【BDTC先睹為快】阿裡巴巴iDST金榕:面向大數據克服隨機機器學習算法的局限性
摘要:阿裡巴巴iDST(數據科學與技術研究院)負責人之一金榕將在12月10日的BDTC 2015大會上分享克服隨機機器學習算法的局限性使之能夠有效地利用大規模數據及阿裡巴巴的成功實踐。本文是CSDN雲計算針對金榕的會前采訪。
為瞭更好幫助企業深入瞭解國內外最新大數據技術,掌握更多行業大數據實踐經驗,進一步推進大數據技術創新、行業應用和人才培養,2015年12月10-12日,由中國計算機學會(CCF)主辦,CCF大數據專傢委員會承辦,中國科學院計算技術研究所與CSDN共同協辦的2015中國大數據技術大會(Big Data Technology Conference 2015,BDTC 2015)將在北京新雲南皇冠假日酒店隆重舉辦。
2015中國大美國商標註冊推薦數據技術大會
BDTC 2015將為期三天,在大會主會之外,擬設立16個分論壇,包括數據庫、深度學習、推薦系統、安全等6大技術論壇,金融、制造業、交通旅遊、互聯網、醫療健康、教育、網絡通訊等7大應用論壇,以及政策法規和標準化、數據市場及交易、社會治理等3大熱點議題論壇,將邀請近100位國外大數據技術領域頂尖專傢與一線實踐者,深入討論Spark、Kudu、PostgreSQL、YARN、HBase、機器學習/深度學習、推薦系統等熱門技術及行業實踐。
本次大會請到瞭阿裡巴巴iDST(數據科學與技術研究院)負責人之一,美國密歇根州立大學終身教授金榕擔任全體大會演講嘉賓,發表題為“Randomized Algorithms for Big Data: Making the Impossible Possible”的主題演講。
在大會開始之前,金榕在接受CSDN記者采訪時表示,由於具有較高的計算效率,隨機機器學習算法在近年的機器學習研究中受到廣泛關註。但是,由於隨機算法固有的局限性,隨機機器學習算法在很多學習任務中並不能非常有效地利用大規模數據(阿裡巴巴的電子商務平臺每天收到數以10億計的服務請求)。他將在大會上基於兩個例子將介紹如何利用輔助信息(side information) 和先驗知識(prior knowledge)克服隨機機器學習算法的局限性,隻需對進行微小的修改,就可以極大地提高隨機機器學習算法的有效性。同時,他也會介紹隨機機器學習方法在阿裡巴巴的成功應用案例。
金榕
阿裡巴巴iDST(數據科學與技術研究院)負責人之一,美國密歇根州立大學終身教授
金榕教授擁有美國卡內基梅隆大學博士學位,長期致力於統計機器學習研究,重點關註大數據分析及其在互聯網信息檢索、電子商務等領域中的應用,在隨機優化、在線學習、核學習、度量學習、半監督學習、主動學習和眾包等領域提出瞭一系列原創算法和理論。金榕教授共發表200多篇國際會議和期刊論文,在本領域的頂級期刊如JMLR、TPAMI、PNAS上發表論文32篇,在本領域的頂級國際會議如ICML、NIPS、COLT上發表論文147篇,研究成果他引10,000餘次。曾擔任NIPS、SIGIR等頂級國際會議領域主席,KDD、AAAI、IJCAI等頂級會議高級程序委員會委員。金榕教授獲得過美國國傢科學基金會NSF Career Award。
以下為金榕教授采訪實錄:技術實踐CSDN:請介紹一下您公司的業務,大數據對公司業務的價值,以及您部門的職責。
金榕:The goal of our BU is to develop
state-of-the-art machine learning and data mining algorithms to support the key
technologies of Alibaba including search, recommendation, business data
analysis, and sales forecasting.
CSDN:能否介紹您在項目實施中曾使用過哪些大數據技術?您對這些技術滿意的地方和不滿意的地方分別有什麼?
金榕:The key technologies we have utilized
are large-scale optimization and machine learning. Although numerous efforts
are devoted to large-scale optimization and machine learnin美國商標註冊台中g, they are limited
in two aspects: first, most efforts are devoted to developing computing infrastructure
for large-scale optimization; second, most algorithm are unable to handle
美國商標註冊large-scale and high dimension data simultaneously; third, most machine
learning algorithms are unable to deal with noisy data effectively, which is
quite common in industry.
CSDN:能否談談大數據在您的行業落地目前主要遇到哪些挑戰?
金榕:The key challenge for individual
developers is lack of computing resources. Currently, Alibaba has offered the
general public the powerful distributed environment that makes it possible for
individual developers to perform large-scale data analysis. In particular, this
platform has offered powerful tools for large-scale optimization and
large-scale machine learning.
CSDN:根據您的瞭解,企業容易犯哪些錯誤導致大數據實踐的失敗?
金榕:A common mistake that I have observed
is to infer causal relations from noisy data. For instance, we may found from
the estimated conditional probabilities that male clients are more likely to
search for female products than the female clients. We late on found that it is
due to the fact that many Taobao accounts are owned jointly by couples and for
some reason, and only the males were listed as the owners of the accounts.
技術趨勢CSDN:大數據領域的新技術發展很快,從整個大數據產業來說,您認為哪些技術趨勢值得關註?
金榕:
large-scale optimization and large-scale machine learninglarge-scale deep learning
CSDN:針對您所在的行業,哪些技術是您目前主要觀察和研究的,您為什麼看好這些技術?
金榕:Large-scale
deep learning, or in general learning non-linear prediction functions from
massive amount of data.
技術人才CSDN:人才與大數據項目的成功直接相關,您在大數據人才團隊的建設方面有什麼經驗可以分享?
金榕:To build a strong data science team,
the key is to include people of good business understanding with people with
solid background in machine learning and data mining.
CSDN:您認為優秀的數據科學傢需要哪些素質?
金榕:I noticed that although many fresh
graduates have received good education on machine learning and data mining,
they are not good at problem solving particularly when encountering unexpected
difficulties. To be a good data scientist, he/she should be able to find out
the source of problems, particularly when the data to be analyzed is complex
and seems to reveal conclusions that may be conflicting.
分享話題CSDN:請談談您在這次大會上即將分享的話題。
金榕:Exploit randomized algorithms for
large-scale optimization.
We are continuing to encounter an explosive growth in data: the number of web pages grows from 300 million in 1997 to 50 billion in 2013; about 10 billion images are indexed by Google and 6 billion videos are indexed by YouTube; Alibaba’s ecommerce platform receives billions of requests on a daily basis. This data explosion poses a great challenge in data analysis. Randomized algorithms have attracted significant interests in the recent studies of machine learning, mostly due to its computational efficiency. But, on the other hand, the formal limitations of randomized algorithms have been established for various learning tasks, making them less effective in exploiting the massive amount of data that is available to computer programs. In this talk, I will discuss, based on two examples, how to overcome the limitation of randomized machine learning algorithms by exploiting either the side information or prior knowledge of data. We have shown, both theoretically and empirically, that with a slight modification, it is possible to dramatically improve the effectiveness of randomized algorithms for machine learning. I will also introduce the successful cases of applying randomized algorithms in Alibaba.
CSDN:哪些聽眾最應該瞭解這些話題?您所分享的主題可以幫助聽眾解決哪些問題?
金榕:Any audience interested in
large-scale learning will be interested in this topic. The materials presented
in my talk will help people find ways to solve large scale optimization
problems without having to resolve to distributed computing environment.
CSDN:能否談談您對BDTC2015、其他的講師分享的話題有什麼期待?
金榕:Would love to know the topics of big data
in other subjects.
第九屆中國大數據技術大會將於2015年12月10-12日在北京隆重舉辦。在主會之外,會議還設立瞭16大分論壇,包含數據庫、深度學習、推薦系統、安全等6大技術論壇,金融、制造業、交通旅遊、互聯網、醫療健康、教育等7大應用論壇和3大熱點議題論壇,票價折扣中預購從速。
本文為CSDN原創文章,未經允許不得轉載,如需轉載請聯系market#csdn.net(#換成@)
- Dec 22 Fri 2017 12:54
美國商標分類 尋找美國商標檢索的事務所~
close
文章標籤
全站熱搜
留言列表