Machine Learning in a Nutshell

Data Sciences = Big Data + Machine Learning + Domain Expertise (NLP, Bio, …) NAS –March, 2015 Jaime Carbonell& Colleagues Language Technologies Institute School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~jgc Computing Departments at CMU Systems Lang Tech Theory Comp Bio Natural Machine Sciences Comp Hardware Software Learning Eng Robotics Computer Science Engineering 3/9/2015 Jaime G. Carbonell, CMU 2 Computing Departments at CMU Systems Lang Tech Theory Comp Bio Data Sciences Natural Machine Sciences Comp Hardware Software Learning Eng Robotics Computer Science Engineering 3/9/2015 Jaime G. Carbonell, CMU 3 How Big is Big? Dimensions of Big Data Analytics Billions++ of entries: Terabyes/Petabyesof data Trillions of potential relations among entries (graphs) Millions of attributes per entry (but typically sparse encoding) 3/9/2015 Jaime G. Carbonell, CMU 4 The Big-Data “Stack” Analytics Algorithms --Machine Learning --Pattern Detection Alerts, Visualization Big-DataArchitecture --Hadoop/H-Table --Asynch/Pegasus Sensors Big-Data “Plumbing” --Cloud/Storage Knowledge --ResourceAllocator Historical & base Normative Data 3/9/2015 Jaime G. Carbonell, CMU 5 The Big-Data “Stack” Analytics Algorithms --Machine Learning --Pattern Detection Alerts, Visualization Big-DataArchitecture --Hadoop/H-Table --Asynch/Pegasus Sensors Big-Data “Plumbing” --Cloud/Storage Knowledge --ResourceAllocator Historical & base Normative Data 3/9/2015 Jaime G. Carbonell, CMU 6 Machine Learning • Objectives – Discover patterns in data à make useful predictions • General approach – Select & train functional models on historical data – Apply trained models to make predictions on new data • Applications – Robotics, speech, medicine, biology, finance, cyber, energy, … • Hot topics – Deep neural networks – Transfer and proactive learning 3/9/2015 Jaime G. Carbonell, CMU 7 Machine Learning in a Nutshell r r r r o Training data: {xi , yi}i=1,...k +{xi}i=k +1,...n + O : xi ® yi n Special case: k = 0 o Functional space: Á = { f j ´ pl } o Fitness Criterion: n a.k.a. loss function éæ r ö ù arg min ç y - f (x ) ÷ + b ( f , p ) ê å i j, pl i j l ú j,l ëè i ø û o Active Sampling Strategy: r r r r argmin L( fˆ(x , yˆ ))| x È{x ,...,x } r r r ( test test i 1 k ) xiÎ{xk+1,...,xn } 3/9/2015 Jaime G. Carbonell, CMU 8 The Big Picture Unbalanced Rare Learning in Classifier Unlabeled Category Unbalanced Data Set Detection Settings Feature Feature Extraction Representation Relational Raw Data Temporal 3/9/2015 Jaime G. Carbonell, CMU 9 LL Evolutionary Tree of MT Paradigms Large- Large- scale TMT scale TMT Transfer Transfer MT MT w stat Interlingua phrases MT Context- Analogy Based MT MT Example- based MT Stat MT on syntax struct. Decoding Statistic Phrasal MT al MT SMT 10 1950 1980 2010 Morpho-Syntactics & Multi-Morphemics oIñupiaq(North Slope Alaska, Lori Levin) nTauqsiġñiaġviŋmuŋniaŋitchugut. n‘We won’t go to the store.’ oKalaallisut(Greenlandic, Per Langaard) nPittsburghimukarthussaqarnavianngilaq nPittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar+n aviar+nngit+v+IND+3SG n"It is not likely that anyone is going to Pittsburgh" 11 Context Needed to Resolve Ambiguity Example: English à Japanese Machine Translation Power line –densen(電線) Subway line –chikatetsu(地下鉄) (Be) on line –onrain(オンライン) (Be) on the line –denwachuu(電話中) Line up –narabu(並ぶ) Line one’s pockets – kanemochininaru (金持ちになる) Line one’s jacket – uwagio nijuunisuru(上着を二重にする) Actor’s line –serifu(セリフ) Get a line on –johoo eru(情報を得る) Sometimes local context suffices (as above) 3/9/2015 Jaime G. Carbonell, CMU 12 Active Learning for MT Parallel Expert corpus Translator S,T Trainer S Sampled Model corpus MT System Source Language Active Corpus Learner 3/9/2015 Jaime G. Carbonell, CMU 13 Active Crowd Translation S,T 1 S,T 2 Trainer . Translation . Selection . S,T Model n S Sentenc e Selectio n MT System Source Language ACT Corpus Framework 3/9/2015 Jaime G. Carbonell, CMU 14 Structure Learning for Next-Generation Machine Translation •Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases) •Resulting aligned nodes are highlighted in figure •T ransferrules are partially lexicalized and read off tree. 3/9/2015 Jaime G. Carbonell, CMU 15 Active vs Proactive Learning Active Learning Proactive Learning Number of Oracles Individual (only one) Multiple, with different capabilities, costs and areas of expertise Reliability Infallible (100% right) Variable across oracles and queries, depending on difficulty, expertise, … Reluctance Indefatigable (always Variable across oracles and answers) queries, depending on workload, certainty, … Cost per query Invariant (free or constant) Variable across oracles and queries, depending on workload, difficulty, … Note: “Oracle” Î {expert, experiment, computation, …} 3/9/2015 Jaime G. Carbonell, CMU 16 Proactive Learning: Wind Sampling o Predict: Prevalent Direction, Speed, seasonality o Measurement towers: Expensive –where, when? 3/9/2015 Jaime G. Carbonell, CMU 17 Inferring the Protein Interactome o Degree distribution / Hub analysis / Disease checking o Graph modules analysis (from bi-clustering study) o Protein-family based graph patterns (receptors / ligands/ etc ) 3/9/2015 Jaime G. Carbonell, CMU 18 18 PPIs: Protein-Protein Interactions o The cell machinery is run by the proteins n Enzymatic activities, replication, translation, transport, signaling, structural o Proteins interact with each other to perform these functions Through physical contact Indirectly Indirectly in pathway in a protein complex Indirectly in a pathway 3/9/2015 Jaime G. Carbonell, http://CMUwww.cellsignal.com/reference/pathway/Apoptosis_Overview.html Host Pathogen Interactomes Fusion HIV-1 depends on the cellular machinery in every Reverse aspect of its life cycle. transcription Transcription Budding Maturation Peterlin and Torono, Nature Rev Immu 2003. 3/9/2015 Jaime G. Carbonell, CMU 20 ROTEIN (Borrowed from: Judith P S Klein-Seetharaman) Sequence à Structure à Function Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure Complex function within network of proteins Normal 3/9/2015 Jaime G. Carbonell, CMU 21 PROTEINS Sequence à Structure à Function Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure Complex function within network of proteins Cancer: 3/9/2015 Jaime G. Carbonell, CMU unckecked cellular 22 replicaiton The “Folding Problem:” 3D Protein Structure Determination In Silico o In Vitro Lab experiments: time, cost, uncertainty, … n X-ray crystallography (months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962 n NMR spectroscopy (only works for small proteins or domains) Nobel Prize, Kurt Wuthrich, 2002 o The gap between sequence and structure necessitates computational methods of protein structure determination n 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 3/9/2015 Jaime G. Carbonell, CMU 23 Linked Segmentation CRF o Node: secondary structure elements and/or simple fold o Edges: Local interactions and long-range inter-chain and intra-chain interactions o L-SCRF: conditional probability of y given x is defined as 1 P(y,...,y|x,...,x)=exp(lmf(x,y))exp(gy(x,xy,,)) 1R1RZÕåkkii,jÕålkiai,,jab yi, jGÎVklyyi,,j,abGÎE Joint Labels 3/9/2015 Jaime G. Carbonell, CMU 24 Fold Alignment Prediction: β-Helix o Predicted alignment for known β -helices on cross-family validation 3/9/2015 Jaime G. Carbonell, CMU 25 Parting Thoughts o Data Science = Big Data + ML + Domains n Big data requires big (scalable) ML n Leverage scarce human expertise o Machine Learning is Engine for AI n Heretofore it was hand-coded knowledge n Applicable to virtually any domain o Big Impact Everywhere n Bio: genetic, molecular, healthcare n Language: speech, MT, text mining n Education: student models, transfer n Energy: optimization, production, conservation n Finance: investment, fraud detection, risk analysis 3/9/2015 Jaime G. Carbonell, CMU 26 THANK YOU! 3/9/2015 Jaime G. Carbonell, CMU 27.

Machine Learning in a Nutshell

Proactive Learning for Building Machine Translation Systems for Minority Languages

Jaime Carbonell Massively Multilingual Language

Information Extraction Lecture 5 – Named Entity Recognition III

Probabilistic Approaches for Answer Selection in Multilingual Question Answering

Active Learning

Active Learning and Crowd-Sourcing for Machine Translation

Jaime G. Carbonell -- Curriculum Vita

Naacl Hlt 2010

The Natural Language Sourcebook

Jaime G. Carbonell -- Curriculum Vita

Manuela M. Veloso

Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics Short Papers