Data Sciences = Big Data + Machine Learning + Domain Expertise (NLP, Bio, …)

NAS –March, 2015

Jaime Carbonell& Colleagues Language Technologies Institute School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~jgc Computing Departments at CMU

Systems Lang Tech Theory

Comp Bio

Natural Machine Sciences Comp Hardware Software Learning Eng Robotics

Computer Science Engineering

3/9/2015 Jaime G. Carbonell, CMU 2 Computing Departments at CMU

Systems Lang Tech Theory

Comp Bio Data Sciences

Natural Machine Sciences Comp Hardware Software Learning Eng Robotics

Computer Science Engineering

3/9/2015 Jaime G. Carbonell, CMU 3 How Big is Big? Dimensions of Big Data Analytics

Billions++ of entries: Terabyes/Petabyesof data

Trillions of potential relations among entries (graphs)

Millions of attributes per entry (but typically sparse encoding)

3/9/2015 Jaime G. Carbonell, CMU 4 The Big-Data “Stack”

Analytics Algorithms --Machine Learning --Pattern Detection Alerts, Visualization

Big-DataArchitecture --Hadoop/H-Table --Asynch/Pegasus

Sensors Big-Data “Plumbing” --Cloud/Storage Knowledge --ResourceAllocator Historical & base Normative Data

3/9/2015 Jaime G. Carbonell, CMU 5 The Big-Data “Stack”

Analytics Algorithms --Machine Learning --Pattern Detection Alerts, Visualization

Big-DataArchitecture --Hadoop/H-Table --Asynch/Pegasus

Sensors Big-Data “Plumbing” --Cloud/Storage Knowledge --ResourceAllocator Historical & base Normative Data

3/9/2015 Jaime G. Carbonell, CMU 6 Machine Learning

• Objectives – Discover patterns in data à make useful predictions • General approach – Select & train functional models on historical data – Apply trained models to make predictions on new data • Applications – Robotics, speech, medicine, biology, finance, cyber, energy, … • Hot topics – Deep neural networks – Transfer and proactive learning

3/9/2015 Jaime G. Carbonell, CMU 7 Machine Learning in a Nutshell

r r r r o Training data: {xi , yi}i=1,...k +{xi}i=k +1,...n + O : xi ® yi n Special case: k = 0 o Functional space: Á = { f j ´ pl } o Fitness Criterion: n a.k.a. loss function éæ r ö ù arg min ç y - f (x ) ÷ + b ( f , p ) ê å i j, pl i j l ú j,l ëè i ø û o Active Sampling Strategy: r r r r argmin L( fˆ(x , yˆ ))| x È{x ,...,x } r r r ( test test i 1 k ) xiÎ{xk+1,...,xn }

3/9/2015 Jaime G. Carbonell, CMU 8 The Big Picture

Unbalanced Rare Learning in Classifier Unlabeled Category Unbalanced Data Set Detection Settings

Feature Feature Extraction Representation

Relational Raw Data Temporal 3/9/2015 Jaime G. Carbonell, CMU 9 LL Evolutionary Tree of MT Paradigms

Large- Large- scale TMT scale TMT Transfer Transfer MT MT w stat Interlingua phrases MT Context- Analogy Based MT MT Example- based MT Stat MT on syntax struct. Decoding Statistic Phrasal MT al MT SMT

10 1950 1980 2010 Morpho-Syntactics & Multi-Morphemics oIñupiaq(North Slope Alaska, Lori Levin) nTauqsiġñiaġviŋmuŋniaŋitchugut. n‘We won’t go to the store.’ oKalaallisut(Greenlandic, Per Langaard) nPittsburghimukarthussaqarnavianngilaq nPittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar+n aviar+nngit+v+IND+3SG n"It is not likely that anyone is going to Pittsburgh"

11 Context Needed to Resolve Ambiguity

Example: English à Japanese Power line –densen(電線) Subway line –chikatetsu(地下鉄) (Be) on line –onrain(オンライン) (Be) on the line –denwachuu(電話中) Line up –narabu(並ぶ) Line one’s pockets – kanemochininaru (金持ちになる) Line one’s jacket – uwagio nijuunisuru(上着を二重にする) Actor’s line –serifu(セリフ) Get a line on –johoo eru(情報を得る)

Sometimes local context suffices (as above)

3/9/2015 Jaime G. Carbonell, CMU 12 Active Learning for MT Parallel Expert corpus Translator S,T Trainer

S Sampled Model corpus

MT System Source Language Active Corpus Learner

3/9/2015 Jaime G. Carbonell, CMU 13 Active Crowd Translation S,T 1

S,T 2 Trainer . Translation . Selection .

S,T Model n

S Sentenc e Selectio n MT System Source Language ACT Corpus Framework

3/9/2015 Jaime G. Carbonell, CMU 14 Structure Learning for Next-Generation Machine Translation

•Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)

•Resulting aligned nodes are highlighted in figure

•T ransferrules are partially lexicalized and read off tree.

3/9/2015 Jaime G. Carbonell, CMU 15 Active vs Proactive Learning

Active Learning Proactive Learning

Number of Oracles Individual (only one) Multiple, with different capabilities, costs and areas of expertise Reliability Infallible (100% right) Variable across oracles and queries, depending on difficulty, expertise, … Reluctance Indefatigable (always Variable across oracles and answers) queries, depending on workload, certainty, … Cost per query Invariant (free or constant) Variable across oracles and queries, depending on workload, difficulty, …

Note: “Oracle” Î {expert, experiment, computation, …}

3/9/2015 Jaime G. Carbonell, CMU 16 Proactive Learning: Wind Sampling o Predict: Prevalent Direction, Speed, seasonality o Measurement towers: Expensive –where, when?

3/9/2015 Jaime G. Carbonell, CMU 17 Inferring the Protein Interactome

o Degree distribution / Hub analysis / Disease checking o Graph modules analysis (from bi-clustering study) o Protein-family based graph patterns (receptors / ligands/ etc )

3/9/2015 Jaime G. Carbonell, CMU 18 18 PPIs: Protein-Protein Interactions o The cell machinery is run by the proteins n Enzymatic activities, replication, translation, transport, signaling, structural

o Proteins interact with each other to perform these functions

Through physical contact Indirectly Indirectly in pathway in a protein complex

Indirectly in a pathway

3/9/2015 Jaime G. Carbonell, http://CMUwww.cellsignal.com/reference/pathway/Apoptosis_Overview.html Host Pathogen Interactomes

Fusion HIV-1 depends on the cellular machinery in every Reverse aspect of its life cycle. transcription

Transcription

Budding Maturation

Peterlin and Torono, Nature Rev Immu 2003. 3/9/2015 Jaime G. Carbonell, CMU 20 ROTEIN (Borrowed from: Judith P S Klein-Seetharaman) Sequence à Structure à Function

Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure

Complex function within network of proteins

Normal 3/9/2015 Jaime G. Carbonell, CMU 21 PROTEINS Sequence à Structure à Function

Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure

Complex function within network of proteins

Cancer: 3/9/2015 Jaime G. Carbonell, CMU unckecked cellular 22 replicaiton The “Folding Problem:” 3D Protein Structure Determination In Silico

o In Vitro Lab experiments: time, cost, uncertainty, … n X-ray crystallography (months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962 n NMR spectroscopy (only works for small proteins or domains) Nobel Prize, Kurt Wuthrich, 2002 o The gap between sequence and structure necessitates computational methods of protein structure determination n 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)

3/9/2015 Jaime G. Carbonell, CMU 23 Linked Segmentation CRF

o Node: secondary structure elements and/or simple fold o Edges: Local interactions and long-range inter-chain and intra-chain interactions o L-SCRF: conditional probability of y given x is defined as 1 P(y,...,y|x,...,x)=exp(lmf(x,y))exp(gy(x,xy,,)) 1R1RZÕåkkii,jÕålkiai,,jab yi, jGÎVklyyi,,j,abGÎE Joint Labels

3/9/2015 Jaime G. Carbonell, CMU 24 Fold Alignment Prediction: β-Helix

o Predicted alignment for known β -helices on cross-family validation

3/9/2015 Jaime G. Carbonell, CMU 25 Parting Thoughts o Data Science = Big Data + ML + Domains n Big data requires big (scalable) ML n Leverage scarce human expertise o Machine Learning is Engine for AI n Heretofore it was hand-coded knowledge n Applicable to virtually any domain o Big Impact Everywhere n Bio: genetic, molecular, healthcare n Language: speech, MT, n Education: student models, transfer n Energy: optimization, production, conservation n Finance: investment, fraud detection, risk analysis

3/9/2015 Jaime G. Carbonell, CMU 26 THANK YOU!

3/9/2015 Jaime G. Carbonell, CMU 27