Data Sciences = Big Data + Machine Learning + Domain Expertise (NLP, Bio, …)
NAS –March, 2015
Jaime Carbonell& Colleagues Language Technologies Institute School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~jgc Computing Departments at CMU
Systems Lang Tech Theory
Comp Bio
Natural Machine Sciences Comp Hardware Software Learning Eng Robotics
Computer Science Engineering
3/9/2015 Jaime G. Carbonell, CMU 2 Computing Departments at CMU
Systems Lang Tech Theory
Comp Bio Data Sciences
Natural Machine Sciences Comp Hardware Software Learning Eng Robotics
Computer Science Engineering
3/9/2015 Jaime G. Carbonell, CMU 3 How Big is Big? Dimensions of Big Data Analytics
Billions++ of entries: Terabyes/Petabyesof data
Trillions of potential relations among entries (graphs)
Millions of attributes per entry (but typically sparse encoding)
3/9/2015 Jaime G. Carbonell, CMU 4 The Big-Data “Stack”
Analytics Algorithms --Machine Learning --Pattern Detection Alerts, Visualization
Big-DataArchitecture --Hadoop/H-Table --Asynch/Pegasus
Sensors Big-Data “Plumbing” --Cloud/Storage Knowledge --ResourceAllocator Historical & base Normative Data
3/9/2015 Jaime G. Carbonell, CMU 5 The Big-Data “Stack”
Analytics Algorithms --Machine Learning --Pattern Detection Alerts, Visualization
Big-DataArchitecture --Hadoop/H-Table --Asynch/Pegasus
Sensors Big-Data “Plumbing” --Cloud/Storage Knowledge --ResourceAllocator Historical & base Normative Data
3/9/2015 Jaime G. Carbonell, CMU 6 Machine Learning
• Objectives – Discover patterns in data à make useful predictions • General approach – Select & train functional models on historical data – Apply trained models to make predictions on new data • Applications – Robotics, speech, medicine, biology, finance, cyber, energy, … • Hot topics – Deep neural networks – Transfer and proactive learning
3/9/2015 Jaime G. Carbonell, CMU 7 Machine Learning in a Nutshell
r r r r o Training data: {xi , yi}i=1,...k +{xi}i=k +1,...n + O : xi ® yi n Special case: k = 0 o Functional space: Á = { f j ´ pl } o Fitness Criterion: n a.k.a. loss function éæ r ö ù arg min ç y - f (x ) ÷ + b ( f , p ) ê å i j, pl i j l ú j,l ëè i ø û o Active Sampling Strategy: r r r r argmin L( fˆ(x , yˆ ))| x È{x ,...,x } r r r ( test test i 1 k ) xiÎ{xk+1,...,xn }
3/9/2015 Jaime G. Carbonell, CMU 8 The Big Picture
Unbalanced Rare Learning in Classifier Unlabeled Category Unbalanced Data Set Detection Settings
Feature Feature Extraction Representation
Relational Raw Data Temporal 3/9/2015 Jaime G. Carbonell, CMU 9 LL Evolutionary Tree of MT Paradigms
Large- Large- scale TMT scale TMT Transfer Transfer MT MT w stat Interlingua phrases MT Context- Analogy Based MT MT Example- based MT Stat MT on syntax struct. Decoding Statistic Phrasal MT al MT SMT
10 1950 1980 2010 Morpho-Syntactics & Multi-Morphemics oIñupiaq(North Slope Alaska, Lori Levin) nTauqsiġñiaġviŋmuŋniaŋitchugut. n‘We won’t go to the store.’ oKalaallisut(Greenlandic, Per Langaard) nPittsburghimukarthussaqarnavianngilaq nPittsburgh+PROP+Trim+SG+kar+tuq+ssaq+qar+n aviar+nngit+v+IND+3SG n"It is not likely that anyone is going to Pittsburgh"
11 Context Needed to Resolve Ambiguity
Example: English à Japanese Machine Translation Power line –densen(電線) Subway line –chikatetsu(地下鉄) (Be) on line –onrain(オンライン) (Be) on the line –denwachuu(電話中) Line up –narabu(並ぶ) Line one’s pockets – kanemochininaru (金持ちになる) Line one’s jacket – uwagio nijuunisuru(上着を二重にする) Actor’s line –serifu(セリフ) Get a line on –johoo eru(情報を得る)
Sometimes local context suffices (as above)
3/9/2015 Jaime G. Carbonell, CMU 12 Active Learning for MT Parallel Expert corpus Translator S,T Trainer
S Sampled Model corpus
MT System Source Language Active Corpus Learner
3/9/2015 Jaime G. Carbonell, CMU 13 Active Crowd Translation S,T 1
S,T 2 Trainer . Translation . Selection .
S,T Model n
S Sentenc e Selectio n MT System Source Language ACT Corpus Framework
3/9/2015 Jaime G. Carbonell, CMU 14 Structure Learning for Next-Generation Machine Translation
•Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)
•Resulting aligned nodes are highlighted in figure
•T ransferrules are partially lexicalized and read off tree.
3/9/2015 Jaime G. Carbonell, CMU 15 Active vs Proactive Learning
Active Learning Proactive Learning
Number of Oracles Individual (only one) Multiple, with different capabilities, costs and areas of expertise Reliability Infallible (100% right) Variable across oracles and queries, depending on difficulty, expertise, … Reluctance Indefatigable (always Variable across oracles and answers) queries, depending on workload, certainty, … Cost per query Invariant (free or constant) Variable across oracles and queries, depending on workload, difficulty, …
Note: “Oracle” Î {expert, experiment, computation, …}
3/9/2015 Jaime G. Carbonell, CMU 16 Proactive Learning: Wind Sampling o Predict: Prevalent Direction, Speed, seasonality o Measurement towers: Expensive –where, when?
3/9/2015 Jaime G. Carbonell, CMU 17 Inferring the Protein Interactome
o Degree distribution / Hub analysis / Disease checking o Graph modules analysis (from bi-clustering study) o Protein-family based graph patterns (receptors / ligands/ etc )
3/9/2015 Jaime G. Carbonell, CMU 18 18 PPIs: Protein-Protein Interactions o The cell machinery is run by the proteins n Enzymatic activities, replication, translation, transport, signaling, structural
o Proteins interact with each other to perform these functions
Through physical contact Indirectly Indirectly in pathway in a protein complex
Indirectly in a pathway
3/9/2015 Jaime G. Carbonell, http://CMUwww.cellsignal.com/reference/pathway/Apoptosis_Overview.html Host Pathogen Interactomes
Fusion HIV-1 depends on the cellular machinery in every Reverse aspect of its life cycle. transcription
Transcription
Budding Maturation
Peterlin and Torono, Nature Rev Immu 2003. 3/9/2015 Jaime G. Carbonell, CMU 20 ROTEIN (Borrowed from: Judith P S Klein-Seetharaman) Sequence à Structure à Function
Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure
Complex function within network of proteins
Normal 3/9/2015 Jaime G. Carbonell, CMU 21 PROTEINS Sequence à Structure à Function
Primary Sequence MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA Folding 3D Structure
Complex function within network of proteins
Cancer: 3/9/2015 Jaime G. Carbonell, CMU unckecked cellular 22 replicaiton The “Folding Problem:” 3D Protein Structure Determination In Silico
o In Vitro Lab experiments: time, cost, uncertainty, … n X-ray crystallography (months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962 n NMR spectroscopy (only works for small proteins or domains) Nobel Prize, Kurt Wuthrich, 2002 o The gap between sequence and structure necessitates computational methods of protein structure determination n 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)
3/9/2015 Jaime G. Carbonell, CMU 23 Linked Segmentation CRF
o Node: secondary structure elements and/or simple fold o Edges: Local interactions and long-range inter-chain and intra-chain interactions o L-SCRF: conditional probability of y given x is defined as 1 P(y,...,y|x,...,x)=exp(lmf(x,y))exp(gy(x,xy,,)) 1R1RZÕåkkii,jÕålkiai,,jab yi, jGÎVklyyi,,j,abGÎE Joint Labels
3/9/2015 Jaime G. Carbonell, CMU 24 Fold Alignment Prediction: β-Helix
o Predicted alignment for known β -helices on cross-family validation
3/9/2015 Jaime G. Carbonell, CMU 25 Parting Thoughts o Data Science = Big Data + ML + Domains n Big data requires big (scalable) ML n Leverage scarce human expertise o Machine Learning is Engine for AI n Heretofore it was hand-coded knowledge n Applicable to virtually any domain o Big Impact Everywhere n Bio: genetic, molecular, healthcare n Language: speech, MT, text mining n Education: student models, transfer n Energy: optimization, production, conservation n Finance: investment, fraud detection, risk analysis
3/9/2015 Jaime G. Carbonell, CMU 26 THANK YOU!
3/9/2015 Jaime G. Carbonell, CMU 27