Gad Getz Stefano Monti Michael Reich {gadgetz,smonti,mreich}@broad.mit.edu http://www.broad.mit.edu/~smonti/aws Broad Institute of MIT & Harvard October 18-20, 2006 Cambridge, MA
Workshop Format
• Morning lectures: – Principles of statistics, machine learning and pattern recognition. – Their application to the analysis of gene expression data.
• Afternoon hands-on’s: – Practice sessions w/ GenePattern. – Application of the concepts presented in the lectures.
1 User Profile
• Knowledge of basic mathematical concepts assumed (square root, log, function, …)
• No (or little) previous analysis experience required.
• Basic familiarity with microarrays and expression analysis terms.
• Mixed audience: nobody satisfied.
Outline of the course
Lectures Hands-on’s
• Day 1: • Day 1: – Introduction: Functional Genomics – Preprocessing – GenePattern mini-tutorial – Data visualization/Dimensionality – FG Pipeline: reduction: • Data Acquisition • HeatMaps • Preprocessing & Visualization • PCA,NMF,MDS
• Day 2: • Day 2: – Supervised Analysis – Differential Analysis/Annotation • Differential analysis/GSEA – Classification: • Class Prediction/Classification • Model building/selection • Validation • Evaluation
• Day 3: • Day 3: • [Survival analysis] – Clustering: – Unsupervised Analysis • HC, NMF, CC, Bi-clustering • Clustering, Bi-clustering • GO Annotation – Annotation – Final Project
2 The use of high-throughput gene expression micro-arrays and computational tools for molecular profiling.
Functional genomics definition
• The use of systematic approaches to answer questions for the majority of genes in a genome, including – when is a gene expressed? – with which other genes does it interact? – what phenotype results if a gene is switched-on/-off/mutated?
Functional genomics aspires to answer such questions systematically for all genes in a genome in contrast to conventional approaches that do so for one gene at a time.
3 Paradigm for Functional Genomics
Biological • Tumor vs. Normal States/ • Chemical treatment vs. untreated Phenotypes • Remission vs. Refractory Disease • Successful vs. unsuccessful Treatment • RNAi • Time courses
• Polymorphism Readouts DNA • Mutation • Loss of Heterozigosity
RNA • Expression Levels
• Relative abundance Protein • Modification • Activity
•• StatisticalWhat pathways inference are affected by a disease?Statistics Machine Learning •• Classification/PredictionWhat pathways are modulated by a specific drug? • Clustering Pattern Recognition Analysis and • What signatures predict tumor type or patient • Featureoutcome? extraction / projection Understanding • Pattern discovery Understanding • What genes confer susceptibility to disease? • Network extraction
High-throughput assays technologies
DNA • Polymorphism • SNP arrays • Copy number variation • CGH arrays • Loss of Heterozigosity • sequencing
• Expression levels • Microarrays RNA • SAGE •…
• Relative abundance • Mass Spectrometry Protein • Modification •ChIP2chip • Activity •…
4 High-throughput assays technologies
DNA • Polymorphism • SNP arrays • Copy number variation • CGH arrays • Loss of Heterozigosity •…
• Expression levels • Microarrays RNA • SAGE •…
• Relative abundance • Mass Spectrometry Protein • Modification •ChIP2chip • Activity •…
mRNA micro-array
Samples Measures the gene Read out organized in a “activity” of 10K of high-dimensional genes at once numerical matrix
Genes Traits Transcription translation Diseases
RNA Proteins Physiology
DNA Metabolism mRNA Drug Resistance Computational analysis
5 Number of articles PubMed query: “microarray” in title/abstract
4908‡
4211
3377
2383
1542
No. of articles of No. 792 230 0217 81
04 05 06 0 0 1996 1997 1998 1999 2000 2001 2002 2003 2 2 20 Year
‡Extrapolated from first 5 months
The functional genomics pipeline
Experimental design affects outcome data analysis
Data acquisition microarray processing
Data preprocessing scaling/normalization/filtering
Data analysis/Hypothesis generation Supervised Analysis Unsupervised Analysis Differential analysis, Classification, … Clustering, Bi-clustering, …
Validation/Annotation Enrichment analysis “In silico” testing “In vitro” testing GO annotation, GSEA, … Cross validation, train/test, etc, Back to the lab
6 Introductory references
1. Hastie, T, Tibshirani R, and Friedman J. The Elements of Statistical Learning. Springer-Verlag, 2001.
2. Nature Genetics supplements: The Chipping Forecast I [Nat Genet, 21(1s), 1999], II [Nat Genet, 32(4s), 2002], III [Nat Genet, 37(6s), 2005].
3. Allison, D. B., Cui, X., Page, G. P., and Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet, 7: 55-65, 2006.
4. Hoffman, E. P., Awad, T., et al. Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nat Rev Genet, 5: 229-237, 2004.
5. Larkin, J. E., Frank, B. C., et al. Independence and reproducibility across microarray platforms. Nat Meth, 2: 337-344, 2005.
6. Irizarry, R. A., Warren, D., et al. Multiple-laboratory comparison of microarray platforms. Nat Meth, 2: 345-350, 2005.
7