Gad Getz Stefano Monti Michael Reich {gadgetz,smonti,mreich}@broad.mit.edu http://www.broad.mit.edu/~smonti/aws Broad Institute of MIT & Harvard October 18-20, 2006 Cambridge, MA

Workshop Format

• Morning lectures: – Principles of statistics, and pattern recognition. – Their application to the analysis of expression data.

• Afternoon hands-on’s: – Practice sessions w/ GenePattern. – Application of the concepts presented in the lectures.

1 User Profile

• Knowledge of basic mathematical concepts assumed (square root, log, function, …)

• No (or little) previous analysis experience required.

• Basic familiarity with microarrays and expression analysis terms.

• Mixed audience: nobody satisfied.

Outline of the course

Lectures Hands-on’s

• Day 1: • Day 1: – Introduction: Functional – Preprocessing – GenePattern mini-tutorial – Data visualization/Dimensionality – FG Pipeline: reduction: • Data Acquisition • HeatMaps • Preprocessing & Visualization • PCA,NMF,MDS

• Day 2: • Day 2: – Supervised Analysis – Differential Analysis/Annotation • Differential analysis/GSEA – Classification: • Class Prediction/Classification • Model building/selection • Validation • Evaluation

• Day 3: • Day 3: • [Survival analysis] – Clustering: – Unsupervised Analysis • HC, NMF, CC, Bi-clustering • Clustering, Bi-clustering • GO Annotation – Annotation – Final Project

2 The use of high-throughput micro-arrays and computational tools for molecular profiling.

Functional genomics definition

• The use of systematic approaches to answer questions for the majority of in a , including – when is a gene expressed? – with which other genes does it interact? – what results if a gene is switched-on/-off/mutated?

Functional genomics aspires to answer such questions systematically for all genes in a genome in contrast to conventional approaches that do so for one gene at a time.

3 Paradigm for Functional Genomics

Biological • Tumor vs. Normal States/ • Chemical treatment vs. untreated • Remission vs. Refractory Disease • Successful vs. unsuccessful Treatment • RNAi • Time courses

Readouts DNA • • Loss of Heterozigosity

RNA • Expression Levels

• Relative abundance • Modification • Activity

•• StatisticalWhat pathways inference are affected by a disease?Statistics Machine Learning •• Classification/PredictionWhat pathways are modulated by a specific drug? • Clustering Pattern Recognition Analysis and • What signatures predict tumor type or patient • Featureoutcome? extraction / projection Understanding • Pattern discovery Understanding • What genes confer susceptibility to disease? • Network extraction

High-throughput assays technologies

DNA • Polymorphism • SNP arrays • Copy number variation • CGH arrays • Loss of Heterozigosity • sequencing

• Expression levels • Microarrays RNA • SAGE •…

• Relative abundance • Protein • Modification •ChIP2chip • Activity •…

4 High-throughput assays technologies

DNA • Polymorphism • SNP arrays • Copy number variation • CGH arrays • Loss of Heterozigosity •…

• Expression levels • Microarrays RNA • SAGE •…

• Relative abundance • Mass Spectrometry Protein • Modification •ChIP2chip • Activity •…

mRNA micro-array

Samples Measures the gene Read out organized in a “activity” of 10K of high-dimensional genes at once numerical matrix

Genes Traits Diseases

RNA Physiology

DNA Metabolism mRNA Drug Resistance Computational analysis

5 Number of articles PubMed query: “microarray” in title/abstract

4908‡

4211

3377

2383

1542

No. of articles of No. 792 230 0217 81

04 05 06 0 0 1996 1997 1998 1999 2000 2001 2002 2003 2 2 20 Year

‡Extrapolated from first 5 months

The functional genomics pipeline

Experimental design affects outcome data analysis

Data acquisition microarray processing

Data preprocessing scaling/normalization/filtering

Data analysis/Hypothesis generation Supervised Analysis Unsupervised Analysis Differential analysis, Classification, … Clustering, Bi-clustering, …

Validation/Annotation Enrichment analysis “In silico” testing “In vitro” testing GO annotation, GSEA, … Cross validation, train/test, etc, Back to the lab

6 Introductory references

1. Hastie, T, Tibshirani R, and Friedman J. The Elements of Statistical Learning. Springer-Verlag, 2001.

2. Nature Genetics supplements: The Chipping Forecast I [Nat Genet, 21(1s), 1999], II [Nat Genet, 32(4s), 2002], III [Nat Genet, 37(6s), 2005].

3. Allison, D. B., Cui, X., Page, G. P., and Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet, 7: 55-65, 2006.

4. Hoffman, E. P., Awad, T., et al. Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nat Rev Genet, 5: 229-237, 2004.

5. Larkin, J. E., Frank, B. C., et al. Independence and reproducibility across microarray platforms. Nat Meth, 2: 337-344, 2005.

6. Irizarry, R. A., Warren, D., et al. Multiple-laboratory comparison of microarray platforms. Nat Meth, 2: 345-350, 2005.

7