Functional Genomics Algorithms and Tools

Gad Getz Stefano Monti Michael Reich {gadgetz,smonti,mreich}@broad.mit.edu http://www.broad.mit.edu/~smonti/aws Broad Institute of MIT & Harvard October 18-20, 2006 Cambridge, MA Workshop Format • Morning lectures: – Principles of statistics, machine learning and pattern recognition. – Their application to the analysis of gene expression data. • Afternoon hands-on’s: – Practice sessions w/ GenePattern. – Application of the concepts presented in the lectures. 1 User Profile • Knowledge of basic mathematical concepts assumed (square root, log, function, …) • No (or little) previous analysis experience required. • Basic familiarity with microarrays and expression analysis terms. • Mixed audience: nobody satisfied. Outline of the course Lectures Hands-on’s • Day 1: • Day 1: – Introduction: Functional Genomics – Preprocessing – GenePattern mini-tutorial – Data visualization/Dimensionality – FG Pipeline: reduction: • Data Acquisition • HeatMaps • Preprocessing & Visualization • PCA,NMF,MDS • Day 2: • Day 2: – Supervised Analysis – Differential Analysis/Annotation • Differential analysis/GSEA – Classification: • Class Prediction/Classification • Model building/selection • Validation • Evaluation • Day 3: • Day 3: • [Survival analysis] – Clustering: – Unsupervised Analysis • HC, NMF, CC, Bi-clustering • Clustering, Bi-clustering • GO Annotation – Annotation – Final Project 2 The use of high-throughput gene expression micro-arrays and computational tools for molecular profiling. Functional genomics definition • The use of systematic approaches to answer questions for the majority of genes in a genome, including – when is a gene expressed? – with which other genes does it interact? – what phenotype results if a gene is switched-on/-off/mutated? Functional genomics aspires to answer such questions systematically for all genes in a genome in contrast to conventional approaches that do so for one gene at a time. 3 Paradigm for Functional Genomics Biological • Tumor vs. Normal States/ • Chemical treatment vs. untreated Phenotypes • Remission vs. Refractory Disease • Successful vs. unsuccessful Treatment • RNAi • Time courses • Polymorphism Readouts DNA • Mutation • Loss of Heterozigosity RNA • Expression Levels • Relative abundance Protein • Modification • Activity •• StatisticalWhat pathways inference are affected by a disease?Statistics Machine Learning •• Classification/PredictionWhat pathways are modulated by a specific drug? • Clustering Pattern Recognition Analysis and • What signatures predict tumor type or patient • Featureoutcome? extraction / projection Understanding • Pattern discovery Understanding • What genes confer susceptibility to disease? • Network extraction High-throughput assays technologies DNA • Polymorphism • SNP arrays • Copy number variation • CGH arrays • Loss of Heterozigosity • sequencing • Expression levels • Microarrays RNA • SAGE •… • Relative abundance • Mass Spectrometry Protein • Modification •ChIP2chip • Activity •… 4 High-throughput assays technologies DNA • Polymorphism • SNP arrays • Copy number variation • CGH arrays • Loss of Heterozigosity •… • Expression levels • Microarrays RNA • SAGE •… • Relative abundance • Mass Spectrometry Protein • Modification •ChIP2chip • Activity •… mRNA micro-array Samples Measures the gene Read out organized in a “activity” of 10K of high-dimensional genes at once numerical matrix Genes Traits Transcription translation Diseases RNA Proteins Physiology DNA Metabolism mRNA Drug Resistance Computational analysis 5 Number of articles PubMed query: “microarray” in title/abstract 4908‡ 4211 3377 2383 1542 No. of articles of No. 792 230 0217 81 04 05 06 0 0 1996 1997 1998 1999 2000 2001 2002 2003 2 2 20 Year ‡Extrapolated from first 5 months The functional genomics pipeline Experimental design affects outcome data analysis Data acquisition microarray processing Data preprocessing scaling/normalization/filtering Data analysis/Hypothesis generation Supervised Analysis Unsupervised Analysis Differential analysis, Classification, … Clustering, Bi-clustering, … Validation/Annotation Enrichment analysis “In silico” testing “In vitro” testing GO annotation, GSEA, … Cross validation, train/test, etc, Back to the lab 6 Introductory references 1. Hastie, T, Tibshirani R, and Friedman J. The Elements of Statistical Learning. Springer-Verlag, 2001. 2. Nature Genetics supplements: The Chipping Forecast I [Nat Genet, 21(1s), 1999], II [Nat Genet, 32(4s), 2002], III [Nat Genet, 37(6s), 2005]. 3. Allison, D. B., Cui, X., Page, G. P., and Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet, 7: 55-65, 2006. 4. Hoffman, E. P., Awad, T., et al. Expression Profiling - Best Practices for Data Generation and Interpretation in Clinical Trials. Nat Rev Genet, 5: 229-237, 2004. 5. Larkin, J. E., Frank, B. C., et al. Independence and reproducibility across microarray platforms. Nat Meth, 2: 337-344, 2005. 6. Irizarry, R. A., Warren, D., et al. Multiple-laboratory comparison of microarray platforms. Nat Meth, 2: 345-350, 2005. 7.

Functional Genomics Algorithms and Tools

Chapter 14: Functional Genomics Learning Objectives

Masterpath: Network Analysis of Functional Genomics Screening Data

The Economic Impact and Functional Applications of Human Genetics and Genomics

Human Functional Genomics Project Begins Unraveling Links Between

High-Throughput Automated Microfluidic Sample Preparation for Accurate Microbial Genomics

IBM Functional Genomics Platform, a Cloud-Based Platform for Studying Microbial Life at Scale

After the Draft Sequence, What Next for the Human Genome Mapping Project Resource Centre?

Toward a Protein–Protein Interaction Map of the Budding Yeast: A

A Massively Parallel Barcoded Sequencing Pipeline Enables Generation of the First Orfeome and Interactome Map for Rice

Arabidopsis Thaliana Functional Genomics Project Annual Report 2008

Comparative Genomics for Reliable Protein-Function Prediction from Genomic Data

Classical Genetics 3. the Beginnings of Genomic Biol