Functional Interpretation of Gene Lists

Functional interpretation of gene lists Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected] Microarray experiment comparing metastatic and non- metastatic colon cancer cell lines Parental cell line Affymetrix RNA .cel files Mouse430_2 Metastatic cell line Smith et al., Gastroenterology, 138:958-968, 2010 2 Where to get the data n Gene Expression Omnibus (GEO) q http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi? acc=GSE19073 3 Download and prepare the data in Linux n In your home directory in Linux q mkdir gse19073 q cd gse19073 q wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE19nnn/GSE19073/suppl/GSE19073_RAW.tar q tar xvf GSE19073_RAW.tar q rm GSE19073_RAW.tar q gunzip *.gz q cp /home/igptest/diff/annotation.txt . n (can use nano to create the annotation file for future projects) first row contains one fewer field Columns separated by Sample names tab consistent with cel file names 4 Make sure you have the correct setting to use R In the first lecture, you were asked to get an ACCRE account, copy the file sample_file.txt under directory /home/igptest to your home directory, and add a line “setpkgs –a R” (without the quotation marks) to the end of your .bashrc file. 5 Analyze the data using R n R CMD BATCH /home/igptest/diff/limma.r limma.rout & limma.r 6 Explore the results using excel n Copy the output file data_rma_limma.csv to your local computer for exploring in Excel (SSH or Fugu) 863 significant probe set IDs (adjp<0.01 and fold-change>2), out of 45,101 probe sets *because the data are log2 based, fold-change>2 means abs(logFC)>1 7 Omics studies generate lists of interesting genes log2(ratio) 92546_r_at 92545_f_at 96055_at 102105_f_at Microarray 102700_at -log10(p value) 161361_s_at RNA-Seq Differential 92202_g_at expression 103548_at 100947_at Proteomics 101869_s_at 102727_at 160708_at …... Clustering Lists of genes with potential biological interest 8 Understanding a gene list n Level I 1451263_a_at 1436486_x_at q What are the genes behind the IDs and 1451780_at what do we know about the function of the 1438237_at 1417023_a_at genes? 1441054_at 1416203_at 1416295_a_at 1435012_x_at 1416069_at 1436485_s_at 1438148_at 1452740_at 1422184_a_at …… 9 One-gene-at-a-time information systems 10 Biomart: a batch information retrieval system http://central.biomart.org/ 11 Biomart analysis n Choose dataset q Choose database: Ensembl 74 Genes q Choose dataset: Mus musculus genes n Set filters q Gene: a list of genes identified by various database IDs (e.g. Affy probe set IDs) q Gene Ontology: filter for genes with specific GO terms (e.g. cell cycle) q Protein domains: filter for genes with specific protein domains (e.g. SH2 domain, signal domains ) q Region: filter for genes in a specific chromosome region (e.g. chr1 1:1000000 or 11q13) q Others n Select output attributes q Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc. q External data: Gene Ontology, IDs in other databases q Expression: anatomical system, development stage, cell type, pathology q Protein domains: SMART, PFAM, Interpro, etc. 12 Biomart: sample output 13 Understanding a gene list n Level I q What are the genes behind the IDs and what do we know about the function of the genes? n Level II q Which biological processes and pathways are the most interesting in terms of the experimental question? 14 Functional group enrichment analysis 98 Hoxa5 Hoxa11 Sash1 Ltbp3 Cd24a Agt Sox4 581 1842 Foxc1 Psrc1 Ctla2b Edn1 Ror2 Angptl4 Gnag Depdc7 Observe Sorbs1 Smad3 compare Wdr5 Macrod1 Enpp2 Trp63 Sox9 Tmem176a 65 Pax1 …… Acd Rai1 Pitx1 581 1842 …… Differentially expressed genes (581 genes) Expect System development n Is the observed overlap significantly (1842 genes) larger than the expected value? 15 Enrichment analysis: hypergeometric test Significant genes Non-significant genes Total genes in the group k j-k j Other genes n-k m-n-j+k m-j Total n m-n m Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group? Observed # m − j&# j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005 16 € Commonly used functional groups n Gene Ontology (http://www.geneontology.org) q Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products q Three organizing principles: molecular function, biological process, and cellular component n Pathways q KEGG (http://www.genome.jp/kegg/pathway.html) q Pathway commons (http://www.pathwaycommons.org) q WikiPathways (http://www.wikipathways.org) n Cytogenetic bands n Targets of transcription factors/miRNAs 17 WebGestalt: Web-based Gene Set Analysis Toolkit 8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast Microarray Probe IDs Gene IDs Protein IDs • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Entrez Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN 196 ID types with mapping to Entrez Gene ID WebGestalt 59,278 functional categories with genes identified by Entrez Gene IDs Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules http://www.webgestalt.org Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands Zhang et.al. Nucleic Acids Res. 33:W741, 2005 • Drug association genes Wang et al. Nucleic Acids Res. 41:W77, 2013 18 WebGestalt: ID mapping n Input list q 863 significant probe sets identified in the microarray study n Mapping result q Total number of User IDs: 863. Unambiguously mapped User IDs to Entrez IDs: 734. Unique User Entrez IDs: 581. The Enrichment Analysis will be based upon the unique IDs. 19 WebGestalt: top 10 enriched GO biological processes 20 WebGestalt: top 10 enriched KEGG pathways 21 Understanding a gene list n Level I q What are the genes behind the IDs and what do we know about the function of the genes? n Level II q Which biological processes and pathways are the most interesting in terms of the experimental question? n Level III q How do the gene products work together to form a functional network? 22 Biological networks Networks Nodes Edges Protein-protein Proteins Physical interaction, interaction network undirected Signaling network Proteins Modification, Physical directed interaction networks Gene regulatory TFs/miRNAs Physical interaction, network Target genes directed Metabolic network Metabolites Metabolic reaction, directed Co-expression Genes/ Co-expression, Functional network proteins undirected association networks Genetic network Genes Genetic interaction, undirected 23 Properties of complex networks Scale-free Modular Hierarchical 24 WebGestalt protein interaction module analysis 25 STRING (http://string-db.org) n A database of known and predicted protein interactions, including both direct (physical) and indirect associations (functional). n Quantitatively integrates interaction data from different sources for a large number of organisms, and transfers information between these organisms where applicable. n Covers 5,214,234 proteins from 1,133 organisms. 26 Understanding a gene list: summary n Level I q What are the genes behind the IDs and what do we know about the function of the genes? q Biomart (http://www.biomart.org/) n Level II q Which biological processes and pathways are the most interesting in terms of the experimental question? q WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt) q Related tools: DAVID (http://david.abcc.ncifcrf.gov/), GenMAPP (http://www.genmapp.org/), GSEA (http://www.broadinstitute.org/gsea ) n Level III q How do the gene products work together to form a functional network? q STRING (http://string-db.org) q Related tools: Cytoscape (http://www.cytoscape.org/), Genemania (http://www.genemania.org), Ingenuity (http://www.ingenuity.com/), Pathway Studio ( http://www.ariadnegenomics.com/products/pathway-studio/) 27 .

Functional Interpretation of Gene Lists

Whole Exome Sequencing in ADHD Trios from Single and Multi-Incident Families Implicates New Candidate Genes and Highlights Polygenic Transmission

Genome-Wide Screen of Otosclerosis in Population Biobanks

Supplementary Table 1: Adhesion Genes Data Set

Association of the PSRC1 Rs599839 Variant with Coronary Artery Disease in a Mexican Population

Role and Regulation of the P53-Homolog P73 in the Transformation of Normal Human Fibroblasts

Gene Regulation Underlies Environmental Adaptation in House Mice

Autoantibodies As Potential Biomarkers in Breast Cancer

Functional Interpretation of Gene Lists

Identification of Subgroups Along the Glycolysis-Cholesterol Synthesis

The Rs599839 A>G Variant Disentangles

Genome-Wide Enhancer Annotations Differ Significantly in Genomic Distribution, Evolution, and Function

The Rs599839 A>G Variant Disentangles Cardiovascular Risk