Functional interpretation of lists

Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected] Microarray data analysis workflow

log2(ratio) 92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at -log10(p value) 161361_s_at Microarray data Differential 92202_g_at expression 103548_at 100947_at 101869_s_at 102727_at 160708_at …...

Normalization Clustering Lists of with potential biological interest

2 Applied Bioinformatics, Spring 2013 Omics studies generate gene/ lists

n Genomics

q Genome Wide Association Study (GWAS)

q Next generation sequencing (NGS)

n Transcriptomics

q mRNA profiling ………..

n Microarrays ……….. ……….. n Serial analysis of gene expression (SAGE) ……….. n RNA-Seq ………. q Protein-DNA interaction ………. n Chromatin immunoprecipitation ……….

n Proteomics ………. ……… q Protein profiling

n LC-MS/MS ……… ……… q Protein-protein interaction

n Yeast two hybrid

n Affinity pull-down/LC-MS/MS

3 Applied Bioinformatics, Spring 2013 Microarray experiment comparing metastatic and non- metastatic colon cancer cell lines

Parental cell line

Affymetrix RNA Mouse430_2

Metastatic cell line

Smith et al., Gastroenterology, 138:958-968, 2010

4 Applied Bioinformatics, Spring 2013 Data matrix and differential expression analysis

863 significant probe set IDs (adjp<0.01 and fold-change>2), out of 45,101 probe sets *because the data are log2 based, fold-change>2 means abs(logFC)>1

5 Applied Bioinformatics, Spring 2013 Understanding a gene list

n Level I 1451263_a_at 1436486_x_at q What are the genes behind the IDs and 1451780_at what do we know about the function of the 1438237_at 1417023_a_at genes? 1441054_at 1416203_at 1416295_a_at 1435012_x_at 1416069_at 1436485_s_at 1438148_at 1452740_at 1422184_a_at ……

6 Applied Bioinformatics, Spring 2013 One-gene-at-a-time information systems

7 Applied Bioinformatics, Spring 2013 Biomart: a batch information retrieval system

n In contrast to the “one-gene-at-a-time” systems, e.g. Gene

n Originally developed for the Ensembl genome databases (http://www.ensembl.org )

n Adopted by other projects including UniProt, InterPro, Reactome, Pancreatic Expression Database, and many others (see a complete list and get access to the tools from http://www.biomart.org/ )

8 Applied Bioinformatics, Spring 2013 Biomart analysis

n Choose dataset

q Choose database: Ensembl Genes 69

q Choose dataset: Mus musculus genes (NCBIM37)

n Set filters

q Gene: a list of genes identified by various database IDs (e.g. Affy probe set IDs)

q : filter for genes with specific GO terms (e.g. cell cycle)

q Protein domains: filter for genes with specific protein domains (e.g. SH2 domain, signal domains )

q Region: filter for genes in a specific region (e.g. chr1 1:1000000 or 11q13)

q Others

n Select output attributes

q Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc.

q External data: Gene Ontology, IDs in other databases

q Expression: anatomical system, development stage, cell type, pathology

q Protein domains: SMART, PFAM, Interpro, etc.

9 Applied Bioinformatics, Spring 2013 Biomart: sample output

10 Applied Bioinformatics, Spring 2013 Understanding a gene list

n Level I

q What are the genes behind the IDs and what do we know about the function of the genes?

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

11 Applied Bioinformatics, Spring 2013 Functional group enrichment analysis

98 Hoxa5 Hoxa11 Sash1 Ltbp3 Cd24a Agt Sox4 581 1842 Foxc1 Psrc1 Ctla2b Edn1 Ror2 Angptl4 Gnag Depdc7 Observe Sorbs1 Smad3 compare Wdr5 Macrod1 Enpp2 Trp63 Sox9 Tmem176a 65 Pax1 …… Acd Rai1 Pitx1 581 1842 …… Differentially expressed genes (581 genes) Expect

System development n Is the observed overlap significantly (1842 genes) larger than the expected value?

12 Applied Bioinformatics, Spring 2013 Enrichment analysis: hypergeometric test

Significant genes Non-significant genes Total

genes in the group k j-k j

Other genes n-k m-n-j+k m-j Total n m-n m

Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group? Observed # m − j&# j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005

13 Applied Bioinformatics, Spring 2013 € Commonly used functional groups

n Gene Ontology (http://www.geneontology.org)

q Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products

q Three organizing principles: molecular function, biological process, and cellular component

n Pathways

q KEGG (http://www.genome.jp/kegg/pathway.html)

q Pathway commons (http://www.pathwaycommons.org)

q WikiPathways (http://www.wikipathways.org)

n Cytogenetic bands

n Targets of transcription factors/miRNAs

14 Applied Bioinformatics, Spring 2013 WebGestalt: Web-based Gene Set Analysis Toolkit

8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast

Microarray Probe IDs Gene IDs Protein IDs • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Entrez Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN

196 ID types with mapping to Entrez Gene ID http://bioinfo.vanderbilt.edu/webgestalt WebGestalt

Zhang et.al. Nucleic Acids Res. 33:W741, 2005 59,278 functional categories with genes identified by Entrez Gene IDs

Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules

Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands • Drug association genes

15 Applied Bioinformatics, Spring 2013 WebGestalt: ID mapping

n Input list

q 863 significant probe sets identified in the microarray study

n Mapping result

q Total number of User IDs: 863. Unambiguously mapped User IDs to Entrez IDs: 734. Unique User Entrez IDs: 581. The Enrichment Analysis will be based upon the unique IDs.

16 Applied Bioinformatics, Spring 2013 WebGestalt: top 10 enriched GO biological processes

17 Applied Bioinformatics, Spring 2013 WebGestalt: top 10 enriched KEGG pathways

18 Applied Bioinformatics, Spring 2013 Understanding a gene list

n Level I

q What are the genes behind the IDs and what do we know about the function of the genes?

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

n Level III

q How do the gene products work together to form a functional network?

19 Applied Bioinformatics, Spring 2013 Biological networks

Networks Nodes Edges Protein-protein Physical interaction, interaction network undirected Signaling network Proteins Modification, Physical directed interaction networks Gene regulatory TFs/miRNAs Physical interaction, network Target genes directed Metabolic network Metabolites Metabolic reaction, directed Co-expression Genes/ Co-expression, Functional network proteins undirected association networks Genetic network Genes Genetic interaction, undirected

20 Applied Bioinformatics, Spring 2013 Properties of complex networks

Scale-free Modular Hierarchical

21 Applied Bioinformatics, Spring 2013 WebGestalt protein interaction module analysis

22 Applied Bioinformatics, Spring 2013 STRING (http://string-db.org)

n A database of known and predicted protein interactions, including both direct (physical) and indirect associations (functional).

n Quantitatively integrates interaction data from different sources for a large number of organisms, and transfers information between these organisms where applicable.

n Covers 5,214,234 proteins from 1,133 organisms.

23 Applied Bioinformatics, Spring 2013 Understanding a gene list: summary

n Level I

q What are the genes behind the IDs and what do we know about the function of the genes?

q Biomart (http://www.biomart.org/)

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

q WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt)

q Related tools: DAVID (http://david.abcc.ncifcrf.gov/), GenMAPP (http://www.genmapp.org/), GSEA (http://www.broadinstitute.org/gsea )

n Level III

q How do the gene products work together to form a functional network?

q STRING (http://string-db.org)

q Related tools: Cytoscape (http://www.cytoscape.org/), Genemania (http://www.genemania.org), Ingenuity (http://www.ingenuity.com/), Pathway Studio ( http://www.ariadnegenomics.com/products/pathway-studio/)

24 Applied Bioinformatics, Spring 2013