Functional interpretation of lists

Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected] Microarray experiment comparing metastatic and non- metastatic colon cancer cell lines

Parental cell line

Affymetrix RNA .cel files Mouse430_2

Metastatic cell line

Smith et al., Gastroenterology, 138:958-968, 2010

2 Where to get the data n Omnibus (GEO)

q http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi? acc=GSE19073

3 Download and prepare the data in Linux n In your home directory in Linux

q mkdir gse19073

q cd gse19073

q wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE19nnn/GSE19073/suppl/GSE19073_RAW.tar

q tar xvf GSE19073_RAW.tar

q rm GSE19073_RAW.tar

q gunzip *.gz

q cp /home/igptest/diff/annotation.txt .

n (can use nano to create the annotation file for future projects)

first row contains one fewer field Columns separated by Sample names tab consistent with cel file names

4 Make sure you have the correct setting to use R

In the first lecture, you were asked to get an ACCRE account, copy the file sample_file.txt under directory /home/igptest to your home directory, and add a line “setpkgs –a R” (without the quotation marks) to the end of your .bashrc file.

5 Analyze the data using R

n R CMD BATCH /home/igptest/diff/limma.r limma.rout &

limma.r

6 Explore the results using excel

n Copy the output file data_rma_limma.csv to your local computer for exploring in Excel (SSH or Fugu)

863 significant probe set IDs (adjp<0.01 and fold-change>2), out of 45,101 probe sets *because the data are log2 based, fold-change>2 means abs(logFC)>1

7 Omics studies generate lists of interesting

log2(ratio) 92546_r_at 92545_f_at 96055_at 102105_f_at Microarray 102700_at -log10(p value) 161361_s_at RNA-Seq Differential 92202_g_at expression 103548_at 100947_at Proteomics 101869_s_at 102727_at 160708_at …...

Clustering Lists of genes with potential biological interest

8 Understanding a gene list

n Level I 1451263_a_at 1436486_x_at q What are the genes behind the IDs and 1451780_at what do we know about the function of the 1438237_at 1417023_a_at genes? 1441054_at 1416203_at 1416295_a_at 1435012_x_at 1416069_at 1436485_s_at 1438148_at 1452740_at 1422184_a_at ……

9 One-gene-at-a-time information systems

10 Biomart: a batch information retrieval system

http://central.biomart.org/

11 Biomart analysis

n Choose dataset

q Choose database: Ensembl 74 Genes

q Choose dataset: Mus musculus genes

n Set filters

q Gene: a list of genes identified by various database IDs (e.g. Affy probe set IDs)

q : filter for genes with specific GO terms (e.g. cell cycle)

q domains: filter for genes with specific protein domains (e.g. SH2 domain, signal domains )

q Region: filter for genes in a specific region (e.g. chr1 1:1000000 or 11q13)

q Others

n Select output attributes

q Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc.

q External data: Gene Ontology, IDs in other databases

q Expression: anatomical system, development stage, cell type, pathology

q Protein domains: SMART, PFAM, Interpro, etc.

12 Biomart: sample output

13 Understanding a gene list

n Level I

q What are the genes behind the IDs and what do we know about the function of the genes?

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

14 Functional group enrichment analysis

98 Hoxa5 Hoxa11 Sash1 Ltbp3 Cd24a Agt Sox4 581 1842 Foxc1 Psrc1 Ctla2b Edn1 Ror2 Angptl4 Gnag Depdc7 Observe Sorbs1 Smad3 compare Wdr5 Macrod1 Enpp2 Trp63 Sox9 Tmem176a 65 Pax1 …… Acd Rai1 Pitx1 581 1842 …… Differentially expressed genes (581 genes) Expect

System development n Is the observed overlap significantly (1842 genes) larger than the expected value?

15 Enrichment analysis: hypergeometric test

Significant genes Non-significant genes Total

genes in the group k j-k j

Other genes n-k m-n-j+k m-j Total n m-n m

Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group? Observed # m − j&# j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005

16 € Commonly used functional groups

n Gene Ontology (http://www.geneontology.org)

q Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products

q Three organizing principles: molecular function, biological process, and cellular component

n Pathways

q KEGG (http://www.genome.jp/kegg/pathway.html)

q Pathway commons (http://www.pathwaycommons.org)

q WikiPathways (http://www.wikipathways.org)

n Cytogenetic bands

n Targets of transcription factors/miRNAs

17 WebGestalt: Web-based Gene Set Analysis Toolkit

8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast

Microarray Probe IDs Gene IDs Protein IDs • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN

196 ID types with mapping to Entrez Gene ID

WebGestalt

59,278 functional categories with genes identified by Entrez Gene IDs

Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules http://www.webgestalt.org

Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands Zhang et.al. Nucleic Acids Res. 33:W741, 2005 • Drug association genes Wang et al. Nucleic Acids Res. 41:W77, 2013 18 WebGestalt: ID mapping

n Input list

q 863 significant probe sets identified in the microarray study

n Mapping result

q Total number of User IDs: 863. Unambiguously mapped User IDs to Entrez IDs: 734. Unique User Entrez IDs: 581. The Enrichment Analysis will be based upon the unique IDs.

19 WebGestalt: top 10 enriched GO biological processes

20 WebGestalt: top 10 enriched KEGG pathways

21 Understanding a gene list

n Level I

q What are the genes behind the IDs and what do we know about the function of the genes?

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

n Level III

q How do the gene products work together to form a functional network?

22 Biological networks

Networks Nodes Edges Protein-protein Physical interaction, interaction network undirected Signaling network Proteins Modification, Physical directed interaction networks Gene regulatory TFs/miRNAs Physical interaction, network Target genes directed Metabolic network Metabolites Metabolic reaction, directed Co-expression Genes/ Co-expression, Functional network proteins undirected association networks Genetic network Genes Genetic interaction, undirected

23 Properties of complex networks

Scale-free Modular Hierarchical

24 WebGestalt protein interaction module analysis

25 STRING (http://string-db.org)

n A database of known and predicted protein interactions, including both direct (physical) and indirect associations (functional).

n Quantitatively integrates interaction data from different sources for a large number of organisms, and transfers information between these organisms where applicable.

n Covers 5,214,234 proteins from 1,133 organisms.

26 Understanding a gene list: summary

n Level I

q What are the genes behind the IDs and what do we know about the function of the genes?

q Biomart (http://www.biomart.org/)

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

q WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt)

q Related tools: DAVID (http://david.abcc.ncifcrf.gov/), GenMAPP (http://www.genmapp.org/), GSEA (http://www.broadinstitute.org/gsea )

n Level III

q How do the gene products work together to form a functional network?

q STRING (http://string-db.org)

q Related tools: Cytoscape (http://www.cytoscape.org/), Genemania (http://www.genemania.org), Ingenuity (http://www.ingenuity.com/), Pathway Studio ( http://www.ariadnegenomics.com/products/pathway-studio/)

27