Introduction to WebGestalt and DataView

Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] Microarray data analysis workflow

92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at Microarray data Differential expression 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at ……

Normalization Clustering Lists of with potential biological interest

2 Conte bioinformatics workshop, 12-03-2010 Serotonin neuron transcriptome analysis

1416882_at 1418208_at 1418310_a_at 1418582_at 1418934_at 1419127_at 1419225_at 1419434_at 1420227_at 1420337_at 1420565_at 1421601_at 1422210_at 1422520_at 1422643_at ……

Wylie et al. J Neurosci, 2010

3 Conte bioinformatics workshop, 12-03-2010 Over-representation analysis

HSPA1A 92546_r_at 16 total PNRC1 HSPA1B 92545_f_at GADD45B HSPA1L 96055_at RRAGC HSPA8 102105_f_at DDIT3 102700_at HSPB1 ASNS 152 339 161361_s_at HSPB2 FOSB 92202_g_at UBE2H HSPB8 103548_at EPC1 HSPC138 100947_at HDAC9 HSPD1 101869_s_at convert compare Observed JMJD1C 102727_at HSPE1 RRAGC 160708_at HSPH1 RIT1 …… PURA HYPB …... HYPK IBRDC2 2.6 total ID4 IGFBP5 IL1F5 152 339 Input list IL6ST (152 genes) …… Expected

Predefined functional category  Enrichment ratio: 6.08

(339 genes)  p value: 9.34E-9

4 Conte bioinformatics workshop, 12-03-2010 Over-representation analysis

Significant genes Non-significant genes Total

Genes in the category k j-k j

Other genes n-k m-n-j+k m-j Total n m-n m

Hypergeometric distribution: given a total of m genes where j genes are in the functional category, if we pick n genes randomly, what is the probability of having k or more genes from the category? ⎛ m − j⎞⎛ j⎞ min(n, j ) ⎜ ⎟⎜ ⎟ ⎝ n − i ⎠⎝ i⎠ p = ∑ ⎛ m⎞ i= k ⎜ ⎟ ⎝ n ⎠ Zhang et.al. Nucleic Acids Res. 33:W741, 2005

5 Conte bioinformatics workshop, 12-03-2010 € Commonly used functional categories

(http://www.geneontology.org )

 Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products

 Three organizing principles: molecular function, biological process, and cellular component

 Pathways

 KEGG (http://www.genome.jp/kegg/pathway.html )

 Pathway Commons (http://www.pathwaycommons.org/pc/ )

 WikiPathways (http://www.wikipathways.org )

 Common targets of transcription factors

 TRANSFAC (http://www.gene-regulation.com)

 Cytogenetic bands

6 Conte bioinformatics workshop, 12-03-2010 WebGestalt: Web-based Gene Set Analysis Toolkit

 What is WebGestalt

 A web-based application for the functional enrichment analysis of gene lists derived from microarray gene-expression studies.

 How to access WebGestalt

 http://bioinfo.vanderbilt.edu/webgestalt

 Functions in WebGestalt:

 Convert different types of IDs to entrez gene IDs: support 132 ID types from 8 organisms

 Perform enrichment analysis against GO, KEGG, WikiPathways, and other functional categories, 73,986 functional categories in total

 Visualize enriched pathways

7 Conte bioinformatics workshop, 12-03-2010 WebGestaltWebGestalt: Web-based Gene Set Analysis Toolkit

8 organisms

132 ID types

http://bioinfo.vanderbilt.edu/ webgestalt

73,986 functional categories

Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Duncan et al. BMC Bioinformatics. 11:P10, 2010

8 Conte bioinformatics workshop, 12-03-2010 WebGestalt analysis

 Select the organism of interest.

 Upload a gene/ list in the txt format, one ID per row. Optionally, a value can be provided for each ID. In this case, put the ID and value in the same row and separate them by a tab. Then pick the ID type that corresponds to the list of IDs.

 Categorize the uploaded ID list based upon GO Slim (a simplified version of Gene Ontology that focuses on high level classifications).

 Analyze the uploaded ID list for for enrichment in various biological contexts. You will need to select an appropriate predefined reference set or upload a reference set. If a customized reference set is uploaded, ID type also needs to be selected. After this, select the analysis parameters (e.g., significance level, multiple test adjustment method, etc.).

 Retrieve enrichment results by opening the respective results files. You may also open and/or download a TSV file, or download the zipped results to a directory on your desktop.

9 Conte bioinformatics workshop, 12-03-2010 WebGestalt sample input

 Gene list

 Wylie et al. Distinct Transcriptomes Define Rostral and Caudal Serotonin Neurons. J. Neurosci, 30:670-684, 2010

 Supplemental Table S8. Cluster I comprises 317 probe set IDs enriched in rostral and caudal 5HT neurons (R+C+) relative to rostral non-5HT and caudal non-5HT cells (R-C-).

 http://www.jneurosci.org/cgi/content/full/ 30/2/670/DC1

 Array type: Affy_mouse430_2

10 Conte bioinformatics workshop, 12-03-2010 WebGestalt: ID mapping

 Mapping result

 Total number of User IDs: 317. Unambiguously mapped User IDs to Entrez IDs: 259. Unique User Entrez IDs: 226. The Enrichment Analysis will be based upon the unique IDs.

11 Conte bioinformatics workshop, 12-03-2010 WebGestalt: GOSlim classification

Molecular function Biological process

Cellular component

12 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched GO cellular components

Reference list: Affy_mouse430_2

13 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched KEGG pathways

14 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched WikiPathways

15 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 transcription factor binding sites

16 Conte bioinformatics workshop, 12-03-2010 Microarray data analysis workflow

92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at Microarray data Differential expression 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at ……

Normalization Clustering Lists of genes with potential biological interest

17 Conte bioinformatics workshop, 12-03-2010 DataView

 What is DataView

 A web-based application for the storage, retrieval, and statistical analysis of microarray data.

 How to access DataView

 http://bioinfo.vanderbilt.edu/dataview/conte.php

 Conte Data Sets in DataView:

 Evan’s microarray data on 5ht neurons (public)

 Randy’s microarray data on SERT G56A mutation (private)

 Doug’s microarray data on photoperiod (private)

18 Conte bioinformatics workshop, 12-03-2010 DataView

 Analysis tools in DataView:

 Retrieve gene expression data and annotation for a list of gene ids, biological processes, molecular functions, cellular locations, or chromosomal regions

 Identify genes co-expressed with a gene of interest

 Perform differential expression analysis for user defined sample groups

 Send co-expression and differential expression results to WebGestalt for functional enrichment analysis

 Visualize gene expression data in a heatmap

19 Conte bioinformatics workshop, 12-03-2010 DataView: data preparation

 Gene expression data matrix

 Sample annotation

20 Conte bioinformatics workshop, 12-03-2010 DataView: which genes are co-expressed with the 5HT transporter Slc6a4?

 Method: correlation analysis

 Input: gene symbol Slc6a4 (probe set id 1417150_at)

 Output

 List of probe set ids with correlation values

 Link to WebGestalt for enrichment analysis (link gene to function if unannotated)

 Link to Heatmap for visualizing expression pattern

 Link to annotation of the probe set ids

21 Conte bioinformatics workshop, 12-03-2010 DataView: top 500 genes that are differentially expressed among the 4 groups

 Method: Differential expression

 Input:

 Manual grouping, four groups

 Display result top: Top500

 Output

 List of the top 500 probe set ids with p values

 Link to WebGestalt for enrichment analysis (link gene to function if unannotated)

 Link to Heatmap for visualizing expression pattern

 Link to annotation of the probe set ids

22 Conte bioinformatics workshop, 12-03-2010 DataView: top 500 genes that are up-regulated in 5HT neurons compared to non-5HT neurons in rostral

 Method: Differential expression

 Input:

 Manual grouping, two groups

 Display result top: Top500

 Output

 List of the top 500 probe set ids with p values

 Link to WebGestalt for enrichment analysis

 Link to Heatmap for visualizing expression pattern

 Link to annotation of the probe set ids

23 Conte bioinformatics workshop, 12-03-2010 DataView: show me expression patterns for all neurotransmitter transporters

 Method: Clustering/heatmap

 Input:

 Type: GOFUNCTION ID: 0005326 (neurotransmitter transporter activity)

 Output

 Heatmap in gif format

 High resolution pdf file

24 Conte bioinformatics workshop, 12-03-2010 DataView: retrieve expression data for all neurotransmitter transporters

 Method: Retrieve expression data

 Input:

 Type: GOFUNCTION ID: 0005326 (neurotransmitter transporter activity)

 Output

 Expression data, with probe set annotation

25 Conte bioinformatics workshop, 12-03-2010