Introduction to WebGestalt and DataView
Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] Microarray data analysis workflow
92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at Microarray data Differential expression 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at ……
Normalization Clustering Lists of genes with potential biological interest
2 Conte bioinformatics workshop, 12-03-2010 Serotonin neuron transcriptome analysis
1416882_at 1418208_at 1418310_a_at 1418582_at 1418934_at 1419127_at 1419225_at 1419434_at 1420227_at 1420337_at 1420565_at 1421601_at 1422210_at 1422520_at 1422643_at ……
Wylie et al. J Neurosci, 2010
3 Conte bioinformatics workshop, 12-03-2010 Over-representation analysis
HSPA1A 92546_r_at 16 total PNRC1 HSPA1B 92545_f_at GADD45B HSPA1L 96055_at RRAGC HSPA8 102105_f_at DDIT3 102700_at HSPB1 ASNS 152 339 161361_s_at HSPB2 FOSB 92202_g_at UBE2H HSPB8 103548_at EPC1 HSPC138 100947_at HDAC9 HSPD1 101869_s_at convert compare Observed JMJD1C 102727_at HSPE1 RRAGC 160708_at HSPH1 RIT1 …… PURA HYPB …... HYPK IBRDC2 2.6 total ID4 IGFBP5 IL1F5 152 339 Input gene list IL6ST (152 genes) …… Expected
Predefined functional category Enrichment ratio: 6.08
(339 genes) p value: 9.34E-9
4 Conte bioinformatics workshop, 12-03-2010 Over-representation analysis
Significant genes Non-significant genes Total
Genes in the category k j-k j
Other genes n-k m-n-j+k m-j Total n m-n m
Hypergeometric distribution: given a total of m genes where j genes are in the functional category, if we pick n genes randomly, what is the probability of having k or more genes from the category? ⎛ m − j⎞⎛ j⎞ min(n, j ) ⎜ ⎟⎜ ⎟ ⎝ n − i ⎠⎝ i⎠ p = ∑ ⎛ m⎞ i= k ⎜ ⎟ ⎝ n ⎠ Zhang et.al. Nucleic Acids Res. 33:W741, 2005
5 Conte bioinformatics workshop, 12-03-2010 € Commonly used functional categories
Gene Ontology (http://www.geneontology.org )
Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products
Three organizing principles: molecular function, biological process, and cellular component
Pathways
KEGG (http://www.genome.jp/kegg/pathway.html )
Pathway Commons (http://www.pathwaycommons.org/pc/ )
WikiPathways (http://www.wikipathways.org )
Common targets of transcription factors
TRANSFAC (http://www.gene-regulation.com)
Cytogenetic bands
6 Conte bioinformatics workshop, 12-03-2010 WebGestalt: Web-based Gene Set Analysis Toolkit
What is WebGestalt
A web-based application for the functional enrichment analysis of gene lists derived from microarray gene-expression studies.
How to access WebGestalt
http://bioinfo.vanderbilt.edu/webgestalt
Functions in WebGestalt:
Convert different types of IDs to entrez gene IDs: support 132 ID types from 8 organisms
Perform enrichment analysis against GO, KEGG, WikiPathways, and other functional categories, 73,986 functional categories in total
Visualize enriched pathways
7 Conte bioinformatics workshop, 12-03-2010 WebGestaltWebGestalt: Web-based Gene Set Analysis Toolkit
8 organisms
132 ID types
http://bioinfo.vanderbilt.edu/ webgestalt
73,986 functional categories
Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Duncan et al. BMC Bioinformatics. 11:P10, 2010
8 Conte bioinformatics workshop, 12-03-2010 WebGestalt analysis
Select the organism of interest.
Upload a gene/protein list in the txt format, one ID per row. Optionally, a value can be provided for each ID. In this case, put the ID and value in the same row and separate them by a tab. Then pick the ID type that corresponds to the list of IDs.
Categorize the uploaded ID list based upon GO Slim (a simplified version of Gene Ontology that focuses on high level classifications).
Analyze the uploaded ID list for for enrichment in various biological contexts. You will need to select an appropriate predefined reference set or upload a reference set. If a customized reference set is uploaded, ID type also needs to be selected. After this, select the analysis parameters (e.g., significance level, multiple test adjustment method, etc.).
Retrieve enrichment results by opening the respective results files. You may also open and/or download a TSV file, or download the zipped results to a directory on your desktop.
9 Conte bioinformatics workshop, 12-03-2010 WebGestalt sample input
Gene list
Wylie et al. Distinct Transcriptomes Define Rostral and Caudal Serotonin Neurons. J. Neurosci, 30:670-684, 2010
Supplemental Table S8. Cluster I comprises 317 probe set IDs enriched in rostral and caudal 5HT neurons (R+C+) relative to rostral non-5HT and caudal non-5HT cells (R-C-).
http://www.jneurosci.org/cgi/content/full/ 30/2/670/DC1
Array type: Affy_mouse430_2
10 Conte bioinformatics workshop, 12-03-2010 WebGestalt: ID mapping
Mapping result
Total number of User IDs: 317. Unambiguously mapped User IDs to Entrez IDs: 259. Unique User Entrez IDs: 226. The Enrichment Analysis will be based upon the unique IDs.
11 Conte bioinformatics workshop, 12-03-2010 WebGestalt: GOSlim classification
Molecular function Biological process
Cellular component
12 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched GO cellular components
Reference list: Affy_mouse430_2
13 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched KEGG pathways
14 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched WikiPathways
15 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 transcription factor binding sites
16 Conte bioinformatics workshop, 12-03-2010 Microarray data analysis workflow
92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at Microarray data Differential expression 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at ……
Normalization Clustering Lists of genes with potential biological interest
17 Conte bioinformatics workshop, 12-03-2010 DataView
What is DataView
A web-based application for the storage, retrieval, and statistical analysis of microarray gene expression data.
How to access DataView
http://bioinfo.vanderbilt.edu/dataview/conte.php
Conte Data Sets in DataView:
Evan’s microarray data on 5ht neurons (public)
Randy’s microarray data on SERT G56A mutation (private)
Doug’s microarray data on photoperiod (private)
18 Conte bioinformatics workshop, 12-03-2010 DataView
Analysis tools in DataView:
Retrieve gene expression data and annotation for a list of gene ids, biological processes, molecular functions, cellular locations, or chromosomal regions
Identify genes co-expressed with a gene of interest
Perform differential expression analysis for user defined sample groups
Send co-expression and differential expression results to WebGestalt for functional enrichment analysis
Visualize gene expression data in a heatmap
19 Conte bioinformatics workshop, 12-03-2010 DataView: data preparation
Gene expression data matrix
Sample annotation
20 Conte bioinformatics workshop, 12-03-2010 DataView: which genes are co-expressed with the 5HT transporter Slc6a4?
Method: correlation analysis
Input: gene symbol Slc6a4 (probe set id 1417150_at)
Output
List of probe set ids with correlation values
Link to WebGestalt for enrichment analysis (link gene to function if unannotated)
Link to Heatmap for visualizing expression pattern
Link to annotation of the probe set ids
21 Conte bioinformatics workshop, 12-03-2010 DataView: top 500 genes that are differentially expressed among the 4 groups
Method: Differential expression
Input:
Manual grouping, four groups
Display result top: Top500
Output
List of the top 500 probe set ids with p values
Link to WebGestalt for enrichment analysis (link gene to function if unannotated)
Link to Heatmap for visualizing expression pattern
Link to annotation of the probe set ids
22 Conte bioinformatics workshop, 12-03-2010 DataView: top 500 genes that are up-regulated in 5HT neurons compared to non-5HT neurons in rostral
Method: Differential expression
Input:
Manual grouping, two groups
Display result top: Top500
Output
List of the top 500 probe set ids with p values
Link to WebGestalt for enrichment analysis
Link to Heatmap for visualizing expression pattern
Link to annotation of the probe set ids
23 Conte bioinformatics workshop, 12-03-2010 DataView: show me expression patterns for all neurotransmitter transporters
Method: Clustering/heatmap
Input:
Type: GOFUNCTION ID: 0005326 (neurotransmitter transporter activity)
Output
Heatmap in gif format
High resolution pdf file
24 Conte bioinformatics workshop, 12-03-2010 DataView: retrieve expression data for all neurotransmitter transporters
Method: Retrieve expression data
Input:
Type: GOFUNCTION ID: 0005326 (neurotransmitter transporter activity)
Output
Expression data, with probe set annotation
25 Conte bioinformatics workshop, 12-03-2010