Introduction to Webgestalt and Dataview
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to WebGestalt and DataView Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] Microarray data analysis workflow 92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at Microarray data Differential expression 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at …… Normalization Clustering Lists of genes with potential biological interest 2 Conte bioinformatics workshop, 12-03-2010 Serotonin neuron transcriptome analysis 1416882_at 1418208_at 1418310_a_at 1418582_at 1418934_at 1419127_at 1419225_at 1419434_at 1420227_at 1420337_at 1420565_at 1421601_at 1422210_at 1422520_at 1422643_at …… Wylie et al. J Neurosci, 2010 3 Conte bioinformatics workshop, 12-03-2010 Over-representation analysis HSPA1A 92546_r_at 16 total PNRC1 HSPA1B 92545_f_at GADD45B HSPA1L 96055_at RRAGC HSPA8 102105_f_at DDIT3 102700_at HSPB1 ASNS 152 339 161361_s_at HSPB2 FOSB 92202_g_at UBE2H HSPB8 103548_at EPC1 HSPC138 100947_at HDAC9 HSPD1 101869_s_at convert compare Observed JMJD1C 102727_at HSPE1 RRAGC 160708_at HSPH1 RIT1 …… PURA HYPB …... HYPK IBRDC2 2.6 total ID4 IGFBP5 IL1F5 152 339 Input gene list IL6ST (152 genes) …… Expected Predefined functional category Enrichment ratio: 6.08 (339 genes) p value: 9.34E-9 4 Conte bioinformatics workshop, 12-03-2010 Over-representation analysis Significant genes Non-significant genes Total Genes in the category k j-k j Other genes n-k m-n-j+k m-j Total n m-n m Hypergeometric distribution: given a total of m genes where j genes are in the functional category, if we pick n genes randomly, what is the probability of having k or more genes from the category? ⎛ m − j⎞⎛ j⎞ min(n, j ) ⎜ ⎟⎜ ⎟ ⎝ n − i ⎠⎝ i⎠ p = ∑ ⎛ m⎞ i= k ⎜ ⎟ ⎝ n ⎠ Zhang et.al. Nucleic Acids Res. 33:W741, 2005 5 Conte bioinformatics workshop, 12-03-2010 € Commonly used functional categories Gene Ontology (http://www.geneontology.org ) Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products Three organizing principles: molecular function, biological process, and cellular component Pathways KEGG (http://www.genome.jp/kegg/pathway.html ) Pathway Commons (http://www.pathwaycommons.org/pc/ ) WikiPathways (http://www.wikipathways.org ) Common targets of transcription factors TRANSFAC (http://www.gene-regulation.com) Cytogenetic bands 6 Conte bioinformatics workshop, 12-03-2010 WebGestalt: Web-based Gene Set Analysis Toolkit What is WebGestalt A web-based application for the functional enrichment analysis of gene lists derived from microarray gene-expression studies. How to access WebGestalt http://bioinfo.vanderbilt.edu/webgestalt Functions in WebGestalt: Convert different types of IDs to entrez gene IDs: support 132 ID types from 8 organisms Perform enrichment analysis against GO, KEGG, WikiPathways, and other functional categories, 73,986 functional categories in total Visualize enriched pathways 7 Conte bioinformatics workshop, 12-03-2010 WebGestaltWebGestalt: Web-based Gene Set Analysis Toolkit 8 organisms 132 ID types http://bioinfo.vanderbilt.edu/ webgestalt 73,986 functional categories Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Duncan et al. BMC Bioinformatics. 11:P10, 2010 8 Conte bioinformatics workshop, 12-03-2010 WebGestalt analysis Select the organism of interest. Upload a gene/protein list in the txt format, one ID per row. Optionally, a value can be provided for each ID. In this case, put the ID and value in the same row and separate them by a tab. Then pick the ID type that corresponds to the list of IDs. Categorize the uploaded ID list based upon GO Slim (a simplified version of Gene Ontology that focuses on high level classifications). Analyze the uploaded ID list for for enrichment in various biological contexts. You will need to select an appropriate predefined reference set or upload a reference set. If a customized reference set is uploaded, ID type also needs to be selected. After this, select the analysis parameters (e.g., significance level, multiple test adjustment method, etc.). Retrieve enrichment results by opening the respective results files. You may also open and/or download a TSV file, or download the zipped results to a directory on your desktop. 9 Conte bioinformatics workshop, 12-03-2010 WebGestalt sample input Gene list Wylie et al. Distinct Transcriptomes Define Rostral and Caudal Serotonin Neurons. J. Neurosci, 30:670-684, 2010 Supplemental Table S8. Cluster I comprises 317 probe set IDs enriched in rostral and caudal 5HT neurons (R+C+) relative to rostral non-5HT and caudal non-5HT cells (R-C-). http://www.jneurosci.org/cgi/content/full/ 30/2/670/DC1 Array type: Affy_mouse430_2 10 Conte bioinformatics workshop, 12-03-2010 WebGestalt: ID mapping Mapping result Total number of User IDs: 317. Unambiguously mapped User IDs to Entrez IDs: 259. Unique User Entrez IDs: 226. The Enrichment Analysis will be based upon the unique IDs. 11 Conte bioinformatics workshop, 12-03-2010 WebGestalt: GOSlim classification Molecular function Biological process Cellular component 12 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched GO cellular components Reference list: Affy_mouse430_2 13 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched KEGG pathways 14 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 enriched WikiPathways 15 Conte bioinformatics workshop, 12-03-2010 WebGestalt: top 10 transcription factor binding sites 16 Conte bioinformatics workshop, 12-03-2010 Microarray data analysis workflow 92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at Microarray data Differential expression 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at …… Normalization Clustering Lists of genes with potential biological interest 17 Conte bioinformatics workshop, 12-03-2010 DataView What is DataView A web-based application for the storage, retrieval, and statistical analysis of microarray gene expression data. How to access DataView http://bioinfo.vanderbilt.edu/dataview/conte.php Conte Data Sets in DataView: Evan’s microarray data on 5ht neurons (public) Randy’s microarray data on SERT G56A mutation (private) Doug’s microarray data on photoperiod (private) 18 Conte bioinformatics workshop, 12-03-2010 DataView Analysis tools in DataView: Retrieve gene expression data and annotation for a list of gene ids, biological processes, molecular functions, cellular locations, or chromosomal regions Identify genes co-expressed with a gene of interest Perform differential expression analysis for user defined sample groups Send co-expression and differential expression results to WebGestalt for functional enrichment analysis Visualize gene expression data in a heatmap 19 Conte bioinformatics workshop, 12-03-2010 DataView: data preparation Gene expression data matrix Sample annotation 20 Conte bioinformatics workshop, 12-03-2010 DataView: which genes are co-expressed with the 5HT transporter Slc6a4? Method: correlation analysis Input: gene symbol Slc6a4 (probe set id 1417150_at) Output List of probe set ids with correlation values Link to WebGestalt for enrichment analysis (link gene to function if unannotated) Link to Heatmap for visualizing expression pattern Link to annotation of the probe set ids 21 Conte bioinformatics workshop, 12-03-2010 DataView: top 500 genes that are differentially expressed among the 4 groups Method: Differential expression Input: Manual grouping, four groups Display result top: Top500 Output List of the top 500 probe set ids with p values Link to WebGestalt for enrichment analysis (link gene to function if unannotated) Link to Heatmap for visualizing expression pattern Link to annotation of the probe set ids 22 Conte bioinformatics workshop, 12-03-2010 DataView: top 500 genes that are up-regulated in 5HT neurons compared to non-5HT neurons in rostral Method: Differential expression Input: Manual grouping, two groups Display result top: Top500 Output List of the top 500 probe set ids with p values Link to WebGestalt for enrichment analysis Link to Heatmap for visualizing expression pattern Link to annotation of the probe set ids 23 Conte bioinformatics workshop, 12-03-2010 DataView: show me expression patterns for all neurotransmitter transporters Method: Clustering/heatmap Input: Type: GOFUNCTION ID: 0005326 (neurotransmitter transporter activity) Output Heatmap in gif format High resolution pdf file 24 Conte bioinformatics workshop, 12-03-2010 DataView: retrieve expression data for all neurotransmitter transporters Method: Retrieve expression data Input: Type: GOFUNCTION ID: 0005326 (neurotransmitter transporter activity) Output Expression data, with probe set annotation 25 Conte bioinformatics workshop, 12-03-2010 .