Understanding lists from proteomics studies

Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected] A typical comparative shotgun proteomics study

IPI00375843 IPI00171798 IPI00299485 IPI00009542 IPI00019568 IPI00060627 IPI00168262 IPI00082931 IPI00025084 IPI00412546 IPI00165528 IPI00043992 IPI00384992 IPI00006991 IPI00021885 IPI00377045 IPI00022471 …….

Li et.al. JPR, 2010

2 BCHM352 Omics technologies generate /protein lists

n Genomics

q Genome Wide Association Study (GWAS)

q Next generation sequencing (NGS)

n Transcriptomics q mRNA profiling ……….. n Microarrays ……….. n Serial analysis of gene expression (SAGE) ……….. n RNA-Seq ………..

q Protein-DNA interaction ……….

n Chromatin immunoprecipitation ……….

n Proteomics ………. ………. q Protein profiling ……… n LC-MS/MS

q Protein-protein interaction ………

n Yeast two hybrid ………

n Affinity pull-down/LC-MS/MS

3 BCHM352 Sample files

n Samples files can be downloaded from

q http://bioinfo.vanderbilt.edu/zhanglab/?q=node/410

n Significant

q hnscc_sig_proteins.txt

n Significant proteins with log fold change

q hnscc_sig_withLogRatio.txt

n All proteins identified in the study

q hnscc_all_proteins.txt

4 BCHM352 Understanding a protein list

n Level I

q What are the proteins/ behind the IDs and what do we know about the functions of the proteins/genes?

5 BCHM352 Level one: information retrieval

Query interface (http://www.ebi.ac.uk/IPI) Output

n One-protein-at-a-time n Time consuming

n Information is local and isolated

n Hard to automate the information retrieval process

6 BCHM352 A typical question

n “I’ve attached a spreadsheet of our proteomics results comparing 5 Vehicle and 5 Aldosterone treated patients. We’ve included only those proteins whose summed spectral counts are >30 in one treatment group. Would it be possible to get the GO annotations for these? The Uniprot name is listed in column A and the gene name is listed in column R. If this is a time consuming task (and I imagine that it is), can you tell me how to do it?”

7 BCHM352 Biomart: a batch information retrieval system


8 BCHM352 Biomart analysis

n Choose dataset

q Choose database: Ensembl 75 Genes

q Choose dataset: Homo sapiens genes (GRCh37.p13)

n Set filters

q Gene: a list of genes identified by various database IDs (e.g. IPI IDs)

q : filter for genes with specific GO terms (e.g. cell cycle)

q Protein domains: filter for genes with specific protein domains (e.g. SH2 domain, signal domains )

q Region: filter for genes in a specific region (e.g. chr1 1:1000000 or 11q13)

q Others

n Select output attributes

q Gene annotation information in the Ensembl database, e.g. gene description, chromosome name, gene start, gene end, strand, band, gene name, etc.

q External data: Gene Ontology, IDs in other databases

q Expression: anatomical system, development stage, cell type, pathology

q Protein domains: SMART, PFAM, Interpro, etc.

9 BCHM352 Biomart: sample output

10 BCHM352 Understanding a protein list

n Level I

q What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes?

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

11 BCHM352 Enrichment analysis

n Enrichment analysis: is a functional group (e.g. cell cycle) significantly associated with the experimental question? MMP9 SERPINF1 Random Observed IPI00375843 A2ML1 9.2 22 IPI00171798 F2 IPI00299485 FN1 IPI00009542 LYZ 180 IPI00019568 TNXB 180 83 83 IPI00060627 FGG IPI00168262 MPO IPI00082931 FBLN1 1305 1305 IPI00025084 THBS1 IPI00412546 Compare HDLBP annotated IPI00165528 GSN IPI00043992 FBN1 All identified Filter for IPI00384992 CA2 proteins (1733) significant IPI00006991 P11 IPI00021885 CCL21 proteins …... FGB ……

Differentially Extracellular expressed protein list space (260 proteins) (83 proteins)

12 BCHM352 Enrichment analysis: hypergeometric test

Significant Non-significant Total proteins proteins Proteins in the group k j-k j

Other proteins n-k m-n-j+k m-j Total n m-n m Hypergeometric test: given a total of m proteins where j proteins are in the functional group, if we pick n proteins randomly, what is the probability of having k or more proteins from the group? Observed # m − j&# j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005

13 BCHM352 € Commonly used functional groups

n Gene Ontology (http://www.geneontology.org)

q Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products

q Three organizing principles: molecular function, biological process, and cellular component

n Pathways

q KEGG (http://www.genome.jp/kegg/pathway.html)

q Pathway commons (http://www.pathwaycommons.org)

q WikiPathways (http://www.wikipathways.org)

n Cytogenetic bands

n Targets of transcription factors/miRNAs

14 BCHM352 WebGestalt: Web-based Gene Set Analysis Toolkit

8 organisms Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast

Microarray Probe IDs Gene IDs Protein IDs • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN

196 ID types with mapping to Entrez Gene ID


59,278 functional categories with genes identified by Entrez Gene IDs

Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules http://www.webgestalt.org

Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands Zhang et.al. Nucleic Acids Res. 33:W741, 2005 • Drug association genes Wang et al. Nucleic Acids Res. 41:W77, 2013 15 BCHM352 WebGestalt analysis

n Select the organism of interest.

n Upload a gene/protein list in the txt format, one ID per row. Optionally, a value can be provided for each ID. In this case, put the ID and value in the same row and separate them by a tab. Then pick the ID type that corresponds to the list of IDs.

n Categorize the uploaded ID list based upon GO Slim (a simplified version of Gene Ontology that focuses on high level classifications).

n Analyze the uploaded ID list for for enrichment in various biological contexts. You will need to select an appropriate predefined reference set or upload a reference set. If a customized reference set is uploaded, ID type also needs to be selected. After this, select the analysis parameters (e.g., significance level, multiple test adjustment method, etc.).

n Retrieve enrichment results by opening the respective results files. You may also open and/or download a TSV file, or download the zipped results to a directory on your desktop.

16 BCHM352 WebGestalt: ID mapping

n Input list

q 260 significant proteins identified in the HNSCC study (hnscc_sig_withLogRatio.txt)

n Mapping result

q Total number of User IDs: 260. Unambiguously mapped User IDs to Entrez IDs: 229. Unique User Entrez IDs: 224. The Enrichment Analysis will be based upon the unique IDs.

17 BCHM352 WebGestalt: GOSlim classification

Molecular function Biological process

Cellular component

18 BCHM352 WebGestalt: top 10 enriched GO biological processes

Reference list: CSHL2010_hnscc_all_proteins.txt

19 BCHM352 WebGestalt: top 10 enriched WikiPathways

20 BCHM352 Limitation of the over-representation analysis

n Does not account for the order of genes in the significant gene list

n Arbitrary thresholding leads to the lose of information

21 BCHM352 Gene Set Enrichment Analysis (GSEA)


Subramanian et.al. PNAS 102:15545, 2005

n Test whether the members of a predefined gene set are randomly distributed throughout the ranked gene list

q Calculation of an Enrichment Score, modified Kolmogorov Smirnov test

q Estimation of Significance Level of ES, permutation test

q Adjustment for Multiple Hypothesis Testing, control False Discovery Rate

q Leading edge subset: genes contribute to the significance

22 BCHM352 Understanding a protein list

n Level I

q What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes?

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

n Level III

q How do the proteins work together to form a network?

23 BCHM352 Resources


q http://genemania.org


q http://string-db.org/

n Genes2Networks

q http://actin.pharm.mssm.edu/ genes2networks/

24 BCHM352 Understanding a protein list: summary

n Level I

q What are the proteins/genes behind the IDs and what do we know about the functions of the proteins/genes?

q Biomart (http://www.biomart.org/)

n Level II

q Which biological processes and pathways are the most interesting in terms of the experimental question?

q WebGestalt (http://bioinfo.vanderbilt.edu/webgestalt)

q Related tools: DAVID (http://david.abcc.ncifcrf.gov/), GenMAPP (http://www.genmapp.org/), GSEA (http://www.broadinstitute.org/gsea )

n Level III

q How do the proteins work together to form a network?

q GeneMANIA (http://genemania.org)

q Related tools: Cytoscape (http://www.cytoscape.org/), STRING (http://string.embl.de/), Genes2Networks (http://actin.pharm.mssm.edu/genes2networks), Ingenuity ( http://www.ingenuity.com/), Pathway Studio ( http://www.ariadnegenomics.com/products/pathway-studio/)

25 BCHM352