Proteomics and systems biology
Bing Zhang, Ph.D. Associate Professor Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] NoNo man gene is an isisland an island
No man is an island, entire of itself; Pathways/ every man is a piece of the continent, a part of the main. Networks If a clod be washed away by the sea, Europe is the less, as well as if a promontory were, as well as if a manor of thy friend's or of thine own were. Any man's death diminishes me because I am involved in mankind; and therefore never send to know for whom the bell tolls; it tolls for thee.
John Donne, Devotions upon Emergent Occasions, 1624
The impact of a specific genetic abnormality is not restricted to the activity of the gene product that carries it, but can spread along the links of the network and alter the activity of gene products that otherwise carry no defects. Disease Barabasi et al., Nature Review Genetics, 2011
2 Omics opportunities and challenges
Genome CNV mutation methylation Data for Elephant individual genes
Transcriptome expression splicing
Proteome Understanding of expression biological systems modification
3 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
4 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
5 Omics studies generate gene/protein lists
! Genomics
" Genome Wide Association Study (GWAS)
" Whole genome or exome sequencing
! Transcriptomics " mRNA profiling ……….. ! Microarray ……….. ! RNA-Seq ………..
" Protein-DNA interaction ………..
! Chromatin immunoprecipitation (ChIP)-Seq ………. ………. ! Proteomics ………. " Protein profiling ………. ! LC-MS/MS ……… " Protein-protein interaction ……… ! Yeast two hybrid ……… ! Affinity pull-down/LC-MS/MS
6 Omics studies generate gene/protein lists
log2(ratio)
IPI00375843 IPI00171798 IPI00299485
Microarray -log10(p value) IPI00009542 IPI00019568 IPI00060627 RNA-Seq Differential IPI00168262 IPI00082931 expression IPI00025084 IPI00412546 Proteomics IPI00165528 IPI00043992 IPI00384992 IPI00006991 IPI00021885 …...
Clustering Lists of genes or proteins with potential biological interest
7 Different ways of data organization
! Gene 1 ! Pathway 1
" Pathway 1 " Gene 1
" Pathway 2 " Gene 2
! Gene 2 ! Pathway 2
" Pathway 1 " Gene 1
" Pathway 3 " Gene 3
! Gene 3 ! Pathway 3
" Pathway 2 " Gene 2
" Pathway 3 " Gene 3
! Gene 4 " Gene 4
" Pathway 3
8 Advantages of pathway analysis
! Better interpretation
" From interesting genes to interesting biological themes
! Improved robustness
" Robust against noise in the data
! Improved sensitivity
" Detecting minor but concordant changes in a pathway
9 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
10 Pathway databases
! Databases
" BioCarta (http://www.biocarta.com/genes/index.asp)
" KEGG (http://www.genome.jp/kegg/pathway.html)
" MetaCyc (http://metacyc.org)
" Pathway commons (http://www.pathwaycommons.org)
" Reactome (http://www.reactome.org)
" STKE (http://stke.sciencemag.org/cm)
" Signaling Gateway (http://www.signaling-gateway.org)
" Wikipathways (http://www.wikipathways.org)
! Limitation
" Limited coverage
" Inconsistency among different databases
" Relationship between pathways is not defined
! Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products
! Three organizing principles: molecular function, biological process, and cellular component
" Dopamine receptor D2, the product of human gene DRD2
" molecular function: dopamine receptor activity
" biological process: synaptic transmission
" cellular component: plasma membrane ! Terms in GO are linked by several types of relationships " Is_a (e.g. plasma membrane is_a membrane)
" Part_of (e.g. membrane is part_of cell) " Has part
" Regulates " Occurs in
12 Gene Ontology
13 Annotating genes using GO terms
! Two types of GO annotations
" Electronic annotation
" Manual annotation
! All annotations must:
" be attributed to a source
" indicate what evidence was found to support the GO term-gene/protein association
! Types of evidence codes
" Experimental codes - IDA, IMP, IGI, IPI, IEP
" Computational codes - ISS, IEA, RCA, IGC
" Author statement - TAS, NAS
" Other codes - IC, ND
14 Annotating genes using GO terms
…… ……
Parent DLGAP1 discs, large (Drosophila) homolog-associated protein 1
Cell-cell signaling DLGAP2 discs, large (Drosophila) homolog-associated protein 2
DNM1 dynamin 1
DOC2A double C2-like domains, alpha Synaptic transmission Child DRD1 dopamine receptor D1 226 human genes DRD1IP dopamine receptor D1 interacting protein
DRD2 dopamine receptor D2
DRD3 dopamine receptor D3
DRD4 dopamine receptor D4
DRD5 dopamine receptor D5
…… ……
15 Access GO
! Downloads (http://www.geneontology.org)
" Ontologies
! http://www.geneontology.org/page/download-ontology
" Annotations
! http://www.geneontology.org/page/download-annotations
! Web-based access
" AmiGO: http://www.godatabase.org
" QuickGO: http://www.ebi.ac.uk/QuickGO
16 Coverage of GO annotations
Homo sapiens Mus musculus #term #gene #term #gene GO/BP 6502 15228 6227 15709 GO/MF 3144 16389 2961 17287 GO/CC 947 16765 882 16801
17 Other “gene sets”
! Transcription factor targets
! miRNA targets
! Protein complexes
! Cytogenetic bands
! Disease-related genes
! Drug targets
! Experimentally-derived gene sets
! ……
18 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
19 Over-representation analysis: concept
Is the observed overlap significantly larger than the expected value?
MMP9 SERPINF1 Random Observed IPI00375843 A2ML1 9.2 22 IPI00171798 F2 IPI00299485 FN1 IPI00009542 LYZ 180 IPI00019568 TNXB 180 83 83 IPI00060627 FGG IPI00168262 MPO IPI00082931 FBLN1 1305 1305 IPI00025084 THBS1 IPI00412546 Compare HDLBP annotated IPI00165528 GSN IPI00043992 FBN1 All identified Filter for IPI00384992 CA2 proteins (1733) significant IPI00006991 P11 IPI00021885 CCL21 proteins …... FGB ……
Differentially Extracellular expressed protein list space (260 proteins) (83 genes)
20 Over-representation analysis: hypergeometric test
Significant Non-significant Total proteins proteins Proteins in the group k j-k j
Other proteins n-k m-n-j+k m-j Total n m-n m Hypergeometric test: given a total of m proteins where j proteins are in the functional group, if we pick n proteins randomly, what is the probability of having k or more proteins from the group? Observed # m − j j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005
21 € Over-representation analysis: WebGestalt
http://www.webgestalt.org 92546_r_at 8 organisms 92545_f_at Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast 96055_at
102105_f_at Microarray Probe IDs Gene IDs Protein IDs 102700_at • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI …… • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Entrez Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN ~200 ID types 196 ID types with mapping to Entrez Gene ID Statistical WebGestalt analysis ~60K gene sets 59,278 functional categories with genes identified by Entrez Gene IDs
Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules
Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands • Drug association genes Jul. 1, 2013 – Jun. 30, 2014
Zhang et.al. Nucleic Acids Res. 33:W741, 2005 49,136 visits from 18,213 visitors Wang et al. Nucleic Acids Res. 41:W77, 2013 22 WebGestalt output: an enriched pathways
Input genes
TGF Beta Signaling Pathway
23 WebGestalt output: Enriched GO terms
Response to unfolded proteins 12 genes adjp=1.32e-08
24 Limitation of the over-representation analysis
! Does not account for the order of genes in the significant gene list
! Arbitrary thresholding may lead to lose of information
25 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
26 Gene Set Enrichment Analysis: concept
! Do genes in a gene set tend to locate at the top or bottom of the ranked gene list?
27 Gene Set Enrichment Analysis: method
-1/(n-k)
+1/k
Subramanian et.al. PNAS 102:15545, 2005 k: Number of genes in the gene set S http://www.broad.mit.edu/gsea/ n: Number of all genes in the ranked gene list
28 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
29 Exercise
! http://bioinfo.vanderbilt.edu/zhanglab/?q=node/410
! Using WebGestalt (http://www.webgestalt.org) to identify GO categories and pathways enriched in proteins up- regulated in the poor-prognosis colorectal cancer (CRC) subtype (Zhang et al. Nature, 513:382-387, 2014).
30 Proteomics and systems biology (II)
Bing Zhang, Ph.D. Associate Professor Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
32 Mapping protein-protein interaction networks
! Experimental
" Yeast two-hybrid
" Tandem affinity purification
! Computational (based on evidence of association)
" Gene fusion (form part of a single protein in other organisms)
" Ortholog interaction (orthologs interact in other organisms)
" Phylogenetic profiling (co-existence in different organisms)
" Microarray gene co-expression (co-expression in various conditions)
! PPI data integration
" Voting
" Probabilistic integration (Jansen et al. Science 302:449, 2003)
Phylogenetic profiling
Valencia et al. Curr. Opin. Struct. Biol, 12:368, 2002
33 PPI databases
! Database of Interacting Proteins (DIP) http://dip.doe-mbi.ucla.edu/
! The Molecular INTeraction database (MINT) http://mint.bio.uniroma2.it/mint/
! The Biomolecular Object Network Databank (BOND) http://bond.unleashedinformatics.com/
! The General Repository for Interaction Datasets (BioGRID) http://www.thebiogrid.org/
! Human Protein Reference Database (HPRD) http://www.hprd.org
! Online Predicted Human Interaction Database (OPHID) http://ophid.utoronto.ca
! iRef http://wodaklab.org/iRefWeb
! HI-II-14 http://interactome.dfci.harvard.edu/H_sapiens/
! The International Molecular Exchange Consortium (IMEX) http://www.imexconsortium.org
34 Mapping transcriptional regulatory networks
! Experimental
" Chromatin immunoprecipitation (ChIP)
! ChIP-chip
! ChIP-seq
! Computational
" Promoter sequence analysis
" Reverse engineering from gene expression data
! Public databases
" Transfac (http://www.gene-regulation.com)
" MSigDB (http://www.broadinstitute.org/gsea/msigdb)
" hPDI (http://bioinfo.wilmer.jhu.edu/PDI/ )
35 Mapping metabolic networks
! Public databases
" KEGG (http://www.genome.jp/kegg/ )
" MetaCyc (http://metacyc.org/ )
36 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
37 Graph representation of networks
! Graph: a graph is a set of objects called nodes or vertices connected by links called edges. In mathematics and computer science, a graph is the basic object of study in graph theory.
node
edge
RNA polymerase II Cramer et al. Science 292:1863, 2001
38 Undirected graph vs directed graph
Protein interaction network Nodes: protein Edges: physical interaction Undirected
Krogan et al. Nature 440:637, 2006
Shen-orr et al. Nat Genet, 31:64, 2002
Metabolic network Transcriptional regulatory network Nodes: metabolites Nodes: genes Edges: enzymes Edges: transcriptional regulation Directed Directed Substrate->Product TF->target gene
Ravasz et al. Science 297:1551, 2002
39 Unweighted graph vs weighted graph
Protein interaction network Nodes: protein Edges: physical interaction Unweighted
Krogan et al. Nature 440:637, 2006
Gene co-expression network Nodes: gene Edges: co-expression level Correlation filtering Weighted Unweighted (>0.8)
40 Degree, path, shortest path
! Degree: the number of edges adjacent to a node. A simple measure of the node centrality.
! Path: a sequence of nodes such that from each of its nodes there is an edge to the next node in the sequence.
! Shortest path: a path between two nodes such that the sum of the distance of its constituent edges is minimized.
MetR YDL176W Out degree: 3 Degree: 3 In degree: 1
41 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
42 Properties of complex networks
Scale-free Small world Modular Hierarchical
43 Obama vs Lady Gaga: who is more influential?
Twitter following Twitter followers (out degree) (in degree) 701,301 Obama 7,035,548
664,606 28,490,739
144,263 Gaga 8,873,525
136,511 35,158,014
0 Eminem 3,509,469
0 13,946,813
44 Role of hubs in biological networks
! Based on data from model organisms S. cerevisiae and C. elegans
" Correspond to essential genes
" Have a tendency to be older and have evolved more slowly
" Have a tendency to be more abundant
" Have a larger diversity of phenotypic outcomes resulting from their deletion
Vidal et al. Cell, 144:986, 2011
45 Scale-free topology and network robustness
! Robust against random damage
! Yet fragile against selective damage
46 Modularity
! Modularity refers to a group of physically or functionally linked molecules (nodes) that work together to achieve a relatively distinct function.
! Examples Protein interaction modules Palla et al, Nature, 435:841, 2005 " Transcriptional module: a set of co- regulated genes
" Protein complex: assembly of proteins that build up some cellular machinery, commonly spans a dense sub-network of proteins in a protein interaction network
" Signaling pathway: a chain of interacting proteins propagating a signal in the cell
Gene co-expression modules Shi et al, BMC Syst Biol, 4:74, 2010
47 Making sense of the hairballs using network hierarchies
Nodes ordered in one dimension on the basis of network structure Data visualized as tracks
www.netgestalt.org Shi et al., Nature Methods, 2013 48 Outline
! Pathway analysis
" Motivation
" Pathway databases
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation
" Network properties
" Network-based applications
49 Network visualization
Network visualization tools Cytoscape (http://www.cytoscape.org)
Node label color: protein type Red HNE targets Blue Transcription factors Black Other proteins Node size: prediction cutoff level Loose cutoff level Stringent cutoff level
Node color: protein function Cell cycle RNA splicing Both Neither
Gehlenborg et al. Nature Methods, 7:S56, 2010
50 Network distance vs functional similarity
! Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.
Sharan et al. Mol Syst Biol, 3:88, 2007
51 Gene function prediction: neighborhood majority voting
Direct neighbor Iterative majority voting (local) (global)
52 Gene function prediction: module-based approach
Direct neighbor voting Module-based
53 Candidate disease gene prioritization
Kohler et al. Am J Hum Genet. 82:949, 2008
54 Outline
! Pathway analysis
" Motivation Data for " Pathway databases individual genes
" Over-representation analysis
" Gene set enrichment analysis
! Network analysis
" Network mapping
" Network representation Understanding of " Network properties biological systems " Network-based applications
55