Proteomics and systems biology

Bing Zhang, Ph.D. Associate Professor Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] NoNo man is an isisland an island

No man is an island, entire of itself; Pathways/ every man is a piece of the continent, a part of the main. Networks If a clod be washed away by the sea, Europe is the less, as well as if a promontory were, as well as if a manor of thy friend's or of thine own were. Any man's death diminishes me because I am involved in mankind; and therefore never send to know for whom the bell tolls; it tolls for thee.

John Donne, Devotions upon Emergent Occasions, 1624

The impact of a specific genetic abnormality is not restricted to the activity of the gene product that carries it, but can spread along the links of the network and alter the activity of gene products that otherwise carry no defects. Disease Barabasi et al., Nature Review Genetics, 2011

2 Omics opportunities and challenges

Genome CNV mutation methylation Data for Elephant individual

Transcriptome expression splicing

Proteome Understanding of expression biological systems modification

3 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

4 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

5 Omics studies generate gene/ lists

! Genomics

" Genome Wide Association Study (GWAS)

" Whole genome or exome sequencing

! Transcriptomics " mRNA profiling ……….. ! Microarray ……….. ! RNA-Seq ………..

" Protein-DNA interaction ………..

! Chromatin immunoprecipitation (ChIP)-Seq ………. ………. ! Proteomics ………. " Protein profiling ………. ! LC-MS/MS ……… " Protein-protein interaction ……… ! Yeast two hybrid ……… ! Affinity pull-down/LC-MS/MS

6 Omics studies generate gene/protein lists

log2(ratio)

IPI00375843 IPI00171798 IPI00299485

Microarray -log10(p value) IPI00009542 IPI00019568 IPI00060627 RNA-Seq Differential IPI00168262 IPI00082931 expression IPI00025084 IPI00412546 Proteomics IPI00165528 IPI00043992 IPI00384992 IPI00006991 IPI00021885 …...

Clustering Lists of genes or with potential biological interest

7 Different ways of data organization

! Gene 1 ! Pathway 1

" Pathway 1 " Gene 1

" Pathway 2 " Gene 2

! Gene 2 ! Pathway 2

" Pathway 1 " Gene 1

" Pathway 3 " Gene 3

! Gene 3 ! Pathway 3

" Pathway 2 " Gene 2

" Pathway 3 " Gene 3

! Gene 4 " Gene 4

" Pathway 3

8 Advantages of pathway analysis

! Better interpretation

" From interesting genes to interesting biological themes

! Improved robustness

" Robust against noise in the data

! Improved sensitivity

" Detecting minor but concordant changes in a pathway

9 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

10 Pathway databases

! Databases

" BioCarta (http://www.biocarta.com/genes/index.asp)

" KEGG (http://www.genome.jp/kegg/pathway.html)

" MetaCyc (http://metacyc.org)

" Pathway commons (http://www.pathwaycommons.org)

" Reactome (http://www.reactome.org)

" STKE (http://stke.sciencemag.org/cm)

" Signaling Gateway (http://www.signaling-gateway.org)

" Wikipathways (http://www.wikipathways.org)

! Limitation

" Limited coverage

" Inconsistency among different databases

" Relationship between pathways is not defined

11

! Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products

! Three organizing principles: molecular function, biological process, and cellular component

" Dopamine receptor D2, the product of human gene DRD2

" molecular function: dopamine receptor activity

" biological process: synaptic transmission

" cellular component: plasma membrane ! Terms in GO are linked by several types of relationships " Is_a (e.g. plasma membrane is_a membrane)

" Part_of (e.g. membrane is part_of cell) " Has part

" Regulates " Occurs in

12 Gene Ontology

13 Annotating genes using GO terms

! Two types of GO annotations

" Electronic annotation

" Manual annotation

! All annotations must:

" be attributed to a source

" indicate what evidence was found to support the GO term-gene/protein association

! Types of evidence codes

" Experimental codes - IDA, IMP, IGI, IPI, IEP

" Computational codes - ISS, IEA, RCA, IGC

" Author statement - TAS, NAS

" Other codes - IC, ND

14 Annotating genes using GO terms

…… ……

Parent DLGAP1 discs, large (Drosophila) homolog-associated protein 1

Cell-cell signaling DLGAP2 discs, large (Drosophila) homolog-associated protein 2

DNM1 dynamin 1

DOC2A double C2-like domains, alpha Synaptic transmission Child DRD1 dopamine receptor D1 226 human genes DRD1IP dopamine receptor D1 interacting protein

DRD2 dopamine receptor D2

DRD3 dopamine receptor D3

DRD4 dopamine receptor D4

DRD5 dopamine receptor D5

…… ……

15 Access GO

! Downloads (http://www.geneontology.org)

" Ontologies

! http://www.geneontology.org/page/download-ontology

" Annotations

! http://www.geneontology.org/page/download-annotations

! Web-based access

" AmiGO: http://www.godatabase.org

" QuickGO: http://www.ebi.ac.uk/QuickGO

16 Coverage of GO annotations

Homo sapiens Mus musculus #term #gene #term #gene GO/BP 6502 15228 6227 15709 GO/MF 3144 16389 2961 17287 GO/CC 947 16765 882 16801

17 Other “gene sets”

! Transcription factor targets

! miRNA targets

! Protein complexes

! Cytogenetic bands

! Disease-related genes

! Drug targets

! Experimentally-derived gene sets

! ……

18 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

19 Over-representation analysis: concept

Is the observed overlap significantly larger than the expected value?

MMP9 SERPINF1 Random Observed IPI00375843 A2ML1 9.2 22 IPI00171798 F2 IPI00299485 FN1 IPI00009542 LYZ 180 IPI00019568 TNXB 180 83 83 IPI00060627 FGG IPI00168262 MPO IPI00082931 FBLN1 1305 1305 IPI00025084 THBS1 IPI00412546 Compare HDLBP annotated IPI00165528 GSN IPI00043992 FBN1 All identified Filter for IPI00384992 CA2 proteins (1733) significant IPI00006991 P11 IPI00021885 CCL21 proteins …... FGB ……

Differentially Extracellular expressed protein list space (260 proteins) (83 genes)

20 Over-representation analysis: hypergeometric test

Significant Non-significant Total proteins proteins Proteins in the group k j-k j

Other proteins n-k m-n-j+k m-j Total n m-n m Hypergeometric test: given a total of m proteins where j proteins are in the functional group, if we pick n proteins randomly, what is the probability of having k or more proteins from the group? Observed # m − j&# j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005

21 € Over-representation analysis: WebGestalt

http://www.webgestalt.org 92546_r_at 8 organisms 92545_f_at Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast 96055_at

102105_f_at Microarray Probe IDs Gene IDs Protein IDs 102700_at • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI …… • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN ~200 ID types 196 ID types with mapping to Entrez Gene ID Statistical WebGestalt analysis ~60K gene sets 59,278 functional categories with genes identified by Entrez Gene IDs

Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules

Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands • Drug association genes Jul. 1, 2013 – Jun. 30, 2014

Zhang et.al. Nucleic Acids Res. 33:W741, 2005 49,136 visits from 18,213 visitors Wang et al. Nucleic Acids Res. 41:W77, 2013 22 WebGestalt output: an enriched pathways

Input genes

TGF Beta Signaling Pathway

23 WebGestalt output: Enriched GO terms

Response to unfolded proteins 12 genes adjp=1.32e-08

24 Limitation of the over-representation analysis

! Does not account for the order of genes in the significant gene list

! Arbitrary thresholding may lead to lose of information

25 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

26 Gene Set Enrichment Analysis: concept

! Do genes in a gene set tend to locate at the top or bottom of the ranked gene list?

27 Gene Set Enrichment Analysis: method

-1/(n-k)

+1/k

Subramanian et.al. PNAS 102:15545, 2005 k: Number of genes in the gene set S http://www.broad.mit.edu/gsea/ n: Number of all genes in the ranked gene list

28 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

29 Exercise

! http://bioinfo.vanderbilt.edu/zhanglab/?q=node/410

! Using WebGestalt (http://www.webgestalt.org) to identify GO categories and pathways enriched in proteins up- regulated in the poor-prognosis colorectal cancer (CRC) subtype (Zhang et al. Nature, 513:382-387, 2014).

30 Proteomics and systems biology (II)

Bing Zhang, Ph.D. Associate Professor Department of Biomedical Informatics Vanderbilt University School of Medicine [email protected] Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

32 Mapping protein-protein interaction networks

! Experimental

" Yeast two-hybrid

" Tandem affinity purification

! Computational (based on evidence of association)

" Gene fusion (form part of a single protein in other organisms)

" Ortholog interaction (orthologs interact in other organisms)

" Phylogenetic profiling (co-existence in different organisms)

" Microarray gene co-expression (co-expression in various conditions)

! PPI data integration

" Voting

" Probabilistic integration (Jansen et al. Science 302:449, 2003)

Phylogenetic profiling

Valencia et al. Curr. Opin. Struct. Biol, 12:368, 2002

33 PPI databases

! Database of Interacting Proteins (DIP) http://dip.doe-mbi.ucla.edu/

! The Molecular INTeraction database (MINT) http://mint.bio.uniroma2.it/mint/

! The Biomolecular Object Network Databank (BOND) http://bond.unleashedinformatics.com/

! The General Repository for Interaction Datasets (BioGRID) http://www.thebiogrid.org/

! Human Protein Reference Database (HPRD) http://www.hprd.org

! Online Predicted Human Interaction Database (OPHID) http://ophid.utoronto.ca

! iRef http://wodaklab.org/iRefWeb

! HI-II-14 http://interactome.dfci.harvard.edu/H_sapiens/

! The International Molecular Exchange Consortium (IMEX) http://www.imexconsortium.org

34 Mapping transcriptional regulatory networks

! Experimental

" Chromatin immunoprecipitation (ChIP)

! ChIP-chip

! ChIP-seq

! Computational

" Promoter sequence analysis

" Reverse engineering from gene expression data

! Public databases

" Transfac (http://www.gene-regulation.com)

" MSigDB (http://www.broadinstitute.org/gsea/msigdb)

" hPDI (http://bioinfo.wilmer.jhu.edu/PDI/ )

35 Mapping metabolic networks

! Public databases

" KEGG (http://www.genome.jp/kegg/ )

" MetaCyc (http://metacyc.org/ )

36 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

37 Graph representation of networks

! Graph: a graph is a set of objects called nodes or vertices connected by links called edges. In mathematics and computer science, a graph is the basic object of study in graph theory.

node

edge

RNA polymerase II Cramer et al. Science 292:1863, 2001

38 Undirected graph vs directed graph

Protein interaction network Nodes: protein Edges: physical interaction Undirected

Krogan et al. Nature 440:637, 2006

Shen-orr et al. Nat Genet, 31:64, 2002

Metabolic network Transcriptional regulatory network Nodes: metabolites Nodes: genes Edges: enzymes Edges: transcriptional regulation Directed Directed Substrate->Product TF->target gene

Ravasz et al. Science 297:1551, 2002

39 Unweighted graph vs weighted graph

Protein interaction network Nodes: protein Edges: physical interaction Unweighted

Krogan et al. Nature 440:637, 2006

Gene co-expression network Nodes: gene Edges: co-expression level Correlation filtering Weighted Unweighted (>0.8)

40 Degree, path, shortest path

! Degree: the number of edges adjacent to a node. A simple measure of the node centrality.

! Path: a sequence of nodes such that from each of its nodes there is an edge to the next node in the sequence.

! Shortest path: a path between two nodes such that the sum of the distance of its constituent edges is minimized.

MetR YDL176W Out degree: 3 Degree: 3 In degree: 1

41 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

42 Properties of complex networks

Scale-free Small world Modular Hierarchical

43 Obama vs Lady Gaga: who is more influential?

Twitter following Twitter followers (out degree) (in degree) 701,301 Obama 7,035,548

664,606 28,490,739

144,263 Gaga 8,873,525

136,511 35,158,014

0 Eminem 3,509,469

0 13,946,813

44 Role of hubs in biological networks

! Based on data from model organisms S. cerevisiae and C. elegans

" Correspond to essential genes

" Have a tendency to be older and have evolved more slowly

" Have a tendency to be more abundant

" Have a larger diversity of phenotypic outcomes resulting from their deletion

Vidal et al. Cell, 144:986, 2011

45 Scale-free topology and network robustness

! Robust against random damage

! Yet fragile against selective damage

46 Modularity

! Modularity refers to a group of physically or functionally linked molecules (nodes) that work together to achieve a relatively distinct function.

! Examples Protein interaction modules Palla et al, Nature, 435:841, 2005 " Transcriptional module: a set of co- regulated genes

" Protein complex: assembly of proteins that build up some cellular machinery, commonly spans a dense sub-network of proteins in a protein interaction network

" Signaling pathway: a chain of interacting proteins propagating a signal in the cell

Gene co-expression modules Shi et al, BMC Syst Biol, 4:74, 2010

47 Making sense of the hairballs using network hierarchies

Nodes ordered in one dimension on the basis of network structure Data visualized as tracks

www.netgestalt.org Shi et al., Nature Methods, 2013 48 Outline

! Pathway analysis

" Motivation

" Pathway databases

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation

" Network properties

" Network-based applications

49 Network visualization

Network visualization tools Cytoscape (http://www.cytoscape.org)

Node label color: protein type Red HNE targets Blue Transcription factors Black Other proteins Node size: prediction cutoff level Loose cutoff level Stringent cutoff level

Node color: protein function Cell cycle RNA splicing Both Neither

Gehlenborg et al. Nature Methods, 7:S56, 2010

50 Network distance vs functional similarity

! Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.

Sharan et al. Mol Syst Biol, 3:88, 2007

51 Gene function prediction: neighborhood majority voting

Direct neighbor Iterative majority voting (local) (global)

52 Gene function prediction: module-based approach

Direct neighbor voting Module-based

53 Candidate disease gene prioritization

Kohler et al. Am J Hum Genet. 82:949, 2008

54 Outline

! Pathway analysis

" Motivation Data for " Pathway databases individual genes

" Over-representation analysis

" Gene set enrichment analysis

! Network analysis

" Network mapping

" Network representation Understanding of " Network properties biological systems " Network-based applications

55