Pathway and network analysis for mRNA and protein profiling data
Bing Zhang, Ph.D. Professor of Molecular and Human Genetics Lester & Sue Smith Breast Center Baylor College of Medicine [email protected]
VU workshop, 2016 Gene expression
DNA Transcription
Transcriptome Transcriptome RNA mRNA decay profiling
Translation
Proteome Protein Proteome Protein degradation profiling
Phenotype Networks
VU workshop, 2016 Overall workflow of gene expression studies
Biological question
Experimental design
Microarray RNA-Seq Shotgun proteomics
Image analysis Reads mapping Peptide/protein ID
Signal intensities Read counts Spectral counts; Intensities
Data Analysis
Experimental Hypothesis validation
VU workshop, 2016 Data matrix
Samples
probe_set_id HNE0_1 HNE0_2 HNE0_3 HNE60_1 HNE60_2 HNE60_3 1007_s_at 8.6888 8.5025 8.5471 8.5412 8.5624 8.3073 1053_at 9.1558 9.1835 9.4294 9.2111 9.1204 9.2494 117_at 7.0700 7.0034 6.9047 9.0414 8.6382 9.2663 121_at 9.7174 9.7440 9.6120 9.7581 9.7422 9.7345 1255_g_at 4.2801 4.4669 4.2360 4.3700 4.4573 4.2979 1294_at 6.3556 6.2381 6.2053 6.4290 6.5074 6.2771
Genes 1316_at 6.5759 6.5330 6.4709 6.6636 6.6438 6.4688 1320_at 6.5497 6.5388 6.5410 6.6605 6.5987 6.7236 1405_i_at 4.3260 4.4640 4.1438 4.3462 4.3876 4.6849 1431_at 5.2191 5.2070 5.2657 5.2823 5.2522 5.1808 1438_at 7.0155 6.9359 6.9241 7.0248 7.0142 7.0971 1487_at 8.6361 8.4879 8.4498 8.4470 8.5311 8.4225 1494_f_at 7.3296 7.3901 7.0886 7.2648 7.6058 7.2949 1552256_a_at 10.6245 10.5235 10.6522 10.4205 10.2344 10.3144 1552257_a_at 10.3224 10.1749 10.1992 10.2464 10.2191 10.2405
Signal intensities Read counts Spectral counts; Intensities
VU workshop, 2016 Differential expression and clustering
log2(ratio)
Microarray log10(p value) log10(p - RNA-Seq Differential expression
Proteomics
Clustering
VU workshop, 2016 Differential expression: t-test
graph courtesy of www.socialresearchmethods.net
VU workshop, 2016 Data are not normally distributed
n Log transformation
Histogram of a Histogram of log(a) 200 150 150 100 100 Frequency Frequency 50 50 0 0
50 150 250 3.5 4.0 4.5 5.0 5.5
a log(a) n Mann-Whitney U test (Wilcoxon rank-sum test)
q Based on ranks of observations
q Does not rely on data belonging to any particular distribution
VU workshop, 2016 Sample size is small
n Moderated t-test (limma)
q Designed for experiments with small sample size
q The standard errors are moderated across genes, effectively borrowing information from the ensemble of genes to aid with inference about each individual gene.
VU workshop, 2016 Count data
n Negative binomial distribution
q Cuffdiff
q DESeq
q edgeR
n Log-transformed counts
q Voom/limma
VU workshop, 2016 Correction for multiple testing
n Why
q In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed
n How
q Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g., Standard Bonferroni Correction
q Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g., Benjamini and Hochberg correction.
VU workshop, 2016 Clustering (unsupervised analysis)
n Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities
n Unsupervised techniques that do not require sample annotation in the process
n Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes
Samples Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 …… TNNC1 14.82 14.46 14.76 11.22 11.55 …… DKK4 10.71 10.37 11.23 19.74 19.73 …… ZNF185 15.20 14.96 15.07 12.57 12.37 …… CHST3 13.40 13.18 13.15 11.18 10.99 …… FABP3 15.87 15.80 15.85 13.16 12.99 …… MGST1 12.76 12.80 12.67 14.92 15.02 ……
Genes DEFA5 10.63 10.47 10.54 15.52 15.52 …… VIL1 11.47 11.69 11.87 13.94 14.01 …… AKAP12 18.26 18.10 18.50 15.60 15.69 …… HS3ST1 10.61 10.67 10.50 12.44 12.23 …… …… …… …… …… …… …… ……
VU workshop, 2016 Hierarchical clustering
n Agglomerative hierarchical clustering
q Start out with all sample units in n clusters of size 1.
q At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.
q The algorithm stops when all sample units are combined into a single cluster of size n.
VU workshop, 2016 Visualization of hierarchical clustering results n Dendrogram
q Output of a hierarchical clustering Neuron type Non-5ht 5ht Brain region caudal rostral rostral caudal q Tree structure with the genes or samples as the leaves
q The height of the join indicates the distance between the branches n Heat map
q Graphical representation of data where the values are represented as colors.
VU workshop, 2016 Omics studies generate lists of interesting genes
log2(ratio) 92546_r_at 92545_f_at 96055_at 102105_f_at Microarray 102700_at log10(p value) log10(p - 161361_s_at RNA-Seq Differential 92202_g_at expression 103548_at 100947_at Proteomics 101869_s_at 102727_at 160708_at …...
Clustering Lists of genes with potential biological interest
VU workshop, 2016 Organizing genes based on pathways
n Gene 1 n Pathway 1
q Pathway 1 q Gene 1
q Pathway 2 q Gene 2
n Gene 2 n Pathway 2
q Pathway 1 q Gene 1
q Pathway 3 q Gene 3
n Gene 3 n Pathway 3
q Pathway 2 q Gene 2
q Pathway 3 q Gene 3
n Gene 4 q Gene 4
q Pathway 3
VU workshop, 2016 Advantages of pathway analysis
n Better interpretation
q From interesting genes to interesting biological themes
n Improved robustness
q Robust against noise in the data
n Improved sensitivity
q Detecting minor but concordant changes in a pathway
VU workshop, 2016 Pathway databases
n Databases
q BioCarta (http://www.biocarta.com/genes/index.asp)
q KEGG (http://www.genome.jp/kegg/pathway.html)
q MetaCyc (http://metacyc.org)
q Pathway commons (http://www.pathwaycommons.org)
q Reactome (http://www.reactome.org)
q STKE (http://stke.sciencemag.org/cm)
q Signaling Gateway (http://www.signaling-gateway.org)
q Wikipathways (http://www.wikipathways.org)
n Limitation
q Limited coverage
q Inconsistency among different databases
q Relationship between pathways is not defined
VU workshop, 2016 Gene Ontology
n Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products
n Three organizing principles: molecular function, biological process, and cellular component
q Dopamine receptor D2, the product of human gene DRD2
q molecular function: dopamine receptor activity
q biological process: synaptic transmission
q cellular component: plasma membrane
n Terms in GO are linked by several types of relationships
q Is_a (e.g. plasma membrane is_a membrane)
q Part_of (e.g. membrane is part_of cell)
q Has part
q Regulates
q Occurs in
VU workshop, 2016 Gene Ontology
VU workshop, 2016 Annotating genes using GO terms
n Two types of GO annotations
q Electronic annotation
q Manual annotation
n All annotations must:
q be attributed to a source
q indicate what evidence was found to support the GO term-gene/protein association
n Types of evidence codes
q Experimental codes - IDA, IMP, IGI, IPI, IEP
q Computational codes - ISS, IEA, RCA, IGC
q Author statement - TAS, NAS
q Other codes - IC, ND
VU workshop, 2016 Access GO
n Downloads (http://www.geneontology.org)
q Ontologies
n http://www.geneontology.org/page/download-ontology
q Annotations
n http://www.geneontology.org/page/download-annotations
n Web-based access
q AmiGO: http://www.godatabase.org
q QuickGO: http://www.ebi.ac.uk/QuickGO
VU workshop, 2016 Coverage of GO annotations
Homo sapiens Mus musculus #term #gene #term #gene GO/BP 6502 15228 6227 15709 GO/MF 3144 16389 2961 17287 GO/CC 947 16765 882 16801
VU workshop, 2016 Over-representation analysis: concept
98 Hoxa5 Sash1 Hoxa11 Ltbp3 Cd24a Agt Sox4 581 1842 Foxc1 Psrc1 Edn1 Ctla2b Angptl4 Ror2 Gnag Depdc7 Observe Sorbs1 Smad3 compare Wdr5 Macrod1 Trp63 Enpp2 Tmem176a Sox9 65 Pax1 …… Acd Rai1 Pitx1 581 1842 …… Differentially expressed genes (581 genes) Expect
Development (1842 genes) n Is the observed overlap significantly larger than the expected value?
VU workshop, 2016 Over-representation analysis: method
Significant genes Non-significant genes Total genes in the group k j-k j
Other genes n-k m-n-j+k m-j Total n m-n m
Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group?
Observed # m − j j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005
VU workshop, 2016 € Over-representation analysis: WebGestalt
Gene list http://www.webgestalt.org 92546_r_at 8 organisms 92545_f_at Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast 96055_at Daily Unique Visitors (Jan 2015 – Dec 2015)
102105_f_at Microarray Probe IDs Gene IDs Protein IDs 102700_at • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI …… • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Entrez Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN ~200 ID types 196 ID types with mapping to Entrez Gene ID Analysis/ ~60K WebGestalt visualization Functional 59,278 functional categories with genes identified by categories Entrez Gene IDs
Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules
Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands • Drug association genes Jan. 1, 2015 – Dec. 31, 2015 Pathways/ Zhang et.al. Nucleic Acids Res. 33:W741, 2005 63,932 visits from 27,409 visitors functional categories Wang et al. Nucleic Acids Res. 41:W77, 2013 VU workshop, 2016 >300 citations Over-representation analysis: limitations
n Arbitrary thresholding
n Ignoring the order of genes in the significant gene list
VU workshop, 2016 Gene Set Enrichment Analysis: concept
n Do genes in a gene set tend to locate at the top or bottom of the ranked gene list?
VU workshop, 2016 Gene Set Enrichment Analysis: method
-1/(n-k)
+1/k
Subramanian et.al. PNAS 102:15545, 2005 k: Number of genes in the gene set S n: Number of all genes in the ranked gene list http://www.broad.mit.edu/gsea/
VU workshop, 2016 Pathway-based analysis
n Organizing genes by
q Pathways
q Gene Ontology
q Other functional categories
n Enrichment analysis methods
q Over-representation analysis
q Gene Set enrichment analysis
n Major limitation
q Existing knowledge on pathways or gene functions is far from complete
q Connections between genes are not considered
VU workshop, 2016 Biological networks
Networks Nodes Edges Protein-protein Proteins Physical interaction, interaction network undirected Signaling network Proteins Modification, Physical directed interaction networks Gene regulatory TFs/miRNAs Physical interaction, network Target genes directed Metabolic network Metabolites Metabolic reaction, directed Co-expression Genes/protei Co-expression, Functional network ns undirected association networks Genetic network Genes Genetic interaction, undirected
VU workshop, 2016 Protein-protein interaction databases
n Database of Interacting Proteins (DIP) http://dip.doe-mbi.ucla.edu/
n The Molecular INTeraction database (MINT) http://mint.bio.uniroma2.it/mint/
n The Biomolecular Object Network Databank (BOND) http://bond.unleashedinformatics.com/
n The General Repository for Interaction Datasets (BioGRID) http://www.thebiogrid.org/
n Human Protein Reference Database (HPRD) http://www.hprd.org
n Online Predicted Human Interaction Database (OPHID) http://ophid.utoronto.ca
n iRef http://wodaklab.org/iRefWeb
n The International Molecular Exchange Consortium (IMEX) http://www.imexconsortium.org
VU workshop, 2016 Properties of complex networks
Human protein-protein interaction network 9,198 proteins and 36,707 interactions
Scale-free Small world Hierarchical modular (hubs) (six degree separation)
VU workshop, 2016 Network visualization
VU workshop, 2016 Network visualization
Network visualization tools Cytoscape (http://www.cytoscape.org)
Node label color: protein type Red HNE targets Blue Transcription factors Black Other proteins Node size: prediction cutoff level Loose cutoff level Stringent cutoff level
Node color: protein function Cell cycle RNA splicing Both Neither
Gehlenborg et al. Nature Methods, 7:S56, 2010
VU workshop, 2016 Scalability challenges
Data complexity Network size
DNA CNA mutation methylation
mRNA expression splicing
Protein expression modification
VU workshop, 2016 Making sense of the hairballs
NetSAM is available in BioConductor Shi et al. Nature Methods, 2013
Nodes ordered in one dimension on the basis of network structure Data visualized as tracks as visualized Data
Barabasi and Oltvai, Nat Rev Genet, 2004
VU workshop, 2016 NetGestalt: exploring multidimensional omics data over biological networks n Genes ordered in the horizontal dimension 2D layout n Data visualized as tracks along the vertical dimension
q Composite continuous track (e.g. gene NetSAM expression matrices) 1D layout
q Single continuous track (e.g. fold changes, p values) Heat map q Single binary track (e.g. significant genes, GO annotations) n Interactive data visualization and analysis Bar chart
q Search, zoom, pan, filter Barcode plot q Track comparison
q Statistical analysis
q Network analysis and visualization http://www.netgestalt.org
q Pathway enrichment analysis Shi et al. Nature Methods, 2013
q Upload user networks and tracks
VU workshop, 2016 Network distance vs functional similarity
n Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.
n Network-based gene function prediction
n Network-based gene prioritization
Sharan et al. Mol Syst Biol, 3:88, 2007
VU workshop, 2016 Network-based analysis
n Direct neighbor-based analysis
n Network module-based analysis
n Network diffusion analysis
VU workshop, 2016 Direct neighbor-based analysis
n Network expansion: to expand a list of “seed” genes to include other related genes in the network
q All neighbors: to identify all direct neighbors of these seed genes in the network
q Enriched neighbors: to identify all non-seed genes significantly
enriched with seed neighbors WNT2 NKX2-1 TBX5
RAD50 EPB41L5 SUB1 KDM6B
PRKAA2 PCBD1 SOX9 HIF1A FOXO3 SFRP1 WWTR1 PIAS4 PPARGC1B PABPC4 BCL9L NR2C2 PARD3B NF1 MAPK14 XPO1 LEF1 HNF1A PROX1 SNCG NRBF2 SREBF2 WNT4 NOTCH1 SMAD2 ACVR1B SMAD4 PITX2 EP300 NR0B2 FOXC1 EFNA1 NCOA2 HDAC4 TRIM24 SMAD3 NCOA1 TP53 MAPK3 CTNNB1 SP1 GSK3B TGFB1I1 SNAI1 SMAD7 TGFBRAP1 SFRP2 NR2F1 UBE2I COPS5 EXT2 CITED2 CTNNB1 HDAC3 HNF4A TGFBR1 MED21 NR1I3 AKT1 BMP7 MED14 AXIN2 DVL1 MED10 NOG SMAD4 CREBBP MAPK1 MED24 TCF7L2 S100A4 TGFB2 TGFBR2 TGFB1 PRMT1 PPP2CA SIRT1 NCOA6 MED17 VDRSREBF1 STK16 FOXO1 AR BMP2
SMAD2 PPARGC1A TGFBR3 ZNHIT3 ENG MED1 MED16 NCOR2 MED7 FMOD PNRC1 TGFB3 VASN C16orf80 TGS1 NPPA MED23 MECR All neighbors RNASEH2B Enriched neighbors HNRNPAB ADAM15 COL1A1 PPP3R1 DAB2IP ZNF703 WNT5A TRIM28 GREM1 WNT16 FOXC2 BAMBI FOXS1 FOXF2 MSX1 HGF
VU workshop, 2016 Direct neighbor-based analysis
n Gene prioritization: to prioritize genes in a list of candidate genes
CDH11 PIK3CA OPCML TNFRSF10C
CDH2 LPPR4 CDH8 KRAS
ATM APC CDH18 BRAF
DMD TCF7L2 ANK2 PTEN CTNNB1 CDH9 TP53
FAM123B EPHA3
ING1 ACVR2A ACVR1B SMAD4
NRAS
ARID1A SMAD2
PCBP1 RIMS1 DPP10
SOX9
EIF4A2
DNAH5
MYO1B
GUCY1A3 KIAA1804 PCDHA10 SLITRK1 PCDHB5 ADAM29 CHRDL1 MAP2K4 UBE2NL CASP14 ZC3H13 ZNF208 ROBO1 NCAM2 CHRM2 EDNRB FBXW7 CSMD1 PTPRM DOCK2 NALCN
DACH2 KCNA4 CCBP2 FAM5C ERBB4 NRXN1 LRRN3 CRTC1 CNTN1 TRPC4 LRP1B CHST9 TRPS1 LOXL2 PRIM2 GRIK3 GRIA1 ARID2 DKK2 HCN1 GGT1 SDK1 KLK2 RYR2 EVC2 BAI3
IFT80 NXF5 AFF2 ELF3 TLL1 FAT4 LIFR ISL1 FLG TTN
VU workshop, 2016 Network module-based analysis
n Identify network modules from a network
q e.g., NetSAM
n Use network modules as gene sets for enrichment analysis
q Over-representation analysis; GSEA
MLIP NARF TOR1AIP1 IGDCC3 URB2 HSBP1 LMNA RAP1B
TRIB1 KIFC3 FTL VARS2 ALOX12 EIF2AK1 MRPL36 ITGAE RGL2
CALB2 ALKBH1 CBLL1 RAP1GDS1 DNAJB6 KRT14 RRAS PIK3CA ALOX12B PLEKHA7 KRT1 LIMA1 NRAS KRT5 LRPAP1 ITIH1 KRT16 AJAP1 KRAS PEX5L RALGDS FAM84B ARHGEF1 TCHP CA9 PKP1 CDH1 OIP5 SHOC2 HRAS DSP PKP4 MYO7A CPS1 DGKZ VRK3 CTNND1 PAQR3 DSC3 CDH24 ARAF KRT6A PKP2 CTNNA1 DSG1 PTPRM FSCN2 RASGRP1 USP53 DSC1 VEZT CPNE7 CDH8 RAF1 BRAF CDH3 NUDT14 KRT8 KRT18 JUP DNAJC27 PKP3 PKD1 CTNNB1 KIAA1109 PPM1B KRT17 CDH16 CDH2 KIAA1161 CLN5 TERF2IP CDH17 MAP2K2 TIMM50 DSC2 CDON MAP2K1 DSG2 CLEC3B PTPN14 SOX1 FKBP10 ADCY6 KRT72 BOC KSR2 MOXD1 VRK2 CDH11 LDHB KANK1 CHRNA7 DSG3 AKAP12 KSR1 FGF13 PEBP4 CDH4 ERMN NEURL2 BCL9L SOX17 CTNNA3 VU workshop, 2016 Network diffusion analysis
n Simulating a random walker’s behavior on a network (with restart)
VU workshop, 2016 Network diffusion analysis
n Simulating a random walker’s behavior on a network (with restart)
q Gene prioritization
q Network expansion
http://www.gene2net.org
VU workshop, 2016 Summary
n Biological networks
q Physical interaction networks
q Functional association networks
n Properties of complex networks
q Scale-free
q Small world
q Hierarchical modular
n Network visualization
q Node-link diagrams
q NetGestalt
n Network-based analysis
q Direct neighbor-based analysis
q Network module-based analysis
q Network diffusion analysis
VU workshop, 2016 Hands-on session
n WebGestalt
q Pathway analysis for gene lists
q http://www.webgestalt.org
n NetGestalt
q Pathway and network analysis for gene lists and data matrices
q http://www.netgestalt.org
n Gene2Net
q network diffusion analysis for gene lists
q http://www.gene2net.org
VU workshop, 2016