Pathway and network analysis for mRNA and profiling data

Bing Zhang, Ph.D. Professor of Molecular and Genetics Lester & Sue Smith Breast Center Baylor College of Medicine [email protected]

VU workshop, 2016 expression

DNA

Transcriptome Transcriptome RNA mRNA decay profiling

Translation

Proteome Protein Proteome Protein degradation profiling

Phenotype Networks

VU workshop, 2016 Overall workflow of studies

Biological question

Experimental design

Microarray RNA-Seq Shotgun proteomics

Image analysis Reads mapping Peptide/protein ID

Signal intensities Read counts Spectral counts; Intensities

Data Analysis

Experimental Hypothesis validation

VU workshop, 2016 Data matrix

Samples

probe_set_id HNE0_1 HNE0_2 HNE0_3 HNE60_1 HNE60_2 HNE60_3 1007_s_at 8.6888 8.5025 8.5471 8.5412 8.5624 8.3073 1053_at 9.1558 9.1835 9.4294 9.2111 9.1204 9.2494 117_at 7.0700 7.0034 6.9047 9.0414 8.6382 9.2663 121_at 9.7174 9.7440 9.6120 9.7581 9.7422 9.7345 1255_g_at 4.2801 4.4669 4.2360 4.3700 4.4573 4.2979 1294_at 6.3556 6.2381 6.2053 6.4290 6.5074 6.2771

Genes 1316_at 6.5759 6.5330 6.4709 6.6636 6.6438 6.4688 1320_at 6.5497 6.5388 6.5410 6.6605 6.5987 6.7236 1405_i_at 4.3260 4.4640 4.1438 4.3462 4.3876 4.6849 1431_at 5.2191 5.2070 5.2657 5.2823 5.2522 5.1808 1438_at 7.0155 6.9359 6.9241 7.0248 7.0142 7.0971 1487_at 8.6361 8.4879 8.4498 8.4470 8.5311 8.4225 1494_f_at 7.3296 7.3901 7.0886 7.2648 7.6058 7.2949 1552256_a_at 10.6245 10.5235 10.6522 10.4205 10.2344 10.3144 1552257_a_at 10.3224 10.1749 10.1992 10.2464 10.2191 10.2405

Signal intensities Read counts Spectral counts; Intensities

VU workshop, 2016 Differential expression and clustering

log2(ratio)

Microarray log10(p value) log10(p - RNA-Seq Differential expression

Proteomics

Clustering

VU workshop, 2016 Differential expression: t-test

graph courtesy of www.socialresearchmethods.net

VU workshop, 2016 Data are not normally distributed

n Log transformation

Histogram of a Histogram of log(a) 200 150 150 100 100 Frequency Frequency 50 50 0 0

50 150 250 3.5 4.0 4.5 5.0 5.5

a log(a) n Mann-Whitney U test (Wilcoxon rank-sum test)

q Based on ranks of observations

q Does not rely on data belonging to any particular distribution

VU workshop, 2016 Sample size is small

n Moderated t-test (limma)

q Designed for experiments with small sample size

q The standard errors are moderated across , effectively borrowing information from the ensemble of genes to aid with inference about each individual gene.

VU workshop, 2016 Count data

n Negative binomial distribution

q Cuffdiff

q DESeq

q edgeR

n Log-transformed counts

q Voom/limma

VU workshop, 2016 Correction for multiple testing

n Why

q In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed

n How

q Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g., Standard Bonferroni Correction

q Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g., Benjamini and Hochberg correction.

VU workshop, 2016 Clustering (unsupervised analysis)

n Clustering algorithms are methods to divide a set of n objects (genes or samples) into g groups so that within group similarities are larger than between group similarities

n Unsupervised techniques that do not require sample annotation in the process

n Identify candidate subgroups in complex data. e.g. identification of novel sub-types in cancer, identification of co-expressed genes

Samples Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 …… TNNC1 14.82 14.46 14.76 11.22 11.55 …… DKK4 10.71 10.37 11.23 19.74 19.73 …… ZNF185 15.20 14.96 15.07 12.57 12.37 …… CHST3 13.40 13.18 13.15 11.18 10.99 …… FABP3 15.87 15.80 15.85 13.16 12.99 …… MGST1 12.76 12.80 12.67 14.92 15.02 ……

Genes DEFA5 10.63 10.47 10.54 15.52 15.52 …… VIL1 11.47 11.69 11.87 13.94 14.01 …… AKAP12 18.26 18.10 18.50 15.60 15.69 …… HS3ST1 10.61 10.67 10.50 12.44 12.23 …… …… …… …… …… …… …… ……

VU workshop, 2016 Hierarchical clustering

n Agglomerative hierarchical clustering

q Start out with all sample units in n clusters of size 1.

q At each step of the algorithm, the pair of clusters with the shortest distance are combined into a single cluster.

q The algorithm stops when all sample units are combined into a single cluster of size n.

VU workshop, 2016 Visualization of hierarchical clustering results n Dendrogram

q Output of a hierarchical clustering Neuron type Non-5ht 5ht Brain region caudal rostral rostral caudal q Tree structure with the genes or samples as the leaves

q The height of the join indicates the distance between the branches n Heat map

q Graphical representation of data where the values are represented as colors.

VU workshop, 2016 Omics studies generate lists of interesting genes

log2(ratio) 92546_r_at 92545_f_at 96055_at 102105_f_at Microarray 102700_at log10(p value) log10(p - 161361_s_at RNA-Seq Differential 92202_g_at expression 103548_at 100947_at Proteomics 101869_s_at 102727_at 160708_at …...

Clustering Lists of genes with potential biological interest

VU workshop, 2016 Organizing genes based on pathways

n Gene 1 n Pathway 1

q Pathway 1 q Gene 1

q Pathway 2 q Gene 2

n Gene 2 n Pathway 2

q Pathway 1 q Gene 1

q Pathway 3 q Gene 3

n Gene 3 n Pathway 3

q Pathway 2 q Gene 2

q Pathway 3 q Gene 3

n Gene 4 q Gene 4

q Pathway 3

VU workshop, 2016 Advantages of pathway analysis

n Better interpretation

q From interesting genes to interesting biological themes

n Improved robustness

q Robust against noise in the data

n Improved sensitivity

q Detecting minor but concordant changes in a pathway

VU workshop, 2016 Pathway databases

n Databases

q BioCarta (http://www.biocarta.com/genes/index.asp)

q KEGG (http://www.genome.jp/kegg/pathway.html)

q MetaCyc (http://metacyc.org)

q Pathway commons (http://www.pathwaycommons.org)

q Reactome (http://www.reactome.org)

q STKE (http://stke.sciencemag.org/cm)

q Signaling Gateway (http://www.signaling-gateway.org)

q Wikipathways (http://www.wikipathways.org)

n Limitation

q Limited coverage

q Inconsistency among different databases

q Relationship between pathways is not defined

VU workshop, 2016

n Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products

n Three organizing principles: molecular function, biological process, and cellular component

q Dopamine receptor D2, the product of human gene DRD2

q molecular function: dopamine receptor activity

q biological process: synaptic transmission

q cellular component: plasma membrane

n Terms in GO are linked by several types of relationships

q Is_a (e.g. plasma membrane is_a membrane)

q Part_of (e.g. membrane is part_of cell)

q Has part

q Regulates

q Occurs in

VU workshop, 2016 Gene Ontology

VU workshop, 2016 Annotating genes using GO terms

n Two types of GO annotations

q Electronic annotation

q Manual annotation

n All annotations must:

q be attributed to a source

q indicate what evidence was found to support the GO term-gene/protein association

n Types of evidence codes

q Experimental codes - IDA, IMP, IGI, IPI, IEP

q Computational codes - ISS, IEA, RCA, IGC

q Author statement - TAS, NAS

q Other codes - IC, ND

VU workshop, 2016 Access GO

n Downloads (http://www.geneontology.org)

q Ontologies

n http://www.geneontology.org/page/download-ontology

q Annotations

n http://www.geneontology.org/page/download-annotations

n Web-based access

q AmiGO: http://www.godatabase.org

q QuickGO: http://www.ebi.ac.uk/QuickGO

VU workshop, 2016 Coverage of GO annotations

Homo sapiens Mus musculus #term #gene #term #gene GO/BP 6502 15228 6227 15709 GO/MF 3144 16389 2961 17287 GO/CC 947 16765 882 16801

VU workshop, 2016 Over-representation analysis: concept

98 Hoxa5 Sash1 Hoxa11 Ltbp3 Cd24a Agt Sox4 581 1842 Foxc1 Psrc1 Edn1 Ctla2b Angptl4 Ror2 Gnag Depdc7 Observe Sorbs1 Smad3 compare Wdr5 Macrod1 Trp63 Enpp2 Tmem176a Sox9 65 Pax1 …… Acd Rai1 Pitx1 581 1842 …… Differentially expressed genes (581 genes) Expect

Development (1842 genes) n Is the observed overlap significantly larger than the expected value?

VU workshop, 2016 Over-representation analysis: method

Significant genes Non-significant genes Total genes in the group k j-k j

Other genes n-k m-n-j+k m-j Total n m-n m

Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group?

Observed # m − j&# j& k min(n, j ) % (% ( $ n − i '$ i' p = n j ∑ # m& i= k % ( m $ n ' Zhang et.al. Nucleic Acids Res. 33:W741, 2005

VU workshop, 2016 € Over-representation analysis: WebGestalt

Gene list http://www.webgestalt.org 92546_r_at 8 organisms 92545_f_at Human, Mouse, Rat, Dog, Fruitfly, Worm, Zebrafish, Yeast 96055_at Daily Unique Visitors (Jan 2015 – Dec 2015)

102105_f_at Microarray Probe IDs Gene IDs Protein IDs 102700_at • Affymetrix • Gene Symbol • UniProt • Agilent • GenBank • IPI …… • Codelink • Ensembl Gene • RefSeq Peptide • Illumina • RefSeq Gene • Ensembl Peptide • UniGene • Gene Genetic Variation IDs • SGD • dbSNP • MGI • Flybase ID • Wormbase ID • ZFIN ~200 ID types 196 ID types with mapping to Entrez Gene ID Analysis/ ~60K WebGestalt visualization Functional 59,278 functional categories with genes identified by categories Entrez Gene IDs

Gene Ontology Pathway Network module • Biological Process • KEGG • Transcription factor targets • Molecular Function • Pathway Commons • microRNA targets • Cellular Component • WikiPathways • Protein interaction modules

Disease and Drug Chromosomal location • Disease association genes • Cytogenetic bands • Drug association genes Jan. 1, 2015 – Dec. 31, 2015 Pathways/ Zhang et.al. Nucleic Acids Res. 33:W741, 2005 63,932 visits from 27,409 visitors functional categories Wang et al. Nucleic Acids Res. 41:W77, 2013 VU workshop, 2016 >300 citations Over-representation analysis: limitations

n Arbitrary thresholding

n Ignoring the order of genes in the significant gene list

VU workshop, 2016 Gene Set Enrichment Analysis: concept

n Do genes in a gene set tend to locate at the top or bottom of the ranked gene list?

VU workshop, 2016 Gene Set Enrichment Analysis: method

-1/(n-k)

+1/k

Subramanian et.al. PNAS 102:15545, 2005 k: Number of genes in the gene set S n: Number of all genes in the ranked gene list http://www.broad.mit.edu/gsea/

VU workshop, 2016 Pathway-based analysis

n Organizing genes by

q Pathways

q Gene Ontology

q Other functional categories

n Enrichment analysis methods

q Over-representation analysis

q Gene Set enrichment analysis

n Major limitation

q Existing knowledge on pathways or gene functions is far from complete

q Connections between genes are not considered

VU workshop, 2016 Biological networks

Networks Nodes Edges Protein-protein Physical interaction, interaction network undirected Signaling network Proteins Modification, Physical directed interaction networks Gene regulatory TFs/miRNAs Physical interaction, network Target genes directed Metabolic network Metabolites Metabolic reaction, directed Co-expression Genes/protei Co-expression, Functional network ns undirected association networks Genetic network Genes Genetic interaction, undirected

VU workshop, 2016 Protein-protein interaction databases

n Database of Interacting Proteins (DIP) http://dip.doe-mbi.ucla.edu/

n The Molecular INTeraction database (MINT) http://mint.bio.uniroma2.it/mint/

n The Biomolecular Object Network Databank (BOND) http://bond.unleashedinformatics.com/

n The General Repository for Interaction Datasets (BioGRID) http://www.thebiogrid.org/

n Human Protein Reference Database (HPRD) http://www.hprd.org

n Online Predicted Human Interaction Database (OPHID) http://ophid.utoronto.ca

n iRef http://wodaklab.org/iRefWeb

n The International Molecular Exchange Consortium (IMEX) http://www.imexconsortium.org

VU workshop, 2016 Properties of complex networks

Human protein-protein interaction network 9,198 proteins and 36,707 interactions

Scale-free Small world Hierarchical modular (hubs) (six degree separation)

VU workshop, 2016 Network visualization

VU workshop, 2016 Network visualization

Network visualization tools Cytoscape (http://www.cytoscape.org)

Node label color: protein type Red HNE targets Blue Transcription factors Black Other proteins Node size: prediction cutoff level Loose cutoff level Stringent cutoff level

Node color: protein function Cell cycle RNA splicing Both Neither

Gehlenborg et al. Nature Methods, 7:S56, 2010

VU workshop, 2016 Scalability challenges

Data complexity Network size

DNA CNA mutation methylation

mRNA expression splicing

Protein expression modification

VU workshop, 2016 Making sense of the hairballs

NetSAM is available in BioConductor Shi et al. Nature Methods, 2013

Nodes ordered in one dimension on the basis of network structure Data visualized as tracks as visualized Data

Barabasi and Oltvai, Nat Rev Genet, 2004

VU workshop, 2016 NetGestalt: exploring multidimensional omics data over biological networks n Genes ordered in the horizontal dimension 2D layout n Data visualized as tracks along the vertical dimension

q Composite continuous track (e.g. gene NetSAM expression matrices) 1D layout

q Single continuous track (e.g. fold changes, p values) Heat map q Single binary track (e.g. significant genes, GO annotations) n Interactive data visualization and analysis Bar chart

q Search, zoom, pan, filter Barcode plot q Track comparison

q Statistical analysis

q Network analysis and visualization http://www.netgestalt.org

q Pathway enrichment analysis Shi et al. Nature Methods, 2013

q Upload user networks and tracks

VU workshop, 2016 Network distance vs functional similarity

n Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process.

n Network-based gene function prediction

n Network-based gene prioritization

Sharan et al. Mol Syst Biol, 3:88, 2007

VU workshop, 2016 Network-based analysis

n Direct neighbor-based analysis

n Network module-based analysis

n Network diffusion analysis

VU workshop, 2016 Direct neighbor-based analysis

n Network expansion: to expand a list of “seed” genes to include other related genes in the network

q All neighbors: to identify all direct neighbors of these seed genes in the network

q Enriched neighbors: to identify all non-seed genes significantly

enriched with seed neighbors WNT2 NKX2-1 TBX5

RAD50 EPB41L5 SUB1 KDM6B

PRKAA2 PCBD1 SOX9 HIF1A FOXO3 SFRP1 WWTR1 PIAS4 PPARGC1B PABPC4 BCL9L NR2C2 PARD3B NF1 MAPK14 XPO1 LEF1 HNF1A PROX1 SNCG NRBF2 SREBF2 WNT4 NOTCH1 SMAD2 ACVR1B SMAD4 PITX2 EP300 NR0B2 FOXC1 EFNA1 NCOA2 HDAC4 TRIM24 SMAD3 NCOA1 TP53 MAPK3 CTNNB1 SP1 GSK3B TGFB1I1 SNAI1 SMAD7 TGFBRAP1 SFRP2 NR2F1 UBE2I COPS5 EXT2 CITED2 CTNNB1 HDAC3 HNF4A TGFBR1 MED21 NR1I3 AKT1 BMP7 MED14 AXIN2 DVL1 MED10 NOG SMAD4 CREBBP MAPK1 MED24 TCF7L2 S100A4 TGFB2 TGFBR2 TGFB1 PRMT1 PPP2CA SIRT1 NCOA6 MED17 VDRSREBF1 STK16 FOXO1 AR BMP2

SMAD2 PPARGC1A TGFBR3 ZNHIT3 ENG MED1 MED16 NCOR2 MED7 FMOD PNRC1 TGFB3 VASN C16orf80 TGS1 NPPA MED23 MECR All neighbors RNASEH2B Enriched neighbors HNRNPAB ADAM15 COL1A1 PPP3R1 DAB2IP ZNF703 WNT5A TRIM28 GREM1 WNT16 FOXC2 BAMBI FOXS1 FOXF2 MSX1 HGF

VU workshop, 2016 Direct neighbor-based analysis

n Gene prioritization: to prioritize genes in a list of candidate genes

CDH11 PIK3CA OPCML TNFRSF10C

CDH2 LPPR4 CDH8 KRAS

ATM APC CDH18 BRAF

DMD TCF7L2 ANK2 PTEN CTNNB1 CDH9 TP53

FAM123B EPHA3

ING1 ACVR2A ACVR1B SMAD4

NRAS

ARID1A SMAD2

PCBP1 RIMS1 DPP10

SOX9

EIF4A2

DNAH5

MYO1B

GUCY1A3 KIAA1804 PCDHA10 SLITRK1 PCDHB5 ADAM29 CHRDL1 MAP2K4 UBE2NL CASP14 ZC3H13 ZNF208 ROBO1 NCAM2 CHRM2 EDNRB FBXW7 CSMD1 PTPRM DOCK2 NALCN

DACH2 KCNA4 CCBP2 FAM5C ERBB4 NRXN1 LRRN3 CRTC1 CNTN1 TRPC4 LRP1B CHST9 TRPS1 LOXL2 PRIM2 GRIK3 GRIA1 ARID2 DKK2 HCN1 GGT1 SDK1 KLK2 RYR2 EVC2 BAI3

IFT80 NXF5 AFF2 ELF3 TLL1 FAT4 LIFR ISL1 FLG TTN

VU workshop, 2016 Network module-based analysis

n Identify network modules from a network

q e.g., NetSAM

n Use network modules as gene sets for enrichment analysis

q Over-representation analysis; GSEA

MLIP NARF TOR1AIP1 IGDCC3 URB2 HSBP1 LMNA RAP1B

TRIB1 KIFC3 FTL VARS2 ALOX12 EIF2AK1 MRPL36 ITGAE RGL2

CALB2 ALKBH1 CBLL1 RAP1GDS1 DNAJB6 KRT14 RRAS PIK3CA ALOX12B PLEKHA7 KRT1 LIMA1 NRAS KRT5 LRPAP1 ITIH1 KRT16 AJAP1 KRAS PEX5L RALGDS FAM84B ARHGEF1 TCHP CA9 PKP1 CDH1 OIP5 SHOC2 HRAS DSP PKP4 MYO7A CPS1 DGKZ VRK3 CTNND1 PAQR3 DSC3 CDH24 ARAF KRT6A PKP2 CTNNA1 DSG1 PTPRM FSCN2 RASGRP1 USP53 DSC1 VEZT CPNE7 CDH8 RAF1 BRAF CDH3 NUDT14 KRT8 KRT18 JUP DNAJC27 PKP3 PKD1 CTNNB1 KIAA1109 PPM1B KRT17 CDH16 CDH2 KIAA1161 CLN5 TERF2IP CDH17 MAP2K2 TIMM50 DSC2 CDON MAP2K1 DSG2 CLEC3B PTPN14 SOX1 FKBP10 ADCY6 KRT72 BOC KSR2 MOXD1 VRK2 CDH11 LDHB KANK1 CHRNA7 DSG3 AKAP12 KSR1 FGF13 PEBP4 CDH4 ERMN NEURL2 BCL9L SOX17 CTNNA3 VU workshop, 2016 Network diffusion analysis

n Simulating a random walker’s behavior on a network (with restart)

VU workshop, 2016 Network diffusion analysis

n Simulating a random walker’s behavior on a network (with restart)

q Gene prioritization

q Network expansion

http://www.gene2net.org

VU workshop, 2016 Summary

n Biological networks

q Physical interaction networks

q Functional association networks

n Properties of complex networks

q Scale-free

q Small world

q Hierarchical modular

n Network visualization

q Node-link diagrams

q NetGestalt

n Network-based analysis

q Direct neighbor-based analysis

q Network module-based analysis

q Network diffusion analysis

VU workshop, 2016 Hands-on session

n WebGestalt

q Pathway analysis for gene lists

q http://www.webgestalt.org

n NetGestalt

q Pathway and network analysis for gene lists and data matrices

q http://www.netgestalt.org

n Gene2Net

q network diffusion analysis for gene lists

q http://www.gene2net.org

VU workshop, 2016