Pathways and Networks Biological Meaning of the Gene Sets
Total Page:16
File Type:pdf, Size:1020Kb
Pathways and Networks Biological meaning of the gene sets Gene ontology terms ? Pathway mapping Linking to Pubmed abstracts or associted MESH terms Regulation by the same transcription factor (module) Protein families and domains Gene set enrichment analysis Over representation analysis 1 Gene set enrichment analysis 1. Given an a priori defined set of genes S. 2. Rank genes (e.g. by t‐value between 2 groups of microarray samples) ranked gene list L. 3. Calculation of an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. 4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype‐based permutation test procedure. 5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR). Gene set enrichment analysis Subramanian A et al. Proc Natl Acad Sci (2005) 2 Biochemical and Metabolic Pathways Böhringer Mannheim Signaling networks (in cancer cell) Hannah, Weinberg, Cell. 2000 137 NCI curated pathway maps (http://pid.nci.nih.gov) Signal Transduction Knowledge Environment http://stke.sciencemag.org/cm/ 3 Pathways • Pathways are available mechanistic information •Kyoto Encylopedia of Genes and Genomes, KEGG (http://www.genome.jp/kegg/) BioCYc • EcoCYC Metabolic Pathway Database (E. Coli) •Also for other organisms (e.g. HumanCYC) 4 Pathways • Pathways from Biocarta • http://www.biocarta.com/genes/index.asp Transpath Part of larger BioBase package (commercial) • PathwayBuilder package for network visualization • Highly integrated with signaling networks and transcription factor networks (TransFAC) • Linked to extensive enzyme information in BRENDA (www.brenda.uni‐koeln.de) • 28,456 molecules; 52,007 reactions; 54 handdrawn pathways 5 Reactome • Curated pathway database • http://www.reactome.org/ Pathway Commons • Aim: convenient access to pathway information • Facilitate creation and communication of pathway data • Aggregate pathway data in the public domain • Provide easy access for pathway analysis 6 Cytoscape • Access pathway commons from cytoscape • http://www.cytoscape.org • Open source software for network visualization • Active community • >40 plugins extend functionality e.g. Bingo, ClueGO (for gene ontology) • Easy to use and good documentation VizMapper Various layout Cline MS et al Nat Protoc. 2007 Map gene expression to pathways • GenMAPP, Cytoscape, Pathway Explorer 7 Concepts for network analysis • Pairwise similarity measures • Connection strength (adjacency functions) •Network modules (measures of node dissimilarity): • Connectivity and scale‐free network topolgy •Network motifs Similarity measures Correlation coefficient Partial correlation 8 Mutual information Discrete data Continuous data Gene co‐expression networks Expression profiles Selected genes: •abs(log2ratios)>1 in at least 5 conditions • association with metabolic pathways identified by GO analysis 9 Gene co‐expression networks 67 conditions 548 genes P Charoentong Subnetworks for insulin signaling P Charoentong 10 Gene association network MICO • Discretizing expression profiles → groups of genes with idencal profile • REVEAL algorithm based on Mutual information M(x,y)=I(x)+I(y)‐I(x,y) Pparg M(x,y)=I(x) → directed • Correlation Apmap Bogner‐Strauss et al. Cell Mol Life Sci. 2010 Adjacency function unweighted weighted 11 Topological overlap matrix (TOM) Hierarchical clustering of modules (subnetworks) Finding active subnetworks Discovering regulatory and signaling circuits in molecular interaction networks Ideker T, Ozier O, Schwikowski B, Siegel AF. Bioinformatics 18:S233‐S240 (2002) Aggregate z‐score for an entire subnetwork A of k genes: 1 ZA Zi k iA 12 Calibrating z against background distribution Randomly sample gene sets of size k using a Monte Carlo approach, compute their scores zA, and calculate standard deviation parameters for each k. Z A k Corrected subnet score S A k SA Za Z Z Z b c d ZA Scoring over multiple conditions Finding the probability that at least j of the m conditions had scores above ZA(j) 13 Finding the maximal scoring connected subgraph Simulated annealing Global maxima Local maxima Local maxima significance score subnetwork Reverse Engineering Complex System (unknown) + + - + - 14 Reverse Engineering Input Temporal series of data System Modeling Input Reverse Temporal Engineering series of data General assumption The most effective regulation acts at transcription level and is due to protein‐gene interaction. Since proteins are not measured, mRNAs is implicitly considered somehow proportional to proteins concentration proteins genes “gene-corresponding protein(s)” elements - B B - Y R incorrect - - + B real + - R - Y Y R + Gene-protein network ex. of post-transcription regulation Genomic network Correct genomic network 15 Predictive vs inferential power Techniques can have a good predictive power but a low inferential power…. Anyway we can (hopefully) identify the main involved genes - An arrow has not necessarily the meaning of a direct regulation + - Gene network Results need to be validated by using other experiments General Model Assumptions Model’s parameters are static (it is the concentration of genes and proteins that changes in time) w21 - x1 x2 The control is usually supposed instantaneous or synchronous with time delay t (number of timepoints is limited) 3 x1 x2 2 1 x1(t) x2(t) 0 024 681012 Instantaneous model -1 -2 -3 3 x1 2 x2 x1(t) x2(t+1) 1 0 Synchronous model 02468101214 -1 -2 -3 16 General Model Constraints The network must be stable: to tend to an equilibrium situation, i.e. to a static or to a cyclic condition after some time from a perturbation + Network forbidden in absence of other + connections (mRNAs and proteins in a cell can not augment indefinitely) + Boolean Networks Starting from “direct” engineering: cdk7 cyclin H cdk7 111 100 010 000 cdk7 AND cyclin H = cdk2 cdk7 cdk2 cyclin H 17 Boolean networks for sensitivity/robustness analysis CellNetAnalyzer Klemt A et al. BMC Systems Biology. 2007 Differential equation model The effect on a gene (expression level or its derivative) is modeled as a linear combination (weighted sum) of other expression levels. Naïve Example Starting from “direct” engineering: x (t 1) w x (t) w x (t) w x (t) w31 3 31 1 32 2 33 3 x1 + w 0 x3 31 - w 0 x 32 2 w32 w 33 0 18 Differential equations dx t N K i R f w x t v u t B λ x t i ij j ik k i i i dt j1 k1 Activation constant gene i Control strength of External input and their Gene i degradation law gene j on gene i control strength on i Activation function of gene i Gene j expression level 20 15 Gene i basal 10 5 activation level 0 -10 -5 0 5 10 Linear f(.) -5 -10 -15 1 0.5 Sigmoidal f(.) 0 -10 -5 0 5 10 Parameter identification • For each gene: m equations (m=time points) in n (+c) unknown parameters. • m<n undetermined problem time arrays Equation system for gene 1: x ... ... ... x 11 1m x 11 (t ) R i f w 1111 x (t ) ... w 1nn1 x (t ) λ 111x(t) ... x (t ) R f w x (t ) ... w x (t ) λ x(t) ... 12 i 1112 1nn2 112 genes ... ... xn1 ... ... ... x nm x 1m (t ) R i f w 111m x (t ) ... w 1nnm x (t ) λ 11mx(t) • Strategies: Re‐sampling + LS, SVD, simulated annealing, genetic algorithms, gradient descent 19 Bayesian networks • A Bayesian network provides not just the network, but the probability of this network to be true according to: – given data – some a priori knowledge • Components: – Acyclic Graph (main disadvantage) Nodes = Random variables to study Connections = Relation of conditional dependence among variables – Conditional Probability Distribution Assumptions and basic definitions p(G1,G2,G3,G4,G5)=p(G1)p(G2|G1)p(G3|G1)p(G4|G2)p(G5|G1,G2,G3) n p(x1,x2,…,xn) = П p(xi|pa(xi)) i=1 20 Conditional independence Serial connection p(x,y,z) = p(y)p(x|y)p(z|y) X Y Z p(z|x) = ∑ p(y|x)p(z|y) y Diverging connection Y p(x,y,z) = p(y)p(x|y)p(z|y) p(z|x) = ∑ p(y)p(x|y)p(z|y) X Z y Converging connection X Z p(x,y,z) = p(x)p(z)p(y|x,z) p(z|x) = p(z) indp. of x Y Problem to be solved max S(G|D) G D={d1,…,dm}, m is the number of samples and dl={dl1,…,dln} l є {1,…,m} and 1,…,n genes in the network. S(G|D) is a score function that measures how likely is that the data comes from every one of the networks proposed for the given nodes (genes). G is the set of all the available networks for a given set of nodes (genes). 21 Score function • Bayesian Score A priori probability P(D|G)P(G) of the network S(G|D) = P(D) Constant across the networks m n l l max (S(G|D)=ППP(d i|pa(d )) l=1 i=1 Discrete (multinomial) Continuous (Gaussian) Finding the parameters of the proposed probability function that maximizes the score function. • Minimum length Different network representation 22 Different levels of networks Network measures • Degree ki •Degree distribution P(k) •Mean path length •Network Diameter • Clustering Coefficient 23 Network measures Connectivity (degree) Clustering coefficient Topological overlap Graphs •Graph G=(V,E) is a set of vertices V and edges E where |V|=n, |E|=m the number of vertices and edges •A subgraph G’ of G is induced by some V’ V and E’ E •Graph properties: Connectivity (node degree, paths) Cyclic vs. acyclic Directed vs. undirected •Graph is sparse if m~n and dense if m~n2 •Graph is complete if m=n2 24 Small‐world network •Every node can be reached from every other by a small number of hops or steps •High clustering coefficient and low mean‐shortest path length (random graphs don’t necessarily have high clustering coefficients •Social networks, the Internet, and biological networks all exhibit small‐ world network characteristics •Six degrees of separation Pilgrim Experiment: Random people from Nebraska were to send a letter (via intermediaries) to a stock broker in Boston.