Pathways and Networks
Biological meaning of the gene sets
Gene ontology terms
? Pathway mapping Linking to Pubmed abstracts or associted MESH terms Regulation by the same transcription factor (module) Protein families and domains Gene set enrichment analysis Over representation analysis
1 Gene set enrichment analysis
1. Given an a priori defined set of genes S.
2. Rank genes (e.g. by t‐value between 2 groups of microarray samples) ranked gene list L.
3. Calculation of an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.
4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype‐based permutation test procedure.
5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).
Gene set enrichment analysis
Subramanian A et al. Proc Natl Acad Sci (2005)
2 Biochemical and Metabolic Pathways
Böhringer Mannheim
Signaling networks (in cancer cell)
Hannah, Weinberg, Cell. 2000
137 NCI curated pathway maps (http://pid.nci.nih.gov)
Signal Transduction Knowledge Environment http://stke.sciencemag.org/cm/
3 Pathways
• Pathways are available mechanistic information •Kyoto Encylopedia of Genes and Genomes, KEGG (http://www.genome.jp/kegg/)
BioCYc
• EcoCYC Metabolic Pathway Database (E. Coli) •Also for other organisms (e.g. HumanCYC)
4 Pathways
• Pathways from Biocarta • http://www.biocarta.com/genes/index.asp
Transpath
Part of larger BioBase package (commercial)
• PathwayBuilder package for network visualization
• Highly integrated with signaling networks and transcription factor networks (TransFAC)
• Linked to extensive enzyme information in BRENDA (www.brenda.uni‐koeln.de)
• 28,456 molecules; 52,007 reactions; 54 handdrawn pathways
5 Reactome
• Curated pathway database • http://www.reactome.org/
Pathway Commons
• Aim: convenient access to pathway information • Facilitate creation and communication of pathway data • Aggregate pathway data in the public domain • Provide easy access for pathway analysis
• Access pathway commons from cytoscape
• http://www.cytoscape.org • Open source software for network visualization • Active community • >40 plugins extend functionality e.g. Bingo, ClueGO (for gene ontology) • Easy to use and good documentation
VizMapper Various layout
Cline MS et al Nat Protoc. 2007
Map gene expression to pathways
• GenMAPP, Cytoscape, Pathway Explorer
7 Concepts for network analysis
• Pairwise similarity measures
• Connection strength (adjacency functions)
•Network modules (measures of node dissimilarity):
• Connectivity and scale‐free network topolgy
•Network motifs
Similarity measures
Correlation coefficient
Partial correlation
8 Mutual information
Discrete data
Continuous data
Gene co‐expression networks
Expression profiles
Selected genes:
•abs(log2ratios)>1 in at least 5 conditions
• association with metabolic pathways identified by GO analysis
9 Gene co‐expression networks
67 conditions 548 genes
P Charoentong
Subnetworks for insulin signaling
P Charoentong
10 Gene association network
MICO • Discretizing expression profiles → groups of genes with iden cal profile • REVEAL algorithm based on Mutual information
M(x,y)=I(x)+I(y)‐I(x,y) Pparg M(x,y)=I(x) → directed • Correlation
Apmap
Bogner‐Strauss et al. Cell Mol Life Sci. 2010
Adjacency function
unweighted
weighted
11 Topological overlap matrix (TOM)
Hierarchical clustering of
modules (subnetworks)
Finding active subnetworks
Discovering regulatory and signaling circuits in molecular interaction networks Ideker T, Ozier O, Schwikowski B, Siegel AF. Bioinformatics 18:S233‐S240 (2002)
Aggregate z‐score for an entire subnetwork A of k genes: 1 ZA Zi k iA
12 Calibrating z against background distribution
Randomly sample gene sets of size k using a Monte Carlo approach, compute their scores zA, and calculate standard deviation parameters for each k.
Z A k Corrected subnet score S A k
SA
Za Z Z Z b c d ZA
Scoring over multiple conditions
Finding the probability that at least j of the m conditions had scores above ZA(j)
13 Finding the maximal scoring connected subgraph
Simulated annealing
Global maxima
Local maxima Local maxima significance score
subnetwork
Reverse Engineering
Complex System (unknown)
+ +
-
+
-
14 Reverse Engineering
Input Temporal series of data
System Modeling Input Reverse Temporal Engineering series of data
General assumption
The most effective regulation acts at transcription level and is due to protein‐gene interaction. Since proteins are not measured, mRNAs is implicitly considered somehow proportional to proteins concentration
proteins genes “gene-corresponding protein(s)” elements
- B B - Y R incorrect - - + B real + - R - Y Y R + Gene-protein network ex. of post-transcription regulation Genomic network Correct genomic network
15 Predictive vs inferential power
Techniques can have a good predictive power but a low inferential power….
Anyway we can (hopefully) identify the main involved genes
- An arrow has not necessarily the meaning of a direct regulation + -
Gene network Results need to be validated by using other experiments
General Model Assumptions
Model’s parameters are static (it is the concentration of genes and proteins that changes in time)
w21 - x1 x2
The control is usually supposed instantaneous or synchronous with time delay t (number of timepoints is limited)
3 x1 x2 2
1
x1(t) x2(t) 0 024 681012 Instantaneous model -1 -2 -3
3 x1
2 x2 x1(t) x2(t+1) 1 0 Synchronous model 02468101214 -1 -2
-3
16 General Model Constraints
The network must be stable: to tend to an equilibrium situation, i.e. to a static or to a cyclic condition after some time from a perturbation
+ Network forbidden in absence of other + connections (mRNAs and proteins in a cell can not augment indefinitely)
+
Boolean Networks
Starting from “direct” engineering:
cdk7 cyclin H cdk7 111 100 010 000
cdk7 AND cyclin H = cdk2
cdk7 cdk2 cyclin H
17 Boolean networks for sensitivity/robustness analysis
CellNetAnalyzer
Klemt A et al. BMC Systems Biology. 2007
Differential equation model
The effect on a gene (expression level or its derivative) is modeled as a linear combination (weighted sum) of other expression levels.
Naïve Example Starting from “direct” engineering:
x (t 1) w x (t) w x (t) w x (t) w31 3 31 1 32 2 33 3 x1 + w 0 x3 31 - w 0 x 32 2 w32 w 33 0
18 Differential equations
dx t N K i R f w x t v u t B λ x t i ij j ik k i i i dt j1 k1
Activation constant gene i Control strength of External input and their Gene i degradation law gene j on gene i control strength on i
Activation function of gene i Gene j expression level 20 15 Gene i basal 10 5 activation level 0 -10 -5 0 5 10 Linear f(.) -5
-10
-15
1
0.5 Sigmoidal f(.)
0 -10 -5 0 5 10
Parameter identification
• For each gene: m equations (m=time points) in n (+c) unknown parameters.
• m time arrays Equation system for gene 1: x ...... x 11 1m x 11 (t ) R i f w 1111 x (t ) ... w 1nn1 x (t ) λ 111x(t) ... x (t ) R f w x (t ) ... w x (t ) λ x(t) ... 12 i 1112 1nn2 112 genes ...... xn1 ...... x nm x 1m (t ) R i f w 111m x (t ) ... w 1nnm x (t ) λ 11mx(t) • Strategies: Re‐sampling + LS, SVD, simulated annealing, genetic algorithms, gradient descent 19 Bayesian networks • A Bayesian network provides not just the network, but the probability of this network to be true according to: – given data – some a priori knowledge • Components: – Acyclic Graph (main disadvantage) Nodes = Random variables to study Connections = Relation of conditional dependence among variables – Conditional Probability Distribution Assumptions and basic definitions p(G1,G2,G3,G4,G5)=p(G1)p(G2|G1)p(G3|G1)p(G4|G2)p(G5|G1,G2,G3) n p(x1,x2,…,xn) = П p(xi|pa(xi)) i=1 20 Conditional independence Serial connection p(x,y,z) = p(y)p(x|y)p(z|y) X Y Z p(z|x) = ∑ p(y|x)p(z|y) y Diverging connection Y p(x,y,z) = p(y)p(x|y)p(z|y) p(z|x) = ∑ p(y)p(x|y)p(z|y) X Z y Converging connection X Z p(x,y,z) = p(x)p(z)p(y|x,z) p(z|x) = p(z) indp. of x Y Problem to be solved max S(G|D) G D={d1,…,dm}, m is the number of samples and dl={dl1,…,dln} l є {1,…,m} and 1,…,n genes in the network. S(G|D) is a score function that measures how likely is that the data comes from every one of the networks proposed for the given nodes (genes). G is the set of all the available networks for a given set of nodes (genes). 21 Score function • Bayesian Score A priori probability P(D|G)P(G) of the network S(G|D) = P(D) Constant across the networks m n l l max (S(G|D)=ППP(d i|pa(d )) l=1 i=1 Discrete (multinomial) Continuous (Gaussian) Finding the parameters of the proposed probability function that maximizes the score function. • Minimum length Different network representation 22 Different levels of networks Network measures • Degree ki •Degree distribution P(k) •Mean path length •Network Diameter • Clustering Coefficient 23 Network measures Connectivity (degree) Clustering coefficient Topological overlap Graphs •Graph G=(V,E) is a set of vertices V and edges E where |V|=n, |E|=m the number of vertices and edges •A subgraph G’ of G is induced by some V’ V and E’ E •Graph properties: Connectivity (node degree, paths) Cyclic vs. acyclic Directed vs. undirected •Graph is sparse if m~n and dense if m~n2 •Graph is complete if m=n2 24 Small‐world network •Every node can be reached from every other by a small number of hops or steps •High clustering coefficient and low mean‐shortest path length (random graphs don’t necessarily have high clustering coefficients •Social networks, the Internet, and biological networks all exhibit small‐ world network characteristics •Six degrees of separation Pilgrim Experiment: Random people from Nebraska were to send a letter (via intermediaries) to a stock broker in Boston. Could only send to someone with whom they were on a first‐name basis Kevin Bacon Game: Connect any actor to Kevin Bacon, by linking actors who have acted in the same movie Erdös Number: Number of links required to connect scholars to Erdős, via co‐authorship of papers Kevin Bacon Game Kevin Bacon Mystic River (2003) Tim Robbins Code 46 (2003) Om Puri Rani Mukherjee Yuva (2004) Black (2005) Amitabh Bachchan 25 Path length and network diameter A path is a sequence {x1, x2,…, xn} such that (x1,x2), (x2,x3), …, (xn‐1,xn) are edges of the graph. A closed path xn=x1 on a graph is called a graph cycle or circuit. Shortest path between nodes Longest shortest path Complex network models Scale‐free network Modular networks Hierarchical networks 26 Network topology Metabolic networks •We have seen that the cellular functionality can be partitioned into a collection of modules. • Each module is a discrete entity of several elementary omponents which perform an identifiable task Modular network (Ci>) •But, it was demonstrated that the degree distribution follows a power law Scale‐free network (power law) • Modular network ≠ Scale‐free network Hierarchical network Ravasz et al.Science (2002) 27 Essential genes tend to be hubs Scale‐free networks are robust •Complex systems (cell, internet, social networks), are resilient to component failure •Network topology plays an important role in this robustness (even if ~80% of nodes fail, the remaining ~20% still maintain network connectivity • Attack vulnerability if hubs are selectively targeted •In yeast, only ~20% of proteins are lethal when deleted, and are 5 times more likely to have degree k>15 than k<5. 28 Interesting features • Cellular networks are assortative, hubs tend not to interact directly with other hubs. •Hubs tend to be “older” proteins (so far claimed for protein‐protein interaction networks only) •Hubs also seem to have more evolutionary pressure—their protein sequences are more conserved than average between species (shown in yeast vs. worm) • Experimentally determined protein complexes tend to contain solely essential or non‐essential proteins—further evidence for modularity. What is missing? • Dynamics Pathways/networks represented as static processes Difficult to represent a calcium wave or a feedback loop More detailed mathematical representations exist that handle these e.g. Stoichiometric modeling, Kinetic modeling (VirtualCell, E‐cell ) Need to accumulate or estimate comprehensive kinetic information • Details (atomic structures) • Context (cell type, developmental stage) 29 Network motifs Network motifs 30 Network motifs Negative and positive autoregulation b a c • NAR speeds up the response time of gene circuits •NAR can reduce cell‐cell variation in protein levels •PAR works in the opposite way 31 Feedforward loop Detecting network motifs There are two main tasks in detecting network motifs: 1. generating an ensemble of proper random networks 2. counting the subgraphs in the real network and in random networks. 32 MicroRNA‐TF network motifs Martinez NJ et al. Genes Dev. 2008. 22:2535–2549 33