Pathways and Networks

Biological meaning of the gene sets

Gene ontology terms

? Pathway mapping Linking to Pubmed abstracts or associted MESH terms Regulation by the same transcription factor (module) Protein families and domains Gene set enrichment analysis Over representation analysis

1 Gene set enrichment analysis

1. Given an a priori defined set of genes S.

2. Rank genes (e.g. by t‐value between 2 groups of microarray samples)  ranked gene list L.

3. Calculation of an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.

4. Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype‐based permutation test procedure.

5. Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).

Gene set enrichment analysis

Subramanian A et al. Proc Natl Acad Sci (2005)

2 Biochemical and Metabolic Pathways

Böhringer Mannheim

Signaling networks (in cancer cell)

Hannah, Weinberg, Cell. 2000

137 NCI curated pathway maps (http://pid.nci.nih.gov)

Signal Transduction Knowledge Environment http://stke.sciencemag.org/cm/

3 Pathways

• Pathways are available mechanistic information •Kyoto Encylopedia of Genes and Genomes, KEGG (http://www.genome.jp/kegg/)

BioCYc

• EcoCYC Metabolic Pathway Database (E. Coli) •Also for other organisms (e.g. HumanCYC)

4 Pathways

• Pathways from Biocarta • http://www.biocarta.com/genes/index.asp

Transpath

Part of larger BioBase package (commercial)

• PathwayBuilder package for network visualization

• Highly integrated with signaling networks and transcription factor networks (TransFAC)

• Linked to extensive enzyme information in BRENDA (www.brenda.uni‐koeln.de)

• 28,456 molecules; 52,007 reactions; 54 handdrawn pathways

5 Reactome

• Curated pathway database • http://www.reactome.org/

Pathway Commons

• Aim: convenient access to pathway information • Facilitate creation and communication of pathway data • Aggregate pathway data in the public domain • Provide easy access for pathway analysis

6

• Access pathway commons from cytoscape

• http://www.cytoscape.org • Open source software for network visualization • Active community • >40 plugins extend functionality e.g. Bingo, ClueGO (for gene ontology) • Easy to use and good documentation

VizMapper Various layout

Cline MS et al Nat Protoc. 2007

Map gene expression to pathways

• GenMAPP, Cytoscape, Pathway Explorer

7 Concepts for network analysis

• Pairwise similarity measures

• Connection strength (adjacency functions)

•Network modules (measures of node dissimilarity):

• Connectivity and scale‐free network topolgy

•Network motifs

Similarity measures

Correlation coefficient

Partial correlation

8 Mutual information

Discrete data

Continuous data

Gene co‐expression networks

Expression profiles

Selected genes:

•abs(log2ratios)>1 in at least 5 conditions

• association with metabolic pathways identified by GO analysis

9 Gene co‐expression networks

67 conditions 548 genes

P Charoentong

Subnetworks for insulin signaling

P Charoentong

10 Gene association network

MICO • Discretizing expression profiles → groups of genes with idencal profile • REVEAL algorithm based on Mutual information

M(x,y)=I(x)+I(y)‐I(x,y) Pparg M(x,y)=I(x) → directed • Correlation

Apmap

Bogner‐Strauss et al. Cell Mol Life Sci. 2010

Adjacency function

unweighted

weighted

11 Topological overlap matrix (TOM)

Hierarchical clustering of

modules (subnetworks)

Finding active subnetworks

Discovering regulatory and signaling circuits in molecular interaction networks Ideker T, Ozier O, Schwikowski B, Siegel AF. 18:S233‐S240 (2002)

Aggregate z‐score for an entire subnetwork A of k genes: 1 ZA   Zi k iA

12 Calibrating z against background distribution

Randomly sample gene sets of size k using a Monte Carlo approach, compute their scores zA, and calculate standard deviation parameters for each k.

Z A  k Corrected subnet score S A   k

SA

Za Z Z Z b c d ZA

Scoring over multiple conditions

Finding the probability that at least j of the m conditions had scores above ZA(j)

13 Finding the maximal scoring connected subgraph

 Simulated annealing

Global maxima

Local maxima Local maxima significance score

subnetwork

Reverse Engineering

Complex System (unknown)

+ +

-

+

-

14 Reverse Engineering

Input  Temporal series of data

System Modeling Input Reverse Temporal Engineering series of data

General assumption

The most effective regulation acts at transcription level and is due to protein‐gene interaction. Since proteins are not measured, mRNAs is implicitly considered somehow proportional to proteins concentration

proteins genes “gene-corresponding protein(s)” elements

- B B - Y R incorrect - - + B real + - R - Y Y R + Gene-protein network ex. of post-transcription regulation Genomic network Correct genomic network

15 Predictive vs inferential power

Techniques can have a good predictive power but a low inferential power….

Anyway we can (hopefully) identify the main involved genes

- An arrow has not necessarily the meaning of a direct regulation + -

Gene network Results need to be validated by using other experiments

General Model Assumptions

Model’s parameters are static (it is the concentration of genes and proteins that changes in time)

w21 - x1 x2

The control is usually supposed instantaneous or synchronous with time delay t (number of timepoints is limited)

3 x1 x2 2

1

x1(t) x2(t) 0 024 681012 Instantaneous model -1   -2 -3

3 x1

2 x2 x1(t) x2(t+1) 1 0 Synchronous model 02468101214   -1 -2

-3

16 General Model Constraints

The network must be stable: to tend to an equilibrium situation, i.e. to a static or to a cyclic condition after some time from a perturbation

+ Network forbidden in absence of other + connections (mRNAs and proteins in a cell can not augment indefinitely)

+

Boolean Networks

Starting from “direct” engineering:

cdk7 cyclin H cdk7 111 100 010 000

cdk7 AND cyclin H = cdk2

cdk7 cdk2 cyclin H

17 Boolean networks for sensitivity/robustness analysis

CellNetAnalyzer

Klemt A et al. BMC . 2007

Differential equation model

The effect on a gene (expression level or its derivative) is modeled as a linear combination (weighted sum) of other expression levels.

Naïve Example Starting from “direct” engineering:

x (t 1)  w x (t)  w x (t)  w x (t) w31 3 31 1 32 2 33 3 x1 + w  0 x3 31 - w  0 x 32 2 w32 w 33  0

18 Differential equations

dx t  N K  i  R f  w  x t  v u t B   λ  x t i  ij j  ik k i  i i  dt  j1 k1 

Activation constant gene i Control strength of External input and their Gene i degradation law gene j on gene i control strength on i

Activation function of gene i Gene j expression level 20 15 Gene i basal 10 5 activation level 0 -10 -5 0 5 10 Linear f(.) -5

-10

-15

1

0.5 Sigmoidal f(.)

0 -10 -5 0 5 10

Parameter identification

• For each gene: m equations (m=time points) in n (+c) unknown parameters.

• m

time arrays Equation system for gene 1: x ...... x 11 1m x 11 (t ) R i f w 1111 x (t )  ... w 1nn1 x (t ) λ 111x(t) ... x (t ) R f w x (t )  ... w x (t ) λ x(t) ...  12 i 1112 1nn2 112 genes ......

xn1 ...... x nm x 1m (t ) R i f w 111m x (t )  ... w 1nnm x (t ) λ 11mx(t)

• Strategies: Re‐sampling + LS, SVD, simulated annealing, genetic algorithms, gradient descent

19 Bayesian networks

• A Bayesian network provides not just the network, but the probability of this network to be true according to:

– given data – some a priori knowledge

• Components:

– Acyclic Graph (main disadvantage) Nodes = Random variables to study Connections = Relation of conditional dependence among variables

– Conditional Probability Distribution

Assumptions and basic definitions

p(G1,G2,G3,G4,G5)=p(G1)p(G2|G1)p(G3|G1)p(G4|G2)p(G5|G1,G2,G3)

n

p(x1,x2,…,xn) = П p(xi|pa(xi)) i=1

20 Conditional independence

Serial connection p(x,y,z) = p(y)p(x|y)p(z|y) X Y Z p(z|x) = ∑ p(y|x)p(z|y) y

Diverging connection

Y p(x,y,z) = p(y)p(x|y)p(z|y) p(z|x) = ∑ p(y)p(x|y)p(z|y) X Z y

Converging connection

X Z p(x,y,z) = p(x)p(z)p(y|x,z) p(z|x) = p(z) indp. of x Y

Problem to be solved

max S(G|D) G

D={d1,…,dm}, m is the number of samples and

dl={dl1,…,dln} l є {1,…,m} and 1,…,n genes in the network.

S(G|D) is a score function that measures how likely is that the data comes from every one of the networks proposed for the given nodes (genes).

G is the set of all the available networks for a given set of nodes (genes).

21 Score function

• Bayesian Score A priori probability P(D|G)P(G) of the network S(G|D) = P(D) Constant across the networks m n l l max (S(G|D)=ППP(d i|pa(d )) l=1 i=1 Discrete (multinomial) Continuous (Gaussian)

Finding the parameters of the proposed probability function that maximizes the score function.

• Minimum length

Different network representation

22 Different levels of networks

Network measures

• Degree ki

•Degree distribution P(k)

•Mean path length

•Network Diameter

• Clustering Coefficient

23 Network measures

Connectivity (degree)

Clustering coefficient

Topological overlap

Graphs

•Graph G=(V,E) is a set of vertices V and edges E where |V|=n, |E|=m the number of vertices and edges

•A subgraph G’ of G is induced by some V’  V and E’  E

•Graph properties:

Connectivity (node degree, paths) Cyclic vs. acyclic Directed vs. undirected

•Graph is sparse if m~n and dense if m~n2

•Graph is complete if m=n2

24 Small‐world network

•Every node can be reached from every other by a small number of hops or steps •High clustering coefficient and low mean‐shortest path length (random graphs don’t necessarily have high clustering coefficients •Social networks, the Internet, and biological networks all exhibit small‐ world network characteristics •Six degrees of separation

 Pilgrim Experiment: Random people from Nebraska were to send a letter (via intermediaries) to a stock broker in Boston. Could only send to someone with whom they were on a first‐name basis

 Kevin Bacon Game: Connect any actor to Kevin Bacon, by linking actors who have acted in the same movie

 Erdös Number: Number of links required to connect scholars to Erdős, via co‐authorship of papers

Kevin Bacon Game Kevin Bacon

Mystic River (2003)

Tim Robbins

Code 46 (2003) Om Puri

Rani Mukherjee Yuva (2004)

Black (2005) Amitabh Bachchan

25 Path length and network diameter

A path is a sequence {x1, x2,…, xn} such that (x1,x2), (x2,x3), …, (xn‐1,xn) are edges of the graph.

A closed path xn=x1 on a graph is called a graph cycle or circuit.

Shortest path between nodes Longest shortest path

Complex network models

Scale‐free network

Modular networks

Hierarchical networks

26 Network topology

Metabolic networks

•We have seen that the cellular functionality can be partitioned into a collection of modules.

• Each module is a discrete entity of several elementary omponents which perform an identifiable task

 Modular network (Ci>)

•But, it was demonstrated that the degree distribution follows a power law

 Scale‐free network (power law)

• Modular network ≠ Scale‐free network

 Hierarchical network

Ravasz et al.Science (2002)

27 Essential genes tend to be hubs

Scale‐free networks are robust

•Complex systems (cell, internet, social networks), are resilient to component failure

•Network topology plays an important role in this robustness (even if ~80% of nodes fail, the remaining ~20% still maintain network connectivity

• Attack vulnerability if hubs are selectively targeted

•In yeast, only ~20% of proteins are lethal when deleted, and are 5 times more likely to have degree k>15 than k<5.

28 Interesting features

• Cellular networks are assortative, hubs tend not to interact directly with other hubs.

•Hubs tend to be “older” proteins (so far claimed for protein‐protein interaction networks only)

•Hubs also seem to have more evolutionary pressure—their protein sequences are more conserved than average between species (shown in yeast vs. worm)

• Experimentally determined protein complexes tend to contain solely essential or non‐essential proteins—further evidence for modularity.

What is missing?

• Dynamics

 Pathways/networks represented as static processes Difficult to represent a calcium wave or a feedback loop

 More detailed mathematical representations exist that handle these e.g. Stoichiometric modeling, Kinetic modeling (VirtualCell, E‐cell ) Need to accumulate or estimate comprehensive kinetic information

• Details (atomic structures)

• Context (cell type, developmental stage)

29 Network motifs

Network motifs

30 Network motifs

Negative and positive autoregulation

b a c

• NAR speeds up the response time of gene circuits

•NAR can reduce cell‐cell variation in protein levels

•PAR works in the opposite way

31 Feedforward loop

Detecting network motifs

There are two main tasks in detecting network motifs:

1. generating an ensemble of proper random networks

2. counting the subgraphs in the real network and in random networks.

32 MicroRNA‐TF network motifs

Martinez NJ et al. Genes Dev. 2008. 22:2535–2549

33