Advanced Algorithms and Models for Computational -- a approach

Systems Biology: Inferring using graphical models

Eric Xing Lecture 24, April 17, 2006

Gene regulatory networks

z Regulation of expression of genes is crucial:

z Genes control cell behavior by controlling which proteins are made by a cell and when and where they are delivered

z Regulation occurs at many stages:

z pre-transcriptional (chromatin structure)

z initiation

z RNA editing (splicing) and transport

z Translation initiation

z Post-translation modification

z RNA & Protein degradation

z Understanding regulatory processes is a central problem of biological research

1 Inferring gene regulatory networks

z Expression network E B

R A

mostly sometimes rarely C z gets most attentions so far, many algorithms

z still algorithmically challenging

z Protein-DNA interaction network

+ +

z actively pursued recently, some interesting methods available

z lots of room for algorithmic advance

Inferring gene regulatory networks

z Network of cis-regulatory pathways

z Success stories in sea urchin, fruit fly, etc, from decades of experimental research

z Statistical modeling and automated learning just started

2 1: Expression networks

z Early work

z Clustering of expression data z Groups together genes with similar expression pattern z Disadvantage: does not reveal structural relations between genes z Boolean Network

z Deterministic models of the logical interactions between genes

z Disadvantage: deterministic, static z Deterministic linear models

z Disadvantage: under-constrained, capture only linear interactions

z The challenge:

z Extract biologically meaningful information from the expression data

z Discover genetic interactions based on statistical associations among data

z Currently dominant methodology

z Probabilistic network

Probabilistic Network Approach

z Characterize stochastic (non-deterministic!) relationships between expression patterns of different genes

z Beyond pair-wise interactions => structure!

z Many interactions are explained by intermediate factors

z Regulation involves combined effects of several gene-products

z Flexible in terms of types of interactions (not necessarily linear or Boolean!)

3 Probabilistic graphical models

• Given a graph G = (V,E), where each node v∈V is associated with a random variable Xv X3 p(X3| X2) AAAAGAGTCA X2 p(X2| X1) X 1 X6

p(X1)

p(X6| X2, X5) p(X4| X1) p(X5| X4) X4 X5

• The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relations defined by the edges E :

p(X1, X2, X3, X4, X5, X6) = p(X1) p(X2| X1)p(X3| X2) p(X4| X1)p(X5| X4)p(X6| X2, X5)

Probabilistic Inference

z Computing statistical queries regarding the network, e.g.:

z Is node X independent on node Y given nodes Z,W ?

z What is the probability of X=true if (Y=false and Z=true)?

z What is the joint distribution of (X,Y) if R=false?

z What is the likelihood of some full assignment?

z What is the most likely assignment of values to all or a subset the nodes of the network?

z General purpose algorithms exist to fully automate such computation

z Computational cost depends on the topology of the network

z Exact inference:

z The junction tree algorithm z Approximate inference;

z Loopy belief propagation, variational inference, Monte Carlo sampling

4 Two types of GMs

z Directed edges give causality relationships (Bayesian Network or Directed ):

z Undirected edges simply give correlations between variables ( or Undirected Graphical model):

Bayesian Network

z Structure: DAG Ancestor

z Meaning: a node is conditionally independent Parent Y Y of every other node in the 1 2 network outside its Markov blanket X

z Location conditional distributions (CPD) and the Child DAG completely determines Children's co-parent the joint dist. Descendent

5 Local Structures & Independencies

z Common parent B z Fixing B decouples A and C "given the level of gene B, the levels of A and C are independent" A C z Cascade z Knowing B decouples A and C A B C "given the level of gene B, the level gene A provides no extra prediction value for the level of gene C"

z V-structure z Knowing C couples A and B A B because A can "explain away" B w.r.t. C "If A correlates to C, then chance for B to also correlate to B C will decrease"

z The language is compact, the concepts are rich!

Why Bayesian Networks?

z Sound statistical foundation and intuitive probabilistic semantics

z Compact and flexible representation of (in)dependency structure of multivariate distributions and interactions

z Natural for modeling global processes with local interactions => good for biology

z Natural for statistical confidence analysis of results and answering of queries

z Stochastic in nature: models stochastic processes & deals (“sums out”) noise in measurements

z General-purpose learning and inference

z Capture causal relationships

6 Possible Biological Interpretation

Measured expression Random variables level of each gene (node)

Gene interaction Probabilistic dependencies (edge)

B z Common cause A C

z Intermediate gene A B C

A B z Common/combinatorial effects C

More directed probabilistic networks

z Dynamic Bayesian Networks (Ong et. al)

gene gene z Temporal series "static" gene 2 gene gene 2 gene 1 3 1 3 gene gene N N

z Module network (Segal et al.) z Partition variables into modules that share the same parents and the same CPD.

z Probabilistic Relational Models (Segal et al.) z Data fusion: integrating related data from multiple sources

7 Bayesian Network – CPDs

Local Probabilities: CPD - conditional probability

distribution P(Xi|Pai)

z Discrete variables: Multinomial Distribution (can represent any kind of statistical dependency)

Earthquake Burglary E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 Radio Alarm e b 0.9 0.1 e b 0.01 0.99

Call

Bayesian Network – CPDs (cont.)

z Continuous variables: e.g. linear Gaussian

k 2 P(X |Y1,...,Yk ) ~N(a0 +∑ai yi ,σ ) i =1

P(X | Y)

Y X

8 Learning Bayesian Network

z The goal:

z Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs)

E B E B

R A R A

C C (B,E,A,C,R)=(T,F,F,T,F) E B P(A | E,B) e b 0.9 0.1 (B,E,A,C,R)=(T,F,T,T,F) e b 0.2 0.8 …….. e b 0.9 0.1 (B,E,A,C,R)=(F,T,T,T,F) e b 0.01 0.99

Learning Bayesian Network

• Learning of best CPDs given DAG is easy – collect statistics of values of each node given specific assignment to its parents • Learning of the graph topology (structure) is NP-hard – heuristic search must be applied, generally leads to a locally optimal network •Overfitting – It turns out, that richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable)

A B A B C C P(C | A) ≤P(C | A, B) – more parameters to fit => more freedom => always exist more "optimal" CPD(C) • We prefer simpler (more explanatory) networks – Practical scores regularize the likelihood improvement complex networks.

9 BN Learning Algorithms

z Structural EM (Friedman 1998) z The original algorithm

Expression data z Sparse Candidate Algorithm (Friedman et al.) z Discretizing array signals z Hill-climbing search using local operators: add/delete/swap of a single edge z Feature extraction: Markov relations, order relations z Re-assemble high-confidence sub-networks from features

Learning Algorithm z Module network learning (Segal et al.) z Heuristic search of structure in a "module graph"

E B z Module assignment z Parameter sharing R A z Prior knowledge: possible regulators (TF genes)

C

Confidence Estimates

Bootstrap approach: E B

Learn R A D1 le C mp sa re E B

D Learn R A D resample 2

C ... re s am p E B le

Estimate “Confidence level”: R A D Learn m m 1 C C (f ) = ∑1{}f ∈ Gi m i =1

10 Results from SCA + feature extraction (Friedman et al.)

SST2 KAR4

TEC1 NDJ1 KSS1 FUS1 PRM1 AGA1

YLR343W AGA2 TOM6 FIG1 FUS3

YLR334C MFA1 STE6 YEL059W

The initially learned network of The “mating response” substructure ~800 genes

A Module Network

Nature Genetics 34, 166 - 176 (2003)

11 Gaussian Graphical Models

z Why?

Sometimes an UNDIRECTED association graph makes more sense and/or is more informative z gene expressions may be influenced by unobserved factor that are post-transcriptionally regulated

B B B A C A C A C

z The unavailability of the state of B results in a constrain over A and C

Covariance Selection

z Multivariate Gaussian over all continuous expressions

1 1 r T -1 r p([x1,..., xn ]) = n 1 exp{}- 2 (x - µ) Σ (x - µ) (2π ) 2 | Σ | 2

z The precision matrix K=Σ−1 reveals the topology of the (undirected) network

E(xi | x-i ) = ∑(Kij / Kii )x j j z Edge ~ |Kij| > 0

z Learning Algorithm: Covariance selection z Want a sparse matrix z Regression for each node with degree constraint (Dobra et al.) z Regression for each node with hierarchical Bayesian prior (Li, et al)

12 Gene modules from identified using GGM (Li, Yang and Xing)

A comparison of BN and GGM: GGM

BN

2: Protein-DNA Interaction Network

z Expression networks are not necessarily causal

z BNs are indefinable only up to Markov equivalence:

B B and A C A C can give the same optimal score, but not further distinguishable under a likelihood score unless further experiment from perturbation is performed z GGM have yields functional modules, but no causal semantics

z TF-motif interactions provide direct evidence of casual, regulatory dependencies among genes

z stronger evidence than expression correlations

z indicating presence of binding sites on target gene -- more easily verifiable

z disadvantage: often very noisy, only applies to cell-cultures, restricted to known TFs ...

13 ChIP-chip analysis

Simon I et al (Young RA) Cell 2001(106):697-708 Ren B et al (Young RA) Science 2000 (290):2306-2309

Advantages: Limitations: - Identifies “all” the sites where a TF - Expense: Only 1 TF per experiment binds “in vivo” under the experimental - Feasibility: need an antibody for the TF. condition. - Prior knowledge: need to know what TF to test.

Transcription network

[Horak, et al, Genes & Development, 16:3017-3033]

Often reduced to a clustering problem z A transcriptional module can be defined as a set of genes that are:

1. co-bound (bound by the same set of TFs, up to limits of experimental noise) and

2. co-expressed (has the same expression pattern, up to limits of experimental noise). z Genes in the same module are likely to be co-regulated, and hence likely have a common biological function.

14 Inferring transcription network

z Clustering + Post-processing + + z SAMBA (Tanay et al.)

Array/motif/chip data z GRAM (Genetic RegulAtory Modules) (Bar- Joseph et al.)

z Bayesian network

z Joint clustering model for mRNA signal, motif, Learning Algorithm and binding score (Segal, et al.)

z Using binding data to define Informative structural prior (Bernard & Hartemink) (the only reported work that directly learn the network)

The SAMBA model

edge

conditions

s

e

n

e g

The GE (gene-condition) matrix no edge Goal : Find high Goal : Find dense similarity submatrices G=(U,V,E)subgraphs

15 The SAMBA approach

z Normalization: translate gene/experiment matrix to a weighted bipartite graph using a statistical model for the data Bicluster Random Subgraph Model Likelihood model translates to sum of Background Random = weights over edges Graph Model and non edges

z Bicluster model: Heavy subgraphs

z Combined hashing and local optimization (Tanay et al.): incrementally finds heavy-weight biclusters that can partially overlap

z Other algorithms are also applicable, e.g., spectral graph cut, SDP: but they mostly finds disjoint clusters

Date pre-processing

From experiments to properties:

Strong complex binding to p2 protein P

Medium complex p1 binding to Protein P

p1 p2 p3 p4

Strong Medium Medium Strong gene g Induction Induction Repression Repression

p1 p2 p1 p2 p1 p2

Strong Medium High Medium High Confidence Medium Confidence Binding to Binding to Sensitivity Sensitivity Interaction Interaction TF T TF T

16 A SAMBA module

Properties Genes

CPA1 CPA2 GO annotations

Post-clustering analysis

z Using topological properties to identify genes connecting several processes.

z Specialized groups of genes (transport, signaling) are more likely to bridge modules.

z Inferring the transcription network

z Genes in modules with expression properties should be driven by at least one common TF

z There exist gaps between information sources (condition specific TF binding?)

z Enriched binding motifs help identifying “missing TF profiles”

CGATGAG Ribosom AP-1 Biogenesys AGGGG RNA STRE Processing AAAATTTT Fhl1 Stress Rap1 Protein Biosynthesis

17 The GRAM model

Cluster genes that Group genes bound by co-express (nearly) the same set of TFs

Statistically significant union

The GRAM algorithm

z Exhaustively search all subsets of TFs Arg80 Arg81 Leu3 Gcn4 g1 1 1 0 1 g 1 0 1 1 (starting with the largest sets); find 2 g3 1 1 1 1 g 0 0 0 0 genes bound by subset F with 4 g5 1 1 1 1 g 1 1 0 1 significance p: G(F,p) 6

z Find an r-tight gene cluster with

centroid c*: B(c*,r)|, s.t. r

c* = argmaxc |G(F,p) ∩ B(c,r)|

Arg80 Arg81 Leu3 Gcn4 z Some local massage to further g1 .0004 .00003 .33 .0004 √ improve cluster significance g2 .00002 0.0006 .02 .0001 √ g3 .0007 .002 .15 .0002 √

g4 .007 .2 0.04 .7 x

g5 .00001 .00001 .0001 .0002 √

g6 .00001 .00007 .5 .0001 √

18 Regulatory Interactions revealed by GRAM

The Full Bayesian Network Approach

TF Binding sites

regulation experimental indicator conditions

p-value for localization assay expression level

19 The Full BN Approach, cond.

z Introduce random variables for each observations, and latent states/events to be modeled

z A network with hidden variables:

z regulation indicator and binding sites not observed

z Results in a very complex model that integrates motif search, gene activation/repression function, mRNA level distribution, ...

z Advantage: in principle allow heterogeneous information to flow and cross influence to produce a globally optimal and consistent solution

z Disadvantage: in practice, a very difficult statistical inference and learning problem. Lots of heuristic tricks are needed, computational solutions are usually not a globally optimum.

A Bayesian Learning Approach

For model p(A|Q ):

z Treat parameters/model Q as unobserved random variable(s)

z probabilistic statements of Q is conditioned on the values of the observed variables Aobs p(Θ )

E B E B

R A R A

C C • Bayes’ rule: p(Θ | A) ∝ p(A | Θ) p(Θ) posterior likelihood prior

Θ = Θp(Θ | A)dΘ • Bayesian estimation: Bayes ∫

20 Informative Structure Prior (Bernard & Hartemink)

z If we observe TF i binds Gene j from location assay, then the probability of have an edge between node i and j in a expression network is increased

z This leads to a prior probability of an edge being present:

z This is like adding a regularizer (a compensating score) to the originally likelihood score of an expression network

z The regularized score is now ...

Informative Structure Prior (Bernard & Hartemink)

z The regularized score of an expression network under an informative structure prior is:

Score = P(arrayprofile|Chip profile) = logP(array profile|S) + logP(S)

z Now we run the usual structural learning algorithm as in expression network, but now optimizing

logP(array profile|S) + logP(S), instead logP(array profile|S)

21 We've found a network…now what?

z How can we validate our results?

z Literature.

z Curated databases (e.g., GO/MIPS/TRANSFAC).

z Other high throughput data sources.

z “Randomized” versions of data.

z New experiments.

z Active learning: proposing the "most informative" new experiments

What is next?

z We are still far away from being able to correctly infer cis-regulatory network in silico, even in a simple multi- cellular organism

z Open computational problems: z Causality inference in network z Algorithms for identifying TF bindings motifs and motif clusters directly from genomic sequences in higher organisms z Techniques for analyzing temporal-spatial patterns of gene expression (i.e., whole embryo in situ hybridization pattern) z Probabilistic inference and learning algorithms that handle large networks z Experimental limitation: z Binding location assay can not yet be performed inexpensively in vivo in multi- cellular organisms z Measuring the temporal-spatial patterns of gene expression in multi-cellular organisms is not yet high-throughput z Perturbation experiment still low-throughput

22 More "Direct" Approaches

z Principle:

z using "direct" structural/sequence/biochemical level evidences of network connectivity, rather than "indirect" statistical correlations of network products, to recover the network topology.

z Algorithmic approach

z Supervised and de novo motif search

z Protein function prediction Networks of cis- regulatory pathway z protein-protein interactions

This strategy has been a hot trend, but faces many algorithmic challenges

z Experimental approach

z In principle feasible via systematic mutational perturbations

z In practice very tedious

z The ultimate ground truth

A thirty-year network for a single gene

The Endo16 Model (Eric Davidson, with tens of students/postdocs over years)

23 Acknowledgments

Sources of some of the slides:

Mark Gerstein, Yale , Hebrew U , TAU Irene Ong, Wisconsin Georg Gerber, MIT

References

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Pe'er. In Proc. Fourth Annual Inter. Conf. on Computational Molecular Biology RECOMB, 2000. Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression, E. Segal, R. Yelensky, and D. Koller. , 19(Suppl 1): 273--82, 2003. A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules Joshua M. Stuart, , , Stuart K. Kim, Science, 302 (5643):249-55, 2003. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N. Nature Genetics, 34(2): 166-76, 2003. Learning Module Networks E. Segal, N. FriedmanD. Pe'er, A. Regev, and D. Koller. In Proc. Nineteenth Conf. on Uncertainty in (UAI), 2003 From Promoter Sequence to Expression: A Probabilistic Framework, E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller. In Proc. 6th Inter. Conf. on Research in Computational Molecular Biology (RECOMB), 2002 Modelling Regulatory Pathways in E.coli from Time Series Expression Profiles, Ong, I. M., J. D. Glasner and D. Page, Bioinformatics, 18:241S-248S, 2002.

24 References

Sparse graphical models for exploring gene expression data, Adrian Dobra, Beatrix Jones, Chris Hans, Joseph R. Nevins and Mike West, J. Mult. Analysis, 90, 196-212, 2004. Experiments in stochastic computation for high-dimensional graphical models, Beatrix Jones, Adrian Dobra, Carlos Carvalho, Chris Hans, Chris Carter and Mike West, Statistical Science, 2005. Inferring regulatory network using a hierarchical Bayesian graphical gaussian model, Fan Li, Yiming Yang, Eric Xing (working paper) 2005. Computational discovery of gene modules and regulatory networks. Z. Bar-Joseph*, G. Gerber*, T. Lee*, N. Rinaldi, J. Yoo, F. Robert, B. Gordon, E. Fraenkel, T. Jaakkola, R. Young, and D. Gifford. Nature Biotechnology, 21(11) pp. 1337-42, 2003. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data, A. Tanay, R. Sharan, M. Kupiec, R. Shamir Proc. National Academy of Science USA 101 (9) 2981--2986, 2004. Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data, Bernard, A. & Hartemink, A. In Pacific Symposium on Biocomputing 2005 (PSB05),

25