Gene Regulatory Networks
Total Page:16
File Type:pdf, Size:1020Kb
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Systems Biology: Inferring gene regulatory network using graphical models Eric Xing Lecture 24, April 17, 2006 Gene regulatory networks z Regulation of expression of genes is crucial: z Genes control cell behavior by controlling which proteins are made by a cell and when and where they are delivered z Regulation occurs at many stages: z pre-transcriptional (chromatin structure) z transcription initiation z RNA editing (splicing) and transport z Translation initiation z Post-translation modification z RNA & Protein degradation z Understanding regulatory processes is a central problem of biological research 1 Inferring gene regulatory networks z Expression network E B R A mostly sometimes rarely C z gets most attentions so far, many algorithms z still algorithmically challenging z Protein-DNA interaction network + + z actively pursued recently, some interesting methods available z lots of room for algorithmic advance Inferring gene regulatory networks z Network of cis-regulatory pathways z Success stories in sea urchin, fruit fly, etc, from decades of experimental research z Statistical modeling and automated learning just started 2 1: Expression networks z Early work z Clustering of expression data z Groups together genes with similar expression pattern z Disadvantage: does not reveal structural relations between genes z Boolean Network z Deterministic models of the logical interactions between genes z Disadvantage: deterministic, static z Deterministic linear models z Disadvantage: under-constrained, capture only linear interactions z The challenge: z Extract biologically meaningful information from the expression data z Discover genetic interactions based on statistical associations among data z Currently dominant methodology z Probabilistic network Probabilistic Network Approach z Characterize stochastic (non-deterministic!) relationships between expression patterns of different genes z Beyond pair-wise interactions => structure! z Many interactions are explained by intermediate factors z Regulation involves combined effects of several gene-products z Flexible in terms of types of interactions (not necessarily linear or Boolean!) 3 Probabilistic graphical models • Given a graph G = (V,E), where each node v∈V is associated with a random variable Xv X3 p(X3| X2) AAAAGAGTCA X2 p(X2| X1) X 1 X6 p(X1) p(X6| X2, X5) p(X4| X1) p(X5| X4) X4 X5 • The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relations defined by the edges E : p(X1, X2, X3, X4, X5, X6) = p(X1) p(X2| X1)p(X3| X2) p(X4| X1)p(X5| X4)p(X6| X2, X5) Probabilistic Inference z Computing statistical queries regarding the network, e.g.: z Is node X independent on node Y given nodes Z,W ? z What is the probability of X=true if (Y=false and Z=true)? z What is the joint distribution of (X,Y) if R=false? z What is the likelihood of some full assignment? z What is the most likely assignment of values to all or a subset the nodes of the network? z General purpose algorithms exist to fully automate such computation z Computational cost depends on the topology of the network z Exact inference: z The junction tree algorithm z Approximate inference; z Loopy belief propagation, variational inference, Monte Carlo sampling 4 Two types of GMs z Directed edges give causality relationships (Bayesian Network or Directed Graphical Model): z Undirected edges simply give correlations between variables (Markov Random Field or Undirected Graphical model): Bayesian Network z Structure: DAG Ancestor z Meaning: a node is conditionally independent Parent Y Y of every other node in the 1 2 network outside its Markov blanket X z Location conditional distributions (CPD) and the Child DAG completely determines Children's co-parent the joint dist. Descendent 5 Local Structures & Independencies z Common parent B z Fixing B decouples A and C "given the level of gene B, the levels of A and C are independent" A C z Cascade z Knowing B decouples A and C A B C "given the level of gene B, the level gene A provides no extra prediction value for the level of gene C" z V-structure z Knowing C couples A and B A B because A can "explain away" B w.r.t. C "If A correlates to C, then chance for B to also correlate to B C will decrease" z The language is compact, the concepts are rich! Why Bayesian Networks? z Sound statistical foundation and intuitive probabilistic semantics z Compact and flexible representation of (in)dependency structure of multivariate distributions and interactions z Natural for modeling global processes with local interactions => good for biology z Natural for statistical confidence analysis of results and answering of queries z Stochastic in nature: models stochastic processes & deals (“sums out”) noise in measurements z General-purpose learning and inference z Capture causal relationships 6 Possible Biological Interpretation Measured expression Random variables level of each gene (node) Gene interaction Probabilistic dependencies (edge) B z Common cause A C z Intermediate gene A B C A B z Common/combinatorial effects C More directed probabilistic networks z Dynamic Bayesian Networks (Ong et. al) gene gene z Temporal series "static" gene 2 gene gene 2 gene 1 3 1 3 gene gene N N z Module network (Segal et al.) z Partition variables into modules that share the same parents and the same CPD. z Probabilistic Relational Models (Segal et al.) z Data fusion: integrating related data from multiple sources 7 Bayesian Network – CPDs Local Probabilities: CPD - conditional probability distribution P(Xi|Pai) z Discrete variables: Multinomial Distribution (can represent any kind of statistical dependency) Earthquake Burglary E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 Radio Alarm e b 0.9 0.1 e b 0.01 0.99 Call Bayesian Network – CPDs (cont.) z Continuous variables: e.g. linear Gaussian k 2 P(X |Y1,...,Yk ) ~N(a0 +∑ai yi ,σ ) i =1 P(X | Y) Y X 8 Learning Bayesian Network z The goal: z Given set of independent samples (assignments of random variables), find the best (the most likely?) Bayesian Network (both DAG and CPDs) E B E B R A R A C C (B,E,A,C,R)=(T,F,F,T,F) E B P(A | E,B) e b 0.9 0.1 (B,E,A,C,R)=(T,F,T,T,F) e b 0.2 0.8 …….. e b 0.9 0.1 (B,E,A,C,R)=(F,T,T,T,F) e b 0.01 0.99 Learning Bayesian Network • Learning of best CPDs given DAG is easy – collect statistics of values of each node given specific assignment to its parents • Learning of the graph topology (structure) is NP-hard – heuristic search must be applied, generally leads to a locally optimal network •Overfitting – It turns out, that richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable) A B A B C C P(C | A) ≤P(C | A, B) – more parameters to fit => more freedom => always exist more "optimal" CPD(C) • We prefer simpler (more explanatory) networks – Practical scores regularize the likelihood improvement complex networks. 9 BN Learning Algorithms z Structural EM (Friedman 1998) z The original algorithm Expression data z Sparse Candidate Algorithm (Friedman et al.) z Discretizing array signals z Hill-climbing search using local operators: add/delete/swap of a single edge z Feature extraction: Markov relations, order relations z Re-assemble high-confidence sub-networks from features Learning Algorithm z Module network learning (Segal et al.) z Heuristic search of structure in a "module graph" E B z Module assignment z Parameter sharing R A z Prior knowledge: possible regulators (TF genes) C Confidence Estimates Bootstrap approach: E B Learn R A D1 le C mp sa re E B D Learn R A D resample 2 C ... re s am p E B le Estimate “Confidence level”: R A D Learn m m 1 C C (f ) = ∑1{}f ∈ Gi m i =1 10 Results from SCA + feature extraction (Friedman et al.) SST2 KAR4 TEC1 NDJ1 KSS1 FUS1 PRM1 AGA1 YLR343W AGA2 TOM6 FIG1 FUS3 YLR334C MFA1 STE6 YEL059W The initially learned network of The “mating response” substructure ~800 genes A Module Network Nature Genetics 34, 166 - 176 (2003) 11 Gaussian Graphical Models z Why? Sometimes an UNDIRECTED association graph makes more sense and/or is more informative z gene expressions may be influenced by unobserved factor that are post-transcriptionally regulated B B B A C A C A C z The unavailability of the state of B results in a constrain over A and C Covariance Selection z Multivariate Gaussian over all continuous expressions 1 1 r T -1 r p([x1,..., xn ]) = n 1 exp{}- 2 (x - µ) Σ (x - µ) (2π ) 2 | Σ | 2 z The precision matrix K=Σ−1 reveals the topology of the (undirected) network E(xi | x-i ) = ∑(Kij / Kii )x j j z Edge ~ |Kij| > 0 z Learning Algorithm: Covariance selection z Want a sparse matrix z Regression for each node with degree constraint (Dobra et al.) z Regression for each node with hierarchical Bayesian prior (Li, et al) 12 Gene modules from identified using GGM (Li, Yang and Xing) A comparison of BN and GGM: GGM BN 2: Protein-DNA Interaction Network z Expression networks are not necessarily causal z BNs are indefinable only up to Markov equivalence: B B and A C A C can give the same optimal score, but not further distinguishable under a likelihood score unless further experiment from perturbation is performed z GGM have yields functional modules, but no causal semantics z TF-motif interactions provide direct evidence of casual, regulatory dependencies among genes z stronger evidence than expression correlations z indicating presence of binding sites on target gene -- more easily verifiable z disadvantage: often very noisy, only applies to cell-cultures, restricted to known TFs ..