Modelling Gene Expression Data using Dynamic Bayesian Networks
Kevin Murphy and Saira Mian Computer Science Division, University of California Life Sciences Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720 Tel (510) 642 2128, Fax (510) 642 5775 [email protected], [email protected]
Abstract capable of handling noisy data. For example, suppose the underlying system really is a boolean network, but that we Recently, there has been much interest in reverse have noisy observations of some of the variables. Then the engineering genetic networks from time series data set might contain inconsistencies, i.e., there might not data. In this paper, we show that most of the be any boolean network which can model it. Rather than proposed discrete time models Ð including the giving up, we should look for the most probable model boolean networkmodel [Kau93, SS96], thelinear given the data; this of course requires that our model have model of D'haeseleer et al. [DWFS99], and the a well-de®ned probabilistic semantics. nonlinear model of Weaver et al. [WWS99] Ð The ability of our models to handle hidden variables is also are all special cases of a general class of mod- important. Typically, what is measured (usually mRNA els called Dynamic Bayesian Networks (DBNs). levels) is only one of the factors that we care about; other The advantages of DBNs include the ability to ones include cDNA levels, protein levels, etc. Often we model stochasticity, to incorporate prior knowl- can model the relationship between these factors, even if edge, and to handle hidden variables and missing we cannot measure their values. This prior knowledge can data in a principled way. This paper provides a be used to constrain the set of possible models we learn. review of techniques for learning DBNs. Key- words: Genetic networks, boolean networks, The models we use are called Bayesian (belief) Networks Bayesian networks, neural networks, reverse en- (BNs) [Pea88], which have become the method of choice gineering, machine learning. for representing stochastic models in the UAI (Uncertainty in Arti®cial Intelligence) community. In Section 2, we explain what BNs are, and show how they generalize the 1 Introduction boolean network model [Kau93, SS96], Hidden Markov Models [DEKM98], and other models widely used in the Recently, it has become possible to experimentally mea- computational biology community. In Sections 3 to 7, we sure the expression levels of many genes simultaneously, review various techniques for learning BNs from data, and as they change over time and react to external stimuli (see show how REVEAL [LFS98] is a special case of such an e.g., [WFM+ 98, DLB97]). In the future, the amount of algorithm. In Section 8, we consider BNs with continuous such experimental data is expected to increase dramatically. (as opposed to discrete) state, and discuss their relationship This increases the need for automated ways of discovering to the the linear model of D'haeseleer et al. [DWFS99], the patterns in such data. Ultimately, we would like to auto- nonlinear model of Weaver et al. [WWS99], and techniques matically discover the structure of the underlying causal from the neural network literature [Bis95]. network that is assumed to generate the observed data. In this paper, we consider learning stochastic, discrete time 2 Bayesian Networks models with discrete or continuous state, and hidden vari- ables. This generalizes the linear model of D'haeseleer et al. BNs are a special case of a more general class called graph- [DWFS99], the nonlinear model of Weaver et al. [WWS99], and the popular boolean network model [Kau93, SS96], all ical models in which nodes represent random variables, and the lack of arcs represent conditional independence assump- of which are deterministic and fully observable. tions. Undirected graphical models, also called Markov The fact that our models are stochastic is very important, Random Fields (MRFs; see e.g., [WMS94] for an applica-
since it is well known that gene expression is an inher- tion in biology), have a simple de®nition of independence: B ently stochastic phenomenon [MA97]. In addition, even if two (sets of) nodes A and are conditionally indepen- the underlying system were deterministic, it might appear dent given all the other nodes if they are separated in the stochastic due to our inability to perfectly measure all the graph. By contrast, directed graphical models (i.e., BNs) variables. Hence it iscrucial that our learning algorithms be have a more complicated notion of independence, which X1 X2 X3 X1 X2 X3
Y1 Y2 Y3
(a) (b)
Figure 2: (a) A Markov Chain represented as a Dynamic Bayesian Net (DBN). (b) A Hidden Markov Model (HMM) represented as a DBN. Shaded nodes are observed, non-
Figure 1: The Bayes-Ball algorithm. Two (sets of) nodes A shaded nodes are hidden.
and B are conditionally independent (d-separated [Pea88]) given all the others if and only if there is no way for a
2.1 Relationship to HMMs B ball to get from A to in the graph. Hidden nodes are nodes whose values are not known, and are depicted as For our ®rst example of aBN, considerFigure2(a). Wecall unshaded; observed nodes are shaded. The dotted arcs this a Dynamic Bayesian Net (DBN) because it represents indicate direction of ¯ow of the ball. The ball cannot pass
how the random variable X evolves over time (three time
throughhidden nodes with convergent arrows (top left), nor
X + slices are shown). From the graph, we see that t is
through observed nodes with any outgoing arrows. See 1
X X X