Modelling Gene Expression Data using Dynamic Bayesian Networks

Kevin Murphy and Saira Mian Computer Science Division, University of California Life Sciences Division, Lawrence Berkeley National Laboratory Berkeley, CA 94720 Tel (510) 642 2128, Fax (510) 642 5775 [email protected], [email protected]

Abstract capable of handling noisy data. For example, suppose the underlying system really is a boolean network, but that we Recently, there has been much interest in reverse have noisy observations of some of the variables. Then the engineering genetic networks from data set might contain inconsistencies, i.e., there might not data. In this paper, we show that most of the be any boolean network which can model it. Rather than proposed discrete time models Ð including the giving up, we should look for the most probable model boolean networkmodel [Kau93, SS96], thelinear given the data; this of course requires that our model have model of D'haeseleer et al. [DWFS99], and the a well-de®ned probabilistic semantics. nonlinear model of Weaver et al. [WWS99] Ð The ability of our models to handle hidden variables is also are all special cases of a general class of mod- important. Typically, what is measured (usually mRNA els called Dynamic Bayesian Networks (DBNs). levels) is only one of the factors that we care about; other The advantages of DBNs include the ability to ones include cDNA levels, protein levels, etc. Often we model stochasticity, to incorporate prior knowl- can model the relationship between these factors, even if edge, and to handle hidden variables and missing we cannot measure their values. This prior knowledge can data in a principled way. This paper provides a be used to constrain the set of possible models we learn. review of techniques for learning DBNs. Key- words: Genetic networks, boolean networks, The models we use are called Bayesian (belief) Networks Bayesian networks, neural networks, reverse en- (BNs) [Pea88], which have become the method of choice gineering, . for representing stochastic models in the UAI (Uncertainty in Arti®cial Intelligence) community. In Section 2, we explain what BNs are, and show how they generalize the 1 Introduction boolean network model [Kau93, SS96], Hidden Markov Models [DEKM98], and other models widely used in the Recently, it has become possible to experimentally mea- computational biology community. In Sections 3 to 7, we sure the expression levels of many genes simultaneously, review various techniques for learning BNs from data, and as they change over time and react to external stimuli (see show how REVEAL [LFS98] is a special case of such an e.g., [WFM+ 98, DLB97]). In the future, the amount of algorithm. In Section 8, we consider BNs with continuous such experimental data is expected to increase dramatically. (as opposed to discrete) state, and discuss their relationship This increases the need for automated ways of discovering to the the linear model of D'haeseleer et al. [DWFS99], the patterns in such data. Ultimately, we would like to auto- nonlinear model of Weaver et al. [WWS99], and techniques matically discover the structure of the underlying causal from the neural network literature [Bis95]. network that is assumed to generate the observed data. In this paper, we consider learning stochastic, discrete time 2 Bayesian Networks models with discrete or continuous state, and hidden vari- ables. This generalizes the linear model of D'haeseleer et al. BNs are a special case of a more general class called graph- [DWFS99], the nonlinear model of Weaver et al. [WWS99], and the popular boolean network model [Kau93, SS96], all ical models in which nodes represent random variables, and the lack of arcs represent conditional independence assump- of which are deterministic and fully observable. tions. Undirected graphical models, also called Markov The fact that our models are stochastic is very important, Random Fields (MRFs; see e.g., [WMS94] for an applica-

since it is well known that gene expression is an inher- tion in biology), have a simple de®nition of independence: B ently stochastic phenomenon [MA97]. In addition, even if two (sets of) nodes A and are conditionally indepen- the underlying system were deterministic, it might appear dent given all the other nodes if they are separated in the stochastic due to our inability to perfectly measure all the graph. By contrast, directed graphical models (i.e., BNs) variables. Hence it iscrucial that our learning algorithms be have a more complicated notion of independence, which X1 X2 X3 X1 X2 X3

Y1 Y2 Y3

(a) (b)

Figure 2: (a) A represented as a Dynamic Bayesian Net (DBN). (b) A (HMM) represented as a DBN. Shaded nodes are observed, non-

Figure 1: The Bayes-Ball algorithm. Two (sets of) nodes A shaded nodes are hidden.

and B are conditionally independent (d-separated [Pea88]) given all the others if and only if there is no way for a

2.1 Relationship to HMMs B ball to get from A to in the graph. Hidden nodes are nodes whose values are not known, and are depicted as For our ®rst example of aBN, considerFigure2(a). Wecall unshaded; observed nodes are shaded. The dotted arcs this a Dynamic Bayesian Net (DBN) because it represents indicate direction of ¯ow of the ball. The ball cannot pass

how the random variable X evolves over time (three time

throughhidden nodes with convergent arrows (top left), nor

X + slices are shown). From the graph, we see that t is

through observed nodes with any outgoing arrows. See 1

X X X

t t independent of t given (since blocks the only

[Sha98] for details. 1

X X

t+ for the Bayes ball between t 1 and 1). This, of course, is the (®rst-order) , which states

that the future is independent of the past given the present. takes into account the directionality of the arcs (see Fig- X

Now consider Figure 2(b). t is as before, but is now ure 1). Graphical models with both directed and undirected Y

hidden. What we observe at each time step is t , which

arcs are called chain graphs. is another random variable whose distribution depends on X

(and only on) t . Hence this graph captures all and only B In a BN, one can intuitively regard an arc from A to

the conditional independence assumptions that are made in B as indicating the fact that A ªcausesº . (For a more formal treatment of causality in the context of BNs, see a Hidden Markov Model (HMM) [Rab89]. [HS95].) Since evidence can be assigned to any subset of In addition to the graph structure, a BN requires that we the nodes (i.e., any subset of nodes can be observed), BNs specify the Conditional Probability Distribution (CPD) of

can be used for both causal reasoning (from known causes each node given its parents. In an HMM, we assume that to unknown effects) an diagnostic reasoning (from known X

the hidden state variables t are discrete, and have a dis-

X = j jX = i) = T (i; j ) T effects to unknown causes), or any combination of the two. (

t t t tribution given by Pr t 1 . ( The inference algorithms which are needed to do this are

is the transition matrix for time slice t.) If the observed brie¯y discussed in Section 5.1. Note that, if all the nodes Y

variables t are discrete, we can specify their distribution

Y = j jX = i) = O (i; j ) O are observed, there is no need to do inference, although we (

t t t by Pr t . ( is the observa- might still want to do learning. tion matrix for time slice t.) However, in an HMM, it is In additionto causal and diagnosticreasoning, BNs support also possible for the observed variables to be Gaussian, in the powerful notion of ªexplaining awayº: if a node is which case we must specify the mean and covariance for observed, then its parents become dependent, since they are each value of the hidden state variable, and each value of t: rival causes for explaining the child's value (see the bottom see Section 8. left case in Figure1.) Incontrast, inan undirected graphical As a (hopefully!) familiar example of HMMs, let us con- model, the parents would be independent, since the child sider the way that they are used for aligning protein se-

separates (but does not d-separate) them. quences [DEKM98]. In this case, the hidden state variable

X 2 fD;I;M g Some other important advantages of directed graphical can take on three possible values, t , which models over undirected ones include the fact that BNs can represent delete, insert and match respectively. In protein encode deterministic relationships, and that it is easier to alignment, the t subscript does not refer to time, but rather learn BNs (see Section 3) since they are separable models topositionalonga staticsequence. This isan importantdif- (in the sense of [Fri98]). Hence we shall focus exclusively ference from gene expression, where t really does represent on BNs in this paper. For a careful study of the relation- time (see Section 3). ship between directed and undirected graphical models, see The observable variable can take on 21 possible values, [Pea88, Whi90, Lau96].1 which represent the 20 possible amino acids, plus the gap alignment character ª-º. The probability distribution over

these 21 values depends on the current position t and of X

1It is interesting to note that much of the theory underlying the current state of the system, t . Thus the distribution

(Y jX = M ) t (Y jX =

t t t

graphical models involves concepts such as chordal (triangulated) Pr t is the pro®le for position , Pr )

graphs [Gol80], which also arise in other areas of computational I is the (ªtimeº-invariant) ªbackgroundº distribution, and

00 00

(Y = jX ) = : X = D X 6= D

t t t biology, such as evolutionary tree construction (perfect phyloge- Pr t 1 0 if and is 0.0 if ,

nies) and physical mapping (interval graphs). for all t. A1 A2 A AB

B1 B2 BC A' B' C1 C2

(a) (b) C

Figure3: (a)ADBN,(b)Represented asa booleannetwork. Figure 4: The noisy-OR gate modelled as a BN. A' is a noisy version of A, and B' is a noisy version of B. C is a deterministic OR of A' and B' (indicated by the double

In an HMM, ªlearningº means ®nding estimates of the pa- ring). Shaded nodes are observed, non-shaded nodes are

T O t rameters t and (see Section 4.1); the structure of the hidden.

model is ®xed (namely, it is Figure 2(b)).

= (A; B ) C noisyor , we can represent its CPD in tabular 2.2 Relationship to boolean networks form as follows:

Now consider the DBN in Figure 3(a). (We only show two

B (C = ) (C = ) A Pr 0 Pr 1 time slices of the DBN, since the structure repeats.) Just as 0 0 1.0 0.0

an HMMcan be thoughtof asa singlestochasticautomaton,

q 1 0 q 1

so this can be thought of as a network of four interacting 1 1

q

0 1 q2 1 2

A A

stochastic automata: node t represents the state of the

q q q 1 1 q1 2 1 1 2 automaton at time t,for example; itsnext stateisdetermined

by its previous state, and the previous state of B . + K 1

Note that this CPT has 2 entries, where K is the = The interconnectivity between the automata can also be number of parents (here, K 2), but that there are con-

represented as in Figure 3(b). The presence of directed straints on these entries. For a start, the requirement that

C = jA; B ) + (C = jA; B ) = :

cycles means that this is not a BN; rather, an arc from node Pr( 0 Pr 1 1 0 for all values

K

i j A(i; j ) = A B to node means 1, where is the inter-slice of A and eliminates one of the columns, resulting in 2

. entries. However, the entries in the remaining columns are

K q ; : : : ; q

themselves functions of just free parameters, K . In addition to connections between time slices, we can also 1 Hence this is called an implicit or parametric distribution, allow connections within a time slice (intra-slice connec- and has the advantage that less data is needed to learn the tions). This can be used to model co-occurrence effects,

parameter values (see Section 3). By contrast, a full multi-

B A C

t t e.g., gene t causes toturnonifandonlyif is also

1 nomial distribution(i.e., an unconstrained CPT) would have

A C t turned on. Since this relationship between t and is not 2K parameters, but could model arbitrary interactions be- causal, it is more natural to model this with an undirected tween the parents. arc; the resulting model would therefore be a chain graph, which we don't consider in this paper. The intepretation of the noisy-OR gate is that each input is an independent cause, which may be inhibited. This can be The transition matrix of each automaton (e.g.,

modelled by the BN shown in Figure 4. It is common to

(A jA ;B )

t t Pr t ) is often called a CPT (Conditional 1 1 imagine that one of the parents is permanently on, so that Probability Table), since it represents the CPD in tabu- the child can turn on ªspontaneouslyº. This ªleakº node lar form. If each row of the CPT only contains a single can be used as a catch-all for all other, unmodelled causes. non-zero value (which must therefore be 1.0), then the au- tomatonisinfact a deterministicautomaton. Ifall thenodes It is straightforwardto generalize the noisy-ORgate to non- (automata) are deterministic and have two states, the system binary variables and other functions such as AND [Hen89, is equivalent to a boolean network [Kau93, SS96]. Sri93, HB94]. It is also possible to loosen the assumption that all the causes are independent [MH97]. In a boolean network, ªlearningº means ®nding the best structure, i.e., inter-slice connectivity matrix (see Sec- Another popular compact representation for CPDs in the tion 6). Once we know the correct structure, it is easy UAI community is a decision tree [BFGK96]. This is a to ®gure out what logical rule each node is using, e.g., by stochastic generalization of the concept of canalyzing func- exhaustively enumerating them all and ®nding the one that tion [Kau93], popular in the boolean networks ®eld. (A ®ts the data. function is canalyzing if at least one of its inputs has the property that, when it takes a speci®c value, the output of 2.3 Stochastic boolean networks the function is independent of all the remaining inputs.)

We can de®ne a stochastic boolean network model by mod- 2.4 Comparison of DBNs and HMMs ifying the CPDs. For example, a popular CPD in the UAI

communityis the noisy-ORmodel [Pea88]. This is just like Note that we can convert the model in Figure 3(a) into the q

an OR gate, except that there is a certain probability i that one in Figure 2(a), by de®ning a new variable whose state

i X =

the 'th input will be ¯ipped from 1 to 0. For example, if space is the cross product of the original variables, t

(A ;B ;C )

t t t , i.e., by ªcollapsingº the original model into a are usually called hidden variables), or because they just chain. But now the conditional independence relationships happen to be unmeasured in the training data (in which are ªburiedº inside the transition matrix. In particular, the case they are called missing variables). Note that a variable entries in the transition matrix are products of a smaller can be observed intermittently. number of parameters, i.e.,

Unknown structure means we do not know the complete

(X jX ) = (A jA ;B )

t t t t

Pr t 1 Pr 1 1 topology of the graph. Typically we will know some parts

B jA ;C )  (C jA ;C )

( of it, or at least know some properties the graph is likely

t t t t t Pr t 1 1 Pr 1 1

to have, e.g., it is common to assume a bound, K , on

So 0s in the transition matrix do not correspond to absence the maximum number of parents (fan-in) that a node can

(i; j ) =

of arcs in the model. Rather, if T 0, it just means take on, and we may know that nodes of a certain ªtypeº j the automaton cannot get from state i to state Ð this says only connect to other nodes of the same type. Such prior nothing about the connections between the underlying state knowledge can either be a ªhardº constraint, or a ªsoftº variables. Thus Markov chains and HMMs are not a good prior. representation for sparse, discrete models of the kind we To exploit such soft prior knowledge, we must use Bayesian are considering here. methods, which have the additional advantage that they re- turn a distribution over possible models instead of a single 2.5 Higher-order models best model. Handling priors on model structures, however, is quite complicated, so we do not discuss this issue here We have assumed that that our models have the ®rst-order (see [Hec96] for a review). Instead, we assume that the Markov property, i.e., that there are only connections be- goal is to ®nd a single model which maximizes some scor- tween adjacent time slices. This assumption is without ing function (discussed in Section 6.1). We will, however, loss of generality, since it is easy to convert a higher-order consider priors on parameters, which are much simpler. In Markov chain to a ®rst-order Markov chain, simply by Section 8.1, when we consider DBNs in which all the vari- adding new variables (called lag variables) which contain ables are continuous, we will see that we can use numerical the old state, and clamping the transition matrices of these priors as a proxy for structural priors. new variables to the identity matrix. (Note that doing this for an HMM (as opposed to a DBN) would blow up the An interesting approximation to the full-blown Bayesian state space exponentially.) approach is to learn a mixture of models, each of which has different structure, depending on the value of a hidden, dis- 2.6 Relationship to other probabilistic models crete, mixture node. This has been done for Gaussian net- works [TMCH98] and discrete, undirected trees2 [MJM98]; Although the emphasis in this section has been on models unfortunately, there are severe computational dif®cultiesin which are relevant to gene expression, we should remark generalizing this technique to arbitrary, discrete networks. that Bayesian Nets can also be used in evolutionary tree con- In this paper, we restrict our attention to learning the struc- struction and genetic pedigree analysis. (In fact, the peel- ture of the inter-slice connectivity matrix of the DBN. This ing (Elston-Stewart) algorithm was invented over a decade has the advantage that we do not need to worry about in- before being rediscovered in the UAI community.) Also, troducing (directed) cycles, and also that we can use the Bayesian networks can be augmented with utility and deci- ªarrow of timeº to disambiguate the direction of causal re-

sion nodes, to create an in¯uence diagram; this can be used

X X

t lations, i.e., we know an arc must go from t 1 to and to compute an optimal action or sequence of actions (opti- not vice versa, because the network models the temporal mal in the decision-theoretic sense of maximizing expected evolution of a physical system. Of course, we still cannot utility).This might be useful for designing gene-knockout completely overcome the problem that ªcorrelation is not

experiments, although we don't pursue this issue here.

X

causationº: it is possible that the correlation between t 1 X

and t is due to one or more hidden common causes (or 3 Learning Bayesian Networks observed common effects, because of the explaining away phenomenon). Hence the models learned from data should be subject to experimental veri®cation.3 In this section,we discuss how to learn BNs from data. The problem can be summarized as follows. 2In one particularly relevant example, [MJM98] applies her Structure/ Observability Method algorithm to the problem of classifying DNA splice sites as intron- Known, full Sample exon or exon-intron. When she examined each tree in the mixture Known, partial EM or gradient ascent to ®nd the nodeswhich are most commonly connectedto the class Unknown, full Search through model space variable, she found that they were the nodes in the vicinity of the splice junction, and further, that their CPTs encoded the known Unknown, partial Structural EM AG/G pattern. 3[HMC97] discusses Bayesian techniques for learning causal Full observability means that the values of all variables are (static) networks, and [SGS93] discusses constraint based tech- known; partial observability means that we do not know niques Ð but see [Fre97] for a critique of this approach. These the values of some of the variables. This might be because techniques have mostly been used in the social sciences. One they cannot, in principle, be measured (in which case they major advantage of molecular biology is that it is usually feasible When the structure of the model is known, the learning task We see that the log-likelihoodscoring function decomposes becomes one of parameter estimation. Although the focus according to the structure of the graph, and hence we can of this paper is on learning structure, this calls parameter maximize the contribution to the log-likelihood of each learning as a subroutine, so we start by discussing thiscase. node independently (assuming the parameters in each node are independent of the other nodes). Finally, we mention that learning BNs is a huge subject, and in this paper, we only touch on the aspects that we consider most relevant to learning genetic networks from 4.1 Parameter estimation time series data. For further details, see the review papers [Bun94, Bun96, Hec96]. For discrete nodes with CPTs, we can compute the ML estimates by simple counting. As is well known from the HMM literature, however, ML estimates of CPTs are prone 4 Known structure, full observability to sparse data problems, which can be solved by using (mixtures of) Dirichlet priors (pseudo counts). We assume that the goal of learning in this case is to ®nd Estimating the parameters of noisy-OR distributions (and the values of the parameters of each CPD which maximizes its relatives) is harder, essentially because of the presence the likelihood of the training data, which contains S se- of hidden variables (see Figure 4), so we must use the quences (assumed to be independent), each of which has

techniques discussed in Section 5. T the observed values of all n nodes per slice for each of slices. For notational simplicity, we assume each sequence is of the same length. Thus we can imagine ªunrollingº a 4.2 Choosing the form of the CPDs

two-slice DBN to produce a (static) BN with T slices. In the above discussion, we assumed that we knew the form We assume the parameter values for all nodes are constant oftheCPD foreach node. Ifwe are uncertain, one approach (tied) across time (in contrast to the assumption made in is to use a mixture distribution, although this introduces HMM proteinalignment), so that for a time series of length a hidden variable, which makes things more complicated

T , weget onedatapoint(sample)foreach CPDintheinitial (see Section 5). Alternatively, we can use a decision tree

slice, and T 1 data points for each of the other CPDs. If [BFGK96], ora tableof parent values along with theirasso- = S 1, we cannot reliably estimate the parameters of the ciated non-zero probabilities [FG96], to represent the CPD.

nodes in the ®rst slice, so we usually assume these are ®xed This can increase the number of free parameters gradually,

k

= S (T ) apriori. That leaves us with N 1 samples for from1 to 2 , where k is the number of parents.

each of the remaining CPDs. In cases where N is small compared to the number of parameters that require ®tting, we can use a numerical prior to regularize the problem (see 5 Known structure, partial observability Section 4.1). In this case, we call the estimates Maximum A Posterori (MAP) estimates, as opposed to Maximum Like- When some of the variables are not observed, the likeli- lihood (ML) estimates. hood surface becomes multimodal, and we must use it- Using the Bayes Ball algorithm (see Figure 1), it is easy to erative methods, such as EM [Lau95] or gradient ascent see that each node is conditionally independent of its non- [BKRK97], to ®nd a local maximum of the ML/MAP func- descendants given its parents (indeed, this is often taken tion. These algorithmsneed touse an inferencealgorithmto as the de®nition of a BN), and hence, by the chain rule of compute the expected suf®cient statistics (or related quan- probability,we ®nd that the jointprobabilityof all thenodes tity) for each node. in the graph is

5.1 Inference in BNs

P (X ;:::;X ) =

1 m

Y Y

P (X jX P (X j = ;:::;X ) = (X ));

i i i

1 i 1 Pa Inference in Bayesian networks is a huge subject which we i i willnotgointointhispaper. See [Jen96]foranintroduction

to one of the most commonly used algorithm (the junction

= n(T ) where m 1 is the number of nodes in the un-

4 tree algorithm). [HD94] gives a good cookbook introduc-

(X )

rolled network (excluding the ®rst slice ), Pa i are the tion to the junction tree algorithm. [SHJ97] explains how parents of node i, and nodes are numbered in topological order (i.e., parents before children). The normalized log- the forwards-backwards algorithm is a special case of the

1 junction tree algorithm, and might be a good place to start

= (D jG) likelihood of the training set L logPr , where

N if you are familiar with HMMs. [Dec98] would be a good

D = fD ;:::;D g

1 S , is a sum of terms, one for each node:

place to start if you are familiar with the peeling algorithm S

m (although the junction tree approach is much more ef®cient X

1 X

= P (X j (X );D ):

L for learning). [Mur99] discusses inference in BNs with dis-

i l log i Pa (1)

N crete and continuous variables, and ef®cient techniques for

= l= i 1 1 handling networks with many observed nodes. to verify hypotheses experimentally. 4Since we are not trying to estimate the parameters of nodes in Exact inference in densely connected (discrete) BNs is com- the initial slice, we have omitted their contribution to the overall putationally intractible, so we must use approximate meth- probability, for simplicity. ods. There are many approaches, including sampling meth- ods such as MCMC [Mac98] and variational methods such {1,2,3,4} as mean-®eld [JGJS98]. DBNs are even more computa- tionally intractible in the sense that, even if the connec- {1,2,3} {2,3,4} {1,3,4} {1,2,4} tions between two slices are sparse, correlations can arise over several time steps, thus rendering the unrolled network {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} ªeffectivelyº dense. A simple approximate inference algo- rithm for the speci®c case of sparse DBNs is described in {1} {2} {3} {4} [BK98b, BK98a].

{}

; ; ; g

6 Unknown structure, full observability Figure 5: The subsets of f1 2 3 4 arranged in a lattice.

k  k  The k 'th row represents subsets of size , for0 4. We start by discussing the scoring functionwhich we use to select models; we then discuss algorithms which attempt to optimize this function over the space of models, and ®nally The ®rst term is computed using Equation 1. The second examine their computational and sample complexity. termisa penalty formodel complexity. Sincethenumberof parameters in themodel is the sumof thenumber of param- eters in each node6, we see that the BIC score decomposes 6.1 The objective function used for model selection just like the log-likelihood (Equation 1). Hence all of the It is commonly assumed that the goal is to ®nd the model techniques invented for maximizing MI also work for the with maximum likelihood. Often (e.g., as in the RE- BIC case, and indeed, any decomposable scoring function. VEAL algorithm [LFS98], and as in the Reconstructabil- ity Analysis (RA) or General Systems Theory community 6.2 Algorithms for ®nding the best model [Kli85, Kri86]) this is stated as ®nding the model in which the sum of the mutual information (MI) [CT91] between Given that the score is decomposable, we can learn

the parent set for each node independently. There are 

each node and its parents is maximal; in Appendix A, we 

P

n n prove that these objective functions are equivalent, in the n

= 2 such sets, which can be arranged in a = k 0 sense that they rank models in the same order. k lattice as shown in Figure 5. The problem is to ®nd the The trouble is, the ML model will be a , highest scoring point in this lattice. since this has the largest number of parameters, and hence can ®t the datathe best. A well-principledway to avoid this The approach taken by REVEAL [LFS98] is to start at kind of over-®tting is to put a prior on models, specifying the bottom of the lattice, and evaluate the score at all that we prefer sparse models. Then, by Bayes' rule, the points in each successive level, until a point is found with

MAP model is the one that maximizes a score of 1.0. Since the scoring function they use is

(X; (X ))= fH (X );H ( (X ))g I (X;Y )

I Pa max Pa , where is

D jG) (G)

Pr( Pr

Y H (X )

the mutual information between X and , and is the

GjD ) =

Pr( (2)

(D )

Pr entropy of X (see Appendix A for de®nitions), 1.0 is the

X ) highest possible score, and corresponds to Pa ( being a

Taking logs, we ®nd

H (X j (X )) =

perfect predictor of X , i.e., Pa 0.

GjD ) = (D jG) + (G) + c logPr( logPr logPr A normalized MI of 1 is only achievable with a deterministic

relationship. For stochastic relationships, we must decide

= (D ) G where c logPr is a constant independent of . Thus whether the gains in MI produced by a larger parent set is the effect of the prior is equivalent to penalizing overly ªworth itº. The standard approach in the RA community

2

(X;Y )  I (X;Y )N ( ) complex models, as in the Minimum Description Length uses the fact [Mil55] that ln 4 ,

(MDL) approach. where N is the number of samples. Hence we can use a 2

test to decide whether an increase in MI is statistically An exact Bayesian approach to model selection is usually signi®cant. (This also gives us some kind of con®dence unfeasible, since it involves computing the marginal likeli-

P measure on the connections that we learn.) Alternatively,

P (D ) = (D;G) hood P , which is a sum over an expo- G we can use a complexity penalty term, as in the BIC score. nential number of models (see Section 6.2). However, we can use an asymptotic approximationto the posterior called Of course, if we do not know if we have achieved the max- BIC (Bayesian Information Criterion), which is de®ned as imum possible score, we do not know when to stop search- follows: ing, and hence we must evaluate all pointsin the lattice (al- though we can obviously use branch-and-bound). For large

log N

à n, thisis computationallyinfeasible,so a common approach

Θ ) (GjD )  (D jG; G

logPr logPr G # 2

number of free parameters. In a model with hidden variables, it G where N is the number of samples, # is the dimension of might be less than this [GHM96]. 5 ΘÃ 6Hence nodes with compact representations of their CPDs will the model , and G is the ML estimate of the parameters. encur a lower penalty, which can allow connections to form which 5In the fully observable case, the dimension of a model is the might otherwise have been rejected [FG96]. is to only search up until level K (i.e., assume a bound on

the maximum number of parents of each node), which takes

K

(n ) O time. Unfortunately, in real genetic networks, it is

known that some genes can have very high fan-in (and fan- = out), so restricting the bound to K 3 would make it impossible to discover these important ªmasterº genes. The obviousway to avoid theexponential cost (and the need (a) (b) for a bound, K ) is to use heuristics to avoid examining all possible subsets. (In fact, we must use heuristics of some kind, since the problem of learning optimal structure is Figure 6: (a) A BN with a hidden variable H . (b) The sim- NP-hard [CHG95].) One approach in the RA framework, plest networkthat can capture the same distributionwithout called Extended Dependency Analysis (EDA) [Con88], is using a hidden variable (created using arc reversal and node as follows. Start by evaluating all subsets of size up to two, elimination). If H is binary and the other nodes are tri- 2 keep all the ones with signi®cant (in the sense) MI with nary, and we assume full CPTs, the ®rst network has 45 thetarget node, and take theunionof the resultingset as the independent parameters, and the second has 708. set of parents.

The disadvantage of this greedy technique is that it will fail 6.4 Scaling up to large networks to ®nd a set of parents unless some subset of size two has

signi®cant MI with the target variable. However, a Monte  Since there are about n 6400 genes for yeast (S. Cere- Carlo simulation in [Con88] shows that most random rela- visae), all of which can now be simultaneously measured tions have this property. In addition, highly interdependent [DLB97], it is clear that we will have to do some pre- setsofparents(whichmightfailthepairwiseMItest)violate processing to exclude the ªirrelevantº genes, and just try to the causal independence assumption, which is necessary to learn the connections between the rest. After all, many of justify the use of noisy-OR and similar CPDs. them do not change their expression level during a given An alternative technique, popular in the UAI community, experiment. In the [DLB97] dataset, for example, which is to start with an initial guess of the model structure (i.e., consists of 7 times steps, only 416 genes showed a sig-

at a speci®c point in the lattice), and then perform local ni®cant change Ð the rest are completely uninformative. = search, i.e., evaluate the score of neighboring points in the Of course, even n 416 is too big for the techniques we lattice, and move to the best such point, until we reach a have discussed, so we must pre-process further, perhaps by local optimum. We can use multiple restarts to try to ®nd clustering genes with similar timeseries [ESBB98]. In Sec- the global optimum, and to learn an ensemble of models. tion 8.1, we discuss techniques for learning the structure of Note that, in the partially observable case, we need to have DBNs with continuous state, which scale much better than an initial guess of the model structure in order to estimate the methods we have discussed above for discrete models. the values of the hidden nodes, and hence the (expected) score of each model (see Section 5); starting with the fully disconnected model (i.e., at the bottomof the lattice) would 7 Unknown structure, partial observability be a bad idea, since it wouldlead to a poor estimate. Finally, we come to the hardest case of all, where the structure is unknown and there are hidden variables and/or missing data. One dif®culty is that the score does not de- compose. However, as observed in [Fri97, Fri98, FMR98, 6.3 Sample complexity TMCH98], the expected score does decompose, and so we can use an iterative method, which alternates between eval- In addition to the computational cost, another important uating the expected score of a model (using an inference consideration is the amount of data that is required to re- engine), and changing the model structure, until a local liably learn structure. For deterministic boolean networks, maximum is reached. This is called the Structural EM this issue has been addressed from a statistical physics per- (SEM) algorithm. SEM was succesfully used to learn the spective [Her98] and a computational learning theory (com- structureofdiscrete DBNs withinmissing datain [FMR98]. binatorial) perspective [AMK99]. In particular, [AMK99]

prove that, if the fan-in is bounded by a constant K , the 7.1 Inventing new hidden nodes

number of samples needed to identifya boolean networkof

K

Ω( + K n) n nodes is lower bounded by 2 log and upper So far, structure learning has meant ®nding the right con-

2K

( K n) bounded by O 2 2 log . Unfortunately, the constant nectivity between pre-existing nodes. A more interesting

factor is exponential in K , which can be quitelarge for cer- problem is inventing hidden nodes on demand. Hidden tain genes, as we have already remarked. These analyses all nodes can make a model much more compact: see Fig- assume a ªtabula rasaº approach; however, in practice, we ure 6. The standard approach is to keep adding hidden will often have strong prior knowledge about the structure nodes one at a time, to some part of the network (see be- (or at least parts of it), which can reduce the data require- low), performing structure learning at each step, until the ments considerably. score drops. One problem ischoosing the cardinality (num- ber of possible values) for the hidden node, and its type of To create a BN with continuous variables, we need to CPD. Another problem is choosing where to add the new specify the graph structure and the Conditional Proba- hidden node. There is no point making it a child, since bility Distributions at each node, as before. The most hidden children can always be marginalized away, so we common distribution is a Gaussian, since it is analytically

need to ®nd an existing node which needs a new parent, tractable. Consider, for example, the DBN in Figure 2(a),

P (X = x jX = x ) = N (x  + W x ; Σ)

t t t t t when the current set of possible parents is not adequate. with t 1 1 ; 1 , where

[RM98] use the followingheuristicfor ®nding nodes which 

1

need new parents: they consider a noisy-OR node which is 0 1

N (x ; Σ) = Σj (x ) x ) j ; exp ( (3)

nearly always on, even if its non-leak parents are off, as an Z

Z = indicator that there is a missing parent. Generalizing this is the Normal (Gaussian) distribution evaluated at x, technique beyond noisy-ORs is an interesting open prob- 1

n=2

( ) jΣj

2 2 is the normalizing constant to ensure the den-

(X j (X )) lem. One approach might be to examine H Pa : if

this is very high, it means the current set of parents are in- sityintegratesto1,and W istheweightorregressionmatrix.

0 x

(x denotes the transpose of .) Another way of writingthis

X )

adequate to ªexplainº the residual entropy; if Pa ( is the

X = WX +  +    N ( ; Σ)

t t t 2 is t 1 , where 0 are inde- best (in the BIC or sense) set of parents we have been able to ®nd in the current model, it suggests we need to pendent Gaussian noise terms, corresponding to unmod-

elled in¯uences. This is called a ®rst-order auto-regressive

X ) create a new node and add it to Pa ( . AR(1) time series model [Ham94].

A simple heuristic for inventinghidden nodes in the case of X DBNs is to check if the Markov property is being violated If t is a vector containing the expression levels of all the for any particular node. If so, it suggests that we need genes at time t, then thiscorresponds precisely to the model connections to slices further back in time. Equivalently, in [DWFS99]. (They do not explicitely mention Gaussian we can add new lag variables (see Section 2.5) and connect noise, but it is implicit in their decision to use least squares to them. (Note that reasoning across multiple time slices to ®t the W term: see Section 8.1). Despite the simplicity forms the basis of the (ªmodel freeº) correlational analysis of this model (in particular, its linear nature), they ®nd that approach of [AR95, ASR97].) it models their experimental data very well.

AnotherheuristicforDBNsmight be to®rst performa clus- We can also de®ne nonlinear models. For example, if we de-

X = g (WX ) +  g (x) = =( + (x))

t t tering of the timeseries of the observed variables (see e.g., ®ne t 1 , where 1 1 exp [ESBB98]), and then to associate hidden nodes with each is the sigmoid function, we obtain the nonlinear model in cluster. The result would be a Markov model with a tree- [WWS99].

structured hidden ªbackboneº c.f., [JGS96]. This is one It is simple to extend both the linear and nonlinear models X

possible approach to the problem of learning hierarchically E t

to allow for external inputs t , so that we regress on E

structured DBNs. (Building hierarchical (object oriented) X

t t 1 and , as the authors point out. For example, in BNs and DBNs by hand is straightforward, and there are [DWFS99], they inject kainate Ð ªa glutamatergic agonist algorithms which can exploit this modular structure to a which causes seizures, localized cell death, and severely certain extent to speed up inference [KP97, FKP98].) disrupts the normal gene expression patternsº Ð into rat

Of course, interpreting the ªmeaningº of hidden nodes is CNS, and measure expression levels before and after. Their

X = WX + bK +  K =

t t t

always tricky, especially since they are often unidenti®able, model is therefore t 1 , where 0

K = ( = ) t 

e.g., we can often switch the interpretation of the true and before the injection, and t exp 2 if is hours

 n false states (assuming for simplicity that the hidden node after injection, and b and are vectors of length , repre- is binary) provided we also permute the parameters appro- senting the response to kainate and an offset (bias) term, priately. (Symmetries such as this are one cause of the respectively.

multiple maxima in the likelihood surface.) Our opinion is X

In Figure 2(a), the t variables are always observed. Now

that fully automated structure discovery techniques can be Y

consider Figure 2(b), where only the t are observed, and

useful as hypothesis generators, which can then be tested X

have Gaussian CPDs. If the t are hidden, discrete vari-

) (Y = y jX = i) = N (y  ; Σ

by experiment. P

i t i

ables, we have t ; ; this X

is an HMM with Gaussian outputs. If the t are hid-

P (Y = y jX = x) = t

den, continuous variables, we have t

 + C x; Σ) (y

8 DBNs with continuous state N ; . This is called a Linear Dynamical Sys-

P (X jX )

t

tem, since both the dynamics, t 1 , and the obser-

P (Y jX ) t vations, t , are linear. (It is also called a State Space Up until now, we have assumed, for simplicity, that all the model.) random variables are discrete-valued. However, in reality, most of the quantitieswe care about in genetic networksare To perform inference in an LDS, we can use the well- continuous-valued. If we coarsely quantize the continuous known Kalman ®lter algorithm [BSL93], which is just a variables (i.e., convert them to a small number of discrete special case of the junction tree algorithm. If the dynamics values), we lose a lot of information,but if we ®nely quan- are non-linear, however, inference becomes much harder; tize them, the resulting discrete model will have too many the standard technique is to make a local linear approxima- parameters. Hence it is better to work with continuous tion Ð this is called the Extended Kalman Filter (EKF), variables directly. and is similar to techniques developed in the Recurrent

Neural Networks literature [Pea95]. (One reason for the matrices

1 widespread success of HMMs with Gaussian outputs is that 0

1 0

0

0 X

the discrete hidden variables can be used to approximate 1 X2 C

B . A

arbitrary non-linear dynamics, given enough training data.) @ .

; Y = X = A @ . .

In the rest of this paper, we will stick to the case of fully . .

0

0

X

X

T observable (but possibly non-linear) models, for simplicity, 1 T

so that inference will not be necessary. (We will, however,

X t consider Bayesian methods of parameter learning, which We can think of the t'th row of as the 'th input vector

can be thought of as inferring the most probable values of to a linear neural network with no hidden layers (i.e., a per- Y nodes which represent the parameters themselves.) ceptron), and the t'th row of as the corresponding output. Using this notation, we can rewrite the above expression as follows (this is called the normal equation):

8.1 Learning

0 0 y

0 1

W = (X X ) X Y = X Y X

We have assumed that t is a vector of random variables.

If we were to represent each of this vector as

y X its own (scalar) node, the inter-slice connections would where X is the pseudo-inverse of (which is not square). correspond to the non-zero values in the weight matrix Thus we see that the least squares solution is the same as the Maximum Likelihood (ML) solution assuming Gaussian

W , and undirected connections within a slice would cor-

respond to non-zero entries in Σ1 [Lau96]. For example, noise.

1 1 1 ! The technique used by D'haeseleer et al. [DWFS99] is

W = Σ

if 1 0 0 , and is diagonal, the model is to compute the least squares value of W , and interpret

W j <  

0 1 1 j

i;j , for some(unspeci®ed) threshold ,as indicating

equivalent to the one in Figure 3(a). Hence, in the case j the absence of an arc between nodes i and . However, as of continuous state models, we can convert back and forth Bishop writes [Bis95, p.360], ª[this technique] has little between representing the structure graphically and numeri- theoretical motivation, and performs poorly in practiceº. cally. This means that structure learning reduces to param- eter ®tting, and we can avoid expensive search through a More sophisticated techniques have been developed in the discrete space of models! (c.f., [Bra98].) Consequently, we neural network literature, which involve examining the will now discuss techniques for learning the parameters of change in the error function due to small changes in the continuous-state DBNs. values of the weights. This requires computing the Hessian

H of the error surface. In the technique of ªoptimal brain Consider the AR(1) model in Figure 2(a). For notational damageº [CDS90], they assume H is diagonal; in the more

Σ 2  I

simplicity we will assume = , so the normalizing sophisticated technique of ªoptimal brain surgeonº [HS93], 

2 n=2 2

) = ( = = constant is Z , where 1 is the inverse

they remove this assumption. variance (the precision). Using Equation 3, we ®nd that the likelihoodofthedata(ignoringtermswhichareindependent Since we believe (hope!) that the true model is sparse (i.e.,

most entries in W are 0), we can encode this knowledge

of W ) is 7

( ; = )

(assumption)byusing a N 0 1 Gaussian prior for each

S T Y

Y weight:

l l

P (D jW ) = P (X jX )

 

t

t 1

Y Y

1

= t 2 =

l 1 2

P (W ) = P (W ) = W

ij exp

ij 1

0 1 2

ij ij

 = ) ( 2

X 2

1

l l 0

2 1

@ A

WX jjX jj

= exp

t

t 1

X

Z ( )

D 2

1

l;t 2

@ A W

= exp

ij

Z ( )

W 2

1 ij

= E

exp ( D ) (4)

Z ( )

D 1

( E ) =

exp W (5)

l

Z ( )

X X t l Z ( ) = W

where isthevalueof at time in sequence , D

t N

Z is the product of the individual normalizing constants, 2

 n

2 =2

) Z ( ) = ( W n  n

where W (since is an matrix)

N = S (T ) E

1 is the number of samples, and D isan error E (cost) function on the data. (Having multiple sequences is and W is an error (cost) function on the weights. Then, essentially the same as concatenating them into a single using Bayes' rule (Equation 2) and Equations 4 and 5, the long sequence (modulo boundary conditions), so we shall posterior distribution on the weights is

henceforth just use a single index, t.) Differentiating this 1

P (W jD ) = ( E E ) W exp D

with respect to W and setting to 0 yields

Z ( ; )

S

! !

1 Since the denominator is independent of W , the Maximum

T T X

X Posterior (MAP) value of W is given by minimizing the

W = X X X X

t t t

t 1

t= = t 2 2 7In [WWS99], they use an ad-hoc iterative pruning method to Let us now rewrite this in a more familiar form. De®ne the achieve a similar effect. Y

cost function is the mutual information (MI) between X and , X

1 2 X

W S (W ) = E + E = E +

D W D

H (X ) = P (x) P (x)

2 i;j log

i;j x

The second term is called a regularizer, and encourages is the entropy of X , and

learning of a weight matrix with small values. (Unfortu-

X

(X jY ) = P (y )P (xjy ) P (xjy ) nately, this prior favours many small weights, rather than H log

a few large ones, although this can be ®xed [HP89].) This x;y

regularization technique is also called weight decay. Note

X Y (X ) = ;

that the use of a regularizer overcomes the problem that is the conditional entropy of given . (If Pa i ,

W

H (X j (X )) = H (X ) I (X ; (X )) =

i i i i will often be underdetermined, i.e., there will be fewer we de®ne i Pa and Pa 2

samples than parameters, N n < n . 0.) Finally, we de®ne the MI score of a model as

P

I (X ; MIS (G D ) = D ) (X )

i i

; PaG ; , where the prob-

i

The values and are called hyperparameters, since they

abilities are set to their empirical values given D , and

control the prior on the weights and the system noise, re-

(X ) X G

G i Pa mean the parents of node i in graph .

spectively. Since their values are unknown, the correct 0

Bayesian approach is to integrate them out, but an approxi- Theorem. Let G; G be BNs in which every node has a full 9 mation that actually works better in practice is to approxi- CPT, and D be a fully observable data set. Then

mate their posterior by a Gaussian (or maybe a mixture of

0 0

(G D ) L(G D ) = MIS (G D ) MIS (G D ) Gaussians), ®nd their MAP values, and then plug them in L ; ; ; ; to the above equation: see [Bis95, sec. 10.4] and [Mac95] for details. Proof: To see this, note that the normalized log-likelihood

(Equation 1) can be rewritten as

We can associate a separate hyperparameter i;j for each

X W

weight i;j , ®nd their MAP values, and use this as a ªsoftº 1

N L = (X = k j = j )

ij k i

logPr i W

means of ®ndingwhich entriesof i;j tokeep: thisiscalled N

Automatic Relevance Determination (ARD) [Mac95], and i;j;k has been found to peform better than the weight deletion

def

 = (X ) N

i ij k

techniques discussed above, especially on small data sets. where we have de®ned i Pa for brevity. is

(X = k ;  = j ) i

the number of times the event i occurs

 = ; (X = i in the whole training set. (If i , we de®ne Pr

9 Conclusion

(X = k ) N k j = j ) =

i ij k

i Pr and as the number of times

X = k

the event i occurs.) If we assume each node is a full

In this paper, we have explained what DBNs are, and dis- CPT, and set the probabilities to their empirical estimates,

= N (X = k ;  = j )

cussed how to learn them. In the future, we hope to apply N

i i then ij k Pr , and so

these techniques (in particular, the ones discussed in Sec-

L =

tion 8.1) to real biological data, when enough becomes X

X = k ;  = j ) (X = k j = j )

8 (

i i i

publicly available. Pr i logPr

i;j;k

X

H (X j ) = i

A The equivalence between Mutual i i

Information and Maximum Likelihood X

= I (X ;  ) H (X )

i i

methods i

i

H (X )

Since i is independent of the structure of the graph

We start by introducing some standard notation from infor- P

0

(X )) G L(G D ) L(G D ) = I (X ;

G i , we ®nd ; ; i Pa

mation theory [CT91]. i

0

I (X ; (X ))

i i

PaG .

X

P (x; y )

(X;Y ) = P (x; y )

I log

(x)P (y )

P We note that this theorem is similar to the result in [CL68],

x;y who show that the optimal 10 tree-structured MRF is a max-

H (X ) H (X jY )

= imal weight spanning tree (MWST) of the graph in which

X X

= H (X ) + H (Y ) H (X;Y ) j

the weight of an arc between two nodes, i and , is

I (X ;X ) j i . [MJM98] extends this result to mixtures of 8We carried out some provisional experiments on a trees. Trees have the signi®cant advantage that we can

subset of the data described in [WFM+ 98] (available at

0 0

9 0

(G) f (G ) = g (G) g (G ) f (G) > f (G ) ()

rsb.info.nih.gov/mol-physiol/PNAS/GEMtable.html), If f , then

0

f g g (G) > g (G )

= T = S = which has n 112, 9 and 1. Applying the ARD tech- , so and rank models in the same order.

nique to the 17 genes involved in the GAD/GABA subsystem, However, there might be many models which receive the same

f (G) = f g

G

described in [DWFS99], we found that all the i;j 's were signi®- score under or , so it is ambiguous to say argmax

g (G) cant, indicating a fully connected graph. [DWFS99] got a sparse arg maxG .

10 = graph, but used a larger (unpublished) data set with T 28, and In the sense of minimizing the KL divergence between the

used an unspeci®ed pruning threshold  . true and estimated distribution.

2

(n ) n compute the MWST in O time, where is the number [Dec98] R. Dechter. Bucket elimination: a unifying frame- of variables. However, it is not clear if they are a useful work for probabilistic inference. In M. Jordan, editor, model for gene expression data. Learning in Graphical Models. Kluwer, 1998. If a node does not have a full CPT, but instead has a para- [DEKM98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models metric CPD of the right form (i.e., at least capable of rep- ofProteins and Nucleic Acids. Cambridge University

resenting the ªtrueº conditional probability distribution of Press, Cambridge, 1998.

N ! 1 N ! N (X = k j )

i j the node), then as , ij k Pr , [DLB97] J. L. DeRisi, V. R. Lyer, and P. O. Brown. Exploring and the proof goes through as before. But this leaves us the metabolic andgenetic control of gene expression with the additional problem of choosing the form of CPD: on a genomic scale. Science, 278:680±686,1997. see Section 4.2. [DWFS99] P. D'haeseleer, X. Wen, S. Fuhrman, and R. Som- ogyi. Linear modeling of mRNA expression levels References during CNS development and injury. In Proc. of the Paci®c Symp. on Biocomputing, 1999. [AMK99] T. Akutsu, S. Miyano, and S. Kuhara. Identi®cation [ESBB98] Michael B. Eisen, Paul T. Spellman, Patrick O. ofgenetic networksfrom a smallnumberof geneex- Brown, and David Botstein. Cluster analysis and pression patterns under the boolean network model. display of genome-wide expression patterns. Proc. In Proc. of the Paci®c Symp.on Biocomputing, 1999. of the National Academy of Science,USA, 1998. [AR95] A. Arkin and J. Ross. Statistical construction of [FG96] N. Friedman and M. Goldszmidt. Learning Bayesian chemical reaction mechanisms from measured time- networks with local structure. In Proc. of the Conf. series. J. Phys. Chem., 99:970, 1995. on Uncertainty in AI, 1996. [ASR97] A.Arkin, P.Shen,andJ.Ross. Atestcaseofcorrela- [FKP98] N. Friedman, D. Koller, and A. Pfeffer. Structured tion metric construction of a reaction pathway from representation of complex stochastic systems. In measurements. Science, 277:1275±1279, 1997. Proc. of the Conf. of the Am. Assoc. for AI, 1998. [BFGK96] C. Boutilier, N. Friedman, M. Goldszmidt, and [FMR98] N. Friedman, K. Murphy, and S. Russell. Learning D. Koller. Context-speci®c independence in bayesian the structure of dynamic probabilistic networks. In networks. In Proc.of the Conf. on Uncertaintyin AI, Proc. of the Conf. on Uncertainty in AI, 1998. 1996. [Bis95] C. M. Bishop. Neural Networks for Pattern Recog- [Fre97] D. A. Freeman. From association to causation via regression. In V. R. McKim and S. P. Turner, edi- nition. Clarendon Press, 1995. tors, Causality in Crisis? Statistical methods and the [BK98a] X. Boyen and D. Koller. Approximate learning of dy- search for causal knowledge in the social sciences. namic models. In Neural Info. Proc. Systems, 1998. U. Notre Dame Press, 1997. [BK98b] X. Boyen and D. Koller. Tractable inference for [Fri97] N. Friedman. Learning Bayesian networks in the complex stochastic processes. In Proc. of the Conf. presence of missing values and hidden variables. In on Uncertainty in AI, 1998. Proc. of the Conf. on Uncertainty in AI, 1997. [BKRK97] J. Binder, D. Koller, S. J. Russell, and K. Kanazawa. [Fri98] N. Friedman. The bayesian structural EM algorithm. Adaptive probabilistic networks with hidden vari- In Proc. of the Conf. on Uncertainty in AI, 1998. ables. Machine Learning, 29:213±244, 1997. [GHM96] D. Geiger, D. Heckerman, and C. Meek. Asymptotic [Bra98] M. Brand. An entropic estimator for structure dis- model selection for directed networks with hidden covery. In Neural Info. Proc. Systems, 1998. variables. In Proc.of the Conf. on Uncertainty in AI, [BSL93] Y. Bar-Shalom and X. Li. Estimation and Tracking: 1996. Principles, Techniquesand Software. Artech House, [Gol80] M. C. Golumbic. Algorithmic Graph Theory and 1993. Perfect Graphs. Academic Press, 1980. [Bun94] W.L. Buntine. Operations for learning with graphical [Ham94] J. Hamilton. Time Series Analysis. Wiley, 1994. models. J. of AI Research,pages 159±225, 1994. [HB94] D. Heckerman and J.S. Breese. Causal indepen- [Bun96] W. Buntine. A guide to the literature on learning dencefor probability assessmentand inference using probabilistic networks from data. IEEE Trans. on bayesian networks. IEEE Trans. on Systems, Man Knowledge and Data Engineering, 8(2), 1996. and Cybernetics, 26(6):826±831, 1994. [CDS90] Y. Le Cun, J. Denker, and S. Solla. Optimal brain [HD94] C. Huang and A. Darwiche. Inference in belief net- damage. In Neural Info. Proc. Systems, 1990. works: A procedural guide. Intl. J. Approx. Reason- [CHG95] D. Chickering, D. Heckerman,and D. Geiger. Learn- ing, 11, 1994. ing Bayesian netwprks: search methods and experi- [Hec96] D. Heckerman. A tutorial on learning with Bayesian mental results. In AI + Stats, 1995. networks. Technical Report MSR-TR-95-06, Mi- [CL68] C. K. Chow and C. N. Liu. Approximating dis- crosoft Research, March 1996. crete probability distributions with dependencetrees. [Hen89] M. Henrion. Some practical issues in constructing IEEE Trans. on Info. Theory, 14:462±67, 1968. belief networks. In Proc. of the Conf. on Uncertainty [Con88] R.C.Conant. Extendeddependencyanalysisoflarge in AI, 1989. systems. Intl. J. General Systems, 14:97±141, 1988. [Her98] J. Hertz. Statistical issues in the reverse engineering [CT91] T. M. Cover and J. A. Thomas. Elements of Infor- of genetic networks. In Proc. of the Paci®c Symp. on mation Theory. John Wiley, 1991. Biocomputing, 1998. [HMC97] D. Heckerman, C. Meek, and G. Cooper. A [MJM98] M. Meila, M. Jordan, and Q. Morris. Estimating Bayesian approach to causal discovery. Technical dependencystructure as a hidden variable. Technical Report MSR-TR-97-05, Microsoft Research, Febru- Report 1648, MIT AI Lab, 1998. ary 1997. [Mur99] K. P. Murphy. A variational approximation for [HP89] S. Hanson and L. Pratt. Comparing biases for mini- Bayesian networks with discrete and continuous la- mal network construction with back-propogation. In tent variables. In Proc. of the Conf. on Uncertainty Neural Info. Proc. Systems, 1989. in AI, 1999. Submitted. [HS93] B.HassibiandD.Stork. Secondorderderivativesfor [Pea88] J.Pearl. Probabilistic Reasoning in Intelligent Sys- network pruning: optimal brain surgeon. In Neural tems: Networks of Plausible Inference. Morgan Info. Proc. Systems, 1993. Kaufmann, 1988. [HS95] D. Heckerman and R. Shachter. Decision-theoretic [Pea95] B. A. Pearlmutter. Gradient calculations for dynamic foundations for causal reasoning. J. of AI Research, recurrent neural networks: a survey. IEEE Trans. on 3:405±430, 1995. Neural Networks, 6, 1995. [Jen96] F.V.Jensen. An Introduction to Bayesian Networks. [Rab89] L. R. Rabiner. A tutorial in Hidden Markov Mod- UCL Press, London, England, 1996. els and selected applications in speech recognition. Proc. of the IEEE, 77(2):257±286, 1989. [JGJS98] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and [RM98] S. Ramachandran and R. Mooney. Theory re®ne- L. K. Saul. An introduction to variational methods ment of bayesiannetworks with hidden variables. In for graphical models. In M. Jordan, editor, Learning Machine Learning: Proceedingsof the International in Graphical Models. Kluwer, 1998. Conference, 1998. [JGS96] M.I.Jordan,Z.Ghahramani,andL.K.Saul. Hidden [SGS93] P. Spirtes, C. Glymour, and R. Scheines. Causation, Markov decision trees.In NeuralInfo.Proc.Systems, Prediction, and Search. Springer-Verlag, 1993. Out 1996. of print. [Kau93] S. Kauffman. The origins of order.Self-organization [Sha98] R. Shachter. Bayes-ball: The rational pastime (for and selection in evolution. Oxford Univ. Press, 1993. determining irrelevance and requisite information in [Kli85] G. Klir. The Architecture of Systems Problem Solv- belief networks and in¯uence diagrams). In Proc. of ing. Plenum Press, 1985. the Conf. on Uncertainty in AI, 1998. [KP97] D. Koller and A. Pfeffer. Object-oriented bayesian [SHJ97] P. Smyth, D. Heckerman, and M. I. Jordan. Prob- networks. In Proc.of the Conf. on Uncertaintyin AI, abilistic independence networks for hidden Markov 1997. probability models. Neural Computation, 9(2):227± 269, 1997. [Kri86] K. Krippendorff. Information Theory : Structural Models for Qualitative Data (Quantitative Applica- [Sri93] S. Srinivas. A generalizationof the noisy-ormodel. tions in the SocialSciences,No62). Page Publishers, In Proc. of the Conf. on Uncertainty in AI, 1993. 1986. [SS96] R.SomogyiandC.A.Sniegoski. Modelingthecom- [Lau95] S. L. Lauritzen. TheEM algorithm for graphicalas- plexity of genetic networks: understanding multige- sociation models with missing data. Computational netic and pleiotropic regulation. Complexity, 1(45), Statistics and Data Analysis, 19:191±201, 1995. 1996. [TMCH98] B. Thiesson,C. Meek, D. Chickering, and D. Heck- [Lau96] S.Lauritzen. Graphical Models. OUP, 1996. erman. Learning mixtures of DAG models. In Proc. [LFS98] S. Liang, S. Fuhrman, and R. Somogyi. Reveal, a of the Conf. on Uncertainty in AI, 1998.

general reverse engineering algorithm for inference [WFM+ 98] X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr, of genetic network architectures. In Paci®c Sym- S. Smith, J. L. Barker, and R. Somogyi. Large-scale posium on Biocomputing, volume 3, pages 18±29, temporal gene expression mapping of central nervous 1998. system development. Proc. of the National Academy [MA97] H. H. McAdams and A. Arkin. Stochastic mech- of Science, USA, 95:334±339, 1998. anisms in gene expression. Proc. of the National [Whi90] J. Whittaker. Graphical Models in Applied Multi- Academy of Science, USA, 94(3):814±819, 1997. variate Statistics. Wiley, 1990. [Mac95] D. MacKay. Probable networks and plausible pre- [WMS94] J. V. White, I. Muchnik, and T. F. Smith. Modelling dictions Ð a review of practical Bayesian methods protein cores with Markov Random Fields. Mathe- for supervised neural networks. Network, 1995. matical Biosciences, 124:149±179, 1994. [Mac98] D. MacKay. Introduction to monte carlo methods. [WWS99] D. C. Weaver, C. T. Workman, and G. D. Stormo. In M. Jordan, editor, Learning in graphical models. Modeling regulatory networks with weight matrices. Kluwer, 1998. In Proc. of the Paci®c Symp.on Biocomputing, 1999. [MH97] C. Meek and D. Heckerman. Structure and parameter learning for causal independence and causal interac- tion models. In Proc. of the Conf. on Uncertainty in AI, pages 366±375, 1997. [Mil55] G. Miller. Note on the bias of information estimates. In H. Quastler, editor, Information theory in psychol- ogy. The Free Press, 1955.