Learning Belief Networks in the Presence of Missing Values and Hidden Variables

Nir Friedman Computer Science Division, 387 Soda Hall University of California, Berkeley, CA 94720 [email protected]

Abstract sponds to a random variable. This graph represents a set of conditional independence properties of the represented In recent years there has been a ¯urry of works distribution. This component captures the structure of the on learning probabilistic belief networks. Cur- probability distribution, and is exploited for ef®cient infer- rent state of the art methods have been shown to ence and decision making. Thus, while belief networks can be successful for two learning scenarios: learn- represent arbitrary probability distributions, they provide ing both network structure and parameters from computational advantage for those distributions that can be complete data, and learning parameters for a ®xed represented with a simple structure. The second component network from incomplete dataÐthat is, in the is a collection of local interaction models that describe the presence of missing values or hidden variables. conditional probability of each variable given its parents However, no method has yet been demonstrated in the graph. Together, these two components represent a to effectively learn network structure from incom- unique probability distribution [Pearl 1988]. plete data. Eliciting belief networks from experts can be a laborious In this paper, we propose a new method for and expensive process in large applications. Thus, in recent learning network structure from incomplete data. years there has been a growing interest in learning belief This method is based on an extension of the networks from data [Cooper and Herskovits 1992; Lam and Expectation-Maximization (EM) algorithm for Bacchus 1994; Heckerman et al. 1995].1 Current methods model selection problems that performs search are successful at learning both the structure and parame- for the best structure inside the EM procedure. ters from complete dataÐthat is, when each data record We prove the convergence of this algorithm, and describes the values of all variables in the network. Unfor- adapt it for learning belief networks. We then tunately, things are different when the data is incomplete. describe how to learn networks in two scenarios: Current learning methods are essentially limited to learning when the data contains missing values, and in the the parameters for a ®xed network structure. presence of hidden variables. We provide exper- imental results that show the effectiveness of our This is a signi®cant problem for several reasons. First, procedure in both scenarios. most real-life data contains missing values (e.g., most of the non-synthetic datasets in the UC Irvine repository [Murphy and Aha 1995]). One of the cited advantages of belief 1 INTRODUCTION networks (e.g., [Heckerman 1995, p. 1]) is that they allow for principled methods for reasoning with incomplete data. Belief networks (BN) (also known as Bayesian networks However, it is unreasonable at the same time to require and directed probabilistic networks) are a graphical repre- complete data for training them. sentation for probability distributions. They are arguably Second, learning a concise structure is crucial both for the representation of choice for uncertainty in arti®cial in- avoiding over®tting and for ef®cient inference in the learned telligence. These networks provide a compact and natural model. By introducing hidden variables that do not appear representation, effective inference, and ef®cient learning. explicitly in the model we can often learn simpler models. They have been successfully applied in expert systems, A simple example, originally given by Binder et al. [1997], diagnostic engines, and optimal decision making systems is shown in Figure 1. Current methods for learning hidden (e.g., [Heckerman et al. 1995]). A Belief network consists of two components. The ®rst 1We refer the interested reader to the tutorial by Hecker- is a directed acyclic graph in which each vertex corre- man [1995] that overviews the current state of this ®eld.

of the network (e.g., removing an arc from 1 to 3). Thus, after making one change, we do not need to reevaluate the score of most of the possible neighbors in the search space. H These properties allow for ef®cient learning procedures. When the data is incomplete, we can no longer decom- pose the likelihood function, and must perform inference (a) (b) to evaluate it. Moreover, to evaluate the optimal choice of parameters for a candidate network structure, we must per- Figure 1: (a) An example of a network with a hidden vari- form non-linear optimization using either EM [Lauritzen able. (b) The simplest network that can capture the same 1995] or gradient descent [Binder et al. 1997]. In par- distribution without using the hidden variable. ticular, the EM procedure iteratively improves its current

choice of parameters ¡ using the following two steps. In the ®rst step, the current parameters are used for computing variables require that human experts choose a ®xed network the expected value of all the statistics needed to evaluate structure or a small set of possible structures. While this the current structure. In the second step, we replace ¡ by is reasonable in some domains, it is clearly infeasible in the parameters that maximize the complete data score with general. Moreover, the motivation for learning models in these expected statistics. This second step is essentially the ®rst place is to avoid such strong dependence on the equivalent to learning from complete data, and thus, can be expert. done ef®ciently. However, the ®rst step requires us to com- pute the probabilities of several events for each instance in In this paper, we propose a new method for learning the training data. Thus, learning parameters with the EM structure from incomplete data that uses a variant of the procedure is signi®cantly slower than learning parameters Expectation-Maximization (EM) [Dempster et al. 1977] from complete data. algorithm to facilitate ef®cient search over large number of candidate structures. Roughly speaking, we reduce the Finally, in the incomplete data case, a local change in one search problem to one in the complete data case, which part of the network, can affect the evaluation of a change can be solved ef®ciently. As we experimentally show, our in another part of the network. Thus, the current proposed method is capable of learning structure from non-trivial methods evaluate all the neighbors (e.g., networks that dif- datasets. For example, our procedure is able to learn struc- ferent by one or few local changes) of each candidate they tures in a domain with several dozen variables and 30% visit. This requires many calls to the EM procedure before missing values, and to learn the structure in the presence making a single change to the current candidate. To the best of several hidden variables. (We note that in both of these of our knowledge, such methods have been successfully ap- experiments, we did not rely on prior knowledge to reduce plied only to problems where there are few choices to be the number of candidate structures.) We believe that this made: Clustering methods (e.g., [Cheeseman et al. 1988; is a crucial step toward making belief network induction Chickering and Heckerman 1996]) select the number of applicable to real-life problems. values for a single hidden variable in networks with a ®xed structure, and Heckerman [1995] describes an experiment To convey the main idea of our approach, we must ®rst re- with a single missing value and ®ve observable variables. view the source of the dif®culties in learning belief networks from incomplete data. The common approach to learning The novel idea of our approach is to perform search for belief networks is to introduce a scoring metric that evalu- the best structure inside the EM procedure. Our procedure ates each network with respect to the training data, and then maintains a current network candidate, and at each iteration to search for the best network (according to this metric). it attempts to ®nd a better network structure by computing The current metrics are all based, to some extent, on the the expected statistics needed to evaluate alternative struc- likelihood function, that is, the probability of the data given tures. Since this search is done in a complete data setting, a candidate network. we can exploit the properties of scoring metric for effective search. (In fact, we use exactly the same search procedures When the data is complete, by using the independencies as in the complete data case.) In contrast to current practice, encoded in the network structure, we can decompose the this procedure allows us to make a signi®cant progress in likelihood function (and the score metric) into a product of the search in each EM iteration. As we show in our ex- terms, where each term depends only on the choice of par- perimental validation, our procedure requires relatively few ents for a particular variable and the relevant statistics in the EM iterations to learn non-trivial networks. dataÐthat is, counts of the number of common occurrences for each possible assignment of values to the variable and The rest of the paper is organized as follows. We start in its parents. This allows for a modular evaluation of a candi- Section 2 with a review of learning belief networks. In date network and of all local changes to it. Additionally, the Section 3, we describe a new variant of EM, the model se-

evaluation of a particular change (e.g., adding an arc from lection EM (MS-EM) algorithm, and show that by choosing

a model (e.g., network structure) that maximizes the com- 1 to 2) remains the same after changing a different part 

1

£ @ plete data score in each MS-EM step we indeed improve x ¥§¦¨¦§¦¥ x be a training set where each x assigns a value

the objective score. In Section 4, we discuss the details to some (or all) variables in U. The MDL score of a net-

,

0

 = = of applying the MS-EM algorithm for learning belief net- work given a training data set , written Score  : , works using the minimal description length scoring metric. is given by the following equation:

In Section 5, we present experimental evaluation of this

E

, ,

, log

0 05D 0

= ¢CB  =  procedure both for handling missing values and learning Score  : : # (1) structure with hidden nodes. Finally, in Section 6 we 2

, 2 0

discuss the implications of our results and possible future where #  is the number of parameters in the network. = extensions. 

The ®rst term is the of the log-likelihood of given :

, ,1+54H,

<9

@

0©0 0

B  = ¢GF : 1 log x ¦ The log-likelihood has a 2 REVIEW OF LEARNING BELIEF statistical interpretation: the higher the log-likelihood is, NETWORKS the closer  is to modeling the probability distribution in

the data = . The second term is a penalty term that biases

the score metric to prefer simpler networks. (We refer the ¢¤£

Consider a ®nite set U 1 ¥§¦¨¦§¦©¥ of discrete ran-  interested reader to [Heckerman 1995; Lam and Bacchus dom variables where each variable may take on values

1994] for detailed description of this score.) 

from a ®nite set. We use capital letters, such as ¥¥ , =

 When the all instances x in are completeÐthat is, for variable names and lowercase letters ¥¥ to denote

speci®c values taken by those variables. Sets of variables they assign values to all the variables in UÐthe log-

, 0

likelihood term decomposes. Let E x be statistics of ¥

are denoted by boldface capital letters X ¥ Y Z, and assign- X =

ments of values to the variables in these sets are denoted by x in = Ðthat is, the number of instances in where

,©I

0 E

X ¢ x. (Note that is well-de®ned only for com- ¥ boldface lowercase letters x ¥ y z. X

plete datasets.) We omit the subscript of E X when- A belief network is an annotated directed acyclic graph that ever it is clear from the context. Applying the de®ni-

encodes a joint probability distribution over U. Formally, tion of +54 to the log-likelihood and changing the order

¢ ¡ a belief network for U is a pair ¥ . The ®rst of summation yields the following well-known decompo-

component, namely  , is a directed acyclic graph whose

sition of the log-likelihood according to the structure of

, , ,

 / 

vertices correspond to the random variables 1 ¥§¦§¦¨¦¥

$:0 0 0

"

 B  = ¢JF F E  !#"%$L& (*)

"%$

¥ ¦ : : log $ It

that encodes the following set of conditional independence $

 is easy to show that this expression is maximized when (*)

assumptions: each variable is independent of its non- "%$'K

$1N

@3M

$ )

¢ !#" & (

 ¦

$ )

( Thus, given a network structure, there $

descendants given its parents in . The second component N @3M

of the pair, namely ¡ , represents the set of parameters that is a closed form solution for the parameters that maximize

!#"%$'& (*) ¢

quanti®es the network. It contains a parameter $ the log-likelihood score. Moreover, since the second term

+-,

. /  / /32 

$10 $ $

" " 

 for eachpossible value of , and of , of (1) does not depend on the choice of parameters, this

2 

/ $ where denotes the set of parents of in  . A belief solution maximizes the MDL score.

network  de®nes a unique joint probability distribution Using this decomposition, we have that for each network

over U given by: ,

0

 = ¢

structure  the score decomposes as well: Score :

,

 /62 



8

0 $

F = ¡

+546,' 7 +;43,< 7 2

. / ¥

Score ¥ : where

0 $10 ¢

1 ¥§¦¨¦§¦¥

:9

, 2  ,  ,

/ /

"P$10 0 $ 0

$ )

¡  !Q" & ( ¢ O E

1 =

¥ ¥

Score : log $

$ )

" K ( $

The problem of learning a belief network can be stated as

E ,'  2

1 log /

=>¢?£ @

D $10 ¥§¦§¦¨¦¥ follows. Given a training set x x of in- # ¥ (2)

2 =

stances of U, ®nd a network  that best matches . The

,'  2

/ $10

common approach to this problem is to introduce a scor- and # ¥ is the number of parameters need to repre-

+-,'  ./62 ing function that evaluates each network with respect to sent $0 . As explained in the introduction, this the training data, and then to search for the best network decomposition is crucial for learning structure. A local (according to this metric). The two main scoring func- search procedure that changes one arc at each move can tions commonly used to learn belief networks are the belief ef®ciently evaluate the gains made by adding or removing scoring function [Cooper and Herskovits 1992; Heckerman an arc. Such a procedure can also reuse computations made et al. 1995], and the one based on the principle of minimal in previous stages to evaluate changes to the parents of all description length (MDL) [Lam and Bacchus 1994] which variables that have not been changed in the last move. One is equivalent to Schwarz' Bayesian information criterion particular search procedure that exploits this decomposi- (BIC) [Schwarz 1978]. In this extended abstract we con- tion is a greedy hill-climbing procedure that at each step centrate on the MDL/BIC metric. We defer treatment of the 2 Bayesian metric to the full version of this paper. The MDL scoring metric is often de®ned as the negative

inverse of (1), where learning attempts to minimize the score

A¢A ¡ = ¢

Let ¥ be a belief network, and let rather than maximize it.

f

, +-, ,

0kj 0 05D

\ ¡ \ed ¡Hd ¢hg-i \ ¡ \ ¡

¥ ¥ ¥ ¥ ¥c ¥ performs the local change that results in the maximal gain, ¥ : log O h : Pen until it reaches a local maxima. Although this procedure

where the expectation is over the value of h according to +-,

does not necessarily ®nd a global maxima, it does perform .

0

\ed ¡Hd

h O : ¥ . well in practice; e.g., see [Heckerman et al. 1995].  The MS-EM algorithm can be now stated concisely: When the training data is incomplete,that is, some x s do not assign values to all the variables in U, the situation is quite Procedure MS-EM:

different. In this case the log-likelihood, and consequently 0 0 m

Choose l and randomly.

p pqqq the MDL score, do not decompose. Moreover, we cannot Loop for n o 0 1 until convergence

1

u3vw lsrxpmyrPz choose parameters in each local interaction model indepen- Find a model lsr¨t that maximizes :

1 1

o {Hu3v1l|r§t p©m l|r}p~myrPz dently of the others. These issues have a drastic effect on Let myr¨t arg max : the learning procedure. To evaluate the score of a structure, we must ®nd the optimal parameter setting (see [Chicker- ing and Heckerman 1996]). This usually involves some That is, at each stage we choose a model and parameters form of parametric search (e.g., gradient descent [Binder that have the highest expected score given our previous et al. 1997] or EM [Lauritzen 1995]). Moreover, for each assessment. For any particular instantiation of this algo-

candidate structure we consider, we must reevaluate the rithm, we have to show that we can indeed ®nd a pair

,

0

\ ¡

choice of parameters, since, in general, we cannot adopt the ¥ that maximizes the expected score. In fact, it is € parameters computed for previous candidates. not necessary to maximize the expected x score: it suf®ces

1 1

€ €

¡

that at each iteration we choose \ and so that

f f ,

1 1 ,

0‚ 0

\ ¡ \ \ ¡ \ ¡ ¡

¥ ¥ ¥ : ¥ : . 3 THEORETICAL FOUNDATIONS

This process converges when there is no further im-

€ x

provement in the objective score. That is, when

a a

b b ,

The standard EM algorithm is a method for parametric 1 1 ,

0 0

\ ¡ ¢ \ ¡ ¥ ¥ . In practice, we stop estimation for problems with missing data [Dempster et al. the procedure when the change in the objective score is 1977; McLachlan and Krishnan 1997; Tanner 1993]. Here negligible (i.e., smaller than 0.5 percent.) we present an extension of EM, which we call the model selection EM (MS-EM) algorithm, that deals with model To show that this algorithm is useful, we need to show that

selection as well as parameter estimation. it indeed improves the objective score in each iteration.

f f ,

We start with some notation. Assume we have an input ,

0ƒ

\ ¡ \ed ¡Hd \ed ¡Hd

¥ ¥

Theorem 3.1: If ¥ : :

a a

b b

, ,

@ ¢

dataset with E records. We de®ne random variables U

0 0‚ 0

\ed ¡Hd \ ¡ \ed ¡Hd

-R -R

¥ ¥

¥ , then  

£ SUTVSUW SYX-SUE

: 1 ¥ 1 such that describes the

x €



, 1 1

@

X Z

0 \ assignment to in the 'th input record. Let O U ¡ Thus, if at each iteration we choose ¥ that has

be the set of observed variables, that is, those whose values a higher expected score than the previous candidate, we are

@

D ¢ are speci®ed in = , and let H U O be the hidden bound to improve the objective score. The proof of this the- variables. We assume that we have a class of models [ orem is a relatively simple extension of the corresponding

such that each model \^]_[ is parameterized by a vector proof for parametric EM [McLachlan and Krishnan 1997]. ¡6` ¡3` such that each (legal) choice of values de®nes a

+-,I This theorem shows that the MS-EM makes progress at

@

0

`

\ ¡ probability distribution : ¥ over U . From now

each step until it converges. What do we know about the

¡3` \ on we use ¡ as a shorthand for when the model is

point of convergence this algorithm? Using results for f

clear from the context. ,

0

\ed ¡Hd \ed ¡ d ¢ ¥

standard EM, we have that if ¥ :

f

,

,

0

0 „ \ed ¡ \ed ¡ d ¡Hd

\ ¡

¥ ¥

We also assume that we want to ®nd the choice of ¥ arg max : , then is a stationary

a*b a*b

, ,

0 0 K

`_

¡ ¢ \ed ¡

that maximizes a scoring metric of the form: point of ¥ ; that is, the gradient

a*b

+-, ,

, at that point is zero [McLachlan and Krishnan 1997]. This

0 05D 0

\ ¡ ¢ \ ¡ \ ¡

¥ ¥ ¥c ¦ ¥ log O : Pen means that the choice of parameters at the point of conver-

+-, gence is either a local maxima, local minima, or a saddle

a

b

0

\ ¡

The term log O : ¥ is the log-likelihood of data

,

K `

point of .

0

\ ¡ ¥c given the choice of model, and term Pen ¥ is some penalty function that might depend on the values of the What can we say about the choice of model at the con- observable variables in the training data. We assume that vergence point? The formal notion of stationarity does not

had we observed the complete data, we would have been apply to the discrete space [ of possible models. However, able to maximize this score. Unfortunately, we do not we can de®ne the class of ªstationary pointsº of this algo- have these values. However, we can examine expected rithm as all the points to which the algorithm can converge. score by taking expectation over all possible values H might Note that this set of points is a subset of the set of station-

take. To do so, we need to estimate the probability of these ary points in spaces of parameterization of all the candidate

,

0

\ed ¡ d [ assignments. Given a particular estimate ¥ , the models in . Thus, had we run standard (e.g., parametric) expected score is: EM for each of these, we would have potentially encoun- tered at least the same number of stationary points. Search (Expected) Score Engine This discussion suggests another modi®cation in the learn- Statistics ing algorithm. As we shall see below, maximizing the

choice of parameters ¡ for a ®xed model is computation- ally cheaper than searching for a better model. Thus, we Training Current Data Model can modify our algorithm so that it alternates between it- Candidate erations that optimize the parameters for the current model Networks candidate, and iterations that search for a different model. We call this procedure the alternating MS-EM (AMS-EM): procedure AMS-EM: Figure 2: Architecture of the implementation of the MS-EM

0 0 † 0 algorithm. m

Choose l and randomly.

p pqqq

Loop for n o 0 1 until convergence

p pqqq ‡ˆo‰‡

Loop for ‡ˆo 0 1 until convergence or max

Š † Š

† 1

m o u3v1l p~m l p©m z

t { r r r

Let r arg max : Š

1 † module, which implements the particulars of the scoring

l u3vw l pm z

r r

Find a model r¨t that maximizes :

† Š

1 † 0 1 metric (e.g., MDL or Bayesian scoring). To evaluate the

o {Hu3v1l|r§t p©m l|rxp©myr z Let myr¨t arg max : score, the Score module requires the corresponding statis- tics, which are supplied by the Statistics module. In the This variant attempts to make progress using parametric complete data case, the Statistics module answer queries EM steps, for some speci®ed number of steps, or until con- by counting instances in the training data. In MS-EM, this vergence. Then it considers making additional progress by module computes expected statistics, using a the network changing the choice of model. This can lead to computa- found in the previous iteration. Indeed, our implementation tional advantages, and moreover, as we shall see in Sec- is based on a previous implementation of a complete-data tion 5.2, in some situations it can avoid early convergence learning software. The implemention of MS-EM involved to undesirable local maxima. minor changes to the Statistics module and an implementa- tion of an inference algorithm. 4 MS-EM FOR LEARNING BELIEF Most of the running time during the execution our pro- NETWORKS cedure is spent in the computations of expected statistics. This is where our procedure differs from parametric EM. In parametric EM, we know in advance which expected statis- To apply the MS-EM algorithm to learning belief networks

tics are required. Namely, these are all statistics of events

/62 /32 

with the MDL score, we need to show how to choose 

,

$ $

of the form ¥ where are the parents of in

0

 ¡

a model ¥ that increases the expected score. As the ®xed network structure we are considering. Since these usual for models in the exponential family, we get that  variables are also the parents of in our current candidate, the expected score has the same form as the complete data we can employ ef®cient inference algorithms that compute

score. Using the de®nition of the MDL score, and the f

, all the required statistics in one pass over the training data

0

 ¡ ‹d ¡Hd ¢ ¥

linearity of expectation, we get that ¥ :

f

©, 2  , /

 (e.g., the clique-tree algorithm of Lauritzen and Spiegelhal-

$ 0 0

F ¡ ‹d ‹d3¢ ‹d ¡Hd

¥ ¥

¥ : where and ter [1988]).

f

, , ,

.  /32  /

d

$10 $ 0 j 0

"

¡3Œ  ¢ g-iE  !#"%$L& (*)

O

¥ ¥

: O log $ In our procedure, we cannot determine in advance which

"P$

$ statistics will be required. Thus, we have to handle each

E ,' /62

log  query separately. Moreover, when we consider different

$10 D ¥ # ¥ 2 structures, we are bound to evaluate a larger number of +54 queries than parametric EM . Thus, a ªstructuralº iteration, where we take expectation according to Thus, we get where we search for a better network structures, is more ex- an analogue to the decomposition of the MDL score for the pensive than a ªparametricº iteration, where we only update complete data case. The only difference is that we use the the parameters. Since we usually need several iterations to expected statistics based on the model Žd instead of actual converge on the best parameters, this suggests that by us- statistics. As in the complete data case, we maximize the ing the AMS-EM algorithm, we can save computational

expected score, for a particular network structure  , by

(*) &

"P$

$1N O

@6M

$ )

!#" & ( ¢’‘

(*) &

setting $ . As a consequence, we $kN

O “ @6M

’‘ Additionally, we implemented two straightforward mecha- can use the same search strategies that exist for the complete nisms to reduce the overhead in computing statistics. First, data case. at the beginning of each iteration, we use the clique-tree al- The MS-EM algorithm for belief networks is implemented gorithm to compute the statistics for the current candidate. using the architecture described in Figure 2. The Search This is the only computation needed during the parametric Engine module is responsible for choosing candidate net- optimization steps in AMS-EM, and these statistics are also works to evaluate. For each candidate, it calls the Score used in initial phases of the search in MS-EM. Second, the insurance alarm -20 -16.5 -17 -22 -17.5 -24 -18 -18.5 -26 -19

Log Loss MS-EM 1000 samples Log Loss -19.5 -28AMS-EM 1000 samples -20 MS-EM 1000 samples MS-EM 500 samples -20.5 AMS-EM 1000 samples -30 AMS-EM 500 samples MS-EM 500 samples MS-EM 250 samples -21 AMS-EM 500 samples -32 AMS-EM 250 samples -21.5 0 10 20 30 40 0 10 20 30 Percent Missing Values Percent Missing Values

Figure 3: Plots showing the degradation in learning performance as a function of the percentage of missing values. The horizontal axis shows the percentage of missing values, and the vertical show the log-loss (higher is better).

Statistics module caches queries computed using the cur- 5.1 Missing Values rent network. This allow us to avoid computing a query (or marginals of it) more than once at each iteration. Many real life data sets contain missing values. In order to evaluate our procedure, we performed the following ex- Another crucial point is the choice of the initial model for periment that examines the degradation in performance of MS-EM. Clearly. this choice determines the convergence our learning procedure as a function of the percentage of point of the algorithm (since the algorithm itself is deter- missing values. ministic). In general, we do not want to choose too sim- ple initial structure, since such a structure embodies many In this experiment we generated arti®cial training data from independencies, and thus biases the expected statistics to two networks: InsuranceÐa network for classifying car indicate that variables are independent of each other. On insurance applications [Binder et al. 1997] that has 26 the other hand, we do not want to choose too complex ini- variables; and alarmÐa network for intensive care patient tial structure, since such a structure might be too hard to monitoring [Beinlich et al. 1989] that has 37 variables. perform inference on. In the following sections, we discuss From each network we randomly sampled training sets of particular choices of initial structure in the scenarios we different sizes. We then randomly removed values from consider. As usual in EM procedures, we choose the initial each of these training sets to get training sets with vary- parameters for the chosen structure randomly. Moreover, ing percentage of missing values. For each data point we in order to avoid unfortunate early convergence, we usually generated ®ve independent training sets. run the procedure several times, starting with different ini- For each training set, we started the procedure from ®ve tial points. We then choose the network with the highest random initial points. We choose the initial structure to score from those found by these runs. be a random chain-like network that connected all the vari- ables. We suspect that for most missing value problems with relatively low number of missing values (e.g., less

than 10 D 15%), the topology of the initial network is not 5 EXPERIMENTAL RESULTS crucial. We hope to verify this experimentally in the future. In both experiments, we tried both the MS-EM procedure and the AMS-EM procedure.

In this section we report initial experimental results for two We evaluated the performance of the learned networks by scenarios of incomplete data: missing values and hidden measuring how well they model the target distribution. To variables. We discuss aspects of the procedure that are do so, we sampled a test set of 10,000 instances and evalu-

relevant to each type of problem, and how we address them. ated the average log-loss of each learned network on this test +546,

1 :9 3

@ 0 In both settings, we evaluate our procedure as a density set, that is, F 1 log x . (We used the same test estimator, that is, how well it learns the distribution in the set for evaluating@ all the learned networks.) The results are training data. Another interesting aspect we measure for summarized in Figure 3. As expected, there is a degrada- belief networks is how well they model the structure of the tion in performance when there are large number of missing domain (e.g., number of correct arcs found). We defer this analysis to the full version of this paper. 3This is a standard method for evaluating density estimates. (a) (b)

Figure 4: (a) The network 3x2x4 used in the experiments. The shaded nodes correspond to hidden variables. (b) One of networks learned from 1000 samples using one hidden variable. variables. However, for reasonable percentage of missing the treatment independent of most of the symptoms. On values, the degradation is moderate or non-existent.4 These the other hand, when we do not observe the disease, all results are encouraging, since they show that even with observables seem related to each other. Thus, we hope that 30% of the values missing, our procedure usually performs by introducing hidden variables, we will be able to learn better than one that receives half the number of complete simpler models that are less prone for over®tting and more instances. (Note that with 30% missing values, virtually all ef®cient for inference. the instances in the dataset are incomplete). These results also show that the AMS-EM algorithm roughly performs as Thus, there is growing interest in learning networks that in- clude one or more hidden variables. Unfortunately, the well as the MS-EM algorithm. only methods that learn with hidden variables must re- We note that in this experiment, values where missing at strict themselves to small number of structure candidates random. In many real-life domains, this is not the case. In (as done by Cheeseman et al. [1988] and Chickering and these domains, the pattern of missing values can provide Heckerman [1996]) since the cost of running EM to ®nd the additional information (e.g., the patient was too ill to take parameter setting for each candidate structure is high. Our

a certain test). Such information can be easily modeled by procedure allows us to learn network structures with hidden  introducing for each 7 a new variable labeled ªSeen- º variables. The following experiments attempt to show the that denotes whether 7 was observed or not. Datasets effectiveness of the procedure. annotated with this information can then be handled by Before we describe the experiments, we explain how we our procedure. This allows the learning procedure to learn choose the initial network our procedure. We start with two correlations between the observations of various variables and value of other variables. This approaches utilizes the simple observations. First, if a hidden variable ” is isolated in the network we use for computing expected statistics, belief network to describe both the underlaying domain and then the expectedstatistics would show that it is independent the process of observation in the domain. We hope to return to this issue in future work. of all other variables. Thus, the MS-EM algorithm would

not add arcs to or from ” . Second, suppose that we learn

a network where ” is a leaf (e.g., there are no arcs leading

5.2 Hidden Variables ” out of ” ) or is a root with only one child. In this

case, marginalizing ” from the learned distribution does In most domains, the observable variables describe only not effect the distribution over the observable variables. some of the relevant aspects of the world. This can have Thus, ” does not contribute to the representation of the adverse effect on our learning procedure. Consider, for domain. example, a medical domain in which the training data con- sists of observed symptoms (e.g., fever, headache, blood These observations suggest that a hidden variable is bene®- pressure etc.), and the medication prescribed by the doctor. cial only if it is connected to other variables in the networks. One hidden quantity that we not observe is which disease(s) Moreover, once we lose that property, we are effectively ig- the patient has. Knowledge of the patient's disease makes noring the hidden variable from there on. Thus, we want to ensure that our choice of initial structure connects hidden 4We suspect that the slight improvement in the score for 10% variables to several observable variables. One structure that is due to the randomness of our procedure. This apparently allows ensures that this is the case is a bipartite graph in which all the search to escape some of the local maxima found when there of the hidden variables are parents of each observable vari- are no missing values. 3x1x3 network 3x2x4 network -5.32 -8 -5.34 -8.05 -5.36 -8.1 -5.38 -8.15 -8.2 -5.4 -8.25 -5.42 Log Loss

Log Loss -8.3 -5.44 2000 inst. -8.35 2000 inst. 1000 inst. -5.46 500 inst. -8.4 1000 inst. 250 inst. 500 inst. -5.48 -8.45 250 inst. -5.5 -8.5 0 1 2 3 0 1 2 3 # Hidden # Hidden

Figure 5: Plots showing learning performance when learning models with hidden variables as a function of the number of hidden variables. The results shown are average loss from 5 different experiments in each sample size. able. Of course, when there are many hidden variables, this sults in improved density estimation. However, when the network might require many parameters. In these cases, we number of instances grows larger, we are able to ªaffordº randomly choose edges from the bipartite graph ensuring to learn more complex structures, and then the impact of that each observable variable does not have more then a adding hidden variables diminishes. Also note that for certain ®xed number of parents. small samples additional hidden variables can lead to worst performance. We suspect that this is due to over®tting, and As usual, we randomly choose parameters for this network. we are currently exploring this phenomena. These random choices create weak dependencies between the observable variables and the hidden variables. However, if we immediately apply MS-EM with such an initial guess, 6 DISCUSSION we end up choosing a structure where most of the hidden variables are independent of the rest of the networks. To avoid this problem, we run several iterations of the standard In this paper, we introduced a new method for learning EM procedure on initial network. These iterations change belief networks from incomplete data. As the preliminary the parameters so that there are stronger dependencies be- experimental results show, this method allows us to learn in tween the hidden variables and the observables. After this non-trivial domains. We believe that these results will have preprocessing stage, we continue on as before (i.e., either impact in several domains. In particular, we are planning MS-EM or AMS-EM). In our experiments there was no to explore application of this method for learning temporal signi®cant difference between MS-EM and AMS-EM, and models that involve hidden entities. Such models are cru- thus, we focus on the latter in the reminder of this section. cial for applications in speech recognition, computational and reinforcement learning. Additionally, we are In our experiment, we created a two networks: 3x1x3 with exploring the use of this method for classi®cation tasks. the topology shown in Figure 1a (where all the variables are binary); and 3x2x4 with the topology shown in Fig- The main computational efforts in our procedure is com- ure 4a Like the ®rst network, this second network also has putation of expected statistics during ªstructuralº iterations, hidden variables ªmeditatingº between two groups of ob- where the procedure performs search. Our current inference served variables. It includes, however, other variables that procedure is unoptimized and this has restricted the scope make the problem harder. We quanti®ed both networks us- of the reported experiments. We are currently replacing this ing randomly chosen parameters. We then sampled, from module by an optimized inference engine. Additionally, we each network, ®ve training sets of sizes 250, 500, 1000, are examining a stochastic variant of MS-EM (analogous and 2000 instances of the observable variables, and learned to stochastic EM [McLachlan and Krishnan 1997; Tanner networks in the presence of 0, 1, 2, or 3 hidden binary vari- 1993]). In this variant, we use sampling procedures to com- ables using the AMS-EM algorithm. We tested the average plete that training data. That is, from each partial instance, log-loss of our procedure on a separate test set. The results we sample several complete instances (using our current are summarized in Figure 5. Experiments with the MS-EM estimate of the model). We then learn from the completed algorithm led to similar results, which we omit here because data using the complete data methods. This approach is of space limitations. appealing since unlike exact inference methods, the run- ning time of sampling is not sensitive to the complexity of These results show that introducing hidden variables re- the network. Finally, we are examining alternative search procedures that attempt to escape local maxima by using Heckerman, D., D. Geiger, and D. M. Chickering (1995). various local perturbations. Learning Bayesian networks: The combination of knowledge and statistical data.Machine Learning 20, An additional topic that deserves more attention is learning 197±243. models with hidden variables. Our current procedure starts with a given set of hidden variables and attempts to ®nd a Heckerman, D., A. Mamdani, and M. P. Wellman (1995). model that includes them. A more sophisticated approach Real-world applications of Bayesian networks. Com- would create hidden variables on as-needed basis during munications of the ACM 38. the learning process. This requires making a decision as to Lam, W. and F. Bacchus (1994). Learning Bayesian be- when to add a new variable, and how to insert it into the lief networks. An approach based on the MDL prin- current model. We are currently exploring several methods ciple. Computational Intelligence 10, 269±293. for recognizing where to insert a hidden variable. Lauritzen, S. L. (1995). The EM algorithm for graphi- cal association models with missing data. Computa- Finally, in the current presentation we focused on the MDL tional Statistics and Data Analysis 19, 191±201. learning score. An analogous development can be carried for the Bayesian learning score [Cooper and Herskovits Lauritzen, S. L. and D. J. Spiegelhalter (1988). Local 1992; Heckermanet al. 1995] in a relatively straightforward computations with probabilities on graphical struc- manner. Due to space restrictions, we defer the presentation tures and their application to expert systems. Journal of these issues to the full version of this paper. of the Royal Statistical Society B 50(2), 157±224. McLachlan, G. J. and T. Krishnan (1997). The EM Algo- rithm and Extensions. Wiley Interscience. Acknowledgments Murphy, P. M. and D. W. Aha (1995). UCI repository of I am grateful to Kevin Murphy, Stuart Russell, Geoff Zweig, machine learning databases. http://www.ics. and particularly Moises Goldszmidt and for uci.edu/Ämlearn/MLRepository.html. comments on earlier version of this paper and useful dis- Pearl, J. (1988). Probabilistic Reasoning in Intelligent cussions relating to this topic. This research was supported Systems. San Francisco, Calif.: Morgan Kaufmann. by ARO under the MURI program ªIntegrated Approach to Schwarz, G. (1978). Estimating the dimension of a Intelligent Systemsº, grant number DAAH04-96-1-0341. model. Annals of Statistics 6, 461±464. Tanner, M. A. (1993). Tools for Statistical Inference. References New York: Springer-Verlag. Beinlich, I., G. Suermondt, R. Chavez, and G. Cooper (1989). The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks. In Proc. 2'nd European Conf. on AI and Medicine. Berlin: Springer-Verlag. Binder, J., D. Koller, S. Russell, and K. Kanazawa (1997). Adaptive probabilistic networks with hidden variables. Machine Learning this volume. Cheeseman, P., J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman (1988). Autoclass: a Bayesian classi®- cation system. In ML '88. Chickering, D. M. and D. Heckerman (1996). Ef®cient approximations for the marginal likelihood of incom- plete data given a Bayesian network. In Proc. Twelfth Conference on Uncertainty in Arti®cial Intelligence (UAI '96), pp. 158±168. Cooper, G. F. and E. Herskovits (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9, 309±347. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Soci- ety B 39, 1±39. Heckerman, D. (1995). A tutorial on learning Bayesian networks. Technical Report MSR-TR-95-06, Mi- crosoft Research.