Learning Belief Networks in the Presence of Missing Values and Hidden Variables
Total Page:16
File Type:pdf, Size:1020Kb
Learning Belief Networks in the Presence of Missing Values and Hidden Variables Nir Friedman Computer Science Division, 387 Soda Hall University of California, Berkeley, CA 94720 [email protected] Abstract sponds to a random variable. This graph represents a set of conditional independence properties of the represented In recent years there has been a ¯urry of works distribution. This component captures the structure of the on learning probabilistic belief networks. Cur- probability distribution, and is exploited for ef®cient infer- rent state of the art methods have been shown to ence and decision making. Thus, while belief networks can be successful for two learning scenarios: learn- represent arbitrary probability distributions, they provide ing both network structure and parameters from computational advantage for those distributions that can be complete data, and learning parameters for a ®xed represented with a simple structure. The second component network from incomplete dataÐthat is, in the is a collection of local interaction models that describe the presence of missing values or hidden variables. conditional probability of each variable given its parents However, no method has yet been demonstrated in the graph. Together, these two components represent a to effectively learn network structure from incom- unique probability distribution [Pearl 1988]. plete data. Eliciting belief networks from experts can be a laborious In this paper, we propose a new method for and expensive process in large applications. Thus, in recent learning network structure from incomplete data. years there has been a growing interest in learning belief This method is based on an extension of the networks from data [Cooper and Herskovits 1992; Lam and Expectation-Maximization (EM) algorithm for Bacchus 1994; Heckerman et al. 1995].1 Current methods model selection problems that performs search are successful at learning both the structure and parame- for the best structure inside the EM procedure. ters from complete dataÐthat is, when each data record We prove the convergence of this algorithm, and describes the values of all variables in the network. Unfor- adapt it for learning belief networks. We then tunately, things are different when the data is incomplete. describe how to learn networks in two scenarios: Current learning methods are essentially limited to learning when the data contains missing values, and in the the parameters for a ®xed network structure. presence of hidden variables. We provide exper- imental results that show the effectiveness of our This is a signi®cant problem for several reasons. First, procedure in both scenarios. most real-life data contains missing values (e.g., most of the non-synthetic datasets in the UC Irvine repository [Murphy and Aha 1995]). One of the cited advantages of belief 1 INTRODUCTION networks (e.g., [Heckerman 1995, p. 1]) is that they allow for principled methods for reasoning with incomplete data. Belief networks (BN) (also known as Bayesian networks However, it is unreasonable at the same time to require and directed probabilistic networks) are a graphical repre- complete data for training them. sentation for probability distributions. They are arguably Second, learning a concise structure is crucial both for the representation of choice for uncertainty in arti®cial in- avoiding over®tting and for ef®cient inference in the learned telligence. These networks provide a compact and natural model. By introducing hidden variables that do not appear representation, effective inference, and ef®cient learning. explicitly in the model we can often learn simpler models. They have been successfully applied in expert systems, A simple example, originally given by Binder et al. [1997], diagnostic engines, and optimal decision making systems is shown in Figure 1. Current methods for learning hidden (e.g., [Heckerman et al. 1995]). A Belief network consists of two components. The ®rst 1We refer the interested reader to the tutorial by Hecker- is a directed acyclic graph in which each vertex corre- man [1995] that overviews the current state of this ®eld. of the network (e.g., removing an arc from 1 to 3). Thus, after making one change, we do not need to reevaluate the score of most of the possible neighbors in the search space. H These properties allow for ef®cient learning procedures. When the data is incomplete, we can no longer decom- pose the likelihood function, and must perform inference (a) (b) to evaluate it. Moreover, to evaluate the optimal choice of parameters for a candidate network structure, we must per- Figure 1: (a) An example of a network with a hidden vari- form non-linear optimization using either EM [Lauritzen able. (b) The simplest network that can capture the same 1995] or gradient descent [Binder et al. 1997]. In par- distribution without using the hidden variable. ticular, the EM procedure iteratively improves its current choice of parameters ¡ using the following two steps. In the ®rst step, the current parameters are used for computing variables require that human experts choose a ®xed network the expected value of all the statistics needed to evaluate structure or a small set of possible structures. While this the current structure. In the second step, we replace ¡ by is reasonable in some domains, it is clearly infeasible in the parameters that maximize the complete data score with general. Moreover, the motivation for learning models in these expected statistics. This second step is essentially the ®rst place is to avoid such strong dependence on the equivalent to learning from complete data, and thus, can be expert. done ef®ciently. However, the ®rst step requires us to com- pute the probabilities of several events for each instance in In this paper, we propose a new method for learning the training data. Thus, learning parameters with the EM structure from incomplete data that uses a variant of the procedure is signi®cantly slower than learning parameters Expectation-Maximization (EM) [Dempster et al. 1977] from complete data. algorithm to facilitate ef®cient search over large number of candidate structures. Roughly speaking, we reduce the Finally, in the incomplete data case, a local change in one search problem to one in the complete data case, which part of the network, can affect the evaluation of a change can be solved ef®ciently. As we experimentally show, our in another part of the network. Thus, the current proposed method is capable of learning structure from non-trivial methods evaluate all the neighbors (e.g., networks that dif- datasets. For example, our procedure is able to learn struc- ferent by one or few local changes) of each candidate they tures in a domain with several dozen variables and 30% visit. This requires many calls to the EM procedure before missing values, and to learn the structure in the presence making a single change to the current candidate. To the best of several hidden variables. (We note that in both of these of our knowledge, such methods have been successfully ap- experiments, we did not rely on prior knowledge to reduce plied only to problems where there are few choices to be the number of candidate structures.) We believe that this made: Clustering methods (e.g., [Cheeseman et al. 1988; is a crucial step toward making belief network induction Chickering and Heckerman 1996]) select the number of applicable to real-life problems. values for a single hidden variable in networks with a ®xed structure, and Heckerman [1995] describes an experiment To convey the main idea of our approach, we must ®rst re- with a single missing value and ®ve observable variables. view the source of the dif®culties in learning belief networks from incomplete data. The common approach to learning The novel idea of our approach is to perform search for belief networks is to introduce a scoring metric that evalu- the best structure inside the EM procedure. Our procedure ates each network with respect to the training data, and then maintains a current network candidate, and at each iteration to search for the best network (according to this metric). it attempts to ®nd a better network structure by computing The current metrics are all based, to some extent, on the the expected statistics needed to evaluate alternative struc- likelihood function, that is, the probability of the data given tures. Since this search is done in a complete data setting, a candidate network. we can exploit the properties of scoring metric for effective search. (In fact, we use exactly the same search procedures When the data is complete, by using the independencies as in the complete data case.) In contrast to current practice, encoded in the network structure, we can decompose the this procedure allows us to make a signi®cant progress in likelihood function (and the score metric) into a product of the search in each EM iteration. As we show in our ex- terms, where each term depends only on the choice of par- perimental validation, our procedure requires relatively few ents for a particular variable and the relevant statistics in the EM iterations to learn non-trivial networks. dataÐthat is, counts of the number of common occurrences for each possible assignment of values to the variable and The rest of the paper is organized as follows. We start in its parents. This allows for a modular evaluation of a candi- Section 2 with a review of learning belief networks. In date network and of all local changes to it. Additionally, the Section 3, we describe a new variant of EM, the model se- evaluation of a particular change (e.g., adding an arc from lection EM (MS-EM) algorithm, and show that by choosing a model (e.g., network structure) that maximizes the com- 1 to 2) remains the same after changing a different part 1 £ @ plete data score in each MS-EM step we indeed improve x ¥§¦¨¦§¦¥ x be a training set where each x assigns a value the objective score.