Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood

Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood Daniel Grossman [email protected] Pedro Domingos [email protected] Department of Computer Science and Engineeering, University of Washington, Seattle, WA 98195-2350, USA Abstract timize the likelihood of the entire data, rather than just the conditional likelihood of the class given the Bayesian networks are a powerful probabilis- attributes. Such scoring results in suboptimal choices tic representation, and their use for classi- during the search process whenever the two functions fication has received considerable attention. favor differing changes to the network. The natural However, they tend to perform poorly when solution would then be to use conditional likelihood as learned in the standard way. This is at- the objective function. Unfortunately, Friedman et al. tributable to a mismatch between the objec- observed that, while maximum likelihood parameters tive function used (likelihood or a function can be efficiently computed in closed form, this is not thereof) and the goal of classification (max- true of conditional likelihood. The latter must be opti- imizing accuracy or conditional likelihood). mized using numerical methods, and doing so at each Unfortunately, the computational cost of op- search step would be prohibitively expensive. Fried- timizing structure and parameters for condi- man et al. thus abandoned this avenue, leaving the tional likelihood is prohibitive. In this pa- investigation of possible heuristic alternatives to it as per we show that a simple approximation| an important direction for future research. In this pa- choosing structures by maximizing condi- per, we show that the simple heuristic of setting the tional likelihood while setting parameters parameters by maximum likelihood while choosing the by maximum likelihood|yields good results. structure by conditional likelihood is accurate and ef- On a large suite of benchmark datasets, ficient. this approach produces better class probability estimates than naive Bayes, TAN, and Friedman et al. chose instead to extend naive Bayes by generatively-trained Bayesian networks. allowing a slightly less restricted structure (one parent per variable in addition to the class) while still opti- mizing likelihood. They showed that TAN, the result- 1. Introduction ing algorithm, was indeed more accurate than naive Bayes on benchmark datasets. We compare our algo- The simplicity and surprisingly high accuracy of the rithm to TAN and naive Bayes on the same datasets, naive Bayes classifier have led to its wide use, and and show that it outperforms both in the accuracy of to many attempts to extend it (Domingos & Pazzani, class probability estimates, while outperforming naive 1997). In particular, naive Bayes is a special case of a Bayes and tying TAN in classification error. Bayesian network, and learning the structure and parameters of an unrestricted Bayesian network would If the structure is fixed in advance, computing the appear to be a logical means of improvement. How- maximum conditional likelihood parameters by gra- ever, Friedman et al. (1997) found that naive Bayes dient descent should be computationally feasible, and easily outperforms such unrestricted Bayesian network Greiner and Zhou (2002) have shown with their ELR classifiers on a large sample of benchmark datasets. algorithm that it is indeed beneficial. They leave op- Their explanation was that the scoring functions used timization of the structure as an important direction in standard Bayesian network learning attempt to op- for future work, and that is what we accomplish in this paper. st Appearing in Proceedings of the 21 International Confer- Perhaps the most important reason to seek an im- ence on Machine Learning, Banff, Canada, 2004. Copyright 2004 by the authors. proved Bayesian network classifier is that, for many applications, high accuracy in class predictions is not is in state k given that its parents are in state j, for all enough; accurate estimates of class probabilities are i; j; k. When there are no examples with missing val- also desirable. For example, we may wish to rank cases ues in the training set and we assume parameter inde- by probability of class membership (Provost & Domin- pendence, the maximum likelihood estimates are sim- gos, 2003), or the costs associated with incorrect pre- ply the observed frequency estimates pîjk = nijk=nij , dictions may be variable and not known precisely at where nijk is the number of occurrences in the train- learning time (Provost & Fawcett, 2001). In this case, ing set of the kth state of xi with the jth state of its knowing the class probabilities allows the learner to parents, and nij is the sum of nijk over all k. make optimal decisions at classification time, what- In this paper we assume no missing data through- ever the misclassification costs (Duda & Hart, 1973). out, and focus on the problem of learning network More generally, a classifier is often only one part of a structure. Chow and Liu (1968) provide an efficient larger decision process, and outputting accurate class algorithm for the special case where each variable probabilities increases its utility to the process. has only one parent. Solution methods for the gen- We begin by reviewing the essentials of learning eral (intractable) case fall into two main classes: in- Bayesian networks. We then present our algorithm, dependence tests (Spirtes et al., 1993) and search- followed by experimental results and their interpreta- based methods (Cooper & Herskovits, 1992; Hecker- tion. The paper concludes with a discussion of related man et al., 1995). The latter are probably the most and future work. widely used, and we focus on them in this paper. We assume throughout that hill-climbing search is used; 2. Bayesian Networks this was found by Heckerman et al. to yield the best combination of accuracy and efficiency. Hill-climbing A Bayesian network (Pearl, 1988) encodes the joint starts with an initial network, which can be empty, probability distribution of a set of v variables, random, or constructed from expert knowledge. At fx1; : : : ; xvg, as a directed acyclic graph and a set of each search step, it creates all legal variations of the conditional probability tables (CPTs). (In this paper current network obtainable by adding, deleting, or re- we assume all variables are discrete, or have been pre- versing any single arc, and scores these variations. The discretized.) Each node corresponds to a variable, and best variation becomes the new current network, and the CPT associated with it contains the probability of the process repeats until no variation improves the each state of the variable given every possible combi- score. nation of states of its parents. The set of parents of x , i Since on average adding an arc never decreases like- denoted π , is the set of nodes with an arc to x in the i i lihood on the training data, using the log likelihood graph. The structure of the network encodes the asser- as the scoring function can lead to severe overfit- tion that each node is conditionally independent of its ting. This problem can be overcome in a number non-descendants given its parents. Thus the probabil- of ways. The simplest one, which is often surpris- ity of an arbitrary event X = (x1; : : : ; x ) can be com- v ingly effective, is to limit the number of parents a puted as P (X) = v P (x jπ ). In general, encoding Qi=1 i i variable can have. Another alternative is to add a the joint distribution of a set of v discrete variables complexity penalty to the log-likelihood. For exam- requires space exponential in v; Bayesian networks re- ple, the MDL method (Lam & Bacchus, 1994) mini- duce this to space exponential in maxi2f1;:::;vg jπij. 1 mizes MDL(BjD) = 2 m log n − LL(SjD), where m is the number of parameters in the network. In both 2.1. Learning Bayesian Networks these approaches, the parameters of each candidate Given an i.i.d. training set D = fX1; : : : ; Xd; : : : ; Xng, network are set by maximum likelihood, as in the where Xd = (xd;1; : : : ; xd;v), the goal of learning is known-structure case. Finally, the full Bayesian ap- to find the Bayesian network that best represents the proach (Cooper & Herskovits, 1992; Heckerman et al., joint distribution P (xd;1; : : : ; xd;v). One approach is 1995) maximizes the Bayesian Dirichlet (BD) score to find the network B that maximizes the likelihood P (B ; D) = P (B )P (DjB ) of the data or (more conveniently) its logarithm: S S S v qi 0 ri 0 Γ(nij ) Γ(nijk + nijk) n n v = P (BS) Y Y 0 Y 0 Γ(nij + nij ) Γ(n ) LL(BjD) = X log PB(Xd) = X X log PB(xd;ijπd;i) i=1 j=1 k=1 ijk d=1 d=1 i=1 (2) (1) When the structure of the network is known, this re- where BS is the structure of network B, Γ() is the duces to estimating pijk, the probability that variable i gamma function, qi is the number of states of the Cartesian product of xi's parents, and ri is the num- CLL(BjD) by itself as the objective function. This ber of states of xi. P (BS) is the prior probability of would be a form of discriminative learning, be- the structure, which Heckerman et al. set to an expo- cause it would focus on correctly discriminating nentially decreasing function of the number of differing between classes. The problem with this approach is arcs between BS and the initial (prior) network. Each that, unlike LL(BjD) (Equation 1), CLL(BjD) = n multinomial distribution for xi given a state of its par- Pd=1 log[PB (xd;1; : : : ; xd;v−1; y)=PB(xd;1; : : : ; xd;v−1)] ents has an associated Dirichlet prior distribution with does not decompose into a separate term for each 0 0 ri 0 parameters nijk, with nij = Pk=1 nijk.

Load more