Building Classifiers Using Bayesian Networks
Total Page:16
File Type:pdf, Size:1020Kb
From: AAAI-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. Building Classifiers using ayesian Networks Nir Friedman Moises Goldszmidt* Stanford University Rockwell Science Center Dept. of Computer Science 444 High St., Suite 400 Gates Building 1A Palo Alto, CA 94301 Stanford, CA 943059010 [email protected] [email protected] Abstract instantiation of Al, . , A,. This computation is rendered feasible by making a strong independence assumption: all Recent work in supervised learning has shown that a surpris- the attributes Ai are conditionally independent given the ingly simple Bayesian classifier with strong assumptions of value of the label C.’ The performance of naive Bayes is independence among features, called naive Bayes, is com- somewhat surprising given that this is clearly an unrealistic petitive with state of the art classifiers such as C4.5. This fact raises the question of whether a classifier with less re- assumption. Consider for example a classifier for assessing strictive assumptions can perform even better. In this paper the risk in loan applications. It would be erroneous to ignore we examine and evaluate approaches for inducing classifiers the correlations between age, education level, and income. from data, based on recent results in the theory of learn- This fact naturally begs the question of whether we can ing Bayesian networks. Bayesian networks are factored improve the performance of Bayesian classifiers by avoid- representations of probability distributions that generalize ing unrealistic assumptions about independence. In order the naive Bayes classifier and explicitly represent statements to effectively tackle this problem we need an appropriate about independence. Among these approaches we single out language and effective machinery to represent and manip- a method we call Tree AugmentedNaive Bayes (TAN), which ulate independences. Bayesian networks (Pearl 1988) pro- outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness vide both. Bayesian networks are directed acyclic graphs which are characteristic of naive Bayes. We experimentally that allow for efficient and effective representation of the tested these approaches using benchmark problems from the joint probability distributions over a set of random vari- U. C. Irvine repository, and compared them against C4.5, ables. Each vertex in the graph represents a random vari- naive Bayes, and wrapper-based feature selection methods. able, and edges represent direct correlations between the variables. More precisely, the network encodes the follow- ing statements about each random variable: each variable 1 Introduction is independent of its non-descendants in the graph given A somewhat simplified statement of the problem of super- the state of its parents. Other independences follow from vised learning is as follows. Given a training set of labeled these ones. These can be efficiently read from the net- instances of the form (al, . , a,), c, construct a classi- work structure by means of a simple graph-theoretic crite- 7; f capable of predicting the value of c, given instances ria. Independences are then exploited to reduce the number al,-*-, a,) as Input. The variables Ai, . , A, are called of parameters needed to characterize a probability distribu- features or attributes, and the variable C is usually referred tion, and to efficiently compute posterior probabilities given to as the class variable or label. This is a basic problem evidence. Probabilistic parameters are encoded in a set of in many applications of machine learning, and there are nu- tables, one for each variable, in the form of local conditional merous approaches to solve it based on various functional distributions of a variable given its parents. Given the in- representations such as decision trees, decision lists, neural dependences encoded in the network, the joint distribution networks, decision-graphs, rules, and many others. One of can be reconstructed by simply multiplying these tables. the most effective classifiers, in the sense that its predic- When represented as a Bayesian network, a naive tive performance is competitive with state of the art clas- Bayesian classifier has the simple structure depicted in Fig- sifiers such as C4.5 (Quinlan 1993), is the so-called naive ure 1. This network captures the main assumption behind Bayesian classifier (or simply naive Bayes) (Langley, Iba, the naive Bayes classifier, namely that every attribute (ev- & Thompson 1992). This classifier learns the conditional ery leaf in the network) is independent from the rest of the probability of each attribute Ai given the label C in the attributes given the state of the class variable (the root in training data. Classification is then done by applying Bayes the network). Now that we have the means to represent and rule to compute the probability of C given the particular ‘By independence we mean probabilistic independence, that is *Current address: SRI International, 333 Ravenswood Way, A is independentof B given C wheneverPr(AIB, C) = Pr(AIC) Menlo Park, CA 94025, [email protected] for all possible values of A, B and C. Bayesian Networks 1277 Geiger (1992), is a consequence of a well-known result by CC-7 Chow and Liu (1968) (see also (Pearl 1988)). This ap- proach, which we call Tree Augmented Naive Bayes (TAN), / approximates the interactions between attributes using a tree-structure imposed on the naive Bayes structure. We J”\, note that while this method has been proposed in the liter- 0 0 00 0 ature, it has never been rigorously tested in practice. We d show that TAN maintains the robustness and computational Al A2 A, complexity (the search process is bounded) of naive Bayes, and at the same time displays better performance in our experiments. We compare this approach with C4.5, naive Figure 1: The structure of the naive Bayes network. Bayes, and selective naive Bayes, a wrapper-based feature subset selection method combined with naive Bayes, on a set of benchmark problems from the U. C. Irvine repository manipulate independences, the obvious question follows: (see Section 4). This experiments show that TAN leads to can we learn an unrestricted Bayesian network from the significant improvement over all of these three approaches. data that when used as a classifier maximizes the prediction rate? 2 Learning Bayesian Networks Learning Bayesian networks from data is a rapidly grow- Consider a finite set U = {Xi, . , X,} of discrete random ing field of research that has seen a great deal of activ- variables where each variable Xi may take on values from ity in recent years, see for example (Heckerman 1995; a finite domain. We use capital letters, such as X, Y, 2, Heckerman, Geiger, & Chickering 1995; Lam & Bacchus for variable names and lowercase letters 2, y, z to denote 1994). This is a form unsupervised learning in the sense that specific values taken by those variables. The set of val- the learner is not guided by a set of informative examples. ues X can attain is denoted as KzZ(X), the cardinality of The objective is to induce a network (or a set of networks) this set is denoted as 11X 1I = I kZ(X) I. Sets of variables that “best describes” the probability distribution over the are denoted by boldface capital letters X, U, Z, and assign- training data. This optimization process is implemented in ments of values to the variables in these sets will be denoted practice using heuristic search techniques to find the best by boldface lowercase letters x, y, z (we use VaZ(X) and candidate over the space of possible networks. The search 11x1I in the obvious way). Finally, let P be a joint proba- process relies on a scoring metric that asses the merits of bility distribution over the variables in U, and let X, ‘!I?,Z each candidate network. be subsets of U. X and U are conditionally independent We start by examining a straightforward application of given Z if for all x E l&Z(X), y E KzZ(‘lr’),2 E KzZ(Z), current Bayesian networks techniques. We learn networks P(x I z, y) = P(x I z) whenever P(y, z) > 0. using the MDL metric (Lam & Bacchus 1994) and use them A Bayesian network is an annotated directed acyclic for classification. The results, which are analyzed in Sec- graph that encodes a joint probability distribution of a do- tion 3, are mixed: although the learned networks perform main composed of a set of random variables. Formally, significantly better than naive Bayes on some datasets, they a Bayesian network for U is the pair B = (G, 0). G perform worse on others. We trace the reasons for these re- is a directed acyclic graph whose nodes correspond to the sults to the definition of the MDL scoring metric. Roughly random variables X1, . , X,, and whose edges represent speaking, the problem is that the MDL scoring metric mea- direct dependencies between the variables. The graph struc- sures the “error” of the learned Bayesian network over all ture G encodes the following set of independence assump- the variables in the domain. Minimizing this error, however, tions: each node Xi is independent of its non-descendants does not necessarily minimize the local “error” in predict- given its parents in G (Pearl 1988).2 The second compo- ing the class variable given the attributes. We argue that nent of the pair, namely 0, represents the set of param- similar problems will occur with other scoring metrics in eters that quantifies the network. It contains a parameter the literature. e z* I&* = P(QI&*) for each possible value zi of Xi, and In light of these results we limit our attention to a class of III,, of IIx, (the set of parents of Xi in G). B defines a network structures that are based on the structure of naive unique joint probability distribution over U given by: Bayes, requiring that the class variable be a parent of ev- n n ery attribute.