Bayesian Network Classifiers M a High Dimensional Framework
Total Page:16
File Type:pdf, Size:1020Kb
UAI2002 PAVLENKO & VON ROSEN 397 Bayesian Network Classifiers m a High Dimensional Framework Tatjana Pavlenko Dietrich von Rosen Dept. of Engineering, Physics and Mathematics Dept. of Biometry and Informatics Mid Sweden University Swedish University of Agricultural Sciences 851 70 Sundsvall Box 480, SE-750 07 Uppsala Sweden Sweden [email protected] [email protected] · Abstract Bayesian network models offer complementary advan tages in classification theory and pattern recognition tasks such as the ability to deal effectively with un We present a growing dimension asymptotic certain and high dimensional examples. Several ap formalism. The perspective in this paper is proaches have recently been proposed for combining classification theory and we show that it can the network rriodels and classification (discrimination) accommodate probabilistic networks classi methods; see for instance [10,12]. In the current pa fiers, including naive Bayes model and its per we also combine the network approach with dis augmented version. When represented as a crimination and provide some new ideas which are of Bayesian network these classifiershave an im principal interest. portant advantage: The corresponding dis criminant function turns out to be a spe The focus of this paper is on the classification task, cialized case of a generalized additive model, where Bayesian networks are designed and trained es which makes it possible to get closed form sentially to answer the question of accurate classifying expressions for the asymptotic misclassifica yet unseen examples. Assuming classes to be modeled tion probabilities used here as a measure of by a probability distribution, quantifying a Bayesian classification accuracy. Moreover, in this pa network amounts to assessing these distributions (den per we propose a new quantity for assess sities) to the network's variables conditional on their ing the discriminative power of a set of fea direct predecessors in the graph. The problem of as tures which is then used to elaborate the sessing a distribution in statistics as well as in pattern augmented naive Bayes classifier. The result recognition can be viewed as a problem of estimating is a weighted form of the augmented naive a set of parameters associated with a specific para Bayes that distributes weights among the sets metric family of distributions that is given a priori. of features according to their discriminative However, applying parametric estimation techniques power. We derive the asymptotic distribu naturally raises the following question: What can we tion of the sample based discriminative power say about the accuracy of the induced classifier, given and show that it is seriously overestimated that the network structure is known or assumed to be in a high dimensional case. We then apply correct and we are training on a possibly small sample this result to find the optimal, in a sense of of observations? Clearly, in such a case classification minimum misclassification probability, type will be affected by inaccuracy of involved estimates; of weighting. the resulting classifier will especially degrade in per formance in a high dimensional setting, i.e. when the size of the training data set is comparable to the di 1 INTRODUCTION mensionality of observations. The degradation effect is known as "curse of dimensionality", a phrase due to Bayesian networks (also referred to as probabilisticnet Bellman, [1], and one of our primary goal is to explore works or belief networks) have been used in many areas this effect when using Bayesian network classifiers. as convenient tools for presenting dependence struc In order to tackle this problem effectively, we employ tures. A few examples are [2,8,17], also including ar a growing dimension asymptotic formalism [15], essen tificial intelligence research, where these models have tially designed for multivariate problems and show how by now established their position as valuable represen it can accommodate such classifiers as naive Bayes [3] tations of uncertainty. 398 PAVLENKO & VON ROSEN UAI2002 and its augmented version [6]. The technique we de i.e. irj = ni / 2::::nk. In this study we will concen velop in this study exploits the main assumption en trate on the sampling paradigm, which means that coding in these networks, namely that. each feature we assume a stationary underlying joint. distribution variable (set of the features in augmented naive Bayes) (density) Ji(x) = Ji(x; B) over the p feature vari is independent. from the rest. of features given the root. ables J:1, . , :r;p· The desired posterior probability can of the network (viewed in the present. setting as a class then be computed by using the Bayes' formula, which variable). This provides a factori;,ed representation of tells us that Pr(C = jlx;B) ex ji(x;B)1r1. It is rela class distributions and we show that the correspond tively simple to show that the rule that classifies to ing discriminant for both types of the network turns the largest posterior probability will give the smallest out to be a specialized case of the generalized additive expected misclassification rate. This rule is known as models [9]. The main advantage here is that given the Bayes' classifier. additive structure of the discriminant, the Gaussian As a model family for the underlying class distribution approximation can be applied to its distribution in a we consider in this study a finite number of Bayesian high dimensional setting. As we show in the present network models. As a quick synopsis: A Bayesian study, this in turn yields explicit expressions for the network [17] is the graphical representation of a set misclassificationprobabilities which arc used as a mea of independence assumptions holding among the fea sure of classification accuracy. ture variables. These independencies are encoded in In addition, we introduce a new technique for mea a directed acyclic graph, where nodes correspond to suring discriminative power of a given set of features the features x1, ... , Xp and the arcs represent the di which is based on the cross-entropy distance between rect probabilistic influence between its incident nodes. the classes. This measure is then used to extend the Absence of an arc between two variables means that augmented naive Bayes classifier to deal with sets of these variables do not influenceeach other directly, and features with a priori different discriminative power. hence are (conditionally) independent. In this way a The extended classifier in this approach is naturally Bayesian network specifies a complete joint probabil a weighted version of the augmented naive Bayes. ity distribution over all feature variables which then The analysis of performance of this modified classi can be used for producing classifiers. fier in a high dimensional setting, and the optimal in a A well-known example of the sampling Bayesian net sense of minimum misclassificat.ion probability choice work classifier is the naive Bayesian classifier [3,6] of weighting are the further subjects of the current (or simply naive Bayes), which is a random network study. with one arc from the class node to each of the fea ture nodes. This graph structure represents the as 2 OVERVIEW OF BAYESIAN sumption that the feature variables are condition NETWORK CLASSIFIERS ally independent of each other, given the class, from it is immediate that = (xi; B), which j1 (x;B) Tif=1 J/ Hence, using Bayes' formula we get In the classification problem, the goal is to build an j = 1, ... , v. j It is worth noting accurate classifierfrom a given data set (x,C), where Pr(C = lx) ex "lrJ Tif=1J/ (xi; B). that the naive Bayes does surprisingly well when only each x consists of p feature variables x1, ... , Xp to gether with a value for a class variable C. In the se a finitesample of training observation is available; This quel we assume the feature variables to be continu behavior has been noted in [14]. The naive Bayes ap ous, which makes it possible to use various parametric proach also turns out. to be effective when studying models for the joint probability density f(x, C;B) of the curse of dimensionality effect; see for instance [4], where the bias and variance induced on the class den a random sample (x, C), where B E 8 is the vector of (unknown) parameters needed to describe the jth sity estimation by the naive Bayes decomposition and class density, j 1, their effecton classificationhave been studied in a high = ... , v. dimensional setting. In the probabilistic framework, the main goal is to evaluate the classification posterior distribution However, the structural simplicity of the naive Bayes classifier calls for developing better alternatives. In Pr(C = jlx; B). We can further distinguish two dif ferent approaches for estimating this distribution: in [6] the problem was approached by augmenting the the diagnostic paradigm one tries to model Pr(Clx; B) naive Bayes model by allowing additional arcs between directly, while in the sampling paradigm the interest the features that. capture possible dependencies among centers on Ji(x;B), and we have the following factor them. This approach is called augmented naive Bayes and approximates interactions between features using ization f(x, C; B) = "lrjji(x; B), with the prior prob a tree-structure imposed on the naive Bayes structure. abilities "lrj for the classes assumed to be known or estimated from the proportions in the training set, Classification accuracy has further been studied in