Bayesian Model Averaging Naive Bayes (BMA-NB): Averaging Over an Exponential Number of Feature Models in Linear Time
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Model Averaging Naive Bayes (BMA-NB): Averaging over an Exponential Number of Feature Models in Linear Time Ga Wu Scott Sanner Rodrigo F.S.C. Oliveira Australian National University NICTA & Australian National University University of Pernambuco Canberra, Australia Canberra, Australia Recife, Brazil [email protected] [email protected] [email protected] Abstract to different feature sets, BMA has the advantage over feature selection that its predictions tend to have lower variance in Naive Bayes (NB) is well-known to be a simple but ef- comparison to any single model. fective classifier, especially when combined with fea- ture selection. Unfortunately, feature selection methods While BMA has had some notable successes for av- are often greedy and thus cannot guarantee an optimal eraging over tree-based models such as the context-tree feature set is selected. An alternative to feature selec- (CTW) weighting (Willems, Shtarkov, and Tjalkens 1995) tion is to use Bayesian model averaging (BMA), which algorithm for sequential prediction, or prunings of decision computes a weighted average over multiple predictors; trees (Oliver and Dowe 1995), it has not been previously when the different predictor models correspond to dif- applied to averaging over all possible feature sets (i.e., the ferent feature sets, BMA has the advantage over feature exponentially sized power set) for classifiers such as NB, selection that its predictions tend to have lower vari- which would correspond to averaging over all nodes of the ance on average in comparison to any single model. feature subset lattice — a problem which defies the simpler In this paper, we show for the first time that it is pos- recursions that can be used for BMA on tree structures. sible to exactly evaluate BMA over the exponentially- sized powerset of NB feature models in linear-time in In this paper, we show the somewhat surprising result the number of features; this yields an algorithm about that it is indeed possible to exactly evaluate BMA over as expensive to train as a single NB model with all fea- the exponentially-sized powerset of NB feature models in tures, but yet provably converges to the globally optimal linear-time in the number of features; this yields an algo- feature subset in the asymptotic limit of data. We evalu- rithm no more expensive to train than a single NB model ate this novel BMA-NB classifier on a range of datasets with all features, but yet provably converges to the globally showing that it never underperforms NB (as expected) optimal feature subset in the asymptotic limit of data. and sometimes offers performance competitive (or su- While previous work has proposed algorithms for BMA perior) to classifiers such as SVMs and logistic regres- sion while taking a fraction of the time to train. over the feature subset lattice in generalized linear models such as logistic regression and linear regression (Raftery 1996; Raftery, Madigan, and Hoeting 1997; Hoeting et al. Introduction 1999), all of these methods either rely on biased approxi- mations or use MCMC methods to sample the model space Naive Bayes (NB) is well-known to be a simple but effective with no a priori bound on the mixing time of such Markov classifier in practice owing to both theoretical justifications Chains; in contrast our proposal for NB does not approxi- and extensive empirical studies (Langley, Iba, and Thomp- mate and provides an exact answer in linear-time in the num- son 1992; Friedman 1997; Domingos and Pazzani 1997; ber of features.1 Ng and Jordan 2001). At the same time, in domains such as text classification that tend to have many features, it is We evaluate our novel BMA-NB classifier on a range of also noted that NB often performs best when combined with datasets showing that it never underperforms NB (as ex- feature selection (Manning, Raghavan, and Schutze¨ 2008; pected) and sometimes offers performance competitive (or Fung, Morstatter, and Liu 2011). superior) to classifiers such as SVMs and logistic regression while taking a fraction of the time to train. Unfortunately, tractable feature selection methods are in general greedy (Guyon and Elisseeff 2003) and thus cannot guarantee that an optimal feature set is selected — even for Preliminaries the training data. An interesting alternative to feature selec- We begin by providing a graphical model perspective of tion is to use Bayesian model averaging (BMA) (Hoeting et Naive Bayes (NB) classifiers and Bayesian Model Averag- al. 1999), which computes a weighted average over multiple ing (BMA) that simplify the subsequent presentation and predictors; when the different predictor models correspond derivation of our BMA-NB algorithm. Copyright c 2015, Association for the Advancement of Artificial 1At AAAI-15, it was pointed out that Dash and Cooper (2002) Intelligence (www.aaai.org). All rights reserved. arrived at similar results using a different derivational approach. Bayesian Model Averaging (BMA) If we had a class of models m 2 f1::Mg, each of which yi yi m yi specified a different way to generate the feature probabili- ties P (xijyi; m), then we can write down a prediction for a 2 f generative classification model m as follows: xi,k k xi,k xi k=1..K k=1..K P (y; xjm; D) = P (xjm; y; D)P (yjD): (4) i=1..N+1 i=1..N+1 i=1..N+1 While we could select a single model m according to some predefined criteria, an effective way to combine all (a) (b) (c) models to produce a lower variance prediction than any sin- gle model is to use Bayesian model averaging (Hoeting et Figure 1: Bayesian graphical model representations of (a) Naive al. 1999) and compute a weighted average over all models: Bayes (NB), (b) Bayesian model averaging (BMA) for generative X models such as NB, and (c) BMA-NB. Shaded nodes are observed. P (y; xjD) = P (xjm; y; D)P (yjD)P (mjD): (5) m Naive Bayes (NB) Classifiers Here we can view P (mjD) as the weight of model m and Assume we have N i.i.d. data points D = f(yi; xi)g (1 ≤ observe that it provides a convex combination of model pre- P i ≤ N) where yi 2 f1::Cg is the discrete class label for da- dictions, since by definition: m P (mjD) = 1. BMA has tum i and xi = (xi;1; : : : ; xi;K ) is the corresponding vector the convenient property that in the asymptotic limit of data of K features for datum i. In a probabilistic setting for clas- D, as jDj ! 1, P (mjD) ! 1 for the single best model m sification, given a new feature vector xN+1 (abbreviated x), (assuming no models are identical) meaning that asymptoti- our objective is to predict the highest probability class label cally, BMA can be viewed as optimal model selection. yN+1 (abbreviated y) based on our observed data. That is, To compute P (mjD), we can apply Bayes rule to obtain for each y, we want to compute N N+1 Y Y P (mjD) / P (Djm)P (m) = P (m) P (yi; xijm) P (yjx;D) / P (y; x;D) = P (yi; xi); (1) i=1 i=1 N Y where the first proportionality is due to the fact that D and = P (m) P (yi)P (xijm; yi): x are fixed (so the proportionality term is a constant) and i=1 QN+1 the i=1 owes to our i.i.d. assumption for D (we likewise (6) assume the N+1st data point is also drawn i.i.d.) and the fact Substituting (6) into (5) we arrive at the final simple form, that we can absorb (y; x) = (yN+1; xN+1) into the product by changing the range of i to include N + 1. where we again absorb P (xjm; yN+1;D)P (yjD) into the QN+1 The Naive Bayes independence assumption posits that all renamed N + 1st factor of i=1 : x ≤ ≤ conditionally independent features i;k (1 k K) are N+1 given the class yi. If we additionally assume that P (yi) and X Y P (y; xjD) / P (m) P (yi)P (xijm; yi): (7) P (xi;kjyi) are set to their maximum likelihood (empirical) estimates given D, we arrive at the following standard result m i=1 for NB in (3): We pause to make two simple observations on BMA be- N+1 K fore proceeding to our main result combining BMA and Y Y arg max P (yjx;D) = arg max P (yi) P (xi;kjyi) (2) NB. First, as for NB, we can represent the underlying joint y y N+1 i=1 k=1 distribution for (7) as the graphical model in Figure 1(b) K where in this case m is a latent (unobserved) variable that Y we marginalize over. Second, unlike the previous result for = arg max P (yN+1) P (xN+1;kjyN+1) (3) yN+1 NB, we observe that with BMA, we cannot simply absorb k=1 the data likelihood into the proportionality constant since it QN In (3), we dropped all factors in the i=1 from (2) since depends on the unobserved m and is hence non-constant. these correspond to data constants in D that do not affect the arg maxy. Main Result: BMA-NB Derivation Later when we derive BMA-NB, we’ll see that it is not QN Consider a naive combination of the previous sections on possible to drop the i=1 since latent feature selection vari- BMA and NB if we consider each model m to correspond to ables will critically depend on D.