Bagging and the Bayesian Bootstrap

Merlise A. Clyde and Herbert K. H. Lee Institute of Statistics & Decision Sciences Duke University Durham, NC 27708

Abstract reduction in mean-squared prediction error for unsta- ble procedures. Bagging is a method of obtaining more ro- In this paper, we consider a Bayesian version of bag- bust predictions when the model class under ging based on Rubin’s Bayesian bootstrap (1981). consideration is unstable with respect to the This overcomes a technical difficulty with the usual data, i.e., small changes in the data can cause bootstrap in bagging, and it leads to a reduction the predicted values to change significantly. in over the bootstrap for certain classes of In this paper, we introduce a Bayesian ver- estimators. Another Bayesian approach for dealing sion of bagging based on the Bayesian boot- with unstable procedures is Bayesian model averaging strap. The Bayesian bootstrap resolves a the- (BMA) (Hoeting et al., 1999). In BMA, one fits sev- oretical problem with ordinary bagging and eral models to the data and makes predictions by tak- often results in more efficient estimators. We ing the weighted average of the predictions from each show how model averaging can be combined of the fitted models, where the weights are posterior within the Bayesian bootstrap and illustrate probabilities of models. We show that the Bayesian the procedure with several examples. bootstrap and Bayesian model averaging can be com- bined. We illustrate Bayesian bagging in a regression problem with variable selection and a highly influen- 1INTRODUCTION tial data point, a classification problem using , and a CART model. In a typical prediction problem, there is a trade-offbe- tween bias and variance, in that after a certain amount of fitting, any increase in the precision of the fit will 2 BOOTSTRAPPING cause an increase in the prediction variance on future observations. Similarly, any reduction in the predic- Suppose we have a sample of size n,withobserved data Z ,...,Z where Z is a vector in p+1 and the tion variance causes an increase in the expected bias 1 n i R for future predictions. Breiman (1996a) introduced Zi are independent, identically distributed realizations bagging as a method of reducing the prediction vari- from some distribution F .Here is the class of ∈Fp+1 F ance without affecting the prediction bias. all distribution functions on .Theparameterof interest is a functional T (F )whereR T is a mapping Bagging is short for “Bootstrap AGGregatING” which from to or k in the case of vectored valued describes how it works. The idea is straightforward. functions,F forR example,R the p +1 dimensionalmeanof Instead of making predictions from a single model fit the distribution, µ = zdF. to the observed data, bootstrap samples are taken of the data, the model is fit to each sample, and the pre- In Efron’s bootstrap and! the Bayesian bootstrap, the dictions are averaged over all of the fitted models to get class of distribution functions is restricted to a para- metric model by restricting estimation to F , the bagged prediction. Breiman explains that bagging n ∈Fn works well for unstable modeling procedures, i.e. those where Fn is represented as for which the conclusions are sensitive to small changes n in the data, such as neural networks, classification and Fn = ωiδZi , i=1 regression trees (CART), and variable selection for re- " gression (Breiman, 1996b). He also gives a theoretical δZi is a degenerate probability measure at Zi,andωi explanation of how bagging works, demonstrating the is the weight associated with Z , ω 0, ω =1. i i ≥ i # In the bootstrap, the distribution of T (F )isob- variance under the ordinary bootstrap. This applies tained by repeatedly generating bootstrap replicates, directly for CART models; for other estimators that where one bootstrap replicate is a sample drawn with are not necessarily linear in the weights, in our ex- replacement of size n from Z1,...,Zn.Theboot- perience the Bayesian bootstrap has also empirically strap distribution of T (F )isbasedonconsideringall exhibited less variability than the ordinary bootstrap. (r) (r) We illustrate this reduction in the rat liver example. possible bootstrap replications T (Fn )whereωi in (r) Fn corresponds to the proportion of times Zi ap- (r) pears in the rth bootstrap replicate, with ωi tak- 4BAGGINGVIATHEBAYESIAN ing on values in 0, 1/n, . . ., n/n .Forexample,the BOOTSTRAP bootstrap mean{ for the rth replicate} is calculated ˆ(r) n (r) Bagging is used primarily in prediction problems, and as Z = i=1 ωi Zi,andthebaggedestimateof the mean is the average over all bootstrap replicates, with that in mind we partition each Zi into a response R ˆ(r) # Y (which could be continuous or categorical) and a p r=1 Z /R,whereR is the number of bootstrap i replicates. Of course, one cannot usually consider all dimensional vector of input variables xi for predicting 2n 1 # Y .Inmatrixform,thedataareY =(y1,...,yn)′ with possible bootstrap samples, which is n− ,andbag- ging is often based on a much smaller set of bootstrap a n p matrix of covariates X with rows xi. $ % × replicates, say 25 to 50 (Breiman, 1996a). Under the nonparametric model for the data, the only unknown quantity is the distribution F ,ortheparam- 3 BAYESIAN BOOTSTRAP eters ω under the restricted class of distribution func- tions. The posterior distribution on ω induces a poste- The Bayesian bootstrap was introduced by Rubin rior distribution on a functional T (F )forpredictingY (1981) as a Bayesian analog of the original bootstrap. given X.Wefirstconsiderthecasewhereinterestisin Instead of drawing weights ωi from the discrete set regression-type estimates, then extend the procedure 0, 1 ,..., n ,theBayesianapproachtreatsthevector to allow for variable selection, nonlinear functions, cat- { n n } of weights ω in Fn as unknown parameters and derives egorical responses and model uncertainty. aposteriordistributionforω,andhenceT (F ). Rubin n 1 (1981) used a non-informative prior, i=1 ωi− ,which 4.1 when combined with the multinomial likelihood for Z, leads to a Dirichlet(1,...,1) distribution& for the pos- For making predictions based on linear combinations terior distribution of ω.Theposteriordistributionof of Y ,weconsiderfunctionalsoftheform T (F )isestimatedbyMonteCarlomethods:generate ˆ 2 (b) β = T (F )=argmin Y Xβ dF (1) ω from a Dirichlet(1,...,1) distribution and then β || − || (b) (b) 'n calculate T (Fn )foreachsampleω .Theaverage 2 (b) =argmin ωi(yi xiβ) of T (Fn )overthesamplescorrespondstotheMonte β − Carlo estimate of the posterior mean of T (F )andcan i=1 "1 be viewed as a Bayesian analog of bagging. =(X′WX)− X′WY Although there are differences in interpretation, op- where W is a diagonal matrix of weights ω.Thevalues of βˆ,thatminimize(1)withtherestrictionto ,are erationally the ordinary bootstrap and Bayesian boot- Fn strap differ primarily in how the values of ω are drawn. equivalent to weighted leastsquaresestimatesusing As Rubin (1981) shows, the expected values of the weights ω. weights ω are equal to 1/n under both bootstrap meth- Operationally, Bayesian bagging (BB) proceeds by ods. As the expectations of the weights are the same, taking a sample ω(b),fromaDirichlet(1,...,1) distri- both ordinary bagging and Bayesian bagging will have bution, and then using weighted least squares to obtain the same expectation for functions T (F )thatarelin- n ˆ(b) (b) 1 (b) ear in the weights ω,suchasmeans.Therearesit- β =(X′W X)− X′W Y uations, which will be discussed later, where the or- where W (b) is a diagonal matrix of weights ω(b).Thisis dinary bootstrap distribution is not well defined, and repeated for b =1,...,B,whereB is the total number the two approaches may yield different answers. Both of Monte Carlo samples, to obtain the posterior distri- approaches also lead to the same correlation between bution of βˆ,andtheposteriordistributionofYˆ = Xβˆ weights. However, the variability of the weights ω or other functions of ω.LetYˆ (b) = Xβˆ(b).TheBB under the ordinary bootstrap is (n +1)/n times the estimate of Yˆ given X is the Monte Carlo average variance of ω under the Bayesian bootstrap. For lin- B 1 ear functionals, the variance of the estimate under the Yˆ = Yˆ (b) . (2) Bayesian bootstrap is therefore strictly less than the B "b=1 4.2 VARIABLE SELECTION 4.4 NEURAL NETWORKS AND OTHER NONLINEAR ESTIMATORS While linear regression is a stable procedure, where bagging does not lead to substantial improvements, For continuous responses, the linear predictor Xβ in variable selection is viewed as being unstable. The BB (1) can be replaced by nonlinear functions as in neu- procedure is modified to combine model selection with ral nets or generalized additive models. While we no parameter estimation, where for each sample of ω(b), longer have an explicit solution for βˆ(b),anycodefor one selects a model M (b) using an appropriate model fitting neural networks (or other nonlinear models) ˆ(b) selection criterion and then estimates βM under model that allows weights can be used to construct the BB ˆ (b) M (b).TheposteriordistributionforYˆ and the pos- predictions, where one substitutes Y using predic- terior mean are now based on multiple models where tions from the neural network in (2). For model aver- the BB estimate of Y given X is aging with neural networks with continuous responses, model probabilities based on (4-5) are still appropri- B 1 (r) ate. Yˆ = Yˆ B M b=1 " 4.5 EXPONENTIAL FAMILY MODELS, ˆ (r) ˆ(r) where YM = XM (r) βM is the prediction using de- GLMS AND CART sign matrix XM (r) based on model M.Althoughnot equivalent to Bayesian model averaging as described For continuous responses, linear regression predictions in Hoeting et al. (1999), the above estimator is a vari- were based on minimizing a residual sum of squares ant of model averaging as the bootstrap aggregation (1) which is equivalent to maximizing a normal likeli- results in averaging over different models. hood. While the nonparametric bootstrap model im- plies a multinomial likelihood for the data Z,theuse 4.3 MODEL AVERAGING of likelihood score functions based on alternative dis- tributional assumptions to provide estimates and pre- Bayesian model averaging can be introduced into dictions is in the same spiritasgeneralizedestimating Bayesian bootstrap estimates by replacing (1) by the equations (Liang and Zeger, 1986). In this vein, we Bayes risk for squared error loss with model uncer- can extend the BB approach to other model classes, tainty. For sample b,theBMAestimateofYˆ (b) is the such as exponential families, CART models, and neu- weighted average ral networks for categorical responses, to allow for cat- egorical and discrete response variables. The connec- Yˆ (b) = π(M X,Y,ω(b))Yˆ (b) (3) BMA | M tion between iteratively reweighted least squares and "M maximum likelihood estimation provides the basis for where π(M X,Y,ω(b))isthe“posteriorprobability” computations using the Bayesian bootstrap weights ω. | ˆ (b) ˆ(b) of model M and the predicted values YM = XβM For exponential family models, the log likelihood can ˆ(b) and coefficients β are calculated given model M, be written as using weights ω(b). These are combined to form the n BB BMA predictions, vi l(θ)= (yiθi b(θ)) + c(yi,φ) B φ − 1 i=1 Yˆ = Yˆ (b) " B BMA b=1 where θ is the canonical parameter, v is a known prior " i which incorporate any instability in the model weights weight, and φ is a dispersion parameter; the mean pa- due to changes in the data. Posterior model probabil- rameter is µi = b′(θi)(McCullaghandNelder,1989). ities may be based on the BIC (Schwarz, 1978), As in GLMs, we express µi as a function of xi and β, µ = f(X,β)(althoughnotnecessarilythroughalinear BICM =RSSM + pM log(n)(4)predictor Xβ). Incorporating the bootstrap weights (b) exp ( .5BICM ) ωi,intotheexponentialfamilyweightsvi,sothat π(M X,Y,ω )= − (5) (b) (b) (b) | exp ( .5BIC ) w = viω ,wefindthebootstrapestimateβˆ that M − M i i maximizes l(θ(β)) using weights v(b).Thisisrepeated where RSSM is the residual# sum of squares under i model M using weighted least squares and p is the to provide b =1,...,B samples and a Monte Carlo M ˆ number of parameters under model M.Ofcourse, estimate of the posterior distribution of f(X, β). The ˆ other prior specifications mayleadtootherposterior BB estimate Y is model probabilities, however, for many cases BIC does B 1 lead to consistent model selection and is a useful de- Yˆ = f(X, βˆ(b)). fault (Hoeting et al., 1999). B "b=1 Any GLM or CART software that allows weights can 5.2 OZONE be used to construct the estimates f(X, βˆ(b))forBB. BB with BMA can be carried out by replacing the Ground level ozone data were analyzed in the original residual sum of squares in the expression for the BIC bagging paper (Breiman, 1996a). The dataset consists (4) with the residual deviance for the model. of daily readings of the maximum ozone concentra- tion at ground level in Los Angeles over the course of ayear,with9meteorologicalpredictors.Eliminating 5EXAMPLES cases with missing data leaves 330 complete records. Following Breiman, we: (i) randomly selected a test 5.1 RAT LIVERS set of 15 cases; (ii) fit a single regression tree using ten-fold cross-validation on the remaining cases (the Weisberg (1985, p. 121) describes a dataset on drug training data), and then used this fitted tree to predict uptake in rat livers. The experimental hypothesis is on the test data; (iii) for b =1,...,25 generated ω(b) that because dose was matched to weight, there would which were used as weights to fit a regression tree using be no relationship between the percent dose in the ten-fold cross validation on the training data, and then liver and the three input variables (body weight, liver used this fitted tree to predict on the test data; the av- weight, and relative dose). One rat (case 3), however, erage of the 25 predictions is the BB prediction. This received a large dose relative to its weight and is an process was repeated 500 times. The average mean influential point, leading to a rejection of the experi- squared error (MSE) of prediction was calculated for mental hypothesis (the null model which is believed to the single tree model and the BB method. The MSE be true). Regression is normally thought of as a sta- for the single tree model was 23.9% (standard error ble procedure, where methods such as bagging will not 0.47), and the MSE for the BB predictions was 18.6% help. However, in the presence of outliers and influen- (standard error 0.35) resulting in a 22.0% reduction in tial points (e.g., case 3), regression is no longer stable. error due to the BB, comparable to Breiman’s results. Variable selection further contributes to instability. The left plot of Figure 1 shows body weight versus 5.3 DIABETES dose, where one can see both the high correlation as well as the highly influential point (case 3) at weight Smith et al. (1988) introduced a dataset on the preva- 190, which causes instability in the linear models. The lence of diabetes in Pima Indian women. The goal is right plot of Figure 1 shows predicted percent dose in to predict the presence of diabetes using seven health- the liver by body weight. The experimenters expected related covariates. Thereare532completerecords, no relationship between percent dose in the liver and of which 200 are used as a training set and the other body weight; deviations from a horizontal line are be- 332 are used as a test set. The data are available at cause of nonzero regression coefficients for input vari- http://www.ics.uci.edu/~mlearn/MLSummary.html ables. Predictions from the full model provide the We used logistic regression for classifying the data, greatest deviations, demonstrating the trouble caused which is essentially equivalent to fitting a neural net- by case 3. Bagging and BB of the full model (Bag/BB work with a single hidden node. Ripley (1996) pointed in the plot) have nearly identical predictions, both pro- out that there is no gain in fit by using more hidden viding additional shrinkage towards the overall mean, nodes, so logistic regression is a sufficiently flexible reducing the effect of case 3. BMA produces the best procedure. We find that for each of three model classes fit, with virtually no change under bagging BMA or (the full model, the best BIC model, and BMA), the BB BMA, as the null model receives the highest pos- Bayesian bootstrap improves predictions, as shown terior probability under the complete data. in Table 1, with error rates comparable to bagging (Breiman, 1996a). The series of boxplots in Figure 5.1 highlights the vari- ation in predictions for case3.Ofparticularinterest is that BB estimates show a large reduction in vari- 6DISCUSSION ation over estimates using bagging (without or with BMA). BMA is more stable, with greater shrinkage Since Breiman introduced bagging, a number of pa- towards the overall mean of the data (without case pers have demonstrated its effectiveness in a variety 3), which is in the direction expected by the experi- of contexts. The Bayesian bootstrap leads to similar menters. Even with BMA, bagging occasionally pro- improvements in prediction rates, with less apparent duces large deviations. While estimates under bagging variability. The approach in this paper is technically and BB are comparable, the distribution un- equivalent to the weighted likelihood bootstrap (WLB) der the bootstrap exhibits more variability than the of Newton and Raftery (1994), which appeared in a posterior distribution under the Bayesian bootstrap. different context. They used the WLB as a tool to • • full model • Bag/BB • BMA

• • •

• • • • • Dose • • • • • • •

Percent Dose in Liver • • • • • • • • • •

0.75 0.80 0.85 0.90 0.95 1.00 •• • • • 0.2 0.3 0.4 0.5 150 160 170 180 190 200 150 160 170 180 190 200

Body Weight Body Weight

Figure 1: Rat Data: A Point with High Influence and the Resulting Fitted Models

Table 1: Percent Misclassification for the Pima Diabetes Data

Method % Misclassification Best single model using BIC 20.2 Bayesian bootstrap on the best BIC model 19.3 Full model 20.2 Bayesian bootstrap on the full model 20.0 Bayesian model averaging 21.1 Bayesian bootstrap and model averaging 20.9 approximate posterior distributions for standard likeli- samples), they do contribute to the theoretical boot- hoods, which today can be readily approximated using strap distribution of T (F ). This problem is not specific Markov chain Monte Carlo sampling. to regression, but occurs in all estimation cases where more than one distinct data point is necessary for well- An alternative view of the Bayesian bootstrap is that defined estimates. Further examples of problematic the data arise from a nonparametric model with dis- procedures include non-linear and nonparametric re- tribution F .Inthiscase,T (F )isnotnecessarilya gression and estimation of standard deviations or cor- parameter in the model, but is taken as an interest- relations. The use of the Bayesian bootstrap avoids ing summary of the distribution. As nonparametric this issue. models, both the bootstrap and Bayesian bootstrap share the problem that they only give positive weight Rubin’s Bayesian bootstrap can be viewed as a limit- to values of (x, y)thatwereactuallyobserved.This ing case of the nonparametric distribution for F using raises theoretical issues when the bootstrapped quan- aDirichletprocessprior(Gasparini,1995).Iftheprior tities are used for prediction at values of x that are for F is a Dirichlet process with parameter α (DP(α)), not in the original dataset. Other problems with both then the posterior distribution for F is again a Dirich- n bootstrap methods are raised by Rubin (1981). let process DP(α+ i=1 δZi )(Ferguson,1973).Inthe limit as α( p+1)goesto0,theposteriordistribution Another problem with using the bootstrap for bag- is a DP( Rn δ ),# with mean equal to the empiri- ging is that weights may be 0, and bootstrap replicates i=1 Zi cal cumulative distribution function. It is this limit- where X is not of full rank receive positive probabil- ing noninformative# case that is equivalent to Rubin’s ity. In these samples, βˆ is not well defined, although Bayesian bootstrap (Gasparini, 1995). predictions for Y are still defined. Even though these cases have very low probability (and may not appear in Fully non-parametric or semi-parametric Bayesian hs prahsmysewdruse. wider see may both approaches by improve, environments these inherited computing As are BB. that and bagging Rubin by boot- noted with bootstrap- problems strapping theoretical of resolve may form they com- ping, either more typically than are intensive these the putationally While is BB. Line or alternatives bagging Horizontal useful to other The robust are are mis-specification and Set. model nonlinearities to Data to adapt Liver can Rat that the models in 3 Case for 3. Means Case Estimated without Mean of Overall Distribution 2: Figure aprn,M 19) EatMliait Bayesian Multivariate “Exact Some (1995). of M. Gasparini, Analysis Bayesian “A (1973). S. T. Ferguson, —( 1 9 9 6 b ) .“ Predictors.” H “Bagging e u (1996a). r L. i sBreiman, t i c so fI n s References to like would discussions. We helpful for MacEachern 9873275. Steve DMS grants thank NSF and by 9733013 supported DMS partially was research This Acknowledgments otn,J . aia,D,Rfey . n Volin- and A., Raftery, D., Madigan, A., J. Hoeting, nMdlSelection.” Model in Statistics Moments.” of Distributions Bootstrap 209–230. 2, 1, Problems.” Nonparametric 2350–2383. Learning k,C .(99.“aeinMdlAveraging: Model “Bayesian (1999). T. C. sky, ,26,2,123–140. ,23,762–768.

h naso Statistics of Annals The -0.5 0.0 0.5 1.0 aiiyadStabilization and tability naso Statistics of Annals Full Model Bagging • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Bayesian Bootstrap nasof Annals Machine Full Model ,24, • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • , cwr,G 17) Etmtn h ieso fa of Dimension the “Estimating (1978). G. Schwarz, Bootstrap.” Bayesian “The (1981). B. D. Rubin, (1996). D. B. Ripley, “Approxi- (1994). E. A. Raftery, and A. M. Newton, (1989). A. J. Nelder, and P. McCullagh, “Longitudinal (1986). L. S. Zeger, and K.-Y. Liang, eseg .(1985). S. Weisberg, W., Knowler, W., Dickson, J., Everhart, J., Smith, Bagging +BMA Model.” Statistics of nals Press. Networks ral B Society Statistical Royal discussion).” the (with Like- Bootstrap Weighted the lihood with Inference Bayesian mate Edition) (Second &Hall. Models Linear ized Models.” Biometrika Linear Generalized Using Analysis Data 382–417. 4, 14, AT u t o r i a l( w i t hd i s c u s s i o n ) . ” d e ok onWly&Sons. & Wiley John York: New ed. Medical and Applications Computer Care on In ADAP Di- posium the of Mellitus.” Onset “Using the abetes Forecast to (1988). Algorithm Learning R. Johannes, and • • • • ,2 6 1 – 2 6 5 .I E E EC o m p u t e rS o c i e t yP r e s s . Bayesian Bootstrap h naso Statistics of Annals The ,73,13–22. +BMA .C a m b r i d g e :C a m b r i d g eU n i v e r s i t y ,9,130–134. ple ierRegression Linear Applied atr eonto n Neu- and Recognition Pattern rceig fteSym- the of Proceedings ,56,3–48. ttsia Science Statistical ,6,2,461–464. .Chapman ora of Journal General- .2nd An- ,