Bagging and the Bayesian Bootstrap
Total Page:16
File Type:pdf, Size:1020Kb
Bagging and the Bayesian Bootstrap Merlise A. Clyde and Herbert K. H. Lee Institute of Statistics & Decision Sciences Duke University Durham, NC 27708 Abstract reduction in mean-squared prediction error for unsta- ble procedures. Bagging is a method of obtaining more ro- In this paper, we consider a Bayesian version of bag- bust predictions when the model class under ging based on Rubin’s Bayesian bootstrap (1981). consideration is unstable with respect to the This overcomes a technical difficulty with the usual data, i.e., small changes in the data can cause bootstrap in bagging, and it leads to a reduction the predicted values to change significantly. in variance over the bootstrap for certain classes of In this paper, we introduce a Bayesian ver- estimators. Another Bayesian approach for dealing sion of bagging based on the Bayesian boot- with unstable procedures is Bayesian model averaging strap. The Bayesian bootstrap resolves a the- (BMA) (Hoeting et al., 1999). In BMA, one fits sev- oretical problem with ordinary bagging and eral models to the data and makes predictions by tak- often results in more efficient estimators. We ing the weighted average of the predictions from each show how model averaging can be combined of the fitted models, where the weights are posterior within the Bayesian bootstrap and illustrate probabilities of models. We show that the Bayesian the procedure with several examples. bootstrap and Bayesian model averaging can be com- bined. We illustrate Bayesian bagging in a regression problem with variable selection and a highly influen- 1INTRODUCTION tial data point, a classification problem using logistic regression, and a CART model. In a typical prediction problem, there is a trade-offbe- tween bias and variance, in that after a certain amount of fitting, any increase in the precision of the fit will 2 BOOTSTRAPPING cause an increase in the prediction variance on future observations. Similarly, any reduction in the predic- Suppose we have a sample of size n,withobserved data Z ,...,Z where Z is a vector in p+1 and the tion variance causes an increase in the expected bias 1 n i R for future predictions. Breiman (1996a) introduced Zi are independent, identically distributed realizations bagging as a method of reducing the prediction vari- from some distribution F .Here is the class of ∈Fp+1 F ance without affecting the prediction bias. all distribution functions on .Theparameterof interest is a functional T (F )whereR T is a mapping Bagging is short for “Bootstrap AGGregatING” which from to or k in the case of vectored valued describes how it works. The idea is straightforward. functions,F forR example,R the p +1 dimensionalmeanof Instead of making predictions from a single model fit the distribution, µ = zdF. to the observed data, bootstrap samples are taken of the data, the model is fit to each sample, and the pre- In Efron’s bootstrap and! the Bayesian bootstrap, the dictions are averaged over all of the fitted models to get class of distribution functions is restricted to a para- metric model by restricting estimation to F , the bagged prediction. Breiman explains that bagging n ∈Fn works well for unstable modeling procedures, i.e. those where Fn is represented as for which the conclusions are sensitive to small changes n in the data, such as neural networks, classification and Fn = ωiδZi , i=1 regression trees (CART), and variable selection for re- " gression (Breiman, 1996b). He also gives a theoretical δZi is a degenerate probability measure at Zi,andωi explanation of how bagging works, demonstrating the is the weight associated with Z , ω 0, ω =1. i i ≥ i # In the bootstrap, the distribution of T (F )isob- variance under the ordinary bootstrap. This applies tained by repeatedly generating bootstrap replicates, directly for CART models; for other estimators that where one bootstrap replicate is a sample drawn with are not necessarily linear in the weights, in our ex- replacement of size n from Z1,...,Zn.Theboot- perience the Bayesian bootstrap has also empirically strap distribution of T (F )isbasedonconsideringall exhibited less variability than the ordinary bootstrap. (r) (r) We illustrate this reduction in the rat liver example. possible bootstrap replications T (Fn )whereωi in (r) Fn corresponds to the proportion of times Zi ap- (r) pears in the rth bootstrap replicate, with ωi tak- 4BAGGINGVIATHEBAYESIAN ing on values in 0, 1/n, . ., n/n .Forexample,the BOOTSTRAP bootstrap mean{ for the rth replicate} is calculated ˆ(r) n (r) Bagging is used primarily in prediction problems, and as Z = i=1 ωi Zi,andthebaggedestimateof the mean is the average over all bootstrap replicates, with that in mind we partition each Zi into a response R ˆ(r) # Y (which could be continuous or categorical) and a p r=1 Z /R,whereR is the number of bootstrap i replicates. Of course, one cannot usually consider all dimensional vector of input variables xi for predicting 2n 1 # Y .Inmatrixform,thedataareY =(y1,...,yn)′ with possible bootstrap samples, which is n− ,andbag- ging is often based on a much smaller set of bootstrap a n p matrix of covariates X with rows xi. $ % × replicates, say 25 to 50 (Breiman, 1996a). Under the nonparametric model for the data, the only unknown quantity is the distribution F ,ortheparam- 3 BAYESIAN BOOTSTRAP eters ω under the restricted class of distribution func- tions. The posterior distribution on ω induces a poste- The Bayesian bootstrap was introduced by Rubin rior distribution on a functional T (F )forpredictingY (1981) as a Bayesian analog of the original bootstrap. given X.Wefirstconsiderthecasewhereinterestisin Instead of drawing weights ωi from the discrete set regression-type estimates, then extend the procedure 0, 1 ,..., n ,theBayesianapproachtreatsthevector to allow for variable selection, nonlinear functions, cat- { n n } of weights ω in Fn as unknown parameters and derives egorical responses and model uncertainty. aposteriordistributionforω,andhenceT (F ). Rubin n 1 (1981) used a non-informative prior, i=1 ωi− ,which 4.1 LINEAR REGRESSION when combined with the multinomial likelihood for Z, leads to a Dirichlet(1,...,1) distribution& for the pos- For making predictions based on linear combinations terior distribution of ω.Theposteriordistributionof of Y ,weconsiderfunctionalsoftheform T (F )isestimatedbyMonteCarlomethods:generate ˆ 2 (b) β = T (F )=argmin Y Xβ dF (1) ω from a Dirichlet(1,...,1) distribution and then β || − || (b) (b) 'n calculate T (Fn )foreachsampleω .Theaverage 2 (b) =argmin ωi(yi xiβ) of T (Fn )overthesamplescorrespondstotheMonte β − Carlo estimate of the posterior mean of T (F )andcan i=1 "1 be viewed as a Bayesian analog of bagging. =(X′WX)− X′WY Although there are differences in interpretation, op- where W is a diagonal matrix of weights ω.Thevalues of βˆ,thatminimize(1)withtherestrictionto ,are erationally the ordinary bootstrap and Bayesian boot- Fn strap differ primarily in how the values of ω are drawn. equivalent to weighted leastsquaresestimatesusing As Rubin (1981) shows, the expected values of the weights ω. weights ω are equal to 1/n under both bootstrap meth- Operationally, Bayesian bagging (BB) proceeds by ods. As the expectations of the weights are the same, taking a sample ω(b),fromaDirichlet(1,...,1) distri- both ordinary bagging and Bayesian bagging will have bution, and then using weighted least squares to obtain the same expectation for functions T (F )thatarelin- n ˆ(b) (b) 1 (b) ear in the weights ω,suchasmeans.Therearesit- β =(X′W X)− X′W Y uations, which will be discussed later, where the or- where W (b) is a diagonal matrix of weights ω(b).Thisis dinary bootstrap distribution is not well defined, and repeated for b =1,...,B,whereB is the total number the two approaches may yield different answers. Both of Monte Carlo samples, to obtain the posterior distri- approaches also lead to the same correlation between bution of βˆ,andtheposteriordistributionofYˆ = Xβˆ weights. However, the variability of the weights ω or other functions of ω.LetYˆ (b) = Xβˆ(b).TheBB under the ordinary bootstrap is (n +1)/n times the estimate of Yˆ given X is the Monte Carlo average variance of ω under the Bayesian bootstrap. For lin- B 1 ear functionals, the variance of the estimate under the Yˆ = Yˆ (b) . (2) Bayesian bootstrap is therefore strictly less than the B "b=1 4.2 VARIABLE SELECTION 4.4 NEURAL NETWORKS AND OTHER NONLINEAR ESTIMATORS While linear regression is a stable procedure, where bagging does not lead to substantial improvements, For continuous responses, the linear predictor Xβ in variable selection is viewed as being unstable. The BB (1) can be replaced by nonlinear functions as in neu- procedure is modified to combine model selection with ral nets or generalized additive models. While we no parameter estimation, where for each sample of ω(b), longer have an explicit solution for βˆ(b),anycodefor one selects a model M (b) using an appropriate model fitting neural networks (or other nonlinear models) ˆ(b) selection criterion and then estimates βM under model that allows weights can be used to construct the BB ˆ (b) M (b).TheposteriordistributionforYˆ and the pos- predictions, where one substitutes Y using predic- terior mean are now based on multiple models where tions from the neural network in (2). For model aver- the BB estimate of Y given X is aging with neural networks with continuous responses, model probabilities based on (4-5) are still appropri- B 1 (r) ate. Yˆ = Yˆ B M b=1 " 4.5 EXPONENTIAL FAMILY MODELS, ˆ (r) ˆ(r) where YM = XM (r) βM is the prediction using de- GLMS AND CART sign matrix XM (r) based on model M.Althoughnot equivalent to Bayesian model averaging as described For continuous responses, linear regression predictions in Hoeting et al.