Boosting Algorithms: Regularization, Prediction and Model Fitting
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Science 2007, Vol. 22, No. 4, 477–505 DOI: 10.1214/07-STS242 c Institute of Mathematical Statistics, 2007 Boosting Algorithms: Regularization, Prediction and Model Fitting Peter B¨uhlmann and Torsten Hothorn Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or non- parametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information crite- ria, particularly useful for regularization and variable selection in high- dimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, al- lowing for the implementation of new boosting algorithms optimizing user-specified loss functions. Key words and phrases: Generalized linear models, generalized ad- ditive models, gradient boosting, survival analysis, variable selection, software. 1. INTRODUCTION variety of applications. Boosting methods have been originally proposed as ensemble methods (see Sec- Freund and Schapire’s AdaBoost algorithm for clas- tion 1.1), which rely on the principle of generating sification (author?) [29, 30, 31] has attracted much multiple predictions and majority voting (averag- attention in the machine learning community (cf. ing) among the individual classifiers. [76], and the references therein) as well as in related Later, Breiman (author?) [15, 16] made a path- areas in statistics (author?) [15, 16, 33]. Various ver- breaking observation that the AdaBoost algorithm sions of the AdaBoost algorithm have proven to be can be viewed as a gradient descent algorithm in very competitive in terms of prediction accuracy in a arXiv:0804.2752v1 [stat.ME] 17 Apr 2008 function space, inspired by numerical optimization and statistical estimation. Moreover, Friedman, Hastie Peter B¨uhlmann is Professor, Seminar f¨ur Statistik, and Tibshirani (author?) [33] laid out further im- ETH Z¨urich, CH-8092 Z¨urich, Switzerland e-mail: portant foundations which linked Ada-Boost and [email protected]. Torsten Hothorn is Professor, Institut f¨ur Statistik, other boosting algorithms to the framework of sta- Ludwig-Maximilians-Universit¨at M¨unchen, tistical estimation and additive basis expansion. In Ludwigstraße 33, D-80539 M¨unchen, Germany their terminology, boosting is represented as “stage- e-mail: [email protected]. Torsten wise, additive modeling”: the word “additive” does Hothorn wrote this paper while he was a lecturer at the not imply a model fit which is additive in the co- Universit¨at Erlangen-N¨urnberg. variates (see our Section 4), but refers to the fact 1Discussed in 10.1214/07-STS242A and that boosting is an additive (in fact, a linear) combi- 10.1214/07-STS242B; rejoinder at 10.1214/07-STS242REJ. nation of “simple” (function) estimators. Also Ma- This is an electronic reprint of the original article son et al. (author?) [62] and R¨atsch, Onoda and published by the Institute of Mathematical Statistics in M¨uller (author?) [70] developed related ideas which Statistical Science, 2007, Vol. 22, No. 4, 477–505. This were mainly acknowledged in the machine learning reprint differs from the original in pagination and community. In Hastie, Tibshirani and Friedman (au- typographic detail. thor?) [42], additional views on boosting are given; 1 2 P. BUHLMANN¨ AND T. HOTHORN in particular, the authors first pointed out the re- more elaborate techniques. Essentially, one needs to lation between boosting and ℓ1-penalized estima- ensure that some (uniform) laws of large numbers tion. The insights of Friedman, Hastie and Tibshi- still hold, for example, assuming stationary, mixing rani (author?) [33] opened new perspectives, namely sequences; some rigorous results are given in [57] and to use boosting methods in many other contexts [59]. than classification. We mention here boosting meth- 1.1 Ensemble Schemes: Multiple Prediction and ods for regression (including generalized regression) Aggregation [22, 32, 71], for density estimation [73], for survival analysis [45, 71] or for multivariate analysis [33, 59]. Ensemble schemes construct multiple function es- In quite a few of these proposals, boosting is not only timates or predictions from reweighted data and use a black-box prediction tool but also an estimation a linear (or sometimes convex) combination thereof method for models with a specific structure such as for producing the final, aggregated estimator or pre- linearity or additivity [18, 22, 45]. Boosting can then diction. be seen as an interesting regularization scheme for First, we specify a base procedure which constructs estimating a model. This statistical perspective will a function estimateg ˆ(·) with values in R, based on drive the focus of our exposition of boosting. some data (X1,Y1),..., (Xn,Yn): We present here some coherent explanations and base procedure illustrations of concepts about boosting, some deriva- (X1,Y1),..., (Xn,Yn) −→ gˆ(·). tions which are novel, and we aim to increase the For example, a very popular base procedure is a re- understanding of some methods and some selected gression tree. known results. Besides giving an overview on theo- Then, generating an ensemble from the base pro- retical concepts of boosting as an algorithm for fit- cedures, that is, an ensemble of function estimates ting statistical models, we look at the methodology or predictions, works generally as follows: from a practical point of view as well. The dedicated base procedure add-on package mboost (“model-based boosting,” reweighted data 1 −→ gˆ[1](·) R base procedure [43]) to the system for statistical computing [69] reweighted data 2 −→ gˆ[2](·) implements computational tools which enable the ··· ··· data analyst to compute on the theoretical concepts ··· ··· explained in this paper as closely as possible. The base procedure [M] illustrations presented throughout the paper focus reweighted data M −→ gˆ (·) M [m] on three regression problems with continuous, bi- aggregation: fˆA(·)= αmgˆ (·). nary and censored response variables, some of them m=1 having a large number of covariates. For each ex- What is termed here asP “reweighted data” means ample, we only present the most important steps of that we assign individual data weights to each of the analysis. The complete analysis is contained in the n sample points. We have also implicitly as- a vignette as part of the mboost package (see Ap- sumed that the base procedure allows to do some pendix A.1) so that every result shown in this paper weighted fitting, that is, estimation is based on a is reproducible. weighted sample. Throughout the paper (except in Unless stated differently, we assume that the data Section 1.2), we assume that a base procedure esti- are realizations of random variables mateg ˆ(·) is real-valued (i.e., a regression procedure), making it more adequate for the “statistical perspec- (X ,Y ),..., (X ,Y ) 1 1 n n tive” on boosting, in particular for the generic FGD from a stationary process with p-dimensional predic- algorithm in Section 2.1. tor variables Xi and one-dimensional response vari- The above description of an ensemble scheme is ables Yi; for the case of multivariate responses, some too general to be of any direct use. The specification references are given in Section 9.1. In particular, of the data reweighting mechanism as well as the M the setting above includes independent, identically form of the linear combination coefficients {αm}m=1 distributed (i.i.d.) observations. The generalization are crucial, and various choices characterize differ- to stationary processes is fairly straightforward: the ent ensemble schemes. Most boosting methods are methods and algorithms are the same as in the i.i.d. special kinds of sequential ensemble schemes, where framework, but the mathematical theory requires the data weights in iteration m depend on the results BOOSTING ALGORITHMS AND MODEL FITTING 3 from the previous iteration m − 1 only (memoryless that Ada-Boost and also other boosting algorithms with respect to iterations m − 2,m − 3,...). Exam- are overfitting eventually, and early stopping [using ples of other ensemble schemes include bagging [14] a value of mstop before convergence of the surrogate or random forests [1, 17]. loss function, given in (3.3), takes place] is necessary 1.2 AdaBoost [7, 51, 64]. We emphasize that this is not in con- tradiction to the experimental results by (author?) The AdaBoost algorithm for binary classification [15] where the test set misclassification error still [31] is the most well-known boosting algorithm. The decreases after the training misclassification error is base procedure is a classifier with values in {0, 1} zero [because the training error of the surrogate loss (slightly different from a real-valued function esti- function in (3.3) is not zero before numerical con- mator as assumed above), for example, a classifica- vergence]. tion tree. Nevertheless, the AdaBoost algorithm is quite re- AdaBoost algorithm sistant to overfitting (slow overfitting behavior) when 1. Initialize some weights for individual sample increasing the number of iterations mstop. This has [0] been observed empirically, although some cases with points: wi = 1/n for i = 1,...,n. Set m = 0. 2. Increase m by 1. Fit the base procedure to the clear overfitting do occur for some datasets [64]. A weighted data, that is, do a weighted fitting using stream of work has been devoted to develop VC-type [m−1] [m] bounds for the generalization (out-of-sample) error the weights wi , yielding the classifierg ˆ (·). 3. Compute the weighted in-sample misclassifica- to explain why boosting is overfitting very slowly tion rate only. Schapire et al. (author?) [77] proved a remark- n n able bound for the generalization misclassification [m] [m−1] [m] [m−1] error for classifiers in the convex hull of a base proce- err = wi I(Yi 6=g ˆ (Xi)) wi , dure.