arXiv:0804.2752v1 [stat.ME] 17 Apr 2008 ohr rt hspprwieh a etrra the at lecturer a was he while Universit¨at Erlangen-N¨urnberg. paper this wrote Hothorn a in be accuracy to prediction of proven terms have in algorithm competitive very AdaBoost the of sions (cf. community learning machine [ the in attention Fitting Hothorn Torsten B¨uhlmann Model and Peter and Regularization, Prediction Algorithms: Boosting e-mail: M¨unchen, Germany D-80539 33, Ludwigstraße Ludwig-Maximilians-Universit¨at M¨unchen, f¨ur Statistik, Institut Professor, sification 07 o.2,N.4 477–505 4, DOI: No. 22, Vol. 2007, Science Statistical 10.1214/07-STS242B [email protected] in areas T ¨rc,C-02Zuih wteln e-mail: Z¨urich, Switzerland Z¨urich, CH-8092 ETH

ee ¨hmn sPoesr eia f¨ur Statistik, Seminar Professor, B¨uhlmann is Peter c 76 ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint 1 nttt fMteaia Statistics Mathematical of Institute rudadShpr’ dBotagrtmfrclas- for algorithm AdaBoost Schapire’s and Freund ,adterfrne hri)a ela nrelated in as well as therein) references the and ], icse in Discussed 10.1214/07-STS242 [email protected] (author?) 10.1214/07-STS242A .INTRODUCTION 1. nttt fMteaia Statistics Mathematical of Institute , iesoa oait pcs r icse swell. as discussed are s informatio spaces, variable o covariate Bayesian and dimensional regularization Concepts or for Akaike useful analysis. particularly corresponding survival ria, and for freedom models of regression addit as and linear well generalized parame including complex models, potentially parametric estimating to given is emphasis Abstract. iiemdl,gain osig uvvlaayi,vari analysis, survival software. boosting, gradient models, phrases: ditive and words op fl Key algorithms is boosting It functions. new loss selection. of user-specified variable implementation the and for prediction lowing fitting, model open-sourc for dedicated the of means package by illustrated are models (author?) eone at rejoinder ; 07 o.2,N.4 477–505 4, No. 22, Vol. 2007, h rcia set fbotn rcdrsfrfitn sta fitting for procedures boosting of aspects practical The [ 29 , 30 ose ohr is Hothorn Torsten . mboost 2007 , , [ 31 15 epeetasaitclprpcieo osig Special boosting. on perspective statistical a present We 10.1214/07-STS242REJ a trce much attracted has ] , 16 and hspcaeipeet ucin hc a eused be can which functions implements package This . , 33 Torsten . .Vrosver- Various ]. eeaie iermdl,gnrlzdad- generalized models, linear Generalized This . . in 1 thor?) Friedman learning and machine Tibshirani Hastie, the In community. in acknowledged mainly were classifiers. (averag- individual voting the among majority ing) and predictions multiple M¨uller Ma- al. Also et . son (function) combi- linear) “simple” co- a of fact, the (in nation additive in an additive is boosting is Section that which our fit (see model variates does a “additive” imply word not the modeling”: “stage- additive as wise, represented In is expansion. boosting basis terminology, sta- their additive of and framework estimation the and tistical to Ada-Boost algorithms linked boosting other which foundations portant Tibshirani in and Hastie algorithm Friedman, optimization Moreover, descent estimation. numerical statistical by gradient and inspired a space, as function viewed algorithm be AdaBoost can the that observation breaking Sec- (see methods ensemble tion as proposed been have originally methods Boosting applications. of variety ae,Breiman Later, 1.1 [ (author?) 42 ,wihrl ntepicpeo generating of principle the on rely which ), ,adtoa iw nbotn r given; are boosting on views additional ], (author?) (author?) [ lcini high- in election 70 (author?) beselection, able eeoe eae da which ideas related developed ] v oesas models ive [ 62 rco non- or tric xbe al- exible, software e n ¨tc,Ooaand R¨atsch, Onoda and ] degrees f timizing 4 crite- n [ tistical ,btrfr otefact the to refers but ), 33 [ adotfrhrim- further out laid ] 15 , 16 aeapath- a made ] (au- 2 P. BUHLMANN¨ AND T. HOTHORN in particular, the authors first pointed out the re- more elaborate techniques. Essentially, one needs to lation between boosting and ℓ1-penalized estima- ensure that some (uniform) laws of large numbers tion. The insights of Friedman, Hastie and Tibshi- still hold, for example, assuming stationary, mixing rani (author?) [33] opened new perspectives, namely sequences; some rigorous results are given in [57] and to use boosting methods in many other contexts [59]. than classification. We mention here boosting meth- 1.1 Ensemble Schemes: Multiple Prediction and ods for regression (including generalized regression) Aggregation [22, 32, 71], for density estimation [73], for survival analysis [45, 71] or for multivariate analysis [33, 59]. Ensemble schemes construct multiple function es- In quite a few of these proposals, boosting is not only timates or predictions from reweighted data and use a black-box prediction tool but also an estimation a linear (or sometimes convex) combination thereof method for models with a specific structure such as for producing the final, aggregated or pre- linearity or additivity [18, 22, 45]. Boosting can then diction. be seen as an interesting regularization scheme for First, we specify a base procedure which constructs estimating a model. This statistical perspective will a function estimateg ˆ(·) with values in R, based on drive the focus of our exposition of boosting. some data (X1,Y1),..., (Xn,Yn): We present here some coherent explanations and base procedure illustrations of concepts about boosting, some deriva- (X1,Y1),..., (Xn,Yn) −→ gˆ(·). tions which are novel, and we aim to increase the For example, a very popular base procedure is a re- understanding of some methods and some selected gression tree. known results. Besides giving an overview on theo- Then, generating an ensemble from the base pro- retical concepts of boosting as an algorithm for fit- cedures, that is, an ensemble of function estimates ting statistical models, we look at the methodology or predictions, works generally as follows: from a practical point of view as well. The dedicated base procedure add-on package mboost (“model-based boosting,” reweighted data 1 −→ gˆ[1](·) R base procedure [43]) to the system for statistical computing [69] reweighted data 2 −→ gˆ[2](·) implements computational tools which enable the ··· ··· data analyst to compute on the theoretical concepts ··· ··· explained in this paper as closely as possible. The base procedure [M] illustrations presented throughout the paper focus reweighted data M −→ gˆ (·) M [m] on three regression problems with continuous, bi- aggregation: fˆA(·)= αmgˆ (·). nary and censored response variables, some of them m=1 having a large number of covariates. For each ex- What is termed here asP “reweighted data” means ample, we only present the most important steps of that we assign individual data weights to each of the analysis. The complete analysis is contained in the n sample points. We have also implicitly as- a vignette as part of the mboost package (see Ap- sumed that the base procedure allows to do some pendix A.1) so that every result shown in this paper weighted fitting, that is, estimation is based on a is reproducible. weighted sample. Throughout the paper (except in Unless stated differently, we assume that the data Section 1.2), we assume that a base procedure esti- are realizations of random variables mateg ˆ(·) is real-valued (i.e., a regression procedure), making it more adequate for the “statistical perspec- (X ,Y ),..., (X ,Y ) 1 1 n n tive” on boosting, in particular for the generic FGD from a stationary process with p-dimensional predic- algorithm in Section 2.1. tor variables Xi and one-dimensional response vari- The above description of an ensemble scheme is ables Yi; for the case of multivariate responses, some too general to be of any direct use. The specification references are given in Section 9.1. In particular, of the data reweighting mechanism as well as the M the setting above includes independent, identically form of the linear combination coefficients {αm}m=1 distributed (i.i.d.) observations. The generalization are crucial, and various choices characterize differ- to stationary processes is fairly straightforward: the ent ensemble schemes. Most boosting methods are methods and algorithms are the same as in the i.i.d. special kinds of sequential ensemble schemes, where framework, but the mathematical theory requires the data weights in iteration m depend on the results BOOSTING ALGORITHMS AND MODEL FITTING 3 from the previous iteration m − 1 only (memoryless that Ada-Boost and also other boosting algorithms with respect to iterations m − 2,m − 3,...). Exam- are overfitting eventually, and early stopping [using ples of other ensemble schemes include bagging [14] a value of mstop before convergence of the surrogate or random forests [1, 17]. , given in (3.3), takes place] is necessary 1.2 AdaBoost [7, 51, 64]. We emphasize that this is not in con- tradiction to the experimental results by (author?) The AdaBoost algorithm for binary classification [15] where the test set misclassification error still [31] is the most well-known boosting algorithm. The decreases after the training misclassification error is base procedure is a classifier with values in {0, 1} zero [because the training error of the surrogate loss (slightly different from a real-valued function esti- function in (3.3) is not zero before numerical con- mator as assumed above), for example, a classifica- vergence]. tion tree. Nevertheless, the AdaBoost algorithm is quite re- AdaBoost algorithm sistant to overfitting (slow overfitting behavior) when 1. Initialize some weights for individual sample increasing the number of iterations mstop. This has [0] been observed empirically, although some cases with points: wi = 1/n for i = 1,...,n. Set m = 0. 2. Increase m by 1. Fit the base procedure to the clear overfitting do occur for some datasets [64]. A weighted data, that is, do a weighted fitting using stream of work has been devoted to develop VC-type [m−1] [m] bounds for the generalization (out-of-sample) error the weights wi , yielding the classifierg ˆ (·). 3. Compute the weighted in-sample misclassifica- to explain why boosting is overfitting very slowly tion rate only. Schapire et al. (author?) [77] proved a remark- n n able bound for the generalization misclassification [m] [m−1] [m] [m−1] error for classifiers in the convex hull of a base proce- err = wi I(Yi 6=g ˆ (Xi)) wi , dure. This bound for the misclassification error has i=1 . i=1 X X been improved by Koltchinskii and Panchenko (au- 1 − err[m] [m] thor?) [53], deriving also a generalization bound for α = log [m] ,  err  AdaBoost which depends on the number of boosting and update the weights iterations. [m−1] [m] [m] It has been argued in [33], rejoinder, and [21] that w˜i = wi exp(α I(Yi 6=g ˆ (Xi))), the overfitting resistance (slow overfitting behav- n [m] ior) is much stronger for the misclassification error wi =w ˜i/ w˜j. than many other loss functions such as the (out-of- jX=1 sample) negative log-likelihood (e.g., squared error 4. Iterate steps 2 and 3 until m = mstop and build in Gaussian regression). Thus, boosting’s resistance the aggregated classifier by weighted majority vot- of overfitting is coupled with a general fact that over- ing: fitting is less an issue for classification (i.e., the 0-1

mstop loss function). Furthermore, it is proved in [6] that [m] [m] fˆAdaBoost(x) =argmax α I(ˆg (x)= y). the misclassification risk can be bounded by the risk y∈{0,1} mX=1 of the surrogate loss function: it demonstrates from By using the terminology mstop (instead of M as a different perspective that the 0-1 loss can exhibit in the general description of ensemble schemes), we quite a different behavior than the surrogate loss. emphasize here and later that the iteration process Finally, Section 5.1 develops the variance and bias should be stopped to avoid overfitting. It is a tuning for boosting when utilized to fit a one-dimensional parameter of AdaBoost which may be selected using curve. Figure 5 illustrates the difference between some cross-validation scheme. the boosting and the smoothing spline approach, and the eigen-analysis of the boosting method [see 1.3 Slow Overfitting Behavior (5.2)] yields the following: boosting’s variance in- It had been debated until about the year 2000 creases with exponentially small increments while whether the AdaBoost algorithm is immune to over- its squared bias decreases exponentially fast as the fitting when running more iterations, that is, stop- number of iterations grows. This also explains why ping would not be necessary. It is clear nowadays boosting’s overfitting kicks in very slowly. 4 P. BUHLMANN¨ AND T. HOTHORN

1.4 Historical Remarks f(Xi)) and pursuing iterative steepest descent in The idea of boosting as an ensemble method for function space. The following algorithm has been improving the predictive performance of a base pro- given by Friedman (author?) [32]: cedure seems to have its roots in machine learning. Generic FGD algorithm Kearns and Valiant (author?) [52] proved that if in- 1. Initialize fˆ[0](·) with an offset value. Common dividual classifiers perform at least slightly better choices are than guessing at random, their predictions can be n combined and averaged, yielding much better pre- [0] −1 fˆ (·) ≡ arg min n ρ(Yi, c) dictions. Later, Schapire (author?) [75] proposed a c boosting algorithm with provable polynomial run- Xi=1 time to construct such a better ensemble of classi- or fˆ[0](·) ≡ 0. Set m = 0. fiers. The AdaBoost algorithm [29, 30, 31] is consid- 2. Increase m by 1. Compute the negative gradient ered as a first path-breaking step toward practically − ∂ ρ(Y,f) and evaluate at fˆ[m−1](X ): feasible boosting algorithms. ∂f i The results from Breiman (author?) [15, 16], show- ∂ ing that boosting can be interpreted as a functional Ui = − ρ(Yi,f)| ˆ[m−1] , i = 1,...,n. ∂f f=f (Xi) gradient descent algorithm, uncover older roots of boosting. In the context of regression, there is an 3. Fit the negative gradient vector U1,...,Un to immediate connection to the Gauss–Southwell al- X1,...,Xn by the real-valued base procedure (e.g., gorithm [79] for solving a linear system of equa- regression) tions (see Section 4.1) and to Tukey’s [83] method of “twicing” (see Section 5.1). n base procedure [m] (Xi, Ui)i=1 −→ gˆ (·). 2. FUNCTIONAL GRADIENT DESCENT Thus,g ˆ[m](·) can be viewed as an approximation Breiman (author?) [15, 16] showed that the Ad- of the negative gradient vector. ˆ[m] ˆ[m−1] [m] aBoost algorithm can be represented as a steep- 4. Update f (·)= f (·)+ ν · gˆ (·), where 0 < est descent algorithm in function space which we ν ≤ 1 is a step-length factor (see below), that is, call functional gradient descent (FGD). Friedman, proceed along an estimate of the negative gradi- Hastie and Tibshirani (author?) [33] and Friedman ent vector. (author?) [32] then developed a more general, statis- 5. Iterate steps 2 to 4 until m = mstop for some stop- tical framework which yields a direct interpretation ping iteration mstop. of boosting as a method for function estimation. In their terminology, it is a “stagewise, additive mod- The stopping iteration, which is the main tuning eling” approach (but the word “additive” does not parameter, can be determined via cross-validation imply a model fit which is additive in the covariates; or some information criterion; see Section 5.4. The see Section 4). Consider the problem of estimating choice of the step-length factor ν in step 4 is of minor a real-valued function importance, as long as it is “small,” such as ν = 0.1. A smaller value of ν typically requires a larger num- (2.1) f ∗(·) =argmin E[ρ(Y,f(X))], f(·) ber of boosting iterations and thus more computing time, while the predictive accuracy has been em- where ρ(·, ·) is a loss function which is typically as- pirically found to be potentially better and almost sumed to be differentiable and convex with respect never worse when choosing ν “sufficiently small” to the second argument. For example, the squared error loss ρ(y,f)= |y − f|2 yields the well-known (e.g., ν = 0.1) [32]. Friedman (author?) [32] suggests population minimizer f ∗(x)= E[Y |X = x]. to use an additional line search between steps 3 and 4 (in case of other loss functions ρ(·, ·) than squared 2.1 The Generic FGD or Boosting Algorithm error): it yields a slightly different algorithm but the In the sequel, FGD and boosting are used as equiv- additional line search seems unnecessary for achiev- alent terminology for the same method or algorithm. ing a good estimator fˆ[mstop]. The latter statement is Estimation of f ∗(·) in (2.1) with boosting can be based on empirical evidence and some mathematical −1 n done by considering the empirical risk n i=1 ρ(Yi, reasoning as described at the beginning of Section 7. P BOOSTING ALGORITHMS AND MODEL FITTING 5

−1 n 2.1.1 Alternative formulation in function space. −hU, gˆi = n i=1 Uigˆ(Xi). For certain base pro- In steps 2 and 3 of the generic FGD algorithm, we cedures, the two algorithms coincide. For example, P associated with U1,...,Un a negative gradient vec- ifg ˆ(·) is the componentwise linear least squares base −1 n tor. A reason for this can be seen from the following procedure described in (4.1), it holds that n i=1(Ui − formulation in function space which is similar to the gˆ(X ))2 = C − hU, gˆi, where C = n−1 n U 2 is a i i=1 Pi exposition in Mason et al. (author?) [62] and to the constant. P discussion in Ridgeway (author?) [72]. Consider the empirical risk functional C(f) = 3. SOME LOSS FUNCTIONS AND BOOSTING −1 n ALGORITHMS n i=1 ρ(Yi,f(Xi)) and the usual inner product hf, gi = n−1 n f(X )g(X ). We can then calculate P i=1 i i Various boosting algorithms can be defined by the negative Gˆateaux derivative dC(·) of the func- P specifying different (surrogate) loss functions ρ(·, ·). tional C(·), The mboost package provides an environment for ∂ defining loss functions via boost family objects, as − dC(f)(x)= − C(f + αδ )| , ∂α x α=0 exemplified below. f : Rp → R, x ∈ Rp, 3.1 Binary Classification where δx denotes the delta- (or indicator-) function For binary classification, the response variable is at x ∈ Rp. In particular, when evaluating the deriva- Y ∈ {0, 1} with P[Y =1]= p. Often, it is notation- [m−1] ˜ tive −dC at fˆ and Xi, we get ally more convenient to encode the response by Y = 2Y − 1 ∈ {−1, +1} (this coding is used in mboost as ˆ[m−1] −1 −dC(f )(Xi)= n Ui, well). We consider the negative binomial log-likelihood as loss function: with U1,...,Un exactly as in steps 2 and 3 of the generic FGD algorithm. Thus, the negative gradient −(y log(p) + (1 − y) log(1 − p)). vector U1,...,Un can be interpreted as a functional (Gˆateaux) derivative evaluated at the data points. We parametrize p = exp(f)/(exp(f) + exp(−f)) so that f = log(p/(1 − p))/2 equals half of the log-odds We point out that the algorithm in Mason et al. ratio; the factor 1/2 is a bit unusual but it will en- (author?) [62] is different from the generic FGD able that the population minimizer of the loss in method above: while the latter is fitting the nega- (3.1) is the same as for the exponential loss in (3.3) tive gradient vector by the base procedure, typically below. Then, the negative log-likelihood is using (nonparametric) least squares, Mason et al. (author?) [62] fit the base procedure by maximizing log(1 + exp(−2˜yf)).

Fig. 1. Losses, as functions of the margin yf˜ = (2y − 1)f, for binary classification. Left panel with monotone loss functions: 0-1 loss, exponential loss, negative log-likelihood, hinge loss (SVM); right panel with nonmonotone loss functions: squared error (L2) and absolute error (L1) as in (3.5). 6 P. BUHLMANN¨ AND T. HOTHORN

By scaling, we prefer to use the equivalent loss func- The population minimizer can be shown to be the tion same as for the log-likelihood loss (cf. [33]):

(3.1) ρlog-lik(˜y,f) = log (1 + exp(−2˜yf)), 1 p(x) 2 f ∗ (x)= log , exp 2 1 − p(x) which then becomes an upper bound of the misclas-   sification error; see Figure 1. In mboost, the neg- p(x)= P[Y = 1|X = x]. ative gradient of this loss function is implemented in a function Binomial () returning an object of Using functional gradient descent with different class boost family which contains the negative gra- (surrogate) loss functions yields different boosting dient function as a slot (assuming a binary response algorithms. When using the log-likelihood loss in variable y ∈ {−1, +1}). (3.1), we obtain LogitBoost [33] or BinomialBoost- The population minimizer can be shown to be (cf. ing from Section 3.3; and with the exponential loss [33]) in (3.3), we essentially get AdaBoost [30] from Sec- tion 1.2. ∗ 1 p(x) ˆ[m] flog-lik(x)= log , We interpret the boosting estimate f (·) as an 2 1 − p(x) ∗   estimate of the population minimizer f (·). Thus, p(x)= P[Y = 1|X = x]. the output from AdaBoost, Logit- or BinomialBoost- ing are estimates of half of the log-odds ratio. In The loss function in (3.1) is a function ofyf ˜ , the particular, we define probability estimates via so-called margin value, where the function f induces the following classifier for Y : exp(fˆ[m](x)) pˆ[m](x)= . 1, if f(x) > 0, exp(fˆ[m](x)) + exp(−fˆ[m](x)) C(x)= 0, if f(x) < 0,  undetermined, if f(x) = 0. The reason for constructing these probability esti-  mates is based on the fact that boosting with a Therefore, a misclassification (including the unde- suitable stopping iteration is consistent [7, 51]. Some ˜ termined case) happens if and only if Yf(X) ≤ 0. cautionary remarks about this line of argumentation Hence, the misclassification loss is are presented by Mease, Wyner and Buja (author?) [64]. (3.2) ρ0−1(y,f)= I{yf˜ ≤0}, Very popular in machine learning is the hinge func- whose population minimizer is equivalent to the tion, the standard loss function for support vector ˜ Bayes classifier (for Y ∈ {−1, +1}) machines: +1, if p(x) > 1/2, f ∗ (x)= ρ (y,f) = [1 − yf˜ ] , 0−1 −1, if p(x) ≤ 1/2, SVM +  where p(x)= P[Y = 1|X = x]. Note that the 0-1 loss where [x]+ = xI{x>0} denotes the positive part. It is in (3.2) cannot be used for boosting or FGD: it is also an upper convex bound of the misclassification nondifferentiable and also nonconvex as a function of error; see Figure 1. Its population minimizer is the margin valueyf ˜ . The negative log-likelihood loss f ∗ (x) = sign(p(x) − 1/2), in (3.1) can be viewed as a convex upper approxima- SVM tion of the (computationally intractable) nonconvex which is the Bayes classifier for Y˜ ∈ {−1, +1}. Since 0-1 loss; see Figure 1. We will describe in Section 3.3 ∗ fSVM(·) is a classifier and noninvertible function of the BinomialBoosting algorithm (similar to Logit- p(x), there is no direct way to obtain conditional Boost [33]) which uses the negative log-likelihood as class probability estimates. loss function (i.e., the surrogate loss which is the implementing loss function for the algorithm). 3.2 Regression Another upper convex approximation of the 0-1 For regression with response Y ∈ R, we use most loss function in (3.2) is the exponential loss often the squared error loss (scaled by the factor

(3.3) ρexp(y,f) = exp(−yf˜ ), 1/2 such that the negative gradient vector equals the residuals; see Section 3.3 below), implemented (with notation y ∈ {−1, +1}) in mboost 1 2 as AdaExp () family. (3.4) ρL2 (y,f)= 2 |y − f| BOOSTING ALGORITHMS AND MODEL FITTING 7 with population minimizer Moreover, both the L1- and L2-loss functions can be parametrized as functions of the margin value f ∗ (x)= E[Y |X = x]. L2 yf˜ (˜y ∈ {−1, +1}):

The corresponding boosting algorithm is L2Boosting; |y˜ − f| = |1 − yf˜ |, see Friedman (author?) [32] and B¨uhlmann and Yu (3.5) |y˜ − f|2 = |1 − yf˜ |2 (author?) [22]. It is described in more detail in Sec- tion 3.3. This loss function is available in mboost as = (1 − 2˜yf + (˜yf)2). family GaussReg(). The L - and L -loss functions are nonmonotone func- Alternative loss functions which have some ro- 1 2 tions of the margin valueyf ˜ ; see Figure 1. A nega- bustness properties (with respect to the error dis- tive aspect is that they penalize margin values which tribution, i.e., in “Y-space”) include the L - and 1 are greater than 1: penalizing large margin values Huber-loss. The former is can be seen as a way to encourage solutions fˆ ∈ ρ (y,f)= |y − f| [−1, 1] which is the range of the population mini- L1 ∗ ∗ ˜ mizers fL1 and fL2 (for Y ∈ {−1, +1}), respectively. with population minimizer However, as discussed below, we prefer to use mono- tone loss functions. f ∗(x) = (Y |X = x) The L2-loss for classification (with response vari- and is implemented in mboost as Laplace(). able y ∈ {−1, +1}) is implemented in GaussClass(). All loss functions mentioned for binary classifica- Although the L1-loss is not differentiable at the point y = f, we can compute partial derivatives since tion (displayed in Figure 1) can be viewed and inter- the single point y = f (usually) has probability zero preted from the perspective of proper scoring rules; cf. Buja, Stuetzle and Shen (author?) [24]. We usu- to be realized by the data. A compromise between ally prefer the negative log-likelihood loss in (3.1) the L - and L -loss is the Huber-loss function from 1 2 because: (i) it yields probability estimates; (ii) it is : a monotone loss function of the margin valueyf ˜ ;

ρHuber(y,f) (iii) it grows linearly as the margin valueyf ˜ tends to −∞, unlike the exponential loss in (3.3). The |y − f|2/2, if |y − f|≤ δ, = third point reflects a robustness aspect: it is similar δ(|y − f|− δ/2), if |y − f| > δ,  to Huber’s loss function which also penalizes large which is available in mboost as Huber(). A strat- values linearly (instead of quadratically as with the egy for choosing (a changing) δ adaptively has been L2-loss). proposed by Friedman (author?) [32]: 3.3 Two Important Boosting Algorithms [m−1] δm = median({|Yi − fˆ (Xi)|; i = 1,...,n}), Table 1 summarizes the most popular loss func- tions and their corresponding boosting algorithms. ˆ[m−1] where the previous fit f (·) is used. We now describe the two algorithms appearing in 3.2.1 Connections to binary classification. Moti- the last two rows of Table 1 in more detail. vated from the population point of view, the L2- or 3.3.1 L2Boosting. L2Boosting is the simplest and L1-loss can also be used for binary classification. For perhaps most instructive boosting algorithm. It is Y ∈ {0, 1}, the population minimizers are very useful for regression, in particular in presence of very many predictor variables. Applying the general f ∗ (x)= E[Y |X = x] L2 description of the FGD algorithm from Section 2.1 P = p(x)= [Y = 1|X = x], to the squared error loss function ρL2 (y,f)= |y − 2 ∗ f| /2, we obtain the following algorithm: fL (x) = median(Y |X = x) 1 L2Boosting algorithm 1, if p(x) > 1/2, = 1. Initialize fˆ[0](·) with an offset value. The default 0, if p(x) ≤ 1/2.  value is fˆ[0](·) ≡ Y . Set m = 0. Thus, the population minimizer of the L1-loss is the 2. Increase m by 1. Compute the residuals Ui = Yi − [m−1] Bayes classifier. fˆ (Xi) for i = 1,...,n. 8 P. BUHLMANN¨ AND T. HOTHORN

3. Fit the residual vector U1,...,Un to X1,...,Xn With BinomialBoosting, there is no need that the by the real-valued base procedure (e.g., regres- base procedure is able to do weighted fitting; this sion): constitutes a slight difference to the requirement for n base procedure [m] Logit-Boost [33]. (Xi, Ui)i=1 −→ gˆ (·). 3.4 Other Data Structures and Models 4. Update fˆ[m](·)= fˆ[m−1](·)+ ν · gˆ[m](·), where 0 < ν ≤ 1 is a step-length factor (as in the general Due to the generic nature of boosting or func- FGD algorithm). tional gradient descent, we can use the technique in 5. Iterate steps 2 to 4 until m = mstop for some stop- very many other settings. For data with univariate ping iteration mstop. responses and loss functions which are differentiable

The stopping iteration mstop is the main tuning with respect to the second argument, the boosting parameter which can be selected using cross-valida- algorithm is described in Section 2.1. Survival analy- tion or some information criterion as described in sis is an important area of application with censored Section 5.4. observations; we describe in Section 8 how to deal The derivation from the generic FGD algorithm with it. in Section 2.1 is straightforward. Note that the neg- ative gradient vector becomes the residual vector. 4. CHOOSING THE BASE PROCEDURE Thus, L Boosting amounts to refitting residuals mul- 2 Every boosting algorithm requires the specifica- tiple times. Tukey (author?) [83] recognized this to tion of a base procedure. This choice can be driven be useful and proposed “twicing,” which is nothing else than L Boosting using m = 2 (and ν = 1). by the aim of optimizing the predictive capacity only 2 stop or by considering some structural properties of the 3.3.2 BinomialBoosting: the FGD version of Logit- boosting estimate in addition. We find the latter Boost. We already gave some reasons at the end of usually more interesting as it allows for better in- Section 3.2.1 why the negative log-likelihood loss terpretation of the resulting model. function in (3.1) is very useful for binary classifi- We recall that the generic boosting estimator is a cation problems. Friedman, Hastie and Tibshirani sum of base procedure estimates (author?) [33] were first in advocating this, and they m proposed Logit-Boost, which is very similar to the fˆ[m](·)= ν gˆ[k](·). generic FGD algorithm when using the loss from (3.1): the deviation from FGD is the use of New- kX=1 ton’s method involving the Hessian matrix (instead Therefore, structural properties of the boosting func- of a step-length for the gradient). tion estimator are induced by a linear combination For the sake of coherence with the generic func- of structural characteristics of the base procedure. tional gradient descent algorithm in Section 2.1, we The following important examples of base proce- describe here a version of LogitBoost; to avoid con- dures yield useful structures for the boosting esti- flicting terminology, we name it BinomialBoosting: mator fˆ[m](·). The notation is as follows:g ˆ(·) is an BinomialBoosting algorithm estimate from a base procedure which is based on Apply the generic FGD algorithm from Section 2.1 data (X1, U1),..., (Xn, Un) where (U1,...,Un) de- using the loss function ρlog-lik from (3.1). The de- notes the current negative gradient. In the sequel, fault offset value is fˆ[0](·) ≡ log(ˆp/(1 − pˆ))/2, where the jth component of a vector c will be denoted by pˆ is the relative frequency of Y = 1. c(j).

Table 1 Various loss functions ρ(y, f), population minimizers f ∗(x) and names of corresponding boosting algorithms; p(x)= P[Y = 1|X = x]

∗ Range spaces ρ(y,f) f (x) Algorithm

R 1 p(x) y ∈ {0, 1}, f ∈ exp(−(2y − 1)f) 2 log( 1−p(x) ) AdaBoost R −2(2y−1)f 1 p(x) y ∈ {0, 1}, f ∈ log2(1+ e ) 2 log( 1−p(x) ) LogitBoost / BinomialBoosting R R 1 2 E y ∈ , f ∈ 2 |y − f| [Y |X = x] L2Boosting BOOSTING ALGORITHMS AND MODEL FITTING 9

4.1 Componentwise Linear Least Squares for ˜ (j) (j) (j) Xi = Xi − X . In case of a linear model, when Linear Models centering also the response Y˜i = Yi −Y , this becomes Boosting can be very useful for fitting potentially p ˜ (j) ˜ (j) high-dimensional generalized linear models. Consider Yi = β Xi + noisei the base procedure jX=1 which forces the regression surface through the cen- ˆ ˆ gˆ(x)= βˆ(S)x(S), ter (˜x(1),..., x˜(p), y˜) = (0, 0,..., 0) as with ordinary n n least squares. Note that it is not necessary to cen- ˆ(j) (j) (j) 2 (4.1) β = Xi Ui (Xi ) , ter the response variables when using the default offset value fˆ[0] = Y in L Boosting. [For Binomi- Xi=1 . Xi=1 2 n alBoosting, we would center the predictor variables ˆ ˆ(j) (j) 2 ˆ[0] S = argmin (Ui − β Xi ) . only but never the response, and we would use f ≡ 1≤j≤p −1 n Xi=1 arg mincn i=1 ρ(Yi, c).] It selects the best variable in a simple linear model IllustrationP: Prediction of total body fat. Garcia et in the sense of ordinary least squares fitting. al. (author?) [34] report on the development of pre- When using L2Boosting with this base procedure, dictive regression equations for body fat content by we select in every iteration one predictor variable, means of p = 9 common anthropometric measure- not necessarily a different one for each iteration, and ments which were obtained for n = 71 healthy Ger- we update the function linearly: man women. In addition, the women’s body compo- sition was measured by dual energy X-ray absorp- ˆ ˆ fˆ[m](x)= fˆ[m−1](x)+ νβˆ(Sm)x(Sm), tiometry (DXA). This reference method is very ac- curate in measuring body fat but finds little appli- where Sˆm denotes the index of the selected predictor cability in practical environments, mainly because variable in iteration m. Alternatively, the update of of high costs and the methodological efforts needed. the coefficient estimates is Therefore, a simple regression equation for predict- ing DXA measurements of body fat is of special [m] [m−1] (Sˆ ) βˆ = βˆ + ν · βˆ m . interest for the practitioner. Backward-elimination ˆ was applied to select important variables from the The notation should be read that only the Smth available anthropometrical measurements and Gar- ˆ[m] component of the coefficient estimate β (in iter- cia et al. (author?) [34] report a final linear model ation m) has been updated. For every iteration m, utilizing hip circumference, knee breadth and a com- we obtain a linear model fit. As m tends to infinity, pound covariate which is defined as the sum of log ˆ[m] f (·) converges to a least squares solution which is chin skinfold, log triceps skinfold and log subscapu- unique if the design matrix has full rank p ≤ n. The lar skinfold: method is also known as matching pursuit in signal R> bf_lm <- lm(DEXfat ~ hipcirc processing [60], weak greedy algorithm in computa- + kneebreadth tional mathematics [81], and it is a Gauss–Southwell + anthro3a, algorithm [79] for solving a linear system of equa- data = bodyfat) tions. We will discuss more properties of L2Boosting R> coef(bf_lm) with componentwise linear least squares in Section (Intercept) hipcirc kneebreadth anthro3a -75.23478 0.51153 1.90199 8.90964 5.2. A simple regression formula which is easy to com- When using BinomialBoosting with component- municate, such as a linear combination of only a wise linear least squares from (4.1), we obtain a fit, few covariates, is of special interest in this applica- including variable selection, of a linear logistic re- tion: we employ the glmboost function from pack- gression model. age mboost to fit a linear regression model by means As will be discussed in more detail in Section 5.2, of L2Boosting with componentwise linear least squares. boosting typically shrinks the (logistic) regression By default, the function glmboost fits a linear model coefficients toward zero. Usually, we do not want to (with initial mstop = 100 and shrinkage parameter shrink the intercept term. In addition, we advocate ν = 0.1) by minimizing squared error (argument family to use boosting on mean centered predictor variables = GaussReg() is the default): 10 P. BUHLMANN¨ AND T. HOTHORN

R> bf_glm <- glmboost(DEXfat ~ ., data = bodyfat, control= boost_control (center = TRUE)) Note that, by default, the mean of the response variable is used as an offset in the first step of the boosting algorithm. We center the covariates prior to model fitting in addition. As mentioned above, the special form of the base learner, that is, compo- nentwise linear least squares, allows for a reformu- lation of the boosting fit in terms of a linear combi- nation of the covariates which can be assessed via R> coef(bf_glm) (Intercept) age waistcirc hipcirc 0.000000 0.013602 0.189716 0.351626 elbowbreadth kneebreadth anthro3a anthro3b -0.384140 1.736589 3.326860 3.656524 anthro3c anthro4 0.595363 0.000000 attr(,"offset") [1] 30.783 We notice that most covariates have been used for fitting and thus no extensive variable selection was performed in the above model. Thus, we need to in- vestigate how many boosting iterations are appro- priate. Resampling methods such as cross-validation or the bootstrap can be used to estimate the out- of-sample error for a varying number of boosting it- erations. The out-of-bootstrap for 100 bootstrap samples is depicted in the upper part of Figure 2. The plot leads to the impression that approximately mstop = 44 would be a sufficient Fig. 2. bodyfat data: Out-of-bootstrap squared error for number of boosting iterations. In Section 5.4, a cor- varying number of boosting iterations mstop (top). The dashed rected version of the Akaike information criterion horizontal line depicts the average out-of-bootstrap error (AIC) is proposed for determining the optimal num- of the linear model for the preselected variables hipcirc, ber of boosting iterations. This criterion attains its kneebreadth and anthro3a fitted via ordinary least squares. The lower part shows the corrected AIC criterion. minimum for R> mstop(aic <- AIC(bf_glm)) [1] 45 −1 n n i=1 Yi = 30.783 is the intercept in the uncen- boosting iterations; see the bottom part of Fig- tered model). Note that the variables hipcirc, P ure 2 in addition. The coefficients of the linear model kneebreadth and anthro3a, which we have used for with mstop = 45 boosting iterations are fitting a linear model at the beginning of this para- R> coef(bf_glm[mstop(aic)]) graph, have been selected by the boosting algorithm (Intercept) age waistcirc hipcirc as well. 0.0000000 0.0023271 0.1893046 0.3488781 elbowbreadth kneebreadth anthro3a anthro3b 4.2 Componentwise Smoothing Spline for 0.0000000 1.5217686 3.3268603 3.6051548 anthro3c anthro4 Additive Models 0.5043133 0.0000000 attr(,"offset") Additive and generalized additive models, intro- [1] 30.783 duced by Hastie and Tibshirani (author?) [40] (see and thus seven covariates have been selected for also [41]), have become very popular for adding more the final model (intercept equal to zero occurs here flexibility to the linear structure in generalized lin- for mean centered response and predictors and hence, ear models. Such flexibility can also be added in BOOSTING ALGORITHMS AND MODEL FITTING 11 boosting (whose framework is especially useful for df = 4. This yields low variance but typically large high-dimensional problems). bias of the base procedure. The bias can then be re- We can choose to use a nonparametric base pro- duced by additional boosting iterations. This choice cedure for function estimation. Suppose that of low variance but high bias has been analyzed in B¨uhlmann and Yu (author?) [22]; see also Sec- fˆ(j)(·) is a least squares cubic smoothing tion 4.4. spline estimate based on U1,...,Un against (4.2) (j) (j) Componentwise smoothing splines can be gener- X1 ,...,Xn with fixed degrees of freedom alized to pairwise smoothing splines which search df. for and fit the best pairs of predictor variables such That is, that smoothing of U1,...,Un against this pair of pre- n dictors reduces the residual sum of squares most. (j) 2 ˆ(j) With L2Boosting, this yields a nonparametric model f (·) =argmin (Ui − f(Xi )) f(·) fit with first-order interaction terms. The procedure Xi=1 (4.3) has been empirically demonstrated to be often much + λ (f ′′(x))2 dx, better than fitting with MARS [23]. Z where λ> 0 is a tuning parameter such that the Illustration: Prediction of total body fat (cont.). trace of the corresponding hat matrix equals df. For Being more flexible than the linear model which we further details, we refer to Green and Silverman (au- fitted to the bodyfat data in Section 4.1, we esti- thor?) [36]. As a note of caution, we use in the sequel mate an additive model using the gamboost func- the terminology of “hat matrix” in a broad sense: it tion from mboost (first with prespecified mstop = is a linear operator but not a projection in general. 100 boosting iterations, ν = 0.1 and squared error The base procedure is then loss): R> bf_gam ˆ ˆ gˆ(x)= fˆ(S)(x(S)), <- gamboost(DEXfat ~ ., data = bodyfat) fˆ(j)(·) as above and The degrees of freedom in the componentwise n (j) smoothing spline base procedure can be defined by Sˆ = argmin (U − fˆ(j)(X ))2, i i the dfbase argument, defaulting to 4. 1≤j≤p i=1 X We can estimate the number of boosting iterations where the degrees of freedom df are the same for all m using the corrected AIC criterion described in ˆ(j) stop f (·). Section 5.4 via L2Boosting with componentwise smoothing splines R> mstop(aic <- AIC(bf_gam)) yields an additive model, including variable selec- [1] 46 tion, that is, a fit which is additive in the predic- Similarly to the linear regression model, the par- tor variables. This can be seen immediately since tial contributions of the covariates can be extracted L2Boosting proceeds additively for updating the func- from the boosting fit. For the most important vari- [m] tion fˆ (·); see Section 3.3. We can normalize to ables, the partial fits are given in Figure 3 showing obtain the following additive model estimator: some slight nonlinearity, mainly for kneebreadth. p 4.3 Trees fˆ[m](x) =µ ˆ + fˆ[m],(j)(x(j)), jX=1 In the machine learning community, regression trees n are the most popular base procedures. They have −1 ˆ[m],(j) (j) n f (Xi ) = 0 for all j = 1,...,p. the advantage to be invariant under monotone trans- Xi=1 formations of predictor variables, that is, we do not As with the componentwise linear least squares base need to search for good data transformations. More- procedure, we can use componentwise smoothing over, regression trees handle covariates measured at splines also in BinomialBoosting, yielding an addi- different scales (continuous, ordinal or nominal vari- tive logistic regression fit. ables) in a unified way; unbiased split or variable se- The degrees of freedom in the smoothing spline lection in the context of different scales is proposed base procedure should be chosen “small” such as in [47]. 12 P. BUHLMANN¨ AND T. HOTHORN

Fig. 3. bodyfat data: Partial contributions of four covariates in an additive model (without centering of estimated functions to mean zero).

When using stumps, that is, a tree with two ter- = boost_control minal nodes only, the boosting estimate will be an (mstop = 500)) additive model in the original predictor variables, Conditional inference trees [47] as available from because every stump-estimate is a function of a sin- the party package [46] are utilized as base proce- gle predictor variable only. Similarly, boosting trees dures. Here, the function boost control defines the with (at most) d terminal nodes result in a nonpara- number of boosting iterations mstop. metric model having at most interactions of order Alternatively, we can use the function gbm from d − 2. Therefore, if we want to constrain the degree the gbm package which yields roughly the same fit of interactions, we can easily do this by constraining as can be seen from Figure 4. the (maximal) number of nodes in the base proce- dure. 4.4 The Low-Variance Principle Illustration: Prediction of total body fat (cont.). We have seen above that the structural properties Both the gbm package [74] and the mboost package of a boosting estimate are determined by the choice are helpful when decision trees are to be used as base of a base procedure. In our opinion, the structure procedures. In mboost, the function blackboost im- specification should come first. After having made plements boosting for fitting such classical black-box a choice, the question becomes how “complex” the models: base procedure should be. For example, how should R> bf_black we choose the degrees of freedom for the componen- <- blackboost(DEXfat ~ ., twise smoothing spline in (4.2)? A general answer data = bodyfat, is: choose the base procedure (having the desired control structure) with low variance at the price of larger BOOSTING ALGORITHMS AND MODEL FITTING 13

Consider the case with a linear base procedure having a hat matrix H : Rn → Rn, mapping the re- ⊤ sponse variables Y = (Y1,...,Yn) to their fitted ⊤ values (fˆ(X1),..., fˆ(Xn)) . Examples include non- parametric kernel smoothers or smoothing splines. It is easy to show that the hat matrix of the L2Boosting fit (for simplicity, with fˆ[0] ≡ 0 and ν = 1) in itera- tion m equals

Bm = Bm−1 + H(I −Bm−1) (5.1) = I − (I −H)m. Formula (5.1) allows for several insights. First, if the base procedure satisfies kI − Hk < 1 for a suit- able norm, that is, has a “learning capacity” such that the residual vector is shorter than the input- response vector, we see that Bm converges to the Fig. 4. bodyfat data: Fitted values of both the gbm and identity I as m →∞, and BmY converges to the mboost implementations of L2Boosting with different regres- fully saturated model Y, interpolating the response sion trees as base learners. variables exactly. Thus, we see here explicitly that we have to stop early with the boosting iterations estimation bias. For the componentwise smoothing in order to prevent overfitting. splines, this would imply a low number of degrees of When specializing to the case of a cubic smoothing freedom, for example, df = 4. spline base procedure [cf. (4.3)], it is useful to invoke We give some reasons for the low-variance prin- some eigenanalysis. The spectral representation is ciple in Section 5.1 (Replica 1). Moreover, it has been demonstrated in Friedman (author?) [32] that H = UDU ⊤, a small step-size factor ν can be often beneficial U ⊤U = UU ⊤ = I, and almost never yields substantially worse predic- tive performance of boosting estimates. Note that D = diag(λ1,...,λn), a small step-size factor can be seen as a shrinkage where λ ≥ λ ≥···≥ λ denote the (ordered) eigen- of the base procedure by the factor ν, implying low 1 2 n variance but potentially large estimation bias. values of H. It then follows with (5.1) that ⊤ Bm = UDmU , 5. L2BOOSTING Dm = diag(d1,m,...,dn,m), L Boosting is functional gradient descent using 2 m the squared error loss which amounts to repeated di,m = 1 − (1 − λi) . fitting of ordinary residuals, as described already in It is well known that a smoothing spline satisfies Section 3.3.1. Here, we aim at increasing the under- standing of the simple L2Boosting algorithm. We λ1 = λ2 = 1, 0 < λi < 1 (i = 3,...,n). first start with a toy problem of curve estimation, Therefore, the eigenvalues of the boosting hat oper- and we will then illustrate concepts and results which ator (matrix) in iteration m satisfy are especially useful for high-dimensional data. These can serve as heuristics for boosting algorithms with (5.2) d1,m ≡ d2,m ≡ 1 for all m, other convex loss functions for problems in for ex- m 0 < di,m = 1 − (1 − λi) < 1 (i = 3,...,n), ample, classification or survival analysis. (5.3) 5.1 Nonparametric Curve Estimation: From di,m → 1 (m →∞). Basics to Asymptotic Optimality When comparing the spectrum, that is, the set of Consider the toy problem of estimating a regres- eigenvalues, of a smoothing spline with its boosted sion function E[Y |X = x] with one-dimensional pre- version, we have the following. For both cases, the dictor X ∈ R and a continuous response Y ∈ R. largest two eigenvalues are equal to 1. Moreover, all 14 P. BUHLMANN¨ AND T. HOTHORN

Fig. 5. Mean squared prediction error E[(f(X) − fˆ(X))2] for the regression model Yi = 0.8Xi + sin(6Xi) + εi (i = 1,...,n = 100), with ε ∼ N (0, 2), Xi ∼ U(−1/2, 1/2), averaged over 100 simulation runs. Left: L2Boosting with smoothing spline base procedure (having fixed degrees of freedom df = 4) and using ν = 0.1, for varying number of boosting iterations. Right: single smoothing spline with varying degrees of freedom. other eigenvalues can be changed either by vary- (without the need of choosing a higher-order spline n ing the degrees of freedom df = i=1 λi in a single base procedure). smoothing spline, or by varying the boosting iter- Recently, asymptotic convergence and minimax P ation m with some fixed (low-variance) smoothing rate results have been established for early-stopped spline base procedure having fixed (low) values λi. boosting in more general settings [10, 91]. In Figure 5 we demonstrate the difference between 5.1.1 L Boosting using kernel estimators. As we the two approaches for changing “complexity” of 2 have pointed out in Replica 1, L Boosting of smooth- the estimated curve fit by means of a toy example 2 ing splines can achieve faster mean squared error first shown in [22]. Both methods have about the convergence rates than the classical O(n−4/5), as- same minimum mean squared error, but L Boosting 2 suming that the true underlying function is suffi- overfits much more slowly than a single smoothing ciently smooth. We illustrate here a related phe- spline. nomenon with kernel estimators. By careful inspection of the eigenanalysis for this We consider fixed, univariate design points x = simple case of boosting a smoothing spline, B¨uhlmann i i/n (i = 1,...,n) and the Nadaraya–Watson kernel and Yu (author?) [22] proved an asymptotic mini- estimator for the nonparametric regression function max rate result: E[Y |X = x]: Replica 1 ([22]). When stopping the boosting it- n −1 x − xi erations appropriately, that is, mstop = mn = gˆ(x; h) = (nh) K Yi 4/(2ξ+1) h O(n ), mn →∞ (n →∞) with ξ ≥ 2 as be- Xi=1   low, L2Boosting with cubic smoothing splines having n −1 fixed degrees of freedom achieves the minimax con- = n Kh(x − xi)Yi, vergence rate over Sobolev function classes of smooth- Xi=1 ness degree ξ ≥ 2, as n →∞. where h> 0 is the bandwidth, K(·) is a kernel in the form of a probability density which is symmetric Two items are interesting. First, minimax rates −1 are achieved by using a base procedure with fixed around zero and Kh(x)= h K(x/h). It is straight- degrees of freedom which means low variance from forward to derive the form of L2Boosting using m = 2 iterations (with fˆ[0] ≡ 0 and ν = 1), that is, twicing an asymptotic perspective. Second, L2Boosting with cubic smoothing splines has the capability to adapt [83], with the Nadaraya–Watson kernel estimator: to higher-order smoothness of the true underlying n ˆ[2] −1 tw function; thus, with the stopping iteration as the f (x) = (nh) Kh (x − xi)Yi, one and only tuning parameter, we can neverthe- Xi=1 tw less adapt to any higher-order degree of smoothness Kh (u) = 2Kh(u) − Kh ∗ Kh(u), BOOSTING ALGORITHMS AND MODEL FITTING 15 where Illustration: Breast cancer subtypes. Variable se- n lection is especially important in high-dimensional −1 Kh ∗ Kh(u)= n Kh(u − xr)Kh(xr). situations. As an example, we study a binary classi- rX=1 fication problem involving p = 7129 gene expression tw For fixed design points xi = i/n, the kernel Kh (·) levels in n = 49 breast cancer tumor samples (data is asymptotically equivalent to a higher-order kernel taken from [90]). For each sample, a binary response (which can take negative values) yielding a squared variable describes the lymph node status (25 nega- 8 bias term of order O(h ), assuming that the true tive and 24 positive). regression function is four times continuously differ- The data are stored in form of an exprSet object entiable. Thus, twicing or L2Boosting with m = 2 westbc (see [35]) and we first extract the matrix of iterations amounts to a Nadaraya–Watson kernel expression levels and the response variable: estimator with a higher-order kernel. This explains R> x <- t(exprs(westbc)) from another angle why boosting is able to improve R> y <- pData(westbc)$nodal.y the mean squared error rate of the base procedure. We aim at using L Boosting for classification (see More details including also nonequispaced designs 2 Section 3.2.1), with classical AIC based on the bi- are given in DiMarzio and Taylor (author?) [27]. nomial log-likelihood for stopping the boosting it- 5.2 L2Boosting for High-Dimensional Linear erations. Thus, we first transform the factor y to a Models numeric variable with 0/1 coding: Consider a potentially high-dimensional linear mo- R> yfit <- as.numeric(y) - 1 del The general framework implemented in mboost al- p lows us to specify the negative gradient (the ngradient (j) (j) (5.4) Yi = β0 + β Xi + εi, i = 1,...,n, argument) corresponding to the surrogate loss func- jX=1 tion, here the squared error loss implemented as a where ε1,...,εn are i.i.d. with E[εi] = 0 and inde- function rho, and a different evaluating loss func- pendent from all Xi’s. We allow for the number of tion (the loss argument), here the negative bino- predictors p to be much larger than the sample size mial log-likelihood, with the Family function as fol- n. The model encompasses the representation of a lows: noisy signal by an expansion with an overcomplete R> rho <- function(y, f, w = 1) { dictionary of functions {g(j)(·) : j = 1,...,p}; for ex- p <- pmax(pmin(1 - 1e-05, f), ample, for surface modeling with design points in 1e-05) 2 Zi ∈ R , -y * log(p) - (1 - y)

Yi = f(Zi)+ εi, * log(1 - p) } (j) (j) 2 f(z)= β g (z) (z ∈ R ). R> ngradient Xj <- function(y, f, w = 1) y - f Fitting the model (5.4) can be done using R> offset L2Boosting with the componentwise linear least <- function(y, w) squares base procedure from Section 4.1 which fits weighted.mean(y, w) in every iteration the best predictor variable reduc- R> L2fm <- Family(ngradient = ing the residual sum of squares most. This method ngradient, has the following basic properties: loss = rho, 1. As the number m of boosting iterations increases, offset = offset) [m] the L2Boosting estimate fˆ (·) converges to a The resulting object (called L2fm), bundling the least squares solution. This solution is unique if negative gradient, the loss function and a function the design matrix has full rank p ≤ n. for computing an offset term (offset), can now 2. When stopping early, which is usually needed to be passed to the glmboost function for boosting avoid overfitting, the L2Boosting method often with componentwise linear least squares (here ini- does variable selection. tial mstop = 200 iterations are used): 3. The coefficient estimates βˆ[m] are (typically) R> ctrl <- boost_control shrunken versions of a least squares estimate βˆOLS, (mstop = 200, related to the Lasso as described in Section 5.2.1. center = TRUE) 16 P. BUHLMANN¨ AND T. HOTHORN

R> west_glm <- glmboost 4 of the L2Boosting algorithm in Section 3.3.1) pro- (x, yfit, duces a set of solutions which is approximately equiv- family = L2fm, alent to the set of Lasso solutions when varying the control = ctrl) regularization parameter λ in Lasso [see (5.5)]. The Fitting such a linear model to p = 7129 covariates approximate equivalence is derived by representing for n = 49 observations takes about 3.6 seconds on FSLR and Lasso as two different modifications of a medium-scale desktop computer (Intel Pentium 4, the computationally efficient least angle regression 2.8 GHz). Thus, this form of estimation and vari- (LARS) algorithm from Efron et al. (author?) [28] able selection is computationally very efficient. As (see also [68] for generalized linear models). The lat- a comparison, computing all Lasso solutions, using ter is very similar to the algorithm proposed earlier package lars [28, 39] in R (with use.Gram=FALSE), by Osborne, Presnell and Turlach (author?) [67]. takes about 6.7 seconds. In special cases where the design matrix satisfies a “positive cone condition,” FSLR, Lasso and LARS The question how to choose mstop can be addressed by the classical AIC criterion as follows: all coincide ([28], page 425). For more general situ- R> aic <- AIC(west_glm, ations, when adding some backward steps to boost- method = "classical") ing, such modified L2Boosting coincides with the R> mstop(aic) Lasso (Zhao and Yu (author?) [93]). [1] 100 Despite the fact that L2Boosting and Lasso are where the AIC is computed as −2(log-likelihood)+ not equivalent methods in general, it may be use- 1 2(degrees of freedom) = 2(evaluating loss) + ful to interpret boosting as being “related” to ℓ - 2(degrees of freedom); see (5.8). The notion of de- penalty based methods. grees of freedom is discussed in Section 5.3. 5.2.2 Asymptotic consistency in high dimensions. Figure 6 shows the AIC curve depending on the We review here a result establishing asymptotic con- number of boosting iterations. When we stop after sistency for very high-dimensional but sparse linear mstop = 100 boosting iterations, we obtain 33 genes models as in (5.4). To capture the notion of high- with nonzero regression coefficients whose standard- dimensionality, we equip the model with a dimen- (j) ized values βˆ Var(X(j)) are depicted in the left sionality p = pn which is allowed to grow with sam- (j) (j) panel of Figure q6. ple size n; moreover, the coefficients β = βn are Of course, we couldd also use BinomialBoosting for now potentially depending on n and the regression analyzing the data; the computational CPU time function is denoted by fn(·). would be of the same order of magnitude, that is, Replica 2 ([18]). Consider the linear model in only a few seconds. 1−ξ (5.4). Assume that pn = O(exp(n )) for some 0 < pn (j) 5.2.1 Connections to the Lasso. Hastie, Tibshi- ξ ≤ 1 (high-dimensionality) and supn∈N j=1 |βn | < rani and Friedman (author?) [42] pointed out first ∞ (sparseness of the true regression function w.r.t. P an intriguing connection between L Boosting with 1 (j) 2 the ℓ -norm); moreover, the variables Xi are componentwise linear least squares and the Lasso 4/ξ bounded and E[|εi| ] < ∞. Then: when stopping 1 [82] which is the following ℓ -penalty method: the boosting iterations appropriately, that is, m =

n p 2 mn →∞ (n →∞) sufficiently slowly, L2Boosting ˆ −1 (j) (j) with componentwise linear least squares satisfies β(λ) = argmin n Yi − β0 − β Xi β ! 2 Xi=1 jX=1 E [(fˆ[mn](X ) − f (X )) ] → 0 (5.5) Xnew n new n new p in probability (n →∞), + λ |β(j)|. jX=1 where Xnew denotes new predictor variables, inde- Efron et al. (author?) [28] made the connection rig- pendent of and with the same distribution as the X-component of the data (X ,Y ) (i = 1,...,n). orous and explicit: they considered a version of i i L2Boosting, called forward stagewise linear regres- The result holds for almost arbitrary designs and sion (FSLR), and they showed that FSLR with in- no assumptions about collinearity or correlations are finitesimally small step-sizes (i.e., the value ν in step required. Replica 2 identifies boosting as a method BOOSTING ALGORITHMS AND MODEL FITTING 17

(j) (j) Fig. 6. westbc data: Standardized regression coefficients βˆ Var(X ) (left panel) for mstop = 100 determined from the classical AIC criterion shown in the right panel. q c which is able to consistently estimate a very high- function, where the model formula performs the com- dimensional but sparse linear model; for the Lasso in putations of all transformations by means of the bs (5.5), a similar result holds as well [37]. In terms of (B-spline basis) function from the package splines. empirical performance, there seems to be no overall First, we set up a formula transforming each covari- superiority of L Boosting over Lasso or vice versa. 2 ate: 5.2.3 Transforming predictor variables. In view of R> bsfm Replica 2, we may enrich the design matrix in model DEXfat ~ bs(age) + bs(waistcirc) + bs(hipcirc) + bs(elbowbreadth) + (5.4) with many transformed predictors: if the true bs(kneebreadth) + bs(anthro3a) + regression function can be represented as a sparse bs(anthro3b) + bs(anthro3c) + linear combination of original or transformed pre- bs(anthro4) dictors, consistency is still guaranteed. It should be and then fit the complex linear model by using the noted, though, that the inclusion of noneffective vari- glmboost function with initial mstop = 5000 boost- ables in the design matrix does degrade the finite- ing iterations: sample performance to a certain extent. R> ctrl <- boost_control For example, higher-order interactions can be spec- (mstop = 5000) ified in generalized AN(C)OVA models and R> bf_bs <- glmboost L2Boosting with componentwise linear least squares can be used to select a small number out of poten- (bsfm, data = bodyfat, tially many interaction terms. control = ctrl) As an option for continuously measured covari- R> mstop(aic <- AIC(bf_bs)) ates, we may utilize a B-spline basis as illustrated [1] 2891 in the next paragraph. We emphasize that during The corrected AIC criterion (see Section 5.4) sug- the process of L2Boosting with componentwise lin- gests to stop after mstop = 2891 boosting iterations ear least squares, individual spline basis functions and the final model selects 21 (transformed) pre- from various predictor variables are selected and fit- dictor variables. Again, the partial contributions of ted one at a time; in contrast, L2Boosting with com- each of the nine original covariates can be com- ponentwise smoothing splines fits a whole smoothing puted easily and are shown in Figure 7 (for the same spline function (for a selected predictor variable) at a time. variables as in Figure 3). Note that the depicted Illustration: Prediction of total body fat (cont.). functional relationship derived from the model fit- Such transformations and estimation of a correspond- ted above (Figure 7) is qualitatively the same as the ing linear model can be done with the glmboost one derived from the additive model (Figure 3). 18 P. BUHLMANN¨ AND T. HOTHORN

Fig. 7. bodyfat data: Partial fits for a linear model fitted to transformed covariates using B-splines (without centering of estimated functions to mean zero).

ˆ 5.3 Degrees of Freedom for L2Boosting (5.6) = I − (I − νH(Sm)) ˆ ˆ A notion of degrees of freedom will be useful for ·(I − νH(Sm−1)) · · · (I − νH(S1)), estimating the stopping iteration of boosting (Sec- tion 5.4). where Sˆr ∈ {1,...,p} denotes the component which 5.3.1 Componentwise linear least squares. We con- is selected in the componentwise least squares base sider L2Boosting with componentwise linear least procedure in the rth boosting iteration. We empha- squares. Denote by size that Bm is depending on the response variable ⊤ 2 ˆ H(j) = X(j)(X(j)) /kX(j)k , j = 1,...,p, Y via the selected components Sr, r = 1,...,m. Due to this dependence on Y , B should be viewed as an the n × n hat matrix for the linear least squares fit- m approximate hat matrix only. Neglecting the selec- ting operator using the jth predictor variable X(j) = tion effect of Sˆ (r = 1,...,m), we define the degrees (X(j),...,X(j))⊤ only; kxk2 = x⊤x denotes the r 1 n of freedom of the boosting fit in iteration m as Euclidean norm for a vector x ∈ Rn. The hat ma- trix of the componentwise linear least squares base df(m) = trace(B ). procedure [see (4.1)] is then m (Sˆ) H : (U1,...,Un) 7→ Uˆ1,..., Uˆn, Even with ν = 1, df(m) is very different from count- ing the number of variables which have been selected where Sˆ is as in (4.1). Similarly to (5.1), we then until iteration m. obtain the hat matrix of L Boosting in iteration m: 2 Having some notion of degrees of freedom at hand, ˆ (Sm) 2 E 2 Bm = Bm−1 + ν ·H (I −Bm−1) we can estimate the error variance σε = [εi ] in the BOOSTING ALGORITHMS AND MODEL FITTING 19 linear model (5.4) by The gMDL criterion bridges the AIC and BIC in a data-driven way: it is an attempt to adaptively 1 n σˆ2 = (Y − fˆ[mstop](X ))2. select the better among the two. ε n − df(m ) i i stop i=1 When using L Boosting for binary classification X 2 Moreover, we can represent (see also the end of Section 3.2 and the illustration p in Section 5.2), we prefer to work with the binomial (j) log-likelihood in AIC, (5.7) Bm = Bm , n jX=1 AIC(m)= −2 Yi log((BmY)i) where B(j) is the (approximate) hat matrix which m Xi=1 yields the fitted values for the jth predictor, that is, (5.8) + (1 − Yi) log(1 − (BmY)i) (j)Y X(j) ˆ[m] (j) Bm = βj . Note that the Bm ’s can be easily computed in an iterative way by updating as follows: + 2 df(m), ˆ ˆ ˆ or for BIC(m) with the penalty term log(n)df(m). B(Sm) (Sm) (Sm) m = Bm−1 + ν ·H (I −Bm−1), (If (BmY)i ∈/ [0, 1], we truncate by max(min((BmY)i, (j) (j) ˆ 1 − δ), δ) for some small δ > 0, for example, δ = Bm = Bm−1 for all j 6= Sm. 10−5.) Thus, we have a decomposition of the total degrees of freedom into p terms: 6. BOOSTING FOR VARIABLE SELECTION p df(m)= df(j)(m), We address here the question whether boosting j=1 is a good variable selection scheme. For problems X with many predictor variables, boosting is compu- (j) (j) df (m) = trace(Bm ). tationally much more efficient than classical all sub- set selection schemes. The mathematical properties The individual degrees of freedom df(j)(m) are a use- ful measure to quantify the “complexity” of the in- of boosting for variable selection are still open ques- tions, for example, whether it leads to a consistent dividual coefficient estimate βˆ[m]. j model selection method. 5.4 Internal Stopping Criteria for L2Boosting 6.1 L2Boosting Having some degrees of freedom at hand, we can When borrowing from the analogy of L Boosting now use information criteria for estimating a good 2 with the Lasso (see Section 5.2.1), the following is stopping iteration, without pursuing some sort of relevant. Consider a linear model as in (5.4), al- cross-validation. lowing for p ≫ n but being sparse. Then, there is We can use the corrected AIC [49]: a sufficient and “almost” necessary neighborhood 1 + df(m)/n stability condition (the word “almost” refers to a AIC (m) = log(ˆσ2)+ , c (1 − df(m) + 2)/n strict inequality “<” whereas “≤” suffices for suffi- n ciency) such that for some suitable penalty param- 2 −1 2 σˆ = n (Yi − (BmY)i) . eter λ in (5.5), the Lasso finds the true underly- i=1 ing submodel (the predictor variables with corre- X In mboost, the corrected AIC criterion can be com- sponding regression coefficients 6= 0) with probabil- puted via AIC(x, method = "corrected") (with x ity tending quickly to 1 as n →∞ [65]. It is im- being an object returned by glmboost or gamboost portant to note the role of the sufficient and “al- called with family = GaussReg()). Alternatively, most” necessary condition of the Lasso for model we may employ the gMDL criterion (Hansen and selection: Zhao and Yu (author?) [94] call it the Yu (author?) [38]): “irrepresentable condition” which has (mainly) im- plications on the “degree of collinearity” of the de- df(m) gMDL(m) = log(S)+ log(F ), sign (predictor variables), and they give examples n where it holds and where it fails to be true. A fur- 2 n 2 2 nσˆ i=1 Yi − nσˆ ther complication is the fact that when tuning the S = , F = . Lasso for prediction optimality, that is, choosing the n − df(m) P df(m)S 20 P. BUHLMANN¨ AND T. HOTHORN

reweighting the penalty function. Instead of (5.5), the adaptive Lasso estimator is

n p 2 ˆ −1 (j) (j) β(λ) = argmin n Yi − β0 − β Xi β i=1 j=1 ! (6.1) X X p |β(j)| + λ , |βˆ(j) | jX=1 init

where βˆinit is an initial estimator, for example, the Lasso (from a first stage of Lasso estimation). Con- sistency of the adaptive Lasso for variable selection has been proved for the case with fixed predictor- dimension p [96] and also for the high-dimensional case with p = pn ≫ n [48]. We do not expect that boosting is free from the Fig. 8. Hard-threshold (dotted-dashed), soft-threshold (dot- difficulties which occur when using the Lasso for ted) and adaptive Lasso (solid) estimator in a linear model variable selection. The hope is, though, that also with orthonormal design. For this design, the adaptive Lasso boosting would produce an interesting set of sub- coincides with the nonnegative garrote [13]. The value on the x-abscissa, denoted by z, is a single component of X⊤Y. models when varying the number of iterations. 6.2 Twin Boosting penalty parameter λ in (5.5) such that the mean Twin Boosting [19] is the boosting analogue to the squared error is minimal, the probability for esti- adaptive Lasso. It consists of two stages of boosting: mating the true submodel converges to a number the first stage is as usual, and the second stage is which is less than 1 or even zero if the problem enforced to resemble the first boosting round. For is high-dimensional [65]. In fact, the prediction op- example, if a variable has not been selected in the timal tuned Lasso selects asymptotically too large first round of boosting, it will not be selected in models. the second; this property also holds for the adaptive The bias of the Lasso mainly causes the difficulties (j) Lasso in (6.1), that is, βˆ = 0 enforces βˆ(j) = 0. mentioned above. We often would like to construct init Moreover, Twin Boosting with componentwise lin- estimators which are less biased. It is instructive to ear least squares is proved to be equivalent to the look at regression with orthonormal design, that is, n (j) (k) adaptive Lasso for the case of an orthonormal lin- the model (5.4) with i=1 Xi Xi = δjk. Then, the ear model and it is empirically shown, in general Lasso and also L Boosting with componentwise lin- 2 P and for various base procedures and models, that it ear least squares and using very small ν (in step has much better variable selection properties than 4 of L2Boosting; see Section 3.3.1) yield the soft- the corresponding boosting algorithm [19]. In special threshold estimator [23, 28]; see Figure 8. It exhibits settings, similar results can be obtained with Sparse the same amount of bias regardless by how much the Boosting [23]; however, Twin Boosting is much more observation (the variable z in Figure 8) exceeds the generically applicable. threshold. This is in contrast to the hard-threshold estimator and the adaptive Lasso in (6.1) which are 7. BOOSTING FOR EXPONENTIAL FAMILY much better in terms of bias. MODELS Nevertheless, the (computationally efficient) Lasso seems to be a very useful method for variable filter- For exponential family models with general loss ing: for many cases, the prediction optimal tuned functions, we can use the generic FGD algorithm as Lasso selects a submodel which contains the true described in Section 2.1. model with high probability. A nice proposal to cor- First, we address the issue about omitting a line rect Lasso’s overestimation behavior is the adaptive search between steps 3 and 4 of the generic FGD Lasso, given by Zou (author?) [96]. It is based on algorithm. Consider the empirical risk at iteration BOOSTING ALGORITHMS AND MODEL FITTING 21 m, doing more iterations but not necessarily more com- n puting time (since the line-search is omitted in every −1 [m] n ρ(Yi, fˆ (Xi)) iteration). Xi=1 7.1 BinomialBoosting n −1 [m−1] (7.1) ≈ n ρ(Yi, fˆ (Xi)) For binary classification with Y ∈ {0, 1}, Binomi- i=1 alBoosting uses the negative binomial log-likelihood X n from (3.1) as loss function. The algorithm is de- −1 [m] − νn Uigˆ (Xi), scribed in Section 3.3.2. Since the population min- ∗ Xi=1 imizer is f (x) = log[p(x)/(1 − p(x))]/2, estimates using a first-order Taylor expansion and the defini- from BinomialBoosting are on half of the logit-scale: the componentwise linear least squares base proce- tion of Ui. Consider the case with the component- wise linear least squares base procedure and without dure yields a logistic linear model fit while using loss of generality with standardized predictor vari- componentwise smoothing splines fits a logistic ad- −1 n (j) 2 ditive model. Many of the concepts and facts from ables [i.e., n i=1(Xi ) = 1 for all j]. Then, Section 5 about L2Boosting become useful heuris- n P ˆ tics for BinomialBoosting. [m] −1 (Sm) (Sˆm) gˆ (x)= n UiXi x , One principal difference is the derivation of the Xi=1 boosting hat matrix. Instead of (5.6), a linearization and the expression in (7.1) becomes argument leads to the following recursion [assuming [0] n fˆ (·) ≡ 0] for an approximate hat matrix Bm: n−1 ρ(Y , fˆ[m](X )) i i [0] (Sˆ1) B1 = ν4W H , Xi=1 n [m−1] (Sˆm) −1 ˆ[m−1] Bm = Bm−1 + 4νW H (I −Bm−1) (7.2) ≈ n ρ(Yi, f (Xi)) (7.3) Xi=1 (m ≥ 2), 2 n [m] [m] [m] −1 (Sˆm) W = diag(ˆp (Xi)(1 − pˆ (Xi); 1 ≤ i ≤ n)). − ν n UiXi . ! Xi=1 A derivation is given in Appendix A.2. Degrees of freedom are then defined as in Section 5.3, In case of the squared error loss ρL2 (y,f)= |y − 2 f| /2, we obtain the exact identity: df(m) = trace(Bm), n and they can be used for information criteria, for n−1 ρ (Y , fˆ[m](X )) L2 i i example, Xi=1 n n [m] −1 ˆ[m−1] AIC(m)= −2 [Yi log(ˆp (Xi)) = n ρL2 (Yi, f (Xi)) i=1 Xi=1 X [m] n 2 + (1 − Yi) log(1 − pˆ (Xi))] −1 (Sˆm) − ν(1 − ν/2) n UiXi . ! + 2 df(m), Xi=1 Comparing this with (7.2), we see that functional or for BIC(m) with the penalty term log(n)df(m). gradient descent with a general loss function and In mboost, this AIC criterion can be computed via without additional line-search behaves very similarly AIC(x, method = "classical") (with x being an glmboost gamboost to L Boosting (since ν is small) with respect object returned by or called 2 family = Binomial() to optimizing the empirical risk; for with ). L2Boosting, the numerical convergence rate is Illustration: Wisconsin prognostic breast cancer. −1 n ˆ[m] −1/6 n i=1 ρL2 (Yi, f (Xi)) = O(m ) (m → ∞) Prediction models for recurrence events in breast [81]. This completes our reasoning why the line- cancer patients based on covariates which have been P search in the general functional gradient descent al- computed from a digitized image of a fine needle as- gorithm can be omitted, of course at the price of pirate of breast tissue (those measurements describe 22 P. BUHLMANN¨ AND T. HOTHORN characteristics of the cell nuclei present in the im- (Intercept) mean_radius mean_texture age) have been studied by Street, Mangasarian and -1.2511e-01 -5.8453e-03 -2.4505e-02 mean_smoothness mean_symmetry mean_fractaldim Wolberg (author?) [80] (the data are part of the UCI 2.8513e+00 -3.9307e+00 -2.8253e+01 repository [11]). SE_texture SE_perimeter SE_compactness We first analyze these data as a binary prediction -8.7553e-02 5.4917e-02 1.1463e+01 problem (recurrence vs. nonrecurrence) and later in SE_concavity SE_concavepoints SE_symmetry -6.9238e+00 -2.0454e+01 5.2125e+00 Section 8 by means of survival models. We are faced SE_fractaldim worst_radius worst_perimeter with many covariates (p = 32) for a limited number 5.2187e+00 1.3468e-02 1.2108e-03 of observations without missing values (n = 194), worst_area worst_smoothness worst_compactness and variable selection is an important issue. We can 1.8646e-04 9.9560e+00 -1.9469e-01 tsize pnodes choose a classical logistic regression model via AIC 4.1561e-02 2.4445e-02 in a stepwise algorithm as follows: (Because of using the offset-value fˆ[0], we have to R> cc <- complete.cases(wpbc) add the value fˆ[0] to the reported intercept estimate R> wpbc2 above for the logistic regression model.) <- wpbc[cc, A generalized additive model adds more flexibility colnames(wpbc) != "time"] to the regression function but is still interpretable. R> wpbc_step We fit a logistic additive model to the wpbc data as <- step(glm(status ~ ., follows: data = wpbc2, R> wpbc_gam <- gamboost(status ~ ., family = binomial()), data = wpbc2, trace = 0) family = Binomial()) The final model consists of 16 parameters with R> mopt <- mstop(aic <- R> logLik(wpbc_step) ’log Lik.’ -80.13 (df=16) AIC(wpbc_gam, "classical")) R> AIC(wpbc_step) R> aic [1] 192.26 [1] 199.76 Optimal number of boosting iterations: 99 and we want to compare this model to a logistic re- Degrees of freedom (for mstop = 99): 14.583 gression model fitted via gradient boosting. We sim- This model selected 16 out of 32 covariates. The ply select the Binomial family [with default offset of partial contributions of the four most important vari- 1/2 log(ˆp/(1 − pˆ)), wherep ˆ is the empirical propor- ables are depicted in Figure 9 indicating a remark- tion of recurrences] and we initially use mstop = 500 able degree of nonlinearity. boosting iterations: R> ctrl <- boost_control 7.2 PoissonBoosting (mstop = 500, For count data with Y ∈ {0, 1, 2,...}, we can use center = TRUE) Poisson regression: we assume that Y |X = x has a R> wpbc_glm Poisson(λ(x)) distribution and the goal is to esti- <- glmboost(status ~ ., mate the function f(x) = log(λ(x)). The negative data = wpbc2, log-likelihood yields then the loss function family = Binomial(), control = ctrl) ρ(y,f)= −yf + exp(f), f = log(λ), The classical AIC criterion (−2log-likelihood+2df) suggests to stop after which can be used in the functional gradient descent R> aic <- AIC(wpbc_glm, "classical") algorithm in Section 2.1, and it is implemented in R> aic mboost as Poisson() family. [1] 199.54 Similarly to (7.3), the approximate boosting hat Optimal number of boosting iterations: 465 matrix is computed by the following recursion: Degrees of freedom (for mstop = 465): 9.147 [0] (Sˆ1) boosting iterations. We now restrict the number of B1 = νW H , boosting iterations to m = 465 and then obtain stop [m−1] (Sˆm) the estimated coefficients via (7.4) Bm = Bm−1 + νW H (I −Bm−1) R> wpbc_glm <- wpbc_glm[mstop(aic)] (m ≥ 2), R> coef(wpbc_glm) [m] ˆ[m] [abs(coef(wpbc_glm)) > 0] W = diag(λ (Xi); 1 ≤ i ≤ n). BOOSTING ALGORITHMS AND MODEL FITTING 23

Fig. 9. wpbc data: Partial contributions of four selected covariates in an additive logistic model (without centering of estimated functions to mean zero).

7.3 Initialization of Boosting enter the estimated linear model in an unpenalized way. We propose to do ordinary least squares re- We have briefly described in Sections 2.1 and 4.1 (1) (q) ˆ[0] gression on X ,...,X : consider the projection the issue of choosing an initial value f (·) for boost- (1) (q) ing. This can be quite important for applications Pq onto the linear span of X ,...,X and use where we would like to estimate some parts of a L2Boosting with componentwise linear least squares Y model in an unpenalized (nonregularized) fashion, on the new response (I − Pq) and the new (p − q)- X with others being subject to regularization. dimensional predictor (I −Pq) . The final model es- q ˆ (j) p ˆ[mstop] (j) For example, we may think of a parametric form of timate is then j=1 βOLS,jx + j=q+1 βj x˜ , fˆ[0](·), estimated by maximum likelihood, and devi- (j) where the latterP part is from L2BoostingP andx ˜ is ations from the parametric model would be built in the residual when linearly regressing x(j) to x(1),..., by pursuing boosting iterations (with a nonparamet- x(q). A special case which is used in most applica- ric base procedure). A concrete example would be: tions is with q =1 and X(1) ≡ 1 encoding for an in- ˆ[0] (j) f (·) is the maximum likelihood estimate in a gen- tercept. Then, (I −P1)Y = Y−Y and (I −P1)X = eralized linear model and boosting would be done (j) X(j) −n−1 n X . This is exactly the proposal at with componentwise smoothing splines to model ad- i=1 i the end of Section 4.1. For generalized linear models, ditive deviations from a generalized linear model. A P analogous concepts can be used. related strategy has been used in [4] for modeling multivariate volatility in financial time series. 8. SURVIVAL ANALYSIS Another example would be a linear model Y = Xβ + ε as in (5.4) where some of the predictor vari- The negative gradient of Cox’s partial likelihood ables, say the first q predictor variables X(1),...,X(q), can be used to fit proportional hazards models to 24 P. BUHLMANN¨ AND T. HOTHORN censored response variables with boosting algorithms base procedures as long as they allow for weighted [71]. Of course, all types of base procedures can least squares fitting. Furthermore, the concepts of be utilized; for example, componentwise linear least degrees of freedom and information criteria are anal- squares fits a Cox model with a linear predictor. ogous to Sections 5.3 and 5.4. Details are given in Alternatively, we can use the weighted least squares [45]. framework with weights arising from inverse proba- Illustration: Wisconsin prognostic breast cancer (cont.). bility censoring. We sketch this approach in the se- Instead of the binary response variable describing quel; details are given in [45]. We assume complete the recurrence status, we make use of the addition- R+ data of the following form: survival times Ti ∈ ally available time information for modeling the time (some of them right-censored) and predictors Xi ∈ to recurrence; that is, all observations with nonre- p R , i = 1,...,n. We transform the survival times to currence are censored. First, we calculate IPC weights: the log-scale, but this step is not crucial for what R> censored <- wpbc$status == "R" follows: Yi = log(Ti). What we observe is R> iw <- IPCweights(Surv(wpbc$time, censored)) Oi = (Y˜i, Xi, ∆i), R> wpbc3 <- wpbc[, names(wpbc) != ˜ ˜ Yi = log(Ti), "status"] and fit a weighted linear model by boosting with T˜i = min(Ti,Ci), componentwise linear weighted least squares as base where ∆i = I(Ti ≤ Ci) is a censoring indicator and procedure: Ci is the censoring time. Here, we make a restrictive R> ctrl <- boost_control( assumption that Ci is conditionally independent of mstop = 500, center = TRUE) Ti given Xi (and we assume independence among R> wpbc_surv <- glmboost( different indices i); this implies that the coarsening log(time) ~ ., data = wpbc3, at random assumption holds [89]. control = ctrl, weights = iw) We consider the squared error loss for the com- R> mstop(aic <- AIC(wpbc_surv)) plete data, ρ(y,f)= |y − f|2 (without the irrelevant [1] 122 factor 1/2). For the observed data, the following R> wpbc_surv <- wpbc_surv[ weighted version turns out to be useful: mstop(aic)] The following variables have been selected for fit- 2 1 ρobs(o, f)=(˜y − f) ∆ , ting: G(t˜|x) R> names(coef(wpbc_surv) G(c|x)= P[C > c|X = x]. [abs(coef(wpbc_surv)) > 0]) [1] "mean_radius" "mean_texture" Thus, the observed data loss function is weighted [3] "mean_perimeter" "mean_smoothness" −1 [5] "mean_symmetry" "SE_texture" by the inverse probability for censoring ∆G(t˜|x) [7] "SE_smoothness" "SE_concavepoints" (the weights are inverse probabilities of censoring; [9] "SE_symmetry" "worst_concavepoints" IPC). Under the coarsening at random assumption, and the fitted values are depicted in Figure 10, it then holds that showing a reasonable model fit. 2 Alternatively, a Cox model with linear predictor EY,X [(Y − f(X)) ]= EO[ρobs(O,f(X))]; can be fitted using L2Boosting by implementing the see van der Laan and Robins (author?) [89]. negative gradient of the partial likelihood (see [71]) The strategy is then to estimate G(·|x), for exam- via ple, by the Kaplan–Meier estimator, and do weighted R> ctrl <- boost_control (center = TRUE) L2Boosting using the weighted squared error loss: R> glmboost n 1 ∆ (Y˜ − f(X ))2, (Surv(wpbc$time, i ˆ ˜ i i i=1 G(Ti|Xi) wpbc$status == "N") ~ ., X data = wpbc, ˆ ˜ −1 where the weights are of the form ∆iG(Ti|Xi) family = CoxPH(), (the specification of the estimator Gˆ(t|x) may play a control = ctrl) substantial role in the whole procedure). As demon- For more examples, such as fitting an additive Cox strated in the previous sections, we can use various model using mboost, see [44]. BOOSTING ALGORITHMS AND MODEL FITTING 25

first consistency result for AdaBoost has been given by Jiang (author?) [51], and a different constructive proof with a range for the stopping value mstop = mstop,n is given in [7]. Later, Zhang and Yu (au- thor?) [92] generalized the results for a functional gradient descent with an additional relaxation scheme, and their theory covers also more general loss func- tions than the exponential loss in AdaBoost. For L2Boosting, the first minimax rate result has been established by B¨uhlmann and Yu (author?) [22]. This has been extended to much more general set- tings by Yao, Rosasco and Caponnetto (author?) [91] and Bissantz et al. (author?) [10]. In the machine learning community, there has been a substantial focus on estimation in the convex hull of function classes (cf. [5, 6, 58]). For example, one may want to estimate a regression or probability Fig. 10. wpbc data: Fitted values of an IPC-weighted linear function by using model, taking both time to recurrence and censoring informa- tion into account. The radius of the circles is proportional ∞ ∞ [k] to the IPC weight of the corresponding observation; censored wˆkgˆ (·), wˆk ≥ 0, wˆk = 1, observations with IPC weight zero are not plotted. kX=1 kX=1 where theg ˆ[k](·)’s belong to a function class such 9. OTHER WORKS as stumps or trees with a fixed number of terminal We briefly summarize here some other works which nodes. The estimator above is a convex combina- have not been mentioned in the earlier sections. A tion of individual functions, in contrast to boost- very different exposition than ours is the overview ing which pursues a linear combination. By scaling, of boosting by Meir and R¨atsch (author?) [66]. which is necessary in practice and theory (cf. [58]), one can actually look at this as a linear combination 9.1 Methodology and Applications of functions whose coefficients satisfy k wˆk = λ. Boosting methodology has been used for various This then represents an ℓ1-constraint as in Lasso, P other statistical models than what we have discussed a relation which we have already seen from another in the previous sections. Models for multivariate re- perspective in Section 5.2.1. Consistency of such con- sponses are studied in [20, 59]; some multiclass boost- vex combination or ℓ1-regularized “boosting” meth- ing methods are discussed in [33, 95]. Other works ods has been given by Lugosi and Vayatis (author?) deal with boosting approaches for generalized linear [58]. Mannor, Meir and Zhang (author?) [61] and and nonparametric models [55, 56, 85, 86], for flex- Blanchard, Lugosi and Vayatis (author?) [12] de- ible semiparametric mixed models [88] or for non- rived results for rates of convergence of (versions of) parametric models with quality constraints [54, 87]. convex combination schemes. Boosting methods for estimating propensity scores, a special weighting scheme for modeling observa- APPENDIX A.1: SOFTWARE tional data, are proposed in [63]. There are numerous applications of boosting meth- The data analyses presented in this paper have ods to real data problems. We mention here classifi- been performed using the mboost add-on package to R cation of tumor types from gene expressions [25, 26], the system of statistical computing. The theoret- multivariate financial time series [2, 3, 4], text classi- ical ingredients of boosting algorithms, such as loss fication [78], document routing [50] or survival anal- functions and their negative gradients, base learn- ysis [8] (different from the approach in Section 8). ers and internal stopping criteria, find their com- putational counterparts in the mboost package. Its 9.2 Asymptotic Theory implementation and user-interface reflect our statis- The asymptotic analysis of boosting algorithms tical perspective of boosting as a tool for estimation includes consistency and minimax rate results. The in structured models. For example, and extending 26 P. BUHLMANN¨ AND T. HOTHORN the reference implementation of tree-based gradient APPENDIX A.2: DERIVATION OF BOOSTING boosting from the gbm package [74], mboost allows HAT MATRICES to fit potentially high-dimensional linear or smooth Derivation of (7.3). The negative gradient is additive models, and it has methods to compute de- ∂ grees of freedom which in turn allow for the use of − ρ(y,f) = 2(y − p), information criteria such as AIC or BIC or for esti- ∂f mation of variance. Moreover, for high-dimensional exp(f) p = . (generalized) linear models, our implementation is exp(f) + exp(−f) very fast to fit models even when the dimension of [m] [m] [m] the predictor space is in the ten-thousands. Next, we linearizep ˆ : we denotep ˆ = (ˆp (X1), [m] ⊤ ˆ[m] The Family function in mboost can be used to ..., pˆ (Xn)) and analogously for f . Then, create an object of class boost family implementing ∂p pˆ[m] ≈ pˆ[m−1] + (fˆ[m] − fˆ[m−1]) the negative gradient for general surrogate loss func- ∂f f=fˆm−1 (A.1) tions. Such an object can later be fed into the fit- ˆ [m−1] [m −1] (Sm) Y [m−1] ting procedure of a linear or additive model which =p ˆ + 2W νH 2( − pˆ ), optimizes the corresponding empirical risk (an ex- [m] where W = diag(ˆp(Xi)(1−pˆ(Xi)); 1 ≤ i ≤ n). Since [m] ample is given in Section 5.2). Therefore, we are not for the hat matrix, BmY =p ˆ , we obtain from limited to already implemented boosting algorithms, (A.1) but can easily set up our own boosting procedure by [0] Sˆ1 implementing the negative gradient of the surrogate B1 ≈ ν4W H , loss function of interest. [m−1] Sˆm Bm ≈Bm−1 + ν4W H (I −Bm−1) (m ≥ 2), Both the source version as well as binaries for several operating systems of the mboost [43] pack- which shows that (7.3) is approximately true. age are freely available from the Comprehensive R Derivation of formula (7.4). The arguments are Archive Network (http://CRAN.R-project.org). analogous to those for the binomial case above. Here, The reader can install our package directly from the the negative gradient is R prompt via ∂ − ρ(y,f)= y − λ, λ = exp(f). R> install.packages("mboost", ∂f dependencies = ˆ[m] ˆ[m] ˆ[m] ⊤ TRUE) When linearizing λ = (λ (X1),..., λ (Xn)) we get, analogously to (A.1), R> library("mboost") All analyses presented in this paper are contained ∂λ λˆ[m] ≈ λˆ[m−1] + (fˆ[m] − fˆ[m−1]) in a package vignette. The rendered output of the ∂f f=fˆm−1 analyses is available by the R-command ˆ = λˆ[m−1] + W [m −1]νH(Sm)(Y − λˆ[m−1]), R> vignette("mboost_illustrations", [m] package = "mboost") where W = diag(λˆ(Xi)); 1 ≤ i ≤ n. We then com- whereas the R code for reproducibility of our anal- plete the derivation of (7.4) as in the binomial case yses can be assessed by above. R> edit(vignette ("mboost_illustrations", ACKNOWLEDGMENTS package = "mboost")) We would like to thank Axel Benner, Florian Leit- There are several alternative implementations of enstorfer, Roman Lutz and Lukas Meier for discus- R boosting techniques available as add-on packages. sions and detailed remarks. Moreover, we thank four The reference implementation for tree-based gradi- referees, the editor and the executive editor Ed George ent boosting is gbm [74]. Boosting for additive mod- for constructive comments. The work of T. Hothorn els based on penalized B-splines is implemented in was supported by Deutsche Forschungsgemeinschaft GAMBoost [9, 84]. (DFG) under grant HO 3242/1-3. BOOSTING ALGORITHMS AND MODEL FITTING 27

REFERENCES [19] Buhlmann,¨ P. (2007). Twin boosting: Im- proved feature selection and prediction. Amit, Y. Geman, D. [1] and (1997). Shape quantiza- Technical report, ETH Z¨urich. Available at tion and recognition with randomized trees. Neural ftp://ftp.stat.math.ethz.ch/Research-Reports/ Computation 9 1545–1588. Other-Manuscripts/buhlmann/TwinBoosting1.pdf. [2] Audrino, F. and Barone-Adesi, G. (2005). Functional [20] Buhlmann,¨ P. and Lutz, R. (2006). Boosting algo- gradient descent for financial time series with an rithms: With an application to bootstrapping mul- application to the measurement of market risk. J. tivariate time series. In The Frontiers in Statistics Banking and Finance 29 959–977. (J. Fan and H. Koul, eds.) 209–230. Imperial Col- [3] Audrino, F. and Barone-Adesi, G. (2005). A multi- lege Press, London. MR2326003 variate FGD technique to improve VaR computa- [21] Buhlmann,¨ P. and Yu, B. (2000). Discussion on “Ad- tion in equity markets. Comput. Management Sci. ditive logistic regression: A statistical view,” by 2 87–106. J. Friedman, T. Hastie and R. Tibshirani. Ann. ¨ [4] Audrino, F. and Buhlmann, P. (2003). Volatility es- Statist. 28 377–386. timation with functional gradient descent for very [22] Buhlmann,¨ P. and Yu, B. (2003). Boosting with the L2 high-dimensional financial time series. J. Comput. loss: Regression and classification. J. Amer. Statist. Finance 6 65–89. Assoc. 98 324–339. MR1995709 [5] Bartlett, P. (2003). Prediction algorithms: Complex- [23] Buhlmann,¨ P. and Yu, B. (2006). Sparse boost- ity, concentration and convexity. In Proceedings of ing. J. Machine Learning Research 7 1001–1024. the 13th IFAC Symp. on System Identification. MR2274395 [6] Bartlett, P. L., Jordan, M. and McAuliffe, J. [24] Buja, A., Stuetzle, W. and Shen, Y. (2005). (2006). Convexity, classification, and risk bounds. Loss functions for binary class probability J. Amer. Statist. Assoc. 101 138–156. MR2268032 estimation: Structure and applications. Tech- [7] Bartlett, P. and Traskin, M. (2007). AdaBoost is nical report, Univ. Washington. Available at consistent. J. Mach. Learn. Res. 8 2347–2368. http://www.stat.washington.edu/wxs/Learning- [8] Benner, A. (2002). Application of “aggregated clas- papers/paper-proper-scoring.pdf. sifiers” in survival time studies. In Proceed- [25] Dettling, M. (2004). BagBoosting for tumor classifica- ings in Computational Statistics (COMPSTAT) tion with gene expression data. Bioinformatics 20 (W. H¨ardle and B. R¨onz, eds.) 171–176. Physica- 3583–3593. Verlag, Heidelberg. MR1973489 [26] Dettling, M. and Buhlmann,¨ P. (2003). Boosting [9] Binder, H. (2006). GAMBoost: Generalized ad- for tumor classification with gene expression data. ditive models by likelihood based boost- Bioinformatics 19 1061–1069. ing. R package version 0.9-3. Available at [27] DiMarzio, M. and Taylor, C. (2008). On boosting ker- http://CRAN.R-project.org. nel regression. J. Statist. Plann. Inference. To ap- [10] Bissantz, N., Hohage, T., Munk, A. and Ruym- pear. gaart, F. (2007). Convergence rates of general reg- [28] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, ularization methods for statistical inverse problems R. (2004). Least angle regression (with discussion). and applications. SIAM J. Numer. Anal. 45 2610– Ann. Statist. 32 407–499. MR2060166 2636. [29] Freund, Y. and Schapire, R. (1995). A decision- [11] Blake, C. L. and Merz, C. J. (1998). UCI repos- theoretic generalization of on-line learning and an itory of machine learning databases. Available application to boosting. In Proceedings of the Sec- at http://www.ics.uci.edu/ mlearn/MLRepository. ond European Conference on Computational Learn- html. ing Theory. Springer, Berlin. [12] Blanchard, G., Lugosi, G. and Vayatis, N. (2003). [30] Freund, Y. and Schapire, R. (1996). Experiments On the rate of convergence of regularized boosting with a new boosting algorithm. In Proceedings of the classifiers. J. Machine Learning Research 4 861–894. Thirteenth International Conference on Machine MR2076000 Learning. Morgan Kaufmann, San Francisco, CA. [13] Breiman, L. (1995). Better subset regression using the [31] Freund, Y. and Schapire, R. (1997). A decision- nonnegative garrote. Technometrics 37 373–384. theoretic generalization of on-line learning and an MR1365720 application to boosting. J. Comput. System Sci. 55 [14] Breiman, L. (1996). Bagging predictors. Machine 119–139. MR1473055 Learning 24 123–140. [32] Friedman, J. (2001). Greedy function approximation: A [15] Breiman, L. (1998). Arcing classifiers (with discussion). gradient boosting machine. Ann. Statist. 29 1189– Ann. Statist. 26 801–849. MR1635406 1232. MR1873328 [16] Breiman, L. (1999). Prediction games and arcing algo- [33] Friedman, J., Hastie, T. and Tibshirani, R. (2000). rithms. Neural Computation 11 1493–1517. Additive logistic regression: A statistical view of [17] Breiman, L. (2001). Random forests. Machine Learning boosting (with discussion). Ann. Statist. 28 337– 45 5–32. 407. MR1790002 [18] Buhlmann,¨ P. (2006). Boosting for high-dimensional [34] Garcia, A. L., Wagner, K., Hothorn, T., Koeb- linear models. Ann. Statist. 34 559–583. MR2281878 nick, C., Zunft, H. J. and Trippo, U. (2005). 28 P. BUHLMANN¨ AND T. HOTHORN

Improved prediction of body fat by measuring skin- regression using an improved Akaike information fold thickness, circumferences, and bone breadths. criterion. J. Roy. Statist. Soc. Ser. B 60 271–293. Obesity Research 13 626–634. MR1616041 [35] Gentleman, R. C., Carey, V. J., Bates, D. M., Bol- [50] Iyer, R., Lewis, D., Schapire, R., Singer, Y. and stad, B., Dettling, M., Dudoit, S., Ellis, B., Singhal, A. (2000). Boosting for document rout- Gautier, L., Ge, Y., Gentry, J., Hornik, K., ing. In Proceedings of CIKM-00, 9th ACM Int. Hothorn, T., Huber, M., Iacus, S., Irizarry, Conf. on Information and Knowledge Management R., Leisch, F., Li, C., Machler,¨ M., Rossini, (A. Agah, J. Callan and E. Rundensteiner, eds.). A. J., Sawitzki, G., Smith, C., Smyth, G., Tier- ACM Press, New York. ney, L., Yang, J. Y. and Zhang, J. (2004). Bio- [51] Jiang, W. (2004). Process consistency for AdaBoost conductor: Open software development for compu- (with discussion). Ann. Statist. 32 13–29, 85–134. tational biology and bioinformatics. Genome Biol- MR2050999 ogy 5 R80. [52] Kearns, M. and Valiant, L. (1994). Cryptographic [36] Green, P. and Silverman, B. (1994). Nonparamet- limitations on learning Boolean formulae and finite ric Regression and Generalized Linear Models: A automata. J. Assoc. Comput. Machinery 41 67–95. Roughness Penalty Approach. Chapman and Hall, MR1369194 New York. MR1270012 [53] Koltchinskii, V. and Panchenko, D. (2002). Empir- [37] Greenshtein, E. and Ritov, Y. (2004). Persistence ical margin distributions and bounding the gener- in high-dimensional predictor selection and the alization error of combined classifiers. Ann. Statist. virtue of over-parametrization. Bernoulli 10 971– 30 1–50. MR1892654 988. MR2108039 [54] Leitenstorfer, F. and Tutz, G. (2006). Smooth- [38] Hansen, M. and Yu, B. (2001). Model selection and ing with curvature constraints based on boosting minimum description length principle. J. Amer. techniques. In Proceedings in Computational Statis- Statist. Assoc. 96 746–774. MR1939352 tics (COMPSTAT) (A. Rizzi and M. Vichi, eds.). [39] Hastie, T. and Efron, B. (2004). Lars: Least Physica-Verlag, Heidelberg. angle regression, lasso and forward stage- [55] Leitenstorfer, F. and Tutz, G. (2007). Generalized wise. R package version 0.9-7. Available at monotonic regression based on B-splines with an ap- http://CRAN.R-project.org. plication to air pollution data. Biostatistics 8 654– [40] Hastie, T. and Tibshirani, R. (1986). Generalized ad- 673. ditive models (with discussion). Statist. Sci. 1 297– [56] Leitenstorfer, F. and Tutz, G. (2007). Knot selec- 318. MR0858512 tion by boosting techniques. Comput. Statist. Data [41] Hastie, T. and Tibshirani, R. (1990). Generalized Anal. 51 4605–4621. Additive Models. Chapman and Hall, London. [57] Lozano, A., Kulkarni, S. and Schapire, R. (2006). MR1082147 Convergence and consistency of regularized boost- [42] Hastie, T., Tibshirani, R. and Friedman, J. (2001). ing algorithms with stationary β-mixing observa- The Elements of Statistical Learning; Data Min- tions. In Advances in Neural Information Process- ing, Inference and Prediction. Springer, New York. ing Systems (Y. Weiss, B. Sch¨olkopf and J. Platt, MR1851606 eds.) 18. MIT Press. [43] Hothorn, T. and Buhlmann,¨ P. (2007). Mboost: [58] Lugosi, G. and Vayatis, N. (2004). On the Bayes- Model-based boosting. R package version 0.5-8. risk consistency of regularized boosting methods Available at http://CRAN.R-project.org/. (with discussion). Ann. Statist. 32 30–55, 85–134. [44] Hothorn, T. and Buhlmann,¨ P. (2006). Model-based MR2051000 boosting in high dimensions. Bioinformatics 22 [59] Lutz, R. and Buhlmann,¨ P. (2006). Boosting for high- 2828–2829. multivariate responses in high-dimensional linear [45] Hothorn, T., Buhlmann,¨ P., Dudoit, S., Molinaro, regression. Statist. Sinica 16 471–494. MR2267246 A. and van der Laan, M. (2006). Survival ensem- [60] Mallat, S. and Zhang, Z. (1993). Matching pursuits bles. Biostatistics 7 355–373. with time-frequency dictionaries. IEEE Transac- [46] Hothorn, T., Hornik, K. and Zeileis, A. tions on Signal Processing 41 3397–3415. (2006). Party: A laboratory for recursive [61] Mannor, S., Meir, R. and Zhang, T. (2003). Greedy part(y)itioning. R package version 0.9-11. Available algorithms for classification–consistency, conver- at http://CRAN.R-project.org/. gence rates, and adaptivity. J. Machine Learning [47] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Un- Research 4 713–741. MR2072266 biased recursive partitioning: A conditional infer- [62] Mason, L., Baxter, J., Bartlett, P. and Frean, M. ence framework. J. Comput. Graph. Statist. 15 651– (2000). Functional gradient techniques for combin- 674. MR2291267 ing hypotheses. In Advances in Large Margin Clas- [48] Huang, J., Ma, S. and Zhang, C.-H. (2008). Adap- sifiers (A. Smola, P. Bartlett, B. Sch¨olkopf and tive Lasso for sparse high-dimensional regression. D. Schuurmans, eds.) 221–246. MIT Press, Cam- Statist. Sinica. To appear. bridge. [49] Hurvich, C., Simonoff, J. and Tsai, C.-L. (1998). [63] McCaffrey, D. F., Ridgeway, G. and Morral, A. Smoothing parameter selection in nonparametric R. G. (2004). Propensity score estimation with BOOSTING ALGORITHMS AND MODEL FITTING 29

boosted regression for evaluating causal effects in [80] Street, W. N., Mangasarian, O. L., and Wol- observational studies. Psychological Methods 9 403– berg, W. H. (1995). An inductive learning ap- 425. proach to prognostic prediction. In Proceedings of [64] Mease, D., Wyner, A. and Buja, A. (2007). Cost- the Twelfth International Conference on Machine weighted boosting with jittering and over/under- Learning. Morgan Kaufmann, San Francisco, CA. sampling: JOUS-boost. J. Machine Learning Re- [81] Temlyakov, V. (2000). Weak greedy algorithms. Adv. search 8 409–439. Comput. Math. 12 213–227. MR1745113 [65] Meinshausen, N. and Buhlmann,¨ P. (2006). High- [82] Tibshirani, R. (1996). Regression shrinkage and selec- dimensional graphs and variable selection with the tion via the Lasso. J. Roy. Statist. Soc. Ser. B 58 Lasso. Ann. Statist. 34 1436–1462. MR2278363 267–288. MR1379242 [66] Meir, R. and Ratsch,¨ G. (2003). An introduction to [83] Tukey, J. (1977). Exploratory Data Analysis. Addison- boosting and leveraging. In Advanced Lectures on Wesley, Reading, MA. Machine Learning (S. Mendelson and A. Smola, [84] Tutz, G. and Binder, H. (2006). Generalized addi- eds.). Springer, Berlin. tive modelling with implicit variable selection by [67] Osborne, M., Presnell, B. and Turlach, B. (2000). likelihood based boosting. Biometrics 62 961–971. A new approach to variable selection in least MR2297666 squares problems. IMA J. Numer. Anal. 20 389– [85] Tutz, G. and Binder, H. (2007). Boosting Ridge regres- 403. MR1773265 sion. Comput. Statist. Data Anal. 51 6044–6059. [68] Park, M.-Y. and Hastie, T. (2007). An L1 [86] Tutz, G. and Hechenbichler, K. (2005). Aggregat- regularization-path algorithm for generalized linear ing classifiers with ordinal response structure. J. models. J. Roy. Statist. Soc. Ser. B 69 659–677. Statist. Comput. Simul. 75 391–408. MR2136546 [69] R Development Core Team (2006). R: A language [87] Tutz, G. and Leitenstorfer, F. (2007). Generalized and environment for statistical computing. R Foun- smooth monotonic regression in additive modelling. dation for Statistical Computing, Vienna, Austria. J. Comput. Graph. Statist. 16 165–188. Available at http://www.R-project.org. [88] Tutz, G. and Reithinger, F. (2007). Flexible semi- [70] Ratsch,¨ G., Onoda, T. and Muller,¨ K. (2001). Soft parametric mixed models. Statistics in Medicine 26 margins for AdaBoost. Machine Learning 42 287– 2872–2900. 320. [89] van der Laan, M. and Robins, J. (2003). Unified Meth- [71] Ridgeway, G. (1999). The state of boosting. Comput. ods for Censored Longitudinal Data and Causality. Sci. Statistics 31 172–181. Springer, New York. MR1958123 [72] Ridgeway, G. (2000). Discussion on “Additive logistic [90] West, M., Blanchette, C., Dressman, H., Huang, regression: A statistical view of boosting,” by J. E. Ishida, S. Spang, R. Zuzan, H. Olson, J. Friedman, T. Hastie, R. Tibshirani. Ann. Statist. , , , , , Marks, J. Nevins, J. 28 393–400. and (2001). Predicting the [73] Ridgeway, G. (2002). Looking for lumps: Boosting and clinical status of human breast cancer by using gene bagging for density estimation. Comput. Statist. expression profiles. Proc. Natl. Acad. Sci. USA 98 Data Anal. 38 379–392. MR1884870 11462–11467. [74] Ridgeway, G. (2006). Gbm: Generalized boosted regres- [91] Yao, Y., Rosasco, L. and Caponnetto, A. (2007). On sion models. R package version 1.5-7. Available at early stopping in gradient descent learning. Constr. http://www.i-pensieri.com/gregr/gbm.shtml. Approx. 26 289–315. MR2327601 [75] Schapire, R. (1990). The strength of weak learnability. [92] Zhang, T. and Yu, B. (2005). Boosting with early stop- Machine Learning 5 197–227. ping: Convergence and consistency. Ann. Statist. 33 [76] Schapire, R. (2002). The boosting approach to machine 1538–1579. MR2166555 learning: An overview. Nonlinear Estimation and [93] Zhao, P. and Yu, B. (2007). Stagewise Lasso. J. Mach. Classification. Lecture Notes in Statist. 171 149– Learn. Res. 8 2701–2726. 171. Springer, New York. MR2005788 [94] Zhao, P. and Yu, B. (2006). On model selection con- [77] Schapire, R., Freund, Y., Bartlett, P. and Lee, W. sistency of Lasso. J. Machine Learning Research 7 (1998). Boosting the margin: A new explanation for 2541–2563. MR2274449 the effectiveness of voting methods. Ann. Statist. 26 [95] Zhu, J., Rosset, S., Zou, H. and Hastie, 1651–1686. MR1673273 T. (2005). Multiclass AdaBoost. Techni- [78] Schapire, R. and Singer, Y. (2000). Boostexter: A cal report, Stanford Univ. Available at boosting-based system for text categorization. Ma- http://www-stat.stanford.edu/ hastie/Papers/ chine Learning 39 135–168. samme.pdf. [79] Southwell, R. (1946). Relaxation Methods in Theo- [96] Zou, H. (2006). The adaptive Lasso and its oracle retical Physics. Oxford, at the Clarendon Press. properties. J. Amer. Statist. Assoc. 101 1418–1429. MR0018983 MR2279469