Accelerating Gradient Boosting Machine

Haihao Lu * 1 2 Sai Praneeth Karimireddy * 3 Natalia Ponomareva 2 Vahab Mirrokni 2

Abstract available implementations: scikit-learn (Pedregosa et al., Gradient Boosting Machine (GBM) (Friedman, 2011), R gbm (Ridgeway et al., 2013), LightGBM (Ke et al., 2001) is an extremely powerful supervised learn- 2017), XGBoost (Chen & Guestrin, 2016), TF Boosted ing algorithm that is widely used in practice. Trees (Ponomareva et al., 2017), etc. GBM routinely features as a leading algorithm In spite of the practical success of GBM, there is a consid- in competitions such as Kaggle erable gap in its theoretical understanding. The traditional and the KDDCup. In this work, we propose Ac- interpretation of GBM is to view it as a form of steepest celerated Gradient Boosting Machine (AGBM) by descent in functional space (Mason et al., 2000; Friedman, incorporating Nesterov’s acceleration techniques 2001). While this interpretation serves as a good starting into the design of GBM. The difficulty in accel- point, such framework lacks rigorous non-asymptotic con- erating GBM lies in the fact that weak (inexact) vergence guarantees, especially when compared to the grow- learners are commonly used, and therefore the ing body of literature on first order convex optimization. errors can accumulate in the momentum term. To overcome it, we design a “corrected pseudo resid- In convex optimization literature, Nesterov’s acceleration is ual” and fit best weak learner to this corrected a successful technique to speed up the convergence of first- pseudo residual, in order to perform the z-update. order methods. In this work, we show how to incorporate Thus, we are able to derive novel computational Nesterov momentum into the gradient boosting framework guarantees for AGBM. This is the first GBM type to obtain an accelerated gradient boosting machine. of algorithm with theoretically-justified acceler- ated convergence rate. Finally we demonstrate 1.1. Our contributions with a number of numerical experiments the ef- We propose the first accelerated gradient boosting algorithm fectiveness of AGBM over conventional GBM that comes with strong theoretical guarantees and can be in obtaining a model with good training and/or used with any type of weak learner. In particular: testing data fidelity. 1. We propose a novel accelerated gradient boosting al- gorithm (AGBM) (Section 3) and prove (Section 4) 1. Introduction that it reduces the empirical loss at a rate of O(1/m2) after m iterations, improving upon the O(1/m) rate Gradient Boosting Machine (GBM) (Friedman, 2001) is obtained by traditional gradient boosting methods a powerful algorithm that combines 2. We propose a computationally inexpensive practical multiple weak-learners into an ensemble with excellent pre- variant of AGBM, taking advantage of strong convexity diction performance. GBM works very well for a number of of , which achieves linear convergence tasks like spam filtering, online advertising, fraud detection, (Section 5). We also list the conditions (on the loss , computational physics (e.g., the Higgs function) under which AGBMs would be beneficial. Boson discovery), etc; and has routinely featured as a top al- 3. With a number of numerical experiments with weak gorithm in Kaggle competitions and the KDDCup (Chen & tree learners (one of the most popular type of GBMs) Guestrin, 2016). GBM can naturally handle heterogeneous we confirm the effectiveness of AGBM. datasets (highly correlated data, missing data, categorical 4. Our accelerated boosting framework will be open- data, etc). It is also quite easy to use with several publicly sourced at this url1. * 1 2 Equal contribution MIT, Cambridge, MA, USA Google, Apart from theoretical contributions, we paved the way to New York, NY, USA 3EPFL, Lausanne, Switzerland. Correspon- dence to: Haihao Lu . speed up some practical applications of GBMs, which cur- rently require a large number of boosting iterations. For th Proceedings of the 35 International Conference on Machine 1 Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 Link removed for anonymity; will be added upon acceptance. by the author(s). Accelerating Gradient Boosting Machine example, GBMs with boosted trees for multi-class problems cal explanation of acceleration techniques by studying the are commonly implemented as a number of one-vs-rest continuous-time interpretation of accelerated gradient de- learners, resulting in more complicated boundaries (Fried- scent via dynamical systems (Su et al., 2016; Wilson et al., man et al., 1998) and a potentially a larger number of boost- 2016; Hu & Lessard, 2017). ing iterations required. Additionally, it is a common practice Accelerated Greedy Coordinate and Matching Pursuit to build many very weak learners for problems where it is Methods: Recently, (Locatello et al., 2018) and (Lu et al., easy to overfit. Such large ensembles result not only in 2018) discussed how to accelerate matching pursuit and slow training time, but slower inference. AGBMs can be greedy coordinate descent algorithms respectively. Their potentially beneficial for all these applications. methods however require a random step and are hence only ‘semi-greedy’, which does not fit in the boosting framework. 1.2. Related Literature Accelerated GBM: Recently, (Biau et al., 2018) and Convergence Guarantees for GBM: After being first in- (Fouillen et al., 2018) proposed an accelerated version of troduced by Friedman et al. (Friedman, 2001), several GBM by directly incorporating Nesterov’s momentum in works established its guaranteed convergence, without ex- GBM, however, no theoretical justification was provided. plicitly stating the convergence rate (Collins et al., 2002; Furthermore, as we argue in Section 5.2, their proposed Mason et al., 2000). Subsequently, when the loss function algorithm may not converge to the optimum. was both smooth and strongly convex, (Bickel et al., 2006) proved an exponential convergence rate—more precisely that O(exp(1/ε2)) iterations are sufficient to ensure that 2. Gradient Boosting Machine the training loss is within ε of its optimal value. (Telgar- We consider a supervised learning problem with n training sky, 2012) then studied the primal-dual structure of GBM p examples (xi, yi), i = 1, . . . , n such that xi ∈ R is the and demonstrated that in fact only O(log(1/ε)) iterations feature vector of the i-th example and yi is a label (in a clas- are needed. However the constants in their rate were non- sification problem) or a continuous response (in a regression standard and less intuitive. This result was recently im- problem). In the classical version of GBM (Friedman, 2001), proved upon by (Freund et al., 2017) and (Lu & Mazumder, we assume we are given a base class of learners B and that 2018), who showed a similar convergence rate but with more our target function class is the linear combination of such transparent constants such as the smoothness and strong con- base learners (denoted by lin(B)). Let B = {bτ (x) ∈ R} be vexity constant of the loss function, and the density of weak a family of learners parameterized by τ ∈ T . The prediction learners, etc. Additionally, if the loss function is assumed to corresponding to a feature vector x is given by an additive be smooth and convex (but not necessarily strongly convex), model of the form: (Lu & Mazumder, 2018) also showed that O(1/ε) iterations M ! are sufficient. We refer the reader to (Telgarsky, 2012), (Fre- X f(x) := β b (x) ∈ lin(B) , (1) und et al., 2017) and (Lu & Mazumder, 2018) for a more m τm detailed literature review on the theoretical results of GBM m=1 convergence. where bτm (x) ∈ B is a weak-learner and βm is its corre- Accelerated Gradient Methods: For optimizing a smooth sponding additive coefficient. Here, βm and τm are chosen convex function, (Nesterov, 1983) showed that the standard in an adaptive fashion in order to improve the data-fidelity algorithm can be made much faster, re- as discussed below. Examples of learners commonly used sulting in the accelerated gradient descent method. While in practice include wavelet functions, support vector ma- chines, and classification and regression trees (Friedman gradient descent requires O(1/ε) iterations,√ accelerated gra- dient methods only requires O(1/ ε). In general, this rate et al., 2001). We assume the set of weak learners B is of convergence is optimal and cannot be improved upon scalable, namely that the following assumption holds. (Nesterov, 2004). Since its introduction in 1983, the main- Assumption 2.1. If b(·) ∈ B, then λb(·) ∈ B for any λ > 0. stream research community’s interest in Nesterov’s accel- erated method started around 15 years ago; yet even today Assumption 2.1 holds for most of the set of weak learners we most researchers struggle to find basic intuition as to what are interested in. Indeed scaling a weak learner is equivalent is really going on in accelerated methods. Such lack of intu- to modifying the coefficient of the weak learner, so it does ition about the estimation sequence proof technique used by not change the structure of B. (Nesterov, 2004) has motivated many recent works trying The goal of GBM is to obtain a good estimate of the function to explain this acceleration phenomenon (Su et al., 2016; f that approximately minimizes the empirical loss: Wilson et al., 2016; Hu & Lessard, 2017; Lin et al., 2015; n Frostig et al., 2015; Allen-Zhu & Orecchia, 2014; Bubeck ? n X o L = min L(f) := `(yi, f(xi)) (2) et al., 2015). Some have recently attempted to give a physi- f∈lin(B) i=1 Accelerating Gradient Boosting Machine where `(yi, f(xi)) is a measure of the data-fidelity for the 3. Accelerated Gradient Boosting Machine i-th sample for the loss function `. (AGBM)

2.1. Best Fit Weak Learners Given the success of accelerated gradient descent as a first order optimization method, it seems natural to attempt to The original version of GBM by (Friedman, 2001), pre- accelerate the GBMs. As a warm-up, we first look at how sented in Algorithm 1, can be viewed as minimizing the to obtain an accelerated boosting algorithm when our class loss function by applying an approximated steepest descent of learners B is strong (complete) and can exactly fit any algorithm (2). GBM starts from a null function f 0 ≡ 0 and pseudo-residuals. This assumption is quite unreasonable but at each iteration computes the pseudo-residual rm (namely, will serve to understand the connection between boosting the negative gradient of the loss function with respect to the and first order optimization. We then describe our actual predictions so far f m): algorithm which works for any class of weak learners. m m d `(yi, f (xi)) ri = − m . (3) 3.1. Boosting with strong learners df (xi) In this subsection, we assume the class of learners B is Then a weak-learner, that best fits the current pseudo- strong, i.e. for any pseudo-residual r ∈ n, there exists a residual in terms of the least squares loss, is computed as R learner b(x) ∈ B such that follows:

n b(xi) = ri, ∀ i ∈ [n] . X m 2 τm = arg min (ri − bτ (xi)) . (4) τ∈T i=1 Of course the entire point of boosting is that the learners are weak and thus the class is not strong, therefore this is not a This weak-learner is added to the model with a coefficient realistic assumption. Nevertheless this section will provide found via a line search. As the iterations progress, GBM the intuitions on how to develop AGBM. m leads to a sequence of functions {f }m∈[M] (where [M] is m a shorthand for the set {1,...,M}). The usual intention of In the GBM we compute the psuedo-residual r in (3) to be GBM is to stop early—before one is close to a minimum the negative gradient of the loss function over the predictions of Problem (2)—with the hope that such a model will lead so far. A gradient descent step in a functional space would m+1 to good predictive performance (Friedman, 2001; Freund try to find f such that for i ∈ {1, . . . , n} et al., 2017; Zhang & Yu, 2005; Buhlmann¨ et al., 2007). m+1 m m f (xi) = f (xi) + ηri . Algorithm 1 Gradient Boosting Machine (GBM) (Fried- Here η is the step-size of our algorithm. Since our class of man, 2001) learners is rich, we can choose bm(x) ∈ B to exactly satisfy Initialization. Initialize with f 0(x) = 0. the above equation. For m = 0,...,M − 1 do: Thus GBM (Algorithm 1) then has the following update: Perform Updates: (1) Compute pseudo residual: rm = m+1 m m m f = f + ηb , h ∂`(yi,f (xi)) i − ∂f m(x ) . i i=1,...,n where bm(x ) = rm. In other words, GBM performs ex- (2) Find the parameters of the best weak-learner: τ = i i m actly functional gradient descent when the class of learners arg min Pn (rm − b (x ))2. τ∈T i=1 i τ i is strong, and so it converges at a rate of O(1/m). Akin (3) Choose the step-size η by line-search: η = m m to the above argument, we can perform functional accel- arg min Pn `(y , f m(x ) + ηb (x )). η i=1 i i τm i erated gradient descent, which has the accelerated rate of (4) Update the model f m+1(x) = f m(x) + η b (x). m τm O(1/m2). In the accelerated method, we maintain three Output. f M (x). model ensembles: f, g, and h of which f(x) is the only model which is finally used to make predictions during in- ference. h(x) is the momentum sequence and g(x) is a Perhaps the most popular set of learners are classification weighted average of f and h (refer to Table 1 for list of all and regression trees (CART) (Breiman et al., 1984), result- notations used). These sequences are updated as follows for ing in Gradient Boosted Decision Tree models (GBDTs). a step-size η and {θm = 2/(m + 2)}: These are the models that we are using for our numerical gm = (1 − θ )f m + θ hm experiments. At the same time, we would like to highlight m m m+1 m m that our algorithm is not tied to a particular type of weak f = g + ηb : primary model (5) m+1 m m learner and is a general algorithm. h = h + η/θmb : momentum model , Accelerating Gradient Boosting Machine

Parameter Dimension Explanation p (xi, yi) R × R The features and the label of the i-th sample. p×n X R X = [x1, x2, ··· , xn] is the feature matrix for all training data. bτ (x) function Weak learner parameterized by τ. f m(x) function Ensemble of weak learners at the m-th iteration. gm(x), hm(x) functions Auxiliary ensembles of weak learners at the m-th iteration. rm Rn Pseudo residual at the m-th iteration. cm Rn Corrected pseudo-residual at the m-th iteration. n bτ (X) R A vector of predictions [bτ (xi)]i. n f(X) R A vector of [f(xi)]i for any function f(x).

Table 1. List of notations used. where bm(x) satisfies for i ∈ 1, . . . , n residual is defined inductively as follows: for i ∈ {1, . . . , n} d `(y , gm(x )) m + 1 bm(x ) = − i i . m m m−1  i m (6) ci = ri + ci − bτm−1,2 (xi) , dg (xi) m + 2 Note that the psuedo-residual is computed w.r.t. g instead and then we compute of f. The update above can be rewritten as n X m+1 m m m m τ = arg min (cm − b (x ))2 . f = f + ηb + θm(h − f ) . m,2 i τ i τ∈T i=1 If θ = 0, we see that we recover the standard functional m Thus at each iteration two weak learners are computed— gradient descent with step-sze η. For θ ∈ (0, 1], there is m b (x) approximates the residual rm and the b (x), an additional momentum in the direction of (hm − f m). τm,1 τm,2 which approximates the error-corrected residual cm. Note m−1 The three sequences f, g, and h match exactly those used in that if our class of learners is complete then ci = m m typical accelerated gradient methods (see (Nesterov, 2004; bτm−1,2 (xi), c = r and τm,1 = τm,2. This would re- Tseng, 2008) for details). vert back to our accelerated gradient boosting algorithm for strong-learners described in (5). 3.2. Boosting with weak learners In this subsection, we consider the general case without 4. Convergence Analysis of AGBM assuming that the class of learners is strong. Indeed, the We first formally define the assumptions required and then class of learners B is usually quite simple and it is very outline the computational guarantees for AGBM. likely that for any τ ∈ T , it is impossible to exactly fit the residual rm. We call this case boosting with weak learners. 4.1. Assumptions Our task then is to modify (5) to obtain a truly accelerated gradient boosting machine. Let’s introduce some standard regularity/continuity con- straints on the loss function that we require in our analysis. The full details are summarized in Algorithm 2 but we will Definition 4.1. We denote ∂`(y,f) as the derivative of the highlight two key differences from (5). ∂f bivariant loss function ` w.r.t. the prediction f. We say that First, the update to the f sequence is replaced with a weak- ` is σ-smooth if for any y and predictions f1 and f2, it holds learner which best approximates rm similar to (5). In par- that ticular, we compute pseudo-residual rm computed w.r.t. g ∂`(y, f2) σ 2 as in (6) and find a parameter τm,1 such that `(y, f ) ≤ `(y, f ) + (f − f ) + (f − f ) . 1 2 ∂f 1 2 2 1 2 n X m 2 τm,1 = arg min (ri − bτ (xi)) . We say ` is µ-strongly convex (with µ > 0) if for any y and τ∈T i=1 predictions f1 and f2, it holds that

Secondly, and more crucially, the update to the momentum ∂`(y, f2) µ 2 `(y, f1) ≥ `(y, f2) + (f1 − f2) + (f1 − f2) . model h is decoupled from the update to the f sequence. ∂f 2 We use an error-corrected pseudo-residual cm instead of directly using rm. Suppose that at iteration m − 1, a weak- Note that µ ≤ σ always. Smoothness and strong-convexity m−1 learner bτm−1,2 was added to h . Then error corrected mean that the function l(x) is (respectively) upper and lower Accelerating Gradient Boosting Machine

Algorithm 2 Accelerated Gradient Boosting Machine (AGBM) 0 Input. Starting function f (x) = 0, step-size η, momentum-parameter γ ∈ (0, 1], and data X, y = (xi, yi)i∈[n]. 0 0 2 Initialization. h (x) = f (x) and sequence θm = m+2 . For m = 0,...,M − 1 do: Perform Updates: m m m (1) Compute a linear combination of f and h: g (x) = (1 − θm)f (x) + θmh (x). h m i m ∂`(yi,g (xi)) (2) Compute pseudo residual: r = − ∂gm(x ) . i i=1,...,n Pn m 2 (3) Find the best weak-learner for pseudo residual: τm,1 = arg minτ∈T i=1(ri − bτ (xi)) . m+1 m (4) Update the model: f (x) = g (x) + ηbτm,1 (x). ( m m ri if m = 0 (5) Update the corrected residual: ci = m m+1 m−1 . ri + m+2 (ci − bτm−1,2 (xi)) o.w. Pn m 2 (6) Find the best weak-learner for the corrected residual: τm,2 = arg minτ∈T i=1(ci − bτ (xi)) . m+1 m (7) Update the momentum model: h (x) = h (x) + γη/θmbτm,2 (x). Output. f M (x). bounded by quadtratic functions. Intuitively, smoothness Proof Sketch. Here we only give an outline—the full proof implies that that gradient does not change abruptly and can be found in the Appendix (Section B). We use the hence l(x) is never ‘sharp’. Strong-convexity implies that potential-function based analysis of accelerated method (cf. 2 l(x) always has some ‘curvature’ and is never ‘flat’. (Tseng, 2008; Wilson et al., 2016)). Recall that θm = m+2 . For the proof, we introduce the following vector sequence The notion of Minimal Cosine Angle (MCA) introduced in of auxiliary ensembles hˆ as follows: (Lu & Mazumder, 2018) plays a central rule in our conver- ηγ gence rate analysis of GBM. MCA measures how well the hˆ0(X) = 0, hˆm+1(X) = hˆm(X) + rm . weak-learner bτ(r)(X) approximates the desired residual r θm n Definition 4.2. Let r ∈ R be a vector. The Minimal Cosine The sequence hˆm(X) is in fact closely tied to the sequence Angle (MCA) is defined as the similarity between r and the hm(X) as we demonstrate in the Appendix (Lemma B.2). output of the best-fit learner bτ (X): Let f ? be any function which obtains the optimal loss (2) n Θ := min max cos(r, bτ (X)) , (7) n ? n X o r∈R τ∈Tm f ∈ arg min L(f) := `(yi, f(xi)) . f∈lin(B) n i=1 where bτ (X) ∈ R is a vector of predictions [bτ (xi)]i. Let us define the following sequence of potentials: The quantity Θ ∈ (0, 1] measures how “dense” the learners  2 are in the prediction space. For strong learners (in Section  1 f ?(X) − hˆ0(X) if m = 0 , m ?  2 3.1), the prediction space is complete, and Θ = 1. For a V (f ) = 2 ηγ m ? 1 ? ˆm  2 (L(f ) − L ) + f (X) − h (X) o.w complex space of learners T such as deep trees, we expect  θm−1 2 the prediction space to be dense and that Θ ≈ 1. For a simpler class such as tree-stumps Θ would be much smaller. Typical proofs of accelerated algorithms show that the po- m ? Refer to (Lu & Mazumder, 2018) for a discussion of Θ. tential V (f ) is a decreasing sequence. In boosting, we use the weak-learner that fits the pseudo-residual of the loss. This can guarantee sufficient decay to the first term 4.2. Computational Guarantees of V m(f ?) related to the loss L(f). However, there is no We are now ready to state the main theoretical result of our such guarantee that the same weak-learner can also provide paper. sufficient decay to the second term as we do not apriori ? Theorem 4.1. Consider Accelerated Gradient Boosting Ma- know the optimal ensemble f . That is the major challenge chine (Algorithm 2). Suppose ` is σ-smooth, the step-size in the development of AGBM. 1 4 2 η ≤ σ and the momentum parameter γ ≤ Θ /(4 + Θ ), We instead show that the potential decreases at least by δm: where Θ is the MCA introduced in Definition 4.2. Then for m+1 ? m ? all M ≥ 0, we have: V (f ) ≤ V (f ) + δm ,

1 where δm is an error term depending on Θ (see Lemma B.4 L(f M ) − L(f ∗) ≤ kf ∗(X)k2 . 2ηγ(M + 1)2 2 for the exact definition of δm and proof of the claim). By Accelerating Gradient Boosting Machine telescope, it holds that Substituting this in Theorem 4.1 gives us that

ηγ M ? 1 0 ? L(f m+1) − L(f ?) ≤ V m+1(f ?) L(f ) − L(f ) ≤ (L(f ) − L(f )) . 2 µηγ(M + 1)2 θm m X 1 2 Recalling that f 0(x) = f˜p(x), f M (x) = f˜p+1(x), and ≤ δ + f ?(X) − hˆ0(X) . j 2 M 2 = 2/ηµγ gives us the required statement. j=0 Finally a careful analysis of the error term (Lemma B.6) The restart strategy in Option 1 requires knowledge of the Pm shows that j=0 δj ≤ 0 for any m ≥ 0. Therefore, strong-convexity constant µ. Alternatively, one can also use adaptive restart strategy (Option 2) which is known to have θ2 L(f m+1) − L(f ?) ≤ m kf ?(X)k2 , good empirical performance (Odonoghue & Candes, 2015). 2ηγ Remark 5.1. Theorem 5.1 shows that M =  1 q σ  which furnishes the proof by letting m = M − 1 and substi- O Θ2 µ log(1/ε) weak learners are sufficient to tuting the value of θm. obtain an error of ε using ABGMR (Algorithm 3). M In contrast, standard GBM (Algorithm 1) requires Remark 4.1. Theorem 4.1 implies that to get a function f   M ? 1 σ such that the error L(f )−L(f ) ≤ ε, we need number of M = O Θ2 µ log(1/ε) weak learners. Thus AGBMR   1√ is significantly better than GBM only if the condition iterations M ∼ O 2 . In contrast, standard gradient Θ ε number is large i.e. (σ/µ ≥ 1). When l(y, f) is the boosting machines as proved in (Lu & Mazumder, 2018) least-squares loss, (µ = σ = 1) we would see no advantage require M ∼ O 1  This means that for high values of ε, Θ2ε of acceleration. However for more complicated functions AGBM (Algorithm 2) can require far fewer weak learners with (σ  µ) (e.g. logistic loss or exp loss), AGBMR than GBM (Algorithm 1). might result in an ensemble that is significantly better (e.g. obtaining lower training loss) than that of GBM for the 5. Extensions and Variants same number of weak learners. In this section we study two more practical variants of AGBM. First we see how to restart the algorithm to take 5.2. A Vanilla Accelerated Gradient Boosting Method advantage of strong convexity of the loss function. Then A natural question to ask is whether, instead of adding two we will study a straight-forward approach to accelerated learners at each iteration, we can get away with adding only GBM, which we call vanilla accelerated gradient boosting one? Below we show how such an algorithm would look machine (VAGBM), a variant of the recently proposed algo- like and argue that it may not always converge. rithm in (Biau et al., 2018), however without any theoretical guarantees. Following the updates in Equation (5), we can get a direct acceleration of GBM by using the weak learner fitting the 5.1. Restart and Linear Convergence gradient. This leads to an Algorithm 4. It is more common to show a linear rate of convergence for Algorithm 4 is equivalent to the recently developed accel- GBM methods by additionally assuming that the function erated gradient boosting machines algorithm (Biau et al., l(x) is µ-strongly convex (e.g. (Lu & Mazumder, 2018)). It 2018; Fouillen et al., 2018). Unfortunately, it may not al- is then relatively straight-forward to recover an accelerated ways converge to optimum or may even diverge. This is m linear rate of convergence by restarting Algorithm 2. because bτm from Step (2) is only an approximate-fit to r , meaning that we only take an approximate gradient descent Theorem 5.1. Consider Accelerated Gradient Boosting step. While this is not an issue in the non-accelerated ver- with Restarts with Option 1 (Algorithm 3) . Suppose that sion, in Step (2) of Algorithm 4, the momentum term pushes l(x) is σ-smooth and µ-strongly convex. If the step-size the h sequence to take a large step along the approximate 1 4 2 η ≤ σ and the momentum parameter γ ≤ Θ /(4 + Θ ), gradient direction. This exacerbates the effect of the approx- ? then for any p and optimal loss L(f ), imate direction and can lead to an additive accumulation of 1 error as shown in (Devolder et al., 2014). In Section 6.1, L(f˜p+1) − L? ≤ (L(f˜p) − L(f ?)) . 2 we see that this is not just a theoretical concern, but that Algorithm 4 also diverges in practice in some situations. Proof. The loss function l(x) is µ-strongly convex, which Remark 5.2. Our corrected residual cm in Algorithm 2 was implies that crucial to the theoretical proof of converge in Theorem 4.1. One extension could be to introduce γ ∈ (0, 1) in step (5) µ ∗ 2 ? kf(X) − f (X)k ≤ L(f) − L(f ) . of Algorithm 4 just as in Algorithm 2. 2 2 Accelerating Gradient Boosting Machine

Algorithm 3 Accelerated Gradient Boosting Machine with Restart (AGBMR) Input: Starting function f˜0(x), step-size η, momentum-parameter γ ∈ (0, 1], strong-convexity parameter µ. For p = 0,...,P − 1 do: (1) Run AGBM (Algorithm 2) initialized with f 0(x) = f˜p(X): q 2 Option 1: for M = ηγµ . Option 2: as long as f M+1(X) ≤ f M (X). (2) Set f˜p+1(x) = f M (x). Output: f˜P (x).

Algorithm 4 Vanilla Accelerated Gradient Boosting Machine (VAGBM) Input. Starting function f 0(x) = 0, step-size η, momentum parameter γ ∈ (0, 1]. 0 0 2 Initialization. h (x) = f (x), and sequence θm = m+2 . For m = 0,...,M − 1 do: Perform Updates: m m m (1) Compute a linear combination of f and h: g (x) = (1 − θm)f (x) + θmh (x). h m i m ∂`(yi,g (xi)) (2) Compute pseudo residual: r = − ∂gm(x ) . i i=1,...,n Pn m 2 (3) Find the best weak-learner for pseudo residual: τm = arg minτ∈Tm i=1(ri − bτ (xi)) . m+1 m (4) Update the model: f (x) = g (x) + ηbτm (x). m+1 m (5) Update the momentum model: h (x) = h (x) + η/θmbτm (x). Output. f M (x).

6. Numerical Experiments consider 100 quantiles (instead of potentially all n values). These strategies are commonly used in implementations In this section, we present the results of computational ex- of GBM like (Chen & Guestrin, 2016; Ponomareva et al., periments and discuss the performance of AGBM with trees 2017). as weak-learners. Subsection 6.1 demonstrates that the al- gorithm described in Section 5.2 may diverge numerically; 6.1. Evidence that VAGBM May Diverge Subsection 6.2 shows training and testing performance for GBM and AGBM with different parameters; and Subsec- Figure 1 shows the training loss versus the number of trees tion 6.3 compares the performance of GBM and AGBM for the housing dataset with step-size η = 1 and η = 0.3 for with best tuned parameters. The code for the numerical VAGBM and for AGBM with different parameters γ. The experiments will be also open-sourced. x-axis is number of trees added to the ensemble (recall that our AGBM algorithm adds two trees to the ensemble per Datasets: Table 2 summaries the basic statistics of the LIB- iteration, so the number of boosting iterations of VAGBM SVM datasets that were used. and AGBM is different). As we can see, when η is large, Dataset task # samples # features the training loss for VAGBM diverges very fast while our a1a classification 1605 123 AGBM with proper parameter γ converges. When η gets w1a classification 2477 300 smaller, the training loss for VAGBM may decay faster diabetes classification 768 8 than our AGBM at the begining, but it gets stuck and never german classification 1000 24 converges to the true optimal solution. Eventually the train- housing regression 506 13 ing loss of VAGBM may even diverge. On the other hand, eunite2001 regression 336 16 our theory guarantees that AGBM always converges to the optimal solution. Table 2. Basic statistics of the (real) datasets used. 6.2. AGBM Sensitivity to the hyperparameters AGBM with CART trees: In our experiments, all algo- In this section we look at how the two parameters η and γ af- rithms use CART trees as the weak learners. For classi- fect the performance of AGBM. Figure 2 shows the training fication problems, we use logistic loss function, and for loss and the testing loss versus the number of trees for the regression problems, we use least squares loss. To reduce a1a dataset with two different step-sizes η = 4 and η = 0.1 the computational cost, for each split and each feature, we Accelerating Gradient Boosting Machine

η = 1 η = 0.3 Dataset Method Training Testing # iter # trees GBM 0.2187 0.3786 97 97 a1a VAGBM 0.2454 0.3661 33 33 AGBM 0.1994 0.3730 33 66 GBM 0.0262 0.0578 84 84 w1a VAGBM 0.0409 0.0578 32 32 AGBM 0.0339 0.0552 47 94

training loss GBM 0.297 0.462 87 87 number of trees number of trees diabetes VAGBM 0.271 0.462 24 24 AGBM 0.297 0.458 47 94 GBM 0.244 0.505 54 54 Figure 1. Training loss versus number of trees for VAGBM (which german VAGBM 0.288 0.514 51 51 doesn’t converge) and AGBM with different parameters γ. AGBM 0.305 0.485 35 70 GBM 0.2152 4.6603 93 93 housing VAGBM 0.5676 5.8090 73 73 (recall AGBM adds two trees per iteration). When the step- AGBM 0.215 4.5074 35 70 size η is large (with logistic loss, the largest step-size to GBM 36.73 270.1 64 64 guarantee the convergence is η = 1/σ = 4), the training eunite2001 VAGBM 28.99 245.2 58 58 loss decays very fast, and the traditional GBM can converge AGBM 26.74 245.4 24 48 even faster than our AGBM at the beginning. But the test- ing performance is suffering, demonstrating that such a fast Table 3. Performance of after tuning hyper-parameters. (due to the ) convergence can result in sever overfitting. In this case, our AGBM with proper parame- ter γ has a slightly better testing performance. When the step-size becomes smaller, the testing performances of all al- gorithms become more stable, though the training becomes slower. AGBM with proper γ may require less number of • number of trees: 10, 11,..., 100; iterations/trees to get a good training/testing performance. • momentum parameter γ (only for AGBM): 0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.5, 1. η = 4 η = 0.1 For each dataset, we randomly choose 80% as the train- ing dataset and the remainder was used as the final test- ing dataset. We use 5-fold cross validation on the train- ing dataset to tune the hyperparameters. Instead of go- ing through all possible hyper-parameters, we utilize ran- training loss domized search (RandomizedSearchCV in scikit-learn). As AGBM has more parameters (namely γ), we did proportion- ally more iterations of random search for AGBM. Table 3 presents the performance of GBM, VAGBM, AGBM with the tuned parameters. As we can see, the accelerated meth- ods (AGBM and VAGBM) in general require less numbers testing loss of iterations to get similar or slightly better testing perfor- mance than GBM. Compared with VAGBM, AGBM adds number of trees number of trees two trees per iteration, and that can be more expensive, but the performance of AGBM can be more stable, for example, Figure 2. Training and testing loss versus number of trees for lo- the testing error of VAGBM for housing dataset is much gisitc regression on a1a. larger than AGBM.

6.3. Experiments with Fine Tuning 7. Conclusion In this section we look at the testing performance of GBM, In this paper, we proposed a novel Accelerated Gradient VAGBM, AGBM on the six datasets with hyperparameter Boosting Machine (AGBM), prooved its rate of convergence tuning. We consider depth 5 trees as weak-learners. We and introduced a computationally inexpensive practical vari- early stop the splitting when the gain smaller than 0.001 ant of AGBM that takes advantage of strong convexity of (roughly 1/n for these datasets). The hyper-parameters and loss function and achives linear convergence. Finally we their ranges we tuned are: demonstrated with a number of numerical experiments the • step size (η): 0.01, 0.03, 0.1, 0.3, 1 for least squares effectiveness of AGBM over conventional GBM in obtain- loss and 0.04, 0.12, 0.4, 1.2, 4 for logistic loss; ing a model with good training and/or testing data fidelity. Accelerating Gradient Boosting Machine

References Frostig, R., Ge, R., Kakade, S., and Sidford, A. Un- regularizing: approximate proximal point and faster Allen-Zhu, Z. and Orecchia, L. Linear coupling: An ulti- stochastic algorithms for empirical risk minimization. In mate unification of gradient and mirror descent. arXiv International Conference on Machine Learning, 2015. preprint arXiv:1407.1537, 2014. Hu, B. and Lessard, L. Dissipativity theory for Nesterov’s Biau, G., Cadre, B., and Rouv`ıere,` L. Accelerated gradient accelerated method. arXiv preprint arXiv:1706.04381, boosting. arXiv preprint arXiv:1803.02042, 2018. 2017. Bickel, P. J., Ritov, Y., and Zakai, A. Some theory for Karimireddy, S. P., Stich, S. U., and Jaggi, M. Global generalized boosting algorithms. Journal of Machine linear convergence of newton’s method without strong- Learning Research, 7(May):705–732, 2006. convexity or lipschitz gradients. arXiv preprint Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. arXiv:1806.00413, 2018. Classification and Regression Trees. Wadsworth, 1984. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, ISBN 0-534-98053-8. W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient Bubeck, S., Lee, Y. T., and Singh, M. A geometric alter- gradient boosting decision tree. In Advances in Neural native to nesterov’s accelerated gradient descent. arXiv Information Processing Systems, pp. 3146–3154, 2017. preprint arXiv:1506.08187, 2015. Lin, H., Mairal, J., and Harchaoui, Z. A universal cata- Advances in Neural Buhlmann,¨ P., Hothorn, T., et al. Boosting algorithms: lyst for first-order optimization. In Information Processing Systems Regularization, prediction and model fitting. Statistical , 2015. Science, 22(4):477–505, 2007. Locatello, F., Raj, A., Karimireddy, S. P., Ratsch,¨ G., Scholkopf,¨ B., Stich, S., and Jaggi, M. On matching Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting pursuit and coordinate descent. In International Confer- system. In Proceedings of the 22nd acm sigkdd inter- ence on Machine Learning, pp. 3204–3213, 2018. national conference on knowledge discovery and , pp. 785–794. ACM, 2016. Lu, H. and Mazumder, R. Randomized gradient boosting machine. arXiv preprint arXiv:1810.10158, 2018. Collins, M., Schapire, R. E., and Singer, Y. Logistic regres- sion, and bregman distances. Machine Learning, Lu, H., Freund, R. M., and Mirrokni, V. Accelerating 48(1-3):253–285, 2002. greedy coordinate descent methods. arXiv preprint arXiv:1806.02476, 2018. Devolder, O., Glineur, F., and Nesterov, Y. First-order methods of smooth convex optimization with inexact Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. R. Boost- oracle. Mathematical Programming, 146(1-2):37–75, ing algorithms as gradient descent. In Advances in neural 2014. information processing systems, pp. 512–518, 2000.

Fouillen, E., Boyer, C., and Sangnier, M. Accelerated proxi- Nesterov, Y. A method of solving a convex programming mal boosting. arXiv preprint arXiv:1808.09670, 2018. problem with convergence rate O(1/k2). In Soviet Math- ematics Doklady, volume 27, pp. 372–376, 1983. Freund, R. M., Grigas, P., Mazumder, R., et al. A new per- spective on boosting in via subgradient Nesterov, Y. Introductory lectures on convex optimization: optimization and relatives. The Annals of Statistics, 45 A basic course, volume 87. Springer Science & Business (6):2328–2364, 2017. Media, 2004.

Friedman, J., Hastie, T., and Tibshirani, R. Additive logis- Nesterov, Y. Accelerating the cubic regularization of new- tic regression: a statistical view of boosting. Annals of tons method on convex problems. Mathematical Pro- Statistics, 28:2000, 1998. gramming, 112(1):159–181, 2008.

Friedman, J., Hastie, T., and Tibshirani, R. The elements of Nesterov, Y. and Polyak, B. T. Cubic regularization of statistical learning, volume 1. Springer series in statistics newton method and its global performance. Mathematical New York, NY, USA:, 2001. Programming, 108(1):177–205, 2006.

Friedman, J. H. Greedy function approximation: a gradient Odonoghue, B. and Candes, E. Adaptive restart for accel- boosting machine. Annals of statistics, pp. 1189–1232, erated gradient schemes. Foundations of computational 2001. mathematics, 15(3):715–732, 2015. Accelerating Gradient Boosting Machine

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011. Ponomareva, N., Radpour, S., Hendry, G., Haykal, S., Colthurst, T., Mitrichev, P., and Grushetsky, A. Tf boosted trees: A scalable tensorflow based framework for gradi- ent boosting. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 423–427. Springer, 2017. Ridgeway, G., Southworth, M. H., and RUnit, S. Package gbm. Viitattu, 10(2013):40, 2013. Sigrist, F. Gradient and newton boosting for classification and regression. arXiv preprint arXiv:1808.03064, 2018. Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling Nesterovs accelerated gradient method: the- ory and insights. Journal of Machine Learning Research, 17(153):1–43, 2016. Sun, P., Zhang, T., and Zhou, J. A convergence rate analysis for logitboost, mart and their variant. In ICML, pp. 1251– 1259, 2014.

Telgarsky, M. A primal-dual convergence analysis of boost- ing. Journal of Machine Learning Research, 13(Mar): 561–606, 2012. Tseng, P. On accelerated proximal gradient methods for convex–concave optimization preprint, 2008. Wilson, A. C., Recht, B., and Jordan, M. I. A Lyapunov analysis of momentum methods in optimization. arXiv preprint arXiv:1611.02635, 2016. Zhang, T. and Yu, B. Boosting with early stopping: Con- vergence and consistency. The Annals of Statistics, 33(4): 1538–1579, 2005. Accelerating Gradient Boosting Machine A. Additional Discussions Below we include some additional discussions which could not fit into the main paper but which nevertheless help to understand the relevance of our results when applied to frameworks typically used in practice.

A.1. Line search in Boosting Traditionally the analysis of gradient boosting methods has focused on algorithms which use line search to select the step-size η (e.g. Algorithm 1). Analysis of gradient descent suggests that is not necessary—using a fixed step-size of 1/β where l(x) is β-smooth is sufficient (Lu & Mazumder, 2018). Our accelerated Algorithm 2 also adopts this fixed step-size strategy. In fact, even the standard boosting libraries (XGBoost and TFBT) typically use a fixed (but tuned) step-size and avoid an expensive line search.

A.2. Use of Hessian Popular boosting libraries such as XGBoost (Chen & Guestrin, 2016) and TFBT (Ponomareva et al., 2017) compute the Hessian and perform a Newton boosting step instead of gradient boosting. Since the Newton step may not be well defined (e.g. if the Hessian is degenerate), an additional euclidean regularizer is also added. This has been shown to improve performance and reduce the need for a line-search for the η parameter sequence (Sun et al., 2014; Sigrist, 2018). For LogitBoost (i.e. when l(x) is the logistic loss), (Sun et al., 2014) demonstrate that trust-region Newton’s method can indeed significantly improve the convergence. Leveraging similar results in second-order methods for convex optimization (e.g. (Nesterov & Polyak, 2006; Karimireddy et al., 2018)) and adapting accelerated second-order methods (Nesterov, 2008) would be an interesting direction for the future work.

A.3. Out-of-sample Performance Throughout this work we focus only on minimizing the empirical training loss L(f) (see Formula (2)). In reality what we really care about is the out-of-sample error of our resulting ensemble f M (x). A number of regularization tricks such as i) early stopping (Zhang & Yu, 2005), ii) pruning (Chen & Guestrin, 2016; Ponomareva et al., 2017), iii) smaller step-sizes (Ponomareva et al., 2017), iv) dropout (Ponomareva et al., 2017) etc. are usually employed in practice to prevent over-fitting and improve generalization. Since AGBM requires much fewer iterations to achieve the same training loss than GBM, it outputs a much sparser set of learners. We believe this is partially the reason for its better out-of-sample performance. However a joint theoretical study of the out-of-sample error along with the empirical error Ln(f) is much needed. It would also shed light on the effectiveness of the numerous ad-hoc regularization techniques currently employed.

B. Proof of Theorem 4.1 This section proves our major theoretical result in the paper: 1 Theorem 4.1 Consider Accelerated Gradient Boosting Machine (Algorithm 2). Suppose ` is σ-smooth, the step-size η ≤ σ and the momentum parameter γ ≤ Θ4/(4 + Θ2). Then for all M ≥ 0, we have:

1 L(f M ) − L(f ∗) ≤ kf ∗(X)k2 . 2ηγ(M + 1)2 2

Let’s start with some new notations. Define scalar constants s = γ/Θ2 and t := (1 − s)/2 ∈ (0, 1). We mostly only need s + t ≤ 1—the specific values of γ and t are needed only in Lemma B.6. Then define

ηγ ηsΘ2 αm := = , θm θm Accelerating Gradient Boosting Machine

m m ˆm then the definitions of the sequences {r }, {c }, h (X) and {θm} from Algorithm 3 can be simplified as: 2 θ = m m + 2  m  m ∂l(yi, g (xi)) r = − m ∂g (xi) i=1,...n m m m−1 c = r + (αm−1/αm)(c − b 2 (X)) τm−1 ˆm+1 ˆm m h (X) = h (X) + αmr .

The sequence hˆm(X) is in fact closely tied to the sequence hm(X) as we show in the next lemma. For notational −1 α−1 convenience, we define c = b 2 (X) = 0 and similarly = 0 throughout the proof. τ−1 θ−1 Lemma B.1. ˆm+1 m+1 h (X) = h (X) + αm(cm − bτm,2 (X)) .

Proof. Observe that m m ˆm+1 X j m+1 X h (X) = αjr and that h (X) = αjbτj,2 (X) . j=0 j=0 Then we have

m ˆm+1 m+1 X j h (X) − h (X) = αj(r − bτj,2 (X)) j=0 m X j αj−1 2 = αj(r − bτ (X)) − αmbτm,2 (X) α j−1 j=0 j m X αj−1 = α (cj − cj−1) − α b (X) j α m τm,2 j=0 j m X j j−1 = (αjc − αj−1c ) − αmbτm,2 (X) j=0

= αm(cm − bτm,2 (X)) , where the third equality is due to the definition of cm.

Lemma B.2 presents the fact that there is sufficient decay of the loss function: Lemma B.2. ηΘ2 L(f m+1) ≤ L(gm) − krmk2 . 2

Proof. Recall that τm,1 is chosen such that

m 2 τm,1 = arg minkbτ (X) − r k . τ∈T

Since the class of learners T is scalable (Assumption 2.1), we have

m 2 m 2 kbτm,1 (X) − r k = min minkσbτ (X) − r k τ∈Tm σ∈R m 2 m 2 = kr k (1 − arg max cos(r , bτ (X)) ) τ∈T ≤ krmk2(1 − Θ2) , (8) Accelerating Gradient Boosting Machine where the last inequality is because of the definition of Θ, and the second equality is due to the simple fact that for any two vectors a and b, 2 2 2 σ 2 2 2 ha, bi minkσa − bk = kak − max(σha, bi − kbk ) = kak − kak 2 2 . σ∈R σ∈R 2 kak kbk

m+1 Pn m+1 m+1 m Now recall that L(f ) = i=1 l(yi, f (xi)) and that f (x) = g (x) + ηbτm,1 (x). Since the loss function 1 l(yi, x) is σ-smooth and step-size η ≤ σ , it holds that n m+1 X m+1 L(f ) = l(yi, f (xi)) i=1 n X m ≤ l(yi, g (xi) + ηbτm,1 (xi)) i=1 n  m  X ∂l(yi, g (xi)) σ ≤ l(y , gm(x )) + (ηb (x )) + (ηb (x ))2 i i ∂gm(x ) τm,1 i 2 τm,1 i i=1 i n  m  X ∂l(yi, g (xi)) η ≤ l(y , gm(x )) + (ηb (x )) + (b (x ))2 i i ∂gm(x ) τm,1 i 2 τm,1 i i=1 i n X 1  = l(y , gm(x )) − rm(ηb (x )) + (b (x ))2 i i i τm,1 i 2η τm,1 i i=1 η = L(gm) − η rm, b (X) + kb (X)k2 τm,1 2 τm,1 η η = L(gm) + kb (X) − rmk2 − krmk2 2 τm,1 2 Θ2η ≤ L(gm) − krmk2 , 2 where the final inequality follows from (8). This furnishes the proof of the lemma.

Lemma B.3 is a basic fact of convex function, and it is commonly used in the convergence analysis in accelerated method. Lemma B.3. For any function f and m ≥ 0, m m m m L(g ) + θmhr , h (X) − f(X)i ≤ θmL(f) + (1 − θm)L(f ) .

Proof. For any function f, it follows from the convexity of the loss function l that

n m X ∂l(yi, g (xi)) L(gm) + hrm, gm(X) − f(X)i = l(y , gm(x )) + (f(x ) − gm(x )) i i ∂gm(x ) i i i=1 i n X ≤ l(yi, f(xi)) = L(f) . (9) i=1 Substituting f = f m in (9), we get L(gm) + hrm, gm(X) − f m(X)i ≤ L(f m) . (10) m m m Also recall that g (X) = (1 − θm)f (X) + θmh (X). This can be rewritten as m m m m θm(g (X) − h (X)) = (1 − θm)(f (X) − g (X)) . (11) Putting (9), (10), and (11) together: m m m L(g ) + θmhr , h (X) − f(X)i m m m m m m =L(g ) + θmhr , g (X) − f(X)i + θmhr , h (X) − g (X)i m m m m m m m =θm[L(g ) + hr , g (X) − f(X)i] + (1 − θm)[L(g ) + hr , g (X) − f (X)i] m ≤θmL(f) + (1 − θm)L(f ) , which furnishes the proof. Accelerating Gradient Boosting Machine

We are ready to prove the key lemma which gives us the accelerated rate of convergence. Lemma B.4. Define the following potential function V (f) for any given output function f:

α 1 2 V m(f) = m−1 (L(f m) − L(f)) + f(X) − hˆm(X) . (12) θm−1 2

At every step, the potential decreases at least by δm:

m+1 m V (f) ≤ V (f) + δm , where δm is defined as: 2 2 sαm−1 m−1 2 αm m 2 δm := kc − bτ 2 (X)k − (1 − s − t) kr k . (13) 2t m−1 2s

−1 α−1 Proof. Recall that c = b 2 (X)) = 0 and = 0. It follows from Lemma B.2 that: τ−1 θ−1

(1 − s)ηΘ2 L(f m+1) − L(gm) + krmk2 2 sηΘ2 ≤ − krmk2 2 α θ = − α θ krmk2 + m m krmk2 m m 2 D m ˆm ˆm+1 E θm ˆm ˆm+1 2 =θm r , h (X) − h (X) + kh (X) − h (X)k 2αm D m ˆm E θm  ˆm 2 ˆm+1 2 =θm r , h (X) − f(X) + kf(X) − h (X)k − kf(X) − h (X)k , 2αm where the second equality is by the definition of hˆm(x) and the third is just mathematical manipulation of the equation (it is also called three-point property). By rearranging the above inequality, we have

(1 − s)ηΘ2 L(f m+1) + krmk2 2 D E θ   ≤L(gm) + rm, hˆm(X) − f(X) + m kf(X) − hˆm(X)k2 − kf(X) − hˆm+1(X)k2 2αm m m m θm  ˆm 2 ˆm+1 2 D m ˆm m E =L(g ) + θmhr , h (X) − f(X)i + kf(X) − h (X)k − kf(X) − h (X)k + θm r , h (X) − h (X) 2αm m θm  ˆm 2 ˆm+1 2 D m m−1 E ≤θmL(f) + (1 − θm)L(f ) + kf(X) − h (X)k − kf(X) − h (X)k + θmαm−1 r , c − bτ 2 (X) , 2αm m−1

ˆm m m−1 where the first inequality uses Lemma B.3 and the last inequality is due to the fact that h (X) − h (X) = αm−1(c − b 2 (X)) from Lemma B.1. Rearranging the terms and multiplying by (αm/θm) leads to τm−1

α 1 m (L(f m+1) − L(f)) + kf(X) − hˆm+1(X)k2 θm 2 2 αm(1 − θm) m 1 ˆm 2 D m m−1 E (1 − s)ηΘ αm m 2 ≤ (L(f )−L(f))+ kf(X)−h (X)k +αmαm−1 r , (c − bτ 2 (X)) − kr k . θm 2 m−1 2θm | {z } | {z } :=A :=B Let us examine first the term A:

αm(1 − θm) 2 1 − θm 2 1 αm−1 = (ηΘ s) 2 ≤ (ηΘ s) 2 = . θm θm θm−1 θm−1 We have thus far shown that V m+1(f) ≤ V m(f) + B , Accelerating Gradient Boosting Machine and we now need to show that B ≤ δm. Using Mean-Value inequality, the first term in B can be bounded as

2 2 D m m−1 E αmt m 2 αm−1s m−1 2 αmαm−1 r , (c − bτ 2 (X)) ≤ kr k + kc − bτ 2 (X)k . m−1 2s 2t m−1 Substituting it in B shows:

2 D m m−1 E (1 − s)ηΘ αm m 2 B = αmαm−1 r , (c − bτ 2 (X)) − kr k m−1 2θm 2 2 2 αmt m 2 αm−1s m−1 2 (1 − s)αm m 2 ≤ kr k + kc − bτ 2 (X)k − kr k 2s 2t m−1 2s 2 2 αm−1s m−1 2 αm m 2 = kc − bτ 2 (X)k − (1 − s − t) kr k 2t m−1 2s = δm , which finishes the proof.

Unlike the typical proofs of accelerated algorithms, which usually shows that the potential V m(f) is a decreasing sequence, there is no guarantee that the potential V m(f) is decreasing in the boosting setting due to the use of weak learners. Instead, we are able to prove that: Pm Lemma B.5. For any given m, it holds that j=0 δj ≤ 0.

Proof. We can rewrite the statement of the lemma as:

m−1 m X t(1 − s − t) X α2kcj − b (X)k2 ≤ α2krjk2 . (14) j τj,2 s2 j j=0 j=0

j+1 2 Here, let us focus on the term kc − b 2 (X)k for a given j. We have that τj+1

2 j+1 2 j+1 2 c − b 2 (X) ≤ (1 − Θ ) c τj+1 2 2 j+1 θj+1 j = (1 − Θ ) r + (c − bτj,2 (X)) θj 2 2 j+1 2 2 θj+1 j ≤ (1 − Θ )(1 + ρ) r + (1 − Θ )(1 + 1/ρ) (c − bτj,2 (X)) θj 2 j+1 2 2 j 2 ≤ (1 + ρ)(1 − Θ ) r + (1 − Θ )(1 + 1/ρ) (c − bτj,2 (X)) , where the first inequality follows from our assumption about the density of the weak-learner class B (the same of the argument in (8)), the second inequality holds for any ρ ≥ 0 due to Mean-Value inequality, and the last inequality is from θj+1 ≤ θj. We now derives a recursive bound on the left side of (14). From this, (14) follows from an elementary fact of 2 j 2 2 j 2 recursive sequence as stated in Lemma B.6 with aj = αj c − bτj,2 (X) and cj = αj r .

2 m αm m 2 Remark B.1. If c = bτm,2 (X) (i.e. our class of learners B is strong), then δm = −(1 − s − t) 2s2 kr k ≤ 0. Lemma B.6 is an elementary fact of recursive sequence used in the proof of Lemma B.5.

Lemma B.6. Given two sequences {aj ≥ 0} and {cj ≥ 0} such that the following holds for any ρ ≥ 0,

2 aj+1 ≤ (1 − Θ )[(1 + 1/ρ)aj + (1 + ρ)cj+1] , then the sum of the terms aj can be bounded as

m m X t(1 − s − t) X a ≤ c . j s2 j j=0 j=0 Accelerating Gradient Boosting Machine

Proof. The recursive bound on aj implies that 2 aj ≤ (1 − Θ )[(1 + 1/ρ)aj−1 + (1 + ρ)cj] j X 2 j−k 2 ≤ [(1 + 1/ρ)(1 − Θ )] (1 + ρ)(1 − Θ )ck . k=0 Summing both the terms gives

m m j X X X 2 j−k 2 aj ≤ [(1 + 1/ρ)(1 − Θ )] (1 + ρ)(1 − Θ )ck j=0 j=0 k=0 m m X X 2 j−k 2 = [(1 + 1/ρ)(1 − Θ )] (1 + ρ)(1 − Θ )ck k=0 j=k

m  ∞  X X 2 j 2 ≤  [(1 + 1/ρ)(1 − Θ )]  (1 + ρ)(1 − Θ )ck k=0 j=0 m (1 + ρ)(1 − Θ2) X = c 1 − (1 + 1/ρ)(1 − Θ2) k k=0 m (1 + ρ)(1 − Θ2) X = c Θ2 − (1 − Θ2)/ρ k k=0 m 2(1 + ρ)(1 − Θ2) X = c Θ2 k k=0 m 2(2 − Θ2)(1 − Θ2) X = c , Θ4 k k=0

2(1−Θ2) Θ2 where in the last two equalities we chose ρ = Θ2 . Now recall that s ≤ 4+Θ2 ∈ (0, 1) and that t = (1 − s)/2: m m X 2(2 − Θ2)(1 − Θ2) X a ≤ c j Θ4 k j=0 k=0 m 4 X ≤ c Θ4 k k=0 2 m 4 + Θ2  1 X = − 1 c Θ2 4 k k=0 2 m 1  1 X ≤ − 1 c s 4 k k=0 m (1 − s)2 X = c 4s2 k k=0 m t(1 − s − t) X = c . s2 k k=0

Lemma B.4 and Lemma B.5 directly result in our major theorem: Proof of Theorem 4.1 It follows from Lemma B.4 and Lemma B.5 that M−1 X 1 V M (f ?) ≤ V M−1(f ?) + δ ≤ V 0(f ?) + δ ≤ kf 0(X) − f ?(X)k2 . m j 2 j=0 Accelerating Gradient Boosting Machine

Notice V M (f ?) ≥ αm−1 (L(f M ) − L(f ?)) as the term 1 kf M (X) − f ?(X)k2 ≥ 0, which induces that θm−1 2

0 ? 2 M ? θM−1 0 ? 2 1 kf (X) − f (X)k L(f ) − L(f ) ≤ kf (X) − f (X)k = · 2 . 2αM−1 2γη M