Accelerating Gradient Boosting Machine

Accelerating Gradient Boosting Machine Haihao Lu * 1 2 Sai Praneeth Karimireddy * 3 Natalia Ponomareva 2 Vahab Mirrokni 2 Abstract available implementations: scikit-learn (Pedregosa et al., Gradient Boosting Machine (GBM) (Friedman, 2011), R gbm (Ridgeway et al., 2013), LightGBM (Ke et al., 2001) is an extremely powerful supervised learn- 2017), XGBoost (Chen & Guestrin, 2016), TF Boosted ing algorithm that is widely used in practice. Trees (Ponomareva et al., 2017), etc. GBM routinely features as a leading algorithm In spite of the practical success of GBM, there is a consid- in machine learning competitions such as Kaggle erable gap in its theoretical understanding. The traditional and the KDDCup. In this work, we propose Ac- interpretation of GBM is to view it as a form of steepest celerated Gradient Boosting Machine (AGBM) by descent in functional space (Mason et al., 2000; Friedman, incorporating Nesterov’s acceleration techniques 2001). While this interpretation serves as a good starting into the design of GBM. The difficulty in accel- point, such framework lacks rigorous non-asymptotic con- erating GBM lies in the fact that weak (inexact) vergence guarantees, especially when compared to the grow- learners are commonly used, and therefore the ing body of literature on first order convex optimization. errors can accumulate in the momentum term. To overcome it, we design a “corrected pseudo resid- In convex optimization literature, Nesterov’s acceleration is ual” and fit best weak learner to this corrected a successful technique to speed up the convergence of first- pseudo residual, in order to perform the z-update. order methods. In this work, we show how to incorporate Thus, we are able to derive novel computational Nesterov momentum into the gradient boosting framework guarantees for AGBM. This is the first GBM type to obtain an accelerated gradient boosting machine. of algorithm with theoretically-justified accelerated convergence rate. Finally we demonstrate 1.1. Our contributions with a number of numerical experiments the ef- We propose the first accelerated gradient boosting algorithm fectiveness of AGBM over conventional GBM that comes with strong theoretical guarantees and can be in obtaining a model with good training and/or used with any type of weak learner. In particular: testing data fidelity. 1. We propose a novel accelerated gradient boosting algorithm (AGBM) (Section 3) and prove (Section 4) 1. Introduction that it reduces the empirical loss at a rate of O(1=m2) after m iterations, improving upon the O(1=m) rate Gradient Boosting Machine (GBM) (Friedman, 2001) is obtained by traditional gradient boosting methods a powerful supervised learning algorithm that combines 2. We propose a computationally inexpensive practical multiple weak-learners into an ensemble with excellent pre- variant of AGBM, taking advantage of strong convexity diction performance. GBM works very well for a number of of loss function, which achieves linear convergence tasks like spam filtering, online advertising, fraud detection, (Section 5). We also list the conditions (on the loss anomaly detection, computational physics (e.g., the Higgs function) under which AGBMs would be beneficial. Boson discovery), etc; and has routinely featured as a top al- 3. With a number of numerical experiments with weak gorithm in Kaggle competitions and the KDDCup (Chen & tree learners (one of the most popular type of GBMs) Guestrin, 2016). GBM can naturally handle heterogeneous we confirm the effectiveness of AGBM. datasets (highly correlated data, missing data, categorical 4. Our accelerated boosting framework will be open- data, etc). It is also quite easy to use with several publicly sourced at this url1. * 1 2 Equal contribution MIT, Cambridge, MA, USA Google, Apart from theoretical contributions, we paved the way to New York, NY, USA 3EPFL, Lausanne, Switzerland. Correspon- dence to: Haihao Lu <[email protected]>. speed up some practical applications of GBMs, which cur- rently require a large number of boosting iterations. For th Proceedings of the 35 International Conference on Machine 1 Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 Link removed for anonymity; will be added upon acceptance. by the author(s). Accelerating Gradient Boosting Machine example, GBMs with boosted trees for multi-class problems cal explanation of acceleration techniques by studying the are commonly implemented as a number of one-vs-rest continuous-time interpretation of accelerated gradient de- learners, resulting in more complicated boundaries (Fried- scent via dynamical systems (Su et al., 2016; Wilson et al., man et al., 1998) and a potentially a larger number of boost- 2016; Hu & Lessard, 2017). ing iterations required. Additionally, it is a common practice Accelerated Greedy Coordinate and Matching Pursuit to build many very weak learners for problems where it is Methods: Recently, (Locatello et al., 2018) and (Lu et al., easy to overfit. Such large ensembles result not only in 2018) discussed how to accelerate matching pursuit and slow training time, but slower inference. AGBMs can be greedy coordinate descent algorithms respectively. Their potentially beneficial for all these applications. methods however require a random step and are hence only ‘semi-greedy’, which does not fit in the boosting framework. 1.2. Related Literature Accelerated GBM: Recently, (Biau et al., 2018) and Convergence Guarantees for GBM: After being first in- (Fouillen et al., 2018) proposed an accelerated version of troduced by Friedman et al. (Friedman, 2001), several GBM by directly incorporating Nesterov’s momentum in works established its guaranteed convergence, without ex- GBM, however, no theoretical justification was provided. plicitly stating the convergence rate (Collins et al., 2002; Furthermore, as we argue in Section 5.2, their proposed Mason et al., 2000). Subsequently, when the loss function algorithm may not converge to the optimum. was both smooth and strongly convex, (Bickel et al., 2006) proved an exponential convergence rate—more precisely that O(exp(1="2)) iterations are sufficient to ensure that 2. Gradient Boosting Machine the training loss is within " of its optimal value. (Telgar- We consider a supervised learning problem with n training sky, 2012) then studied the primal-dual structure of GBM p examples (xi; yi); i = 1; : : : ; n such that xi 2 R is the and demonstrated that in fact only O(log(1=")) iterations feature vector of the i-th example and yi is a label (in a clas- are needed. However the constants in their rate were non- sification problem) or a continuous response (in a regression standard and less intuitive. This result was recently im- problem). In the classical version of GBM (Friedman, 2001), proved upon by (Freund et al., 2017) and (Lu & Mazumder, we assume we are given a base class of learners B and that 2018), who showed a similar convergence rate but with more our target function class is the linear combination of such transparent constants such as the smoothness and strong con- base learners (denoted by lin(B)). Let B = fbτ (x) 2 Rg be vexity constant of the loss function, and the density of weak a family of learners parameterized by τ 2 T . The prediction learners, etc. Additionally, if the loss function is assumed to corresponding to a feature vector x is given by an additive be smooth and convex (but not necessarily strongly convex), model of the form: (Lu & Mazumder, 2018) also showed that O(1=") iterations M ! are sufficient. We refer the reader to (Telgarsky, 2012), (Fre- X f(x) := β b (x) 2 lin(B) ; (1) und et al., 2017) and (Lu & Mazumder, 2018) for a more m τm detailed literature review on the theoretical results of GBM m=1 convergence. where bτm (x) 2 B is a weak-learner and βm is its corre- Accelerated Gradient Methods: For optimizing a smooth sponding additive coefficient. Here, βm and τm are chosen convex function, (Nesterov, 1983) showed that the standard in an adaptive fashion in order to improve the data-fidelity gradient descent algorithm can be made much faster, re- as discussed below. Examples of learners commonly used sulting in the accelerated gradient descent method. While in practice include wavelet functions, support vector ma- chines, and classification and regression trees (Friedman gradient descent requires O(1=") iterations,p accelerated gradient methods only requires O(1= "). In general, this rate et al., 2001). We assume the set of weak learners B is of convergence is optimal and cannot be improved upon scalable, namely that the following assumption holds. (Nesterov, 2004). Since its introduction in 1983, the main- Assumption 2.1. If b(·) 2 B, then λb(·) 2 B for any λ > 0. stream research community’s interest in Nesterov’s accelerated method started around 15 years ago; yet even today Assumption 2.1 holds for most of the set of weak learners we most researchers struggle to find basic intuition as to what are interested in. Indeed scaling a weak learner is equivalent is really going on in accelerated methods. Such lack of intu- to modifying the coefficient of the weak learner, so it does ition about the estimation sequence proof technique used by not change the structure of B. (Nesterov, 2004) has motivated many recent works trying The goal of GBM is to obtain a good estimate of the function to explain this acceleration phenomenon (Su et al., 2016; f that approximately minimizes the empirical loss: Wilson et al., 2016; Hu & Lessard, 2017; Lin et al., 2015; n Frostig et al., 2015; Allen-Zhu & Orecchia, 2014; Bubeck ? n X o L = min L(f) := `(yi; f(xi)) (2) et al., 2015). Some have recently attempted to give a physi- f2lin(B) i=1 Accelerating Gradient Boosting Machine where `(yi; f(xi)) is a measure of the data-fidelity for the 3. Accelerated Gradient Boosting Machine i-th sample for the loss function `.

Load more