
XGBoost: A Scalable Tree Boosting System Tianqi Chen Carlos Guestrin University of Washington University of Washington [email protected] [email protected] ABSTRACT problems. Besides being used as a stand-alone predictor, it Tree boosting is a highly effective and widely used machine is also incorporated into real-world production pipelines for learning method. In this paper, we describe a scalable end- ad click through rate prediction [15]. Finally, it is the de- to-end tree boosting system called XGBoost, which is used facto choice of ensemble method and is used in challenges widely by data scientists to achieve state-of-the-art results such as the Netflix prize [3]. on many machine learning challenges. We propose a novel In this paper, we describe XGBoost, a scalable machine learning system for tree boosting. The system is available as sparsity-aware algorithm for sparse data and weighted quan- 2 tile sketch for approximate tree learning. More importantly, an open source package . The impact of the system has been we provide insights on cache access patterns, data compres- widely recognized in a number of machine learning and data sion and sharding to build a scalable tree boosting system. mining challenges. Take the challenges hosted by the ma- chine learning competition site Kaggle for example. Among By combining these insights, XGBoost scales beyond billions 3 of examples using far fewer resources than existing systems. the 29 challenge winning solutions published at Kaggle's blog during 2015, 17 solutions used XGBoost. Among these solutions, eight solely used XGBoost to train the model, Keywords while most others combined XGBoost with neural nets in en- Large-scale Machine Learning sembles. For comparison, the second most popular method, deep neural nets, was used in 11 solutions. The success 1. INTRODUCTION of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10. Machine learning and data-driven approaches are becom- Moreover, the winning teams reported that ensemble meth- ing very important in many areas. Smart spam classifiers ods outperform a well-configured XGBoost by only a small protect our email by learning from massive amounts of spam amount [1]. data and user feedback; advertising systems learn to match These results demonstrate that our system gives state-of- the right ads with the right context; fraud detection systems the-art results on a wide range of problems. Examples of protect banks from malicious attackers; anomaly event de- the problems in these winning solutions include: store sales tection systems help experimental physicists to find events prediction; high energy physics event classification; web text that lead to new physics. There are two important factors classification; customer behavior prediction; motion detec- that drive these successful applications: usage of effective tion; ad click through rate prediction; malware classification; (statistical) models that capture the complex data depen- product categorization; hazard risk prediction; massive on- dencies and scalable learning systems that learn the model line course dropout rate prediction. While domain depen- of interest from large datasets. dent data analysis and feature engineering play an important Among the machine learning methods used in practice, 1 role in these solutions, the fact that XGBoost is the consen- gradient tree boosting [10] is one technique that shines sus choice of learner shows the impact and importance of in many applications. Tree boosting has been shown to arXiv:1603.02754v3 [cs.LG] 10 Jun 2016 our system and tree boosting. give state-of-the-art results on many standard classification The most important factor behind the success of XGBoost benchmarks [16]. LambdaMART [5], a variant of tree boost- is its scalability in all scenarios. The system runs more than ing for ranking, achieves state-of-the-art result for ranking ten times faster than existing popular solutions on a single 1Gradient tree boosting is also known as gradient boosting machine and scales to billions of examples in distributed or machine (GBM) or gradient boosted regression tree (GBRT) memory-limited settings. The scalability of XGBoost is due to several important systems and algorithmic optimizations. These innovations include: a novel tree learning algorithm Permission to make digital or hard copies of part or all of this work for personal or is for handling sparse data; a theoretically justified weighted classroom use is granted without fee provided that copies are not made or distributed quantile sketch procedure enables handling instance weights for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. in approximate tree learning. Parallel and distributed com- For all other uses, contact the owner/author(s). puting makes learning faster which enables quicker model ex- KDD ’16, August 13-17, 2016, San Francisco, CA, USA ploration. More importantly, XGBoost exploits out-of-core c 2016 Copyright held by the owner/author(s). 2 ACM ISBN . https://github.com/dmlc/xgboost 3 DOI: Solutions come from of top-3 teams of each competitions. computation and enables data scientists to process hundred millions of examples on a desktop. Finally, it is even more exciting to combine these techniques to make an end-to-end system that scales to even larger data with the least amount of cluster resources. The major contributions of this paper is listed as follows: • We design and build a highly scalable end-to-end tree boosting system. • We propose a theoretically justified weighted quantile Figure 1: Tree Ensemble Model. The final predic- sketch for efficient proposal calculation. tion for a given example is the sum of predictions from each tree. • We introduce a novel sparsity-aware algorithm for par- allel tree learning. it into the leaves and calculate the final prediction by sum- ming up the score in the corresponding leaves (given by w). • We propose an effective cache-aware block structure To learn the set of functions used in the model, we minimize for out-of-core tree learning. the following regularized objective. X X While there are some existing works on parallel tree boost- L(φ) = l(^y ; y ) + Ω(f ) ing [22, 23, 19], the directions such as out-of-core compu- i i k i k tation, cache-aware and sparsity-aware learning have not (2) 1 been explored. More importantly, an end-to-end system where Ω(f) = γT + λkwk2 that combines all of these aspects gives a novel solution for 2 real-world use-cases. This enables data scientists as well as Here l is a differentiable convex loss function that measures researchers to build powerful variants of tree boosting al- the difference between the predictiony ^i and the target yi. gorithms [7,8]. Besides these major contributions, we also The second term Ω penalizes the complexity of the model make additional improvements in proposing a regularized (i.e., the regression tree functions). The additional regular- learning objective, which we will include for completeness. ization term helps to smooth the final learnt weights to avoid The remainder of the paper is organized as follows. We over-fitting. Intuitively, the regularized objective will tend will first review tree boosting and introduce a regularized to select a model employing simple and predictive functions. objective in Sec.2. We then describe the split finding meth- A similar regularization technique has been used in Regu- ods in Sec.3 as well as the system design in Sec.4, including larized greedy forest (RGF) [25] model. Our objective and experimental results when relevant to provide quantitative the corresponding learning algorithm is simpler than RGF support for each optimization we describe. Related work and easier to parallelize. When the regularization parame- is discussed in Sec.5. Detailed end-to-end evaluations are ter is set to zero, the objective falls back to the traditional included in Sec.6. Finally we conclude the paper in Sec.7. gradient tree boosting. 2. TREE BOOSTING IN A NUTSHELL 2.2 Gradient Tree Boosting We review gradient tree boosting algorithms in this sec- The tree ensemble model in Eq. (2) includes functions as tion. The derivation follows from the same idea in existing parameters and cannot be optimized using traditional opti- mization methods in Euclidean space. Instead, the model literatures in gradient boosting. Specicially the second order (t) method is originated from Friedman et al. [12]. We make mi- is trained in an additive manner. Formally, lety ^i be the nor improvements in the reguralized objective, which were prediction of the i-th instance at the t-th iteration, we will found helpful in practice. need to add ft to minimize the following objective. n 2.1 Regularized Learning Objective (t) X (t−1) L = l(yi; y^i + ft(xi)) + Ω(ft) For a given data set with n examples and m features i=1 m D = f(xi; yi)g (jDj = n; xi 2 R ; yi 2 R), a tree ensem- ble model (shown in Fig.1) uses K additive functions to This means we greedily add the ft that most improves our predict the output. model according to Eq. (2). Second-order approximation can be used to quickly optimize the objective in the general K X setting [12]. y^i = φ(xi) = fk(xi); fk 2 F; (1) n k=1 (t) X (t−1) 1 2 L ' [l(yi; y^ ) + gift(xi) + hift (xi)] + Ω(ft) m T 2 where F = ff(x) = wq(x)g(q : R ! T; w 2 R ) is the i=1 space of regression trees (also known as CART). Here q rep- (t−1) 2 (t−1) resents the structure of each tree that maps an example to where gi = @y^(t−1) l(yi; y^ ) and hi = @y^(t−1) l(yi; y^ ) the corresponding leaf index.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-