GBDT-MO: Gradient Boosted Decision Trees for Multiple Outputs Zhendong Zhang and Cheolkon Jung, Member, IEEE

UNDER REVIEW 1 GBDT-MO: Gradient Boosted Decision Trees for Multiple Outputs Zhendong Zhang and Cheolkon Jung, Member, IEEE Abstract—Gradient boosted decision trees (GBDTs) are widely regression [12]. Other machine learning methods, such as used in machine learning, and the output of current GBDT neural networks [13], can adapt to any dimension of outputs implementations is a single variable. When there are multiple straightforwardly by changing the number of neurons in the outputs, GBDT constructs multiple trees corresponding to the output variables. The correlations between variables are ignored last layer. The flexibility for the output dimension may be one by such a strategy causing redundancy of the learned tree of the reasons why neural networks are popular. However, it is structures. In this paper, we propose a general method to learn somewhat strange to handle multiple outputs by current GBDT GBDT for multiple outputs, called GBDT-MO. Each leaf of implementations. At each step, they construct multiple deci- GBDT-MO constructs predictions of all variables or a subset of sion trees each of which corresponds to an individual output automatically selected variables. This is achieved by considering the summation of objective gains over all output variables. variable, then concatenates the predictions of all trees to obtain Moreover, we extend histogram approximation into multiple multiple outputs. This strategy is used in the most popular output case to speed up the training process. Various experiments open-sourced GBDT libraries: XGBoost [8], LightGBM [9], on synthetic and real-world datasets verify that GBDT-MO and CatBoost [10]. achieves outstanding performance in terms of both accuracy and The major drawback of the abovementioned strategy is that training speed. Our codes are available on-line. correlations between variables are ignored during the training Index Terms—gradient boosting, decision tree, multiple out- process because those variables are treated in isolation and puts, variable correlations, indirect regularization. they are learned independently. However, correlations more or less exist between output variables. For example, there are I. INTRODUCTION correlations between classes for multi-class classification. It is verified in [14] that such correlations improve the generaliza- ACHINE learning and data-driven approaches have tion ability of neural networks. Ignoring variable correlations achieved great success in recent years. Gradient M also leads to redundancy of the learned tree structures. Thus, boosted decision tree (GBDT) [1] [2] is a powerful machine it is necessary to learn GBDT for multiple outputs via better learning tool widely used in many applications including strategies. Up to now, a few works have explored it. Geurts et multi-class classification [3], flocculation process modeling al. [15] transformed the multiple output problem into a single [4], learning to rank [5] and click prediction [6]. It also output problem by kernelizing the output space. However, this produces state-of-the-art results for many data mining com- method was not scalable because the space complexity of petitions such as the Netflix prize [7]. GBDT uses decision its kernel matrix was n2 where n is the number of training trees as the base learner and sums the predictions of a series samples. Si et al. [16] proposed GBDT for sparse output of trees. At each step, a new decision tree is trained to fit the (GBDT-sparse). They mainly focused on extreme multi-label residual between ground truth and current prediction. GBDT classification problems, and the outputs were represented in is popular due to its accuracy, efficiency and interpretability. sparse format. A sparse split finding algorithm was designed Many improvements have been proposed after [1]. XGBoost for square hinge loss. [15] and [16] worked for specific arXiv:1909.04373v2 [cs.CV] 28 Dec 2019 [8] used the second order gradient to guide the boosting loss and they did not employ the second order gradient and process and improve the accuracy. LightGBM [9] aggregated histogram approximation. gradient information in histograms and significantly improved In this paper, we propose a novel and general method the training efficiency. CatBoost [10] proposed a novel strategy to learn GBDT for multiple outputs, which is scalable and to deal with categorical features. efficient, named GBDT-MO. Unlike previous works, we em- A limitation of current GBDT implementations is that the ploy the second order gradient and histogram approximation output of each decision tree is a single variable. This is to improve GBDT-MO. The learning mechanism is designed because each leaf of a decision tree produces a single variable. based on them to jointly fit all variables in a single tree. Each However, multiple outputs are required for many machine leaf of a decision tree constructs multiple outputs at once. learning problems including but not limited to multi-class This is achieved by maximizing the summation of objective classification, multi-label classification [11] and multi-output gains over all output variables. Sometimes, only a subset of the This work was supported by the National Natural Science Foundation of output variables is correlated. It is expected that the proposed China (No. 61872280) and the International S&T Cooperation Program of method automatically selects those variables and constructs China (No. 2014DFG12780). predictions for them at a leaf. We achieve this by adding Z. Zhang and C. Jung (corresponding author) are with the School of Electronic Engineering, Xidian University, Xi’an 710071, China e-mail: L0 constraint to the objective function. Since the learning [email protected], [email protected] mechanism of GBDT-MO enforces the learned trees to capture UNDER REVIEW 2 variable correlations, it plays a role in indirect regularization. where the first term is fidelity term, R is regularization term Experiments on both synthesis and real-world datasets show of f. In this work, λ is a positive number to control the trade- that GBDT-MO achieves better generalization ability than the off between fidelity term and regularization term. We suppose standard GBDT. Moreover, GBDT-MO achieves a fast training l is a second order differentiable loss. Based on the space of speed, especially when the number of outputs is large. f, i.e. a constant value for each leaf, the fidelity term of (2) Compared with existing methods, main contributions of this is separable w.r.t. each leaf. Then, (2) is rewritten as follows: paper are as follows: L 8 9 • We formulate the problem of learning multiple outputs for X < X = l(^yi + wj; yi) + λR(w) (3) GBDT, and propose a split finding algorithm by deriving j=1 :i2leafj ; a general approximate objective for this problem. • To learn a subset of outputs, we add a sparse constraint Although there are many choices of R, we set R(w) = 1 2 to the objective. Based on it, we develop two sparse split 2 kwk2, which is commonly used. Because (3) is separable finding algorithms. w.r.t. each leaf, we only consider the objective of a single leaf • We extend histogram approximation [17] into multiple as follows: X λ 2 output case to speed up the training process. L = l(^yi + w; yi) + w (4) 2 The rest of this paper is organized as follows. First, we i review GBDT for single output and introduce basic definitions where w is the value of a leaf and i is enumerated over the in Section II. Then, we describe the details of GBDT-MO in samples belonging to that leaf. l(^yi + w; yi) can be approxi- Section III. We address related work in Section IV. Finally, mated by the second order Taylor expansion of l(^yi; yi). Then, we perform experiments and conclude in Sections V and VI, we have respectively. X 1 λ L = l(^y ; y ) + g w + h w2 + w2 (5) i i i 2 i 2 II. GBDT FOR SINGLE OUTPUT i In this section, we review GBDT for single output. First, where gi and hi are the first and second order derivatives of @L we describe the work flow of GBDT. Then show how to l(^yi; y) w.r.t. yî. By setting @w to 0, we obtain the optimal derive the objective of GBDT based on the second order Taylor value of w as follows: expansion of the loss, which is used in XGBoost. The objective P ∗ i gi to multiple variable cases will be generalized in Section III. w = −P (6) hi + λ Finally, we explain the split finding algorithms which exactly i or approximately minimize the objective. Substituting (6) into (5), we get the optimal objective as follows: (P g )2 A. Work Flow ∗ 1 i i L = − P (7) n 2 i hi + λ Denote D = f(xi; yi)i=1g as a dataset with n samples, where x 2 Rm is an m dimension input. Denote f : Rm ! R We ignore l(^y; y) since it is a constant term given y^. as the function of a decision tree which maps x into a scalar. Since GBDT integrates t decision trees in an additive manner, C. Split Finding the prediction of GBDT is y^ = Pt f (x ), where f is i k=1 k i k One of the most important problems in decision tree learn- the function of k-th decision tree. GBDT aims at constructing ing is to find the best split given a set of samples. Specifically, a series of trees given datasets. It first calculates gradient samples are divided into left and right parts based on the based on current prediction at each boosting round, and then following rule: constructs a new tree guided by gradient. Finally, it updates the prediction using the new tree. The most important part of left; xij ≤ T xi 2 (8) GBDT is to construct trees based on gradient.

Load more