arXiv:1909.12098v2 [cs.LG] 27 Jan 2021 fcasfir r n ftems fetv ehius clea techniques, effective [29]. most networks the ral contrast of In one 25]. are etc.[18, classifiers speech, of images, as e performan such can outstanding data that tured shown algorithms have find architectures to Deep importance classi data. capital or of is [27] It prediction [21]. energy renewable [1], suc prediction application the novel of for examples piece Some day. fundamental every applications a becoming is learning Machine Introduction 1 Ns[2.I eetyas rdetbotn 1] fairly a [12], as boosting such gradient models years, complex recent even In or [22]. 6] CNNs [10, models simple multiple sul oi´ciaSpro,UiesddAtooad M Aut´onoma de Universidad Polit´ectica Superior, Escuela nebemtosaevr fetv tipoigtegenera the improving at effective very are methods Ensemble eunilTann fNua ewrswith Networks Neural of Training Sequential hn tec tp oto ftentok opsdof composed network, the of portion a step, a each the minimizes at that Then, approximation constant a with tialized T trtos ial,the comp Finally, data iterations. training the on pseudo-residuals the approximate h dpaino h oe odfeetcasfiainspee classification different to model fly. generalizat the its of of (th adaptation test reduction the during significant units a hidden without of number trained) a off proposed stan the switch that to different show permits we with Furthermore, SGD. trained and networks L-BFGS Adam, neural to gene competitive respect a with showing out carried are tasks regression ftentok nta ftann h ewr sawoe th whole, prov a in weights as sequentially its network network the the and trains training layer th of hidden Instead where network. last model the additive the of an of as responses seen the be of also approxi can to network sequentially trained neural are A models ex of additive series an a is which boosting Gradient (NN). network neural low × hspprpeet oe ehiu ae ngain boos gradient on based technique novel a presents paper This J ern ntehde ae.Etnieeprmnsi class in experiments Extensive layer. hidden the in neurons eesmnEai ozl Mart´ınez-Mu˜noz Gonzalo Emami, Seyedsaman T rdetBoosting Gradient ata oesadba r nertda igeN with NN single a as integrated are bias and models partial T tp.Frt h istr ftentoki ini- is network the of term bias the First, steps. Abstract 1 o blt.Ti permits This ability. ion l upromn epneu- deep outperforming rly o aua aa ensembles data, tabular for , cto fglci sources galactic of fication nld iatv molecule bioactive include s J eaels ftedata. the of loss verage aiainperformance ralization tdfo h previous the from uted eurmnso the on requirements d aeagvnfunction. given a mate adsles uhas such solvers, dard l ehiu,hsgained has technique, old fiinl adecomplex handle fficiently e seilywt struc- with especially ces rpsdalgorithm proposed e ern,i rie to trained is neurons, aso loih in algorithm pansion nt htwr last were that units e d h nloutput final the ide eso oeadmore and more of cess igt ri shal- a train to ting clrproduct scalar e Ls[6 rdeep- or [26] MLPs iainacrc of accuracy lization ehddesign method fiainand ification di,Spain adrid, much attention, specially due to the novel and computationally efficient version of gra- dient boosting called eXtreme Gradient Boosting or XGBoost [7]. Gradient boosting builds an additive model as a series of regressors to gradually minimize a given . When gradient boosting is combined with several stochastic techniques, as bootstrapping or feature sampling from [5], its performance generally improves [13]. In fact, this combination of randomization techniques and optimization has placed XGBoost among the top contenders in Kaggle competitions [7] and pro- vide excellent performance in a variety of applications as in the ones mentioned above. Based on the success of XGBoost, other techniques have been proposed like [24] and LightGBM [14], which propose improvements in training speed and general- ization performance. More details about these methods can be seen in the comparative analysis of Bent´ejac et al. [2]. The objective of this study is to combine the stage-wise optimization of gradient boosting into the training procedure of a shallow neural network. The result of the pro- posed algorithm is an alternative for training a single neural network (not an ensemble of networks). Several related studies propose hybrid algorithms that, for instance, trans- form a decision forestinto a single neuralnetwork [28, 3] or that use a deep architecture to train a tree forest [16, 17]. In [28], it is shown that a pre-trained tree forest can be cast into a two-layer neural network with the same predictive outputs. First, each tree is converted into a neural network. To do so, each split in the tree is transformed into an individual neuron that is connected to a single input attribute (split attribute) and whose activation threshold is set to the split threshold. In this way, and by a proper combination of the outputs of these neurons (splits) the network mimics the behaviour of the decision tree. Finally, all neurons are combined through a second layer, which recovers the forest decision. The weights of this network can be later retrained to obtain further improvements [3]. In [16, 17], a decision forest is trained jointly by means of a deep neural network that learns all splits of all trees of the forest. To guide the network to learn the splits of the trees, a procedure that trains the trees using back-propagation is proposed. The final output of the algorithm is a decision forest whose performance is remarkable in image classification tasks. However, we could not find any proposal that applies the training procedures of ensembles for the creation of a single neural network. In this paper, we propose a combination of ensembles and neural networks that is somehow complementaryto the work of [16, 17], that is, a single neural network that is trained using an ensemble training algorithm. Specifically, we propose to train a neural network iteratively as an additive ex- pansion of simpler models. The algorithm is equivalent to gradient boosting: first, a constant approximation is computed (assigned to the bias term of the neural network), then at each step, a regression neural network with a single (or very few) neuron(s) in the hidden layer is trained to fit the residuals of the previous models. All these models are then combined to form a single neural network with one hidden layer. This training procedure provides an alternative to standard training solvers (such as Adam, L-BFGS or SGD) for training a neural network. In addition, the proposed method has an addi- tive neural architecture in which the latest computed neurons contribute less to the final decision. This can be useful in computationally intensive applications as the number of active models (or neurons) can be gauged on the fly to the available computational resources without a significant loss in generalization accuracy. The proposed model is

2 tested on multiple classification and regression problems. These experiments show that the proposed method for training a shallow neural network is a good alternative to other standard methods. The paper is organizedas follows: Section 2 describes gradient boosting and how to apply it to train a single neural network; In section 3 the results of several experimental analysis are shown; Finally, the conclusions are summarized in the last section.

2 Methodology

In this section, we show the gradient boosting mathematical framework [12] and the applied modifications in order to use it for training a neural network sequentially. The proposed algorithm is valid for multi-class and binary classification, and for regression. Finally, an illustrative example is given.

2.1 Gradient boosting

N Given a training dataset D = xi,yi 1 , the goal of algorithms is to find an approximation, Fˆ(x),{ of the} objective function F ∗(x), which maps instances x to their output values y. In general, the learning process can be posed as an opti- mization problem in which the expected value of a given loss function, E[L(y, F (x))], is minimized. A data-based estimate can be used to approximate this expected loss: N i=1 L(yi, F (xi)). In the specific case of gradient boosting, the model is built using an additive expan- sionP Ft(x)= Ft−1(x)+ ρtht(x), (1) th where ρt is the weight of the t function, ht(x). The approximation is constructed stage-wise in the sense that at each step, a new model ht is built without modifying any of the previously created models included in Ft−1(x). First, the additive model is initialized with a constant approximation

N

F0(x) = argmin L(yi, α) (2) α i=1 X and the following models are built in order to minimize

N

(ρt,ht(x)) = argmin L(yi, Ft−1(xi)+ ρht(xi)) . (3) ρ,h i=1 X However, instead of jointly solve the optimization for ρ and ht, the problem is split into two steps. First, each model ht is trained to learn the data-based gradient vec- tor of the loss-function. For that, each model, ht, is trained on a new dataset D = N xi, rti , where the pseudo-residuals, rti, are the negativegradient of the loss func- { }i=1 tion at Ft−1(xi) ∂L(yi, F (xi)) rti = (4) − ∂F (xi) x x F ( )=Ft−1( )

3 The function, ht, is expected to output values close to the pseudo-residuals at the given data points, which are parallel to the gradient of L at Ft−1(x). Note, however, that the training process of h is generally guided by square-error loss, which may be different from the given objective loss function. Notwithstanding, the value of ρt is subsequently computed by solving a line search optimization problem on the given loss function

N

ρt = argmin L(yi, Ft−1(xi)+ ρht(xi)) . (5) ρ i=1 X The original formulation of gradient boosting is, in some final derivations, only valid for decision trees. Here, we present an extension of the formulation of gradient boosting to be able to use any possible regressor as base model and we describe how to integrate this process to train a single neural network, which is the focus of the paper.

2.2 Binary classification For binary classification, in which y 1, 1 , we will consider the logistic loss ∈ {− } L(y, F ( )) = ln(1+exp( 2yF ( ))) , (6) · − · 1 p(y=1|x) which is optimized by the logit function F (x)= 2 ln p(y=−1|x) . For this loss function the constant approximation of Eq. 2 is given by

N

F0 = argmin ln(1 + exp( 2yiα)) = α − i=1 X 1 p(y = 1) 1 1 y = ln = ln − , 2 p(y = 1) 2 1+ y − where y is the mean value of the class labels yi. The pseudo-residuals given by Eq.4 on which the model ht is trained for the logistic loss can be calculated as

rti =2yi/(1 + exp(2yiFt−1(xi))) . (7)

Once ht is built, the value of ρt is computed using Eq. 5 by minimizing

N f(ρ)= ln(1 + exp( 2yi(Ft (xi)+ ρht(xi)))) . − −1 i=1 X There is no close form solution for this equation. However, the value of ρ can be ap- proximated by a single Newton-Raphson step

′ N f (ρ = 0) i=1 rtiht(xi) ρt = . (8) ′′ N 2 ≈−f (ρ = 0) rti(2yi rti)h (xi) i=1P − t This equation is valid for any base regressorP used as additive model and not only for decision trees as in the general gradient boosting framework [12]. The formulation in

4 [12] is adapted to the fact that decision trees can be seen as piecewise-constant additive models, which allows for the use of a different ρ for each tree leaf. Finally, the output for binary classification of gradient boosting composed of T models for instance x is given by the probability of y =1 x | p(y =1 x)=1/(1 + exp( 2FT (x))) . (9) | − 2.3 Multi-class classification For multi-class classification with K > 2 classes, the labels are defined with 1-of-K vectors, y, such that yk = 1 if the instance belongs to class k and yk = 0, otherwise. In this context the output for x of the additive model is also a K-dimensional vector F(x). The cross-entropy loss function is used in this context

K L(y, F( )) = yk ln pk( ) , (10) · − · k X=1 where pk( ) is the probability of a given instance of being of class k · exp(Fk( )) pk( )= K · (11) · exp(Fl( )) l=1 · The additive model is initialized with constantP value 0 as Fk,0 =0 k, that correspond to a probability equal to 1/K for all classes and instances. ∀ The pseudo-residuals, given by Eq. 4, on which the model ht is trained for the multi-class loss are the derivative of Eq. 10 with respect to Fk evaluated at Fk,t−1

K yij ∂pj (xi) rtik = = (12) pj (xi) ∂Fk(xi) j=1 x x X Fk( )=Fk,t−1( )

K

= yij (δkj pk,t (xi)) = yik pk,t (xi) , (13) − −1 − −1 j=1 X K where δkj is Kronecker delta and the fact that yij =1 i is used in the final step. j=1 ∀ In this study, a single model, ht is going to be trained per iteration to fit the residuals for all K classes. In contrast to the K decision treesP per iteration that are built in gradient boosting. Then a line search for each of the K outputs of the model is computed by minimizing

N K exp(Fj,t−1(xi)+ ρj hj,t(xi)) f(ρk)= yij ln (14) − K i=1 j=1 " l=1 exp(Fl,t−1(xi)+ ρlhl,t(xi))# X X with a Newton-Raphson step P

′ N f (ρk = 0) i=1 hk,t(xi)(yik pk,t−1(xi)) ρk,t = = − . (15) ′′ N 2 −f (ρk = 0) − " h (xi)pk,t−1(xi)(pk,t−1(xi) 1)# i=1P k,t − P 5 hidden hidden layers layers

......

......

hidden hidden layers layers

......

......

Figure 1: Illustration of binary and multi-class neural networks and their parameters (left diagrams) and neural networks with one unit highlighted in black that represents the tth model trained in gradient boosted neural network (right diagrams).

In the same way as for Eq. 8, this equation is valid for all type of base learners and not specific for decision trees as in the original formulation [12], which opens the possibility to apply gradient boosting to other base learners. The final output of gradient boosting composed of T models for multi-class tasks is the probability yk =1 x |

exp(Fk,T (xi)) p(yk =1 x)= (16) | K l=1 exp(Fl,T (xi)) P 2.4 Neural network as an additive model A multi-layeredneural network can be seen as an additive model of its last hidden layer. The output of the last hidden layer for a fully connected neural network for binary tasks is (using the parametrization shown in Fig 1 top left)

T p(y =1 x)= σ ωtzt (17) | t=0 ! X

6 and for multi-class (bottom left in Fig 1)

T p(y = k x)= σ ωtkzt (18) | t=0 ! X with zt being the outputs of the last hidden layer, ωt and ωtk the weights of the last layer and σ the activation function (for classification). For regression, no activation function is used. It is not straightforward to adapt this process to train deeper layers as we take advantage of the additive nature of the final output of the network. In the standard NN training procedure all parameters of the model (i.e. vt and ωk or ωtk as show in Fig. 1 left diagrams) are fitted jointly with back-propagation. Moreover, to optimize the loss function of the neural network, there are different methods, such as Stochastic (SGD) [8], Adam, which is an algorithm for the first-order gradient-based optimization of stochastic function [15], or L-BFGS, a quasi-Newton based method [20]. In the proposed method, instead of learning the weights jointly, the parameters are trained sequentially using gradient boosting. To do so, after computing F0, a fully connected regression neural network with a single (or few) neuron(s) in the last hidden layer is trained. This neural network is trained on the residuals given by the previous iteration as given by Eq. 7 for binary classification and Eq. 12 for multi-class tasks. Figure 1 (right diagrams) shows highlighted in black the regression neural network to be trained in iteration t corresponding to model ht when one hidden unit is used. After model t has been trained, the value of ρt is computed using Eq. 8 for binary problems and ρk,t (Eq. 15) for multi-class classification. Once all T models have been trained, a neural network as shown in Fig. 1 (left diagrams), with T units in the last hidden layer ′ is obtained by assigning all the weights necessary to compute the zt variables (i.e. vt in Fig. 1 right) to the corresponding weights in the final NN (i.e. vt in Fig. 1 left) for binary and multi-class tasks

′ vt = vt t =1,...,T and the weights ωt of the output layer for binary classification to

T ′ ω0 = F0 + 2ρtω t0 t=1 ′X ωt =2ρtωt t =1,...,T and for multi-class to

T ′ ω0k = ρk,tω t0k t=1 X ′ ωtk = ρk,tωtk t =1,...,T k =1,...,K

Finally, to recover the probability of y = k x in the outputof the NN as given by Eq. 9 and Eq. 16, the activation function should be| a sigmoid (i.e. σ(x)=1/(1 + exp( x)) −

7 T=1 T=2 T=3 T=4 T=100

1.5 1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 0.0

− 0.5 − 0.5 − 0.5 − 0.5 − 0.5

− 1.0 − 1.0 − 1.0 − 1.0 − 1.0

− 1.5 − 1.5 − 1.5 − 1.5 − 1.5

− 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1

2.0 2.0 2.0 2.0 2.0

1.5 1.5 1.5 1.5 1.5

1.0 1.0 1.0 1.0 1.0

0.5 0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0 0.0

− 0.5 − 0.5 − 0.5 − 0.5 − 0.5

− 1.0 − 1.0 − 1.0 − 1.0 − 1.0

− 1.5 − 1.5 − 1.5 − 1.5 − 1.5

− 2.0 − 2.0 − 2.0 − 2.0 − 2.0 − 2 − 1 0 1 2 − 2 − 1 0 1 2 − 2 − 1 0 1 2 − 2 − 1 0 1 2 − 2 − 1 0 1 2

Figure 2: Classification boundaries for gradient boosted neural network (top row) and for a standard neural network (bottom row). Each column shows the results for a com- bination of a different number of models (units). Top plots are the sequential results of a single GBNN model, whether bottom plots are independent neural networks models for binary classification with the logistic loss (Eq. 6), and the soft-max activation func- tion (i.e. σ(x) = exp(xk)/ k exp(xk)) for multi-class tasks when using the cross- entropy loss (Eq. 10). This training procedure can be easily modified to larger incre- ments of neurons, so that insteadP of a single neuron per step (a linear model), a more flexible model, comprising more units (J), can be trained at each iteration. The proposed training procedure can be further tuned by applying subsampling and/or shrinking, as generally used in gradient boosting [13, 11, 7]. In shrinking, the additive expansion process is regularized by multiplying each term ρtht by a constant , ν (0, 1], to prevent overfitting when multiple models are combined [11]. Subsampling∈ consists of training each model on a random subsample without replacement from the original training data. Subsampling has shown to improve the performance of gradient boosting [13]. The overall computational time complexity to train the proposed algorithm is the same as that of a standard NN with the same number of hidden neurons and training epochs. Note that, at each step, one NN with one (or few) neurons in the hidden layer is trained. Hence, at the end of the process, the same number of weights updates are performed. Note, however, that the proposed method is sequential and cannot be easily parallelized. Once the model is trained, the computational complexity for classifying newinstancesis equivalentto that of a standardNN as the generated model is a standard neural network.

2.5 Illustrative example To illustrate the workings of this algorithm, we show its performance in a toy classifica- tion problem. The toy problem task consists in a 2D version of the ringnorm problem [4]: where both classes are 2D gaussian distribution, one with < 0, 0 > mean and with covariance four times the identity matrix, and the second class with mean value at < 2/√2, 2/√2 > and the identity matrix as covariance.

8 The proposed gradient boosted neural network (GBNN) with T = 100 and one hidden layer is trained on 200 randomly generated instances of this problem. In ad- dition, 100 independent neural nets with hidden units in the range [1, 100] are also trained using the same training set. In Fig. 2, the boundaries for the different stages of the process are shown graphically. In detail, the first and second rows show the results for GBNN and NN, respectively. Each column shows the results for T = 1, T = 2, T = 3, T = 4 and T = 100 neurons in the hidden layer respectively. Note that the plots for GBNN are sequential. That is, the first column shows the first trained model; the second column, the first two models combined,and so on. For the NN, each column corresponds to a different NN with a different number of neurons in the hidden layer. For each column, the architecture of the networks and the number of weights (but not their values) are the same. The color of the plots represent the probability p(y = 1 x) given by the models (Eq. 9) using the viridis colormap. In addition, all plots show| the training points. As we can see both GBNN and NN start, as expected, with very similar models (column T =1). As the number of models (neurons for NN) increases, GBNN builds up the boundaryfrom previousmodels. On the other hand, the standard neural network, as it creates a new model for each size, is able to adjust faster to the data. However, as the number of neurons increases, NN tends to overfit in this problem (as shown in the bottom right-most plot). On the contrary, GBNN tends to focus on the unsolved parts of the problem: the decision boundary becomes defined only asymptotically, as the number of models (neurons) becomes large.

3 Experimental results

In this section, an analysis of the efficiency of the proposed neural network train- ing method based on gradient boosting is tested on twelve binary classification tasks, six multi-class problems, and seven regression datasets from the UCI repository [19]. These datasets, shown in Table 1, have a different number of instances and attributes and come from different fields of application. The Energy dataset has two different target columns (cooling and heating respectively), so the different algorithms were ex- ecuted for both objectives separately. We modified some of the datasets. For Diabetes duplicated instances and instances with missing values were removed.In addition, cate- gorical values were substituted by dummy variables in: German Credit Data, Hepatitis, Indian Liver Patient, MAGIC and Tic-tac-toe. The implementation of the proposed method (gradient boosted neural network) was done in python following the standards of scikit-learn. The implementa- tion of the algorithm will be available at https://github.com/GAA-UAM/. The proposed method is compared with respect to standard neural network training procedures using three different solvers: Adam, L-BFGS and SGD. The scikit-learn package [23] was used in all the experiments. For the classification and regression experiments, single hidden layer networks were trained with three different standard solvers and using the proposed gradient boosting approach. The comparison was carried out using 10-fold cross-validation. For the Waveform dataset we also considered 10 random train-test partitions using 300

9 Dataset Area Instances Attribs. Binary classification AustralianCreditApproval Financial 690 14 Banknote Computer 1372 5 Breastcancer Life 569 32 Diabetes Life 381 9 GermanCreditData Financial 1000 20 Hepatitis Life 155 19 Ionosphere Physical 351 34 IndianLiverPatient Life 583 10 MAGICGammaTelescope Physical 19020 11 Sonar Physical 208 60 Spambase Computer 4601 57 Tic-tac-toe Game 958 9 Multi-class classification Digits Computer 1797 64 Iris Life 150 4 Vehicle Transportation 946 18 Vowel notdefined 990 11 Waveform Physical 5000 21 Wine Physical 178 13 Regression Concrete Physical 1030 9 Energy Computer 768 8 Boston Housing Business 506 14 Power Computer 9568 4 Winequality-red Business 1599 12 Wine quality-white Business 4898 12

Table 1: The details of datasets, which used in experimental analysis instances for training the models and the rest for test as experiments with this dataset generally consider this experimental setup [4]. The optimum hyper-parameter setting for each method was estimated using within-train 10-fold cross validation. For the stan- dard neural network and all solvers, the grid with the number of units in the hidden layer was set to [1, 3, 5, 7, 11, 12, 17, 22, 27, 32, 37, 42, 47, 52, 60, 70, 80, 90, 100, 150, 200]. In addition, for the SGD solver, the learning policies [Adaptive, Constant] were also considered in their grid searches. The rest of the hyper-parameter were set to their default values. For GBNN, a sequentially trained neural network with 200 hidden units was built in steps of J units per iteration. The within train hyper-parameter search was carried out using the following values for binary classification: [0.1, 0.25, 0.5, 1.0]

10 for the learning rate, [0.5, 0.75, 1.0] for subsample rate and J [1, 2, 3]. For the multi- class and regression problems the hyper-parameter grid was∈extended a bit because in some datasets the extreme values were always selected. Hence, the grid for multi- class and regression problems is set to: [0.025, 0.05, 0.1, 0.5, 1] for the learning rate, [0.25, 0.5, 0.75, 1.0] for subsample rate and J [1, 2, 3, 4]. Moreover, for the proposed procedure, the sub-networks are trained using∈ the L-BFGS solver for all datasets. This decision was made after some preliminary experiments that showed that L-BFGS pro- vided good results in general for the small networks we are training. Subsequently, the best sets of hyper-parameters obtained in the grid search for each method were used to train, on the whole training set, one standard neural network for each solver and the proposed gradient boosted neural network. Finally, the average generalization performance of the four neural networks was estimated in the left-out test set. For all analyzed datasets, the average generalization performance and standard de- viations are shown in Table 2 for the proposed neural network training method with 200 hidden units (column GBNN), and standard neural networks trained with Adam (column labeled NN–Adam), L-BFGS (column NN–L-BFGS) and SGD (column NN– SGD). For classification problems the generalization performance is given as average accuracy and, for regression, as root mean square error. The best result for each dataset is highlighted with a light yellow background.The table is structured in three blocks de- pending on the problem type: binary classification tasks, multi-class classification, and regression. An overall comparison of these results is shown graphically in Fig. 3 using the methodology described in [9]. These plots show the average rank for the studied methods across the analyzed datasets where a higher rank (i.e. lower values) indicates better results. The figure shows the Demsˇar plots for the classifications tasks only (top left plot), for regression only (top right plot), and for all analyzed tasks (bottom plot). The statistical differences between methods are determined using a Nemenyi test. In the plot, the differencein the average rank of the two methods is statistically significant if the methods are not connected with a horizontal solid line. The critical distance (CD) in average rank over which the performance of two methods is considered significant is shown in the plot for reference (CD = 1.07, 1.77, and 0.91 for 19 classification, 7 regression, and all datasets respectively considering 4 methods and p-value < 0.05). From Table 2 and Fig. 3, it can be observed that, for the studied datasets, the best overall performing method is GBNN. This method obtained the best results in 14 out of 26 tasks. The neural network method with Adam as its solver obtained the best result in seven datasets. SGD and L-BFGS solver obtained both the best outcome in four datasets. In classification, the differences in average accuracy among the differ- ent methods are generally favorable to NN–Adam and GBNN. Although the differ- ences between these two methods is generally small. This issue is especially evident in Magic, where the accuracy of GBNN is only 0.33 percent points worse than that NN–Adam. Although, small differences are not always the case. The most notable dif- ference between these two methods is in Tic-tac-toe where NN–Adam is 8.35% worse than the result obtained by GBNN. In contrast, the most favorable outcome for NN– Adam is obtained in Hepatitis and Vowel where its accuracy is 3.48% and 3.84% bet- ter than that of GBNN respectively. The results in classification for NN–L-BFGS and NN–SGD are generally worse than those of the other two methods. Although, for most datasets the differences are small. In regression, the results are more clearly in favor

11 Dataset GBNN NN–Adam NN–L-BFGS NN–SGD Binary classification Australian Credit Approval 85.51% 4.40 85.80% 2.88 84.35% 6.21 87.25% 3.6 Banknote 100.00%± 0.00 100.00%± 0.00 100.00%± 0.00 97.37% ±1.88 Breast cancer 97.89% ±2.05 97.72% ±1.12 96.66% ±2.28 96.48%±1.92 Diabetes 77.42%±6.06 73.99% ±12.89 73.24%±6.99 77.42%± 5.83 GermanCreditData 73.20%±4.12 73.20%± 3.34 73.4% ±4.43 73.6% ±3.20 Hepatitis 83.81%±9.57 87.29%±6.26 82.43%± 5.61 85.29%± 6.49 Ionosphere 92.59%±4.28 92.04%±3.50 91.44%±4.62 90.02%± 5.47 Indian Liver Patient 72.03%±4.10 69.75%±5.12 67.71%±5.93 70.98% ±1.38 MAGICGammaTelescope 87.41%±0.60 87.74%±0.84 87.68%±0.85 86.44%± 0.56 Connectionist Bench (Sonar) 80.29%±8.89 83.12%±6.31 84.10%±5.37 81.81%± 10.2 Spambase 94.33%±1.42 94.65%±0.67 93.20%±0.84 93.52%± 1.22 Tic-tac-toe 99.06%±1.09 90.71%±3.22 94.26%±2.43 71.82%± 3.86 Multi-class classification ± ± ± ± Digits 96.94% 1.15 98.16% 1.48 97.55% 1.76 96.55% 1.18 Iris 96.67% ±4.47 94.67%±4.99 95.33%±5.21 87.33%±8.14 Vehicle 81.04%± 1.98 83.34%±4.63 82.39%±3.76 74.11%±3.75 Vowel 90.81%±4.27 94.65%±2.96 92.93%±2.78 55.76%±4.09 Waveform 86.88% ±1.03 86.46% ±0.96 85.98%±1.28 86.72% ±1.26 Waveform-300 83.32%± 0.78 83.76%± 0.82 86.18%±1.72 84.43%± 0.60 Wine 98.86%± 2.29 98.33%± 3.56 96.60%±4.48 97.19%± 3.75 ± ± ± ± Regression Concrete 6.63 2.32 11.34 1.83 8.26 3.13 7.57 2.01 Energy cooling 1.17±0.61 3.53 ±0.61 1.49±0.66 3.19 ±0.42 Energyheating 0.88 ±1.08 3.09±0.95 0.83±0.82 2.95 ±0.88 BostonHousing 4.21±2.31 4.69±1.96 4.92±2.01 3.94±2.01 Power 3.83± 0.18 4.24±0.21 4.12±0.18 4.12 ±0.18 Wine quality-red 0.64±0.05 0.68±0.04 0.66±0.03 0.663± 0.04 Wine quality-white 0.71±0.06 0.726± 0.06 0.725± 0.06 0.718 ±0.05 ± ± ± ± Table 2: Average generalization performance and standard deviation for gradient boosted neural network (GBNN) and neural networks trained with solvers Adam, L- BFGS and SGD. The performance is measured using accuracy for the classification tasks and using root mean square error (RMSE) for regression problems. The best re- sults for each dataset are highlighted with a light yellow background. of GBNN. The proposed method obtains the best performance in all tested datasets except for two datasets (Boston Housing and Energy heating) in which GBNN get the second best result. The performance of NN–Adam in regression is suboptimal getting the worst performances in general. Solvers L-BFGS and SGD perform more closely to the performance of GBNN. These results can also be observed from Fig. 3, the performance of NN–L-BFGS and NN–SGD in the classification tasks is worse than that of NN-Adam and GBNN although the differences are not statistically significant among none of the methods (subplot a). For regression (subplot b), the best performing method is GBNN. The per- formance of GBNN is significantly better than that of NN-Adam, which has the worst performance for these tasks. In addition, GBNN for regression performs generally bet- ter than NN–L-BFGS and NN–SGD although the differences are not statistically sig-

12 CD

GBNN NN−L−BFGS NN−Adam NN−SGD

1 2 3 4 5 (a) Classification CD

NN−SGD NN−L−BFGS GBNN NN−Adam

1 2 3 4 5 (b) Regression

CD

NN−Adam NN−L−BFGS

GBNN NN−SGD 4¢ £ 1.0 1.5 2.0 2.5 3 ¡ 3.5 (c) All of datasets

Figure 3: Average ranks (higher rank is better) for GBNN, NN(adam), NN(lbfgs) and NN(sgd) for 26 datasets. a) The Demsar plot for binary and multiclass datasets. b) The Demsar plot for regression datasets. c) The Demsar plot considering all of the datasets. nificant. The Demsar plot for all datasets (subplot c) shows that GBNN has the best rank followed by NN–Adam, NN–L-BFGS and NN–SGD. In order to analyze the evolution of the networks generated by the proposed method as more hidden units are included, we have carried out another experiment that builds a GBNN with 1000 hidden units. The evolution is shown in Fig. 4 for Tic-tac-toe (left plot) and Spambase (right plot). The plots show the average test accuracy (blue curve) and average loss in train (red curve) with respect to the number of hidden units. For this experiment, 10-fold cross-validation was used. In this case, the hyper-parameters were not tuned for each partition. Instead, they were set to the values more often selected in the previous experiment for the each dataset. Specifically, they were set to J = 2, learning rate to 0.5 and subsampling to 1 for Tic-tac-toe and to J =1, learning rate to 0.5 and subsampling to 0.75 for Spambase for all partitions. From these plots, we can observe that the GBNN model steadily reduces the average loss as more hidden units are included into the network. The average test accuracy (see blue curves in Fig 4) also improves as more units are considered although the evolution is more irregular than that of the average loss in train. More importantly, we can observe that the curves tend to stabilize with the number of units. Hence, if the latter units are removed, the

13 Loss and Accuracy - Dataset: tic-tac-toe-endgame Loss and Accuracy - Dataset: spambase 0.70 0.65 0.65 0.60 0.60 0.945 0.55 0.95 0.55 0.50 0.50 0.45 0.90 0.45 0.940 0.40 0.40 0.35 0.35 0.30 0.85 0.30 0.935 0.25 0.25 0.20 0.20 0.15 0.80 0.15 0.930 0.10 0.10 0.05 0.75 0.05 0.00 0.925 0.00 0 200 400 600 800 1000 0 200 400 600 800 1000 Number of units Number of units

Figure 4: Average generalization accuracy of gradient boosted neural network concern- ing the number of models combined (blue curve) and average loss in train with respect to the number of hidden units (red curve) for Tic-tac-toe (left plot) and Spambase (right plot) performance in accuracy of the model is not damaged to a great extent. For instance, in Spambase, if the number of units is reduced from T = 1000 to T 50 the model accuracy only drops from 94.63% to 94.17%.In Tic-tac-toe a reduction≈ in size would reduce the accuracy of the model by≈ approximately 1% (from 98.54% to 97.39%). This property can be particularly useful, as one can adapt a single≈ model to≈ different computational requirements on the fly.

4 Conclusions

In this paper, we present a novel iterative method to train a neural network based on gradient boosting. The proposed algorithm builds at each step a regression neural net- work with one or few (J) hidden unit(s) fitted to the data residuals. The weights of the network are then updated with a Newton-Raphson step to minimize a given loss func- tion. Then, the data residuals are updated to train the model of the next iteration. The resulting T regressors constitute a single neural network with T J hidden units. In addition, the formulation derived for this works opens the possibility× to create gradient boosting ensembles composed of base models different from decision trees, as the all previous implementations. In the analyzed problems, the proposed method achieves a generalization accuracy that converges with the number of combined regressors (or hidden units). This quality of the proposed method allows us to use the combined model fully or partially by deactivating the units in order inverse to their creation, depending on classification speed requirements. This can be done on the fly during test. The proposed method has been tested on a variety of classification and regression tasks. The results show a performance favorable to the proposed method in general although the overall differences are not statistically significant with respect to solvers

14 Adam and L-BFGS (they are with respect to SGD). Notwithstanding, the proposed iterative training procedure opens novel alternatives for training neural networks. This is particularly evident for regression tasks where the proposed method achieved the best result in most of the analyzed datasets.

References

[1] Ismail Babajide Mustapha and Faisal Saeed. Bioactive molecule prediction using extreme gradient boosting. Molecules, 21(8), 2016. [2] Candice Bent´ejac, Anna Cs¨org˝o, and Gonzalo Mart´ınez-Mu˜noz. A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, (in press), 2020. [3] G´erard Biau, Erwan Scornet, and Johannes Welbl. Neural random forests. Sankhya A, In press, 2018. [4] L. Breiman. Bias, variance,and arcing classifiers. Technical Report 460, Statistics Department, University of California, 1996. [5] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. [6] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of su- pervised learning algorithms. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 161–168, New York, NY, USA, 2006. ACM Press. [7] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and , KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM. [8] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.

[9] Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. Jour- nal of Machine Learning Research, 7:1–30, 2006. [10] Manuel Fern´andez-Delgado, Eva Cernadas, Sen´en Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.

[11] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic re- gression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000. [12] Jerome H. Friedman. Greedy function approximation: a Gradient Boosting ma- chine. The Annals of Statistics, 29(5):1189 – 1232, 2001.

15 [13] Jerome H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367 – 378, 2002. Nonlinear Methods and Data Mining. [14] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qi- wei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146–3154, 2017. [15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [16] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. Deep neural decision forests. In The IEEE International Conference on Computer Vision (ICCV), December 2015. [17] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. Deep neural decision forests. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), 2016. [18] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. . Nature, 521:436–444, 2015. [19] M. Lichman. UCI machine learning repository, 2013. [20] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989. [21] N. Mirabal, E. Charles, E. C. Ferrara, P. L. Gonthier, A. K. Harding, M. A. S´anchez-Conde, and D. J. Thompson. 3fgl demographics outside the galactic plane using supervised machine learning: Pulsar and dark matter subhalo inter- pretations. The Astrophysical Journal, 825(1):69, 2016. [22] Mohammad Moghimi, Mohammad Saberian, Jian Yang, Li-Jia Li, Nuno Vascon- celos, and Serge Belongie. Boosted convolutional neural networks. In British Machine Vision Conference (BMVC), York, UK, 2016.

[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830,2011. [24] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in neural information processing systems, pages 6638–6648, 2018. [25] J¨urgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 – 117, 2015. [26] Holger Schwenk and Yoshua Bengio. Boosting neural networks. Neural Compu- tation, 12(8):1869–1887, 2000.

16 [27] Alberto Torres-Barr´an, Alvaro´ Alonso, and Jos´eR. Dorronsoro. Regression tree ensembles for wind energy and solar radiation prediction. Neurocomputing (2017), 2017. [28] Johannes Welbl. Casting random forests as artificial neural networks (and prof- iting from it). In Xiaoyi Jiang, Joachim Hornegger, and Reinhard Koch, editors, Pattern Recognition. Springer International Publishing, 2014. [29] Chongsheng Zhang, Changchang Liu, Xiangliang Zhang, and George Almpani- dis. An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications, 82:128–150, 2017.

17 ...

...

......

...

...