Boosting Neural Networks

Bo osting Neural Networks pap er No Holger Schwenk LIMSICNRS bat BP Orsay cedex FRANCE Yoshua Bengio DIRO University of Montreal Succ CentreVille CP Montreal Qc HC J CANADA To app ear in Neural Computation Abstract Bo osting is a general metho d for improving the p erformance of learning algorithms A recently prop osed b o osting algorithm is AdaBoost It has b een applied with great success to several b enchmark machine learning problems using mainly decision trees as base classiers In this pap er weinvestigate whether AdaBo ost also works as well with neural networks and we discuss the advantages and drawbacks of dierentversions of the AdaBo ost algorithm In particular we compare training metho ds based on sampling the training set and weighting the cost function The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBo ost This is in contrast to Bagging which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error Our system achieves ab out error on a data set of online handwritten digits from more than writers A on the UCI Letters and error on the b o osted multilayer network achieved error UCI satellite data set which is signicantly b etter than b o osted decision trees Keywords AdaBo ost b o osting Bagging ensemble learning multilayer neural networks generalization Intro duction Bo osting is a general metho d for improving the p erformance of a learning algorithm It is a metho d for nding a highly accurate classier on the training set by combining weak hyp otheses Schapire each of which needs only to be mo derately accurate on the training set See an earlier overview of dierent ways to combine neural networks in Perrone A recently prop osed b o osting algorithm is AdaBoost Freund which stands for Adaptive Bo osting During the last two years many empirical studies have b een published that use decision trees as base classiers for AdaBo ost Breiman Drucker and Cortes Freund and Schapire a Quinlan Maclin and Opitz Bauer and Kohavi Dietterich b Grove and Schuurmans All these exp eriments have shown impressiveimprovements in the generalization b ehavior and suggest that AdaBo ost tends to be robust to overtting In fact in many exp eriments it has been observed that the generalization error continues to decrease towards an apparent asymptote after the training error has reached zero Schapire et al suggest a p ossible explanation for this unusual b ehavior based on the denition of the margin of classication Other attemps to understand b o osting theoretically can b e found in Schapire et al Breiman a Breiman Friedman et al Schapire AdaBo ost has also b een linked with game theory Freund and Schapire b Breiman b Grove and Schuurmans Freund and Schapire in order to understand the behavior of AdaBo ost and to prop ose alternative algorithms Mason and Baxter prop ose a new variant of b o osting based on the direct optimization of margins Additionally there is recent evidence that AdaBo ost may very well overt if we combine several hundred thousand classiers Grove and Schuurmans It also seems that the p erformance of AdaBo ost degrades a lot in the presence of signicant amounts of noise Dietterich b Ratsch et al Although much useful work has been done b oth theoretically and exp erimentally there is still a lot that is not well understo o d ab out the impressive generalization b ehavior of AdaBo ost Tothebest of our knowledge applications of AdaBo ost have all been to decision trees and no applications to multilayer articial vides a neural networks have b een rep orted in the literature This pap er extends and pro deep er exp erimental analysis of our rst exp eriments with the application of AdaBo ost to neural networks Schwenk and Bengio Schwenk and Bengio In this pap er we consider the following questions do es AdaBo ost work as well for neural networks as for decision trees short answer yes sometimes even b etter Do es it b ehaveina similar way as was observed previously in the literature short answer yes Furthermore are there particulars in the way neural networks are trained with gradientbackpropagation which should be taken into account when cho osing a particular version of AdaBo ost short answer yes b ecause it is p ossible to directly weight the cost function of neural networks Is overtting of the individual neural networks a concern short answer not as muchaswhen not using b o osting Is the random resampling used in previous implementations of AdaBo ost critical or can we get similar p erformances byweighing the training criterion which can easily be done with neural networks short answer it is not critical for generalization but helps to obtain faster convergence of individual networks when coupled with sto chastic gradient descent The pap er is organized as follows In the next section we rst describ e the AdaBo ost algorithm and we discuss several implementation issues when using neural networks as base classiers In section we present results that we have obtained on three mediumsized tasks a data set of handwritten online digits and the letter and satimage data set of the UCI rep ository The pap er nishes with a conclusion and p ersp ectives for future research AdaBo ost It is well known that it is often p ossible to increase the accuracy of a classier by averaging the decisions of an ensemble of classiers Perrone Krogh and Vedelsby In general more improvement can be exp ected when the individual classiers are diverse and yet accurate One can try to obtain this result by taking a base learning algorithm and by invoking it several times on dierent training sets Two popular techniques exist that dier in the way they construct these training sets Bagging Breiman and b o osting Freund Freund and Schapire In Bagging each classier is trained on a b o otstrap replicate of the original training set Given a training set S of N examples the new training set is created by resampling N examples uniformly with replacement Note that some examples may o ccur several times while others may not o ccur in the sample at all One can show that on average only ab out of the examples o ccur in each b o otstrap replicate Note also that the individual training sets are indep endent and the classiers could b e trained in parallel Bagging is known to b e particularly eective when the classiers are unstable ie when p erturbing the learning set can cause signicant changes in the classication behavior classiers Formulated in the context of the biasvariance decomp osition Geman et al Bagging improves generalization p erformance due to a reduction in variance while maintaining or only slightly increasing bias Note however that there is no unique biasvariance decomp osition for classication tasks Kong and Dietterich Breiman Kohavi and Wolp ert Tibshirani AdaBo ost on the other hand constructs a comp osite classier by sequentially training classiers while putting more and more emphasis on certain patterns For this AdaBo ost maintains a probability distribution D iover the original training set In each round t the t classier is trained with resp ect to this distribution Some learning algorithms dont allow sampling with replacement training with resp ect to a weighted cost function In this case using the probability distribution D can b e used to approximate a weighted cost function t Examples with high probabilitywould then o ccur more often than those with low probability while some examples may not o ccur in the sample at all although their probability is not zero Input sequence of N examples x y x y N N with lab els y Y fkg i Init let B fi y i fNgy y g i D i y jB j for all i y B Rep eat Train neural network with resp ect to distribution D and obtain t hyp othesis h X Y t calculate the pseudoloss of h t X D i y h x y h x y t t t i i t i iy B set t t t up date distribution D t h x y h x y t t i i i D iy t D i y t t Z t where Z is a normalization constant t Output nal hyp othesis X h x y log f x arg max t y Y t t Table Pseudoloss AdaBoost AdaBoostM After each AdaBo ost round the probability of incorrectly lab eled examples is increased and th the probability of correctly lab eled examples is decreased The result of training the t classier is a hypothesis h X Y where Y fkg is the space of lab els and X is t P th the space of input features After the t round the weighted error D i t t ih x y t i i is computed from D by of the resulting classier is calculated and the distribution D t t increasing the probabilityof incorrectly lab eled examples The probabilities are changed so th that the error of the t classier using these new weights D would be In this way t the classiers are optimally decoupled The global decision f is obtained byweighted voting This basic AdaBo ost algorithm converges learns the training set if each classier yields a weighted error that is less than ie b etter than chance in the class case In general neural network classiers provide more information than just a class lab el It can b e shown that the network outputs approximate the ap osteriori probabilities of classes and it might be useful to use this information rather than to p erform a hard decision for one recognized class This issue is addressed by another version of AdaBo ost called Ada BoostM Freund and Schapire It can be used when the classier computes con th dence scores for each class The result of training

Boosting Neural Networks

Regularizing Adaboost with Validation Sets of Increasing Size

Adaboost Artificial Neural Network for Stock Market Predicting

A Gentle Introduction to Gradient Boosting

The Evolution of Boosting Algorithms from Machine Learning to Statistical Modelling∗

Explaining the Success of Adaboost and Random Forests As Interpolating Classifiers

Survey of Metaheuristics and Statistical Methods for Multifactorial Diseases Analyses

A Comparison of Adaboost Algorithms for Time Series Forecast Combination

Effectiveness of Boosting Algorithms in Forest Fire Classification

Performance Analysis of Boosting Classifiers in Recognizing Activities of Daily Living

A Comparative Performance Assessment of Ensemble Learning for Credit Scoring

Adaboost Is Consistent

Explaining Adaboost