Bo osting Neural Networks pap er No

Holger Schwenk

LIMSICNRS bat BP Orsay cedex FRANCE

Yoshua Bengio

DIRO University of Montreal Succ CentreVille CP

Montreal Qc HC J CANADA

To app ear in Neural Computation

Abstract

Bo osting is a general metho d for improving the p erformance of learning algorithms A

recently prop osed b o osting algorithm is AdaBoost It has b een applied with great success to

several b enchmark machine learning problems using mainly decision trees as base classiers

In this pap er weinvestigate whether AdaBo ost also works as well with neural networks and

we discuss the advantages and drawbacks of dierentversions of the AdaBo ost algorithm In

particular we compare training metho ds based on sampling the training set and weighting

the cost function The results suggest that random resampling of the training data is not

the main explanation of the success of the improvements brought by AdaBo ost This is

in contrast to Bagging which directly aims at reducing variance and for which random

resampling is essential to obtain the reduction in generalization error Our system achieves

ab out error on a data set of online handwritten digits from more than writers A

on the UCI Letters and error on the b o osted multilayer network achieved error

UCI satellite data set which is signicantly b etter than b o osted decision trees

Keywords AdaBo ost b o osting Bagging ensemble learning

multilayer neural networks generalization

Intro duction

Bo osting is a general metho d for improving the p erformance of a learning algorithm

It is a metho d for nding a highly accurate classier on the training set by combining

weak hyp otheses Schapire each of which needs only to be mo derately accurate

on the training set See an earlier overview of dierent ways to combine neural networks

in Perrone A recently prop osed b o osting algorithm is AdaBoost Freund

which stands for Adaptive Bo osting During the last two years many empirical studies

have b een published that use decision trees as base classiers for AdaBo ost Breiman

Drucker and Cortes Freund and Schapire a Quinlan Maclin and Opitz

Bauer and Kohavi Dietterich b Grove and Schuurmans All these

exp eriments have shown impressiveimprovements in the generalization b ehavior and suggest

that AdaBo ost tends to be robust to overtting In fact in many exp eriments it has been

observed that the generalization error continues to decrease towards an apparent asymptote

after the training error has reached zero Schapire et al suggest a p ossible explanation

for this unusual b ehavior based on the denition of the margin of classication Other

attemps to understand b o osting theoretically can b e found in Schapire et al Breiman

a Breiman Friedman et al Schapire AdaBo ost has also b een linked

with game theory Freund and Schapire b Breiman b Grove and Schuurmans

Freund and Schapire in order to understand the behavior of AdaBo ost and to

prop ose alternative algorithms Mason and Baxter prop ose a new variant of b o osting

based on the direct optimization of margins

Additionally there is recent evidence that AdaBo ost may very well overt if we combine

several hundred thousand classiers Grove and Schuurmans It also seems that the

p erformance of AdaBo ost degrades a lot in the presence of signicant amounts of noise

Dietterich b Ratsch et al Although much useful work has been done b oth

theoretically and exp erimentally there is still a lot that is not well understo o d ab out the

impressive generalization b ehavior of AdaBo ost Tothebest of our knowledge applications

of AdaBo ost have all been to decision trees and no applications to multilayer articial

vides a neural networks have b een rep orted in the literature This pap er extends and pro

deep er exp erimental analysis of our rst exp eriments with the application of AdaBo ost to

neural networks Schwenk and Bengio Schwenk and Bengio

In this pap er we consider the following questions do es AdaBo ost work as well for neural

networks as for decision trees short answer yes sometimes even b etter Do es it b ehaveina

similar way as was observed previously in the literature short answer yes Furthermore

are there particulars in the way neural networks are trained with gradientbackpropagation

which should be taken into account when cho osing a particular version of AdaBo ost short

answer yes b ecause it is p ossible to directly weight the cost function of neural networks Is

overtting of the individual neural networks a concern short answer not as muchaswhen

not using b o osting Is the random resampling used in previous implementations of AdaBo ost

critical or can we get similar p erformances byweighing the training criterion which can easily

be done with neural networks short answer it is not critical for generalization but helps

to obtain faster convergence of individual networks when coupled with sto chastic gradient

descent

The pap er is organized as follows In the next section we rst describ e the AdaBo ost

algorithm and we discuss several implementation issues when using neural networks as base

classiers In section we present results that we have obtained on three mediumsized

tasks a data set of handwritten online digits and the letter and satimage data set

of the UCI rep ository The pap er nishes with a conclusion and p ersp ectives for future

research

AdaBo ost

It is well known that it is often p ossible to increase the accuracy of a classier by averaging

the decisions of an ensemble of classiers Perrone Krogh and Vedelsby In

general more improvement can be exp ected when the individual classiers are diverse and

yet accurate One can try to obtain this result by taking a base learning algorithm and

by invoking it several times on dierent training sets Two popular techniques exist that

dier in the way they construct these training sets Bagging Breiman and b o osting

Freund Freund and Schapire In Bagging each classier is trained on a

b o otstrap replicate of the original training set Given a training set S of N examples the

new training set is created by resampling N examples uniformly with replacement Note that

some examples may o ccur several times while others may not o ccur in the sample at all One

can show that on average only ab out of the examples o ccur in each b o otstrap replicate

Note also that the individual training sets are indep endent and the classiers could b e trained

in parallel Bagging is known to b e particularly eective when the classiers are unstable

ie when p erturbing the learning set can cause signicant changes in the classication

behavior classiers Formulated in the context of the biasvariance decomp osition Geman

et al Bagging improves generalization p erformance due to a reduction in variance

while maintaining or only slightly increasing bias Note however that there is no unique

biasvariance decomp osition for classication tasks Kong and Dietterich Breiman

Kohavi and Wolp ert Tibshirani

AdaBo ost on the other hand constructs a comp osite classier by sequentially training

classiers while putting more and more emphasis on certain patterns For this AdaBo ost

maintains a probability distribution D iover the original training set In each round t the

t

classier is trained with resp ect to this distribution Some learning algorithms dont allow

sampling with replacement training with resp ect to a weighted cost function In this case

using the probability distribution D can b e used to approximate a weighted cost function

t

Examples with high probabilitywould then o ccur more often than those with low probability

while some examples may not o ccur in the sample at all although their probability is not

zero

Input sequence of N examples x y x y

N N

with lab els y Y fkg

i

Init let B fi y i fNgy y g

i

D i y jB j for all i y B

Rep eat

Train neural network with resp ect to distribution D and obtain

t

hyp othesis h X Y

t

calculate the pseudoloss of h

t

X

D i y h x y h x y

t t t i i t i

iy B

set

t t t

up date distribution D

t

h x y h x y

t t

i i i

D iy

t



D i y

t

t

Z

t

where Z is a normalization constant

t

Output nal hyp othesis

X

h x y log f x arg max

t

y Y

t

t

Table Pseudoloss AdaBoost AdaBoostM

After each AdaBo ost round the probability of incorrectly lab eled examples is increased and

th

the probability of correctly lab eled examples is decreased The result of training the t

classier is a hypothesis h X Y where Y fkg is the space of lab els and X is

t

P

th

the space of input features After the t round the weighted error D i

t t

ih x y

t

i i

is computed from D by of the resulting classier is calculated and the distribution D

t t

increasing the probabilityof incorrectly lab eled examples The probabilities are changed so

th

that the error of the t classier using these new weights D would be In this way

t

the classiers are optimally decoupled The global decision f is obtained byweighted voting

This basic AdaBo ost algorithm converges learns the training set if each classier yields a

weighted error that is less than ie b etter than chance in the class case

In general neural network classiers provide more information than just a class lab el It

can b e shown that the network outputs approximate the ap osteriori probabilities of classes

and it might be useful to use this information rather than to p erform a hard decision for

one recognized class This issue is addressed by another version of AdaBo ost called Ada

BoostM Freund and Schapire It can be used when the classier computes con

th

dence scores for each class The result of training the t classier is now a hyp othesis

h X Y Furthermore we use a distribution D i y over the set of all misslabels

t t

The scores do not need to sum to one

B fi y i f N gy y g where N is the number of training examples Therefore

i

jB j N k AdaBo ost mo dies this distribution so that the next learner fo cuses not only

on the examples that are hard to classify but more sp ecically on improving the discrim

ination between the correct class and the incorrect class that comp etes with it Note that

P

t t

W the misslab el distribution D induces a distribution over the examples P i W

t t

i

i i

P

t

where W D i y P i may be used for resampling the training set Freund and

t t

y y

i

i

Schapire dene the pseudoloss of a learning machine as

X

D i y h x y h x y

t t t i i t i

iy B

It is minimized if the condence scores of the correct lab els are and the condence scores of

all the wrong lab els are The nal decision f is obtained by adding together the weighted

condence scores of all the machines all the hyp otheses h h Table summarizes

the AdaBo ostM algorithm This multiclass b o osting algorithm converges if each classier

yields a pseudoloss that is less than ie b etter than any constant hyp othesis

AdaBo ost has very interesting theoretical prop erties in particular it canbeshown that the

error of the comp osite classier on the training data decreases exp onentially fast to zero as

the num b er of combined classiers is increased Freund and Schapire Many empirical

evaluations of AdaBo ost also provide an analysis of the socalled margin distribution The

margin is dened as the dierence between the ensemble score of the correct class and the

strongest ensemble score of a wrong class In the case in which there are just two p ossible

lab els f g this is yf x where f is the output of the comp osite classier and y the

ab out the correct lab el The classication is correct if the margin is p ositive Discussions

relevance of the margin distribution for the generalization behavior of ensemble techniques

can b e found in Freund and Schapire b Schapire et al Breiman a Breiman

b Grove and Schuurmans Ratsch et al

In this pap er an imp ortant fo cus is on whether the go o d generalization p erformance of

AdaBo ost is partially explained by the random resampling of the training sets generally

used in its implementation This issue will be addressed by comparing three versions of

AdaBo ost as describ ed in the next section in which randomization is used or not used in

three dierent ways

Applying AdaBo ost to neural networks

In this pap er we investigate dierent techniques of using neural networks as base classi

ers for AdaBo ost In all cases we have trained the neural networks by minimizing a

where quadratic criterion that is a weighted sum of the squared dierences z z

ij ij

z z z z is the desired output vector with a low target value everywhere ex

i i i ik

target class and z is the output vector of the cept at the p osition corresp onding to the

i

network A score for class j for pattern i can be directly obtained from the j th element

z of the output vector z When a class must be chosen the one with the highest score is

ij i

selected Let V i j D i j max D i k for j y and V i y These weights

t t k y t i t i

i

are used to give more emphasis to certain incorrect lab els according to the PseudoLoss

Adab o ost

What we call epoch is a pass of the training algorithm through all the examples in a training

set In this pap er we compare three dierent versions of AdaBo ost

R Training the tth classier with a xed training set obtained by resampling with re

placement once from the original training set b efore starting training the tth network

we sample N patterns from the original training set each time with a probability P i

t

of picking pattern i Training is p erformed for a xed numb er of iterations always using

this same resampled training set This is basically the scheme that has b een used in the

past when applying AdaBo ost to decision trees except that we used the Pseudoloss

AdaBo ost To approximate the Pseudoloss the training cost that is minimized for a

P

pattern that is the ith one from the original training set is V i j z z

t ij ij

j

E Training the tth classier using a dierent training set at each ep o ch by resampling

with replacement after each training ep o ch after eachepoch a new training set is ob

tained by sampling from the original training set with probabilities P i Since weused

t

an online sto chastic gradient in this case this is equivalent to sampling a new pat

tern from the original training set with probability P i b efore each forwardbackward

t

pass through the neural network Training continues until a xed number of pattern

presentations has b een p erformed Like for R the training cost that is minimized for

P

a pattern that is the ith one from the original training set is V i j z z

t ij ij

j

W Training the tth classier by directly weighting the cost function here the squared er

ror of the tth neural network ie all the original training patterns are in the training

P

set but the cost is weighted by the probabilityofeach example D i j z z

t ij ij

j

If we used directly this formulae the gradients would be very small even when all

probabilities D i j are identical To avoid having to scale learning rates dierently

t

dep ending on the number of examples the following normalized error function was

used

X

P i

t

V i j z z

t ij ij

max P k

k t

j

In E and W what makes the combined networks essentially dierent from each other is

the fact that they are trained with resp ect to dierentweightings D of the original training

t

set Rather in R an additional elementofdiversity is builtin b ecause the criterion used for

the tth network is not exactly the errors weighted by P i Instead more emphasis is put on

t

certain patterns while completely ignoring others b ecause of the initial random sampling of

the training set The E version can b e seen as a sto chastic version of the W version ie

as the number of iterations through the data increases and the learning rate decreases E

b ecomes avery go o d approximation of W W itself is closest to the recip e mandated by

the AdaBo ost algorithm but as we will see b elow it suers from numerical problems Note

that E is a b etter approximation of the weighted cost function than R in particular when

many ep o chs are p erformed If random resampling of the training data explained a good

part of the generalization p erformance of AdaBo ost then the weighted training version W

should p erform worse than the resampling versions and the xed sample version R should

p erform better than the continuously resampled version E Note that for Bagging which

directly aims at reducing variance random resampling is essential to obtain the reduction

in generalization error

Results

Exp eriments have been p erformed on three data sets a data set of online handwritten

digits the UCI Letters data set of oline machineprinted alphab etical characters and the

UCI satel lite data set that is generated from Landsat Multisp ectral Scanner image data

All data sets have a predened training and test set All the pvalues that are given in

this section concern a pair p p of test p erformance results on n test points for two

classication systems with unknown true error rates p and p The null hyp othesis is

that the true exp ected p erformance for the two systems is not dierent ie p p Let

p p p be the estimator of the common error rate under the null hyp othesis

The alternative hyp othesis is that p p so the pvalue is obtained as the probability of

observing such a large dierence under the null hyp othesis ie P Z z for a Normal Z

p

np p



p

with z This is based on the Normal approximation of the Binomial which is

pp

appropriate for large n however see Dietterich a for a discussion of this and other

tests to compare algorithms

Results on the online data set

The online data set was collected at Paris University Schwenk and Milgram A

WACOM A tablet with a cordless pen was used in order to allow natural writing Since

t recognition system we tried to use many writers we wanted to build a writerindep enden

and to imp ose as few constraints as p ossible on the writing style In total students

wrote down isolated numb ers that have b een divided into learning set examples and

test set examples Note that the writers of the training and test sets are completely

distinct A particular prop erty of this data set is the notable variety of writing styles that

are not equally frequent at all There are for instance zeros written counterclo ckwise

only written clo ckwise Figure gives an idea of the great variety of writing styles but

of this data set We only applied a simple prepro cessing the characters were resampled to

p oints centered and sizenormalized to an xyco ordinate sequence in

Table summarizes the results on the test set b efore using AdaBo ost Note that the dif

Figure Some examples of the online handwritten digits data set test set

Table Online digits data set error rates for ful ly connected MLPs not boosted

architecture

train

test

ferences among the test results on the last three networks are not statistically signicant

pvalue whereas the dierence with the rst network is signicantpvalue

fold crossvalidation within the training set was used to nd the optimal numb er of train

ing epochs typically ab out Note that if training is continued until ep o chs the

test error increases by up to

Table shows the results of bagged and b o osted multilayer p erceptrons with or

hidden units trained for either or epochs and using either the ordinary

resampling scheme R resampling with dierent random selections at each ep o ch E or

training with weights D on the squared error criterion for each pattern W In all cases

t

neural networks were combined

AdaBo ost improved in all cases the generalization error of the MLPs for instance from

to ab out for the architecture Note that the improvement with

hidden units from without AdaBo ost to with AdaBo ost is signicant p

value of despite the small numb er of examples Bo osting was also always sup erior to

Bagging although the dierences are not always very signicant b ecause of the small number



The notation h designates a fully connected neural network with input no des

one hidden layer with h neurons and a dimensional output layer

Table Online digits test error rates for boosted MLPs

architecture

version R E W R E W R E W

Bagging

epochs

AdaBo ost

epochs

epochs

epochs

ep o chs

ep o chs

of examples Furthermore it seems that the number of training ep o chs of each individual

classier has no signicant impact on the results of the combined classier at least on this

data set AdaBo ost with weighted training of MLPs W version however do esnt work as

well if the learning of each individual MLP is stopp ed to o early ep o chs the networks

didnt learn well enough the weighted examples and rapidly approached When training

t

each MLP for epochs however the weighted training W version achieved the same

low test error rate

AdaBo ost is less useful for very big networks or more hidden units for this data since

an individual classier can achieve zero error on the original training set using the E or

W metho d Such large networks probably have a very low bias but high variance This

may explain why Bagging a pure variance reduction metho d can do as well as AdaBo ost

which is b elieved to reduce bias and variance Note however that AdaBo ost can achieve

the same low error rates with the smaller networks

Figure shows the error rates of some of the b o osted classiers as the numb er of networks is

increased AdaBo ost brings training error to zero after only a few steps even with an MLP

with only hidden units The generalization error is also considerably improved and it

continues to decrease to an apparent asymptote after zero training error has been reached

The surprising eect of continuously decreasing generalization error even after training er

ror reaches zero has already been observed by others Breiman Drucker and Cortes

Freund and Schapire a Quinlan This seems to contradict Occams razor

but a recent theorem Schapire et al suggests that the margin distribution may be

Schapire et al relevant to the generalization error Although previous empirical results

indicate that pushing the margin cumulative distribution to the right may improve

generalization other recent results Breiman a Breiman b Grove and Schuur

mans show that improving the whole margin distribution can also yield to worse

generalization Figure and show several margin cumulative distributions ie the fraction

of examples whose margin is at most x as a function of x The networks had be

trained for ep o chs for the W version MLP 22-10-10

10 Bagging test unboosted AdaBoost (R) AdaBoost (E) 8 AdaBoost (W)

6 test 4 error in %

2 train 0 1 10 100 MLP 22-30-10

10 Bagging AdaBoost (R) AdaBoost (E) 8 AdaBoost (W)

6

4 error in %

2 train test 0 1 10 100 MLP 22-50-10

10 Bagging AdaBoost (R) 8 AdaBoost (E) AdaBoost (W) 6

4 error in % 2 train test 0 1 10 100

number of networks

Figure Error rates of the boosted classiers for increasing number of networks For clarity

the training error of Bagging is not shown it overlaps with the test error rates of AdaBoost

The dotted constant horizontal line corresponds to the test error of the unboosted classier

Smal l oscil lations are not signicant since they correspond to few examples

AdaBo ost R of MLP AdaBo ost E of MLP

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

AdaBo ost R of MLP AdaBo ost E of MLP

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

AdaBo ost R of MLP AdaBo ost E of MLP

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

Figure Margin distributions using and networks respectively

AdaBo ost W of MLP Bagging of MLP

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

AdaBo ost W of MLP Bagging of MLP

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

AdaBo ost W of MLP Bagging of MLP

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

Figure Margin distributions using and networks respectively

It is clear in the Figures and that the number of examples with high margin increases

when more classiers are combined by b o osting When b o osting neural networks with

hidden units for instance there are some examples with a margin smaller than when

only twonetworks are combined However all examples have a p ositive margin when nets

are combined and all examples have a margin higher than for networks Bagging

on the other hand hasnosignicant inuence on the margin distributions There is almost

no dierence between the margin distributions of the R E or W version of AdaBo ost

either Note however that there is a dierence between the margin distributions and

the test set errors when the complexity of the neural networks is varied hidden layer size

Finally it seems that sometimes AdaBo ost must allow some examples with very high margins

in order to improve the minimal margin This can b est b eseen for the architecture

One should keep in mind that this data set contains only small amounts of noise In ap

plication domains with high amounts of noise it may be less advantageous to improve the

y price Grove and Schuurmans Ratsch et al since this minimal margin at an

would mean putting to o much weight to noisy or wrongly lab eled examples

Results on the UCI Letters and Satimage Data Sets

Similar exp eriments were p erformed with MLPs on the Letters data set from the UCI

Machine Learning data set It has training and test patterns input features

and classes AZ of distorted machineprinted characters from dierent fonts A

few preliminary exp eriments on the training set only were used to cho ose a

architecture Each input feature was normalized according to its mean and variance on the

training set

Two typ es of exp eriments were p erformed doing resampling after each epoch E and

using sto chastic and without resampling but using reweighting of the

squared error W and conjugate gradient descent In b oth cases a xed numb er of training

ep o chs was used The plain bagged and b o osted networks are compared to decision

trees in Table

Table Test error rates on the UCI data sets

y z

CART C MLP

data set alone bagged b o osted alone bagged boosted alone bagged b o osted

letter

satellite

y

results from Breiman

z

results from Freund and Schapire a

In b oth cases E and W the same nal generalization error results were obtained for



One may note that the W and E versions achieveslightly higher margins than R 10 Bagging AdaBoost (SG+E) AdaBoost (CG+W) 8 test unboosted 6

4 error in % test 2 train 0 1 10 100

number of networks

Figure Error rates of the bagged and boosted neural networks for the UCI letter data

set logscale SGE denotes stochastic gradient descent and resampling after each epoch

CGW means conjugate gradient descent and weighting of the squarederror For clarity the

training error of Bagging is not shown it attens out to about The dotted constant

horizontal line corresponds to the test error of the unboosted classier

E and for W but the training time using the weighted squared error W was ab out

times greater This shows that using random resampling as in E or R is not necessary

to obtain go o d generalization whereas it is clearly necessary for Bagging However the

exp eriments show that it is still preferable to use a random sampling metho d such as R or

E for numerical reasons convergence of each network is faster

For this reason for the E exp eriments with sto chastic gradient descent networks

were b o osted whereas we stopp ed training on the W network after networks when

the generalization error seemed to have attened out which to ok more than a week on

a fast pro cessor SGI Origin We believe that the main reason for this dierence in

training time is that the conjugate gradient metho d is a batch metho d and is therefore

slower than sto chastic gradient descent on redundant data sets with many thousands of

examples such as this one See comparisons b etween batch and online metho ds Bourrely

and conjugate gradients for classication tasks in particular Moller Moller

For the W version with sto chastic gradient descent the weighted training error of

individual networks do es not decrease as much as when using conjugate gradient descent

so that AdaBo ost itself did not work as well We b elieve that this is due to the fact that it

is dicult for sto chastic gradient descent to approach a minimum when the output error is

weighted with very dierent weights for dierent patterns the patterns with small weights

t metho d can make almost no progress On the other hand the conjugate gradient descen

approach a minimum of the weighted cost function more precisely but ineciently when

there are thousands of training examples

The results obtained with the b o osted network are extremely go o d error whether

using the W version with conjugate gradients or the E version with sto chastic gradient

Bagging AdaBo ost SGE

1 2 1 2 5 5 0.8 10 0.8 10 50 50 100 100 0.6 0.6

0.4 0.4

0.2 0.2

0 0

-1 -0.5 0 0.5 1.0 -1 -0.5 0 0.5 1.0

Figure Margin distributions for the UCI letter data set

and are the b est ever published to date as far as the authors know for this data set In a

comparison with the b o osted trees error the pvalue of the null hyp othesis is less than

The b est p erformance rep orted in STATLOG Feng et al is Note also that

we need to combine only a few neural networks to get immediate imp ortant improvements

with the E version neural networks suce for the error to fall under whereas

actually b o osted decision trees typically converge later The W version of AdaBo ost

converged faster in terms of number of networks gure after ab out networks the

mark was reached and after networks the apparent asymptote was reached but

converged much slower in terms of training time Figure shows the margin distributions

for Bagging and AdaBo ost applied to this data set Again Bagging has no eect on the

margin distribution whereas AdaBo ost clearly increases the numb er of examples with large

margins

Similar conclusions hold for the UCI satellite data set Table although the improve

ments are not as dramatic as in the case of the Letter data set The improvement due

to AdaBo ost is statistically signicant pvalue but the dierence in p erformance

between boosted MLPs and b o osted decision trees is not pvalue This data set

has examples with the rst used for training and the last used for testing

generalization There are inputs and classes and a network was used Again

the two b est training metho ds are the ep o ch resampling E with sto chastic gradientorthe

weighted squared error W with conjugate gradien t descent

Conclusion

As demonstrated here in three realworld applications AdaBo ost can signicantly improve

neural classiers In particular the results obtained on the UCI Letters data set test

error are signicantly b etter than the b est published results to date as far as the authors

know The b ehavior of AdaBo ost for neural networks conrms previous observations on

other learning algorithms eg Breiman Drucker and Cortes Freund and

Schapire a Quinlan Schapire et al such as the continued generalization

improvement after zero training error has been reached and the asso ciated improvementin

the margin distribution It seems also that AdaBo ost is not very sensitivetoovertraining of

the individual classiers so that the neural networks can be trained for a xed preferably

high number of training ep o chs A similar observation was recently made with decision

trees Breiman b This apparent insensitivity to overtraining of individual classiers

simplies the choice of neural network design parameters

Another interesting nding of this pap er is that the weighted training version W of

AdaBo ost gives go o d generalization results for MLPs but requires many more training

ep o chs or the use of a secondorder and unfortunately batch metho d such as conjugate

gradients We conjecture that this happ ens b ecause of the weights on the cost function

terms esp ecially when the weights are small which could worsen the conditioning of the

Hessian matrix So in terms of generalization error all three metho ds R E W gave similar

results but training time was lowest with the E metho d with sto chastic gradient descent

which samples each new training pattern from the original data with the AdaBo ost weights

Although our exp eriments are insucient to conclude it is p ossible that the weighted

training metho d W with conjugate gradients might be faster than the others for small

training sets a few hundred examples

There are various ways to dene variance for classiers eg Kong and Dietterich

Breiman Kohavi and Wolp ert Tibshirani It basically represents howthe

resulting classier will vary when a dierent training set is sampled from the true generating

distribution of the data Our comparative results on the R E and W versions add

credence to the view that randomness induced by resampling the training data is not the

to main reason for AdaBo osts reduction of the generalization error This is in contrast

Bagging which is a pure variance reduction metho d For Bagging random resampling is

essential to obtain the observed variance reduction

Another interesting issue is whether the b o osted neural networks could be trained with a

criterion other than the mean squared error criterion one that would b etter approximate the

goal of the AdaBo ost criterion ie minimizing a weighted classication error See Schapire

and Singer for recent work that addresses this issue

Acknowledgments

Most of the work was done while the rst author was doing a p ostdo ctorate at the University

of Montreal The authors would like to thank the National Science and Engineering Research

Council of Canada and the Government of Queb ec for nancial supp ort

References

Bauer E and Kohavi R An empirical comparison of voting classication algorithms

Bagging b o osting and variants to appear in Machine Learning

Bourrely J Parallelization of a neural learning algorithm on a hyp ercub e In

Hypercube and distributedcomputers pages Elsiever Science Publishing North

Holland

Breiman L Bagging predictors Machine Learning

Breiman L Bias variance and arcing classiers Technical Rep ort Statistics

Department University of California at Berkeley

Breiman L a Arcing the edge Technical Rep ort Statistics Department Uni

versity of California at Berkeley

Breiman L b Prediction games and arcing classiers Technical Rep ort Statistics

Departmen t University of California at Berkeley

Breiman L Arcing classiers Annuals of Statistics

Dietterich T a Approximate statistical tests for comparing sup ervised classication

learning algorithms Neural Computation

Dietterich T G b An exp erimental comparison of three metho ds for constructing

ensembles of decision trees Bagging b o osting and randomization submitted to Ma

chine Learning

available at ftpftpcsorstedupu bt gdp aper str ra ndom ized c ps gz

Drucker H and Cortes C Bo osting decision trees In Touretzky D S Mozer M C

and Hasselmo M E editors Advances in Neural Information Processing Systems

pages MIT Press

Feng C Sutherland A King R SMuggleton and Henery R Comparison of

machine learning classiers to statistics and neural networks In Proceedings of the

Fourth International Workshop on Articial Intel ligence and Statistics pages

Freund Y Bo osting a weak learning algorithm by ma jority Information and Com

putation

Freund Y and Schapire R E a Exp eriments with a new b o osting algorithm In

Machine Learning Proceedings of Thirteenth International Conference pages

Freund Y and Schapire R E b Game theory online prediction and b o osting In

Proceedings of the Ninth Annual Conference on Computational Learning Theory pages

Freund Y and Schapire R E A decision theoretic generalization of online learning

and an application to b o osting Journal of Computer and System Science

Freund Y and Schapire R E Adaptive game playing using multiplicativeweights

Games and Economic Behavior to app ear

Friedman J Hastie T and Tibshirani R Additive a statistical

view of b o osting Technical rep ort DepartmentofStatistics Stanford University

Geman S Bienensto ck E and Doursat R Neural networks and the biasvariance

dilemma Neural Computation

Grove A J and Schuurmans D Bo osting in the limit Maximizing the margin

of learned ensembles In Proceedings of the Fifteenth National Conference on Articial

Intel ligence to app ear

Kohavi R and Wolp ert D H Bias plus variance decomp osition for zeroone loss

functions In Machine Learning Proceedings of Thirteenth International Conference

pages

Kong E B and Dietterich T G Errorcorrecting output co ding corrects bias and

variance In Machine Learning Proceedings of Twelfth International Conference pages

Krogh A and Vedelsby J Neural network ensembles cross validation and active

learning In Tesauro G Touretzky D S and Leen T K editors Advances in Neural

Information Processing Systems pages MIT Press

Maclin R and Opitz D An empirical evaluation of bagging and b o osting In

Proceedings of the Fourteenth National Conference on Articial Intel ligenc pages

Mason L and Baxter P B J Direct optimization of margins improves generalization

in combined classiers In TODO editor Advances in Neural Information Processing

Systems MIT Press in press

Moller M Sup ervised learning on large redundant training sets In Neural Networks

for Signal Processing IEEE press

edForward Neural Networks PhD thesis Aarhus Moller M Ecient Training of Fe

University Aarhus Denmark

Perrone M P Improving Regression Estimation Averaging Methods for Variance

Reduction with Extensions to General Convex Measure Optimization PhD thesis Brown

University Institute for Brain and Neural Systems

Perrone M P Putting it all together Metho ds for combining neural networks In

Cowan J D Tesauro G and Alsp ector J editors Advances in Neural Information

Processing Systemsvolume pages Morgan Kaufmann Publishers Inc

Quinlan J R Bagging boosting and C In Machine Learning Proceedings of

the fourteenth International Conference pages

Ratsch G Ono da T and Muller KR Soft margins for adab o ost Technical

Rep ort NCTR Royal Holloway College

Schapire R E The strength of weak learnability Machine Learning

Schapire R E Theoretical views of b o osting In Computational Learning Theory

Fourth European Conference EuroCOLT to app ear

Schapire R E Freund Y Bartlett P and Lee W S Bo osting the margin Anew

explanation for the eectiveness of voting metho ds In Machine Learning Proceedings

of Fourteenth International Conference pages

Schapire R E and Singer Y Improved b o osting algorithms using condence rated

predictions In Proceedings of the th Annual Conference on Computational Learning

Theory

Schwenk H and Bengio Y Adab o osting neural networks Application to online

character recognition In International ConferenceonArticial Neural Networks pages

Springer Verlag

Schwenk H and Bengio Y Training metho ds for adaptive b o osting of neural

networks In Jordan M I Kearns M J and Solla S A editors Advances in Neural

Information Processing Systems pages The MIT Press

Schwenk H and Milgram M Constraint tangent distance for online character

recognition In International Conference on Pattern Recognition pages D

Tibshirani R Bias variance and prediction error for classication rules Technical

rep ort Departement o d Statistics University of Toronto