A Brief Introduction to Boosting
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999. A Brief Introduction to Bo osting Rob ert E Schapire ATT Labs Shannon Lab oratory Park Avenue Ro om A Florham Park NJ USA wwwresearchattcomschapire schapireresearchattcom Given x y x y 1 1 m m Abstract where x X y Y f g i i Bo osting is a general metho d for improving the Initialize D i m 1 accuracy of any given learning algorithm This For t T short pap er introduces the b o osting algorithm Train weak learner using distribution D t AdaBo ost and explains the underlying theory Get weak hypothesis h X f g with error t of b o osting including an explanation of why b o osting often do es not suer from overtting Pr h x y t iD t i i t Some examples of recent applications of b o ost t ing are also describ ed 1 Cho ose ln t 2 t Up date Background t D i e if h x y t t i i Bo osting is a general metho d which attempts to b o ost D i t+1 t e if h x y Z t i i t the accuracy of any given learning algorithm Bo osting D i exp y h x has its ro ots in a theoretical framework for studying ma t t i t i chine learning called the PAC learning mo del due to Z t Valiant see Kearns and Vazirani for a go o d in where Z is a normalization factor chosen so that t tro duction to this mo del Kearns and Valiant D will b e a distribution t+1 were the rst to p ose the question of whether a weak learning algorithm which p erforms just slightly b et Output the nal hypothesis ter than random guessing in the PAC mo del can b e T X b o osted into an arbitrarily accurate strong learning H x sign h x t t algorithm Schapire came up with the rst prov t=1 able p olynomialtime b o osting algorithm in A year later Freund developed a much more ecient b o osting algorithm which although optimal in a certain Figure The b o osting algorithm AdaBo ost sense nevertheless suered from certain practical draw backs The rst exp eriments with these early b o osting One of the main ideas of the algorithm is to maintain algorithms were carried out by Drucker Schapire and a distribution or set of weights over the training set Simard on an OCR task The weight of this distribution on training example i on round t is denoted D i Initially all weights are set t AdaBo ost equally but on each round the weights of incorrectly The AdaBo ost algorithm introduced in by Freund classied examples are increased so that the weak learner and Schapire solved many of the practical dicul is forced to fo cus on the hard examples in the training ties of the earlier b o osting algorithms and is the fo set cus of this pap er Pseudo co de for AdaBo ost is given The weak learners job is to nd a weak hypothesis in Fig The algorithm takes as input a training h X f g appropriate for the distribution D t t set x y x y where each x b elongs to some 1 1 m m i The go o dness of a weak hypothesis is measured by its domain or instance space X and each label y is in i error X some lab el set Y For most of this pap er we assume Pr h x y D i t iD t i i t t Y f g later we discuss extensions to the multi i:h (x )6=y t i i class case AdaBo ost calls a given weak or base learning algorithm rep eatedly in a series of rounds t T Notice that the error is measured with resp ect to the 1.0 20 15 10 0.5 error 5 0 cumulative distribution 10 100 1000 -1 -0.5 0.5 1 rounds margin Figure Error curves and the margin distribution graph for b o osting C on the letter dataset as rep orted by Schapire et al Left the training and test error curves lower and upp er curves resp ectively of the combined classier as a function of the number of rounds of b o osting The horizontal lines indicate the test error rate of the base classier as well as the test error of the nal combined classier Right The cumulative distribution of margins of the training examples after and iterations indicated by shortdashed longdashed mostly hidden and solid curves resp ectively b etter than random are h s predictions Freund and distribution D on which the weak learner was trained t t Schapire prove that the training error the fraction In practice the weak learner may b e an algorithm that of mistakes on the training set of the nal hypothesis can use the weights D on the training examples Alter t H is at most natively when this is not p ossible a subset of the train q h i Y Y p ing examples can b e sampled according to D and these t 2 t t t unweighted resampled examples can b e used to train t t the weak learner X Once the weak hypothesis h has b een received Ada t 2 exp Bo ost chooses a parameter as in the gure Intu t t t itively measures the imp ortance that is assigned to t h Note that if which we can assume t t t Thus if each weak hypothesis is slightly b etter than ran without loss of generality and that gets larger as t t dom so that for some then the training t gets smaller error drops exp onentially fast The distribution D is next up dated using the rule t A similar prop erty is enjoyed by previous b o osting al shown in the gure The eect of this rule is to increase gorithms However previous algorithms required that the weight of examples misclassied by h and to de t such a lower b ound b e known a priori b efore b o ost crease the weight of correctly classied examples Thus ing b egins In practice knowledge of such a b ound is the weight tends to concentrate on hard examples very dicult to obtain AdaBo ost on the other hand is The nal hypothesis H is a weighted ma jority vote of adaptive in that it adapts to the error rates of the indi the T weak hypotheses where is the weight assigned t vidual weak hypotheses This is the basis of its name to h t Ada is short for adaptive Schapire and Singer show how AdaBo ost and its The b ound given in Eq combined with the b ounds analysis can b e extended to handle weak hypotheses on generalization error given b elow prove that AdaBo ost which output realvalued or condencerated predictions is indeed a b o osting algorithm in the sense that it can That is for each instance x the weak hypothesis h out t eciently convert a weak learning algorithm which can puts a prediction h x R whose sign is the predicted t always generate a hypothesis with a weak edge for any lab el or and whose magnitude jh xj gives a t distribution into a strong learning algorithm which can measure of condence in the prediction generate a hypothesis with an arbitrarily low error rate given sucient data Analyzing the training error Generalization error The most basic theoretical prop erty of AdaBo ost con Freund and Schapire showed how to b ound the cerns its ability to reduce the training error Let us 1 generalization error of the nal hypothesis in terms of Since a hypothesis that write the error of h as t t t 2 its training error the size m of the sample the VC guesses each instances class at random has an error rate dimension d of the weak hypothesis space and the num of on binary problems thus measures how much t 30 30 25 25 20 20 15 15 C4.5 10 10 5 5 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 boosting stumps boosting C4.5 Figure Comparison of C versus b o osting stumps and b o osting C on a set of b enchmark problems as rep orted by Freund and Schapire Each p oint in each scatterplot shows the test error rate of the two comp eting algorithms on a single b enchmark The y co ordinate of each p oint gives the test error rate in p ercent of C on the given b enchmark and the xco ordinate gives the error rate of b o osting stumps left plot or b o osting C right plot All error rates have b een averaged over multiple runs b er of rounds T of b o osting The VCdimension is a dened to b e X standard measure of the complexity of a space of hy y h x t t p otheses See for instance Blumer et al Sp eci t X cally they used techniques from Baum and Haussler t to show that the generalization error with high proba t bility is at most It is a number in which is p ositive if and only if H correctly classies the example Moreover the mag r nitude of the margin can b e interpreted as a measure of T d condence in the prediction Schapire et al proved that Pr H x y O m larger margins on the training set translate into a su p erior upp er b ound on the generalization error Sp eci cally the generalization error is at most r where Pr denotes empirical probability on the train d ing sample This b ound suggests that b o osting will Pr marginx y O 2 m overt if run for to o many rounds ie as T b ecomes large In fact this sometimes do es happ en However in for any with high probability Note that this b ound early exp eriments several authors observed is entirely indep endent of T the number of rounds of empirically that b o osting often do es not overt even b o osting In addition Schapire et al proved that b o ost when run for thousands of rounds Moreover it was ob ing is particularly aggressive at reducing the margin in a served that AdaBo ost would sometimes continue to drive quantiable sense since it concentrates on the examples down the generalization error long after the training er with the smallest margins whether p ositive or negative ror had reached zero clearly contradicting the spirit of Bo ostings eect on the margins can b e seen empirically the b ound ab ove For instance the left side of Fig for instance on the right side of Fig which shows the shows the training and test curves of running b o ost cumulative distribution of margins of the training ex ing on top of Quinlans C decisiontree