Additive Logistic Regression: a Statistical View of Boosting
Total Page:16
File Type:pdf, Size:1020Kb
Additive Logistic Regression a Statistical View of Bo osting y Jerome Friedman z Trevor Hastie z Robert Tibshirani August Revision April Second Revision November Abstract Bo osting Schapire Freund Schapire is one of the most imp ortant recent devel opments in classication metho dology Bo osting works by sequentially applying a classication algorithm to reweighted versions of the training data and then taking a weighted ma jorityvote of the sequence of classiers thus pro duced For many classication algorithms this simple strat egy results in dramatic improvements in p erformance Weshow that this seemingly mysterious phenomenon can b e understo o d in terms of well known statistical principles namely additive mo deling and maximum likeliho o d For the twoclass problem b o osting can b e viewed as an approximation to additive mo deling on the logistic scale using maximum Bernoulli likeliho o d as a criterion Wedevelop more direct approximations and show that they exhibit nearly iden tical results to b o osting Direct multiclass generalizations based on multinomial likeliho o d are derived that exhibit p erformance comparable to other recently prop osed multiclass generaliza tions of b o osting in most situations and far sup erior in some We suggest a minor mo dication to b o osting that can reduce computation often by factors of to Finallywe apply these insights to pro duce an alternativeformulation of b o osting decision trees This approach based on b estrst truncated tree induction often leads to b etter p erformance and can provide in terpretable descriptions of the aggregate decision rule It is also much faster computationally making it more suitable to large scale data mining applications Intro duction The starting point for this pap er is an interesting pro cedure called b o osting whichis a way of combining the p erformance of manyweak classiers to pro duce a p owerful committee Bo osting was prop osed in the Computational Learning Theory literature Schapire Freund Freund Schapire and has since received much attention While b o osting has evolved somewhat over the years we describ e the most commonly used version of the AdaBoost pro cedure Freund Schapire b whichwe call Discrete AdaBoost Here is a concise description of AdaBo ost in the twoclass classication setting We have train ing data x y x y with x a vector valued feature and y or We dene N N i i Department of Statistics Sequoia Hall Stanford University California fjhfhastietibsgstatstanfordedu y Stanford Linear Accelerator Center Stanford CA z Division of Biostatistics Department of Health Research and Policy Stanford University Stanford CA Essentially the same as AdaBo ostM for binary data Freund Schapire b P M c f x where each f x is a classier pro ducing values and c are constants F x m m m m the corresp onding prediction is signF x The AdaBo ost pro cedure trains the classiers f xon m weighted versions of the training sample giving higher weight to cases that are currently misclassi ed This is done for a sequence of weighted samples and then the nal classier is dened to b e a linear combination of the classiers from each stage A detailed description of Discrete AdaBo ost is given in the b oxed display titled Algorithm Discrete AdaBo ostFreund Schapire b Start with weights w N i N i Rep eat for m M a Fit the classier f x f g using weights w on the training data m i b Compute err E c log err err m w m m m y f x m P c Set w w expc i N and renormalize so that w i i m i y f x i m i i P M Output the classier sign c f x m m m Algorithm E represents expectation over the training data with weights w w w w and w n is the indicator of the set S At each iteration AdaBoost increases the weights of the observations S misclassiedbyf x by a factor that depends on the weightedtraining error m Much has b een written ab out the success of AdaBo ost in pro ducing accurate classiers Many authors have explored the use of a treebased classier for f x and have demonstrated that it m consistently pro duces signicantly lower error rates than a single decision tree In fact Breiman NIPS workshop called AdaBo ost with trees the b est otheshelf classier in the world see also Breiman b Interestingly in many examples the test error seems to consistently decrease and then level o as more classiers are added rather than ultimately increase For some reason it seems that AdaBo ost is resistanttoovertting Figure shows the p erformance of Discrete AdaBo ost on a synthetic classication task using TM an adaptation of CART Breiman Friedman Olshen Stone as the base classier This adaptation grows xedsize trees in a b estrst manner see Section page Included in the gure is the bagged tree Breiman which averages trees grown on b o otstrap resampled versions of the training data Bagging is purely a variancereduction technique and since trees tend to have high variance bagging often pro duces go o d results Early versions of AdaBo ost used a resampling scheme to implement step of Algorithm by weighted sampling from the training data This suggested a connection with bagging and that a ma jor comp onent of the success of b o osting has to do with variance reduction However b o osting p erforms comparably well when a weighted treegrowing algorithm is used in step rather than weighted resampling where each training observation is assigned its weight w This removes the randomization comp o i nent essential in bagging stumps are used for the weak learners Stumps are singlesplit trees with only two terminal no des These typically have lowvariance but high bias Bagging p erforms very p o orly with stumps Fig topright panel These observations suggest that b o osting is capable of b oth bias and variance reduction and thus diers fundamentally from bagging Figure Test error for Bagging Discrete AdaBoost and Real AdaBoost on a simulated twoclass nested spheres problem seeSection on page Therearetraining data points in dimensions and the Bayes error rate is zero All trees aregrown bestrst without pruning The leftmost iteration corresponds to a single tree The base classier in Discrete AdaBo ost pro duces a classication rule f x X f g m where X is the domain of the predictive features x Freund Schapire b Breiman aand Schapire Singer have suggested various mo dications to improve the b o osting algorithms A generalization of Discrete AdaBo ost app eared in Freund Schapire b and was develop ed further in Schapire Singer that uses realvalued condencerated predictions rather than the f g of Discrete AdaBo ost The weak learner for this generalized b o osting pro duces a mapping f x X R the sign of f x gives the classication and jf xj a measure of m m m the condence in the prediction This real valued contribution is combined with the previous contributions with a multiplier c as b efore and a slightly dierent recip e for c is provided m m We present a generalized version of AdaBo ost which we call Real AdaBoost in Algorithm in which the weak learner returns a class probability estimate p xP y jx The m w contribution to the nal classier is half the logittransform of this probability estimate One form of Schapire and Singers generalized AdaBo ost coincides with Real AdaBo ost in the sp ecial case where the weak learner is a decision tree Real AdaBo ost tends to p erform the b est in our simulated examples in Fig esp ecially with stumps although we see with no de trees Discrete AdaBo ost overtakes Real AdaBo ost after iterations Real AdaBo ost Start with weights w N i N i Rep eat for m M a Fit the classier to obtain a class probability estimate p x P y jx m w using weights w on the training data i p x m log b Set f x R m p x m P w c Set w w expy f x i N and renormalize so that i i i i m i i P M f x Output the classier sign m m Algorithm The Real AdaBoost algorithm uses class probability estimates p x to construct realvalued m contributions f x m In this pap er we analyze the AdaBo ost pro cedures from a statistical perspective The main P result of our pap er rederives AdaBo ost as a metho d for tting an additivemodel f x in a m m forward stagewise manner This simple fact largely explains why it tends to outp erform a single base learner By tting an additive mo del of dierent and p otentially simple functions it expands the class of functions that can b e approximated Given this fact Discrete and Real AdaBo ost app ear unnecessarily complicated Amuch simpler P way totanadditivemodelwould b e to minimize squarederror loss E y f x in a forward m stagewise manner Atthemth stage wexf x f x and minimize squared error to obtain m P m f xjx This is just tting of residuals and is commonly used in linear f x E y j m regression and additive mo deling Hastie Tibshirani However squared error loss is not a go o d choice for classication see Figure and hence tting of residuals do esnt work very well in that case We show that AdaBo ost ts an additive mo del using a b etter loss function for classication Sp ecically we show that AdaBo ost ts an additive logistic regression mo del using a criterion similar to but not the same as the binomial loglikeliho o d If p x are the class probabilities an additive logistic regression approximates m P log p x p x by an additive function f x We then go on to derive a new b o osting m m m m pro cedure LogitBo ost that directly optimizes the binomial loglikeliho o d The original b o osting techniques Schapire Freund provably improved or b o osted the p erformance