Additive Logistic Regression: a Statistical View of Boosting

Additive Logistic Regression: a Statistical View of Bo osting Jerome Friedman Trevor Hastie y Robert Tibshirani July 23, 1998 Abstract Bo osting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most imp ortant recent developments in classi cation metho d- ology. The p erformance of many classi cation algorithms often can b e dramatically improved by sequentially applying them to reweighted versions of the input data, and taking a weighted ma jority vote of the sequence of classi ers thereby pro duced. We show that this seemingly mysterious phenomenon can b e understo o d in terms of well known statistical principles, namely additive mo deling and maximum likeliho o d. For the two-class problem, b o osting can b e viewed as an ap- proximation to additive mo deling on the logistic scale using maximum Bernoulli likeliho o d as a criterion. We develop more direct approx- imations and show that they exhibit nearly identical results to that of b o osting. Direct multi-class generalizations based on multinomial likeliho o d are derived that exhibit p erformance comparable to other recently prop osed multi-class generalizations of b o osting in most sit- uations, and far sup erior in some. We suggest a minor mo di cation to b o osting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to pro duce an alternative formu- lation of b o osting decision trees. This approach, based on b est- rst truncated tree induction, often leads to b etter p erformance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally making it more suitable to large scale data mining applications. Department of Statistics, Sequoia Hall, Stanford University, Stanford California 94305; fjhf,[email protected] y Department of Public Health Sciences, and Department of Statistics, University of Toronto; [email protected] 1 1 Intro duction The starting p oint for this pap er is an interesting pro cedure called \b o osting", which is a way of combining or b o osting the p erformance of many \weak" classi ers to pro duce a p owerful \committee". Bo osting was pro- p osed in the machine learning literature (Freund & Schapire 1996) and has since received much attention. While b o osting has evolved somewhat over the years, we rst describ e the most commonly used version of theAdaBoost pro cedure (Freund & Schapire 1996), which we call \Discrete" AdaBo ost. Here is a concise description of AdaBo ost in the two-class classi cation setting. We have training data (x ; y );::: (x ; y ) with x a vector valued feature and y = 1 or 1. We 1 1 N N i i P M c f (x) where each f (x) is a classi er pro ducing val- de ne F (x) = m m m 1 ues 1 and c are constants; the corresp onding prediction is sign(F (x)). m The AdaBo ost pro cedure trains the classi ers f (x) on weighted versions m of the training sample, giving higher weight to cases that are currently misclassi ed. This is done for a sequence of weighted samples, and then the nal classi er is de ned to b e a linear combination of the classi ers from each stage. We describ e the pro cedure in more detail in Algorithm 1 Discrete AdaBo ost(Freund & Schapire 1996) 1. Start with weights w = 1=N , i = 1;::: ;N . i 2. Rep eat for m = 1; 2;::: ;M : (a) Estimate the classi er f (x) from the training data with weights m w . i (b) Compute e = E [1 ], c = log ((1 e )=e ). m w m m m (y 6=f (x)) m (c) Set w w exp[c 1 ]; i = 1; 2;::: N , and renormalize i i m (y 6=f (x )) m i i P w = 1. so that i i P M c f (x)] 3. Output the classi er sign[ m m m=1 Algorithm 1: E is the expectation with respect to the weights w = w (w ; w ; : : : w ). At each iteration AdaBoost increases the weights of the obser- 1 2 n vations misclassi ed by f (x) by a factor that depends on the weighted training m error. Much has b een written ab out the success of AdaBo ost in pro ducing accu- rate classi ers. Many authors have explored the use of a tree-based classi er for f (x) and have demonstrated that it consistently pro duces signi cantly m 2 lower error rates than a single decision tree. In fact, Breiman (NIPS work- shop, 1996) called AdaBo ost with trees the \b est o the shelf classi er in the world". Interestingly, the test error seems to consistently decrease and then level o as more classi ers are added, rather than ultimately increase. For some reason, it seems that AdaBo ost is immune to over tting. Figure 1 shows the p erformance of Discrete AdaBo ost on a synthetic TM classi cation task, using a adaptation of CART (Breiman, Friedman, Ol- shen & Stone 1984) as the base classi er. This adaptation grows xed-size trees in a \b est- rst" manner (see Section 7). Included in the gure is the bagged tree (Breiman 1996) which averages trees grown on b o otstrap resam- pled versions of the training data. Bagging is purely a variance-reduction technique, and since trees tend to have high variance, bagging often pro duces go o d results. Early versions of AdaBo ost used a resampling scheme to implement step 2 of Algorithm 1, by weighted imp ortance sampling from the training data. This suggested a connection with bagging, and that a ma jor comp onent of the success of b o osting has to do with variance reduction. However, b o osting p erforms comparably well when: a weighted tree-growing algorithm is used in step 2 rather than weighted resampling, where each training observation is assigned its weight w . i This removes the randomization comp onent essential in bagging. \stumps" are used for the weak learners. Stumps are single-split trees with only two terminal no des. These typically have low variance but high bias. Bagging p erforms very p o orly with stumps (Fig. 1[top-right panel].) These observations suggest that b o osting is capable of b oth bias and variance reduction, and thus di ers fundamentally from bagging. The base classi er in Discrete AdaBo ost pro duces a classi cation rule f (x) : X 7! f1; 1g, where X is the domain of the predictive features m x. If the implementation of the base classi er cannot deal with observation weights, weighted resampling is used instead. Freund & Schapire (1996) and Schapire & Singer (1998) have suggested various mo di cations to improve the b o osting algorithms; here we fo cus on a version due to Schapire & Singer (1998), which we call \Real AdaBo ost", that uses real-valued \con dence- rated" predictions rather than the f1; 1g of Discrete AdaBo ost. The base classi er for this generalized b o osting pro duces a mapping f (x) : X 7! R ; m the sign of f (x) gives the classi cation, and jf (x)j a measure of the \con- m m dence" in the prediction. This real-valued b o osting tends to p erform the 3 10 Node Trees Stumps Bagging Discrete AdaBoost Real AdaBoost Test Error Test Error 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0 100 200 300 400 0 100 200 300 400 Number of Terms Number of Terms 100 Node Trees Test Error 0.0 0.1 0.2 0.3 0.4 0 100 200 300 400 Number of Terms Figure 1: Test error for Bagging (BAG), Discrete AdaBoost (DAB) and Real AdaBoost (RAB) on a simulated two-class nested spheres problem (see Section 5.) There are 2000 training data points in 10 dimensions, and the Bayes error rate is zero. Al l trees are grown \best- rst" without pruning. The left-most iteration corresponds to a single tree. 4 b est in our simulated examples in Fig. 1, esp ecially with stumps, although we see with 100 no de trees Discrete AdaBo ost overtakes Real AdaBo ost after 200 iterations. Real AdaBo ost(Schapire & Singer 1998) 1. Start with weights w = 1=N , i = 1; 2;::: ;N . i 2. Rep eat for m = 1; 2;::: ;M : (a) Estimate the \con dence rated" classi er f (x) : X 7! R and m the constant c from the training data with weights w . m i (b) Set w w exp[c y f (x )]; i = 1; 2;::: N , and renormalize i i m i m i P w = 1. so that i i P M c f (x)] 3. Output the classi er sign[ m m m=1 Algorithm 2: The Real AdaBoost algorithm al lows for the estimator f (x) to m range over R . In the special case that f (x) 2 f1; 1g it reduces to AdaBoost, since m y f (x ) is 1 for a correct and1 for an incorrect classi cation. In the general case i m i the constant c is absorbed into f (x). We describe the Schapire-Singer estimate m m for f (x) in Section 3. m Freund & Schapire (1996) and Schapire & Singer (1998) provide some theory to supp ort their algorithms, in the form of upp er b ounds on gen- eralization error.

Load more