Ensemble Methods - Boosting
Total Page:16
File Type:pdf, Size:1020Kb
Ensemble Methods - Boosting Bin Li IIT Lecture Series 1 / 48 Boosting a weak learner 1 1 I Weak learner L produces an h with error rate β = 2 < 2 , with Pr 1 δ for any . L has access to continuous− stream of training≥ data− and a classD oracle. 1. L learns h1 on first N training points. 2. L randomly filters the next batch of training points, extracting N=2 points correctly classified by h1, N=2 incorrectly classified, and produces h2. 3. L builds a third training set of N points for which h1 and h2 disagree, and produces h3. 4. L ouptuts h = Majority Vote(h1; h2; h3) I Theorem (Schapire, 1990): \The Strength of Weak Learnability" 2 3 errorD(h) 3β 2β < β ≤ − 2 / 48 Plot of error rate 3β2 2β3 − 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 3 / 48 What is AdaBoost? Figure from HTF 2009. ) x ( m G m α =1 M m ) ) ) ) x ( x x x ( ( ( 3 2 M 1 G G G G )=sign x Final Classifier ( Boosting G Training SampleTraining Sample Training Sample Weighted SampleWeighted Sample Weighted Sample Weighted SampleWeighted Sample Weighted SampleWeighted Sample Weighted Sample Weighted Sample January 2003 Trevor Hastie, Stanford University 11 4 / 48 AdaBoost (Freund & Schapire, 1996) 1. Initialize the observation weigths: wi = 1=N, i = 1; 2;:::; N. 2. For m = 1 to M repeat steps (a)-(d): (a) Fit a classifier Gm(x) to the training data using weigths wi (b) Compute N i=1 wi I (yi = Gm(xi )) errm = N 6 P i=1 wi (c) Compute αm = log((1 errm)=errPm). (d) Update weights for i =− 1;:::; N: wi wi exp [αm I (yi = Gm(xi ))] − · · 6 and renormalize wi to sum to 1. M 3. Output G(x) = sign m=1 αmGm(x) hP i 5 / 48 Example (2) ExampleAn example taken from of(R. AdaBoost Schapire) 2 2 1 =0.err3,α1 1==0i.42wi I,( wrongG1(xi ) pattern:= yi ) = 0d:3i =0.166, correct: di =0.07 1−err1 6 computationα1 = log see next slide= 0:847 P err1 wi for misclassified points: 0:1 exp(0:847) = 0:233. × 0:233 After normalizing, wi for wroing points is 0:233×3+0:1×7 = 0:1667. 0:1 For correct points, wi is 0:233×3+0:1×7 = 0:0714 6 / 48 Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 10 An example of AdaBoostExample (cont.) (3) Example taken from (R. Schapire) 2 =0.21,α2 =0.65 errcomputation2 = i w seei I (G next2(x slidei ) = yi ) = 3 0:07 = 0:21 1−err2 6 × α2 = log 1:3 P err2 ≈ Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 12 7 / 48 An example of AdaBoostExample (cont.) (4) Example taken from (R. Schapire) 2 =0.14,α2 =0.92 err3 = 0:14 1−err3 α3 = log = 1:84 err3 Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 14 8 / 48 Example (5) AnExample example taken -from final(R. classifierSchapire) Final classifier: 9 / 48 Martin Riedmiller, Albert-Ludwigs-Universit¨at Freiburg, [email protected] Machine Learning 15 A simulation - \nested spheres" example Ten features X1;:::; X10 are independent N(0; 1) r.v. The deterministic target Y is defined by 1 if 10 X 2 > χ2 (0:5) = 9:34 Y = j=1 j 10 1 otherwise − P I 2000 training observations (about 1000 cases in each class), and 10,000 test observations. I Bayes error is 0% (noise-free). I A stump is a two-node tree, after a single split. I We use stump as the \weak learner" in the AdaBoost algorithm. 10 / 48 January 2003 Trevor Hastie, Stanford University 17 Boosting Stumps NS example (figure from HTF 2009) Single Stump 400 Node Tree Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 Boosting Iterations A stump is a two-node tree, after a single split. 11 / 48 Boosting stumps works remarkably well on the nested-spheres problem. January 2003 Trevor Hastie, Stanford University 18 Boosting & Training Error Nested spheres in R10 — Bayes error is 0%. Training and testing errorsStumps(figure from HTF 2009) Train and Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 600 Number of Terms Boosting drives the training error to zero. Further Boosting drives the training error to zero. Further iterations iterations continue to improve test error in many continue to improve test error. examples. 12 / 48 January 2003 Trevor Hastie, Stanford University 19 Boosting Noisy Problems Boosting forNested noisy Gaussians problems in R10(figure— Bayes from HTF) error is 25%. Nested spheres example withStumps added noise - Bayes error is 25% Bayes Error Train and Test Error 0.0 0.1 0.2 0.3 0.4 0.5 0 100 200 300 400 500 600 Number of Terms 13 / 48 Here the test error does increase, but quite slowly. Weak classifiers in AdaBoost In the illustrative example, after fitting the first classifier the normalized weights wi for wrong and correct points are 0:1667 and 0:0714. Then what is the sum of weights for wrong and correct points separately? 14 / 48 Weak classifiers in AdaBoost (cont.) Sum of weights for wrongly classified points: wi exp(αm) i:yiX6=G(xi ) 1 errm = wi − errm i:yiX6=G(xi ) i wi I (yi = Gm(xi )) = wi i wi I (yi = Gm(xi )) i:yiX6=G(xi ) P 6 = wi I (yi = GPm(xi )) = wi i X i:yiX=G(xi ) Error rate is1 =2 on the new weighting scheme. The next weak classifier is \independent" to the previous one. 15 / 48 Boosting as an additive model M I Boosting builds an additive model: f (x) = m=1 βmb(x; γm). For example, in boosting trees b(x; γm) is a tree, and γm P parametrizes the splits. I We use additive models in statistics all the time! I Multiple linear regression: f (x) = j βj xj I Generalized additive model: f (x) = j fj (xj ) P M I Basis expansions in spline model: f (Px) = m=1 θmhm(x) I Traditionally, the parameters fm, θm are fitPjointly (i.e. least squares, maximum likelihood). I Friedman et al. (2000) have shown that AdaBoost is equivalent to stagewise additive logistic regression using the exponential loss criterion L(y; f (x)) = exp( yf (x)). − I yf (x) is called \margin"(> 0 classified correctly, < 0 misclassified). 16 / 48 Stagewise additive modeling Based on Friedman, Hastie and Tibshirani (2000)'s result, the AdaBoost is equivalent to the following modeling strategy. Forward Stagewise Additive Modeling. 1. Initialize f0(x) = 0. 2. For m = 1 to M: (a) Compute N (βm; γm) = arg min L (yi ; fm 1(xi ) + βb(xi ; γ)) : β,γ − i=1 X (b) Set fm(x) = fm 1(x) + βmb(x; γm) − Compare with stepwise approach, stagewise additive modeling slows the process down, and tends to overfit less quickly. 17 / 48 10.5 Why Exponential Loss? 345 Exponential loss vs. misclassification loss (figure from HTF, 2009) Training Error Exponential Loss Misclassification Rate 0.0 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 Boosting Iterations FIGURE 10.3. Simulated data, boosting with stumps: misclassificationP error 18 / 48 N rate on the training set, and average exponential loss: (1/N ) exp( yif(xi)). i=1 − After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease. 10.5 Why Exponential Loss? The AdaBoost.M1 algorithm was originally motivated from a very differ- ent perspective than presented in the previous section. Its equivalence to forward stagewise additive modeling based on exponential loss was only discovered five years after its inception. By studying the properties of the exponential loss criterion, one can gain insight into the procedure and dis- cover ways it might be improved. The principal attraction of exponential loss in the context of additive modeling is computational; it leads to the simple modular reweighting Ad- aBoost algorithm. However, it is of interest to inquire about its statistical properties. What does it estimate and how well is it being estimated? The first question is answered by seeking its population minimizer. It is easy to show (Friedman et al., 2000) that Yf(x) 1 Pr(Y =1x) f ∗(x)=argminEY x(e− )= log | , (10.16) f(x) | 2 Pr(Y = 1 x) − | Why exponential loss? I The AdaBoost algorithm was originally motivated from a very different perspective. I What is the statistical property for exponential loss? yf I What does it estimate? The population minimizer of e− is: 1 Pr(Y = 1 x) Yf (x) log j = arg min EY x e− 2 Pr(Y = 1 x) f (x) j − j I AdaBoost estimates one-half the log-odds of P(Y = 1 x) and uses its sign as the the classification rule. j I Binomial negative log-likelihood or deviance has the same population minimizer. l(Y ; f (x)) = log 1 + e−2Yf (x) − −Yf I Note e is not a proper log-likelihood. 19 / 48 10.6 Loss Functions and Robustness 347 Loss functions for classification (figure from HTF, 2009) Misclassification Exponential Binomial Deviance Squared Error Support Vector Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 −2 −1 0 1 2 y f · FIGURE 10.4.