Obtaining Calibrated Probabilities from Boosting∗

Alexandru Niculescu-Mizil Rich Caruana Department of Computer Science Department of Computer Science Cornell University, Ithaca, NY 14853 Cornell University, Ithaca, NY 14853 [email protected] [email protected]

Abstract bration. This shifting is also consistent with Breiman’s inter- pretation of boosting as an equalizer (see Breiman’s discus- Boosted decision trees typically yield good accuracy, preci- sion in (Friedman, Hastie, & Tibshirani 2000)). In the next sion, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boost- section we demonstrate this probability shifting on real data. ing yields poor squared error and cross-entropy. We empir- To correct for boosting’s poor , we experiment ically demonstrate why AdaBoost predicts distorted proba- with boosting with log-loss, and with three methods for cali- bilities and examine three calibration methods for correcting brating the predictions made by boosted models to convert this distortion: Platt Scaling, , and Lo- them to well-calibrated posterior probabilities. The three gistic Correction. We also experiment with boosting using post-training calibration methods are: log-loss instead of the usual exponential loss. Experiments Logistic Correction: a method based on Friedman et al.’s show that Logistic Correction and boosting with log-loss work analysis of boosting as an additive model well when boosting weak models such as decision stumps, but Platt Scaling: the method used by Platt to transform SVM yield poor performance when boosting more complex mod- [−∞ +∞] els such as full decision trees. Platt Scaling and Isotonic Re- outputs from , to posterior probabilities (1999) gression, however, significantly improve the probabilities pre- Isotonic Regression: the method used by Zadrozny and dicted by both boosted stumps and boosted trees. After cal- Elkan to calibrate predictions from boosted naive Bayes, ibration, boosted full decision trees predict better probabili- SVM, and decision tree models (2002; 2001) ties than other learning methods such as SVMs, neural nets, Logistic Correction and Platt Scaling convert predictions bagged decision trees, and KNNs, even after these methods to probabilities by transforming them with a sigmoid. With are calibrated. Logistic Correction, the sigmoid parameters are derived from Friedman et al.’s analysis. With Platt Scaling, the param- Introduction eters are fitted to the data using gradient descent. Isotonic Regression is a general-purpose non-parametric calibration In a recent evaluation of learning algorithms (Caruana & method that assumes probabilities are a monotonic transfor- Niculescu-Mizil 2006), boosted decision trees had excel- mation (not just sigmoid) of the predictions. lent performance on metrics such as accuracy, lift, area un- An alternative to training boosted models with AdaBoost der the ROC curve, average precision, and precision/recall and then correcting their outputs via post-training calibration break even point. However, boosted decision trees had poor is to use a variant of boosting that directly optimizes cross- squared error and cross-entropy because AdaBoost does not entropy (log-loss). Collins, Schapire and Singer (2002) show produce good probability estimates. that a boosting algorithm that optimizes log-loss can be ob- Friedman, Hastie, and Tibshirani (2000) provide an expla- tained by simple modification to the AdaBoost algorithm. nation for why boosting makes poorly calibrated predictions. Collins et al. briefly evaluate this new algorithm on a syn- They show that boosting can be viewed as an additive logistic thetic data set, but acknowledge that a more thorough evalu- regression model. A consequence of this is that the predic- ation on real data sets is necessary. tions made by boosting are trying to fit a logit of the true Lebanon and Lafferty (2001) show that Logistic Correc- probabilities, as opposed to the true probabilities themselves. tion applied to boosting with exponential loss should behave To get back the probabilities, the logit transformation must similarly to boosting with log-loss, and then demonstrate this be inverted. by examining the performance of boosted stumps on a vari- In their treatment of boosting as a large margin classifier, ety of data sets. Our results confirm their findings for boosted Schapire et al. (1998) observed that in order to obtain large stumps, and show the same effect for boosted trees. margin on cases close to the decision surface, AdaBoost will Our experiments show that boosting full decision trees sacrifice the margin of the easier cases. This results in a shift- usually yields better models than boosting weaker stumps. ing of the predicted values away from 0 and 1, hurting cali- Unfortunately, our results also show that boosting to directly ∗A slightly extended verison of this paper appeared in UAI‘05 optimize log-loss, or applying Logistic Correction to models Copyright c 2007, Association for the Advancement of Artificial boosted with exponential loss, is only effective when boost- Intelligence (www.aaai.org). All rights reserved. ing weak models such as stumps. Neither of these methods is

28 effective when boosting full decision trees. Significantly bet- hi is trained on the weighted train set. The error of hi deter- ter performance is obtained by boosting full decision trees mines the model weight αi and the future weight of each with exponential loss, and then calibrating their predictions training example. There are two equivalent formulations. using either Platt Scaling or Isotonic Regression. Calibration The first formulation, also used by Friedman, Hastie, and with Platt Scaling or Isotonic Regression is so effective that Tibshirani (2000) assumes yi ∈{−1, 1} and hi ∈{−1, 1}. after calibration boosted decision trees predict better prob- The output of the boosted model is: abilities than any other learning method we have compared T them to, including neural nets, bagged trees, random forests, F (x)= αihi(x) (1) and calibrated SVMs. i=1 Friedman et al. show that AdaBoost builds an additive Boosting and Calibration model for minimizing E(exp(−yF(x))). They show that E(exp(−yF(x))) is minimized by: In this section we empirically examine the relationship be- 1 P (y =1|x) tween boosting’s predictions and posterior probabilities. One F (x)= log (2) way to visualize this relationship when the true posterior 2 P (y = −1|x) probabilities are not known is through reliability diagrams This suggests applying a logistic correction in order to get (DeGroot & Fienberg 1982). To construct a reliability dia- back the conditional probability: gram, the prediction space is discretized into ten bins. Cases 1 P (y =1|x)= (3) with predicted value between 0 and 0.1 are put in the first 1+exp(−2F (x)) bin, between 0.1 and 0.2 in the second bin, etc. For each bin, the mean predicted value is plotted against the true fraction As we will see in the Empirical Results section, this logis- of positive cases in the bin. If the model is well calibrated tic correction works well when boosting simple base learners the points will fall near the diagonal line. such as decision stumps. However, if the base learners are powerful enough that the training data becomes fully separa- The bottom row of Figure 1 shows reliability plots on a ble, after correction the predictions will become only 0’s and large test set after 1,4,8,32,128, and 1024 stages of boosting 1’s (Rosset, Zhu, & Hastie 2004) and thus have poor calibra- Bayesian smoothed decision trees (Buntine 1992). The top tion. of the figure shows histograms of the predicted values for the same models. The histograms show that as the number of Platt Calibration steps of boosting increases, the predicted values are pushed away from 0 and 1 and tend to collect on either side of the de- An equivalent formulation of AdaBoost assumes yi ∈{0, 1} cision surface. This shift away from 0 and 1 hurts calibration and hi ∈{0, 1}. The output of the boosted model is T and yields sigmoid-shaped reliability plots. αihi(x) ( )= i=1 Figure 2 shows histograms and reliability diagrams for f x T (4) αi boosted decision trees after 1024 steps of boosting on eight i=1 test problems. (See the Empirical Results section for more We use this model for Platt Scaling and Isotonic Regres- detail about these problems.) The figures present results mea- sion, and treat f(x) as the raw (uncalibrated) prediction. sured on large independent test sets not used for training. For Platt (1999) uses a sigmoid to map SVM outputs on seven of the eight data sets the predicted values after boosting [−∞, +∞] to posterior probabilities on [0, 1]. The sigmoidal do not approach 0 or 1. The one exception is LETTER.P1, shape of the reliability diagrams in Figure 2 suggest that the a highly skewed data set that has only 3% positive class. same calibration method should be effective at calibrating On this problem some of the predicted values do approach boosting predictions. In this section we closely follow the 0, though careful examination of the histogram shows that description of Platt’s method in (Platt 1999). there is a sharp drop in the number of cases predicted to have Let the output of boosting be f(x) given by equation 4. probability near 0. To get calibrated probabilities, pass the output of boosting All the reliability plots in Figure 2 display sigmoid-shaped through a sigmoid: 1 reliability diagrams, motivating the use of a sigmoid to map P (y =1|f)= (5) the predictions to calibrated probabilities. The functions fit- 1+exp(Af + B) ted with Platt’s method and Isotonic Regression are shown in where the A and B are fitted using maximum likelihood es- the middle and bottom rows of the figure. timation from a calibration set (fi,yi). Gradient descent is used to find A and B that are the solution to: Calibration  argmin{− yilog(pi)+(1− yi)log(1 − pi)}, (6) In this section we describe three methods for calibrating pre- A,B i dictions from AdaBoost: Logistic Correction, Platt Scaling, and Isotonic Regression. where 1 pi = (7) 1+exp(Afi + B) Logistic Correction Two questions arise: where does the sigmoid training set Before describing Logistic Correction, it is useful to briefly (fi,yi) come from and how to avoid overfitting to it. review AdaBoost. Start with each example in the train set One answer to the first question is to use the same train set (xi,yi) having equal weight. At each step i a weak learner as boosting: for each example (xi,yi) in the boosting train

29 1 STEP 4 STEPS 8 STEPS 32 STEPS 128 STEPS 1024 STEPS 0.16 0.16 0.16 0.16 0.16 0.16 0.14 0.14 0.14 0.14 0.14 0.14 0.12 0.12 0.12 0.12 0.12 0.12 0.1 0.1 0.1 0.1 0.1 0.1 0.08 0.08 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0.02 0.02 0 0 0 0 0 0 1 1 1 1 1 1

0.8 0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2

0 0 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Figure 1: Effect of boosting on the predicted values. Histograms of the predicted values (top) and reliability diagrams (bottom) on the test set for boosted trees at different steps of boosting on the COV TYPE problem.

COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Figure 2: Histograms of predicted values and reliability diagrams for boosted decision trees. set, use (f(xi),yi) as a training example for the sigmoid. Un- target values see (Platt 1999). The middle row of Figure 2 fortunately, this can introduce unwanted bias in the sigmoid shows the sigmoids fitted with Platt Scaling. training set and can lead to poor results (Platt 1999). An alternate solution is to split the data into a training and Isotonic Regression a validation set. After boosting is trained on the training set, An alternative to Platt Scaling is to use Isotonic Regression the predictions on the validation set are used to fit the sig- (Robertson, Wright, & Dykstra 1988). Zadrozny and Elkan moid. Cross validation can be used to allow both boosting (2002; 2001) successfully used Isotonic Regression to cali- and the sigmoid to be trained on the full data set. The data brate predictions from SVMs, Naive Bayes, boosted Naive is split into C parts. For each fold one part is held aside for Bayes, and decision trees. Isotonic Regression assumes only use as an independent calibration validation set while boost- that: ing is performed using the other C-1 parts. The union of the yi = m(fi)+i (9) C validation sets is then used to fit the sigmoid parameters. Following Platt, experiments in this paper use 3-fold cross- where m is an isotonic (monotonically increasing) function. validation to estimate the sigmoid parameters. Given a training set (fi,yi), the Isotonic Regression problem As for the second question, an out-of-sample model is used ˆ is finding the isotonic functionm such that: to avoid overfitting to the calibration set. If there are N+ pos- 2 mˆ = argmin (yi − z(fi)) (10) itive examples and N− negative examples in the calibration z set, for each example Platt Calibration uses target values y+ A piecewise constant solution mˆ can be found in linear and y− (instead of 1 and 0, respectively), where time by using the pair-adjacent violators (PAV) algorithm N+ +1 1 (Ayer et al. 1955) presented in Table 1. y+ = ; y− = (8) N+ +2 N− +2 As in the case of Platt calibration, if we use the boosting For a more detailed treatment, and a justification of these train set (xi,yi) to get the train set (f(xi),yi) for Isotonic

30 Table 2: Cross-entropy performance of boosted decision stumps and boosted decision trees before and after calibration. Quali- tatively similar results are obtained for squared error.

BOOSTED STUMPS BOOSTED TREES PROBLEM RAW PLATT ISO LOGIST LOGLOSS RAW PLATT ISO LOGIST LOGLOSS COV TYPE 0.7571 0.6705 0.6714 0.6785 0.6932 0.6241 0.5113 0.5179 0.7787 0.7062 ADULT 0.5891 0.4570 0.4632 0.4585 0.4905 0.5439 0.4957 0.5072 0.5330 0.5094 LETTER.P1 0.2825 0.0797 0.0837 0.0820 0.0869 0.0975 0.0375 0.0378 0.1172 0.0799 LETTER.P2 0.8018 0.5437 0.5485 0.5452 0.5789 0.3901 0.1451 0.1412 0.5255 0.3461 MEDIS 0.4573 0.3829 0.3754 0.3836 0.4411 0.4757 0.3877 0.3885 0.5596 0.5462 SLAC 0.8698 0.8304 0.8351 0.8114 0.8435 0.8270 0.7888 0.7786 0.8626 0.8450 HS 0.6110 0.3351 0.3365 0.3824 0.3407 0.4084 0.2429 0.2501 0.5988 0.4780 MG 0.4868 0.4107 0.4125 0.4096 0.4568 0.5021 0.4286 0.4297 0.5026 0.4830 MEAN 0.6069 0.4637 0.4658 0.4689 0.4914 0.4836 0.3797 0.3814 0.5597 0.4992

Friedman, Hastie, & Tibshirani 2000), and because many Table 1: PAV Algorithm iterations of boosting can make calibration worse (see Algorithm 1. PAV algorithm for estimating posterior Figure 1), we consider boosted stumps after 2,4,8,16,32, probabilities from uncalibrated model predictions. 64,128,256,512,1024,2048,4096 and 8192 steps of boosting. 1 Input: training set (fi,yi) sorted according to fi The left side of Table 2 shows the cross-entropy on large fi- 2 Initialize mˆ i,i = yi, wi,i =1 nal test sets for the best boosted stump models before and 3 While ∃ is.t.mˆ k,i−1 ≥ mˆ i,l after calibration1. The results show that the performance Set wk,l = wk,i−1 + wi,l of boosted stumps is dramatically improved by calibration Set mˆ k,l =(wk,i−1mˆ k,i−1 + wi,lmˆ i,l)/wk,l or by optimizing to log-loss. On average calibration re- Replace mˆ k,i−1 and mˆ i,l with mˆ k,l duces cross-entropy by about 23% and squared error by about 4 Output the stepwise constant function: 14%. The three post-training calibration methods (PLATT, mˆ (f)=m ˆi,j,forfi

31 COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Figure 3: Histograms of predicted values and reliability diagrams for boosted trees calibrated with Platt’s method.

COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS MG 0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Figure 4: Histograms of predicted values and reliability diagrams for boosted trees calibrated with Isotonic Regression. functiononanindependent test set. problems, and have the nice property that the theoretically The right side of Table 2 presents the performance of the suggested Logistic Correction works well.2 best boosted trees before and after calibration. To prevent To illustrate how calibration transforms the predictions, we the results from depending on one specific tree style, we show histograms and reliability diagrams for the eight prob- boost ten different styles of trees. We use all the tree types lems for boosted trees after 1024 steps of boosting, after Platt in Buntine’s IND package (1991) (ID3, CART, CART0, Scaling (Figure 3) and after Isotonic Regression (Figure 4). C4.5, MML, SMML, BAYES) as well as three new tree The figures show that calibration undoes the shift in proba- types that should predict better probabilities: unpruned C4.5 bility mass caused by boosting: after calibration many more with Laplacian smoothing (Provost & Domingos 2003); un- cases have predicted probabilities near 0 and 1. The relia- pruned C4.5 with Bayesian smoothing; and MML trees with bility plots are closer to the diagonal, and the S shape char- Laplacian smoothing. We consider the boosted models after acteristic of boosting’s predictions is gone. On each prob- 2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting. lem, transforming the predictions using either Platt Scaling As expected, Logistic Correction is not competitive when or Isotonic Regression yields a significant improvement in boosting full decision trees. The other two calibration meth- the quality of the predicted probabilities, leading to much ods (Platt Scaling and Isotonic Regression) provide a sig- lower squared error and cross-entropy. The main difference nificant improvement in the quality of the predicted proba- between Isotonic Regression and Platt Scaling for boosting bilities. Both methods reduce cross-entropy about 21% and can be seen when comparing the histograms in the two fig- squared error about 13%. ures. Because Isotonic Regression generates a piecewise Surprisingly, when boosting directly optimizes log-loss, constant function, the histograms are quite coarse, while the boosted trees have poor performance. Because the base-level histograms generated by Platt Scaling are smooth and easier models are so expressive, boosting with log-loss quickly sep- to interpret. arates the two classes on the train set and pushes predictions To determine how the performance of Platt Scaling and toward 0 and 1, hurting calibration. This happens despite the Isotonic Regression changes with the amount of data avail- fact that we regularize boosting by varying the number of able for calibration, we vary the size of the calibration set iterations of boosting (Rosset, Zhu, & Hastie 2004). from 32 to 8192 by factors of two. Figure 5 shows the av- Comparing the results in the left and right side of Tables 2 we see that boosted trees significantly outperform boosted 2We also tried boosting 2-level decision trees. Boosted 2-level stumps on five of the eight problems. On average over the trees outperformed boosted 1-level stumps, but did not perform as eight problems, boosted trees yield about 13% lower squared well as boosting full trees. Moreover, 2-level trees are complex error and 18% lower cross-entropy than boosted stumps. enough base-level models that Logistic Correction is no longer as Boosted stumps, however, do win by a small margin on three effective as Platt Scaling or Isotonic Regression.

32 Unscaled Platt Isotonic Regression yield the best average squared error and 0.32 Isotonic cross-entropy performance (results are not included due to lack of space). This, combined with their excellent per- 0.31 formance on other measures such as accuracy, precision, and area under the ROC curve (Caruana & Niculescu-Mizil 0.3 2006), may make calibrated boosted decision trees the model

Squared Error of choice for many applications. 0.29 Acknowledgments 0.28 Thanks to Bianca Zadrozny and Charles Elkan for use of their Isotonic Regression code. Thanks to Charles Young et 0.27 10 100 1000 10000 Calibration Set Size al. at SLAC (Stanford Linear Accelerator) for the SLAC data set. Thanks to Tony Gualtieri at Goddard Space Center for Figure 5: Learning curves for Platt Scaling and Isotonic Re- help with the Indian Pines Data set. This work was supported gression for boosted trees (average across 8 problems) by NSF Award 0412930. References erage squared error over the eight test problems for the best Ayer, M.; Brunk, H.; Ewing, G.; Reid, W.; and Silverman, E. 1955. uncalibrated and the best calibrated boosted trees. We per- An empirical distribution function for sampling with incomplete form ten trials for each problem. information. Annals of Mathematical Statistics 5(26):641–647. The nearly horizontal line in the graph show the squared Blake, C., and Merz, C. 1998. UCI repository of error prior to calibration. This line is not perfectly horizon- databases. tal only because the test sets change as more data is moved Buntine, W., and Caruana, R. 1991. Introduction to ind and re- into the calibration sets. The plot shows the squared error af- cursive partitioning. Technical Report FIA-91-28, NASA Ames ter calibration with Platt’s method or Isotonic Regression as Research Center. the size of the calibration set varies from small to large. The Buntine, W. 1992. Learning classification trees. Statistics and learning curves show that the performance of boosted trees Computing 2:63–73. is improved by calibration even for very small calibration set Caruana, R., and Niculescu-Mizil, A. 2006. An empirical compar- sizes. When the calibration set is small (less than about 2000 ison of algorithms. In ICML ’06: Proceedings cases), Platt Scaling outperforms Isotonic Regression. This of the 23rd international conference on Machine learning. happens because Isotonic Regression is less constrained than Collins, M.; Schapire, R. E.; and Singer, Y. 2002. Logistic regres- Platt Scaling, so it is easier for it to overfit when the cali- sion, adaboost and bregman distances. Machine Learning 48(1-3). bration set is small. Platt’s method also has some overfitting DeGroot, M., and Fienberg, S. 1982. The comparison and evalua- control built in. tion of forecasters. Statistician 32(1):12–22. Friedman, J.; Hastie, T.; and Tibshirani, R. 2000. Additive logistic Conclusions regression: a statistical view of boosting. The Annals of Statistics In this paper we empirically demonstrated why AdaBoost 38(2). yields poorly calibrated predictions. To correct this problem, Gualtieri, A.; Chettri, S. R.; Cromp, R.; and Johnson, L. 1999. we experimented with using a variant of boosting that di- Support vector machine classifiers as applied to aviris data. In rectly optimizes log-loss, as well as with three post-training Proc. Eighth JPL Airborne Geoscience Workshop. calibration methods: Logistic Correction justified by Fried- Lebanon, G., and Lafferty, J. 2001. Boosting and maximum like- man et al.’s analysis, Platt Scaling justified by the sigmoidal lihood for exponential models. In Advances in Neural Information shape of the reliability plots, and Isotonic Regression, a gen- Processing Systems. eral non-parametric calibration method. Platt, J. 1999. Probabilistic outputs for support vector machines One surprising result is that boosting with log-loss instead and comparison to regularized likelihood methods. In Adv. in of the usual exponential loss works well when the base level Large Margin Classifiers. models are weak models such as 1-level stumps, but is not Provost, F., and Domingos, P. 2003. Tree induction for probability- competitive when boosting more powerful models such as based rankings. Machine Learning. full decision trees. Similarly Logistic Correction works well Robertson, T.; Wright, F.; and Dykstra, R. 1988. Order Restricted for boosted stumps, but gives poor results when boosting full Statistical Inference. New York: John Wiley and Sons. decision trees (or even 2-level trees). Rosset, S.; Zhu, J.; and Hastie, T. 2004. Boosting as a regularized The other two calibration methods, Platt Scaling and Iso- path to a maximum margin classifier. J. Mach. Learn. Res. 5. tonic Regression, are very effective at mapping boosting’s Schapire, R.; Freund, Y.; Bartlett, P.; and Lee, W. 1998. Boost- predictions to calibrated posterior probabilities, regardless ing the margin: A new explanation for the effectiveness of voting of how complex the base level models are. After calibra- methods. Annals of Statistics 26(5):1651–1686. tion with either of these two methods, boosted decision trees Zadrozny, B., and Elkan, C. 2001. Obtaining calibrated probability have better reliability diagrams, and significantly improved estimates from decision trees and naive bayesian classifiers. In squared-error and cross-entropy. ICML. When compared to other learning algorithms, boosted de- Zadrozny, B., and Elkan, C. 2002. Transforming classifier scores cision trees that have been calibrated with Platt Scaling or into accurate multiclass probability estimates. In KDD.

33