Obtaining Calibrated Probabilities from Boosting∗

Obtaining Calibrated Probabilities from Boosting∗ Alexandru Niculescu-Mizil Rich Caruana Department of Computer Science Department of Computer Science Cornell University, Ithaca, NY 14853 Cornell University, Ithaca, NY 14853 [email protected] [email protected] Abstract bration. This shifting is also consistent with Breiman’s inter- pretation of boosting as an equalizer (see Breiman’s discus- Boosted decision trees typically yield good accuracy, precision in (Friedman, Hastie, & Tibshirani 2000)). In the next sion, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boost- section we demonstrate this probability shifting on real data. ing yields poor squared error and cross-entropy. We empir- To correct for boosting’s poor calibration, we experiment ically demonstrate why AdaBoost predicts distorted proba- with boosting with log-loss, and with three methods for cali- bilities and examine three calibration methods for correcting brating the predictions made by boosted models to convert this distortion: Platt Scaling, Isotonic Regression, and Lo- them to well-calibrated posterior probabilities. The three gistic Correction. We also experiment with boosting using post-training calibration methods are: log-loss instead of the usual exponential loss. Experiments Logistic Correction: a method based on Friedman et al.’s show that Logistic Correction and boosting with log-loss work analysis of boosting as an additive model well when boosting weak models such as decision stumps, but Platt Scaling: the method used by Platt to transform SVM yield poor performance when boosting more complex mod- [−∞ +∞] els such as full decision trees. Platt Scaling and Isotonic Re- outputs from , to posterior probabilities (1999) gression, however, significantly improve the probabilities pre- Isotonic Regression: the method used by Zadrozny and dicted by both boosted stumps and boosted trees. After cal- Elkan to calibrate predictions from boosted naive Bayes, ibration, boosted full decision trees predict better probabili- SVM, and decision tree models (2002; 2001) ties than other learning methods such as SVMs, neural nets, Logistic Correction and Platt Scaling convert predictions bagged decision trees, and KNNs, even after these methods to probabilities by transforming them with a sigmoid. With are calibrated. Logistic Correction, the sigmoid parameters are derived from Friedman et al.’s analysis. With Platt Scaling, the param- Introduction eters are fitted to the data using gradient descent. Isotonic Regression is a general-purpose non-parametric calibration In a recent evaluation of learning algorithms (Caruana & method that assumes probabilities are a monotonic transfor- Niculescu-Mizil 2006), boosted decision trees had excel- mation (not just sigmoid) of the predictions. lent performance on metrics such as accuracy, lift, area un- An alternative to training boosted models with AdaBoost der the ROC curve, average precision, and precision/recall and then correcting their outputs via post-training calibration break even point. However, boosted decision trees had poor is to use a variant of boosting that directly optimizes cross- squared error and cross-entropy because AdaBoost does not entropy (log-loss). Collins, Schapire and Singer (2002) show produce good probability estimates. that a boosting algorithm that optimizes log-loss can be ob- Friedman, Hastie, and Tibshirani (2000) provide an expla- tained by simple modification to the AdaBoost algorithm. nation for why boosting makes poorly calibrated predictions. Collins et al. briefly evaluate this new algorithm on a syn- They show that boosting can be viewed as an additive logistic thetic data set, but acknowledge that a more thorough evalu- regression model. A consequence of this is that the predic- ation on real data sets is necessary. tions made by boosting are trying to fit a logit of the true Lebanon and Lafferty (2001) show that Logistic Correc- probabilities, as opposed to the true probabilities themselves. tion applied to boosting with exponential loss should behave To get back the probabilities, the logit transformation must similarly to boosting with log-loss, and then demonstrate this be inverted. by examining the performance of boosted stumps on a vari- In their treatment of boosting as a large margin classifier, ety of data sets. Our results confirm their findings for boosted Schapire et al. (1998) observed that in order to obtain large stumps, and show the same effect for boosted trees. margin on cases close to the decision surface, AdaBoost will Our experiments show that boosting full decision trees sacrifice the margin of the easier cases. This results in a shift- usually yields better models than boosting weaker stumps. ing of the predicted values away from 0 and 1, hurting cali- Unfortunately, our results also show that boosting to directly ∗A slightly extended verison of this paper appeared in UAI‘05 optimize log-loss, or applying Logistic Correction to models Copyright c 2007, Association for the Advancement of Artificial boosted with exponential loss, is only effective when boost- Intelligence (www.aaai.org). All rights reserved. ing weak models such as stumps. Neither of these methods is 28 effective when boosting full decision trees. Significantly bet- hi is trained on the weighted train set. The error of hi deter- ter performance is obtained by boosting full decision trees mines the model weight αi and the future weight of each with exponential loss, and then calibrating their predictions training example. There are two equivalent formulations. using either Platt Scaling or Isotonic Regression. Calibration The first formulation, also used by Friedman, Hastie, and with Platt Scaling or Isotonic Regression is so effective that Tibshirani (2000) assumes yi ∈{−1, 1} and hi ∈{−1, 1}. after calibration boosted decision trees predict better prob- The output of the boosted model is: abilities than any other learning method we have compared T them to, including neural nets, bagged trees, random forests, F (x)= αihi(x) (1) and calibrated SVMs. i=1 Friedman et al. show that AdaBoost builds an additive Boosting and Calibration logistic regression model for minimizing E(exp(−yF(x))). They show that E(exp(−yF(x))) is minimized by: In this section we empirically examine the relationship be- 1 P (y =1|x) tween boosting’s predictions and posterior probabilities. One F (x)= log (2) way to visualize this relationship when the true posterior 2 P (y = −1|x) probabilities are not known is through reliability diagrams This suggests applying a logistic correction in order to get (DeGroot & Fienberg 1982). To construct a reliability dia- back the conditional probability: gram, the prediction space is discretized into ten bins. Cases 1 P (y =1|x)= (3) with predicted value between 0 and 0.1 are put in the first 1+exp(−2F (x)) bin, between 0.1 and 0.2 in the second bin, etc. For each bin, the mean predicted value is plotted against the true fraction As we will see in the Empirical Results section, this logis- of positive cases in the bin. If the model is well calibrated tic correction works well when boosting simple base learners the points will fall near the diagonal line. such as decision stumps. However, if the base learners are powerful enough that the training data becomes fully separa- The bottom row of Figure 1 shows reliability plots on a ble, after correction the predictions will become only 0’s and large test set after 1,4,8,32,128, and 1024 stages of boosting 1’s (Rosset, Zhu, & Hastie 2004) and thus have poor calibra- Bayesian smoothed decision trees (Buntine 1992). The top tion. of the figure shows histograms of the predicted values for the same models. The histograms show that as the number of Platt Calibration steps of boosting increases, the predicted values are pushed away from 0 and 1 and tend to collect on either side of the de- An equivalent formulation of AdaBoost assumes yi ∈{0, 1} cision surface. This shift away from 0 and 1 hurts calibration and hi ∈{0, 1}. The output of the boosted model is T and yields sigmoid-shaped reliability plots. αihi(x) ( )= i=1 Figure 2 shows histograms and reliability diagrams for f x T (4) αi boosted decision trees after 1024 steps of boosting on eight i=1 test problems. (See the Empirical Results section for more We use this model for Platt Scaling and Isotonic Regres- detail about these problems.) The figures present results mea- sion, and treat f(x) as the raw (uncalibrated) prediction. sured on large independent test sets not used for training. For Platt (1999) uses a sigmoid to map SVM outputs on seven of the eight data sets the predicted values after boosting [−∞, +∞] to posterior probabilities on [0, 1]. The sigmoidal do not approach 0 or 1. The one exception is LETTER.P1, shape of the reliability diagrams in Figure 2 suggest that the a highly skewed data set that has only 3% positive class. same calibration method should be effective at calibrating On this problem some of the predicted values do approach boosting predictions. In this section we closely follow the 0, though careful examination of the histogram shows that description of Platt’s method in (Platt 1999). there is a sharp drop in the number of cases predicted to have Let the output of boosting be f(x) given by equation 4. probability near 0. To get calibrated probabilities, pass the output of boosting All the reliability plots in Figure 2 display sigmoid-shaped through a sigmoid: 1 reliability diagrams, motivating the use of a sigmoid to map P (y =1|f)= (5) the predictions to calibrated probabilities. The functions fit- 1+exp(Af + B) ted with Platt’s method and Isotonic Regression are shown in where the A and B are fitted using maximum likelihood es- the middle and bottom rows of the figure. timation from a calibration set (fi,yi).

Load more