Obtaining Calibrated Probabilities from Boosting

Obtaining Calibrated Probabilities from Boosting Alexandru Niculescu-Mizil Rich Caruana Department of Computer Science Department of Computer Science Cornell University, Ithaca, NY 14853 Cornell University, Ithaca, NY 14853 [email protected] [email protected] Abstract the predictions made by boosting are trying to fit a logit of the true probabilities, as opposed to the true probabilities themselves. To get back the probabilities, the logit trans- Boosted decision trees typically yield good ac- formation must be inverted. curacy, precision, and ROC area. However, because the outputs from boosting are not well In their treatment of boosting as a large margin classifier, calibrated posterior probabilities, boosting yields Schapire et al. [1998] observed that in order to obtain large poor squared error and cross-entropy. We empir- margin on cases close to the decision surface, AdaBoost ically demonstrate why AdaBoost predicts dis- will sacrifice the margin of the easier cases. This results in a torted probabilities and examine three calibra- shifting of the predicted values away from 0 and 1, hurting tion methods for correcting this distortion: Platt calibration. This shifting is also consistent with Breiman’s Scaling, Isotonic Regression, and Logistic Cor- interpretation of boosting as an equalizer (see Breiman’s rection. We also experiment with boosting us- discussion in [Friedman et al., 2000]). In Section 2 we ing log-loss instead of the usual exponential loss. demonstrate this probability shifting on real data. Experiments show that Logistic Correction and To correct for boosting’s poor calibration, we experiment boosting with log-loss work well when boosting with boosting with log-loss, and with three methods for cal- weak models such as decision stumps, but yield ibrating the predictions made by boosted models to convert poor performance when boosting more complex them to well-calibrated posterior probabilities. The three models such as full decision trees. Platt Scal- post-training calibration methods are: ing and Isotonic Regression, however, signif- icantly improve the probabilities predicted by Logistic Correction: a method based on Friedman et al.’s both boosted stumps and boosted trees. After cal- analysis of boosting as an additive model ibration, boosted full decision trees predict better Platt Scaling: the method used by Platt to transform SVM probabilities than other learning methods such as outputs from [−∞; +1] to posterior probabilities [1999] SVMs, neural nets, bagged decision trees, and KNNs, even after these methods are calibrated. Isotonic Regression: the method used by Zadrozny and Elkan to calibrate predictions from boosted naive Bayes, SVM, and decision tree models [2002; 2001] 1 Introduction Logistic Correction and Platt Scaling convert predictions to probabilities by transforming them with a sigmoid. With In a recent evaluation of learning algorithms [Caruana and Logistic Correction, the sigmoid parameters are derived Niculescu-Mizil, 2005], boosted decision trees had excel- from Friedman et al.’s analysis. With Platt Scaling, the pa- lent performance on metrics such as accuracy, lift, area un- rameters are fitted to the data using gradient descent. Iso- der the ROC curve, average precision, and precision/recall tonic Regression is a general-purpose non-parametric cali- break even point. However, boosted decision trees had poor bration method that assumes probabilities are a monotonic squared error and cross-entropy because AdaBoost does transformation (not just sigmoid) of the predictions. not produce good probability estimates. An alternative to training boosted models with AdaBoost Friedman, Hastie, and Tibshirani [2000] provide an expla- and then correcting their outputs via post-training cali- nation for why boosting makes poorly calibrated predic- bration is to use a variant of boosting that directly op- tions. They show that boosting can be viewed as an addi- timizes cross-entropy (log-loss). Collins, Schapire and tive logistic regression model. A consequence of this is that Singer [2002] show that a boosting algorithm that opti- mizes log-loss can be obtained by simple modification to for the same models. The histograms show that as the num- the AdaBoost algorithm. Collins et al. briefly evaluate this ber of steps of boosting increases, the predicted values are new algorithm on a synthetic data set, but acknowledge that pushed away from 0 and 1 and tend to collect on either side a more thorough evaluation on real data sets is necessary. of the decision surface. This shift away from 0 and 1 hurts calibration and yields sigmoid-shaped reliability plots. Lebanon and Lafferty [2001] show that Logistic Correction applied to boosting with exponential loss should behave Figure 2 shows histograms and reliability diagrams for similarly to boosting with log-loss, and then demonstrate boosted decision trees after 1024 steps of boosting on eight this by examining the performance of boosted stumps on a test problems. (See Section 4 for more detail about these variety of data sets. Our results confirm their findings for problems.) The figures present results measured on large boosted stumps, and show the same effect for boosted trees. independent test sets not used for training. For seven of the eight data sets the predicted values after boosting do Our experiments show that boosting full decision trees usu- not approach 0 or 1. The one exception is LETTER.P1, ally yields better models than boosting weaker stumps. Un- a highly skewed data set that has only 3% positive class. fortunately, our results also show that boosting to directly On this problem some of the predicted values do approach optimize log-loss, or applying Logistic Correction to mod- 0, though careful examination of the histogram shows that els boosted with exponential loss, is only effective when there is a sharp drop in the number of cases predicted to boosting weak models such as stumps. Neither of these have probability near 0. methods is effective when boosting full decision trees. Sig- nificantly better performance is obtained by boosting full All the reliability plots in Figure 2 display sigmoid-shaped decision trees with exponential loss, and then calibrating reliability diagrams, motivating the use of a sigmoid to their predictions using either Platt Scaling or Isotonic Re- map the predictions to calibrated probabilities. The func- gression. Calibration with Platt Scaling or Isotonic Regres- tions fitted with Platt’s method and Isotonic Regression are sion is so effective that after calibration boosted decision shown in the middle and bottom rows of the figure. trees predict better probabilities than any other learning method we have compared them to, including neural nets, 3 Calibration bagged trees, random forests, and calibrated SVMs. In Section 2 we analyze the predictions from boosted trees In this section we describe three methods for calibrat- from a qualitative point of view. We show that boosting predictions from AdaBoost: Logistic Correction, Platt ing distorts the probabilities in a consistent way, gener- Scaling, and Isotonic Regression. ating sigmoid-shaped reliability diagrams. This analysis motivates the use of a sigmoid to map predictions to well- 3.1 Logistic Correction calibrated probabilities. Section 3 describes the three calibration methods. Section 4 presents an empirical compar- Before describing Logistic Correction, it is useful to briefly ison of the three calibration methods and the log-loss ver- review AdaBoost. Start with each example in the train set sion of boosting. Section 5 compares the performance of (xi; yi) having equal weight. At each step i a weak learner boosted trees and stumps to other learning methods. hi is trained on the weighted train set. The error of hi deter- mines the model weight αi and the future weight of each training example. There are two equivalent formulations. 2 Boosting and Calibration The first formulation, also used by Friedman, Hastie, and Tibshirani [2000] assumes yi 2 {−1; 1g and hi 2 {−1; 1g. In this section we empirically examine the relationship be- The output of the boosted model is: tween boosting’s predictions and posterior probabilities. T One way to visualize this relationship when the true pos- F (x) = X αihi(x) (1) terior probabilities are not known is through reliability di- i=1 agrams [DeGroot and Fienberg, 1982]. To construct a reliability diagram, the prediction space is discretized into ten Friedman et al. show that AdaBoost builds an additive lo- bins. Cases with predicted value between 0 and 0.1 are put gistic regression model for minimizing E(exp(−yF (x))). in the first bin, between 0.1 and 0.2 in the second bin, etc. They show that E(exp(−yF (x))) is minimized by: For each bin, the mean predicted value is plotted against 1 P (y = 1jx) the true fraction of positive cases in the bin. If the model is F (x) = log (2) 2 P (y = −1jx) well calibrated the points will fall near the diagonal line. This suggests applying a logistic correction in order to get The bottom row of Figure 1 shows reliability plots on a back the conditional probability: large test set after 1,4,8,32,128, and 1024 stages of boosting Bayesian smoothed decision trees [Buntine, 1992]. The 1 P (y = 1jx) = (3) top of the figure shows histograms of the predicted values 1 + exp(−2F (x)) 1 STEP 4 STEPS 8 STEPS 32 STEPS 128 STEPS 1024 STEPS 0.16 0.16 0.16 0.16 0.16 0.16 0.14 0.14 0.14 0.14 0.14 0.14 0.12 0.12 0.12 0.12 0.12 0.12 0.1 0.1 0.1 0.1 0.1 0.1 0.08 0.08 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0.02 0.02 0 0 0 0 0 0 1 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 0 0 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Figure 1: Effect of boosting on the predicted values.

Load more