1.6A PREDICTING GOOD PROBABILITIES WITH

Rich Caruana and Alexandru Niculescu-Mizil Computer Science, Cornell University, Ithaca, New York

1. INTRODUCTION KNN: we use 26 values of K ranging from K = 1 to K = |trainset|. We use KNN with Euclidean distance This paper presents the results of an empirical evalua- and distance weighted by gain ratio. We also use distance tion of the probabilities predicted by seven supervised weighted KNN, and locally weighted averaging. learning algorithms. The algorithms are SVMs, neu- ral nets, decision trees, memory-based learning, bagged ANN we train neural nets with backprop varying the trees, boosted trees, and boosted stumps. For each al- number of hidden units {1,2,4,8,32,128} and momen- gorithm we test many different variants and parameter tum {0,0.2,0.5,0.9}. We don't use validation sets to do settings: we compare ten styles of decision trees, neural weight decay or early stopping. Instead, we stop the nets nets of many sizes, SVMs using different kernels, etc. A at many different epochs so that some nets under£t or total of 2000 models are tested on each problem. over£t. Experiments with seven classi£cation problems suggest Decision trees (DT): we vary the splitting criterion, that neural nets and bagged decision trees are the best pruning options, and smoothing (Laplacian or Bayesian learning methods for predicting well-calibrated proba- smoothing). We use all of the tree models in Buntine's bilities. However, while SVMs and boosted trees are not IND package: BAYES, ID3, CART, CART0, C4, MML, well calibrated, they have excellent performance on other and SMML. We also generate trees of type C44LS (C4 metrics such as accuracy and area under the ROC curve with no pruning and Laplacian smoothing)(Provost & (AUC). We analyze the predictions made by these mod- Domingos, 2003), C44BS (C44 with Bayesian smooth- els and show that they are distorted in a speci£c and con- ing), and MMLLS (MML with Laplacian smoothing). sistent way. To correct for this distortion, we experiment Bagged trees (BAG-DT): we bag 25-100 trees of with two methods for calibrating probabilities: each tree type. Boosted trees (BST-DT): we Platt Scaling: a method for transforming SVM outputs boost each tree type. Boosting can over£t, so we from [−∞, +∞] to posterior probabilities (Platt, 1999) use 2,4,8,16,32,64,128,256,512,1024 and 2048 steps : the method used by Elkan and of boosting. Boosted stumps (BST-STMP): we Zadrozny to calibrate predictions from boosted naive use stumps (single level decision trees) generated bayes, SVM, and decision tree models (Zadrozny & with 5 different splitting criteria boosted for 2,4,8, Elkan, 2002; Zadrozny & Elkan, 2001) 16,32,64,128,256,512,1024,2048,4096,8192 steps. Comparing the performance of the learning algorithms SVMs: we use the following kernels in SVM- before and after , we see that calibration sig- Light(Joachims, 1999): linear, polynomial degree 2 & ni£cantly improves the performance of boosted trees and 3, radial with width {0.001,0.005,0.01,0.05,0.1,0.5,1,2} SVMs. After calibration, these two learning methods and vary the regularization parameter by factors of ten 7 3 outperform neural nets and bagged decision trees and be- from 10− to 10 . come the best learning methods for predicting calibrated With ANN's, SVM's and KNN's we scale attributes to posterior probabilities. Boosted stumps also bene£t sig- 0 mean 1 std. With DT, BAG-DT, BST-DT and BST- ni£cantly from calibration, but their performance overall STMP we don't scale the data. In total, we train about is not competitive. Not surprisingly, the two model types 2000 different models on each test problem. that were well calibrated to start with, neural nets and bagged trees, do not bene£t from calibration. 2.2. Performance Metrics 2. METHODOLOGY Finding models that predict the true underlying probabil- ity for each test case would be optimal. Unfortunately, 2.1. Learning Algorithms we usually do not know how to train models to predict true underlying probabilities. Either the correct paramet- This section summarizes the parameters used with each ric model type is not known, or the training sample is too learning algorithm. small for model parameters to be estimated accurately, or Table 1. Description of the test problems there is noise in the data. Typically, all of these problems occur to varying degrees. Moreover, usually we don't PROBLEM #ATTR TRAIN SIZE TEST SIZE %POZ have access to the true underlying probabilities. We only ADULT 14/104 4000 35222 25% know if a case is positive or not, making it dif£cult to COV TYPE 54 4000 25000 36% detect when a model predicts the true underlying proba- LETTER.P1 16 4000 14000 3% bilities. LETTER.P2 16 4000 14000 53% MEDIS 63 4000 8199 11% Some performance metrics are minimized (in expecta- SLAC 59 4000 25000 50% tion) when the predicted value for each case is the true HS 200 4000 4366 24% underlying probability of that case being positive. We call these probability metrics. The probability metrics we use are squared error (RMS), cross-entropy (MXE) and calibration (CAL). CAL measures the calibration of where the parameters A and B are £tted using maximum a model: if the model predicts 0.85 for a number of cases, likelihood estimation from a £tting training set (fi, yi). it is well calibrated if 85% of cases are positive. CAL is Gradient descent is used to £nd A and B such that they calculated as follows: Order all cases by their predictions are the solution to: and put cases 1-100 in the same bin. Calculate the per- centage of these cases that are true positives to estimate argmin{− X yilog(pi) + (1 − yi)log(1 − pi)}, (2) the true probability that these cases are positive. Then A,B i calculate the mean prediction for these cases. The ab- solute value of the difference between the observed fre- where 1 quency and the mean prediction is the calibration error pi = (3) for these 100 cases. Now take cases 2-101, 3-102, ... and 1 + exp(Afi + B) compute the errors in the same way. CAL is the mean of Two questions arise: 1) where does the sigmoid training all these binned calibration errors. set (fi, yi) come from? 2) how to avoid over£tting to this Other metrics don't treat predicted values as probabili- training set? ties, but still give insight into model quality. Two com- One possible answer to question 1 is to use the same monly used metrics are accuracy (ACC) and area un- training set used for training the model: for each example der ROC curve (AUC). Accuracy measures how well the (xi, yi) in the training set, use (f(xi), yi) as a training model discriminates between classes. AUC is a mea- example for the sigmoid. Unfortunately, if the learning sure of how good a model is at ordering the cases, i.e. algorithm can learn complex models it will introduces predicting higher values for instances that have a higher unwanted bias in the sigmoid training set that can lead to probability of being positive. See (Provost & Fawcett, poor results (Platt, 1999). 1997) for a discussion of ROC from a perspective. AUC depends only on the ordering of the An alternate solution is to split the training data into a predictions, not the actual predicted values. If the order- model training set and a calibration validation set. After ing is preserved it makes no difference if the predicted the model is trained on the £rst set, the predictions on the values are between 0 and 1 or between 0.49 and 0.51. validation set are used to £t the sigmoid. Cross valida- tion can be used to allow both the model and the sigmoid 2.3. Data Sets to be trained on the full data set. The training data is split into C parts. The model is learned using C-1 parts, We compare the algorithms on 7 binary classi£cation while the C-th part is held aside for use as a calibration ∗ problems. The data sets are summarized in Table 1. validation set. From each of the C validation sets we ob- tain a sigmoid training set that does not overlap with the 3. Calibration Methods model training set. The union of these C validation sets is used to £t the sigmoid parameters. Following Platt, all 3.1. Platt Calibration experiments in this paper use 3-fold cross-validation to Let the output of a learning method be f(x) . To get cal- estimate the sigmoid parameters ibrated probabilities, pass the output through a sigmoid: As for the second question, an out-of-sample model is 1 used to avoid over£tting to the sigmoid train set . If there P (y = 1|f) = (1) are N+ positive examples and N negative examples in 1 + exp(Af + B) − the train set, for each training example Platt Calibration ∗ Unfortunately, none of these are meteorology data. uses target values y+ and y− (instead of 1 and 0, respec- Table 2. Performance of learning algorithms prior to calibration MODEL ACC AUC RMS MXE CAL ANN 0.8720 0.9033 0.2805 0.4143 0.0233 BAG-DT 0.8728 0.9089 0.2818 0.4050 0.0314 KNN 0.8688 0.8970 0.2861 0.4367 0.0270 DT 0.8433 0.8671 0.3211 0.5019 0.0346 SVM 0.8745 0.9067 0.3390 0.5767 0.0765 BST-STMP 0.8470 0.8866 0.3659 0.6241 0.0502 BST-DT 0.8828 0.9138 0.3050 0.4810 0.0542

tively), where get an unbiased training set. For the experiments with N+ + 1 1 Isotonic Regression we again use the 3-fold CV method- y+ = ; y− = (4) N+ + 2 N− + 2 ology used with Platt Scaling. The bottom row of Fig- For a more detailed treatment, and a justi£cation of these ure 1 shows functions £tted with Isotonic Regression for particular target values see (Platt, 1999). The middle row the seven test problems. of Figure 1 shows sigmoids £tted with Platt Scaling on the seven test problems using 3-fold CV. 4. EMPIRICAL RESULTS Table 2 shows the average performance of the learning 3.2. Isotonic Regression algorithms on the seven test problems. For each prob- An alternative to Platt Calibration is Isotonic Regression lem, we select the best model trained with each learning (Robertson et al., 1988). Zadrozny and Elkan used Iso- algorithm using a 1K validation set and report it's per- tonic Regression to calibrate predictions made by SVMs, formance on large £nal test sets. The learning methods Naive Bayes, boosted Naive Bayes, and decision trees with best performance on the probability metrics (RMS, (Zadrozny & Elkan, 2002; Zadrozny & Elkan, 2001). MXE, and CAL) are neural nets and bagged decision trees. The learning methods with the poorest perfor- The basic assumption in Isotonic Regression is: mance are SVMs †, boosted stumps, and boosted deci- yi = m(fi) + ²i (5) sion trees. Interestingly, although SVMs and the boosted models predict poor probabilities, they outperform neu- where m is an isotonic (monotonically increasing) func- ral nets and bagged trees on accuracy and AUC. This tion. Then, given a train set (fi, yi), the Isotonic Regres- suggests that SVMs and the boosted models are learn- sion problem is £nding the isotonic function mˆ such that ing good models, but their predictions are distorted and 2 thus have poor calibration. mˆ = argminz X(yi − z(fi)) (6) Model calibration can be visualized through reliability One algorithm for Isotonic Regression is pair-adjacent diagrams (DeGroot & Fienberg, 1982). To construct a re- violators (PAV) (Ayer et al., 1955) presented in Table 3. liability diagram, the prediction space is discretized into PAV £nds a stepwise constant solution for the Isotonic ten bins. Cases with predicted value between 0 and 0.1 Regression problem. fall in the £rst bin, between 0.1 and 0.2 in the second bin, etc. For each bin, the mean predicted value is plotted Table 3. PAV Algorithm against the true fraction of positive cases. If the model is Algorithm 1. PAV algorithm for estimating posterior well calibrated the points will fall near the diagonal line. probabilities from uncalibrated model predictions. Figure 1 shows histograms and reliability diagrams for 1 Input: training set (fi, yi) sorted according to fi boosted trees after 1024 steps of boosting on seven test 2 Initialize mi,i = yi, wi,i = 1 problems. The results are for large test sets not used for 3 While ∃ i s.t. mˆ k,i−1 ≥ mˆ i,l training or validation. For six of the seven data sets the Set wk,l = wk,i−1 + wi,l predicted values after boosting do not approach 0 or 1. Set mˆ k,l = (wk,i−1mˆ k,i−1 + wi,lmˆ i,l)/wk,l The one exception is LETTER.P1, a highly skewed data Replace mˆ k,i−1 and mˆ i,l with mˆ k,l set that has only 3% positive class. On this problem some 4 Output the stepwise const. function generated by mˆ of the predicted values do approach 0, though careful ex- amination of the histogram shows that even on this prob- As in the case of Platt calibration, if we use the model lem there is a sharp drop in the number of cases predicted training set (xi, yi) to get the training set (f(xi), yi) for † Isotonic Regression, we introduce unwanted bias. The SVM predictions are scaled to [0,1] by (x−min)/(max− same methods discussed in Section 3.1 can be used to min). COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value

Figure 1. Histograms of predicted values and reliability diagrams for boosted decision trees.

Table 4. Squared error and cross-entropy performance of learning algorithms SQUARED ERROR CROSS-ENTROPY ALGORITHM RAW PLATT ISOTONIC RAW PLATT ISOTONIC BST-DT 0.3050 0.2650 0.2652 0.4810 0.3727 0.3745 SVM 0.3303 0.2727 0.2719 0.5767 0.3988 0.3984 BAG-DT 0.2818 0.2815 0.2799 0.4050 0.4082 0.3996 ANN 0.2805 0.2821 0.2806 0.4143 0.4229 0.4120 KNN 0.2861 0.2871 0.2839 0.4367 0.4300 0.4186 BST-STMP 0.3659 0.3098 0.3096 0.6241 0.4713 0.4734 DT 0.3211 0.3212 0.3145 0.5019 0.5091 0.4865

to have probability near 0. problem, transforming the predictions using either Platt Scaling or Isotonic Regression yields a signi£cant im- The reliability plots in Figure 1 display roughly sigmoid- provement in the quality of the predicted probabilities, shaped reliability diagrams, motivating the use of a sig- leading to much lower squared error and cross-entropy. moid to transform predictions into calibrated probabili- The main difference between Isotonic Regression and ties. The reliability plots in the middle row of the £gure Platt Scaling for boosting can be seen when comparing also show sigmoids £tted using Platt's method. The reli- the histograms in the two £gures. Because Isotonic Re- ability plots in the bottom of the £gure show the function gression generates a piecewise constant function, the his- £tted with Isotonic Regression. tograms are quite coarse, while the histograms generated To show how calibration transforms the predictions, we by Platt Scaling are smooth and easier to interpret. plot histograms and reliability diagrams for the seven Table 4 compares the RMS and MXE performance of the problem for boosted trees after 1024 steps of boosting, learning methods before and after calibration. Figure 4 after Platt Calibration (Figure 2) and after Isotonic Re- shows the squared error results from Table 4 graphically. gression (Figure 3). The reliability diagrams for Isotonic Regression are very similar to the ones for Platt Scal- After calibration with Platt Scaling or Isotonic Regres- ing, so we omit them in the interest of space. The £gures sion, boosted decision trees have better squared error and show that calibration undoes the shift in probability mass cross-entropy than the other learning methods. The next caused by boosting: after calibration many more cases best methods are SVMs, bagged decision trees and neu- have predicted probabilities near 0 and 1. The reliabil- ral nets. While Platt Scaling and Isotonic Regression sig- ity diagrams are closer to the diagonal, and the S shape ni£cantly improve the performance of the SVM models, characteristic of boosting's predictions is gone. On each they have little or no effect on the performance of bagged COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0 1 1

0.8 0.8

0.6 0.6

0.4 0.4

Fraction of Positives 0.2 0.2 Fraction of Positives

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value

Figure 2. Histograms of predicted values and reliability diagrams for boosted trees calibrated with Platt’s method.

COV_TYPE ADULT LETTER.P1 LETTER.P2 MEDIS SLAC HS 0.16 0.16 0.14 0.14 0.12 0.12 0.1 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 0

Figure 3. Histograms of predicted values for boosted trees calibrated with Isotonic Regression.

0.38 Raw Predictions References Platt Scaling Isotonic Regression Ayer, M., Brunk, H., Ewing, G., Reid, W., & Silverman, E. 0.36 (1955). An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 0.34 5, 641–647.

0.32 DeGroot, M., & Fienberg, S. (1982). The comparison and eval- uation of forecasters. Statistician, 32, 12–22. Squared Error 0.3 Joachims, T. (1999). Making large-scale svm learning practi- cal. Advances in Kernel Methods.

0.28 Platt, J. (1999). Probabilistic outputs for support vector ma- chines and comparison to regularized likelihood methods. 0.26 Advances in Large Margin Classi£ers (pp. 61–74).

BST-DT SVM BAG-DT ANN KNN BST-STMP DT Provost, F., & Domingos, P. (2003). Tree induction for Figure 4. Squared error performance of learning algorithms probability-based rankings. Machine Learning, 52. Provost, F. J., & Fawcett, T. (1997). Analysis and visualiza- decision trees and neural nets. While neural nets and tion of classi£er performance: Comparison under imprecise bagged trees yield better probabilities before calibration, class and cost distributions. Knowledge Discovery and Data Platt Scaling or Isotonic Regression improve the calibra- Mining (pp. 43–48). tion of maximum margin methods enough for boosted Robertson, T., Wright, F., & Dykstra, R. (1988). Order re- trees and SVMs to become the best methods for predict- stricted statistical inference. New York: John Wiley and ing good probabilities once calibrated. Sons. Zadrozny, B., & Elkan, C. (2001). Obtaining calibrated proba- Acknowledgements bility estimates from decision trees and naive bayesian clas- si£ers. ICML (pp. 609–616). Thanks to B. Zadrozny and C. Elkan for the Isotonic Regression code, to C. Young at Stanford Linear Accelerator for the SLAC Zadrozny, B., & Elkan, C. (2002). Transforming classi£er data, and to T. Gualtieri at Goddard Space Center for help with scores into accurate multiclass probability estimates. KDD the Indian Pines Data. This work was supported by NSF Grant (pp. 694–699). IIS-0412930.