Predicting Good Probabilities with Supervised Learning 1

1.6A PREDICTING GOOD PROBABILITIES WITH SUPERVISED LEARNING Rich Caruana and Alexandru Niculescu-Mizil Computer Science, Cornell University, Ithaca, New York 1. INTRODUCTION KNN: we use 26 values of K ranging from K = 1 to K = jtrainsetj. We use KNN with Euclidean distance This paper presents the results of an empirical evalua- and distance weighted by gain ratio. We also use distance tion of the probabilities predicted by seven supervised weighted KNN, and locally weighted averaging. learning algorithms. The algorithms are SVMs, neural nets, decision trees, memory-based learning, bagged ANN we train neural nets with backprop varying the trees, boosted trees, and boosted stumps. For each al- number of hidden units f1,2,4,8,32,128g and momen- gorithm we test many different variants and parameter tum f0,0.2,0.5,0.9g. We don't use validation sets to do settings: we compare ten styles of decision trees, neural weight decay or early stopping. Instead, we stop the nets nets of many sizes, SVMs using different kernels, etc. A at many different epochs so that some nets under£t or total of 2000 models are tested on each problem. over£t. Experiments with seven classi£cation problems suggest Decision trees (DT): we vary the splitting criterion, that neural nets and bagged decision trees are the best pruning options, and smoothing (Laplacian or Bayesian learning methods for predicting well-calibrated proba- smoothing). We use all of the tree models in Buntine's bilities. However, while SVMs and boosted trees are not IND package: BAYES, ID3, CART, CART0, C4, MML, well calibrated, they have excellent performance on other and SMML. We also generate trees of type C44LS (C4 metrics such as accuracy and area under the ROC curve with no pruning and Laplacian smoothing)(Provost & (AUC). We analyze the predictions made by these mod- Domingos, 2003), C44BS (C44 with Bayesian smooth- els and show that they are distorted in a speci£c and con- ing), and MMLLS (MML with Laplacian smoothing). sistent way. To correct for this distortion, we experiment Bagged trees (BAG-DT): we bag 25-100 trees of with two methods for calibrating probabilities: each tree type. Boosted trees (BST-DT): we Platt Scaling: a method for transforming SVM outputs boost each tree type. Boosting can over£t, so we from [¡1; +1] to posterior probabilities (Platt, 1999) use 2,4,8,16,32,64,128,256,512,1024 and 2048 steps Isotonic Regression: the method used by Elkan and of boosting. Boosted stumps (BST-STMP): we Zadrozny to calibrate predictions from boosted naive use stumps (single level decision trees) generated bayes, SVM, and decision tree models (Zadrozny & with 5 different splitting criteria boosted for 2,4,8, Elkan, 2002; Zadrozny & Elkan, 2001) 16,32,64,128,256,512,1024,2048,4096,8192 steps. Comparing the performance of the learning algorithms SVMs: we use the following kernels in SVM- before and after calibration, we see that calibration sig- Light(Joachims, 1999): linear, polynomial degree 2 & ni£cantly improves the performance of boosted trees and 3, radial with width f0.001,0.005,0.01,0.05,0.1,0.5,1,2g SVMs. After calibration, these two learning methods and vary the regularization parameter by factors of ten 7 3 outperform neural nets and bagged decision trees and be- from 10¡ to 10 . come the best learning methods for predicting calibrated With ANN's, SVM's and KNN's we scale attributes to posterior probabilities. Boosted stumps also bene£t sig- 0 mean 1 std. With DT, BAG-DT, BST-DT and BST- ni£cantly from calibration, but their performance overall STMP we don't scale the data. In total, we train about is not competitive. Not surprisingly, the two model types 2000 different models on each test problem. that were well calibrated to start with, neural nets and bagged trees, do not bene£t from calibration. 2.2. Performance Metrics 2. METHODOLOGY Finding models that predict the true underlying probability for each test case would be optimal. Unfortunately, 2.1. Learning Algorithms we usually do not know how to train models to predict true underlying probabilities. Either the correct paramet- This section summarizes the parameters used with each ric model type is not known, or the training sample is too learning algorithm. small for model parameters to be estimated accurately, or Table 1. Description of the test problems there is noise in the data. Typically, all of these problems occur to varying degrees. Moreover, usually we don't PROBLEM #ATTR TRAIN SIZE TEST SIZE %POZ have access to the true underlying probabilities. We only ADULT 14/104 4000 35222 25% know if a case is positive or not, making it dif£cult to COV TYPE 54 4000 25000 36% detect when a model predicts the true underlying proba- LETTER.P1 16 4000 14000 3% bilities. LETTER.P2 16 4000 14000 53% MEDIS 63 4000 8199 11% Some performance metrics are minimized (in expecta- SLAC 59 4000 25000 50% tion) when the predicted value for each case is the true HS 200 4000 4366 24% underlying probability of that case being positive. We call these probability metrics. The probability metrics we use are squared error (RMS), cross-entropy (MXE) and calibration (CAL). CAL measures the calibration of where the parameters A and B are £tted using maximum a model: if the model predicts 0.85 for a number of cases, likelihood estimation from a £tting training set (fi; yi). it is well calibrated if 85% of cases are positive. CAL is Gradient descent is used to £nd A and B such that they calculated as follows: Order all cases by their predictions are the solution to: and put cases 1-100 in the same bin. Calculate the per- centage of these cases that are true positives to estimate argminf¡ X yilog(pi) + (1 ¡ yi)log(1 ¡ pi)g; (2) the true probability that these cases are positive. Then A;B i calculate the mean prediction for these cases. The ab- solute value of the difference between the observed fre- where 1 quency and the mean prediction is the calibration error pi = (3) for these 100 cases. Now take cases 2-101, 3-102, ... and 1 + exp(Afi + B) compute the errors in the same way. CAL is the mean of Two questions arise: 1) where does the sigmoid training all these binned calibration errors. set (fi; yi) come from? 2) how to avoid over£tting to this Other metrics don't treat predicted values as probabili- training set? ties, but still give insight into model quality. Two com- One possible answer to question 1 is to use the same monly used metrics are accuracy (ACC) and area un- training set used for training the model: for each example der ROC curve (AUC). Accuracy measures how well the (xi; yi) in the training set, use (f(xi); yi) as a training model discriminates between classes. AUC is a mea- example for the sigmoid. Unfortunately, if the learning sure of how good a model is at ordering the cases, i.e. algorithm can learn complex models it will introduces predicting higher values for instances that have a higher unwanted bias in the sigmoid training set that can lead to probability of being positive. See (Provost & Fawcett, poor results (Platt, 1999). 1997) for a discussion of ROC from a machine learning perspective. AUC depends only on the ordering of the An alternate solution is to split the training data into a predictions, not the actual predicted values. If the order- model training set and a calibration validation set. After ing is preserved it makes no difference if the predicted the model is trained on the £rst set, the predictions on the values are between 0 and 1 or between 0.49 and 0.51. validation set are used to £t the sigmoid. Cross validation can be used to allow both the model and the sigmoid 2.3. Data Sets to be trained on the full data set. The training data is split into C parts. The model is learned using C-1 parts, We compare the algorithms on 7 binary classi£cation while the C-th part is held aside for use as a calibration ¤ problems. The data sets are summarized in Table 1. validation set. From each of the C validation sets we ob- tain a sigmoid training set that does not overlap with the 3. Calibration Methods model training set. The union of these C validation sets is used to £t the sigmoid parameters. Following Platt, all 3.1. Platt Calibration experiments in this paper use 3-fold cross-validation to Let the output of a learning method be f(x) . To get cal- estimate the sigmoid parameters ibrated probabilities, pass the output through a sigmoid: As for the second question, an out-of-sample model is 1 used to avoid over£tting to the sigmoid train set . If there P (y = 1jf) = (1) are N+ positive examples and N negative examples in 1 + exp(Af + B) ¡ the train set, for each training example Platt Calibration ¤ Unfortunately, none of these are meteorology data. uses target values y+ and y¡ (instead of 1 and 0, respec- Table 2. Performance of learning algorithms prior to calibration MODEL ACC AUC RMS MXE CAL ANN 0.8720 0.9033 0.2805 0.4143 0.0233 BAG-DT 0.8728 0.9089 0.2818 0.4050 0.0314 KNN 0.8688 0.8970 0.2861 0.4367 0.0270 DT 0.8433 0.8671 0.3211 0.5019 0.0346 SVM 0.8745 0.9067 0.3390 0.5767 0.0765 BST-STMP 0.8470 0.8866 0.3659 0.6241 0.0502 BST-DT 0.8828 0.9138 0.3050 0.4810 0.0542 tively), where get an unbiased training set.

Load more