Beta Calibration: a Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classiﬁers

Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers Meelis Kull Telmo de Menezes e Silva Filho Peter Flach University of Bristol Universidade Federal de Pernambuco University of Bristol University of Tartu Centro de Informatica´ Abstract the classifier approximates, in some sense, the class posterior, although the approximation can be crude: for example, For optimal decision making under variable class a constant classifier predicting the overall class distribution distributions and misclassification costs a classi- for every instance is perfectly calibrated in this sense. Cal- fier needs to produce well-calibrated estimates of ibration is closely related to optimal decision making and the posterior probability. Isotonic calibration is a cost-sensitive classification, where we wish to determine powerful non-parametric method that is however the predicted class that minimises expected misclassifica- prone to overfitting on smaller datasets; hence a tion cost averaged over all possible true classes. The better parametric method based on the logistic curve is our estimates of the class posterior are, the closer we get to commonly used. While logistic calibration is de- the (irreducible) Bayes risk. A sufficiently calibrated clas- signed for normally distributed per-class scores, sifier can be simply thresholded at a threshold directly de- we demonstrate experimentally that many clas- rived from the misclassification costs. Thresholds can also sifiers including Naive Bayes and Adaboost suf- be derived to optimally adapt to a change in class prior, or fer from a particular distortion where these score to a combination of both. In contrast, for a poorly cali- distributions are heavily skewed. In such cases brated classifier the optimal thresholds cannot be obtained logistic calibration can easily yield probability without optimisation. estimates that are worse than the original scores. Some learning algorithms are designed to yield well- Moreover, the logistic curve family does not in- calibrated probabilities. These include decision trees, clude the identity function, and hence logistic whose leaf probabilities are optimal on the training set calibration can easily uncalibrate a perfectly cal- (Provost and Domingos, 2003); as trees suffer from high ibrated classifier. variance, using Laplace smoothing and no pruning is rec- In this paper we solve all these problems with ommended (Ferri et al., 2003). Logistic regression is an- a richer class of calibration maps based on the other example of a learning algorithm that often produces beta distribution. We derive the method from well-calibrated probabilities; as we show in this paper, this first principles and show that fitting it is as easy only holds if the specific parametric assumptions made by as fitting a logistic curve. Extensive experiments logistic regression are met, which cannot be guaranteed in show that beta calibration is superior to logistic general. Many other learning algorithms do not take suf- calibration for Naive Bayes and Adaboost. ficient account of distributional factors (e.g., support vector machines) or make unrealistic assumptions (e.g., Naive Bayes) and need to be calibrated in post-processing. Well- 1 INTRODUCTION established calibration methods include logistic calibration, also known as ‘Platt scaling’ in reference to the author who A predictive model can be said to be well-calibrated if its introduced it for support vector machines (Platt, 2000) and predictions match observed distributions in the data. In par- isotonic calibration, also known as the ROC convex hull ticular, a probabilistic classifier is well-calibrated if, among method and pair-adjacent-violators (Zadrozny and Elkan, the instances receiving a predicted probability vector p , the 2002; Fawcett and Niculescu-Mizil, 2007). class distribution is approximately distributed as p. Hence Isotonic calibration is a non-parametric method that uses th the convex hull of the model’s ROC curve to discretise Proceedings of the 20 International Conference on Artificial the scores into bins; the slope of each segment of the con- Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, Florida, USA. JMLR: W&CP volume 54. Copyright 2017 by vex hull can be interpreted as an empirical likelihood ratio, the author(s). from which a calibrated posterior probability for the corre- Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers sponding bin can be derived. Hence the resulting calibra- γ = 1 γ = 4 γ = 20 ) tion map is a non-decreasing, piecewise constant function. δ 1 , Logistic calibration is a parametric method which assumes γ that the scores within each class are normally distributed s; 0.5 ( c i with the same variance, from which the familiar sigmoidal t s i g calibration map can be derived with two parameters: a lo- o l 0 cation parameter m specifying the midpoint of the sigmoid µ 0 0.5 1 0 0.5 1 0 0.5 1 at which the calibrated score is 0:5; and a shape parameter s g specifying the slope of the sigmoid at this midpoint. Be- ing a parametric model, logistic calibration tends to require Figure 1: Examples of logistic curves with parameters g 2 less labelled data to fit the sigmoid; however, it can produce f1;4;20g, m 2 f0:25;0:5;0:75g and d = −mg. bad results due to model mismatch. The main contributions of this paper are (i) a demonstration do not want to learn perfect calibration maps on the train- that such model mismatch is a real danger for a range of ing data, because these would overfit and would be far from widely used machine learning models and can make clas- being calibrated on the test data. For example, if the model sifiers less calibrated; and (ii) the derivation of a new and f outputs a different score on each training instance, then richer parametric family which fixes this in a principled and the training-perfect calibration map would produce only the flexible way. The outline of the paper is as follows. In Sec- 0/1-probabilities m(si) = yi, which in most cases would be tion 2 we discuss the logistic calibration method and its overly confident and overfitting. properties. Section 3 introduces our new beta calibration method. In Section 4 we report on a wide range of exper- 2.2 Logistic Family of Calibration Maps iments showing that beta calibration is superior to logistic calibration and the preferred calibration method for smaller Logistic calibration was proposed by (Platt, 2000) to reduce datasets. Section 5 concludes. overfitting by introducing a strong inductive bias which considers only calibration maps of the following form: 2 LOGISTIC CALIBRATION 1 mlogistic(s;g;d) = 2.1 What Is Calibration? 1 + 1=exp(g · s + d) where g;d are real-valued parameters with g ≥ 0 to ensure The aim of calibration in binary classification is to take an that the calibration map is monotonically non-decreasing. uncalibrated scoring classifier s = f (x) and apply a cali- Monotonicity is enforced assuming that higher model out- bration map m on top of it to produce calibrated probabili- put scores suggest higher probability to be positive. For ties m( f (x)). Formally, a scoring classifier is perfectly cal- easier interpretability we introduce the parameter m = ibrated on a dataset if for each of its output scores s the −d=g. This implies d = −mg, yielding the following al- proportion of positives within instances with model output ternative parametrisation: score s is equal to s. Denoting the instances in the dataset by x1;:::;xn and their binary labels by y1;:::;yn, a model 1 m (s;g;−mg) = (1) f is calibrated on this dataset if for each of its possible out- logistic 1 + 1=exp(g · (s − m)) puts si = f (xi) the following holds: The parameter m determines the value of s for which the si = E[Yj f (X) = si] calibrated score is 1=2; the slope of the calibration map at s = m is g=4. Figure 1 shows a variety of shapes that the where the random variables X;Y denote respectively the logistic calibration map can take. features and label of a uniformly randomly drawn instance from the dataset, where the labels Y = 1 and Y = 0 stand 2.3 Fitting the Logistic Calibration Maps for a positive and negative, respectively. This expectation can be rewritten as follows (I[·] is the indicator function): In order to fit the parameters of logistic regression we n need to decide how we measure the goodness of fit. I.e., ∑ j=1 y j · I[ f (x j) = si] E[Yj f (X) = si] = n we need to measure how good the probability estimates ∑ I[ f (x j) = si] j=1 pî = mlogistic(si;g;d) are, given the actual labels yi. A well- known method to evaluate any estimatesp î is to use log- For any fixed model f there exists a uniquely determined loss, which penalises predictingp î for a positive instance calibration map which produces perfectly calibrated prob- with a loss of −lnp î and for a negative instance with a loss abilities on the given dataset. That calibration map can be of −ln(1 − pî). E.g., the fully confident predictionsp î = 0 defined as m(si) = E[Yj f (X) = si]. However, usually we andp î = 1 incur loss 0 if correct, and loss ¥ if wrong, Meelis Kull, Telmo de Menezes e Silva Filho, Peter Flach figure. The red line shows that our proposed beta calibra- 1.0 1.0 empirical empirical sigmoid sigmoid tion method provides a similar fit to isotonic calibration on isotonic isotonic 0.8 0.8 beta beta this dataset. Note that this calibration map has an inverse- 0.6 0.6 p p sigmoid shape, which is outside of the logistic family of ˆ ˆ 0.4 0.4 calibration maps.

Load more