Obtaining Calibrated Probabilities from Boosting

Total Page:16

File Type:pdf, Size:1020Kb

Obtaining Calibrated Probabilities from Boosting Obtaining Calibrated Probabilities from Boosting Alexandru Niculescu-Mizil Rich Caruana Department of Computer Science Department of Computer Science Cornell University, Ithaca, NY 14853 Cornell University, Ithaca, NY 14853 [email protected] [email protected] Abstract the predictions made by boosting are trying to fit a logit of the true probabilities, as opposed to the true probabilities themselves. To get back the probabilities, the logit trans- Boosted decision trees typically yield good ac- formation must be inverted. curacy, precision, and ROC area. However, be- cause the outputs from boosting are not well In their treatment of boosting as a large margin classifier, calibrated posterior probabilities, boosting yields Schapire et al. [1998] observed that in order to obtain large poor squared error and cross-entropy. We empir- margin on cases close to the decision surface, AdaBoost ically demonstrate why AdaBoost predicts dis- will sacrifice the margin of the easier cases. This results in a torted probabilities and examine three calibra- shifting of the predicted values away from 0 and 1, hurting tion methods for correcting this distortion: Platt calibration. This shifting is also consistent with Breiman’s Scaling, Isotonic Regression, and Logistic Cor- interpretation of boosting as an equalizer (see Breiman’s rection. We also experiment with boosting us- discussion in [Friedman et al., 2000]). In Section 2 we ing log-loss instead of the usual exponential loss. demonstrate this probability shifting on real data. Experiments show that Logistic Correction and To correct for boosting’s poor calibration, we experiment boosting with log-loss work well when boosting with boosting with log-loss, and with three methods for cal- weak models such as decision stumps, but yield ibrating the predictions made by boosted models to convert poor performance when boosting more complex them to well-calibrated posterior probabilities. The three models such as full decision trees. Platt Scal- post-training calibration methods are: ing and Isotonic Regression, however, signif- icantly improve the probabilities predicted by Logistic Correction: a method based on Friedman et al.’s both boosted stumps and boosted trees. After cal- analysis of boosting as an additive model ibration, boosted full decision trees predict better Platt Scaling: the method used by Platt to transform SVM probabilities than other learning methods such as outputs from [−∞; +1] to posterior probabilities [1999] SVMs, neural nets, bagged decision trees, and KNNs, even after these methods are calibrated. Isotonic Regression: the method used by Zadrozny and Elkan to calibrate predictions from boosted naive Bayes, SVM, and decision tree models [2002; 2001] 1 Introduction Logistic Correction and Platt Scaling convert predictions to probabilities by transforming them with a sigmoid. With In a recent evaluation of learning algorithms [Caruana and Logistic Correction, the sigmoid parameters are derived Niculescu-Mizil, 2005], boosted decision trees had excel- from Friedman et al.’s analysis. With Platt Scaling, the pa- lent performance on metrics such as accuracy, lift, area un- rameters are fitted to the data using gradient descent. Iso- der the ROC curve, average precision, and precision/recall tonic Regression is a general-purpose non-parametric cali- break even point. However, boosted decision trees had poor bration method that assumes probabilities are a monotonic squared error and cross-entropy because AdaBoost does transformation (not just sigmoid) of the predictions. not produce good probability estimates. An alternative to training boosted models with AdaBoost Friedman, Hastie, and Tibshirani [2000] provide an expla- and then correcting their outputs via post-training cali- nation for why boosting makes poorly calibrated predic- bration is to use a variant of boosting that directly op- tions. They show that boosting can be viewed as an addi- timizes cross-entropy (log-loss). Collins, Schapire and tive logistic regression model. A consequence of this is that Singer [2002] show that a boosting algorithm that opti- mizes log-loss can be obtained by simple modification to for the same models. The histograms show that as the num- the AdaBoost algorithm. Collins et al. briefly evaluate this ber of steps of boosting increases, the predicted values are new algorithm on a synthetic data set, but acknowledge that pushed away from 0 and 1 and tend to collect on either side a more thorough evaluation on real data sets is necessary. of the decision surface. This shift away from 0 and 1 hurts calibration and yields sigmoid-shaped reliability plots. Lebanon and Lafferty [2001] show that Logistic Correction applied to boosting with exponential loss should behave Figure 2 shows histograms and reliability diagrams for similarly to boosting with log-loss, and then demonstrate boosted decision trees after 1024 steps of boosting on eight this by examining the performance of boosted stumps on a test problems. (See Section 4 for more detail about these variety of data sets. Our results confirm their findings for problems.) The figures present results measured on large boosted stumps, and show the same effect for boosted trees. independent test sets not used for training. For seven of the eight data sets the predicted values after boosting do Our experiments show that boosting full decision trees usu- not approach 0 or 1. The one exception is LETTER.P1, ally yields better models than boosting weaker stumps. Un- a highly skewed data set that has only 3% positive class. fortunately, our results also show that boosting to directly On this problem some of the predicted values do approach optimize log-loss, or applying Logistic Correction to mod- 0, though careful examination of the histogram shows that els boosted with exponential loss, is only effective when there is a sharp drop in the number of cases predicted to boosting weak models such as stumps. Neither of these have probability near 0. methods is effective when boosting full decision trees. Sig- nificantly better performance is obtained by boosting full All the reliability plots in Figure 2 display sigmoid-shaped decision trees with exponential loss, and then calibrating reliability diagrams, motivating the use of a sigmoid to their predictions using either Platt Scaling or Isotonic Re- map the predictions to calibrated probabilities. The func- gression. Calibration with Platt Scaling or Isotonic Regres- tions fitted with Platt’s method and Isotonic Regression are sion is so effective that after calibration boosted decision shown in the middle and bottom rows of the figure. trees predict better probabilities than any other learning method we have compared them to, including neural nets, 3 Calibration bagged trees, random forests, and calibrated SVMs. In Section 2 we analyze the predictions from boosted trees In this section we describe three methods for calibrat- from a qualitative point of view. We show that boost- ing predictions from AdaBoost: Logistic Correction, Platt ing distorts the probabilities in a consistent way, gener- Scaling, and Isotonic Regression. ating sigmoid-shaped reliability diagrams. This analysis motivates the use of a sigmoid to map predictions to well- 3.1 Logistic Correction calibrated probabilities. Section 3 describes the three cali- bration methods. Section 4 presents an empirical compar- Before describing Logistic Correction, it is useful to briefly ison of the three calibration methods and the log-loss ver- review AdaBoost. Start with each example in the train set sion of boosting. Section 5 compares the performance of (xi; yi) having equal weight. At each step i a weak learner boosted trees and stumps to other learning methods. hi is trained on the weighted train set. The error of hi deter- mines the model weight αi and the future weight of each training example. There are two equivalent formulations. 2 Boosting and Calibration The first formulation, also used by Friedman, Hastie, and Tibshirani [2000] assumes yi 2 {−1; 1g and hi 2 {−1; 1g. In this section we empirically examine the relationship be- The output of the boosted model is: tween boosting’s predictions and posterior probabilities. T One way to visualize this relationship when the true pos- F (x) = X αihi(x) (1) terior probabilities are not known is through reliability di- i=1 agrams [DeGroot and Fienberg, 1982]. To construct a reli- ability diagram, the prediction space is discretized into ten Friedman et al. show that AdaBoost builds an additive lo- bins. Cases with predicted value between 0 and 0.1 are put gistic regression model for minimizing E(exp(−yF (x))). in the first bin, between 0.1 and 0.2 in the second bin, etc. They show that E(exp(−yF (x))) is minimized by: For each bin, the mean predicted value is plotted against 1 P (y = 1jx) the true fraction of positive cases in the bin. If the model is F (x) = log (2) 2 P (y = −1jx) well calibrated the points will fall near the diagonal line. This suggests applying a logistic correction in order to get The bottom row of Figure 1 shows reliability plots on a back the conditional probability: large test set after 1,4,8,32,128, and 1024 stages of boost- ing Bayesian smoothed decision trees [Buntine, 1992]. The 1 P (y = 1jx) = (3) top of the figure shows histograms of the predicted values 1 + exp(−2F (x)) 1 STEP 4 STEPS 8 STEPS 32 STEPS 128 STEPS 1024 STEPS 0.16 0.16 0.16 0.16 0.16 0.16 0.14 0.14 0.14 0.14 0.14 0.14 0.12 0.12 0.12 0.12 0.12 0.12 0.1 0.1 0.1 0.1 0.1 0.1 0.08 0.08 0.08 0.08 0.08 0.08 0.06 0.06 0.06 0.06 0.06 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.02 0.02 0.02 0.02 0.02 0.02 0 0 0 0 0 0 1 1 1 1 1 1 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 Fraction of Positives 0.2 0 0 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Mean Predicted Value Figure 1: Effect of boosting on the predicted values.
Recommended publications
  • Obtaining Well Calibrated Probabilities Using Bayesian Binning
    Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Obtaining Well Calibrated Probabilities Using Bayesian Binning Mahdi Pakdaman Naeini1, Gregory F. Cooper1;2, and Milos Hauskrecht1;3 1Intelligent Systems Program, University of Pittsburgh, PA, USA 2Department of Biomedical Informatics, University of Pittsburgh, PA, USA 3Computer Science Department, University of Pittsburgh, PA, USA Abstract iments to perform), medicine (e.g., deciding which therapy Learning probabilistic predictive models that are well cali- to give a patient), business (e.g., making investment deci- brated is critical for many prediction and decision-making sions), and others. However, model calibration and the learn- tasks in artificial intelligence. In this paper we present a new ing of well-calibrated probabilistic models have not been non-parametric calibration method called Bayesian Binning studied in the machine learning literature as extensively as into Quantiles (BBQ) which addresses key limitations of ex- for example discriminative machine learning models that isting calibration methods. The method post processes the are built to achieve the best possible discrimination among output of a binary classification algorithm; thus, it can be classes of objects. One way to achieve a high level of model readily combined with many existing classification algo- calibration is to develop methods for learning probabilistic rithms. The method is computationally tractable, and em- models that are well-calibrated, ab initio. However, this ap- pirically accurate, as evidenced by the set of experiments proach would require one to modify the objective function reported here on both real and simulated datasets. used for learning the model and it may increase the com- putational cost of the associated optimization task.
    [Show full text]
  • Calibrated Model-Based Deep Reinforcement Learning
    Calibrated Model-Based Deep Reinforcement Learning and intuition on how to apply calibration in reinforcement Probabilistic Models This paper focuses on probabilistic learning. dynamics models T (s0|s, a) that take a current state s 2S and action a 2A, and output a probability distribution over We validate our approach on benchmarks for contextual future states s0. Web represent the output distribution over the bandits and continuous control (Li et al., 2010; Todorov next states, T (·|s, a), as a cumulative distribution function et al., 2012), as well as on a planning problem in inventory F : S![0, 1], which is defined for both discrete and management (Van Roy et al., 1997). Our results show that s,a continuous Sb. calibration consistently improves the cumulative reward and the sample complexity of model-based agents, and also en- hances their ability to balance exploration and exploitation 2.2. Calibration, Sharpness, and Proper Scoring Rules in contextual bandit settings. Most interestingly, on the A key desirable property of probabilistic forecasts is calibra- HALFCHEETAH task, our system achieves state-of-the-art tion. Intuitively, a transition model T (s0|s, a) is calibrated if performance, using 50% fewer samples than the previous whenever it assigns a probability of 0.8 to an event — such leading approach (Chua et al., 2018). Our results suggest as a state transition (s, a, s0) — thatb transition should occur that calibrated uncertainties have the potential to improve about 80% of the time. model-based reinforcement learning algorithms with mini- 0 mal computational and implementation overhead. Formally, for a discrete state space S and when s, a, s are i.i.d.
    [Show full text]
  • Offline Deep Models Calibration with Bayesian Neural Networks
    Under review as a conference paper at ICLR 2019 OFFLINE DEEP MODELS CALIBRATION WITH BAYESIAN NEURAL NETWORKS Anonymous authors Paper under double-blind review ABSTRACT In this work the authors show that Bayesian Neural Networks (BNNs) can be efficiently applied to calibrate state-of-the-art Deep Neural Networks (DNN). Our approach acts offline, i.e., it is decoupled from the training of the DNN to be cali- brated. This offline approach allow us to apply our BNN calibration to any model regardless of the limitations that the model may present during training. Note that this offline setting is also appropriate in order to deal with privacy concerns of DNN training data or implementation, among others. We show that our approach clearly outperforms other simple maximum likelihood based solutions that have recently shown very good performance, as temperature scaling (Guo et al., 2017). As an example, we reduce the Expected Calibration Error (ECE%) from 0.52 to 0.24 on CIFAR-10 and from 4.28 to 2.46 on CIFAR-100 on two Wide ResNet with 96.13% and 80.39% accuracy respectively, which are among the best results published for these tasks. Moreover, we show that our approach improves the per- formance of online methods directly applied to the DNN, e.g. Gaussian processes or Bayesian Convolutional Neural Networks. Finally, this decoupled approach allows us to apply any further improvement to the BNN without considering the computational restrictions imposed by the deep model. In this sense, this offline setting is a practical application where BNNs can be considered, which is one of the main criticisms to these techniques.
    [Show full text]
  • Discovering General-Purpose Active Learning Strategies
    1 Discovering General-Purpose Active Learning Strategies Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua, Fellow, IEEE Abstract—We propose a general-purpose approach to discovering active learning (AL) strategies from data. These strategies are transferable from one domain to another and can be used in conjunction with many machine learning models. To this end, we formalize the annotation process as a Markov decision process, design universal state and action spaces and introduce a new reward function that precisely model the AL objective of minimizing the annotation cost. We seek to find an optimal (non-myopic) AL strategy using reinforcement learning. We evaluate the learned strategies on multiple unrelated domains and show that they consistently outperform state-of-the-art baselines. Index Terms—Active learning, meta-learning, Markov decision process, reinforcement learning. F 1 INTRODUCTION ODERN supervised machine learning (ML) methods is still greedy and therefore could be a subject to finding M require large annotated datasets for training purposes suboptimal solutions. and the cost of producing them can easily become pro- Our goal is therefore to devise a general-purpose data- hibitive. Active learning (AL) mitigates the problem by se- driven AL method that is non-myopic and applicable to lecting intelligently and adaptively a subset of the data to be heterogeneous datasets. To this end, we build on earlier annotated. To do so, AL typically relies on informativeness work that showed that AL problems could be naturally measures that identify unlabelled data points whose labels reworded in Markov Decision Process (MDP) terms [4], [10]. are most likely to help to improve the performance of the In a typical MDP, an agent acts in its environment.
    [Show full text]
  • On Calibration of Modern Neural Networks
    On Calibration of Modern Neural Networks Chuan Guo * 1 Geoff Pleiss * 1 Yu Sun * 1 Kilian Q. Weinberger 1 Abstract LeNet (1998) ResNet (2016) CIFAR-100 CIFAR-100 1:0 Confidence calibration – the problem of predict- ing probability estimates representative of the 0:8 true correctness likelihood – is important for 0:6 classification models in many applications. We Accuracy Accuracy discover that modern neural networks, unlike 0:4 those from a decade ago, are poorly calibrated. Avg. confidence Avg. confidence Through extensive experiments, we observe that % of Samples 0:2 depth, width, weight decay, and Batch Normal- 0:0 ization are important factors influencing calibra- 0:0 0:2 0:4 0:6 0:8 1:0 0:0 0:2 0:4 0:6 0:8 1:0 tion. We evaluate the performance of various 1:0 Outputs Outputs post-processing calibration methods on state-of- 0:8 Gap Gap the-art architectures with image and document classification datasets. Our analysis and exper- 0:6 iments not only offer insights into neural net- 0:4 work learning, but also provide a simple and Accuracy straightforward recipe for practical settings: on 0:2 Error=44.9 Error=30.6 most datasets, temperature scaling – a single- 0:0 parameter variant of Platt Scaling – is surpris- 0:0 0:2 0:4 0:6 0:8 1:0 0:0 0:2 0:4 0:6 0:8 1:0 ingly effective at calibrating predictions. Confidence Figure 1. Confidence histograms (top) and reliability diagrams (bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right) 1.
    [Show full text]
  • Calibration Techniques for Binary Classification Problems
    Calibration Techniques for Binary Classification Problems: A Comparative Analysis Alessio Martino a, Enrico De Santis b, Luca Baldini c and Antonello Rizzi d Department of Information Engineering, Electronics and Telecommunications, University of Rome ”La Sapienza”, Via Eudossiana 18, 00184 Rome, Italy Keywords: Calibration, Classification, Supervised Learning, Support Vector Machine, Probability Estimates. Abstract: Calibrating a classification system consists in transforming the output scores, which somehow state the con- fidence of the classifier regarding the predicted output, into proper probability estimates. Having a well- calibrated classifier has a non-negligible impact on many real-world applications, for example decision mak- ing systems synthesis for anomaly detection/fault prediction. In such industrial scenarios, risk assessment is certainly related to costs which must be covered. In this paper we review three state-of-the-art calibration techniques (Platt’s Scaling, Isotonic Regression and SplineCalib) and we propose three lightweight proce- dures based on a plain fitting of the reliability diagram. Computational results show that the three proposed techniques have comparable performances with respect to the three state-of-the-art approaches. 1 INTRODUCTION tion from X . The seminal example is data clustering, where aim of the learning system is to return groups Classification is one of the most important problems (clusters) of data in such a way that patterns belong- falling under the machine learning and, specifically, ing to the same cluster are more similar with respect to under the supervised learning umbrella. Generally patterns belonging to other clusters (Jain et al., 1999; speaking, it is possible to sketch three main families: Martino et al., 2017b; Martino et al., 2018b; Martino clustering, regression/function approximation, classi- et al., 2019; Di Noia et al., 2019).
    [Show full text]
  • Radar-Detection Based Classification of Moving Objects Using Machine Learning Methods
    Radar-detection based classification of moving objects using machine learning methods VICTOR NORDENMARK ADAM FORSGREN Master of Science Thesis Stockholm, Sweden 2015 Radar-detection based classification of moving objects using machine learning methods Victor Nordenmark Adam Forsgren Master of Science Thesis MMK 2015:77 MDA 520 KTH Industrial Engineering and Management Machine Design SE-100 44 STOCKHOLM ii Examensarbete MMK 2015:77 MDA 520 Radar-detection based classification of moving objects using machine learning methods Victor Nordenmark Adam Forsgren Godkänt Examinator Handledare 2015-06-17 Martin Grimheden De-Jiu Chen Uppdragsgivare Kontaktperson Scania Södertälje AB Kristian Lundh Sammanfattning I detta examensarbete undersöks möjligheten att klassificera rörliga objekt baserat på data från Dopplerradardetektioner. Slutmålet är ett system som använder billig hårdvara och utför beräkningar av låg komplexitet. Scania, företaget som har beställt detta projekt, är intresserat av användningspotentialen för ett sådant system i applikationer för autonoma fordon. Specifikt vill Scania använda klassinformationen för att lättare kunna följa rörliga objekt, vilket är en väsentlig färdighet för en autonomt körande lastbil. Objekten delas in i fyra klasser: fotgängare, cyklist, bil och lastbil. Indatan till systemet består väsentligen av en plattform med fyra stycken monopulsdopplerradars som arbetar med en vågfrekvens på 77 GHz. Ett klassificeringssystem baserat på maskininlärningskonceptet Support vector machines har skapats. Detta system har tränats och validerats på ett dataset som insamlats för projektet, innehållandes datapunkter med klassetiketter. Ett antal stödfunktioner till detta system har också skapats och testats. Klassificeraren visas kunna skilja väl på de fyra klasserna i valideringssteget. Simuleringar av det kompletta systemet gjort på inspelade loggar med radardata visar lovande resultat, även i situationer som inte finns representerade i träningsdatan.
    [Show full text]
  • Learning to Plan with Portable Symbols
    Learning to Plan with Portable Symbols Steven James 1 Benjamin Rosman 1 2 George Konidaris 3 Abstract the state space remains complex and planning continues to We present a framework for autonomously learn- be challenging. ing a portable symbolic representation that de- On the other hand, the classical planning approach is to rep- scribes a collection of low-level continuous en- resent the world using abstract symbols, with actions repre- vironments. We show that abstract representa- sented as operators that manipulate these symbols (Ghallab tions can be learned in a task-independent space et al., 2004). Such representations use only the minimal specific to the agent that, when combined with amount of state information necessary for task-level plan- problem-specific information, can be used for ning. This is appealing since it mitigates the issue of reward planning. We demonstrate knowledge transfer sparsity and admits solutions to long-horizon tasks, but in a video game domain where an agent learns raises the question of how to build the appropriate abstract portable, task-independent symbolic rules, and representation of a problem. This is often resolved manu- then learns instantiations of these rules on a per- ally, requiring substantial effort and expertise. Fortunately, task basis, reducing the number of samples re- recent work demonstrates how to learn a provably sound quired to learn a representation of a new task. symbolic representation autonomously, given only the data obtained by executing the high-level actions available to the agent (Konidaris et al., 2018). 1. Introduction A major shortcoming of that framework is the lack of On the surface, the learning and planning communities op- generalisability—an agent must relearn the appropriate sym- erate in very different paradigms.
    [Show full text]
  • Materials Genomics and Machine Learning Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit*
    This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. pubs.acs.org/CR Review Big-Data Science in Porous Materials: Materials Genomics and Machine Learning Kevin Maik Jablonka, Daniele Ongari, Seyed Mohamad Moosavi, and Berend Smit* Cite This: Chem. Rev. 2020, 120, 8066−8129 Read Online ACCESS Metrics & More Article Recommendations ABSTRACT: By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal−organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years.
    [Show full text]
  • Measuring Calibration in Deep Learning
    Under review as a conference paper at ICLR 2020 MEASURING CALIBRATION IN DEEP LEARNING Anonymous authors Paper under double-blind review ABSTRACT Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. We propose new measures for calibration, the Static Calibration Error (SCE) and Adaptive Calibration Error (ACE). These measures taks into account every prediction made by a model, in contrast to the popular Expected Calibration Error. 1 INTRODUCTION The reliability of a machine learning model’s confidence in its predictions is critical for high risk applications, such as deciding whether to trust a medical diagnosis prediction (Crowson et al., 2016; Jiang et al., 2011; Raghu et al., 2018). One mathematical formulation of the reliability of confidence is calibration (Murphy & Epstein, 1967; Dawid, 1982). Intuitively, for class predictions, calibration means that if a model assigns a class with 90% probability, that class should appear 90% of the time. Calibration is not directly measured by proper scoring rules like negative log likelihood or the quadratic loss. The reason that we need to assess the quality of calibrated probabilities is that presently, we optimize against a proper scoring rule and our models often yield uncalibrated probabilities. Recent work proposed Expected Calibration Error (ECE; Naeini et al., 2015), a measure of calibration error which has lead to a surge of works developing methods for calibrated deep neural networks (e.g., Guo et al., 2017; Kuleshov et al., 2018). In this paper, we show that ECE has numerous pathologies, and that recent calibration methods, which have been shown to successfully recalibrate models according to ECE, cannot be properly evaluated via ECE.
    [Show full text]
  • Arxiv:1808.00111V2 [Cs.LG] 14 Sep 2018
    JMLR: Workshop and Conference Proceedings 77:145{160, 2017 ACML 2017 Probability Calibration Trees Tim Leathart [email protected] Eibe Frank [email protected] Geoffrey Holmes [email protected] Department of Computer Science, University of Waikato Bernhard Pfahringer [email protected] Department of Computer Science, University of Auckland Editors: Yung-Kyun Noh and Min-Ling Zhang Abstract Obtaining accurate and well calibrated probability estimates from classifiers is useful in many applications, for example, when minimising the expected cost of classifications. Ex- isting methods of calibrating probability estimates are applied globally, ignoring the po- tential for improvements by applying a more fine-grained model. We propose probability calibration trees, a modification of logistic model trees that identifies regions of the input space in which different probability calibration models are learned to improve performance. We compare probability calibration trees to two widely used calibration methods|isotonic regression and Platt scaling|and show that our method results in lower root mean squared error on average than both methods, for estimates produced by a variety of base learners. Keywords: Probability calibration, logistic model trees, logistic regression, LogitBoost 1. Introduction In supervised classification, assuming uniform misclassification costs, it is sufficient to pre- dict the most likely class for a given test instance. However, in some applications, it is important to produce an accurate probability distribution over the classes for each exam- ple. While most classifiers can produce a probability distribution for a given test instance, these probabilities are often not well calibrated, i.e., they may not be representative of the true probability of the instance belonging to a particular class.
    [Show full text]
  • Multi-Class Probabilistic Classification Using Inductive and Cross Venn–Abers Predictors
    Proceedings of Machine Learning Research 60:1{13, 2017 Conformal and Probabilistic Prediction and Applications Multi-class probabilistic classification using inductive and cross Venn{Abers predictors Valery Manokhin [email protected] Royal Holloway, University of London, Egham, Surrey, UK Editor: Alex Gammerman, Vladimir Vovk, Zhiyuan Luo and Harris Papadopoulos Abstract Inductive (IVAP) and cross (CVAP) Venn{Abers predictors are computationally efficient algorithms for probabilistic prediction in binary classification problems. We present a new approach to multi-class probability estimation by turning IVAPs and CVAPs into multi- class probabilistic predictors. The proposed multi-class predictors are experimentally more accurate than both uncalibrated predictors and existing calibration methods. Keywords: Probabilistic classification, uncertainty quantification, calibration method, on-line compression modeling, measures of confidence 1. Introduction Multi-class classification is the problem of classifying objects into one of the more than two classes. The goal of classification is to construct a classifier which, given a new test object, will predict the class label from the set of possible k classes. In an \ordinary" classification problem, the goal is to minimize the loss function by predicting correct labels on the test set. In contrast, a probabilistic classifier outputs, for a given test object, the probability distribution over the set of k classes. Probabilistic classifiers allow to express a degree of confidence about the classification of the test object. This is useful in a range of applications and industries, including life sciences, robotics and artificial intelligence. A number of techniques are available to solve binary classification problems. In logistic regression, the dependent variable is a categorical variable indicating one of the two classes.
    [Show full text]