Non-Parametric Calibration for Classification

Non-Parametric Calibration for Classification Jonathan Wenger Hedvig Kjellström Rudolph Triebel University of Tuebingen KTH Royal Institute of Technology German Aerospace Center (DLR) TU Munich TU Munich KTH Royal Institute of Technology [email protected] [email protected] [email protected] Abstract been made using powerful and complex network architectures, trained on very large data sets. Most of these Many applications of classification methods techniques are used for classification tasks, e.g. object not only require high accuracy but also re- recognition. We also consider classification in our work. liable estimation of predictive uncertainty. However, in addition to achieving high classification However, while many current classification accuracy, our goal is to provide reliable prediction un- frameworks, in particular deep neural net- certainty estimates. This is particularly relevant in works, achieve high accuracy, they tend to safety-critical applications, such as autonomous driv- incorrectly estimate uncertainty. In this pa- ing and robotics (Amodei et al., 2016). Reliable uncer- per, we propose a method that adjusts the tainties can be used to increase a classifier’s precision confidence estimates of a general classifier by reporting only class labels that are predicted with such that they approach the probability of low uncertainty or for information theoretic analyses classifying correctly. In contrast to existing of what was learned and what was not. The latter approaches, our calibration method employs a is especially interesting in active learning, where the non-parametric representation using a latent model actively selects the most relevant data samples Gaussian process, and is specifically designed for training via a query function based on the predictive for multi-class classification. It can be ap- uncertainty of the model (Settles, 2010). plied to any classifier that outputs confidence Unfortunately, current probabilistic classification ap- estimates and is not limited to neural net- proaches that inherently provide good uncertainty esti- works. We also provide a theoretical analysis mates, such as Gaussian processes (GP), often suffer regarding the over- and underconfidence of a from lower accuracy and higher computational complex- classifier and its relationship to calibration, ity on high-dimensional classification tasks compared to as well as an empirical outlook for calibrated state-of-the-art convolutional neural networks (CNN). active learning. In experiments we show the It was recently observed that many modern CNNs are universally strong performance of our method overconfident (Lakshminarayanan et al., 2017, Hein across different classifiers and benchmark data et al., 2019) and miscalibrated (Guo et al., 2017). Cal- sets, in particular for state-of-the art neural ibration refers to how well confidence estimates of a network architectures. classifier match the probability of the associated prediction being correct. Originally developed in the context of forecasting (Murphy, 1973, DeGroot and Fienberg, 1 INTRODUCTION 1983), uncertainty calibration has seen an increased interest in recent years (Naeini et al., 2015, Guo et al., With the recent achievements in machine learning, in 2017, Vaicenavicius et al., 2019), partly because of the particular in the area of deep learning, the application popularity of CNNs which generally lack an inherent range for learning methods has increased significantly. uncertainty representation. Earlier studies show that Especially in challenging fields such as computer vision also classical methods such as decision trees, boosting, or speech recognition, important advancements have SVMs and naive Bayes classifiers tend to be miscalibrated (Zadrozny and Elkan, 2001, Niculescu-Mizil and Caruana, 2005a,b, Naeini et al., 2015). Based on these Proceedings of the 23rdInternational Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. observations, we claim that training and calibrating a PMLR: Volume 108. Copyright 2020 by the author(s). classifier can be two different objectives that benefit Non-Parametric Calibration for Classification train test 1 0.04 0.05 0.2 test + calibr. NLL error ECE 0.02 0.00 0.0 0 20 40 60 0 20 40 60 0 20 40 60 epoch epoch epoch Figure 1: Motivating example for calibration. We trained a neural network with one hidden layer on MNIST (LeCun et al., 1998) and computed the classification error, the negative log-likelihood (NLL) and the expected calibration error (ECE1) for each training epoch. While accuracy continues to improve on the test set, the ECE1 increases after 20 epochs. This differs from classical overfitting as the test error continues to decrease. This indicates that improving both accuracy and uncertainty estimation can be conflicting objectives. However, we can mitigate this post-hoc via our calibration method (red dot). The uncertainty estimation after training and calibration is improved with maintained classification accuracy. from being considered separately, as shown in a toy representation, and the second performs post-hoc cali- example in Figure 1. Here, a simple neural network bration by transforming the output of the underlying continually improves its accuracy on the test set during model. For example, Pereyra et al. (2017) propose training, but eventually overfits in terms of NLL and to penalize low-entropy output distributions, Kumar calibration error. A similar phenomenon was observed et al. (2018) suggest a trainable measure of calibra- by Guo et al. (2017) for more complex models. tion as a regularizer and Maddox et al. (2019) employ an approximate Bayesian inference technique using Calibration methods approach this problem by perform- stochastic weight averaging. Milios et al. (2018) ap- ing a post-hoc improvement to uncertainty estimation proximate Gaussian process classifiers by GP regression using a small subset of the training data. Our goal in on transformed labels for better scalability and Wilson this paper is to develop a multi-class calibration method et al. (2016) combine additive Gaussian processes with for arbitrary classifiers, to provide reliable predictive deep neural network architectures. Research on cali- uncertainty estimates in addition to maintaining high bration goes back to statistical forecasting (Murphy, accuracy. In contrast to recent approaches, which strive 1973, DeGroot and Fienberg, 1983) and approaches to to improve uncertainty estimation only for neural net- provide uncertainty estimates for non-probabilistic bi- works, including Bayesian neural networks (MacKay, nary classifiers (Platt, 1999, Lin et al., 2007, Zadrozny 1992, Gal, 2016) and Laplace approximations (LA) and Elkan, 2002). More recently, Bayesian binning (Martens and Grosse, 2015, Ba et al., 2017), our aim into quantiles (Naeini et al., 2015) and beta calibra- is a framework that is not based on tuning a specific tion (Kull et al., 2017a) for binary classification and classification method. This has the advantage that our temperature scaling (Guo et al., 2017) for multi-class method operates independently of the training process. problems were proposed. A theoretical framework for evaluating calibration in classification was suggested Contribution In this work we develop a new multi- by Vaicenavicius et al. (2019). Calibration was also class and model-agnostic approach to calibration, based previously considered in the online setting with poten- on a latent Gaussian process inferred using variational tially adversarial input (Kuleshov and Ermon, 2017). inference. We replicate and extend previous findings Calibration in a broader sense is also of interest outside that popular classification models are generally not of the classification setting, e.g. in regression (Kuleshov calibrated and demonstrate the superior performance et al., 2018, Song et al., 2019), in the discovery of causal of our method for deep neural networks. Finally, we Bayesian network structure from observational data study the relationship between active learning and (Jabbari et al., 2017) and in the algorithmic fairness calibration from a theoretical perspective and give an literature (Pleiss et al., 2017, Kleinberg, 2018). empirical outlook. Related Work Estimation of uncertainty, in partic- 2 BACKGROUND ular in deep learning (Kendall and Gal, 2017), is of N considerable interest in the machine learning commu- Notation Consider a data set = (xn; yn) n=1 nity. There are two main approaches in classification. assumed to consist of independentD and identicallyf g dis- The first chooses a model and a (regularized) loss func- tributed realizations of the random variable (x; y) tion for a particular problem to inherently learn a good with K := classes. If not stated otherwise,2 X × Y jYj Jonathan Wenger, Hedvig Kjellström, Rudolph Triebel any expectation is taken with respect to the law of new data (x; y). Let f : RK be a classifier with output X! z = f(x), prediction y^ = arg maxi(zi) and associated K K confidence score ^z = maxi(zi). Lastly, v : R R denotes a calibration method. ! training data classifier prediction 2.1 Calibration calibration A classifier is called calibrated if the confidence in its data method alibration class prediction matches the probability of its prediction c being correct, i.e. E [1y^=y ^z] = ^z. In order to measure calibration, we define thej expected calibration error Figure 2: Schematic diagram of calibration. A fraction of the training data is split off and the remaining data following Naeini et al. (2015) for 1 p < by ≤ 1

Load more