Trainable Calibration Measures for Neural Networks from Kernel Mean Embeddings

Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings Aviral Kumar 1 Sunita Sarawagi 1 Ujjwal Jain 1 Abstract learned models feed into decision systems or are human in- terpreted. Also calibration is useful for detecting out of Modern neural networks have recently been found sample examples (Hendrycks & Gimpel, 2017; Lee et al., to be poorly calibrated, primarily in the direction 2018; Liang et al., 2018) and in fairness (Pleiss et al., 2017). of over-confidence. Methods like entropy penalty and temperature smoothing improve calibration A primary reason for poor calibration of modern neural net- by clamping confidence, but in doing so compro- works is that due to their high capacity, the negative log like- mise the many legitimately confident predictions. lihood (NLL) overfits without overfitting 0/1 error (Zhang We propose a more principled fix that minimizes et al., 2017). This manifests as overly confident predic- an explicit calibration error during training. We tions. Recently (Guo et al., 2017) experimented with several present MMCE, a RKHS kernel based measure of known calibration fixes applied post training, and found a calibration that is efficiently trainable alongside simple temperature scaling of logits to be most effective. the negative likelihood loss without careful hyper- A second option is to plan for calibration during training. parameter tuning. Theoretically too, MMCE is a Pereyra et al.(2017) proposes to add a entropy regularizer to sound measure of calibration that is minimized at the NLL objective to clamp over-confidence. We show that perfect calibration, and whose finite sample esti- both temperature scaling and entropy regularization manage mates are consistent and enjoy fast convergence to reduce aggregate calibration error but in the process need- rates. Extensive experiments on several network lessly clamp down legitimate high confidence predictions. A architectures demonstrate that MMCE is a fast, third set of approaches model full prediction uncertainty via stable, and accurate method to minimize calibra- variational Bayesian networks (Louizos & Welling, 2017), tion error metrics while maximally preserving the or their committee counterparts (Lakshminarayanan et al., number of high confidence predictions. 2017). But their training is too resource-intensive. We propose a practical and principled fix by minimizing calibration error during training along with classification error. 1. Introduction We depend on the power of RKHS functions induced by a Recently, (Guo et al., 2017) made the surprising observa- universal kernel to express calibration error as a tractable in- tion that highly accurate, negative log likelihood trained, tegral probability measure, which we call Maximum Mean deep neural networks predict poorly calibrated confidence Calibration Error (MMCE). This is analogous to the way probabilities unlike traditional models trained with the same MMD over RKHS kernels expresses the distance between objective (Niculescu-Mizil & Caruana, 2005). Poor cali- two probability distributions (Muandet et al., 2017; Li et al., bration implies that if the network makes a prediction with 2015). We show that MMCE is a consistentp estimator of cal- more than 0.99 confidence (which it often does!), the pre- ibration and converges at rate 1= m to its expected value. dicted label may be correct much less than 99% of the time. Furthermore, MMCE is easy to optimize in the existing Such lack of calibration is a serious problem in applications mini-batch gradient descent framework. like medical diagnosis (Caruana et al., 2015; Crowson et al., Our experiments spanning seven datasets show that train- 2016; Jiang et al., 2012), obstacle detection in self-driving ing with MMCE achieves significant reduction in calibra- vehicles (Bojarski et al., 2016), and other applications where tion error while also providing a modest accuracy increase. 1Department of Computer Science and Engineering, IIT Bom- MMCE achieves this without throttling high confidence bay, Mumbai, India. Correspondence to: Aviral Kumar <aviralku- predictions. For example on CIFAR-10, MMCE makes [email protected]>. 72% predictions at 99.7% confidence, whereas temperature th scaling predicts only 40% at 99.6% and entropy scaling Proceedings of the 35 International Conference on Machine only 7% at 96.9%. This is important in applications like Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). medical diagnosis where only highly confident predictions Trainable Calibration Measures from Kernel Mean Embeddings result in saving the cost of manual screening. This study like Platt scaling that optimize the NLL objective lead to demonstrates that well-designed training methods can simul- well-calibrated models (Niculescu-Mizil & Caruana, 2005). taneously improve calibration and accuracy without severely Unfortunately on high capacity neural networks, NLL fails reducing the number of high-confidence predictions. to minimize calibration error because of over-fitting. We explore ways of training so as to directly optimize for 2. Problem Setup calibration alongside NLL. We need to choose a trainable function CE(D; θ) to measure calibration error for joint Our focus is improving the calibration of multi-class clas- optimization with NLL as follows: sification models. Let Y = f1; 2;:::;Kg denote the set of class labels and X denote a space of inputs. Let Nθ(yjx) de- min NLL(D; θ) + λ CE(D; θ) (4) note the probability distribution the neural network predicts θ on an input x 2 X and θ denote the network parameters. We cannot use ECE(D) here because it is highly discontin- For an instance x with correct label y , the network pre- i i uous in r and consequently on θ. In the next section we dicts label y^ = argmax N (yjx ). The prediction gets i y2Y θ i propose a new calibration measure called MMCE that is correctness score c = 1 if y^ = y and 0 otherwise and a i i i trainable and satisfies other properties of sound measures confidence score r = N (^y jx ). The model N (yjx) is i θ i i θ that we will discuss next in Section4. well-calibrated over a data distribution D, when over all (xi; yi) 2 D and ri = α the probability that ci = 1 is α. For example, out of a sample from D if 100 examples are 3. A Trainable Calibration Measure from predicted with confidence 0.7, then we expect 70 of these Kernel Embeddings to be correct when Nθ(yjx) is well-calibrated on D. More Our goal is to design a function to serve as an optimizable formally, we use Pθ;D(r; c) to denote the distribution over surrogate for the calibration error. In this section we propose r and c values of the predictions of Nθ(yjx) on D. When such a measure that is zero if and only if the model is cali- Nθ(yjx) is well calibrated on data distribution D, brated and whose finite sample estimates are consistent and Pθ;D(c = 1jr = Iα) = α 8α 2 [0; 1] (1) enjoy fast convergence rates. Further, we show empirically that it can be optimized over the network parameters θ using where Iα denotes a small non-zero interval around α. Using existing batch stochastic gradient algorithms. this we can define an expected calibration error (ECE) as Our approach is based on defining an integral probability measure over functions from a reproducing kernel Hilbert ECE(Pθ;D) = EP (r) jEP (cjr)[c] − rj (2) θ;D θ;D space (RKHS). Such approaches have emerged as a pow- To estimate ECE on a data sample D ∼ Pθ;D we partition erful tool in machine learning and have been successfully the [0,1] range of r into B equal bins. We then sum up used in tasks like comparing two distributions (Gretton et al., j j+1 2012), (Li et al., 2015), goodness of fit tests, and class ratio over each bin Bj = [ B ; B ] the difference between the correctness and confidence scores over examples in that bin: estimation (Iyer et al., 2014) (see (Muandet et al., 2017) for a survey). In this paper, we show their usage in defining a B−1 tractable measure of calibration error. 1 X X X ECE(D) = ci − ri jDj Let H to denote a reproducible kernel Hilbert space (RKHS) j=0 i2Dj i2Dj (3) induced by a universal kernel k(:; :) and canonical feature j j + 1 s:t: D = fi 2 D; r 2 [ ; ]g map φ : [0; 1] 7! H. We define a measure called maximum j i B B mean calibration error (MMCE) over Pθ;D(r; c) as: We are interested in models with low ECE and high accuracy. Note a model that minimizes ECE may not necessarily have MMCE(P (r; c)) = E(r;c)∼P [(c − r)φ(r)] H (5) high accuracy. For example, a model that always predicts the majority class with confidence equal to the class’s prior where k:kH denotes norm in the Hilbert space H. For ease probability will have ECE 0 but is not accurate. of notation, we use P (r; c) for Pθ;D(r; c). Theorem1 in Section4 shows that MMCE is zero if and only if P (r; c) is Fortunately, the negative log-likelihood loss calibrated over D provided kernel k(:; :) is universal. The P (NLL=− (x;y)∼D log Nθ(yjx)) used for training finite sample estimate over a sample D ∼ P with D = neural networks optimizes for accuracy and calibration f(r1; c1);::: (rm; cm)g becomes: indirectly. NLL is minimized when Nθ(yjx) matches the true data distribution D, and is therefore trivially X (ci − ri)φ(ri) well-calibrated (Hastie et al., 2001). Popular classifiers MMCEm(D) = (6) m like linear logistic regression and calibration methods (r ;c )2D i i H Trainable Calibration Measures from Kernel Mean Embeddings In Theorem2 in Section4 we show thatp the above estimate is third term in Eq9 will pair up x with these high confidence consistent and converges at rate 1= m to MMCE(P (r; c)).

Load more