Measuring Calibration in Deep Learning

Measuring Calibration in Deep Learning Abstract has lead to a surge of works developing methods for calibrated deep neural networks (e.g., Guo et al., 2017; The reliability of a machine learning model’s Kuleshov et al., 2018). We show that ECE has numerous confidence in its predictions is critical for high- pathologies, and that recent calibration methods, which risk applications. Calibration—the idea that have been shown to successfully recalibrate models ac- a model’s predicted probabilities reflect true cording to ECE, do not lead to truly calibrated models. probabilities—formalizes this notion. While We examine these pathologies and propose the Thresh- analyzing the calibration of deep neural net- olded Adaptive Calibration Error (TACE) metric, which works, we’ve identified core problems with the is designed to resolve them. We also propose a dynamic way calibration is currently measured. For ex- programming based metric for estimating the calibration ample, trained networks often predict a class error of specific predictions. label with very high confidence. This causes existing metrics to measure calibration er- 2 Background & Related Work ror only within a small probability interval, which ultimately leads to misleading conclu- Assume the dataset of features and outcomes f(x; y)g are sions about whether a model is well-calibrated. i.i.d. realizations of the random variables X; Y ∼ P. We The Thresholded Adaptive Calibration Error focus on class predictions. Suppose a model predicts a (TACE) metric is designed to resolve these class y with probability p^. The model is calibrated if p^ pathologies. We show that while ECE be- is always the true probability. Formally, comes a worse approximation of true calibration error as class predictions beyond the max- P(Y = y j p^ = p) = p imum prediction matter more, TACE continues for all probability values p 2 [0; 1] and class labels y 2 to perform well. f0;:::;K − 1g. The left-hand-side denotes the true data distribution’s probability of a label given that the model predicts p^ = p; the right-hand-side denotes that value. 1 Introduction Expected Calibration Error (ECE). To measure the The reliability of a machine learning model’s confidence deviation from calibration, ECE discretizes the probabil- in its predictions is critical for high risk applications, ity interval into a fixed number of bins and assigns pre- such as deciding whether to trust a medical diagno- dicted probabilities to the bin that encompasses it. The sis prediction (Crowson et al., 2016; Jiang et al., 2011; calibration error is the difference between the fraction of Raghu et al., 2018). One mathematical formulation of predictions in the bin that are correct (accuracy) and the the reliability of confidence is calibration (Murphy and mean of the probabilities in the bin (confidence). Intu- Epstein, 1967; Dawid, 1982). Intuitively, for class pre- itively, the accuracy estimates P(Y = y j p^ = p) and dictions, calibration means that if a model assigns a class the average confidence is a setting of p. ECE computes a with 90% probability, that class should appear 90% of weighted average of this error across bins: the time. B X nb Recent work proposed Expected Calibration Error (ECE; ECE = jacc(b) − conf(b)j ; N Naeini et al., 2015), a measure of calibration error which b=1 where nb is the number of predictions in bin b, N is the total number of data points, and acc(b) and conf(b) are the accuracy and confidence of bin b, respectively. ECE as framed in Naeini et al. (2015) leaves ambiguity in both its binning implementation and how to compute calibration for multiple classes. In Guo et al. (2017), they bin the probability interval [0; 1] into equally spaced subin- tervals, and they take the maximum probability output for each datapoint (i.e., the predicted class’s probability). We use this for our ECE implementation. Other measures of probabilistic accuracy. Many classic methods exist to measure the accuracy of predicted probabilities. For example, Brier score measures the mean squared difference between the predicted probability and the actual outcome (Gneiting and Raftery, 2007). This score can be shown to decompose into a sum of metrics, including calibration error. The Hosmer- Figure 1: Per-class Thresholded Adaptive Calibration Lemeshow test is a popular hypothesis test for assessing Error (TACE) for a trained 110-layer ResNet on the whether a model’s predictions significantly deviate from CIFAR-100 test set. We observe that calibration error is perfect calibration (Hosmer and Lemesbow, 1980). The non-uniform across classes, which is difficult to express reliability diagram provides a visualization of how well- with a scalar metric and when measuring error only on calibrated a model is (DeGroot and Fienberg, 1983). the maximum probabilities. Kuleshov et al. (2018) extends ECE to the regression setting. Unlike these methods, we’d like the metric to be models to be calibrated across all predictions as no true scalar-valued in order to easily benchmark methods, and class label is likely. to only measure calibration. 3.2 Fixed Calibration Ranges 3 Issues With Calibration Metrics One major weakness of evenly spaced binning metrics is 3.1 Not Computing Calibration Across All caused by the dispersion of data across ranges. In com- Predictions puting ECE, there is often a large leftward skew in the output probabilities, with the left end of the region being Expected Calibration Error was crafted to mirror reliabil- sparsely populated and the rightward end being densely ity diagrams, which are structured around binary classi- populated. (That is, network predictions are typically fication such as rain vs not rain (DeGroot and Fienberg, very confident.) This causes only a few bins to contribute 1983). A consequence is that the error metric is reductive the most to expected calibration error—typically one or in a multi-class setting. In particular, ECE is computed two as bin sizes are 10-20 in practice (Guo et al., 2017). using only the predicted class’s probability, which im- More broadly, sharpness is a fundamental property plies the metric does not assess how accurate a model is (Gneiting et al., 2007), which is the desire for models with respect to the K − 1 other probabilities. to always predict with high confidence, i.e., predicted In Figure 1, we observe that calibration error, when probabilities concentrate to 0 or 1. Because of the above measured in a per-class manner, is non-uniform across behavior, ECE conflates calibration and sharpness. An classes. Guaranteeing calibration for all class proba- extreme example is if the model overfits, predicted prob- bilities is an important property for many applications. abilities will be pushed to upwards of 0.99, which places For example, a physician may care deeply about a sec- all predictions into one bin. This leads to zero calibration ondary or tertiary prediction of a disease. Decision costs error, even though such sharp models are not necessarily along both ethical and monetary axes are non-uniform, calibrated. and thus clinical decision policies necessitate a compre- hensive view of the likelihoods of potential outcomes 3.3 Bias-Variance Tradeoff (e.g., Tsoukalas et al., 2015). Alternatively, if a true data distribution exhibits high data noise (also known as Selecting the number of bins has a bias-variance tradeoff high aleatoric uncertainty, or weak labels), then we’d like as it determines how many data points fall into each bin and therefore the quality of the estimate of calibration from that bin’s range. In particular, a larger number of bins causes more granular measures of calibration error (low bias) but also a high variance of each bin’s mea- surement as bins become sparsely populated. This tradeoff compounds particularly with the problem of fixed calibration ranges, as certain bins have many more data points than others. 4 Challenges in Calibration Before describing new metrics for calibration, we first outline broad challenges with designing such metrics. 4.1 Ground Truth & Comparing Calibration Metrics Figure 2: Reliability Diagram over adaptive metrics for a CNN on Fashion-MNIST validation data. TACE and There are many challenges in measuring a network’s cal- ACE with 100 bins, SCE with 10. ibration, beginning with the absence of ground truth. In principle, one can limit comparisons to controlled, sim- ulated experiments where ground truth is available by Additionally, in the context of out-of-distribution detec- drawing infinitely many samples from the true data dis- tion, we would prefer to be well-calibrated in the middle tribution. However, even with ground truth, any estima- of the spectrum, whereby a prediction of .5 (high uncer- tor property, such as bias, remains difficult to compare tainty) is more likely to happen. across estimators, as “ground truth error” is multi-valued and estimators may make measurements for different el- ements of these values. Specifically, calibration is a guar- 4.3 Visualization and Interpretability antee for all predicted probabilities p 2 [0; 1] and class labels y 2 f0;:::;K −1g. An estimator may have lower Adaptive calibration metrics, due to the variability of bias and/or variance in estimating the error for specific their binning scheme, make it difficult to compare two ranges of p and y but higher bias and/or variance in other models side-by-side. The calibration ranges that they ranges. For example, adaptive metrics use different bin- choose will be different from one another, and those dif- ning strategies and therefore measure error with different ferences create a reliability diagram that is more difficult settings of p (the average confidence of datapoints in a to interpret than the standard diagram over evenly-spaced bin).

Load more