On Calibration of Modern Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
On Calibration of Modern Neural Networks Chuan Guo * 1 Geoff Pleiss * 1 Yu Sun * 1 Kilian Q. Weinberger 1 Abstract LeNet (1998) ResNet (2016) CIFAR-100 CIFAR-100 1:0 Confidence calibration – the problem of predict- ing probability estimates representative of the 0:8 true correctness likelihood – is important for 0:6 classification models in many applications. We Accuracy Accuracy discover that modern neural networks, unlike 0:4 those from a decade ago, are poorly calibrated. Avg. confidence Avg. confidence Through extensive experiments, we observe that % of Samples 0:2 depth, width, weight decay, and Batch Normal- 0:0 ization are important factors influencing calibra- 0:0 0:2 0:4 0:6 0:8 1:0 0:0 0:2 0:4 0:6 0:8 1:0 tion. We evaluate the performance of various 1:0 Outputs Outputs post-processing calibration methods on state-of- 0:8 Gap Gap the-art architectures with image and document classification datasets. Our analysis and exper- 0:6 iments not only offer insights into neural net- 0:4 work learning, but also provide a simple and Accuracy straightforward recipe for practical settings: on 0:2 Error=44.9 Error=30.6 most datasets, temperature scaling – a single- 0:0 parameter variant of Platt Scaling – is surpris- 0:0 0:2 0:4 0:6 0:8 1:0 0:0 0:2 0:4 0:6 0:8 1:0 ingly effective at calibrating predictions. Confidence Figure 1. Confidence histograms (top) and reliability diagrams (bottom) for a 5-layer LeNet (left) and a 110-layer ResNet (right) 1. Introduction on CIFAR-100. Refer to the text below for detailed illustration. Recent advances in deep learning have dramatically im- If the detection network is not able to confidently predict proved neural network accuracy (Simonyan & Zisserman, the presence or absence of immediate obstructions, the car 2015; Srivastava et al., 2015; He et al., 2016; Huang et al., should rely more on the output of other sensors for braking. 2016; 2017). As a result, neural networks are now entrusted Alternatively, in automated health care, control should be with making complex decisions in applications, such as ob- passed on to human doctors when the confidence of a dis- ject detection (Girshick, 2015), speech recognition (Han- ease diagnosis network is low (Jiang et al., 2012). Specif- nun et al., 2014), and medical diagnosis (Caruana et al., ically, a network should provide a calibrated confidence arXiv:1706.04599v2 [cs.LG] 3 Aug 2017 2015). In these settings, neural networks are an essential measure in addition to its prediction. In other words, the component of larger decision making pipelines. probability associated with the predicted class label should In real-world decision making systems, classification net- reflect its ground truth correctness likelihood. works must not only be accurate, but also should indicate Calibrated confidence estimates are also important for when they are likely to be incorrect. As an example, con- model interpretability. Humans have a natural cognitive in- sider a self-driving car that uses a neural network to detect tuition for probabilities (Cosmides & Tooby, 1996). Good pedestrians and other obstructions (Bojarski et al., 2016). confidence estimates provide a valuable extra bit of infor- *Equal contribution, alphabetical order. 1Cornell University. mation to establish trustworthiness with the user – espe- Correspondence to: Chuan Guo <[email protected]>, Geoff cially for neural networks, whose classification decisions Pleiss <[email protected]>, Yu Sun <[email protected]>. are often difficult to interpret. Further, good probability estimates can be used to incorporate neural networks into Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 other probabilistic models. For example, one can improve by the author(s). performance by combining network outputs with a lan- guage model in speech recognition (Hannun et al., 2014; 0:8, we expect that 80 should be correctly classified. More Xiong et al., 2016), or with camera information for object formally, we define perfect calibration as detection (Kendall & Cipolla, 2016). ^ ^ P Y = Y j P = p = p; 8p 2 [0; 1] (1) In 2005, Niculescu-Mizil & Caruana(2005) showed that where the probability is over the joint distribution. In all neural networks typically produce well-calibrated proba- practical settings, achieving perfect calibration is impos- bilities on binary classification tasks. While neural net- sible. Additionally, the probability in (1) cannot be com- works today are undoubtedly more accurate than they were puted using finitely many samples since P^ is a continuous a decade ago, we discover with great surprise that mod- random variable. This motivates the need for empirical ap- ern neural networks are no longer well-calibrated. This proximations that capture the essence of (1). is visualized in Figure 1, which compares a 5-layer LeNet (left) (LeCun et al., 1998) with a 110-layer ResNet (right) (He et al., 2016) on the CIFAR-100 dataset. The top row Reliability Diagrams (e.g. Figure 1 bottom) are a visual shows the distribution of prediction confidence (i.e. prob- representation of model calibration (DeGroot & Fienberg, abilities associated with the predicted label) as histograms. 1983; Niculescu-Mizil & Caruana, 2005). These diagrams The average confidence of LeNet closely matches its accu- plot expected sample accuracy as a function of confidence. racy, while the average confidence of the ResNet is substan- If the model is perfectly calibrated – i.e. if (1) holds – then tially higher than its accuracy. This is further illustrated in the diagram should plot the identity function. Any devia- the bottom row reliability diagrams (DeGroot & Fienberg, tion from a perfect diagonal represents miscalibration. 1983; Niculescu-Mizil & Caruana, 2005), which show ac- To estimate the expected accuracy from finite samples, we curacy as a function of confidence. We see that LeNet is group predictions into M interval bins (each of size 1=M) well-calibrated, as confidence closely approximates the ex- and calculate the accuracy of each bin. Let Bm be the set pected accuracy (i.e. the bars align roughly along the diag- of indices of samples whose prediction confidence falls into onal). On the other hand, the ResNet’s accuracy is better, m−1 m the interval Im = ( ; ]. The accuracy of Bm is but does not match its confidence. M M 1 X acc(Bm) = 1(^yi = yi); Our goal is not only to understand why neural networks jBmj have become miscalibrated, but also to identify what meth- i2Bm ods can alleviate this problem. In this paper, we demon- where y^i and yi are the predicted and true class labels for strate on several computer vision and NLP tasks that neu- sample i. Basic probability tells us that acc(Bm) is an un- ^ ^ ral networks produce confidences that do not represent true biased and consistent estimator of P(Y = Y j P 2 Im). probabilities. Additionally, we offer insight and intuition We define the average confidence within bin Bm as into network training and architectural trends that may 1 X conf(Bm) = p^i; cause miscalibration. Finally, we compare various post- jBmj i2B processing calibration methods on state-of-the-art neural m networks, and introduce several extensions of our own. where p^i is the confidence for sample i. acc(Bm) and Surprisingly, we find that a single-parameter variant of Platt conf(Bm) approximate the left-hand and right-hand sides scaling (Platt et al., 1999) – which we refer to as temper- of (1) respectively for bin Bm. Therefore, a perfectly cal- ature scaling – is often the most effective method at ob- ibrated model will have acc(Bm) = conf(Bm) for all taining calibrated probabilities. Because this method is m 2 f1;:::;Mg. Note that reliability diagrams do not dis- straightforward to implement with existing deep learning play the proportion of samples in a given bin, and thus can- frameworks, it can be easily adopted in practical settings. not be used to estimate how many samples are calibrated. 2. Definitions Expected Calibration Error (ECE). While reliability diagrams are useful visual tools, it is more convenient to The problem we address in this paper is supervised multi- have a scalar summary statistic of calibration. Since statis- class classification with neural networks. The input X 2 X tics comparing two distributions cannot be comprehensive, and label Y 2 Y = f1;:::;Kg are random variables previous works have proposed variants, each with a unique that follow a ground truth joint distribution π(X; Y ) = emphasis. One notion of miscalibration is the difference in π(Y jX)π(X). Let h be a neural network with h(X) = expectation between confidence and accuracy, i.e. ^ ^ ^ ^ h i (Y; P ), where Y is a class prediction and P is its associ- ^ ^ E P Y = Y j P = p − p (2) ated confidence, i.e. probability of correctness. We would P^ like the confidence estimate P^ to be calibrated, which in- ^ Expected Calibration Error (Naeini et al., 2015) – or ECE tuitively means that P represents a true probability. For – approximates (2) by partitioning predictions into M example, given 100 predictions, each with confidence of equally-spaced bins (similar to the reliability diagrams) and Varying Depth Varying Width Using Normalization Varying Weight Decay ResNet - CIFAR-100 ResNet-14 - CIFAR-100 ConvNet - CIFAR-100 ResNet-110 - CIFAR-100 0:7 0:6 Error Error Error Error ECE ECE ECE ECE 0:5 0:4 0:3 Error/ECE 0:2 0:1 0:0 5 4 3 2 0 20 40 60 80 100 120 0 50 100 150 200 250 300 Without With 10− 10− 10− 10− Depth Filters per layer Batch Normalization Weight decay Figure 2.