Accurate Uncertainties for Deep Learning Using Calibrated Regression

Accurate Uncertainties for Deep Learning Using Calibrated Regression Volodymyr Kuleshov 1 2 Nathan Fenner 2 Stefano Ermon 1 Abstract Raw 90% Confidence Interval from a Bayesian Neural Network 250 Methods for reasoning under uncertainty are a True Predicted key building block of accurate and reliable ma- 200 chine learning systems. Bayesian methods pro- 150 vide a general framework to quantify uncertainty. However, because of model misspecification and Sales (units) 100 the use of approximate inference, Bayesian un- 01 02 03 04 05 06 07 08 09 10 certainty estimates are often inaccurate — for Time (hours) example, a 90% credible interval may not con- Same Confidence Interval, Recalibrated 250 tain the true outcome 90% of the time. Here, we True propose a simple procedure for calibrating any Predicted 200 regression algorithm; when applied to Bayesian and probabilistic models, it is guaranteed to 150 produce calibrated uncertainty estimates given Sales (units) 100 enough data. Our procedure is inspired by Platt scaling and extends previous work on classifica- 01 02 03 04 05 06 07 08 09 10 Time (hours) tion. We evaluate this approach on Bayesian linear regression, feedforward, and recurrent neu- Figure 1. Top: Time series forecasting using a Bayesian neural ral networks, and find that it consistently outputs network. Because the model is Bayesian, we may obtain a 90% well-calibrated credible intervals while improv- credible interval around the forecast (red). However, the interval fails to capture the true data distribution: most points fall outside ing performance on time series forecasting and of it. Bottom: We propose a recalibration method that enables model-based reinforcement learning tasks. the original model to output a 90% credible interval (green) that correctly contains 9/10 points. 1. Introduction ble model weights. Recent advances in variational infer- Methods for reasoning and making decisions under uncer- ence have greatly increased the scalability and usefulness tainty are an important building block of accurate, reliable, of these approaches (Blundell et al., 2015). and interpretable machine learning systems. In many applications — ranging from supply chain planning to medi- In practice, however, Bayesian uncertainty estimates often cal diagnosis to autonomous driving — faithfully assessing fail to capture the true data distribution (Lakshminarayanan uncertainty can be as important as obtaining high accuracy. et al., 2017) — e.g., a 90% posterior credible interval gen- This paper explores uncertainty estimation over continuous erally does not contain the true outcome 90% of the time variables in the context of modern deep learning models. (Figure1). In such cases, we say that the model is mis- calibrated. This problem arises because of model bias: a Bayesian approaches provide a general framework for deal- predictor may not be sufficiently expressive to assign the ing with uncertainty (Gal, 2016). Bayesian methods de- right probability to every credible interval, just as it may fine a probability distribution over model parameters and not be able to always assign the right label to a datapoint. derive uncertainty estimates by intergrating over all possi- Recently, Gal et al.(2017) and Lakshminarayanan et al. 1Stanford University, Stanford, California 2Afresh Technolo- (2017) proposed uncertainty estimation techniques for deep gies, San Francisco, California. Correspondence to: Volodymyr neural networks, which include ensemble methods, het- Kuleshov <[email protected]>. eroscedastic regression, and concrete dropout. These meth- Proceedings of the 35 th International Conference on Machine ods require modifying the model and may not always pro- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 duce perfectly calibrated forecasts. Calibration has been by the author(s). extensively studied in the weather forecasting literature Accurate Uncertainties for Deep Learning Using Calibrated Regression 2D Classification Problem Estimating Density of Brown Class Calibration Plot 1.0 Density estimate (isotonic regression) 1.0 Uncalibrated Recalibrated 2 Perfectly calibrated 0.8 0.8 1 0.6 0.6 0 0.4 0.4 −1 Orthogonal Feature 0.2 0.2 −2 Empirical Probability of Brown Class Empirical Probability of Brown Class 0.0 0.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Class-Separating Feature Class-Separating Feature Mean Predicted Value Figure 2. Calibrated classification. Left: Two classes are separated by a hyperplane in 2D. The x-axis is especially useful for separating the two classes. Middle: We project data onto the x-axis and fit a histogram (blue) or an isotonic regression model (green) to estimate the empirical probability of observing the brown class as a function of x. We may use these probabilities as approximately calibrated predictions. Right: The calibration of the original linear model and its recalibrated version are assessed by binning the predictions into ten intervals ([0; 0:1]; (0:1; 0:2]; :::), and plotting the predicted vs. the observed frequency of the brown class in each interval. (Gneiting and Raftery, 2005); however these techniques techniques that will be useful for deriving an extension to tend to be specialized and difficult to generalize beyond the regression and Bayesian settings in the next section. applications in climate science. An alternative way to calibrate models has been explored Notation. We are given a labeled dataset xt; yt 2 X × Y in the support vector classification literature. These tech- for t = 1; 2; :::; T of i.i.d. realizations of random variables niques — of which Platt scaling (Platt, 1999) is the most X; Y ∼ P, where P is the data distribution. Given xt, a well-known — recalibrate the predictions of a pre-trained forecaster H : X! (Y! [0; 1]) outputs a probability classifier in a post-processing step. As a result, these meth- distribution Ft(y) targeting the label yt. When Y is contin- ods are classifier-agnostic and also typically very simple. uous, Ft is a cumulative probability distribution (CDF). In this section, we assume for simplicity that Y = f0; 1g. Here, we propose a new procedure for recalibrating any regression algorithm that is inspired by Platt scaling for clas- 2.1. Calibration sification. When applied to Bayesian and probabilistic deep learning models, it always produces calibrated credible in- Intuitively, calibration means that whenever a forecaster as- tervals given a sufficient amount of i.i.d. data. signs a probability of 0:8 to an event, that event should oc- cur about 80% of the time. In binary classification, we have We evaluate our proposed algorithm on a range of Bayesian Y = f0; 1g, and we say that H is calibrated if models, including Bayesian linear regression as well as feedforward and recurrent Bayesian neural networks. Our T P y fH(x ) = pg method consistently produces well-calibrated confidence t=1 tI t ! p for all p 2 [0; 1] (1) PT estimates, which are in turn useful for several tasks in time t=1 IfH(xt) = pg series forecasting and model-based reinforcement learning. as T ! 1. Here, for simplicity, we use H(xt) to denote y = 1 x ; y Contributions. In summary, we introduce a simple tech- the probability of the event t . When the t t are X; Y ∼ nique for recalibrating the output of any regression algo- i.i.d. realizations of random variables P, a suffi- rithm, extending recalibration methods such as Platt scal- cient condition for calibration is: ing that were previously applicable only to classification. (Y = 1 j H(X) = p) = p for all p 2 [0; 1]: (2) We then use this technique to solve an important problem P in Bayesian deep learning: the miscalibration of credible intervals. We show that our results are useful in time series Calibration vs. Sharpness. By itself, calibration is not forecasting and in model-based reinforcement learning. enough to guarantee a useful forecast. For example, a forecaster that always predicts E[Y ] is calibrated , but not very 2. Calibrated Classification useful. Good predictions also need to be sharp, which intuitively means that probabilities should be close to zero This section is a concise overview of calibrated classifica- or one. Note that an ideal forecaster is both calibrated and tion (Platt, 1999), and offers a reinterpretation of existing predicts outcomes with 100% confidence. Accurate Uncertainties for Deep Learning Using Calibrated Regression Forecasts with Uncalibrated Confidence Intervals Estimating Cumulative Density of Forecast Calibration Plot 250 1.0 1.0 200 Calibrated 150 Uncalibrated 0.8 0.8 Sales 100 50 0.6 0.6 2016-11-28 2016-12-05 2016-12-12 2016-12-19 2016-12-26 Forecasts with Calibrated Confidence Intervals 0.4 0.4 200 0.2 0.2 Observed Confidence Level Sales Empirical Cumulative Distribution 100 0.0 0.0 2016-11-28 2016-12-05 2016-12-12 2016-12-19 2016-12-26 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Date Predicted Cumulative Distribution Expected Confidence Level Figure 3. Calibrated regression. Left: A Bayesian neural network outputs probabilistic forecasts Ft of future time series values yt. The credible intervals do not always represent the true frequency of the prediction falling in the interval. Middle: For each credible interval, we plot the observed number of times the prediction falls in the interval (i.e. we estimate P(FX (Y ) ≤ p)). We fit this function and use it to output the actual probability of any given interval. Right: Forecast calibration can be assessed by plotting expected vs. empirical rates of observing an outcome yt in a set of ten intervals (−∞;F (p)] for p = 0; 0:1; :::; 1.

Load more