Amortized Conditional Normalized Maximum Likelihood: Reliable out of Distribution Uncertainty Estimation

Amortized Conditional Normalized Maximum Likelihood: Reliable Out of Distribution Uncertainty Estimation Aurick Zhou 1 Sergey Levine 1 Abstract reliable uncertainty estimates under distribution shift, with While deep neural networks provide good per- the aim of producing models that can reliably report their formance for a range of challenging tasks, cal- uncertainty even when presented with unexpected inputs. ibration and uncertainty estimation remain ma- Most prior work approaches the problem of uncertainty es- jor challenges, especially under distribution shift. timation from the standpoint of Bayesian inference. By In this paper, we propose the amortized condi- treating parameters as random variables with some prior tional normalized maximum likelihood (ACNML) distribution, Bayesian inference can compute posterior dis- method as a scalable general-purpose approach tributions that capture a notion of epistemic uncertainty and for uncertainty estimation, calibration, and out-of- allow us to quantitatively reason about uncertainty in model distribution robustness with deep networks. Our predictions. However, computing accurate posterior distri- algorithm builds on the conditional normalized butions becomes intractable as we use very complex models maximum likelihood (CNML) coding scheme, like deep neural nets, and current approaches require highly which has minimax optimal properties accord- approximate inference methods that fall short of the promise ing to the minimum description length principle, of full Bayesian modeling in practice. but is computationally intractable to evaluate ex- actly for all but the simplest of model classes. We Bayesian methods also have a deep connection with the propose to use approximate Bayesian inference minimum description length (MDL) principle, a formal- technqiues to produce a tractable approximation ization of Occam’s razor that casts learning as performing to the CNML distribution. Our approach can be efficient data compression and has been widely used as a combined with any approximate inference algo- motivation for model selection techniques. Codes corre- rithm that provides tractable posterior densities sponding to maximum-a-posteriori estimators and Bayes over model parameters. We demonstrate that AC- marginalization have been commonly used within the MDL NML compares favorably to a number of prior framework. However, other coding schemes have been pro- techniques for uncertainty estimation in terms of posed in MDL centered around achieving different notions calibration on out-of-distribution inputs. of minimax optimality. Interpreting coding schemes as predictive distributions, such methods can directly inspire prediction strategies that give conservative predictions and 1. Introduction do not suffer from excessive overconfidence due to their minimax formulation. Current machine learning methods provide unprecedented arXiv:2011.02696v2 [cs.LG] 1 Mar 2021 accuracy across a range of domains, from computer vision to One such predictive distribution is the conditional normal- natural language processing. However, in many high-stakes ized maximum likelihood (CNML) (Grunwald¨ , 2007; Rissa- applications, such as medical diagnosis or autonomous driv- nen and Roos, 2007; Roos et al., 2008) model, also known ing, rare mistakes can be extremely costly. Thus, effective as sequential NML or predictive NML (Fogel and Feder, deployment of learned models requires not only high accu- 2018b). To make a prediction on a new input, CNML con- racy, but also a way to measure the certainty in a model’s siders every possible label and finds the model that best predictions in order to assess risk and allow the model to explains that label for the query point together with the abstain from making decisions when there is low confidence training set. It then uses that corresponding model to as- in the prediction. While deep networks offer excellent pre- sign probabilities for each input and normalizes to obtain a diction accuracy, they generally do not provide the means valid probability distribution. We will argue that the CNML to accurately quantify their uncertainty. This is especially prediction strategy can be useful for providing reliable un- true on out-of-distribution inputs, where deep networks tend certainty estimates on out-of-distribution inputs. Intuitively, to make overconfident incorrect predictions (Ovadia et al., instead of relying on a learned model to extrapolate from the 2019). In this paper, we tackle the problem of obtaining training set to the new (potentially out-of-distribution) input, Amortized Conditional Normalized Maximum Likelihood CNML can obtain more reasonable predictive distributions for a distribution over labels q as by explicitly updating a model for each potential label of def R(q; y1:n; x1:n; Θ) = log p^ (y1:njx1:n) the particular test input and then asking “given the training θ(y1:njx1:n) data, which labels would make sense for this input?” − log q(y1:n): (1) While CNML provides compelling minimax regret guaran- In relation to the MDL principle, this regret corresponds to tees, practical instantiations have been exceptionally diffi- the excess number of bits q uses to encode the labels y1:n cult, because computing predictions for a test point requires compared to the best distribution in the model class Θ. For retraining the model on the test point concatenated with the any fixed input sequence, we can then define the normalized entire training set. With large models like deep neural net- maximum likelihood distribution (NML) as works, this can require hours of training for every prediction, p^ (y1:njx1:n) rendering naive CNML schemes infeasible for practical use. NML θ(y1:njx1:n) p (y1:njx1:n) = P : n p (~y jx ) y~1:n2Y θ^(~y1:njx1:n) 1:n 1:n In this paper, we argue that prediction strategies inspired (2) by CNML, which output conservative predictions that de- The NML distribution can be shown to achieve minimax pend on models explicitly trained on the test input, can regret (Shtarkov, 1987; Rissanen, 1996) as it achieves the provide reasonable uncertainty estimates even when faced same regret for all label sequences. with out-of-distribution data. To instantiate such a strat- NML p = argmin max R(q; y1:n; x1:n; Θ): (3) egy tractably, we propose amortized CNML (ACNML), a n q y1:n2Y practical algorithm for approximating CNML utilizing approximate Bayesian inference. ACNML avoids the need This corresponds, in a sense, to an optimal coding scheme to optimize over large datasets during inference by using for sequences of labels of known fixed length n. an approximate posterior in place of the training set. We Conditional NML. Instead of making predictions across show that our proposed approach is compares favorably to entire sequences of labels at once, NML can be adapted to number of prior techniques for uncertainty estimation on the setting where we make predictions about only the next out-of-distribution inputs, and is substantially more feasible label based on the previously seen data, resulting in condi- and computationally efficient than prior techniques for using tional NML (CNML) (Rissanen and Roos, 2007; Grunwald¨ , CNML predictions with deep neural networks. 2007; Fogel and Feder, 2018a). While several variations on CNML exist, we consider the following: 2. Conditional Normalized Maximum pCNML(y jx ; x ; y ) / p (y jx ); n n 1:n−1 1:n−1 θ^(y1:njx1:n) n n Likelihood (4) which solves the minimax problem ACNML is motivated from the minimum description length CNML (MDL) principle, which states that any regularities in a p = argmin max log p^ (ynjxn) − log q(yn): y θ(y1:njx1:n) dataset can be exploited to compress it, and so learning is q n (5) reformulated as encoding the data as efficiently as possible. We note that the inner maximization is only over the next (Rissanen, 1989; Grunwald¨ , 2007). While MDL is typically label y that we are predicting, rather than the full sequence described in terms of code lengths, we can associate codes n as before. This prediction strategy is now amenable to our with probability distributions, with the code length of an ob- typical supervised learning setting, where (x ; y ) ject corresponding to the negative log-likelihood under that 1:n−1 1:n−1 is our training set, and we want to output a predictive distri- probability distribution. MDL was originally formulated in bution over labels y for a new test input x . a generative setting where the goal is to code arbitrary data, n n we focus here on a supervised learning setting, where we CNML provides conservative predictions. Here we mo- assume the inputs are already known and our goal is to only tivate why CNML can provide reasonable uncertainty esti- encode/predict the labels. mates for out-of-distribution inputs. For each query point, CNML considers each potential label and finds the model Normalized Maximum Likelihood. Suppose we have a that would be most consistent with that label and with the model class Θ, where each θ 2 Θ corresponds to a con- training set. If that model assigns high probability to the la- ditional distribution p (yjx). Let θ^(y jx ) denote the θ 1:n 1:n bel, then minimizing the worst-case regret forces CNML to maximum likelihood estimator for a sequence of labels y 1:n assign relatively high probability to it. Compared to simply corresponding to inputs x over all θ 2 Θ. Given a se- 1:n letting a model trained only on the training set extrapolate quence of inputs x and labels y , we can define a regret 1:n 1:n to OOD inputs, we expect CNML to give more conservative predictions on OOD inputs, since it explicitly considers what would have happened if the new data point had been labeled with each possible label. Amortized Conditional Normalized Maximum Likelihood heatmaps of CNMAP predictions in Figure3, adding different amounts of L2 regularization to the logistic regression weights. As we add more regularization, the model class becomes effectively less expressive, and the CNMAP predictions become less conservative.

Amortized Conditional Normalized Maximum Likelihood: Reliable out of Distribution Uncertainty Estimation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support