Quantifying Intrinsic Uncertainty in Classification Via Deep Dirichlet

Quantifying Intrinsic Uncertainty in Classification via Deep Dirichlet Mixture Networks Qingyang Wu 1 He Li 2 Lexin Li 3 Zhou Yu 1 Abstract deep reinforcement learning suffers from a strikingly low reproducibility due to high uncertainty of the predictions With the widespread success of deep neural net- (Henderson et al., 2017). Uncertainty quantification can be works in science and technology, it is becoming challenging though; for instance, (Guo et al., 2017) argued increasingly important to quantify the uncertainty that modern neural networks architectures are poor in pro- of the predictions produced by deep learning. In ducing well-calibrated probability in binary classification. this paper, we introduce a new method that at- Recognizing such challenges, there have been recent pro- taches an explicit uncertainty statement to the posals to estimate and quantify the uncertainty of output probabilities of classification using deep neural from deep neural networks, and we review those methods networks. Precisely, we view that the classifica- in Section 1.1. Despite the progress, however, uncertainty tion probabilities are sampled from an unknown quantification of deep neural networks remains relatively distribution, and we propose to learn this distribu- underdeveloped (Kendall & Gal, 2017). tion through the Dirichlet mixture that is flexible enough for approximating any continuous distri- In this paper, we propose deep Dirichlet mixture networks to bution on the simplex. We then construct credible produce, in addition to a point estimator of the classification intervals from the learned distribution to assess probabilities, an associated credible interval (region) that the uncertainty of the classification probabilities. covers the true probabilities at a desired level. We begin Our approach is easy to implement, computation- with the binary classification problem and employ the Beta ally efficient, and can be coupled with any deep mixture model to approximate the probability distribution neural network architecture. Our method lever- of the true but random probability. We then extend to the ages the crucial observation that, in many classi- general multi-class classification using the Dirichlet mixture fication applications such as medical diagnosis, model. Our key idea is to view the classification probability more than one class labels are available for each as a random quantity, rather than a deterministic value in observational unit. We demonstrate the usefulness [0; 1]. We seek to estimate the distribution of this random of our approach through simulations and a real quantity using the Beta or the Dirichlet mixture, which we data example. show is flexible enough to model any continuous distribution on [0; 1]. We achieve the estimation by adding an extra layer in a typical deep neural network architecture, without 1. Introduction having to substantially modify the overall structure of the network. Then based on the estimated distribution, we pro- Deep neural networks have been achieving remarkable suc- duce both a point estimate and a credible interval for the arXiv:1906.04450v2 [cs.LG] 14 Aug 2019 cess in a wide range of classification tasks in recent years. classification probability. This credible interval provides an Accompanying increasingly accurate prediction of the clas- explicit quantification of the classification variability, and sification probability, it is of equal importance to quantify can greatly facilitate our decision making. For instance, a the uncertainty of the classification probability produced by point estimate of high probability to have a disease may be deep neural networks. Without a careful characterization of regarded as lack of confidence if the corresponding credi- such an uncertainty, the prediction of deep neural networks ble interval is wide. By contrast, a point estimate with a can be questionable, unusable, and in the extreme case in- narrow credible interval may be seen as a more convincing cur considerable loss (Wang et al., 2016). For example, diagnosis. 1Computer Science Department, University of California, Davis The feasibility of our proposal is built upon a crucial ob- 2 3 Stern School of Business, New York University Department of servation that, in many classification applications such as Biostatistics and Epidemiology, University of California, Berkeley. Correspondence to: <>. medical diagnosis, there exist more than one class labels. For instance, a patient’s computed tomography image may be evaluated by two doctors, each giving a binary diagno- Quantifying Intrinsic Uncertainty in Classification via Deep Dirichlet Mixture Networks sis of existence of cancer. In Section4, we illustrate with tification. One is the lower and upper bound estimation an example of diagnosis of Alzheimer’s disease (AD) us- (LUBE) (Khosravi et al., 2011; Quan et al., 2014). LUBE ing patients’ anatomical magnetic resonance imaging. For has been proven successful in numerous applications. How- each patient, there is a binary diagnosis status as AD or ever, its loss function is non-differentiable and gradient healthy control, along with additional cognitive scores that descent cannot be applied for optimization. The quality- are strongly correlated with and carry crucial information driven prediction interval method (QD) has recently been about one’s AD status. We thus consider the dichotomized proposed to improve LUBE (Pearce et al., 2018). It is a version of the cognitive scores, combine them with the diag- distribution-free method by outputting prediction’s upper nosis status, and feed them together into our deep Dirichlet bound and lower bound. The uncertainty can be estimated mixture networks to obtain a credible interval of the classi- by measuring the distance between the two bounds. Unlike fication probability. We remark that, existence of multiple LUBE, the objective function of QD can be optimized by labels is common rather than an exception in a variety of gradient descent. But similar to MVE, it is designed for real world applications. regression tasks. Confidence network is another method to estimate confidence by adding an output node next to Our proposal provides a useful addition to the essential yet the softmax probabilities (DeVries & Taylor, 2018). This currently still growing inferential machinery to deep neural method is suitable for classification. Although its original networks learning. Our method is simple, fast, effective, and goal was for out-of-distribution detection, its confidence can can couple with any existing deep neural network structure. be used to represent the intrinsic uncertainty. Later in Sec- In particular, it adopts a frequentist inference perspective, tion 3.2, we numerically compare our method with MVE, but produces a Bayesian-style outcome of credible intervals. QD, and confidence network. 1.1. Related Work We also clarify that our proposed framework is different from the mixture density network (Bishop, 1994). The latter There has been development of uncertainty quantification of trains a neural network to model the distribution of the artificial neural networks since two decades ago. Early ex- outcome using a mixture distribution. By contrast, we aim amples include the delta method (Hwang & Ding, 1997), and to learn the distribution of the classification probabilities the Bootstrap methods (Efron & Tibshirani, 1994; Heskes, and to quantify their variations. 1997; Carney et al., 1999). However, the former requires computing the Hessian matrix and is computationally ex- pensive, whereas the latter hinges on an unbiased prediction. 2. Dirichlet Mixture Networks When the prediction is biased, the total variance is to be In this section, we describe our proposed Dirichlet mixture underestimated, which would in turn result in a narrower networks. We begin with the case of binary classification, credible interval. where the Dirichlet mixture models reduce to the simpler Another important line of research is Bayesian neural net- Beta mixture models. Although a simpler case, the binary works (MacKay, 1992a;b), which treat model parameters as classification is sufficient to capture all the key ingredients distributions, and thus can produce an explicit uncertainty of our general approach and thus loses no generality. At the quantification in addition to a point estimate. The main end of this section, we discuss the extension to the multi- drawback is the prohibitive computational cost of running class case. MCMC algorithms. There have been some recent proposals aiming to address this issue, most notably, (Gal & Ghahra- 2.1. Loss Function mani, 2016; Li & Gal, 2017) that used the dropout tricks. We begin with a description of the key idea of our proposal. Our proposal, however, is a frequentist solution, and thus Let f1; 2g denote the two classes. Given an observational we have chosen not to numerically compare with those unit x, e.g., an image, we view the probability p that x Bayesian approaches. x belongs to class 1 as a random variable, instead of a de- Another widely used uncertainty quantification method is terministic value in [0; 1]. We then seek to estimate the the mean variance estimation (MVE) approach (Nix & probability density function f(p; x) of px. This function en- Weigend, 1994). It models the data noise using a normal codes the intrinsic uncertainty of the classification problem. distribution, and employs a neural network to output the A point estimate of the classification probability only fo- mean and variance. The optimization is done by minimizing R 1 cuses on the mean,

Quantifying Intrinsic Uncertainty in Classification Via Deep Dirichlet

No-Decision Classification: an Alternative to Testing for Statistical

Comparison of Machine Learning Techniques When Estimating Probability of Impairment

Binary Classification Models with “Uncertain” Predictions

Social Dating: Matching and Clustering

How Will AI Shape the Future of Law?

Gaussian Process Classification Using Privileged Noise

Binary Classification Approach to Ordinal Regression

Analysis of Confirmed Cases of COVID-19

Evidence Type Classification in Randomized Controlled Trials Tobias Mayer, Elena Cabrio, Serena Villata

Estimating the Operating Characteristics of Ensemble Methods

Binary Classification and Logistic Regression Models Application to Crash Severity

Benchmarking Machine Learning Models for the Analysis of Genetic Data Using FRESA.CAD Binary Classiﬁcation Benchmarking