Being Bayesian About Categorical Probability

Being Bayesian about Categorical Probability Taejong Joo 1 Uijung Chung 1 Min-Gwan Seo 1 Abstract Bayesian NNs (BNNs; MacKay, 1992) can address the aforementioned issues of softmax. BNNs provide quan- Neural networks utilize the softmax as a building tifiable measures of uncertainty such as predictive entropy block in classification tasks, which contains an and mutual information (Gal, 2016) and enable automatic overconfidence problem and lacks an uncertainty embodiment of Occam’s razor (MacKay, 1995). However, representation ability. As a Bayesian alternative some practical obstacles have impeded the wide adoption of to the softmax, we consider a random variable BNNs. First, the intractable posterior inference in BNNs de- of a categorical probability over class labels. In mands approximate methods such as variation inference (VI; this framework, the prior distribution explicitly Graves, 2011; Blundell et al., 2015) and Monte Carlo (MC) models the presumed noise inherent in the ob- dropout (Gal & Ghahramani, 2016). Even with such novel served label, which provides consistent gains in approximation methods, concerns arise regarding both the generalization performance in multiple challeng- degree of approximation and the computational expensive ing tasks. The proposed method inherits advan- posterior inference (Wu et al., 2019a; Osawa et al., 2019). In tages of Bayesian approaches that achieve bet- addition, under extreme non-linearity between parameters ter uncertainty estimation and model calibration. and outputs in the NNs, determining a meaningful weight Our method can be implemented as a plug-and- prior distribution is challenging (Sun et al., 2019). Last but play loss function with negligible computational not least, BNNs often require considerable modifications to overhead compared to the softmax with the cross- existing baselines, or they result in performance degradation entropy loss function. (Lakshminarayanan et al., 2017). In this paper, we apply the Bayesian principle to construct 1. Introduction the target distribution for learning classifiers. Specifically, we regard a categorical probability as a random variable, Softmax (Bridle, 1990) is the de facto standard for post pro- and construct the target distribution over the categorical cessing of logits of neural networks (NNs) for classification. probability by means of the Bayesian inference, which is When combined with the maximum likelihood objective, it approximated by NNs. The resulting target distribution can enables efficient gradient computation with respect to log- be thought of as being regularized via the prior belief whose its and has achieved state-of-the-art performances on many impact is controlled by the number of observations. By benchmark datasets. However, softmax lacks the ability considering only the random variable of categorical proba- to represent the uncertainty of predictions (Blundell et al., bility, the Bayesian principle can be efficiently adopted to 2015; Gal & Ghahramani, 2016) and has poorly calibrated existing deep learning building blocks without huge modifi- behavior (Guo et al., 2017). For instance, the NN with cations. Our extensive experiments show effectiveness of softmax can easily be fooled to confidently produce wrong being Bayesian about the categorical probability in improv- outputs; when rotating digit 3, it will predict it as the digit ing generalization performances, uncertainty estimation, 8 or 4 with high confidence (Louizos & Welling, 2017). and calibration property. Another concern of the softmax is its confident predictive Our contributions can be summarized as follows: 1) we behavior makes NNs to be subject to overfitting (Xie et al., show the importance of considering categorical probabil- 2016; Pereyra et al., 2017). This issue raises the need for ity as a random variable instead of being determined by effective regularization techniques for improving general- the label; 2) we provide experimental results showing the ization performance. usefulness of the Bayesian principle in improving general- 1ESTsoft, Republic of Korea. Correspondence to: Taejong Joo ization performance of large models on standard benchmark <[email protected]>. datasets, e.g., ResNext-101 on ImageNet; 3) we enable NNs th to inherit the advantages of the Bayesian methods in better Proceedings of the 37 International Conference on Machine uncertainty representation and well-calibrated behavior with Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- thor(s). a negligible increase in computational complexity. Being Bayesian about Categorical Probability Neural Neural Network Network Minimize Minimize KL-Divergence “cat” KL-Divergence One-Hot Bayes’ Rule “cat” Encoding Prior Posterior (a) Softmax cross-entropy loss (b) Belief matching framework Figure 1. Illustration of the difference between softmax cross-entropy loss and belief matching framework when each image is unique in the training set. In softmax cross-entropy loss, the label “cat” is directly transformed into the target categorical distribution. In belief matching framework, the label “cat” is combined with the prior Dirichlet distribution over the categorical probability. Then, the Bayes’ rule updates the belief about categorical probability, which produces the target distribution. 2. Preliminary x 2 A and 0 otherwise. Then, the empirical risk on D can be expressed as follows: This paper focuses on classification problems in which, (i) (i) N N given i.i.d. training samples D = x ; y i=1 2 1 X W (i) L^ (W ) = − log φ (i) (f (x )) (X × Y)N , we construct a classifier F : X!Y. Here, X D N y is an input space and Y = f1; ··· ;Kg is a set of labels. We i=1 X P cD(x) denote x and y as random variables whose unknown proba- = i i lD(W ) + C (3) N x bility distributions generate inputs and labels, respectively. x2G(D) Also, we let y~ be a one-hot representation of y. where G(D) is a set of unique values in D, e.g., Let f W : X!X 0 be a NN with parameters W where G(f1; 2; 2g) = f1; 2g, and C is a constant with respect X 0 = K D R is a logit space. In this paper, we assume to W ; lx (W ) measures the KL divergence between the W W arg maxj fj ⊆ F is the classification model where fj empirical target distribution and the categorical distribution denotes the j-th output basis of f W , and we concentrate modeled by the NN at location x, which is given by: W ((x; y) 2 D; f W ) on the problem of learning . Given , cD(x) a standard minimization loss function is the softmax cross- lD(W ) = KL PC k PC φ(f W (x)) x P cD(x) entropy loss, which applies the softmax to logit and then i i (4) computes the cross-entropy between a one-hot encoded la- Therefore, the normalized value of cD(x) becomes the esti- bel and a softmax output (Figure 1(a)). Specifically, the mator of a categorical probability of the target distribution softmax, denoted by φ : X 0 ! 4K−1, transforms a logit at location x. However, directly approximating this target f W (x) into a normalized exponential form: distribution can be problematic. This is because the estima- exp(f W (x)) tor uses single or very few samples since most of the inputs φ (f W (x)) = k (1) k P W are unique or very rare in the training set. j exp(fj (x)) , and then the cross-entropy loss can be computed by One simple heuristic to handle this problem is label smooth- W P W ing (Szegedy et al., 2016) that constructs a regularized lCE(y~; φ(f (x))) = − k y~k log φk(f (x)). Here, note that the softmax output can be viewed as a parame- target estimator, in which a one-hot encoded label y~ is λ ter of the categorical distribution, which can be denoted by relaxed by (1 − λ)y~ + K 1 with hyperparameter λ. Un- PC (φ(f W (x))). der the smoothing operation, the target estimator is regularized by a mixture of the empirical counts and the pa- We can formulate the minimization of the softmax cross- rameter of the discrete uniform distribution PU such that D entropy loss over D into a collection of distribution match- C c (x) U (1 − λ)P P cD (x) + λP . One concern is that the ing problems. To this end, let cD(x) be a vector-valued i i mixing coefficient is constant with respect to the number of function that counts label frequency at x 2 X in D, which observations, which can possibly prevent the exploitation of is defined as: the empirical counting information when it is needed. X cD(x) = y~01 (x0) (2) fxg Another more principled approach is BNNs, which pre- 0 0 (x ;y )2D vents full exploitation of the noisy estimation by bal- where 1A(x) is an indicator function that takes 1 when ancing the distance to the target distribution with model Being Bayesian about Categorical Probability complexity and maintaining the weight ensemble instead which the prior distribution acts as adding pseudo counts. of choosing a single best configuration. Specifically, in We note that the relative strength between the prior belief BNNs with the Gaussian weight prior N (0; τ −1I), the and the empirical count information becomes adaptive with score of configuration W is measured by the posterior respect to each data point. density pW(W jD) / p(DjW )pW(W ) where we have 2 log pW(W ) / −τ k W k2. Therefore, the complexity 3.2. Representing Approximate Distribution penalty term induced by the prior prevents the softmax output from exactly matching a one-hot encoded target. In Now, we specify the approximate posterior distribution mod- 2 eled by the NNs, which aims to approximate pzjx;y. In this modern deep NNs, however, k W k2 may be poor proxy for the model complexity due to extreme non-linear relation- paper, we model the approximate posterior as the Dirichlet ship between weights and outputs (Hafner et al., 2018; Sun distribution.

Being Bayesian About Categorical Probability

Hyperprior on Symmetric Dirichlet Distribution

A Dirichlet Process Mixture Model of Discrete Choice Arxiv:1801.06296V1

Part 2: Basics of Dirichlet Processes 2.1 Motivation

Maximum Likelihood Estimation of Dirichlet Distribution Parameters

A Dirichlet Process Model for Detecting Positive Selection in Protein-Coding DNA Sequences

Variability in Encoding Precision Accounts for Visual Short-Term Memory Limitations

Kernel Analysis Based on Dirichlet Processes Mixture Models

Estimating the Mutual Information Between Two Discrete, Asymmetric Variables with Limited Samples

Nonparametric Density Estimation for Learning Noise Distributions in Mobile Robotics

Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution

Robust Estimation of Risks from Small Samples

Paper: a Tutorial on Dirichlet Process Mixture Modeling