Learnable Bernoulli Dropout for Bayesian Deep Learning

Learnable Bernoulli Dropout for Bayesian Deep Learning Shahin Bolukiy Randy Ardywibowoy Siamak Zamani Dadanehy Mingyuan Zhou? Xiaoning Qiany y Texas A&M University ?The University of Texas at Austin Abstract safety-critical applications such as autonomous driving and healthcare (Ardywibowo et al., 2019). To address this, Bayesian methods attempt to principally regular- In this work, we propose learnable Bernoulli ize and estimate the prediction uncertainty of DNNs. dropout (LBD), a new model-agnostic They introduce model uncertainty by placing prior dis- dropout scheme that considers the dropout tributions on the weights and biases of the networks. rates as parameters jointly optimized with Since exact Bayesian inference is computationally in- other model parameters. By probabilistic tractable, many approximation methods have been modeling of Bernoulli dropout, our method developed such as Laplace approximation (MacKay, enables more robust prediction and uncer- 1992a), Markov chain Monte Carlo (MCMC) (Neal, tainty quantification in deep models. Es- 2012), stochastic gradient MCMC (Welling and Teh, pecially, when combined with variational 2011; Ma et al., 2015; Springenberg et al., 2016), and auto-encoders (VAEs), LBD enables flexible variational inference methods (Blei et al., 2017; Hoff- semi-implicit posterior representations, lead- man et al., 2013; Blundell et al., 2015; Graves, 2011). In ing to new semi-implicit VAE (SIVAE) mod- practice, these methods are significantly slower to train els. We solve the optimization for training compared to non-Bayesian methods for DNNs, such as with respect to the dropout parameters using calibrated regression (Kuleshov et al., 2018), deep en- Augment-REINFORCE-Merge (ARM), an un- semble methods (Lakshminarayanan et al., 2017), and biased and low-variance gradient estimator. more recent prior networks (Malinin and Gales, 2018), Our experiments on a range of tasks show which have their own limitations including training the superior performance of our approach instability (Blum et al., 2019). compared with other commonly used dropout schemes. Overall, LBD leads to improved Although dropout, a commonly used technique to al- accuracy and uncertainty estimates in im- leviate overfitting in DNNs, was initially used as a age classification and semantic segmentation. regularization technique during training (Hinton et al., Moreover, using SIVAE, we can achieve state- 2012), Gal and Ghahramani (2016b) showed that when of-the-art performance on collaborative filter- used at test time, it enables uncertainty quantifica- ing for implicit feedback on several public tion with Bayesian interpretation of the network out- datasets. puts as Monte Carlo samples of its predictive distri- arXiv:2002.05155v1 [cs.LG] 12 Feb 2020 bution. Considering the original dropout scheme as multiplying the output of each neuron by a binary mask 1 INTRODUCTION drawn from a Bernoulli distribution, several dropout variations with other distributions for random mul- Deep neural networks (DNNs) are a flexible family of tiplicative masks have been studied, including Gaus- models that usually contain millions of free parameters. sian dropout (Kingma et al., 2015; Srivastava et al., Growing concerns on overfitting of DNNs (Szegedy 2014). Among them, Bernoulli dropout and extensions et al., 2013; Nguyen et al., 2015; Zhang et al., 2016; are most commonly used in practice due to their ease Bozhinoski et al., 2019) arise especially when consider- of implementation in existing deep architectures and ing their robustness and generalizability in real-world their computation speed. Its simplicity and compu- tational tractability has made Bernoulli dropout the current most popular method to introduce uncertainty Proceedings of the 23rdInternational Conference on Artificial in DNNs. Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s). It has been shown that both the level of prediction Learnable Bernoulli Dropout for Bayesian Deep Learning accuracy and quality of uncertainty estimation are de- properly introduces learnable feature sparsity regular- pendent on the network weight configuration as well ization to the deep network, improving the performance as the dropout rate (Gal, 2016). Traditional dropout of deep architectures that rely on it. Moreover, our mechanism with fixed dropout rate may limit model formulation allows each neuron to have its own learned expressiveness or require tedious hand-tuning. Allow- dropout probability. We provide an interpretation of ing the dropout rate to be estimated along with the LBD as a more flexible variational Bayesian approxi- other network parameters increases model flexibility mation method for learning Bayesian DNNs compared and enables feature sparsity patterns to be identified. to Monte Carlo (MC) dropout. We combine this learn- An early approach to learning dropout rates overlays able dropout module with variational autoencoders a binary belief network on top of DNNs to determine (VAEs) (Kingma and Welling, 2013; Rezende et al., dropout rates (Ba and Frey, 2013). Unfortunately, 2014), which naturally leads to a flexible semi-implicit this approach does not scale well due to the significant variational inference framework with VAEs (SIVAE). model complexity increase. Our experiments show that LBD results in improved accuracy and uncertainty quantification in DNNs for Other dropout formulations instead attempt to replace image classification and semantic segmentation com- the Bernoulli dropout with a different distribution. pared with regular dropout, MC dropout (Gal and Following the variational interpretation of Gaussian Ghahramani, 2016b), Gaussian dropout (Kingma et al., dropout, Kingma et al. (2015) proposed to optimize 2015), and Concrete dropout (Gal et al., 2017). More the variance of the Gaussian distributions used for importantly, by performing optimization of the dropout the multiplicative masks. However, in practice, op- rates in SIVAE, we achieve state-of-the-art performance timization of the Gaussian variance is difficult. For in multiple different collaborative filtering benchmarks. example, the variance should be capped at 1 in order to prevent the optimization from diverging. This as- sumption limits the dropout rate to at most 0.5, and 2 METHODOLOGY is not suitable for regularizing architectures with po- tentially redundant features, which should be dropped 2.1 Learnable Bernoulli Dropout (LBD) at higher rates. Also, Hron et al. (2017) showed that N approximate Bayesian inference of Gaussian dropout is Given a training dataset D = f(xi; yi)gi=1, where x ill-posed since the improper log-uniform prior adopted and y denote the input and target of interest respec- in (Kingma et al., 2015) does not usually result in a tively, a neural network is a function f(x; θ) from the proper posterior. Recently, a relaxed Concrete (Maddi- input space to the target space with parameters θ. son et al., 2016) (Gumbell-Softmax (Jang et al., 2016)) The parameters are learned by minimizing an objective distribution has been adopted in (Gal et al., 2017) function L, which is usually comprised of an empirical to replace the Bernoulli mask for learnable dropout loss E with possibly additional regularization terms R, rate (Gal, 2016). However, the continuous relaxation by stochastic gradient descent (SGD): introduces bias to the gradients which reduces its per- M N X formance. L(θjD) ≈ E(f(x ; θ); y ) + R(θ); (1) M i i Motivated by recent efforts on gradient estimates for i=1 optimization with binary (discrete) variables (Yin and where M is the mini-batch size. Zhou, 2019; Tucker et al., 2017; Grathwohl et al., 2017), we propose a learnable Bernoulli dropout (LBD) mod- Consider a neural network with L fully connected lay- th ule for general DNNs. In LBD, the dropout probabil- ers. The j fully connected layer with Kj neurons th ities are considered as variational parameters jointly takes the output of the (j − 1) layer with Kj−1 neu- optimized with the other parameters of the model. rons as input. We denote the weight matrix connecting Kj−1×Kj We emphasize that LBD exactly optimizes the true layer j − 1 to j by Wj 2 R . Dropout takes Bernoulli distribution of regular dropout, instead of the output to each layer and multiplies it with a ran- replacing it by another distribution such as Concrete or dom variable zj ∼ p(zj) element-wise (channel-wise Gaussian. LBD accomplishes this by taking advantage for convolutional layers). The most common choice of a recent unbiased low-variance gradient estimator| for p(zj) is the Bernoulli distribution Ber(σ(αj)) with Augment-REINFORCE-Merge (ARM) (Yin and Zhou, dropout rate 1 − σ(αj), where we have reparameter- 2019)|to optimize the loss function of the deep neu- ized the dropout rate using the sigmoid function σ(·). L ral network with respect to the dropout layer. This With this notation, let α = fαjgj=1 denote the col- allows us to backpropagate through the binary random lection of all logits of the dropout parameters, and L masks and compute unbiased, low-variance gradients let z = fzjgj=1 denote the collection of all dropout with respect to the dropout parameters. This approach masks. Dropout in this form is one of the most common regularization techniques in training DNNs to avoid Shahin Bolukiy, Randy Ardywibowoy, Siamak Zamani Dadanehy overfitting and improve generalization and accuracy on Under this formulation, we propose learnable Bernoulli unseen data. This can also be considered as using a dropout (LBD) as a variational

Learnable Bernoulli Dropout for Bayesian Deep Learning

Risk Bounds for Classification and Regression Rules That Interpolate

Probabilistic Circuits: Representations, Inference, Learning and Theory

Risk Bounds for Classification and Regression Rules That Interpolate

Part I Officers in Institutions Placed Under the Supervision of the General Board

Machine Learning Conference Report

AAAI News AAAI News

Probabilistic Machine Learning: Foundations and Frontiers

Bayesian Activity Recognition Using Variational Inference

Automatic Opioid User Detection from Twitter

Deep Learning

Uncertainty in Deep Learning

Arxiv:2012.12294V2 [Stat.ML] 16 Apr 2021 1 Introduction