<<

Challenges in Monte Carlo for Bayesian neural networks Theodore Papamarkou, Jacob Hinkle, M. Todd Young and David Womble

Abstract. Markov chain Monte Carlo (MCMC) methods have not been broadly adopted in Bayesian neural networks (BNNs). This paper initially reviews the main challenges in sampling from the parameter posterior of a neural network via MCMC. Such challenges culminate to lack of con- vergence to the parameter posterior. Nevertheless, this paper shows that a non-converged Markov chain, generated via MCMC sampling from the parameter space of a neural network, can yield via Bayesian marginaliza- tion a valuable posterior predictive distribution of the output of the neural network. Classification examples based on multilayer perceptrons showcase highly accurate posterior predictive distributions. The postulate of limited scope for MCMC developments in BNNs is partially valid; an asymptotically exact parameter posterior seems less plausible, yet an accurate posterior predictive distribution is a tenable research avenue. Key words and phrases: , Bayesian neural networks, con- vergence diagnostics, Markov chain Monte Carlo, posterior predictive dis- tribution.

1. MOTIVATION The slower evolution of MCMC methods for neu- ral networks is partly attributed to the lack of scal- The universal approximation theorem (Cybenko, ability of existing MCMC for big data 1989) and its subsequent extensions (Hornik, 1991; and for high-dimensional parameter spaces. Further- Lu et al., 2017) state that feedforward neural net- more, additional factors hinder the adaptation of ex- works with exponentially large width and width- isting MCMC methods in deep learning, including bounded deep neural networks can approximate the hierarchical structure of neural networks and the any continuous function arbitrarily well. This uni- associated covariance between parameters, lack of versal approximation capacity of neural networks identifiability arising from weight symmetries, lack along with available computing power explain the of a priori knowledge about the parameter space, widespread use of deep learning nowadays.

arXiv:1910.06539v5 [stat.ML] 11 Aug 2021 and ultimately lack of convergence. Bayesian inference for neural networks is typi- cally performed via stochastic Bayesian optimiza- The purpose of this paper is twofold. Initially, tion, stochastic variational inference (Polson and a literature review is conducted to identify infer- Sokolov, 2017) or ensemble methods (Ashukha et al., ential challenges in MCMC developments for neu- 2020; Wilson and Izmailov, 2020). MCMC methods ral networks. Subsequently, Bayesian marginaliza- have been explored in the context of neural net- tion based on MCMC samples of neural network works, but have not become part of the Bayesian parameters is used for attaining accurate posterior deep learning toolbox. predictive distributions of the respective neural net- work output, despite the lack of convergence of the Department of Mathematics, The University of MCMC samples to the parameter posterior. Manchester, Manchester, UK, and Computational Sciences and Engineering Division, Oak Ridge An outline of the paper layout follows. Section National Laboratory, Oak Ridge, TN, USA 2 reviews the inferential challenges arising from 1 2 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE the application of MCMC to neural networks. Sec- variants that evaluate a costly log- tion3 provides an overview of the employed infer- likelihood on a subset (minibatch) of the data rather ential framework, including the multilayer percep- than on the entire data set (Welling and Teh, 2011; tron (MLP) model and its likelihood for binary and Chen, Fox and Guestrin, 2014; Ma, Foti and Fox, multiclass classification, the MCMC algorithms for 2017; Mandt, Hoffman and Blei, 2017; De Sa, Chen sampling from MLP parameters, the multivariate and Wong, 2018; Nemeth and Sherlock, 2018; Robert MCMC diagnostics for assessing convergence and et al., 2018; Seita et al., 2018; Quiroz et al., 2019). sampling effectiveness, and the Bayesian marginal- Among minibatch MCMC algorithms to big data ization for attaining posterior predictive distribu- applications, there exists a subset of studies apply- tions of MLP outputs. Section4 showcases Bayesian ing such algorithms to neural networks (Chen, Fox parameter estimation via MCMC and Bayesian and Guestrin, 2014; Gu, Ghahramani and Turner, predictions via marginalization by fitting different 2015; Gong, Li and Hern´andez-Lobato, 2019). Mini- MLPs to four datasets. Section5 posits predictive batch MCMC approaches to neural networks pave inference for neural networks, among else by com- the way towards data-parallel deep learning. On the bining Bayesian marginalization with approximate other hand, to the best of the authors’ knowledge, MCMC sampling or with ensemble training. there is no published research on MCMC methods that evaluate the log-likelihood on a subset of neural 2. PARAMETER INFERENCE CHALLENGES network parameters rather than on the whole set of parameters, and therefore no reported research on A literature review of inferential challenges in the model-parallel deep learning via MCMC. application of MCMC methods to neural networks is conducted in this section thematically, with each Minibatch MCMC has been studied analytically subsection being focused on a different challenge. by Johndrow, Pillai and Smith(2020). Their the- oretical findings point out that some minibatching 2.1 Computational cost schemes can yield inexact approximations and that Existing MCMC algorithms do not scale with in- minibatch MCMC can not greatly expedite the rate creasing number of parameters or of data points. For of convergence. this reason, approximate inference methods, includ- 2.2 Model structure ing variational inference (VI), are preferred in high- dimensional parameter spaces or in big data prob- A neural network with ρ layers can be viewed lems from a time complexity standpoint (MacKay, as a hierarchical model with ρ levels, each network 1995; Blei, Kucukelbir and McAuliffe, 2017; Blier layer representing a level (Williams, 2000). Due to its and Ollivier, 2018). On the other hand, MCMC nested layers and its non-linear activations, a neural methods are better than VI in terms of approximat- network is a non-linear hierarchical model. ing the log-likelihood (Dupuy and Bach, 2017). MCMC methods for non-linear hierarchical mod- Literature on MCMC methods for neural net- els have been developed, see for example Ben- works is limited due to associated computational nett, Racine-Poon and Wakefield(1996); Gilks and complexity implications. Sequential Monte Carlo Roberts(1996); Daniels and Kass(1998); Sargent, and reversible jump MCMC have been applied on Hodges and Carlin(2000). However, existing MCMC two types of neural network architectures, namely methods for non-linear hierarchical models have not MLPs and radial basis function networks (RBFs), harnessed neural networks due to time complexity see for instance Andrieu, de Freitas and Doucet and convergence implications. (1999); de Freitas(1999); Andrieu, de Freitas and Although not designed to mirror the hierarchi- Doucet(2000); de Freitas et al.(2001). For a review cal structure of a neural network, recent hierarchi- of Bayesian approaches to neural networks, see Tit- cal VI (Ranganath, Tran and Blei, 2016; Esmaeili terington(2004). et al., 2019; Huang et al., 2019; Titsias and Ruiz, Many research developments have been made to 2019) provides more general variational approxima- scale MCMC algorithms to big data. The main fo- tions of the parameter posterior of the neural net- cus has been on designing Metropolis-Hastings or work than mean-field VI. Introducing a hierarchi- CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 3

n cal structure in the variational distribution induces work is set to be the whole of rather than a n correlation among parameters, in contrast to the cone H of R . Since a neural network likelihood mean-field variational distribution that assumes in- with support in the non-reduced parameter space of n dependent parameters. So, one of the Bayesian in- R is invariant under weight permutations, sign-flips ference strategies for neural networks is to approx- or other transformations, the posterior landscape imate the covariance structure among network pa- includes multiple equally likely modes. This im- rameters. In fact, there are published comparisons plies low acceptance rate, entrapment in local modes between MCMC and VI in terms of speed and ac- and convergence challenges for MCMC. Addition- curacy of convergence to the posterior covariance, ally, computational time is wasted during MCMC, both for linear or mixture models (Giordano, Brod- since posterior modes represent equivalent solutions erick and Jordan, 2015; Mandt, Hoffman and Blei, (Nalisnick, 2018). Such challenges manifest them- 2017; Ong, Nott and Smith, 2018) and for neural selves in the MLP examples of section4. For neu- networks (Zhang et al., 2018a). ral networks with higher number n of parameters in n R , the topology of the likelihood is characterized 2.3 Weight symmetries by local optima embedded in high-dimensional flat The output of a feedforward neural network given plateaus (Brea et al., 2019). Thereby, larger neural some fixed input remains unchanged under a set networks lead to a multimodal target density with of transformations determined by the the choice of symmetric modes for MCMC. activations and by the network architecture more Seeking parameter symmetries in neural networks generally. For instance, certain weight permutations can lead to a variety of NP-hard problems (Ensign and sign flips in MLPs with hyperbolic tangent ac- et al., 2017). Moreover, symmetries in neural net- tivations leave the output unchanged (Chen, Lu and works pose identifiability and associated inferential Hecht-Nielsen, 1993). challenges in Bayesian inference, but they also pro- If a parameter transformation leaves the output vide opportunities to develop inferential methods of a neural network unchanged given some fixed in- with reduced computational cost (Hu, Zagoruyko put, then the likelihood is invariant under the trans- and Komodakis, 2019) or with improved predic- formation. In other words, transformations, such as tive performance (Moore, 2016). Empirical evidence weight permutations and sign-flips, render neural from stochastic optimization simulations suggests networks non-identifiable (Pourzanjani, Jiang and that removing weight symmetries has a negative ef- Petzold, 2017). fect on prediction accuracy in smaller and shallower It is known that the set of linear invertible pa- convolutional neural networks (CNNs), but has no rameter transformations that leaves the output un- effect in prediction accuracy in larger and deeper changed is a subgroup T of the group of invertible CNNs (Maddison et al., 2015). n linear mappings from the parameter space R to Imposing constraints on neural network weights is itself (Hecht-Nielsen, 1990). T is a transformation one way of removing symmetries, leading to better n group acting on the parameter space R . It can be mixing for MCMC (Sen, Papamarkou and Dunson, shown that for each permutable feedforward neu- 2020). More generally, exploitation of weight sym- n ral network, there exists a cone H ⊂ R dependent metries provides scope for scalable Bayesian infer- only on the network architecture such that for any ence in deep learning by reducing the measure or n parameter θ ∈ R there exist η ∈ H and τ ∈ T such dimension of parameter space. Bayesian inference in that τη = θ. This relation means that every net- subspaces of parameter space for deep learning has work parameter is equivalent to a parameter in the been proposed before (Izmailov et al., 2020). n proper subset H of R (Hecht-Nielsen, 1990). Neural Lack of identifiability is not unique to neural net- networks with convolutions, max-pooling and batch- works. For instance, the likelihood of mixture models normalization contain more types of weight sym- is invariant under relabelling of the mixture com- metries than MLPs (Badrinarayanan, Mishra and ponents, a condition known as the label switching Cipolla, 2015). problem (Stephens, 2000). In practice, the parameter space of a neural net- The high-dimensional parameter space of neu- 4 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE ral networks is another source of non-identifiability. such a restricted flat prior enables more effective A necessary condition for identifiability is that the MCMC sampling in comparison to alternative prior number of data points must be larger than the choices. number of parameters. This is one reason why big Objective prior specification is an area of datasets are required for training neural networks. that has not infiltrated Bayesian inference for neural networks. Alternative ideas for constructing objec- 2.4 Prior specification tive priors with minimal effect on posterior inference Parameter priors have been used for generating exist in the statistics literature. For example, Jef- Bayesian smoothing or regularization effects. For in- freys priors are invariant to differentiable one-to-one stance, de Freitas(1999) develops sequential Monte transformations of the parameters (Jeffreys, 1962), Carlo methods with smoothing priors for MLPs and maximum entropy priors maximize the Shannon en- Williams(1995) introduces Bayesian regularization tropy and therefore provide the least possible infor- and pruning for neural networks via a Laplace prior. mation (Jaynes, 1968), reference priors maximize When parameter prior specification for a neural the expected Kullback-Leibler divergence from the network is not driven by smoothing or regulariza- associated posteriors and in that sense are the least tion, the question becomes how to choose the prior. informative priors (Bernardo, 1979), and penalised The choice of parameter prior for a neural network complexity priors penalise the complexity induced is crucial in that it affects the parameter posterior by deviating from a simpler base model (Simpson (Lee, 2004), and consequently the posterior predic- et al., 2017). tive distribution (Lee, 2005). To the best of the authors’ knowledge, there are Neural networks are commonly applied to big only two published lines of research on objective pri- data. For large amounts of data, practitioners may ors for neural networks; a theoretical derivation of not have intuition about the relationship between in- Jeffreys and reference priors for feedforward neural put and output variables. Furthermore, it is an open networks by Lee(2007), and an approximation of research question how to interpret neural network reference priors via Monte Carlo sampling of a dif- weights and biases. As a priori knowledge about big ferentiable non-centered parameterization of MLPs datasets and about neural network parameters is and CNNs by Nalisnick(2018). typically not available, prior elicitation from experts More broadly, research on prior specification for is not applicable to neural networks. BNNs has been published recently (Pearce et al., It seems logical to choose a prior that reflects a 2019; Vladimirova et al., 2019). For a more thor- priori ignorance about the parameters. A constant- ough review of prior specification for BNNs, see Lee valued prior is a possible candidate, with the caveat (2005). of being improper for unbounded parameter spaces, n 2.5 Convergence such as R . However, for neural networks, an im- proper prior can result in an improper parameter MCMC convergence depends on the target den- posterior (Lee, 2005). sity, namely on its multi-modality and level of Typically, a truncated flat prior for neural net- smoothness. An MLP with fewer than a hundred pa- works is sufficient for ensuring a valid parameter pos- rameters fitted to a non-linearly separable dataset terior (Lee, 2005). At the same time, the choice of makes convergence in fixed MCMC sampling time truncation bounds depends on weight symmetry and challenging (see subsection 4.3). consequently on the allocation of equivalent points Attaining MCMC convergence is not the only in the parameter space. Lee(2003) proposes a re- challenge. Assessing whether a finite sample from an stricted flat prior for feedforward neural networks by MCMC represents an underlying target bounding some of the parameters and by imposing density can not be done with certainty (Cowles and constraints that guarantee layer-wise linear indepen- Carlin, 1996). MCMC diagnostics can fail to detect dence between activations, while Lee(2000) shows the type of convergence failure they were designed to that this prior is asymptotically consistent for the identify. Combinations of diagnostics are thus used posterior. Moreover, Lee(2003) demonstrates that in practice to evaluate MCMC convergence with re- CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 5 duced risk of false diagnosis. model structure, weight symmetry, prior specifica- MCMC diagnostics were initially designed for tion, posterior shape, MCMC convergence and sam- asymptotically exact MCMC. Research activity on pling effectiveness. approximate MCMC has emerged recently. Mini- 3.1.1 Model definition. An MLP is a feedforward batch MCMC methods (see subsection 2.1) are one neural network consisting of an input layer, one or class of approximate MCMC methods. Alternative more hidden layers and an output layer (Rosenblatt, approximate MCMC techniques without minibatch- 1958; Minsky and Papert, 1988; Hastie, Tibshirani ing have been developed (Rudolf and Schweizer, and Friedman, 2016). Let ρ ≥ 2 be a natural number. 2018; Chen et al., 2019) along with new approaches Consider an index j ∈ {0, 1, . . . , ρ} indicating the to quantify convergence (Chwialkowski, Strathmann layer, where j = 0 refers to the input layer, j = and Gretton, 2016). 1, 2, . . . , ρ − 1 to one of the ρ − 1 hidden layers and Quantization and discrepancy are two notions j = ρ to the output layer. Let κ be the number of pertinent to approximate MCMC methods. The j neurons in layer j and use κ = (κ , κ , . . . , κ ) as quantization of a target density p by an empirical 0:ρ 0 1 ρ a shorthand for the sequence of neuron counts per measurep ˆ provides an approximation to the tar- layer. Under such notation, MLP(κ ) refers to an get p (Graf and Luschgy, 2007), while the notion of 0:ρ MLP with ρ − 1 hidden layers and κ neurons at discrepancy quantifies how well the empirical mea- j layer j. surep ˆ approximates the target p (Chen et al., 2019). An MLP(κ ) with ρ − 1 ≥ 1 hidden layers and The kernel Stein discrepancy (KSD) and the max- 0:ρ κ neurons at layer j is defined recursively as imum mean discrepancy (MMD) constitute two in- j stances of discrepancy; for more details, see Chen (3.1) gj(xi, θ1:j) = Wjhj−1(xi, θ1:j−1) + bj, et al.(2019) and Gretton et al.(2012), respectively. (3.2) h (x , θ ) = φ (g (x , θ )), Rudolf and Schweizer(2018) provide an alternative j i 1:j j j i 1:j way of assessing the quality of approximation of a for j = 1, 2, . . . , ρ. A data point x ∈ κ0 corre- target density p by an empirical measurep ˆ in the i R sponds to the input layer h (x ) = x , yielding the context of approximate MCMC using the notion of 0 i i sequence g (x , θ ) = W x + b in the first hidden Wasserstein distance between p andp ˆ. 1 i 1 1 i 1 layer. Wj and bj are the respective weights and bi- ases at layer j = 1, 2, . . . , ρ, which constitute the 3. INFERENTIAL FRAMEWORK OVERVIEW parameters θj = (Wj, bj) at layer j. The shorthand An overview of the inferential framework used in θ1:j = (θ1, θ2, . . . , θj) denotes all weights and biases this paper follows, including the MLP model and up to layer j. Functions φj, known as activations, its likelihood for classification, MCMC samplers for are applied elementwise to their input gj. parameter estimation, MCMC diagnostics for as- The default recommendation of activation in neu- sessing convergence and sampling effectiveness, and ral networks is a rectified linear unit (ReLU), see Bayesian marginalization for prediction. for instance Jarrett et al.(2009); Nair and Hinton (2009); Goodfellow, Bengio and Courville(2016). 3.1 The MLP model Other activations are the ELU, leaky RELU, tanh MLPs have been chosen as a more tractable class and sigmoid (Nwankpa et al., 2018). If an activation of neural networks. CNNs are the most widely used is not present at layer j, then the identity function deep learning models. However, even small CNNs, φj(gj) = gj is used as φj in (3.2). such as AlexNet (Krizhevsky, Sutskever and Hinton, The weight matrix Wj in (3.1) has κj rows and 2012), SqueezeNet (Iandola et al., 2016), Xception κj−1 columns, while the vector bj of biases has length (Chollet, 2017), MobileNet (Howard et al., 2017), κj. Concatenating all θj across hidden and output n ShuffleNet (Zhang et al., 2018b), EffNet (Free- layers gives a parameter vector θ = θ1:ρ ∈ R of Pρ man, Roese-Koerner and Kummert, 2018) or DCTI length n = j=1 κj(κj−1 + 1). To define θ uniquely, (Truong, Nguyen and Tran, 2018), have at least two the convention to traverse weight matrix elements orders of magnitude higher number of parameters, row-wise is made. Apparently, each of gj in (3.1) thus amplifying issues of computational complexity, and hj in (3.2) has length κj. 6 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

The notation Wj,k,l is introduced to point to the take κρ ≥ 2 values. Moreover, consider an MLP(κ0:ρ) (k, l)-the element of weight matrix Wj at layer j. with κρ neurons in its output layer. Analogously, bj,k points to the k-th coordinate of Initially, a softmax activation function φρ(gρ) = Pκρ bias vector bj at layer j. exp (gρ)/ k=1 exp (gρ,k) is applied at the output layer of the MLP, where g denotes the k-th co- 3.1.2 Likelihood for binary classification. Con- ρ,j ordinate of the κρ-length vector gρ. Thus, the event sider s samples (xi, yi), i = 1, 2, . . . , s, consisting κ probabilities Pr(yi = k|xi, θ) are of some input xi ∈ R 0 and of a binary output y ∈ {0, 1}. An MLP(κ , κ , . . . , κ = 1) with a sin- i 0 1 ρ Pr(yi = k|xi, θ) = hρ,k(xi, θ) gle neuron in its output layer can be used for setting = φρ(gρ,k(xi, θ)) the L(y1:s|x1:s, θ) of labels y1:s = (3.6) exp (g (x(i), θ)) (y1, y2, . . . , ys) given the input x1:s = (x1, x2, . . . , xs) = ρ,k . Pκρ and MLP parameters θ. r=1 exp (gρ,r(xi, θ)) Firstly, the sigmoid activation function φρ(gρ) = hρ,k(xi, θ) denotes the k-th coordinate of the MLP 1/(1 + exp (−gρ)) is applied at the output layer of output hρ(xi, θ). the MLP. So, the event probabilities Pr(yi = 1|xi, θ) are set to It is assumed that the labels are outcomes of s in- dependent draws from categorical probability mass Pr(yi = 1|xi, θ) = hρ(xi, θ) = φρ(gρ(xi, θ)) functions with event probabilities given by (3.6), so (3.3) 1 the likelihood is = . 1 + exp (−g(ρ)(x(i), θ)) s κρ Y Y 1{y =k} (3.7) L(y1:s|x1:s, θ) = (hρ,k(xi, θ)) i . Assuming that the labels are outcomes of s in- i=1 k=1 dependent draws from Bernoulli probability mass functions with event probabilities given by (3.3), the The log-likelihood follows as likelihood becomes s κρ s 2 X 1 (3.8) `(y1:s|x1:s, θ) = {yi=k} log (hρ,k(xi, θ)). Y Y 1{y =k−1} (3.4) L(y1:s|x1:s, θ) = (zρ,k(xi, θ)) i . i=1 k=1 i=1 k=1 The negative value of log-likelihood (3.8) is known z (xi, θ), k = 1, 2, denotes the k-th coordinate ρ,k as cross entropy, and it is used as loss function for of the vector zρ(xi, θ) = (1 − hρ(xi, θ), hρ(xi, θ)) of stochastic optimization in multiclass classification event probabilities for sample i = 1, 2, . . . , s. Fur- MLPs. thermore, 1 denotes the indicator function, that is An MLP(κ0, κ1, . . . , κρ = 2) with two neurons at 1 = 1 if yi = k − 1, and 1 = 0 oth- {yi=k−1} {yi=k−1} the output layer, event probabilities given by soft- erwise. The log-likelihood follows as max activation (3.6) and log-likelihood (3.8) can be s used for binary classification. Such a formulation is 2 X 1 an alternative to an MLP(κ0, κ1, . . . , κρ = 1) with (3.5) `(y1:s|x1:s, θ) = {yi=k−1} log (zρ,k(xi, θ)). i=1 one neuron at the output layer, event probabilities k=1 given by sigmoid activation (3.3) and log-likelihood The negative value of log-likelihood (3.5) is known (3.5). The difference between the two MLP models is as the binary cross entropy (BCE). To infer the pa- the parameterization of event probabilities, since a categorical distribution with κρ = 2 levels otherwise rameters θ of MLP(κ0, κ1, . . . , κρ = 1), the binary cross entropy or a different loss function is mini- coincides with a Bernoulli distribution. mized using stochastic optimization methods, such 3.2 MCMC sampling for parameter estimation as stochastic gradient descent (SGD). Interest is in sampling from the parameter pos- 3.1.3 Likelihood for multiclass classification. Let terior p(θ|x1:s, y1:s) ∝ L(y1:s|x1:s, θ)π(θ) of a neu- yi ∈ {1, 2, . . . , κρ} be an output variable, which can ral network given the neural network likelihood CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 7

L(y1:s|x1:s, θ) and parameter prior π(θ). For MLPs, and subsequently states between pairs of chains are the likelihood L(y1:s|x1:s, θ) for binary and multi- swapped according to an MH algorithm. For the i-th class classification is provided by (3.4) and (3.7), re- chain, a sample j is drawn from a probability mass spectively. function pi with probability pi(j), in order to deter- The parameter posterior p(θ|x1:s, y1:s) is alter- mine the pair (i, j) for a possible swap. t natively denoted by p(θ|D1:s) for brevity. D1:s = Power posteriors p i (θ|D1:s), ti < tm, are smooth tm (x1:s, y1:s) is a dataset of size s consisting of input approximations of the target density p (θ|D1:s) = x1:s and output y1:s. p(θ|D1:s), facilitating exploration of the parame- This subsection provides an introduction to the ter space via state transitions between chains of t MCMC algorithms and MCMC diagnostics used in p i (θ|D1:s) and of p(θ|D1:s). In this paper, a cate- the examples of section4. Three MCMC algorithms gorical probability mass function pi is used in PP are outlined, namely Metropolis-Hastings, Hamilto- sampling for determining candidate pairs of chains nian Monte Carlo, and power posterior sampling. for state swaps (see Appendix A). Two MCMC diagnostics are described, the multi- variate potential scale reduction factor (PSRF) and 3.2.4 Multivariate PSRF. PSRF, commonly de- ˆ the multivariate effective sample size (ESS). noted by R, is an MCMC diagnostic of convergence conceived by Gelman and Rubin(1992) and ex- 3.2.1 Metropolis-Hastings algorithm. One of the tended to its multivariate version by Brooks and most general algorithms for sampling from a poste- Gelman(1998). This paper uses the multivariate rior p(θ|D1:s) is the Metropolis-Hastings (MH) al- PSRF by Brooks and Gelman(1998), which provides gorithm (Metropolis et al., 1953; Hastings, 1970). a single-number summary of convergence across the Given the current state θ, the MH algorithm ini- n dimensions of a parameter, requiring a Monte ∗ tially samples a state θ from a proposal density gθ Carlo covariance matrix for the param- ∗ and subsequently accepts the proposed state θ with eter. probability To acquire the multivariate PSRF, the multivari-  n ∗ o ate initial monotone sequence estimator (MINSE) p(θ |D1:s)gθ∗ (θ) ∗  min ∗ , 1 if p(θ|D1:s)gθ(θ ) > 0, p(θ|D1:s)gθ(θ ) of Monte Carlo covariance is employed (Dai and  1 otherwise. Jones, 2017). In a Bayesian setting, the MINSE esti- mates the covariance matrix of a parameter posterior Typically, a normal proposal density gθ = N (θ, Λ) p(θ|D1:s). with a constant covariance matrix Λ is used. For such To compute PSRF, several independent Markov a normal gθ, the acceptance probability simplifies to chains are simulated. Gelman et al.(2004) recom- ∗ min {p(θ |D1:s)/p(θ|D1:s), 1}, yielding the so called mend terminating MCMC sampling as soon as Rˆ < random walk Metropolis algorithm. 1.1. More recently, Vats and Knudson(2018) make 3.2.2 . Hamiltonian an argument based on ESS that a cut-off of 1.1 for ˆ Monte Carlo (HMC) draws samples from an aug- R is too high to estimate a Monte Carlo mean with mented parameter space via Gibbs steps, by com- reasonable uncertainty. Vehtari et al.(2019) recom- puting a trajectory in the parameter space accord- mend simulating at least m = 4 chains to compute ˆ ˆ ing to Hamiltonian dynamics. For a more detailed R and using a threshold of R < 1.01. review of HMC, see Neal(2011). 3.2.5 Multivariate ESS. The ESS of an estimate 3.2.3 Power posterior sampling. Power posterior obtained from a Markov chain realization is inter- (PP) sampling by Friel and Pettitt(2008) is a pop- preted as the number of independent samples that ulation Monte Carlo algorithm. It involves m + 1 provide an estimate with equal to the vari- t chains drawn from tempered versions p i (θ|D1:s) of ance of the estimate obtained from the Markov chain a target posterior p(θ|D1:s) for a temperature sched- realization. For a more extensive treatment entailing ule ti ∈ [0, 1], i ∈ {0, 1, . . . , m}, where tm = 1. At univariate approaches to ESS, see Vats and Flegal each iteration, the state of each chain is updated us- (2018); Gong and Flegal(2016); Kass et al.(1998). ing an MCMC sampler associated with that chain Rˆ and its variants can fail to diagnose poor mixing 8 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE of a Markov chain, whereas low values of ESS are an (3.10) states the posterior predictive distribution indicator of poor mixing. It is thus recommended to p(y|x, D1:s) as an expectation of the likelihood check both Rˆ and ESS (Vehtari et al., 2019). For a p(y|x, θ) evaluated at the test output y with respect theoretical treatment of the relation between Rˆ and to the parameter posterior p(θ|D1:s) learnt from the ESS, see Vats and Knudson(2018). training set D1:s. Univariate ESS pertains to a single coordinate The expectation in (3.10) can be approximated of an n-dimensional parameter. Vats, Flegal and via . More specifically, a Jones(2019) introduce a multivariate version of Monte Carlo approximation of the posterior predic- ESS, which provides a single-number summary of tive distribution is given by sampling effectiveness across the n dimensions of a v parameter. Similarly to multivariate PSRF (Brooks X (3.11) p(y|x, D ) ' p(y|x, ω ). and Gelman, 1998), multivariate ESS (Vats, Flegal 1:s k k=1 and Jones, 2019) requires a Monte Carlo covariance matrix estimator for the parameter. The sum in (3.11) involves evaluations of the like- Given a single Markov chain realization of length lihood across v iterations ωk, k = 1, 2, . . . , v, of a v for an n-dimensional parameter, Vats, Flegal and Markov chain realization ω1:v obtained from the pa- Jones(2019) define multivariate ESS as rameter posterior p(θ|D1:s). det (E)1/n 3.3.3 Classification rule. In the case of binary Sˆ = v . det (C) classification, the predictiony ˆ for the test label y ∈ {0, 1} is det (E) is the determinant of the empirical covari- ance matrix E and det (C) is the determinant of a   1 if p(y|x, D1:s) ≥ 0.5, Monte Carlo covariance matrix estimate C for the (3.12)y ˆ = chain. In this paper, the multivariate ESS by Vats,  0 otherwise. Flegal and Jones(2019) is used, setting C to be the MINSE for the chain. For multiclass classification, the prediction labely ˆ for the test label y ∈ {1, 2, . . . , κρ} is 3.3 Bayesian marginalization for prediction

This subsection briefly reviews the notion of (3.13)y ˆ = arg max {p(y|x, D1:s)}. y posterior predictive distribution based on Bayesian marginalization, posterior predictive distribution The classification rules (3.12) and (3.13) for bi- approximation via Monte Carlo integration, and as- nary and multiclass classification maximize the pos- sociated binary and multiclass classification. terior predictive distribution. This way, predictions 3.3.1 Posterior predictive distribution. Consider are made based on the Bayesian principle. The un- a set D1:s = (x1:s, y1:s) of s training data points certainty of predictions is quantified, since the pos- and a single test data point (x, y) consisting of some terior predictive probability p(y|x, D1:s) of each pre- test input x and test output y. Integrating out the dicted labely ˆ is available. parameters θ of a model fitted to D1:s yields the posterior predictive distribution 4. EXAMPLES Z Four examples of Bayesian inference for MLPs (3.9) p(y|x, D1:s) = p(y|x, θ) p(θ|D1:s) dθ. based on MCMC are presented. A different dataset | {z } | {z } | {z } Predictive Likelihood Parameter is used for each example. The four datasets entail distribution posterior simulated noisy data from the exclusive-or (XOR) Appendix B provides a derivation of (3.9). function, and observations collected from Pima Indi- ans, penguins and hawks. Section 4.1 introduces the 3.3.2 Monte Carlo approximation. (3.9) can be four datasets. Each of the four datasets is split into written as a training and a test set for parameter inference and

(3.10) p(y|x, D1:s) = Eθ|D1:s [p(y|x, θ)]. for predictions, respectively. MLPs with one neuron CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 9

Table 1 A perceptron without a hidden layer can not learn Training and test sample sizes of the four datasets of section 4, architectures of fitted MLP models and associated number the XOR function (Minsky and Papert, 1988). On n of MLP parameters. the other hand, an MLP(2, 2, 1) with a single hidden layer of two neurons can learn the XOR function Sample size Dataset Model n (Goodfellow, Bengio and Courville, 2016). Training Test An MLP(2, 2, 1) has a parameter vector θ of Noisy XOR 500 120 MLP(2, 2, 1) 9 length n = 9, as W1, b1,W2 and b2 have respective Pima 262 130 MLP(8, 2, 2, 1) 27 dimensions 2 · 2, 2 · 1, 2 · 1 and 1 · 1. Since the number Penguins 223 110 MLP(6, 2, 2, 3) 29 s = 4 of data points defined by the exact XOR func- Hawks 596 295 MLP(6, 2, 2, 3) 29 tion is less than the number n = 9 of parameters in the fitted MLP(2, 2, 1), the parameters can not be fully identified. in the output layer are fitted to the noisy XOR and To circumvent the lack of identifiability arising Pima datasets to perform binary classification, while from the limited number of data points, a larger MLPs with three neurons in the output layer are fit- dataset is simulated by introducing a noisy ver- ted to the penguin and hawk datasets to perform sion of XOR. Firstly, consider the auxiliary function multiclass classification with three classes. Table1 ψ :[−c, 1 + c] × [−c, 1 + c] → {0, 1} × {0, 1} given by shows the training and test sample sizes of the four datasets, and the fitted MLP models with their as- ψ(u − c, u − c) = (0, 0), sociated number n of parameters. ψ(u − c, u + c) = (0, 1), In the examples, samples are drawn via MCMC ψ(u + c, u − c) = (1, 0), from the unnormalized log-posterior ψ(u + c, u + c) = (1, 1).

log (p(θ|x1:s, y1:s)) = `(y1:s|x1:s, θ) + log (π(θ)) ψ is presented in parametrized form, in terms of a constant c ∈ (0.5, 1) and a uniformly distributed of MLP parameters. The log-likelihood `(y1:s|x1:s, θ) u ∼ U(0, 1). The noisy XOR func- for binary or multiclass classification corresponds to tion is then defined as the function composition f ◦ψ. (3.5) or (3.8). log (π(θ)) is the log-prior of MLP pa- A training and a test set of noisy XOR points, rameters. generated using f ◦ ψ and c = 0.55, are shown in 4.1 Datasets figure 2a. 125 and 30 noisy XOR points per exact XOR point (xi, yi), i = 1, 2, 3, 4, are contained in the An introduction to the four datasets used in this training and test set, respectively. So, the training paper follows. The simulated noisy XOR dataset and test sample sizes are 500 and 120, as reported does not contain missing values, while the real in table1 and as visualized in figure 2a. datasets for Pima, penguins and hawks come with In figure 2a, the training and test sets of noisy missing values. Data points containing missing val- XOR points consist of two input variables (u ± ues in the chosen variables have been dropped from 0.55, u ± 0.55) ∈ [−0.55, 1.55] × [−0.55, 1.55] and the three real datasets. All features (input vari- of one output variable f ◦ ψ(u ± 0.55, u ± 0.55) ∈ ables) in the three real datasets have been - {0, 1}. The four colours classify noisy XOR in- dardized. The four datasets, in their final form used put (u ± 0.55, u ± 0.55) with respect to the corre- for inference and prediction, are available at https: sponding exact XOR input ψ(u ± 0.55, u ± 0.55) ∈ //github.com/papamarkou/bnn_mcmc_examples. {(0, 0), (0, 1), (1, 0), (1, 1)}; the two different shapes classify noisy XOR output, with circle and triangle 4.1.1 XOR dataset. The so called XOR function corresponding to 0 and 1. f : {0, 1} × {0, 1} → {0, 1} returns 1 if exactly one of its binary input values is equal to 1, otherwise it 4.1.2 Pima dataset. The Pima dataset contains returns 0. The s = 4 data points defining XOR are observations taken from female patients of Pima In- (x1, y1) = ((0, 0), 0), (x2, y2) = ((0, 1), 1), (x3, y3) = dian heritage. The binary output variable indicates ((1, 0), 1) and (x4, y4) = ((1, 1), 0). whether or not a patient has diabetes. Eight features 10 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE are used as diagnostics of diabetes, namely the num- for binary classification. A softmax activation func- ber of pregnancies, plasma glucose concentration, di- tion is applied at the output layer of MLP(6, 2, 2, 3), astolic blood pressure, triceps skinfold thickness, in- in accordance with log-likelihood (3.8) for multiclass sulin level, body mass index, diabetes pedigree func- classification. The same MLP(6, 2, 2, 3) model is fit- tion and age. ted to the penguin and hawk datasets. For more information about the Pima dataset, see A normal prior π(θ) = N (0, 10I) is adopted for n Smith et al.(1988). The original data, prior to re- the parameters θ ∈ R of each MLP model shown in moval of missing values and feature standardization, table1. An isotropic covariance matrix 10 I assigns are available as the PimaIndiansDiabetes2 data relatively high prior variance, equal to 10, to each frame of the mlbench R package. coordinate of θ, thus setting empirically a seemingly 4.1.3 Penguin dataset. The penguin dataset con- non-informative prior. sists of body measurements for three penguin species MH and HMC are run for each of the four ex- observed on three islands in the Palmer Archipelago, amples of table1. PP sampling incurs higher com- Antarctica. Ad´elie,Chinstrap and Gentoo penguins putational cost than MH and HMC; for this rea- are the three observed species. Four body measure- son, PP sampling is run only for noisy XOR. Ten ments per penguin are taken, specifically body mass, power posteriors are employed for PP sampling, and flipper length, bill length and bill depth. The four MH is used for within-chain moves. On the basis of body measurements, sex and location (island) make pilot runs, the PP temperature schedule is set to up a total of six features utilized for deducing the ti = 1, i = 0, 1,..., 9; this implies that each power species to which a penguin belongs. Thus, the pen- posterior is set to be the parameter posterior and guin species is used as output variable. consequently between-chain moves are made among Horst, Hill and Gorman(2020) provide more ten chains realized from the parameter posterior. details about the penguin dataset. In their orig- Empirical tuning for MH, HMC and inal form, prior to data filtering, the data are PP is carried out. The chosen MH proposal variance, available at https://github.com/allisonhorst/ HMC number of leapfrog steps and HMC leapfrog palmerpenguins. step size for each example can be found in https: //github.com/papamarkou/bnn_mcmc_examples. 4.1.4 Hawk dataset. The hawk dataset is com- m = 10 Markov chains are realized for each com- posed of observations for three hawk species col- bination of training dataset shown in table1 and lected from Lake MacBride near Iowa City, Iowa. of MCMC sampler. 110, 000 iterations are run per Cooper’s, red-tailed and sharp-shinned hawks are chain realization, 10, 000 of which are discarded as the three observed species. Age, wing length, body burn-in. Thereby, v = 100, 000 post-burnin itera- weight, culmen length, hallux length and tail length tions are retained per chain realization. are the six hawk features employed in this paper for MINSE computation, required by multivariate deducing the species to which a hawk belongs. So, PSRF and multivariate ESS, is carried out using v = the hawk species is used as output variable. 100, 000 post-burnin iterations per realized chain. Cannon et al.(2019) mention that Emeritus Pro- The multivariate PSRF for each dataset-sampler set- fessor Bob Black at Cornell College shared the hawk ting is computed across the m = 10 realized chains dataset publicly. The original data, prior to data fil- for the setting. On the other hand, the multivari- tering, are available as the Hawks data frame of the ate ESS is computed for each realized chain, and Stat2Data R package. the mean across m = 10 ESSs is reported for each 4.2 Experimental configuration dataset-sampler setting. To fully specify the MLP models of table1, their Monte Carlo approximations of posterior predic- activations are listed. A sigmoid activation func- tive distributions are computed according to (3.11) tion is applied at each hidden layer of each MLP. for each data point of each test set. To reduce the Additionally, a sigmoid activation function is ap- computational cost, the last v = 10, 000 iterations plied at the output layer of MLP(2, 2, 1) and of of each realized chain are used in (3.11). MLP(8, 2, 2, 1), conforming to log-likelihood (3.5) Predictions for binary and multiclass classifica- CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 11 tion are made using (3.12) and (3.13), respectively. Table 2 Multivariate PSRF, multivariate ESS and predictive Given a single chain realization from an MCMC accuracy for each set of ten Markov chains realized by an sampler, predictions are made for every point in a MCMC sampler for a dataset-MLP combination. Predictive test set; the predictive accuracy is then computed accuracies based on samples from the prior are reported as as the number of correct predictions over the total model-agnostic baselines. number of points in the test set. Subsequently, the Accuracy Sampler PSRF ESS mean of predictive accuracies across the m = 10 MCMC Prior chains realized from the sampler is reported for the Noisy XOR, MLP(2, 2, 1) test set. MH 1.2057 540 75.92 4.3 Numerical summaries HMC 13.8689 25448 74.75 48.33 Table2 shows numerical summaries for each set PP 2.2885 4083 87.58 of m = 10 Markov chains realized by an MCMC Pima, MLP(8, 2, 2, 1) sampler for a dataset-MLP combination of table1. MH 1.0007 93 79.31 51.69 Multivariate PSRF and multivariate ESS diagnose HMC 1.0001 718 80.38 the capacity of MCMC sampling to perform pa- Penguins, MLP(6, 2, 2, 3) rameter inference. Predictive accuracy via Bayesian MH 1.0229 217 100.00 marginalization (3.11), based on classification rules 36.45 (3.12) and (3.13) for binary and multiclass classifi- HMC 1.6082 3127 100.00 cation, demonstrates the predictive performance of Hawks, MLP(6, 2, 2, 3) MCMC sampling. The last column of table2 dis- MH 1.0319 168 97.97 28.85 plays the predictive accuracy via (3.11) with samples HMC 1.4421 1838 98.03 ωk, k = 1, 2, . . . , v, drawn from the prior π(θ) = N (0, 10I), thus providing an approximation of the expected posterior predictive probability 87.58% predictive accuracy is attained by PP sam- pling despite the associated PSRF value of 2.2885. Z Bayesian marginalization based on MCMC sam- (4.1) Eθ[p(y|x, θ)] = p(y|x, θ)π(θ)dθ pling outperforms prior beliefs or random guesses in terms of predictive inference, despite MCMC di- with respect to prior π(θ). agnostic failures. For instance, Bayesian marginal- PSRF is above 1.01 (Vehtari et al., 2019), indicat- ization via non-converged HMC chain realizations ing lack of convergence, in three out of four datasets. yields 74.75%, 100% and 98.03% predictive accu- ESS is low considering the post-burnin length of racy on the noisy XOR, penguin and hawk datasets. v = 100, 000 of each chain realization, indicating Approximating the posterior predictive distribu- slow mixing. MCMC sampling for Pima data is the tion with samples from the parameter prior yields only case of attaining PSRF less than 1.01, yet the 48.33%, 36.45% and 28.85% predictive accuracy on ESS values for Pima are the lowest among the four the same datasets. It is noted that 48.33% is close to datasets. Overall, simultaneous low PSRF and high a 50/50 random guess for binary classification, while ESS are not reached in any of the examples. 36.45% and 28.85% are close to a 1/3 random guess The predictive accuracy is high in multiclass clas- for multiclass classification with three classes. sification, despite the lack of convergence and slow mixing. Bayesian marginalization based on HMC 4.4 Visual summaries for parameters samples yields 100% and 98.03% predictive accuracy Visual summaries for MLP parameters are pre- on the penguin and hawk test datasets, despite the sented in this subsection. In particular, Markov PSRF values of 1.6082 and 1.4421 on the penguin chain traceplots and a comparison between MCMC and hawk training datasets. sampling and ensemble training are displayed. PP sampling for the binary classification problem of noisy XOR leads to higher predictive accuracy 4.4.1 Non-converged chain realizations. Figure1 (87.58%) than MH (75.92%) or HMC (74.75%). The shows chain traceplots of four parameters of MLP 12 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

with symmetries of weight θ8. Two realized MH chains for parameter θ18 of the MLP(6, 2, 2, 3) model fitted to the penguin training data are plotted, one shown in orange and one in blue. Each of these two traces initially explore a mode, transit to a seemingly symmetric mode about halfway through the simulation time (post-burnin) and explore the symmetric mode in the second half of the simulation. One HMC chain traceplot for parameter θ23 and one HMC chain traceplot for parameter θ26 of the MLP(6, 2, 2, 3) model fitted to the penguin and hawk training data, respectively, are shown. The traces of these two parameters exhibit similar behaviour, each of them switching between two symmetric regions about zero. Switching between symmetric modes, as seen in the displayed traceplots, manifests weight symme- tries. These traceplots exemplify how computational time is wasted during MCMC to explore equivari- ant parameter posterior modes of a neural network (Nalisnick, 2018). Consequently, the realized chains do not converge. 4.4.2 MCMC sampling vs ensemble training. An exemplified comparison between MCMC sampling and ensemble training for neural networks follows. Fig 1: Markov chain traceplots of four parameter To this end, the same noisy XOR training data and coordinates of MLP models introduced in table1. the same MLP(2, 2, 1) model, previously used for The vertical dotted lines indicate the end of burnin. MCMC sampling, are used for ensemble training. To recap, the noisy XOR dataset is introduced in subsection 4.1 and is displayed in figure 2a; a sig- models introduced in table1. These traceplots visu- moid activation function is applied to the hidden ally demonstrate entrapment in local modes, mode and output layer of MLP(2, 2, 1), and the BCE loss switching and more generally lack of convergence. function is employed, which is the negative value of All 110, 000 iterations per realized chain, which log-likelihood (3.5). include burnin, are shown in the traceplots of figure Ensemble learning is conducted by training the 1. The vertical dotted lines delineate the first 10, 000 MLP(2, 2, 1) model on the noisy XOR training set burnin iterations. multiple times. At each training session, SGD is used Two realized MH chains for parameter θ8 of the for minimizing the BCE loss. SGD is initialized by MLP(2, 2, 1) model fitted to the noisy XOR training drawing a sample from π(θ) = N (0, 10I), which is data are plotted. The traces in orange and in blue the same density used as prior for MCMC sampling. gravitate during burnin towards modes in the vicin- 2, 000 epochs are run per training session, with a ity of 8 and −8, respectively, and then get entrapped batch size of 50 and a learning rate of 0.002. The for the entire simulation time in these modes. Pa- SGD solution from the training session is accepted rameter θ8 corresponds to a weight connecting a neu- if its predictive accuracy on the noisy XOR test set is ron in the hidden layer with the neuron of the output above 85%, otherwise it is rejected. Ensemble learn- layer of MLP(2, 2, 1). The two realized chains for θ8 ing is terminated as soon as 1, 000 SGD solutions explore two regions symmetric about zero associated with the required level of accuracy are obtained. CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 13

centrates its mass. The overlaid green and orange histograms show that MH sampling explores a re- gion of the marginal posterior of θ3 also explored by ensemble training. The blue histogram in figure 2c comes from a chain realization for θ3 using MH sampling to ap- ply MLP(2, 2, 1) to the four exact XOR data points. The pink line in figure 2c shows the marginal prior 2 π(θ3) = N (0, σ = 10). Four data points are not suf- (a) Noisy XOR training set (left) and test set (right). ficient to learn from them, given that MLP(2, 2, 1) has nine parameters. For this reason, the blue his- togram coincides with the pink line, which means that the marginal posterior p(θ3) obtained from ex- act XOR via MH sampling and the marginal prior π(θ3) coincide. 4.5 Visual summaries for predictions Visual summaries for MLP predictions and for (b) 100 SGD solutions from training MLP(2, 2, 1). MLP posterior predictive probabilities are presented in this section. MLP posterior predictive probabili- ties are visually shown to quantify predictive uncer- tainty in classification. 4.5.1 Predictive accuracy. Figure3 shows box- plots of predictive accuracies, hereinafter referred to as accuracies, for the examples introduced in table 1. Each boxplot summarizes m = 10 accuracies asso- (c) Histograms of parameter θ of MLP(2, 2, 1). 3 ciated with the ten chains realized per sampler for a test set. Accuracy computation is based on Bayesian Fig 2: Comparison between MH sampling and en- marginalization, as outlined in subsections 3.3 and semble training of an MLP(2, 2, 1) model fitted to 4.2. Horizontal red lines represent accuracy medi- noisy XOR data. SGD is used for ensemble training. ans. Figure3 and table1 provide complementary Each accepted SGD solution has predictive accuracy summaries, as they present respective quartiles and above 85% on the noisy XOR test set. means of accuracies across chains per sampler. Boxplot medians show high accuracy on the pen- guin and hawk test sets. Moreover, narrow box- Figure 2b shows a parallel coordinates plot of 100 plots indicate accuracies with small variation on SGD solutions. Each line connects the nine coordi- the penguin and hawk test sets. Thereby, Bayesian nates of a solution. Overlaying lines of different SGD marginalization based on non-converged chain real- solutions visualizes parameter symmetries. izations attains high accuracy with small variability Figure 2c displays histograms associated with pa- on the two multiclass classification examples. rameter θ3 of MLP(2, 2, 1). The green histogram rep- Figure3 also displays boxplots of accuracies based resents all 1, 000 SGD solutions for θ3 obtained from on expected posterior predictive distribution ap- ensemble training based on noisy XOR. These 1, 000 proximation (4.1) with respect to the prior. For all modes cluster in two regions approximately symmet- four test sets and regardless of Markov chain con- ric about zero. The orange histogram belongs to one vergence, Bayesian marginalization outperforms ag- of ten realized MH chains for θ3 based on noisy XOR. nostic prior-based baseline (4.1). This realized chain is entrapped in a local mode in The PP boxplot has more elevated median and is the vicinity of 5, where the orange histogram con- narrower than its MH and HMC counterparts for the 14 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Fig 3: Boxplots of predictive accuracies for the exam- ples introduced in table1. Each boxplot summarizes m = 10 predictive accuracies associated with the ten chains realized by an MCMC sampler for a test set. noisy XOR test set. This implies that PP sampling attains higher accuracy with smaller variation than Fig 4: Heatmaps of ground truth and of posterior MH and HMC sampling on the noisy XOR test set. predictive probabilities p(y = 1|(x1, x2),D1:500) = 4.5.2 Uncertainty quantification on a grid. Fig- c on a grid of noisy XOR features (x1, x2). The ure4 visualizes heatmaps of the ground truth and of heatmap colour palette represents values of c. The posterior predictive distribution approximations for ground truth heatmap visualizes true labels, while noisy XOR. More specifically, the posterior predic- the other three heatmaps use approximate Bayesian tive probability p(y = 1|(x1, x2),D1:500) is approxi- marginalization based on HMC and PP chain real- mated at the centre (x1, x2) of each square cell of a izations. 22 × 22 grid in [−0.5, 1.5] × [−0.5, 1.5]. D1:500 refers to the noisy XOR training dataset of size s = 500 introduced in subsection 4.1.(3.11) is used for ap- bels, while it remains highly uncertain for the other half of grid labels. Moreover, both HMC chain re- proximating p(y = 1|(x1, x2),D1:500). Previously ac- quired Markov chain realizations (subsection 4.3) via alizations exhibit higher uncertainty closer to the MCMC sampling of MLP(2, 2, 1) parameters, using decision boundaries of ground truth. These decision boundaries are the vertical straight line x1 = 0.5 and the noisy XOR training dataset D1:500, are passed to (3.11). horizontal straight line x2 = 0.5. A posterior predictive distribution approximation The approximation p(y = 1|(x1, x2),D1:500) = c based on a PP chain realization is displayed. PP at the center (x1, x2) of a square cell determines the colour of the cell in figure4. If c is closer to 1, 0, or sampling uncovers larger regions of the ground truth 0.5, the cell is plotted with a shade of red, blue or of grid labels than HMC sampling in the consid- white, respectively. So, darker shades of red indicate ered grid of noisy XOR features (x1, x2). Although that y = 1 with higher certainty, darker shades of HMC and PP samples do not converge to the param- blue indicate that y = 0 with higher certainty, and eter posterior of MLP(2, 2, 1), approximate Bayesian shades of white indicate high uncertainty about the marginalization using these samples predicts a sub- binary label of noisy XOR. set of noisy XOR labels. Two posterior predictive distribution approxima- 4.5.3 Uncertainty quantification on a test set. tions based on two HMC chain realizations learn dif- Figures5 and6 show approximations of predictive ferent regions of the exact posterior predictive dis- prosterior probabilities for a binary classification tribution. Each of the two HMC chain realizations (noisy XOR) and a multiclass classification (hawks) uncover about half of the ground truth of grid la- example. Two posterior predictive probabilities are CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 15

in figure5 using a square and a rhombus. These two points have the same true label c = 1. Given pos- terior predictive probabilities 0.5269 and 0.9750 for the rhombus and square-shaped test points, the label c = 1 is correctly predicted for both points. However, the rhombus-shaped point is closer to the decision boundary x2 = 0.5 than the square-shaped point, so classifying the former entails higher uncertainty. As 0.5269 < 0.9750, Bayesian marginalization quanti- fies the increased predictive uncertainty associated (a) Scatterplot of noisy XOR features (x1, x2). with the rhombus-shaped point despite using a non- converged MH chain realization. Figure 6a shows a scatterplot of weight against tail length for the hawk test set of subsection 4.1. Blue, red and green test points belong to Cooper’s, red-tailed and sharp-shinned hawk classes. Figure 6b shows the posterior predictive probabilities p(y = c|x, D1:596) for a subset of 100 hawk test points, where c ∈ {Cooper’s, red-tailed, sharp-shinned} de- notes the true label of test point (x, y = c) and (b) Posterior predictive probabilities for noisy XOR. D1:596 denotes the hawk training set of subsec- tion 4.1. These posterior predictive probabilities are Fig 5: Quantification of uncertainty in predictions shown ordered within each class, and are coloured for the noisy XOR test set. Approximate Bayesian red or pale green depending on whether they yield marginalization via MH sampling is used for com- correct or wrong predictions. One of the ten MH puting posterior predictive probabilities. chain realizations for MLP(6, 2, 2, 3) parameter in- ference is used for approximating p(y = c|x, D1:596) via (3.11) and for making predictions via (3.13). interpreted contextually in each example to quantify Two points in the hawk test set are marked in fig- predictive uncertainty. ure6 using a square and a rhombus. Each of these Figure 5a visualizes the noisy XOR test set of two points represents weight and tail length mea- subsection 4.1. This is the same test set shown surements from a red-tailed hawk. The red-tailed in figure 2a, but with test points coloured accord- hawk class is correctly predicted for both points. ing to their labels. Figure 5b shows the poste- The squared-shaped observation belongs to the main rior predictive probability p(y = c|(x1, x2),D1:500) cluster of red-tailed hawks in figure 6a and it is of true label c ∈ {0, 1} for each noisy XOR test predicted with high posterior predictive probability point ((x1, x2), y = c) given noisy XOR training (0.9961). On the other hand, the rhombus-shaped set D1:500 of subsection 4.1. The posterior proba- observation, which falls in the cluster of Cooper’s bilities p(y = c|(x1, x2),D1:500) of predicting true hawk, is correctly predicted with a lower posterior class c are ordered within class c. Moreover, each predictive probability (0.5271). Bayesian marginal- p(y = c|(x1, x2),D1:500) is coloured as red or pale ization provides approximate posterior predictive green depending on whether the resulting prediction probabilities that signify the level of uncertainty in is correct or not. One of the ten MH chain real- predictions despite using a non-converged MH chain izations for MLP(2, 2, 1) parameter inference from realization. noisy XOR data is used for approximating p(y = 4.6 Source code c|(x1, x2),D1:500) via (3.11) and for making predic- tions via (3.12). The source code for this paper is split into Two points in the noisy XOR test set are marked three Python packages, namely eeyore, kanga and 16 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

method, which implements the posterior predictive distribution approximation (3.11) given a realized Markov chain. kanga is available via pip, via conda and at https://github.com/papamarkou/kanga. kanga is a collection of MCMC diagnostics implemented us- ing numpy. MINSE, multivariate PSRF multivariate ESS are available in kanga. bnn mcmc examples organizes the examples of this paper in a package. bnn mcmc examples relies (a) Scatterplot of hawks’ weight against tail length. on eeyore for MCMC simulations and posterior pre- dictive distribution approximations, and on kanga for MCMC diagnostics. For more details, see https: //github.com/papamarkou/bnn_mcmc_examples. Optimization via SGD for the example involv- ing MLP(2, 2, 1) and noisy XOR data (figure2) is run using PyTorch. The loss function for optimiza- tion is computed via torch.nn.BCELoss. This loss function corresponds to the negative log-likelihood function (3.5) involved in MCMC, thus linking the (b) Posterior predictive probabilities for hawks. SGD and MH simulations shown in figure 2c. SGD is coded manually instead of calling an optimization Fig 6: Quantification of uncertainty in predictions algorithm of the torch.optim package of PyTorch. for the hawk test set. Bayesian marginalization via Gradients for optimization are computed calling the MH sampling is used for approximating posterior backward method. The SGD code related to the ex- predictive probabilities. ample of figure2 is available at https://github. com/papamarkou/bnn_mcmc_examples. 4.7 Hardware bnn mcmc examples. eeyore implements MCMC al- gorithms for Bayesian neural networks. kanga imple- Pilot MCMC runs indicated a three-fold increase ments MCMC diagnostics. bnn mcmc examples in- in speed by using CPUs instead of GPUs; accord- cludes the examples of this paper. ingly, computations were performed on CPUs for eeyore is available via pip, via conda and at this paper. The GPU slowdown is explained by the https://github.com/papamarkou/eeyore. eeyore overhead of copying PyTorch tensors between GPUs implements the MLP model, as defined by (3.1)- and CPUs for small neural networks, such as the (3.2), using PyTorch. An MLP class is set to be a sub- ones used in section4. class of torch.nn.Module, with log-likelihood (3.5) The computations for section4 were run on for binary classification equal to the negative value of Google Cloud Platform (GCP). Eleven virtual ma- torch.nn.BCELoss and with log-likelihood (3.8) for chine (VM) instances with virtual CPUs were cre- multiclass classification equal to the negative value ated on GCP to spread the workload. of torch.nn.CrossEntropyLoss. Each MCMC al- Setting aside heterogeneities in hardware config- gorithm takes an instance of torch.nn.Module as uration between GCP VM instances and in order to input, with the logarithm of the target density be- provide an indication of computational cost, MCMC ing a log target method of the instance. Log- simulation runtimes are provided for the example target density gradients for HMC are computed of applying an MLP(6, 2, 2, 3) to the hawk train- via the automatic differentiation functionality of ing dataset. The mean runtimes across the ten re- the torch.autograd package of PyTorch. The MLP alized chains per MH and HMC are 0 : 42 : 54 and class of eeyore provides a predictive posterior 1 : 10 : 48, respectively (runtimes are formatted as CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 17

‘hours : minutes : seconds’). where i ∈ {0, 1, . . . , m}, j ∈ {0, 1, . . . , m}\{i}, β is a hyperparameter and γi is a normalizing constant. 5. PREDICTIVE INFERENCE SCOPE The hyperparameter β is typically set to β = 0.5, a value which makes a jump to j = i±1 roughly three Bayesian marginalization can attain high predic- times more likely than a jump to j = i±3 (Friel and tive accuracy and can quantify predictive uncer- Pettitt, 2008). tainty using non-converged MCMC samples of neu- The normalizing constant γi is given by ral network parameters. Thus, MCMC sampling elic- its some information about the parameter posterior exp (−β)(2 − exp (−βi) − exp (−β(m − i))) γ = . of a neural network and conveys such information i 1 − exp (−β) to the posterior predictive distribution. It is pos- sible that MCMC sampling learns about the sta- Starting from the fact that the event probabilities tistical dependence among neural network param- αi(j) add up to one, γi is derived as follows: eters. Along these lines, groups of weights or biases m can be formed, with strong within-group and weak X between-group dependence, to investigate scalable 1 = αi(j) ⇒ block Gibbs sampling methods for neural networks. j=0 j6=i Another possibility of MCMC developments for i−1 m neural networks entails shifting attention from the X X γi = exp (−β(i − j)) + exp (−β(j − i)) parameter space to the output space, since the latter j=0 j=i+1 is related to predictive inference directly. Approxi- i m−i X X mate MCMC methods that measure the discrepancy = exp (−βj) + exp (−βj) or Wasserstein distance between neural network pre- j=1 j=1 dictions and output data (Rudolf and Schweizer, 1 − exp (−βi) = exp (−β) 2018) can be investigated. 1 − exp (−β) Bayesian marginalization provides scope to de- 1 − exp (−β(m − i)) velop predictive inference for neural networks. For + exp (−β) 1 − exp (−β) instance, Bayesian marginalization can be examined exp (−β)(2 − exp (−βi) − exp (−β(m − i))) in the context of approximate MCMC sampling from = . a neural network parameter posterior, regardless of 1 − exp (−β) convergence to the parameter posterior and in anal- ogy to the workings of this paper. Moreover, the idea APPENDIX B: PREDICTIVE DISTRIBUTION of Wilson and Izmailov(2020) to interpret ensem- This appendix derives the posterior predictive dis- ble training of neural networks from a viewpoint of tribution (3.9). Applying the law of total probability Bayesian marginalization can be studied using the and the definition of conditional probability yields notion of quantization of probability distributions. Z p(y|x, D ) = p(y, θ|x, D )dθ APPENDIX A: POWER POSTERIORS 1:s 1:s Z This appendix provides the probability mass func- = p(y|x, D1:s, θ)p(θ|x, D1:s)dθ. tion pi(j) for proposing a chain j for a possible swap of states between chains i and j in PP sampling. As- p(y|x, D , θ) is equal to the likelihood p(y|x, θ): suming m+1 power posteriors, a neighbouring chain 1:s j of i is chosen randomly from the categorical prob- p(y, D1:s|x, θ) ability mass function pi = C(αi(0), αi(1), . . . , αi(i − p(y|x, D1:s, θ) = p(D1:s|x, θ) 1), αi(i + 1), . . . , αi(m)) with event probabilities p(y|x, θ)p(D |x, θ) = 1:s exp (−β|j − i|) p(D1:s|x, θ) αi(j) = , γi = p(y|x, θ). 18 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Furthermore, p(θ|x, D1:s) is equal to the parameter Blier, L. and Ollivier, Y. (2018). The description length of deep learning models. In Advances in Neural Information posterior p(θ|D1:s): Processing Systems 31.2 Brea, J., Simsek, B., Illing, B. and Gerstner, W. (2019). p(θ, x, D1:s) Weight-space symmetry in deep networks gives rise to per- p(θ|x, D1:s) = p(x, D1:s) mutation saddles, connected by equal-loss valleys across the loss landscape. arXiv.3 p(θ, x|D1:s)p(D1:s) = Brooks, S. P. and Gelman, A. (1998). General methods for p(x)p(D1:s) monitoring convergence of iterative simulations. Journal of p(θ|D )p(x|D ) Computational and Graphical Statistics 7 434–455.7,8 = 1:s 1:s p(x) Cannon, A., Cobb, G., Hartlaub, B., Legler, J., Lock, R., Moore, T., Rossman, A. and Witmer, J. p(θ|D )p(x, D ) = 1:s 1:s (2019). Stat2Data: datasets for Stat2 R package version p(x)p(D1:s) 2.0.0. 10 p(θ|D )p(x)p(D ) Chen, T., Fox, E. and Guestrin, C. (2014). Stochastic gra- = 1:s 1:s dient Hamiltonian Monte Carlo. In Proceedings of the 31st p(x)p(D1:s) International Conference on Machine Learning 32 1683– 1691.2 = p(θ|D1:s). Chen, A. M., Lu, H. and Hecht-Nielsen, R. (1993). On the geometry of feedforward neural network error surfaces. ACKNOWLEDGEMENTS Neural Computation 5 910–927.3 Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Giro- Research sponsored by the Laboratory Directed lami, M., Mackey, L. and Oates, C. (2019). Stein point Research and Development Program of Oak Ridge Markov chain Monte Carlo. In Proceedings of the 36th National Laboratory, managed by UT-Battelle, International Conference on Machine Learning 97 1011– LLC, for the US Department of Energy under con- 1021.5 Chollet, F. (2017). Xception: deep learning with depthwise tract DE-AC05-00OR22725. separable convolutions. In Proceedings of the IEEE confer- The first author would like to thank Google for ence on computer vision and pattern recognition 1251–1258. the provision of free credit on Google Cloud Plat- 5 form. Chwialkowski, K., Strathmann, H. and Gretton, A. (2016). A kernel test of goodness of fit. In Proceedings of The 33rd International Conference on Machine Learning REFERENCES 48 2606–2615.5 Cowles, M. K. and Carlin, B. P. (1996). Markov chain Andrieu, C., de Freitas, J. F. G. and Doucet, A. (1999). Monte Carlo convergence diagnostics: a comparative re- Sequential Bayesian estimation and model selection applied view. Journal of the American Statistical Association 91 to neural networks.2 883–904.4 Andrieu, C., de Freitas, N. and Doucet, A. (2000). Re- Cybenko, G. (1989). Approximation by superpositions of a versible jump MCMC for neural net- sigmoidal function. Mathematics of control, signals and sys- works. In Proceedings of the Sixteenth Conference on Un- tems 2 303–314.1 certainty in Artificial Intelligence 11–18.2 Dai, N. and Jones, G. L. (2017). Multivariate initial se- Ashukha, A., Lyzhov, A., Molchanov, D. and quence in Markov chain Monte Carlo. Journal Vetrov, D. (2020). Pitfalls of in-domain uncertainty of Multivariate Analysis 159 184–199.7 estimation and ensembling in deep learning. In Interna- Daniels, M. J. and Kass, R. E. (1998). A note on first-stage tional Conference on Learning Representations.1 approximation in two-stage hierarchical models. Sankhy¯a: Badrinarayanan, V., Mishra, B. and Cipolla, R. (2015). The Indian Journal of Statistics, Series B (1960-2002) 60 Symmetry-invariant optimization in deep networks. arXiv. 19–30.2 3 de Freitas, N. (1999). Bayesian methods for neural net- Bennett, J. E., Racine-Poon, A. and Wakefield, J. C. works, PhD thesis, University of Cambridge.2,4 (1996). MCMC for nonlinear hierarchical models. In de Freitas, N., Andrieu, C., Højen-Sørensen, P., Niran- Markov chain Monte Carlo in practice (W. R. Gilks, jan, M. and Gee, A. (2001). Sequential Monte Carlo meth- S. Richardson and D. Spiegelhalter, eds.) 339–358. Chap- ods for neural networks In Sequential Monte Carlo Methods man and Hall/CRC.2 in Practice 359–379.2 Bernardo, J. M. (1979). Reference posterior distributions De Sa, C., Chen, V. and Wong, W. (2018). Minibatch for Bayesian inference. Journal of the Royal Statistical So- Gibbs sampling on large graphical models. In Proceedings ciety. Series B (Methodological) 41 113–147.4 of the 35th International Conference on Machine Learning Blei, D. M., Kucukelbir, A. and McAuliffe, J. D. (2017). 80 1165–1173.2 Variational inference: a review for statisticians. Journal of Dupuy, C. and Bach, F. (2017). Online but accurate infer- the American Statistical Association 112 859-877.2 CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 19

ence for latent variable models with local Gibbs sampling. Hastings, W. K. (1970). Monte Carlo sampling methods us- Journal of Machine Learning Research 18 1-45.2 ing Markov chains and their applications. Biometrika 57 Ensign, D., Neville, S., Paul, A. and Venkatasubra- 97–109.7 manian, S. (2017). The complexity of explaining neu- Hecht-Nielsen, R. (1990). On the algebraic structure of ral networks through (group) invariants. In Proceedings of feedforward network weight spaces. In Advanced Neural the 28th International Conference on Algorithmic Learning Computers 129–135.3 Theory 76 341–359.3 Hornik, K. (1991). Approximation capabilities of multilayer Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Sid- feedforward networks. Neural Networks 4 251–257.1 dharth, N., Paige, B., Brooks, D. H., Dy, J. and Horst, A. M., Hill, A. P. and Gorman, K. B. (2020). van de Meent, J.-W. (2019). Structured disentangled palmerpenguins: Palmer Archipelago (Antarctica) penguin representations. In Proceedings of the 22nd International data R package version 0.1.0. 10 Conference on Artificial Intelligence and Statistics 89 Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., 2525–2534.2 Wang, W., Weyand, T., Andreetto, M. and Adam, H. Freeman, I., Roese-Koerner, L. and Kummert, A. (2018). (2017). Mobilenets: efficient convolutional neural networks Effnet: An efficient structure for convolutional neural net- for mobile vision applications. arXiv.5 works. In 25th IEEE International Conference on Image Hu, S. X., Zagoruyko, S. and Komodakis, N. (2019). Ex- Processing 6–10.5 ploring weight symmetry in deep neural networks. Com- Friel, N. and Pettitt, A. N. (2008). Marginal likelihood puter Vision and Image Understanding 187 102786.3 estimation via power posteriors. Journal of the Royal Sta- Huang, C.-W., Sankaran, K., Dhekane, E., Lacoste, A. tistical Society: Series B (Statistical Methodology) 70 589– and Courville, A. (2019). Hierarchical importance 607.7, 17 weighted autoencoders. In Proceedings of the 36th Inter- Gelman, A. and Rubin, D. B. (1992). Inference from itera- national Conference on Machine Learning 97 2869–2878. tive simulation using multiple sequences. Statistical Science 2 7 457–472.7 Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. Dally, W. J. and Keutzer, K. (2016). SqueezeNet: (2004). Bayesian data analysis, 2nd ed. Chapman and AlexNet-level accuracy with 50x fewer parameters and Hall/CRC.7 ¡0.5MB model size. arXiv.5 Gilks, W. R., Richardson, S. and Spiegelhalter, D., eds. Izmailov, P., Maddox, W. J., Kirichenko, P., (1996). Markov chain Monte Carlo in practice. Chapman Garipov, T., Vetrov, D. and Wilson, A. G. (2020). and Hall/CRC. Subspace inference for Bayesian deep learning. In Pro- Gilks, W. R. and Roberts, G. O. (1996). Strategies for ceedings of The 35th Uncertainty in Artificial Intelligence improving MCMC. In Markov chain Monte Carlo in prac- Conference 115 1169–1179.3 tice (W. R. Gilks, S. Richardson and D. Spiegelhalter, eds.) Jarrett, K., Kavukcuoglu, K., Ranzato, M. and Le- 89–114. Chapman and Hall/CRC.2 Cun, Y. (2009). What is the best multi-stage architecture Giordano, R. J., Broderick, T. and Jordan, M. I. (2015). for object recognition? In IEEE 12th International Confer- Linear response methods for accurate covariance estimates ence on Computer Vision 2146-2153.5 from mean field variational Bayes. In Advances in Neural Jaynes, E. T. (1968). Prior probabilities. IEEE Transactions Information Processing Systems 28 1441–1449.3 on Systems Science and Cybernetics 4 227–241.4 Gong, L. and Flegal, J. M. (2016). A practical sequen- Jeffreys, H. (1962). The theory of probability, 3rd ed. OUP tial stopping rule for high-dimensional Markov chain Monte Oxford.4 Carlo. Journal of Computational and Graphical Statistics Johndrow, J. E., Pillai, N. S. and Smith, A. (2020). No 25 684–700.7 free lunch for approximate MCMC. arXiv.2 Gong, W., Li, Y. and Hernandez-Lobato,´ J. M. (2019). Kass, R. E., Carlin, B. P., Gelman, A. and Neal, R. M. Meta-learning For stochastic gradient MCMC. In Interna- (1998). Markov chain Monte Carlo in practice: a roundtable tional Conference on Learning Representations.2 discussion. The American Statistician 52 93–100.7 Goodfellow, I., Bengio, Y. and Courville, A. (2016). Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Deep learning. MIT press.5,9 ImageNet classification with deep convolutional neural net- Graf, S. and Luschgy, H. (2007). Foundations of quantiza- works. In Advances in Neural Information Processing Sys- tion for probability distributions. Springer.5 tems 25 1097–1105.5 Gretton, A., Borgwardt, K. M., Rasch, M. J., Lee, H. K. H. (2000). Consistency of posterior distributions Scholkopf,¨ B. and Smola, A. (2012). A kernel two- for neural networks. Neural Networks 13 629–642.4 sample test. Journal of Machine Learning Research 13 723– Lee, H. K. H. (2003). A noninformative prior for neural net- 773.5 works. Machine Learning 50 197–212.4 Gu, S. S., Ghahramani, Z. and Turner, R. E. (2015). Neu- Lee, H. K. (2004). Priors for neural networks. In Classifica- ral adaptive sequential Monte Carlo. In Advances in Neural tion, Clustering, and Data Mining Applications (D. Banks, Information Processing Systems 28 2629–2637.2 F. R. McMorris, P. Arabie and W. Gaul, eds.) 141–150.4 Hastie, T., Tibshirani, R. and Friedman, J. (2016). The Lee, H. K. (2005). Neural networks and default priors. In elements of statistical learning: data mining, inference and Proceedings of the American Statistical Association, Sec- prediction, 2nd ed. Springer.5 tion on Bayesian Statistical Science.4 20 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE

Lee, H. K. (2007). Default priors for neural network classifi- Quiroz, M., Kohn, R., Villani, M. and Tran, M.-N. cation. Journal of Classification 24 53–70.4 (2019). Speeding Up MCMC by efficient data subsampling. Lu, Z., Pu, H., Wang, F., Hu, Z. and Wang, L. (2017). The Journal of the American Statistical Association 114 831– expressive power of neural networks: a view from the width. 843.2 In Advances in Neural Information Processing Systems 30 Ranganath, R., Tran, D. and Blei, D. (2016). Hierarchi- 6231–6239.1 cal variational models. In Proceedings of The 33rd Interna- Ma, Y.-A., Foti, N. J. and Fox, E. B. (2017). Stochastic tional Conference on Machine Learning 48 324–333.2 gradient MCMC methods for hidden Markov models. In Robert, C. P., Elvira, V., Tawn, N. and Wu, C. (2018). Proceedings of the 34th International Conference on Ma- Accelerating MCMC algorithms. Wiley Interdisciplinary chine Learning 70 2265–2274.2 Reviews: Computational Statistics 10 e1435.2 MacKay, D. J. (1995). Developments in probabilistic mod- Rosenblatt, F. (1958). The perceptron: a probabilistic elling with neural networks—ensemble learning. In Neu- model for information storage and organization in the ral Networks: Artificial Intelligence and Industrial Applica- brain. Psychological review 65 386.5 tions 191–198.2 Rudolf, D. and Schweizer, N. (2018). Perturbation theory Maddison, C. J., Huang, A., Sutskever, I. and Silver, D. for Markov chains via Wasserstein distance. Bernoulli 24 (2015). Move evaluation in Go using deep convolutional 2610–2639.5, 17 neural networks. In International Conference on Learning Sargent, D. J., Hodges, J. S. and Carlin, B. P. (2000). Representations.3 Structured Markov chain Monte Carlo. Journal of Compu- Mandt, S., Hoffman, M. D. and Blei, D. M. (2017). tational and Graphical Statistics 9 217–234.2 Stochastic gradient descent as approximate Bayesian infer- Seita, D., Pan, X., Chen, H. and Canny, J. (2018). An ence. Journal of Machine Learning Research 18 1–35.2, efficient minibatch acceptance test for Metropolis-Hastings. 3 In Proceedings of the Twenty-Seventh International Joint Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Conference on Artificial Intelligence 5359–5363.2 Teller, A. H. and Teller, E. (1953). Equation of state Sen, D., Papamarkou, T. and Dunson, D. (2020). Bayesian calculations by fast computing machines. The journal of neural networks and dimensionality reduction. arXiv.3 Chemical Physics 21 1087–1092.7 Simpson, D., Rue, H., Riebler, A., Martins, T. G. and Minsky, M. L. and Papert, S. A. (1988). Perceptrons: ex- Sørbye, S. H. (2017). Penalising model component com- panded edition. MIT press.5,9 plexity: a principled, practical approach to constructing pri- Moore, D. A. (2016). Symmetrized variational inference. In ors. Statistical Science 32 1–28.4 NIPS Workshop on Advances in Approximate Bayesian In- Smith, J. W., Everhart, J., Dickson, W., Knowler, W. ference.3 and Johannes, R. (1988). Using the ADAP learning algo- Nair, V. and Hinton, G. E. (2009). 3D object recognition rithm to forecast the onset of diabetes mellitus. In Proceed- with deep belief nets. In Advances in Neural Information ings of the Annual Symposium on Computer Application in Processing Systems 22 1339–1347.5 Medical Care 261. 10 Nalisnick, E. T. (2018). On priors for Bayesian neural net- Stephens, M. (2000). Dealing with label switching in mixture works, PhD thesis, UC Irvine.3,4, 12 models. Journal of the Royal Statistical Society: Series B Neal, R. M. (2011). MCMC Using Hamiltonian dynamics In (Statistical Methodology) 62 795–809.3 Handbook of Markov Chain Monte Carlo 5. CRC Press.7 Titsias, M. K. and Ruiz, F. (2019). Unbiased implicit vari- Nemeth, C. and Sherlock, C. (2018). Merging MCMC ational inference. In Proceedings of Machine Learning Re- subposteriors through Gaussian-process approximations. search 89 167–176.2 Bayesian Analysis 13 507–530.2 Titterington, D. M. (2004). Bayesian methods for neural Nwankpa, C., Ijomah, W., Gachagan, A. and Mar- networks and related models. Statistical Science 19 128– shall, S. (2018). Activation functions: comparison of 139.2 trends in practice and research for deep learning. arXiv. Truong, T.-D., Nguyen, V.-T. and Tran, M.-T. (2018). 5 Lightweight Deep Convolutional Network for Tiny Object Ong, V. M. H., Nott, D. J. and Smith, M. S. (2018). Recognition. In Proceedings of the 7th International Con- Gaussian variational approximation with a factor covari- ference on Pattern Recognition Applications and Methods ance structure. Journal of Computational and Graphical 675–682.5 Statistics 27 465–478.3 Vats, D. and Flegal, J. M. (2018). Lugsail lag windows and Pearce, T., Zaki, M., Brintrup, A. and Neely, A. (2019). their application to MCMC. arXiv.7 Expressive priors in Bayesian neural networks: kernel com- Vats, D., Flegal, J. M. and Jones, G. L. (2019). Mul- binations and periodic functions. In Proceedings of the 35th tivariate output analysis for Markov chain Monte Carlo. Conference on Uncertainty in Artificial Intelligence.4 Biometrika 106 321–337.8 Polson, N. G. and Sokolov, V. (2017). Deep learning: a Vats, D. and Knudson, C. (2018). Revisiting the Gelman- Bayesian perspective. Bayesian Analysis 12 1275–1304.1 Rubin diagnostic. arXiv.7,8 Pourzanjani, A. A., Jiang, R. M. and Petzold, L. R. Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. (2017). Improving the identifiability of neural networks for and Burkner, P.-C. (2019). Rank-normalization, folding, Bayesian inference. In NIPS Workshop on Bayesian Deep and localization: an improved R for assessing convergence Learning.3 of MCMC. arXiv.7,8, 11 CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 21

Vladimirova, M., Verbeek, J., Mesejo, P. and Arbel, J. (2019). Understanding priors in Bayesian neural networks at the unit level. In Proceedings of the 36th International Conference on Machine Learning 97 6458–6467.4 Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on International Con- ference on Machine Learning 681–688.2 Williams, P. M. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation 7 117–143.4 Williams, C. K. I. (2000). An MCMC approach to hierarchi- cal mixture modelling. In Advances in Neural Information Processing Systems 12 680–686.2 Wilson, A. G. and Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. arXiv.1, 17 Zhang, G., Sun, S., Duvenaud, D. and Grosse, R. (2018a). Noisy natural gradient as variational inference. In Pro- ceedings of the 35th International Conference on Machine Learning 80 5852–5861.3 Zhang, X., Zhou, X., Lin, M. and Sun, J. (2018b). Shuf- flenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6848–6856. 5