Challenges in Markov Chain Monte Carlo for Bayesian Neural Networks Theodore Papamarkou, Jacob Hinkle, M
Total Page:16
File Type:pdf, Size:1020Kb
Challenges in Markov chain Monte Carlo for Bayesian neural networks Theodore Papamarkou, Jacob Hinkle, M. Todd Young and David Womble Abstract. Markov chain Monte Carlo (MCMC) methods have not been broadly adopted in Bayesian neural networks (BNNs). This paper initially reviews the main challenges in sampling from the parameter posterior of a neural network via MCMC. Such challenges culminate to lack of con- vergence to the parameter posterior. Nevertheless, this paper shows that a non-converged Markov chain, generated via MCMC sampling from the parameter space of a neural network, can yield via Bayesian marginaliza- tion a valuable posterior predictive distribution of the output of the neural network. Classification examples based on multilayer perceptrons showcase highly accurate posterior predictive distributions. The postulate of limited scope for MCMC developments in BNNs is partially valid; an asymptotically exact parameter posterior seems less plausible, yet an accurate posterior predictive distribution is a tenable research avenue. Key words and phrases: Bayesian inference, Bayesian neural networks, con- vergence diagnostics, Markov chain Monte Carlo, posterior predictive dis- tribution. 1. MOTIVATION The slower evolution of MCMC methods for neu- ral networks is partly attributed to the lack of scal- The universal approximation theorem (Cybenko, ability of existing MCMC algorithms for big data 1989) and its subsequent extensions (Hornik, 1991; and for high-dimensional parameter spaces. Further- Lu et al., 2017) state that feedforward neural net- more, additional factors hinder the adaptation of ex- works with exponentially large width and width- isting MCMC methods in deep learning, including bounded deep neural networks can approximate the hierarchical structure of neural networks and the any continuous function arbitrarily well. This uni- associated covariance between parameters, lack of versal approximation capacity of neural networks identifiability arising from weight symmetries, lack along with available computing power explain the of a priori knowledge about the parameter space, widespread use of deep learning nowadays. arXiv:1910.06539v5 [stat.ML] 11 Aug 2021 and ultimately lack of convergence. Bayesian inference for neural networks is typi- cally performed via stochastic Bayesian optimiza- The purpose of this paper is twofold. Initially, tion, stochastic variational inference (Polson and a literature review is conducted to identify infer- Sokolov, 2017) or ensemble methods (Ashukha et al., ential challenges in MCMC developments for neu- 2020; Wilson and Izmailov, 2020). MCMC methods ral networks. Subsequently, Bayesian marginaliza- have been explored in the context of neural net- tion based on MCMC samples of neural network works, but have not become part of the Bayesian parameters is used for attaining accurate posterior deep learning toolbox. predictive distributions of the respective neural net- work output, despite the lack of convergence of the Department of Mathematics, The University of MCMC samples to the parameter posterior. Manchester, Manchester, UK, and Computational Sciences and Engineering Division, Oak Ridge An outline of the paper layout follows. Section National Laboratory, Oak Ridge, TN, USA 2 reviews the inferential challenges arising from 1 2 T. PAPAMARKOU, J. HINKLE, M. T. YOUNG AND D. WOMBLE the application of MCMC to neural networks. Sec- Gibbs sampling variants that evaluate a costly log- tion3 provides an overview of the employed infer- likelihood on a subset (minibatch) of the data rather ential framework, including the multilayer percep- than on the entire data set (Welling and Teh, 2011; tron (MLP) model and its likelihood for binary and Chen, Fox and Guestrin, 2014; Ma, Foti and Fox, multiclass classification, the MCMC algorithms for 2017; Mandt, Hoffman and Blei, 2017; De Sa, Chen sampling from MLP parameters, the multivariate and Wong, 2018; Nemeth and Sherlock, 2018; Robert MCMC diagnostics for assessing convergence and et al., 2018; Seita et al., 2018; Quiroz et al., 2019). sampling effectiveness, and the Bayesian marginal- Among minibatch MCMC algorithms to big data ization for attaining posterior predictive distribu- applications, there exists a subset of studies apply- tions of MLP outputs. Section4 showcases Bayesian ing such algorithms to neural networks (Chen, Fox parameter estimation via MCMC and Bayesian and Guestrin, 2014; Gu, Ghahramani and Turner, predictions via marginalization by fitting different 2015; Gong, Li and Hern´andez-Lobato, 2019). Mini- MLPs to four datasets. Section5 posits predictive batch MCMC approaches to neural networks pave inference for neural networks, among else by com- the way towards data-parallel deep learning. On the bining Bayesian marginalization with approximate other hand, to the best of the authors' knowledge, MCMC sampling or with ensemble training. there is no published research on MCMC methods that evaluate the log-likelihood on a subset of neural 2. PARAMETER INFERENCE CHALLENGES network parameters rather than on the whole set of parameters, and therefore no reported research on A literature review of inferential challenges in the model-parallel deep learning via MCMC. application of MCMC methods to neural networks is conducted in this section thematically, with each Minibatch MCMC has been studied analytically subsection being focused on a different challenge. by Johndrow, Pillai and Smith(2020). Their the- oretical findings point out that some minibatching 2.1 Computational cost schemes can yield inexact approximations and that Existing MCMC algorithms do not scale with in- minibatch MCMC can not greatly expedite the rate creasing number of parameters or of data points. For of convergence. this reason, approximate inference methods, includ- 2.2 Model structure ing variational inference (VI), are preferred in high- dimensional parameter spaces or in big data prob- A neural network with ρ layers can be viewed lems from a time complexity standpoint (MacKay, as a hierarchical model with ρ levels, each network 1995; Blei, Kucukelbir and McAuliffe, 2017; Blier layer representing a level (Williams, 2000). Due to its and Ollivier, 2018). On the other hand, MCMC nested layers and its non-linear activations, a neural methods are better than VI in terms of approximat- network is a non-linear hierarchical model. ing the log-likelihood (Dupuy and Bach, 2017). MCMC methods for non-linear hierarchical mod- Literature on MCMC methods for neural net- els have been developed, see for example Ben- works is limited due to associated computational nett, Racine-Poon and Wakefield(1996); Gilks and complexity implications. Sequential Monte Carlo Roberts(1996); Daniels and Kass(1998); Sargent, and reversible jump MCMC have been applied on Hodges and Carlin(2000). However, existing MCMC two types of neural network architectures, namely methods for non-linear hierarchical models have not MLPs and radial basis function networks (RBFs), harnessed neural networks due to time complexity see for instance Andrieu, de Freitas and Doucet and convergence implications. (1999); de Freitas(1999); Andrieu, de Freitas and Although not designed to mirror the hierarchi- Doucet(2000); de Freitas et al.(2001). For a review cal structure of a neural network, recent hierarchi- of Bayesian approaches to neural networks, see Tit- cal VI (Ranganath, Tran and Blei, 2016; Esmaeili terington(2004). et al., 2019; Huang et al., 2019; Titsias and Ruiz, Many research developments have been made to 2019) provides more general variational approxima- scale MCMC algorithms to big data. The main fo- tions of the parameter posterior of the neural net- cus has been on designing Metropolis-Hastings or work than mean-field VI. Introducing a hierarchi- CHALLENGES IN MCMC FOR BAYESIAN NEURAL NETWORKS 3 n cal structure in the variational distribution induces work is set to be the whole of R rather than a n correlation among parameters, in contrast to the cone H of R . Since a neural network likelihood mean-field variational distribution that assumes in- with support in the non-reduced parameter space of n dependent parameters. So, one of the Bayesian in- R is invariant under weight permutations, sign-flips ference strategies for neural networks is to approx- or other transformations, the posterior landscape imate the covariance structure among network pa- includes multiple equally likely modes. This im- rameters. In fact, there are published comparisons plies low acceptance rate, entrapment in local modes between MCMC and VI in terms of speed and ac- and convergence challenges for MCMC. Addition- curacy of convergence to the posterior covariance, ally, computational time is wasted during MCMC, both for linear or mixture models (Giordano, Brod- since posterior modes represent equivalent solutions erick and Jordan, 2015; Mandt, Hoffman and Blei, (Nalisnick, 2018). Such challenges manifest them- 2017; Ong, Nott and Smith, 2018) and for neural selves in the MLP examples of section4. For neu- networks (Zhang et al., 2018a). ral networks with higher number n of parameters in n R , the topology of the likelihood is characterized 2.3 Weight symmetries by local optima embedded in high-dimensional flat The output of a feedforward neural network given plateaus (Brea et al., 2019). Thereby, larger neural some fixed input remains unchanged under a set networks lead to a multimodal target density with of transformations determined by the the choice of symmetric modes for MCMC. activations and by the network architecture