Arxiv:1811.08687V3 [Cs.LG] 14 May 2020 Ing Process

Surrogate-assisted parallel tempering for Bayesian neural learning Rohitash Chandraa, Konark Jaind,b, Arpit Kapoorb,c, Ashray Amanb,e aSchool of Mathematics and Statistics, University of New South Wales, Sydney, NSW 2052, Australia bCentre for Translational Data Science, The University of Sydney, NSW 2006, Australia cDepartment of Computer Science and Engineering, SRM Institute of Science and Technology, Chennai, Tamil Nadu, India d Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Assam, India e Department of Mathematics, Indian Institute of Technology Delhi, Delhi, India Abstract Due to the need for robust uncertainty quantification, Bayesian neural learning has gained attention in the era of deep learning and big data. Markov Chain Monte-Carlo (MCMC) methods typically implement Bayesian inference which faces several challenges given a large number of parameters, complex and multimodal posterior distributions, and computational complexity of large neural network models. Parallel tempering MCMC addresses some of these limitations given that they can sample multimodal posterior distributions and utilize high-performance computing. However, certain challenges remain given large neural network models and big data. Surrogate-assisted optimization features the estimation of an objective function for models which are computationally expensive. In this paper, we address the inefficiency of parallel tempering MCMC for large-scale problems by combining parallel computing features with surrogate assisted likelihood estimation that describes the plausibility of a model parameter value, given specific observed data. Hence, we present surrogate-assisted parallel tempering for Bayesian neural learning for simple to computationally expensive models. Our results demonstrate that the methodology significantly lowers the computational cost while maintaining quality in decision making with Bayesian neural networks. The method has applications for a Bayesian inversion and uncertainty quantification for a broad range of numerical models. Keywords: Bayesian neural networks, parallel tempering, MCMC, Surrogate-assisted optimization, parallel computing 1. Introduction gradient-based counterparts since convergence becomes computationally expensive for a large number of model parameters Although neural networks have gained significant attention and multimodal posteriors [11]. MCMC methods typically re- due to the deep learning revolution [1], several limitations ex- quire thousands of samples to be drawn depending on the model ist. The challenge widens for uncertainty quantification in de- which becomes a major limitation in applications such as deep cision making given the development of new neural network learning [1, 12, 13]. Hence, other methods for implementing architectures and learning algorithms. Bayesian neural learn- Bayesian inference exist, such as variational inference [14, 15] ing provides a probabilistic viewpoint with the representation which has been used for deep learning [16]. Variational infer- of neural network weights as probability distributions [2, 3]. ence has the advantage of faster convergence when compared Rather than single-point estimates by gradient-based learning to MCMC methods for large models. However, in computa- methods, the probability distributions naturally account for un- tionally expensive models, variational inference methods would certainty in parameter estimates. have a similar problem as MCMC, since both would need to Through Bayesian neural learning, uncertainty regarding the evaluate model samples for the likelihood; hence, both would data and model can be propagated into the decision mak- benefit from surrogate-based models. arXiv:1811.08687v3 [cs.LG] 14 May 2020 ing process. Markov Chain Monte Carlo sampling methods Parallel tempering is an MCMC method that [17, 18, 19] fea- (MCMC) implement Bayesian inference [4, 5, 6, 7] by con- tures multiple replicas to provide global and local exploration structing a Markov chain, such that the desired distribution which makes them suitable for irregular and multi-modal dis- becomes the equilibrium distribution after a number of steps tributions [20, 21]. During sampling, parallel tempering fea- [8, 9]. MCMC methods provide numerical approximations of tures the exchange of neighbouring replicas to feature explo- multi-dimensional integrals [10]. MCMC methods have not ration and exploitation in the search space. The replicas with gained as much attention in neural learning when compared to higher temperature values make sure that there is enough exploration, while replicas with lower temperature values exploit the promising areas found during the exploration. In contrast Email addresses: [email protected] (Rohitash to canonical MCMC sampling methods, parallel tempering is Chandra), [email protected] (Konark Jain), more easily implemented in a multi-core or parallel computing [email protected] (Arpit Kapoor), [email protected] (Ashray Aman) architecture [22]. In the case of neural networks, parallel tem- URL: rohitash-chandra.github.io (Rohitash Chandra) pering was used for inference of restricted Boltzmann machines Preprint submitted to Eng. App. AI May 15, 2020 (RBMs) [23, 24] where it was shown that [25] parallel temper- ally expensive models, we demonstrate its effectiveness using ing is more effective than Gibbs sampling by itself. Parallel a neural network model for classification problems. The major tempering for RBMs has been improved by featuring the effi- contribution of this paper is to address the limitations of parallel cient exchange of information among the replicas [26]. These tempering given computationally expensive models. studies motivated the use of parallel tempering in Bayesian neu- The rest of the paper is organised as follows. Section 2 pro- ral learning for pattern classification and regression tasks [27]. vides background and related work, while Section 3 presents Surrogate assisted optimisation [28, 29] considers the use of the proposed methodology. Section 4 presents experiments and machine learning methods such as Gaussian process and neural results and Section 5 concludes the paper with discussion for network models to estimate the objective function during op- future research. timisation. This is handy when the evaluation of the objective function is too time-consuming. In the past, metaheuristic and 2. Related work evolutionary optimization methods have been used in surrogate assisted optimization [30, 31]. Surrogate assisted optimization 2.1. Bayesian neural learning has been useful for the fields of engine and aerospace design In Bayesian inference, we update the probability for a hy- to replicate computationally expensive models [32, 33, 34, 28]. pothesis as more evidence or information becomes available The optimisation literature motivates improving parallel tem- [39]. We estimate the posterior distribution by sampling us- pering using a low-cost replica of the actual model via a surro- ing prior distribution and a ’likelihood function’ that evalu- gate to lower the computational costs. ates the model with observed data. A probabilistic perspective In the case of conventional Bayesian neural learning, much of treats learning as equivalent to maximum likelihood estimation the literature concentrated on smaller problems such as datasets (MLE) [40]. Given that the neural network is the model, we and network architecture [35, 36, 2, 3] due to computational base the prior distribution on belief or expert opinion without efficiency of MCMC methods. Therefore, parallel computing observing the evidence or training data [35]. An example of in- has been used in the implementation of parallel tempering for formation or belief for the prior in the case of neural networks Bayesian neural learning [27], where the computational time is the concept of weight decay that states that smaller weights was significantly decreased due to parallelisation. Besides, the are better for generalization [41, 2, 2, 42, 43]. method achieved better prediction accuracy and convergence Due to limitations in MCMC sampling methods, progress due to the exploration features of parallel tempering. We be- in development and applications of Bayesian neural learning lieve that this can be further improved through incorporating has been slow, especially when considering larger neural net- notions from surrogate assisted optimisation in parallel temper- work architectures, big data and deep learning. Several tech- ing, where the likelihood function at certain times is estimated niques have been applied to improve MCMC sampling meth- rather than evaluated in a high-performance computing envi- ods by incorporating approaches from the optimisation litera- ronment. ture. Neal et al. [44] presented Hamiltonian dynamics that We note that some work has been done using surrogate as- involve using gradient information for constructing efficient sisted Bayesian inference. Zeng et. al [37] presented a method MCMC proposals during sampling. Gradient-based learning for material identification using surrogate assisted Bayesian in- using Langevin dynamics refer to use of gradient information ference for estimating the parameters of advanced high strength with Gaussian noise [45]; Chandra et al. employed Langevin steel used in vehicles. Ray and Myer [38] used Gaussian dynamics for Bayesian neural networks for time series predic- process-based surrogate models with MCMC for geophysical tion [46]. Hinton et al.[47] used complementary

Arxiv:1811.08687V3 [Cs.LG] 14 May 2020 Ing Process

Parallel-Tempered Stochastic Gradient Hamiltonian Monte Carlo for Approximate Multimodal Posterior Sampling

Chapter 2 Monte Carlo Simulations

LAMMPS – an Object Oriented Scientific Application

Desmond Users Guide Release 3.4.0 / 0.7

Hybrid Parallel Tempering and Simulated Annealing Method

The American Ceramic Society 38Th International Conference

Comparing Monte Carlo Methods for Finding Ground States of Ising Spin

Multiscale Implementation of Infinite-Swap Replica Exchange Molecular Dynamics

Parallel Tempering: Theory, Applications, and New Perspectives PCCP

PLUMED Tutorial

Conditions for Rapid Mixing of Parallel and Simulated Tempering on Multimodal Distributions

Qualitative Properties of Parallel Tempering and Its Infinite Swapping