Surrogate-assisted parallel tempering for Bayesian neural learning

Rohitash Chandraa, Konark Jaind,b, Arpit Kapoorb,c, Ashray Amanb,e

aSchool of Mathematics and Statistics, University of New South Wales, Sydney, NSW 2052, Australia bCentre for Translational Data Science, The University of Sydney, NSW 2006, Australia cDepartment of Computer Science and Engineering, SRM Institute of Science and Technology, Chennai, Tamil Nadu, India d Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Assam, India e Department of Mathematics, Indian Institute of Technology Delhi, Delhi, India

Abstract Due to the need for robust uncertainty quantification, Bayesian neural learning has gained attention in the era of deep learning and big data. Monte-Carlo (MCMC) methods typically implement Bayesian inference which faces several challenges given a large number of parameters, complex and multimodal posterior distributions, and computational complexity of large neural network models. Parallel tempering MCMC addresses some of these limitations given that they can sample multimodal posterior distributions and utilize high-performance computing. However, certain challenges remain given large neural network models and big data. Surrogate-assisted optimization features the estimation of an objective function for models which are computationally expensive. In this paper, we address the inefficiency of parallel tempering MCMC for large-scale problems by combining parallel computing features with surrogate assisted likelihood estimation that describes the plausibility of a model parameter value, given specific observed data. Hence, we present surrogate-assisted parallel tempering for Bayesian neural learning for simple to com- putationally expensive models. Our results demonstrate that the methodology significantly lowers the computational cost while maintaining quality in decision making with Bayesian neural networks. The method has applications for a Bayesian inversion and uncertainty quantification for a broad range of numerical models. Keywords: Bayesian neural networks, parallel tempering, MCMC, Surrogate-assisted optimization, parallel computing

1. Introduction gradient-based counterparts since convergence becomes com- putationally expensive for a large number of model parameters Although neural networks have gained significant attention and multimodal posteriors [11]. MCMC methods typically re- due to the deep learning revolution [1], several limitations ex- quire thousands of samples to be drawn depending on the model ist. The challenge widens for uncertainty quantification in de- which becomes a major limitation in applications such as deep cision making given the development of new neural network learning [1, 12, 13]. Hence, other methods for implementing architectures and learning algorithms. Bayesian neural learn- Bayesian inference exist, such as variational inference [14, 15] ing provides a probabilistic viewpoint with the representation which has been used for deep learning [16]. Variational infer- of neural network weights as probability distributions [2, 3]. ence has the advantage of faster convergence when compared Rather than single-point estimates by gradient-based learning to MCMC methods for large models. However, in computa- methods, the probability distributions naturally account for un- tionally expensive models, variational inference methods would certainty in parameter estimates. have a similar problem as MCMC, since both would need to Through Bayesian neural learning, uncertainty regarding the evaluate model samples for the likelihood; hence, both would data and model can be propagated into the decision mak- benefit from surrogate-based models. arXiv:1811.08687v3 [cs.LG] 14 May 2020 ing process. sampling methods Parallel tempering is an MCMC method that [17, 18, 19] fea- (MCMC) implement Bayesian inference [4, 5, 6, 7] by con- tures multiple replicas to provide global and local exploration structing a Markov chain, such that the desired distribution which makes them suitable for irregular and multi-modal dis- becomes the equilibrium distribution after a number of steps tributions [20, 21]. During sampling, parallel tempering fea- [8, 9]. MCMC methods provide numerical approximations of tures the exchange of neighbouring replicas to feature explo- multi-dimensional integrals [10]. MCMC methods have not ration and exploitation in the search space. The replicas with gained as much attention in neural learning when compared to higher temperature values make sure that there is enough ex- ploration, while replicas with lower temperature values exploit the promising areas found during the exploration. In contrast Email addresses: [email protected] (Rohitash to canonical MCMC sampling methods, parallel tempering is Chandra), [email protected] (Konark Jain), more easily implemented in a multi-core or parallel computing [email protected] (Arpit Kapoor), [email protected] (Ashray Aman) architecture [22]. In the case of neural networks, parallel tem- URL: rohitash-chandra.github.io (Rohitash Chandra) pering was used for inference of restricted Boltzmann machines

Preprint submitted to Eng. App. AI May 15, 2020 (RBMs) [23, 24] where it was shown that [25] parallel temper- ally expensive models, we demonstrate its effectiveness using ing is more effective than Gibbs sampling by itself. Parallel a neural network model for classification problems. The major tempering for RBMs has been improved by featuring the effi- contribution of this paper is to address the limitations of parallel cient exchange of information among the replicas [26]. These tempering given computationally expensive models. studies motivated the use of parallel tempering in Bayesian neu- The rest of the paper is organised as follows. Section 2 pro- ral learning for pattern classification and regression tasks [27]. vides background and related work, while Section 3 presents Surrogate assisted optimisation [28, 29] considers the use of the proposed methodology. Section 4 presents experiments and machine learning methods such as Gaussian process and neural results and Section 5 concludes the paper with discussion for network models to estimate the objective function during op- future research. timisation. This is handy when the evaluation of the objective function is too time-consuming. In the past, metaheuristic and 2. Related work evolutionary optimization methods have been used in surrogate assisted optimization [30, 31]. Surrogate assisted optimization 2.1. Bayesian neural learning has been useful for the fields of engine and aerospace design In Bayesian inference, we update the probability for a hy- to replicate computationally expensive models [32, 33, 34, 28]. pothesis as more evidence or information becomes available The optimisation literature motivates improving parallel tem- [39]. We estimate the posterior distribution by sampling us- pering using a low-cost replica of the actual model via a surro- ing prior distribution and a ’likelihood function’ that evalu- gate to lower the computational costs. ates the model with observed data. A probabilistic perspective In the case of conventional Bayesian neural learning, much of treats learning as equivalent to maximum likelihood estimation the literature concentrated on smaller problems such as datasets (MLE) [40]. Given that the neural network is the model, we and network architecture [35, 36, 2, 3] due to computational base the prior distribution on belief or expert opinion without efficiency of MCMC methods. Therefore, parallel computing observing the evidence or training data [35]. An example of in- has been used in the implementation of parallel tempering for formation or belief for the prior in the case of neural networks Bayesian neural learning [27], where the computational time is the concept of weight decay that states that smaller weights was significantly decreased due to parallelisation. Besides, the are better for generalization [41, 2, 2, 42, 43]. method achieved better prediction accuracy and convergence Due to limitations in MCMC sampling methods, progress due to the exploration features of parallel tempering. We be- in development and applications of Bayesian neural learning lieve that this can be further improved through incorporating has been slow, especially when considering larger neural net- notions from surrogate assisted optimisation in parallel temper- work architectures, big data and deep learning. Several tech- ing, where the likelihood function at certain times is estimated niques have been applied to improve MCMC sampling meth- rather than evaluated in a high-performance computing envi- ods by incorporating approaches from the optimisation litera- ronment. ture. Neal et al. [44] presented Hamiltonian dynamics that We note that some work has been done using surrogate as- involve using gradient information for constructing efficient sisted Bayesian inference. Zeng et. al [37] presented a method MCMC proposals during sampling. Gradient-based learning for material identification using surrogate assisted Bayesian in- using Langevin dynamics refer to use of gradient information ference for estimating the parameters of advanced high strength with Gaussian noise [45]; Chandra et al. employed Langevin steel used in vehicles. Ray and Myer [38] used Gaussian dynamics for Bayesian neural networks for time series predic- process-based surrogate models with MCMC for geophysical tion [46]. Hinton et al.[47] used complementary priors for deep inversion problems. The benefits of surrogate assisted meth- belief networks to form an undirected associative memory for ods for computationally expensive optimisation problems mo- handwriting recognition. Furthermore, parallel tempering has tivate parallel tempering for computationally expensive mod- been used for the Gaussian Bernoulli Restricted Boltzmann Ma- els. To our knowledge, there is no work on parallel temper- chines (RBMs) [48]. Prior to this, Cho et al. [49] demonstrated ing with surrogate-models implemented via parallel computing the efficiency of parallel tempering in RBMs. Desjardins et al. for machine learning problems. In the case of parallel tem- [50] utilized parallel tempering for maximum likelihood train- pering that uses parallel computing, the challenge would be in ing of RBMs and later used it for deep learning using RBMs developing a paradigm where different replicas can communi- [51]. cate efficiently. Besides, the task of training the surrogate model In Bayesian neural learning, we estimate the posterior distri- from data across multiple replicas in parallel poses further chal- bution by MCMC sampling using a ’likelihood function’ that lenges. evaluates the model given the observed data and prior distribu- In this paper, we present surrogate-assisted parallel temper- tion. Given input features or covariates (xt), we compute f (xt) ing for Bayesian neural learning where a surrogate is used to by a feedforward neural network with one hidden layer, estimate the likelihood rather than evaluating the actual model that feature a large number of parameters and datasets. We  XH  XI  present a framework that seamlessly incorporates the decision f (xt) = g δo + vhog δh + (wdhxt) (1) making by a master surrogate for parallel processing cores that h=1 d=1 execute the respective replicas of parallel tempering MCMC. where δo and δh are the bias for the output o and hidden h layer, Although the framework is intended for general computation- respectively. vho is the weight which maps the hidden layer h to 2 the output layer. wdh is the weight which maps xt to the hidden exploration, while replicas with lower temperature values ex- layer h and g is the activation function for the hidden and output ploit the promising areas found by exploration. Typically, layer units. the neighbouring replicas are exchanged given the Metropolis- 2 Let θ = (w, v, δ, τ ), with δ = (δo, δh), and L as the number Hastings criterion. In some implementations, non-neighboring of parameters that includes weights and biases. τ2 is a single replicas are also considered for exchange [55, 56]. In such parameter to represent the noise in the predictions given by the cases, we calculate the acceptance probabilities of all possi- neural network model. I, H, O refers to number of input, hidden ble moves a priori. The specific exchange move is then se- and output neurons, respectively. We assume a Gaussian prior lected, which is useful when a limited number of replicas are distribution using, available [57]. Although geometrically uniform temperature levels have been typically used for the respective replicas, de- termining the optimal tempering assigned for each of the repli- L log (p(θ)) = − log(σ ˆ 2) cas has been a challenge that attracted some attention in the 2 literature [58, 59, 60, 20]. Typically, gradient-free proposals  H I H  1 X X X  within chains are used for proposals for exploring multimodal −  w2 + (δ2 + v2) + δ2 2σ ˆ 2  dh h h o and discontinuous posteriors [61, 62]; however, it is possible to h=1 d=1 h=1 incorporate gradient-based information for developing effective (2) proposal distributions [27]. where the variance (σ2) is user-defined, gathered by infor- Although denoted ”parallel”, the replicas can be executed mation (prior belief) regarding the distribution of weights of sequentially in a single processing unit [63]; however, multi- trained neural networks in similar applications. core or high-performance computing systems can feature par- Given data (y), we use Bayes rule for the posterior p(θ|y) allel implementations improving the computational time [56, which is proportional to the likelihood p(y|θ) times the prior 64, 65]. There are challenges since parallel tempering fea- p(θ). tures exchange or transition between neighbouring replicas, and p(θ|y) ∝ p(y|θ) × p(θ) we need to consider efficient strategies that take into account inter-process communication [65]. Parallel tempering has been Then, the log-posterior transforms to used with high-performance computing for seismic full wave- form inversion (FWI) problems [64], which is amongst the most log (p(θ|y)) = log (p(θ)) + log (p(y|θ)) computationally intensive problems in earth science today. To implement the likelihood given in Equations 1 and 2, we Parallel tempering has also been implemented in a distributed use the multinomial likelihood function for the classification volunteer computing network via crowd-sourcing for multi- problems as shown in Equation 3. threading and graphic processing units [66]. Furthermore, im- plementation with field-programmable gate array (FPGA) has shown much better performance than multi-core and graphic X XK processing units (GPU) implementations [67]. log (p(y|θ)) = zt,k log πk (3) In a parallel tempering MCMC ensemble, given M replicas t∈T k=1 defined by multiple temperature levels, the state of the ensem- ble is specified by X = x1, x2, ..., xM; where xi is the ith replica, for classes k = 1,..., K, where πk is the output of the neural network after applying the transfer function, and T is the num- with temperature level Ti. The equilibrium distribution of the ber of training samples. In this case, the transfer function is the ensemble X is given by, softmax function [52], M 1 exp(− L(xi)) Y Ti exp( f (xp)) Π(X) = (6) Z(Ti) πk = PK (4) i=1 k=1 exp( f (xk)) where L(x ) is the log-likelihood function for the replica state for k = 1,..., K. z is an indicator variable for the given sam- i t,k at each temperature level (T ) and Z(T ) is an intractable nor- ple t and the class k as given in the dataset and defined by, i i malization constant. At every iteration of the replica state,  the replica can feature two types of transitions that includes a 1, if yt = k z =  (5) Metropolis transition and a replica transition. t,k  0, otherwise The Metropolis transition enables the replica to perform local Monte Carlo moves defined by the energy function E(xi). The 2.2. Parallel tempering MCMC ∗ local replica state xi is sampled using a proposal distribution Parallel tempering MCMC features an ensemble of chains qi(.|xi) and the Metropolis-Hastings acceptance probability α (known as replicas) executed at different temperature levels for the local replica Llocal is given as, defined by a temperature ladder that determine the extent of ! exploration and exploitation [53, 21, 54, 55]. The replicas  1 ∗  α = min 1, exp − (L(xi ) − L(xi)) (7) with higher temperature values ensure that there is enough Ti 3 The condition holds for each replica, and there- have been widely used in Earth sciences such as modelling wa- fore it holds for the ensemble system. ter resources [74]. Moreover, Daz-Manrquez et al. [75] pre- Typically, the Metropolis-Hastings update consists of a sin- sented a review of surrogate assisted multi-objective evolution- gle that evaluates the energy of the system ary algorithms that showed that the method has been successful and accepted is based on the temperature ladder. The selection in a wide range of application problems. of the temperature ladder for the replicas is done prior to sam- The search for the right surrogate model is a significant chal- pling; we use a geometric spacing methodology [68] as given, lenge given different types of likelihood or fitness landscape given by the actual model. Giunta et al. [76] presented a com- (i−1) parison of quadratic polynomial models with the least-square T = T (M−1) (8) i max method and interpolation models that featured Gaussian pro- cess regression (kriging). They discovered that the quadratic where i = 1,..., M and Tmax is the maximum temperature which is user-defined. polynomial models were more accurate in terms of errors for es- The Replica transition features the exchange of two neigh- timation for the optimisation problems. Jin et al. [77] presented another study that compared several surrogate models that in- bouring replica states ( xi ↔ xi+1) defined by their tempera- ture level, i and i + 1. The replica exchange is accepted by the clude polynomial regression, multivariate adaptive regression Metropolis-Hastings criterion with replica exchange probabil- splines, radial-basis functions, and kriging based on multiple ity β by, performance criteria using different classes of problems. The authors reported radial basis functions as one of the best for   1 1   scalability and robustness, given different types of problems. β = min 1, exp − L(xi+1) − L(xi) (9) They also reported kriging to be computationally expensive. Ti+1 Ti Based on the Metropolis criterion, the configuration (position in 3. Surrogate-assisted multi-core parallel tempering the replica ) of neighbouring replicas at different temperatures are exchanged. This results in a robust ensemble which can Surrogate models primarily learn to mimic actual or true sample both low and high energy configurations. models using their behaviour, i.e. how the true model responds The replica-exchange enables a replica that could be stuck to a set of input parameters. A surrogate model captures the at a local minimum with low-temperature level to exchange relationship between the input and output given by the true configuration with a higher neighbouring temperature level and model. The input is the set of proposals in parallel tempering hence improve exploration. In this way, the replica-exchange MCMC that features the weights and biases of the neural net- can shorten the sampling time required for convergence. The work model. Hence, we utilize the surrogate model to approx- frequency of determining the exchange and the temperature imate the likelihood of the true model. We define the approxi- level is user-defined. For further details about the derivation mation of the likelihood by the surrogate as pseudo-likelihood. for the equations in this section, see [55, 69]. We train the surrogate model on the data that is composed of the history of proposals for weights and biases of the neural 2.3. Surrogate-assisted optimization network with the corresponding true likelihood. We do not es- timate the output of the neural network by the surrogate (since Surrogate assistant optimisation refers to the use of machine in some problems there are many outputs); hence to limit the learning or statistical learning models to develop approximately number of variables, we directly estimate the likelihood using computationally inexpensive simulation of the actual model the surrogate. [29]. The major advantage is that the surrogate model provides We implement the neighbouring replica transition or ex- computationally efficiency when compared to the exact model change at regular intervals. The cost of inter-process communi- used for evolutionary algorithms and related optimisation meth- cation must be limited to avoid computational overhead, given ods [30, 31]. In the optimisation literature, such approaches are that we execute each replica on a separate processing core. The also known as response surface methodologies [70, 71] which swap interval defines the time (number of iterations or sam- have been applicable for a wide range of engineering problems ples) after which each replica pauses and awaits for neighbour- such as reliability analysis of laterally loaded piles [72]. ing replica exchange. After the exchange, the replica manager In the case of evolutionary computation methods, Ong et al. process enables local Metropolis transition and the process re- [30] presented parallel evolutionary optimisation for solving peats. We note that although we use a neural network model, computationally expensive functions with application to aero- the framework is general and we can use other models; which dynamic wing design where surrogate models used radial basis could include those from other domains such as Earth science functions. Zhou et al. [31] accelerated evolutionary optimiza- models that are computationally expensive [78, 79]. tion with global and local surrogate models while Lim et al. Bayesian neural learning employs parallel tempering MCMC [73] presented a generalised method that accounted for uncer- for inference; this can be viewed as a training procedure for tainty in estimation to unify diverse surrogate models. Jin [29] the neural network model. The goal of the surrogate model reviewed surrogate-assisted evolutionary computation that cov- is to save computational time taken for evaluation of the true ered single and multi-objective, dynamic, constrained, and mul- likelihood function associated with computationally expensive timodal optimisation problems. Furthermore, surrogate models models. 4 Given that the true model is represented as L = f (x), the purpose of the surrogate interval is to collect enough data for surrogate model provides an approximation in the form lˆ = the surrogate model during sampling. This also can be seen as fˆ(x), such that L = lˆ + e; where e represents the difference batch-based training where we update the model after we col- or error. The surrogate model provides an estimate by the lect the data at regular intervals. pseudo-likelihood for replacing true-likelihood when needed. Figure 2 shows how we use the manager processing unit for The surrogate model is constructed by training from experience the respective replicas running in parallel for the given surro- which is given by the set of input xi,s with corresponding true- gate interval. The manager process waits for all the replicas likelihoods Li,s; where s represents the sample and i represents to reach the surrogate interval. Then we calculate the replica the replica. Hence, input features (Φ) for the surrogate is de- transition probability for the possibility of swapping the neigh- veloped by combining xi,s using samples generated (θ) across bouring replicas. Figure 2 further highlights the information all the replicas for a given surrogate interval (ψ). The surrogate flow between the master process and the replica process via interval defines the batch size of the training data for the sur- inter-process communication 1. The information flows from rogate model which goes through incremental learning. All the the replica process to master process using signal() given by respective replicas sample until the surrogate interval is reached the replica process as shown in Stage 2.2 and 5.0 of Algorithm and then the manager process collects the sampled data to cre- 1. ate a training data batch for the surrogate model. This can be The surrogate model is re-trained for remaining surrogate in- formulated as follows, terval blocks until the maximum iteration (Rmax) is reached to enable better estimation for the pseudo-likelihood. The surro- gate model is trained only in the manger process, and we pass Φ = ([x ,..., x ],..., [x ,..., x ]) 1,s 1,s+ψ M,s M,s+ψ a copy of the surrogate model with the trained parameters to λ = ([L1,s,..., L1,s+ψ],..., [LM,s,..., LM,s+ψ]) the ensemble of replica processes for estimating the pseudo- Θ = [Φ, λ] (10) likelihood when needed. Note that only the samples associ- ated with the true-likelihood become part of the surrogate train- where xi,s represents the set of parameters proposed and s, ing dataset. The surrogate training can consume a significant yi,s is the output from the multinomial likelihood, and M is the portion of time which is dependent on the size of the problem total number of replicas. in terms of the number of parameters and the surrogate model The training surrogate dataset (Θ = [Φ, λ]) consists of input used. We evaluate the trade-off between quality of estimation features (Φ) and response (λ) for the span of each surrogate in- by pseudo-likelihood and overall cost of computation for the terval (s + ψ). Hence, we denote the pseudo-likelihoody ˆ by true likelihood function for different types of problems. ˆ ˆ yˆ = f (Θ); where f is the surrogate model. We amend the likeli- In Algorithm 1, Stage 1.4 predicts the pseudo-likelihood hood in training data for the temperature level since it has been ∗ (Lsurrogate) with given proposal θs. Stage 1.5 calculates the like- changed by taking L /T for given replica i. The likelihood local i lihood moving average of past three likelihood values, Lpast = is amended to reflect the true likelihood rather than that rep- mean(Ls−1, Ls−1, Ls−2). The motivation for doing this is to com- resented by taking into account the temperature level. All the bine univariate time series prediction approach (moving aver- respective replica Θi data is combined, Υ = [Θ1, Θ2, ...ΘN ] and age) with multivariate regression approach (surrogate model) trained using the neural network model given in Equation 1. for robust estimation where we take both current and past infor- Algorithm 1 presents surrogate-assisted multi-core parallel mation into account. In Stage 1.6, we combine the likelihood tempering for Bayesian neural learning. We implement the al- moving average with the pseudo-likelihood to give a prediction gorithm using (multi-core) parallel processing where a manager that considers the present replica proposal and the past behav- process takes care of the ensemble of replicas that run in sep- ior, Llocal = (0.5 * Lsurrogate) + 0.5 * Lpast. Note that although we arate processing cores. Given the parallel processing nature, it use the past three values for the moving average, this number is tricky to implement when to terminate sampling distributed can change for different types of problems. Once the swap in- amongst parallel processing replicas; hence, our termination terval (φ) is reached, Stage 2.0 prepares the replica transition in condition waits for all the replica processes to end. We monitor the manager process (highlighted in blue). Stage 3.0 executes the number of alive replica process in the master process by set- once we reach the surrogate interval where we use data col- ting the number of alive replicas in the ensemble (alive = M). lected from Stage 1.8 for creating surrogate training set batch We note that the highlighted region of Algorithm 1 shows dif- Θ as shown in Equation 10. Stage 4 shows how we execute the ferent processing cores as given in Figure 2. We highlight the global surrogate training in the Manager process with combined manager process in blue, and in pink we highlight the ensem- surrogate data using the neural network model in Equation 1. ble of replica processes running in parallel. We initially assign We use the trained knowledge from the global surrogate model the replicas that sample θ with values using the Gaussian prior n (Ψglobal) in the respective replicas local surrogate model (Ψi) as 2 distribution with user-defined variance (σ = 25) centred at the shown in Stage 1.3 and Figure 2. Once we reach the maximum mean of 0. We define the temperature level by geometric lad- number of samples for the given replica (Rmax, Stage 5.0 signals der (Equation 8), and other key parameters which includes the number of replica samples (Rmax), swap-interval which defines after how many samples to check for replica swap (Rswap), sur- 1Python multiprocessing library for implementation: rogate interval (ψ), and surrogate probability (S prob). The main https://docs.python.org/2/library/multiprocessing.html 5 the manager process to decrement the number of replicas alive for executing the termination criterion. Finally, we execute Stage 6 in the manager process where we combine the respective replica predictions and proposals (weights and biases) from the ensemble by concatenating the history of the samples to create the posterior distributions. This features the accepted samples and copies of the accepted sam- ples in cases of rejected samples, as shown in Stage 1.9 of Al- gorithm 1. Furthermore, the framework features parallel tempering in the first stage of sampling that transforms into a local mode or exploitation in the second stage where the temperature ladder is changed such that Ti = 1, for all replicas, i = 1, 2, ..., N as done in [27, 78]. We emphasize on exploration in the first phase and emphasize on exploitation in the second phase, as shown in Stage 1.9.1 of Algorithm 1. The duration of each phase is prob- lem dependent, which we determine from trial experiments. We validate the quality of the surrogate model prediction us- Figure 1: Data collection for training surrogate model ing the root mean squared error (RMSE)

tuv 1 XN 3.2. Surrogate model RMSE = (L − Lˆ )2 N i i i=1 The choice of the surrogate model needs to consider the com- putational resources taken for training the model during the ˆ where Li and Li are the true likelihood and the pseudo- sampling process. We note that Gaussian process models, neu- likelihood values, respectively. N is the number of cases we ral networks, and radial basis function models [81] have been employ the surrogate during sampling. popular choices for surrogates in the literature. Note that instead of swapping entire replica configuration, an In our case, we consider the inference problem that features alternative is to swap the respective temperatures. This could hundreds to thousands of parameters; hence, the model needs save the amount of information exchanged during inter-process to be efficiently trained without taking lots of computational communication as the temperature is only a double-precision resources. Moreover, the flexibility of the model to have incre- number, implemented by [38]. mental training is also needed. Therefore, we rule out Gaussian process models since they have imitations given large training 3.1. Langevin gradient-based proposal distribution datasets [82]. In our case, tens of thousands of samples could Apart from random-walk proposals, we utilize stochastic make the training data. Therefore, we use neural networks as gradient Langevin dynamics [45] for the proposal distribution. the choice of the surrogate model. The training data and neural It features additional noise with stochastic gradients to optimize network model can be formulated as follows. a differentiable objective function which has been very promis- The data given to the surrogate model is Φ and λ as in (10), ing for neural networks [80, 27]. The proposal distribution is where Φ is the input and λ is the desired output of the model. constructed as follows. The prediction of the model is denoted by λˆ. We explain the surrogate models used in the paper as follows. In our surrogate model, we consider a single hidden layer p [k] θ ∼ N(θ¯ , Σθ), where (11) feedforward neural network as shown in Equation (1). The θ¯[k] = θ[k] + r × ∇E [θ[k]], main difference is that we use the rectified linear unitary ac- yAD,T X tivation function, g(.). In this case, after forward propagation, E [θ[k]] = (y − f (x )[k])2, yAD,T t t the errors in the estimates are evaluated using the cross-entropy t∈AD,T cost function [83] which is formulated as: ! [k] ∂E ∂E ∇Ey [θ ] = ,..., AD,T ∂θ ∂θ 1 L ψ 1 X (i) (i) (i) (i) 2 J(W, b) = − (λ log(λˆ ) + (1 − λ )log(1 − λˆ ) (12) r is the learning rate, Σθ = σθ IL and IL is the L × L identity ψ matrix. So that the newly proposed value of θp, consists of 2 i=0 parts: The learning or optimization task then is to iteratively up- 1. An gradient descent based weight update given by Equa- date the weights and biases to minimize the cross-entropy loss, tion 11. J(W, b). We note that stochastic gradient descent maintains a single 2. Add an amount of noise, from N(0, Σθ). learning rate for all weight updates and typically the learning 6 Data: Classification Dataset Result: Posterior distributions for neural network weights * Set the number of replicas (M) in ensemble as alive; alive = M * Define geometric() temperature ladder , number of replica processes (M), surrogate interval (ψ), replica swap interval (Rswap), and maximum number of samples for each replica (Rmax). while (alive , 0 do Stage 0: Prepare manager process to execute each replica in parallel cores for each i until M do s = 0 first-phase: Ti = geometric() while (s < Rmax) do Stage 1.0: Metropolis Transition for each v until ψ do for each k until Rswap do ∗ 1.1 Random-walk, θs = θs +  1.2 Llocal calculate: Draw κ from a Uniform distribution [0,1] if κ < S prob and s > ψ then Estimate Llocal from local surrogate’s prediction, Lsurrogate 1.3 Copy global surrogate knowledge to local surrogate, Ψi ← Ψglobal ∗ 1.4 Predict Lsurrogate value with the proposed θi 1.5 Lpast = mean(Ls−1, Ls−1, Ls−2) 1.6 Assign Llocal = (0.5 * Lsurrogate) + 0.5 * Lpast

else 1.7 Llocal = true-likelihood, given by Likelihood function in Equation (3)) 1.8 Save Ls = Llocal (Equation 10) end 1.9 Calculate acceptance probability α and draw u from uniform distribution if u ≤ α then ∗ Accept replica state, θs ← θs end else ← ∗ Reject and retain previous state: θs θs−1 end 1.9.1 Switch sampling style by updating replica temperature if second-phase is true then Update temperature, Ti = 1 end Increment s end Stage 2.0: Replica Transition: 2.1 Calculate acceptance probability β and draw b from a Uniform distribution [0,1] if b ≤ β then 2.2 Signal() manager process 2.3 Exchange neighboring Replica, θi ↔ θs+1 end

end Stage 3.0: Set Θi which features history of proposals Φ (θ) and response λ ( Llocal ). Use data collected from Stage 1.8

Stage 4.0: Global Surrogate Training for each replica do 4.1 Get replica surrogate data Θi from Stage 3.0 end 4.2 Train global surrogate model with combined surrogate data, Υ = [Θ1, Θ2, ...ΘN ] using neural network model in Equation 1 4.3 Save global surrogate model parameters, Ψglobal end

Stage 5.0: Signal() manager process 5.1 decrement number of replica processes alive

end end Stage 6: Combine predictions and posterior from respective replicas in the ensemble, using second-phase MCMC samples. Algorithm 1: Surrogate-assisted parallel tempering for Bayesian neural networks. The highlighted regions of the algorithm shows different processing cores. The manager process is highlighted in blue while the ensemble of replica processes running in parallel is highlighted in pink.

7 Figure 2: Surrogate-assisted multi-core parallel tempering features surrogates to estimate the likelihood function at times rather than evaluating it. rate does not change during training. Adam (adaptive mo- 4.1. Experimental Design ment estimation) learning algorithm [84] differs from classical We select six benchmark pattern classification problems from stochastic gradient descent, as the learning rate is maintained the University of California Irvine machine learning repository for each network weight and separately adapted as learning un- [87]. The problems feature different levels of computational folds. Adam computes individual adaptive learning rates for complexity and learning difficultly; in terms of the number of different parameters from estimates of first and second mo- instances, the number of attributes, and the number of classes, ments of the gradients. Adam features the strengths of root as shown in Table 1. We use the multinomial likelihood given mean square propagation (RMSProp) [85], and adaptive gradi- in Equation 3 the selected classification problems. Moreover, ent algorithm (AdaGrad) [86]. Adam has shown better results we show the performance using Langevin-gradient proposals when compared to stochastic gradient descent, RMSprop and which takes more computational time due to cost of computing AdaGrad. Hence, we consider Adam as the designated algo- gradients when compared to random-walk proposals; however, rithm for the neural network-based surrogate model. it gives better prediction accuracy given the same number of We note that the surrogate model predictions do not provide samples [27]. The experimental design follows the following uncertainty quantification since they are not probabilistic mod- strategy in evaluating the performance of the selected parame- els. The proposed framework is general and hence different sur- ters from Algorithm 1. rogate types of models can be used in future. Different types of surrogate models would consume different computational time • Evaluate the effect of the surrogate probability (S prob on and have certain strengths and weaknesses [77]. the computational time and classification performance.

• Evaluate the effect of the surrogate interval (ψ on the com- putational time and classification performance. 4. Experiments and Results • Evaluate the effect of Langevin-gradients for proposals in parallel tempering MCMC. In this section, we present an experimental analysis of surrogate-assisted parallel tempering (SAPT) for Bayesian neu- . ral learning. The experiments consider a wide range of issues We provide the parameter setting for the respective exper- that test the accuracy of pseudo-likelihood by the surrogate, the iments as follows. A burn-in time Rburn = 0.50 (50 %) of quality in decision making given by the classification perfor- the samples for the respective replica. The burn-in strategy is mance, and the amount of computational time saved. standard practice for MCMC sampling which ensures that the 8 Table 1: Dataset description [87] gate neural network model architecture consists of [i, h1, h2, o]; where i refers to the number of inputs that consists of the to- Dataset Instances Attributes Classes Hidden Units tal number of weights and biases used in the Bayesian neu- Iris 150 4 3 12 ral network for the given problem. h1 and h2 refers to the Ionosphere 351 34 2 50 first and second hidden layers, and o represents the output that Cancer 569 9 2 12 predicts the likelihood. In our experiments, we use hidden Bank 11162 20 2 50 units h = 64, h = 16, for the Iris and Cancer problems. Pen-Digit 10992 16 10 30 1 2 In the Ionosphere and Bank problems, we use hidden units Chess 28056 6 18 25 h1 = 120, h2 = 40, and for Pen-Digit and Chess problems, we use hidden units h1 = 200, h2 = 50. All problems used one chain enters a high probability region, where the states of the output unit for the surrogate model, o = 1. Markov chain are more representative of the posterior distri- bution. Although MCMC methods more commonly use burn- 4.3. Results in time of 10-20 % in the literature, we use 50 % for getting prediction performance of preferred accuracy. The maximum We first present the results in terms of classification accu- sampling time, Fmax = 50, 000 for all the respective problems, racy of posterior samples for SAPT-RW with random-walk pro- and number of replicas, M = 10 which run on parallel process- posal distribution (Table 2) where we evaluate different com- ing cores. The other key parameters include replica swap inter- binations of selected values of surrogate interval and ψ surro- val (Rswap = 50), surrogate interval(ψ = 50), replica sampling gate probability S prob. Looking at the elapsed time, we find time (Rmax = Fmax/M), surrogate probability (S prob = 0.25, that SAPT-RW is more costly for the Iris and Cancer problems S prob = 0.50) and maximum temperature (Tmax = 5). We deter- when compared to PT-RW. These are smaller problems when mined the given values for the parameters in trial experiments. compared to rest given the size of the dataset shown in Table We selected the maximum temperature in trial experiments by 1. The Ionosphere problem saved computation time for both taking into account the performance accuracy given with a fixed instances of SAPT-RW. A larger dataset with a Bayesian neural number of samples. We provide details for the pattern classifi- network model implies that there is more chance for the surro- cation datasets with details of Bayesian neural network topol- gate to save time which is visible in the Pen-Digit and Chess ogy (number of hidden units) in Table 1. problems. The Bank problem does not save much time but re- In the case of random-walk proposal distribution, we add a tains the accuracy in classification performance. Furthermore, Gaussian noise to the weights and biases of the network from we find that instances of SAPT improve the classification ac- a normal distribution with mean of 0 and standard deviation of curacy of Iris, Ionosphere, Pen-Digit and Chess problems. In 0.025. Theuser defined constants for the priors (see Equation Cancer and Bank problems, the performance is similar. 2 2) are set as σ = 25, ν1 = 0 and ν2 = 0. We provide the results for Langevin-gradient proposals In the respective experiments, we compare SAPT with par- (SAPT-LG) in Table 3. In general, the classification perfor- allel tempering featuring random-walk proposals (PTRW) and mance improves when compared to random-walk proposal dis- Langevin-gradients (PTLG) taken from the literature [27]. tribution (SAPT-RW) in Table 2. By incorporating the gradient Surrogate-assisted parallel tempering also features Langevin- into the proposal, the results show faster convergence with bet- gradients (SAPT-LG) given in Equation (12). We use a learning ter accuracy given the same number of samples. In a compar- rate of 0.5 for computing the weight update via the gradients. ison of both forms of parallel tempering (PT-RW and PT-LG) Furthermore, we apply Langevin-gradient with a probability with the surrogate-assisted framework (SAPT-RW and SAPT- Lprob = 0.5, and random-walk proposal, otherwise. Note that LG), we observe that the elapsed time has not improved for Iris, the respective methods feature parallel tempering for the first Ionosphere and Cancer problems; however, it has improved in 50 per cent of the samples that make the burn-in period. After- the rest of the problems. wards, we use canonical MCMC where temperature T = 1 us- The accuracy of the surrogate in predicting the likelihood is ing parallel computing environment featuring replica-exchange shown in Table 4 for the smaller problems that feature Iono- via interprocess communication. sphere, Cancer and Iris. We notice that the RMSE for surrogate prediction is lower for the Iris problem when compared to the 4.2. Implementation others; however, as shown in Figure 4, Figure 5 and 6, this is We employ multi-core parallel tempering for neural networks relative to the range of log-likelihood. We observe that the log- 2 to implement surrogate assisted parallel tempering. We use likelihood prediction by the surrogate model is much better for one hidden layer in the Bayesian neural network for classifica- the smaller problems (Iris and Cancer) when compared to the tion problems. larger problems (Pen-Digit and Chess). We use the Keras machine learning library for implementing Figure shows the comparison of the posterior distribution the surrogate 3 with Adam learning algorithm [84]. The surro- of two selected weights from input to the hidden layer of the Bayesian neural network for the Cancer problem using SAPT- RW and PT-RW. We notice that although the chains converged 2Multi-core parallel tempering: https://github.com/sydney-machine- learning/parallel-tempering-neural-net to different sub-optimal models in the multimodal distributions, 3Keras: https://keras.io/ SAPT-RW does not depreciate in terms of exploration. 9 (a) Weight 1 (b) Weight 1

(c) Weight 2 (d) Weight 2

Figure 3: Posterior distribution and trace-plot comparing surrogate-assisted parallel tempering (Panel (a) and (c)) with parallel tempering (Panel (b) and (d)) MCMC for selected weights (1 and 2) for the Cancer problem.

Table 5 provides an evaluation of the number of replicas used for exchanging information between the different replicas. (cores) on the computational time (minutes) using SAPT-RW Inter-process communication is also used for collecting the his- for selected problems. We observe that in general, the com- tory of information in terms of proposals and associated likeli- putational time reduces as the number of replica increases by hood for creating training datasets for the surrogate model. The taking advantage of parallel processing. proposed method could be seen as a case for online learning that considers a sequence of predictions from previous tasks and currently available information [89]. This is because the 5. Discussion surrogate is trained at every surrogate interval and the surro- The results, in general, have shown that surrogate-assisted gate gives an estimation of the likelihood until the next interval parallel tempering can be beneficial for larger datasets and mod- is reached for further retraining based on accumulated data of els, demonstrated by Bayesian neural network architecture for proposal and true-likelihood for the previous interval. Pen-Digit and Chess classification problems. This implies that We observe that there is systematic under-estimation of the the method would be very useful for large scale models where true likelihood for some of the cases (Figures 4-6). One rea- computational time can be lowered while maintaining perfor- son could be that the variance is overestimated and the other mance in decision making such as classification accuracy. We is due to the global surrogate model, where training is done by observed that in general, the Langevin-gradients improves the one model in the manager process and the knowledge is used accuracy of the results. Although we used Bayesian neural net- in all the replicas. The surrogate accuracy depends on the na- works to demonstrate the challenge in using computationally ture of the problem, and the neural network model, in terms expensive models with large datasets, surrogate-assisted paral- of the number of parameters in the model and the size of the lel tempering can be used for a wide range of models across dataset. We find that smaller Bayesian neural network mod- different domains. We note that Langevin-gradients are limited els had good accuracy in surrogate estimation when compared to models where gradient information is available. In large and to others. In future work, we can consider training surrogate computationally expensive geoscientific models such as mod- model in the local replicas which could further improve the esti- elling landscape [78] and modelling reef evolution [88], it is mation. The global surrogate model has the advantage of com- difficult to obtain gradients from the models; hence, random- bining information across the different replicas, it faces chal- walk and other meta-heuristics are more applicable. lenges of dealing with large surrogate training dataset which The proposed method employs transfer learning for the sur- accumulates over sampling time. We also need to lower the rogate model, where we transfer the knowledge and refine the time taken for the exchange of knowledge needed for decision surrogate model in the forthcoming surrogate intervals with making by the local surrogate model and the master replica. We new data. Since each replica of parallel tempering is executed found that bigger problems (such as Pen-Digit and Chess) give on a separate processing core, inter-process communication is further challenges for surrogate estimation. In such cases, it 10 Table 2: Classification Results (Random-walk proposal distribution)

Dataset Method Train Accuracy Test Accuracy Elapsed Time SAPT(S prob) [mean, std, best] [mean, std, best] (minutes) Iris PT-RW 51.39 15.02 91.43 50.18 41.78 100.00 1.26 SAPT-RW (0.25) 69.31 2.05 80.00 65.24 4.04 78.38 1.93 SAPT-RW (0.50) 44.11 11.93 78.89 50.01 5.65 70.30 1.81 Ionosphere PT-RW 68.92 16.53 91.84 51.29 30.73 91.74 3.50 SAPT-RW (0.25) 76.60 4.42 86.94 61.23 8.05 86.24 2.93 SAPT-RW (0.50) 70.34 6.15 83.27 73.42 14.06 95.41 2.43 Cancer PT-RW 83.78 20.79 97.14 83.55 27.85 99.52 2.78 SAPT-RW(0.25) 89.75 6.91 96.32 92.80 4.57 99.05 3.41 SAPT-RW(0.50) 91.17 6.29 96.52 97.64 3.17 99.52 2.84 Bank PT-RW 78.39 1.34 80.11 77.49 0.90 79.45 27.71 SAPT-RW(0.25) 78.44 0.67 79.69 77.79 0.63 79.60 28.67 SAPT(0.50) 77.82 1.05 79.69 77.16 0.91 78.80 27.38 Pen Digit PT-RW 76.67 17.44 95.24 71.93 16.59 90.62 57.13 SAPT-RW(0.25) 88.74 1.94 92.87 83.60 2.14 88.74 49.25 SAPT-RW(0.50) 80.85 1.28 82.87 77.66 1.08 80.02 36.05 Chess PT-RW 89.48 17.46 100.00 90.06 15.93 100.00 252.56 SAPT-RW(0.25) 97.17 8.35 100.00 97.66 6.83 100.00 197.61 SAPT-RW(0.50) 90.87 13.35 100.00 90.71 13.31 100.00 143.75 would be worthwhile to take a time series approach, where the ing an error term for each surrogate model and integrating it history of the likelihood is treated as time series. Hence, a local out via MCMC sampling. Potentially, multi-fidelity modelling surrogate could learn from the past tend of the likelihood, rather where the synergy between low-fidelity and how fidelity data than the history of past proposals. This could help in address- and models could be employed for the surrogate model [91]. ing computational challenges given a large number of model parameters to be considered for surrogate training. 6. Conclusions and Future Work Although we ruled out Gaussian process models as the choice of the surrogate model due to computational complexity We presented surrogate-assisted parallel tempering for im- in training large surrogate data, we need to consider that Gaus- plementing Bayesian inference for computationally expensive sian process models naturally account for uncertainty quantifi- problems that harness the advantage of parallel processing. We cation in decision making. Recent techniques to address the is- used a Bayesian neural network model to demonstrate the ef- sue of training Gaussian process models for large datasets could fectiveness of the framework to depict computationally expen- be a way ahead in future studies [90]. Moreover, we note that sive problems. The results from the experiments reveal that the there has not been much work done in the literature that em- method gives a promising performance where computational ploys surrogate assisted machine learning. Most of the liter- time is reduced for larger problems. ature considered surrogate assisted optimization, whereas we The surrogate-based framework is flexible and can incor- considered inference for machine learning problems. The re- porate different surrogate models and be applied to problems sults open the road to use surrogate models for machine learn- across various domains that feature computationally expen- ing. Surrogates would be helpful in case of big data problems sive models which require parameter estimation and uncertainty and cases where there are inconsistencies or noisy data. Fur- quantification. In future work, we envision strategies to further thermore, other optimization methods could be used in conjunc- improve the surrogate estimation for large models and parame- tion with surrogates for big data problems rather than parallel ters. A way ahead is to utilize local surrogate via time series tempering. prediction could help in alleviating the challenges. Further- more, the framework could be applied to problems in different Finally, we use the surrogate model with a certain probabil- domains such as computationally expensive geoscientific mod- ity in all replicas, hence the resulting distribution is only an els used for landscape evolution. approximation of the true posterior distribution. An interest- ing question would then be whether there is a way to use sur- rogates while not introducing an approximation error. This is Software and Data trivially possible for sequential approaches; however, in paral- We provide an open-source implementation of the proposed lel cases, this could be an issue since multiple replicas are used algorithm in Python along with data and sample results 4. which all contribute to different levels of uncertainties. This would need to be addressed in future work in uncertainty quan- 4Surrogate-assisted multi-core parallel tempering: tification for the surrogate likelihood prediction, perhaps with https://github.com/Sydney-machine-learning/surrogate-assisted-parallel- the use of Gaussian process surrogate models or by introduc- tempering 11 Table 3: Classification Results (Langevin-gradient proposal distribution)

Dataset Method Train Accuracy Test Accuracy Elapsed Time [mean, std, best] [mean, std, best] (minutes) Iris PT-LG 97.32 0.92 99.05 96.76 0.96 99.10 2.09 SAPT-LG (0.25) 98.91 0.16 100.00 99.93 0.39 100.00 2.85 SAPT-LG (0.50) 99.09 0.43 100.00 98.63 1.57 100.00 2.49 Ionosphere PT-LG 98.55 0.55 99.59 92.19 2.92 98.17 5.07 SAPT-LG (0.25 ) 100.00 0.02 100.00 90.82 2.43 96.33 6.17 SAPT-LG (0.50 ) 99.51 0.60 100.00 91.24 1.82 97.25 4.76 Cancer PT-LG 97.00 0.29 97.75 98.77 0.32 99.52 5.09 SAPT-LG(0.25) 99.36 0.11 99.39 98.00 0.76 99.52 8.18 SAPT-LG(0.50) 99.37 0.12 99.59 98.61 0.65 100.00 6.64 Bank PT-LG 80.75 1.45 85.41 79.96 0.81 82.61 86.94 SAPT-LG(0.25) 79.86 0.15 80.30 80.53 0.28 79.22 75.96 SAPT-LG(0.50) 80.86 0.15 80.30 81.53 0.28 79.25 65.11 Pen Digit PT-LG 84.98 7.42 96.02 81.24 6.82 91.25 86.62 SAPT-LG(0.25) 82.12 7.42 94.02 82.24 6.82 93.25 66.62 SAPT-LG(0.50) 83.98 7.42 95.02 83.14 6.82 92.25 56.62 Chess PT-LG 100.00 0.00 100.00 100.00 0.00 100.00 323.10 SAPT-LG(0.25) 100.00 0.00 100.00 100.00 0.00 100.00 223.70 SAPT-LG(0.50) 100.00 0.00 100.00 100.00 0.00 100.00 173.10

Table 4: Surrogate Accuracy

Dataset Method Surrogate Prediction Surrogate Training RMSE RMSE [mean, std] Iris SAPT-RW 3.55 3.84e-05 8.40e-05 Ionosphere SAPT-RW 2.63 7.20e-05 1.62e-04 Cancer SAPT-RW 11.08 6.12e-05 1.58e-04 Bank SAPT-RW 131.26 3.09e-05 1.53e-04 Pen-Digit SAPT-RW 1246.60 3.93e-06 1.47e-05 Chess SAPT-RW 4026.44 1.34e-06 3.58e-06

Credit Author Statement [5] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of state calculations by fast computing machines, The Rohitash Chandra: conceptualization, methodology, experi- journal of chemical physics 21 (6) (1953) 1087–1092. [6] A. Tarantola, B. Valette, et al., Inverse problems= quest for information, mentation, writing and supervision; Konark Jain: initial soft- Journal of geophysics 50 (1) (1982) 159–170. ware development and experimentation; Arpit Kapoor: soft- [7] K. Mosegaard, A. Tarantola, Monte carlo sampling of solutions to inverse ware development and experimentation; Ashray Aman: soft- problems, Journal of Geophysical Research: Solid Earth 100 (B7) (1995) ware development, experimentation, and visualisation. 12431–12447. [8] A. E. Raftery, S. M. Lewis, Implementing mcmc, Markov chain Monte Carlo in practice (1996) 115–130. [9] D. van Ravenzwaaij, P. Cassey, S. D. Brown, A simple introduction to Acknowledgement markov chain monte–carlo sampling, Psychonomic bulletin & review (2016) 1–12. The authors would like to thank Prof. Dietmar Muller and [10] S. Banerjee, B. P. Carlin, A. E. Gelfand, Hierarchical modeling and anal- Danial Azam for discussions and support during the course of ysis for spatial data, Crc Press, 2014. this research project. We sincerely thank the editors and anony- [11] C. P. Robert, V. Elvira, N. Tawn, C. Wu, Accelerating mcmc algorithms, Wiley Interdisciplinary Reviews: Computational Statistics 10 (5) (2018) mous reviewers for their valuable comments. e1435. [12] Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Repre- senting model uncertainty in deep learning, in: international conference References on machine learning, 2016, pp. 1050–1059. [13] A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learn- [1] J. Schmidhuber, Deep learning in neural networks: An overview, Neural ing for computer vision?, in: Advances in Neural Information Processing networks 61 (2015) 85–117. Systems, 2017, pp. 5580–5590. [2] D. J. MacKay, Probable networks and plausible predictionsa review of [14] D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A practical bayesian methods for supervised neural networks, Network: review for statisticians, Journal of the American Statistical Association Computation in Neural Systems 6 (3) (1995) 469–505. 112 (518) (2017) 859–877. [3] C. Robert, Machine learning, a probabilistic perspective (2014). [15] A. C. Damianou, M. K. Titsias, N. D. Lawrence, Variational inference for [4] W. K. Hastings, Monte carlo sampling methods using markov chains and latent variables and uncertain inputs in gaussian processes, The Journal their applications, Biometrika 57 (1) (1970) 97–109. of Machine Learning Research 17 (1) (2016) 1425–1486. 12 Table 5: Effect of number of replica on the computational time (minutes)

Num. Replica Cancer Bank Pen-digit Chess 4 45.66 79.25 102.28 253.98 6 21.16 43.48 61.38 188.79 8 13.25 32.57 51.92 185.52 10 9.17 29.75 44.7 130.57

20 Surrogate 40 Surrogate True True

60 40

80

60 100

120 80 Log-Likelihood Log-Likelihood

140

100 160

120 180 0 5000 10000 15000 20000 0 5000 10000 15000 20000 Samples per Replica [R-1, R-2 ..., R-N] Samples per Replica [R-1, R-2 ..., R-N]

50 1000 Surrogate Surrogate True True 1500 100

2000

150 2500

3000 200 Log-Likelihood Log-Likelihood 3500

250 4000

300 4500

0 5000 10000 15000 20000 0 5000 10000 15000 20000 Samples per Replica [R-1, R-2 ..., R-N] Samples per Replica [R-1, R-2 ..., R-N]

Figure 4: The Iris (top) and Cancer (bottom) surrogate accuracy. The dashed Figure 5: The Ionosphere (top) and Bank (bottom) surrogate accuracy. The line denotes the real likelihood function while the blue line gives the surrogate dashed line denotes the real likelihood function while the blue line gives the likelihood estimation. surrogate likelihood estimation.

13 [16] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncer- tainty in neural network, in: Proceedings of The 32nd International Con- ference on Machine Learning, 2015, pp. 1613–1622. [17] R. H. Swendsen, J.-S. Wang, Nonuniversal critical dynamics in monte carlo simulations, Physical review letters 58 (2) (1987) 86. [18] E. Marinari, G. Parisi, Simulated tempering: a new monte carlo scheme, EPL (Europhysics Letters) 19 (6) (1992) 451. [19] C. J. Geyer, E. A. Thompson, Annealing markov chain monte carlo with applications to ancestral inference, Journal of the American Statistical Association 90 (431) (1995) 909–920. Surrogate [20] A. Patriksson, D. van der Spoel, A temperature predictor for parallel tem- 4000 True pering simulations, Physical Chemistry Chemical Physics 10 (15) (2008) 2073–2077. [21] K. Hukushima, K. Nemoto, Exchange and applica- 6000 tion to spin glass simulations, Journal of the Physical Society of Japan 65 (6) (1996) 1604–1608. [22] L. Lamport, On interprocess communication, Distributed computing 1 (2) 8000 (1986) 86–101. [23] R. Salakhutdinov, A. Mnih, G. Hinton, Restricted boltzmann machines for collaborative filtering, in: Proceedings of the 24th international con- 10000 ference on Machine learning, ACM, 2007, pp. 791–798. [24] A. Fischer, C. Igel, A bound for the convergence rate of paral-

Log-Likelihood lel tempering for sampling restricted boltzmann machines, The- 12000 oretical Computer Science 598 (2015) 102 – 117. doi:https: //doi.org/10.1016/j.tcs.2015.05.019. URL http://www.sciencedirect.com/science/article/pii/ 14000 S0304397515004235 [25] G. Desjardins, A. Courville, Y. Bengio, P. Vincent, O. Delalleau, Tem- pered markov chain monte carlo for training of restricted boltzmann ma- 16000 chines, in: Proceedings of the thirteenth international conference on arti- ficial intelligence and statistics, 2010, pp. 145–152. 0 5000 10000 15000 20000 Samples per Replica [R-1, R-2 ..., R-N] [26] P. Brakel, S. Dieleman, B. Schrauwen, Training restricted boltzmann ma- chines with multi-tempering: harnessing parallelization, in: International Conference on Artificial Neural Networks, Springer, 2012, pp. 92–99. [27] R. Chandra, K. Jain, R. V. Deo, S. Cripps, Langevin-gradient parallel tempering for bayesian neural learning, Neurocomputing 359 (2019) 315 – 326. Surrogate [28] R. M. Hicks, P. A. Henne, Wing design by numerical optimization, Jour- 10000 True nal of Aircraft 15 (7) (1978) 407–412. [29] Y. Jin, Surrogate-assisted evolutionary computation: Recent advances and future challenges, Swarm and Evolutionary Computation 1 (2) (2011) 61– 15000 70. [30] Y. S. Ong, P. B. Nair, A. J. Keane, Evolutionary optimization of computa- 20000 tionally expensive problems via surrogate modeling, AIAA journal 41 (4) (2003) 687–696. [31] Z. Zhou, Y. S. Ong, P. B. Nair, A. J. Keane, K. Y. Lum, Combining global 25000 and local surrogate models to accelerate evolutionary optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37 (1) (2007) 66–76. 30000 [32] Y. S. Ong, P. Nair, A. Keane, K. Wong, Surrogate-assisted evolutionary Log-Likelihood optimization frameworks for high-fidelity engineering design problems, in: Knowledge Incorporation in Evolutionary Computation, Springer, 35000 2005, pp. 307–331. [33] S. Jeong, M. Murayama, K. Yamamoto, Efficient optimization design 40000 method using kriging model, Journal of aircraft 42 (2) (2005) 413–420. [34] A. Samad, K.-Y. Kim, T. Goel, R. T. Haftka, W. Shyy, Multiple surro- gate modeling for axial compressor blade shape optimization, Journal of 45000 Propulsion and Power 24 (2) (2008) 301–310. 0 2000 4000 6000 8000 10000 [35] M. D. Richard, R. P. Lippmann, Neural network classifiers estimate Samples per Replica [R-1, R-2 ..., R-N] bayesian a posteriori probabilities, Neural computation 3 (4) (1991) 461– 483. [36] D. J. MacKay, Hyperparameters: Optimize, or integrate out?, in: Maxi- mum entropy and Bayesian methods, Springer, 1996, pp. 43–59. Figure 6: The Pen-Digit (top) and Chess (bottom) surrogate accuracy. The [37] H. Wang, Y. Zeng, X. Yu, G. Li, E. Li, Surrogate-assisted bayesian infer- dashed line denotes the real likelihood function while the blue line gives the ence inverse material identification method and application to advanced surrogate likelihood estimation. high strength steel, Inverse Problems in Science and Engineering 24 (7) (2016) 1133–1161. [38] A. Ray, D. Myer, Bayesian geophysical inversion with trans-dimensional Gaussian process machine learning, Geophysical Journal International 217 (3) (2019) 1706–1726. [39] D. A. Freedman, On the asymptotic behavior of bayes’ estimates in the

14 discrete case, The Annals of Mathematical Statistics (1963) 1386–1403. [66] K. Karimi, N. Dickson, F. Hamze, High-performance physics simula- [40] H. White, Maximum likelihood estimation of misspecified models, tions using multi-core cpus and gpgpus in a volunteer computing context, Econometrica: Journal of the Econometric Society (1982) 1–25. The International Journal of High Performance Computing Applications [41] A. Krogh, J. A. Hertz, A simple weight decay can improve generalization, 25 (1) (2011) 61–69. in: Advances in neural information processing systems, 1992, pp. 950– [67] G. Mingas, L. Bottolo, C.-S. Bouganis, Particle mcmc algorithms and ar- 957. chitectures for accelerating inference in state-space models, International [42] R. M. Neal, Bayesian learning for neural networks, Vol. 118, Springer Journal of Approximate Reasoning 83 (2017) 413–433. Science & Business Media, 2012. [68] W. Vousden, W. M. Farr, I. Mandel, Dynamic temperature selection for [43] T. Auld, A. W. Moore, S. F. Gull, Bayesian neural networks for Internet parallel tempering in markov chain monte carlo simulations, Monthly No- traffic classification, IEEE Transactions on neural networks 18 (1) (2007) tices of the Royal Astronomical Society 455 (2) (2015) 1919–1937. 223–239. [69] D. J. Earl, M. W. Deem, Parallel tempering: Theory, applications, and [44] R. M. Neal, et al., Mcmc using hamiltonian dynamics, Handbook of new perspectives, Physical Chemistry Chemical Physics 7 (23) (2005) Markov Chain Monte Carlo 2 (11) (2011). 3910–3916. [45] M. Welling, Y. W. Teh, Bayesian learning via stochastic gradient langevin [70] D. C. Montgomery, J. Vernon M. Bettencourt, Multiple response surface dynamics, in: Proceedings of the 28th International Conference on Ma- methods in computer simulation, SIMULATION 29 (4) (1977) 113–121. chine Learning (ICML-11), 2011, pp. 681–688. [71] J. D. Letsinger, R. H. Myers, M. Lentner, Response surface methods for [46] R. Chandra, L. Azizi, S. Cripps, Bayesian neural learning via langevin bi-randomization structures, Journal of quality technology 28 (4) (1996) dynamics for chaotic time series prediction, in: D. Liu, S. Xie, Y. Li, 381–397. D. Zhao, E.-S. M. El-Alfy (Eds.), Neural Information Processing, 2017. [72] V. Tandjiria, C. I. Teh, B. K. Low, Reliability analysis of laterally loaded [47] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep piles using response surface methods, Structural Safety 22 (4) (2000) belief nets, Neural computation 18 (7) (2006) 1527–1554. 335–355. [48] K. Cho, A. Ilin, T. Raiko, Improved learning of gaussian-bernoulli re- [73] D. Lim, Y. Jin, Y.-S. Ong, B. Sendhoff, Generalizing surrogate-assisted stricted boltzmann machines, in: International conference on artificial evolutionary computation, IEEE Transactions on Evolutionary Computa- neural networks, Springer, 2011, pp. 10–17. tion 14 (3) (2010) 329–355. [49] K. Cho, T. Raiko, A. Ilin, Parallel tempering is efficient for learning re- [74] S. Razavi, B. A. Tolson, D. H. Burn, Review of surrogate modeling in stricted boltzmann machines, in: Neural Networks (IJCNN), The 2010 water resources, Water Resources Research 48 (7) (2012). International Joint Conference on, IEEE, 2010, pp. 1–8. [75]A.D ´ıaz-Manr´ıquez, G. Toscano, J. H. Barron-Zambrano, E. Tello-Leal, [50] G. Desjardins, A. Courville, Y. Bengio, Adaptive parallel tempering A review of surrogate assisted multiobjective evolutionary algorithms, for stochastic maximum likelihood learning of rbms, arXiv preprint Computational intelligence and neuroscience 2016 (2016). arXiv:1012.3476 (2010). [76] A. Giunta, L. Watson, A comparison of approximation model- [51] G. Desjardins, H. Luo, A. Courville, Y. Bengio, Deep tempering, arXiv ing techniques-polynomial versus interpolating models, in: 7th preprint arXiv:1410.0123 (2014). AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis [52] C. Bishop, C. M. Bishop, et al., Neural networks for pattern recognition, and Optimization, p. 4758. Oxford university press, 1995. [77] R. Jin, W. Chen, T. W. Simpson, Comparative studies of metamodelling [53] R. H. Swendsen, J.-S. Wang, Replica monte carlo simulation of spin- techniques under multiple modelling criteria, Structural and multidisci- glasses, Physical review letters 57 (21) (1986) 2607. plinary optimization 23 (1) (2001) 1–13. [54] U. H. Hansmann, Parallel tempering algorithm for conformational studies [78] R. Chandra, R. D. Muller,¨ D. Azam, R. Deo, N. Butterworth, T. Salles, of biological molecules, Chemical Physics Letters 281 (1-3) (1997) 140– S. Cripps, Multicore parallel tempering Bayeslands for basin and land- 150. scape evolution, Geochemistry, Geophysics, Geosystems 20 (11) (2019) [55] M. Sambridge, A parallel tempering algorithm for probabilistic sampling 5082–5104. and multimodal optimization, Geophysical Journal International 196 (1) [79] M. Sambridge, A parallel tempering algorithm for probabilistic sampling (2014) 357–374. and multimodal optimization, Geophysical Journal International 196 (1) [56] A. Ray, A. Sekar, G. M. Hoversten, U. Albertin, Frequency domain full (2013) 357–374. waveform elastic inversion of marine seismic data from the alba field us- [80] R. Chandra, L. Azizi, S. Cripps, Bayesian neural learning via langevin dy- ing a bayesian trans-dimensional algorithm, Geophysical Journal Interna- namics for chaotic time series prediction, in: Neural Information Process- tional 205 (2) (2016) 915–937. ing - 24th International Conference, ICONIP 2017, Guangzhou, China, [57] F. Calvo, All-exchanges parallel tempering, The Journal of chemical November 14-18, 2017, Proceedings, Part V, 2017, pp. 564–573. physics 123 (12) (2005) 124106. [81] D. S. Broomhead, D. Lowe, Radial basis functions, multi-variable func- [58] N. Rathore, M. Chopra, J. J. de Pablo, Optimal allocation of replicas in tional interpolation and adaptive networks, Tech. rep., Royal Signals and parallel tempering simulations, The Journal of chemical physics 122 (2) Radar Establishment Malvern (United Kingdom) (1988). (2005) 024111. [82] C. E. Rasmussen, Gaussian processes in machine learning, in: Advanced [59] H. G. Katzgraber, S. Trebst, D. A. Huse, M. Troyer, Feedback-optimized lectures on machine learning, Springer, 2004, pp. 63–71. parallel tempering monte carlo, Journal of Statistical Mechanics: Theory [83] P.-T. de Boer, D. P. Kroese, S. Mannor, R. Y. Rubinstein, A tutorial on and Experiment 2006 (03) (2006) P03018. the cross-entropy method, Annals of Operations Research 134 (1) (2005) [60] E. Bittner, A. Nußbaumer, W. Janke, Make life simple: Unleash the 19–67. doi:10.1007/s10479-005-5724-z. full power of the parallel tempering algorithm, Physical review letters URL https://doi.org/10.1007/s10479-005-5724-z 101 (13) (2008) 130603. [84] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv [61] M. K. Sen, P. L. Stoffa, Bayesian inference, gibbs’ sampler and un- preprint arXiv:1412.6980 (2014). certainty estimation in geophysical inversion, Geophysical Prospecting [85] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a 44 (2) (1996) 313–350. running average of its recent magnitude, COURSERA: Neural networks [62] M. Maraschini, S. Foti, A monte carlo multimodal inversion of surface for machine learning 4 (2) (2012) 26–31. waves, Geophysical Journal International 182 (3) (2010) 1557–1566. [86] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online [63] A. Ray, D. L. Alumbaugh, G. M. Hoversten, K. Key, Robust and acceler- learning and stochastic optimization, Journal of Machine Learning Re- ated bayesian inversion of marine controlled-source electromagnetic data search 12 (Jul) (2011) 2121–2159. using parallel tempering, Geophysics 78 (6) (2013) E271–E280. [87] D. Dua, C. Graff, UCI machine learning repository (2017). [64] A. Ray, S. Kaplan, J. Washbourne, U. Albertin, Low frequency full wave- URL http://archive.ics.uci.edu/ml form seismic inversion within a tree based bayesian framework, Geophys- [88] J. Pall, R. Chandra, D. Azam, T. Salles, J. M. Webster, R. Scalzo, ical Journal International 212 (1) (2018) 522–542. S. Cripps, Bayesreef: A bayesian inference framework for modelling reef [65] Y. Li, M. Mascagni, A. Gorin, A decentralized parallel implementation growth in response to environmental change and biological dynamics, En- for parallel tempering algorithm, Parallel Computing 35 (5) (2009) 269 – vironmental Modelling & Software (2020) 104610. 283. [89] S. Shalev-Shwartz, et al., Online learning and online convex optimization,

15 Foundations and Trends R in Machine Learning 4 (2) (2012) 107–194. [90] C. J. Moore, A. J. Chua, C. P. Berry, J. R. Gair, Fast methods for training gaussian processes on large datasets, Royal Society open science 3 (5) (2016) 160125. [91] B. Peherstorfer, K. Willcox, M. Gunzburger, Survey of multifidelity meth- ods in uncertainty propagation, inference, and optimization, Siam Review 60 (3) (2018) 550–591.

16