Bayesian generalised ensemble

Jes Frellsen Ole Winther University of Cambridge Technical University of Denmark Zoubin Ghahramani Jesper Ferkinghoff-Borg University of Cambridge University of Copenhagen

Abstract likelihood values, which makes the method inapplica- ble for calculating key multivariate integrals, in partic- Bayesian generalised ensemble (BayesGE) ular the partition function (evidence, marginal likeli- is a new method that addresses two ma- hood) or the associated with p(x) (Iba jor drawbacks of standard Markov chain 2001; Bishop 2006; Ferkinghoff-Borg 2012). Secondly, Monte Carlo algorithms for inference in high- canonical sampling is often hampered by a high degree dimensional probability models: inapplica- of correlations between the generated states for stan- bility to estimate the partition function and dard choices of the transition kernel. This property, poor mixing properties. BayesGE uses a which is referred to as poor mixing or slow relaxation Bayesian approach to iteratively update the of the Markov chain, reduces the effective number of belief about the density of states (distribu- samples and may lead to results which are erroneously tion of the log likelihood under the prior) sensitive to the arbitrary initialisation of the chain. for the model, with the dual purpose of en- In the past few decades, a variety of MCMC methods hancing the sampling efficiency and mak- known as extended ensembles have been proposed to ing the estimation of the partition function alleviate these two deficiencies (see review and refer- tractable. We benchmark BayesGE on Ising ence in Gelman and Meng (1998), Iba (2001), Mur- and Potts systems and show that it compares ray (2007), and Ferkinghoff-Borg (2012)). The un- favourably to existing state-of-the-art meth- derlying idea of this approach is to build a “bridge” ods. from the part of the probability distribution where the Markov chain suffers from slow relaxation to the part where the sampling is free from such problems. These 1 INTRODUCTION methods include simulated tempering (Marinari and Parisi 1992; Lyubartsev et al. 1992; Irb¨ack and Pot- For most probabilistic models, p(x), exact infer- thast 1995), parallel tempering (Swendsen and Wang ence is intractable, and one has to resort to some 1986; Geyer 1991) and generalised ensembles (Berg form of approximation (Bishop 2006). Markov chain and Neuhaus 1992; Lee 1993; Hesselbo and Stinch- Monte Carlo (MCMC) methods constitute a partic- combe 1995). In extended ensembles the transition ularly powerful and versatile approach for this pur- kernel is extended in such way that it at the same pose (Metropolis and Ulam 1949; Metropolis et al. time allows for the calculation of the partition function 1953; Hastings 1970). In standard MCMC methods, as well as for the reconstruction of the desired statis- p(x) is sampled through a Markov chain with a fixed tics for the original target distribution p(x). However transition kernel that is constructed to have the unique this kernel depends on integral quantities of the model invariant distribution p(x). In the following we will which are not a priori known. Therefore these tech- broadly refer to this as canonical sampling. niques rely on an iterative approach, where estimates While the canonical MCMC method has been the of these quantities obtained from previous iteration(s) workhorse in statistics and physics for the last 50 are used to define the transition probability kernel years or so, it suffers from two primary deficiencies. for the next iteration (Murray 2007; Ferkinghoff-Borg First, it only samples from a narrow interval of log- 2012). At present, however, extended ensemble techniques th Appearing in Proceedings of the 19 International Con- use rather heuristic approaches to define the required ference on Artificial Intelligence and Statistics (AISTATS) iteration procedure and are all based on frequentist 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright 2016 by the authors. estimators. Consequently, whilst they have shown

408 Bayesian generalised ensemble Markov chain Monte Carlo promising results in a wide range of problems in ma- due to its link to the thermodynamical free energy. chine learning and statistical physics (Salakhutdinov Here, p0 = pβ=0 represents the normalised integra- 1 2010; Landau and Binder 2014), these aspects compro- tion measure and consequently Z0 ≡ Zβ=0 = 1. In mise both the robustness and speed of the algorithms the non-thermal context of Bayesian statistics, we set and limit their applicability for inference in more com- β = 1 in which case the “energy” equals minus log plex systems. likelihood E(x) = − log p(data|x), x are model pa- rameters and latent variables, p is the prior distri- In this paper we focus on the generalised ensemble 0 bution and Z ≡ Z is known as the model evidence (GE) subclass of methods, due to its larger domain of β=1 or marginal likelihood. Another central object is the application compared to the tempering based coun- density of states which is defined as terparts (Hansmann and Okamoto 1997). We pro- pose a novel Bayesian generalised ensemble (BayesGE) Z g(E) = δ(E − E(x))p (x) dx (3) method to address the problems pertaining the tra- 0 ditional GE techniques. The proposed method is and measures the prior mass associated with energy applicable to both discrete and continuous distribu- E. In statistical physics, g(E) is related to the micro- tions but requires discretisation of the log-likelihood canonical entropy s(E) through Boltzmann’s formula in the current formulation. We test the algorithm s(E) ≡ log g(E), from which all thermodynamic po- on two discrete 2D systems: an tentials can be calculated. Typically, the energy can and a . Both models are canonical ex- be evaluated for any x but the partition function Zβ amples of systems displaying cooperative transitions and g are unknown. and slow relaxation, for which tempering based in- ference methods in general are inadequate. Infer- A Markov chain with unique invariant distribution ence in Potts models is further complicated by their pβ can be constructed using the Metropolis–Hastings generic multimodal nature as opposed to the essen- (MH) algorithm (Metropolis et al. 1953; Hastings tial bimodal nature of Ising systems. We demonstrate 1970), in which a sequence of states {xi} is gener- the robustness and accuracy of the algorithm in esti- ated by sampling a trial state from a proposal distri- mating the full density of states and partition func- bution x0 ∼ q(·|xi) and accepting the state with prob-  pβ (x0)q(x x0) tion of the two models for different size and/or com- | ability a(x0|xi) = min 1, p (x)q(x x) at each time plexity and show that BayesGE outperforms existing β 0| step. Typically, pβ only occupies an exponentially state-of-the-art methods: the Wang–Landau (WL) al- small part of the volume under p0, as illustrated in gorithm (Wang and Landau 2001), annealed impor- figure S1. For the Metropolis–Hastings algorithm, this tance sampling (AIS) (Neal 2001) and nested sam- implies that the normalisation constant Zβ (as well as pling (Skilling 2006; Murray et al. 2005). other multivariate-integrals) is not tractable and fur- In section 2, we outline the basic methodology of gen- thermore that the sampling for more complex models eralised ensembles. In section 3, we detail the ele- may suffer from slow mixing of the Markov chain. ments of the BayesGE method. The inference results In the GE procedure, an artificial target distribution of the algorithm in comparison to existing advanced is constructed that facilitates a “bridging” between sampling methods on the two spin systems are pre- p0 and pβ. The starting point is to replace the log- sented in section 4. We conclude by discussing relevant Boltzmann weights −βE with a different weight func- extensions of the method for future work. tion w(y(x)), where y(x) represent a set of functions of x of particular interest (“reaction coordinates” or 2 GENERALISED ENSEMBLES “order parameters”). The most common choice is to set y = E in which case the GE target distribution Consider a posterior distribution of x ∈ X on the form takes the form exp(w(E(x)))p0(x) exp(−βE(x))p (x) pGE(x) = p(x|w) ≡ (4) 0 Zw pβ(x) = (1) Zβ where where Z Z Zw = exp(w(E(x)))p0(x) dx . (5) Zβ = exp(−βE(x))p0(x) dx . (2) 1 Note, that the physical partition function Z0 is nor- In the context of statistical physics, β represents the mally identified with the total volume of X , corresponding to p0 being an actual integration measure as opposed to inverse temperature times the Boltzmann constant, a normalised one. However, for most statistical considera- E(x) is the energy and the normaliser Zβ is known tions it is the ratio Zβ /Z0 that is of primary importance, as the partition function, which plays a central role so this normalisation convention poses no loss of generality.

409 Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg

Sampling from p(x|w) is realised with MH by replacing 3 BAYESIAN GENERALISED pβ(x) with p(x|w) in a(x0|xi). ENSEMBLE In order to ensure the target distribution has the de- sired properties, GE methods make use of the den- We wish to address the problems of traditional GE ap- sity of states to define the weights. Using the defi- proaches in a principled manner, by treating the en- nition from equation (3), the marginal distribution of tropy estimation as an inference problem in a Bayesian the energy p(E|w) = R δ(E − E(x))p(x|w) dx can be framework. In the following we will assume that expressed as we are considering a discrete or discretised system having J possible energy values. To this end, let exp(w(E) + s(E)) p(E|w) = (6) s = (s1, . . . , sJ ) be the entropy vector over the energy Z w values E = (E1,...,EJ ) with sj = log g(Ej). and the partition function from equation (5) as Assume that for each weight vector w(τ) = Z (w , . . . , w ) in a given set of weights W = {w(τ)}t Z = exp(w(E) + s(E)) dE. (7) 1 J τ=1 w we perform a Metropolis–Hastings simulation with ν(τ) steps and target distribution We can use this distribution to estimate canonical quantities for any choice of β, for example exp w(τ) (τ) j Z P x|w = P0(x) , (10) Zw(τ) Zβ = Zw exp(−βE − w(E))p(E|w) dE (8) (τ) ν(τ) Z where E(x) = Ej. Based on the samples {xi }i=1 = exp(−βE + s(E)) dE. from the τ’th simulation, we can compute an en- (τ) (τ) (τ) (τ) ergy histogram n = (n1 , . . . , nJ ), where nj = So in principle, for any choice of w(E) we can collect (τ) (τ) samples from p(x|w) and thereby from p(E|w) to get |{xi |E(xi ) = Ej}|. This gives us a set of t energy (τ) t posterior estimates. histograms N = {n }τ=1. Based on these t simu- lations we can write the posterior distribution of the If we know the entropy s(E) we can then define two entropy s as prominent generalised ensembles: the multicanoni- cal ensemble (MUCA) (Berg and Neuhaus 1992; Lee P (N|W, s)P (s|σ) P (s|N, W, σ) = , (11) 1993) and the 1/k ensemble (Hesselbo and Stinch- P (N|W, σ) combe 1995) with weights given respectively by where σ are hyperparameters of the prior. We propose wMUCA(E) = −s(E) w1/k(E) = − log k(E) (9) to use a Gaussian process prior for s and a product E of multinomial distributions for the likelihood of N. where k(E) ≡ R g(E ) dE . The multicanonical en- 0 0 As detailed below and in algorithm 1, this allows us semble makes the−∞ marginal distribution p(E|w) flat, to define an efficient and robust algorithm, where the which implies that once the Markov chain has con- posterior P (s|N, W, σ) is iteratively updated and used verged all energies are visited with equal probability. to define the weights w(t+1) for the next simulation. The 1/k ensemble is constructed to put roughly equal probability to all values of the entropy, which leads to more frequent sampling of the low-energy part com- 3.1 Prior Specification pared to the multicanonical sampling, c.f. section S1. For continuous systems s(E) is typically a smooth and The primary obstacle of the GE approach is that s(E) (mostly) concave function with a non-trivial shape. is not known a priori. Indeed, had this been the case Similarly, for a discrete or discretised system s will the partition function Zβ could have been directly cal- be values of such a function. We propose to use a culated according to the one-dimensional integral (8). Gaussian process prior GP(0, κ) for s(E) with a suit- The fact that the GE ensemble is defined in terms of able kernel κ (Rasmussen and Williams 2006). As quantities that are the primary aim to infer from the s(E) is typically concave, the kernel needs to be non- method in the first place, implies that all GE methods stationary because the entropy should not revert back rely on a iterative procedure in which weights are it- to the mean function away from observed energies. eratively refined based on the data and weights from Concavity cannot directly be modelled on quadratic previous iteration(s). The different GE learning al- form, but a cubic spline extrapolates linearly and gives gorithms differ in the choice of data, estimators and a quite flexible fit. Linear extrapolation appears to be iteration procedure, but are all based on count statis- a good choice because we need be reasonably conserva- tics and do not account for prior knowledge in any tive in order to keep the iterative estimation of weights systematic manner, c.f. section S2. robust.

410 Bayesian generalised ensemble Markov chain Monte Carlo

The cubic spline prior penalises curvature, and by as- 3.3 Posterior Inference suming the boundary conditions s(0) = s0(0) = 0 the kernel on the unit interval can be expressed as (section In order to make posterior inference analytically 6.3.1 in Rasmussen and Williams 2006; Wahba 1978) tractable, we approximate the log-likelihood function 2 2 3  from equation (17) by a second order Taylor expan- κ(E,E0|σ) = σ |E − E0|v /2 + v /3 , (12) sion. Using the maximal likelihood estimate (MLE) where v = min(E,E0). As these boundary condi- ˆs = arg maxs P (N|W, s) as expansion point this ap- tions are not appropriate for our problem, we remove proximation reads them by incorporating explicit linear basis functions u(E) = (1,E)| with a normal prior on the coefficients log P (N|W, s) ≈ log P (N|W,ˆs) N (0,B), B ∈ 2 2. By integrating out the coefficients 1 R × − (s − ˆs)|H(s − ˆs) , (18) we obtain the prior 2

| t ˜(τ) s|σ ∼ GP(0, κ(E,E0|σ) + u(E) Bu(E0)) , (13) wheres ˆj is only well-defined for j ∈ S = ∪τ=1S . H 1 is given by the negative Hessian where we take the limit B− → 0 to make the prior on the basis functions vague (section 2.7 in Rasmussen ∂2 log P (N|W, s) H = − (ˆs) and Williams 2006). ∂s2 t (19) X 3.2 Likelihood = ν(τ) [diag(ˆp(τ)) − ˆp(τ) · (ˆp(τ))|] , τ=1 The probability of observing a histogram n ∈ N gen- wherep ˆ(τ) = P (E |w(τ),ˆs). We note that third (and erated with the weights w ∈ W is naturally expressed j j higher) order terms in the expansion are generally neg- through the multinomial distribution (Ferkinghoff- ligible as these involve two (or more) outer products of Borg 2002) ˆp(τ) and for the broad target distribution of GE meth- nj Y p(Ej|w, s) (τ) p(n|w, s) = ν! , (14) ods ˆp  1. The MLE ˆs solution can be found nj! using the ∞generalised multi-histogram (GMH) equa- j ∈S P tions (Ferkinghoff-Borg 2002), which involve a set of t where ν = j nj, S is the index set associated with t ∈S nonlinear equations (S3) with unknowns {Zw(τ) }τ=1. the histogram (see below) and These can be solved effectively using the standard it- exp(wj + sj) erative Newton–Raphson method, and ˆs can be calcu- p(Ej|w, s) = , (15) Zw lated from the solution using equation (S4), see sec- where tion S3 for details. X Zw = exp(wj + sj) . (16) Using this second order approximation, P (N|W, s) ≈ j ∈S N (ˆs,H), we can perform posterior inference analyti- Equation (15) is the discrete version of equation (6). cally, as both the prior and likelihood are Gaussian. If n represents a fully equilibrated sample from p(x|w) This means that P (s|N, W, σ) ≈ N (¯s,V ) and when 1 the sum in equation (16) should run over the full index taking the limit B− → 0 in equation (13) we get (sec- set S = {1,...,J}. However, if we do not have a fully tion 2.7 in Rasmussen and Williams 2006) equilibrated sample, it was proposed by Ferkinghoff- ¯s = κ(E, E |σ)K 1ˆs + R|(UK 1U |) 1UK 1ˆs (20) Borg (2002) to assume that each histogram n only rep- S H− H− − H− 1 resents an equilibrated sampling within the observed V = κ(E, E|σ) − κ(E, ES|σ)K− k(ES, E) H (21) support S˜ = {j | n > 0}. In this case we would con- 1 1 j + R|(UK− U |)− R strain the summation in the normalisation constant H ˜ where E = (E |j ∈ S), H = [H |j, j ∈ S], Zw to S to reflect the assumption of local equilibra- S j S jj0 0 1 tion only, as detailed by Ferkinghoff-Borg (2002, 2012). KH = κ(ES, ES|σ) + HS− , U = u(E), US = u(ES) 1 As the weights and entropy function render the his- and R = US − UKH− κ(ES, E|σ). An estimateσ ˆ tograms independent, the probability for the combined for the hyperparameter is obtained by optimising the set of observations N is then given by marginal likelihood P (N|W, σ) w.r.t. σ. See section S4 for further details. t Y P (N|W, s) = P (n(τ)|w(τ), s) . (17) We have also tested other approximations to the pos- τ=1 terior, including a quadratic approximation to the in- To compensate for the likelihood not scaling correctly dividual likelihood terms from equation (14) and a with the number of independent samples, we scale the Laplace approximation to the posterior. Approximat- histograms n(τ) with the inverse of the number of de- ing the individual terms turned out to be too inaccu- grees of freedom of the sampled model. rate. For the Laplace approximation we can update

411 Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg the posterior either sequentially using the current his- Algorithm 1 The BayesGE algorithm togram plus the previous Laplace approximation, or Input: ν(1), γ using all histograms plus the Gaussian process prior. 1: w(1) ← 0 The latter will in principle give the best approxima- 2: N,W ← {}, {} tion, however the Laplace approximation involves solv- 3: for t ∈ (1,...,T ) do ing a |S| dimensional optimisation problem, while the (t) M.H. 4: Draw ν(t) samples {x }ν ∼ P (x|w(t)) MLE can be found by solving a system of t nonlinear i i=1 5: Compute an energy histogram n(t) based on with t unknown, where t  |S| typically. ν(t) the samples {xi}i=1 6: Insert n(t) and w(t) into N and W respectively 3.4 Setting the Weights 7: Optimise P (N|W, σ) w.r.t. σ and update the At each iteration of the algorithm, we can use the pos- posterior P (s|N, W, σˆ) (t+1) terior mean estimate ¯s, equation (20), to define the 8: Set the weight for next iteration w using weights of the generalised ensemble in the next step. P (s|N, W, σˆ) and equation (25) The simplest approach is to use the definition of the 9: Set the simulation time for the next iteration (t+1) GE ensemble from equation (9) directly which implies ν using equation (26) the updating rules for respectively multicanonical and Output: P (s|N, W, σˆ) 1/k weights:

w¯ [MUCA](t+1) = −¯s (22) where the increasement factor γ > 1. The motiva- (t+1) X tion for this scheme is that if we did not see any new w¯[1/k]j = − log exp(¯sj0 ) . (23) j j energies in the last simulations then we either need 0≤ to run longer simulations to see new energies or we However, this rule fails to account for the uncertainty have seen all possible energies. If we have seen new of the posterior of s. A more robust approach is to energies in the last simulations, then the simulation define the weights according to expectation values of time is already long enough to explore new energies. (t+1) (1) 1/10 the marginal distribution P (Ej|w ) under the pos- As default values we use ν = 5000 and γ = 2 . terior distribution P (s|N, W, σˆ), where N and W are The algorithm is robust to changes in these values and the set of histograms and weights for the first t iter- they do not change the performance of the algorithm ations. For any weight function w this expectation is significantly. given as   3.6 Complexity and Convergence exp(wj + sj) hP (Ej|w)iP (s N,W ) = (24) | Zw P (s N,W ) | The computational complexity of the BayesGE al- gorithm stems from ν = Pt ν(τ) evaluations of In section S5 we show that, to first order in δs = s − τ=1 ¯s, the expected GE target distribution is realised by the posterior target distribution, equation (1), and simply setting T = O(log ν) entropy inference steps each of which is O(J 3). Accordingly, in term of MC steps ν the al- 1 gorithm scales as O(νc + J 3 log ν), where c is the com- ˜w(t+1) = w¯ (t+1) − diag(V ) , (25) 2 plexity of evaluating the posterior target, equation (1). The latter term can be reduced to be linear in J using where w¯ (t+1) are the weights defined in equations (22) sparse GP approximations (Qui˜nonero-Candelaand to (23) and V is the posterior covariance matrix (21). Rasmussen 2005). Note that the BayesGE algorithm does not introduce an additional cost at the individ- 3.5 Algorithmic Details ual MC step, as the weights are kept fixed for each iteration of the algorithm. For fully connected models The BayesGE algorithm is outlined in algorithm 1. We (e.g. Boltzmann machines and polymer systems) the use a simple exponential scheme for setting the simula- complexity of posterior target evaluations scales with tion time for each iteration of the algorithm, which has the square of the number of degrees of freedom and previous been shown to be efficient (Ferkinghoff-Borg linear with the number of data points, in which case 2002). The simulations time ν(t+1) for the (t + 1)’th the overhead of the inference step is expected to be iteration of the algorithm is set to negligible, as empirically illustrated in section S7.  (t) t 1 (τ) if n and P − n γν(t) τ=1 Convergence to the target ensemble is in principle ν(t+1) = have the same support , (26) guaranteed by the formal correctness of the inference ν(t) otherwise procedure (in keeping with the preservation of de-

412 Bayesian generalised ensemble Markov chain Monte Carlo

Posterior at 5 103 MC steps Posterior at 175 103 MC steps Posterior at 494 103 MC steps 50 · 50 · 50 ·

0 0 0

50 50 50 ) − ) − ) − E E E ( ( ( s 100 s 100 s 100 − − − ¯s 150 ¯s 2 std. 150 150 − ± − − Correct s 200 200 200 − 400 200 0 200 400 − 400 200 0 200 400 − 400 200 0 200 400 − − − − − − E E E

Figure 1: Examples of the posterior distribution of the entropy function s for BayesGE with multicanonical weight at three different simulations lengths (corresponding to 1, 25 and 45 histograms) for the 2D Ising model of size 16 × 16. The orange line is the ground truth and the blue line is the posterior mean estimate ¯s, equation (20), with the shaded area showing ± two standard deviations. tailed balance) and the exponential update rule, equa- model has a first order phase transition and is noto- tion (26), which serves the dual purpose of ensuring riously hard to sample for temperature based meth- accumulation of statistics as well as asymptotic unbi- ods (Iba 2001). For all algorithms we use a single spin ased sampling. As demonstrated on the chosen models flip Metropolis proposal. the BayesGE algorithm has a learning stage√ where the error of an estimator scales with sub O(1/ ν) and√ a 4.1 Posterior Distribution of Entropy sampling stage where the error displays a O(1/ ν) scaling, similar to the WL algorithm (Iba 2001). Examples of the posterior distribution of the entropy s for a simulation of the 16 × 16 Ising model using 4 RESULTS BayesGE with multicanonical weights are illustrated in figure 1 and compared to the correct values obtained We have applied the method on two discrete Markov by analytical means (Beale 1996). We see that the random field model: the 2D square lattice Ising model BayesGE algorithm initially learns about the entropy and Potts model (see Wu (1982) for a review) and around the peak of the curve (corresponding to low compared it against three other state-of-the-art algo- β) and is very uncertain about the tails. As more rithms: the WL algorithm, AIS and nested sampling. samples are collected the uncertainty drops in the tails of the entropy function. Similar plots for BayesGE 2 In the Ising model the state is x ∈ {−1, 1}L and in with 1/k weights are shown in figure S2. A notable the Potts model x ∈ {1, 2, . . . , q}L2 , where L is the difference is that the multicanonical weights explores size of the lattice side and q is the number of colours. the whole phase space, whereas 1/k explores the left We use a homogeneous coupling constant of 1 and no branch of the entropy curve only. As the Zβ values external field in which case the energy of a state is that we will consider below all correspond to sums with given respectively by the majority weight on the left branch this give 1/k X a natural advantage over the multicanonical weights. EIsing(x) = − xaxb (27) AIS and nested sampling also essentially only sample a,b h i the left branch of the entropy curve. X EPotts(x) = − δ(xa, xb) , (28) a,b 4.2 Estimation of Partition Functions h i where the sums run over neighbour pairs ha, bi in the In figure 2 we compare the performance of BayesGE to lattice and δ is the Kronecker delta. For these models WL, AIS and nested sampling (see alternative illustra- we want to estimate the partition function Zβ from tions in figures S3 and S4) in estimating the partition equation (2) with a uniform prior.2 The Ising model functions for Ising and Potts models. For each model has a second order phase transition, while the Potts we ran 50 independent simulations using each method, 2The total number of states in the two models are eas- This implies that we in these cases could obtain absolute L2 L2 ily calculated by Z0 = 2 and Z0 = q respectively, estimates for any partition function Zβ in the BayesGE using the natural (unnormalised) counting measure for p0. methodology.

413 Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg

Ising 16 16 β = 1 Ising 32 32 β = 1 100 × 100 ×

β β 1 10− Z 1 Z 10−

2 10− 2 10− 3 10−

3 10− 4 10− Rel. RMSE of log Rel. RMSE of log

4 5 10− 10− 106 107 108 106 107 108 109 MC steps MC steps Potts 16 16 q = 20 β = 2 Potts 16 16 q = 100 β = 4 100 × 100 × β β

Z 1 Z 10− 1 10−

2 10−

2 10− 3 10− Rel. RMSE of log Rel. RMSE of log

4 3 10− 10− 106 107 108 109 106 107 108 109 MC steps MC steps

BayesGE MUCA BayesGE 1/k AIS WL Nested sampling 1/√ν ∝

Figure 2: The relative root mean square error (RMSE) of log Zβ as a function of the number Monte Carlo (MC) steps for simulations on 2D Ising and Potts models with different sizes, number of colours√ (q) and values of β. For an unbiased MC estimator the RMSE scales with the number of MC steps ν as O(1/ ν), which is indicated with a dashed line. We show results for BayesGE with the multicanonical ensemble (BayesGE MUCA) and the 1/k ensemble (BayesGE 1/k), annealed importance sampling (AIS), the Wang and Landau (WL) algorithm and nested sampling. and the plots show the relative root mean square er- optimised prior to the simulation results presented in 3 ror (RMSE) of log Zβ as a function of the number figure 2 in a laborious trial-and-error manner (see fig- of Monte Carlo (MC) steps. For the Ising model the ure S5). Furthermore, both AIS and nested sampling reference log Zβ is calculated from the analytically en- require full rerun for each choice of the total allocated tropy function (Beale 1996). For the Potts model we simulation time. In contrast, the GE algorithms sim- cannot compute log Zβ analytically, so as reference we ply proceed with the iteration process, and as such use the average log Zβ over 50 independent WL simu- provide a more flexible framework for increasing the lations using 1010 MC steps. For each model, we se- precision of the estimators. lected β such that the mode of p(E|w ) lies well below β The first observation to make from figure 2 is that the phase transition. the BayesGE algorithm in all cases display a sigmoidal We ran BayesGE with multicanonical and 1/k weights type of behaviour with a sharp decrease at intermedi-√ using the standard settings from section 3, and calcu- ate times (learning stage) followed by a O(1/ ν) be- lated Zβ using the posterior mean estimate ¯s and equa- haviour at large times (sampling stage), as previously tion (8). The implementation details of BayesGE and discussed. This long time scaling behaviour testifies to the other methods are discussed in section S6. The un- the convergence properties of the algorithm. known hyperparameters of nested sampling have been If we compare the two multicanonical methods, BayesGE MUCA and WL, we see that they are some- q 3 ˆ 2 what on par on the two Ising models. On the Potts The relative RMSE is h(log Z − log Zref ) i/ log Zref ˆ models however, BayesGE MUCA does perform much where Z is the estimated partition function, Zref the ref- erence partition function and the expectation h·i is taken better than WL, with a particular pronounced speedup over repeated simulations. on the 100 colour problem. This testifies to the advan-

414 Bayesian generalised ensemble Markov chain Monte Carlo tage of the Bayesian approach, since the target ensem- due to the fact that the posterior can be evaluated in ble is the same for the two algorithms. constant time, which is what makes the accurate com- parison of the different inference algorithms tractable On all four models AIS performs well at a low num- on a wide range of MC timescales. As a proof-of- ber of MC steps, and for the Ising models AIS is one principle we have also applied BayesGE and AIS to of the best methods. However, for the Potts models, a binary restricted Boltzmann machine trained on the which have a first order phase transition, we see that MNIST dataset (LeCun et al. 1998), with the dual AIS converges very slowly and performs the worst of purpose of demonstrating the application of our algo- all methods. This fits with the expectation that tem- rithm to a more computational intensive model having perature based methods have notorious difficulties on a semi-continuous energy spectrum and verifying that systems with first order phase transitions (Iba 2001), BayesGE approaches the same scaling behaviour as due to trapping in metastable states. Nested sampling AIS in this case, see section S7. also performs well at low number of MC steps, but on the Ising models it has the highest RMSE at the end of the simulations. As expected, nested sampling does 5 CONCLUSION not have the same problems as AIS with the first or- der phase transition in the Potts model. On the 20 In conclusion, we have presented a new method for en- colour Potts model the error is about an order of mag- hancing sampling efficiency and estimating key multi- nitude lower for nested sampling than for AIS at the variate integrals in high-dimensional probability mod- same number of iterations. However, at the end of the els. The robustness and accuracy of the method simulation WL and the two BayesGE methods have has been demonstrated and compared with existing a lower error than nested sampling. Nested sampling methodologies on two classical spin systems display- performs very well on the 100 colour Potts model and ing cooperative transitions and multimodality. it is only the two BayesGE methods that has a lower We note that both the WL and the BayesGE algo- RMSE at 109 MC steps. rithm in its present formulation require a reasonable BayesGE 1/k performs consistently well on all four binning to be known prior to the simulation. How- models and in all cases it has the lowest RMSE at ever, in contrast to the WL algorithm our update rule, the maximal number of MC steps. On the Ising mod- equation (26), does not presume prior knowledge of els BayesGE 1/k has a similar performance as AIS, the relevant energy range, which is a considerable ad- though AIS has a lower RMSE at low numbers of MC vantage in systems where the ground state(s) or low steps, whereas BayesGE 1/k has the lowest error at temperature properties are unknown. Extensions of high numbers of MC steps. On the 20 colour Potts our method to online binning will be the subject of model BayesGE 1/k outperforms all the other meth- future work. ods: it is only AIS that has a slightly lower RMSE at There are a number of other directions to be pursued low numbers of MC steps. On the 100 colour Potts in future works. First of all, we wish to pertain a more model both AIS and nested sampling performs best absolute interpretation of the covariance structure of at low number of iterations, but at the end the two the posterior, by ensuring that the likelihood function BayesGE methods have the lowest RMSE. If we ex- scales correctly with the number of independent obser- clude the AIS results on the Ising system the typical vations. Secondly, it is of obvious interest to explore speed-up time to reach a prescribed accuracy between other prior functions and determine their proper do- BayesGE 1/k and any of the other algorithms is of one main of application. Thirdly, we aim to extend the order of magnitude, as illustrated in figure S3. method to larger systems as well as multivariate rep- Here we compared the performance of the algorithms resentations of the density function using approximate in terms of number of MC steps, as for most systems posterior inference. As our of method can be applied the actual simulation time will be dominated by the to any statistical model, we believe it should find im- evaluation of the posterior target distribution, c.f. sec- portant applications in a wide range of computational tion 3.6. On the 20 colour 16 × 16 Potts model the fields, including machine learning, statistics, statistical running times for AIS and WL with 109 MC steps physics and bioinformatics, where inference in high- are approximately 3 minutes (C++ implementation), dimensional systems forms a central problem. whereas the BayesGE methods uses approximately 6 minutes (simulation implemented in C++ and infer- Acknowledgements ence in Python). The time overhead for BayesGE can mainly be attributed to the posterior inference for the JF acknowledge funding from the Danish Council for entropy. It has to be emphasised, however, that the Independent Research 0602-02909B. ZG acknowledge Ising and Potts models have been specifically chosen funding from EPSRC EP/I036575/1 and Google.

415 Jes Frellsen, Ole Winther, Zoubin Ghahramani, Jesper Ferkinghoff-Borg

References Lee, J. (1993). “New Monte Carlo algorithm: En- tropic sampling”. In: Physical Review Letters 71 (2), Beale, P. D. (1996). “Exact Distribution of Energies p. 211. in the Two-Dimensional Ising Model”. In: Physical Lyubartsev, A. P., Martsinovski, A. A., Shevkunov, S. Review Letters 76 (1), pp. 78–81. V., and Vorontsov-Velyaminov, P. N. (1992). “New Berg, B. A. and Neuhaus, T. (1992). “Multicanon- approach to Monte Carlo calculation of the free en- ical ensemble: A new approach to simulate first- ergy: Method of expanded ensembles”. In: The Jour- order phase transitions”. In: Physical Review Letters nal of Chemical Physics 96 (3), p. 1776. 68 (1), p. 9. Marinari, E. and Parisi, G. (1992). “Simulated Tem- Bishop, C. M. (2006). Pattern Recognition and Ma- pering: A New Monte Carlo Scheme”. In: Euro- chine Learning. Cambridge: Springer Verlag. physics Letters 19, p. 451. Ferkinghoff-Borg, J. (2002). “Optimized Monte Carlo Metropolis, N. and Ulam, S. (1949). “The Monte Carlo analysis for generalized ensembles”. In: The Euro- Method”. In: Journal of the American Statistical pean Physical Journal B 29 (3), pp. 481–484. Association 44 (247), pp. 335–341. Ferkinghoff-Borg, J. (2012). “Monte Carlo Methods Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. for Inference in High-Dimensional Systems”. In: N., Teller, A. H., and Teller, E. (1953). “Equation Bayesian Methods in Structural Bioinformatics. Ed. of State Calculations by Fast Computing Machines”. by T. Hamelryck, K. Mardia, and J. Ferkinghoff- In: The Journal of Chemical Physics 21 (6), p. 1087. Borg. Statistics for Biology and Health. Springer Murray, I. (2007). “Advances in Markov chain Monte Berlin Heidelberg, pp. 49–93. Carlo methods”. PhD thesis. Gatsby computational Gelman, A. and Meng, X.-L. (1998). “Simulating nor- neuroscience unit, University College London. malizing constants: from importance sampling to Murray, I., MacKay, D. J. C., Ghahramani, Z., and bridge sampling to path sampling”. In: Statistical Skilling, J. (2005). “Nested sampling for Potts mod- Science 13 (2), pp. 163–185. els”. In: Advances in Neural Information Processing Geyer, C. J. (1991). “Markov Chain Monte Carlo Systems 18. Ed. by Y. Weiss, B. Sch¨olkopf, and J. Maximum Likelihood”. In: Computing Science and Platt, pp. 947–954. Statistics: Proceedings of the 23rd Symposium on the Neal, R. M. (2001). “Annealed importance sampling”. Interface. Interface Foundation of North America. In: Statistics and Computing 11 (2), pp. 125–139. Hansmann, U. H. E. and Okamoto, Y. (1997). Qui˜nonero-Candela,J. and Rasmussen, C. E. (2005). “Generalized-ensemble Monte Carlo method for sys- “A Unifying View of Sparse Approximate Gaussian tems with rough energy landscape”. In: Physical Re- Process Regression”. In: Journal of Machine Learn- view E 56 (2), p. 2228. ing Research 6, pp. 1939–1959. Hastings, W. (1970). “Monte Carlo sampling meth- Rasmussen, C. E. and Williams, C. K. I. (2006). ods using Markov chains and their applications”. In: Gaussian Processes for Machine Learning. The MIT Biometrika 57 (1), p. 97. Press. Hesselbo, B. and Stinchcombe, R. B. (1995). “Monte Salakhutdinov, R. (2010). “Learning Deep Boltzmann Carlo Simulation and Global Optimization without Machines using Adaptive MCMC”. In: 27th Inter- Parameters”. In: Physical Review Letters 74 (12), national Conference on Machine Learning, pp. 943– p. 2151. 950. Iba, Y. (2001). “Extended Ensemble Monte Carlo”. In: Skilling, J. (2006). “Nested sampling for general International Journal of Modern Physics C 12 (5), Bayesian computation”. In: Bayesian Analysis 1 (4), p. 623. pp. 833–859. Irb¨ack, A. and Potthast, F. (1995). “Studies of an Swendsen, R. H. and Wang, J.-S. (1986). “Replica off-lattice model for protein folding: sequence de- Monte Carlo Simulation of Spin-Glasses”. In: Phys- pendence and improved sampling at finite tem- ical Review Letters 57 (21), pp. 2607–2609. perature”. In: Journal of Chemical Physics 103, Wahba, G. (1978). “Improper Priors, Spline Smooth- p. 10298. ing and the Problem of Guarding Against Model Landau, D. P. and Binder, K. (2014). A Guide to Errors in Regression”. In: Journal of the Royal Sta- Monte Carlo Simulations in Statistical Physics. tistical Society B 40 (3), pp. 364–372. Cambridge University Press. Wang, F. and Landau, D. P. (2001). “Efficient, LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Multiple-Range Algorithm to Cal- (1998). “Gradient-based learning applied to doc- culate the Density of States”. In: Physical Review ument recognition”. In: Proceedings of the IEEE Letters 86 (10), p. 2050. 86 (11), pp. 2278–2324. Wu, F. Y. (1982). “The Potts model”. In: Reviews of Modern Physics 54 (1), pp. 235–268.

416