Parallel tempering on optimized paths

Saifuddin Syed * 1 Vittorio Romaniello * 1 Trevor Campbell 1 Alexandre Bouchard-Cotˆ e´ 1

Abstract Suppose we seek to approximate an expectation with respect to an intractable target density π1. Denote by π0 a reference Parallel tempering (PT) is a class of density defined on the same space, which is assumed to Monte Carlo algorithms that constructs a path of be tractable in the sense of the availability of an efficient distributions annealing between a tractable ref- sampler. This work is motivated by the case where π0 and erence and an intractable target, and then inter- π1 are nearly mutually singular. A typical case is where the changes states along the path to improve mixing target is a Bayesian posterior distribution, the reference is in the target. The performance of PT depends on the prior—for which i.i.d. sampling is typically possible— how quickly a sample from the reference distri- and the prior is misspecified. bution makes its way to the target, which in turn depends on the particular path of annealing dis- PT methods are based on a specific continuum of densities 1−t t tributions. However, past work on PT has used πt ∝ π0 π1, t ∈ [0, 1], bridging π0 and π1. This path of only simple paths constructed from convex com- intermediate distributions is known as the power posterior binations of the reference and target log-densities. path in the literature, but in our framework it will be more This paper begins by demonstrating that this path natural to think of these continua as a linear paths, as they performs poorly in the setting where the reference linearly interpolate between log-densities. PT algorithms and target are nearly mutually singular. To ad- discretize the path at some 0 = t0 < ··· < tN = 1 to obtain dress this issue, we expand the framework of PT a sequence of densities πt0 , πt1 , . . . , πtN . See Figure1 (top) to general families of paths, formulate the choice for an example of a linear path for two nearly mutually of path as an optimization problem that admits singular Gaussian distributions. tractable gradient estimates, and propose a flex- Given the path discretization, PT involves running N + 1 ible new family of spline interpolation paths for MCMC chains that together target the product distribution use in practice. Theoretical and empirical results πt0 πt1 ··· πtN . Based on the assumption that the chain π0 both demonstrate that our proposed methodology can be sampled efficiently, PT uses swap-based interactions breaks previously-established upper performance between neighbouring chains to propagate the exploration limits for traditional paths. done in π0 into improved exploration in the chain of inter- est π1. By designing these swaps as Metropolis–Hastings moves, PT guarantees that the marginal distribution of the th 1. Introduction N chain converges to π1; and in practice, the rate of con- vergence is often much faster compared to running a single (MCMC) methods are widely chain (Woodard et al., 2009). PT algorithms are extensively used to approximate intractable expectations with respect to used in hard sampling problems arising in statistics, physics, un-normalized probability distributions over general state computational chemistry, phylogenetics, and machine learn- spaces. For hard problems, MCMC can suffer from poor ing (Desjardins et al., 2014; Ballnus et al., 2017; Kamberaj, mixing. For example, faced with well-separated modes, 2020;M uller¨ & Bouckaert, 2020). MCMC methods often get trapped exploring local regions of high probability. Parallel tempering (PT) is a widely Notwithstanding empirical and theoretical successes, ex- applicable methodology (Geyer, 1991) to tackle poor mixing isting PT algorithms also have well-understood theoretical of MCMC algorithms. limitations. Earlier work focusing on the theoretical analysis of reversible variants of PT has shown that adding too many *Equal contribution 1Department of Statistics, University of intermediate chains can actually deteriorate performance British Columbia, Vancouver, Canada. Correspondence to: Saifud- (Lingenheil et al., 2009; Atchade´ et al., 2011). Recent work din Syed , Vittorio Romaniello . (Syed et al., 2019) has shown that a nonreversible variant of PT (Okabe et al., 2001) is guaranteed to dominate its Proceedings of the 38 th International Conference on Machine classical reversible counterpart, and moreover that in the Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Parallel tempering on optimized paths nonreversible regime adding more chains does not lead to & Hukushima, 2016; Syed et al., 2019). Define a reference performance collapse. However, even with these more ef- unnormalized density function π0 for which sampling is ficient non reversible PT algorithms, Syed et al.(2019) tractable, and an unnormalized target density function π1 established that the improvement brought by higher paral- for which sampling is intractable; the goal is to obtain sam- 1 1−t t lelism will asymptote to a fundamental limit known as the ples from π1. Define a path of distributions πt ∝ π0 π global communication barrier. for t ∈ [0, 1] from the reference to the target. Finally, define the annealing schedule T to be a monotone sequence in In this work, we show that by generalizing the class of paths N [0, 1], satisfying interpolating between π0 and π1 from linear to nonlinear, N the global communication barrier can be broken, leading TN = (tn)n=0, 0 = t0 ≤ t1 ≤ · · · ≤ tN = 1 to substantial performance improvements. Importantly, the kTN k = max tn+1 − tn. nonlinear path used to demonstrate this breakage is com- n∈{0,...,N−1} puted using a practical algorithm that can be used in any The core idea of parallel tempering is to construct a Markov 0 N situation where PT is applicable. An example of a path opti- chain (Xm,...,Xm ), m = 1, 2,... that (1) has invariant mized using our algorithm is shown in Figure1 (bottom). distribution πt0 · πt1 ··· πtN —such that we can treat the N We also present a detailed theoretical analysis of parallel marginal chain Xn as samples from the target π1—and (2) tempering algorithms based on nonlinear paths. Using this swaps components of the state vector such that independent analysis we prove that the performance gains obtained by samples from component 0 (i.e., the reference π0) traverse going from linear to nonlinear path PT algorithms can be along the annealing path and aid mixing in component N arbitrarily large. Our theoretical analysis also motivates a (i.e., the target π1). This is possible to achieve by itera- principled objective function used to optimize over a para- tively performing a local exploration move followed by a metric family of paths. communication move as shown in Algorithm1.

0 N Literature review Beyond parallel tempering, several Local Exploration Given (Xm−1,...,Xm−1), we ob- methods to approximate intractable integrals rely on a path ˜ 0 ˜ N tain an intermediate state (Xm,..., Xm ) by updating the of distributions from a reference to a target distribution, and th n component using any MCMC move targeting πtn , for there is a rich literature on the construction and optimiza- n = 0,...,N. This move can be performed in parallel tion of nonlinear paths for annealed importance sampling across components since each is updated independently. type algorithms (Gelman & Meng, 1998; Rischard et al., 2018; Grosse et al., 2013; Brekelmans et al., 2020). These Communication Given the intermediate state algorithms are highly parallel; however, for challenging ˜ 0 ˜ N (Xm,..., Xm ), we apply pairwise swaps of compo- problems, even when combined with adaptive step size pro- nents n and n + 1, n ∈ Sm ⊂ {0,...,N − 1} for cedures (Zhou et al., 2016) they typically suffer from par- swapped index set Sm. Formally, a swap is a move from ticle degeneracy (Syed et al., 2019, Sec. 7.4). Moreover, (x0, . . . , xN ) to (x0, . . . , xn+1, xn, . . . , xN ), which is these methods use different path optimization criteria which accepted with probability are not well motivated in the context of parallel tempering. n+1 n πtn (x )πtn+1 (x ) Some special cases of non-linear paths have been used αn = 1 ∧ n n+1 . (1) πt (x )πt (x ) in the PT literature (Whitfield et al., 2002; Tawn et al., n n+1 2020). Whitfield et al.(2002) construct a non-linear path Since each swap only depends on components n, n + 1, one inspired by the concept of Tsallis entropy, a generalization can perform all of the swaps in Sm in parallel, as long as of Boltzmann-Gibbs entropy, but do not provide algorithms n ∈ Sm implies (n + 1) ∈/ Sm. The largest collection of to optimize over this path family. The work of Tawn et al. such non-interfering swaps is Sm ∈ {Seven,Sodd}, where (2020), also considers a specific example of a nonlinear Seven,Sodd are the even and odd subsets of {0,...,N − 1} path distinct from the ones explored in this paper. However, respectively. In non-reversible PT (NRPT) (Okabe et al., the construction of the nonlinear path in Tawn et al.(2020) 2001), the swap set Sm at each step m is set to  requires knowledge of the location of the modes of π1 and Seven if m is even Sm = hence makes their algorithm less broadly applicable than Sodd if m is odd. standard PT. Round trips The performance of PT is sensitive to both 2. Background the local exploration and communication moves. The quan- tity commonly used to evaluate the performance of MCMC In this section, we provide a brief overview of parallel tem- 1 pering (PT) (Geyer, 1991), as well as recent results on We assume all distributions share a common state space X throughout, and will often suppress the arguments of (log-)density nonreversible communication (Okabe et al., 2001; Sakai functions—i.e., π1 instead of π1(x)—for notational brevity. Parallel tempering on optimized paths algorithms is the effective sample size (ESS); however, ESS Algorithm 1 NRPT measures the combined performance of local exploration Require: state x0, path πt, schedule TN , # iterations M and communication, and is not able to distinguish between rn ← 0 for all n ∈ {0,...,N − 1} the two. Since the major difference between PT and stan- for m = 1 to M do dard MCMC is the presence of a communication step, we x˜m ← LocalExploration(xm−1) require a way to measure communication performance in Sm ← Seven if m is even, otherwise Sm ← Sodd isolation such that we can compare PT methods without for n = 0 to N − 1 do π (xn+1)π (xn) dependence on the details of the local exploration move. tn tn+1 αn ← 1 ∧ π (xn)π (xn+1) The round trip rate is a performance measure from the PT tn tn+1 r ← r + (1 − α ) literature (Katzgraber et al., 2006; Lingenheil et al., 2009) n n n U ∼ Unif(0, 1) that is designed to assess communication efficiency alone. n if n ∈ Sm and Un ≤ αn then We say a round trip has occurred when a new sample from n n+1 n+1 n (˜xm, x˜m ) ← (˜xm , x˜m) the reference π0 travels to the target π1 and then back to π0; the round trip rate τ(T ) is the frequency at which round end if N ← ˜ trips occur. Based on simplifying assumptions on the lo- xm xm end for cal exploration moves, the round trip rate τ(TN ) may be expressed as (Syed et al., 2019, Section 3.5) end for rn ← rn/M for n ∈ {0,...,N − 1} M N−1 N−1 !−1 Return: {xm}m=1, {rn}n=0 X r(tn, tn+1) τ(T ) = 2 + 2 , (2) N 1 − r(t , t ) n=0 n n+1 where r(t, t0) is the expected probability of rejection be- tween chains t, t0 ∈ [0, 1],

 0  0 πt(X )πt0 (X) r(t, t ) = E 1 ∧ 0 , πt(X)πt0 (X )

0 0 and X,X have distributions X ∼ πt, X ∼ πt0 . Further, if TN is refined so that kTN k → 0 as N → ∞, we find that the asymptotic (in N) round trip rate is

−1 τ∞ = lim τ(TN ) = (2 + 2Λ) , (3) N→∞ where Λ ≥ 0 is a constant associated with the pair π0, π1 called the global communication barrier (Syed et al., 2019). 2 Figure 1. Two annealing paths between a π0 = N(−2, 0.2 ) (light 2 Note that Λ does not depend on the number of chains N or blue) and π1 = N(2, 0.2 ) (dark blue) : the traditional linear discretization schedule TN . path (top) and an optimized nonlinear path (bottom). While the distributions in the linear path are nearly mutually singular, those 3. General annealing paths in the optimized path overlap substantially, leading to faster round trips. In the following, we use the terminology annealing path to describe a continuum of distributions interpolating between Proposition 1. Suppose the reference and target distribu- π and π ; this definition will be formalized shortly. The 0 1 tions are π = N (µ , σ2) and π = N (µ , σ2), and define previous work reviewed in the last section assumes that 0 0 1 1 1−t t z = |µ1 − µ0|/σ. Then as z → ∞, the annealing path has the form πt ∝ π0 π1, i.e., that the annealing path linearly interpolates between the log 1−t t densities. A natural question is whether using other paths 1. the path πt ∝ π0 π1 has τ∞ = Θ(1/z), and could lead to an improved round trip rate. 2. there exists a path of Gaussians distributions with In this work we show that the answer to this question is τ∞ = Ω(1/ log z). positive. The following proposition demonstrates that the 1−t t traditional path πt ∝ π0 π1 suffers from an arbitrarily Therefore, upon decreasing the variance of the reference and suboptimal global communication barrier even in simple target while holding their means fixed, the traditional linear examples with Gaussian reference and target distributions. annealing path obtains an exponentially smaller asymptotic Parallel tempering on optimized paths round trip rate than the optimal path of Gaussian distri- We start with some notation for the rejection rates involved butions. Figure1 provides an intuitive explanation. The when Algorithm 1 is used with nonlinear paths (Equation4). standard path (top) corresponds to a set of Gaussian distri- For t, t0 ∈ [0, 1], x ∈ X , define the rejection rate function butions with mean interpolated between the reference and r : [0, 1]2 → [0, 1] to be target. If one reduces the variance of the reference and 0 0 target, so does the variance of the distributions along the r(t, t ) = 1 − E [exp (min{0,At,t0 (X,X )})] 0 0 0 path. For any fixed N, these distributions become nearly At,t0 (x, x ) = (Wt0 (x) − Wt(x)) − (Wt0 (x ) − Wt(x )), mutually singular, leading to arbitrarily low round trip rates. 0 The solution to this issue (bottom) is to allow the distribu- where X ∼ πt and X ∼ πt0 are independent. Assuming tions along the path to have increased variances, thereby that all chains have reached stationarity, and all chains un- maintaining mutual overlap and the ability to swap compo- dergo efficient local exploration—i.e., Wt(X),Wt(X˜) are nents with a reasonable probability. This motivates the need independent when X ∼ πt and X˜ is generated by local to design more general annealing paths. In the following, exploration from X—then the round trip rate for a partic- we introduce the precise general definition of an annealing ular schedule TN has the form given earlier in Equation path, an analysis of path communication efficiency in paral- (2). This statement follows from Syed et al.(2019, Cor. 1) lel tempering, and a rigorous formulation of—and solution without modification, because the proof of that result does to—the problem of tuning path parameters to maximize not depend on the form of the acceptance ratio. communication efficiency. Our next objective is to characterize the asymptotic commu- nication efficiency of a nonlinear path in the regime where 3.1. Assumptions N → ∞ and kTN k → 0—which establishes its fundamen- Let P(X ) be the set of probability densities with full support tal ability to take advantage of parallel computation. In other on a state space X . For any collection of densities πt ∈ words, we require a generalization of the asymptotic result P(X ) with index t ∈ [0, 1], associate to each a log-density in Equation (3). In this case, previous theory relies on the function Wt such that particular form of the acceptance ratio for linear paths (Syed et al., 2019); in the remainder of this section, we provide a 1 πt(x) = exp (Wt(x)) , x ∈ X , (4) generalization of the asymptotic result for nonlinear paths Zt by piecewise approximation by a linear spline. where Z = R exp (W (x)) dx is the normalizing constant. t X t For t, t0 ∈ [0, 1], define Λ(t, t0) to be the global communi- Definition1 provides the conditions necessary to form a 0 cation barrier for the linear secant connecting πt and πt, path of distributions from a reference π0 to a target π1 that are well-behaved for use in parallel tempering. Z 1 0 1 0 Definition 1. An annealing path is a map π : [0, 1] → Λ(t, t ) = E[|At,t0 (Xs,Xs)|]ds, (5) (·) 2 0 P(X ), denoted t 7→ πt, such that for all x ∈ X , πt(x) is 0 i.i.d. 1 0 continuous in t. Xs,Xs ∼ 0 exp ((1 − s)Wt + sWt ) . Zs(t, t ) There are many ways to move beyond the standard lin- Lemma1 shows that the global communication barrier ear path π ∝ π1−tπt . For example, consider a nonlin- t 0 1 Λ(t, t0) along the secant of the path connecting t to t0 is η0(t) η1(t) ear path πt ∝ π0 π1 where ηi : [0, 1] → R are a good approximation of the true rejection rate r(t, t0) with continuous functions such that η0(0) = η1(1) = 1 and O(|t − t0|3) error as t → t0. η0(1) = η1(0) = 0. As long as for all t ∈ [0, 1], πt is a normalizable density this is a valid annealing path between Lemma 1. Suppose that for all x ∈ X , Wt(x) is piecewise continuously differentiable in t, and that there exists V1 : π0 and π1. Further, note that the path parameter does not necessarily have to appear as an exponent: consider for ex- X → [0, ∞) such that ample the mixture path πt ∝ (1 − t)π0 + tπ. Section4 dWt provides a more detailed example based on linear splines. ∀x ∈ X , sup (x) ≤ V1(x), (6) t∈[0,1] dt

3.2. Communication efficiency analysis and 3 Given a particular annealing path satisfying Definition1, we sup Eπt [V1 ] < ∞. (7) require a method to characterize the round trip rate perfor- t∈[0,1] mance of parallel tempering based on that path. The results Then there exists a constant C < ∞ independent of t, t0 presented in this section form the basis of the objective func- such that for all t, t0 ∈ [0, 1], tion used to optimize over paths, as well as the foundation for the proof of Proposition1. |r(t, t0) − Λ(t, t0)| ≤ C|t − t0|3. (8) Parallel tempering on optimized paths

A direct consequence of Lemma1 is that for any fixed 3.3. Annealing path families and optimization schedule T , N It is often the case that there are a set of candidate annealing paths in consideration for a particular target π. For example, N−1 X 2 if a path has tunable parameters φ ∈ Φ that govern its shape, r(tn, tn+1) − Λ(TN ) ≤ CkTN k , (9) n=0 we can generate a collection of annealing paths that all target π by varying the parameter φ. We call such collections an PN−1 where Λ(TN ) = n=0 Λ(tn, tn+1). Intuitively, in the annealing path family. kTN k ≈ 0 regime where rejection rates are low, Definition 3. An annealing path family for target π1 is a φ collection of annealing paths {πt }φ∈Φ such that for all r(tn, tn+1) φ ≈ r(tn, tn+1), parameters φ ∈ Φ, π1 = π1. 1 − r(tn, tn+1) There are many ways to construct useful annealing path fam- −1 and we have that τ(TN ) ≈ (2 + 2Λ(TN )) . Therefore, ilies. For example, if one is provided a parametric family Λ(TN ) characterizes the communication efficiency of the of variational distributions {qφ : φ ∈ Φ} for some param- path in a way that naturally extends the global communica- eter space Φ, one can construct the annealing path family φ 1−t t tion barrier from the linear path case. Theorem2 provides of linear paths πt = qφ π1 from a variational reference the precise statement: the convergence is uniform in TN qφ to the target π1. More generally, given ηi(t) satisfying φ η (t) η (t) and depends only on kTN k, and Λ(TN ) itself converges to a 0 1 the constraints in Section 3.1, πt = qφ π1 defines a constant Λ in the asymptotic regime. We refer to Λ, defined nonlinear annealing path family. Another example of an below in Equation (13) as the global communication barrier annealing path family used in the context of PT are q-paths for the general annealing path. q {πt }q∈[0,1] (Whitfield et al., 2002). Given a fixed reference q Theorem 2. Suppose that for all x ∈ X , Wt(x) is piecewise and target π0, π1, the path πt interpolates between the mix- twice continuously differentiable in t, that there exists V1 : ture path (q = 0) and the linear path (q = 1)(Brekelmans X → [0, ∞) satisfying (6) and (7), and that there exists et al., 2020). In Section4, we provide a new flexible class of V2 : X → [0, ∞),  > 0 satisfying nonlinear paths based on splines that is designed specifically to enhance the performance of parallel tempering. 2 d Wt Since every path in an annealing path family has the desired ∀x ∈ X , sup 2 (x) ≤ V2(x), (10) t∈[0,1] dt target distribution π1, we are free to optimize the path over the tuning parameter space φ ∈ Φ in addition to optimizing 2 and the schedule TN . Motivated by the analysis of Section 3.2,

sup Eπt [(1 + V1) exp(V2)] < ∞. (11) a natural objective function for this optimization to consider t∈[0,1] is the non-asymptotic round trip rate Then, ? ? φ φ , TN = arg max τ (TN ) (14) φ∈Φ,TN −1 lim sup (2 + 2Λ(TN )) − τ(TN ) = 0 (12) N−1 φ δ→0 r (tn, tn+1) TN :kTN k≤δ X = arg min φ , (15) φ∈Φ,TN 1 − r (tn, tn+1) and lim sup |Λ(TN ) − Λ| = 0, (13) n=0 δ→0 T :kT k≤δ N N where now the round trip rate and rejection rates depend both on the schedule and path parameter, denoted by super- where Λ = R 1 λ(t)dt for an instantaneous rejection rate 0 script φ. We solve this optimization using an approximate function λ : [0, 1] → [0, ∞) given by coordinate-descent procedure, iterating between an update of the schedule T for a fixed path parameter φ ∈ Φ, fol- r(t + ∆t, t) N λ(t) = lim lowed by a gradient step in φ based on a surrogate objective ∆t→0 |∆t| function and a fixed schedule. This is summarized in Algo-   1 dWt dWt 0 0 i.i.d. rithm2. We outline the details of schedule and path tuning = E (Xt) − (Xt) ,Xt,Xt ∼ πt. 2 dt dt procedure in the following.

The integrability condition is required to control the tail Tuning the schedule Fix the value of φ, which fixes the behaviour of distributions formed by linearized approxima- path. We adapt a schedule tuning algorithm from past work tions to the path πt. This condition is satisfied by a wide 2We assume that the optimization over φ ends after a finite range of annealing paths, e.g., the linear spline paths pro- number of iterations to sidestep the potential pitfalls of adaptive posed in this work—since in that case V2 = 0. MCMC methods (Andrieu & Moulines, 2006). Parallel tempering on optimized paths

Algorithm 2 PathOptNRPT where Xs are defined in (5) are drawn from the linear path φ φ φ between πt and πt . Finally, we note that the inner Require: state x, path family πt , parameter φ, # chains N, n n+1 # PT iterations M, # tuning steps S, learning rate γ expectation is the path integral of the Fisher information metric along the linear path and evaluates to the symmetric TN ← (0, 1/N, 2/N, . . . , 1) for s = 1 to S do KL divergence (Dabak & Johnson, 2002, Result 4), M N φ {xm}m=1, (rn)n=0 ← NRPT(x, πt , TN ,M) Z 1 φ (A (X ,X0 ))2 ds = 2SKL(πφ , πφ ). λ ← CommunicationBarrier(TN , {rn}) E tn,tn+1 s s tn tn+1 φ 0 TN ← UpdateSchedule(λ ,N) PN−1 φ φ Therefore we have φ ← φ − γ∇φ n=0 SKL(πtn , πtn+1 ) x ← xM φ 2 N−1 2Λ (TN ) X end for ≤ SKL(πφ , πφ ). (17) N tn tn+1 Return: φ, TN n=0

The slack in the inequality in Equation (17) could poten- N tially depend on φ even in the large N regime. Therefore, to update the schedule TN = (tn)n=0 (Syed et al., 2019, Section 5.1). Based on the same argument as this previous during optimization, we recommend monitoring the value work, we obtain that when kTN k ≈ 0, the non-asymptotic of the original objective function (Equation (14)) to ensure round trip rate is maximized when the rejection rates are that the optimization of the surrogate SKL objective indeed all equal. The schedule that approximately achieves this improves it, and hence the round trip rate performance of PT satisfies via Equation (2). In the experiments we display the values of both objective functions. Z tn 1 φ n ∀n ∈ {1,...,N − 1}, φ λ (s)ds = . (16) Λ 0 N 4. Spline annealing path family Following Syed et al.(2019), we use Monte Carlo esti- In this section, we develop a family of annealing paths— mates of the rejection rates rφ(t , t ) to approximate n n+1 the spline annealing path family—that offers a practical t 7→ R t λφ(s) s s ∈ [0, 1] 0 d , via a monotone cubic spline, and flexible improvement upon the traditional linear paths and then use bisection search to solve for each tn according considered in past work. We first define a general family to Equation (16). of annealing paths based on the exponential family, and then provide the specific details of the spline family with a Optimizing the path Fix the schedule TN ; we now want discussion of its properties. Empirical results in Section5 to improve the path itself by modifying φ. However, in demonstrate that the spline annealing path family resolves challenging problems this is not as simple as taking a gra- the problematic Gaussian annealing example in Figure1. dient step for the objective in Equation (14). In particu- lar, in early iterations—when the path is near its oft-poor φ 4.1. Exponential annealing path family initialization—the rejection rates satisfy r (tn, tn+1) ≈ 1. As demonstrated empirically in AppendixF, gradient esti- We begin with the practical desiderata for an annealing path 3 mates in this regime exhibit a low signal-to-noise ratio that family given a fixed reference π0 and target π1 distribution. 1−t t precludes their use for optimization. First, the traditional linear path πt ∝ π0 π1 should be a member of the family, so that one can achieve at least the We propose a surrogate, the symmetric KL divergence, mo- round trip rate provided by that path. Second, the family tivated as follows. Consider first the global communication should be broadly applicable and not depend on particular barrier Λφ(T ) for the linear spline approximation to the N details of either π or π . Finally, using the Gaussian exam- path; Theorem2 guarantees that as long as kT k is small 0 1 N ple from Figure1 and Proposition1 as insight, the family enough, one can optimize Λφ(T ) in place of the round trip N should enable the path to smoothly vary from π to π while rate τ φ(T ). By Jensen’s inequality, 0 1 N inflating / deflating the variance as necessary. N−1 1 1 X These desiderata motivate the design of the exponential Λφ(T )2 ≤ Λφ(t , t )2. N 2 N N n n+1 annealing path family, in which each annealing path takes n=0 the form

Next, we apply Jensen’s inequality again to the definition of η0(t) η1(t) T φ πt ∝ π0 π1 = exp(η(t) W (x)), Λ (tn, tn+1) from (5), which shows that 3A natural extension of this discussion would include 1 1 Z parametrized variational reference distribution families. For sim- Λφ(t , t ))2 ≤ [A (X ,X0 )2]ds, n n+1 E tn,tn+1 s s plicity we restrict to a fixed reference. 4 0 Parallel tempering on optimized paths for some function η(t) = (η0(t), η1(t)) and reference/target log densities W (x) = (W0(x),W1(x)). Intuitively, η0(t) and η1(t) represent the level of annealing for the reference and target respectively along the path. Proposition2 shows that a broad collection of functions η indeed construct a valid annealing path family including the linear path. Proposition 2. Let Ω ⊆ R2 be the set  Z  2 T Ω= ξ ∈ R : exp(ξ W (x))dx < ∞ .

Suppose (0, 1) ∈ Ω and A is a set of piecewise twice con- tinuously differentiable functions η : [0, 1] → Ω such that η(1) = (0, 1). Then Figure 2. The spline path for K = 1, 2, 3, 4, 5, 10 knots for the  T πt(x) ∝ exp(η(t) W (x)) : η ∈ A family generated by π0 = N(−1, 0.5) and π1 = N(1, 0.5). is an annealing path family for target distribution π1. If additionally (1, 0) ∈ Ω, then the linear path η(t) = (1−t, t) Flexibility: Assuming the family is nonempty, it trivially may be included in A. Finally, if for every η ∈ A there exists contains the linear path. Further, given a large enough 0 00 M > 0 such that supt max{kη (t)k2, kη (t)k2} ≤ M and number of knots K, the spline annealing family well- Equation (11) holds with V1 = V2 = MkW k2, then each approximates subsets of the exponential annealing family path in the family satisfies the conditions of Theorem2. for fixed M > 0. In particular,

4.2. Spline annealing path family φ M sup inf kη − ηk∞ ≤ 2 . (18) η∈A φ∈ΩK+1 4K Proposition2 reduces the problem of designing a general M family of paths of probability distributions to the much Figure2 provides an illustration of the behaviour of opti- simpler task of designing paths in R2. We argue that a mized spline paths for a Gaussian reference and target. The good choice can be constructed using linear spline paths path takes a convex curved shape; starting at the bottom 2 K+1 connecting K knots φ = (φ0, . . . , φK ) ∈ (R ) , i.e., right point of the figure (reference), this path corresponds to k−1 k for all k ∈ {1,...,K} and t ∈ [ K , K ], increasing the variance of the reference, shifting the mean from reference to target, and finally decreasing the variance ηφ(t) 7→ (k − Kt)φ + (Kt − k + 1)φ . k−1 k to match the target. With more knots, this process happens Let Ω be defined as in Proposition2. The K-knot spline more smoothly. annealing path family is defined as the set of K-knot linear AppendixD provides an explicit derivation of the stochastic spline paths such that gradient estimates we use to optimize the knots of the spline φ0 = (1, 0), φK = (0, 1), and ∀k, φk ∈ Ω. annealing path family. It is also worth making two practi- cal notes. First, to enforce the monotonicity constraint, we Validity: Since Ω is a convex set per the proof of Propo- developed an alternative to the usual projection approach, sition2, we are guaranteed that ηφ([0, 1]) ⊆ Ω, and so this as projecting into the set of monotone splines can cause collection of annealing paths is a subset of the exponential several knots to become superposed. Instead, we maintain annealing family and hence forms a valid annealing path monotonicity as follows: after each stochastic gradient step, family targeting π1. we identify a monotone subsequence of knots containing the endpoints, remove the nonmonotone jumps, and linearly Convexity: Furthermore, convexity of Ω implies that tun- interpolate between the knot subsequence with an even spac- ing the knots φ ∈ ΩK+1 involves optimization within a ing. Second, we take gradient steps in a log-transformed convex constraint set. In practice, we enforce also that space so that knot components are always strictly positive. the knots are monotone in each component—i.e., the first component monotonically decreases, 1 = φ0,0 ≥ φ1,0 ≥ 5. Experiments · · · ≥ φK,0 = 0 and the second increases, 0 = φ0,1 ≤ φ1,1 ≤ · · · ≤ φK,1 = 1 —such that the path of distributions In this section, we study the empirical performance of always moves from the reference to the target. Because non-reversible PT based on the spline annealing path fam- monotonicity constraint sets are linear and hence convex, ily (K ∈ {2, 3, 4, 5, 10}) from Section4, with knots and the overall monotonicity-constrained optimization problem schedule optimized using the tuning method from Sec- has a convex domain. tion3. We compare this method to two PT methods Parallel tempering on optimized paths

Figure 3. Top: Cumulative round trips averaged over 10 runs for the spline path with K = 2, 3, 4, 5, 10 (solid blue), NRPT using a linear path (dashed green), and reversible PT with linear path (dash/dot red). The slope of the lines represent the round trip rate. We observe large gains going from linear to non-linear paths (K > 1). For all values of K > 1, the optimized spline path substantially improves on the theoretical upper bound on round trip rate possible using linear path (dotted black). Bottom: Non-asymptotic communication barrier from Equation 15 (solid blue) and Symmetric KL (dash orange) as a function of iteration for one run of PathOptNRPT + Spline (K = 4 knots). based on standard linear paths: non-reversible PT with ing rate equal to 0.2. Beta-binomial model: a conjugate adaptive schedule (“NRPT+Linear”) (Syed et al., 2019), Bayesian model with prior π0(p) = Beta(180, 840) and and reversible PT (“Reversible+Linear”) (Atchade´ et al., likelihood L(x|p) = Binomial(x|n, p). We simulated data 2011). Code for the experiments is available at https: x1, . . . , x2000 ∼ Binomial(100, 0.7) resulting in a poste- //github.com/vittrom/PT-pathoptim. rior π1(p) = Beta(140180, 60840). The prior and poste- rior are heavily concentrated at 0.2 and 0.7 respectively. We use the terminology “scan” to denote one iteration of the We used the same settings as for the Gaussian example. for loop in Algorithm 2. The computational cost of a scan Galaxy data: A Bayesian Gaussian mixture model applied is comparable for all the methods, since the bottleneck is to the galaxy dataset of (Roeder, 1990). We used six mix- the local exploration step shared by all methods. These ex- ture components with mixture proportions w , . . . , w , mix- periments demonstrate two primary conclusions: (1) tuned 0 5 ture component densities N(µ , 1) for mean parameters nonlinear paths provide a substantially higher round trip rate i µ , . . . , µ , and a cluster label categorical variable for each compared to linear paths for the examples considered; and 0 5 data point. We placed a Dir(1 ) prior on the proportions, (2) the symmetric KL sum objective (Equation 17) is a good 6 where 1 = (1, 1, 1, 1, 1, 1) and a N(150, 1) prior on each proxy for the round trip rate as a tuning objective. 6 of the mean parameters. We did not marginalize the clus- We run the following benchmark problems; see the sup- ter indicators, creating a multi-modal posterior inference plement for details. Gaussian: a synthetic setup in which problem over 94 latent variables. In this experiment we 2 the reference distribution is π0 = N(−1, 0.01 ) and the used N = 35 chains and fixed the computational budget to 2 target is π1 = N(1, 0.01 ). For this example we used 50000 samples, divided into 500 scans using 100 samples N = 50 parallel chains and fixed the computational bud- each. We optimized the path using Adagrad with a learn- get to 45000 samples. For Algorithm2, the computa- ing rate of 0.3. Exploring the full posterior distribution is tional budget was divided equally over 150 scans, meaning challenging in this context due to a combination of misspec- 300 samples were used for every gradient update. “Re- ification of the prior and label switching. Label switching versible+Linear” performed 45000 local exploration steps refers to the invariance of the likelihood under relabelling with a communication step after every iteration while of the cluster labels. In Bayesian problems label switching for “NRPT+Linear” the computational budget was used can lead to increased difficulty of the sampling problem to adapt the schedule. The gradient updates were per- as it generates symmetric multi-modal posterior distribu- formed using Adagrad (Duchi et al., 2011) with learn- tions. High dimensional Gaussian: a similar setup to the Parallel tempering on optimized paths

able proxy for the non-asymptotic communication barrier, but does not exhibit as large estimation variance in early iterations when there are pairs of chains with rejection rates close to one. Figure4 shows the round trip rate as a function of the dimen- sionality of the problem for Gaussian target distributions. As the dimensionality increases, the sampling problem be- comes fundamentally more difficult, explaining the decay in performance of both NRPT with a linear path and an optimized path. Both sampling methods are provided with a fixed computational budget in all runs. Due to this fixed budget, NRPT is unable for d ≥ 64 to approach the optimal schedule in the alloted time. This leads to an increasing gap Figure 4. Round trips rate averaged over 5 runs for the spline path between the round trip rates of NRPT with linear path and with K = 4 (solid blue), NRPT using a linear path (dashed green) the spline path for d ≥ 64. and theoretical upper bound on round trip rate possible using linear path (dotted black) as a function of the dimensionality of the target 6. Discussion distribution. In this work, we identified the use of linear paths of distri- butions as a major bottleneck in the performance of parallel one-dimensional Gaussian experiment where the number tempering algorithms. To address this limitation, we have of dimensions ranges from d = 1 to d = 256. The refer- provided a theory of parallel tempering on nonlinear paths, 2 ence distribution is π0 = N(−1d, (0.1 )Id) and the target a methodology to tune parametrized paths, and finally a 2 is π1 = N(1d, (0.1 )Id) where the subscript d indicates the practical, flexible family of paths based on linear splines. dimensionality of the problem. The number of chains√ N Future work in this line of research includes extensions to es- is set to increase with dimension at the rate N = d15 de. timate normalization constants, as well as the development We fixed the number of spline knots K to 4 and set the of techniques and theory surrounding the use of variational computational budget to 50000 samples divided into 500 reference distributions. scans with 100 samples per gradient update. The gradient updates were performed using Adagrad with learning rate References equal to 0.2. For all the experiments we performed one local exploration step before each communication step. Andrieu, C. and Moulines, E. On the ergodicity properties of some adaptive MCMC algorithms. Annals of Applied The results of these experiments are shown in Figures3 Probability, 16(3):1462–1505, 2006. and4. Examining the top row of Figure3—which shows the number of round trips as a function of the number of Atchade,´ Y. F., Roberts, G. O., and Rosenthal, J. S. To- scans—one can see that PT using the spline annealing fam- wards optimal scaling of Metropolis-coupled Markov ily outperforms PT using the linear annealing path across chain Monte Carlo. Statistics and Computing, 21(4): all numbers of knots tested. Moreover, the slope of these 555–568, 2011. curves demonstrates that PT with the spline annealing fam- ily exceeds the theoretical upper bound of round trip rate for Ballnus, B., Hug, S., Hatz, K., Gorlitz,¨ L., Hasenauer, J., the linear annealing path (from Equation (12)). The largest and Theis, F. J. Comprehensive benchmarking of Markov gain is obtained from going from K = 1 (linear) to K = 2. chain Monte Carlo methods for dynamical systems. BMC For all the examples, increasing the number of knots to Systems Biology, 11(1):63, 2017. more than K > 2 leads to marginal improvements. In the Brekelmans, R., Masrani, V., Bui, T., Wood, F., Galstyan, case of the Gaussian example, note that since the global A., Steeg, G. V., and Nielsen, F. Annealed importance communication barrier Λ for the linear path is much larger sampling with q-paths. arXiv:2012.07823, 2020. than N, algorithms based on linear paths incurred rejection rates of nearly one for most chains, resulting in no round Costa, S. I., Santos, S. A., and Strapasson, J. E. Fisher trips. information distance: A geometrical reading. Discrete Applied Mathematics, 197:59–69, 2015. The bottom row of Figure3 shows the value of the surrogate SKL objective and non-asymptotic communication barrier Dabak, A. G. and Johnson, D. H. Relations between from Equation (15). In particular, these figures demonstrate Kullback-Leibler distance and Fisher information. Jour- that the SKL provides a surrogate objective that is a reason- nal of The Iranian Statistical Society, 5:25–37, 2002. Parallel tempering on optimized paths

Desjardins, G., Luo, H., Courville, A., and Bengio, Y. Deep Sakai, Y. and Hukushima, K. Irreversible simulated tem- tempering. arXiv:1410.0123, 2014. pering. Journal of the Physical Society of Japan, 85(10): 104002, 2016. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Syed, S., Bouchard-Cotˆ e,´ A., Deligiannidis, G., and Doucet, Journal of machine learning research, 12(7), 2011. A. Non-reversible parallel tempering: an embarassingly parallel MCMC scheme. arXiv:1905.02939, 2019. Gelman, A. and Meng, X.-L. Simulating normalizing con- stants: From importance sampling to bridge sampling to Tawn, N. G., Roberts, G. O., and Rosenthal, J. S. Weight- path sampling. Statistical science, pp. 163–185, 1998. preserving simulated tempering. Statistics and Comput- ing, 30(1):27–41, 2020. Geyer, C. J. Markov chain Monte Carlo maximum likeli- hood. Interface Proceedings, 1991. Whitfield, T., Bu, L., and Straub, J. Generalized paral- lel sampling. Physica A: Statistical Mechanics and its Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. R. Applications, 305(1-2):157–171, 2002. Annealing between distributions by averaging moments. Woodard, D. B., Schmidler, S. C., Huber, M., et al. Condi- In Advances in Neural Information Processing Systems, tions for rapid mixing of parallel and simulated tempering 2013. on multimodal distributions. The Annals of Applied Prob- Kamberaj, H. Simulations in Statisti- ability, 19(2):617–640, 2009. cal Physics: Theory and Applications. Springer, 2020. Zhou, Y., Johansen, A. M., and Aston, J. A. Toward auto- matic model comparison: An adaptive sequential Monte Katzgraber, H. G., Trebst, S., Huse, D. A., and Troyer, Carlo approach. Journal of Computational and Graphical M. Feedback-optimized parallel tempering Monte Carlo. Statistics, 25(3):701–726, 2016. Journal of Statistical Mechanics: Theory and Experiment, 2006(03):P03018, 2006.

Lingenheil, M., Denschlag, R., Mathias, G., and Tavan, P. Efficiency of exchange schemes in replica exchange. Chemical Physics Letters, 478(1-3):80–84, 2009.

Muller,¨ N. F. and Bouckaert, R. R. Adaptive parallel temper- ing for BEAST 2. bioRxiv, 2020. doi: 10.1101/603514.

Okabe, T., Kawata, M., Okamoto, Y., and Mikami, M. Replica-exchange for the isobaric– isothermal ensemble. Chemical Physics Letters, 335(5-6): 435–439, 2001.

Predescu, C., Predescu, M., and Ciobanu, C. V. The incom- plete beta function law for parallel tempering sampling of classical canonical systems. The Journal of Chemical Physics, 120(9):4119–4128, 2004.

Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter variational bounds are not necessarily better. In International Con- ference on Machine Learning, 2018.

Rischard, M., Jacob, P. E., and Pillai, N. Unbiased esti- mation of log normalizing constants with applications to Bayesian cross-validation. arXiv:1810.01382, 2018.

Roeder, K. Density estimation with confidence sets exempli- fied by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85(411):617–624, 1990.