Parallel tempering on optimized paths

Saifuddin Syed * 1 Vittorio Romaniello * 1 Trevor Campbell 1 Alexandre Bouchard-Cotˆ e´ 1

Abstract Suppose we seek to approximate an expectation with respect to an intractable target density π1. Denote by π0 a reference Parallel tempering (PT) is a class of density defined on the same space, which is assumed to Monte Carlo algorithms that constructs a path of be tractable in the sense of the availability of an efficient distributions annealing between a tractable ref- sampler. This work is motivated by the case where π0 and erence and an intractable target, and then inter- π1 are nearly mutually singular. A typical case is where the changes states along the path to improve mixing target is a Bayesian posterior distribution, the reference is in the target. The performance of PT depends on the prior—for which i.i.d. sampling is typically possible— how quickly a sample from the reference distri- and the prior is misspecified. bution makes its way to the target, which in turn depends on the particular path of annealing dis- PT methods are based on a specific continuum of densities 1−t t tributions. However, past work on PT has used πt ∝ π0 π1, t ∈ [0, 1], bridging π0 and π1. This path of only simple paths constructed from convex com- intermediate distributions is known as the power posterior binations of the reference and target log-densities. path in the literature, but in our framework it will be more This paper begins by demonstrating that this path natural to think of these continua as a linear paths, as they performs poorly in the setting where the reference linearly interpolate between log-densities. PT algorithms and target are nearly mutually singular. To ad- discretize the path at some 0 = t0 < ··· < tN = 1 to obtain dress this issue, we expand the framework of PT a sequence of densities πt0 , πt1 , . . . , πtN . See Figure1 (top) to general families of paths, formulate the choice for an example of a linear path for two nearly mutually of path as an optimization problem that admits singular Gaussian distributions. tractable gradient estimates, and propose a flex- Given the path discretization, PT involves running N + 1 ible new family of spline interpolation paths for MCMC chains that together target the product distribution use in practice. Theoretical and empirical results πt0 πt1 ··· πtN . Based on the assumption that the chain π0 both demonstrate that our proposed methodology can be sampled efficiently, PT uses swap-based interactions breaks previously-established upper performance between neighbouring chains to propagate the exploration limits for traditional paths. done in π0 into improved exploration in the chain of inter- est π1. By designing these swaps as Metropolis–Hastings moves, PT guarantees that the marginal distribution of the th 1. Introduction N chain converges to π1; and in practice, the rate of con- vergence is often much faster compared to running a single (MCMC) methods are widely chain (Woodard et al., 2009). PT algorithms are extensively used to approximate intractable expectations with respect to used in hard sampling problems arising in statistics, physics, un-normalized probability distributions over general state computational chemistry, phylogenetics, and machine learn- arXiv:2102.07720v2 [stat.CO] 25 Jun 2021 spaces. For hard problems, MCMC can suffer from poor ing (Desjardins et al., 2014; Ballnus et al., 2017; Kamberaj, mixing. For example, faced with well-separated modes, 2020;M uller¨ & Bouckaert, 2020). MCMC methods often get trapped exploring local regions of high probability. Parallel tempering (PT) is a widely Notwithstanding empirical and theoretical successes, ex- applicable methodology (Geyer, 1991) to tackle poor mixing isting PT algorithms also have well-understood theoretical of MCMC algorithms. limitations. Earlier work focusing on the theoretical analysis of reversible variants of PT has shown that adding too many *Equal contribution 1Department of Statistics, University of intermediate chains can actually deteriorate performance British Columbia, Vancouver, Canada. Correspondence to: Saifud- (Lingenheil et al., 2009; Atchade´ et al., 2011). Recent work din Syed , Vittorio Romaniello . (Syed et al., 2019) has shown that a nonreversible variant of PT (Okabe et al., 2001) is guaranteed to dominate its Proceedings of the 38 th International Conference on Machine classical reversible counterpart, and moreover that in the Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Parallel tempering on optimized paths nonreversible regime adding more chains does not lead to & Hukushima, 2016; Syed et al., 2019). Define a reference performance collapse. However, even with these more ef- unnormalized density function π0 for which sampling is ficient non reversible PT algorithms, Syed et al.(2019) tractable, and an unnormalized target density function π1 established that the improvement brought by higher paral- for which sampling is intractable; the goal is to obtain sam- 1 1−t t lelism will asymptote to a fundamental limit known as the ples from π1. Define a path of distributions πt ∝ π0 π global communication barrier. for t ∈ [0, 1] from the reference to the target. Finally, define the annealing schedule T to be a monotone sequence in In this work, we show that by generalizing the class of paths N [0, 1], satisfying interpolating between π0 and π1 from linear to nonlinear, N the global communication barrier can be broken, leading TN = (tn)n=0, 0 = t0 ≤ t1 ≤ · · · ≤ tN = 1 to substantial performance improvements. Importantly, the kTN k = max tn+1 − tn. nonlinear path used to demonstrate this breakage is com- n∈{0,...,N−1} puted using a practical algorithm that can be used in any The core idea of parallel tempering is to construct a Markov 0 N situation where PT is applicable. An example of a path opti- chain (Xm,...,Xm ), m = 1, 2,... that (1) has invariant mized using our algorithm is shown in Figure1 (bottom). distribution πt0 · πt1 ··· πtN —such that we can treat the N We also present a detailed theoretical analysis of parallel marginal chain Xn as samples from the target π1—and (2) tempering algorithms based on nonlinear paths. Using this swaps components of the state vector such that independent analysis we prove that the performance gains obtained by samples from component 0 (i.e., the reference π0) traverse going from linear to nonlinear path PT algorithms can be along the annealing path and aid mixing in component N arbitrarily large. Our theoretical analysis also motivates a (i.e., the target π1). This is possible to achieve by itera- principled objective function used to optimize over a para- tively performing a local exploration move followed by a metric family of paths. communication move as shown in Algorithm1.

0 N Literature review Beyond parallel tempering, several Local Exploration Given (Xm−1,...,Xm−1), we ob- methods to approximate intractable integrals rely on a path ˜ 0 ˜ N tain an intermediate state (Xm,..., Xm ) by updating the of distributions from a reference to a target distribution, and th n component using any MCMC move targeting πtn , for there is a rich literature on the construction and optimiza- n = 0,...,N. This move can be performed in parallel tion of nonlinear paths for annealed importance sampling across components since each is updated independently. type algorithms (Gelman & Meng, 1998; Rischard et al., 2018; Grosse et al., 2013; Brekelmans et al., 2020). These Communication Given the intermediate state algorithms are highly parallel; however, for challenging ˜ 0 ˜ N (Xm,..., Xm ), we apply pairwise swaps of compo- problems, even when combined with adaptive step size pro- nents n and n + 1, n ∈ Sm ⊂ {0,...,N − 1} for cedures (Zhou et al., 2016) they typically suffer from par- swapped index set Sm. Formally, a swap is a move from ticle degeneracy (Syed et al., 2019, Sec. 7.4). Moreover, (x0, . . . , xN ) to (x0, . . . , xn+1, xn, . . . , xN ), which is these methods use different path optimization criteria which accepted with probability are not well motivated in the context of parallel tempering. n+1 n πtn (x )πtn+1 (x ) Some special cases of non-linear paths have been used αn = 1 ∧ n n+1 . (1) πt (x )πt (x ) in the PT literature (Whitfield et al., 2002; Tawn et al., n n+1 2020). Whitfield et al.(2002) construct a non-linear path Since each swap only depends on components n, n + 1, one inspired by the concept of Tsallis entropy, a generalization can perform all of the swaps in Sm in parallel, as long as of Boltzmann-Gibbs entropy, but do not provide algorithms n ∈ Sm implies (n + 1) ∈/ Sm. The largest collection of to optimize over this path family. The work of Tawn et al. such non-interfering swaps is Sm ∈ {Seven,Sodd}, where (2020), also considers a specific example of a nonlinear Seven,Sodd are the even and odd subsets of {0,...,N − 1} path distinct from the ones explored in this paper. However, respectively. In non-reversible PT (NRPT) (Okabe et al., the construction of the nonlinear path in Tawn et al.(2020) 2001), the swap set Sm at each step m is set to  requires knowledge of the location of the modes of π1 and Seven if m is even Sm = hence makes their algorithm less broadly applicable than Sodd if m is odd. standard PT. Round trips The performance of PT is sensitive to both 2. Background the local exploration and communication moves. The quan- tity commonly used to evaluate the performance of MCMC In this section, we provide a brief overview of parallel tem- 1 pering (PT) (Geyer, 1991), as well as recent results on We assume all distributions share a common state space X throughout, and will often suppress the arguments of (log-)density nonreversible communication (Okabe et al., 2001; Sakai functions—i.e., π1 instead of π1(x)—for notational brevity. Parallel tempering on optimized paths algorithms is the effective sample size (ESS); however, ESS Algorithm 1 NRPT measures the combined performance of local exploration Require: state x0, path πt, schedule TN , # iterations M and communication, and is not able to distinguish between rn ← 0 for all n ∈ {0,...,N − 1} the two. Since the major difference between PT and stan- for m = 1 to M do dard MCMC is the presence of a communication step, we x˜m ← LocalExploration(xm−1) require a way to measure communication performance in Sm ← Seven if m is even, otherwise Sm ← Sodd isolation such that we can compare PT methods without for n = 0 to N − 1 do π (xn+1)π (xn) dependence on the details of the local exploration move. tn tn+1 αn ← 1 ∧ π (xn)π (xn+1) The round trip rate is a performance measure from the PT tn tn+1 r ← r + (1 − α ) literature (Katzgraber et al., 2006; Lingenheil et al., 2009) n n n U ∼ Unif(0, 1) that is designed to assess communication efficiency alone. n if n ∈ Sm and Un ≤ αn then We say a round trip has occurred when a new sample from n n+1 n+1 n (˜xm, x˜m ) ← (˜xm , x˜m) the reference π0 travels to the target π1 and then back to π0; the round trip rate τ(T ) is the frequency at which round end if N ← ˜ trips occur. Based on simplifying assumptions on the lo- xm xm end for cal exploration moves, the round trip rate τ(TN ) may be expressed as (Syed et al., 2019, Section 3.5) end for rn ← rn/M for n ∈ {0,...,N − 1} M N−1 N−1 !−1 Return: {xm}m=1, {rn}n=0 X r(tn, tn+1) τ(T ) = 2 + 2 , (2) N 1 − r(t , t ) n=0 n n+1 where r(t, t0) is the expected probability of rejection be- tween chains t, t0 ∈ [0, 1],

 0  0 πt(X )πt0 (X) r(t, t ) = E 1 ∧ 0 , πt(X)πt0 (X )

0 0 and X,X have distributions X ∼ πt, X ∼ πt0 . Further, if TN is refined so that kTN k → 0 as N → ∞, we find that the asymptotic (in N) round trip rate is

−1 τ∞ = lim τ(TN ) = (2 + 2Λ) , (3) N→∞ where Λ ≥ 0 is a constant associated with the pair π0, π1 called the global communication barrier (Syed et al., 2019). 2 Figure 1. Two annealing paths between a π0 = N(−2, 0.2 ) (light 2 Note that Λ does not depend on the number of chains N or blue) and π1 = N(2, 0.2 ) (dark blue) : the traditional linear discretization schedule TN . path (top) and an optimized nonlinear path (bottom). While the distributions in the linear path are nearly mutually singular, those 3. General annealing paths in the optimized path overlap substantially, leading to faster round trips. In the following, we use the terminology annealing path to describe a continuum of distributions interpolating between Proposition 1. Suppose the reference and target distribu- π and π ; this definition will be formalized shortly. The 0 1 tions are π = N (µ , σ2) and π = N (µ , σ2), and define previous work reviewed in the last section assumes that 0 0 1 1 1−t t z = |µ1 − µ0|/σ. Then as z → ∞, the annealing path has the form πt ∝ π0 π1, i.e., that the annealing path linearly interpolates between the log 1−t t densities. A natural question is whether using other paths 1. the path πt ∝ π0 π1 has τ∞ = Θ(1/z), and could lead to an improved round trip rate. 2. there exists a path of Gaussians distributions with In this work we show that the answer to this question is τ∞ = Ω(1/ log z). positive. The following proposition demonstrates that the 1−t t traditional path πt ∝ π0 π1 suffers from an arbitrarily Therefore, upon decreasing the variance of the reference and suboptimal global communication barrier even in simple target while holding their means fixed, the traditional linear examples with Gaussian reference and target distributions. annealing path obtains an exponentially smaller asymptotic Parallel tempering on optimized paths round trip rate than the optimal path of Gaussian distri- We start with some notation for the rejection rates involved butions. Figure1 provides an intuitive explanation. The when Algorithm 1 is used with nonlinear paths (Equation4). standard path (top) corresponds to a set of Gaussian distri- For t, t0 ∈ [0, 1], x ∈ X , define the rejection rate function butions with mean interpolated between the reference and r : [0, 1]2 → [0, 1] to be target. If one reduces the variance of the reference and 0 0 target, so does the variance of the distributions along the r(t, t ) = 1 − E [exp (min{0,At,t0 (X,X )})] 0 0 0 path. For any fixed N, these distributions become nearly At,t0 (x, x ) = (Wt0 (x) − Wt(x)) − (Wt0 (x ) − Wt(x )), mutually singular, leading to arbitrarily low round trip rates. 0 The solution to this issue (bottom) is to allow the distribu- where X ∼ πt and X ∼ πt0 are independent. Assuming tions along the path to have increased variances, thereby that all chains have reached stationarity, and all chains un- maintaining mutual overlap and the ability to swap compo- dergo efficient local exploration—i.e., Wt(X),Wt(X˜) are nents with a reasonable probability. This motivates the need independent when X ∼ πt and X˜ is generated by local to design more general annealing paths. In the following, exploration from X—then the round trip rate for a partic- we introduce the precise general definition of an annealing ular schedule TN has the form given earlier in Equation path, an analysis of path communication efficiency in paral- (2). This statement follows from Syed et al.(2019, Cor. 1) lel tempering, and a rigorous formulation of—and solution without modification, because the proof of that result does to—the problem of tuning path parameters to maximize not depend on the form of the acceptance ratio. communication efficiency. Our next objective is to characterize the asymptotic commu- nication efficiency of a nonlinear path in the regime where 3.1. Assumptions N → ∞ and kTN k → 0—which establishes its fundamen- Let P(X ) be the set of probability densities with full support tal ability to take advantage of parallel computation. In other on a state space X . For any collection of densities πt ∈ words, we require a generalization of the asymptotic result P(X ) with index t ∈ [0, 1], associate to each a log-density in Equation (3). In this case, previous theory relies on the function Wt such that particular form of the acceptance ratio for linear paths (Syed et al., 2019); in the remainder of this section, we provide a 1 πt(x) = exp (Wt(x)) , x ∈ X , (4) generalization of the asymptotic result for nonlinear paths Zt by piecewise approximation by a linear spline. where Z = R exp (W (x)) dx is the normalizing constant. t X t For t, t0 ∈ [0, 1], define Λ(t, t0) to be the global communi- Definition1 provides the conditions necessary to form a 0 cation barrier for the linear secant connecting πt and πt, path of distributions from a reference π0 to a target π1 that are well-behaved for use in parallel tempering. Z 1 0 1 0 Definition 1. An annealing path is a map π : [0, 1] → Λ(t, t ) = E[|At,t0 (Xs,Xs)|]ds, (5) (·) 2 0 P(X ), denoted t 7→ πt, such that for all x ∈ X , πt(x) is 0 i.i.d. 1 0 continuous in t. Xs,Xs ∼ 0 exp ((1 − s)Wt + sWt ) . Zs(t, t ) There are many ways to move beyond the standard lin- Lemma1 shows that the global communication barrier ear path π ∝ π1−tπt . For example, consider a nonlin- t 0 1 Λ(t, t0) along the secant of the path connecting t to t0 is η0(t) η1(t) ear path πt ∝ π0 π1 where ηi : [0, 1] → R are a good approximation of the true rejection rate r(t, t0) with continuous functions such that η0(0) = η1(1) = 1 and O(|t − t0|3) error as t → t0. η0(1) = η1(0) = 0. As long as for all t ∈ [0, 1], πt is a normalizable density this is a valid annealing path between Lemma 1. Suppose that for all x ∈ X , Wt(x) is piecewise continuously differentiable in t, and that there exists V1 : π0 and π1. Further, note that the path parameter does not necessarily have to appear as an exponent: consider for ex- X → [0, ∞) such that ample the mixture path πt ∝ (1 − t)π0 + tπ. Section4 dWt provides a more detailed example based on linear splines. ∀x ∈ X , sup (x) ≤ V1(x), (6) t∈[0,1] dt

3.2. Communication efficiency analysis and 3 Given a particular annealing path satisfying Definition1, we sup Eπt [V1 ] < ∞. (7) require a method to characterize the round trip rate perfor- t∈[0,1] mance of parallel tempering based on that path. The results Then there exists a constant C < ∞ independent of t, t0 presented in this section form the basis of the objective func- such that for all t, t0 ∈ [0, 1], tion used to optimize over paths, as well as the foundation for the proof of Proposition1. |r(t, t0) − Λ(t, t0)| ≤ C|t − t0|3. (8) Parallel tempering on optimized paths

A direct consequence of Lemma1 is that for any fixed 3.3. Annealing path families and optimization schedule T , N It is often the case that there are a set of candidate annealing paths in consideration for a particular target π. For example, N−1 X 2 if a path has tunable parameters φ ∈ Φ that govern its shape, r(tn, tn+1) − Λ(TN ) ≤ CkTN k , (9) n=0 we can generate a collection of annealing paths that all target π by varying the parameter φ. We call such collections an PN−1 where Λ(TN ) = n=0 Λ(tn, tn+1). Intuitively, in the annealing path family. kTN k ≈ 0 regime where rejection rates are low, Definition 3. An annealing path family for target π1 is a φ collection of annealing paths {πt }φ∈Φ such that for all r(tn, tn+1) φ ≈ r(tn, tn+1), parameters φ ∈ Φ, π1 = π1. 1 − r(tn, tn+1) There are many ways to construct useful annealing path fam- −1 and we have that τ(TN ) ≈ (2 + 2Λ(TN )) . Therefore, ilies. For example, if one is provided a parametric family Λ(TN ) characterizes the communication efficiency of the of variational distributions {qφ : φ ∈ Φ} for some param- path in a way that naturally extends the global communica- eter space Φ, one can construct the annealing path family φ 1−t t tion barrier from the linear path case. Theorem2 provides of linear paths πt = qφ π1 from a variational reference the precise statement: the convergence is uniform in TN qφ to the target π1. More generally, given ηi(t) satisfying φ η (t) η (t) and depends only on kTN k, and Λ(TN ) itself converges to a 0 1 the constraints in Section 3.1, πt = qφ π1 defines a constant Λ in the asymptotic regime. We refer to Λ, defined nonlinear annealing path family. Another example of an below in Equation (13) as the global communication barrier annealing path family used in the context of PT are q-paths for the general annealing path. q {πt }q∈[0,1] (Whitfield et al., 2002). Given a fixed reference q Theorem 2. Suppose that for all x ∈ X , Wt(x) is piecewise and target π0, π1, the path πt interpolates between the mix- twice continuously differentiable in t, that there exists V1 : ture path (q = 0) and the linear path (q = 1)(Brekelmans X → [0, ∞) satisfying (6) and (7), and that there exists et al., 2020). In Section4, we provide a new flexible class of V2 : X → [0, ∞),  > 0 satisfying nonlinear paths based on splines that is designed specifically to enhance the performance of parallel tempering. 2 d Wt Since every path in an annealing path family has the desired ∀x ∈ X , sup 2 (x) ≤ V2(x), (10) t∈[0,1] dt target distribution π1, we are free to optimize the path over the tuning parameter space φ ∈ Φ in addition to optimizing 2 and the schedule TN . Motivated by the analysis of Section 3.2,

sup Eπt [(1 + V1) exp(V2)] < ∞. (11) a natural objective function for this optimization to consider t∈[0,1] is the non-asymptotic round trip rate Then, ? ? φ φ , TN = arg max τ (TN ) (14) φ∈Φ,TN −1 lim sup (2 + 2Λ(TN )) − τ(TN ) = 0 (12) N−1 φ δ→0 r (tn, tn+1) TN :kTN k≤δ X = arg min φ , (15) φ∈Φ,TN 1 − r (tn, tn+1) and lim sup |Λ(TN ) − Λ| = 0, (13) n=0 δ→0 T :kT k≤δ N N where now the round trip rate and rejection rates depend both on the schedule and path parameter, denoted by super- where Λ = R 1 λ(t)dt for an instantaneous rejection rate 0 script φ. We solve this optimization using an approximate function λ : [0, 1] → [0, ∞) given by coordinate-descent procedure, iterating between an update of the schedule T for a fixed path parameter φ ∈ Φ, fol- r(t + ∆t, t) N λ(t) = lim lowed by a gradient step in φ based on a surrogate objective ∆t→0 |∆t| function and a fixed schedule. This is summarized in Algo-   1 dWt dWt 0 0 i.i.d. rithm2. We outline the details of schedule and path tuning = E (Xt) − (Xt) ,Xt,Xt ∼ πt. 2 dt dt procedure in the following.

The integrability condition is required to control the tail Tuning the schedule Fix the value of φ, which fixes the behaviour of distributions formed by linearized approxima- path. We adapt a schedule tuning algorithm from past work tions to the path πt. This condition is satisfied by a wide 2We assume that the optimization over φ ends after a finite range of annealing paths, e.g., the linear spline paths pro- number of iterations to sidestep the potential pitfalls of adaptive posed in this work—since in that case V2 = 0. MCMC methods (Andrieu & Moulines, 2006). Parallel tempering on optimized paths

Algorithm 2 PathOptNRPT where Xs are defined in (5) are drawn from the linear path φ φ φ between πt and πt . Finally, we note that the inner Require: state x, path family πt , parameter φ, # chains N, n n+1 # PT iterations M, # tuning steps S, learning rate γ expectation is the path integral of the Fisher information metric along the linear path and evaluates to the symmetric TN ← (0, 1/N, 2/N, . . . , 1) for s = 1 to S do KL divergence (Dabak & Johnson, 2002, Result 4), M N φ {xm}m=1, (rn)n=0 ← NRPT(x, πt , TN ,M) Z 1 φ (A (X ,X0 ))2 ds = 2SKL(πφ , πφ ). λ ← CommunicationBarrier(TN , {rn}) E tn,tn+1 s s tn tn+1 φ 0 TN ← UpdateSchedule(λ ,N) PN−1 φ φ Therefore we have φ ← φ − γ∇φ n=0 SKL(πtn , πtn+1 ) x ← xM φ 2 N−1 2Λ (TN ) X end for ≤ SKL(πφ , πφ ). (17) N tn tn+1 Return: φ, TN n=0

The slack in the inequality in Equation (17) could poten- N tially depend on φ even in the large N regime. Therefore, to update the schedule TN = (tn)n=0 (Syed et al., 2019, Section 5.1). Based on the same argument as this previous during optimization, we recommend monitoring the value work, we obtain that when kTN k ≈ 0, the non-asymptotic of the original objective function (Equation (14)) to ensure round trip rate is maximized when the rejection rates are that the optimization of the surrogate SKL objective indeed all equal. The schedule that approximately achieves this improves it, and hence the round trip rate performance of PT satisfies via Equation (2). In the experiments we display the values of both objective functions. Z tn 1 φ n ∀n ∈ {1,...,N − 1}, φ λ (s)ds = . (16) Λ 0 N 4. Spline annealing path family Following Syed et al.(2019), we use Monte Carlo esti- In this section, we develop a family of annealing paths— mates of the rejection rates rφ(t , t ) to approximate n n+1 the spline annealing path family—that offers a practical t 7→ R t λφ(s) s s ∈ [0, 1] 0 d , via a monotone cubic spline, and flexible improvement upon the traditional linear paths and then use bisection search to solve for each tn according considered in past work. We first define a general family to Equation (16). of annealing paths based on the exponential family, and then provide the specific details of the spline family with a Optimizing the path Fix the schedule TN ; we now want discussion of its properties. Empirical results in Section5 to improve the path itself by modifying φ. However, in demonstrate that the spline annealing path family resolves challenging problems this is not as simple as taking a gra- the problematic Gaussian annealing example in Figure1. dient step for the objective in Equation (14). In particu- lar, in early iterations—when the path is near its oft-poor φ 4.1. Exponential annealing path family initialization—the rejection rates satisfy r (tn, tn+1) ≈ 1. As demonstrated empirically in AppendixF, gradient esti- We begin with the practical desiderata for an annealing path 3 mates in this regime exhibit a low signal-to-noise ratio that family given a fixed reference π0 and target π1 distribution. 1−t t precludes their use for optimization. First, the traditional linear path πt ∝ π0 π1 should be a member of the family, so that one can achieve at least the We propose a surrogate, the symmetric KL divergence, mo- round trip rate provided by that path. Second, the family tivated as follows. Consider first the global communication should be broadly applicable and not depend on particular barrier Λφ(T ) for the linear spline approximation to the N details of either π or π . Finally, using the Gaussian exam- path; Theorem2 guarantees that as long as kT k is small 0 1 N ple from Figure1 and Proposition1 as insight, the family enough, one can optimize Λφ(T ) in place of the round trip N should enable the path to smoothly vary from π to π while rate τ φ(T ). By Jensen’s inequality, 0 1 N inflating / deflating the variance as necessary. N−1 1 1 X These desiderata motivate the design of the exponential Λφ(T )2 ≤ Λφ(t , t )2. N 2 N N n n+1 annealing path family, in which each annealing path takes n=0 the form

Next, we apply Jensen’s inequality again to the definition of η0(t) η1(t) T φ πt ∝ π0 π1 = exp(η(t) W (x)), Λ (tn, tn+1) from (5), which shows that 3A natural extension of this discussion would include 1 1 Z parametrized variational reference distribution families. For sim- Λφ(t , t ))2 ≤ [A (X ,X0 )2]ds, n n+1 E tn,tn+1 s s plicity we restrict to a fixed reference. 4 0 Parallel tempering on optimized paths for some function η(t) = (η0(t), η1(t)) and reference/target log densities W (x) = (W0(x),W1(x)). Intuitively, η0(t) and η1(t) represent the level of annealing for the reference and target respectively along the path. Proposition2 shows that a broad collection of functions η indeed construct a valid annealing path family including the linear path. Proposition 2. Let Ω ⊆ R2 be the set  Z  2 T Ω= ξ ∈ R : exp(ξ W (x))dx < ∞ .

Suppose (0, 1) ∈ Ω and A is a set of piecewise twice con- tinuously differentiable functions η : [0, 1] → Ω such that η(1) = (0, 1). Then Figure 2. The spline path for K = 1, 2, 3, 4, 5, 10 knots for the  T πt(x) ∝ exp(η(t) W (x)) : η ∈ A family generated by π0 = N(−1, 0.5) and π1 = N(1, 0.5). is an annealing path family for target distribution π1. If additionally (1, 0) ∈ Ω, then the linear path η(t) = (1−t, t) Flexibility: Assuming the family is nonempty, it trivially may be included in A. Finally, if for every η ∈ A there exists contains the linear path. Further, given a large enough 0 00 M > 0 such that supt max{kη (t)k2, kη (t)k2} ≤ M and number of knots K, the spline annealing family well- Equation (11) holds with V1 = V2 = MkW k2, then each approximates subsets of the exponential annealing family path in the family satisfies the conditions of Theorem2. for fixed M > 0. In particular,

4.2. Spline annealing path family φ M sup inf kη − ηk∞ ≤ 2 . (18) η∈A φ∈ΩK+1 4K Proposition2 reduces the problem of designing a general M family of paths of probability distributions to the much Figure2 provides an illustration of the behaviour of opti- simpler task of designing paths in R2. We argue that a mized spline paths for a Gaussian reference and target. The good choice can be constructed using linear spline paths path takes a convex curved shape; starting at the bottom 2 K+1 connecting K knots φ = (φ0, . . . , φK ) ∈ (R ) , i.e., right point of the figure (reference), this path corresponds to k−1 k for all k ∈ {1,...,K} and t ∈ [ K , K ], increasing the variance of the reference, shifting the mean from reference to target, and finally decreasing the variance ηφ(t) 7→ (k − Kt)φ + (Kt − k + 1)φ . k−1 k to match the target. With more knots, this process happens Let Ω be defined as in Proposition2. The K-knot spline more smoothly. annealing path family is defined as the set of K-knot linear AppendixD provides an explicit derivation of the stochastic spline paths such that gradient estimates we use to optimize the knots of the spline φ0 = (1, 0), φK = (0, 1), and ∀k, φk ∈ Ω. annealing path family. It is also worth making two practi- cal notes. First, to enforce the monotonicity constraint, we Validity: Since Ω is a convex set per the proof of Propo- developed an alternative to the usual projection approach, sition2, we are guaranteed that ηφ([0, 1]) ⊆ Ω, and so this as projecting into the set of monotone splines can cause collection of annealing paths is a subset of the exponential several knots to become superposed. Instead, we maintain annealing family and hence forms a valid annealing path monotonicity as follows: after each stochastic gradient step, family targeting π1. we identify a monotone subsequence of knots containing the endpoints, remove the nonmonotone jumps, and linearly Convexity: Furthermore, convexity of Ω implies that tun- interpolate between the knot subsequence with an even spac- ing the knots φ ∈ ΩK+1 involves optimization within a ing. Second, we take gradient steps in a log-transformed convex constraint set. In practice, we enforce also that space so that knot components are always strictly positive. the knots are monotone in each component—i.e., the first component monotonically decreases, 1 = φ0,0 ≥ φ1,0 ≥ 5. Experiments · · · ≥ φK,0 = 0 and the second increases, 0 = φ0,1 ≤ φ1,1 ≤ · · · ≤ φK,1 = 1 —such that the path of distributions In this section, we study the empirical performance of always moves from the reference to the target. Because non-reversible PT based on the spline annealing path fam- monotonicity constraint sets are linear and hence convex, ily (K ∈ {2, 3, 4, 5, 10}) from Section4, with knots and the overall monotonicity-constrained optimization problem schedule optimized using the tuning method from Sec- has a convex domain. tion3. We compare this method to two PT methods Parallel tempering on optimized paths

Figure 3. Top: Cumulative round trips averaged over 10 runs for the spline path with K = 2, 3, 4, 5, 10 (solid blue), NRPT using a linear path (dashed green), and reversible PT with linear path (dash/dot red). The slope of the lines represent the round trip rate. We observe large gains going from linear to non-linear paths (K > 1). For all values of K > 1, the optimized spline path substantially improves on the theoretical upper bound on round trip rate possible using linear path (dotted black). Bottom: Non-asymptotic communication barrier from Equation 15 (solid blue) and Symmetric KL (dash orange) as a function of iteration for one run of PathOptNRPT + Spline (K = 4 knots). based on standard linear paths: non-reversible PT with ing rate equal to 0.2. Beta-binomial model: a conjugate adaptive schedule (“NRPT+Linear”) (Syed et al., 2019), Bayesian model with prior π0(p) = Beta(180, 840) and and reversible PT (“Reversible+Linear”) (Atchade´ et al., likelihood L(x|p) = Binomial(x|n, p). We simulated data 2011). Code for the experiments is available at https: x1, . . . , x2000 ∼ Binomial(100, 0.7) resulting in a poste- //github.com/vittrom/PT-pathoptim. rior π1(p) = Beta(140180, 60840). The prior and poste- rior are heavily concentrated at 0.2 and 0.7 respectively. We use the terminology “scan” to denote one iteration of the We used the same settings as for the Gaussian example. for loop in Algorithm 2. The computational cost of a scan Galaxy data: A Bayesian Gaussian mixture model applied is comparable for all the methods, since the bottleneck is to the galaxy dataset of (Roeder, 1990). We used six mix- the local exploration step shared by all methods. These ex- ture components with mixture proportions w , . . . , w , mix- periments demonstrate two primary conclusions: (1) tuned 0 5 ture component densities N(µ , 1) for mean parameters nonlinear paths provide a substantially higher round trip rate i µ , . . . , µ , and a cluster label categorical variable for each compared to linear paths for the examples considered; and 0 5 data point. We placed a Dir(1 ) prior on the proportions, (2) the symmetric KL sum objective (Equation 17) is a good 6 where 1 = (1, 1, 1, 1, 1, 1) and a N(150, 1) prior on each proxy for the round trip rate as a tuning objective. 6 of the mean parameters. We did not marginalize the clus- We run the following benchmark problems; see the sup- ter indicators, creating a multi-modal posterior inference plement for details. Gaussian: a synthetic setup in which problem over 94 latent variables. In this experiment we 2 the reference distribution is π0 = N(−1, 0.01 ) and the used N = 35 chains and fixed the computational budget to 2 target is π1 = N(1, 0.01 ). For this example we used 50000 samples, divided into 500 scans using 100 samples N = 50 parallel chains and fixed the computational bud- each. We optimized the path using Adagrad with a learn- get to 45000 samples. For Algorithm2, the computa- ing rate of 0.3. Exploring the full posterior distribution is tional budget was divided equally over 150 scans, meaning challenging in this context due to a combination of misspec- 300 samples were used for every gradient update. “Re- ification of the prior and label switching. Label switching versible+Linear” performed 45000 local exploration steps refers to the invariance of the likelihood under relabelling with a communication step after every iteration while of the cluster labels. In Bayesian problems label switching for “NRPT+Linear” the computational budget was used can lead to increased difficulty of the sampling problem to adapt the schedule. The gradient updates were per- as it generates symmetric multi-modal posterior distribu- formed using Adagrad (Duchi et al., 2011) with learn- tions. High dimensional Gaussian: a similar setup to the Parallel tempering on optimized paths

able proxy for the non-asymptotic communication barrier, but does not exhibit as large estimation variance in early iterations when there are pairs of chains with rejection rates close to one. Figure4 shows the round trip rate as a function of the dimen- sionality of the problem for Gaussian target distributions. As the dimensionality increases, the sampling problem be- comes fundamentally more difficult, explaining the decay in performance of both NRPT with a linear path and an optimized path. Both sampling methods are provided with a fixed computational budget in all runs. Due to this fixed budget, NRPT is unable for d ≥ 64 to approach the optimal schedule in the alloted time. This leads to an increasing gap Figure 4. Round trips rate averaged over 5 runs for the spline path between the round trip rates of NRPT with linear path and with K = 4 (solid blue), NRPT using a linear path (dashed green) the spline path for d ≥ 64. and theoretical upper bound on round trip rate possible using linear path (dotted black) as a function of the dimensionality of the target 6. Discussion distribution. In this work, we identified the use of linear paths of distri- butions as a major bottleneck in the performance of parallel one-dimensional Gaussian experiment where the number tempering algorithms. To address this limitation, we have of dimensions ranges from d = 1 to d = 256. The refer- provided a theory of parallel tempering on nonlinear paths, 2 ence distribution is π0 = N(−1d, (0.1 )Id) and the target a methodology to tune parametrized paths, and finally a 2 is π1 = N(1d, (0.1 )Id) where the subscript d indicates the practical, flexible family of paths based on linear splines. dimensionality of the problem. The number of chains√ N Future work in this line of research includes extensions to es- is set to increase with dimension at the rate N = d15 de. timate normalization constants, as well as the development We fixed the number of spline knots K to 4 and set the of techniques and theory surrounding the use of variational computational budget to 50000 samples divided into 500 reference distributions. scans with 100 samples per gradient update. The gradient updates were performed using Adagrad with learning rate References equal to 0.2. For all the experiments we performed one local exploration step before each communication step. Andrieu, C. and Moulines, E. On the ergodicity properties of some adaptive MCMC algorithms. Annals of Applied The results of these experiments are shown in Figures3 Probability, 16(3):1462–1505, 2006. and4. Examining the top row of Figure3—which shows the number of round trips as a function of the number of Atchade,´ Y. F., Roberts, G. O., and Rosenthal, J. S. To- scans—one can see that PT using the spline annealing fam- wards optimal scaling of Metropolis-coupled Markov ily outperforms PT using the linear annealing path across chain Monte Carlo. Statistics and Computing, 21(4): all numbers of knots tested. Moreover, the slope of these 555–568, 2011. curves demonstrates that PT with the spline annealing fam- ily exceeds the theoretical upper bound of round trip rate for Ballnus, B., Hug, S., Hatz, K., Gorlitz,¨ L., Hasenauer, J., the linear annealing path (from Equation (12)). The largest and Theis, F. J. Comprehensive benchmarking of Markov gain is obtained from going from K = 1 (linear) to K = 2. chain Monte Carlo methods for dynamical systems. BMC For all the examples, increasing the number of knots to Systems Biology, 11(1):63, 2017. more than K > 2 leads to marginal improvements. In the Brekelmans, R., Masrani, V., Bui, T., Wood, F., Galstyan, case of the Gaussian example, note that since the global A., Steeg, G. V., and Nielsen, F. Annealed importance communication barrier Λ for the linear path is much larger sampling with q-paths. arXiv:2012.07823, 2020. than N, algorithms based on linear paths incurred rejection rates of nearly one for most chains, resulting in no round Costa, S. I., Santos, S. A., and Strapasson, J. E. Fisher trips. information distance: A geometrical reading. Discrete Applied Mathematics, 197:59–69, 2015. The bottom row of Figure3 shows the value of the surrogate SKL objective and non-asymptotic communication barrier Dabak, A. G. and Johnson, D. H. Relations between from Equation (15). In particular, these figures demonstrate Kullback-Leibler distance and Fisher information. Jour- that the SKL provides a surrogate objective that is a reason- nal of The Iranian Statistical Society, 5:25–37, 2002. Parallel tempering on optimized paths

Desjardins, G., Luo, H., Courville, A., and Bengio, Y. Deep Sakai, Y. and Hukushima, K. Irreversible simulated tem- tempering. arXiv:1410.0123, 2014. pering. Journal of the Physical Society of Japan, 85(10): 104002, 2016. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Syed, S., Bouchard-Cotˆ e,´ A., Deligiannidis, G., and Doucet, Journal of machine learning research, 12(7), 2011. A. Non-reversible parallel tempering: an embarassingly parallel MCMC scheme. arXiv:1905.02939, 2019. Gelman, A. and Meng, X.-L. Simulating normalizing con- stants: From importance sampling to bridge sampling to Tawn, N. G., Roberts, G. O., and Rosenthal, J. S. Weight- path sampling. Statistical science, pp. 163–185, 1998. preserving simulated tempering. Statistics and Comput- ing, 30(1):27–41, 2020. Geyer, C. J. Markov chain Monte Carlo maximum likeli- hood. Interface Proceedings, 1991. Whitfield, T., Bu, L., and Straub, J. Generalized paral- lel sampling. Physica A: Statistical Mechanics and its Grosse, R. B., Maddison, C. J., and Salakhutdinov, R. R. Applications, 305(1-2):157–171, 2002. Annealing between distributions by averaging moments. Woodard, D. B., Schmidler, S. C., Huber, M., et al. Condi- In Advances in Neural Information Processing Systems, tions for rapid mixing of parallel and simulated tempering 2013. on multimodal distributions. The Annals of Applied Prob- Kamberaj, H. Simulations in Statisti- ability, 19(2):617–640, 2009. cal Physics: Theory and Applications. Springer, 2020. Zhou, Y., Johansen, A. M., and Aston, J. A. Toward auto- matic model comparison: An adaptive sequential Monte Katzgraber, H. G., Trebst, S., Huse, D. A., and Troyer, Carlo approach. Journal of Computational and Graphical M. Feedback-optimized parallel tempering Monte Carlo. Statistics, 25(3):701–726, 2016. Journal of Statistical Mechanics: Theory and Experiment, 2006(03):P03018, 2006.

Lingenheil, M., Denschlag, R., Mathias, G., and Tavan, P. Efficiency of exchange schemes in replica exchange. Chemical Physics Letters, 478(1-3):80–84, 2009.

Muller,¨ N. F. and Bouckaert, R. R. Adaptive parallel temper- ing for BEAST 2. bioRxiv, 2020. doi: 10.1101/603514.

Okabe, T., Kawata, M., Okamoto, Y., and Mikami, M. Replica-exchange for the isobaric– isothermal ensemble. Chemical Physics Letters, 335(5-6): 435–439, 2001.

Predescu, C., Predescu, M., and Ciobanu, C. V. The incom- plete beta function law for parallel tempering sampling of classical canonical systems. The Journal of Chemical Physics, 120(9):4119–4128, 2004.

Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. Tighter variational bounds are not necessarily better. In International Con- ference on Machine Learning, 2018.

Rischard, M., Jacob, P. E., and Pillai, N. Unbiased esti- mation of log normalizing constants with applications to Bayesian cross-validation. arXiv:1810.01382, 2018.

Roeder, K. Density estimation with confidence sets exempli- fied by superclusters and voids in the galaxies. Journal of the American Statistical Association, 85(411):617–624, 1990. Parallel tempering on optimized paths

A. Proof of Proposition1 where ΛF is the length of the the path πt with the Fisher in-

2 2 formation metric. The geodesic path of Gaussians between Define π0 = N(µ0, σ ) and π1 = N(µ1, σ ) with π0 and π1 that minimizes ΛF satisfies (Costa et al., 2015, W (x) ∝ − 1 (x−µ )2 (throughout we use the proportion- i 2σ2 i Eq. 11, Sec. 2) ality symbol ∝ with log-densities to indicate an unspecified W √  2  constant, additive with respect to i, multiplicative with re- z z p 2 ΛF = 2 log 1 + + 8 + z . (19) spect to πt). Suppose πt is the linear path πt(x) ∝ exp(Wt) 4 4 where Wt = (1 − t)W0 + tW1. Note that as a function of x, Again, by Theorem2, the asymptotic round trip rate τ geodesic for the geodesic path satisfies 1 − t t ∞ W (x) ∝ − (x − µ )2 − (x − µ )2 t 2 0 2 1 1 1  1  2σ 2σ τ geodesic = ≥ = Θ . 1 ∞ 2 + 2Λ 2 + 2Λ log z ∝ − (x − µ )2, µ = (1 − t)µ + tµ , F 2σ2 t t 0 1 2 and thus πt = N(µt, σ ). Taking a derivative of Wt, we B. Proof of Lemma1 find that Definition 4. Given a path πt and measurable function f, µ0+µ1  dW (µ1 − µ0) x − we denote kfkπ = supt Eπt [f]. t = 2 . dt σ2 Following the computation in Predescu et al.(2004, Equa- 0 We will now compute λ(t). If Xt,Xt ∼ πt, then tion (6)), we have

  1 0 1 dW dW 0 0 ˜ ˜ λ(t) = (X ) − (X ) E[exp(− 2 |At,t (X1/2, X1/2)|] E t t r(t, t0) = 1 − , (20) 2 dt dt 1 0 [exp(− A 0 (X˜ , X˜ ))] " # E 2 t,t 1/2 1/2 |µ − µ | X − µ0+µ1 X0 − µ0+µ1 1 0 t 2 t 2 = E − 0 1 2σ σ σ where X˜ , X˜ ∼ π˜ = exp((1 − s)W + sW 0 ) and s s s Z˜(s) t t  0  |µ1 − µ0| Xt − µt Xt − µt 0 0 0 = E − At,t0 (x, x ) = (Wt0 (x) − Wt(x)) − (Wt0 (x ) − Wt(x )). 2σ σ σ |µ − µ | In particular, the path of distributions π˜s for s ∈ [0, 1] is the = 1 0 [|Z − Z0|] , 2σ E linear path between πt and πt0 . where Z,Z0 ∼ N(0, 1). Thus Z − Z0 ∼ N(0, 2), and |Z − Lemma 2. Suppose (6) and (7) hold. Then for all k ≤ 3, √ 0 0 there is a constant C˜k independent of t, t , TN such that Z | has a folded normal distribution√ with expectation 2/ π. This implies λ(t) = z/ π where z = |µ − µ |/σ and 0 k 0 k 1 0 0 ˜ ˜ ˜ 1 √ sup E[|At,t (Xs, Xs)| ] ≤ Ck|t − t| . R s Λ = 0 λ(t)dt = z/ π. By Theorem2, the asymptotic linear round trip rate τ∞ for the linear path satisfies, ˜ ˜ 0 where Xs, Xs ∼ π˜s. 1 1 1 τ linear = = √ = Θ . ∞ 2 + 2Λ 2 + 2z/ π z Proof. The mean-value theorem and (6) imply that Wt(x) is Lipschitz in t,

We will now establish an upper bound for the communica- 0 0 |Wt(x) − Wt0 (x)| ≤ V1(x)|t − t|. tion barrier Λ for a general path πt. If Xt,Xt ∼ πt, then Theorem2 and Jensen’s inequality imply the following: The triangle inequality therefore implies   1 s 2 0 0 0 Z   |At,t0 (x, x )| ≤ (V1(x) + V1(x ))|t − t|. 1 dW dW 0 Λ = E  (Xt) − (Xt)  dt 0 2 dt dt By taking expectations and using the fact |a + b|k ≤ k−1 k k v " # 2 (|a| + |b| ), we have that Z 1 1u dW dW 2 u 0 0 k k k 0 k ≤ t (Xt) − (Xt) dt 0 ˜ ˜ E E[|At,t (Xs, Xs)| ] ≤ 2 Eπ˜s [V1 ]|t − t| 0 2 dt dt k 3 0 k ≤ 2 Eπ˜s [V1 ]|t − t| , Z 1 s   1 dWt = √ Varπt dt where in the last line we use the fact that we can assume 2 0 dt V1 ≥ 1 without loss of generality. The result follows by 1 C˜ = = √ ΛF , taking the supremum on both sides and noting that k k 3 2 2 kV1 kπ˜ is finite by (7). Parallel tempering on optimized paths

We now begin the proof of Lemma1. Define λ˜(s) = Lemma 3. For measurable functions f and s > 0, let 1 ˜ ˜ 0 ˜ ˜ 0 2 E[|At,t0 (Xs, Xs)|] for Xs, Xs ∼ π˜s. Then a third order Taylor expansion of Equation (20) (Predescu et al., 2004), h 2 i E (f, s) = |f|es V2 , ˜ ˜ 0 k t Eπt which contains terms of the form E[|At,t0 (Xs, Xs)| ] that can be controlled via Lemma2, yields and define Et(s) = Et(1, s) for brevity. r(t, t0) = λ˜(1/2) + R(t, t0), |R(t, t0)| ≤ C0|t − t0|3,

0 0 for some finite constant C independent of t, t . By Syed (a) For any schedule TN , et al.(2019, Prop. 2, Appendix C) we have that in addi- ˜ 2 00 tion, λ(s) is in C ([0, 1]), and thus there is a constant C Z˜ 0 t independent of t, t such that − 1 ≤ Et(kTN k) − 1, Zt

d2λ˜ sup ≤ C00|t − t0|3. 2 and if kT k is small enough that E (kT k) < 2, s ds N t N The error bound for the midpoint rule implies, Z E (kT k) − 1 t − 1 ≤ t N . ˜ Zt 2 − Et(kTN k) Z 1 1 d2λ˜ λ˜(1/2) − λ˜(s)ds ≤ sup 2 0 24 s ds 00 (b) For any schedule TN and measureable function f, if C 0 3 ≤ |t − t | . kTN k is small enough that Et(kTN k) < 2, 24

The result follows: there is a finite constant C independent Et(kTN k) − 1 0 |Eπ˜t [f] − Eπt [f]| ≤ Et(f, kTN k) of t, t such that 2 − Et(kTN k)

Z 1 + Et(f, kTN k) − Et(f, 0). 0 0 0 0 3 |r(t, t ) − Λ(t, t )| = r(t, t ) − λ˜(s)ds ≤ C|t − t| . 0

C. Proof of Theorem2 Proof. (a) We rewrite the expression

We first note that without loss of generality we can place Z˜ 1 Z t W˜ t(x) an artificial schedule point tn at each of the finitely many = e dx Zt Zt X discontinuities in Wt or its first/second derivative. Thus we Z 2 W˜ t(x)−Wt(x) assume the Wt is C on each interval [tn−1, tn]. Later in = e πt(x)dx the proof it will become clear that the contributions of these X Z  ˜  artificial schedule points becomes negligible as kT k → 0. Wt(x)−Wt(x) N = 1 + e − 1 πt(x)dx. 1 X Given a schedule TN , define the path π˜t = exp(W˜ t) Z˜t ˜ with log-likelihood Wt satisfying for each segment tn−1 ≤ Thus using the inequality |ex − 1| ≤ e|x| − 1, t ≤ tn, ˜ ∆Wn Wt = Wtn−1 + (t − tn−1), Z Z˜t  ˜  ∆tn Wt(x)−Wt(x) − 1 ≤ e − 1 πt(x)dx Zt X where ∆Wn = Wtn − Wtn−1 and ∆tn = tn − tn−1. In ˜ Z   particular, Wt agrees with Wt for t ∈ TN , linearly interpo- |W˜ t(x)−Wt(x)| ≤ e − 1 πt(x)dx lates between Wtn−1 and Wtn for t ∈ [tn−1, tn], and for all X x, from Taylor’s theorem: Z  2  V2(x)kTN k ≤ e − 1 πt(x)dx 2 X 1 d Wt 2 |W˜ (x) − W (x)| ≤ sup (x) ∆t . (21) h 2 i t t 2 n = ekTN k V2 − 1 2 t∈[tn−1,tn] dt Eπt

The following lemma shows that the normalization con- = Et(kTN k) − 1. stant of, and expectations under, π˜t are comparable to the same for πt with an error bound that depends on kTN k and The bound on |Zt/Z˜t − 1| arises from straightforward converges to 0 as kTN k → 0. algebraic manipulation of the above bound. Parallel tempering on optimized paths

(b) We begin by rewriting Eπ˜t [f]: and

[f] − [f] Z dW˜ dW˜ Eπ˜t Eπt J = |π (x)π (y) − π˜ (x)˜π (y)| t (x) − t (y) . Z 2,t t t t t 1 ˜ dt dt = f(x)eWt(x)dx − [f] ˜ Eπt Zt X For the first term, the mean value theorem implies that there   Z Z ˜ 0 t Wt(x)−Wt(x) exist s, s ∈ [tn−1, tn] (potentially functions of x and y, = f(x) e − 1 πt(x)dx X Z˜t respectively) such that   Z Zt W˜ (x)−W (x) Z t t dW = − 1 f(x)e πt(x)dx dWs s0 dWt dWt J1,t = πt(x)πt(y) | (x)− (y)|−| (x)− (y)| Z˜t X dt dt dt dt Z   W˜ t(x)−Wt(x) + f(x) e − 1 πt(x)dx. Split the integral into the set A of x, y ∈ X where the X first term in the absolute value is larger; the same analysis c Therefore again using |ex − 1| ≤ e|x| − 1 and the with the same result applies in the other case in A . Here, previous bound, Taylor’s theorem and the triangle inequality yield

E (kT k) − 1 dWs dWs0 dWt dWt | [f] − [f]| ≤ t N E (f, kT k) (x) − (y) ≤ (x) − (y) Eπ˜t Eπt t N dt dt dt dt 2 − Et(kTN k) + (V (x) + V (y))kT k. + Et(f, kTN k) − Et(f, 0). 2 2 N Using this and the same procedure for Ac, we have that Z J1,t ≤ πt(x)πt(y)(V2(x) + V2(y))kTN k By changing variables via t = tn−1 + s∆tn in (5), we can rewrite Λ(tn−1, tn) as = 2Eπt [V2] kTN k.

Z tn   This converges to 0 as kTN k → 0. 1 ∆Wn ˜ ∆Wn ˜ 0 Λ(tn−1, tn) = E (Xt) − (Xt) dt, tn−1 2 ∆tn ∆tn For the second term J2,t, we can again use the mean value theorem to find s, s0 ∈ [t , t ] where ˜ ˜ 0 n−1 n where Xt, Xt ∼ π˜t. Note that by construction for t ∈ ˜ Z dWt ∆Wn dWs dWs0 (tn−1, tn) we have exists and equals . So by dt ∆tn J2,t = |πt(x)πt(y) − π˜t(x)˜πt(y)| (x) − (y) , summing over n we get, dt dt

N and therefore via the triangle inequality, symmetry, and the X Λ(TN ) = Λ(tn−1, tn) V1(x) bound on the first path derivative, n=1 Z " # J ≤ 2 V (x) |π (x)π (y) − π˜ (x)˜π (y)| . Z 1 1 dW˜ dW˜ 2,t 1 t t t t t ˜ t ˜ 0 = E (Xt) − (Xt) dt 0 2 dt dt We then add and subtract πt(x)˜πt(y) within the absolute Z 1 value and use the triangle inequality again to find that = λ˜(t)dt 0 Z J2,t ≤ 2 (V1(x) + Eπt [V1]) |πt(x) − π˜t(x)| ˜ 4 If we can show that supt |λ(t)−λ(t)| converges uniformly Z π˜ (x) = 2 π (x)(V (x) + [V ]) 1 − t . to 0 as kTN k → 0 then by dominated convergence theorem t 1 Eπt 1 πt(x) Λ(TN ) converges to Λ uniformly as kTN k → 0. The round trip rate then uniformly converges to (2+2Λ)−1 by Theorem Note that by the triangle inequality and the bound |ex −1| ≤ 3 of (Syed et al., 2019). e|x| − 1,

h dW˜ dW˜ 0 i π˜t(x) Zt 2 2 t (X ) − t (X ) kTN k V2(x) kTN k V2(x) Adding and subtracting E dt t dt t within 1 − ≤ − 1 e + e − 1. πt(x) Z˜ the absolute difference 2|λ˜(t) − λ(t)| and using the triangle t inequality, it can be shown that we require bounds on Assume that kTN k is small enough such that Et(kTN k) < 2, and let f = V + V . Then by Lemma3, Z 1 Eπt 1 ˜ ˜ J = π (x)π (y) | dWt (x)− dWt (y)|−| dWt (x)− dWt (y)| 1,t t t dt dt dt dt Et(kTN k) − 1 J2,t ≤ 2 Et(f, kTN k) 2 − Et(kTN k) 4We say a(T ) converges uniformly to a if for all  > 0, ∃δ > N + E (f, kT k) − E (f, 0). 0 such that kTN k < δ implies |a(TN ) − a| < . t N t Parallel tempering on optimized paths

By assumption we know that Et(f, s) is finite for some under distributions in the exponential family. Let Jφ(x) = T s small enough. Therefore as kTN k → 0, by monotone ξJ (φ) J(x) be a linear function in φ and suppose Wφ(x) = T d n convergence Et(f, kTN k) → Et(f, 0), and in particular ξW (φ) W (x) for some functions ξJ : R → R , J : X → n d m m Et(kTN k) → 1. Therefore J1,t + J2,t → 0 as kTN k → 0 R and ξW : R → R , W : X → R . Then and the proof is complete. gφ(x) = Cov[∇φWφ(x),Jφ(x)] + E[∇φJφ(x)] T T D. Objective and Gradient = ∇φξW (φ) Cov[W (x),J (x)]ξJ (φ) T + ∇φξJ (φ) E[J(x)] Here we derive the gradient used to optimize the surrogate T SKL objective in Equation 17. First we derive the gradient where ∇φξ(φ) is the transposed Jacobian of ξ. for the expectation of a general function in Section D.1. Next, in Section D.2, we show the result for the specific D.3. Symmetric KL: general case case of expectations of linear functions with respect to dis- tributions in the exponential family. Lastly, we show how Next we show that the symmetric KL divergence of Equa- the result is related to our SKL objective in Sections D.3 tion 17 can be rewritten as a sum of expectations over func- and D.4. tions parametrized by φ, hence falling in the framework presented above. D.1. Derivative of parameter-dependent expectation For path parameter φ, the symmetric KL divergence is

Here we consider the problem of computing N−1 X φ φ Z LSKL(φ) = SKL(πtn , πtn+1 ) gφ(x) = ∇φ πφ(x)Jφ(x)dx n=0 X N−1 " πφ (X ) φ # X tn+1 n+1 πtn (Xn) where π (x) = Z(φ)−1 exp(W (x)), Z(φ) = = E log + log φ φ πφ (X ) πφ (X ) R n=0 tn n+1 tn+1 n X exp(Wφ(x))dx and Jφ(x) is a function depending on φ. Assuming we can interchange the gradient and the expecta- φ where Xn ∼ πtn . After cancellation of the normalization tion and using the product rule we can rewrite: constants we obtain Z gφ(x) = (Jφ(x)∇φπφ(x) + πφ(x)∇φJφ(x)) dx. LSKL(φ) = X N−1 X Using ∇ π (x) = π (x)∇ log π (x),  φ φ φ φ φ φ φ E Wtn+1 (Xn+1) − Wtn (Xn+1) Z n=0 φ φ  gφ(x) = πφ(x)(Jφ(x)∇φ log πφ(x) + ∇φJφ(x))dx. + Wtn (Xn) − Wtn+1 (Xn) . X Collecting expectations under the same distribution and From the definition of πφ(x), we can evaluate the score function as rearranging terms,

LSKL(φ) = ∇φ log πφ(x) = −∇φ log Z(φ) + ∇φWφ(x) φ φ [W (X0) − W (X0)]+ = −E [∇φWφ(x)] + ∇φWφ(x). E t0 t1 N−1 Substitute this in gφ(x) we obtain, X φ φ φ E[2Wtn (Xn) − Wtn+1 (Xn) − Wtn−1 (Xn)]+ Z n=1 gφ(x) = πφ(x)Jφ(x)(−E [∇φWφ(x)] + ∇φWφ(x))dx [W φ (X ) − W φ (X )]. X E tN N tN−1 N Z + πφ(x)∇φJφ(x)dx Defining for n = 1,...,N − 1, X J φ(x) = W φ (x) − W φ (x) = −E[Jφ(x)]E[∇φWφ(x)] + E[Jφ(x)∇φWφ(x)] 0 t0 t1 J φ(x) = 2W φ (x) − W φ (x) − W φ (x) + E[∇φJφ(x)] n tn tn+1 tn−1 φ φ φ = Cov[∇φWφ(x),Jφ(x)] + [∇φJφ(x)]. E JN (x) = WtN (x) − WtN−1 (x),

D.2. Exponential family and linear function we have that N The gradient derived in the previous section can easily be X φ LSKL(φ) = E[Jn (Xn)] applied to expectations with respect to functions linear in φ n=0 Parallel tempering on optimized paths and then N X φ ∇φLSKL(φ) = ∇φE[Jn (Xn)] dWt 0 T = |η (t) W (x)| ≤ MkW (x)k2 n=0 dt φ 2 where ∇φE[Jn (Xn)] can be computed using the formula d Wt 00 T = |η (t) W (x)| ≤ MkW (x)k2, derived in Section D.1. dt2

D.4. Symmetric KL: exponential family case and thus by setting V1(x) = V2(x) = MkW (x)k2 we sat- isfy Equations (6) and (10). Equation (11) implies Equation For the spline family introduce in Section4, the distributions (7); so as long as Equation (11) holds, the path η satisfies πφ are in the exponential family with, tn all of the conditions of Theorem2. φ φ T Ω 2 W (x) = η (tn) W (x), n = 0,...,N. Finally, note that is a convex subset of R : for any non- tn 2 negative function G(x), vectors ξ1, ξ2 ∈ R , and λ ∈ [0, 1], φ It follows that the functions Jn are linear in φ with T exp((λξ1 + (1 − λ)ξ2) W (x))G(x) φ φT J0 (x) = z0 W (x) T λ T 1−λ = exp(ξ1 W )G(x) exp(ξ1 W )G(x) φ φT Jn (x) = zn W (x), n = 1,...,N − 1 and so Holder’s¨ inequality R f λg1−λ ≤ (R f)λ(R g)1−λ φ φ T JN (x) = zN W (x), yields log-convexity (and hence convexity). Therefore as long as the endpoints (0, 1) and (1, 0) are both in Ω, any where convex combination of (0, 1) and (1, 0) is also in Ω, and φ φ φ therefore the linear path η(t) = (1 − t, t) creates a set of z0 = η (t0) − η (t1) φ φ φ φ normalizable densities and may be included in A. zn = 2η (tn) − η (tn+1) − η (tn−1), n = 1,...,N − 1 zφ = ηφ(t ) − ηφ(t ). N N N−1 F. Empirical support for the SKL surrogate Given this relation, the stochastic gradient of Equation 17 objective function s can be evaluated using samples from parallel tempering Two objective functions were discussed in Section3: one through the formula in Section D.2 defining: based on rejection rate statistics, i.e. Equation (14), and the symmetric KL divergence (SKL). In this section we perform X = (X0,...,XN ) T controlled experiments comparing the signal-to-noise ratio W (X) = [W0(X0),W1(X0),...,W0(XN ),W1(XN )] of Monte Carlo estimators of the gradient of these two ob- J(X) = W (X) jectives. Let G denote a Monte Carlo estimator of a partial φ φ φ φ T derivative with respect to one of the parameters φi. Refer to ξW (φ) = [η (t0), η (t0), . . . , η (tN ), η (tN )] 0 1 0 1 D for details on the stochastic gradient estimators. In this φ φ φ φ T ξJ (φ) = [z0,0, z0,1, . . . , zN,0, zN,1] experiment we use i.i.d. samples so that the Monte Carlo estimators are unbiased, justifying the use of the variance as where X is the s × N matrix of samples from parallel tem- a notion of noise. Hence following Rainforth et al.(2018), pering, W (X) is a s×2N matrix evaluating X elementwise we define the signal-to-noise ratio by SNR = | [G]/σ[G]|, at the reference and target distributions W and W , ξ (φ) E 0 1 W where σ[G] denotes the standard deviation of G. We use is a 2N × 1 vector of annealing coefficients and ξ (φ) is a J two chains with one set to a standard Gaussian, the other 2N × 1 J φ = [J φ,...,J φ ] vector of coefficients defining 0 N . to a Gaussian with mean φ and unit variance. We show the value of the two objective functions in Figure5 (left). E. Proof of proposition 2 The label “Rejection” refers to the expected rejection of the swap proposal between the two chains, r. We also show the For this annealing path family, square root of half of the SKL (“SqrtHalfSKL”), to quantify T the tightness of the bound in Equation (17), while “Ineff” Wt(x) = η(t) W (x). shows the rejection odds, r/(1 − r), called inefficiency in Therefore, the piecewise twice continuous differentiability Syed et al.(2019). of η(t) and endpoint conditions imply that Definition1 is Signal-to-noise ratio estimates were computed for each pa- satisfied. Next, note that if rameter φi ∈ {0, 1/5, 2/5,..., 2}. Each gradient estimate 0 00 uses 50 samples, and to approximate the signal-to-noise sup max{kη (t)k2, kη (t)k2} ≤ M, t ratio, the estimation was repeated 1000 times for each φi Parallel tempering on optimized paths

6 Objective function 8 Inef Rejection 4 SKL Objective function 6 SqrtHalfSKL Rejection SKL 2 4 Value of the objective function of the objective Value 0 2 SNR of the gradient |mean/std. dev.| SNR of the gradient 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0 Variational parameter Variational parameter

Figure 5. Left: objective functions for path optimization in a controlled experiment as a function of a variational parameter φ. Right: signal-to-noise of corresponding gradient estimators on the same range of parameters. and objective function. The results are shown in Figure5 transformation to log space. (right), and demonstrate that in the regime of small rejec- To mitigate variance in the results due to randomness, we tion ( 30%), the gradient estimator based on the rejection / performed 10 runs of each method and averaged the results objective has a superior signal-to-noise ratio compared to across the runs. its SKL counterpart. However as φ increases and the two distributions become farther apart, the situation is reversed, providing empirical support for the surrogate objective for G.1. Gaussian challenging path optimization problems. This experiment optimized the path between the reference 2 2 π0 = N(−1, 0.01 ) and the target π1 = N(1, 0.01 ). G. Experimental details We used N = 50 parallel chains initialized at a state sampled from a standard Gaussian distribution. In this All the experiments were conducted comparing reversible setting, πt has a closed form that can be shown to be   PT, non-reversible PT and non-reversible PT based on the  2 2 N η1(t)−η0(t) , 0.01 , therefore, in the local ex- spline family with K ∈ {2, 3, 4, 5, 10}. η0(t)+η1(t) η0(t)+η1(t) ploration step of parallel tempering we sampled i.i.d. from Every method was initialized at the linear path with equally π . The computational budget was fixed at 45000 samples. spaced schedule, i.e. π ∝ π1−t/N πt/N with N the num- t t 0 1 Non-reversible PT with optimized path divided the budget ber of parallel chains. All methods performed one local in 150 scans. Therefore, for every gradient step in Algo- exploration step before a communication step. rithm2, the gradient was estimated with 300 samples. We To ensure a fair comparison of the different algorithms, we used 0.2 as learning rate for Adagrad. fixed the computational budget to a pre-determined num- ber of samples in each experiment. Reversible PT used G.2. Beta-binomial model the budget to perform local exploration steps followed by communication steps. In non-reversible PT the computa- The second experiment was performed on a conju- π (p) = tional budget was used to tune the schedule according to the gate Bayesian model. The model prior was 0 Beta(180, 840) L(x|p) = procedure described in Syed et al.(2019, Section 5.1). For . The likelihood was Binomial(x|n, p) x , . . . , x ∼ non-reversible PT with path optimization, the computational . We simulated 1 2000 Binomial(100, 0.7) budget was divided equally over a fixed number of scans of , resulting in a posterior distribution π (p) = Beta(140180, 60840) Algorithm2, where a scan corresponds to one iteration of 1 . The prior is concentrated the for loop. at 0.176 with a standard deviation of 0.0119. The posterior distribution is concentrated at 0.697 with a standard devia- Optimization of the spline annealing path family was per- tion of 0.001. We used N = 50 parallel chains initialized formed using the SKL surrogate objective of Equation 17. at 0.5. Also in this experiment it is possible to compute πt P2000 Adagrad was used for the optimization. The gradient was in closed form. Let S = i=1 xi, R = 2000 × 100 then scaled elementwise by its absolute value plus the value of πt(p) = Beta(179η0(t)+(180+S−1)η1(t)+1, 839η0(t)+ the knot component. Such scaling was necessary to limit (840+N −S −1)η1(t)+1). Hence, in the local exploration the gradient in the interval [−1, 1], stabilizing the optimiza- step of parallel tempering we sampled i.i.d. from πt. The tion and avoiding possible exploding gradients due to the Parallel tempering on optimized paths computational budget was fixed at 45000 samples. Non- reversible PT with optimized path divided the budget in 150 scans. Therefore, for every gradient step in Algorithm2, the gradient was estimated with 300 samples. We used 0.2 as learning rate for Adagrad.

G.3. Galaxy data The third experiment was a Bayesian Gaussian mixture model applied to the galaxy dataset of Roeder(1990). We used six mixture components with mixture propor- tions w0, . . . , w5, mixture component densities N(µi, 1) for mean parameters µ0, . . . , µ5, and a binary cluster label for each data point. We placed a Dir(1) prior on the pro- portions, where 1 = (1, 1, 1, 1, 1, 1) and a N(150, 1) prior on each of the mean parameters. We did not marginalize the cluster indicators, creating a multi-modal posterior in- ference problem over 94 latent variables. In this experiment we used N = 35 chains. Mixture proportions were ini- tialized at 1/6, mean parameters were initialized at 0 and cluster labels were initialized at 0. The local exploration step involved standard Gibbs steps for the means, indica- tors, and proportions. To improve local mixing, we also included an additional Metropolis-Hastings step for the pro- portions that approximates a Gibbs step when the indicators are marginalized. We fixed the computational budget to Figure 6. Top: Cumulative round trips averaged over 10 runs for 50000 samples, divided into 500 scans using 100 samples the spline path with K = 2, 3, 4, 5, 10 (solid blue), NRPT using each. We optimized the path using Adagrad with a learning a linear path (dashed green), and reversible PT with linear path rate of 0.3. (dash/dot red). The slope of the lines represent the round trip rate. Bottom: Non-asymptotic communication barrier from Equation 15 (solid blue) and Symmetric KL (dash orange) as a function of G.4. Mixture model iteration for one run of PathOptNRPT + Spline (K = 4 knots). The fourth experiment was a Bayesian Gaussian mixture model with mixture proportions w0, w1, mixture component 2 densities N(µi, 10 ) for mean parameters µ0, µ1, and a bi- nary cluster label for each data point. We placed a Dir(1, 1) prior on the proportions, and a N(150, 1) prior on each of the two mean parameters. We simulated n = 1000 data points from the mixture 0.3N(100, 102) + 0.7N(200, 102). We did not marginalize the cluster indicators, creating a multi-modal posterior over 1004 latent variables. We used N = 35 chains. Mixture proportions were initialized at 0.5, mean parameters were initialized at 0 and cluster labels were initialized at 0. The local exploration step involved standard Gibbs steps for the means, indicator variables, and proportions. To improve local mixing, we also included an additional Metropolis-Hastings step for the proportions that approximates a Gibbs step when the indicators are marginal- ized. The computational budget was fixed at 25000 samples. Non-reversible PT with optimized path divided the budget in 50 scans. Therefore, for every gradient step in Algorithm 2, the gradient was estimated with 500 samples. We used 0.3 as learning rate for Adagrad. Results are shown in Figure6.