Qualitative properties of parallel tempering and its in…nite swapping limit

Paul Dupuis

Division of Applied Mathematics Brown University

CERMICS workshop: Computational statistics and molecular simulation

with Jim Doll (Department of Chemistry, Brown University)

Yufei Liu, Pierre Nyquist, Nuria Plattner, David Lipshutz

2 February 2016

Paul Dupuis (Brown University) 2 February 2016 Outline

1 General perspective 2 Problem of interest 3 Large deviation theory for empirical measure 4 An accelerated algorithm–parallel tempering 5 Three uses of large deviation: First large deviations analysis and the in…nite swapping limit Approximation of risk-sensitive functionals–a second rare event issue The particle/temperature empirical measure–a diagnostic for convergence

6 Implementation issues and partial in…nite swapping 7 References

Paul Dupuis (Brown University) 2 February 2016 Advantages: generally works directly with quantities of interest–not a surrogate Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

Large relative variance of standard Monte Carlo when estimating small probabilities Poor communication/ergodicity properties when using MCMC for stationary distributions For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events Large deviation theory gives such information

General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method

Paul Dupuis (Brown University) 2 February 2016 Advantages: generally works directly with quantities of interest–not a surrogate Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

Poor communication/ergodicity properties when using MCMC for stationary distributions For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events Large deviation theory gives such information

General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method Large relative variance of standard Monte Carlo when estimating small probabilities

Paul Dupuis (Brown University) 2 February 2016 Advantages: generally works directly with quantities of interest–not a surrogate Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events Large deviation theory gives such information

General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method Large relative variance of standard Monte Carlo when estimating small probabilities Poor communication/ergodicity properties when using MCMC for stationary distributions

Paul Dupuis (Brown University) 2 February 2016 Advantages: generally works directly with quantities of interest–not a surrogate Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

Large deviation theory gives such information

General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method Large relative variance of standard Monte Carlo when estimating small probabilities Poor communication/ergodicity properties when using MCMC for stationary distributions For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events

Paul Dupuis (Brown University) 2 February 2016 Advantages: generally works directly with quantities of interest–not a surrogate Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method Large relative variance of standard Monte Carlo when estimating small probabilities Poor communication/ergodicity properties when using MCMC for stationary distributions For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events Large deviation theory gives such information

Paul Dupuis (Brown University) 2 February 2016 Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method Large relative variance of standard Monte Carlo when estimating small probabilities Poor communication/ergodicity properties when using MCMC for stationary distributions For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events Large deviation theory gives such information Advantages: generally works directly with quantities of interest–not a surrogate

Paul Dupuis (Brown University) 2 February 2016 General perspective

When using Monte Carlo, there are several ways “rare events” can adversely a¤ect performance or even viability of the method Large relative variance of standard Monte Carlo when estimating small probabilities Poor communication/ergodicity properties when using MCMC for stationary distributions For purposes of design and qualitative understanding of the methods, need some way to characterize the impact of these events Large deviation theory gives such information Advantages: generally works directly with quantities of interest–not a surrogate Disadvantages: is an asymptotic theory (is it the right asymptotic for the problem at hand?); often requires solution to a variational problem

Paul Dupuis (Brown University) 2 February 2016 Here primary interest is as the marginal (independent) distribution on spatial variables of stationary distribution of a Hamiltonian system. E.g.,

p2 1 V (x) 1 n j (dx; dp) e   j=1 2m dxdp: / P However many problems from other areas take this form, e.g., Bayesian inference, inverse problems, pattern theory, etc., where V depends on data. Here  = kB T .

Problem of Interest

Compute the average potential energy, heat capacity, other functionals with respect to a Gibbs measure of the form

V (x)= (dx) = e dx Z(); . and V is the potential of a (relatively) complex physical system.

Paul Dupuis (Brown University) 2 February 2016 However many problems from other areas take this form, e.g., Bayesian inference, inverse problems, pattern theory, etc., where V depends on data. Here  = kB T .

Problem of Interest

Compute the average potential energy, heat capacity, other functionals with respect to a Gibbs measure of the form

V (x)= (dx) = e dx Z(); . and V is the potential of a (relatively) complex physical system. Here primary interest is as the marginal (independent) distribution on spatial variables of stationary distribution of a Hamiltonian system. E.g.,

p2 1 V (x) 1 n j (dx; dp) e   j=1 2m dxdp: / P

Paul Dupuis (Brown University) 2 February 2016 Problem of Interest

Compute the average potential energy, heat capacity, other functionals with respect to a Gibbs measure of the form

V (x)= (dx) = e dx Z(); . and V is the potential of a (relatively) complex physical system. Here primary interest is as the marginal (independent) distribution on spatial variables of stationary distribution of a Hamiltonian system. E.g.,

p2 1 V (x) 1 n j (dx; dp) e   j=1 2m dxdp: / P However many problems from other areas take this form, e.g., Bayesian inference, inverse problems, pattern theory, etc., where V depends on data. Here  = kB T .

Paul Dupuis (Brown University) 2 February 2016 The function V (x) is de…ned on a large space, and includes, e.g., various inter-molecular potentials. In general, it may have a very complicated surface, with many deep and shallow local minima. Representative quantities of interest:

e V (x)= dx average potential: V (x) Z() Z 2 e V (y )= dy e V (x)= dx heat capacity: V (x) V (y) : Z() Z() Z " Z #

Problem of Interest

We use that (dx) is the stationary distribution of the solution to

dX = V (X )dt + p2dW ; r as well as a variety of related discrete time models.

Paul Dupuis (Brown University) 2 February 2016 Representative quantities of interest:

e V (x)= dx average potential: V (x) Z() Z 2 e V (y )= dy e V (x)= dx heat capacity: V (x) V (y) : Z() Z() Z " Z #

Problem of Interest

We use that (dx) is the stationary distribution of the solution to

dX = V (X )dt + p2dW ; r as well as a variety of related discrete time models. The function V (x) is de…ned on a large space, and includes, e.g., various inter-molecular potentials. In general, it may have a very complicated surface, with many deep and shallow local minima.

Paul Dupuis (Brown University) 2 February 2016 Problem of Interest

We use that (dx) is the stationary distribution of the solution to

dX = V (X )dt + p2dW ; r as well as a variety of related discrete time models. The function V (x) is de…ned on a large space, and includes, e.g., various inter-molecular potentials. In general, it may have a very complicated surface, with many deep and shallow local minima. Representative quantities of interest:

e V (x)= dx average potential: V (x) Z() Z 2 e V (y )= dy e V (x)= dx heat capacity: V (x) V (y) : Z() Z() Z " Z #

Paul Dupuis (Brown University) 2 February 2016 The lowest 150 and their “connectivity” graph are as in the …gure (taken from Doyle, Miller & Wales, JCP, 1999).

Other examples may have discrete (but very large!) domains of integration.

Problem of Interest

An example of a potential energy surface is the Lennard-Jones cluster of 38 atoms. This potential has 1014 local minima. 

Paul Dupuis (Brown University) 2 February 2016 Other examples may have discrete (but very large!) domains of integration.

Problem of Interest

An example of a potential energy surface is the Lennard-Jones cluster of 38 atoms. This potential has 1014 local minima. The lowest 150 and their “connectivity” graph are as in the …gure (taken from Doyle, Miller & Wales, JCP, 1999).

Paul Dupuis (Brown University) 2 February 2016 Problem of Interest

An example of a potential energy surface is the Lennard-Jones cluster of 38 atoms. This potential has 1014 local minima. The lowest 150 and their “connectivity” graph are as in the …gure (taken from Doyle, Miller & Wales, JCP, 1999).

Paul Dupuis (Brown University) 2 February 2016 Other examples may have discrete (but very large!) domains of integration. Consider

dX = b(X )dt + (X )dW ; X (0) = x0

and for large T 1 T T (dx) =  (dx)dt: T X (t) Z0 Then considered as taking values in (Rd ) and for small  > 0, P

T TJ0() P  N() e : 2  n o Here J0() 0 measures deviations from the LLN limit (ergodic theorem) , where  is unique invariant probability for X . For di¤usions satisfying a , J0 takes an explicit form.

Large deviation theory for empirical measure

LD theory for empirical measure problem originates with Donsker-Varadhan and Gärtner.

Paul Dupuis (Brown University) 2 February 2016 Then considered as taking values in (Rd ) and for small  > 0, P

T TJ0() P  N() e : 2  n o Here J0() 0 measures deviations from the LLN limit (ergodic theorem) , where  is unique invariant probability for X . For di¤usions satisfying a detailed balance, J0 takes an explicit form.

Large deviation theory for empirical measure

LD theory for empirical measure problem originates with Donsker-Varadhan and Gärtner. Consider

dX = b(X )dt + (X )dW ; X (0) = x0

and for large T 1 T T (dx) =  (dx)dt: T X (t) Z0

Paul Dupuis (Brown University) 2 February 2016 For di¤usions satisfying a detailed balance, J0 takes an explicit form.

Large deviation theory for empirical measure

LD theory for empirical measure problem originates with Donsker-Varadhan and Gärtner. Consider

dX = b(X )dt + (X )dW ; X (0) = x0

and for large T 1 T T (dx) =  (dx)dt: T X (t) Z0 Then considered as taking values in (Rd ) and for small  > 0, P

T TJ0() P  N() e : 2  n o Here J0() 0 measures deviations from the LLN limit (ergodic theorem) , where  is unique invariant probability for X .

Paul Dupuis (Brown University) 2 February 2016 Large deviation theory for empirical measure

LD theory for empirical measure problem originates with Donsker-Varadhan and Gärtner. Consider

dX = b(X )dt + (X )dW ; X (0) = x0

and for large T 1 T T (dx) =  (dx)dt: T X (t) Z0 Then considered as taking values in (Rd ) and for small  > 0, P

T TJ0() P  N() e : 2  n o Here J0() 0 measures deviations from the LLN limit (ergodic theorem) , where  is unique invariant probability for X . For di¤usions satisfying a detailed balance, J0 takes an explicit form.

Paul Dupuis (Brown University) 2 February 2016 We have the variational representation

T 1 T 1 2 T c log P  N() = inf E u(t) dt + 1N() ( ) ; T 2 u T 0 1 n o  Z  where u is progressively measurable with respect to W ,

dX = b(X )dt + (X )dW + (X )u; X (0) = x0

and T T 1  (dx) =   (dx)dt: T X (t) Z0

Large deviation theory for empirical measure

Form of the rate.

Paul Dupuis (Brown University) 2 February 2016 Large deviation theory for empirical measure

Form of the rate. We have the variational representation

T 1 T 1 2 T c log P  N() = inf E u(t) dt + 1N() ( ) ; T 2 u T 0 1 n o  Z  where u is progressively measurable with respect to W ,

dX = b(X )dt + (X )dW + (X )u; X (0) = x0

and T T 1  (dx) =   (dx)dt: T X (t) Z0

Boué and Dupuis, Annals of Probab., 1998. Paul Dupuis (Brown University) 2 February 2016 Using the representation and a weak convergence analysis, one can show for (x) = d (x), 1=2 W 1;2, that d 2

1 2 J0() = inf u L2 if () = : u () 2k k  S 6 ; 2S

Large deviation theory for empirical measure

Form of the rate. Let be the generator of X and y A

1=2 2 u 2 = u(x) (dx) ; L k k d j j ZR  u f = f + (u) f ; f C 1; A A  r 2 c 2 u () = u L : ( f )(x)(dx) = 0 f Cc1 : S 2 d A 8 2  ZR 

yproof from Dupuis and Lipshutz, 2016. Paul Dupuis (Brown University) 2 February 2016 Large deviation theory for empirical measure

Form of the rate. Let be the generator of X and y A

1=2 2 u 2 = u(x) (dx) ; L k k d j j ZR  u f = f + (u) f ; f C 1; A A  r 2 c 2 u () = u L : ( f )(x)(dx) = 0 f Cc1 : S 2 d A 8 2  ZR  Using the representation and a weak convergence analysis, one can show for (x) = d (x), 1=2 W 1;2, that d 2

1 2 J0() = inf u L2 if () = : u () 2k k  S 6 ; 2S

yproof from Dupuis and Lipshutz, 2016. Paul Dupuis (Brown University) 2 February 2016 Then completion of squares shows this u is optimal. If b = U,  = I , then this reduces to r

' + log  ' = (log  + U) + log  (log  + U): r  r r  r In the case solution by inspection is ' = log  + U. Thus

1 2 J0() = log  2 : 2kr kL

Large deviation theory for empirical measure

One can explicitly identify the minimum. IBP on ( u f )(x)(dx) = 0 Rd suggests we consider u = ', ' a weak sense soln toA r R 1 ' + log  ' = : r  r  A

Paul Dupuis (Brown University) 2 February 2016 If b = U,  = I , then this reduces to r

' + log  ' = (log  + U) + log  (log  + U): r  r r  r In the case solution by inspection is ' = log  + U. Thus

1 2 J0() = log  2 : 2kr kL

Large deviation theory for empirical measure

One can explicitly identify the minimum. IBP on ( u f )(x)(dx) = 0 Rd suggests we consider u = ', ' a weak sense soln toA r R 1 ' + log  ' = : r  r  A Then completion of squares shows this u is optimal.

Paul Dupuis (Brown University) 2 February 2016 Large deviation theory for empirical measure

One can explicitly identify the minimum. IBP on ( u f )(x)(dx) = 0 Rd suggests we consider u = ', ' a weak sense soln toA r R 1 ' + log  ' = : r  r  A Then completion of squares shows this u is optimal. If b = U,  = I , then this reduces to r

' + log  ' = (log  + U) + log  (log  + U): r  r r  r In the case solution by inspection is ' = log  + U. Thus

1 2 J0() = log  2 : 2kr kL

Paul Dupuis (Brown University) 2 February 2016 Besides  1 = , introduce higher temperature  2 >  1. Thus

dX1 = V (X1)dt + p2 1dW1 r dX2 = V (X2)dt + p2 2dW2; r

with W1 and W2 independent. Then one obtains a Monte Carlo approximation to

V (x1) V (x2)   (x1; x2) = e 1 e 2 Z( 1)Z( 2): 

An accelerated algorithm: parallel tempering

How to speed up a single particle. A key idea: “parallel tempering” (also called “replica exchange”, due to Geyer, Swendsen and Wang). Idea of parallel tempering, two temperatures.

Paul Dupuis (Brown University) 2 February 2016 Thus

dX1 = V (X1)dt + p2 1dW1 r dX2 = V (X2)dt + p2 2dW2; r

with W1 and W2 independent. Then one obtains a Monte Carlo approximation to

V (x1) V (x2)   (x1; x2) = e 1 e 2 Z( 1)Z( 2): 

An accelerated algorithm: parallel tempering

How to speed up a single particle. A key idea: “parallel tempering” (also called “replica exchange”, due to Geyer, Swendsen and Wang).

Idea of parallel tempering, two temperatures. Besides  1 = , introduce higher temperature  2 >  1.

Paul Dupuis (Brown University) 2 February 2016 An accelerated algorithm: parallel tempering

How to speed up a single particle. A key idea: “parallel tempering” (also called “replica exchange”, due to Geyer, Swendsen and Wang).

Idea of parallel tempering, two temperatures. Besides  1 = , introduce higher temperature  2 >  1. Thus

dX1 = V (X1)dt + p2 1dW1 r dX2 = V (X2)dt + p2 2dW2; r

with W1 and W2 independent. Then one obtains a Monte Carlo approximation to

V (x1) V (x2)   (x1; x2) = e 1 e 2 Z( 1)Z( 2): 

Paul Dupuis (Brown University) 2 February 2016 An accelerated algorithm: parallel tempering

Now introduce swaps, i.e., X1 and X2 exchange locations with state dependent intensity

(x ; x ) V (x1) V (x2) V (x2) V (x1) 2 1  +  +  +  ag(x1; x2) = a 1 = a 1 e 1 2 1 2 ; ^ (x ; x ) ^ h i h i  1 2    with a > 0 the “swap rate.”

Paul Dupuis (Brown University) 2 February 2016 Increased temperature higher di¤usivity of X a  2 easier communication for X a  2 passed to X a via swaps  1 This helps overcome the “rare event sampling problem.” As we will see, there is a second “rare event problem” of a di¤erent sort that it also helps overcome.

An accelerated algorithm: parallel tempering

Now have a Markov jump-di¤usion. Easy to check: with this swapping intensity still have detailed balance, and thus

V (x1) V (x2)   (x1; x2) = e 1 e 2 Z( 1)Z( 2): 

Paul Dupuis (Brown University) 2 February 2016 An accelerated algorithm: parallel tempering

Now have a Markov jump-di¤usion. Easy to check: with this swapping intensity still have detailed balance, and thus

V (x1) V (x2)   (x1; x2) = e 1 e 2 Z( 1)Z( 2): 

Increased temperature higher di¤usivity of X a  2 easier communication for X a  2 passed to X a via swaps  1 This helps overcome the “rare event sampling problem.” As we will see, there is a second “rare event problem” of a di¤erent sort that it also helps overcome.

Paul Dupuis (Brown University) 2 February 2016 Suppose  given such that d (x ; x ) = (x ; x ) 1 2 d 1 2 is smooth. Then we have monotonic form

a I () = J0() + aJ1()

where J0 is the rate for “no swap” dynamics and

(x2; x1) J1() = g(x1; x2)` (dx1dx2) Rd Rd s(x1; x2) Z  ! with = 0 z = 1 ` (z) = z log z z + 1 . > 0 z = 1  6

Large deviations analysis

What does theory say about parallel tempering?

Paul Dupuis (Brown University) 2 February 2016 Then we have monotonic form

a I () = J0() + aJ1()

where J0 is the rate for “no swap” dynamics and

(x2; x1) J1() = g(x1; x2)` (dx1dx2) Rd Rd s(x1; x2) Z  ! with = 0 z = 1 ` (z) = z log z z + 1 . > 0 z = 1  6

Large deviations analysis

What does theory say about parallel tempering? Suppose  given such that d (x ; x ) = (x ; x ) 1 2 d 1 2 is smooth.

Paul Dupuis (Brown University) 2 February 2016 (x2; x1) J1() = g(x1; x2)` (dx1dx2) Rd Rd s(x1; x2) Z  ! with = 0 z = 1 ` (z) = z log z z + 1 . > 0 z = 1  6

Large deviations analysis

What does theory say about parallel tempering? Suppose  given such that d (x ; x ) = (x ; x ) 1 2 d 1 2 is smooth. Then we have monotonic form

a I () = J0() + aJ1()

where J0 is the rate for “no swap” dynamics and

Paul Dupuis (Brown University) 2 February 2016 Large deviations analysis

What does theory say about parallel tempering? Suppose  given such that d (x ; x ) = (x ; x ) 1 2 d 1 2 is smooth. Then we have monotonic form

a I () = J0() + aJ1()

where J0 is the rate for “no swap” dynamics and

(x2; x1) J1() = g(x1; x2)` (dx1dx2) Rd Rd s(x1; x2) Z  ! with = 0 z = 1 ` (z) = z log z z + 1 . > 0 z = 1  6

Paul Dupuis (Brown University) 2 February 2016 Large deviations analysis

Paul Dupuis (Brown University) 2 February 2016 V (x1)  If (dx1) = 1(dx1) = e 1 dx1 Z( 1), then for a (0; ) 6 2 1  a 0 I1 ( ) > I1 ( )

and I a( ) some …nite limit. 1 " Exponentially faster decay for probability to be in any nice set that does not contain the target 1.

Large deviations analysis

Rate for low temperature marginal. By contraction principle, for probability measure

I a( ) = inf I a(): …rst marginal of  is : 1 f g

Paul Dupuis (Brown University) 2 February 2016 Large deviations analysis

Rate for low temperature marginal. By contraction principle, for probability measure

I a( ) = inf I a(): …rst marginal of  is : 1 f g

V (x1)  If (dx1) = 1(dx1) = e 1 dx1 Z( 1), then for a (0; ) 6 2 1  a 0 I1 ( ) > I1 ( )

and I a( ) some …nite limit. 1 " Exponentially faster decay for probability to be in any nice set that does not contain the target 1.

Paul Dupuis (Brown University) 2 February 2016 if a is large but …nite almost all computational e¤ort is directed at swap attempts, rather than di¤usion dynamics, if a then limit process not well de…ned (no tightness). ! 1

An alternative perspective: rather than swap particles, swap temperatures, and use “weighted” empirical measure. Particle swapping. Process:

a a (X1 ; X2 ) ;

Approximation to (dx):

1 T  X a;X a (dx)dt T ( 1 2 ) Z0

In…nite swapping limit

This suggests one consider the in…nite swapping limit a , except ! 1

Paul Dupuis (Brown University) 2 February 2016 if a then limit process not well de…ned (no tightness). ! 1 An alternative perspective: rather than swap particles, swap temperatures, and use “weighted” empirical measure. Particle swapping. Process:

a a (X1 ; X2 ) ;

Approximation to (dx):

1 T  X a;X a (dx)dt T ( 1 2 ) Z0

In…nite swapping limit

This suggests one consider the in…nite swapping limit a , except ! 1 if a is large but …nite almost all computational e¤ort is directed at swap attempts, rather than di¤usion dynamics,

Paul Dupuis (Brown University) 2 February 2016 An alternative perspective: rather than swap particles, swap temperatures, and use “weighted” empirical measure. Particle swapping. Process:

a a (X1 ; X2 ) ;

Approximation to (dx):

1 T  X a;X a (dx)dt T ( 1 2 ) Z0

In…nite swapping limit

This suggests one consider the in…nite swapping limit a , except ! 1 if a is large but …nite almost all computational e¤ort is directed at swap attempts, rather than di¤usion dynamics, if a then limit process not well de…ned (no tightness). ! 1

Paul Dupuis (Brown University) 2 February 2016 An alternative perspective: rather than swap particles, swap temperatures, and use “weighted” empirical measure. Particle swapping. Process:

a a (X1 ; X2 ) ;

Approximation to (dx):

1 T  X a;X a (dx)dt T ( 1 2 ) Z0

In…nite swapping limit

This suggests one consider the in…nite swapping limit a , except ! 1 if a is large but …nite almost all computational e¤ort is directed at swap attempts, rather than di¤usion dynamics, if a then limit process not well de…ned (no tightness). ! 1

Paul Dupuis (Brown University) 2 February 2016 Particle swapping. Process:

a a (X1 ; X2 ) ;

Approximation to (dx):

1 T  X a;X a (dx)dt T ( 1 2 ) Z0

In…nite swapping limit

This suggests one consider the in…nite swapping limit a , except ! 1 if a is large but …nite almost all computational e¤ort is directed at swap attempts, rather than di¤usion dynamics, if a then limit process not well de…ned (no tightness). ! 1 An alternative perspective: rather than swap particles, swap temperatures, and use “weighted” empirical measure.

Paul Dupuis (Brown University) 2 February 2016 In…nite swapping limit

This suggests one consider the in…nite swapping limit a , except ! 1 if a is large but …nite almost all computational e¤ort is directed at swap attempts, rather than di¤usion dynamics, if a then limit process not well de…ned (no tightness). ! 1 An alternative perspective: rather than swap particles, swap temperatures, and use “weighted” empirical measure. Particle swapping. Process:

a a (X1 ; X2 ) ;

Approximation to (dx):

1 T  X a;X a (dx)dt T ( 1 2 ) Z0

Paul Dupuis (Brown University) 2 February 2016 Process: a a a dY = V (Y )dt + 2r1(Z )dW1 1 r 1 a a a dY = V (Y )dt + p2r2(Z )dW2; 2 r 2 a a a where r(Z (t)) jumps between  1 and  2pwith intensity ag(Y1 (t); Y2 (t)).

Approximation to (dx): T 1 a a 1 (Z ) a a (dx) + 1 (Z ) a a (dx) dt: 0 (Y1 ;Y2 ) 1 (Y2 ;Y1 ) T 0 f g f g Z h i

In…nite swapping limit

Temperature swapping.

Paul Dupuis (Brown University) 2 February 2016 Approximation to (dx): T 1 a a 1 (Z ) a a (dx) + 1 (Z ) a a (dx) dt: 0 (Y1 ;Y2 ) 1 (Y2 ;Y1 ) T 0 f g f g Z h i

In…nite swapping limit

Temperature swapping. Process: a a a dY = V (Y )dt + 2r1(Z )dW1 1 r 1 a a a dY = V (Y )dt + p2r2(Z )dW2; 2 r 2 a a a where r(Z (t)) jumps between  1 and  2pwith intensity ag(Y1 (t); Y2 (t)).

Paul Dupuis (Brown University) 2 February 2016 In…nite swapping limit

Temperature swapping. Process: a a a dY = V (Y )dt + 2r1(Z )dW1 1 r 1 a a a dY = V (Y )dt + p2r2(Z )dW2; 2 r 2 a a a where r(Z (t)) jumps between  1 and  2pwith intensity ag(Y1 (t); Y2 (t)).

Approximation to (dx): T 1 a a 1 (Z ) a a (dx) + 1 (Z ) a a (dx) dt: 0 (Y1 ;Y2 ) 1 (Y2 ;Y1 ) T 0 f g f g Z h i

Paul Dupuis (Brown University) 2 February 2016 dY1 = V (Y1)dt + 2 1 (Y1; Y2) + 2 2 (Y1; Y2)dW1 r 1 2 dY2 = V (Y2)dt + p2 2 (Y1; Y2) + 2 1 (Y1; Y2)dW2; r 1 2 p 1 T T (dx) =  (Y ; Y ) +  (Y ; Y ) ds; T 1 1 2 (Y1;Y2) 2 1 2 (Y2;Y1) Z0 and  

V (x1) + V (x2) V (x2) + V (x1) e 1 2 e 1 2 1(x1; x2) = h i ; 2(x1; x2) = h i : Z(x1; x2) Z(x1; x2)

T Theorem:  satis…es the large deviation principle with rate I 1. 

In…nite swapping limit

The advantage is a well de…ned weak limit as a : ! 1

Paul Dupuis (Brown University) 2 February 2016 T Theorem:  satis…es the large deviation principle with rate I 1. 

In…nite swapping limit

The advantage is a well de…ned weak limit as a : ! 1

dY1 = V (Y1)dt + 2 1 (Y1; Y2) + 2 2 (Y1; Y2)dW1 r 1 2 dY2 = V (Y2)dt + p2 2 (Y1; Y2) + 2 1 (Y1; Y2)dW2; r 1 2 p 1 T T (dx) =  (Y ; Y ) +  (Y ; Y ) ds; T 1 1 2 (Y1;Y2) 2 1 2 (Y2;Y1) Z0 and  

V (x1) + V (x2) V (x2) + V (x1) e 1 2 e 1 2 1(x1; x2) = h i ; 2(x1; x2) = h i : Z(x1; x2) Z(x1; x2)

Paul Dupuis (Brown University) 2 February 2016 In…nite swapping limit

The advantage is a well de…ned weak limit as a : ! 1

dY1 = V (Y1)dt + 2 1 (Y1; Y2) + 2 2 (Y1; Y2)dW1 r 1 2 dY2 = V (Y2)dt + p2 2 (Y1; Y2) + 2 1 (Y1; Y2)dW2; r 1 2 p 1 T T (dx) =  (Y ; Y ) +  (Y ; Y ) ds; T 1 1 2 (Y1;Y2) 2 1 2 (Y2;Y1) Z0 and  

V (x1) + V (x2) V (x2) + V (x1) e 1 2 e 1 2 1(x1; x2) = h i ; 2(x1; x2) = h i : Z(x1; x2) Z(x1; x2)

T Theorem:  satis…es the large deviation principle with rate I 1. 

Paul Dupuis (Brown University) 2 February 2016 In…nite swapping limit

Remarks

Prove a uniform result (can let a a [0; ], T in any order). ! 2 1 ! 1 The invariant distribution of (Y1; Y2) is the symmetrized measure 1 [(x ; x ) + (x ; x )] 2 1 2 2 1 1 V (x1) V (x2) V (x2) V (x1) = e 1 e 2 + e 1 e 2 : 2Z( )Z( ) 1 2   The “implied potential”

V (x1) V (x2) V (x2) V (x1) log e 1 e 2 + e 1 e 2   has lower energy barriers than the original V (x ) V (x ) 1 + 2 :  1  2

Paul Dupuis (Brown University) 2 February 2016 In…nite swapping limit

Densities when V (x) is a double well, orginal product density and density of implied potential:

Paul Dupuis (Brown University) 2 February 2016 Two interpretations of the rate function in terms of rates J0 of (X1; X2) and K of (Y1; Y2):

J0() (x1; x2) = (x2; x1) I 1() = = inf K( ):  = M[ ] : else f g  1 The minimizing is always symmetric, and enforces the “weighted” symmetry on  due to the form of M.

In…nite swapping limit

Remarks T To get the INS approximation  (dx), we simulate (Y1; Y2), form its empirical measure, and push this through a deterministic “re-weighting” map M to get T :

M[ ](A) = [1(y1; y2) (dy1dy2) + 2(y1; y2) (dy2dy1)] : ZA

Paul Dupuis (Brown University) 2 February 2016 In…nite swapping limit

Remarks T To get the INS approximation  (dx), we simulate (Y1; Y2), form its empirical measure, and push this through a deterministic “re-weighting” map M to get T :

M[ ](A) = [1(y1; y2) (dy1dy2) + 2(y1; y2) (dy2dy1)] : ZA

Two interpretations of the rate function in terms of rates J0 of (X1; X2) and K of (Y1; Y2):

J0() (x1; x2) = (x2; x1) I 1() = = inf K( ):  = M[ ] : else f g  1 The minimizing is always symmetric, and enforces the “weighted” symmetry on  due to the form of M.

Paul Dupuis (Brown University) 2 February 2016 Suppose  1 …xed and assume interest is in this temperature. How to choose  2 (and  3; :::)? If only rare event issue sampling problem (long time correlation) tempted to choose  2 large. One cannot do this for parallel tempering since too few acceptances if gap too large. But in…nite swapping “hard codes” the swaps, so why not? One reason: for many interesting functionals (risk-sensitive functionals), there is another rare event problem.

A second rare event issue

A problem that has attracted a lot of attention: how to select the temperatures.

Paul Dupuis (Brown University) 2 February 2016 If only rare event issue sampling problem (long time correlation) tempted to choose  2 large. One cannot do this for parallel tempering since too few acceptances if gap too large. But in…nite swapping “hard codes” the swaps, so why not? One reason: for many interesting functionals (risk-sensitive functionals), there is another rare event problem.

A second rare event issue

A problem that has attracted a lot of attention: how to select the temperatures.

Suppose  1 …xed and assume interest is in this temperature. How to choose  2 (and  3; :::)?

Paul Dupuis (Brown University) 2 February 2016 One cannot do this for parallel tempering since too few acceptances if gap too large. But in…nite swapping “hard codes” the swaps, so why not? One reason: for many interesting functionals (risk-sensitive functionals), there is another rare event problem.

A second rare event issue

A problem that has attracted a lot of attention: how to select the temperatures.

Suppose  1 …xed and assume interest is in this temperature. How to choose  2 (and  3; :::)? If only rare event issue sampling problem (long time correlation) tempted to choose  2 large.

Paul Dupuis (Brown University) 2 February 2016 But in…nite swapping “hard codes” the swaps, so why not? One reason: for many interesting functionals (risk-sensitive functionals), there is another rare event problem.

A second rare event issue

A problem that has attracted a lot of attention: how to select the temperatures.

Suppose  1 …xed and assume interest is in this temperature. How to choose  2 (and  3; :::)? If only rare event issue sampling problem (long time correlation) tempted to choose  2 large. One cannot do this for parallel tempering since too few acceptances if gap too large.

Paul Dupuis (Brown University) 2 February 2016 One reason: for many interesting functionals (risk-sensitive functionals), there is another rare event problem.

A second rare event issue

A problem that has attracted a lot of attention: how to select the temperatures.

Suppose  1 …xed and assume interest is in this temperature. How to choose  2 (and  3; :::)? If only rare event issue sampling problem (long time correlation) tempted to choose  2 large. One cannot do this for parallel tempering since too few acceptances if gap too large. But in…nite swapping “hard codes” the swaps, so why not?

Paul Dupuis (Brown University) 2 February 2016 A second rare event issue

A problem that has attracted a lot of attention: how to select the temperatures.

Suppose  1 …xed and assume interest is in this temperature. How to choose  2 (and  3; :::)? If only rare event issue sampling problem (long time correlation) tempted to choose  2 large. One cannot do this for parallel tempering since too few acceptances if gap too large. But in…nite swapping “hard codes” the swaps, so why not? One reason: for many interesting functionals (risk-sensitive functionals), there is another rare event problem.

Paul Dupuis (Brown University) 2 February 2016 What is a risk-sensitive functional? Ones of the form 1 eF (x)= e V (x)= dx; Z() Z and integrals heavily in‡uenced by the tail of the distribution. Examples: heat capacity, functionals arising in “free energy calculations”.

For ordinary functionals with INS, bigger  2 is true at least for some circumstances. Not so for risk-sensitive functionals.

A second rare event issue

What is an ordinary integral? One of the form 1 1 F (x)e V (x)= dx; e.g., V (x)e V (x)= dx: Z() Z() Z Z

Paul Dupuis (Brown University) 2 February 2016 Examples: heat capacity, functionals arising in “free energy calculations”.

For ordinary functionals with INS, bigger  2 is true at least for some circumstances. Not so for risk-sensitive functionals.

A second rare event issue

What is an ordinary integral? One of the form 1 1 F (x)e V (x)= dx; e.g., V (x)e V (x)= dx: Z() Z() Z Z What is a risk-sensitive functional? Ones of the form 1 eF (x)= e V (x)= dx; Z() Z and integrals heavily in‡uenced by the tail of the distribution.

Paul Dupuis (Brown University) 2 February 2016 For ordinary functionals with INS, bigger  2 is true at least for some circumstances. Not so for risk-sensitive functionals.

A second rare event issue

What is an ordinary integral? One of the form 1 1 F (x)e V (x)= dx; e.g., V (x)e V (x)= dx: Z() Z() Z Z What is a risk-sensitive functional? Ones of the form 1 eF (x)= e V (x)= dx; Z() Z and integrals heavily in‡uenced by the tail of the distribution.

Examples: heat capacity, functionals arising in “free energy calculations”z.

zLelièvre, Stoltz, Rousset, Free Energy Computations: A Mathematical Perspective, 2010. Paul Dupuis (Brown University) 2 February 2016 A second rare event issue

What is an ordinary integral? One of the form 1 1 F (x)e V (x)= dx; e.g., V (x)e V (x)= dx: Z() Z() Z Z What is a risk-sensitive functional? Ones of the form 1 eF (x)= e V (x)= dx; Z() Z and integrals heavily in‡uenced by the tail of the distribution.

Examples: heat capacity, functionals arising in “free energy calculations”z.

For ordinary functionals with INS, bigger  2 is true at least for some circumstances. Not so for risk-sensitive functionals.

zLelièvre, Stoltz, Rousset, Free Energy Computations: A Mathematical Perspective, 2010. Paul Dupuis (Brown University) 2 February 2016 To simplify consider the risk-sensitive quantity 1 1 e V (x)= 1 dx = e 1Ac (x)= 1 e V (x)= 1 dx; Z( ) Z( ) 1 1 ZA 1 Z where A does not include the global min of V . Probability of interest decay rate:

1 V (x)=  log e dx inf V (x): Z() ! x A  ZA  2

A second rare event issue

To illustrate, we eliminate the time correlation aspect, and assume iid samples (Y1; Y2) drawn from the target symmetrized distribution

1 1 V (x1) V (x2) V (x2) V (x1) [(x ; x ) + (x ; x )] = e 1 e 2 + e 1 e 2 : 2 1 2 2 1 2Z( )Z( ) 1 2  

Paul Dupuis (Brown University) 2 February 2016 A second rare event issue

To illustrate, we eliminate the time correlation aspect, and assume iid samples (Y1; Y2) drawn from the target symmetrized distribution

1 1 V (x1) V (x2) V (x2) V (x1) [(x ; x ) + (x ; x )] = e 1 e 2 + e 1 e 2 : 2 1 2 2 1 2Z( )Z( ) 1 2   To simplify consider the risk-sensitive quantity 1 1 e V (x)= 1 dx = e 1Ac (x)= 1 e V (x)= 1 dx; Z( ) Z( ) 1 1 ZA 1 Z where A does not include the global min of V . Probability of interest decay rate:

1 V (x)=  log e dx inf V (x): Z() ! x A  ZA  2

Paul Dupuis (Brown University) 2 February 2016 Appropriate measure of performance the decay rate of variance/second moment. Using LD calculations,

 ; 2 1 1  log E[ 1 2 ] min 1 + ; 2 inf V (x) : ! a a x A    2   3 Optimal decay rate is infx A V (x), achieved when a = 2. 2 2

A second rare event issue

The estimate given by INS is

V (Y1) V (Y2) 1 2  1; 2 e  = 1 (Y1) A V (Y1) V (Y2) V (Y2) V (Y1) e 1 2 + e 1 2 V (Y2) V (Y1) e 1 2 +1 (Y2) : A V (Y1) V (Y2) V (Y2) V (Y1) e 1 2 + e 1 2

Now let  2 = a 1 = a, a [1; ) and consider the limit  0. 2 1 #

Paul Dupuis (Brown University) 2 February 2016 A second rare event issue

The estimate given by INS is

V (Y1) V (Y2) 1 2  1; 2 e  = 1 (Y1) A V (Y1) V (Y2) V (Y2) V (Y1) e 1 2 + e 1 2 V (Y2) V (Y1) e 1 2 +1 (Y2) : A V (Y1) V (Y2) V (Y2) V (Y1) e 1 2 + e 1 2

Now let  2 = a 1 = a, a [1; ) and consider the limit  0. Appropriate measure of performance2 1 the decay rate of variance/second# moment. Using LD calculations,

 ; 2 1 1  log E[ 1 2 ] min 1 + ; 2 inf V (x) : ! a a x A    2   3 Optimal decay rate is infx A V (x), achieved when a = 2. 2 2 Paul Dupuis (Brown University) 2 February 2016 A second rare event issue

Remarks Can extend to more temperatures Current project: put dynamics back in and consider double limit T and  1 0 ! 1 #

Paul Dupuis (Brown University) 2 February 2016 Then for parallel tempering,

T T 1 () = 1 i (t)= dt T f g Z0 is the empirical measure on the particle/temperature associations, and is a (random) probability measure on K = permutations of 1;:::; K . There is an analogue for INS. f f gg

A diagnostic for convergence

For temperature swapped parallel tempering, let

i (t) = process component assigned dynamic with temperature  i ;

so i (t); i = 1;:::; K is a permutation on 1;:::; K , with (0) =  (thef identity permutation).g f g

Paul Dupuis (Brown University) 2 February 2016 A diagnostic for convergence

For temperature swapped parallel tempering, let

i (t) = process component assigned dynamic with temperature  i ;

so i (t); i = 1;:::; K is a permutation on 1;:::; K , with (0) =  (thef identity permutation).g Then for parallel tempering,f g

T T 1 () = 1 i (t)= dt T f g Z0 is the empirical measure on the particle/temperature associations, and is a (random) probability measure on K = permutations of 1;:::; K . There is an analogue for INS. f f gg

Paul Dupuis (Brown University) 2 February 2016 A diagnostic for convergence

Lemma For either PT or INS,

T 1 () for  K , a.s. ! K! 2

As a consequence, for any given particle 1 fraction of time particle assigned dynamic k : ! K Easy to compute during simulation.

Paul Dupuis (Brown University) 2 February 2016 Rate function for the pair (T ; T ):

I (; (w1; w2)) = inf K( ):  = M ; 1(x1; x2) (dx1; dx2) = w1 ;  Z  where K is rate for empirical measure of (Y1; Y2),

M[ ](A) = [1(y1; y2) (dy1dy2) + 2(y1; y2) (dy2dy1)] : ZA

A diagnostic for convergence

For INS (notation of K = 2) the analogue of T () is T T T 1 T 1 ( 1; 2 ) =  (Y1; Y2)dt; ( 2; 1 ) =  (Y1; Y2)dt; f g T 1 f g T 2 Z0 Z0 de…ned in terms of the weights

V (x1) + V (x2) V (x2) + V (x1) e 1 2 e 1 2 1(x1; x2) = h i ; 2(x1; x2) = h i : Z(x1; x2) Z(x1; x2)

Paul Dupuis (Brown University) 2 February 2016 A diagnostic for convergence

For INS (notation of K = 2) the analogue of T () is T T T 1 T 1 ( 1; 2 ) =  (Y1; Y2)dt; ( 2; 1 ) =  (Y1; Y2)dt; f g T 1 f g T 2 Z0 Z0 de…ned in terms of the weights

V (x1) + V (x2) V (x2) + V (x1) e 1 2 e 1 2 1(x1; x2) = h i ; 2(x1; x2) = h i : Z(x1; x2) Z(x1; x2) Rate function for the pair (T ; T ):

I (; (w1; w2)) = inf K( ):  = M ; 1(x1; x2) (dx1; dx2) = w1 ;  Z  where K is rate for empirical measure of (Y1; Y2),

M[ ](A) = [1(y1; y2) (dy1dy2) + 2(y1; y2) (dy2dy1)] : ZA

Paul Dupuis (Brown University) 2 February 2016 Interpretation. By using the LDP (in the form of Gibbs conditioning principle), the minimizing  is the overwhelmingly most likely location of T T  given = (w1; w2). Thus if the particle/temperature association is not close to uniform, then T will not have converged. Remark. Generalizes to K temperatures. The converse (unfortunately) does not hold. However, the analysis of the variationals problem w I (; (w1; w2)) gives indicates other interesting aspects of INS. !

A diagnostic for convergence

Theorem

If (w1; w2) = (1=2; 1=2), then the minimizer in 6

 I (; (w1; w2)) ! is not  = .

Paul Dupuis (Brown University) 2 February 2016 Remark. Generalizes to K temperatures. The converse (unfortunately) does not hold. However, the analysis of the variationals problem w I (; (w1; w2)) gives indicates other interesting aspects of INS. !

A diagnostic for convergence

Theorem

If (w1; w2) = (1=2; 1=2), then the minimizer in 6

 I (; (w1; w2)) ! is not  = .

Interpretation. By using the LDP (in the form of Gibbs conditioning principle), the minimizing  is the overwhelmingly most likely location of T T  given = (w1; w2). Thus if the particle/temperature association is not close to uniform, then T will not have converged.

Paul Dupuis (Brown University) 2 February 2016 A diagnostic for convergence

Theorem

If (w1; w2) = (1=2; 1=2), then the minimizer in 6

 I (; (w1; w2)) ! is not  = .

Interpretation. By using the LDP (in the form of Gibbs conditioning principle), the minimizing  is the overwhelmingly most likely location of T T  given = (w1; w2). Thus if the particle/temperature association is not close to uniform, then T will not have converged. Remark. Generalizes to K temperatures. The converse (unfortunately) does not hold. However, the analysis of the variationals problem w I (; (w1; w2)) gives indicates other interesting aspects of INS. !

Paul Dupuis (Brown University) 2 February 2016 Straightforward extension of in…nite swapping to K temperatures  1 <  2 < <  K . Bene…ts of symmetrization/equilibration even greater, larger   rate for lowest marginal. But, coe¢ cients become complex, e.g., K! weights, and each involves many calculations. Not practical if K 7.  Need for computational feasibility leads to partial in…nite swapping.

Implementation issues and partial in…nite swapping

As noted applications of parallel tempering use many temperatures (e.g., K = 30 to 50) when V is complicated to overcome barriers of all di¤erent heights.

Paul Dupuis (Brown University) 2 February 2016 But, coe¢ cients become complex, e.g., K! weights, and each involves many calculations. Not practical if K 7.  Need for computational feasibility leads to partial in…nite swapping.

Implementation issues and partial in…nite swapping

As noted applications of parallel tempering use many temperatures (e.g., K = 30 to 50) when V is complicated to overcome barriers of all di¤erent heights. Straightforward extension of in…nite swapping to K temperatures  1 <  2 < <  K . Bene…ts of symmetrization/equilibration even greater, larger   rate for lowest marginal.

Paul Dupuis (Brown University) 2 February 2016 Need for computational feasibility leads to partial in…nite swapping.

Implementation issues and partial in…nite swapping

As noted applications of parallel tempering use many temperatures (e.g., K = 30 to 50) when V is complicated to overcome barriers of all di¤erent heights. Straightforward extension of in…nite swapping to K temperatures  1 <  2 < <  K . Bene…ts of symmetrization/equilibration even greater, larger   rate for lowest marginal. But, coe¢ cients become complex, e.g., K! weights, and each involves many calculations. Not practical if K 7. 

Paul Dupuis (Brown University) 2 February 2016 Implementation issues and partial in…nite swapping

As noted applications of parallel tempering use many temperatures (e.g., K = 30 to 50) when V is complicated to overcome barriers of all di¤erent heights. Straightforward extension of in…nite swapping to K temperatures  1 <  2 < <  K . Bene…ts of symmetrization/equilibration even greater, larger   rate for lowest marginal. But, coe¢ cients become complex, e.g., K! weights, and each involves many calculations. Not practical if K 7.  Need for computational feasibility leads to partial in…nite swapping.

Paul Dupuis (Brown University) 2 February 2016 Two examples are Dynamics A and B in …gure:

Implementation issues and partial in…nite swapping

Partial in…nite swapping. Given any subgroup of set of permutations one can construct a corresponding partial in…nite swapping dynamic.

Paul Dupuis (Brown University) 2 February 2016 Implementation issues and partial in…nite swapping

Partial in…nite swapping. Given any subgroup of set of permutations one can construct a corresponding partial in…nite swapping dynamic. Two examples are Dynamics A and B in …gure:

Paul Dupuis (Brown University) 2 February 2016 If one alternates between subgroups that generate full group of permutations, one approximates full in…nite swapping (convergence theorem in continuous time). However, particles lose their temperature identity in in…nite swapping limit (partial or otherwise). Cannot simply alternate–need a proper “hando¤” rule. Can identify the “distributionally correct” hando¤ rule, using that partial swappings are limits of “physically meaningful” processes. E.g., in a block of 4 locations xi associated with 4 temperatures  i , select a permutation  according to

V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) (1) + (2) + (3) + (4)  (1) +  (2) +  (3) +  (4) 1 2 3 4 1 2 3 4 e   e  ; ,  X and assign  i to x(i). CHARMM codes (http://www.charmm.org) due to Plattner, Meuwly

Implementation issues and partial in…nite swapping

Using partial in…nite swapping one can control the complexity of the coe¢ cients and algorithm.

Paul Dupuis (Brown University) 2 February 2016 However, particles lose their temperature identity in in…nite swapping limit (partial or otherwise). Cannot simply alternate–need a proper “hando¤” rule. Can identify the “distributionally correct” hando¤ rule, using that partial swappings are limits of “physically meaningful” processes. E.g., in a block of 4 locations xi associated with 4 temperatures  i , select a permutation  according to

V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) (1) + (2) + (3) + (4)  (1) +  (2) +  (3) +  (4) 1 2 3 4 1 2 3 4 e   e  ; ,  X and assign  i to x(i). CHARMM codes (http://www.charmm.org) due to Plattner, Meuwly

Implementation issues and partial in…nite swapping

Using partial in…nite swapping one can control the complexity of the coe¢ cients and algorithm. If one alternates between subgroups that generate full group of permutations, one approximates full in…nite swapping (convergence theorem in continuous time).

Paul Dupuis (Brown University) 2 February 2016 Can identify the “distributionally correct” hando¤ rule, using that partial swappings are limits of “physically meaningful” processes. E.g., in a block of 4 locations xi associated with 4 temperatures  i , select a permutation  according to

V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) (1) + (2) + (3) + (4)  (1) +  (2) +  (3) +  (4) 1 2 3 4 1 2 3 4 e   e  ; ,  X and assign  i to x(i). CHARMM codes (http://www.charmm.org) due to Plattner, Meuwly

Implementation issues and partial in…nite swapping

Using partial in…nite swapping one can control the complexity of the coe¢ cients and algorithm. If one alternates between subgroups that generate full group of permutations, one approximates full in…nite swapping (convergence theorem in continuous time). However, particles lose their temperature identity in in…nite swapping limit (partial or otherwise). Cannot simply alternate–need a proper “hando¤” rule.

Paul Dupuis (Brown University) 2 February 2016 CHARMM codes (http://www.charmm.org) due to Plattner, Meuwly

Implementation issues and partial in…nite swapping

Using partial in…nite swapping one can control the complexity of the coe¢ cients and algorithm. If one alternates between subgroups that generate full group of permutations, one approximates full in…nite swapping (convergence theorem in continuous time). However, particles lose their temperature identity in in…nite swapping limit (partial or otherwise). Cannot simply alternate–need a proper “hando¤” rule. Can identify the “distributionally correct” hando¤ rule, using that partial swappings are limits of “physically meaningful” processes. E.g., in a block of 4 locations xi associated with 4 temperatures  i , select a permutation  according to

V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) (1) + (2) + (3) + (4)  (1) +  (2) +  (3) +  (4) 1 2 3 4 1 2 3 4 e   e  ; ,  X and assign  i to x(i).

Paul Dupuis (Brown University) 2 February 2016 Implementation issues and partial in…nite swapping

Using partial in…nite swapping one can control the complexity of the coe¢ cients and algorithm. If one alternates between subgroups that generate full group of permutations, one approximates full in…nite swapping (convergence theorem in continuous time). However, particles lose their temperature identity in in…nite swapping limit (partial or otherwise). Cannot simply alternate–need a proper “hando¤” rule. Can identify the “distributionally correct” hando¤ rule, using that partial swappings are limits of “physically meaningful” processes. E.g., in a block of 4 locations xi associated with 4 temperatures  i , select a permutation  according to

V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) V (x ) (1) + (2) + (3) + (4)  (1) +  (2) +  (3) +  (4) 1 2 3 4 1 2 3 4 e   e  ; ,  X and assign  i to x(i). CHARMM codes (http://www.charmm.org) due to Plattner, Meuwly Paul Dupuis (Brown University) 2 February 2016 Numerical examples

Relaxation study of convergence to equilibrium for LJ-38. quantity of interest: average potential energy at various temperatures used 45 temperatures, 3–6–6– –6 type dynamic for partial in…nite swapping    used Smart Monte Carlo for particle dynamics lowest 1/3 of temperatures raised to push process away from equilibrium (low temperature components pushed away from deep minima) then reduced to correct temperatures for 600 discrete time steps to study return to equilibria repeated 2000 times, we plot averages for lowest (and hardest) temperature

Paul Dupuis (Brown University) 2 February 2016 Numerical examples

Relaxation study of convergence to equilibrium for LJ-38: parallel tempering versus partial in…nite swapping, only lowest temperature illustrated.

Paul Dupuis (Brown University) 2 February 2016 Numerical examples

For this system, reduction relative to parallel tempering: 1010 reduced to 106 steps with additional overhead of approximately 10%. Paul Dupuis (Brown University) 2 February 2016 Numerical examples

Convergence of the empirical measure on temperatures to uniform distribution.

Paul Dupuis (Brown University) 2 February 2016 Numerical examples

Convergence to equilibrium, single sample, 12 lowest temperatures:

Paul Dupuis (Brown University) 2 February 2016 References

Parallel tempering: “Replica Monte Carlo simulation of spin glasses”, Swendsen and Wang, Phys. Rev. Lett., 57, 2607–2609, 1986. “ Monte Carlo maximum likelihood”, Geyer in Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, ASA, 156–163, 1991. A paper suggesting it is good to swap a lot: “Exchange frequency in replica exchange ”, Sindhikara, Meng and Roitberg, J. of Chem. Phy., 128, 024104, 2008. LD for empirical measure and control form of rate: “On the large deviation rate for the empirical measure of a reversible pure jump Markov processes”, Dupuis and Liu, Annals of Probability, 43, (2015), 1121–1156. “Large deviations for the empirical measure of a di¤usion via weak convergence methods”, Dupuis and Lipshutz, preprint.

Paul Dupuis (Brown University) 2 February 2016 References

In…nite swapping: “On the in…nite swapping limit for parallel tempering”, Dupuis, Liu, Plattner and Doll, SIAM J. on MMS, 10, 986–1022, 2012. “An in…nite swapping approach to the rare-event sampling problem”, Plattner, Doll, Dupuis, Wang, Liu and Gubernatis, J. of Chem. Phy., 135, 134111, 2011. “On performance measures for in…nite swapping Monte Carlo methods”, Doll and Dupuis, J. of Chemical Physics, 142, (2015), 024111. “A large deviations analysis of certain qualitative properties of parallel tempering and in…nite swapping algorithms”, Doll, Dupuis and Nyquist, preprint.

Paul Dupuis (Brown University) 2 February 2016 References

Applications to biology: “Overcoming the rare-event sampling problem in biological systems with in…nite swapping”, Plattner, Doll and Meuwly, J. of Chem. Th. and Comp. 9, 4215–4224, 2013. “The structure and dynamics of alanine dipeptide and decapeptide at multiple thermodynamic states using the PINS method”, Hédin, Plattner, Doll and Meuwly, preprint. Paper using LD rate to study non-reversible single particle scheme: “Irreversible Langevin samplers and variance reduction: a large deviation approach”, Rey-Bellet and Spiliopoulos, Nonlinearity, 28, 2081–2103, 2015. Paper considering alternative implementation of INS and PINS: “In…nite swapping replica exchange molecular dynamics leads to a simple simulation patch using mixture potentials”, Lu and Vanden-Eijnden, J Chem Phys., 138, 084105, 2013.

Paul Dupuis (Brown University) 2 February 2016