Monte Carlo Methods for Rough Energy Landscapes

Jon Machta University of Massachusetts Amherst, Santa Fe Institute

SFI, August 9, 2013 Collaborators

• Stephan Burkhardt • Chris Pond • Jingran Li • Richard Ellis Motivation

• How to sample equilibrium states of systems with rugged free energy landscapes, e.g. spin glasses, configurational glasses, proteins, combinatorial optimization problems. Problem

Monte Carlo (MCMC) at a single temperature such as the Metropolis algorithm gets stuck in local minima. Problem

(MCMC) at a single temperature such as the Metropolis algorithm gets stuck in local minima.

Outline

• Population Annealing • Parallel Tempering • PA vs PT • Conclusions 5

spin glasses and perhaps other systems with rough free 0.1 energy landscapes. By using weighted averages over an ensemble of runs, biases inherent in a single run can be 0.01 can be made small and high precision results can be ob- tained. The method can be optimized by minizing the ⇤ 0.001 variance of the free energy estimator. If the variance of the free energy estimator is less than unity, high precision

4 results can be obtained from a relatively small ensemble 10 of runs. The method is comparable in eciency to par- allel tempering and is well suited to parallelization. 105 0.1 0.2 0.3 0.4 Var ⇥F

FIG. 3: The variance of the free energy Var(F˜) vs. the prob- ability of a small overlap, ⇥, for a sample⇥ of 25 disorder real- izations.

50

40

30

20 Acknowledgments 10

3813 3814 3815 3816 3817 This work was supported in part from NSF grant FIG. 4: The histogram of the dimensionless free energy, F . DMR-0907235. I am grateful to Helmut Katzgraber and PopulationThe solidAnnealing line is the best fit to a Gaussian distribution. Burcu Yucesoy for helpful discussions.

[1] K. Hukushima and Y. Iba, in THE MONTE CARLO [7] C. Jarzynski, Phys. Rev. E 56, 5018 (1997). METHOD IN THE PHYSICAL SCIENCES: Celebrating [8] M. E. J. Newman and G. T. Barkema, Monte Carlo the 50th Anniversary of the Metropolis Algorithm, edited Methods in Statistical Physics (Oxford, Oxford, 1999). by J. E. Gubernatis (AIP, 2003), vol. 690, pp. 200–206. [9] H. G. Katzgraber and A. P. Young, Phys. Rev. B 65, [2] R. H. Swendsen and J.-S. Wang, Phys. Rev. Lett. 57, 214402 (2002). 2607 (1986). [10] F. Krzakala and O. C. Martin, Phys. Rev. Lett. 85, 3013 [3] K. Hukushima and K. Nemoto, J. Phys. Soc. Jpn. 65, (2000). • Modification of simulated annealing1604 (1996).for equilibrium [11] M. Palassini and A. P. Young, Phys. Rev. Lett. 85, 3017 [4] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, Science (2000). sampling. 220, 671 (1983). [12] E. Marinari and F. Zuliani, J. Phys. A: Math. Gen. 32, [5] A. M. Ferrenberg and R. H. Swendsen, Phys. Rev. Lett. 7447 (1999). A population of replicas is cooled61, according 2635 (1988). to an • [6] J. B. Anderson, J. Chem. Phys. 63, 1499 (1975). annealing schedule. Each replica is acted on by a MCMC (e.g. Metropolis) at the current temperature. • During each temperature step, the population is differentially resampled according to Boltzmann weights to maintain equilibrium. Population Annealing

E1 E2 E3 E4 E5 ER R replicas

β=1/kT ... Population Annealing

E1 E2 E3 E4 E5 ER R replicas

β=1/kT ...

βʹ

Population annealing = with differential reproduction (resampling) of replicas Temperature Step

E1 E2 E3 E4 E5 ER

β ...

exp [ ( )Ej] ⇥j(, )= Q(, ) Replica j is reproduced nj times where nj is an integer random R variate with mean τj. j=1 exp [ ( )Ej] Q(, )= R Temperature Step

E1 E2 E3 E4 E5 ER

β ...

βʹ

exp [ ( )Ej] ⇥j(, )= Q(, ) Replica j is reproduced nj times where nj is an integer random R variate with mean τj. j=1 exp [ ( )Ej] Q(, )= R Resampling e.g. n1 =1, n2 =0, n3 =3, n4 =1

exp [ ( )Ej] ⇥j(, )= Q(, ) Replica j is reproduced nj times where nj is an integer random R variate with mean τj. j=1 exp [ ( )Ej] Q(, )= R Nearest integer resampling

Prob(n = ⌧ )=⌧ ⌧ Minimizes variance of nj. b c d e Maximizes number of ancestors with Prob(n = ⌧ )=⌧ ⌧ descendants. d e b c Population size fluctuates. Prob(n ⌧ > 1) = 0 Population Annealing is related to...

✦Simulated annealing ✦Sequential Monte Carlo - See e.g. “Sequential Monte Carlo Methods in Practice”, A. Doucet, et. al. (2001) - aka Particle Filters - Nested Sampling, Skilling • Go with the Winners, Grassberger (2002) • Diffusion (quantum) Monte Carlo • Nonequilibrium Equality for Free Energy Differences, Jarzynski (1997) • Histogram Re-weighting, Swendsen and Ferrenberg (1988) Convergence to Equilibrium Suppose the population at � is a correct, iid sample:

Y =exp[ (0 )E + A ] i i i

JM, R. Ellis, JSP 144, 544 (2011) Convergence to Equilibrium Suppose the population at � is a correct, iid sample:

Y =exp[ (0 )E + A ] i i i 1 R I(, 0, )=E log( Y ) R j j=1 X

JM, R. Ellis, JSP 144, 544 (2011) Convergence to Equilibrium Suppose the population at � is a correct, iid sample:

Y =exp[ (0 )E + A ] i i i 1 R I(, 0, )=E log( Y ) R j j=1 X

dI(, 0, ) A 0 = h iR d =0

JM, R. Ellis, JSP 144, 544 (2011) Convergence to Equilibrium Suppose the population at � is a correct, iid sample:

Y =exp[ (0 )E + A ] i i i 1 R 1 VarY 1 I(, 0, )=E log( Yj) = log(EY ) +O R 2 3/2 j=1 2R (EY ) R X ✓ ◆

dI(, 0, ) A 0 = h iR d =0

JM, R. Ellis, JSP 144, 544 (2011) Convergence to Equilibrium Suppose the population at � is a correct, iid sample:

Y =exp[ (0 )E + A ] i i i 1 R 1 VarY 1 I(, 0, )=E log( Yj) = log(EY ) +O R 2 3/2 j=1 2R (EY ) R X ✓ ◆

dI(, 0, ) 1 A 0 = = E0A +O h iR d R =0 ✓ ◆

JM, R. Ellis, JSP 144, 544 (2011) Convergence to Equilibrium Suppose the population at � is a correct, iid sample:

Y =exp[ (0 )E + A ] i i i 1 R 1 VarY 1 I(, 0, )=E log( Yj) = log(EY ) +O R 2 3/2 j=1 2R (EY ) R X ✓ ◆

dI(, 0, ) 1 A 0 = = E0A +O h iR d R =0 ✓ ◆ A bias inversely proportional to the population size results from the re-weighting to �´

JM, R. Ellis, JSP 144, 544 (2011) Convergence to Equilibrium

Resampling complicates the analysis...

Conjecture: Fixing other parameters of population annealing, observable converge to their equilibrium values inversely in population size. Quantifying Convergence to Equilibrium

• Do many runs for each of many population sizes. Expensive. • Is there an analog to the autocorrelation function and the exponential autocorrelation time? Quantifying Convergence to Equilibrium • Idea: resample small populations from a single run with a large population (bootstrap). – Choose a small population of parents and measure the observable in their descendants. – Repeat many times for different populations sizes.

based on 300000*100*10

realization 1 realization 2 log10|-exact| log systematic error systematic log -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0

3.5 4.0 4.5 5.0

log10(R) log population size • Problem: A small sample within a large population is not the same as an isolated small population because the sample within the large populations exists in a more competitive environment.

realization 2

=bootstrap log10(|q2R-q2exact|) log systematic error log 100runs =many runs 100000subsets -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2

3.0 3.2 3.4 3.6 3.8

log10(R) log population size 3 the multinomial distribution p [R; n1,...,nR; ⇢1/R, . . . , ⇢R/R] for R trials. In this implementation the population size is fixed. Other valid resampling methods are available. For example, the number of copies of replica j can be chosen as a PoissonDirect random variable Estimate with mean proportional of Free to ⇢j(, 0 ),Energies in which case the population size fluctuates [13]. For large R and small (0 ), the resampled distribution is close to an equilibrium ensemble at the new temperature, 0. However, the regions of the equilibrium distribution for 0 that di↵er significantly from the equilibrium distribution for are not well sampled, leading to biasesR exp in the [ population( )E at ]0. In addition, due to resampling, the replicas are no longer independent. To mitigate bothj=1 of these problem, thej equilibrating subroutine is now applied. Finally, Q(, )= observables are measured by averaging over the population.R The entire algorithm consists of S 1 steps: in step k the temperature is lowered from S k to S k 1 via resampling followed by the application of the equilibratingK subroutine and data collection at temperature S k 1. Population annealing permits one to estimate free energy di↵erences. If the annealing schedule begins at infinite kF (k)= ln Q(⇥+1, ⇥)+K F (K ) ˜ temperature corresponding to S = 0, then it yields an estimate of the absolute free energy F (k) at every temperature ⇥=k in the annealing schedule. The following calculation shows that the normalization factor Q(, 0) is an estimator of the ratio of the partition functions at the two temperatures:

0E Z( ) e Derivation: 0 = Z() P Z() E (0 )E e = e ( ) Z() X (0 )E = e h i R 1 (0 )E e j = Q(, 0). (4) ⇡ R j=1 X The summation over is a sum over the microstates of the system while the sum over j is a sum over the population of replicas in PA. The last approximate equality becomes exact in the limit R . From Eq. 4 the estimated free !1 energy di↵erence from to 0 is found to be

0F˜(0)= F () + log Q(, 0), (5) where F () is the free energy at and F˜ is the estimated free energy at 0. Given these free energy di↵erences, if S = 0, then the PA estimator of the absolute free energy at each simulated temperature is

S 1 kF˜(k)= log Q(`, ` 1) + log ⌦, (6) `=Xk+1 where ⌦ = 1 is the total number of microstates of the system; i.e. , kB log ⌦ is the infinite temperature entropy. P III. PARALLEL TEMPERING

Parallel tempering, also known as replica exchange Monte Carlo, simultaneously equilibrates a set of R replicas of a system at S inverse temperatures

0 > 1 >...,S 1. (7) There is one replica at each temperature so that R = S in contrast to population annealing, where typically the number R of replicas greatly exceeds the number S of temperatures; i.e. , R S. The equilibrating subroutine operates on each replica at its respective temperature. Replica exchange moves are implemented that allow replicas to di↵use in temperature space. The first step in a replica exchange move is to propose a pair of replicas (k, k 1) at neighboring temperatures = k and 0 = k 1. The probability for accepting the replica exchange move is

( 0)(E E0) pswap =min 1,e . (8) h i Here E and E0 are the respective energies of the replicas that were originally at and 0. If the move is accepted, the replica equilibrating at is now set to equilibrate at 0 and vice versa. Equation 8 insures so that the Markov chain defined by parallel tempering converges to a joint distribution whose marginals are equilibrium Weighted Averages

JM, PRE82,26704(2010) • Results using small population size are biased. • Results from independent runs can be combined and biases reduced using weighted averages. • Observables from each run weighted by the exponential of the free energy estimator for that run.

M ˜ e Fm() ˜ A()= Am()⇥m() ⇥m()= M F˜i() m=1 i=1 e Weighted Averages

• Number of runs needed for unbiased results determined by the width of the distribution of the free energy estimator--an intrinsic measure of equilibration. systematic systematic error number of runs Parallel Tempering

hard to equilibrate easy to equilibrate

...

β1 β2 β3 β4 βR

•R replicas at inverse temperatures β1 > β2 > ... > βR (each with the same couplings). •MCMC (e.g. Metropolis) on each replica •Exchange replicas with energies E and Eʹ and temperatures β and βʹ,with probability:

( )(E E) pswap = min 1,e ⇥ Intuition

• Mixing is accelerated by “round trips” from low to high temperature and back.

...

β1 β2 β3 βR Population Annealing Parallel Tempering

Sequential Monte Carlo Markov Chain Monte Carlo

# Replicas (R>>1) #Sweeps (t>>1) space ≈ parallel time #Temperature steps (T) #Replicas (R) parallel time space work=RT work=Rt

(A(t)−Aeq) ~ e(-t/τ) (A(R)−Aeq) ~ R0/R Two-Well landscape

• Consider a toy free energy landscape with two free energy minima separated by a high barrier.

– JM, PRE 80, 056706 (2009) Two-Well Landscape

1 F ()= ( )2(K + H⇥) 2 c

σ=0 σ=1 1 ⇥F = ( )2H 2 c

1 Prob [ = +1] = ⇥F 1+e Dynamics of the two-well model

• Assumptions: –Fast equilibration within each well.

–No transitions between wells except at βc where each well is equally probable. –Energy is normally distributed in each well; from thermodynamics:

E = ( )(K + H⇥) Var(E)=(K + H) ⇥ ⇤ c Replica exchange probabilities

• For the two-well model, replica exchange transition probabilities can be computed exactly. For E β symmetric wells (H=0):

( )(E E) pswap = min 1,e

E = ( )K , Var(E)=K⇥ ⇥ ⇤ c βʹ

Eʹ 1 p = Erfc ( )⇥K swap 2 ⇥ Replica exchange probabilities

• For the two-well model, replica exchange transition probabilities can be computed exactly. For Eʹ β symmetric wells (H=0):

( )(E E) pswap = min 1,e

E = ( )K , Var(E)=K⇥ ⇥ ⇤ c βʹ

E 1 p = Erfc ( )⇥K swap 2 ⇥ PT for the two-well model

1 p = Erfc ( )⇥K swap 2 ⇥ PT for the two-well model

1 Diffusion of replicas p = Erfc ( )⇥K • swap 2 ⇥ PT for the two-well model

1 Diffusion of replicas p = Erfc ( )⇥K • swap 2 • Equilibration time τ is ⇥ proportional to the mean first passage time for diffusion from β0 to βc with R equally spaced replicas PT for the two-well model

1 Diffusion of replicas p = Erfc ( )⇥K • swap 2 • Equilibration time τ is ⇥ proportional to the mean first passage time for diffusion from β0 to βc with R equally spaced replicas • Optimum number of replicas balances diffusion time and replica exchange acceptance fraction. PT for the two-well model

1 Diffusion of replicas p = Erfc ( )⇥K • swap 2 • Equilibration time τ is ⇥ proportional to the mean first passage time for diffusion from β0 to βc with R equally spaced replicas • Optimum number of replicas balances diffusion time and replica exchange acceptance fraction. R =1+0.594( )⇤K F opt 0 c ⇥ PT for the two-well model

1 Diffusion of replicas p = Erfc ( )⇥K • swap 2 • Equilibration time τ is ⇥ ⇥ R2 F proportional to the mean first passage time for diffusion from β0 to βc with R equally spaced replicas • Optimum number of replicas balances diffusion time and replica exchange acceptance fraction. R =1+0.594( )⇤K F opt 0 c ⇥ PA for the two-well model

Y =exp[ (0 )E + A ] i i i 1 R 1 VarY 1 I(, 0, )=E log( Y ) = log(EY ) +O R j 2 3/2 j=1 2R (EY ) R X ✓ ◆

dI(, 0, ) 1 A R0 = = E0A +O (A(R)−Aeq) ~ R0/R h i d =0 R ✓ ◆ T~K1/2

a R0~T a=1 ?

JM, R. Ellis, JSP 144, 544 (2011) Population Annealing Parallel Tempering

# Replicas (R) #Sweeps (t) space parallel time #Temperature steps (T) #Replicas (R) parallel time space

(A(t)−Aeq) ~ e(-t/τ) (A(R)−Aeq) ~ R0/R

R0~T T~K1/2 τ ~K R~K1/2

work to eq ~ R0T~K work to eq ~ τR~K3/2 Compare PA and PT for the Two-Well Model 0.200 0.100 0.150 0.050 K=16 0.100 K=64 0.020 0.070 G G 0.010 0.050 0.005 0.002 0.030 0.001 0.020 0 200 400 600 800 1000 0 200 400 600 800 1000 t, R t, R

0.150 = PA K=256 0.100 = PT G 0.070 0.050 H=0.1 0.030 0 200 400 600 800 1000 t, R = Prob [ = +1] Prob [ = +1] ⇥ ⇤⇥ ⇤eq PA vs PT for 3D Spin Glass

• For a small handful of disorder realizations, doing many runs for different population sizes (for PA) or run lengths (for PT), the convergence to equilibrium is comparable as measured in sweeps.

0.50

0.20 = PA

0.10 = PT

0.05

0.02 log systematic error log

0 200 000 400 000 600 000 800 000 1.0 ¥106 1.2 ¥106 sweeps Conclusions

• Both parallel tempering and population annealing overcome the exponential slowing associated with large free energy barriers. • Population annealing is comparably efficient to parallel tempering and has several features to recommend it: - Massively parallel - Direct measurement of free energies - Perhaps more efficient?