Hamiltonian Monte Carlo Without Detailed Balance
Jascha Sohl-Dickstein [email protected] Stanford University, Palo Alto. Khan Academy, Mountain View
Mayur Mudigonda [email protected] Redwood Institute for Theoretical Neuroscience, University of California at Berkeley
Michael R. DeWeese [email protected] Redwood Institute for Theoretical Neuroscience, University of California at Berkeley
Abstract rithm converges to the correct distribution is via a concept We present a method for performing Hamiltonian known as detailed balance. Sampling algorithms based on Monte Carlo that largely eliminates sample re- detailed balance are powerful because they allow samples jection. In situations that would normally lead to from any target distribution to be generated from almost rejection, instead a longer trajectory is computed any proposal distribution, using for instance Metropolis- until a new state is reached that can be accepted. Hastings acceptance criteria (Hastings, 1970). However, This is achieved using Markov chain transitions detailed balance also suffers from a critical flaw. Pre- that satisfy the fixed point equation, but do not cisely because the forward and reverse transitions occur satisfy detailed balance. The resulting algorithm with equal probability, detailed balance driven samplers go significantly suppresses the random walk behav- backwards exactly as often as they go forwards. The state ior and wasted function evaluations that are typ- space is thus explored via a random walk over distances ically the consequence of update rejection. We longer than those traversed by a single draw from the pro- posal distribution. A random walk only travels a distance demonstrate a greater than factor of two improve- 1 ment in mixing time on three test problems. We dN 2 in N steps, where d is the characteristic step length. release the source code as Python and MATLAB The current state-of-the-art sampling algorithm for proba- packages. bility distributions with continuous state spaces is Hamilto- nian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2010). By extending the state space to include auxiliary momen- 1. Introduction tum variables, and then using Hamiltonian dynamics to tra- High dimensional and otherwise computationally expen- verse long iso-probability contours in this extended state sive probabilistic models are of increasing importance for space, HMC is able to move long distances in state space in such diverse tasks as modeling the folding of proteins a single update step. However, HMC still relies on detailed (Schutte¨ & Fischer, 1999), the structure of natural images balance to accept or reject steps, and as a result still behaves (Culpepper et al., 2011), or the activity of networks of neu- like a random walk – just a random walk with a longer step rons (Cadieu & Koepsell, 2010). length. Previous attempts to address this have combined multiple Markov steps that individually satisfy detailed bal- Sampling from the described distribution is typically the ance into a composite step that does not (Horowitz, 1991), bottleneck when working with these probabilistic models. with limited success (Kennedy & Pendleton, 1991). Sampling is commonly required when training a proba- bilistic model, when evaluating the model’s performance, The No-U-Turn Sampler (NUTS) sampling package (Hoff- when performing inference, and when taking expectations man & Gelman, 2011) and the windowed acceptance (MacKay, 2003). Therefore, work that improves sampling method of (Neal, 1994) both consider Markov transitions is fundamentally important. within a set of discrete states generated by repeatedly sim- ulating Hamiltonian dynamics. NUTS generates a set of The most common way to guarantee that a sampling algo- candidate states around the starting state by running Hamil- Proceedings of the 31 st International Conference on Machine tonian dynamics forwards and backwards until the trajec- Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- tory doubles back on itself, or a slice variable constraint is right 2014 by the author(s). violated. It then chooses a new state at uniform from the Hamiltonian Monte Carlo Without Detailed Balance candidate states. In windowed acceptance, a transition is fixed point equation (Equation2). Detailed balance guar- proposed between a window of states at the beginning and antees that if samples are drawn from the equilibrium dis- end of a trajectory, rather than the first state and last state. tribution p (x), then for every pair of states x and x0 the With the selected window, a single state is then chosen us- probability of transitioning from state x to state x0 is iden- ing Boltzmann weightings. Both NUTS and the windowed tical to that of transitioning from state x0 to x, acceptance method rely on detailed balance to choose the candidate state from the discrete set. p (x) T (x0|x) = p (x0) T (x|x0) . (3)
Here we present a novel discrete representation of the HMC By substitution for T (x0|x) in the left side of Equation2, state space and transitions. Using this representation, we it can be seen that if Equation3 is satisfied, then the fixed derive a method for performing HMC while abandoning point equation is also satisfied. detailed balance altogether, by directly satisfying the fixed An appealing aspect of detailed balance is that a tran- point equation restricted to the discrete state space. As a sition distribution satisfying it can be easily constructed result, random walk behavior in the sampling algorithm is from nearly any proposal distribution, using Metropolis- greatly reduced, and the mixing rate of the sampler is sub- Hastings acceptance/rejection rules (Hastings, 1970). A stantially improved. primary drawback of detailed balance, and of Metropolis- Hastings, is that the resulting Markov chains always engage 2. Sampling in random walk behavior, since by definition detailed bal- ance depends on forward and reverse transitions happening We begin by briefly reviewing some key concepts related with equal probability. to sampling. The goal of a sampling algorithm is to draw characteristic samples x ∈ RN from a target probability The primary advance in this paper is demonstrating how distribution p (x). Without loss of generality, we will as- HMC sampling can be performed without resorting to de- sume that p (x) is determined by an energy function E (x), tailed balance. 1 p (x) = exp (−E (x)) . (1) Z 3. Hamiltonian Monte Carlo 2.1. Markov Chain Monte Carlo Hamiltonian Monte Carlo (HMC) can traverse long dis- tances in state space with single Markov transitions. It Markov Chain Monte Carlo (MCMC) (Neal, 1993) is does this by extending the state space to include auxiliary commonly used to sample from probabilistic models. In momentum variables, and then simulating Hamiltonian dy- MCMC a chain of samples is generated by repeatedly namics to move long distances along iso-probability con- drawing new samples x0 from a conditional probability dis- tours in the expanded state space. tribution T (x0|x), where x is the previous sample. Since T (x |x) is a probability density over x , R T (x |x) dx = 0 0 0 0 3.1. Extended state space 1 and T (x0|x) ≥ 0. The state space is extended by the addition of momentum 2.2. Fixed Point Equation variables v ∈ RN , with identity-covariance Gaussian dis- tribution, An MCMC algorithm must satisfy two conditions in or- der to generate samples from the target distribution p (x). N 1 T p (v) = (2π)− 2 exp − v v . (4) The first is mixing, which requires that repeated application 2 of T (x0|x) must eventually explore the full state space of p (x). The second condition is that the target distribution We refer to the combined state space of x and v as ζ, such { } p (x) must be a fixed point of T (x0|x). This second condi- that ζ = x, v . The corresponding joint distribution is tion can be expressed by the fixed point equation, N (2π)− 2 Z p (ζ) = p (x, v) = p (x) p (v) = exp (−H (ζ)) , p (x) T (x0|x) dx = p (x0) , (2) Z (5) which requires that when T (x0|x) acts on p (x), the result- 1 H (ζ) = H (x, v) = E (x) + vT v. (6) ing distribution is unchanged. 2 has the same form as total energy in a physical sys- 2.3. Detailed Balance H (ζ) tem, where E (x) is the potential energy for position x and 1 T Detailed balance is the most common way of guaranteeing 2 v v is the kinetic energy for momentum v (mass is set that the Markov transition distribution T (x0|x) satisfies the to one). Hamiltonian Monte Carlo Without Detailed Balance
In HMC samples from p (x) are generated by drawing sam- ples from the joint distribution p (x, v), and using only the x variables as samples from the desired distribution. ⇣ 1 F⇣ L F⇣ L⇣ 3.2. Hamiltonian dynamics L⇣
Hamiltonian dynamics govern how physical systems F⇣ ⇣ ⇣ evolve with time. It might be useful to imagine the trajec- tory of a skateboarder rolling in an empty swimming pool. 1 As she rolls downwards she exchanges potential energy for LF⇣ L ⇣ kinetic energy, and the magnitude of her velocity increases. ⇣ As she rolls up again she exchanges kinetic energy back for R ( ) ⇣ potential energy. In this fashion she is able to traverse long (a) (b) distances across the swimming pool, while at the same time maintaining constant total energy over her entire trajectory. Figure 1. (a) The action of operators involved in Hamiltonian Monte Carlo (HMC). The base of each red or green arrow repre- In HMC, we treat H (ζ) as the total energy of a physi- sents the position x, and the length and direction of each of these cal system, with spatial coordinate x, velocity v, potential arrows represents the momentum v. The flip operator F reverses 1 T energy E (x), and kinetic energy 2 v v. In an identical the momentum. The leapfrog operator L approximately integrates fashion to the case of the skateboarder in the swimming Hamiltonian dynamics. The trajectory taken by L is indicated by pool, running Hamiltonian dynamics on this system tra- the dotted line. The randomization operator R (β) corrupts the verses long distances in x while maintaining constant total momentum with an amount of noise that depends on β. (b) The energy H (ζ). By Equation5, moving along a trajectory ladder of discrete states that are accessible by applying F and L with constant energy is identical to moving along a trajec- starting at state ζ. Horizontal movement on the ladder occurs by tory with constant probability. flipping the momentum, whereas vertical movement occurs by in- tegrating Hamiltonian dynamics. Hamiltonian dynamics can be run exactly in reverse by re- versing the velocity vector. They also preserve volume in ζ. As we will see, all these properties together mean that 3.3.2. LEAPFROG INTEGRATOR Hamiltonian dyamics can be used to propose update steps Leapfrog, or Stormer-Verlet,¨ integration provides a discrete that move long distances in state space while retaining high time approximation to Hamiltonian dynamics (Hairer et al., acceptance probability. 2003). The operator L (, M) performs leapfrog integra- tion for M leapfrog steps with step length . For concise- 3.3. Operators ness, L (, M) will be written only as L, The Markov transitions from which HMC is constructed (The state resulting from M steps of can be understood in terms of several operators acting on Lζ = leapfrog integration of Hamiltonian (11) ζ. These operators are illustrated in Figure1a. This repre- dynamics with step length . sentation of the actions performed in HMC, and the corre- sponding state space, is unique to this paper and diverges Like exact Hamiltonian dynamics, leapfrog dynamics are from the typical presentation of HMC. exactly reversible by reversing the velocity vector, and they also exactly preserve volume in state space. L can be in- 3.3.1. MOMENTUM FLIP verted by reversing the sign of the momentum, tracing out the reverse trajectory, and then reversing the sign of the The momentum flip operator F reverses the direction of momentum again so that it points in the original direction; the momentum. It is its own inverse, leaves the total energy 1 unchanged, and preserves volume in state space: L− ζ = FLFζ, (12) ∂Lζ Fζ = F {x, v} = {x, −v} , (7) det = 1. (13) ∂ζT 1 F− ζ = Fζ, (8) Unlike for exact dynamics, the total energy is only H (Fζ) = H (ζ) , (9) H (ζ) approximately conserved by leapfrog integration, and the ∂Fζ det = 1. (10) energy accumulates errors due to discretization. This dis- ∂ζT cretization error in the energy is the source of all rejections of proposed updates in HMC. The momentum flip operator F causes movement between the left and right sides of the state ladder in Figure1b. The leapfrog operator L causes movement up the right side Hamiltonian Monte Carlo Without Detailed Balance of the state ladder in Figure1b, and down the left side of Metropolis-Hastings rules, the ladder. p (ζ0) πaccept = min 1, , (18) 3.3.3. MOMENTUM RANDOMIZATION p (ζ) ζ0 with probability πaccept The momentum randomization operator R (β) mixes an ζ(t,1) = . ζ(t,0) with probability 1 − π amount of Gaussian noise determined by β ∈ [0, 1] into accept (19) the velocity vector, Note that since the transition FL is its own in- R (β) ζ = R (β) {x, v} = {x, v0} , (14) verse, the forward and reverse proposal distri- p p bution probabilities cancel in the Metropolis- v = v 1 − β + n β, (15) 0 Hastings rule in Equation 18. n ∼ N (0, I) . (16) On rejection, the computations performed in Equation 17 are discarded. In our new technique, Unlike the previous two operators, the momentum random- this will no longer be true. ization operator is not deterministic. R (β) is however a valid Markov transition operator for p (ζ) on its own, in 2. Flip the momentum, that it satisfies both Equation2 and Equation3. ζ(t,2) = Fζ(t,1). (20) The momentum randomization operator R (β) causes If the proposed update from Step 1 was accepted, then movement off of the current state ladder and on to a new this moves ζ(t,1) from the left back to the right side state ladder. of the state ladder in Figure1b, and prevents the tra- jectory from doubling back on itself. If the update was 3.4. Discrete State Space rejected however, and ζ(t,1) is already on the right side As illustrated in Figure1b, the operators L and F generate of the ladder, then this causes it to move to the left side a discrete state space ladder, with transitions only occurring of the ladder, and the trajectory to double back on it- between ζ and three other states. Note that every state on self. the ladder can be represented many different ways, depend- Doubling back on an already computed trajectory is ing on the series of operators used to reach it. For instance, wasteful in HMC, both because it involves recom- the state in the upper left of the figure pane can be written puting nearly redundant trajectories, and because the 1 L− Fζ = FLζ = LFLLζ = ··· . distance traveled before the sampler doubles back is Standard HMC can be viewed in terms of transitions on the characteristic length scale beyond which HMC ex- this ladder. Additionally, we will see that this discrete state plores the state space by diffusion. space view allows Equation2 to be solved directly by re- 3. Corrupt the momentum with noise, placing the integral over all states with a short sum. ζ(t+1,0) = R (β) ζ(t,2). (21) 3.5. Standard HMC It is common to set β = 1, in which case the mo- HMC as typically implemented consists of the following mentum is fully randomized every sampling step. In steps. Here, ζ(t,s) represents the state at sampling step t, our experiments (Section5) however, we found that and sampling substep s. Each numbered item below cor- smaller values of β produced large improvements in responds to a valid Markov transition for p (ζ), satisfying mixing time. This is therefore a hyperparameter that detailed balance. A full sampling step consists of the com- is probably worth adjusting1. position of all three Markov transitions. 4. Look Ahead HMC 1. (a) Generate a proposed update, Here we introduce an HMC algorithm that relies on Markov transitions that do not obey detailed balance, but (t,0) ζ0 = FLζ . (17) 1One method for choosing β (Culpepper et al., 2011) which we have found to be effective is to set it such that it randomizes a On the state ladder in Figure1b, this corresponds fixed fraction α of the momentum per unit simulation time, to moving up one rung ( ), and then moving L 1 from the right to the left side (F). β = α M . (22) (b) Accept or reject the proposed update using Hamiltonian Monte Carlo Without Detailed Balance
still satisfy the fixed point equation. This algorithm elimi- nates much of the momentum flipping that occurs on rejec- tion in HMC, and as a result greatly reduces random walk behavior. It also prevents the trajectory computations that would typically be discarded on proposal rejection from be-
2D Anisotropic Gaussian ing wasted. We call our algorithm Look Ahead Hamilto- 1 nian Monte Carlo (LAHMC). HMC β=1 0.8 LAHMC β=1 HMC β=0.1 4.1. Intuition LAHMC β=0.1 0.6 In LAHMC, in situations that would correspond to a re- 0.4 jection in Step 1 of Section 3.5, we will instead attempt to travel even farther by applying the leapfrog operator Autocorrelation L 0.2 additional times. This section provides intuition for how
0 this update rule was discovered, and how it can be seen to connect to standard HMC. A more mathematically precise 0 2 4 6 4 description will follow in the next several sections. (a) Gradient Evaluations x 10 100D Anisotropic Gaussian LAHMC can be understood in terms of a series of modifi- 1 HMC β=1 cations of standard HMC. The net effect of Steps 1 and 2 0.8 LAHMC β=1 in Section 3.5 is to transition from state ζ into either state HMC β=0.1 Lζ or state Fζ, depending on whether the update in Section LAHMC β=0.1 0.6 3.5 Step 1 was accepted or rejected.
0.4 We wish to minimize the transitions into state Fζ. In
Autocorrelation LAHMC we do this by replacing as many transitions from 0.2 ζ to Fζ as possible with transitions that instead go from ζ 2 0 to L ζ. This would seem to change the number of transi- tions into both state and state 2 , violating the fixed 0 1 2 3 4 Fζ L ζ 4 (b) Gradient Evaluations x 10 point equation. However, the changes in incoming tran- 2d Rough Well sitions from ζ are exactly counteracted because the state 1 FL2ζ is similarly modified, so that it makes fewer transi- HMC β=1 2 2 LAHMC β=1 tions into the state L ζ = F FL ζ , and more transitions 0.8 HMC β=0.1 into the state Fζ = L2 FL2ζ. LAHMC β=0.1 0.6 For some states, after this modification there will still be 0.4 transitions between the states ζ and Fζ. In order to further
Autocorrelation minimize these transitions, the process in the proceeding 0.2 paragraph is repeated for these remaining transitions and 3 0 the state L ζ. This process is then repeated again for states 4 5 0 2000 4000 6000 8000 10000 L ζ, L ζ, etc, up to some maximum number of leapfrog (c) Gradient Evaluations applications K.
Figure 2. Autocorrelation vs. number of function evaluations for 4.2. Algorithm standard HMC (no momentum randomization, β = 1), LAHMC with β = 1, persistent HMC (β = 0.1), and persistent LAHMC LAHMC consists of the following two steps, (β = 0.1) for (a) a two dimensional ill-conditioned Gaussian, (b) a one hundred dimensional ill-conditioned Gaussian, and (c) a 1. Transition to a new state by applying the leapfrog op- + two dimensional well conditioned energy function with a “rough” erator L between 1 and K ∈ Z times, or by applying surface. In all cases the LAHMC sampler demonstrates faster the momentum flip operator F, mixing. (t,0) (t,0) Lζ with probability πL1 ζ 2 (t,0) (t,0) L ζ with probability πL2 ζ ζ(t,1) = ··· . K (t,0) (t,0) L ζ with probability πLK ζ (t,0) (t,0) Fζ with probability πF ζ (23) Hamiltonian Monte Carlo Without Detailed Balance
Distribution Sampler Fζ Lζ L2ζ L3ζ L4ζ 2d Gaussian HMC β = 1 0.079 0.921 0 0 0 2d Gaussian LAHMC β = 1 0.000 0.921 0.035 0.044 0.000 2d Gaussian HMC β = 0.1 0.080 0.920 0 0 0 2d Gaussian LAHMC β = 0.1 0.000 0.921 0.035 0.044 0.000 100d Gaussian HMC β = 1 0.147 0.853 0 0 0 100d Gaussian LAHMC β = 1 0.047 0.852 0.059 0.035 0.006 100d Gaussian HMC β = 0.1 0.147 0.853 0 0 0 100d Gaussian LAHMC β = 0.1 0.047 0.852 0.059 0.035 0.006 2d Rough Well HMC β = 1 0.446 0.554 0 0 0 2d Rough Well LAHMC β = 1 0.292 0.554 0.099 0.036 0.019 2d Rough Well HMC β = 0.1 0.446 0.554 0 0 0 2d Rough Well LAHMC β = 0.1 0.292 0.554 0.100 0.036 0.019
Table 1. A table showing the fraction of transitions which occurred to each target state for the conditions plotted in Figure2. Note that LAHMC has far fewer momentum flips than standard HMC.
Note that there is no longer a Metropolis-Hastings ac- 4.4. Fixed Point Equation cept/reject step. The state update in Equation 23 is a We can substitute the transition rates from Section 4.3 into valid Markov transition for p (ζ) on its own. the left side of Equation2, and verify that they satisfy the 2. Corrupt the momentum with noise in an identical fash- fixed point equation. Note that the integral over all states ion as in Equation 21, is transformed into a sum over all source states from which transitions into state ζ might be initiated. ζ(t+1,0) = R (β) ζ(t,1). (24) Z 4.3. Transition Probabilities dζ0p (ζ0) T (ζ|ζ0) Z We choose the probabilities π a (ζ) for the leapfrog transi- X a L = dζ0p (ζ0) π a (ζ0) δ (ζ − L ζ0) (28) a L tions from state ζ to state L ζ to be a X πLa (ζ) = min 1 − πLb (ζ) , (25) + πF (ζ0) δ (ζ − Fζ0) , b