Hamiltonian Monte Carlo Without Detailed Balance

Jascha Sohl-Dickstein [email protected] Stanford University, Palo Alto. Khan Academy, Mountain View

Mayur Mudigonda [email protected] Redwood Institute for Theoretical Neuroscience, University of California at Berkeley

Michael R. DeWeese [email protected] Redwood Institute for Theoretical Neuroscience, University of California at Berkeley

Abstract rithm converges to the correct distribution is via a concept We present a method for performing Hamiltonian known as detailed balance. Sampling algorithms based on Monte Carlo that largely eliminates sample re- detailed balance are powerful because they allow samples jection. In situations that would normally lead to from any target distribution to be generated from almost rejection, instead a longer trajectory is computed any proposal distribution, using for instance Metropolis- until a new state is reached that can be accepted. Hastings acceptance criteria (Hastings, 1970). However, This is achieved using transitions detailed balance also suffers from a critical flaw. Pre- that satisfy the fixed point equation, but do not cisely because the forward and reverse transitions occur satisfy detailed balance. The resulting algorithm with equal probability, detailed balance driven samplers go significantly suppresses the behav- backwards exactly as often as they go forwards. The state ior and wasted function evaluations that are typ- space is thus explored via a random walk over distances ically the consequence of update rejection. We longer than those traversed by a single draw from the pro- posal distribution. A random walk only travels a distance demonstrate a greater than factor of two improve- 1 ment in mixing time on three test problems. We dN 2 in N steps, where d is the characteristic step length. release the source code as Python and MATLAB The current state-of-the-art sampling algorithm for proba- packages. bility distributions with continuous state spaces is Hamilto- nian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2010). By extending the state space to include auxiliary momen- 1. Introduction tum variables, and then using Hamiltonian dynamics to tra- High dimensional and otherwise computationally expen- verse long iso-probability contours in this extended state sive probabilistic models are of increasing importance for space, HMC is able to move long distances in state space in such diverse tasks as modeling the folding of proteins a single update step. However, HMC still relies on detailed (Schutte¨ & Fischer, 1999), the structure of natural images balance to accept or reject steps, and as a result still behaves (Culpepper et al., 2011), or the activity of networks of neu- like a random walk – just a random walk with a longer step rons (Cadieu & Koepsell, 2010). length. Previous attempts to address this have combined multiple Markov steps that individually satisfy detailed bal- Sampling from the described distribution is typically the ance into a composite step that does not (Horowitz, 1991), bottleneck when working with these probabilistic models. with limited success (Kennedy & Pendleton, 1991). Sampling is commonly required when training a proba- bilistic model, when evaluating the model’s performance, The No-U-Turn Sampler (NUTS) sampling package (Hoff- when performing inference, and when taking expectations man & Gelman, 2011) and the windowed acceptance (MacKay, 2003). Therefore, work that improves sampling method of (Neal, 1994) both consider Markov transitions is fundamentally important. within a set of discrete states generated by repeatedly sim- ulating Hamiltonian dynamics. NUTS generates a set of The most common way to guarantee that a sampling algo- candidate states around the starting state by running Hamil- Proceedings of the 31 st International Conference on Machine tonian dynamics forwards and backwards until the trajec- Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- tory doubles back on itself, or a slice variable constraint is right 2014 by the author(s). violated. It then chooses a new state at uniform from the Hamiltonian Monte Carlo Without Detailed Balance candidate states. In windowed acceptance, a transition is fixed point equation (Equation2). Detailed balance guar- proposed between a window of states at the beginning and antees that if samples are drawn from the equilibrium dis- end of a trajectory, rather than the first state and last state. tribution p (x), then for every pair of states x and x0 the With the selected window, a single state is then chosen us- probability of transitioning from state x to state x0 is iden- ing Boltzmann weightings. Both NUTS and the windowed tical to that of transitioning from state x0 to x, acceptance method rely on detailed balance to choose the candidate state from the discrete set. p (x) T (x0|x) = p (x0) T (x|x0) . (3)

Here we present a novel discrete representation of the HMC By substitution for T (x0|x) in the left side of Equation2, state space and transitions. Using this representation, we it can be seen that if Equation3 is satisfied, then the fixed derive a method for performing HMC while abandoning point equation is also satisfied. detailed balance altogether, by directly satisfying the fixed An appealing aspect of detailed balance is that a tran- point equation restricted to the discrete state space. As a sition distribution satisfying it can be easily constructed result, random walk behavior in the sampling algorithm is from nearly any proposal distribution, using Metropolis- greatly reduced, and the mixing rate of the sampler is sub- Hastings acceptance/rejection rules (Hastings, 1970). A stantially improved. primary drawback of detailed balance, and of Metropolis- Hastings, is that the resulting Markov chains always engage 2. Sampling in random walk behavior, since by definition detailed bal- ance depends on forward and reverse transitions happening We begin by briefly reviewing some key concepts related with equal probability. to sampling. The goal of a sampling algorithm is to draw characteristic samples x ∈ RN from a target probability The primary advance in this paper is demonstrating how distribution p (x). Without loss of generality, we will as- HMC sampling can be performed without resorting to de- sume that p (x) is determined by an energy function E (x), tailed balance. 1 p (x) = exp (−E (x)) . (1) Z 3. Hamiltonian Monte Carlo 2.1. Hamiltonian Monte Carlo (HMC) can traverse long dis- tances in state space with single Markov transitions. It Markov Chain Monte Carlo (MCMC) (Neal, 1993) is does this by extending the state space to include auxiliary commonly used to sample from probabilistic models. In variables, and then simulating Hamiltonian dy- MCMC a chain of samples is generated by repeatedly namics to move long distances along iso-probability con- drawing new samples x0 from a conditional probability dis- tours in the expanded state space. tribution T (x0|x), where x is the previous sample. Since T (x |x) is a probability density over x , R T (x |x) dx = 0 0 0 0 3.1. Extended state space 1 and T (x0|x) ≥ 0. The state space is extended by the addition of momentum 2.2. Fixed Point Equation variables v ∈ RN , with identity-covariance Gaussian dis- tribution, An MCMC algorithm must satisfy two conditions in or-   der to generate samples from the target distribution p (x). N 1 T p (v) = (2π)− 2 exp − v v . (4) The first is mixing, which requires that repeated application 2 of T (x0|x) must eventually explore the full state space of p (x). The second condition is that the target distribution We refer to the combined state space of x and v as ζ, such { } p (x) must be a fixed point of T (x0|x). This second condi- that ζ = x, v . The corresponding joint distribution is tion can be expressed by the fixed point equation, N (2π)− 2 Z p (ζ) = p (x, v) = p (x) p (v) = exp (−H (ζ)) , p (x) T (x0|x) dx = p (x0) , (2) Z (5) which requires that when T (x0|x) acts on p (x), the result- 1 H (ζ) = H (x, v) = E (x) + vT v. (6) ing distribution is unchanged. 2 has the same form as total energy in a physical sys- 2.3. Detailed Balance H (ζ) tem, where E (x) is the for position x and 1 T Detailed balance is the most common way of guaranteeing 2 v v is the kinetic energy for momentum v (mass is set that the Markov transition distribution T (x0|x) satisfies the to one). Hamiltonian Monte Carlo Without Detailed Balance

In HMC samples from p (x) are generated by drawing sam- ples from the joint distribution p (x, v), and using only the x variables as samples from the desired distribution. ⇣ 1 F⇣ L F⇣ L⇣ 3.2. Hamiltonian dynamics L⇣

Hamiltonian dynamics govern how physical systems F⇣ ⇣ ⇣ evolve with time. It might be useful to imagine the trajec- tory of a skateboarder rolling in an empty swimming pool. 1 As she rolls downwards she exchanges potential energy for LF⇣ L ⇣ kinetic energy, and the magnitude of her velocity increases. ⇣ As she rolls up again she exchanges kinetic energy back for R () ⇣ potential energy. In this fashion she is able to traverse long (a) (b) distances across the swimming pool, while at the same time maintaining constant total energy over her entire trajectory. Figure 1. (a) The action of operators involved in Hamiltonian Monte Carlo (HMC). The base of each red or green arrow repre- In HMC, we treat H (ζ) as the total energy of a physi- sents the position x, and the length and direction of each of these cal system, with spatial coordinate x, velocity v, potential arrows represents the momentum v. The flip operator F reverses 1 T energy E (x), and kinetic energy 2 v v. In an identical the momentum. The leapfrog operator L approximately integrates fashion to the case of the skateboarder in the swimming Hamiltonian dynamics. The trajectory taken by L is indicated by pool, running Hamiltonian dynamics on this system tra- the dotted line. The randomization operator R (β) corrupts the verses long distances in x while maintaining constant total momentum with an amount of noise that depends on β. (b) The energy H (ζ). By Equation5, moving along a trajectory ladder of discrete states that are accessible by applying F and L with constant energy is identical to moving along a trajec- starting at state ζ. Horizontal movement on the ladder occurs by tory with constant probability. flipping the momentum, whereas vertical movement occurs by in- tegrating Hamiltonian dynamics. Hamiltonian dynamics can be run exactly in reverse by re- versing the velocity vector. They also preserve volume in ζ. As we will see, all these properties together mean that 3.3.2. LEAPFROG INTEGRATOR Hamiltonian dyamics can be used to propose update steps Leapfrog, or Stormer-Verlet,¨ integration provides a discrete that move long distances in state space while retaining high time approximation to Hamiltonian dynamics (Hairer et al., acceptance probability. 2003). The operator L (, M) performs leapfrog integra- tion for M leapfrog steps with step length . For concise- 3.3. Operators ness, L (, M) will be written only as L, The Markov transitions from which HMC is constructed (The state resulting from M steps of can be understood in terms of several operators acting on Lζ = leapfrog integration of Hamiltonian (11) ζ. These operators are illustrated in Figure1a. This repre- dynamics with step length . sentation of the actions performed in HMC, and the corre- sponding state space, is unique to this paper and diverges Like exact Hamiltonian dynamics, leapfrog dynamics are from the typical presentation of HMC. exactly reversible by reversing the velocity vector, and they also exactly preserve volume in state space. L can be in- 3.3.1. MOMENTUM FLIP verted by reversing the sign of the momentum, tracing out the reverse trajectory, and then reversing the sign of the The momentum flip operator F reverses the direction of momentum again so that it points in the original direction; the momentum. It is its own inverse, leaves the total energy 1 unchanged, and preserves volume in state space: L− ζ = FLFζ, (12)   ∂Lζ Fζ = F {x, v} = {x, −v} , (7) det = 1. (13) ∂ζT 1 F− ζ = Fζ, (8) Unlike for exact dynamics, the total energy is only H (Fζ) = H (ζ) , (9) H (ζ)   approximately conserved by leapfrog integration, and the ∂Fζ det = 1. (10) energy accumulates errors due to discretization. This dis- ∂ζT cretization error in the energy is the source of all rejections of proposed updates in HMC. The momentum flip operator F causes movement between the left and right sides of the state ladder in Figure1b. The leapfrog operator L causes movement up the right side Hamiltonian Monte Carlo Without Detailed Balance of the state ladder in Figure1b, and down the left side of Metropolis-Hastings rules, the ladder.   p (ζ0) πaccept = min 1, , (18) 3.3.3. MOMENTUM RANDOMIZATION p (ζ)  ζ0 with probability πaccept The momentum randomization operator R (β) mixes an ζ(t,1) = . ζ(t,0) with probability 1 − π amount of Gaussian noise determined by β ∈ [0, 1] into accept (19) the velocity vector, Note that since the transition FL is its own in- R (β) ζ = R (β) {x, v} = {x, v0} , (14) verse, the forward and reverse proposal distri- p p bution probabilities cancel in the Metropolis- v = v 1 − β + n β, (15) 0 Hastings rule in Equation 18. n ∼ N (0, I) . (16) On rejection, the computations performed in Equation 17 are discarded. In our new technique, Unlike the previous two operators, the momentum random- this will no longer be true. ization operator is not deterministic. R (β) is however a valid Markov transition operator for p (ζ) on its own, in 2. Flip the momentum, that it satisfies both Equation2 and Equation3. ζ(t,2) = Fζ(t,1). (20) The momentum randomization operator R (β) causes If the proposed update from Step 1 was accepted, then movement off of the current state ladder and on to a new this moves ζ(t,1) from the left back to the right side state ladder. of the state ladder in Figure1b, and prevents the tra- jectory from doubling back on itself. If the update was 3.4. Discrete State Space rejected however, and ζ(t,1) is already on the right side As illustrated in Figure1b, the operators L and F generate of the ladder, then this causes it to move to the left side a discrete state space ladder, with transitions only occurring of the ladder, and the trajectory to double back on it- between ζ and three other states. Note that every state on self. the ladder can be represented many different ways, depend- Doubling back on an already computed trajectory is ing on the series of operators used to reach it. For instance, wasteful in HMC, both because it involves recom- the state in the upper left of the figure pane can be written puting nearly redundant trajectories, and because the 1 L− Fζ = FLζ = LFLLζ = ··· . distance traveled before the sampler doubles back is Standard HMC can be viewed in terms of transitions on the characteristic length scale beyond which HMC ex- this ladder. Additionally, we will see that this discrete state plores the state space by diffusion. space view allows Equation2 to be solved directly by re- 3. Corrupt the momentum with noise, placing the integral over all states with a short sum. ζ(t+1,0) = R (β) ζ(t,2). (21) 3.5. Standard HMC It is common to set β = 1, in which case the mo- HMC as typically implemented consists of the following mentum is fully randomized every sampling step. In steps. Here, ζ(t,s) represents the state at sampling step t, our experiments (Section5) however, we found that and sampling substep s. Each numbered item below cor- smaller values of β produced large improvements in responds to a valid Markov transition for p (ζ), satisfying mixing time. This is therefore a hyperparameter that detailed balance. A full sampling step consists of the com- is probably worth adjusting1. position of all three Markov transitions. 4. Look Ahead HMC 1. (a) Generate a proposed update, Here we introduce an HMC algorithm that relies on Markov transitions that do not obey detailed balance, but (t,0) ζ0 = FLζ . (17) 1One method for choosing β (Culpepper et al., 2011) which we have found to be effective is to set it such that it randomizes a On the state ladder in Figure1b, this corresponds fixed fraction α of the momentum per unit simulation time, to moving up one rung ( ), and then moving L 1 from the right to the left side (F). β = α M . (22) (b) Accept or reject the proposed update using Hamiltonian Monte Carlo Without Detailed Balance

still satisfy the fixed point equation. This algorithm elimi- nates much of the momentum flipping that occurs on rejec- tion in HMC, and as a result greatly reduces random walk behavior. It also prevents the trajectory computations that would typically be discarded on proposal rejection from be-

2D Anisotropic Gaussian ing wasted. We call our algorithm Look Ahead Hamilto- 1 nian Monte Carlo (LAHMC). HMC β=1 0.8 LAHMC β=1 HMC β=0.1 4.1. Intuition LAHMC β=0.1 0.6 In LAHMC, in situations that would correspond to a re- 0.4 jection in Step 1 of Section 3.5, we will instead attempt to travel even farther by applying the leapfrog operator Autocorrelation L 0.2 additional times. This section provides intuition for how

0 this update rule was discovered, and how it can be seen to connect to standard HMC. A more mathematically precise 0 2 4 6 4 description will follow in the next several sections. (a) Gradient Evaluations x 10 100D Anisotropic Gaussian LAHMC can be understood in terms of a series of modifi- 1 HMC β=1 cations of standard HMC. The net effect of Steps 1 and 2 0.8 LAHMC β=1 in Section 3.5 is to transition from state ζ into either state HMC β=0.1 Lζ or state Fζ, depending on whether the update in Section LAHMC β=0.1 0.6 3.5 Step 1 was accepted or rejected.

0.4 We wish to minimize the transitions into state Fζ. In

Autocorrelation LAHMC we do this by replacing as many transitions from 0.2 ζ to Fζ as possible with transitions that instead go from ζ 2 0 to L ζ. This would seem to change the number of transi- tions into both state and state 2 , violating the fixed 0 1 2 3 4 Fζ L ζ 4 (b) Gradient Evaluations x 10 point equation. However, the changes in incoming tran- 2d Rough Well sitions from ζ are exactly counteracted because the state 1 FL2ζ is similarly modified, so that it makes fewer transi- HMC β=1 2 2  LAHMC β=1 tions into the state L ζ = F FL ζ , and more transitions 0.8 HMC β=0.1 into the state Fζ = L2 FL2ζ. LAHMC β=0.1 0.6 For some states, after this modification there will still be 0.4 transitions between the states ζ and Fζ. In order to further

Autocorrelation minimize these transitions, the process in the proceeding 0.2 paragraph is repeated for these remaining transitions and 3 0 the state L ζ. This process is then repeated again for states 4 5 0 2000 4000 6000 8000 10000 L ζ, L ζ, etc, up to some maximum number of leapfrog (c) Gradient Evaluations applications K.

Figure 2. Autocorrelation vs. number of function evaluations for 4.2. Algorithm standard HMC (no momentum randomization, β = 1), LAHMC with β = 1, persistent HMC (β = 0.1), and persistent LAHMC LAHMC consists of the following two steps, (β = 0.1) for (a) a two dimensional ill-conditioned Gaussian, (b) a one hundred dimensional ill-conditioned Gaussian, and (c) a 1. Transition to a new state by applying the leapfrog op- + two dimensional well conditioned energy function with a “rough” erator L between 1 and K ∈ Z times, or by applying surface. In all cases the LAHMC sampler demonstrates faster the momentum flip operator F, mixing.  (t,0) (t,0)  Lζ with probability πL1 ζ  2 (t,0) (t,0)  L ζ with probability πL2 ζ ζ(t,1) = ··· .  K (t,0) (t,0)  L ζ with probability πLK ζ  (t,0) (t,0)  Fζ with probability πF ζ (23) Hamiltonian Monte Carlo Without Detailed Balance

Distribution Sampler Fζ Lζ L2ζ L3ζ L4ζ 2d Gaussian HMC β = 1 0.079 0.921 0 0 0 2d Gaussian LAHMC β = 1 0.000 0.921 0.035 0.044 0.000 2d Gaussian HMC β = 0.1 0.080 0.920 0 0 0 2d Gaussian LAHMC β = 0.1 0.000 0.921 0.035 0.044 0.000 100d Gaussian HMC β = 1 0.147 0.853 0 0 0 100d Gaussian LAHMC β = 1 0.047 0.852 0.059 0.035 0.006 100d Gaussian HMC β = 0.1 0.147 0.853 0 0 0 100d Gaussian LAHMC β = 0.1 0.047 0.852 0.059 0.035 0.006 2d Rough Well HMC β = 1 0.446 0.554 0 0 0 2d Rough Well LAHMC β = 1 0.292 0.554 0.099 0.036 0.019 2d Rough Well HMC β = 0.1 0.446 0.554 0 0 0 2d Rough Well LAHMC β = 0.1 0.292 0.554 0.100 0.036 0.019

Table 1. A table showing the fraction of transitions which occurred to each target state for the conditions plotted in Figure2. Note that LAHMC has far fewer momentum flips than standard HMC.

Note that there is no longer a Metropolis-Hastings ac- 4.4. Fixed Point Equation cept/reject step. The state update in Equation 23 is a We can substitute the transition rates from Section 4.3 into valid Markov transition for p (ζ) on its own. the left side of Equation2, and verify that they satisfy the 2. Corrupt the momentum with noise in an identical fash- fixed point equation. Note that the integral over all states ion as in Equation 21, is transformed into a sum over all source states from which transitions into state ζ might be initiated. ζ(t+1,0) = R (β) ζ(t,1). (24) Z 4.3. Transition Probabilities dζ0p (ζ0) T (ζ|ζ0) Z  We choose the probabilities π a (ζ) for the leapfrog transi- X a L = dζ0p (ζ0) π a (ζ0) δ (ζ − L ζ0) (28) a L tions from state ζ to state L ζ to be a   X πLa (ζ) = min 1 − πLb (ζ) , (25) + πF (ζ0) δ (ζ − Fζ0) , b

LAHMC, β=1 HMC, β=1 5000 5000 LAHMC, M=10 HMC, M=10 0.7 0.7 5000 5000 0.8 0.8 0.7 0.7 4000 4000 0.8 0.8 0.9 0.9 4000 4000 0.9 0.9 1.0 1.0 3000 3000 1.0 1.0 3000 3000 1.1 1.1 ε ε 1.1 1.1

ε ε 1.2 1.2 1.2 1.2 2000 2000 2000 2000 1.3 1.3 1.3 1.3 1.4 1.4 1.4 1.4 1000 1000 1000 1000 1.5 1.5 1.5 1.5

1.6 1.6 1.6 1.6 0 0 0 0 .01 .11 .21 .31 .41 .01 .11 .21 .31 .41 1 10 50 100 1 10 50 100 β β Leap Steps Leap Steps (a) (b) (c) (d)

Figure 3. Images illustrating mixing time as a function of HMC hyperparameters for a two dimensional ill-conditioned Gaussian dis- tribution. Pixel intensity indicates the number of gradient evaluations required to reach an autocorrelation of 0.5. LAHMC always outperforms HMC for the same hyperparameter settings. (a) LAHMC as a function of  and β, for fixed M = 10, (b) HMC as a function of  and β, for fixed M = 10, (c) LAHMC as a function of  and M, for fixed β = 1, (d) HMC as a function of  and M, for fixed β = 1. same setting of hyperparameters, often by more than a fac- ance of the Gaussian are 1 and 105 in Figure3, rather than tor of 2. 1 and 106 as in Figure2a. The first two target distributions are 2 and 100 dimensional MATLAB and Python implementations of ill-conditioned Gaussian distributions. In both Gaussians, LAHMC are available at http://github.com/ the eigenvalues of the covariance matrix are log-linearly Sohl-Dickstein/LAHMC. Figure2 and Table1 distributed between 1 and 106. can be reproduced by running generate figure 2.m or generate figure 2.py. The final target distribution was chosen to demonstrate that LAHMC is useful even for well conditioned distribu- tions. The energy function used was the sum of an isotropic 6. Future Directions quadratic and sinusoids in each of two dimensions, There are many powerful variations on standard HMC that     are complementary to and could be combined naturally 1 2 2 πx1 πx2 E (x) = 2 x1 + x2 + cos + cos , with the present work. These include Riemann manifold 2σ1 σ2 σ2 (34) HMC (Girolami et al., 2011), quasi-Newton HMC (Zhang & Sutton, 2011), Hilbert space HMC (Beskos & Pinski, where σ1 = 100 and σ2 = 2. Although this distribu- 2011), shadow Hamiltonian methods (Izaguirre & Hamp- tion is well conditioned the sinusoids cause it to have a ton, 2004), parameter adaptation techniques (Wang et al., “rough” surface, such that traversing the quadratic well 2013), Hamiltonian annealed importance sampling (Sohl- while maintaining a reasonable discretization error requires Dickstein & Culpepper, 2012), split HMC (Shahbaba et al., many leapfrog steps. 2011), and tempered trajectories (Neal, 2010). The fraction of the sampling steps resulting in each possible It should be possible to further reduce random walk be- update for the samplers and energy functions in Figure2 is havior by exploring new topologies and allowed state tran- illustrated in Table1. The majority of momentum flips in sitions. Two other schemes have already been explored, standard HMC were eliminated by LAHMC. Note that the though with only marginal benefit. In one scheme as many acceptance rate for HMC with these hyperparameter val- flips as possible are replaced by identity transitions. This is ues is reasonably close to its optimal value of 65% (Neal, described in the note (Sohl-Dickstein, 2012). In a second 2010). scheme, a state space is constructed with two sets of aux- iliary momentum variables, and an additional momentum- Figure3 shows several grid searches over hyperparam- swap operator which switches the two momenta with each eters for a two dimensional ill-conditioned Gaussian, other is included in the allowed transitions. In this sce- and demonstrates that our technique outperforms standard nario, in situations that would typically lead to momentum HMC for all explored hyperparameter settings. Due to flipping, with high probability the two sets of momenta can computational constraints, the eigenvalues of the covari- Hamiltonian Monte Carlo Without Detailed Balance instead be exchanged with each other. This leads to mo- Horowitz, AM. A generalized guided Monte Carlo algo- mentum randomization on rejection, rather than momen- rithm. Physics Letters B, 1991. tum reversal. Unfortunately, though this slightly improves mixing time, it still amounts to a random walk on a sim- Izaguirre, JA and Hampton, SS. Shadow hybrid Monte ilar length scale. The exploration of other topologies and Carlo: an efficient propagator in phase space of macro- allowed transitions will likely prove fruitful. molecules. Journal of , 2004. Any deterministic, reversible, discrete stepped trajectory Kennedy, AD and Pendleton, B. Acceptances and auto- through a state space can be mapped onto the ladder struc- correlations in hybrid Monte Carlo. Nuclear Physics B- ture in Figure1. The Markov transition rules presented Proceedings Supplements, 1991. in this paper could therefore be applied to a wide range MacKay, DJC. Information theory, inference and learning of problems. All that is required in addition to the map- algorithms. 2003. ping is an auxiliary variable indicating direction along that trajectory. In HMC, the momentum variable doubles as a Neal, Radford M. MCMC using Hamiltonian dynamics. direction indicator, but there could just as easily be an ad- Handbook of Markov Chain Monte Carlo, January 2010. 1 ditional variable d ∈ {−1, 1}, p (d = 1) = 2 , which in- dicates whether transitions are occurring up or down the Neal, RM. Probabilistic inference using Markov chain ladder. The efficiency of the exploration then depends only Monte Carlo methods. Technical Report CRG-TR-93-1, on choosing a sensible, approximately energy conserving, Dept. of Computer Science, University of Toronto, 1993. trajectory. Neal, RM. An improved acceptance procedure for the hy- brid Monte Carlo algorithm. Journal of Computational References Physics, 1994. Beskos, A and Pinski, FJ. Hybrid monte carlo on hilbert Schutte,¨ C and Fischer, A. A direct approach to conforma- spaces. Stochastic Processes and their Applications, tional dynamics based on hybrid Monte Carlo. Journal 2011. of Computational Physics, 1999. Cadieu, CF and Koepsell, K. Phase coupling estimation Shahbaba, B, Lan, S, Johnson, WO, and Neal, RM. Split from multivariate phase statistics. Neural computation, hamiltonian monte carlo. Statistics and Computing, 2010. 2011.

Culpepper, Benjamin J, Sohl-Dickstein, Jascha, and Ol- Sohl-Dickstein, Jascha. Hamiltonian Monte Carlo with Re- shausen, Bruno A. Building a better probabilistic model duced Momentum Flips. arXiv:1205.1939v1, May 2012. of images by factorization. International Conference on Sohl-Dickstein, Jascha and Culpepper, Benjamin J. Hamil- Computer Vision, 2011. tonian Annealed Importance Sampling for partition Duane, S, Kennedy, AD, Pendleton, BJ, and Roweth, D. function estimation. arXiv:1205.1925v1, May 2012. Hybrid monte carlo. Physics letters B, 1987. Wang, Z, Mohamed, S, and Nando, D. Adaptive Hamilto- nian and Riemann Manifold Monte Carlo. Proceedings Girolami, Mark, Calderhead, Ben, and Chin, Siu A. Rie- of the 30th International Conference on Machine Learn- mann manifold Langevin and Hamiltonian Monte Carlo ing (ICML-13), 2013. methods. Journal of the Royal Statistical Society: Se- ries B (Statistical Methodology), 73(2):123–214, March Zhang, Y and Sutton, C. Quasi-Newton Markov chain 2011. ISSN 13697412. doi: 10.1111/j.1467-9868.2010. Monte Carlo. 2011. 00765.x.

Hairer, E, Lubich, C, and Wanner, G. Geometric numeri- cal integration illustrated by the Stormer-Verlet method. Acta Numerica, 2003.

Hastings, W. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, January 1970.

Hoffman, MD and Gelman, A. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Arxiv preprint arXiv:1111.4246, pp. 1–30, 2011.