Magnetic Hamiltonian Monte Carlo

Nilesh Tripuraneni 1 Mark Rowland 2 Zoubin Ghahramani 2 3 Richard Turner 2

Abstract (asymptotically) exact inference (Neal, 1996; Hoffman & Hamiltonian Monte Carlo (HMC) exploits Gelman, 2014; Carpenter et al., 2016). Hamiltonian dynamics to construct efficient proposals for Markov chain Monte Carlo (MCMC). In this paper, we present a generalization of HMC which exploits non-canonical Hamiltonian dynamics. We refer to this algorithm as magnetic HMC, since in 3 dimensions a subset of the dynamics map onto the mechanics of a charged particle coupled to a magnetic field. We establish a theoretical basis for the use of non-canonical Hamiltonian dynamics in MCMC, and construct a symplectic, leapfrog-like integrator allowing for the implementation of magnetic HMC. Fi- nally, we exhibit several examples where these Figure 1. Example sample paths for standard HMC (left) and MHMC (right) for an isotropic Gaussian target distribution. non-canonical dynamics can lead to improved mixing of magnetic HMC relative to ordinary In this paper, we first review important properties of Hamil- HMC. tonian dynamics, namely energy-preservation, symplecticity, and time-reversibility, and derive a more general class of dynamics with these properties which we refer to as non- 1. Introduction canonical Hamiltonian dynamics. We then discuss the re- lationship of non-canonical Hamiltonian dynamics to well- Probabilistic inference in complex models generally re- known variants of HMC and propose a novel extension of quires the evaluation of intractable, high-dimensional inte- HMC. We refer to this method as magnetic HMC (see Al- grals. One powerful and generic approach to inference is to gorithm1) since it corresponds to a particular subset of the use Markov chain Monte Carlo (MCMC) methods to gen- non-canonical dynamics that in 3 dimensions map onto to erate asymptotically exact (but correlated) samples from a the mechanics of a charged particle coupled to a magnetic posterior distribution for inference and learning. Hamilto- field – see Figure1 for an example of these dynamics. Fur- nian Monte Carlo (HMC) (Duane et al., 1987; Neal, 2011) thermore, we construct an explicit, symplectic, leapfrog- is a state-of-the-art MCMC method which uses gradient in- like integrator for magnetic HMC which allows for an effi- formation from an absolutely continuous target density to cient numerical integration scheme comparable to that of encourage efficient sampling and exploration. Crucially, ordinary HMC. Finally, we evaluate the performance of HMC utilizes proposals inspired by Hamiltonian dynamics magnetic HMC on several sampling problems where we (corresponding to the classical mechanics of a point parti- show how its non-canonical dynamics can lead to improved cle) which can traverse long distances in parameter space. mixing. The proofs of all results in this paper are presented HMC, and variants like NUTS (which eliminates the need in the corresponding sections of the Appendix. to hand-tune the algorithm’s hyperparameters), have been successfully applied to a large class of probabilistic inference problems where they are often the gold standard for 2. Markov chain Monte Carlo

1UC Berkeley, USA 2University of Cambridge, UK Given an unnormalized target density ρ(θ) deﬁned on Rd, 3Uber AI Labs, USA. Correspondence to: Nilesh Tripuraneni an MCMC algorithm constructs an ergodic Markov chain . (Θn)n∈N such that the distribution of Θn converges to ρ Proceedings of the 34 th International Conference on Machine (e.g. in total variation) (Robert & Casella, 2004). Often, the Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 transition kernel of such a Markov chain is speciﬁed by the by the author(s). Metropolis-Hastings (MH) algorithm which (i) given the Magnetic Hamiltonian Monte Carlo

0 0 current state Θn = θ, proposes a new state θe by sampling tain (θen+1, pen+1) = Φp(θn, pn) = Φe τ,H (θn, pn); (iii) from a proposal distribution Q( θ), and (ii) sets Θn+1 = θe Apply a MH-type accept/reject step to enforce detailed bal- ρ(θe)Q(·|θ|θe) ance with respect to the target distribution; (iv) Flip the mo- with probability min 1, and Θn+1 = θ oth- ρ(θ)Q(θe|θ) mentum again with Φ so it points in the original direction. erwise. The role of the acceptance step is to enforce re- p versibility (or detailed balance) of the Markov chain with Note that because the map Φτ,H is time-reversible (in respect to ρ – which implies ρ is a stationary distribution of the sense that if the path (θ(t), p(t)) is a solution to (2) the transition kernel. then the path with negated momentum traversed in reverse (θ( t), p( t)) is also a solution), the map Φe τ,H is self- Heuristically, a good MH algorithm should have low inter- inverse.− − From− this, the acceptance ratio in step (iii) enforc- sample correlation while maintaining a high acceptance ra- ing detailed balance can be shown (see e.g. (Green, 1995)) tio. Hamiltonian Monte Carlo provides an elegant mecha- to have the form: ! nism to do this by simulating a particle moving along the exp(−H(θ , p )) en+1 en+1 min 1, det ∇θ,pΦe τ,H (θn, pn) . (3) contour lines of a dynamical system, constructed from the exp(−H(θn, pn)) target density, to use as a MCMC proposal. Note that the Hamiltonian flow & momentum flip op- 2 2.1. Hamiltonian Monte Carlo erator Φe τ,H is volume-preserving , which immediately yields that the Jacobian term in the acceptance ratio (3) In Hamiltonian Monte Carlo, the target distribution is aug- is simply 1. The acceptance probability therefore reduces mented with “momentum” variables p which are indepen- to min(1, exp(H(θn, pn) H(θen+1, pen+1))). Further- dent of the θ variables but of equal dimension. For the more, since the Hamiltonian− flow defined in (2) is energy- remainder of the paper, we take the distribution over the preserving (i.e. H(θen+1, pen+1) = H(θn, pn)) – the ac- momentum variables to be Gaussian, as is common in the ceptance ratio is identically 1. Moreover, the momentum literature (indeed, there is evidence that in many cases, the resampling in (i) and momentum flip in (iv) both leave the choice of a Gaussian distribution may be optimal (Betan- joint distribution invariant. court, 2017)). The joint target distribution is therefore: While the momentum resampling ensures the Markov > ρ(θ, p) e−U(θ)−p p/2 e−H(θ,p). (1) chain explores the joint (θ, p) space, the proposals inspired ∝ ≡ by Hamiltonian dynamics can traverse long distances in Crucially, this augmentation allows Hamiltonian dynamics parameter space θ, reducing the random-walk behavior of to be used as a proposal for an MCMC algorithm over the MH that often results in highly correlated samples (Neal, space (θ, p), where we interpret θ (resp., p) as position 2011). (resp., momentum) coordinates of a physical particle with total energy H(θ, p), given by the sum of its potential en- 2.2. Symplectic Numerical Integration ergy U(θ) and kinetic energy p>p/2. We briefly review the Markov chain construction below; see (Neal, 2011) or Unfortunately, it is rarely possible to integrate the flow (Duane et al., 1987) for a more detailed description. Given defined in (2) analytically; instead an efficient numerical the Markov chain state (θn, pn) at time n, the new state integration scheme must be used to generate a proposal for time n + 1 is obtained by first resampling momentum for the MH-type accept/reject test. Typically, the leapfrog pn (0, I), and then proposing a new state according to (Stormer-Verlet)¨ integrator is used since it is an explicit the∼ following N steps: (i) Simulate the deterministic Hamil- method that is both symplectic and time-reversible (Neal, tonian flow defined by the differential equation 2011). One elegant way to motivate this integrator is by decomposing the Hamiltonian into a symmetric splitting:

H(θ, p) = U(θ)/2 + p>p/2 + U(θ)/2 (4) d θ(t) 0 I θH(p(t), θ(t)) = ∇ | {z } | {z } | {z } dt p(t) I 0 pH(p(t), θ(t)) H1(θ) H2(p) H1(θ) |− {z } ∇ A and then defining Φ and Φ to be the exactly- ,H1(θ) ,H2(p) p(t) integrated flows for the sub-Hamiltonians H (θ) and . (2) 1 ≡ θU(θ(t)) −∇ Hamiltonian flow associated with H for time τ. In addition, Φe τ,H denotes the composition of Φ with the momentum flip map τ (θ , p ) τ,H for time , with initial condition n n , to obtain Φp. 0 0 1 2 (θn, p n) = Φτ,H (θn, pn) ; (ii) Flip the resulting momen- In fact the Hamiltonian flow satisfies the stronger tum component with the map Φp(θ, p) = (θ, p) to ob- condition of symplecticity with respect to the A matrix > −1 −1 − ([∇θ,pΦτ,H (θ, p)] A [∇θ,pΦτ,H (θ, p)] = A ) which im- 1 Throughout this paper, we use Φτ,H to denote the map that mediately implies it is volume-preserving by taking determinants takes a given position-momentum pair as initial conditions for the of this relation. Magnetic Hamiltonian Monte Carlo

> −1 H2(p), respectively. These updates (which are equivalent respect to A ([ θ,pΦτ,H (θ, p)] A [ θ,pΦτ,H (θ, p)] = to Euler translations) can be written: A−1) which also∇ implies volume-preservation∇ of the flow. θ θ Within the formal construction of classical mechanics, it Φ,H1(θ) = p p θU(θ) is known that any Hamiltonian flow defined on the cotan- − 2 ∇ θ θ + p gent bundle (θ, p) of a configuration manifold, which is Φ = (5) ,H2(p) p p equipped with an arbitrary symplectic 2-form, will preserve its symplectic structure and admit the correspond- since the Hamilton equations (2) for the sub-Hamiltonians ing Hamiltonian as a first integral invariant (Arnold, 1989). H1(θ) and H2(p) are linear, and hence analytically inte- The statement of Lemma1 is simply a restatement of grable. One leapfrog step is then defined as: this fact grounded in a coordinate system. Similar arbitrary, antisymmetric terms have also appeared in the study Φfrog = Φ Φ Φ (6) ,H(θ,p) ,H1(θ) ◦ ,H2(p) ◦ ,H1(θ) of MCMC algorithms based on diffusion processes; such with the overall proposal given by L leapfrog steps, fol- samplers often do not enforce detailed balance with re- lowed by the momentum flip operator Φp as before: spect to the target density and are often implemented as dis- cretizations of stochastic differential equations (Rey-Bellet L frog frog & Spiliopoulos, 2015; Ma et al., 2015), in contrast to the Φe = Φp Φ . L,,H ◦ ,H(θ,p) approach taken here.

As each of the flows Φ,H1(θ), Φ,H2(p) exactly integrates Our second observation is that the dynamics in (7) are not a sub-Hamiltonian, they inherit the symplecticity, volume- time-reversible in the traditional sense. Instead, if we con- preservation, and time-reversibility of the exact dynamics. sider the parametrization of A as: Moreover, since the composition of symplectic flows is also symplectic and the splitting scheme is symmetric (imply- EF A = (8) ing the composition of time-reversible flows is also time- F> G reversible), the Jacobian term in the acceptance probability − (3) is exactly 1 as in the case of perfect simulation. where E, G are antisymmetric and F is taken to be general such that A is invertible, then the non-canonical dynamics The leapfrog scheme will not exactly preserve the have a (pseudo) time-reversibility symmetry: Hamiltonian H, so the remaining acceptance ratio Lemma 2. If (θ(t), p(t)) is a solution to the non-canonical exp(H(θ , p ) H(θ , p )) must be calculated. n n en+1 en+1 dynamics: However, the leapfrog− integrator has error (3) in one O leapfrog step (Hairer et al., 2006). This error scaling will d θ(t) EF θH(θ(t), p(t)) = > ∇ (9) lead to good energy conservation properties (and thus high dt p(t) F G pH(θ(t), p(t)) acceptance rates in the MH step), even when simulating |− {z } ∇ over long trajectories. A then (θe(t), pe(t)) = (θ( t), p( t)) is a solution to the 3. Non-Canonical Hamiltonian Monte Carlo modified non-canonical− dynamics:− − " # In Section2, we noted the role time-reversibility, volume- d EF H(θ(t), p(t)) θe(t) θe e preservation, and energy conservation of canonical Hamil- = − > ∇ (10) dt p(t) F G H(θ(t), p(t)) e pe e e tonian dynamics play in making them useful candidates for |− {z − } ∇ MCMC. In this section, we develop the properties of a gen- Ae eral class of flows we refer to as non-canonical Hamilto- if H(θ, p) = H(θ, p). In particular if E = G = 0 nian systems that parallel these properties, we use to con- − then A = A, which reduces to the traditional time-reversal struct our method magnetic HMC (see Algorithm1): e symmetry of canonical Hamiltonian dynamics. A Lemma 1. The map Φτ,H (θ, p) defined by integrating the non-canonical Hamiltonian system Lemma1 suggests a generalization of HMC that can uti- lize an arbitrary invertible antisymmetric A matrix in its d θ(t) = A H(θ(t), p(t)) (7) dynamics; however Lemma2 indicates the non-canonical p(t) θ,p dt ∇ dynamics lack a traditional time-reversibility symmetry with initial conditions (θ, p) for time τ, where A which poses a potential difficulty to satisfying detailed ∈ balance. In particular, we cannot compose Φ with an 2n×2n is any invertible, antisymmetric matrix induces p M A A a flow on the coordinates (θ, p) that is still energy- exact/approximate simulation of Φτ,H to make Φe τ,H = A A conserving (∂τ H(Φ (θ, p)) = 0) and symplectic with Φp Φ self-inverse. τ,H ◦ τ,H Magnetic Hamiltonian Monte Carlo

Our solution to obtaining a time-reversible proposal is sim- and: ply to flip the elements of the E and G matrices just as d θ EF 0 Fp ordinary HMC flips the auxiliary variable p i) at the end of = = . p F> G p Gp Hamiltonian flow in the proposal and ii) once again after dt |− {z } the MH acceptance step to return p to its original direction. A In this vein, we view the parameters E and G as auxiliary We denote the corresponding flows by ΦA and variables in the state space, and simultaneously flip p, E, ,H1(θ) ΦA respectively. The flow ΦA is generally not and G after having simulated the dynamics, rendering the ,H2(p) ,H1(θ) proposal time-reversible according to Lemma2 – see Sec- explicitly tractable unless we take E = 0 – in which case tion 2 in the Appendix for full details of this construction. it is solved by an Euler translation as before. Crucially, the flow in ΦA is a linear differential equation and hence This ensures that detailed balance is satisfied for this entire ,H2(p) proposal. To avoid “random walk” behaviour in the result- analytically integrable. If G is invertible (and F = I) then: ing Markov chain, we can apply a sign flip to E and G, θ θ + G−1(exp(G) I)p in addition to p, to return them to their original directions Φ = . (11) ,H2(p) p exp(G)p − after the MH acceptance step. The validity of this construction relies on equipping E and See the Appendix for a detailed derivation which also han- G with symmetric auxiliary distributions. For the remain- dles the general case where G is not invertible. Thus when E = 0, the flows ΦA and ΦA are analytically der of this paper, we further restrict to binary symmetric ,H1(θ) ,H2(p) auxiliary distributions supported on a given antisymmetric tractable and will inherit the generalized symplecticity and (pseudo) time-reversibility of the exact dynamics in (7). matrix V0 and its sign flip V0 – see Appendix 1.1 for full details. This restriction is− not necessary, but gives rise Therefore if we use the symmetric splitting (4) to construct to a simple and interpretable class of algorithms, which is a leapfrog-like step: in spirit closest to using fixed parameters E and G, whilst Φfrog,A = ΦA ΦA ΦA (12) ensuring the proposal satisfies detailed balance. This con- ,H(θ,p) ,H1(θ) ◦ ,H2(p) ◦ ,H1(θ) struction is also reminiscent of lifting constructions preva- we can construct a total proposal that consists of several lent in the discrete Markov chain literature (Chen et al., leapfrog steps, followed by a flip of the momentum and G, 1999); heuristically, the signed variables E and G favour Φp ΦG, which will be a volume-preserving, self-inverse map:◦ proposals in opposing directions. Φfrog,A = Φ ◦ Φ ◦ (ΦA ◦ ΦA ◦ ΦA )L. (13) e ,H(θ,p) p G ,H1(θ) ,H2(p) ,H1(θ) 3.1. Symplectic Numerical Integration for Non-Canonical Dynamics Henceforth we will always take E = 0 when we use Φfrog,A to generate leapfrog proposals, which interest- A ,H(θ,p) As with standard HMC, exactly simulating the flow Φτ,H ingly corresponds to a magnetic dynamics as discussed in is rarely tractable, and a numerical integrator is required Lemma4. A full description of the magnetic HMC algo- to approximate the flow. It is not absolutely necessary rithm using this numerical integrator is described in Section to use an explicit, symplectic integration scheme; in- 5. deed implicit integrators are used in Riemannian HMC to maintain symplecticity of the proposal which comes at a greater complexity and computational cost (Girolami et al., 4. Special Cases 2009). However explicit, symplectic integrators are simple, Here, we describe several tractable subcases of the general have good energy-conservation properties, and are volume- formulation of non-canonical Hamiltonian dynamics since preserving/time-reversible (Hairer et al., 2006), so for the these they have interesting physical interpretations. present discussion we restrict our attention to investigating leapfrog-like schemes. 4.1. Mass Preconditioned Dynamics We begin, as in Section 2.2, by considering the symmet- One simple variant of HMC is preconditioned HMC where H (θ) = ric splitting (4), yielding the sub-Hamiltonians 1 p (0, M) (Neal, 2011), and can be implemented U(θ)/2 H (p) = p>p/2 , 2 . The corresponding non- nearly∼ Nidentically to ordinary HMC. We note that precon- H (θ) canonical dynamics for the sub-Hamiltonians 1 and ditioning can be recovered within our framework using a H (p) 2 are: simple form for the non-canonical A matrix: Lemma 3. i) Preconditioned HMC with momentum vari- d θ EF θU(θ)/2 E θU(θ)/2 able p (0, M) in the (θ, p) coordinates is exactly = > ∇ = ∇> ∼ N dt p F G 0 F θU(θ)/2 equivalent to simulating non-canonical HMC with p0 = − − ∇ | {z } M−1/2p (0, I) and the non-canonical matrix A = A ∼ N Magnetic Hamiltonian Monte Carlo

0 M1/2 , and then transforming back to 5. The Magnetic HMC (MHMC) Algorithm (M1/2)> 0 − Using the results discussed in Section3 and Section 3.1 we (θ, p) coordinates using p = M1/2p0. Here M1/2 is a can now propose Magnetic HMC – see Algorithm1. Cholesky factor for M. ii) On the other hand if we apply a change of basis (via an Algorithm 1 Magnetic HMC (MHMC) invertible matrix F) to our coordinates θ0 = F−1θ, sim- Input: H, G, L, 0 ulate HMC in the (θ , p) coordinates, and transform back Initialize (θ0, p0), and set G0 G to the original basis using F, this is exactly equivalent to for n = 1,...,N do ← 0 F non-canonical HMC with A = . Resample pn−1 N(0, I) F> 0 ∼ − Set (θen, pen) LF(H, L, , (θn−1, pn−1, Gn−1)) with ΦA ←as in (11) Lemma3 illustrates a fact alluded to in (Neal, 2011); using ,H2(p) a mass preconditioning matrix M and a change of basis Flip momentum (θen, pen) (θen, pen) and set −1/2 > ← − given by matrix (M ) acting on θ leaves the HMC Ge n Gn−1 − ← − algorithm invariant. if Unif([0, 1]) < min(1, exp(H(θn−1, pn−1) − H(θen, pen))) then 4.2. Magnetic Dynamics Set (θn, pn, Gn) (θen, pen, Ge n) else ← The primary focus of this paper is to investigate the subcase Set (θ , p , G ) (θ , p , G ) of the dynamics where: n n n n−1 n−1 n−1 end if ← 0 I Flip momentum pn pn and flip Gn Gn A = (14) ← − ← − IG end for − N Output: (θn)n=0 for two important reasons3. Firstly for this choice of A we can construct an explicit, symplectic, leapfrog-like integration scheme which is important for developing an efficient One further remark is that by construction the integrator for HMC sampler as discussed in Section 3.1. Secondly, the magnetic HMC is expected to have similarly good energy dynamics have an interesting physical interpretation quite conservation properties to the integrator of standard HMC: distinct from mass preconditioning and other HMC variants Lemma 6. The symplectic leapfrog-like integrator for like Riemannian HMC (Girolami et al., 2009): magnetic HMC will have the same local ( (3)) and ∼L O Lemma 4. In 3 dimensions the non-canonical Hamiltonian global ( (2)) error scaling (over τ steps), as ∼ O ∼ dynamics corresponding to the Hamiltonian H(θ, p) = the canonical leapfrog integrator of standard HMC if the 1 > Hamiltonian is separable. (See Appendix). U(θ) + 2 p p and A matrix as in (14) are equivalent to the Newtonian mechanics of a charged particle (with unit It is worthwhile to contrast the algorithmic differences be- mass and charge) coupled to a magnetic field B~ (given tween magnetic HMC and ordinary HMC. Intuitively, the d2θ role of the flow ΦA – which reduces to the standard by a particular function of G - see Appendix): dt2 = ,H2(p) dθ Euler translation update of ordinary HMC when G = 0 θU(θ) + B~ . −∇ dt × – is to introduce a rotation into the momentum space of This interpretation is perhaps surprising since Hamiltonian the flow. In particular, a non-zero element Gij will allow formulations of classical magnetism are uncommon, al- momentum to periodically flow between pi and pj. If we though the quantum mechanical treatment naturally incor- regard G as an element in the Lie algebra of antisymmet- porates a Hamiltonian framework. However, in light of ric matrices, which can be thought of as infinitesimal rota- Lemma3 we might wonder if by a clever rewriting of the tions, then the exponential map exp(G) will project this Hamiltonian we can reproduce this system of ODEs using transformation into the Lie group of real orthogonal linear the canonical A matrix (i.e. E = G = 0, F = I). This is maps. not the case: With respect to computational cost, although magnetic Lemma 5. The non-canonical Hamiltonian dynamics with HMC requires matrix exponentiation/diagonalization to magnetic A and Hamiltonian H(θ, p) = U(θ) + 1 p>p simulate ΦA , this only needs to be computed once up- 2 ,H2(p) cannot be obtained using canonical Hamiltonian dynamics front for G and cached; moreover, as G is diagonaliz- for any choice of smooth Hamiltonian. (See Appendix). able, the exact± exponential can be calculated± in (d3) time. 2 O 3Note that the effect of a non-identity F matrix can be Additionally, there is an (d ) cost for the matrix-vector products needed to implementO the flow ΦA as with achieved by simply composing these magnetic dynamics with a ,H2(p) coordinate-transformation as suggested in Lemma3. preconditioning. However, it is possible to design sparsi- Magnetic Hamiltonian Monte Carlo

fied matrix representations of A which will translate into large marginal variance since its exploration will often be sparsified rotations if we only wish to ”curl” in a specific limited by the smaller variance directions. Accordingly, in subspace of dimension d0 – which will translate into a com- order to induce a periodic momentum flow between the di- 3 2 putational cost of (d0) and (d0) respectively. rections of small and large variance, we introduce nonzero O O components G into the subspaces spanned directly be- An important problem to address is the selection of the G ij tween the large and small eigenvalues. Indeed, we find that matrix, which affords a great deal of flexibility to MHMC magnetic G term is encouraging more efficient exploration relative to HMC; this point is further discussed in the Ex- as we can see from the averaged autocorrelation of samples periments section, where we argue that in certain cases in- generated from the HMC/MHMC chains – see Figure2. tuitive heuristics can be used to select the G matrix. Further, by running the 50 parallel chains for 107 timesteps, we computed both the bias and Monte Carlo standard errors 6. Experiments (MCSE) of the estimators of the target moments as shown in Table1 and Table2. Here we investigate the performance of magnetic HMC against standard HMC in several examples; in each case Table 1. Absolute Bias ± MCSE for 2D ill-conditioned Gaussian commenting on our choice of the magnetic field term G. 2 6 moments for HMC vs. MHMC. Note that E[x1] = 10 and 2 Step sizes () and number of leapfrog steps (L) were tuned E[x2] = 1. to achieve an acceptance rate between .7 .8, after which − the norm of the non-zero elements in G was set to .1 .2 algorithm x2 (Bias MCSE) x2 (Bias MCSE) which was found to work well. ∼ − 1 ± 2 ± HMC .947 41.7 .00108 .0247 In the Appendix we also display illustrations of different ± ± MHMC .657 22.5 .000280 .00438 MHMC proposals across several targets in order to provide ± ± more intuition for MHMC’s dynamics. Further experimen- tal details and an additional experiment on a Gaussian fun- Table 2. Absolute Bias ± MCSE for 10D ill-conditioned Gaus- nel target are also provided in the Appendix. 2 6 sian moments for HMC vs. MHMC. Note that E[x1] = 10 and 2 E[x10] = 1. 6.1. Multiscale Gaussians

2 2 We consider two highly ill-conditioned Gaussians similar algorithm x1 (Bias MCSE) x10 (Bias MCSE) to as in (Sohl-Dickstein et al., 2014) to illustrate a heuristic ± ± HMC 24.7 46.6 0.00376 0.0249 for G matrix selection and demonstrate properties of the ± ± magnetic dynamics. In particular we consider a centered, MHMC 9.68 32.7 0.00127 0.0132 ± ± uncorrelated 2D Gaussian with covariance eigenvalues of 106 and 1 as well as a centered, uncorrelated 10D Gaus- sian with two large covariance eigenvalues of 106 and re- 6.2. Mixture of Gaussians maining eigenvalues of 1. We denote their coordinates as x = (x , x ) 2 and x = (x , . . . , x ) 10 respec- We compare MHMC vs. HMC on a simple, but inter- 1 2 R 1 10 R 2 tively. HMC will∈ have difﬁculty exploring the∈ directions of esting, 2D density over x = (x, y) R comprised of an evenly weighted mixture of isotropic∈ Gaussians: p(x) = 1 (x; µ, Σ) + 1 (x; µ, Σ) for σ2 = σ2 = 1, 2 N 2 N − x y ρxy = 0 and µ = (2.5, 2.5). This problem is challeng- 1 1 − HMC HMC ing for HMC because the gradients in canonical Hamil- tonian dynamics force it to one of the two modes. We MHMC MHMC tuned HMC to achieve an acceptance rate of .75 and used the same , L for MHMC, generating 15000∼ samples from both HMC and MHMC with these settings. The ad-

Autocorrelation Autocorrelation dition of the magnetic ﬁeld term G – which has one de- gree of freedom in this case – introduces an asymmetric “curl” into the dynamics that pushes the sampler across 0 0 0 5 104 0 5 104 the saddlepoint to the other mode allowing it to efﬁciently Lag × Lag × mix around both modes and between them – see Figure 3. The maximum mean discrepancy between exact sam- Figure 2. Averaged Autocorrelation of HMC vs MHMC on a 10D ples generated from the target density and samples gen- ill-conditioned Gaussian (left) and Averaged Autocorrelation of erated from both HMC and MHMC chains was also esti- HMC vs MHMC on a 2D ill-conditioned Gaussian. mated for various magnitudes of G, using a quadratic ker- Magnetic Hamiltonian Monte Carlo

30 g = 0.0 g = 0.05 25 g = 0.1 g = 0.15 20

Maximum Mean Discrepancy 5

0 2000 4000 6000 8000 10000 12000 14000 Number of Samples

Figure 3. Left: 500 samples from HMC; Middle: 500 samples from MHMC; Right: MMD between HMC/MHMC samples for various magnitudes of the non-zero component of the magnetic field – denoted g. Note g = 0 corresponds to standard HMC. nel k(x, x0) = (1 + x, x0 )2 and averaged over 100 runs (V (0),R(0)) of the system (15) are known, and a set h i N of the Markov chains (Borgwardt et al., 2006). Here, we of noise-corrupted observations (Ve(tn), Re(tn))n=0 = V R N clearly see that for various values of the nonzero compo- (Va,b,c(tn)+εn ,Ra,b,c(tn)+εn )n=0 at discrete time points nent of G, denoted g, the samples generated by MHMC 0 = t0 < t1 < < tN , are available - note that we il- more faithfully reflect the structure of the posterior. As be- lustrate dependence··· of the trajectories on the model param- fore, we ran 50 parallel chains for 107 timesteps to compute eters explicitly via subscripts. It is not possible to recover both the bias and Monte Carlo standard errors (MCSE) of the true parameter values of the model from these observa- the estimators of the target moments as shown in Table3. tions, but we can obtain a posterior distribution over them Additional experiments over a range of , L (and corre- by specifying a model for the observation noise and a prior distribution over the model parameters. Table 3. Bias ± MCSE for 2D Mixture of Gaussians for HMC vs. 2 MHMC with g = 0.1. Note that E[x] = 0 and E[x ] = 7.25. Similar to (Ramsay et al., 2007; Girolami & Calderhead, 2011), we assume that the observation noise variables (εV )N (εV )N (0, 0.12) algorithm x (Bias MCSE) x2 (Bias MCSE) n n=0 and n n=0 are iid , and take an in- ± ± dependent (0, 1) prior over eachN parameter a, b, and c. HMC .0132 .0644 0.00264 0.0114 This yieldsN a posterior distribution of the form ± ± MHMC .00239 .012 0.000596 0.00365 ± ± p(a, b, c) (a; 0, 1) (b; 0, 1) (c; 0, 1) ∝ N N N × sponding acceptance rates) and details are included in the N Y 2 (Ve(tn); Va,b,c(tn), 0.1 ) (16) Appendix for this example, demonstrating similar behav- N ior. n=0 Importantly, the highly non-linear dependence of the tra- 6.3. FitzHugh-Nagumo model jectory on the parameters a, b and c yields a complex posterior distribution - see Figure4. Full details of the model Finally, we consider the problem of Bayesian inference set-up can be found in (Ramsay et al., 2007; Girolami & over the parameters of the FitzHugh-Nagumo model (a set Calderhead, 2011). of nonlinear ordinary differential equations, originally de- veloped to model the behavior of axial spiking potentials For our experiments, we used fixed parameter settings of in neurons) as in (Ramsay et al., 2007; Girolami & Calder- a = 0.2, b = 0.2, c = 3.0 to generate 200 evenly- head, 2011). The FitzHugh-Nagumo model is a dynamical spaced noise-corrupted observations over the time interval system (V (t),R(t)) defined by the following coupled dif- t = [0, 20] (as in (Ramsay et al., 2007; Girolami & Calder- ferential equations: head, 2011)). We performed inference over the posterior distribution of parameters (a, b, c) with this set of observa- V˙ (t) = c(V (t) V (t)3/3 + R(t)) − tions using both the HMC and MHMC algorithms, which R˙ (t) = (V (t) a + bR(t))/c (15) was perturbed with a magnetic field in each of the 3 ax- − − ial planes of parameters – along the ab, ac, and bc axes We consider the problem where the initial conditions with magnitude g = 0.1. The chains were run to gener- Magnetic Hamiltonian Monte Carlo ate 1000 samples over 100 repetitions with settings of = Acknowledgements 0.015,L = 10, which resulted in an average acceptance rate of .8. The effective sample size of each of the chains The authors thank John Aston, Adrian Weller, Maria normalized∼ per unit time was then computed for each chain. Lomeli, Yarin Gal and the anonymous reviewers for help- ful comments. MR acknowledges support by the UK En- gineering and Physical Sciences Research Council (EP- Since each query to the SRC) grant EP/L016516/1 for the University of Cambridge posterior log-likelihood Centre for Doctoral Training, the Cambridge Centre for or posterior gradient Analysis. RET thanks EPSRC grants EP/M0269571 and log-likelihood requires EP/L000776/1 as well as Google for funding. solving an augmented set of differential References equations as in (15), the computation time Arnold, Vladimir. Mathematical Methods of Classical Me- ( 238s) of all the Figure 4. Marginal posterior den- chanics. 1989. ISBN 0387968903. methods∼ was nearly sity contour plot over a, b, with identical. Moreover, c = 3. Betancourt, Michael. A Conceptual Introduction to Hamil- tonian Monte Carlo. arXiv:1701.02434, 2017.

Table 4. HMC vs. MHMC performance targeting the Fitzhugh- Borgwardt, Karsten M., Gretton, Arthur, Rasch, Malte J., Nagumo posterior parameters Kriegel, Hans Peter, Scholkopf, Bernhard, and Smola, Alex J. Integrating structured biological data by Kernel algorithm ESS ESS(a, b, c)/time (s) Maximum Mean Discrepancy. In Bioinformatics, volume 22, 2006. HMC 1000 318 606 4.20, 1.33, 2.55 Carpenter, Bob, Gelman, Andrew, Hoffman, Matt, Lee, MHMC ab 1000 349 658 4.20, 1.47, 2.77 Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, MHMC ac 1000 336 628 4.20, 1.42, 2.64 Marcus A, Li, Peter, and Riddell, Allen. Journal of Sta- MHMC bc 1000 326 649 4.20, 1.37, 2.73 tistical Software Stan : A Probabilistic Programming Language. Journal of Statistical Software, VV(Ii), 2016. note that all methods achieved nearly perfect mixing over Chen, Fang, Lovasz,´ Laszl´ o,´ and Pak, Igor. Lifting markov the first coordinate so their effective sample size were chains to speed up mixing. In Proceedings of the Thirty- truncated at 1000 for the a coordinate. In this example, first Annual ACM Symposium on Theory of Computing, we can see that all magnetic perturbations slightly increase STOC ’99, pp. 275–281, New York, NY, USA, 1999. the mixing rate of the sampler over each of the (b, c) ACM. ISBN 1-58113-067-8. doi: 10.1145/301250. coordinates with the ab perturbation performing best. 301315. Duane, Simon, Kennedy, A. D., Pendleton, Brian J., and 7. Discussion and Conclusion Roweth, Duncan. Hybrid Monte Carlo. Physics Letters B, 195(2):216 – 222, 1987. We have investigated a framework for MCMC algorithms based on non-canonical Hamiltonian dynamics and have Girolami, Mark and Calderhead, Ben. Riemann manifold given a construction for an explicit, symplectic integrator Langevin and Hamiltonian Monte Carlo methods. Jour- that is used to implement a generalization of HMC we re- nal of the Royal Statistical Society: Series B (Statistical fer to as magnetic HMC. We have also shown several ex- Methodology), 73(2):123–214, March 2011. amples where the non-canonical dynamics of MHMC can improve upon the sampling performance of standard HMC. Girolami, Mark, Calderhead, Ben, and Chin, Siu A. Rie- Important directions for further research include finding mannian Manifold Hamiltonian Monte Carlo. Physics, more automated, adaptive mechanisms to set the matrix G, 73(2):35, 2009. as well as investigating positionally-dependent magnetic Green, Peter J. Reversible Jump Markov Chain Monte field components, similar to how Riemannian HMC corre- Carlo Computation and Bayesian Model Determination. sponds to local preconditioning. We believe that exploiting Biometrika, 82(4):711–732, 1995. more general deterministic flows (such as also maintaining a non-zero E in the top left-block of a general A matrix) Hairer, Ernst, Hochbruck, Marlis, Iserles, Arieh, and Lu- could form a fruitful area for further research on MCMC bich, Christian. Geometric Numerical Integration. Ober- methods. wolfach Reports, pp. 805–882, 2006. Magnetic Hamiltonian Monte Carlo

Hoffman, Matt and Gelman, Andrew. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(2008):1–31, 2014. Ma, Yi-an, Chen, Tianqi, and Fox, Emily B. A Complete Recipe for Stochastic Gradient MCMC. NIPS, pp. 1–19, 2015. Neal, Radford. Bayesian Learning for Neural Networks, volume 118. 1996. Neal, Radford M. MCMC using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, pp. 113–162, 2011. Ramsay, J. O., Hooker, G., Campbell, D., and Cao, J. Pa- rameter estimation for differential equations: A generalized smoothing approach. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 2007.

Rey-Bellet, Luc and Spiliopoulos, Konstantinos. Irre- versible Langevin Samplers and Variance Reduction: a Large Deviations Approach. Nonlinearity, 28(7):2081, 2015. Robert, Christian P and Casella, George. Monte Carlo Sta- tistical Methods, volume 95. 2004. Sohl-Dickstein, J, Mudigonda, Mayur, and DeWeese, M. Hamiltonian Monte Carlo Without Detailed Balance. Proceedings of The 31st International Conference on Machine Learning, 32:9, 2014.