Shadow Manifold Hamiltonian Monte Carlo

Chris van der Heide1 Liam Hodgkinson Fred Roosta Dirk P. Kroese UQ UC Berkeley, ICSI UQ, ICSI UQ

Abstract The key advantage of HMC over many of its com- petitors is its ability to draw samples that are large distances apart by evolving them via Hamiltonian dy- Hamiltonian Monte Carlo and its descendants namics. The acceptance rate depends on the error have found success in machine learning and accumulated along the sample trajectory, and remains computational statistics due to their ability large even in moderately high dimensions. This is in to draw samples in high dimensions with no small part due to its deep connection with geometry greater efficiency than classical MCMC. One and physics, which inform its rich theoretical foun- of these derivatives, Riemannian manifold dations (Betancourt et al., 2017; Barp et al., 2018). Hamiltonian Monte Carlo (RMHMC), bet- However, HMC’s performance deteriorates if either the ter adapts the sampler to the geometry of the step size or the size of the system becomes too large, or target density, allowing for improved perfor- the target density is poorly behaved. More numerical mances in sampling problems with complex error leads to lower sample acceptance, which induces geometric features. Other approaches have heavy autocorrelation, necessitating a larger sample boosted acceptance rates by sampling from an size and thus higher computational costs. integrator-dependent “shadow density” and compensating for the induced bias via im- One approach to ease this burden is to exploit the struc- portance sampling. We combine the bene- ture of the numerical integrator error and instead target fits of RMHMC with those attained by sam- the density corresponding to a modified, or shadow, pling from the shadow density, by deriving the Hamiltonian. This leads to a higher sample acceptance shadow Hamiltonian corresponding to the gen- rate, shown in Figure 1, at the cost of some induced eralized leapfrog integrator used in RMHMC. bias. This bias is easily quantified, and importance This leads to a new algorithm, shadow man- sampling is then performed to compensate (Radivojevi´c ifold Hamiltonian Monte Carlo, that shows & Akhmatskaya, 2019). Due to the structure of the improved performance over RMHMC, and shadow Hamiltonian, refreshment requires leaves the target density invariant. another Metropolis–Hastings step to ensure sampler consistency. This results in a non-reversible sampler, and allows for partial momentum retention (Horowitz, 1991; Kennedy & Pendleton, 2001; Akhmatskaya & 1 Introduction Reich, 2006; Sohl-Dickstein, 2012; Sohl-Dickstein et al., 2014), which may be of interest on its own. Hamiltonian Monte Carlo (HMC) was first introduced by Duane et al. (1987) in the context of simulating In their landmark paper, Girolami & Calderhead (2011) lattice field theories in quantum chromodynamics. It proposed Riemannian manifold Hamiltonian Monte gained further mainstream success following its intro- Carlo (RMHMC). This algorithm draws upon ideas duction to the machine learning and computational from information geometry (Amari, 2016) to generalize statistics communities for training Bayesian neural HMC by traversing a Riemannian manifold. While networks (Neal, 1996). It has since become an in- other special classes of Riemannian manifolds have dispensable tool for Bayesian inference at large, finding recently been studied (Barp et al., 2018, 2019), the a plethora of applications in scientific and industrial original implementations target Bayesian posterior den- fields. sities, and excel in the presence of complex geometric features. However, to date, there have been no efforts Proceedings of the 24th International Conference on Artifi- to develop a shadow Hamiltonian corresponding to the cial Intelligence and Statistics (AISTATS) 2021, San Diego, generalized leapfrog algorithm used in RMHMC. California, USA. PMLR: Volume 130. Copyright 2021 by the author(s). [email protected] Shadow Manifold Hamiltonian Monte Carlo

Figure 1: Error due to Hamiltonian drift (left), and corresponding Metropolis–Hastings acceptance probabilities (right), in posterior samples from a Bayesian logistic regression model using the Australian credit dataset. 2000 proposed parameter samples were produced via RMHMC and our proposed SMHMC with h = 0.5,L = 6.

Contributions. Our contributions in this work can The goal of MCMC is to generate a set of correlated n be summarized as follows. samples {θi}i=1 whose empirical distribution converges, as n → ∞, to a target measure with Lebesgue den- I. A fourth-order shadow Hamiltonian that is preserved d sity πθ on R . This enables the approximation of by the generalized leapfrog integrator up to O(h4) in R the integral Eπf = d f(θ)πθ(θ) dθ of a πθ-integrable the step size h is derived. R function f : Rd → R via the Monte Carlo estimate II. We prove the existence of shadow Hamiltonians of −1 Pn n i=1 f(θi). In the sequel, we assume that all rel- any even order for arbitrary non-separable Hamiltoni- evant functions are infinitely differentiable, although ans and reversible symplectic integrators. this can be relaxed in practice. III. We introduce the shadow manifold Hamiltonian Monte Carlo algorithm (SMHMC), which is guaranteed to leave the target density invariant. 2.1 The Basic Setup IV. Numerical examples are provided demonstrating advantages over existing alternatives. For a target density π , let U : d → be a potential V. A PyTorch implementation is made available at θ R R energy function on the configuration space of a Hamil- https://github.com/chrisvdh/shadowtorch. −U(θ) tonian system such that πθ(θ) := e /Z, where Z is a constant that ensures that πθ integrates to unity. 2 Hamiltonian Monte Carlo Under this interpretation, “energy wells” with low po- tential energy correspond to regions of high probability. Monte Carlo (MCMC) sampling tech- In the basic HMC setup, the configuration space is aug- niques are a mainstay of computational statistics and mented with a conjugate momentum variable p ∈ Rd, machine learning. Many problems in scientific, medical, which is assigned a non-negative kinetic energy term and industrial applications can be framed as Bayesian K : Rd → R, typically taken to be K(p) = p>Σ−1p/2. hierarchical problems, and their resolution often re- Here, Σ is a real-valued positive-definite d × d matrix, duces to sampling from a posterior distribution in order commonly taken to be the identity or some other diago- to approximate some (often high-dimensional) analyti- nal matrix. Under this choice, the conjugate momenta cally intractable integral. HMC is a highly successful are assumed to be distributed via a multivariate Gaus- −K(p) p d MCMC technique that is somewhat resilient to the sian density πp(p) := e / (2π) |Σ|, where we curse of dimensionality, without compromising conver- denote by |Σ| the determinant of Σ. The Hamiltonian gence guarantees (Livingstone et al., 2016). of the system is then defined by H(θ, p) := U(θ)+K(p), van der Heide, Hodgkinson, Roosta, Kroese

H with the joint density of θ and p satisfying 2. Reversibility: Let ϕt (z0) denote the unique solu- tion at time t to (3) with initial value z0. Since H is −H(θ,p) e time-homogeneous, there holds πH (θ, p) := = πθ(θ)πp(p). (1) Zp(2π)d|Σ| H H H H H ϕt ◦ ϕs (z0) = ϕs+t(z0); i.e., ϕ−t ◦ ϕt (z0) = z0, RMHMC is a method introduced by Girolami & Calder- (5) head (2011) that generalises HMC by endowing Rd with H H −1 a Riemannian metric, and sampling from the resulting or, in other words, ϕ−t = (ϕt ) . Furthermore, if K Riemannian manifold (M,G). In the basic implemen- is an even function, then we can obtain the inverse tation, the statistical manifold M is equipped with a mapping by negating the sign of p, applying ϕt, and metric G, taken to be a variant of the Fisher–Rao met- then negating p’s sign again. ric that incorporates the Bayesian prior. This metric 3. Volume preservation: Taking an open neigh- 2d gives a canonical way to measure the squared length bourhood N ⊂ R and evolving N by ϕt reveals of a momentum p at θ, which is incorporated into m(φt(N)) = m(N), where m is the Lebesgue mea- > −1 the kinetic energy term K(θ, p) = p G (θ)p/2. This sure on R2d. This is most easily seen by noting that term is now position-dependent, and the corresponding the Hamiltonian vector field J∇H is divergence free, conditional density becomes that is,

e−K(θ,p) d X ∂ ∂H ∂ ∂H πp|θ(θ, p) := p . ∇ · J∇H = − = 0. (2π)d|G(θ)| ∂θi ∂pi ∂pi ∂θi i=1 The determinant term is then incorporated into the Hamiltonian Given a candidate sample obtained by evolving the system (3) up to time T , these three properties allow 1 H(θ, p) := U(θ) + log (2π)d|G(θ)| + K(θ, p), (2) for a simple Metropolis–Hastings acceptance criterion. 2 Reversibility of the dynamics implies reversibility of and by construction the joint density is then given by the resultant Markov chain — therefore, the sampler leaves the target density invariant. Conservation of e−H(θ,p) π (θ, p) := = π (θ)π (θ, p). energy suggests a means of detecting numerically ap- H Z θ p|θ proximated trajectories that have not been faithful to  (3) by computing H z(T ) −H(z0), and volume preser- Given some initial pair z0 = (θ0, p0), (RM)HMC algo- rithms then draw a candidate sample from the joint vation removes the need to keep track of a Jacobian in this change of variables. This allows proposed samples density πH by evolving the phase z0 along a trajectory in phase space defined by Hamiltonian dynamics. to be accepted with probability α = min 1, exp H(θ, p) − H(θ(T ), p(T )) , (6) 2.2 Hamiltonian Dynamics and otherwise taking (θ, −p). This final momentum The key component of HMC is Hamilton’s equations of flip is to guarantee reversibility. motion: for time t ∈ R and z(t) = (θ(t), p(t)), let Finally, it is worth mentioning that while properties 1–3 dθi ∂H dpi ∂H are certainly desirable, they are by no means necessary = , = − , (3) dt ∂pi dt ∂θi for an effective sampling algorithm. Accurate samplers have been constructed that break each of these condi- or more succinctly tions (Lan et al., 2015; Radivojevi´c& Akhmatskaya, 2019). In these scenarios, greater care must be taken dz  0 I  = J∇H(z), where J := d×d , (4) to ensure invariance to the correct target density. dt −Id×d 0

and Id×d is the d-dimensional identity matrix. The 2.3 Symplectic Integrators following three properties of Hamiltonian dynamics are key (see Neal (2010)). Exact solutions to (3) are accepted with probability one, but are unavailable in non-trivial settings. Instead, 1. Conservation of energy: It is immediate from it is necessary to approximate the solutions via numeri- equation (3) that the vector field dz/dt lies orthogonal cal integration techniques. Doing so no longer preserves to the gradient vector field of H; that is, the Hamiltonian, and so changes the acceptance prob-

d ability. Approximation errors and their corresponding dH X ∂H dθi dpi ∂H = + = (∇H)>J∇H = 0. Metropolis–Hastings acceptance probabilities for sam- dt ∂θi dt dt ∂pi i=1 ples drawn from a Bayesian logistic regression posterior Shadow Manifold Hamiltonian Monte Carlo

density on a standard dataset are shown in Figure 1. for some L ∈ N, we simply compose (9) L times Fortunately, there is a well studied, highly accurate family of geometric numerical integrators that preserve L times z }| { volume. These are known as symplectic numerical in- N Uˆ Kˆ Uˆ Uˆ Kˆ Uˆ 2d 2d Φ := (ψ ◦ ψ ◦ ψ ) ◦ · · · ◦ (ψ ◦ ψ ◦ ψ ) . tegrators, whose one-step map Φ : R → R satisfies h h/2 h h/2 h/2 h h/2 > DΦ(z)>JDΦ(z) = J, where DΦ(z) denotes the (10) 2d×2d Jacobian of Φ at z (Bou-Rabee & Sanz-Serna, 2017). When considering Hamiltonians of the form (2), up- It is clear from this definition that the composition of dates of the form (8) based on approximations given two symplectic maps is also symplectic, which allows by (7) do not have unit Jacobian determinant, and us to approximate dynamics along lengthy trajectories do not provide reversible updates (Girolami & Calder- by composing chains of short symplectic steps. We head, 2011). Instead, standard implementations employ will see in Section 3.1 that by enforcing our numerical an implicit numerical integrator called the generalized integrator to be symplectic, we can compute a family leapfrog algorithm to address these issues. of modified or shadow Hamiltonians that are conserved The generalized leapfrog method is constructed in the up to arbitrary order in the step size. same way as the leapfrog method. By composing two In the Euclidean case (HMC), the Hamiltonian is sepa- partitioned Euler methods, a symplectic method is rable; that is, H(θ, p) = U(θ) + K(p). Systems of this obtained (Hairer et al., 2006, VI.3). The updates with type admit a certain type of decomposition or splitting, step-size h are of the form since the Hamiltonian vector field also decomposes into J∇H(z) = J∇U(θ) + J∇K(p). This induces the pair  h  pn+ 1 = pn − 2 ∇θH θn, pn+ 1 , of split systems  2 2  h h  i θn+1 = θn + ∇p H θn, p 1 + H θn+1, p 1 , 2 n+ 2 n+ 2 i i i i  h  dθ ∂K dp dp ∂U dθ pn+1 = pn+ 1 − ∇θH θn+1, pn+ 1 . = , = 0 and = , = 0. (7) 2 2 2 dt ∂pi dt dt ∂θi dt (11) This method is again accurate up to second order in Individually, these systems are analytically solvable h, and again, we compute the trajectory up to time for arbitrary times. However, doing so results in large T = hL by composing (11) L times, analogous to (10). errors, due to the effective decoupling caused by the The distinguishing feature here is that each variable splitting. By using a relatively small step size, we can on the left hand side of the first two lines in (11) also mitigate this error by iterative updates, each of which appear on the right hand side, that is, it is implicitly exactly solves the split systems (7) defined. The immediate drawback of implicit numerical meth- ( U ∂U ψh (θn+1, pn) := pn − h ∂θ (θn+1) = pn+1, ods is that, in contrast to (8), the sub-problems can K ∂K (8) ψh (θn, pn) := θn + h ∂p (pn) = θn+1. no longer be exactly solved. Fixed point iterations are commonly used to approximate the updates, although While each of these steps is reversible, their composition other implicit methods for Euclidean HMC have been is not. Instead, by composing (8) with its reversal, we proposed that employ Newton’s method (Pourzanjani obtain the palindromic St¨ormer-Verlet scheme & Petzold, 2019). Recently, an explicit method has also been developed (Cobb et al., 2019) based on an ap-  h ∂U proach for approximating the dynamics of systems with p 1 = pn − (θn),  n+ 2 2 ∂θ  ∂K non-separable Hamiltonians in the physics literature θn+1 = θn + h (p 1 ), (9) ∂p n+ 2 (Tao, 2016; Pihajoki, 2015). This technique does so  h ∂U pn+1 = p 1 − (θn+1), n+ 2 2 ∂p by simulating parallel problems on an extended phase space (the product space of two copies of the cotangent which is accurate up to O(h2). This integrator is re- bundle), with a ‘binding step’ in between. The inte- U K U U K grator is computationally cheaper than (11), and has versible, since (ψ−h/2 ◦ ψ−h ◦ ψ−h/2) ◦ (ψh/2 ◦ ψh ◦ U a known fourth-order shadow Hamiltonian (Tao, 2016). ψ )(z0) = z0. A number of higher-order integrators h/2 However, it is not clear how to incorporate this into have also been studied (Yoshida, 1990; Radivojevi´c a shadow HMC algorithm in a way that ensures the et al., 2018), and more recently applied to problems resulting sampler leaves the target density invariant. in Bayesian statistics (Radivojevi´c& Akhmatskaya, Furthermore, it introduces an extra binding hyperpa- 2019). rameter, which the stability of the trajectories appear In order to approximate a trajectory up to time T = hL to depend upon in a delicate way. van der Heide, Hodgkinson, Roosta, Kroese

3 Shadow Manifold Hamiltonian into time-homogeneous Hamiltonian vector fields. Tak- Monte Carlo ing a continuous extension of (11) in t, we have that (θ(t), p(t)) satisfies HMC methods rely on the numerical integrators’ accu-  q(t) = p − t ∇ Hθ(t), q(t), rate simulation of the trajectories dictated by Hamil-  0 2 θ  t h  i tonian dynamics. However, leapfrog and generalized θ(t) = θ0 + 2 ∇p H θ0, q(t) + H θ(t), q(t) , leapfrog integrators only preserve the Hamiltonian up  t  p(t) = q(t) − ∇θH θ(t), q(t) . to second order. In order to increase accuracy, one 2 (13) could design more accurate integrators, although these Differentiating and solving for (dθ/dt, dp/dt), we find tend to be computationally expensive. Our approach that the vector field governing the system varies with relies on backwards error analysis to instead derive a time. This necessitates a different approach. shadow Hamiltonian, whose energy is more accurately conserved by the generalized leapfrog algorithm. By To compute the fourth–order shadow Hamiltonian, instead targeting the corresponding shadow density, we we must first identify its corresponding vector field see higher sample acceptance probabilities (Figure 1). up to third order. This is done by a comparing the Importance sampling is then employed to correct these third–order Taylor expansions of the discretized solu- samples towards the true density. This efficient combi- tion θˆ(h), pˆ(h), and the analytic solution θ(h), p(h). nation of MCMC and importance sampling allows for By appropriately integrating the vector field with re- even greater performance than either strategy alone. spect to θ and p, we can then identify the fourth–order shadow Hamiltonian H[4]. 3.1 Shadow Hamiltonians Theorem 1. Let M = Rd, and H : Rd × Rd → R be a smooth Hamiltonian function. The fourth–order Shadow Hamiltonians are commonly obtained by trun- [4] d shadow Hamiltonian function H : R → R corre- cating the Baker–Campbell–Hausdorff (BCH) formula sponding to the generalized leapfrog integrator is given applied to Poisson brackets of the terms of a sepa- by rable Hamiltonian (Barp et al., 2018). Since short- time solutions to Hamilton’s equations are realized as h2  H[4](θ, p) = H(θ, p) + ∇ H∇ H∇ H (14) an exponential of the Hamiltonian vector field, the 12 p θθ p BCH formula gives us a way to track the approxi- 1 − ∇ H∇ H∇ H mation error induced by iterative composition of the 2 θ pp θ non-commutative split vector fields. The underlying al-  + ∇θH∇θpH∇pH . gebraic relationship between Hamiltonian vector fields and Hamiltonians provides a clean representation in terms of an asymptotic expansion of nested Poisson Comparing equations (14) and (12), we note an ad- brackets. This expansion, when truncated to an appro- ditional term containing mixed partial derivatives of priate order, reveals the shadow Hamiltonian. the Hamiltonian in θ and p. Consequently, when the Hamiltonian is separable (as in the Euclidean case), we Recall that for the leapfrog integrator (8), the Hamilto- recover (12). nian vector field can be split into two orthogonal vector fields, J∇U and J∇K. By applying the BCH formula The complete proof is tedious, and is relegated to the to each composition in (9), the fourth–order shadow supplementary material. In our implementation, the Hamiltonian for the leapfrog integrator with step size second term in (14) is computed using quantities al- h (Radivojevi´c& Akhmatskaya, 2019) is obtained: ready required for the dynamics, while the other terms of the shadow Hamiltonian can be computed using au- h2 h2 tomatic differentiation. Judicious use of matrix-vector H = H + {K, {K,U}} − {U, {U, K}} (12) L 12 24 products allows this to be done in a way that doesn’t h2 h2 dominate the O(d3) complexity of matrix inversion or = H + p>M −1U M −1p − ∇U >M −1∇U, 12 θθ 24 decomposition. Difference quotient approximations, such as those used by Radivojevi´c& Akhmatskaya ∂f ∂g where the Poisson bracket satisfies {f, g} = ∂θi ∂pi − (2019), could also be applied. ∂f ∂g . Here, we denote the Hessian of U by U and ∂pi ∂θi θθ Proceeding further, this construction can be extended suppress dependence on θ and p. to obtain shadow Hamiltonians of arbitrary even order. However, the generalized leapfrog integrator is more Since modern statistical applications are increasingly delicate. The BCH formula can only be naively ap- concerned with more exotic spaces, it is natural to plied to time-homogeneous systems of ODEs, and its ask if such a construction could be considered on, for implicit nature means that it can no longer be split example, spheres and Stiefel manifolds. Hairer (2003) Shadow Manifold Hamiltonian Monte Carlo

provides local results, and a construction when the the conditional density invariant. Moreover, we can manifold M is given by a level set of a known smooth no longer expect to be able to analytically marginal- function. For another broad class of manifolds, and ize out the momenta. This was initially circumvented any reversible symplectic numerical integrator, we have via rejection sampling (Izaguirre & Hampton, 2004). the following global result. At each refreshment step a new momentum proposal can be generated pˆ ∼ N (0,G(θ)), and accepted with Theorem 2. Let M be a smooth simply connected  Riemannian manifold with cotangent bundle T ∗M, and probability min 1, exp(H(θ, pˆ) − H(θ, p)) . If the new ∗ momentum is rejected, this is repeated until a sample let H : T M → R be smooth Hamiltonian function. Then for any fixed reversible symplectic integrator, there is accepted. This results in a method that more ac- exists a family of shadow Hamiltonians H[2k] : T ∗M → curately reflects the original implementation of HMC; [2k] however, repeated rejection can result in many compu- R indexed by k ∈ N, such that H is preserved by the integrator (11) with step size h up to O(h2k). tationally expensive shadow Hamiltonian evaluations. More recent approaches utilize partial momentum re- Here we have required the topological property that the freshment via an additional Metropolis–Hastings step; base manifold is simply connected. Roughly speaking, see Kennedy & Pendleton (2001); Akhmatskaya & Re- this means that the manifold has no ‘holes’, which is ich (2006) and also Radivojevi´c& Akhmatskaya (2019) satisfied by a wide class of manifolds that may be of for a brief review. In this regime, an auxiliary noise vec- interest in statistical applications. However, known tor u ∼ N (0,G(θ)) is drawn and a momentum proposal HMC methods on these classes of manifolds are either d d d d is generated via the mapping R : R × R → R × R constructed using constrained Hamiltonians (Brubaker that satisfies et al., 2012), or directly exploit additional structure in p p their geometry (Barp et al., 2019). In either case, the R(p, u) = (ρ p + 1 − ρ2 u, − 1 − ρ2 p + ρ u). generalized leapfrog integrator is not used, so Theo- rem 1 cannot be used directly. However, the procedure The new parameter ρ = ρ(θ, p, u) takes values in used in its proof can be applied to obtain a local shadow (0, 1] and controls the extent of the momentum re- Hamiltonian, providing avenues for future research. tention. The proposals are then accepted according to the modified density corresponding to H(θ, p, u) = The proof of Theorem 2, also relegated to supplemen- 1 > −1 H(θ, p) + 2 u G (θ)u. The updated momentum is tary material, is a straightforward consequence of de then taken to be Rham’s theorem (Warner, 2013). While this is a basic ( p result in algebraic and differential topology, it’s largely ρ p + 1 − ρ2 u with probability γ, unknown to the statistics and machine learning com- p otherwise, munities. We remark that for fixed positive h, one may expect H[2k] to diverge as k → ∞. where   Upon inspection of (14), we notice that Hamiltonians γ := max 1, exp H(θ, p, u) − H(θ, R(p, u)) . (16) of the form (2) may not amount to desirable tail be- This procedure results in a Markov chain that preserves haviour, which can cause difficulties in integrability some of the dynamics between subsequent samples. of the resulting shadow density, and may provide an Note that ρ introduces another degree of freedom into obstruction to geometric ergodicity. To remedy this, the sampler, and can be chosen to depend on θ and p. we can follow Izaguirre & Hampton (2004) and set

H(θ, p) := max H[4](θ, p) + c, H(θ, p) , (15) 3.3 The SMHMC Sampler

where c is a constant that is chosen by the practi- An algorithmic description of the SMHMC sampler tioner. This guarantees that the tail behaviour of H is provided in Algorithm 1. Since the sampler now is always dictated by the Hamiltonian itself, which in involves a composition of two reversible Metropolis– turn controls the asymptotic decay of the density and Hastings steps, the resulting Markov chain is no longer the performance of subsequent importance sampling. reversible. By breaking the detailed balance relations, it is no longer immediately clear that the target density is stationary, and so this must be demonstrated. To 3.2 Momentum Refreshment this end we have the following guarantee, the proof of A key distinction between shadow Hamiltonian meth- which is taken from Radivojevi´c& Akhmatskaya (2019) ods and regular HMC is that the shadow density’s con- or Fang et al. (2014), after noting that the explicit form of the shadow density plays no role. ditional density πH(p | θ) is no longer Gaussian. This means that na¨ıvely sampling new Gaussian momenta Theorem 3. The SMHMC sampler leaves the target and accepting with probability one no longer leaves density πH invariant. van der Heide, Hodgkinson, Roosta, Kroese

Algorithm 1: SMHMC 1: Input: maximum number of steps L, step size h, Riemannian metric G, momentum update parameter ρ, number of Monte Carlo samples n, initial values (θ0, p0). 2: for i = 1 to k do 3: sample number of steps l from {1,...,L}. 4: store (θ, p) ← (θi−1, pi−1). 5: sample momentum update proposal u ∼ N (0,G(θ)). p 6: update p ← ρ p + 1 − ρ2 u with probability γ, defined in (16). 7: compute the Shadow Hamiltonian H(θ, p). 8: integrate Hamiltonian Dynamics ˆ l (θ, pˆ) ← Φh(θ, p). ˆ 9: accept sample (θi, pi) ← (θ, pˆ) with probability β, and reject (θi, pi) ← (θ, −p) otherwise. Here, β = min{1, exp(−∆H)}. 10: compute the importance sampling weights  wi = exp H(θi, pi) − H(θi, pi) 11: end for k Figure 2: Trajectories of the first 20 samples of the banana- 12: return (θi, pi, wi) i=0 shaped density drawn by HMC, MMHMC, RMHMC and SMHMC. Accepted samples are denoted by red dots.

Of course, the target density of the SMHMC sampler Regression model, over various benchmark datasets. is proportional to exp − H(θ, p), and is therefore This is a standard example where RMHMC outper- not the target density π in general. In order to cor- θ forms HMC in low dimensions, but begins to struggle rect for this bias, samples can then be reweighted via in higher dimensions. We then turn our attention to importance sampling, with the i-th sample’s weight Neal’s funnel problem (Neal, 2003), which is a simple given by the exponential shadows wi = exp H(θi, pi)−  R example that captures the pathological geometric fea- H(θi, pi) . Integrals of the form d f(θ)πθ(θ) dθ = R tures typical of those found in Bayesian hierarchical R πH (θ,p) d d f(θ) πH(θ, p) dθ dp can then be approxi- R ×R πH(θ,p) models. Performance is measured via sample accep- mated via the importance sampling estimator In(f) = tance rate, minimum effective sample size (ESS), and Pn Pn i=1 w¯if(θi), where w¯i = wi/( i=1 wi) are the nor- ESS per second; details on their computation are pro- malized weights. vided in the supplementary material. One potential drawback of importance sampling is its deterioration in performance as the sampling and target 4.1 Banana-shaped Density densities grow far apart. Fortunately, the smoothness of The banana-shaped density is a variant of the Rosen- (14) guarantees pointwise control over the behaviour of brock function suggested by Bornn and Cornebise in the weights in the continuous limit. A Taylor expansion their discussion of Girolami & Calderhead (2011) as of w in view of (14) shows that w = 1 + O(h2) almost i i an example with strong ridge-like geometric features surely as h → 0+. The performance in the importance typical of those found in non-identifiable models. We sampling step should be adequate provided h is not tested the dynamics of SMHMC, RMHMC, HMC and too large. MMHMC in sampling the two-dimensional posterior density π(θ | y) based on the following model:

4 Numerical Experiments 2 2 2 y | θ ∼ N (θ1 + θ2, σy), θ1, θ2 ∼ N (0, σθ ). We consider three test problems to demonstrate the dy- Following Lan et al. (2015); Radivojevi´c & namics and performance of SMHMC. The first is a toy Akhmatskaya (2019), one hundred data points 100 2 example that illustrates the dynamics of SMHMC com- {yi}i=1 are generated with θ1 + θ2 = 1, σy = 2 and pared to HMC, MMHMC (Radivojevi´c& Akhmatskaya, σθ = 1. Samples are then drawn from the posterior 2019) and RMHMC. We then consider performance in density across our range of samplers. The dynamics of a basic statistical problem, namely a Bayesian Logistic the HMC, MMHMC, RMHMC and SMHMC samplers Shadow Manifold Hamiltonian Monte Carlo

Table 1: Summary of acceptance rates, minimum effective sample size, and minimum effective sample size per second for posterior samples of Bayesian logistic regression on four datasets. Reported values are averaged across 10 chains.

Acceptance minESS minESS/s Data set d h α RMHMC SMHMC RMHMC SMHMC RMHMC SMHMC Australian 15 0.5 100 0.9237 0.9929 5167.19 6212.87 56.0732 52.6366 German 25 0.5 100 0.8933 0.9727 3758.42 5452.45 35.1639 40.9878 Parkinson’s 46 0.3 1 0.9391 0.9933 1553.08 2469.34 14.4799 17.8577 Sonar 61 0.3 1 0.8898 0.9639 1371.66 2273.69 9.5294 12.9841 are illustrated in Figure 2. This example clearly 4.3 Neal’s Funnel Density delineates the effect of the geometric adaptation on the sampler’s dynamics, while the partial momentum The funnel density was suggested by Neal (2003) as retention’s effects are less pronounced. a sampling problem that exhibits behaviour typical of pathologies that arise in Bayesian hierarchical and latent variable models, particularly those with sparse datasets (Betancourt & Girolami, 2015). The model 4.2 Bayesian Logistic Regression treats the variance of the parameters as a latent log- normal , which leads to non-convex Bayesian Logistic Regression is a standard tool for exponential cusping behaviour in the negative log- binary classification that is ubiquitous in the medical, density. Neal (2003) considered a 10-dimensional fun- scientific, engineering, and financial fields to name a nel — here, we instead consider a 30-dimensional few (Gelman et al., 2004). We present results from example. The target density is defined to satisfy  Q29 v the analysis of four datasets, retrieved from the UCI π(θ, v) := N v | 0, 9 i=1 N θi | 0, e . This is a prob- Machine Learning Repository (Dua & Graff, 2017). lem for which HMC typically demonstrates poor per- formance (Betancourt, 2013; Cobb et al., 2019), and As a common practice, the dataset is centred and nor- struggles to draw samples from inside the ‘throat’ of malized to have zero mean and unit variance in each the funnel. Since the negative log-density is highly dimension. The potential U(θ) is then taken to be the non-convex, the SoftAbs metric is employed to give a negative log–likelihood of the data at θ, and a Gaussian positive definite approximation of the expected Hessian, prior with variance α on each of the parameters is im- while retaining its eigenvectors (Betancourt, 2013). posed. We took the metric G to be the prior-adapted Fisher information from Girolami & Calderhead (2011), We ran 10 chains of 2000 samples each for the HMC, estimated via the (positive-definite) negative empirical RMHMC and SMHMC samplers on a 30-dimensional Hessian. To this end, 10 chains of 5000 samples were funnel. Following a few pilot runs, a step-size of h = 0.3, drawn from the posterior of each dataset, after 500 with a maximum of 64 leapfrog steps per sample tra- samples of burn–in. We set ρ = 0.25 in the SMHMC jectory was chosen for the geometric algorithms. For sampler. Results are presented in Table 1. HMC, these hyperparameters were set to 0.075 and 500.

Figure 3: Samples drawn via HMC, RMHMC and SMHMC on the first two dimensions of a 30-dimensional funnel. van der Heide, Hodgkinson, Roosta, Kroese

Table 2: Summary of acceptance rates, minimum effective Spaces via Symplectic Reduction. arXiv preprint sample size, and minimum effective sample size per second arXiv:1903.02699, 2019. for samples drawn from funnel density. Reported values are averaged across 10 chains. Betancourt, M. A general metric for Riemannian man- ifold Hamiltonian Monte Carlo. In Nielsen, F. and Accept minESS minESS/s Barbaresco, F. (eds.), In First International Con- HMC 0.9771 7.32 0.0220 ference on the Geometric Science of Information, RMHMC 0.9576 397.48 1.0311 volume 8085 of Lecture Notes in Computer Science, RMHMCρ 0.9640 550.01 1.3706 pp. 327–334, 2013. SMHMC 0.9878 545.29 1.3167 SMHMCρ 0.9843 578.25 1.4003 Betancourt, M. and Girolami, M. Hamiltonian Monte Carlo for hierarchical models. Current trends in Bayesian methodology with applications, 54:79—-101, The momentum retention ρ was set to zero and 0.25 2015. in the RMHMC/SMHMC and RMHMCρ/SMHMCρ Betancourt, M., Byrne, S., Livingstone, S., and Giro- variants, respectively. Figure 3 shows the locations of lami, M. The geometric foundations of Hamiltonian these samples. The SMHMC sampler exhibits com- Monte Carlo. Bernoulli, 23, 2017. parable throat penetration to the RMHMC sampler. Acceptance rates, minimum ESS and minimum ESS Bou-Rabee, N. and Sanz-Serna, J. M. Yet more ele- per second are reported in Table 2. mentary proofs that the determinant of a symplectic matrix is 1. Linear Algebra and its Applications, 515, 2017. 5 Conclusion Brubaker, M., Salzmann, M., and Urtasun, R. A fam- We have extended shadow Hamiltonian Monte Carlo ily of MCMC methods on implicitly defined mani- methods by deriving the shadow Hamiltonian gener- folds. In Proceedings of the Fifteenth International ated by the generalized leapfrog numerical integrator. Conference on Artificial Intelligence and Statistics, The resulting SMHMC algorithm shows promise as volume 22, pp. 161–172, La Palma, Canary Islands, a means of pushing past the limitations of RMHMC 21–23 Apr 2012. PMLR. while retaining the advantages of adapting the sampler Cobb, A., Baydin, A., Markham, A., and Roberts, to the geometry of the target density. SMHMC shows S. Introducing an explicit symplectic integration promise for deployment in complex higher-dimensional scheme for Riemannian manifold Hamiltonian Monte hierarchical Bayesian models, where the dimensionality Carlo. arXiv preprint arXiv:1910.06243, 2019. causes RMHMC’s performance to wane. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Acknowledgements Duane, S., Kennedy, A., Pendleton, B. J., and Roweth, This work has been supported by the Australian Re- D. Hybrid Monte Carlo. Physics Letters B, 1987. search Council Centre of Excellence for Mathematical Fang, Y., Sanz-Serna, J., and Skeel, R. Compressible & Statistical Frontiers (ACEMS), under grant number Generalized Hybrid Monte Carlo. The Journal of CE140100049. Chemical Physics, 140:174108, 05 2014. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. References Bayesian Data Analysis. Chapman and Hall/CRC, Akhmatskaya, E. and Reich, P. S. The Targeted Shad- 2nd ed. edition, 2004. owing Hybrid Monte Carlo (TSHMC) Method. In Girolami, M. and Calderhead, B. Riemann manifold New Algorithms for Macromolecular Simulation. Lec- Langevin and Hamiltonian Monte Carlo methods. ture Notes in Computational Science and Engineer- Journal of the Royal Statistical Society: Series B ing, vol 49, pp. 1–51. Springer, 2006. (Statistical Methodology), 73(2):123–214, 2011. Amari, S.-I. Information Geometry and Its Applica- Hairer, E. Global modified Hamiltonian for constrained tions. Springer, Berlin, 1st edition, 2016. symplectic integrators. Numerische Mathematik, 95 Barp, A., Briol, F.-X., Kennedy, A. D., and Giro- (2):325–336, 2003. lami, M. Geometry and Dynamics for Markov Chain Hairer, E., Lubich, C., and Wanner, G. Geometric Monte Carlo. Annual Review of Statistics and Its Numerical Integration. Springer, Berlin, 2nd edition, Application, 5(1):451–471, 2018. 2006. Barp, A., Kennedy, A., and Girolami, M. Hamilto- Hatcher, A. Algebraic topology. Cambridge Univ. Press, nian Monte Carlo on Symmetric and Homogeneous Cambridge, 2000. Shadow Manifold Hamiltonian Monte Carlo

Horowitz, A. M. A generalized guided Monte Carlo Warner, F. W. Foundations of differentiable mani- algorithm. Physics Letters B, 1991. folds and Lie groups, volume 94. Springer Science & Izaguirre, J. and Hampton, S. Shadow hybrid Monte Business Media, 2013. Carlo: An efficient propagator in phase space of Whitehead, G. W. Elements of homotopy theory. Grad- macromolecules. Journal of , uate Texts in Mathematics. Springer, New York, 200, 2004. 1978. doi: 10.1007/978-1-4612-6318-0. Kennedy, A. and Pendleton, B. Cost of the generalised Yoshida, H. Construction of higher order symplectic hybrid Monte Carlo algorithm for free field theory. integrators. Physics Letters A, 150(5), 1990. Nuclear Physics B, 2001. Lan, S., Stathopoulos, V., Shahbaba, B., and Girolami, M. from Lagrangian Dynamics. Journal of Computational and Graphical Statistics, 24, 2015. Livingstone, S., Betancourt, M., Byrne, S., and Giro- lami, M. On the Geometric Ergodicity of Hamilto- nian Monte Carlo. Bernoulli, 25:3109–3138, 2016. Neal, R. M. Bayesian Learning for Neural Networks. Springer, Berlin, 1996. Neal, R. M. Slice sampling. Ann. Statist., 31(3):705– 767, 06 2003. Neal, R. M. MCMC using Hamiltonian Dynamics. Handbook of Markov Chain Monte Carlo, 54:113–162, 2010. Pihajoki, P. Explicit methods in extended phase space for inseparable hamiltonian problems. Celestial Me- chanics and Dynamical Astronomy, 121(3), 2015. Pourzanjani, A. and Petzold, L. Implicit Hamiltonian Monte Carlo for Sampling Multiscale Distributions. arXiv preprint arXiv:1911.05754, 11 2019. Radivojevi´c,T. and Akhmatskaya, E. Modified Hamil- tonian Monte Carlo for Bayesian inference. Statistics and Computing, 2019. Radivojevi´c,T., Fern´andez-Pend´as,M., Sanz-Serna, J. M., and Akhmatskaya, E. Multi-stage splitting integrators for sampling with modified Hamiltonian Monte Carlo methods. Journal of Computational Physics, 373, 2018. Sohl-Dickstein, J. Hamiltonian Monte Carlo with Reduced Momentum Flips. arXiv preprint arXiv:1205.1939, 2012. Sohl-Dickstein, J., Mudigonda, M., and Deweese, M. Hamiltonian Monte Carlo Without Detailed Balance. Proceedings of the 31st International Conference on Machine Learning, 32, 2014. Tao, M. Explicit symplectic approximation of non- separable Hamiltonians: Algorithm and long time performance. Phys. Rev. E, 94, 2016. Warner, F. W. Foundations of differentiable mani- folds and Lie groups; 2nd ed. Graduate texts in mathematicals. Springer, New York, NY, 1983. doi: 10.1007/978-1-4757-1799-0. van der Heide, Hodgkinson, Roosta, Kroese

SUPPLEMENTARY MATERIALS

A Effective sample size

A common measure of the efficiency of a sampling algorithm is the effective sample size (ESS), representing the estimated equivalent number of iid samples drawn. For (unweighted) Markov chains, motivated by Markov chain central limit theory, it is typical to consider ESS in each dimension defined via

n ESS := , (17) MC PT 1 + 2 i=1 ρˆi

where ρˆi is an estimate of the lag-i autocorrelations of the Markov chain, and T is the stopping time given in Section 11.5 of Gelman et al. (2004). Following Girolami & Calderhead (2011), we account for the worst case performance in the chain by reporting the minimum of (17) taken over all dimensions in the chain. There is some ambiguity when comparing notions of effective sample size for Markov chains that have been reweighted via importance sampling. For n samples reweighted by importance sampling, Kish’s approximation to the ESS, given by

n !−1 X 2 ESSK := w¯j , (18) j=1

Pn wherew ¯j = wj/ k=1 wk, is most common. Roughly speaking, this accounts for the shift in sampling efficiency caused by imbalances in the importance sampling weights. One may verify for equally weighted samples (w¯j = 1/n for all j = 1, . . . , n) that ESSK = n. Motivated by this approximation, in order to account for both the effects of both sample autocorrelation and reweighting via importance sampling, we approximated the effective sample size under importance sampling by −1 Pn 2 ESS j=1 w¯j ESS := K ESS = , (19) n MC PT 1 + 2 i=1 ρˆi

which also reduces to (17) when all the samples are weighted by unity. Once again, we report the minimum of (19) taken across all dimensions in the chain to account for worst case performance.

B Proofs

In this section, we provide proofs of Theorem 1 and Theorem 2.

Proof of Theorem 1. As outlined in the main text, the strategy is as follows. First, we consider a third–order Taylor expansion of the symplectic transformation implicitly defined by varying the step–size in the generalized leapfrog integrator. Since the integrator is reversible, Theorem 2.2 in Chapter IX of Hairer et al. (2006) guarantees that the fourth-order term vanishes. Theorem 2 then guarantees that this vector field corresponds to Hamilton’s equations of a shadow Hamiltonian, which can be computed by integration. Throughout the proof, we will adopt the convention of summation over repeated indices. ˆ Letting (θ(t), p(t)) be the analytic solution to (3) with initial values (θ0, p0) and supposing that (q(t), θ(t), pˆ(t)) is a solution to the implicitly defined equations

 t  q(t) = p0 − ∇θH θ0, q(t) ,  2 ˆ t   ˆ  θ(t) = θ0 + 2 ∇pH θ0, q(t) + ∇pH θ(t), q(t) , (20)  t ˆ  pˆ(t) = q(t) − 2 ∇θH θ(t), q(t) , Shadow Manifold Hamiltonian Monte Carlo observe that (20) evaluated at t = h retrieves (11). Differentiating each coordinate of (20) with respect to t,

0 1 ∂  qi(t) = − H θ0, q(t) + O(t) 2 ∂θi   ˆ0 1 ∂  ∂ ˆ  θi(t) = H θ0, q(t) + H θ(t), q(t) + O(t) 2 ∂pi ∂pi 0 0 1 ∂ ˆ  pˆi(t) = qi(t) − H θ(t), q(t) + O(t), 2 ∂θi which, evaluated at t = 0, becomes

0 1 ∂  ˆ0 ∂  0 ∂  qi(0) = − H θ0, p0 , θi(0) = H θ0, p0 , pˆi(0) = − H θ0, p0 . 2 ∂θi ∂pi ∂θi Similarly, differentiating a second time,

2 00 ∂  0 qi (t) = − H θ0, q(t) qj(t) + O(t) ∂pj∂θi 2 2 ˆ00 ∂   ˆ  0 ∂ ˆ ˆ0 θ (t) = H θ0, q(t) + H θ(t), q(t) qj(t) + H θ(t), q(t) θj(t) + O(t) ∂pj∂pi ∂θj∂pi 2 2 00 00 ∂ ˆ  0 ∂ ˆ ˆ0 pˆi (t) = qi (t) − H θ(t), q(t) qj(t) − H θ(t), q(t) θj(t) + O(t), ∂pj∂θi ∂θj∂θi which evaluated at t = 0 is just

2 00 1 ∂  ∂  qi (0) = H θ0, p0 H θ0, p0 2 ∂pj∂θi ∂θj 2 2 ˆ00 ∂  ∂  ∂  ∂  θi (0) = − H θ0, p0 H θ0, p0 + H θ0, p0 H θ0, p0 ∂pj∂pi ∂θj ∂θj∂pi ∂pj 2 2 00 ∂  ∂  ∂  ∂  pˆi (0) = H θ0, p0 H θ0, p0 − H θ0, p0 H θ0, p0 . ∂pj∂θi ∂θj ∂θj∂θi ∂pj Moving on to the third order terms,

2 3 000 3 ∂  00 3 ∂  0 0 qi (t) = − H θ0, q(t) qj (t) − H θ0, q(t) qj(t)qk(t) + O(t) 2 ∂pj∂θi 2 ∂pk∂pj∂θi 2 2 ˆ000 3 ∂   ˆ  00 3 ∂ ˆ ˆ00 θi (t) = H θ0, q(t) + H θ(t), q(t) qj (t) + H θ(t), q(t) θj (t) 2 ∂pj∂pi 2 ∂θj∂pi 3 3 3 ∂ ˆ ˆ0 0 3 ∂ ˆ ˆ0 ˆ0 + H θ(t), q(t) θj(t)qk(t) + H θ(t), q(t) θj(t)θk(t) 2 ∂pk∂θj∂pi 2 ∂θk∂θj∂pi 3 3 3 ∂ ˆ  0 ˆ0 3 ∂  0 0 + H θ(t), q(t) qj(t)θk(t) + H θ0, q(t) qj(t)qk(t) 2 ∂θk∂pj∂pi 2 ∂pk∂pj∂pi 3 3 ∂ ˆ  0 0 + H θ(t), q(t) qj(t)qk(t) + O(t) 2 ∂pk∂pj∂pi 2 2 000 000 3 ∂ ˆ00 3 ∂  00 pˆi (t) = qi (t) − H θ0, q(t) θj (t) − H θ0, q(t) qj (t) 2 ∂θj∂θi 2 ∂pj∂θi 3 3 3 ∂ ˆ0 ˆ0 3 ∂ ˆ0 0 − H θ0, q(t) θj(t)θk(t) − H θ0, q(t) θj(t)qk(t) 2 ∂θk∂θj∂θi 2 ∂pk∂θj∂θi 3 3 3 ∂  0 ˆ0 3 ∂  0 0 − H θ0, q(t) qj(t)θk(t) − H θ0, q(t) qj(t)qk(t) + O(t). 2 ∂θk∂pj∂θi 2 ∂pk∂pj∂θi which implies

2 2 3 000 3 ∂ H ∂ H ∂H 3 ∂ H ∂H ∂H qi (0) = − (θ0, p0) − (θ0, p0) 4 ∂pj∂θi ∂pk∂θj ∂θk 8 ∂pk∂pj∂θi ∂θj ∂θk van der Heide, Hodgkinson, Roosta, Kroese

2 2 2 2 2 2 ˆ000 3 ∂ H ∂ H ∂H 3 ∂ H ∂ H ∂H 3 ∂ H ∂ H ∂H θi (0) = − (θ0, p0) + (θ0, p0) + (θ0, p0) 2 ∂θj∂pi ∂pk∂pj ∂θk 2 ∂θj∂pi ∂θk∂pj ∂pk 2 ∂pj∂pi ∂pk∂θj ∂θk 3 ∂3H ∂H ∂H 3 ∂3H ∂H ∂H 3 ∂3H ∂H ∂H − (θ0, p0) + (θ0, p0) − (θ0, p0) 4 ∂pk∂θj∂pi ∂pj ∂θk 2 ∂θk∂θj∂pi ∂pj ∂pk 4 ∂θk∂pj∂pi ∂θj ∂pk 3 ∂3H ∂H ∂H + (θ0, p0) 4 ∂pk∂pj∂pi ∂θj ∂θk 2 2 2 2 2 2 000 3 ∂ H ∂ H ∂H 3 ∂ H ∂ H ∂H 3 ∂ H ∂ H ∂H pˆi (0) = − (θ0, p0) + (θ0, p0) − (θ0, p0) 2 ∂pj∂θi ∂pk∂θj ∂θk 2 ∂θj∂θi ∂pk∂pj ∂θk 2 ∂θj∂θi ∂θk∂pj ∂pk 3 ∂3H ∂H ∂H 3 ∂3H ∂H ∂H 3 ∂3H ∂H ∂H − (θ0, p0) + (θ0, p0) − (θ0, p0) 2 ∂θk∂θj∂θi ∂pj ∂pk 2 ∂pk∂θj∂θi ∂pj ∂θk 4 ∂pk∂pj∂θi ∂θj ∂θk On the other hand, differentiating (3),

2 2 00 ∂  0 ∂  0 θ (t) = H θ(t), p(t) pj(t) + H θ(t), p(t) θj(t) ∂pj∂pi ∂θj∂pi ∂2 ∂ ∂2 ∂ = − Hθ(t), p(t) Hθ(t), p(t) + Hθ(t), p(t) Hθ(t), p(t) ∂pj∂pi ∂θj ∂θj∂pi ∂pj

2 2 00 ∂  0 ∂  0 pi (t) = − H θ(t), p(t) pj(t) − H θ(t), p(t) θj(t) ∂pj∂θi ∂θj∂θi ∂2 ∂ ∂2 ∂ = Hθ(t), p(t) Hθ(t), p(t) − Hθ(t), p(t) Hθ(t), p(t) ∂pj∂θi ∂θj ∂θj∂θi ∂pj and 2 2 2 2 2 2 000 ∂ H ∂ H ∂H  ∂ H ∂ H ∂H  ∂ H ∂ H ∂H  θi (t) = − θ(t), p(t) + θ(t), p(t) + θ(t), p(t) ∂θj∂pi ∂pk∂pj ∂θk ∂θj∂pi ∂θk∂pj ∂pk ∂pj∂pi ∂pk∂θj ∂θk ∂2H ∂2H ∂H ∂3H ∂H ∂H ∂3H ∂H ∂H − θ(t), p(t) − 2 θ(t), p(t) + θ(t), p(t) ∂pj∂pi ∂θk∂θj ∂pk ∂pk∂θj∂pi ∂pj ∂θk ∂θk∂θj∂pi ∂pj ∂pk ∂3H ∂H ∂H + θ(t), p(t) ∂pk∂pj∂pi ∂θj ∂θk 2 2 2 2 2 2 000 ∂ H ∂ H ∂H  ∂ H ∂ H ∂H  ∂ H ∂ H ∂H  pi (t) = − θ(t), p(t) + θ(t), p(t) − θ(t), p(t) ∂pj∂θi ∂pk∂θj ∂θk ∂θj∂θi ∂pk∂pj ∂θk ∂θj∂θi ∂θk∂pj ∂pk ∂2H ∂2H ∂H ∂3H ∂H ∂H ∂3H ∂H ∂H + θ(t), p(t) − θ(t), p(t) + 2 θ(t), p(t) ∂pj∂θi ∂θk∂θj ∂pk ∂θk∂θj∂θi ∂pj ∂pk ∂pk∂θj∂θi ∂pj ∂θk ∂3H ∂H ∂H − θ(t), p(t). ∂pk∂pj∂θi ∂θj ∂θk Invoking Taylor’s theorem, as t ↓ 0, t2 t3 t2 t3 θ(t) = θ + tθ0(0) + θ00(0) + θ000(0) + O(t4), p(t) = p + tp0(0) + p00(0) + p000(0) + O(t4), 0 2 6 0 2 6 t2 t3 t2 t3 θˆ(t) = θ + tθˆ0(0) + θˆ00(0) + θˆ000(0) + O(t4), pˆ(t) = p + tpˆ0(t) + pˆ00(0) + pˆ000(0) + O(t4). 0 2 6 0 2 6 Collecting terms, we see that

3 2 2 2 2 2 2 ˆ  t  1 ∂ H ∂ H ∂H 1 ∂ H ∂ H ∂H 1 ∂ H ∂ H ∂H θi(t) − θi(t) = − + + 6 2 ∂θj∂pi ∂pk∂pj ∂θk 2 ∂θj∂pi ∂θk∂pj ∂pk 2 ∂pj∂pi ∂pk∂θj ∂θk 1 ∂3H ∂H ∂H 1 ∂3H ∂H ∂H 1 ∂3H ∂H ∂H + + + 4 ∂pk∂θj∂pi ∂pj ∂θk 2 ∂θk∂θj∂pi ∂pj ∂pk 4 ∂θk∂pj∂pi ∂θj ∂pk 1 ∂3H ∂H ∂H ∂2H ∂2H ∂H  − + + O(t4), 4 ∂pk∂pj∂pi ∂θj ∂θk ∂pj∂pi ∂θk∂θj ∂pk Shadow Manifold Hamiltonian Monte Carlo

which can be written

3 2 2 2 ˆ  ∂ t 1 ∂ H ∂H ∂H 1 ∂ H ∂H ∂H 1 ∂ H ∂H ∂H  4 θi(t) − θi(t) = − + + O(t ) ∂pi 6 2 ∂θk∂θj ∂pj ∂pk 4 ∂pk∂pj ∂θj ∂θk 2 ∂θk∂pj ∂θj ∂pk 3 ∂ t 1 1 1  4 = ∇pH∇θθH∇pH − ∇θH∇ppH∇θH + ∇θH∇θpH∇pH + O(t ). ∂pi 6 2 4 2 Analogous calculations give us

3 2 2 2  ∂ t 1 ∂ H ∂H ∂H 1 ∂ H ∂H ∂H 1 ∂ H ∂H ∂H  4 pˆi(t) − pi(t) = − − + + O(t ) ∂θi 6 2 ∂θk∂θj ∂pj ∂pk 4 ∂pk∂pj ∂θj ∂θk 2 ∂θk∂pj ∂θj ∂pk 3 ∂ t 1 1 1  4 = − ∇pH∇θθH∇pH − ∇θH∇ppH∇θH + ∇θH∇θpH∇pH + O(t ). ∂θi 6 2 4 2 Taken together,

t3 1 1 1  zˆ(t) − z(t) = J∇ ∇ H∇ H∇ H − ∇ H∇ H∇ H + ∇ H∇ H∇ H + O(t4). 6 2 p θθ p 4 θ pp θ 2 θ θp p Fixing t = h, and keeping in mind that the generalized leapfrog integrator is reversible, Theorems 1.2 and 2.2 from Chapter IX in Hairer et al. (2006) imply that

zˆ0(t) = z0(t) + zˆ0(t) − z0(t) h2 1 1 1  = J∇H + J∇ ∇ H∇ H∇ H − ∇ H∇ H∇ H + ∇ H∇ H∇ H + O(h4) 6 2 p θθ p 4 θ pp θ 2 θ θp p = J∇H[4] + O(h4),

as required.

Proof of Theorem 2. Assume M is a smooth, simply connected Riemannian manifold with metric g, and that t ∗ φH is a symplectic integrator on T M, smooth in its step size t. Denote by X the corresponding (and typically time-inhomogeneous) symplectic vector field on T ∗M. Note that this vector field can be locally computed by t asymptotic expansion after differentiating φH in t. Since X is symplectic, it corresponds to a closed differential 1-form via the canonical isomorphism induced by the symplectic form. In order to show existence of a shadow Hamiltonian, we will now demonstrate that this closed form must be exact. To do so, we first argue that the first de Rham cohomology group on the vector bundle T ∗M must be trivial (see Barp et al. (2018) for a statistical introduction). For (θ, p) ∈ T ∗M define a deformation retraction onto the zero section of T ∗M via (θ, p) 7→ (θ, (1 − t)p) for t ∈ [0, 1]. The homotopy invariance of the de Rham 1 ∗ 1 cohomology now guarantees that HdR(T M) ' HdR(M). Next, invoke de Rham’s Theorem (5.36 in Warner (1983)), which tells us that the first de Rham cohomology group is isomorphic to the first singular cohomology group with real coefficients (see Warner (1983) or Hatcher (2000)). 1 1 That is, HdR(M) ' H (M, R). From here, the universal coefficient theorem for cohomology (see Theorem 3.2 in 1  Hatcher (2000)) suggests that H (M, R) ' Hom H1(M) , where H1(M) is the first homology group of M. The Hurewicz theorem (see Theorem 4.1 in Whitehead (1978)) now implies that the Hurewicz map ρ : π1(M) → H1(M) is a homomorphism, where π1(M) is the first fundamental group of M. Moreover this homomorphism is an isomorphism whenever π1(M) is abelian. Keeping in mind that M is simply connected and so has trivial first fundamental group, we can conclude that the first de Rham cohomology group of T ∗M is also trivial. This can be seen in the chain of isomorphisms

1 ∗ 1 1   HdR(T M) ' HdR(M) ' H (M, R) ' Hom H1(M) ' Hom π1(M) ' 0. This shows that the symplectic vector field X is not only closed, but exact. That is, it is a Hamiltonian vector field corresponding to some Hamiltonian Hφ.

Now since the symplectic integrator is reversible, its maximal error must be of even order, say n, for some n ∈ 2N (see Chapter V of Hairer et al. (2006)). Symplecticity guarantees the Hamiltonian H is preserved up to this order. van der Heide, Hodgkinson, Roosta, Kroese

By taking the n + 1-th order Taylor approximation of X, in the same vein as Theorem 1, one can integrate this vector field approximation to compute the (n + 2)-th order shadow Hamiltonian, Hn+2. Assume now that we have such a shadow Hamiltonian of order k, for some even integer k > n + 2. That is, the t symplectic integrator ϕH ’s approximation of the Hamiltonian trajectories corresponding to H, conserves the shadow Hamiltonian H2k and tracks its Hamiltonian trajectories up to order 2k. We know that the integrator t φH is reversible, so its error must be even. By computing the 2k + 1-th order Taylor approximation of X, we can integrate to find H2k+1. The inductive hypothesis implies the theorem.

C Additional Numerical Experiments

C.1 Momentum Refreshment

To investigate the effects of varying the momentum refreshment parameter ρ, we ran ten independent chains of RMHMC and SMHMC on the 30-dimensional funnel for each ρ ∈ {0.0, 0.1,..., 0.8}. Each chain is comprised of 1000 samples with 100 burn-in steps, but otherwise used the same parameters as those in Section 4.3. The results are displayed in Figure 4. RMHMC shows a general trend towards improved acceptance and performance as ρ increases. The same trend is shared by SMHMC, but to a lesser extent, possibly due to its universally high acceptance probabilities.

Figure 4: (a) Acceptance probabilities; (b) minESS; and (c) minESS/s for 10 chains of RMHMC and SMHMC on the 30-dimensional funnel with varying choices of ρ. Shadow Manifold Hamiltonian Monte Carlo

C.2 Varying Tail Behaviour

In a similar vein, we also investigate the effects of the tail parameter c in (15). We ran ten independent chains of SMHMC on the Bayesian logistic regression problem with the German dataset. Each chain is comprised of 1000 samples with 100 burn-in steps — otherwise, we use the same parameters seen in Section 4.2. The results are displayed in Figure 5. We recall that increasing c interpolates between RMHMC and SMHMC, and this can be readily seen in the rising acceptance probabilities. Such monotonicity is not shared by the minESS. Indeed, worst-case performance appears when c = −0.5, when roughly half of the modified Hamiltonians H(θ, p) coincide with the classical Hamiltonians H(θ, p).

Figure 5: (a) Acceptance probabilities; (b) minESS; and (c) minESS/s for 10 chains of SMHMC on the German dataset with varying choices of c.