Shadow Manifold Hamiltonian Monte Carlo
Total Page:16
File Type:pdf, Size:1020Kb
Shadow Manifold Hamiltonian Monte Carlo Chris van der Heide1 Liam Hodgkinson Fred Roosta Dirk P. Kroese UQ UC Berkeley, ICSI UQ, ICSI UQ Abstract The key advantage of HMC over many of its com- petitors is its ability to draw samples that are large distances apart by evolving them via Hamiltonian dy- Hamiltonian Monte Carlo and its descendants namics. The acceptance rate depends on the error have found success in machine learning and accumulated along the sample trajectory, and remains computational statistics due to their ability large even in moderately high dimensions. This is in to draw samples in high dimensions with no small part due to its deep connection with geometry greater efficiency than classical MCMC. One and physics, which inform its rich theoretical foun- of these derivatives, Riemannian manifold dations (Betancourt et al., 2017; Barp et al., 2018). Hamiltonian Monte Carlo (RMHMC), bet- However, HMC's performance deteriorates if either the ter adapts the sampler to the geometry of the step size or the size of the system becomes too large, or target density, allowing for improved perfor- the target density is poorly behaved. More numerical mances in sampling problems with complex error leads to lower sample acceptance, which induces geometric features. Other approaches have heavy autocorrelation, necessitating a larger sample boosted acceptance rates by sampling from an size and thus higher computational costs. integrator-dependent \shadow density" and compensating for the induced bias via im- One approach to ease this burden is to exploit the struc- portance sampling. We combine the bene- ture of the numerical integrator error and instead target fits of RMHMC with those attained by sam- the density corresponding to a modified, or shadow, pling from the shadow density, by deriving the Hamiltonian. This leads to a higher sample acceptance shadow Hamiltonian corresponding to the gen- rate, shown in Figure 1, at the cost of some induced eralized leapfrog integrator used in RMHMC. bias. This bias is easily quantified, and importance This leads to a new algorithm, shadow man- sampling is then performed to compensate (Radivojevi´c ifold Hamiltonian Monte Carlo, that shows & Akhmatskaya, 2019). Due to the structure of the improved performance over RMHMC, and shadow Hamiltonian, momentum refreshment requires leaves the target density invariant. another Metropolis{Hastings step to ensure sampler consistency. This results in a non-reversible sampler, and allows for partial momentum retention (Horowitz, 1991; Kennedy & Pendleton, 2001; Akhmatskaya & 1 Introduction Reich, 2006; Sohl-Dickstein, 2012; Sohl-Dickstein et al., 2014), which may be of interest on its own. Hamiltonian Monte Carlo (HMC) was first introduced by Duane et al. (1987) in the context of simulating In their landmark paper, Girolami & Calderhead (2011) lattice field theories in quantum chromodynamics. It proposed Riemannian manifold Hamiltonian Monte gained further mainstream success following its intro- Carlo (RMHMC). This algorithm draws upon ideas duction to the machine learning and computational from information geometry (Amari, 2016) to generalize statistics communities for training Bayesian neural HMC by traversing a Riemannian manifold. While networks (Neal, 1996). It has since become an in- other special classes of Riemannian manifolds have dispensable tool for Bayesian inference at large, finding recently been studied (Barp et al., 2018, 2019), the a plethora of applications in scientific and industrial original implementations target Bayesian posterior den- fields. sities, and excel in the presence of complex geometric features. However, to date, there have been no efforts Proceedings of the 24th International Conference on Artifi- to develop a shadow Hamiltonian corresponding to the cial Intelligence and Statistics (AISTATS) 2021, San Diego, generalized leapfrog algorithm used in RMHMC. California, USA. PMLR: Volume 130. Copyright 2021 by the author(s). [email protected] Shadow Manifold Hamiltonian Monte Carlo Figure 1: Error due to Hamiltonian drift (left), and corresponding Metropolis{Hastings acceptance probabilities (right), in posterior samples from a Bayesian logistic regression model using the Australian credit dataset. 2000 proposed parameter samples were produced via RMHMC and our proposed SMHMC with h = 0:5;L = 6. Contributions. Our contributions in this work can The goal of MCMC is to generate a set of correlated n be summarized as follows. samples fθigi=1 whose empirical distribution converges, as n ! 1, to a target measure with Lebesgue den- I. A fourth-order shadow Hamiltonian that is preserved d sity πθ on R . This enables the approximation of by the generalized leapfrog integrator up to O(h4) in R the integral Eπf = d f(θ)πθ(θ) dθ of a πθ-integrable the step size h is derived. R function f : Rd ! R via the Monte Carlo estimate II. We prove the existence of shadow Hamiltonians of −1 Pn n i=1 f(θi). In the sequel, we assume that all rel- any even order for arbitrary non-separable Hamiltoni- evant functions are infinitely differentiable, although ans and reversible symplectic integrators. this can be relaxed in practice. III. We introduce the shadow manifold Hamiltonian Monte Carlo algorithm (SMHMC), which is guaranteed to leave the target density invariant. 2.1 The Basic Setup IV. Numerical examples are provided demonstrating advantages over existing alternatives. For a target density π , let U : d ! be a potential V. A PyTorch implementation is made available at θ R R energy function on the configuration space of a Hamil- https://github.com/chrisvdh/shadowtorch. −U(θ) tonian system such that πθ(θ) := e =Z, where Z is a constant that ensures that πθ integrates to unity. 2 Hamiltonian Monte Carlo Under this interpretation, \energy wells" with low po- tential energy correspond to regions of high probability. Markov chain Monte Carlo (MCMC) sampling tech- In the basic HMC setup, the configuration space is aug- niques are a mainstay of computational statistics and mented with a conjugate momentum variable p 2 Rd, machine learning. Many problems in scientific, medical, which is assigned a non-negative kinetic energy term and industrial applications can be framed as Bayesian K : Rd ! R, typically taken to be K(p) = p>Σ−1p=2. hierarchical problems, and their resolution often re- Here, Σ is a real-valued positive-definite d × d matrix, duces to sampling from a posterior distribution in order commonly taken to be the identity or some other diago- to approximate some (often high-dimensional) analyti- nal matrix. Under this choice, the conjugate momenta cally intractable integral. HMC is a highly successful are assumed to be distributed via a multivariate Gaus- −K(p) p d MCMC technique that is somewhat resilient to the sian density πp(p) := e = (2π) jΣj; where we curse of dimensionality, without compromising conver- denote by jΣj the determinant of Σ. The Hamiltonian gence guarantees (Livingstone et al., 2016). of the system is then defined by H(θ; p) := U(θ)+K(p), van der Heide, Hodgkinson, Roosta, Kroese H with the joint density of θ and p satisfying 2. Reversibility: Let 't (z0) denote the unique solu- tion at time t to (3) with initial value z0. Since H is −H(θ;p) e time-homogeneous, there holds πH (θ; p) := = πθ(θ)πp(p): (1) Zp(2π)djΣj H H H H H 't ◦ 's (z0) = 's+t(z0); i.e., '−t ◦ 't (z0) = z0; RMHMC is a method introduced by Girolami & Calder- (5) head (2011) that generalises HMC by endowing Rd with H H −1 a Riemannian metric, and sampling from the resulting or, in other words, '−t = ('t ) . Furthermore, if K Riemannian manifold (M; G). In the basic implemen- is an even function, then we can obtain the inverse tation, the statistical manifold M is equipped with a mapping by negating the sign of p, applying 't, and metric G, taken to be a variant of the Fisher{Rao met- then negating p's sign again. ric that incorporates the Bayesian prior. This metric 3. Volume preservation: Taking an open neigh- 2d gives a canonical way to measure the squared length bourhood N ⊂ R and evolving N by 't reveals of a momentum p at θ, which is incorporated into m(φt(N)) = m(N), where m is the Lebesgue mea- > −1 the kinetic energy term K(θ; p) = p G (θ)p=2. This sure on R2d. This is most easily seen by noting that term is now position-dependent, and the corresponding the Hamiltonian vector field JrH is divergence free, conditional density becomes that is, e−K(θ;p) d X @ @H @ @H πpjθ(θ; p) := p : r · JrH = − = 0: (2π)djG(θ)j @θi @pi @pi @θi i=1 The determinant term is then incorporated into the Hamiltonian Given a candidate sample obtained by evolving the system (3) up to time T , these three properties allow 1 H(θ; p) := U(θ) + log (2π)djG(θ)j + K(θ; p); (2) for a simple Metropolis{Hastings acceptance criterion. 2 Reversibility of the dynamics implies reversibility of and by construction the joint density is then given by the resultant Markov chain | therefore, the sampler leaves the target density invariant. Conservation of e−H(θ;p) π (θ; p) := = π (θ)π (θ; p): energy suggests a means of detecting numerically ap- H Z θ pjθ proximated trajectories that have not been faithful to (3) by computing H z(T ) −H(z0), and volume preser- Given some initial pair z0 = (θ0; p0), (RM)HMC algo- rithms then draw a candidate sample from the joint vation removes the need to keep track of a Jacobian in this change of variables. This allows proposed samples density πH by evolving the phase z0 along a trajectory in phase space defined by Hamiltonian dynamics. to be accepted with probability α = min 1; exp H(θ; p) − H(θ(T ); p(T )) ; (6) 2.2 Hamiltonian Dynamics and otherwise taking (θ; −p).