Exponential Integration for Hamiltonian Monte Carlo
Total Page:16
File Type:pdf, Size:1020Kb
Exponential Integration for Hamiltonian Monte Carlo Wei-Lun Chao1 [email protected] Justin Solomon2 [email protected] Dominik L. Michels2 [email protected] Fei Sha1 [email protected] 1Department of Computer Science, University of Southern California, Los Angeles, CA 90089 2Department of Computer Science, Stanford University, 353 Serra Mall, Stanford, California 94305 USA Abstract butions that generate random walks in sample space. A fraction of these walks aligned with the target distribution is We investigate numerical integration of ordinary kept, while the remainder is discarded. Sampling efficiency differential equations (ODEs) for Hamiltonian and quality depends critically on the proposal distribution. Monte Carlo (HMC). High-quality integration is For this reason, designing good proposals has long been a crucial for designing efficient and effective pro- subject of intensive research. posals for HMC. While the standard method is leapfrog (Stormer-Verlet)¨ integration, we propose For distributions over continuous variables, Hamiltonian the use of an exponential integrator, which is Monte Carlo (HMC) is a state-of-the-art sampler using clas- robust to stiff ODEs with highly-oscillatory com- sical mechanics to generate proposal distributions (Neal, ponents. This oscillation is difficult to reproduce 2011). The method starts by introducing auxiliary sam- using leapfrog integration, even with carefully pling variables. It then treats the negative log density of selected integration parameters and precondition- the variables’ joint distribution as the Hamiltonian of parti- ing. Concretely, we use a Gaussian distribution cles moving in sample space. Final positions on the motion approximation to segregate stiff components of trajectories are used as proposals. the ODE. We integrate this term analytically for The equations of motion are ordinary differential equations stability and account for deviation from the ap- (ODEs) requiring numerical integration. Several factors proximation using variation of constants. We con- affect the choice of integrators. Time reversibility is needed sider various ways to derive Gaussian approxi- for the Markov chain to converge to the target distribution, mations and conduct extensive empirical studies and volume preservation ensures that the sample acceptance applying the proposed “exponential HMC” to sev- test is tractable. The acceptance rate of HMC is determined eral benchmarked learning problems. We com- by how well the integrator preserves the energy of the phys- pare to state-of-the-art methods for improving ical system, and hence high-fidelity numerical solutions are leapfrog HMC and demonstrate the advantages of preferable. In the rare case that the ODE can be integrated our method in generating many effective samples exactly, no samples would be rejected. More realistically, with high acceptance rates in short running times. however, when the ODE is integrated approximately, es- pecially over a long time period to encourage exploration of the sample space, integration error can impede energy 1. Introduction conservation, slowing convergence. Thus, the time period Markov chain Monte Carlo (MCMC) is at the core of many for integration usually is subdivided into many shorter steps. methods in computational statistics (Gelman et al., 2004; Despite its relevance to sampling efficiency, investigation Robert & Casella, 2004) to sample from complex probability of sophisticated numerical integrators for HMC has been distributions for inference and learning. scarce in the current machine learning literature. The Metropolis-Hastings MCMC obtains such samples by con- standard HMC integrator is the leapfrog (Stormer-Verlet)¨ structing a Markov chain with help from proposal distri- method. Leapfrog integration, however, is sensitive to stiff ODEs with highly-oscillatory dynamical components Proceedings of the 32 nd International Conference on Machine (Hairer et al., 2006). When the target density yields a stiff Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- ODE, e.g. a multivariate Gaussian distribution with small right 2015 by the author(s). variances in certain dimensions, the time step for leapfrog Exponential Integration for Hamiltonian Monte Carlo is limited to the scale of the stiff components to avoid per- function LEAPFROG((q0; p0); h; L) turbing the energy of the system and consequently lowering 1 p1=2 p0 − 2 hrqU(q0) MCMC acceptance rates. This results in limited motion in for i 1; 2;:::;L sample space, requiring many integration steps to explore −1 qi qi−1 + hM pi−1=2 the space. More generally, stiff ODEs occur in HMC when- if i 6= L then ever target distributions tightly peak around their modes, fur- pi+1=2 pi−1=2 − hrqU(qi) ther motivating the need for advanced integrators. Although p p − 1 hr U(q ) preconditioning can reduce stiffness and partially alleviate L L−1=2 2 q L return (q ; p ) as (q∗; p∗) this problem, it is often insufficient, as demonstrated in our L L empirical studies. Figure 1. Leapfrog integrator. In this paper, we address the challenge of simulating stiff ODEs in HMC using an explicit exponential integrator. Ex- the following coupled system of ODEs: ponential integrators are known in simulation and numerics ( −1 q_ = rpH = M p for their ability to take large time steps with enhanced stabil- (2) ity. They decompose ODEs into two terms, a linearization p_ = −∇qH = −∇qU(q); solvable in closed form and a nonlinear remainder. For where dots denote derivatives in time. HMC, the linear term encodes a Gaussian component of the distribution; when the small variances in the distribution The trajectories of q and p provide proposals to the are adequately summarized by this part, exponential inte- Metropolis-Hastings MCMC procedure. Specifically, HMC gration outperforms generic counterparts like leapfrog. Our applies the following steps: (i) starting from k = exponential integrator (“expHMC”) is easily-implemented 1, draw pk ∼ N (0; M); (ii) compute the position and efficient, with high acceptance rates, broad exploration (q∗; p∗) by simulating (2) for time t with initial condi- of sample space, and fast convergence due to the reduced tions (qk−1; pk); (iii) compute the change in Hamiltonian ∗ ∗ ∗ restriction on the time step size. The flexibility to choose δH = H(qk−1; pk)−H(q ; p ); (iv) output qk = q with filter functions in exponential integrators further enhances probability min(eδH ; 1). the applicability of our method. If (q∗; p∗) comes from solving (2) exactly, then by con- We validate the effectiveness of expHMC with extensive servation of H, δH = 0 and the new sample is always empirical studies on various types of distributions, including accepted. Closed-form solutions to (2) rarely exist, however. Bayesian logistic regression and independent component In practice, t is broken into L discrete steps, and integration analysis. We also compare expHMC to alternatives, such is carried out numerically in increments of h = t=L. This as the conventional leapfrog method and the recently pro- discretization creates a chance of rejected samples, as H posed Riemann manifold HMC, demonstrating desirable is no longer conserved exactly. The constructed chain of characteristics in scenarios where other methods suffer. fqkg still converges to the target distribution, so long as the integrator is symplectic and time-reversible (Neal, 2011). We describe basic HMC in x2 and our approach in x3. We then review related work in x4, and report several empirical An integrator satisfying these criteria is the leapfrog studies in x5. We conclude in x7. (Stormer-Verlet)¨ method in Figure1. It is easily imple- mented and explicit—no linear or nonlinear solvers are 2. Hamiltonian Monte Carlo needed to generate (qL; pL) from (q0; p0). For large h, however, it suffers from instability that can lead to inaccu- Suppose we would like to sample a random variable q 2 rate solutions. This restriction on h is problematic when Rd ∼ p(q). Hamiltonian Monte Carlo (HMC) considers the ODE (2) is stiff, requiring small h and correspondingly the joint distribution between q and an auxiliary random large L to resolve high-frequency oscillations of (q; p). We variable p 2 Rd ∼ p(p) that is distributed independently of demonstrate this instability empirically in x5. q. For simplicity, p(p) is often assumed to be a zero-mean Gaussian with covariance M. 3. Exponential HMC The joint density p(q; p) is used to define a Hamiltonian The idea behind exponential integration is to consider dy- H(q; p) = − log p(q) − log p(p) namics at different scales explicitly. This is achieved by 1 decomposing the Hamiltonian into two terms. The first en- = U(q) + pTM −1p + const:; (1) 2 codes high-frequency and numerically-unstable oscillation, while the second encodes slower dynamics that are robust where M is referred to as the (preconditioning) mass matrix. to explicit integration. The insight is that the first part can The dynamics governed by this Hamiltonian are given by be solved analytically, leaving numerical integration for the Exponential Integration for Hamiltonian Monte Carlo more stable remainder. Assembling these two parts will function EXPONENTIAL-STEP((q0; p0); h; L) provide a stable solution to the original dynamics. Transform into the new variable 1=2 −1=2 Here, we introduce the basic idea first and then propose (r0; r_0) (M q0; M p0) practical strategies for decomposition. Precompute the following matrices & the filters (C; S) (cos(hΩ); sin(hΩ)) for i 0; 1;:::;L − 1 3.1. Exponential Integrator −1 1 2 ri+1 Cri + Ω Sr_i − h F (φri) 2 We consider the following structure for U: −ΩSri + Cr_i r_i+1 1 −1 − h( 0F (φri) + 1F (φri+1)) rqU(q) = Σ q + f(q) (3) 2 −1=2 1=2 ∗ ∗ return (M rL; M r_L) as (q ; p ) for some positive definite matrix Σ 2 Rn×n and a possibly nonlinear function f : Rn ! Rn. This form can originate Figure 2. Exponential integrator. from a density with “slow” portion f(q) modulated by a −1 possibly stiff Gaussian distribution from Σ , with stiffness • Consistency and accuracy to second order: φ(0) = from small variances.