Exponential Integration for Hamiltonian Monte Carlo

Wei-Lun Chao1 [email protected] Justin Solomon2 [email protected] Dominik L. Michels2 [email protected] Fei Sha1 [email protected] 1Department of Computer Science, University of Southern California, Los Angeles, CA 90089 2Department of Computer Science, Stanford University, 353 Serra Mall, Stanford, California 94305 USA

Abstract butions that generate random walks in sample space. A fraction of these walks aligned with the target distribution is We investigate numerical integration of ordinary kept, while the remainder is discarded. Sampling efficiency differential equations (ODEs) for Hamiltonian and quality depends critically on the proposal distribution. Monte Carlo (HMC). High-quality integration is For this reason, designing good proposals has long been a crucial for designing efficient and effective pro- subject of intensive research. posals for HMC. While the standard method is leapfrog (Stormer-Verlet)¨ integration, we propose For distributions over continuous variables, Hamiltonian the use of an , which is Monte Carlo (HMC) is a state-of-the-art sampler using clas- robust to stiff ODEs with highly-oscillatory com- sical mechanics to generate proposal distributions (Neal, ponents. This oscillation is difficult to reproduce 2011). The method starts by introducing auxiliary sam- using , even with carefully pling variables. It then treats the negative log density of selected integration parameters and precondition- the variables’ joint distribution as the Hamiltonian of parti- ing. Concretely, we use a Gaussian distribution cles moving in sample space. Final positions on the motion approximation to segregate stiff components of trajectories are used as proposals. the ODE. We integrate this term analytically for The are ordinary differential equations stability and account for deviation from the ap- (ODEs) requiring numerical integration. Several factors proximation using variation of constants. We con- affect the choice of integrators. Time reversibility is needed sider various ways to derive Gaussian approxi- for the Markov chain to converge to the target distribution, mations and conduct extensive empirical studies and volume preservation ensures that the sample acceptance applying the proposed “exponential HMC” to sev- test is tractable. The acceptance rate of HMC is determined eral benchmarked learning problems. We com- by how well the integrator preserves the energy of the phys- pare to state-of-the-art methods for improving ical system, and hence high-fidelity numerical solutions are leapfrog HMC and demonstrate the advantages of preferable. In the rare case that the ODE can be integrated our method in generating many effective samples exactly, no samples would be rejected. More realistically, with high acceptance rates in short running times. however, when the ODE is integrated approximately, es- pecially over a long time period to encourage exploration of the sample space, integration error can impede energy 1. Introduction conservation, slowing convergence. Thus, the time period Markov chain Monte Carlo (MCMC) is at the core of many for integration usually is subdivided into many shorter steps. methods in computational statistics (Gelman et al., 2004; Despite its relevance to sampling efficiency, investigation Robert & Casella, 2004) to sample from complex probability of sophisticated numerical integrators for HMC has been distributions for inference and learning. scarce in the current machine learning literature. The Metropolis-Hastings MCMC obtains such samples by con- standard HMC integrator is the leapfrog (Stormer-Verlet)¨ structing a Markov chain with help from proposal distri- method. Leapfrog integration, however, is sensitive to stiff ODEs with highly-oscillatory dynamical components Proceedings of the 32 nd International Conference on Machine (Hairer et al., 2006). When the target density yields a stiff Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- ODE, e.g. a multivariate Gaussian distribution with small right 2015 by the author(s). variances in certain dimensions, the time step for leapfrog Exponential Integration for Hamiltonian Monte Carlo is limited to the scale of the stiff components to avoid per- function LEAPFROG((q0, p0); h, L) turbing the energy of the system and consequently lowering 1 p1/2 ← p0 − 2 h∇qU(q0) MCMC acceptance rates. This results in limited motion in for i ← 1, 2,...,L sample space, requiring many integration steps to explore −1 qi ← qi−1 + hM pi−1/2 the space. More generally, stiff ODEs occur in HMC when- if i 6= L then ever target distributions tightly peak around their modes, fur- pi+1/2 ← pi−1/2 − h∇qU(qi) ther motivating the need for advanced integrators. Although p ← p − 1 h∇ U(q ) preconditioning can reduce stiffness and partially alleviate L L−1/2 2 q L return (q , p ) as (q∗, p∗) this problem, it is often insufficient, as demonstrated in our L L empirical studies. Figure 1. Leapfrog integrator. In this paper, we address the challenge of simulating stiff ODEs in HMC using an explicit exponential integrator. Ex- the following coupled system of ODEs: ponential integrators are known in simulation and numerics ( −1 q˙ = ∇pH = M p for their ability to take large time steps with enhanced stabil- (2) ity. They decompose ODEs into two terms, a linearization p˙ = −∇qH = −∇qU(q), solvable in closed form and a nonlinear remainder. For where dots denote derivatives in time. HMC, the linear term encodes a Gaussian component of the distribution; when the small variances in the distribution The trajectories of q and p provide proposals to the are adequately summarized by this part, exponential inte- Metropolis-Hastings MCMC procedure. Specifically, HMC gration outperforms generic counterparts like leapfrog. Our applies the following steps: (i) starting from k = exponential integrator (“expHMC”) is easily-implemented 1, draw pk ∼ N (0, M); (ii) compute the position and efficient, with high acceptance rates, broad exploration (q∗, p∗) by simulating (2) for time t with initial condi- of sample space, and fast convergence due to the reduced tions (qk−1, pk); (iii) compute the change in Hamiltonian ∗ ∗ ∗ restriction on the time step size. The flexibility to choose δH = H(qk−1, pk)−H(q , p ); (iv) output qk = q with filter functions in exponential integrators further enhances probability min(eδH , 1). the applicability of our method. If (q∗, p∗) comes from solving (2) exactly, then by con- We validate the effectiveness of expHMC with extensive servation of H, δH = 0 and the new sample is always empirical studies on various types of distributions, including accepted. Closed-form solutions to (2) rarely exist, however. Bayesian logistic regression and independent component In practice, t is broken into L discrete steps, and integration analysis. We also compare expHMC to alternatives, such is carried out numerically in increments of h = t/L. This as the conventional leapfrog method and the recently pro- discretization creates a chance of rejected samples, as H posed Riemann manifold HMC, demonstrating desirable is no longer conserved exactly. The constructed chain of characteristics in scenarios where other methods suffer. {qk} still converges to the target distribution, so long as the integrator is symplectic and time-reversible (Neal, 2011). We describe basic HMC in §2 and our approach in §3. We then review related work in §4, and report several empirical An integrator satisfying these criteria is the leapfrog studies in §5. We conclude in §7. (Stormer-Verlet)¨ method in Figure1. It is easily imple- mented and explicit—no linear or nonlinear solvers are 2. Hamiltonian Monte Carlo needed to generate (qL, pL) from (q0, p0). For large h, however, it suffers from instability that can lead to inaccu- Suppose we would like to sample a random variable q ∈ rate solutions. This restriction on h is problematic when Rd ∼ p(q). Hamiltonian Monte Carlo (HMC) considers the ODE (2) is stiff, requiring small h and correspondingly the joint distribution between q and an auxiliary random large L to resolve high-frequency oscillations of (q, p). We variable p ∈ Rd ∼ p(p) that is distributed independently of demonstrate this instability empirically in §5. q. For simplicity, p(p) is often assumed to be a zero-mean Gaussian with covariance M. 3. Exponential HMC The joint density p(q, p) is used to define a Hamiltonian The idea behind exponential integration is to consider dy- H(q, p) = − log p(q) − log p(p) namics at different scales explicitly. This is achieved by 1 decomposing the Hamiltonian into two terms. The first en- = U(q) + pTM −1p + const., (1) 2 codes high-frequency and numerically-unstable oscillation, while the second encodes slower dynamics that are robust where M is referred to as the (preconditioning) mass . to explicit integration. The insight is that the first part can The dynamics governed by this Hamiltonian are given by be solved analytically, leaving numerical integration for the Exponential Integration for Hamiltonian Monte Carlo more stable remainder. Assembling these two parts will function EXPONENTIAL-STEP((q0, p0); h, L) provide a stable solution to the original dynamics. . Transform into the new variable 1/2 −1/2 Here, we introduce the basic idea first and then propose (r0, r˙0) ← (M q0, M p0) practical strategies for decomposition. . Precompute the following matrices & the filters (C, S) ← (cos(hΩ), sin(hΩ)) for i ← 0, 1,...,L − 1 3.1. Exponential Integrator −1 1 2 ri+1 ← Cri + Ω Sr˙i − h ψF (φri) 2 We consider the following structure for U: −ΩSri + Cr˙i r˙i+1 ← 1 −1 − h(ψ0F (φri) + ψ1F (φri+1)) ∇qU(q) = Σ q + f(q) (3) 2 −1/2 1/2 ∗ ∗ return (M rL, M r˙L) as (q , p ) for some positive definite matrix Σ ∈ Rn×n and a possibly nonlinear function f : Rn → Rn. This form can originate Figure 2. Exponential integrator. from a density with “slow” portion f(q) modulated by a −1 possibly stiff Gaussian distribution from Σ , with stiffness • Consistency and accuracy to second order: φ(0) = from small variances. ψ(0) = ψ0(0) = ψ1(0) = 1 The dynamical equations (2) are equivalent to • Reversibility: ψ(·) = sinc(·)ψ1(·) and ψ0(·) = cos(·)ψ1(·) −1 −1 q¨ + M (Σ q + f(q)) = 0. (4) • Symplecticity: ψ(·) = sinc(·)φ(·) We substitute r = M 1/2q and rewrite (4) as These conditions permit several choices of filters. We adopt r¨ + Ω2r + F (r) = 0, (5) two basic options: where Ω2 = M −1/2Σ−1M −1/2 and F (r) = Simple filters (Deuflhard, 1979): M −1/2f(M −1/2r). φ = 1, ψ = sinc(·), ψ0 = cos(·), ψ1 = 1 When the nonlinear component f(·) (thus F (·)) van- Mollified filters (Garc´ıa-Archilla et al., 1998): ishes, (5) is analytically solvable; in one dimension, the φ = sinc(·), ψ = sinc2(·), solution is r(t) = a cos(ωt + b) where a and b depend on ψ0 = cos(·)sinc(·), ψ1 = sinc(·) the initial conditions. When the nonlinear component does The mollified filters promote smooth evolution of the not vanish, variation of constants shows Hamiltonian. In contrast, the simple filters do not filter the  r(t)   cos tΩΩ−1 sin tΩ   r(0)  argument of the nonlinearity. In this case, numerical error = r˙(t) −Ω sin tΩ cos tΩ r˙(0) can oscillate with an amplitude growing with the step size, Z t  Ω−1 sin(t − s)Ω  increasing the likelihood for rejected HMC samples. − F (r(s)) ds. (6) 0 cos(t − s)Ω Figure2 lists the steps of our proposed exponential integra- tor for HMC—it is a drop-in replacement for leapfrog in Cosine and sine here are matrix functions evaluated on Figure1. The integrator is explicit and with many opera- eigenvalues with eigenvectors intact. tions precomputable, so iterations are comparable in speed The solution (6) can be seen as a perturbation of the an- to those from leapfrog. See the supplementary material for alytic solution (the first term on the right hand side) due more details. This method is stable, however, when F (r) is to the nonlinear component. Following Hairer & Lubich small relative to Ω2r. Correctness is given by the following (2000), discretizing the integral leads to a class of explicit proposition, proved in the supplementary material: integrators advancing the pair (ri, r˙i) forward for time h: Proposition 1. HMC using the integrator in Figure2 is  −1 convergent with equilibrium distribution p(q). ri+1 = cos(hΩ)ri + Ω sin(hΩ)r˙i   1  − h2ψ(hΩ)F (φ(hΩ)r ) 3.2. Strategies for Decomposing  i  2 r˙i+1 = −Ω sin(hΩ)ri + cos(hΩ)r˙i (7) U(q) = − log p(q) may not be given in form (3) naturally.  −1  1 In this case, we separate out the Σ q term by approxi-  − h(ψ0(hΩ)F (φ(hΩ)ri)  2 mating p(q) with a Gaussian N (µ, Σ). Without loss of  generality, we assume µ = 0 by shifting the distribution by  + ψ1(hΩ)F (φ(hΩ)ri+1)) µ to achieve so. Our decomposition then becomes filter functions These integrators are parameterized by the −1 −1 n×n n×n ∇qU(q) = Σ q + (−∇q log p(q) − Σ q) . (8) φ, ψ, ψ0, ψ1 : R → R . For HMC to converge, we ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ enforce the following criteria on the filters: f(q) Exponential Integration for Hamiltonian Monte Carlo

Gaussian approximation provides a natural decomposition sample space, performing location-specific preconditioning. for our integrator. The integrator is exact when f(q) = 0, Hoffman & Gelman(2014) and Wang et al.(2013) automat- implying that HMC for Gaussian distributions using this in- ically choose steps h and iteration counts L. Sohl-Dickstein tegrator never rejects a sample. Even if Σ−1q invokes high- et al.(2014) extend trajectories to accepting states to avoid frequency oscillatory behavior, e.g. when p(q) is sharply rejection. For large datasets, Chen et al.(2014) formulate peaked, the numerical drift of this integrator is only a func- HMC using stochastic gradients to reduce computational tion of its behavior on f(q). cost. Many those approaches treat the numerical integrator as a black box, often using the simple leapfrog method. Our We consider a few Gaussian approximation strategies: aim is to establish the benefits of more sophisticated and

∗ computationally efficient integrators. As such, the integrator Laplace approximation Let θ be a local maximum (i.e., we propose can be substituted into any of those approaches. a mode) of p(θ). Then we can write Splitting is a common alternative that bears some similarity ∗ 1 ∗ T ∗ to our work (Hairer et al., 2006). It separates Hamiltonians − log p(θ) ≈ − log p(θ ) + (θ − θ ) H ∗ (θ − θ ) 2 θ into multiple terms and alternating between integrating each independently. In statistical physics, splitting is applied in where H is the Hessian of − log p(θ). Hence, we can ap- ∗ −1 lattice field theory (Kennedy & Clark, 2007; Clark et al., proximate p(θ) with a Gaussian N (θ , H ∗ ). The inverse θ 2010) and in quantum chromodynamics (Takaishi, 2002). In Hessian can then be used in our decomposition. statistics, Neal(2011) and Shahbaba et al.(2014) split over data or into Gaussian/non-Gaussian parts. The latter strat- Empirical statistics We run a probing MCMC sampler, egy, also used by Beskos et al.(2011), has similar structure such as HMC with leapfrog, to generate preliminary sam- to ours, never rejecting samples from Gaussians. Our ap- ples θ1, θ2,..., θm. We then use the sample mean and proximations (§3.2) can be used in such splittings. But while covariance to form a Gaussian approximation to kickstart their integrator for the non-Gaussian part has no knowl- exponential HMC. We then accumulate a few more samples edge of the Gaussian part, we use a stable, geometrically- to refine our estimate of the mean and covariance and form motivated combination. Blanes et al.(2014) also split using a new approximating Gaussian. The process can be iterated Gaussian model problems; while they provide theoretical and converges asymptotically to the true distribution. groundwork for stability and accuracy of HMC, they study variations of leapfrog. Pakman & Paninski(2014) report ex- Manifold structures Inspired by Riemann Manifold act HMC integration for Gaussians and truncated Gaussians HMC (RMHMC) in (Girolami & Calderhead, 2011), we can but do not consider the general case. leverage geometric properties of the sample space. Similar to the strategy above using empirical statistics, we update Our proposed exponential integrator represents a larger de- our mean and covariance estimates after every m samples. parture from leapfrog and splitting integrators. Exponen- We take the sample mean of the m samples as the updated tial integration was motivated by the observation that tra- mean and exploit the location-specific Fisher-Rao metric ditional integrators do not exploit analytical solutions to defined in (Girolami & Calderhead, 2011) to update the linear ODEs (Hersch, 1958; Hochbruck et al., 1998). We covariance. Namely, we set the covariance to be the inverse employ a trigonometric formula, introduced by Gautschi of the average of the metrics corresponding to the m sam- (1961). This was combined with the trapezoidal rule (Deu- ples. Note that if the distribution is Gaussian, this average flhard, 1979) and extended using filters (Garc´ıa-Archilla simplifies to the inverse covariance matrix. et al., 1999). There exist many variations of the proposed exponential integrator; the one in this paper is chosen for its In contrast to the Laplace approximation, which provides a simplicity and effectiveness in HMC. global approximation centered at a mode θ∗ of p(θ), mani- fold structures incorporate local information: Each Fisher- 5. Experiments Rao metric defines a local Gaussian approximation. In §5, we extensively study the empirical effectiveness of We validate the effectiveness of exponential HMC on both our exponential integrator with these strategies. synthetic and real data and compare to alternative methods. All the algorithms are implemented in Matlab to ensure a fair comparison1, especially when we evaluate different 4. Related Work approaches for their computational cost. Obtaining a high number of effective samples in a short We present major results in what follows. Additional empir- amount of computation time is a core research problem in ical studies are described in the supplementary material. Hamiltonian Monte Carlo. Girolami & Calderhead(2011) 1 adapts the mass matrix M to the manifold structure of the Code: https://github.com/pujols/Exponential-HMC Exponential Integration for Hamiltonian Monte Carlo

5.1. Setup leapfrogHMC expHMC expc HMC 4 4 4

Variants of exponential integrators We use expHMC 3 3 3 to denote the exponential integrator where the mean µ and 2 2 2 covariance Σ are computed from parameters of the distribu- 1 1 1 tion, expHMCd for integration with empirically estimated µ 0 0 0 Σ and , and rmExpHMC for those obtained from manifold −1 −1 −1 structures; see §3.2 for details. We only consider mollified −2 −2 −2 filters (§3.1) in this section and defer comparison to the sim- −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 ple filters to the supplementary material. We take M = I AR = 0.445 AR = 1 AR = 0.939 for the mass matrix in (1), unless stated otherwise. (a) Accepted (blue) and rejected (red) proposals of each method For expHMC, we directly use the parameters of Gaussian distributions on synthetic data, and apply the Laplace ap- 1 proximation on real data. For expHMCd , we run HMC with 0.8 leapfrog for burn-in and use the last N1 burn-in samples to estimate µ and Σ. Both are updated after sampling every 0.6 N2 samples. rmExpHMC follows a similar procedure to 0.4 expHMC, yet in updating, it uses information only from the leapfrogHMC d Acceptance rate 0.2 expHMC N2 new samples. expc HMC 0 0 200 400 600 800 1000 number of samples Evaluation criteria We evaluate performance by com- (b) The accumulated acceptance rate puting acceptance rates (AR) and the minimum effective Figure 3. On a 2D Gaussian, expHMC always accepts proposals, sample sizes (min ESS). The min ESS, suggested by Giro- outperforming leapfrog under identical (h, L). The acceptance lami & Calderhead(2011), measures sampling efficiency. rate (AR) of expHMCd increases as samples are collected. To take computation time into consideration, we report min ESS/sec and the relative speed (RS) to the leapfrog method. Preconditioning One way to mitigate stiffness in ODEs of HMC (e.g., small λ in the previous example) is to adjust −1 5.2. Synthetic Example: Gaussian Distribution the mass matrix M (Neal, 2011). HMC with M = Σg is equivalent to sampling from a homogeneous Gaussian. As a sanity check and to illustrate the basic properties of However, even with preconditioning, leapfrogHMC has exponential integration, we first compare the proposed expo- time step restrictions on Gaussian distributions. We further nential integrator to the leapfrog (leapfrogHMC) integrator demonstrate in §5.3 that preconditioning alone is insufficient on the 2D Gaussian distribution N (µ , Σ ), where Σ has g g g to deal with stiff ODEs on more complicated distributions. eigenvalues {1, 0.1}. We run 200 iterations for burn-in and collect 1000 samples, with (h, L) = (0.6, 8) for all meth- We include more experiments on synthetic examples (e.g., ods, a fairly large step to encourage exploration of sample mixtures of Gaussians) in the supplementary material. space. For expHMCd , we set (N1,N2) = (50, 20). Samples and acceptance rates are shown in Figure3. As 5.3. Real Data: Bayesian Logistic Regression expected, expHMC accepts samples unconditionally. The We apply the proposed methods to sampling from the pos- acceptance rate of expHMCd increases as it updates the Gaus- terior of Bayesian logistic regression (BLR), comparing sian approximation, since the sample statistics begin to esti- to leapfrogHMC and Riemann manifold HMC (RMHMC) mate the mean and covariance accurately. by Girolami & Calderhead(2011). We also compare to an accelerated version of RMHMC using Lagrangian dynamics Next, we examine how the eigenvalues of Σg affect ac- (e-RMLMC), by Lan et al.(2012). ceptance. We construct Σg with eigenvalues {1, λ}, using identical parameters as above except (h, L) = (0.12, 10); N smaller h increases leapfrog acceptance rates. We vary Model Given a dataset {xi, yi}i=1 with N instances, d λ ∈ [2−8, 20], showing the acceptance rate of each method where xi ∈ R is a feature vector and yi ∈ {−1, 1} is in Figure4a. Again, expHMC maintains acceptance rate 1, a binary label, the posterior distribution p(θ) of Bayesian while the leapfrog acceptance degrades as λ shrinks. The ac- logistic regression is defined as ceptance rate of expHMCd is also affected by λ. As the sam- N ple size increases, however, the acceptance rate of expHMC Y T d p(θ) ∝ π(θ) σ(yi(θ xi)), increases and converges to 1, as in Figure4b. i=1 Exponential Integration for Hamiltonian Monte Carlo

−8 (a) Acceptance rate under varing λ (b) Accumulated AR under λ = 2 leapfrogHMC RMHMC expHMC expc HMC

1 1 expHMC(4h) expc HMC(4h)

0.8 0.8 (a) σ = 100 (b) σ = 1 (c) σ = 0.01 800 800 2500 0.6 0.6 2000 0.4 leapfrogHMC 0.4 leapfrogHMC 600 600 1500 0.2 expHMC 0.2 expHMC 400 400

Acceptance rate expc HMC Acceptance rate expc HMC 1000 0 0 200 200 500 −8 −6 −4 λ −2 0 0 1000 2000 3000 4000 5000 log 2 number of samples min ESS/Time(s) 0 0 0 0.8 0.9 1 0.8 0.9 1 0.7 0.8 0.9 1 Figure 4. (a) On a 2D Gaussian, expHMC has acceptance rate Acceptance rate Acceptance rate Acceptance rate 1 under varying λ (with h fixed). (b) The acceptance rate of −8 Figure 5. Acceptance rate and min ESS/sec of each method on the expHMCd (at λ = 2 ) continuously increases as more samples are collected. five datasets, with σ ∈ {0.01, 1, 100} (the variance of the prior in BLR). As illustrated, expHMC and expHMCd achieve both higher acceptance rate and/or higher min ESS/sec. Best viewed in color. where π(θ) is a prior on θ and σ(z) = (1 + e−z)−1.

Dataset We consider five datasets from the UCI reposi- Table 1. Pima Indian dataset (d = 7,N = 532, and 8 regression tory (Bache & Lichman, 2013): Ripley, Heart, Pima Indian, coefficients). “AR” denotes acceptance rate, and “RS” denotes German credit, and Australian credit. They have 250 to relative speed. 1000 instances with feature dimensions between 2 and 24. METHOD TIME(S) MIN ESS S/MIN ESS RS AR σ = 100 Each dimension is normalized to have zero mean and unit leapfrogHMC 6.3 3213 0.0019 1 0.82 variance to ensure that M = I is a fair choice for leapfrog RMHMC 28.3 4983 0.0057 0.34 0.95 and exponential integration. e-RMLMC 19.8 4948 0.0040 0.48 0.95 expHMC (h, L) 7.8 3758 0.0021 0.94 0.95 expHMC (2h, L/2) 4.1 2694 0.0015 1.29 0.88 Experimental setting We consider the homogeneous expHMC (4h, L/4) 2.2 2555 0.0008 2.30 0.88 expHMCc (h, L) 7.7 3876 0.0020 0.98 0.95 Gaussian prior N (0, σI) with σ ∈ {0.01, 1, 100}. We take expHMCc (2h, L/2) 4.0 3025 0.0013 1.47 0.89 L = 100 and manually set h in leapfrogHMC to achieve expHMCc (4h, L/4) 2.1 2845 0.0008 2.58 0.85 rmExpHMC (h, L) 7.8 4143 0.0019 1.00 0.97 acceptance rates in [0.6, 0.9] for each combination of dataset rmExpHMC (2h, L/2) 4.1 3229 0.0013 1.49 0.91 and σ, as suggested by Betancourt et al.(2014a). The same rmExpHMC (4h, L/4) 2.2 3232 0.0007 2.78 0.90 σ = 0.01 (h, L) is used for expHMC, expHMCd , and rmExpHMC, leapfrogHMC 6.0 3865 0.0015 1 0.89 along with (2h, L/2) and (4h, L/4) to test larger steps. We RMHMC 28.3 4987 0.0057 0.27 0.95 use parameters for RMHMC and e-RMLMC from their e-RMLMC 20.2 4999 0.0040 0.38 0.95 expHMC (h, L) 7.3 4239 0.0017 0.89 0.99 corresponding references. We set (N1,N2) = (500, 250) expHMC (2h, L/2) 3.8 4164 0.0009 1.69 0.97 for expHMC and apply the same location-specific met- expHMC (4h, L/4) 2.0 4226 0.0005 3.21 0.97 d expHMCc (h, L) 7.1 4141 0.0017 0.90 0.98 ric defined by Girolami & Calderhead(2011) along with expHMCc (2h, L/2) 3.7 3771 0.0010 1.57 0.93 expHMC (4h, L/4) 2.0 3540 0.0006 2.79 0.90 (N1,N2) = (500, 500) for rmExpHMC. For expHMC, c rmExpHMC (h, L) 7.3 4316 0.0017 0.89 0.99 (µ, Σ) come from the Laplace approximation. We collect rmExpHMC (2h, L/2) 3.8 4284 0.0009 1.68 0.97 5000 samples after 5000 iterations of burn-in as suggested rmExpHMC (4h, L/4) 2.0 3758 0.0005 2.80 0.93 in previous work, repeating for 10 trials. To ensure high effective sample size (ESS), for all the compared methods sampling, raising the relative speed. Enlarging the steps in we uniformly sample L from {1,...,L} in each iteration, leapfrogHMC, RMHMC, and e-RMLMC, however, leads as suggested by Neal(2011). to a significant drop in acceptance and thus is not presented. Experimental results We summarize results on the Pima As σ shrinks, the ODE becomes stiff and the relative speeds Indian dataset in Table1; remaining results are in the of all our methods improve, demonstrating the effective- supplementary material. Our three methods expHMC, ness of exponential integration on stiff problems. This expHMCd , and rmExpHMC outperform leapfrogHMC in is highlighted in Figure5, where we plot each method terms of min ESS and acceptance rate, with similar speed as by acceptance rate and min ESS/sec on the five datasets; leapfrogHMC. RMHMC and e-RMLMC have the highest each point corresponds to a method-dataset pair. expHMC min ESS but execute much slower than leapfrogHMC, lead- and expHMCd achieve high acceptance rates and high ing to poor relative speeds. Besides, expHMC, expHMCd , min ESS/sec (top right) compared to leapfrogHMC and and rmExpHMC achieve acceptance rate > 0.8 under the RMHMC, especially for small σ. For clarity, rmExpHMC cases (2h, L/2) and (4h, L/4). The high acceptance rates and e-RMLMC are not shown; they perform similarly to under larger step sizes allow them to take fewer steps during expHMCd and RMHMC, respectively. Exponential Integration for Hamiltonian Monte Carlo

Table 2. MNIST dataset with digit 7 and 9 (d = 100,N = 12214). Table 3. MEG dataset (d = 5,N = 17730, and 25 sampling “AR” denotes acceptance rate, and “RS” denotes relative speed. dimensions). “AR” denotes acceptance rate, and “RS” denotes METHOD TIME(S) MIN ESS S/MIN ESS RS AR relative speed. σ = 0.01 METHOD TIME(S) MIN ESS S/MIN ESS RS AR leapfrogHMC 1215 22164 0.055 1 0.88 σ = 100 expHMC (h, L) 1413 23720 0.060 0.92 0.96 leapfrogHMC (h, L) 196 2680 0.073 1 0.85 expHMC (2h, L/2) 750 16643 0.045 1.22 0.87 leapfrogHMC (2h, L/2) 109 498 0.220 0.33 0.26 expHMC (h, L) 1437 23080 0.062 0.88 0.94 c expHMCc (h, L) 207 3256 0.063 1.15 0.95 expHMC (2h, L/2) 779 11919 0.065 0.84 0.73 c expHMCc (2h, L/2) 109 2579 0.042 1.73 0.82 rmExpHMC (h, L) 1450 24347 0.060 0.92 0.97 expHMC (4h, L/4) 62 1143 0.054 1.34 0.64 rmExpHMC (2h, L/2) 781 19623 0.040 1.38 0.90 c

Scalability Sophisticated integrators, like that for We do not compare to RMHMC and e-RMLMC as these RMHMC, lose speed when dimensionality increases. To methods do not scale well to this problem. investigate this effect, we sample from the BLR posterior on digits 7 and 9 from the MNIST dataset (12214 train- Model The joint probability of (X, W) is ing instances); each instance is reduced to 100 dimensions N d by PCA. We set (h, L) = (0.013, 30), with (N1,N2) = N Y Y T p(X, W ) =| det(W )| { pj(w xi)} (2000, 500) for expHMCd and (N1,N2) = (2000, 2000) for j rmExpHMC. The prior σ = 0.01. Performance averaged i=1 j=1 Y over 10 trials is summarized in Table2, for 30000 samples × N (Wkl; 0, σ), (9) after 10000 iterations of burn-in. kl expHMC, expHMC, and rmExpHMC have runtimes com- T d where we use a Gaussian prior over W , with wj the jth parable to leapfrogHMC, scaling up well. High acceptance 2 1 −1 row of W . We set pj(yij ) = {4 cosh ( 2 yij )} with yi = rates and better relative speed under larger step sizes also W x , as suggested by Korattikara et al.(2014). suggest a more efficient sampling by our methods. RMHMC i and e-RMLMC are impractical on this task as they take Dataset We experiment on the MEG dataset (Vigario´ around 1 day to process 30000 samples. Specifically, the et al., 1997), which contains 122 channels and 17730 time- speed improvement of e-RMLMC over RMHMC dimin- points. We extract the first 5 channels for our experiment ishes for large-scale problems like this one, as shown by (i.e, W ∈ 5×5), leading to samples with 25 dimensions. Lan et al.(2012) and analyzed by Betancourt et al.(2014b). R Experimental setting We set σ = 100 for the Gaus- Preconditioning In BLR, a suitable choice of (constant) sian prior (9). We set L = 50 and manually adjust h in −1 M is the Hessian of − log p(θ) at MAP, i.e., Σ by leapfrogHMC to achieve acceptance rates in [0.6, 0.9]. The Laplace approximation. We precondition both our method same (h, L) is used for our methods, along with (2h, L/2) 2 and leapfrog for fair comparison. With step sizes so that and (4h, L/4) to test larger steps. We only examine > 0.85 preconditioned leapfrog PLHMC has acceptance, expHMCd , given its tractable strategy for Gaussian approxi- expHMC still always results in a higher acceptance rate. mation. The parameters are set to (N1,N2) = (500, 250). Further increasing the step sizes, acceptances of PLHMC We collect 5000 samples after 5000 iterations of burn-in, re- drop significantly while ours does not suffer as strongly. peating for 5 trials. In each iteration, we sample L uniformly More importantly, as σ shrinks (the ODE becomes stiffer), from {1, ..., L} to ensure high ESS, as in §5.3. we observe the same performance gain as in Figure5. These demonstrate the advantage of our method even with precon- Experimental results We summarize the results in Table ditioning. Details are in the supplementary material. 3. With comparable computational times, expHMCd pro- duces a much higher min ESS and acceptance rate than 5.4. Real Data: Independent Component Analysis leapfrogHMC. As the step size increases, acceptances d N of leapfrogHMC drop drastically, yet expHMCd can still Given N d-dimensional observations X = {xi ∈ R }i=1, independent component analysis (ICA) aims to find the de- maintain high acceptance and ESS, leading to better relative speed for efficient sampling. mixing matrix W ∈ Rd×d so as to recover the original d d N mutually independent sources {yi ∈ R }i=1. We take the formulation introduced by Hyvarinen¨ & Oja(2000) that de- 6. Comparison to Splitting fines a posterior distribution p(W |X). We then compare our methods to leapfrogHMC in sampling from the posterior. We contrast our approach with the Gaussian splitting by Shahbaba et al.(2014). After applying the Laplace 2Both solve the same ODE (2), using different integrators. approximation, they integrate by alternating between the Exponential Integration for Hamiltonian Monte Carlo

−1 nonlinear and linear terms. The linear part is integrated in ∇qU(q) Σ q f(q) closed form, while the nonlinear part is integrated using a “forward Euler” step. In the notation of Figure2, we can (a) = + write an iteration of their integrator as:

0 1 r˙i+1 ← r˙i − 2 hF (ri)  r   cos hΩΩ−1 sin hΩ   r  i+1 ← i 1 0 Splitting r˙i+1 −Ω sin hΩ cos hΩ r˙i+1 1 1 Exponential r˙i+1 ← r˙i+1 − 2 hF (ri+1) (b) The nonlinearity vanishes when f(·) = 0, providing exact integration for Gaussians. q(t), estimated The key difference between our integrator and theirs is in the treatment of f(·). Their integrator treats f(·) independently Figure 6. Comparison between splitting and exponential integra- tion. (a) The acceleration vector field corresponding to a Gaussian from the Gaussian part, stepping variables using a simple distribution is shifted, leading to a nonzero f(q); (b) this simple explicit integrator. Variation of constants (6), on the other but nonzero f(q) is enough to perturb the initial velocity (0, 1) of hand, models the deviation from Gaussian behavior directly the elliptical path for splitting, shearing to the right. and steps all variables simultaneously. Figure6 illustrates with a simple setting where this differ- ence leads to significant outcomes. Here, we show paths scalability. In contrast, our method is based on an efficient corresponding to a single iteration of HMC with L = 1 for explicit integrator that is robust to stiffness, even without a Gaussian. For leapfrog integration these paths would be preconditioning. straight lines, but both splitting and exponential integration The choice of numerical integrators is a “meta” parameter follow elliptical paths. We estimate the mean incorrectly, for HMC, since each integrator discretizes the same dynam- however, so f(·) is a constant but nonzero function. Split- ical equations. Our extensive studies have revealed, ting provides a velocity impulse and then follows an ellipse centered about the incorrect mean. Our integrator handles constant functions f(·) exactly, recovering the correct path. • When the step size is small, leapfrog suffices. • When the step size is large enough that acceptance This example illustrates the robustness we expect from ex- for leapfrog declines, splitting and exponential integra- ponential integration. For nonlinear distributions, we can tion with simple filters maintain stability. In this case, use the “empirical” approach in §3.2 to determine a Gaus- mollification sacrifices too much accuracy for stability. sian approximation. Empirical means and covariances likely Details are in the supplementary material. are inexact, implying that even for Gaussian distributions, • For large step sizes, exponential HMC with mollified splitting with this approximation may reject samples. Our filters avoids degrading performance. integrator better handles the coupling between the Gaus- • Empirical means and covariances yield stable Gaussian sian and non-Gaussian parts of p(q) and hence can be more approximations for exponential integrators. robust to approximation error. A more extensive empirical comparison to splitHMC is given in the supplementary material. In general, splitHMC While our experiments focus on basic HMC, future studies performs similarly to our methods with simple filters can assess all options for HMC, accounting e.g. for exten- (cf. §3.1), as discussed in §7. Our methods with mollified sions like (Hoffman & Gelman, 2014) and (Sohl-Dickstein filters, however, handle departure from Gaussian approxi- et al., 2014). Additional research also may reveal better mation more stably, outperforming splitHMC. integrators specifically for HMC. For instance, the exponen- tial integrators introduced by Rosenbrock(1963) use local 7. Discussion and Conclusion rather than global approximations that may be suitable for distributions like the mixture of Gaussians, whose Gaussian The integrator for HMC has a strong effect on acceptance approximation should change depending on locations; the rates, ESS, and other metrics of convergence. While large in- challenge is to maintain symplecticity and reversibility. Ex- tegration times encourage exploration, numerically solving ponential integration also can be performed on Lie groups, the underlying ODE using leapfrog requires many substeps providing avenues for non-Euclidean sampling. Regardless, to avoid high rejection rates. Although RMHMC deals with exponential integrators remain practical, straightforward, this issue by locally adjusting the mass matrix M, it suffers and computationally inexpensive alternatives for HMC with from significantly increased computation costs and poor the potential to improve sampling. Exponential Integration for Hamiltonian Monte Carlo

Acknowledgments Girolami, M. and Calderhead, B. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Jour- F. S. and W. C. are supported by ARO Award # W911NF- nal of the Royal Statistical Society: Series B, 73(2):123– 12-1-0241, ONR # N000141210066, and Alfred P. Sloan 214, 2011. Fellowship. D. M. is supported by the Max Planck Center for Visual Computing and Communication. Hairer, E. and Lubich, C. Long-time energy conservation of numerical methods for oscillatory differential equations. References SIAM J. Numer. Anal., 38(2):414–441, July 2000.

Bache, K. and Lichman, M. UCI machine learning Hairer, E., Lubich, C., and Wanner, G. Geometric numerical repository, 2013. URL http://archive.ics.uci. integration: structure-preserving algorithms for ordinary edu/ml. differential equations, volume 31. Springer Science & Business Media, 2006. Beskos, A., Pinski, F. J., Sanz-Serna, J. M., and Stuart, A. M. Hybrid Monte Carlo on Hilbert spaces. Stochastic Hersch, J. Contribution a` la methode´ des equations´ aux Processes and their Applications, 121(10):2201–2230, differences.´ Zeitschrift fur¨ Angewandte Mathematik und 2011. Physik, 9:129–180, 1958. Betancourt, M. J., Byrne, S., and Girolami, M. Optimizing Hochbruck, M., Lubich, C., and Selfhofer, H. Exponential the integrator step size for Hamiltonian Monte Carlo. integrators for large systems of differential equations. arXiv preprint arXiv:1411.6669, 2014a. SIAM Journal on Scientific Computing, 19(5):1552–1574, Betancourt, M. J., Byrne, S., Livingstone, S., and Girolami, 1998. M. The geometric foundations of Hamiltonian Monte Carlo. arXiv preprint arXiv:1410.5110, 2014b. Hoffman, M. D. and Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Blanes, S., Casas, F., and Sanz-Serna, J. M. Numerical Carlo. J. Mach. Learn. Res., 15(1):1593–1623, January integrators for the Hybrid Monte Carlo method. SIAM 2014. Journal on Scientific Computing, 36(4):A1556–A1580, 2014. Hyvarinen,¨ A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(4): Chen, T., Fox, E. B., and Guestrin, C. Stochastic gradient 411–430, 2000. Hamiltonian Monte Carlo. In Proc. ICML, June 2014. Kennedy, A. D. and Clark, M. A. Speeding up HMC with Clark, M. A., Joo,´ B., Kennedy, A. D., and Silva, P. J. Better better integrators. In Proc. International Symposium on HMC integrators for dynamical simulations. In Proc. Lattice Field Theory, 2007. International Symposium on Lattice Field Theory, 2010. Korattikara, A., Chen, Y., and Welling, M. Austerity in Deuflhard, P. A study of extrapolation methods based on MCMC land: Cutting the Metropolis-Hastings budget. In Journal of multistep schemes without parasitic solutions. Proc. ICML, pp. 181–189, 2014. Applied Mathematics and Physics, 30:177–189, 1979. Lan, S., Stathopoulos, V., Shahbaba, B., and Girolami, Garc´ıa-Archilla, B., Sanz-Serna, J. M., and Skeel, R. D. M. Lagrangian dynamical Monte Carlo. arXiv preprint Long-time-step methods for oscillatory differential equa- arXiv:1211.3759, 2012. tions. SIAM Journal on Scientific Computing, 20(3): 930–963, 1998. Neal, R. M. MCMC using Hamiltonian dynamics. In Garc´ıa-Archilla, B., Sanz-Serna, J. M., and Skeel, R. D. Brooks, Steve, Gelman, Andrew, Jones, Galin, and Meng, Long-time-step methods for oscillatory differential equa- Xiao-Li (eds.), Handbook of Markov Chain Monte Carlo. tions. SIAM Journal on Scientific Computing, 20:930– CRC Press, 2011. 963, 1999. Pakman, A. and Paninski, L. Exact Hamiltonian Monte Gautschi, W. Numerical integration of ordinary differen- Carlo for truncated multivariate Gaussians. Journal of tial equations based on trigonometric polynomials. Nu- Computational and Graphical Statistics, 23(2):518–542, merische Mathematik, 3:381–397, 1961. 2014.

Gelman, A., Carlin, J., Stern, H., and Rubin, D. R. Bayesian Robert, C. and Casella, G. Monte Carlo Statistical Methods. Data Analysis. Chapman and Hall, CRC, 2004. Springer, 2004. Exponential Integration for Hamiltonian Monte Carlo

Rosenbrock, H. H. Some general implicit processes for the numerical solution of differential equations. The Computer Journal, 5(4):329–330, 4 1963. Shahbaba, B., Lan, S., Johnson, W. O., and Neal, R. M. Split Hamiltonian Monte Carlo. Statistics and Computing, 24 (3):339–349, 2014.

Sohl-Dickstein, J., Mudigonda, M., and Deweese, M. Hamil- tonian Monte Carlo without detailed balance. In Proc. ICML, pp. 719–726, 2014. Takaishi, T. Higher order hybrid Monte Carlo at finite temperature. Physics Letters B, 540(1–2):159–165, 2002.

Vigario,´ R., Sarel¨ a,¨ J., and Oja, E. MEG data for studies using independent component analysis, 1997. URL http://research.ics.aalto.fi/ ica/eegmeg/MEG_data.html. Wang, Z., Mohamed, S., and Nando, D. Adaptive Hamilto- nian and Riemann manifold Monte Carlo. In Proc. ICML, volume 28, pp. 1462–1470, May 2013.