<<

Stochastic Gradient Hamiltonian Monte Carlo

Tianqi Chen [email protected] Emily B. Fox [email protected] Carlos Guestrin [email protected] Lab, University of Washington, Seattle, WA.

Abstract simple updates to the momentum variables, one simu- lates from a Hamiltonian dynamical system that enables Hamiltonian Monte Carlo (HMC) proposals of distant states. The target distribution is in- methods provide a mechanism for defining dis- variant under these dynamics; in practice, a discretiza- tant proposals with high acceptance probabilities tion of the continuous-time system is needed necessitating in a Metropolis-Hastings framework, enabling a Metropolis-Hastings (MH) correction, though still with more efficient exploration of the state space than high acceptance probability. Based on the attractive proper- standard random-walk proposals. The popularity ties of HMC in terms of rapid exploration of the state space, of such methods has grown significantly in recent HMC methods have grown in popularity recently (Neal, years. However, a limitation of HMC methods 2010; Hoffman & Gelman, 2011; Wang et al., 2013). is the required gradient computation for simula- tion of the Hamiltonian dynamical system—such A limitation of HMC, however, is the necessity to com- computation is infeasible in problems involving a pute the gradient of the potential energy function in or- large sample size or streaming . Instead, we der to simulate the Hamiltonian dynamical system. We must rely on a noisy gradient estimate computed are increasingly faced with datasets having millions to bil- from a subset of the data. In this paper, we ex- lions of observations or where data come in as a stream plore the properties of such a stochastic gradient and we need to make inferences online, such as in online HMC approach. Surprisingly, the natural imple- advertising or recommender systems. In these ever-more- mentation of the stochastic approximation can be common scenarios of massive batch or streaming data, such arbitrarily bad. To address this problem we intro- gradient computations are infeasible since they utilize the duce a variant that uses second-order Langevin entire dataset, and thus are not applicable to “” dynamics with a friction term that counteracts the problems. Recently, in a variety of machine learning al- effects of the noisy gradient, maintaining the de- gorithms, we have witnessed the many successes of utiliz- sired target distribution as the invariant distribu- ing a noisy estimate of the gradient based on a minibatch tion. Results on simulated data validate our the- of data to scale the algorithms (Robbins & Monro, 1951; ory. We also provide an application of our meth- Hoffman et al., 2013; Welling & Teh, 2011). A major- ods to a classification task using neural networks ity of these developments have been in optimization-based and to online Bayesian matrix factorization. algorithms (Robbins & Monro, 1951; Nemirovski et al., 2009), and a question is whether similar efficiencies can arXiv:1402.4102v2 [stat.ME] 12 May 2014 be garnered by sampling-based algorithms that maintain 1. Introduction many desirable theoretical properties for Bayesian infer- ence. One attempt at applying such methods in a sam- Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; pling context is the recently proposed stochastic gradient Neal, 2010) sampling methods provide a powerful Markov Langevin dynamics (SGLD) (Welling & Teh, 2011; Ahn chain Monte Carlo (MCMC) sampling algorithm. The et al., 2012; Patterson & Teh, 2013). This method builds methods define a Hamiltonian function in terms of the tar- on first-order Langevin dynamics that do not include the get distribution from which we desire samples—the po- crucial momentum term of HMC. tential energy—and a kinetic energy term parameterized by a set of “momentum” auxiliary variables. Based on In this paper, we explore the possibility of marrying the efficiencies in state space exploration of HMC with the Proceedings of the 31 st International Conference on Machine big-data computational efficiencies of stochastic gradients. Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- Such an algorithm would enable a large-scale and online right 2014 by the author(s). Stochastic Gradient Hamiltonian Monte Carlo

Bayesian sampling algorithm with the potential to rapidly Hamiltonian (Hybrid) Monte Carlo (HMC) (Duane et al., explore the posterior. As a first cut, we consider simply 1987; Neal, 2010) provides a method for proposing sam- applying a stochastic gradient modification to HMC and ples of θ in a Metropolis-Hastings (MH) framework that assess the impact of the noisy gradient. We prove that the efficiently explores the state space as compared to stan- noise injected in the system by the stochastic gradient no dard random-walk proposals. These proposals are gener- longer leads to Hamiltonian dynamics with the desired tar- ated from a Hamiltonian system based on introducing a set get distribution as the stationary distribution. As such, even of auxiliary momentum variables, r. That is, to sample before discretizing the dynamical system, we need to cor- from p(θ|D), HMC considers generating samples from a rect for this effect. One can correct for the injected gradi- joint distribution of (θ, r) defined by ent noise through an MH step, though this itself requires  1  costly computations on the entire dataset. In practice, one π(θ, r) ∝ exp −U(θ) − rT M −1r . (3) might propose long simulation runs before an MH correc- 2 tion, but this leads to low acceptance rates due to large de- If we simply discard the resulting r samples, the θ sam- viations in the Hamiltonian from the injected noise. The ples have marginal distribution p(θ|D). Here, M is a mass efficiency of this MH step could potentially be improved matrix, and together with r, defines a kinetic energy term. using the recent results of (Korattikara et al., 2014; Bar- M is often set to the identity matrix, I, but can be used to denet et al., 2014). In this paper, we instead introduce a precondition the sampler when we have more information stochastic gradient HMC method with friction added to the about the target distribution. The Hamiltonian function is momentum update. We assume the injected noise is Gaus- 1 T −1 defined by H(θ, r) = U(θ) + 2 r M r. Intuitively, H sian, appealing to the , and analyze the measures the total energy of a physical system with posi- corresponding dynamics. We show that using such second- tion variables θ and momentum variables r. order Langevin dynamics enables us to maintain the desired target distribution as the stationary distribution. That is, the To propose samples, HMC simulates the Hamiltonian dy- friction counteracts the effects of the injected noise. For namics  dθ = M −1r dt discretized systems, we consider letting the step size tend (4) dr = −∇U(θ) dt. to zero so that an MH step is not needed, giving us a sig- nificant computational advantage. Empirically, we demon- To make Eq. (4) concrete, a common analogy in 2D is as strate that we have good performance even for  set to a follows (Neal, 2010). Imagine a hockey puck sliding over small, fixed value. The theoretical computation versus ac- a frictionless ice surface of varying height. The potential curacy tradeoff of this small- approach is provided in the energy term is based on the height of the surface at the cur- Supplementary Material. rent puck position, θ, while the kinetic energy is based on the momentum of the puck, r, and its mass, M. If the sur- A number of simulated validate our theoretical face is flat (∇U(θ) = 0, ∀θ), the puck moves at a constant results and demonstrate the differences between (i) exact velocity. For positive slopes (∇U(θ) > 0), the kinetic en- HMC, (ii) the na¨ıve implementation of stochastic gradient ergy decreases as the potential energy increases until the HMC (simply replacing the gradient with a stochastic gra- kinetic energy is 0 (r = 0). The puck then slides back dient), and (iii) our proposed method incorporating friction. down the hill increasing its kinetic energy and decreasing We also compare to the first-order Langevin dynamics of potential energy. Recall that in HMC, the position vari- SGLD. Finally, we apply our proposed methods to a classi- ables are those of direct interest whereas the momentum fication task using Bayesian neural networks and to online variables are artificial constructs (auxiliary variables). Bayesian matrix factorization of a standard movie dataset. Our experimental results demonstrate the effectiveness of Over any interval s, the Hamiltonian dynamics of Eq. (4) the proposed algorithm. defines a mapping from the state at time t to the state at time t + s. Importantly, this mapping is reversible, which 2. Hamiltonian Monte Carlo is important in showing that the dynamics leave π invari- ant. Likewise, the dynamics preserve the total energy, H, Suppose we want to sample from the posterior distribution so proposals are always accepted. In practice, however, we of θ given a set of independent observations x ∈ D: usually cannot simulate exactly from the continuous system of Eq. (4) and instead consider a discretized system. One p(θ|D) ∝ exp(−U(θ)), (1) common approach is the “leapfrog” method, which is out- lined in Alg.1. Because of inaccuracies introduced through where the potential energy function U is given by the discretization, an MH step must be implemented (i.e., the acceptance rate is no longer 1). However, acceptance X U = − log p(x|θ) − log p(θ). (2) rates still tend to be high even for proposals that can be x∈D quite far from their last state. Stochastic Gradient Hamiltonian Monte Carlo

Algorithm 1: Hamiltonian Monte Carlo noisy gradient as Input: Starting position θ(1) and step size  ∇U˜(θ) ≈ ∇U(θ) + N (0,V (θ)). (6) for t = 1, 2 ··· do Resample momentum r Here, V is the of the stochastic gradient noise, r(t) ∼ N (0,M) which can depend on the current model parameters and (t) (t) (θ0, r0) = (θ , r ) sample size. Note that we use an abuse of notation in Simulate discretization of Hamiltonian dynamics Eq. (6) where the addition of N (µ, Σ) denotes the intro- in Eq. (4): duction of a that is distributed according to  D˜ r0 ← r0 − 2 ∇U(θ0) this multivariate Gaussian. As the size of increases, this for i = 1 to m do Gaussian approximation becomes more accurate. Clearly, −1 θi ← θi−1 + M ri−1 we want minibatches to be small to have our sought-after ri ← ri−1 − ∇U(θi) computational gains. Empirically, in a wide of set- end tings, simply considering a minibatch size on the order of  rm ← rm − 2 ∇U(θm) hundreds of data points is sufficient for the central limit ˆ (θ, rˆ) = (θm, rm) theorem approximation to be accurate (Ahn et al., 2012). Metropolis-Hastings correction: In our applications of interest, minibatches of this size still u ∼ Uniform[0, 1] represent a significant reduction in the computational cost (t) (t) ρ = eH(θ,ˆ rˆ)−H(θ ,r ) of the gradient. if u < min(1, ρ), then θ(t+1) = θˆ end 3.1. Na¨ıve Stochastic Gradient HMC The most straightforward approach to stochastic gradient HMC is simply to replace ∇U(θ) in Alg.1 by ∇U˜(θ). Re- ferring to Eq. (6), this introduces noise in the momentum There have been many recent developments of HMC to update, which becomes ∆r = −∇U˜(θ) = −∇U(θ) + make the algorithm more flexible and applicable in a va- N (0, 2V ). The resulting discrete time system can be riety of settings. The “No U-Turn” sampler (Hoffman & viewed as an -discretization of the following continuous Gelman, 2011) and the methods proposed by Wang et al. stochastic differential equation: (2013) allow automatic tuning of the step size, , and num- ber of simulation steps, m. Riemann manifold HMC (Giro-  dθ = M −1r dt (7) lami & Calderhead, 2011) makes use of the Riemann ge- dr = −∇U(θ) dt + N (0, 2B(θ)dt). ometry to adapt the mass M, enabling the algorithm to 1 make use of curvature information to perform more effi- Here, B(θ) = 2 V (θ) is the diffusion matrix contributed cient sampling. We attempt to improve HMC in an orthog- by gradient noise. As with the original HMC formulation, onal direction focused on computational complexity, but it is useful to return to a continuous time system in order to these adaptive HMC techniques could potentially be com- derive properties of the approach. To gain some intuition bined with our proposed methods to see further benefits. about this setting, consider the same hockey puck analogy of Sec.2. Here, we can imagine the puck on the same 3. Stochastic Gradient HMC ice surface, but with some random wind blowing as well. This wind may blow the puck further away than expected. In this section, we study the implications of implement- Formally, as given by Corollary 3.1 of Theorem 3.1, when ing HMC using a stochastic gradient and propose variants B is nonzero, π(θ, r) of Eq. (3) is no longer invariant under on the Hamiltonian dynamics that are more robust to the the dynamics described by Eq. (7). noise introduced by the stochastic gradient estimates. In all Theorem 3.1. Let pt(θ, r) be the distribution of (θ, r) at scenarios, instead of directly computing the costly gradient time t with dynamics governed by Eq. (7). Define the ∇U(θ) using Eq. (2), which requires examination of the R entropy of pt as h(pt) = − θ,r f(pt(θ, r))dθdr, where entire dataset D, we consider a noisy estimate based on a f(x) = x ln x. Assume p is a distribution with density ˜ t minibatch D sampled uniformly at random from D: and gradient vanishing at infinity. Furthermore, assume the gradient vanishes faster than 1 . Then, the entropy of ln pt |D| X pt increases over time with rate ∇U˜(θ) = − ∇ log p(x|θ) − ∇ log p(θ), D˜ ⊂ D. ˜ |D| ˜ x∈D ∂th(pt(θ, r)) = (5) Z 00 T We assume that our observations x are independent and, f (pt)(∇rpt(θ, r)) B(θ)∇rpt(θ, r)dθdr. (8) appealing to the central limit theorem, approximate this θ,r Stochastic Gradient Hamiltonian Monte Carlo

Eq. (8) implies that ∂th(pt(θ, r)) ≥ 0 since B(θ) is a pos- rection of future research is to consider using the recent re- itive semi-definite matrix. sults of Korattikara et al.(2014) and Bardenet et al.(2014) that show that it is possible to do MH using a subset of data. Intuitively, Theorem 3.1 is true because the noise-free However, we instead consider in Sec. 3.2 a straightforward Hamiltonian dynamics preserve entropy, while the addi- modification to the Hamiltonian dynamics that alleviates tional noise term strictly increases entropy if we assume the issues of the noise introduced by stochastic gradients. (i) B(θ) is positive definite (a reasonable assumption due In particular, our modification allows us to again achieve to the normal full rank property of Fisher information) and the desired π as the invariant distribution of the continuous (ii) ∇rpt(θ, r) 6= 0 for all t. Then, jointly, the entropy Hamiltonian dynamical system. strictly increases over time. This hints at the fact that the distribution pt tends toward a uniform distribution, which 3.2. Stochastic Gradient HMC with Friction can be very far from the target distribution π. In Sec. 3.1, we showed that HMC with stochastic gradients Corollary 3.1. The distribution π(θ, r) ∝ exp (−H(θ, r)) requires a frequent costly MH correction step, or alterna- is no longer invariant under the dynamics in Eq. (7). tively, long simulation runs with low acceptance probabili- ties. Ideally, instead, we would like to minimize the effect The proofs of Theorem 3.1 and Corollary 3.1 are in the of the injected noise on the dynamics themselves to allevi- Supplementary Material. ate these problems. To this end, we consider a modification Because π is no longer invariant under the dynamics of to Eq. (7) that adds a “friction” term to the momentum up- Eq. (7), we must introduce a correction step even before date: considering errors introduced by the discretization of the  dθ = M −1r dt dynamical system. For the correctness of an MH step (9) dr = −∇U(θ) dt − BM −1rdt + N (0, 2Bdt). (based on the entire dataset), we appeal to the same argu- ments made for the HMC data-splitting technique of Neal Here and throughout the remainder of the paper, we omit (2010). This approach likewise considers minibatches of the dependence of B on θ for simplicity of notation. Let data and simulating the (continuous) Hamiltonian dynam- us again make a hockey analogy. Imagine we are now ics on each batch sequentially. Importantly, Neal(2010) playing street hockey instead of ice hockey, which intro- alludes to the fact that the resulting H from the split-data duces friction from the asphalt. There is still a random scenario may be far from that of the full-data scenario af- wind blowing, however the friction of the surface prevents ter simulation, which leads to lower acceptance rates and the puck from running far away. That is, the friction term thereby reduces the apparent computational gains in simu- BM −1r helps decrease the energy H(θ, r), thus reducing lation. Empirically, as we demonstrate in Fig.2, we see that the influence of the noise. This type of dynamical system even finite-length simulations from the noisy system can is commonly referred to as second-order Langevin dynam- diverge quite substantially from those of the noise-free sys- ics in physics (Wang & Uhlenbeck, 1945). Importantly, we tem. Although the minibatch-based HMC technique con- note that the Langevin dynamics used in SGLD (Welling sidered herein is slightly different from that of Neal(2010), & Teh, 2011) are first-order, which can be viewed as a lim- the theory we have developed in Theorem 3.1 surrounding iting case of our second-order dynamics when the friction the high-entropy properties of the resulting invariant distri- term is large. Further details on this comparison follow at bution of Eq. (7) provides some intuition for the observed the end of this section. deviations in H both in our experiments and those of Neal (2010). Theorem 3.2. π(θ, r) ∝ exp(−H(θ, r)) is the unique sta- tionary distribution of the dynamics described by Eq. (9). The poorly behaved properties of the trajectory of H based on simulations using noisy gradients results in a complex  0 −I   0 0  computation versus efficiency tradeoff. On one hand, it Proof. Let G = , D = , where G is extremely computationally intensive in large datasets to I 0 0 B insert an MH step after just short simulation runs (where is an anti-symmetric matrix, and D is the symmetric (dif- deviations in H are less pronounced and acceptance rates fusion) matrix. Eq. (9) can be written in the following de- should be reasonable). Each of these MH steps requires a composed form (Yin & Ao, 2006; Shi et al., 2012) all costly computation using of the data, thus defeating the       computational gains of considering noisy gradients. On the θ 0 −I ∇U(θ) d = − −1 dt + N (0, 2Ddt) other hand, long simulation runs between MH steps can r IB M r lead to very low acceptance rates. Each rejection corre- = − [D + G] ∇H(θ, r)dt + N (0, 2Ddt). sponds to a wasted (noisy) gradient computation and simu- lation using the proposed variant of Alg.1. One possible di- The distribution evolution under this dynamical system is Stochastic Gradient Hamiltonian Monte Carlo governed by a Fokker-Planck equation Algorithm 2: Stochastic Gradient HMC

T for t = 1, 2 ··· do ∂tpt(θ, r)=∇ {[D+G][pt(θ, r)∇H(θ, r) + ∇pt(θ, r)]}. optionally, resample momentum r as (10) r(t) ∼ N (0,M) See the Supplementary Material for details. We can ver- (t) (t) (θ0, r0) = (θ , r ) ify that π(θ, r) is invariant under Eq. (10) by calculating simulate dynamics in Eq.(13):  −H(θ,r) −H(θ,r) e ∇H(θ, r) + ∇e = 0. Furthermore, due for i = 1 to m do to the existence of diffusion noise, π is the unique station- −1 θi ← θi−1 + tM ri−1 ary distribution of Eq. (10). −1 ri ← ri−1 − t∇U˜(θi) − tCM ri−1 +N (0, 2(C − Bˆ)t) In summary, we have shown that the dynamics given by end (t+1) (t+1) Eq. (9) have a similar invariance property to that of the (θ , r ) = (θm, rm), no M-H step original Hamiltonian dynamics of Eq. (4), even with noise end present. The key was to introduce a friction term using second-order Langevin dynamics. Our revised momentum update can also be viewed as akin to partial momentum 3.3. Stochastic Gradient HMC in Practice refreshment (Horowitz, 1991; Neal, 1993), which also cor- responds to second-order Langevin dynamics. Such partial In everything we have considered so far, we have assumed momentum refreshment was shown to not greatly improve that we know the noise model B. Clearly, in practice this HMC in the case of noise-free gradients (Neal, 2010). is not the case. Imagine instead that we simply have an es- However, as we have demonstrated, the idea is crucial in timate Bˆ. As will become clear, it is beneficial to instead our stochastic gradient scenario in order to counterbalance introduce a user specified friction term C  Bˆ and con- the effect of the noisy gradients. We refer to the resulting sider the following dynamics method as stochastic gradient HMC (SGHMC).  dθ = M −1r dt  −1 CONNECTIONTO FIRST-ORDER LANGEVIN DYNAMICS dr = −∇U(θ) dt − CM rdt (13)  +N (0, 2(C − Bˆ)dt) + N (0, 2Bdt) As we previously discussed, the dynamics introduced in Eq. (9) relate to the first-order Langevin dynamics used in The resulting SGHMC algorithm is shown in Alg.2. Note SGLD (Welling & Teh, 2011). In particular, the dynamics that the algorithm is purely in terms of user-specified or of SGLD can be viewed as second-order Langevin dynam- computable quantities. To understand our choice of dy- ics with a large friction term. To intuitively demonstrate namics, we begin with the unrealistic scenario of perfect −1 1 this connection, let BM = dt in Eq. (9). Because the estimation of B. friction and momentum noise terms are very large, the mo- Proposition 3.1. If Bˆ = B, then the dynamics of Eq. (13) mentum variable r changes much faster than θ. Thus, rel- yield the stationary distribution π(θ, r) ∝ e−H(θ,r). ative to the rapidly changing momentum, θ can be consid- ered as fixed. We can study this case as simply: Proof. The momentum update simplifies to r = −1 dr = −∇U(θ)dt − BM −1rdt + N (0, 2Bdt) (11) −∇U(θ) dt−CM rdt+N (0, 2Cdt), with friction term CM −1 and noise term N (0, 2Cdt). Noting that the proof The fast evolution of r leads to a rapid convergence to of Theorem 3.2 only relied on a of noise and fric- the stationary distribution of Eq. (11), which is given by tion, the result follows directly by using C in place of B in N (MB−1∇U(θ),M). Let us now consider a change in θ, Theorem 3.2. −1 −1 1 with r ∼ N (MB ∇U(θ),M). Recalling BM = dt , we have Now consider the benefit of introducing the C terms and revised dynamics in the more realistic scenario of inaccu- dθ = −M −1∇U(θ)dt2 + N (0, 2M −1dt2), (12) rate estimation of B. For example, the simplest choice is Bˆ = 0. Though the true stochastic gradient noise B is 1 which exactly aligns with the dynamics of SGLD where clearly non-zero, as the step size  → 0, B = 2 V goes to 0 M −1 serves as the preconditioning matrix (Welling & Teh, and C dominates. That is, the dynamics are again governed 2011). Intuitively, this that when the friction is by the controllable injected noise N (0, 2Cdt) and friction −1 ˆ 1 ˆ ˆ large, the dynamics do not depend on the decaying series CM . It is also possible to set B = 2 V , where V is esti- of past gradients represented by dr, reducing to first-order mated using empirical Fisher information as in (Ahn et al., Langevin dynamics. 2012) for SGLD. Stochastic Gradient Hamiltonian Monte Carlo

COMPUTATIONAL COMPLEXITY 0.8 True Distribution 0.7 Standard HMC(with MH) The complexity of Alg.2 depends on the choice of M, C Standard HMC(no MH) ˆ ˜ Naive stochastic gradient HMC(with MH) and B, and the complexity for estimating ∇U(θ)—denoted 0.6 Naive stochastic gradient HMC(no MH) SGHMC as g(|D|, d)—where d is the dimension of the parameter 0.5 space. Assume we allow Bˆ to be an arbitrary d × d pos- 0.4 itive definite matrix. Using empirical Fisher information estimation of Bˆ, the per-iteration complexity of this esti- 0.3 mation step is O(d2|D|˜ ). Then, the time complexity for 0.2 3 the (θ, r) update is O(d ), because the update is dom- 0.1 inated by generating Gaussian noise with a full covari- 0 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 ance matrix. In total, the per-iteration time complexity is θ 2 3 O(d |D|˜ + d + g(|D|˜ , d)). In practice, we restrict all of Figure 1. Empirical distributions associated with various sam- the matrices to be diagonal when d is large, resulting in pling algorithms relative to the true target distribution with time complexity O(d|D|˜ + d + g(|D|˜ , d)). Importantly, we U(θ) = −2θ2 + θ4. We compare the HMC method of Alg.1 note that our SGHMC time complexity is the same as that with and without the MH step to: (i) a naive variant that replaces of SGLD (Welling & Teh, 2011; Ahn et al., 2012) in both the gradient with a stochastic gradient, again with and without an parameter settings. MH correction; (ii) the proposed SGHMC method, which does not use an MH correction. We use ∇U˜(θ) = ∇U(θ) + N (0, 4) In practice, we must assume inaccurate estimation of B. in the stochastic gradient based samplers and  = 0.1 in all cases. For a decaying series of step sizes t, an MH step is not Momentum is resampled every 50 steps in all variants of HMC. required (Welling & Teh, 2011; Ahn et al., 2012)1. How- ever, as the step size decreases, the efficiency of the sampler gradient method with momentum. We can use the equiv- likewise decreases since proposals are increasingly close alent update rule of Eq. (15) to run SGHMC, and borrow to their initial value. In practice, we may want to tolerate experience from parameter settings of SGD with momen- some errors in the sampling accuracy to gain efficiency. As tum to guide our choices of SGHMC settings. For example, in (Welling & Teh, 2011; Ahn et al., 2012) for SGLD, we we can set α to a fixed small number (e.g., 0.01 or 0.1), se- ˆ ˆ consider using a small, non-zero  leading to some bias. We lect the learning rate η, and then fix β = ηV/2. A more explore an analysis of the errors introduced by such finite- sophisticated strategy involves using momentum schedul- approximations in the Supplementary Material. ing (Sutskever et al., 2013). We elaborate upon how to se- lect these parameters in the Supplementary Material. CONNECTIONTO SGD WITH MOMENTUM Adding a momentum term to stochastic gradient descent 4. Experiments (SGD) is common practice. In concept, there is a clear rela- 4.1. Simulated Scenarios tionship between SGD with momentum and SGHMC, and here we formalize this connection. Letting v = M −1r, To empirically explore the behavior of HMC using exact we first rewrite the update rule in Alg.2 as gradients relative to stochastic gradients, we conduct ex- periments on a simulated setup. As a baseline, we consider  ∆θ = v the standard HMC implementation of Alg.1, both with and  ∆v = −2M −1∇U˜(θ) − M −1Cv (14) without the MH correction. We then compare to HMC with  +N (0, 23M −1(C − Bˆ)M −1). stochastic gradients, replacing ∇U in Alg.1 with ∇U˜, and consider this proposal with and without an MH correction. Define η = 2M −1, α = M −1C, βˆ = M −1Bˆ. The Finally, we compare to our proposed SGHMC, which does update rule becomes not use an MH correction. Fig.1 shows the empirical distri- butions generated by the different sampling algorithms. We  ∆θ = v see that even without an MH correction, both the HMC and (15) ∆v = −η∇U˜(x) − αv + N (0, 2(α − βˆ)η). SGHMC algorithms provide results close to the true dis- tribution, implying that any errors from considering non- Comparing to an SGD with momentum method, it is clear zero  are negligible. On the other hand, the results of from Eq. (15) that η corresponds to the learning rate and na¨ıve stochastic gradient HMC diverge significantly from 1−α the momentum term. When the noise is removed (via the truth unless an MH correction is added. These find- ˆ C = B = 0), SGHMC naturally reduces to a stochastic ings validate our theoretical results; that is, both standard 1We note that, just as in SGLD, an MH correction is not even HMC and SGHMC maintain π as the invariant distribution possible because we cannot compute the probability of the reverse as  → 0 whereas na¨ıve stochastic gradient HMC does not, dynamics. though this can be corrected for using a (costly) MH step. Stochastic Gradient Hamiltonian Monte Carlo

8 property. Fig.3 compares SGHMC and SGLD (Welling & Noisy Hamiltonian dynamics Noisy Hamiltonian dynamics(resample r each 50 steps) Teh, 2011) when sampling from a bivariate Gaussian with 6 Noisy Hamiltonian dynamics with friction Hamiltonian dynamics positive correlation. For each method, we examine five 4 different settings of the initial step size on a linearly de- 2 creasing scale and generate ten million samples. For each

r 0 of these sets of samples (one set per step-size setting), we calculate the time2 of the samples and the −2 average absolute error of the resulting sample covariance. −4 Fig.3(a) shows the autocorrelation versus estimation error −6 for the five settings. As we decrease the stepsize, SGLD has

−8 reasonably low estimation error but high autocorrelation −8 −6 −4 −2 0 2 4 6 8 θ time indicating an inefficient sampler. In contrast, SGHMC Figure 2. Points (θ,r) simulated from discretizations of various achieves even lower estimation error at very low autocorre- 1 2 Hamiltonian dynamics over 15000 steps using U(θ) = 2 θ and lation times, from which we conclude that the sampler is in-  = 0.1. For the noisy scenarios, we replace the gradient by deed efficiently exploring the distribution. Fig.3(b) shows ∇U˜(θ) = θ + N (0, 4). We see that noisy Hamiltonian dynam- the first 50 samples generated by the two samplers. We see ics lead to diverging trajectories when friction is not introduced. that SGLD’s random-walk behavior makes it challenging to r helps control , but the associated HMC explore the tails of the distribution. The momentum vari- stationary distribution is not correct, as illustrated in Fig.1. able associated with SGHMC instead drives the sampler to

0.45 3 move along the distribution contours. SGLD SGLD 0.4 SGHMC SGHMC 2 0.35 4.2. Bayesian Neural Networks for Classification 0.3 1 0.25 y We also test our method on a handwritten digits classifica- 0.2 0 3 0.15 tion task using the MNIST dataset . The dataset consists 0.1 −1 of 60,000 training instances and 10,000 test instances. We 0.05 randomly split a validation set containing 10,000 instances 0 −2 Average Absolute Error of Sample Covariance 0 50 100 150 200 −2 −1 0 1 2 3 Autocorrelation Time x from the training data in order to select training parame- Figure 3. Contrasting sampling of a bivariate Gaussian with cor- ters, and use the remaining 50,000 instances for training. 1 T −1 For classification, we consider a two layer Bayesian neu- relation using SGHMC versus SGLD. Here, U(θ) = 2 θ Σ θ, −1 ral network with 100 hidden variables using a sigmoid unit ∇U˜(θ) = Σ θ + N (0,I) with Σ11 = Σ22 = 1 and correlation ρ = Σ12 = 0.9. Left: absolute error of the covariance and an output layer using softmax. We tested four meth- estimation using ten million samples versus autocorrelation time ods: SGD, SGD with momentum, SGLD and SGHMC. of the samples as a function of 5 step size settings. Right: First For the optimization-based methods, we use the validation 50 samples of SGHMC and SGLD. set to select the optimal regularizer λ of network weights4. For the sampling-based methods, we take a fully Bayesian We also consider simply simulating from the discretized approach and place a weakly informative gamma prior on Hamiltonian dynamical systems associated with the vari- each layer’s weight regularizer λ. The sampling procedure ous samplers compared. In Fig.2, we compare the result- is carried out by running SGHMC and SGLD using mini- ing trajectories and see that the path of (θ, r) from the noisy batches of 500 training instances, then resampling hyperpa- system without friction diverges significantly. The modifi- rameters after an entire pass over the training set. We run cation of the dynamical system by adding friction (corre- the samplers for 800 iterations (each over the entire training sponding to SGHMC) corrects this behavior. We can also dataset) and discard the initial 50 samples as burn-in. correct for this divergence through periodic resampling of The test error as a function of MCMC or optimization iter- the momentum, though as we saw in Fig.1, the correspond- ation (after burn-in) is reported for each of these methods ing MCMC algorithm (“Naive stochastic gradient HMC in Fig.4. From the results, we see that SGD with mo- (no MH)”) does not yield the correct target distribution. mentum converges faster than SGD. SGHMC also has an These results confirm the importance of the friction term advantage over SGLD, converging to a low test error much in maintaining a well-behaved Hamiltonian and leading to more rapidly. In terms of runtime, in this case the gra- the correct stationary distribution. 2 P∞ Autocorrelation time is defined as 1 + s=1 ρs, where ρs is It is known that a benefit of HMC over many other MCMC the autocorrelation at lag s. algorithms is the efficiency in sampling from correlated 3http://yann.lecun.com/exdb/mnist/ distributions (Neal, 2010)—this is where the introduction 4We also tried MAP inference for selecting λ in the of the momentum variable shines. SGHMC inherits this optimization-based method, but found similar performance. Stochastic Gradient Hamiltonian Monte Carlo

0.05 Table 1. Predictive RMSE estimated using 5-fold cross validation SGD on the Movielens dataset for various approaches of inferring pa- 0.045 SGD with momentum SGLD rameters of a Bayesian probabilistic matrix factorization model. 0.04 SGHMC

0.035 METHOD RMSE

0.03 test error SGD 0.8538 ± 0.0009 SGD WITHMOMENTUM 0.8539 ± 0.0009 0.025 SGLD 0.8412 ± 0.0009 0.02 SGHMC 0.8411 ± 0.0011

0.015 0 200 400 600 800 evaluate the performance of the different methods. iteration Figure 4. Convergence of test error on the MNIST dataset using The results are shown in Table1. Both SGHMC and SGLD SGD, SGD with momentum, SGLD, and SGHMC to infer model give better prediction results than optimization-based meth- parameters of a Bayesian neural net. ods. In this , the results for SGLD and SGHMC are very similar. We also observed that the per-iteration dient computation used in backpropagation dominates so running time of both methods are comparable. As such, the both have the same computational cost. The final results of experiment suggests that SGHMC is an effective candidate the sampling based methods are better than optimization- for online Bayesian PMF. based methods, showing an advantage to Bayesian infer- ence in this setting, thus validating the need for scalable and efficient algorithms such as SGHMC. 5. Conclusion Moving between modes of a distribution is one of the 4.3. Online Bayesian Probabilistic Matrix key challenges for MCMC-based inference algorithms. To Factorization for Movie Recommendations address this problem in the large-scale or online setting, we proposed SGHMC, an efficient method for generat- Collaborative filtering is an important problem in web ing high-quality, “distant” steps in such sampling meth- applications. The task is to predict a user’s prefer- ods. Our approach builds on the fundamental framework ence over a set of items (e.g., movies, music) and pro- of HMC, but using stochastic estimates of the gradient to duce recommendations. Probabilistic matrix factorization avoid the costly full gradient computation. Surprisingly, (PMF) (Salakhutdinov & Mnih, 2008b) has proven effec- we discovered that the natural way to incorporate stochas- tive for this task. Due to the sparsity in the ratings matrix tic gradient estimates into HMC can lead to divergence and (users versus items) in recommender systems, over-fitting poor behavior both in theory and in practice. To address is a severe issue with Bayesian approaches providing a nat- this challenge, we introduced second-order Langevin dy- ural solution (Salakhutdinov & Mnih, 2008a). namics with a friction term that counteracts the effects of We conduct an experiment in online Bayesian PMF on the the noisy gradient, maintaining the desired target distri- Movielens dataset ml-1M5. The dataset contains about 1 bution as the invariant distribution of the continuous sys- million ratings of 3,952 movies by 6,040 users. The num- tem. Our empirical results, both in a simulated experiment ber of latent dimensions is set to 20. In comparing our and on real data, validate our theory and demonstrate the stochastic-gradient-based approaches, we use minibatches practical value of introducing this simple modification. A of 4,000 ratings to update the user and item latent matri- natural next step is to explore combining adaptive HMC ces. We choose a significantly larger minibatch size in this techniques with SGHMC. More broadly, we believe that application than that of the neural net because of the dra- the unification of efficient optimization and sampling tech- matically larger parameter space associated with the PMF niques, such as those described herein, will enable a signif- model. For the optimization-based approaches, the hyper- icant scaling of Bayesian methods. parameters are set using cross validation (again, we did not see a performance difference from considering MAP esti- Acknowledgements mation). For the sampling-based approaches, the hyperpa- This work was supported in part by the TerraSwarm Research rameters are updated using a Gibbs step after every 2, 000 Center sponsored by MARCO and DARPA, ONR Grant N00014- steps of sampling model parameters. We run the sampler to 10-1-0746, DARPA Grant FA9550-12-1-0406 negotiated by generate 2,000,000 samples, with the first 100,000 samples AFOSR, NSF IIS-1258741 and Intel ISTC Big Data. We also discarded as burn-in. We use five-fold cross validation to appreciate the discussions with Mark Girolami, Nick Foti, Ping 5http://grouplens.org/datasets/movielens/ Ao and Hong Qian. Stochastic Gradient Hamiltonian Monte Carlo

References Robbins, H. and Monro, S. A stochastic approximation method. The Annals of , 22(3): Ahn, S., Korattikara, A., and Welling, M. Bayesian pos- 400–407, 09 1951. terior sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th International Conference Salakhutdinov, R. and Mnih, A. Bayesian probabilistic on Machine Learning (ICML’12), pp. 1591–1598, July matrix factorization using Markov chain Monte Carlo. 2012. In Proceedings of the 25th International Conference on Machine Learning (ICML’08) Bardenet, R., Doucet, A., and Holmes, C. Towards scal- , pp. 880–887, 2008a. ing up Markov chain Monte Carlo: An adaptive sub- Salakhutdinov, R. and Mnih, A. Probabilistic matrix factor- sampling approach. In Proceedings of the 30th Inter- ization. In Advances in Neural Information Processing national Conference on Machine Learning (ICML’14), Systems 20 (NIPS’08), pp. 1257–1264, 2008b. volume 32, pp. 405–413, February 2014. Shi, J., Chen, T., Yuan, R., Yuan, B., and Ao, P. Relation of Duane, S., Kennedy, A.D., Pendleton, B.J., and Roweth, a new interpretation of stochastic differential equations D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 to Ito process. Journal of Statistical Physics, 148(3): – 222, 1987. 579–590, 2012. Girolami, M. and Calderhead, B. Riemann manifold Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. On Langevin and Hamiltonian Monte Carlo methods. Jour- the importance of initialization and momentum in deep nal of the Royal Statistical Society Series B, 73(2):123– learning. In Proceedings of the 30th International Con- 214, 03 2011. ference on Machine Learning (ICML’13), volume 28, Hoffman, M.D. and Gelman, A. The No-U-Turn sampler: pp. 1139–1147, May 2013. Adaptively setting path lengths in Hamiltonian Monte Wang, M.C. and Uhlenbeck, G.E. On the Theory of the Carlo. arXiv, 1111.4246, 2011. Brownian Motion II. Reviews of Modern Physics, 17(2- Hoffman, M.D., Blei, D. M., Wang, C., and Paisley, J. 3):323, 1945. Stochastic variational inference. Journal of Maching Learning Research, 14(1):1303–1347, May 2013. Wang, Z., Mohamed, S., and Nando, D. Adaptive Hamil- tonian and Riemann manifold Monte Carlo. In Proceed- Horowitz, A.M. A generalized guided Monte Carlo algo- ings of the 30th International Conference on Machine rithm. Physics Letters B, 268(2):247 – 252, 1991. Learning (ICML’13), volume 28, pp. 1462–1470, May 2013. Korattikara, A., Chen, Y., and Welling, M. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. Welling, M. and Teh, Y.W. Bayesian learning via stochas- In Proceedings of the 30th International Conference on tic gradient Langevin dynamics. In Proceedings of Machine Learning (ICML’14), volume 32, pp. 181–189, the 28th International Conference on Machine Learning February 2014. (ICML’11), pp. 681–688, June 2011.

Levin, D.A., Peres, Y., and Wilmer, E.L. Markov Chains Yin, L. and Ao, P. Existence and construction of dynamical and Mixing Times. American Mathematical Society, potential in nonequilibrium processes without detailed 2008. balance. Journal of Physics A: Mathematical and Gen- Neal, R.M. Bayesian learning via stochastic dynamics. In eral, 39(27):8593, 2006. Advances in Neural Information Processing Systems 5 (NIPS’93), pp. 475–482, 1993. Neal, R.M. MCMC using Hamiltonian dynamics. Hand- book of Markov Chain Monte Carlo, 54:113–162, 2010. Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): 1574–1609, January 2009. Patterson, S. and Teh, Y.W. Stochastic gradient Rieman- nian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems 26 (NIPS’13), pp. 3102–3110. 2013. Stochastic Gradient Hamiltonian Monte Carlo

Supplementary Material We use z = (θ, r) to denote the joint variable of position and momentum. The entropy is defined by h(pt(θ, r)) = R A. Background on Fokker-Planck Equation − θ,r f(pt(θ, r))dθdr. Here f(x) = x ln x is a strictly convex function defined on (0, +∞). The evolution of the The Fokker-Planck equation (FPE) associated with a given entropy is governed by stochastic differential equation (SDE) describes the time Z evolution of the distribution on the random variables under ∂th(pt(z)) =∂t f(pt(z))dz the specified stochastic dynamics. For example, consider z Z the SDE: 0 = − f (pt(z))∂tpt(z)dz z dz = g(z)dt + N (0, 2D(z)dt), (16) Z = − f 0(p (z))∇T [G∇H(z)p (z)]dz n n n×n t t where z ∈ R , g(z) ∈ R ,D(z) ∈ R . The distribution z Z of z governed by Eq. (16) (denoted by pt(z)), evolves under 0 T the following equation − f (p)∇ [D(z)∇pt(z)]dz. z n n n X X X ∂ p (z) = − ∂ [g (z)p (z)]+ ∂ ∂ [D (z)p (z)]. The entropy evolution can be described as the sum of t t zi i t zi zj ij t two parts: the noise-free Hamiltonian dynamics and the i=1 i=1 j=1 stochastic gradient noise term. The Hamiltonian dynamics Here gi(z) is the i-th entry of vector g(z) and Dij(z) is the part does not change the entropy, since (i, j) entry of the matrix D. In the dynamics considered in Z 0 T this paper, z = (θ, r) and − f (pt(z))∇ [G∇H(z)pt]dz z   Z 0 0 0 T D = . (17) = − f (pt(z))∇ [G∇H(z)]ptdz 0 B(θ) z Z 0 T That is, the random variables are momentum r and position − f (pt(z))(∇pt(z)) [G∇H(z)]dz z θ, with noise only added to r (though dependent upon θ). Z T The FPE can be written in the following compact form: = − (∇f(pt(z))) [G∇H(z)]dz z T T Z ∂tpt(z) = −∇ [g(z)pt(z)] + ∇ [D(z)∇pt(z)], (18) T = f(pt(z))∇ [G∇H(z)]dz = 0. T Pn z where ∇ [g(z)pt(z)] = i=1 ∂zi [gi(z)pt(z)] , and In the second equality, we use the fact that ∇T [G∇H(z)] = ∇T [D∇p (θ, r)] t −∂ ∂ H + ∂ ∂ H = 0. The last equality is given by inte- X θ r r θ = ∂zi [Dij (z)∂zj pt(z)] gration by parts, using the assumption that the probability ij density vanishes at infinity and f(x) → 0 as x → 0 such X X = ∂zi [Dij (z)∂zj pt(z)] + ∂zi [(∂zj Dij (z))pt(z)] that f(pt(z))[G∇H(z)] → 0 as z → ∞. ij ij X The contribution due to the stochastic gradient noise can be = ∂ ∂ [D (z)p (z)]. zi zj ij t calculated as ij Z − f 0(p (z))∇T [D(z)∇p (z)]dz Note that ∂zj Dij(z) = 0 for all i, j, since ∂rj Bij(θ) = 0 t t z (the noise is only added to r and only depends on parameter Z 00 T θ). = (f (pt(z))∇pt(z)) D(z)∇pt(z)dz z Z 00 T B. Proof of Theorem 3.1 = f (pt(z))(∇rpt(θ, r)) B(θ)∇rpt(θ, r)dθdr. θ,r  0 −I   0 0  Let G = and D = . The noisy I 0 0 B(θ) The first equality is again given by integration by parts, Hamiltonian dynamics of Eq. (7) can be written as assuming that the gradient of pt vanishes at infinity 1 0 faster than . That is, f (pt(z))∇pt(z) = (1 +       ln pt(z) θ 0 −I ∇U(θ) ln p (z))∇p (z) → 0 f 0(p (z))[D(z)∇p (z)] → d = − −1 dt + N (0, 2Ddt) t t such that t t r I 0 M r 0 as z → ∞. The statement of Theorem 3.1 immediately = − G∇H(θ, r)dt + N (0, 2Ddt). follows. Applying Eq. (18), defining g(z) = −G∇H), the corre- sponding FPE is given by C. Proof of Corollary 3.1 T T ∂tpt(θ, r)=∇ [G∇H(θ, r)pt(θ, r)] + ∇ [D∇pt(θ, r)]. Assume π(θ, r) = exp (−H(θ, r)) /Z is invariant un- (19) der Eq. (7) and is a well-behaved distribution such that Stochastic Gradient Hamiltonian Monte Carlo

H(θ, r) → ∞ as kθk, krk → ∞. Then it is straight- E. Reversibility of SGHMC Dynamics forward to verify that π(θ, r) and ln π(θ, r)∇π(θ, r) = 1 2 The dynamics of SGHMC are not reversible in the conven- Z exp (−H(θ, r)) ∇H (θ, r) vanish at infinity, such that π satisfies the conditions of Theorem 3.1. We also have tional definition of reversibility. However, the dynamics 1 −1 satisfy the following property: ∇rπ(θ, r) = Z exp (−H(θ, r)) M r. Using the assump- tion that the Fisher information matrix B(θ) has full rank, Theorem E.1. Assume P (θt, rt|θ0, r0) is the distribution 00 and noting that f (p) > 0 for p > 0, from Eq. (8) of governed by dynamics in Eq. (20), i.e. P (θt, rt|θ0, r0) fol- Theorem 3.1 we conclude that entropy increases over time: lows Eq. (21), then for π(θ, r) ∝ exp(−H(θ, r)), ∂ h(p (θ, r))| > 0. This contradicts that π is the in- t t pt=π π(θ , r )P (θ , r |θ , r ) = π(θ , −r )P (θ , −r |θ , −r ). variant distribution. 0 0 t t 0 0 t t 0 0 t t (22)

D. FPE for Second-Order Langevin Dynamics Proof. Assuming π is the stationary distribution and P ∗ the reverse-time Markov process associated with P : Second-order Langevin dynamics can be described by the π(θ , r )P (θ , r |θ , r ) = π(θ , r )P ∗(θ , r |θ , r ). Let following equation 0 0 t t 0 0 t t 0 0 t t L(p) = ∇T {[D + G][p∇H(θ, r) + τ∇p]} be the genera-  θ   0 −I   ∇U(θ)  d = − dt + N (0, 2τDdt) tor of Markov process described by Eq. (21). The generator r IB M −1r of the reverse process is given by L∗, which is the adjoint 2 = − [D + G] ∇H(θ, r)dt + N (0, 2τDdt), operator of L in the inner-product space l (π), with inner- (20) product defined by hp, qiπ = Ex∼π(x)[p(x)q(x)]. We can verify that L∗(p) = ∇T {[D − G][p∇H(θ, r) + τ∇p]}. where τ is a temperature (usually set to 1). In this paper, The corresponding SDE of the reverse process is given by we use the following compact form of the FPE to calculate  θ  the distribution evolution under Eq (20): d = [D − G] ∇H(θ, r) + N (0, 2τDdt), T r ∂tpt(θ, r)=∇ {[D+G][pt(θ, r)∇H(θ, r) + τ∇pt(θ, r)]}. (21) which is equivalent to To derive this FPE, we apply Eq. (18) to Eq (20), defining  θ  g(z) = −(D + G)∇H, which yields d = [D + G] ∇H(θ, −r) + N (0, 2τDdt). T T −r ∂tpt(θ, r)=∇ {[D+G][∇H(θ, r)pt(θ, r)]}+∇ [τD∇pt(θ, r)] . This means P ∗(θ , r |θ , r ) = P (θ , −r |θ , −r ). Re- Using the fact that ∇T [G∇p (θ, r)] = −∂ ∂ p (θ, r) + 0 0 t t 0 0 t t t θ r t calling that we assume Gaussian momentum, r, centered ∂ ∂ p (θ, r) = 0, we get Eq. (21). This form of the FPE r θ t about 0, we also have π(θ, r) = π(θ, −r). Together, we allows easy verification that the stationary distribution is − 1 H(θ,r) then have given by π(θ, r) ∝ e τ . In particular, if we substi- ∗ tute the target distribution into Eq. (21), we note that π(θ0, r0)P (θt, rt|θ0, r0) = π(θt, rt)P (θ0, r0|θt, rt)

h − 1 H(θ,r) − 1 i = π(θt, −rt)P (θ0, −r0|θt, −rt). e τ ∇H(θ, r) + τ∇e τ H(θ, r) = 0 such that ∂tπ(θ, r) = 0, implying that π is indeed the sta- tionary distribution. Theorem E.1 is not strictly detailed balance by the conven- The compact form of Eq. (21) can also be used to construct tional definition since L∗ 6= L and P ∗ 6= P . However, it other stochastic processes with the desired invariant distri- can be viewed as a kind of time reversibility. When we re- bution. A generalization of the FPE in Eq. (21) is given verse time, the sign of speed needs to be reversed to allow by Yin & Ao(2006). The system we have discussed in backward travel. This property is shared by the noise-free  0 −I  HMC dynamics of (Neal, 2010). Detailed balance can be this paper considers cases where G = and D I 0 enforced by the symmetry of r during the re-sampling step. only depends on θ. In practice, however, it might be help- However, we note that we do not rely on detailed balance ful to make G depend on θ as well. For example, to make to have π be the stationary distribution of our noisy Hamil- use of the Riemann geometry of the problem, as in Giro- tonian with friction (see Eq. (9)). lami & Calderhead(2011) and Patterson & Teh(2013), by adapting G according to the local curvature. For us to con- F. Convergence Analysis sider these more general cases, a correction term needs to be added during simulation (Shi et al., 2012). With that In the paper, we have discussed that the efficiency of correction term, we still maintain the desired target distri- SGHMC decreases as the step size  decreases. In prac- bution as the stationary distribution. tice, we usually want to trade a small amount of error for Stochastic Gradient Hamiltonian Monte Carlo efficiency. In the case of SGHMC, we are interested in We then evaluate the above equation at the station- a small, nonzero  and fast approximation of B given by ary distribution of the inaccurate dynamics π˜. Since 2 Bˆ. In this case, even under the continuous dynamics, the ∂tχ (˜pt, π)|p˜=˜π = 0, we have sampling procedure contains error that relates to  due to ˆ 2 2 inaccurate estimation of B with B. In this section, we in- λχ (˜π, π) = δ (∂tχ (qt, π)|qt=˜π) < δc. vestigate how the choice of  can be related to the error in the final stationary distribution. The sampling procedure with inaccurate estimation of B can be described with the following dynamics This theorem can also be used to measure the error in SGLD, and justifies the use of small finite step sizes in  −1 dθ =M r dt SGLD. We should note that the mixing rate bound λ at π˜ −1 dr =−∇U(θ) dt − CM rdt + N (0, 2(C + δS)dt). exists for SGLD and can be obtained using spectral anal- Here, δS = B−Bˆ is the error term that is not considered by ysis (Levin et al., 2008), but the corresponding bounds for the sampling algorithm. Assume the setting where Bˆ = 0, SGHMC are unclear due to the irreversibility of the pro- 1 cess. We leave this for future work. then we can let δ =  and S = 2 V . Let π˜ be the stationary distribution of the dynamics. In the special case when V = Our proof relies on a contraction bound relating the error C, we can calculate π˜ exactly by in the transition distribution to the error in the final sta-  1  tionary distribution. Although our argument is based on π˜(θ, r) ∝ exp − H(θ, r) . (23) 1 + δ a continuous-time Markov process, we should note that a similar guarantee can also be proven in terms of a discrete- This indicates that for small , our stationary distribution is time Markov transition kernel. We refer the reader to (Ko- indeed close to the true stationary distribution. In general rattikara et al., 2014) and (Bardenet et al., 2014) for further case, we consider the FPE of the distribution of this SDE, details. given by ∂ p˜ (θ, r)= [L + δS]˜p (θ, r). (24) t t t G. Setting SGHMC Parameters Here, L(p) = ∇T {[D+G][p∇H(θ, r) + ∇p]} is the oper- ator corresponds to correct sampling process. Let the oper- As we discussed in Sec. 3.3, we can connect SGHMC with ator S(p) = ∇r[S∇rp] correspond to the error term intro- SGD with momentum by rewriting the dynamics as (see duced by inaccurate Bˆ. Let us consider the χ2-divergence Eq.(15)) defined by  ∆θ = v (p(x) − π(x))2   p2(x)  ˜ ˆ χ2(p, π) = E = E −1, ∆v = −η∇U(x) − αv + N (0, 2(α − β)η). x∼π π2(x) x∼π π2(x) In analogy to SGD with momentum, we call η the learning which provides a measure of distance between the distribu- rate and 1 − α the momentum term. This equivalent update tion p and the true distribution π. Theorem F.1 shows that 2 rule is cleaner and we recommend parameterizing SGHMC the χ -divergence decreases as δ becomes smaller. in this form. Theorem F.1. Assume p evolves according to ∂ p = Lp , t t t t ˆ and satisfies the following mixing rate λ with respect to χ2 The β term corresponds to the estimation of noise that 2 2 comes from the gradient. One simple choice is to ignore divergence at π˜: ∂tχ (pt, π)|p =˜π ≤ −λχ (˜π, π). Fur- t the gradient noise by setting βˆ = 0 and relying on small ther assume the process governed by S (∂tqt = Sqt) has 2 . We can also set βˆ = ηV/ˆ 2, where Vˆ is estimated using bounded divergence change |∂tχ (qt, π)| < c. Then π˜ sat- isfies empirical Fisher information as in (Ahn et al., 2012). δc χ2(˜π, π) < . (25) There are then three parameters: the learning rate η, mo- λ mentum decay α, and minibatch size |D|˜ . Define β = M −1B = 1 ηV (θ) to be the exact term induced by in- Proof. Consider the divergence change of p˜ governed by 2 Eq.(24). It can be decomposed into two components, the troduction of the stochastic gradient. Then, we have change of divergence due to L, and the change of diver-  |D|  gence due to δS β = O η I , (26) ˜   |D| 2 p˜(x) ∂tχ (˜pt, π) =Ex∼π [L + δS]˜pt(x) π2(x) where I is fisher information matrix of the gradient, |D| is     p˜t(x) p˜(x) size of training data, |D|˜ is size of minibatch, and η is our =E Lp˜ (x) + δE Sp˜ (x) x∼π π2(x) t x∼π π2(x) t learning rate. We want to keep β small so that the resulting 2 2 dynamics are governed by the user-controlled term and the =∂tχ (pt, π)|pt=˜pt + δ∂tχ (qt, π)|qt=˜pt . Stochastic Gradient Hamiltonian Monte Carlo sampling algorithm has a stationary distribution close to • Sample λ from P (λA, λB, λa, λb|A, B, a, b) using a the target distribution. From Eq. (26), we see that there is Gibbs step. Note that the posterior for λ is a gamma no free lunch here: as the training size gets bigger, we can distribution by conditional conjugacy. 1 either set a small learning rate η = O( |D| ) or use a bigger minibatch size |D|˜ . In practice, choosing η = O( 1 ) gives We used the validation set to select parameters for the |D| various methods we compare. Specifically, for SGD and better numerical stability, since we also need to multiply −4 η ∇U˜ η SGLD, we tried step-sizes  ∈ {0.1, 0.2, 0.4, 0.8} × 10 , by , the mean of the stochastic gradient. Large −4 can cause divergence, especially when we are not close to and the best settings were found to be  = 0.1 × 10 for SGD and  = 0.2 × 10−4 for SGLD. We then fur- the mode of distribution. We note that the same discussion −4 −4 holds for SGLD (Welling & Teh, 2011). ther tested  = 0.16 × 10 and  = 0.06 × 10 for SGD, and found  = 0.16 × 10−4 gave the best result, In practice, we find that using a minibatch size of hundreds thus we used this setting for SGD. For SGD with mo- (e.g |D|˜ = 500) and fixing α to a small number (e.g. 0.01 mentum and SGHMC, we fixed α = 0.01 and βˆ = 0, or 0.1) works well. The learning rate can be set as η = and tried η ∈ {0.1, 0.2, 0.4, 0.8} × 10−5. The best set- γ/|D|, where γ is the “per-batch learning rate”, usually set tings were η = 0.4 × 10−5 for SGD with momentum, and to 0.1 or 0.01. This method of setting parameters is also η = 0.2 × 10−5 for SGHMC. For the optimization-based commonly used for SGD with momentum (Sutskever et al., methods, we use tried regularizer λ ∈ {0, 0.1, 1, 10, 100}, 2013). and λ = 1 was found to give the best performance.

H. Experimental Setup H.2. Online Bayesian Probabilistic Matrix Factorization H.1. Bayesian Neural Network The Bayesian probabilistic matrix factorization (BPMF) The Bayesian neural network model used in Sec. 4.2 can model used in Sec. 4.3 can be described as: be described by the following equation: i.i.d. λU , λV , λa, λb ∼ Gamma(1, 1) T T  −1 −1 P (y = i|x) ∝ exp Ai σ(B x + b) + ai . (27) Uki ∼N (0, λ ),Vkj ∼ N (0, λ ), U V (29) −1 −1 Here, y ∈ {1, 2, ··· , 10} is the output label of a digit. A ∈ ai ∼N (0, λa ), bi ∼ N (0, λb ) 10×100 contains the weight for output layers and we use T −1 R Yij |U, V ∼N (Ui Vj + ai + bj , τ ). d×100 Ai to indicate i-th column of A. B ∈ R contains the 10 d d weight for the first layer. We also introduce a ∈ R and The Ui ∈ R and Vj ∈ R are latent vectors for user i 100 b ∈ R as bias terms in the model. In the MNIST dataset, and movie j, while ai and bj are bias terms. We use a the input dimension d = 784. We place a Gaussian prior slightly simplified model than the BPMF model considered on the model parameters in (Salakhutdinov & Mnih, 2008a), where we only place 2 2 priors on precision variables λ = {λU , λV , λa, λb}. How- P (A) ∝ exp(−λAkAk ),P (B) ∝ exp(−λBkBk ) ever, the model still benefits from Bayesian inference by 2 2 P (a) ∝ exp(−λakak ),P (b) ∝ exp(−λbkak ). integrating over the uncertainty in the crucial regulariza- tion parameter λ. We generate samples from the posterior We further place gamma priors on each of the precision distribution terms λ: i.i.d. λA, λB, λa, λb ∼ Γ(α, β). P (Θ|Y ) ∝ P (Y |Θ)P (Θ), (30) We simply set α and β to 1 since the results are usually with the parameter set Θ = {U, V, a, b, λ , λ , λ , λ }. insensitive to these parameters. We generate samples from U V a b The sampling procedure is carried out by alternating the the posterior distribution followings Y P (Θ|D) ∝ P (y|x, Θ)P (Θ), (28) y,x∈D • Sample weights from P (U, V, a, b|λU , λV , λa, λb,Y ) using SGHMC or SGLD with a minibatch size of where parameter set Θ = {A, B, a, b, λA, λB, λa, λb}. 4,000 ratings. Sample for 2, 000 steps before updat- The sampling procedure is carried out by alternating the ing the hyper-parameters. following steps: • Sample λ from P (λU , λV , λa, λb|U, V, a, b) using a • Sample weights from P (A, B, a, b|λA, λB, λa, λb, D) Gibbs step. using SGHMC or SGLD with minibatch of 500 in- stances. Sample for 100 steps before updating hyper- The training parameters for this experiment were directly parameters. selected using cross-validation. Specifically, for SGD and Stochastic Gradient Hamiltonian Monte Carlo

SGLD, we tried step-sizes  ∈ {0.1, 0.2, 0.4, 0.8, 1.6} × 10−5, and the best settings were found to be  = 0.4×10−5 for SGD and  = 0.8 × 10−5 for SGLD. For SGD with momentum and SGHMC, we fixed α = 0.05 and βˆ = 0, and tried η ∈ {0.1, 0.2, 0.4, 0.8} × 10−6. The best settings were η = 0.4 × 10−6 for SGD with momentum, and η = 0.4 × 10−6 for SGHMC.