Stochastic Gradient Hamiltonian Monte Carlo

Stochastic Gradient Hamiltonian Monte Carlo Tianqi Chen [email protected] Emily B. Fox [email protected] Carlos Guestrin [email protected] MODE Lab, University of Washington, Seattle, WA. Abstract simple updates to the momentum variables, one simulates from a Hamiltonian dynamical system that enables Hamiltonian Monte Carlo (HMC) sampling proposals of distant states. The target distribution is in- methods provide a mechanism for defining dis- variant under these dynamics; in practice, a discretiza- tant proposals with high acceptance probabilities tion of the continuous-time system is needed necessitating in a Metropolis-Hastings framework, enabling a Metropolis-Hastings (MH) correction, though still with more efficient exploration of the state space than high acceptance probability. Based on the attractive proper- standard random-walk proposals. The popularity ties of HMC in terms of rapid exploration of the state space, of such methods has grown significantly in recent HMC methods have grown in popularity recently (Neal, years. However, a limitation of HMC methods 2010; Hoffman & Gelman, 2011; Wang et al., 2013). is the required gradient computation for simulation of the Hamiltonian dynamical system—such A limitation of HMC, however, is the necessity to com- computation is infeasible in problems involving a pute the gradient of the potential energy function in or- large sample size or streaming data. Instead, we der to simulate the Hamiltonian dynamical system. We must rely on a noisy gradient estimate computed are increasingly faced with datasets having millions to bil- from a subset of the data. In this paper, we ex- lions of observations or where data come in as a stream plore the properties of such a stochastic gradient and we need to make inferences online, such as in online HMC approach. Surprisingly, the natural imple- advertising or recommender systems. In these ever-more- mentation of the stochastic approximation can be common scenarios of massive batch or streaming data, such arbitrarily bad. To address this problem we intro- gradient computations are infeasible since they utilize the duce a variant that uses second-order Langevin entire dataset, and thus are not applicable to “big data” dynamics with a friction term that counteracts the problems. Recently, in a variety of machine learning al- effects of the noisy gradient, maintaining the de- gorithms, we have witnessed the many successes of utiliz- sired target distribution as the invariant distribu- ing a noisy estimate of the gradient based on a minibatch tion. Results on simulated data validate our the- of data to scale the algorithms (Robbins & Monro, 1951; ory. We also provide an application of our meth- Hoffman et al., 2013; Welling & Teh, 2011). A major- ods to a classification task using neural networks ity of these developments have been in optimization-based and to online Bayesian matrix factorization. algorithms (Robbins & Monro, 1951; Nemirovski et al., 2009), and a question is whether similar efficiencies can arXiv:1402.4102v2 [stat.ME] 12 May 2014 be garnered by sampling-based algorithms that maintain 1. Introduction many desirable theoretical properties for Bayesian infer- ence. One attempt at applying such methods in a sam- Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; pling context is the recently proposed stochastic gradient Neal, 2010) sampling methods provide a powerful Markov Langevin dynamics (SGLD) (Welling & Teh, 2011; Ahn chain Monte Carlo (MCMC) sampling algorithm. The et al., 2012; Patterson & Teh, 2013). This method builds methods define a Hamiltonian function in terms of the tar- on first-order Langevin dynamics that do not include the get distribution from which we desire samples—the po- crucial momentum term of HMC. tential energy—and a kinetic energy term parameterized by a set of “momentum” auxiliary variables. Based on In this paper, we explore the possibility of marrying the efficiencies in state space exploration of HMC with the Proceedings of the 31 st International Conference on Machine big-data computational efficiencies of stochastic gradients. Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- Such an algorithm would enable a large-scale and online right 2014 by the author(s). Stochastic Gradient Hamiltonian Monte Carlo Bayesian sampling algorithm with the potential to rapidly Hamiltonian (Hybrid) Monte Carlo (HMC) (Duane et al., explore the posterior. As a first cut, we consider simply 1987; Neal, 2010) provides a method for proposing sam- applying a stochastic gradient modification to HMC and ples of θ in a Metropolis-Hastings (MH) framework that assess the impact of the noisy gradient. We prove that the efficiently explores the state space as compared to stan- noise injected in the system by the stochastic gradient no dard random-walk proposals. These proposals are gener- longer leads to Hamiltonian dynamics with the desired tar- ated from a Hamiltonian system based on introducing a set get distribution as the stationary distribution. As such, even of auxiliary momentum variables, r. That is, to sample before discretizing the dynamical system, we need to cor- from p(θjD), HMC considers generating samples from a rect for this effect. One can correct for the injected gradi- joint distribution of (θ; r) defined by ent noise through an MH step, though this itself requires 1 costly computations on the entire dataset. In practice, one π(θ; r) / exp −U(θ) − rT M −1r : (3) might propose long simulation runs before an MH correc- 2 tion, but this leads to low acceptance rates due to large de- If we simply discard the resulting r samples, the θ sam- viations in the Hamiltonian from the injected noise. The ples have marginal distribution p(θjD). Here, M is a mass efficiency of this MH step could potentially be improved matrix, and together with r, defines a kinetic energy term. using the recent results of (Korattikara et al., 2014; Bar- M is often set to the identity matrix, I, but can be used to denet et al., 2014). In this paper, we instead introduce a precondition the sampler when we have more information stochastic gradient HMC method with friction added to the about the target distribution. The Hamiltonian function is momentum update. We assume the injected noise is Gaus- 1 T −1 defined by H(θ; r) = U(θ) + 2 r M r. Intuitively, H sian, appealing to the central limit theorem, and analyze the measures the total energy of a physical system with posi- corresponding dynamics. We show that using such second- tion variables θ and momentum variables r. order Langevin dynamics enables us to maintain the desired target distribution as the stationary distribution. That is, the To propose samples, HMC simulates the Hamiltonian dy- friction counteracts the effects of the injected noise. For namics dθ = M −1r dt discretized systems, we consider letting the step size tend (4) dr = −∇U(θ) dt: to zero so that an MH step is not needed, giving us a sig- nificant computational advantage. Empirically, we demon- To make Eq. (4) concrete, a common analogy in 2D is as strate that we have good performance even for set to a follows (Neal, 2010). Imagine a hockey puck sliding over small, fixed value. The theoretical computation versus ac- a frictionless ice surface of varying height. The potential curacy tradeoff of this small- approach is provided in the energy term is based on the height of the surface at the cur- Supplementary Material. rent puck position, θ, while the kinetic energy is based on the momentum of the puck, r, and its mass, M. If the sur- A number of simulated experiments validate our theoretical face is flat (rU(θ) = 0; 8θ), the puck moves at a constant results and demonstrate the differences between (i) exact velocity. For positive slopes (rU(θ) > 0), the kinetic en- HMC, (ii) the na¨ıve implementation of stochastic gradient ergy decreases as the potential energy increases until the HMC (simply replacing the gradient with a stochastic gra- kinetic energy is 0 (r = 0). The puck then slides back dient), and (iii) our proposed method incorporating friction. down the hill increasing its kinetic energy and decreasing We also compare to the first-order Langevin dynamics of potential energy. Recall that in HMC, the position vari- SGLD. Finally, we apply our proposed methods to a classi- ables are those of direct interest whereas the momentum fication task using Bayesian neural networks and to online variables are artificial constructs (auxiliary variables). Bayesian matrix factorization of a standard movie dataset. Our experimental results demonstrate the effectiveness of Over any interval s, the Hamiltonian dynamics of Eq. (4) the proposed algorithm. defines a mapping from the state at time t to the state at time t + s. Importantly, this mapping is reversible, which 2. Hamiltonian Monte Carlo is important in showing that the dynamics leave π invariant. Likewise, the dynamics preserve the total energy, H, Suppose we want to sample from the posterior distribution so proposals are always accepted. In practice, however, we of θ given a set of independent observations x 2 D: usually cannot simulate exactly from the continuous system of Eq. (4) and instead consider a discretized system. One p(θjD) / exp(−U(θ)); (1) common approach is the “leapfrog” method, which is out- lined in Alg.1. Because of inaccuracies introduced through where the potential energy function U is given by the discretization, an MH step must be implemented (i.e., the acceptance rate is no longer 1). However, acceptance X U = − log p(xjθ) − log p(θ): (2) rates still tend to be high even for proposals that can be x2D quite far from their last state. Stochastic Gradient Hamiltonian Monte Carlo Algorithm 1: Hamiltonian Monte Carlo noisy gradient as Input: Starting position θ(1) and step size rU~(θ) ≈ rU(θ) + N (0;V (θ)): (6) for t = 1; 2 ··· do Resample momentum r Here, V is the covariance of the stochastic gradient noise, r(t) ∼ N (0;M) which can depend on the current model parameters and (t) (t) (θ0; r0) = (θ ; r ) sample size.

Load more