Stochastic Gradient Hamiltonian Monte Carlo

Stochastic Gradient Hamiltonian Monte Carlo Tianqi Chen [email protected] Emily B. Fox [email protected] Carlos Guestrin [email protected] MODE Lab, University of Washington, Seattle, WA. Abstract simple updates to the momentum variables, one simulates from a Hamiltonian dynamical system that enables Hamiltonian Monte Carlo (HMC) sampling proposals of distant states. The target distribution is in- methods provide a mechanism for defining dis- variant under these dynamics; in practice, a discretiza- tant proposals with high acceptance probabilities tion of the continuous-time system is needed necessitating in a Metropolis-Hastings framework, enabling a Metropolis-Hastings (MH) correction, though still with more efficient exploration of the state space than high acceptance probability. Based on the attractive proper- standard random-walk proposals. The popularity ties of HMC in terms of rapid exploration of the state space, of such methods has grown significantly in recen- HMC methods have grown in popularity recently (Neal, t years. However, a limitation of HMC methods 2010; Hoffman & Gelman, 2011; Wang et al., 2013). is the required gradient computation for simulation of the Hamiltonian dynamical system—such A limitation of HMC, however, is the necessity to com- computation is infeasible in problems involving a pute the gradient of the potential energy function in order large sample size or streaming data. Instead, we to simulate the Hamiltonian dynamical system. We are in- must rely on a noisy gradient estimate computed creasingly faced with datasets having millions to billions from a subset of the data. In this paper, we ex- of observations or where data come in as a stream and we plore the properties of such a stochastic gradient need to make inferences online, such as in online advertis- HMC approach. Surprisingly, the natural imple- ing or recommender systems. In these ever-more-common mentation of the stochastic approximation can be scenarios of massive batch or streaming data, such gradi- arbitrarily bad. To address this problem we intro- ent computations are infeasible since they utilize the entire duce a variant that uses second-order Langevin dataset, and thus are not applicable to “big data” problem- dynamics with a friction term that counteracts the s. Recently, in a variety of machine learning algorithms, effects of the noisy gradient, maintaining the de- we have witnessed the many successes of utilizing a noisy sired target distribution as the invariant distribu- estimate of the gradient based on a minibatch of data to tion. Results on simulated data validate our the- scale the algorithms (Robbins & Monro, 1951; Hoffman ory. We also provide an application of our meth- et al., 2013; Welling & Teh, 2011). A majority of these ods to a classification task using neural networks developments have been in optimization-based algorithm- and to online Bayesian matrix factorization. s (Robbins & Monro, 1951; Nemirovski et al., 2009), and a question is whether similar efficiencies can be garnered by sampling-based algorithms that maintain many desir- 1. Introduction able theoretical properties for Bayesian inference. One at- tempt at applying such methods in a sampling context is the Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; recently proposed stochastic gradient Langevin dynamics Neal, 2010) sampling methods provide a powerful Markov (SGLD) (Welling & Teh, 2011; Ahn et al., 2012; Patterson chain Monte Carlo (MCMC) sampling algorithm. The & Teh, 2013). This method builds on first-order Langevin methods define a Hamiltonian function in terms of the tar- dynamics that do not include the crucial momentum term get distribution from which we desire samples—the po- of HMC. tential energy—and a kinetic energy term parameterized by a set of “momentum” auxiliary variables. Based on In this paper, we explore the possibility of marrying the efficiencies in state space exploration of HMC with the Proceedings of the 31 st International Conference on Machine big-data computational efficiencies of stochastic gradients. Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy- Such an algorithm would enable a large-scale and online right 2014 by the author(s). Stochastic Gradient Hamiltonian Monte Carlo Bayesian sampling algorithm with the potential to rapidly Hamiltonian (Hybrid) Monte Carlo (HMC) (Duane et al., explore the posterior. As a first cut, we consider simply 1987; Neal, 2010) provides a method for proposing sam- applying a stochastic gradient modification to HMC and ples of θ in a Metropolis-Hastings (MH) framework that assess the impact of the noisy gradient. We prove that the efficiently explores the state space as compared to stan- noise injected in the system by the stochastic gradient no dard random-walk proposals. These proposals are gener- longer leads to Hamiltonian dynamics with the desired tar- ated from a Hamiltonian system based on introducing a set get distribution as the stationary distribution. As such, even of auxiliary momentum variables, r. That is, to sample before discretizing the dynamical system, we need to cor- from p(θjD), HMC considers generating samples from a rect for this effect. One can correct for the injected gradi- joint distribution of (θ; r) defined by ent noise through an MH step, though this itself requires 1 costly computations on the entire dataset. In practice, one π(θ; r) / exp −U(θ) − rT M −1r : (3) might propose long simulation runs before an MH correc- 2 tion, but this leads to low acceptance rates due to large de- If we simply discard the resulting r samples, the θ sam- viations in the Hamiltonian from the injected noise. The ples have marginal distribution p(θjD). Here, M is a mass efficiency of this MH step could potentially be improved matrix, and together with r, defines a kinetic energy term. using the recent results of (Korattikara et al., 2014; Bar- M is often set to the identity matrix, I, but can be used to denet et al., 2014). In this paper, we instead introduce a precondition the sampler when we have more information stochastic gradient HMC method with friction added to the about the target distribution. The Hamiltonian function is momentum update. We assume the injected noise is Gaus- 1 T −1 defined by H(θ; r) = U(θ) + 2 r M r. Intuitively, H sian, appealing to the central limit theorem, and analyze the measures the total energy of a physical system with posi- corresponding dynamics. We show that using such second- tion variables θ and momentum variables r. order Langevin dynamics enables us to maintain the desired target distribution as the stationary distribution. That is, the To propose samples, HMC simulates the Hamiltonian dy- friction counteracts the effects of the injected noise. For namics dθ = M −1r dt discretized systems, we consider letting the step size tend (4) dr = −∇U(θ) dt: to zero so that an MH step is not needed, giving us a sig- nificant computational advantage. Empirically, we demon- To make Eq. (4) concrete, a common analogy in 2D is as strate that we have good performance even for set to a follows (Neal, 2010). Imagine a hockey puck sliding over a small, fixed value. The theoretical computation versus ac- frictionless ice surface of varying height. The potential en- curacy tradeoff of this small- approach is provided in the ergy term is based on the height of the surface at the current Supplementary Material. puck position, θ, while the kinetic energy is based on the momentum of the puck, r, and its mass, M. If the surface A number of simulated experiments validate our theoretical is flat (rU(θ) = 0; 8θ), the puck moves at a constant ve- results and demonstrate the differences between (i) exact locity. For positive slopes (rU(θ) > 0), the kinetic energy HMC, (ii) the na¨ıve implementation of stochastic gradient decreases as the potential energy increases until the kinet- HMC (simply replacing the gradient with a stochastic gra- ic energy is 0 (r = 0). The puck then slides back down dient), and (iii) our proposed method incorporating friction. the hill increasing its kinetic energy and decreasing poten- We also compare to the first-order Langevin dynamics of tial energy. Recall that in HMC, the position variables are SGLD. Finally, we apply our proposed methods to a classi- those of direct interest whereas the momentum variables fication task using Bayesian neural networks and to online are artificial constructs (auxiliary variables). Bayesian matrix factorization of a standard movie dataset. Our experimental results demonstrate the effectiveness of Over any interval s, the Hamiltonian dynamics of Eq. (4) the proposed algorithm. defines a mapping from the state at time t to the state at time t + s. Importantly, this mapping is reversible, which 2. Hamiltonian Monte Carlo is important in showing that the dynamics leave π invariant. Likewise, the dynamics preserve the total energy, H, Suppose we want to sample from the posterior distribution so proposals are always accepted. In practice, however, we of θ given a set of independent observations x 2 D: usually cannot simulate exactly from the continuous system of Eq. (4) and instead consider a discretized system. One p(θjD) / exp(−U(θ)); (1) common approach is the “leapfrog” method, which is out- lined in Alg.1. Because of inaccuracies introduced through where the potential energy function U is given by the discretization, an MH step must be implemented (i.e., the acceptance rate is no longer 1). However, acceptance X U = − log p(xjθ) − log p(θ): (2) rates still tend to be high even for proposals that can be x2D quite far from their last state. Stochastic Gradient Hamiltonian Monte Carlo Algorithm 1: Hamiltonian Monte Carlo noisy gradient as Input: Starting position θ(1) and step size rU~(θ) ≈ rU(θ) + N (0;V (θ)): (6) for t = 1; 2 ··· do Resample momentum r Here, V is the covariance of the stochastic gradient noise, r(t) ∼ N (0;M) which can depend on the current model parameters and (t) (t) (θ0; r0) = (θ ; r ) sample size.

Load more