On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

Difan Zou 1 Quanquan Gu 1

Abstract tions such as Bayesian inference, reinforcement learning, and computer vision. In the past decades, many MCMC Hamiltonian Monte Carlo (HMC), built based on algorithms, such as Metropolis (Mengersen the Hamilton’s equation, has been witnessed great et al., 1996), ball walk (Lovasz´ & Simonovits, 1990), hit success in sampling from high-dimensional pos- and run (Smith, 1984), Langevin dynamics (LD) based al- terior distributions. However, it also suffers from gorithms (Langevin, 1908; Parisi, 1981), and Hamiltonian computational inefficiency, especially for large Monte Carlo (HMC) (Duane et al., 1987), have been in- training datasets. One common idea to overcome vented and studied. Among them HMC has been recognized this computational bottleneck is using stochas- as the most effective MCMC algorithm due to its rapid mix- tic gradients, which only queries a mini-batch of ing rate and small discretization error. In practice, HMC has training data in each iteration. However, unlike been deployed as the default sampler in many open packages the extensive studies on the convergence analy- such as Stan (Carpenter et al., 2017) and Tensorflow (Abadi sis of HMC using full gradients, few works fo- et al., 2016). In specific, HMC simulates the trajectory of a cus on establishing the convergence guarantees particle in the Hamiltonian system, which is described by of stochastic gradient HMC algorithms. In this the following Hamilton’s equation paper, we propose a general framework for prov- ing the convergence rate of HMC with stochastic dq(t) ∂H(q(t), p(t)) = gradient estimators, for sampling from strongly dt ∂p (1.1) log-concave and log-smooth target distributions. dp(t) ∂H(q(t), p(t)) = − , We show that the convergence to the target distri- dt ∂q bution in 2-Wasserstein distance can be guaran- teed as long as the stochastic gradient estimator where q and p are position and variables, and is unbiased and its variance is upper bounded H(q, p) is the so-called Hamiltonian function, which is along the algorithm trajectory. We further apply typically defined as the sum of the f(q) 2 the proposed framework to analyze the conver- and the kinetic energy kpk2/2. In each step, HMC solves gence rates of HMC with four standard stochastic (1.1) using the sample generated in the last step as the ini- gradient estimators: mini-batch stochastic gra- tial position q(0) and an independently generated Gaussian dient (SG), stochastic variance reduced gradient random vector as the initial momentum p(0), then outputs (SVRG), stochastic average gradient (SAGA), and the solution at a certain time τ, i.e., q(τ), as the next sam- control variate gradient (CVG). Theoretical re- ple. It is well known that if the Hamilton’s equation can be sults explain the inefficiency of mini-batch SG, exactly solved and the potential energy function f(x) ad- and suggest that SVRG and SAGA perform bet- mits certain good properties, the sample sequence generated ter in the tasks with high-precision requirements, by HMC asymptotically converges to the target distribu- while CVG performs better for large dataset. Ex- tion π ∝ exp(−f(x)) (Lee et al., 2018; Chen & Vempala, periment results verify our theoretical findings. 2019). However, it is generally intractable to exactly solve (1.1), and numerical integrators are needed to solve it ap- proximately. One of the most popular HMC algorithms 1. Introduction adopts the leapfrog integrator for solving (1.1) following by a Metropolis-Hasting (MH) correction step (Neal et al., Monte Carlo (MCMC) methods have been 2011). Aside from the algorithmic development, the con- witnessed great success in many machine learning applica- vergence rate of HMC has also been extensively studied in recent literature (Bou-Rabee et al., 2018; Lee et al., 2018; 1 Department of Computer Science, UCLA. Correspondence to: Mangoubi & Smith, 2017; Durmus et al., 2017; Mangoubi Quanquan Gu . & Vishnoi, 2018; Chen & Vempala, 2019; Chen et al., 2019), Proceedings of the 38 th International Conference on Machine which demonstrate its superior performance compared with Learning, PMLR 139, 2021. Copyright 2021 by the author(s). other MCMC methods. On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

However, as the data size grows rapidly nowadays, the stan- have been successfully incorporated into Langevin based dard HMC algorithm suffers from huge computational cost. algorithms for faster sampling (Dubey et al., 2016; Chatterji For a Bayesian inference problem, the energy function f(x) et al., 2018; Baker et al., 2018; Brosse et al., 2018; Zou (a.k.a., negative log-posterior in Bayesian learning prob- et al., 2018a; Li et al., 2018). It is unclear whether these es- lem) is formulated as the sum of the negative log-likelihood timators can be adapted in the HMC algorithm to overcome Pn functions over all observations (i.e., f(x) = i=1 fi(x)). the drawbacks of SG-HMC. When the number of observations (i.e., n) becomes ex- In this paper, we propose a general framework for prov- tremely large, the standard HMC algorithm may fail as ing the convergence rate of HMC with stochastic gradients it requires to query the entire dataset to compute the full for sampling from strongly log-concave and log-smooth gradient ∇f(x). To overcome the computational burden, a distributions. At the core of our analysis is a sharp charac- common idea is to leverage stochastic gradient in each up- terization of the solution to the Hamilton’s equation (1.1) date, i.e., we only compute the gradient approximately using obtained using stochastic gradients, which is in the order of a mini-batch of training data, which gives rise to stochas- √ O( η), where η is the step size of the numerical integrator. tic gradient HMC (SG-HMC) (Chen et al., 2014)1. This Under the proposed proof framework, the convergence rate idea has also triggered a bunch of work focusing on improv- of HMC with a variety of stochastic gradient estimators can ing the scalability of other gradient-based MCMC methods be derived. We summarize the main contributions of our (Welling & Teh, 2011; Chen et al., 2015; Ma et al., 2015; paper as follows: Baker et al., 2018). Despite the efficiency improvements for large-scale Bayesian inference problems, SG-HMC has • many drawbacks. The variance of stochastic gradients may We develop a general framework for characterizing lead to inaccurate solutions to the Hamilton’s equation (1.1). the convergence rate of HMC algorithm when using Additionally, it is no longer tractable to perform MH correc- stochastic gradients. In particular, we prove that as tion step since (1) the proposal distribution of HMC does long as the stochastic gradient is unbiased, and its vari- not have an explicit formula and is not time-reversible, and ance along the algorithm trajectory is upper bounded (2) one cannot exactly query the entire training dataset to (not require a uniform upper bound), the stochastic compute the MH acceptance probability. These two short- gradient HMC algorithm provably converges to the target distribution π ∝ exp(−f(x)) in 2-Wasserstein comings prevent SG-HMC from achieving as accurate sam- √ O( η) pling as the standard HMC (Betancourt, 2015; Bardenet distance with a sampling error up to . et al., 2017; Dang et al., 2019), and hurdle the application • We apply four commonly used stochastic gradient es- of SG-HMC in many sampling tasks with a high-precision timators to the HMC algorithm for sampling from the requirement. Pn target distribution of form π ∝ exp(− i=1 fi(x)), Despite the pros and cons of SG-HMC discussed in the which gives rise to four variants of stochastic gradient aforementioned works, most of them are empirical studies. HMC algorithms, including SG-HMC, SVRG-HCM, Unlike stochastic gradient Langevin dynamics (SGLD) and SAGA-HMC, and CVG-HMC. We establish their con- stochastic gradient underdamped Langevin dynamics (SG- vergence guarantees under the proposed framework.√ ULD)2 that have been extensively studied in theory, little Our analysis suggests that in order to achieve / n- work has been done to provide a theoretical understanding sampling error in 2-Wasserstein distance, the gradi- 3 of SG-HMC. It remains illusive whether SG-HMC can be ent complexity of SG-HMC, CVG-HMC, SVRG- 2 2 guaranteed to converge and how it performs in different HMC and SAGA-HMC are Oe(n/ ), Oe(1/ + 1/), 2/3 2/3 2/3 2/3 regimes. Moreover, there has also emerged many other Oe(n / + 1/), and Oe(n / + 1/) respec- stochastic gradient estimators that exhibit smaller variance tively. This explains the inefficiency of SG-HMC ob- than the standard mini-batch stochastic gradient estimator, served in prior work, and reveals the prospects of CVG- such as stochastic variance reduced gradient (SVRG) (John- HMC, SVRG-HMC and SAGA-HMC for large-scale son & Zhang, 2013), stochastic averaged gradient (SAG) sampling problems. (Defazio et al., 2014), and control variates gradient (CVG) • We carry out numerical experiments on both synthetic (Baker et al., 2018). These stochastic gradient estimators and real-world dataset. The results show all stochastic 1In fact, in addition to making use of stochastic gradients, Chen gradient HMC algorithms converge but SG-HMC has et al.(2014) also introduces a friction term and an additional Brow- a significantly larger bias compared with other algo- nian term to mitigate the bias and variance brought by stochastic rithms. Additionally, SVRG-HMC performs the best gradients. when the sample size is small while CVG-HMC be- 2In some existing works this algorithm is also referred to as SGHMC (Zou et al., 2018a; Gao et al., 2018b;a). We highlight that comes more efficient and effective when the sample their SGHMC algorithm is different from the SG-HMC algorithm 3The gradient complexity is defined by the number of stochastic studied in this paper, which we will clearly discuss in Section2. gradient evaluations to achieve the target accuracy. On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

size increases. This well corroborates our theoretical et al., 2020; Simsekli et al., 2020) and nonconvex optimiza- findings. tion (Raginsky et al., 2017; Zhang et al., 2017; Xu et al., 2018; Ma et al., 2018; Gao et al., 2018a;b; Chau & Ra- Notation. Given two scalars a and b, we use a ∧ b to de- sonyi, 2019; Deng et al., 2020; Zou et al., 2020). Among note min{a, b} and use a ∨ b to denote max{a, b}. Given them, the most relevant works to this paper are focusing d p 2 2 on establishing the convergence rate of Langevin dynamics a vector x ∈ R , we define by kxk2 = x1 + ··· + xd its Euclidean norm. We use O(·) and Ω(·) notations to hide based algorithm for sampling from strongly log-concave and constant factors and use Oe(·) to hide the poly-logarithmic log-smooth distributions (Dalalyan & Karagulyan, 2019; Dalalyan, 2017; Chen et al., 2017; Baker et al., 2018; Zou factors in O(·) notation. Given two sequences {xk} and et al., 2018a; Chatterji et al., 2018). In particular, based {yk}, we further define xk = Θ(yk) if xk = Ω(yk) on the overdamped Langevin dynamics, Dalalyan & Karag- and xk = O(yk). We Given two distributions µ and 2 ulyan(2019); Dalalyan(2017) established the convergence ν, the 2-Wasserstein distance is defined by W2 (µ, ν) = R 2 guarantee of Langevin Monte Carlo (LMC, Euler discretiza- infγ∈Γ(µ,ν) kx − yk2dγ(x, y). tion of (2.1) using full gradient) and stochastic gradient Langevin dynamics (SGLD, Euler discretization of (2.1) us- 2. Additional Related Work: Langevin ing stochastic gradients). Zou et al.(2018b) further showed Dynamics Based Algorithms that using SVRG or subsampled SVRG gradient estima- tor can help improve the convergence rate for both LMC 2.1. Mathematical Description of Langevin Dynamics. and SGLD. Baker et al.(2018) proposed to use a control- Aside from the HMC algorithm, another important fam- variate gradient estimator in SGLD and also demonstrated ily of MCMC methods is built upon the Langevin dynam- its efficiency in terms of the sample size. Based on the un- ics, including both overdamped Langevin dynamics (LD) derdamped Langevin dynamics, Chen et al.(2015) showed (Roberts & Tweedie, 1996) and underdamped Langevin that using naive Euler discretization on (2.2) cannot give dynamics (ULD) (Chen et al., 2017). Formally, the over- faster convergence rate than LMC/SGLD and instead one damped Langevin dynamics can be described by the follow- may need to use a high-order discretization mechanism. ing stochastic differential equation (SDE): However, Chen et al.(2017) showed that part of (2.2) can √ be solved analytically and proposed an accurate first-order dxt = −∇f(xt)dt + 2dBt, (2.1) discretization method for solving (2.2), which provably achieves faster convergence rate than LMC. Following this where Bt is the Brownian term. The underdamped Langevin line of research, Zou et al.(2018a); Chatterji et al.(2018) dynamics takes the form of the following SDE, considered using the SVRG and CVG estimators in the al- gorithm developed by Chen et al.(2017), which can also dv = −γv dt − u∇f(x ) + p2γudB t t t t (2.2) help reduce the discretization error and thus lead to faster dxt = vtdt, convergence rates. where xt and vt are the position and velocity variables 2.3. Comparison between Langevin Dynamics and at time t respectively, γ > 0 is the “friction” parameter, HMC based algorithms and u > 0 is referred to the “inverse mass” parameter. Notably, both overdamped and underdamped Langevin dy- We would like to highlight that these two types of algo- namics converges to the (marginal) stationary distribution rithms (especially ULD based algorithms vs. HMC based π ∝ e−f(x). The goal of Langevin dynamics based algo- algorithms) are different in terms of both algorithm designs rithms is to approximately solve the SDEs in (2.1) and (2.2). and their underlying SDE/ODE (see (1.1) and (2.2) for their As a comparison, the focus of HMC based algorithms is to formulas). In particular, HMC-based algorithms focus on solve the ODE (1.1), which could be done more accurately solving the Hamilton’s equation (which is an ODE) in each or even exactly. proposal while ULD based algorithms are derived from the discretization of an SDE. From the algorithmic perspective, 2.2. Existing Convergence Results of Langevin HMC based algorithms have a double loop structure: the Dynamics Based Algorithms inner loop solves the ODE and makes a proposal, the outer loop updates the proposals until convergence. ULD-based The convergence rate of Langevin dynamics based algo- algorithms exhibit a single loop structure and are designed rithms have been widely studied for various machine learn- as a discretization of the underlying SDE. ing problems such as sampling (Chen et al., 2015; Li et al., 2016; Dubey et al., 2016; Chen et al., 2017; Dalalyan & To better position our algorithms and results, we also sum- Karagulyan, 2019; Li et al., 2018; Cheng et al., 2018; Zou marize the gradient complexities of the HMC based algo- et al., 2018b;a; Shen & Lee, 2019; Zou et al., 2019; Dalalyan rithms and Langevin dynamics based algorithms in Table On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

Table 1. Comparison of different stochastic sampling algorithms, 3. HMC with Stochastic Gradients Pn where the target distribution is π ∝ e− i=1 fi(x) and the target √ 2 sampling error is / n in 2-Wasserstein distance. Besides, the Let H(q, p) = f(q)+kpk2/2 be the Hamiltonian function, gradient complexities for SG-ULD, CV-ULD, SGLD, SVRG-LD, then the Hamilton’s equation (3.1) can be formulated as. and SAGA-LD are derived in Chatterji et al.(2018), the gradient dq(t) dp(t) complexity of SVRG-ULD is derived in Zou et al.(2018a). = p(t), = −∇f(q(t)). (3.1) Algorithm Complexity Type dt dt n  Let {x(0), x(1),..., x(t),... } be the sequence of generated SGLD (Welling & Teh, 2011) Oe 2 LD  (t) SVRG-LD (Dubey et al., 2016) O n  LD samples by HMC. Given x , an idealized HMC generates e  (t+1) n  the next sample x by solving the differential equation SAGA-LD (Dubey et al., 2016) Oe LD  (3.1) at a certain time τ (i.e., q(τ)) with initial position n  (t) SG-ULD (Chen et al., 2017) Oe 2 ULD q(0) = x and initial momentum p(0) ∼ N(0, I) being n2/3 1  independently drawn from the standard Gaussian distribu- SVRG-ULD (Zou et al., 2018a) Oe 2/3 +  ULD 1  tion. In practice, one typically applies the leapfrog numer- CV-ULD (Chatterji et al., 2018) Oe 3 ULD  ical integrator to solve (3.1). In particular, the numerical n  SG-HMC Oe 2 HMC integrator first divides the time interval [0, τ] into K sub- n2/3 1  SVRG-HMC Oe 2/3 +  HMC intervals with length equaling to η = τ/K. Let p0 = p(0) 2/3 n 1  and q0 = q(0), the one-step second-order leapfrog update SAGA-HMC Oe 2/3 +  HMC 1  for (qk, pk) is defined as follows CVG-HMC Oe 2 HMC η p = p − ∇f(q ) k+1/2 k 2 k 1, where SG-ULD, SVRG-ULD, CVG-ULD are referred to qk+1 = qk + ηpk+1/2 (3.2) as the algorithms that apply the discretization approach in η p = p − ∇f(q ). (Chen et al., 2017) on (2.2) using SG, SVRG, and CVG esti- k+1 k+1/2 2 k+1 mators respectively; SAGA-LD and SVRG-LD are referred to as the algorithms that apply Euler discretization on (2.2) Similarly, stochastic gradient HMC can be designed by ∇f(q ) ∇f(q ) using SVRG and SAGA estimators respectively. replacing the full gradient k and k+1 with stochastic gradient estimators. Let g(q, ξ) be an unbi- Here we calibrate the bounds proved in the original pa- ased stochastic gradient estimator of ∇f(q), where ξ repre- pers to fit into our setting (i.e., the target distribution is sents randomness. Then reformulating (3.2) and replacing Pn √ − fi(x) 4 π ∝ e i=1 and the target sampling error is / n) . ∇f(qk) and ∇f(qk+1) with stochastic gradients yields First, when comparing between different HMC based al- gorithms, it can be seen that SG-HMC has the worst de- η2 qk+1 = qk + ηpk − g(qk, ξk) pendency on both dataset size n and accuracy parameter . 2 (3.3) Moreover, if the dataset size satisfies n = Ω(−2), CVG- η η pk+1 = pk − g(qk, ξk) − g(qk+1, ξk+1/2), HMC enjoys better gradient complexity than both SVRG- 2 2 HMC and SAGA-HMC, which suggests that CVG-HMC where η is the step size of leapfrog integrator and the ran- performs better for very large datasets. On the other hand, if domness ξk and ξ are independent. Here we use  = O(n−1/2), SVRG-HMC and SAGA-HMC will outper- k+1/2 ξk+1/2 rather than ξk+1 to guarantee that the randomness form CVG-HMC, implying that SAGA-HMC and SVRG- at two subsequent leapfrog updates are independent. We HMC are better for sampling with high-precision require- summarize the entire algorithm in Algorithm1. qK can ments. Then we will compare the HMC based algorithms be seen as an approximate solution to (3.1), which will be with Langevin dynamics based algorithms. Clearly, it can passed to the next proposal of HMC. be seen that when using the same stochastic gradient esti- mator, the complexity bound of SG-HMC is the same as those of SG-ULD and SGLD, the complexity bounds of 4. General Convergence Results for SVRG-HMC and SAGA-HMC are the same as that of SVR- Stochastic Gradient HMC Algorithms ULD and better than those of SAGA-LD and SVRG-LD, In this section, we will present the general theoretical results and the complexity bound of CVG-HMC is better than that on the convergence rate of HMC with stochastic gradients. of CV-ULD. Before presenting the main theory, we first make the follow- 4This is to make the Wasserstein metric invariant under differ- ing assumptions on the potential energy function f(·) and ent scalings (Lee et al., 2018). Some of related work consider aver- −1 Pn the stochastic gradient estimator g(·, ·). aged log-likelihood functions (i.e., π ∝ exp(−n i=1 fi(x))) and set the target sampling error as . Assumption 4.1 (Strongly Convex). There exists a positive On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

Algorithm 1 Noisy Gradient Hamiltonian Monte Carlo where κ = L/µ is the condition number of f(x) and the 1: input: Step size η, number of leapfrog steps K, number constants Γ1 and Γ2 satisfy, of HMC proposals T , initial point x(0) Γ2 = OL−3/2σ2κ2 2: for t = 0,...,T do 1 (t) 2 2 −1/2 2  3: Set q0 = x Γ2 = O κ LD + κd + L σ η . 4: Sample p0 from N(0, I) 5: for k = 0,...,K − 1 do Remark 4.5. Theorem 4.4 provides a general convergence 2 result for HMC with stochastic gradients. The first term 6: q = q + ηp − η g(q , ξ ) k+1 k k 2 k k in the upper bound represents the mixing of HMC, the sec- 7: p = p − η g(q , ξ ) − η g(q , ξ ) k+1 k 2 k k 2 k+1 k+1/2 ond term represents the error brought by the variance of 8: end for stochastic gradients, and the last term is mainly from the 9: Set x(t+1) = q K discretization error of the numerical integrator. Clearly, 10: end for stochastic gradient HMC provably converges with a sam- 11: output: x(T ) pling error governed by the step size and the variance of stochastic gradients along the Markov chain (in a rate of O(ση1/2)). In order achieve the target sampling error constant µ such that for any x, y ∈ d, R , one may have to choose a sufficiently small step size µ f(y) ≥ f(x) + h∇f(x), y − xi + ky − xk2. η, and run T = O(log(1/)) HMC proposals for mix- 2 2 ing. Then the total number of iterations in Algorithm1 TK = O(L−1/2η−1) Assumption 4.2 (Smoothness). There exists a positive con- is e . stant L such that for any x, y ∈ Rd, Note that the sampling error proved in Theorem√ 4.4 depends 2 k∇f(x) − ∇f(y)k ≤ Lkx − yk . on the magnitude of the variance (Γ1 ∼ σ ), and different 2 2 stochastic gradient estimators may lead to different conver- gence rates. Intuitively speaking, smaller variance leads to The above two assumptions on the target distribution are smaller parameter Γ1, implying that we can use larger step commonly made in the convergence analysis of sampling size to achieve the same accuracy. This in turn speeds up algorithms (Chen et al., 2017; Erdogdu et al., 2018; Chen the convergence of HMC since the iteration complexity is et al., 2019; Dalalyan & Karagulyan, 2019). proportion to η−1. In the next section, we will apply the Assumption 4.3 (Bounded Variance). For any qk in the general convergence result in Theorem 4.4 to some specific leapfrog update, the variance of the unbiased stochastic stochastic gradient estimators and establish the correspond- gradient estimator g(qk, ξk) is upper bounded by ing convergence guarantees.

kg(q , ξ ) − ∇f(q )k2 ≤ σ2, E k k k 2 5. Application to Commonly Used Stochastic where the expectation is taken over both qk and ξk. Gradient Estimators

We remark that while this assumption is made on the algo- Note that given n observations, the target distribution can be Pn rithm path which we will verify it for several widely used described as π ∝ exp(−f(x)) with f(x) = i=1 fi(x), stochastic gradient estimators in Section5. In other words, where fi(·) corresponds to the i-th observation. In each this assumption is only needed for the general convergence step, we will only query a subset of observations to esti- analysis. When specializing to a specific algorithm with mate the gradient and update the variables accordingly. In certain stochastic gradient estimator, it can be proved. this section, we will prove the convergence rates of HMC with four commonly used stochastic gradient estimators, in- Now we are ready to present our main result, which char- cluding mini-batch stochastic gradient (SG), stochastic vari- acterizes the convergence rate of Algorithm1 for sampling ance reduced gradient (SVRG) (Johnson & Zhang, 2013), from strongly log-concave and log-smooth distribution π. stochastic average gradient (SAGA) (Defazio et al., 2014), Theorem 4.4. Under Assumptions 4.1, 4.2, and 4.3, let and control variates gradient (CVG) (Baker et al., 2018). We ∗ (0) ∗ 2 x = arg minx f(x), D = kx − x k2, and µt be the list these four stochastic gradient estimators in Algorithm2. (t) distribution of the iterate x , if the step√ size satisfies η = O(L1/2σ−2κ−1 ∧ L−1/2) and K = 1/(4 Lη), the output 5.1. Review of Stochastic Gradient Estimators of Algorithm1 satisfies The mini-batch stochastic gradient samples a mini-batch −1T/2 1/2 of training examples I of size |I | = B to compute the W2(µT , π) ≤ 1 − (128κ) (2D + 2d/µ) k k 1/2 stochastic gradient and is identical to that used in SGLD + Γ1η + Γ2η, (Welling & Teh, 2011). The SVRG and SAGA estimators On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients follow from (Dubey et al., 2016). SVRG estimator adopts a characterize the magnitude of the variance parameter σ. reference gradient ∇f(qe) associate with a reference point qe, In the subsequent analysis, we will use a stronger version both of which are updated in a low frequency (updated every of Assumption 4.2 by requiring all component functions n N leapfrog steps). In each update, we will sample a fresh {fi(x)}i=1 are L/n-smooth. mini-batch of training examples and leverage ∇f(q) and q d e e Assumption 5.1. For any x, y ∈ R and i ∈ [n], there as control variate to help reduce the variance. SAGA estima- exists a positive constant L such that tor maintains a table G that stores all stochastic gradients {fi(x)}k=1,...,n. In each iteration, it queries a mini-batch of L k∇fi(x) − ∇fi(y)k2 ≤ kx − yk2. training examples Ik and computes the stochastic gradient n by combining the mini-batch stochastic gradients on the new examples and the most recent history gradients in the This Assumption has also been made in many prior works table, including the stochastic gradients for new examples (Baker et al., 2018; Chatterji et al., 2018; Brosse et al.,

{Gi}i∈Ik and the sum of all stochastic gradients in the ta- 2018) for studying the convergence of stochastic gradient Pn ble (i.e., gek = i=1 Gi). Afterward, the newly computed Langevin MCMC algorithms. Note that Assumption 5.1 mini-batch stochastic gradient will be used to update the immediately implies Assumption 4.2 and thus the result in table. Similar to the SVRG estimator, CVG estimator also Theorem 4.4 applies. We would also like to point out that maintains a reference point qb, which is typically set to be an we only need all component functions to be smooth but not approximate minimizer of the function f(x), and queries a necessarily to be strongly convex. Additionally, we follow new mini-batch of training examples to compute the stochas- the similar setting in Baker et al.(2018); Chatterji et al. tic gradient jointly. Different from the SVRG estimator that (2018) that assumes L/n and µ/n are in the constant order, slowly updates the reference point, the reference point qb which implies that L, µ = O(n) adopted in CVG is fixed during the entire algorithm. By combining Algorithm1 and the corresponding stochastic Algorithm 2 Stochastic Gradient Estimators gradient estimator presented in Algorithm2, we can obtain four specific stochastic gradient HMC algorithms, namely 1: input: Current point q , index of the HMC proposal t, k SG-HMC, SVRG-HMC, SAGA-HMC and CVG-HMC. We random sampled mini-batch I k assume that the initial point x(0) satisfies kx(0) − x∗k2 ≤ Mini-batch Stochastic gradient 2 n P d/µ. Note that this can be achieved by running SGD for 2: g(qk, ξk) = ∇fi(qk) B i∈Ik roughly O(n) steps (Baker et al., 2018; Brosse et al., 2017). Stochastic variance reduced gradient In the sequel, we will provide the convergence guarantees 3: if k + Kt mod N = 0 then for these four algorithms. 4: g(qk, ξk) = f(qk), qe = qk 5: else Mini-batch stochastic gradient HMC (SG-HMC). The n P   6: g(qk, ξk) = ∇fi(qk) − ∇fi(q) + f(q) following theorem characterizes the convergence results of B i∈Ik e e 7: end if SG-HMC in 2-Wasserstein distance. Stochastic averaged gradient Theorem 5.2. Under Assumptions 4.1 and 5.1, assume 8: if k + Kt = 0 then (0) ∗ 2 (t) kx − x k2 ≤ d/µ and let µt be the distribution of x , 9: g(qk, ξk) = ∇f(qk), G = {∇fi(qk)}i=1,...,n −1/2 −1 −1 then if the step√ size satisfies η = O(L ∧dµ Γ1 ) and 10: else K = 1/(4 Lη) Pn set , the output of SG-HMC satisfies 11: gek = i=1 Gi, Gi ← ∇fi(qk) for all i ∈ Ik, n P   12: g(qk, ξk) = ∇fi(qk) − Gi + gk, s B i∈Ik e  d −1T/2 1/2 13: end if W2 µT , π ≤ 2 1 − (128κ) + Γ1η + Γ2η, Control variate gradient µ n P 14: g(qk, ξk) = ∇f(q) + B i∈I [∇fi(qk) − ∇fi(q)] b k b where the constants Γ1 and Γ2 satisfy, 15: output: g(qk, ξk) 2 −1/2 −1 3 −3/2 −1 2 2  Γ1 = O L B κ d + L B κ n d 2 3 −1/2 −1 2 2  Γ2 = O κ d + L B n κ dη . 5.2. Convergence Results of Specific Stochastic Gradient HMC Algorithms Gradient complexity of SG-HMC. Similar to (Chatterji Note that the convergence guarantee of stochastic gradient et al., 2018; Baker et al., 2018; Brosse et al., 2018), we HMC in Theorem 4.4 is established based on Assumption assume L = O(n) for simplicity. This further implies that 4.3. Therefore, in order to prove the convergence rates for K = O(n−1/2η−1) and η = O(L−1/2) = O(n1/2). Then HMC equipped with the aforementioned stochastic gradi- if ignoring the dependency on the condition number κ and ent estimators, it suffices the verify Assumption 4.3 and dimension d but only pay attention to the dependency on , On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

B, and η, the sampling error—corresponding to the last two 0 0 10 SVRG-ULD 10 SVRG-ULD 1/4 −1/2 1/2 CVG-ULD CVG-ULD −1 −1 terms of the bound—is O(n B η ). Let the target 10 SG-HMC 10 SG-HMC √ CVG-HMC CVG-HMC −2 2 −2 2 10 | |

̄ 10 ̄ sampling error be / n for arbitrary  ∈ (0, 1), it suffices | SVRG-HMC | SVRG-HMC x x SAGA-HMC −3 SAGA-HMC

− − 10 −3/2 2 ̂ 10−3 ̂ x x | |

to set η = Θ(n B ). Note that HMC requires to make | | 10−4 10−4 T = O(log(1/)) proposals to ensure good mixing. As 10−5 10−5 KTB = 10−6 a result, the gradient complexity of SG-HMC is 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 Oe(n−1/2η−1B) = Oe(n−2). # Steps (× 100) # Steps (× 100) (a) n = 500 (b) n = 5000 Stochastic variance reduced gradient HMC (SVRG- HMC). We deliver the convergence rate of SVGR-HMC in Figure 1. Sampling error of SG-HMC, CVG-HMC, SVRG-HMC, the following theorem. SAGA-HMC, SVRG-ULD, and CVG-ULD on synthetic data. X- Theorem 5.3. Under the same assumptions made in The- axis represents the error between the estimated mean xb and the (t) true one x¯, Y-axis represents the total number of steps. orem 5.2 and let µt be the distribution of x √. Then if −1 −1 −1 η = O(L ∧ dµ Γ1 ) and set K = 1/(4 Lη), the output of SVRG-HMC satisfies O(n5/4B−3/2η3/2 + η), which is identical to that of s SVRG-HMC. Then we can similarly set the step size d T/2 √ W µ , π ≤ 2 1 − (128κ)−1 + Γ η1/2 + Γ η, η = Θn−7/6B2/3 ∧  to guarantee / n-sampling 2 T µ 1 2 error, and further set B = O(n2/31/3 ∨ 1) to get a On2/3−2/3 + −1 gradient complexity for SAGA-HMC. where the constants Γ1 and Γ2 satisfy, e Control variates gradient HMC (CVG-HMC). Note that Γ2 = OL1/2B−1N 2κ3dη2 1 control variates gradient adopts a fixed reference point qb 2 3 3/2 −1 2 3 3 Γ2 = O κ d + L B N κ dη . in the entire algorithm. In the following analysis, we will (0) ∗ 2 simply set qb = x , which also satisfies that kqb − x k2 ≤ Gradient complexity of SVRG-HMC. We first set d/µ. The following theorem characterizes the convergence BN = Θ(n), then Theorem 5.3 suggests that the sam- results of CVG-HMC in 2-Wasserstein distance. pling error of SVRG-HMC is O(n5/4B−3/2η3/2 + η). Theorem 5.5. Under the same Assumptions made in The- −7/6 2/3 (t) Then it suffices to set the step size η = Θ n B ∧ orem 5.2. Let µt be the distribution of x √, then if −1/2  √ −1 −1 −1 n  to guarantee / n-sampling error, which further η = O(L ∧ dµ Γ1 ) and set K = 1/(4 Lη), the implies that the gradient complexity of SVRG-HMC is output of CVG-HMC satisfies 2/3 −2/3 −1 KTB = Oe n  + B , where we use the fact −1/2 −1 s that T = O(log(1/)) and K = O(n η ). Then we d T/2 W µ , π ≤ 2 1 − (128κ)−1 + Γ η1/2 + Γ η, can set the batch size B = O(n2/31/3 ∨ 1) and get a 2 T µ 1 2 O(n2/3−2/3 + −1) gradient complexity for SVRG-HMC. Γ Γ Stochastic averaged gradient HMC (SAGA-HMC). We where the constants 1 and 2 satisfy, present the convergence rate of SAGA-HMC in the follow- Γ2 = OL−1/2B−1κ3d and Γ2 = Oκ3d. ing theorem. 1 2 Theorem 5.4. Under the same assumptions made in The- (t) Gradient complexity of CVG-HMC. Theorem 5.5 orem 5.2. Let µt be the distribution of x √, then if η = O(L−1 ∧ dµ−1Γ−1) and set K = 1/(4 Lη), the shows that the sampling error of CVG-HMC is 1 O(n−1/4B−1/2η1/2 + η), which implies that we can set output of SAGA-HMC satisfies −1/2 2 −1/2  the√ step size as η = Θ n B ∧ n  to achieve s / n-sampling error in 2-Wasserstein distance. Then simi-  d −1T/2 1/2 −1 W2 µT , π ≤ 2 1 − (128κ) + Γ1η + Γ2η, larly, by setting B = O( ) we can derive that the gradient µ −2 complexity of CVG-HMC is KTB = Oe  . where the constants Γ1 and Γ2 satisfy, 6. Experiments 2 1/2 −3 2 3 2 Γ1 = O L B n κ dη 2 3 3/2 −3 2 3 3 In this section, we will evaluate the empirical performance Γ2 = O κ d + L B n κ dη . of the aforementioned four stochastic gradient HMC algo- rithms, including SG-HMC, SVRG-HMC, SAGA-HMC Gradient complexity of SAGA-HMC. Theorem 5.4 and CVG-HMC, on both synthetic and real-world datasets. suggests that the sampling error of SAGA-HMC is Moreover, we will also include SVRG-ULD and CVG-ULD On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

0.5565 CVG-ULD CVG-ULD 0.6225 CVG-ULD CVG-ULD SVRG-ULD SVRG-ULD SVRG-ULD 0.5564 SVRG-ULD SG-HMC 0.6224 SG-HMC SG-HMC SG-HMC CVG-HMC −1 CVG-HMC CVG-HMC CVG-HMC 2 2 10 0.6223 0.5563 | | ̄ ̄ | | SVRG-HMC SVRG-HMC SVRG-HMC SVRG-HMC x x 0.6222 − − 0.5562 ̂ ̂ x x | | 0.6221 | | 10−2 0.5561 0.6220

0.6219 0.5560 Negative Log-likelihood Negative Log-likelihood 0.6218 0.5559 0 2 4 6 8 10 12 0 5 10 15 20 25 30 0 2 4 6 8 10 12 0 5 10 15 20 25 30 # Steps (× 1000) # Steps (× 1000) # Steps (× 1000) # Steps (× 1000) (a) n = 500 (b) n = 5000 (c) n = 500 (d) n = 5000 Figure 2. Experimental results of Bayesian logistic regression on Covtype dataset. (a)-(b): Sampling error of SG-HMC, CVG-HMC, SVRG-HMC, SVG-ULD, and SVRG-ULD, function of the number of iterations. (c)-(d): Negative log-likelihood on the test dataset for SG-HMC, CVG-HMC, SVRG-HMC, SVG-ULD, and SVRG-ULD, function of the number of iterations. for comparison since they have been demonstrated to per- Table 2. Error of estimating the quantity z¯ = Ex∼π[x x] form well in both theory and experiment (Zou et al., 2018a; Algorithm (HMC) SG SVRG SAGA CVG Chatterji et al., 2018). Error (kbz − z¯k2) 0.070 0.0022 0.0018 0.0017

6.1. Sampling from Multivariate Gaussian Distribution tors. Additionally, when n increases, the performance of We first evaluate the performances of these four stochastic CVG become closer to those of SVRG-HMC and SAGA- gradient HMC algorithms for sampling from a multivariate HMC. These observations align well with our theoretical Gaussian distribution. Specifically, given n mean vectors results on the HMC based algorithms stated in Table1. {µ }n and positive definite matrices {Σ }n , we set each i i=1 i i=1 Moreover, we also observe that SVRG-HMC and SAGA- component function as f (x) = (x − µ )>Σ (x − µ )/2. i i i i HMC can outperform SVRG-ULD on small dataset, though Then it can be seen that each component distribution, i.e., in theory they have the same gradient complexity. πi ∝ exp(−fi(x)) is a Gaussian distribution with mean µi and covariance matrix Σ−1, and thus the target distribution i 6.2. Bayesian Logistic Regression π ∝ exp(−f(x)) is also a Gaussian distribution. In our experiment, we generate two synthetic dataset with size We then perform Bayesian logistic regression to evaluate n = 500 and n = 5000. For HMC based algorithms, the empirical performances of all stochastic gradient HMC We run all four algorithms using the same step size (η = n algorithms. In particular, let {zi, yi}i=1 be the observed −3 −4 d {2×10 , 3×10 } for n = {500, 5000}) and mini-batch training data, where zi ∈ R and yi are the feature vector size B = 16 with 2 × 104 steps (2000 proposals with 10 and label of the i-th observation respectively. The likeli- internal leapfrog steps each). For ULD based algorithms, hood function given the observation {zi, yi} is modeled >  we follow the same configuration in Chen et al.(2017); by p(yi|zi, x) = 1/ 1 + exp(−yix zi) . Then assum- Chatterji et al.(2018) by setting the friction parameter as ing that the model parameter x follows from a Gaussian γ = 1/n and the inverse mass parameter as u = 2. The prior p(x) = N(0, λ−1I). We aim to sample the posterior mini-batch size and iteration number are identical to those n  Qn p x|{zi, yi}i=1 = p(x) i=1 p(yi|zi, x). Therefore, it of HMC based algorithms, and the step size are tuned such can be derived that the negative log-posterior function is that the algorithm can converge fast. Pn f(x) = i=1 fi(x) with the component function fi(x) de- >  2 In order to characterize the convergence performance in fined by fi(x) = log 1 + exp(−yix zi) + λkxk2/(2n). terms of the distance between distributions, we run all of We carry out the experiments on Covtype dateset 5, which these four algorithms for 105 times in parallel, which gives has 581012 instances with 54 attributes. We further extract 105 independent samples at each iteration. Since it is not two training dataset with size n = {500, 5000} from the computation efficient to exactly compute the 2-Wasserstein original dataset, and take the rest for test. Similar to the ex- distance, we instead evaluate the error between the estimated periments on the synthetic data, we use the same mini-batch mean xb and the true one x¯ (which can be exactly computed size (B = 16) and step size (η = {2 × 10−3, 4 × 10−4} for n n based on {µi}i=1 and {Σi}i=1). We display the experi- n = {500, 5000}) for SH-HMC, SVRG-HMC, and CVG- mental results in Figure1. Besides, we also characterize the HMC 6. For ULD based algorithms we use the same batch estimation errors of the second moment z¯ = Ex∼π[x x], size and tune the step size such that they converge fast. which are reported in Table1. It can be observed that all 5 these four stochastic gradient HMC converges, while the Available at https://archive.ics.uci.edu/ml/datasets/covertype 6 mini-batch stochastic gradient leads to significantly larger We point out that SAGA-HMC is extremely inefficient in gen- erating independent samples in a parallel manner since it requires sampling error than other three stochastic gradient estima- huge memory cost. So we do not include SAGA-HMC in this part. On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

Moreover, we run all algorithms for 2 × 104 times in paral- References lel and obtain 2 × 104 independent samples at each iteration. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Given the generated samples, we compute the mean (in or- Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, der to increase the precision of the estimation, we further M., et al. Tensorflow: Large-scale machine learning apply moving average with size 100, so in total we use on heterogeneous distributed systems. arXiv preprint 2×106 samples) and compared it to the ground truth, which arXiv:1603.04467, 2016. is obtained by running standard HMC algorithms (using full gradient and MH correction), and display the errors in Fig- Baker, J., Fearnhead, P., Fox, E. B., and Nemeth, C. Con- ures 2(a) and 2(b). It can be clearly observed that SG-HMC trol variates for stochastic gradient MCMC. Statis- performs significantly worse than other algorithms. Besides, tics and Computing, 2018. ISSN 1573-1375. doi: we also observe SVRG-HMC slightly outperforms CVG- 10.1007/s11222-018-9826-2. HMC, SVRG-ULD, and CVG-ULD on the small dataset, which is consistent with the observation on the synthetic Bardenet, R., Doucet, A., and Holmes, C. On markov chain dataset. Moreover, given the estimated mean at different monte carlo methods for tall data. The Journal of Machine iterations, we evaluate the negative log-likelihood of all al- Learning Research, 18(1):1515–1557, 2017. gorithms on the test dataset. The results are displayed in Betancourt, M. The fundamental incompatibility of scalable Figures 2(c) and 2(d). The plots show that the output of Hamiltonian monte carlo and naive data subsampling. SG-HMC has a significantly larger bias than those of other In International Conference on Machine Learning, pp. algorithms, while SVRG-HMC, CVG-HMC, CVG-ULD, 533–540, 2015. and SVRG-ULD give similar results. This again explains the inefficiency of SG-HMC and verifies our theory. Bou-Rabee, N., Eberle, A., and Zimmer, R. Coupling and convergence for Hamiltonian monte carlo. arXiv preprint 7. Conclusion and Future Work arXiv:1805.00452, 2018. ´ In this paper, we provided a general framework for proving Brosse, N., Durmus, A., Moulines, E., and Pereyra, M. Sam- the convergence rate of HMC with stochastic gradients. Our pling from a log-concave distribution with compact sup- result shows that as long as the variance of stochastic gra- port with proximal Langevin Monte Carlo. In Conference dient is upper bounded along the Markov chain, stochastic on Learning Theory, pp. 319–342, 2017. gradient HMC algorithms with properly chosen step size Brosse, N., Durmus, A., and Moulines, E. The promises provably converge to the target distribution. We applied the and pitfalls of stochastic gradient langevin dynamics. In general convergence result to four specific stochastic gradi- Advances in Neural Information Processing Systems, pp. ent HMC algorithms: SG-HMC, CVG-HMC, SVRG-HMC 8268–8278, 2018. and SAGA-HMC, and established their convergence guar- antees. The results explain the inefficiency of SG-HMC, Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., and reveal the potential prospects of the applications of Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., CVG-HMC, SVRG-HMC, and SAGA-HMC. Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of statistical software, 76(1), 2017. One interesting future direction is to explore whether adding Metropolis-Hasting (MH) correction in certain ways to the Chatterji, N. S., Flammarion, N., Ma, Y.-A., Bartlett, P. L., stochastic HMC algorithm can help mitigating the bias and Jordan, M. I. On the theory of variance reduc- caused by stochastic gradients in theory, which is supported tion for stochastic gradient monte carlo. arXiv preprint by some empirical evidence (Dang et al., 2019). arXiv:1802.05431, 2018. Chau, H. N. and Rasonyi, M. Stochastic gradient hamilto- Acknowledgement nian monte carlo for non-convex learning. arXiv preprint arXiv:1903.10328 We would like to thank the anonymous reviewers for their , 2019. helpful comments. DZ is supported by the Bloomberg Data Chen, C., Ding, N., and Carin, L. On the convergence of Science Ph.D. Fellowship. QG is partially supported by stochastic gradient mcmc algorithms with high-order in- the National Science Foundation CAREER Award 1906169. tegrators. In Advances in Neural Information Processing The views and conclusions contained in this paper are those Systems, pp. 2278–2286, 2015. of the authors and should not be interpreted as representing any funding agencies. Chen, C., Wang, W., Zhang, Y., Su, Q., and Carin, L. A convergence analysis for a class of practical variance- reduction stochastic gradient mcmc. arXiv preprint arXiv:1709.01180, 2017. On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

Chen, T., Fox, E., and Guestrin, C. Stochastic gradient Durmus, A., Moulines, E., and Saksman, E. On the con- Hamiltonian monte carlo. In International Conference vergence of hamiltonian monte carlo. arXiv preprint on Machine Learning, pp. 1683–1691, 2014. arXiv:1705.00166, 2017.

Chen, Y., Dwivedi, R., Wainwright, M. J., and Yu, Durmus, A., Moulines, E., et al. High-dimensional B. Fast mixing of metropolized hamiltonian monte bayesian inference via the unadjusted langevin algorithm. carlo: Benefits of multi-step gradients. arXiv preprint Bernoulli, 25(4A):2854–2882, 2019. arXiv:1905.12247, 2019. Erdogdu, M. A., Mackey, L., and Shamir, O. Global non- Chen, Z. and Vempala, S. S. Optimal convergence rate of convex optimization with discretized diffusions. In Ad- hamiltonian monte carlo for strongly logconcave distribu- vances in Neural Information Processing Systems, pp. tions. arXiv preprint arXiv:1905.02313, 2019. 9671–9680, 2018.

Cheng, X., Chatterji, N. S., Abbasi-Yadkori, Y., Bartlett, Gao, X., Gurbuzbalaban, M., and Zhu, L. Breaking re- P. L., and Jordan, M. I. Sharp convergence rates for versibility accelerates langevin dynamics for global non- Langevin dynamics in the nonconvex setting. arXiv convex optimization. arXiv preprint arXiv:1812.07725, preprint arXiv:1805.01648, 2018. 2018a.

Dalalyan, A. S. Theoretical guarantees for approximate Gao, X., Gurb¨ uzbalaban,¨ M., and Zhu, L. Global conver- sampling from smooth and log-concave densities. Jour- gence of stochastic gradient Hamiltonian monte carlo nal of the Royal Statistical Society: Series B (Statistical for non-convex stochastic optimization: Non-asymptotic Methodology), 79(3):651–676, 2017. performance bounds and momentum-based acceleration. arXiv preprint arXiv:1809.04618, 2018b. Dalalyan, A. S. and Karagulyan, A. User-friendly guaran- tees for the langevin monte carlo with inaccurate gradi- Johnson, R. and Zhang, T. Accelerating stochastic gradient ent. Stochastic Processes and their Applications, 129(12): descent using predictive variance reduction. In Advances 5278–5311, 2019. in Neural Information Processing Systems, pp. 315–323, 2013. Dalalyan, A. S., Riou-Durand, L., et al. On sampling from a log-concave density using kinetic langevin diffusions. Langevin, P. On the theory of brownian motion. CR Acad. Bernoulli, 26(3):1956–1988, 2020. Sci. Paris, 146:530–533, 1908. Lee, Y. T., Song, Z., and Vempala, S. S. Algorithmic theory Dang, K.-D., Quiroz, M., Kohn, R., Tran, M.-N., and Villani, of odes and sampling from well-conditioned logconcave M. Hamiltonian monte carlo with energy conserving densities. arXiv preprint arXiv:1812.06243, 2018. subsampling. Journal of machine learning research, 20 (100):1–31, 2019. Li, C., Chen, C., Carlson, D., and Carin, L. Preconditioned stochastic gradient langevin dynamics for deep neural Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A networks. In Proceedings of the AAAI Conference on fast incremental gradient method with support for non- Artificial Intelligence, volume 30, 2016. strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, Li, Z., Zhang, T., and Li, J. Stochastic gradient Hamilto- 2014. nian monte carlo with variance reduction for bayesian inference. arXiv preprint arXiv:1803.11159, 2018. Deng, W., Feng, Q., Gao, L., Liang, F., and Lin, G. Non- convex learning via replica exchange stochastic gradient Lovasz,´ L. and Simonovits, M. The mixing rate of markov mcmc. In International Conference on Machine Learning, chains, an isoperimetric inequality, and computing the pp. 2474–2483. PMLR, 2020. volume. In Proceedings [1990] 31st annual symposium on foundations of computer science, pp. 346–354. IEEE, Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. 1990. Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987. Ma, Y.-A., Chen, T., and Fox, E. A complete recipe for stochastic gradient MCMC. In Advances in Neural Infor- Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B., mation Processing Systems, pp. 2917–2925, 2015. Smola, A. J., and Xing, E. P. Variance reduction in stochastic gradient Langevin dynamics. In Advances in Ma, Y.-A., Chen, Y., Jin, C., Flammarion, N., and Jordan, Neural Information Processing Systems, pp. 1154–1162, M. I. Sampling can be faster than optimization. arXiv 2016. preprint arXiv:1811.08413, 2018. On the Convergence of Hamiltonian Monte Carlo with Stochastic Gradients

Mangoubi, O. and Smith, A. Rapid mixing of hamiltonian Zou, D., Xu, P., and Gu, Q. Stochastic variance-reduced monte carlo on strongly log-concave distributions. arXiv Hamilton Monte Carlo methods. In Proceedings of the preprint arXiv:1708.07114, 2017. 35th International Conference on Machine Learning, pp. 6028–6037, 2018a. Mangoubi, O. and Vishnoi, N. Dimensionally tight bounds for second-order hamiltonian monte carlo. In Advances Zou, D., Xu, P., and Gu, Q. Subsampled stochastic variance- in neural information processing systems, pp. 6027–6037, reduced gradient Langevin dynamics. In Proceedings 2018. of International Conference on Uncertainty in Artificial Intelligence, 2018b. Mengersen, K. L., Tweedie, R. L., et al. Rates of conver- gence of the hastings and metropolis algorithms. The Zou, D., Xu, P., and Gu, Q. Sampling from non-log-concave annals of Statistics, 24(1):101–121, 1996. distributions via variance-reduced gradient Langevin dy- namics. In Artificial Intelligence and Statistics, volume 89 Neal, R. M. et al. MCMC using Hamiltonian dynamics. of Proceedings of Machine Learning Research, pp. 2936– Handbook of , 2:113–162, 2945. PMLR, 2019. 2011. Zou, D., Xu, P., and Gu, Q. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sam- Parisi, G. Correlation functions and computer simulations. pling. arXiv preprint arXiv:2010.09597, 2020. Nuclear Physics B, 180(3):378–384, 1981.

Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning The- ory, pp. 1674–1703, 2017.

Roberts, G. O. and Tweedie, R. L. Exponential convergence of Langevin distributions and their discrete approxima- tions. Bernoulli, pp. 341–363, 1996.

Shen, R. and Lee, Y. T. The randomized midpoint method for log-concave sampling. arXiv preprint arXiv:1909.05503, 2019.

Simsekli, U., Zhu, L., Teh, Y. W., and Gurbuzbalaban, M. Fractional underdamped langevin dynamics: Retargeting sgd with momentum under heavy-tailed gradient noise. In International Conference on Machine Learning, pp. 8970–8980. PMLR, 2020.

Smith, R. L. Efficient monte carlo procedures for generat- ing points uniformly distributed over bounded regions. Operations Research, 32(6):1296–1308, 1984.

Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, pp. 681– 688, 2011.

Xu, P., Chen, J., Zou, D., and Gu, Q. Global convergence of Langevin dynamics based algorithms for nonconvex opti- mization. In Advances in Neural Information Processing Systems, pp. 3126–3137, 2018.

Zhang, Y., Liang, P., and Charikar, M. A hitting time analy- sis of stochastic gradient Langevin dynamics. In Confer- ence on Learning Theory, pp. 1980–2022, 2017.