Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Changyou Chen† David Carlson‡ Zhe Gan† Chunyuan Li† Lawrence Carin† †Department of Electrical and Computer Engineering, Duke University ‡Department of and Grossman Center for Statistics of Mind, Columbia University

Abstract tribution. At first glance, these methods appear to be distinct, independent approaches to learning. How- Stochastic gradient Markov chain Monte ever, even the celebrated Gibbs sampler was first in- Carlo (SG-MCMC) methods are Bayesian troduced to statistics as a method analogs to popular stochastic optimization for maximum a posteriori estimation (i.e., finding an methods; however, this connection is not optima) (Geman and Geman, 1984). well studied. We explore this relationship Recent work on large-scale Bayesian learning has fo- by applying simulated annealing to an SG- cused on incorporating the speed and low-memory MCMC algorithm. Furthermore, we extend costs from stochastic optimization. These approaches recent SG-MCMC methods with two key are referred to as stochastic gradient Markov chain components: i) adaptive preconditioners (as Monte Carlo (SG-MCMC) methods. Well-known SG- in ADAgrad or RMSprop), and ii) adaptive MCMC methods include stochastic gradient Langevin element-wise momentum weights. The zero- dynamics (SGLD) (Welling and Teh, 2011), stochastic temperature limit gives a novel stochastic gradient Hamiltonian Monte Carlo (SGHMC) (Chen optimization method with adaptive element- et al., 2014), and stochastic gradient thermostats wise momentum weights, while conventional (SGNHT) (Ding et al., 2014). SG-MCMC has become optimization methods only have a shared, increasingly popular in the literature due to practi- static momentum weight. Under certain as- cal successes, ease of implementation, and theoretical sumptions, our theoretical analysis suggests convergence properties (Teh et al., 2014; Vollmer et al., the proposed simulated annealing approach 2015; Chen et al., 2015). converges close to the global optima. Experi- ments on several deep neural network models There are obvious structural similarities between SG- show state-of-the-art results compared to re- MCMC algorithms and stochastic optimization meth- lated stochastic optimization algorithms. ods. For example, SGLD resembles SGD with additive Gaussian noise. SGHMC resembles SGD with momen- tum (Rumelhart et al., 1986), adding additive Gaus- 1 Introduction sian noise when updating the momentum terms (Chen arXiv:1512.07962v3 [stat.ML] 5 Aug 2016 et al., 2014). These similarities are detailed in Section 2. Despite these structural similarities, the theory is has made significant recent strides unclear on how additive Gaussian noise differentiates due to large-scale learning applied to “big data”. a Bayesian algorithm from its optimization analog. Large-scale learning is typically performed with stochastic optimization, and the most common Just as classical sampling methods were originally used method is stochastic gradient descent (SGD) (Bottou, for optimization (Geman and Geman, 1984), we di- 2010). Stochastic optimization methods are devoted rectly address using SG-MCMC algorithms for opti- to obtaining a (local) optima of an objective function. mization. A major benefit of adapting these schemes Alternatively, Bayesian methods aim to compute the is that Bayesian learning is (in theory) able to fully ex- expectation of a test function over the posterior dis- plore the parameter space. Thus it may find a better local optima, if not the global optima, for a non-convex th Appearing in Proceedings of the 19 International Con- objective function. ference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright Specifically, in this work we first extend the recently 2016 by the authors. proposed multivariate stochastic gradient thermostat Bridging the Gap between SG-MCMC and Stochastic Optimization algorithm (Gan et al., 2015) with Riemannian infor- fined as the negative log-posterior, U(θ) , − log p(θ)− mation geometry, which results in an adaptive pre- PN p n=1 log p(xn |θ). Here θ ∈ R are the model param- conditioning and momentum scheme with analogs to eters, and {xn}n=1,...,N are the d-dimensional observed Adam (Kingma and Ba, 2015) and RMSprop (Tiele- data; p(θ) corresponds to the prior and p(xn |θ) is a man and Hinton, 2012). We propose an annealing likelihood term for the nth observation. In optimiza- scheme on the system temperature to move from a PN tion, − n=1 log p(xn |θ) is typically referred to as the Bayesian method to a stochastic optimization method. , and − log p(θ) as a regularizer. We call the proposed algorithm Stochastic AnNeal- ing Thermostats with Adaptive momentum (Santa). In large-scale learning, N is prohibitively large. This motivates the use of stochastic approximations. We show that in the temperature limit, Santa recov- ˜ ers the SGD with momentum algorithm except that: i) We denote the Ut(θ) , − log p(θ) − N Pm log p(x |θ), where (i , ··· , i ) adaptive preconditioners are used when updating both m j=1 ij 1 m is a random subset of the set {1, 2, ··· ,N}. The gra- model and momentum parameters; ii) each parame- ˜ ˜ ter has an individual, learned momentum parameter. dient on this minibatch is denoted as f t(θ) = ∇Ut(θ), Adaptive preconditioners and momentums are desir- which is an unbiased estimate of the true gradient. able in practice because of their ability to deal with A standard approach to learning is SGD, where param- ˜ uneven, dynamic curvature (Dauphin et al., 2015). For eter updates are given by θt = θt−1 − ηtf t−1(θ) with completeness, we first review related algorithms in Sec- ηt the learning rate. This is guaranteed to converge to tion 2, and present our novel algorithm in Section 3. a local minima under mild conditions (Bottou, 2010). The SG-MCMC analog to this is SGLD, with updates We develop theory to analyze convergence properties ˜ √ of our algorithm, suggesting that Santa is able to find θt = θt−1 − ηtf t−1(θ) + 2ηtζt. The additional term a solution for an (non-convex) objective function close is a standard normal random vector, ζt ∼ N (0, Ip) to its global optima, shown in Section 4. The theory is (Welling and Teh, 2011). The SGLD method draws based on the analysis from stochastic differential equa- approximate posterior samples instead of obtaining a tions (Teh et al., 2014; Chen et al., 2015), and presents local minima. results on bias and variance of the annealed Markov Using momentum in stochastic optimization is impor- chain. This is a fundamentally different approach from tant in learning deep models (Sutskever et al., 2013). the traditional convergence explored in stochastic opti- This motivates SG-MCMC algorithms with momen- mization, or the regret bounds used in online optimiza- tum. The standard SGD with momentum (SGD-M) p tion. We note we can adapt the regret bound of Adam approach introduces an auxiliary variable ut ∈ R (Kingma and Ba, 2015) for our zero-temperature al- to represent the momentum. Given a momentum gorithm (with a few trivial modifications) for a con- weight α, the updates are θt = θt−1 + ηtut and ˜ vex problem, as shown in Supplementary Section F. ut = (1 − α)ut−1 − f t−1(θ). A Bayesian analog is However, this neither addresses non-convexity nor the SGHMC (Chen et al., 2014) or multivariate SGNHT annealing scheme that our analysis does. (mSGNHT) (Gan et al., 2015). In mSGNHT, each p In addition to theory, we demonstrate effective empir- parameter has a unique momentum weight αt ∈ R ical performance on a variety of deep neural networks that is learned during the sampling sequence. The (DNNs), achieving the best performance compared to momentum weights are updated to maintain the sys- all competing algorithms for the same model size. This tem temperature 1/β. An inverse temperature of is shown in Section 5. The code is publicly available β = 1 corresponds to the posterior. This algorithm has updates θt = θt−1 + ηtut, ut = (1 − ηtαt−1) at https://github.com/cchangyou/Santa. ˜ p ut−1 − ηtf t−1(θ) + 2ηt/βζt. The main difference 2 Preliminaries is the additive Gaussian noise and step-size depen- dent momentum update. The weights have updates Throughout this paper, we denote vectors as bold, αt = αt−1 + ηt((ut ut) − 1/β), which matches the lower-case letters, and matrices as bold, upper-case kinetic energy to the system temperature. letters. We use for element-wise√ multiplication, and as element-wise division; · denotes the element- A recent idea in stochastic optimization is to use wise square root when applied to vectors or matricies. an adaptive preconditioner, also known as a variable We reserve (·)1/2 for the standard matrix square root. metric, to improve convergence rates. Both ADA- grad (Duchi et al., 2011) and Adam (Kingma and Ba, Ip is the p × p identity matrix, 1 is an all-ones vector. 2015)√ adapt to the local geometry with a regret bound The goal of an optimization algorithm is to min- of O( N). Adam adds momentum as well through imize an objective function U(θ) that corresponds moment smoothing. RMSprop (Tieleman and Hin- to a (non-convex) model of interest. In a Bayesian ton, 2012), Adadelta (Zeiler, 2012), and RMSspectral model, this corresponds to the potential energy de- C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

Algorithm 1: Santa with the Euler scheme tion (Dauphin et al., 2015) and SG-MCMC algorithms (Patterson and Teh, 2013). In the MCMC literature, Input: ηt (learning rate), σ, λ, burnin, β = {β , β , · · · } → ∞, {ζ ∈ Rp} ∼ N (0, I ). preconditioning is alternatively referred to as Rieman- 1 2 √ t √ p nian information geometry (Patterson and Teh, 2013). Initialize θ0, u0 = η × N (0,I), α0 = ηC, v0 = 0 ; p×p for t = 1, 2,... do We denote the preconditioner as {Gt ∈ R }. A pop- ular choice in SG-MCMC is the Fisher information ma- Evaluate ˜f ∇ U˜(θ ) on the tth mini-batch; t , θ t−1 trix (Girolami and Calderhead, 2011). Unfortunately, v = σ v + 1−σ ˜f ˜f ; t t−1 N 2 t t this approach is computationally prohibitive for many p √ gt = 1 λ + vt ; models of interest. To avoid this problem, we adopt if t < burnin then the preconditioner from RMSprop and Adam, which /* exploration */ p uses a vector {gt ∈ R } to approximate the diagonal αt = αt−1 + (ut−1 ut−1 −η/βt); of the Fisher information matrixes (Li et al., 2016a). q η  2η The construction sequentially updates the precondi- ut = β 1 − gt−1 gt ut−1 + β gt−1 ζt t t tioner based on current and historical gradients with else a smoothing parameter σ, and is shown as part of Al- /* refinement */ gorithm 1. While this approach will not capture the α = α ; u = 0; t t−1 t Riemannian geometry as effectively as the Fisher in- end formation matrix, it is computationally efficient. ˜ ut = ut + (1 − αt) ut−1 −η gt f t; Santa also introduces an annealing scheme on sys- θt = θt−1 + gt ut; end tem temperatures. As discussed in Section 2, mS- GNHT naturally accounts for a varying temperature by matching the particle momentum to the system temperature. We introduce β = {β1, β2, ···}, a se- (Carlson et al., 2015) are similar methods with pre- quence of inverse temperature variables with βi < βj conditioners. Our method introduces adaptive mo- for i < j and limi→∞ βi = ∞. The infinite case mentum and preconditioners to the SG-MCMC. This corresponds to the zero-temperature limit, where SG- differs from stochastic optimization in implementation MCMCs become deterministic optimization methods. and theory, and is novel in SG-MCMC. The annealing scheme leads to two stages: the ex- ˇ Simulated annealing (Kirkpatrick et al., 1983; Cern´y, ploration and the refinement stages. The exploration 1985) is well-established as a way of acquiring a local stage updates all parameters based on an annealed se- mode by moving from a high-temperature, flat surface quence of stochastic dynamic systems (see Section 4 to a low-temperature, peaky surface. It has been ex- for more details). This stage is able to explore the plored in the context of MCMC, including reversible parameter space efficiently, escape poor local modes, jump MCMC (Andrieu et al., 2000), annealed impor- and finally converge close to the global mode. The re- tant sampling (Neal, 2001) and parallel tempering (Li finement stage corresponds to the temperature limit, et al., 2009). Traditional algorithms are based on i.e., βn → ∞. In the temperature limit, the momen- Metropolis–Hastings sampling, which require compu- tum weight updates vanish and it becomes a stochastic tationally expensive accept-reject steps. Recent work optimization algorithm. has applied simulated annealing to large-scale learn- ing through mini-batch based annealing (van de Meent We propose two update schemes to solve the corre- et al., 2014; Obermeyer et al., 2014). Our approach in- sponding stochastic differential equations: the Euler corporates annealing into SG-MCMC with its inherent scheme and the symmetric splitting scheme (SSS). The speed and mini-batch nature. Euler scheme has simpler updates, as detailed in Algo- rithm 1; while SSS endows increased accuracy (Chen et al., 2015) with a slight increase in overhead compu- 3 The Santa Algorithm tation, as shown in Algorithm 2. Section 4.1 elaborates on the details of these two schemes. We recommend Santa extends the mSGNHT algorithm with precon- the use of SSS, but the Euler scheme is simpler to im- ditioners and a simulated annealing scheme. A simple plement and compare to known algorithms. pseudocode is shown in Algorithm 1, or a more com- plex, but higher accuracy version, is shown in Algo- Practical considerations According to Section 4, rithm 2, and we detail the steps below. the exploration stage helps the algorithm traverse the The first extension we consider is the use of adaptive parameter space following the posterior curve as ac- preconditioners. Preconditioning has been proven crit- curate as possible. For optimization, slightly biased ical for fast convergence in both stochastic optimiza- samples do not affect the final solution. As a result, Bridging the Gap between SG-MCMC and Stochastic Optimization

Algorithm 2: Santa with SSS mation geometry, important for fast convergence (Pat- terson and Teh, 2013). Given an inverse temperature Input: ηt (learning rate), σ, λ, burnin, 1 β = {β , β , · · · } → ∞, {ζ ∈ Rp} ∼ N (0, I ). β, the system is described by the following SDEs : 1 2 √ t √ p Initialize θ0, u0 = η × N (0,I), α0 = ηC, v0 = 0 ;  dθ = G (θ) p dt  1 for t = 1, 2,... do   1  d p = −G1(θ)∇θU(θ) − Ξ p + ∇θG1(θ) ˜ ˜ th  β Evaluate f t , ∇θU(θt−1) on the t mini-batch; 2 1 +G (θ)(Ξ − G (θ))∇ G (θ)) dt + ( G (θ)) 2 dw 1−σ ˜ ˜  1 2 θ 2 β 2 vt = σ vt−1 + 2 f t f t ;    N  1 p √  dΞ = Q − β I dt , gt = 1 λ + vt ; (1) θt = θt−1 + gt ut−1 /2; if t < burnin then where Q = diag(p p), w is standard Brownian mo- /* exploration */ tion, G1(θ) encodes geometric information of the po- α = α + (u u −η/β ) /2; t t−1 t−1 t−1 t tential energy U(θ), and G2(θ) characterizes the man- u = exp (−α /2) u ; t t t−1 ifold geometry of the Brownian motion. Note G2(θ) ˜ p ut = ut − g f tη + 2 g η/βt ζt may be the same as G (θ) for the same Riemannian t  t−1 1 + η/βt 1 − gt−1 gt ut−1; manifold. We call G1(θ) and G2(θ) Riemannian met- ut = exp (−αt/2) ut; rics, which are commonly defined by the Fisher infor- αt = αt + (ut ut −η/βt) /2; mation matrix (Girolami and Calderhead, 2011). We else use the RMSprop preconditioner (with updates from /* refinement */ Algorithm 1) for computational feasibility. Using the αt = αt−1; ut = exp (−αt/2) ut−1; Fokker-Plank equation (Risken, 1989), we show that ˜ the marginal stationary distribution of (1) corresponds ut = ut − gt f tη; ut = exp (−αt/2) ut; end to the posterior distribution. n T o θt = θt + gt ut /2; Lemma 1. Denote A : B , tr A B .The station- end ary distribution of (1) is: pβ(θ, p, Ξ) ∝

−βU(θ)− β pT p − β (Ξ−G (θ)):(Ξ−G (θ)) e 2 2 2 2 . (2) the term consisting of (1 − gt−1 gt) in the algorithm (which is an approximation term, see Section 4.1) is An inverse temperature β = 1 corresponds to the stan- ignored. We found no decreasing performance in our dard Bayesian posterior. experiments. Furthermore, the term gt−1 associated We note that p in (1) has additional dependencies on with the Gaussian noise could be replaced with a fixed G1 and G2 compared to Gan et al. (2015) that must be constant without affecting the algorithm. accounted for. Ξ p introduces friction into the system so that the particle does not move too far away by 4 Theoretical Foundation the random force; the terms ∇θG1(θ) and ∇θG2(θ) penalize the influences of the Riemannian metrics so that the stationary distribution remains invariant. In this section we present the stochastic differential equations (SDEs) that correspond to the Santa algo- 4.1 Exploration rithm. We first introduce the general SDE framework, then describe the exploration stage in Section 4.1 and The first stage of Santa, exploration, explores the pa- the refinement stage in Section 4.2. We give the con- rameter space to obtain parameters near the global vergence properties of the numerical scheme in Section mode of an objective function2. This approach applies 4.3. This theory uses tools from the SDE literature and ideas from simulated annealing (Kirkpatrick et al., extends the mSGNHT theory (Ding et al., 2014; Gan 1983). Specifically, the inverse temperature β is slowly et al., 2015). annealed to temperature zero to freeze the particles at The SDEs are presented with re-parameterized p = the global mode. 1/2 1/2 u /η , Ξ = diag(α)/η , as in Ding et al. (2014). Minimizing U(θ) is equivalent to sampling from the The SDEs describe the motion of a particle in a system 1 −βU(θ) zero-temperature limit pβ(θ) , e (propor- where θ is the location and p is the momentum. Zβ 1 We abuse notation for conciseness. Here, ∇θ G(θ) is In mSGNHT, the particle is driven by a force a vector with the i-th element being P ∇ G (θ). ˜ j θj ij −∇θUt(θ) at time t. The stationary distribution of θ 2This requires an ergodic algorithm. While ergodicity is corresponds to the model posterior (Gan et al., 2015). not straightforward to check, we follow most MCMC work Our critical extension is the use of Riemannian infor- and assume it holds in our algorithm. C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

tional to (2)), with Zβ being the normalization con- Approximate calculation for ∇θ G1(θ) We pro- stant such that pβ(θ) is a valid distribution. We pose a computationally efficient approximation for construct a Markov chain that sequentially transits calculating the derivative vector ∇θ G1(θ) based on the definition. Specifically, for the i-th ele- from high temperatures to low temperatures. At the ment of ∇θ G1(θ) at the t-th iteration, denoted as state equilibrium, the chain reaches the temperature t (∇θ G1(θ))i, it is approximated as: limit with marginal stationary distribution ρ0(θ) , t t−1 −βU(θ) 3 A limβ→∞ e , a point mass located at the global t 1 X (G1(θ))ij − (G1 (θ))ij (∇θ G1(θ))i ≈ θtj − θ mode of U(θ). Specifically, we first define a sequence j (t−1)j of inverse temperatures, (β1, β2, ··· , βL), such that βL t t A2 X (∆ G )ij X (∆ G )ij is large enough4. For each time t, we generate a sam- = 1 = 1 (Gt (θ) p ) h (Gt (θ) u ) ple according to the SDE system (1) with temperature j 1 t−1 j j 1 t−1 j 1 , conditioned on the sample from the previous tem- t t t−1 βt where ∆ G G − G . Step A1 follows by the 1 1 , 1 1 perature, β . We call this procedure annealing ther- definition of a derivative, and A2 by using the update t−1 t mostats to denote the analog to simulated annealing. equation for θt, i.e., θt = θt−1+G1(θ) pt−1 h. Accord- ing to Taylor’s theory, the approximation error for the Generating approximate samples Generating ∇θ G1(θ) is O(h), e.g., exact samples from (1) is infeasible for general mod- t X X (∆ G1)ij t els. One well-known numerical approach is the Euler − (∇θ G1(θ))i ≤ Bth , (3) (Gt (θ) u ) scheme in Algorithm 1. The Euler scheme is a 1st- i j 1 t−1 j order method with relatively high approximation error for some positive constant Bt. The approximation error (Chen et al., 2015). We increase accuracy by imple- is negligible in term of convergence behaviors because menting the symmetric splitting scheme (SSS) (Chen it can be absorbed into the stochastic gradients error. et al., 2015; Li et al., 2016b). The idea of SSS is to split Formal theoretical analysis on convergence behaviors an infeasible SDE into several sub-SDEs, where each with this approximation is given in later sections. Us- sub-SDE is analytically solvable; approximate samples t ing similar methods, ∇θG2(θ) is also approximately are generated by sequentially evolving parameters via calculated. these sub-SDEs. Specifically, in Santa, we split (1) into the following three sub-SDEs: 4.2 Refinement    dθ = G1(θ) p dt dθ = 0  d p = 0  The refinement stage corresponds to the zero- A :   ,B : d p = −Ξ p dt , temperature limit of the exploration stage, where Ξ  dΞ = Q − 1 I dt  dΞ = 0  β is learned. We show that in the limit Santa gives sig-  dθ = 0  nificantly simplified updates, leading to an stochastic   1  d p = −G1(θ)∇θU(θ) + ∇θG1(θ) optimization algorithm similar to Adam or SGD-M. O : β 2 1 +G1(θ)(Ξ − G2(θ))∇θG2(θ)) dt + ( G2(θ)) 2 dw  β We assume that the Markov chain has reached its  dΞ = 0 equilibrium after the exploration stage. In the zero- temperature limit, some terms in the SDE (1) vanish. We then update the sub-SDEs in order A-B-O-B-A 1 to generative approximate samples (Chen et al., 2015). First, as β → ∞, the term β ∇θ G1(θ) and the vari- This uses half-steps h/2 on the A and B updates5, ance term for the Brownian motion approach 0. As and full steps h in the O update. This is analogous to well, the thermostat variable Ξ approaches G2(θ), so the leapfrog steps in Hamiltonian Monte Carlo (Neal, the term G1(θ)(Ξ − G2(θ))∇θ G2(θ) vanishes. The 2 2011). Update equations are given in the Supplemen- stationary distribution in (2) implies E Qii , E pi → tary Section A. The resulting parameters then serve 0, which makes the SDE for Ξ in (1) vanish. As a re- as an approximate sample from the posterior distribu- sult, in the refinement stage, only θ and p need to be updated. The Euler scheme for this is shown in Algo- tion with the inverse temperature of β. Replacing G1 rithm 1, and the symmetric splitting scheme is shown and G2 with the RMSprop preconditioners gives Al- gorithm 2. These updates require approximations to in Algorithm 2. ∇θG1(θ) and ∇θG2(θ), addressed below. Relation to stochastic optimization algorithms 3The sampler samples a uniform distribution over global In the refinement stage Santa is a stochastic optimiza- modes, or a point mass if the mode is unique. We assume tion algorithm. This relation is easier seen with the uniqueness and say point mass for clarity henceforth. 4 Euler scheme in Algorithm 1. Compared with SGD-M Due to numerical issues, it is impossible to set βL to infinity; we thus assign a large enough value for it and (Rumelhart et al., 1986), Santa has both adaptive gra- handle the infinity case in the refinement stage. dient and adaptive momentum updates. Unlike Ada- √ 5As in Ding et al. (2014), we define h = η. grad (Duchi et al., 2011) and RMSprop (Tieleman and Bridging the Gap between SG-MCMC and Stochastic Optimization

Hinton, 2012), refinement Santa is a momentum based Assumption 1. ψt and its up to 3rd-order deriva- k algorithm. tives, D ψt, are bounded by a function V(θ, p, Ξ), i.e., kDkψk ≤ C Vrk for k = (0, 1, 2, 3), C , r > The recently proposed Adam algorithm (Kingma and k k k 0. Furthermore, the expectation of V is bounded: Ba, 2015) incorporates momentum and precondition- r supt EV (θ, p, Ξ) < ∞, and V is smooth such that ing in what is denoted as “adaptive moments.” We r r r sups∈(0,1) V (sx + (1 − s) y) ≤ C (V (x) + V (y)), show in Supplementary Section F that a constant step 3p 3p size combined with a change of variables nearly recov- ∀x ∈ R , y ∈ R , r ≤ max{2rk} for some C > 0. ers the Adam algorithm with element-wise momentum ∗ Let ∆U(θ) , U(θ) − U(θ ). Further define an opera- weights. For these reasons, Santa serves as a more gen-   Bt tor ∆Vt = G1(θ)(∇θU˜t − ∇θU) + · ∇p for each eral stochastic optimization algorithm that extends all βt current algorithms. As well, for a convex problem and t, where Bt is from (3). Theorem 2 depicts the close- ness of Uˆ to the global optima U¯ in term of bias and a few trivial algorithmic changes, the regret bound√ of Adam holds for refinement Santa, which is O( T ), mean square error (MSE) defined below. as detailed in Supplementary Section F. However, our Theorem 2. Let k·k be the operator norm. Under analysis is focused on non-convex problems that do not Assumption 1, the bias and MSE of the exploration stage in Santa with respect to the global optima for L fit in the regret bound formulation. steps with stepsize h is bounded, for some constants C > 0 and D > 0, with: L Z ! 4.3 Convergence properties −U(θ∗) 1 X −β ∆U(θ) Bias: Uˆ − U¯ ≤ Ce e t dθ E L t=1 Our convergence properties are based on the frame-  P  1 k ∆Vtk work of Chen et al. (2015). The proofs for all the- + D + t E + h2 . orems are given in the Supplementary Material. We Lh L L !2 2 Z focus on the exploration stage of the algorithm. Us-   2 −2U(θ∗) 1 X −β ∆U(θ) MSE: Uˆ − U¯ ≤ C e e t dθ ing the Monotone Convergence argument (Schechter, E L 1997), the refinement stage convergence is obtained by t=1 1 P 2 ! 2 E k∆Vtk 1 4 taking the temperature limit from the results of the ex- + D L t + + h . ploration stage. We emphasize that our approach dif- L Lh fers from conventional stochastic optimization or on- Both bounds for the bias and MSE have two parts. line optimization approaches. Our convergence rate The first part contains integration terms, which char- is weaker than many stochastic optimization methods, acterizes the distance between the global optima, ∗ including SGD; however, our analysis applies to non- e−U(θ ), and the unnormalized annealing distribu- convex problems, whereas traditionally convergence tions, e−βtU(θ), decreasing to zero exponentially fast rates only apply to convex problems. with increasing β; the remaining part characterizes the The goal of Santa is to obtain θ∗ such that θ∗ = distance between the sample average and the anneal- argminθ U(θ). Let {θ1, ··· , θL} be a sequence of pa- ing posterior average. This shares a similar form as in ˆ rameters collected from the algorithm. Define U , general SG-MCMC algorithms (Chen et al., 2015), and 1 PL U(θ ) as the sample average, U¯ U(θ∗) the can be controlled to converge. Furthermore, the term L t=1 t , P tkE∆Vtk global optima of U(θ). L in the bias vanishes as long as the sum of the annealing sequence {β } is finite6, indicating that As in Chen et al. (2015), we require certain assump- t the gradient approximation for ∇ G (θ) in Section 4.1 tions on the potential energy U. To show these as- θ 1 does not affect the bias of the algorithm. Similar ar- sumptions, we first define a functional ψ for each t t guments apply for the MSE bound. that solves the following Poisson equation: To get convergence results right before the refinement Ltψt(θt) = U(θt) − U,¯ (4) stage, let a sequence of functions {gm} be defined as 1 PL+m−1 −βlUˆ(θ) Lt is the generator of the SDE system (1) in the t-th gm , − L l=m e ; it is easy to see that {gm} [f(x )]−f(x ) E t+h t satisfies gm < gm for m1 < m2, and limm→∞ gm = iteration, defined Ltf(xt) , limh→0+ h 1 2 3p 0. According to the Monotone Convergence Theorem where xt , (θt, pt, Ξt), f : R → R is a compactly supported twice differentiable function. The solution (Schechter, 1997), the bias and MSE in the limit exists, leading to Corollary 3. functional ψt(θt) characterizes the difference between U(θt) and the global optima U¯ for every θt. As shown Corollary 3. Under Assumptions 1, the bias and in Mattingly et al. (2010), (4) typically possesses a MSE of the refinement stage in Santa with respect unique solution, which is at least as smooth as U under 6In practice we might not need to care about this con- the elliptic or hypoelliptic settings. We assume ψt is straint because a small bias in the exploration stage does bounded and smooth, as described below. not affect convergence of the refinement stage. C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

4.0 to the global optima for L steps with stepsize h are SGD 1.4 3.5 bounded, for some constants D1 > 0, D2 > 0, as SGD-M SGLD 1.2 3.0 RMSprop  P  Adam 1.0 1 t kE∆Vtk 2 2.5 Bias: Uˆ − U¯ ≤ D1 + + h Santa E 0.8 2.0

Lh L Test Error (%) Test Error (%)

! 0.6  2 1 P k∆V k2 1.5 ˆ ¯ L t E t 1 4 MSE: E U − U ≤ D2 + + h 1.0 0.4 L Lh 0 20 40 60 80 100 0 20 40 60 80 100 Epochs Epochs

Corollary 3 implies that in the refinement stage, the Figure 2: Learning curves of different algorithms on MNIST. (Left) FNN with size of 400. (Right) CNN. discrepancy between annealing distributions and the global optima vanishes, leaving only errors from dis- cretized simulations of the SDEs, similar to the result 5.2 Feedforward neural networks of general SG-MCMC (Chen et al., 2015). We note that after exploration, Santa becomes a pure stochas- We first test Santa on the Feedforward Neural Net- tic optimization algorithm, thus convergence results work (FNN) with rectified linear units (ReLU). We in term of regret bounds can also be derived; refer to test two-layer models with network sizes 784-X-X-10, Supplementary Section F for more details. where X is the number of hidden units for each layer; 100 epochs are used. For variants of Santa, we de- 5 Experiments note Santa-E as Santa with a Euler scheme illustrated in Algorithm 1, Santa-r as Santa running only on the 5.1 Illustration refinement stage, but with updates on α as in the ex- ploration stage. We compare Santa with SGD, SGD- In order to demonstrate that Santa is able to achieve M, RMSprop, Adam, SGD with dropout, SGLD and the global mode of an objective function, we consider Bayes by Backprop (Blundell et al., 2015). We use the double-well potential (Ding et al., 2014), a grid search to obtain good learning rates for each algorithm, resulting in 4 × 10−6 for Santa, 5 × 10−4 U(θ) = (θ + 4)(θ + 1)(θ − 1)(θ − 3)/14 + 0.5 . for RMSprop, 10−3 for Adam, and 5 × 10−1 for SGD, SGD-M and SGLD. We choose an annealing schedule γ As shown in Figure 1 (left), the double-well potential of βt = At with A = 1 and γ selected from 0.1 to 1 has two modes, located at θ = −3 and θ = 2, with the with an interval of 0.1. For simplicity, the exploration global optima at θ = −3. We use a decreasing learning is set to take half of total iterations. −0.3 rate ht = t /10, and the annealing sequence is set to 2 We test the algorithms on the standard MNIST βt = t . To make the optimization more challenging, dataset, which contains 28 × 28 handwritten digital we initialize the parameter at θ0 = 4, close to the local mode. The evolution of θ with respect to iterations is images from 10 classes with 60, 000 training samples shown in Figure 1(right). As can be seen, θ first moves and 10, 000 test samples. The network size (X-X) is to the local mode but quickly jumps out and moves set to 400-400 and 800-800, and test classification er- to the global mode in the exploration stage (first half rors are shown in Table 1. Santa show improved state- iterations); in the refinement stage, θ quickly converges of-the-art performance amongst all algorithms. The to the global mode and sticks to it afterwards. In Euler scheme shows a slight decrease in performance, contrast, RMSprop is trapped on the local optima, and due to the integration error when solving the SDE. convergences slower than Santa at the beginning. Santa without exploration (i.e., Santa-r) still performs relatively well. Learning curves are plotted in Fig-

20 4 ure 2, showing that Santa converges as fast as other algorithms but to a better local optima7. 2 15 ) ) θ

( 0

U Santa

10 θ 5.3 Convolution neural networks −

( RMSprop 2 p x e 5 4 We next test Santa on the Convolution Neural Net- work (CNN). Following Jarrett et al. (2009), a stan- 0 6 6 4 2 0 2 4 6 0 1000 2000 3000 4000 5000 dard network configuration with 2 convolutional layers θ iterations followed by 2 fully-connected layers is adopted. Both Figure 1: (Left) Double-well potential. (Right) The convolutional layers use 5 × 5 filter size with 32 and 64 evolution of θ using Santa and RMSprop algorithms. 7Learning curves of FNN with size of 800 are provided in Supplementary Section G. Bridging the Gap between SG-MCMC and Stochastic Optimization

Algorithms FNN-400 FNN-800 CNN 14 14 SGD 13 13 Santa 1.21% 1.16% 0.47% SGD-M Santa-E 1.41% 1.27% 0.58% 12 RMSprop 12 11 Adam 11 Santa Santa-r 1.45% 1.40% 0.49% 10 10 Santa-s Adam 1.53% 1.47% 0.59% 9 9 RMSprop 1.59% 1.43% 0.64% 8 8 SGD-M 1.66% 1.72% 0.77% Negative Log-likelihood 7 Negative Log-likelihood 7 6 6 SGD 1.72% 1.47% 0.81% 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 SGLD 1.64% 1.41% 0.71% Epochs Epochs  BPB 1.32% 1.34% − Figure 3: Learning curves of different algorithms on Piano  SGD, Dropout 1.51% 1.33% − using RNN. (Left) training set. (Right) validation set. Stoc. Pooling. − − 0.47% NIN, Dropout◦ − − 0.47% Algorithms Piano. Nott. Muse. JSB. ? Maxout, Dropout − − 0.45% Santa 7.60 3.39 7.20 8.46 Adam 8.00 3.70 7.56 8.51 Table 1: Test error on MNIST classification using FNN and CNN. () taken from Blundell et al. (2015). (.) taken RMSprop 7.70 3.48 7.22 8.52 from Zeiler and Fergus (2013). (◦) taken from Lin et al. SGD-M 8.32 3.60 7.69 8.59 (2014). (?) taken from Goodfellow et al. (2013). SGD 11.13 5.26 10.08 10.81 HF 7.66 3.89 7.19 8.58 SGD-M 8.37 4.46 8.13 8.71 channels, respectively; 2 × 2 max pooling is used af- ter each convolutional layer. The fully-connected lay- Table 2: Test negative log-likelihood results on poly- ers have 200-200 hidden nodes with ReLU activation. phonic music datasets using RNN. () taken from The same parameter setting and dataset as in the FNN Boulanger-Lewandowski et al. (2012). are used. The test errors are shown in Table 1, and the corresponding learning curves are shown in Fig- ure 2. Similar trends as in FNN are obtained. Santa use a learning rate of 0.001 for all the algorithms. For significantly outperforms other algorithms with an er- Santa, we consider an additional experiment using a ror of 0.45%. This result is comparable or even better learning rate of 0.0002, denoted Santa-s. The anneal- than some recent state-of-the-art CNN-based systems, ing coefficient γ is set to 0.5. Gradients are clipped which have much more complex architectures. if the norm of the parameter vector exceeds 5. We do not perform any dataset-specific tuning other than 5.4 Recurrent neural networks early stopping on validation sets. Each update is done using a minibatch of one sequence. We test Santa on the Recurrent Neural Network The best log-likelihood results on the test set are (RNN) for sequence modeling, where a model is achieved by using Santa, shown in Table 2. Learn- trained to minimize the negative log-likelihood of ing curves on the Piano dataset are plotted in Fig- training sequences: ure 3. We observe that Santa achieves fast conver- N Tn gence, but is overfitting. This is straightforwardly ad- 1 X X n n n min − log p(xt | x1 ,..., xt−1; θ) (5) dressed through early stopping. The learning curves θ N n=1 t=1 for all the other datasets are provided in Supplemen- n tary Section G. where θ is a set of model parameters, {xt } is the observed data. The conditional distributions in (5) are modeled by the RNN. The hidden units are set to 6 Conclusions gated recurrent units (Cho et al., 2014). We propose Santa, an annealed SG-MCMC method We consider the task of sequence modeling on four for stochastic optimization. Santa is able to explore different polyphonic music sequences of piano, i.e., the parameter space efficiently and locate close to the Piano-midi.de (Piano), Nottingham (Nott), MuseData global optima by annealing. At the zero-temperature (Muse) and JSB chorales (JSB). Each of these datasets limit, Santa gives a novel stochastic optimization algo- are represented as a collection of 88-dimensional bi- rithm where both model parameters and momentum nary sequences, that span the whole range of piano are updated element-wise and adaptively. We provide from A0 to C8. theory on the convergence of Santa to the global op- The number of hidden units is set to 200. Each model tima for an (non-convex) objective function. Experi- is trained for at most 100 epochs. According to the ments show best results on several deep models com- experiments and their results on the validation set, we pared to related stochastic optimization algorithms. C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

Acknowledgements I. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In This research was supported in part by ARO, DARPA, ICML, 2013. DOE, NGA, ONR and NSF. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Le- Cun. What is the best multi-stage architecture for References object recognition? In ICCV, 2009. C. Andrieu, N. de Freitas, and A. Doucet. Reversible D. Kingma and J. Ba. Adam: A method for stochastic jump mcmc simulated annealing for neural net- optimization. In ICLR, 2015. works. In UAI, 2000. S. Kirkpatrick, C. D. G. Jr, and M. P. Vecchi. Opti- C. Blundell, J. Cornebise, K. Kavukcuoglu, and mization by simulated annealing. In Science, 1983. D. Wierstra. Weight in neural networks. C. Li, C. Chen, D. Carlson, and L. Carin. Precon- In ICML, 2015. ditioned stochastic gradient Langevin dynamics for L. Bottou. Large-scale machine learning with stochas- deep neural networks. In AAAI, 2016a. tic gradient descent. In Proc. COMPSTAT, 2010. C. Li, C. Chen, K. Fan, and L. Carin. High-order N. Boulanger-Lewandowski, Y. Bengio, and P. Vin- stochastic gradient thermostats for Bayesian learn- cent. Modeling temporal dependencies in high- ing of deep models. In AAAI, 2016b. dimensional sequences: Application to polyphonic Y. Li, V. A. Protopopescu, N. Arnold, X. Zhang, music generation and transcription. In ICML, 2012. and A. Gorin. Hybrid parallel tempering and sim- D. E. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and ulated annealing method. In Applied Mathematics V. Cevher. Preconditioned spectral descent for deep and Computation, 2009. learning. In Advances in Neural Information Pro- M. Lin, Q. Chen, and S. Yan. Network in network. In cessing Systems, pages 2953–2961, 2015. ICLR, 2014. C. Chen, N. Ding, and L. Carin. On the convergence J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. of stochastic gradient mcmc algorithms with high- Construction of numerical time-average and station- order integrators. In NIPS, 2015. ary measures via Poisson equations. In SIAM J. T. Chen, E. B. Fox, and C. Guestrin. Stochastic gra- NUMER. ANAL., 2010. dient Hamiltonian Monte Carlo. In ICML, 2014. R. M. Neal. Annealed importance sampling. In Statis- K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bah- tics and Computing, 2001. danau, F. Bougares, H. Schwenk, and Y. Ben- R. M. Neal. Mcmc using hamiltonian dynamics. In gio. Learning phrase representations using rnn Handbook of Markov Chain Monte Carlo, 2011. encoder-decoder for statistical machine translation. In arXiv:1406.1078, 2014. F. Obermeyer, J. Glidden, and E. Jonas. Scaling nonparametric bayesian inference via subsample- Y. N. Dauphin, H. de Vries, and Y. Bengio. Equili- annealing. In AISTATS, 2014. brated adaptive learning rates for non-convex opti- mization. In NIPS, 2015. S. Patterson and Y. W. Teh. Stochastic gradient Rie- mannian Langevin dynamics on the probability sim- N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, plex. In NIPS, 2013. and H. Neven. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014. H. Risken. The Fokker-Planck equation. Springer- Verlag, New York, 1989. J. Duchi, E. Hazan, and Y. Singer. Adaptive sub- gradient methods for online learning and stochastic D. E. Rumelhart, G. E. Hinton, and R. J. Williams. optimization. In JMLR, 2011. Learning representations by back-propagating er- rors. In Nature, 1986. Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep Poisson factor analysis for topic mod- E. Schechter. Handbook of Analysis and Its Founda- eling. In ICML, 2015. tions. Elsevier, 1997. S. Geman and D. Geman. Stochastic relaxation, gibbs I. Sutskever, J. Martens, G. Dahl, and G. E. Hinton. distributions, and the bayesian restoration of im- On the importance of initialization and momentum ages. In PAMI, 1984. in deep learning. In ICML, 2013. M. Girolami and B. Calderhead. Riemann manifold Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Con- Langevin and Hamiltonian Monte Carlo methods. sistency and fluctuations for stochastic gradient In JRSS, 2011. Langevin dynamics. In arXiv:1409.0578, 2014. Bridging the Gap between SG-MCMC and Stochastic Optimization

T. Tieleman and G. E. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its re- cent magnitude. In Coursera: Neural Networks for Machine Learning, 2012. J. W. van de Meent, B. Paige, and F. Wood. Temper- ing by subsampling. In arXiv:1401.7145, 2014. V. Cern´y.ˇ Thermodynamical approach to the travel- ing salesman problem: An efficient simulation algo- rithm. In J. Optimization Theory and Applications, 1985. S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. (Non-)asymptotic properties of stochastic gradient Langevin dynamics. In arXiv:1501.00438, 2015. M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011. M. Zeiler and R. Fergus. Stochastic pooling for regu- larization of deep convolutional neural networks. In ICLR, 2013. M. D. Zeiler. Adadelta: An adaptive learning rate method. In arXiv:1212.5701, 2012. C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization: Supplementary Material

Changyou Chen† David Carlson‡ Zhe Gan† Chunyuan Li† Lawrence Carin† †Department of Electrical and Computer Engineering, Duke University ‡Department of Statistics and Grossman Center for Statistics of Mind, Columbia University

A Solutions for the sub-SDEs A reformulation of the main theorem in Ding et al. (2014) gives the following lemma, which is used to We provide analytic solutions for the split sub-SDEs prove Lemma 1 in the main text. in Section 4.1. For stepsize h, the solutions are given ~ in (6). Lemma 4. The stochastic process of θ generated by  the stochastic differential equation (7) has the target θ = θ + G (θ) p h 1  t t−1 1 distribution pθ(θ) = exp{−U(θ)} as its stationary  p = p Z A : t t−1   , (6) distribution, if ρ(x) satisfies the following marginaliza-  1  Ξt = Ξt−1 + Q − β I h tion condition:   θt = θt−1 Z B : pt = exp (−Ξh) pt−1 , exp{−U(θ)} ∝ exp{−U(θ) − E(θ, p, Ξ)}d p dΞ ,  Ξt = Ξt−1 (9)  θ = θ  t t−1   1  p = p + −G1(θ)∇θU(θ) + ∇θG1(θ)  t t−1 β and if the following condition is also satisfied: O : +G1(θ)(Ξ − G2(θ))∇θG2(θ)) h  2 1  + ( G (θ)) 2 ζ  β 2 t  ∇ · (ρF ) = ∇∇> :(ρD) , (10)  Ξt = Ξt−1 B Proof of Lemma 1 where ∇ , (∂/∂θ, ∂/∂ p, ∂/Ξ),“·” represents the vec- tor inner product operator, “:” represents a matrix For a general stochastic differential equation of the double dot product, i.e., X : Y tr(X> Y). form , √ d x = F (x)dt + 2D1/2(x)d w , (7) where x ∈ RN , F : RN → RN , D : RM → RN×P are measurable functions with P , and w is standard Proof of Lemma 1. We first have reformulated (1) us- P -dimensional Brownian motion. (1) is a special case ing the general SDE form of (7), resulting in (8). of the general form (7) with Lemma 1 states the joint distribution of (θ, p, Ξ) is x = (θ, p, Ξ) (8)    1 1 > G1(θ) p ρ(x) = exp − p p −U(θ) 1 Z 2  − G1(θ)∇θU(θ) − Ξ p + β ∇θ G1(θ)  F (x) =   1 n o  + G1(θ)(Ξ − G2(θ))∇θ G2(θ)  − tr (Ξ − G (θ))> (Ξ − G (θ)) , (11) 1 2 2 2 Q − β I  0 0 0  1 1 > D(x) =  0 β G2(θ) 0  with H(x) = 2 p p +U(θ) + 0 0 0 1 n > o 2 tr (Ξ − G2(θ)) (Ξ − G2(θ)) . The marginaliza- We write the joint distribution of x as tion condition (9) is trivially satisfied, we are left to 1 1 verify condition (10). Substituting ρ(x) and F into ρ(x) = exp {−H(x)} exp {−U(θ) − E(θ, p, Ξ)} . Z , Z (10), we have the left-hand side Bridging the Gap between SG-MCMC and Stochastic Optimization

where Bt is from (3). Sum over t = 1, ··· ,L in (13), X ∂ LHS = (ρF ) take expectation on both sides, and use the relation ∂ x i i i ˜ Lt + Bt = Lβt +∆Vt to expand the first order term. X ∂ρ ∂Fi = F + ρ We obtain ∂ x i ∂ x i i i   X ∂Fi ∂H = − Fi ρ L L−1 ∂ xi ∂ xi X X i [ψ(Xt)] = ψ(X0) + [ψ(Xt)] E E X X t=1 t=1 = ∇ (G ) p − diag(Ξ) θi 1 i: L L i i X X   + h E[Lβt ψ(Xt−1)] + h E[∆Vtψ(Xt−1)] X X t=1 t=1 − β ∇θi U − (Ξij − (G2)ij)∇θi (G2)ij (G1 p)i L i j 2 h X ˜ 2 3   + E[Lt ψ(Xt−1)] + O(Lh ). T 1 2 − βp − G ∇ U − Ξ p + ∇ G + G (Ξ − G )∇ G t=1 1 θ β θ 1 1 2 θ 2 ! X  1  −β (Ξ − (G ) ) Q − ρ ii 2 ii ii β We divide both sides by Lh, use the Poisson equation i 1 (4), and reorganize terms. We have: = tr G (p pT −I) ρ . β 2

It is easy to see for the right-hand side L 1 X 1 X [ φ(X ) − φ¯ ] = [L ψ(X )] X X 1 ∂2 E L t βt L E βt t−1 RHS = (G ) ρ t t=1 β 2 ij ∂ x ∂ x i j i j 1 1 X = ( [ψ(X )] − ψ(X )) − [∆V ψ(X )]   Lh E t 0 L E t t−1 1 X X ∂ ∂H t = (G2)ij − ρ β ∂ p ∂ p L i j j i h X ˜ 2 2 − E[Lt ψ(Xt−1)] + O(h ) (14) 1 X 2  2L = (G ) p −1 ρ t=1 β 2 ii i i ≡ LHS . ˜ 2 Now we try to bound Lt . Based on ideas from Mat- According to Lemma 4, the joint distribution (11) is tingly et al. (2010), we apply the following procedure. the equilibrium distribution of (1). First replace ψ with L˜ tψ from (13) to (14), and apply the same logic for L˜ tψ as for ψ in the above deriva- C Proof of Theorem 2 tions, but this time expand in (13) up to the order of O(h2), instead of the previous order O(h3). After We start by proving the bias result of Theorem 2. simplification, we obtain:

Proof of the bias. For our 2nd-order integrator, ac-   X 2 1 cording to the definition, we have: [L˜ ψ(X )] = O + Lh (15) E t t−1 h ˜ t ˜l hLt 3 E[ψ(Xt)] = Phψ(Xt−1) = e ψ(Xt−1) + O(h ) 2   h 2 = + hL˜ ψ(X ) + L˜ ψ(X ) + O(h3) , I t t−1 2 t t−1 Substituting (15) into (14), after simplification, we 1 P ¯  (12) have: E L t φ(Xt) − φβt

where Lt is the generator of the SDE for the t-th it- eration, i.e., using stochastic gradient instead of the 1 1 X = ( [ψ(X )] − ψ(X )) − [∆V ψ(X )] full gradient, I is the identity map. Compared to the Lh E t 0 L E t t−1 | {z } t prove of Chen et al. (2015), we need to consider the C1 approximation error for ∇θG1(θ). As a result, (12)  h  − O + h2 + C h2 , needs to be rewritten as: Lh 3

E[ψ(Xt)] (13) 2   2 ˜ h ˜ 3 for some C3 ≥ 0. According to the assumption, the = I + h(Lt + Bt) ψ(Xt−1) + Lt ψ(Xt−1) + O(h ) , 2 term C1 is bounded. As a result, collecting low order C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin terms, the bias can be expressed as: lated terms, we have

ˆ ¯ Eφ − φ ! 1 X 1 X = φ(X ) − φ¯  + φ¯ − φ¯ E L t βt L βt L t t 1 X 1 φ(X ) − φ¯  = ( ψ (X ) − ψ (X )) ! ! L t βt Lh E βL Lh β0 0 1 X ¯ ¯ 1 X ¯  t=1 ≤ φβ − φ + φ(Xt) − φβ E L t E L t L t t 1 X  − Eψβt−1 (Xt−1) − ψβt−1 (Xt−1) L Z ! Lh 1 X ˆ t=1 ≤Cφ(θ∗) e−βtU(θ)dθ L L L θ6=θ∗ t=1 1 X h X ˜2 2 P − ∆Vtψβt−1 (Xt−1) − Lt ψβt−1 (Xt−1) + O(h ) C1 t E∆Vtψ(Xt−1) 2 L 2L + − + C3h t=1 t=1 Lh L L Z ! ∗ 1 X −β Uˆ(θ) C1 ≤Cφ(θ ) e t dθ + L ∗ Lh t=1 θ6=θ P t E∆Vtψ(Xt−1) 2 + + C3h Taking the square of both sides, it is then easy to see L there exists some positive constant C, such that L Z ! 1 X ˆ ≤Cφ(θ∗) e−βtU(θ)dθ L ∗ t=1 θ6=θ  1 P k ∆V k  + D + t E t + h2 , Lh L L !2 1 X φ(X ) − φ¯  (16) where the last equation follows from the finiteness as- L t βt t=1 sumption of ψ, k · k denotes the operator norm and  is bounded in the space of ψ due to the assumptions. 2 (EψβL (XLh) − ψβ0 (X0)) This completes the proof. ≤C   L2h2 | {z } A1 We will now prove the MSE result . L 1 X 2 + ψ (X ) − ψ (X ) L2h2 E βt−1 t−1 βt−1 t−1 t=1 Proof of the MSE bound. Similar to the proof of The- | {z } A2 orem 2, for our 2nd–order integrator we have: L 1 X + ∆V 2ψ (X ) L2 t βt−1 t−1 E (ψβt (Xt)) = (I + h(Lβt + ∆Vt)) ψβt−1 (Xt−1) t=1 2  h ˜2 3 + Lt ψβt−1 (Xt−1) + O(h ) . 2 2 2 L !  h X ˜2 4 + L ψβt−1 (Xt−1) +h  2L2 t  Sum over t from 1 to L + 1 and simplify, we have: t=1  | {z } A L L 3 X X E (ψβt (Xt)) = ψβt−1 (Xt−1) t=1 t=1 L L X X + h Lβt ψβt−1 (Xt−1) + h ∆Vtψβt−1 (Xt−1) t=1 t=1 A1 is easily bounded by the assumption that kψk ≤ 2 L V r0 < ∞. A is bounded because it can be shown that h X ˜2 3 2 √ + Lt ψβt−1 (Xt−1) + O(Lh ) . 2 E (ψβt (Xt)) − ψβt (Xt) ≤ C1 h + O(h) for C1 ≥ 0. t=1 Intuitively this is true because the only difference be-

tween E (ψβt (Xt)) and ψβt (Xt) lies in the additional Substitute the Poisson equation (4) into the above Gaussian noise with variance h. A formal proof is equation, divide both sides by Lh and rearrange re- given in Chen et al. (2015). Furthermore, A3 is Bridging the Gap between SG-MCMC and Stochastic Optimization

bounded by the following arguments: D Proof of Corollary 3

L !2 h2 X h i Proof. The refinement stage corresponds to β → ∞. A = L˜2ψ (X ) 3 2L2 E t βt−1 t−1 We can prove that in this case, the integration terms t=1 | {z } in the bias and MSE in Theorem 2 converge to 0. B1 To show this, define a sequence of functions {gm} as: L !2 h2 X   + L˜2ψ (X ) − L˜2ψ (X ) 2L2 E t βt−1 t−1 E t βt−1 t−1 L+m−1 t=1 1 X ˆ g − e−βlU(θ) . (18) | {z } m , L B2 l=m 2 2 L ! h X 2 B + L˜ ψ (X ) it is easy to see the sequence {gm} satisfies gm < gm . 1 Lh t βt−1 t−1 1 2 t=1 for m1 < m2, and limm→∞ gm = 0. According to the 2 2 L ! monotone convergence theorem, we have h X  ˜2  + ELt ψβt−1 (Xt−1) Lh L+m−1 t=1 Z Z 1 X −βlUˆ(θ) L ! lim gm , lim − e dθ  1  1 h2 X m→∞ m→∞ L ≤ O + L2h2 + (L˜2ψ(X ))2 l=m 2L2 Lh L t t−1 Z t=1 = lim gm = 0 .  1  m→∞ + O + h4 L2h2  1  As a result, the integration terms in the bounds for the = O + L4 Lh bias and MSE vanish, leaving only the terms stated in Corollary 3. This completes the proof. Collecting low order terms we have: E Reformulation of the Santa L !2 1 X Algorithm φ(X ) − φ¯  E L t βt t=1 1 P 2 ! In this section we give a version of the Santa algorithm E k∆Vtk 1 =O L t + + h4 . (17) that matches better than our actual implementation, L Lh shown in Algorithm 3–7.

Finally, we have: Algorithm 3: Santa !2 Input: ηt (learning rate), σ, λ, burnin,  2 1 X p φˆ − φ¯ < φ(X ) − φ¯  β = {β1, β2, · · · } → ∞, {ζt ∈ R } ∼ N (0, Ip). E E L t βt √ √ t Initialize θ0, u0 = η × N (0,I), α0 = ηC, v0 = 0 ; for t = 1, 2,... do L !2 1 ˜ ˜ X ¯  Evaluate f t = ∇θUt(θt−1) on the t-th minibatch ; + E φ(Xt) − φβt L 1−σ ˜ ˜ t=1 vt = σ vt−1 + N 2 f t f t ; 2 p √ L Z ! gt = 1 λ + vt ; 1 X ˆ ≤Cφ(θ∗)2 e−βtU(θ)dθ if t < burnin then L ∗ t=1 θ6=θ /* exploration */ 1 P 2 ! (θt, ut, αt) = Exploration S(θt−1, ut−1, αt−1) E k∆Vtk 1 + O L t + + h4 or L Lh (θt, ut, αt) = Exploration E(θt−1, ut−1, αt−1) 2 else L Z ! ∗ 2 1 X −β Uˆ(θ) /* refinement */ ≤Cφ(θ ) e t dθ L θ6=θ∗ (θt, ut, αt) = Refinement S(θt−1, ut−1, αt−1) t=1 or 1 P 2 ! E k∆Vtk 1 (θt, ut, αt) = Refinement E(θt−1, ut−1, αt−1) + D L t + + h4 . L Lh end end C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

Algorithm 4: Exploration S(θt−1, ut−1, αt−1) Algorithm 6: Exploration E(θt−1, ut−1, αt−1)

θt = θt−1 + gt ut−1 /2; αt = αt−1 + (ut−1 ut−1 −η/βt); q αt = αt−1 + (ut−1 ut−1 −η/βt) /2; ˜ 3/2 ut = (1 − αt) ut−1 −η gt f t + 2 gt−1 η /βt ζt; ut = exp (−αt/2) ut−1; q θt = θt + g ut; ˜ 3/2 t ut = ut − g f tη + 2 g η /βt ζt; t t−1 Return (θt, ut, αt) ut = exp (−αt/2) ut; αt = αt + (ut ut −η/βt) /2; Algorithm 7: Refinement E(θ , u , α ) θt = θt + gt ut /2; t−1 t−1 t−1 Return (θt, ut, αt) αt = αt−1; ˜ ut = (1 − αt) ut−1 −η gt f t; θt = θt + gt ut; Algorithm 5: Refinement S(θt−1, ut−1, αt−1) Return (θt, ut, αt) αt = αt−1; θt = θt−1 + gt ut−1 /2; u = exp (−α /2) u ; t t t−1 Because the focus of this paper is not on the regret ˜ ut = ut − gt f tη; bound, we only briefly discuss the changes in the the- ut = exp (−αt/2) ut; ory. We note that Lemma 10.4 from Kingma and Ba θt = θt + g ut /2; t (2015) will hold with element-wise b1. Return (θt, ut, αt) b2 √1,i Lemma 5. Let γi , σ . For b1,i, σ ∈ [0, 1) that 2 β1 ˜ ˜ ˜ satisfy √ < 1 and bounded ft, ||ft||2 ≤ G, ||ft||∞ ≤ F Relationship of refinement Santa to β2 Adam G∞, the following inequality holds In the Adam algorithm (see Algorithm 1 of Kingma T 2 X ui 2 ˜ and Ba (2015)), the key steps are: ≤ ||f1:T,i||2 p 2 1 − γ t=1 tgi i ˜ ˜ f t , ∇θU(θt−1) which contains an element-dependent γ compared to v = σ v +(1 − σ)˜f ˜f i t t−1 t t Adam. q √ g = 1 λ + v t t Theorem 10.5 of Kingma and Ba (2015) will hold with ˜ u˜t = (1 − b1) u˜t−1 + b1 f t the same modifications and assumptions for a b with distinct entries; the proof in Kingma and Ba (2015) θt = θt + η(gt gt) u˜t is already element-wise, so it suffices to replace their Here, we maintain the square root form of g , so b2 t global parameter γ with distinct γ √1,i . This will the square is equivalent to the preconditioner used in √ i , σ Adam. As well, in Adam, the vector b1 is set to the give a regret of O( T ), the same as Adam. same constant between 0 and 1 for all entries. An equivalent formulation of this is: G Additional Results ˜ ˜ f t , ∇θU(θt−1) ˜ ˜ 4.0 vt = σ vt−1 +(1 − σ)f t f t SGD q √ 3.5 SGD-M gt = 1 λ + vt SGLD ˜ 3.0 ut = (1 − b1) ut−1 − η(gt b1 f t) RMSprop Adam θt = θt − gt ut 2.5 Santa The only differences between these steps and the Eu- 2.0 ler integrator we present in our Algorithm 1 are that Test Error (%) our b1 has a separate constant for each entry, and the 1.5 second term in u does not include the b1 in our for- mulation. If we modify our algorithm to multiply the 1.0 gradient by b , then our algorithm, under the same as- 0 20 40 60 80 100 1 Epochs sumptions√ as Adam, will have a similar regret bound of O( T ) for a convex problem. Figure 4: MNIST using FNN with size of 800. Bridging the Gap between SG-MCMC and Stochastic Optimization

Learning curves of different algorithms on MNIST us- ing FNN with size of 800 are plotted in Figure 4. Learning curves of different algorithms on four poly- phonic music datasets using RNN are shown in Fig- ure 6. We additionally test Santa on the ImageNet dataset. We use the GoogleNet architecture, which is a 22 layer deep model. We use the default setting de- fined in the Caffe package8. We were not able to make other stochastic optimization algorithms except SGD with momentum and the proposed Santa work on this dataset. Figure 5 shows the comparison on this dataset. We did not tune the parameter setting, note the default setting is favourable by SGD with momentum. Nevertheless, Santa still significantly out- performs SGD with momentum in term of convergence speed.

Top-1 Accuracy on ImageNet 0.7

0.6

0.5

0.4

0.3

0.2 Top-1 Accuracy Top-1

0.1 Santa SGD-M 0 0.5 1 1.5 2 #iterations #10 6 Top-5 Accuracy on ImageNet 1

0.8

0.6

0.4 Top-5 Accuracy Top-5 0.2 Santa SGD-M 0 0.5 1 1.5 2 #iterations #10 6 Figure 5: Santa vs. SGD with momentum on Ima- geNet. We used ImageNet11 for training.

8https : //github.com/cchangyou/Santa/tree/master/caffe/models/bvlc googlenet C. Chen, D. Carlson, Z. Gan, C. Li, and L. Carin

Piano train Piano valid Piano test 14 14 14 SGD 13 13 13 SGD-M 12 RMSprop 12 12 11 Adam 11 11 Santa 10 10 10 Santa-s 9 9 9

8 8 8

Negative Log-likelihood 7 Negative Log-likelihood 7 Negative Log-likelihood 7

6 6 6 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Epochs Epochs Epochs

Nott train Nott valid Nott test 11 11 11

10 SGD 10 10 SGD-M 9 9 9 RMSprop 8 Adam 8 8 7 Santa 7 7 6 Santa-s 6 6

5 5 5

4 4 4 Negative Log-likelihood Negative Log-likelihood Negative Log-likelihood 3 3 3

2 2 2 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Epochs Epochs Epochs

Muse train Muse valid Muse test

14 SGD 14 14 SGD-M

12 RMSprop 12 12 Adam Santa 10 10 10 Santa-s

8 8 8 Negative Log-likelihood Negative Log-likelihood Negative Log-likelihood 6 6 6

0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Epochs Epochs Epochs

JSB train JSB valid JSB test

14 SGD 14 14 SGD-M

12 RMSprop 12 12 Adam Santa 10 10 10 Santa-s

8 8 8 Negative Log-likelihood Negative Log-likelihood Negative Log-likelihood 6 6 6

0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Epochs Epochs Epochs

Figure 6: Learning curves of different algorithms on four polyphonic music datasets using RNN.