<<

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and

Vien V. Mai 1 Mikael Johansson 1

Abstract problems are at the core of many machine-learning appli- cations, and are often solved using stochastic (sub)gradient Stochastic gradient algorithms are often unsta- methods. In spite of their successes, stochastic gradient ble when applied to functions that do not have methods can be sensitive to their parameters (Nemirovski Lipschitz-continuous and/or bounded gradients. et al., 2009; Asi & Duchi, 2019a) and have severe instabil- Gradient clipping is a simple and effective tech- ity (unboundedness) problems when applied to functions nique to stabilize the training process for prob- that grow faster than quadratically in the decision vector x lems that are prone to the exploding gradient (Andradottir¨ , 1996; Asi & Duchi, 2019a). Consequently, a problem. Despite its widespread popularity, the careful (and sometimes time-consuming) parameter tuning convergence properties of the gradient clipping is often required for these methods to perform well in prac- heuristic are poorly understood, especially for tice. Even so, a good parameter selection is not sufficient to stochastic problems. This paper establishes both circumvent the instability issue on steep functions. qualitative and quantitative convergence results of the clipped stochastic (sub)gradient method Gradient clipping and the closely related gradient normal- (SGD) for non-smooth convex functions with ization technique are simple modifications to the underlying rapidly growing subgradients. Our analyses show algorithm to control the step length that an update can make that clipping enhances the stability of SGD and relative to the current iterate. These techniques enhance that the clipped SGD algorithm enjoys finite con- the stability of the optimization process, while adding es- vergence rates in many cases. We also study the sentially no extra cost to the original update. As a result, convergence of a clipped method with momen- gradient clipping has been a common choice in many ap- tum, which includes clipped SGD as a special plied domains of machine learning (Pascanu et al., 2013). case, for weakly convex problems under standard In this work, we consider gradient clipping applied to the assumptions. With a novel Lyapunov analysis, classical SGD method. Throughout the paper, we frequently we show that the proposed method achieves the use the following clipping operator best-known rate for the considered class of prob-   lems, demonstrating the effectiveness of clipped n n γ clipγ : R → R : x 7→ min 1, x, methods also in this regime. Numerical results kxk2 confirm our theoretical developments. which is nothing else but the orthogonal projection onto the γ-ball. It is important to note that for noiseless gradients, clipping just changes the magnitude and does not effect the 1. Introduction search direction. However, in the stochastic setting, the We study stochastic optimization problems on the form expected value of the clipped stochastic gradient may point in a completely different direction than the true gradient. Z minimize f(x) := P [f(x; S)] = f(x; s)dP (s), (1) Clipped SGD To solve problem (1), we use an iterative n E x R n ∈ S procedure that starts from x0 ∈ R and g0 ∈ ∂f(x0,S0) and generates a sequence of points x ∈ n by repeating where S ∼ P is a random variable; f(x; s) is the instanta- k R the following steps for k = 0, 1, 2,...: neous loss parameterized by x on a sample s ∈ S. Such

xk = xk − αkdk, dk = clip (gk) . (2) 1Division of Decision and Control Systems, EECS, KTH Royal +1 γk Institute of Technology, Stockholm, Sweden. Correspondence to: We refer to α as the kth stepsize and γ as the kth clipping V. V. Mai , M. Johansson . k k threshold, while gk = f 0(xk,Sk) is the kth stochastic sub- 1 Pmk i th gradient or its mini-batch version, gk = f (xk,S ) Proceedings of the 38 International Conference on Machine mk i=1 0 k Learning, PMLR 139, 2021. Copyright 2021 by the author(s). if multiple samples are used in each iteration. Stability and Convergence of Stochastic Gradient Clipping

1.1. Related work et al., 1998; Andradottir¨ , 1996) and discussed above. In (Asi & Duchi, 2019b), almost sure convergence of the so-called Our work is closely related to a number of topics which we truncated method is proven for convex functions that can briefly review below. grow polynomially from the the set of solutions.

Gradient clipping Gradient clipping and normalization Weakly convex minimization The class of weakly con- were recognized early in the development of subgradient vex functions is broad, allowing for both non-smooth and methods as a useful tool to obtain convergence for rapidly non-convex objectives, and has favorable structures for growing convex functions (Shor, 1985; Ermoliev, 1988; algorithmic foundations and complexity theory. Earlier Alber et al., 1998). For the normalized method, the sem- works on weakly convex minimization (Nurminskii, 1973; inal work (Shor, 1985) establishes a convergence rate for Ruszczynski´ , 1987; Ermol’ev & Norkin, 1998) establish ? the quantity hgk/ kgkk2 , xk − x i without any assumptions qualitative convergence results for subgradient-based meth- on gk. By only requiring that subgradients are bounded ods. With the recent advances in statistical learning and on bounded sets, which always holds for continuous func- signal processing, there has been an emerging line of work tions, Alber et al.(1998) prove convergence in the objec- on this topic (see, e.g., (Duchi & Ruan, 2018; Davis & Grim- tive value for the clipped subgradient method. The work mer, 2019; Davis & Drusvyatskiy, 2019)). Convergence (Hazan et al., 2015) considers normalized gradient methods properties have been analyzed for many popular stochastic for quasi-convex and locally smooth functions. Recently, algorithms such as: model-based methods (Davis & Drusvy- the authors in (Zhang et al., 2019; 2020) analyze clipped atskiy, 2019; Duchi & Ruan, 2018; Asi & Duchi, 2019b); methods for twice-differentiable functions satisfying a more momentum extensions (Mai & Johansson, 2020); adaptive relaxed condition than the traditional L-smoothness. How- methods (Alacaoglu et al., 2020); and more. ever, much less is known in the stochastic setting. The work (Andradottir¨ , 1996) proposes a method that uses two inde- 1.2. Contributions pendent samples in each iteration and proves its almost sure convergence under the same growth condition used in (Alber The performance of stochastic (sub)gradient methods de- et al., 1998). Hazan et al.(2015) establish certain complex- pends heavily on how rapidly the underlying is ity results for a mini-batch stochastic normalized gradient allowed to grow. Much convergence theory for these meth- method under some strong assumptions on the closeness ods hinges on the L-smoothness assumption for differen- of all the generated iterates to the optimal solution and the tiable functions or uniformly bounded subgradients for non- boundedness of all the individual mini-batch functions. In smooth ones. These conditions restrict the corresponding (Zhang et al., 2019; 2020), stochastic clipped SGD methods convergence rates to functions with at most quadratic and are analyzed under the assumption that the noise in the gra- linear growth, respectively. Beyond these well-behaved dient estimate is bounded for every x almost surely. Such classes, there is abundant evidence that SGD and its rela- assumption does not hold even for Gaussian noise in the tives may fail to converge. It is our goal in this work to show, gradients. Finally, we refer to (Cutkosky & Mehta, 2020; both theoretically and empirically, that the addition of the Curtis et al., 2019; Gorbunov et al., 2020) for recent theoret- clipping step greatly improves the convergence properties ical developments for clipped and normalized methods on of SGD. To that end, we make the following contributions: standard L-smooth problems. • We establish stability and convergence guarantees for clipped SGD on convex problems with arbitrary growth Robustness and stability The problems of robustness (exponential, super-exponential, etc.) of the subgradi- and stability in stochastic optimization have been empha- ents. We show that clipping coupled with standard mini- sized in many studies (see, e.g., (Nemirovski et al., 2009; batching suffices to guarantee almost sure convergence. Andradottir¨ , 1996; Asi & Duchi, 2019a;b) and references Even more, a finite convergence rate can also be obtained therein). Much recent work on this topic concentrates in this setting. around model-based algorithms that attempt to construct more accurate models of the objective than the linear one • We then turn to convex functions with polynomial growth provided by the stochastic subgradient. When such mod- and show that without the need for mini-batching, clipped els can be obtained and the resulting update steps can be SGD can essentially achieve the same optimal conver- performed efficiently, these methods often possess good gence rate as for stochastic strongly convex and Lipschitz stability properties and can be less sensitive to parameter se- continuous functions. lection than traditional stochastic subgradient methods. For example, the work (Asi & Duchi, 2019a) establishes almost • We consider a momentum extension of clipped SGD for sure convergence of stochastic (approximate) proximal point weakly convex minimization under standard growth con- methods under the arbitrary growth condition used in (Alber ditions. With a carefully constructed Lyapunov function, Stability and Convergence of Stochastic Gradient Clipping

we are able to overcome the bias introduced by the clip- Assumption A3 (Finite variance). There exists a scalar ping step and preserve the best-known sample complexity σ > 0 such that: for this function class. h 2i 2 E kf 0(x, S) − f 0(x)k2 ≤ σ , ∀x ∈ dom(f), Our experiments on phase retrieval, absolute linear regres- where f (x) = [f (x, S)] ∈ ∂f(x). sion, and classification with neural networks reaffirm our 0 E 0 theoretical findings that gradient clipping can: i) stabilize Finally, unless otherwise stated, we assume that the stepsizes and guarantee convergence for problems with rapidly grow- αk are square summable but not summable: ing gradients; ii) retain and sometimes improve the best performance of their unclipped counterparts even on stan- X∞ X∞ 2 dard problems. We note also that none of the convergence αk ≥ 0, αk = ∞, and αk < ∞. results in this work require hard-to-estimate parameters to i=0 i=0 set the clipping threshold. Before detailing the stability and convergence analyses of clipped SGD, Example 1 shows that even with stepsizes that Notation For any x, y ∈ Rn, we denote by hx, yi the are as small as O(1/k), the vanilla SGD method may fail Euclidean inner product of x and y. We denote by ∂f(x) miserably when applied to a function satisfying Assump- the Frechet´ subdifferential of f at x; f 0(x) denotes any tions A2–A3. We refer to (Asi & Duchi, 2019a) for more element of ∂f(x). The `2- is denoted by k·k2. For a examples of the potential instability of SGD. closed and convex set X , the distance and the projection map Example 1 (Super-Exponential Divergence of SGD): Let are given respectively by: dist(x, X ) = minz kz − xk 4 2 ∈X 2 f(x) = x /4 + x /2 with  > 0 and consider the SGD and Π (x) = argminz kz − xk . 1 {E} denotes the X 2 algorithm applied to f with the stepsizes α = α /k: indicator function of an∈X event E; i.e., 1 {E} = 1 if E k 1 is true and 0 otherwise. The closed `2-ball centered at α1 3  xk = xk − x + xk . x with radius r > 0 is denoted B(x, r). We denote by +1 k k Fk := σ(S0,...,Sk 1) the σ-field formed by the first k − Then, if we let ≥ p , it holds for any ≥ that random variables S0,...,Sk 1, so that xk ∈ Fk. Finally, x1 3/α1 k 1 − we will impose the following basic assumption throughout |xk| ≥ |x1| k!. 3 the paper. Despite its simplicity, the example highlights that moving Assumption A1. Let S be a sample drawn from P and beyond standard (upper) quadratic models, SGD may fail f 0(x, S) ∈ ∂f(x, S), we have: E [f 0(x, S)] ∈ ∂f(x). to guarantee any convergence. Our goal in this section is to: (i) show that with a simple clipping step added to SGD, 2. Stability and its consequences for convex the resulting algorithm becomes much more stable; and (ii) minimization to prove strong convergence guarantees for clipped SGD in new settings. Next, we state the first of these results: In this section, we study the stability of the clipped SGD Proposition 1 (Stability). Let Assumptions A1, A2, and algorithm and its consequence for the minimization of (pos- √ A3 hold. Let γ ≤ γ/ α for some γ > 0. Let C = sibly non-smooth) convex functions. We first specify the k k σ2/(2µ) + γ2, then, the iterates generated by the clipped assumptions needed for the results in this section starting SGD method satisfy with the basic quadratic growth condition. k 1 Assumption A2 (Quadratic growth). There exists a scalar h ? 2i ? 2 X− µ > 0 such that E dist (xk, X ) ≤ dist (x0, X ) + C αi. (3) i=0 f(x) − f(x?) ≥ µ dist (x, X ?)2 , ∀x ∈ dom(f). Some remarks on Proposition1 are in order. First, unlike SGD, where the distance to the optimal set may grow super- Assumption A2 gives a lower bound on the speed at which exponentially, the clipped version will not diverge faster the objective f grows away from the solution set X ?. Since than the sum of the used stepsizes. For example, with the we are interested in problems that may exhibit exploding stepsizes O(1/k) in Example1, the sum is only of order subgradients, this growth condition is a rather natural as- log(k). Such a guarantee will play a critical role in establish- sumption. Note also that in many machine learning appli- ing all the convergence results in the subsequent sections. cations, the addition of a quadratic regularization term to Second, the proposition is reported for time-varying clip- improve generalization results in problems which funda- ping thresholds to facilitate the proofs of some subsequent mentally have quadratic growth. results. We note however that the similar estimate holds for Stability and Convergence of Stochastic Gradient Clipping the constant scheme with a slightly different scaling con- square summable, taking mk = 1/αk suffices to guarantee P stant. Finally, we mention that the bound (3) is similar to the k∞=0 αk/mk < ∞. classical results for the stochastic proximal point iteration It turns out that clipping can even provide finite convergence (Ryu & Boyd, 2014, Theorem 6), but slightly weaker than rate in this setting, as stated in the next result. the best bounds for that algorithm (Asi & Duchi, 2019a, Corollary 3.1). Theorem 2. Let Assumptions A1, A2, A3, and A4 hold. Let τ αk = (k + 1)− with τ ∈ (1/2, 1) and let xk be generated 2.1. Convergence under arbitrary growth by clipped SGD using batches of mk = 1/αk samples. ? Define ek := dist (xk, X ) and fix a failure probability Having studied the stability of clipped SGD, we now turn to δ ∈ (0, 1), then for any  > 0, there exists a numerical its consequences for the actual convergence guarantees. We constant c0 > 0 such that first remark that on deterministic convex problems, the pro- 2 2 PK 1 2 cedure (2) is known to be convergent under the very weak 2  δ σ /µ + γ k=0− αk c0αK Pr eK ≤  ≥1 − δ − 2 − . growth condition summarized in Assumption A4 below (Al- e0  ber et al., 1998). Concretely, the subgradients can grow ar- Furthermore, if we take τ with ≤ bitrarily (exponentially, super-exponentially, etc.) as long as αk = α = α0K− α0 1/2 ? they are bounded on bounded sets. However, the situation is 1/(µ%), where % = γ/(γ + Gbig (dist (x0, X ) /δ)) and 2 2 less clear as stochastic noise enters the problem. Under A4, η = σ /µ + γ /µ%. Then, for K ∈ N+ satisfying 1 τ 2 τ  similar convergence results have only been established for µ%α0K − ≥ log e0K /ηα0 , we have the stochastic proximal point method (Asi & Duchi, 2019a) 2ηα and a scaled stochastic approximation algorithm proposed ? 2 0 dist (xK , X ) ≤ τ , in (Andradottir¨ , 1996). Note that the former algorithm relies δK 2 2 2 heavily on the ability to accurately model the objective and (σ /µ+γ )α0 with probability at least 1 − 2δ − δ · ? 2 2τ−1 . dist(x0, ) K efficiently solve the resulting minimization problem in each X iteration, while the later one needs two independent search The first result in the theorem refines the asymptotic guaran- directions to construct its upates. Theorem1 below demon- tee in Theorem1 for general time-varying stepsizes and the strates that gradient clipping coupled with mini-batching second one shows the iteration complexity for a constant can also provide such a strong qualitative guarantee. stepsize and fixed mini-batch size. We have the following Assumption A4. There exits an increasing function Gbig : remarks: (i) Setting τ close to one in the second claim yields R+ → [0, ∞) such that a bound with a similar order-dependence on K and δ as for strongly convex and Lipschitz continuous f (Lan, 2020, h i 2 ? eq. (4.2.61)). Note that the last term in the last probability E kf 0(x, S)k2 ≤ Gbig(dist (x, X )), ∀x ∈ dom(f). bound is negligible for large K; (ii) The proof of the theo- Theorem 1. Let Assumptions A1, A2, and A3 hold. Let rem is motivated by a technique developed in (Davis et al., 2019, Lemma 3.3) to bound the escape probability of their γk = γ for all k. Consider for each k a batch of sam- 1:mk algorithm’s iterates. ples Sk and let xk be generated by the clipped SGD 1 Pmk i method with gk = f (x, S ). Define %k = mk i=1 0 k ? 2.2. Convergence under polynomial growth min {1, γ/ kgkk2} and ek = dist (xk, X ), then For the final set of theoretical results of the section, we con- 2  2    2 σ αk 2 2 sider a more specific function class for which we derive the E ek+1 Fk ≤ 1 − µαkE %k Fk ek + + αkγ . µmk convergence rate of clipped SGD without the need for mini- batching. In particular, we impose the following conditions Suppose further that P∞ ∞, then under As- k=0 αk/mk < on the stochastic subgradients. ? a.s. sumption A4, we have dist (xk, X ) −→ 0. Assumption A5. There exist real numbers L0,L1, σ ≥ 0 and ≤ ∞ such that for all ∈ : Theorem1 highlights the importance of the clipping step 2 p < x dom(f) as no amount of samples in a batch can save SGD from h 2i ? 2(p 1) kf (x, S)k ≤ L + L dist (x, X ) − , divergence in this setting. In particular, it implies that E 0 2 0 1 clipped SGD converges for any growth function provided h 2(p 1)i p E kf 0(x, S) − f 0(x)k − ≤ σ , that sufficiently accurate estimates of the subgradients can 2 be obtained. This is in stark contrast to SGD without clip- where f 0(x) = E [f 0(x, S)] ∈ ∂f(x). ping, where, as Example1 shows, the iterates may diverge even in the noiseless setting when the objective function Note that when p = 2, we have the standard (upper) grows faster than the quadratic x2. Since the stepsizes are quadratic growth model (Polyak & Juditsky, 1992). For Stability and Convergence of Stochastic Gradient Clipping general values of p, Assumption A5 implies that Theorem 3. Let Assumptions A1, A2, and A5 hold. Let τ xk be generated by clipped SGD using αk = α0(k + 1)− 2 h 2i ? 2(p 1) √ kf 0(x)k2 ≤ E kf 0(x, S)k2 ≤ L0 + L1 dist (x, X ) − with τ ∈ (1/2, 1) and γk = γ/ αk, then there exists a numerical constant C such that which, since f is assumed to be convex, guarantees that   h ? 2i µα0 h ? 2i ? p ? p ? p E dist (xk+1, X ) ≤ 1 − τ E dist (xk, X ) f(x) − f(x ) ≤ L0 dist (x, X ) + L1 dist (x, X ) . (k + 1) C We thus allow the function f to grow polynomially from the + . (k + 1)2(1 p(1 τ)) set of optimal solutions. For example, f(x) = x4/4+x2/2 − − satisfies the assumption with L = L = 2(1+) and p = 4. 0 1 Furthermore, we we take τ = 1 −  for some  > 0, then The second condition in A5 requires that the 2(p − 1)th cen- tral moment is bounded, which amounts to finite variance   h ? 2i C 1 1 when p = 2. We mention that a closely related assumption E dist (xk, X ) ≤ + o . µα 1+(1 2p) 1+(1 2p has been used in (Asi & Duchi, 2019b, Assumption A3) to 0 k − k − ) analyze a method analogous to the classical Polyak subgradi- ent algorithm. The only difference is in the second condition, Some remarks on Theorem3 are in order: where they require of a quantity involving the objectives instead of the subgradients. This is because (i) The numerical constant C can be computed as f(x, S) and f(x?) are used to construct their updates. 2 2 2 The next lemma explicitly bounds the expected norm of the C = (2γ /µ)(L0 + L1(D0 + D1)) + G0 + G1, subgradients and the distance between the iterates and the set of optimal solutions. The proof of this lemma follows where D0, D1, G0, and G1 are given in Lemma 2.1. the same arguments in (Asi & Duchi, 2019b, Lemma B2) If f is Lipschitz continuous (L1 = 0), then C re- and is reported in AppendixE for completeness. duces to C = (2γ2/µ)L2 + L . Thus, by setting p 0 0 Lemma 2.1. Let Assumptions A1, A2, and A5 hold. Let γ = O( µ/L0) so that C = O(L0), we recover τ the similar order-dependence on L0 as in the standard xk be generated by clipped SGD using αk = α0(k + 1)− with τ ∈ (1/2, 1), then there exist positive real constants bound for unclipped SGD (Lan, 2020). D0,D1,G0,G1 (independent of k) such that (ii) Despite the polynomial growth condition, the first h i 2 (p 1)(1 τ) bound in the theorem is quite close to the classical E kf 0(xk,S)k2 ≤ G0 + G1k − − , estimate for SGD (τ = 1), when applied to Lipschitz h ? 4(p 1)i 2(p 1)(1 τ) E dist (xk, X ) − ≤ D0 + D1k − − . continuous f using the small stepsize O(1/k) (Lan, 2020; Beck, 2017; Nemirovski et al., 2009). Note also More specifically, if we define for q ≥ 2 the quantities: that our guarantee is valid for a wide range of large τ stepsizes αk = O(k− ) with τ ∈ (1/2, 1), which are q ? q P0(q) := 2 2 dist(x0, X ) and more robust than the classical O(1/k) (Nemirovski q   2 et al., 2009).  q q q +1 2α0 P (q) := (2γ) + µ− 2 σ 4 , 1 1 − τ (iii) The convergence result follows from a direct applica- then G0,G1,D0,D1 are given explicitly by tion of Chung’s lemma (Chung, 1954, Lemma 4) to the first inequality in Theorem3. The little o term in G0 = L0 + L1P0(2(p − 1)),G1 = L1P1(2(p − 1)), the estimate vanishes exponentially fast with the sum D0 = P0(4(p − 1)),D1 = P1(4(p − 1)). of the used stepsizes in the manner of (Chung, 1954, p. 467) (see also (Bach & Moulines, 2011)). Since we are The lemma reveals an attractive property: the subgradients free to pick  > 0, we can, in principle, guarantee a rate at the iterates can be made small by setting τ close to one, that is arbitrarily close to O(1/k). Recall that O(1/k) no matter the value of p. This brings us to a position close is also the optimal convergence rate for (stochastic) to where we would have been if we had assumed Lipschitz strongly convex and Lipschitz continuous functions continuity of f in the first place. The difference is, however, (two contradicting conditions) (Nemirovski et al., 2009, that the preceding guarantees hold w.r.t the full expectation Section 2.1). Hence, gradient clipping is able to es- while in the alternative case, one is given a priori an upper- sentially preserve the optimal rate of SGD while also 2 bound on the quantity E[kf 0(xk,S)k2 Fk]. We can now supporting a broad class of functions on which SGD state the main result of this subsection. and its relatives would diverge super-exponentially. Stability and Convergence of Stochastic Gradient Clipping

3. Non-asymptotic convergence for weakly (Davis & Drusvyatskiy, 2019). This is possible since weakly convex functions convex functions admit an implicit smooth approximation through the classical Moreau envelope: The previous section demonstrates that gradient clipping can greatly improve the performance of SGD when applied n 2o fλ(x) = inf f(y) + 1/(2λ) kx − yk . (5) to convex functions with rapidly growing subgradients. We y n 2 ∈R now turn to non-asymptotic convergence analysis of clipped 1 methods under standard growth conditions, but for a much For λ < ρ− , the point achieving fλ(x) in (5), denoted by wider class of weakly convex functions. Our goal is to proxλf (x), is unique and given by: show that the sample complexity of the clipped methods matches the best-known result for weakly convex problems, n 2o proxλf (x) = argmin f(y) + 1/(2λ) kx − yk2 . (6) emphasizing the effectiveness of gradient clipping for a y n ∈R wide range of problem classes. With these definitions, for any x ∈ n, the point xˆ = n R Recall that that f : R → R ∪ {+∞} is called ρ-weakly prox (x) satisfies: ρ 2 λf convex if f + 2 k·k2 is convex. Such functions satisfy the following inequality for any x, y ∈ Rn with g ∈ ∂f(x): ( kx − xˆk2 = λ k∇fλ(x)k2 , 2 (7) f(y) ≥ f(x) + hg, y − xi − (ρ/2) ky − xk2 . dist(0, ∂f(ˆx)) ≤ k∇fλ(x)k2 . Weakly convex optimization problems arise naturally in ap- Thus, a small gradient k∇f (x)k implies that x is close to plications described by compositions of the form f(x) = λ 2 m a point xˆ that is near-stationary for f. h(c(x)), where h : R → R is convex and Lh-Lipschitz n m and c : R → R is a smooth map with Lc-Lipschitz As for most convergence analyses of subgradient-based Jacobian. Note also that all convex functions and all differ- methods, we aim to establish the following per-iterate esti- entiable functions with Lipschitz continuous gradient are mate (see, e.g., (Nemirovski et al., 2009; Davis & Drusvy- weakly convex. We refer to (Duchi & Ruan, 2018; Davis & atskiy, 2019; Ghadimi & Lan, 2013)): Drusvyatskiy, 2019; Asi & Duchi, 2019a; Mai & Johansson, 2 2020) for practical applications as well as recent theoretical E[Vk+1] ≤ E[Vk] − c0αk E[ek] + c1αk. (8) and algorithmic developments for this function class. Here ek is some stationarity measure, Vk is a Lyapunov Algorithm We consider the following momentum extension function, αk is the stepsize, and c0, c1 are some real con- of clipped SGD: stants. As discussed above, for minimization of weakly convex functions it is natural to consider e = k∇f (x )k2. x = x − α d (4a) k λ k k+1 k k k It now remains to find an appropriate Lyapunov function − (4b) dk+1 = clipγ ((1 βk)dk + βkgk+1) . Vk. To build up our Vk, we will go through a number of supporting lemmas. We begin with the one that concerns Here, gk+1 = f 0(xk+1,Sk+1), αk > 0 is the stepsize, the search direction dk. βk ∈ (0, 1] is the momentum parameter, and γ > 0 is the clipping threshold. The algorithm is initialized from Lemma 3.1. Let Assumptions A1 and A6 hold. Let βk = n x0 ∈ R and d0 = clipγ (g0) with g0 ∈ ∂f(x0,S0), and ναk for some constant ν > 0 such that βk ∈ (0, 1]. Let xk n n generates the sequences xk ∈ R (iterates) and dk ∈ R be generated by the clipped SHB method, then (search directions). This algorithm goes back to at least   (Gupal & Bazhenov, 1972), and in the sequel, we term 1−βk 2 1−βk 1 2 f(xk)+E kdkk Fk ≤f(xk 1)+ − kdk 1k procedure (4) as clipped stochastic heavy ball (SHB). 2ν 2 − 2ν − 2 2  2  Next we state the standing assumption in this section. h 2 i αk 1 νL 2 − αk E kdkk2 Fk + − + ργ . (9) Assumption A6. There exists a positive real constant L 2 1 − β0 h 2i 2 such that E kf 0(x, S)k2 ≤ L , ∀x ∈ dom(f). It is interesting to note that, despite the bias introduced by This is a very basic assumption for non-smooth optimization the clipping operator, the estimate in (9) is equivalent to that (Nemirovski et al., 2009; Davis & Drusvyatskiy, 2019). of in (Mai & Johansson, 2020, Lemma 3.1). Moreover, our new proof is arguably simpler, more intuitive and applicable As the function f is neither smooth nor convex, even measur- to both constant and time-varying parameters. ing the progress to a stationary point for f is a challenging task. A common practice is then to use the norm of the gra- The next lemma brings the gradient of the Moreau envelope dient of the Moreau envelope as a proxy for near-stationarity to the stage. Stability and Convergence of Stochastic Gradient Clipping

Lemma 3.2. Let Assumptions A1 and A6 hold. Let βk = Theorem 4. Let Assumptions A1 and A6 hold. Let k∗ be ναk for some constant ν > 0 such that βk ∈ (0, 1]. Let xk sampled randomly from {0,...,K − 1} with Pr(k∗ = k + be generated by clipped SHB with γ ≥ 2L and define PK 1 1) = αk/ i=0− αi. Let ∆ = f(x0) − infx f(x) and let C be given in (11). Then, under the same setting of Lemma 3.3, 1 2 1 2 Wk = kdk − ∇fλ(xk)k − k∇fλ(xk)k + f(xk). we have 2ν 2 2ν 2 2 PK 1 2 h 2i ξ∆ + 2L /ν + C i=0− αi 2 2 k∇F (x ∗ )k ≤ 2 · , Let C = νL + ργ /2, then for any k ∈ N, we have E λ k 2 PK 1 i=0− αi     E Wk Fk ≤ Wk 1 −αk 1E hgk,∇fλ(xk)i Fk √ − − where ξ = 2+1/(λν). Furthermore, if we set α = α0/ K αk 1 2 2 +αk 1 hdk 1, ∇fλ(xk 1)i+ − kdk 1k2 +Cαk 1. and ν = 1/α0 for some real α0 > 0 − − − λν − − (10) 2 2 h 2i ξ∆ + 2L /ν + Cα0 ∇ ∗ ≤ · √ E F1/(2ρ)(xk ) 2 2 . α0 K To motivate the introduction of Wk, we take a step back and consider the problem where f is assumed to be L-smooth Finally, if α0 is set to 1/ρ and K ≥ 2, we obtain h i 2 and no gradient clipping is applied. In this case, proce- 2 ρ∆+γ E ∇f1/(2ρ)(xk∗ ) ≤ 6 · . dure (4) is identical to the algorithm which was analyzed in 2 √K (Ruszczynski & Syski, 1983) using a Lyapunov function on The rate achieved by the clipped SHB is of the same or- the form der as the best-known result for weakly convex stochastic problems (Davis & Drusvyatskiy, 2019, Theorem 1). By 1 2 1 2 Vk = νf(xk) + kdk − ∇f(xk)k + kdkk . inspection, all the proofs and convergence results in this 2 2 2 2 section can also be extended (often with significant simplifi- The key insight here is to view the sequence of directions dk cations) to the case of clipped SGD. The choice ν = 1/α0 as estimates of the true gradients ∇f(xk). With reasonable is just for simplicity; we can choose any value of √ν as long 2 ∈ assumptions, the term E[kdk − ∇f(xk)k2] can indeed be as βk = ναk (0, 1]. Since βk = ναk = O(1/ k), one driven to zero (Ruszczynski & Syski, 1983, Theorem 1). can put much more weight on the momentum term than Since our f is non-smooth, is not immediately applicable to on the fresh subgradient in the search directions dk. As our problem. Nevertheless, we observe that it is useful to both αk and βk have the same scale, the algorithm can be view dk as an estimate of the gradient of the Moreau enve- seen as a single time-scale method (Ghadimi et al., 2020; 2 lope. This is the reason why kdk − ∇fλ(xk)k2 appears in Ruszczynski & Syski, 1983). Wk, while the other terms arise from the algebraic manip- ulations to satisfy (10). Finally, due to the presence of the 4. Experimental results clipping step, some extra care is needed to make the intuition work. In particular, since the dk’s always belong to B(0, γ), For the first two problems, we set up our experiments as m n we cannot expect that they approximate the ∇fλ(xk) un- follows. We fix m = 500, n = 50 and generate A ∈ R × less these also belong to the γ-ball. It turns out that setting as A = QD, where Q is a matrix with standard normal γ ≥ 2L suffices to ensure that ∇fλ(xk) ∈ B(0, γ). distributed entries, and D is a diagonal matrix with linearly spaced elements between 1/κ and 1. Here, κ ≥ 1 repre- We now have all the ingredients needed to construct the sents a condition number which we set to κ = 10 in all ultimate Lyapunov function: experiments. The algorithms are all randomly initialized at 1/2 Lemma 3.3. Assume the same setting of Lemma 3.2. Let x0 ∼ N (0, 1) and we use the stepsize αk = α0(k + 1)− , 1 λ > 0 be such that λ− ≥ 2ρ and consider the function: where α0 is an initial stepsize. We also refer to m stochastic   iterations as one epoch (pass over the data). Within each in- f(xk) 1 − βk αk 2 dividual run, we set the maximum number of epochs to 500. Vk = fλ(xk) + Wk + + + kdkk2 . λν 2λν2 λν Each plot reports the results of 30 experiments, visualized as the of the quantity of interest and the corresponding Then, for any k ∈ , N+ 90% confidence interval. Finally, the so-called epoch-to-

αk 1 2 2 -accuracy is defined as the smallest number of epochs q E [Vk|Fk] ≤ Vk 1 − − k∇fλ(xk)k2 + Cαk 1, (11) ? − − needed to reach f(xm q) − f(x ) ≤ . 2 · 1 2 2 where C =λ− γ (1+ρ/(2ν))+νL (1+1/(2λν(1−β0))). 4.1. Phase retrieval

n Finally, the following complexity result follows by a stan- Given m measurements (ai, bi) ∈ R × R, the (ro- dard argument from (11). bust) phase retrieval problem seeks a vector x? such that Stability and Convergence of Stochastic Gradient Clipping

500 500 500

400 400 400 =0.1 =0.25 =0.25    300 300 300

200 200 200

SGD SGD SGD Epochs to accuracy Epochs to accuracy 100 Epochs to accuracy 100 100 SHB SHB SHB clipSGD clipSGD clipSGD clipSHB clipSHB clipSHB 0 0 0 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 10− 10− 10 10 10 10− 10− 10 10 10 10− 10− 10 10 10 Initial stepsize α0 Initial stepsize α0 Initial stepsize α0 (a) 1 − β = 0.9,  = 0.1 (b) 1 − β = 0.9,  = 0.25 (c) 1 − β = 0.99,  = 0.25

Figure 1. The number of epochs to achieve -accuracy versus initial stepsize α0 for phase retrieval with γ = 10.

9 102 10 1036 SGD SGD SGD SGD 101 SHB SHB SHB 31 SHB 7 10 Clipped SGD Clipped SGD 10 Clipped SGD Clipped SGD 101 Clipped SHB Clipped SHB Clipped SHB 1026 Clipped SHB 5 100 10 1021 100 103 1016 1 10−

Function Gap Function Gap Function Gap Function Gap 11 1 10 10− 101 106 2 10− 2 10 1 10− − 101

103 104 105 103 104 105 103 104 105 103 104 105 Iterations Iterations Iterations Iterations

(a) α0 = 0.139 (b) α0 = 0.268 (c) α0 = 0.518 (d) α0 = 1.0

? Figure 2. The function gap f(xk) − f(x ) versus iteration count for phase retrieval with γ = 10.

? 2 hai, x i ≈ bi for most measurements i = 1, . . . , m by that gradient clipping does not harm and sometimes can solving significantly boost the performance of their unclipped coun-

m terparts. We can see that although all methods converge 1 X 2 with a similar slope, clipped methods may achieve better minimize hai, xi − bi . n x R m ∈ i=1 final accuracies. One possible explanation for this result would be the connection between the used stepsizes and the As for the vector b, in each problem instance, we select x? final error of the (stochastic) subgradient method;√ we have uniformly from the unit sphere and construct its elements bi [f(¯x )] − f(x?) ≤ O(α ) for α = O(1/ k) (Boyd ? 2 E k k k as bi = hai, x i +δζi, i = 1, . . . , m, where ζi ∼ N (0, 25) et al., 2003; Duchi, 2018). With clipping, using a smaller models corrupted measurements, and δ ∈ {0, 1} is a binary γ (if permitted) might have some effect on reducing the random variable taking the value 1 with probability pfail = effective stepsizes αk min{1, γ/ kgkk2}, thereby yielding 0.1, so that pfail · m measurements are noisy. smaller errors. Figure1 shows the improved robustness provided by gra- dient clipping for the SGD and SHB algorithms. These 4.3. Neural Networks clipped methods achieve good accuracies (within the al- For our last set of experiments, we consider the image clas- lowed number of epochs) for much wider ranges of initial sification task on the CIFAR10 dataset (Krizhevsky et al., stepsizes than their unclipped versions. To further elabo- 2009) with the ResNet-18 architecture (He et al., 2016). rate on this, Figure2 depicts the actual performance for 4 Here, we also compare the previous methods with the Adam consecutive stepsizes (out of 15) used to produce Figure1. algorithm using its default parameters in PyTorch; β = 0.9, We can see that these clipped methods always remain stable, 1 β = 0.99, and  = 10 8.1 Following common practice, while (started from the same initial point) SGD and SHB 2 − we use mini-batch size 128, momentum parameter β = 0.9, exhibit the problem of unboundedness when moving beyond and weight-decay coefficient 5 × 10 4 in all experiments. their narrow ranges of working parameters. − For each algorithm, we conduct 5 experiments (up to 200 epochs) and report the of the training loss and test 4.2. Absolute linear regression accuracy together with the 90% confidence intervals. For 1 the stepsizes, we use constant values starting with α0 and We consider mean absolute error f(x) = m kAx − bk1. For each problem instance, we generate b = Ax? + σw reduce them by a factor of 10 every 50 epochs. The initial for w ∼ N (0, 1) and σ = 0.01. The problem is well- stepsizes α0 for Adam are scaled by 1/100 in actual runs behaved as f is both convex and Lipschitz continuous; we 1https://pytorch.org did not observe instability of SGD and SHB. Figure3 shows Stability and Convergence of Stochastic Gradient Clipping

1 1 10 10 1 SGD SGD 10 SGD SHB SHB SHB Clipped SGD Clipped SGD Clipped SGD Clipped SHB Clipped SHB Clipped SHB 100 100

100

1 10−

Function Gap Function Gap 1 Function Gap 10−

2 1 10− 10−

2 10− 103 104 105 103 104 105 103 104 105 Iterations Iterations Iterations (a) γ = 0.5 (b) γ = 1.0 (c) γ = 10.0

? Figure 3. The function gap f(xk) − f(x ) versus iteration count for absolute linear regression with α0 = 5 and 1 − β = 0.9.

200 200 200

175 175 175 85 90 0.1 = =

= 150

  150

 150 125 125 125 100 100 100 75 SGD SGD 75 SGD 75 SHB 50 SHB SHB Epochs to training loss Epochs to test accuracy Epochs to test accuracy Clipped SGD Clipped SGD 50 Clipped SGD 50 Clipped SHB 25 Clipped SHB Clipped SHB Adam Adam 25 Adam 25 0 3 2 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 10− 10− 10− 10 10 10 10− 10− 10− 10 10 10 10− 10− 10− 10 10 10 Initial stepsize α0 Initial stepsize α0 Initial stepsize α0

Figure 4. The number of epochs to achieve  training loss and test error versus initial stepsize α0 for CIFAR10 with γ = 10.

95.0 cal developments that gradient clipping can: i) stabilize and

92.5 guarantee convergence for problems with rapidly growing

90.0 gradients; ii) retain and sometimes improve the best per-

87.5 formance of their unclipped counterparts even on standard

85.0 (“easy”) problems. Accuracy 82.5 SGD 80.0 SHB 5. Conclusions Clipped SGD 77.5 Clipped SHB Adam We analyzed clipped subgradient-based methods for solving 75.0 3 2 1 0 1 2 10− 10− 10− 10 10 10 stochastic convex and non-convex optimization problems. Initial stepsize α0 Moving beyond traditional quadratic models, we showed

Figure 5. The best achievable accuracy versus initial stepsize α0 that these methods enjoy strong stability properties and at- for CIFAR10 with γ = 10. tain classical convergence rates in settings where standard convergence theory does not apply. With a novel Lyapunov analysis, we also proved that the sample complexiy of the (Asi & Duchi, 2019b). methods match the best-known result for weakly convex Figure4 shows the minimum number of epochs required to problems, emphasizing the effectiveness of gradient clip- reach desired values for various performance measures as a ping on a wide range of problem classes. function of the initial stepsize. As the classification task on CIFAR10 is a rather well-conditioned problem, the results Acknowledgement tell a very similar story to our absolute linear regression This work was supported in part by the Knut and Alice Wal- experiments. We also observe that Adam is more sensitive lenberg Foundation, the Swedish Research Council and the to stepsize selection and needs more time to achieve good Swedish Foundation for Strategic Research. The computa- test performance in this example. To further clarify this, tions were enabled by resources provided by the Swedish Figure5 shows that over the tested range of stepsizes, Adam National Infrastructure for Computing (SNIC), partially is not able to reach the same best achievable test accuracies funded by the Swedish Research Council through grant that the other methods do. agreement no. 2018-05973. Finally, we thank the anony- In summary, the results in this section reinforce our theoreti- mous reviewers for their detailed and valuable feedback. Stability and Convergence of Stochastic Gradient Clipping

References Duchi, J. C. Introductory lectures on stochastic optimization. The mathematics of data, 25:99, 2018. Alacaoglu, A., Malitsky, Y., and Cevher, V. Convergence of adaptive algorithms for weakly convex constrained Duchi, J. C. and Ruan, F. Stochastic methods for composite optimization. arXiv preprint arXiv:2006.06650, 2020. and weakly convex optimization problems. SIAM Journal Alber, Y. I., Iusem, A. N., and Solodov, M. V. On the on Optimization, 28(4):3229–3259, 2018. projected subgradient method for nonsmooth convex opti- Ermol’ev, Y. M. and Norkin, V. Stochastic generalized mization in a Hilbert space. Mathematical Programming, gradient method for nonconvex nonsmooth stochastic 81(1):23–35, 1998. optimization. Cybernetics and Systems Analysis, 34(2): Andradottir,´ S. A new algorithm for stochastic optimiza- 196–215, 1998. tion. In Proceedings of the 22nd conference on Winter Ermoliev, Y. Stochastic quasigradient methods. In Numeri- simulation, pp. 364–366, 1990. cal techniques for stochastic optimization, pp. 141–185. Andradottir,¨ S. A scaled stochastic approximation algorithm. Springer, 1988. Management Science, 42(4):475–498, 1996. Ghadimi, S. and Lan, G. Stochastic first-and zeroth-order Asi, H. and Duchi, J. C. Stochastic (approximate) proximal methods for nonconvex stochastic programming. SIAM point methods: Convergence, optimality, and adaptivity. Journal on Optimization, 23(4):2341–2368, 2013. SIAM Journal on Optimization, 29(3):2257–2290, 2019a. Ghadimi, S., Ruszczynski,´ A., and Wang, M. A single time- Asi, H. and Duchi, J. C. The importance of better models scale stochastic approximation method for nested stochas- in stochastic optimization. Proceedings of the National tic optimization. SIAM J. on Optimization, 2020. Ac- Academy of Sciences, 116(46):22924–22930, 2019b. cepted for publication (arXiv preprint arXiv:1812.01094). Bach, F. and Moulines, E. Non-Asymptotic Analysis Gorbunov, E., Danilova, M., and Gasnikov, A. Stochastic of Stochastic Approximation Algorithms for Machine optimization with heavy-tailed noise via accelerated gra- Learning. In Neural Information Processing Systems dient clipping. arXiv preprint arXiv:2005.10785, 2020. (NIPS), pp. 451–459, Spain, 2011. URL https://hal. archives-ouvertes.fr/hal-00608041. Gupal, A. M. and Bazhenov, L. G. Stochastic analog of Beck, A. First-order methods in optimization, volume 25. the conjugant-gradient method. Cybernetics and Systems SIAM, 2017. Analysis, 8(1):138–140, 1972. Boyd, S., Xiao, L., and Mutapcic, A. Subgradient methods. Hazan, E., Levy, K., and Shalev-Shwartz, S. Beyond convex- Lecture note, 2003. ity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, pp. 1594–1602, Chung, K. L. On a stochastic approximation method. The 2015. Annals of Mathematical Statistics, pp. 463–483, 1954. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Curtis, F. E., Scheinberg, K., and Shi, R. A stochastic trust ing for image recognition. In Proceedings of the IEEE region algorithm based on careful step normalization. conference on computer vision and pattern recognition, Informs Journal on Optimization, 1(3):200–220, 2019. pp. 770–778, 2016. Cutkosky, A. and Mehta, H. Momentum improves nor- Hiriart-Urruty, J.-B. and Lemarechal,´ C. Convex analy- malized sgd. In International Conference on Machine sis and minimization algorithms, volume 305. Springer Learning, pp. 2260–2268. PMLR, 2020. science & business media, 1993. Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. SIAM Journal Krizhevsky, A., Hinton, G., et al. Learning multiple layers on Optimization, 29(1):207–239, 2019. of features from tiny images. 2009. Davis, D. and Grimmer, B. Proximally guided stochastic Lan, G. First-order and Stochastic Optimization Methods subgradient method for nonsmooth, nonconvex problems. for Machine Learning. Springer, 2020. SIAM Journal on Optimization, 29(3):1908–1930, 2019. Mai, V. V. and Johansson, M. Convergence of a stochastic Davis, D., Drusvyatskiy, D., and Charisopoulos, V. gradient method with momentum for non-smooth non- Stochastic algorithms with geometric step decay con- convex optimization. In Proceedings of the 37th Interna- verge linearly on sharp functions. arXiv preprint tional Conference on Machine Learning, volume 119, pp. arXiv:1907.09547, 2019. 6630–6639, 2020. Stability and Convergence of Stochastic Gradient Clipping

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Ro- bust stochastic approximation approach to stochastic pro- gramming. SIAM Journal on optimization, 19(4):1574– 1609, 2009. Nurminskii, E. A. The quasigradient method for the solving of the nonlinear programming problems. Cybernetics, 9 (1):145–150, 1973.

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. PMLR, 2013.

Polyak, B. T. Introduction to optimization. Optimization Software, 1987. Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.

Robbins, H. and Siegmund, D. A convergence theorem for non negative almost supermartingales and some applica- tions. In Optimizing methods in statistics, pp. 233–257. Elsevier, 1971. Ruszczynski,´ A. A linearization method for nonsmooth stochastic programming problems. Mathematics of Oper- ations Research, 12(1):32–49, 1987. Ruszczynski, A. and Syski, W. Stochastic approximation method with gradient averaging for unconstrained prob- lems. IEEE Transactions on Automatic Control, 28(12): 1097–1105, 1983. Ryu, E. K. and Boyd, S. Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. 2014. Author website, early draft. Shor, N. Z. Minimization methods for non-differentiable functions, volume 3. Springer Science & Business Media, 1985. Zhang, B., Jin, J., Fang, C., and Wang, L. Improved analysis of clipping algorithms for non-convex optimization. arXiv preprint arXiv:2010.02519, 2020.

Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019.