Chapter 8 Stochastic Gradient / Subgradient Methods
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 8 Stochastic gradient / subgradient methods Contents (class version) 8.0 Introduction........................................ 8.2 8.1 The subgradient method................................. 8.5 Subgradients and subdifferentials................................. 8.5 Properties of subdifferentials.................................... 8.7 Convergence of the subgradient method.............................. 8.10 8.2 Example: Hinge loss with 1-norm regularizer for binary classifier design...... 8.17 8.3 Incremental (sub)gradient method............................ 8.19 Incremental (sub)gradient method................................. 8.21 8.4 Stochastic gradient (SG) method............................. 8.23 SG update.............................................. 8.23 Stochastic gradient algorithm: convergence analysis....................... 8.26 Variance reduction: overview................................... 8.33 Momentum............................................. 8.35 Adaptive step-sizes......................................... 8.37 8.5 Example: X-ray CT reconstruction........................... 8.41 8.1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.2 8.6 Summary.......................................... 8.50 8.0 Introduction This chapter describes two families of algorithms: • subgradient methods • stochastic gradient methods aka stochastic gradient descent methods Often we turn to these methods as a “last resort,” for applications where none of the methods discussed previously are suitable. Many machine learning applications, such as training artificial neural networks, use such methods. As stated in [1] “large-scale machine learning represents a distinctive setting in which traditional nonlinear optimization techniques typically falter.” For recent surveys, especially about stochastic gradient methods, see [1,2]. Acknowledgment This chapter was based in part on slides made by Naveen Murthy in April 2019. © J. Fessler, April 12, 2020, 17:55 (class version) 8.3 Rate of convergence review Suppose the sequence fxkg converges to x^. Consider the limiting ratio kx − x^k µ lim k+1 2 : , k!1 kxk − x^k2 We define the rate of convergence of the sequence fxkg as follows: • Converges linearly with rate µ if µ 2 (0; 1) • Converges sublinearly if µ = 1 • Converges super-linearly if µ = 0 k+1 k jxk+1 − 0j jρj Example: For xk = ρ , with jρj < 1, = k = jρj = µ, so the sequence converges linearly. jxk − 0j jρj c c jxk+1 − 0j 1=(k + 1) Example: For xk = 1=k , with c > 0, = c ! 0 = µ, so the convergence is sublinear. jxk − 0j 1=k 1 5 For the sequence x = 1 − − ; for k = 1; 2;:::, the value of µ is k 3k k2 A: 0 B: 1=5 C: 1=3 D: 1=2 E: 1 ?? © J. Fessler, April 12, 2020, 17:55 (class version) 8.4 Gradient descent (GD) review • GD update: xk+1 = xk − αrf(xk): • Converges to a minimizer if f is convex and differentiable, and rf is L-Lipschitz continuous, and 0 < α < 2=L; simple to analyze • Worst-case sublinear rate of convergence of O(1=k) if α ≤ 1=L • Can be improved to a linear rate O(ρk), where ρ < 1, under strong convexity assumptions on f PM • In the usual case where f(x) = m=1 fm(x); gradient computation is linear in M, i.e., takes O(M) time. =) Doubling the number of examples in the training set doubles the gradient computation cost. GD is a “batch” method: rf uses all available data at once The methods in this chapter relax the differentiability requirement, and scale better for large-scale problems. Example. ImageNet [3] contains ∼14 million images with more than 20,000 categories. © J. Fessler, April 12, 2020, 17:55 (class version) 8.5 8.1 The subgradient method The O(1=k) convergence rate of ISTA in Ch.4 (aka PGM) may seem quite slow, and it is by modern stan- dards, but convergence rates can be even worse! A classic way to seek a minimizer of a non-differentiable cost function Ψ is the subgradient method, defined as [4,5]: xk+1 = xk − αk gk; gk 2 @ Ψ(xk); (8.1) where gk 2 @ Ψ(xk) denotes a subgradient of the (nondifferentiable) function Ψ at the current iterate xk. This method was published (in Russian) by Naum Shor in 1962 for solving transportation problems [6, p. 4]. Subgradients and subdifferentials Define. If f : D 7! R is a real-valued convex function defined on a convex open set D ⊂ RN , N a vector g 2 R is called a subgradient at a point x0 2 D iff for all z 2 D we have f(z) − f(x0) ≥ hg; z − x0i : Define. The set of all subgradients at x0 is called the subdifferential at x0 and is denoted @f(x0) [6, p. 8]. © J. Fessler, April 12, 2020, 17:55 (class version) 8.6 Example.A rectified linear unit (ReLU) in an ANN uses the rectifier function that has the following definition and subdifferential: (Draw tangent lines on ReLU.) ReLU(x) @ ReLU(x) 8 0; x < 0 < r(x) = max(x; 0); @r(x) = [0; 1];x = 0 1 : 1; 0 < x: 0 x 0 x For this example, the derivative is defined almost everywhere, i.e., everywhere but a set of Lebesgue mea- sure equal to zero, also known as a null set. Specifically, here the derivative defined for the entire real line except for the point f0g. In most SIPML applications, the derivatives are defined except for a finite set of points. Unfortunately, even for a convex function the direction negative to that of an arbitrary subgradient is not always a direction of descent [6, p. 4]. 8 −1; x < 0 < Example. For f(x) = jxj, the subdifferential is @f(x) = [−1; 1];x = 0 : 1; 0 < x: At x = 0 all elements of the subdifferential have negatives that are ascent directions except for 0. © J. Fessler, April 12, 2020, 17:55 (class version) 8.7 Properties of subdifferentials Define. The subdifferential of a convex function f : RN 7! R is this set of subgradient vectors: N N @f(x) , g 2 R : f(z) − f(x) ≥ hg; z − xi; 8z 2 R : Properties [7]. • A convex function f : D 7! R is differentiable at x 2 D iff @f(x) = frf(x)g : So a subgradient is a generalization of a gradient (for convex functions). • For any x 2 D, the subdifferential is a nonempty convex and compact set (closed and bounded) [6, p. 9]. • If convex functions f; g : RN 7! R have subdifferentials @f and @g, respectively, and h(x) = f(x)+g(x), then, for all x 2 RN , HW @h(x) = @(f + g)(x) = @f(x) + @g(x) = fu + v : u 2 @f(x); v 2 @g(x)g ; where the “+” here denotes the Minkowski sum of two sets. The subdifferential of a sum of convex functions is the (set) sum of their subdifferentials [6, p. 13]. N If convex function f : R 7! R has subdifferential @f and h(x) , αf(x) for α 2 R, • then @h(x) = α@f(x) for all x 2 RN . (?) A: True B: False ?? True when α ≥ 0 HW © J. Fessler, April 12, 2020, 17:55 (class version) 8.8 • x is a global minimizer of a convex function f iff 0 2 @f(x) [6, p. 12]. Example. f(x) = max(jxj ; 1) f(x) @f(x) 1 -1 1 x 1 x • One can define convex functions and subdifferentials using the extended reals R [ f1g [7]. • There are also generalization for non-convex functions [6, p. 17] [7]. • A chain rule for affine arguments [8]. If g(x) = f(Ax + b) for convex f : RM 7! R, for x 2 RN and A 2 RM×N , then we saw in earlier HW that g : RN 7! R is convex, and furthermore [7,8]: @g(x) = A0@f(Ax + b): (8.2) Proof that (Read) v 2 @f(Ax + b) =) A0v 2 @g(x): Given that v 2 @f(Ax + b); we know that 8y 2 RM : f(y) − f(Ax + b) ≥ v0 (y − (Ax + b)) : In particular, choosing y = Az + b for any z 2 RN we have f(Az + b) − f(Ax + b) ≥ v0 ((Az + b) − (Ax + b)) = v0A (z − x) = (A0v)0 (z − x) : © J. Fessler, April 12, 2020, 17:55 (class version) 8.9 So by construction, for any z 2 RN we have g(z) − g(x) ≥ (A0v)0 (z − x) implying that A0v 2 @g(x): Thus we have shown A0@f(Ax + b) ⊆ @g(x): 2 Showing that @g(x) ⊆ A0@f(Ax + b); to complete the proof of (8.2), seems to be more complicated. For details: https://maunamn.wordpress.com/9-the-subdifferential-chain-rule Example. Consider compressed sensing with analysis sparsity regularizer and β ≥ 0: 1 Ψ(x) = kAx − yk2 + β kT xk =) A0 (Ax − y) + βT 0 sign :(T x) 2 @ Ψ(x); 2 2 1 because sign(t) is a subgradient of jtj, where we have applied several of the above properties: derivative, sum, scaling, affine. © J. Fessler, April 12, 2020, 17:55 (class version) 8.10 Convergence of the subgradient method For suitable choice of fαkg, and suitable assumptionsp on Ψ such as convexity, the convergence rate of the subgradient method (8.1) is bounded by O(1= k) [4–6]. (See result on subsequent pages.) Diminishing step sizes Convergence theorems for SGM often assume that the step sizes diminish, but not too quickly. Specifically, often we assume they satisfy: X1 αk > 0; αk ! 0; αk = 1: (8.3) k=1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.11 SGM convergence for diminishing step sizes A classic convergence theorem for SGM is the following [6, p. 26]. If Ψ is convex and has a bounded set of minimizers X∗, and fαkg satisfies (8.3), then the sequence fxkg generated by (8.4) for any x0 has the property that either • the sequence fg(xk)g is bounded and the sequence fxkg converges in the sense that fd(xk; X∗)g ! 0 and fΨ(xk)g ! Ψ∗; or • the sequence fg(xk)g is unbounded and there is no convergence.