Chapter 8 Stochastic Gradient / Subgradient Methods

Chapter 8 Stochastic Gradient / Subgradient Methods

Chapter 8 Stochastic gradient / subgradient methods Contents (class version) 8.0 Introduction........................................ 8.2 8.1 The subgradient method................................. 8.5 Subgradients and subdifferentials................................. 8.5 Properties of subdifferentials.................................... 8.7 Convergence of the subgradient method.............................. 8.10 8.2 Example: Hinge loss with 1-norm regularizer for binary classifier design...... 8.17 8.3 Incremental (sub)gradient method............................ 8.19 Incremental (sub)gradient method................................. 8.21 8.4 Stochastic gradient (SG) method............................. 8.23 SG update.............................................. 8.23 Stochastic gradient algorithm: convergence analysis....................... 8.26 Variance reduction: overview................................... 8.33 Momentum............................................. 8.35 Adaptive step-sizes......................................... 8.37 8.5 Example: X-ray CT reconstruction........................... 8.41 8.1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.2 8.6 Summary.......................................... 8.50 8.0 Introduction This chapter describes two families of algorithms: • subgradient methods • stochastic gradient methods aka stochastic gradient descent methods Often we turn to these methods as a “last resort,” for applications where none of the methods discussed previously are suitable. Many machine learning applications, such as training artificial neural networks, use such methods. As stated in [1] “large-scale machine learning represents a distinctive setting in which traditional nonlinear optimization techniques typically falter.” For recent surveys, especially about stochastic gradient methods, see [1,2]. Acknowledgment This chapter was based in part on slides made by Naveen Murthy in April 2019. © J. Fessler, April 12, 2020, 17:55 (class version) 8.3 Rate of convergence review Suppose the sequence fxkg converges to x^. Consider the limiting ratio kx − x^k µ lim k+1 2 : , k!1 kxk − x^k2 We define the rate of convergence of the sequence fxkg as follows: • Converges linearly with rate µ if µ 2 (0; 1) • Converges sublinearly if µ = 1 • Converges super-linearly if µ = 0 k+1 k jxk+1 − 0j jρj Example: For xk = ρ , with jρj < 1, = k = jρj = µ, so the sequence converges linearly. jxk − 0j jρj c c jxk+1 − 0j 1=(k + 1) Example: For xk = 1=k , with c > 0, = c ! 0 = µ, so the convergence is sublinear. jxk − 0j 1=k 1 5 For the sequence x = 1 − − ; for k = 1; 2;:::, the value of µ is k 3k k2 A: 0 B: 1=5 C: 1=3 D: 1=2 E: 1 ?? © J. Fessler, April 12, 2020, 17:55 (class version) 8.4 Gradient descent (GD) review • GD update: xk+1 = xk − αrf(xk): • Converges to a minimizer if f is convex and differentiable, and rf is L-Lipschitz continuous, and 0 < α < 2=L; simple to analyze • Worst-case sublinear rate of convergence of O(1=k) if α ≤ 1=L • Can be improved to a linear rate O(ρk), where ρ < 1, under strong convexity assumptions on f PM • In the usual case where f(x) = m=1 fm(x); gradient computation is linear in M, i.e., takes O(M) time. =) Doubling the number of examples in the training set doubles the gradient computation cost. GD is a “batch” method: rf uses all available data at once The methods in this chapter relax the differentiability requirement, and scale better for large-scale problems. Example. ImageNet [3] contains ∼14 million images with more than 20,000 categories. © J. Fessler, April 12, 2020, 17:55 (class version) 8.5 8.1 The subgradient method The O(1=k) convergence rate of ISTA in Ch.4 (aka PGM) may seem quite slow, and it is by modern stan- dards, but convergence rates can be even worse! A classic way to seek a minimizer of a non-differentiable cost function Ψ is the subgradient method, defined as [4,5]: xk+1 = xk − αk gk; gk 2 @ Ψ(xk); (8.1) where gk 2 @ Ψ(xk) denotes a subgradient of the (nondifferentiable) function Ψ at the current iterate xk. This method was published (in Russian) by Naum Shor in 1962 for solving transportation problems [6, p. 4]. Subgradients and subdifferentials Define. If f : D 7! R is a real-valued convex function defined on a convex open set D ⊂ RN , N a vector g 2 R is called a subgradient at a point x0 2 D iff for all z 2 D we have f(z) − f(x0) ≥ hg; z − x0i : Define. The set of all subgradients at x0 is called the subdifferential at x0 and is denoted @f(x0) [6, p. 8]. © J. Fessler, April 12, 2020, 17:55 (class version) 8.6 Example.A rectified linear unit (ReLU) in an ANN uses the rectifier function that has the following definition and subdifferential: (Draw tangent lines on ReLU.) ReLU(x) @ ReLU(x) 8 0; x < 0 < r(x) = max(x; 0); @r(x) = [0; 1];x = 0 1 : 1; 0 < x: 0 x 0 x For this example, the derivative is defined almost everywhere, i.e., everywhere but a set of Lebesgue mea- sure equal to zero, also known as a null set. Specifically, here the derivative defined for the entire real line except for the point f0g. In most SIPML applications, the derivatives are defined except for a finite set of points. Unfortunately, even for a convex function the direction negative to that of an arbitrary subgradient is not always a direction of descent [6, p. 4]. 8 −1; x < 0 < Example. For f(x) = jxj, the subdifferential is @f(x) = [−1; 1];x = 0 : 1; 0 < x: At x = 0 all elements of the subdifferential have negatives that are ascent directions except for 0. © J. Fessler, April 12, 2020, 17:55 (class version) 8.7 Properties of subdifferentials Define. The subdifferential of a convex function f : RN 7! R is this set of subgradient vectors: N N @f(x) , g 2 R : f(z) − f(x) ≥ hg; z − xi; 8z 2 R : Properties [7]. • A convex function f : D 7! R is differentiable at x 2 D iff @f(x) = frf(x)g : So a subgradient is a generalization of a gradient (for convex functions). • For any x 2 D, the subdifferential is a nonempty convex and compact set (closed and bounded) [6, p. 9]. • If convex functions f; g : RN 7! R have subdifferentials @f and @g, respectively, and h(x) = f(x)+g(x), then, for all x 2 RN , HW @h(x) = @(f + g)(x) = @f(x) + @g(x) = fu + v : u 2 @f(x); v 2 @g(x)g ; where the “+” here denotes the Minkowski sum of two sets. The subdifferential of a sum of convex functions is the (set) sum of their subdifferentials [6, p. 13]. N If convex function f : R 7! R has subdifferential @f and h(x) , αf(x) for α 2 R, • then @h(x) = α@f(x) for all x 2 RN . (?) A: True B: False ?? True when α ≥ 0 HW © J. Fessler, April 12, 2020, 17:55 (class version) 8.8 • x is a global minimizer of a convex function f iff 0 2 @f(x) [6, p. 12]. Example. f(x) = max(jxj ; 1) f(x) @f(x) 1 -1 1 x 1 x • One can define convex functions and subdifferentials using the extended reals R [ f1g [7]. • There are also generalization for non-convex functions [6, p. 17] [7]. • A chain rule for affine arguments [8]. If g(x) = f(Ax + b) for convex f : RM 7! R, for x 2 RN and A 2 RM×N , then we saw in earlier HW that g : RN 7! R is convex, and furthermore [7,8]: @g(x) = A0@f(Ax + b): (8.2) Proof that (Read) v 2 @f(Ax + b) =) A0v 2 @g(x): Given that v 2 @f(Ax + b); we know that 8y 2 RM : f(y) − f(Ax + b) ≥ v0 (y − (Ax + b)) : In particular, choosing y = Az + b for any z 2 RN we have f(Az + b) − f(Ax + b) ≥ v0 ((Az + b) − (Ax + b)) = v0A (z − x) = (A0v)0 (z − x) : © J. Fessler, April 12, 2020, 17:55 (class version) 8.9 So by construction, for any z 2 RN we have g(z) − g(x) ≥ (A0v)0 (z − x) implying that A0v 2 @g(x): Thus we have shown A0@f(Ax + b) ⊆ @g(x): 2 Showing that @g(x) ⊆ A0@f(Ax + b); to complete the proof of (8.2), seems to be more complicated. For details: https://maunamn.wordpress.com/9-the-subdifferential-chain-rule Example. Consider compressed sensing with analysis sparsity regularizer and β ≥ 0: 1 Ψ(x) = kAx − yk2 + β kT xk =) A0 (Ax − y) + βT 0 sign :(T x) 2 @ Ψ(x); 2 2 1 because sign(t) is a subgradient of jtj, where we have applied several of the above properties: derivative, sum, scaling, affine. © J. Fessler, April 12, 2020, 17:55 (class version) 8.10 Convergence of the subgradient method For suitable choice of fαkg, and suitable assumptionsp on Ψ such as convexity, the convergence rate of the subgradient method (8.1) is bounded by O(1= k) [4–6]. (See result on subsequent pages.) Diminishing step sizes Convergence theorems for SGM often assume that the step sizes diminish, but not too quickly. Specifically, often we assume they satisfy: X1 αk > 0; αk ! 0; αk = 1: (8.3) k=1 © J. Fessler, April 12, 2020, 17:55 (class version) 8.11 SGM convergence for diminishing step sizes A classic convergence theorem for SGM is the following [6, p. 26]. If Ψ is convex and has a bounded set of minimizers X∗, and fαkg satisfies (8.3), then the sequence fxkg generated by (8.4) for any x0 has the property that either • the sequence fg(xk)g is bounded and the sequence fxkg converges in the sense that fd(xk; X∗)g ! 0 and fΨ(xk)g ! Ψ∗; or • the sequence fg(xk)g is unbounded and there is no convergence.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    54 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us