Subgradient Method

Subgradient Method

Subgradient Method Ryan Tibshirani Convex Optimization 10-725/36-725 Last last time: gradient descent Consider the problem min f(x) x n for f convex and differentiable, dom(f) = R . Gradient descent: (0) n choose initial x 2 R , repeat (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: Step sizes tk chosen to be fixed and small, or by backtracking line search If rf Lipschitz, gradient descent has convergence rate O(1/) Downsides: • Requires f differentiable this lecture • Can be slow to converge next lecture 2 Subgradient method n Now consider f convex, with dom(f) = R , but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients. I.e., initialize x(0), repeat (k) (k−1) (k−1) x = x − tk · g ; k = 1; 2; 3;::: where g(k−1) 2 @f(x(k−1)), any subgradient of f at x(k−1) Subgradient method is not necessarily a descent method, so we (k) (0) (k) keep track of best iterate xbest among x ; : : : x so far, i.e., f(x(k) ) = min f(x(i)) best i=0;:::k 3 Outline Today: • How to choose step sizes • Convergence analysis • Intersection of sets • Stochastic subgradient method 4 Step size choices • Fixed step sizes: tk = t all k = 1; 2; 3;::: • Diminishing step sizes: choose to meet conditions 1 1 X 2 X tk < 1; tk = 1; k=1 k=1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 5 Convergence analysis n Assume that f convex, dom(f) = R , and also that f is Lipschitz continuous with constant G > 0, i.e., jf(x) − f(y)j ≤ Gkx − yk2 for all x; y Theorem: For a fixed step size t, subgradient method satisfies lim f(x(k) ) ≤ f ? + G2t=2 k!1 best Theorem: For diminishing step sizes, subgradient method sat- isfies lim f(x(k) ) = f ? k!1 best 6 Basic bound, and convergence rate (0) ? Letting R = kx − x k2, after k steps, we have the basic bound R2 + G2 Pk t2 f(x(k) ) − f(x?) ≤ i=1 i best Pk 2 i=1 ti Previous theorems follow from this. With fixed step size t, we have R2 G2t f(x(k) ) − f ? ≤ + best 2kt 2 For this to be ≤ , let's make each term ≤ /2. Therefore choose t = /G2, and k = R2=t · 1/ = R2G2/2 I.e., subgradient method has convergence rate O(1/2) ... compare this to O(1/) rate of gradient descent 7 Example: regularized logistic regression p Given (xi; yi) 2 R × f0; 1g for i = 1; : : : n, consider the logistic regression loss: n X T T f(β) = − yixi β + log(1 + exp(xi β) i=1 This is a smooth and convex, with n X rf(β) = yi − pi(β) xi i=1 T T where pi(β) = exp(xi β)=(1 + exp(xi β)), i = 1; : : : n. We will consider the regularized problem: min f(β) + λ · P (β) β2Rp 2 where P (β) = kβk2 (ridge penalty) or P (β) = kβk1 (lasso penalty) 8 Ridge problem: use gradients; lasso problem: use subgradients. Data example with n = 1000, p = 20: Gradient descent Subgradient method t=0.001 t=0.001 2.00 t=0.001/k 1e−01 0.50 1e−04 0.20 f−fstar f−fstar 1e−07 0.05 1e−10 0.02 1e−13 0 50 100 150 200 0 50 100 150 200 k k Step sizes hand-tuned to be favorable for each method (of course comparison is imperfect, but it reveals the convergence behaviors) 9 Polyak step sizes Polyak step sizes: when the optimal value f ? is known, take f(x(k−1)) − f ? t = ; k = 1; 2; 3;::: k (k−1) 2 kg k2 Can be motivated from first step in subgradient proof: (k) ? 2 (k−1) ? 2 (k−1) ? 2 (k−1) 2 kx −x k2 ≤ kx −x k2−2tk f(x )−f(x ) +tkkg k2 Polyak step size minimizes the right-hand side With Polyak step sizes, can show subgradient method converges to optimal value. Convergence rate is still O(1/2) 10 Example: intersection of sets ? Suppose we want to find x 2 C1 \ ::: \ Cm, i.e., find a point in intersection of closed, convex sets C1;:::Cm First define fi(x) = dist(x; Ci); i = 1; : : : m f(x) = max fi(x) i=1;:::m and now solve min f(x) x ? ? Note that f = 0 ) x 2 C1 \ ::: \ Cm. Check: is this problem convex? 11 Recall the distance function dist(x; C) = miny2C ky − xk2. Last time we computed its gradient x − P (x) rdist(x; C) = C kx − PC (x)k2 where PC (x) is the projection of x onto C Also recall subgradient rule: if f(x) = maxi=1;:::m fi(x), then [ @f(x) = conv @fi(x) i:fi(x)=f(x) So if fi(x) = f(x) and gi 2 @fi(x), then gi 2 @f(x) 12 Put these two facts together for intersection of sets problem, with fi(x) = dist(x; Ci): if Ci is farthest set from x (so fi(x) = f(x)), and x − PCi (x) gi = rfi(x) = kx − PCi (x)k2 then gi 2 @f(x) (k−1) Now apply subgradient method, with Polyak size tk = f(x ). (k−1) At iteration k, with Ci farthest from x , we perform update x(k−1) − P (x(k−1)) x(k) = x(k−1) − f(x(k−1)) Ci (k−1) (k−1) kx − PCi (x )k2 (k−1) = PCi (x ) 13 For two sets, this is the famous alternating projections algorithm, i.e., just keep projecting back and forth (From Boyd's lecture notes) 14 Stochastic subgradient method Consider sum of functions m X min fi(x) x i=1 Pm Pm Recall that @ i=1 fi(x) = i=1 @fi(x), and subgradient method would repeat m (k) (k−1) X (k−1) x = x − tk · gi ; k = 1; 2; 3;::: i=1 (k−1) (k−1) where gi 2 @fi(x ). In comparison, stochastic subgradient method (or incremental subgradient) repeats x(k) = x(k−1) − t · g(k−1); k = 1; 2; 3;::: k ik where ik 2 f1; : : : mg is some chosen index at iteration k 15 Stochastic gradient descent: special case when fi, i = 1; : : : m are (k−1) (k−1) differentiable, so gi = rfi(x ) Two rules for choosing index ik at iteration k: • Cyclic rule: choose ik = 1; 2; : : : m; 1; 2; : : : m; : : : • Randomized rule: choose ik 2 f1; : : : mg uniformly at random Randomized rule is more common in practice What's the difference between stochastic and usual (called batch) methods? Computationally, m stochastic steps ≈ one batch step. But what about progress? 16 Consider smooth cyclic rule, for simplicity: here we cycle through a descent step on each fi individually. After m steps (i.e., one cycle), assuming constant step sizes αk+1 = : : : αk+m, the update is m (k+m) (k) X (k+i−1) x = x − t rfi x i=1 One batch step is instead m (k+1) (k) X (k) x = x − t rfi x i=1 Pm (k+i−1) (k) so difference in direction is i=1[rfi(x ) − rfi(x )] We can believe that the stochastic method still converges if rf(x) doesn't vary wildly with x 17 Convergence of stochastic methods Assume each fi, i = 1; : : : m is convex and Lipschitz with constant G > 0 1 For fixed step sizes tk = t, k = 1; 2; 3;:::, cyclic and randomized stochastic subgradient methods both satisfy lim f(x(k) ) ≤ f ? + 5m2G2t=2 k!1 best Note: mG can be viewed as Lipschitz constant for whole function Pm i=1 fi, so this is comparable to batch bound For diminishing step sizes, cyclic and randomized methods satisfy lim f(x(k) ) = f ? k!1 best 1For randomized rule, results hold with probability 1 18 How about convergence rates? This is where things get interesting 2 2 Recall that the batch subgradient method rate was O(Gbatch/ ), where Lipschitz constant Gbatch is for whole function • Cyclic rule: iteration complexity is O(m3G2/2). Therefore number of cycles needed is O(m2G2/2), comparable to batch • Randomized rule2: iteration complexity is O(m2G2/2). Thus number of random cycles needed is O(mG2/2), reduced by a factor of m! This is a convincing reason to use randomized stochastic methods, for problems where m is big 2For randomized rule, result holds in expectation, i.e., bound is on expected number of iterations 19 Example: stochastic logistic regression Back to the logistic regression problem: n X T T min f(β) = − yixi β + log(1 + exp(xi β) β2Rp i=1 | {z } fi(β) Pn The gradient computation rf(β) = i=1 yi − pi(β) xi is doable when n is moderate, but not when n ≈ 500 million. Recall: • One batch update costs O(np) • One stochastic update costs O(p) So clearly, e.g., 10K stochastic steps are much more affordable Also, we often take fixed step size for stochastic updates to be ≈ n what we use for batch updates. (Why?) 20 The \classic picture": ● Batch 20 ● Random Blue: batch steps, O(np) Red: stochastic steps, O(p) 10 Rule of thumb for stochastic ●● ●● ●●● ● 0 ●● ●●●● methods: ●● ● ● ●●*● ● ● ●● ● ● ● ● ● ● ● ● • generally thrive far ●● ● ● ●● ● ● ● ●● ● from optimum −10 ● ●● ● ●● ● ●● ●● ●● ● ● • ● ●●● ● generally struggle close ● ●● ● ●● ●● ●● to optimum −20 −20 −10 0 10 20 (More on stochastic methods later in the course ...) 21 Can we do better? Upside of the subgradient method: broad applicability.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    24 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us