Subgradient Method
Total Page:16
File Type:pdf, Size:1020Kb
Subgradient Method Ryan Tibshirani Convex Optimization 10-725/36-725 Last last time: gradient descent Consider the problem min f(x) x n for f convex and differentiable, dom(f) = R . Gradient descent: (0) n choose initial x 2 R , repeat (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: Step sizes tk chosen to be fixed and small, or by backtracking line search If rf Lipschitz, gradient descent has convergence rate O(1/) Downsides: • Requires f differentiable this lecture • Can be slow to converge next lecture 2 Subgradient method n Now consider f convex, with dom(f) = R , but not necessarily differentiable Subgradient method: like gradient descent, but replacing gradients with subgradients. I.e., initialize x(0), repeat (k) (k−1) (k−1) x = x − tk · g ; k = 1; 2; 3;::: where g(k−1) 2 @f(x(k−1)), any subgradient of f at x(k−1) Subgradient method is not necessarily a descent method, so we (k) (0) (k) keep track of best iterate xbest among x ; : : : x so far, i.e., f(x(k) ) = min f(x(i)) best i=0;:::k 3 Outline Today: • How to choose step sizes • Convergence analysis • Intersection of sets • Stochastic subgradient method 4 Step size choices • Fixed step sizes: tk = t all k = 1; 2; 3;::: • Diminishing step sizes: choose to meet conditions 1 1 X 2 X tk < 1; tk = 1; k=1 k=1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 5 Convergence analysis n Assume that f convex, dom(f) = R , and also that f is Lipschitz continuous with constant G > 0, i.e., jf(x) − f(y)j ≤ Gkx − yk2 for all x; y Theorem: For a fixed step size t, subgradient method satisfies lim f(x(k) ) ≤ f ? + G2t=2 k!1 best Theorem: For diminishing step sizes, subgradient method sat- isfies lim f(x(k) ) = f ? k!1 best 6 Basic bound, and convergence rate (0) ? Letting R = kx − x k2, after k steps, we have the basic bound R2 + G2 Pk t2 f(x(k) ) − f(x?) ≤ i=1 i best Pk 2 i=1 ti Previous theorems follow from this. With fixed step size t, we have R2 G2t f(x(k) ) − f ? ≤ + best 2kt 2 For this to be ≤ , let's make each term ≤ /2. Therefore choose t = /G2, and k = R2=t · 1/ = R2G2/2 I.e., subgradient method has convergence rate O(1/2) ... compare this to O(1/) rate of gradient descent 7 Example: regularized logistic regression p Given (xi; yi) 2 R × f0; 1g for i = 1; : : : n, consider the logistic regression loss: n X T T f(β) = − yixi β + log(1 + exp(xi β) i=1 This is a smooth and convex, with n X rf(β) = yi − pi(β) xi i=1 T T where pi(β) = exp(xi β)=(1 + exp(xi β)), i = 1; : : : n. We will consider the regularized problem: min f(β) + λ · P (β) β2Rp 2 where P (β) = kβk2 (ridge penalty) or P (β) = kβk1 (lasso penalty) 8 Ridge problem: use gradients; lasso problem: use subgradients. Data example with n = 1000, p = 20: Gradient descent Subgradient method t=0.001 t=0.001 2.00 t=0.001/k 1e−01 0.50 1e−04 0.20 f−fstar f−fstar 1e−07 0.05 1e−10 0.02 1e−13 0 50 100 150 200 0 50 100 150 200 k k Step sizes hand-tuned to be favorable for each method (of course comparison is imperfect, but it reveals the convergence behaviors) 9 Polyak step sizes Polyak step sizes: when the optimal value f ? is known, take f(x(k−1)) − f ? t = ; k = 1; 2; 3;::: k (k−1) 2 kg k2 Can be motivated from first step in subgradient proof: (k) ? 2 (k−1) ? 2 (k−1) ? 2 (k−1) 2 kx −x k2 ≤ kx −x k2−2tk f(x )−f(x ) +tkkg k2 Polyak step size minimizes the right-hand side With Polyak step sizes, can show subgradient method converges to optimal value. Convergence rate is still O(1/2) 10 Example: intersection of sets ? Suppose we want to find x 2 C1 \ ::: \ Cm, i.e., find a point in intersection of closed, convex sets C1;:::Cm First define fi(x) = dist(x; Ci); i = 1; : : : m f(x) = max fi(x) i=1;:::m and now solve min f(x) x ? ? Note that f = 0 ) x 2 C1 \ ::: \ Cm. Check: is this problem convex? 11 Recall the distance function dist(x; C) = miny2C ky − xk2. Last time we computed its gradient x − P (x) rdist(x; C) = C kx − PC (x)k2 where PC (x) is the projection of x onto C Also recall subgradient rule: if f(x) = maxi=1;:::m fi(x), then [ @f(x) = conv @fi(x) i:fi(x)=f(x) So if fi(x) = f(x) and gi 2 @fi(x), then gi 2 @f(x) 12 Put these two facts together for intersection of sets problem, with fi(x) = dist(x; Ci): if Ci is farthest set from x (so fi(x) = f(x)), and x − PCi (x) gi = rfi(x) = kx − PCi (x)k2 then gi 2 @f(x) (k−1) Now apply subgradient method, with Polyak size tk = f(x ). (k−1) At iteration k, with Ci farthest from x , we perform update x(k−1) − P (x(k−1)) x(k) = x(k−1) − f(x(k−1)) Ci (k−1) (k−1) kx − PCi (x )k2 (k−1) = PCi (x ) 13 For two sets, this is the famous alternating projections algorithm, i.e., just keep projecting back and forth (From Boyd's lecture notes) 14 Stochastic subgradient method Consider sum of functions m X min fi(x) x i=1 Pm Pm Recall that @ i=1 fi(x) = i=1 @fi(x), and subgradient method would repeat m (k) (k−1) X (k−1) x = x − tk · gi ; k = 1; 2; 3;::: i=1 (k−1) (k−1) where gi 2 @fi(x ). In comparison, stochastic subgradient method (or incremental subgradient) repeats x(k) = x(k−1) − t · g(k−1); k = 1; 2; 3;::: k ik where ik 2 f1; : : : mg is some chosen index at iteration k 15 Stochastic gradient descent: special case when fi, i = 1; : : : m are (k−1) (k−1) differentiable, so gi = rfi(x ) Two rules for choosing index ik at iteration k: • Cyclic rule: choose ik = 1; 2; : : : m; 1; 2; : : : m; : : : • Randomized rule: choose ik 2 f1; : : : mg uniformly at random Randomized rule is more common in practice What's the difference between stochastic and usual (called batch) methods? Computationally, m stochastic steps ≈ one batch step. But what about progress? 16 Consider smooth cyclic rule, for simplicity: here we cycle through a descent step on each fi individually. After m steps (i.e., one cycle), assuming constant step sizes αk+1 = : : : αk+m, the update is m (k+m) (k) X (k+i−1) x = x − t rfi x i=1 One batch step is instead m (k+1) (k) X (k) x = x − t rfi x i=1 Pm (k+i−1) (k) so difference in direction is i=1[rfi(x ) − rfi(x )] We can believe that the stochastic method still converges if rf(x) doesn't vary wildly with x 17 Convergence of stochastic methods Assume each fi, i = 1; : : : m is convex and Lipschitz with constant G > 0 1 For fixed step sizes tk = t, k = 1; 2; 3;:::, cyclic and randomized stochastic subgradient methods both satisfy lim f(x(k) ) ≤ f ? + 5m2G2t=2 k!1 best Note: mG can be viewed as Lipschitz constant for whole function Pm i=1 fi, so this is comparable to batch bound For diminishing step sizes, cyclic and randomized methods satisfy lim f(x(k) ) = f ? k!1 best 1For randomized rule, results hold with probability 1 18 How about convergence rates? This is where things get interesting 2 2 Recall that the batch subgradient method rate was O(Gbatch/ ), where Lipschitz constant Gbatch is for whole function • Cyclic rule: iteration complexity is O(m3G2/2). Therefore number of cycles needed is O(m2G2/2), comparable to batch • Randomized rule2: iteration complexity is O(m2G2/2). Thus number of random cycles needed is O(mG2/2), reduced by a factor of m! This is a convincing reason to use randomized stochastic methods, for problems where m is big 2For randomized rule, result holds in expectation, i.e., bound is on expected number of iterations 19 Example: stochastic logistic regression Back to the logistic regression problem: n X T T min f(β) = − yixi β + log(1 + exp(xi β) β2Rp i=1 | {z } fi(β) Pn The gradient computation rf(β) = i=1 yi − pi(β) xi is doable when n is moderate, but not when n ≈ 500 million. Recall: • One batch update costs O(np) • One stochastic update costs O(p) So clearly, e.g., 10K stochastic steps are much more affordable Also, we often take fixed step size for stochastic updates to be ≈ n what we use for batch updates. (Why?) 20 The \classic picture": ● Batch 20 ● Random Blue: batch steps, O(np) Red: stochastic steps, O(p) 10 Rule of thumb for stochastic ●● ●● ●●● ● 0 ●● ●●●● methods: ●● ● ● ●●*● ● ● ●● ● ● ● ● ● ● ● ● • generally thrive far ●● ● ● ●● ● ● ● ●● ● from optimum −10 ● ●● ● ●● ● ●● ●● ●● ● ● • ● ●●● ● generally struggle close ● ●● ● ●● ●● ●● to optimum −20 −20 −10 0 10 20 (More on stochastic methods later in the course ...) 21 Can we do better? Upside of the subgradient method: broad applicability.