Gradient Descent Revisited
Total Page:16
File Type:pdf, Size:1020Kb
Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Gradient descent n Recall that we have f : R ! R, convex and differentiable, want to solve min f(x); x2Rn i.e., find x? such that f(x?) = min f(x) (0) n Gradient descent: choose initial x 2 R , repeat: (k) (k−1) (k−1) x = x − tk · rf(x ); k = 1; 2; 3;::: Stop at some point 2 ● ● ● ● ● 3 Interpretation At each iteration, consider the expansion 1 f(y) ≈ f(x) + rf(x)T (y − x) + ky − xk2 2t 2 1 Quadratic approximation, replacing usual r f(x) by t I f(x) + rf(x)T (y − x) linear approximation to f 1 2 2t ky − xk proximity term to x, with weight 1=(2t) Choose next point y = x+ to minimize quadratic approximation x+ = x − trf(x) 4 ● ● Blue point is x, red point is x+ 5 Outline Today: • How to choose step size tk • Convergence under Lipschitz gradient • Convergence under strong convexity • Forward stagewise regression, boosting 6 Fixed step size Simply take tk = t for all k = 1; 2; 3;:::, can diverge if t is too big. 2 2 Consider f(x) = (10x1 + x2)=2, gradient descent after 8 steps: ● 20 ● 10 ● 0 * −10 −20 −20 −10 0 10 20 7 Can be slow if t is too small. Same example, gradient descent after 100 steps: ●●● 20 ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● 0 * −10 −20 −20 −10 0 10 20 8 Same example, gradient descent after 40 appropriately sized steps: ● 20 ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● 0 * −10 −20 −20 −10 0 10 20 This porridge is too hot! { too cold! { juuussst right. Convergence analysis later will give us a better idea 9 Backtracking line search A way to adaptively choose the step size • First fix a parameter 0 < β < 1 • Then at each iteration, start with t = 1, and while t f(x − trf(x)) > f(x) − krf(x)k2; 2 update t = βt Simple and tends to work pretty well in practice 10 Interpretation (From B & V page 465) For us ∆x = −∇f(x), α = 1=2 11 Backtracking picks up roughly the right step size (13 steps): ● 20 ● ● ● 10 ● ● ● ● ●● ●● 0 * −10 −20 −20 −10 0 10 20 Here β = 0:8 (B & V recommend β 2 (0:1; 0:8)) 12 Exact line search At each iteration, do the best we can along the direction of the gradient, t = argmin f(x − srf(x)) s≥0 Usually not possible to do this minimization exactly Approximations to exact line search are often not much more efficient than backtracking, and it's not worth it 13 Convergence analysis n Assume that f : R ! R is convex and differentiable, and additionally krf(x) − rf(y)k ≤ Lkx − yk for any x; y I.e., rf is Lipschitz continuous with constant L > 0 Theorem: Gradient descent with fixed step size t ≤ 1=L satisfies kx(0) − x?k2 f(x(k)) − f(x?) ≤ 2tk I.e., gradient descent has convergence rate O(1=k) I.e., to get f(x(k)) − f(x?) ≤ , need O(1/) iterations 14 Proof Key steps: • rf Lipschitz with constant L ) L f(y) ≤ f(x) + rf(x)T (y − x) + ky − xk2 all x; y 2 • Plugging in y = x − trf(x), Lt f(y) ≤ f(x) − (1 − )tkrf(x)k2 2 • Letting x+ = x − trf(x) and taking 0 < t ≤ 1=L, t f(x+) ≤ f(x?) + rf(x)T (x − x?) − krf(x)k2 2 1 = f(x?) + kx − x?k2 − kx+ − x?k2 2t 15 • Summing over iterations: k X 1 (f(x(i)) − f(x?)) ≤ kx(0) − x?k2 − kx(k) − x?k2 2t i=1 1 ≤ kx(0) − x?k2 2t • Since f(x(k)) is nonincreasing, k 1 X kx(0) − x?k2 f(x(k)) − f(x?) ≤ (f(x(i)) − f(x?)) ≤ k 2tk i=1 16 Convergence analysis for backtracking n Same assumptions, f : R ! R is convex and differentiable, and rf is Lipschitz continuous with constant L > 0 Same rate for a step size chosen by backtracking search Theorem: Gradient descent with backtracking line search satis- fies kx(0) − x?k2 f(x(k)) − f(x?) ≤ 2tmink where tmin = minf1; β=Lg If β is not too small, then we don't lose much compared to fixed step size (β=L vs 1=L) 17 Strong convexity Strong convexity of f means for some d > 0, r2f(x) dI for any x Better lower bound than that from usual convexity: d f(y) ≥ f(x) + rf(x)T (y − x) + ky − xk2 all x; y 2 Under Lipschitz assumption as before, and also strong convexity: Theorem: Gradient descent with fixed step size t ≤ 2=(d + L) or with backtracking line search search satisfies L f(x(k)) − f(x?) ≤ ck kx(0) − x?k2 2 where 0 < c < 1 18 I.e., rate with strong convexity is O(ck), exponentially fast! I.e., to get f(x(k)) − f(x?) ≤ , need O(log(1/)) iterations Called linear convergence, because looks linear on a semi-log plot: (From B & V page 487) Constant c depends adversely on condition number L=d (higher condition number ) slower rate) 19 How realistic are these conditions? How realistic is Lipschitz continuity of rf? • This means r2f(x) LI 1 2 • E.g., consider f(x) = 2 ky − Axk (linear regression). Here 2 T 2 2 r f(x) = A A, so rf Lipschitz with L = σmax(A) = kAk How realistic is strong convexity of f? • Recall this is r2f(x) dI 1 2 2 T • E.g., again consider f(x) = 2 ky − Axk , so r f(x) = A A, 2 and we need d = σmin(A) • If A is wide, then σmin(A) = 0, and f can't be strongly convex (E.g., A = ) • Even if σmin(A) > 0, can have a very large condition number L=d = σmax(A)/σmin(A) 20 Practicalities Stopping rule: stop when krf(x)k is small • Recall rf(x?) = 0 • If f is strongly convex with parameter d, then p krf(x)k ≤ 2d ) f(x) − f(x?) ≤ Pros and cons: • Pro: simple idea, and each iteration is cheap • Pro: Very fast for well-conditioned, strongly convex problems • Con: Often slow, because interesting problems aren't strongly convex or well-conditioned • Con: can't handle nondifferentiable functions 21 Forward stagewise regression 1 2 Let's stick with f(x) = 2 ky − Axk , linear regression A is n × p, its columns A1;:::Ap are predictor variables Forward stagewise regression: start with x(0) = 0, repeat: T (k−1) • Find variable i such that jAi rj is largest, for r = y − Ax (largest absolute correlation with residual) (k) (k−1) T • Update xi = xi + γ · sign(Ai r) Here γ > 0 is small and fixed, called learning rate This looks kind of like gradient descent 22 Steepest descent Close cousin to gradient descent, just change the choice of norm. Let q; r be complementary (dual): 1=q + 1=r = 1 Updates are x+ = x + t · ∆x, where ∆x = krf(x)kr · u u = argmin rf(x)T v kvkq≤1 • If q = 2, then ∆x = −∇f(x), gradient descent • If q = 1, then ∆x = −@f(x)=@xi · ei, where @f @f (x) = max (x) = krf(x)k1 @xi j=1;:::n @xj Normalized steepest descent just takes ∆x = u (unit q-norm) 23 Equivalence Normalized steepest descent with 1-norm: updates are + n @f o xi = xi − t · sign (x) @xi where i is the largest component of rf(x) in absolute value Compare forward stagewise: updates are + T xi = xi + γ · sign(Ai r); r = y − Ax 1 2 T Recall here f(x) = 2 ky − Axk , so rf(x) = −A (y − Ax) and T @f(x)=@xi = −Ai (y − Ax) Forward stagewise regression is exactly normalized steepest descent under 1-norm 24 Early stopping and regularization Forward stagewise is like a slower version of forward stepwise If we stop early, i.e., don't continue all the way to the least squares solution, then we get a sparse approximation ... can this be used as a form of regularization? Recall lasso problem: 1 2 min ky − Axk subject to kxk1 ≤ t x2Rp 2 Solution x?(s), as function of s, also exhibits varying amounts of regularization How do they compare? 25 (From ESL page 609) For some problems (some y; A), with a small enough step size, forward stagewise iterates trace out lasso solution path! 26 Gradient boosting n Given observations y = (y1; : : : yn) 2 R , associated predictor p measurements ai 2 R , i = 1; : : : n Want to construct a flexible (nonlinear) model for y based on predictors. Weighted sum of trees: m X yi ≈ y^i = γj · Tj(ai); i = 1; : : : n j=1 Each tree Tj inputs predictor measurements ai, outputs prediction. Trees are typically very short ... 27 Pick a loss function L that reflects task; e.g., for continuous y, 2 could take L(yi; y^i) = (yi − y^i) Want to solve n M X X min L yi; γj · Tj(ai) γ2 M R i=1 j=1 Indexes all trees of a fixed size (e.g., depth = 5), so M is huge Space is simply too big to optimize Gradient boosting: combines gradient descent idea with forward model building First think of minimization as min f(^y), function of predictions y^ (subject to y^ coming from trees) 28 (0) Start with initial model, i.e., fit a single tree y^ = T0. Repeat: • Evaluate gradient g at latest prediction y^(k−1), @L(yi; y^i) gi = ; i = 1; : : : n @y^i (k−1) y^i=^yi • Find a tree Tk that is close to −g, i.e., Tk solves n X 2 min (−gi − T (ai)) T i=1 Not hard to (approximately) solve for a single tree • Update our prediction: (k) (k−1) y^ =y ^ + γk · Tk Note: predictions are weighted sums of trees, as desired! 29 Lower bound Remember O(1=k) rate for gradient descent over problem class: convex, differentiable functions with Lipschitz continuous gradients First-order method: iterative method, updates x(k) in x(0) + spanfrf(x(0)); rf(x(1));::: rf(x(k−1))g Theorem (Nesterov): For any k ≤ (n − 1)=2 and any starting point x(0), there is a function f in the problem class such that any first-order method satisfies 3Lkx(0) − x?k2 f(x(k)) − f(x?) ≥ 32(k + 1)2 Can we achieve a rate O(1=k2)? Answer: yes, and more! 30 References • S.