Beyond Gradient Descent

Beyond Gradient Descent

Illustration Examples Beyond gradient descent Mathieu Blondel December 3, 2020 Outline 1 Convergence rates 2 Coordinate descent 3 Newton’s method 4 Frank-Wolfe 5 Mirror-Descent 6 Conclusion Mathieu Blondel Beyond gradient descent 1 / 47 Measuring progress How to measure the progress made by an iterative algorithm for solving an optimization problem? x? = argmin f (x) x2X Non-negative error measure ? ? Et = jjxt − x jj or Et = f (xt ) − f (x ) Progress ratio Et ρt = Et−1 An algorithm makes strict progress on iteration t if ρt 2 [0; 1). Mathieu Blondel Beyond gradient descent 2 / 47 Types of convergence rates Asymptotic convergence rate lim ρt = ρ t!1 i.e., the sequence ρ1; ρ2; ρ3;::: converges to ρ Sublinear rate: ρ = 1. The longer the algorithm runs, the slower it makes progress! (the algorithm decelerates over time) Linear rate: ρ 2 (0; 1). The algorithm eventually reaches a state of constant progress. Superlinear rate: ρ = 0. The longer the algorithm runs, the faster it makes progress! (the algorithm accelerates over time) Mathieu Blondel Beyond gradient descent 3 / 47 10 2 ) 10 5 e l a Et = 1/ t (sublinear) c s 10 8 Et = 1/t (sublinear) g o 2 l Et = 1/t (sublinear) ( 10 11 t t E E = e (linear) t r 2 o 14 t r 10 Et = e (superlinear) r E 10 17 10 20 0 20 40 60 80 100 Iteration t Mathieu Blondel Beyond gradient descent 4 / 47 1.0 Et = 1/ t (sublinear) 0.8 Et = 1/t (sublinear) 2 Et = 1/t (sublinear) 1 t E t t E 0.6 Et = e (linear) = t2 t Et = e (superlinear) o i 0.4 t a R 0.2 0.0 0 20 40 60 80 100 Iteration t Mathieu Blondel Beyond gradient descent 5 / 47 Sublinear rates Error at iteration Et , number of iterations to reach "-error T" 1 1 Et = O , T" = O b > 0 tb "1=b b = 1=2 1 1 Et = O p , T" = O t "2 b = 1 1 1 E = O , T" = O t t " ex: gradient descent for smooth but not strongly-convex functions b = 2 1 1 E = O , T" = O p t t2 " ex: Nesterov’s method for smooth but not strongly-convex functions Mathieu Blondel Beyond gradient descent 6 / 47 Linear rates The iteration number t now appears in the exponent t Et = O(ρ ) ρ 2 (0; 1) Example: −t 1 E = O e , T" = O log t " “Linear rate” is kind of a misnomer: Et is decreasing exponentially fast! On the other hand, log Et is decreasing linearly. Ex: gradient descent on smooth and strongly convex functions Mathieu Blondel Beyond gradient descent 7 / 47 Superlinear rates We can further classify the order q of convergence rates Et lim q = M t!1 (Et−1) Superlinear (q = 1, M = 0) 1=k −tk 1 E = O e , T" = O log t " Quadratic (q = 2) −2t 1 E = O e , T" = O log log t " Mathieu Blondel Beyond gradient descent 8 / 47 Outline 1 Convergence rates 2 Coordinate descent 3 Newton’s method 4 Frank-Wolfe 5 Mirror-Descent 6 Conclusion Mathieu Blondel Beyond gradient descent 9 / 47 Coordinate descent Minimize only one variable per iteration, keeping all others fixed Well-suited if the one-variable sub-problem is easy to solve Cheap iteration cost Easy to implement No need for step size tuning State-of-the-art on the lasso, SVM dual, (non-negative) matrix factorization Block coordinate descent: minimize only a block of variables Mathieu Blondel Beyond gradient descent 10 / 47 Coordinate-wise minimizer A point is called coordinate-wise minimizer of f if f is minimized along all coordinates separately f (x + δej ) ≥ f (x) 8δ 2 R; j 2 f1;:::; dg Does the coordinate-wise minimizer coincide with the global minimizer? f convex and differentiable: yes rf (x) = 0 , rj f (x) = 0 j 2 f1;:::; dg f convex but non-differentiable: not always. Coordinate descent can get stuck Pd f (x) = g(x) + j=1 hj (xj ) where g is differentiable but hj is not: yes 0 2 @f (x) , −∇j g(x) 2 @hj (xj ) j 2 f1;:::; dg Mathieu Blondel Beyond gradient descent 11 / 47 Coordinate descent On each iteration t, pick a coordinate jt 2 f1;:::; dg and minimize (approximately) this coordinate while keeping others fixed t t t t min f (x ; x ;:::; xj ;:::; x ; x ) x 1 2 t d−1 d jt Coordinate selection strategies: random, cyclic, shuffled cyclic. Coordinate descent with exact updates: requires an “oracle” to solve the sub-problem Coordinate gradient descent: only requires first-order information (and sometimes a prox operator) Mathieu Blondel Beyond gradient descent 12 / 47 Coordinate descent with exact updates Suppose f is a quadratic function. Then δ2 f (x + δe ) = f (x) + r f (x)δ + r2f (x) j j 2 jj Minimizing w.r.t. δ, we get r f (x) r f (x) ? = − j , − j δ 2 xj xj 2 rjj f (x) rjj f (x) 1 2 λ 2 Example: f (x) = 2 kAx − bk2 + 2 kxk2 > > rf (x) = A (Ax − b) + λx ) rj f (x) = A:;j (Ax − b) + λxj 2 > 2 2 r f (x) = A A + λI ) rjj f (x) = kA:;j k2 + λ Mathieu Blondel Beyond gradient descent 13 / 47 Coordinate descent with exact updates Computing rf (x) for gradient descent costs O(nd) time Let us maintain the residual vector r = Ax − b 2 Rn When xj is updated, synchronizing r takes O(n) time When r in synchronized, we can compute rj f (x) in O(n) time 2 The second derivatives rjj f (x) can be pre-computed ahead of time, since it does not depend on x Doing a pass on all d coordinates therefore takes O(nd) time, just like one iteration of gradient descent Mathieu Blondel Beyond gradient descent 14 / 47 Coordinate descent with exact updates Pd If f (x) = g(x) + j=1 hj (xj ) and g is quadratic, then δ2 f (x + δe ) = g(x) + r g(x)δ + r2g(x) + h (x + δ) j j 2 jj j j The closed form solution is ! ? rj g(x) δ = prox λ x − − x hj j 2 j r2f (x) r ( ) jj jj g x where we used the proximity operator 1 prox (u) = argmin (u − v)2 + h (v) τhj τ j v 2 If hj = j · j, then prox is the soft-thresholding operator. State-of-the-art for solving the lasso! Mathieu Blondel Beyond gradient descent 15 / 47 Coordinate gradient descent If f is not quadratic, there typically does not exist a closed form 2 If rj f (x) is Lj -Lipschitz-continuous, recall that rjj f (x) ≤ Lj 2 Key idea: replace rjj f (x) with Lj , i.e., rj f (x) − xj xj 2 rjj f (x) becomes rj f (x) xj xj − Lj Each Lj is coordinate-specific (easier to derive and tighter than a global constant L) Convergence: O(1=") under Lipschitz gradient and O(log(1=")) under strong convexity (random or cyclic selection) Mathieu Blondel Beyond gradient descent 16 / 47 Coordinate gradient descent with prox Similarly, we can replace ! rj g(x) x prox λ x − j hj j 2 r2g(x) r ( ) jj jj g x with rj g(x) x prox λ x − j hj j Lj Lj 2 where rjj g(x) ≤ Lj for all x Can be used for instance for L1-regularized logistic regression If hj (xj ) = ICj (xj ), where Cj is a convex set, then the prox becomes the projection onto Cj . Mathieu Blondel Beyond gradient descent 17 / 47 Implementation techniques Synchronize “statistics” (e.g. residuals) upon each update Column-major format: Fortran-style array or sparse CSC matrix Regularization path and warm-start λ1 > λ2 > ··· > λm Since CD converges faster with big λ, start from λ1, use solution to warm-start (initialize) λ2, etc. Active set, safe screening: use optimality conditions to safely discard coordinates that are guaranteed to be 0 Mathieu Blondel Beyond gradient descent 18 / 47 Outline 1 Convergence rates 2 Coordinate descent 3 Newton’s method 4 Frank-Wolfe 5 Mirror-Descent 6 Conclusion Mathieu Blondel Beyond gradient descent 19 / 47 Newton’s method for root finding Given a function g, find x such that g(x) = 0 Such x is called a root of g Newton’s method in one dimension: g(xt ) xt+1 = xt − g0(xt ) Newton’s method in d dimensions: t+1 t t −1 t x = x − Jg(x ) g(x ) t d×d d d where Jg(x ) 2 R is the Jacobian matrix of g : R ! R The method may fail to converge to a root Mathieu Blondel Beyond gradient descent 20 / 47 Newton’s method for optimization If we want to minimize f , we can set g = f 0 or g = rf Newton’s method in one dimension: f 0(xt ) xt+1 = xt − f 00(xt ) Newton’s method in d dimensions: xt+1 = xt − r2f (xt )−1rf (xt ) | {z } dt In practice, once solves the linear system of equations r2f (xt )d t = rf (xt ) If f is non-convex, r2f (xt ) is indefinite (i.e., not psd) Mathieu Blondel Beyond gradient descent 21 / 47 Line search If f is a quadratic, Newton’s method converges in one iteration Otherwise, Newton’s method may not converge, even if f is convex Solution: use a step size xt+1 = xt − ηt d t Backtracking line search: decrease ηt geometrically until ηt d t satisfies some conditions Examples: Armijo rule, strong Wolfe conditions Superlinear local convergence Mathieu Blondel Beyond gradient descent 22 / 47 Trust region methods Newton’s method xt+1 = xt − r2f (xt )−1rf (xt ) | {z } dt is equivalent to solving a quadratic approximation of f around xt 1 −d t = argmin f (xt ) + rf (xt )>d + d >r2f (xt )d d 2 Trust region method: add a ball constraint t t t > 1 > 2 t t −d = argmin f (x )+rf (x ) d + d r f (x )d s.t.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    48 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us