Beyond Gradient Descent

Illustration Examples Beyond gradient descent Mathieu Blondel December 3, 2020 Outline 1 Convergence rates 2 Coordinate descent 3 Newton’s method 4 Frank-Wolfe 5 Mirror-Descent 6 Conclusion Mathieu Blondel Beyond gradient descent 1 / 47 Measuring progress How to measure the progress made by an iterative algorithm for solving an optimization problem? x? = argmin f (x) x2X Non-negative error measure ? ? Et = jjxt − x jj or Et = f (xt ) − f (x ) Progress ratio Et ρt = Et−1 An algorithm makes strict progress on iteration t if ρt 2 [0; 1). Mathieu Blondel Beyond gradient descent 2 / 47 Types of convergence rates Asymptotic convergence rate lim ρt = ρ t!1 i.e., the sequence ρ1; ρ2; ρ3;::: converges to ρ Sublinear rate: ρ = 1. The longer the algorithm runs, the slower it makes progress! (the algorithm decelerates over time) Linear rate: ρ 2 (0; 1). The algorithm eventually reaches a state of constant progress. Superlinear rate: ρ = 0. The longer the algorithm runs, the faster it makes progress! (the algorithm accelerates over time) Mathieu Blondel Beyond gradient descent 3 / 47 10 2 ) 10 5 e l a Et = 1/ t (sublinear) c s 10 8 Et = 1/t (sublinear) g o 2 l Et = 1/t (sublinear) ( 10 11 t t E E = e (linear) t r 2 o 14 t r 10 Et = e (superlinear) r E 10 17 10 20 0 20 40 60 80 100 Iteration t Mathieu Blondel Beyond gradient descent 4 / 47 1.0 Et = 1/ t (sublinear) 0.8 Et = 1/t (sublinear) 2 Et = 1/t (sublinear) 1 t E t t E 0.6 Et = e (linear) = t2 t Et = e (superlinear) o i 0.4 t a R 0.2 0.0 0 20 40 60 80 100 Iteration t Mathieu Blondel Beyond gradient descent 5 / 47 Sublinear rates Error at iteration Et , number of iterations to reach "-error T" 1 1 Et = O , T" = O b > 0 tb "1=b b = 1=2 1 1 Et = O p , T" = O t "2 b = 1 1 1 E = O , T" = O t t " ex: gradient descent for smooth but not strongly-convex functions b = 2 1 1 E = O , T" = O p t t2 " ex: Nesterov’s method for smooth but not strongly-convex functions Mathieu Blondel Beyond gradient descent 6 / 47 Linear rates The iteration number t now appears in the exponent t Et = O(ρ ) ρ 2 (0; 1) Example: −t 1 E = O e , T" = O log t " “Linear rate” is kind of a misnomer: Et is decreasing exponentially fast! On the other hand, log Et is decreasing linearly. Ex: gradient descent on smooth and strongly convex functions Mathieu Blondel Beyond gradient descent 7 / 47 Superlinear rates We can further classify the order q of convergence rates Et lim q = M t!1 (Et−1) Superlinear (q = 1, M = 0) 1=k −tk 1 E = O e , T" = O log t " Quadratic (q = 2) −2t 1 E = O e , T" = O log log t " Mathieu Blondel Beyond gradient descent 8 / 47 Outline 1 Convergence rates 2 Coordinate descent 3 Newton’s method 4 Frank-Wolfe 5 Mirror-Descent 6 Conclusion Mathieu Blondel Beyond gradient descent 9 / 47 Coordinate descent Minimize only one variable per iteration, keeping all others fixed Well-suited if the one-variable sub-problem is easy to solve Cheap iteration cost Easy to implement No need for step size tuning State-of-the-art on the lasso, SVM dual, (non-negative) matrix factorization Block coordinate descent: minimize only a block of variables Mathieu Blondel Beyond gradient descent 10 / 47 Coordinate-wise minimizer A point is called coordinate-wise minimizer of f if f is minimized along all coordinates separately f (x + δej ) ≥ f (x) 8δ 2 R; j 2 f1;:::; dg Does the coordinate-wise minimizer coincide with the global minimizer? f convex and differentiable: yes rf (x) = 0 , rj f (x) = 0 j 2 f1;:::; dg f convex but non-differentiable: not always. Coordinate descent can get stuck Pd f (x) = g(x) + j=1 hj (xj ) where g is differentiable but hj is not: yes 0 2 @f (x) , −∇j g(x) 2 @hj (xj ) j 2 f1;:::; dg Mathieu Blondel Beyond gradient descent 11 / 47 Coordinate descent On each iteration t, pick a coordinate jt 2 f1;:::; dg and minimize (approximately) this coordinate while keeping others fixed t t t t min f (x ; x ;:::; xj ;:::; x ; x ) x 1 2 t d−1 d jt Coordinate selection strategies: random, cyclic, shuffled cyclic. Coordinate descent with exact updates: requires an “oracle” to solve the sub-problem Coordinate gradient descent: only requires first-order information (and sometimes a prox operator) Mathieu Blondel Beyond gradient descent 12 / 47 Coordinate descent with exact updates Suppose f is a quadratic function. Then δ2 f (x + δe ) = f (x) + r f (x)δ + r2f (x) j j 2 jj Minimizing w.r.t. δ, we get r f (x) r f (x) ? = − j , − j δ 2 xj xj 2 rjj f (x) rjj f (x) 1 2 λ 2 Example: f (x) = 2 kAx − bk2 + 2 kxk2 > > rf (x) = A (Ax − b) + λx ) rj f (x) = A:;j (Ax − b) + λxj 2 > 2 2 r f (x) = A A + λI ) rjj f (x) = kA:;j k2 + λ Mathieu Blondel Beyond gradient descent 13 / 47 Coordinate descent with exact updates Computing rf (x) for gradient descent costs O(nd) time Let us maintain the residual vector r = Ax − b 2 Rn When xj is updated, synchronizing r takes O(n) time When r in synchronized, we can compute rj f (x) in O(n) time 2 The second derivatives rjj f (x) can be pre-computed ahead of time, since it does not depend on x Doing a pass on all d coordinates therefore takes O(nd) time, just like one iteration of gradient descent Mathieu Blondel Beyond gradient descent 14 / 47 Coordinate descent with exact updates Pd If f (x) = g(x) + j=1 hj (xj ) and g is quadratic, then δ2 f (x + δe ) = g(x) + r g(x)δ + r2g(x) + h (x + δ) j j 2 jj j j The closed form solution is ! ? rj g(x) δ = prox λ x − − x hj j 2 j r2f (x) r ( ) jj jj g x where we used the proximity operator 1 prox (u) = argmin (u − v)2 + h (v) τhj τ j v 2 If hj = j · j, then prox is the soft-thresholding operator. State-of-the-art for solving the lasso! Mathieu Blondel Beyond gradient descent 15 / 47 Coordinate gradient descent If f is not quadratic, there typically does not exist a closed form 2 If rj f (x) is Lj -Lipschitz-continuous, recall that rjj f (x) ≤ Lj 2 Key idea: replace rjj f (x) with Lj , i.e., rj f (x) − xj xj 2 rjj f (x) becomes rj f (x) xj xj − Lj Each Lj is coordinate-specific (easier to derive and tighter than a global constant L) Convergence: O(1=") under Lipschitz gradient and O(log(1=")) under strong convexity (random or cyclic selection) Mathieu Blondel Beyond gradient descent 16 / 47 Coordinate gradient descent with prox Similarly, we can replace ! rj g(x) x prox λ x − j hj j 2 r2g(x) r ( ) jj jj g x with rj g(x) x prox λ x − j hj j Lj Lj 2 where rjj g(x) ≤ Lj for all x Can be used for instance for L1-regularized logistic regression If hj (xj ) = ICj (xj ), where Cj is a convex set, then the prox becomes the projection onto Cj . Mathieu Blondel Beyond gradient descent 17 / 47 Implementation techniques Synchronize “statistics” (e.g. residuals) upon each update Column-major format: Fortran-style array or sparse CSC matrix Regularization path and warm-start λ1 > λ2 > ··· > λm Since CD converges faster with big λ, start from λ1, use solution to warm-start (initialize) λ2, etc. Active set, safe screening: use optimality conditions to safely discard coordinates that are guaranteed to be 0 Mathieu Blondel Beyond gradient descent 18 / 47 Outline 1 Convergence rates 2 Coordinate descent 3 Newton’s method 4 Frank-Wolfe 5 Mirror-Descent 6 Conclusion Mathieu Blondel Beyond gradient descent 19 / 47 Newton’s method for root finding Given a function g, find x such that g(x) = 0 Such x is called a root of g Newton’s method in one dimension: g(xt ) xt+1 = xt − g0(xt ) Newton’s method in d dimensions: t+1 t t −1 t x = x − Jg(x ) g(x ) t d×d d d where Jg(x ) 2 R is the Jacobian matrix of g : R ! R The method may fail to converge to a root Mathieu Blondel Beyond gradient descent 20 / 47 Newton’s method for optimization If we want to minimize f , we can set g = f 0 or g = rf Newton’s method in one dimension: f 0(xt ) xt+1 = xt − f 00(xt ) Newton’s method in d dimensions: xt+1 = xt − r2f (xt )−1rf (xt ) | {z } dt In practice, once solves the linear system of equations r2f (xt )d t = rf (xt ) If f is non-convex, r2f (xt ) is indefinite (i.e., not psd) Mathieu Blondel Beyond gradient descent 21 / 47 Line search If f is a quadratic, Newton’s method converges in one iteration Otherwise, Newton’s method may not converge, even if f is convex Solution: use a step size xt+1 = xt − ηt d t Backtracking line search: decrease ηt geometrically until ηt d t satisfies some conditions Examples: Armijo rule, strong Wolfe conditions Superlinear local convergence Mathieu Blondel Beyond gradient descent 22 / 47 Trust region methods Newton’s method xt+1 = xt − r2f (xt )−1rf (xt ) | {z } dt is equivalent to solving a quadratic approximation of f around xt 1 −d t = argmin f (xt ) + rf (xt )>d + d >r2f (xt )d d 2 Trust region method: add a ball constraint t t t > 1 > 2 t t −d = argmin f (x )+rf (x ) d + d r f (x )d s.t.

Load more