Illustration Examples
Beyond gradient descent
Mathieu Blondel
December 3, 2020 Outline
1 Convergence rates
3 Newton’s method
4 Frank-Wolfe
5 Mirror-Descent
6 Conclusion
Mathieu Blondel Beyond gradient descent 1 / 47 Measuring progress
How to measure the progress made by an iterative algorithm for solving an optimization problem?
x? = argmin f (x) x∈X
Non-negative error measure
? ? Et = ||xt − x || or Et = f (xt ) − f (x )
Progress ratio Et ρt = Et−1
An algorithm makes strict progress on iteration t if ρt ∈ [0, 1).
Mathieu Blondel Beyond gradient descent 2 / 47 Types of convergence rates
Asymptotic convergence rate
lim ρt = ρ t→∞
i.e., the sequence ρ1, ρ2, ρ3,... converges to ρ
Sublinear rate: ρ = 1. The longer the algorithm runs, the slower it makes progress! (the algorithm decelerates over time)
Linear rate: ρ ∈ (0, 1). The algorithm eventually reaches a state of constant progress.
Superlinear rate: ρ = 0. The longer the algorithm runs, the faster it makes progress! (the algorithm accelerates over time)
Mathieu Blondel Beyond gradient descent 3 / 47 10 2
) 10 5 e l a Et = 1/ t (sublinear) c s 10 8 Et = 1/t (sublinear) g
o 2 l Et = 1/t (sublinear) (
10 11 t t E E = e (linear)
t r 2 o 14 t r 10 Et = e (superlinear) r E
10 17
10 20 0 20 40 60 80 100 Iteration t
Mathieu Blondel Beyond gradient descent 4 / 47 1.0
Et = 1/ t (sublinear)
0.8 Et = 1/t (sublinear) 2 Et = 1/t (sublinear) 1 t E t t E 0.6 Et = e (linear)
= t2 t Et = e (superlinear)
o i 0.4 t a R
0.2
0.0 0 20 40 60 80 100 Iteration t
Mathieu Blondel Beyond gradient descent 5 / 47 Sublinear rates
Error at iteration Et , number of iterations to reach ε-error Tε 1 1 Et = O ⇔ Tε = O b > 0 tb ε1/b b = 1/2 1 1 Et = O √ ⇔ Tε = O t ε2 b = 1 1 1 E = O ⇔ Tε = O t t ε ex: gradient descent for smooth but not strongly-convex functions b = 2 1 1 E = O ⇔ Tε = O √ t t2 ε ex: Nesterov’s method for smooth but not strongly-convex functions
Mathieu Blondel Beyond gradient descent 6 / 47 Linear rates
The iteration number t now appears in the exponent
t Et = O(ρ ) ρ ∈ (0, 1)
Example: −t 1 E = O e ⇔ Tε = O log t ε
“Linear rate” is kind of a misnomer: Et is decreasing exponentially fast! On the other hand, log Et is decreasing linearly.
Ex: gradient descent on smooth and strongly convex functions
Mathieu Blondel Beyond gradient descent 7 / 47 Superlinear rates
We can further classify the order q of convergence rates
Et lim q = M t→∞ (Et−1) Superlinear (q = 1, M = 0)
1/k −tk 1 E = O e ⇔ Tε = O log t ε
Quadratic (q = 2) −2t 1 E = O e ⇔ Tε = O log log t ε
Mathieu Blondel Beyond gradient descent 8 / 47 Outline
1 Convergence rates
2 Coordinate descent
3 Newton’s method
4 Frank-Wolfe
5 Mirror-Descent
6 Conclusion
Mathieu Blondel Beyond gradient descent 9 / 47 Coordinate descent
Minimize only one variable per iteration, keeping all others fixed
Well-suited if the one-variable sub-problem is easy to solve
Cheap iteration cost
Easy to implement
No need for step size tuning
State-of-the-art on the lasso, SVM dual, (non-negative) matrix factorization
Block coordinate descent: minimize only a block of variables
Mathieu Blondel Beyond gradient descent 10 / 47 Coordinate-wise minimizer
A point is called coordinate-wise minimizer of f if f is minimized along all coordinates separately
f (x + δej ) ≥ f (x) ∀δ ∈ R, j ∈ {1,..., d} Does the coordinate-wise minimizer coincide with the global minimizer? f convex and differentiable: yes
∇f (x) = 0 ⇔ ∇j f (x) = 0 j ∈ {1,..., d} f convex but non-differentiable: not always. Coordinate descent can get stuck Pd f (x) = g(x) + j=1 hj (xj ) where g is differentiable but hj is not: yes
0 ∈ ∂f (x) ⇔ −∇j g(x) ∈ ∂hj (xj ) j ∈ {1,..., d}
Mathieu Blondel Beyond gradient descent 11 / 47 Coordinate descent
On each iteration t, pick a coordinate jt ∈ {1,..., d} and minimize (approximately) this coordinate while keeping others fixed
t t t t min f (x , x ,..., xj ,..., x , x ) x 1 2 t d−1 d jt
Coordinate selection strategies: random, cyclic, shuffled cyclic.
Coordinate descent with exact updates: requires an “oracle” to solve the sub-problem
Coordinate gradient descent: only requires first-order information (and sometimes a prox operator)
Mathieu Blondel Beyond gradient descent 12 / 47 Coordinate descent with exact updates
Suppose f is a quadratic function. Then
δ2 f (x + δe ) = f (x) + ∇ f (x)δ + ∇2f (x) j j 2 jj Minimizing w.r.t. δ, we get
∇ f (x) ∇ f (x) ? = − j ⇔ ← − j δ 2 xj xj 2 ∇jj f (x) ∇jj f (x)
1 2 λ 2 Example: f (x) = 2 kAx − bk2 + 2 kxk2
> > ∇f (x) = A (Ax − b) + λx ⇒ ∇j f (x) = A:,j (Ax − b) + λxj
2 > 2 2 ∇ f (x) = A A + λI ⇒ ∇jj f (x) = kA:,j k2 + λ
Mathieu Blondel Beyond gradient descent 13 / 47 Coordinate descent with exact updates
Computing ∇f (x) for gradient descent costs O(nd) time
Let us maintain the residual vector r = Ax − b ∈ Rn
When xj is updated, synchronizing r takes O(n) time
When r in synchronized, we can compute ∇j f (x) in O(n) time
2 The second derivatives ∇jj f (x) can be pre-computed ahead of time, since it does not depend on x
Doing a pass on all d coordinates therefore takes O(nd) time, just like one iteration of gradient descent
Mathieu Blondel Beyond gradient descent 14 / 47 Coordinate descent with exact updates
Pd If f (x) = g(x) + j=1 hj (xj ) and g is quadratic, then
δ2 f (x + δe ) = g(x) + ∇ g(x)δ + ∇2g(x) + h (x + δ) j j 2 jj j j The closed form solution is ! ? ∇j g(x) δ = prox λ x − − x hj j 2 j ∇2f (x) ∇ ( ) jj jj g x where we used the proximity operator 1 prox (u) = argmin (u − v)2 + h (v) τhj τ j v 2
If hj = | · |, then prox is the soft-thresholding operator. State-of-the-art for solving the lasso!
Mathieu Blondel Beyond gradient descent 15 / 47 Coordinate gradient descent
If f is not quadratic, there typically does not exist a closed form
2 If ∇j f (x) is Lj -Lipschitz-continuous, recall that ∇jj f (x) ≤ Lj
2 Key idea: replace ∇jj f (x) with Lj , i.e., ∇ f (x) ← − j xj xj 2 ∇jj f (x) becomes ∇j f (x) xj ← xj − Lj
Each Lj is coordinate-specific (easier to derive and tighter than a global constant L) Convergence: O(1/ε) under Lipschitz gradient and O(log(1/ε)) under strong convexity (random or cyclic selection)
Mathieu Blondel Beyond gradient descent 16 / 47 Coordinate gradient descent with prox
Similarly, we can replace ! ∇j g(x) x ← prox λ x − j hj j 2 ∇2g(x) ∇ ( ) jj jj g x
with ∇j g(x) x ← prox λ x − j hj j Lj Lj 2 where ∇jj g(x) ≤ Lj for all x
Can be used for instance for L1-regularized logistic regression
If hj (xj ) = ICj (xj ), where Cj is a convex set, then the prox becomes the projection onto Cj .
Mathieu Blondel Beyond gradient descent 17 / 47 Implementation techniques
Synchronize “statistics” (e.g. residuals) upon each update
Column-major format: Fortran-style array or sparse CSC matrix
Regularization path and warm-start
λ1 > λ2 > ··· > λm
Since CD converges faster with big λ, start from λ1, use solution to warm-start (initialize) λ2, etc.
Active set, safe screening: use optimality conditions to safely discard coordinates that are guaranteed to be 0
Mathieu Blondel Beyond gradient descent 18 / 47 Outline
1 Convergence rates
2 Coordinate descent
3 Newton’s method
4 Frank-Wolfe
5 Mirror-Descent
6 Conclusion
Mathieu Blondel Beyond gradient descent 19 / 47 Newton’s method for root finding
Given a function g, find x such that g(x) = 0
Such x is called a root of g
Newton’s method in one dimension: g(xt ) xt+1 = xt − g0(xt )
Newton’s method in d dimensions:
t+1 t t −1 t x = x − Jg(x ) g(x )
t d×d d d where Jg(x ) ∈ R is the Jacobian matrix of g : R → R The method may fail to converge to a root
Mathieu Blondel Beyond gradient descent 20 / 47 Newton’s method for optimization
If we want to minimize f , we can set g = f 0 or g = ∇f
Newton’s method in one dimension: f 0(xt ) xt+1 = xt − f 00(xt )
Newton’s method in d dimensions:
xt+1 = xt − ∇2f (xt )−1∇f (xt ) | {z } dt In practice, once solves the linear system of equations
∇2f (xt )d t = ∇f (xt )
If f is non-convex, ∇2f (xt ) is indefinite (i.e., not psd)
Mathieu Blondel Beyond gradient descent 21 / 47 Line search
If f is a quadratic, Newton’s method converges in one iteration
Otherwise, Newton’s method may not converge, even if f is convex
Solution: use a step size
xt+1 = xt − ηt d t
Backtracking line search: decrease ηt geometrically until ηt d t satisfies some conditions
Examples: Armijo rule, strong Wolfe conditions
Superlinear local convergence
Mathieu Blondel Beyond gradient descent 22 / 47 Trust region methods
Newton’s method
xt+1 = xt − ∇2f (xt )−1∇f (xt ) | {z } dt is equivalent to solving a quadratic approximation of f around xt 1 −d t = argmin f (xt ) + ∇f (xt )>d + d >∇2f (xt )d d 2 Trust region method: add a ball constraint
t t t > 1 > 2 t t −d = argmin f (x )+∇f (x ) d + d ∇ f (x )d s.t. kdk2 ≤ δ d 2 If xt − d t increases f , reject the solution and decrease δt Similar convergence guarantees as line search methods
Mathieu Blondel Beyond gradient descent 23 / 47 Hessian-free method
If d (number of dimensions) is large, computing the Hessian matrix is expensive
The conjugate gradient (CG) method can be used to solve Ax = b
It only requires to know how to multiply with A, not A directly
Since A = ∇2f (xt ), we need to multiply with the Hessian
This can be done in a number of ways: manual derivation, finite difference, autodiff (cf. autodiff lecture)
The resulting algorithm (with line search) is called Newton-CG
Mathieu Blondel Beyond gradient descent 24 / 47 Sub-sampled Hessians
In machine learning, f is often an average (finite expectation)
n 1 X f (x) = f (x) n i i=1
On iteration t we can sub-sample a set St ⊆ {1,..., n} to compute unbiased estimates of the gradient and Hessian
1 X ∇f (x) ≈ ∇f (xt ) |St | i i∈St 1 X ∇2f (x) ≈ ∇2f (xt ) |St | i i∈St
Can be combined with Hessian-free method
Mathieu Blondel Beyond gradient descent 25 / 47 Quasi-Newton methods
BFGS: replaces
xt+1 = xt − ∇2f (xt )−1∇f (xt )
with xt+1 = xt − Ht ∇f (xt ) where Ht+1 ≈ ∇f (xt+1)−1 is built incrementally from Ht , d t = xt+1 − xt and v t = ∇f (xt+1) − ∇f (xt ) using the so-called secant equation
L-BFGS: keep a history of only m pairs (d t , v t ) and compute Ht ∇f (xt ) on the fly without materializing Ht in memory
Local superlinear convergence rate
One of the go-to algorithms in machine learning!
Mathieu Blondel Beyond gradient descent 26 / 47 Outline
1 Convergence rates
2 Coordinate descent
3 Newton’s method
4 Frank-Wolfe
5 Mirror-Descent
6 Conclusion
Mathieu Blondel Beyond gradient descent 27 / 47 Projected gradient descent
Consider the constrained optimization problem
min f (x) x∈C
where f is L-smooth convex and C is a closed convex set
Projected gradient descent
1 xt+1 = P xt − ∇f (xt ) C L
where 2 PC(x) = argmin kx − yk2 y∈C is the projection of x onto C
Mathieu Blondel Beyond gradient descent 28 / 47 Projected gradient descent
Recall that gradient descent’s update can be seen as solving a (crude) local quadratic approximation of f around xt
t+1 t 1 t t t t L t t x = x − ∇f (x ) = argmin f (x )+∇f (x )>(x−x )+ (x − x )>I(x − x ) L d 2 x R | {z } ∈ x x t 2 || − ||2 Similarly
t+1 t 1 t t t > t L t 2 x = PC(x − ∇f (x )) = PC argmin f (x ) + ∇f (x ) (x − x ) + ||x − x ||2 L d 2 x∈R t t > t L t 2 = argmin f (x ) + ∇f (x ) (x − x ) + ||x − x ||2 x∈C 2
Mathieu Blondel Beyond gradient descent 29 / 47 Frank-Wolfe
A method for constrained optimization
Based on a linear approximation instead of a quadratic one
Projection free: a linear minimization oracle (LMO) is needed instead
LMOs are typically cheaper to compute than projections
Also known as conditional gradient method (not to be confused with conjugate gradient method)
Mathieu Blondel Beyond gradient descent 30 / 47 Frank-Wolfe
Initialize x0 ∈ C
For t ∈ {0, 1, 2,... }
s = argmin f (xt ) + ∇f (xt )>(s − xt ) = argmin ∇f (xt )>s s∈C s∈C 2 xt+1 = (1 − γt )xt + γt s γt = 2 + t
> argmins∈C g s is called linear minimization oracle (LMO) C needs to be compact (closed and bounded), otherwise the LMO problem is not feasible (solution goes to infinity)
How to compute the LMO?
Mathieu Blondel Beyond gradient descent 31 / 47 Convex hulls
Probability simplex
m m m X 4 = {p ∈ R : pi = 1, pi ≥ 0 i ∈ {1,..., m}} i=1
v is a convex combination of {v1,..., vm} if
m X m v = pi vi for some p ∈ 4 i=1 The convex hull of S is the set of all convex combinations of S
( m ) X m conv(S) = pi vi : m ∈ N; p ∈ 4 ; v1,..., vm ∈ S i=1
Mathieu Blondel Beyond gradient descent 32 / 47 Convex polytopes
A convex polytope is the convex hull of its vertices V = {v1,..., vm}
C = conv(V )
Example 1: Probability simplex
m 4 = conv({e1,..., em})
Example 2: L1-ball
m ♦ = {x : kxk1 ≤ 1} = conv({±e1,..., ±em})
Example 3: L∞-ball
m m = {x : kxk∞ ≤ 1} = conv({−1, 1} )
Mathieu Blondel Beyond gradient descent 33 / 47 Linear minimization oracles
If C = conv(V ) where V = {v1,..., vm} then argmin g>s ⊆ V s∈C Example 1: Probability simplex
> ei ∈ argmin g s i ∈ argmin gj s∈4m j
Example 2: L1-ball
> sign(−gi )ei ∈ argmin g s i ∈ argmax |gj | s∈♦m j
Example 3: L∞-ball sign(−g) ∈ argmin g>s s∈m
Mathieu Blondel Beyond gradient descent 34 / 47 Example: sparse regression
Consider the objective
1 2 > min f (w) = kXw − yk2 ∇f (w) = X (Xw − y) kwk1≤τ 2 Initialize w 0 = 0 For t ∈ {0, 1, 2,... } g = ∇f (w t )
i ∈ argmax |gj | j
s = τ · sign(gi )ei 2 w t+1 = (1 − γt )w t + γt s γt = 2 + t Pick a coordinate greedily and update it! Sparse solution: w t contains at most t non-zero elements
Mathieu Blondel Beyond gradient descent 35 / 47 Norm constraints
Consider the case of norm constraints C = {x ∈ Rd : kxk ≤ τ} Note that s ∈ argmin g>x = τ · argmax −g>x kxk≤τ kxk≤1 Recall the definition of dual norm
> kyk∗ = max x y kxk≤1 Thus, up to the factor τ, s is the argument achieving the maximum in the dual norm
We saw (last lecture) that this coincides with the subdifferential
Therefore, s ∈ τ · ∂k − gk∗
Example 3 (last slide): sign(−g) ∈ ∂k − gk1
Mathieu Blondel Beyond gradient descent 36 / 47 Frank-Wolfe variants
Vanilla FW has a slow sublinear rate 1 f (xt ) − f (x?) ≤ O t
Variants of FW that enjoy a linear rate of convergence exist under strong convexity assumptions on f
f (xt ) − f (x?) ≤ O ρt ρ ∈ (0, 1)
Full-corrective FW
Away-steps FW
Pairwise FW
Mathieu Blondel Beyond gradient descent 37 / 47 Outline
1 Convergence rates
2 Coordinate descent
3 Newton’s method
4 Frank-Wolfe
5 Mirror-Descent
6 Conclusion
Mathieu Blondel Beyond gradient descent 38 / 47 Bregman divergences
The Bregman divergence generated by ϕ between u and v is
Dϕ(u, v) = ϕ(u) − ϕ(v) − h∇ϕ(v), u − vi It is the difference between ϕ(u) and its linearization around v.
ϕ(u) Dϕ(u, v) ϕ(v) + ϕ(v), u v h∇ − i
v u
1 2 > Examples: ϕ(x) = 2 kxk2 (squared Euclidean), ϕ(x) = x log(x) (Kullback-Leibler) Mathieu Blondel Beyond gradient descent 39 / 47 Bregman projections
Euclidean projection
2 PC(x) = argmin ky − xk2 y∈C
Bregman projection onto C ⊆ dom(ϕ)
ϕ PC (x) = argmin Dϕ(y, x) y∈C
Recovers Euclidean projections as a special case
Example: ϕ(x) = x> log(x)
ϕ d d PC (x) = argmin KL(y, x) x ∈ R+, C ⊆ R+ y∈C
Mathieu Blondel Beyond gradient descent 40 / 47 Mirror descent
Projected gradient descent 1 t+1 = ( t − ηt ∇ ( t )) = ( t ) + ∇ ( t )>( − t ) + || − t ||2 x PC x f x PC argmin f x f x x x t x x 2 d 2η x∈R
1 = ( t ) + ∇ ( t )>( − t ) + || − t ||2 argmin f x f x x x t x x 2 x∈C 2η Mirror descent 1 t+1 = ϕ ( t ) + ∇ ( t )>( − t ) + ( , t ) x PC argmin f x f x x x t Dϕ x x d η x∈R 1 = ( t ) + ∇ ( t )>( − t ) + ( , t ) argmin f x f x x x t Dϕ x x x∈C η ϕ t t t 6= PC (x − η ∇f (x )) (in general)
L Convergence rate is O √f ,∗ , where k∇f (x)k ≤ L for all x when t ∗ f ,∗ 1 using ηt = O √ Lf ,∗ t
Mathieu Blondel Beyond gradient descent 41 / 47 Example: optimization over the simplex
If ϕ(x) = x> log(x) and C = 4d , then we have a closed form
xt exp(−η ∇f (xt )) xt+1 = t Pd t t j=1 xj exp(−ηt ∇j f (x ))
(the operations in the numerator are element-wise)
Often called exponentiated gradient descent or entropic descent
The KL case, for which k · k = k · k1 and k · k∗ = k · k∞, enjoys better convergence rate than the Euclidean case on the simplex √ Indeed, using kgk∞ ≤ kgk2 ≤ dkgk∞, we get 1 L √ ≤ f ,∞ ≤ 1 d Lf ,2
Mathieu Blondel Beyond gradient descent 42 / 47 Alternative view
On iteration t
xˆt = ∇ϕ(xt ) map primal point to dual yˆt+1 = xˆt − ηt ∇f (xt ) take gradient step in the dual y t+1 = ∇ϕ∗(yˆt+1) map new dual point back to primal t+1 ϕ t+1 x = PC (y ) project onto feasible set
∇ϕ and ∇ϕ∗ are called mirror maps
Under technical assumptions on ϕ called “Legendre-type” (ϕ strictly convex, ∇ϕ = ∞ on the boundary of dom(ϕ)), we have
∇ϕ∗ = (∇ϕ)−1
Mathieu Blondel Beyond gradient descent 43 / 47 Outline
1 Convergence rates
2 Coordinate descent
3 Newton’s method
4 Frank-Wolfe
5 Mirror-Descent
6 Conclusion
Mathieu Blondel Beyond gradient descent 44 / 47 Summary
Convergence rates: important to familarize yourself with their classification
Coordinate descent: ideal when regularizers or constraints are decomposable
Newton’s method: the Newton-CG algorithm (Hessian-free) is popularly used
Frank-Wolfe: projection-free constrained optimization
Mirror descent: generalization of projected gradient descent to non-Euclidean geometries
Mathieu Blondel Beyond gradient descent 45 / 47 Lab work
Recall that the dual of multiclass SVMs consists in maximizing
n X 1 D(β) = − [Ω(β ) − Ω(y )] − kX >(Y − β)k2 s.t. β ∈ 4k i i 2λ 2 i i=1
× × where X ∈ Rn d (features), Y ∈ Rn k (one-hot labels), Ω(βi ) = −hβi , 1 − yi i, λ > 0 (regularization parameter)
? 1 > ? d×k The primal-dual link is W = λ X (Y − β ) ∈ R × The gradient ∇D(β) ∈ Rn k has rows as follows:
> 1 > k ∇i D(β) = −∇Ω(βi ) + x ( X (Y − β)) ∈ R i λ | {z } W
where ∇Ω(βi ) = yi − 1
Mathieu Blondel Beyond gradient descent 46 / 47 Lab work
We want to minimize f (β) = −D(β), where ∇f (β) = −∇D(β) Implement Frank-Wolfe for this problem. 0 k 0 Initialize βi ∈ 4 , e.g., βi = (1/k,..., 1/k), for i ∈ {1,..., n} For t ∈ {0, 1, 2,... } t n k G = ∇f (β ) ∈ R ×
si = argmin gi>si i ∈ {1,..., n} s k i ∈4 2 βt+1 = (1 − γt )βt + γt S γt = 2 + t Implement mirror descent for this problem using the KL geometry. t t β exp(−ηt ∇ f (β )) βt+1 = i i i ∈ {1,..., n} i Pk t t j=1 βi,j exp(−ηt ∇i,j f (β )) √ using ηt = η/ t for some η ∈ (0, 1]
Mathieu Blondel Beyond gradient descent 47 / 47