Illustration Examples

Beyond descent

Mathieu Blondel

December 3, 2020 Outline

1 Convergence rates

2

3 Newton’s method

4 Frank-Wolfe

5 Mirror-Descent

6 Conclusion

Mathieu Blondel Beyond 1 / 47 Measuring progress

How to measure the progress made by an iterative algorithm for solving an optimization problem?

x? = argmin f (x) x∈X

Non-negative error measure

? ? Et = ||xt − x || or Et = f (xt ) − f (x )

Progress ratio Et ρt = Et−1

An algorithm makes strict progress on iteration t if ρt ∈ [0, 1).

Mathieu Blondel Beyond gradient descent 2 / 47 Types of convergence rates

Asymptotic convergence rate

lim ρt = ρ t→∞

i.e., the sequence ρ1, ρ2, ρ3,... converges to ρ

Sublinear rate: ρ = 1. The longer the algorithm runs, the slower it makes progress! (the algorithm decelerates over time)

Linear rate: ρ ∈ (0, 1). The algorithm eventually reaches a state of constant progress.

Superlinear rate: ρ = 0. The longer the algorithm runs, the faster it makes progress! (the algorithm accelerates over time)

Mathieu Blondel Beyond gradient descent 3 / 47 10 2

) 10 5 e l a Et = 1/ t (sublinear) c s 10 8 Et = 1/t (sublinear) g

o 2 l Et = 1/t (sublinear) (

10 11 t t E E = e (linear)

t r 2 o 14 t r 10 Et = e (superlinear) r E

10 17

10 20 0 20 40 60 80 100 Iteration t

Mathieu Blondel Beyond gradient descent 4 / 47 1.0

Et = 1/ t (sublinear)

0.8 Et = 1/t (sublinear) 2 Et = 1/t (sublinear) 1 t E t t E 0.6 Et = e (linear)

= t2 t Et = e (superlinear)

o i 0.4 t a R

0.2

0.0 0 20 40 60 80 100 Iteration t

Mathieu Blondel Beyond gradient descent 5 / 47 Sublinear rates

Error at iteration Et , number of iterations to reach ε-error Tε  1   1  Et = O ⇔ Tε = O b > 0 tb ε1/b b = 1/2  1   1  Et = O √ ⇔ Tε = O t ε2 b = 1 1 1 E = O ⇔ Tε = O t t ε ex: gradient descent for smooth but not strongly-convex functions b = 2  1   1  E = O ⇔ Tε = O √ t t2 ε ex: Nesterov’s method for smooth but not strongly-convex functions

Mathieu Blondel Beyond gradient descent 6 / 47 Linear rates

The iteration number t now appears in the exponent

t Et = O(ρ ) ρ ∈ (0, 1)

Example:   −t  1 E = O e ⇔ Tε = O log t ε

“Linear rate” is kind of a misnomer: Et is decreasing exponentially fast! On the other hand, log Et is decreasing linearly.

Ex: gradient descent on smooth and strongly convex functions

Mathieu Blondel Beyond gradient descent 7 / 47 Superlinear rates

We can further classify the order q of convergence rates

Et lim q = M t→∞ (Et−1) Superlinear (q = 1, M = 0)

 1/k  −tk  1 E = O e ⇔ Tε = O log t ε

Quadratic (q = 2)    −2t  1 E = O e ⇔ Tε = O log log t ε

Mathieu Blondel Beyond gradient descent 8 / 47 Outline

1 Convergence rates

2 Coordinate descent

3 Newton’s method

4 Frank-Wolfe

5 Mirror-Descent

6 Conclusion

Mathieu Blondel Beyond gradient descent 9 / 47 Coordinate descent

Minimize only one variable per iteration, keeping all others fixed

Well-suited if the one-variable sub-problem is easy to solve

Cheap iteration cost

Easy to implement

No need for step size tuning

State-of-the-art on the lasso, SVM dual, (non-negative) matrix factorization

Block coordinate descent: minimize only a block of variables

Mathieu Blondel Beyond gradient descent 10 / 47 Coordinate-wise minimizer

A point is called coordinate-wise minimizer of f if f is minimized along all coordinates separately

f (x + δej ) ≥ f (x) ∀δ ∈ R, j ∈ {1,..., d} Does the coordinate-wise minimizer coincide with the global minimizer? f convex and differentiable: yes

∇f (x) = 0 ⇔ ∇j f (x) = 0 j ∈ {1,..., d} f convex but non-differentiable: not always. Coordinate descent can get stuck Pd f (x) = g(x) + j=1 hj (xj ) where g is differentiable but hj is not: yes

0 ∈ ∂f (x) ⇔ −∇j g(x) ∈ ∂hj (xj ) j ∈ {1,..., d}

Mathieu Blondel Beyond gradient descent 11 / 47 Coordinate descent

On each iteration t, pick a coordinate jt ∈ {1,..., d} and minimize (approximately) this coordinate while keeping others fixed

t t t t min f (x , x ,..., xj ,..., x , x ) x 1 2 t d−1 d jt

Coordinate selection strategies: random, cyclic, shuffled cyclic.

Coordinate descent with exact updates: requires an “oracle” to solve the sub-problem

Coordinate gradient descent: only requires first-order information (and sometimes a prox operator)

Mathieu Blondel Beyond gradient descent 12 / 47 Coordinate descent with exact updates

Suppose f is a quadratic function. Then

δ2 f (x + δe ) = f (x) + ∇ f (x)δ + ∇2f (x) j j 2 jj Minimizing w.r.t. δ, we get

∇ f (x) ∇ f (x) ? = − j ⇔ ← − j δ 2 xj xj 2 ∇jj f (x) ∇jj f (x)

1 2 λ 2 Example: f (x) = 2 kAx − bk2 + 2 kxk2

> > ∇f (x) = A (Ax − b) + λx ⇒ ∇j f (x) = A:,j (Ax − b) + λxj

2 > 2 2 ∇ f (x) = A A + λI ⇒ ∇jj f (x) = kA:,j k2 + λ

Mathieu Blondel Beyond gradient descent 13 / 47 Coordinate descent with exact updates

Computing ∇f (x) for gradient descent costs O(nd) time

Let us maintain the residual vector r = Ax − b ∈ Rn

When xj is updated, synchronizing r takes O(n) time

When r in synchronized, we can compute ∇j f (x) in O(n) time

2 The second derivatives ∇jj f (x) can be pre-computed ahead of time, since it does not depend on x

Doing a pass on all d coordinates therefore takes O(nd) time, just like one iteration of gradient descent

Mathieu Blondel Beyond gradient descent 14 / 47 Coordinate descent with exact updates

Pd If f (x) = g(x) + j=1 hj (xj ) and g is quadratic, then

δ2 f (x + δe ) = g(x) + ∇ g(x)δ + ∇2g(x) + h (x + δ) j j 2 jj j j The closed form solution is ! ? ∇j g(x) δ = prox λ x − − x hj j 2 j ∇2f (x) ∇ ( ) jj jj g x where we used the proximity operator 1 prox (u) = argmin (u − v)2 + h (v) τhj τ j v 2

If hj = | · |, then prox is the soft-thresholding operator. State-of-the-art for solving the lasso!

Mathieu Blondel Beyond gradient descent 15 / 47 Coordinate gradient descent

If f is not quadratic, there typically does not exist a closed form

2 If ∇j f (x) is Lj -Lipschitz-continuous, recall that ∇jj f (x) ≤ Lj

2 Key idea: replace ∇jj f (x) with Lj , i.e., ∇ f (x) ← − j xj xj 2 ∇jj f (x) becomes ∇j f (x) xj ← xj − Lj

Each Lj is coordinate-specific (easier to derive and tighter than a global constant L) Convergence: O(1/ε) under Lipschitz gradient and O(log(1/ε)) under strong convexity (random or cyclic selection)

Mathieu Blondel Beyond gradient descent 16 / 47 Coordinate gradient descent with prox

Similarly, we can replace ! ∇j g(x) x ← prox λ x − j hj j 2 ∇2g(x) ∇ ( ) jj jj g x

with   ∇j g(x) x ← prox λ x − j hj j Lj Lj 2 where ∇jj g(x) ≤ Lj for all x

Can be used for instance for L1-regularized logistic regression

If hj (xj ) = ICj (xj ), where Cj is a convex set, then the prox becomes the projection onto Cj .

Mathieu Blondel Beyond gradient descent 17 / 47 Implementation techniques

Synchronize “statistics” (e.g. residuals) upon each update

Column-major format: Fortran-style array or sparse CSC matrix

Regularization path and warm-start

λ1 > λ2 > ··· > λm

Since CD converges faster with big λ, start from λ1, use solution to warm-start (initialize) λ2, etc.

Active set, safe screening: use optimality conditions to safely discard coordinates that are guaranteed to be 0

Mathieu Blondel Beyond gradient descent 18 / 47 Outline

1 Convergence rates

2 Coordinate descent

3 Newton’s method

4 Frank-Wolfe

5 Mirror-Descent

6 Conclusion

Mathieu Blondel Beyond gradient descent 19 / 47 Newton’s method for root finding

Given a function g, find x such that g(x) = 0

Such x is called a root of g

Newton’s method in one dimension: g(xt ) xt+1 = xt − g0(xt )

Newton’s method in d dimensions:

t+1 t t −1 t x = x − Jg(x ) g(x )

t d×d d d where Jg(x ) ∈ R is the Jacobian matrix of g : R → R The method may fail to converge to a root

Mathieu Blondel Beyond gradient descent 20 / 47 Newton’s method for optimization

If we want to minimize f , we can set g = f 0 or g = ∇f

Newton’s method in one dimension: f 0(xt ) xt+1 = xt − f 00(xt )

Newton’s method in d dimensions:

xt+1 = xt − ∇2f (xt )−1∇f (xt ) | {z } dt In practice, once solves the linear system of equations

∇2f (xt )d t = ∇f (xt )

If f is non-convex, ∇2f (xt ) is indefinite (i.e., not psd)

Mathieu Blondel Beyond gradient descent 21 / 47

If f is a quadratic, Newton’s method converges in one iteration

Otherwise, Newton’s method may not converge, even if f is convex

Solution: use a step size

xt+1 = xt − ηt d t

Backtracking line search: decrease ηt geometrically until ηt d t satisfies some conditions

Examples: Armijo rule, strong Wolfe conditions

Superlinear local convergence

Mathieu Blondel Beyond gradient descent 22 / 47 methods

Newton’s method

xt+1 = xt − ∇2f (xt )−1∇f (xt ) | {z } dt is equivalent to solving a quadratic approximation of f around xt 1 −d t = argmin f (xt ) + ∇f (xt )>d + d >∇2f (xt )d d 2 Trust region method: add a ball constraint

t t t > 1 > 2 t t −d = argmin f (x )+∇f (x ) d + d ∇ f (x )d s.t. kdk2 ≤ δ d 2 If xt − d t increases f , reject the solution and decrease δt Similar convergence guarantees as line search methods

Mathieu Blondel Beyond gradient descent 23 / 47 Hessian-free method

If d (number of dimensions) is large, computing the is expensive

The conjugate gradient (CG) method can be used to solve Ax = b

It only requires to know how to multiply with A, not A directly

Since A = ∇2f (xt ), we need to multiply with the Hessian

This can be done in a number of ways: manual derivation, finite difference, autodiff (cf. autodiff lecture)

The resulting algorithm (with line search) is called Newton-CG

Mathieu Blondel Beyond gradient descent 24 / 47 Sub-sampled Hessians

In machine learning, f is often an average (finite expectation)

n 1 X f (x) = f (x) n i i=1

On iteration t we can sub-sample a set St ⊆ {1,..., n} to compute unbiased estimates of the gradient and Hessian

1 X ∇f (x) ≈ ∇f (xt ) |St | i i∈St 1 X ∇2f (x) ≈ ∇2f (xt ) |St | i i∈St

Can be combined with Hessian-free method

Mathieu Blondel Beyond gradient descent 25 / 47 Quasi-Newton methods

BFGS: replaces

xt+1 = xt − ∇2f (xt )−1∇f (xt )

with xt+1 = xt − Ht ∇f (xt ) where Ht+1 ≈ ∇f (xt+1)−1 is built incrementally from Ht , d t = xt+1 − xt and v t = ∇f (xt+1) − ∇f (xt ) using the so-called secant equation

L-BFGS: keep a history of only m pairs (d t , v t ) and compute Ht ∇f (xt ) on the fly without materializing Ht in memory

Local superlinear convergence rate

One of the go-to algorithms in machine learning!

Mathieu Blondel Beyond gradient descent 26 / 47 Outline

1 Convergence rates

2 Coordinate descent

3 Newton’s method

4 Frank-Wolfe

5 Mirror-Descent

6 Conclusion

Mathieu Blondel Beyond gradient descent 27 / 47 Projected gradient descent

Consider the constrained optimization problem

min f (x) x∈C

where f is L-smooth convex and C is a closed convex set

Projected gradient descent

 1  xt+1 = P xt − ∇f (xt ) C L

where 2 PC(x) = argmin kx − yk2 y∈C is the projection of x onto C

Mathieu Blondel Beyond gradient descent 28 / 47 Projected gradient descent

Recall that gradient descent’s update can be seen as solving a (crude) local quadratic approximation of f around xt

t+1 t 1 t t t t L t t x = x − ∇f (x ) = argmin f (x )+∇f (x )>(x−x )+ (x − x )>I(x − x ) L d 2 x R | {z } ∈ x x t 2 || − ||2 Similarly

t+1 t 1 t t t > t L t 2 x = PC(x − ∇f (x )) = PC argmin f (x ) + ∇f (x ) (x − x ) + ||x − x ||2 L d 2 x∈R t t > t L t 2 = argmin f (x ) + ∇f (x ) (x − x ) + ||x − x ||2 x∈C 2

Mathieu Blondel Beyond gradient descent 29 / 47 Frank-Wolfe

A method for constrained optimization

Based on a linear approximation instead of a quadratic one

Projection free: a linear minimization oracle (LMO) is needed instead

LMOs are typically cheaper to compute than projections

Also known as conditional (not to be confused with conjugate gradient method)

Mathieu Blondel Beyond gradient descent 30 / 47 Frank-Wolfe

Initialize x0 ∈ C

For t ∈ {0, 1, 2,... }

s = argmin f (xt ) + ∇f (xt )>(s − xt ) = argmin ∇f (xt )>s s∈C s∈C 2 xt+1 = (1 − γt )xt + γt s γt = 2 + t

> argmins∈C g s is called linear minimization oracle (LMO) C needs to be compact (closed and bounded), otherwise the LMO problem is not feasible (solution goes to infinity)

How to compute the LMO?

Mathieu Blondel Beyond gradient descent 31 / 47 Convex hulls

Probability simplex

m m m X 4 = {p ∈ R : pi = 1, pi ≥ 0 i ∈ {1,..., m}} i=1

v is a convex combination of {v1,..., vm} if

m X m v = pi vi for some p ∈ 4 i=1 The convex hull of S is the set of all convex combinations of S

( m ) X m conv(S) = pi vi : m ∈ N; p ∈ 4 ; v1,..., vm ∈ S i=1

Mathieu Blondel Beyond gradient descent 32 / 47 Convex polytopes

A convex polytope is the convex hull of its vertices V = {v1,..., vm}

C = conv(V )

Example 1: Probability simplex

m 4 = conv({e1,..., em})

Example 2: L1-ball

m ♦ = {x : kxk1 ≤ 1} = conv({±e1,..., ±em})

Example 3: L∞-ball

m m  = {x : kxk∞ ≤ 1} = conv({−1, 1} )

Mathieu Blondel Beyond gradient descent 33 / 47 Linear minimization oracles

If C = conv(V ) where V = {v1,..., vm} then argmin g>s ⊆ V s∈C Example 1: Probability simplex

> ei ∈ argmin g s i ∈ argmin gj s∈4m j

Example 2: L1-ball

> sign(−gi )ei ∈ argmin g s i ∈ argmax |gj | s∈♦m j

Example 3: L∞-ball sign(−g) ∈ argmin g>s s∈m

Mathieu Blondel Beyond gradient descent 34 / 47 Example: sparse regression

Consider the objective

1 2 > min f (w) = kXw − yk2 ∇f (w) = X (Xw − y) kwk1≤τ 2 Initialize w 0 = 0 For t ∈ {0, 1, 2,... } g = ∇f (w t )

i ∈ argmax |gj | j

s = τ · sign(gi )ei 2 w t+1 = (1 − γt )w t + γt s γt = 2 + t Pick a coordinate greedily and update it! Sparse solution: w t contains at most t non-zero elements

Mathieu Blondel Beyond gradient descent 35 / 47 Norm constraints

Consider the case of norm constraints C = {x ∈ Rd : kxk ≤ τ} Note that s ∈ argmin g>x = τ · argmax −g>x kxk≤τ kxk≤1 Recall the definition of dual norm

> kyk∗ = max x y kxk≤1 Thus, up to the factor τ, s is the argument achieving the maximum in the dual norm

We saw (last lecture) that this coincides with the subdifferential

Therefore, s ∈ τ · ∂k − gk∗

Example 3 (last slide): sign(−g) ∈ ∂k − gk1

Mathieu Blondel Beyond gradient descent 36 / 47 Frank-Wolfe variants

Vanilla FW has a slow sublinear rate 1 f (xt ) − f (x?) ≤ O t

Variants of FW that enjoy a linear rate of convergence exist under strong convexity assumptions on f

f (xt ) − f (x?) ≤ O ρt  ρ ∈ (0, 1)

Full-corrective FW

Away-steps FW

Pairwise FW

Mathieu Blondel Beyond gradient descent 37 / 47 Outline

1 Convergence rates

2 Coordinate descent

3 Newton’s method

4 Frank-Wolfe

5 Mirror-Descent

6 Conclusion

Mathieu Blondel Beyond gradient descent 38 / 47 Bregman divergences

The Bregman divergence generated by ϕ between u and v is

Dϕ(u, v) = ϕ(u) − ϕ(v) − h∇ϕ(v), u − vi It is the difference between ϕ(u) and its linearization around v.

ϕ(u) Dϕ(u, v) ϕ(v) + ϕ(v), u v h∇ − i

v u

1 2 > Examples: ϕ(x) = 2 kxk2 (squared Euclidean), ϕ(x) = x log(x) (Kullback-Leibler) Mathieu Blondel Beyond gradient descent 39 / 47 Bregman projections

Euclidean projection

2 PC(x) = argmin ky − xk2 y∈C

Bregman projection onto C ⊆ dom(ϕ)

ϕ PC (x) = argmin Dϕ(y, x) y∈C

Recovers Euclidean projections as a special case

Example: ϕ(x) = x> log(x)

ϕ d d PC (x) = argmin KL(y, x) x ∈ R+, C ⊆ R+ y∈C

Mathieu Blondel Beyond gradient descent 40 / 47 Mirror descent

Projected gradient descent 1 t+1 = ( t − ηt ∇ ( t )) = ( t ) + ∇ ( t )>( − t ) + || − t ||2 x PC x f x PC argmin f x f x x x t x x 2 d 2η x∈R

1 = ( t ) + ∇ ( t )>( − t ) + || − t ||2 argmin f x f x x x t x x 2 x∈C 2η Mirror descent 1 t+1 = ϕ ( t ) + ∇ ( t )>( − t ) + ( , t ) x PC argmin f x f x x x t Dϕ x x d η x∈R 1 = ( t ) + ∇ ( t )>( − t ) + ( , t ) argmin f x f x x x t Dϕ x x x∈C η ϕ t t t 6= PC (x − η ∇f (x )) (in general)

 L  Convergence rate is O √f ,∗ , where k∇f (x)k ≤ L for all x when t ∗ f ,∗  1  using ηt = O √ Lf ,∗ t

Mathieu Blondel Beyond gradient descent 41 / 47 Example: optimization over the simplex

If ϕ(x) = x> log(x) and C = 4d , then we have a closed form

xt exp(−η ∇f (xt )) xt+1 = t Pd t t j=1 xj exp(−ηt ∇j f (x ))

(the operations in the numerator are element-wise)

Often called exponentiated gradient descent or entropic descent

The KL case, for which k · k = k · k1 and k · k∗ = k · k∞, enjoys better convergence rate than the Euclidean case on the simplex √ Indeed, using kgk∞ ≤ kgk2 ≤ dkgk∞, we get 1 L √ ≤ f ,∞ ≤ 1 d Lf ,2

Mathieu Blondel Beyond gradient descent 42 / 47 Alternative view

On iteration t

xˆt = ∇ϕ(xt ) map primal point to dual yˆt+1 = xˆt − ηt ∇f (xt ) take gradient step in the dual y t+1 = ∇ϕ∗(yˆt+1) map new dual point back to primal t+1 ϕ t+1 x = PC (y ) project onto feasible set

∇ϕ and ∇ϕ∗ are called mirror maps

Under technical assumptions on ϕ called “Legendre-type” (ϕ strictly convex, ∇ϕ = ∞ on the boundary of dom(ϕ)), we have

∇ϕ∗ = (∇ϕ)−1

Mathieu Blondel Beyond gradient descent 43 / 47 Outline

1 Convergence rates

2 Coordinate descent

3 Newton’s method

4 Frank-Wolfe

5 Mirror-Descent

6 Conclusion

Mathieu Blondel Beyond gradient descent 44 / 47 Summary

Convergence rates: important to familarize yourself with their classification

Coordinate descent: ideal when regularizers or constraints are decomposable

Newton’s method: the Newton-CG algorithm (Hessian-free) is popularly used

Frank-Wolfe: projection-free constrained optimization

Mirror descent: generalization of projected gradient descent to non-Euclidean geometries

Mathieu Blondel Beyond gradient descent 45 / 47 Lab work

Recall that the dual of multiclass SVMs consists in maximizing

n X 1 D(β) = − [Ω(β ) − Ω(y )] − kX >(Y − β)k2 s.t. β ∈ 4k i i 2λ 2 i i=1

× × where X ∈ Rn d (features), Y ∈ Rn k (one-hot labels), Ω(βi ) = −hβi , 1 − yi i, λ > 0 (regularization parameter)

? 1 > ? d×k The primal-dual link is W = λ X (Y − β ) ∈ R × The gradient ∇D(β) ∈ Rn k has rows as follows:

> 1 > k ∇i D(β) = −∇Ω(βi ) + x ( X (Y − β)) ∈ R i λ | {z } W

where ∇Ω(βi ) = yi − 1

Mathieu Blondel Beyond gradient descent 46 / 47 Lab work

We want to minimize f (β) = −D(β), where ∇f (β) = −∇D(β) Implement Frank-Wolfe for this problem. 0 k 0 Initialize βi ∈ 4 , e.g., βi = (1/k,..., 1/k), for i ∈ {1,..., n} For t ∈ {0, 1, 2,... } t n k G = ∇f (β ) ∈ R ×

si = argmin gi>si i ∈ {1,..., n} s k i ∈4 2 βt+1 = (1 − γt )βt + γt S γt = 2 + t Implement mirror descent for this problem using the KL geometry. t t β exp(−ηt ∇ f (β )) βt+1 = i i i ∈ {1,..., n} i Pk t t j=1 βi,j exp(−ηt ∇i,j f (β )) √ using ηt = η/ t for some η ∈ (0, 1]

Mathieu Blondel Beyond gradient descent 47 / 47