Machine Learning 8: Convex Optimization for Machine Learning
Master 2 Computer Science
Aur´elienGarivier 2018-2019 Table of contents
1. Convex functions in Rd
2. Gradient Descent
3. Smoothness
4. Strong convexity
5. Lower bounds lower bound for Lipschitz convex optimization
6. What more?
7. Stochastic Gradient Descent
1 Convex functions in Rd Convex functions and subgradients
Convex function Let X ⊂ Rd be a convex set. The function f : X → R is convex if
∀x, y ∈ X , ∀λ ∈ [0, 1], f (1 − λ)x + λy ≤ (1 − λ)f (x) + λf (y) .
Subgradients A vector g ∈ Rn is a subgradient of f at x ∈ X if for any y ∈ X ,
f (y) ≥ f (x) + hg, y − xi .
The set of subgradients of f at x is denoted ∂f (x).
Proposition
• If ∂f (x) 6= ∅ for all x ∈ X , then f is convex. • If f is convex, then ∀x ∈ X˚, ∂f (x) 6= ∅. • If f is convex and differentiable at x, then ∂f (x) = ∇f (x) . 2 Convex functions and optimization
Proposition Let f be convex. Then
• x is local minimum of f iff 0 ∈ ∂f (x), • and in that case, x is a global minimum of f ; • if X is closed and if f is differentiable on X , then
x = arg min f (x) iff ∀y ∈ X , ∇f (x), y − x ≥ 0 . x∈X
Black-box optimization model The set X is known, f : X → R is unknown but accessible thru: • a zeroth-order oracle: given x ∈ X , yields f (x), • and possibly a first-order oracle: given x ∈ X , yields g ∈ ∂f (x).
3 Gradient Descent Gradient Descent algorithms
A memoryless algorithm for first-order black-box optimization: Algorithm: Gradient Descent
Input: convex function f, step size γt , initial point x0 1 for t = 0 ... T − 1 do
2 Compute gt ∈ ∂f (xt ) 3 xt+1 ← xt − γt gt x0 + ··· + xT −1 4 return x or T T Questions:
∗ def • at what speed? • xT → x = arg min f ? T →∞ ∗ • works in high dimension? • f (xT ) → f (x ) = min f ? T →∞ • do some properties help? • under which conditions? x + ··· + x • can other algorithms do • what about 0 T −1 ? T better? 4 Monotonicity of gradient
Property Let f be a convex function on X , and let x, y ∈ X . For every
gx ∈ ∂f (x) and every gy ∈ ∂f (y),
gx − gy , x − y ≥ 0 .
In fact, a differentiable mapping f is convex iff
∀x, y ∈ X , ∇f (x) − ∇f (y), x − y ≥ 0 .
∗ In particular, hgx , x − x i ≥ 0. =⇒ the negative gradient does not point the the wrong direction. Under some assumptions (to come), this inequality can be strenghtened, making gradient descent more relevant.
5 Convergence of GD for Convex-Lipschitz functions
Lipschitz Assumption For every x ∈ X and every g ∈ ∂f (x), kgk ≤ L.
This implies f (y) − f (x) ≤ hg, y − xi ≤ Lky − xk. Theorem Under the Lipschitz assumption, GD with γ ≡ γ = √R satisfies t L T
T −1 ! 1 X RL f x − f (x ∗) ≤ √ . T i i=0 T
• Of course, can return arg min1≤i≤T f (xi ) instead (not always better). 2 2 • It requires T ≈ R L steps to ensure precision . 2 √ • Online version γ = √R : bound in 3RL/ T (see Hazan). t L t • Works just as well for constrained optimization with xt+1 ← ΠX xt − γt ∇f (xt ) thanks to Pythagore projection theorem. 6 Intuition: what can happen?
The step must be large enough to reach the region of the minimum, but not too large too avoid skip- ping over it. Let X = B(0, R) ⊂ R2 and R f (x 1, x 2) = √ |x 1| + L|x 2| 2γT which is L-Lipschitz for γ ≥ √R . Then, if x = √R , 3Lγ ∈ X , 2LT 0 2 4
• x 1 = √R − √Rt andx ¯1 √R ; t 2 2T T ' 2 2 2 3Lγ Lγ 2 3Lγ 2 Lγ • x2s+1 = 4 − γL = − 4 , x2s = 4 , andx ¯T ' 4 . Hence 2 1 2 R R γL 1 R 2 f x¯T , x¯T ' √ √ + L = + L γ , 2γT 2 2 4 4 T γ which is minimal for γ = √R where f x¯1 , x¯2 ≈ RL√ . L T T T 2 T 7 Proof
Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem: 2ha, bi = kak2 + kbk2 − ka − bk2 . Hence for every 0 ≤ t < T :
∗ ∗ f (xt ) − f (x ) ≤ hgt , xt − x i 1 = hx − x , x − x ∗i γ t t+1 t 1 = kx − x ∗k2 + kx − x k2 − kx − x ∗k2 2γ t t t+1 t+1 1 γ = kx − x ∗k2 − kx − x ∗k2 + kg k2 , 2γ t t+1 2 t and hence T −1 √ X 1 L2γT L TR2 L2 RT f (x )−f (x ∗) ≤ kx −x ∗k2−kx −x ∗k2 + ≤ + √ t 2γ 0 T 2 2R t=0 2L T T −1 ! T −1 1 X 1 X and by convexity f xi ≤ f (xt ). T T 8 i=0 t=0 Smoothness Smoothness
Definition A continuously differentiable function f is β-smooth if the gradient ∇f is β-Lipschitz, that is if for all x, y ∈ X ,
k∇f (y) − ∇f (x)k ≤ βky − xk .
Property If f is β-smooth, then for any x, y ∈ X :
β 2 f (y) − f (x) − h∇f (x), y − xi ≤ ky − xk . 2
β • f is convex and β-smooth iff x 7→ kxk2 − f (x) is convex iff ∀x, y ∈ X 2 β f (x) + h∇f (x), y − xi ≤ f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk2 . 2 • If f is twice differentiable, then f is α-strongly convex iff all the eigenvalues of the Hessian of f are at most equal to β. 9 Convergence of GD for smooth convex functions
Theorem Let f be a convex and β-smooth function on Rd . Then GD with 1 γt ≡ γ = β satisfies:
2βkx − x ∗k2 f (x ) − f (x ∗) ≤ 0 . T T + 4
2βR Thus it requires T ≈ steps to ensure precision .
10 Majoration/minoration
1 Taking γ = β is a ”safe” choice ensuring progress: 1 β x + def= x − ∇f (x) = arg min f (x) + ∇f (x), y − x + ky − xk2 β y 2
+ 1 2 is such that f (x ) ≤ f (x) − ∇f (x) . Indeed, 2β β f (x +) − f (x) ≤ ∇f (x), x + − x + kx + − xk2 2 1 2 1 2 1 2 = − ∇f (x) + ∇f (x) = − ∇f (x) . β 2β 2β =⇒ Descent method. Moreover, x + is ”on the same side of x ∗ as x” (no overshooting): since
∇f (x)k = ∇f (x) − ∇f (x ∗)k ≤ βkx − x ∗k, ∗ ∗ ∗ 2 ∇f (x), x − x ≤ ∇f (x) kx − x k ≤ βkx − x k and thus
∗ + ∗ ∗ 2 1 ∗ x − x , x − x = x − xk + ∇f (x), x − x ≥ 0 . β 11 Lemma: the gradient shoots in the right direction
Lemma For every x ∈ X , ∗ 1 2 ∇f (x), x − x ≥ ∇f (x) . β
∗ 1 1 2 We already now that f (x ) ≤ f x − β ∇f (x) ≤ f (x) − 2β ∇f (x) . In addition, taking ∗ 1 z = x + β ∇f (x): f (x∗) = f (z) + f (x∗) − f (z) β ≥ f (x) + ∇f (x), z − x − kz − x∗k2 2 ∗ ∗ 1 2 = f (x) + ∇f (x), x − x + ∇f (x), z − x − ∇f (x) 2β
∗ 1 2 = f (x) + ∇f (x), x − x + ∇f (x) . 2β
∗ 1 2 ∗ 1 2 Thus f (x) + ∇f (x), x − x + ∇f (x) ≤ f (x ) ≤ f (x) − ∇f (x) . 2β 2β In fact, this lemma is a corrolary of the co-coercivity of the gradient: ∀x, y ∈ X ,
1 2 ∇f (x) − ∇f (y), x − y ≥ ∇f (x) − ∇f (y) , β which holds iff the convex, differentiable function f is β-smooth. 12 Proof step 1: the iterates get closer to x ∗
Applying the preceeding lemma to x = xt , we get 2 ∗ 2 1 ∗ kxt+1 − x k = xt − ∇f (xt ) − x β
∗ 2 2 ∗ 1 2 = xt − x − ∇f (xt ), xt − x + ∇f (xt ) β β2
∗ 2 1 2 ≤ xt − x − ∇f (xt ) β2 ∗ 2 ≤ xt − x .
→ it’s good, but it can be slow...
13 Proof step 2: the values of the iterates converge
1 2 ∗ We have seen that f (xt+1) − f (xt ) ≤ − ∇f (xt ) . Hence, if δt = f (xt ) − f (x ), then 2β
∗ β ∗ 2 δ0 = f (x0) − f (x ) ≤ x0 − x 2
1 2 and δt+1 ≤ δt − ∇f (xt ) . But 2β
∗ ∗ δt ≤ ∇f (xt ), xt − x ≤ ∇f (xt ) xt − x .
δ2 Therefore, since δ is decreasing with t, δ ≤ δ − t . Thinking to the t t+1 t ∗ 2 2βkx0 − x k coresponding ODE, one sets ut = 1/δt , which yields:
ut 1 1 u ≥ ≥ u 1 + = u + t+1 1 t ∗ 2 t ∗ 2 1 − ∗ 2 2βkx0 − x k ut 2βkx0 − x k 2βkx0−x k ut
T 2 T T + 4 Hence, u ≥ u + ≥ + = . T 0 ∗ 2 ∗ 2 ∗ 2 ∗ 2 2βkx0 − x k βkx0 − x k 2βkx0 − x k 2βkx0 − x k
14 Strong convexity Strong convexity
Definition
f : X → R is α-strongly convex if for all x, y ∈ X , for any gx ∈ ∂f (x), α f (y) ≥ f (x) + g , y − x + ky − xk2 . x 2
α 2 • f is α-strongly convex iff f (x) − 2 kxk is convex. • α measures the curvature of f . • If f is twice differentiable, then f is α-strongly convex iff all the eigenvalues of the Hessian of f are larger than α.
15 Faster rates for Lipschitz functions through strong convexity
Theorem Let f be a α-strongly convex and L-Lipschitz. Under the Lipschitz 1 assumption, GD with γt = α(t+1) satisfies:
T −1 ! 1 X L2 log(T ) f x − f (x ∗) ≤ . T i αT i=0
2 Note : returning another weighted average with γt = α(t+1) yields:
T −1 ! X 2(i + 1) 2L2 f x − f (x ∗) ≤ . T (T + 1) i α(T + 1) i=0
2L2 Thus it requires T ≈ α steps to ensure precision .
16 Proof
Cosinus theorem = generalized Pythagore theorem = Alkashi’s theorem: 2ha, bi = kak2 + kbk2 − ka − bk2 . Hence for every 0 ≤ t < T , by α-strong convexity: α f (x ) − f (x ∗) ≤ hg , x − x ∗i− kx − x ∗k2 t t t 2 t 1 ∗ α ∗ 2 = hxt − xt+1, xt − x i − kxt − x k γt 2 1 ∗ 2 2 ∗ 2 α ∗ 2 = kxt − x k + kxt − xt+1k − kxt+1 − x k − kxt − x k 2γt 2 tα (t + 1)α 1 = kx − x ∗k2 − kx − x ∗k2 + kg k2 2 t 2 t+1 2(t + 1)α t 1 since γt = α(t+1) , and hence T −1 T −1 X 0 × α T α L2 X 1 L2 log(T ) f (x )−f (x ∗) ≤ kx −x ∗k2− kx −x ∗k2+ ≤ t 2 0 2 T 2α t + 1 2α t=0 t=0 1 PT −1 1 PT −1 and by convexity f T i=0 xi ≤ T t=0 f (xt ). 17 Outline
Convex functions in Rd
Gradient Descent
Smoothness
Strong convexity Smooth and Strongly Convex Functions
Lower bounds lower bound for Lipschitz convex optimization
What more?
Stochastic Gradient Descent
18 Smoothness and strong convexity: sandwiching f by squares
Let x ∈ X .
For every y ∈ X ,
β-smoothness implies: β f (y) ≤ f (x) + ∇f (x), y − x + ky − xk2 2 def + β + 2 = f¯(x) = f¯(x ) + y − x . 2 − 1 Moreoever, α-strong convexity implies, with x = x − α ∇f (x), α f (y) ≥ f (x) + ∇f (x), y − x + ky − xk2 2 def − α − 2 = f (x) = f (x ) + y − x . 2 19 Convergence of GD for smooth and strongly convex functions
Theorem Let f be a β-smooth and α-strongly convex function. Then GD with 1 the choice γt ≡ γ = β satisfies
∗ − T ∗ f (xT ) − f (x ) ≤ e κ f (x0) − f (x ) ,
β where κ = α ≥ 1 is the condition number of f .