Convex Optimization for Machine Learning

Machine Learning 8: Convex Optimization for Machine Learning Master 2 Computer Science AurélienGarivier 2018-2019 Table of contents 1. Convex functions in Rd 2. Gradient Descent 3. Smoothness 4. Strong convexity 5. Lower bounds lower bound for Lipschitz convex optimization 6. What more? 7. Stochastic Gradient Descent 1 Convex functions in Rd Convex functions and subgradients Convex function Let X ⊂ Rd be a convex set. The function f : X! R is convex if 8x; y 2 X ; 8λ 2 [0; 1]; f (1 − λ)x + λy ≤ (1 − λ)f (x) + λf (y) : Subgradients A vector g 2 Rn is a subgradient of f at x 2 X if for any y 2 X , f (y) ≥ f (x) + hg; y − xi : The set of subgradients of f at x is denoted @f (x). Proposition • If @f (x) 6= ; for all x 2 X , then f is convex. • If f is convex, then 8x 2 X˚;@f (x) 6= ;. • If f is convex and differentiable at x, then @f (x) = rf (x) . 2 Convex functions and optimization Proposition Let f be convex. Then • x is local minimum of f iff 0 2 @f (x), • and in that case, x is a global minimum of f ; • if X is closed and if f is differentiable on X , then x = arg min f (x) iff 8y 2 X ; rf (x); y − x ≥ 0 : x2X Black-box optimization model The set X is known, f : X! R is unknown but accessible thru: • a zeroth-order oracle: given x 2 X , yields f (x), • and possibly a first-order oracle: given x 2 X , yields g 2 @f (x). 3 Gradient Descent Gradient Descent algorithms A memoryless algorithm for first-order black-box optimization: Algorithm: Gradient Descent Input: convex function f, step size γt , initial point x0 1 for t = 0 ::: T − 1 do 2 Compute gt 2 @f (xt ) 3 xt+1 xt − γt gt x0 + ··· + xT −1 4 return x or T T Questions: ∗ def • at what speed? • xT ! x = arg min f ? T !1 ∗ • works in high dimension? • f (xT ) ! f (x ) = min f ? T !1 • do some properties help? • under which conditions? x + ··· + x • can other algorithms do • what about 0 T −1 ? T better? 4 Monotonicity of gradient Property Let f be a convex function on X , and let x; y 2 X . For every gx 2 @f (x) and every gy 2 @f (y), gx − gy ; x − y ≥ 0 : In fact, a differentiable mapping f is convex iff 8x; y 2 X ; rf (x) − rf (y); x − y ≥ 0 : ∗ In particular, hgx ; x − x i ≥ 0. =) the negative gradient does not point the the wrong direction. Under some assumptions (to come), this inequality can be strenghtened, making gradient descent more relevant. 5 Convergence of GD for Convex-Lipschitz functions Lipschitz Assumption For every x 2 X and every g 2 @f (x), kgk ≤ L. This implies f (y) − f (x) ≤ hg; y − xi ≤ Lky − xk. Theorem Under the Lipschitz assumption, GD with γ ≡ γ = pR satisfies t L T T −1 ! 1 X RL f x − f (x ∗) ≤ p : T i i=0 T • Of course, can return arg min1≤i≤T f (xi ) instead (not always better). 2 2 • It requires T ≈ R L steps to ensure precision . 2 p • Online version γ = pR : bound in 3RL= T (see Hazan). t L t • Works just as well for constrained optimization with xt+1 ΠX xt − γt rf (xt ) thanks to Pythagore projection theorem. 6 Intuition: what can happen? The step must be large enough to reach the region of the minimum, but not too large too avoid skip- ping over it. Let X = B(0; R) ⊂ R2 and R f (x 1; x 2) = p jx 1j + Ljx 2j 2γT which is L-Lipschitz for γ ≥ pR . Then, if x = pR ; 3Lγ 2 X , 2LT 0 2 4 • x 1 = pR − pRt andx ¯1 pR ; t 2 2T T ' 2 2 2 3Lγ Lγ 2 3Lγ 2 Lγ • x2s+1 = 4 − γL = − 4 , x2s = 4 , andx ¯T ' 4 . Hence 2 1 2 R R γL 1 R 2 f x¯T ; x¯T ' p p + L = + L γ ; 2γT 2 2 4 4 T γ which is minimal for γ = pR where f x¯1 ; x¯2 ≈ pRL . L T T T 2 T 7 Proof Cosinus theorem = generalized Pythagore theorem = Alkashi's theorem: 2ha; bi = kak2 + kbk2 − ka − bk2 : Hence for every 0 ≤ t < T : ∗ ∗ f (xt ) − f (x ) ≤ hgt ; xt − x i 1 = hx − x ; x − x ∗i γ t t+1 t 1 = kx − x ∗k2 + kx − x k2 − kx − x ∗k2 2γ t t t+1 t+1 1 γ = kx − x ∗k2 − kx − x ∗k2 + kg k2 ; 2γ t t+1 2 t and hence T −1 p X 1 L2γT L TR2 L2 RT f (x )−f (x ∗) ≤ kx −x ∗k2−kx −x ∗k2 + ≤ + p t 2γ 0 T 2 2R t=0 2L T T −1 ! T −1 1 X 1 X and by convexity f xi ≤ f (xt ). T T 8 i=0 t=0 Smoothness Smoothness Definition A continuously differentiable function f is β-smooth if the gradient rf is β-Lipschitz, that is if for all x; y 2 X , krf (y) − rf (x)k ≤ βky − xk : Property If f is β-smooth, then for any x; y 2 X : β 2 f (y) − f (x) − hrf (x); y − xi ≤ ky − xk : 2 β • f is convex and β-smooth iff x 7! kxk2 − f (x) is convex iff 8x; y 2 X 2 β f (x) + hrf (x); y − xi ≤ f (y) ≤ f (x) + hrf (x); y − xi + ky − xk2 : 2 • If f is twice differentiable, then f is α-strongly convex iff all the eigenvalues of the Hessian of f are at most equal to β. 9 Convergence of GD for smooth convex functions Theorem Let f be a convex and β-smooth function on Rd . Then GD with 1 γt ≡ γ = β satisfies: 2βkx − x ∗k2 f (x ) − f (x ∗) ≤ 0 : T T + 4 2βR Thus it requires T ≈ steps to ensure precision . 10 Majoration/minoration 1 Taking γ = β is a "safe" choice ensuring progress: 1 β x + def= x − rf (x) = arg min f (x) + rf (x); y − x + ky − xk2 β y 2 + 1 2 is such that f (x ) ≤ f (x) − rf (x) : Indeed, 2β β f (x +) − f (x) ≤ rf (x); x + − x + kx + − xk2 2 1 2 1 2 1 2 = − rf (x) + rf (x) = − rf (x) : β 2β 2β =) Descent method. Moreover, x + is "on the same side of x ∗ as x" (no overshooting): since rf (x)k = rf (x) − rf (x ∗)k ≤ βkx − x ∗k, ∗ ∗ ∗ 2 rf (x); x − x ≤ rf (x) kx − x k ≤ βkx − x k and thus ∗ + ∗ ∗ 2 1 ∗ x − x ; x − x = x − xk + rf (x); x − x ≥ 0 : β 11 Lemma: the gradient shoots in the right direction Lemma For every x 2 X , ∗ 1 2 rf (x); x − x ≥ rf (x) : β ∗ 1 1 2 We already now that f (x ) ≤ f x − β rf (x) ≤ f (x) − 2β rf (x) . In addition, taking ∗ 1 z = x + β rf (x): f (x∗) = f (z) + f (x∗) − f (z) β ≥ f (x) + rf (x); z − x − kz − x∗k2 2 ∗ ∗ 1 2 = f (x) + rf (x); x − x + rf (x); z − x − rf (x) 2β ∗ 1 2 = f (x) + rf (x); x − x + rf (x) : 2β ∗ 1 2 ∗ 1 2 Thus f (x) + rf (x); x − x + rf (x) ≤ f (x ) ≤ f (x) − rf (x) . 2β 2β In fact, this lemma is a corrolary of the co-coercivity of the gradient: 8x; y 2 X ; 1 2 rf (x) − rf (y); x − y ≥ rf (x) − rf (y) ; β which holds iff the convex, differentiable function f is β-smooth. 12 Proof step 1: the iterates get closer to x ∗ Applying the preceeding lemma to x = xt , we get 2 ∗ 2 1 ∗ kxt+1 − x k = xt − rf (xt ) − x β ∗ 2 2 ∗ 1 2 = xt − x − rf (xt ); xt − x + rf (xt ) β β2 ∗ 2 1 2 ≤ xt − x − rf (xt ) β2 ∗ 2 ≤ xt − x : ! it's good, but it can be slow... 13 Proof step 2: the values of the iterates converge 1 2 ∗ We have seen that f (xt+1) − f (xt ) ≤ − rf (xt ) . Hence, if δt = f (xt ) − f (x ), then 2β ∗ β ∗ 2 δ0 = f (x0) − f (x ) ≤ x0 − x 2 1 2 and δt+1 ≤ δt − rf (xt ) . But 2β ∗ ∗ δt ≤ rf (xt ); xt − x ≤ rf (xt ) xt − x : δ2 Therefore, since δ is decreasing with t, δ ≤ δ − t . Thinking to the t t+1 t ∗ 2 2βkx0 − x k coresponding ODE, one sets ut = 1/δt , which yields: ut 1 1 u ≥ ≥ u 1 + = u + t+1 1 t ∗ 2 t ∗ 2 1 − ∗ 2 2βkx0 − x k ut 2βkx0 − x k 2βkx0−x k ut T 2 T T + 4 Hence, u ≥ u + ≥ + = . T 0 ∗ 2 ∗ 2 ∗ 2 ∗ 2 2βkx0 − x k βkx0 − x k 2βkx0 − x k 2βkx0 − x k 14 Strong convexity Strong convexity Definition f : X! R is α-strongly convex if for all x; y 2 X , for any gx 2 @f (x), α f (y) ≥ f (x) + g ; y − x + ky − xk2 : x 2 α 2 • f is α-strongly convex iff f (x) − 2 kxk is convex. • α measures the curvature of f . • If f is twice differentiable, then f is α-strongly convex iff all the eigenvalues of the Hessian of f are larger than α.

Convex Optimization for Machine Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support