Convergence Theorems for Gradient Descent

Convergence Theorems for Gradient Descent Robert M. Gower. October 5, 2018 Abstract Here you will find a growing collection of proofs of the convergence of gradient and stochastic gradient descent type method on convex, strongly convex and/or smooth functions. Important disclaimer: Theses notes do not compare to a good book or well prepared lecture notes. You should only read these notes if you have sat through my lecture on the subject and would like to see detailed notes based on my lecture as a reminder. Under any other circumstances, I highly recommend reading instead the first few chapters of the books [3] and [1]. 1 Assumptions and Lemmas 1.1 Convexity We say that f is convex if d f(tx + (1 − t)y) ≤ tf(x) + (1 − t)f(y); 8x; y 2 R ; t 2 [0; 1]: (1) If f is differentiable and convex then every tangent line to the graph of f lower bounds the function values, that is d f(y) ≥ f(x) + hrf(x); y − xi ; 8x; y 2 R : (2) We can deduce (2) from (1) by dividing by t and re-arranging f(y + t(x − y)) − f(y) ≤ f(x) − f(y): t Now taking the limit t ! 0 gives hrf(y); x − yi ≤ f(x) − f(y): If f is twice differentiable, then taking a directional derivative in the v direction on the point x in (2) gives 2 2 d 0 ≥ hrf(x); vi + r f(x)v; y − x − hrf(x); vi = r f(x)v; y − x ; 8x; y; v 2 R : (3) Setting y = x − v then gives 2 d 0 ≤ r f(x)v; v ; 8x; v 2 R : (4) 1 2 d The above is equivalent to saying the r f(x) 0 is positive semi-definite for every x 2 R : An analogous property to (2) holds even when the function is not differentiable. Indeed for d every convex function, we say that g 2 R subgradient is a subdifferential at x if f(y) ≥ f(x) + hg; y − xi ; 8y: (5) We refer to the set of subgradients as the subdifferential @f(x); that is @f(x) def= fg : f(y) ≥ f(x) + hg; y − xi ; 8yg: (6) 1.2 Smoothness A differential function f is said to be L{smooth if its gradients are Lipschitz continuous, that is krf(x) − rf(y)k ≤ Lkx − yk: (7) If f is twice differentiable then we have, by using first order expansion Z α rf(x) − rf(x + αd) = r2f(x + td)d; t=0 followed by taking the norm gives Z α 2 k r f(x + td)dk2 ≤ Lαkdk2: t=0 Dividing by α R α 2 k r f(x + td)dk2 t=0 ≤ Lkdk ; α 2 then dividing through by kdk with d 6= 0 and taking the limit as α ! 0 we have that R α 2 2 2 k t=0 r f(x + td)dk2 kαr f(x)dk2 kαr f(x)dk2 n = + O(α) = ≤ L; 8d 6= 0 2 R αkdk αkdk α!0 d d Taking the supremum over d 6= 0 2 R in the above gives r2f(x) LI: (8) Furthermore, using the Taylor expansion of f(x) and the uniform bound over Hessian we have that L f(y) ≤ f(x) + hrf(x); y − xi + ky − xk2 (9) 2 2 Some direct consequences of the smoothness are given in the following lemma. Lemma 1.1 If f is L{smooth then 1 f(x − 1 rf(x)) − f(x) ≤ − krf(x)k2; (10) L 2L 2 and 1 f(x∗) − f(x) ≤ − krf(x)k2; (11) 2L 2 d hold for all x 2 R . 2 1 Proof: The first inequality (10) follows by inserting y = x − L rf(x) in the definition of smoothness (7) since L f(x − 1 rf(x)) ≤ f(x) − 1 hrf(x); rf(x)i + k 1 rf(x)k2 L L 2 L 2 1 = f(x) − krf(x)k2: 2L 2 Furthermore, by using (10) combined with f(x∗) ≤ f(y) 8y, we get (11). Indeed since 1 f(x∗) − f(x) ≤ f(x − 1 rf(x)) − f(x) ≤ − krf(x)k2: (12) L 2L 2 1.3 Smooth and Convex There are many problems in optimization where the function is both smooth and convex. Further- more, such a combination results in some interesting consequences and Lemmas. Lemmas that we will then use to prove convergence of the Gradient method. Lemma 1.2 If f(x) is convex and L{smooth then 1 f(y) − f(x) ≤ hrf(y); y − xi − krf(y) − rf(x)k2: (13) 2L 2 1 hrf(y) − rf(x); y − xi ≥ krf(x) − rf(y)k (Co-coercivity): (14) L Proof: To prove (13), it follows that f(y) − f(x) = f(y) − f(z) + f(z) − f(x) (2)+(9) L ≤ hrf(y); y − zi + hrf(x); z − xi + kz − xk2: 2 2 Minimizing in z we have that 1 z = x − (rf(x) − rf(y)): (15) L Substituting this in gives 1 1 1 f(y) − f(x) = rf(y); y − x + (rf(x) − rf(y)) − hrf(x); rf(x) − rf(y)i + krf(x) − rf(y)k2 L L 2L 2 1 1 = hrf(y); y − xi − krf(x) − rf(y)k2 + krf(x) − rf(y)k2 (16) L 2 2L 2 1 = hrf(y); y − xi − krf(x) − rf(y)k2 (17) 2L 2 Finally (14) follows from applying (13) once 1 f(y) − f(x) ≤ hrf(y); y − xi − krf(y) − rf(x)k2; 2L 2 then interchanging the roles of x and y to get 1 f(x) − f(y) ≤ hrf(x); x − yi − krf(y) − rf(x)k2: 2L 2 3 Finally adding together the two above inequalities gives 1 0 ≤ hrf(y) − rf(x); y − xi − krf(y) − rf(x)k2: L 2 1.4 Strong convexity We can \strengthen" the notion of convexity by defining µ{strong convexity, that is µ f(y) ≥ f(x) + hrf(x); y − xi + ky − xk2; 8x; y 2 d: (18) 2 2 R Minimizing both sides of (18) in y proves the following lemma Lemma 1.3 If f is µ{strongly convex then it also satisfies the Polyak{Lojasiewicz condition, that is 2 ∗ krf(x)k2 ≥ 2µ(f(x) − f(x ): (19) Proof: Multiplying (18) by minus and substituting y = x∗ we have that µ f(x) − f(x∗) ≤ hrf(x); x − x∗i − kx∗ − xk2 2 2 1 p 1 1 = − k µ(x − x∗) − p rf(x)k2 + krf(x)k2 2 µ 2 2µ 2 1 ≤ krf(x)k2: 2µ 2 The inequality (19) is of such importance in optimization that is merits its own name. 2 Gradient Descent Consider the problem x∗ = arg min f(x); (20) x2Rd and the following gradient method xt+1 = xt − αrf(xt); (21) where f is L{smooth. We will now prove that the iterates (21) converge. In Theorem 2.2 we will prove sublinear convergence under the assumption that f is convex. In Theorem 2.3 we will prove linear convergence (a stronger form of convergence) under the assumption that f is µ{strongly convex. 4 2.1 Convergence for convex and smooth functions 2.1.1 Average iterates Theorem 2.1 Let f be convex and L{smooth and let xt for t = 1; : : : ; n be the sequence of iterates generated by the gradient method (21). It follows that T 1 X α f(x1) − f(x∗) 1 kx1 − x∗k2 [f(xt) − f(x∗)] ≤ + 2 : (22) T 4L T α T t=1 2.1.2 Last iterates Theorem 2.2 Let f be convex and L{smooth and let xt for t = 1; : : : ; n be the sequence of iterates generated by the gradient method (21). It follows that 2Lkx1 − x∗k2 f(xn) − f(x∗) ≤ : (23) n − 1 Proof: Let f be convex and L{smooth. It follows that t+1 ∗ 2 t ∗ 1 t 2 kx − x k2 = kx − x − L rf(x )k2 t ∗ 2 1 t ∗ t 1 t 2 = kx − x k2 − 2 L x − x ; rf(x ) + L2 krf(x )k2 (14) t ∗ 2 1 t 2 ≤ kx − x k2 − L2 krf(x )k2: t ∗ 2 ∗ Thus kx − x k2 is a decreasing sequence in t: Calling upon (10) and subtracting f(x ) from both sides gives 1 f(xt+1) − f(x∗) ≤ f(xt) − f(x∗) − krf(xt)k2: (24) 2L 2 Applying convexity we have that f(xt) − f(x∗) ≤ rf(xt); xt − x∗ (24) t t ∗ t 1 ∗ ≤ krf(x )k2kx − x k ≤ krf(x )k2kx − x k: (25) t Isolating krf(x )k2 in the above and inserting in (24) gives (24)+(25) 1 1 f(xt+1) − f(x∗) ≤ f(xt) − f(x∗) − (f(xt) − f(x∗))2 (26) 2L kx1 − x∗k2 | {z } β t ∗ Let δt = f(x ) − f(x ): Since δt+1 ≤ δt. Manipulating (26) we have that 1 × 2 δtδt+1 δt 1 1 δt+1≤δt 1 1 δt+1 ≤ δt − βδt , β ≤ − , β ≤ − : δt+1 δt+1 δt δt+1 δt Summing up both sides over t = 1; : : : ; n − 1 and using telescopic cancellation we have that 1 1 1 (n − 1)β ≤ − ≤ : δn δ1 δn 5 2.2 Convergence for strongly convex (PL) and smooth convex functions Now we prove some bounds that hold for strongly convex and smooth functions. In fact, if you observe, we will only use PL inequality (19) to establish the convergence result.

Load more