Convergence Theorems for Descent

Robert M. Gower.

October 5, 2018

Abstract Here you will find a growing collection of proofs of the convergence of gradient and stochastic gradient descent type method on convex, strongly convex and/or smooth functions. Important disclaimer: Theses notes do not compare to a good book or well prepared lecture notes. You should only read these notes if you have sat through my lecture on the subject and would like to see detailed notes based on my lecture as a reminder. Under any other circumstances, I highly recommend reading instead the first few chapters of the books [3] and [1].

1 Assumptions and Lemmas

1.1 Convexity

We say that f is convex if

d f(tx + (1 − t)y) ≤ tf(x) + (1 − t)f(y), ∀x, y ∈ R , t ∈ [0, 1]. (1)

If f is differentiable and convex then every to the graph of f lower bounds the values, that is

d f(y) ≥ f(x) + h∇f(x), y − xi , ∀x, y ∈ R . (2)

We can deduce (2) from (1) by dividing by t and re-arranging f(y + t(x − y)) − f(y) ≤ f(x) − f(y). t Now taking the limit t → 0 gives

h∇f(y), x − yi ≤ f(x) − f(y).

If f is twice differentiable, then taking a directional in the v direction on the point x in (2) gives

2 2 d 0 ≥ h∇f(x), vi + ∇ f(x)v, y − x − h∇f(x), vi = ∇ f(x)v, y − x , ∀x, y, v ∈ R . (3)

Setting y = x − v then gives 2 d 0 ≤ ∇ f(x)v, v , ∀x, v ∈ R . (4)

1 2 d The above is equivalent to saying the ∇ f(x)  0 is positive semi-definite for every x ∈ R . An analogous property to (2) holds even when the function is not differentiable. Indeed for d every convex function, we say that g ∈ R subgradient is a subdifferential at x if

f(y) ≥ f(x) + hg, y − xi , ∀y. (5)

We refer to the set of subgradients as the subdifferential ∂f(x), that is

∂f(x) def= {g : f(y) ≥ f(x) + hg, y − xi , ∀y}. (6)

1.2 Smoothness

A differential function f is said to be L–smooth if its are Lipschitz continuous, that is

k∇f(x) − ∇f(y)k ≤ Lkx − yk. (7)

If f is twice differentiable then we have, by using first order expansion Z α ∇f(x) − ∇f(x + αd) = ∇2f(x + td)d, t=0 followed by taking the norm gives Z α 2 k ∇ f(x + td)dk2 ≤ Lαkdk2. t=0 Dividing by α R α 2 k ∇ f(x + td)dk2 t=0 ≤ Lkdk , α 2 then dividing through by kdk with d 6= 0 and taking the limit as α → 0 we have that

R α 2 2 2 k t=0 ∇ f(x + td)dk2 kα∇ f(x)dk2 kα∇ f(x)dk2 n = + O(α) = ≤ L, ∀d 6= 0 ∈ R αkdk αkdk α→0 d

d Taking the supremum over d 6= 0 ∈ R in the above gives

∇2f(x)  LI. (8)

Furthermore, using the Taylor expansion of f(x) and the uniform bound over Hessian we have that L f(y) ≤ f(x) + h∇f(x), y − xi + ky − xk2 (9) 2 2 Some direct consequences of the smoothness are given in the following lemma.

Lemma 1.1 If f is L–smooth then 1 f(x − 1 ∇f(x)) − f(x) ≤ − k∇f(x)k2, (10) L 2L 2 and 1 f(x∗) − f(x) ≤ − k∇f(x)k2, (11) 2L 2 d hold for all x ∈ R .

2 1 Proof: The first inequality (10) follows by inserting y = x − L ∇f(x) in the definition of smoothness (7) since L f(x − 1 ∇f(x)) ≤ f(x) − 1 h∇f(x), ∇f(x)i + k 1 ∇f(x)k2 L L 2 L 2 1 = f(x) − k∇f(x)k2. 2L 2 Furthermore, by using (10) combined with f(x∗) ≤ f(y) ∀y, we get (11). Indeed since 1 f(x∗) − f(x) ≤ f(x − 1 ∇f(x)) − f(x) ≤ − k∇f(x)k2. (12) L 2L 2

1.3 Smooth and Convex

There are many problems in optimization where the function is both smooth and convex. Further- more, such a combination results in some interesting consequences and Lemmas. Lemmas that we will then use to prove convergence of the Gradient method.

Lemma 1.2 If f(x) is convex and L–smooth then 1 f(y) − f(x) ≤ h∇f(y), y − xi − k∇f(y) − ∇f(x)k2. (13) 2L 2 1 h∇f(y) − ∇f(x), y − xi ≥ k∇f(x) − ∇f(y)k (Co-coercivity). (14) L Proof: To prove (13), it follows that

f(y) − f(x) = f(y) − f(z) + f(z) − f(x) (2)+(9) L ≤ h∇f(y), y − zi + h∇f(x), z − xi + kz − xk2. 2 2 Minimizing in z we have that 1 z = x − (∇f(x) − ∇f(y)). (15) L Substituting this in gives  1  1 1 f(y) − f(x) = ∇f(y), y − x + (∇f(x) − ∇f(y)) − h∇f(x), ∇f(x) − ∇f(y)i + k∇f(x) − ∇f(y)k2 L L 2L 2 1 1 = h∇f(y), y − xi − k∇f(x) − ∇f(y)k2 + k∇f(x) − ∇f(y)k2 (16) L 2 2L 2 1 = h∇f(y), y − xi − k∇f(x) − ∇f(y)k2 (17) 2L 2 Finally (14) follows from applying (13) once 1 f(y) − f(x) ≤ h∇f(y), y − xi − k∇f(y) − ∇f(x)k2, 2L 2 then interchanging the roles of x and y to get 1 f(x) − f(y) ≤ h∇f(x), x − yi − k∇f(y) − ∇f(x)k2. 2L 2

3 Finally adding together the two above inequalities gives 1 0 ≤ h∇f(y) − ∇f(x), y − xi − k∇f(y) − ∇f(x)k2. L 2

1.4 Strong convexity

We can “strengthen” the notion of convexity by defining µ–strong convexity, that is µ f(y) ≥ f(x) + h∇f(x), y − xi + ky − xk2, ∀x, y ∈ d. (18) 2 2 R Minimizing both sides of (18) in y proves the following lemma

Lemma 1.3 If f is µ–strongly convex then it also satisfies the Polyak–Lojasiewicz condition, that is 2 ∗ k∇f(x)k2 ≥ 2µ(f(x) − f(x ). (19)

Proof: Multiplying (18) by minus and substituting y = x∗ we have that µ f(x) − f(x∗) ≤ h∇f(x), x − x∗i − kx∗ − xk2 2 2 1 √ 1 1 = − k µ(x − x∗) − √ ∇f(x)k2 + k∇f(x)k2 2 µ 2 2µ 2 1 ≤ k∇f(x)k2. 2µ 2

The inequality (19) is of such importance in optimization that is merits its own name.

2 Gradient Descent

Consider the problem x∗ = arg min f(x), (20) x∈Rd and the following gradient method

xt+1 = xt − α∇f(xt), (21) where f is L–smooth. We will now prove that the iterates (21) converge. In Theorem 2.2 we will prove sublinear convergence under the assumption that f is convex. In Theorem 2.3 we will prove linear convergence (a stronger form of convergence) under the assumption that f is µ–strongly convex.

4 2.1 Convergence for convex and smooth functions

2.1.1 Average iterates

Theorem 2.1 Let f be convex and L–smooth and let xt for t = 1, . . . , n be the sequence of iterates generated by the gradient method (21). It follows that

T 1 X α f(x1) − f(x∗) 1 kx1 − x∗k2 [f(xt) − f(x∗)] ≤ + 2 . (22) T 4L T α T t=1

2.1.2 Last iterates

Theorem 2.2 Let f be convex and L–smooth and let xt for t = 1, . . . , n be the sequence of iterates generated by the gradient method (21). It follows that 2Lkx1 − x∗k2 f(xn) − f(x∗) ≤ . (23) n − 1 Proof: Let f be convex and L–smooth. It follows that

t+1 ∗ 2 t ∗ 1 t 2 kx − x k2 = kx − x − L ∇f(x )k2 t ∗ 2 1 t ∗ t 1 t 2 = kx − x k2 − 2 L x − x , ∇f(x ) + L2 k∇f(x )k2 (14) t ∗ 2 1 t 2 ≤ kx − x k2 − L2 k∇f(x )k2. t ∗ 2 ∗ Thus kx − x k2 is a decreasing sequence in t. Calling upon (10) and subtracting f(x ) from both sides gives 1 f(xt+1) − f(x∗) ≤ f(xt) − f(x∗) − k∇f(xt)k2. (24) 2L 2 Applying convexity we have that

f(xt) − f(x∗) ≤ ∇f(xt), xt − x∗ (24) t t ∗ t 1 ∗ ≤ k∇f(x )k2kx − x k ≤ k∇f(x )k2kx − x k. (25)

t Isolating k∇f(x )k2 in the above and inserting in (24) gives (24)+(25) 1 1 f(xt+1) − f(x∗) ≤ f(xt) − f(x∗) − (f(xt) − f(x∗))2 (26) 2L kx1 − x∗k2 | {z } β t ∗ Let δt = f(x ) − f(x ). Since δt+1 ≤ δt. Manipulating (26) we have that

1 × 2 δtδt+1 δt 1 1 δt+1≤δt 1 1 δt+1 ≤ δt − βδt ⇔ β ≤ − ⇔ β ≤ − . δt+1 δt+1 δt δt+1 δt Summing up both sides over t = 1, . . . , n − 1 and using telescopic cancellation we have that 1 1 1 (n − 1)β ≤ − ≤ . δn δ1 δn

5 2.2 Convergence for strongly convex (PL) and smooth convex functions

Now we prove some bounds that hold for strongly convex and smooth functions. In fact, if you observe, we will only use PL inequality (19) to establish the convergence result. Assuming a func- tion satisfies the PL condition is a strictly weaker assumption then assuming strong convexity [2]. This proof is taken from [2].

d 1 Theorem 2.3 Let f be L–smooth and µ–strongly convex. From a given x0 ∈ R and L ≥ α > 0, the iterates xt+1 = xt − α∇f(xt), (27) converge according to t+1 ∗ 2 t+1 0 ∗ 2 kx − x k2 ≤ (1 − αµ) kx − x k2. (28) 1 In particular, or α = L the iterates (21) enjoy a linear convergence with a rate of µ/L.

Proof: From (21) we have that

t+1 ∗ 2 t ∗ t 2 kx − x k2 = kx − x − α∇f(x )k2 t ∗ 2 t t ∗ 2 t 2 = kx − x k2 − 2α ∇f(x ), x − x + α k∇f(x )k2 (18) t ∗ 2 t ∗ 2 t 2 ≤ (1 − αµ)kx − x k2 − 2α(f(x ) − f(x )) + α k∇f(x )k2 (12) t ∗ 2 t ∗ 2 t ∗ ≤ (1 − αµ)kx − x k2 − 2α(f(x ) − f(x )) + 2α L(f(x ) − f(x ) t ∗ 2 t ∗ = (1 − αµ)kx − x k2 − 2α(1 − αL)(f(x ) − f(x )). (29)

1 Since L ≥ α we have that −2α(1 − αL) is negative, and thus can be safely dropped to give

t+1 ∗ 2 t ∗ 2 kx − x k2 ≤ (1 − αµ)kx − x k2.

It now remains to unroll the recurrence.

3 Stochastic Gradient Descent

Consider the problem ∗ def w = arg min ED [f(w, x)] = f(w), (30) w∈Rd where x ∼ D. Here we will consider the stochastic subgradient method. Let f(w, x) be convex in w, and consider the iterates t+1 t t t w = w − αtg(w , x ), (31)

t t t t where g(w , x ) ∈ ∂wf(w , x ) and

E g(wt, xt) | wt = g(wt) ∈ ∂f(w), (32) and where xt ∼ D is sampled i.i.d at each iteration. We will also consider (33) with a projection step.

6 3.1 Convex

Theorem 3.1 (With projection step) Let r, B > 0. Let f(w, x) be convex in w. Assume that  t 2 2 and that ED kg(w , x)k2 ≤ B where B > 0 for all t, and that the inputs and optimal point are in the ball ||w|| ≤ r. From a given w0 ∈ d and α = √D , consider the iterates of the projected 2 R t B 2t stochastic subgradient method

t+1 t t t  w = projD w − αtg(w , x ) , (33) The iterates satisfy

∗ 3B E [f(¯xT )] − f(x ) ≤ √ . (34) D T The proof technique for convex function will always follow the following steps. Proof: (33)+proj t+1 ∗ 2 t ∗ t t 2 kw − w k2 ≤ kw − w − αg(w , x )k2 (35) t ∗ 2 t t t ∗ 2 t t 2 = kw − w k2 − 2α g(w , x ), w − w + α kg(w , x )k2. Taking expectation conditioned on wt we have that

 t+1 ∗ 2 t (32) t ∗ 2 t t ∗ 2  t t 2 E kw − w k2 | w = kw − w k2 − 2αt g(w ), w − w + αt ED kg(w , x )k2 t ∗ 2 t t ∗ 2 2 ≤ kw − w k2 − 2αt g(w ), w − w + αt B (5) t ∗ 2 t 2 2 ≤ kw − w k2 − 2αt(f(w ) − f(w∗)) + αt B . (36) Taking expectation and re-arranging we have that 2  t  1  t ∗ 2 1  t+1 ∗ 2 αtB E f(w ) − f(w∗)) ≤ E kw − w k2 − E kw − w k2 + . (37) 2αt 2αt 2

Summing up from t = 1,...,T and using that αt is a non-increasing sequence, we have that T T −1 X 1 1 X  1 1  E f(wt) − f(w∗)) ≤ kw1 − w∗k2 + − E kwt+1 − w∗k2 2α 2 2 α α 2 t=1 1 t=1 t+1 t T 1 B2 X − E kwT +1 − w∗k2 + α (38) 2α 2 2 t T +1 t=1 T −1   2 T kwk2≤r 2 X 1 1 B X ≤ r2 + − 2r2 + α α α α 2 t 1 t=1 t+1 t t=1 T telescopic 2r2 B2 X = + α . (39) α 2 t T t=1 1 PT −1 Finally letx ¯T = T t=0 xt and dividing by T and using Jensen’s inequality we have T −1 Jensen’s 1 X E [f(¯x )] − f(x∗) ≤ E f(wt) − f(w∗)) T T t=0 T (39) 2r2 B2 X ≤ + α . (40) T α 2T t T t=1

7 Now plugging in α = √α0 we have that t t

2 2 T  2  2r α0B X 1 1 2r E [f(¯x )] − f(x∗) ≤ √ + √ ≤ √ + α B2 . (41) T 2T α 0 T α0 t=1 t T 0 √ 2r Minimizing the right-hand side in α0 gives α0 = B and the result since 2 √ 2r 2 + α0B = 2 2rB ≤ 3rB. α0

3.2 Strongly convex

3.2.1 Constant stepsize

0 d 1 Theorem 3.2 Let f(w, x) be convex in w. From a given w ∈ R and µ > α ≡ αt > 0, consider  t 2 2 the iterates of the stochastic subgradient method (33). Assume that and that ED kg(w , x)k2 ≤ B where B > 0 for all t. 1 The iterates satisfy α E kwt+1 − w∗k2 ≤ (1 − αµ)t+1kw0 − w∗k2 + B2. (42) 2 2 µ

Proof: From (33) we have that

t+1 ∗ 2 (32) t ∗ t t 2 kw − w k2 = kw − w − αg(w , x )k2 t ∗ 2 t t t ∗ 2 t t 2 = kw − w k2 − 2α g(w , x ), w − w + α kg(w , x )k2.

Taking expectation condition on wt in the above gives

 t+1 ∗ 2 t t ∗ 2 t t ∗ 2  t t 2 E kw − w k2 | w = kw − w k2 − 2α g(w ), w − w + α ED kg(w , x )k2 t ∗ 2 t t ∗ 2 2 ≤ kw − w k2 − 2α g(w ), w − w + α B (43) (18) t ∗ 2 2 2 ≤ (1 − αµ)kw − w k2 + α B (44) t recurrence t+1 0 ∗ 2 X i 2 2 ≤ (1 − αµ) kw − w k2 + (1 − αµ) α B . (45) i=0 Since t X 1 − (1 − αµ)t+1 α2B2 αB2 (1 − αµ)iα2B2 = α2B2 ≤ = , (46) αµ αµ µ i=0 we have that (46)+(45) αB2 E kwt+1 − w∗k2 | wt ≤ (1 − αµ)t+1kw0 − w∗k2 + . (47) 2 2 µ It now remains to taking expectation over the above.

1This is a very awkward assumption, and should really be proven instead.

8