<<

L. Vandenberghe ECE236C (Spring 2020) 1. method

• gradient method, first-order methods

• convex functions

of gradient

• strong convexity

• analysis of gradient method

1.1 Gradient method

to minimize a convex differentiable f : choose an initial point x0 and repeat

xk+1 = xk − tk∇ f (xk), k = 0,1,... step size tk is constant or determined by search

Advantages

• every iteration is inexpensive

• does not require second

Notation

• xk can refer to kth element of a sequence, or to the kth component of vector x • to avoid confusion, we sometimes use x(k) to denote elements of a sequence

Gradient method 1.2 Quadratic example

1 f (x) = (x2 + γx2) (with γ > 1) 2 1 2 with exact and starting point x(0) = (γ,1)

4 kx(k) − x?k γ − 1 k 2 = (0) ? kx − x k2 γ + 1 2 x 0 where x? = 0 −4 −10 0 10 x1 gradient method is often slow; convergence very dependent on scaling

Gradient method 1.3 Nondifferentiable example

q 2 2 x1 + γ|x2| f (x) = x + γx if |x2| ≤ x1, f (x) = if |x2| > x1 1 2 p1 + γ with exact line search, starting point x(0) = (γ,1), converges to non-optimal point

2 2

x 0

−2 −2 0 2 4 x1 gradient method does not handle nondifferentiable problems

Gradient method 1.4 First-order methods address one or both shortcomings of the gradient method

Methods for nondifferentiable or constrained problems

• proximal gradient method

• smoothing methods

• cutting-plane methods

Methods with improved convergence

• conjugate gradient method

• accelerated gradient method

• quasi-Newton methods

Gradient method 1.5 Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method Convex function a function f is convex if dom f is a convex set and Jensen’s inequality holds:

f (θx + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0,1]

First-order condition for (continuously) differentiable f , Jensen’s inequality can be replaced with

f (y) ≥ f (x) + ∇ f (x)T(y − x) for all x, y ∈ dom f

Second-order condition for twice differentiable f , Jensen’s inequality can be replaced with

∇2 f (x)  0 for all x ∈ dom f

Gradient method 1.6 Strictly convex function

f is strictly convex if dom f is a convex set and

f (θx + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y, and θ ∈ (0,1) strict convexity implies that if a minimizer of f exists, it is unique

First-order condition for differentiable f , strict Jensen’s inequality can be replaced with

f (y) > f (x) + ∇ f (x)T(y − x) for all x, y ∈ dom f , x , y

Second-order condition note that ∇2 f (x) 0 is not necessary for strict convexity (cf., f (x) = x4)

Gradient method 1.7 Monotonicity of gradient a differentiable function f is convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f i.e., the gradient ∇ f : Rn → Rn is a monotone mapping

a differentiable function f is strictly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping

Gradient method 1.8 Proof

• if f is differentiable and convex, then

f (y) ≥ f (x) + ∇ f (x)T(y − x), f (x) ≥ f (y) + ∇ f (y)T(x − y)

combining the inequalities gives (∇ f (x) − ∇ f (y))T(x − y) ≥ 0

• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g, where

g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T(y − x)

hence

¹ 1 f (y) = g(1) = g(0) + g0(t) dt ≥ g(0) + g0(0) 0 = f (x) + ∇ f (x)T(y − x)

this is the first-order condition for convexity

Gradient method 1.9 Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method Lipschitz continuous gradient the gradient of f is Lipschitz continuous with parameter L > 0 if

k∇ f (x) − ∇ f (y)k∗ ≤ Lkx − yk for all x, y ∈ dom f

• functions f with this property are also called L-smooth

• the definition does not assume convexity of f (and holds for − f if it holds for f )

• in the definition, k · k and k · k∗ are a pair of dual norms:

T u v T kuk∗ = sup = sup u v v,0 kvk kvk=1

this implies a generalized Cauchy–Schwarz inequality

T |u v| ≤ kuk∗kvk for all u, v

Gradient method 1.10 Choice of norm

Equivalence of norms

• for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that

c1kxkb ≤ kxka ≤ c2kxkb for all x

• constants depend on dimension; for example, for x ∈ Rn,

√ 1 kxk2 ≤ kxk1 ≤ n kxk2, √ kxk2 ≤ kxk∞ ≤ kxk2 n

Norm in definition of Lipschitz continuity

• without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2 • the parameter L depends on choice of norm

• in complexity bounds, choice of norm can simplify dependence on dimensions

Gradient method 1.11 Quadratic upper bound suppose ∇ f is Lipschitz continuous with parameter L

• this implies (from the generalized Cauchy–Schwarz inequality) that

(∇ f (x) − ∇ f (y))T (x − y) ≤ Lkx − yk2 for all x, y ∈ dom f (1)

• if dom f is convex, (1) is equivalent to L f (y) ≤ f (x) + ∇ f (x)T(y − x) + ky − xk2 for all x, y ∈ dom f (2) 2

f (y) (x, f (x))

Gradient method 1.12 Proof (of the equivalence of (1) and (2) if dom f is convex) • consider arbitrary x, y ∈ dom f and define g(t) = f (x + t(y − x)) • g(t) is defined for t ∈ [0,1] because dom f is convex • if (1) holds, then

g0(t) − g0(0) = (∇ f (x + t(y − x)) − ∇ f (x))T (y − x) ≤ tLkx − yk2

integrating from t = 0 to t = 1 gives (2):

¹ 1 L f (y) = g(1) = g(0) + g0(t) dt ≤ g(0) + g0(0) + kx − yk2 0 2 L = f (x) + ∇ f (x)T(y − x) + kx − yk2 2

• conversely, if (2) holds, then (2) and the same inequality with x, y switched, i.e., L f (x) ≤ f (y) + ∇ f (y)T(x − y) + kx − yk2, 2

can be combined to give (∇ f (x) − ∇ f (y))T(x − y) ≤ Lkx − yk2

Gradient method 1.13 Consequence of quadratic upper bound if dom f = Rn and f has a minimizer x?, then

1 L k∇ f (z)k2 ≤ f (z) − f (x?) ≤ kz − x?k2 for all z 2L ∗ 2

• right-hand inequality follows from upper bound property (2) at x = x?, y = z • left-hand inequality follows by minimizing quadratic upper bound for x = z

 L  inf f (y) ≤ inf f (z) + ∇ f (z)T(y − z) + ky − zk2 y y 2  Lt2  = inf inf f (z) + t∇ f (z)T v + kvk=1 t 2  1  = inf f (z) − (∇ f (z)T v)2 kvk=1 2L 1 = f (z) − k∇ f (z)k2 2L ∗

Gradient method 1.14 Co-coercivity of gradient if f is convex with dom f = Rn and ∇ f is L-Lipschitz continuous, then

1 (∇ f (x) − ∇ f (y))T(x − y) ≥ k∇ f (x) − ∇ f (y)k2 for all x, y L ∗

• this property is known as co-coercivity of ∇ f (with parameter 1/L)

• co-coercivity in turn implies Lipschitz continuity of ∇ f (by Cauchy–Schwarz)

• hence, for differentiable convex f with dom f = Rn

Lipschitz continuity of ∇ f ⇒ upper bound property (2) (equivalently, (1)) ⇒ co-coercivity of ∇ f ⇒ Lipschitz continuity of ∇ f

therefore the three properties are equivalent

Gradient method 1.15 n Proof of co-coercivity: define two convex functions fx, fy with domain R

T T fx(z) = f (z) − ∇ f (x) z, fy(z) = f (z) − ∇ f (y) z

• the two functions have L-Lipschitz continuous

• z = x minimizes fx(z); from the left-hand inequality on page 1.14,

T f (y) − f (x) − ∇ f (x) (y − x) = fx(y) − fx(x) 1 ≥ k∇ f (y)k2 2L x ∗ 1 = k∇ f (y) − ∇ f (x)k2 2L ∗

• similarly, z = y minimizes fy(z); therefore

1 f (x) − f (y) − ∇ f (y)T(x − y) ≥ k∇ f (y) − ∇ f (x)k2 2L ∗ combining the two inequalities shows co-coercivity

Gradient method 1.16 Lipschitz continuity with respect to Euclidean norm supose f is convex with dom f = Rn, and L-smooth for the Euclidean norm:

k∇ f (x) − ∇ f (y)k2 ≤ Lkx − yk2 for all x, y

• the equivalent property (1) states that

(∇ f (x) − ∇ f (y))T(x − y) ≤ L(x − y)T(x − y) for all x, y

• this is monotonicity of Lx − ∇ f (x), i.e., equivalent to the property that

L kxk2 − f (x) is a convex function 2 2

• if f is twice differentiable, the Hessian of this function is LI − ∇2 f (x):

2 λmax(∇ f (x)) ≤ L for all x

is an equivalent characterization of L-smoothness

Gradient method 1.17 Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method Strongly convex function

f is strongly convex with parameter m > 0 if dom f is convex and

m f (θx + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)kx − yk2 2 holds for all x, y ∈ dom f , θ ∈ [0,1]

• this is a stronger version of Jensen’s inequality

• it holds if and only if it holds for f restricted to arbitrary lines:

m f (x + t(y − x)) − t2kx − yk2 (3) 2

is a convex function of t, for all x, y ∈ dom f

• without loss of generality, we can take k · k = k · k2 • however, the strong convexity parameter m depends on the norm used

Gradient method 1.18 Quadratic lower bound if f is differentiable and m-strongly convex, then m f (y) ≥ f (x) + ∇ f (x)T(y − x) + ky − xk2 for all x, y ∈ dom f (4) 2

f (y)

(x, f (x))

• follows from the 1st order condition of convexity of (3) • this implies that the sublevel sets of f are bounded • if f is closed (has closed sublevel sets), it has a unique minimizer x? and m 1 kz − x?k2 ≤ f (z) − f (x?) ≤ k∇ f (z)k2 for all z ∈ dom f 2 2m ∗ (proof as on page 1.14)

Gradient method 1.19 Strong monotonicity differentiable f is strongly convex if and only if dom f is convex and

(∇ f (x) − ∇ f (y))T(x − y) ≥ mkx − yk2 for all x, y ∈ dom f this is called strong monotonicity (coercivity) of ∇ f

Proof

• one direction follows from (4) and the same inequality with x and y switched

• for the other direction, assume ∇ f is strongly monotone and define

m g(t) = f (x + t(y − x)) − t2kx − yk2 2

then g0(t) is nondecreasing, so g is convex

Gradient method 1.20 Strong convexity with respect to Euclidean norm suppose f is m-strongly convex for the Euclidean norm:

m f (θx + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)kx − yk2 2 2 for x, y ∈ dom f , θ ∈ [0,1]

• this is Jensen’s inequality for the function

m h(x) = f (x) − kxk2 2 2

• therefore f is strongly convex if and only if h is convex

• if f is twice differentiable, h is convex if and only if ∇2 f (x) − mI  0, or

2 λmin(∇ f (x)) ≥ m for all x ∈ dom f

Gradient method 1.21 Extension of co-coercivity

n suppose f is m-strongly convex and L-smooth for k · k2, and dom f = R

• then the function m h(x) = f (x) − kxk2 2 2 is convex and (L − m)-smooth:

0 ≤ (∇h(x) − ∇h(y))T(x − y) T 2 = (∇ f (x) − ∇ f (y)) (x − y) − mkx − yk2 2 ≤ (L − m)kx − yk2

• co-coercivity of ∇h can be written as

mL 1 (∇ f (x) − ∇ f (y))T (x − y) ≥ kx − yk2 + k∇ f (x) − ∇ f (y)k2 m + L 2 m + L 2

for all x, y ∈ dom f

Gradient method 1.22 Outline

• gradient method, first-order methods

• convex functions

• Lipschitz continuity of gradient

• strong convexity

• analysis of gradient method Analysis of gradient method

xk+1 = xk − tk∇ f (xk), k = 0,1,... with fixed step size or backtracking line search

Assumptions

1. f is convex and differentiable with dom f = Rn

2. ∇ f (x) is L-Lipschitz continuous with respect to the Euclidean norm, with L > 0

? ? 3. optimal value f = infx f (x) is finite and attained at x

Gradient method 1.23 Basic gradient step

• from quadratic upper bound (page 1.12) with y = x − t∇ f (x):

Lt f (x − t∇ f (x)) ≤ f (x) − t(1 − ) k∇ f (x)k2 2 2

• therefore, if x+ = x − t∇ f (x) and 0 < t ≤ 1/L,

t f (x+) ≤ f (x) − k∇ f (x)k2 (5) 2 2

• from (5) and convexity of f ,

t f (x+) − f ? ≤ ∇ f (x)T(x − x?) − k∇ f (x)k2 2 2 1  ? ? 2 = kx − x k2 − x − x − t∇ f (x) 2t 2 2 1   = kx − x?k2 − kx+ − x?k2 (6) 2t 2 2

Gradient method 1.24 Descent properties assume ∇ f (x) , 0 and 0 < t ≤ 1/L

• the inequality (5) shows that

f (x+) < f (x)

• the inequality (6) shows that

+ ? ? kx − x k2 < kx − x k2 in the gradient method, function value and distance to the optimal set decrease

Gradient method 1.25 Gradient method with constant step size

xk+1 = xk − t∇ f (xk), k = 0,1,...

+ • take x = xi−1, x = xi in (6) and add the bounds for i = 1,..., k:

k k X ? 1 X  ? 2 ? 2 ( f (xi) − f ) ≤ kxi−1 − x k2 − kxi − x k2 i=1 2t i=1 1   = kx − x?k2 − kx − x?k2 2t 0 2 k 2 1 ≤ kx − x?k2 2t 0 2

• since f (xi) is non-increasing (see (5))

k ? 1 X ? 1 ? 2 f (xk) − f ≤ ( f (xi) − f ) ≤ kx0 − x k2 k i=1 2kt

? Conclusion: number of iterations to reach f (xk) − f ≤  is O(1/)

Gradient method 1.26 Backtracking line search

initialize tk at tˆ > 0 (for example, tˆ = 1) and take tk := βtk until

2 f (xk − tk∇ f (xk)) < f (xk) − αtk k∇ f (xk)k2

f (xk − t∇ f (xk))

2 f (xk) − αtk∇ f (xk)k2

2 f (xk) − tk∇ f (xk)k2 t

0 < β < 1; we will take α = 1/2 (mostly to simplify proofs)

Gradient method 1.27 Analysis for backtracking line search line search with α = 1/2, if f has a Lipschitz continuous gradient

tL f (x ) − t(1 − )k∇ f (x )k2 k 2 k 2 t f (x ) − k∇ f (x )k2 k 2 k 2

f (xk − t∇ f (xk))

t = 1/L

selected step size satisfies tk ≥ tmin = min{tˆ, β/L}

Gradient method 1.28 Gradient method with backtracking line search

• from line search condition and convexity of f ,

t f (x ) ≤ f (x ) − i k∇ f (x )k2 i+1 i 2 i 2 t ≤ f ? + ∇ f (x )T(x − x?) − i k∇ f (x )k2 i i 2 i 2 ? 1  ? 2 ? 2 = f + kxi − x k2 − kxi+1 − x k2 2ti

? ? • this implies kxi+1 − x k2 ≤ kxi − x k, so we can replace ti with tmin ≤ ti:

? 1  ? 2 ? 2 f (xi+1) − f ≤ kxi − x k2 − kxi−1 − x k2 2tmin

• adding the upper bounds gives same 1/k bound as with constant step size

k ? 1 X ? 1 ? 2 f (xk) − f ≤ ( f (xi) − f ) ≤ kx0 − x k2 k i=1 2ktmin

Gradient method 1.29 Gradient method for strongly convex functions better results exist if we add strong convexity to the assumptions on p. 1.23

Analysis for constant step size if x+ = x − t∇ f (x) and 0 < t ≤ 2/(m + L):

+ ? 2 ? 2 kx − x k2 = kx − t∇ f (x) − x k2 ? 2 T ? 2 2 = kx − x k2 − 2t∇ f (x) (x − x ) + t k∇ f (x)k2 2mL 2 ≤ (1 − t )kx − x?k2 + t(t − )k∇ f (x)k2 m + L 2 m + L 2 2mL ≤ (1 − t )kx − x?k2 m + L 2

(step 3 follows from result on page 1.22)

Gradient method 1.30 Distance to optimum

2mL kx − x?k2 ≤ ck kx − x?k2, c = 1 − t k 2 0 2 m + L

• implies (linear) convergence γ − 1 2 • for t = 2/(m + L), get c = with γ = L/m γ + 1

Bound on function value (from page 1.14)

L ck L f (x ) − f ? ≤ kx − x?k2 ≤ kx − x?k2 k 2 k 2 2 0 2

? Conclusion: number of iterations to reach f (xk) − f ≤  is O(log(1/))

Gradient method 1.31 Limits on convergence rate of first-order methods

First-order method: any iterative that selects xk+1 in the set

x0 + span{∇ f (x0), ∇ f (x1),..., ∇ f (xk)}

Problem class: any function that satisfies the assumptions on page 1.23

Theorem (Nesterov): for every integer k ≤ (n − 1)/2 and every x0, there exist functions in the problem class such that for any first-order method

k − ?k2 ? 3 L x0 x 2 f (xk) − f ≥ 32 (k + 1)2

• suggests 1/k rate for gradient method is not optimal

• more recent accelerated gradient methods have 1/k2 convergence (see later)

Gradient method 1.32 References

• A. Beck, First-Order Methods in Optimization (2017), chapter 5.

• Yu. Nesterov, Lectures on (2018), 2.1. (The result on page 1.32 is Theorem 2.1.7 in the book.)

• B. T. Polyak, Introduction to Optimization (1987), section 1.4.

• The example on page 1.4 is from N. Z. Shor, Nondifferentiable Optimization and Problems (1998), page 37.

Gradient method 1.33