L. Vandenberghe ECE236C (Spring 2020) 1. Gradient method
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method
1.1 Gradient method
to minimize a convex differentiable function f : choose an initial point x0 and repeat
xk+1 = xk − tk∇ f (xk), k = 0,1,... step size tk is constant or determined by line search
Advantages
• every iteration is inexpensive
• does not require second derivatives
Notation
• xk can refer to kth element of a sequence, or to the kth component of vector x • to avoid confusion, we sometimes use x(k) to denote elements of a sequence
Gradient method 1.2 Quadratic example
1 f (x) = (x2 + γx2) (with γ > 1) 2 1 2 with exact line search and starting point x(0) = (γ,1)
4 kx(k) − x?k γ − 1 k 2 = (0) ? kx − x k2 γ + 1 2 x 0 where x? = 0 −4 −10 0 10 x1 gradient method is often slow; convergence very dependent on scaling
Gradient method 1.3 Nondifferentiable example
q 2 2 x1 + γ|x2| f (x) = x + γx if |x2| ≤ x1, f (x) = if |x2| > x1 1 2 p1 + γ with exact line search, starting point x(0) = (γ,1), converges to non-optimal point
2 2
x 0
−2 −2 0 2 4 x1 gradient method does not handle nondifferentiable problems
Gradient method 1.4 First-order methods address one or both shortcomings of the gradient method
Methods for nondifferentiable or constrained problems
• proximal gradient method
• smoothing methods
• cutting-plane methods
Methods with improved convergence
• conjugate gradient method
• accelerated gradient method
• quasi-Newton methods
Gradient method 1.5 Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method Convex function a function f is convex if dom f is a convex set and Jensen’s inequality holds:
f (θx + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , θ ∈ [0,1]
First-order condition for (continuously) differentiable f , Jensen’s inequality can be replaced with
f (y) ≥ f (x) + ∇ f (x)T(y − x) for all x, y ∈ dom f
Second-order condition for twice differentiable f , Jensen’s inequality can be replaced with
∇2 f (x) 0 for all x ∈ dom f
Gradient method 1.6 Strictly convex function
f is strictly convex if dom f is a convex set and
f (θx + (1 − θ)y) < θ f (x) + (1 − θ) f (y) for all x, y ∈ dom f , x , y, and θ ∈ (0,1) strict convexity implies that if a minimizer of f exists, it is unique
First-order condition for differentiable f , strict Jensen’s inequality can be replaced with
f (y) > f (x) + ∇ f (x)T(y − x) for all x, y ∈ dom f , x , y
Second-order condition note that ∇2 f (x) 0 is not necessary for strict convexity (cf., f (x) = x4)
Gradient method 1.7 Monotonicity of gradient a differentiable function f is convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) ≥ 0 for all x, y ∈ dom f i.e., the gradient ∇ f : Rn → Rn is a monotone mapping
a differentiable function f is strictly convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T (x − y) > 0 for all x, y ∈ dom f , x , y i.e., the gradient ∇ f : Rn → Rn is a strictly monotone mapping
Gradient method 1.8 Proof
• if f is differentiable and convex, then
f (y) ≥ f (x) + ∇ f (x)T(y − x), f (x) ≥ f (y) + ∇ f (y)T(x − y)
combining the inequalities gives (∇ f (x) − ∇ f (y))T(x − y) ≥ 0
• if ∇ f is monotone, then g0(t) ≥ g0(0) for t ≥ 0 and t ∈ dom g, where
g(t) = f (x + t(y − x)), g0(t) = ∇ f (x + t(y − x))T(y − x)
hence
¹ 1 f (y) = g(1) = g(0) + g0(t) dt ≥ g(0) + g0(0) 0 = f (x) + ∇ f (x)T(y − x)
this is the first-order condition for convexity
Gradient method 1.9 Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method Lipschitz continuous gradient the gradient of f is Lipschitz continuous with parameter L > 0 if
k∇ f (x) − ∇ f (y)k∗ ≤ Lkx − yk for all x, y ∈ dom f
• functions f with this property are also called L-smooth
• the definition does not assume convexity of f (and holds for − f if it holds for f )
• in the definition, k · k and k · k∗ are a pair of dual norms:
T u v T kuk∗ = sup = sup u v v,0 kvk kvk=1
this implies a generalized Cauchy–Schwarz inequality
T |u v| ≤ kuk∗kvk for all u, v
Gradient method 1.10 Choice of norm
Equivalence of norms
• for any two norms k · ka, k · kb, there exist positive constants c1, c2 such that
c1kxkb ≤ kxka ≤ c2kxkb for all x
• constants depend on dimension; for example, for x ∈ Rn,
√ 1 kxk2 ≤ kxk1 ≤ n kxk2, √ kxk2 ≤ kxk∞ ≤ kxk2 n
Norm in definition of Lipschitz continuity
• without loss of generality we can use the Euclidean norm k · k = k · k∗ = k · k2 • the parameter L depends on choice of norm
• in complexity bounds, choice of norm can simplify dependence on dimensions
Gradient method 1.11 Quadratic upper bound suppose ∇ f is Lipschitz continuous with parameter L
• this implies (from the generalized Cauchy–Schwarz inequality) that
(∇ f (x) − ∇ f (y))T (x − y) ≤ Lkx − yk2 for all x, y ∈ dom f (1)
• if dom f is convex, (1) is equivalent to L f (y) ≤ f (x) + ∇ f (x)T(y − x) + ky − xk2 for all x, y ∈ dom f (2) 2
f (y) (x, f (x))
Gradient method 1.12 Proof (of the equivalence of (1) and (2) if dom f is convex) • consider arbitrary x, y ∈ dom f and define g(t) = f (x + t(y − x)) • g(t) is defined for t ∈ [0,1] because dom f is convex • if (1) holds, then
g0(t) − g0(0) = (∇ f (x + t(y − x)) − ∇ f (x))T (y − x) ≤ tLkx − yk2
integrating from t = 0 to t = 1 gives (2):
¹ 1 L f (y) = g(1) = g(0) + g0(t) dt ≤ g(0) + g0(0) + kx − yk2 0 2 L = f (x) + ∇ f (x)T(y − x) + kx − yk2 2
• conversely, if (2) holds, then (2) and the same inequality with x, y switched, i.e., L f (x) ≤ f (y) + ∇ f (y)T(x − y) + kx − yk2, 2
can be combined to give (∇ f (x) − ∇ f (y))T(x − y) ≤ Lkx − yk2
Gradient method 1.13 Consequence of quadratic upper bound if dom f = Rn and f has a minimizer x?, then
1 L k∇ f (z)k2 ≤ f (z) − f (x?) ≤ kz − x?k2 for all z 2L ∗ 2
• right-hand inequality follows from upper bound property (2) at x = x?, y = z • left-hand inequality follows by minimizing quadratic upper bound for x = z
L inf f (y) ≤ inf f (z) + ∇ f (z)T(y − z) + ky − zk2 y y 2 Lt2 = inf inf f (z) + t∇ f (z)T v + kvk=1 t 2 1 = inf f (z) − (∇ f (z)T v)2 kvk=1 2L 1 = f (z) − k∇ f (z)k2 2L ∗
Gradient method 1.14 Co-coercivity of gradient if f is convex with dom f = Rn and ∇ f is L-Lipschitz continuous, then
1 (∇ f (x) − ∇ f (y))T(x − y) ≥ k∇ f (x) − ∇ f (y)k2 for all x, y L ∗
• this property is known as co-coercivity of ∇ f (with parameter 1/L)
• co-coercivity in turn implies Lipschitz continuity of ∇ f (by Cauchy–Schwarz)
• hence, for differentiable convex f with dom f = Rn
Lipschitz continuity of ∇ f ⇒ upper bound property (2) (equivalently, (1)) ⇒ co-coercivity of ∇ f ⇒ Lipschitz continuity of ∇ f
therefore the three properties are equivalent
Gradient method 1.15 n Proof of co-coercivity: define two convex functions fx, fy with domain R
T T fx(z) = f (z) − ∇ f (x) z, fy(z) = f (z) − ∇ f (y) z
• the two functions have L-Lipschitz continuous gradients
• z = x minimizes fx(z); from the left-hand inequality on page 1.14,
T f (y) − f (x) − ∇ f (x) (y − x) = fx(y) − fx(x) 1 ≥ k∇ f (y)k2 2L x ∗ 1 = k∇ f (y) − ∇ f (x)k2 2L ∗
• similarly, z = y minimizes fy(z); therefore
1 f (x) − f (y) − ∇ f (y)T(x − y) ≥ k∇ f (y) − ∇ f (x)k2 2L ∗ combining the two inequalities shows co-coercivity
Gradient method 1.16 Lipschitz continuity with respect to Euclidean norm supose f is convex with dom f = Rn, and L-smooth for the Euclidean norm:
k∇ f (x) − ∇ f (y)k2 ≤ Lkx − yk2 for all x, y
• the equivalent property (1) states that
(∇ f (x) − ∇ f (y))T(x − y) ≤ L(x − y)T(x − y) for all x, y
• this is monotonicity of Lx − ∇ f (x), i.e., equivalent to the property that
L kxk2 − f (x) is a convex function 2 2
• if f is twice differentiable, the Hessian of this function is LI − ∇2 f (x):
2 λmax(∇ f (x)) ≤ L for all x
is an equivalent characterization of L-smoothness
Gradient method 1.17 Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method Strongly convex function
f is strongly convex with parameter m > 0 if dom f is convex and
m f (θx + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)kx − yk2 2 holds for all x, y ∈ dom f , θ ∈ [0,1]
• this is a stronger version of Jensen’s inequality
• it holds if and only if it holds for f restricted to arbitrary lines:
m f (x + t(y − x)) − t2kx − yk2 (3) 2
is a convex function of t, for all x, y ∈ dom f
• without loss of generality, we can take k · k = k · k2 • however, the strong convexity parameter m depends on the norm used
Gradient method 1.18 Quadratic lower bound if f is differentiable and m-strongly convex, then m f (y) ≥ f (x) + ∇ f (x)T(y − x) + ky − xk2 for all x, y ∈ dom f (4) 2
f (y)
(x, f (x))
• follows from the 1st order condition of convexity of (3) • this implies that the sublevel sets of f are bounded • if f is closed (has closed sublevel sets), it has a unique minimizer x? and m 1 kz − x?k2 ≤ f (z) − f (x?) ≤ k∇ f (z)k2 for all z ∈ dom f 2 2m ∗ (proof as on page 1.14)
Gradient method 1.19 Strong monotonicity differentiable f is strongly convex if and only if dom f is convex and
(∇ f (x) − ∇ f (y))T(x − y) ≥ mkx − yk2 for all x, y ∈ dom f this is called strong monotonicity (coercivity) of ∇ f
Proof
• one direction follows from (4) and the same inequality with x and y switched
• for the other direction, assume ∇ f is strongly monotone and define
m g(t) = f (x + t(y − x)) − t2kx − yk2 2
then g0(t) is nondecreasing, so g is convex
Gradient method 1.20 Strong convexity with respect to Euclidean norm suppose f is m-strongly convex for the Euclidean norm:
m f (θx + (1 − θ)y) ≤ θ f (x) + (1 − θ) f (y) − θ(1 − θ)kx − yk2 2 2 for x, y ∈ dom f , θ ∈ [0,1]
• this is Jensen’s inequality for the function
m h(x) = f (x) − kxk2 2 2
• therefore f is strongly convex if and only if h is convex
• if f is twice differentiable, h is convex if and only if ∇2 f (x) − mI 0, or
2 λmin(∇ f (x)) ≥ m for all x ∈ dom f
Gradient method 1.21 Extension of co-coercivity
n suppose f is m-strongly convex and L-smooth for k · k2, and dom f = R
• then the function m h(x) = f (x) − kxk2 2 2 is convex and (L − m)-smooth:
0 ≤ (∇h(x) − ∇h(y))T(x − y) T 2 = (∇ f (x) − ∇ f (y)) (x − y) − mkx − yk2 2 ≤ (L − m)kx − yk2
• co-coercivity of ∇h can be written as
mL 1 (∇ f (x) − ∇ f (y))T (x − y) ≥ kx − yk2 + k∇ f (x) − ∇ f (y)k2 m + L 2 m + L 2
for all x, y ∈ dom f
Gradient method 1.22 Outline
• gradient method, first-order methods
• convex functions
• Lipschitz continuity of gradient
• strong convexity
• analysis of gradient method Analysis of gradient method
xk+1 = xk − tk∇ f (xk), k = 0,1,... with fixed step size or backtracking line search
Assumptions
1. f is convex and differentiable with dom f = Rn
2. ∇ f (x) is L-Lipschitz continuous with respect to the Euclidean norm, with L > 0
? ? 3. optimal value f = infx f (x) is finite and attained at x
Gradient method 1.23 Basic gradient step
• from quadratic upper bound (page 1.12) with y = x − t∇ f (x):
Lt f (x − t∇ f (x)) ≤ f (x) − t(1 − ) k∇ f (x)k2 2 2
• therefore, if x+ = x − t∇ f (x) and 0 < t ≤ 1/L,
t f (x+) ≤ f (x) − k∇ f (x)k2 (5) 2 2
• from (5) and convexity of f ,
t f (x+) − f ? ≤ ∇ f (x)T(x − x?) − k∇ f (x)k2 2 2 1 ? ? 2 = kx − x k2 − x − x − t∇ f (x) 2t 2 2 1 = kx − x?k2 − kx+ − x?k2 (6) 2t 2 2
Gradient method 1.24 Descent properties assume ∇ f (x) , 0 and 0 < t ≤ 1/L
• the inequality (5) shows that
f (x+) < f (x)
• the inequality (6) shows that
+ ? ? kx − x k2 < kx − x k2 in the gradient method, function value and distance to the optimal set decrease
Gradient method 1.25 Gradient method with constant step size
xk+1 = xk − t∇ f (xk), k = 0,1,...
+ • take x = xi−1, x = xi in (6) and add the bounds for i = 1,..., k:
k k X ? 1 X ? 2 ? 2 ( f (xi) − f ) ≤ kxi−1 − x k2 − kxi − x k2 i=1 2t i=1 1 = kx − x?k2 − kx − x?k2 2t 0 2 k 2 1 ≤ kx − x?k2 2t 0 2
• since f (xi) is non-increasing (see (5))
k ? 1 X ? 1 ? 2 f (xk) − f ≤ ( f (xi) − f ) ≤ kx0 − x k2 k i=1 2kt
? Conclusion: number of iterations to reach f (xk) − f ≤ is O(1/)
Gradient method 1.26 Backtracking line search
initialize tk at tˆ > 0 (for example, tˆ = 1) and take tk := βtk until
2 f (xk − tk∇ f (xk)) < f (xk) − αtk k∇ f (xk)k2
f (xk − t∇ f (xk))
2 f (xk) − αtk∇ f (xk)k2
2 f (xk) − tk∇ f (xk)k2 t
0 < β < 1; we will take α = 1/2 (mostly to simplify proofs)
Gradient method 1.27 Analysis for backtracking line search line search with α = 1/2, if f has a Lipschitz continuous gradient
tL f (x ) − t(1 − )k∇ f (x )k2 k 2 k 2 t f (x ) − k∇ f (x )k2 k 2 k 2
f (xk − t∇ f (xk))
t = 1/L
selected step size satisfies tk ≥ tmin = min{tˆ, β/L}
Gradient method 1.28 Gradient method with backtracking line search
• from line search condition and convexity of f ,
t f (x ) ≤ f (x ) − i k∇ f (x )k2 i+1 i 2 i 2 t ≤ f ? + ∇ f (x )T(x − x?) − i k∇ f (x )k2 i i 2 i 2 ? 1 ? 2 ? 2 = f + kxi − x k2 − kxi+1 − x k2 2ti
? ? • this implies kxi+1 − x k2 ≤ kxi − x k, so we can replace ti with tmin ≤ ti:
? 1 ? 2 ? 2 f (xi+1) − f ≤ kxi − x k2 − kxi−1 − x k2 2tmin
• adding the upper bounds gives same 1/k bound as with constant step size
k ? 1 X ? 1 ? 2 f (xk) − f ≤ ( f (xi) − f ) ≤ kx0 − x k2 k i=1 2ktmin
Gradient method 1.29 Gradient method for strongly convex functions better results exist if we add strong convexity to the assumptions on p. 1.23
Analysis for constant step size if x+ = x − t∇ f (x) and 0 < t ≤ 2/(m + L):
+ ? 2 ? 2 kx − x k2 = kx − t∇ f (x) − x k2 ? 2 T ? 2 2 = kx − x k2 − 2t∇ f (x) (x − x ) + t k∇ f (x)k2 2mL 2 ≤ (1 − t )kx − x?k2 + t(t − )k∇ f (x)k2 m + L 2 m + L 2 2mL ≤ (1 − t )kx − x?k2 m + L 2
(step 3 follows from result on page 1.22)
Gradient method 1.30 Distance to optimum
2mL kx − x?k2 ≤ ck kx − x?k2, c = 1 − t k 2 0 2 m + L
• implies (linear) convergence γ − 1 2 • for t = 2/(m + L), get c = with γ = L/m γ + 1
Bound on function value (from page 1.14)
L ck L f (x ) − f ? ≤ kx − x?k2 ≤ kx − x?k2 k 2 k 2 2 0 2
? Conclusion: number of iterations to reach f (xk) − f ≤ is O(log(1/))
Gradient method 1.31 Limits on convergence rate of first-order methods
First-order method: any iterative algorithm that selects xk+1 in the set
x0 + span{∇ f (x0), ∇ f (x1),..., ∇ f (xk)}
Problem class: any function that satisfies the assumptions on page 1.23
Theorem (Nesterov): for every integer k ≤ (n − 1)/2 and every x0, there exist functions in the problem class such that for any first-order method
k − ?k2 ? 3 L x0 x 2 f (xk) − f ≥ 32 (k + 1)2
• suggests 1/k rate for gradient method is not optimal
• more recent accelerated gradient methods have 1/k2 convergence (see later)
Gradient method 1.32 References
• A. Beck, First-Order Methods in Optimization (2017), chapter 5.
• Yu. Nesterov, Lectures on Convex Optimization (2018), section 2.1. (The result on page 1.32 is Theorem 2.1.7 in the book.)
• B. T. Polyak, Introduction to Optimization (1987), section 1.4.
• The example on page 1.4 is from N. Z. Shor, Nondifferentiable Optimization and Polynomial Problems (1998), page 37.
Gradient method 1.33