Lecture 13 Gradient Methods for Constrained Optimization
October 16, 2008 Lecture 13 Outline
• Gradient Projection Algorithm
• Convergence Rate
Convex Optimization 1 Lecture 13 Constrained Minimization minimize f(x) subject x ∈ X • Assumption 1:
• The function f is convex and continuously differentiable over Rn • The set X is closed and convex ∗ • The optimal value f = infx∈Rn f(x) is finite
• Gradient projection algorithm
xk+1 = PX[xk − αk∇f(xk)]
starting with x0 ∈ X.
Convex Optimization 2 Lecture 13 Bounded Gradients
Theorem 1 Let Assumption 1 hold, and suppose that the gradients are uniformly bounded over the set X. Then, the projection gradient method
generates the sequence {xk} ⊂ X such that
• When the constant stepsize αk ≡ α is used, we have
2 ∗ αL lim inf f(xk) ≤ f + k→∞ 2
P • When diminishing stepsize is used with k αk = +∞, we have
∗ lim inf f(xk) = f . k→∞ Proof: We use projection properties and the line of analysis similar to that of unconstrained method. HWK 6
Convex Optimization 3 Lecture 13 Lipschitz Gradients
• Lipschitz Gradient Lemma For a differentiable convex function f with Lipschitz gradients, we have for all x, y ∈ Rn, 1 k∇f(x) − ∇f(y)k2 ≤ (∇f(x) − ∇f(y))T (x − y), L where L is a Lipschitz constant.
• Theorem 2 Let Assumption 1 hold, and assume that the gradients of f are Lipschitz continuous over X. Suppose that the optimal solution ∗ set X is not empty. Then, for a constant stepsize αk ≡ α with 0 2 < α < L converges to an optimal point, i.e.,
∗ ∗ ∗ lim kxk − x k = 0 for some x ∈ X . k→∞
Convex Optimization 4 Lecture 13 Proof: n Fact 1: If z = PX[z − v] for some v ∈ < , then z = PX[z − τv] for any τ > 0. ∗ Fact 2: z ∈ X if and only if z = PX[z − ∇f(z)]. ∗ These facts imply that z ∈ X if and only if z = PX[z − τ∇f(z)] for any τ > 0. By using the definition of the method and the preceding relation with τ = α, we obtain for any z ∈ X∗,
2 2 kxk+1 − zk = kPX[xk − α∇f(xk)] − PX[z − α∇f(z)k . By non-expansiveness of the projection, it follows
2 2 kxk+1 − zk = kxk − z − α(∇f(xk) − ∇f(z))k 2 T = kxk − zk − 2α(xk − z) (∇f(xk) − ∇f(z)) 2 2 +α k∇f(xk) − ∇f(z)k
Convex Optimization 5 Lecture 13
Using Lipschitz Gradient Lemma, we obtain for any z ∈ X∗,
2 2 α 2 kxk+1 − zk ≤ kxk − zk − (2 − αL)k∇f(xk) − ∇f(z)k . (1) L
Hence, for all k,
α 2 2 2 (2 − αL)k∇f(xk) − ∇f(z)k ≤ kxk − zk − kxk+1 − zk . L
By summing the preceding relations from arbitrary K to N, with K < N, we obtain
N α X 2 2 2 2 (2−αL) k∇f(xk)−∇f(z)k ≤ kxK−zk −kxN+1−zk ≤ kxK−zk . L k=K
Convex Optimization 6 Lecture 13
In particular, setting K = 0 and letting N → ∞, we see that
∞ α X 2 2 (2 − αL) k∇f(xk) − ∇f(z)k ≤ kx0 − zk < ∞. (2) L k=0
As a consequence, we also have
lim ∇f(xk) = ∇f(z). (3) k→∞
By discarding the non-positive term in the right hand side of Eq. (1), we have for any z ∈ X∗ and all k,
2 2 2 kxk+1 − zk ≤ kxk − zk + (2 − αL)k∇f(xk) − ∇f(z)k . By summing these relations over k = K,...,N for arbitrary K and N with K < N, we obtain
Convex Optimization 7 Lecture 13
N 2 2 X 2 kxN+1 − zk ≤ kxK − zk + (2 − αL) k∇f(xk) − ∇f(z)k . k=K Taking limsup as N → ∞, we obtain
∞ 2 2 X 2 lim sup kxN+1 − zk ≤ kxK − zk + (2 − αL) k∇f(xk) − ∇f(z)k . N→∞ k=K
Now, taking liminf as K → ∞ yields
2 2 lim sup kxN+1 − zk ≤ lim inf kxK − zk N→∞ K→∞ ∞ X 2 + (2 − αL) lim k ∇f(xk) − ∇f(z)k K→∞ k=K 2 = lim inf kxK − zk , K→∞
Convex Optimization 8 Lecture 13 where the equality follows in view of the relation in (2). Thus, we have that ∗ the sequence {kxk − zk} is convergent for every z ∈ X . By the inequality in Eq. (1), we have that
kxk − zk ≤ kx0 − zk for all k.
Hence, the sequence {xk} is bounded, and it has an accumulation point. ∗ Since the scalar sequence {kxk − zk} is convergent for every z ∈ X , the sequence {xk} must be convergent.
Suppose now that xk → x.¯ By considering the definition of the iterate xk+1, we have
xk+1 = PX[xk − α∇f(xk)].
Letting k → ∞ and using xk → x,¯ and continuity of the gradient ∇f(x), we obtain
x¯ = PX[¯x − α∇f(¯x)]. ∗ In view of facts 1 and 2, the preceding relation is equivalent to x¯ ∈ X .
Convex Optimization 9 Lecture 13 Modes of Convexity: Strict and Strong
• Def. f is strictly convex if for all x 6= y and α ∈ (0, 1) we have
f(αx + (1 − α)y) < αf(x) + (1 − α)f(y)
• Def. f is strongly convex if there exists a scalar ν > 0 such that
ν f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y) − α(1 − α)kx − yk2 2
for all x, y ∈ Convex Optimization 10 Lecture 13 Modes of Convexity: Differentiable Function • Let f : f(x) + ∇f(x)T (y − x) ≤ f(y) for all x, y ∈ f(x) + ∇f(x)T (y − x) < f(y) for all x 6= y • f is strongly convex with constant ν if and only if ν f(x) + ∇f(x)T (y − x) + ky − xk2 ≤ f(y) for all x, y ∈ • Let f : (∇f(x) − ∇f(y))T (x − y) ≥ 0 for all x, y ∈ (∇f(x) − ∇f(y))T (x − y) > 0 for all x 6= y • f is strongly convex with constant ν if and only if (∇f(x) − ∇f(y))T (x − y) ≥ ν kx − yk2 for all x, y ∈ Convex Optimization 12 Lecture 13 Modes of Convexity: Twice Differentiable Function • Let f : ∇2f(x) ≥ 0 for all x ∈ • f is strictly convex if ∇2f(x) > 0 for all x ∈ ∇2f(x) ≥ ν I for all x ∈ Convex Optimization 13 Lecture 13 Strong Convexity: Implications Let f be continuously differentiable and strongly convex∗ over Rn with constant m • Implications: • Lower Bound on f over Rn: for all x, y ∈ Rn m f(y) ≥ f(x) + ∇f(x)T (y − x) + kx − yk2 (4) 2 2 minimize w/r to y in the right-hand side: 1 f(y) ≥ f(x) − k∇f(x)k2 2m n minimum over y ∈ R : 1 f(x) − f ∗ ≤ k∇f(x)k2 2m • Useful as a stopping criterion (if you know m) ∗ n Strong convexity over R can be replaced by a strong convexity over a set X. Then, all the relations stay valid over the set Convex Optimization 14 Lecture 13 • Relation (4) with x = x0 and f(y) ≤ f(x0) implies that the level set Lf (f(x0)) is bounded • Relation (4) also yields for an optimal x∗ and any x ∈ Rn, m kx − x∗k2 ≤ f(x) − f(x∗) 2 • Last two bullets HWK6 assignment. Convex Optimization 15 Lecture 13 Convergence Rate: Once Differentiable Theorem 3 Let Assumption 1 hold, and assume that the gradients of f are Lipschitz continuous over X with constant L > 0. Suppose that the function is strongly convex with constant m > 0. Then: • A solution x∗ exists and it is unique. • The iterates generated by the gradient projection method with αk ≡ α 2 ∗ and α < L converge to x with geometric rate, i.e., ∗ 2 k ∗ 2 kxk+1 − x k ≤ q kxk − x k for all k with q ∈ (0, 1) depending on m and L. Proof: HWK 6. Convex Optimization 16 Lecture 13 Convergence Rate: Twice Differentiable Theorem 4 Let Assumption 1 hold. Assume that the function is twice continuously differentiable and strongly convex with constant m > 0. Assume also that ∇f 2(x) ≤ L for all x ∈ X. Then: • A solution x∗ exists and it is unique. • The iterates generated by the gradient projection method with αk ≡ α 2 ∗ and α < L converge to x with geometric rate, i.e., ∗ k ∗ kxk+1 − x k ≤ q kxk − x k for all k with q = max{|1 − αm|, |1 − αL}. Convex Optimization 17 Lecture 13 Proof: The q here is different from the one in the preceding theorem. Since ∇f 2(x) ≤ L for all x ∈ X, it follows that the gradients are Lipschitz continuous over X with constant L. By the definition of the method and the non-expansive property of the projection, we have for z = x∗ and any k, ∗ 2 ∗ ∗ 2 kxk+1 − x k = kPX[xk − α∇f(xk)] − PX[x − ∇f(x )]k ∗ ∗ 2 ≤ kxk − x − α(∇f(xk) − ∇f(x ))k . (5) Mean Value Theorem for vector functions When g : Rn → R is differentiable on [x, y], we have Z 1 g(y) = g(x) + ∇g(x + τ(y − x)) dτ 0 Convex Optimization 18 Lecture 13 ∗ Applying this Theorem with g = ∇f, y = xk and x = x , we obtain Z 1 ∗ 2 ∗ ∗ ∇f(xk) = ∇f(x ) + ∇ f(x + τ(xk − x )) dτ 0 Hence, Z 1 ∗ 2 ∗ ∗ ∇f(xk) − ∇f(x ) = ∇ f(x + τ(xk − x ))dτ. (6) 0 ∗ ∗ By introducing Ak(x − x ) = ∇f(xk) − ∇f(x ) and using this in relation (5), we obtain ∗ ∗ ∗ kxk+1 − x k ≤ k(I − αAk)(xk − x )k ≤ kI − αAkk kxk − x k Convex Optimization 19 Lecture 13 The matrix Ak is symmetric, and hence kI − Akk is equal to the max absolute eigenvalue of I − Ak, i.e., kI − αAkk = max{|1 − αλmax(Ak)|, |1 − αλmin(Ak)|}. R 1 2 ∗ ∗ In view of Eq. (6), we have Ak = 0 ∇ f(x + τ(xk − x ))dτ. By the strong convexity of f, we have ∇2f(x) ≥ mI for all x, while by the given condition, we have ∇2f(x) ≤ LI. Therefore, λmax(Ak) ≤ L, λmin(Ak) ≥ m, implying that kI − αAkk ≤ max{|1 − αm|, |1 − αM|}. Convex Optimization 20 Lecture 13 ∗ = 2 • The parameter q is minimized when α m+L in which case L − m cond(f) − 1 q∗ = ⇐⇒ q∗ = , L + m cond(f) + 1 ( ) = L with cond f m. Convex Optimization 21 Lecture 13 Upper Bound on Hessian and f over the Level Set For a twice differentiable strongly convex f: • The level set L0 = {x | f(x) ≤ f(x0)} is bounded • The maximum eigenvalue of the Hessian ∇2f(x) is a continuous function of x over L0 • Hence, the maximum eigenvalue of the Hessian is bounded over L0: 2 there is a constant M such that ∇ f(x) MI for all x ∈ L0 • Upper Bound on f over L0: T M 2 f(y) ≤ f(x) + ∇f(x) (y − x) + ky − xk for all x, y ∈ L0 2 • minimize over y ∈ L0 in both sides: ∗ 1 2 f ≤ f(x) − k∇f(x)k for all x ∈ L0 2M Convex Optimization 22 Lecture 13 Condition Number of a Matrix For a a twice differentiable strongly convex f: mI ∇2f(x) MI for all x ∈ L0 • The condition number cond(A) of a positive definite matrix A: largest eigenvalue of A cond(A) = smallest eigenvalue of A M 2 ( ) • The ratio m is an upper bound on the condition number ∇ f x for every x ∈ L0 Convex Optimization 23 Lecture 13 Strong Convexity and Condition Number of Level Sets Assume a minimizer x∗ of f over Rn exists and f is strongly convex. Consider the level set L0 = {x | f(x) ≤ f(x0)} 2 • We have seen that mI ∇ f(x) MI for all x ∈ L0 • Also, we have m M f ∗ + kx − x∗k2 ≤ f(x) ≤ f ∗ + kx − x∗k2 2 2 • Hence: Binner ⊆ L0 ⊆ Bouter, where q ∗ ∗ Binner = x | kx − x k ≤ (2 (f(x0) − f ) /M) q ∗ ∗ Bouter = x | kx − x k ≤ (2 (f(x0) − f ) /m) • Therefore, we have a bound on cond(L0) M cond(L0) ≤ m • The condition number of level sets affects the efficiency of the algorithms Convex Optimization 24