Lecture 13 Gradient Methods for Constrained Optimization

Lecture 13 Gradient Methods for Constrained Optimization October 16, 2008 Lecture 13 Outline • Gradient Projection Algorithm • Convergence Rate Convex Optimization 1 Lecture 13 Constrained Minimization minimize f(x) subject x ∈ X • Assumption 1: • The function f is convex and continuously differentiable over Rn • The set X is closed and convex ∗ • The optimal value f = infx∈Rn f(x) is finite • Gradient projection algorithm xk+1 = PX[xk − αk∇f(xk)] starting with x0 ∈ X. Convex Optimization 2 Lecture 13 Bounded Gradients Theorem 1 Let Assumption 1 hold, and suppose that the gradients are uniformly bounded over the set X. Then, the projection gradient method generates the sequence {xk} ⊂ X such that • When the constant stepsize αk ≡ α is used, we have 2 ∗ αL lim inf f(xk) ≤ f + k→∞ 2 P • When diminishing stepsize is used with k αk = +∞, we have ∗ lim inf f(xk) = f . k→∞ Proof: We use projection properties and the line of analysis similar to that of unconstrained method. HWK 6 Convex Optimization 3 Lecture 13 Lipschitz Gradients • Lipschitz Gradient Lemma For a differentiable convex function f with Lipschitz gradients, we have for all x, y ∈ Rn, 1 k∇f(x) − ∇f(y)k2 ≤ (∇f(x) − ∇f(y))T (x − y), L where L is a Lipschitz constant. • Theorem 2 Let Assumption 1 hold, and assume that the gradients of f are Lipschitz continuous over X. Suppose that the optimal solution ∗ set X is not empty. Then, for a constant stepsize αk ≡ α with 0 2 < α < L converges to an optimal point, i.e., ∗ ∗ ∗ lim kxk − x k = 0 for some x ∈ X . k→∞ Convex Optimization 4 Lecture 13 Proof: n Fact 1: If z = PX[z − v] for some v ∈ < , then z = PX[z − τv] for any τ > 0. ∗ Fact 2: z ∈ X if and only if z = PX[z − ∇f(z)]. ∗ These facts imply that z ∈ X if and only if z = PX[z − τ∇f(z)] for any τ > 0. By using the definition of the method and the preceding relation with τ = α, we obtain for any z ∈ X∗, 2 2 kxk+1 − zk = kPX[xk − α∇f(xk)] − PX[z − α∇f(z)k . By non-expansiveness of the projection, it follows 2 2 kxk+1 − zk = kxk − z − α(∇f(xk) − ∇f(z))k 2 T = kxk − zk − 2α(xk − z) (∇f(xk) − ∇f(z)) 2 2 +α k∇f(xk) − ∇f(z)k Convex Optimization 5 Lecture 13 Using Lipschitz Gradient Lemma, we obtain for any z ∈ X∗, 2 2 α 2 kxk+1 − zk ≤ kxk − zk − (2 − αL)k∇f(xk) − ∇f(z)k . (1) L Hence, for all k, α 2 2 2 (2 − αL)k∇f(xk) − ∇f(z)k ≤ kxk − zk − kxk+1 − zk . L By summing the preceding relations from arbitrary K to N, with K < N, we obtain N α X 2 2 2 2 (2−αL) k∇f(xk)−∇f(z)k ≤ kxK−zk −kxN+1−zk ≤ kxK−zk . L k=K Convex Optimization 6 Lecture 13 In particular, setting K = 0 and letting N → ∞, we see that ∞ α X 2 2 (2 − αL) k∇f(xk) − ∇f(z)k ≤ kx0 − zk < ∞. (2) L k=0 As a consequence, we also have lim ∇f(xk) = ∇f(z). (3) k→∞ By discarding the non-positive term in the right hand side of Eq. (1), we have for any z ∈ X∗ and all k, 2 2 2 kxk+1 − zk ≤ kxk − zk + (2 − αL)k∇f(xk) − ∇f(z)k . By summing these relations over k = K, . , N for arbitrary K and N with K < N, we obtain Convex Optimization 7 Lecture 13 N 2 2 X 2 kxN+1 − zk ≤ kxK − zk + (2 − αL) k∇f(xk) − ∇f(z)k . k=K Taking limsup as N → ∞, we obtain ∞ 2 2 X 2 lim sup kxN+1 − zk ≤ kxK − zk + (2 − αL) k∇f(xk) − ∇f(z)k . N→∞ k=K Now, taking liminf as K → ∞ yields 2 2 lim sup kxN+1 − zk ≤ lim inf kxK − zk N→∞ K→∞ ∞ X 2 + (2 − αL) lim k ∇f(xk) − ∇f(z)k K→∞ k=K 2 = lim inf kxK − zk , K→∞ Convex Optimization 8 Lecture 13 where the equality follows in view of the relation in (2). Thus, we have that ∗ the sequence {kxk − zk} is convergent for every z ∈ X . By the inequality in Eq. (1), we have that kxk − zk ≤ kx0 − zk for all k. Hence, the sequence {xk} is bounded, and it has an accumulation point. ∗ Since the scalar sequence {kxk − zk} is convergent for every z ∈ X , the sequence {xk} must be convergent. Suppose now that xk → x.¯ By considering the definition of the iterate xk+1, we have xk+1 = PX[xk − α∇f(xk)]. Letting k → ∞ and using xk → x,¯ and continuity of the gradient ∇f(x), we obtain x¯ = PX[¯x − α∇f(¯x)]. ∗ In view of facts 1 and 2, the preceding relation is equivalent to x¯ ∈ X . Convex Optimization 9 Lecture 13 Modes of Convexity: Strict and Strong • Def. f is strictly convex if for all x 6= y and α ∈ (0, 1) we have f(αx + (1 − α)y) < αf(x) + (1 − α)f(y) • Def. f is strongly convex if there exists a scalar ν > 0 such that ν f(αx + (1 − α)y) ≤ αf(x) + (1 − α)f(y) − α(1 − α)kx − yk2 2 for all x, y ∈ <n and any α ∈ [0, 1]. The scalar ν is referred to as strongly convex constant. The function is said to be strongly convex with constant ν. Convex Optimization 10 Lecture 13 Modes of Convexity: Differentiable Function • Let f : <n → R be continuously differentiable. • Modes of convexity can be equivalently characterized in terms of the linearization properties of the function ∇f : <n → <n. • We have • f is convex if and only if f(x) + ∇f(x)T (y − x) ≤ f(y) for all x, y ∈ <n • f is strictly convex if and only if f(x) + ∇f(x)T (y − x) < f(y) for all x 6= y • f is strongly convex with constant ν if and only if ν f(x) + ∇f(x)T (y − x) + ky − xk2 ≤ f(y) for all x, y ∈ <n 2 Convex Optimization 11 Lecture 13 Modes of Convexity: Gradient Mapping • Let f : <n → R be continuously differentiable. • Modes of convexity can be equivalently characterized in terms of the monotonicity properties of the gradient mapping ∇f : <n → <n. • We have • f is convex if and only if (∇f(x) − ∇f(y))T (x − y) ≥ 0 for all x, y ∈ <n • f is strictly convex if and only if (∇f(x) − ∇f(y))T (x − y) > 0 for all x 6= y • f is strongly convex with constant ν if and only if (∇f(x) − ∇f(y))T (x − y) ≥ ν kx − yk2 for all x, y ∈ <n Convex Optimization 12 Lecture 13 Modes of Convexity: Twice Differentiable Function • Let f : <n → R be twice continuously differentiable. • Modes of convexity can be equivalently characterized in terms of the definiteness of the Hessians ∇2f(x) for x ∈ <n. • We have • f is convex if and only if ∇2f(x) ≥ 0 for all x ∈ <n • f is strictly convex if ∇2f(x) > 0 for all x ∈ <n • f is strongly convex with constant ν if and only if ∇2f(x) ≥ ν I for all x ∈ <n Convex Optimization 13 Lecture 13 Strong Convexity: Implications Let f be continuously differentiable and strongly convex∗ over Rn with constant m • Implications: • Lower Bound on f over Rn: for all x, y ∈ Rn m f(y) ≥ f(x) + ∇f(x)T (y − x) + kx − yk2 (4) 2 2 minimize w/r to y in the right-hand side: 1 f(y) ≥ f(x) − k∇f(x)k2 2m n minimum over y ∈ R : 1 f(x) − f ∗ ≤ k∇f(x)k2 2m • Useful as a stopping criterion (if you know m) ∗ n Strong convexity over R can be replaced by a strong convexity over a set X. Then, all the relations stay valid over the set Convex Optimization 14 Lecture 13 • Relation (4) with x = x0 and f(y) ≤ f(x0) implies that the level set Lf (f(x0)) is bounded • Relation (4) also yields for an optimal x∗ and any x ∈ Rn, m kx − x∗k2 ≤ f(x) − f(x∗) 2 • Last two bullets HWK6 assignment. Convex Optimization 15 Lecture 13 Convergence Rate: Once Differentiable Theorem 3 Let Assumption 1 hold, and assume that the gradients of f are Lipschitz continuous over X with constant L > 0. Suppose that the function is strongly convex with constant m > 0. Then: • A solution x∗ exists and it is unique. • The iterates generated by the gradient projection method with αk ≡ α 2 ∗ and α < L converge to x with geometric rate, i.e., ∗ 2 k ∗ 2 kxk+1 − x k ≤ q kxk − x k for all k with q ∈ (0, 1) depending on m and L. Proof: HWK 6. Convex Optimization 16 Lecture 13 Convergence Rate: Twice Differentiable Theorem 4 Let Assumption 1 hold. Assume that the function is twice continuously differentiable and strongly convex with constant m > 0. Assume also that ∇f 2(x) ≤ L for all x ∈ X. Then: • A solution x∗ exists and it is unique. • The iterates generated by the gradient projection method with αk ≡ α 2 ∗ and α < L converge to x with geometric rate, i.e., ∗ k ∗ kxk+1 − x k ≤ q kxk − x k for all k with q = max{|1 − αm|, |1 − αL}. Convex Optimization 17 Lecture 13 Proof: The q here is different from the one in the preceding theorem.

Lecture 13 Gradient Methods for Constrained Optimization

Section 8.8: Improper Integrals

Notes on Calculus II Integral Calculus Miguel A. Lerma

Two Fundamental Theorems About the Definite Integral

Calculus Terminology

The Infinite and Contradiction: a History of Mathematical Physics By

Convergence Rates for Deterministic and Stochastic Subgradient

Mean Value, Taylor, and All That

MATH 162: Calculus II Differentiation

Generalizing the Mean Value Theorem -Taylor's Theorem

Lecture 5 : Continuous Functions Definition 1 We Say the Function F Is

Week 3 Quiz: Differential Calculus: the Derivative and Rules of Differentiation

Tensor Products with Bounded Continuous Functions