Projected Gradient Method; Sufficient Optimality Conditions
Total Page:16
File Type:pdf, Size:1020Kb
ORIE 6300 Mathematical Programming I November 10, 2016 Lecture 22 Lecturer: Damek Davis Scribe: Xueyu Tian 1 Last Time 1. Finished up IPMs. 2. Defined differentiability atx ¯ n n (9v 2 R )(8x 2 R ) f(x) = f(¯x)+ < v; x − x¯ > +o(x − x¯) (1) where: n 1 o : R ! R s.t. lim o(x − x¯) = 0; o(0) = 0 x!x¯ jjx − x¯jj Proposition 1 Precisely, one vector can satisfy (1). We write rf(¯x) := v, whenever f is differentiable at x¯. 3. Necessary optimality conditions. n n Theorem 2 (necessary optimality) Let f : R ! R be a function, let C ⊆ R be a closed convex set. Suppose that x¯ 2 argmin f(x) exists and f is differentiable at x¯. Then x2C −∇f(¯x) 2 Nc(¯x) (OPT) 2 Today Definition 1 Points x¯ satisfying (OPT) are called stationary points. Observation 1 Minimizers of inf f(x) are stationary points, but stationary points are not nec- x2C essarily minimizers. For nonconvex optimization problems, stationary points are all we can hope to find. Later we will see that stationary points of convex problems are global minimizers. So how can we find stationary points? Theorem 3 x¯ satisfies OPT if, and only if, (8γ > 0)x ¯ = PC (¯x − γrf(¯x)) . 22-1 Proof: −∇f(¯x) 2 NC (¯x) , −γrf(¯x) 2 γNC (¯x) = NC (¯x) , (¯x − γrf(¯x)) − x¯ 2 NC (¯x) , x¯ = PC (¯x − γrf(¯x)): 2 Thus, stationary points are exactly elements of Fix(PC ◦ (I − γrf)). This suggests we apply KM iteration to T = PC ◦ (I − γrf). However T is NOT nonexpansive in such a general setting. To get any result, we must restrict to f which have globally Lipschitz continuous gradients. Algorithm 1 Projected Gradient for f with rf L-Lipschitz. 2 Input: x 2 C; 0 < γ < L 1: loop 2: x PC (x − λrf(x)) Example 1 f(x) = h(1; 0); xi;C = B(0; 1) k Theorem 4 Suppose inf f(x) > −∞. Let fx gk2 be generated by the projected gradient method. x2C N k Suppose x¯ is a limit point of fx gk2N. Then x¯ is a stationary point for OPT. Proof: Because f is Lipschitz differentiable, for all k 2 N, we have L f(xk+1) ≤ f(xk) + hrf(xk); xk+1 − xki + jjxk+1 − xkjj2 2 1 1 L = f(xk) + hxk+1 − (xk − γrf(xk)); xk+1 − xki − − kxk+1 − xkk2 (2) γ γ 2 Notice that, k+1 k k k k k+1 k+1 x = PC (x − γrf(x )) , (x − γrf(x )) − x 2 NC (x ): 22-2 Thus, 1 h(xk − γrf(xk)) − xk+1; xk − xk+1i ≤ 0: γ Therefore, 1 L f(xk+1) ≤ (2) ≤ f(xk) − − jjxk+1 − xkjj2: γ 2 The Fixed-Point Residual (FPR) is then summable, 1 T X X 1 jjxk − xk+1jj2 = lim jjxk − xk+1jj2 ≤ lim [f(x0) − f(xT )] T !1 1 − L T !1 k=0 k=0 γ 2 1 0 ∗ ≤ 1 L [f(x ) − f ] γ − 2 where f ∗ = inf f(x). Thus, jjxk+1 − xkjj ! 0, as k! 1. x2C jk k jk Suppose some subsequence fx gk2N ⊆ fx gk2N converges tox ¯. Then, by continuity, PC (x − jk k k +1 k +1 γrf(x )) ! PC (¯x − γrf(¯x)). Moreover, because x j − x j ! 0, we have x j ! x¯. So kj +1 kj kj x¯ = lim x = lim PC (x − γrf(x )) = PC (¯x − γrf(¯x)): k!1 k!1 Thus,x ¯ is stationary. 2 Corollary 5 For all T 2 N, we have v u 0 uf(x ) − inf f(x) min jjxk+1 − xkjj ≤ u x2C k=0;1;:::;T t 1 L T γ − 2 Proof: T f(x0) − inf f(x) 1 X x2C min jjxk+1 − xkjj2 ≤ jjxk − xk+1jj2 ≤ k=0;1;:::;T T 1 L k=0 T γ − 2 which implies the result. 2 Significance of the convergence rate. The steplength xk+1 − xk satisfies 1 (xk − xk+1) 2 N (xk+1) + rf(xk) = N (xk+1) + rf(xk+1) + rf(xk) − rf(xk+1) γ C C 1 =) (xk − xk+1) + (rf(xk+1) − rf(xk)) 2 N (xk+1) + rf(xk+1) γ C So it is a natural measure of stationarity. 22-3 Using Lipschitz continuity of rf, we can easy get rates on min jj(xk − xk+1) + rf(xk+1) − rf(xk)jj ≤ min (1 + L)jjxk − xk+1jj k=0;1;:::;T k=0;1;:::;T v uf(x0) − inf f(x) u t x2C ≤ (1 + L) 1 L T ( γ − 2 ) This is a pretty terrible rate, but with convexity, we can get MUCH faster rates, and much stronger guarantees. Let's play with convexity for a bit first. n n Definition 2 A continuously differentiable function f on R (notation f 2 F(R )) is called convex if n (8x; y 2 R ) f(y) ≥ f(x) + hrf(x); y − xi (3) The picture to have in mind is the following Proposition 6 Convexity is equivalent to the following n (8x; y 2 R )(α 2 [0; 1]) f((1 − α)x + αy) ≤ (1 − α)f(x) + αf(y) Proof: See thm 2.1.2 from Nesterov's book. 2 For convex functions, stationary points are global minimizers. n n Notation. We let F(R ) denote the set of continuously differentiable, convex functions on R . n n Theorem 7 (Sufficient Optimality Conditions) Let f 2 F(R ) and let C 2 R be a closed convex set. Then −∇f(x) 2 NC (x) , x 2 argminf(y) y2C Proof: We proved necessity last time. Let's prove sufficiency. Suppose −∇f(x) 2 NC (x). Then (8y 2 C) h−∇f(x); y − xi ≤ 0 ) hrf(x); y − xi ≥ 0: Thus, (8y 2 C) f(y) ≥ f(x) + hrf(x); y − xi ≥ f(x) 2 n When C = R , so all stationary points satisfy rf(x) = 0. 22-4 Example 2 (Linearly constrained optimization) Let C = fx j Ax = bg: Then AT n if Ax = b; N (x) = R C ; if otherwise. The optimality conditions of the problem inf f(x) become: x2C 9y s.t. rf(x) + AT y = 0 Ax = b If f is convex, these conditions are necessary and sufficient. The conditions should remind you of Lagrange multipliers. What about nonsmooth convex functions? Consider the minimization problem. min f(x) + g(x) where g is nonsmooth, but convex. Can we reformulate it into a form that the projected gradient method applies to? Yes! minff(x) + tg (x; t) 2 epi(g) = fxj g(x) ≤ tg epi(g) is a closed, convex set and f(x) + t is a smooth function. Our optimality conditions become −∇f(¯x) 2 N (¯x; t): −1 epi(g) Notice that if g is a continuous function, then g(x) < t ) int(epi(g)) =) Nepi(g)(x; t) = 0. Thus, optimality must occur at (¯x; t) = (¯x; g(¯x)). Then from homework 3, problem 3, we find that −∇f(¯x) 2 @g(¯x) i.e., 0 2 rf(¯x) + @g(¯x) where @g(x) = fvj(8x) g(x) ≥ g(¯x) + hv; x − x¯ig. Thus Theorem 8 Suppose, f and g are convex. Then x¯ 2 argminff(x) + g(x)g , 0 2 rf(¯x) + @g(¯x): 22-5.