<<

Taught by Professor Adrian Lewis

Notes by Mateo D´ıaz [email protected]

Cornell University Spring, 2018

Last updated May 18, 2018. The latest version is online here.

Notes template by David Mehrle. Contents

1 Convex Sets ...... 4

2 Convex functions...... 6 2.1 Convexity and calculus...... 7 2.1.1 Gradients ...... 7 2.1.2 Recognizing smooth convexity...... 8 2.2 Continuity of convex functions...... 8 2.3 Subgradients...... 9

3 ...... 10 3.1 Fenchel Duality...... 10 3.2 Conic programs...... 12 3.3 Convex calculus...... 14 3.4 Lagrangian duality...... 15

4 Algorithms...... 19 4.1 Augmented Lagrangian Method ...... 19 4.2 Proximal Point method and fixed points...... 20 4.3 Smooth Minimization...... 25 4.4 Splitting Problems and Alternating projections...... 26 4.5 Proximal Gradient ...... 28 4.6 Douglas-Rachford...... 29 4.6.1 Consensus optimization...... 30 4.7 ADMM: Alternating Directions Method of Multipliers...... 31 4.8 Splitting ...... 34 4.9 Projected subgradient method...... 34 4.10 Accelerated Proximal Gradient ...... 36

5 ...... 38 5.1 Nonconvex calculus...... 38 5.2 Subgradients and nonconvexity...... 40 5.3 Inverse problems ...... 43 5.4 Linearizing sets...... 47 5.5 Optimality conditions ...... 49

1 Contents by Lecture

Lecture 01 on 30 January 2018...... 4

Lecture 02 on February 1 2018...... 5

Lecture 03 on February 6 2018...... 8

Lecture 04 on February 8 2018...... 9

Lecture 05 on February 13 2018 ...... 10

Lecture 06 on February 15 2018 ...... 12

Lecture 07 on February 22 2018 ...... 15

Lecture 08 on February 27 2018 ...... 18

Lecture 09 on March 1 2018 ...... 20

Lecture 10 on March 6 2018 ...... 21

Lecture 11 on March 8 2018 ...... 24

Lecture 12 on March 13 2018...... 26

Lecture 13 on March 15 2018...... 29

Lecture 14 on March 20 2018...... 31

Lecture 15 on March 22 2018...... 33

Lecture 16 on April 10 2018 ...... 34

Lecture 17 on April 12 2018 ...... 36

Lecture 18 on April 17 2018 ...... 38

Lecture 19 on April 19 2018 ...... 40

Lecture 20 on April 24 2018 ...... 42

Lecture 21 on April 26 2018 ...... 43

Lecture 22 on May 1 2018...... 45

Lecture 23 on May 3 2018...... 46

Lecture 24 on May 8 2018...... 48

2 Disclousure

These notes are not official. I do not accept any responsibility or liability for the accuracy, content, complete- ness, correctness, or reliability of the information contained on these notes. Please feel free to email me if you find any mistakes or typos.

3 Lecture 01: Convex Sets 30 January 2018

1 Convex Sets

We will mainly consider a real Euclidean space E (finite-dimensional Hilbert space). We denote the inner product as h·, ·i.

Example 1. Few examples:

• The standard space. Rn, hx, yi = x>y.

• Symmetric matrices. Sn := {symmetric n × n matrices}, hX, Yi = tr(X>Y).

Definition 1. A C ⊆ Rn is convex if for all x, y ∈ E, λ ∈ [0, 1] we have

λx + (1 − λ)y ∈ C.

(a) Nonconvex set (b)

Example 2. Here are two simple examples:

• Closed Halfspaces. H := {x | a>x ≤ β} for some fixed β ∈ R and a nonzero a ∈ E∗.

• Unit ball. Define kxk2 = hx, xi, then B = {x | kxk ≤ 1}.

Proposition 1. Arbitrary intersections of convex sets are convex.

Proof. Exercise.

Example 3. Consider:

• Polyhedrons. They are intersections of halfspaces.

n n > • PSD Matrices. S+ = {X ∈ S | y Xy ≥ 0}, this is convex because

0 ≤ y>Xy = tr(ytXy) = tr(Xyy>) = hX, yy>i,

n therefore each constraint defines a halfspaces and thus S+ is an infinite intersection of halfspaces. Proposition 2. Closures of convex sets are convex.

Proof. Exercise.

But, how about interiors?

Lemma 1 (Accessibility). For a convex set S if x ∈ int S, y ∈ cl S, then

λx + (1 − λ)y ∈ int S ∀0 ≤ λ < 1.

4 Lecture 02: Convex Sets February 1 2018

Proof. Assume first that y ∈ S. Then there exists δ > 0 such that x + δB ⊂ S. By convexity, for all λ ∈ (0, 1) λ(x + δB) + (1 − λ)y ⊆ S = λ(x + (1 − λ)y) + λδB ∈ S which implies that λ(x + (1 − λ)y) ∈ int S. ⇒ Now suppose that y ∈ cl S, then there exists a sequence yk ∈ S such that yk y. By the first argument Å (1 − λ) ã λ(x + (1 − λ)y) = λ(x + (1 − λ)y ) + (1 − λ)(y − y ) = λ x + (y − y ) +(1 − λ)y . k k λ → k k

zk

Notice that when k is big enough zk lies in the interior of S.| Thus by{z our first argument} this convex combination lies in the interior of S.

Corollary 1. The interior of convex sets is convex and assuming that int S 6= ∅, then cl int S = cl S. Question 1. For convex S, int cl S = int S? Theorem 1 (Best approximation). Any nonempty closed convex S has a unique shortest vector x¯, i.e. kx¯k ≤ kxk for all x ∈ S. Moreover is characterized by the following property hx¯, x − x¯i ≥ 0. (1.1)

Proof. Existence. Choose any x^ ∈ S, consider S1 = S ∩ kx^kB is nonempty and compact. Thus the continuous k · k has a minimizer on that set, call it x¯. For any x ∈ S, either kxk > kx^k ≥ kx¯k or kxk ≤ kx^k and thus kxk ≥ kx^k. Characterization. We know for x ∈ S and t ∈ [0, 1] such that kx^k2 ≤ kx¯ + t(x − x¯)k2, then expanding 0 ≤ tk(x − x¯)k2 + 2hx¯, (x − x¯)i. By letting t 0 we get the result. Uniqueness. Suppose that there exist another point x0 ∈ S which also satisfies (1.1). Thus → 0 0 0 hx¯, x − x¯i ≥ 0 and hx ,x ¯ − x i ≥ 0. Then if we add the two inequalities we get kx¯ − x0k ≤ 0 sox ¯ = x0.

Theorem 2 (Basic separation). If S is closed and convex and y ∈/ S then there exists a halfspaces H such that S ⊆ H and y ∈/ H. Proof. Without loss of generality take y = 0 and use previous result.

Theorem 3 (Supporting hyperplane (Hann-Banach)). Suppose that S is convex and int S 6= ∅ and let y ∈/ int S, then there exist a halfspace H containing S with y ∈ bd Proof. Without loss of generality assume that 0 ∈ int S. For each r ∈ N+, define the vector Å 1ã z = 1 + y. k r

By the Accessibility Lemma then zk ∈/ cl S, by basic separation there is a separating hyperplane given by ak 6= 0 so that har, zki ≥ har, xi ∀x ∈ S.

Without loss of generality assume that kakk = 1, then by compactness some subsequence converges to a. Thus, ha, yi ≥ ha, xi ∀x ∈ S.

5 Lecture 02: Convex functions February 1 2018

2 Convex functions

There is a very natural way to extend our results, of the previous class, from sets to functions. The kind of trick to do that is to allow your function to have infinite values, namely ± . We will think about functions of the form f : E R := [− , ]. ∞ Definition 2. Let f : E R := [− , ] be a function, then its is defined as → ∞ ∞ f := {(x r) E R | f(x) r} → ∞ ∞epi , ∈ × ≤ Definition 3. We say that a function f is convex if epi f is convex.

Figure 2: Epigraph of the function f = (x2 − 1)2

In the other direction, we can define

Definition 4. Given a set C ∈ E, define the indicator function as

0 if x ∈ C, δ (x) := C otherwise.

Then it is possible to show that C ⊂ E is convex if, and only if, δ is convex. ∞ C

Definition 5. We say that a function f : S R is convex if f + δS is convex. Equivalently, S is convex and

f(λx + (1 − λ)y) λf(x) + (1 − λ)f(y) x y S λ [0 1] ≤ → ∀ , ∈ , ∈ , . (2.1) Definition 6. We say that a function f : S R is strictly convex if inequality (2.1) holds strictly.

Example 4. In particular, norms are strictly convex, k · k , k · k , k · k , ... . → 1 2 Definition 7. We say that x ∈ E is a minimizer for f : E∞ R if

f(x) f(x) x E ≥ → ∀ ∈ . Proposition 3. Minimizers of strictly convex functions are unique.

Proof. Easy to check.

Definition 8. We say that x ∈ E is a local minimizer for f : E R if there exist a neighbor U ⊂ E of x such that

f(x) f(x) x U ≥ ∀→ ∈ .

6 Lecture 02: Convexity and calculus February 1 2018

Proposition 4. For convex f, local minimizers are minimizers.

Proof. First, we need an easy 1-dimensional result for convex functions. g(t) Claim 1. Suppose that g : R+ R convex with g(0) = 0. Then t 7 t is a nondecreasing function. Proof. Exercise. → → Suppose that x is a local minimizer. Now, given any x ∈ E we would like to show that f(x) ≥ f(x). Define g(t) = f(x + t(x − x)) − f(x), it is easy to see that function is convex because f is. Since x is a local minimizer then for small enough t g(t) g(1) 0 ≤ ≤ = f(x) − f(x), t 1 where the second inequality follows by the claim.

2.1 Convexity and calculus Convexity is a one-dimensional idea, so if we understand how it behaves on the reals then we are in good shape for everything else.

Proposition 5. Let h :(a, b) R be a differentiable function, then it is (strictly) convex if, and only if, h0 is (strictly) nondecreasing on (a, b). → Proof. Take two points x, y ∈ (a, b) with x ≤ y.

“= ” Let s = (f(y) − f(x))/(y − x)Using the previous Claim we see that the slope f0(x) ≤ s, the same idea shows that s ≤ f0(y). ⇒ “ =” Without loss of generality (adding a linear function) we can suppose that we have h(x) = h(y). Suppose in search of contraction that there exist t ∈ (x, y) such that h(t) > h(x). By the Mean Value ⇐ Theorem there exist z ∈ (x, y) where f0(z) > 0. By assumption h0 > 0 on [z, y] (since it is nondecreasing) so this implies a contradiction h(y) > h(z) > h(x).

Corollary 2. For twice differentiable functions

1. h is convex on (a, b) if, and only if, h00 ≥ 0 on (a, b).

2. h is strictly convex on (a, b) providing h00 > 0 on (a, b).

Counterexample 1. Notice that h being strictly convex doesn’t imply that h00 > 0 on (a, b). Consider for example h(x) = x4 at zero.

2.1.1 Gradients

Suppose that S ⊆ E is an open set, and let x ∈ S be an arbitrary set, and consider f : S R. We say that f is differentiable then there exists at most one g ∈ E∗ such that → f(x + ty) = f(x) + thg, yi + o(t) ∀y.

If g is unique then, it is the gradient ∇f(x). If ∇f exists for all x and it is continuous in x, we say that f ∈ C1. 1 2 In this case, if the function hy(x) = hy, ∇f(x)i is C for all y, then we say f ∈ C and we write

∇2f(x)[y, y] = hy, ∇h(x)i.

7 Lecture 03: Continuity of convex functions February 6 2018

Theorem 4 (Taylor expansion). For f ∈ C2, then we can write

t2 p(t) := f(x + ty) = f(x) + th∇f(x), yi + ∇f2(x)[y, y] + o(t2), 2 and p ∈ C2 and p00(0) = ∇f2(x)[y, y].

2.1.2 Recognizing smooth convexity

Corollary 3. Let f ∈ C2 be an arbitrary function. Then f is convex ∇f2(x)[y, y] ≥ 0 for all x ∈ S, y ∈ E. Moreover the equivalence holds with strictly convex functions if we have > 0 for all y 6= 0. n ⇐⇒ Example 5 (Log-Barrier). Let S++ be the set of n × n symmetric positive definite matrices. Consider the function n f : S++ R, given by f(X) = − log(det X). Let us check→ that this function is strictly convex. We follow these steps, the details are left as exercises,

n 1. First, we need to show that S++ is open, this is left as an exercise and follows directly from the definition. n 2. Additionally, note that S++ is the intersection of open halfspaces, thus it is convex. 3. Consider the function det(I + tZ) = 1 + ttrZ + o(t), where the equality follows by expanding the determinant and checking the coefficients of the linear term t.

4. Hence we prove that ∇f(X) = −X−1.

5. Next exercise is to prove (I + tZ)−1 = I − tZ + o(t).

2 −1 −1 −1 n −1 > 6. Hence compute ∇ f(X)[Y, Y] = tr(X YX Y) notice that X ∈ S++ then X = A A and therefore −1 −1 > > > > > 2 tr(X YX Y) = tr(A AYA AY) = tr(AYA AYA ) = kAYA kF > 0.

2.2 Continuity of convex functions Theorem 5. Let f : E R be a proper with f(x) < . Then f is continuous Lipschitz around x1 if, and only if, f is bounded above on a neighborhood of x. → ∞ Proof. Let us show both implications.

“= ” Trivial.

“ =” Without loss of generality x is zero and f(0) = 0, and (by rescaling) assume f is bounded above ⇒ 1 1 1 1 by 1 on 2B. So for all x ∈ 2B, I can write 0 = 2 x + 2 (−x), then 0 = f(0) ≤ 2 f(x) + 2 f(−x), so ⇐ f(x) ≥ −f(−x) ≥ −1 on 2B. We’ll prove that the functions is 2-Lipschitz on the unit ball. Let x, y ∈ B, consider the segment [x, y] and extend it to [x, z] where y ∈ [x, z] and ky − zk = 1. Clearly such a point is still in 2B. Note that z = y + (y − z)/kx − yk and 1 ky − xk y = x + z. 1 + ky − xk 1 + ky − xk

Then convexity implies that 1 ky − xk f(y) ≤ f(x) + f(z), 1 + ky − xk 1 + ky − xk

1∃L > 0 such that |f(x) − f(y)| ≤ Lkx − yk for all y, x in a neighborhood of x

8 Lecture 04: Subgradients February 8 2018

thus ky − xk f(y) − f(x) ≤ (f(z) − f(x)) ≤ 2ky − xk 1 + ky − xk where the last inequality follows from the fact that f(z) − f(x) ≤ 2.

This last result implies that we can focus on boundedness of convex functions whenever we want to show that it is locally continuous. Let’s see a counter example for point in the boundary of the domain.

Example 6. Consider the function f : R2 R defined by

2 → x /y if y > 0, f(x, y) = 0 if (x, y) = 0,   + otherwise.

Exercise. Show that this function is convex and its epigraph is closed (i.e. it is lower semi continuous.) 2 ∞ Notice that f(1/r, 1/r ) = 1 for all r ∈ N+ and these points converge to (0, 0), but f(0, 0) = 0 so f is not continuous relative to its domain.

Proposition 6. Let f : E R be a convex function, then

n ! n → f λixi ≤ λif (xi) i=1 i=1 X X

if i λi = 1. Proof.P Induction on the number of points.

n Proposition 7. Define the ∆n = {x ∈ R | x ≥ 0, xi ≤ 1}. Any convex function f : ∆n R is bounded above on ∆ and hence it is continuous on the interior of ∆ . Pn i i → Proof. For any x in the simplex we can write write x = i xie + (1 − xi)0, so f(x) ≤ i xif(e ) + (1 − x )f(0) ≤ max{f(ei), f(0)} and thus the result follows. i P P P CorollaryP 4. If f : C R convex then f has to be continuous on the interior of S.

Proof. Without loss of generality assume that E = Rn. Given any x ∈ int S, put a small enough translation → of the simplex around x. Then f is continuous at x.

2.3 Subgradients Now we introduce a concept to study variations of nonsmooth convex functions.

Definition 9. For a convex function f : E R, define domf = {x ∈ R | f(x) < + }. If f(x) is finite and x minimizes f(·) − hy, ·i then we say that y is a subgradient of f at x and denote it by y ∈ ∂f(x). → ∞ The concept of subgradients is generalization of the idea of gradients.

Proposition 8. If f ∈ C1 at x then ∂f(x) = {∇f(x)}.

Theorem 6 (Existence of subgradients). Let f : E R be a convex function and let x ∈ int domf be an arbitrary point with finite value, i.e. f(x) ∈ R. Then ∂f(x) 6= ∅. →

9 Lecture 05: Fenchel Duality February 13 2018

Proof. Exercise, if this happens then f never takes the value − . Now domf is convex and f : domf R. Hence f is continuous at x. Take a sequence of the form (x, f(x) − 1/n), these points don’t belong to the epigraph, therefore, we conclude that (x, f(x)) is in the bd(epi f)∞. Using the fact that f is locally Lipchitz,→ we can show that the a point (x, f(x) + 1) lies in the interior of the epigraph. Thus the int epi f 6= ∅. Hence, the Hann-Banach Theorem implies that there exists a supporting hyperplane H (given by (h, β) ∈ E × R) to epi f at (x, f(x)). Moreover, this hyperplane is not “vertical” since (x, f(x) + 1) is in the interior of epi f. Then 1 it is easy to see that a multiple of − β h is in ∂f(x).

3 Duality

Definition 10. For any f : E R conjugate f∗ : E R defined by

∗ → f (z) = sup→ {hx, zi − f(x)} . x

Note that f∗ is always convex, since

r ≥ f∗(x) r ≥ hz, xi − f(x)(∀x ∈ E) (3.1)

∗ so the epi f is the intersection of halfspaces⇐⇒ and thus it is convex and closed. Example 7. Consider the following functions

1. f(s) = es on R, then t log t − t if t > 0 ψ(t) := f∗(t) = 0 if t = 0   + otherwise. ∗∗ s Moreover if you repeat the process you get f (s) = f(s) = e . We will see that this is a broader pattern. ∞ xi n 2. f(x) = i e for x ∈ R , then f∗(x) = ψ(x ) P i which is also called the Botzmann-Shannon Entropy. HereX again f∗∗ = f.

Proposition 9. For any f : E R we have that

∗ → f(x) + f (z) ≥ hx, zi. If f(x) is finite then equality holds here exactly when z ∈ ∂f(x).

Definition 11. Let E and Y be Euclidian spaces. For a linear map A : E Y, its adjoint is defined by A∗ : Y E defined by hAx, yi = hx, A∗yi (∀x ∈ E, ∀y→∈ Y). →

In the case in which E = Rn, Y = Rd this is the same as the transpose.

3.1 Fenchel Duality Now we will see a fundamental result that builds up the basis of many algorithms in modern .

10 Lecture 05: Fenchel Duality February 13 2018

Theorem 7 (Fenchel Duality). Let E, Y be Euclidian spaces, let A : E Y be a linear map, let f : E R, g : Y R be arbitrary functions. Define the primal problem → → → p := inf {f(x) + g(Ax)} x∈E

and the dual problem d := sup{−f∗(A∗x) − g∗(−y)}. y∈Y Then (without extra assumptions) p ≥ d.() If furthermore f and g are never − and convex, and

0 ∈ int (domg − Adom∞f) := int {u − Av | u ∈ domg, v ∈ domf}.(constraint qualification) Then we have p = d () and moreover d is attained if it is finite. In this case x and y are optimal if and only if

A∗y ∈ ∂f(x) and − y ∈ ∂g(Ax).(complementary slackness)

Moreover, the constraint qualification holds if either

1. ∃x^ ∈ domf such that Ax^ ∈ int domg.

2. ∃x^ ∈ int domf such that Ax^ ∈ domg and A is surjective.

Proof. Weak duality is an easy consequence of the definition of the conjugate (Exercise!). Define a value function v : Y R by

→ v(z) := inf{f(x) + g(Ax + z) | x ∈ E}, so p = v(0). Assume that z ∈ domv, that means that the infimum is not + . Thus z ∈ domv if, and only if, there exists and x ∈ domf with Ax + z ∈ domg. This is also equivalent to z ∈ domg − Adomf. Thus, the qualification constraint translates to 0 ∈ int domv. Thus p < , if p = − ,∞ we are done by weak duality. Assume p ∈ R is finite. We claim that {(z, r) | z > v(z)} is a convex set. To see this, suppose ri > v(zi) with i = 1, 2; consequently there exists an xi with f(xi) + g(Ax∞i + zi) < r∞i. Take λ ∈ [0, 1], then f(λx1 + (1 − λ)x2) + g(A(λx1 + (1 − λ)x2) + λz1 + (1 − λ)z2) ≤ λf(x1) + (1 − λ)f(x2) + λg(Ax1 + z1 + (1 − λ)g((Ax2 + z2)

< λr1 + (1 − λ)r2.

Then the interior of the epigraph is convex and consequently the epigraph itself is convex. Since 0 ∈ int domv and v(0) is finite, we deduce that v is never minus infinity and furthermore it is locally Lipschitz. Hence v has a subgradient at every point of int domv and in particular at zero. Choose −y ∈ ∂v(0). Equivalently

v(0) + v∗(−y) = h0, −yi,

11 Lecture 06: Conic programs February 15 2018

thus

p = v(0) = − sup {h−y, zi − v(z)} z = inf {hy, zi + v(z)} z = inf hy, zi + inf{f(x) + g(Ax + z)} z x = inf { hy, zi + f(x) + g(Ax + z)} x,z = inf inf {hy, Ax + zi + g(Ax + z) + hy, −Axi + f(x)} x z = −g∗(−y) + inf {h−A∗y, xi + f(x)} x = −g∗(−y) − f∗(A∗y).

By weak duality, this is the optimal p = d and y is dual optimal. Complementary slackness is easy to check using the relationship between the definition of the conjugate and the definition of the subgradients. Also checking that the first condition implies the constraint qualification is easy. To check that the second implies this property we need to use Theorem 8 (Open mapping). Any surjective linear map A : E Y is open, i.e. it maps open sets to open sets.

Proof. It is sufficient to prove that 0 ∈ int AB. We first assume that dim Y = dim E, which means that A is → also one-to-one. In search of contraction assume that zero is not in the interior of AB. Then that there exists c a sequence of points (xi)i ⊆ B such that (Axi)i is contained in the complement of int AB and Axi 0. Which implies that kAxik 0, but kxik doesn’t go to zero. Thus, we get that kAxik/kxik 0. Notice that the sequence (xi/kxik) ⊆ B and thus by compactness there exists a convergent subsequence. After→ ∗ ∗ relabeling we get that xi/kx→ik converges to some x . However, this is a contradiction because →Ax = 0 and hence A is not one-to-one. Now, suppose dim E > dim Y. We can reduce this case to the previous one. Again assume that the statement of the theorem is not true. We know from linear algebra that A((ker A)⊥) = rangeA and A(B ∩ ⊥ ⊥ c (ker A) ) = AB. Then there must exist a sequence in (xi)i ⊂ (B ∩ (ker A) ) such that (Axi)i is contained ⊥ in the complement of int AB and Axi 0. Otherwise, every sequence (yi)i ⊆ ker A such that Ayi 0, eventually gets into int B and the result follows. Then to prove the Theorem we can apply the same argument as before using (xi)i. → →

We didn’t finish this proof.

3.2 Conic programs Let us exemplify the duality theorem using a very broad framework. Consider the primal problem

inf hc, xi x s.t. Ax ∈ b + H x ∈ K. where H and K are convex cones contained in E, which means 0 ∈ K, K + K ⊆ K, and tK ⊆ K for all t ≥ 0. Additionally, A : E Y is a linear map, c ∈ E, and b ∈ Y. Define the dual cone as K+ = {x ∈ E∗ | hx, yi ≥ 0 (∀y ∈ K)}, see figure. On E × Y we can define an inner product that makes→ it Euclidian, namely h(x, y), (u, v)i := hx, ui + hy, vi. Then, one can show that K × H is a cone inside this space and furthermore (K × H)+ = K+ × H+.

12 Lecture 06: Conic programs February 15 2018

Figure 3: The cone K in blue, its dual K+ in green.

Example 8. These are three examples of cones to keep in mind. n 1. Nonnegative octant R+. This cone models . n 2. Positive semidefinite matrices S+. This one models semidefinite programming. n n 3. Second order cone (the ice cream cone) SO+ = {(x, r) ∈ R × R | r ≥ kxk}. This cone models second order programming. A nice fact about these cones is that they are self-dual, i.e. K+ = K.

Now, let us the Fenchel form to model conic programs. Write the primal as infx{f(x) + g(Ax)} with f = hc, ·i + δK, g − δb+H. Then we can write the dual as sup{−f∗(A∗y) − g∗(−y)} y which becomes (after some simple calculations, try this yourself!) sup hb, yi y s.t. A∗ ∈ c − K+ y ∈ H+. So under reasonable conditions, the values of the primal and dual are equal. Let’s make this more concrete, let’s focus on second order programming. Example 9.

inf x1

s.t. x2 − t = 0 2 (x1, x2, t) ∈ SO+. > 2 Then c = (1, 0, 0) , A = (0, −1, −1), b = 0, H = {0}, and K = SO+. The dual becomes sup 0  0  1 2 s.t.  1  y ∈ 0 − SO+. −1 0

13 Lecture 07: Convex calculus February 22 2018

» 2 2 Note that the primal can be written as inf{x1 | x2 ≥ x1 + x2}, thus p = 0. On the other side, by written the constraint of the dual we get

 1  » 2 2 −y ∈ SO+ = y ≥ 1 + y , y ⇒ therefore the dual isn’t feasible. thus d = − .

The moral from this examples is: the “reasonable conditions” might not hold even for simple, non- ∞ pathological problems. This is something that can happen even in linear programming! Remember that there are settings in which both the primal and the dual are infeasible.

Theorem 9. For conic programs, p ≥ d. If H and K are convex cones and either,

1. ∃x^ ∈ K with Ax^ − b ∈ int H.

2. ∃x^ ∈ int K with Ax^ ∈ H and A is surjective.

Then p = d and the dual attained if finite. In this case, x, y are both optimal if, and only if, hx, A∗y − ci = 0 and hAx − b, yi = 0.

Let’s study what is wrong with our example. Consider the perturb problem

inf x1

s.t. x2 − t + z = 0 2 (x1, x2, t) ∈ SO+.

» 2 2 2 2 or equivalently v(z) := inf{x1 | x2 + z ≥ x1 + x2}. Then x2 + z ≥ 0 and 2x2z + z ≥ x1. It easy to see that

− z > 0 v(z) := 0 z = 0   +∞ z < 0  z = 0 v Thus, the optimal value is very sensitive to small perturbations∞ around . Note that epi is convex, but not close.

3.3 Convex calculus For C1 functions f, g we know from the definition of the gradient that ∇(f + g) = ∇f + ∇g. Moreover, the chain rule tell us if we compose with a linear map, we get ∇(f ◦ A)(x) = A∗∇f(Ax). Turns our we can do something similar for convex functions and subgradients.

Theorem 10 (Subdifferential Calculus). For f : E R, g : Y R, and A : E Y a linear map. Then

∗ ∂(f + g ◦ A)(x)→⊇ ∂f(x) +→A ∂g(Ax). → Furthermore, if f, g are convex, never − , and 0 ∈ int (domg − Adomf) then we have equality.

Proof. The first statement follows directly from the definition. For the converse, suppose we’ve got a ∞ subgradient w ∈ ∂(f + g ◦ A)(x), so x minimizes f − hw, ·i + g ◦ A. The result follows after we apply Fenchel duality.

14 Lecture 07: Lagrangian duality February 22 2018

Recall v(z) = inf{f(x) + g(Ax + z)}, p = v(0) and d = sup{−f∗(−A∗y) − g∗(y)}. Using the same trick that we did in the proof of the duality theorem, we get

v∗(y) = sup {h−y, zi − v(z)} z = g∗(−y)f∗(A∗y).

∗ ∗∗ So d = supy{−v (y)} = v (0), thus fundamentally if we want to understand duality, we need to understand the relation between v and v∗∗.

3.4 Lagrangian duality So our goal now is to understand the relation between f and f∗∗? Before we explore that question, let us present a different framework for duality, namely Lagrangian duality. Let U ⊆ E be a convex set and multiple convex functions f, gi : U R with i ∈ {1, ... , k}. With this is hand we define the primal problem as → p = inf f(x)

s.t. gi(x) ≤ 0 ∀i x ∈ U.

Let us write a variational version of the primal

p = inf sup {f(x) + λ>g(x)}. x∈U λ≥0 λ∈Rn Lagrangian

The fundamental idea about weak duality is flipping| the sup{z and} the inf, which gives us

p ≥ sup inf {f(x) + λ>g(x)} . λ≥0 x∈U λ∈Rn Φ(λ)

The essence of strong duality is understanding when| we{z have equality.} The dual problem is given by

d = sup Φ(λ). λ≥0

Now, to study this we are going to study perturbations on the right-hand-side of our primal. Thus we define a value function v : Rk R given by

v(z) = inf {f(x) | g(x) ≤ z} → x∈U so p = v(0). What happens when we take the conjugate of this function?

v∗(y) = sup{y>z − v(z)} z = sup {y>z − f(x) | g(x) ≤ z} z,x∈U = sup{y>z − f(x) | g(x) + s = z} z x∈U s≥0 = sup{y>g(x) + y>s − f(x)} x∈U s≥0

15 Lecture 07: Lagrangian duality February 22 2018

note that this is equal to + unless y ≤ 0 and in this case

∗ > > ∞ v (y) = sup{y g(x) + y s − f(x)} x∈U s≥0 = sup{y>g(x) − f(x)} x∈U = −Φ(−y).

So we have deduced that −Φ(−y) if y ≤ 0, v∗(y) = + otherwise.

∗ ∗∗ Therefore, d = supy{−v (y)} = v (0). Hence, again the fundamental question is when∞ do we have v(0) = v∗∗(0)? In order to do that we need to study what the biconjugate actually is. If we want to understand the conjugate we need to focus on affine functions α(·) that lower bound our function (called affine minorants). The conjugate is actually given by the supremum of such minorants.

Figure 4: Various minorants of the function f = (x2 − 1)3

Proposition 10. Given any function f : E R and x ∈ E, then

∗∗ f = sup→{α(x) | α ≤ f is an affine function}. Proof. The right-hand-side is equal to

sup{hy, xi − β | hy, xi − β ≤ f(x)(∀x ∈ E)} = sup{hy, xi − β | hy, xi − f(x) ≤ β (∀x ∈ E)} = sup{hy, xi − β | f∗(y) ≤ β (∀x ∈ E)} = sup{hy, xi − f∗(y)} = f∗∗(x).

Example 10. Consider the function 0 if x < 0, f(x) = 1 if x = 0,   + otherwise. ∗∗ Notice that it’s epigraph is closed, while the epigraph of its double conjugate f is closed, see figure. ∞

16 Lecture 08: Lagrangian duality February 27 2018

(a) Epigraph of the function f (b) Epigraph of f∗∗

This example motivates the following proposition. Proposition 11. For any f : E R, the conjugate f∗ is convex and closed (lower semicontinuous). Proof. Easy to check from the definition (Intersection of closed halfspaces). → Theorem 11. f = f∗∗ if, and only if, f is closed, convex, and either never − or always − . Proof. Let us prove both implications, ∞ ∞ “= ” The previous proposition shows two of the conditions. The other condition follows by noting that if f is ever − then its conjugate is always + . ⇒ “ =” Without loss of generality we can assume that f is never − . It is enough to prove at 0, so we want to ∞ ∞ prove ⇐ f(0) = sup{α(0) | α ≤ f∞affine}. First, let us suppose that 0 ∈ domf. Choose any r ∈ R with r < f(0). Notice that there exists ε > 0 so that f > r on a ball of εB. This is because the point (0, r) is not in the epigraph, then there exists an open box around this point that doesn’t intersect the epigraph.

So r ≤ inf{f + δB}, since int εB intersects domf we can apply Fenchel duality and get r ≤ max{−f∗(y) − δ∗ (−y)}. y εB Hence there exists a y so that r ≤ −f∗(y) − εkyk. So that says that f∗(y) + kyk ≤ −r, which translates to hx, yi + r + kyk ≤ f(x) ∀x.

affine α(x) Note that α(0) ≥ r, hence f∗∗(0) = |sup{α(0{z) | α ≤ f}affine} ≥ r, thus f(0) = f∗∗. If 0 ∈/ domf, by our previous argument there exists an affine function α ≤ f (by taking any point in the closure of the domain). The separating hyperplane theorem implies that there exists a z and β so that hz, xi < β < 0 ∀x ∈ domf.

Now, it is easy to see that for any k ≥ 0, α(x) + k(hz, xi − β) ≤ f(x) ∀x.

affine ∗∗ Note that such an affine function| is α(0) −{zkβ at zero} and it goes to as k . So f = + as we wanted. ∞ → ∞ ∞ 17 Lecture 08: Lagrangian duality February 27 2018

Let us continues exploring when f = f∗∗, but now from a local perspective. In particular, let us present a new sufficient conditions that might be useful.

Theorem 12. Given x ∈ E, then f(x) = f∗∗(x) if either

1. f is closed, convex, proper (never − ).

2. f(x) finite, ∂f(x) 6= . ∞ Proof. 1. The first statement follows by the previous theorem. ∞ 2. Recall that f∗∗(x) = sup{α(x) | α affine minorant} subgradients give affine α with α(x) = f(x).

Now let us use this to study strong duality of Lagrangigan duality. Recall that

p = inf sup {f(x) + λ>g(x)} x∈U λ≥0 λ∈Rn

and the dual is defined as d = sup inf {f(x) + λ>g(x)} λ≥0 x∈U λ∈Rn

Theorem 13. Suppose U convex, f, gi convex on U. Suppose there exists a Slater point bx ∈ U, that is gi(bx) < 0 for all i. Then, p = d; moreover, if d is finite then there exits a dual optimal λ. On the other hand, if U is compact and the functions f, gi are lower semicontinuous, then p = d and there exists a primal optimal x.

Proof. Define the value function v(z) = inf {f(x) | g(x) ≤ z} , x∈U then p = v(0). Furthermore as we saw earlier d = v∗∗(0). Notice that one can show that v is convex, we leave it as an exercise. Now, we are in good shape to apply the previous problem, because Slater implies that 0 ∈ int domv. So either p = − and we are done by weak duality or p is finite and there exists a subgradient ∂v(0) 6= ∅ and so v(0) = v∗∗(0). The second set of assumptions∞ imply that v is lower semicontinous and therefore v(0) = v∗∗(0).

By knowing the the vector of Lagrangian multipliers λ we can transform our constrained problem to an unconstrained one.

Proposition 12. Assume we have that p = d and that λ is an optimal dual solution. Then x ∈ U is primal optimal if, and only if,

1. It is feasible (gi(x) ≤ 0),

2. λi = 0 or gi(x) = 0 for all i (complementary slackness),

3. x minimizes the Lagrangian L(·, λ) over U.

Remark 1. If any of the functions f, gi is strictly convex, then L is strictly convex and thus it has a unique minimizer. Thus we only need to solve arg minx L(x, λ).

18 Lecture 09: Augmented Lagrangian Method March 1 2018

Proof. Left as exercise.

Example 11. Consider the problem

inf ex2 » 2 2 s.t. x1 + x2 − x1 ≤ 0.

After some algebra is easy to get p = 1 and d = 0. What is wrong with this function? Notice that the function g1 is only negative at (0, 0) and this makes everything fall apart.

4 Algorithms

4.1 Augmented Lagrangian Method Our goal now is to develop an algorithm for this framework. Before we go into the actual development, we motivate/seed the foundations of the algorithm. Notice

inf {f(x) | g(x) ≤ 0} = inf {f(x) | g(x) + z = 0} x x z≥0 1 = inf f(x) + kg(x) + zk2 | g(x) + z = 0 x 2 z≥0 1 = inf sup f(x) + kg(x) + zk2 + λ> (g(x) + z) x 2 z≥0 λ 1 ≥ sup inf f(x) + kg(x) + zk2 + λ> (g(x) + z) x 2 λ z≥0 Let us state a lemma to helps us understand what is this.

Lemma 2. Assume that g, λ are real numbers. Then

1 2 > 1 1 2 inf (g + z) + λ (g(x) + z) = (λ + g)+ − λ z∈R+ 2 2 2

and it is attained by z = (λ + g)+

Proof. Left as exercise.

Then the right-hand-side of our inequalities is equal to

1 2 1 2 sup inf f(x) + k(λ + g(x))+k − kλk . x 2 2 λ Augmented Lagrangian L(x,λ) | {z } where (·)+ is applied component-wise.

2 2 Lemma 3. If h : E R convex then so is h+ and ∂(h+)(x) = 2h+(x)∂h(x). Proof. Also left as an exercise. → Then, using subdifferential calculus we get the following

Lemma 4. Suppose that the functions f, gi are all convex and λ ≥ 0, z = (λ + g(x))+. Then, x minimizes L(·, λ) if, and only if, x minimizes L(·, z).

19 Lecture 09: Proximal Point method and fixed points March 1 2018

Proof. Another exercise.

We are now in a position to present the Augmented Lagrangian Algorithm. Algorithm 0: Augmented Lagrangian Algorithm m Data: Initial points x0 ∈ E, λ ∈ R+ , convex functions f, gi repeat Choose xk by minimizing L(·, λk);

Set λk+1 = (λk + g(xk))+; > until (g(xk)+, λk+1g(xk) are small enough); So what have we achieved with this algorithm? we translated our original constrained problem into a bunch of unconstrained ones that we can solve using any well-established convex solver.

4.2 Proximal Point method and fixed points Our objective now is to analyze the Augmented Lagrangian Method. Recall that the dual function was defined as Φ(λ) = inf {f(x) + λ>g(x)} x∈U however this function is concave. So since we are thinking about convex solvers, it is useful to redefine it as

inf {f(x) + λ>g(x)} if λ ≥ 0, Φ(λ) = x∈U + otherwise.

Which leads to this easy extension of the previous∞ Lemma

Lemma 5. Suppose that the functions f, gi are all convex and λ ≥ 0, z = (λ + g(x))+. Then, x minimizes L(·, λ) if, and only if, x minimizes L(·, λ). Furthermore, in this case λ − z ∈ ∂Φ(z).

So for the dual problem we are minimizing the convex function Φ via

λk − λk+1 ∈ ∂Φ(λk+1) 1 or equivalently λ minimizes Φ + k · −λ k2 (to see this just take the subdifferential of this function and k+1 2 k evaluate at zero). In order to analyze the Augmented Lagrangian algorithm we will need to analyze a related algorithm. Minimizing a convex function with the extra quadratic is the basic idea of a well-known algorithm called the Proximal point method. Algorithm 1: Proximal linear method

Data: Initial points x0 ∈ E, convex closed functions f repeat 1 Choose x by minimizing f + k · −x k2; k+1 2 k until Some stopping criteria is satisfied;

Note that xk+1 minimizes the function if any of the two equivalently conditions hold

−1 xk − xk+1 ∈ ∂f(xk+1) or xk+1 = (I + ∂f) (xk).

But why does xk+1 exist? and furthermore why are we assuming that it is unique? Essentially, the quadratic term bends the function , which guarantees existence and uniqueness of the next iterate. The following theorem formalizes this idea.

20 Lecture 10: Proximal Point method and fixed points March 6 2018

Theorem 14. Let f : E R be a closed convex proper function and let z ∈ E be an arbitrary point. Then, we have that 1 1 1 → inf {f(x) + kx − zk2 + inf {f∗(y) + ky − zk2} = kzk2, x∈E 2 y∈E 2 2 furthermore the optimal solutions of these problems are attained by unique points x?, y?. Additionally, they can be characterized as the solutions of the following feasibility problem

z = x + y y ∈ ∂f(x).

Remark 2. This theorem is a generalization of Pythagorean Theorem. Notice that if f = δL (where L ⊆ E is a ∗ subspace) and we get that f = δL⊥ , which gives us

? ? x = PLz and y = PL⊥ z.

Proof. Note that this proof is symmetric for x? and y?, since f = f∗∗. The first step is to apply Fenchel duality 1 1 to inf {f(x) + ky − zk2}, we can do this since the domain of ky − zk2 is E and the domain of f is nonempty. x 2 2 Then, we know that there is no , but why can we ensure that the optimums are attained? We need to ensure that p is finite. Since domf 6= ∅, we have that p < + . Now we need to show that p > − . In order to do this we are going to prove that there is an affine minorant and the fact that the quadratic makes the function grow faster∞ than linearly. ∞ Claim 2. For all convex f : E R, we have that it has a subgradient if, and only if, f is proper. Assuming that this is true (we will prove this after we finish the proof of the theorem), we get that then → f(x) ≥ f(x) + hg, x − xi for all x. Consequently 1 1 f(x) + kx − zk2 ≥ f(x) + hg, x − xi + kx − zk2 2 2 has a minimizer 1 | {z } then f(x) + kx − zk2 is bounded below and p > − . 2

Proof of the claim. We need the following proposition∞ Proposition 13. Suppose that 0 ∈ U ⊆ E. Then if span U = E, then int U 6= ∅. 1 Proof. Choose a basis x , x , ... , x , we claim that n x ∈ int U. To see this we can use an 1 2 n n + 1 i=1 i isomorphism to assume that xi = ei. P Without loss of generality assume that 0 ∈ domf. Let Y = span domf and consider the convex function

f : Y R. Then, due to the previous proposition we know that domf has nonempty interior. Let Y Y y ∈ int domf Y, choose z ∈ ∂f Y(y). Then it is easy to see that z ∈ ∂f(y). →

Now that we know that the Proximal Point Method is well-defined, we want to prove that it actually converges to a minimizer. 1 Proposition 14. x minimizes f if, and only if, x is a fixed point of the map T(x) := arg min αf + k · −xk2 for any 2 α > 0. (This is the map that we use to update the iterates in the Proximal Point Method).

Proof. Exercise.

21 Lecture 10: Proximal Point method and fixed points March 6 2018

So, we have transform the problem of finding a minimizer to the problem of finding a fixed point. Let us use subdifferential calculus (again) to characterize the minimizers at each iteration, so

−1 0 ∈ ∂αf(xk+1) + (xk+1 − xk) xk+1 ∈ (I + α∂f) (xk).

We need to introduce some relevant notation. ⇐⇒ E Definition 12. A set-valued mapping, T : E ⇒ E is a mapping from E to 2 (subsets of E). Define its inverse 1 T : E ⇒ E by x ∈ T −1(y) y ∈ T(x).

Remark 3. Keep in mind that a set-valued mapping could potentially send a point to ∅. ⇐⇒ To get some intuition about this, let us consider an example.

Example 12. 1. If f = δS for some closed convex set S ⊂ E. Then the resolvent is exactly the projection onto the −1 set S, namely (I + α∂f) = PS.

Definition 13. Let F : E ⇒ E be a set-valued mapping. We say that the is nonexpansive if for all y ∈ F(x), y0 ∈ F(x0) we have that ky − y0k ≤ kx − x0k, we say that it is a contraction if this inequality is strict.

Remark 4. 1. Notice that in the example above, the map is actually a nonexpansive (Proved in the first homework).

2. Recall that if the resolvent (I + α∂f)−1 is a contraction, then the Banach Contraction Mapping Theorem guarantees that the method converges since iterating contractions converges to unique fixed point. Furthermore this theorem gives us a linear rate of convergence depending on the Lipschitz constant.

3. Nonetheless, we don’t always have contractions. For instance consider the previous example with S = E.

Iterating nonexpansive mappings may not converge to a fixed point.

2 2 ◦ Example 13. Consider the following mapping T : R R given by T(x) = Rπ/2x, where Rπ/2 is the 90 clockwise rotation matrix. Then T k(1, 1) ∈ {(1, 1), (1, −1), (−1, 1), (−1, −, 1)}. → Even so, we can overcome this issue by averaging the operator.

Theorem 15. If F : E ⇒ E is averaged meaning that F = (1 − θ)I + θG where G is nonexpansive, then the iteration

xk+1 = F(xk) = (1 − θ)x + θG(xk)

must converge to a fixed point of F (or equivalently G) assuming one exists. This is known as the Krasnoselskii-Mann iteration (1953).

But why do we care about this result? Well, resolvents (I + α∂f)−1 are always averaged. For example, −1 when (I + α∂f) = PS we have that C = 2PS − I is nonexpansive and therefore PS = (C + I)/2, see Figure6.

22 Lecture 10: Proximal Point method and fixed points March 6 2018

Figure 6: Reflexions are nonexpansive and projections are averaged.

The key idea of the proof of this theorem is simple, yet it is fundamental to our toolkit. We will use it extensively, that is why it is useful to move to a more setting.

Definition 14. For T : E E is monotone if for all y ∈ T(x), y0 ∈ T(x0) we have that

0 0 → hy − y , x − x i ≥ 0. Example 14. The subdifferential ∂f is monotone. Let y ∈ ∂f(x), y0 ∈ ∂f(x0), then we get

hy, x0 − xi ≤ f(x0) − f(x) and hy0, x − x0i ≤ f(x) − f(x0),

adding this two together gives monotonicity.

Definition 15. For a monotone operator T : E ⇒ E and α > 0, we define 1. The resolvent as R := (I + αT)−1,

2. The Caley operator C = 2R − I.

Lemma 6. R and C are nonexpansive and hence R is averaged.

Proof. Exercise.

Lemma 7. k(1 − θ)a + θbk2 = (1 − θ)kak2 + θkbk2 − θ(1 − θ)ka − bk2.

Proof. Both terms are quadratic in θ, so it is enough to check three points. For example θ = 0, 1, 1/2.

Now let us present the more general version of Theorem 15.

Theorem 16 (Krasnoselskii-Mann iteration). Suppose that F : E ⇒ E is averaged and that X = {x | x = F(x)} 6= ∅ and xk+1 = F(xk) for k = 0, 1, ... . Then,

∗ ∗ 1. xk x for some x ∈ X

2. kx − xk for all x ∈ X (Fej´ermonotonicity). k→   3. min = O √1 j≤k ↓ k

23 Lecture 11: Proximal Point method and fixed points March 8 2018

(b) Solvability, the read line has to intersect (a) Graph. the graph ∀y.

Figure 7: Monotone mapping

Proof. Choose anyx ¯ ∈ X, then

2 2 kxk+1 − x¯k = kF(xk) − x¯k 2 = k(1 − θ)xk + θG(xk) − ((1 − θ)x¯ + θG(x¯))k 2 = k(1 − θ)(xk − x¯) + θ(G(xk) − G(x¯))k 2 2 2 = (1 − θ)k(xk − x¯)k + θkG(xk) − G(x¯)k − θ(1 − θ)k(xk − x¯) − (G(xk) − G(x¯))k 2 2 2 ≤ (1 − θ)k(xk − x¯)k + θkxk − x¯k − θ(1 − θ)kxk − G(xk)k 2 2 ≤ kxk − x¯k − θ(1 − θ)kxk − G(xk)k .

Then reordering gives us the second item and using a telescoping sum, we get

k 2 2 θ(1 − θ) kxk − G(xk)k ≤ kx0 − x¯k ∀k. i=0 X

Hence kxk − G(xk)k 0, hence we have the third item. Now, let us prove the first statement. By 2. the sequence (xk)k is bounded so there is a convergent ∗ subsequence with limit→x . Note that along the subsequence kG(xk) − xkk 0 and G is continuous (since it ∗ ∗ ∗ is nonexpansive) so x ∈ X. But kxk − x k , hence xk x . → Monotonicity is a constraint on the graph of the operator grph(T) = {(x, y) | y ∈ T(x)}. If we take, E = R ↓ → it guarantees that the graph looks like the graph of a monotonously nondecreasing function with some jumps, see Figure9. Note that we are not assuming that the resolvent is well-defined. We showed that this is fine when we are considering the subdifferential as the operator F. But for a general operator, when does this happen? This is a question of solvability of a system of the form

y − x ∈ αT(x)(∀y) (4.1)

see Figure9. Thus we want the dotted line to intersect the graph, so to ensure solvability we want the graph of T to be as large as possible.

Proposition 15. Let T : E ⇒ E be a monotone operator and let α > 0 be an arbitrary constant. Assume that T 0 0 0 satisfies (4.1). Then, T is maximal, i.e. if T : E ⇒ E monotone such that for all x T(x) ⊆ T (x), then T = T .

24 Lecture 11: Smooth Minimization March 8 2018

Proof. Suppose that y ∈ T 0(x), so we want to show that y ∈ T(x). First notice that αy ∈ αT 0(x) and by assumption, there exists a z so that (x + αy) − z ∈ αT(z) ⊆ αT 0(z). Then by monotonicity, h(x + αy − z) − αy, z − xi ≥ 0 = x = z. Hence αy = (x + αy) − z ∈ αT(x). ⇒ Next we are going to prove that maximality is exactly what we need to ensure that the resolvent is nonempty and a function. Before we get to that, let’s state am immediate corollary of this proposition. Corollary 5. For a closed convex proper f : E R, then ∂f is maximal monotone. Theorem 17 (Minty, 1962). Let T : E ⇒ E be a monotone operator and α > 0 be a constant. Then, T is maximal if, and only if, (4.1) holds true. → We are not going to show this theorem, because it is not straightforward nor illuminating. Nonetheless, rather rarely in practice we use this theorem. In general, it is simpler to show that the resolvent is well- defined. Remark 5. If we assume the Axiom of Choice then any monotone operator is contained in a maximal monotone operator. This framework will allows to prove convergence ratios for a bunch of different algorithms.

4.3 Smooth Minimization Theorem 18. Suppose f : E R has an L-Lipschitz gradient, ∇f. Then, L |f(y) − f(x) − h∇f(x), y − xi| ≤ ky − xk2. → 2 Proof. We can rewrite the left-hand-side of the inequality as

1 d 1 f(x + t(y − x)))dt − h∇f(x), y − xi = h∇f(x + t(y − x)) − ∇f(x), y − xidt dt Z0 Z0 1 ≤ |h∇f(x + t(y − x)) − ∇f(x), y − xi| dt 0 Z 1 ≤ k∇f(x + t(y − x)) − ∇f(x)kky − xkdt Z0 1 L ≤ Ltky − xk2dt = kx − yk2. 2 Z0

In particular, what this theorem tell us is that we can build a quadratic upper bound on f at x, that is L f(y) ≤ f(x) + h∇f(x), y − xi + kx − yk2 =: f˜(y). 2 This gives us the idea of minimizing the quadratic upper-bound f˜ at x (which is a simpler function than f) and still get a guarantee on the improvement of the next iterate, see Figure8. ˜ 1 Interestingly, f at x is minimized at x − L ∇f(x), that suggest the gradient method

xk+1 xk − α∇f(xk) for some α > 0 is related to 1/L. In particular, if we take α = 1/L what we see is that ← 1 f(x ) − f(x ) ≥ k∇f(x )k2. (4.2) k k+1 2L k What happens if the function is convex?

25 Lecture 12: Splitting Problems and Alternating projections March 13 2018

Figure 8: Majorizing convex function f˜.

Theorem 19. Let f : E R be a convex function with L-Lipschitz gradient. Then, we have 1 → hx − y, ∇f(x) − ∇f(y)i ≥ k∇f(x) − ∇f(y)k2 (co-coercivity) L and hence the map x 7 x − α∇f(x) 2 2 is nonexpansive for 0 ≤ α ≤ L and an immediate consequence it is averaged for 0 ≤ α < L . Thus, the gradient → 2 method converges to a minimizer (by Krasnoselskii-Mann) if one exists for 0 < α < L . Proof. All we have to do is check the co-coercivity condition, once we do that, the rest is immediate. Let’s fix y ∈ E and define a convex function g = f − h∇f(y), ·i. In particular, y minimizes the function g and g also has L-Lipschitz gradients. Then for any x, 1 1 g(x) − k∇g(x)k2 ≥ g(x − ∇g(x)) ≥ g(y) 2L L Rewritting all the terms, we get 1 f(x) − h∇f(y), xi − k∇f(x) − ∇f(y)k2 ≥ f(y) − h∇f(y), yi. 2L This argument is symmetric and so we can swap x and y. To get the result add the two inequalities.

Remark 6. When ever you see “co-” in a name it usually refers to properties of the conjugate of f. Thus, this property is equivalent to having f∗ coercive (we are not going to prove this).

4.4 Splitting Problems and Alternating projections We now turn our interest into splitting problems, in which we have two functions involved and we know how to minimize each one of them and we want to leverage this to minimize some combination of them.

Example 15. Given two closed convex set X, Y ⊆ E find a point in their intersection X ∩ Y. For which the projections PX and PY are easy to compute.

26 Lecture 12: Splitting Problems and Alternating projections March 13 2018

(a) Does converge. (b) doesn’t converge.

Figure 9: Alternating projections behavior

More generally we can describe this example as

inf{kx − yk | x ∈ X, y ∈ Y} (4.3)

Then, the question becomes how can we combine them in a clever way to get PX∩Y?

Example 16. We can model optimization problems using this framework, for example for linear programming we can n n define X = {x ∈ R | Ax = b} and Y = R+.

One of the simplest ways to do this is using alternating projections, an idea that dates back to von Neumann (1933). Algorithm 2: Alternating projections

Data: Initial points x0 ∈ E, and maps PX, PY repeat Set yk+1 PYxk;

Set xk+1 PXyk+1; ← until (xk = PXPYxk); ← Now, when does this algorithm converges? and when it does how rapidly does it converge?

Proposition 16. The following two are equivalent

1. The point x be a fixed point of PX ◦ PY,

2. The pair (x, PYx) solves (4.3) if, and only if,

1 2 Proof. Note that x¯, y¯ is a solution of (4.3) if, and only if, (0, 0) ∈ ∂f(x¯, y¯ ) where f(x, y) = 2 kx − yk + δX(x) + δY(y). It is easy to check that

∂f(x, y) = (x − y, y − x) + (∂δX(x) × ∂δY(y)) .

Thus, (x¯, y¯ ) solves (4.3) if, and only if, 0 ∈ x¯ − y¯ + ∂δX(x¯) and 0 ∈ y¯ − x¯ + ∂δY(y¯ ). Which is equivalent to having that 1 1 x ∈ arg min k · −yk2 + δ and y ∈ arg min k · −xk2 + δ 2 X 2 Y which is equivalent to the second condition.

27 Lecture 12: Proximal Gradient March 13 2018

Exercise 1. Let F is L-Lipschitz and F0 is L0-Lipschitz. Prove that F + F0 is (L + L0)-Lipschitz and F ◦ F0 is LL0- Lipschitz. Hence show that compositions of averaged operators are averaged.

This almost gives us convergence, but not quite, we need to ensure that there is a fixed point.

Theorem 20 (Cheney-Goldstein 1959). Let X and Y are closed convex sets have a closest pair (x, y) (as holds if X, Y are nonempty and one is compact). Then alternating projections converges to a closest pair.

Proof. It follows immediately from the previous exercise and Theorem 16.

4.5 Proximal Gradient Consider a closed convex proper function f : E R, we define the proximal mapping as 1 prox (x) =→arg min{f + k · −xk2. f 2 −1 Note that proxαf(x) = (I + α∂f) is the resolvent. Recall that the proximal point method is defined by the recursion −1 xk+1 proxαf(xk) = (I + α∂f) xk. This is what is known in the literature as a backward method (because of the inverse). Such terminology comes from differential equations, where← these ideas where originally developed. Additionally, for smooth functions with ∇f L-Lipschitz we defined the gradient method by the recursion

xk+1 x − α∇f(x) = (I + α∇f)xk.

This is what is known as a forward method← (no inverse). Both these methods are majorization-minimization methods, where the key idea is to construct a majorizing convex function f˜ ≥ f at x (i.e. f˜(x) = f(x)) and then select the next point in arg min f˜(x), see Figure ?? We are now going to study a method that generalizes this two methods. C We are interested in solving a problem of the form inf f(x) + g(x) (4.4) x∈E where f, g : E R are closed convex proper functions and f is smooth with ∇f L-Lipschitz. Given x, define the majorizing convex function h by → 1 h(z) = f(x) + h∇f(x), z − xi + kz − xk2 + g(z) 2α for some α > 0. Notice that ∂h(x) = ∇f(x) + ∂g(x). Notice that 1 1 α h(z) = kz − (x − ∇f(x))k2 − k∇f(x)k2 + f(x) + f(z). 2α α 2 1 The minimizer of this function is given by proxαg(x − α ∇f(x)). Note that this captures the spirit of a splitting method/problem, we are giving a forward gradient step for f, followed by a backward step for g. Algorithm 3: Proximal gradient method

Data: Initial points x ∈ E, and maps PX, PY repeat −1 −1 Set xk+1 proxα(xk − α∇f(xk)) = (I + α∂g) ) ◦ (I − α∇f) )xk; until (xk+1 is closed enough to xk); ← This is also known as the Forward-backward Method. It is easy to see (using the same arguments as before) that the following proposition holds

28 Lecture 13: Douglas-Rachford March 15 2018

Proposition 17. The following holds true, x¯ is a fixed point of the forward-backward iteration if, and only if, it minimizes f + g.

Additionally we have the following convergence result.

2 Theorem 21. The proximal gradient method converges to a minimizer if one exists as long as 0 < α < L . Proof. The proof is analogous to the one we gave for alternating projections.

Example 17 (LASSO.). A very well known (and useful) problem in statistics is to solve a regression while ensuring that the solution is sparse. Tibshirani (1996) introduced the following nonsmooth convex formulation 1 inf kAx − bk2 + λkxk . 2 1 where λ > 0 is a constant. The nice thing about this problem is that we can analytically compute the forward and backward steps. Let’s see how, we can define 1 1 f(x) = kAx − bk2, ∇f(x) = A>(Ax − b) 2λ λ and notice that s − α if s ≥ α prox (s) = S (s) := 0 if |s| ≤ α α|·| α   s + α if s ≤ α

Then since we can decoupled the proximal operator of αk · k1, we get

(proxαk · k1(x))i = Sα(xi).

Thus, the proximal gradient algorithm is easy to implement, for this example it is known as Iterative soft thresholding (ISTA).

Why is ISTA good? and why would we pick ISTA over other algorithms?

1. ISTA just needs matrix-vector multiplications and soft thresholding, and so it is extremely cheap!

2. Gradient method doesn’t (typically) converge to a minimizer even when the iterates have well-defined gradients.

3. Subgradient√ method, this works and we have a guarantee for it but the objective function decreases to slow O(1/ k) (both in theory and in practice). On the other hand, the Proximal gradient goes that a rate of O(1/k) (or linearly in practice).

4. Proximal point method is hard because we need to compute the proximal mapping of the entire objective function. Thus, each iteration is no easier than original problem.

4.6 Douglas-Rachford This method dates back to the 1950’s and came from ideas in differential equations and infinite dimensional problems. −1 Recall that for a monotone operator A : E ⇒ E we defined the resolvent as RA = (I + αA) and the Caley operator CA = 2RA − I and we know that both of them are nonexpansive. Moreover, note that fixed points of RA and CA are exactly zeros of the original operator A, i.e. {0 | 0 ∈ Ax}.

29 Lecture 13: Douglas-Rachford March 15 2018

Douglas and Rachford asked themselves the marvelous question of what are fixed points of the composi- tions of these operators. Consider another monotone operator B. Consider the fixed points of CA ◦ CB

y = CACBy y = 2RA(2RBy − y) − (2RBy − y)

RBy = RA(2RBy − y)(Suppose x = RBy) ⇐⇒ x = RA(2x − y) ⇐⇒ (2x − y) ∈ x + Ax ⇐⇒ (x − y) ∈ Ax. ⇐⇒ Notice that x = RBy is equivalent to⇐⇒y − x ∈ Bx. Thus we can characterize the fixed points of the composition. Proposition 18. Let A, B : E ⇒ E be monotone operators. Then, zeros of A + B are exactly the same as images under RB of fixed points of fixed points of CA ◦ CB.

Proposition 19.

0 ∈ (A + B)x ∃z ∈ Ax with − z ∈ Bx ∃y so that x − y ∈ Ax and y − x ∈ Bx ⇐⇒ and the result now follows by our previous argument.⇐⇒ So, basically we’ve transformed our original problem into a fixed point problem. Which allows us to use Krasnoselskii-Mann for iteration 1 y (I + C C ) y. 2 A B averaged ← Formally, this reads as | {z } Algorithm 4: Douglas-Rachford

Data: Initial points y0, x0 ∈ E, and maps RA, RB repeat Set xk+1 RByk; 0 Set y 2xk+1 − yk; 0 ← 0 Set x RAy ; 00← 0 0 0 (Set y 2x − y = 2x − 2xk+1 + yk); ← 1 00 0 Set yk+1 ( 2 (yk + y ) =)yk + x − xk+1; ← until Some stopping criteria is met; ← This algorithm splits use of RA and RB, no need to know anything about A + B.

0 Theorem 22. Assume that A, B are monotone A + B has a zero. Then, any such a sequence of yks converges to some y¯ and if x¯ = RBy¯ then 0 ∈ (A + B)x¯.

Proof. Follows immediately from KM Theorem.

4.6.1 Consensus optimization

We are interested in minimizing k inf fi(x) i=1 X

30 Lecture 14: ADMM: Alternating Directions Method of Multipliers March 20 2018

where fi : E R are closed convex proper functions. Think of each one of these relatively simple or prox-friendly, meaning that we know how to compute their prox mapping easily. Equivalently, → inf {f(x1, ... , xk) + δL(x1, ... , xk)}

k where f = i=0 fi(xk) and L = {(x1, ... , xk) | x1 = x2 = ··· = xk}. Thus, we’d like to solve P 0 ∈ ∂(f + δL)(x1, ... , xk).

Assume that ∩i int domfi 6= ∅, then ! ∂ fi (x) = ∂fi(x). i i X X Hence ! x solves the original problem 0 ∈ ∂ fi (x) i X ⇐⇒ There exist zi ∈ ∂fi(x) with zi = 0 i X ⇐⇒ (z1, ... , zk) ∈ ∂f (x, ... , x) . ∈L ⇐⇒ | {z } ⊥ Furthermore, it is easy to check that for such zi’s we have that −(z1, ... , zk) ∈ ∂δL(x, . . . x) ∈ L . So we will apply Douglas-Rachford with A = ∂δL and B = ∂f. By the previous convergence theorem, if ? the original problem has a minimizer Douglas-Rachford generates a convergent sequence of yk y and ? RBy is a minimizer. In particular we are going to apply → R (y ... y ) = ( (y ))k R (y ... y ) = P (y ... y ) = (y ... y) B 1, , k proxαfi i i=1 and A 1, , k L 1, , k ¯ , , ¯

1 wherey ¯ = k i yi (easy to check). Therefore, the algorithm looks like Algorithm 5:PConsensus optimization via Douglas-Rachford y x ∈ E i Data: Initial points 0, 0 , and maps proxαfi for all repeat i x(i) y(i) For all set k+1 proxαfi k ; (i) (i) Set y = yi + 2x¯k − y¯ k − x ; k+1 ← k until Some stopping criteria is met; ←

4.7 ADMM: Alternating Directions Method of Multipliers We now consider problems with the following form

inf f(x) + g(z) x∈X,z∈Z (P) s.t. Ax + Bx = c, where X, Z, Y are Euclidian spaces, A : X Y, B : Z Y are linear maps, c ∈ Y is given, and f : X R, g : Z R are closed proper convex functions. → → → →

31 Lecture 14: ADMM: Alternating Directions Method of Multipliers March 20 2018

Augmented Lagrangian. In this setting the augmented Lagrangian is given by d L(x, z, u) = f(x) + g(z) + kAx + Bz − c − uk2 2 . Then the ADMM algorithm is an iterative algorithm that alternates between minimizing (and maximizing) with respect to the variables x, y, and u. Algorithm 6: ADMM

Data: Initial points x0, z0, u0, and function L repeat Set xk+1 arg min L(x, zk, uk);

Set zk+1 arg min L(xk+1, z, uk); ← Set uk+1 arg min uk + c − Axk+1 − Bzk+1; ← until Some stopping criteria is met; ←

α 2 Motivation. We seek for a saddlepoint for L(x, z, u) − 2 kuk , that is we want a point that achieves the value given by α α min max L(x, z, u) − kuk2 = max min L(x, z, u) − kuk2 x,z u 2 u x,z 2 We can read the algorithm as

1. Minimize over x,

2. Minimize over z,

3. Do a gradient step for u.

We are going to assume robust feasibility c ∈ int (Adomf + Bdomg) and (P) is attained. Under this setting we’ll prove that α(u + c − Ax) is a dual optimal solution. One thing to keep in mind is that the inner routines might be expensive, that is minimizing L(·, z, u) and L(x, ·, u) could be difficult.

Example 18. Consider X = Y = Z and let c = 0, A = I and B = −I. So, ADMM becomes a prox step on f, a prox step on g and as simple u update. Let us show the versatility of this method by instantiating different f’s and g’s,

1. Intersections. In particular, if we set f = δP and g = δQ, then we just want to find an point in P ∩ Q. Hence, ADMM recovers the so could Dykstra’s alternating projection method.

f = hq ·i + δ L g = δ n 2. Linear programming. Set , e+L (where is a subspace) and R+ .

3. Compressed sensing. Set f = k · k1 and g = ∂e+L. Then, ADMM revoers the Basis pursuit algorithm.

4. LASSO. Set f = kAx − bk2 and g = λk · k1. Check this.

5. Least absolute deviations. Set f = k · −bk1 and g = 0.

To analyze this problem we define the extended-value function v : Y R v(c) as the solution of (P). It is easy to show that v is convex (Prove it!). Then, → v∗(y) = sup{hy, ci − v(c)} c = sup{hy, ci − f(x) − g(z) | Ax + Bz = c} c,x,z = f∗(A∗y) + g∗(B∗y).

32 Lecture 15: ADMM: Alternating Directions Method of Multipliers March 22 2018

Hence, the primal value

v(c) ≥ hc, yi − v∗(y) = hc, yi − f∗(A∗y) − g∗(B∗y).

supy dual problem | {z } Therefore, equality holds if, and only if, y ∈ ∂v(c). Robust feasibility says c ∈ int domv, then by the existence of subgradients that such y exists. Since there exists a primal solution we conclude that v(c) is finite. Hence primal and dual values are equal. Therefore, we derive the optimality condition y is dual optimal, if, and only if, there exist x ∈ ∂f∗(A∗y) and z ∈ ∂g∗(B∗y) with Ax + Bx = c. Thus, we get

y is dual optimal 0 ∈ A∂f∗(A∗y) − c + B∂g∗(B∗y)

Sy Ty ⇐⇒ It is easy to see that both these operators are monotone| (since{z ∂f∗}and| ∂g{z∗ are} monotone). Thus, we can apply Douglas-Rachford Algorithm 7: ADMM problem via Douglas-Rachford y x ∈ E i Data: Initial points 0, 0 , and maps proxαfi for all repeat Set w RT (y); Set y0 2w − y; 0← 0 Set w RS(y ); Set y ←y − w + w0; ← until Some stopping criteria is met; ← We know (thanks to the theory we developed in the previous section) that this converges to a dual optimal solution. We will soon see that this is exactly the same as doing ADMM on the original problem! Let us elaborate on the operators in the previous algorithm. Notice that

w = RT (y) y ∈ (I + αT)w y − w ∈ αB∂g∗(B∗w) ⇐⇒ ∃z ∈ ∂g∗(B∗w) with y − w = αBz ⇐⇒ B∗w ∈ ∂g(z) with w = y − αBz ⇐⇒ B∗y − αB∗Bz ∈ ∂g(z) with w = y − αBz ⇐⇒ 0 ∈ ∂g(z) − B∗y + αB∗Bz and w = y − αBz ⇐⇒ α z minimizes g − hB∗y, ·i + kB · k2 and w = y − αBz. ⇐⇒ 2

What we discover is that RT (y⇐⇒) = y − αBz for any z minimizing the aforementioned function. Thus, if we can minimize such function we can compute it. Computing the RS is similar (but minimizing over x). Finally, a change of variables y = α(u + c − Ax) and some dry algebra (where it’s important to keep track of the ordering in which we are minimizing(applying the resolvants)) shows that Douglas-Rachford reduces exactly to ADMM. Also, one can show that under reasonable conditions the x’s and z’s converge, the analyis is a little bit more involved and we are not going to show it.

33 Lecture 16: Projected subgradient method April 10 2018

4.8 Splitting Consider the following example 1 2 inf kx − wk2 + kAxk1 x∈Rn 2 with w ∈ Rn. Note that if we want to use the proximal gradient gives the same problem. There is a similar problem if we naively try to apply ADMM. Nonetheless, we can reformulate the problem and apply ADMM, let us see how. In general assume that we have inf f(x) + g(Ax) Equivantely we write inff(x) + g(z) s.t.w − x = 0 Aw − z = 0. Then α L(w, x, z, u, v) = f(x) + g(z) + (kw − x − uk2 + kAw − z − vk2) 2 ADMM then gives 1. w solves w − x + u + A∗(Aw − z − v) = 0

2. x proxαf(w − u), z proxαy(Aw − v) 3. u u + x − w, v v + z − Aw ← ← but then we have to solve linear equations, which is really bad when the dimension is big. ← ← We want to apply a method that is only allow to multiply A or A∗. Theorem 23. Suppose T is a monotone operator and it has a zero. Consider a sequence defined by

H(xk − xk+1) ∈ αT(xk+1) with α > 0 and H : E E is linear, self-adjoint positive definite. Then the sequence converges to a zero of T. Proof. The matrix H has a square root, call it L, with H = L2. Now make a change of variables y = Lx. → Iteration becomes y − y0 ∈ αL−1TL−1y0. Then it is easy to check that L−1TL−1 is monotone. Then, the result follows.

This is like preconditioning, we want to choose H in a clever way to evaluate the resolvent fast. I had to leave in the middle of the class, so I didn’t scribe the section of the Chambolle-Pock algorithm.

4.9 Projected subgradient method For a convex f : E R with Lipschitz gradient we showed that the gradient method

xt+1 xt − α∇f(xt) →  1  satisfies min k∇f(xt)k = O √ (via Krasnoselskii-Mann). t∈[k] k ← Ä 1 ä In fact, we can improve this to O k and furthermore you can control the gap between the function value minus its infimum also decreases at this rate. Although, this improvement doesn’t seem incredible on paper, computationally it makes a great difference. We will spend a little while trying to answer the following question about this method

34 Lecture 16: Projected subgradient method April 10 2018

• Can we do better?

• What if f is nonsmooth?

Remark 7. For a smooth f, −∇f(x) is a descent direction, that is the value function decreases if we take a sufficiently small step in that direction. This follows from the fact that

f(x − t∇f(x)) = f(x) − tk∇f(x)k2 + o(t).

However, the negative of 0 6= g ∈ ∂f(x) is not necessary a descent direction.

Example 19. Consider f(u, v) = 2|u| + v, it is easy to check that g = (2, 1) ∈ ∂f(0, 0). Further,

f(−(2t, t)) = 3t > 0 = f(0, 0) ∀t > 0.

Nonetheless, although minus gradient doesn’t decrease the function value, it does decrease the distance to minimizers. Which allows us to consider the following algorithm for nonsmooth convex. Assume that f : E R is a closed convex proper and let ∅ 6= Q ⊆ domf be a closed convex subset. Additionally, assume that there existsx ¯ ∈ arg minQ f and f is locally M-Lipschitz atx ¯, that is it is Lipschitz onx ¯ + γB (for some γ >→ 0). Algorithm 8: Projected subgradient method

Data: Initial points x0 ∈ Q repeat gk Set xk+1 PQ(xk − tk ), where 0 6= gk ∈ ∂f(xk); kgkk until Some stopping criteria is met; ← Let’s take a look at 2 2 gk kxk+1 − x¯k = PQ(xk − tk − PQ(x¯)) kgkk 2 gk ≤ xk − tk − x¯ kgkk 2 gk 2 = kxk − x¯k − 2tk h , xk − x¯i +tk kgkk

1 (f(xk)−f(x¯ ))>0 kgkk (if |xk is not{z a minimizer)}

Lemma 8. If 0 6= g ∈ δf(xk), then

gk 1 h , xk − x¯i ≥ (f(x) − f(x¯)) kgkk M

providing left-hand-side less than γ.

Proof. Picture

f(y) − f(x¯) ≤ M(LHS). we also have f(y) − f(x) ≥ hg, y − xi = 0. Then the result follows.

Assuming that kx0 − x¯k ≤ γ, then adding the inequalities (??) for each iterate gives

gi γ2 + t ≥ 2 t h , x − x¯i. i i kg k i i≤k i≤k i X X

35 Lecture 17: Accelerated Proximal Gradient April 12 2018

Now, let’s assume that f(xi) − f(x¯) ≥  > 0 for all i ≤ k. In particular,

 ≤ f(x0) − f(x¯) ≤ Mγ.

gi We claim that this implies h , xi − x¯i ≥ /M for all i ≤ k, otherwise we contradict our lemma. kgik Hence 2 2 γ + ti ≥ 2/M ti i≤k i≤k X X or equivalently 2 2 γ + i≤k t M i ≥ . 2 t Pi≤k i Which allows us to conclude P 2 2 γ + i≤k ti min f(xi) − f(x¯) ≤ M . i≤k 2 t Pi≤k i √  k  Then if we take t = λ/ i + 1 gives a rate of λMO log√ . P i k Actually, this rate is almost optimal, we cannot expect to do much better. Suppose an iterative algorithm for minimizing a convex function, with a minimizer x¯. Furthermore, it is M-Lipschitz on x + γB. Then starting at x0 inside that ball with an oracle, that given any x returns f(x) and one subgradient g ∈ ∂f(x). The only thing we know about the algorithm is that xk+1 ∈ x0span{g0, ... , gk}. Suppose we guarantee that

min f(xk) − f(x¯) ≤ λMϕ(k). i≤k

(we are using this constants since it is the natural scaling of the problem). The question is what is the best ϕ we can get? We will see that the best we can hope is φ(k) = O( √1 . Let’s see an example to prove this lower bound. k Example 20. Assume that Q = Rn and let

1 2 f(x) = max xi + kxk . i≤k 2

Start with x0 = 0. This is a strictly convex function, thus it only has one minimizer. It is easy to check (Exercise!) 1 1 that the minimizer is x¯ = − k (1, ... , 1, 0, ... , 0) (with k ones at the beginning). So f(x¯) = − 2k . It is easy to see that γ = √1 and M = 1 + √1 . My adversarial oracle returns k k

ej +x ∈ ∂f(x).

smallest possible index

Notice by induction that |{z} (xk)j = 0 ∀j ≥ k Hence 1 Mγ f(xk−1) − f(x¯) ≥ = √ . 2k 2(1 + k)

4.10 Accelerated Proximal Gradient We consider inf f = g + h

36 Lecture 17: Accelerated Proximal Gradient April 12 2018 where both functions are convex and g is smooth with ∇g L-Lipchitz. Let us start with a prototypical algorithm to solve this problem. Algorithm 9: Proximal gradient method (prototype)

Data: Initial point v0, x0 ∈ E repeat Set θ θk; 1 Set y (1 − θ)xk + θvk Set yb y − L ∇g(y); ← L 2 Set xk+1 ∈ arg minx h(x) + 2 kx − ybk ; ← 1 ← Set vk+1 xk + θ (xk+1 − xk); Update θk+1; ← until Some stopping criteria is met;

Note that y extrapolates beyond xk+1 in the direction xk+1 − xk. We still have do decide about the parameters θk, let’s analyze the algorithm and see what we can get. Since ∇g is L-Lipschitz L g(x ) ≤ g(y) + h∇g(y), x − yi + kx − yk2 k+1 k+1 2 k+1 By definition 0 ∈ ∂h(xk+1 + L(xk+1 − yb) which implies that L(yb − xk+1) ∈ ∂h(xk+1). Then by definition

h(xk+1 ≤ h(z) + Lhyb − xk+1, xk+1 − zi = h(z) + Lhy − xk+1, xk+1 − zi − h∇g(y), xk+1 − zi ∀z ∈ E. where the equality follows by definition. If we add this two together 2 f(xk+1 ≤ h(z) + g(y) + h∇g(y), z − yi + Lhy − xk+1, xk+1 − zi + Lkxk+1 − yk L ≤ f(z) + Lhy − x , x − zi + kx − yk2 ∀z. k+1 k+1 2 k+1

In particular, if we substitute substitute z = xk and z = x and we take a convex combination of them we get L f(x ) ≤ (1 − θ)f(x ) + θf(x) + kx − yk2 + hy − x , x − (1 − θ)x − θxi. k+1 k 2 k+1 k+1 k+1 k Reordering gives 2 ((f(x ) − f(x)) − (1 − θ)(f(x ) − f(x)) ≤ kx − yk2 + 2hy − x , x − (1 − θ)x − θxi L k+1 k k+1 k+1 k+1 k 2 2 = ky − ((1 − θ)xk + θx)k − kxk+1 − ((1 − θ)xk + θx)k 2 2 2 2 = θ kv − xk − θ kvk+1 − xk .

Summing up we get 2 2(1 − θ) (f(x ) − f(x)) + kv − xk2 ≤ (f(x ) − f(x)) + kv − xk2. θ2L k+1 k+1 θ2L k k This inequality will allows us to study the convergence ratio of the algorithm. In order to do so we need to pick θ in such a way that we get decay. If we set 2 2 2 1 − θk k − 1 k 1 θk = = 2 = ≤ = 2 . k + 1 θk 4 4 θk−1 ⇒ 37 Lecture 18: Nonconvex calculus April 17 2018

Now by induction what we end up with is

2 2(1 − θ1) 2 2 (f(xk+1) − f(x)) ≤ 2 (f(x1) − f(x)) + kv0 − xk θkL θ1L which can be rewritten, by setting θ1 = 1, as

2 2Lkv0 − xk f(x ) − f(x) ≤ . k+1 (k + 1)2

This gives the algorithm Algorithm 10: Accelerated proximal gradient method

Data: Initial point v0, x0 ∈ E

Set x−1 x0; repeat ← k−2 1 Set y xk−1 + k+1 xk−2 Set yb y − L ∇g(y); L 2 Set xk+1 ∈ arg minx h(x) + 2 kx − ybk ; ← ← until Some stopping criteria is met;

Theorem 24. Under the assumptions of this section, this algorithm exhibits the following convergence rate 1 f(x ) − inf f = O( ). k k2 This is actually optimal, one can construct counterexamples that lower bounds on the minimal achivable methods for algorithms like this one. Notice that this is dimension independent, if one allows dependence on the dimension one can prove “better” bounds. Note that the difference between 1/k2 and 1/k is a huge deal! This is specially relevant when we are considering large-scale problems, 1/100 = .01 while 1/1002 = 0.0001. This line of research was started by Nesterov (1983, 1988) and continued for example by Tseng (2008), and FISTA (Beck-Teboulle 2008).

5 Variational analysis

We are now going to move away from algorithms and go back to understanding sets and functions (as in the beginning of the class). In particular, we are going to try to generalize the ideas we consider for convex analysis.

5.1 Nonconvex calculus When we have smooth functions we use linear approximations using Taylor expansions. When we considered nonsmooth convex functions we focus on minorizing functions, because in a way optimization is a one-sided discipline, that is we are just interested on minimizing the function. Now, we are going to combine these two ideas.

Definition 16. For f : E R finite at x, we say that y ∈ ∂fb (x) is a regular subgradient if and only if

→ f(x + z) ≥ f(x) + hy, zi + o(z) o(z) where o(z) is a function such that kzk 0 when the norm of z goes to zero.

38 Lecture 18: Nonconvex calculus April 17 2018

Exercise 2. For a convex function g : E RR, finite at x and another function h that is differentiable at x. We have that →∂b(g + h)(x) = ∂g(x) + ∇h(x). (So in particular, ∂b unifies the idea of the convex subdifferential ∂ and the gradient ∇ of smooth function).

Unlike in the convex setting, subdifferential calculus rules for ∂b are somehow weaker. Let’s explore what can we say about sums of functions and compositions.

Theorem 25. (Fuzzy sum rule) For lower semicontinuous functions fi : E R (i ∈ 1, 2) finite at x, if y ∈ ∂b(f1 + f2) then there exists x1, x2 close to x with fi(xi) close to fi(x) and yi ∈ ∂fb i(xi) with y1 + y2 close to y. → Before we start with the proof let’s present a couple of facts/exercises.

Exercise 3. 1. If x ∈ E is a local minimizer then 0 ∈ ∂fb (x).

2. Conversely, if 0 ∈ ∂fb (x) then x is a strict local minimizer of f + δk · −xk for any δ > 0.

3. If f is L-Lipschitz around x then for all y ∈ ∂fb (x) we have that kyk ≤ L.

4. If f is a lower semicontinuous function and g is smooth then ∂b(f + g) = ∂fb (x) + ∇g(x).

Proof. Without loss of generality let’s assume that the point we are looking at is zero, that the subgradient we are looking at is zero and the function value is also zero, that is 0 ∈ b(f1 + f2)(0) and f1(0) = f2(0) = 0 (Why?). Since each fi is closed there exists an  > 0 so that f1, f2 > −1 on B. Fix δ > 0, by the previous exercise we can assume (maybe after shrinking ) that

f1(x) + f2(x) + δkxk ≥ f1(0) + f2(0) + δkxk > 0 ∀0 6= x ∈ B.

Fix a sequence of nonnegative parameters going to infinity 0 < µr + and now we are going to construct a sequence of problems with a quadratic penalty → ∞ µr Ä ä ρ (x , x , x ) = δkx k + f (x ) + f (x ) + kx − x k2 + kx − x k2 . r 0 1 2 0 1 1 2 2 2 0 1 0 2

Note that ρr is closed (lower semicontinuous) and if we restrict the varaibles to the compact x0, x1, x2 ∈ B 3 r r 2 then ρr is bounded below. Thus this function has a minimizer (inside (B) ) x0, x1, xr . Moreover, the r r r ? ? ? sequence {(x0, x1, x2)}r is bounded therefore we can assume that it converges to some (x0, x1, x2) ∈ B × B × B. Furthermore notice that r r r ρr(x0, x1, x2) ≤ ρr(0, 0, 0) = 0. Hence, µr Ä ä kxr − xr k2 + kxr − xr k2 ≤ −f (xr ) − f (xr ) ≤ 2, 2 0 1 0 2 1 1 2 2 so reordering and taking the limits gives

Ä ? ? 2 r ? 2ä kx0 − x1k + kx0 − x2k ≤ 0

? ? ? ? where the last inequality follows since µr + . Thus x0 = x1 = x2 = x Also, δ xr + f (x ) + f (x ) 0 k→0k ∞1 1 2 2 ≤ taking limits and using lower semicontinuity gives

? ? ? δkx k + f1(x ) + f2(x ) ≤ 0

? r r and recall that the unique minimizer of this function is zero, thus x = 0. Moreover, f1(x1) + f2(x2) ≤ 0 and r r since the functions are closed eventually we have fi(xi ) ≥ −µ for any µ > 0. So, fi(xi ) 0 as r .

39 → → ∞ Lecture 19: Subgradients and nonconvexity April 19 2018

r r r Since the point are converging to zero, then eventually we get that (x0, x1, x2) is an unconstrained local minimizer of ρr. Varying just x0 in the rth problem show that r r r r 0 ∈ δB + µr(x0 − x1) + (x0 − x2)),

additionally varying x1 and x2 (respectively) gives r r r r r r 0 ∈ ∂fb 1(x1) + µr(x1 − x0) and 0 ∈ ∂fb 2(x2) + µr(x2 − x0). r r r r r r r r r r Then we are done, if we define y1 = µr(x0 − x1) ∈ ∂fb 1(x1), y1 = µr(x0 − x2) ∈ ∂fb 2(x2). Then ky1 + y2k ≤ r r δ, xi 0, and f(xi ) 0. Example 21. Consider the function f(x) = −|x|, then → → −1 if x < 0, ∂fb (x) = ∅ if x = 0,   1 otherwise. Then the graph of this subdifferential looks like . This is a big issue because the graph is not closed (this was always true when we consider the convex case). Which implies both theoretical and numerical issues. Our goal now is to fix this issue. add a picture?

5.2 Subgradients and nonconvexity

So as we saw our definition of ∂b gave us an operator with graphs that aren’t necessarily closed. So, this means that this definition is not robust with respect to limits.

Definition 17. Limiting subgradients We say that y ∈ ∂f(x) if there exist sequences (xr)r ⊆ E and (yr)r such that yr ∈ ∂fb (xr), yr y, and xr x. Remark 8. In particular, for continuous f we get that grph ∂f = cl (grph(∂fb )). → → Example 22. • For f(x) = −|x|, we have that −1 if x < 0, ∂f(x) = {−, 1, 1} if x = 0,   1 otherwise. Exercise 4. 1. If f is C1, then ∂f(x) = {∇f(x)}, 2. If f is closed convex proper then

∂f(x) = {g ∈ E | (∀y ∈ E) f(y) ≥ f(x) + hg, x − yi}.

3. For Lipschitz f, then ∂f has a closed graph and ∂f(x) is always compact. 4. If x ∈ E is a local minimizer of f, then 0 ∈ ∂f(x). The issue with this construction is that now ∂f(x) is not convex. Definition 18. For S ∈ E be any set. We define

conv S = λixi | xi ∈ S, λi ≥ 0, λi = 1 .  i i  X X Equivalently (due to Caratheodory Theorem) we can define

dim E+1 conv S = λixi | xi ∈ S, λi ≥ 0, λi = 1 .  i=1 i  X X

40 Lecture 19: Subgradients and nonconvexity April 19 2018

Proposition 20. If S is compact, so is conv S.

Definition 19. Clarke subdifferential For a locally Lipschitz function f : E R, we define

c ∂ f(x) = conv(∂f(x)). → Remark 9. Clarke Introduced this definition in his PhD thesis in 1973.

Try to prove as an exercise that the properties in Exercise4 also hold for ∂cf. We build up this definition from multiple layers, now we want to get a better understanding of what it is and how to compute it.

Interlude: introduction to measure theory Definition 20. Let S ⊆ Rn a subset, we say that this set has measure zero if for all  > 0, there exists a countable S sequence of boxes B1, B2, ... with S ⊆ i Bi and i vol Bi < . We say that a property holds almost everywhere (a.e.) if it holds at every point outside a set of measure zero. ∞ P Theorem 26. Fubini’s Theorem Suppose that S ⊆ Rn has measure zero. Then, for all z 6= 0

{t ∈ R | x + tz ∈ S}

has measure zero for almost all x.

Theorem 27. Rademacher’s Theorem Lipschitz functions are differentiable almost everywhere.

Back to nonsmooth analysis Clarke’s idea was to use measure-theoretic tools to study his subdifferential.

Definition 21. Suppose f Lipschitz, choose any set S of measure zero so that f is differentiable on Sc. Define

c ∂fÛ (x) = conv{lim ∇f(xr) | xr x, xr ∈ S }

Theorem 28. Let f be a locally Lipschitz function. Then, for all x we→ have that

∂fÛ (x¯) = conv ∂f(x¯).

Proof. Claim 3. ∂fb (x¯) ⊆ ∂fÛ (x¯)

Proof. Suppose not and choose y ∈ ∂fb (x¯) \ ∂fÛ (x¯). Then due to the separating hyperplane theorem there exists a unit z and  > 0 such that hy, zi ≥ max hz, xi + 2. x∈Û∂f(x¯ ) Then for all points x sufficiently close tox ¯

hy, zi ≥ h∇f(x), zi + .

We know that f(x¯ + tz) − f(x¯) ≥ thy, zi + o(t) so for arbitrarily small t > 0, then 1  (f((x¯ + tz) − f(x¯)) ≥ hy, zi − . t 3

41 Lecture 19: Subgradients and nonconvexity April 19 2018

Now, using Fubini’s Theorem we get that for x arbitrarily close to x¯ so that {t | x + tz ∈ S} has measure zero. Choose some such x so that 1 2 (f((x + tz) − f(x)) ≥ hy, zi − t 3 (which can be done since x is continuous). Then, by the fundamental theorem of calculus

t f(x + tz) − f(x) = h∇f(x + τz), zi dτ Z0 ≤ t(hy, zi − ).

Then dividing by t gives a contradiction with the previous inequality.

So we have that grph ∂fb ⊆ grph ∂f. It is easy to check (using Caratheodory’s Theorem) that grph ∂f is closed. Hence, by taking closures we learn that grph ∂f ⊆ grph ∂f, so ∂f(x) ⊆ ∂f(x). Thus we conclude that

conv ∂f(x) ⊆ ∂f(x).

Additionally, if the function is differentiable at y, then ∇f(y) ∈ ∂fb (y) and consequently ∂f(x) is contained in conv ∂f(x).

Let us explore what calculus rules hold for the limiting subdifferential.

Theorem 29. For closed f1, f2 : E R finite at some point x¯ ∈ E. Then the subdifferential of the sum at x¯

→ ∂(f1 + f2)(x¯) ⊆ ∂f1(x¯) + ∂f2(x¯) provided at least one of these two functions is Lipschitz around x¯.

Proof. Without loss of generality assume that f1 is locally Lipschitz. Suppose that y ∈ ∂(f1 + f2)(x¯). By definition there exist sequences (xr)r and yr ∈ ∂fb (xr) such that xr x¯, yr y, and (f1 + f2)(xr) (f1 + f2)(x¯). Now, we will apply the fuzzy sum rule, let’s pick a sequence of positive errors δr approaching i i i zero. Then there exists xr ∈ Bδr (xr) and yr ∈ ∂fb i(xr) such that → → →

i 1 2 |fi(xr) − fi(xr)| ≤ δr (for i = 1, 2) and kyr + yr − yrk ≤ δr.

r 1 2 Notice that xi x¯ (for i = 1, 2) and since f1(xr ) f1(x¯) (by continuity) and so f2(xr ) f2(x¯). 1 1 Since yr is Lipschitz, then yr is a bounded sequence and, without loss of generality, we can ssume that it 1 2 2 is a converging→ sequence to some y1. Furthermore,→ since yr + yr is a converging sequence,→ then yr y2. Hence y1 ∈ ∂f1(x¯), y2 ∈ ∂f2(x¯) and y1 + y2 = y. → Ekeland Variational Principle Now lets ask ourselves the following question: Suppose f : E R is C1 and bounded below. Does it exists necessarily x ∈ E such that k∇f(x)k ≤ 1? → Theorem 30 (Ekeland Variational Principle (1974)). Suppose f : E R is closed and inf f is finite and let , λ > 0. Suppose there is an x ∈ E such that → f(x) ≤ inf f + .

Then there exists y satisfying

1. kx − yk ≤ λ,

42 Lecture 21: Inverse problems April 26 2018

2. f(y) ≤ f(x),

 3. y strictly minimizes f + λ k · −yk. Remark 10. The previous theorem is true in any complete metric space.  ^ ^ Proof. Define a function fb = f + λ k · −xk, clearly this function is closed. Define the set {z | f(z) ≤ f(x)}, this is a nonempty compact set, therefore M = arg min f^it is also nonempty and compact and choose any y ∈ arg minz∈M f(z). Then, for any z ∈ E there are two possibilities:

1. z ∈ M, then by definition  f(y) ≤ f(z) < f(z) + kz − xk (unless z = x) λ

2. z ∈/ M in which case    fb(y) < fb(x) = f(y) + ky − xk < f(z) + kz − xk ≤ (z) + (kz − yk + ky − xk) λ λ λ

proving the third condition.⇒ No y ∈ M, so fb(y) ≤ fb(x) and consequently  f(y) + ky − xk ≤ f(x) ≤ inf f +  ≤ f(y) +  λ proving the first two conditions.

√ Typically λ =  for a small  > 0. Intuitively, the principle says that near any approximate minimizer there is an exact minimizer of a nearby function. √ √ Corollary 6. For a closed f, if f(x) ≤ inf f + . Then there exists x¯ ∈ B (x) and y ∈ ∂f(x¯) with kyk ≤ .

5.3 Inverse problems Consider the following problem: assume that we have a C1 operator F : E Y around x¯ ∈ E and let y¯ = F(x¯). We want to solve F(x) = y,→ (5.1) where x is a variable and y is a point near y¯ . If ∇F(x¯) is surjective then the implicit function theorem implies that this equation is solvable for x nearx ¯, given y neary ¯ . Furthermore, there exists k > 0 such that

inf kz − xk =: dF−1 (y)(x) ≤ kkF(x) − yk (5.2) z∈F−1(y) and this relationship is true for all (x, y) close to (x¯,y ¯ ).

Exercise 5. Prove the aforementioned result (you should start by proving the case where F is a linear map). You should get that k corresponds to the smallest singular value of ∇F(x¯).

43 Lecture 21: Inverse problems April 26 2018

Note that x solves (5.1) if, and only if, it minimizes kF(·) − yk. Because otherwise Å F(x) − y ã ∇F(x)> = 0 kF(x) − y)k

for x nearx ¯ which is a contradiction since the kenrel of ∇F(x¯)> is trivial. In fact, solutions are also just minimizers x of kF(·) − yk + δk · −xk for a sufficiently small δ > 0 otherwise ∇F(x)>u ∈ δB which contradicts (5.2). What happens if the Jacobian ∇F(x¯) is not surjective. Let’s start with F linear and take x¯ = 0. Choose any nonzero y such that y ⊥ range(F), then x = 0 minimizes kF(·) − yk + δk · −xk and yet it doesn’t solve F(x) = y. What happens if we have more general equations F(x) = y? what do we need to analyze this? We are going to use the variational tools that we developed to answer this question. Consider now a set-valued mapping Φ : E ⇒ X. Definition 22. We say that Φ is metrically regular at x¯ for y¯ ∈ Φ(x¯) is there exists k > 0 so that

d(x, Φ−1(y)) ≤ kd(Φ(x), y)

for all x and y near x¯ and y¯ , respectively. Lemma 9 ((Ioffe, 1979)). Suppose that Φ is defined by x 7 F(x) if x ∈ S and otherwise is empty. Where F : E Y is continuous and S ⊂ E is a closed set. Assume that Φ is not metrically regular at x¯ ∈ S. Then, there exists y close to y¯ = F(x¯) and a small δ > 0 and x close to x¯ minimizing → → kF(·) − yk + δk · −xk over S but F(x) 6= y.

Proof. By assumption there exist sequences (xr)r ⊆ S and (yr)r such that xr x¯ and yr y¯ and metric regularity fails −1 d(xr, S ∩ F (yr)) > rkF(xr) − yrk.→ → (5.3) We are going to apply Ekeland’s principle to the function f = kF(·) − y k + d . We take  = kF(x ) − y k √ r r S r r r and set λr = min{ r, rr} therefore for each r there existsx ¯r ∈ S such that it minimizes

r fr + k · −x¯rk λr S i.e. minimizing over the function √ fr + max{ r, 1/r} k · −x¯rk

δr:= by construction we have that δr 0 as r | . Furthermore,{z } kx¯r − xrk ≤ λr. −1 We claim that F(x¯r) 6= yr, otherwise x¯r ∈ S ∩ F (yr), but this is a contradiction because (5.3) implies that kx¯r − xrk > rr. → → ∞

When things were nice (smooth) in order to check metric regularity it was enough to check the surjectivity of F. But what do we need to check that the property holds in the general setting? Now suppose that F is Lipschitz. If metric regularity fails at x¯, then due to the previous result there exist there existx ¯ x¯,y ¯ y¯ = F(x¯), δr 0 and xr minimizes kF(·) − yrk + δrk · −xrk + δS. The sum rule implies that → → → 0 ∈∂kF(·) − yrk(xr) + δrB + ∂δS(xr)

= ∂hwr, F(·)i(xr) + δrB + ∂δS(xr)

44 Lecture 22: Inverse problems May 1 2018

F(xr)−yr where wr = . The second line follows since the norm is smooth at F(xr) − yr 6= 0. Without loss of kF(xr)−yrk generality wr converge to some w. Then we can writhe this as

= ∂hw, F(·)i(xr) + ∂hwr − w, F(·)i(xr) + δrB + ∂δS(xr).

Note that the functions hwr − w, F(·)i are Lr-Lipschitz functions with Lr 0 as r . Furthermore δr 0 as well, then there exists sequences (ur)r and (vr)r such that → → ∞ → ur ∈ ∂hw, F(·)i(xr) and vr ∈ ∂δS(xr)

such that ur + vr 0. Without loss of generality we can assume that ur u ∈ ∂hw, F(·)i(x¯) and vr −u ∈ ∂δS(x¯) (Exercise: this follows since ∂δS has a closed graph). Summarizing we have proved that if metric regularity fails,→ then there exists w 6= 0 → →

0 ∈ ∂hw, F(·)i(x¯) + ∂δS(x¯).

This condition is an analogue of non-surjectivity of the Jacobian. Theorem 31. The map {F(x)} if x ∈ S x 7 ∅ otherwise.

is metrically regular at x¯ providing →

0 ∈ ∂hw, F(·)i(x¯) + ∂δS(x¯) = w = 0.

Definition 23. Let S ⊂ E be a subset, then the normal cone at x¯ is defined by N (x¯) := ∂δ (x¯). ⇒ S S If S is a convex cone then we have that

y ∈ NS(x¯) hy, x − x¯i ≤ 0 ∀x ∈ S.

draw a picture. ⇐⇒ Example 23. For any closed set S we have that x¯ ∈ int S NS(x¯) = {0}. Proof. Apply the previous theorem with F = I. ⇐⇒ More generally, we want to study systems of the form: F(x) ∈ z + P for a closed set P ⊆ Y and x ∈ S. Supposex ¯ ∈ S, F(x¯) ∈ P, consider the map

F(x) − P if x ∈ S x 7 ∅ otherwise.

and suppose that it is not metrically regular→ at x¯ for 0. Thus, there exists a sequence (xr)r ⊆ S and (zr)r such that xr x¯, zr 0 so

d −1 (xr) > rd (F(xr) − zr) → → F (zr+P)∩S P = rkF(xr) − zr − yrk for some yr ∈ P.

Another view is to study a system

F(x) − y = z, x ∈ S, and y ∈ P, via the map G(x,y) (x, y) 7  F(x) − y if x ∈ S, y ∈ P  z∅ }| { otherwise. →  45 Lecture 23: Inverse problems May 3 2018

This map is not metrically regular at (x¯, F(x¯)). To see this, note

2 2 d ((xr, yr), {(x, y) ∈ S × P | F(x) − y = zr}) = inf{kxr − xk + kyr − yk | x ∈ S, y ∈ P, F(x) − y = zr} 2 ≥ inf{kxr − xk | x ∈ S, y ∈ P, F(x) − y = zr} 2 2 2 = d −1 (xr) > r kF(xr) − yr − zrk . F (zr+P)∩S So we can apply the previous theorem to this map, since metric regularity fails there exists a nonzero vector w such that 0 ∈ ∂hw, G(x, y)i(x¯, F(x¯)) + NS×P(x¯, F(x¯)).

Exercise 6. Prove NS×P(x¯, F(x¯)) = NS(x¯) × NPF(x¯)).

It is easy to check that ∂hw, G(x, y)i(x¯, F(x¯)) = ∂hw, F(·)i(x¯) × {−w}. Thus, altogether we proved the following theorem.

Theorem 32. Suppose that S ⊆ E, P ⊆ Y are closed subsets, x¯ ∈ S, F : E Y is Lipschitz, F(x¯) ∈ P and for all w

w ∈ N (F(x¯)) P = w→= 0. 0 ∈ ∂hw, F()¯i(x¯) + N (x¯) S Then, the map ⇒ F(x) − P if x ∈ S x 7 ∅ otherwise.

is metrically regular at x¯. → From now on, this is going to be our main tool.

Theorem 33. Suppose that S, P ⊆ E are closed and suppose that x¯ ∈ S ∩ P and they are transversal at x¯, i.e.

NS(x¯) ∩ (−NP(x¯)) = {0}.

Then for all small z, S ∩ (P + z) 6= ∅.

Proof. Use the previous theorem with F = I.

Proposition 21 (Exact penalization). Assume f : E R is L-Lipschitz and S ⊆ E, and x¯ locally minimizes f over S. Then, x¯ is an unconstrained local minimizer of f + kdS provided k ≥ L. → Proof. If not, then there exists a sequence (xr)r such that xr x¯ with f(xr) + kdS(xr) < f(x¯). So 1 d (x ) < (f(x¯) − f→(x )). S r k r

1 1 Which implies that there is a sequence (yr)r ⊆ S so that kxr − yrk < min{dS(xr) + r , k (f(x¯) − f(xr)). Hence yr x¯ and

f(y ) f(x ) + L x − y → r ≤ r k r rk ≤ f(xr) + kkxr − yrkk

< f(xr) + f(x¯) − f(xr) = f(x¯) yielding a contradiction.

Remark 11. For convex sets this fails whenever there is a separating hyperplane between them. picture.

46 Lecture 23: Linearizing sets May 3 2018

5.4 Linearizing sets When we have smooth functions we use the Taylor series to approximate it locally via a linear function. Our goal now is to use the machinery we develop to to the same with sets. We want to this because it is easier to think about linear sets and we can use them to write optimality conditions. We consider a constrained system of the form

F(x) ∈ P x ∈ S where S ⊂ E, P ⊂ Y are closed, F : E Y, and there exists a pointx ¯ ∈ S such that F(x¯) ∈ P. This system has an associated mapping of the form → F(x) − P if x ∈ S x 7 ∅ otherwise.

Recall that metric regularity atx ¯ says→ that there exists a constant k > 0 such that

d(x, S ∩ F−1(z + P)) ≤ kd(F(x) − z, P) for all x ∈ S close tox ¯ and z ∈ Y small. We could try to understand special cases of this property. For example, note that with x = x¯ we get a “sensitivity” or “inversion” result. On the other hand, with z = 0 we get an “error bound”, for all large k > 0

d(x, S ∩ F−1(P)) ≤ kd(F(x), P)

for all x ∈ S nearx ¯. Hencex ¯ locally minimizes the function

kd(F(·), P) − d(·, S ∩ F−1(P)) over S.

Notice that this function is Lipschitz with constant k + 1, hence we can apply exact penalization. There exists k0 > 0 large such thatx ¯ locally minimizes

kd(F(·), P) − d(·, S ∩ F−1(P)) + k0d(·, S).

Thus, we have proved the following theorem.

Theorem 34. Assume that constraint qualification holds, i.e.

w ∈ N (F(x¯)) P = w = 0. 0 ∈ ∂hw, F(·)i(x¯) + N (x¯) S Then, there exists k > 0 so that ⇒

d(x, S ∩ F−1(P)) < k (d(F(x), P) + d(x, S)) feasible region constraint error

for all x near x¯. | {z } | {z }

Let’s come back to the problem of linearizing a function. To this end let us define the tangent cone.

Definition 24 (Tangent cone). For a point x¯ ∈ S, then we say that a direction v ∈ TS(x¯) is in the tangent cone of S at x¯ if there exist sequences (vr)r ⊆ E and (tr)r ⊆ R vr v such that vr v, tr 0, and x¯ + trdr ∈ S for all r. Picture → → → 47 Lecture 24: Linearizing sets May 8 2018

Exercise 7. 1. Show that TS(x¯) is a closed cone.

2. If S is a convex, then v ∈ TS(x¯) if, and only if,

dS(x¯ + td) 0 as t 0. t S ⊆ E P ⊆ Y F : E Y C1 x ∈ S Theorem 35. Suppose that , are a closed convex→ set, →is a map, there exists ¯ such that F(x¯) ∈ P, and constraint qualification holds → w ∈ N (F(x¯)) P = w = 0. (5.4) −∇F(x¯)∗w ∈ N (x¯) S Then, ⇒ −1 TS∩F−1 (P) = TS(x¯) ∩ ∇F(x) TP(F(x¯)). 0 Proof. If S ⊆ S = TS(x¯) ⊆ TS0 (x¯), so

TS∩F−1 (P) = TS(x¯) ∩ TF−1P(F(x¯)). ⇒ Thus, for the inclusion ⊆ we just need to show that −1 TF−1P(x¯) ⊆ ∇F(x) TP(F(x¯)).

Take v ∈ TF−1P(x¯), hence there exist sequences (vr)r ⊆ E and (tr)r ⊆ R vr v such that vr v, tr 0, −1 andx ¯ + trdr ∈ F P for all r. Thus F(x + trdr) ∈ P ∀r, consequently

ÅF(x¯ + trdr) − F(x¯)ã → → → F(x¯) + tr ∈ P. tr ∇F(x¯ )v

So ∇F(x¯)v ∈ TP(F(x¯)) proving what we wanted.| → {z } Conversely, first note that the constraint qualification condition implies metric regularity holds. So there exists a constant k > 0 such that d(x, S ∩ F−1P) ≤ k (d(x, S) + d(F(x), P)) for all x nearx ¯. −1 Suppose v ∈ TS(x¯) ∩ ∇F(x) TP(F(x¯)). Hence dS(x¯ + tv) = o(t) as t 0, similarly

dP(F(x¯) + t∇F(x¯)v) = o(t) → F(x¯ +tv)+o(t)

(since S, P are convex), thus dP(F(x¯ + tv)) =|o(t). Then{z by} the previous error bound we have d(x¯ + tv, S ∩ F−1P) ≤ k(o(t) + o(t)) = o(t)

so d ∈ TS∩F−1P(x¯). For closed convex S, P ⊆ E andx ¯ ∈ S ∩ P, we have proved that

TS∩P(x¯) = TS(x¯) ∩ TP(x¯) providing NS(x¯) ∩ (−NP(x¯)) = {0}. Remark 12. This condition captures good properties in different settings, for manifolds this implies transversality. Furthermore, even for nonconvex closed sets, transversality implies local linear convergence of alternating projection to a point in the intersection. Open question: Do you need convexity for the previous result? In other words do there exist S and P satisfying transverality, but TS∩P(x¯) 6= TS(x¯) ∩ TP(x¯).

48 Lecture 24: Optimality conditions May 8 2018

5.5 Optimality conditions Recall the definition of the dual cone.

Definition 25. For any S ⊂ E, then the dual cone S+ = {y | hx, yi ≥ 0 (∀x ∈ S)}.

It is simple to see that the dual cone is always a closed convex cone. On the other side, the normal cone and the tangent cone are closed, but necessarily convex.

Exercise 8. Show the following facts:

1. S++ is the smallest closed convex cone containing S.

2. For a convex cone L ⊂ Y, then L+ = {0} L = Y.

+ 3. Key fact. For any x¯ ∈ S ⊆ E we have that TS(x¯) =⇐⇒ −∂δb S(x¯). The previous fact has two important consequences (that we leave as exercises).

Proposition 22. When S ⊂ E is a closed convex set. Then for all x¯,

+ TS(x¯) = −NS(x¯) = cl R+(S − x¯)

. In particular the tangent is a closed convex cone.picture

Proposition 23 (Basic optimality condition). Suppose that x¯ locally minimizes f : E R over S ⊆ E, and f is differentiable at x¯. Then, + ∇f(x¯) ∈ TS(x¯) . →

Thus if we understand how the dual of the tangent cone looks like we can write explicit optimality conditions for this problem. Note that constraint qualification (Equation 5.4) says that

+ ∗ −1 TP(F(x¯)) ∩ −(∇F(x¯) ) TS(x¯) = {0}

then by one of the exercises above we get that this is equivalent to

+ ∗ −1 + Y = (TP(F(x¯)) ∩ −(∇F(x¯) ) TS(x¯))

Also, recall that for any cones K, Q, and a linear map A we have that

(K + AQ)+ = K+ ∩ (A∗)−1Q+ which implies that constraint qualification is equivalent

Y = ∇F(x¯)TS(x¯) − TP(F(x¯)). (5.5)

This condition is called the Robinson constraint qualification (1976). Under this condition, we have that

+ −1 + TS∩F−1P(x¯) = (TS(x¯) ∩ ∇F(x¯) TP(F(x¯))) + ∗ + = TS(x¯) + ∇F(x¯) TP(F(x¯)) where the last equality follows by the fact that Robinson constrarint qualification holds and so we can apply Krein-Rutman identity (Assignment 2). Let’s summarize what we’ve learned in a theorem.

49 Lecture 24: Optimality conditions May 8 2018

Theorem 36 (First order optimality conditions for nonlinear programming). Suppose x¯ is a local minimizer for inf{f(x) | F(x) ∈ P, x ∈ S} where P ⊆ Y, S ⊆ E are closed convex sets, the objective function f is differentiable at x¯, F : E Y is C1 around x¯, and the Robinson constraint qualification (5.5) holds. Then → ∗ −∇f(x¯) ∈ NS(x¯) + ∇F(x¯) NP(F(x¯)).

Corollary 7 (Karush-Kuhn-Tucker). Consider a nonlinear program

inf f(x) s.t. gi(x) ≥ 0 ∀i ∈ {1, ... , q} H(x) = 0.

1 Assume that f is differentiable at x¯ and the functions gi, H are C around x¯, with ∇H(x¯) is surjective and there exists a direction d ∈ ker ∇H(x¯) such that hgi(x¯), di > 0

for all i such that gi(x¯) = 0 (active constraints), this condition is known as Mangasarian-Fromovitz. Then, there exists Lagrange multipliers λi ≥ 0 with λi = 0 for all the inactive constraints and a vector µ with

∗ −∇f(x¯) = λi∇gi(x¯) + ∇H(x¯) µ. i X

50