1 Introduction to Convexity

2 Amitabh Basu

th 3 Compiled on Monday 9 December, 2019 at 16:35

1 4 Contents

5 1 Definitions and Preliminaries4 6 Basic and topology...... 4 7 Basic facts about matrices...... 6

8 2 Convex Sets 6 9 2.1 Definitions and basic properties...... 6 10 2.2 Convex cones, affine sets and dimension...... 8 11 2.3 Representations of convex sets...... 11 12 2.3.1 Extrinsic description: separating hyperplanes...... 11 13 How to represent general convex sets: Separation oracles...... 12 14 Farkas’ lemma: A glimpse into polyhedral theory...... 12 15 Duality/Polarity...... 12 16 2.3.2 Intrinsic description: faces, extreme points, recession cone, lineality ...... 14 17 2.3.3 A remark about extrinsic and intrinsic descriptions...... 20 18 2.4 Combinatorial theorems: Helly-Radon-Carath´eodory...... 20 19 An application to learning theory: VC-dimension of halfspaces...... 21 20 Application to centerpoints...... 23 21 2.5 Polyhedra...... 25 22 2.5.1 The Minkowski-Weyl Theorem...... 27 23 2.5.2 Valid inequalities and feasibility...... 28 24 2.5.3 Faces of polyhedra...... 30 25 2.5.4 Implicit equalities, dimension of polyhedra and facets...... 31

26 3 Convex Functions 33 27 3.1 General properties, epigraphs, subgradients...... 34 28 3.2 Continuity properties...... 38 29 3.3 First-order derivative properties...... 40 30 3.4 Second-order derivative properties...... 42 31 3.5 Sublinear functions, support functions and gauges...... 43 32 Gauges...... 43 33 Support functions...... 47 34 Generalized Cauchy-Schwarz/Holder’s inequality...... 48 35 One-to-one correspondence between closed, convex sets and closed, sublinear functions. 49 36 3.6 Directional derivatives, subgradients and subdifferential calculus...... 50

37 4 Optimization 54 38 Algorithmic setup: First-order oracles...... 56 39 4.1 Subgradient algorithm...... 56 40 Subgradient Algorithm...... 57 41 4.2 Generalized inequalities and convex mappings...... 59 42 4.3 Convex optimization with generalized inequalities...... 60 43 4.3.1 Lagrangian duality for convex optimization with generalized constraints...... 61 44 4.3.2 Solving the Lagrangian dual problem...... 63 45 4.3.3 Explicit examples of the Lagrangian dual...... 63 46 Conic optimization...... 64 47 Convex optimization with explicit constraints and objective...... 64 48 A closer look at linear programming duality...... 65 49 4.3.4 Strong duality: sufficient conditions and complementary slackness...... 65 50 Slater’s condition for strong duality...... 65 51 Closed cone condition for strong duality in conic optimization...... 66

2 52 Complementary slackness...... 67 53 4.3.5 Saddle point interpretation of the Lagrangian dual...... 68 54 4.4 Cutting plane schemes...... 68 55 General cutting plane scheme...... 69 56 Center of Gravity Method...... 71 57 Ellipsoid method...... 72

3 58 1 Definitions and Preliminaries

d d d 59 We will focus on R for arbitrary d N: x = (x1, . . . , xd) R . We will use the notation R+ to denote ∈ ∈ i 60 the set of all vectors with nonnegative coordinates. We will also use e , i = 1, . . . , d to denote the i-th unit 61 vector, i.e., the vector which has 1 in the i-th coordinate and 0 in every other coordinate.

d d 62 Definition 1.1. A on R is a function N : R R+ satisfying: → 63 1. N(x) = 0 if and only if x = 0,

d 64 2. N(αx) = α N(x) for all α R and x R , | | ∈ ∈ d 65 3. N(x + y) N(x) + N(y) for all x, y R . (Triangle inequality) ≤ ∈ p d p p p 1 66 Example 1.2. For any p 1, define the ` norm on R : x p = ( x1 + x2 + ... + xd ) p . p = 2 ≥ k k | | | | | | 67 is also called the standard Euclidean norm; we will drop the subscript 2 to denote the standard norm: p 2 2 2 ∞ n 68 x = x + x + ... + x . The ` norm is defined as x ∞ = max xi . k k 1 2 d k k i=1 | | d d 69 Definition 1.3. Any norm on R defines a distance between points in x, y R as dN (x, y) := N(x y). ∈ − 70 This is called the or distance induced by the norm. Such a metric satisfies three important properties:

71 1. dN (x, y) = 0 if and only if x = y,

d 72 2. dN (x, y) = dN (y, x) for all x R , ∈ d 73 3. dN (x, z) dN (x, y) + dN (y, z) for all x, y, z R . (Triangle inequality) ≤ ∈ d 74 Definition 1.4. We also utilize the (standard) inner product of x, y R : x, y = x1y1 +x2y2 +...+xdyd. 2 ∈ h i 75 (Note that x = x, x ). We say x and y are orthogonal if x, y = 0. k k2 h i h i d d 76 Definition 1.5. For any norm N and x R , r R+, we will call the set BN (x, r) := y R : N(y x) ∈ ∈ { ∈ − ≤ 77 r as the around x of radius r. BN (0, 1) will be called the unit ball for the norm N. We will drop } 78 the subscript N when we speak of the standard Euclidean norm and there is no chance of confusion in the 79 context. d 80 A X R is said to be bounded if there exists R R such that X BN (0,R). ⊆ ∈ ⊆ Definition 1.6. Given any set X Rd and a scalar α R, ⊆ ∈ αX := αx : x X . { ∈ } Given any two sets X,Y Rd, we define the Minkowski sum of X,Y as ⊆ X + Y := x + y : x X, y Y . { ∈ ∈ }

81 Basic real analysis and topology. For any subset of real numbers S R, we denote the concept of the ⊆ 82 infimum by inf S and the supremum by sup S. d d 83 Fix a norm N on R . A set X R is called open if for every x X, there exists r R+ such that ⊆ d ∈ ∈ 84 BN (x, r) X. A set X is closed if its complement R X is open. ⊆ \ d 85 Theorem 1.7. 1. , R are both open and closed. ∅ 86 2. An arbitrary union of open sets is open. An arbitrary intersection of closed sets is closed.

87 3. A finite intersection of open sets is open. A finite union of closed sets is closed.

4 d 1 2 3 i 88 A sequence in R is a countable ordered set of points: x , x , x ,... and will often be denoted by x i∈ . { } N 89 We say that the sequence converges or that the limit of the sequence exists if there exists a point x such that n 90 for every  > 0, there exists M N such that N(x x )  for all n M. x is called the limit point, or ∈ − ≤ ≥ n 91 simply the limit, of the sequence and will also sometimes be denoted by limn→∞ x . Although the definition 92 of the limit is made here with respect to a particular norm, it is a well-known fact that the concept actually 93 does not depend on the choice of the norm.

94 Theorem 1.8. A set X is closed if and only if for every convergent sequence in X, the limit of the sequence 95 is also in X.

96 We introduce three important notions:

d 97 1. For any set X R , the closure of X is the smallest containing X and will be denoted by ⊆ 98 cl(X).

d 99 2. For any set X R , the interior of X is the largest open set contained inside X and will be denoted ⊆ 100 by int(X).

d 101 3. For any set X R , the boundary of X is defined as bd(X) := cl(X) int(X). ⊆ \ d 102 Definition 1.9. A set in R that is closed and bounded is called compact. d i 103 Theorem 1.10. Let C R be a compact set. Then every sequence x i∈ contained in C (not necessarily ⊆ { } N 104 convergent) has a convergent subsequence.

d n i d 105 A function f : R R is continuous if for every convergent sequence x i∈N R , the following holds: i → i { } ⊆ 106 limi→∞ f(x ) = f(limn→∞ x ).

d d 107 Theorem 1.11. [Weierstrass’ Theorem] Let f : R R be a continuous function. Let X R be a → min ⊆ 108 nonempty, compact subset. Then inf f(x): x X is attained, i.e., there exists x X such that min { ∈ max} max ∈ 109 f(x ) = inf f(x): x X . Similarly, there exists x X such that f(x ) = sup f(x): x X . { ∈ } ∈ { ∈ } 110 A generalization of the above theorem is the following.

d n 111 Theorem 1.12. Let f : R R be a continuous function, and C be a compact set. Then f(C) is compact. → d n 112 We will also need to speak of differentiability of functions f : R R . → Definition 1.13. We say that f : Rd Rn is differentiable at x Rd, if there exists a linear transformation → ∈ A : Rd Rn such that → f(x + h) f(x) Ah lim k − − k = 0. h→0 h k k 113 If f is differentiable at x, then the linear transformation A is unique. It is commonly called the differential 0 114 or total derivative of f and is denoted by f (x). When n = 1, it is commonly called gradient of f and is 115 denoted by f(x). ∇ Definition 1.14. The partial derivative of f : Rd R at x in the i-th direction is defined as the → i 0 f(x + he ) f(x) fi (x) := lim − , h→0 h

116 if the limit exists.

5 m×n 117 Basic facts about matrices. The set of m n matrices will be denoted by R . The rank of a matrix × 118 A will be denoted by rk(A) – it is the maximum number of linearly independent rows of A, which is equal to 119 the maximum number of linearly independent columns of A. When m = n, we say that the matrix is square.

n×n 120 Definition 1.15. A square matrix A R is called symmetric if Aij = Aji for all i, j 1, . . . , n . ∈ ∈ { } n×n n 121 Definition 1.16. Let A R . A vector v R is called an eigenvector of A, if there exists λ R such ∈ ∈ ∈ 122 that Av = λv. λ is called the eigenvalue of A associated with v.

n×n 1 n 123 Theorem 1.17. If A R is symmetric then it has n orthogonal eigenvectors v ,..., v all of unit ∈ 124 Euclidean norm, with associated eigenvalues λ1, . . . , λn R. Moreover, if S is the matrix whose columns 1 n ∈ T 125 are v ,..., v and Λ is the diagonal matrix with λ1, . . . , λn as the diagonal entries, then A = SΛS . 126 Moreover, rk(A) equals the number of nonzero eigenvalues.

n×n 127 Theorem 1.18. Let A R be a symmetric matrix of rank r. The following are equivalent. ∈ 128 1. All eigenvalues of A are nonnegative.

r×n T 129 2. There exists a matrix B R with linearly independent rows such that A = B B. ∈ T n 130 3. u Au 0 for all u R . ≥ ∈ n×n 131 Definition 1.19. A symmetric matrix A R satisfying any of the three conditions in Theorem 1.18 is ∈ 132 called a positive semidefinite (PSD) matrix. If rk(A) = n, i.e., all its eigenvalues are strictly positive, then 133 A is called positive definite.

d×d d T 134 Exercise 1. Show that any positive definite matrix A R defines a norm on R via NA(x) = √x Ax. ∈ 135 This norm is called the norm induced by A.

136 2 Convex Sets

137 2.1 Definitions and basic properties d 138 A set X R is called a convex set if for all x, y X, the line segment [x, y] lies entirely in X. More ⊆ ∈ 139 precisely, for all x, y X and every λ [0, 1], λx + (1 λ)y X. ∈ ∈ − ∈ 140 Example 2.1. Some examples of convex sets:

141 1. In R, the only examples of convex sets are intervals (closed, open, half open): (a, b), (a, b], [a, b], ( , b] −∞ 142 etc.

d d + d 143 2. Let a R and δ R. The sets H(a, δ) = x R : a, x = δ , H (a, δ) = x R : a, x δ and − ∈ ∈d { ∈ h i } { ∈ h i ≥ } 144 H (a, δ) = x R : a, x δ are all convex sets. Sets of the form H(a, δ) are called hyperplanes { ∈ h+ i ≤ }− 145 and sets of the form H (a, δ),H (a, δ) are called halfspaces.

d 146 3. x R : x ∞ 1 is a convex set. { ∈ k k ≤ } d 2 d−1 147 4. x = (x1, . . . , xd) R : x1 + x2t + x3t + ... + xdt 0 for all t 0 is a convex set. { ∈ ≥ ≥ } 2 2 2 d 148 5. (x, y) R : x + y 5 is convex. More generally, the ball x R : x C for any C 0 is { ∈ ≤ } { ∈ k k ≤ } ≥ 149 convex.

d 150 Exercise 2. Show that if N : R R is a norm, then every ball BN (x,R) with respect to N is convex. → d×d d T 151 Definition 2.2. Let A R be a positive definite matrix . The set x R : x Ax 1 is called an ∈ { ∈ ≤ } 152 ellipsoid. In other words, an ellipsoid is the unit ball associated with the norm induced by A – see Exercise1. 153 Exercise2 shows that ellipsoids are convex.

6 154 Theorem 2.3. [Operations that preserve convexity] The following are all true. T 155 1. Let Xi, i I be an arbitrary family of convex sets. Then Xi is a convex set. ∈ i∈I 156 2. Let X be a convex set and α R, then αX is a convex set. ∈ 157 3. Let X,Y be convex sets, then X + Y is convex.

d m m×d 158 4. Let T : R R be any affine transformation, i.e., T (x) = Ax + b for some matrix A R and →m d m ∈ −1 159 vector b R . If X R is convex, then T (X) is a convex set. If Y R is convex, then T (Y ) is ∈ ⊆ ⊆ 160 convex.

161 Proof. 1. Let x, y i∈I Xi. This implies that x, y Xi for every i I. Since each Xi is convex, for ∈ ∩ ∈ ∈ 162 every λ [0, 1], λx + (1 λ)y Xi for all i I. Therefore, λx + (1 λ)y i∈I Xi. ∈ − ∈ ∈ − ∈ ∩ 163 The proofs of 2., 3. and 4. are very similar, are left for the reader.

164 Remark 2.4. Observe that item 4. in Example 2.1 can be interpreted as an (uncountable) intersection 165 of halfpsaces. Thus, item 2 from that example and Theorem 2.3 together give another proof that item 4. 166 describes a convex set.

Definition 2.5. Let Y = y1,..., yn Rd be a finite set of points. The set of all convex combinations of Y is defined as { } ⊂ 1 2 n λ1y + λ2y + ... + λny : λi 0, λ1 + λ2 + ... + λn = 1 . { ≥ } 1 n 1 n 167 Proposition 2.6. If X is convex and y ,..., y X, then every convex combination of y ,..., y is in X. ∈ Proof. We prove it by induction on n. If n = 1, then the conclusion is trivial. Else consider any λ1, . . . , λn 0 ≥ such that λ1 + ... + λn = 1. Then

1 2 n λ1y + λ2y + ... + λny λ1 1 λ2 2 λn−1 n−1 n = (λ1 + ... + λn−1)( y + y + ... + y ) + λny λ1+...+λn−1 λ1+...+λn−1 λ1+...+λn−1 n = (1 λn)y˜ + λny −

λ1 1 λ2 2 λn−1 n−1 168 where y˜ := y + y + ... + y belongs to X by the induction hypothesis. λ1+...+λn−1 λ1+...+λn−1 λ1+...+λn−1 169 The rest follows from definition of convexity.

d 170 Definition 2.7. Given any set X R (not necessarily convex), the convex hull of X, denoted by conv(X), ⊆ 0 0 0 171 is a convex set C such that X C and for any other convex set C , X C C C , i.e., the convex hull ⊆ ⊆ ⇒ ⊆ 172 of X is the smallest (with respect to set inclusion) convex set containing X.

Theorem 2.8. For any set X Rd (not necessarily convex), ⊆ t \ X conv(X) = (C : X C,C convex) = λ1x1 + ... + λtxt : x1,..., xt X, λ1, . . . , λt 0, λi = 1 . ⊆ { ∈ ≥ } i=1

173 In other words, the convex hull of X is the union of the set of convex combinations of all possible finite 174 of X. T 175 Proof. Let Cˆ = (C : X C,C convex), which is a convex set by Theorem 2.3 and by definition X Cˆ. ⊆ 0 0 0 ⊆ 0 176 Consider any other convex set C such that X C . Then C appears in the intersection, and thus Cˆ C . ⊆ ⊆ 177 Thus, Cˆ = conv(X). Pt 178 Next, let C˜ = λ1x1 + ... + λtxt : x1,..., xt X, λ1, . . . , λt 0, λi = 1 . Then, { ∈ ≥ i=1 }

7 179 1. C˜ is convex. Consider two points z1, z2 C˜. Thus there exist two finite index sets I1,I2, two 1 ∈ 2 180 finite subsets of X given by X1 = xi X : i I1 and X2 = xi X : i I2 , and two { 1 ∈ ∈ } 2 { ∈ ∈ P } j 181 λ , i I λ , i I λ subsets of nonnegative real numbers i 0 1 , i 0 2 such that i∈Ij i = 1 { ≥ P∈ } j{ j ≥ ∈ } 182 for j = 1, 2, with the following property: zj = λ x for j = 1, 2. Then for any λ [0, 1], i∈Ij i i ∈ P 1 1 P 2 2 183 λz λ z λ λ x λ λ x X˜ X X 1 + (1 ) 2 = ( i∈I1 i i ) + (1 )( i∈I2 i i ). Consider the finite set = 1 2, and − ˜ 1 − 1 2 ∪ 184 for each x X, if x = xi X1 with i I1 let µx = λ λi , and if x = xi X2 with i I2, let ∈ 2 ∈ ∈ P · ∈ P ∈ 185 µx = (1 λ) λ . It is easy to check that ˜ µx = 1, and λz1 + (1 λ)z2 = ˜ µxx. Thus, − · i x∈X − x∈X 186 λz1 + (1 λ)z2 C˜. − ∈ 187 2. X C˜. We simply use λ = 1 as the multiplier for a point from X. ⊆ 0 0 0 188 3. Let C be any convex set such that X C . Since C is convex, every point of the form λ1x1 +...+λtxt Pt ⊆ 0 0 189 where x1,..., xt X, λi 0, λi = 1 belongs to C by Proposition 2.6. Thus, C˜ C . ∈ ≥ i=1 ⊆

190 From 1., 2. and 3., we get that C˜ = conv(X).

191 2.2 Convex cones, affine sets and dimension

192 We say X is convex if for all x, y X and λ, γ 0 such that λ + γ = 1, λx + γy X. What happens if we ∈ ≥ ∈ 193 relax the conditions on λ, γ?

d 194 Definition 2.9. Let X R be a nonempty set. We have three possibilities: ⊆ d 195 1. We say that X R is a if for all x, y X and λ, γ 0, λx + γy X. ⊆ ∈ ≥ ∈ d 196 2. We say that X R is an affine set or an affine subspace, if for all x, y X and λ, γ R such that ⊆ ∈ ∈ 197 λ + γ = 1, λx + γy X. ∈ d 198 3. We say X R is a linear set or a linear subspace if for all x, y X and λ, γ R, λx + γy X. ⊆ ∈ ∈ ∈ 199 Remark 2.10. Since we relaxed the conditions on λ, γ, convex cones, affine sets and linear sets are all 200 special cases of convex sets.

201 Similar to the definition of the convex hull of an arbitrary subset X, one can define the conical hull of 202 X as the set inclusion wise smallest convex cone containing X, denoted by cone(X). Similarly, the affine 203 (linear) hull of X as the set inclusion wise smallest affine (linear) set containing X. The affine hull will be 204 be denoted by aff(X), and linear hull will be denoted by span(X). One can verify the following analog of 205 Theorem 2.8.

d 206 Theorem 2.11. Let X R . The following are all true. ⊆ T 207 1. cone(X) = (C : X C,C is a convex cone) = λ1x1 + ... + λtxt : x1,..., xt X, λ1, . . . , λt 0 . ⊆ { ∈ ≥ } T Pt 208 2. aff(X) = (C : X C,C is an affine set) = λ1x1 + ... + λtxt : x1,..., xt X, λi = 1 . ⊆ { ∈ i=1 } T 209 3. span(X) = (C : X C,C is a linear subspace) = λ1x1 +...+λtxt : x1,..., xt X, λ1, . . . , λt R . ⊆ { ∈ ∈ } 210 The following is a fundamental theorem of linear algebra.

d 211 Theorem 2.12. Let X R . The following are equivalent. ⊆ 212 1. X is a linear subspace.

1 m 213 2. There exists 0 m d and linearly independent vectors v ,..., v X such that every x X can ≤ ≤ 1 m ∈ 1 ∈ m 214 be written as x = λ1v + ... + λmv for some reals λi, i = 1, . . . , m, i.e., X = span( v ,..., v ). { } (d−m)×d d 215 3. There exists a matrix A R with full row rank such that X = x R : Ax = 0 . ∈ { ∈ }

8 d 216 Proof sketch. We take for granted the fact that we can have at most d linearly independent vectors in R . 217 This is something one can show using Gaussian elimination. 218 It is easy to verify that 2. 1. (because linear combinations of linear combinations are linear combi- ⇒ 1 m 219 nations). To see that 1. 2., starting with a linear subspace X, we construct a finite set v ,..., v X ⇒ 1 ∈ 220 satisfying the conditions of 2. We do this in an iterative fashion. Start by picking any arbitrary v X. If 1 2 1 1 2∈ 221 X = span( v ), then we are done. Else, choose v X span( v ). Again, if X = span( v , v ) then { } 3 1 2 ∈ \ { } { } 222 we are done, else choose v X span( v , v ). This process has to end after at most d steps, because we ∈ \ { } d 223 cannot have more than d linearly independent vectors in R . ⊥ d 224 It is easy to verify 3. 1. To see that 1. 3., define the set X := y R : y, x = 0 x X (this ⇒ ⇒ ⊥{ ∈ h i ∀ ∈ } 225 is known as the orthogonal complement of X). It can be verified that X is a linear subspace. Moreover, by ⊥ 1 k 226 the equivalence 1. 2., we know that 2. holds for X . So there exist linearly independent vectors a ,..., a ⇔ ⊥ 1 k 1 k 227 for some 0 k d such that X = span( a ,..., a ). Let A be the k d matrix which has a ,..., a as ≤ ≤ { d } × 228 rows. One can now verify that X = x R : Ax = 0 . The fact that one can take k = d m where m is { ∈ } − 229 the number from condition 2. needs additional work, which we skip here.

230 Definition 2.13. The number m showing up in item 2. in the above theorem is called the dimension of X. 1 m 231 The set of vectors v ,..., v are called a basis for the linear subspace. { } 232 There is an analogous theorem for affine sets. For this, we need the concept of affine independence that 233 is analogous to the concept of linear independence.

234 Definition 2.14. We say a set X is affinely independent if there does not exist x X such that x ∈ ∈ 235 aff(X x ). \{ } 236 We now give several characterizations of affine independence.

d 237 Proposition 2.15. Let X R . The following are equivalent. ⊆ 238 1. X is an affinely independent set.

239 2. For every x X, the set v x : v X x is linearly independent. ∈ { − ∈ \{ }} 240 3. There exists x X such that the set v x : v X x is linearly independent. ∈ { − ∈ \{ }} d+1 241 4. The set of vectors (x, 1) R : x X is linearly independent. { ∈ ∈ } 1 m 1 m 242 5. X is a finite set with vectors x ,..., x such that λ1x + ... + λmx = 0, λ1 + ... + λm = 0 implies 243 λ1 = λ2 = ... = λm = 0.

244 Proof. 1. 2. Consider an arbitrary x X. Suppose to the contrary that v x : v X x is ⇒ ∈ {P− ∈ \{ }} 245 not linearly independent, i.e., there exist multipliers λv, not all zero, such that v∈X\{x} λv(v x) = 0. P P − 246 Rearranging terms, we get v∈X\{x} λvv = ( v∈X\{x} λv)x. We now consider two cases: P 247 Case 1: λv = 0. In this case, since not all the λv are zero, let v¯ X x be such that λv¯ = 0. v∈X\{x} ∈ \{ } 6 P P P λv P 248 Since λvv = ( λv)x = 0, we obtain that v¯ = v. Since λv = v∈X\{x} v∈X\{x} v∈X\{x,v¯} −λv¯ v∈X\{x} P λv 249 0, we obtain that = 1 and thus v¯ aff(X x, v¯ ), contradicting the assumption that X v∈X\{x,v¯} −λv¯ ∈ \{ } 250 is affinely independent.

P P λv 251 Case 2: v∈X\{x} λv = 0. We can write x = v∈X\{x} P λ v. This implies that x aff(X x ) 6 v∈X\{x} v ∈ \{ } 252 contradicting the assumption that X is affinely independent.

253 2. 3. Obvious. ⇒ 254 3. 4. Let x¯ be such that v x¯ : v X x¯ is linearly independent. This means that the vectors ⇒ { − ∈ \{ }} 255 (v x¯, 0) : v X x¯ (x¯, 1) are also linearly independent. Thus the matrix with these vectors as { − ∈ \{ }} ∪ { } 256 columns has full column rank. Now if we add the the column (x¯, 1) to the rest of the columns, this does

9 257 not change the column rank, and thus the columns remain linearly independent. But the new matrix has d+1 258 precisely (x, 1) R : x X as its columns. { ∈ ∈ } d+1 259 4. 5. If (x, 1) R : x X is linearly independent, then the set X must be finite with elements 1 ⇒ m { ∈ ∈ } 1 m 260 x ,..., x . Moreover, for any λ1, . . . , λm such that λ1x + ... + λmx = 0, λ1 + ... + λm = 0 we have P d+1 261 λx(x, 1) = 0. By linear independence of the set (x, 1) R : x X , λ1 = ... = λm = 0. x∈X { ∈ ∈ } i i i 262 5. 1. Consider any x X. If x aff(X x ), then there exist multipliers λj R, j = i such ⇒i P j P∈ ∈ \{ } Pm j ∈ 6 263 that x = λjx and λj = 1. This implies that λjx = 0 where λi = 1, and therefore j6=i j6=i j=1 − 264 λ1 + ... + λm = 0, contradicting the hypothesis of 5.

265 We are now ready to state the affine version of Theorem 2.12.

d 266 Theorem 2.16. Let X R . The following are equivalent. ⊆ 267 1. X is an affine subspace.

268 2. There exists a linear subspace L of dimension 0 m d, such that X x = L for every x X. ≤ ≤ − ∈ 1 m+1 269 3. There exist affinely independent vectors v ,..., v X for 0 m d such that every x X can be 1 m+1 ∈ ≤ ≤ ∈ 270 written as x = λ1v +...+λm+1v for some reals λi, i = 1, . . . , m+1 such that λ1 +...+λm+1 = 1, 1 m+1 271 i.e., X = aff( v ,..., v ). { } (d−m)×d d−m 272 4. There exists a matrix A R with full row rank and a vector b R for some 0 m d d ∈ ∈ ≤ ≤ 273 such that X = x R : Ax = b . { ∈ } ? ? 274 Proof. 1. 2. Fix an arbitrary x X. Define L = X x . We first show that L is a linear subspace: for 1 2⇒ 1 ? 2 ∈ ? ? − 1 ? 2 ? ? 275 any y , y X, λ(y x ) + γ(y x ) X x for any λ, γ R. Since λ(y x ) + γ(y x ) + x = 1 2 ∈ − ? − ∈ − ∈ 1 ? − 2 ? −? 276 λy + γy + (1 λ γ)x and X is an affine subset, we have λ(y x ) + γ(y x ) + x X. So, 1 ? −2 − ? ? − − ∈ 277 λ(y x ) + γ(y x ) X x = L. Now, for any other x¯ X, we need to show that L = X x¯. − − ∈ − ? ∈ ? ? − 278 Consider any y L, i.e., y = x x for some x X. Observe that y = (x + x¯ x ) x¯ and x + x¯ x X ∈ − ∈ − −? − ∈ 279 (because the coefficients all sum to 1). Therefore, y X x¯ showing that L = X x X x¯. Switching ? ∈ − ? − ⊆ − 280 the roles of x and x¯, one can similarly show that X x¯ X x = L. 1 2 − ⊆ − 281 2. 1. Consider any y , y X and let λ, γ R such that λ + γ = 1. We need to show that 1 ⇒2 1 ∈ 2 ∈ 1 1 2 1 1 1 2 282 λy +γy X. Since X y is a linear subspace, γ(y y ) X y . Thus, γ(y y )+y = λy +γy X. ∈ − − ∈ − − ∈ 283 The equivalence of 2., 3. and 4. follows from Theorem 2.12.

284 Definition 2.17 (Dimension of convex sets). If X is an affine subspace and x X, the linear subspace ∈ 285 X x is called the linear subspace parallel to X and the dimension of X is the dimension of the linear − 286 subspace X x. For any nonempty convex set X, the dimension of X is the dimension of aff(X) and will − 287 be denoted by dim(X). As a matter of convention, we take the dimension of the empty set to be 1. − 288 Lemma 2.18. If X is a set of affinely independent points, then dim(aff(X)) = X 1. | | − 289 Proof. Fix any x X. By Theorem 2.16, L = aff(X) x is a linear subspace. We claim that (X x ) x ∈ − \{ } − 290 is a basis for L. The verification of this claim is left to the reader.

291 Proposition 2.19. Let X be a convex set. dim(X) equals one less than the maximum number of affinely 292 independent points in X.

293 Proof. Let X0 X be a maximum sized set of affinely independent points in X. By Problem5 in “HW ⊆ 294 for Week I”, aff(X0) aff(X). Since X0 is a maximum sized set of affinely independent points in X, any ⊆ 295 x X must lie in aff(X0). Therefore, X aff(X0). Since aff(X0) is an affine set, by definition of affine hull ∈ ⊆ 296 of X, we have aff(X) aff(X0). Therefore, aff(X) = aff(X0), implying that dim(aff(X0)) = dim(aff(X)). ⊆ 297 By Lemma 2.18, we thus obtain X0 1 = dim(aff(X)). | | −

10 298 2.3 Representations of convex sets

299 A large part of modern convex geometry is concerned with algorithms for computing with or optimizing over 300 convex sets. For algorithmic purposes, we need ways to describe a convex set, so that it can be stored in a 301 computer compactly and computations can be performed with it.

302 2.3.1 Extrinsic description: separating hyperplanes

d 303 Perhaps the most primitive convex set in R is the halfspace – see item 2. in Example 2.1. Moreover, a 304 halfspace is a closed convex set. By Theorem 2.3, the intersection of an arbitrary family of halfspaces is a 305 closed convex set. Perhaps the most fundamental theorem of convexity is that the converse is true.

d 306 Theorem 2.20 (Separating Hyperplane Theorem). Let C R be a closed convex set and let x C. There ⊆ d 6∈ 307 exists a halfspace that contains C and does not contain x. More precisely, there exists a R 0 , δ R d ∈ \{ } ∈ 308 such that a, y δ for all y C and a, x > δ. The hyperplane y R : a, y = δ is called a separating h i ≤ ∈ h i { ∈ h i } 309 hyperplane for C and x.

310 Proof. If C is empty, then any halfspace that does not contain x suffices. Otherwise, consider any x¯ C ∈ 311 and let r = x x¯ . Let C¯ = C B(x, r). Since C is closed and B(x, r) is compact, C¯ is compact. One k − k ∩ d 312 can also verify that the function f(y) = y x is a continuous function on R . Therefore, by Weierstrass’ ?k − k ? 313 Theorem (Theorem 1.11), there exists x C¯ such that x x x y for all y C¯, and therefore in ? ∈ k − k ≤ k − k ∈ 314 fact x x x y for all y C. Letk −a = kx ≤ kx?−andk let δ = ∈a, x? . Note that a = 0 because x C and x? C. Also note that a, x = a, a +−x? = a 2 + δ > δh. Thus,i it remains to6 check that a, y6∈ δ for all y∈ C. For any y C, allh thei pointsh αy +i (1k kα)x?, α (0, 1) are in C by convexity. Therefore,h i ≤ by the extremal∈ property of∈x?, we have − ∈ x x? 2 x (αy + (1 α)x?) 2 α (0, 1) k − k0 ≤ kα2 −y x? 2 −2α x kx?, y x? ∀α ∈ (0, 1) ⇒ 2 x x?, y x? ≤ α ky −x? k2 − h − − i ∀α ∈ (0, 1) ⇒ h − − i ≤ k − k ∀ ∈ ? ? ? ? 315 Letting α 0 in the last inequality yields that 0 x x , y x = a, y x . Thus, a, y a, x = δ → ≥ h − − i h − i h i ≤ h i 316 for all y C. ∈ 317 Corollary 2.21. Every closed convex set can be written as the intersection of some family of halfpsaces. d i i 318 In other words, a subset X R is a closed convex set if and only if there exists a family of tuples (a , δ ), ⊆ − i i 319 i I (where I may be an uncountable index set) such that X = i∈I H (a , δ ). ∈ ∩ d 320 Definition 2.22. A finite intersection of halfpsaces is called a polyhedron. In other words, P R is 1 m d 1 m ⊆ 321 a polyhedron if and only if there exist vectors a ,..., a R and real numbers b , . . . , b such that d i i ∈ d 322 P = x R : a , x b i = 1, . . . , m . The shorthand P = x R : Ax b is often employed, where { ∈ h i ≤ 1 m } 1 { m∈ m ≤ } 323 A is the m d matrix with a ,..., a as rows, and b = (b , . . . , b ) R . × ∈ m×d m 324 Thus, a polyhedron is completely described by specifying a matrix A R and a vector b R . ∈ ∈ d 325 Question 1. How would one show that the unit ball for the standard Euclidean norm in R is not a 326 polyhedron?

327 Another related, and very useful, result is the following.

d 328 Theorem 2.23 (Supporting Hyperplane Theorem). Let C R be a convex set and let x bd(C). Then, d ⊆ ∈ 329 there exists a R 0 , δ R such that a, y δ for all y C and a, x = δ. The hyperplane d ∈ \{ } ∈ h i ≤ ∈ h i 330 y R : a, y = δ is called a supporting hyperplane for C at x. { ∈ h i } d d d 331 Proof. Since bd(C) = bd(R cl(C)), x bd(R cl(C)). Since R cl(C) is an open set, there exists a i i\ ∈ i \ \ i i 332 sequence x i∈N such that x x and each x cl(C). By Theorem 2.20, for each x , there exists a such i { } i i → 6∈ i i 333 that a , y < a , x for all y C. By scaling the vectors a , we can assume that a = 1 for all i N. h i h i ∈ k k ∈

11 334 Since the set of unit norm vectors is a compact set, by Theorem 1.10, one can pick a convergent sub- ik ik ik ik 335 sequence a a such that a , y < a , x for all y C. Taking the limit on both sides, we obtain → h i h i ∈ i 336 a, y a, x for all y C. We simply set δ = a, x . Note also that since a = 1 for all i N, we must h i ≤ h i ∈ h i k k ∈ 337 have a = 1, and so a = 0. k k 6

338 How to represent general convex sets: Separation oracles. We have seen that polyhedra can be 339 represented by a matrix A and a right hand side b. Norm balls can be represented by the center x and 340 the radius R. Ellipsoids can be represented by positive definite matrices A. What about general convex 341 sets? This problem is gotten around by assuming that one has “black-box” access to the convex set via a d 342 separation oracle. More formally, we say that a convex set C R is equipped with a separation oracle O d ⊆ 343 that takes as input any vector x R and gives the following output: If x C, the output is “YES”, and if ∈ d d ∈ 344 x C, then the output is a tuple (a, δ) R R such that y R : a, y = δ is a separating hyperplane 6∈ ∈ × { ∈ h i } 345 for x and C.

346 Farkas’ lemma: A glimpse into polyhedral theory. A nice characterization of solutions to systems 347 of linear equations is given in linear algebra, which can be viewed as the most basic type of “theorem of the 348 alternative”.

d×n d 349 Theorem 2.24. Let A R and b R . Exactly one of the following is true. ∈ ∈ 350 1. Ax = b has a solution.

d T T 351 2. There exists u R such that u A = 0 and u b = 0. ∈ 6 352 What if we are interested in nonnegative solutions to linear equations? Farkas’ lemma is a characterization 353 of such solutions.

d×n d 354 Theorem 2.25. [Farkas’ Lemma] Let A R and b R . Exactly one of the following is true. ∈ ∈ 355 1. Ax = b, x 0 has a solution. ≥ d T T 356 2. There exists u R such that u A 0 and u b > 0. ∈ ≤ 357 Before we dive into the proof of Farkas’ Lemma, we need a technical result.

1 n d 1 n 358 Lemma 2.26. Let a ,..., a R . Then cone( a ,..., a ) is closed. ∈ { } 359 Proof. We will complete the proof of this lemma when we do Caratheodory’s theorem (see the end of 360 Section 2.4).

1 n d 361 Proof of Theorem 2.25. Let a ,..., a R be the columns of the matrix A. By Lemma 2.26, the cone ∈ 362 C = Ax : x 0 is closed. We now have two cases, either b C or b C. In the first case, we end up in { ≥ } ∈ 6∈ d 363 Case 1 of the statement of the theorem. In the second case, by Theorem 2.20, there exists u R and δ R ∈ ∈ 364 such that u, y δ for all y C and u, b > δ. Since 0 C, we must have δ u, 0 = 0. This already h i ≤ ∈ h i ∈ ≥ h i 365 shows that u, b > 0. h i i i i 366 Now suppose to the contrary that for some a , u, a > 0. Thus, there exists λ¯ 0 such that λ¯ u, a > δ ¯ |δ|+1 ¯ i h i ≥ h i 367 (for example, take λ = i ). Since y := λa C, this implies that u, y > δ, contradicting that u, y δ hu,a i ∈ h i h i ≤ 368 for all y C. ∈

369 Duality/Polarity. With every linear space, one can associate a “dual” linear space which is its orthogonal 370 complement.

d ⊥ d 371 Definition 2.27. Let X R be a linear subspace. We define X := y R : y, x = 0 x X as the ⊆ { ∈ h i ∀ ∈ } 372 orthogonal complement of X.

373 The following is well-known from linear algebra.

12 ⊥ ⊥ ⊥ 374 Proposition 2.28. X is a linear subspace. Moreover, (X ) = X.

375 There is a way to generalize this idea of associating a dual object to convex sets.

Definition 2.29. Let X Rd be any set. The set defined as ⊆ ◦ d X := y R : y, x 1 x X { ∈ h i ≤ ∀ ∈ }

376 is called the polar of X.

377 Proposition 2.30. The following are all true.

◦ d 378 1. X is a closed, convex set for any X R (not necessarily convex). ⊆ ◦ ◦ 379 2.( X ) = cl(conv(X 0 )). In particular, if X is a closed convex set containing the origin, then ◦ ◦ ∪ { } 380 (X ) = X.

◦ d 381 3. If X is a convex cone, then X = y R : y, x 0 x X . { ∈ h i ≤ ∀ ∈ } ◦ ⊥ 382 4. If X is a linear subspace, then X = X .

Proof. 1. Follows from the fact that X◦ can be written as the intersection of closed halfspaces:

◦ \ d X = y R : y, x 1 . { ∈ h i ≤ } x∈X

◦ ◦ ◦ ◦ ◦ ◦ 383 2. Observe that X (X ) . Also, 0 (X ) , because 0 is always in the polar of any set. Since (X ) is ⊆ ∈ ◦ ◦ 384 a closed convex set by 1., we must have cl(conv(X 0 )) (X ) . ∪ { } ⊆ ◦ ◦ 385 To show the reverse inclusion, we show that if y cl(conv(X 0 )) then y (X ) . Thus, we need ◦ 6∈ ∪ { } 6∈ 386 to show that there exists z X such that y, z > 1. Since y cl(conv(X 0 )), by Theorem 2.20, d ∈ h i 6∈ ∪ { } 387 there exists a R , δ R such that a, y > δ and a, x δ for all x cl(conv(X 0 )). Since ∈ ∈ h i h i ≤ ∈ ∪ { } 388 0 cl(conv(X 0 )), we obtain that 0 δ. We now consider two cases: ∈ ∪ { } ≤ a 389 Case 1: δ > 0. Set z = δ . Now, z, x 1 for all x X because a, x δ for all x cl(conv(X ◦ h i ≤ ∈ h i ≤ ∈ ∪ 390 0 )) X. Therefore, z X . Moreover, z, y > 1 because a, y > δ. So we are done. { } ⊇ ∈ h i h i 2a 391 Case 2: δ = 0. Define  := a, y > δ = 0. Set z =  . Then, z, y = 2 > 1. Also, for every h i 2 2 h i ◦ 392 x X cl(conv(X 0 )), we obtain that z, x = a, x δ = 0 1. Thus, z X . Thus, we ∈ ⊆ ∪ { } h i  h i ≤  ≤ ∈ 393 are done. 394 3. and 4. are left to the reader.

395

13 1 1 ◦ Example 2.31. If p, q 1 such that p + q = 1 (allowing for p or q to be ), then (B`p (0, 1)) = B`q (0, 1). This example illustrates≥ the use of the fundamental H¨older’sinequality.∞

1 1 Proposition 2.32 (H¨older’sinequality). If p, q 1 such that p + q = 1 (allowing for p or q to be ), then ≥ ∞ x, y x p y q, |h i| ≤ k k k k q d p for every x, y R . Moreover, if p, q > 1 then equality holds if and only if xi = yi . ∈ | | | | The special case with p = q = 2 is known as the Cauchy-Schwarz inequality. We won’t prove H¨older’s inequality here, but we will use it to derive the polarity relation between `p unit balls. We only show that ◦ 1 1 B`q (0, 1) = (B`p (0, 1)) for any p, q > 1 such that + = 1. The case p = 1, q = is considered in p q ∞ 396 Problem6 from “HW for Week III”. ◦ First, we show that B`q (0, 1) (B`p (0, 1)) . Consider any y B`q (0, 1) and consider any x B`p . By ⊆ ∈ ◦ ∈ H¨older’sinequality, we obtain that x, y x p y q 1. Thus, B`q (0, 1) (B`p (0, 1)) . To show the ◦ h i ≤ k k k k ≤ ◦ ⊆ reverse inclusion (B`p (0, 1)) B`q (0, 1), consider any y (B`p (0, 1)) . We would like to show that ⊆ ∈ y B`q (0, 1), i.e., y q 1. Suppose to the contrary that y q > 1. Consider x defined as follows: for ∈ k k ≤ qk k p x each i = 1, . . . , d, xi has the same sign as yi, and xi = yi . Set x˜ = . Now, | | | | kxkp 1 1 y, x˜ = x, y = ( x p y q) = y q > 1, h i x p h i x p k k k k k k k k k k ◦ contradicting the fact that y (B`p (0, 1)) , because x˜ p = 1. The second equality follows from Proposi- tion 2.32 because of the special∈ choice of x. k k

397 2.3.2 Intrinsic description: faces, extreme points, recession cone, lineality space

d 398 We have seen that given any set X of points in R , the convex hull of X – the smallest convex set containing 399 X – can be expressed as the set of all convex combinations of finite subsets of X (Theorem 2.8). One 400 possibility to represent a convex set C intrinsically is to give a minimal subset X C such that all points in ⊆ 401 C can be expressed as convex combinations of points in X, i.e., C = conv(X). In particular, if X is a finite 402 set, then we can use X to represent C in a computer: implicitly, C is the convex hull of the set X. We are 403 going to get to such a “minimal” intrinsic description.

404 Definition 2.33 (Faces and extreme points). Let C be a convex set. A convex subset F C is called an 1 2 x1+x2 ⊆ 405 extreme subset or a face of C, if for any x F the following holds: x , x C, 2 = x implies that 1 2 ∈ ∈ 406 x , x F . This is equivalent to saying that there is no point in F that can be expressed as a convex ∈ 407 combination of points in C F – see Problem 10 from “HW for Week III”. \ 408 A face of dimension 0 is called an . In other words, x is an extreme point of C if the 1 2 x1+x2 1 2 409 following holds: x , x C, = x implies that x = x = x. We denote the set of extreme points of C ∈ 2 410 by ext(C). 411 The one-dimensional faces of a convex set are called its edges. If k = dim(C), then the (k 1)-dimensional − 412 faces are called facets. We will see below that the only k-dimensional face of C is C itself. Any face of C 413 that is not C or is called a proper face of C. ∅ 414 Definition 2.34. Let C be a convex set. We define the relative interior of C as the set of all x C for y−x  1∈ 415 which there exists  > 0 such that for all y aff(C), x +  C. We denote it by relint(C). ∈ ky−xk ∈ 416 We define the relative boundary of C to be relbd(C) := cl(C) relint(C). \ 1For the reader familiar with the concept of a relative topology: the relative interior of C is the interior of C with respect to the relative topology of aff(C).

14 417 Exercise 3. Let C be convex and x C. Suppose that for all y aff(C), there exists y such that ∈ ∈ 418 x + y(y x) C. Show that x relint(C). − ∈ ∈ 419 This exercise shows that it suffices to have a different  for every direction; this implies a universal  for 420 every direction.

421 Exercise 4. Show that relint(C) is nonempty for any nonempty convex set C.

422 Lemma 2.35. Let C be a convex set of dimension k. The only k dimensional face of C is C itself.

423 Proof. Let F ( C be a proper face of C. Let x C F . Let X F be a maximum set of affinely independent ∈ \ ⊆ 424 points in F . We claim that X x is affinely independent. This immediately implies that dim(C) > dim(F ) ∪{ } 425 and we will be done. ? 426 Suppose to the contrary that x aff(X). Then consider x relint(F ) (which is nonempty by Exercise4). ∈ ? ? ∈ ? 427 By definition, there exists  > 0 such that y = x + (x x ) F . But this means that y = (1 )x + x. − ∈ − 428 Since y F , and x F , this contradicts that F is a face. ∈ 6∈ 429 Lemma 2.36. Let C be a convex set and let F C be a face of C. If x is an extreme point of F , then x ⊆ 430 is an extreme point of C.

431 Proof. Left to the reader.

d d d 432 Lemma 2.37. Let C R be convex. Let a R and δ R be such that C x R : a, x δ . Then, ⊆d ∈ ∈ ⊆ { ∈ h i ≤ } 433 the set F = C x R : a, x = δ is a face of C. ∩ { ∈ h i } 1 2 x1+x2 i Proof. Let x¯ F and x , x C such that 2 = x¯. By the hypothesis, a, x δ for i = 1, 2. If for either i = 1, 2,∈ a, xi < δ, then∈ h i ≤ h i  x1 + x2  a, x1 + a, x2 a, x¯ = a, = h i h i < δ h i 2 2

i 1 2 434 contradicting that x F . Therefore, we must have a, x = δ for i = 1, 2 and thus, x , x F . ∈ h i ∈ d 435 Definition 2.38. A face F of a convex set C is called an exposed face if there exists a R and δ R such d d ∈ ∈ 436 that C x R : a, x δ and F = C x R : a, x = δ . We will sometimes make it explicit and ⊆ { ∈ h i ≤ } ∩ { ∈ h i } 437 say that F is an exposed face induced by (a, δ).

438 By working with the affine hull and the relative interior, and using Problem3 from “HW for Week II”, 439 a stronger version of the supporting hyperplane theorem can be shown to be true.

d 440 Theorem 2.39 (Supporting Hyperplane Theorem - II). Let C R be convex and x relbd(C). There d ⊆ ∈ 441 exists a R and δ R such that all of the following hold: ∈ ∈ 442 (i) a, y δ for all y C, h i ≤ ∈ 443 (ii) a, x = δ, and h i 444 (iii) there exists y¯ C such that a, y¯ < δ. This third condition says that C is not completely contained ∈ d h i 445 in the hyperplane y R : a, y = δ . { ∈ h i } 446 An important consequence of the above discussion is the following theorem about the relative boundary 447 of a closed, convex set C.

d 448 Theorem 2.40. Let C R be a closed, convex set and x C. x is contained in a proper face of C if and ⊆ ∈ 449 only if x relbd(C). ∈

15 d 450 Proof. If x relbd(C), then by Theorem 2.39 there exists a R and δ R such that the three conditions ∈ d ∈ ∈ 451 in Theorem 2.39 hold. By Lemma 2.37, F = C x R : a, x = δ is a face of C, and it is a proper face ∩ { ∈ h i } 452 because of condition (iii) in Theorem 2.39. Now let x F where F is a proper face of C. Since C is closed, it suffices to show that x relint(C). Suppose to the∈ contrary that x relint(C). Let x¯ C F . Observe that 2x x¯ aff(C). Since x6∈is assumed to be in the relative interior of∈C, there exists  >∈ 0 such\ that y = ((2x −x¯)∈ x) + x C. Rearranging terms, we obtain that − − ∈  1 x = x¯ + y.  + 1  + 1

453 Since x F and x¯ F , this contradicts the fact that F is a face. Thus, x relint(C) and so x ∈ 6∈ 6∈ ∈ 454 relbd(C).

455 In our search for a subset X C such that C = conv(X), it is clear that X must contain all extreme ⊆ 456 points. But is it sufficient to include all extreme points? In other words, is it true that C = conv(ext(C))? d 457 No! A simple counterexample is R+. Its only extreme point is 0. Another weird example is the set d 458 x R : x < 1 – this set has NO extreme points! As you might suspect, the problem is that these sets { ∈ k k } 459 are not compact, i.e., closed and bounded.

460 Theorem 2.41 (Krein-Milman Theorem). If C is a compact convex set, then C = conv(ext(C)).

461 Proof. The proof is going to use induction on the dimension of C. First, if C is the empty set, then the 462 statement is a triviality. So we assume C is nonempty. 463 For the base case with dim(C) = 0, i.e., C = x is a single point, the statement follows because x is { } { } 464 an extreme point of C, and C = conv( x ). For the induction step, consider any point x C. We consider { } ∈ 465 two cases: 466 Case 1: x relbd(C). By Theorem 2.40, x is contained in a proper face F of C. By Lemma 2.35, dim(F ) < ∈ 467 dim(C). By the induction hypothesis applied to F (note that F is also compact using Problem 14 from “HW 468 for Week III”), we can express x as a convex combination of extreme points of F , which by Lemma 2.36, 469 shows that x is a convex combination of extreme points of C. 470 Case 2: x relint(C). Let ` aff(C) be any affine set of dimension one (i.e., a line) going through x. Since ∈ ⊆ 1 2 471 C is compact, ` C is a line segment. The end points x , x of ` C must be in the relative boundary ∩ 1 2 ∩ 472 of C. By the previous case, x , x can be expressed as the convex combination of extreme points in C. 1 2 473 Since x is a convex combination of x and x , and a convex combination of convex combinations is a convex 474 combination, we can express x as the convex combination of extreme points of C.

475 What about non-compact sets? Let us relax the condition of being bounded. So we want to describe 476 closed, convex sets. It turns out that there is a nice way to deal with unboundedness. We introduce the 477 necessary concepts next.

d 478 Proposition 2.42. Let C be a nonempty, closed, convex set, and r R . The following are equivalent: ∈ 479 1. There exists x C such that x + λr C for all λ 0. ∈ ∈ ≥ 480 2. For every x C, x + λr C for all λ 0. ∈ ∈ ≥ Proof. Since C is nonempty, we only need to show 1. 2.; the reverse implication is trivial. Let x¯ C be such that x¯+λr C for all λ 0. Consider any arbitrary⇒x? C. Suppose to the contrary that there exists∈ λ0 0 ∈ ≥ ∈ ≥ such that y = x? + λ0r C. By Theorem 2.20, there exist a Rd, δ R such that a, y > δ and a, x δ for all x C. This means6∈ that a, r > 0 because otherwise,∈ a, y ∈= a, x? + λ0ha, ri δ + λ0ha, r i ≤ δ ∈ h i ¯ |δ−ha,x¯i|+1 h i h i h i ≤ h i ≤ causing a contradiction. But then, if we choose λ = ha,ri , we would obtain that

a, x¯ + λr = a, x¯ + λ¯ a, r = a, x¯ + δ a, x¯ + 1 a, x¯ + δ a, x¯ + 1 δ + 1 > δ, h i h i h i h i | − h i| ≥ h i − h i ≥

481 contradicting the assumption that x¯ + λ¯r C. ∈

16 d 482 Definition 2.43. Any r R that satisfies the conditions in Proposition 2.42 is called a recession direction ∈ 483 for C.

484 Proposition 2.44. The set of all recession directions of a nonempty, closed, convex set is a closed, convex 485 cone.

Proof. Fix any point x in the closed convex set C. Using condition 1. of Proposition 2.42, we see r Rd is a recession direction if and only if for every λ 0, r 1 (C x). Therefore, ∈ ≥ ∈ λ − \ 1 rec(C) = (C x). λ − λ≥0

486 Each term in the intersection is a closed, convex set. Therefore, rec(C) is a closed, convex set. It is easy to 487 see that for any r rec(C), λr rec(C) also for every λ 0. Thus, rec(C) is a closed, convex cone. ∈ ∈ ≥ 488 Definition 2.45. Let C be any nonempty, closed, convex set. We call the cone of recession directions 489 the recession cone of C and it is denoted by rec(C). The set rec(C) rec(C) is a linear subspace and ∩ − 490 is called the lineality space of C. It will be denoted by lin(C). As a matter of convention, we say that 491 rec(C) = lin(C) = 0 when C is empty. { } 492 Exercise 5. Show that Proposition 2.42 remains true if λ 0 is replaced by λ R in both conditions. d ≥ ∈ 493 Show that lin(C) is exactly the set of all r R that satisfy these modified conditions. ∈ 494 Proposition 2.42 immediately gives the following corollary.

495 Corollary 2.46. Let C be a closed convex set and let F C be a closed, convex subset. Then rec(F ) ⊆ ⊆ 496 rec(C).

497 Proof. Left as an exercise.

498 Here is a characterization of compact convex sets.

499 Theorem 2.47. A closed convex set C is compact if and only if rec(C) = 0 . { } Proof. We leave it to the reader to check that if C is compact, then rec(C) = 0 . For the other direction, assume that rec(C) = 0 . Suppose to the contrary that C is not bounded,{ i.e.,} there exists a sequence of points yi C such that{ } yi . Let x C be any point and consider the set of unit norm vectors i yi−x ∈ k k → ∞ ∈ r = kyi−xk . Since this is a sequence of unit norm vectors, by Theorem 1.10, there is a convergent subsequence rik ∞ converging to r also with unit norm. We claim that r is a recession direction, giving a contradiction { }k=1 to rec(C) = 0 . To see this, for any λ 0, let N N such that yik x > λ for all k N. We now observe that { } ≥ ∈ k − k ≥

( yik x λ) λ ( yik x λ) λ x + λrik = k − k − x + (x + rik yik x ) = k − k − x + yik C yik x yik x k − k yik x yik x ∈ k − k k − k k − k k − k

ik 500 for all k N. Letting k , since C is closed, we obtain that x + λr = limk→∞ x + λr C. ≥ → ∞ ∈ 501 We next consider closed convex sets whose lineality space is 0 . { } 502 Definition 2.48. If lin(C) = 0 then C is called pointed. { } 503 The main result about pointed closed convex sets says that you can decompose them into convex combi- 504 nations of extreme points and recession directions.

505 Theorem 2.49. If C is a closed, convex set that is pointed, then C = conv(ext(C)) + rec(C).

17 506 Proof. The proof follows the same lines as Theorem 2.41. We may assume C is nonempty since otherwise 507 ext(C) = . We prove by induction on dimension of C. If dim(C) = 0, then C is a single point, and we are {} 508 done. Consider any x C and then two cases: ∈ 509 Case 1: x relbd(C). By Theorem 2.40, x is contained in a proper face F of C. By Lemma 2.35, dim(F ) < ∈ 510 dim(C). By the induction hypothesis applied to F (note that F is also closed using Problem 14 from “HW 0 0 511 for Week III”), we can express x = x + d, where x is a convex combination of extreme points of F and 0 512 d is a recession direction for F . By Lemma 2.36, x is a convex combination of extreme points of C. By 513 Corollary 2.46, d rec(C). ∈ 514 Case 2: x relint(C). Let ` be any affine set of dimension one (i.e., a line) going through x. Since C contains ∈ 1 2 515 no lines (C is pointed), ` C is either a line segment, i.e., x is the convex combination of x , x relbd(C), ∩ 0 0 ∈ 516 or ` C is a half-line, i.e, x = x + d, where x relbd(C) and d rec(C). ∩ ∈ i ∈ i i i i 517 In the first case, using Case 1, for each i = 1, 2, x can be expressed as x = y + d , where y is a convex i 1 2 518 combination of extreme points in C, and d rec(C). Since x is a convex combination of x and x , this ∈ 519 shows that x conv(ext(C)) + rec(C). ∈ 0 0 0 0 0 520 In the second case, applying Case 1 to x , we express x = y + d where y is a convex combination of 0 0 0 521 extreme points in C, and d rec(C). Thus, x = y + d + d and we have the desired representation. ∈ 522 Lets make this description even more “minimal”. For this we will need to understand the structure of 523 pointed cones. d 524 Proposition 2.50. Let D R be a closed, convex cone. The following are equivalent. ⊆ 525 1. D is pointed.

◦ ◦ 526 2. D is full-dimensional, i.e., dim(D ) = d.

527 3. 0 is an exposed face of D.

528 4. There exists a compact, convex subset B D 0 such that every d D 0 can be uniquely ⊂ \{ } ∈ \{ } 529 written in the form d = λb, where b B and λ > 0. In particular, D = cone(B). ∈ ◦ ◦ 530 Proof. 1. 2. If D is not full-dimensional, then aff(D ) is a linear space of dimension strictly less than d, ⇒ ◦ ⊥ ◦ ◦ 531 and so aff(D ) = 0 . Since D aff(D ), using Problem3 from “HW for Week III”, and property 2. 6 { } ⊆ ◦ ⊥ ◦ ◦ ◦ ◦ ◦ ⊥ 532 and 4. in Proposition 2.30, we obtain that aff(D ) = aff(D ) (D ) = D. Since aff(D ) is a linear ◦ ⊥ ⊆ 533 space, this implies that aff(D ) lin(D), contradicting the assumption that D is pointed. ⊆ ◦ ◦ ◦ 534 2. 3. By Problem5 from “HW for Week II”, int( D ) = . Choose any y int(D ). Since D = y d ⇒ 6 ∅ ∈ { ∈ 535 R : x, y 0 x D , using Problem3 from “HW for Week II”, we obtain that y, x < 0 for every h i ≤ ∀ ∈ } h i 536 x D 0 . This shows that the exposed face induced by (y, 0) is exactly 0 . ∈ \{ } { } d 537 3. 4. Let 0 be an exposed face induced by (y, 0). Define B := D x R : y, x = 1 . It is clear ⇒ ∩ { ∈ h i − } 538 from the definition that 0 B. Since it is the intersection of a convex cone and an affine set, S is also convex. 6∈ 539 We now show that B is compact. It is the intersection of closed sets, so it is closed. By Theorem 2.47, it 540 suffices to show that rec(B) = 0 . Suppose to the contrary that there exists r rec(B) 0 . Consider any { } ∈ \{ } 541 point x¯ B. Since y, x¯ = 1 and y, x¯ + r = 1, we obtain that y, r = 0. Now, by Proposition 2.42, ∈ h i − h i − h i 542 we obtain that 0 + r D, i.e., r D. But then y, r = 0 contradicts the fact that 0 is an exposed face of ∈ ∈ h i 543 D induced by (y, 0). We next consider any d D 0 . By our assumption, y, d < 0. Thus, setting b = d , we obtain ∈ \{ } h i |hy,di| that y, b = 1 and thus, b B. To show uniqueness, consider b1, b2 B both satisfying the condition. Thish means,i b−2 = λb1 for some∈ λ > 0. Therefore, ∈ λ y, b1 = y, b2 = 1 = y, b1 h i h i − h i 544 showing that λ = 1. This shows uniqueness of b. 545 4. 1. If D is not pointed, then there exists x D 0 such that x D. Moreover, there exists λ1 > 0 ⇒ 1 2 ∈ \{ } − ∈ λ2 1 λ1 2 546 such that x = λ1x B and λ2 > 0 such x = λ2( x) B. Since B is convex, x + x = 0 is ∈ − ∈ λ1+λ2 λ1+λ2 547 in B, contradicting the assumption.

18 548 Definition 2.51. For any closed convex cone D, any subset B D satisfying condition 4. of Proposition 2.50 ⊆ 549 is called a base of D.

550 The proof of Proposition 2.50 also shows the following.

551 Corollary 2.52. Let D be a closed, convex cone. D is pointed if and only if there exists a hyperplane H 552 such that H D is a base of D. ∩ 553 Remark 2.53. In fact, it can be shown that any base of a pointed cone D must be of the form H D for ∩ 554 some hyperplane H. We skip the proof of this fact from these notes.

555 Definition 2.54. Let D be a closed, convex cone. An edge of D is called an extreme ray of D. We say that 556 r D spans an extreme ray if λr : λ 0 is an extreme ray. The set of extreme rays of D will be denoted ∈ { ≥ } 557 by extr(D).

558 Proposition 2.55. Let D be a closed, convex cone and r D 0 . r spans an extreme ray of D if and 1 2 r1+r2 ∈ \{ } 1 2 559 only if for all r , r D such that r = , there exist λ1, λ2 0 such that r = λ1r and r = λ2r. ∈ 2 ≥ 560 Proof. Left as an exercise.

561 Here is an analogue of the Krein-Milman Theorem (Theorem 2.41) for closed convex cones.

562 Theorem 2.56. If D is a pointed, closed, convex cone, then D = cone(extr(D)).

563 Proof. By Proposition 2.50, there exists a base B for D. Since B is compact, B = conv(ext(B)) by Theo- 564 rem 2.41. It is easy to verify that the ray spanned by each r ext(B) is an extreme ray for D, and vice versa, ∈ 565 any extreme ray of D is spanned by some r ext(B). Moreover, using the fact that B = conv(ext(B)), it ∈ 566 immediately follows that D = cone(extr(D)).

567 Slight abuse of notation. For a closed convex set C, we will also use extr(C) to denote extr(rec(C)). We 568 will also say these are the extreme rays of C. 569 Now we can write a sharper version of Theorem 2.49:

570 Corollary 2.57. If C is a closed, convex set that is pointed, then C = conv(ext(C)) + cone(extr(C)).

571 Thus, to describe a pointed closed convex set, we just need to specify its extreme points and its extreme 572 rays. We finally deal with general closed convex sets that are not necessarily pointed. The idea is that the 573 lineality space can be “factored out”. ⊥ 574 Lemma 2.58. If C is a closed convex set, then C lin(C) is pointed. ∩ ⊥ 575 Proof. Define Cˆ = C lin(C) . Cˆ is closed because it is the intersection of two closed sets. By Corollary 2.46, ∩ 576 rec(Cˆ) rec(C). Therefore, lin(Cˆ) = rec(Cˆ) rec(Cˆ) rec(C) rec(C) = lin(C). By the same reasoning, ⊆ ⊥ ⊥ ∩− ⊥⊆ ∩− 577 lin(Cˆ) lin(lin(C) ) = lin(C) . Since lin(C) lin(C) = 0 , we obtain that lin(Cˆ) = 0 . ⊆ ∩ { } { } Theorem 2.59. Let C be a closed convex set and let Cˆ = C lin(C)⊥. Then ∩ C = conv(ext(Cˆ)) + cone(extr(Cˆ)) + lin(C).

0 578 Proof. We first observe that C = Cˆ + lin(C). Indeed, for any x C, we can express x = x + r where 0 ⊥ ⊥ n ∈ 0 579 x lin(C) and r lin(C) (since lin(C) + lin(C) = R ). We also know that x = x r C because ∈ 0∈ − ∈ 580 r lin(C). Thus, x Cˆ and we are done. Cˆ is pointed by Lemma 2.58 and applying Corollary 2.57 gives ∈ ∈ 581 the desired result.

582 Thus, a general closed convex set C can be specified by giving a set of generators for its lineality space ⊥ 583 lin(C), and the extreme points and vectors spanning the extreme rays of the set C lin(C) . In Section 2.5, ∩ 584 we will see that polyhedra are precisely those convex sets C that have a finite number of extreme points and ⊥ 585 extreme rays for C lin(C) . So we see that polyhedra are especially easy to describe intrinsically: simply ∩ 586 specify the finite list of extreme points, vectors spanning the extreme rays and a finite list of generators of 587 lin(C).

19 588 2.3.3 A remark about extrinsic and intrinsic descriptions

589 You may have already observed that although a closed convex set can be represented as the intersection of 2 590 halfspaces, such a representation is not unique. For example, consider the circle in R . You can represent 591 it by intersecting all its tangent halfspaces. On the other hand, if you throw away any finite subset of 592 these halfspaces, you still get the same set. In fact, there is a representation which uses only countably 593 many halfspaces. Thus, the same convex set can have many different representations as the intersection of 594 halfspaces. Moreover, there is usually no way to choose a “canonical” representation, i.e., there is no set of 595 representing halfspaces such that any representation will always include this “canonical” set of halfspaces 596 (this situation will get a little better with polyhedra). On the other hand, the intrinsic representation for a closed convex set is more “canonical”. To begin with, consider the compact case. We express a compact C as conv(ext(C)). We cannot remove any extreme point, because it cannot be represented as the convex combination of other points. Thus, this representation is unique/minimal/canonical in the sense that for any X such that C = conv(X), we must have ext(C) X. With closed, convex sets that have a nontrivial recession cone, the situation is a bit more subtle. First,⊆ there is more flexibility in choosing the representation because one can choose a different set of vectors to span the extreme rays. One might think that this is just a scaling issue and the following result holds: if C is a pointed, closed, convex set, and we consider any “intrinsic” representation

C = conv(E) + cone(R),

d 597 for some sets E,R R , then we must have ⊆ 598 (i) ext(C) E and ⊆ 599 (ii) for every r that spans an extreme ray of rec(C), there must be some nonnegative scaling of r present 600 in R.

601 While the above holds for polyhedra and many other closed, convex sets, it is not true in general. We 602 leave it as an exercise to find a closed, convex set that violates the above claim.

603 2.4 Combinatorial theorems: Helly-Radon-Carath´eodory

604 We will discuss three foundational results that expose combinatorial aspects of convexity. We begin with 605 Radon’s Theorem.

d 606 Theorem 2.60 (Radon’s Theorem). Let X R be a set of size at least d + 2. Then X can be partitioned ⊆ 607 as X = X1 X2 into sets X1,X2, such that conv(X1) conv(X2) = . ] ∩ 6 ∅ Proof. Since we can have at most d+1 affinely independent points in Rd (see condition 2. in Proposition 2.15), and X has at least d + 2 points, there exists a subset x1,..., xk X such that x1,..., xk is affinely { } ⊆ { } dependent. By using characterization 5. in Proposition 2.15, there exist multipliers λ1, . . . , λk R, not all 1 k ∈ zero, such that λ1+...+λk = 0 and λ1x +. . . λkx = 0. Define P := i : λi 0 and N := i : λi < 0 . Since { ≥ } { } the λi’s are not all zero and λ1+...+λk = 0, P and N both contain indices such that corresponding multiplier P P P j P j is non-zero. Moreover, i∈P λi = i∈N ( λi) since λ1 + ... + λk = 0, and j∈P λjx = j∈N ( λj)x 1 k − − since λ1x + ... + λkx = 0. Thus, we obtain that

X λj j X ( λj) j y := P x = P − x , λi ( λi) j∈P i∈P j∈N i∈N −

i i 608 showing that y conv(XP ) conv(XN ) where XP = x : i P and XN = x : i N . One can now ∈ ∩ { ∈ } { ∈ } 609 simply define X1 = XP and X2 = X XP . \

20 610

An application to learning theory: VC-dimension of halfspaces. An important concept in learn- ing theory is the Vapnik-C¸ervonenkis (VC) dimension of a family of subsets [5]. Let be a family of F subsets of Rd (possibly infinite). Definition 2.61. A set X Rd is said to be shattered by , if for every subset X0 X, there exists a set F such that X0 = F⊆ X. The VC-dimension of isF defined as ⊆ ∈ F ∩ F d sup m N : there exists a set X R of size m that can be shattered by . { ∈ ⊆ F } d 611 Proposition 2.62. Let be the family of halfspaces in R . The VC-dimension of is d + 1. F F Proof. For any m d + 1, let X be a set of m affinely independent points. Now, for any subset X0 X, we claim that conv(≤X0) conv(X X0) = (Verify!!). When we study polyhedra in Section 2.5, we⊆ will see that conv(X0) and conv(∩ X X\0) are compact∅ convex sets. By Problem7 from “HW for Week II”, there exists a separating hyperplane\ for these two sets, giving a halfspace H such that X0 = H X. ∩ Let m d + 2. Consider any set X with m points. By Theorem 2.60, one can partition X = X1 X2 such ≥ 0 ] 0 that there exists y conv(X1) conv(X2). Let X = X1. Consider any halfspace H such that X H. ∈ ∩ ⊆ Since H is convex, y H. By Problem 11 in “HW for Week IV”, we obtain that H X2 = . Thus, X ∈ ∩ 6 ∅ cannot be shattered by the family of halfspaces in Rd. See Chapters 12 and 13 of [2] for more on VC dimension.

612 An extremely important corollary of Radon’s Theorem is known as Helly’s theorem concerning the 613 intersection of a family of convex sets.

d 614 Theorem 2.63 (Helly’s Theorem). Let X1,...,Xk R be a family of convex sets. If X1 ... Xk = , ⊆ ∩ ∩ ∅ 615 then there is a subfamily Xi ,...,Xi for some m d + 1, with ih 1, . . . , k for each h = 1, . . . , m such 1 m ≤ ∈ { } 616 that Xi ... Xi = . Thus, there is a subfamily of size at most d + 1 that already certifies the empty 1 ∩ ∩ m ∅ 617 intersection.

618 Proof. We prove by induction on k. The base case is if k d + 1, then we are done. Assume we know ≤ 619 the statement to be true for all families of convex sets with k¯ elements for some k¯ d + 1. Consider a ≥ 620 family of k¯ + 1 convex sets X1,X2,...,X¯ . Define a new family C1,...,C¯, where Ci = Xi if i k¯ 1 k+1 k ≤ − 621 and C¯ = X¯ X¯ . Since = X1 ... X¯ = C1 ... C¯, we can use the induction hypothesis k k ∩ k+1 ∅ ∩ ∩ k+1 ∩ ∩ k 622 on this new family and obtain a subfamily Ci ,...,Ci such that Ci ... Ci = and m d + 1. If 1 m 1 ∩ ∩ m ∅ ≤ 623 m d or none of the Ci , h = 1, . . . , m equals C¯, then we are done. So we assume that m = d + 1 and ≤ h k 624 Ci = C¯ = X¯ X¯ . m k k ∩ k+1 To simplify notation, let us relabel everything and define Dh := Cih = Xih , h = 1, . . . , d and Dd+1 = Xk¯ and Dd+2 = X¯ . We thus know that D1 ... Dd+2 = . We may assume that each subfamily of d + 1 k+1 ∩ ∩ ∅ sets from D1,...,Dd+2 has a nonempty intersection, because otherwise we will be done. Let these common intersection points be i x h6=iDh, i = 1 , . . . , d + 2. ∈ ∩ i 625 By Theorem 2.60, there exists a partition 1, . . . , d + 2 = L R such that there exists y conv( x i∈L) i { } ] ∈ { } ∩ 626 conv( x i∈R). Now, we claim that y Dh for each h 1, . . . , d + 2 arriving at a contradiction to { } ∈ ? ∈ { } 627 D1 ... Dd+2 = . Indeed, consider any h 1, . . . , d + 2 . Either L or R does not contain it. Suppose ∩ ∩ ∅ i∈ { } ? 628 L does not contain it. Then for each i L, x h6=iDh Dh? because i = h . Since Dh? is convex, this i ∈ ∈ ∩ ⊆ 6 629 shows that y conv( x i∈L) Dh? . ∈ { } ⊆ 630 A corollary for infinite families is often useful, as long as we assume compactness for the elements in the 631 family.

21 632 Corollary 2.64. Let be a (possibly infinite) family of compact, convex sets. If X∈X X = , then there X ∩ ∅ 633 is a subfamily Xi ,...,Xi for some m d + 1, with ih 1, . . . , k for each h = 1, . . . , m such that 1 m ≤ ∈ { } 634 Xi ... Xi = . Thus, there is a subfamily of size at most d + 1 that already certifies the empty 1 ∩ ∩ m ∅ 635 intersection.

636 Proof. By a standard result in topology, if the intersection of an infinite family of compact sets is empty, 637 then there is a finite subfamily whose intersection is also empty. One can now apply Theorem 2.63 to this 638 finite subfamily and obtain a subfamily of size at most d + 1.

22 639 Application to centerpoints. Helly’s theorem can be used to extend the notion of median to distri- butions on Rd with d 2. Let µ be any probability distribution on Rd. For any point x Rd, define ≥ ∈

fµ(x) := inf µ(H): H halfspace such that x H . { ∈ }

centerpoint or median µ x d f x . Define the with respect to as any in the set Cµ := arg maxx∈R µ( ) It can be shown that this set is nonempty for all probability distributions µ. For d = 1, this gives the standard 1 notion of a median, and one can show that for any probability distribution µ on R, fµ(x) = 2 for any 1 centerpoint/median x. In higher dimensions, unfortunately, one cannot guarantee a value of 2 . In fact, given the uniform distribution on a triangle in R2, one can show that the centroid x of the triangle is the 4 1 unique centerpoint, and has value fµ(x) = 9 < 2 . So can one guarantee any lower bound? Or can we find distributions whose centerpoint values are arbitrarily low? Gr¨unbraum [4] proved a lower bound for the value of a centerpoint, irrespective of the distribution. The only assumption is a mild regularity condition on the distribution: for any halfspace H and any δ > 0, there exists a closed halfspace H0 Rd H such ⊆ \ that µ(H0) µ(Rd H) δ. ≥ \ − Theorem 2.65. Let µ be any probability distribution on Rd satisfying the above assumption. There d 1 exists a point x R such that fµ(x) . ∈ ≥ d+1 Proof. Given any α R, let α be the set of all halfspaces H such that µ(H) α. It is not hard to ∈ H ≥ check that if α < 1, then Dα := H∈Hα H is a compact, convex set. Indeed, for any coordinate indexed ∩ i i i d i by i = 1, . . . , d, there must exist some δ1, δ2 such that the halfspaces H1 := x R : xi δ1 and i d i i i { ∈ ≤ } H2 := x R : xi δ2 satisfy µ(H1) α and µ(H2) α. Thus, Dα is contained in the box d{ i∈ i ≥ } ≥ ≥ x R : δ2 xi δ2, i = 1, . . . , d . { ∈ ≤ ≤ } We now claim that for any x Dα, we have fµ(x) 1 α. To see this, consider any halfspace H = y 640 d ∈ ≥ − d d { ∈ R : a, y δ that contains x Dα. We will show that µ(R H) α. Indeed, if µ(R H) > α, then h i ≤ } ∈ \ ≤ \ some halfspace H0 contained in Rd H also has mass at least α. This would imply that H0 contains all of 0 \ 0 d Dα and, therefore, x H . But since H R H, this contradicts the fact that x H. ∈ ⊆ \ ∈ Therefore, it suffices to show that D d is nonempty for every  > 0, because using compactness and d+1 + the fact that Dα Dβ when α β, we would have >0D d is nonempty, and any point x in this set d+1 + ⊆ 1 ≤ ∩ will satisfy fµ(x) . ≥ d+1 Now let’s fix an  > 0. We want to show that D d is nonempty. By standard measure-theoretic d+1 +  arguments, there exists a ball B centered at the origin such that µ(B) 1 and D d + B, because ≥ − 2 d+1 ⊆ Dα := H∈H H is compact, as observed earlier. ∩ α Define = B H : H is a closed halfspace with µ(H) d +  . Thus, is a family of compact sets C { ∩ T ≥ d+1 } C such that D d + = C : C . For any subset C1,...,Cd+1 of size d + 1, we claim d+1 { ∈ C} { } ⊆ C  µ(Cc ... Cc ) 1 (d + 1) . 1 ∪ ∪ d+1 ≤ − 2

c c c c 1 c  This is because each C = B H for some half space Hi satisfying µ(H ) . Since µ(B ) , i ∪ i i ≤ d+1 − ≤ 2 we obtain that µ(Cc) 1  . Therefore, i ≤ d+1 − 2 c c   µ(C1 ... Cd+1) = 1 µ(C ...C ) 1 (1 (d + 1) ) = (d + 1) > 0. ∩ ∩ − 1 ∪ d+1 ≥ − − 2 2 T This implies that C1 ... Cd+1 = . By Corollary 2.64, C : C is nonempty and so D d + is ∩ ∩ 6 ∅ { ∈ C} d+1 nonempty.

23 641 Another useful theorem is Carath´eodory’s theorem which says that if a point x can be expressed as the d 0 642 convex combination of some other set X R of points, then there is a subset X X of size at most d + 1 0 ⊆ ⊆ 643 such that x conv(X ). We state the conical version first, and then the convex version. ∈ d 644 Theorem 2.66 (Carath´eodory’s Theorem – cone version). Let X R (not necessarily convex) and let 0 0 ⊆ 0 645 x cone(X). There exists a subset X X such that X is linearly independent (and thus, X d), and ∈ 0 ⊆ | | ≤ 646 x cone(X ). ∈ Proof. Since x cone(X), by Theorem 2.11, we can find a finite set x1,..., xk X such that x cone( x1,..., xk∈). Choose a minimal such set, i.e., there is no strict subset{ of x1,...,} ⊆ xk whose conical∈ { } 1 k { } hull contains x. This implies that x = λ1x + ... + λkx for some λi > 0 for each i = 1, . . . , k. We claim 1 k that x ,..., x are linearly independent. Suppose to the contrary that there exist multipliers γ1, . . . , γk R, 1 k ∈ not all zero, such that γ1x + ... + γkx = 0. By changing the signs of the γi’s if necessary, we may assume that there exists j 1, . . . , k such that γj > 0. Define ∈ { }

λj 0 θ = min , λi = λi θγi i = 1, . . . , k. j:γj >0 γj − ∀

Observe that λ0 0 for all i = 1, . . . , k and i ≥ 0 1 0 k 1 k 1 k 1 k λ x + ... + λ x = λ1x + ... + λkx θ(γ1x + ... + γkx ) = λ1x + ... + λkx = x. 1 k − 0 λj 647 However, at least one of the λ ’s is zero (corresponding to an index in arg minj:γ >0 ), contradicting the i j γj 1 k 648 minimal choice of x ,..., x . { } d 649 Theorem 2.67 (Carath´eodory’s Theorem – convex version). Let X R (not necessarily convex) and let 0 0 ⊆ 0 650 x conv(X). There exists a subset X X such that X is affinely independent (and thus, X d + 1), ∈ 0 ⊆ | | ≤ 651 and x conv(X ). ∈ d+1 652 Proof. Consider the set Y R defined by Y := (y, 1) : y X) . Now, x conv(X) is equivalent ⊆ { ∈ } ∈ 653 to saying that (x, 1) cone(Y ). We get the desired result by applying Theorem 2.66 and condition 4. of ∈ 654 Proposition 2.15.

655 We can finally furnish the proof of Lemma 2.26.

i 1 n d Proof of Lemma 2.26. Consider a convergent sequence x i∈N cone( a ,..., a ) converging to x R . By Theorem 2.66, every xi is in the conical hull of some{ linearly} ⊆ independent{ subset} of a1,..., an .∈ Since there are only finitely many linearly independent subsets of a1,..., an , one of these{ subsets contains} i { } infinitely many elements of the sequence x i∈N. Thus, after passing to that subsequence, we may assume i 1 k { }1 k i that x i∈N cone( a¯ ,..., a¯ ) where a¯ ,..., a¯ are linearly independent. For each x , there exists i { k} ⊆ i{ i 1 } i k d×k λ R+ such that x = λ1a¯ + ... + λka¯ . If we denote by A R the matrix whose volumes are 1 ∈ k i i i −1 i ∈ i a¯ ,..., a¯ , then x = Aλ and λ = A x for every i N. Since x i∈N is a convergent sequence, it is i ∈ k{ } also a bounded set. This implies that λ i∈N is a bounded set in R+ because it is the image of a bounded set under the linear (and therefore continuous){ } map A−1. Thus, by Theorem 1.10 there is a convergent ik k subsequence λ λ R+. Taking limits, → ∈ x = lim xik = lim Aλik = Aλ. k→∞ k→∞

k 1 k 1 n 656 Since λ R+, we find that x cone( a¯ ,..., a¯ ) cone( a ,..., a ). ∈ ∈ { } ⊆ { } 657 Here is another result that proves handy in many situations.

d 658 Theorem 2.68. Let X R be a compact set (not necessarily convex). Then conv(X) is compact. ⊆

24 Proof. By Theorem 2.67, every x conv(X) is the convex combination of some d + 1 points in X. Define d ∈ d the following function f : R ... R Rd+1 Rd as follows: | × {z × } × → d+1 times

1 d+1 1 d+1 f(y ,..., y , λ) = λ1y + ... + λd+1y .

It is easily verified that f is a continuous function (each coordinate of f( ) is a bilinear quadratic function of the input). We now observe that conv(X) is the image of X ... X · ∆d+1 under f, where | × {z × } × d+1 times

d+1 d+1 ∆ := λ R+ : λ1 + ... + λd+1 = 1 . { ∈ } d+1 659 Since X and ∆ are compact sets, we obtain the result by applying Theorem 1.12.

660 2.5 Polyhedra

661 Recall that a polyhedron is any convex set that can be obtained by intersecting a finite number of halfspaces 662 (Definition 2.22). Polyhedra, in a sense, are the nicest convex sets to work with because of this finiteness 663 property. For example, our first result will be that a polyhedron can have only finitely many extreme points. 664 Even so, one thing to keep in mind is that the same polyhedron can be described as the intersection 665 of two completely different finite families of halfspaces. This brings into sharp focus the non-uniqueness of 666 extrinsic descriptions discussed in Section 2.3.3. Consider the following systems of halfspace/inequalities.

x1 0 2x1 + x2 0 − ≤ ≤ x1 + x2 0 x1 + x2 0 ≤ − ≤ x1 x2 0 x1 2x2 0 − ≤ − ≤ x1 x2 x3 0 x1 2x3 0 − − − ≤ − ≤ x2 + x3 5 2x1 + x2 + 2x3 10 ≤ ≤ 3 667 Both these systems describe the same polyhedron P = conv (0, 0, 0), (0, 0, 5) in R . However, if a polyhe- { } 668 dron is given by its list of extreme points and extreme rays, this ambiguity disappears. Moreover, having 669 these two alternate extrinsic/intrinsic descriptions is very useful as many properties become easier to see 670 in one description, compared to the other description. Let us, therefore, start by making some important 671 observations about extreme points and extreme rays of a polyhedron.

m×d 1 m m 672 Definition 2.69. Let P be a polyhedron. Let A R with rows a ,..., a and b R such that d ∈ i ∈ 673 P = x R : Ax b . Given any x P , define tight(x, A, b) := i : a , x = bi . For brevity, when { ∈ ≤ } ∈ { h i } 674 A and b are clear from the context, we will shorten this to tight(x). We also use the notation Atight(x) to 675 denote the submatrix formed by taking the rows of A indexed by tight(x). Similarly, btight(x) will denote 676 the subvector of b indexed by tight(x).

d m×d m 677 Theorem 2.70. Let P = x R : Ax b be a polyhedron given by A R and b R . Let x P . { ∈ ≤ } ∈ ∈ ∈ 678 Then, x is an extreme point of P if and only if Atight(x) has rank equal to d, i.e., the rows of A indexed by d 679 tight(x) span R .

Proof. ( ) Suppose Atight(x) has rank equal to d; we want to establish that x is an extreme point. Consider 1 ⇐2 x1+x2 i 1 i 2 any x , x P such that x = 2 . For each i tight(x), a , x bi and similarly, a , x bi. Now, we observe∈ that ∈ h i ≤ h i ≤ i 1 i 2 i a , x a , x bi = a , x = h i + h i bi. h i 2 2 ≤ i 1 680 Thus, the inequality must be an equality. Therefore, for each i tight(x), a , x = bi and similarly, i 2 ∈ h j i 681 a , x = bi. In other words, we have that Atight(x)x = btight(x), and Atight(x)x = btight(x) for j = 1, 2. h i 1 2 682 Since the rank of Atight(x) is d, the system of equations must have a unique solution. This means x = x = x . 683 This shows that x is extreme.

25 ( ) Suppose to the contrary that x is extreme and A has rank strictly less than d (note that its ⇒ tight(x) rank is less than or equal to d because it has d columns). Thus, there exists a non-zero r Rd such that ∈ Atight(x)r = 0. Define

  j   j  bj a , x bj a , x  := min min − h i : aj, r > 0 , min − h i : aj, r < 0 j aj, r h i j aj, r h i h i −h i j 684 Note that  > 0 because whenever a , r = 0 we have that j / tight(x) and thus all the numerators are h 1 i 6 2 ∈ x1+x2 685 strictly positive. We now claim that x := x+r P and x := x r P . This would show that x = 2 1 2 ∈ − ∈ 686 with x = x (because r = 0 and  > 0), contradicting extremality. 6 6 1 2 1 687 To finish the proof, we need to check that Ax b and Ax b. We will do the calculations for x – 2 ≤ ≤ 688 the calculations for x are similar. Consider any j 1, . . . , m . If j tight(x), then since Atight(x)r = 0, j 1 j j j ∈ { } ∈ 689 we obtain that a , x = a , x +  a , r = a , x = bj. If j tight(x), then we consider two cases: h i h i h i h i 6∈ j j bj −ha ,xi j 1 j j 690 Case 1: a , r > 0. Since  j , we obtain that a , x = a , x +  a , r bj. h i ≤ ha ,ri h i h i h i ≤ j j 1 j j j 691 Case 2: a , r < 0. In this case, a , x = a , x +  a , r < bj, simply because  > 0 and a , r < 0. h i h i h i h i h i 692 This immediately gives the following.

d 693 Corollary 2.71. Any polyhedron P R has a finite number of extreme points. ⊆ m×d m d 694 Proof. Let A R and b R be such that P = x R : Ax b . From Theorem 2.70, for any ∈ ∈ { ∈ ≤ } 695 extreme point, A has rank d. There are only finitely many subsets I 1, . . . , m such that the tight(x) ⊆ { } 696 submatrix AI is of rank d. Moreover, for any I 1, . . . , m such that AI has rank d and AI x = bI has a ⊆ { } 697 solution, the set of solutions to AI x = bI is unique. This shows that there are only finitely many extreme 698 points.

699 What about the extreme rays? First we define polyhedral cones.

700 Definition 2.72. A convex cone that is also a polyhedron is called a polyhedral cone.

d 701 Proposition 2.73. Let D R be a convex cone. D is a polyhedral cone if and only if there exists a matrix m×d ⊆ d 702 A R for some m N such that D = x R : Ax 0 . ∈ ∈ { ∈ ≤ } 703 Proof. We simply have to show the forward direction, the reverse is easy. Assume D is a polyhedral cone. m×d m 704 Thus, it is a polyhedron and so there exists a matrix A R and b R for some m N such that ∈ ∈ ∈ 705 D = x : Ax b . Since D is a closed, convex cone (closed because all polyhedra are closed), rec(D) = D. { ≤ } 706 By Problem1 in “HW for Week IV”, we obtain that D = rec(D) = x : Ax 0 . { ≤ } 707 Problem1 in “HW for Week IV” also immediately implies the following.

708 Proposition 2.74. If P is a polyhedron, then rec(P ) is a polyhedral cone.

709 Theorem 2.75. Let D = x : Ax 0 be a polyhedral cone and let r D 0 . r spans an extreme ray { ≤ } ∈ \{ } 710 if and only if A has rank d 1. tight(r) − 1 k i 711 Proof. ( ) Let Atight(r) have rows a¯ ,..., a¯ . Each Fi := D x : a¯ , x = 0 for each i = 1, . . . , k is an ⇐ ∩ { k h i } 712 exposed face of D. By Problem 13 in “HW for Week III”, F := Fi is a face of D. Since A has ∩i=1 tight(r) 713 rank d 1, the set x : A x = 0 is a 1-dimensional linear subspace. Since F x : A x = 0 , − { tight(r) } ⊆ { tight(r) } 714 F is a 1-dimensional face of D (it cannot be 0 dimensional because it contains 0 and r = 0) and hence an { } 6 715 extreme ray. Since r F , we have that r spans F . ( ) Suppose r spans∈ the 1-dimensional face F . Recall that this means that any x F is a scaling of r. ⇒ ∈ Rank of Atight(r) cannot be d since then r is an extreme point of D and r = 0 by Problem3 in “HW for Week IV”. This would contradict that r spans an extreme ray of D. Thus, rank of A d 1. If it is tight(r) ≤ −

26 0 0 strictly less, then consider any r x : Atight(r)x = 0 that is linearly independent to r – such an r exists if rank of A d 2. Define∈ { } tight(r) ≤ − aj, r aj, r  := min min −h i : aj, r0 > 0 , min −h i : aj, r0 < 0 { { aj, r0 h i } { aj, r0 h i }} h i −h i 1 0 2 0 716 Note that  > 0. We now claim that r := r + r D and r := r r D. This would show that r1+r2 0 ∈ 1 2 − ∈ 717 r = 2 . Moreover, since r and r are linearly independent, r , r are not scalings of r. This contradicts 718 Proposition 2.55. 1 2 719 To finish the proof, we need to check that Ar 0 and Ar 0. This is the same set of calculations as ≤ ≤ 720 in the proof of Theorem 2.70.

721 Analogous to Corollary 2.71, we have:

722 Corollary 2.76. Any polyhedral cone D has finitely many extreme rays.

723 2.5.1 The Minkowski-Weyl Theorem

724 We can now state the first part of the famous Minkowski-Weyl theorem.

d 725 Theorem 2.77 (Minkowski-Weyl Theorem – Part I). Let P R be a polyhedron. Then there exist finite d ⊆ 726 sets V,R R such that P = conv(V ) + cone(R). ⊆ 727 Proof. Let L be a finite set of vectors spanning lin(P )(L is taken as the empty set if lin(P ) = 0 ). Note ⊥ { } 728 that lin(P ) = cone(L L). Define Pˆ = P lin(P ) . By Problem1 (iii) in “HW for Week VI”, Pˆ is also ∪ − ∩ 729 a polyhedron. By Corollary 2.71, we obtain that V := ext(Pˆ) is a finite set. Moreover, by Proposition 2.74, 730 rec(Pˆ) is a polyhedral cone. By Corollary 2.76, extr(rec(Pˆ)) is a finite set. Define R = extr(rec(Pˆ)) L L. ∪ ∪− 731 By Theorem 2.59, P = conv(ext(Pˆ)) + cone(rec(Pˆ)) + lin(P ) = conv(V ) + cone(R).

732 We now make an observation about polars.

d 733 Lemma 2.78. Let V,R R be finite sets and let X = conv(V ) + cone(R). Then X is a closed, convex set. ⊆ 734 Proof. conv(V ) is compact, by Theorem 2.68, and cone(R) is closed by Lemma 2.26. By Problem6 in “HW 735 for Week II” we obtain that X = conv(V ) + conv(R) is closed. Since the Minkowski sum of convex sets is 736 convex (property 3. in Theorem 2.3), X is also convex.

Theorem 2.79. Let V = v1,..., vk Rd, and R = r1,..., rn Rd with k 1 and n 0. Let X = conv(V ) + cone(R). Then{ } ⊆ { } ⊆ ≥ ≥

 i  ◦ d v , y 1 i = 1, . . . , k X = y R : h j i ≤ . ∈ r , y 0 j = 1, . . . , n h i ≤  i  ˜ d v , y 1 i = 1, . . . , k ˜ ◦ Proof. Define X := y R : h i i ≤ . We first verify that X X , i.e., y, x 1 for ∈ r , y 0 i = 1, . . . , n ⊆ h i ≤ h i ≤ ˜ Pk i Pn j all y X and x X. By definition of X, we can write x = i=1 λiv + j=1 µjr for some λi, µj 0 such P∈ k ∈ ≥ that i=1 λi = 1. Thus, k n X i X j x, y = λi v , y + µj r , y 1, h i h i h i ≤ i=1 j=1 i j 737 since v , y 1 for i = 1, . . . , k, and r , y 0 for j = 1, . . . , n. h i ≤ ◦ h i ≤◦ i 738 To see that X X˜, consider any y X . Since x, y 1 for all x X, we must have v , y 1 ⊆ i ∈ h i ≤j ∈ h i ≤ 739 for i = 1, . . . , k since v X. Suppose to the contrary that r , y > 0 for some j 1, . . . , n . Then there ∈1 j h i ∈ { } 740 exists λ > 0 such that v + λr , y > 1. But this contradicts the fact that x, y 1 for all x X because 1 j h i j h i ≤ ∈ 741 v + λr X, by definition of X. Therefore, r , y 0 for j = 1, . . . , n and thus, y X˜. ∈ h i ≤ ∈

27 742 This has the following corollary.

◦ 743 Corollary 2.80. Let P be a polyhedron. Then P is a polyhedron.

◦ d 744 Proof. If P = , then P = R , which is a polyhedron. Else, by Theorem 2.77, there exist finite sets d ∅ ◦ 745 V,R R such that P = conv(V )+cone(R), with V = . By Theorem 2.79, P is the intersection of finitely ⊆ 6 ∅ 746 many halfspaces, and is thus a polyhedron.

747 We now prove the converse of Theorem 2.77.

d 748 Theorem 2.81 (Minkowski-Weyl Theorem – Part II). Let V,R R be finite sets and let X = conv(V ) + d ⊆ 749 cone(R). Then X R is a polyhedron. ⊆ 750 Proof. The case when X is empty is trivial. So we consider X is nonempty. Take any t X and define 0 0 ∈ 751 X = X t. Now, it is easy to see X is a polyhedron if and only if X is a polyhedron (Verify!!). So it − 0 0 0 0 752 suffices to show that X is a polyhedron. Note that X = conv(V ) + cone(R) where V = V t, which is − 0 ◦ 753 a nonempty set because V is nonempty (since X is assumed to be nonempty). By Theorem 2.79,(X ) is 0 0 0 0 ◦ ◦ 754 a polyhedron. By Lemma 2.78, X is a closed, convex set, and also 0 X . Therefore, X = ((X ) ) by 0 ◦∈ 0 ◦ ◦ 0 755 condition 2. in Theorem 2.30. Applying Corollary 2.80 with P = (X ) , we obtain that ((X ) ) = X is a 756 polyhedron.

757 Collecting Theorems 2.77 and 2.81 together, we have the full-blown Minkowski-Weyl Theorem.

d 758 Theorem 2.82 (Minkowski-Weyl Theorem – full version). Let X R . Then the following are equivalent. ⊆ m×d m 759 (i)( -description) There exists m N, a matrix A R and a vector b R such that X = x Hd ∈ ∈ ∈ { ∈ 760 R : Ax b . ≤ } d 761 (ii)( -description) There exist finite sets V,R R such that X = conv(V ) + cone(R). V ⊆ 762 A compact version is often useful.

d 763 Theorem 2.83 (Minkowski-Weyl Theorem – compact version). Let X R . Then X is a bounded poly- ⊆ 764 hedron if and only if X is the convex hull of a finite set of points.

765 Proof. Left as an exercise.

766 2.5.2 Valid inequalities and feasibility

d d 767 Definition 2.84. Let X R (not necessarily convex) and let a R , δ R. We say that a, x δ is a ⊆ − ∈ ∈ h i ≤ 768 valid inequality/halfspace for X if X H (a, δ). ⊆ d m×d m m 769 Consider a polyhedron P = x R : Ax b with A R , b R . For any vector y R+ , T T { ∈ ≤ } ∈ ∈ ∈ 770 the inequality y A, x y b is clearly a valid inequality for P . The next theorem says that all valid h i ≤ 771 inequalities are of this form, upto a translation.

d m×d m 772 Theorem 2.85. Let P = x R : Ax b with A R , b R be a nonempty polyhedron. Let d { ∈ ≤ } ∈ ∈ m 773 c R , δ R. Then c, x δ is a valid inequality for P if and only if there exists y R+ such that T∈ T ∈ T h i ≤ ∈ 774 c = y A and y b δ. ≤ m T T T Proof. ( ) Suppose there exists y R+ such that c = y A and y b δ. The validity of c, x δ is clear from⇐ the following relations for∈ any x P : ≤ h i ≤ ∈ c, x = yT A, x = yT (Ax) yT b δ, h i h i ≤ ≤

775 where the first inequality follows from the fact that x P implies Ax b and y is nonnegative. ∈ ≤

28 ( ) Let c, x δ be a valid inequality for P . Suppose to the contrary that there is no nonnegative solution⇒ to chT = iyT ≤A and yT b δ. This is equivalent to saying that the following system has no solution in y, λ: ≤ AT y = c, bT y + λ = δ, y 0, λ 0. ≥ ≥ Setting this up in matrix notation, we have no nonnegative solutions to  AT 0   y   c  = . bT 1 λ δ

d+1 776 By Farkas’ Lemma (Theorem 2.25), there exists u = (u¯, ud+1) R such that ∈ T T T T u¯ A + ud+1b 0, ud+1 0, and u¯ c + ud+1δ > 0. (2.1) ≤ ≤ 777 We now consider two cases: T T 778 Case 1: ud+1 = 0. Plugging into (2.1), we obtain u¯ A 0, i.e. Au¯ 0, and c, u¯ > 0. By Problem1 in ≤ ≤ h i 1+(δ−hc,xi) 779 “HW for Week IV”, u¯ rec(P ). Consider any x P (we assume P is nonempty). Let µ = > 0. ∈ ∈ hc,u¯i 780 Now x + µu¯ P since u¯ rec(P ). However, c, x + µu¯ = δ + 1 > δ, contradicting that c, x δ is a valid ∈ ∈ h i h i ≤ 781 inequality for P .

782 Case 2: ud+1 < 0. By rearranging (2.1), we have Au¯ ( ud+1)b and c, u¯ > ( ud+1)δ. By setting u¯ ≤ − h i − 783 x = , we obtain that Ax b and c, x > δ, contradicting that c, x δ is a valid inequality for −ud+1 ≤ h i h i ≤ 784 P .

d 785 Definition 2.86. Let c R and δ1, δ2 R. If δ1 δ2, then the inequality/halfspace c, x δ1 is said to ∈ ∈ ≤ h i ≤ 786 dominate the inequality/halfspace c, x δ2. h i ≤ d m×d m 787 Remark 2.87. Let P = x R : Ax b with A R , b R be a polyhedron. Then c, x δ is { ∈ ≤ } m∈ ∈ T T T h i ≤ 788 called a consequence of Ax b if there exists y R+ such that c = y A and δ = y b. Another way to ≤ ∈ 789 think of Theorem 2.85 is that it says the geometric property of being a valid inequality is the same as the 790 algebraic property of being a consequence:

d 791 [Alternate version of Theorem 2.85] Let P = x R : Ax b be a nonempty polyhedron. { ∈ ≤ } 792 Then c, x δ is a valid inequality for P if and only if c, x δ is dominated by a consequence h i ≤ h i ≤ 793 of Ax b. ≤ 794 A version of Theorem 2.85 for empty polyhedra is also useful. It can be interpreted as the existence of a 795 short certificate of infeasibility of polyhedra. d m×d m 796 Theorem 2.88. Let P = x R : Ax b with A R , b R be a polyhedron. Then P = if and { ∈ ≤ } ∈ ∈ ∅ 797 only if 0, x 1 is a consequence of Ax b. h i ≤ − ≤ 798 Proof. It is easy to see that if 0, x 1 is a consequence of Ax b then P = , because any point that h i ≤ − ≤ ∅ 799 satisfies Ax b must satisfy every consequence of it, and no point satisfies 0, x 1. So now assume≤ P = . This means that there is no solution to Ax b. Thish isi ≤ equivalent − to saying that there is no solution to A∅x1 Ax2 + s = b with x1, x2, s 0.2 In matrix≤ notation, this means there are no nonnegative solutions to − ≥  x1    2 A AI  x  = b. − s By Farkas’ Lemma (Theorem 2.25), there exists u Rm such that ∈ uT A 0, uT ( A) 0, u 0, and uT b > 0. ≤ − ≤ ≤ −u T T 800 Define y = T 0. Then y A = 0 and y b = 1, showing that 0, x 1 is a consequence of u b ≥ − h i ≤ − 801 Ax b. ≤ 2This is easily seen by the the transformation x = x1 − x2.

29 802 2.5.3 Faces of polyhedra

803 Faces for polyhedra are very structured. Firstly, every face is an exposed face – something that is not true 804 for general closed, convex sets. Secondly, there is an algebraic characterization of faces in terms of the 805 describing inequalities of a polyhedron. This is the content of Theorem 2.90 below. First, we make some 806 simpler observations.

807 Theorem 2.89. The following are both true.

808 1. Every face of a polyhedron is a polyhedron.

809 2. Every polyhedron has finitely many faces.

810 Proof. We prove the theorem for the case of bounded polyhedra, i.e., polytopes. The extension of this proof 811 idea to the unbounded case is left as an exercise. 812 Let P be any polytope and let F P be a face. If x F is an extreme point of F , then by Problem 15 ⊆ ∈ 813 in “HW for Week IV”, x is a face of P and therefore x is an extreme point of P . Since P has finitely { } 814 many extreme points by Corollary 2.71, F has only finitely many extreme points. By Problem 14 in “HW 815 for Week III”, F is closed, and since P is compact, so is F . By the Krein-Milman theorem (Theorem 2.41), 816 F is the convex hull of its finitely many extreme points. By the Minkowski-Weyl theorem (Theorem 2.81), 817 F is a polyhedron. Moreover, since we showed that any face is the convex hull of some subset of extreme 818 points of P , there can only be finitely many faces since P has finitely many extreme points.

d m×d m 819 Theorem 2.90. Let P = x R : Ax b with A R , b R . Let F P such that F = ,P . The { ∈ ≤ } ∈ ∈ ⊆ 6 ∅ 820 following are equivalent.

821 (i) F is a face of P .

822 (ii) F is an exposed face of P .

823 (iii) There exists a subset I 1, . . . , m such that F = x P : AI x = bI . ⊆ { } { ∈ } 824 Proof. (i) (ii). Consider x¯ relint(F ) (which exists by Exercise4). Since F is a proper face, by ⇒ ∈ 825 Theorem 2.40, x¯ relbd(P ). By Theorem 2.39, there exists a supporting hyperplane at x¯ given by a, x δ. ∈ h i ≤ 826 Let y P : a, y = δ be the corresponding exposed face. Since x relint(F ), one can show that { ∈ h i } ∈ 0 827 F y P : a, y = δ (Verify!!). Thus, there exists an exposed face containing F . Let F be the minimal ⊆ { ∈ h i } 00 828 (with respect to set inclusion) exposed face of P that contains F , i.e., for any other exposed face F F , 0 00 ⊇ 829 we have F F . Note that such a minimal exposed face exists because we have only finitely many faces ⊆ 0 1 830 by Theorem 2.89. Let this exposed face F by defined by the valid inequality c , x δ1 for P . 0 0 h i0 ≤ 831 If F = F , then we are done because F is an exposed face. Otherwise, F ( F , and so F is a face of 0 0 0 2 d 832 F . Therefore, x¯ relbd(F ). Applying Theorem 2.39 to F and x¯, we obtain c R , δ2 R such that 0 d∈ 2 0 2 ∈ ∈ 833 F F y R : c , y = δ2 , and there exists y¯ F such that c , y¯ < δ2. Using Theorem 2.83, we ⊆ ∩ { ∈ h i } ∈ h i − 1 834 find finite sets V,R such that P = conv(V ) + cone(R). Notice that since P H (c , δ1), we must have 1 1 ⊆ 835 c , v δ1 for all v V and c , r 0 for all r R. h i ≤ ∈ h i ≤ ∈ 1 2 Claim 1. One can always choose λ 0 such that λc + c , λδ1 + δ2 satisfy ≥ 1 2 1 2 λc + c , v λδ1 + δ2 for all v V, λc + c , r 0 for all r R. h i ≤ ∈ h i ≤ ∈

836 Proof of Claim. The relations can be rearranged to say

2 1 2 1 c , v δ2 λ(δ1 c , v ) for all v V c , r λ( c , r ) for all r R. (2.2) h i − ≤ − h i ∈ h i ≤ −h i ∈ 1 1 First, recall that 0 δ1 c , v for all v V and 0 c , r for all r R. Notice that since 0 − 2 1≤ − h i ∈ ≤ −h i 0 ∈ 2 F H (c , δ2), if c , v = δ1 for some v V , this means that v F and therefore c , v δ2. ⊆ h i ∈ ∈ h i ≤

30 Similarly, if c1, r = 0 for some r R, this means that r rec(F 0) and therefore c2, r 0. Thus, the following choiceh ofi ∈ ∈ h i ≤

 2 2  c , v δ2 c , r λ := max 0, max h i − , max h i 1 1 1 1 v∈V :δ1−hc ,vi>0 δ1 c , v r∈R:−hc ,ri>0 c , r − h i −h i

837 satisfies (2.2).

d 1 2 838 Using the λ from the above claim, X = P y R : λc + c , y = λδ1 + δ2 is an exposed face of 1 2 ∩ { ∈ h 0 i } 839 P containing F . Moreover, λc + c , y λδ1 + δ2 is valid for F because the inequality is a nonnegative h i ≤1 2 0 0 840 combination of the two valid inequalities c , y¯ δ1, c , y¯ δ2 for F . Therefore, X F . But y¯ satisfies h 2 i ≤ h i ≤ ⊆ 0 841 this inequality strictly, because it satisfies c , y¯ < δ2, so X ( F . This contradicts the minimality of F . h i d 842 (ii) (iii). Let c R , δ R be such that F = P x : c, x = δ . By Theorem 2.85, there exists m⇒ T ∈ T ∈ T ∩ { h i } 843 y R+ such that c = y A and δ y b. Consider any x F (recall that F is assumed to be nonempty). ∈ ≥ ∈ 844 Then δ = c, x = yT A, x = yT Ax yT b δ. (2.3) h i h i ≤ ≤ T T T 845 Thus, equality must hold everywhere and y b = δ. Moreover, y Ax = y b for all x F , which implies T ∈ 846 that y (Ax b) = 0 for all x F . This last relation says that for any i 1, . . . , m , if yi > 0 then i − ∈ ∈ { } 847 a , x = bi for every x F . Thus, setting I = i : yi > 0 , we immediately obtain that AI x = bI for all h i ∈ { } T T 848 x F . Consider any x¯ P satisfying AI x¯ = bI . Therefore, y Ax¯ = y b since yi = 0 for i I. Therefore, T∈ T T ∈ 6∈ 849 c x = y Ax¯ = y b = δ, and thus, x P x : c, x = δ = F . ∈ ∩ { h i } i 850 (iii) (i). By definition, F = i∈I Fi, where Fi = x P : a , x = bi . By definition, each Fi is an ⇒ ∩ { ∈ h i } 851 exposed face, and thus a face. By Problem 13 in “HW for Week III”, the intersection of faces is a face and 852 thus, F is a face.

853 2.5.4 Implicit equalities, dimension of polyhedra and facets

854 Given a polyhedron P = x : Ax b how can we decide the dimension of P ? The concept of implicit { ≤ } 855 equalities is important for this.

m×d m i 856 Definition 2.91. Let A R and b R . We say that the inequality a , x bi for some i 1, . . . , m ∈ ∈ hi i ≤ ∈ {i } 857 is an implicit equality for the polyhedron P = x : Ax b if P x : a , x = bi , i.e., P H(a , bi). We { ≤ } = ⊆ {= h i } + ⊆ + 858 denote the subsystem of implicit equalities of Ax b by A x b . We will also use A x b to denote ≤ ≤ ≤ 859 the inequalities in Ax b that are NOT implicit equalities. ≤ i 860 Note that for each i such that a , x b is not an implicit equality, there exists x P such that i h i ≤ ∈ 861 a , x < bi. h i = = + + 862 Exercise 6. Let P = x : Ax b . Show that there exists x¯ P such that A x¯ = b and A x¯ < b . { ≤ } d = ∈= + + 863 Show the stronger statement that relint(P ) = x R : A x = b ,A x < b . { ∈ } 864 We can completely characterize the affine hull of a polyhedron, and consequently its dimension, in terms 865 of the implicit equalities.

Proposition 2.92. Let A Rm×d and b Rm and P = x : Ax b . Then ∈ ∈ { ≤ } d = = d = = aff(P ) = x R : A x = b = x R : A x b . { ∈ } { ∈ ≤ } = = Proof. It is easy to verify that aff(P ) x Rd : A=x = b x Rd : A=x b . We show that = ⊆ { ∈ } ⊆= { ∈ ≤ } x Rd : A=x b aff(P ). Consider any y satisfying A=y b . Using Exercise6, choose any x¯ P such{ ∈ that A=x¯ ≤= b=}and ⊆ A+x¯ < b+. If A+y b+, then y P ≤ aff(P ) and we are done. Otherwise, set∈ ≤ ∈ ⊆  i  bi a , x¯ µ := min − h i . i i i i:ha ,yi>bi a , y a , x¯ h i − h i

31 i i 866 Observe that since a , y > bi > a , x¯ for each i considered in the minimum, we have 0 < µ < 1. One can h i h i 867 check that (1 µ)x¯ + µy P . This shows that y aff(P ), because y is on the line joining two points in P , − ∈ ∈ 868 namely x¯ and (1 µ)x¯ + µy. − 869 Combined with part 4. of Theorem 2.16, this gives the following corollary.

Corollary 2.93. Let A Rm×d and b Rm and P = x : Ax b . Then ∈ ∈ { ≤ } dim(P ) = d rank(A=). −

870 As we have seen before, a given description P = x : Ax b for a polyhedron may be redundant, in { ≤ } 871 the sense, that we can remove some of the inequalities, and still have the same set P . This motivates the 872 following definition.

m×d m i 873 Definition 2.94. Let A R and b R . We say that the inequality a , x bi for some i 1, . . . , m ∈ ∈ h i ≤ ∈ { } 874 is redundant for the polyhedron P = x : Ax b if P = x : A−ix b−i , where A−i denotes the matrix A { ≤ } { ≤ } 875 without row i and b−i is the vector b with the i-th coordinate removed. Otherwise, if P ( x : A−ix b−i , i { ≤ } 876 then a , x bi is said to be irredundant for P . The system Ax b is said to be an irredundant system if h i ≤ ≤ 877 every inequality is irredundant for P = x : Ax b . { ≤ } 878 The following characterization of facets of a polyhedron is quite useful, specially in combinatorial opti- 879 mization and polyhedral combinatorics.

d m×d m 880 Theorem 2.95. Let P = x R : Ax b be nonempty with A R , b R giving an irredundant { ∈ ≤ } ∈ ∈ 881 system. Let F P . The following are equivalent. ⊆ 882 (i) F is a facet of P , i.e., F is a face with dim(F ) = dim(P ) 1. − 0 0 883 (ii) F is a maximal, proper face of P , i.e., for any proper face F F , we must have F = F . ⊇ i i 884 (iii) There exists a unique i 1, . . . , m such that F = x P : a , x = bi and a , x bi is not an ∈ { } { ∈ h i } h i ≤ 885 implicit equality.

0 886 Proof. (i) (ii). Suppose to the contrary that there exists a proper face F ) F . Observe that F is ⇒0 0 887 a face of F by Problem 15 in “HW for Week IV”, and so F is a proper face of F . By Lemma 2.35, 0 0 0 888 dim(F ) > dim(F ) = dim(P ) 1. So, dim(F ) = dim(P ). This contradicts the fact that F is proper face, − 889 by Lemma 2.35.

890 (ii) (iii). By Theorem 2.90, there exists a subset of indices I 1, . . . , m such that F = x d ⇒ ⊆ { } { ∈ 891 R : Ax b,AI x = bI . If all the inequalities indexed by I are implicit equalities for P , then F = P , ≤ } i 892 contradicting the assumption that F is a proper face. So there exists i I such that a , x bi is 0 i ∈ h i ≤ 893 not an implicit equality. Let F = x P : a , x = bi be the face defined by this inequality; since i { ∈ 0 h i } 0 894 a , x bi is not an implicit equality, F is a proper face of P . Also observe that F F . Hence h i0 ≤ i ⊆ 895 F = F = x P : a , x = bi . To show uniqueness of i, we would like to show that I = i . We show { ∈ 0h i } j { } 896 this by exhibiting x F with the following property: for any j = i such that a , x bj is not an implicit j∈ 0 1 6 = 1 h= i ≤ + 1 + 897 equality, we have a , x < bj. To see this, let x P such that A x = b and A x < b (such an 1 h i ∈ 898 x exists by Exercise6). Since Ax b is an irredundant system, if we remove the inequality indexed by i, ≤ i 899 then we get some new points that satisfy the rest of the inequalities, but which violate a , x bi. More 2 d = 2 = + 2 + i 2 h i ≤+ + 900 precisely, there exists x R such that A x = b , A−ix b−i and a , x > bi, where A−ix b−i + ∈ + ≤ h i i1 i 2 ≤ 901 denotes the system A x b without the inequality indexed by i. Since a , x < bi and a , x > bi, ≤ 1 2 h 0 i i 0h i 902 there exists a convex combination of x , x such that this convex combination x satisfies a , x = bi. Since = 1 = = 2 = = 0 = + 1 + h + i 2 + 903 A x = b and A x = b , we must have A x = b . Moreover, since A x < b and A−ix b−i, we + + 0 j 0 ≤ 904 must have that for any j = i indexing an inequality in A x b , x must satisfy a , x < bj. Thus, we 6 ≤ h i 905 are done. 906 (iii) (i). By Theorem 2.90, F is a face. We now establish that dim(F ) = dim(P ) 1. Let denote ⇒ − J 907 the set of indices that index inequalities in Ax b that are not implicit equalities. Since there exists a ≤

32 i j 908 unique i such that F = x P : a , x = bi , this means that for any j i, there exists x F ∈ Jj j { ∈ 0 h 1 iP } j 0 ∈ J \ ∈ 909 such that a , x < bj. Now let x = x , and observe that x F (since F is convex) and h i |J |−1 j∈J \{i} ∈ j 0 910 for any j i, we have a , x < bj. Let us describe the polyhedron F by the system A˜x b˜ that ∈ J \ ih i ≤ 911 appends the inequality a , x bi to the system Ax b. h− i ≤ − ≤ = = 912 Claim 2. rank(A˜ ) = rank(A ) + 1.

0 = = i 913 Proof. The properties of x show that the matrix A˜ is simply the matrix A appended with a . So it suffices i = i T = 914 to show that a is not a linear combination of the rows of A . Suppose to the contrary that a = y A k = T = 915 for some y R where k is the number of rows of A . If bi < y b , then P is empty because any x P =∈ = T = T = T = i ∈ 916 satisfies A x = b , and therefore must satisfy y A x = y b and this contradicts y A x = a , x bi. T = i = = i h i ≤ 917 If bi y b , then a , x bi is redundant for P , as every x satisfying A x = b satisfies a , x bi. ≥ h i ≤ h i ≤ = = 918 Using Corollary 2.93, we obtain that dim(F ) = d rank(A˜ ) = d rank(A ) 1 = dim(P ) 1. − − − − 919 A consequence of this characterization of facets is that full-dimensional polyhedra have a unique system 920 describing them, upto scaling.

0 0 921 Definition 2.96. We say that the inequality a, x δ is equivalent to the inequality a , x δ if there 0 0 h i ≤ h i ≤ 922 exists λ 0 such that a = λa and δ = λδ. Equivalent inequalities define the same halfspace, i.e., − ≥ − 0 0 923 H (a, δ) = H (a , δ ).

0 Theorem 2.97. Let P be a full-dimensional polyehdron. Let A Rm×d, A0 Rp×d, b Rm and b Rp be such that Ax b and A0x b0 are both irredundant systems∈ describing P∈, i.e., ∈ ∈ ≤ ≤ d d 0 0 x R : Ax b = x R : A x b = P. { ∈ ≤ } { ∈ ≤ }

924 Then both systems are the same upto permutation and scaling. More precisely, the following holds:

925 1. m = p.

i 926 2. There exists a permutation σ : 1, . . . , m 1, . . . , m such that for each i 1, . . . , m , a , x bi 0σ(i) 0 { } → { } ∈ { } h i ≤ 927 is equivalent to a , x b . h i ≤ σ(i) 928 Proof. Left as an exercise.

929 3 Convex Functions

930 We now turn our attention to convex functions, as a step towards optimization. In this context, we will need 931 to sometimes talk about the extended real numbers R , + . One reason is that in optimization ∪ {−∞ ∞} 932 problems, many times a supremum may be + or an infimum may be , and using them on the same ∞ −∞ 933 footing as the reals makes certain statements nicer, without having to exclude annoying special cases. For 934 this, one needs to set up some convenient rules for arithmetic over R , + : ∪ {−∞ ∞}

935 x + = for any x R + . • ∞ ∞ ∈ ∪ { ∞} 936 x(+ ) = + for all x > 0. We will avoid situations where we need to consider 0 (+ ). • ∞ ∞ · ∞ 937 x < for all x R. • ∞ ∈

33 938 3.1 General properties, epigraphs, subgradients Definition 3.1. A function f : Rd R + is called convex if → ∪ { ∞} f(λx + (1 λ)y) λf(x) + (1 λ)f(y), − ≤ − for all x, y Rd and λ (0, 1). If the inequality is strict for all x = y, then the function is called strictly convex. The∈domain (sometimes∈ also called effective domain) of f is6 defined as d dom(f) := x R : f(x) < + . { ∈ ∞} 939 A function g is said to be (strictly) concave if g is (strictly) convex. − 940 The domain of a convex function is easily seen to be convex. d 941 Proposition 3.2. Let f : R R + be a convex function. Then dom(f) is a convex set. → ∪ { ∞} 942 Proof. Left as an exercise.

943 The following subfamily of convex functions is nicer to deal with from an algorithmic perspective. Definition 3.3. A function f : Rd R + is called strongly convex with modulus of strong convexity c > 0 if → ∪ { ∞} 1 f(λx + (1 λ)y)+ λf(x) + (1 λ)f(y) cλ(1 λ) x y 2, − ≤ − − 2 − k − k d 944 for all x, y R and λ (0, 1). ∈ ∈ 945 The following proposition sheds some light on strongly convex functions. d 946 Proposition 3.4. A function f : R R + is strongly convex with modulus of strong convexity → ∪ { 1 ∞} 2 947 c > 0 if and only if the function g(x) := f(x) c x is convex. − 2 k k

948 Convex functions have a natural convex set associated with them, called the epigraph. Many properties of 949 convex functions can be obtained by just analyzing the corresponding epigraph and using all the technology 950 built in Section2. We give the formal definition for general functions below; very informally, it is “the region 951 above the graph of a function”. Definition 3.5. Let f : Rd R + be any function (not necessarily convex). The epigraph of f is defined as → ∪ { ∞} n epi(f) := (x, t) R R : f(x) t . { ∈ × ≤ } d 952 Note that epi(f) R R, so it lives in a space whose dimension is one more than the space over which ⊆ × 953 the function is defined, just like the graph of the function. Note also that the epigraph is nonempty 954 if and only if the function is not identically equal to + . Convex functions are precisely those ∞ 955 functions whose epigraphs are convex. d 956 Proposition 3.6. Let f : R R + be any function. f is convex if and only if epi(f) is a convex set. → ∪ { ∞} 1 2 957 Proof. ( ) Consider any (x , t1), (x , t2) epi(f), and any λ (0, 1). ⇒ ∈ ∈ 958 The result is a consequence of the following sequence of implications:

1 2 (x , t1) epi(f), (x , t2) epi(f), f is convex 1 ∈ 2 ∈ 1 2 1 2 f(x ) t1, f(x ) t2, f(λx + (1 λ)x ) λf(x ) + (1 λ)f(x ) ⇒ 1 ≤ 2 ≤ − ≤ − f(λx + (1 λ)x ) λt1 + (1 λ)t2 ⇒ 1 − 2 ≤ − (λx + (1 λ)x , λt1 + (1 λ)t2) epi(f) ⇒ − − ∈ 1 2 d 1 2 1 959 ( ) Consider the any x , x R and λ (0, 1). We wish to show that f(λx +(1 λ)x ) λf(x )+(1 ⇐2 1 ∈2 ∈ − 1 ≤ 2 − 960 λ)f(x ). If f(x ) = + or f(x ) = + , then relation holds trivially. So we assume f(x ), f(x ) < + . 1 1 ∞ 2 2 ∞ 1 ∞ 961 The points (x , f(x )), (x , f(x )) both lie in epi(f). By convexity of epi(f), we have that (λx + (1 2 1 2 1 2 1 2− 962 λ)x , λf(x ) + (1 λ)f(x )) epi(f). This implies that f(λx + (1 λ)x ) λf(x ) + (1 λ)f(x ), − ∈ − ≤ − 963 showing that f is convex.

34 964 Just like the class of closed, convex sets are nicer to deal with compared to sets that are simply convex 965 but not closed (mainly because of the separating/supporting hyperplane theorem), it will be convenient to 966 isolate a similar class of “nicer” convex functions.

967 Definition 3.7. A function is said to be a closed, convex function if its epigraph is a closed, convex set.

968 One can associate another family of convex sets with a convex function.

Definition 3.8. Let f : Rd R + be any function. Given α R, the α-sublevel set of f is the set → ∪ { ∞} ∈ d fα := x R : f(x) α . { ∈ ≤ }

969 The following can be verified by the reader.

970 Proposition 3.9. All sublevel sets of a convex function are convex sets.

971 The converse of Proposition 3.9 is not true. Functions whose sublevel sets are all convex are called 972 quasi-convex.

Example 3.10. 1. Indicator function. For any subset X Rd, define ⊆  0 if x X I (x) := X + if x ∈ X ∞ 6∈ 973 Then IX is convex if and only if X is convex.

d 974 2. Linear/Affine function. Let a R and δ R. Then the function x a, x + δ is called an affine ∈ ∈ 7→ h i 975 function (if δ = 0, this is a linear function). It is easily verified that affine functions are convex.

3. Norms and Distances. Let N : Rd R be a norm (see Definition 1.1). Then N is convex (Verify !!). Let C be a nonempty convex set. Then→ the distance function associated with the norm N, defined as

N dC (x) := inf N(y x) y∈C −

976 is a convex function.

4. Maximum of affine functions/Piecewise linear function/Polyhedral function. Let a1,..., am Rd and ∈ δ1, . . . , δm R. The function ∈ i f(x) := max ( a , x + δi) i=1,...,m h i is a convex function. Let us verify this. Consider any x1, x2 Rd and λ (0, 1). Then, ∈ ∈ 1 2 i 1 2 f(λx + (1 λ)x ) = maxi=1,...,m( a , λx + (1 λ)x + δi) − h i 1 − i i 2  = maxi=1,...,m λ( a , x + δi) + (1 λ)( a , x + δi) h i 1i  − h i i 2  maxi=1,...,m λ( a , x + δi) + maxi=1,...,m (1 λ)( a , x + δi) ≤ h i 1 i − i h 2 i = λ maxi=1,...,m( a , x + δi) + (1 λ) maxi=1,...,m( a , x + δi) = λf(x1) + (1 hλ)f(x2i) − h i − 977 The inequality follows from the fact that if `1, . . . , `m and u1, . . . , um are two sets of m real numbers 978 for some m N, then maxi=1,...,m(`i + ui) maxi=1,...,m `i + maxi=1,...,m ui. ∈ ≤ 979 An important consequence of the definition of convexity for functions is Jensen’s inequality which sees 980 its uses in diverse areas of science and engineering.

Theorem 3.11. [Jensen’s Inequality] Let f : Rd R + be any function. Then f is convex if and only 1 n d → ∪{ ∞} if for any finite set of points x ,..., x R and λ1, . . . , λn > 0 such that λ1 + ... + λn = 1, the following holds: ∈ 1 n 1 n f(λ1x + ... + λnx ) λ1f(x ) + ... + λnf(x ). ≤

35 981 Proof. ( ) Just use the hypothesis with n = 2. ⇐ i i 982 ( ) If any f(x ) is + , then the inequality holds trivially. So we assume that each f(x ) < + . By ⇒ ∞ i i ∞ 983 Proposition 3.6, epi(f) is a convex set. For each i = 1, . . . , m, the point (x , f(x )) epi(f) by definition of Pm i i 1 n ∈ 1 n 984 epi(f). Since epi(f) is convex, i=1 λi(x , f(x )) epi(f), i.e., (λ1x +...+λnx , λ1f(x )+...+λnf(x )) 1 n ∈1 n ∈ 985 epi(f). Therefore, f(λ1x + ... + λnx ) λ1f(x ) + ... + λnf(x ). ≤ 986 Recall Theorem 2.3 that showed convexity of a set is preserved under certain operations. We would like 987 to develop a similar result for convex functions. d 988 Theorem 3.12. [Operations that preserve the property of being a (closed) convex function] Let fi : R → 989 R + , i I be a family of (closed) convex functions where the index set I is potentially infinite. The ∪ { ∞} ∈ 990 following are all true.

991 1. (Nonnegative combinations). If I is a finite set, and αi 0, i I is a corresponding set of nonnegative P ≥ ∈ 992 reals, then i∈I αifi is a (closed) convex function.

993 2. (Taking supremums). The function defined as g(x) := supi∈I fi(x) is a (closed) convex function (even 994 when I is uncountable infinite).

m×d m m 995 3. (Pre-Composition with an affine function). Let A R and b R and let f : R R be any m ∈ ∈ d → 996 (closed) convex function on R . Then g(x) := f(Ax + b) as a function from R R is a (closed) → 997 convex function.

998 4. (Post-Composition with an increasing convex function). Let h : R R + be a (closed) convex → ∪d { ∞} 999 function that is also increasing, i.e., h(x) h(y) when x y. Let f : R R + be a (closed) ≥ ≥ → ∪ { ∞} 1000 convex function. We adopt the convention that h(+ ) = + . Then h(f(x)) as a function from d ∞ ∞ 1001 R R is a (closed) convex function. → P d Proof. 1. Let F = αifi. Consider any x, y R and λ (0, 1). Then i∈I ∈ ∈ P F (λx + (1 λ)y) = αifi(λx + (1 λ)y) − Pi∈I − αi(λfi(x) + (1 λ)fi(y)) ≤ Pi∈I − P = λ i∈I αifi(x) + (1 λ) i∈I αifi(y) = λF (x) + (1 λ)F (y−) − 1002 We use the nonnegativity of αi in the inequality on the second displayed line above. We omit the proof 1003 of closedness of the function.

1004 2. The main observation is that epi(g) = i∈I epi(fi) because g(x) t if and only if fi(x) t for all ∩ ≤ ≤ 1005 i I. Since the intersection of (closed) convex sets is a (closed) convex set (part 1. of Theorem 2.3), ∈ 1006 we have the result.

d 1007 3. The main observation is that for any x R and t R,(x, t) epi(g) if and only if (Ax+b, t) epi(f). d m ∈ ∈ ∈ −1∈ 1008 Define the affine map T : R R R R as follows: T (x, t) = (Ax+b, t). Then epi(g) = T (epi(f)). × → × 1009 Since the pre-image of a (closed) convex set with respect to an affine transformation is (closed) convex 1010 (part 4. of Theorem 2.3), we obtain that epi(g) is (closed) convex.

1011 4. Left as an exercise.

1012

1013 We can now see some more interesting examples of convex functions. i d Example 3.13. 1. Let a R and δi R, i I for some index set I. Then the function ∈ ∈ ∈ i f(x) := sup( a , x + δi) i∈I h i

1014 is closed convex. This is an alternate proof of the convexity of the maximum of finitely many affine 1015 functions – part 4. of Example 3.10.

36 n(n+1) 2. Consider the vector space V of symmetric n n matrices. One can view V as R 2 . Let k n. × ≤ Consider the function fk : V R which takes a matrix X and maps it to f(X) which is the sum of → the k largest eigenvalues of X. Then fk is a convex function. This is seen by the following argument. P Given any Y V define the linear function AY on V as follows: AY (X) = XijYij. Then ∈ i,j

fk(X) = sup AYY T (X), Y ∈Ω

n 1016 where Ω is the set of n k matrices with k orthonormal columns in R . This shows that fk is the × 1017 supremum of linear functions, and by Theorem 3.12, it is closed convex.

1018 We see in part 1. of Example 3.13 that the supremum of affine functions is convex. We will show below 1019 that, in fact, every convex function is the supremum of some family of affine functions. This is analogous 1020 to the fact that all closed convex sets are the intersection of some family of halfspaces. We build up to this 1021 with an important definition.

d d 1022 Definition 3.14. Let f : R R + be any function. Let x dom(f). Then a R is said to define → ∪ { ∞} ∈d ∈ 1023 an affine support of f at x if f(y) f(x) + a, y x for all y R . ≥ h − i ∈ d 1024 A useful picture to keep in mind is the following fact: a R is an affine support of f at x if and only if ∈ 1025 the hyperplane a, y t a, x f(x) is a supporting hyperplane for the epigraph of f at (x, f(x)). h i − ≤ h i − d 1026 Theorem 3.15. Let f : R R be any function. Then f is closed convex if and only if there exists an → d 1027 affine support of f at every x R . ∈ d 1028 Proof. ( ) Consider any x R . By definition of closed convex, epi(f) is a closed convex set. Moreover, ⇒ ∈ d 1029 (x, f(x)) bd(epi(f)). By Theorem 2.23, there exists (a¯, r) R R and δ R such that a¯ and r are not ∈ ∈ × ∈ 1030 both zero, and a¯, y + rt δ for all (y, t) epi(f), and a¯, x + rf(x) = δ. h i ≤ ∈ h i 1031 We claim that r < 0. Suppose to the contrary that r 0. First consider the case that a¯ = 0, then ≥ 1032 r > 0. (x, t) epi(f) for all t f(x). But this contradicts that rt = a¯, y + rt δ for all t f(x) ∈ ≥ h i ≤ d ≥ 1033 and rf(x) = a¯, x + rf(x) = δ. Next consider the case that a¯ = 0. Consider any y R satisfying h i 6 ∈ 1034 a¯, y > δ. Since f is real valued, there exists (y, t) epi(f) for some t 0. Since r 0, this contradicts h i ∈ ≥ ≥ 1035 that a¯, y + rt δ. h i ≤ a¯ d 1036 Now set a = −r . a¯, x + rf(x) = δ and a¯, y + rf(y) δ for all y R together imply that h i h i ≤ ∈ d 1037 a¯, y ( r)f(y) + a¯, x + rf(x). Rearranging, we obtain that f(y) f(x) + a, y x for all y R . h i ≤ − h i d ≥ dh − i ∈ ( ) By definition of affine support, for every x R , there exists ax R such that f(y) f(x) + ⇐ d ∈ ∈ ≥ ax, y x for all y R . This implies that, in fact, h − i ∈

f(y) = sup (f(x) + ax, y x ), d x∈R h − i

1038 because setting x = y on the right hand side gives f(y). Thus, f is the supremum of a family of affine 1039 functions, which by Example 3.13, shows that f is closed convex.

1040 Remark 3.16. 1. Any convex function that is finite valued everywhere is closed convex. This follows 1041 from a continuity result we will prove later. We skip the details in these notes. Thus, in the forward 1042 direction of Theorem 3.15, one may weaken the hypothesis to just convex, as opposed to closed convex.

1043 2. In the reverse direction of Theorem 3.15, one may weaken the hypothesis to having local affine support d 1044 everywhere. A function f : R R is said to have local affine support at x if there exists  > 0 → 1045 (depending on x) such that f(y) f(x) + a, y x for all y B(x, ). We will omit the proof of this ≥ h − i ∈ 1046 extension of Theorem 3.15 here. See Chapter on “Convex Functions” in [3].

d 1047 3. The proof of Theorem 3.15 also shows that if f : R R + and x dom(f), then there exists → ∪ { ∞} ∈ 1048 an affine support of f at x.

1049 Affine supports for convex functions have been given a special name.

37 d 1050 Definition 3.17. Let f : R R + be a convex function. For any x dom(f), an affine support at → ∪ { ∞} ∈ 1051 x is called a subgradient of f at x. The set of all subgradients at x is denoted by ∂f(x) and is called the 1052 subdifferential of f at x.

d 1053 Theorem 3.18. Let f : R R + be a convex function. For any x dom(f), the subdifferential → ∪ { ∞} ∈ 1054 ∂f(x) at x is a closed, convex set. Proof. Note that d d ∂f(x) := a R : y x, a f(y) f(x) y R . { ∈ h − i ≤ − ∀ ∈ } 1055 Since the above set is the intersection of a family of halfspaces, this shows that ∂f(x) is a closed, convex 1056 set.

1057 3.2 Continuity properties

3 1058 Convex functions enjoy strong continuity properties in the relative interior of their domains . This fact is 1059 very useful in many contexts, especially in optimization, because this is useful in showing that minimizers 1060 and maximizers exist when optimizing convex functions that show up in practice, via Weierstrass’ theorem 1061 (Theorem 1.11).

Proposition 3.19. Let f : Rd R + be a convex function. Take x? Rd and suppose that for some → ∪ { ∞} ∈  > 0 and m, M R, the inequalities ∈ m f(x) M ≤ ≤ ? ? 1062 hold for all x in the ball B(x , 2). Then for all x, y B(x , ), it holds that ∈ M m f(x) f(y) − x y . (3.1) | − | ≤  k − k

? 1063 In particular, f is locally Lipschitz continuous about x .

?  y−x  1064 Proof. Take x, y B(x , ) with x = y. Define z = y +  . Note that ∈ 6 ky−xk     ? y x ? ? y x z x = y +  − x y x +  −  +  = 2. k − k y x − ≤ k − k y x ≤ k − k k − k ? 1065 Thus z B(x , 2). Also, ∈  y x   y x  y = k − k z + 1 k − k x,  + y x −  + y x k − k k − k 1066 showing that y is a convex combination of x and z. Therefore we may apply the convexity of f to see  y x   y x  f(y) k − k f(z) + 1 k − k f(x) ≤  + y x −  + y x k − k k − k  y x  = f(x) + k − k (f(z) f(x))  + y x − k − k  y x  f(x) + k − k (M m) using the bounds on f in B(x?, 2). ≤  −

 ky−xk  1067 Hence f(y) f(x) (M m). − ≤  −  ky−xk  1068 Repeating this argument by swapping the roles of x and y, we get f(x) f(y) (M m). − ≤  − 1069 Therefore (3.2) holds.

3This section was written by Joseph Paat.

38 d 1070 Proposition 3.20. Let f : R R + be a convex function. Consider any compact, convex subset ? → ∪ { ∞} 1071 S dom(f) and let x relint(S). Then there is an x? > 0 and values mx? ,Mx? R so that ⊆ ∈ ∈

mx? f(x) Mx? (3.2) ≤ ≤ ? 1072 for all x B(x , 2x? ) S. ∈ ∩ 1 ` 1073 Proof. Let v ,..., v be vectors that span the linear space parallel to aff(S) (see Theorem 2.16). By definition ? ? j ? j 1074 of relative interior, since x aff(S), there exists  > 0 such that x + v and x v are both in S for ∈ ? j −0 1075 j = 1, . . . , `. Denote the set of points x v as x1,..., xk S (k = 2`), and define S := conv x1,..., xk . ? 0 0 ± ∈ { } 1076 Observe that x relint(S ) and aff(S ) = aff(S). Set Mx? = max f(xi): i = 1, . . . , k . Using Problem3 ∈ {0 } 1077 from “HW for Week VIII”, it follows that f(x) Mx? for all x S . Now since f is convex, by Theorem 3.15 (see≤ Remark 3.16 part∈ 3.), there is some affine support function ? ? ? L(x) = a, (x x ) + f(x ) for f at x . Define mx? = min L(xi): i = 1, . . . , k . Consider any point Pk h − 0 i { } x = λixi S , where λ1, . . . , λk are convex coefficients, and observe that i=1 ∈ k ! k k X ? ? X ? ? X L(x) = a, λixi x + f(x ) = λi ( a, xi x + f(x )) = λiL(xi) mx? . h − i h − i ≥ i=1 i=1 i=1

0 ? 0 1078 Since L is an affine support, it follows that f(x) L(x) mx? for all x S . Finally, as x relint(S ) 0 ≥ ? ≥ 0 ∈ ∈ 1079 and aff(S ) = aff(S), there is some  > 0 so that B(x , 2) S S . ∩ ⊆ 1080

d 1081 Theorem 3.21. Let f : R R + be a convex function. Let D relint(dom(f)) be a convex, → ∪ { ∞} ⊆ 1082 compact subset. Then there is a constant L = L(D) 0 so that ≥ f(x) f(y) L x y (3.3) | − | ≤ k − k

1083 for all x, y D. In particular, f is locally Lipschitz continuous over the relative interior of its domain. ∈ 1084 Proof. Let S be a compact set such that D relint(S) relint(dom(f)). From Proposition 3.20, for ⊆ ⊆ 1085 every x relint(S), there is a tuple (x, mx,Mx) so that mx f(y) Mx for all y B(x, 2x) S. ∈ ≤ ≤ ∈ ∩ 1086 Proposition 3.19 then implies that there is some Lx 0 so that f(y) f(z) Lx z y for all z, y ≥ | − | ≤ k − k ∈ 1087 B(x, x). Note that the collection B(x, x) S : x D forms an open cover of S (in the relative topology { ∩ ∈ } Sk 1088 of aff(S)). Therefore, as S is compact, there exists a finite set x1, ..., xk S so that S B(xi, x ). { } ⊂ ⊆ i=1 i 1089 Set L = max Lx : i 1, . . . , k . { i ∈ { }} 1090 Now take y, z S. The line segment [y, z] can be divided into finitely many segments [y, z] = [y , y ] ∈ 1 2 ∪ 1091 [y , y ] ... [y , y ], where y = y, y = z, and each [y , y ] is contained in some ball B(xj, x ) 2 3 ∪ ∪ q−1 q 1 q i i+1 j 1092 for j 1, . . . , k . Without loss of generality, we may assume that q 1 k and [y , y ] B(xi, x ) for ∈ { } − ≤ i i+1 ⊆ i

39 1093 each i 1, . . . , q 1 . It follows that ∈ { − } q−1 ! q−1 ! X X f(y) f(z) = f(y ) + f(yi) f(yi) f(yq) | − | 1 − − i=2 i=2 q−1 X = f(y ) f(y ) i − i+1 i=1 q−1 X f(y ) f(y ) ≤ | i − i+1 | i=1 q−1 X Lx y y ≤ i k i − i+1k i=1 q−1 X L y y ≤ k i − i+1k i=1 =L y y = L y z . k 1 − qk k − k

1094 Hence f is Lipschitz continuous over S with constant L.

1095 3.3 First-order derivative properties

1096 A convex function enjoys very strong differentiability properties. We will first state some useful results 1097 without proof. See the Chapter on “Convex Functions” in Gruber [3] for full proofs.

d 1098 Theorem 3.22. Let f : R R + be a convex function and let x int(dom(f)). Then f is → ∪ { ∞} 0 ∈ 1099 differentiable at x if and only if the partial derivative fi (x) exists for all i = 1, . . . , d. d 1100 Theorem 3.23. [Reidemeister’s Theorem] Let f : R R + be a convex function. Then f is → ∪ { ∞} 1101 differentiable almost everywhere in int(dom(f)), i.e., the subset of int(dom(f)) where f is not differentiable 1102 has Lebesgue measure 0.

1103 We now prove the central relationships between the gradient f and convexity. We first observe some ∇ 1104 facts about convex functions on the real line.

Proposition 3.24. Let f : R R be a convex function. Then for any real numbers x < y < z, we must have → f(y) f(x) f(z) f(x) f(z) f(y) − − − . y x ≤ z x ≤ z y − − − 1105 Moreover, if f is strictly convex, then these inequalities are strict. Proof. Since y (x, z), there exists α (0, 1) such that y = αx + (1 α)z. Now we follow the inequalities: ∈ ∈ − f(y)−f(x) f(αx+(1−α)z)−f(x) y−x = αx+(1−α)z−x αf(x)+(1−α)f(z)−f(x) αx+(1−α)z−x . ≤ f(z)−f(x) = z−x Similarly, f(z)−f(y) f(z)−f(αx+(1−α)z) z−y = z−αx−(1−α)z f(z)−αf(x)−(1−α)f(z) z−αx−(1−α)z . ≥ f(z)−f(x) = z−x

1106 The strict convexity implication is clear from the above.

40 1107 An immediate corollary is the following relationship between the derivative of a function on the real line 1108 and convexity.

0 1109 Proposition 3.25. Let f : R R be a differentiable function. Then f is convex if and only if f is an 0 → 0 1110 increasing function, i.e., f (x) f (y) for all x y R. Moreover, f is strictly convex if and only if 0 ≥ ≥ ∈ 0 1111 f is strictly increasing. f is strongly convex with strong convexity modulus c > 0 if and only if f (x) 0 ≥ 1112 f (y) + c(x y) for all x y R. − ≥ ∈ 0 f(x+t)−f(x) f(x+t)−f(x) 1113 Proof. ( ) Recall that f (x) = limt→0+ t . But for every 0 < t < y x, we have t f(y)−f(x)⇒ 0 f(y)−f(x) − 0 f(y)−f(x≤) 1114 by Proposition 3.24. Thus, f (x) . By a similar argument, we obtain f (y) . y−x ≤ y−x ≥ y−x 1115 This gives the relation. ( ) Consider any x, z R and α (0, 1). Let y = αx + (1 α)z. By the mean value theorem, there ⇐ ∈f(y)−f(x) ∈ 0 − f(z)−f(y) 0 exists t1 [x, y] such that y−x = f (t1) and t2 [y, z] such that z−y = f (t2). Since t2 t1 and ∈ 0 0 0 ∈ ≥ we assume f is increasing, then f (t2) f (t1). This implies that ≥ f(z) f(y) f(y) f(x) − − . z y ≥ y x − −

1116 Substituting y = αx + (1 α)z and rearranging, we obtain that f(αx + (1 α)z) αf(x) + (1 α)f(z). − − ≤ − 1117

1118 We can now prove the main result of this subsection. A key idea behind the results below is that one can d 1119 reduce testing convexity of a function on R to testing convexity of any one-dimensional “slice” of it. More 1120 precisely,

d 1121 Proposition 3.26. Let f : R R + be a function. Then f is convex if and only if for every d → ∪ { ∞} 1122 x, r R , the function φ : R R + defined by φ(t) = f(x + tr) is convex. ∈ → ∪ { ∞} 1123 Proof. Left as an exercise.

d 1124 Theorem 3.27. Let f : R R be differentiable everywhere. Then the following are all equivalent. → 1125 1. f is convex.

d 1126 2. f(y) f(x) + f(x), y x for all x, y R . ≥ h∇ − i ∈ d 1127 3. f(y) f(x), y x 0 for all x, y R . h∇ − ∇ − i ≥ ∈ 1128 A characterization of strict convexity is obtained if all the above inequalities are considered strict for all d 1129 x = y R . A characterization of strong convexity with modulus c > 0 is obtained if 2. is replaced with 6 ∈ 1 2 d 1130 f(y) f(x)+ f(x), y x + 2 c y x for all x, y R , and 3. is replaced with f(y) f(x), y x ≥ 2 h∇ −d i k − k ∈ h∇ −∇ − i ≥ 1131 c y x for all x, y R . k − k ∈ Proof. 1. 2. Consider any x, y Rd. For every α > 0, convexity of f implies that f((1 α)x + αy) (1 α)f(x⇒) + αf(y). Rearranging,∈ we obtain − ≤ − f((1−α)x+αy)−f(x) f(y) f(x) α ≤ − f(x+α(y−x))−f(x) f(y) f(x) ⇒ α ≤ −

1132 Letting α 0 on the left hand side, we obtain the directional derivative f(x), y x and 2. is established. → h∇ − i 2. 3. By switching the roles of x, y Rd, we obtain the following ⇒ ∈ f(y) f(x) + f(x), y x . f(x) ≥ f(y) + h∇f(y), x − yi ≥ h∇ − i

1133 Adding these inequalities together we obtain 3.

41 3. 1. Consider any x¯, y¯ Rd and define the function φ(t) := f(x¯ + t(y¯ x¯)). Observe that φ0(t) = ⇒ ∈ − f(x¯ + t(y¯ x¯)), y¯ x¯ for any t R. For t2 > t1, we have that h∇ − − i ∈ 0 0 φ (t2) φ (t1) = f(x¯ + t2(y¯ x¯)), y¯ x¯ f(x¯ + t1(y¯ x¯)), y¯ x¯ − h∇ − − i − h∇ − − i = f(x¯ + t2(y¯ x¯)) f(x¯ + t1(y¯ x¯)), y¯ x¯ h∇1 − − ∇ − − i = f(x¯ + t2(y¯ x¯)) f(x¯ + t1(y¯ x¯)), (t2 t1)(y¯ x¯) t2−t1 1 h∇ − − ∇ − − − i = f(x¯ + t2(y¯ x¯)) f(x¯ + t1(y¯ x¯)), (t2(y¯ x¯) x¯) (t1(y¯ x¯) x¯) t2−t1 h∇ − − ∇ − − − − − − i 0 ≥ d 1134 where the last inequality follows from the fact that f(y) f(x), y x 0 for all x, y R , and t2 > t1. h∇ −∇ − i ≥ ∈ 1135 Therefore, by Proposition 3.25, we obtain that φ(t) is a convex function in t. By Proposition 3.26, f is 1136 convex.

1137 3.4 Second-order derivative properties

1138 A simple consequence of Proposition 3.25 for twice differentiable functions on the real line is the following.

00 1139 Corollary 3.28. Let f : R R be a twice differentiable function. Then f is convex if and only if f (x) 0 00 → ≥ 1140 for all x R. If f (x) > 0, then f is strictly convex. ∈ 0 1141 Remark 3.29. From Proposition 3.25, we know strict convexity of f is equivalent to the condition that f is 00 1142 strictly increasing. However, this is not equivalent to f (x) > 0, the implication only goes in one direction. 1143 This is why we lose the other direction when discussing strict convexity in Corollary 3.28. As a concrete 4 1144 example, consider f(x) = x which is strictly convex, but the second derivative is 0 at x = 0.

d 1145 This enables one to characterize convexity of f : R R in terms of its Hessian, which will be denoted 2 → 1146 by f. ∇ d 1147 Theorem 3.30. Let f : R R be a twice differentiable function. Then the following are all true. → 2 d 1148 1. f is convex if and only if f(x) is positive semidefinite (PSD) for all x R . ∇ ∈ 2 d 1149 2. If f(x) is positive definite (PD) for all x R , then f is strictly convex. ∇ ∈ 2 1150 3. f is strongly convex with modulus c > 0 if and only if f(x) cI is positive semidefinite (PSD) for d ∇ − 1151 all x R . ∈ d 2 1152 Proof. 1.( ) Let x R and we would like to show that f(x) is positive semidefinite. Consider any d ⇒ ∈ ∇ 1153 r R . Define the function φ(t) = f(x + tr). By Proposition 3.26, φ is convex. By Corollary 3.28, ∈ 00 2 2 1154 0 φ (0) = f(x)r, r . Since the choice of r was arbitrary, this shows that f(x) is positive ≤ h∇ i ∇ 1155 semidefinite. 2 d d 1156 ( ) Assume f(x) is positive semidefinite fo all x R , and consider x¯, r R . Define the function ⇐ ∇ 00 2 ∈ 2 ∈ 1157 φ(t) = f(x¯ + tr). Now φ (t) = f(x¯ + tr)r, r 0, since f(x¯ + tr) is positive semidefinite. By h∇ i ≥ ∇ 1158 Corollary 3.28, φ is convex. By Proposition 3.26, f is convex.

1159 2. This follows from the same construction as in 1. above, and the sufficient condition that if the second 1160 derivative of a one-dimensional function is strictly positive, then the function is strictly convex.

1161 3. We omit the proof of the characterization of strong convexity.

1162

42 1163 3.5 Sublinear functions, support functions and gauges

1164 We will now introduce a more structured subfamily of convex functions which is easier to deal with analyti- 1165 cally, and yet has very important uses in diverse areas.

d 1166 Definition 3.31. A function f : R R + is called sublinear if it satisfies the following two properties: → ∪{ ∞} d 1167 (i) f is positively homogeneous, i.e., f(λr) = λf(r) for all r R and λ > 0. ∈ d 1168 (ii) f is subadditive, i.e., f(x + y) f(x) + f(y) for all x, y R . ≤ ∈ 1169 Here is the connection with convexity.

d 1170 Proposition 3.32. Let f : R R + . Then the following are equivalent: → ∪ { ∞} 1171 1. f is sublinear.

1172 2. f is convex and positively homogeneous.

1 2 1 2 1 2 d 1173 3. f(λ1x + λ2x ) λ1f(x ) + λ2f(x ) for all x , x R and λ1, λ2 > 0. ≤ ∈ 1174 Proof. Left as an exercise.

1175 A useful property of sublinear functions is the following.

d 1176 Proposition 3.33. Let f : R R + be a sublinear function. Then either f(0) = 0 or f(0) = + . → ∪ { ∞} ∞ 1177 A characterization of sublinear functions via epigraphs is also possible.

d 1178 Proposition 3.34. Let f : R R + such that f(0) = 0. Then f is sublinear if and only if epi(f) d → ∪ { ∞} 1179 is a convex cone in R R. × 1180 Proof. ( ) From Proposition 3.32, we know that f is convex and positively homogeneous. From Propo- ⇒ 1181 sition 3.6, this implies that epi(f) is convex. So we only need to verify that if (x, t) epi(f) then ∈ 1182 λ(x, t) = (λx, λt) epi(f) for all λ 0. If λ = 0, then the result follows from the assumption that ∈ ≥ 1183 f(0) = 0. Now consider λ > 0. Since (x, t) epi(f), we have f(x) t and by positive homogeneity of f, ∈ ≤ 1184 f(λx) = λf(x) λt, and so (λx, λt) epi(f). ≤ ∈ 1185 ( ) From Proposition 3.6 and the assumption that epi(f) is a convex cone, we get that f is convex. We ⇐ 1186 now verify that f is positively homogeneous; by Proposition 3.32, we will be done. We first verify that for d 1187 all λ > 0 and x R , f(λx) λf(x). Since epi(f) is a convex cone and (x, f(x)) epi(f), we have that ∈ ≤ ∈ 1188 λ(x, f(x)) = (λx, λf(x)) epi(f). This implies that f(λx) λf(x). ∈ ¯ d ≤ ¯ ¯ 1189 Now, for any particular λ > 0 and x¯ R , we have that f(λx¯) λf(x¯). But using the above observation 1 ∈ 1 1 ≤ 1190 with λ = and x = λ¯x, we obtain that f( λ¯x¯) f(λ¯x¯), i.e., λf¯ (x¯) f(λ¯x¯). Hence, we must have λ¯ λ¯ ≤ λ¯ ≤ 1191 f(λ¯x¯) = λf¯ (x¯).

d 1192 Gauges. One easily observes that any norm N : R R is a sublinear function – recall Definition 1.1. → 1193 In fact, a norm has the additional “symmetry” property that N(x) = N( x). Since a sublinear function is − 1194 convex (Proposition 3.32), and sublevel sets of convex sets are convex, we immediately know that the unit d 1195 norm balls BN (0, 1) = x R : N(x) 1 are convex sets. Because of the “symmetry property” of norms, { ∈ ≤ } 1196 these unit norm balls are also “symmetric” about the origin. This merits a definition.

d 1197 Definition 3.35. A convex set C R is said to be centrally symmetric about the origin, if x C implies ⊆ ∈ 1198 that x C. Sometimes we will abbreviate this to say C is centrally symmetric. − ∈ 1199 We now summarize the above discussion in the following observation.

d d 1200 Proposition 3.36. Let N : R R be a norm. Then the unit norm ball BN (0, 1) = x R : N(x) 1 → { ∈ ≤ } 1201 is a centrally symmetric, closed convex set.

43 1202 One can actually prove a converse to the above statement, which will establish a nice one-to-one corre- 1203 spondence between norms and centrally symmetric convex sets. We first generalize the notion of a norm to 1204 a family of sublinear functions called “gauge functions”.

Definition 3.37. Let C Rd be a closed, convex set such that 0 C. Define the following function d ⊆ ∈ γC : R R + as → ∪ { ∞} γC (r) = inf λ > 0 : r λC . { ∈ } 1205 γC is called the gauge or the Minkowski functional of C.

1206 Exercise 7. Show that γC is finite valued everywhere if and only if 0 int(C). ∈ 1207 The following is a useful observation for the analysis of gauge functions. d d 1208 Lemma 3.38. Let C R be a closed convex set such that 0 C, and let r R be any vector. Then the ⊆ ∈ ∈ 1209 set λ > 0 : r λC is either empty or a convex interval of the real line of the form (a, + ) or [a, + ). { ∈ } ∞ ∞ 1210 Proof. Define I := λ > 0 : r λC and suppose it is nonempty. It suffices to show that if λ¯ I then ¯ { ∈ } ¯ 1 ¯ ∈ 1211 for all λ λ, λ I. This follows from the fact that λ I implies that λ¯ r C. For any λ λ, we have 1 λ¯ 1≥ λ∈−λ¯ ∈ ∈ ≥ 1212 r = ( r) + ( )0 which is in C because C is convex and 0 C. λ λ λ¯ λ ∈ 1213 A useful intuition to keep in mind is that for any r the gauge function value γC (r) gives you a factor to 1214 scale r with so that you end up on the boundary of C. More precisely, d d 1215 Proposition 3.39. Let C R be a closed, convex set such that 0 C. Suppose r R such that 1 ⊆ ∈ ∈ 1216 0 < γC (r) < . Then r relbd(C). ∞ γC (r) ∈ 1 1217 Proof. From Lemma 3.38, we have that for all λ > γC (r), we have that r λC, i.e., λ r C. Taking the 1 ∈ 1 ∈ 1218 limit λ γC (r) and using the fact that C is closed, we obtain that r C. If r relint(C), then ↓ γC (r) ∈ γC (r) ∈ 1 α γC (r) 1219 we can scale r by α > 1 and obtain that r C, which would imply that r C, contradicting γC (r) γC (r) ∈ ∈ α γC (r) 1220 the fact that γC (r) = inf λ > 0 : r λC , since < γC (r). { ∈ } α 1221 The following theorem relates geometric properties of C with analytical properties of the gauge function. 1222 These relations are extremely handy to keep in mind. d 1223 Theorem 3.40. Let C R be a closed, convex set such that 0 C. Then the following are all true. ⊆ ∈ 1224 1. γC is a nonnegative, sublinear function.

d 1225 2. C = x R : γC (x) 1 . { ∈ ≤ } d 1226 3. rec(C) = r R : γC (r) = 0 . { ∈ } d 1227 4. If 0 relint(C), then relint(C) = x R : γC (x) < 1 . ∈ { ∈ } 1228 Proof. 1. Although 1. can be proved directly from the definition of the gauge, we postpone its proof until 1229 we speak of support functions below.

d 1230 2. We now first show that C x R : γC (x) 1 . This is because x C implies that 1 λ > 0 : ⊆ { ∈ ≤ } ∈ ∈ { 1231 x λC and therefore, inf λ > 0 : x λC 1. ∈ } { ∈ } ≤ d 1232 Now, we verify that x R : γC (x) 1 C. γC (x) 1 implies that inf λ > 0 : x λC 1 and { ∈ ≤ } ⊆ ≤ { ∈ } ≤ 1233 since λ > 0 : x λC is an unbounded interval by Lemma 3.38, this means that either 1 λ > 0 : { ∈ } ∈ { 1234 r λC , and thus x C or 1 = inf λ > 0 : x λC = γC (x). By Proposition 3.39, we have that ∈ } ∈ { ∈ } 1235 1 x C. · ∈ 1 1236 3. Since λ > 0 : r λC is convex by Lemma 3.38, we observe that γC (r) = 0 if and only if r C for { ∈ } λ ∈ 1237 all λ > 0. Since 0 C, this is equivalent to saying that tr C for all t 0; more explicitly, 0 + tr C ∈ ∈ ≥ ∈ 1238 for all t 0. This is equivalent to saying that r satisfies Definition 2.43 of rec(C). ≥

44 1239 4. Consider any x relint(C). By definition of relative interior, there exists λ > 1 such that λx C. ∈ ∈ 1240 By part 2. above, γC (λx) 1 and by part 1. above, γC is positively homogeneous, and thus, 1 ≤ 1241 γC (x) < 1. ≤ λ d 1242 Now suppose x R such that γC (x) < 1. If γC (x) = 0, then x rec(C) by part 3. above. Since ∈ ∈ 1243 0 relint(C), we also have x = 0 + x relint(C). Now suppose 0 < γC (x) < 1. By part 2. above, ∈ ∈ 1244 x C. Suppose to the contrary that x relint(C). By Theorem 2.40, x is contained in a proper face ∈ 6∈ x 1245 F of C. Since 0 relint(C), 0 is not contained in F . Also, γC ( ) = 1 by positive homogeneity γC (x) ∈ x x 1246 of γC , from part 1. above. Therefore, C. However, x = (1 γC (x))0 + γC (x)( ). Since γC (x) ∈ − γC (x) 1247 γC (x) < 1 and 0 F , this would contradict the fact that F is a face. 6∈ 1248

1249 We derive some immediate consequences.

d 1250 Corollary 3.41. Let C R be a closed, convex set containing the origin. Then C is compact if and only d ⊆ 1251 if γ(r) > 0 for all r R 0 . ∈ \{ } 1252 Corollary 3.42. [Uniqueness of the gauge] Let C be a compact convex set containing the origin in its d d 1253 interior, i.e., 0 int(C). Let f : R R be any sublinear function. Then C = x R : f(x) 1 if and ∈ → { ∈ ≤ } 1254 only if f = γC .

1255 Proof. The sufficiency follows from Theorem 3.40, part 2. For the necessity, suppose to the contrary that d 1256 f(x) = γC (x) for some x R . We first observe that x = 0 because f(0) = 0 = γC (0) by Proposition 3.33. 6 ∈ 6 1257 First suppose f(x) > γC (x). Since C is compact, we know that γC (x) > 0 by Corollary 3.41. Con- 1 1258 sider that point x. By Proposition 3.39, x relbd(C). However, since f is positively homogeneous, γC (x) 1 1 ∈ d 1259 f( x) = f(x) > 1 because f(x) > γC (x). This contradicts that C = x R : f(x) 1 . γC (x) γC (x) { ∈ ≤ } 1260 Next suppose f(x) < γC (x). If f(x) 0, then by positive homogeneity, f(λx) 0 for all λ 0. Thus, ≤ d ≤ ≥ 1261 λx C for all λ 0 by the assumption that C = x R : f(x) 1 . This means that x rec(C) which ∈ ≥ { ∈ ≤ } ∈ 1262 contradicts the fact that C is compact (see Theorem 2.47). Thus, we may assume that f(x) > 0. 1 1 γC (x) 1263 Now let y = f(x) x. By positive homogeneity of γC , we obtain that γC (y) = γC ( f(x) x) = f(x) > 1. 1264 Therefore, y C by Theorem 3.40, part 2. However, f(y) = 1, which contradicts the assumption that d 6∈ 1265 C = x R : f(x) 1 . { ∈ ≤ } 1266 The proof of Corollary 3.42 can be massaged to obtain the following.

1267 Corollary 3.43. [Uniqueness of the gauge-II] Let C be a closed, convex set (not necessarily compact) d 1268 containing the origin in its interior, i.e., 0 int(C). Let f : R R be any nonnegative, sublinear function. d ∈ → 1269 Then C = x R : f(x) 1 if and only if f = γC . { ∈ ≤ } 1270 Consequently, for every nonnegative, sublinear function f, there exists a closed, convex set C such that 1271 f = γC .

1272 We also make the following observation on when the gauge function can take + as a value. ∞ 1273 Lemma 3.44. Let C be a closed, convex set with 0 C. Then the gauge γC is finite valued everywhere d ∈ 1274 (i.e., γC (x) < for all x R ) if and only if 0 int(C). ∞ ∈ ∈ 1275 Proof. (= ) Suppose 0 is not in the interior, i.e., 0 is on the boundary of C. By the Supporting Hyperplane ⇒ d − 1276 Theorem 2.23, there exist a R 0 and δ R such that C H (a, δ) and a, 0 = δ. Thus, δ = 0. d ∈ \{ } ∈ ⊆ − h i − 1277 Now consider any r R such that a, r > 0. However, since C H (a, 0), it follows that λC H (a, 0) ∈ h i ⊆ ⊆ 1278 for all λ > 0. Therefore, the set λ > 0 : r λC is empty, and we conclude that γC (r) = . In fact, this { ∈ } d ∞ 1279 shows that γC takes value on the entire “open” halfspace r R : a, r > 0 . ∞ d { ∈ h i } 1280 ( =) Assume 0 int(C) and consider any x R . Since 0 int(C), there exists  > 0 such that x C. ⇐1 ∈ ∈ ∈ ∈ 1281 Thus,  is in the set λ > 0 : x λC , and so the infimum over this set is finite valued. Thus, γC (x) < d { ∈ } ∞ 1282 for all x R . ∈

45 1283 We can now finally settle the correspondence between norms and centrally symmetric, compact convex 1284 sets.

d d 1285 Theorem 3.45. Let N : R R be a norm. Then BN (0, 1) = x R : N(x) 1 is a centrally → { ∈ ≤ } 1286 symmetric, compact convex set with 0 in its interior. Moreover, γBN (0,1) = N. 1287 Conversely, let B be a centrally symmetric, compact convex set containing 0 in its interior. Then γB is d 1288 a norm on R and B = BγB (0, 1).

1289 Proof. For the first part, since N is sublinear, it is convex (by Proposition 3.32). By definition, BN (0, 1) = d 1290 x R : N(x) 1 is a sublevel set for N, and is thus a convex set. It is closed, since N is continuous by { ∈ ≤ } 1291 Theorem 3.21. Since N(x) = N( x), this also shows that BN (0, 1) is centrally symmetric. We now show − 1292 that rec(BN (0, 1)) = 0 ; this will imply that it is compact by Theorem 2.47. Consider any nonzero vector { } 2 2 2 2 1293 r, and let N(r) = M > 0. Then, r = 0 + r, but N( r) = 2. Thus, r BN (0, 1), and so r cannot be M M M M 6∈ 1294 a recession direction for BN (0, 1). 1295 We verify that 0 int(BN (0, 1)). If not, then by the Supporting Hyperplane Theorem 2.23, there exists d ∈ − 1296 a R 0 and δ R such that BN (0, 1) H (a, δ) and a, 0 = δ. Thus, δ = 0. Now, since a = 0, ∈ \{ } ∈ a ⊆ a h i a kak2 6 1297 N(a) > 0. Thus, N( N(a) ) = 1 and by definition, N(a) BN (0, 1). However, a, N(a) = N(a) > 0 which − ∈ h i 1298 contradicts the fact that BN (0, 1) H (a, 0). Therefore, from Corollary 3.42, we obtain that N = γ . ⊆ BN (0,1) 1299 For the second part, we know that γB is sublinear, and since B is compact, γB(r) > 0 for all r = 0 by 6 1300 Corollary 3.41. Since 0 int(B), Lemma 3.44 implies that γC is finite valued everywhere. To confirm that ∈ 1301 γB is a norm, all that remains to be checked is that γB(x) = γB( x) for all x = 0. Suppose to the contrary − 6 1 1302 that γB(x) > γB( x) (note that this is without loss of generality). This implies that γB( x) > 1. γB (−x)) 1 − 1 1 1303 Therefore, x B by Theorem 3.40, part 2. However, γB( x) = γB( x) = 1 showing γB (−x)) γB (−x)) γB (−x) 1 6∈ − − 1304 that x B by Theorem 3.40, part 2. This contradicts the fact that B is centrally symmetric. Thus, γB (−x)) − ∈ d d 1305 γB is a norm on R . Moreover, by Theorem 3.40, part 2., B = x R : γB(x) 1 = BγB (0, 1). { ∈ ≤ } 1306 Let us build towards a more computational approach to the gauge. First, lets give an explicit formula 1307 for the gauge of a halfspace containing the origin.

Example 3.46. Let H := H−(a, δ) be a halfspace defined by some a Rd and δ R such that 0 H−(a, δ). We assume that we have normalized δ to be 0 or 1. If δ = 0, then ∈ ∈ ∈

 0 if a, r 0 γ (r) = H + if ha, ri> ≤ 0 ∞ h i If δ = 1, then γH (r) = max 0, a, r . { h i} 1308 The above calculation, along with the next theorem, gives powerful computational tools for gauge func- 1309 tions.

Theorem 3.47. Let Ci, i I be a (not necessarily finite) family of closed, convex sets containing the origin, ∈ and let C = i∈I Ci. Then ∩ γC = sup γCi . i∈I

d 1310 Proof. Consider any r R . Let us define Ai = λ > 0 : r λCi for each i I, and define A = λ > ∈ { ∈ } ∈ { 1311 0 : r λC . Observe that A = Ai. If any Ai is empty, then γC (r) = , and A is empty and therefore ∈ } ∩ i ∞ 1312 γC (r) = , and the equality holds. Now suppose all Ai’s are nonempty, and so by Lemma 3.38, each Ai is of ∞ 1313 the form (ai, ) or [ai, ). If A = , then it must mean that ai . Since γC (r) = inf Ai = ai, this shows ∞ ∞ ∅ → ∞ i 1314 that sup γC (r) = . Moreover, A = implies that γC (r) = inf A = + . Finally, consider the case that i∈I i ∞ ∅ ∞ 1315 A is nonempty. Then since A = Ai, A must be of the form (a, + ) or [a, + ) where a := sup ai. Then ∩ ∞ ∞ i∈I 1316 γC (r) = a = supi∈I ai = supi∈I γCi (r).

1317 This shows that gauge functions for polyhedra can be computed very easily.

46 Corollary 3.48. Let P be a polyhedron containing the origin in its interior. Thus, there exist a1,..., am ∈ Rd such that d i P = x R : a , x 1 i = 1, . . . , m . { ∈ h i ≤ } Then 1 m γP (r) = max 0, a , r ,..., a , r . { h i h i} 1318 Proof. Use the formula from 3.46 and Theorem 3.47.

1319 Support functions. While gauges are good in the sense that they are a nice generalization of norms from 1320 centrally symmetric convex bodies to asymmetric convex bodies, there is a drawback. Gauges are a strict 1321 subset of sublinear functions because they are always nonnegative, while there are many sublinear functions 1322 that take negative values. We would like to establish a one-to-one correspondence between sublinear functions 1323 and all closed, convex sets. Note that the correspondence via the epigraph only establishes a correspondence 1324 with closed, convex cones, and that too not all closed, convex cones are covered. The right definition, it 1325 turns out, is inspired by optimization of linear functions over closed, convex sets. Definition 3.49. Let S Rd be any set. The support function for S is a function on Rd defined as ⊆ σS(r) = sup r, x . x∈Sh i

1326 The following is easy to verify, and aspects of it were already explored in the midterm and HWs. Proposition 3.50. Let S Rd. Then ⊆ σS = σcl(S) = σconv(S) = σcl(conv(S)).

d 1327 Proposition 3.51. Let S R . Then σS is a closed, sublinear function, i.e., its epigraph is a closed, convex ⊆ 1328 cone. d Proof. We first check that σS is sublinear. We check positive homogeneity. For any r R and λ > 0, ∈ σS(λr) = sup λr, x = sup λ r, x = λ sup r, x = λσS(r). x∈Sh i x∈S h i x∈Sh i

We check subadditivity. Let r1, r2 Rd. Then, ∈ 1 2 1 2 σS(r + r ) = supx∈S r + r , x h 1 i 2 = supx∈S( r , x + r , x ) h1 i h i 2 supx∈S r , x + supx∈S r , x ≤ 1 h i 2 h i = σS(r ) + σS(r ).

1329 Since σS is the supremum of linear functions x, r , x S, epi(f) is the intersection of closed halfspaces, h i ∈ 1330 which shows that it is closed. The fact that it is a convex cone follows from Proposition 3.34.

1331 We now establish a fundamental correspondence between gauges and support functions via polarity. Theorem 3.52. Let C be a closed convex set containing the origin. Then

γC = σC◦ . Proof. Recall that C = (C◦)◦ by Proposition 2.30 part 2. Unwrapping the definitions, this says that

d ◦ − C = x R : a, x 1 a C = a∈C◦ H (a, 1). { ∈ h i ≤ ∀ ∈ } ∩ By Theorem 3.47 and Example 3.46, we obtain that

γC (r) = sup γH−(a,1)(r) = sup max 0, a, r . a∈C◦ a∈C◦ { h i} ◦ 1332 Since 0 C , the last term above can be written as sup ◦ a, r = σC◦ (r). ∈ a∈C h i

47 Example 3.53. Consider the polyhedron

2 1 1 P = x R : x1 x2 1, x1 x2 1, x1 + x2 1 . { ∈ − − ≤ 2 − ≤ − 2 ≤ } From Corollary 3.48, we obtain that 1 1 γP (r) = max 0, r1 r2, r1 r2, r1 + r2 , { − − 2 − − 2 }

2 and by Theorem 3.40 part 2., we obtain that P = x R : γP (x) 1 . Now consider the function { ∈ ≤ } 1 1 f(r) = max r1 r2, r1 r2, r1 + r2 . {− − 2 − − 2 }

It turns out that P = x R2 : f(x) 1 because { ∈ ≤ } 1 1 x P x1 x2 1, 2 x1 x2 1, x1 + 2 x2 1 ∈ ⇔ − − ≤ 1 − ≤ −1 ≤ max x1 x2, 2 x1 x2, x1 + 2 x2 1 ⇔ f(x){− 1. − − − } ≤ ⇔ ≤ 1 1333 Notice that f((1, 1)) = 2 = 0 = γP ((1, 1)). Also, f is sublinear because f is the support function of the 1 − 6 1 1334 set S = ( 1, 1), ( , 1), ( 1, ) . This shows that Corollary 3.42 really breaks down if the assumption { − − 2 − − 2 } 1335 of compactness is removed. Even so, given a closed, convex set C, any sublinear function that has a set C d 1336 as its 1-sublevel set must match the gauge on R int(rec(C)) (see Problem7 from “HW for Week IX”). If \ 1337 you are interested in learning more about representing closed, convex sets as the sublevel sets of sublinear 1338 functions, please see [1] on exciting new results.

1339 Generalized Cauchy-Schwarz/Holder’s inequality. Using our relationship between norms and gauges and support functions, we can write an inequality which vastly generalizes Holder’s inequality (and consequently, Cauchy-Schwarz’ inequality) – see Proposition 2.32.

Theorem 3.54. Let C Rd be a compact, convex set containing the origin in its interior. Then ⊆ d x, y γC (x)σC (y) x, y R . h i ≤ ∀ ∈ d Proof. Consider any x, y R . Since C is compact, γC (x) > 0 by Corollary 3.41, and σC (y) < + . By Proposition 3.39, x ∈C, and therefore, ∞ γC (x) ∈ x 1340 , y sup z, y = σC (y). hγC (x) i ≤ z∈Ch i

This immediately implies x, y γC (x)σC (y). h i ≤ Corollary 3.55. Let C Rd be a compact, convex set containing the origin in its interior. Then ⊆ d x, y γC (x)γC◦ (y) x, y R . h i ≤ ∀ ∈ Proof. Follows from Theorems 3.54 and 3.52.

1 1 p q The above corollary generalizes Holder’s inequality by recalling that when p + q = 1, then the ` and ` unit balls are polars of each other. Note that Theorem 3.54 and Corollary 3.55 have no assumption of centrally symmetric sets, so they strictly generalize the norm inequalities of Holder and Cauchy-Schwarz.

48 1341 One-to-one correspondence between closed, convex sets and closed, sublinear functions. Propo- 1342 sition 3.51 shows that support functions are closed, sublinear functions. Proposition 3.50 shows that two 1343 different sets, e.g., S and conv(S), may give rise to the same sublinear function σS = σconv(S) via the support 1344 function construction. In other words, if we consider the mapping S σS as a mapping from the family of d → 1345 subsets of R to the family of closed, sublinear functions, this mapping is not injective. But if we restrict to 1346 closed, convex sets, it can be shown that this mapping is injective.

1347 Exercise 8. Let C1,C2 be closed, convex sets. Then σC1 = σC2 if and only if C1 = C2.

1348 A natural question now is whether the mapping C σC from the family of closed, convex sets to the → 1349 family of closed, sublinear functions is onto. The answer is yes! Thus, all closed, sublinear functions 1350 are support functions and vice versa.

d 1351 Theorem 3.56. Let f : R R + be a sublinear function that is also closed. Then the set → ∪ { ∞} d d − Cf := x R : r, x f(r) r R = r∈ d H (r, f(r)) (3.4) { ∈ h i ≤ ∀ ∈ } ∩ R

1352 is a closed, convex set. Moreover, σCf = f. 1353 Conversely, if C is a closed, convex set, then CσC = C.

1354 Proof. We will prove the assertion when f is finite valued everywhere; the proof for general f is more tedious 1355 and does not provide any additional insight, in our opinion, and will be skipped here. d 1356 Since Cf is defined as the intersection of a family of halfspaces (indexed by R ), Cf is a closed, convex d − 1357 set. We now establish that σCf = f. For any r R , since Cf H (r, f(r)), we must have that ∈ ⊆ 1358 σC (r) = sup r, x f(r). To show that σC (r) f(r), it suffices to exhibit y Cf such that f x∈Cf h i ≤ f ≥ ∈ 1359 r, y = f(r). Consider epi(f), which by Proposition 3.34, is a closed convex cone (since f is assumed to be h i 1360 closed). By Theorem 2.23, there exists a supporting hyperplane for epi(f) at (r, f(r)). Let this hyperplane d − 1361 by defined by (y, η) R R and α R such that epi(f) H ((y, η), α). Using Problems8 and9 ∈ × ∈ ⊆ 1362 from “HW for Week IX”, one can assume that α = 0 and η < 0. After normalizing, this means that − 0 d 0 0 − 1363 epi(f) H ((y/ η, 1), 0). This implies that for every r R ,(r , f(r )) H ((y/ η, 1), 0), which ⊆ 0 y− − 0 0 d y ∈ ∈ − − 1364 implies that r , −η f(r ) for all r R . So, −η Cf . Moreover, since H((y/ η, 1), 0) is a supporting h i ≤ ∈ y ∈ − − 1365 hyperplane at (r, f(r)), we must have r, f(r) = 0. So, we are done. h −η i − 1366 We now show that CσC = C for any closed, convex set C. Consider any x C. Then r, x − d ∈ h i ≤ 1367 supy∈C r, y = σC (r). Therefore, x H (r, σ(r)) for all r R . This shows that x CσC , and therefore, h i ∈ ∈ ∈ 1368 C CσC . To show the reverse inclusion, consider any y C. Since C is a closed, convex set, there ⊆ − 6∈ − 1369 exists a separating hyperplane H(a, δ) such that C H (a, δ) and a, y > δ. C H (a, δ) implies that ⊆ h i ⊆ 1370 σC (a) = sup a, x δ. Since Cσ has a, x σC (a) as a defining halfspace, and a, y > δ σC (a), x∈C h i ≤ C h i ≤ h i ≥ 1371 we observe that y Cσ . 6∈ C 1372 One can associate a nice picture with the above construction of Cf associated with the sublinear function 1373 f, which corresponds to the following proposition.

d 1374 Proposition 3.57. Let f : R R be a sublinear function, and let Cf be defined as in Theorem 3.56. → ◦ d ◦ 1375 Then y Cf if and only if (y, 1) epi(f) . In other words, Cf = y R :(y, 1) epi(f) . ∈ − ∈ { ∈ − ∈ } Proof. We simply observe the following equivalences.

d y Cf r, y f(r) r R ∈ ⇔ h i ≤ ∀ ∈ r, y t r Rd, t R such that f(r) t ⇔ h i ≤ ∀ ∈ ∈ ≤ r, y t 0 r Rd, t R such that f(r) t ⇔ h i − ≤ ∀ ∈ ∈ ≤ (r, t), (y, 1) 0 r Rd, t R such that f(r) t ⇔ h(y, 1), (r−, t)i ≤ 0 ∀(r∈, t) epi(∈ f) ≤ ⇔( hy, −1) epi(fi) ≤◦ ∀ ∈ ⇔ − ∈

1376

49 t f t f

epi(f) epi(f) epi(f) t = 1 ∩ { } 1 d Cf R Cf◦ Rd

-1 Cf

-1 epi(f)◦ t = 1 ∩ { − } epi(f)◦ epi(f)◦

epi(f)◦ t = 1 ∩ { − }

(a) General sublinear function (b) Nonnegative sublinear function

Figure 1: Illustration of Propositions 3.57 and 3.58

1377 When f is a nonnegative sublinear function, even more can be said.

d 1378 Proposition 3.58. Let f : R R be a sublinear function that is nonnegative everywhere, and let Cf → ◦ 1379 ◦ be defined as in Theorem 3.56. Then f = γ(Cf ) , i.e., f is the gauge function for (Cf ) . Consequently, ◦ d d 1380 (Cf ) = y R :(y, 1) epi(f) = y R : f(y) 1 . { ∈ ∈ } { ∈ ≤ } ◦ 1381 Proof. Since f 0, epi(f) (r, t): t 0 . Therefore, (0, 1) epi(f) . By Proposition 3.57, 0 Cf . ≥ ⊆ { ≥ } − ∈ ∈ 1382 ◦ Moreover, by Theorems 3.56 and 3.52, f = σCf = γ(Cf ) . By Theorem 3.40 part 2., this shows that ◦ d ◦ d 1383 (Cf ) = y R : f(y) 1 . By Problem 10 from the “HW for Week XI”, we have that (Cf ) = y R : { ∈ ≤d } { ∈ 1384 (y, 1) epi(f) = y R : f(y) 1 . ∈ } { ∈ ≤ }

1385 3.6 Directional derivatives, subgradients and subdifferential calculus d 1386 Let us look at directional derivatives of convex functions more closely. Let f : R R be any function and d d → 1387 let x R , and r R . We define the directional derivative of f at x in the direction r as: ∈ ∈ f(x + tr) f(x) f 0(x; r) := lim − , (3.5) t↓0 t

0 d 1388 if that limit exists. We will be speaking of f (x; ) as a function from R R. When the function f is · → 1389 convex, this function has very nice properties.

d f(x+tr)−f(x) 1390 Lemma 3.59. If f : R R is convex, the expression is a non-decreasing function of t, for → t 1391 t = 0. 6 1392 Proof. By Proposition 3.26, the function φ(t) = f(x + tr) is a convex function. By Proposition 3.24, we φ(t)−φ(0) 1393 observe that t is a non-decreasing function of t. d d 1394 Proposition 3.60. Let f : R R be a convex function, and let x R . Then the limit in (3.5) exists and d → 0 d ∈ 1395 is finite for all r R and the function f (x; ): R R is sublinear. ∈ · → 0 φ(t)−φ(0) 1396 Proof. By Proposition 3.26, the function φ(t) = f(x+tr) is a convex function, and f (x; r) = limt↓0 t . φ(t)−φ(0) 1397 By Lemma 3.59, we observe that is a non-decreasing function of t for t = 0, and restricting to t > 0, t 6

50 φ(t)−φ(0) φ(−1)−φ(0) φ(t)−φ(0) 1398 t is lower bounded by the value at t = 1, i.e., −1 . Therefore, limt↓0 t exists and is φ(t)−φ(0) − 1399 in fact equal to inft>0 t . We now prove positive homogeneity of f 0(x; ). For any r Rd and λ > 0, we obtain that · ∈ 0 f(x+tλr)−f(x) f (x; λr) = limt↓0 t f(x+tλr)−f(x) = limt↓0 λ λt f(x+tλr)−f(x) = λ limt↓0 λt f(x+t0r)−f(x) = λ limt0↓0 t0 = λf 0(x; r).

We next establish that f 0(x; ) is convex. Consider any r1, r2 Rd and λ (0, 1). · ∈ ∈ 1 2 0 1 2 f(x+t(λr +(1−λ)r ))−f(x) f (x; λr + (1 λ)r ) = limt↓0 t − f(λx+(1−λ)x+t(λr1+(1−λ)r2))−λf(x)−(1−λ)f(x) = limt↓0 t f(λ(x+tr1)+(1−λ)(x+tr2))−λf(x)−(1−λ)f(x) = limt↓0 t λf(x+tr1)+(1−λ)f(x+tr2)−λf(x)−(1−λ)f(x) limt↓0 t ≤ f(x+tr1)−f(x) f(x+tr2)−f(x) = λ limt↓0 t + (1 λ) limt↓0 t = λf 0(x; r1) + (1 λ)f 0(x; r2−), −

1400 where the inequality follows from convexity of f. By Proposition 3.32, the function f is sublinear.

1401 There is a nice connection with subgradients and subdifferentials – recall Definition 3.17. Also recall the 1402 construction of the closed, convex set Cf from a sublinear function f from Theorem 3.56.

Theorem 3.61. Let f : Rd R be a convex function, and let x Rd. Then → ∈

∂f(x) = Cf 0(x;·).

0 1403 In other words, f (x; ) is the support function for the subdifferential ∂f(x). · Proof. Recall from Definitions 3.14 and 3.17 that

∂f(x) = s Rd : s, y x f(y) f(x) y Rd { ∈ h − i ≤ − ∀ ∈ } = s Rd : s, r f(x + r) f(x) r Rd . { ∈ h i ≤ − ∀ ∈ } Thus, we have the following equivalences.

s ∂f(x) s, r f(x + r) f(x) r Rd ∈ ⇔ h i ≤ − ∀ ∈ s, tr f(x + tr) f(x) r Rd, t > 0 ⇔ h i ≤f(x+tr)−f(x−) ∀ ∈ s, r r Rd, t > 0 ⇔ h i ≤ t ∀ ∈ s, r f 0(x; r) r Rd ⇔ h i ≤ ∀ ∈ d s Cf 0(x;r) r R , ⇔ ∈ ∀ ∈ f(x+tr)−f(x) 1404 where the second-to-last equivalence follows the fact that t is a decreasing function of t by 1405 Lemma 3.59, and the last equivalence follows from the definition of Cf 0(x;r) in (3.4).

1406 A characterization of differentiability for convex functions can be obtained using these concepts.

d d 1407 Theorem 3.62. Let f : R R be a convex function, and let x R . Then the following are equivalent. → ∈ 1408 (i) f is differentiable at x.

0 0 d 1409 (ii) f (x; ) is a linear function given by f (x, r) = ax, r for some ax R . · h i ∈

51 t f 0(x; ) ·

epi(f 0(x; )) · f f(x) Cf (x; ) = ∂f(x) 0 ·

d R x s ∂f(x) ∈ -1 (s, 1) f(x) + s, y x f(y) − h − i ≤ epi(f 0(x; ))◦ t = 1 · ∩ { − } epi(f 0(x; ))◦ ·

0 Figure 2: A picture illustrating the relationship between the sublinear function f (x; ), the set Cf 0(x;·), the subgradient ∂f(x), and an affine support hyperplane given by an element s ∂f(x). Recall· the relationships from Figure1. ∈

52 1410 (iii) ∂f(x) is a singleton, i.e., there is a unique subgradient for f at x.

1411 Moreover, if any of the above conditions hold then f(x) = ax = s, where s is the unique subgradient in ∇ 1412 ∂f(x).

0 1413 Proof. (i) = (ii). If f is differentiable, then it is well-known from calculus that f (x; r) = f(x), r ; ⇒ h∇ i 1414 thus, setting ax = f(x) suffices. ∇ (ii) = (iii). By Theorem 3.61 and (3.4), we obtain that ⇒

∂f(x) = Cf 0(x;·) = s Rd : s, r f 0(x; r) r Rd { ∈ d h i ≤ ∀ ∈ d } = s R : s, r ax, r r R . { ∈ h i ≤ h i ∀ ∈ } d 1415 We now observe that if s, r ax, r for all r R , then we must have s = ax. Therefore, ∂f(x) = ax . h i ≤ h i ∈ { } (iii) = (i). Let s be the unique subgradient at x. We will establish that ⇒ f(x + h) f(x) s, h lim | − − h i| = 0, h→0 h k k 1416 thus showing that f is differentiable at x with gradient s. In other words, given any δ > 0, we must find |f(x+h)−f(x)−hs,hi| 1417  > 0 such that h B(0, ) implies that khk < δ. ∈ 1 Suppose to the contrary that for some δ > 0, for every k 1 there exists hk such that hk =: tk ≥ k k ≤ k and |f(x+hk)−f(x)−hs,hki| δ. Since hk is a sequence of unit norm vectors, by Theorem 1.10, there is a tk tk convergent subsequence which≥ converges to r with unit norm. To keep the notation easy, we relabel indices so that hk ∞ is the convergent sequence. Using Theorem 3.21, there exists a constrant L := L(B(0, 1)) { tk }k=1 such that f(y) f(z) L y z for all y, z B(0, 1). Noting that hk and tkr for all k 1 are in the | − | ≤ k1 − k ∈ ≥ unit ball B(0, 1) (since tk ), ≤ k δ |f(x+hk)−f(x)−hs,hki| tk ≤ |f(x+hk)−f(x+tkr)|+|f(x+tkr)−f(x)−hs,tkri|+|hs,tkri−hs,hki| ≤ tk Lktkr−hkk + |f(x+tkr)−f(x)−hs,tkri| + |hs,tkri−hs,hki| ≤ tk tk tk L r hk + f(x+tkr)−f(x) s, r + s r hk ≤ k − tk k | tk − h i| k kk − tk k = (L + s ) r hk + f(x+tkr)−f(x) s, r k k k − tk k | tk − h i|

1418 where the Cauchy-Schwartz inequality is used in the last inequality. We now let k . The first term hk →f(x ∞+tkr)−f(x) 1419 in the last expression above goes to 0, since converges to r. In the second term, goes to tk tk 0 0 1420 its limit which is the directional derivative f (x; r). By Theorem 3.61, f (x; r) = sup y, r = s, r , y∈∂f(x)h i h i 1421 because by assumption ∂f(x) = s . Thus, the second term in the last expression above also goes to 0. This { } 1422 contradicts δ > 0.

1423 The following rules for manipulating subgradients and subdifferentials will be useful from an algorithmic 1424 perspective when we discuss optimization in the next section.

1425 Theorem 3.63. Subdifferential calculus. The following are all true.

d 1. Let f1, f2 : R R be convex functions and let t1, t2 0. Then → ≥ d ∂(t1f1 + t2f2)(x) = t1∂f1(x) + t2∂f2(x) for all x R . ∈ 2. Let A Rm×d and b Rm and let T (x) = Ax + b be the corresponding affine map from Rd Rm ∈ ∈ → and let g : Rm R be a convex function. Then → T d ∂(g T )(x) = A ∂g(Ax + b) for all x R . ◦ ∈

53 d 3. Let fj : R R, j J be convex functions for some (possibly infinite) index set J, and let f = → ∈ supj∈J fj. Then cl(conv( ∂fj(x))) ∂f(x), ∪j∈J(x) ⊆ 1426 where J(x) is the set of indices j such that fj(x) = f(x). Moreover, equality holds in the above 1427 relation, if one can impose a structure on J such that J(x) is a compact set.

1428 4 Optimization

1429 We now being our study of the general convex optimization problem

inf f(x), (4.1) x∈C

d 1430 where f : R R is a convex function, and C is a closed, convex set. We first observe that local minimizers → 1431 are global minimizers for convex optimization problems.

d d 1432 Definition 4.1. Let g : R R be any function (not necessarily convex) and let X R be any set (not ?→ ⊆ 1433 necessarily convex). Then x X is said to be a local minimizer for the problem infx∈X g(x) is there exists ? ∈ ? 1434  > 0 such that g(y) g(x ) for all y B(x , ) X. ≥ ∈ ∩ ? ? 1435 x X is said to be a global minimizer if g(y) g(x ) for all y X. ∈ ≥ ∈ 1436 Note that if C is a compact, convex set, then (4.1) has a global minimizer by Weierstrass’ Theorem 1437 (Theorem 1.11), because convex functions are continuous over the relative interior of their domain (Theo- 1438 rem 3.21).

1439 Theorem 4.2. Any local minimizer for (4.1) is a global minimizer.

? ? ? 1440 Proof. Let x be a local minimizer, i.e., there exists  > 0 such that f(y) f(x ) for all y B(x , ) C. ? ≥ ? ∈ ∩ 1441 Suppose to the contrary that there exists y¯ C such that f(y¯) < f(x ). Then y¯ B(x , ); otherwise, it ? ∈ ? 6∈ ? 1442 would contradict f(y) f(x ) for all y B(x , ) C. Consider the line segment [x , y¯]. It must intersect ? ≥ ? ∈ ∩ ? 1443 B(x , ) in a point other than x . Therefore, there exists 1 > λ > 0 such that x¯ = λx + (1 λ)y¯ is in ? ? ? − 1444 B(x , ). By convexity of f, f(x¯) λf(x ) + (1 λ)f(y¯). Since λ (0, 1) and f(y¯) < f(x ), this implies ? ≤ − ∈ ? 1445 that f(x¯) < f(x ). Moreover, since C is convex, x¯ C, and so x¯ B(x , ) C. This contradicts that ? ? ∈ ∈ ∩ 1446 f(y) f(x ) for all y B(x , ) C. ≥ ∈ ∩ 1447 We now give a characterization of global minimizers of (4.1) in terms of the local geometry of C and 1448 the first order properties of f, i.e., its subdifferential ∂f. We first need some concepts related to the local 1449 geometry of a convex set.

Definition 4.3. Let C Rd be a convex set, and let x C. Define the cone of feasible directions as ⊆ ∈ d FC (x) = r R :  > 0 such that x + r C . { ∈ ∃ ∈ } 2 1450 FC (x) may not be a closed cone – consider C as the unit circle in R and x = ( 1, 0); then FC (x) = 2 − 1451 r R : r1 > 0 0 . It is much nicer to work with its closure. { ∈ } ∪ { } d 1452 Definition 4.4. Let C R be a convex set, and let x C. The tangent cone of C at x is TC (x) := ⊆ ∈ 1453 cl(FC (x)).

1454 The final concept related to the local geometry of closed, convex sets will be the normal cone.

d 1455 Definition 4.5. Let C R be a convex set, and let x C. The normal cone of C at x is NC (x) := r d ⊆ ∈ { ∈ 1456 R : r, x r, y y C . h i ≥ h i ∀ ∈ }

54 d 1457 The normal cone NC (x) is the set of vectors r R such that x is the maximizer over C for the ∈ d 1458 corresponding linear functional r, , i.e., r, x = supy∈C r, y . Moreover, since NC (x) = r R : h ·i h i h i { ∈ 1459 r, y x 0 y C which is an intersection of halfspaces with the origin on the boundary, it is h − i ≤ ∀ ∈ } 1460 immediate that NC is a closed, convex cone. Note that any nonzero vector r NC (x) defines a supporting ∈ 1461 hyperplane H(r, r, x ) at x. h i d 1462 Proposition 4.6. Let C R be a convex set, and let x C. Then FC (x),TC (x) and NC (x) are all convex ⊆ ∈ ◦ 1463 cones, with TC (x),NC (x) being closed, convex cones. Moreover, NC (x) = TC (x) , i.e., the tangent cone and 1464 the normal cone are polars of each other.

1465 Proof. See Problem4 in “HW for Week X”.

1466 We are now ready to state the characterization of a global minimizer of (4.1), in terms of the local 1467 geometry of C and the first-order information of f. d 1468 Theorem 4.7. Let f : R R be a convex function, and C be a closed, convex set. Then the following are → 1469 all equivalent. ? 1470 1. x is a global minimizer of (4.1).

0 ? ? 1471 2. f (x ; y x ) 0 for all y C. − ≥ ∈ 0 ? ? 1472 3. f (x ; r) 0 for all r TC (x ). ≥ ∈ ? ? 1473 4. 0 ∂f(x ) + NC (x ). ∈ ? ? ? 1474 Proof. 1. = 2. Since f(z) f(x ) for all z C, in particular this holds for z = x + t(y x ) for all ⇒ f(x?+t(y≥−x?))−f(x?) ∈ − 1475 0 t 1. Therefore, t 0 for all t (0, 1). Taking the limit as t 0, we obtain that 0≤ ? ≤ ? ≥ ∈ → 1476 f (x ; y x ) 0. − ≥ 0 ? ? 1477 2. = 3. We first show that f (x ; r) 0 for all r FC (x). Let  > 0 such that y = x + r C. ⇒ 0 ? ? 0 ? ≥ 0 ∈ 0 ? ∈ 1478 By assumption, 0 f (x ; y x ) = f (x ; r) = f (x; r), using the positive homogeneity of f (x ; ), since 0 ? ≤ − 0 ? · ? 1479 f (x ; ) is sublinear by Proposition 3.60. Diving by , we obtain that f (x ; r) 0 for all r FC (x ). · 0 ? ≥ ∈ 1480 Since f (x ; ) is sublinear, it is convex by Proposition 3.32, and thus, it is continuous by Theorem 3.21. · 1481 Consequently, it must be nonnegative on TC (x) = cl(FC (x)), because it is nonnegative on FC (x). ? ? 1482 3. = 4. Suppose to the contrary that 0 ∂f(x ) + NC (x ). Since f is assumed to be finite-valued ⇒ d 6∈ ? 1483 everywhere, dom(f) = R . Thus, by Problem 15 in “HW for Week IX”, ∂f(x ) is a compact, convex set. ? 1484 Moreover, NC (x ) is a closed, convex cone by Proposition 4.6. Therefore, by Problem6 in “HW for Week ? ? 1485 II”, ∂f(x ) + NC (x ) is a closed, convex set. By the separating hyperplane theorem (Theorem 2.20), there d ? ? 1486 exist a R , δ R such that 0 = a, 0 > δ a, v for all v ∂f(x ) + NC (x ). ∈ ∈ h i ≥ h ?i ∈ ? 1487 First, we claim that a, n 0 for all n NC (x ). Otherwise, consider n¯ NC (x ) such that a, n¯ > 0. ? h i ≤ ∈ ? ∈ h i 1488 Since NC (x ) is a convex cone, λn¯ NC (x ) for all λ 0. But then consider any s ∂f(x) (which is ∈ ≥ ∈ 1489 nonempty by Problem 15 in “HW for Week IX”) and the set of points s + λn¯. Since a, n¯ > 0, we can find h i ? ? 1490 λ 0 large enough such that a, s + λn¯ > δ, contradicting that δ a, v for all v ∂f(x ) + NC (x ). ≥ h ? i ≥? h◦ i ? ∈ 1491 Since a, n 0 for all n NC (x ), we obtain that a NC (x ) = TC (x ), by Proposition 4.6. Now h i ≤ ? ∈ ? ? ∈ ? 1492 we use the fact that ∂f(x ) ∂f(x ) + NC (x ), since 0 NC (x ). This implies that a, s δ < 0 ? ⊆? ∈ h i ≤ 1493 for all s ∂f(x ). Since ∂f(x ) is a compact, convex set, this implies that sups∈∂f(x?) a, s < 0. From ∈ 0 ? h i 1494 Theorem 3.61, f (x ; a) = σ∂f(x?)(a) = sups∈∂f(x?) a, s < 0. This contradicts the assumption of 3., because ? h i 1495 we showed above that a TC (x ). ∈ ? ? ? ? 4. = 1. Consider any y C. Since 0 ∂f(x ) + NC (x ), there exist s ∂f(x ) and n NC (x ) such ⇒ ? ∈ ? ∈ ? ∈ ∈ that 0 = s + n. Now, y x TC (x ) and so y x , n 0 by Proposition 4.6. Since we have − ∈ h − i ≤ 0 = y x?, 0 = y x?, s + y x?, n , h − i h − i h − i ? ? ? ? 1496 this implies that y x , s 0. By definition of subgradient, f(y) f(x ) + s, y x f(x ). Since h − i ≥ ? ≥ h − i ≥ 1497 the choice of y C was arbitrary, this shows that x is a global minimizer. ∈

55 ? ? ? 1498 Corollary 4.8. Let x be a minimizer for (4.1). If x int(C), then 0 ∂f(x ). In particular, if f is real- d ∈ ∈ 1499 valued everywhere and C = R , then a minimizer of f must contain 0 in its subdifferential. Consequently, 1500 if f is differentiable everywhere, the gradient at the minimizer must be 0.

1501 Proof. This follows from the fact that for any convex set C and any y int(C), NC (y) = 0 (Why?). ∈ { }

1502 Algorithmic setup: First-order oracles. To tackle the problem (4.1) computationally, we have to set 1503 up a precise way to access the values/subgradients of the function f and test if given points belong to the 1504 set C or not. To make this algorithmically clean, we define first-order oracles.

d 1505 Definition 4.9. A first order oracle for a convex function f : R R is an oracle/algorithm/black-box that d → 1506 takes as input any x R and returns f(x) and some s ∂f(x). A first order oracle for a closed, convex set d ∈ ∈ d 1507 C R is an oracle/algorithm/black-box that takes as input any x R and either correctly reports that ⊆ ∈ d 1508 x C or correctly reports a separating hyperplane separating x from C, i.e., it returns a R , δ R such ∈ − ∈ ∈ 1509 that C H (a, δ) and a, x > δ. Such an oracle is also known as a separation oracle. ⊆ h i

1510 4.1 Subgradient algorithm

1511 To build up towards an algorithm that assumes only first-order oracles for f and C, we will first look at the d 1512 situation where we have a first order oracle for f, and a stronger oracle for C which, given any x R , can ∈ 1513 report the closest point in C to x (assuming C is nonempty). Recall that in the proof of Theorem 2.20, we 1514 had shown that such a closest point always exists as long as C is a nonempty, closed, convex set. In fact, 1515 the proof holds even for a closed set; convexity was not sued to show the existence of a closest point. We 1516 now strengthen the observation by showing that under the additional assumption of convexity, the closest 1517 point is unique.

d d 1518 Proposition 4.10. Let C R be a nonempty, closed, convex set and let x R . Then there is a unique ? ⊆ ? ∈ 1519 point x C such that x x x y for all y C. ∈ k − k ≤ k − k ∈ Proof. If x C, then the conclusion is true by setting x? = x. So we assume x C. Following the proof of Theorem ∈2.20, there exists a closest point x? C and a = x x? satisfies a, y6∈ x? 0 for all y C. Thus, ∈ − h − i ≤ ∈ x y 2 = a + (x? y) 2 = a 2 + a, x? y + x? y 2 > x? y 2, k − k k − k k k h − i k − k k − k ? 1520 where the last inequality follows form the fact that a = 0 and a, x y 0. 6 h − i ≥

1521 Definition 4.11. ProjC (x) will denote the unique closest point (under the standard Euclidean norm) in C 1522 to x.

d 1523 Note that an oracle that reports ProjC (x) for any x R is stronger than a separation oracle for C, ∈ 1524 because Proj (x) = x if and only if x C, and when Proj (x) = x, then one can use a = x Proj (x) C ∈ C 6 − C 1525 and δ = a, ProjC (x) as a separating hyperplane; see the proof of Theorem 2.20. Even so, for “simple” h i d 1526 sets C, computing ProjC (x) is not a difficult task. For example, when C = R+, then ProjC (x) = y, where 1527 y = max 0, xi for all i = 1, . . . , d. i { } 1528 We now give a simple and elegant algorithm to solve the problem (4.1) when one has access to an oracle d d 1529 that can output ProjC (x) for any x R , and a first-order oracle for f : R R. The algorithm does not ∈ → 1530 assume any properties beyond convexity for the function f (e.g., differentiability). Note that, in particular, n n 1531 when we have no constraints, i.e., C = R , then ProjC (x) = x for all x R . Therefore, this algorithm can ∈ 1532 be used for unconstrained optimization of general convex functions with only a first-order oracle for f.

56 1533 Subgradient Algorithm. 0 1534 1. Choose any sequence h0, h1,..., of strictly positive numbers. Let x C (which can be found by d ∈ 1535 taking an arbitrary point in R and projecting to C).

1536 2. For i = 0, 1, 2,..., do

i i i i 1537 (a) Use the first-order oracle for f to get some s ∂f(x ). If s = 0, then stop and report x as the ∈ 1538 optimal point. i+1 i si  1539 (b) Set x = Proj x hi i . C − ks k 0 1 1540 The points x , x ,... will be called the iterates of the Subgradient Algorithm. We now do a simple 1541 convergence analysis for the algorithm. First, a simple observation about the point ProjC (x). Lemma 4.12. Let C Rd be a closed, convex set, let x? C and x Rd (not necessarily in C). Then ⊆ ∈ ∈ Proj (x) x? x x? . k C − k ≤ k − k Proof. The interesting case is when x C. The proof of Theorem 2.20 shows that if we set a = x ProjC (x), then a, Proj (x) y 0 for all y 6∈C; in particular, a, Proj (x) x? 0. We now observe− that h C − i ≥ ∈ h C − i ≥ ? 2 ? 2 x x = x ProjC (x) + ProjC (x) x k − k k − ? 2 − k = a + ProjC (x) x k 2 − k ? 2 ? = a + ProjC (x) x + 2 a, ProjC (x) x kProjk (kx) x? 2,− k h − i ≥ k C − k ? 1542 since a, Proj (x) x 0. h C − i ≥ d ? Theorem 4.13. Let f : R R be a convex function, and let x arg minx∈C f(x) (i.e, we assume a minimizer exists for the problem).→ Suppose x0 B(x?,R) for some∈ real number R 0. Let M := M(B(x?,R)) be a Lipschitz constant for f, guaranteed∈ to exist by Theorem 3.21, i.e., ≥f(x) f(y) M x y for all x, y B(x?,R). Let x0, x1,... be the sequence of iterates obtained by| the Subgradient− | ≤ Algorithmk − k above and assume∈ that a zero subgradient was not reported in any iteration. Then, for every k 0, ≥ k R2 + P h2  min f(xi) f(x?) + M i=0 i . i=0,...,k ≤ Pk 2 i=0 hi

i i ? i ? hs ,x −x i i Proof. Define ri = x x and vi = i for i = 0, 1, 2,.... Note that vi 0 for all i 0 since s is k − k ks k ≥ ≥ a subgradient at xi and x? is the minimizer (Verify!!). We next observe that

2 i si  ? 2 ri+1 = ProjC x hi ksik x k i− − k i s ? 2 x hi ksik x by Lemma 4.12 ≤ k i − ? 2 − 2 k = x x + hi 2hivi k2 − 2 k − = r + h 2hivi i i − 1543 Adding these inequalities for i = 0, 1, . . . , k, we obtain that

k k 2 2 X 2 X r r + h 2 hivi. (4.2) k+1 ≤ 0 i − i=0 i=0

min 2 0 ? 2 2 Let vmin = mini=0,...,k vi and let i be such that vmin = vimin . Using the fact that r0 = x x R , and that r2 0, we obtain from (4.2) that k − k ≤ k+1 ≥ k k k X X 2 X 2 vmin(2 hi) 2 hivi R + h . ≤ ≤ i i=0 i=0 i=0

57 1544 Consequently, 2 Pk 2 R + i=0 hi vmin . (4.3) ≤ Pk 2 i=0 hi

min min min min si si , y = si , xi h i h i

min xi

vmin

x?

imin Figure 3: Using vmin to bound the function value. The line through x¯ and x represents the hyperplane min min min H := H(si , si , xi ). h i

min min min min min Consider the hyperplane H := H(si , si , xi ) passing through xi , orthogonal to si . Let x¯ ? h i ? be the point on H closest to x ; see Figure3. By Problem 12 in “HW for Week IX”, vmin = x¯ x . 0 ? ? k − k Moreover, vmin v0 x x R. Therefore, x¯ B(x ,R). Using the Lipschitz constant M, we obtain ≤? ≤ k − k ≤ imin ∈ imin imin imin that f(x¯) f(x ) + Mvmin. Finally, since s ∂f(x ), we must have that f(x¯) f(x ) + s , x¯ min ≤ min min ∈ ≥ h − xi = f(xi ), since x¯, xi H. Therefore, we obtain i ∈  2 Pk 2  i imin ? ? R + i=0 hi min f(x ) f(x ) f(x¯) f(x ) + Mvmin f(x ) + M , i=0,...,k ≤ ≤ ≤ ≤ Pk 2 i=0 hi

1545 where the last inequality follows from (4.3).

1546 If we fix the number of steps of the algorithm to be N N, then the choice of h0, . . . , hN that minimizes R2+PN h2 ∈ i=0 i √ R 1547 PN is where hi = for all i = 0,...,N, which yields the following corollary. 2 i=0 hi N+1

d ? 0 Corollary 4.14. Let f : R R be a convex function, and let x arg minx∈C f(x). Suppose x → ∈ ∈ B(x?,R) for some real number R 0. Let M := M(B(x?,R)) be a Lipschitz constant for f. Let N N h≥ √ R i ,...,N ∈ be any natural number, and set i = N+1 for all = 0 . Then the iterates of the Subgradient Algorithm, with this choice of hi, satisfy MR min f(xi) f(x?) + . i=0,...,N ≤ √N + 1

? 1548 Turning this around, if we want to be within  of the optimal value f(x ) for some  > 0, we should run M 2R2  1549 the Subgradient Algorithm for 2 iterates, with hi = M .

58 1550 If we theoretically let the algorithm run for infinitely many steps, we would hope to make the difference i ? 1551 between mini f(x ) and f(x ) go to 0 in the limit. This, of course, depends on the choice of the sequence 2 Pk 2 R + i=0 hi 1552 h , h ,... k 0 1 so that the expression 2 Pk h 0 as . There is a general sufficient condition that i=0 i → → ∞ 1553 guarantees this.

∞ Proposition 4.15. Let hi i=0 be a sequence of strictly positive real numbers such that limi→∞ hi = 0 and P∞ {1 } hi = (e.g., hi = ). Then, for any real number R, i=1 ∞ i R2 + Pk h2 lim i=0 i = 0. k→∞ Pk 2 i=0 hi

1554 Remark 4.16. Corollary 4.14 shows that the subgradient algorithm has a convergence that is independent 1555 of the dimension! Now matter how large d is, as long as one can access subgradients for f and project to 1 1556 C, the number of iterations needed to converge to within  is O( 2 ). This is important to keep in mind for 1557 applications where the dimension is extremely large.

1558 4.2 Generalized inequalities and convex mappings

1559 We first review the notion of a .

1560 Definition 4.17. Let X be any set. A partial order on X is a on X, i.e., a subset X X R ⊆ × 1561 that satisfies certain conditions. We will denote x y for x, y X if (x, y) . The conditions are as 4 ∈ ∈ R 1562 follows:

1563 1. x x for all x X. 4 ∈ 1564 2. x 4 y and y 4 z implies x 4 z.

1565 3. x 4 y and y 4 x if and only if x = y. m 1566 We would like to be able to define partial orders on R for any m 1. In doing so, we want to be m ≥ 1567 mindful of the vector space structure of R . m 1568 Definition 4.18. We will say that a binary relation on R is a generalized inequality, if it satisfies the 1569 following conditions.

m 1570 1. x x for all x R . 4 ∈ 1571 2. x 4 y and y 4 z implies x 4 z.

1572 3. x 4 y and y 4 x if and only if x = y. m 1573 4. x y implies x + z y + z for all z R . 4 4 ∈ 1574 5. x y implies λx λy for all λ 0. 4 4 ≥ 1575 Generalized inequalities have an elegant geometric characterization.

m m 1576 Proposition 4.19. Let K R be a closed, convex, pointed cone. Then, the relation on R defined by ⊆ 1577 x K y if and only if y x K, is a generalized inequality. In this case, we say that K is the generalized 4 − ∈ 4 1578 inequality induced by K. d 1579 Conversely, any generalized inequality is induced by a unique convex cone given by K = x R : 4 4 { ∈ 1580 0 4 x . In other words, 4 is the same relation as 4K . } 4 1581 Proof. Left as an exercise.

1582 Example 4.20. Here are some examples of generalized inequalities.

59 m 1583 1. K = R+ induces the generalized inequality x 4K y if and only if xi yi for all i = 1 . . . , m. This is ≤ m 1584 often abbreviated to x y, and is sometimes called the “canonical” generalized inequality on R . ≤ d q 2 2 1585 2. K = x R : x1 + ... + x xd . This cone is called the Lorentz cone, and the corresponding { ∈ d−1 ≤ } 1586 generalized inequality is called a second order cone constraints (SOCC).

2 n2 n2 n×n 1587 3. Let m = n for some n N, i.e., consider the space R . Identifying R = R with some ordering of ∈ n2 1588 the coordinates, we think of R as the space of all n n matrices. Let K be the cone of all symmetric × 1589 matrices that are positive semidefinite; see Definition 1.19. The corresponding generalized inequality n2 1590 on R is called the positive semidefinite cone constraint.

1591 We would like to extend the notion of convex functions to vector valued maps, for which we will use the 1592 notion of generalized inequalities.

m d Definition 4.21. Let K be a generalized inequality on R induced by the cone K. We say that G : R 4 → Rm is a K-convex mapping if

G(λx + (1 λ)y) K λG(x) + (1 λ)G(y) − 4 − d 1593 for all x, y R and λ (0, 1). ∈ ∈ 1594 Example 4.22. Here are some examples of K-convex mappings.

m d m 1595 1. Let K R be any closed, convex, pointed cone. If G : R R is an affine map, i.e., there exist a ⊆ m×d m → 1596 matrix A R and a vector b R such that G(x) = Ax + b, then G is a K-convex mapping. ∈ ∈ 2 n2 2. Let m = n for some n N, i.e., consider the space R and let 4 be the positive semidefinite cone constraint from part 3. of∈ Example 4.20, i.e., induced by the cone K of positive semidefinite matrices. Let A0,A1,...,Ad be fixed p n matrices, for some p N (not necessarily equal to n). Define 2 × ∈ G : Rd R Rn to be the mapping × → T G(x, s) = (A0 + x1A1 + ... + xdAd) (A0 + x1A1 + ... + xdAd) B, −

1597 where B is an arbitrary n n matrix. Then G is a K-convex mapping. × m d d m 1598 3. Let K = R+ , and let g1, . . . , gm : R R be convex functions. Let G : R R be defined as → → 1599 G(x) = (g1(x), . . . , gm(x)), then G is a K-convex mapping.

1600 4.3 Convex optimization with generalized inequalities

1601 We can now define a very general framework for convex optimization problems, which is more concrete than 1602 the abstraction level of black-box first-order oracles, but is still flexible enough to incorporate the majority 1603 of convex optimization problems that show up in practice.

d m 1604 Definition 4.23. Let f : R R be a convex function, let K R be a closed, convex, pointed cone, d m → ⊆ 1605 and let G : R R be a K-convex mapping. Then f, K, G define a convex optimization problem with → 1606 generalized constraints given as follows

inf f(x): G(x) K 0 . (4.4) { 4 } d 1607 Problem3 in “HW for Week XI” shows that the set C = x R : G(x) K 0 is a convex set, when G { ∈ 4 } 1608 is a K-convex mapping. Thus, (4.4) is a special case of (4.1).

1609 Example 4.24. Let us look at some concrete examples of (4.4).

60 d m 1. Linear/Quadratic Programming. Let f(x) = c, x for some c R , let K = R+ and let h i ∈ G : Rd Rm be an affine map, i.e., G(x) = Ax b for some matrix A Rm×d and a vector b Rm. Then (4.4→ ) becomes − ∈ ∈ inf c, x : Ax b {h i ≤ } 1610 which is the problem of minimizing a linear function over a polyhedron. This is more commonly known 1611 as a linear program, in accordance with the fact that the objective and the constraints are all linear. T d 1612 If f(x) = x Qx + c, x where Q is a given d d positive semidefinite matrix, and c R , then f is a h i × ∈ 1613 convex function (see Problem 14 from “HW for Week IX”). With K and G as above, (4.4) is called a 1614 convex quadratic program.

2 2. Semidefinite Programming. Let m = n2 for some n N and consider the space Rn . Let f(x) = 2 ∈ c, x for some c Rd, let K Rn be the positive semidefinite cone, including the positive semidefinite h i ∈ ⊆d n2 cone constraint, and let G : R R be an affine map, i.e., there exist n n matrices F0,F1,...,Fd → × such that G(x) = F0 + x1F1 + ... + xdFd. Then (4.4) becomes

inf c, x : F0 x1F1 ... xdFd is a PSD matrix . {h i − − − − }

1615 This is known as a semidefinite program.

d 3. Convex optimization with explicit constraints. Let f, g1, . . . , gm : R R be convex functions. m d m → Define K = R+ and define G : R R as G(x) = (g1(x), . . . , gm(x)), which is the K-convex mapping from Example 4.22. Then (4.4) becomes→

inf f(x): g1(x) 0, . . . , gm(x) 0 . { ≤ ≤ }

1616 4.3.1 Lagrangian duality for convex optimization with generalized constraints

1617 Given that the Subgradient Algorithm is a simple and elegant method for solving unconstrained problems, 1618 or problems with “simple” constraint sets C (i.e., when one can compute ProjC (x) efficiently), we will try to 1619 transform convex optimization problems with more complicated constraints into ones with simple constraints. 1620 This is the motivation for what is known as Lagrangian duality. 1621 Note that problem (4.4) is equivalent to the problem

inf f(x) + I−K (G(x)), (4.5) d x∈R

1622 where I−K is the indicator function for the cone K. It can be shown that the function I−K G is a − ◦ 1623 convex function – see Problem4 from “HW for Week XI”. Thus, problem (4.5) is an unconstrained convex 1624 optimization problem. However, indicator functions are nasty to deal with because they are not finite valued, 1625 and thus, obtaining subgradients at all points becomes impossible. Thus, we try to replace I−K with a “nicer” m 1626 penalty function p : R R, which is not that wildly discontinuous, and is finite-valued everywhere. So we → 1627 would be looking at the problem

inf f(x) + p(G(x)). (4.6) d x∈R

1628 What properties should we require from our penalty function? First we would like problem (4.6) to be a 1629 convex problem, thus, we impose that

d p G : R R is a convex function. (4.7) ◦ →

1630 Next, from an optimization perspective, we would like to have a guaranteed relationship between the function 1631 f(x)+I−K (G(x)) and the function f(x)+p(G(x)). It turns out that a nice property to have is the guarantee d 1632 that f(x) + p(G(x)) f(x) + I−K (G(x)) for all x R . This can be achieved by imposing that ≤ ∈ p is an underestimator of I−K , i.e., p I−K . (4.8) ≤

61 m 1633 Lagrangian duality theory is the study of penalty functions p that are linear on R , and satisfy the two m m 1634 conditions highlighted above. Now a function p : R R is linear if and only if there exists c R such → ∈ 1635 that p(z) = c, z . The following proposition characterizes linear functions that satisfy the two conditions h i 1636 above.

m m 1637 Proposition 4.25. Let p : R R be a linear function given by p(z) = c, z for some c R . Then the → h i ∈ 1638 following are equivalent:

1639 1. p satisfies condition (4.8).

◦ 1640 2. c K , i.e., c is in the polar of K. ∈ − − 1641 3. p satisfies conditions (4.7) and (4.8). Proof. (1. = 2.) Condition (4.8) is equivalent to saying that p(z) 0 for all z K, i.e., ⇒ ≤ ∈ − c, z 0 for all z K ch, zi ≤ 0 for all z∈ −K ⇔ h c−, zi ≤ 0 for all z ∈ K ⇔ h−c iK ≤◦ ∈ ⇔c − ∈ K◦ ⇔ ∈ − 1642 (2. = 3.) We showed above that assuming c K◦ is equivalent to condition (4.8). We now check ⇒ ∈ − that c K◦ implies (4.7). Since G is a K-convex mapping, we have that for all x, y Rd and λ (0, 1), ∈ − ∈ ∈ c, λG(x) + (1 λ)G(y) G(λx + (1 λ)y) 0 = h − c, λG(x−) + c, (1 λ−)G(y)i ≥ c,G(λx + (1 λ)y) =⇒ hλ c,G(x)i + (1h λ−) c,G(y)i ≥ hc,G(λx + (1 − λ)y)i =⇒ hλp(G(xi)) + (1− λh)p(G(y))i ≥ hp(G(λx + (1 −λ)y))i ⇒ − ≥ −

1643 Hence, condition (4.7) is satisfied.

1644 (3. = 1.) Trivial. ⇒ ◦ 1645 Definition 4.26. The set K is important in Lagrangian duality, and a separate notation and name has ◦ − ? 1646 been invented: K is called the dual cone of K and is denoted by K . − ? 1647 The above discussions show that for any y K , the optimal value of the (4.6), with p given by ∈ 1648 p(z) = y, z , is a lower bound on the optimal value of (4.4). This motivates definition of the so-called dual h i m 1649 function : R R associated with (4.4) as follows: L → (y) := inf f(x) + y,G(x) (4.9) d L x∈R h i

1650 We state the lower bound property formally.

d m 1651 Proposition 4.27. [Weak Duality] Let f : R R be convex, let K R be a closed, convex, pointed d m → m ⊆ 1652 cone, and let G : R R be a K-convex mapping. Let : R R be as defined in (4.9). Then, for all d → ? L → 1653 x¯ R such that G(x¯) K 0 and all y¯ K , we must have (y¯) f(x¯). Consequently, (y¯) inf f(x): ∈ 4 ∈ L ≤ L ≤ { 1654 G(x) K 0 . 4 } Proof. We simply follow the inequalities

y d f x y,G x (¯) = infx∈R ( ) + ¯ ( ) L f(x¯) + y¯,G(x¯)h i ≤ f(x¯), h i ≤ ? 1655 where the last inequality holds because G(x) K 0 and y¯ K , and so y¯,G(x¯) 0. 4 ∈ h i ≤

62 ? 1656 Proposition 4.27 shows that any y K provides the lower bound (y) on the optimal value of the ∈ L ? 1657 optimization problem (4.4). The Lagrangian dual optimization problem is the problem of finding the y K ∈ 1658 that provides the best/largest lower bound. In other words, the Lagrangian dual problem is defined as

sup (y), (4.10) y∈K? L

1659 and Proposition 4.27 can be restated as

? sup (y): y K inf f(x): G(x) K 0 . (4.11) {L ∈ } ≤ { 4 }

1660 If we have equality in (4.11), then to solve (4.4), one can instead solve (4.10). This merits a definition.

1661 Definition 4.28 (Strong Duality). We say that we have a zero duality gap if equality holds in (4.11). In ? 1662 addition, if the supremum in (4.10) is attained for some y K , then we say that strong duality holds. ∈

1663 4.3.2 Solving the Lagrangian dual problem

1664 Before we investigate conditions under which we have zero duality gap or strong duality, let us try to see 1665 how one could use the subgradient algorithm to solve (4.10).

1666 Proposition 4.29. (y) is a concave function of y. L Proof. We have to show that (y) is a convex function of y. This follows from the fact that −L

(y) = inf d f(x) + y,G(x) −L − x∈R h i = sup d f(x) + y, G(x) , x∈R − h − i

1667 i.e., (y) is the supremum of affine functions of y of the form f(x) + y, G(x) . By part 2. of −L − h − i 1668 Theorem 3.12, (y) is convex in y. −L 1669 We could now use the subgradient algorithm to solve (4.10), if we had a first order oracle for (y) ? −L 1670 and an algorithm to project to K . We show that a subgradient for (y) can be found by solving an −L 1671 unconstrained convex optimization problem.

m 1672 Proposition 4.30. Let y¯ R and let x¯ arg infx∈ d f(x) + y¯,G(x) . Then G(x¯) ∂( )(y¯). ∈ ∈ R h i − ∈ −L

1673 Proof. We express (y) = sup d f(x) + y, G(x) as the supremum of affine functions, and use part −L x∈R − h − i 1674 3. of Theorem 3.63, and the fact that the subdifferential of the affine function f(x¯) + y, G(x¯) , at y¯ is − h − i 1675 simply G(x¯) . {− } m 1676 Now if we have an algorithm that can compute ProjK? (y) for all y R , then using Propositions 4.29 ∈ 1677 and 4.30, one can solve the Lagrangian dual problem (4.10), where in each iteration of the algorithm, one ? 1678 solves the unconstrained problem inf d f(x) + y¯,G(x) for a given y¯ K . This can, in turn, be solved x∈R h i ∈ 1679 by the subgradient algorithm if one has the appropriate first order oracles for f(x) and y¯,G(x) . h i

1680 4.3.3 Explicit examples of the Lagrangian dual

1681 We will now explore some special settings of convex optimization problems with generalized inequalities, and 1682 see that the Lagrangian dual has a particularly nice form.

63 m d m 1683 Conic optimization. Let K R be a closed, convex, pointed cone. Let G : R R be an affine map ⊆ m×n m d → 1684 given by G(x) = Ax b, where A R and b R . Let f : R R be a linear function given by − d ∈ ∈ → 1685 f(x) = c, x for some c R . Then Problem (4.4) becomes h i ∈ inf c, x : Ax K b . (4.12) {h i 4 } 1686 For a fixed cone K, problems of the form (4.12) with are called conic optimization problems over the cone 1687 K. As we pick different data A, b, c, we get different instances of a conic optimization problem over the m 1688 cone K. A special case is when K = R+ , which is known as linear programming or linear optimization – see 1689 Example 4.24 – which is the problem of optimizing a linear function over a polyhedron. y d f x y,G x Let us investigate the dual function of (4.12). Recall that ( ) = infx∈R ( ) + ( ) , which in this case becomes L h i d c, x y,Ax b d c, x y,Ax y, b infx∈R + = infx∈R + h i h − i h i h T i − h i d c, x A y, x y, b = infx∈R + h i T h i − h i = inf d c + A y, x y, b . x∈R h i − h i T T 1690 Now, if c + A y = 0, then the infimum above is clearly . And if c + A y = 0, then the infimum is 6 −∞ 1691 b, y . Therefore, for (4.12), the dual function is given by −h i  if c + AT y = 0 (y) = −∞ 6 (4.13) L b, y if c + AT y = 0 −h i Therefore, sup (y) = sup b, y : AT y = c, y K? = inf b, y : AT y = c, y K? . y∈K? L {−h i − ∈ } − {h i − ∈ }

1692 To remove the slightly annoying minus sign in front of c above, it is more standard to write (4.12) as 1693 sup c, x : Ax K b , and then replace c with c throughout the above derivation. Thus, the − {h− i 4 } − 1694 standard primal dual pairs for conic optimization problems are T ? sup c, x : Ax K b inf b, y : A y = c, y K . (4.14) {h i 4 } ≤ {h i ∈ } m 1695 Linear Programming/Optimization. Specializing to the linear programming case with K = R+ and observing ? m 1696 that K = K = R+ (see Problem2 from “HW for Week III”), we obtain the primal dual pair sup c, x : Ax b inf b, y : AT y = c, y 0 . (4.15) {h i ≤ } ≤ {h i ≥ } 1697 Semidefinite Programming/Optimization. Another special case is that of semidefinite optimization. This is 2 d n2 1698 the situation when m = n and K is the cone of positive semidefinite matrices. G : R R is an affine d → 1699 map from R to the space of n n matrices. To avoid dealing with asymmetric matrices, G is always assumed × 4 1700 to be of the form G(x) = x1A1 + ... + xdAd A0, where A0,A1,...,Ad are n n symmetric matrices . If − × 1701 one works though the algebra in this case and uses the fact that the positive semidefinite cone is self-dual, ? 1702 i.e., K = K ,(4.14) becomes

sup c, x : x1A1 + ... + xdAd A0 is a PSD matrix inf A0,Y : Ai,Y = ci,Y is a PSD matrix , {h i − } ≤ {h i h i } P 1703 where X,Z = XijZij for any pair X,Z of n n symmetric matrices. h i i,j × Convex optimization with explicit constraints and objective. Recall part 3. if Example 4.24, m d d m where K = R+ , f, g1, . . . , gm : R R are convex functions, and G : R R was defined as G(x) = → → (g1(x), . . . , gm(x)), giving the explicit problem

inf f(x): g1(x) 0, . . . , gm(x) 0 . { ≤ ≤ } ? m In this case, since K = K = R+ (see Problem2 from “HW for Week III”), the dual problem is

sup (y) = sup inf f(x) + y g1(x) + ..., y gm(x) . d 1 m y∈K? L y≥0 x∈R { } 4Dealing with asymmetric matrices is not hard, but involves little details that can be overlooked for this exposition, and don’t provide any great insight.

64 1704 A closer look at linear programming duality. Consider the following linear program:

sup 2x1 1.5x2 − x1 + x2 1 ≤ x1 x2 1 (4.16) − ≤ x1 + x2 1 − ≤ x1 x2 1 − − ≤ 1705 To solve this problem, let us make some simple observations. If we multiply the first inequality by 0.5, 1706 the second inequality by 3.5, the third by 1.75 and the fourth by 0.25 and add all these scaled inequalities, 2 1707 then we obtain the inequality 2x1 1.5x2 6. Now any x R satisfying the constraints of the above linear − ≤ ∈ 1708 program must also satisfy this new inequality. This shows that our supremum is at most 6. Now if we choose 1709 another set of multipliers : 0.25, 1.75, 0, 0 (in order), then we obtain the inequality 2x1 1.5x2 2, which − ≤ 1710 gives a better bound of 2 6 on the optimal solution value. Now, consider the point x1 = 1, x2 = 0: this ≤ 1711 have value 2 1 1.5 0 = 2. Since we have an upper bound of 2 from the above arguments, we know that · − · 1712 x1 = 1, x2 = 0 is actually the optimal solution to the above linear program! Thus, we have provided the 1713 optimal solution, and a quick certificate of its optimality. If you think about how we were deriving the upper 1714 bounds of 6 and 2, we were looking for nonnegative multipliers y1, y2, y3, y4 such that the corresponding 1715 combination of the inequalities gives us 2x1 1.5x2 on the left hand side, and the upper bound was simply − 1716 the right hand side of the combined inequality, which is, y1 + y2 + y3 + y4. If the right hand side is to end 1717 up as 2x1 1.5x2, then we must have y1 + y2 y3 y4 = 2 and y1 y2 + y3 y4 = 1.5. To get the best − − − − − − 1718 upper bound, we want to find the minimum value of y1 + y2 + y3 + y4 such that y1 + y2 y3 y4 = 2 and − − 1719 y1 y2 + y3 y4 = 1.5, and all yi’s are nonnegative. But this is exactly the dual problem in (4.15). We − − − 1720 hope this gives the reader a more “hands-on” perspective on the Lagrangian dual of a linear program.

1721 4.3.4 Strong duality: sufficient conditions and complementary slackness

1722 In the above example of the linear program in (4.16), it turned out that we could find a primal feasible 1723 solution and a dual feasible solution that have the same value, which shows that we have strong duality, and 1724 certifies the optimality of the two solutions. We will see below that this always happens for linear programs. 1725 For general conic optimization problems, or a convex optimization problem with generalized inequalities, 1726 this does not always hold and one may not even have zero duality gap. We now supply two conditions under 1727 which strong duality is obtained. Linear programming strong duality will be a special case of the second 1728 condition.

1729 Slater’s condition for strong duality. The following is perhaps the most well-known sufficient condition 1730 in convex optimization that guarantees strong duality.

d m 1731 Theorem 4.31. [Slater’s condition] Let f : R R be convex, let K R be a closed, convex, pointed d m → m ⊆ 1732 cone, and let G : R R be a K-convex mapping. Let : R R be as defined in (4.9). If there exists → L → ? ? 1733 x¯ such that G(x¯) int(K) and inf f(x): G(x) 4K 0 is a finite value, then there exists y K such − ∈ ? { } ∈ 1734 that sup ? (y) = (y ) = inf f(x): G(x) K 0 , i.e., strong duality holds. y∈K L L { 4 } 1735 Before we begin the proof, we need to establish a slight variant of the separating hyperplane theorem, 1736 that does not make any closedness or compactness assumptions.

d 1737 Proposition 4.32. Let A, B R be convex sets (not necessarily closed) such that A B = . Then there d ⊆ ∩ ∅ 1738 exist a R 0 and δ R such that a, x a, y for all x A, y B. ∈ \{ } ∈ h i ≥ h i ∈ ∈ 1739 Proof. Left as an exercise.

1740 Proof of Theorem 4.31. Let µ0 := inf f(x): G(x) K 0 R. Define the sets { 4 } ∈ m d A := (z, r) R R : x R such that f(x) r, G(x) 4K z , { ∈ m × ∃ ∈ ≤ } B := (z, r) R R : r < µ0, z K 0 . { ∈ × 4 }

65 1741 It is not hard to verify that A, B are convex. Moreover, it is also not hard to verify that A B = . By m ∩ ∅ 1742 Proposition 4.32, there exists a R , γ R such that ∈ ∈

a, z1 + γr1 a, z2 + γr2 (4.17) h i ≥ h i

1743 for all (z1, r1) A and (z2, r2) B. ∈ ∈ ? 1744 Claim 3.a K and γ 0. ∈ ≥ ? ◦ ◦ 1745 Proof of Claim. Suppose to the contrary that a K . Then a K = ( K) . Thus, there exists z¯ K, 6∈ 6∈ − − ∈ − 1746 i.e., z¯ K 0, such that a, z¯ > 0. Now (4.17) holds with z1 = G(x¯)(x¯ is the point in the hypothesis of the 4 h i 1747 theorem), r1 = f(x¯), r2 = µ0 1 and z2 = λz¯ for all λ 0. But since a, z¯ > 0, the inequality (4.17) would − ≥ ? h i 1748 be violated for large enough λ. Thus, we must have a K . ∈ 1749 Similarly, (4.17) holds with z1 = G(x¯), r1 = f(x¯), z2 = 0 and all r2 < µ0. If γ < 0, then letting r2 → −∞ 1750 would violate (4.17).

We now show that, in fact, γ > 0 because of the existence of x¯ assumed in the hypothesis of the theorem. Substitute z1 = G(x¯), r1 = f(x¯), r2 = µ0 1 and z2 = 0 in (4.17). If γ = 0, then this relation becomes − a,G(x¯) 0. h i ≥

1751 However, G(x¯) int(K) and a = 0 and therefore, a,G(x¯) < 0 (see Problem3 from “HW for Week II”). − ∈ 6 h i 1752 By Claim3, γ > 0. ? a ? ? ? Let y := γ ; by Claim3, y K . We will now show that for every  > 0, (y ) µ0 . This will ∈ ? L ? ≥ − establish the result because this means (y ) µ0 and since (y) µ0 for all y K by Proposition 4.27, ? L ≥ L d ≤ ∈ we must have supy∈K? (y) = (y ) = µ0. Consider any x R . z1 = G(x) and r1 = f(x) gives a point L L ∈ in A. Substituting into (4.17) with z2 = 0 and r2 = µ0 , we obtain that a,G(x) + γf(x) γ(µ0 ). Dividing through by γ, we obtain − h i ≥ − ? y ,G(x) + f(x) µ0 . h i ≥ − ? ? 1753 This implies that (y ) = inf d y ,G(x) + f(x) µ0 . L x∈R h i ≥ −

1754 Closed cone condition for strong duality in conic optimization. Slater’s condition applied to conic 1755 optimization problems translates into requiring that there is some x¯ such that b Ax¯ int(K). Another −∗ ∈ 1756 very useful strong duality condition uses topological properties of the dual cone K .

1757 Theorem 4.33. [Closed cone condition] Consider the conic optimization primal dual pair (4.14). Suppose T d ∗ ? 1758 the set (A y, b, y ) R R : y K is closed and the dual is feasible, i.e., there exists y K such T{ h i ∈ × ∈ } ∈ 1759 that A y = c. Then we have zero duality gap. If the optimal dual value is finite, then strong duality holds 1760 in (4.14).

1761 Proof. Since the dual is feasible, its optimal value is either or finite. By weak duality (Proposition 4.27), −∞ 1762 in the first case we must have zero duality gap and the primal is infeasible. So we consider the case when the T ∗ d 1763 optimal value of the dual is finite, say µ0 R. Let us label the set S := (A y, b, y ): y K R R. ∈ { h i ∈ } ⊆ × 1764 Notice that the optimal value of the dual is µ0 = inf r R :(c, r) S . Since S is closed, the set { ∈ ∈ } 1765 r R :(c, r) S is closed because it is topologically the same as S (c R). Therefore the infimum in { ∈ ∈ } ∩ × ? ? 1766 inf r R :(c, r) S is over a closed subset of the real line. Hence, (c, µ0) S and so there exists y K { ∈ T ? ∈ } ? ∈ ∈ 1767 such that A y = c and b, y = µ0. h i 1768 Since µ0 = inf r R :(c, r) S , for every  > 0, (c, µ0 ) S. Therefore, there exists a separating { ∈d ∈ } T − 6∈ ? 1769 hyperplane (a, γ) R R and δ R such that a,A y +γ b, y δ for all y K , and a, c +γ(µ0 ) > ∈ × ∈ h i ·h i ≤ ∈ h i − 1770 δ. By Problem8 from “HW for Week IX”, we may assume δ = 0. Therefore, we have

a,AT y + γ b, y 0 for all y K?, (4.18) h i · h i ≤ ∈ a, c + γ(µ0 ) > 0 (4.19) h i −

66 ? 1771 Substituting y in (4.18), we obtain that a, c + γµ0 0, and (4.19) tells us that a, c + γµ0 > γ. This h i ≤ h i ? 1772 implies that γ < 0 since  > 0. Now (4.18) can be rewritten as Aa+γb, y 0 for all y K and (4.19) can h i ≤ ∈ a 1773 be rewritten as a, c > γ(µ0 ). Dividing through both these relations by γ > 0, and setting x = −γ , h i − − ? − 1774 we obtain that Ax b, y 0 for all y K implying that Ax K b, and x, c > µ0 . Thus, we have h − i ≤ ∈ 4 h i − 1775 a feasible solution x for the primal with value at least µ0 . Since  > 0 was chosen arbitrarily, this shows − 1776 that for every  > 0, the primal has optimal value better than µ0 . Therefore, the primal value must be ? − 1777 µ0 and we have zero duality gap. The existence of y shows that we have strong duality.

1778 Linear Programming strong duality. The closed cone condition for strong duality implies that linear programs 1779 always enjoy strong duality when either the primal or the dual (or both) are feasible. This is because the m ? m 1780 cone K = R+ is a polyhedral cone and also self-dual, i.e., K = K = R+ . Since linear transformations of 1781 polyhedral cones are polyhedral (see part 5. of Problem1 in “HW for Week VI”), and hence closed, linear 1782 programs always satisfy the condition in Theorem 4.33. One therefore has the following table for the possible 1783 outcomes in the primal dual linear programming pair.

XX XXX Dual XX Infeasible Finite Unbounded Primal XXX 1784 Infeasible Possible Impossible Possible Finite Impossible Possible, Zero duality gap Impossible Unbounded Possible Impossible Impossible

1785 An alternate proof of zero duality gap for linear programming follows from our results on polyhedral 1786 theory. We outline it here to illustrate that linear programming duality can be approached in different 1787 ways (although ultimately both proofs go back to the separating hyperplane theorem – Theorem 2.20). We 1788 consider two cases: 1789 Primal is infeasible. In this case, we will show that if the dual is feasible, then the dual must be 1790 unbounded. Since the primal is infeasible, the polyhedron Ax b is empty. By Theorem 2.88, there exists T ≤ T 1791 yˆ 0 such that A yˆ = 0 and b, yˆ = 1. Since the dual is feasible, consider any y¯ 0 such that A y¯ = c. ≥ h i − ≥ 1792 Now, all points of the form y¯ + λyˆ are also feasible to the dual, and the corresponding value b, y¯ + λyˆ can h i 1793 be made to go to because b, yˆ = 1. −∞ h i − 1794 Primal is feasible. If the primal is unbounded, then by weak duality, the dual must be infeasible. So let 1795 us consider the case that the primal has a finite value µ0. This means that the inequality c, x µ0 is a h Ti ≤ 1796 valid inequality for the polyhedron Ax b. By Theorem 2.85, there exists yˆ 0 such that A yˆ = c and ≤ ≥ 1797 b, yˆ µ0. Therefore the dual has a solution yˆ whose objective value is equal to the primal value µ0. This h i ≤ 1798 guarantees strong duality.

1799 Complementary slackness. Complementary slackness is a useful necessary condition when we have 1800 primal and dual optimal solutions with zero duality gap.

d m 1801 Theorem 4.34. Let f : R R be convex, let K R be a closed, convex, pointed cone, and let d m → m ⊆ ? 1802 G : R R be a K-convex mapping. Let : R R be as defined in (4.9). Let x be such that ? → ? ? ? L? → ? ? 1803 G(x ) K 0 and y K such that f(x ) = (y ). Then y ,G(x ) = 0. 4 ∈ L h i ? ? ? ? ? Proof. We simply observe that since G(x ) K 0 and y K , we must have y ,G(x ) 0. Therefore, 4 ∈ h i ≤ f(x?) f(x?) + y?,G(x?) inf f(x) + y?,G(x) = (y?). d ≥ h i ≥ x∈R h i L

? ? ? ? 1804 Since f(x ) = (y ) by assumption, equality must hold throughout above giving us y ,G(x ) = 0. L h i

67 1805 4.3.5 Saddle point interpretation of the Lagrangian dual

1806 Let us go back to the original problem (4.4) and revisit the dual function (y). Define the function L ˆ(x, y) := f(x) + y,G(x) (4.20) L h i

1807 which is often called the Lagrangian function associated with (4.4). A characterization of a pair of optimal 1808 solutions to (4.4) and (4.10) can be obtained using saddle points of the Lagrangian function.

d m 1809 Theorem 4.35. Let f : R R be convex, let K R be a closed, convex, pointed cone, and let d m → m ⊆ ˆ d m 1810 G : R R be a K-convex mapping. Let : R R be as defined in (4.9) and : R R R be as → ? ? L → ? ? L × → 1811 defined in (4.20). Let x be such that G(x ) K 0 and y K . Then the following are equivalent. 4 ∈ ? ? 1812 1. (y ) = f(x ). L ˆ ? ˆ ? ? ˆ ? d ? 1813 2. (x , yˆ) (x , y ) (xˆ, y ), for all xˆ R and yˆ K . L ≤ L ≤ L ∈ ∈ Proof. 1. = 2. Consider any xˆ Rd and yˆ K?. We now derive the following chain of inequalities: ⇒ ∈ ∈ ˆ(x?, yˆ) = f(x?) + yˆ,G(x?) L ? h i ? ? ? f(x ) since yˆ,G(x ) 0 because yˆ K ,G(x ) K 0 ≤ h i ≤ ∈ 4 = f(x?) + y?,G(x?) = ˆ(x?, y?) since y?,G(x?) = 0 by Theorem 4.34 = (y?)h i L since h (y?) = f(ix?) L ? L d f x y ,G x = infx∈R ( ) + ( ) f(xˆ) + y?,G(xˆh) i ≤ h i = ˆ(xˆ, y?) L 2. = 1. Since ˆ(x?, yˆ) ˆ(x?, y?) for all yˆ K?, we have that ⇒ L ≤ L ∈ ˆ(x?, y?) = sup ˆ(x?, yˆ) = sup f(x?) + y,G(x?) = f(x?), L y∈K? L y∈K? h i

where the last equality follows from the fact that y,G(x?) 0 for all y K?. So the supremum is achieved h i ≤ ∈ for y = 0. On the other hand, since ˆ(x?, y?) ˆ(xˆ, y?) for all xˆ Rd, we have that L ≤ L ∈ ˆ(x?, y?) = inf ˆ(x, y?) = inf f(x) + y?,G(x) = (y?). d d L x∈R L x∈R h i L

? ? ? ? 1814 Thus, we obtain that f(x ) = ˆ(x , y ) = (y ). L L ? ? 1815 Theorem 4.35 says that x and y are solutions for the primal problem (4.4) and dual problem (4.10) ? ? ? 1816 respectively, if and only if (x , y ) is a saddle point for the function ˆ(x, y) of the type that x is the ? ? L ? 1817 minimizer when y is fixed at y and y is the maximizer when x is fixed at x . This can be used to 1818 directly solve (4.4) and (4.10) simultaneously by searching for such saddle-points of the function ˆ(x, y). L 1819 This approach can be useful, if one has analytical forms for f and G (with sufficient differentiable properties) 1820 so that finding saddle-points is a reasonable option.

1821 4.4 Cutting plane schemes

1822 We now go back to the most general convex optimization problem (4.1). As before, we make no assumptions d 1823 on f and C except that we have access to first-order oracles for f and C, i.e., for any x R , the oracle ∈ 1824 returns an element from the subdifferential ∂f(x), and if x C then it returns a separating hyperplane. 6∈ 1825 The subgradient algorithm from Section 4.1 can be used to solve (4.1) if one has access to the projection 1826 operator ProjC (x), which is stronger than a separation oracle. Cutting plane schemes are a class of algo- 1827 rithms that work with just a separation oracle. Moreover, the number of oracle calls is quite different from 1828 the number of oracle calls made by the subgradient algorithm: on the one hand, they typically exhibit a

68 MR 1829 logarithmic dependence of ln(  ) on the initial data M,R and error guarantee  as opposed to the quadratic M 2R2 1830 dependence 2 of the subgradient algorithm; on the other other, cutting plane schemes have a polynomial 2 1831 dependence on the dimension d of the problem (typically of the order of d ), and such a dependence does 1832 not exist for the subgradient algorithm – see Remark 4.16. 1833 We will present the algorithm and the analysis for the situation when C is compact and full-dimensional. ? 1834 Hence the minimizer x exists for (4.1) since f is convex, and therefore, continuous by Theorem 3.21. There 1835 are ways to get around this assumption, but we will ignore this complication in this write-up.

1836 General cutting plane scheme

1837 1. Choose any E0 C. ⊇ 1838 2. For i = 0, 1, 2,..., do

i 1839 (a) Choose x Ei. ∈ i 1840 (b) Call the separation oracle for C with x as input. i i i 1841 Case 1: x C. Call the first order oracle for f to get some s ∂f(x ). i ∈ i ∈ i 1842 Case 2: x C. Set s to be the normal vector of some separating hyperplane for x from C. 6∈ d i i i 1843 (c) Set Ei+1 Ei x R : s , x s , x . ⊇ ∩ { ∈ h i ≤ h i} 0 1 1844 The points x , x ,... will be called the iterates of the Cutting Plane scheme.

1845 Remark 4.36. The above general scheme actually defines a family of algorithms. We have two choices to 1846 make to get a particular algorithm out of this scheme. First, there must be a strategy/procedure to choose i 1847 x Ei in step 2(a) in every iteration. Second, there should be a strategy to define Ei+1 as a superset of ∈ d i i i 1848 Ei x R : s , x s , x in step 2(c) of the scheme. Depending on what these two strategies are, we ∩ { ∈ h i ≤ h i} 1849 get different variants of the general cutting plane scheme. We will look at two variants below: the center of 1850 gravity method and the ellipsoid method. 1851 Technically, we also have to make a choice for E0 in Step 1, but this is usually given as part of the input 1852 to the problem: E0 is usually a large ball or polytope containing C that is provided or known at the start.

1853 We now start our analysis of cutting plane schemes. We introduce a useful notation to denote the 1854 polyhedron defined by the halfspaces obtained during the iterations of the cutting plane scheme.

Definition 4.37. Let z1,..., zk Rd and let s1,..., sk be the corresponding outputs of the first-order oracle, i.e., si ∂f(zi) if zi C, and⊆ si is the normal vector of a separating hyperplane if zi C. Define ∈ ∈ 6∈ 1 k d i i i G(z ,..., z ) := x R : s , x s , z i = 1, . . . , k . { ∈ h i ≤ h i } 1 k 1855 This polyhedron will be referred to as the gradient polyhedron of z ,..., z . The name is a bit of a misnomer, 1856 because we are considering general f, so we may have no gradients, and also some of the halfspaces could 1857 correspond to separating hyperplanes which have nothing to do with gradients. Even so we stick with this 1858 terminology. Definition 4.38. Let x0, x1,... be the iterates of a cutting plane scheme. For any iteration t 0, we define h(t) := C x0,..., xt , i.e., h(t) is the number of feasible iterates until iteration t. We also≥ define | ∩ { }| 0 t St = C G(x ,..., x ). ∩ 1859 As we shall see below, the volume of St will be central in measuring our progress towards the optimal 1860 solution. We first observe in the next lemma that St can be described as the intersection of C and the 1861 gradient polyhedron of only the feasible iterates. 0 1 1862 Lemma 4.39. Let x , x ,... be the iterates of a cutting plane scheme. Let t 0 be any natural number i1 ih(t) 0 t ≥ 1863 and let the feasible iterates be denoted by x ,..., x = C x ,..., x , with 0 i1 i2 ... ih(t). i1 i { } ∩ { } ≤ ≤ ≤ ≤ 1864 Then St = C G(x ,..., x h(t) ). ∩

69 0 t Proof. Let Xt = x ,..., x . We derive the following relations. { } 0 t St = C G(x ,..., x ) ∩ i1 i i1 i = C G(Xt x ,..., x h(t) ) G(x ,..., x h(t) ) ∩ \{ } ∩ = C G(xi1 ,..., xih(t) ), ∩

i1 i i1 i 1865 where the last equality follows since C G(Xt x ,..., x h(t) ) because each z Xt x ,..., x h(t) is ⊆ \{ } ∈ \{ } 1866 infeasible, i.e., z C, and therefore, the corresponding vector s is a separating hyperplane for z and C, i.e., d 6∈ 1867 C x R : s, x s, z . ⊆ { ∈ h i ≤ h i} 1868 Since our analysis will involve the volume of St, while our algorithm only works with the sets Et, we need 1869 to establish a definite relationship between these two sets.

0 1 1870 Lemma 4.40. Let x , x ,... be the iterates of a cutting plane scheme. Then Et+1 St for all t 0. ⊇ ≥ d i i i 1871 Proof. By definition Ei+1 Ei x R : s , x s , x for all i = 0, . . . , t. By putting all these ⊇ ∩ { ∈ h i ≤ h i} 1872 relationships together, we obtain that

0 t 0 t Et+1 E0 G(x ,..., x ) C G(x ,..., x ) = St, (4.21) ⊇ ∩ ⊇ ∩

1873 where the second containment follows from the assumption that E0 C. ⊇ 1874 We now state our main structural result for the analysis of cutting plane schemes. We use dist(x,X) to d d 1875 denote the distance of x R from any subset X R , i.e., dist(x,X) := infy∈X x y . ∈ ⊆ k − k Theorem 4.41. Let f : Rd R be a convex function and let C be a compact, convex set. Let x? be a minimizer for (4.1). Let x0, x1→,... be the iterates of any cutting plane scheme. For any t 0, let the feasible i1 i 0 t ≥ iterates be denoted by x ,..., x h(t) = C x ,..., x , with 0 i1 i2 ... i . Define { } ∩ { } ≤ ≤ ≤ ≤ h(t) ? j j j vmin(t) := min dist(x ,H(s , s , x )), j=i1,...,ih(t) h i

? j j j 1876 i.e., vmin(t) is the minimum distance of x from the hyperplanes x : s , x = s , x , j = i1, . . . , i . Let { h i h i} h(t) 1877 D be diameter of C, i.e., D = maxx,y∈C x y . Then the following are all true. k − k 1878 1. For any t 0, if vol(Et+1) < vol(C) then h(t) > 0, i.e., there is at least one feasible iterate. ≥ 1 1 vol(St)  d vol(Et+1)  d 1879 2. For any t 0 such that h(t) > 0, vmin(t) D D . ≥ ≤ vol(C) ≤ vol(C) 1 j ? ? vol(Et+1)  d 1880 3. For any t 0 such that h(t) > 0, minj=i1,...,ih(t) f(x ) f(x ) + Mvmin(t) f(x ) + MD vol(C) , ≥ ? ≤ ? ≤ 1881 where M = L(B(x , vmin)) is a Lipschitz constant for f over B(x , vmin) (see Theorem 3.21). This 1882 provides a bound on the value of the best feasible point seen upto iteration t, in comparison to the ? 1883 optimal value f(x ).

1884 Theorem 4.41 shows that if we can ensure vol(Et) 0 as t , then we have a convergent algorithm. → → ∞

1885 Proof of Theorem 4.41. 1. We prove the contrapositive. If h(t) = 0, then all iterates upto iteration t i i 1886 are infeasible, i.e., x C for all i = 1, . . . , t. This implies that all the vector s are normal vectors 6∈ 0 t 1887 for separating hyperplanes. So C G(x ,..., x ). Since C E0, this implies that C = E0 C 0 t ⊆ ⊆ ∩ ⊆ 1888 E0 G(x ,..., x ) Et+1, where the last containment follows from the first containment in (4.21). ∩ ⊆ 1889 Therefore, vol(C) vol(Et+1). ≤

70 2. Let α = vmin(t) . Since D is the diameter of C, we must have C B(x?,D). Thus, D ⊆ ? ? ? ? i1 ih(t) α(C x ) + x B(x , αD) = B(x , vmin(t)) G(x ,..., x ), − ⊆ ⊆ 1890 where the first equality follows from the definition of α and the final containment follows from the ? ? ? ? 1891 definition of vmin(t). Since x C and C is convex, we know that α(C x ) + x = αC + (1 α)x ? ∈? ? ? −i1 i − ⊆ 1892 C. Therefore, α(C x ) + x = C (α(C x ) + x ) C G(x ,..., x h(t) ) = St, where the − ∩ − ⊆ d ∩ ? 1893 last equality follows from Lemma 4.39. This implies that α vol(C) = vol(α(C x )) vol(St). 1 − ≤ vol(St)  d 1894 Rearranging and using the definition of α, we obtain that vmin(t) D vol(C) . By Lemma 4.40, 1 1 ≤ vol(St)  d vol(Et+1)  d 1895 D D . vol(C) ≤ vol(C) 3. It suffices to prove the first inequality; the second inequality follows from part 2. above. Let min ? imin imin imin i i1, i2, . . . , ih(t) be such that vmin(t) = dist(x ,H(s , s , x )). Denote byH := min∈ { min min } min h min i H(si , si , xi ) the hyperplane passing through xi orthogonal to si . Let x¯ be the point on h ? i ? H closest to x . Using the Lipschitz constant M, we obtain that f(x¯) f(x )+Mvmin(t); see Figure3 min min ≤ min min min in Section 4.1. Finally, since si ∂f(xi ), we must have that f(x¯) f(xi ) + si , x¯ xi = min min ∈ min min ≥ h − i f(xi ), since x¯, xi H implying that si , x¯ xi = 0. Therefore, we obtain ∈ h − i j imin ? min f(x ) f(x ) f(x¯) f(x ) + Mvmin(t). j=i1,...,ih(t) ≤ ≤ ≤

1896

i 1897 We now analyze two instantiations of the cutting plane scheme with concrete strategies to choose x and 1898 Ei+1 in each iteration i.

1899 Center of Gravity Method. The first one is called the center of gravity method. Definition 4.42. The center of gravity for any compact set X Rd with non-zero volume is defined as ⊆ R xdx X . vol(X)

1900 An important property of the center gravity of compact, convex sets was established by Gr¨unbaum [4]. Theorem 4.43. Let C Rd be a compact, convex set with center of gravity x¯. Then for every hyperplane H such that x¯ H, ⊆ ∈ d d 1  d  vol(H+ C)  d  1 ∩ 1 1 , e ≤ d + 1 ≤ vol(C) ≤ − d + 1 ≤ − e + 1901 where H is a halfspace with boundary H.

1902 Theorem 4.43 follows from the proof of Theorem 2 in [4] and will not be repeated here.

1903 In the center of gravity method, xi is chosen as the center of gravity of Ei in Step 2(a) of the General cutting d i i i 1904 plane scheme, and Ei+1 is set to be equal to Ei x R : s , x s , x in Step 2(c) in the General ∩ { ∈ h i ≤ h i} 1905 cutting plane scheme. Theorem 4.43 then implies the following. Sometimes, the center of gravity method 1906 assumes that E0 = C, where the central assumption is that one can compute the center of gravity of C and 1907 any subset of it. Theorem 4.44. In the center of gravity method, if h(t) > 0 for some iteration t 0, then ≥  1t/dvol(E )1/d min f(xj) f(x?) + MD 1 0 , j=i1,...ih(t) ≤ − e vol(C) ? 1908 where D is the diameter of C and M is a Lipschitz constant for f over B(x ,D). j ? 1 t/d 1909 In particular, if E0 = C, then minj=i ,...,i f(x ) f(x ) + MD(1 ) . 1 h(t) ≤ − e

71 ? ? 1910 Proof. Follows from Theorem 4.41 part 3., and the fact that B(x , vmin) B(x ,D) implying that M is a ? 1 t ⊆ 1911 Lipschitz constant for f over B(x , vmin), and vol(Et+1) (1 ) vol(E0) by Theorem 4.43. ≤ − e

1 t/d vol(E0) 1/d 1912 By setting the error term MD 1 less than equal to  in Theorem 4.44, the following − e vol(C) 1913 is an immediate consequence.   MD vol(E0)  Corollary 4.45. For any  > 0, after O d ln(  )+ln vol(C) iterations of the center of gravity method,

min f(xj) f(x?) + . j=i1,...,ih(t) ≤

MD 1914 In particular, if E0 = C, then one needs O(d ln(  )) iterations.

1915 Ellipsoid method. The ellipsoid method is a cutting plane scheme where E0 is assumed to be a large 0 0 1916 ball with radius R around a known point x (typically x = 0) that is guaranteed to contain C. At i 1917 every iteration i, Ei is maintained to be an ellipsoid and in Step 2(a), x is chosen to be the center of d i i i 1918 Ei. In Step 2(c), Ei+1 is set to be an ellipsoid that contains Ei x R : s , x s , x , such that 1 d/2 ∩ { ∈ h i ≤ h i} 1919 vol(Ei+1) (1 2 ) vol(Ei). The technical bulk of the analysis goes into showing that such an ellipsoid ≤ − d +1 1920 Ei+1 always exists. Definition 4.46. Recall from Definition 2.2 that an ellipsoid is the unit ball associated with the norm induced by a positive definite matrix, i.e., E = x Rd : xT Ax 1 for some positive definite matrix A. First, we need to also consider translated ellipsoids{ ∈ so that the≤ center} is not 0 anymore. Secondly, for computational reasons involving inverses of matrices, we will actually define the following family of objects, which are just translated ellipsoids, just written in a different way. Given a positive definite matrix Q Rd×d ∈ and a point y Rd, we define ∈ d T −1 E(Q, y) := x R :(x y) Q (x y) 1 . { ∈ − − ≤ }

1921 The next proposition follows from unwrapping the definition. It shows that ellipsoids are simply the 1922 image of the Euclidean unit norm ball under an invertible linear transformation.

d×d −1 T 1923 Proposition 4.47. Let Q R be a positive definite matrix and let Q = X X for some invertible d×d ∈ −1 −1 1924 matrix X R . Then E(Q, y) = y + X (B(0, 1)). Thus, vol(E(Q, y)) = det(X ) vol(B(0, 1)) = p ∈ 1925 det(Q) vol(B(0, 1)).

d d×d 1926 In the following, we will utilize the following relation for any w, z R and A R ∈ ∈ (w + z)T A(w + z) = wT Aw + 2wT Az + zT Az. (4.22)

d×d d d Theorem 4.48. Let Q R be a positive definite matrix and y R . Let s R and let E+ = E(Q, y) H−(s, s, y ). Define∈ ∈ ∈ ∩ h i y = y 1 Qs + − d+1 · √sT Qs

d2 2 QssT Q Q+ = 2 (Q T ). d −1 − d+1 · s Qs 1 d/2 1927 Then E+ E(Q+, y ) and vol(E(Q+, y )) (1 2 ) vol(E(Q, y)). ⊆ + + ≤ − (d+1) − 1928 Proof. We first prove E+ E(Q+, y+). Consider any x E+ = E(Q, y) H (s, s, y ). To ease notational ⊆−1 −1 ∈ ∩ h di2−1 2 ssT 1929 burden, we denote G = Q and G+ = Q . A direct calculation shows that G+ = 2 (G + T ). + d d−1 · s Qs 1930 Note that x satisfies

(x y)T G(x y) 1 (4.23) − − ≤ s, x y 0 (4.24) h − i ≤

72 We now verify that

T 1 Qs T 1 Qs (x y ) G+(x y ) = (x y + ) G+(x y + ) − + − + − d+1 · √sT Qs − d+1 · √sT Qs T T T 2 T Qs 1 2 s Q G+Qs = (x y) G+(x y) + (x y) G+( ) + T , − − d+1 − √sT Qs d+1 · s Qs

1931 where we use (4.22). Let us analyze the three terms separately. The first term can be written in terms of 1932 G, s, and y:

T T d2−1 2 ssT (x y) G+(x y) = (x y) ( 2 (G + T ))(x y) − − − d d−1 · s Qs −  T 2  d2−1 T 2 (s (x−y)) = 2 (x y) G(x y) + T d − − d−1 s Qs

1933 The second term simplifies to

2 T Qs 2 T d2−1 2 ssT Qs (x y) G+( ) = (x y) ( 2 (G + T ))( ) d+1 − √sT Qs d+1 − d d−1 · s Qs √sT Qs d2−1 2 sT (x−y) 2 (x−y)T ssT Qs  = 2 + d · d+1 √sT Qs d−1 · sT Qs·√sT Qs d2−1 2 sT (x−y) 2 (x−y)T s  = 2 + d · d+1 √sT Qs d−1 · √sT Qs d2−1 2 sT (x−y)  = 2 d · d−1 √sT Qs

1934 The third term simplifies to

T d2−1 T T T s Q( (G+ 2 · ss ))Qs 1 2 s Q G+Qs 1 2 d2 d−1 sT Qs d+1 sT Qs = d+1 sT Qs · 2 · T 2 T d −1 1 2 s Qs+ d−1 (s Qs)  = d2 d+1 sT Qs d2−1 · 1 · = d2 ( d2−1 ),

1935 Putting all of it together, we obtain that

2  T 2 T  T d 1 T 2 (s (x y)) 2 s (x y) 1 (x y+) G+(x y+) = − (x y) G(x y)+ − + − + (4.25) − − d2 − − d 1 sT Qs d 1 psT Qs d2 1 − − −

T 2 T T (s (x−y)) s (x−y) s (x−y) p T T T 1936 We now argue that T + = T ( s Qs + s (x y)) 0. Since s (x y) 0 by s Qs √sT Qs s Qs − ≤ − ≤ p T T T p T 1937 (4.24), it suffices to show that s Qs + s (x y) 0; we will in fact show that s (x y) s Qs. − ≥ | − | ≤ T p T 1938 Claim 4. s (x y) s Qs. | − | ≤ Proof of Claim. Let the eigendecomposition of Q be given as Q = SΛST , where S is the orthonormal matrix which has the eigenvectors of Q as columns, and Λ is a diagonal matrix with the corresponding eigenvalues. Then Q−1 = SΛ−1ST = G. Now,

T T 1 − 1 T s (x y) = s SΛ 2 Λ 2 S (x y) | − | | 1 T − 1 T− | = Λ 2 S s, Λ 2 S (x y) | h 1 T − 1 T − i | Λ 2 S s 2 Λ 2 S (x y) 2 ≤ kq k k q− k 1 1 − 1 − 1 = (Λ 2 ST s)T (Λ 2 ST s) (Λ 2 ST (x y))T (Λ 2 ST (x y)) q − − p 1 1 − 1 − 1 = sT SΛ 2 Λ 2 ST s (x y)T SΛ 2 Λ 2 ST (x y) p p − − = sT Qs (x y)T G(x y) − − psT Qs, ≤

1939 where there first inequality is the Cauchy-Schwarz inequality, and the last inequality follows from (4.23).

73 This claim, together with (4.25), implies that

T d2−1 T 1  (x y+) G+(x y+) d2 (x y) G(x y) + d2−1 − − ≤ d2−1 − 1  − 2 1 + 2 ≤ d d −1 = 1,

1940 where the second inequality follows from (4.23). This establishes that x E(Q+, y ). ∈ + T We now prove the volume claim. Let Q = B B for some invertible matrix B. We use Id to denote the d didentity matrix. By Proposition 4.47, × q vol(E(Q+,y+)) det(Q+) vol(E(Q,y)) = det(Q) r 2 T det( d (Q− 2 · Qss Q )) d2−1 d+1 sT Qs = det(Q)

r 2 QssT Q 2 det(Q− · ) d d+1 sT Qs d 2 = ( d2−1 ) det(Q) r T T T 2 T 2 B Bss B B d det(B B− d+1 · T T ) d 2 s B Bs = ( d2−1 ) det(BT B) r T T 2 T 2 Bss B d det(B (Id− d+1 · T T )B) d 2 s B Bs = ( d2−1 ) det(BT ) det(B) r T T 2 T 2 Bss B d det(B ) det(Id− d+1 · T T ) det(B) d 2 s B Bs = ( d2−1 ) det(BT ) det(B) 2 d q T T d 2 2 Bss B = ( d2−1 ) det(Id d+1 sT BT Bs ) 2 d − 1 · d 2 2 2 = ( 2 ) (1 ) , d −1 · − d+1 BssT BT aaT where the last equality follows from the fact that the matrix sT BT Bs = kak2 with a = Bs, is a rank one positive semidefinite matrix with eigenvalue 1 with multiplicity 1, and eigenvalue 0 with multiplicity d 1. Now finally we observe that −

2 d 1 2 1 d d 2 2 2 d 2 d 2 ( d2−1 ) (1 d+1 ) = ( d2−1 (1 d+1 ) ) · − 2 · − d d 2 2 ( d2−1 (1 d(d+1) )) ≤ 2 2· − d d (d +d−2)  2 = d(d+1)(d2−1) 1 d/2 = (1 2 ) − (d+1)

1941 This completes the proof.

1942 This can be used to give the guarantee of the ellipsoid method as follows.

0 Theorem 4.49. Using the ellipsoid method with E0 = B(x ,R), if h(t) > 0 for some iteration t 0, then ≥  t/2  1/d  1/d 1 vol(E0) − t vol(E0) j ? 2(d+1)2 min f(x ) f(x ) + MR 1 2 MRe , j=i1,...,ih(t) ≤ − (d + 1) · vol(C) ≤ · vol(C)

0 1943 where M is a Lipschitz constant for f over B(x , 2R).

? 0 1944 Proof. The first inequality follows from Theorem 4.41 part 3., and the fact that B(x , vmin) B(x , 2R) ? 1 t/⊆2 1945 implying that M is a Lipschitz constant for f over B(x , vmin), and vol(Et+1) 1 (d+1)2 vol(E0) by ≤ − x 1946 Theorem 4.48. The second inequality follows from the general inequality that (1 + x) e for all x R. ≤ ∈ 1/d t   − 2 vol(E0) 1947 By setting the error term MRe 2(d+1) less than equal to  in Theorem 4.49, the following · vol(C) 1948 is an immediate consequence.

74   2 MR 1 vol(E0)  Corollary 4.50. For any  > 0, after 2(d + 1) ln(  ) + d ln vol(C) iterations of the ellipsoid method,

min f(xj) f(x?) + . j=i1,...,ih(t) ≤

2 MR2 1949 In particular, if there exists ρ > 0 such that B(z, ρ) C for some z C, then after 2(d + 1) ln( ρ ) j ⊆ ? ∈ 1950 iterations of the ellipsoid method, minj=i ,...,i f(x ) f(x ) + . 1 h(t) ≤ d d 1951 Proof. We simply use the fact that vol(B(z, λ)) = λ vol(B(0, 1)) for any z R and λ 0. ∈ ≥ 1952 Because of the logarithmic dependence on the data (M, R, ρ) and the error guarantee , and the quadratic 1953 dependence on the dimension d, the ellipsoid method is said to have polynomial running time for convex 1954 optimization.

75 1955 References

1956 [1] Michele Conforti, G´erardCornu´ejols,Aris Daniilidis, Claude Lemar´echal, and J´erˆomeMalick. Cut- 1957 generating functions and S-free sets. Mathematics of Operations Research, 40(2):276–391, 2014.

1958 [2] Luc Devroye, L´aszl´oGy¨orfi,and G´abor Lugosi. A probabilistic theory of pattern recognition, volume 31. 1959 Springer Science & Business Media, 2013.

1960 [3] P.M. Gruber. Convex and Discrete Geometry, volume 336 of Grundlehren der Mathematischen Wis- 1961 senschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2007.

1962 [4] B. Gr¨unbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. Pacific J. Math., 1963 10:1257–1261, 1960.

1964 [5] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events 1965 to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.

76