Convex Optimization Page 1

CONVEX OPTIMIZATION

Chapter 2 – Convex Sets

 Basics o A set is affine if it contains any line through two of its point. Alternatively,

xx11,,nnnÎ=++Î ,qq+ 1 1 x q x . o The affine hull of a set of points is the set of all affine combinations of these points.

o The affine dimension of a set is the dimension of its affine hull. Its relative interior is its interior relative to its affine hull relint =Î{}xxCB: ( , r ) Ç aff Í for some r >0

o The most general form of a is ()x , where ()1x Î= . o A set  is a cone if xxγÎ ,qq 0  The set {}(,):xxtt£ is a cone associated with a particular norm.

 The conic hull of {}xi is {}l11x ++ lkkx0: l ³ . o A hyperplane is a set of the form {}xax: ⋅=b . Hyperplane with normal vector

^ a, offset b from the origin; can be written as {}xa:(⋅- x x00 )0 = = x + a

o Given k + 1 affinely independent pints (ie: vvi - 0 linearly independent), the k- dimensional simplex determined by these points is  =³=qqv :0,1 q . {å ii i + } We can describe this as a polyhedron as follows:  Write B = éùvv- ,,vv- and q¢ = éq ,, q ù . All points x Î  can ëûêú10k 0 ëê 1 k ûú ¢ ¢ ¢ then be expressed as xv=+0 Bq provided q ³ 0 and 1q⋅ £ 1

 B has rank k (by assumptions) and k < n, and so there exists a A Î nn´ éABùéù I such that AB ==ê 1 úêú. êABúêú0 ëê 2 ûëûúêú

Daniel Guetta Convex Optimization Page 2

¢  Multiplying the boxed equation by A, we get q =-AA110xv and ¢  ¢ AA220xv= . We can therefore express q ³ 0 and 1 q £ 0 as linear

inequalities. Together with AA220xv= , they define the polyhedron.  Operations that preserve convexity o Intersection (including infinite intersection) – also preserve subspaces, affine sets and convex cones:

+  Example: The positive semidefinite cone n can be written as XXγ :0zz . Each set in the intersection is convex (since z¹0 { n }

+ the defining equations are linear), and so n is convex.   Example:  =Îx m :cos(1xit)] £ for t Î-[,pp can be written { å i 33} as XmΣ :1- (costt , , cos ) ⋅£x 1 , and so is convex.  tÎ-[,]pp { n } 33 o Affine functions: An affine function has the form fA()xxb=+. The image and inverse image of a under such a function is convex.

 Example: 12+={xyx+ :,Î  1y Î  2} is the image of

12´={}(,xx 121 ): x Î  1 ,x2 Î  2 under f (,xx12 )=+ x 1 x 2. 

 Example: {xx: ACxd£=b, } is the inverse image of  + ´{0} under fAC()xbxdx=- ( , - ). 

 Example: {xxx:()AA=++11  xnnAB } is the inverse image of the

n positive semidefinite cone + under fBA()xx=- (). 

 n  Example: {}xxx:(--cc))P ( xx£ 1 , where P Î ++ is the image of a

1/2 unit Euclidean ball under fP()uux=+c .  o Perspective function: ft(,)zz= / t, where t > 0. It normalizes the last component of a vector to 1 and then gets rid of that component. The image of a convex set under the perspective function is convex.

o Linear-fractional function: A linear-fractional function is formed by compsing that perspective function with an affine function. They take the form fAbd()xx=+ ( )/( cx ⋅+ ), with domain {}xcx:0⋅+>d .

Daniel Guetta Convex Optimization Page 3

 Separating & Supporting Hyperplanes o Theorem: If Ç¹Æ then $¹a0 and b such that abx⋅£"Îx  and abx⋅³x "Î . In some cases, strict separation is possible (ie: the inequalities become strict). o Example: Consider an affine set  =+{}Fugu: Îm and a convex set  which are disjoint. Then by our Theorem, there exists a0¹ and b such that

abx⋅£"Îx  and aagu⋅-⋅" [FFug+³ ] ³ baub . The only way a linear function can be bounded below is if it’s 0 – as such, a0F = , and bag£⋅.

o Theorem: Consider two convex sets  and  . Provided at least one of them is open, they are disjoint if and only if there exists a separating hyperplane. Proof: Consider the open set – ax⋅ cannot be 0 for any x in that set, else it would be greater than 0 for a point close to x. Thus, ax⋅ is strictly less than 0 for all points in the open set. 

o Example: Consider Axb< . This has a solution if and only if

n m  =-{}bxxA : Î and  =  ++ do not intersect. By the Theorem, this is true if and only if there exists l ¹ 0 and m Î  such that l ⋅yy£"Îm  and l ⋅yy³"Îm  . In other words, there is not separating hyperplane iff

ll¹³00llA =£ 0b0 Thus, only one of this sytem and Axb< can have a solution. 

Chapter 3 – Convex Functions

 Basics o We extend a convex/concave function by setting it to +/– ¥ outside its domain. o Theorem: f is convex over  convex iff f (yx))³+-f ( )f (x ) ( yx over  .

Proof:  choose x1, x2 and convex comb z. Apply equation with y = z and

xx= i . Multiply one equation by l , other by 1-l . Add the two.  Take x, y. By convexity ftfxtfy(()xtyx+-[)()]) £-(1 + for t Î (0,1). Re-arrange to get

Daniel Guetta Convex Optimization Page 4

f(y) on one side, divide by t, take limit as t  0 . General case consider

 gt()=+- f() tyx (1 t ) and gt¢()= +- ft()()yxyx(1 t ) -.  Apply previous

result with y = 1 and x = 0.  Apply inequality with ttyx+-(1 ) and ttyx+-(1 ) . This implies an inequality about g that makes it convex. 

o Theorem: 2 f (x ) 0 over  convexf convex over .

Proof: ff()yy= (xx)1)+-⋅-+- f()(xyx )1 ( yx )é 2 fee x +- ( ) yù ( for 2 ëê ()ûú e Î [0,1]. If 2 f is positive definite, get FOC for convexity.   Convex functions The following functions are convex Function Parameters Convex/concave… …on domain

eax a Î  convex  a > 1 or a < 0 convex (0,¥ ) xa 0 < a < 1 concave (0,¥ )

p ||x p > 1 convex  log x concave (0,¥ ) x log x convex (0,¥ ) ax⋅+ b (ie: any affine function) both n (ie: any norm) convex n x log(åe i ) (the log-sum-exp func.) convex n

1/n x (the geometric mean) concave (0,¥ )n () i

(the log determinant) n log det X convex X Î ++

Same as fi, providing they are all wf()x wi > 0 å ii concave/convex

Ways to find convexity:

o Directly verify definition

o Check the Hessian: for example, for f(,xy )= x2 / y

22é yxy2 - ù é yù =2 fxy(, )êú =-éù y xêú 0 33êú-xy x 2 ëûêúêú-x yyëêúû ëê ûú

Daniel Guetta Convex Optimization Page 5

Restrict to a line: f is convex if and only if gt()=+- fé txx (1 t ) ù is convex o ëê 12ûú

over [0,1] " xx12, . For example, fX()= logdet X. Take the line XZtV=+ , restricting to values of t for which X  0 , and wlog, assume it contains t = 0.

g() t=+= f ( Z tV ) logdeté Z1/2 ( I + tZ-- 1/2 VZ 1/2 ) Z 1/2 ù ëûêú =+log detéù (ItZVZZ--1/2 1/2 ) =++ log det Z log(1 tl ) ëê ûú å i --1/2 1/2 Where li are the eigenvalues of ZVZ . Taking derivatives of g, we find that the derivative is always < 0. Thus, convexity.

 -1 nn o Use the epigraph: Consider f (,xxxYY )= . It is convex over  ´++ . ìüé ù ïïY x epi fYtY==(xxx , , ) : -1 £ tY,0íýïï ( x , Yt , ) : êú 0,0 Y {}ïïêúx  t ïïîþëêúû (We used Schur Complements). This is a set of LMIs, and therefore convex. é ù o Jensen’s Inequality: f convex  ff()[]xx£ ëê ()ûú , "Î=xxs.t. ( dom f ) 1.  Operations that preserve convexity o Non-negative weighed sum. o The perspective function gt(,)xx= tft ( /) [t > 0] is convex if f is convex. Example: The perspective of the negative logarithm gives the relative entropy and Kullback-Leibler divergence. 

o The pointwise maximum supyÎ f (xy , ) is an extended-value convex function, if f(,⋅ y ) is convex for each y Î  [note that we do not require joint convexity of f]

This corresponds to intersections of epigraphs. Example: Let f(x) be the sum of the r largest elements of x. Then we can write f(x) as the maximum of all the possible sums of r elements of x. 

Example: Support function s ()xxyy=⋅Î sup{ : } . Convex. 

 Example: lmax ()XX=£ sup{}xxx : 1; family of linear functions of X.  Note: Every convex function f : n  can be written as fggf¢()xx=£ sup():{} affine, gz()zz () "

Daniel Guetta Convex Optimization Page 6

Clearly, f ¢()x)(£ f x . Furthermore, epi f is convex and so at any (,)z tfÎ epi , élùéxz- ù $¹(,l m ) 0 such that êúê⋅ ú£⋅0()(l xz - +m f x) -t £ 0 for all x. êúêm f (x)-t ú () ëê úêûë ûú Now, (1) we must have m ³ 0 , else as we take t ¥ violated and (2) we must have m ¹ 0 , else we get l = 0 . Thus, can write gf()zx=+⋅-£ ()1 l ( xz ) t m Choosing a point on the boundary of the epigraph, tf= ()z and so gtf()zz£= (). As such, it’s a global underestimator with gf()x) = (x . 

o The minimum infyÎ f (xy , ) is convex provided it is >-¥ and f (,xy ) is jointly convex in (x, y) over nm´ and  Ì m is convex. This corresponds to the projective of a (convex) epigraph onto the subspace of x.

o The composition with an affine function fA( xb+ ) is convex, provided f ()⋅ is convex.

o The general composition fhgg()xxx= ()1 (),, k () behaves as follows: Result h h g Convex Convex Nondecreasing Convex Convex Convex Nonincreasing Concave Concave Concave Nondecreasing Concave Concave Concave Nonincreasing Convex These can be derived (for the case k = 1 and x scalar) by noting that the second derivative of f is given by

2 fx¢¢()=+ hgxgx ¢¢( ())éù ¢ () hgxgx ¢( ()) ¢¢ () ëê ûú  Conjugate Functions

* o ff()yyxx=⋅- supxÎdom f () (). It’s the maximum gap between the linear function g(x) = yx and f. If f is differentiable, this occurs at a point at which f ¢()xy= . Basic examples:

f(x) f *(x) Domain f(x) f *(x) Domain

y – 1 ax + b –b {a} x log x e  1/2 –log x log(1/y) – 1 - ++ 1/x –2(–y) - + x e y logy – y  +

Daniel Guetta Convex Optimization Page 7

Example: f ()xxx= 1 Q with Q Î n . Then yx⋅-1 xQ x which is bounded 2 ++ 2 above and maximized at yx-=QQxy0  =-1 , and fQ*1()yyy= 1  - .  2

Example: Let I  (xx )= 0 if Î¥ , otherwise . Then the conjugate of this

* indicator is the support function I (yyxy )=⋅= supxÎ ( )s ( ).  Example: If f ()xx= , then f *()yyy=£ Indicator fn of : 1. To see why; { * } if y > 1 , then $⋅>£zyzz s.t. 1, 1 . Take xz= t and let t ¥. If y £ 1 * * , then xy⋅£xy £ x. Therefore, 0 maximizes xy⋅ - x maxed at 0.  *

x Example: f ()x = log()åe i . Differentiating yx⋅-f() x and setting to 0, we

xx get ye= ii/ e and fyy*()xy=³⋅= log if 01 , y 1 and ¥ otherwise i å å ii [this is valid even if some of the components of y are 0].  Example: Company uses resources r at price p prices produces revenue S(r). The maximum profit that can be made from a given price is

* MS(prpr )=-⋅=-- supr [ ( ) ] ( Sp )( ) .  o From the definition, we get f()xyxy+³⋅f * () . o If f is differentiable, the maximizer of yx⋅-f() x satisfies yx=f(). Thus, for any z for which yz=f (), f *()y))(=⋅zzz-f ( f

o The conjugate of gfA()xxb=+ ( ) is gfAA**()yyby=- (-- ) . o The conjugate of the sum of functions is the sum of conjugates. o Let f be a proper convex function. A vector g is a subgradient of f at x if f ()z ³+⋅-"Îf()xgxzz ( ) n . ¶f ()x is the set of all subgradients at x. It is a closed and convex set. If x is in the interior of the domain of f, ¶f ()x is non- empty, and if f is differentiable at x, ¶=f ()xx {f ()}.

 If f is a proper convex function, then xy⋅¶=f ()x + f * () yy Î f () x [if f is closed, these are also equivalent to xyζf *(). Proof: Follows directly from the definition of f*.   Chernoff Bounds & Large Deviations Chernoff Bounds –  £³£"³eXteelll()Xt--() t [] X l 0. o {}Xt³

Daniel Guetta Convex Optimization Page 8

o Define the cumulant generating function f()l = log[ elX ], and define

f+()ll= f () if l ³ 0 and ¥ otherwise. o Making the Chernoff Bound as tight as possible, we get

-lltX * (Xt³£ ) infll³³+00 e ( e ) = exp( - sup [ll tf - ( )]) = exp( - ft [ ])

o Similarly, (XzzΣ ) [el⋅+X m ] provided l ⋅+³"Îm 0 . Then, defining f()l = log[ el⋅X ], we find that

log ()inf():XzzΣl,m {mm +f ll -⋅£"Î }

=-⋅+infl {} supzÎ (llz )f ( ) * =-+=-infl {Sf (ll ) ( )} f (0) o Example: Let X be multivariate Gaussian: X ~(0,)NI. We then have f ()l =⋅1 ll. Consider  =£{:xxbA }. Now 2  By LP duality: S ()sup(-=l -⋅=l zbu )inf ⋅ zÎ Auu0=-l, ³  As such, log ()infXbuu0u0Σ ⋅+⋅1 ll s.t. ³,A += l l 2  Eliminating l , we get log)n (XbuuuΣ i f ⋅+1 AA u0³ 2

2  Using QP duality, we can write this program as sup--1 A-1 (l b ) . l³0 2 2

2  Intepreting l as a slack variable, this becomes sup - 1 x Axb£ 2 2  As such, ()expdist(0,)X Σé -1 ù . ëê 2 ûú

Chapter 4 – Convex Optimization Problems

 Terminology

minfx0 ( )

s.t.fxi ( ) £= 0i 1, , m

hxi ()==0,, i1  p o  is the feasible region of the problem  Equivalent Problems o Informal definition of equivalent problems: from the solution to one problem, a solution to the other is readily found, and vice versa.

Daniel Guetta Convex Optimization Page 9

o Change of variables – consider a one-one function f : nn with  Í ff(dom ). Then replacing x by f()x leads to an equivalent problem.  Example (Linear-fractional programming): Consider mincx⋅+ d s.t. AGxbxh=£ , ex⋅+ f Simply write the objective as cd⋅+x 1 , and min cy⋅+ dz . ()ex⋅+ f() ex ⋅+ f

o Transformation of objectives & Constraints – suppose:

 y0 :  is monotone increasing.

 y1,:, ym  satisfy yi (uu)0££u

 ymp+1,,ym+ :  satisfy yi ()u == 0 u 0 Then composing f and h with these functions leads to the same problem.

o Eliminating equality constraints – say we find a function f : kn such that x satisfies the equality constraints if and only if it can be written as xz= f(). Then we can eliminate the equality constraints and optimize

fz0(ff ( )) s.t. fzi ( ( ))£ 0 over z. For example, the equality constraint Ax = b

mn´ with A Î  (with solution x0) can be replaced by xzx=+F 0 , with F Î nn´-[rank A]. This preserves convexity, since this is an affine transformation. Optimizing over some variables – it is possible to optimize over each variable one-by-one; this is especially true when some constraints involve only a

 subset of the variables. For example, take min xxP . Minimizing over x2 xx:fi ()01 £

 only gives an objective of xx11S . Thus, the problem is equivalent to min xxS . xx11:(fi )0£ 1 1 o Epigraph form – the problem above is equivalent to minimizing t subject to

fx0()-£ t 0. Particularly useful for minimax problem.

2 Example: minxx max1,££irxy-³"it=- mintt s.t. xy i i – a QCQP. 

o Implicit & Explicit constraints – if the objective function has a restricted domain, it can often be unrestricted by adding additional constraints instead, and vice versa.  Convex problems

Daniel Guetta Convex Optimization Page 10

A convex optimization problem is

minf0 (x) ¬ convex

s.t.fi (x ) £=01,,conim ¬vex  axi == bi i 1, ,p ¬ afine

If f0 is quasiconvex, the problem is a quasiconvex optimization problem. In either case, the e -suboptimal sets are convex. An optimality condition for x is

⋅-³"f0(xyx ) ( ) 0 y feasible

Geometrically, f0()x is a supporting hyperplane to the feasible set at x. Alternatively, anywhere feasible we try to move from x yields an increase in objective.

Proof: For a convex function, ff00()y ³+⋅-f0()xxyx ()( ). If the optimality condition

is met, ff0(yy)()³"0 x . Conversely, suppose $⋅-

close to 0, we can decrease f0 by moving away from f.  Examples:

 For unconstrained problems, =f0()x 0  For problems with Ax = b, every solution can be written as yxvv=+,() Î A

So ⋅³"ÎfA0()xv 0 v  (), but this is a nullspace, so ⋅="ÎfA0()xv 0 v  ()

^   In other words, fAA0()x Î= () ( ), ie: $+=nn s.t. fA0 (x0 ) .

 For problems with x0 , need ³f0()x0, else ⋅-f0()(xyx ) unbounded

below. So reduces to ⋅³f0()xx 0. This gives complementary slackness…

Examples of convex optimization problems.

o Linear programs (LP)  Example (Chebyshev Center): Consider the problem of finding the

largest ball  =+{}xuuc | £r that lies in a polyhedron. Require that  lies on one side of ax⋅ £ b is equivalent to requiring: sup ax⋅+ au⋅⋅+ £bb axr a £  u £r { cc}

 Example: minx maxii()ax⋅+ b i can be linearised using the epigraph.  o Linear-fractional programs

Daniel Guetta Convex Optimization Page 11

cx⋅+ d min s.t. x Î= {}xxhxb :GA £ , = xex: ⋅+> f 0 ex⋅+ f This is a quasiconvex program, which can be transformed into min cy⋅+ dz y,z s.t.Gzyh--=⋅+=³£ 0,Azyb 0, eyf z 1, z 0 Proof: Note that zz==1 ,yx is feasible for the transformed problem with ex⋅+ f the same objective. Similarly, x = y/z is feasible for the original problem, if z ¹ 0 . Thus, the original and modified objectives are both > and < each other.

If z = 0 and x0 is feasible for the original problem, then x = x0 + ty is optimal for the original problem; taking t ¥ allows us to gets the two objectives arbitrarily close to each other. 

o Quadratic programs (QP) min1 xxqxPGA+⋅+rhb s.t. xxÎ= x £, = x 2 { } Where P  0 .

2  Example: Distance between polyhedra minxx-Î s.t. x , xÎ . 122 112 2  Example: Consider an LP minimizing cx⋅ where ()xx= , ov(x ) =S. Then  ar(cx⋅=S ) x x. Can minimize cx⋅+g x S x. 

 Example (portfolio problems): pi is relative price change of asset i

(change/start) and xi is amount of asset bought. Mean return is px⋅ , variance is xxS . Minimize variance subject to given mean return. Budget constraint 1x⋅= 1. Short sales: Can set xx=-long x short and

require 1x⋅ short£⋅h1x l o ng . Transaction costs: Can set xx=+-init u buy u sell and constraint (11-⋅=+⋅ffsell)1u sell( buy) 1u buy . 

o Quadratically constrained quadratic programs (QCQP) min11xxqxrPP+⋅+ s.t. xxqxr0x +⋅+£Î,  x 2200 0 ii

Where the Pi are positive definite. In this case, we are maximizing a quadratic function over an intersection of ellipses and polyhedra.

o Second-order cone program (SOCP) minfx s.t. Aimii x+ bcxd£⋅+ = ,1, , , x Î x 2

Daniel Guetta Convex Optimization Page 12

If c0i = "i , this reduces to a QCQP (square both sides to see). Note that the direction of the inequality is important!  Example (quadratic constraint):

1 (1+⋅+bx c ) 1 xxbAA +⋅xx+cc£012 £() -b ⋅- Ax 2 2

i i  Example (robust LP): Consider minx cx⋅⋅£ s.t. ax b for all

i ii aauuÎ=i { +P |1 £} (ie: ellipsoids). Can write constraint as axii⋅++£sup{ uPP,,xu | ££1} baxii ⋅ i  x b i, an SOCP. Norm terms are regularization terms; prevents b from being high in directions where uncertainty in a is high.  Note: If = =={}xxf: A is a polyhedron with k vertices, then sup{ax⋅ : xÎ } must occur at a vertex, so 1 inequality just turns into k inequalities. Alternatively, supax⋅= inf fz ⋅ (strong Ax£f { } Az=a0,z³ { } duality) and so write ax⋅£⋅£ b fz b ,A z = az , ³ 0 . 

ii  Example (uncertain LP): Say aa~(,)N Si . Can ask for

()axii⋅£ b ³h . Note that D(axi ⋅= ) x S=S x1/2 x . Can then ii2 express probability constraint as SOCP. 

o Semidefinite programs (SDP)

minx cx⋅+++Î s.t. x11FFG xnn  0 , x  Multiple LMIs are easily dealt with by forming a large block diagonal LMI from the individual LMIs. If the matrices F are diagonal, this is an LP.  Example (bounds on eigenvalues): Consider min A ( A is max 2 2 singular value). Note AA£sA  sI2 . Using Schur complements, 2 é sI Aù AA - sI2  00êú êúAsI ëê ûú (Clearly, sI  0 and the Schur complement is SsI=-AA ). Thus, s simplify minimize s subject to the constraint above. The original LMI was

Daniel Guetta Convex Optimization Page 13

quadratic – Schur complements allowed us to make it linear. [Similarly,

we can bound the lowest eigenvalue: lmin ()AsAsI³  ] 

 Example (portfolio optimization): Say we know Lij£Sij £U ij . Given a portfolio x, can maximize xxS s.t. that constraint and S  0 to get worst-case variance. We can add additional convex constraints

 2  Known portfolio variances: uukkS=s k  Estimation error: If we estimate S=Sˆ but within an ellipsoidal confidence interval, we have C ()S-Sˆ £a , where C()⋅ is some positive definite quadratic form.  Factor models: Say pFzd=+, where z are random factors and d represents additional randomness. We then have

 S=FFD Sfactor + , and we can constraint each individually.

Sij  Correlation coefficients: rij = . In a case where we know SSii jj

the volatilities exactly, constraints on rij are linear…   Example (expressing QCQP and SOCP as SDP): Using Schur complements, we can make these non-linear constraints linear

é IP1/2x ù 1 xxPxg+⋅+h £00êú 2 êú()P1/2x  -⋅g xh - êúë û é()gx⋅+ hIF x + qù Fxq+£⋅+ gxh êú 0 êú()Fxq+⋅+ gxh ëê ûú  Geometric Programming

K aa a A function fcxxx()x = 12kk n k, with c > 0 and a Î  is a posynomial o åk=1 kn12 k i (closed under +, ´). K = 1 gives a monomial (closed under ´¸, ).  Posynomial´= Monomial Posynomial  Posynomial¸= Monomial Posynomial o A geometric program is of the form

minfh0 (xx ) s.t. fi (),£=> 1i (x ) 1, x0

Daniel Guetta Convex Optimization Page 14

f are posynomials, h are monomials. Can deal with fg()x) £ (x and hg()xx= () by dividing. Can maximize by minimizing inverse (also posynomial).

o Example: maxxx123xxxxxxx s.t. 12++ 23 13£ c / 2,x0 > [min volume box] can

---111 be written as minxxx123 s.t. (2 xxc 12 / )++ (2 xxc 23 / ) (2 xxc 13 / )£ 1,x0 >. 

yi o To make convex, substitute yxii= log =xei . Feed in and then take logs of objective and constraints. Result is convex.  Existence of Solutions o Theorem (Weierstrass): Consider the problem minf (xx ) s.t. ÎÌ n . Then, if  is non-empty, f is lower semicontinuous over  and either (1)  is compact (2)  is closed, and f is coercive (3) There exists a scalar g such that the level set ()gg=Î{xx :f () £} is nonempty and compact, then the set of optimal minimizing solutions of f is non-empty and compact.

* * Proof: Let f be the optimal objective, and gk  f . Then the set of optimal

¥ solutions is ()g . If (3) is true, this is an intersection of nested non-empty k=1 k compact sets – it is therefore non-empty and compact. 13 For g > f * , ()g must be non-empty, and by semi-continuity of f, it is closed. Since  is compact, this closed subset is also compact. 23 ()gg=-¥Çf -1 () ( , ] since  is closed, the intersection is closed and so is the inverse by semi-continuity of f. Since f is coercive, ()g is also bounded. Thus, ()g is compact.  Example: Consider min1 xxbxxP -⋅ , Î . If l is the smallest eigenvalue of o 2

2 P, then we can say that 1 xxbxP -⋅ ³1 l x - bx. This is coercive if 2 2 l > 0 . Thus, a solution exists if P  0 . 

Chapter 5 – Duality

 The Lagrangian Function

o min f0(xx ) s.t. fx())(£=0h , 0. We let (,xxfxhxln , )=+⋅+⋅f0 () l () n ()

be the Lagrangian, and g(,)ln= inf(,,)x  x ln . This clearly underestimates the

optimal value, because everywhere in the feasible region, (,,)xxln £ f0 ().

Daniel Guetta Convex Optimization Page 15

mp o Writing the original program as min fIfIh()xx++éù () éù () x, 00ååii==11- ëûêúii ëûêú

where I– and I0 are indicator functions for the negative orthant and {0}, we see the Lagrangian replaces the indicators (“hard walls”) by “soft walls”.

o Example: min xx⋅= s.t. A x b.  =⋅+⋅xxl ()A x - b. Differentiating, set to 0, minimum is at 20x +=Al , so gA(,ll) =  - 1  l .  ( 2 )

 2 Example: min xxWx s.t. i = 1 . Involves partitioning the xi into either a “+1”

group or a “–1” group; Wij is the cost of having xi and xj in the same/different 2 é ù partitions. (,xxxx1xnn )=+⋅-=+WW ( )ëê diag() nnûú x1 -⋅. Minimizing é ù over x, we find that gW (nn )=-1 ⋅ if ëê + diag( n)ûú  0 and -¥ o.w. Can use to

find bound – for example, using n =-lmin ()W 1 , feasible because WI-lmin  0 .  The Lagrangian & Convex Conjugates o Consider minfAC (xx ) s.t.xd£=b , . The dual function is gfAC(,ln )=+⋅-+⋅- infé ()xxbxd l ( ) n ( )ù x ëûêú =-ln ⋅bd - ⋅ +inféfAC ( x ) + ( l + n ) xù x ëê ûú =-ln ⋅bd - ⋅ -supé ( -AC l - n ) xx - f ( )ù x ëê ûú =-ln ⋅bd - ⋅ -fA*() - l - C n Example: f ()xx= and only equality constraints. The conjugate of a norm is

ïì  ï-⋅nnd C £ 1 the indicator of the unit ball in its dual norm, so g()n = í * . ï -¥ otherwise îï Example: Min entropy: minxx log s.t. Axb1x£⋅= , 1 . The conjugate of x å ii log x is ev – 1, and so gA(,)llnn=- ⋅b - - exp(é - l ) - n - 1ù .  å ëê i ûú  The Lagrange Dual Problem o The dual problem is maxg (ln , ) s.t. l³ 0 , with domain {(,ln ):(,g ln )>-¥} .

o Weak duality implies that dp* £ * . If the primal is unbounded below, the dual is infeasible. If the dual is unbounded above, the primal is infeasible.

o The dual is always convex and can be used to find a good lower bound. o Strong duality holds under Slater’s Conditions; the problem is convex, and there exists a point such that every inequality constraint is strictly satisfied. If the constraints are linear, only feasibility is needed, not strict feasibility.

Daniel Guetta Convex Optimization Page 16

Example: QCQP: min11xxqxrPP+⋅+ s.t. xxqxr +⋅+ £ 0 , where o 2200 0 ii i

PP0  0,i  0 . The Lagrangian is P()l qr ()ll  () (,xxll )=+1  PPll xqqxrr ++⋅++⋅ 2 ()()()00ååii ii 0 i l ³ 0 and so P()l  0. Differentiating and setting to 0, we find the dual problem is max()g llllll=-1 qqr0 ()P() -1 () + () s.t.  . pd**= if we have 2 strict feasibility. 

-()Al Example: Min entropy (above) has dual max -⋅--l b n ee-n-1 i . o l³0,n ()å

-()Al Optimizing over n , we get max-⋅-l b e i + 1 , a GP.  l³0 ( å )  Geometric Interpretation/Proof of Strong Duality

* o Consider  = {}((),(),()f xhxf0 x . Then ptt= inf{ : (uv ,v , )Σ= ,u0 , 0} , and gtt(,ln )=⋅ inf(,,1)(,,):(,,){ ln uv uv Î}. To assuming the infemum exists, the inequality (,,1)(ln⋅³"Îuv,,,(t),)gt (,l n ) u v  defines a supporting hyperplane. Looking at the expression for p* and g(,ln ), it is clear that p* ³ g(,ln ) when l ³ 0 , because it involves more constraints.

m o Let =+(++,{0 }, ) be an “epigraph” of  which extends it “up” in the objective & inequalities. This then allows us to write ptt* =Îinf{} : (00 , , )  , and, for l ³ 0 , gtt(,ln )=⋅ inf(,,1)(,,):(,,){ ln uv uv Î} . Once again, this defines a supporting hyperplane to  , and the dual problem involves finding the supporting hyperplane with least t. Since (,00 ,p* )Î Bd() , we have pp**=⋅³(,00 , )(,llnn ,1)g (, ), which is weak duality. If there is a non-vertical supporting hyperplane at (,00 ,p* ), then strong duality holds. o To prove strong duality, let  =<{}(,00 ,):ss p* . This set does not intersect with  . Create a separating hyperplane (,,ln  m ) so that  (,,ln  m )(⋅Σuv,,t) £"amma (,, uvttp )  * £  (,,ln  mam )(⋅³³u0,)vv,0t ³³pt* " (,,)u Î  l , m. Otherwise, the LHS would be unbounded below as we went up the epigraph.

Daniel Guetta Convex Optimization Page 17

Divide first equation by m and substitute in the second to find the Lagrangian is > p* .

If m = 0 ; second equation above becomes (,ln  )(⋅"Îuv,0 )³ (,,)uvt  . Applying this to the strictly feasible point with v = 0 and u < 0, we find that

l = 0 , and so n ¹ 0 [supporting hyperplane cannot be 0 vector], so n ⋅v ³ 0 . But the problem is convex, so vxb=-A . But since we have an interior point with n ⋅=v 0 , there are points around that interior point with n ⋅

o Multicriterion optimization: Consider min(f (xx ),f0 ( )). One way to obtain every pareto-optimal point is to minimize l ⋅F . Since we can re-scale this

without changing the minimizers, f0()xF+⋅l is an example of such a program, which is precisely the Lagrangian.

o Shadow prices: The dual problem is the lowest cost given we can buy some “constraint violation” and be rewarded when we don’t violate them. Clearly, this is lower than our lowest cost without these amenities. When strong duality holds, there is a price that makes us indifferent. This is the “value” of the constraints.  Optimality conditions o Complementary Slackness: Consider that, if x * and (,ln** ) are primal and dual optimal points with zero duality gap

*** ** *** * fg0()x = (,lnn )=££ inf(,,x xxlln ) (, , )f0 () x As such, every inequality in this line must be an equality. Now, recall that

*** ***** (,xxfxgxln , )=+⋅+⋅f0 () l () n (). This implies that at optimality

** ** * l ⋅=f()x 0. Since each term is non-positive, we must have liif ()x = 0. Only active constraints at the optimum can have non-zero multipliers.

o KKT Conditions: Based on all the above, we find that for any problem for which strong duality holds, any primal-dual optimal pair must satisfy

Daniel Guetta Convex Optimization Page 18

=£(,x0fx0hx00***ln , ) () * () * =³⋅= l * l * fx () * 0 If the problem is convex, these are also sufficient conditions, because since l* ³ 0 ,  is convex, and so the first condition implies  is minimized. Finally, complementary slackness shows we have 0 duality gap. Example: min1 xxqxrPA+⋅+ s.t. xb =. KKT conditions are Axb* = and 2 éPA ù éxq* ù é- ù PAxq**++n = 0, or êúêú= ê ú .  êúA 0 êún* ê b ú ëê ûú êúë û ëê ûú a Example: min(-+å log(iix )) s.t. x1x³⋅= 0, 1. This attempt to maximize communication capacity given total available power of 1 for each channel. KKT: - 1 -+=ln0 "i a +x i ii xxiiii³ 0,1x⋅= 1l ³ 0l = 0

Clearly, li only acts as a slack variable in the first equation. So n ³"1 i a +x i i xx³⋅=0,1x 1n -1 = 0 ii()a +x i i The first equation gives x =-1 a , but this can only work if an£11 £. If iin iina

this were not the case, xi would go negative. Thus ïì 1 -

n n Example: minfx ( ) s.t. ax⋅= b, (,x n )= fx ( )+⋅-n (ax b ). The o åi=1 ii åi=1 ii

é nnù * dual function is gb()nnn= -+nninfêú fxbfa ( ) +ax ⋅ =-+ ( - ). x ëååii==11iiû i i The dual therefore involves a single scalar variable (simple to solve). We then use the fact that the optimal point minimizes (,x n * ), which is convex.   Sensitivity analysis

* o Consider minf0 (xx ) s.t. f()xv)(£=uh , . We let p (,)uv be its optimal value. If the problem is convex, this is jointly convex (its epigraph is cl( ), above).

o Global inequality: Under SD pp***(,)uvv ³-⋅-⋅*(,00 ) ln u . To prove,

Daniel Guetta Convex Optimization Page 19

***** ** pg(,)00=⋅+ (lln ,nln )££+⋅+ f0 () x+⋅ fx () hx () f0 () x u ⋅ v

**** o Local Result: If p differentiable & SD, =-=-uvpp(,00 )ln and (, 00 )

* ** ¶p (0, 0) pt(,0)ei - p * =-limt0 ³ l [By global inequa lity] ¶uti Taking t negative gives the opposite inequality.  Examples Consider min log exp(axi ⋅+b ) . The dual isn’t particularly interesting. But o (å i ) the dual of min log exp(yA ) s.t. xb+= y is entropy maximization. The (å i ) same is true of Axb- . 

o This can be done with constraints; fb()ax⋅+ can be transformed to fy().   Theorems of the Alternative o Weak alternatives:  Non-strict Inequalities: Consider

f(x )£=0hx, ( ) 0 has soln  min 0 s.t .  =0 (and not ¥) The dual function has the property gg(,)aaln= a (,) ln and is g(,ln )=⋅+⋅ infé lf ()xhx n ()ù x ëê ûú Because of the homogeneity, if there is any g(,ln )> 0 with l ³ 0 , d * =¥. If that’s infeasible, d * = 0 . Thus, since dp* £ * , we find that g(ln , )>£= 0, l³ 0 feasible f(xhx ) 0, ( ) 0 infeasible In fact, at most one of the two is feasible – weak alternatives.  Strict inequalities: If the inequality f()x0< is strict, the alternative is

g(ln , )³ 0, l><= 0 feasible f(xhx ) 0, ( ) 0 infeasible We can show this directly from the definition of the dual function, if we assume there exists f()x0 < , then $>ln 0, s.t. g ( ln , ) < 0 .

o Strong alternatives – when fi are convex and hi are affine, we might be able to prove strong alternatives; that exactly one of them must hold  Strict inequalities: First, consider

* fx(n)<<0x ,AA=£bb feasible =psfs(mi s.t. i (),0 x x =)

Daniel Guetta Convex Optimization Page 20

The dual function is gA (ln , )=⋅ n (xb -+ ) miné (1 -⋅ 1 l ) sù . This is only s ëê ûú finite if 1⋅=l 1 . So the dual is dg* =⋅=³()max (ln , ) s.t. 10 l 1, l . Provided strict feasibility holds, strong duality holds and pd**= . So if the original system is infeasible (p* > 0), then there exists a g(),ln,0³> l 0 . Similarly, if there exists such a (,ln ), then p* > 0… f(x0xb)< ,)A =³> feasible g(,ln 0, l 0 infeasible infeasible feasible  Non-strict inequalities: Consider f()x0xb£ ,A = the program is the same as above, but we need the optimum to be attained so that p* > 0 if the system is infeasible. In that case, l ³>0,(,g ln ) 0 is clearly feasible.

o Example: Consider Axb£ . Then gA (ll )=- ⋅b0 if  l = and -¥ o.w. The strong system of alternative inequalities is lll³=⋅<0b ,A 0, 0 . 

 ii n o Example: Take m ellipsoids ii==+⋅+£Î{xxxxbxc:()fA i 2 0,} A i++ . We ask if the intersection has a non-empty interior. This is equivalent to solving

the system f()x0< . Here, gA()l =+⋅+ infxxbxc lll 2 ii. x (åååii) ( i) ( i )

 -1 Differentiating, setting to 0 and using obvious notation, gA()l =-bbclll + l.

 -1 As such, the alternative system is l > 0b,-+³lllA b c l 0.

To explain geometrically, consider that the ellipsoid with f()xfx=⋅l () contains the intersection of all the ellipsoids above, because if f ()x0£ , then clearly a positive of them is also < 0. This ellipsoid is empty if and only if the alternative is satisfied [prove by finding inff (x )]. 

o Example: Farkas’ Lemma: the following two systems are strong alternatives AAxx0=⋅

kk Consider minf (xgxiii ) s.t. ( )£ÎW0x , i [note: the vector g o ååii==11i i represents a number of inequality constraints]. The Lagrangian is

kk k (,xxmm )=⋅f ()xgiii+ (). The dual is gg()mmm=³ () s.t. 0 ååii==11i åi=1 i

iii where gfii()mm=+⋅ infi (xgx ) ( ) x ÎWi

Daniel Guetta Convex Optimization Page 21

o For example xij could be the quantity of resource j allocated to acitivity i, and we might want to maximize utility. Each component of gi corresponds to one resource constraint. The resulting geometric multipliers can be considered as prices for a given resources.

o The Tatonnement procedure guesses initial prices, solves the problem, and then adjusts them to ensure each resource is used exactly.  Duality & Combinatorial Optimization o The knapsack problem is max vx⋅£Î s.t. wx⋅ C , x {0,1}n .  gCCwvw()mmmmm=-⋅+=+- max (vwx ) max([/ ] ,0) xÎ{0,1]n å iii

v v  Assume WLOG 1 ³³ n . There is then a breakpoint I *()m until w1 wn

I * ()m which vw/ ³ m . We can write gC()mm=+ vw - m ii å i=1 ii  The dual problem is to make this as small as possible for m ³ 0 . Consider

æöæII**()mm () ö that gvCw()mm=+-çç÷÷. This is piecewise-linear in m – èøèççååii==11ii÷÷ ø the minimum occurs when the gradient switches from negative to positive;

I so IIwCopt =>min : . We then want m to take its smallest { åi=1 i }

possible value, which isd m* = vw/ IIopt opt

æöæIIopt opt ö  This gives an upper bound ççvC÷÷+-m* w. èøèççååii==11ii÷÷ ø

opt o For a lower-bound, consider the solution with xiIi =<1 if and 0 otherwise. This is clearly feasible, and corresponds to the greatest “bang for buck” policy.

A More Rigorous Approach to Optimality Conditions

 Unconstrained optimization: interior solutions of minf (xx ) s.t. ÎÌ n o Necessary conditions: If x * is in the interior of the feasible region and is an optimal minimum, then =f()x * 0 and 2*f ()x  0.

o Sufficient conditions: If x * is in the interior of the feasible region and =f()x * 0 and 2*f ()x  0, then x * is a strict local minimum.

Daniel Guetta Convex Optimization Page 22

o When using these conditions, (1) verify existence (2) find points with =f()x0 (3) compare those to points on the boundary.

o Consider, instead f (,xa ). Differentiate the FOCs with respect to a to get

-1 *2*2* =-xa()xaff ( xaa (),){  xx ( xaa (),)}

**** =ff()axaaxaxaaax ( (),) + () f ( (),)  Constrained optimization: boundary solutions of minf (xx ) s.t. ÎÌ n . o The set of descent directions is ()xd**=Î{ n : f () xd ⋅< 0} . The tangent cone  ()x * is the set of directions we can move from x * while staying in the feasible region. A necessary condition for x * to be optimal is ()xx**Ç=Æ () .

o If  is defined only by the equality constraints hx()= 0, then for any regular

* point (ie: point at which the gradients hi ()x are linearly independent): ()xxd**==Î ()()(){ n : hxd0hx * ==} ( *) Intuitively, any move in direction d will change h by hx()*  d… Thus

 xhxhx* local minimum & regular f ()xx* Î ( ** )^ =()()  ( ) =   ( * ) [For d Î  , we need ⋅³f()xd* 0. But since ddÎ-Î, ⋅=f ()xd* 0].

* ** * Thus, for any regular local optimum x , $+⋅=unique ll s.t. f ()xhx0 () . Example: min xxG⋅=⋅= s.t. 1x 1,m x m . FOCs are

** 201G+x1ll12 +mm = 1x ⋅ = x = m The first gives x1*1=-1 G- ll + m . Feeding into the others gives a system of 2 ( 12) equations for l , whence lhz=+mmsambg xvw*22 = +  =() + +. 

o Consider the addition of inequality constraints gx()£ 0 to the definition of  . It can be shown that all constraints but the active ones at the optimum can be

** ignored. Thus, provided a point is regular (ie: {hgjij()xx} È { (): active} is linearly independent), the KKT conditions provide conditions for optimality.

o When using such conditions, it is important to check for non-regular points as well. Constraint qualifications can be weakened to requiring inequalities to be convex and equalities to be linear.  Subgradients – another way of expressing optimality conditions is as follows

Daniel Guetta Convex Optimization Page 23

x ÎÎÎ+arg min{ff (xx ) :} x arg min{ ( x ) ( x )} ζ0xxxéùff() + () =¶+¶ () () x ëûêú

$gx ζf() s.t. - g ζ () x $gxgyxy ζf() s.t. ⋅ ( - ) ³ 0 " Î

For the last step, note g ζ()xyxgyxgyxy  () ³ () + ⋅ ( - )  0 ³ ⋅ ( - ) " Î .

Chapter 6 – Approximation

 The most basic approximation problem is min Axb- . Has solution 0 if b Î ()A . o Approximating b as closely as possible using columns of A. o Letting y = Ax + v (v is noise), and guessing x based on y, assuming noise small

o minuÎ()A ub- ; projecting b onto ()A . o x are design variables, b is a target, Ax is the result.  Examples

2 o min Axb- ; least square. Solution AAxb= A . 2

2 o min Axb- ; Chebyshev approx problem. Same as mintt s.t. -£11Axb -£ t ¥

2 o min Axb- ; Robust estimator. Same as min1t⋅- s.t. tt£ Ax -£ b . Slowest 1 growing that is still convex.

o minff (rrA1 )++=- (n ) s.t. rxb is penalty function approximation. Measure of dislike of large residuals. 

 Least norm problems are minxxb s.t. A = . Can be reformulated as min xu0 + Z ,

where x0 is a solution and the columns of Z form a basis for  ()A .  Regularization problems are bi-objective problems; min()Axbx- , . Use min Axb-+g x to trace out the tradeoff curve. o Example (Tikhonov regularization): æö 22 A æöb    ç ÷ ç ÷ minAAIAxb-+gg x = x + x - 2 bxbb + =ç ÷ x -ç ÷ 22() ç ÷ ç ÷ èøçI g ÷ èøç0÷ 2 Solution is

Daniel Guetta Convex Optimization Page 24

 æöæöæöAA Aæö -1 çç÷÷ ç ÷çb÷  çç÷÷xx==+ ç ÷ç ÷  AAg I A b  çç÷÷ ç ÷ç ÷ () èøèøèøççIIgg÷÷ ç I g ÷çèø0÷

o Example (minimizing derivatives and curvature): We can replace our second objective ( x ) by Dx . We then get xb*1=+()AAg DD- A . Two

useful examples of D are  D has 1s across its diagonal and a –1 to the left of each 1. Dx is then the

vector of quantities xi + 1 – xi; the “discrete derivative” of x.  D has 2s across its diagonal, and 1s to its left and right. Dx is then the

vector of quantities ([][]xxxxxiiiii+-+-1111---) = - 2 xx ii + , approximately the curvature (second derivative) of x. 

2 o Example (LASSO): min Axb-+g x can be written as QP 21

2 minAxb-+g 1y⋅££ s.t. -yy x  2

 Stochastic Robust approximation: minx Axb- with an uncertain A that has a

probability distribution. Do minx  Axb- instead. If ()AA==ii p, this becomes

minpt⋅£ s.t. Aii x- b t . For a 2-norm, this is an SOCP. For a 1-norm or ¥-norm, this can be written as an LP.

2 min  Axb- is actually tractable. We can write AAU=+, and we can then write 2 2 2 æö1/2  the objective as minç APxb-+ x÷, with PUU=  ( ) . This is Tikhonov èøç 2 2 ÷ regularized least-squares.  Worst-case Robust approximation: We let A Î  , and we solve

minx supAÎ Axb- . We consider several sets  :

o Finite set or polyhedron:  = {,AA1  ,k }: mintA s.t. ixb-£" ti . For a polyhedron, try this out for the vertices.

o Norm bound error:  =+{|AUU £ a }, where the norm is a matrix-norm.

Consider the approximation problem with the Euclidean norm and the max-

singular-value norm. Then supAÎ AUxb-+ x occurs when Ux is aligned with Axb- and is as large as can be. Letting UaA=-()/xbx A xbx - 22

Daniel Guetta Convex Optimization Page 25

achieves that. Thus, our program is min Aaxb-+ x. This is solvable as an x 22 SOCP: mintatA+-££ s.t. xb t , x t. 12 2 122 o Uncertainty ellipsoids: Assume each row of A is in  =+{|1}AuuP £. Then sup Axb- can be found by individually iiirow 2 AÎ maximizing supAx⋅=-⋅bbP sup Ax ⋅-+ ( uxu ) | £ 1 . We do row iiu { row iii 2 } this by aligning u with P x and get sup Ax⋅-bbP = Ax ⋅- +  x. i row iirow iii2 We then have min supAbtxb-= min t s.t. A ⋅-+ xP  x £. xxtAiiÎ ,row2 ii2 We can get rid of the absolute value sign and put the problem in epigraph form to get an SOCP.

Chapter 7 – Statistical Estimation

 In a parametric estimation problem, we believe that a random variable of interest has a

density that is part of a family px ()y indexed by x ÎW. Given an observation y, the

maximum-likelihood estimation problem is max logpx (yx ) s.t. ÎW. i  Example: Consider a model yvii=⋅+ax , where yi are the observed quantities, x is to m be estimated and the v are IID noises with a density p. Then ppy()yax=-⋅ (i ) i x i=1 i 2 o Gaussian noise: ()xxy=-m log(2ps2 ) -1 A - . ML is least-squares. 2 2s2 2 o Laplacian noise: pe()z = -||/z a /2a , and ()xax=-my log2aa -i ⋅ - / . i 1

This is 1 - norm approximation.

i o Uniform [–a, a] noise: (xax )=-my log 2a if i - ⋅ £a " i and -¥

i otherwise. The ML estimate is then any x with yi -⋅£"ax a i .

 Example (logistic regression): We let yi be Bernoulli random variables with p =⋅++⋅+exp(auii b )/[1 exp( au b )]. Then  =+-logpp log(1 ) . i ab, ååii yyii==10

We can feed pi into this expression.

Daniel Guetta Convex Optimization Page 26

Chapter 8 – Geometric problems

 In a classification problem, we are given two sets of points {xx1,, N } and {yy1,, M } and wish to find a function f (within a family) s.t. f (x i )>" 0 i and f (yi )<" 0 i .  Linear discrimination: We look for a function fb()xax=⋅- such that f (x i )³ 1 and f (yi )£- 1 (where we have simply normalized the equations above) [ie: we seek a hyperplane that separates the two sets of points].

Interestingly, the strong alternative of this system of equations is l ³³00,,(,),lll ¹ 0ll xii =  y11 , ⋅=⋅ ll  ååii By dividing by 1⋅l , this becomes ll,1, ³⋅=⋅=01,1,ll 1ll xii = y . This ååii states that the convex hulls of x and y intersect.  Robust linear discrimination: There are lots of possible solutions to the problem above; we’d like to find the one that separates the points the most. Consider the planes {}az⋅+=b 1 and {az⋅+=-b 1}. To find the distance between them, take a point

22 with ax⋅+=b 1 and solve ax⋅+()t a +bt=-1/= a -2Distance  = 2 a. So, 22 we want to solve min1 ax s.t. fti (i )³"tif , ( yi ) £-" . 2 2

The dual function is gttb(,ln )=1 a1 + ( l ⋅+ ) ( n ⋅+⋅-⋅ 1 )é n 1 l 1ù + ( n YXa - l ) . 2 2 ëê ûú We need 1n⋅=⋅ 1l, in which case gt(,ln )=+-1 aYXa1 ( n l ) +⋅+⋅ ( l n 1 ). 2 2 We then note that by Cauchy-Schwarz, ()nlYXa--£ nl Y Xa and so the 2 2 dual is maxl1⋅+ n1 ⋅ s.t. 1n ⋅=³1X0⋅- ln ,Y ln £1 , l , . Normalizing l and 2 n so that 1l⋅=⋅= 1n 1, we get mintt s.t. nnYX-⋅=l1£³ ,lm,0 , 1l =⋅ 1 . This is the minimum distance between the of the points.

2 Practically, we would minimize the program above using x . The dual is relatively 2 simple to construct as a QP, and the primal solution can be recovered: aXY=-ln  Approximate linear separation: If the points cannot be exactly separated, we might

ii try to solve min1u⋅+⋅ 1v s.t. a ⋅x0-³bi³- 1uiii ",ay ⋅ -£-- b (1 v ) ", u ³ 0 , v . This is a heuristic to minimize the number of misclassifications.

2 The support vector classifier minimizes a trade-off between 1u⋅+⋅ 1v and 1 a . If 2 2 nMN , , this can efficiently be solved by taking the dual.

Daniel Guetta Convex Optimization Page 27

 Non-linear classification: In the simplest case, we can use a linearly parameterized family of functions f ()zFz=⋅q (). Our problem then reduces to solving the following system of linear inequalities: q ⋅Fx (i )³" 1 i and q ⋅Fy (i )£- 1 "i .

Infinite-dimensional optimization

 Hilbert spaces o Definition: A pre-Hilbert Space consists of a vector space V over  with an

inner product xy,:VV´ . Satisfies (a) xy,,= yx (b)

xyz+=,,, xyyz + (c) llxy,,= xy for all l (d) xx, ³ 0 , with

equality iff x = 0. xxx= , is a norm, and inner product is continuous in

both its arguments under that norm (Proof on Luenberger, pp49).  V = space of sequences that are square-summable. Define

¥ xy, = xy . [Finite by Cauchy Schwarz]. å i=1 ii

 Va= 2[,]b = space of measurable functions x :[ab , ]  such that

2 b x()t is Lesbegue integrable. xy,()()= xttt y d. òa

b  V = polynomials on [a, b] with xy,()()= xttt y d òa

o Definition: A Sequence {xVn } Í is Cauchy iff xxnm-0 as nm , ¥,

under the norm yyy= , .

o Definition: A Hilbert Space is a complete pre-Hilbert space; one in which every Cauchy sequences converges in the space.  Example of an incomplete space: Take VC= [0,1], the space of continuous functions on [0,1]. Consider two norms

1 f ==maxfx ( ) f fx ( ) d x xÎ[0,1] * ò0 Neither, it turns out, induce an inner product, so this it not a Hilbert Space. However:  V is complete under the first norm. Even though there is a

sequence of functions fn that tend to a step function (which isn’t in

Daniel Guetta Convex Optimization Page 28

the space), the sequence isn’t Cauchy, because f -step 1 , n 2 since we are considering the point of maximum difference.  V is not complete under the starred norm, because f -step 0 , so the sequence is Cauchy and leaves the space. n *

Indeed, the area between fn and the step function shrinks to 0.  o Definition (Orthogonality): We say x and y are orthogonal iff xy,0= . We further define MM^ =="{yyx:, 0 xÎ }. By the joint continuity of the inner product, this is always closed.

o Theorem (Projection): Let H be a Hilbert Space and K a closed and non-

empty convex subset of H. Then, for any x Î H , minkÎK xk- has an optimal ¢ solution k0 called the projection of x onto K. k Î K is equal to k0 if and only if xk-΢ K ^ (in other words, xkkk- ¢¢, -£"Î0 kK ).

Proof: Luenberger, pp. 51.

o Theorem: If M is a closed subspace of H, then HMM=Å^ and MM= ^^ . As

such, we call k0 in the projection theorem the orthogonal projection of x onto M,

^ and we can write xk=+-00( xk) , where kx0 Î-ÎMM,(k0 ) . Proof: Luenberger, pp. 53.

o Definition (Linear functional): A function j :V   is a linear functional if ja()()()xy+= b aj x + bj y.

 Continuous if jj()yx-£"-£ () e yxy: d. If j is continuous at x0,

it is continuous everywhere. Proof: jj()yx-+- ()= j ( yxxx- 00 ) j ( ).  Bounded if $£"MMs.t. j(yyy ) . Define norm

jj==infM :()yy)u£ M s p j ( y (last step by linearity). { } y =1

Notes:  If continuous, continuous at 0, jd()yy£" 1 £. As such,

zzz jjdjd()z ==zz£ , since bit inside brackets has norm d . ddd()zz()

Daniel Guetta Convex Optimization Page 29

If bounded, j()zzz£"M and so je()zzz£" : £e . So  M continuous at 0, and therefore everywhere.  Example of a non-bounded linear functional: Let V be the space of all sequences with finitely many non-zero elements, with norm

x = maxkkx . Then j()x = maxkkkx is unbounded because we can push the non-zero elements of x to infinity without changing the norm but making the functional grow to infinity. 

o Theorem (Riesz-Frechet): If j()x is a continuous linear functional, then there exists a z Î H such that j()xxz= , Proof: Let M =={}y : j()y 0. Since the functional is continuous, M is closed. If M = H, set z = 0. Else, choose g Î M ^ .

jjjxxx-jj()xxg =-=() () 0 x - ()g ÎM ()jj()gg ()  0,,==-xx- jj()xxgg, g () gg jj()gg ()

j()g  j()xx= ,,2 g = xz g

Note also that by Cauchy-Schwarz, jj()xzx£= z. 

o This means that Hilbert spaces are self-dual (see later), and that we can write jj()xx= , .

o Theorem (Special case of the Hahn-Banach Theorem): Let MHÍ be a

closed subspace and jM be a continuous linear functional on M. Then there

exists a continuous linear functional j on H such that jj()xxx="ÎM () M

and jj= M . Proof: Easy in the case of a Hilbert space. Since M is closed, it is also a Hilbert

space, and so $Îm M such that jM ()xxm= , . Then define j()xxm= , for

x Î H . By the CS inequality, jjM ==m .   Banach Spaces & Their Duals o A is a normed, complete vector space with no inner product.  C[0,1] is the space of continuous function son [0,1], with

f = max0££t 1 f (t )

Daniel Guetta Convex Optimization Page 30

[As we showed above, the choice of this norm ensures completeness]. An example of a linear functional on this space is

11 j(fx )=££ ()tvt d() x d() vt x TV() v ò00ò Provided the total variation of v, TV(v ) <¥, where

n TV(vvtvt )=- sup ( ) ( ) All partitions 0=

  =<¥xxÎ ¥ : , where p { p }

1/p æö¥ p xx===¥ç xxp÷ or sup if ppèøçåi=1 iii÷

ïïìü1 p   [0,1]=<¥ïïxx : (tt ) d , with p íýò îþïï0

1/p æö1 p ff= ç ()tt d÷ 1£¥ p<  p èçò0 ÷ø o Definition: We say VV* = {jj: is continuous linear functional on } is the

dual space of V, with norm jj=£sup()xx: 1 . V *, is always a * { } ()* Banach space.

Proof: Want to show that "Íx **V with xx**-£e "nm, ³ M {}n nm* e

*** converges to a point xx=Îlimnn¥ V . First fix x ÎV and note that xx**()-=-£- x () x ( x * x* )() xxx** x nm nm nm*

* As such, {xxn ()} is a Cauchy sequence in  . Since  is complete,

** * xx()= limnn¥ xx () exists. Define x pointwise using this limit. Now  Linearity: By linearity of expectations, x * is linear.

* *  Continuity/boundedness: Fix m0 such that xxnm-£e "nm, ³ m0 .

* ** Then by the definition of xx(), xx()- xm () x£ e x, and

**** * xx()££+ xx ()- xmm () x+ x ( x)nede x m x boud  00() 0* Examples  We have already shown (Riesz-Frechet Theorem) that Hilbert spaces are self-dual.

Daniel Guetta Convex Optimization Page 31

 Theorem: For p Î¥(1, ) , * = , where 11+=1 . In other words pq pq

¥ 1. For any y Î  , fx()= yx is a bounded linear functional q åi=1 ii

on  p .

¥ 2. Every f Î * can be represented uniquely as fx()= yx p åi=1 i i

with y Î q . 3. In both cases above, f = y . * q Proof: We prove each step separately

1. Suppose y Î q . Clearly, the proposed functional is linear.

¥ Furthermore, by Holder’s Inequality fy()xx££yx . åi=1 ii pq As such, the functional is also bounded, with fy£ . q

* 2. We prove this in four steps. Let x Î  p and f Î  p .  Step 1 – approximate x using “basis” functions. Define

i “basis functions” e Î  p , consisting of sequences which are identically 0 except for the ith component. We then have

N xex=x i Niåi=1  Step 2 – find the image under f. In this case, let yii= f()e Î  . We then have, by linearity of f

N ¥ fyxyx()x =ii Niiååii==11

By continuity of f, we have ff()xxN  (). Thus

¥ fyx()x = i åi=1 i

i  Step 3 – show that the y form a vector in q . Define the

vector yN Î  p as a “truncated y” series.

ì qp/ ïyNi sgn yii £ ()y = íï () N i ï 0 iN> îï We then have

Daniel Guetta Convex Optimization Page 32

1/p æöN q y = ç yi ÷ N çå ÷ p èøç i=1 ÷ NNqp/ q f ()y ==yyyyiiii sgn() N ååii==11

f (y ) We know that f ()y ££f yfN . Thus N N y * p N p *

1(1/)- pq 1/ æöæöNNqq ççyyii÷÷= £"f N èøèøçççååii==11÷÷ç * As such, y £ f , and y Î  . q q 3. We have shown that y £ f and y ³ f . Thus, y = f . q q q

*  Theorem: 1 = ¥ .

Proof: Show this as above, and define the truncated vector yN as a containing sgnyN for i = N.

* --11  Theorem: For p Î¥[1, ) , LLpq[0,1]= [0,1], where pq+=1 . For

1 every f Î L* , there exists a y Î L such that fx( )= x ()()ttt y d and p q ò0 f = y . * q  Theorem: C[0,1]* = NBV[0,1]; we will prove this later using the HB Theorem.  The Hahn Banach Theorem & Application to C[0,1] o Theorem (Hahn-Banach): Let p be a continuous seminorm (same as a norm, except for the fact it can be equal to 0 even when x0¹ ), MVÍ be a closed subspace and f : M   be a linear functional such that f ()x £"ÎpM(xx ) . Then, there exists a linear functional FV :   such that F()x £"ÎpV(xx ) and Ff()xxx=" () Î M. Note: Consider setting p()xfx= . Clearly, the condition of the theorem * then applies, because f ()x £ f x . The theorem then implies that Ff= . * **

Daniel Guetta Convex Optimization Page 33

The only reason we generalize p to a seminorm is to prove the geometric HB theorem (see later). Note 2: The idea it is possible to extend f over an entire space is not particularly revolutionary. The crux of the theorem is that this extension has bounded norm. In a way, the HB Theorem can be stated as “the optimization

problem min* Fx s.t. ,Ffx="ÎÌ ( ) xMX has a global optimium, FÎX * and its value is f ”. * Note 3: Let X be a . Then "ÎxFFxFxX, $ s.t. () = .

Define f()axx= a on the subspace generated by x; this has norm unity. By

the HB Theorem, we can extend this to F on X with norm unity. This satisfies the requirements for the point x.

o Example/Theorem (dual space of C[0, 1], Riesz Representation

Theorem): (In all that follows, use the usual norm, max0££t 1 x (t ) , over C[0,1])  Take any function v of bounded variation on [0,1]. Then f :[0,1]C  

1 defined by fx()= x ()tvt d() is a bounded linear functional in C[0,1]* . ò0  Tale any bounded linear functional f ÎC[0,1]* . Then there is a function v

1 of bounded variation on [0, 1] such that fx()= x ()tvt d(). ò0  For the function defined in (2), f = TV(v ).

 C[0,1]* = NBV . Proof: 1. Clearly, any f defined in this fashion is linear. Furthermore, it is bounded:

1 fxx()xx=£(dtv) v(tt )max ()TV() £ T V( v ) ò0 01££t 2. Note that C[0,1] (space of continuous functions on [0,1]) is a subset of B[0,1], the space of bounded functions on [0,1]. Thus, by the HB Theorem fFÎ$ÎCB[0,1]** [0,1] s.t. Ff =

Our proof then goes as follows:

Daniel Guetta Convex Optimization Page 34

 Step 1 – Approximate x ÎC[0,1] by discretisng it. Define a set of step functions u (tB)]=  Î [0,1 . (Note these are not in C[0,1]; s {}ts£ this is why it’s useful to move to the larger space). We can write

n xz()tt»Îp ()=- xuu ()t () t(1 t)]B [0, åi=1 it() t ii-1 Where p is some partition of [0,1]. Hereafter, when we write “tends to”, we mean “as the partition gets arbitrarily fine”.

 Step 2 – find the image under F. Let vs()= Fu (s )Î  be the image of these “basis functions”. Since F is linear and the first term in the sum is a constant, we can write

n 1 Fx((zxp )()()()()=-tvtvt( )  t d) vt Î åi=1 ii i-1 ò0  Step 3 – bridge the gap between B and C. By uniform continuity of x, the approximation z p becomes arbitrarily good (using the max norm). Since F is also continuous using the max norm, this implies that Fz (p )= Fx ( ) fx ( ). As such

1 fx()= x ()tvt d() ò0  Step 4 – show v has bounded TV. By linearity of F, we have

nn vt()-= vt ( ) F e u - u ååii==11ii-1 () itt() ii-1 n £ F e uu- åi=1 it() t * ii-1 ==Ff **

Where ei =1 takes the absolute value into account, and the jump from line 2 to 3 follows since u are step functions. Taking a supremum over all partitions, we find TV(v ) £<¥f . * Note, however, that the F produced by the HB Theorem is not necessarily unique. As such, nor is the function v. This theorem only states there exists such a v.

3. From (1), it is clear that f £ TV(v ). From (2, Step 4) it is clear that * f ³ TV(v ). Thus, f = TV(v ) . * *

Daniel Guetta Convex Optimization Page 35

4. Because if not non-uniqueness noted above, C[0,1]* ¹ BV[0,1]. Indeed, the linear functional f ()xx= ()1 , for example, we can represented using a v 2 that is 0 on [0, ½), 1 on (½, 1] and takes any value at ½. As such, we define the space NBV[0,1] (normalized bounded variation), consisting of all functions of bounded variations that vanish at 0 and are right- continuous. We then have C[0,1]* = NBV[0,1], because every element in one set can be mapped to an element in the other.   Odds & Ends o We will sometimes abuse notation and write xx**()= xx , , for x ÎÎVV,x **.

In a Hilbert space, this is true, because V = V* (Riesz-Frechet). In a Banach space, it is convenient notation (see Hyperplanes section).

o This means that by viewing x as fixed, xx, * also defines a functional in X **

(easy to show linear and bounded). Now, we have

 xx, **£ x x

 By Corollary 2 of H-B (below), "$x ÎÎXX,xxx**** s.t. xx, =

As such, if we consider xx, * as a functional on X**, its norm is x . We define X the natural mapping j : XX ** so that j()x maps x to the functional it

generates in X**; in other words, x,xx**= j(),x but with j()xxÎ ** . The

mapping is linear, and, as we showed previously, norm-preserving; j()x = x . But it might not be onto – some elements in X** might not be XX** representable by elements in X. If XX= ** , X is called reflexive. All Hilbert

spaces are reflexive, as are  p and Lp for p Î¥(1, ) .

o We have xx, **£ x x . In a Hilbert space, we have equality if and only if

xx* = a . In a Banach space, we say x * Î X * is aligned with x Î X if

xx, **= x x . They are said to be orthogonal if xx,0* = . Similarly, if

SXÍ , we say S ^ =Î{}x **Î="ÍXSX**:,xx 0 x . If UXÍ * , we say USX^^ÇÎX ="Í***=Î{x :,xx* = 0 x SX} .

Daniel Guetta Convex Optimization Page 36

o Theorem: If MXÍ , then ^^ (MM ) = . Proof: Clearly, MMÍ ^^ ( ). To show the converse, we’ll show that

xxÏÏMM^^(). Define a linear functional f on the space spanned by M and

x which vanishes on M so that f()mx+=aa. It can be shown that f <¥,

and so by the HB Theorem, we can extend it to some F which also vanishes on

M. As such, F Î M ^ . However, Fx()=¹ Fx ,= 1 0, and so x Ï ^^ (M ) . 

 Minimum Norm Problems o Let us consider a vector x Î X . There are clearly two ways to take the norm of that vector – as an element of X or as an element of X ** (a functional on X * ).

xxxor max , * x* =1

It is clear these two should be equal, because xx, * £⋅ x 1 (or, more

intuitively, because the second norm finds the most x can yield under a functional of norm 1 – clearly, the answer is its norm). Let us now restrict ourselves to a subspace M of X. We can, again, define two norms

xxmx=-inf or = sup xx , * M mÎM M ^ x* =1 x* ÎM ^ The first simply consists of the minimum distance between x and M (as opposed to between x and 0). The second is the most x can yield under a functional of norm 1 that annihilates any element of M. Intuitively, the “remaining bit” that’s

* “not annihilated” is x – m; this is maximized when it is aligned with x – at m0. So it makes sense that the two should be equal.

o Theorem: Consider a normed linear space X and a subspace M therein. Let x Î X . Then æö ç ÷ **ç ÷ d =-=infxm max xx ,ç = max xmx - , ÷ . mÎM xx**£11ç = 0 ÷ **ç ÷ xx**ÎÎMM^^èøç ÷ Or, in our terminology above, xx= . The maximum on the right is MM^

* ^ achieved for some x0 Î M ; if the infemum on the left is achieved for some

* m0 Î M , then xm- 0 is aligned with x0 .

Daniel Guetta Convex Optimization Page 37

Intuitively, this is because at the optimal m, the residual x – m0 is aligned to

^ * some vector in M . As such, for that vector, xmx-=-00, xm. For every other x * , it’ll be smaller than that.

Pictorially, looking for the point on M that minimizes the norm is equivalent to

^ looking for a point on M that is aligned with xm- 0 .

m x M ^ 0

xm- 0

M

This also implies that a vector m0 is the minimum-norm projection if and only if

* ^ there is a non-zero vector x Î M aligned with x – m0.

o Theorem: Let M be a subspace in a real normed space X. Let x * Î X * . Then d =-=minxm* ** sup xx , m* ÎM ^ xΣM ,1x

* ^ where the minimum on the left is achieved for some m0 Î M . If the supremum

** is achieved for some x0 Î M , then xm- 0 is aligned with x0.

Because the minimum on the left is always achieved, it is always more desirable to express optimization problems in a dual space.

o In many optimization problems, we seek to minimize a norm over an affine subset of a dual space rather than subspace. More specifically, subject to a set of

* * linear constraints of the form yxii, = c . In that case, if x is some vector that satisfies these constraints,

Daniel Guetta Convex Optimization Page 38

minxxm***=- min * m* ÎM ^ yxii, = c = supxx , * x £1 xÎM = supa yx , * a y £1 å ii å ii =⋅sup ca a y £1 å ii

Where M is the space generated by the yi. The last equality follows from the fact that x * satisfies the equalities. Note that the optimal is aligned with the åayii optimal x*.  Applications o Example: Consider the problem of selecting the field current u(t) on [0,1] to drive a motor from initial conditions qq (0)== (0) 0 to state q(1)= 1 , q(1)= 0

while minimizing max0££t 1 ut ( ) . Assume the motor is governed by qq()ttut+=  () ().  First, we need to choose a space on which to optimize our problem.

Choose, ut()Î L¥ [0,1], which is the dual of L1[0,1] .  First, note that we can treat the governing equation as a first-order equations and use an integrating factor to find

1 1 d etttqq() = eut( ) =éùettt() eut () dt () ëûêú0 ò0 1 =qq(1)ett--11ut ( ) d(1) t= eu, ò0

1 t – 1 t-1 (Here e is considered as a function in L1 and eutt() d as some ò0

* functional in L1 on that function).

We can also integrate the governing equation directly to get

11 qq(1)-+-= (0) qq (1) (0)ut ( ) d t- q (1) = ut ( ) d t q  (1) ò00ò Feeding in the results of the previous equation

1 q(1)=- 1eutttt--11 ( ) d =-q(1) 1 eu , ò0 ( )  As such, our problem boils down to minimizing the norm of u subject to

eut-1,0= and 1,1-=eut-1 . This is precisely the situation

Daniel Guetta Convex Optimization Page 39

considered at the end of the previous section, and so this optimization problem is equivalent to max 1⋅+⋅aa 0 tt--11()2 1 ae12+ a()1-£ e 1

Where the norm is taken in L1 (the primal space). As such, we want to

1 t-1 maximize a2 subject to (aae-+£) atd 1 . ò0 12 2

 Once we have found the optimal value of a2, we can find u by

characterizing the alignment between L1 and L¥ . For xLÎ 1 and uLÎ ¥ to be aligned, we require

11 xu,()()===⋅ xtut d() t x u xt dmax() t ut ò00* ò tÎ[0,1] For this to be true, it is clear that u can only take two values (+M) and that it must have the same sign as x at any given value of t.

t-1  Finally, consider that in this case, x is (aae12-+) a 2. Clearly, it changes sign at most once. And so u(t) must be equal to + M with a single change in sign.

o Example: Consider selecting a thrust program u(t) for a vertically ascending rocket subject only to gravity and thrust in order to reach a given altitude (say

T 1) with minimum fuel expenditure ut()d t. Assume xx(0)== (0) 0 , unit ò0 mass and unit gravity. The equation of motion is xt()=- ut () 1.

 We might originally regard this problem in L1, but this is not a dual space. Instead, consider it in NBV[0,1], and associate with every u a vNBVÎ [0,1] so that ut() d t= dvt ().

 The time at which the rocket needs to reach an altitude of 1. We denote this by an unknown T and then optimize over this parameter. Integrating the equation of motion

tt xt( )-- x (0)=-= us ( ) d s t xt  ( ) us ( ) d s t ò00ò Integrating again, by parts

Daniel Guetta Convex Optimization Page 40

Tt T 2 xT()=- us () d s d t òò00 2 T éùtTT 2 xT()=--êú t us () d s tut () d t òò00 ëûêú0 2 TTT 2 xT()=-- T us () d s tut () d t òò002 T T 2 ()()Ttuss-=+ d() xT ò0 2 T 2 TtvxT-=,() + 2 Where v is the function in NBV[0,1] associated with u, as described above.  Our problem is then a minimum norm problem subject to a single linear constraint. We want x(T) = 1, and using our theorem, as we did above

minvaT=+ maxé 1 1 2 ù Ttv-=+,11 T2 (1Tta- )2£ ê ( )ú 2 ë û This is a one-dimensional problem. The norm is in C[0,1], the space to

which NBV is dual. As such ()max()Tta-=tÎ[0,1] TtaTa -= , and the optimum occurs at a = 1/T. We then have min vT=+11. Ttv-=+,11 T2 T 2 2 Differentiating this with respect to T, we find that the minimum fuel

expenditure of 2 is achieved at T = 2 .  To find the optimal u, note that the optimal v must be aligned to ()Tta- . As we discussed above when characterizing alignment of C and

NBV, this means that v must be a step function at t = 0, rising to 2 at t = 0, and as such, u must be an impulse (delta function) at t = 0.  Hyperplanes & the Geometric Hahn-Banach Theorem o Definition: A hyperplane H of a normed linear space X is a maximal proper affine set. ie: if HAÍ and A is affine, then either A = H or A = X.

o Theorem: A set H is a hyperplane if and only if it is of the form {xxÎ=Xf:() c} where f is a non-zero linear functional, and c is a scalar. Proof:

 If: Let HM=+x0 , where M is a

Daniel Guetta Convex Optimization Page 41

 If x0 Ï M , then XM= xx00+ span( , ) by the maximality property of H, since this set is bigger than M. Thus,

XM= span(x0 , ). We can therefore write any x Î X as

xxm=+a 0 , with m Î M . Define f ()aaxm0 +=. Then Hf=={}xx:() 1. ¢  If x0 Î M , then HM= . Simply pick x0 Ï M , apply the above and get Hf=={xx:() 0} .  Only if: Suppose f ¹ 0 and let M =={}xfx:() 0. Clearly, it is a

linear subspace. Furthermore, there exists x0 such that f ()x0 = 1. As

such, xxxx-ÎfM()0 " , and so Xf=+{vxxv()0 : Î M} . So we only

require one extra vector (x0) to expand M into the whole subspace. So M

is a maximal subspace. Thus HccM====+{}{}xfx:() x x0 is a hyperplane.  Important note: A hyperplane is only closed if f is linear and continuous.

o Theorem (Geometric Hahn-Banach): Let K be a convex set having a nonempty interior in a real normed linear vector space X. Suppose V is an affine set in X containing no interior points of K. Then there is a closed hyperplane in X containing V but containing to interior point of K. In other words, there is an

element x * Î X * and a constant c such that vx, * = c for all v ÎV and

kx, * < c for all k Î K .

o We will abuse notation and write xx**()= xx , , for x ÎÎVV,x **.

 In Hilbert Spaces, this is actually true thanks to Riesz-Frechet.  It allows to represent all hyperplanes as {}xxa:,= 0.

Mathematical background

 Vector spaces & Topology o Inner products

Daniel Guetta Convex Optimization Page 42

2  Cauchy-Schwarz inequality: xy,yy£⋅xx,,, with equality if and

only if xy= l , or either vectors are 0. Proof: If y = 0, the result is simple. Else

02,£-xyxyxxll,,, - = - l xy + l2 yy

Set l = xy,/, yy to get the result. 

2222  Parallelogram Law: xy++-= xy22 x + y (prove by

extending norms as inner products).

2  Induced norm xx, satisfies triangle (expand xy+ , and use C-S).

 For matrices, the standard inner product is XY,tr()= X Y . Equivalent

to multiplying every element in X with the corresponding element in Y. The induced norm is the Frobenius Norm, X . F o Norms  A norm has the properties (a) x ³ 0 (b) xx0=0 = (c)

aaxx= (d) xy++£ x y. A seminorm might not satisfy (b).

1/p  Common norms: x = ||x p . For P Î n , xxx= P p ()å i ++ P (the unit ball is an ellipsoid). x = max||,xx | |. ¥ {}1 n  The dual norm of a norm is zzxx=£sup⋅ : 1 . It is the * x {} support function of the unit ball of the norm. Note that xy⋅ £ x y *  For p Î¥(1, ) , the dual norm of is , where 11+=1 . p q pq  = (not true in infinite dimensional spaces). ** o Open/closed sets

 NXrr (x):=<{}yxyÎ- is a neighborhood of x (“open ball”)

 x Î int if $ÌrN s.t. r (x )  . open= int .

 x Î cl  if for every Nr ()x , Nr ()x ÇÆ ¹ . closed =cl

Daniel Guetta Convex Optimization Page 43

 The union of open sets is open. The intersection of a finite number of open sets is open. The intersection of closed sets is closed. The union of a finite number of closed sets is closed.  The set of reals is both closed and open.  Theorem: f -1(:())=Î{xx dom ff Î} . If dom f is open/closed and  is open/closed, then f -1() is also open/closed.

n o  Ì  is (sequentially) compact if for every sequence {}xk Í  there exists a

subsequence x converging to an element x Î  . [Another definition, { k } i equivalent in metric spaces, is that every open over must have a finite sub-cover].  Theorem (Heine-Borel): in finite dimensional spaces, a set is compact if and only if it is closed and bounded.  Theorems: A closed subset of a compact set is compact. The intersection of a sequence of non-empty, nested compact sets is non-empty.

o The indicator function of a set I  ()x is equal to 0 if x Î  and ¥ otherwise. o A subspace of a vector space is a subset of the vector space that contains the 0 vector and that satisfies closure under addition and scalar multiplication.  Analysis & Calculus

o A function f is coercive over  if for every sequence {}xk Ì  with xk ¥,

we have limkk¥ f (x ) =¥. o Second-order Taylor expansion: if f is twice-continuously differentiable over

2 N ()x , then "Î d N ()0 , ff()()xd+=xd +++f ()xd1 d2 fo () xd ( ) r r 2 o For a vector-value function F, é=¶¶Fx()ù Fx ()/ x . ëêúûij ji é ù o The chain rule states that =ëêgfx ( ( ))ûú fx ( ) gfx ( ( )) .  Linear algebra o The range or image of A Î mn´ , ()A is the set of all vectors that can be written as Ax. The nullspace or kernel  ()A is the set of vectors than satisfy Ax

^  = 0. Note that ()AAÅ= ( ) n ; in other words ()AA= éù ( ). This last ëûêú statement means that zy0= Ax ⋅=zy 0 " y with A =.

Daniel Guetta Convex Optimization Page 44

o A real symmetric matrix A ÎÍnnn ´ can be factored as AQ=LQ  , where L=diag(,ll , ). Note that detA = l , tr A = l , A = max l and 1 n  i å i 2 i A = l2 . F å i

n o ++ is the set of symmetric, positive definite matrix; all their eigenvectors are

n  positive, and AAι>++,0xxx 0.

 o Note that l ()A = sup/inf xAx. As such, l ()AAxxxx ££xAl ()  x max/min x ¹ 0 xx min max o Suppose A Î mn´ with rank r. Then we can factor AUV=S , where

mr´´ nr U ÎÎ,V are two orthogonal matrices and S=diag(,ss1  , r ). Writing AA=S V UU S V =S V2 V we see that these singular values are the square root of the non-zero eigenvalues of AA , and the right singular vectors V are the eigenvectors of AA . Similarly, U contains the eigenvectors of AA . We have

 s ()A == supxAy sup yAAy. In other words, AA= s (). We maxxy ,¹¹ 0xy y 0 y max 22 2 2 denote cond()AAA==-1 ss ()/ A () A. 2 2 max min o The pseudo-inverse is AVU† =S-1  . If A is square and non-singular, then AA† = -1 . It is useful for the following reasons

2  xb= A† is the minimum-Euclidean norm solution of min Axb- . 2  The optimal value of min 1 xxqxrP +⋅+ for P Î n can be expressed x 2 as -+1 qqrP † if P  0 and q Î ()P , and -¥ otherwise. 2

é ABù k Consider a matrix X = êúÎ n , with A Î  . If detA ¹ 0 , then the matrix êúBC ëêúû Then SCBAB=- -1 is the Schur complement of A in X. detXAS= det det .

  Let A  0 and consider minuxxXABC=++ minu uu 2 v uvv, where xuv= éù. This is a quadratic with solution uv=-AB-1 and ëûêú optimal value vvS . Thus

 X  0  AS0 and 0  If A  0 , then XS00

Daniel Guetta Convex Optimization Page 45

éùABéxuùéù  Consider êúê úêú= with detA ¹ 0 . Using the top equation to êúBC êyvúêú ëûêúëê ûëûúêú eliminate x and feeding it into the bottom block, we get ySvBAu=---11(  ) . Substituting this back to find x, we get

-1 éùé ----1111 -- 11ù êúAB êAABSBAABS+-ú êú = ê --11 - 1ú êBCú ê -SBA S ú ë û ë û

Daniel Guetta