Applying Subgradient Methods – and Accelerated Methods – to Solve General Convex, Conic Optimization Problems

Jim Renegar School of Operations Research and Information Engineering Cornell University

Workshop on “Optimization and Parsimonius Modeling” IMA, January 25-29, 2016 (assume lower semicontinuous)

min f(x) closed, s.t. x Q 2

graph of f

x graph of the function y f(x)+ g, y x 7! h i

a subgradient of f at x

The set of all subgradients at x is denoted @f(x) – the “subdi↵erential” at x For a concave function, supgradients play the analogous role. convex function (assume lower semicontinuous)

min f(x) closed, convex set s.t. x Q 2 Assume f is Lipschitz-continuous on an open neighborhood of Q: f(x) f(y) M x y | |  k k Lipschitz constant

Goal: Compute x Q satisfying f(x) f ⇤ + ✏ 2  optimal value

A typical subgradient method:

Initialize: x Q 0 2 ✏ Iterate: Compute gk @f(xk), and let xk+1 = PQ( xk g 2 gk ) 2 k kk where PQ is projection onto Q

A typical theorem: an optimal solution 2 M x0 x⇤ ` k k min f(xk) f ⇤ + ✏ ✏ ) k `  ✓ ◆  In the special case of this becomes ... min cT x s.t. Ax b so Q = x : Ax b { } Then, of course, the objective function is Lipschitz continuous: cT x cT y c x y | | k kk k Lipschitz constant

T Goal: Compute x satisfying Ax b and c x z⇤ + ✏  optimal value

A typical subgradient method:

Initialize: x Q 0 2 ✏ Iterate: Let xk+1 = PQ( xk c 2 c ) k k where PQ is projection onto Q

A typical theorem: an optimal solution 2 c x0 x⇤ T ` k kk k min c xk z⇤ + ✏ ✏ ) k `  ✓ ◆  But in general, projecting onto Q = x : Ax b is no easier than solving linear{ programs!!! }

A typical subgradient method:

Initialize: x Q 0 2 ✏ Iterate: Let xk+1 = PQ( xk c 2 c ) k k where PQ is projection onto Q But in general, projecting onto Q = x : Ax b is no easier than solving linear{ programs!!! }

There are ways, however, to use a subgradient method to “solve” an LP. For example, here is a way to “approximate” an LP by an unconstrained problem:

min cT x min cT x + max 0,b ↵T x : i =1,...,m s.t. Ax b ⇡ { i i } user-chosen postive constant However, the optimal solution for the problem on the right will not necessarily be feasible for LP.

In the literature, in fact, the only subgradient methods producing feasible iterates require the feasible region to be “simple.”

“Why is this?” min cT x s.t. Ax = b x 0

Think of this 2-dimensional plane as being the slice of Rn cut out by x : Ax = b . { } min cT x s.t. Ax = b x 0

⇡⇤ optimal solution

Assume the objective function x cT x is constant on horizontal slices. 7! T 1 min c x s.t. Ax = b x 0

To begin with simplicity, assume the vector of all one’s is feasible. T 1 min c x s.t. Ax = b x 0 x

Anez ⇡(x) :=

x : Ax = b “radial projection” { and cT x = z } T 1 min c x s.t. Ax = b x 0 x

Anez :=

x : Ax = b { and cT x = z } ⇡(x) T 1 min c x s.t. Ax = b x 0 ⇡(x) x

Anez :=

x : Ax = b { and cT x = z } T 1 min c x s.t. Ax = b x 0 xz⇤

Anez :=

x : Ax = b { and cT x = z }

⇡⇤ = ⇡(xz⇤) T 1 min c x s.t. Ax = b x 0

Anez :=

x : Ax = b { and cT x = z } T 1 min c x s.t. Ax = b x 0

x

Anez :=

x : Ax = b { and cT x = z } ⇡(x) = 1 1 + 1 min x (x 1) j j z⇤ optimal value, assumed finite T min c x T s.t. Ax = b LP z a fixed value satisfying z

1 1 x ⇡(x):=1 + 1 min x (x 1) 7! j j

T T 1 T T c ⇡(x)=c 1 + 1 min x (c x c 1) j j T 1 T = c 1 + 1 min x ( z c 1) j j a negative constant | {z } Thus, for x, y Ane ,cT ⇡(x) min y 2 z , j j j j

maxx minj xj Theorem: LP is equivalent to s.t. Ax = b cT x = z

The only constraints in the equivalent problem are linear equations. It’s thus easy to project onto the feasible region for the equivalent problem. T min c x maxx minj xj s.t. Ax = b LP s.t. Ax = b 9 T x 0 c x = z = ⌘ ; x min x is the exemplary nonsmooth concave function 7! j j Lipschitz continuous with constant M =1 • Supgradients at x are the convex combinations • of the standard basis vectors e(k) for which xk =minj xj Thus, projected supgradients at x are the convex combinations of the corresponding columns of the projection matrix T T 1 A P¯ := I A¯ (A¯ A¯ ) A¯ where A¯ = T c ⇥ ⇤ Hence, in implementing a supgradient method, one option in choosing a supgradient at x is simply to compute any column P¯k for which xk =minj xj ✏ x = x + P¯ + P¯ 2 k k kk do not (cannot) For large n, compute columns of P¯ as needed compute (store in memory) all of P¯ With a modest amount of preprocessing work, the cost of each iteration is proportional to the number of nonzero entries in A. min C, X h i I s.t. (X)=b XA 0 ⌫

Now consider a semidefinite program, and for simplicity, assume the identity matrix is feasible. min C, X h i I s.t. (X)=b XA 0 ⌫

X

Anez :=

X : (X)=b { Aand C, X = z h i } ⇡(X) = 1 I + 1 (X) (X I) min min C, X h i I s.t. (X)=b equivalent problem XA 0 ⌫ max min(X) s.t. (X)=b AC, X = z h i X

Anez :=

X : (X)=b { Aand C, X = z h i } ⇡(X) = 1 I + 1 (X) (X I) min min C, X h i I s.t. (X)=b equivalent problem XA 0 ⌫ max min(X) s.t. (X)=b AC, X = z Xz⇤ h i X

Anez :=

X : (X)=b { Aand C, X = z h i } ⇡(X) ⇡(Xz⇤)

– so z⇤ = C, ⇡(X⇤) h z i

C, ⇡(X) z⇤ Goal: Compute X satisfying h i ✏ C, I z  h i ⇤

To accomplish this, how accurately does min(X) need to approximate min(Xz⇤)? min C, X max min(X) s.t. h (X)=i b SDP s.t. (X)=b 9 A XA 0 C, X = z ⌫ = h i C, ⇡(X) ;z⇤ ✏ C, I z h i ✏ (X⇤) (X) h i C, I z  min z min  1 ✏ C, I z h i ⇤ , h i ⇤

I 1{

Anez := C, I z 1 h i = X : (X)=b C, I z { A h i ⇤ 2 and 2 C, X = z h i }

To get around the high accuracy required if the ratio is small, and to get around having to know the ratio (that is, having to know the optimal value z⇤), we apply a supgradient method to multiple layers (details of which} can be found in the arXiv posting) . . . initial user-supplied max min(X) SDP-feasible matrix s.t. (X)=b I AC, X = z X0 h i

Anez max min(X) s.t. (X)=b I AC, X = z h i

X1

Anez max min(X) s.t. (X)=b I AC, X = z h i

X2

Anez max min(X) s.t. (X)=b I AC, X = z h i

X3

Anez max min(X) s.t. (X)=b I AC, X = z h i

X3 Anez max min(X) s.t. (X)=b I AC, X = z h i

X3

Anez max min(X) s.t. (X)=b I AC, X = z h i

X4

Anez max min(X) s.t. (X)=b I AC, X = z h i

X5

Anez max min(X) s.t. (X)=b I AC, X = z h i

X5 Anez initial user-supplied SDP-feasible matrix I X0

= level set for SDP

Diam := supremum of diameters of level sets for objective values C, X h 0i min C, X max min(X) s.t. h (X)=i b SDP s.t. (X)=b 9 A XA 0 C, X = z ⌫ = h i C, ⇡(X) ;z⇤ ✏ C, I z h i ✏ (X⇤) (X) h i C, I z  min z min  1 ✏ C, I z h i ⇤ , h i ⇤

Thm:

1 1 C, I z⇤ ` 8 Diam2 + log h i +1 · ✏2 ✏ 4/3 C, I C, X ✓ ✓h ih 0i◆ ◆

C, ⇡(X ) z⇤ min h k i ✏ ) k ` C, I z   h i ⇤ min c x s.t. Ax· = b e closed, convex cone x 2 K with nonempty interior

Now consider a general convex conic optimization problem, and fix a strictly feasible point e. min c x s.t. Ax· = b e closed, convex cone x 2 K with nonempty interior

x

Anez :=

x : Ax = b { and c x = z · }

⇡(x) = 1 e + 1 (x) (x e) min

...where (x) is the scalar satisfying x e boundary( ) min 2 K

Prop: The map x (x) is concave and Lipschitz continuous. 7! min min c x max min(x) s.t. Ax· = b s.t. Ax = b x c x = z 2 K · c ⇡(x) z⇤ ✏ c e z · ✏ min(xz⇤) min(x) · c e z   1 ✏ c e z⇤ · ⇤ , · even in this general setting we have “if and only if”

Whereas for linear programming we relied on the dot product, and for SDP we relied on the trace product, in this general setting we allow computations to be done with respect to any inner product.

However, the extent to which the inner product reflects the geometry of the cone a↵ects the Lipschitz constant . . . K e

x : x e re and Ax = b Anec e { k k } ·

Prop: (x) (y) 1 x y for all x, y Ane min min re z | |  k k and2 for every z

(see arXiv posting for full explanation) initial user-supplied feasible point e x0

=levelsets

Diam := supremum of diameters of level sets for objective values c x  · 0 min c x max min(x) s.t. Ax· = b s.t. Ax = b x c x = z 2 K · c ⇡(x) z⇤ ✏ c e z · ✏ min(xz⇤) min(x) · c e z   1 ✏ c e z⇤ · ⇤ , ·

Applying a supgradient method results in a sequence x0,x1,... for which . . .

Thm: Lipschitz constant 1/r  e

1 1 c e z⇤ ` 8(M Diam)2 + log · +1 · ✏2 ✏ 4/3 c e c x ✓ ✓ · · 0 ◆ ◆

c ⇡(x ) z⇤ min · k ✏ ) k ` c e z   · ⇤ Our motivation now is to develop an approach similar to the one of Nesterov, but which applies to optimization problems with complicated feasible regions rather than just “simple” ones.

We depend heavily on various works of Nesterov, as well as results from the literature on “hyperbolic polynomials.”

(Our most recent arXiv posting has all of the details.)

Smoothing

Following Nesterov, rely on the smooth concave function

f (X):= µ ln exp( (X)/µ) (for fixed µ>0) µ j j X Easy to see: (X) µ ln n f (X) (X) min  µ  min Not so obvious, but which Nesterov showed:

1 fµ(X) fµ(Y ) ⇤ µ X Y kr r k1  k k1 that is, X f (X) has Lipschitz constant L =1/µ 7! r µ

exp( (X)/µ) 1 1 T fµ(X)= Q .. Q exp( j (X)/µ) . r j " exp( n(X)/µ) # P

1(X) . T where X = Q .. Q is an eigendecomposition of X " n(X) # expensive!

A relevant line of work thus begins with . . . d’Aspremont, “Smooth optimization with approximate gradient” SIAM J Opt (2008) Smoothing

Following Nesterov, rely on the smooth concave function

f (X):= µ ln exp( (X)/µ) (for fixed µ>0) µ j j X Easy to see: (X) µ ln n f (X) (X) min  µ  min Not so obvious, but which Nesterov showed:

1 fµ(X) fµ(Y ) ⇤ µ X Y kr r k1  k k1 that is, X f (X) has Lipschitz constant L =1/µ 7! r µ

exp( (X)/µ) 1 1 T fµ(X)= Q .. Q exp( j (X)/µ) . r j " exp( n(X)/µ) # P

1(X) . T where X = Q .. Q is an eigendecomposition of X " n(X) #

exp( x /µ) 1 1 . For linear programming: fµ(x)= exp( x /µ) . inexpensive! r j j 2 3 P exp( xn/µ) 4 5 min C, X max min(X) max fµ(X) s.t. h (X)=i b s.t. (X)=b s.t. (X)=b A A XA 0 C, X = z C, X = z ⌫ ⌘ h i ⇡ h i

C, X z⇤ Same goal as before: Compute feasible X satisfying h i ✏ C, I z  h i ⇤

Choosing µ = ✏/(6 ln n) and relying on Nesterov’s original accelerated . . .

Thm: a universal constant 1 C, I z⇤ k  pln n Diam + log h i · · · ✏ C, I C, X ✓ ✓ h ih 0i ◆◆ C, ⇡(X ) z⇤ h k i ✏ ) C, I z  h i ⇤

Especially-notable earlier work with similar iteration bounds: Lu, Nemirovski and Monteiro, “Large-scale semidefinite programming via a saddle point Mirror-Prox algorithm” Math Prog (2007) Lan, Lu and Monteiro, “Primal-dual first-order methods with O(1/✏) iteration-complexity for cone programming” Math Prog (2011) ...where min stescalar the is (x) e x e min + s.t. 1 ⇡ (x)

= x Ax c min · 1 satisfying 2 x = K (x) b (x e) x e 2 K boundary( {x :

Ax and Ane := = c b · z x ) = z } min c x s.t. Ax· = b Euclidean space x 2 K Defn: is a “hyperbolicity cone” K if✓ thereE is a homogeneous polynomial p satisfying: p(x) = 0 for all x int( ) • 6 2 K p(x) = 0 for all x bdy( ) • 2 K e int( ) such that for all x , • 9 2 theK univariate polynomial2E p(x e) has only real roots. 7!

n n Example: = S+⇥ , p(X)=det(X), e = I, det(X I) K 7! Also: non-negative orthant, second-order cones, and many others.

Theorem (G˚arding,1959): For every e int( ) and all x , the univariate polynomial p(x 2e) hasK only real roots.2 E 7!

If each of and is a hyperbolicity cone, then so is and⇤ . K1 K2 K1 ⇥ K2 K1 \ K2 If 0 0 is a hyperbolicity cone and T : 0 is a linear transformation, K ✓ E E ! E then⇤ := x : T (x) 0 is a hyperbolicity cone. K { 2 K } min c x “hyperbolic program” s.t. Ax· = b hyperbolicity cone x 2 K

G¨uler,“Hyperbolic polynomials and interior point methods for convex programming” Math of Oper Res (1997) o every Now x has e n ereof degree el“eigenvalues” real x p e min + s.t. 1 ⇡ (x)

= x Ax c h ot of roots the min · 1 2 x = K (x) b (x 1 yeblct cone hyperbolicity (x),..., e) 7! p(x n (x), e) {x :

Ax and Ane := = c b · z x = z } fµ(x):= µ ln exp( j(x)/µ) (for fixed µ>0) j X Easy to see: (x) µ ln n f (x) (x) min  µ  min

Prop: fµ is concave and infinitely Fr´echet di↵erentiable. 1 Moreover, fµ(x) fµ(y) ⇤ µ x y for all x, y. kr r k1  k k1

(Mostly a corollary to Nesterov’s result and the Helton-Vinnikov Theorem.)

Moreover, the are readily computable, especially if the underlying “hyperbolic polynomial” can be factored as the product of polynomials of low degrees.

(see the arXiv posting for details) e

x : x e re and Ax = b Anec e { k k } ·

1 Cor: P fµ(x) P fµ(y) r2µ x y for all x, y Anez k r r k e k k 2 and for every z min c x max (x) max f (x) · min µ s.t. Ax = b s.t. Ax = b s.t. Ax = b x c x = z c x = z 2 K ⌘ · ⇡ ·

c x z⇤ Same goal as before: Compute feasible x satisfying · ✏ c e z  · ⇤ Choosing µ = ✏/(6 ln n) and using a “uniformly optimal” (or “universal”) accelerated gradient method (Lan (2010), Nesterov (2014)) . . . Thm: c ⇡(xk) z⇤ The algorithm produces xk satisfying · ✏ c e z⇤  and does so within computing a total number· of gradients not exceeding

1 c e z⇤ L Diam pL + log · + log O · · p✏ c e c x L ✓ ✓ 0 0 ◆◆ · · where L is the Lipschitz constant for the gradient map and where L0 > 0 is the input guess of L. 1 6 ln n Note: L 2 = 2  re µ re ✏ ... Let’s see what results by applying the framework to general convex optimization problems by putting those problems into conic form.

First we consider minimizing a convex function subject to no constraints, but we make some assumptions on the function so that we can clarify how the new approach di↵ers from applying subgradient methods directly . . . Assume: f is lower semicontinuous and has a minimizer min f(x) • minx,t t min f(x) s.t. (x, t) epi(f):= (x, t):f(x) t ⌘ 2 {  }

⌘ closed convex set

minx,t,t0 t s.t. t0 =1 (x, t, t0) 2 K where is the closed cone for which K (x, t, 1) (x, t) epi(f) 2 K , 2

The third problem is a conic formulation to which we apply our supgradient framework Assume: f is lower semicontinuous and has a minimizer • min f(x) x : f(x) < is open • { 1} x < 1 f(x) < 0 • k k )

asymptote

graph of such a function f Assume: f is lower semicontinuous and has a minimizer • min f(x) x : f(x) < is open • { 1} x < 1 f(x) < 0 • k k )

Initialize: x0 = ~0, z = f(~0) Iterate:

(1) Compute the positive scalar ↵k satisfying f(↵kxk)=↵kz. ↵ k ↵

(2) Let yk := ↵kxk . graph of ↵ f(↵x ) 7! k

graph of ↵ ↵z 7! z<0 – an upper bound on the optimal value of min f(x) (the value z is occasionally updated)

~0 xk n 1 R yk := 2 xk

Determine the positive scalar ↵k for which f(↵kxk)=↵kz, and then define yk = ↵kxk . z<0 – an upper bound on the optimal value of min f(x) (the value z is occasionally updated)

~0 xk n 1 R yk := 2 xk

xk+1

✏ subgradient at yk If ↵k < 4/3, then let xk+1 = xk + 2 g 2 g k k where g = 1 f(y ) f(y )+ f(y ),~0 y k k hr k ki r

The subgradient is for yk but the step is taken from xk! z<0 – an upper bound on the optimal value of min f(x) (the value z is occasionally updated)

~0 Rn

xk+1 3 yk+1 := 2 xk+1

Determine the positive scalar ↵k+1 for which f(↵k+1xk+1)=↵k+1z, and then define yk+1 = ↵k+1xk+1 . z<0 – an upper bound on the optimal value of min f(x) (the value z is occasionally updated)

~0 Rn

xk+2 := yk+1

If ↵ 4/3, define x = y and update z: z f(x ) k+1 k+2 k+1 k+2 Assume: f is lower semicontinuous and has a minimizer • min f(x) x : f(x) < is open • { 1} x < 1 f(x) < 0 • k k )

Initialize: x0 = ~0, z = f(~0) Iterate:

(1) Compute the positive scalar ↵k satisfying f(↵kxk)=↵kz. ↵ k ↵

(2) Let yk := ↵kxk . graph of ↵ f(↵x ) 7! k

graph of (3) If ↵k 4/3, let xk+1 = yk and z f(yk). ↵ ↵z 7! ✏ Else let x = x + g k+1 k 2 g 2 k k kk 1 where gk = f(yk) . f(y )+ f(y ),~0 y r r k hr k ki f(~0) < 0 subgradient at yk  | {z } Assume: f is lower semicontinuous and has a minimizer • min f(x) x : f(x) < is open • { 1} x < 1 f(x) < 0 • k k )

Cor: optimal value < 0

f(y ) f ⇤ The supgradient algorithm computes y satisfying k ✏ k 0 f  ⇤

1 1 where k 8 D2 + log (D + 1) (1 ✏) +1  ✏2 ✏ 4/3 ✓ ◆ defining D = diameter x : f(x) f(~0) . {  }

Di↵ers from traditional subgradient literature in that f is not required to be Lipschitz continuous. More generally . . . extended valued, convex, lower semicontinuous min f(x)

s.t. x Feas closed and convex 2 = x S : Ax = b { 2 }

Assume: x¯ satisfies Ax¯ = b andx ¯ interior S e↵ective domain(f) • 2 \ Euclidean norm satisfies x B(¯x, 1) : Ax = b S e↵ective domain(f) • { 2 } ✓ \ let f be a scalar upper bound | {z on f(x)} for all x in this set b D = diameter x Feas : f(x) f(¯x) • { 2  } optimal value f(x) f ⇤ Then can compute feasible x satisfying ✏ f f ⇤  1 1 within D2 + log D iterations. O · ✏2 ✏ b ✓ ✓ ◆◆

(see arXiv posting for details) But the most important takeaway is entirely elementary ...... where min s.t. min x Ax c · stescalar the is (x) 2 x = K b e x ⌘ e min + s.t. 1 ⇡ (x) max x Ax = c s.t. T min satisfying 1 2 x = K (x) b c Ax min · (x x = (x = b ) z e) x e 2 listening! K boundary( Thanks {x for :

Ax and Ane := = c b T z x ) = z }