Interior-Point Theory for

Robert M. Freund

May, 2014

c 2014 Massachusetts Institute of Technology. All rights reserved.

1 1 Background

The material presented herein is based on the following two research texts:

Interior-Point Polynomial Algorithms in Convex Programming by Yurii Nesterov and Arkadii Nemirovskii, SIAM 1994, and

A Mathematical View of Interior-Point Methods in Convex Optimiza- tion by James Renegar, SIAM 2001.

2 Barrier Scheme for Solving Convex Optimiza- tion

Our problem of interest is

T P : minimizex c x

s.t. x ∈ S, where S is some closed convex set, and denote the optimal objective value by V ∗. Let f(·) be a barrier function for S, namely f(·) satisfies:

(a) f(·) is strictly convex on its domain Df := intS , and (b) f(x) → ∞ as x → ∂S .

The idea of the barrier method is to dissuade the algorithm from computing points too close to ∂S, effectively eliminating the complicating factors of dealing with ∂S. For every value of µ > 0 we create the barrier problem:

T Pµ : minimizex µc x + f(x)

s.t. x ∈ Df .

Note that Pµ is effectively unconstrained, since the boundary of the will never be encountered. The solution of Pµ is denoted z(µ):

2 T z(µ) := arg min{µc x + f(x): x ∈ Df } . x Intuitively, as µ → ∞, the impact of the barrier function on the solution of T ∗ Pµ should become less and less, so we should have c z(µ) → V as µ → ∞. Presuming this is the case, the barrier scheme tries to use Newton’s method i to solve for approximate solutions x of Pµi for an increasing sequence of values of µi → ∞. In order to be more specific about how the barrier scheme might work, let us assume that at each iteration we have some value x ∈ Df that is an approximate solution of Pµ for a given value µ > 0. We will, of course, need a way to define “is an approximate solution of Pµ” that will be developed later. We then will increase the barrier parameter µ by a multiplicative factor α > 1:

µˆ ← αµ .

Then we will take a Newton step at x for the problem Pµˆ to obtain a new pointx ˆ that we would like to then be an approximate solution of Pµˆ. If so, we can continue the scheme inductively. We typically use g(·) and H(·) to denote the and Hessian of f(·). Note that the Newton iterate for Pµˆ has the formula: xˆ ← x − H(x)−1(ˆµc + g(x)) .

The general algorithmic scheme is presented in Algorithm 1.

3 Some Plain Facts

n Let f(·): R → R be a twice-differentiable function. We typically use g(·) and H(·) denote the gradient and Hessian of f(·). Here are four facts about integrals and derivatives:

Z 1 Fact 3.1 g(y) = g(x) + H(x + t(y − x))(y − x)dt 0

3 Algorithm 1 General Barrier Scheme

Initialize.

0 Initialize with µ0 > 0, x ∈ Df that is “an approximate solution

of Pµ0 .” i ← 0.

Define α > 1.

At iteration i :

1. Current values.

µ ← µi

x ← xi

2. Increase µ and take Newton step.

µˆ ← αµ

xˆ ← x − H(x)−1(ˆµc + g(x))

3. Update values.

µi+1 ← µˆ

xi+1 ← xˆ

4 Fact 3.2 Let h(t) := f(x + tv). Then

(i) h0 (t) = g(x + tv)T v , and

(ii) h00 (t) = vT H(x + tv)v .

Fact 3.3 T 1 T f(y) = f(x) + g(x) (y − x) + 2 (y − x) H(x)(y − x)

Z 1 Z t + (y − x)T [H(x + s(y − x)) − H(x)](y − x) ds dt 0 0

Z r  1  ar2 Fact 3.4 2 − 1 dt = 0 (1 − at) 1 − ar

R 1 1 This follows by observing that (1−at)2 dt = a(1−at) .

We also present five additional facts that we will need in our analyses.

n n Fact 3.5 Suppose f(·) is a convex function on R , and S ⊂ R is a compact convex set, and suppose x ∈ intS satisfies f(x) ≤ f(y) for all y ∈ ∂S. Then f(·) attains its global minimizer on S. √ T Fact 3.6 Let kvk := v v be the Euclidean norm. Let λ1 ≤ ... ≤ λn be the ordered eigenvalues of the symmetric matrix M, and define kMk := max{kMxk : kxk ≤ 1}. Then kMk = maxi{|λi|} = max{|λn|, |λ1|}.

Fact 3.7 Suppose A, B are symmetric and A + B = θI for some θ ∈ R. Then AB = BA. Furthermore, if A  0,B  0, then AαBβ = BβAα for all α, β ≥ 0.

To see why this is true, decompose A = PDP T where P is orthonormal (P T = P −1) and D is diagonal. Then B = P (θI − D)P T , whereby AαBβ = PDαP T P (θI − D)βP T = PDα(θI − D)βP T = P (θI − D)βDαP T = P (θI − D)βP T PDαP T = BβAα.

5 Fact 3.8 Suppose λn ≥ ... ≥ λ1 > 0. Then

max{|λi − 1|} ≤ max{λn − 1, 1/λ1 − 1} . i

Fact 3.9 Suppose a, b, c, d > 0. Then na c o a + c na c o min , ≤ ≤ max , . b d b + d b d

4 Self-Concordant Functions and Properties

Let f(·) be a strictly convex twice-differentiable function defined on the open set Df := domainf(·) and let D¯f := cl Df . Consider x ∈ Df . We will often abbreviate Hx := H(x) for the Hessian at x. Here we assume that Hx 0, whereby Hx can be used to define the norm

p T kvkx := v Hxv which is the “local norm” at x. Notice that

1 p T 2 kvkx = v Hxv = kHx vk , √ T where kwk = w w is the standard Euclidean (L2) norm. Let

Bx(x, 1) := {y : ky − xkx < 1} .

This is called the open Dikin ball at x after the Russian mathematician I.I.Dikin.

Definition 4.1 f(·) is said to be (strongly nondegenerate) self-concordant if for all x ∈ Df we have Bx(x, 1) ⊂ Df , and for all y ∈ Bx(x, 1) we have:

kvky 1 1 − ky − xkx ≤ ≤ kvkx 1 − ky − xkx for all v 6= 0.

Let SC denote the class of all such functions.

6 Remark 1 The following are the most-used self-concordant functions:

(i) f(x) = − ln(x) for x ∈ Df = {x ∈ R : x > 0} , k×k (ii) f(X) = − ln det(X) for X ∈ Df = {X ∈ S : X 0} , and

2 Pn 2 (iii) f(x) = − ln(x1 − j=2 xj ) for x ∈ Df := {x : k(x2, . . . , xn)k < x1} .

Before showing that these functions are self-concordant, let us see how we can combine self-concordant functions to obtain other self-concordant func- tions.

Proposition 4.1 (self-concordance under addition/intersection) Sup- pose that fi(·) ∈ SC with domain Di := Dfi for i = 1, 2, and suppose that D := D1 ∩D2 6= ∅. Define f(·) = f1(·)+f2(·). Then Df = D and f(·) ∈ SC.

i Proof: Consider x ∈ D = D1 ∩ D2. Let Bx(c, r) denote the Dikin ball centered at c with radius r defined by fi(·) and let k · kx,i denote the norm induced at x using the Hessian Hi(x) of fi(x) for i = 1, 2. Then since i 2 2 2 x ∈ Di we have Bx(x, 1) ⊂ Di and kvkx = kvkx,1 + kvkx,2 because H(x) = H1(x) + H2(x). Therefore if ky − xkx < 1 it follows that ky − xkx,1 < 1 i and ky − xkx,2 < 1, whereby y ∈ Bx(x, 1) ⊂ Di for i = 1, 2, and hence y ∈ D1 ∩ D2 = D. Also, for any v 6= 0, using Fact 3.9 we have

2 2 2  2 2  kvky kvky,1+kvky,2 kvky,1 kvky,2 2 = 2 2 ≤ max 2 , 2 kvkx kvkx,1+kvkx,2 kvkx,1 kvkx,2

 2  2 ≤ max 1 , 1 1−ky−xkx,1 1−ky−xkx,2

 2 ≤ 1 . 1−ky−xkx

The virtually identical argument can also be applied to prove the “≥” in- equality of the definition of self-concordance by replacing “max” by “min” above and applying the other inequality of Fact 3.9.

7 Proposition 4.2 (self-concordance under affine transformation) Let m×n A ∈ R satisfy rankA = n ≤ m. Suppose that f(·) ∈ SC with domain m ˆ ˆ ˆ Df ⊂ R and define f(·) by f(x) = f(Ax−b). Then f(·) ∈ SC with domain Dˆ := {x : Ax − b ∈ Df }.

Proof: Consider x ∈ Dˆ and s = Ax − b. Letting g(s) and H(s) denote the gradient and Hessian of f(s) andg ˆ(x) and Hˆ (x) the gradient and Hes- sian of fˆ(x), we haveg ˆ(x) = AT g(s) and Hˆ (x) = AT H(s)A. Suppose that ky − xkx < 1. Then defining t := Ay − b we have 1 > ky − xkx = p T T T T (y A − x A )H(s)(Ay − Ax) = kt−sks, whereby t ∈ Df and so y ∈ Dˆ. Therefore Bx(x, 1) ⊂ Dˆ. Also, for any v 6= 0, we have

p T T kvky v A H(t)Av kAvkt 1 1 = p = ≤ = . kvkx vT AT H(s)Av kAvks 1 − ks − tks 1 − ky − xkx The exact same argument can also be applied to prove the “≥” inequality of the definition of self-concordance.

Proposition 4.3 The three functions defined in Remark 1 are self-concordant.

Proof: We will prove that f(X) := − ln det(X) is self-concordant on its domain {X ∈ Sk×k : X 0}. When k = 1, this is the logarithmic barrier 2 function. Although it is true, we will not prove that f(x) = − ln(x1 − Pn 2 j=2 xj ) is a self-concordant barrier for the interior of the second-order cone n Q := {x : k(x2, . . . , xn)k ≤ x1}, as this proof is arithmetically uninspiring. Before we get started with the proof, we have to expand and amend our notation a bit. If we consider two vectors x, y in the n-dimensional space n V (which we typically identify with R ),√ we use the standard inner product xT y to define the standard norm kvk := vT v. When we consider the vector space V to be Sk×k (the set of symmetric matrices of order k), we need to define the standard inner product of two symmetric k × k matrices X and Y and also to define the standard norm on V = Sk×k. The standard inner product in this case is the trace inner product:

n k k X X X X • Y := Tr(XY ) = (XY )ii = XijYij . i=1 i=1 j=1

8 The trace inner product has the following properties which are elementary to establish:

1. A • BC = AB • C

2. Tr(AB) = Tr(BA), whereby A • B = B • A

The trace inner product is used to define the standard norm: √ kXk := X • X.

This norm is also called the Frobenius norm of a matrix X. Notice that the Frobenius norm of X is not the operator norm of X. For the remainder of this proof we use the trace inner product and the Frobenius norm, and the reader should ignore any mental mention of operator norms in this proof. To prove f(X) := − ln det(X) is self-concordant, let X 0 be given, k×k and let Y ∈ BX (X, 1) and V ∈ S be given. We need to verify three statements:

1. Y 0, kV k 1 2. Y ≤ , and kV kX 1 − kY − XkX

kV kY 3. ≥ 1 − kY − XkX kV kX

To get started, direct expansion yields the following second-order expansion of f(X): 1 f(X + ∆X) ≈ f(X) − X−1 • ∆X + ∆X • X−1∆XX−1 2 and indeed it is easy to derive:

• g(X) = −X−1 and

• H(X)∆X = X−1∆XX−1

9 It therefore follows that

p −1 −1 k∆XkX = Tr(∆XX ∆XX ) q − 1 − 1 − 1 − 1 = Tr(X 2 ∆XX 2 X 2 ∆XX 2 ) (1) q − 1 − 1 = Tr([X 2 ∆XX 2 ]2) .

Now define two auxiliary matrices:

− 1 − 1 − 1 − 1 F := X 2 YX 2 and S := X 2 VX 2 .

Note that q − 1 − 1 − 1 − 1 kSk = Tr(X 2 VX 2 X 2 VX 2 ) = kV kX . (2) Furthermore let us write F = QDQT where Q is orthonormal (QT = Q−1) and the diagonal matrix D is comprised of the eigenvalues of F , and let λ denote the vector of eigenvalues, with minima and maxima λmin and λmax. To prove item (1.) above, we observe:

2 − 1 − 1 − 1 − 1 1 > kY − XkX = Tr(X 2 (Y − X)X 2 X 2 (Y − X)X 2 )

= Tr(F − I)(F − I))

= Tr(Q(D − I)QT Q(D − I)QT ) (3) = Tr((D − I)(D − I))

Pk 2 = j=1(λj − 1)

2 = kλ − ek2 where e = (1,..., 1). Since the last quantity above is less than 1, it follows that λ > 0 and hence F 0 and therefore Y 0, establishing (1.). In order to establish (2.) and (3.) we will need the following

− 1 − 1 1 − 1 − 1 1 kF 2 SF 2 k ≤ kSk and kF 2 SF 2 k ≥ kSk . (4) λmin λmax

10 To prove (4), we proceed as follows: q − 1 − 1 − 1 − 1 − 1 − 1 kF 2 SF 2 k = Tr(QD 2 QT SQD 2 QT QD 2 QT SQD 2 QT )

= pTr(D−1QT SQD−1QT SQ)

p ≤ √ 1 Tr(QT SQD−1QT SQ) λmin p = √ 1 Tr(D−1QT SSQ) λmin

≤ 1 pTr(QT SSQ) λmin

= 1 pTr(SS) = 1 kSk . λmin λmin

The other inequality of (4) follows by substituting λmax for λmin and switch- ing ≥ for ≤ in the above chain of equalities and inequalities. We now have:

2 −1 −1 kV kY = Tr(VY VY )

− 1 − 1 1 −1 1 − 1 − 1 1 −1 1 = Tr(X 2 VX 2 X 2 Y X 2 X 2 VX 2 X 2 Y X 2 )

= Tr(SF −1SF −1)

− 1 − 1 − 1 − 1 = Tr(F 2 SF 2 F 2 SF 2 )

− 1 − 1 2 1 2 1 2 = kF 2 SF 2 k ≤ 2 kSk = 2 kV kX λmin λmin where the last inequality follows from (4) and the last equality from (2). Therefore kV k 1 1 1 1 Y ≤ ≤ ≤ = kV kX λmin 1 − |1 − λmin| 1 − ke − λk2 1 − kY − XkX where the last equality is from (3). This proves (2.). To prove (3.), use the same equalities as above and the second inequality of (4) to obtain:

1 1 1 2 − 2 − 2 2 2 kV kY = kF SF k ≥ 2 kV kX λmax

11 kV kY 1 kV kY and therefore ≥ . If λmax ≤ 1 it follows directly that ≥ kV kX λmax kV kX 1 ≥ 1 ≥ 1 − kY − Xk , while if λ > 1 we have: λmax X max

kY − XkX = kλ − ek2 ≥ λmax − 1 , from which it follows that

λmaxkY − XkX ≥ kY − XkX ≥ λmax − 1 and so λmax (1 − kY − XkX ) ≤ 1 .

kV kY 1 From this it then follows that ≥ ≥ 1−kY −XkX , thus completing kV kX λmax the proof of (3.). Our next result is rather technical, as it shows further properties of changes in Hessian matrices under self-concordance. Recall from Fact 3.6 that the operator norm of a matrix M is defined using the Euclidean norm, namely

kMk := max{kMxk : kxk ≤ 1} .

Lemma 4.1 Suppose that f(·) ∈ SC and x ∈ Df . If ky − xkx < 1, then

1 1 2 − 2 − 2  1  (i) kHx H Hx k ≤ , y 1−ky−xkx

1 1  2 2 −1 2 1 (ii) kHx H Hx k ≤ , y 1−ky−xkx

1 1 2 − 2 − 2  1  (iii) kI − Hx H Hx k ≤ − 1 , and y 1−ky−xkx

1 1  2 2 −1 2 1 (iv) kI − Hx H Hx k ≤ − 1 . y 1−ky−xkx

1 1 − 2 − 2 Proof: Let Q := Hx HyHx , and observe that Q 0 with eigenvalues λn ≥ ... ≥ λ1 > 0. From Fact 3.6 we have s r T T p p w Qw v Hyv kvky 1 kQk = λn = max T = max T = max ≤ w w w v v Hxv v kvkx 1 − ky − xkx

12 1 − 2 (where the third equality uses the substitution v = Hx w) and squaring yields the first assertion. Similarly, we have s r T T 1 p w Qw v Hyv kvky p = λ1 = min T = min T = min ≥ 1−ky−xkx kQ−1k w w w v v Hxv v kvkx

1 − 2 (where the third equality again uses the substitution v = Hx w) and squar- ing and rearranging yields the second assertion. Next observe

 1 2 kI − Qk = max{|λi − 1|} ≤ max{λn − 1, 1/λ1 − 1} ≤ − 1 i 1 − ky − xkx where the first inequality is from Fact 3.8 and the second inequality follows from the two equation streams above, thus showing the third assertion of the lemma. Finally, we have

 2 −1 1 kI−Q k = max{|1/λi−1|} ≤ max{1/λ1−1, λn−1} ≤ −1 i 1 − ky − xkx where the first inequality is from Fact 3.8 and the second inequality follows from the two equation streams above, thus showing the fourth assertion of the lemma. The next result states that we can bound the error in the quadratic approximation of f(·) inside the Dikin ball Bx(x, 1).

Proposition 4.4 Suppose that f(·) ∈ SC and x ∈ Df . If ky − xkx < 1, then   3 T 1 T ky − xkx f(y) − f(x) + g(x) (y − x) + (y − x) Hx(y − x) ≤ . 2 3(1 − ky − xkx)

13 Proof: Let L denote the left-hand side of the inequality to be proved. From Fact 3.3 we have Z 1 Z t T L = (y − x) [H(x + s(y − x)) − H(x)](y − x) ds dt 0 0

Z 1 Z t 1 1 1 1 T 2 − 2 − 2 2 = (y − x) Hx [Hx H(x + s(y − x))Hx − I]Hx (y − x) ds dt 0 0

Z 1 Z t 1 1 2 − 2 − 2 ≤ ky − xkx kHx H(x + s(y − x))Hx − Ik ds dt 0 0

Z 1 Z t  1 2 2 ≤ ky − xkx − 1 ds dt (from Lemma 4.1) 0 0 1 − sky − xkx

Z 1 2 2 ky − xkxt = ky − xkx dt (from Fact 3.4) 0 1 − tky − xkx

ky − xk3 Z 1 ky − xk3 ≤ x t2 dt = x . 1 − ky − xkx 0 3(1 − ky − xkx)

Recall Newton’s method to minimize f(·). At x ∈ Df we compute the Newton step: n(x) := −H(x)−1g(x) and compute the Newton iterate:

−1 x+ := x + n(x) = x − H(x) g(x) .

When f(·) ∈ SC, Newton’s method has some very wonderful properties as we now show.

Theorem 4.1 Suppose that f(·) ∈ SC and x ∈ Df . If kn(x)kx < 1, then

 2 kn(x)kx kn(x+)kx+ ≤ . 1 − kn(x)kx

14 Proof: We will prove this by proving the following two results which to- gether establish the result:

1 − 2 kHx g(x+)k (a) kn(x+)kx+ ≤ , and 1 − kn(x)kx

1 2 − 2 kn(x)kx (b) kHx g(x+)k ≤ . 1 − kn(x)kx

First we prove (a): 2 T −1 −1 kn(x+)kx+ = g(x+) H (x+)H(x+)H (x+)g(x+)

1 1 1 1 T − 2 2 −1 2 − 2 = g(x+) Hx Hx H (x+)Hx Hx g(x+)

1 1 1 2 −1 2 − 2 2 ≤ kHx H (x+)Hx kkHx g(x+)k

 2 1 1 − 2 2 ≤ kHx g(x )k (from Lemma 4.1) 1−kx+−xkx +

 2 1 1 − 2 2 = kHx g(x )k , 1−kn(x)kx + which proves (a). To prove (b), observe first that

g(x+) = g(x+) − g(x) + g(x)

= g(x+) − g(x) − Hxn(x)

R 1 = 0 H(x + t(x+ − x))(x+ − x)dt − Hxn(x) (from Fact 3.1)

R 1 = 0 [H(x + tn(x)) − Hx] n(x)dt

1 1 R 1 − 2 2 = 0 [H(x + tn(x)) − Hx] Hx Hx n(x)dt .

Therefore 1 Z 1  1 1  1 − 2 − 2 − 2 2 Hx g(x+) = Hx H(x + tn(x))Hx − I Hx n(x)dt 0

15 which then implies

1 1 1 1 − 2 R 1 − 2 − 2 2 kHx g(x+)k ≤ 0 kHx H(x + tn(x))Hx − IkkHx n(x)kdt

1  2 2 R 1 1 ≤ kHx n(x)k − 1 dt (from Lemma 4.1) 0 1−tkn(x)kx

= kn(x)k kn(x)kx (from Fact 3.4) x 1−kn(x)kx which proves (b).

1 Theorem 4.2 Suppose that f(·) ∈ SC and x ∈ Df . If kn(x)kx ≤ 4 , then f(·) has a minimizer z, and

2 3kn(x)kx kz − x+kx ≤ 3 . (1 − kn(x)kx)

Proof: First suppose that kn(x)kx ≤ 1/9, and define Qx(y) := f(x) + T 1 T g(x) (y − x) + 2 (y − x) Hx(y − x). Let y satisfy ky − xkx ≤ 1/3. Then from Proposition 4.4 we have ky − xk3 ky − xk2 ky − xk2 |f(y) − Q (y)| ≤ x ≤ x = x , x 3(1 − 1/3) 9(2/3) 6 and therefore

1 1 T −1 2 2 1 2 1 2 f(y) ≥ f(x) + g(x) Hx Hx Hx (y − x) + 2 ky − xkx − 6 ky − xkx

1 2 ≥ f(x) − kn(x)kxky − xkx + 3 ky − xkx

1 = f(x) + 3 ky − xkx(−3kn(x)kx + ky − xkx) .

Now ify ˜ ∈ ∂S := ∂{y : ky − xkx ≤ 3kn(x)kx}, it follows that f(˜y) ≥ f(x). So, by Fact 3.5, f(·) has a global minimizer z ∈ S, and so kz − xkx ≤ 3kn(x)kx.

Now suppose that kn(x)kx ≤ 1/4. From Theorem 4.1 we have

 1/4 2 kn(x )k ≤ = 1/9 , + x+ 1 − 1/4

16 so f(·) has a global minimizer z and kz − x+kx+ ≤ 3kn(x+)kx+ . Therefore

kz − x+kx+ kz − x+kx ≤ (from Definition 4.1) 1 − kx − x+kx

kz − x k = + x+ 1 − kn(x)kx

3kn(x )k ≤ + x+ 1 − kn(x)kx

2 3kn(x)kx ≤ 3 . (from Theorem 4.1) (1 − kn(x)kx)

Last of all in this section, we present a more traditional result about the convergence of Newton’s method for a self-concordant function. This result is not used elsewhere in our development, and is only included to relate the results herein to more traditional theory of Newton’s method. The proof of this result is left as a somewhat-challenging exercise.

Theorem 4.3 Suppose that f(·) ∈ SC and x ∈ Df and f(·) has a minimizer 1 z. If kx − zkz < 4 , then

2 kx+ − zkz < 4kx − zkz .

5 Self-Concordant Barriers

We begin with another definition.

Definition 5.1 f(·) is a ϑ-(strongly nondegenerate self-concordant)-barrier if f(·) ∈ SC and 2 ϑ = ϑf := max kn(x)kx < ∞ . x∈Df

17 2 T −1 −1 T −1 Note that kn(x)kx = (−g(x) H(x) H(x)H(x) (−g(x)) = g(x) H(x) g(x), so we can equivalently define

T −1 ϑf := max g(x) H(x) g(x) x∈Df or T ϑf := max n(x) H(x)n(x) . x∈Df

The quantity ϑf is called the complexity value of the barrier f(·). Let SCB denote the class of all such functions. The following prop- erty is very important.

Theorem 5.1 Suppose that f(·) ∈ SCB and x, y ∈ Df . Then

T g(x) (y − x) < ϑf .

Proof: Define φ(t) := f(x+t(y−x)), whereby φ0 (t) = g(x+t(y−x))T (y−x) 00 T 0 and φ (t) = (y−x) H(x+t(y−x))(y−x). We want to prove that φ (0) < ϑf . If φ0 (0) ≤ 0 there is nothing further to prove, so we can assume that φ0 (0) > 0 whereby from convexity it also follows that φ0 (t) > 0 for all t ∈ [0, 1]. (Notice that φ(1) = f(y) and so t = 1 is in the domain of φ(·).) Let t ∈ [0, 1] be given and let v = x + t(y − x). Then

φ0 (t) = g(v)T (y − x)

1 1 T −1 2 2 = g(v) Hv Hv Hv (y − x)

1 1 T 2 2 = −n(v) Hv Hv (y − x)

1 1 2 2 ≤ kHv n(v)kkHv (y − x)k

p = kn(v)kvky − xkv ≤ ϑf ky − xkv .

00 2 Also φ (t) = (y − x)Hv(y − x) = ky − xkv, whereby

00 2 φ (t) ky − xkv 1 0 2 ≥ 2 = . φ (t) ϑf ky − xkv ϑf

18 It follows that Z 1 00 1 1 φ (t) 1 1 1 ≤ 0 2 dt = − 0 = 0 − 0 , ϑf 0 φ (t) φ (t) 0 φ (0) φ (1) and we have 1 1 1 1 0 ≥ 0 + > , φ (0) φ (1) ϑf ϑf which proves the result. The next two results show how the complexity value ϑ behaves under addition/intersection and affine transformation.

Theorem 5.2 (self-concordant barriers under addition/intersection)

Suppose that fi(·) ∈ SCB with domain Di := Dfi and complexity values

ϑi := ϑfi for i = 1, 2, and suppose that D := D1 ∩ D2 6= ∅. Define f(·) = f1(·) + f2(·). Then f(·) ∈ SCB with domain D, and ϑf ≤ ϑ1 + ϑ2.

Proof: Fix x ∈ D and let g, g1, g2 and H,H1,H2 denote the and Hessians of f(·), f1(·), f2(·) at x, whereby g = g1 + g2 and H = H1 + H2. − 1 − 1 Define Ai = H 2 HiH 2 for i = 1, 2. Then Ai 0, A1 + A2 = I, so A1,A2 1 1 1 1 2 2 − 2 − commute and A1 ,A2 commute, from Fact 3.7. Also define ui = Ai H 2 gi

19 for i = 1, 2. We have T −1 T −1 T −1 T −1 g H g = g1 H g1 + g2 H g2 + 2g1 H g2

1 1 T T T 2 2 = u1 A1u1 + u2 A2u2 + 2u1 A1 A2 u2

1 1 T T T 2 2 = u1 [I − A2] u1 + u2 [I − A1] u2 + 2u1 A2 A1 u2 (from Fact 3.7)

 1 1  T T T T T 2 2 = u1 u1 + u2 u2 − u1 A2u1 + u2 A1u2 − 2u1 A2 A1 u2

1 1 1 1 1 1 T − −1 − T − −1 − 2 2 2 = g1 H 2 A1 H 2 g1 + g2 H 2 A2 H 2 g2 − kA2 u1 − A1 u2k

T − 1 1 −1 1 − 1 T − 1 1 −1 1 − 1 ≤ g1 H 2 H 2 H1 H 2 H 2 g1 + g2 H 2 H 2 H2 H 2 H 2 g2

T −1 T −1 = g1 H1 g1 + g2 H2 g2

≤ ϑ1 + ϑ2 thereby showing that ϑf ≤ ϑ1 + ϑ2.

Theorem 5.3 (self-concordant barriers under affine transformation) m×n Let A ∈ R satisfy rankA = n ≤ m. Suppose that f(·) ∈ SCB with com- m ˆ ˆ plexity value ϑf , with domain Df ⊂ R and define f(·) by f(x) = f(Ax−b). ˆ Then f(·) ∈ SCB and ϑfˆ ≤ ϑf .

Proof: Fix x ∈ Dˆ and define s = Ax − b. Letting g and H denote the gradient and Hessian of f(s) at s andg ˆ and Hˆ the gradient and Hessian of fˆ(x) at x, we haveg ˆ = AT g and Hˆ = AT HA. Then

T −1 T T −1 T T − 1 1 T 1 1 −1 T 1 − 1 gˆ Hˆ gˆ = g A(A HA) A g = g H 2 H 2 A(A H 2 H 2 A) A H 2 H 2 g T − 1 − 1 T −1 ≤ g H 2 H 2 g = g H g ≤ ϑf

1 T 1 1 −1 T 1 since the matrix H 2 A(A H 2 H 2 A) A H 2 is a projection matrix.

Remark 2 The complexity values of the three most-used barriers are as follows:

20 1. ϑf = 1 for the barrier f(x) = − ln(x) defined on Df = {x : x > 0}

2. ϑf = k for the barrier f(X) = − ln det(X) defined on Df = {X ∈ Sk×k : X 0 }

2 Pn 2 3. ϑf = 2 for the barrier f(x) = − ln(x1 − j=2 xj ) defined on Df = {x : k(x2, . . . , xn)k ≤ x1}

Proof: Item (1.) follows from item (2.) so we first prove (2.). Recall the use of the√ trace inner product X • Y = Tr(XY ) and the Frobenius norm kXk := X • X as discussed early in the proof of Proposition 4.3. Also recall that for f(X) = − ln det(X) we have g(X) = −X−1 and H(X)∆X = X−1∆XX−1. Therefore the Newton step at X, denoted by n(X), is the solution of the following equation:

X−1[n(X)]X−1 = X−1 and it follows that n(X) = X. Therefore, using (1) we have

2 −1 −1 −1 −1 −1 −1 kn(X)kX = n(X)•X n(X)X = X•X XX = X•X = Tr(XX ) = Tr(I) = k

2 and therefore ϑf = maxX 0 kn(X)kX = k, which proves (2.) and hence (1.). In order to prove (3.) we amend our notation a bit, letting Qn = 1 n−1 n {(t, x) ∈ R × R : kxk ≤ t}. For (t, x) ∈ intQ we have kxk < t and mechanically we can derive:

 2 T T   −2t  2t + 2x x −4tx 2 T 2 2 T 2  t2 − xT x  (t − x x) (t − x x)  g(t, x) =  2  H(t, x) =  2 T T  x  −4tx 2(t − x x)I + 4xx  t2 − xT x (t2 − xT x)2 (t2 − xT x)2 and the Hessian inverse is given by

t2+xT x T ! −1 2 tx H(t, x) = t2−xT x T . tx 2 I + xx Directly plugging in yields

g(t, x)T H−1(t, x)g(t, x) = 2

21 T −1 from which it follows that ϑf = max(t,x)∈Qn g(x) H(x) g(x) = 2. Finally, we present a result that shows that “one half of the Dikin ball” approximates the shape of a relevant portion of Df to within a factor of ϑf .

Proposition 5.1 Suppose that f(·) ∈ SCB with complexity value ϑf , let x ∈ Df , and define the following three sets:

T (a) S1 := {y : ky − xkx < 1, g(x) (y − x) ≥ 0} ,

T (b) S2 :=⊂ {y : y ∈ Df , g(x) (y − x) ≥ 0} , and

T (c) S3 :=⊂ {y : ky − xkx < 4ϑf + 1, g(x) (y − x) ≥ 0} .

Then S1 ⊂ S2 ⊂ S3.

The proof of Proposition 5.1 is not more involved than other results herein. It is not included because it is not necessary for subsequent results. It is stated here to show the power and beauty of self-concordant barriers.

Corollary 5.1 If z is the minimizer of f(·), then

Bz(z, 1) ⊂ Df ⊂ Bz(z, 4ϑf + 1) .

6 The Barrier Method and its Analysis

Our original problem of interest is

T P : minimizex c x

s.t. x ∈ S, whose optimal objective value we denote by V ∗. Let f(·) be a self-concordant barrier on Df = intS. For every µ > 0 we create the barrier problem:

22 T Pµ : minimizex µc x + f(x)

s.t. x ∈ Df .

The solution of this problem for each µ is denoted z(µ):

T z(µ) := arg min{µc x + f(x): x ∈ Df } . x

Intuitively, as µ → ∞, the impact of the barrier function on the T ∗ solution of Pµ should become less and less, so we should have c z(µ) → V as µ → ∞. Presuming this is the case, the barrier scheme will use New- i ton’s method to solve for approximate solutions x of Pµi for an increasing sequence of values of µi → ∞. In order to be more specific about how the barrier scheme might work, let us assume that at each iteration we have some value x ∈ Df that is an approximate solution of Pµ for a given value µ > 0. (We will define “an approximate solution of Pµ” shortly.) We then will increase the barrier parameter µ by a multiplicative factor α > 1:

µˆ ← αµ .

Then we will take a Newton step at x for the problem Pµˆ to obtain a new pointx ˆ that we would like to then be an approximate solution of Pµˆ. If so, we can continue the scheme inductively.

Let x ∈ Df be given, and let us compute the Newton step for Pµ at x. The objective function of Pµ is

T hµ(x) := µc x + f(x) , whereby we have:

(a) ∇hµ(x) = µc + g(x) and

2 (b) ∇ hµ(x) = H(x) = Hx .

23 Therefore the Newton step for hµ(·) at x is:

−1 −1 nµ(x) := −Hx (µc + g(x)) = n(x) − µHx c and the new iterate is:

−1 xˆ := x + nµ(x) = x − Hx (µc + g(x)) .

Remark 3 Notice that hµ(·) ∈ SC, since membership in SC has only to do with Hessians, and hµ(·) and f(·) have the same Hessian. However, membership in SCB depends also on gradients, and hµ(x) and f(x) have different gradients, whereby it will be typically true that hµ(·) ∈/ SCB (unless c = 0).

We now define what we mean for y to be an “approximate solution” of Pµ.

Definition 6.1 Let γ ∈ [0, 1) be given. We say that y ∈ Df is a γ- approximate solution of Pµ if

knµ(y)ky ≤ γ .

Notice that this definition is equivalent to

−1 kn(y) − µHy cky ≤ γ .

Essentially, the definition states that y is a γ-approximate solution of Pµ if the Newton step for Pµ at y is small (measured using the local norm at y). The following theorem gives an explicit optimality gap bound for y if y is a γ = 1/4-approximate solution of Pµ.

1 Theorem 6.1 Suppose γ ≤ 4 , and y ∈ Df is a γ-approximate solution of Pµ. Then ϑ  1  cT y ≤ V ∗ + f µ 1 − δ

3γ2 where δ = γ + (1−γ)3 .

24 Proof: From Theorem 4.2 we know that z(µ) exists and furthermore

ky − z(µ)ky = ky + nµ(y) − z(µ) − nµ(y)ky

≤ ky+ − z(µ)ky + knµ(y)ky

3γ2 ≤ (1−γ)3 + γ = δ .

From basic first-order optimality conditions we know that z(µ) satisfies

µc + g(z(µ)) = 0 and from Theorem 5.1 we have

T T −µc (w − z(µ)) = g(z(µ)) (w − z(µ)) < ϑf for all w ∈ Df .

Rearranging we have ϑ cT w + f > cT z(µ) for all w ∈ D , µ f

∗ ϑf T whereby V + µ ≥ c z(µ). Now for notational convenience denote z := z(µ)

25 and observe cT y = cT z + cT (y − z)

1 1 ∗ ϑf T − 2 2 ≤ V + µ + c Hz Hz (y − z)

1 ∗ ϑf − 2 ≤ V + µ + kHz ckk(y − z)kz   ∗ ϑf p T −1 k(y−z)ky ≤ V + + c Hz c µ 1−k(y−z)ky

q  ∗ ϑf T −1 δ ≤ V + µ + (g(z)/µ) Hz (g(z)/µ) 1−δ √ T −1 ∗ ϑf (g(z)) Hz (g(z)) δ = V + µ + µ 1−δ √ ∗ ϑf ϑf δ ≤ V + µ + µ 1−δ

∗ ϑf  δ  ≤ V + µ 1 + 1−δ

∗ ϑf = V + µ(1−δ) .

The last inequality above follows from the fact (which we will not prove) that ϑf ≥ 1 for any f(·) ∈ SCB. Note that with γ = 1/9 we have 1/(1 − δ) ≤ 6/5, whereby cT y ≤ V ∗ + 1.2ϑf /µ.[

√ Theorem 6.2 Let β := 1 , γ := 1 , and α := √ϑ+β . Suppose x is a γ- 4 9 ϑ+γ approximate solution of Pµ. Define µˆ := αµ, and let xˆ be the Newton iterate for Pµˆ at x, namely

xˆ := x − H(x)−1(ˆµc + g(x)) .

Then

26 Algorithm 2 Barrier Method

Initialize. √ Define γ := 1/9, β := 1/4, α := √ϑ+β . ϑ+γ

0 Initialize with µ0 > 0, x ∈ Df that is a γ-approximate solution

of Pµ0 . i ← 0.

At iteration i :

1. Current values.

µ ← µi

x ← xi

2. Increase µ and take Newton step.

µˆ ← αµ

xˆ ← x − H(x)−1(ˆµc + g(x))

3. Update values.

µi+1 ← µˆ

xi+1 ← xˆ

27 1. x is a β-approximate solution of Pµˆ, and

2. xˆ is a γ-approximate solution of Pµˆ.

Proof: To prove (1.) we have:

knµˆ(x)kx = knαµ(x)kx

−1 = kH(x) (αµc + g(x))kx

 −1  −1 = kα H(x) (µc + g(x)) + (1 − α)H(x) g(x)kx

−1 −1 ≤ αkH(x) (µc + g(x))kx + (α − 1)kH(x) g(x)kx

≤ αγ + (α − 1)kn(x)kx

p ≤ αγ + (α − 1) ϑf = β . To prove (2.) we invoke Theorem 4.1: 2 2 knµˆ(x)kx β 1 knµˆ(ˆx)kxˆ ≤ 2 ≤ 2 = = γ . (1 − knµˆ(x)kx) (1 − β) 9

Applying Theorem 6.2 inductively we obtain the basic barrier method for self-concordant barriers as presented in Algorithm 2. The complexity of this scheme is presented below.

Theorem 6.3 Let ε > 0 be the desired optimality tolerance, and define  √  6ϑ  J := 9 ϑ ln . 5µ0ε Then by iteration J of the barrier method the current iterate x satisfies cT x ≤ V ∗ + ε.

3γ2 Proof: With the given values of γ, β, α we have δ := γ + (1−γ)3 satisfies 1 1−δ ≤ 6/5 and 1 1 1 1 − = √ ≥ √ . α 7.2 ϑ + 1.8 9 ϑ

28 After J iterations the current iterate x is a γ-approximate solution of Pµ J where µ = α µ0. Therefore  1   1  ln µ − ln µ = J ln ≤ J − 1 . 0 α α

(Note that the inequality above follows from the fact that ln t ≤ t − 1 for t > 0, which itself is a consequence of the concavity of ln(·).) Therefore

 1  J  6ϑ   ϑ  ln µ ≥ ln µ0+J 1 − ≥ ln µ0+ √ ≥ ln µ0+ln ≥ ln α 9 ϑ 5µ0ε (1 − δ)ε

ϑ Therefore µ ≥ (1−δ)ε , and from Theorem 6.1 we have ϑ cT x ≤ V ∗ + ≤ V ∗ + ε . µ(1 − δ)

7 Remarks and other Matters

7.1 Nesterov-Nemirovskii definition of self-concordance

The original definition of self-concordance due to Nesterov and Nemirovskii is different than that presented here in Definition 4.1. Let f(·) be a function n n defined on an open set Df ⊂ R . For every x ∈ Df and every vector h ∈ R define the univariate function

φx,h(α) := f(x + αh) .

Then Nesterov and Nemirovskii’s definition of self-concordance is as follows.

n Definition 7.1 The function f(·) defined on the domain Df ⊂ R is (strongly nondegenerate) self-concordant if:

(a) H(x) 0 for all x ∈ Df ,

(b) f(x) → ∞ as x → ∂Df , and

29 3 n 000 h 00 i 2 (c) for all x ∈ Df and h ∈ R , φx,h(·) satisfies |φx,h(0)| ≤ 2 φx,h(0) .

Definition 7.1 roughly states that the third derivative of f(·) is bounded by an appropriate function of the second derivative. This defini- tion assumes f(·) is three-times differentiable, unlike Definition 4.1. It turns out that when f(·) is three-times differentiable, then Definition 7.1 implies the properties of Definition 4.1 and vice versa, thus the two definitions are equivalent. We prefer Definition 4.1 because proofs of later properties are far more easy to establish. The key advantage of Definition 7.1 is that it leads to an easier way to validate the self-concordance of most self-concordant functions, especially − ln det(X) and − ln(t2 − xT x) for the semidefinite and second-order cones, respectively.

7.2 Universal Barrier

n One natural question is: given a convex set S ⊂ R with nonempty interior, does there exist a self-concordant barrier fS(·) whose domain is Df = intS? Furthermore, if the answer is yes, does there exist a barrier whose complexity value is, say, O(n)? It turns out that the answers to these two questions are both “yes,” but the proofs are incredibly difficult.

n Let S ⊂ R be a closed convex set with nonempty interior. For each point x ∈ intS, define the “local polar of S relative to x”:

n T PS(x) := {w ∈ R : w (y − x) ≤ 1 for all y ∈ S} .

Basic geometry suggests that as x → ∂S, the volume of PS(x) should ap- proach ∞. Now define the function:

fS(x) := ln (volume (PS(x))) .

It turns out that fS(·) is a self-concordant function, but this is quite hard to prove. Furthermore, there is a universal constant C¯ such that for any dimension n and any closed convex set S with nonempty interior, it is true that ¯ ϑfS (·) ≤ C · n .

30 This is a remarkable and very deep result. However, it is computationally useless, since the function values of fS(·) are not only extremely difficult to compute, but gradients and Hessian information is also extremely difficult to compute.

7.3 Other Matters

(i) getting started,

(ii) other formats for convex optimization,

(iii) ϑ-logarithmic homogeneous barriers for cones,

(iv) primal-dual methods,

(v) computational practice, and

(vi) other self-concordant functions and self-concordant .

7.4 Exercises

1. Prove that if kn(x)kx > 1/4, then upon setting 1 y = x + n(x) , 5kn(x)kx we have f(y) ≤ f(x) − 1/37.5 .

2. Prove Theorem 4.3. To get started, observe that x+ − z = x − z − −1 Hx (g(x) − g(z)) (since z is a minimizer and hence g(z) = 0), and then use Fact 3.1 to show that

1 Z 1 1 1 1 − 2 − 2 − 2 2 x+ − z = Hx Hx [Hx − H(x + t(z − x))] Hx Hx (x − z)dt . 0

1 2 Next multiply each side by by Hx and take norms, and then invoke Lemma 4.1. This should help get you started.

31 3. Use Theorem 4.3 to show that under the hypothesis of the theorem, the 1 2 1 sequence of Newton iterates starting with x, x = x+, x = (x )+,..., satisfies 1 i kxi − zk < (4kx − zk )2 . z 4 z 4. Complete the proof Proposition 4.1 by proving the “≥” inequality in the definition of self-concordance using the suggestions at the end of the proof earlier in the text.

5. Complete the proof Proposition 4.2 by proving the “≥” inequality in the definition of self-concordance using the suggestions at the end of the proof earlier in the text.

6. The proof of Proposition 4.3 uses the inequalities presented in (4), but only the left-most inequality is proved in the text herein. Prove the right-most inequality by substituting λmax for λmin and switching ≥ for ≤ in the chain of equalities and inequalities in the text.

32