MOSEK Modeling Cookbook Release 3.2.3

MOSEK ApS

13 September 2021 Contents

1 Preface1

2 Linear optimization3 2.1 Introduction...... 3 2.2 Linear modeling...... 7 2.3 Infeasibility in linear optimization...... 11 2.4 Duality in linear optimization...... 13

3 Conic quadratic optimization 20 3.1 Cones...... 20 3.2 Conic quadratic modeling...... 22 3.3 Conic quadratic case studies...... 26

4 The power cone 30 4.1 The power cone(s)...... 30 4.2 Sets representable using the power cone...... 32 4.3 Power cone case studies...... 33

5 Exponential cone optimization 37 5.1 Exponential cone...... 37 5.2 Modeling with the exponential cone...... 38 5.3 Geometric programming...... 41 5.4 Exponential cone case studies...... 45

6 Semidefinite optimization 50 6.1 Introduction to semidefinite matrices...... 50 6.2 Semidefinite modeling...... 54 6.3 Semidefinite optimization case studies...... 62

7 Practical optimization 71 7.1 Conic reformulations...... 71 7.2 Avoiding ill-posed problems...... 74 7.3 Scaling...... 76 7.4 The huge and the tiny...... 78 7.5 Semidefinite variables...... 79 7.6 The quality of a solution...... 80 7.7 Distance to a cone...... 82

8 Duality in conic optimization 84

i 8.1 Dual cone...... 85 8.2 Infeasibility in conic optimization...... 86 8.3 Lagrangian and the dual problem...... 88 8.4 Weak and strong duality...... 90 8.5 Applications of conic duality...... 93 8.6 Semidefinite duality and LMIs...... 94

9 Mixed integer optimization 97 9.1 Integer modeling...... 97 9.2 Mixed integer conic case studies...... 105

10 Quadratic optimization 109 10.1 Quadratic objective...... 109 10.2 Quadratically constrained optimization...... 111 10.3 Example: Factor model...... 114

11 Bibliographic notes 116

12 Notation and definitions 117

Bibliography 119

Index 121

ii Chapter 1

Preface

This cookbook is about model building using . It is intended as a modeling guide for the MOSEK optimization package. However, the style is intentionally quite generic without specific MOSEK commands or API descriptions. There are several excellent books available on this topic, for example the books by Ben-Tal and Nemirovski [BenTalN01] and Boyd and Vandenberghe [BV04], which have both been a great source of inspiration for this manual. The purpose of this manual is to collect the material which we consider most relevant to our users and to present it in a practical self-contained manner; however, we highly recommend the books as a supplement to this manual. Some textbooks on building models using optimization (or mathematical program- ming) introduce various concepts through practical examples. In this manual we have chosen a different route, where we instead show the different sets and functions thatcan be modeled using convex optimization, which can subsequently be combined into realis- tic examples and applications. In other words, we present simple convex building blocks, which can then be combined into more elaborate convex models. We call this approach extremely disciplined modeling. With the advent of more expressive and sophisticated tools like conic optimization, we feel that this approach is better suited.

Content We begin with a comprehensive chapter on linear optimization, including modeling ex- amples, duality theory and infeasibility certificates for linear problems. Linear problems are optimization problems of the form minimize 푐푇 푥 subject to 퐴푥 = 푏, 푥 0. ≥ Conic optimization is a generalization of linear optimization which handles problems of the form: minimize 푐푇 푥 subject to 퐴푥 = 푏, 푥 퐾, ∈ where 퐾 is a . Various families of convex cones allow formulating different types of nonlinear constraints. The following chapters present modeling with four types of convex cones:

1 • quadratic cones,

• power cone,

• exponential cone,

• semidefinite cone.

It is “well-known” in the convex optimization community that this family of cones is sufficient to express almost all convex optimization problems appearing in practice. Next we discuss issues arising in practical optimization, and we wholeheartedly recom- mend this short chapter to all readers before moving on to implementing mathematical models with real data. Following that, we present a general duality and infeasibility theory for conic prob- lems. Finally we diverge slightly from the topic of conic optimization and introduce the language of mixed-integer optimization and we discuss the relation between convex quadratic optimization and conic quadratic optimization.

2 Chapter 2

Linear optimization

In this chapter we discuss various aspects of linear optimization. We first introduce the basic concepts of linear optimization and discuss the underlying geometric interpretations. We then give examples of the most frequently used reformulations or modeling tricks used in linear optimization, and finally we discuss duality and infeasibility theory in some detail.

2.1 Introduction

2.1.1 Basic notions The most basic type of optimization is linear optimization. In linear optimization we minimize a linear function given a set of linear constraints. For example, we may wish to minimize a linear function

푥 + 2푥 푥 1 2 − 3 under the constraints that

푥 + 푥 + 푥 = 1, 푥 , 푥 , 푥 0. 1 2 3 1 2 3 ≥ The function we minimize is often called the objective function; in this case we have a linear objective function. The constraints are also linear and consist of both linear equalities and inequalities. We typically use more compact notation

minimize 푥1 + 2푥2 푥3 − subject to 푥1 + 푥2 + 푥3 = 1, (2.1) 푥 , 푥 , 푥 0, 1 2 3 ≥ and we call (2.1) a linear optimization problem. The domain where all constraints are satisfied is called the feasible set; the feasible set for (2.1) is shown in Fig. 2.1. For this simple problem we see by inspection that the optimal value of the problem is 1 obtained by the optimal solution − ⋆ ⋆ ⋆ (푥1, 푥2, 푥3) = (0, 0, 1).

3 x3

x1 x2

Fig. 2.1: Feasible set for 푥1 + 푥2 + 푥3 = 1 and 푥1, 푥2, 푥3 0. ≥

Linear optimization problems are typically formulated using matrix notation. The stan- dard form of a linear minimization problem is:

minimize 푐푇 푥 subject to 퐴푥 = 푏, (2.2) 푥 0. ≥ For example, we can pose (2.1) in this form with ⎡ ⎤ ⎡ ⎤ 푥1 1 [︀ ]︀ 푥 = ⎣ 푥2 ⎦ , 푐 = ⎣ 2 ⎦ , 퐴 = 1 1 1 . 푥 1 3 − There are many other formulations for linear optimization problems; we can have different types of constraints,

퐴푥 = 푏, 퐴푥 푏, 퐴푥 푏, 푙푐 퐴푥 푢푐, ≥ ≤ ≤ ≤ and different bounds on the variables

푙푥 푥 푢푥 ≤ ≤

or we may have no bounds on some 푥푖, in which case we say that 푥푖 is a free variable. All these formulations are equivalent in the sense that by simple linear transformations and introduction of auxiliary variables they represent the same set of problems. The important feature is that the objective function and the constraints are all linear in 푥.

2.1.2 Geometry of linear optimization

푛 푇 푇 A hyperplane is a subset of R defined as 푥 푎 (푥 푥0) = 0 or equivalently 푥 푎 푥 = 푇 { | − } { | 훾 with 푎 푥0 = 훾, see Fig. 2.2. } Thus a linear constraint

퐴푥 = 푏

4 a

aT x = γ

x x0

Fig. 2.2: The dashed line illustrates a hyperplane 푥 푎푇 푥 = 훾 . { | } with 퐴 R푚×푛 represents an intersection of 푚 hyperplanes. ∈ Next, consider a point 푥 above the hyperplane in Fig. 2.3. Since 푥 푥0 forms an 푇 푇 − 푇 acute angle with 푎 we have that 푎 (푥 푥0) 0, or 푎 푥 훾. The set 푥 푎 푥 훾 is − ≥ ≥ { | ≥ } called a halfspace. Similarly the set 푥 푎푇 푥 훾 forms another halfspace; in Fig. 2.3 it { | ≤ } corresponds to the area below the dashed line.

aT x > γ a

x

x0

Fig. 2.3: The grey area is the halfspace 푥 푎푇 푥 훾 . { | ≥ } A set of linear inequalities

퐴푥 푏 ≤ corresponds to an intersection of halfspaces and forms a polyhedron, see Fig. 2.4. The polyhedral description of the feasible set gives us a very intuitive interpretation of linear optimization, which is illustrated in Fig. 2.5. The dashed lines are normal to the objective 푐 = ( 1, 1), and to minimize 푐푇 푥 we move as far as possible in the opposite − direction of 푐, to the furthest position where a dashed line intersect the polyhedron; an optimal solution is therefore always either a vertex of the polyhedron, or an entire facet of the polyhedron may be optimal. The polyhedron shown in the figure is nonempty and bounded, but this is not always the case for polyhedra arising from linear inequalities in optimization problems. In such cases the optimization problem may be infeasible or unbounded, which we will discuss in detail in Sec. 2.3.

5 a2 a3

a1 a4

a5

Fig. 2.4: A polyhedron formed as an intersection of halfspaces.

c

a2 a3

a1 a4 x⋆

a5

Fig. 2.5: Geometric interpretation of linear optimization. The optimal solution 푥⋆ is at a point where the normals to 푐 (the dashed lines) intersect the polyhedron.

6 2.2 Linear modeling

In this section we present useful reformulation techniques and standard tricks which allow constructing more complicated models using linear optimization. It is also a guide through the types of constraints which can be expressed using linear (in)equalities.

2.2.1 Maximum

The inequality 푡 max 푥1, . . . , 푥푛 is equivalent to a simultaneous sequence of 푛 in- ≥ { } equalities

푡 푥 , 푖 = 1, . . . , 푛 ≥ 푖 and similarly 푡 min 푥1, . . . , 푥푛 is the same as ≤ { } 푡 푥 , 푖 = 1, . . . , 푛. ≤ 푖

Of course the same reformulation applies if each 푥푖 is not a single variable but a linear expression. In particular, we can consider convex piecewise-linear functions 푓 : R푛 R ↦→ defined as the maximum of affine functions (see Fig. 2.6):

푇 푓(푥) := max 푎푖 푥 + 푏푖 푖=1,...,푚{ }

a3x + b3 a1x + b1

a2x + b2

Fig. 2.6: A convex piecewise-linear function (solid lines) of a single variable 푥. The function is defined as the maximum of 3 affine functions.

The epigraph 푓(푥) 푡 (see Sec. 12) has an equivalent formulation with 푚 inequalities: ≤ 푎푇 푥 + 푏 푡, 푖 = 1, . . . , 푚. 푖 푖 ≤ Piecewise-linear functions have many uses linear in optimization; either we have a convex piecewise-linear formulation from the onset, or we may approximate a more complicated (nonlinear) problem using piecewise-linear approximations, although with modern non- linear optimization software it is becoming both easier and more efficient to directly formulate and solve nonlinear problems without piecewise-linear approximations.

7 2.2.2 Absolute value The absolute value of a scalar variable is a special case of maximum

푥 := max 푥, 푥 , | | { − } so we can model the epigraph 푥 푡 using two inequalities | | ≤ 푡 푥 푡. − ≤ ≤

2.2.3 The ℓ1 norm

All norms are convex functions, but the ℓ1 and ℓ∞ norms are of particular interest for 푛 linear optimization. The ℓ1 norm of vector 푥 R is defined as ∈ 푥 := 푥 + 푥 + + 푥 . ‖ ‖1 | 1| | 2| ··· | 푛| To model the epigraph

푥 1 푡, (2.3) ‖ ‖ ≤ we introduce the following system

푛 ∑︁ 푥푖 푧푖, 푖 = 1, . . . , 푛, 푧푖 = 푡, (2.4) | | ≤ 푖=1 with additional (auxiliary) variable 푧 R푛. Clearly (2.3) and (2.4) are equivalent, in the ∈ sense that they have the same projection onto the space of 푥 and 푡 variables. Therefore, we can model (2.3) using linear (in)equalities

푛 ∑︁ 푧푖 푥푖 푧푖, 푧푖 = 푡, (2.5) − ≤ ≤ 푖=1 with auxiliary variables 푧. Similarly, we can describe the epigraph of the norm of an affine function of 푥,

퐴푥 푏 푡 ‖ − ‖1 ≤ as 푛 ∑︁ 푧 푎푇 푥 푏 푧 , 푧 = 푡, − 푖 ≤ 푖 − 푖 ≤ 푖 푖 푖=1 where 푎푖 is the 푖 th row of 퐴 (taken as a column-vector). −

Example 2.1 (Basis pursuit). The ℓ1 norm is overwhelmingly popular as a convex approximation of the cardinality (i.e., number on nonzero elements) of a vector 푥. For example, suppose we are given an underdetermined linear system

퐴푥 = 푏

8 where 퐴 R푚×푛 and 푚 푛. The basis pursuit problem ∈ ≪

minimize 푥 1 ‖ ‖ (2.6) subject to 퐴푥 = 푏,

uses the ℓ1 norm of 푥 as a heuristic for finding a sparse solution (one with many zero elements) to 퐴푥 = 푏, i.e., it aims to represent 푏 as a linear combination of few columns of 퐴. Using (2.5) we can pose the problem as a linear optimization problem,

minimize 푒푇 푧 subject to 푧 푥 푧, (2.7) − ≤ 퐴푥 ≤= 푏,

where 푒 = (1,..., 1)푇 .

2.2.4 The ℓ norm ∞ 푛 The ℓ∞ norm of a vector 푥 R is defined as ∈

푥 ∞ := max 푥푖 , ‖ ‖ 푖=1,...,푛 | | which is another example of a simple piecewise-linear function. Using Sec. 2.2.2 we model

푥 ∞ 푡 ‖ ‖ ≤ as

푡 푥 푡, 푖 = 1, . . . , 푛. − ≤ 푖 ≤ Again, we can also consider affine functions of 푥, i.e.,

퐴푥 푏 ∞ 푡, ‖ − ‖ ≤ which can be described as

푡 푎푇 푥 푏 푡, 푖 = 1, . . . , 푛. − ≤ 푖 − ≤

Example 2.2 (Dual norms). It is interesting to note that the ℓ1 and ℓ∞ norms are 푛 dual. For any norm on R , the dual norm * is defined as ‖·‖ ‖·‖ 푇 푥 * = max 푥 푣 푣 1 . ‖ ‖ { | ‖ ‖ ≤ }

Let us verify that the dual of the ℓ∞ norm is the ℓ1 norm. Consider

푇 푥 * ∞ = max 푥 푣 푣 ∞ 1 . ‖ ‖ , { | ‖ ‖ ≤ }

9 Obviously the maximum is attained for

{︂ +1, 푥 0, 푣 = 푖 푖 1, 푥 ≥< 0, − 푖 ∑︀ i.e., 푥 *,∞ = 푥푖 = 푥 1. Similarly, consider the dual of the ℓ1 norm, ‖ ‖ 푖 | | ‖ ‖ 푇 푥 * = max 푥 푣 푣 1 . ‖ ‖ ,1 { | ‖ ‖1 ≤ } 푇 To maximize 푥 푣 subject to 푣1 + + 푣푛 1 we simply pick the element of 푥 with | | ··· | | ≤ largest absolute value, say 푥푘 , and set 푣푘 = 1, so that 푥 *,1 = 푥푘 = 푥 ∞. This | | ± ‖ ‖ | | ‖ ‖ illustrates a more general property of dual norms, namely that 푥 ** = 푥 . ‖ ‖ ‖ ‖

2.2.5 Homogenization Consider the linear-fractional problem

푎푇 푥+푏 minimize 푐푇 푥+푑 subject to 푐푇 푥 + 푑 > 0, (2.8) 퐹 푥 = 푔.

Perhaps surprisingly, it can be turned into a linear problem if we homogenize the linear constraint, i.e. replace it with 퐹 푦 = 푔푧 for a single variable 푧 R. The full new ∈ optimization problem is

minimize 푎푇 푦 + 푏푧 subject to 푐푇 푦 + 푑푧 = 1, (2.9) 퐹 푦 = 푔푧, 푧 0. ≥ If 푥 is a feasible point in (2.8) then 푧 = (푐푇 푥 + 푑)−1, 푦 = 푥푧 is feasible for (2.9) with the same objective value. Conversely, if (푦, 푧) is feasible for (2.9) then 푥 = 푦/푧 is feasible in (2.8) and has the same objective value, at least when 푧 = 0. If 푧 = 0 and 푥 is any feasible ̸ point for (2.8) then 푥 + 푡푦, 푡 + is a sequence of solutions of (2.8) converging to the → ∞ value of (2.9). We leave it for the reader to check those statements. In either case we showed an equivalence between the two problems. Note that, as the sketch of proof above suggests, the optimal value in (2.8) may not be attained, even though the one in the linear problem (2.9) always is. For example, consider a pair of problems constructed as above:

minimize 푦1 minimize 푥1/푥2 subject to 푦1 + 푦2 = 푧, subject to 푥2 > 0, 푦 = 1, 푥 + 푥 = 1. 2 1 2 푧 0. ≥ Both have an optimal value of 1, but on the left we can only approach it arbitrarily − closely.

10 2.2.6 Sum of largest elements

Suppose 푥 R푛 and that 푚 is a positive integer. Consider the problem ∈ ∑︀ minimize 푚푡 + 푖 푢푖 subject to 푢푖 + 푡 푥푖, 푖 = 1, . . . , 푛, (2.10) 푢 ≥ 0, 푖 = 1, . . . , 푛, 푖 ≥ 푛 with new variables 푡 R, 푢푖 R . It is easy to see that fixing a value for 푡 determines ∈ ∈ the rest of the solution. For the sake of simplifying notation let us assume for a moment that 푥 is sorted:

푥 푥 푥 . 1 ≥ 2 ≥ · · · ≥ 푛

If 푡 [푥푘, 푥푘+1) then 푢푙 = 0 for 푙 푘 + 1 and 푢푙 = 푥푙 푡 for 푙 푘 in the optimal solution. ∈ ≥ − ≤ Therefore, the objective value under the assumption 푡 [푥푘, 푥푘+1) is ∈ obj = 푥 + + 푥 + 푡(푚 푘) 푡 1 ··· 푘 −

which is a linear function minimized at one of the endpoints of [푥푘, 푥푘+1). Now we can compute

obj obj = (푘 푚)(푥 푥 ). 푥푘+1 − 푥푘 − 푘 − 푘+1

It follows that obj푥푘 has a minimum for 푘 = 푚, and therefore the optimum value of (2.10) is simply

푥 + + 푥 . 1 ··· 푚 Since the assumption that 푥 is sorted was only a notational convenience, we conclude that in general the optimization model (2.10) computes the sum of 푚 largest entries in 푥. In Sec. 2.4 we will show a conceptual way of deriving this model.

2.3 Infeasibility in linear optimization

In this section we discuss the basic theory of primal infeasibility certificates for linear problems. These ideas will be developed further after we have introduced duality in the next section. One of the first issues one faces when presented with an optimization problem is whether it has any solutions at all. As we discussed previously, for a linear optimization problem minimize 푐푇 푥 subject to 퐴푥 = 푏, (2.11) 푥 0. ≥ the feasible set

푛 푝 = 푥 R 퐴푥 = 푏, 푥 0 ℱ { ∈ | ≥ } is a convex polytope. We say the problem is feasible if 푝 = and infeasible otherwise. ℱ ̸ ∅

11 Example 2.3 (Linear infeasible problem). Consider the optimization problem:

minimize 2푥1 + 3푥2 푥3 − subject to 푥1 + 푥2 + 2푥3 = 1, 2푥 푥 + 푥 = 0.5, − 1 − 2 3 − 푥1 + 5푥3 = 0.1, − 푥 −0. 푖 ≥ This problem is infeasible. We see it by taking a linear combination of the constraints with coefficients 푦 = (1, 2, 1)푇 : − 푥 + 푥 + 2푥 = 1,/ 1 1 2 3 · 2푥1 푥2 + 푥3 = 0.5,/ 2 − 푥 − + 5푥 = −0.1,/ ·( 1) − 1 3 − · − 2푥 푥 푥 = 0.1. − 1 − 2 − 3 This clearly proves infeasibility: the left-hand side is negative and the right-hand side is positive, which is impossible.

2.3.1 Farkas’ lemma In the last example we proved infeasibility of the linear system by exhibiting an explicit linear combination of the equations, such that the right-hand side (constant) is positive while on the left-hand side all coefficients are negative or zero. In matrix notation, such a linear combination is given by a vector 푦 such that 퐴푇 푦 0 and 푏푇 푦 > 0. The next ≤ lemma shows that infeasibility of (2.11) is equivalent to the existence of such a vector.

Lemma 2.1 (Farkas’ lemma). Given 퐴 and 푏 as in (2.11), exactly one of the two state- ments is true:

1. There exists 푥 0 such that 퐴푥 = 푏. ≥ 2. There exists 푦 such that 퐴푇 푦 0 and 푏푇 푦 > 0. ≤

Proof. Let 푎1, . . . , 푎푛 be the columns of 퐴. The set 퐴푥 푥 0 is a closed convex { | ≥ } cone spanned by 푎1, . . . , 푎푛. If this cone contains 푏 then we have the first alternative. Otherwise the cone can be separated from the point 푏 by a hyperplane passing through 푇 푇 0, i.e. there exists 푦 such that 푦 푏 > 0 and 푦 푎푖 0 for all 푖. This is equivalent to the ≤ second alternative. Finally, 1. and 2. are mutually exclusive, since otherwise we would have

0 < 푦푇 푏 = 푦푇 퐴푥 = (퐴푇 푦)푇 푥 0. ≤

Farkas’ lemma implies that either the problem (2.11) is feasible or there is a certificate of infeasibility 푦. In other words, every time we classify model as infeasible, we can certify this fact by providing an appropriate 푦, as in Example 2.3.

12 2.3.2 Locating infeasibility As we already discussed, the infeasibility certificate 푦 gives coefficients of a linear combi- nation of the constraints which is infeasible “in an obvious way”, that is positive on one side and negative on the other. In some cases, 푦 may be very sparse, i.e. it may have very few nonzeros, which means that already a very small subset of the constraints is the root cause of infeasibility. This may be interesting if, for example, we are debugging a large model which we expected to be feasible and infeasibility is caused by an error in the problem formulation. Then we only have to consider the sub-problem formed by constraints with index set 푗 푦푗 = 0 . { | ̸ }

Example 2.4 (All constraints involved in infeasibility). As a cautionary note consider the constraints

0 푥 푥 푥 1. ≤ 1 ≤ 2 ≤ · · · ≤ 푛 ≤ − Any problem with those constraints is infeasible, but dropping any one of the inequal- ities creates a feasible subproblem.

2.4 Duality in linear optimization

Duality is a rich and powerful theory, central to understanding infeasibility and sensitivity issues in linear optimization. In this section we only discuss duality in linear optimization at a descriptive level suited for practitioners; we refer to Sec.8 for a more in-depth discussion of duality for general conic problems.

2.4.1 The dual problem Primal problem We consider as always a linear optimization problem in standard form:

minimize 푐푇 푥 subject to 퐴푥 = 푏, (2.12) 푥 0. ≥ We denote the optimal objective value in (2.12) by 푝⋆. There are three possibilities:

• The problem is infeasible. By convention 푝⋆ = + . ∞ • 푝⋆ is finite, in which case the problem has an optimal solution.

• 푝⋆ = , meaning that there are feasible solutions with 푐푇 푥 decreasing to , in −∞ −∞ which case we say the problem is unbounded.

13 Lagrange function

푛 푚 푛 We associate with (2.12) a so-called Lagrangian function 퐿 : R R R+ R that × × → augments the objective with a weighted combination of all the constraints, 퐿(푥, 푦, 푠) = 푐푇 푥 + 푦푇 (푏 퐴푥) 푠푇 푥. − − 푚 푛 The variables 푦 R and 푠 R+ are called Lagrange multipliers or dual variables. For * ∈ ∈ * * 푚 푛 any feasible 푥 푝 and any (푦 , 푠 ) R R+ we have ∈ ℱ ∈ × 퐿(푥*, 푦*, 푠*) = 푐푇 푥* + (푦*)푇 0 (푠*)푇 푥* 푐푇 푥*. · − ≤ Note the we used the nonnegativity of 푠*, or in general of any Lagrange multiplier as- sociated with an inequality constraint. The dual function is defined as the minimum of 퐿(푥, 푦, 푠) over 푥. Thus the dual function of (2.12) is {︂ 푏푇 푦, 푐 퐴푇 푦 푠 = 0, 푔(푦, 푠) = min 퐿(푥, 푦, 푠) = min 푥푇 (푐 퐴푇 푦 푠) + 푏푇 푦 = − − 푥 푥 − − , otherwise. −∞ Dual problem For every (푦, 푠) the value of 푔(푦, 푠) is a lower bound for 푝⋆. To get the best such bound we maximize 푔(푦, 푠) over all (푦, 푠) and get the dual problem: maximize 푏푇 푦 subject to 푐 퐴푇 푦 = 푠, (2.13) 푠 − 0. ≥ The optimal value of (2.13) will be denoted 푑⋆. As in the case of (2.12) (which from now on we call the primal problem), the dual problem can be infeasible (푑⋆ = ), have an −∞ optimal solution ( < 푑⋆ < + ) or be unbounded (푑⋆ = + ). Note that the roles of −∞ ∞ ∞ and + are now reversed because the dual is a maximization problem. −∞ ∞

Example 2.5 (Dual of basis pursuit). As an example, let us derive the dual of the basis pursuit formulation (2.7). It would be possible to add auxiliary variables and constraints to force that problem into the standard form (2.12) and then just apply the dual transformation as a black box, but it is both easier and more instructive to directly write the Lagrangian:

퐿(푥, 푧, 푦, 푢, 푣) = 푒푇 푧 + 푢푇 (푥 푧) 푣푇 (푥 + 푧) + 푦푇 (푏 퐴푥) − − − 푇 푚 푛 where 푒 = (1,..., 1) , with Lagrange multipliers 푦 R and 푢, 푣 R+. The dual ∈ ∈ function

푔(푦, 푢, 푣) = min 퐿(푥, 푧, 푦, 푢, 푣) = min 푧푇 (푒 푢 푣) + 푥푇 (푢 푣 퐴푇 푦) + 푦푇 푏 푥,푧 푥,푧 − − − −

is only bounded below if 푒 = 푢 + 푣 and 퐴푇 푦 = 푢 푣, hence the dual problem is − maximize 푏푇 푦 subject to 푒 = 푢 + 푣, (2.14) 퐴푇 푦 = 푢 푣, 푢, 푣 0. − ≥

14 It is not hard to observe that an equivalent formulation of (2.14) is simply

maximize 푏푇 푦 푇 (2.15) subject to 퐴 푦 ∞ 1, ‖ ‖ ≤ which should be associated with duality between norms discussed in Example 2.2.

Example 2.6 (Dual of a maximization problem). We can similarly derive the dual of problem (2.13). If we write it simply as

maximize 푏푇 푦 subject to 푐 퐴푇 푦 0, − ≥ then the Lagrangian is

퐿(푦, 푢) = 푏푇 푦 + 푢푇 (푐 퐴푇 푦) = 푦푇 (푏 퐴푢) + 푐푇 푢 − − 푛 푇 with 푢 R+, so that now 퐿(푦, 푢) 푏 푦 for any feasible 푦. Calculating ∈ ≥ min푢 max푦 퐿(푦, 푢) is now equivalent to the problem

minimize 푐푇 푢 subject to 퐴푢 = 푏, 푢 0, ≥ so, as expected, the dual of the dual recovers the original primal problem.

2.4.2 Weak and strong duality Suppose 푥* and (푦*, 푠*) are feasible points for the primal and dual problems (2.12) and (2.13), respectively. Then we have

푏푇 푦* = (퐴푥*)푇 푦* = (푥*)푇 (퐴푇 푦*) = (푥*)푇 (푐 푠*) = 푐푇 푥* (푠*)푇 푥* 푐푇 푥* − − ≤ so the dual objective value is a lower bound on the objective value of the primal. In particular, any dual feasible point (푦*, 푠*) gives a lower bound:

푏푇 푦* 푝⋆ ≤ and we immediately get the next lemma.

Lemma 2.2 (Weak duality). 푑⋆ 푝⋆. ≤ It follows that if 푏푇 푦* = 푐푇 푥* then 푥* is optimal for the primal, (푦*, 푠*) is optimal for the dual and 푏푇 푦* = 푐푇 푥* is the common optimal objective value. This way we can use the optimal dual solution to certify optimality of the primal solution and vice versa. The remarkable property of linear optimization is that 푑⋆ = 푝⋆ holds in the most interesting scenario when the primal problem is feasible and bounded. It means that the certificate of optimality mentioned in the previous paragraph always exists.

15 Lemma 2.3 (Strong duality). If at least one of 푑⋆, 푝⋆ is finite then 푑⋆ = 푝⋆.

Proof. Suppose < 푝⋆ < ; the proof in the dual case is analogous. For any 휀 > 0 −∞ ∞ consider the feasibility problem with variable 푥 0 and constraints ≥ [︂ 푐푇 ]︂ [︂ 푝⋆ + 휀 ]︂ 푐푇 푥 = 푝⋆ 휀, 푥 = that is −퐴 − 푏 퐴푥 = 푏. −

Optimality of 푝⋆ implies that the above problem is infeasible. By Lemma 2.1 there exists 푇 푦ˆ = [푦0 푦] such that

[︀ 푇 ]︀ [︀ ⋆ 푇 ]︀ 푐, 퐴 푦ˆ 0 and 푝 + 휀, 푏 푦ˆ > 0. − ≤ − 푇 푇 If 푦0 = 0 then 퐴 푦 0 and 푏 푦 > 0, which by Lemma 2.1 again would mean that the ≤ original primal problem was infeasible, which is not the case. Hence we can rescale so that 푦0 = 1 and then we get

푐 퐴푇 푦 0 and 푏푇 푦 푝⋆ 휀. − ≥ ≥ − The first inequality above implies that 푦 is feasible for the dual problem. By letting 휀 0 → we obtain 푑⋆ 푝⋆. ≥ We can exploit strong duality to freely choose between solving the primal or dual version of any linear problem.

Example 2.7 (Sum of largest elements). Suppose that 푥 is now a constant vector. Consider the following problem with variable 푧:

maximize 푥푇 푧 ∑︀ subject to 푖 푧푖 = 푚, 0 푧 1. ≤ ≤ The maximum is attained when 푧 indicates the positions of 푚 largest entries in 푥, and the objective value is then their sum. This formulation, however, cannot be used when 푥 is another variable, since then the objective function is no longer linear. Let us derive the dual problem. The Lagrangian is

퐿(푧, 푠, 푡, 푢) = 푥푇 푧 + 푡(푚 푒푇 푧) + 푠푇 푧 + 푢푇 (푒 푧) = = 푧푇 (푥 푡푒 +−푠 푢) + 푡푚 + 푢푇 푒 − − −

with 푢, 푠 0. Since 푠푖 0 is arbitrary and not otherwise constrained, the equality ≥ ≥ 푥푖 푡 + 푠푖 푢푖 = 0 is the same as 푢푖 + 푡 푥푖 and for the dual problem we get − − ≥ ∑︀ minimize 푚푡 + 푖 푢푖 subject to 푢푖 + 푡 푥푖, 푖 = 1, . . . , 푛, 푢 ≥ 0, 푖 = 1, . . . , 푛, 푖 ≥ which is exactly the problem (2.10) we studied in Sec. 2.2.6. Strong duality now implies that (2.10) computes the sum of 푚 biggest entries in 푥.

16 2.4.3 Duality and infeasibility: summary We can now expand the discussion of infeasibility certificates in the context of duality. Farkas’ lemma Lemma 2.1 can be dualized and the two versions can be summarized as follows:

Lemma 2.4 (Primal and dual Farkas’ lemma). For a primal-dual pair of linear problems we have the following equivalences:

i. The primal problem (2.12) is infeasible if and only if there is 푦 such that 퐴푇 푦 0 ≤ and 푏푇 푦 > 0.

ii. The dual problem (2.13) is infeasible if and only if there is 푥 0 such that 퐴푥 = 0 ≥ and 푐푇 푥 < 0.

Weak and strong duality for linear optimization now lead to the following conclusions:

• If the problem is primal feasible and has finite objective value ( < 푝⋆ < ) −∞ ∞ then so is the dual and 푑⋆ = 푝⋆. We sometimes refer to this case as primal and dual feasible. The dual solution certifies the optimality of the primal solution and vice versa.

• If the primal problem is feasible but unbounded (푝⋆ = ) then the dual is infeasi- −∞ ble (푑⋆ = ). Part (ii) of Farkas’ lemma provides a certificate of this fact, that is −∞ a vector 푥 with 푥 0, 퐴푥 = 0 and 푐푇 푥 < 0. In fact it is easy to give this statement ≥ a geometric interpretation. If 푥0 is any primal feasible point then the infinite ray

푡 푥 + 푡푥, 푡 [0, ) → 0 ∈ ∞

belongs to the feasible set 푝 because 퐴(푥0 + 푡푥) = 푏 and 푥0 + 푡푥 0. Along this ℱ ≥ ray the objective value is unbounded below:

푐푇 (푥 + 푡푥) = 푐푇 푥 + 푡(푐푇 푥) . 0 0 → −∞

• If the primal problem is infeasible (푝⋆ = ) then a certificate of this fact is provided ∞ by part (i). The dual problem may be unbounded (푑⋆ = ) or infeasible (푑⋆ = ∞ ). −∞

Example 2.8 (Primal-dual infeasibility). Weak and strong duality imply that the only case when 푑⋆ = 푝⋆ is when both primal and dual problem are infeasible (푑⋆ = , ̸ −∞ 푝⋆ = ), for example: ∞ minimize 푥 subject to 0 푥 = 1. ·

17 2.4.4 Dual values as shadow prices Dual values are related to shadow prices, as they measure, under some nondegeneracy assumption, the sensitivity of the objective value to a change in the constraint. Consider again the primal and dual problem pair (2.12) and (2.13) with feasible sets 푝 and 푑 ℱ ℱ and with a primal-dual optimal solution (푥*, 푦*, 푠*). ′ Suppose we change one of the values in 푏 from 푏푖 to 푏푖. This corresponds to moving one of the hyperplanes defining 푝, and in consequence the optimal solution (and the ℱ objective value) may change. On the other hand, the dual feasible set 푑 is not affected. * * ℱ Assuming that the solution (푦 , 푠 ) was a unique vertex of 푑 this point remains optimal ℱ for the dual after a sufficiently small change of 푏. But then the change of the dual objective is 푦*(푏′ 푏 ) 푖 푖 − 푖 and by strong duality the primal objective changes by the same amount.

Example 2.9 (Student diet). An optimization student wants to save money on the diet while remaining healthy. A healthy diet requires at least 푃 = 6 units of protein, 퐶 = 15 units of carbohydrates, 퐹 = 5 units of fats and 푉 = 7 units of vitamins. The student can choose from the following products:

P C F V price takeaway 3 3 2 1 5 vegetables 1 2 0 4 1 bread 0.5 4 1 0 2

The problem of minimizing cost while meeting dietary requirements is

minimize 5푥1 + 푥2 + 2푥3 subject to 3푥1 + 푥2 + 0.5푥3 6, ≥ 3푥1 + 2푥2 + 4푥3 15, 2푥 + 푥 5, ≥ 1 3 ≥ 푥1 + 4푥2 7, 푥 , 푥 , 푥 ≥ 0. 1 2 3 ≥

If 푦1, 푦2, 푦3, 푦4 are the dual variables associated with the four inequality constraints then the (unique) primal-dual optimal solution to this problem is approximately:

(푥, 푦) = ((1, 1.5, 3), (0.42, 0, 1.78, 0.14))

⋆ with optimal cost 푝 = 12.5. Note 푦2 = 0 indicates that the second constraint is not binding. Indeed, we could increase 퐶 to 18 without affecting the optimal solution. The remaining constraints are binding. Improving the intake of protein by 1 unit (increasing 푃 to 7) will increase the cost by 0.42, while doing the same for fat will cost an extra 1.78 per unit. If the student had extra money to improve one of the parameters then the best choice would be to increase the intake of vitamins, with shadow price of just 0.14. If one month the student only had 12 units of money and was willing to relax one of the requirements then the best choice is to save on fats: the necessary reduction of

18 퐹 is smallest, namely 0.5 1.78−1 = 0.28. Indeed, with the new value of 퐹 = 4.72 the · same problem solves to 푝⋆ = 12 and 푥 = (1.08, 1.48, 2.56). We stress that a truly balanced diet problem should also include upper bounds.

19 Chapter 3

Conic quadratic optimization

This chapter extends the notion of linear optimization with quadratic cones. Conic quadratic optimization, also known as second-order cone optimization, is a straightfor- ward generalization of linear optimization, in the sense that we optimize a linear func- tion under linear (in)equalities with some variables belonging to one or more (rotated) quadratic cones. We discuss the basic concept of quadratic cones, and demonstrate the surprisingly large flexibility of conic quadratic modeling.

3.1 Cones

Since this is the first place where we introduce a non-linear cone, it seems suitable to make our most important definition: A set 퐾 R푛 is called a convex cone if ⊆ • for every 푥, 푦 퐾 we have 푥 + 푦 퐾, ∈ ∈ • for every 푥 퐾 and 훼 0 we have 훼푥 퐾. ∈ ≥ ∈ 푛 푛 For example a linear subspace of R , the positive orthant R≥0 or any ray (half-line) starting at the origin are examples of convex cones. We leave it for the reader to check that the intersection of convex cones is a convex cone; this property enables us to assemble complicated optimization models from individual conic bricks.

3.1.1 Quadratic cones We define the 푛-dimensional quadratic cone as {︂ √︁ }︂ 푛 푛 2 2 2 = 푥 R 푥1 푥2 + 푥3 + + 푥푛 . (3.1) 풬 ∈ | ≥ ···

The geometric interpretation of a quadratic (or second-order) cone is shown in Fig. 3.1 for a cone with three variables, and illustrates how the boundary of the cone resembles an ice-cream cone. The 1-dimensional quadratic cone simply states nonnegativity 푥1 0. ≥

20 √︀ 2 2 Fig. 3.1: Boundary of quadratic cone 푥1 푥2 + 푥3 and rotated quadratic cone 2푥1푥2 2 ≥ ≥ 푥 , 푥1, 푥2 0. 3 ≥ 3.1.2 Rotated quadratic cones An 푛 dimensional rotated quadratic cone is defined as − 푛 {︀ 푛 2 2 }︀ 푟 = 푥 R 2푥1푥2 푥3 + + 푥푛, 푥1, 푥2 0 . (3.2) 풬 ∈ | ≥ ··· ≥ As the name indicates, there is a simple relationship between quadratic and rotated quadratic cones. Define an orthogonal transformation

⎡ 1/√2 1/√2 0 ⎤ 푇푛 := ⎣ 1/√2 1/√2 0 ⎦ . (3.3) − 0 0 퐼푛−2

Then it is easy to verify that

푥 푛 푇 푥 푛, ∈ 풬 ⇐⇒ 푛 ∈ 풬푟 푛 and since 푇 is orthogonal we call 푟 a rotated cone; the transformation corresponds to 풬 3 a rotation of 휋/4 in the (푥1, 푥2) plane. For example if 푥 and ∈ 풬 ⎡ ⎤ ⎡ √1 √1 0 ⎤ ⎡ ⎤ ⎡ √1 (푥 + 푥 ) ⎤ 푧1 2 2 푥1 2 1 2 1 1 1 ⎣ 푧2 ⎦ = ⎣ √ √ 0 ⎦ ⎣ 푥2 ⎦ = ⎣ √ (푥1 푥2) ⎦ 2 − 2 · 2 − 푧3 0 0 1 푥3 푥3

then

2푧 푧 푧2, 푧 , 푧 0 = (푥2 푥2) 푥2, 푥 0, 1 2 ≥ 3 1 2 ≥ ⇒ 1 − 2 ≥ 3 1 ≥ and similarly we see that

푥2 푥2 + 푥2, 푥 0 = 2푧 푧 푧2, 푧 , 푧 0. 1 ≥ 2 3 1 ≥ ⇒ 1 2 ≥ 3 1 2 ≥ Thus, one could argue that we only need quadratic cones 푛, but there are many examples 풬 where using an explicit rotated quadratic cone 푛 is more natural, as we will see next. 풬푟 21 3.2 Conic quadratic modeling

In the following we describe several convex sets that can be modeled using conic quadratic formulations or, as we call them, are conic quadratic representable.

3.2.1 Absolute values In Sec. 2.2.2 we saw how to model 푥 푡 using two linear inequalities, but in fact the | | ≤ epigraph of the absolute value is just the definition of a two-dimensional quadratic cone, i.e.,

푥 푡 (푡, 푥) 2. | | ≤ ⇐⇒ ∈ 풬

3.2.2 Euclidean norms

The Euclidean norm of 푥 R푛, ∈ √︁ 푥 = 푥2 + 푥2 + + 푥2 ‖ ‖2 1 2 ··· 푛 essentially defines the quadratic cone, i.e.,

푥 푡 (푡, 푥) 푛+1. ‖ ‖2 ≤ ⇐⇒ ∈ 풬 The epigraph of the squared Euclidean norm can be described as the intersection of a rotated quadratic cone with an affine hyperplane,

푥2 + + 푥2 = 푥 2 푡 (1/2, 푡, 푥) 푛+2. 1 ··· 푛 ‖ ‖2 ≤ ⇐⇒ ∈ 풬푟

3.2.3 Convex quadratic sets

Assume 푄 R푛×푛 is a symmetric positive semidefinite matrix. The convex inequality ∈ (1/2)푥푇 푄푥 + 푐푇 푥 + 푟 0 ≤ may be rewritten as

푡 + 푐푇 푥 + 푟 = 0, (3.4) 푥푇 푄푥 2푡. ≤ Since 푄 is symmetric positive semidefinite the epigraph

푥푇 푄푥 2푡 (3.5) ≤ is a convex set and there exists a matrix 퐹 R푘×푛 such that ∈ 푄 = 퐹 푇 퐹 (3.6)

22 (see Sec.6 for properties of semidefinite matrices). For instance 퐹 could be the Cholesky factorization of 푄. Then

푥푇 푄푥 = 푥푇 퐹 푇 퐹 푥 = 퐹 푥 2 ‖ ‖2 and we have an equivalent characterization of (3.5) as

(1/2)푥푇 푄푥 푡 (푡, 1, 퐹 푥) 2+푘. ≤ ⇐⇒ ∈ 풬푟 Frequently 푄 has the structure

푄 = 퐼 + 퐹 푇 퐹 where 퐼 is the identity matrix, so

푥푇 푄푥 = 푥푇 푥 + 푥푇 퐹 푇 퐹 푥 = 푥 2 + 퐹 푥 2 ‖ ‖2 ‖ ‖2 and hence

(푓, 1, 푥) 2+푛, (ℎ, 1, 퐹 푥) 2+푘, 푓 + ℎ = 푡 ∈ 풬푟 ∈ 풬푟 is a conic quadratic representation of (3.5) in this case.

3.2.4 Second-order cones A second-order cone is occasionally specified as

푇 퐴푥 + 푏 2 푐 푥 + 푑 (3.7) ‖ ‖ ≤ where 퐴 R푚×푛 and 푐 R푛. The formulation (3.7) is simply ∈ ∈ (푐푇 푥 + 푑, 퐴푥 + 푏) 푚+1 (3.8) ∈ 풬 or equivalently

푠 = 퐴푥 + 푏, 푡 = 푐푇 푥 + 푑, (3.9) (푡, 푠) 푚+1. ∈ 풬 As will be explained in Sec.8, we refer to (3.8) as the dual form and (3.9) as the primal form. An alternative characterization of (3.7) is

퐴푥 + 푏 2 (푐푇 푥 + 푑)2 0, 푐푇 푥 + 푑 0 (3.10) ‖ ‖2 − ≤ ≥ which shows that certain quadratic inequalities are conic quadratic representable.

23 3.2.5 Simple sets involving power functions Some power-like inequalities are conic quadratic representable, even though it need not be obvious at first glance. For example, we have

푡 √푥, 푥 0 (푥, 1/2, 푡) 3, | | ≤ ≥ ⇐⇒ ∈ 풬푟 or in a similar fashion 1 푡 , 푥 0 (푥, 푡, √2) 3. ≥ 푥 ≥ ⇐⇒ ∈ 풬푟 For a more complicated example, consider the constraint

푡 푥3/2, 푥 0. ≥ ≥ This is equivalent to a statement involving two cones and an extra variable

(푠, 푡, 푥), (푥, 1/8, 푠) 3 ∈ 풬푟 because 1 1 2푠푡 푥2, 2 푥 푠2, = 4푠2푡2 푥 푥4 푠2 = 푡 푥3/2. ≥ · 8 ≥ ⇒ · 4 ≥ · ⇒ ≥ In practice power-like inequalities representable with similar tricks can often be expressed much more naturally using the power cone (see Sec.4), so we will not dwell on these examples much longer.

3.2.6 Rational functions

푎푥+푏 A general non-degenerate rational function of one variable 푓(푥) = 푔푥+ℎ can always be 푞 written in the form 푓(푥) = 푝 + 푔푥+ℎ , and when 푞 > 0 it is convex on the set where 푔푥 + ℎ > 0. The conic quadratic model of 푞 푡 푝 + , 푔푥 + ℎ > 0, 푞 > 0 ≥ 푔푥 + ℎ with variables 푡, 푥 can be written as: √︀ (푡 푝, 푔푥 + ℎ, 2푞) 3. − ∈ 풬푟 푎푥2+푏푥+푐 We can generalize it to a quadratic-over-linear function 푓(푥) = 푔푥+ℎ . By performing 푞 long polynomial division we can write it as 푓(푥) = 푟푥 + 푠 + 푔푥+ℎ , and as before it is convex when 푞 > 0 on the set where 푔푥 + ℎ > 0. In particular 푞 푡 푟푥 + 푠 + , 푔푥 + ℎ > 0, 푞 > 0 ≥ 푔푥 + ℎ can be written as √︀ (푡 푟푥 푠, 푔푥 + ℎ, 2푞) 3. − − ∈ 풬푟 In both cases the argument boils down to the observation that the target function is a sum of an affine expression and the inverse of a positive affine expression, see Sec. 3.2.5.

24 3.2.7 Harmonic mean Consider next the hypograph of the harmonic mean,

(︃ 푛 )︃−1 1 ∑︁ 푥−1 푡 0, 푥 0. 푛 푖 ≥ ≥ ≥ 푖=1

It is not obvious that the inequality defines a convex set, nor that it is conic quadratic representable. However, we can rewrite it in the form

푛 ∑︁ 푡2 푛푡, 푥 ≤ 푖=1 푖 which suggests the conic representation:

푛 2 ∑︁ 2푥푖푧푖 푡 , 푥푖, 푧푖 0, 2 푧푖 = 푛푡. (3.11) ≥ ≥ 푖=1

Alternatively, in case 푛 = 2, the hypograph of the harmonic mean can be represented by a single quadratic cone:

4 (푥1 + 푥2 푡/2, 푥1, 푥2, 푡/2) . (3.12) − ∈ 풬

As this representation generalizes the harmonic mean to all points 푥1 +푥2 0, as implied ≥ by (3.12), the variable bounds 푥 0 and 푡 0 have to be added seperately if so desired. ≥ ≥ 3.2.8 Quadratic forms with one negative eigenvalue

Assume that 퐴 R푛×푛 is a symmetric matrix with exactly one negative eigenvalue, i.e., ∈ 퐴 has a spectral factorization (i.e., eigenvalue decomposition)

푛 ∑︁ 퐴 = 푄Λ푄푇 = 훼 푞 푞푇 + 훼 푞 푞푇 , − 1 1 1 푖 푖 푖 푖=2

푇 where 푄 푄 = 퐼, Λ = Diag( 훼1, 훼2, . . . , 훼푛), 훼푖 0. Then − ≥ 푥푇 퐴푥 0 ≤ is equivalent to

푛 ∑︁ 푇 2 푇 2 훼푗(푞푗 푥) 훼1(푞1 푥) . (3.13) ≤ 푗=2

This shows 푥푇 퐴푥 0 to be a union of two convex subsets, each representable by a ≤ quadratic cone. For example, assuming 푞푇 푥 0, we can characterize (3.13) as 1 ≥ 푇 푇 푇 푛 (√훼1푞 푥, √훼2푞 푥, . . . , √훼푛푞 푥) . (3.14) 1 2 푛 ∈ 풬

25 3.2.9 Ellipsoidal sets The set

푛 = 푥 R 푃 (푥 푐) 2 1 ℰ { ∈ | ‖ − ‖ ≤ } describes an ellipsoid centred at 푐. It has a natural conic quadratic representation, i.e., 푥 if and only if ∈ ℰ 푥 (1, 푃 (푥 푐)) 푛+1. ∈ ℰ ⇐⇒ − ∈ 풬 3.3 Conic quadratic case studies

3.3.1 Quadratically constrained quadratic optimization A general convex quadratically constrained quadratic optimization problem can be writ- ten as 푇 푇 minimize (1/2)푥 푄0푥 + 푐0 푥 + 푟0 푇 푇 (3.15) subject to (1/2)푥 푄푖푥 + 푐 푥 + 푟푖 0, 푖 = 1, . . . , 푝, 푖 ≤ 푛×푛 where all 푄푖 R are symmetric positive semidefinite. Let ∈ 푇 푄푖 = 퐹푖 퐹푖, 푖 = 0, . . . , 푝,

푘푖×푛 where 퐹푖 R . Using the formulations in Sec. 3.2.3 we then get an equivalent conic ∈ quadratic problem 푇 minimize 푡0 + 푐0 푥 + 푟0 푇 subject to 푡푖 + 푐푖 푥 + 푟푖 = 0, 푖 = 1, . . . , 푝, (3.16) (푡 , 1, 퐹 푥) 푘푖+2, 푖 = 0, . . . , 푝. 푖 푖 ∈ 풬푟 Assume next that 푘푖, the number of rows in 퐹푖, is small compared to 푛. Storing 푄푖 2 requires about 푛 /2 space whereas storing 퐹푖 then only requires 푛푘푖 space. Moreover, 푇 2 the amount of work required to evaluate 푥 푄푖푥 is proportional to 푛 whereas the work 푇 푇 2 required to evaluate 푥 퐹 퐹푖푥 = 퐹푖푥 is proportional to 푛푘푖 only. In other words, if 푖 ‖ ‖ 푄푖 have low rank, then (3.16) will require much less space and time to solve than (3.15). We will study the reformulation (3.16) in much more detail in Sec. 10.

3.3.2 Robust optimization with ellipsoidal uncertainties Often in robust optimization some of the parameters in the model are assumed to be unknown exactly, but there is a simple set describing the uncertainty. For example, for a standard linear optimization problem we may wish to find a robust solution for all objective vectors 푐 in an ellipsoid

푛 = 푐 R 푐 = 퐹 푦 + 푔, 푦 2 1 . ℰ { ∈ | ‖ ‖ ≤ } A common approach is then to optimize for the worst-case scenario for 푐, so we get a robust version 푇 minimize sup푐∈ℰ 푐 푥 subject to 퐴푥 = 푏, (3.17) 푥 0. ≥ 26 The worst-case objective can be evaluated as

푇 푇 푇 푇 푇 푇 sup 푐 푥 = 푔 푥 + sup 푦 퐹 푥 = 푔 푥 + 퐹 푥 2 푐∈ℰ ‖푦‖2≤1 ‖ ‖

푇 푇 where we used that sup 푣 푢 = (푣 푣)/ 푣 2 = 푣 2. Thus the robust problem (3.17) ‖푢‖2≤1 ‖ ‖ ‖ ‖ is equivalent to

푇 푇 minimize 푔 푥 + 퐹 푥 2 ‖ ‖ subject to 퐴푥 = 푏, 푥 0, ≥ which can be posed as a conic quadratic problem

minimize 푔푇 푥 + 푡 subject to 퐴푥 = 푏, (3.18) (푡, 퐹 푇 푥) 푛+1, 푥 0. ∈ 풬 ≥

3.3.3 Markowitz portfolio optimization In classical Markowitz portfolio optimization we consider investment in 푛 stocks or assets held over a period of time. Let 푥푖 denote the amount we invest in asset 푖, and assume a stochastic model where the return of the assets is a random variable 푟 with known mean

휇 = E푟 and covariance

Σ = E(푟 휇)(푟 휇)푇 . − − The return of our investment is also a random variable 푦 = 푟푇 푥 with mean (or expected return)

E푦 = 휇푇 푥 and variance (or risk)

(푦 E푦)2 = 푥푇 Σ푥. − We then wish to rebalance our portfolio to achieve a compromise between risk and ex- pected return, e.g., we can maximize the expected return given an upper bound 훾 on the tolerable risk and a constraint that our total investment is fixed,

maximize 휇푇 푥 subject to 푥푇 Σ푥 훾 (3.19) 푒푇 푥 =≤ 1 푥 0. ≥

27 Suppose we factor Σ = 퐺퐺푇 (e.g., using a Cholesky or a eigenvalue decomposition). We then get a conic formulation maximize 휇푇 푥 subject to (√훾, 퐺푇 푥) 푛+1 (3.20) 푒푇 푥 = 1 ∈ 풬 푥 0. ≥ In practice both the average return and covariance are estimated using historical data. A recent trend is then to formulate a robust version of the portfolio optimization problem to combat the inherent uncertainty in those estimates, e.g., we can constrain 휇 to an ellipsoidal uncertainty set as in Sec. 3.3.2. It is also common that the data for a portfolio optimization problem is already given in the form of a factor model Σ = 퐹 푇 퐹 of Σ = 퐼 +퐹 푇 퐹 and a conic quadratic formulation as in Sec. 3.3.1 is most natural. For more details see Sec. 10.3.

3.3.4 Maximizing the Sharpe ratio Continuing the previous example, the Sharpe ratio defines an efficiency metric of aport- folio as the expected return per unit risk, i.e., 휇푇 푥 푟 푆(푥) = − 푓 , (푥푇 Σ푥)1/2

where 푟푓 denotes the return of a risk-free asset. We assume that there is a portfolio with 푇 휇 푥 > 푟푓 , so maximizing the Sharpe ratio is equivalent to minimizing 1/푆(푥). In other words, we have the following problem

‖퐺푇 푥‖ 푇 minimize 휇 푥−푟푓 subject to 푒푇 푥 = 1, 푥 0. ≥ The objective has the same nature as a quotient of two affine functions we studied in Sec. 2.2.5. We reformulate the problem in a similar way, introducing a scalar variable 푧 0 ≥ and a variable transformation

푦 = 푧푥.

푇 Since a positive 푧 can be chosen arbitrarily and (휇 푟푓 푒) 푥 > 0, we can without loss of − generality assume that

(휇 푟 푒)푇 푦 = 1. − 푓 Thus, we obtain the following conic problem for maximizing the Sharpe ratio, minimize 푡 subject to (푡, 퐺푇 푦) 푘+1, 푒푇 푦 =∈ 풬푧, 푇 (휇 푟푓 푒) 푦 = 1, −푦, 푧 0, ≥ and we recover 푥 = 푦/푧.

28 3.3.5 A resource constrained production and inventory problem The resource constrained production and inventory problem [Zie82] can be formulated as follows: ∑︀푛 minimize 푗=1(푑푗푥푗 + 푒푗/푥푗) ∑︀푛 subject to 푟푗푥푗 푏, (3.21) 푗=1 ≤ 푥 0, 푗 = 1, . . . , 푛, 푗 ≥ where 푛 denotes the number of items to be produced, 푏 denotes the amount of common resource, and 푟푗 is the consumption of the limited resource to produce one unit of item 푗. 푝 The objective function represents inventory and ordering costs. Let 푐푗 denote the holding 푟 cost per unit of product 푗 and 푐푗 denote the rate of holding costs, respectively. Further, let 푐푝푐푟 푑 = 푗 푗 푗 2 so that

푑푗푥푗

is the average holding costs for product 푗. If 퐷푗 denotes the total demand for product 푗 표 and 푐푗 the ordering cost per order of product 푗 then let

표 푒푗 = 푐푗 퐷푗 and hence 표 푒 푐 퐷푗 푗 = 푗 푥푗 푥푗

is the average ordering costs for product 푗. In summary, the problem finds the optimal batch size such that the inventory and ordering cost are minimized while satisfying the constraints on the common resource. Given 푑푗, 푒푗 0 problem (3.21) is equivalent to the ≥ conic quadratic problem ∑︀푛 minimize 푗=1(푑푗푥푗 + 푒푗푡푗) ∑︀푛 subject to 푟푗푥푗 푏, 푗=1 ≤ (푡 , 푥 , √2) 3, 푗 = 1, . . . , 푛. 푗 푗 ∈ 풬푟

It is not always possible to produce a fractional number of items. In such case 푥푗 should be constrained to be integers. See Sec.9.

29 Chapter 4

The power cone

So far we studied quadratic cones and their applications in modeling problems involving, directly or indirectly, quadratic terms. In this part we expand the quadratic and rotated quadratic cone family with power cones, which provide a convenient language to express models involving powers other than 2. We must stress that although the power cones include the quadratic cones as special cases, at the current state-of-the-art they require more advanced and less efficient algorithms.

4.1 The power cone(s)

푛-dimensional power cones form a family of convex cones parametrized by a real number 0 < 훼 < 1: {︁ }︁ 훼,1−훼 푛 훼 1−훼 √︀ 2 2 푛 = 푥 R : 푥1 푥2 푥3 + + 푥푛, 푥1, 푥2 0 . (4.1) 풫 ∈ ≥ ··· ≥ The constraint in the definition of 훼,1−훼 can be expressed as a composition of two 풫푛 constraints, one of which is a quadratic cone:

푥훼푥1−훼 푧 , 1 2 ≥ |√︀| (4.2) 푧 푥2 + + 푥2 , ≥ 3 ··· 푛 which means that the basic building block we need to consider is the three-dimensional power cone

훼,1−훼 {︀ 3 훼 1−훼 }︀ 3 = 푥 R : 푥1 푥2 푥3 , 푥1, 푥2 0 . (4.3) 풫 ∈ ≥ | | ≥ More generally, we can also consider power cones with “long left-hand side”. That is, for 푚 < 푛 and a sequence of exponents 훼1, . . . , 훼푚 with 훼1 + + 훼푚 = 1, we have the most ··· general power cone object defined as {︁ √︁ }︁ 훼1,··· ,훼푚 푛 ∏︀푚 훼푖 ∑︀푛 2 푛 = 푥 R : 푥 푥 , 푥1, . . . , 푥푚 0 . (4.4) 풫 ∈ 푖=1 푖 ≥ 푖=푚+1 푖 ≥

The left-hand side is nothing but the weighted geometric mean of the 푥푖, 푖 = 1, . . . , 푚 with weights 훼푖. As we will see later, also this most general cone can be modeled as a composition of three-dimensional cones 훼,1−훼, so in a sense that is the basic object of 풫3 interest.

30 There are some notable special cases we are familiar with. If we let 훼 0 then in the 0,1 푛−1 1 → limit we get 푛 = R+ . If 훼 = then we have a rescaled version of the rotated 풫 × 풬 2 quadratic cone, precisely:

1 1 2 , 2 푛 (푥 , 푥 , . . . , 푥 ) 푛 (푥 /√2, 푥 /√2, 푥 , . . . , 푥 ) . 1 2 푛 ∈ 풫 ⇐⇒ 1 2 3 푛 ∈ 풬r A gallery of three-dimensional power cones for varying 훼 is shown in Fig. 4.1.

Fig. 4.1: The boundary of 훼,1−훼 seen from a point inside the cone for 훼 = 풫3 0.1, 0.2, 0.35, 0.5.

31 4.2 Sets representable using the power cone

In this section we give basic examples of constraints which can be expressed using power cones.

4.2.1 Powers For all values of 푝 = 0, 1 we can bound 푥푝 depending on the convexity of 푓(푥) = 푥푝. ̸ • For 푝 > 1 the inequality 푡 푥 푝 is equivalent to 푡1/푝 푥 and hence corresponds ≥ | | ≥ | | to

푡 푥 푝 (푡, 1, 푥) 1/푝,1−1/푝. ≥ | | ⇐⇒ ∈ 풫3 For instance 푡 푥 1.5 is equivalent to (푡, 1, 푥) 2/3,1/3. ≥ | | ∈ 풫3 • For 0 < 푝 < 1 the function 푓(푥) = 푥푝 is concave for 푥 0 and so we get ≥ 푡 푥푝, 푥 0 (푥, 1, 푡) 푝,1−푝. | | ≤ ≥ ⇐⇒ ∈ 풫3 • For 푝 < 0 the function 푓(푥) = 푥푝 is convex for 푥 > 0 and in this range the inequality 푡 푥푝 is equivalent to ≥ 푡 푥푝 푡1/(1−푝)푥−푝/(1−푝) 1 (푡, 푥, 1) 1/(1−푝),−푝/(1−푝). ≥ ⇐⇒ ≥ ⇐⇒ ∈ 풫3 2/3,1/3 For example 푡 √1 is the same as (푡, 푥, 1) . ≥ 푥 ∈ 풫3 4.2.2 푝-norm cones

푛 푝 푝 1/푝 Let 푝 1. The 푝-norm of a vector 푥 R is 푥 푝 = ( 푥1 + + 푥푛 ) and the ≥ ∈ ‖ ‖ | | ··· | | 푝-norm ball of radius 푡 is defined by the inequality 푥 푝 푡. We take the 푝-norm cone ‖ ‖ ≤ (in dimension 푛 + 1) to be the convex set

{︀ 푛+1 }︀ (푡, 푥) R : 푡 푥 푝 . (4.5) ∈ ≥ ‖ ‖ For 푝 = 2 this is precisely the quadratic cone. We can model the 푝-norm cone by writing the inequality 푡 푥 푝 as: ≥ ‖ ‖ ∑︁ 푡 푥 푝/푡푝−1 ≥ | 푖| 푖 and bounding each summand with a power cone. This leads to the following model:

푝−1 푝 1/푝,1−1/푝 푟푖푡 푥푖 ((푟푖, 푡, 푥푖) ), ∑︀ ≥ | | ∈ 풫3 (4.6) 푟푖 = 푡.

When 0 < 푝 < 1 or 푝 < 0 the formula for 푥 푝 gives a concave, rather than convex 푛 ‖ ‖ function on R+ and in this case it is possible to model the set {︂ (︁∑︁ )︁1/푝 }︂ (푡, 푥) : 0 푡 푥푝 , 푥 0 , 푝 < 1, 푝 = 0. ≤ ≤ 푖 푖 ≥ ̸ We leave it as an exercise (see previous subsection). The case 푝 = 1 appears in Sec. − 3.2.7.

32 4.2.3 The most general power cone Consider the most general version of the power cone with “long left-hand side” defined in (4.4). We show that it can be expressed using the basic three-dimensional cones 훼,1−훼. 풫3 Clearly it suffices to consider a short right-hand side, that is to model thecone

훼1,...,훼푚 훼1 훼2 훼푚 = 푥 푥 푥 푧 , 푥1, . . . , 푥푚 0 , (4.7) 풫푚+1 { 1 2 ··· 푚 ≥ | | ≥ } ∑︀ where 훼푖 = 1. Denote 푠 = 훼1 + + 훼푚−1. We split (4.7) into two constraints 푖 ··· 훼1/푠 훼푚−1/푠 푥 푥 푡 , 푥 , . . . , 푥 − 0, 1 ··· 푚−1 ≥ | | 1 푚 1 ≥ (4.8) 푡푠푥훼푚 푧 , 푥 0, 푚 ≥ | | 푚 ≥

훼1,...,훼푚 훼1/푠,...,훼푚−1/푠 푠,훼푚 and this way we expressed using two power cones 푚 and . 풫푚+1 풫 풫3 Proceeding by induction gives the desired splitting.

4.2.4 Geometric mean

The power cone 1/푛,...,1/푛 is a direct way to model the inequality 풫푛+1 푧 (푥 푥 푥 )1/푛, (4.9) | | ≤ 1 2 ··· 푛 which corresponds to maximizing the geometric mean of the variables 푥푖 0. In this ≥ special case tracing through the splitting (4.8) produces an equivalent representation of (4.9) using three-dimensional power cones as follows:

푥1/2푥1/2 푡 , 1 2 ≥ | 3| 푡1−1/3푥1/3 푡 , 3 3 ≥ | 4| 1−1/(푛−1) 1/(푛−1) ··· 푡푛−1 푥푛−1 푡푛 , 1−1/푛 1/푛 ≥ | | 푡푛 푥푛 푧 . ≥ | | 4.2.5 Non-homogenous constraints Every constraint of the form

푥훼1 푥훼2 푥훼푚 푧 훽, 푥 0, 1 2 ··· 푚 ≥ | | 푖 ≥ ∑︀ where 푖 훼푖 < 훽 and 훼푖 > 0 is equivalent to

(푥 , 푥 , . . . , 푥 , 1, 푧) 훼1/훽,...,훼푚/훽,푠 1 2 푚 ∈ 풫푚+2 ∑︀ with 푠 = 1 훼푖/훽. − 푖 4.3 Power cone case studies

4.3.1 Portfolio optimization with market impact Let us go back to the Markowitz portfolio optimization problem introduced in Sec. 3.3.3, where we now ask to maximize expected profit subject to bounded risk in a long-only

33 portfolio:

maximize 휇푇 푥 subject to √푥푇 Σ푥 훾, (4.10) 푥 ≤ 0, 푖 = 1, . . . , 푛, 푖 ≥ In a realistic model we would have to consider transaction costs which decrease the expected return. In particular if a really large volume is traded then the trade itself will affect the price of the asset, a phenomenon called market impact. It is typically modeled by decreasing the expected return of 푖-th asset by a slippage cost proportional to 푥훽 for some 훽 > 1, so that the objective function changes to (︃ )︃ 푇 ∑︁ 훽 maximize 휇 푥 훿푖푥 . − 푖 푖

A popular choice is 훽 = 3/2. This objective can easily be modeled with a power cone as in Sec. 4.2: maximize 휇푇 푥 훿푇 푡 − 1/훽,1−1/훽 훽 subject to (푡푖, 1, 푥푖) (푡푖 푥 ), ∈ 풫3 ≥ 푖 ··· 3/2 In particular if 훽 = 3/2 the inequality 푡푖 푥 has conic representation (푡푖, 1, 푥푖) ≥ 푖 ∈ 2/3,1/3. 풫3 4.3.2 Maximum volume cuboid

Suppose we have a convex, conic representable set 퐾 R푛 (for instance a polyhedron, ⊆ a ball, intersections of those and so on). We would like to find a maximum volume axis-parallel cuboid inscribed in 퐾. If we denote by 푝 R푛 the leftmost corner and by ∈ 푥1, . . . , 푥푛 the edge lengths, then this problem is equivalent to

maximize 푡 1/푛 subject to 푡 (푥1 푥푛) , ≤ ··· (4.11) 푥푖 0, (푝 ≥+ 푒 푥 , . . . , 푝 + 푒 푥 ) 퐾, 푒 , . . . , 푒 0, 1 , 1 1 1 푛 푛 푛 ∈ ∀ 1 푛 ∈ { } where the last constraint states that all vertices of the cuboid are in 퐾. The optimal volume is then 푣 = 푡푛. Modeling the geometric mean with power cones was discussed in Sec. 4.2.4. Maximizing the volume of an arbitrary (not necessarily axis-parallel) cuboid inscribed in 퐾 is no longer a convex problem. However, it can be solved by maximizing the solution to (4.11) over all sets 푇 (퐾) where 푇 is a rotation (orthogonal matrix) in R푛. In practice one can approximate the global solution by sampling sufficiently many rotations 푇 or using more advanced methods of optimization over the orthogonal group.

34 Fig. 4.2: The maximal volume cuboid inscribed in the regular icosahedron takes up approximately 0.388 of the volume of the icosahedron (the exact value is 3(1 + √5)/25).

4.3.3 푝-norm geometric median

푛 The geometric median of a sequence of points 푥1, . . . , 푥푘 R is defined as ∈ 푘 ∑︁ argmin 푛 푦 푥 , 푦∈R ‖ − 푖‖ 푖=1 that is a point which minimizes the sum of distances to all the given points. Here ‖·‖ can be any norm on R푛. The most classical case is the Euclidean norm, where the geometric median is the solution to the basic facility location problem minimizing total transportation cost from one depot to given destinations. For a general 푝-norm 푥 푝 with 1 푝 < (see Sec. 4.2.2) the geometric median is ‖ ‖ ≤ ∞ the solution to the obvious conic problem: ∑︀ minimize 푖 푡푖 subject to 푡푖 푦 푥푖 푝, (4.12) ≥ ‖ − ‖ 푦 R푛. ∈ In Sec. 4.2.2 we showed how to model the 푝-norm bound using 푛 power cones. The Fermat-Torricelli point of a triangle is the Euclidean geometric mean of its ver- tices, and a classical theorem in planar geometry (due to Torricelli, posed by Fermat), states that it is the unique point inside the triangle from which each edge is visible at the angle of 120∘ (or a vertex if the triangle has an angle of 120∘ or more). Using (4.12) we can compute the 푝-norm analogues of the Fermat-Torricelli point. Some examples are shown in Fig. 4.3.

35 Fig. 4.3: The geometric median of three triangle vertices in various 푝-norms.

4.3.4 Maximum likelihood estimator of a convex density function In [TV98] the problem of estimating a density function that is know in advance to be convex is considered. Here we will show that this problem can be posed as a conic optimization problem. Formally the problem is to estimate an unknown convex density function 푔 : R+ R+ given an ordered sample 푦1 < 푦2 < . . . < 푦푛 of 푛 outcomes of a → distribution with density 푔. The estimator 푔˜ 0 is a piecewise linear function ≥

푔˜ :[푦1, 푦푛] R+ → with break points at (푦푖, 푥푖), 푖 = 1, . . . , 푛, where the variables 푥푖 > 0 are estimators for 푔(푦푖). The slope of the 푖-th linear segment of 푔˜ is 푥 푥 푖+1 − 푖 . 푦 푦 푖+1 − 푖 Hence the convexity requirement leads to the constraints 푥 푥 푥 푥 푖+1 − 푖 푖+2 − 푖+1 , 푖 = 1, . . . , 푛 2. 푦 푦 ≤ 푦 푦 − 푖+1 − 푖 푖+2 − 푖+1 Recall the area under the density function must be 1. Hence,

푛−1 (︂ )︂ ∑︁ 푥푖+1 + 푥푖 (푦 푦 ) = 1 푖+1 − 푖 2 푖=1

must hold. Therefore, the problem to be solved is ∏︀푛 maximize 푖=1 푥푖 − − subject to 푥푖+1 푥푖 푥푖+2 푥푖+1 0, 푖 = 1, . . . , 푛 2, 푦푖+1−푦푖 − 푦푖+2−푦푖+1 ≤ − ∑︀푛−1 (︀ 푥푖+1+푥푖 )︀ 푖=1 (푦푖+1 푦푖) 2 = 1, − 푥 0. ≥ 푛 푛 1 ∏︀ ∏︀ 푛 Maximizing 푖=1 푥푖 or the geometric mean ( 푖=1 푥푖) will produce the same optimal solutions. Using that observation we can model the objective as shown in Sec. 4.2.4. ∑︀ Alternatively, one can use as objective 푖 log 푥푖 and model it using the exponential cone as in Sec. 5.2.

36 Chapter 5

Exponential cone optimization

So far we discussed optimization problems involving the major “polynomial” families of cones: linear, quadratic and power cones. In this chapter we introduce a single new object, namely the three-dimensional exponential cone, together with examples and applications. The exponential cone can be used to model a variety of constraints involving exponentials and logarithms.

5.1 Exponential cone

The exponential cone is a convex subset of R3 defined as

퐾 = {︀(푥 , 푥 , 푥 ): 푥 푥 푒푥3/푥2 , 푥 > 0}︀ exp 1 2 3 1 2 2 (5.1) (푥 , 0, 푥 ): 푥 ≥0, 푥 0 . ∪ { 1 3 1 ≥ 3 ≤ } Thus the exponential cone is the closure in R3 of the set of points which satisfy

푥 푥 푒푥3/푥2 , 푥 , 푥 > 0. (5.2) 1 ≥ 2 1 2 When working with logarithms, a convenient reformulation of (5.2) is

푥3 푥2 log(푥1/푥2), 푥1, 푥2 > 0. (5.3) ≤ Alternatively, one can write the same condition as

푥 /푥 푒푥3/푥2 , 푥 , 푥 > 0, 1 2 ≥ 1 2 which immediately shows that 퐾exp is in fact a cone, i.e. 훼푥 퐾exp for 푥 퐾exp and ∈ ∈ 훼 0. Convexity of 퐾exp follows from the fact that the Hessian of 푓(푥, 푦) = 푦 exp(푥/푦), ≥ namely

[︂ 푦−1 푥푦−2 ]︂ 퐷2(푓) = 푒푥/푦 푥푦−2 −푥2푦−3 − is positive semidefinite for 푦 > 0.

37 Fig. 5.1: The boundary of the exponential cone 퐾exp. On the left, the red isolines are graphs of 푥2 푥2 log(푥1/푥2) for fixed 푥1, see (5.3). On the right, they are graphs of 푥3/푥2 → 푥3 푥2푒 for fixed 푥2. → 5.2 Modeling with the exponential cone

Extending the conic optimization toolbox with the exponential cone leads to new types of constraint building blocks and new types of representable sets. In this section we list the basic operations available using the exponential cone.

5.2.1 Exponential

푥 The epigraph 푡 푒 is a section of 퐾exp: ≥ 푥 푡 푒 (푡, 1, 푥) 퐾exp. (5.4) ≥ ⇐⇒ ∈ 5.2.2 Logarithm Similarly, we can express the hypograph 푡 log 푥, 푥 0: ≤ ≥

푡 log 푥 (푥, 1, 푡) 퐾exp. (5.5) ≤ ⇐⇒ ∈ 5.2.3 Entropy The entropy function 퐻(푥) = 푥 log 푥 can be maximized using the following representa- − tion which follows directly from (5.3):

푡 푥 log 푥 푡 푥 log(1/푥) (1, 푥, 푡) 퐾exp. (5.6) ≤ − ⇐⇒ ≤ ⇐⇒ ∈

38 5.2.4 Relative entropy The relative entropy or Kullback-Leiber divergence of two probability distributions is defined in terms of the function 퐷(푥, 푦) = 푥 log(푥/푦). It is convex, and the minimization problem 푡 퐷(푥, 푦) is equivalent to ≥

푡 퐷(푥, 푦) 푡 푥 log(푦/푥) (푦, 푥, 푡) 퐾exp. (5.7) ≥ ⇐⇒ − ≤ ⇐⇒ − ∈ Because of this reparametrization the exponential cone is also referred to as the relative entropy cone, leading to a class of problems known as REPs (relative entropy problems). Having the relative entropy function available makes it possible to express epigraphs of other functions appearing in REPs, for instance:

푥 log(1 + 푥/푦) = 퐷(푥 + 푦, 푦) + 퐷(푦, 푥 + 푦).

5.2.5 Softplus function In neural networks the function 푓(푥) = log(1+푒푥), known as the softplus function, is used as an analytic approximation to the rectifier activation function 푟(푥) = 푥+ = max(0, 푥). The softplus function is convex and we can express its epigraph 푡 log(1 + 푒푥) by ≥ combining two exponential cones. Note that

푡 log(1 + 푒푥) 푒푥−푡 + 푒−푡 1 ≥ ⇐⇒ ≤ and therefore 푡 log(1 + 푒푥) is equivalent to the following set of conic constraints: ≥ 푢 + 푣 1, ≤ (푢, 1, 푥 푡) 퐾exp, (5.8) (푣, 1,− 푡) ∈ 퐾 . − ∈ exp 5.2.6 Log-sum-exp We can generalize the previous example to a log-sum-exp (logarithm of sum of exponen- tials) expression

푡 log(푒푥1 + + 푒푥푛 ). ≥ ··· This is equivalent to the inequality

푒푥1−푡 + + 푒푥푛−푡 1, ··· ≤ and so it can be modeled as follows: ∑︀ 푢 1, 푖 (5.9) (푢 , 1, 푥 푡) ≤ 퐾 , 푖 = 1, . . . , 푛. 푖 푖 − ∈ exp

39 5.2.7 Log-sum-inv The following type of bound has applications in capacity optimization for wireless network design: (︂ 1 1 )︂ 푡 log + + , 푥푖 > 0. ≥ 푥1 ··· 푥푛

Since the logarithm is increasing, we can model this using a log-sum-exp and an expo- nential as: 푡 log(푒푦1 + + 푒푦푛 ), ≥ ··· 푥 푒−푦푖 , 푖 = 1, . . . , 푛. 푖 ≥ Alternatively, one can also rewrite the original constraint in equivalent form: (︂ )︂−1 −푡 1 1 푒 푠 + + , 푥푖 > 0. ≤ ≤ 푥1 ··· 푥푛

and then model the right-hand side inequality using the technique from Sec. 3.2.7. This approach requires only one exponential cone.

5.2.8 Arbitrary exponential The inequality

푡 푎푥1 푎푥2 푎푥푛 , ≥ 1 2 ··· 푛 where 푎푖 are arbitrary positive constants, is of course equivalent to (︃ )︃ ∑︁ 푡 exp 푥 log 푎 ≥ 푖 푖 푖 ∑︀ and therefore to (푡, 1, 푥푖 log 푎푖) 퐾exp. 푖 ∈ 5.2.9 Lambert W-function

The Lambert function 푊 : R+ R+ is the unique function satisfying the identity → 푊 (푥)푒푊 (푥) = 푥.

It is the real branch of a more general function which appears in applications such as diode modeling. The 푊 function is concave. Although there is no explicit analytic formula for 푊 (푥), the hypograph (푥, 푡) : 0 푥, 0 푡 푊 (푥) has an equivalent description: { ≤ ≤ ≤ } 2 푥 푡푒푡 = 푡푒푡 /푡 ≥ and so it can be modeled with a mix of exponential and quadratic cones (see Sec. 3.1.2):

(푥, 푡, 푢) 퐾exp, (푥 푡 exp(푢/푡)), ∈ ≥ 2 (5.10) (1/2, 푢, 푡) r , (푢 푡 ). ∈ 풬 ≥

40 5.2.10 Other simple sets Here are a few more typical sets which can be expressed using the exponential and quadratic cones. The presentations should be self-explanatory; we leave the simple veri- fications to the reader.

Table 5.1: Sets representable with the exponential cone Set Conic representation 푡 (log 푥)2, 0 < 푥 1 ( 1 , 푡, 푢) 3, (푥, 1, 푢) 퐾 , 푥 1 ≥ ≤ 2 ∈ 풬r ∈ exp ≤ 푡 log log 푥, 푥 > 1 (푢, 1, 푡) 퐾 , (푥, 1, 푢) 퐾 ≤ ∈ exp ∈ exp 푡 (log 푥)−1, 푥 > 1 (푢, 푡, √2) 3, (푥, 1, 푢) 퐾 ≥ ∈ 풬r ∈ exp 푡 √log 푥, 푥 > 1 ( 1 , 푢, 푡) 3, (푥, 1, 푢) 퐾 ≤ 2 ∈ 풬r ∈ exp 푡 √푥 log 푥, 푥 > 1 (푥, 푢, √2푡) 3, (푥, 1, 푢) 퐾 ≤ ∈ 풬r ∈ exp 푡 log(1 1/푥), 푥 > 1 (푥, 푢, √2) 3, (1 푢, 1, 푡) 퐾 ≤ − ∈ 풬r − ∈ exp 푡 log(1 + 1/푥), 푥 > 0 (푥 + 1, 푢, √2) 3, (1 푢, 1, 푡) 퐾 ≥ ∈ 풬r − − ∈ exp

5.3 Geometric programming

Geometric optimization problems form a family of optimization problems with objective and constraints in special polynomial form. It is a rich class of problems solved by reformulating in logarithmic-exponential form, and thus a major area of applications for the exponential cone 퐾exp. Geometric programming is used in circuit design, chemical engineering, mechanical engineering and finance, just to mention a few applications. We refer to [BKVH07] for a survey and extensive bibliography.

5.3.1 Definition and basic examples A monomial is a real valued function of the form

푎1 푎2 푎푛 푓(푥1, . . . , 푥푛) = 푐푥 푥 푥 , (5.11) 1 2 ··· 푛

where the exponents 푎푖 are arbitrary real numbers and 푐 > 0.A posynomial (positive polynomial) is a sum of monomials. Thus the difference between a posynomial and a standard notion of a multi-variate polynomial known from algebra or calculus is that (i) posynomials can have arbitrary exponents, not just integers, but (ii) they can only have positive coefficients. For example, the following functions are monomials (in variables 푥, 푦, 푧): √︀ 푥푦, 2푥1.5푦−1푥0.3, 3 푥푦/푧, 1 (5.12) and the following are examples of posynomials:

2푥 + 푦푧, 1.5푥3푧 + 5/푦, (푥2푦2 + 3푧−0.3)4 + 1. (5.13)

A geometric program (GP) is an optimization problem of the form

minimize 푓0(푥) subject to 푓푖(푥) 1, 푖 = 1, . . . , 푚, (5.14) ≤ 푥푗 > 0, 푗 = 1, . . . , 푛,

41 where 푓0, . . . , 푓푚 are posynomials and 푥 = (푥1, . . . , 푥푛) is the variable vector. A geometric program (5.14) can be modeled in exponential conic form by making a substitution

푦푗 푥푗 = 푒 , 푗 = 1, . . . , 푛.

Under this substitution a monomial of the form (5.11) becomes

푐푒푎1푦1 푒푎2푦2 푒푎푛푦푛 = exp(푎푇 푦 + log 푐) ··· *

for 푎* = (푎1, . . . , 푎푛). Consequently, the optimization problem (5.14) takes an equivalent form minimize 푡 ∑︀ 푇 subject to log( exp(푎 푦 + log 푐0,푘)) 푡, (5.15) 푘 0,푘,* ≤ log(∑︀ exp(푎푇 푦 + log 푐 )) 0, 푖 = 1, . . . , 푚, 푘 푖,푘,* 푖,푘 ≤ 푛 where 푎푖,푘 R and 푐푖,푘 R for all 푖, 푘. These are now log-sum-exp constraints we ∈ ∈ already discussed in Sec. 5.2.6. In particular, the problem (5.15) is convex, as opposed to the posynomial formulation (5.14).

Example We demonstrate this reduction on a simple example. Take the geometric problem

minimize 푥 + 푦2푧 subject to 0.1√푥 + 2푦−1 1, 푧−1 + 푦푥−2 ≤ 1. ≤ By substituting 푥 = 푒푢, 푦 = 푒푣, 푧 = 푒푤 we get

minimize 푡 subject to log(푒푢 + 푒2푣+푤) 푡, log(푒0.5푢+log 0.1 + 푒−푣+log 2) ≤ 0, log(푒−푤 + 푒푣−2푢) ≤ 0. ≤ and using the log-sum-exp reduction from Sec. 5.2.6 we write an explicit conic problem:

minimize 푡 subject to (푝1, 1, 푢 푡), (푞1, 1, 2푣 + 푤 푡) 퐾exp, 푝1 + 푞1 1, − − ∈ ≤ (푝2, 1, 0.5푢 + log 0.1), (푞2, 1, 푣 + log 2) 퐾exp, 푝2 + 푞2 1, (푝 , 1, 푤), (푞 , 1, 푣 2푢) −퐾 ,∈ 푝 + 푞 ≤ 1. 3 − 3 − ∈ exp 3 3 ≤ Solving this problem yields (푥, 푦, 푧) (3.14, 2.43, 1.32). ≈ 5.3.2 Generalized geometric models In this section we briefly discuss more general types of constraints which can bemodeled with geometric programs.

42 Monomials If 푚(푥) is a monomial then the constraint 푚(푥) = 푐 is equivalent to two posynomial inequalities 푚(푥)푐−1 1 and 푚(푥)−1푐 1, so it can be expressed in the language of ≤ ≤ geometric programs. In practice it should be added to the model (5.15) as a linear constraint ∑︁ 푎푘푦푘 = log 푐. 푘

Monomial inequalities 푚(푥) 푐 and 푚(푥) 푐 should similarly be modeled as linear ≤ ≥ inequalities in the 푦푖 variables. In similar vein, if 푓(푥) is a posynomial and 푚(푥) is a monomial then 푓(푥) 푚(푥) is ≤ still a posynomial inequality because it can be written as 푓(푥)푚(푥)−1 1. ≤ It also means that we can add lower and upper variable bounds: 0 < 푐1 푥 푐2 is −1 −1 ≤ ≤ equivalent to 푥 푐1 1 and 푐 푥 1. ≤ 2 ≤ Products and positive powers Expressions involving products and positive powers (possibly iterated) of posynomials can again be modeled with posynomials. For example, a constraint such as

((푥푦2 + 푧)0.3 + 푦)(1/푥 + 1/푦)2.2 1 ≤ can be replaced with

푥푦2 + 푧 푡, 푡0.3 + 푦 푢, 푥−1 + 푦−1 푣, 푢푣2.2 1. ≤ ≤ ≤ ≤ Other transformations and extensions

• If 푓1, 푓2 are already expressed by posynomials then max 푓1(푥), 푓2(푥) 푡 is clearly { } ≤ equivalent to 푓1(푥) 푡 and 푓2(푥) 푡. Hence we can add the maximum operator ≤ ≤ to the list of building blocks for geometric programs.

• If 푓, 푔 are posynomials, 푚 is a monomial and we know that 푚(푥) 푔(푥) then the ≥ constraint 푓(푥) 푡 is equivalent to 푡−1푓(푥)푚(푥)−1 + 푔(푥)푚(푥)−1 1. 푚(푥)−푔(푥) ≤ ≤ • The objective function of a geometric program (5.14) can be extended to include other terms, for example: ∑︁ minimize 푓0(푥) + log 푚푘(푥) 푘

푦 where 푚푘(푥) are monomials. After the change of variables 푥 = 푒 we get a slightly modified version of (5.15):

minimize 푡 + 푏푇 푦 ∑︀ 푇 subject to exp(푎 푦 + log 푐0,푘) 푡, 푘 0,푘,* ≤ log(∑︀ exp(푎푇 푦 + log 푐 )) 0, 푖 = 1, . . . , 푚, 푘 푖,푘,* 푖,푘 ≤ (note the lack of one logarithm) which can still be expressed with exponential cones.

43 5.3.3 Geometric programming case studies Frobenius norm diagonal scaling

Suppose we have a matrix 푀 R푛×푛 and we want to rescale the coordinate system ∈ using a diagonal matrix 퐷 = Diag(푑1, . . . , 푑푛) with 푑푖 > 0. In the new basis the linear −1 −1 transformation given by 푀 will now be described by the matrix 퐷푀퐷 = (푑푖푀푖푗푑푗 )푖,푗. To choose 퐷 which leads to a “small” rescaling we can for example minimize the Frobenius norm ∑︁ 2 ∑︁ 퐷푀퐷−1 2 = (︀(퐷푀퐷−1) )︀ = 푀 2 푑2푑−2. ‖ ‖퐹 푖푗 푖푗 푖 푗 푖푗 푖푗

Minimizing the last sum is an example of a geometric program with variables 푑푖 (and without constraints).

Maximum likelihood estimation Geometric programs appear naturally in connection with maximum likelihood estimation of parameters of random distributions. Consider a simple example. Suppose we have two biased coins, with head probabilities 푝 and 2푝, respectively. We toss both coins and count the total number of heads. Given that, in the long run, we observed 푖 heads 푛푖 times for 푖 = 0, 1, 2, estimate the value of 푝. The probability of obtaining the given outcome equals (︂푛 + 푛 + 푛 )︂(︂푛 + 푛 )︂ 0 1 2 1 2 (푝 2푝)푛2 (푝(1 2푝) + 2푝(1 푝))푛1 ((1 푝)(1 2푝))푛0 , 푛0 푛1 · − − − −

and, up to constant factors, maximizing that expression is equivalent to solving the problem

maximize 푝2푛2+푛1 푠푛1 푞푛0 푟푛0 subject to 푞 1 푝, 푟 ≤ 1 − 2푝, 푠 ≤ 3 − 4푝, ≤ − or, as a geometric problem:

minimize 푝−2푛2−푛1 푠−푛1 푞−푛0 푟−푛0 subject to 푞 + 푝 1, 푟 + 2푝 ≤ 1, 1 푠 + 4 푝 ≤ 1. 3 3 ≤

For example, if (푛0, 푛1, 푛2) = (30, 53, 16) then the above problem solves with 푝 = 0.29.

An Olympiad problem The 26th Vojtěch Jarník International Mathematical Competition, Ostrava 2016. Let 푎, 푏, 푐 be positive real numbers with 푎 + 푏 + 푐 = 1. Prove that (︂1 1 )︂ (︂1 1 )︂ (︂1 1 )︂ + + + 1728 푎 푏푐 푏 푎푐 푐 푎푏 ≥

44 Using the tricks introduced in Sec. 5.3.2 we formulate this problem as a geometric pro- gram:

minimize 푝푞푟 subject to 푝−1푎−1 + 푝−1푏−1푐−1 1, 푞−1푏−1 + 푞−1푎−1푐−1 ≤ 1, 푟−1푐−1 + 푟−1푎−1푏−1 ≤ 1, 푎 + 푏 + 푐 ≤ 1. ≤ Unsurprisingly, the optimal value of this program is 1728, achieved for (푎, 푏, 푐, 푝, 푞, 푟) = (︀ 1 1 1 )︀ 3 , 3 , 3 , 12, 12, 12 .

Power control and rate allocation in wireless networks We consider a basic wireless network power control problem. In a wireless network with 푛 logical transmitter-receiver pairs if the power output of transmitter 푗 is 푝푗 then the power received by receiver 푖 is 퐺푖푗푝푗, where 퐺푖푗 models path gain and fading effects. If the 푖-th receiver’s own noise is 휎푖 then the signal-to-interference-plus-noise (SINR) ratio of receiver 푖 is given by

퐺푖푖푝푖 푠푖 = ∑︀ . (5.16) 휎푖 + 푗≠ 푖 퐺푖푗푝푗

Maximizing the minimal SINR over all receivers (max min푖 푠푖), subject to bounded power output of the transmitters, is equivalent to the geometric program with variables 푝1, . . . , 푝푛, 푡:

minimize 푡−1 subject to 푝min 푝푗 푝max, 푗 = 1, . . . , 푛, (5.17) 푡(휎 + ∑︀ 퐺≤ 푝 )≤퐺−1푝−1 1, 푖 = 1, . . . , 푛. 푖 푗≠ 푖 푖푗 푗 푖푖 푖 ≤ In the low-SNR regime the problem of system rate maximization is approximated by the ∑︀ ∏︀ −1 problem of maximizing 푖 log 푠푖, or equivalently minimizing 푖 푠푖 . This is a geometric problem with variables 푝푖, 푠푖:

minimize 푠−1 푠−1 1 ···· 푛 subject to 푝min 푝푗 푝max, 푗 = 1, . . . , 푛, (5.18) 푠 (휎 + ∑︀ 퐺≤ 푝 )≤퐺−1푝−1 1, 푖 = 1, . . . , 푛. 푖 푖 푗≠ 푖 푖푗 푗 푖푖 푖 ≤ For more information and examples see [BKVH07].

5.4 Exponential cone case studies

In this section we introduce some practical optimization problems where the exponential cone comes in handy.

45 5.4.1 Risk parity portfolio Consider a simple version of the Markowitz portfolio optimization problem introduced in Sec. 3.3.3, where we simply ask to minimize the risk 푟(푥) = √푥푇 Σ푥 of a fully-invested long-only portfolio:

minimize √푥푇 Σ푥 ∑︀푛 subject to 푖=1 푥푖 = 1, (5.19) 푥 0, 푖 = 1, . . . , 푛, 푖 ≥ where Σ is a symmetric positive definite covariance matrix. We can derive from the 휕푟 = 휕푟 first-order optimality conditions that the solution to(5.19) satisfies 휕푥푖 휕푥푗 whenever 푥푖, 푥푗 > 0, i.e. marginal risk contributions of positively invested assets are equal. In practice this often leads to concentrated portfolios, whereas it would benefit diversification to consider portfolios where all assets have the same total contribution to risk: 휕푟 휕푟 푥푖 = 푥푗 , 푖, 푗 = 1, . . . , 푛. (5.20) 휕푥푖 휕푥푗 We call (5.20) the risk parity condition. It indeed models equal risk contribution from all the assets, because as one can easily check 휕푟 = √(Σ푥)푖 and 휕푥푖 푥푇 Σ푥 ∑︁ 휕푟 푟(푥) = 푥 . 푖 휕푥 푖 푖 Risk parity portfolios satisfying condition (5.20) can be found with an auxiliary optimiza- tion problem:

푇 ∑︀ minimize √푥 Σ푥 푐 log 푥푖 − 푖 (5.21) subject to 푥푖 > 0, 푖 = 1, . . . , 푛, for any 푐 > 0. More precisely, the gradient of the objective function in (5.21) is zero 휕푟 = 푐/푥 푖 when 휕푥푖 푖 for all , implying the parity condition (5.20) holds. Since (5.20) is scale-invariant, we can rescale any solution of (5.21) and get a fully-invested risk parity ∑︀ portfolio with 푖 푥푖 = 1. The conic form of problem (5.21) is:

minimize 푡 푐푒푇 푠 − subject to (푡, Σ1/2푥) 푛+1, (푡 √푥푇 Σ푥), (5.22) (푥 , 1, 푠 ) ∈ 풬퐾 , (푠≥ log 푥 ). 푖 푖 ∈ exp 푖 ≤ 푖 5.4.2 Entropy maximization A general entropy maximization problem has the form ∑︀ maximize 푝푖 log 푝푖 − 푖 ∑︀ subject to 푝푖 = 1, 푖 (5.23) 푝푖 0, 푝 ≥ , ∈ ℐ where defines additional constraints on the probability distribution 푝 (these are known ℐ as prior information). In the absence of complete information about 푝 the maximum

46 entropy principle of Jaynes posits to choose the distribution which maximizes uncertainty, that is entropy, subject to what is known. Practitioners think of the solution to (5.23) as the most random or most conservative of distributions consistent with . ℐ Maximization of the entropy function 퐻(푥) = 푥 log 푥 was explained in Sec. 5.2. − Often one has an a priori distribution 푞, and one tries to minimize the distance between 푝 and 푞, while remaining consistent with . In this case it is standard to minimize the ℐ Kullback-Leiber divergence ∑︁ (푝 푞) = 푝 log 푝 /푞 풟퐾퐿 || 푖 푖 푖 푖

which leads to an optimization problem ∑︀ minimize 푝푖 log 푝푖/푞푖 푖 ∑︀ subject to 푝푖 = 1, 푖 (5.24) 푝푖 0, 푝 ≥ . ∈ ℐ 5.4.3 Hitting time of a linear system Consider a linear dynamical system

x′(푡) = 퐴x(푡) (5.25) where we assume for simplicity that 퐴 = Diag(푎1, . . . , 푎푛) with 푎푖 < 0 with initial condition x(0)푖 = 푥푖. The resulting dynamical system x(푡) = x(0) exp(퐴푡) converges to 0 and one can ask, for instance, for the time it takes to approach the limit up to distance 휀. The resulting optimization problem is

minimize 푡 √︁ ∑︀ 2 subject to (푥푖 exp(푎푖푡)) 휀, 푖 ≤

with the following conic form, where the variables are 푡, 푞1, . . . , 푞푛:

minimize 푡 푛+1 subject to (휀, 푥1푞1, . . . , 푥푛푞푛) , (푞 , 1, 푎 푡) ∈ 풬퐾 , 푖 = 1, . . . , 푛. 푖 푖 ∈ exp See Fig. 5.2 for an example. Other criteria for the target set of the trajectories are also possible. For example, polyhedral constraints

푇 푛 푐 x 푑, 푐 R+, 푑 R+ ≤ ∈ ∈ 푛 are also expressible in exponential conic form for starting points x(0) R+, since they ∈ correspond to log-sum-exp constraints of the form (︃ )︃ ∑︁ log exp(푎 x + log(푐 푥 )) log 푑. 푖 푖 푖 푖 ≤ 푖

For a robust version involving uncertainty on 퐴 and 푥(0) see [CS14].

47 Fig. 5.2: With 퐴 = Diag( 0.3, 0.06) and starting point 푥(0) = (2.2, 1.3) the trajectory − − reaches distance 휀 = 0.5 from origin at time 푡 15.936. ≈ 5.4.4 Logistic regression Logistic regression is a technique of training a binary classifier. We are given a training 푑 set of examples 푥1, . . . , 푥푛 R together with labels 푦푖 0, 1 . The goal is to train ∈ ∈ { } a classifier capable of assigning new data points to either class 0 or 1. Specifically, the labels should be assigned by computing 1 ℎ휃(푥) = (5.26) 1 + exp( 휃푇 푥) − 1 1 and choosing label 푦 = 0 if ℎ휃(푥) < and 푦 = 1 for ℎ휃(푥) . Here ℎ휃(푥) is interpreted 2 ≥ 2 as the probability that 푥 belongs to class 1. The optimal parameter vector 휃 should be learned from the training set, so as to maximize the likelihood function: ∏︁ ℎ (푥 )푦푖 (1 ℎ (푥 ))1−푦푖 . 휃 푖 − 휃 푖 푖

By taking logarithms and adding regularization with respect to 휃 we reach an uncon- strained optimization problem ∑︁ minimize 푑 휆 휃 2 + 푦푖 log(ℎ휃(푥푖)) (1 푦푖) log(1 ℎ휃(푥푖)). 휃∈R ‖ ‖ − − − − (5.27) 푖

Problem (5.27) is convex, and can be more explicitly written as ∑︀ minimize 푖 푡푖 + 휆푟 푇 subject to 푡푖 log(ℎ휃(푥)) = log(1 + exp( 휃 푥푖)) if 푦푖 = 1, ≥ − −푇 푡푖 log(1 ℎ휃(푥)) = log(1 + exp(휃 푥푖)) if 푦푖 = 0, 푟 ≥ −휃 , − ≥ ‖ ‖2 involving softplus type constraints (see Sec. 5.2) and a quadratic cone. See Fig. 5.3 for an example.

48 Fig. 5.3: Logistic regression example with none, medium and strong regularization (small, medium, large 휆). The two-dimensional dataset was converted into a feature vector 푥 R28 using monomial coordinates of degrees at most 6. Without regularization we get ∈ obvious overfitting.

49 Chapter 6

Semidefinite optimization

In this chapter we extend the conic optimization framework introduced before with sym- metric positive semidefinite matrix variables.

6.1 Introduction to semidefinite matrices

6.1.1 Semidefinite matrices and cones A symmetric matrix 푋 푛 is called symmetric positive semidefinite if ∈ 풮 푧푇 푋푧 0, 푧 R푛. ≥ ∀ ∈ We then define the cone of symmetric positive semidefinite matrices as

푛 푛 푇 푛 + = 푋 푧 푋푧 0, 푧 R . (6.1) 풮 { ∈ 풮 | ≥ ∀ ∈ } For brevity we will often use the shorter notion semidefinite instead of symmetric positive semidefinite, and we will write 푋 푌 (푋 푌 ) as shorthand notation for (푋 푌 ) 푛 ⪰ ⪯ − ∈ 풮+ ((푌 푋) 푛 ). As inner product for semidefinite matrices, we use the standard trace − ∈ 풮+ inner product for general matrices, i.e., ∑︁ 퐴, 퐵 := tr(퐴푇 퐵) = 푎 푏 . ⟨ ⟩ 푖푗 푖푗 푖푗

It is easy to see that (6.1) indeed specifies a convex cone; it is pointed (with origin 푋 = 0), and 푋, 푌 푛 implies that (훼푋 + 훽푌 ) 푛 , 훼, 훽 0. Let us review a ∈ 풮+ ∈ 풮+ ≥ few equivalent definitions of 푛 . It is well-known that every symmetric matrix 퐴 has a 풮+ spectral factorization

푛 ∑︁ 푇 퐴 = 휆푖푞푖푞푖 . 푖=1

푛 where 푞푖 R are the (orthogonal) eigenvectors and 휆푖 are eigenvalues of 퐴. Using the ∈ spectral factorization of 퐴 we have

푛 푇 ∑︁ 푇 2 푥 퐴푥 = 휆푖(푥 푞푖) , 푖=1

50 푇 which shows that 푥 퐴푥 0 휆푖 0, 푖 = 1, . . . , 푛. In other words, ≥ ⇔ ≥ 푛 푛 = 푋 휆푖(푋) 0, 푖 = 1, . . . , 푛 . (6.2) 풮+ { ∈ 풮 | ≥ } Another useful characterization is that 퐴 푛 if and only if it is a Grammian matrix ∈ 풮+ 퐴 = 푉 푇 푉 . Here 푉 is called the Cholesky factor of 퐴. Using the Grammian representation we have

푥푇 퐴푥 = 푥푇 푉 푇 푉 푥 = 푉 푥 2, ‖ ‖2 i.e., if 퐴 = 푉 푇 푉 then 푥푇 퐴푥 0 for all 푥. On the other hand, from the positive ≥ spectral factorization 퐴 = 푄Λ푄푇 we have 퐴 = 푉 푇 푉 with 푉 = Λ1/2푄푇 , where Λ1/2 = Diag(√휆1,..., √휆푛). We thus have the equivalent characterization

푛 푛 푇 푛×푛 + = 푋 + 푋 = 푉 푉 for some 푉 R . (6.3) 풮 { ∈ 풮 | ∈ } In a completely analogous way we define the cone of symmetric positive definite matrices as 푛 푛 푇 푛 ++ = 푋 푧 푋푧 > 0, 푧 R 풮 { ∈ 풮푛 | ∀ ∈ } = 푋 휆푖(푋) > 0, 푖 = 1, . . . , 푛 { ∈ 풮푛 | 푇 } 푛×푛 = 푋 + 푋 = 푉 푉 for some 푉 R , rank(푉 ) = 푛 , { ∈ 풮 | ∈ } 푛 푛 and we write 푋 푌 (푋 푌 ) as shorthand notation for (푋 푌 ) ++ ((푌 푋) ++). ≻ ≺ 1 − ∈ 풮 − ∈ 풮 The one dimensional cone + simply corresponds to R+. Similarly consider 풮 [︂ 푥 푥 ]︂ 푋 = 1 3 푥3 푥2

2 with determinant det(푋) = 푥1푥2 푥 = 휆1휆2 and trace tr(푋) = 푥1 + 푥2 = 휆1 + 휆2. − 3 Therefore 푋 has positive eigenvalues if and only if

푥 푥 푥2, 푥 , 푥 0, 1 2 ≥ 3 1 2 ≥ which characterizes a three-dimensional scaled rotated cone, i.e., [︂ ]︂ 푥1 푥3 2 3 + (푥1, 푥2, 푥3√2) 푟. 푥3 푥2 ∈ 풮 ⇐⇒ ∈ 풬 More generally we have

푛 푛 푥 R+ Diag(푥) + ∈ ⇐⇒ ∈ 풮 and [︂ 푡 푥푇 ]︂ (푡, 푥) 푛+1 푛+1, ∈ 풬 ⇐⇒ 푥 푡퐼 ∈ 풮+ where the latter equivalence follows immediately from Lemma 6.1. Thus both the linear and quadratic cone are embedded in the semidefinite cone. In practice, however, linear and quadratic cones should never be described using semidefinite constraints, which would result in a large performance penalty by squaring the number of variables.

51 Example 6.1. As a more interesting example, consider the symmetric matrix

⎡ 1 푥 푦 ⎤ 퐴(푥, 푦, 푧) = ⎣ 푥 1 푧 ⎦ (6.4) 푦 푧 1 parametrized by (푥, 푦, 푧). The set

3 3 푆 = (푥, 푦, 푧) R 퐴(푥, 푦, 푧) + , { ∈ | ∈ 풮 } (shown in Fig. 6.1) is called a spectrahedron and is perhaps the simplest bounded semidefinite representable set, which cannot be represented using (finitely many) linear or quadratic cones. To gain a geometric intuition of 푆, we note that

det(퐴(푥, 푦, 푧)) = (푥2 + 푦2 + 푧2 2푥푦푧 1), − − − so the boundary of 푆 can be characterized as

푥2 + 푦2 + 푧2 2푥푦푧 = 1, − or equivalently as

[︂ 푥 ]︂푇 [︂ 1 푧 ]︂ [︂ 푥 ]︂ − = 1 푧2. 푦 푧 1 푦 − − For 푧 = 0 this describes a circle in the (푥, 푦)-plane, and for 1 푧 1 it characterizes − ≤ ≤ an ellipse (for a fixed 푧).

Fig. 6.1: Plot of spectrahedron 푆 = (푥, 푦, 푧) R3 퐴(푥, 푦, 푧) 0 . { ∈ | ⪰ }

52 6.1.2 Properties of semidefinite matrices Many useful properties of (semi)definite matrices follow directly from the definitions (6.1)-(6.3) and their definite counterparts.

푛 • The diagonal elements of 퐴 + are nonnegative. Let 푒푖 denote the 푖th standard ∈ 풮 푇 basis vector (i.e., [푒푖]푗 = 0, 푗 = 푖, [푒푖]푖 = 1). Then 퐴푖푖 = 푒 퐴푒푖, so (6.1) implies that ̸ 푖 퐴푖푖 0. ≥

• A block-diagonal matrix 퐴 = Diag(퐴1, . . . 퐴푝) is (semi)definite if and only if each diagonal block 퐴푖 is (semi)definite.

• Given a quadratic transformation 푀 := 퐵푇 퐴퐵, 푀 0 if and only if 퐴 0 ≻ ≻ and 퐵 has full rank. This follows directly from the Grammian characterization 푀 = (푉 퐵)푇 (푉 퐵). For 푀 0 we only require that 퐴 0. As an example, if 퐴 is ⪰ ⪰ (semi)definite then so is any permutation 푃 푇 퐴푃 .

• Any principal submatrix of 퐴 푛 (퐴 restricted to the same set of rows as columns) ∈ 풮+ is positive semidefinite; this follows by restricting the Grammian characterization 퐴 = 푉 푇 푉 to a submatrix of 푉 .

• The inner product of positive (semi)definite matrices is positive (nonnegative). For any 퐴, 퐵 푛 let 퐴 = 푈 푇 푈 and 퐵 = 푉 푇 푉 where 푈 and 푉 have full rank. Then ∈ 풮++ 퐴, 퐵 = tr(푈 푇 푈푉 푇 푉 ) = 푈푉 푇 2 > 0, ⟨ ⟩ ‖ ‖퐹 where strict positivity follows from the assumption that 푈 has full column-rank, i.e., 푈푉 푇 = 0. ̸ • The inverse of a positive is positive definite. This follows fromthe positive spectral factorization 퐴 = 푄Λ푄푇 , which gives us

퐴−1 = 푄푇 Λ−1푄

† where Λ푖푖 > 0. If 퐴 is semidefinite then the pseudo-inverse 퐴 of 퐴 is semidefinite.

• Consider a matrix 푋 푛 partitioned as ∈ 풮

[︂ 퐴 퐵푇 ]︂ 푋 = . 퐵 퐶

Let us find necessary and sufficient conditions for 푋 0. We know that 퐴 0 and ≻ ≻ 퐶 0 (since any principal submatrix must be positive definite). Furthermore, we ≻ can simplify the analysis using a nonsingular transformation

[︂ 퐼 0 ]︂ 퐿 = 퐹 퐼

53 to diagonalize 푋 as 퐿푋퐿푇 = 퐷, where 퐷 is block-diagonal. Note that det(퐿) = 1, so 퐿 is indeed nonsingular. Then 푋 0 if and only if 퐷 0. Expanding 퐿푋퐿푇 = ≻ ≻ 퐷, we get

[︂ 푇 푇 ]︂ [︂ ]︂ 퐴 퐴퐹 + 퐵 퐷1 0 푇 푇 푇 = . 퐹 퐴 + 퐵 퐹 퐴퐹 + 퐹 퐵 + 퐵퐹 + 퐶 0 퐷2

Since det(퐴) = 0 (by assuming that 퐴 0) we see that 퐹 = 퐵퐴−1 and direct ̸ ≻ − substitution gives us [︂ 퐴 0 ]︂ [︂ 퐷 0 ]︂ = 1 . 0 퐶 퐵퐴−1퐵푇 0 퐷 − 2 In the last part we have thus established the following useful result. Lemma 6.1 (Schur complement lemma). A symmetric matrix [︂ 퐴 퐵푇 ]︂ 푋 = . 퐵 퐶 is positive definite if and only if

퐴 0, 퐶 퐵퐴−1퐵푇 0. ≻ − ≻

6.2 Semidefinite modeling

Having discussed different characterizations and properties of semidefinite matrices, we next turn to different functions and sets that can be modeled using semidefinite cones and variables. Most of those representations involve semidefinite matrix-valued affine functions, which we discuss next.

6.2.1 Linear matrix inequalities

Consider an affine matrix-valued mapping 퐴 : R푛 푚: ↦→ 풮

퐴(푥) = 퐴0 + 푥1퐴1 + + 푥푛퐴푛. (6.5) ··· A linear matrix inequality (LMI) is a constraint of the form

퐴0 + 푥1퐴1 + + 푥푛퐴푛 0 (6.6) ··· ⪰ 푛 푚 in the variable 푥 R with symmetric coefficients 퐴푖 , 푖 = 0, . . . , 푛. As a simple ∈ ∈ 풮 example consider the matrix in (6.4),

퐴(푥, 푦, 푧) = 퐴 + 푥퐴 + 푦퐴 + 푧퐴 0 0 1 2 3 ⪰ with ⎡ 0 1 0 ⎤ ⎡ 0 0 1 ⎤ ⎡ 0 0 0 ⎤ 퐴0 = 퐼, 퐴1 = ⎣ 1 0 0 ⎦ , 퐴2 = ⎣ 0 0 0 ⎦ , 퐴3 = ⎣ 0 0 1 ⎦ . 0 0 0 1 0 0 0 1 0

54 Alternatively, we can describe the linear matrix inequality 퐴(푥, 푦, 푧) 0 as ⪰ 푋 3 , 푥 = 푥 = 푥 = 1, ∈ 풮+ 11 22 33 i.e., as a semidefinite variable with fixed diagonal; these two alternative formulations illustrate the difference between primal and dual form of semidefinite problems, see Sec. 8.6.

6.2.2 Eigenvalue optimization Consider a symmetric matrix 퐴 푚 and let its eigenvalues be denoted by ∈ 풮 휆 (퐴) 휆 (퐴) 휆 (퐴). 1 ≥ 2 ≥ · · · ≥ 푚 A number of different functions of 휆푖 can then be described using a mix of linear and semidefinite constraints.

Sum of eigenvalues The sum of the eigenvalues corresponds to 푚 ∑︁ 휆푖(퐴) = tr(퐴). 푖=1

Largest eigenvalue

The largest eigenvalue can be characterized in epigraph form 휆1(퐴) 푡 as ≤ 푡퐼 퐴 0. (6.7) − ⪰ To verify this, suppose we have a spectral factorization 퐴 = 푄Λ푄푇 where 푄 is orthogonal and Λ is diagonal. Then 푡 is an upper bound on the largest eigenvalue if and only if 푄푇 (푡퐼 퐴)푄 = 푡퐼 Λ 0. − − ⪰ Thus we can minimize the largest eigenvalue of 퐴.

Smallest eigenvalue

The smallest eigenvalue can be described in hypograph form 휆푚(퐴) 푡 as ≥ 퐴 푡퐼, (6.8) ⪰ i.e., we can maximize the smallest eigenvalue of 퐴.

Eigenvalue spread The eigenvalue spread can be modeled in epigraph form 휆 (퐴) 휆 (퐴) 푡 1 − 푚 ≤ by combining the two linear matrix inequalities in (6.7) and (6.8), i.e., 푧퐼 퐴 푠퐼, (6.9) ⪯푠 푧⪯ 푡. − ≤

55 Spectral radius

The spectral radius 휌(퐴) := max푖 휆푖(퐴) can be modeled in epigraph form 휌(퐴) 푡 using | | ≤ two linear matrix inequalities

푡퐼 퐴 푡퐼. − ⪯ ⪯ Condition number of a positive definite matrix Suppose now that 퐴 푚. The condition number of a positive definite matrix can be ∈ 풮+ minimized by noting that 휆1(퐴)/휆푚(퐴) 푡 if and only if there exists a 휇 > 0 such that ≤ 휇퐼 퐴 휇푡퐼, ⪯ ⪯ or equivalently if and only if 퐼 휇−1퐴 푡퐼. If 퐴 = 퐴(푥) is represented as in (6.5) then ⪯ ⪯ a change of variables 푧 := 푥/휇, 휈 := 1/휇 leads to a problem of the form minimize 푡 ∑︀푚 (6.10) subject to 퐼 휈퐴0 + 푧푖퐴푖 푡퐼, ⪯ 푖=1 ⪯ from which we recover the solution 푥 = 푧/휈. In essence, we first normalize the spectrum by the smallest eigenvalue, and then minimize the largest eigenvalue of the normalized linear matrix inequality. Compare Sec. 2.2.5.

6.2.3 Log-determinant Consider again a symmetric positive-definite matrix 퐴 푚. The determinant ∈ 풮+ 푚 ∏︁ det(퐴) = 휆푖(퐴) 푖=1 is neither convex or concave, but log det(퐴) is concave and we can write the inequality

푡 log det(퐴) ≤ in the form of the following problem: [︂ 퐴 푍 ]︂ 0, 푍푇 diag(푍) ⪰ (6.11) 푍 is lower triangular, 푡 ∑︀ log 푍 . ≤ 푖 푖푖 The equivalence of the two problems follows from Lemma 6.1 and subadditivity of deter- minant for semidefinite matrices. That is: 0 det(퐴 푍푇 diag(푍)−1푍) det(퐴) det(푍푇 diag(푍)−1푍) ≤ − =≤ det(퐴) − det(푍). − On the other hand the optimal value det(퐴) is attained for 푍 = 퐿퐷 if 퐴 = 퐿퐷퐿푇 is the LDL factorization of 퐴. The last inequality in problem (6.11) can of course be modeled using the exponential ∏︀ 1/푚 cone as in Sec. 5.2. Note that we can replace that bound with 푡 ( 푍푖푖) to get ≤ 푖 instead the model of 푡 det(퐴)1/푚 using Sec. 4.2.4. ≤ 56 6.2.4 Singular value optimization

We next consider a non-square matrix 퐴 R푚×푝. Assume 푝 푚 and denote the singular ∈ ≤ values of 퐴 by

휎 (퐴) 휎 (퐴) 휎 (퐴) 0. 1 ≥ 2 ≥ · · · ≥ 푝 ≥ The singular values are connected to the eigenvalues of 퐴푇 퐴 via

√︀ 푇 휎푖(퐴) = 휆푖(퐴 퐴), (6.12)

and if 퐴 is square and symmetric then 휎푖(퐴) = 휆푖(퐴) . We show next how to optimize | | several functions of the singular values.

Largest singular value

The epigraph 휎1(퐴) 푡 can be characterized using (6.12) as ≤ 퐴푇 퐴 푡2퐼, ⪯ which from Schur’s lemma is equivalent to [︂ 푡퐼 퐴 ]︂ 0. (6.13) 퐴푇 푡퐼 ⪰

The largest singular value 휎1(퐴) is also called the spectral norm or the ℓ2-norm of 퐴, 퐴 2 := 휎1(퐴). ‖ ‖ Sum of singular values

The trace norm or the nuclear norm of 푋 is the dual of the ℓ2-norm:

푇 푋 * = sup tr(푋 푍). (6.14) ‖ ‖ ‖푍‖2≤1 It turns out that the nuclear norm corresponds to the sum of the singular values, 푚 푛 ∑︁ ∑︁ √︀ 푇 푋 * = 휎푖(푋) = 휆푖(푋 푋), (6.15) ‖ ‖ 푖=1 푖=1 which is easy to verify using singular value decomposition 푋 = 푈Σ푉 푇 . We have sup tr(푋푇 푍) = sup tr(Σ푇 푈 푇 푍푉 ) ‖푍‖2≤1 ‖푍‖2≤1 = sup tr(Σ푇 푌 ) ‖푌 ‖2≤1 푝 푝 ∑︁ ∑︁ = sup 휎푖푦푖 = 휎푖. | |≤ 푦푖 1 푖=1 푖=1 which shows (6.15). Alternatively, we can express (6.14) as the solution to maximize tr(푋푇 푍) [︂ ]︂ 퐼 푍푇 (6.16) subject to 0, 푍 퐼 ⪰

57 with the dual problem (see Example 8.8)

minimize tr(푈 + 푉 )/2 [︂ ]︂ 푈 푋푇 (6.17) subject to 0. 푋 푉 ⪰

In other words, using strong duality we can characterize the epigraph 퐴 * 푡 with ‖ ‖ ≤ [︂ 푈 퐴푇 ]︂ 0, tr(푈 + 푉 )/2 푡. (6.18) 퐴 푉 ⪰ ≤ For a symmetric matrix the nuclear norm corresponds to the sum of absolute values of eigenvalues, and for a semidefinite matrix it simply corresponds to the trace of the matrix.

6.2.5 Matrix inequalities from Schur’s Lemma Several quadratic or quadratic-over-linear matrix inequalities follow immediately from Schur’s lemma. Suppose 퐴 : R푚×푝 and 퐵 : R푝×푝 are matrix variables. Then

퐴푇 퐵−1퐴 퐶 ⪯ if and only if [︂ 퐶 퐴푇 ]︂ 0. 퐴 퐵 ⪰

6.2.6 Nonnegative polynomials We next consider characterizations of polynomials constrained to be nonnegative on the real axis. To that end, consider a polynomial basis function

푣(푡) = (︀1, 푡, . . . , 푡2푛)︀ .

It is then well-known that a polynomial 푓 : R R of even degree 2푛 is nonnegative on ↦→ the entire real axis

푇 2푛 푓(푡) := 푥 푣(푡) = 푥0 + 푥1푡 + + 푥2푛푡 0, 푡 (6.19) ··· ≥ ∀ if and only if it can be written as a sum of squared polynomials of degree 푛 (or less), i.e., 푛+1 for some 푞1, 푞2 R ∈ 푇 2 푇 2 푛 푓(푡) = (푞1 푢(푡)) + (푞2 푢(푡)) , 푢(푡) := (1, 푡, . . . , 푡 ). (6.20)

It turns out that an equivalent characterization of 푥 푥푇 푣(푡) 0, 푡 can be given in { | ≥ ∀ } terms of a semidefinite variable 푋,

푛+1 푥푖 = 푋, 퐻푖 , 푖 = 0,..., 2푛, 푋 . (6.21) ⟨ ⟩ ∈ 풮+ 푛+1 where 퐻 R(푛+1)×(푛+1) are Hankel matrices 푖 ∈ {︂ 1, 푘 + 푙 = 푖, [퐻 ] = 푖 푘푙 0, otherwise.

58 When there is no ambiguity, we drop the superscript on 퐻푖. For example, for 푛 = 2 we have ⎡ 1 0 0 ⎤ ⎡ 0 1 0 ⎤ ⎡ 0 0 0 ⎤ 퐻0 = ⎣ 0 0 0 ⎦ , 퐻1 = ⎣ 1 0 0 ⎦ , . . . 퐻4 = ⎣ 0 0 0 ⎦ . 0 0 0 0 0 0 0 0 1 To verify that (6.19) and (6.21) are equivalent, we first note that

2푛 푇 ∑︁ 푢(푡)푢(푡) = 퐻푖푣푖(푡), 푖=0 i.e., ⎡ ⎤ ⎡ ⎤푇 ⎡ ⎤ 1 1 1 푡 . . . 푡푛 ⎢ 푡 ⎥ ⎢ 푡 ⎥ ⎢ 푡 푡2 . . . 푡푛+1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ = ⎢ . . . . ⎥ . ⎣ . ⎦ ⎣ . ⎦ ⎣ . . .. . ⎦ 푡푛 푡푛 푡푛 푡푛+1 . . . 푡2푛 Assume next that 푓(푡) 0. Then from (6.20) we have ≥ 푇 2 푇 2 푓(푡) = (푞1 푢(푡)) + (푞2 푢(푡)) 푇 푇 푇 = 푞1푞1 + 푞2푞2 , 푢(푡)푢(푡) ⟨ 2푛 ⟩ = ∑︀ 푞 푞푇 + 푞 푞푇 , 퐻 푣 (푡), 푖=0⟨ 1 1 2 2 푖⟩ 푖 푇 푇 푇 i.e., we have 푓(푡) = 푥 푣(푡) with 푥푖 = 푋, 퐻푖 , 푋 = (푞1푞 +푞2푞 ) 0. Conversely, assume ⟨ ⟩ 1 2 ⪰ that (6.21) holds. Then

2푛 2푛 ∑︁ ∑︁ 푓(푡) = 퐻 , 푋 푣 (푡) = 푋, 퐻 푣 (푡) = 푋, 푢(푡)푢(푡)푇 0 ⟨ 푖 ⟩ 푖 ⟨ 푖 푖 ⟩ ⟨ ⟩ ≥ 푖=0 푖=0 since 푋 0. In summary, we can characterize the cone of nonnegative polynomials over ⪰ the real axis as

푛 푛+1 푛+1 퐾∞ = 푥 R 푥푖 = 푋, 퐻푖 , 푖 = 0,..., 2푛, 푋 + . (6.22) { ∈ | ⟨ ⟩ ∈ 풮 } Checking nonnegativity of a univariate polynomial thus corresponds to a semidefinite feasibility problem.

Nonnegativity on a finite interval As an extension we consider a basis function of degree 푛, 푣(푡) = (1, 푡, . . . , 푡푛).

A polynomial 푓(푡) := 푥푇 푣(푡) is then nonnegative on a subinterval 퐼 = [푎, 푏] R if and ⊂ only 푓(푡) can be written as a sum of weighted squares,

푇 2 푇 2 푓(푡) = 푤1(푡)(푞1 푢1(푡)) + 푤2(푡)(푞2 푢2(푡))

where 푤푖(푡) are polynomials nonnegative on [푎, 푏]. To describe the cone

푛 푛+1 푇 퐾푎,푏 = 푥 R 푓(푡) = 푥 푣(푡) 0, 푡 [푎, 푏] { ∈ | ≥ ∀ ∈ } we distinguish between polynomials of odd and even degree.

59 • Even degree. Let 푛 = 2푚 and denote

푚 푚−1 푢1(푡) = (1, 푡, . . . , 푡 ), 푢2(푡) = (1, 푡, . . . , 푡 ).

We choose 푤1(푡) = 1 and 푤2(푡) = (푡 푎)(푏 푡) and note that 푤2(푡) 0 on [푎, 푏]. − − ≥ Then 푓(푡) 0, 푡 [푎, 푏] if and only if ≥ ∀ ∈ 푇 2 푇 2 푓(푡) = (푞1 푢1(푡)) + 푤2(푡)(푞2 푢2(푡))

for some 푞1, 푞2, and an equivalent semidefinite characterization can be found as

푛 푛+1 푚 푚−1 푚−1 푚−1 퐾푎,푏 = 푥 R 푥푖 = 푋1, 퐻푖 + 푋2, (푎 + 푏)퐻 − 푎푏퐻 퐻 − , { ∈ | ⟨ ⟩ ⟨ 푖 1 − 푖 − 푖 2 ⟩ 푖 = 0, . . . , 푛, 푋 푚, 푋 푚−1 . 1 ∈ 풮+ 2 ∈ 풮+ } (6.23)

푚 • Odd degree. Let 푛 = 2푚 + 1 and denote 푢(푡) = (1, 푡, . . . , 푡 ). We choose 푤1(푡) = 푇 (푡 푎) and 푤2(푡) = (푏 푡). We then have that 푓(푡) = 푥 푣(푡) 0, 푡 [푎, 푏] if and − − ≥ ∀ ∈ only if

푓(푡) = (푡 푎)(푞푇 푢(푡))2 + (푏 푡)(푞푇 푢(푡))2 − 1 − 2

for some 푞1, 푞2, and an equivalent semidefinite characterization can be found as

푛 푛+1 푚 푚 푚 푚 퐾푎,푏 = 푥 R 푥푖 = 푋1, 퐻푖−1 푎퐻푖 + 푋2, 푏퐻푖 퐻푖−1 , { ∈ | ⟨ − ⟩ ⟨ − ⟩ (6.24) 푖 = 0, . . . , 푛, 푋 , 푋 푚 . 1 2 ∈ 풮+ } 6.2.7 Hermitian matrices Semidefinite optimization can be extended to complex-valued matrices. To that end,let 푛 denote the cone of Hermitian matrices of order 푛, i.e., ℋ × 푋 푛 푋 C푛 푛, 푋퐻 = 푋, (6.25) ∈ ℋ ⇐⇒ ∈ where superscript ’퐻’ denotes Hermitian (or complex) transposition. Then 푋 푛 if ∈ ℋ+ and only if

푧퐻 푋푧 = ( 푧 푖 푧)푇 ( 푋 + 푖 푋)( 푧 + 푖 푧) ℜ − ℑ ℜ ℑ ℜ ℑ [︂ 푧 ]︂푇 [︂ 푋 푋 ]︂ [︂ 푧 ]︂ = ℜ ℜ −ℑ ℜ 0, 푧 C푛. 푧 푋 푋 푧 ≥ ∀ ∈ ℑ ℑ ℜ ℑ In other words,

[︂ 푋 푋 ]︂ 푋 푛 ℜ −ℑ 2푛. (6.26) ∈ ℋ+ ⇐⇒ 푋 푋 ∈ 풮+ ℑ ℜ Note that (6.25) implies skew-symmetry of 푋, i.e., 푋 = 푋푇 . ℑ ℑ −ℑ

60 6.2.8 Nonnegative trigonometric polynomials As a complex-valued variation of the sum-of-squares representation we consider trigono- metric polynomials; optimization over cones of nonnegative trigonometric polynomials has several important engineering applications. Consider a trigonometric polynomial evaluated on the complex unit-circle 푛 ∑︁ −푖 푓(푧) = 푥0 + 2 ( 푥푖푧 ), 푧 = 1 (6.27) ℜ | | 푖=1 parametrized by 푥 R C푛. We are interested in characterizing the cone of trigonometric ∈ × polynomials that are nonnegative on the angular interval [0, 휋],

푛 푛 푛 ∑︁ −푖 푗푡 퐾0,휋 = 푥 R C 푥0 + 2 ( 푥푖푧 ) 0, 푧 = 푒 , 푡 [0, 휋] . { ∈ × | ℜ ≥ ∀ ∈ } 푖=1 Consider a complex-valued basis function

푣(푧) = (1, 푧, . . . , 푧푛).

The Riesz-Fejer Theorem states that a trigonometric polynomial 푓(푧) in (6.27) is non- 푛 푛+1 negative (i.e., 푥 퐾0,휋) if and only if for some 푞 C ∈ ∈ 푓(푧) = 푞퐻 푣(푧) 2. (6.28) | | Analogously to Sec. 6.2.6 we have a semidefinite characterization of the sum-of-squares representation, i.e., 푓(푧) 0, 푧 = 푒푗푡, 푡 [0, 2휋] if and only if ≥ ∀ ∈ 푛+1 푥푖 = 푋, 푇푖 , 푖 = 0, . . . , 푛, 푋 (6.29) ⟨ ⟩ ∈ ℋ+ 푛+1 where 푇 R(푛+1)×(푛+1) are Toeplitz matrices 푖 ∈ {︂ 1, 푘 푙 = 푖 [푇푖]푘푙 = − , 푖 = 0, . . . , 푛. 0, otherwise

When there is no ambiguity, we drop the superscript on 푇푖. For example, for 푛 = 2 we have ⎡ 1 0 0 ⎤ ⎡ 0 0 0 ⎤ ⎡ 0 0 0 ⎤ 푇0 = ⎣ 0 1 0 ⎦ , 푇1 = ⎣ 1 0 0 ⎦ , 푇2 = ⎣ 0 0 0 ⎦ . 0 0 1 0 1 0 1 0 0

To prove correctness of the semidefinite characterization, we first note that 푛 푛 퐻 ∑︁ ∑︁ 퐻 푣(푧)푣(푧) = 퐼 + (푇푖푣푖(푧)) + (푇푖푣푖(푧)) 푖=1 푖=1 i.e., ⎡ ⎤ ⎡ ⎤퐻 ⎡ ⎤ 1 1 1 푧−1 . . . 푧−푛 ⎢ 푧 ⎥ ⎢ 푧 ⎥ ⎢ 푧 1 . . . 푧1−푛 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . ⎥ ⎢ . ⎥ = ⎢ . . . . ⎥ . ⎣ . ⎦ ⎣ . ⎦ ⎣ . . .. . ⎦ 푧푛 푧푛 푧푛 푧푛−1 ... 1

61 Next assume that (6.28) is satisfied. Then 푓(푧) = 푞푞퐻 , 푣(푧)푣(푧)퐻 ⟨ 퐻 퐻 ⟩∑︀푛 퐻 ∑︀푛 푇 = 푞푞 , 퐼 + 푞푞 , 푖=1 푇푖푣푖(푧) + 푞푞 , 푖=1 푇푖 푣푖(푧) ⟨ 퐻 ⟩ ⟨∑︀푛 퐻 ⟩ ∑︀⟨ 푛 퐻 푇 ⟩ = 푞푞 , 퐼 + 푖=1 푞푞 , 푇푖 푣푖(푧) + 푖=1 푞푞 , 푇푖 푣푖(푧) = 푥⟨ + 2 ⟩(∑︀푛 푥⟨푣 (푧)) ⟩ ⟨ ⟩ 0 ℜ 푖=1 푖 푖 퐻 with 푥푖 = 푞푞 , 푇푖 . Conversely, assume that (6.29) holds. Then ⟨ ⟩ 푛 푛 ∑︁ ∑︁ 푓(푧) = 푋, 퐼 + 푋, 푇 푣 (푧) + 푋, 푇 푇 푣 (푧) = 푋, 푣(푧)푣(푧)퐻 0. ⟨ ⟩ ⟨ 푖⟩ 푖 ⟨ 푖 ⟩ 푖 ⟨ ⟩ ≥ 푖=1 푖=1 In other words, we have shown that

푛 푛 푛+1 퐾0,휋 = 푥 R C 푥푖 = 푋, 푇푖 , 푖 = 0, . . . , 푛, 푋 + . (6.30) { ∈ × | ⟨ ⟩ ∈ ℋ } Nonnegativity on a subinterval We next sketch a few useful extensions. An extension of the Riesz-Fejer Theorem states that a trigonometric polynomial 푓(푧) of degree 푛 is nonnegative on 퐼(푎, 푏) = 푧 푧 = { | 푒푗푡, 푡 [푎, 푏] [0, 휋] if and only if it can be written as a weighted sum of squared ∈ ⊆ } trigonometric polynomials

푓(푧) = 푓 (푧) 2 + 푔(푧) 푓 (푧) 2 | 1 | | 2 |

where 푓1, 푓2, 푔 are trigonemetric polynomials of degree 푛, 푛 푑 and 푑, respectively, and − 푔(푧) 0 on 퐼(푎, 푏). For example 푔(푧) = 푧 + 푧−1 2 cos 훼 is nonnegative on 퐼(0, 훼), and ≥ − it can be verified that 푓(푧) 0, 푧 퐼(0, 훼) if and only if ≥ ∀ ∈ 푥 = 푋 , 푇 푛+1 + 푋 , 푇 푛 + 푋 , 푇 푛 2 cos(훼) 푋 , 푇 푛 , 푖 ⟨ 1 푖 ⟩ ⟨ 2 푖+1⟩ ⟨ 2 푖−1⟩ − ⟨ 2 푖 ⟩ 푛+1 푛 for 푋1 , 푋2 , i.e., ∈ ℋ+ ∈ ℋ+ 푛 푛 푛+1 푛 푛 퐾0,훼 = 푥 R C 푥푖 = 푋1, 푇푖 + 푋2, 푇푖+1 + 푋2, 푇푖−1 { ∈ × | ⟨ ⟩ ⟨ ⟩ ⟨ ⟩ (6.31) 2 cos(훼) 푋 , 푇 푛 , 푋 푛+1, 푋 푛 . − ⟨ 2 푖 ⟩ 1 ∈ ℋ+ 2 ∈ ℋ+} Similarly 푓(푧) 0, 푧 퐼(훼, 휋) if and only if ≥ ∀ ∈ 푥 = 푋 , 푇 푛+1 + 푋 , 푇 푛 + 푋 , 푇 푛 2 cos(훼) 푋 , 푇 푛 푖 ⟨ 1 푖 ⟩ ⟨ 2 푖+1⟩ ⟨ 2 푖−1⟩ − ⟨ 2 푖 ⟩ i.e., 푛 푛 푛+1 푛 푛 퐾훼,휋 = 푥 R C 푥푖 = 푋1, 푇푖 + 푋2, 푇푖+1 + 푋2, 푇푖−1 { ∈ × | ⟨ ⟩ ⟨ ⟩ ⟨ ⟩ (6.32) +2 cos(훼) 푋 , 푇 푛 , 푋 푛+1, 푋 푛 . ⟨ 2 푖 ⟩ 1 ∈ ℋ+ 2 ∈ ℋ+} 6.3 Semidefinite optimization case studies

6.3.1 Nearest correlation matrix We consider the set

푆 = 푋 푛 푋 = 1, 푖 = 1, . . . , 푛 { ∈ 풮+ | 푖푖 } 62 (shown in Fig. 6.1 for 푛 = 3). For 퐴 푛 the nearest correlation matrix is ∈ 풮 ⋆ 푋 = arg min 퐴 푋 퐹 , 푋∈푆 ‖ − ‖ i.e., the projection of 퐴 onto the set 푆. To pose this as a conic optimization we define the linear operator

svec(푈) = (푈11, √2푈21,..., √2푈푛1, 푈22, √2푈32,..., √2푈푛2, . . . , 푈푛푛),

which extracts and scales the lower-triangular part of 푈. We then get a conic formulation of the nearest correlation problem exploiting symmetry of 퐴 푋, − minimize 푡 subject to 푧 2 푡, ‖ ‖ ≤ svec(퐴 푋) = 푧, (6.33) diag−(푋) = 푒, 푋 0. ⪰ This is an example of a problem with both conic quadratic and semidefinite constraints in primal form. We can add different constraints to the problem, for example abound 훾 on the smallest eigenvalue by replacing 푋 0 with 푋 훾퐼. ⪰ ⪰ 6.3.2 Extremal ellipsoids Given a polytope we can find the largest ellipsoid contained in the polytope, orthe smallest ellipsoid containing the polytope (for certain representations of the polytope).

Maximal inscribed ellipsoid Consider a polytope

푛 푇 푆 = 푥 R 푎푖 푥 푏푖, 푖 = 1, . . . , 푚 . { ∈ | ≤ } The ellipsoid

:= 푥 푥 = 퐶푢 + 푑, 푢 1 ℰ { | ‖ ‖ ≤ } is contained in 푆 if and only if

푇 max 푎푖 (퐶푢 + 푑) 푏푖, 푖 = 1, . . . , 푚 ‖푢‖2≤1 ≤ or equivalently, if and only if

퐶푎 + 푎푇 푑 푏 , 푖 = 1, . . . , 푚. ‖ 푖‖2 푖 ≤ 푖 Since Vol( ) det(퐶)1/푛 the maximum-volume inscribed ellipsoid is the solution to ℰ ≈ maximize det(퐶) 푇 subject to 퐶푎푖 2 + 푎푖 푑 푏푖, 푖 = 1, . . . , 푚, 퐶‖ ‖0. ≤ ⪰ In Sec. 6.2.2 we show how to maximize the determinant of a positive definite matrix.

63 Minimal enclosing ellipsoid Next consider a polytope given as the convex hull of a set of points,

′ 푛 푆 = conv 푥1, 푥2, . . . , 푥푚 , 푥푖 R . { } ∈ The ellipsoid

′ := 푥 푃 푥 푐 1 ℰ { | ‖ − ‖2 ≤ } has Vol( ′) det(푃 )−1/푛, so the minimum-volume enclosing ellipsoid is the solution to ℰ ≈ maximize det(푃 ) subject to 푃 푥푖 푐 2 1, 푖 = 1, . . . , 푚, 푃‖ 0−. ‖ ≤ ⪰

Fig. 6.2: Example of inner and outer ellipsoidal approximations of a pentagon in R2.

6.3.3 Polynomial curve-fitting Consider a univariate polynomial of degree 푛,

푓(푡) = 푥 + 푥 푡 + 푥 푡2 + + 푥 푡푛. 0 1 2 ··· 푛 Often we wish to fit such a polynomial to a given set of measurements or control points

(푡 , 푦 ), (푡 , 푦 ),..., (푡 , 푦 ) , { 1 1 2 2 푚 푚 } i.e., we wish to determine coefficients 푥푖, 푖 = 0, . . . , 푛 such that

푓(푡 ) 푦 , 푗 = 1, . . . , 푚. 푗 ≈ 푗 To that end, define the Vandermonde matrix

⎡ 2 푛 ⎤ 1 푡1 푡1 . . . 푡1 ⎢ 1 푡 푡2 . . . 푡푛 ⎥ ⎢ 2 2 2 ⎥ 퐴 = ⎢ . . . . ⎥ . ⎣ . . . . ⎦ 2 푛 1 푡푚 푡푚 . . . 푡푚

64 We can then express the desired curve-fit compactly as

퐴푥 푦, ≈ i.e., as a linear expression in the coefficients 푥. When the degree of the polynomial equals the number measurements, 푛 = 푚, the matrix 퐴 is square and non-singular (provided there are no duplicate rows), so we can can solve

퐴푥 = 푦

to find a polynomial that passes through all the control points (푡푖, 푦푖). Similarly, if 푛 > 푚 there are infinitely many solutions satisfying the underdetermined system 퐴푥 = 푦.A typical choice in that case is the least-norm solution

푥ln = arg min 푥 2 퐴푥=푦 ‖ ‖ which (assuming again there are no duplicate rows) equals

푇 푇 −1 푥ln = 퐴 (퐴퐴 ) 푦.

On the other hand, if 푛 < 푚 we generally cannot find a solution to the overdetermined system 퐴푥 = 푦, and we typically resort to a least-squares solution

푥 = arg min 퐴푥 푦 ls ‖ − ‖2 which is given by the formula (see Sec. 8.5.1)

푇 −1 푇 푥ls = (퐴 퐴) 퐴 푦.

In the following we discuss how the semidefinite characterizations of nonnegative poly- nomials (see Sec. 6.2.6) lead to more advanced and useful polynomial curve-fitting con- straints.

• Nonnegativity. One possible constraint is nonnegativity on an interval,

푓(푡) := 푥 + 푥 푡 + + 푥 푡푛 0, 푡 [푎, 푏] 0 1 ··· 푛 ≥ ∀ ∈ with a semidefinite characterization embedded in 푥 퐾푛 , see (6.23). ∈ 푎,푏 • Monotonicity. We can ensure monotonicity of 푓(푡) by requiring that 푓 ′(푡) 0 (or ≥ 푓 ′(푡) 0), i.e., ≤ (푥 , 2푥 , . . . , 푛푥 ) 퐾푛−1, 1 2 푛 ∈ 푎,푏 or

(푥 , 2푥 , . . . , 푛푥 ) 퐾푛−1, − 1 2 푛 ∈ 푎,푏 respectively.

65 • Convexity or concavity. Convexity (or concavity) of 푓(푡) corresponds to 푓 ′′(푡) 0 ≥ (or 푓 ′′(푡) 0), i.e., ≤ (2푥 , 6푥 ,..., (푛 1)푛푥 ) 퐾푛−2, 2 3 − 푛 ∈ 푎,푏 or (2푥 , 6푥 ,..., (푛 1)푛푥 ) 퐾푛−2, − 2 3 − 푛 ∈ 푎,푏 respectively. As an example, we consider fitting a smooth polynomial 푓 (푡) = 푥 + 푥 푡 + + 푥 푡푛 푛 0 1 ··· 푛 to the points ( 1, 1), (0, 0), (1, 1) , where smoothness is implied by bounding 푓 ′ (푡) . { − } | 푛 | More specifically, we wish to solve the problem minimize 푧 ′ subject to 푓푛(푡) 푧, 푡 [ 1, 1] 푓| ( 1)| ≤ = 1,∀ 푓 ∈(0)− = 0, 푓 (1) = 1, 푛 − 푛 푛 or equivalently minimize 푧 ′ subject to 푧 푓푛(푡) 0, 푡 [ 1, 1] ′− ≥ ∀ ∈ − 푓푛(푡) 푧 0, 푡 [ 1, 1] 푓 ( 1)− =≥ 1, 푓 ∀(0)∈ =− 0, 푓 (1) = 1. 푛 − 푛 푛 푛 Finally, we use the characterizations 퐾푎,푏 to get a conic problem minimize 푧 푛−1 subject to (푧 푥1, 2푥2,..., 푛푥푛) 퐾−1,1 − − − ∈− (푥 푧, 2푥 , . . . , 푛푥 ) 퐾푛 1 1 − 2 푛 ∈ −1,1 푓 ( 1) = 1, 푓 (0) = 0, 푓 (1) = 1. 푛 − 푛 푛 In Fig. 6.3 we show the graphs for the resulting polynomails of degree 2, 4 and 8, re- spectively. The second degree polynomial is uniquely determined by the three constraints 2 푓2( 1) = 1, 푓2(0) = 0, 푓2(1) = 1, i.e., 푓2(푡) = 푡 . Also, we obviously have a lower bound − ′ on the largest derivative max ∈ − 푓 (푡) 1. The computed fourth degree polynomial 푡 [ 1,1] | 푛 | ≥ is given by 3 1 푓 (푡) = 푡2 푡4 4 2 − 2 after rounding coefficients to rational numbers. Furthermore, the largest derivative is given by

′ 푓4(1/√2) = √2,

′′ and 푓4 (푡) < 0 on (1/√2, 1] so, although not visibly clear, the polynomial is nonconvex on [ 1, 1]. In Fig. 6.4 we show the graphs of the corresponding polynomials where we − added a convexity constraint 푓 ′′(푡) 0, i.e., 푛 ≥ (2푥 , 6푥 ,..., (푛 1)푛푥 ) 퐾푛−2 . 2 3 − 푛 ∈ −1,1

66 f (t) 3 2 2

1

1 2 f4(t)

f8(t)

2 1 0 1 2 t − − Fig. 6.3: Graph of univariate polynomials of degree 2, 4, and 8, respectively, passing through ( 1, 1), (0, 0), (1, 1) . The higher-degree polynomials are increasingly smoother { − } on [ 1, 1]. −

In this case, we get 6 1 푓 (푡) = 푡2 푡4 4 5 − 5 8 and the largest derivative increases to 5 .

f (t) 3 2 2

1

f4(t) 1 2 f8(t)

2 1 0 1 2 t − − Fig. 6.4: Graph of univariate polynomials of degree 2, 4, and 8, respectively, passing through ( 1, 1), (0, 0), (1, 1) . The polynomials all have positive second derivative (i.e., { − } they are convex) on [ 1, 1] and the higher-degree polynomials are increasingly smoother − on that interval.

6.3.4 Filter design problems Filter design is an important application of optimization over trigonometric polynomials in signal processing. We consider a trigonometric polynomial ∑︀푛 −푗휔푘 퐻(휔) = 푥0 + 2 ( 푘=1 푥푘푒 ) ℜ∑︀푛 = 푎0 + 2 푘=1 (푎푘 cos(휔푘) + 푏푘 sin(휔푘))

where 푎푘 := (푥푘), 푏푘 := (푥푘). If the function 퐻(휔) is nonnegative we call it a transfer ℜ ℑ function, and it describes how different harmonic components of a periodic discrete signal are attenuated when a filter with transfer function 퐻(휔) is applied to the signal.

67 We often wish a transfer function where 퐻(휔) 1 for 0 휔 휔푝 and 퐻(휔) 0 for ≈ ≤ ≤ ≈ 휔푠 휔 휋 for given constants 휔푝, 휔푠. One possible formulation for achieving this is ≤ ≤ minimize 푡 subject to 0 퐻(휔) 휔 [0, 휋] ≤ ∀ ∈ 1 훿 퐻(휔) 1 + 훿 휔 [0, 휔푝] − ≤ 퐻(휔) ≤ 푡 ∀휔 ∈ [휔 , 휋], ≤ ∀ ∈ 푠 which corresponds to minimizing 퐻(푤) on the interval [휔푠, 휋] while allowing 퐻(푤) to depart from unity by a small amount 훿 on the interval [0, 휔푝]. Using the results from Sec. 6.2.8 (in particular (6.30), (6.31) and (6.32)), we can pose this as a conic optimization problem minimize 푡 푛 subject to 푥 퐾0,휋, ∈ 푛 (푥0 (1 훿), 푥1:푛) 퐾 , (6.34) − − ∈ 0,휔푝 (푥 (1 + 훿), 푥 ) 퐾푛 , − 0 − 1:푛 ∈ 0,휔푝 (푥 푡, 푥 ) 퐾푛 , − 0 − 1:푛 ∈ 휔푠,휋 which is a semidefinite optimization problem. In Fig. 6.5 we show 퐻(휔) obtained by solving (6.34) for 푛 = 10, 훿 = 0.05, 휔푝 = 휋/4 and 휔푠 = 휔푝 + 휋/8.

1 + δ 1 δ − ) ω ( H

t?

ωp ωs π ω

Fig. 6.5: Plot of lowpass filter transfer function.

6.3.5 Relaxations of binary optimization Semidefinite optimization is also useful for computing bounds on difficult non-convex or combinatorial optimization problems. For example consider the binary linear optimiza- tion problem minimize 푐푇 푥 subject to 퐴푥 = 푏, (6.35) 푥 0, 1 푛. ∈ { } In general, problem (6.35) is a very difficult non-convex problem where we have to explore 2푛 different objectives. Alternatively, we can use semidefinite optimization to get alower bound on the optimal solution with polynomial complexity. We first note that 푥 0, 1 푥2 = 푥 , 푖 ∈ { } ⇐⇒ 푖 푖 68 which is, in fact, equivalent to a rank constraint on a semidefinite variable,

푋 = 푥푥푇 , diag(푋) = 푥.

By relaxing the rank 1 constraint on 푋 we get a semidefinite relaxation of (6.35),

minimize 푐푇 푥 subject to 퐴푥 = 푏, (6.36) diag(푋) = 푥, 푋 푥푥푇 , ⪰ where we note that [︂ 1 푥푇 ]︂ 푋 푥푥푇 0. ⪰ ⇐⇒ 푥 푋 ⪰

Since (6.36) is a semidefinite optimization problem, it can be solved very efficiently. Suppose 푥⋆ is an optimal solution for (6.35); then (푥⋆, 푥⋆(푥⋆)푇 ) is also feasible for (6.36), but the feasible set for (6.36) is larger than the feasible set for (6.35), so in general the optimal solution of (6.36) serves as a lower bound. However, if the optimal solution 푋⋆ of (6.36) has rank 1 we have found a solution to (6.35) also. The semidefinite relaxation can also be used in a branch-bound mixed-integer exact algorithm for (6.35). We can tighten (or improve) the relaxation (6.36) by adding other constraints that cut away parts of the feasible set, without excluding rank 1 solutions. By tightening the relaxation, we reduce the gap between the optimal values of the original problem and the relaxation. For example, we can add the constraints

0 푋푖푗 1, 푖 = 1, . . . , 푛, 푗 = 1, . . . , 푛, 푋≤ 푋≤, 푖 = 1, . . . , 푛, 푗 = 1, . . . , 푛, 푖푖 ≥ 푖푗 and so on. This will usually have a dramatic impact on solution times and memory requirements. Already constraining a semidefinite matrix to be doubly nonnegative (푋푖푗 ≥ 0) introduces additional 푛2 linear inequality constraints.

6.3.6 Relaxations of boolean optimization Similarly to Sec. 6.3.5 we can use semidefinite relaxations of boolean constraints 푥 ∈ 1, +1 푛. To that end, we note that {− } 푥 1, +1 푛 푋 = 푥푥푇 , diag(푋) = 푒, (6.37) ∈ {− } ⇐⇒ with a semidefinite relaxation 푋 푥푥푇 of the rank-1 constraint. ⪰ As a (standard) example of a combinatorial problem with boolean constraints, we consider an undirected graph 퐺 = (푉, 퐸) described by a set of vertices 푉 = 푣1, . . . , 푣푛 { } and a set of edges 퐸 = (푣푖, 푣푗) 푣푖, 푣푗 푉, 푖 = 푗 , and we wish to find the cut of maximum { | ∈ ̸ } capacity. A cut 퐶 partitions the nodes 푉 into two disjoint sets 푆 and 푇 = 푉 푆, and ∖ the cut-set 퐼 is the set of edges with one node in 푆 and another in 푇 , i.e.,

퐼 = (푢, 푣) 퐸 푢 푆, 푣 푇 . { ∈ | ∈ ∈ } The capacity of a cut is then defined as the number of edges in the cut-set, 퐼 . | | 69 v1

v2 v0

v6 v5

v3 v4

Fig. 6.6: Undirected graph. The cut 푣2, 푣4, 푣5 has capacity 9 (thick edges). { }

To maximize the capacity of a cut we define the symmetric adjacency matrix 퐴 푛, ∈ 풮 {︂ 1, (푣푖, 푣푗) 퐸, [퐴]푖푗 = ∈ 0, otherwise,

where 푛 = 푉 , and let | | {︂ +1, 푣 푆, 푥 = 푖 푖 1, 푣 ∈/ 푆. − 푖 ∈

Suppose 푣푖 푆. Then 1 푥푖푥푗 = 0 if 푣푗 푆 and 1 푥푖푥푗 = 2 if 푣푗 / 푆, so we get an ∈ − ∈ − ∈ expression for the capacity as 1 ∑︁ 1 1 푐(푥) = (1 푥 푥 )퐴 = 푒푇 퐴푒 푥푇 퐴푥, 4 − 푖 푗 푖푗 4 − 4 푖푗

and discarding the constant term 푒푇 퐴푒 gives us the MAX-CUT problem

minimize 푥푇 퐴푥 (6.38) subject to 푥 1, +1 푛, ∈ {− } with a semidefinite relaxation minimize 퐴, 푋 ⟨ ⟩ subject to diag(푋) = 푒, (6.39) 푋 0. ⪰

70 Chapter 7

Practical optimization

In this chapter we discuss various practical aspects of creating optimization models.

7.1 Conic reformulations

Many nonlinear constraint families were reformulated to conic form in previous chapters, and these examples are often sufficient to construct conic models for the optimization problems at hand. In case you do need to go beyond the cookbook examples, however, a few composition rules are worth knowing to guarantee correctness of your reformulation.

7.1.1 Perspective functions The perspective of a function 푓(푥) is defined as 푠푓(푥/푠) on 푠 > 0. From any conic representation of 푡 푓(푥) one can reach a representation of 푡 푠푓(푥/푠) by exchanging ≥ ≥ all constants 푘 with their homogenized counterpart 푘푠.

Example 7.1. In Sec. 5.2.6 it was shown that the logarithm of sum of exponentials

푡 log(푒푥1 + + 푒푥푛 ), ≥ ··· can be modeled as: ∑︀ 푢푖 1, (푢 , 1, 푥 푡) ≤ 퐾 , 푖 = 1, . . . , 푛. 푖 푖 − ∈ exp It follows immediately that its perspective

푡 푠 log(푒푥1/푠 + + 푒푥푛/푠), ≥ ··· can be modeled as: ∑︀ 푢푖 푠, (푢 , 푠, 푥 푡) ≤ 퐾 , 푖 = 1, . . . , 푛. 푖 푖 − ∈ exp

Proof. If [푡 푓(푥)] [푡 푐(푥, 푟) + 푐0, 퐴(푥, 푟) + 푏 퐾] for some linear functions ≥ ⇔ ≥ ∈ (︀ 푐 and 퐴, a cone 퐾, and artificial variable 푟, then [푡 푠푓(푥/푠)] [푡 푠 푐(푥/푠, 푟) + ≥ ⇔ ≥ 71 )︀ 푐0 , 퐴(푥/푠, 푟) + 푏 퐾] [푡 푐(푥, 푟¯) + 푐0푠, 퐴(푥, 푟¯) + 푏푠 퐾], where 푟¯ is taken as a new ∈ ⇔ ≥ ∈ artificial variable in place of 푟푠.

7.1.2 Composite functions In representing composites 푓(푔(푥)), it is often helpful to move the inner function 푔(푥) to a separate constraint. That is, we would like 푡 푓(푔(푥)) to be reformulated as 푡 푓(푟) ≥ ≥ for a new variable 푟 constrained as 푟 = 푔(푥). This reformulation works directly as stated if 푔 is affine; that is, 푔(푥) = 푎푇 푥 + 푏. Otherwise, one of two possible relaxations may be considered:

1. If 푓 is convex and 푔 is convex, then 푡 푓(푟), 푟 푔(푥) is equivalent to 푡 {︃ ≥ ≥ ≥ 푓(푔(푥)) if 푔(푥) +, ∈ ℱ 푓min otherwise ,

2. If 푓 is convex and 푔 is concave, then 푡 푓(푟), 푟 푔(푥) is equivalent to 푡 {︃ ≥ ≤ ≥ 푓(푔(푥)) if 푔(푥) −, ∈ ℱ 푓min otherwise ,

where + (resp. −) is the domain on which 푓 is nondecreasing (resp. nonincreasing), ℱ ℱ and 푓min is the minimum of 푓, if finite, and otherwise. Note that Relaxation 1 −∞ provides an exact representation if 푔(푥) + for all 푥 in domain, and likewise for ∈ ℱ Relaxation 2 if 푔(푥) − for all 푥 in domain. ∈ ℱ Proof. Consider Relaxation 1. Intuitively, if 푟 +, you can expect clever solvers to ∈ ℱ drive 푟 towards in attempt to relax 푡 푓(푟), and this push continues until 푟 푔(푥) −∞ ≥+ ≥ becomes active or else — in case 푔(푥) — until 푓min is reached. On the other hand, − − ̸∈ ℱ if 푟 (and thus 푟 푔(푥) ) then solvers will drive 푟 towards + until 푓min is ∈ ℱ ≥ ∈ ℱ ∞ reached.

Example 7.2. Here are some simple substitutions.

• The inequality 푡 exp(푔(푥)) can be rewritten as 푡 exp(푟) and 푟 푔(푥) ≥ ≥ ≥ for all convex functions 푔(푥). This is because 푓(푟) = exp(푟) is nondecreasing everywhere.

• The inequality 2푠푡 푔(푥)2 can be rewritten as 2푠푡 푟2 and 푟 푔(푥) for all ≥ ≥ ≥ nonnegative convex functions 푔(푥). This is because 푓(푟) = 푟2 is nondecreasing on 푟 0. ≥ • The nonconvex inequality 푡 (푥2 1)2 can be relaxed as 푡 푟2 and 푟 푥2 1. ≥ − ≥ ≥ − This relaxation is exact on 푥2 1 0 where 푓(푟) = 푟2 is nondecreasing, i.e., on − ≥ the disjoint domain 푥 1. On the domain 푥 1 it represents the strongest | | ≥ | | ≤ possible convex cut given by 푡 푓min = 0. ≥

72 Example 7.3. Some amount of work is generally required to find the correct reformu- lation of a nested nonlinear constraint. For instance, suppose that for 푥, 푦, 푧 0 with ≥ 푥푦푧 > 1 we want to write 1 푡 . ≥ 푥푦푧 1 − A natural first attempt is: 1 푡 , 푟 푥푦푧 1, ≥ 푟 ≤ − which corresponds to Relaxation 2 with 푓(푟) = 1/푟 and 푔(푥, 푦, 푧) = 푥푦푧 1 > 0. The − function 푓 is indeed convex and nonincreasing on all of 푔(푥, 푦, 푧), and the inequality 푡푟 1 is moreover representable with a rotated quadratic cone. Unfortunately 푔 is ≥ not concave. We know that a monomial like 푥푦푧 appears in connection with the power cone, but that requires a homogeneous constraint such as 푥푦푧 푢3. This gives us an ≥ idea to try 1 푡 , 푟3 푥푦푧 1, ≥ 푟3 ≤ − which is 푓(푟) = 1/푟3 and 푔(푥, 푦, 푧) = (푥푦푧 1)1/3 > 0. This provides the right balance: − all conditions for validity and exactness of Relaxation 2 are satisfied. Introducing another variable 푢 we get the following model:

푡푟3 1, 푥푦푧 푢3, 푢 (푟3 + 13)1/3, ≥ ≥ ≥ We refer to Sec.4 to verify that all the constraints above are representable using power cones. We leave it as an exercise to find other conic representations, based on other transformations of the original inequality.

7.1.3 Convex univariate piecewise-defined functions Consider a univariate function with 푘 pieces: ⎧ 푓1(푥) if 푥 훼1, ⎨⎪ ≤ 푓(푥) = 푓푖(푥) if 훼푖−1 푥 훼푖, for 푖 = 2, . . . , 푘 1, ⎪ ≤ ≤ − ⎩푓푘(푥) if 훼푘−1 푥, ≤ Suppose 푓(푥) is convex (in particular each piece is convex by itself) and 푓푖(훼푖) = 푓푖+1(훼푖). In representing the epigraph 푡 푓(푥) it is helpful to proceed via the equivalent represen- ≥ tation: ∑︀푘 ∑︀푘−1 푡 = 푖=1 푡푖 푖=1 푓푖(훼푖), − − 푥 = ∑︀푘 푥 ∑︀푘 1 훼 , 푖=1 푖 − 푖=1 푖 푡푖 푓푖(푥푖) for 푖 = 1, . . . , 푘, ≥ 푥1 훼1, 훼푖−1 푥푖 훼푖 for 푖 = 2, . . . , 푘 1, 훼푘−1 푥푘. ≤ ≤ ≤ − ≤ Proof. In the special case when 푘 = 2 and 훼1 = 푓1(훼1) = 푓2(훼1) = 0 the epigraph of 푓 is equal to the Minkowski sum (푡1 +푡2, 푥1 +푥2): 푡푖 푓푖(푥푖) . In general the epigraph over { ≥ } 73 two consecutive pieces is obtained by shifting them to (0, 0), computing the Minkowski sum and shifting back. Finally more than two pieces can be joined by continuing this argument by induction. As the reformulation grows in size with the number of pieces, it is preferable to keep this number low. Trivially, if 푓푖(푥) = 푓푖+1(푥), these two pieces can be merged. Substituting 푓푖(푥) and 푓푖+1(푥) for max(푓푖(푥), 푓푖+1(푥)) is sometimes an invariant change to facilitate this merge. For instance, it always works for affine functions 푓푖(푥) and 푓푖+1(푥). Finally, if 푓(푥) is symmetric around some point 훼, we can represent its epigraph via a piecewise function, with only half the number of pieces, defined in terms of a new variable 푧 = 푥 훼 + 훼. | − |

Example 7.4 (Huber loss function). The Huber loss function ⎧ 2푥 1 if 푥 1, ⎨⎪− − ≤ − 푓(푥) = 푥2 if 1 푥 1, ⎪ − ≤ ≤ ⎩2푥 1 if 1 푥, − ≤ is convex and its epigraph 푡 푓(푥) has an equivalent representation ≥ 푡 푡 + 푡 + 푡 2, ≥ 1 2 3 − 푥 = 푥1 + 푥2 + 푥3, 2 푡1 = 2푥1 1, 푡2 푥2, 푡3 = 2푥3 1, 푥 − 1, −1 푥≥ 1, 1 푥 , − 1 ≤ − − ≤ 2 ≤ ≤ 3 2 where 푡2 푥 is representable by a rotated quadratic cone. No two pieces of 푓(푥) ≥ 2 can be merged to reduce the size of this formulation, but the loss function does satisfy a simple symmetry; namely 푓(푥) = 푓( 푥). We can thus represent its epigraph by − 푡 푓ˆ(푧) and 푧 푥 , where ≥ ≥ | | {︃ 푧2 if 푧 1, 푓ˆ(푧) = ≤ 2푧 1 if 1 푧. − ≤ In this particular example, however, unless the absolute value 푧 from 푧 푥 is used ≥ | | elsewhere, the cost of introducing it does not outweigh the savings achieved by going from three pieces in 푥 to two pieces in 푧.

7.2 Avoiding ill-posed problems

For a well-posed continuous problem a small change in the input data should induce a small change of the optimal solution. A problem is ill-posed if small perturbations of the problem data result in arbitrarily large perturbations of the solution, or even change the feasibility status of the problem. In such cases small rounding errors, or solving the problem on a different computer can result in different or wrong solutions. In fact, from an algorithmic point of view, even computing a wrong solution is numer- ically difficult for ill-posed problems. A rigorous definition of the degree of ill-posedness is

74 possible by defining a condition number. This is an attractive, but not very practical met- ric, as its evaluation requires solving several auxiliary optimization problems. Therefore we only make the modest recommendations to avoid problems

• that are nearly infeasible,

• with constraints that are linearly dependent, or nearly linearly dependent.

Example 7.5 (Near linear dependencies). To give some idea about the perils of near linear dependence, consider this pair of optimization problems:

maximize 푥 subject to 2푥 푦 1, 2.0001푥 − 푦 ≥ −0, − ≤ and maximize 푥 subject to 2푥 푦 1, 1.9999푥 − 푦 ≥ −0. − ≤ The first problem has a unique optimal solution (푥, 푦) = (104, 2 104 + 1), while the · second problem is unbounded. This is caused by the fact that the hyperplanes (here: straight lines) defining the constraints in each problem are almost parallel. Moreover, we can consider another modification: maximize 푥 subject to 2푥 푦 1, 2.001푥 − 푦 ≥ −0, − ≤ with optimal solution (푥, 푦) = (103, 2 103 + 1), which shows how a small perturbation · of the coefficients induces a large change of the solution.

Typical examples of problems with nearly linearly dependent constraints are dis- cretizations of continuous processes, where the constraints invariably become more cor- related as we make the discretization finer; as such there may be nothing wrong with the discretization or problem formulation, but we should expect numerical difficulties for sufficiently fine discretizations. One should also be careful not to specify problems whose optimal value is only achieved in the limit. A trivial example is

minimize 푒푥, 푥 R. ∈ The infimum value 0 is not attained for any finite 푥, which can lead to unspecified be- haviour of the solver. More examples of ill-posed conic problems are discussed in Fig. 7.1 and Sec.8. In particular, Fig. 7.1 depicts some troublesome problems in two dimensions. In (a) minimizing 푦 1/푥 on the nonnegative orthant is unattained approaching zero ≥ for 푥 . In (b) the only feasible point is (0, 0), but objectives maximizing 푥 are not → ∞ blocked by the purely vertical normal vectors of active constraints at this point, falsely

75 suggesting local progress could be made. In (c) the intersection of the two subsets is empty, but the distance between them is zero. Finally in (d) minimizing 푥 on 푦 푥2 ≥ is unbounded at minus infinity for 푦 , but there is no improving ray to follow. Al- → ∞ though seemingly unrelated, the four cases are actually primal-dual pairs; (a) with (b) and (c) with (d). In fact, the missing normal vector property in (b)—desired to certify optimality—can be attributed to (a) not attaining the best among objective values at distance zero to feasibility, and the missing positive distance property in (c)—desired to certify infeasibility—is because (d) has no improving ray.

Fig. 7.1: Geometric representations of some ill-posed situations.

7.3 Scaling

Another difficulty encountered in practice involves models that are badly scaled. Loosely speaking, we consider a model to be badly scaled if • variables are measured on very different scales, • constraints or bounds are measured on very different scales.

For example if one variable 푥1 has units of molecules and another variable 푥2 measures temperature, we might expect an objective function such as 12 푥1 + 10 푥2 and in finite precision the second term dominates the objective and renders the contri- bution from 푥1 insignificant and unreliable. A similar situation (from a numerical point of view) is encountered when using pe- nalization or big-푀 strategies. Assume we have a standard linear optimization problem (2.12) with the additional constraint that 푥1 = 0. We may eliminate 푥1 completely from the model, or we might add an additional constraint, but suppose we choose to formulate a penalized problem instead 푇 12 minimize 푐 푥 + 10 푥1 subject to 퐴푥 = 푏, 푥 0, ≥ 76 reasoning that the large penalty term will force 푥1 = 0. However, if 푐 is small we ‖ ‖ have the exact same problem, namely that in finite precision the penalty term will com- pletely dominate the objective and render the contribution 푐푇 푥 insignificant or unreliable. Therefore, the penalty term should be chosen carefully.

Example 7.6 (Risk-return tradeoff). Suppose we want to solve an efficient frontier variant of the optimal portfolio problem from Sec. 3.3.3 with an objective of the form 1 maximize 휇푇 푥 훾푥푇 Σ푥 − 2 where the parameter 훾 controls the tradeoff between return and risk. The user might choose a very small 훾 to discount the risk term. This, however, may lead to numerical issues caused by a very large solution. To see why, consider a simple one-dimensional, unconstrained analogue: 1 maximize 푥 훾푥2, 푥 R − 2 ∈ and note that the maximum is attained for 푥 = 1/훾, which may be very large.

Example 7.7 (Explicit redundant bounds). Consider again the problem (2.12), but with additional (redundant) constraints 푥푖 훾. This is a common approach for some ≤ optimization practitioners. The problem we solve is then

minimize 푐푇 푥 subject to 퐴푥 = 푏, 푥 0, 푥 ≥ 훾푒, ≤ with a dual problem

maximize 푏푇 푦 훾푒푇 푧 − subject to 퐴푇 푦 + 푠 푧 = 푐, 푠, 푧 0. − ≥

Suppose we do not know a-priori an upper bound on 푥 ∞, so we choose a very large ‖ ‖ 훾 = 1012 reasoning that this will not change the optimal solution. Note that the large variable bound becomes a penalty term in the dual problem; in finite precision such a large bound will effectively destroy accuracy of the solution.

Example 7.8 (Big-M). Suppose we have a mixed integer model which involves a big-M constraint (see Sec.9), for instance

푎푇 푥 푏 + 푀(1 푧) ≤ −

77 as in Sec. 9.1.3. The constant 푀 must be big enough to guarantee that the constraint becomes redundant when 푧 = 0, or we risk shrinking the feasible set, perhaps to the point of creating an infeasible model. On the other hand, the user might set 푀 to a huge, safe value, say 푀 = 1012. While this is mathematically correct, it will lead to a model with poorer linear relaxations and most likely increase solution time (sometimes dramatically). It is therefore very important to put some effort into estimating a reasonable value of 푀 which is safe, but not too big.

7.4 The huge and the tiny

Some types of problems and constraints are especially prone to behave badly with very large or very small numbers. We discuss such examples briefly in this section.

Sum of squares The sum of squares 푥2 + + 푥2 can be bounded above using the rotated quadratic cone: 1 ··· 푛 1 푡 푥2 + + 푥2 ( , 푡, 푥) 푛, ≥ 1 ··· 푛 ⇐⇒ 2 ∈ 풬r but in most cases it is better to bound the square root with a quadratic cone: √︁ 푡′ 푥2 + + 푥2 (푡′, 푥) 푛. ≥ 1 ··· 푛 ⇐⇒ ∈ 풬 The latter has slightly better numerical properties: 푡′ is roughly comparable with 푥 and is measured on the same scale, while 푡 grows as the square of 푥 which will sooner lead to numerical instabilities.

Exponential cone Using the function 푒푥 for 푥 30 or 푥 30 would be rather questionable if 푒푥 is supposed ≤ − ≥ to represent any realistic value. Note that 푒30 1013 and that 푒30.0000001 푒30 106, so a ≈ − ≈ tiny perturbation of the exponent produces a huge change of the result. Ideally, 푥 should have a single-digit absolute value. For similar reasons, it is even more important to have well-scaled data. For instance, in floating point arithmetic we have the “equality”

푒20 + 푒−20 = 푒20 since the smaller summand disappears in comparison to the other one, 1017 times bigger. For the same reason it is advised to replace explicit inequalities involving exp(푥) with log-sum-exp variants (see Sec. 5.2.6). For example, suppose we have a constraint such as

푡 푒푥1 + 푒푥2 + 푒푥3 . ≥ ′ Then after a substitution 푡 = 푒푡 we have

′ ′ ′ 1 푒푥1−푡 + 푒푥2−푡 + 푒푥3−푡 ≥ ′ which has the advantage that 푡 is of the same order of magnitude as 푥푖, and that the ′ exponents 푥푖 푡 are likely to have much smaller absolute values than simply 푥푖. − 78 Power cone The power cone is not reliable when one of the exponents is very small. For example, consider the function 푓(푥) = 푥0.01, which behaves almost like the indicator function of 푥 > 0 in the sense that

푓(0) = 0, and 푓(푥) > 0.5 for 푥 > 10−30.

Now suppose 푥 = 10−20. Is the constraint 푓(푥) > 0.5 satisfied? In principle yes, but in practice 푥 could have been obtained as a solution to an optimization problem and it may in fact represent 0 up to numerical error. The function 푓(푥) is sensitive to changes in 푥 well below standard numerical accuracy. The point 푥 = 0 does not satisfy 푓(푥) > 0.5 but it is only 10−30 from doing so.

7.5 Semidefinite variables

Special care should be given to models involving semidefinite matrix variables (see Sec. 6), otherwise it is easy to produce an unnecessarily inefficient model. The general rule of thumb is: • having many small matrix variables is more efficient than one big matrix variable,

• efficiency drops with growing number of semidefinite terms involved in linearcon- straints. This can have much bigger impact on the solution time than increasing the dimension of the semidefinite variable. Let us consider a few examples.

Block matrices Given two matrix variables 푋, 푌 0 do not assemble them in a block matrix and write ⪰ the constraints as [︂ 푋 0 ]︂ 0. 0 푌 ⪰

This increases the dimension of the problem and, even worse, introduces unnecessary constraints for a large portion of entries of the block matrix.

Schur complement Suppose we want to model a relaxation of a rank-one constraint: [︂ 푋 푥 ]︂ 0. 푥푇 1 ⪰

푛 푛 where 푥 R and 푋 +. The correct way to do this is to set up a matrix variable ∈ ∈ 풮 푌 푛+1 with only a linear number of constraints: ∈ 풮+

푌푖,푛+1 = 푥푖, 푖 = 1, . . . , 푛 푌푛+1,푛+1 = 1, 푌 0, ⪰ 79 and use the upper-left 푛 푛 part of 푌 as the original 푋. Going the other way around, i.e. × starting with a variable 푋 and aligning it with the corner of another, bigger semidefinite matrix 푌 introduces 푛(푛 + 1)/2 equality constraints and will quickly have formidable solution times.

Sparse LMIs Suppose we want to model a problem with a sparse linear matrix inequality (see Sec. 6.2.1) such as:

minimize 푐푇 푥 ∑︀푘 subject to 퐴0 + 퐴푖푥푖 0, 푖=1 ⪰ 푘 where 푥 R and 퐴푖 are symmetric 푛 푛 matrices. The representation of this problem ∈ × in primal form is: minimize 푐푇 푥 ∑︀푘 subject to 퐴0 + 푖=1 퐴푖푥푖 = 푋, 푋 0, ⪰ and the linear constraint requires a full set of 푛(푛 + 1)/2 equalities, many of which are just 푋푘,푙 = 0, regardless of the sparsity of 퐴푖. However the dual problem (see Sec. 8.6) is:

maximize 퐴0, 푍 −⟨ ⟩ subject to 퐴푖, 푍 = 푐푖, 푖 = 1, . . . , 푘, 푍⟨ 0,⟩ ⪰

and the number of nonzeros in linear constraints is just joint number of nonzeros in 퐴푖. It means that large, sparse LMIs should almost always be dualized and entered in that form for efficiency reasons.

7.6 The quality of a solution

In this section we will discuss how to validate an obtained solution. Assume we have a conic model with continuous variables only and that the optimization software has reported an optimal primal and dual solution. Given such a solution, we might ask how to verify that it is indeed feasible and optimal. To that end, consider a simple model

minimize 푥2 − subject to 푥1 + 푥2 3, 2 ≤ (7.1) 푥2 2, 푥 , 푥 ≤ 0, 1 2 ≥ where a solver might approximate the solution as

푥1 = 0.0000000000000000 and 푥2 = 1.4142135623730951

and therefore the approximate optimal objective value is

1.4142135623730951. − 80 The true objective value is √2, so the approximate objective value is wrong by the − amount

1.4142135623730951 √2 10−16. − ≈ Most likely this difference is irrelevant for all practical purposes. Nevertheless, in general a solution obtained using floating point arithmetic is only an approximation. Most (if not all) commercial optimization software uses double precision floating point arithmetic, implying that about 16 digits are correct in the computations performed internally by the software. This also means that irrational numbers such as √2 and 휋 can only be stored accurately within 16 digits.

Verifying feasibility A good practice after solving an optimization problem is to evaluate the reported solution. At the very least this process should be carried out during the initial phase of building a model, or if the reported solution is unexpected in some way. The first step in that process is to check that the solution is feasible; in case of the small example (7.1) this amounts to checking that:

푥1 + 푥2 = 0.0000000000000000 + 1.4142135623730951 3, 푥2 = 2.0000000000000004 ≤ 2, (?) 2 ≤ 푥1 = 0.0000000000000000 0, 푥 = 1.4142135623730951 ≥ 0, 2 ≥ which demonstrates that one constraint is slightly violated due to computations in finite precision. It is up to the user to assess the significance of a specific violation; for example a violation of one unit in

푥 + 푥 1 1 2 ≤ may or may not be more significant than a violation of one unitin

푥 + 푥 109. 1 2 ≤ The right-hand side of 109 may itself be the result of a computation in finite precision, and may only be known with, say 3 digits of accuracy. Therefore, a violation of 1 unit is not significant since the true right-hand side could just as wellbe 1.001 109. Certainly a · violation of 4 10−16 as in our example is within the numerical tolerance for zero and we · would accept the solution as feasible. In practice it would be standard to see violations of order 10−8.

Verifying optimality Another question is how to verify that a feasible approximate solution is actually optimal, which is answered by duality theory, which is discussed in Sec. 2.4 for linear problems and in Sec.8 in full generality. Following Sec.8 we can derive an equivalent version of the dual problem as:

maximize 3푦1 푦2 푦3 − − − 2 subject to 2푦2푦3 (푦1 + 1) , 푦 ≥ 0. 1 ≥ 81 This problem has a feasible point (푦1, 푦2, 푦3) = (0, 1/√2, 1/√2) with objective value √2, matching the objective value of the primal problem. By duality theory this proves − that both the primal and dual solutions are optimal. Again, in finite precision the dual objective value may be reported as, for example 1.4142135623730950 − leading to a duality gap of about 10−16 between the approximate primal and dual objective values. It is up to the user to assess the significance of any duality gap and accept or reject the solution as sufficiently good approximation of the optimal value.

7.7 Distance to a cone

The violation of a linear constraint 푎푇 푥 푏 under a given solution 푥* is obviously ≤ max(0, 푎푇 푥* 푏). − It is less obvious how to assess violations of conic constraints. To this end suppose we have a convex cone 퐾. The (Euclidean) projection of a point 푥 onto 퐾 is defined as the solution of the problem

minimize 푝 푥 2 ‖ − ‖ (7.2) subject to 푝 퐾. ∈ We denote the projection by proj퐾 (푥). The distance of 푥 to 퐾 dist(푥, 퐾) = 푥 proj (푥) , ‖ − 퐾 ‖2 i.e. the objective value of (7.2), measures the violation of a conic constraint involving 퐾. Obviously dist(푥, 퐾) = 0 푥 퐾. ⇐⇒ ∈ This distance measure is attractive because it depends only on the set 퐾 and not on any particular representation of 퐾 using (in)equalities.

Example 7.9 (Surprisingly small distance). Let 퐾 = 0.1,0.9 be the power cone defined 풫3 by the inequality 푥0.1푦0.9 푧 . The point ≥ | | 푥* = (푥, 푦, 푧) = (0, 10000, 500)

is clearly not in the cone, in fact 푥0.1푦0.9 = 0 while 푧 = 500, so the violation of the | | conic inequality is seemingly large. However, the point

푝 = (10−8, 10000, 500)

belongs to 0.1,0.9 (since (10−8)0.1 100000.9 630), hence 풫3 · ≈ dist(푥*, 0.1,0.9) 푥* 푝 = 10−8. 풫3 ≤ ‖ − ‖2 Therefore the distance of 푥* to the cone is actually very small.

82 Distance to certain cones For some basic cone types the projection problem (7.2) can be solved analytically. De- noting [푥]+ = max(0, 푥) we have that:

• For 푥 R the projection onto the nonnegative half-line is proj (푥) = [푥]+. ∈ R+ • For (푡, 푥) R R푛−1 the projection onto the quadratic cone is ∈ × ⎧ ⎪(푡, 푥) if 푡 푥 2, ⎪ ≥ ‖ ‖ ⎨ 1 (︁ 푡 )︁ proj풬푛 (푡, 푥) = + 1 ( 푥 2, 푥) if 푥 2 < 푡 < 푥 2, 2 ‖푥‖2 ‖ ‖ − ‖ ‖ ‖ ‖ ⎪ ⎩0 if 푡 푥 2. ≤ −‖ ‖

∑︀푛 푇 • For 푋 = 푖=1 휆푖푞푖푞푖 the projection onto the semidefinite cone is

푛 ∑︁ 푇 proj 푛 (푋) = [휆 ] 푞 푞 . 풮+ 푖 + 푖 푖 푖=1

83 Chapter 8

Duality in conic optimization

In Sec.2 we introduced duality and related concepts for linear optimization. Here we present a more general version of this theory for conic optimization and we illustrate it with examples. Although this chapter is self-contained, we recommend familiarity with Sec.2, which some of this material is a natural extension of. Duality theory is a rich and powerful area of convex optimization, and central to understanding sensitivity analysis and infeasibility issues. Furthermore, it provides a simple and systematic way of obtaining non-trivial lower bounds on the optimal value for many difficult non-convex problems. From now on we consider a conic problem in standard form

minimize 푐푇 푥 subject to 퐴푥 = 푏, (8.1) 푥 퐾, ∈ where 퐾 is a convex cone. In practice 퐾 would likely be defined as a product

퐾 = 퐾 퐾 1 × · · · × 푚 of smaller cones corresponding to actual constraints in the problem formulation. The 푛 abstract formulation (8.1) encompasses such special cases as 퐾 = R+ (linear optimiza- 푛 1 tion), 퐾푖 = R (unconstrained variable) or 퐾푖 = + ( 푛(푛 + 1) variables representing the 풮 2 upper-triangular part of a matrix). The feasible set for (8.1):

푛 푝 = 푥 R 퐴푥 = 푏 퐾 ℱ { ∈ | } ∩

is a section of 퐾. We say the problem is feasible if 푝 = and infeasible otherwise. The ℱ ̸ ∅ value of (8.1) is defined as

푝⋆ = inf 푐푇 푥 : 푥 , { ∈ ℱ푝} allowing the special cases of 푝⋆ = + (the problem is infeasible) and 푝⋆ = (the ∞ −∞ problem is unbounded). Note that 푝⋆ may not be attained, i.e. the infimum is not necessarily a minimum, although this sort of problems are ill-posed from a practical point of view (see Sec.7).

84 Example 8.1 (Unattained problem value). The conic quadratic problem

minimize 푥 subject to (푥, 푦, 1) 3, ∈ 풬푟 with constraint equivalent to 2푥푦 1, 푥, 푦 0 has 푝⋆ = 0 but this optimal value is ≥ ≥ not attained by any point in 푝. ℱ

8.1 Dual cone

Suppose 퐾 R푛 is a closed convex cone. We define the dual cone 퐾* as ⊆ * 퐾 = 푦 R푛 : 푦푇 푥 0 푥 퐾 . (8.2) { ∈ ≥ ∀ ∈ } For simplicity we write 푦푇 푥, denoting the standard Euclidean inner product, but every- thing in this section applies verbatim to the inner product of matrices in the semidefinite context. In other words 퐾* consists of vectors which form an acute (or right) angle with every vector in 퐾. We see easily that 퐾* is in fact a convex cone. The main example, associated with linear optimization, is the dual of the positive orthant: Lemma 8.1 (Dual of linear cone).

푛 * 푛 (R+) = R+.

Proof. This is obvious for 푛 = 1, and then we use the simple fact

(퐾 퐾 )* = 퐾* 퐾* , 1 × · · · × 푚 1 × · · · × 푚 which we invite the reader to prove. All cones studied in this cookbook can be explicitly dualized as we show next. Lemma 8.2 (Duals of nonlinear cones). We have the following: • The quadratic, rotated quadratic and semidefinite cones are self-dual:

( 푛)* = 푛, ( 푛)* = 푛, ( 푛 )* = 푛 . 풬 풬 풬r 풬r 풮+ 풮+

• The dual of the power cone (4.4) is {︂ (︂ )︂ }︂ 훼1,··· ,훼푚 * 푛 푦1 푦푚 훼1,··· ,훼푚 ( 푛 ) = 푦 R : ,..., , 푦푚+1, . . . , 푦푛 푛 . 풫 ∈ 훼1 훼푚 ∈ 풫

• The dual of the exponential cone (5.2) is the closure

* {︀ 3 푦2/푦3−1 }︀ (퐾exp) = cl 푦 R : 푦1 푦3푒 , 푦1 > 0, 푦3 < 0 . ∈ ≥ −

85 Proof. We only sketch some parts of the proof and indicate geometric intuitions. First, let us show ( 푛)* = 푛. For (푡, 푥), (푠, 푦) 푛 we have 풬 풬 ∈ 풬 푠푡 + 푦푇 푥 푦 푥 + 푦푇 푥 0 ≥ ‖ ‖2‖ ‖2 ≥ 푇 푛 푛 * by the Cauchy-Schwartz inequality 푢 푣 푢 2 푣 2, therefore ( ) . Geomet- | | ≤ ‖ ‖ ‖ ‖ 풬 ⊆ 풬 rically we are just saying that the quadratic cone 푛 has a right angle at the apex, so 풬 any two vectors inside the cone form at most a right angle. Now suppose (푠, 푦) is a point outside 푛. If ( 푠, 푦) 푛 then 푠2 푦푇 푦 < 0 and we showed (푠, 푦) ( 푛)*. 풬 − − ∈ 풬 − − ̸∈ 풬 Otherwise let (푡, 푥) = proj풬푛 (푠, 푦) be the projection (see Sec. 7.7); note that (푠, 푦) and (푡, 푥) 푛 form an obtuse angle. − ∈ 풬 푛 For the semidefinite cone we use the property 푋, 푌 0 for 푋, 푌 + (see Sec. 푛 ⟨ ⟩ ≥ ∈ 풮 푇 6.1.2). Conversely, assume that 푍 +. Then there exists a 푤 satisfying 푤푤 , 푍 = 푇 푛 * ̸∈ 풮 ⟨ ⟩ 푤 푍푤 < 0, so 푍 ( +) . ̸∈ 풮 * As our last exercise let us check that vectors in 퐾exp and (퐾exp) form acute angles. * By definition of 퐾exp and (퐾exp) we take 푥, 푦 such that:

푥 푥 exp(푥 /푥 ), 푦 푦 exp(푦 /푦 1), 푥 , 푥 , 푦 > 0, 푦 < 0, 1 ≥ 2 3 2 1 ≥ − 3 2 3 − 1 2 1 3 and then we have 푦푇 푥 = 푥 푦 + 푥 푦 + 푥 푦 푥 푦 exp( 푥3 + 푦2 1) + 푥 푦 + 푥 푦 1 1 2 2 3 3 2 3 푥2 푦3 2 2 3 3 ≥ − 푥3 푦2 − 푥2푦3( + ) + 푥2푦2 + 푥3푦3 = 0, ≥ − 푥2 푦3 using the inequality exp(푡) 1 + 푡. We refer to the literature for full proofs of all the ≥ statements.

Finally it is nice to realize that (퐾*)* = 퐾 and that the cones 0 and R are each { } others’ duals. We leave it to the reader to check these facts.

8.2 Infeasibility in conic optimization

We can now discuss infeasibility certificates for conic problems. Given an optimization problem, the first basic question we are interested in is its feasibility status. Thetheory of infeasibility certificates for linear problems, including the Farkas lemma (see Sec. 2.3) extends almost verbatim to the conic case. Lemma 8.3 (Farkas’ lemma, conic version). Suppose we have the conic optimization problem (8.1). Exactly one of the following statements is true: 1. Problem (8.1) is feasible.

2. Problem (8.1) is infeasible, but there is a sequence 푥푛 퐾 such that 퐴푥푛 푏 0. ∈ ‖ − ‖ → 3. There exists 푦 such that 퐴푇 푦 퐾* and 푏푇 푦 > 0. − ∈ Problems which fall under option 2. (limit-feasible) are ill-posed: an arbitrarily small perturbation of input data puts the problem in either category 1. or 3. This fringe case should therefore not appear in practice, and if it does, it signals issues with the optimization model which should be addressed.

86 Example 8.2 (Limit-feasible model). Here is an example of an ill-posed limit-feasible model created by fixing one of the root variables of a rotated quadratic coneto 0.

minimize 푢 3 subject to (푢, 푣, 푤) 푟, 푣 = 0, ∈ 풬 푤 1. ≥ 3 The problem is clearly infeasible, but the sequence (푢푛, 푣푛, 푤푛) = (푛, 1/푛, 1) with ∈ 풬푟 (푣푛, 푤푛) (0, 1) makes it limit-feasible as in alternative 2. of Lemma 8.3. There is no → infeasibility certificate as in alternative 3.

Having cleared this detail we continue with the proof and example for the actually useful part of conic Farkas’ lemma. Proof. Consider the set

퐴(퐾) = 퐴푥 : 푥 퐾 . { ∈ } It is a convex cone. Feasibility is equivalent to 푏 퐴(퐾). If 푏 퐴(퐾) but 푏 cl(퐴(퐾)) ∈ ̸∈ ∈ then we have the second alternative. Finally, if 푏 cl(퐴(퐾)) then the closed convex cone ̸∈ cl(퐴(퐾)) and the point 푏 can be strictly separated by a hyperplane passing through the origin, i.e. there exists 푦 such that

푏푇 푦 > 0, (퐴푥)푇 푦 0 푥 퐾. ≤ ∀ ∈ But then 0 (퐴푥)푇 푦 = ( 퐴푇 푦)푇 푥 for all 푥 퐾, so by definition of 퐾* we have ≤ − − ∈ 퐴푇 푦 퐾*, and we showed 푦 satisfies the third alternative. Finally, 1. and 3.are − ∈ mutually exclusive, since otherwise we would have

0 ( 퐴푇 푦)푇 푥 = 푦푇 퐴푥 = 푦푇 푏 < 0 ≤ − − − and the same argument works in the limit if 푥푛 is a sequence as in 2. That proves the lemma. Therefore Farkas’ lemma implies that (up to certain ill-posed cases) either the problem (8.1) is feasible (first alternative) or there isa certificate of infeasibility 푦 (last alternative). In other words, every time we classify a model as infeasible, we can certify this fact by 푛 providing an appropriate 푦. Note that when 퐾 = R+ we recover precisely the linear version, Lemma 2.1.

Example 8.3 (Infeasible conic problem). Consider a minimization problem:

minimize 푥1 subject to 푥1 + 푥2 푥4 = 1, − − 2푥3 3푥4 = 1, −√︀ 2 −2 푥1 푥2 + 푥3, 푥 ≥ 0. 4 ≥

87 It can be expressed in the standard form (8.1) with [︂ ]︂ [︂ ]︂ 1 1 0 1 1 3 푇 퐴 = − − , 푏 = , 퐾 = R+, 푐 = [1, 0, 0, 0] . 0 0 2 3 1 풬 × − − A certificate of infeasibility is 푦 = [1, 0]푇 . Indeed, 푏푇 푦 = 1 > 0 and 퐴푇 푦 = * 3 − [1, 1, 0, 1] 퐾 = R+. The certificate indicates that the first linear constraint − ∈ 풬 × alone causes infeasibility, which is indeed the case: the first equality together with the conic constraints yield a contradiction:

푥 1 푥 1 푥 = 푥 푥 . 2 − ≥ 2 − − 4 1 ≥ | 2|

8.3 Lagrangian and the dual problem

Classical Lagrangian In general constrained optimization we consider an optimization problem of the form

minimize 푓0(푥) subject to 푓(푥) 0, (8.3) ℎ(푥) =≤ 0,

푛 푛 푚 where 푓0 : R R is the objective function, 푓 : R R encodes inequality constraints, ↦→ ↦→ and ℎ : R푛 R푝 encodes equality constraints. Readers familiar with the method of ↦→ Lagrange multipliers or penalization will recognize the Lagrangian for (8.3), a function 퐿 : R푛 R푚 R푝 R that augments the objective with a weighted combination of all × × ↦→ the constraints,

푇 푇 퐿(푥, 푦, 푠) = 푓0(푥) + 푦 ℎ(푥) + 푠 푓(푥). (8.4)

푝 푚 The variables 푦 R and 푠 R+ are called Lagrange multipliers or dual variables. The ∈ ∈ Lagrangian has the property that 퐿(푥, 푦, 푠) 푓0(푥) whenever 푥 is feasible for (8.1) and 푚 ≤ 푠 R+ . The optimal point satisfies the first-order optimality condition 푥퐿(푥, 푦, 푠) = 0. ∈ ∇ Moreover, the dual function 푔(푦, 푠) = inf푥 퐿(푥, 푦, 푠) provides a lower bound for the optimal 푝 푚 value of (8.3) for any 푦 R and 푠 R+ , which leads to considering the dual problem ∈ ∈ of maximizing 푔(푦, 푠).

Lagrangian for a conic problem We next set up an analogue of the Lagrangian theory for the conic problem (8.1) minimize 푐푇 푥 subject to 퐴푥 = 푏, 푥 퐾, ∈ where 푥, 푐 R푛, 푏 R푚, 퐴 R푚×푛 and 퐾 R푛 is a convex cone. We associate with ∈ ∈ ∈ ⊆ (8.1) a Lagrangian of the form 퐿 : R푛 R푚 퐾* R × × → 퐿(푥, 푦, 푠) = 푐푇 푥 + 푦푇 (푏 퐴푥) 푠푇 푥. − − 88 * * * 푚 * For any feasible 푥 푝 and any (푦 , 푠 ) R 퐾 we have ∈ ℱ ∈ × 퐿(푥*, 푦*, 푠*) = 푐푇 푥* + (푦*)푇 0 (푠*)푇 푥* 푐푇 푥*. (8.5) · − ≤ Note the we used the definition of the dual cone to conclude that (푠*)푇 푥* 0. The dual ≥ function is defined as the minimum of 퐿(푥, 푦, 푠) over 푥. Thus the dual function of (8.1) is {︂ 푏푇 푦, 푐 퐴푇 푦 푠 = 0, 푔(푦, 푠) = min 퐿(푥, 푦, 푠) = min 푥푇 (푐 퐴푇 푦 푠) + 푏푇 푦 = − − 푥 푥 − − , otherwise. −∞ Dual problem

From (8.5) every (푦, 푠) R푚 퐾* satisfies 푔(푦, 푠) 푝⋆, i.e. 푔(푦, 푠) is a lower bound for ∈ × ≤ 푝⋆. To get the best such bound we maximize 푔(푦, 푠) over all (푦, 푠) R푚 퐾* and get ∈ × the dual problem: maximize 푏푇 푦 subject to 푐 퐴푇 푦 = 푠, (8.6) 푠 − 퐾*, ∈ or simply: maximize 푏푇 푦 (8.7) subject to 푐 퐴푇 푦 퐾*. − ∈ The optimal value of (8.6) will be denoted 푑⋆. As in the case of (8.1) (which from now on we call the primal problem), the dual problem can be infeasible (푑⋆ = ), have an −∞ optimal solution ( < 푑⋆ < + ) or be unbounded (푑⋆ = + ). As before, the value −∞ ∞ ∞ 푑⋆ is defined as a supremum of 푏푇 푦 and may not be attained. Note that the roles of −∞ and + are now reversed because the dual is a maximization problem. ∞ Example 8.4 (More general constraints). We can just as easily derive the dual of a problem with more general constraints, without necessarily having to transform the problem to standard form beforehand. Imagine, for example, that some solver accepts conic problems of the form: minimize 푐푇 푥 + 푐푓 subject to 푙푐 퐴푥 푢푐, (8.8) 푙푥 ≤ 푥 ≤푢푥, 푥 ≤퐾.≤ ∈ Then we define a Lagrangian with one set of dual variables for each constraint appearing in the problem: 푐 푐 푥 푥 푥 푇 푓 푐 푇 푐 푐 푇 푐 퐿(푥, 푠푙 , 푠푢, 푠푙 , 푠푢, 푠푛) = 푐 푥 + 푐 (푠푙 ) (퐴푥 푙 ) (푠푢) (푢 퐴푥) 푥 푇 − 푥 푥 −푇 푥− 푥 −푇 (푠푙 ) (푥 푙 ) (푠푢) (푢 푥) (푠푛) 푥 −푇 푇−푐 −푇 푐 푥 − 푥 − 푥 = 푥 (푐 퐴 푠푙 + 퐴 푠푢 푠푙 + 푠푢 푠푛) +(푙푐)−푇 푠푐 (푢푐)푇 푠푐 +− (푙푥)푇 푠푥 −(푢푐)푇 푠푥 + 푐푓 푙 − 푢 푙 − 푢 and that gives a dual problem 푐 푇 푐 푐 푇 푐 푥 푇 푥 푐 푇 푥 푓 maximize (푙 ) 푠푙 (푢 ) 푠푢 + (푙 ) 푠푙 (푢 ) 푠푢 + 푐 푇 − 푐 푐 푥 푥− 푥 subject to 푐 + 퐴 ( 푠푙 + 푠푢) 푠푙 + 푠푢 푠푛 = 0, 푐 푐 푥− 푥 − − 푠푙 , 푠푢, 푠푙 , 푠푢 0, 푠푥 퐾*. ≥ 푛 ∈

89 Example 8.5 (Dual of simple portfolio). Consider a simplified version of the portfolio optimization problem, where we maximize expected return subject to an upper bound on the risk and no other constraints: maximize 휇푇 푥 subject to 퐹 푥 2 훾, ‖ ‖ ≤ where 푥 R푛 and 퐹 R푚×푛. The conic formulation is ∈ ∈ maximize 휇푇 푥 (8.9) subject to (훾, 퐹 푥) 푚+1, ∈ 풬 and we can directly write the Lagrangian

퐿(푥, 푣, 푤) = 휇푇 푥 + 푣훾 + 푤푇 퐹 푥 = 푥푇 (휇 + 퐹 푇 푤) + 푣훾

with (푣, 푤) ( 푚+1)* = 푚+1. Note that we chose signs to have 퐿(푥, 푣, 푤) 휇푇 푥 ∈ 풬 풬 ≥ since we are dealing with a maximization problem. The dual is now determined by the 푇 conditions 퐹 푤 + 휇 = 0 and 푣 푤 2, so it can be formulated as ≥ ‖ ‖

minimize 훾 푤 2 ‖ ‖ (8.10) subject to 퐹 푇 푤 = 휇. − Note that it is actually more natural to view problem (8.9) as the dual form and problem (8.10) as the primal. Indeed we can write the constraint in (8.9) as

[︂ 훾 ]︂ [︂ 0 ]︂ 푥 (푄푚+1)* 0 − 퐹 ∈ − which fits naturally into the form (8.7). Having done this we can recover the dual as a minimization problem in the standard form (8.1). We leave it to the reader to check that we get the same answer as above.

8.4 Weak and strong duality

Weak duality Suppose 푥* and (푦*, 푠*) are feasible points for the primal and dual problems (8.1) and (8.6), respectively. Then we have

푏푇 푦* = (퐴푥*)푇 푦* = (푥*)푇 (퐴푇 푦*) = (푥*)푇 (푐 푠*) = 푐푇 푥* (푠*)푇 푥* 푐푇 푥* (8.11) − − ≤ so the dual objective value is a lower bound on the objective value of the primal. In particular, any dual feasible point (푦*, 푠*) gives a lower bound: 푏푇 푦* 푝⋆ ≤ and we immediately get the next lemma.

90 Lemma 8.4 (Weak duality). 푑⋆ 푝⋆. ≤ It follows that if 푏푇 푦* = 푐푇 푥* then 푥* is optimal for the primal, (푦*, 푠*) is optimal for the dual and 푏푇 푦* = 푐푇 푥* is the common optimal objective value. This way we can use the optimal dual solution to certify optimality of the primal solution and vice versa.

Complementary slackness Moreover, (8.11) asserts that 푏푇 푦* = 푐푇 푥* is equivalent to orthogonality (푠*)푇 푥* = 0 i.e. complementary slackness. It is not hard to verify what complementary slackness means for particular types of cones, for example

• for 푠, 푥 R+ we have 푠푥 = 0 iff 푠 = 0 or 푥 = 0, ∈ 푛+1 • vectors (푠1, 푠˜), (푥1, 푥˜) are orthogonal iff (푠1, 푠˜) and (푥1, 푥˜) are parallel, ∈ 풬 − 푛+2 • vectors (푠1, 푠2, 푠˜), (푥1, 푥2, 푥˜) are orthogonal iff (푠1, 푠2, 푠˜) and (푥2, 푥1, 푥˜) are ∈ 풬r − parallel. One implicit case is worth special attention: complementary slackness for a linear inequality constraint 푎푇 푥 푏 with a non-negative dual variable 푦 asserts that (푎푇 푥* ≤ − 푏)푦* = 0. This can be seen by directly writing down the appropriate Lagrangian for this type of constraint. Alternatively, we can introduce a slack variable 푢 = 푏 푎푇 푥 with a − conic constraint 푢 0 and let 푦 be the dual conic variable. In particular, if a constraint ≥ is non-binding in the optimal solution (푎푇 푥* < 푏) then the corresponding dual variable 푦* = 0. If 푦* > 0 then it can be related to a shadow price, see Sec. 2.4.4 and Sec. 8.5.2.

Strong duality The obvious question is now whether 푑⋆ = 푝⋆, that is if optimality of a primal solution can always be certified by a dual solution with matching objective value, as for . This turns out not to be the case for general conic problems.

Example 8.6 (Positive duality gap). Consider the problem

minimize 푥3 √︀ 2 2 subject to 푥1 푥2 + 푥3, 푥 ≥ 푥 , 푥 1. 2 ≥ 1 3 ≥ − The only feasible points are (푥, 푥, 0), so 푝⋆ = 0. The dual problem is

maximize 푦2 − √︀ 2 2 subject to 푦1 푦 + (1 푦2) , ≥ 1 − with feasible points (푦, 1), hence 푑⋆ = 1. − Similarly, we consider a problem

minimize 푥1 ⎡ ⎤ 0 푥1 0 3 subject to ⎣ 푥1 푥2 0 ⎦ , ∈ 풮+ 0 0 1 + 푥1

91 ⋆ with feasible set 푥1 = 0, 푥2 0 and optimal value 푝 = 0. The dual problem can be { ≥ } formulated as

maximize 푧2 ⎡− ⎤ 푧1 (1 푧2)/2 0 − 3 subject to ⎣ (1 푧2)/2 0 0 ⎦ , − ∈ 풮+ 0 0 푧2

⋆ which has a feasible set 푧1 0, 푧2 = 1 and dual optimal value 푑 = 1. { ≥ } −

To ensure strong duality for conic problems we need an additional regularity assump- tion. As with the conic version of Farkas’ lemma Lemma 8.3, we stress that this is a technical condition to eliminate ill-posed problems which should not be formed in prac- tice. In particular, we invite the reader to think that strong duality holds for all well formed conic problems one is likely to come across in applications, and that having a duality gap signals issues with the model formulation. We say problem (8.1) is very nicely posed if for all values of 푐0 the feasibility problem

푐푇 푥 = 푐 , 퐴푥 = 푏, 푥 퐾 0 ∈ satisfies either the first or third alternative in Lemma 8.3. Lemma 8.5 (Strong duality). Suppose that (8.1) is very nicely posed and 푝⋆ is finite. Then 푑⋆ = 푝⋆. Proof. For any 휀 > 0 consider the feasibility problem with variable 푥 and constraints

[︂ 푐푇 ]︂ [︂ 푝⋆ + 휀 ]︂ 푐푇 푥 = 푝⋆ 휀, − 푥 = − − 퐴 푏 that is 퐴푥 = 푏, 푥 퐾 푥 퐾. ∈ ∈ Optimality of 푝⋆ implies that the above problem is infeasible. By Lemma 8.3 and because 푇 we assumed very-nice-posedness there exists 푦ˆ = [푦0 푦] such that

[︀ 푇 ]︀ * [︀ ⋆ 푇 ]︀ 푐, 퐴 푦ˆ 퐾 and 푝 + 휀, 푏 푦ˆ > 0. − ∈ − 푇 * 푇 If 푦0 = 0 then 퐴 푦 퐾 and 푏 푦 > 0, which by Lemma 8.3 again would mean that − ∈ the original problem was infeasible, which is not the case. Hence we can rescale so that 푦0 = 1 and then we get

푐 퐴푇 푦 퐾* and 푏푇 푦 푝⋆ 휀. − ∈ ≥ − The first constraint means that 푦 is feasible for the dual problem. By letting 휀 0 we → obtain 푑⋆ 푝⋆. ≥ There are more direct conditions which guarantee strong duality, such as below. Lemma 8.6 (Slater constraint qualification). Suppose that (8.1) is strictly feasible: there exists a point 푥 int(퐾) in the interior of 퐾 such that 퐴푥 = 푏. Then strong duality ∈ holds if 푝⋆ is finite. Moreover, if both primal and dual problem are strictly feasible then 푝⋆ and 푑⋆ are attained.

92 We omit the proof which can be found in standard texts. Note that the first problem from Example 8.6 does not satisfy Slater constraint qualification: the only feasible points lie on the boundary of the cone (we say the problem has no interior). That problem 2 −1 2 −1 3 is not very nicely posed either: the point (푥1, 푥2, 푥3) = (0.5푐 휀 + 휀, 0.5푐 휀 , 푐0) 0 0 ∈ 풬 violates the inequality 푥2 푥1 by an arbitrarily small 휀, so the problem is infeasible but ≥ limit-feasible (second alternative in Lemma 8.3).

8.5 Applications of conic duality

8.5.1 Linear regression and the normal equation

2 푛 Least-squares linear regression is the problem of minimizing 퐴푥 푏 2 over 푥 R , where ‖ − ‖ ∈ 퐴 R푚×푛 and 푏 R푚 are fixed. This problem can be posed in conic form as ∈ ∈ minimize 푡 subject to (푡, 퐴푥 푏) 푚+1, − ∈ 풬 and we can write the Lagrangian

퐿(푡, 푥, 푢, 푣) = 푡 푡푢 푣푇 (퐴푥 푏) = 푡(1 푢) 푥푇 퐴푇 푣 + 푏푇 푣 − − − − − so the constraints in the dual problem are:

푢 = 1, 퐴푇 푣 = 0, (푢, 푣) 푄푚+1. ∈ The problem exhibits strong duality with both the primal and dual values attained in the optimal solution. The primal solution clearly satisfies 푡 = 퐴푥 푏 2, and so comple- ‖ − ‖ mentary slackness for the quadratic cone implies that the vectors (푢, 푣) = (1, 푣) and − − (푡, 퐴푥 푏) are parallel. As a consequence the constraint 퐴푇 푣 = 0 becomes 퐴푇 (퐴푥 푏) = 0 − − or simply

퐴푇 퐴푥 = 퐴푇 푏 so if 퐴푇 퐴 is invertible then 푥 = (퐴푇 퐴)−1퐴푇 푏. This is the so-called normal equation for least-squares regression, which we now obtained as a consequence of strong duality.

8.5.2 Constraint attribution Consider again a portfolio optimization problem with mean-variance utility function, vector of expected returns 훼 and covariance matrix Σ = 퐹 푇 퐹 :

maximize 훼푇 푥 1 푐푥푇 Σ푥 − 2 (8.12) subject to 퐴푥 푏, ≤ where the linear part represents any set of additional constraints: total budget, sector constraints, diversification constraints, individual relations between positions etc. Inthe absence of additional constraints the solution to the unconstrained maximization problem is easy to derive using basic calculus and equals

푥ˆ = 푐−1Σ−1훼.

93 We would like to understand the difference 푥* 푥ˆ, where 푥* is the solution of (8.12), and − in particular to measure which of the linear constraints actually cause 푥* to deviate from 푥ˆ and to what degree. This can be quantified using the dual variables. The conic version of problem (8.12) is

maximize 훼푇 푥 푐푟 − subject to 퐴푥 푏, ≤ (1, 푟, 퐹 푥) r ∈ 풬 with dual minimize 푏푇 푦 + 푠 subject to 퐴푇 푦 = 훼 + 퐹 푇 푢, (푠, 푐, 푢) r , 푦 0. ∈ 풬 ≥ Suppose we have a primal-dual optimal solution (푥*, 푟*, 푦*, 푠*, 푢*). Complementary slack- ness for the rotated quadratic cone implies

(푠*, 푐, 푢*) = 훽(푟*, 1, 퐹 푥*) − which leads to 훽 = 푐 and

퐴푇 푦* = 훼 푐퐹 푇 퐹 푥* − or equivalently

푥* =푥 ˆ 푐−1Σ−1퐴푇 푦*. − This equation splits the difference between the constrained and unconstrained solutions into contributions from individual constraints, where the weights are precisely the dual * 푇 * variables 푦 . For example, if a constraint is not binding (푎푖 푥 푏푖 < 0) then by com- * − plementary slackness 푦푖 = 0 and, indeed, a non-binding constraint has no effect on the change in solution.

8.6 Semidefinite duality and LMIs

The general theory of conic duality applies in particular to problems with semidefinite variables so here we just state it in the language familiar to SDP practitioners. Consider for simplicity a primal semidefinite optimization problem with one matrix variable

minimize 퐶, 푋 ⟨ ⟩ subject to 퐴푖, 푋 = 푏푖, 푖 = 1, . . . , 푚, (8.13) ⟨푋 ⟩푛 . ∈ 풮+ We can quickly repeat the derivation of the dual problem in this notation. The Lagrangian is ∑︀ 퐿(푋, 푦, 푆) = 퐶, 푋 푖 푦푖( 퐴푖, 푋 푏푖) 푆, 푋 = ⟨퐶 ∑︀⟩ −푦 퐴 ⟨푆, 푋 ⟩+ −푏푇 푦 − ⟨ ⟩ ⟨ − 푖 푖 푖 − ⟩ 94 and we get the dual problem maximize 푏푇 푦 ∑︀푚 푛 (8.14) subject to 퐶 푦푖퐴푖 . − 푖=1 ∈ 풮+ 푛 The dual contains an affine matrix-valued function with coefficients 퐶, 퐴푖 and ∈ 풮 variable 푦 R푚. Such a matrix-valued affine inequality is called a linear matrix inequality ∈ (LMI). In Sec.6 we formulated many problems as LMIs, that is in the form more naturally matching the dual. From a modeling perspective it does not matter whether constraints are given as linear matrix inequalities or as an intersection of affine hyperplanes; one formulation is easily converted to other using auxiliary variables and constraints, and this transformation is often done transparently by optimization software. Nevertheless, it is instructive to study an explicit example of how to carry out this transformation. An linear matrix inequality 퐴 + 푥 퐴 + + 푥 퐴 0 0 1 1 ··· 푛 푛 ⪰ 푚 where 퐴푖 is converted to a set of linear equality constraints using a slack variable ∈ 풮+ 퐴 + 푥 퐴 + + 푥 퐴 = 푆, 푆 0. 0 1 1 ··· 푛 푛 ⪰ Apart from introducing an explicit semidefinite variable 푆 푚 we also added 푚(푚+1)/2 ∈ 풮+ equality constraints. On the other hand, a semidefinite variable 푋 푛 can be rewritten ∈ 풮+ as a linear matrix inequality with 푛(푛 + 1)/2 scalar variables 푛 푛 푛 ∑︁ ∑︁ ∑︁ 푋 = 푒 푒푇 푥 + (푒 푒푇 + 푒 푒푇 )푥 0. 푖 푖 푖푖 푖 푗 푗 푖 푖푗 ⪰ 푖=1 푖=1 푗=푖+1 Obviously we should only use these transformations when necessary; if we have a problem that is more naturally interpreted in either primal or dual form, we should be careful to recognize that structure.

Example 8.7 (Dualization and efficiency). Consider the problem: minimize 푒푇 푧 subject to 퐴 + Diag(푧) = 푋, 푋 0. ⪰ 푛 푛 with the variables 푋 + and 푧 R . This is a problem in primal form with ∈ 풮 ∈ 푛(푛 + 1)/2 equality constraints, but they are more naturally interpreted as a linear matrix inequality ∑︁ 퐴 + 푒 푒푇 푧 0. 푖 푖 푖 ⪰ 푖 The dual problem is maximize 퐴, 푍 −⟨ ⟩ subject to diag(푍) = 푒, 푍 0, ⪰ in the variable 푍 푛 . The dual problem has only 푛 equality constraints, which is a ∈ 풮+ vast improvement over the 푛(푛 + 1)/2 constraints in the primal problem. See also Sec. 7.5.

95 Example 8.8 (Sum of singular values revisited). In Sec. 6.2.4, and specifically in (6.16), we expressed the problem of minimizing the sum of singular values of a non- symmetric matrix 푋. Problem (6.16) can be written as an LMI: ∑︀ maximize 푖,푗 푋푖푗푧푖푗 [︂ 푇 ]︂ ∑︀ 0 푒푗푒푖 subject to 퐼 푖,푗 푧푖푗 푇 0. − 푒푖푒푗 0 ⪰

Treating this as the dual and going back to the primal form we get:

minimize Tr(푈) + Tr(푉 ) 1 subject to 푆 = 2 푋, [︂ 푈− 푆푇 ]︂ 0, 푆 푉 ⪰ which is equivalent to the claimed (6.17). The dual formulation has the advantage of being linear in 푋.

96 Chapter 9

Mixed integer optimization

In other chapters of this cookbook we have considered different classes of convex problems with continuous variables. In this chapter we consider a much wider range of non-convex problems by allowing integer variables. This technique is extremely useful in practice, and already for linear programming it covers a vast range of problems. We introduce different building blocks for integer optimization, which make it possible to model useful non-convex dependencies between variables in conic problems. It should be noted that mixed integer optimization problems are very hard (technically NP-hard), and for many practical cases an exact solution may not be found in reasonable time.

9.1 Integer modeling

A general mixed integer conic optimization problem has the form

minimize 푐푇 푥 subject to 퐴푥 = 푏, (9.1) 푥 퐾, ∈ 푥푖 Z, 푖 , ∈ ∀ ∈ ℐ where 퐾 is a cone and 1, . . . , 푛 denotes the set of variables that are constrained ℐ ⊆ { } to be integers. Two major techniques are typical for mixed integer optimization. The first one is the use of binary variables, also known as indicator variables, which only take values 0 and 1, and indicate the absence or presence of a particular event or choice. This restriction can of course be modeled in the form (9.1) by writing:

0 푥 1 and 푥 Z. ≤ ≤ ∈ The other, known as big-M, refers to the fact that some relations can only be modeled linearly if one assumes some fixed bound 푀 on the quantities involved, and this constant enters the model formulation. The choice of 푀 can affect the performance of the model, see Example 7.8.

97 9.1.1 Implication of positivity

Often we have a real-valued variable 푥 R satisfying 0 푥 < 푀 for a known upper ∈ ≤ bound 푀, and we wish to model the implication

푥 > 0 = 푧 = 1. (9.2) ⇒ Making 푧 a binary variable we can write (9.2) as a linear inequality

푥 푀푧, 푧 0, 1 . (9.3) ≤ ∈ { } Indeed 푥 > 0 excludes the possibility of 푧 = 0, hence forces 푧 = 1. Since a priori 푥 푀, ≤ there is no danger that the constraint accidentally makes the problem infeasible. A typical use of this trick is to model fixed setup costs.

Example 9.1 (Fixed setup cost). Assume that production of a specific item 푖 costs 푢푖 per unit, but there is an additional fixed charge of 푤푖 if we produce item 푖 at all. For instance, 푤푖 could be the cost of setting up a production plant, initial cost of equipment etc. Then the cost of producing 푥푖 units of product 푖 is given by the discontinuous function {︂ 푤푖 + 푢푖푥푖, 푥푖 > 0 푐푖(푥푖) = 0, 푥푖 = 0.

If we let 푀 denote an upper bound on the quantities we can produce, we can then minimize the total production cost of 푛 products under some affine constraint 퐴푥 = 푏 with minimize 푢푇 푥 + 푤푇 푧 subject to 퐴푥 = 푏, 푥푖 푀푧푖, 푖 = 1, . . . , 푛 푥 ≤0, 푧 ≥ 0, 1 푛, ∈ { } which is a linear mixed-integer optimization problem. Note that by minimizing the production cost, we drive 푧푖 to 0 when 푥푖 = 0, so setup costs are indeed included only for products with 푥푖 > 0.

9.1.2 Semi-continuous variables

We can also model semi-continuity of a variable 푥 R, ∈ 푥 0 [푎, 푏], (9.4) ∈ ∪ where 0 < 푎 푏 using a double inequality ≤ 푎푧 푥 푏푧, 푧 0, 1 . ≤ ≤ ∈ { }

98 ihadsuciecntan erqieta tlatoeo h ie ierconstraints linear given the of one least at is that that require satisfied, we is constraint disjunctive a With constraints Disjunctive 9.1.4 on constraint additional any enforce not does and if Now inequality bound linear upper a an as advance in know we Suppose variable binary a when for satisfied words, be other must inequality In linear implication certain occurs. a event that fact other the some model to want we Suppose constraints Indicator 9.1.3 nrdcn iayvariables binary Introducing htefc sdsrbdi h nextsection. the in is described effect that that Note 푧 1 = 푧 푗 1 = hnw forced we then mle htthe that implies i.91 rdcincs ihfxdstpcost setup fixed with cost Production 9.1: Fig. ( 푎 1 푇 푥 푎 푧 푧 5 4 3 2 1 0 Total cost 1 1 푖 푇 ≤ 푧 , . . . , w 푥 + 푧 푎 1 푇 푏 i ≤ 푧 , . . . , · · · 1 푥 ) 푗 푏 ≤ ∨ 푎 t osriti aife,btntvc-es.Achieving not vice-versa. but satisfied, is constraint -th 푧 푖 + 푘 uniisproduced Quantities 푇 + ( = 1 = 푥 푏 { ∈ 푎 푧 hl for while , 푘 2 푇 푀 푘 ≤ t rt iermodel linear a write to 9.1.3 Sec. use can we , 푥 ≥ 0 (1 푏 , ≤ 1 1 ⇒ + − 99 } , 푏 u , 푀 2 i 푧 푎 ) 푎 푖 푇 (1 ∨ · · · ∨ 푇 ) 푖 , 푥 푥 푧 − − 0 = ≤ 푥 1 = 푏 푧 . 푏. ) ≤ . ( h nqaiyi rval satisfied trivially is inequality the 푘. , . . . , 푎 푀 푘 푇 푥 hnw a rt h above the write can we Then . ≤ 푏 푘 ) 푧 . ewn omdlthe model to want we 푤 푖 . 9.1.5 Constraint satisfaction Say we want to define an optimization model that will behave differently depending on which of the inequalities

푎푇 푥 푏 or 푎푇 푥 푏 ≤ ≥ is satisfied. Suppose we have lower and upper bounds for 푎푇 푥 푏 in the form of 푚 − ≤ 푎푇 푥 푏 푀. Then we can write a model − ≤ 푏 + 푚푧 푎푇 푥 푏 + 푀(1 푧), 푧 0, 1 . (9.5) ≤ ≤ − ∈ { } Now observe that 푧 = 0 implies 푏 푎푇 푥 푏 + 푀, of which the right-hand inequality ≤ ≤ is redundant, i.e. always satisfied. Similarly, 푧 = 1 implies 푏 + 푚 푎푇 푥 푏. In other ≤ ≤ words 푧 is an indicator of whether 푎푇 푥 푏. ≤ In practice we would relax one inequality using a small amount of slack, i.e.,

푏 + (푚 휖)푧 + 휖 푎푇 푥 (9.6) − ≤ to avoid issues with classifying the equality 푎푇 푥 = 푏.

9.1.6 Exact absolute value In Sec. 2.2.2 we showed how to model 푥 푡 as two linear inequalities. Now suppose we | | ≤ need to model an exact equality

푥 = 푡. (9.7) | | It defines a non-convex set, hence it is not conic representable. If we split 푥 into positive and negative part 푥 = 푥+ 푥−, where 푥+, 푥− 0, then 푥 = 푥+ + 푥− as long as either − ≥ | | 푥+ = 0 or 푥− = 0. That last alternative can be modeled with a binary variable, and we get a model of (9.7):

푥 = 푥+ 푥−, 푡 = 푥+ +− 푥−, 0 푥+, 푥−, (9.8) 푥+ ≤ 푀푧, 푥− ≤ 푀(1 푧), 푧 ≤ 0, 1 −, ∈ { } where the constant 푀 is an a priori known upper bound on 푥 in the problem. | | 9.1.7 Exact 1-norm

We can use the technique above to model the exact ℓ1-norm equality constraint

푛 ∑︁ 푥푖 = 푐, (9.9) | | 푖=1

where 푥 R푛 is a decision variable and 푐 is a constant. Such constraints arise for instance ∈ in fully invested portfolio optimizations scenarios (with short-selling). As before, we split

100 푥 into a positive and negative part, using a sequence of binary variables to guarantee that at most one of them is nonzero: 푥 = 푥+ 푥−, 0 푥+,− 푥−, 푥+ ≤ 푐푧, (9.10) 푥− ≤ 푐(푒 푧), ∑︀ + ∑︀ − ≤ − 푖 푥푖 + 푖 푥푖 = 푐, 푧 0, 1 푛, 푥+, 푥− R푛. ∈ { } ∈ 9.1.8 Maximum

The exact equality 푡 = max 푥1, . . . , 푥푛 can be expressed by introducing a sequence of { } mutually exclusive indicator variables 푧1, . . . , 푧푛, with the intention that 푧푖 = 1 picks the variable 푥푖 which actually achieves maximum. Choosing a safe bound 푀 we get a model:

푥 푡 푥 + 푀(1 푧 ), 푖 = 1, . . . , 푛, 푖 ≤ ≤ 푖 − 푖 푧1 + + 푧푛 = 1, (9.11) ··· 푧 0, 1 푛. ∈ { } 9.1.9 Boolean operators Typically an indicator variable 푧 0, 1 represents a boolean value (true/false). In this ∈ { } case the standard boolean operators can be implemented as linear inequalities. In the table below we assume all variables are binary.

Table 9.1: Boolean operators Boolean Linear 푧 = 푥 OR 푦 푥 푧, 푦 푧, 푧 푥 + 푦 ≤ ≤ ≤ 푧 = 푥 AND 푦 푥 푧, 푦 푧, 푧 + 1 푥 + 푦 ≥ ≥ ≥ 푧 = NOT 푥 푧 = 1 푥 − 푥 = 푦 푥 푦 ⇒ ∑︀≤ At most one of 푧1, . . . , 푧푛 holds (SOS1, set-packing) 푧푖 1 ∑︀푖 ≤ Exactly one of 푧1, . . . , 푧푛 holds (set-partitioning) 푧푖 = 1 ∑︀푖 At least one of 푧1, . . . , 푧푛 holds (set-covering) 푧푖 1 ∑︀푖 ≥ At most 푘 of 푧1, . . . , 푧푛 holds (cardinality) 푧푖 푘 푖 ≤

9.1.10 Fixed set of values

We can restrict a variable 푥 to take on only values from a specified finite set 푎1, . . . , 푎푛 { } by writing ∑︀ 푥 = 푖 푧푖푎푖 푧 0, 1 푛, (9.12) ∑︀ ∈ { } 푖 푧푖 = 1.

In (9.12) we essentially defined 푧푖 to be the indicator variable of whether 푥 = 푎푖. In some circumstances there may be more efficient representations of a restricted set of values, for example:

101 • (sign) 푥 1, 1 푥 = 2푧 1, 푧 0, 1 , ∈ {− } ⇐⇒ − ∈ { } • (modulo) 푥 1, 4, 7, 10 푥 = 3푧 + 1, 0 푧 3, 푧 Z, ∈ { } ⇐⇒ ≤ ≤ ∈ • (fraction) 푥 0, 1/3, 2/3, 1 3푥 = 푧, 0 푧 3, 푧 Z, ∈ { } ⇐⇒ ≤ ≤ ∈ • (gap) 푥 ( , 푎] [푏, ) 푏 푀(1 푧) 푥 푎 + 푀푧, 푧 0, 1 for ∈ −∞ ∪ ∞ ⇐⇒ − − ≤ ≤ ∈ { } sufficiently large 푀.

In a very similar fashion we can restrict a variable 푥 to a finite union of intervals ⋃︀ 푖[푎푖, 푏푖]:

∑︀ 푧 푎 푥 ∑︀ 푧 푏 , 푖 푖 푖 ≤ ≤ 푖 푖 푖 푧 0, 1 푛, (9.13) ∑︀∈ { } 푖 푧푖 = 1.

9.1.11 Alternative as a bilinear equality In the literature one often finds models containing a bilinear constraint

푥푦 = 0.

This constraint is just an algebraic formulation of the alternative 푥 = 0 or 푦 = 0 and therefore it can be written as a mixed-integer linear problem:

푥 푀푧, |푦| ≤ 푀(1 푧), | 푧| ≤ 0, 1 −, ∈ { } for a suitable constant 푀. The absolute value can be omitted if both 푥, 푦 are nonnegative. Otherwise it can be modeled as in Sec. 2.2.2.

9.1.12 Equal signs

Suppose we want to say that the variables 푥1, . . . , 푥푛 must either be all nonnegative or all nonpositive. This can be achieved by introducing a single binary variable 푧 0, 1 ∈ { } which picks one of these two alternatives:

푀(1 푧) 푥 푀푧, 푖 = 1, . . . , 푛. − − ≤ 푖 ≤

Indeed, if 푧 = 0 then we have 푀 푥푖 0 and if 푧 = 1 then 0 푥푖 푀. − ≤ ≤ ≤ ≤ 9.1.13 Continuous piecewise-linear functions

Consider a continuous, univariate, piecewise-linear, non-convex function 푓 :[훼1, 훼5] R ↦→ shown in Fig. 9.2. At the interval [훼푗, 훼푗+1], 푗 = 1, 2, 3, 4 we can describe the function as

푓(푥) = 휆푗푓(훼푗) + 휆푗+1푓(훼푗+1)

102 where ifrn auso rycd.W a hndsrb h osrit on the constraints describe then We can variables, code. indicator of Gray values different interval entire the variables (adjacent) O2cntan.I eitoueidctrvariables indicator introduce ( we If constraint. SOS2 ryencoding Gray lambda-method epigraph have we that so o icws-ierfunction piecewise-linear a for 휆 푖 휆 , fteintervals the of h odto htol w daetvralscnb ozr ssmtmscle an called sometimes is nonzero be can variables adjacent two only that condition The w a euetenme fitgrvralsb sn a using by variables integer of number the reduce can we 9.2 Fig. in function the For 푖 +1 휆 ) 푗 ecnmdla O2cntan as: constraint SOS2 an model can we , 훼 푓 푗 ( 푥 휆 + ) 1 휆 휆 i.92 nvraepeeielna o-ovxfunction. non-convex piecewise-linear univariate A 9.2: Fig. 휆 ≤ ≤ . 1 푗 ≥ +1 푧 ≤ 푡 푧 푗 1 0 as [ 훼 휆 , 훼 푧 = 1 = [ , 훼 푗 1 휆 1 +1 휆 , 푗 훼 , 푗 훼 , 휆 , 2 5 = α 푗 ] ∑︀ 푗 ≤ +1 ⇒ +1 1 푥 ssm ovxcombination, convex some as 푧 α 푥 f(x) 푗 푛 1 ] 푧 푗 1 = =1 1 + a enneo ecncaatrz vr value every characterize can we nonzero, be can and n nidctrvariable indicator an and 01 101 11 10 00 휆 ≤ + ( ( ( ( ∑︀ 휆 푖 푧 푦 푦 푦 푦 푓 푧 푗 0 = 2 푧 2 2 1 1 푗 푗 푛 ( 1 = 2 휆 + α =1 푥 푓 1) = 0) = 1) = 0) = + α 휆 , 푗 2 ) ( 푖 , 2 푧 푥 휆 + 푧 3 , with = ) 푗 푗 ̸= + − 훼 휆 3 1 푗 푗 → → → → 푗 , ≤ 푧 , { +1 ∑︁ 4 푗 ,푗 푗, 103 α =1 α ∑︀ 푛 4 푧 1 = 3 2 = 1 = 3 ∑︀ 2 휆 휆 휆 휆 푗 푛 em.Ti prahi fe aldthe called often is approach This terms. 휆 1 + + =1 1 4 1 3 − 푗 푛 푗 푧 , =1 1 푛 , . . . , 푓 = = = 0 = 푧 fw d osritta nytwo only that constraint a add we If . 푧 ( 3 } 훼 푗 휆 , 휆 휆 휆 휆 α olciey ecnte oe the model then can we Collectively, . , 푗 푗 1 = 2 5 5 4 α { ∈ 푧 ) 푓 0 = 0 = 0 = . 4 푖 ( 훼 o ahpi fajcn variables adjacent of pair each for − 4 푧 , 푦 0 푗 ≤ , , , , ) 1 { ∈ 1 휆 , α ≤ 푧 } 5 4 4 α 푡 , 0 + x , 5 { ∈ 1 푧 } 3 푛 2 휆 , 0 ≤ , orpeettefour the represent to 1 } 푧 5 푛 푛 − − 휆 ≤ 1 1 sn nytwo only using , , 푧 4 푓 ( 푥 ) (9.14) over which leads to a more efficient characterization of the epigraph 푓(푥) 푡, ≤ 푥 = ∑︀5 휆 훼 , ∑︀5 휆 푓(훼 ) 푡, 푗=1 푗 푗 푗=1 푗 푗 ≤ 휆3 푦1, 휆1 + 휆5 (1 푦1), 휆4 + 휆5 푦2, 휆1 + 휆2 (1 푦2), ≤ ≤ − 5 ≤ ≤ − 휆 0, ∑︀ 휆 = 1, 푦 0, 1 2, ≥ 푗=1 푗 ∈ { } The lambda-method can also be used to model multivariate continuous piecewise-linear non-convex functions, specified on a set of polyhedra 푃푘. For example, for the function shown in Fig. 9.3 we can model the epigraph 푓(푥) 푡 as ≤ ∑︀6 ∑︀6 푥 = 푖=1 휆푖푣푖, 푖=1 휆푖푓(푣푖) 푡, 휆 푧 + 푧 , 휆 푧 , 휆 ≤푧 + 푧 , 1 ≤ 1 2 2 ≤ 1 3 ≤ 2 3 휆4 푧1 + 푧2 + 푧3 + 푧4, 휆5 푧3 + 푧4, 휆6 푧4, (9.15) ≤ ∑︀6 ≤ ∑︀4 ≤ 휆 0, 푖=1 휆푖 = 1, 푖=1 푧푖 = 1, ≥ 푧 0, 1 4. ∈ { }

Note, for example, that 푧2 = 1 implies that 휆2 = 휆5 = 휆6 = 0 and 푥 = 휆1푣1 + 휆3푣3 + 휆4푣4.

Fig. 9.3: A multivariate continuous piecewise-linear non-convex function.

9.1.14 Lower semicontinuous piecewise-linear functions The ideas in Sec. 9.1.13 can be applied to lower semicontinuous piecewise-linear functions as well. For example, consider the function shown in Fig. 9.4. If we denote the one-sided limits by 푓−(푐) := lim푥↑푐 푓(푥) and 푓+(푐) := lim푥↓푐 푓(푥), respectively, the one-sided limits, then we can describe the epigraph 푓(푥) 푡 for the function in Fig. 9.4 as ≤

푥 = 휆1훼1 + (휆2 + 휆3 + 휆4)훼2 + 휆5훼3, 휆 푓(훼 ) + 휆 푓−(훼 ) + 휆 푓(훼 ) + 휆 푓 (훼 ) + 휆 푓(훼 ) 푡, 1 1 2 2 3 2 4 + 2 5 3 ≤ (9.16) 휆1 + 휆2 푧1, 휆3 푧2, 휆4 + 휆5 푧3, 5 ≤ ≤ 3 ≤ 휆 0, ∑︀ 휆 = 1, ∑︀ 푧 = 1, 푧 0, 1 3, ≥ 푖=1 푖 푖=1 푖 ∈ { }

where we have a different decision variable for the intervals [훼1, 훼2), [훼2, 훼2], and (훼2, 훼3]. As a special case this gives us an alternative characterization of fixed charge models considered in Sec. 9.1.1.

104 ob oebalanced. more be to increasing For clients. the exponent power the ( applications practical model as In large as region. be covered can the of area and perimeter bounds power summing by realized the be can objective The problem: conic mixed-integer a to leads This variables consumption decision power binary total the minimizing all transmitters of to coverage providing ranges while and locations assign applications. to other range with some distance transmitter Euclidean and within design clients network all wireless serve to in want arises We problem following The design network Wireless studies 9.2.1 case conic integer Mixed 9.2 hsi yeo lseigpolm For problem. clustering of type a is This by Denoting 훼 nr of -norm ujc to subject minimize 훼 i.94 nvraelwrsmcniuu icws-ierfunction. piecewise-linear semicontinuous lower univariate A 9.4: Fig. 1 = yia ouin oti ml ubro uedsscvrn otof most covering disks huge of number small a contain solutions typical ) ( 푟 푥 1 푟 , . . . , 1 6 푛 푥 , . . . , ∑︀ 푝 푟 ∑︀ 푗 푗 eedn nvrosfcos(o ntnetran.I h iercost linear the In terrain). instance (for factors various on depending let with clients 푗 푗 ‖ ≥ ∈ 푧 푟 푗 푖푗 훼 R 푘 푟 ) 푧 푝 푛 2 ≥ h atrapoc ol ercmeddfrlarge for recommended be would approach latter The . spootoa to proportional is 푖푗 푗 푧 , ∈ α 훼 f(x) − niaigi the if indicating 1 1 R 푖푗 푖 , ag agsaepnlzdmr evl n h ik tend disks the and heavily more penalized are ranges large 푥 2 푖 { ∈ 푛 ‖ oain ftecins ecnmdlti rbe using problem this model can we clients, the of locations 2 clients. 푘 − 0 rnmtes rnmte ihrange with transmitter A transmitters. , 1 푀 푟 } 푗 α . (1 2 rmispsto.Tepwrcnupino a of consumption power The position. its from − 105 푧 푖 훼 푟 t leti oee ythe by covered is client -th 푖푗 훼 1 = ) 푖 , o oefixed some for , α 2 1 = 1 = 3 epciey eaemnmzn the minimizing are we respectively, , ,푗 푛, , . . . , 푛, , . . . , 푡 푗 ≥ 푟 x 1 푗 훼 rb ietybounding directly by or ≤ 1 = < 훼 푘, , . . . , 푗 ∞ t transmitter. -th h olis goal The . 푟 푗 a serve can 훼 (9.17) . 훼 9.2.2 Avoiding small trades The standard portfolio optimization model admits a number of mixed-integer extensions aimed at avoiding solutions with very small trades. To fix attention consider the model

maximize 휇푇 푥 subject to (훾, 퐺푇 푥) 푛+1, (9.18) 푒푇 푥 = 푤 +∈푒 풬푇 푥0, 푥 0, ≥ with initial holdings 푥0, initial cash amount 푤, expected returns 휇, risk bound 훾 and 0 decision variable 푥. Here 푒 is the all-ones vector. Let ∆푥푗 = 푥푗 푥 denote the change − 푗 of position in asset 푗.

Transaction costs

A transaction cost involved with nonzero ∆푥푗 could be modeled as {︃ 0, ∆푥푗 = 0, 푇푗(푥푗) = 훼 ∆푥 + 훽 , ∆푥 = 0, 푗 푗 푗 푗 ̸ similarly to the problem from Example 9.1. Including transaction costs will now lead to the model: maximize 휇푇 푥 subject to (훾, 퐺푇 푥) 푛+1, ∈ 풬 푒푇 푥 + 훼푇 푥 + 훽푇 푧 = 푤 + 푒푇 푥0, (9.19) 푥 푥0 푀푧, 푥0 푥 푀푧, 푥 − 0,≤ 푧 0, 1 푛−, ≤ ≥ ∈ { } where the binary variable 푧푗 is an indicator of ∆푥푗 = 0. Here 푀 is a sufficiently large ̸ constant, for instance 푀 = 푤 + 푒푇 푥0 will do.

Cardinality constraints Another option is to fix an upper bound 푘 on the number of nonzero trades. The meaning of 푧 is the same as before: maximize 휇푇 푥 subject to (훾, 퐺푇 푥) 푛+1, 푒푇 푥 = 푤 +∈푒 풬푇 푥0, (9.20) 푥 푥0 푀푧, 푥0 푥 푀푧, 푒푇−푧 푘,≤ − ≤ 푥 ≤0, 푧 0, 1 푛. ≥ ∈ { }

106 Trading size constraints

We can also demand a lower bound on nonzero trades, that is ∆푥푗 0 [푎, 푏] for all | | ∈ { } ∪ 푗. To this end we combine the techniques from Sec. 9.1.6 and Sec. 9.1.2 writing 푝푗, 푞푗 for the indicators of ∆푥푗 > 0 and ∆푥푗 < 0, respectively:

maximize 휇푇 푥 subject to (훾, 퐺푇 푥) 푛+1, 푒푇 푥 = 푤 +∈푒 풬푇 푥0, 푥 푥0 = 푥+ 푥−, (9.21) 푥+− 푀푝, 푥− 푀푞, 푎(푝≤+ 푞) 푥+ +≤ 푥− 푏(푝 + 푞), 푝 + 푞 푒,≤ ≤ 푥, 푥+,≤ 푥− 0, 푝, 푞 0, 1 푛. ≥ ∈ { } 9.2.3 Convex piecewise linear regression

Consider the problem of approximating the data (푥푖, 푦푖), 푖 = 1, . . . , 푛 by a piecewise linear convex function of the form

푓(푥) = max 푎 푥 + 푏 , 푗 = 1, . . . 푘 , { 푗 푗 } where 푘 is the number of segments we want to consider. The quality of the fit is measured with least squares as ∑︁ (푓(푥 ) 푦 )2. 푖 − 푖 푖 Note that we do not specify the locations of nodes (breakpoints), i.e. points where 푓(푥) changes slope. Finding them is part of the fitting problem. As in Sec. 9.1.8 we introduce binary variables 푧푖푗 indicating that 푓(푥푖) = 푎푗푥푖 + 푏푗, i.e. it is the 푗-th linear function that achieves maximum at the point 푥푖. Following Sec. 9.1.8 we now have a mixed integer conic quadratic problem

minimize 푦 푓 2 ‖ − ‖ subject to 푎푗푥푖 + 푏푗 푓푖, 푖 = 1, . . . , 푛, 푗 = 1, . . . , 푘, ≤ 푓푖 푎푗푥푖 + 푏푗 + 푀(1 푧푖푗), 푖 = 1, . . . , 푛, 푗 = 1, . . . , 푘, (9.22) ∑︀ ≤ − 푗 푧푖푗 = 1, 푖 = 1 . . . , 푛, 푧 0, 1 , 푖 = 1, . . . , 푛, 푗 = 1, . . . , 푘, 푖푗 ∈ { } with variables 푎, 푏, 푓, 푧, where 푀 is a sufficiently large constant. Frequently an integer model will have properties which formally follow from the prob- lem’s constraints, but may be very hard or impossible for a mixed-integer solver to auto- matically deduce. It may dramatically improve efficiency to explicitly add some of them to the model. For example, we can enhance (9.22) with all inequalities of the form

푧 + 푧 ′ 1, 푖,푗+1 푖+푖 ,푗 ≤ which indicate that each linear segment covers a contiguous subset of the sample and additionally force these segments to come in the order of increasing 푗 as 푖 increases from left to right. The last statement is an example of symmetry breaking.

107 Fig. 9.5: Convex piecewise linear fit with 푘 = 2, 3, 4 segments.

108 Chapter 10

Quadratic optimization

In this chapter we discuss convex quadratic and quadratically constrained optimization. Our discussion is fairly brief compared to the previous chapters for three reasons; (i) convex quadratic optimization is a special case of conic quadratic optimization, (ii) for most convex problems it is actually more computationally efficient to pose the problem in conic form, and (iii) duality theory (including infeasibility certificates) is much simpler for conic quadratic optimization. Therefore, we generally recommend a conic quadratic formulation, see Sec.3 and especially Sec. 3.2.3.

10.1 Quadratic objective

A standard (convex) quadratic optimization problem

1 푇 푇 minimize 2 푥 푄푥 + 푐 푥 subject to 퐴푥 = 푏, (10.1) 푥 0, ≥ with 푄 푛 is conceptually a simple extension of a standard linear optimization problem ∈ 풮+ with a quadratic term 푥푇 푄푥. Note the important requirement that 푄 is symmetric positive semidefinite; otherwise the objective function would not be convex.

10.1.1 Geometry of quadratic optimization Quadratic optimization has a simple geometric interpretation; we minimize a convex quadratic function over a polyhedron., see Fig. 10.1. It is intuitively clear that the following different cases can occur: • The optimal solution 푥⋆ is at the boundary of the polyhedron (as shown in Fig. 10.1). At 푥⋆ one of the hyperplanes is tangential to an ellipsoidal level curve. • The optimal solution is inside the polyhedron; this occurs if the unconstrained 1 푇 푇 † minimizer arg min푥 푥 푄푥 + 푐 푥 = 푄 푐 (i.e., the center of the ellipsoidal level 2 − curves) is inside the polyhedron. From now on 푄† denotes the pseudoinverse of 푄; in particular 푄† = 푄−1 if 푄 is positive definite. • If the polyhedron is unbounded in the opposite direction of 푐, and if the ellipsoid level curves are degenerate in that direction (i.e., 푄푐 = 0), then the problem is unbounded. If 푄 푛 , then the problem cannot be unbounded. ∈ 풮++ 109 • The problem is infeasible, i.e., 푥 퐴푥 = 푏, 푥 0 = . { | ≥ } ∅

a 1 a x⋆ 2

a0 a3

a4

Fig. 10.1: Geometric interpretation of quadratic optimization. At the optimal point 푥⋆ the hyperplane 푥 푎푇 푥 = 푏 is tangential to an ellipsoidal level curve. { | 1 } Possibly because of its simple geometric interpretation and similarity to linear opti- mization, quadratic optimization has been more widely adopted by optimization practi- tioners than conic quadratic optimization.

10.1.2 Duality in quadratic optimization The Lagrangian function (Sec. 8.3) for (10.1) is 1 퐿(푥, 푦, 푠) = 푥푇 푄푥 + 푥푇 (푐 퐴푇 푦 푠) + 푏푇 푦 (10.2) 2 − −

푛 with Lagrange multipliers 푠 R+, and from 푥퐿(푥, 푦, 푠) = 0 we get the necessary first- ∈ ∇ order optimality condition

푄푥 = 퐴푇 푦 + 푠 푐, − i.e. (퐴푇 푦 + 푠 푐) (푄). Then − ∈ ℛ arg min 퐿(푥, 푦, 푠) = 푄†(퐴푇 푦 + 푠 푐), 푥 − which can be substituted into (10.2) to give a dual function

{︂ 푏푇 푦 1 (퐴푇 푦 + 푠 푐)푄†(퐴푇 푦 + 푠 푐), (퐴푇 푦 + 푠 푐) (푄), 푔(푦, 푠) = − 2 − − − ∈ ℛ otherwise. −∞ Thus we get a dual problem

maximize 푏푇 푦 1 (퐴푇 푦 + 푠 푐)푄†(퐴푇 푦 + 푠 푐) − 2 − − subject to (퐴푇 푦 + 푠 푐) (푄), (10.3) 푠 0, − ∈ ℛ ≥ 110 or alternatively, using the optimality condition 푄푥 = 퐴푇 푦 + 푠 푐 we can write − maximize 푏푇 푦 1 푥푇 푄푥 − 2 subject to 푄푥 = 퐴푇 푦 푐 + 푠, (10.4) 푠 0. − ≥ Note that this is an unusual dual problem in the sense that it involves both primal and dual variables. Weak duality, strong duality under Slater constraint qualification and Farkas infeasi- bility certificates work similarly as in Sec.8. In particular, note that the constraints in both (10.1) and (10.4) are linear, so Lemma 2.4 applies and we have:

i. The primal problem (10.1) is infeasible if and only if there is 푦 such that 퐴푇 푦 0 ≤ and 푏푇 푦 > 0.

ii. The dual problem (10.4) is infeasible if and only if there is 푥 0 such that 퐴푥 = 0, ≥ 푄푥 = 0 and 푐푇 푥 < 0.

10.1.3 Conic reformulation

Suppose we have a factorization 푄 = 퐹 푇 퐹 where 퐹 R푘×푛, which is most interesting ∈ when 푘 푛. Then 푥푇 푄푥 = 푥푇 퐹 푇 퐹 푥 = 퐹 푥 2 and the conic quadratic reformulation of ≪ ‖ ‖2 (10.1) is

minimize 푟 + 푐푇 푥 subject to 퐴푥 = 푏, (10.5) 푥 0, (1≥, 푟, 퐹 푥) 푘+2, ∈ 풬r with dual problem

maximize 푏푇 푦 푢 − subject to 퐹 푇 푣 = 퐴푇 푦 푐 + 푠, (10.6) 푠− 0, − (푢,≥1, 푣) 푘+2. ∈ 풬r Note that in an optimal primal-dual solution we have 푟 = 1 퐹 푥 2, hence the com- 2 ‖ ‖2 plementary slackness for 푘+2 demands 푣 = 퐹 푥 and 퐹 푇 퐹 푣 = 푄푥, as well as 풬r − − 푢 = 1 푣 2 = 1 푥푇 푄푥. This justifies why some of the dual variables in(10.6) and (10.4) 2 ‖ ‖2 2 have the same names - they are in fact equal, and so both the primal and dual solution to the original quadratic problem can be recovered from the primal-dual solution to the conic reformulation.

10.2 Quadratically constrained optimization

A general convex quadratically constrained quadratic optimization problem is

1 푇 푇 minimize 2 푥 푄0푥 + 푐0 푥 + 푟0 1 푇 푇 (10.7) subject to 푥 푄푖푥 + 푐 푥 + 푟푖 0, 푖 = 1, . . . , 푚, 2 푖 ≤

111 푛 where 푄푖 . This corresponds to minimizing a convex quadratic function over an ∈ 풮+ intersection of convex quadratic sets such as ellipsoids or affine halfspaces. Note the important requirement 푄푖 0 for all 푖 = 0, . . . , 푚, so that the objective function is convex ⪰ and the constraints characterize convex sets. For example, neither of the constraints 1 1 푥푇 푄 푥 + 푐푇 푥 + 푟 = 0, 푥푇 푄 푥 + 푐푇 푥 + 푟 0 2 푖 푖 푖 2 푖 푖 푖 ≥ characterize convex sets, and therefore cannot be included.

10.2.1 Duality in quadratically constrained optimization The Lagrangian function for (10.7) is

1 푇 푇 ∑︀푚 (︀ 1 푇 푇 )︀ 퐿(푥, 휆) = 2 푥 푄0푥 + 푐0 푥 + 푟0 + 푖=1 휆푖 2 푥 푄푖푥 + 푐푖 푥 + 푟푖 1 푇 푇 = 2 푥 푄(휆)푥 + 푐(휆) 푥 + 푟(휆), where 푚 푚 푚 ∑︁ ∑︁ ∑︁ 푄(휆) = 푄0 + 휆푖푄푖, 푐(휆) = 푐0 + 휆푖푐푖, 푟(휆) = 푟0 + 휆푖푟푖. 푖=1 푖=1 푖=1

From the Lagrangian we get the first-order optimality conditions

푄(휆)푥 = 푐(휆), (10.8) − and similar to the case of quadratic optimization we get a dual problem

maximize 1 푐(휆)푇 푄(휆)†푐(휆) + 푟(휆) − 2 subject to 푐(휆) (푄(휆)) , (10.9) 휆 0∈, ℛ ≥ or equivalently

maximize 1 푤푇 푄(휆)푤 + 푟(휆) − 2 subject to 푄(휆)푤 = 푐(휆), (10.10) 휆 0. − ≥ Using a general version of the Schur Lemma for singular matrices, we can also write (10.9) as an equivalent semidefinite optimization problem,

maximize 푡 [︂ 2(푟(휆) 푡) 푐(휆)푇 ]︂ subject to − 0, (10.11) 푐(휆) 푄(휆) ⪰ 휆 0. ≥ Feasibility in quadratically constrained optimization is characterized by the following conditions (assuming Slater constraint qualification or other conditions to exclude ill- posed problems):

• Either the primal problem (10.7) is feasible or there exists 휆 0, 휆 = 0 satisfying ≥ ̸

112 푚 [︂ 푇 ]︂ ∑︁ 2푟푖 푐푖 휆푖 0. (10.12) 푐푖 푄푖 ≻ 푖=1

• Either the dual problem (10.10) is feasible or there exists 푥 R푛 satisfying ∈

푇 푇 푄0푥 = 0, 푐0 푥 < 0, 푄푖푥 = 0, 푐푖 푥 = 0, 푖 = 1, . . . , 푚. (10.13)

To see why the certificate proves infeasibility, suppose for instance that(10.12) and (10.7) are simultaneously satisfied. Then

[︂ ]︂푇 [︂ 푇 ]︂ [︂ ]︂ (︂ )︂ ∑︁ 1 2푟푖 푐푖 1 ∑︁ 1 푇 푇 0 < 휆푖 = 2 휆푖 푥 푄푖푥 + 푐푖 푥 + 푟푖 0 푥 푐푖 푄푖 푥 2 ≤ 푖 푖 and we have a contradiction, so (10.12) certifies infeasibility.

10.2.2 Conic reformulation

푇 푘푖×푛 If 푄푖 = 퐹푖 퐹푖 for 푖 = 0, . . . , 푚, where 퐹푖 R then we get a conic quadratic reformu- ∈ lation of (10.7):

푇 minimize 푡0 + 푐0 푥 + 푟0 푘0+2 subject to (1, 푡0, 퐹0푥) , (10.14) ∈ 풬r (1, 푐푇 푥 푟 , 퐹 푥) 푘푖+2, 푖 = 1, . . . , 푚. − 푖 − 푖 푖 ∈ 풬r The primal and dual solution of (10.14) recovers the primal and dual solution of (10.7) similarly as for quadratic optimization in Sec. 10.1.3. Let us see for example, how a (conic) primal infeasibility certificate for (10.14) implies an infeasibility certificate in the form (10.12). Infeasibility in (10.14) involves only the last set of conic constraints. We can derive the infeasibility certificate for those constraints from Lemma 8.3 or by directly writing the Lagrangian 퐿 = ∑︀ (︀푢 + 푣 ( 푐푇 푥 푟 ) + 푤푇 퐹 푥)︀ − 푖 푖 푖 − 푖 − 푖 푖 푖 = 푥푇 (︀∑︀ 푣 푐 ∑︀ 퐹 푇 푤 )︀ + (∑︀ 푣 푟 ∑︀ 푢 ) . 푖 푖 푖 − 푖 푖 푖 푖 푖 푖 − 푖 푖 The dual maximization problem is unbounded (i.e. the primal problem is infeasible) if we have ∑︀ ∑︀ 푇 푣푖푐푖 = 퐹 푤푖, ∑︀푖 ∑︀푖 푖 푖 푣푖푟푖 > 푖 푢푖, 2푢 푣 푤 2, 푢 , 푣 0. 푖 푖 ≥ ‖ 푖‖ 푖 푖 ≥ We claim that 휆 = 푣 is an infeasibility certificate in the sense of(10.12). We can assume 푣푖 > 0 for all 푖, as otherwise 푤푖 = 0 and we can take 푢푖 = 0 and skip the 푖-th coordinate. Let 푀 denote the matrix appearing in (10.12) with 휆 = 푣. We show that 푀 is positive definite: 푇 ∑︀ (︀ 푇 푇 2)︀ [푦, 푥] 푀[푦, 푥] = 푖 푣푖푥 푄푖푥 + 2푣푖푦푐푖 푥 + 2푣푖푟푖푦 ∑︀ (︀ 2 푇 2)︀ > 푖 푣푖 퐹푖푥 + 2푦푤푖 퐹푖푥 + 2푢푖푦 ∑︀ 푣−1‖ 푣 퐹‖푥 + 푦푤 2 0. ≥ 푖 푖 ‖ 푖 푖 푖‖ ≥

113 10.3 Example: Factor model

Recall from Sec. 3.3.3 that the standard Markowitz portfolio optimization problem is

maximize 휇푇 푥 subject to 푥푇 Σ푥 훾, (10.15) 푒푇 푥 =≤ 1, 푥 0, ≥ 푛 푛 where 휇 R is a vector of expected returns for 푛 different assets and Σ + denotes ∈ ∈ 풮 the corresponding covariance matrix. Problem (10.15) maximizes the expected return of an investment given a budget constraint and an upper bound 훾 on the allowed risk. Alternatively, we can minimize the risk given a lower bound 훿 on the expected return of investment, i.e., we can solve the problem

minimize 푥푇 Σ푥 subject to 휇푇 푥 훿, (10.16) 푒푇 푥 =≥ 1, 푥 0. ≥ Both problems (10.15) and (10.16) are equivalent in the sense that they describe the same Pareto-optimal trade-off curve by varying 훿 and 훾. Next consider a factorization

Σ = 푉 푇 푉 (10.17) for some 푉 R푘×푛. We can then rewrite both problems (10.15) and (10.16) in conic ∈ quadratic form as

maximize 휇푇 푥 subject to (1/2, 훾, 푉 푥) 푘+2, 푟 (10.18) 푒푇 푥 = 1, ∈ 풬 푥 0, ≥ and minimize 푡 subject to (1/2, 푡, 푉 푥) 푘+2, ∈ 풬푟 휇푇 푥 훿, (10.19) 푒푇 푥 =≥ 1, 푥 0, ≥ respectively. Given Σ 0, we may always compute a factorization (10.17) where 푉 ≻ is upper-triangular (Cholesky factor). In this case 푘 = 푛, i.e., 푉 R푛×푛, so there is ∈ little difference in complexity between the conic and quadratic formulations. However, in practice, better choices of 푉 are either known or readily available. We mention two examples.

114 Data matrix Σ might be specified directly in the form (10.17), where 푉 is a normalized data-matrix with 푘 observations of market data (for example daily returns) of the 푛 assets. When the observation horizon 푘 is shorter than 푛, which is typically the case, the conic representa- tion is both more parsimonious and has better numerical properties.

Factor model For a factor model we have

Σ = 퐷 + 푈 푇 푅푈 where 퐷 = Diag(푤) is a diagonal matrix, 푈 R푘×푛 represents the exposure of assets to ∈ risk factors and 푅 R푘×푘 is the covariance matrix of factors. Importantly, we normally ∈ have a small number of factors (푘 푛), so it is computationally much cheaper to find a ≪ Cholesky decomposition 푅 = 퐹 푇 퐹 with 퐹 R푘×푘. This combined gives us Σ = 푉 푇 푉 for ∈ [︂ 퐷1/2 ]︂ 푉 = 퐹 푈 of dimensions (푛 + 푘) 푛. The dimensions of 푉 are larger than the dimensions of the × Cholesky factors of Σ, but 푉 is very sparse, which usually results in a significant reduction of solution time. The resulting risk minimization conic problem can ultimately be written as: minimize 푑 + 푡 subject to (1/2, 푡, 퐹 푈푥) 푘+2, ∈ 풬푟 (1/2, 푑, 푤1/2푥 , , 푤1/2푥 ) 푛+2, 1 1 푛 푛 푟 (10.20) 휇푇 푥 훿, ··· ∈ 풬 푒푇 푥 =≥ 1, 푥 0. ≥

115 Chapter 11

Bibliographic notes

The material on linear optimization is very basic, and can be found in any textbook. For further details, we suggest a few standard references [Chvatal83], [BT97] and [PS98], which all cover much more that discussed here. [NW06] gives a more modern treatment of both theory and algorithmic aspects of linear optimization. Material on conic quadratic optimization is based on the paper [LVBL98] and the books [BenTalN01], [BV04]. The papers [AG03], [ART03] contain additional theoretical and algorithmic aspects. For more theory behind the power cone and the exponential cone we recommend the thesis [Cha09]. Much of the material about semidefinite optimization is based on the paper [VB96] and the books [BenTalN01], [BKVH07]. The section on optimization over nonnegative polynomials is based on [Nes99], [Hac03]. We refer to [LR05] for a comprehensive survey on semidefinite optimization and relaxations in combinatorial optimization. The chapter on conic duality follows the exposition in [GartnerM12] Mixed-integer optimization is based on the books [NW88], [Wil93]. Modeling of piece- wise linear functions is described in the survey paper [VAN10].

116 Chapter 12

Notation and definitions

R and Z denote the sets of real numbers and integers, respectively. R푛 denotes the set of 푛-dimensional vectors of real numbers (and similarly for Z푛 and 0, 1 푛); in most cases 푛 { } we denote such vectors by lower case letters, e.g., 푎 R . A subscripted value 푎푖 then ∈ refers to the 푖-th entry in 푎, i.e.,

푎 = (푎1, 푎2, . . . , 푎푛).

The symbol 푒 denotes the all-ones vector 푒 = (1,..., 1)푇 , whose length always follows from the context. All vectors are interpreted as column-vectors. For 푎, 푏 R푛 we use the standard inner ∈ product,

푎, 푏 := 푎 푏 + 푎 푏 + + 푎 푏 , ⟨ ⟩ 1 1 2 2 ··· 푛 푛 which we also write as 푎푇 푏 := 푎, 푏 . We let R푚×푛 denote the set of 푚 푛 matrices, and ⟨ ⟩ × we use upper case letters to represent them, e.g., 퐵 R푚×푛 is organized as ∈ ⎡ ⎤ 푏11 푏12 . . . 푏1푛 ⎢ 푏 푏 . . . 푏 ⎥ ⎢ 21 22 2푛 ⎥ 퐵 = ⎢ . . . . ⎥ ⎣ . . .. . ⎦ 푏푚1 푏푚2 . . . 푏푚푛

For matrices 퐴, 퐵 R푚×푛 we use the inner product ∈ 푚 푛 ∑︁ ∑︁ 퐴, 퐵 := 푎 푏 . ⟨ ⟩ 푖푗 푖푗 푖=1 푗=1

For a vector 푥 R푛 we have ∈ ⎡ ⎤ 푥1 0 ... 0 ⎢ 0 푥 ... 0 ⎥ ⎢ 2 ⎥ Diag(푥) := ⎢ . . . . ⎥ , ⎣ . . .. . ⎦ 0 0 . . . 푥푛 i.e., a square matrix with 푥 on the diagonal and zero elsewhere. Similarly, for a square matrix 푋 R푛×푛 we have ∈

diag(푋) := (푥11, 푥22, . . . , 푥푛푛).

117 A set 푆 R푛 is convex if and only if for any 푥, 푦 푆 and 휃 [0, 1] we have ⊆ ∈ ∈ 휃푥 + (1 휃)푦 푆 − ∈ A function 푓 : R푛 R is convex if and only if its domain dom(푓) is convex and for all ↦→ 휃 [0, 1] we have ∈ 푓(휃푥 + (1 휃)푦) 휃푓(푥) + (1 휃)푓(푦). − ≤ − A function 푓 : R푛 R is concave if and only if 푓 is convex. The epigraph of a function ↦→ − 푓 : R푛 R is the set ↦→ epi(푓) := (푥, 푡) 푥 dom(푓), 푓(푥) 푡 , { | ∈ ≤ } shown in Fig. 12.1.

epi(f)

f

Fig. 12.1: The shaded region is the epigraph of the function 푓(푥) = log(푥). − Thus, minimizing over the epigraph

minimize 푡 subject to 푓(푥) 푡 ≤ is equivalent to minimizing 푓(푥). Furthermore, 푓 is convex if and only if epi(푓) is a convex set. Similarly, the hypograph of a function 푓 : R푛 R is the set ↦→ hypo(푓) := (푥, 푡) 푥 dom(푓), 푓(푥) 푡 . { | ∈ ≥ } Maximizing 푓 is equivalent to maximizing over the hypograph

maximize 푡 subject to 푓(푥) 푡, ≥ and 푓 is concave if and only if hypo(푓) is a convex set.

118 Bibliography

[AG03] F. Alizadeh and D. Goldfarb. Second-order cone programming. Math. Pro- gramming, 95(1):3–51, 2003.

[ART03] E. D. Andersen, C. Roos, and T. Terlaky. On implementing a primal-dual interior-point method for conic quadratic optimization. Math. Programming, February 2003.

[BT97] D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization. Athena Scientific, 1997.

[BKVH07] S. Boyd, S.J. Kim, L. Vandenberghe, and A. Hassibi. A Tutorial on Geomet- ric Programming. Optimization and Engineering, 8(1):67–127, 2007. Avail- able at http://www.stanford.edu/ boyd/gp_tutorial.html.

[BV04] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004. http://www.stanford.edu/ boyd/cvxbook/.

[CS14] Venkat Chandrasekaran and Parikshit Shah. Conic geometric programming. In Information Sciences and Systems (CISS), 2014 48th Annual Conference on, 1–4. IEEE, 2014.

[Cha09] Peter Robert Chares. Cones and interior-point algorithms for structed convex optimization involving powers and exponentials. PhD thesis, Ecole polytech- nique de Louvain, Universitet catholique de Louvain, 2009.

[Chvatal83] V. Chvátal. Linear programming. W.H. Freeman and Company, 1983.

[GartnerM12] B. Gärtner and J. Matousek. Approximation algorithms and . Springer Science & Business Media, 2012.

[Hac03] Y. Hachez. Convex optimization over non-negative polynomials: structured algorithms and applications. PhD thesis, Université Catholique De Lovain, 2003.

[LR05] M. Laurent and F. Rendl. Semidefinite programming and integer pro- gramming. Handbooks in Operations Research and Management Science, 12:393–514, 2005.

[LVBL98] M. S. Lobo, L. Vanderberghe, S. Boyd, and H. Lebret. Applications of second- order cone programming. Linear Algebra Appl., 284:193–228, November 1998.

[NW88] G. L. Nemhauser and L. A. Wolsey. Integer programming and combinatorial optimization. John Wiley and Sons, New York, 1988.

119 [Nes99] Yu. Nesterov. Squared functional systems and optimization problems. In H. Frenk, K. Roos, T. Terlaky, and S. Zhang, editors, High Performance Optimization. Kluwer Academic Publishers, 1999.

[NW06] J. Nocedal and S. Wright. Numerical optimization. Springer Science, 2nd edition, 2006.

[PS98] C. H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algo- rithms and complexity. Dover publications, 1998.

[TV98] T. Terlaky and J.-Ph. Vial. Computing maximum likelihood estimators of convex density functions. SIAM J. Sci. Statist. Comput., 19(2):675–694, 1998.

[VB96] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Rev., 38(1):49–95, March 1996.

[VAN10] J. P. Vielma, S. Ahmed, and G. Nemhauser. Mixed-integer models for non- separable piecewise-linear optimization: unifying framework and extensions. Operations research, 58(2):303–315, 2010.

[Wil93] H. P. Williams. Model building in mathematical programming. John Wiley and Sons, 3rd edition, 1993.

[Zie82] H Ziegler. Solving certain singly constrained convex optimization problems in production planning. Operations Research Letters, 1982. URL: http://www. sciencedirect.com/science/article/pii/016763778290030X.

[BenTalN01] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. MPS/SIAM Series on Optimization. SIAM, 2001.

120 Index

A covariance matrix, 28, 114 absolute value,8, 22, 100 CQO, 19 adjacency matrix, 69 curve fitting, 64 cut, 69 B D basis pursuit,8, 14 big-M, 77, 97 determinant, 56 binary optimization, 68, 69 disjunctive constraint, 99 binary variable, 97 dual Boolean operator, 101 cone, 85 function, 14, 88 C norm,9 cardinality constraint, 106 problem, 14, 89, 110 Cholesky factorization, 23, 28, 51 duality complementary slackness, 91 conic, 83 condition number, 56 gap, 91 cone linear, 13, 18 convex, 20 quadratic, 110, 112 distance to, 82 strong, 15, 91 dual, 85 weak, 15, 90 exponential, 36 dynamical system, 47 p-order, 32 E power, 29 product, 84 eigenvalue optimization, 55 projection, 82 ellipsoid, 26, 63, 109 quadratic, 19, 51 entropy, 38 rotated quadratic, 20, 51 maximization, 46 second-order, 19 relative, 39, 46 self-dual, 85 epigraph, 118 semidefinite, 50 exponential conic quadratic optimization, 19 cone, 36 constraint,3 function, 38 disjunctive, 99 optimization, 36 indicator, 99 F redundant, 77 constraint attribution, 93 factor model, 23, 28, 115 constraint satisfaction, 100 Farkas lemma, 12, 17, 86, 111, 112 convex feasibility, 84 cone, 20 feasible set,3, 11, 84 function, 118 Fermat-Torricelli point, 35 set, 118 filter design, 67

121 function L concave, 118 Lagrange function, 14, 88, 110, 112 convex, 118 Lambert W function, 40 dual, 14, 88 limit-feasibility, 87 entropy, 38 linear exponential, 38, 40 matrix inequality, 54, 80, 94 Lagrange, 14, 88, 110, 112 near dependence, 75 Lambert W, 40 optimization,2 logarithm, 38, 41 linear matrix inequality, 54, 80, 94 lower semicontinuous, 104 linear-fractional problem, 10 piecewise linear,7, 102 LMI, 54, 80, 94 power, 23, 32, 43, 105 LO,2 sigmoid, 48 log-determinant, 56 softplus, 39 log-sum-exp, 39 G log-sum-inv, 40 logarithm, 38, 41 geometric mean, 33, 34, 36 logistic regression, 48 geometric median, 35 lowpass filter, 67 geometric programming, 41 GP, 41 M Grammian matrix, 51 Markowitz model, 27 H matrix adjacency, 69 halfspace,5 correlation, 62 harmonic mean, 25 covariance, 28, 114 Hermitian matrix, 60 Grammian, 51 hitting time, 47 Hermitian, 60 homogenization, 10, 28 inner product, 50 Huber loss, 74 positive definite, 50 hyperplane,4 pseudo-inverse, 53, 109 hypograph, 118 semidefinite, 50, 109, 111 I variable, 50 MAX-CUT, 70 ill-posed, 74, 75, 84, 87 maximum,7,9, 43, 101 indicator maximum likelihood, 36, 44 constraint, 99 mean variable, 97 geometric, 33 infeasibility, 84 harmonic, 25 certificate, 11, 12, 86, 87, 111, 112 MIO, 96 locating, 13 monomial, 41 inner product, 117 matrix, 50 N integer variable, 97 nearest correlation matrix, 62 inventory, 29 network K design, 105 wireless, 45, 105 Kullback-Leiber divergence, 39, 47 norm 1-norm,8, 100

122 2-norm, 22 market impact, 33 dual,9 Markowitz model, 27 Euclidean, 22 risk factor exposure, 115 Frobenius, 44 risk parity, 46 infinity norm,9 Sharpe ratio, 28 nuclear, 57 trading size, 107 p-norm, 32, 35 transaction costs, 106 normal equation, 93 posynomial, 41 power, 23, 43 O power cone optimization, 29 objective, 109 power control, 45 objective function,3 precision, 78, 81 optimal value,3, 84 principal submatrix, 53 unattainment, 10, 84 pseudo-inverse, 53 optimization binary, 68 Q boolean, 69 QCQO, 26, 111 eigenvalue, 55 QO, 22, 107 exponential, 36 quadratic linear,2 cone, 19 mixed integer, 96 duality, 110, 112 p-order cone, 32 optimization, 107 power cone, 29 rotated cone, 20 practical, 70 quadratic optimization, 22, 26 quadratic, 107 R robust, 26 semidefinite, 48 rate allocation, 45 overflow, 78 redundant constraints, 77 regression P linear, 93 penalization, 77 logistic, 48 perturbation, 74 piecewise linear, 107 piecewise linear regularization, 48 function,7 regularization, 48 regression, 107 relative entropy, 39, 46 pOCO, 32 relaxation polyhedron,5, 34, 109 semidefinite, 68, 69, 79 polynomial Riesz-Fejer Theorem, 61 curve fitting, 64 risk parity, 46 nonnegative, 58, 59, 65 robust optimization, 26 trigonometric, 61, 62, 67 rotated quadratic cone, 20 portfolio optimization S cardinality constraint, 106 constraint attribution, 93 scaling, 76 covariance matrix, 28, 114 Schur complement, 53, 58, 79, 112 duality, 90 SDO, 48 factor model, 23, 28, 115 second-order cone, 19, 23 fully invested, 46, 100 semicontinuous variable, 98 semidefinite

123 cone, 50 optimization, 48 relaxation, 68, 69, 79 set covering, 101 packing, 101 partitioning, 101 setup cost, 98 shadow price, 18 Sharpe ratio, 28 sigmoid, 48 signal processing, 67 signal-to-noise, 45 singular value, 57, 96 Slater constraint qualification, 92, 111, 112 SOCO, 19 softplus, 39 SOS1, 101 SOS2, 103 spectrahedron, 52 spectral factorization, 25, 50, 53 spectral radius, 56 sum of squares, 58, 61 symmetry breaking, 107 T trading size, 107 transaction costs, 106 trigonometric polynomial, 61, 62, 67 U unattainment, 75 V variable binary, 97 indicator, 97 integer, 97 matrix, 50 semicontinuous, 98 verification, 80 violation, 81 volume, 34, 63

124