1

Universit´e Joseph Fourier Master de Math´ematiques Appliqu´ees , 2`eme ann´ee

EFFICIENT METHODS IN OPTIMIZATION Th´eorie de la programmation non-lin´eaire Anatoli Iouditski

Lecture Notes

Optimization problems arise naturally in many application fields. Whatever people do, at some point they get a craving for organizing things in a best possible way. This intention, converted in a mathematical form, appears to be an optimization problem of certain type (think of, say, Optimal Diet Problem). Unfortunately, the next step, consisting of finding a solution to the mathematical model is less trivial. At the first glance, everything looks very simple: many commercial optimization packages are easily available and any user can get a “solution” to his model just by clicking on an icon at the desktop of his PC. However, the question is, how much he could trust it? One of the goals of this course is to show that, despite to their attraction, the general optimization problems very often break the expectations of a naive user. In order to apply these formulations successfully, it is necessary to be aware of some theory, which tells us what we can and what we cannot do with optimization problems. The elements of this theory can be found in each lecture of the course. This course itself is based on the lectures given by at Technion in late 1990’s. On the other hand all the errors and inanities you may find here should be put on the account of the name at the title page.

http://www-ljk.imag.fr/membres/Anatoli.Iouditski/cours/optimisation-convexe.htm 2 Contents

1 Introduction 5 1.1 General formulation of the problem ...... 5 1.1.1 Problem formulation and terminology ...... 5 1.1.2 Performance of Numerical Methods ...... 8 1.2 Complexity bounds for Global Optimization ...... 10 1.3 Identity cards of the fields ...... 15 1.4 Rules of the game ...... 17 1.5 Suggested reading ...... 17

2 When everything is simple: 1-dimensional 19 2.1 Example: one-dimensional convex problems ...... 19 2.1.1 Upper complexity bound ...... 20 2.1.2 Lower complexity bound ...... 24 2.2 Conclusion ...... 27 2.3 Exercises ...... 27 2.3.1 Can we use 1-dimensional optimization? ...... 27 2.3.2 Extremal Ellipsoids ...... 30

3 Methods with linear convergence 37 3.1 Class of general convex problems: description and complexity ...... 37 3.2 Cutting Plane scheme and Center of Gravity Method ...... 39 3.2.1 Case of problems without functional constraints ...... 39 3.3 The general case: problems with functional constraints ...... 44 3.4 Lower complexity bound ...... 48 3.5 The Ellipsoid method ...... 54 3.5.1 Ellipsoids ...... 55 3.5.2 The Ellipsoid method ...... 57 3.6 Exercises ...... 60 3.6.1 Some extensions of the Cutting Plane scheme ...... 60 3.6.2 The method of outer simplex ...... 71

3 4 CONTENTS

4 Large-scale optimization problems 73 4.1 Goals and motivations ...... 73 4.2 The main result ...... 74 4.3 Upper complexity bound: Subgradient Descent ...... 76 4.4 The lower bound ...... 79 4.5 Subgradient Descent for Lipschitz-continuous convex problems ...... 80 4.6 Bundle methods ...... 84 4.6.1 The Level method ...... 86 4.6.2 Concluding remarks ...... 90 4.7 Exercises ...... 92 4.7.1 Around Subgradient Descent ...... 92 4.7.2 Mirror Descent ...... 96 4.7.3 Stochastic Approximation ...... 101 4.7.4 The Stochastic Approximation method ...... 105

5 : Unconstrained Minimization 109 5.1 Relaxation ...... 109 5.2 ...... 110 5.3 Newton method ...... 117 5.3.1 Gradient Method and Newton Method: What is different? ...... 121 5.4 Newton Method and Self-Concordant Functions ...... 125 5.4.1 Preliminaries ...... 125 5.4.2 Self-concordance ...... 126 5.4.3 Self-concordant functions and the Newton method ...... 127 5.4.4 Self-concordant functions: applications ...... 130 5.5 Conjugate gradients method ...... 133 5.6 Exercises ...... 137 5.6.1 Implementing Gradient method ...... 137 5.6.2 Regular functions ...... 139 5.6.3 Properties of the gradient ...... 139 5.6.4 Classes of differentiable functions ...... 142

6 Constrained Minimization 145 6.1 Penalty and methods ...... 145 6.2 Self-concordant barriers and path-following scheme ...... 150 6.2.1 Path-following scheme ...... 151 6.2.2 Applications ...... 156 6.2.3 Concluding remarks ...... 158 Lecture 1

Introduction

(General formulation of the problem; Important examples; Black Box and Iterative Methods; Analytical and Arithmetical Complexity; Uniform Grid Method; Lower complexity bounds; Lower bounds for Global Optimization; Rules of the Game.)

1.1 General formulation of the problem

Let us start with fixing the mathematical form of our main problem and the standard terminology.

1.1.1 Problem formulation and terminology Let x be an n-dimensional real vector: x =(x(1),...,x(n)) ∈ Rn, G be a subset of Rn,and functions f(x),g1 ...gm(x) are some real-valued function of x. In the entire course we deal with some variants of the following minimization problem:

min f(x),

s.t.: gj(x)&0,j=1...m, (1.1.1)

x ∈ G, where & could be ≤, ≥ or =. We call f(x)theobjective function, the vector function g(x)=(g1(x),...,gm(x)) is called the functional constraint,thesetG is called the basic feasible set,andtheset

Q = {x ∈ G | gj(x) ≤ 0,j=1...m},

is called the feasible set of the problem (1.1.1).1 There is some natural classification of the types of minimization problems:

1That is just a convention to consider the minimization problems. Instead, we could consider a maxi- mization problem with the objective function −f(x).

5 6 LECTURE 1. INTRODUCTION

• Constrained problems: Q ⊂ Rn.

• Unconstrained problems: Q ≡ Rn.

• Smooth problems:allgj(x) are differentiable.

• Nonsmooth problems: there is a nondifferentiable component gk(x), • Linearly constrained problems: all functional constraints are linear:

n gj(x)= ai,jxi + bj ≡aj,x + bj,j=1...m, i=1

(here ·, · stands for the inner product in Rn), and G is a polyhedron. If f(x) is also linear then (1.1.1)isaLinear Programming Problem.Iff(x) is quadratic then (1.1.1)isaQuadratic Programming Problem.

There is also some classification in accordance to the properties of the feasible set.

• Problem (1.1.1) is called feasible if Q = ∅.

• Problem (1.1.1) is called strictly feasible if ∃x ∈ int Q such that gj(x) < 0(or> 0) for all inequality constraints and gj(x) = 0 for all equality constraints. Finally, we distinguish different types of solutions to (1.1.1):

• x∗ is called the optimal global solution to (1.1.1)if

f(x∗) ≤ f(x) for all x ∈ Q

(global minimum). Then f(x∗) is called the optimal value of the problem.

• x∗ is called a local solution to (1.1.1)if

f(x∗) ≤ f(x) for all x ∈ int Q¯ ⊂ Q

(local minimum).

Let us consider now several examples demonstrating the origin of the optimization prob- lems.

Example 1.1.1 Let x(1) ...x(n) be our design or decision variables. Then we can fix some functional characteristics of our decision: f(x),g1,...,gm(x). That could be the price of the project, the amount of the required resources, the reliability of the system, and many others. We fix the most important characteristics, f(x), as our objective. For all others we impose some bounds: aj ≤ gj(x) ≤ bj. 1.1. GENERAL FORMULATION OF THE PROBLEM 7

Thus, we come up with the problem: min f(x),

s.t.: aj ≤ gj(x) ≤ bj,j=1...m,

x ∈ G, where G stands for the structural constraints, like positiveness or boundedness of some variables, etc.

Example 1.1.2 Let our initial problem be as follows: Find x ∈ Rn such that

g1(x)=a1, ... (1.1.2) gm(x)=am. Then we can consider the problem: m 2 min (gj(x) − aj) x j=1 (may be with some additional constraints on x). Note that the problem (1.1.2) is almost universal. It covers ordinary differential equa- tions, partial differential equations, problems, arising in Game Theory, and many others.

Example 1.1.3 Sometimes our decision variable x(1) ...x(n) must be integer, say, we need x(i) ∈{0, 1}. That can be described by the constraint:

x(i)(x(i) − 1) = 0,i=1...n.

Thus, we could treat also the {0,1} Programming Problems:

min f(x),

s.t.: aj ≤ gj(x) ≤ bj,j=1...m,

x ∈ G,

x(i)(x(i) − 1) = 0,i=1...n. 8 LECTURE 1. INTRODUCTION

Looking at these examples, a reader can understand the enthusiasm of the pioneers of nonlinear programming, which can be easily recognized in the papers of 1950 – 1960. Thus, our first impression should be as follows:

Nonlinear Optimization is very important and perspective application theory. It covers almost all fields of Numerical Analysis.

However, just by looking at the same list, especially at Examples 1.1.2, 1.1.3, a more suspi- cious (or more experienced) reader should come to the following conjecture:

In general, optimization problems are unsolvable.

Indeed, the life is too complicated to believe in a universal tool for solving all problems at once. However, conjectures are not so important in science; that is a question of the personal taste how much we can believe in them. The most important event in the optimization theory in the middle of 70s was that this conjecture was proved in some strict sense. The proof is so simple and remarkable, that we cannot avoid it in our course. But first of all, we should introduce some special language, which is necessary to speak about such serious things.

1.1.2 Performance of Numerical Methods Let us consider some sample situation: We have a problem P, which we are going to solve using a method M. What is the performance of M on P? Let us start with the general definition:

Performance of M on P is the total amount of computational efforts,whichis required by method M to solve the problem P.

In this definition there are several things to be specified. First, what does it mean: to solve the problem? In some fields it could mean to find the exact solution. However, in many areas of numerical analysis that is impossible (and optimization is definitely the case). Therefore, for us to solve the problem should mean:

To find an approximate solution to M with a small accuracy >0.

Now, we know that there are different numerical methods for doing that, and of course, we want to choose the scheme, which is the best for our P. However, it appears that we are looking for something, what does not exist. In fact, it does, but it is too silly. Just imagine a method for solving (1.1.1), which always reports that x∗ = 0. Of course, it does not work on all problems except those with x∗ = 0. And for the latter problems its “performance” is better than that of all other schemes. Thus, we cannot speak about the best method for a concrete problem P, but we can do that for a class of problems F⊃P. Indeed, usually the numerical methods are developed 1.1. GENERAL FORMULATION OF THE PROBLEM 9

for solving many different problems with the similar characteristics. Therefore we can define the performance of M on F as its performance on the worst problem from F. Since we are going to speak about the performance of M on the whole class F, we should assume that M does not have a complete information about a concrete problem P.Ithas only the description of the problem class F. In order to recognize P (and solve it), the method should be able to collect the personal information about P by parts. For modeling this situation, it is convenient to introduce the notion of oracle.OracleO is just a unit, which answers the successive questions of the method. The method M, collecting and handling the data, is trying to solve the problem P. In general, each problem can be included in different problem classes. For each problem we can imagine also the different types of oracles. But if we fix F and O,thenwefixamodel of our problem P. In this case, it is natural to define the performance of M on (F, O)as 2 its performance on the worst Pw from F. Let us now consider the iterative process which naturally describes any method M work- ing with the oracle. General Iterative Scheme. (1.1.3) Input:

A starting point x0 and an accuracy >0. Initialization.

Set k =0, I−1 = ∅. Here k is the iteration counter and Ik is the informational set accumulated after k iterations. Main Loop. O 1. Call the oracle at xk. 2. Update the informational set: Ik = Ik−1 (xk, O(xk)). 3. Apply the rules of method M to Ik and form the new test point xk+1. 4. Check the stopping criterion. If yes then form an output x¯. Otherwise set k = k +1 and go to 1. End of the Loop. Now we can specify the term computational efforts in our definition of the performance. In the scheme (1.1.3) we can easily find two main sources of that. First one is in the Step 1, where we call the oracle, and the second one is in Step 3, where we form the next test point. We introduce two measures of the complexity of the problem P for the method M: 1. Analytical complexity: The number of calls of the oracle, which is required to solve the problem P up to the accuracy . 2. Arithmetical complexity: The total number of the arithmetic operations (in- cluding the work of the oracle and the method), which is required to solve the problem P up to the accuracy .

2 Note that this Pw can be bad only for M . 10 LECTURE 1. INTRODUCTION

Thus, the only thing which is not clear now, is the meaning of the words up to the accuracy >0. Note, that this meaning is very important for our definitions of the complexity. However, it is too specific to speak about that here. We will make this meaning exact when we will consider the concrete problem classes. Comparing the notions of analytical and arithmetical complexity, we can see that the second one is more realistic. However, for a concrete method M, the arithmetical complexity usually can be easily obtained from the analytical complexity. Therefore, in this course we will speak mainly about some estimates of the analytical complexity of some problem classes. There is one standard assumption about the oracle, which allows to obtain most of the results on the analytical complexity of the optimization methods. This assumption is called the black box concept and it looks as follows:

1. The only information available from the oracle is its answer. No intermediate results are available. 2.Theoracleislocal: A small variation of the problem far enough from the test point x does not change the answer at x. This concept is extremely popular in the numerical analysis. Of course, it looks as an artificial wall between the method and the oracle created by ourselves. It seems natural to allow the method to analyze the internal structure of the oracle. However, we will see that for some problems with complicated structure this analysis is almost useless. On the other hand, for some important problems it could help. If we have enough time, that will the subject of the last lecture of this course. To conclude this section, let us present the main types of the oracles used in optimization. For all of them the input is a test point x ∈ Rn, but the output is different: • Zero-order oracle: the value f(x).

• First-order oracle: the value f(x) and the gradient f (x).

• Second-order oracle: the value f(x), the gradient f (x) and the Hessian f (x).

1.2 Complexity bounds for Global Optimization

Let us practice in applying the formal language, introduced in the previous section, to a concrete problem. For that, let us consider

min f(x). (1.2.4) x∈Bn In our terminology, that is an unconstrained minimization problem without functional con- n straints. The basic feasible set of this problem is Bn,ann-dimensional cube in R :

n Bn = {x ∈ R | 0 ≤ xi ≤ 1,i=1,...,n}.

In order to specify the problem class, let us make the following assumption: 1.2. COMPLEXITY BOUNDS FOR GLOBAL OPTIMIZATION 11

The objective function f(x) is Lipshitz continuous on Bn:

∀x, y ∈ Bn : | f(x) − f(y) |≤ L  x − y 

with some constant L (Lipshitz constant).

Here and in the sequel we use notation ·for the Euclidean norm on Rn: n 2  x = x, x = (xi) . i=1

Let us consider a trivial method for solving (1.2.4), which is called the Uniform Grid Method. This method, G(p), has one integer input parameter p and its scheme is as follows.

Scheme of the method G(p). (1.2.5)

1. Form (p +1)n points i i i x = 1 , 2 ,..., n , (i1,i2,...,in) p p p where i1 =0,...,p, i2 =0,...,p, ... in =0,...,p.

2. Among all points x(...) find the pointx ¯ with the minimal value of the objective function. 3. Return the pair (¯x, f(¯x)) as the result.

Thus, this method forms a uniform grid of the test points inside the cube Bn, computes the minimal value of the objective over this grid and returns it as an approximate solution to the problem (1.2.4). In our terminology, this is a zero-order without any influence of the accumulated information on the sequence of test points. Let us find its efficiency estimate.

Theorem 1.2.1 Let f ∗ be the global optimal value of problem (1.2.4). Then √ n f(¯x) − f ∗ ≤ L . 2p

Proof: ∗ Let x be the global minimum of our problem. Then there exists a ”number” (i1,i2,...,in) such that ≡ ≤ ∗ ≤ ≡ x x(i1,i2,...,in) x x(i1+1,i2+1,...,in+1) y 12 LECTURE 1. INTRODUCTION

n (here and in the sequel we write x ≤ y for x, y ∈ R if and only if xi ≤ yi, i =1,...,n). − ∗ ∈ Note that yi xi =1/p, i =1,...,n,and xi [xi,yi], i =1,...,n, Denotex ˆ =(x + y)/2. Let us form a pointx ˜ as follows: ⎧ ⎪ ∗ ≥ ⎨ yi, if xi xˆi, x˜ = i ⎩⎪ xi, otherwise. | − ∗ |≤ 1 It is clear that x˜i xi 2p , i =1,...,n. Therefore n n  x − x∗ 2 x − x∗ 2 ≤ . ˜ = (˜i i ) 2 i=1 4p Sincex ˜ belongs to our grid, we conclude that √ n f(¯x) − f(x∗) ≤ f(˜x) − f(x∗) ≤ L  x˜ − x∗ ≤ L 2p

Note that now we still cannot say what is the complexity of this method on the prob- lem (1.2.4). And the reason is that we did not define what should be the quality of the approximate solution we are looking for. Let us define our goal as follows: ∗ Findx ¯ ∈ Bn : f(¯x) − f ≤ . (1.2.6) Then we immediately get the following result. Corollary 1.2.1 The analytical complexity of the method G is as follows: √  n n A(G)= L +2 2 (here ]a[ is the integer part of a).

Proof:  √  √ n ≥ n Indeed, let us take p = L 2 +1.Thenp L 2 , and therefore, in view of Theorem 1.2.1, we have: √ n f(¯x) − f ∗ ≤ L ≤ . 2p

This result is more informative, but we still have some questions. First, may be our proof is too rough and the real performance of G(p) is much better. Second, we cannot be sure that this is a reasonable method for solving (1.2.4).Maybetherearesomemethodswith much higher performance. In order to answer these questions, we need to derive for (1.2.4), (1.2.6)thelower com- plexity bounds. The main features of these bounds are as follows. 1.2. COMPLEXITY BOUNDS FOR GLOBAL OPTIMIZATION 13

• They are based on the Black Box concept. • They can be derived for a specific class of problems F equipped by a local oracle O. • These bounds are valid for all reasonable iterative schemes. Thus, they provide us with a lower bound for the analytical complexity on the problem class. • They use the idea of resisting oracle. For us only the notion of the resisting oracle is new. Therefore, let us discuss it in details. A resisting oracle is trying to create a worst problem for each concrete method. It starts from an ”empty” function and it tries to answer each call of the method in the worst possible way. However, the answers must be coherent with the previous answers and with the description of the problem class. Note that after the termination of the method it is possible to reconstruct the created problem. Moreover, if we launch the method on this problem, it will reproduce the same sequence of the test points since it will have the same answers of the oracle. Let us show how it works for the problem (1.2.4). Consider the class of problems F defined as follows: Problem Formulation: min f(x). x∈Bn Problem Class: f(x)isLipshitz continuous on Bn. ∗ Approximate solution: Findx ¯ ∈ Bn : f(¯x) − f ≤ .   n F L − Theorem 1.2.2 The analytical complexity of for the 0-order methods is at least 2 1.

Proof: Assume that there exists a method, which needs less than   L pn − 1,p= (≥ 1), 2 calls of oracle to solve any problem of our class up to accuracy >0. Let us suppose that when the method finds its approximate solutionx ¯ we allow it to call the oracle one more time atx ¯, which will not be counted in our complexity evaluation. So, the total number of calls to the oracle of the method is N

1 Denotex ˜ =ˆx + 2p e and consider the function

f¯(x)=min{0,L x − x˜ ∞ −},

where  a ∞=max| ai |. 1≤i≤n Note that the function f¯(x) is Lipshitz continuous (since  a ∞≤ a ) and the optimal value of f¯(·)is−.Moreover,f¯(x) differs from zero only inside the box

 B = {x | x − x˜ ∞≤ }. L Since 2p ≤ L/, we conclude that

 1 B ⊆ B ≡{x |x − x˜ ∞≤ }. 2p Thus, f¯(x) is equal to zero at all test points of our method. Since the accuracy of the result of our method is , we come to the following conclusion: If the number of calls of the oracle is less than pn then the accuracy of the result cannot be less than .

Now we can say much more about the performance of the uniform grid method. Let us compare its efficiency estimate with the lower bound: √   n n L n G : L , Lower bound: . 2 2 Thus, we conclude that G has optimal dependence of its complexity in , but not in n.Note that our conclusion depends on the problem class. If we consider the functions f:

∀x, y ∈ Bn : | f(x) − f(y) |≤ L  x − y ∞

then the same reasoning  as before proves that the uniform grid method is optimal with the n L efficiency estimate 2 . Theorem 1.2.2 supports our initial claim that the general optimization problems are unsolvable. Let us look at the following example. Example 1.2.1 Consider the problem class F defined by the following parameters: L =2,n=10,=0.01.

Note that the size of the problem is very small and  we ask only for 1% accuracy. n L The lower complexity bound for this class is 2 . Let us compute what does it mean: Lower bound: 1020 calls of oracle, Complexity of the oracle: n a.o., Total complexity: 1021 a.o., Intel Quad Core Processor: 109 a.o. per second, Total time: 10 12 seconds, 1 year: less than 3.2 · 107 sec.

We need: 32 000 years. 1.3. IDENTITY CARDS OF THE FIELDS 15

This estimate is so disappointing that we cannot believe that such problems may become solvable even in the future. Indeed, suppose we believe into the Moore low, i.e. that the processor power is to be multiplied by 3 every 2 years. We can hope that a PC of 2030 will solve the problem in only 1 year, and in 2070 it will only take 1 second. However, let us just play with the parameters of the class. • If we change n for n + 1 then we have to multiply our estimate by 100. Thus, for n = 11 our time estimate is valid for the fastest available computer.

• But if we multiply by two, we reduce the complexity by the factor of 1000. For example, if = 8% then we need only 2 days to solve the problem.

We should note, that the lower complexity bounds for problems with smooth functions, or for the high-order methods is not much better than that of Theorem 1.2.2.Thiscanbe proved using the same arguments and we leave the proof as an exercise for the reader. An advanced reader can compare our results with the upper bound for NP-hard problems, which are considered as the examples of very difficult problems in combinatorial optimization. It is 2n a.o. only! To conclude this section, let us compare our situation with some other fields of numerical analysis. It is well-known, that the uniform grid approach is a standard tool for many of them. For example, if we need to compute numerically the value of the integral

1 I = f(x)dx, 0 then we have to form the discrete sum 1 n i Sn = f(xi),xi = ,i=1,...,N. N i=1 N If f(x) is Lipshitz continuous then the result can be used as an approximation to I:

N = L/ ⇒|I−SN |≤ .

Note that in our terminology it is exactly the uniform grid approach. Moreover, it is a standard way for approximating the integrals. The reason why it works here is in the dimension of the problem. For integration the standard sizes are 1 – 3, and in optimization sometimes we need to solve problems with several million variables.

1.3 Identity cards of the fields

After the pessimistic result of the previous section, first of all we should understand what could be our goal in the theoretical analysis of the optimization problems. Hopefully, by now 16 LECTURE 1. INTRODUCTION everything is clear with the global optimization. But may be its goals are too ambitious? May be in some practical problems we would be satisfied by much less “optimal” solution? Or, may be there are some interesting problem classes, which are not so terrible as the class of general continuous functions? In fact, all these question can be answered in a different way. And this way defines the style of the research (or rules of the game) in the different optimization fields. If we will try to classify them, we will easily see that they differ one from another in the following aspects:

• Description of the goals.

• Description of the problem class.

• Description of the oracle.

These aspects define in a natural way the list of desired properties of the optimization methods. To conclude this lecture, let us present the “identity cards” of the fields we will consider in our course.

Name: Global Optimization. Goals: Find a global minimum. Problem Class: Continuous functions. Oracle: 0 − 1 − 2orderblackbox. Desired properties: Convergence to a global minimum. Features: This game is too short. We always lose it. Problem Sizes: Sometimes people pretend to solve problems with several thou- sands of variables. No guarantee for success even for very small problems. History: Starts from 1955. Several local peaks of interest (, neural networks, genetic ).

Name: Nonlinear Programming. Goals: Find a local minimum. Problem Class: Differentiable functions. Oracle: 1 − 2orderblackbox. Desired properties: Convergence to a local minimum. Fast convergence. Features: Variability of approaches. Most widespread software. The goal is not always acceptable. Problem Sizes: up to 1000 variables. History: Starts from 1955. Peak period: 1965 – 1975, variable theoretical activity right now.

Name: Convex Optimization. Goals: Find a global minimum. Problem Class: Convex sets and functions. Oracle: 1st order black box. 1.4. RULES OF THE GAME 17

Desired properties: Convergence to a global minimum. Rate of convergence depends on the dimension. Features: Very reach and interesting theory. Complete complexity theory. Ef- ficient practical methods. The problem class is sometimes restrictive. Problem Sizes: up to 1 000 000’s of variables. History: Starts from 1970. Peak period: 1975 – 1985, currently a peak of theoretical activity concerning methods for extremely large scale problems.

Name: Interior-Point Polynomial Methods.(Would be a nice subject for an ad- vanced course) Goals: Find a global minimum. Problem Class: Convex sets and functions. Oracle: 2nd order oracle which is not a black box. Desired properties: Fast convergence to a global minimum. Rate of conver- gence depends on the structure of the problem. Features: Very new and perspective theory. Avoid the black box concept. The problem class is the same as in Convex Programming. Problem Sizes: Sometimes up to 1 000 000’s variables. History: Starts from 1984. Peak period: 1990 − ..., high theoretical activity now.

1.4 Rules of the game

Some of the statements in the Course (theorems, propositions, lemmas, examples (if the latter + contain certain statement) are marked by superscripts ∗ or . The unmarked statements are obligatory: you are required to know both the statement and its proof. The statements marked by ∗ are semi-obligatory: you are expected to know the statement itself and may skip its proof (the latter normally accompanies the statement), although you are welcome, + of course, to read the proof as well. The proofs of the statements marked by are omitted; you are expected to be able to prove these statements by yourself, and these statements are parts of the assignments.

The majority of Lectures are accompanied by the ”Exercise” sections. In several cases, the exercises are devoted to the lecture where they are placed; sometimes they prepare the reader to the next lecture. Exercises marked by # are closely related to the lecture where they are placed or to the following one; it would be a good thing to solve such an exercise or at least to become acquainted with its solution (if any is given). Exercises which I find difficult are marked with >.

1.5 Suggested reading

It could be helpful to have a good book on analysis at hand. 18 LECTURE 1. INTRODUCTION

If you want to improve your background on the basic mathematical notions involved, consider the reference

• J. Cea: Optimisation, th´eorie et algorithmes Dunod, Paris (1971).

The main drawback of the “little blue book” by C. Lemarechal: M´ethodes num´eriques d’optimisation, Notes de cours, Universit´e Paris IX-Dauphine, INRIA, Rocquencourt, 1989, is that it is too small. As far as the main body of the course is concerned, for Chapter 5 I would suggest the reference

• A. Auslender: Optimisation. M´ethodes num´eriques, Masson, Paris (1976).

• P.J. Laurent: Approximation et Optimisation, Hermann (1972)

All these books also possesses that important quality of being written in French. If you decide that you are interested in Convex Optimization, the following reading would is extremely gratifying

• A. Ben-Tal, A. Nemirovski, Lectures on Modern Convex Optimization,MPS-SIAM Series on Optimization, SIAM, Philadelphia, (2001).

• Yu. Nesterov Introductory Lectures On Convex Optimization: A Basic Course, Boston: Kluwer Academic Publishers, (2004).

• S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press (2004).

The latter book is available on the S. Boyd’s page: http://www.stanford.edu/ boyd/index.html Lecture 2

When everything is simple: 1-dimensional Convex Optimization

(Complexity of One-dimensional Convex Optimization: Upper and Lower Bounds)

2.1 Example: one-dimensional convex problems

In this part of the course we are interested in theoretically efficient methods for convex opti- mization problems. We now take a simple start with a one-dimensional convex minimization. Let us apply our formal language to this example, where everything can be easily seen. Consider one-dimensional convex problem

minimize f(x) s.t. x ∈ G =[a, b], (2.1.1)

where [a, b] is a given finite segment on the axis. It is also known that our objective f is a continuous on G; for the sake of simplicity, assume that we know bounds, let them be 0 and V , for the values of the objective on G. Thus, all we know about the objective is that it belongs to the family

F = {f :[a, b] → R | f is convex and continuous;0≤ f(x) ≤ V,x ∈ [a, b]}.

And what we are asked to do is to find, for a given positive ε,anε-solution to the problem, i.e., a pointx ¯ ∈ G such that

f(¯x) − f ∗ ≡ f(¯x) − min f ≤ ε. G

Of course, our a priori knowledge on the objective given by the inclusion f ∈F,is,for small ε, far from being sufficient for finding an ε-solution, and we need some source of quantitative information on the objective. The standard assumption here which comes from the optimization practice is that we can compute the value and a subgradient of the objective at a point, i.e., we have access to a subroutine, our oracle O, which gets, as an input, a point

19 20LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

x from our segment and returns the value f(x) and a subgradient f (x) of the objective at the point. We have subject the input to the subroutine to the restriction a0, find among the methods M with the accuracy on the family not worse than ε the method with the smallest possible complexity on the family. Recall that the complexity of the associated optimal method, i.e., the function A(ε)=min{A(M) | Accuracy(M) ≤ ε} is called the complexity of the family.

2.1.1 Upper complexity bound Let us consider a simple bisection method to solve the problem (2.1.1). In order to minimize f, we start with the midpoint of our segment, i.e., with a + b x = , 1 2 2.1. EXAMPLE: ONE-DIMENSIONAL CONVEX PROBLEMS 21

and ask the oracle about the value and a subgradient of the objective at the point. If the subgradient is zero, we are done - we have found an optimal solution. If the subgradient is positive, then the function, due to convexity, to the right of x1 is greater than at the point, and we may cut off the right half of our initial segment - the minimum for sure is localized in the remaining part. If the subgradient is negative, then we may cut off the left half of the initial segment. Thus, we either terminate with an optimal solution, or find a new segment, twice smaller than the initial domain, which for sure localizes the set of optimal solutions. In this latter case we repeat the procedure, with the initial domain replaced by the new localizer, and so on. After we have performed the number of steps indicated in the formulation of the theorem below we terminate and form the result as the best - with the minimal value of f - of the search points we have looked through:

x¯ ∈{x1, ..., xN }; f(¯x)= min f(xi). 1≤i≤N Note that traditionally the approximate solution given by the bisection method is identified with the last search point (which is clearly at the distance at most (b − a)2−N from the optimal solution), rather than with the best point found so far. This traditional choice has small in common with our accuracy measure (we are interested in small values of the objective rather than in closeness to optimal solution) and is simply dangerous, as you can see from the following example:

xxN N-1

Figure 1. Here during the first N − 1 steps everything looks as if we were minimizing f(x)=x,so −N that the N-th search point is xN =2 ; our experience is misleading, as you see from the picture, and the relative accuracy of xN as an approximate solution to the problem is very bad, something like 1/2. By the way, we see from this example that the evident convergence of the search points to the optimal set at the rate at least 2−i does not imply automatically certain fixed rate of 22LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

convergence in terms of the objective; it turns out, anyhow, that such a rate exists, but for the best points found so far rather than for the search points themselves. This way we obtain a scheme of the bisection :

1. Initialization.

Set x− = a, x+ = b, k =1 and x1 =(a + b)/2.

2. Iteration. For k =1, ..., N:

 Call the oracle O = f (xk).   Update the active segment: if f (xk) < 0 set x− = xk, elseif f (xk) >  0, set x+ = xk.Iff (xk)=0 then goto 3. Put xk+1 =(x− + x+)/2.

3. Form the result Take x¯ =argmini=1,...,N+1{f(xi)}.

To establish an upper bound for A(ε) it suffices to evaluate A(M).

Theorem 2.1.1 The complexity of the family in question satisfies the inequality

V A(ε) ≤log ( ), 0 <ε

Note that the range of values of ε in our statement is (0,V), and this is quite natural: since all functions from the family take their values between 0 and V , any point of the segment solves every problem from the family to the accuracy V , so that a nontrivial optimization problem occurs only when ε

We start with the observation that if GN =[x−,x+] is the final localizer of optimum found during the bisection, then outside the localizer the value of the objective is at least that one at the best of the search points, i.e., at least the value at the approximate solution x¯ found by bisection:

f(x) ≥ f(¯x) ≡ min f(xi),x∈ G\GN . 1≤i≤N Indeed, at a step of the method we cut off only those points of the domain G where f is at least as large as at the current search point and is therefore ≥ its value at the best of the search points, that is, to the value atx ¯; this is exactly what was claimed. Now, let x∗ be an optimal solution to the problem, i.e., a minimizer of f;asweknow, such a minimizer does exist, since our continuous objective for sure attains its minimum 2.1. EXAMPLE: ONE-DIMENSIONAL CONVEX PROBLEMS 23

over the finite segment G.Letα be a real greater than 2−N and less than 1, and consider α-contraction of the segment G to x∗, i.e., the set

Gα =(1− α)x∗ + αG ≡{(1 − α)x∗ + αz | z ∈ G}.

This is a segment of the length α(b − a), and due to our choice of α the length is greater α than that one of our final localizer GN . It follows that G cannot be inside the localizer, so that there is a point, let it be y, which does not belong to the final localizer and belongs to Gα: α ∃y ∈ G : y ∈ int GN . Since y belongs to Gα,wehave y =(1− α)x∗ + αz for some z ∈ G, and from convexity of f it follows that

f(y) ≤ (1 − α)f(x∗)+αf(z), which can be rewritten as

f(y) − f ∗ ≤ α (f(z) − f ∗) ≤ αV, so that y is an αV -solution to the problem. On the other hand, y is outside GN ,where,as we already know, f is at least f(¯x):

f(¯x) ≤ f(y).

We conclude that f(¯x) − f ∗ ≤ f(y) − f ∗ ≤ αV. Since α can be arbitrary close to 2−N ,wecometo

∗ − − f(¯x) − f ≤ 2 N V =2 log2(V/ε) V ≤ ε.

Thus, Accuracy(BisectionN ) ≤ ε. The upper bound is proved.

The observation that the length of the localizers Gi converges geometrically to 0 was crucial in the above proof of the complexity estimate. However, for the bisection procedure to possess this property, convexity of f is not necessary, for instance, it will be enough if f is quasi-convex. On the other hand, the quasi-convexity itself does not imply the convergence of the error to 0 in course of iterations. To have this we have impose some condition on local variation of the objective, e.g., that f is Lipschitz-continuous.1)

1)We will discuss this subject at length in the Exercise section of the next lecture. 24LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

2.1.2 Lower complexity bound We would like also to provide a lower bound for A(ε). At the first glance, it is not clear where from could one get lower bounds of complexity of the class. This problem looks less natural than inventing methods and evaluating their behavior on the family – this is what people in optimization are doing all their life. In contrast to this, to find a lower bound means to say something definite about all methods from an extremely wide class, not to investigate a particular method. We shall see in a while that the task is quite tractable in the convex context. 2) The answer to the question is given by the following Theorem 2.1.2 The complexity of the family in question satisfies 1 V log ( ) − 1 < A(ε), 0 <ε

Lemma 2.1.1 For any i, 0 ≤ i ≤ K, our family contains an objective fi with the following two properties: (1 i): there is a segment Δi ⊂ G =[−1, 1] - the active segment of fi -ofthelength

1−2i δi =2

where fi is a modulus-like function, namely,

−i fi(x)=ai +3 |x − ci|,x∈ Δi,

2)Recall, that we have succeeded to treat this task for the class of Global Optimization problems. 2.1. EXAMPLE: ONE-DIMENSIONAL CONVEX PROBLEMS 25

ci being the midpoint of Δi;

(2 i): The first i points x1, ..., xi generated by M as applied to fi,areoutsideΔi.

Proof will be given by induction on i.

Base i = 0 is immediate: we set

f0(x)=|x|, Δ0 = G =[−1, 1],

thus ensuring (10). Property (20) holds true by trivial reasons - when i = 0, then there are no search points to be looked at.

Step i ⇒ i +1: Let fi be the objective given by our inductive hypothesis, let Δi be the active segment of this objective and let ci be the midpoint of the segment.

Let also x1, ..., xi,xi+1 be the first i + 1 search points generated by M as applied to fi. According to our inductive hypothesis, the first i of these points are outside the active segment.

In order to obtain fi+1, we modify the function fi in its active segment and do not vary the function outside the segment. The way we modify fi in the active segment depends on whether xi+1 is to the right of the midpoint ci of the segment (”right” modification), or this is not the case and xi+1 either coincides with ci or is to the left of the point (”left” modification).

The ”right” modification is as follows: we replace the modulus-like in its active segment function fi by a piecewise linear function with three linear pieces, as is shown on the picture below. Namely, we do not change the slope of the function in the initial 1/14 part of the segment, then change the slope from −2−3i to −2−3(i+1) and make a new breakpoint at the end ci+1 of the first quarter of the segment Δi. Starting with this breakpoint and till the right endpoint of the active segment, the slope of the modified function is 2−3(i+1).Itis easily seen that the modified function at the right endpoint of Δi comes to the same value as that one of fi and that the modified function is convex on the whole axis.

In the case of the ”left” modification, i.e., when xi+1 ≤ ci, we act in the ”symmetric” 3 | | manner, so that the breakpoints of the modified function are at the distances 4 Δi and 13 | | 14 Δi from the left endpoint of Δi, and the slopes of the function, from left to right, are 26LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

−2−3(i+1),2−3(i+1) and 2−3i:

Δ i+1

u ci+1 cxi i+1 Δ i

Figure 2. ”Right” modification of fi. The bold segment on the axis is the active segment of the modified function

Let us verify that the modified function fi+1 satisfies the requirements imposed by the lemma. As we have mentioned, this is a convex continuous function; since we do not vary fi outside the segment Δi and do not decrease it inside the segment, the modified function takes its values in (0, 1) together with fi. It suffices to verify that fi+1 satisfies (1i+1)and (2i+1). (1i+1) is evident by construction: the modified function indeed is modulus-like with required slopes in a segment of a required length. What should be proved is (2i+1), the claim that the method M as applied to fi+1 during the first i + 1 step does not visit the active segment of fi+1. To prove this, it suffices to prove that the first i +1searchpoints generated by the method as applied to fi+1 are exactly the search point generated by it when minimizing fi, i.e., they are the points x1, ..., xi+1. Indeed, these latter points for sure are outside the new active segment - the first i of them due to the fact that they even do not belong to the larger segment Δi, and the last point, xi+1 - by our construction, which ensures that the active segment of the modified function and xi+1 are separated by the midpoint ci of the segment Δi. Thus, we come to the necessity to prove that x1, ..., xi+1 are the first i + 1 points generated by M as applied to fi+1. This is evident: the points x1, ..., xi are outside Δi,wherefi and fi+1 coincide; consequently, the information - the values and the subgradients - on the functions along the sequence x1, ..., xi also is the same for both of the functions. Now, by definition of a method the information accumulated by it during the first i steps uniquely determines the first i + 1 search points; since fi and fi+1 are indistinguishable in a neighborhood of the first i search points generated by M as applied to fi, the initial (i + 1)-point segments of the trajectories of M on fi and on fi+1 coincide with each other, as claimed. Thus, we have justified the inductive step and therefore have proved the lemma. 2.2. CONCLUSION 27

It remains to derive from the lemma the desired lower complexity bound. This is imme- diate. According to the lemma, there exists function fK in our family which is modulus-like in its active segment and is such that the method during its first K steps does not visit this active segment. But the K-th point xK of the trajectory of M on fK is exactly the result found by the method as applied to the function; since fK is modulus-like in ΔK and ∗ is convex everywhere, it attains its minimum fK at the midpoint cK of the segment ΔK and outside ΔK is greater than ∗ −3K × −K ≤ ∗ −3K × −2K −5K fK +2 3 fK +2 2 =2

(the product here is half of the length of ΔK times the slope of fK). Thus, M − ∗ −5K fK (¯x( ,fK)) fK > 2 . On the other hand, M, by its origin, solves all problems from the family to the accuracy ε, and we come to 2−5K <ε, i.e., to 1 1 K ≡ K +1> log ( ). 5 2 ε as required in our lower complexity bound.

2.2 Conclusion

The one-dimensional situation we have investigated is, of course, very simple; I spoke about it only to give you an impression of what we are going to do. In the main body of the course we shall consider much more general classes of convex optimization problems, i.e., multidimen- sional problems with functional constraints. Same as in our simple one-dimensional example, we shall ask ourselves what is the complexity of the classes and what are the corresponding optimal methods. Let me stress that these are optimal methods we mainly shall focus on - it is much more interesting issue than the complexity itself, both from mathematical and practical viewpoint. In this respect, one-dimensional situation is not typical - it is easy to guess that the bisection should be optimal and to establish its rate of convergence. In several dimensions situation is far from being so trivial and is incomparably more interesting.

2.3 Exercises

2.3.1 Can we use 1-dimensional optimization? Note that though being extremely simple, the bisection algorithm can be of great use to solve multi-dimensional optimization problems which looks much more involved. Consider 28LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

for instance the following example of Minimizing a separable function subject to an equality constraint. We consider the problem  n T min f(x)= fi(xi), subject to a x = b , (2.3.4) i=1

n where a ∈ R , b ∈ R,andfi : R → R are differentiable and strictly convex. The objective function is called separable since it is a sum of functions of the individual variables x1, ..., xn. We assume that the domain of f0 intersects the constraint set, i.e., there exists a point T x0 ∈ dom f with a x0=b. This implies the problem has a unique optimal point x (why?). The Lagrangian is

n n L(x, λ)= fi(xi)+λ(aT x − b)=−bλ + (fi(xi)+λaixi), i=1 i=1 which is also separable, so the dual function is

n L(λ)=−bλ +inf (fi(xi)+λaixi) x i=1 n = −bλ + inf[fi(xi)+λaixi] xi i=1 n − − ∗ − = bλ fi ( λai), i=1 ∗ → where we denote fi (u): R R, the function conjugate to fi: ∗ − fi (u)=max[uz fi(z)],i=1, ..., n, z∈R ∗ ∗ The function f is also referred to as the Legendre transform of f.Notethatfi ,beingapoint- wise supremum of convex functions are themselves convex (and, by the way, differentiable). The dual problem is thus

n − − ∗ − max bλ fi ( λai) (2.3.5) i=1 with (scalar) variable λ ∈ R. ∗ Now suppose that 1) we are able to write down the conjugate functions fi explicitly, so that we are able to compute the first order oracle for the convex problem (2.3.5). Indeed, ∈ ∗ − for a given u R, the value fi (ui) is the optimal value of the problem minz[uiz fi(z)] ∗  ∗ and [fi (ui)] its (unique!) optimizer z . Then 2) we can recover an optimal dual variable ∗ ∗ λ by bisection. Since each fi is strictly convex, the function L(x, λ ) is strictly convex in x, and so has a unique minimizerx ¯. But we also know that the optimal solution x∗ to (2.3.4) also minimizes L(x, λ∗), so we must havex ¯ = x∗.NotethatWecanrecoverx∗ from  ∗  ∗ − ∗ Lx(x, λ ) = 0, i.e., by solving the equations fi (x )= λ ai. 2.3. EXERCISES 29

Exercise 2.3.1 Consider the problem of finding the Euclidean projection of the point y ∈ R on the standard simplex:  n 2 min f(x)=|x − y| subject to xi =1,xi ≥ 0,i=1, ..., n . (2.3.6) x i=1

1. Consider the 1-dimensional problem

max uz − (x − a)2. z≥0

Write down its Lagrange dual and find its explicit solution.

2. Using the method, described in this section, propose a simple solution to the problem (2.3.6) by bisection. Write down explicitly the formulas which allow to recover the primary solution from the dual one.

A slightly more involved variation on the same theme is as follows:

Exercise 2.3.2 (Entropy maximization)

 ∈ n 1. Let a R be an “interior” point of a standard simplex, i.e., ai > 0, i ai =1,and let u ∈ Rn. Consider the problem

   n n xi min f(x)= xi log + ui(xi − ai) subject to xi =1,xi ≥ 0,i=1, ..., n(2.3.7). x i=1 ai i=1

Find the unique (explain why?) explicit solution to (2.3.7).

 1 ≤ ∈ n ≤ 2. Let n α<1,andleta R satisfy 0

z max u(z − β) − z log for 0 <β≤ α 0≤z≤α β

. Explain how the bisection algorithm can be used to solve the problem  n   n xi min f(x)= xi log + ui(xi − ai) subject to xi =1, 0 ≤ xi ≤ α, i =1, ..., n . x i=1 ai i=1

 n Hint: “Dualize” the equality constraint i=1 xi =1. 30LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

2.3.2 Extremal Ellipsoids In order to advance further we need some basic results about the ellipsoids in Rn. There are two natural ways to define an ellipsoid W in Rn. The first is to represent W as the set defined by a convex quadratic constraint, namely, as

W = {x ∈ Rn | (x − c)T A(x − c) ≤ 1} (2.3.8)

A being a symmetric positive definite n × n matrix and c being a point in Rn (the center of the ellipsoid). The second way is to represent W as the image of the unit Euclidean ball under an affine invertible mapping, i.e., as

W = {x = Bu + c | uT u ≤ 1}, (2.3.9) where B is an n × n nonsingular matrix and c is a point from Rn.

Exercise 2.3.3 # Prove that the above definitions are equivalent: if W ⊂ Rn is given by (2.3.8), then W can be represented by (2.3.9)withB chosen according to

A =(B−1)T B−1

(e.g., with B chosen as A−1/2). Vice versa, if W is represented by (2.3.9), then W can be represented by (2.3.8), where one should set

A =(B−1)T B−1.

Note that the (positive definite symmetric) matrix A involved into (2.3.8) is uniquely defined by W (why?); in contrast to this, a nonsingular matrix B involved into (2.3.9) is defined by W up to a right orthogonal factor: the matrices B and B define the same ellipsoid if and only if B = BU with an orthogonal n × n matrix U (why?) From the second description of an ellipsoid it immediately follows that

if W = {x = Bu + c | u ∈ Rn,uT u ≤ 1} is an ellipsoid and x → p + Bx is an invertible affine transformation of Rn (so that B is a nonsingular n × n matrix), then the image of W under the transformation also is an ellipsoid.

Indeed, the image is nothing but

W  = {x = BBu +(p + Bc) | u ∈ Rn,uT u ≤ 1}, the matrix BB being nonsingular along with B and B. It is also worthy of note that 2.3. EXERCISES 31

for any ellipsoid

W = {x = Bu + c | u ∈ Rn,uT u ≤ 1}

there exists an invertible affine transformation of Rn, e.g., the transformation

x → B−1x − B−1c,

which transforms the ellipsoid exactly into the unit Euclidean ball

V = {u ∈ Rn | uT u ≤ 1}.

In what follows we mainly focus on various volume-related issues; to avoid complicated constant factors, it is convenient to take, as the volume unit, the volume of the unit Euclidean ball V in Rn rather than the volume of the unit cube. The volume of a body3) Q measured in this unit, i.e., the ratio Vol (Q) n , Voln(V ) n Voln being the usual Lebesgue volume in R , will be denoted voln(Q) (we omit the subscript n if the value of n is clear from the context).

Exercise 2.3.4 # Prove that if W is an ellipsoid in Rn given by (2.3.9), then

vol W = | Det B|, (2.3.10) and if W is given by (2.3.8), then

vol W = | Det A|−1/2. (2.3.11)

Our local goal is to prove the following statement: Let Q be a convex body in Rn (i.e., a closed and bounded convex set with a nonempty interior). Then there exist ellipsoids containing Q, and among these ellipsoids there is one with the smallest volume. This ellipsoid is unique; it is called the outer extremal ellipsoid associated with Q. Similarly, there exist ellipsoids contained in Q, and among these ellipsoids there is one with the largest volume. This ellipsoid is unique; it is called the inner extremal ellipsoid associated with Q. In fact we are not too interested in the uniqueness of the extremal ellipsoids (and you may try to prove the uniqueness yourself); what actually is of interest is the existence and some important properties of the extremal ellipsoids.

Exercise 2.3.5 #. Prove that if Q is a closed and bounded convex body in Rn, then there exist ellipsoids containing Q and among these ellipsoids there is (at least) one with the smallest volume. 3)in what follows ”body” means a set with a nonempty interior 32LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

Exercise 2.3.6 . Prove that if Q is a closed and bounded convex body in Rn, then there exist ellipsoids contained in Q and among these ellipsoids there is (at least) one with the largest volume.

Note that extremal ellipsoids associated with a closed and bounded convex body Q ”ac- company Q under affine transformations”: if x → Ax+b is an invertible affine transformation and Q is the image of Q under this transformation, then the image W  of an extremal outer ellipsoid W associated with Q (note the article: we has not proved the uniqueness!) is an extremal outer ellipsoid associated with Q, and similarly for (an) extremal inner ellipsoid. The indicated property is, of course, an immediate consequence of the facts that affine images of ellipsoids are again ellipsoids and that the ratio of volumes remains invariant under an affine transformation of the space. In what follows we focus on outer extremal ellipsoids. Useful information can be obtained from investigating these ellipsoids for ”simple parts” of an Euclidean ball. Exercise 2.3.7 + Prove that the volume of the ”spherical hat”

n Vα = {x ∈ R ||x|≤1,xn ≥ α}

(0 ≤ α ≤ 1) of the unit Euclidean ball V = {x ∈ Rn ||x|≤1} satisfies the inequality as follows: 2 Voln(Vα)/Voln(V ) ≤ κ exp{−nα /2}, κ>0 being a positive absolute constant. What is, numerically, the left hand side ratio when α =1/2 and n =64?

Exercise 2.3.8 #+ Let n>1,   n 1/2 { ∈ n ||| ≡ 2 ≤ } V = x R x 2 xi 1 i=1 be the unit Euclidean ball, let e be a unit vector in Rn and let

T Vα = {x ∈ V | e x ≥ α},α∈ [−1, 1]

(Vα is what is called a ”spherical hat”). Prove that if 1 − <α<1, n then the set Vα can be covered by an ellipsoid W of the volume  n2 n − 1 vol (W ) ≤{ }n/2 (1 − α2)(n−1)/2(1 − α) < 1=vol (V ); n n2 − 1 n +1 n W is defined as 1+nα W = {x = e + Bu | uT u ≤ 1}, n +1 2.3. EXERCISES 33

where   n2 1/2   (1 − α)(n − 1) B = (1 − α2) I − βeeT ,β=1− n2 − 1 (1 + α)(n +1)

Hint: note that Vα is contained in the set of solutions to the system of the following pair of quadratic inequalities: xT x ≤ 1; (2.3.12)

(eT x − α)(eT x − 1) ≤ 0; (2.3.13)

Consequently, Vα satisfies any convex quadratic inequality which can be represented as a weighted sum of (2.3.12)and(2.3.13), the weights being positive. Every inequality of this type defines an ellipsoid; compute the volume of the ellipsoid as the function of the weights and optimize it with respect to the weights. In fact the ellipsoid given by the latter exercise is the extremal outer ellipsoid associated with Vα. Looking at the result stated by the latter exercise, one may make a number of useful conclusions.

1. When α = 0, i.e., when the spherical hat Vα is a half-ball, we have    1 n/2 2 vol (W )= 1+ 1 − ≤ n n2 − 1 n − 1

 n/2 ≤ exp{1/(n2 − 1)} exp{−1/(n − 1)} =

n +2 1 1 =exp{− } < exp{− } =exp{− }vol (V ); 2(n2 − 1) 2n − 2 2n − 2 n thus, for the case of α = 0 (and, of course, for the case of α>0) we may cover Vα by an ellipsoid with the volume 1 − O(1/n) times less than that one of V .In fact the same conclusion (with another absolute constant factor O(1)) holds true when α is negative (so that the spherical hat is greater than half-ball), but ”not ≥−1 too negative”, say, when α 2n . 2. In order to cover Vα by an ellipsoid of absolute constant times less volume than that one of V we need α to be positive of order O(n−1/2)orgreater.This fits our observation that the volume of Vα itself is at least absolute constant times less than that one of V only if α ≥ O(n−1/2) (exercise 2.3.7). Thus, whenever the volume of Vα is absolute constant times less than that one of V ,wecancover Vα by an ellipsoid of the volume also absolute constant times less than that√ one of V ; this covering is already given by the Euclidean ball of the radius 1 − α2 centered at the point αe (which, anyhow, is not the optimal covering presented in exercise 2.3.8). 34LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

Exercise 2.3.9 #+ Let V be the unit Euclidean ball in Rn, e be a unit vector and let α ∈ (0, 1). Consider the ”symmetric spherical stripe” V α = {x ∈ V |−α ≤ eT x ≤ α}. √ Prove that if 0 <α<1/ n then V α can be covered by an ellipsoid W with the volume

  − √ n(1 − α2) (n 1)/2 vol (W ) ≤ α n < 1=vol (V ). n n − 1 n Find an explicit representation of the ellipsoid. Hint: use the same construction as that one for exercise 2.3.8. We see that in order to cover a symmetric spherical stripe of the unit Euclidean ball V by an ellipsoid of volume√ less than that one of V , it suffices to have the ”half-thickness” α of the stripe to be < 1/ n, which again fits our observation (Exercise 2.3.7) that√ basically all volume of the unit n-dimensional Euclidean ball is concentrated in the O(1/ n)neighbor- hood of its ”equator” - the cross-section of the ball and a hyperplane passing through the center of the ball. A useful exercise is to realize when a non-symmetric spherical stripe V α,β = {x ∈ V |−α ≤ eT x ≤ β} of the (centered at the origin) unit Euclidean ball V can be covered by an ellipsoid of volume less than that one of V . The results of exercises 2.3.8 and 2.3.9 imply a number of important geometrical conse- quences. Exercise 2.3.10 + Prove the following theorem of Fritz John: Let Q be a closed and bounded convex body in Rn.Then (i) Q can be covered by an ellipsoid W in such a way that the concentric to Wntimes smaller ellipsoid 1 1 W  =(1− )c + W n n (c is the center of W )iscontainedinQ. One can choose as W the extremal outer ellipsoid associated with Q. (ii) If, in addition, Q is central-symmetric with respect to certain point c, then the above result can be improved:√ Q can be covered by an ellipsoid W centered at c in such a way that the concentric to W n times smaller ellipsoid 1 1 W  =(1− √ )c + √ W n n is contained in Q.

Hint: use the results given by exercises√ 2.3.8 and 2.3.9. Note that the constants n and n in the Fritz John Theorem are sharp; an extremal example for (i) is a simplex, and for (ii) - a cube. Here are several nice geometrical consequences of the Fritz John Theorem: 2.3. EXERCISES 35

Let Q be a closed and bounded convex body in Rn.Then 1. There exist a pair of concentric homothetic with respect to their common center parallelotopes p, P with homothety coefficient equal to n−3/2 such that p ⊂ Q ⊂ P ; in other words, there exists an invertible affine transformation of the space such that the image Q of Q under this transformation satisfies the inclusions

n 1  n {x ∈ R |x ∞≤ }⊂Q ⊂{x ∈ R |x ∞≤ 1}; n3/2 here  x ∞=max|xi| 1≤i≤n is the uniform norm of x. Indeed, from the Fritz John Theorem it follows that there exists an invertible affine transformation resulting in

 {x ||x|2 ≤ 1/n}⊂Q ⊂{x ||x|2 ≤ 1}, Q being the image of Q under the transformation (it suffices to transform the outer extremal ellipsoid associated with Q into the unit Euclidean ball centered at the origin). It remains to note that the smaller Euclidean ball in the above −3/2 chain of inclusions contains the cube {x | x ∞≤ n } and the larger one is contained in the unit cube. 2. If Q is central-symmetric, then the parallelotopes mentioned in 1. can be chosen to have the same center, and the homothety coefficient can be improved to 1/n; in other words, there exists an invertible affine transformation of the space which makes the image Q of Q central symmetric with respect to the origin and ensures the inclusions

1  {x |x ∞≤ }⊂Q ⊂{x |x ∞≤ 1}. n The statement is given by the reasoning completely similar to that one used for 1., up to the fact that now we should refer to item (ii) of the Fritz John√ Theorem. 3. Any norm ·on Rn can be approximated, within factor n,bya Euclidean norm: given ·, one can find a Euclidean norm

T 1/2 |x|A =(x Ax) , A being a symmetric positive definite n × n matrix, in such a way that 1 √ |x| ≤ x ≤ |x| n A A for any x ∈ Rn. Indeed, let B = {x |x ≤ 1} be the unit ball with respect to the norm ·;this is a closed and bounded convex body, which is central symmetric with respect 36LECTURE 2. WHEN EVERYTHING IS SIMPLE: 1-DIMENSIONAL CONVEX OPTIMIZATION

to the origin. By item (ii) of the Fritz John Theorem, there exists a centered at the origin ellipsoid W = {x | xT Ax ≤ n} (A is an n × n symmetric positive definite matrix) which contains B, while the ellipsoid {x | xT Ax ≤ 1} is contained in B; this latter inclusion means exactly that

|x|A ≤ 1 ⇒ x ∈B⇔x ≤ 1,

i.e., means that |x|A ≥ x . The inclusion B⊂W , by similar reasons, implies −1/2 that  x ≥ n |x|A.

n Remark 2.3.1 The third of the indicated√ consequences says that any norm on R can be approximated, within constant factor n, by an appropriately chosen Euclidean norm. It turns out that the quality of approximation can be done much better, if we would be satisfied by approximating the norm not at the whole space, but at a properly chosen subspace. Namely, there exists a marvelous and important theorem of Dvoretski which is as follows: there exists a function m(n, ε) of positive integer n and positive real ε with the following properties: first, lim m(n, ε)=+∞ n→∞ and, second, whenever ·is a norm on Rn, one can indicate a m(n, ε)-dimensional subspace n n E ⊂ R and a Euclidean norm |·|A on R such that |·|A approximates ·on E within factor 1+ε: (1 − ε)|x|A ≤ x ≤ (1 + ε)|x|A,x∈ E. In other words, the Euclidean norm is ”marked by God”: for any given integer k an arbitrary normed linear space contains an ”almost Euclidean” k-dimensional subspace, provided that the dimension of the space is large enough. Lecture 3

Methods with linear convergence

(Multidimensional Convex Problems: Class Complexity; Cutting Plane Scheme, Center of Gravity Method, Ellipsoid Method)

3.1 Class of general convex problems: description and complexity

A convex programming problem is

n minimize f(x) subject to gi(x) ≤ 0,i=1, ..., m, x ∈ G ⊂ R . (3.1.1)

Here the domain G of the problem is a closed convex set in Rn with a nonempty interior, the objective f and the functional constraints gi, i =1, ..., m, are convex continuous functions on G. Let us fix a closed and bounded convex domain G ⊂ Rn and the number m of func- tional constraints, and let P = Pm(G) be the family of all feasible convex problems with m functional constraints and the domain G. Note that since the domain G is bounded and all problems from the family are feasible, all of them are solvable, due to the standard compactness reasons. In what follows we identify a problem instance from the family Pm(G) with a vector- valued function

p =(f,g1, ..., gm) comprised of the objective and the functional constraints. What we shall be interested in for a long time are the efficient methods for solving problems from the indicated very wide family. Similarly to the one-dimensional case, we assume that the methods have an access to the first order local oracle O which, given an input vector x ∈ int G, returns the values and some subgradients of the objective and the functional constraints at x, so that the oracle computes the mapping

→O    → (m+1)×(n+1) x (p, x)=(f(x)(,f (x); g1(x),g1(x); ...; gm(x),gm(x)) : int G R .

37 38 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

The notions of a method and its complexity at a problem instance and on the whole family are introduced exactly as it was done in Section 1.2 of our first lecture1). The accuracy of a method at a problem and on the family in the following way. Let us start with the vector of residuals of a point x ∈ G regarded as an approximate solution to a problem instance p: ∗ Residual(p, x)=(f(x) − f , (g1(x))+, ..., (gm(x))+) which is comprised of the inaccuracy in the objective and the violations of functional con- straints at x. In order to get a convenient scalar accuracy measure, it is reasonable to pass from this vector to the relative accuracy − ∗ { f(x) f (g1(x))+ (gm(x))+ } ε(p, x)=max ∗ , , ..., ; maxG f − f (maxG g1)+ (maxG gm)+ to get the relative accuracy, we normalize each of the components of the vector of residuals by its maximal, over all x ∈ G, value and take the maximum of the resulting quantities. It is clear that the relative accuracy takes its values in [0, 1] and is zero if and only if x is an optimal solution to p, as it should be for a reasonable accuracy measure. After we have agreed how to measure accuracy of tentative approximate solutions, we define the accuracy of a method M at a problem instance as the accuracy of the approximate solution found by the method when applied to the instance: Accuracy(M,p)=ε(p, x¯(p, M)). The accuracy of the method on the family is its worse-case accuracy at the problems of the family: Accuracy(M)= sup Accuracy(M,p). p∈Pm(G) Last, the complexity of the family is defined in the manner we already are acquainted with, namely, as the best complexity of a method solving all problems from the family to a given accuracy: A(ε)=min{A(M) | Accuracy(M) ≤ ε}. What we are about to do is to establish the following main result:

Theorem 3.1.1 The complexity A(ε) of the family Pm(G) of general-type convex problems on an n-dimensional closed and bounded convex domain G satisfies the inequalities ln( 1 ) 1 n ε −1 ≤A(ε) ≤2.181 n ln( ). (3.1.2) 6ln2 ε Here the upper bound is valid for all ε<1. The lower bound is valid for all ε<ε(G),where 1 ε(G) ≥ n3 1)that is a set of rules for forming sequential search points, the moment of termination and the result as functions of the information on the problem; this information is comprised by the answers of the oracle obtained to the moment when a rule is to be applied 3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 39

for all G ⊂ Rn; for an ellipsoid G one has 1 ε(G)= , n and for a paralellotope G ε(G)=1.

Same as in the one-dimensional case, to prove the theorem means to establish the lower complexity bound and to present a method associated with the upper complexity bound (and thus optimal in complexity, up to an absolute constant factor, for small enough ε, namely, for 0 <ε<ε(G). We shall start with this latter task, i.e., with constructing an optimal method.

3.2 Cutting Plane scheme and Center of Gravity Method

The method we are about to present is based on a very natural extension of the bisection - the cutting plane scheme.

3.2.1 Case of problems without functional constraints To explain the cutting plane scheme, let me start with the case when there are no functional constraints at all, so that m = 0 and the family is comprised of problems

minimize f(x) s.t. x ∈ G (3.2.3)

of minimizing convex continuous objectives over a given closed and bounded convex domain G ⊂ Rn. To solve such a problem, we can use the same basic idea as in the one-dimensional bisection. Namely, choosing somehow the first search point x1, we get from the oracle the  value f(x1) and a subgradient f (x1)off; thus, we obtain a linear function

T  f1(x)=f(x1)+(x − x1) f (x1) which, due to the convexity of f, underestimates f everywhere on G and coincides with f at x1: f1(x) ≤ f(x),x∈ G; f1(x1)=f(x1).

If the subgradient is zero, we are done - x1 is an optimal solution. Otherwise we can point out a proper part of G which localizes the optimal set of the problem, namely, the set

T  G1 = {x ∈ G | (x − x1) f (x1) ≤ 0}; indeed, outside this new localizer our linear lower bound f1 for the objective, and therefore the objective itself, is greater than the value of the objective at x1. 40 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

Now, our new localizer of the optimal set, i.e., G1, is, same as G, a closed and bounded convex domain, and we may iterate the process by choosing the second search point x2 inside G1 and generating the next localizer

T  G2 = {x ∈ G1 | (x − x2) f (x2) ≤ 0}, and so on. We come to the following generic cutting plane scheme:

starting with the localizer G0 ≡ G, choose xi in the interior of  the current localizer Gi−1 and check whether f (xi)=0;ifitisthe case, terminate, xi being the result, otherwise define the new localizer

T  Gi = {x ∈ Gi−1 | (x − xi) f (xi) ≤ 0} and loop. The approximate solution found after i steps of the routine is, by definition, the best point found so far, i.e., the point

x¯i ∈{x1, ..., xi} such that f(¯xi)= minf(xj). 1≤j≤i A cutting plane method, i.e., a method associated with the scheme, is governed by the rules for choosing the sequential search points in the localizers. In the one-dimensional case there is, basically, the only natural possibility for this choice - the midpoint of the current localizer (the localizer always is a segment). This choice results exactly in the bisection and enforces the lengths of the localizers to go to 0 at the rate 2−i, i being the step number. In the multidimensional case the situation is not so simple. Of course, we would like to decrease a reasonably defined size of localizer at the highest possible rate; the problem is, anyhow, which size to choose and how to ensure its decreasing. When choosing a size, we should take care of two things (1) we should have a possibility to conclude that if the size of a current localizer Gi is small, then the inaccuracy of the current approximate solution also is small; (2) we should be able to decrease at certain rate the size of sequential localizers by appropriate choice of the search points in the localizers. Let us start with a wide enough family of sizes which satisfy the first of our requirements. Definition 3.2.1 A real-valued function Size(Q) defined on the family Q of all closed and bounded convex subsets Q ⊂ Rn with a nonempty interior is called a size, if it possesses the following properties: (Size.1) Positivity: Size(Q) > 0 for any Q ∈Q; (Size.2) Monotonicity with respect to inclusion: Size(Q) ≥ Size(Q) whenever Q ⊂ Q, Q, Q ∈Q; (Size.3) Homogeneity with respect to homotheties: if Q ∈Q, λ>0, a ∈ Rn and

Q = a + λ(Q − a)={a + λ(x − a) | x ∈ Q} 3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 41

is the image of Q under the homothety with the center at the point a and the coefficient λ, then Size(Q)=λ Size(Q). Example 1. The diameter

Diam(Q)=max{|x − x|| x, x ∈ Q} is a size; Example 2. The average diameter

1/n AvDiam(Q)=(Voln(Q)) ,

Voln being the usual n-dimensional volume, is a size. To the moment these examples are sufficient for us. Let us prove that any size of the indicated type satisfies requirement (1), i.e., if the size of a localizer is small, then the problem is ”almost” solved. Lemma 3.2.1 Let Size(·) be a size. Assume that we are solving a convex problem (3.2.3) by a cutting plane method, and let Gi and x¯i be the localizers and approximate solutions generated by the method. Then we have for all i ≥ 1 Size(G ) ε(p, x¯ ) ≤ ε ≡ i , (3.2.4) i i Size(G) p denoting the problem in question. Proof looks completely similar to that one used for the bisection method. Indeed, let us fix i ≥ 1. If εi ≥ 1, then (3.2.4) is evident - recall that the relative accuracy always is ≤ 1. ∗ Now assume that εi < 1 for our i.Letuschooseα ∈ (εi, 1], let x be a minimizer of our objective f over G and let Gα = x∗ + α(G − x∗). Then α Size(G )=α Size(G) >εi Size(G)=Size(Gi) (we have used the homogeneity of Size(·) with respect to homotheties). Thus, the size of Gα α is greater than that on of Gi, and therefore, due to the monotonicity of Size, G cannot be a subset of Gi. In other words, there exists a point

α y ∈ G \Gi.

Since Gα clearly is contained in the domain of the problem and does not belong to the i-th localizer Gi,wehave f(y) >f(¯xi); indeed, at each step j, j ≤ i, of the method we remove from the previous localizer (which initially is the whole domain G of the problem) only those points where the objective is 42 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

greater than at the current search point xj and is therefore greater than at the best pointx ¯i found during the first i steps; since y was removed at one of these steps, we conclude that f(y) >f(¯xi), as claimed. On the other hand, y ∈ Gα,sothat y =(1− α)x∗ + αz with some z ∈ G. From convexity of f it follows that f(y) ≤ (1 − α)f(x∗)+αf(z)=(1− α)minf + α max f, G G whence f(y) − min f ≤ α(max f − min f). G G G

As we know, f(y) >f(¯xi), and we come to

f(xi) − min f ≤ α(max f − min f), G G G or, which is exactly the same, to ε(p, x¯i) <α.

Since α is an arbitrary real >εi, we conclude that (3.2.4)holds.

Thus, we realize now what could be the sizes we are interested in, and the problem is how to ensure certain rate of their decreasing along the sequence of localizers generated by a cutting plane method. The difficulty here is that when choosing the next search point in the current localizer, we do not know what will be the next cutting plane; the only thing we know is that it will pass through the search point. Thus, we are interested in the choice of the search point which guarantees certain reasonable, not too close to 1, ratio of the size of the new localizer to that one of the previous localizer independently of what will be the cutting plane. Whether such a choice of the search point is possible, it depends on the size we are using. For example, the diameter of a localizer, which is a very natural measure of it and which was successively used in the one-dimensional case, would be a very bad choice in the multidimensional case. To realize this, imagine that we are minimizing over the unit square on the two-dimensional plane, and our objective in fact depends on the first coordinate only. All our cutting planes (in our example they are lines) will be parallel to the second coordinate axis, and the localizers will be stripes of certain horizontal size (which we may enforce to tend to 0) and of some fixed vertical size (equal to 1). The diameters of the localizers here although decrease but do not tend to zero. Thus, the first of the particular sizes we have looked at does not fit the second requirement. In contrast to this, the second particular size, the average diameter AvDiam, is quite appropriate, due to the following geometric fact which we present without proof: Proposition 3.2.1 (Grunbaum) Let Q be a closed and bounded convex domain in Rn,let  1 x∗(G)= xdx Voln(G) G 3.2. CUTTING PLANE SCHEME AND CENTER OF GRAVITY METHOD 43

be the center of gravity of Q,andletΓ be an affine hyperplane passing through the center of gravity. Then the volumes of the parts Q, Q in which Q is partitioned by Γ satisfy the inequality   n 1/n Vol (Q), Vol (Q) ≤{1 − }Vol (Q) ≤ exp{−κ}Vol (Q), n n n +1 n n

κ =ln(1− 1/ e)=0.45867...; in other words, κ AvDiam(Q), AvDiam(Q) ≤ exp{− } AvDiam(Q). (3.2.5) n

Note that the proposition states exactly that the smallest (in terms of the volume) fraction you can cut off a n-dimensional convex body by a hyperplane passing through the center of gravity of the body is the fraction you get when the body is a simplex, the plane passes parallel to a facet of the simplex and you cut off the part not containing the facet.

Corollary 3.2.1 Consider the Center of Gravity method, i.e., the cutting plane method with the search points being the centers of gravity of the corresponding localizers:  ∗ 1 xi = x (Gi−1) ≡ xdx. Voln(Gi−1) Gi−1

For the method in question one has κ AvDiam(G ) ≤ exp{− } AvDiam(G − ),i≥ 1; i n i 1 consequently (see Lemma 3.2.4) the relative accuracy of i-th approximate solution generated by the method as applied to any problem p of minimizing a convex objective over G satisfies the inequality κ ε(p, x¯ ) ≤ exp{− i},i≥ 1. i n In particular, to solve the problem within relative accuracy ε ∈ (0, 1) it suffices to perform no more than     1 1 1 N = n ln ≤2.181n ln  (3.2.6) κ ε ε steps of the method.

Remark 3.2.1 The Center of Gravity method for convex problems without functional con- straints was invented in 1965 independently by A.Yu.Levin in the USSR and J. Newman in the USA. 44 LECTURE 3. METHODS WITH LINEAR CONVERGENCE 3.3 The general case: problems with functional con- straints

The Center of Gravity method, as it was presented, results in the upper complexity bound stated in our Main Theorem, but only for problems without functional constraints. In order to establish the upper bound in full generality, we should modify the cutting plane scheme in a manner which enables us to deal with these constraints. The very first idea is to act as follows: after current search point is chosen and we have received the values and subgradients of the objective and the constraints at the point, let us check whether the point is feasible; if it is the case, then let us use the subgradient of the objective in order to cut off the part of the current localizer where the objective is greater that at the current search point, same as it was done in the case when there were no functional constraints. Now, if there is a functional constraint violated at the point, we can use its subgradient to cut off points where the constraint is for sure greater than it is at the search point; the removed points for sure are not feasible: This straightforward approach cannot be used as it is, since the feasible set may have empty interior, and in this case our process, normally, will never find a feasible search point and, consequently, will never look at the objective. In this case the localizers will shrink to the feasible set of the problem, and this is fine, but if the set is not a singleton, we would not have any idea how to extract from our sequence of search points a point where the constraints are ”almost satisfied” and at the same time the objective is close to the optimal value - recall that we simply did not look at the objective when solving the problem! There is, anyhow, a simple way to overcome the difficulty - we should use for the cut the subgradient of the objective at the steps when the constraints are ”almost satisfied” at the current search point, not only at the steps when the point is feasible. Let us consider the method, proposed by Nemirovski and Yudin in 1976:

Cutting plane scheme for problems with functional constraints: Given in advance the desired relative accuracy ε ∈ (0, 1), generate, starting with G0 = G, the sequence of localizers Gi, as follows: given Gi−1 (which is a closed and bounded convex subset of G with a nonempty interior), act as follows: 1) choose i-th search point

xi ∈ int Gi−1

and ask the oracle about the values

f(xi),g1(xi), ..., gm(xi)

and subgradients    f (xi),g1(xi), ..., gm(xi)

of the objective and the constraints at xi. 3.3. THE GENERAL CASE: PROBLEMS WITH FUNCTIONAL CONSTRAINTS 45

2) form the affine lower bounds (i) − T  gj (x)=gj(xi)+(x xi) gj(xi) for the functional constraints (these actually are lower bounds since the constraints are convex) and compute the quantities   i,∗ (i) gj = max gj (x) ,i=1, ..., m. G + 3) If, for all j =1, ..., m, ≤ ∗,i gj(xi) εgj , (3.3.7) claim that i is a productive step and go to 4.a), otherwise go to 4.b). 4.a) [productive step]  If f (xi) =0 , define the new localizer Gi as

T  Gi = {x ∈ Gi−1 | (x − xi) f (xi) ≤ 0}

and go to 5); otherwise terminate, xi being the result formed by the method. 4.b) [non-productive step] Choose j such that ∗,i gj(xi) >εgj ,

define the new localizer Gi as { ∈ | − T  ≤ } Gi = x Gi−1 (x xi) gj(xi) 0 and go to 5). 5) Define i-th approximate solution as the best (with the smallest value of the objective) of the search points xj corresponding to the productive steps j ≤ i. Replace i by i +1 and go to 1).

Note that the approximate solutionx ¯i is defined only if a productive step already has been performed. The rate of convergence of the above method is given by the following analogue of Lemma 3.2.4: Proposition 3.3.1 Let Size be a size, and let a cutting plane method be applied to a convex problem p with m functional constraints. Assume that for a given N the method either terminates in course of the first N steps, or this is not the case and the relation Size(G ) N <ε (3.3.8) Size(G) is satisfied. In the first case the result x¯ found by the method is a solution to the problem of relative accuracy ≤ ε; in the second case the N-th approximate solution is well-defined and solves the problem to the relative accuracy ε. 46 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

Proof. Let us first note that for any i and j one has   ∗,i ≤ gj max gj ; (3.3.9) G +

(i) this is an immediate consequence of the fact that gj (x)isalowerboundforgj(x)(an immediate consequence of the convexity of gj), so that the maximum of this lower bound ∈ ∗,i over x G, i.e., gj , is at most the similar quantity for the constraint gj itself. Now, assume that the method terminates at certain step i ≤ N. According to the description of the method, it means that i is a productive step and 0 is a subgradient of the objective at xi; the latter means that xi is a minimizer of f over the whole G,sothat

∗ f(xi) ≤ f .

Besides this, i is a productive step, so that   ≤ ∗,i ≤ gj(xi) εgj ε max gj ,j=1, ..., m G +

(we have used (3.3.9)); these inequalities, combined with the definition of the relative accu- racy, state exactly that xi (i.e., the result obtained by the method in the case in question) solves the problem within the relative accuracy ε, as claimed. Now assume that the method does not terminate in course of the first N steps. In view of our premise, here we have

Size(GN ) <εSize(G). (3.3.10) Let x∗ be an optimal solution to the problem, and let

Gε = x∗ + ε(G − x∗).

Gε is a closed and bounded convex subset of G with a nonempty interior; due to homogeneity of Size with respect to homotheties, we have

ε Size(G )=ε Size(G) > Size(GN )

(the second inequality here is (3.3.10)). From this inequality and the monotonicity of the ε size it follows that G cannot be a subset of GN :

ε ∃y ∈ G \GN .

Now, y is a point of G (since the whole Gε is contained in G), and since it does not belong to GN ,itwascutoffatsomestepofthemethod,i.e.,thereisani ≤ N such that

T − ei (y xi) > 0, (3.3.11) ei being the linear functional defining i-th cut. 3.3. THE GENERAL CASE: PROBLEMS WITH FUNCTIONAL CONSTRAINTS 47

Note also that since y ∈ Gε, we have a representation

y =(1− ε)x∗ + εz (3.3.12) with certain z ∈ G. Let us prove that in fact i-th step is productive. Indeed, assume it is not the case. Then

 ei = gj(xi) (3.3.13)

(i) ∗,i for some j such that gj (xi) >εgj . From this latter inequality combined with (3.3.13)and (3.3.11) it follows that (i) (i) ∗,i gj (y) >gj(xi)=gj (xi) >εgj . (3.3.14) On the other hand, we have (i) − (i) ∗ (i) ≤ ∗,i gj (y)=(1 ε)gj (x )+εgj (z) εgj

(i) · (we have taken into account that gj is a lower bound for gj( ) and therefore this bound is nonpositive at the optimal solution to the problem); the resulting inequality contradicts (3.3.14), and thus the step i indeed is productive.  Now, since i is a productive step, we have ei = f (xi), and (3.3.11) implies therefore that

f(y) >f(xi); from this latter inequality and (3.3.12), exactly as in the case of problems with no functional constraints, it follows that

∗ ∗ ∗ f(xi) − f

∗ ∗ ∗ f(¯xN ) − f ≤ f(xi) − f ≤ ε(max f − f ); (3.3.16) G

 sincex ¯N is, by construction, the search point generated at certain productive step i ,we have also   (i ) ∗,i ≤ ≤ gj(¯xN )=gj (xi ) εgj ε max gj ,j=1, ..., m; G + these inequalities combined with (3.3.16)resultsin

ε(p, x¯N ) ≤ ε, as claimed. 48 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

Combining Proposition 3.3.1 and the Grunbaum Theorem, we come to the Center of Gravity method for problems with functional constraints. The method is obtained from our general cutting plane scheme for constrained problems by the following specifications: first, we use, as a current search point, the center of gravity of the previous localizer:  1 xi = xdx; Voln(Qi−1) Qi−1 second, we terminate the method after N-th step, N being given by the relation   1 N =2.181n ln . ε With these specifications the average diameter of i-th localizer at every step, due to the Grunbaum Theorem, decreases with i at least as κ exp{− i} AvDiam(G),κ=0.45867..., n

1 and since κ < 2.181, we come to

AvDiam(GN ) <εAvDiam(G);

this latter inequality, in view of Proposition 3.3.1, implies that the method does find an ε-solution to every problem from the family, thus justifying the upper complexity bound we are proving.

3.4 Lower complexity bound

To complete the proof of Theorem 3.1.1, it remains to establish the lower complexity bound. This is done, basically, in the same way as in the one-dimensional case. In order to avoid things difficult for verbal presentation, We show here a slightly worse lower bound n ln( 1 ) A(ε) ≥ O(1)  ε , 0 <ε<ε∗(G), (3.4.17) 1 ln n ln( ε ) O(1) being a positive absolute constant and ε∗(G) being certain positive quantity depending on the geometry G only (we shall see that this quantity measures how G differs from a paralellotope). 2 The ”spoiled” bound (3.4.17) (which is by logarithmic denominator worse than the estimate announced in the Theorem) is more or less immediate consequence of our one-dimensional considerations. Of course, it is sufficient to establish the lower bound

2for the “exact” lower bound refer to A.S. Nemirovskij and D.B. Yudin Problem complexity and method efficiency in optimization, A Wiley-Interscience Publication. Chichester etc.: John Wiley & Sons, 1983. 3.4. LOWER COMPLEXITY BOUND 49

for the case of problems without functional constraints, since the constrained ones form a wider family (indeed, a problem without functional constraints can be thought of as a problem with a given number m of trivial, identically zero functional constraints). Thus, in what follows the number of constraints m is set to 0. Let us start with the following simple observation. Let, for a given ε>0anda convex objective f,thesetGε(f) be comprised of all approximate solutions to f of relative accuracy not worse than ε:

Gε(f)={x ∈ G | f(x) − min f ≤ ε(max f − min f)}. G G G

Assume that, for a given ε>0, we are able to point out a finite set F of objectives with the following two properties: (I) no different problems from F admit a common ε-solution:

Gε(f) ∩ Gε(f¯)=∅

whenever f,f¯ ∈F and f = f¯; (II) given in advance that the problem in question belongs to F, one can compress an answer of the first order local oracle to be a (log2 K)-bit word. It means the following. For certain positive integer K one can indicate a function I(f,x)taking values in a K-element set and a function R(i, x) such that

O(f,x)=R(I(f,x),x),f∈F,x∈ int G.

In other words, given in advance that the problem we are interested in belongs to F,a O I method can imitate the first-order oracle via another oracle which returns log2 K bits of information rather than infinitely many bits contained in the answer of the first order oracle; given the ”compressed” answer I(f,x), a method can substitute this answer, along with x itself, into a universal (defined by F only) function in order to get the ”complete” first-order information on the problem. n E.g., consider the family Fn comprised of 2 convex functions

f(x)= max ixi, i=1,...,n

where all i are ±1. At every point x a function from the family admits a subgradient of the form I(f,x)=±ei (ei are the orths of the axes), with i, same as the sign at ei, depending on f and x. Assume that the first order oracle in question when asked about f ∈Fn reports a subgradient of exactly this form. Since all functions from the family are homogeneous, given x and I(f,x) we know not only a subgradient of f at x, but also the value of f at the point:

f(x)=xT I(f,x).

Thus, our particular first-order oracle as restricted onto Fn can be compressed to log2(2n)bits. Now let us make the following observation: 50 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

(*): under assumptions (I) and (II) the ε-complexity of the family F,and therefore of every larger family, is at least log |F| 2 . log2K Indeed, let M be a method which solves all problems from F within accuracy ε in no more than N steps. We may assume (since informationally this is the same) that the method uses the oracle I, rather than the first-order oracle. Now, the behavior of the method is uniquely defined by the sequence of answers of I in course of N steps; therefore there are at most KN different sequences of answers and, consequently, no more than KN different trajectories of M. In particular, the set X¯ formed by the results produced by M as applied to problems from F is comprised of at most KN points. On the other hand, since M solves every of |F| problems of the family within accuracy ε, and no two different problems from the family admit a common ε-solution, X¯ should contain at least |F| points. Thus,

KN ≥|F|,

as claimed. As an immediate consequence of what was said, we come to the following result: the complexity of minimizing a convex function over an n-dimensional parallelotope G within relative accuracy ε<1/2 is at least n/(1 + log2 n). Indeed, all our problem classes and complexity-related notions are affine invariant, so that we always may assume the parallelotope G mentionedintheassertiontobethe unit cube n {x ∈ R ||x|∞ ≡ max |xi|≤1}. i 1 For any ε< 2 the aforementioned family

Fn = {f(x)=maxixi} i clearly possesses property (I) and, as we have seen, at least for certain first-order oracle possesses also property (II) with K =2n. We immediately conclude that the complex- ity of finding an ε-minimizer, ε<1/2, of a convex function over an n-dimensional parallelotope is, at least for some first order oracle, no less than log |F| 2 , log2(2n) as claimed. In fact, of course, the complexity is at least n for any first order oracle, but to prove the latter statement it requires more detailed considerations. Now let us use the above scheme to derive the lower bound (3.4.17). Recall that when studying the one-dimensional case, we have introduced certain family of univari- ate convex functions which was as follows. The functions of the family form a tree, with the root (”generation 0”) being the function

f root(x)=|x|; 3.4. LOWER COMPLEXITY BOUND 51

when subject to the left and to the right modifications, the function produces two ”children”, let them be called fr and fl; each of these functions, in turn, may be subject to the right and to the left modification, producing two new functions, so that at the level of ”grandchildren” there are four functions frr, frl, flr, fll,andsoon.Now, every of the functions f· of a ”generation” k>0 possesses its own active segment δ(f) of the length 21−2k, and at this segment the function is modulus-like:

(k) −k f·(x)=a +8 |x − c(f)|,

(k) c(f) being the midpoint of δ(f). Note that a depends only on the generation f· belongs to, not on the particular representative of the generation; note also that the active segments of the 2k functions belonging to the generation k are mutually disjoint and that a function from our ”population” coincides with its ”parent” outside the active segment of the parent. In what follows it is convenient also to define the active segment of the root function f root as the whole axis. k Now, let Fk be the set of 2 functions comprising k-th generation in our population. Let us demonstrate that any first order oracle, restricted onto this family of functions, admits compression to log2(2k) bits. Indeed, it is clear from our construction that in order to restore, given an x, f(x) and a subgradient f (x), it suffices to trace the path of predecessors of f - its father, its grandfather, ... - and to find the ”youngest” of them, let it be f¯, such that x belongs to the active segment of f¯ (let us call this predecessor the active at x predecessor of f). The active at x predecessor of f does exist, since the active segment of the ”common predecessor” f root is the whole axis. Now, f is obtained from f¯ by a number of modifications; the first of them possibly varies f¯ in a neighborhood of x (x is in the active segment of f¯), but the subsequent modifications do not, since x is outside the corresponding active segments. Thus, in a neighborhood of xfcoincides with the function f˜ - the modification of f¯ which leads from f¯ to f. Now, to identify the local behavior of f˜ (i.e., that one of f)atx, it suffices to indicate the ”age” of f˜, i.e., the number of the generation it belongs to, and the type of the modification - left or right - which transforms f¯ into f˜. Indeed, given x and the age k¯ of f˜, we may uniquely identify the active segment of f¯ (since the segments for different members of the same generation k¯ − 1haveno common points); given the age of f¯, its active segment and the type of modification leading from f¯ to f˜, we, of course, know f˜ in a neighborhood of the active segment of f¯ and consequently at a neighborhood of x. Thus, to identify the behavior of f at x and therefore to imitate the answer of any given local oracle on the input x, it suffices to know the age k¯ of the active at x predecessor of f and the type - left or right - of modification which ”moves” the predecessor towards f, i.e., to know a point from certain (2k)-element set, as claimed. Now let us act as follows. Let us start with the case when our domain G is a parallelotope; due to affine invariance of our considerations, we may assume G to be the unit n-dimensional cube:

n G = {x ∈ R ||x|∞ ≤ 1}.

Given a positive integer k, consider the family Fk comprised of the objectives { } ∈F fi1,...,in (x)=max fi1 (x1), ..., fin (xn) ,fis k,s=1, ..., n. 52 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

This family contains |F|n =2nk objectives, all of them clearly being convex and Lip- schitz continuous with constant 1 with respect to the uniform norm |·|∞.Letus demonstrate that there exists a first order oracle such that the family, equipped with this oracle, possesses properties (I), and (II), where one should set

ε =2−6k,K=2nk. (3.4.18)

(k) Indeed, a function fi1,...,in attains its minimum a exactly at the point xi1,...,in with

the coordinates comprised of the minimizers of fis (xs). It is clear that within the cube { || − | ≤ × −3k} Cı1,...,in = x x xi1,...,in ∞ 2 2

(i.e., within the direct product of the active segments of fis , s =1, ..., n) the function is simply (k) −3k| − | a +2 x xi1,...,in ∞, therefore outside this cube one has

− −6k fi1,...,in(x) min fi1,...,in > 2 = ε. G

Taking into account that all our functions fi1,...,in , being restricted onto the unit cube G, take their values in [0, 1], so that for these functions absolute inaccuracy in terms of the objective is majorated by the relative accuracy, we come to

⊂ Gε(fi1,...,in) Ci1,...,in.

It remains to note that the cubes C· corresponding to various functions from the family are mutually disjoint (since the active segments of different elements of the generation Fk are disjoint). Thus, (I) is verified. In order to establish (II), let us note that to find the value and a subgradient of

fi1,...,in at a point x it suffices to know the value and a subgradient at xis of any function

fis which is ”active” at x, i.e., is ≥ all other functions participating in the expression

for fi1,...,in. In turn, as we know, to indicate the value and a subgradient of fis it suffices to report a point from a (2k)-element set. Thus, one can imitate certain (not any)first Fk order oracle for the family via a ”compressed” oracle reporting log2(2nk)-bit word (it suffices to indicate the number s,1≤ s ≤ n of a component fis active at x and a

point of a (2k)-element set to identify fis at xis ). Thus, we may imitate certain first order oracle for the family Fk (comprised of 2kn functions), given a ”compressed” oracle with K =2nk;itfollowsfrom(*)thatthe ε-complexity of F for ε =2−6k (see (3.4.18)) is at least

log (2nk) 2 ; log2(2nk) expressing k in terms of ε,wecometo

nlog ( 1 ) A(ε) ≥  2 ε ,ε=2−6k,k =1, 2, ...; 2 6log2 nlog2( ε ) 3.4. LOWER COMPLEXITY BOUND 53

taking into account that A(ε) is a non-increasing function, we come to an estimate of ∗ 1 thetype(3.4.17)withε<ε (G)= 64 (recall that we were dealing with the unit cube G). Let me stress what was in fact was proved. We have demonstrated that it is possible to equip the family of all convex continuous functions on the unit cube with a first order oracle in such a way that the complexity of the family would satisfy lower 1 bound (3.4.17), provided that ε< 64 . Now, what happens in the case when G is not a parallelotope? In this case we can find a pair of homothetic parallelotopes, p and P , such that p ⊂ G ⊂ P .Letuschoose among these pairs p, P that one with the smallest possible similarity ratio, let it be called α(G). Those who have solved problems given at the previous lecture know, and other should believe that α(G) ≤ n3/2 for any closed and bounded convex body G ⊂ Rn. After appropriate affine transforma- tion of the space (the transformation does not influence the complexity - once again, all our complexity-related notions are affine invariant) we may assume that p is the unit cube. Thus, we come to the situation as follows: n n Cn ≡{x ∈ R ||x|∞ ≤ 1}⊂G ⊂{x ∈ R ||x|∞ ≤ α(G)}. (3.4.19) Now let us look at the family of problems minimize f(x) s.t. x ∈ G associated with f ∈Fk. It is easily seen that all the objectives from the family attain their global (i.e., over the whole space) minima within the unit cube and the sets of approximate solutions of absolute accuracy 2−6k to problems of the family are mutually disjoint (these sets are exactly the small cubes already known to us). Therefore our previous reasoning states that it is possible to equip the family with a first order oracle in such a way that the worst-case, with respect to the family Fk, complexity of finding an approximate solution of absolute accuracy 2−5k is at least n log2 k . On the other log2(2nk) hand, the Lipschitz constant of every function from the family Fk taken with respect to the uniform norm is, as we know, at most 1, so that the variation max f − min f G G of such a function on the domain G is at most the diameter of G with respect to the uniform norm; the latter diameter, due to (3.4.19), is at most 2α(G). It follows that any method which solves all problems from Fk within relative accuracy 2−6k−1/α(G) solves all these problems within absolute accuracy 2−6k as well; thus, the complexity of minimizing convex function over G within relative accuracy 2−6k−1/α(G)isatleast n log2 k : log2(2nk) nlog k A(2−6k−1/α(G)) ≥ 2 ,k=1, 2, ... log2(nk) This lower bound immediately implies that   1 nlog2 α(G)ε 1 A(ε) ≥ O(1)  ,α(G)ε< , 1 128 log2 nlog2( α(G)ε ) 54 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

whence, in turn, n ln(1/ε) 1 1 A(ε) ≥ O(1) ,ε≤ ε∗(G) ≡ (≥ ); ln (n ln(1/ε)) 128α2(G) 128n3 this is exactly what is required in (3.4.17). Note that our reasoning results in a lower bound which is worse than that one indicated in the Theorem not only by the logarithmic denominator, but also due to the fact that this is a lower bound for a particular first order oracle, not for an arbitrary one. In fact both these shortcomings, i.e. the presence of the denominator and the ”oracle- dependent” type of the lower bound, may be overcome by more careful reasoning, but we are not going to reproduce it here.

3.5 The Ellipsoid method

We have presented the Center of Gravity method which is optimal in complexity, up to an absolute constant factor, at least when the required accuracy is small enough. The rate of convergence of the method, which is the best possible theoretically, looks fine also from the practical viewpoint. The upper complexity bound O(1)n ln(1/ε) associated with the method means that in order to improve inaccuracy by an absolute constant factor (say, reduce it by factor 10) it suffices to perform O(1)n iterations more, which looks not too bad, taking into account how wide is the family of problems the method can be applied to. Thus, one may ask: what were the computational consequences of the method invented as early as in 1965? The answer is: no consequences at all, since the method cannot be used in practice, provided that the dimension of the problem is, say > 4. The reason is that the auxiliary problems arising at the steps of the method, I mean those of finding centers of gravity, are computationally extremely hard; the source of the difficulty is that the localizers arising in the method may be almost arbitrary solids; indeed, all we can say is that if the domain G of the problem is a polytope given by k0 linear inequalities, then the localizer Gi formed after i steps of the method is a polytope given by k0 + i linear inequalities; and all known deterministic methods for computing the center of gravity of a polytope in Rn given by k>2n linear inequalities take an exponential in n time to find the center. In our definition of complexity we ignore the computational effort required to implement the search rules; but in practice this effort, of course, should be taken into account, this is why the Center of Gravity method, for which this latter effort is tremendous, cannot be used at all. The situation, nevertheless, is not as bad as one could think: to the moment we have not exploit all abilities of the cutting plane scheme. Let us note that the cutting plane scheme can be ”spoiled” as follows. Given previous ¯ localizer Gi−1, we apply our basic scheme to produce a new localizer, Gi, but now this is something intermediate, not the localizer we are forwarding to the next step (this is why we denote by G¯i the solid which in the basic scheme was designated Gi); the localizer we do use at the next step is certain larger solid Gi ⊃ G¯i. Thus, at a step of the modified cutting plane scheme we perform a cut, exactly as in the basic scheme, and enlarge the resulting localizer to obtain Gi. 3.5. THE ELLIPSOID METHOD 55

At this point one could ask: what for should we add to an actual localizer something which for sure does not contain optimal solutions? The answer is: acting in this manner, we may stabilize geometry of our localizers and enforce them to be convenient for numerical implementation of the search rules. This is the idea underlying the Ellipsoid method we are about to present.

3.5.1 Ellipsoids Recall that an ellipsoid in Rn is defined as a level set of a nondegenerate convex quadratic form, i.e., as a set of the type

W = {x ∈ Rn | (x − c)T A(x − c) ≤ 1}, (3.5.20) where A is an n × n symmetric positive definite matrix and c ∈ Rn is the center of the ellipsoid. An equivalent definition, which will be more convenient for us, says that an ellipsoid is the image of the unit Euclidean ball under a one-to-one affine transformation:

W = W (B,c)={x = Bu + c | uT u ≤ 1}, (3.5.21) where B is an n × n nonsingular matrix. It is immediately seen that one can pass from representation (3.5.21)to(3.5.20) by setting

A =(BT )−1B−1; (3.5.22) since any symmetric positive definite matrix A admits a representation of the type (3.5.22) (e.g., with B = A−1/2), the above definitions indeed are equivalent. From (3.5.21) it follows immediately that

Voln(W (B,c)) = | Det B|Voln(V ), (3.5.23) where V denotes the unit Euclidean ball in Rn. Now, by compactness reasons it is easily seen that for any n-dimensional solid Q there exist ellipsoids containing Q and among these ellipsoids there is at least one with the smallest volume; in fact this extremal outer ellipsoid of Q is uniquely defined, but we are not interested in the uniqueness issues. The average diameter of this extremal outer ellipsoid is certain function of Q, let it be denoted by EllOut(Q) and called the outer ellipsoidal size:

1/n EllOut(Q)=(min{Voln(W ) | W is an ellipsoid containing Q}) .

It is immediately seen that the introduced function is a size, i.e., it is positive, monotone with respect to inclusions and homogeneous with respect to similarity transformations of the homogeneity degree 1. We need the following simple lemma 56 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

Lemma 3.5.1 Let n>1,let

W = {x = Bu + c | uT u ≤ 1} be an ellipsoid in Rn,andlet

W¯ = {x ∈ W | (x − c)T q ≤ 0} be a ”half-ellipsoid” - the intersection of W and a half-space with the boundary hyperplane passing through the center of W (here q =0 ). Then W¯ can be covered by an ellipsoid W + of the volume + n Voln(W )=κ (n)Voln(W ),  n2 n − 1 1 κn(n)= ≤ exp{− }; (3.5.24) n2 − 1 n +1 2(n − 1) in particular, 1 EllOut(W +) ≤ κ(n) EllOut(W ) ≤ exp{− } EllOut(W ). (3.5.25) 2n(n − 1)

The ellipsoid W + is given by

W + = {x = B+u + c+ | uT u ≤ 1}, 1 B+ = α(n)B − γ(n)(Bp)pT ,c+ = c − Bp, n +1 where   ⎛  ⎞ n2 1/2 n − 1 BT q α(n)= ,γ(n)=α(n) ⎝1 − ⎠ ,p= $ . n2 − 1 n +1 qT BBT q

To prove the lemma, it suffices to reduce the situation to the similar one with W being the unit Euclidean ball V ; indeed, since W is the image of V under the affine transformation u → Bu + c, the half-ellipsoid W¯ is the image, under this transformation, of the half-ball

V¯ = {u ∈ V | (BT q)T u ≤ 0} = {u ∈ V | pT u ≤ 0}.

Now, it is quite straightforward to verify that a half-ball indeed can be covered by an ellipsoid V + with the volume being the required fraction of the volume of V ; to verify this, it was one of the exercises of the previous lecture (cf. Exercise 2.3.8), and in the formulation of the exercise you were given the explicit representation of V +. It remains to note that the image of V + under the affine transformation which maps the unit ball V onto the ellipsoid W is an ellipsoid which clearly contains the half-ellipsoid W¯ and is in the same ratio of volumes with respect to W as V + is with respect to the unit ball V (since the ratio of volumes remains invariant under affine transformations). The ellipsoid W + given in formulation of the lemma is nothing but the image of V + under our affine transformation. 3.5. THE ELLIPSOID METHOD 57

3.5.2 The Ellipsoid method Bearing in mind our ”spoiled” cutting plane scheme, we may interpret the statement of Lemma 3.5.1 as follows: assume that at certain step of a cutting plane method the localizer Gi−1 to be updated is an ellipsoid. Let us choose the current search point as the center of ¯ the ellipsoid; then the cut will result in the intermediate localizer Gi which is a half-ellipsoid. Let us cover this intermediate localizer by an ellipsoid given by our lemma and choose the latter ellipsoid as the new localizer Gi. Now we are in the same position as we were - the new localizer is an ellipsoid, and we may proceed in the same manner. And due to our lemma, we do decrease in certain ratio certain size of localizers - namely, the outer ellipsoidal size EllOut. Two issues should be thought of. First, how to initialize our procedure - i.e., how to enforce the initial localizer to be an ellipsoid (recall that in our basic scheme the initial localizer was the domain G oftheproblem).Theanswerisimmediate-letustakeasG0 an arbitrary ellipsoid containing G. The second difficulty is as follows: in our ”spoiled” cutting plane scheme the localizers, generally speaking, are not subsets of the domain of the problem; it may, consequently, happen that the center of a localizer is outside the domain; how to perform a cut in the latter case? Here again the difficulty can be immediately overcome. If the center xi of the current localizer Gi−1 is outside the interior of G, then, due to the Separation Theorem for convex sets, we may find a nonzero linear functional eT x which separates xi and int G: T (x − xi) e ≤ 0,x∈ G. Using e for the cut, i.e., setting

T G¯i = {x ∈ Gi−1 | (x − xi) e ≤ 0}, we remove from the previous localizer Gi−1 only those points which do not belong to the domain of the problem, so that G¯i indeed can be thought of as a new intermediate localizer. Thus, we come to the Ellipsoid method, due to Nemirovski and Yudin (1979), which, as applied to a convex programming problem

n minimize f(x) s.t. gj(x) ≤ 0,j =1, ..., m, x ∈ G ⊂ R

works as follows:

Initialization. Choose n × n nonsingular matrix B0 and a point x1 such that the ellipsoid T G0 = {x = B0u + x1 | u u ≤ 1} contains G.Chooseβ>0 such that

EllOut(G) β ≤ . EllOut(G0)

i-th step, i ≥ 1. Given Bi−1, xi, act as follows: 58 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

1) Check whether xi ∈ int G. If it is not the case, then call step i non- productive, find a nonzero ei such that

T (x − xi) ei ≤ 0 ∀x ∈ G

andgoto3),otherwisegoto2). 2) Call the oracle to compute the quantities

   f(xi),f (xi),g1(xi),g1(xi), ..., gm(xi),gm(xi). If one of the inequalities   ≤ { − T  } gj(xi) ε max gj(xi)+(x xi) gj(xi) ,j=1, ..., m (3.5.26) G + is violated, say, k-th of them, call i-th step non-productive, set

 ei = gk(xi) andgoto3). If all inequalities (3.5.26) are satisfied, call i-th step productive and set

 ei = f (xi).

If ei =0,terminate,xi being the result found by the method, otherwise go to 3). 3) Set

T $ Bi−1ei T 1 p = ,Bi = α(n)Bi−1 − γ(n)(Bi−1p)p ,xi+1 = xi − Bi−1p, T T n +1 ei Bi−1Bi−1ei (3.5.27) α(n) and γ(n) being the quantities from Lemma 3.5.1. If κi(n) <εβ, (3.5.28) terminate with the result x¯ being the best, with respect to the objective, of the search points associated with the productive steps:

x¯ ∈ Argmin{f(x) | x is one of xj with productive j ≤ i}

otherwise go to the next step. The main result on the Ellipsoid method is as follows: Theorem 3.5.1 The associated with a given relative accuracy ε ∈ (0, 1) Ellipsoid method Ell(ε), as applied to a problem instance p ∈Pm(G), terminates in no more than   ln 1 1 A(Ell(ε)) = βε ≤2n(n − 1) ln  ln(1/κ(n)) βε 3.5. THE ELLIPSOID METHOD 59

steps and solves p within relative accuracy ε:theresultx¯ is well defined and

ε(p, x¯) ≤ ε.

2 Given the direction ei defining i-th cut, it takes O(n ) arithmetic operations to update (Bi−1,xi) into (Bi,xi+1).

Proof. The complexity bound is an immediate corollary of the termination test (3.5.28). To prove that the method solves p within relative accuracy ε, note that from Lemma 3.5.1 it follows that

i i −1 EllOut(Gi) ≤ κ (n) EllOut(G0) ≤ κ (n)β EllOut(G)

(the latter inequality comes from the origin of β). It follows that if the method terminates at a step N due to (3.5.28), then

EllOut(GN ) <εEllOut(G).

Due to this latter inequality, we immediately obtain the accuracy estimate as a corollary of our general convergence statement on the cutting plane scheme (Proposition 3.3.1). Although the latter statement was formulated and proved for the basic cutting plane scheme rather than for the ”spoiled” one, the reasoning can be literally repeated in the case of the ”spoiled” scheme.

Note that the complexity of the Ellipsoid method depends on β, i.e., on how good is the initial ellipsoidal localizer we start with. Theoretically, we could choose as G0 the ellipsoid of the smallest volume containing the domain G of the problem, thus ensuring β =1;for ”simple” domains, like a box, a simplex or a Euclidean ball, we may start with this optimal ellipsoid not only in theory, but also in practice. Even with this good start, the Ellipsoid method has O(n) times worse theoretical complexity than the Center of Gravity method (here it takes O(n2) steps to improve inaccuracy by an absolute constant factor). As a compensation of this theoretical drawback, the Ellipsoid method is not only of theoretical interest, it can be used for practical computations as well. Indeed, if G is a simple domain from the above list, then all actions prescribed by rules 1)-3) cost only O(n(m+n)) arithmetic operations. Here the term mn comes from the necessity to check whether the current search point is in the interior of G and, if it is not the case, to separate the point from G,andalso from the necessity to maximize the linear approximations of the constraints over G; the term 2 n reflects complexity of updating Bi−1 → Bi after ei is found. Thus, the arithmetic cost of a step is quite moderate, incomparably to the tremendous one for the Center of Gravity method. 60 LECTURE 3. METHODS WITH LINEAR CONVERGENCE 3.6 Exercises

3.6.1 Some extensions of the Cutting Plane scheme

To the moment we have applied the Cutting Plane scheme to convex optimization problems in the standard form

n minimize f(x) s.t. gj(x) ≤ 0,j=1, ..., m, x ∈ G ⊂ R (3.6.29)

(G is a solid, i.e., a closed and bounded convex set with a nonempty interior, f and gj are convex and continuous on G). In fact the scheme has a wider field of applications. Namely, consider a ”generic” problem as follows:

n (f): minimize f(x) s.t. x ∈ Gf ⊂ R ; (3.6.30)

here Gf is certain (specific for the problem instance) solid and f is a function taking values in the extended real axis R ∪{−∞}∪{+∞} and finite on the interior of G. Let us make the following assumption on our abilities to get information on (f): n (A): we have an access to an oracle OA which, given on input a point x ∈ R , informs us whether x belongs to the interior of Gf ; if it is not the case, the oracle reports a nonzero functional ex which separates x and Gf , i.e., is such that

T (y − x) ex ≤ 0,y∈ Gf ;

if x ∈ int Gf , then the oracle reports f(x) and a functional ex such that the level set

lev f(x)f = {y ∈ Gf | f(y)

n T {y ∈ R | (y − x) ex < 0}.

In the mean time we shall see that under assumption (A) we can efficiently solve (f)by cutting plane methods; but before coming to this main issue let me indicate some interesting 3.6. EXERCISES 61

particular cases of the ”generic” problem (f).

x

L (x) h

Figure 3.A quasiconvex function h(x) Example 1. Convex optimization in the standard setting. Consider the usual convex problem (3.6.29) and assume that the feasible set of the problem has a nonempty interior where the constraints are negative. We can rewrite (3.6.29)as(3.6.30) by setting

Gf = {x ∈ G | gj(x) ≤ 0,j=1, ..., m}. Now, given in advance the domain G of the problem (3.6.29) and being allowed to use a first-order oracle O for (3.6.29), we clearly can imitate the oracle Oa required by (A). Example 1 is not so interesting - we simply have expressed something we already know in a slightly different way. The next examples are less trivial. Example 2. Quasi-convex optimization. Assume that the objective f involved into (3.6.29) and the functional constraints gj are quasi-convex rather than convex. Recall that a function h defined on a convex set Q is called quasi-convex,ifthesets

lev h(x)h = {y ∈ Q | h(y) ≤ h(x)} are convex whenever x ∈ Q (note the difference between Lh(x) and lev h(x)h; the latter set is defined via strict inequality <,theformeronevia≤). Besides quasi-convexity, assume that the functions f, gj, j =1, ..., m,areregular, i.e., continuous on G, differentiable on the interior of G with a nonzero gradient at any point which is not a minimizer of the function over G. Last, assume that the feasible set Gf of the problem possesses a nonempty interior and that the constraints are negative on the interior of Gf . If h is a regular quasi-convex function on G and x ∈ int G, then the set lev h(x)h = {y ∈ G | h(y)

T  Πh(x)={y ∈ G | (y − x) h (x) < 0}. 62 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

Exercise 3.6.1 #+ Prove the latter statement.

Thus, in the case in question the sets {y ∈ G | f(y) 0,ω∈ Ω,x∈ G; ∈ j ω ω Ω bω(x) (3.6.31) n here G is a solid in R and Ω is a finite set of indices, aω and bω are affine functions and gj are, say, convex and continuous on G. The problem is, as we see, to minimize the maximum of ratios of given linear forms over the convex set defined by the inclusion x ∈ G,convex functional constraints gj(x) ≤ 0 and additional linear constraints expressing positivity of the denominators. Let us set

Gf = {x ∈ G | gj(x) ≤ 0,bω(x) ≥ 0,ω ∈ Ω}; we assume that the (closed and convex) set Gf possesses a nonempty interior and that the functions gj are negative, while bω are positive on the interior of Gf . By setting  max {a (x)/b (x)} x ∈ int G f(x)= ω ω ω f (3.6.32) +∞, otherwise we can rewrite our problem as

minimize f(x) s.t. x ∈ Gf . (3.6.33)

Now, assume that we are given G in advance and have an access to a first-order oracle O which, given on input a point x ∈ int G, reports the values and subgradients of functional constraints at x, same as reports all aω(·), bω(·). Under this assumptions we can imitate for (3.6.33) the oracle OA required by the as- sumption (A). Indeed, given x ∈ Rn, we first check whether x ∈ int G, and if it is not the case, find a nonzero functional ex which separates x and G (we can do it, since G is known in advance); of course, this functional separates also x and Gf ,asrequiredin(A).Now,if x ∈ int G, we ask the first-order oracle O about the values and subgradients of gj, aω and bω at x and check whether all gj are negative at x and all bω(x) are positive. If it is not the ≥ ∈  case and, say, gk(x) 0, we claim that x int Gf and set ex equal to gk(x); this functional is nonzero (since otherwise gk would attain a nonnegative minimum at x, which contradicts our assumptions about the problem) and clearly separates x and Gf (due to the convexity of gk). Similarly, if one of the denominators bω¯ is nonpositive at x, we claim that x ∈ int Gf and set

ex = ∇bω¯ ; 3.6. EXERCISES 63

here again ex is nonzero and separates x and Gf (why?) Last, if x ∈ int G,allgj are negative at x and all bω are positive at the point, we claim that x ∈ int Gf (as it actually is the case), find the index ω = ω(x) associated with the largest at x of the fractions aω(·)/bω(·), compute the corresponding fraction at x (this is nothing but f(x)) and set

ex = ∇yaω(x)(y) − f(x)∇ybω(x)(y).

Since in the latter case we have

aω(x)(y) T lev f(x)f ⊂{y ∈ Gf |

Axt+1 ≤ Bxt,t=0, 1, ...

(for t = 0 the right hand side should be replaced by a positive vector representing the ”starting” amount of goods). Now, in the von Neumann Economic Growth problem it is asked what is the largest growth factor, γ∗, for which there exists a ”semi-stationary growth trajectory”, i.e., a trajectory of the type xt = γtx0. In other words, we should solve the problem maximize γ s.t. γAx ≤ Bx for some nonzero x ≥ 0. Without loss of generality, x in the above formulation can be taken as a point form the standard simplex n G = {x ∈ R | x ≥ 0, xj =1} j (which should be regarded as a solid in its affine hull). It is clearly seen that the problem in question can be rewritten as follows:  a x minimize max j ij j s.t. x ∈ G; (3.6.34) i=1,...,m j bijxj this is a problem of the type (3.6.33). 64 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

It is worthy to note that the von Neumann growth factor γ∗ describes, in a sense, the highest rate of growth of our economy (this is far from being clear in advance: why the ”Soviet” proportional growth is the best one? Why could we not get something better along an oscillating trajectory?) One exact statement on optimality of the von Neumann semi-stationary trajectory (or, better to say, the simplest of these statements) is as follows:

{ t}T t Proposition 3.6.1 Let x t=1 be a trajectory of our economy, so that x are nonnegative, x0 =0 and Axt+1 ≤ Bxt,t=0, 1, ..., T − 1. Assume that xT ≥ λT x0 for some positive λ (so that our trajectory results, for some T ,in growth of the amount of goods in λT times in T years). Then λ ≤ γ∗.

Note that following the von Neumann trajectory

xt =(γ∗)T x0, x0 being the x-component of an optimal solution to (3.6.34), does ensure growth by factor (γ∗)T each T years.

Exercise 3.6.2 ∗ Prove Proposition 3.6.1.

Example 4. Generalized Eigenvalue problem. Let us again look at problem (3.6.31). What happens if the index set Ω becomes infinite? To avoid minor elabora- tions, let us assume (as it actually is the case in most applications) that Ω is a compact set and that the functions aω(x), bω(x) are continuous in (x, ω) (and, as above, are affine in x). As far as reformulation (3.6.32)-(3.6.33) is concerned, no difficulties occur. The possibility to imitate the oracle OA, this is another story (it hardly would be realistic to ask O to report infinitely many numerators and denominators). Note, anyhow, that from the discussion accompanying the previous example it is immediately seen that at a given x we are not interested in all aω, bω; what in fact we are interested in are the active at x numerator and denominator, i.e., either (any) bω(·) such that bω(x) ≤ 0, if such a denominator exists, or those aω(·), bω(·) with the largest ratio at the point x, if all the denominators at x are positive. Assume therefore that we know G in advance and have an access to an oracle O which, given on input x ∈ int G, reports the values and subgradients of the functional constraints gj at x and, besides this, tells us (at least in the case when all gj are negative at x) whether all the denominators bω(x), ω ∈ Ω, are positive; if it is the case, then the oracle returns the numerator and the denominator with the largest ratio at x, otherwise returns the denominator which is nonpositive at x. Looking at the construction of OA giveninthe previous example we immediately conclude that in our now situation we again can imitate a compatible with (A) oracle OA for problem (3.6.33). In fact the ”semidefinite fractional problem” we are discussing possesses interesting ap- plications; let me introduce one of them which is of extreme importance for modern Control Theory - the Generalized Eigenvalue problem. The problem is as follows: given two affine functions, A(x)andB(x), taking values in the space of symmetric m × m matrices (”affine” 3.6. EXERCISES 65

means that the entries of the matrices are affine functions of x), minimize, with respect to x, the Rayleigh ratio ωT A(x)ω max ω∈Rm\{0} ωT B(x)ω of the quadratic forms associated with these matrices under the constraints that B(x)is positive definite (and, possibly, under additional convex constraints on x). In other words, we are looking for a pair (x, λ) satisfying the constraints B(x) is positive definite ,λB(x) − A(x) is positive semidefinite and additional constraints

n gj(x) ≤ 0,j=1, ..., m, x ∈ G ⊂ R

(gj are convex and continuous on the solid G) and are interested in the pair of this type with the smallest possible λ. The Generalized Eigenvalue problem (the origin of the name is that in the particular case when B(x) ≡ I is the unit matrix we come to the problem of minimizing, with respect to x, the largest eigenvalue of A(x)) can be immediately written down as a ”semidefinite fractional problem” T ω A(x)ω T minimize max s.t. gj(x) ≤ 0,j=1, ..., m, ω B(x)ω>0,ω∈ Ω,x∈ G; ω∈Ω ωT B(x)ω (3.6.35) here Ω is the unit sphere in Rm. Note that the numerators and denominators in our ”objec- tive fractions” are affine in x, as required by our general assumptions on fractional problems. Assume that we are given G in advance, same as the data identifying the affine in x matrix-valued functions A(x)andB(x), and let we have an access to a first-order oracle providing us with local information on the ”general type” convex constraints gj.Thenit is not difficult to decide, for a given x,whetherB(x) is positive definite, and if it is not the case, to find ω ∈ Ω such that the denominator ωT B(x)ω is nonpositive at x. Indeed, it suffices to compute B(x) and to subject the matrix to the Cholesky factorization (I hope you know what it means). If factorization is successful, we find a lower-triangular matrix Q with nonzero diagonal such that B(x)=QQT , and B(x) is positive definite; if the factorization fails, then in course of it we automatically meet a unit vector ω which ”proves” that B(x) is not a positive semidefinite, i.e., is such that ωT B(x)ω ≤ 0. Now, if B(x), for a given x, is positive semidefinite, then to find ω associated with the largest at x of the fractions ωT A(x)ω ωT B(x)ω is the same as to find the eigenvector of the (symmetric) matrix Q−1A(x)(QT )−1 associated with the largest eigenvalue of the matrix, Q being the above Cholesky factor of B(x)(why?); to find this eigenvector, this is a standard Linear Algebra routine. 66 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

Thus, any technique which allows to solve (f) under assumption (A) immediately implies a numerical method for solving the Generalized Eigenvalue problem. It is worthy to explain what is the ”control source” of Generalized Eigenvalue problems. Let me start with the well-known issue - stability of a linear differential equation

z(t)=αz(t)

(z ∈ Rs). As you for sure know, the maximal ”growth” of the trajectories of the equation as t →∞is predetermined by the eigenvalue of α with the largest real part, let this part be λ; namely, all the trajectories admit, for any >0, the estimate

|z(t)|2 ≤ C exp{(λ + )t}|z(0)|2, and vice versa: from the fact that all the trajectories admit an estimate

|z(t)|2 ≤ C exp{at}|z(0)|2 (3.6.36) it follows that a ≥ λ. There are different ways to prove the above fundamental Lyapunov Theorem, and one of the simplest is via quadratic Lyapunov functions. Let us say that a quadratic function zT Lz (L is a symmetric positive definite s × s matrix) proves that the decay rate of the trajectories is at most a,ifforany trajectory of the equation one has

zT (t)Lz(t) ≤ azT (t)Lz(t),t≥ 0; (3.6.37) if it is the case, then of course

d   ln zT (t)Lz(t) ≤ 2a dt and, consequently,     1/2 1/2 zT (t)Lz(t) ≤ exp{at} zT (0)Lz(0) , which immediately results in an estimate of the type (3.6.36). Thus, any positive definite symmetric matrix L which satisfies, for some a,relation(3.6.37) implies an upper bound (3.6.36) on the trajectories of the equation, the upper bound involving just this a.Now, what does it mean that L satisfies (3.6.37)? Since z(t)=αz(t), it means exactly that

1 zT (t)Lαz(t) ≡ zT (t)(Lα + αT L)z(t) ≤ azT (t)Lz(t) 2 for all t and all trajectories of the equation; since z(t) can be an arbitrary vector of Rs,the latter inequality means that

2aL − (αT L + Lα) is positive semidefinite. (3.6.38) 3.6. EXERCISES 67

Thus, any pair comprised of a real a and a positive definite symmetric L satisfying (3.6.38) results in upper bound (3.6.36); the best (with the smallest possible a) bound (3.6.36)which can be obtained on this way is given by the solution to the problem

minimize a s.t. L is positive definite, 2aL − (αT L − Lα) is positive definite; this is nothing but the Generalized Eigenvalue problem with B(L)=2L, A(L)=αT L + Lα and no additional constraints on x ≡ L. And it can be proved that the best a given by this construction is nothing but the largest of the real parts of the eigenvalues of α,sothatin the case in question the approach based on quadratic Lyapunov functions and Generalized Eigenvalue problems results in complete description of the equipped of the trajectories as t →∞. In fact, of course, what was said is of no ”literal” significance: what for should we solve a Generalized Eigenvalue problem in order to find something which can be found by a direct computation of the eigenvalues of α? The indicated approach becomes meaningful when we come from our simple case of a linear differential equation with constant coefficients to a much more difficult (and more important for practice) case of a differential inclusion. Namely, assume that we are given a multivalued mapping z → Q(z) ⊂ Rs and are interested in bounding the trajectories of the differential inclusion

z(t) ∈ Q(z(t)),t≥ 0; (3.6.39) such an inclusion may model, e.g., a time-varying dynamic system

z(t)=α(t)z(t) with certain unknown α(·). Assume that we know finitely many matrices α1,...,αM such that

Q(z)=Conv{α1z, ..., αM z}

(e.g., we know bounds on entries of α(t) in the above time-varying system). In order to obtain an estimate of the type (3.6.36), we again may use a quadratic Lyapunov function zT Lz: if for all trajectories of the inclusion one has

z(t)Lz(t) ≤ azT (t)Lz(t), or, which is the same, if

zT Lz ≤ azT Lz ∀(z ∈ Rs,z ∈ Q(z)) (3.6.40) then, same as above,

(zT (t)Lz(t))1/2 ≤ exp{at}(zT (0)Lz(0))1/2

for all trajectories. Now, the left hand side in (3.6.40)islinearinz, and in order for the inequality in  T (3.6.40) to be satisfied for all z ∈ Q(z) it is necessary and sufficient to have z Lαiz ≡ 68 LECTURE 3. METHODS WITH LINEAR CONVERGENCE   1 T T ≤ T 2 z (αi L + Lαi)z z Lz for all z and all i, i.e., to ensure positive semidefiniteness of − T the matrices 2aL (αi L + Lαi), i =1, ..., M. By setting

T T B(L) = Diag(2L, ..., 2L),A(L)=Diag(α1 L + Lα1, ..., αM L + LαM ) we convert the problem of finding the best quadratic Lyapunov function (i.e., that one with the best associated decay rate a) into the Generalized Eigenvalue problem

minimize a s.t. B(L) positive definite, aB(L) − A(L) positive semidefinite

with no additional constraints on x ≡ L. Note that in the case of differential inclusions (in contrast to that one of equations with constant coefficients) the best decay rate which can be demonstrated by a quadratic Lyapunov function is not necessarily the actually best decay rate possessed by the trajectories of the inclusion; the trajectories may behave themselves better than it can be demonstrated by a quadratic Lyapunov function. This shortcoming, anyhow, is compensated by the fact that the indicated scheme is quite tractable computationally; this is why it becomes now one of the standard tools for stability analysis and synthesis.3 It is worthy to add that the von Neumann problem is a very specific case of a General- ized Eigenvalue problem (make the entries of Ax and Bx the diagonal entries of diagonal matrices).

I have indicated a number of important applications of (f); it is time now to think how to solve the problem. There is no difficulty in applying to the problem the cutting plane scheme.

The Cutting Plane scheme for problem (f): Initialization. Choose a solid G0 which covers Gf . i-th step, i ≥ 1. Given a solid Gi−1 (the previous localizer), choose

xi ∈ int Gi

and call the oracle OA, xi being the input. ≡ Given the answer ei exi of the oracle, -callstepi productive if the oracle says that xi ∈ int Gf , and call the step non-productive otherwise; - check whether ei =0(due to (A), this may happen only at a productive step); if it is the case, terminate, xi being the result found by the method. Otherwise - define i-th approximate solution to (f) as the best, in terms of the values of f,ofthesearchpointsxj associated with the productive steps j ≤ i; -set T G¯i = {x ∈ Gi−1 | (x − xi) ei ≤ 0};

- embed the intermediate localizer G¯i into a solid Gi and loop. 3For “user” information look at the MATLAB LMI Control Toolbox manual. 3.6. EXERCISES 69

The presented scheme defines, of course, a family of methods rather than a single method. The basic implementation issues, as always, are how to choose xi in the interior of Gi−1 and how to extend G¯i to Gi; here one may use the same tactics as in the Center of Gravity or in the Ellipsoid methods. An additional problem is how to start the process (i.e., how to choose G0); this issue heavily depends on a priori information on the problem, and here we hardly could do any universal recommendations. Now, what can be said about the rate of convergence of the method? First of all, we should say how we measure inaccuracy. A convenient general approach here is as follows. Let x ∈ Gf and let, for a given ε ∈ (0, 1), ε − { − | ∈ } Gf = x + ε(Gf x)= y =(1 ε)x + εz z Gf be the image of Gf under the similarity transformation which shrinks Gf to x in 1/ε times. Let fx(ε)= supf(y). ∈ ε y Gf

The introduced quantity depends on x; let us take the infimum of it over x ∈ Gf :

∗ f (ε) = inf fx(ε). x∈Gf

By definition, an ε-solution to f is any point x ∈ Gf such that f(x) ≤ f ∗(ε).

Let us motivate the introduced notion. The actual motivation is, of course, that the notion works, but let us start with a kind of speculation. Assume for a moment that the problem is ∗ solvable, and let x be an optimal solution to it. One hardly could argue that a pointx ¯ ∈ Gf which is at the distance of order of ε of x∗ is a natural candidate on the role of an ε-solution; ε ∗ since all points from Gx∗ are at the distance at most ε Diam(Gf )fromx , all these points canberegardedasε-solutions, in particular, the worst of them (i.e., with the largest value of f)pointx∗(ε). Now, what we actually are interested in are the values of the objective; if ∗ we agree to think of x (ε)asofanε-solution, we should agree that any point x ∈ Gf with f(x) ≤ f(x∗(ε)) also is an ε-solution. But this latter property is shared by any point which ∗ is an ε-solution in the sense of the above definition (look, f(x (ε)) is nothing but fx∗ (ε)), and we are done - our definition is justified! Of course, this is nothing but a speculation. What might, and what might not be called a good approximate solution, this cannot be decided in advance; the definition should come from the real-world interpretation of the problem, not from inside the Optimization Theory. What could happen with our definition in the case of a ”bad” problem, it can be seen from the following example: x minimize ,x∈ G =[0, 1]. 10−20 + x f Here in order to find a solution with the value of the objective better than, say, 1/2(note that the optimal value is 0) we should be at the distance of order 10−20 of the exact solution 70 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

x = 0. For our toy problem it is immediate, of course, to indicate the solution exactly, but think what happens if the same effect is met in the case of a multidimensional and nonpolyhedral Gf . We should note, anyhow, that the problems like that one just presented are ”intrinsically bad” (what is the problem?); in ”good” situations our definition does work:

Exercise 3.6.3 # Prove that 1) if the function f involved into (f) is convex and continuous on Gf ,thenanyε-solution x to the problem satisfies the inequality

f(x) − min f ≤ ε(max f − min f), Gf Gf Gf i.e., solves the problem within relative accuracy ε; 2) if the function f involved into (f) is Lipschitz continuous on Gf with respect to certain norm |·|, Lf being the corresponding Lipschitz constant, and if x is a ε-solution to (f),then

f(x) − min f ≤ Diam(Gf )Lf ε, Gf where Diam is the diameter of Gf with respect to the norm in question. Now - the end of the story.

Exercise 3.6.4 #∗ Prove the following statement: let a cutting plane method be applied to (f),andletSize(·) be a size. Assume that for certain N the method either terminates in course of the first N steps, or this is not the case, but Size(GN ) is smaller than Size(Gf ). In the first of the cases the result produced by the method is a minimizer of f over Gf , and in the second case the N-th approximate solution  x¯N is well-defined an is an ε -solution to (f) for any

Size(G ) ε > N . Size(Gf )

Exercise 3.6.5 # Write a code implementing the Ellipsoid version of the Cutting Plane scheme for (f).Usethecodetofindthebestdecayrateforthedifferentialinclusion

z(t) ∈ Q(z) ⊂ R3, where Q(z)=Conv{α1z, ..., αM z} 6 and αi, i =1, 2, ..., M =2 =64, are the vertices of the polytope ⎛ ⎞ 1 p p ⎜ 12 13 ⎟ P = {⎝ p21 1 p23 ⎠ ||pij|≤0.1}. p31 p32 1 3.6. EXERCISES 71

3.6.2 The method of outer simplex Looking at the Ellipsoid method, one may ask: why should we restrict ourselves to ellipsoids and not to use localizers of some other ”simple” shape? This is a reasonable question. In fact all properties of the ”ellipsoidal” shape which were important in the Ellipsoid version of the cutting plane scheme were as follows: (a) all ellipsoids in Rn are affine equivalent to each other, and, in particular, are affine equivalent to certain ”standard” ellipsoid, the unit Euclidean ball; therefore in order to look what happens after a cut passing through the center of an ellipsoid, it suffices to study the simplest case when the ellipsoid is the unit Euclidean ball; (b) the part of the unit Euclidean ball which is cut off the ball by a hyperplane passing through the center of the ball can be covered by an ellipsoid of volume less than that one of the ball. Now, can we point out another ”simple shape” which satisfies the above requirements? The natural candidate is, of course, a simplex. All n-dimensional simplices are affine equivalent to each other, so that we have no problems with (a); the only bottleneck for simplices could be (b). The below exercises demonstrate that everything is ok with (b) as well. Let us start with considering the standard simplex (which now plays the role of the unit Euclidean ball). Exercise 3.6.6 + Let n n Δ={x ∈ R | x ≥ 0, xi ≤ 1} i=1 be the standard n-dimensional simplex in Rn,let 1 1 c =( , ..., )T n +1 n +1 be the barycenter (≡ the center of gravity, see exercise 1.5) of the simplex, let

T g =(γ1, ..., γn) ≥ 0 be a nonzero nonnegative vector such that gT c =1, and let Δ=¯ {x ∈ Δ | gT x ≤ gT c =1}.

Prove that if ϑ ∈ [0, 1], then the simplex Δϑ with the vertices v0 =0,

−1 vi =(1− ϑ + ϑγi) ei,i=1, ..., n,

(ei are the standard orths of the axes) contains Δ¯ . Prove that under appropriate choice of ϑ one can ensure that     1 n−1 1 Vol (Δ ) ≤ χn(n)Vol (Δ),χn(n)= 1+ 1 − ≤ 1 − O(n−2). n ϑ n n2 − 1 n +1 (3.6.41) 72 LECTURE 3. METHODS WITH LINEAR CONVERGENCE

n Exercise 3.6.7 Let D be a simplex in R , v0, ..., vn be the vertices of D, 1 w = (v + ... + v ) n +1 0 n be the barycenter of D and g be a nonzero linear functional. Prove that the set D¯ = {x ∈ D | (x − w)T g ≤ 0} can be covered by a simplex D such that

 n Voln(D ) ≤ χ (n)Voln(D), with χ(n) given by (3.6.41). Based on this observation, construct a cutting plane method for convex problems with functional constraints where all localizers are simplices. What should be the associated size? What is the complexity of the method?

Hint: without loss of generality we may assume that the linear form gT x attains its minimum T over D at the vertex v0 and that g (w − v0)=1.Choosingv0 as our new origin and v1 − v0, ..., vn − v0 as the orths of our new coordinate axes, we come to the situation studied in exercise 3.6.6. Note that the progress in volumes of the subsequent localizers in the method of outer simplex (i.e., the quantity χn(n)=1−O(n−2)) is worse than that one κn(n)=1−O(n−1)in the Ellipsoid method. It does not, anyhow, mean that the former method is for sure worse than the latter one: in the Ellipsoid method, the actual progress in volumes always equals to κn(n), while in the method of outer simplex the progress depends on what are the cutting planes; the quantity χn(n) is nothing but the worst case bound for the progress, and the latter, for a given problem, may happen to be more significant. Lecture 4

Large-scale optimization problems

Large-scale non-smooth convex problems, complexity bounds, subgradient descent algorithm, bundle methods

4.1 Goals and motivations

We start now a new topic - complexity and efficient methods of large-scale convex optimiza- tion. Thus, we come back to problems of the type

(p) minimize f(x) s.t. gi(x) ≤ 0,i=1, ..., m, x ∈ G,

n where G is a given solid in R and f,g1, ..., gm are convex continuous on G functions. The family of all consistent problems of the indicated type was denoted by Pm(G), and we are interested in finding ε-solution to a problem instance from the family, i.e., a point x ∈ G such that

f(x) − min f ≤ ε(max f − min f),gi(x) ≤ ε(max gi)+,i=1, ..., m. G G G G

We have shown that the complexity of the family in question satisfies the inequalities

O(1)n ln(1/ε)≤A(ε) ≤2.182n ln(1/ε), (4.1.1) where O(1) is an appropriate positive absolute constant; what should be stressed that the upper complexity bound holds true for all ε ∈ (0, 1), while the lower one is valid only for not too large ε,namely,for ε<ε(G). The ”critical value” ε(G) depends, as we remember, on affine properties of G;fortheboxit 1 is 2 ,andforany n-dimensional solid G one has 1 ε(G) ≥ . 2n3

73 74 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

Thus, our complexity bounds identify complexity, up to an absolute constant factor, only for small enough values of ε; there is an initial interval of values of the relative accuracy

Δ(G)=[ε(G), 1) where we to the moment have only an upper bound on the complexity and have no lower bound. Should we be bothered by this incompleteness of our knowledge? I think we should. Indeed, what is the initial segment, it depends on G;ifG is a box, then this segment is once for ever fixed, so that there, basically, is nothing to worry about - one hardly might be interested in solving optimization problems within relative inaccuracy ≥ 1/2, and for smaller ε we know the complexity. But if G is a more general set than a box, then there is something to think about: all we can say about an arbitrary n-dimensional G is that ε(G) ≥ 1/(2n3); this lower bound tends to 0 as the dimension n of the problem increases, so that for large n Δ(G) can in fact ”almost cover” the interval (0, 1) of all possible values of ε. On the other hand, when solving large-scale problems of real world origin, we often are not interested in too high accuracy, and it may happen that the value of ε we actually are interested in is exactly in Δ(G), where we do not know what is the complexity and what are the optimal methods. Thus, we have reasons, both of theoretical and practical origin, to be interested in the pre-asymptotic behaviour of the complexity. The difficulty in investigating the behaviour of the complexity in the initial range of values of the accuracy is that it depends on affine properties of the domain G, and this is something too diffuse for quantitative description. This is why it is reasonable to restrict ourselves with certain ”standard” domains G. We already know what happens when G is a parallelotope, or, which is the same, a box - in this case there, basically, is no initial segment. And, of course, the next interesting case is when G is an ellipsoid, or, which is the same, an Euclidean ball (all our notions are affine invariant, so to speak about Euclidean balls is the same as to speak about arbitrary ellipsoids). This is the case we shall focus on. In fact we shall assume G to be something like a ball rather than a ball exactly. Namely, let us fix a real α ≥ 1 and assume that the asphericity of G is at most α, i.e., there is a pair of concentric Euclidean balls Vin and Vout with the ratio of radii not exceeding α and such that the smaller ball is inside G, and the larger one contains G:

Vin ⊂ G ⊂ Vout.

4.2 The main result

Our main goal is to establish the following

Theorem 4.2.1 The complexity of the family Pm(G) of convex problems with m functional constraints on an n-dimensional domain G of asphericity α satisfies the bounds 1 α2 min{n,  − 0} ≤ A(ε) ≤4  +1, 0 <ε<1; (4.2.2) (2αε)2 ε2 here a − 0 is the largest integer which is smaller than a real a. 4.2. THE MAIN RESULT 75

Comment. Before proving the theorem, let us think what the theorem says. First, it says that the complexity of convex minimization on a domain similar to an Euclidean ball is bounded from above uniformly in the dimension by a function O(1)α2ε−2; the asphericity α is responsible for the level of similarity between the domain and a ball. Second, we see that in the large-scale case, when the dimension of the domain is large enough for given α and ε, or, which is the same, when the inaccuracy ε is large enough for a given dimension (and asphericity), namely, when 1 ε ≥ √ , (4.2.3) 2α n then the complexity admits a lower bound O(1)α−2ε−2 which differs from the aforementioned upper bound by factor O(1)α−4 which depends on asphericity only. Thus, in the large- scale case (4.2.3) our upper complexity bound coincides with the complexity up to a factor depending on asphericity only; if G is an Euclidean ball (α = 1), then this factor does not exceed 16. Now, our new complexity results combined with the initial results related to the case of small inaccuracies gives us basically complete description of the complexity in the case when G is an Euclidean ball. The graph of complexity in this case is as follows:

n ln n

n

1 1 2 n1/2 n 1 ε-

Figure 4. Complexity of convex minimization over an n-dimensional Euclidean ball: the whole range [1, ∞)ofvaluesof1/ε√ can be partitioned into three segments: the initial segment [1, 2 n]; within this segment the complexity, up to an absolute con- stant factor, is ε−2; at the right endpoint of the segment the complexity is equal to n;inthis initial segment the complexity is independent on the dimension and is in fact defined by the affine geometry of G; 1 ∞ ∞ the final segment [ ε(G) , )=[2n, ); here the complexity, up to an absolute constant factor, is n ln(1/ε), this is the standard asymptotics known to us; in this final segment the complexity forgets everything about the geometry of G; 76 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS √ the intermediate segment [2 n, 2n]; at the left endpoint of this segment the complexity is O(n), at the right endpoint it is O(n ln n); within this segment we know complexity up to a factor of order of its logarithm rather than up to an absolute constant factor. Now let us prove the theorem.

4.3 Upper complexity bound: Subgradient Descent

The upper complexity bound is associated with one of the most traditional methods for nonsmooth convex minimization - the short-step Subgradient Descent method. The generic scheme of the method is as follows: given a starting point x1 ∈ int G and the required relative accuracy ε ∈ (0, 1),formthe sequence xi+1 = xi − γiRei, (4.3.4) where γi > 0 is certain stepsize, R is the radius of the Euclidean ball Vout covering G and the unit vector ei is defined as follows: (i) if xi ∈ int G,thenei is an arbitrary unit vector which separates xi and G:

T (x − xi) ei ≤ 0,x∈ G;

(ii) if xi ∈ int G, but there is a constraint gj which is ”ε-violated” at xi, i.e., is such that   g (x ) >ε max{g (x )+(x − x )T g (x )} , (4.3.5) j i ∈ j i i j i x G + then 1  ei = |  |gj(xi); gj(xi)

(iii) if xi ∈ int G and no constraint is ε-violated at xi, i.e., no inequality (4.3.5) is satisfied, then 1  ei =  f (xi). |f (xi)|  Note that the last formula makes sense only if f (xi) =0 ; if in the case of (iii) we meet with  f (xi)=0, then we simply terminate and claim that xi is the result of our activity. Same as in the cutting plane scheme, let us say that search point xi is productive,ifati-th step we meet the case (iii), and non-productive otherwise, and let us define i-th approximate solutionx ¯i as the best (with the smallest value of the objective) of the productive search points generated in course of the first i iterations (if no productive search point is generated, x¯i is undefined). The efficiency of the method is given by the following.

Proposition 4.3.1 Let a problem (p) from the family Pm(G) be solved by the short-step Subgradient Descent method associated with accuracy ε,andletN be a positive integer such that  2+ 1 N γ2 ε 2 j=1 j N < . (4.3.6) j=1 γj α 4.3. UPPER COMPLEXITY BOUND: SUBGRADIENT DESCENT 77

Then either the method terminates in course of N steps with the result being an ε-solution to (p),orx¯N is well-defined and is an ε-solution to (p). In particular, if γi ≡ ε/α, then (4.3.6)issatisfiedby α2 N ≡ N(ε)=4  +1, ε2 and with the indicated choice of the stepsizes we can terminate the method after the N-th step; the resulting method solves any problem from the class within relative accuracy ε with the complexity N(ε), which is exactly the upper complexity bound stated in Theorem 4.2.1. Proof. Let me make the following crucial observation: let us associate with the method the localizers T Gi = {x ∈ G | (x − xj) ej ≤ 0, 1 ≤ j ≤ i}. (4.3.7) Then the presented method fits our generic cutting plane scheme for problems with functional constraints, up to the fact that Gi now should not necessarily be solids (they may possess empty interior or even be themselves empty) and xi should not necessarily be an interior point of Gi−1. But all these particularities were not used in the proof of the general proposition on the rate of convergence of the scheme (Proposition 3.3.1), and in fact there we have proved the following:

n Proposition 4.3.2 Assume that we are generating a sequence of search points xi ∈ R and associate with these points vectors ei and approximate solutions x¯i in accordance to (i)-(iii). Let the sets Gi be defined by the pairs (xi,ei) according to (4.3.7), and let Size be a size. Assume that in course of N steps we either terminate due to vanishing of the subgradient of the objective at a productive search point, or this is not the case, but

Size(GN ) <εSize(G)

(if GN is not a solid, then, by definition, Size(GN )=0). In the first case the result formed at the termination is an ε-solution to the problem; in the second case such a solution is x¯N (which is for sure well-defined). Now let us apply this latter proposition to our short-step Subgradient Descent method and to the size

InnerRad(Q)=max{r | Q contains an Euclidean ball of radius r}.

We know in advance that G contains an Euclidean ball Vin of the radius R/α,sothat

InnerRad(G) ≥ R/α. (4.3.8)

Now let us estimate from above the size of i-th localizer Gi, provided that the localizer is well-defined (i.e., that the method did not terminate in course of the first i steps due to vanishing the subgradient of the objective at a productive search point). Assume that Gi 78 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

contains an Euclidean ball of certain radius r>0, and let x+ be the center of the ball. Since V is contained in Gi,wehave

T (x − xj) ej ≤ 0,x∈ V,1 ≤ j ≤ i, whence + T T (x − xj) ej + h ej ≤ 0, |h|≤r, 1 ≤ j ≤ i, and since ej is a unit vector, we come to

+ T (x − xj) ej ≤−r, 1 ≤ j ≤ i. (4.3.9)

Now let us write down the cosine theorem:

+ 2 + 2 + T 2 |xj+1 − x | = |xj − x | +2(xj − x ) (xj+1 − xj)+|xj+1 − xj| =

| − +|2 + − T 2 2 ≤| − +|2 − 2 2 = xj x +2γjR(x xj) ej + γj R xj x 2γjRr + γj R . + We see that the squared distance from xj to x is decreased with j at least by the quantity − 2 2 2γjR γj R at each step; since the squared distance cannot become negative, we come to

i i − 2 ≤| − +|2 ≤ 2 R(2r γj γj R) x1 x 4R j=1 j=1

(wehaveusedthefactthatG is contained in the Euclidean ball Vout of the radius R). Thus, we come to the estimate  2+ 1 i γ2 ≤ 2 j=1 j r i R j=1 γj

This bound acts for the radius r of an arbitrary Euclidean ball contained in Gi,andwecome to  2+ 1 i γ2 ≤ 2 j=1 j InnerRad(Gi) i R. (4.3.10) j=1 γj Combining this inequality with (4.3.8), we come to  InnerRad(G ) 2+ 1 i γ2 i ≤ 2 j=1 j i α, (4.3.11) InnerRad(G) j=1 γj and due to the definition of N,wecometo

InnerRad(G ) N <ε, InnerRad(G)

Thus, the conclusion of the Theorem follows from Proposition 4.3.2. 4.4. THE LOWER BOUND 79 4.4 The lower bound

The lower bound in (4.2.2)isgivenbyasimplereasoningwhichisinfactalreadyknowntous. Due to similarity reasons, we without loss of generality may assume that G is contained in the 1 1 Euclidean ball of the radius R = 2 centered at the origin and contains the ball of the radius r = 2α with the same center. It, of course, suffices to establish the lower bound for the case of problems without functional constraints. Besides this, due to monotonicity of the complexity in ε, it suffices to prove that if ε ∈ (0, 1) is such that

1 M =  − 0≤n, (2αε)2 then the complexity A(ε)isatleastM. Assume that this is not the case, so that there exists a method M which solves all problems from the family in question in no more than M − 1step.We may assume that M solves any problem exactly in M steps, and the result always is the last search point. Let us set 1 δ = √ − ε, 2α M so that δ>0 by definition of M. Now consider the family F0 comprised of functions

i f(x)= max (ξix + di) 1≤i≤M where ξi = ±1and0

From Pk it follows that the (k + 1)-th search point xk+1 generated by M as applied to a function from the family Fk is independent of the function. At the step k +1we find the index ik+1 of the largest in absolute value of the coordinates of xk+1 with indices different from i1, ..., ik, ∗ define ξik+1 as the sign of the coordinate, ∗ −(k+1) set dik+1 =2 δ, and F F ∗ ∗ define k+1 as the set of those functions from k for which ξik+1 = ξik+1 , dik+1 = dik+1 and −(k+2) di ≤ 2 for i different from i1, ..., ik+1. It is immediately seen that the resulting family satisfies the predicate Pk+1, and we may proceed in the same manner. Now let us look what will be found after M step of the construction. We will end with a family FM which consists of exactly one function

∗ ∗ f =max(ξi xi + di ) 1≤i≤M such that f is positive along the sequence x1, ..., xM of search points generated by M as applied to the function. On the other hand, G contains the ball of the radius r =1/(2α) centered at the origin, and, consequently, contains the point

M ∗ ∗ ξi x = − √ ei, i=1 2α M

n ei being the basic orths in R . We clearly have 1 f ∗ ≡ min f(x) ≤ f(x∗) < − √ + δ ≤−ε x∈G 2α M

(the concluding inequality follows from the definition of δ). On the other hand, f clearly is Lipschitz continuous with constant 1 on G,andG is contained in the Euclidean ball of the radius 1/2, so that the variation (maxG f − minG f)off over G is ≤ 1. Thus, we have

∗ f(xM ) − f > 0 − (−ε)=ε ≥ ε(max f − min f); G G since, by construction, xM is the result obtained by M as applied to f, we conclude that M does not solve the problem f within relative accuracy ε, which is the desired contradiction with the origin of M.

4.5 Subgradient Descent for Lipschitz-continuous con- vex problems

For the sake of simplicity we restrict ourselves to problems without functional constraints:

(f) minimize f(x) s.t. x ∈ G ⊂ Rn. 4.5. SUBGRADIENT DESCENT FOR LIPSCHITZ-CONTINUOUS CONVEX PROBLEMS81

From now on we assume that G is a closed and bounded convex subset in Rn, possibly, with empty interior, and that the objective is convex and Lipschitz continuous on G:

|f(x) − f(y)|≤L(f)|x − y|,x,y∈ G, where L(f) < ∞ and |·| is the usual Euclidean norm in Rn. Note that the subgradient set of f at any point from G is nonempty and contains subgradients of norms not exceeding L(f); from now on we assume that the oracle in question reports such a subgradient at any input point x ∈ G. We would like to solve the problem within absolute inaccuracy ≤ ε, i.e., to find x ∈ G such that f(x) − f ∗ ≡ f(x) − min f ≤ ε. G The simplest way to solve the problem is to apply the standard Subgradient Descent method { }∞ which generates the sequence of search points xi i=1 accordingtotherule

  xi+1 = πG(xi − γig(xi)),g(x)=f (x)/|f (x)|, (4.5.12) where x1 ∈ G is certain starting point, γi > 0 are positive stepsizes and

πG(x) = argmin{|x − y|| y ∈ G}

is the standard projector onto G. Of course, if we meet a point with f (x) = 0, we terminate with optimal solution at hands; from now on I ignore this trivial case. As always, i-th approximate solutionx ¯i found by the method is the best - with the smallest value of f -ofthesearchpointsx1, ..., xi; note that all these points belong to G. It is easy to investigate the rage of convergence of the aforementioned routine. To this ∗ end let x be the closest to x1 optimal solution to the problem, and let

∗ di = |xi − x |.

We are going to see how di vary. To this end let us start with the following simple and important observation (cf Exercise 4.7.3):

Lemma 4.5.1 Let x ∈ Rn,andletG be a closed convex subset in Rn. Under projection onto G, x becomes closer to any point u of G, namely, the squared distance from x to u decreases at least by the squared distance from x to G:

2 2 2 |πG(x) − u| ≤|x − u| −|x − πG(x)| . (4.5.13)

From Lemma 4.5.1 it follows that

2 ≡| − ∗|2 | − − ∗|2 ≤| − − ∗|2 di+1 xi+1 x = πG(xi γig(xi)) x xi γg(xi) x = | − ∗|2 − − ∗ T  |  | 2 ≤ = xi x 2γi(xi x ) f (xi)/ f (xi) + γi ≤ 2 − − ∗ |  | 2 di 2γi(f(xi) f )/ f (xi) + γi 82 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

(the concluding inequality is due to the convexity of f). Thus, we come to the recurrence 2 ≤ 2 − − ∗ |  | 2 di+1 di 2γi(f(xi) f )/ f (xi) + γi ; (4.5.14) in view of the evident inequality

∗ ∗ f(xi) − f ≥ f(¯xi) − f ≡ εi, and since |f (x)|≤L(f), the recurrence implies that ε d2 ≤ d2 − 2γ i + γ2. (4.5.15) i+1 i i L(f) i The latter inequality allows to make several immediate conclusions. 1) From (4.5.15) it follows that N N ≤ 2 2 2 γiεi L(f) d1 + γi ; i=1 i=1 since εi clearly do not increase with i,wecometo   L(f) |x − x∗|2 + N γ2 ≤ ≡ 1  i=1 i εN νN N . (4.5.16) 2 i=1 γi The right hand side in this inequality clearly tends to 0 as N →∞, provided that ∞ γi = ∞,γi → 0,i→∞ i=1 (why?), which gives us certain general statement on convergence of the method as applied to a Lipschitz continuous convex function; note that we did not use the fact that G is bounded. Of course, we would like to choose the stepsizes resulting in the best possible estimate (4.5.16). Note that our basic recurrence (4.5.14) implies that for any N ≥ M ≥ 1 one has N N N ≤ 2 2 ≤ 2 2 2εN γi L(f) dM + γi L(f) D + γi . i=M i=M i=M

Whence   L(f) D2 + N γ2 ≤  i=M i εN M ; 2 i=N γi with D being an a priori upper bound on the diameter of G; M = N/2 and

−1/2 γi = Di (4.5.17) the right hand side in the latter inequality does not exceed O(1)DN−1/2.Thiswaywecome to the optimal, up to an absolute constant factor, estimate

L√(f)D εN ≤ O(1) ,N=1, 2, ... (4.5.18) N 4.5. SUBGRADIENT DESCENT FOR LIPSCHITZ-CONTINUOUS CONVEX PROBLEMS83

(O(1) is an easily computable absolute constant). I call this rate optimal, since the lower complexity bound of Section 4.4 says that if G is an Euclidean ball of diameter D in Rn and L is a given constant, then the complexity at which one can minimize over G, within absolute accuracy ε, an arbitrary Lipschitz continuous with constant L convex function f is at least     LD 2 min n; O(1) , ε so that in the large-scale case, when   LD 2 n ≥ , ε the lower complexity bound coincides, within absolute constant factor, with the upper bound given by (4.5.18). Thus, we can choose the stepsizes γi accordingto(4.5.17) and obtain dimension-independent rate of convergence (4.5.18); this rate of convergence does not admit ”significant” uniform in the dimension improvement, provided that G is an Euclidean ball. 2) The stepsizes (4.5.17) are theoretically optimal and more or less reasonable from the practical viewpoint, provided that you deal with a domain G of reasonable diameter, i.e., the diameter of the same order of magnitude as the distance from the starting point to the optimal set. If the latter assumption is not satisfied (as it often is the case), the stepsizes should be chosen more carefully. A reasonable idea here is as follows. Our rate-of-convergence proof in fact was based on a very simple relation

2 ≤ 2 − − ∗ |  | 2 di+1 di 2γi(f(xi) f )/ f (xi) + γi ; let us choose as γi the quantity which results in the strongest possible inequality of this type, namely, that one which minimizes the right hand side:

∗ f(xi) − f γi =  . (4.5.19) |f (xi)| Of course, this choice is possible only when we know the optimal value f ∗. Sometimes this is not a problem, e.g., when we reduce a system of convex inequalities

fi(x) ≤ 0,i=1, ..., m, to the minimization of f(x)=maxfi(x); i here we can take f ∗ = 0. In more complicated cases people use some on-line estimates of f ∗; I would not like to go in details, so that I assume that f ∗ isknowninadvance.Withthe stepsizes (4.5.19) (proposed many years ago by B.T. Polyak) our recurrence becomes

2 ≤ 2 − − ∗ 2|  |−2 ≤ 2 − 2 −2 di+1 di (f(xi) f ) f (xi) di εi L (f), 84 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

 N 2 ≤ 2 2 whence i=1 εi L (f)d1, and we immediately come to the estimate

∗ −1/2 εN ≤ L(f)|x1 − x |N . (4.5.20)

∗ This estimate seems to be the best one, since it involves the actual distance |x1 − x | to the optimal set rather than the diameter of G;infactG might be even unbounded. Typically, whenever one can use the Polyak stepsizes, this is the best possible tactics for the Subgradient Descent method. We can now present a small summary: we see that the Subgradient Descent, which we were exploiting in order to obtain an optimal method for large scale convex minimization over Euclidean ball, can be applied to minimization of a convex Lipschitz continuous function over an arbitrary n-dimensional closed convex domain G;ifG is bounded, then, under appropriate choice of stepsizes, one can ensure the inequalities

∗ −1/2 εN ≡ min f(xi) − f ≤ O(1)L(f)D(G)N , (4.5.21) 1≤i≤N where O(1) is a moderate absolute constant, L(f) is the Lipschitz constant of f and D(G)is the diameter of G. If the optimal value of the problem is known, then one can use stepsizes ∗ which allow to replace D(G) by the distance |x1 − x | from the starting point to the optimal set; in this latter case, G should not necessarily be bounded. And the rate of convergence is optimal, I mean, it cannot be improved by more than an absolute constant factor, provided that G is an n-dimensional Euclidean ball and n>N. Note also that if G is a ”simple” set, say, an Euclidean ball, or a box, or the standard simplex n { ∈ n | } x R+ xi =1 , i=1 then the method is computationally very cheap - a step costs only O(n) operations in addition to those spent by the oracle. Theoretically all it looks perfect. It is not a problem to speak about an upper accuracy bound O(N −1/2) and about optimality of this bound in the large scale case; but in practice such a rate of convergence would result in thousands of steps, which is too much for the majority of applications. Note that in practice we are interested in ”typical” complexity of a method rather than in its worst case complexity and worst case optimality. And from this practical viewpoint the Subgradient Descent is far from being optimal: there are other methods with the same worst case theoretical complexity bound, but with significantly better ”typical” performance; needless to say that these methods are more preferable in actual computations. What we are about to do is to look at a certain family of methods of this latter type.

4.6 Bundle methods

Common sense says that the weak point in the Subgradient Descent is that when running the method, we almost loose previous information on the objective; the whole ”prehistory” is compressed to the current iterate (this was not the case with the cutting plane methods, 4.6. BUNDLE METHODS 85

where the prehistory was memorized in the current localizer). Generally speaking, what we actually know about the objective after we have formed a sequence of search points xj ∈ G, j =1, ..., i? All we know is the bundle - the sequence of affine forms

T  f(xj)+(x − xj) f (xj) reported by the oracle; we know that every form from the sequence underestimates the objective and coincides with it at the corresponding search point. All these affine forms can be assembled into a single piecewise linear convex function - i-th model of the objective

T  fi(x)=max{f(xj)+(x − xj) f (xj)}. 1≤j≤i

This model underestimates the objective:

fi(x) ≤ f(x),x∈ G, (4.6.22)

and is exact at the points x1, ..., xi:

fi(xj)=f(xj),j=1, ..., i. (4.6.23)

And once again - the model accumulates all our knowledge obtained so far; e.g., the infor- mation we possess does not contradict the hypothesis that the model is exact everywhere. Since the model accumulates the whole prehistory, it is reasonable to formulate the search rules for a method in terms of the model. The most natural and optimistic idea is to trust in the model completely and to take, as the next search point, the minimizer of the model:

xi+1 ∈ Argmin fi. G This is the Kelley cutting plane method - the very first method proposed for nonsmooth convex optimization. The idea is very simple - if we are lucky and the model is good everywhere, not only along the previous search points, we would improve significantly the best value of the objective found so far. On the other hand, if the model is bad, then it will be corrected at the ”right” place. From compactness of G one can immediately derive that the method does converge and is even finite if the objective is piecewise linear. Unfortunately, it turns out that the rate of convergence of the method is a disaster; one can demonstrate that the worst-case number of steps required by the Kelley method to solve a problem f within absolute inaccuracy ε (G is the unit n-dimensional ball, L(f) = 1) is at least   1 (n−1)/2 O(1) . ε We see how dangerous is to be too optimistic, and it is clear why: even in the case of smooth objective the model is close to the objective only in a neighborhood of the search points; until the number of these points becomes very-very large, this neighborhood covers a ”negligible” part of the domain G, so that the global characteristic of the model - its minimizer - is very 86 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

unstable and until the termination phase has small in common with the actual optimal set. It should be noted that the Kelley method in practice is much better than one could think looking at its worst-case complexity (a method with “practical” complexity like this estimate simply could not be used even in the dimension 10), but the qualitative conclusions from the estimate are more or less valid also in practice - the Kelley method sometimes is too slow. A natural way to improve the Kelley method is as follows. We can only hope that the model approximates the objective in a neighborhood of the search points. Therefore it is reasonable to enforce the next search point to be not too far from the previous ones, more exactly, from the ”most perspective”, the best of them, since the latter, as the method goes on, hopefully will become close to the optimal set. To forbid the new iterate to move far away, let us choose xi+1 as the minimizer of the penalized model:

{ di | − +|2} xi+1 =argmin fi(x)+ x xi , G 2

+ where xi is what is called the current prox center,andtheprox coefficient di > 0 is certain parameter. When di is large, we enforce xi+1 to be close to the prox center, and when it is small, we act almost as in the Kelley method. What is displayed, is the generic form of the bundle methods; to specify a method from this family, one need to indicate the policies of updating the prox centers and the prox coefficients. There is a number of reasonable policies of this type, and among these policies there are those resulting in methods with very good practical performance. I would not like to go in details here; let me say only that, first, the best theoretical complexity estimate for the traditional bundle methods is something like O(ε−3); although non-optimal, this upper bound is incomparably better than the lower complexity bound for the method of Kelley. Second, there is more or less unique reasonable policy of updating the prox center, in contrast to the policy for updating the prox coefficient. Practical performance of a bundle algorithm heavily depends on this latter policy, and sensitivity to the prox coefficient is, in a sense, the weak point of the bundle methods. Indeed, even without addressing to computational experience we can guess in advance that the scheme should be sensitive to di - since in the limiting cases of zero and infinite prox coefficient we get, respectively, the Kelley method, which can be slow, and the ”method” which simply does not move from the initial point. Thus, both small and large prox coefficients are forbidden; and it is unclear how to choose the ”golden middle” - our information has nothing in common with any quadratic terms in the model, these terms are invented by us.

4.6.1 The Level method We present now the Level method, a rather recent method from the bundle family due to Lemarechal, Nemirovski and Nesterov (1991). This method, in a sense, is free from the aforementioned shortcomings of the traditional bundle scheme. Namely, the method possesses the optimal complexity bound O(ε−2), and the difficulty with tuning the prox coefficient in the method is resolved in a funny way - this problem does not occur at all. 4.6. BUNDLE METHODS 87

To describe the method, we introduce several simple quantities. Given i-th model fi(·), we can compute its optimum, same as in the Kelley method; but now we are interested not in the point where the optimum is attained, but in the optimal value

− fi =minfi G

− of the model. Since the model underestimates the objective, the quantity fi is a lower bound for the actual optimal value; and since the models clearly increase with i at every point, their minima also increase, so that

− ≤ − ≤ ≤ ∗ f1 f2 ... f . (4.6.24)

+ On the other hand, let fi be the best found so far value of the objective:

+ fi =minf(xj)=f(¯xi), (4.6.25) 1≤j≤i wherex ¯i is the best (with the smallest value of the objective) of the search point generated + so far. The quantities fi clearly decrease with i and overestimate the actual optimal value: + ≥ + ≥ ≥ ∗ f1 f2 ... f . (4.6.26)

It follows that the gaps + − − Δi = fi fi are nonnegative and nonincreasing and bound from above the inaccuracy of the best found so far approximate solutions:

∗ f(¯xi) − f ≤ Δi, Δ1 ≥ Δ2 ≥ ... ≥ 0. (4.6.27)

Now let me describe the method. Its i-thstepisasfollows: 1) solve the piecewise linear problem

minimize fi(x) s.t. x ∈ G

− to get the minimum fi of the i-th model; 2) form the level − − + ≡ − li =(1 λ)fi + λfi fi + λΔi, λ ∈ (0, 1) being the parameter of the method (normally, λ =1/2), and define the new iterate xi+1 as the projection of the previous one xi onto the level set

Qi = {x ∈ G | fi(x) ≤ li} of the i-th model, the level set being associated with li:

xi+1 = πQi (xi). (4.6.28) 88 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

Computationally, the method requires solving two auxiliary problems at each iteration. − The first is to minimize the model in order to compute fi ; this problem arises in the Kelley method and does not arise in the bundle ones. The second auxiliary problem is to project xi onto Qi; this is, basically, the same quadratic problem which arises in bundle methods and does not arise in the Kelley one. If G is a polytope, which normally is the case, the first of these auxiliary problems is a linear program, and the second is a convex linearly constrained quadratic program; to solve them, one can use the traditional efficient simplex- type technique. Let me note that the method actually belongs to the bundle family, and that for this method the prox center always is the last iterate. To see this, let us look at the solution

d 2 x(d) = argmin{fi(x)+ |x − xi| } x∈G 2 of the auxiliary problem arising in the bundle scheme as at a function of the prox coefficient d. It is clear that x(d) is the closest to xi point in the set {x ∈ G | fi(x) ≤ fi(x(d))},so that x(d) is the projection of xi onto the level set

{x ∈ G | fi(x) ≤ li(d)} of the i-th model associated with the level li(d)=fi(x(d)) (this latter relation gives us certain equation which relates d and li(d)). As d varies from 0 to ∞, x(d) moves along certain path which starts at the closest to xi point in the optimal set of the i-th model and − ≥ + ends at the prox center xi; consequently, the level li(d) varies from fi to f(xi) fi and therefore, for certain value di of the prox coefficient, we have li(di)=li and, consequently, x(di)=xi+1. Note that the only goal of this reasoning was to demonstrate that the Level method does belong to the bundle scheme and corresponds to certain implicit control of the prox coefficient; this control exists, but is completely uninteresting for us, since the method does not require knowledge of di. Now let me formulate and prove the main result on the method.

Theorem 4.6.1 Let the Level method be applied to a convex problem (f) with Lipschitz continuous, with constant L(f), objective f and with a closed and bounded convex domain G of diameter D(G). Then the gaps Δi converge to 0; namely, for any positive ε one has L(f)D(G) 2 i>c(λ) ⇒ Δ ≤ ε, (4.6.29) ε i where 1 c(λ)= . (1 − λ)2λ(2 − λ)

In particular, L(f)D(G) 2 i>c(λ) ⇒ f(¯x ) − f ∗ ≤ ε. ε i 4.6. BUNDLE METHODS 89

Proof. The ”in particular” part is an immediate consequence of (4.6.29)and(4.6.27), so that all we need is to verify (4.6.29). To this end assume that N is such that ΔN >ε,and let us bound N from above. 0 1 . Let us partition the set I = {1, 2, ..., N} of iteration indices in groups I1, ..., Ik as follows. The first group ends with the index i(1) ≡ N and contains all indices i ≤ i(1) such that −1 −1 Δi ≤ (1 − λ) ΔN ≡ (1 − λ) Δi(1); since, as we know, the gaps never increase, I1 is certain final segment of I. If it differs from the whole I, we define i(2) as the largest of those i ∈ I which are not in I1, and define I2 as the set of all indices i ≤ i(2) such that

−1 Δi ≤ (1 − λ) Δi(2).

I2 is certain preceding I1 segment in I. If the union of I1 and I2 is less than I, we define i(3) as the largest of those indices in I which are not in I1 ∪ I2, and define I3 as the set of those indices i ≤ i(3) for which

−1 Δi ≤ (1 − λ) Δi(3), and so on. With this process, we partition the set I of iteration indices into sequential segments I1,..,Ik (Is follows Is+1 in I). The last index in Is is i(s), and we have

−1 Δi(s+1) > (1 − λ) Δi(s),s=1, ..., k − 1 (4.6.30)

(indeed, if the opposite inequality would hold, then i(s+1) would be included into the group Is, which is not the case). 20. The main (and very simple) observation is as follows: the level sets Qi of the models corresponding to certain group of iterations Is have a point in common, namely, the minimizer, us, of the last, i(s)-th, model from the group. Indeed, since the models increase with i, and the best found so far values of the objective decrease with i, for all i ∈ Is one has ≤ − + − ≤ + − ≤ + − − ≡ fi(us) fi(s)(us)=fi(s) = fi(s) Δi(s) fi Δi(s) fi (1 λ)Δi li

−1 (the concluding ≤ in the chain follows from the fact that i ∈ Is,sothatΔi ≤ (1−λ) Δi(s)). 0 3 . The above observation allows to estimate from above the number Ns of iterations in the group Is. Indeed, since xi+1 is the projection of xi onto Qi and us ∈ Qi for i ∈ Is,we conclude from Lemma 4.5.1 that

2 2 2 |xi+1 − us| ≤|xi − us| −|xi − xi+1| ,i∈ Is, whence, denoting by j(s) the first element in Is, 2 2 2 |xi − xi+1| ≤|xj(s) − us| ≤ D (G). (4.6.31) i∈Is 90 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

Now let us estimate from below the steplengths |xi − xi+1|. At the point xi the i-th model fi ≥ + equals to f(xi) and is therefore fi , and at the point xi+1 the i-th model is, by construction + − − of xi+1, less or equal (in fact is equal) to li = fi (1 λ)Δi. Thus, when passing from xi to xi+1,thei-th model varies at least by the quantity (1 − λ)Δi, which is, in turn, at least (1 − λ)Δi(s) (the gaps may decrease only!). On the other hand, fi clearly is Lipschitz continuous with the same constant L(f) as the objective (recall that, according to our assumption, the oracle reports subgradients of f of the norms not exceeding L(f)). Thus, at the segment [xi,xi+1] the Lipschitz continuous with constant L(f) function fi varies at least by (1 − λ)Δi(s), whence

−1 |xi − xi+1|≥(1 − λ)Δi(s)L (f).

From this inequality and (4.6.31) we conclude that the number Ns of iterations in the group Is satisfies the estimate ≤ − −2 2 2 −2 Ns (1 λ) L (f)D (G)Δi(s). 0 −s+1 4 .WehaveΔi(1) >ε(the origin of N)andΔi(s) > (1 − λ) Δi(1) (see (4.6.30)), so that the above estimate of Ns results in

−2 2 2 2(s−1) −2 Ns ≤ (1 − λ) L (f)D (G)(1 − λ) ε , whence 2 2 −2 N = Ns ≤ c(λ)L (f)D (G)ε , s as claimed.

4.6.2 Concluding remarks The theorem we have proved says that the level method is optimal in complexity, provided that G is an Euclidean ball of a large enough dimension. In fact there exists “computational evidence”, based on many numerical tests, that the method is also optimal in complexity in a fixed dimension. Namely, as applied to a problem of minimization of a convex function over an n-dimensional solid G, the method finds an approximate solution of absolute accuracy ε in no more than V (f) c(f) n ln ε

iterations, where V (f)=maxG f −minG f is the variation of the objective over G and c(f)is certain problem-dependent constant which never is greater than 1 and typically is something around 0.2 - 0.5. We stress that this is an experimental fact, not a theorem. Strong doubts exist that a theorem of this type can be proved, but empirically this ”law” is supported by hundreds of tests. To illustrate this point, let me present numerical results related to one of the standard test problems called MAXQUAD. This is a small, although difficult, problem of maximizing the 4.6. BUNDLE METHODS 91

maximum of 5 convex quadratic forms of 10 variables. In the below table you see the results obtained on this problem by the Subgradient Descent and by the Level methods. In the Subgradient Descent the Polyak stepsizes were used (to this end, the method was equipped with the exact optimal value, so that the experiment was in favor of the Subgradient Descent). The results are as follows: Subgradient Descent: 100,000 steps, best found value -0.8413414 (absolute inaccuracy 0.0007), running time 54; Level: 103 steps, best found value -0.8414077 (absolute inaccuracy < 0.0000001), running time 2, ”complexity index” c(f)=0.47. Runs:

Subgradient Descent Level

+ + ifi ifi 1 5337.066429 1 5337.066429 298.595071 698.586465 86.295622 16 7.278115 31 0.198568 39 −0.674044 41 −0.221810 54 −0.811759 73 −0.841058 81 −0.841232 103 −0.841408 201 −0.801369 4001 ∗ − 0.839771 5001 −0.840100 17001 −0.841021 25001 −0.841144 50001 −0.841276 75001 −0.841319 100000 −0.841341

∗ marks the result obtained by the Subgradient Descent after 2 (the total CPU time of the Level method); to that moment the Subgradient Descent has performed 4,000 iterations, but has restored the solution within 2 accuracy digits rather then 6 digits given by Level. −1/2 The Subgradient Descent with the ”default” stepsizes γi = O(1)i in the same 100,000 iterations was unable to achieve value of the objective less than -0.837594, i.e., found the solution within a single accuracy digit. 92 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS 4.7 Exercises

Exercise 4.7.1 Implement the level method. Try to test it on the MAXQUAD test problem1) which is as follows: 10 min f(x),X= {x ∈ R ||xi|≤1}, x∈X where   f(x)= max xT A(i)x + xT b(i) i=1,...,5 with

(i) (i) j/k Akj = Ajk = e cos(jk) sin(i),j

Take the initial point x0 =0.

When implementing the level method you will need a solver. For SCILAB implementation you can use QUAPRO internal solver, when working with MAT- LAB, you can use the conic solver from SDPT3 (http://www.math.nus.edu.sg/ mattohkc/sdpt3.html).

4.7.1 Around Subgradient Descent The short-step version of the Subgradient Descent presented above (I hope you are familiar with the lecture) is quite appropriate for proving the upper complexity bound; as a com- putational scheme, it is not too attractive. The most unpleasant property of the scheme is that it actually is a short-step procedure: one should from the very beginning tune it to the desired accuracy ε, and the stepsizes γi = ε/α associated with the upper complexity bound stated in Theorem 4.2.1 should be of order of ε. For the sake of simplicity, let us assume for a moment that G is an Euclidean ball of radius R,sothatα = 1. With the choice of stepsizes γi = ε/α = ε, the method will for sure be very slow, since to pass from the starting point x1 to a reasonable neighborhood of the optimal solution it normally requires to cover a distance of order of the diameter of the ball G, i.e., of order of R, and it will take at least M = O(1/ε) steps (since a single step moves the point by γiR = εR) even in the ideal case when all directions −ei look directly to the optimal solution. The indicated observation does not contradict the theoretical optimality of the method in the large-scale case, where the worst-case complexity, as we know, is at least is O(ε−2), which is much worse that the above M, namely, something like M 2. Note, anyhow, that right now we were comparing the best possible worst-case complexity of a method, the complexity of the family, and the best possible ideal-case complexity of the short-step Subgradient Descent. And there is nothing good in the fact that even in the ideal case the method is slow.

1)this problem is probably due to C. Lemarechal and R. Mifflin 4.7. EXERCISES 93

There are, anyhow, more or less evident possibilities to make the method computationally more reasonable. The idea is not to tune the method to the prescribed accuracy in advance, thus making the stepsizes small from the very beginning, but to start with ”large” stepsizes and then decrease them at a reasonable rate. To implement the idea, we need an auxiliary tool (which is important an interesting in its own right), namely, projections. n Let Q be a closed and nonempty convex subset in R .Theprojection πQ(x) of a point x ∈ Rn onto Q is defined as the closest, with respect to the usual Euclidean norm, to x point of Q, i.e., as the solution to the following optimization problem

| − |2 ∈ (Px): minimize x y 2 over y Q.

# Exercise 4.7.2 Prove that πQ(x) does exist and is unique.

# Exercise 4.7.3 Prove that a point y ∈ Q is a solution to (Px) if and only if the vector x − y is such that (u − y)T (x − y) ≤ 0 ∀u ∈ Q. (4.7.32) Derive from this observation the following important property:

| − |2 ≤| − |2 −| − |2 πQ(x) u 2 x u 2 x πQ(x) 2. (4.7.33) Thus, when we project a point onto a convex set, the point becomes closer to any point u of the set, namely, the squared distance to u is decreased at least by the squared distance from x to Q. Derive from (4.7.32) that the mappings x → πQ(x) and x → x − πQ(x) are Lipschitz continuous with Lipschitz constant 1.

Now consider the following modification of the Subgradient Descent scheme: given a solid Q which covers G, a starting point x1 ∈ int Q and a pair of sequences {γi > 0} (stepsizes) and {εi ∈ (0, 1)} (tolerances), form the sequence

xi+1 = πQ(xi − γiρei), (4.7.34) where 2ρ is the Euclidean diameter of Q and the unit vector ei is defined as follows: (i) if xi ∈ int G,thenei is an arbitrary unit vector which separates xi and G:

T (x − xi) ei ≤ 0,x∈ G;

(ii) if xi ∈ int G, but there is a constraint gj which is ”εi-violated” at xi, i.e., is such that   g (x ) >ε max{g (x )+(x − x )T g (x )} , (4.7.35) j i i ∈ j i i j i x G + then 1  ei = |  | gj(xi); gj(xi) 2 94 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

(iii) if xi ∈ int G and no constraint is εi-violated at xi, i.e., no inequality (4.7.35)is satisfied, then 1  ei =  f (xi). |f (xi)|2  If in the case of (iii) f (xi)=0,thenei is an arbitrary unit vector. After ei is chosen, loop. The modification, as we see, is in the following: 1) We add projection onto a covering G solid Q into the rule defining the updating xi → xi+1 and use the half of the diameter of Q as the scale factor in the steplength (in the basic version of the method, there was no projection, and the scale factor was the radius of the ball Vout); 2) We use time-dependent tactics to distinguish between search points which ”almost satisfy” the functional constraints and those which ”significantly violate” a constraint. 3) If we meet with a productive point xi with vanishing subgradient of the objective, we choose as ei an arbitrary unit vector and continue the process. Note that in the initial version of the method in the case in question we terminate and claim that xi is an ε-solution; now we also could terminate and claim that xi is an εi-solution, which in fact is the case, but εi could be large, not the accuracy we actually are interested in.

Exercise 4.7.4 # Prove the following modification of Proposition 4.3.1: Let a problem (p) from the family Pm(G) be solved by the aforementioned Subgradient Descent method with nonincreasing sequence of tolerances {εi}. Assume that for a pair of positive integers N>N one has  1 N 2 2+ γ r  ≡ 2 j=N j in Γ(N : N) N <εN , (4.7.36) j=N γj ρ where rin is the maximal of radii of Euclidean balls contained in G.Thenamongthesearch points xN ,xN +1, ..., xN there were productive ones, and the best of them (i.e., that one with the smallest value of the objective) point x¯N ,N is an εN -solution to (p). Derive from this result that in the case of problems without functional constraints (m =0), where εi do not influence the process at all, the relation

∗  ε (N) ≡ min {ρΓ(M : M)/rin} < 1 (4.7.37) N≥M≥M ≥1 implies that the best of the productive search points found in course of the first N steps is well-defined and is an ε∗(N)-solution to (p).

Hint: follow the line of argument of the original proof of Proposition 4.3.1. Namely, apply the proof to the ”shifted” process which starts at xN and uses at its i-th iteration, i ≥ 1, the stepsize γi+N −1 and the tolerance εi+N −1. This process differs from that one considered in the lecture in two issues: (1) presence of time-varying tolerance in detecting productivity and an ”arbitrary” step, instead of termination, when a productive search point with vanishing subgradient of the objective is met; 4.7. EXERCISES 95

(2) exploiting the projection onto Q ⊃ G when updating the search points. To handle (1), prove the following version of Proposition 3.3.1 (Lecture 3): n Assume that we are generating a sequence of search points xi ∈ R and associate with these points vectors ei and approximate solutions x¯i in accordance to (i)-(iii). Let

T Gi = {x ∈ G | (x − xj) ej ≤ 0, 1 ≤ j ≤ i}, and let Size be a size. Assume that for some M

Size(GM ) <εM Size(G)

(if GM is not a solid, then, by definition, Size(GM )=0). Then among the search points x1, ..., xM there were productive ones, and the best (with the smallest value of the objective) of these pro- ductive points is a ε1-solution to the problem. To handle (2), note that when estimating InnerRad(GN ), we used the equalities | − +|2 | − +|2 xj+1 x 2 = xj x 2 + ... and would be quite satisfied if = in these inequalities would be replaced with ≤; in view of Exercise 4.7.3, this replacement is exactly what the projection does. Looking at the statement given by Exercise 4.7.4, we may ask ourselves what could be a reasonable way to choose the stepsizes γi and the tolerances εi. Let us start with the case of problems without functional constraints, where we can forget about the tolerances - they do not influence the process. What we are interested in is to minimize over stepsizes the quantities ε∗(N). For a given pair of positive integers M ≥ M  the minimum of the quantity  1 M 2 2+ γ  2 j=M j Γ(N : N)= M j=M γj

√ 2  ≤ ≤ √ 2 over positive γj is attained when γj = M−M +1 , M j M, and is equal to M−M +1 ; thus, to minimize ε∗(N)foragiveni, one should set γ = √2 , j =1, ..., N, which would j N result in ρ ε∗(N)=2N −1/2 . rin This is, basically, the choice of stepsizes we used in the short-step version of the Subgradient Descent; an unpleasant property of this choice is that it is ”tied” to N, and we would like to avoid necessity to fix in advance the number of steps allowed for the method. A natural −1/2 idea is to use the recommendation γj =2N in the ”sliding” way, i.e., to set

−1/2 γj =2j ,j=1, 2, ... (4.7.38) Let us look what will be the quantities ε∗(N) for the stepsizes (4.7.38). Exercise 4.7.5 # Prove that for the stepsizes (4.7.38) one has ρ ρ ε∗(N) ≤ Γ(]N/2[: N) ≤ κN −1/2 rin rin with certain absolute constant κ. Compute the constant. 96 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

We see that the stepsizes (4.7.38) result in optimal, up to an absolute constant factor, rate of convergence of the quantities ε∗(N)to0asN →∞. Thus, when solving problems without functional constraints, it is reasonable to use the aforementioned Subgradient Descent with stepsizes (4.7.38); according to the second statement of Exercise 4.7.4 and Exercise 4.7.5,for all N such that ρ ε(N) ≡ κN −1/2 < 1 rin the best of the productive search points found in course of the first N steps is well-defined and solves the problem within relative accuracy ε(N). Now let us look at problems with functional constraints. It is natural to use here the same rule (4.7.38); the only question now is how to choose the tolerances εi. A reasonable policy would be something like

−1/2 ρ εi =min{0.9999, 1.01κi }, (4.7.39) rin

Exercise 4.7.6 # Prove that the Subgradient Descent with stepsizes (4.7.38) and tolerances (4.7.39), as applied to a problem (p) from the family Pm(G), possesses the following conver- gence properties: for all N such that ρ κN −1/2 < 0.99 rin among the search points x]N/2[,x]N/2[+1, ..., xN there are productive ones, and the best (with the smallest value of the objective) of these points solves (p) within relative inaccuracy not exceeding −1/2 ρ ε]N/2[ ≤ χN , rin χ being an absolute constant. Note that if one chooses Q = Vout (i.e., ρ = R, so that ρ/rini = α is the asphericity of G), then the indicated rate of convergence results in the same (up to an absolute constant factor) as for the basic short-step Subgradient Descent complexity of solving problems from the family within relative accuracy ε.

4.7.2 Mirror Descent Looking at the 3-line convergence proof for the standard Subgradient Descent:

−  |  | ⇒| − ∗|2 ≤| − ∗|2 − − ∗ T  |  | 2 xi+1 = πG(xi γif (xi)/ f (xi) ) xi+1 x xi x 2γi(xi x ) f (xi)/ f (xi) +γi

⇒| − ∗|2 ≤| − ∗|2 − − ∗ |  | 2 xi+1 x xi x 2γi(f(xi) f )/ f (xi) + γi  | − ∗|2 N 2 ∗ x1 x + i=1 γi ⇒ min f(xi) − f ≤  i≤N N 2 i=1 γi 4.7. EXERCISES 97

one should be surprised. Indeed, all of us know the origin of the : if f is smooth, a step in the antigradient direction decreases the first-order expansion of f and therefore, for a reasonably chosen stepsize, increases f itself. Note that this standard rea- soning has nothing in common with the above one: we deal with a nonsmooth f,andit should not decrease in the direction of an anti-subgradient independently of how small is the stepsize; there is a subgradient in the subgradient set which actually possesses the desired property, but this is not necessarily the subgradient used in the method, and even with the ”good” subgradient you could say nothing about the amount the objective can be decreased by. The ”correct” reasoning deals with algebraic structure of the Euclidean norm rather than with local behavior of the objective, which is very surprising; it is a kind of miracle. But we are interested in understanding, not in miracles. Let us try to understand what is behind the phenomenon we have met. First of all, what is a subgradient? Is it actually a vector? The answer, of course, is ”no”. Given a convex function f defined on an n-dimensional vector space E and an interior point x of the domain of f, you can define a nonempty set of support functionals - linear forms f (x)[h]ofh ∈ E which are support to f at x, i.e., such that

f(y) ≥ f(x)+f (x)[y − x],y ∈ Dom f;

these forms are intrinsically associated with f and x.Now,having chosen somehow an Euclidean structure (·, ·) on E, you may associate with linear forms f (x)[h] vectors f (x) from E in such a way that

f (x)[h]=(f (x),h),h∈ Rn, thus coming from support functionals to subgradients-vectors. The crucial point is that these vectors are not defined by f and x only; they also depend on what is the Euclidean structure on E we use. Of course, normally we think of an n-dimensional space as of the coordinate space Rn with once for ever fixed Euclidean structure, but this habit sometimes is dangerous; the problems we are interested in are defined in affine terms, not in the metric ones, so why should we always look at the problems via certain once for ever fixed Euclidean structure which has nothing in common with the problem? Developing systematically this evident observation, one may come to the most advanced and recent convex optimization methods like the polynomial time interior point ones. Our goal now is much more modest, but we also shall get profit from the aforementioned observation. Thus, once more: the ”correct” objects associated with f and x are not vectors from E, but elements of the dual to E space E∗ of linear forms on E.Ofcourse,E∗ is of the same dimension as E and therefore it can be identified with E; but there are many ways to identify these spaces, and no one of them is ”natural”, more preferable than others. Since the support functionals f (x)[h] ”live” in the dual space, the Gradient Descent cannot avoid the necessity to identify somehow the initial - primal - and the dual space, and this is done via the Euclidean structure the method is related to - as it was already explained, this is what allows to associate with a support functional - something which ”actually exists”, but belongs to the dual space - a subgradient, a vector belonging to the primal space; in 98 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

a sense, this vector is a phantom - it depends on the Euclidean structure on E.Now,is a Euclidean structure the only way to identify the dual and the primal spaces? Of course, no, there are many other ways. What we are about to do is to consider certain family of ”identifications” of E and E∗ which includes, as particular cases, all identifications given by Euclidean structures. This family is as follows. Let V (φ) be a smooth (say, continuously differentiable) convex function on E∗; its support functional V (φ)[·] at a point φ is a linear functional on the dual space. Due to the well-known fact of Linear Algebra, every linear form L[η] on the dual space is defined by a vector l from the primal space, in the sense that L[η]=η[l] for all η ∈ E∗; this vector is uniquely defined, and the mapping L → l is a linear isomorphism between the space (E∗)∗ dual to E∗ and the primal space E. This isomorphism is ”canonical” - it does not depend on any additional structures on the spaces in question, like inner products, coordinates, etc., it is given by intrinsic nature of the spaces. In other words, in the quantity φ[x] we may think of φ being fixed and x varying over E,whichgivesusa linear form on E (this is the origin of the quantity); but we also can think of x as being fixed and φ varying over E∗, which gives us a linear form on E∗; and Linear Algebra says to us that every linear form on E∗ can be obtained in this manner from certain uniquely defined x ∈ E. Bearing in mind this symmetry, let us from now on renotate the quantity φ[x]as φ, x, thus using more ”symmetric” notation. Thus, every linear form on E∗ corresponds to certain x ∈ E; in particular, the linear form V (φ)[·]onE∗ corresponds to certain vector V (φ) ∈ E: V (φ)[η] ≡η, V (φ) ,η ∈ E∗. We come to certain mapping φ → V (φ):E∗ → E; this mapping, under some mild assumptions, is a continuous one-to-one mapping with con- tinuous inverse, i.e., is certain identification (not necessarily linear) of the dual and the primal spaces. Exercise 4.7.7 Let (·, ·) be certain Euclidean structure on the dual space, and let 1 V (φ)= (φ, φ). 2 The aforementioned construction associates with V the mapping φ → V (φ):E → E∗;on the other hand, the Euclidean structure in question itself defines certain identification of the dual and the primal space, the identification I given by the identity φ[x]=(φ, I−1x),φ∈ E∗,x∈ E. Prove that I = V . { }n { ∗} ∗ Assume that ei i=1 is a basis in E and ei is the biorthonormal basis in E (so that ∗ ei [ej]=δij), and let A be the matrix which represents the inner product in the coordinates ∗ { ∗} ∗ ∗ of E related to the basis ei , i.e., Aij =(ei ,ej ). What is the matrix of the associated  { ∗} ∗ { } mapping V taken with respect to the ei -coordinates in E and the ei -coordinates in E? 4.7. EXERCISES 99

We see that all ”standard” identifications of the primal and the dual spaces, i.e., those given by Euclidean structures, are covered by our mappings φ → V (φ); the corresponding V ’s are, up to the factor 1/2, squared Euclidean norms. A natural question is what are the mappings associated with other squared norms.

Exercise 4.7.8 Let · be a norm on E,let

φ∗ =max{φ[x] | x ∈ E, x≤1}

be the conjugate norm on and assume that the function

1 2 V · (φ)= φ ∗ 2 ∗ is continuously differentiable. Then the mapping φ → V (φ) is as follows: you take a linear form φ =0 on E, maximize it on the ·-unit ball of E; the maximizer is unique and is exactly V (φ);and,ofcourse,V (0) = 0.Inotherwords,V (φ) is nothing but the ”direction of the fastest growth” of the functional φ, where, of course, the rate of growth in a direction is defined as the ”progress in φ per ·-unit step in the direction”. Prove that the mapping φ → φ is a continuous mapping form E onto E. Prove that this mapping is a continuous one-to-one correspondence between E and E if and only if the function 1 2 W · (x)= x : E → R 2 is continuously differentiable, and in this case the mapping

 x → W · (x)

 is nothing but the inverse to the mapping given by V · ∗ .

Now, every Euclidean norm ·on E induces, as we know, a Subgradient Descent method for minimization of convex functions over closed convex domains in E. Let us write down this method in terms of the corresponding function V · ∗ . For the sake of simplicity let us ignore for the moment the projector πG, thus looking at the method for minimizing over the whole E. The method, as it is easily seen, would become

−      ≡ φi+1 = φi γif (xi)/ f (xi) ∗,xi = V (φi),V V · ∗ . (4.7.40)

Now, in the presented form of the Subgradient Descent there is nothing from the fact that · is a Euclidean norm; the only property of the norm which we actually need is the differentiability of the associated function V .Thus,givenanorm·on E which induces a differentiable outside 0 conjugate norm on the conjugate space, we can write down certain method for minimizing convex functions over E. How could we analyze the convergence properties of the method? In the case of the usual Subgradient Descent the proof of con- vergence was based on the fact that the anti-gradient direction f (x) is a descent direction 100 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

for certain Lyapunov function,namely,|x − x∗|2, x∗ being a minimizer of f.Infactour reasoning was as follows: since f is convex, we have

f (x),x∗ − x≤f(x) − f(x∗) ≤ 0, (4.7.41) and the quantity f (x),x− x∗ is, up to the constant factor 2, the derivative of the function |x − x∗|2 in the direction f (x). Could we say something similar in the general case, where, accordingto(4.7.40), we should deal with the situation x = V (φ)? With this substitution, thelefthandsideof(4.7.41) becomes d f (x),x∗ − V (φ) = | V +(φ − tf (x)),V+(ψ)=V (ψ) −ψ, x∗ . dt t=0 Thus, we can associate with (4.7.40) the function

+ ≡ + − ∗ V (φ) V · ∗ (φ)=V · ∗ (φ) φ, x , (4.7.42) x∗ being a minimizer of f, and the derivative of this function in the direction −f (V (φ)) of the trajectory (4.7.40)isnonpositive: ' ( −f (V (φ)), (V +)(φ) ≤ f(x∗) − f(x) ≤ 0,φ∈ E∗. (4.7.43)

Now we may try to reproduce the reasoning which leads to the rate-of-convergence estimate for the Subgradient Descent for our now situation, where we speak about process (4.7.40) associated with an arbitrary norm on E (the norm should result, of course, in a continuously differentiable V ). For the sake of simplicity, let us restrict ourselves with the simple case when V possesses a Lipschitz continuous derivative. Thus, from now on let · be a norm on E such that the mapping V ≡  ∗ → (φ) V · ∗ (φ):E E is Lipschitz continuous, and let V  −V   L≡ { (φ ) (φ ) |     ∈ ∗} L · =sup   φ = φ ,φ,φ E . φ − φ ∗

For the sake of brevity, from now on we write V instead of V · ∗ . Exercise 4.7.9 Prove that

V (φ + η) ≤ V (φ)+η, V(φ) + LV (η),φ,η∈ E∗. (4.7.44)

Now let us investigate process (4.7.40). Exercise 4.7.10 ∗ Let f : E → R be a Lipschitz continuous convex function which attains its minimum on E at certain point x∗. Consider process (cf. (4.7.40))

   φi+1 = φi − γif (xi)/|f (xi)|∗,xi = V (φi),φ1 =0, (4.7.45) 4.7. EXERCISES 101

and let x¯i be the best (with the smallest value of f)ofthepointsx1, ..., xi and let εi = f(¯xi) − minE f. Prove that then  |x∗|2 + L N γ2 ≤  i=1 i εN L · (f) N ,N=1, 2, ... (4.7.46) 2 i=1 γi where L · (f) is the Lipschitz constant of f with respect to the norm ·. In particular, the method converges, provided that γi = ∞,γi → 0,i→∞. i

Hint: use the result of exercise 4.7.9 and (4.7.43) to demonstrate that L + + ∗  2 + ∗ V (φ ) ≤ V (φ ) − 2γ (f(x ) − f )/|f (x )|∗ + γ ,V (φ)=V (φ) −φ, x  , i+1 i i i i 2 i and then act exactly as in the case of the usual Subgradient Descent. Note that the basic result explains what is the origin of the ”Subgradient Descent miracle” which motivated our considerations; as we see, this miracle comes not from the very specific algebraic structure of the Euclidean norm, but from certain ”robust” analytic property of the norm (the Lipschitz continuity of the derivative of the conjugate norm), and we can fabricate similar miracles for arbitrary norms which share the indicated property. In fact you could use the outlined Mirror Descent scheme, developed by Nemirovski and Yudin, with necessary (and more or less straightforward) modifications, in order to extend everything what we know about the usual - ”Euclidean” - Subgradient Descent (I mean, the versions for optimization over a domain rather than over the whole space and for optimization over solids under functional constraints) onto the general ”non-Euclidean” case, but we skip here these issues.

4.7.3 Stochastic Approximation We shall speak here about a problem which is quite different from those considered so far – about stochastic optimization. The simplest (and, perhaps, the most important in applications) single-stage Stochastic Programming program is as follows:  minimize f(x)= F (x, ω)dP (ω) s.t. x ∈ G. (4.7.47) Ω Here

• F (x, ω) are functions of the design vector x ∈ Rn and parameter w ∈ Ω

• G is a subset in Rn

• P is a probability distribution on the set Ω. 102 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

Thus, a stochastic program is a Mathematical Programming program where the objective and the constraints are expectations of certain functions depending both of the design vector x and random parameter ω. The main source of programs of the indicated type is optimization in stochastic systems, like Queing Networks, where the processes depend not only on the design parameters, like performances and numbers of serving devices of different types), but also on random factors. As a result, the characteristics of such a system (e.g., time of serving a client, cost of service, etc.) are random variables depending, as on parameters, on the design parameters of the system. It is reasonable to measure the quality of the system by the expected values of the indicated random variables (in the dynamic systems, we should speak about the steady-state expectations). These expected values have the form of the objective and the constraints in (4.7.47), so that to optimize a system in question over its design parameters is a program of thetype(4.7.47). Looking at stochastic program (4.7.47), you can immediately ask what are, if any, the specific features of these problems and why these problems need specific treatment. (4.7.47) is, finally, nothing but the usual Mathematical Programming problem; we know how to solve tractable (convex) Mathematical Programming programs, and to the moment we never took into account what is the origin of the objective and the constraints, are they given by explicit simple formulae, or are they intergals, as in (4.7.47), or solutions to differential equations, or whatever else. All what was important for us were the properties of the objective and the constraints – convexity, smoothness, etc., but not their origin. This “indifference to the origin of the problem” indeed was the feature of our approach, but it was its weak point, not the strong one. In actual computations, the performance of an algorithm heavily depends not only on the quality of the method we apply to solve the problem, but also on the computational effort needed to provide the algorithm by the information on the problem instance – the job which in our model of optimization was the task of the oracle. We did not think how to help the oracle to solve his task and took care only of total number of oracle calls – we did our best to reduce this number. In the most general case, when we have no idea of what is the internal structure of the problem, this is the only possible approach. But the more we know about the structure of the problem, the more should we think of how to simplify the task of the oracle in order to reduce the overall computational expenses. Stochastic Programming is an example when we know something about the structure of the problem in question. Namely, let us look at a typical stochastic system, like Queing Network. Normally the function F (x, ω) associated with the system is “algorithmically simple” – given x and ω, we can more or less easily compute the quantity F (x, ω)and even its derivative with respect to x; to this end it suffices to create a simulation model of the system and run it at the given by ω realization of the random parameters (arrival and service times, etc.); even for a rather sophisticated system, a single simulation of this type is relatively fast. Thus, typically we have no difficulties with simulating realizations of the random quantity F (x, ·). On the other hand, even for relatively simple systems it is, as a rule, impossible to compute the expected values of the indicated quantities in a “closed analytic form”, and the only way to evaluate these expected values if to use a kind of the 4.7. EXERCISES 103

Monte-Carlo method: to run not a single, but many simulations, for a fixed value of the design parameters, and to take the empiric average of the observed random quantities as an estimate of their expected values. According to the well-known results on the rate of convergence of the Monte-Carlo method, to estimate the expected values within inaccuracy it requires O(1/2) simulations, and this is just to get the estimate of the objective and the constraints of problem (4.7.47) at a single point! Now imagine that we are going to treat (4.7.47) as a usual “black-box represented” optimization problem and intend to imitate the usual first-order oracle for it via the afrementioned Monte-Carlo estimator. In order to get an -solution to the problem, we, even in good cases, need to estimate within accuracy O() the objective and the constraints along the search points. It means that the method will require much more simulations than the aforementioned O(1/2): this quantity should be multiplied by the information-based complexity of the optimization method we are going to use. As a result, the indicated approach in most of the cases results in unappropriately long computations. An extremely surprising thing is that there exists another way to solve the problem. This way, under reasonable convexity assumptions, results in overall number of O(1/2) computations only – as if there were no optimization at all and the only goal were to estimate the objective and the constraints at a given point. The subject of our today lecture is this other way – Stochastic Approximation. To get a convenient framework for presenting Stochastic Approximation, it is worthy to modify a little the way we are looking at our problem. Assume that when solving it, we are allowed to generate a random sample ω1, ω2,... of “random factors” involved into the problem; the elements of the sample are assumed to be mutually independent and distributed according to P . Assume also that given x and ω, we are able to compute the value F (x, ω) and the gradient ∇xF (x, ω) of the integrant in (4.7.47). Note that under mild regularity assumptions the differentiation with respect to x and taking expectation are interchangeable:  

f(x)= F (x, ω)dP (ω), ∇f(x)= ∇xF (x, ω)dP (ω). (4.7.48) Ω Ω It means that the situation is covered by the following model of an optimization method solving (4.7.47): At a step i, we (the method) form i-th search point xi and forward it to the oracle which we have in our disposal. The oracle returns the quantities

F (xi,ωi), ∇xF (xi,ωi)

(in our previous interpretation it means that a single simulation of the stochastic system in question is performed), and this answer is the portion of information on the problem we get on the step in question. Using the information accumulated so far, we generate the new search point xi+1, again forward it to the oracle, enrich our accumulated information by its answer, and so on. The presented scheme is a very natural definition of a method based on stochastic first order oracle capable to provide the method with random unbiased (see (4.7.48)) estimates of 104 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

the values and the gradients of the objective and the constraints of (4.7.47). Note that the estimates are not only unbiased, but also form a kind of Markov chain: the distribution of the answers of the oracle at a point depends only on the point, not on the previous answers (recall that {ωi} are assumed to be independent). Now, for our further considerations it is completely unimportant that the observation of f(x) comes from the value, and the observation of ∇f(x) comes from the gradient. It suffices to postulate the following:

• The goal is to solve the Convex Programming program2

n f0(x) → min, | x ∈ G ⊂ R (4.7.49) [G is closed and convex, f is convex and Lipschitz continuous on G] • The information obtained by an optimization method at i-th step, i =1, 2, ...,comes from a stochastic oracle, i.e., is a real and a vector

n φ(xi,ωi) ∈ R,ψ(xi,ωi) ∈ R , where

– {ωi} is sequence of independent identically distributed, according to certain prob- ability distribution P , random parameters taking values in certain space Ω;

– xi is i-th search point generated by the method; this point may be an arbitrary deterministic function of the information obtained by the method so far (i.e., at the steps 1, ..., i − 1) It is assumed (and it is crucial) that the information obtained by the method is unbi- ased: Eφ(x, ω)=f(x),f(x) ≡Eψ(x, ω) ∈ ∂f(x),x∈ G; (4.7.50) here E is the expectation with respect to the distribution of the random parameters in question. To get reasonable complexity results, we need to bound somehow the magnitude of the random noise in the process in the stochastic oracle (as it is always done in all statistical considerations). Mathematically, the most convenient way to do it is as follows: let  1/2 L =sup E|ψ(x, f)|2 . (4.7.51) x∈G From now on we assume that the oracle is such that L<∞.ThequantityL will be called the intensity of the oracle at the problem in question; in what follows it plays the same role as the Lipschitz constant of the objective in large-scale minimization of Lipschitz continuous convex functions. 2Of course, the model we are about to present makes sense not only for convex programs; but the methods we are interested in will, as always, work well only in convex case, so that we loose nothing when imposing the convexity assumption from the very beginning 4.7. EXERCISES 105

4.7.4 The Stochastic Approximation method The method we are interested in is completely similar to the Subgradient Descent method from Section 4.5. Namely, the method generates search points according to the recurrency (cf. (8.1.1)) xi+1 = πG(xi − γiψ(xi,ωi)),i=1, 2, ..., (4.7.52) where

• x1 ∈ G is an arbitrary starting point;

• γi are deterministic positive stepsizes; • | − | πG(x) = argminy∈G x y is the standard projector on G.   The only difference with (8.1.1) is that now we replace the direction g(xi)=f (xi)/|f (xi)| of the subgradinet of f at xi by the random estimate ψ(xi,ωi) of the subgradient. Note that the search trajectory of the method is governed by the random variables ωi and is therefore random: i−1 s xi = xi(ω ),ω =(ω1, ..., ωs). Recurrency (4.7.52) defines the sequence of the search points, not that one of the approx- imate solutions. In the deterministic case of Section 4.5 we extracted from the search points the approximate solutions by choosing the best (with the smallest value of the objective) of the search points generated so far. The same could work in the stochastic case as well, but here we meet with the following obstacle: we “do not see” the values of the objective, and therefore cannot say which of the search point is better. To resolve the difficulty, we use the following trick: define i-th approximate solution as the sliding average of the search points: ⎡ ⎤ −1 i i i−1 ⎣ ⎦ x ≡ x (ω )= γi γtxt. (4.7.53) i/2≤t≤i i/2≤t≤i The efficiency of the resulting method is given by the following Theorem 4.7.1 For the aforementined method one has, for all positive integer N,  2 2 2 D + L ≤ ≤ γ ε ≡E[f(xN ) − min f(x)] ≤  N/2 i N i , (4.7.54) N ∈ x G 2 N/2≤i≤N γi G being the diameter of G. In particular, for γi chosen accoding to

D√ γi = (4.7.55) L i one has √LD εN ≤ O(1) (4.7.56) N with appropriately chosen absolute constant O(1). 106 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

Exercise 4.7.11 Prove Theorem 4.7.1.

2 Hint: follow the lines of the proof of the estimate (4.5.16) in Section 4.5; substitute di+1 i ∗ 2 with vi ≡E|xi+1(ω ) − x | .

Comments: The statement and the proof of Theorem 4.7.1 are completely similar to the related “deterministic” considerations of Section 4.5. The only difference is that now we are estimating from above the expected inaccuracy of N-th approximate solution; this is quite natural, since the stochastic nature of the process makes it impossible to say something reasonable about the quality of every realization of the random vector xN = xN (ωN−1) It turns out that the rate of convergence established in (4.7.56) connot be improved. Namely, it is not difficult to prove the following statement.

Proposition 4.7.1 For every L>0,anyD>0, any positive integer N and any stochastic- oracle-based N-step method M of minimizing univariate convex functions over the segment G =[0,D] on the axis there exists a linear function f and a stochastic oracle with intensity L on the function such that

LD E[f(xN ) − min f] ≥ O(1)√ , G N xN being the result formed by the method as applied to the problem f.HereO(1) is properly chosen positive absolute constant. √ Note that in the deterministic case the rate of convergence O(1/ N) was unimprovable only in the large-scale case; in contrast to this, in the stochastic case this rate becomes optimal already when we are minimizing univariate linear functions. √ Convergence rate O(1/ N) can be improved only if the objective is strongly convex.The simplest and the most important result here is as follows (the proof is completely similar to that one of Theorem 4.7.1):

Proposition 4.7.2 Assume that convex function f on Rn attains its minimum on G at a unique point x∗ and is such that

θ Θ f(x∗)+ |x − x∗|2 ≤ f(x) ≤ f(x∗)+ |x − x∗|2,x∈ G, (4.7.57) 2 2 with certain positive θ and Θ. Consider process (4.7.52) with the stepsizes γ γ = , (4.7.58) i i γ being a positive scale factor satisfying the relation

γθ > 1. (4.7.59) 4.7. EXERCISES 107

Then D2 + γ2L2 εN ≡E(f(xN ) − min f) ≤ c(γθ)Θ , (4.7.60) G N where D is the diameter of G, L is the intensity of the oracle at the problem in question and c(·) is certain problem-independent function on (1, ∞).

Exercise 4.7.12 Prove Proposition 4.7.2.

The algorithm (4.7.52) with the stepsizes (4.7.58) and the approximate solutions identical to the search points is called the classical Stochastic Approximation; it originates from Kiefer and Wolfovitz.√ A good news about the algorithm is its rate of convergence: O(1/N )instead of O(1/ N). A bad news is that this better rate is ensured only in the case of problems satisfying (4.7.57) and that the rate of convergence is very sensitive to the choice of the scale factor γ in the stepsize formula (4.7.58): if this scale factor does not satisfy (4.7.59), the rate of convergence may become worse in order. To see this, consider the following simple example: the problem is 1 f(x)= x2, [θ =Θ=1]; G =[−1, 1]; x =1, 2 1 the observations are given by

ψ(x, ω)=x + ω [= f (x)+ω],

E E 2 and ωi are the standard Gaussian random variables ( ωi =0, ωi = 1); we do not specify φ(x, ω), since it is not used in the algorithm. In this example, the best choice of γ is γ =1(in this case one can make (4.7.59) an equality rather than strict inequality due to the extreme simplicity of the objective). For this choice of γ one has 1 Ef(xN+1) ≡E[f(xN+1) − min f(x)] ≤ ,N≥ 1. x 2N In particular, it takes no more than 50 steps to reach expected inaccuracy not exceeding 0.01. Now assume that when solving the problem we overestimate the quantity θ and choose stepsizes according to (4.7.58) with γ =0.1. How many steps do we need in this case to reach the same expected inaccuracy 0.01 – 500, 5000, or what? The answer is astonishing: approximately 1,602,000 steps. And with γ =0.05 (20 times less than the optimal value of the parameter) the same accuracy costs more than 5.2 × 1014 steps! We see how dangerous the “classical” rule (4.7.58) for the stepsizes is: underestimating of γ (≡ overestimating of θ) may kill the procedure completely. And where from, in more or less complicated cases, could we take a reasonable estimate of θ? It should be said that there exist stable versions of the classical Stochastic Approximation (they, same as our version of the routine, use “large”, as compared to O(1/i), stepsizes and take, as approximate solutions, certain averages of the search points). These stable version of the method are capable to reach (under assumptions similar to (4.7.57)) the O(1/N )-rate of convergence, even with 108 LECTURE 4. LARGE-SCALE OPTIMIZATION PROBLEMS

the “asymptotically optimal” coefficient at 1/N . Note, anyhow, that the nondegeneracy assumption (4.7.57)iscrucialfortheO(1/N )-rate of convergence;√ if it is removed, the best possible rate, as we know from Proposition 4.7.1, becomes O(1/ N), and this is the rate given by our “robust” Stochastic Approximation with “large” steps and averaging. Lecture 5

Nonlinear programming: Unconstrained Minimization

(Relaxation; Gradient method; Rate of convergence; Newton method; Gradient Method and Newton Method: What is different? Idea of Variable Metric; Variable Metric Methods; Conjugate Gradient Methods Nous discutons ici des m´ethodes classiques de programmation non-lin´eaire. Ces m´ethodes constituent le point de d´epart de la th´eorie d’optimisation - c’est ici que l’histoire d’optimisation a commenc´e. Notre objectif sera aussi d’obtenir un avant-goˆut de la “nouvelle vie” que certaines id´ees tr`es anciennes ont retrouv´e en optimisation convexe, ce qui a donn´edes techniques algorithmiques les plus avanc´ees actuellement disponibles.

5.1 Relaxation

We have already mentioned in the first lecture of this course that the main goal of the general nonlinear programming is to find a local solution to a problem defined by differentiable functions. In general, the global structure of these problems is not much simpler than that of the problems defined by Lipshitz continuous functions. Therefore, even for such restricted goals, it is necessary to follow some special principles, which guarantee the convergence of the minimization process. The majority of the nonlinear programming methods are based on the idea of relaxation: { }∞ ≤ ≥ We call the sequence ak k=0 a relaxation sequence if ak+1 ak for all k 0. In this sections we consider several methods for solving the unconstrained minimization problem min f(x), (5.1.1) x∈Rn where f(x) is a smooth function. To solve this problem, we can try to generate a relaxation { }∞ sequence f(xk) k=0: f(xk+1) ≤ f(xk),k=0, 1,... . If we manage to do that, then we immediately have the following important consequences:

109 110LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

n { }∞ 1. If f(x) is bounded from below on R then the sequence f(xk) k=0 converges. 2. In any case we improve the initial value of the objective function.

5.2 Gradient method

Now we are completely prepared for the analysis of the unconstrained minimization methods. Let us start from the simplest scheme. We already know that the antigradient is a direction of the locally steepest descent of a differentiable function (cf. Exercise 5.6.2). Since we are going to find a local minimum of such function, the following scheme is the first to be tried:

n 0). Choose x0 ∈ R . 1). Iterate (5.2.2)  xk+1 = xk − hkf (xk),k=0, 1,... .

This is a scheme of the gradient method. The gradient’s factor in this scheme, hk, is called the step size. Of course, it is reasonable to choose the step size positive. There are many variants of this method, which differ one from another by the step size strategy. Let us consider the most important ones. { }∞ 1. The sequence hk k=0 is chosen in advance, before the gradient method starts its job. For example, hk = h>0, (constant step)

h √ h . k = k+1 2. Full relaxation:  hk =argmin f(xk − hf (xk)). h≥0

 3. Goldstein-Armijo rule: Find xk+1 = xk − hf (xk) such that

 αf (xk),xk − xk+1≤f(xk) − f(xk+1), (5.2.3)

 βf (xk),xk − xk+1≥f(xk) − f(xk+1), (5.2.4) where 0 <α<β<1 are some fixed parameters. Comparing these strategies, we see that the first strategy is the simplest one. Indeed, it is often used, but only in Convex Optimization, where the behavior of functions is much more predictable than in the general nonlinear case. The second strategy is completely theoretical. It is never used in practice since even in one-dimensional case we cannot find an exact minimum of a function in finite time. The third strategy is used in the majority of the practical algorithms. It has the following geometric interpretation. Let us fix x ∈ Rn. Consider the function of one variable

φ(h)=f(x − hf (x)),h≥ 0. 5.2. GRADIENT METHOD 111

Then the step-size values acceptable for this strategy belong to the part of the graph of φ, which is located between two linear functions:

 2  2 φ1(h)=f(x) − αh  f (x)  ,φ2(h)=f(x) − βh  f (x)  .

   Note that φ(0) = φ1(0) = φ2(0) and φ (0) <φ2(0) <φ1(0) < 0. Therefore, the ac- ceptable values exist unless φ(h) is not bounded from below. There are several very fast one-dimensional procedures for finding a point satisfying the conditions of this strategy, but their description is not so important for us now. Let us estimate now the performance of the gradient method. Consider the problem

min f(x), x∈Rn

∈ 1,1 n with f CL (R ). The latter means (cf the definition in Section 5.6.4 that f is continuously differentiable on Rn and its derivative is Lipschitz continuous on Rn with the constant L:

 f (x) − f (y) ≤ L  x − y  for all x, y ∈ Rn. Let us also assume that f(x) is bounded from below on Rn. Let us evaluate first the result of one step of the gradient method. Consider y = x−hf (x). Then, in view of (5.6.32), we have:

≤   −  L  − 2 f(y) f(x)+ f (x),y x + 2 y x (5.2.5) −   2 h2   2 − − h   2 = f(x) h f (x) + 2 L f (x) = f(x) h(1 2 L) f (x) . Thus, in order to get the best estimate for the possible decrease of the objective function, we have to solve the following one-dimensional problem:

h Δ(h)=−h(1 − L) → min . 2 h Computing the derivative of this function, we conclude that the optimal step size must satisfy  − ∗ 1 the equation Δ (h)=hL 1 = 0. Thus, it could be only h = L , and that is a minimum of Δ(h) since Δ(h)=L>0. Thus, our considerations prove that one step of the gradient method can decrease the objective function as follows:

1 f(y) ≤ f(x) −  f (x) 2 . 2L Let us check what is going on with our step-size strategies.  Let xk+1 = xk − hkf (xk). Then for the constant step strategy, hk = h,wehave:

− ≥ − 1   2 f(xk) f(xk+1) h(1 2 Lh) f (xk) . 112LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

2α ∈ Therefore, if we choose hk = L with α (0, 1), then 2 f(x ) − f(x ) ≥ α(1 − α)  f (x ) 2 . k k+1 L k

1 Of course, the optimal choice is hk = L . For full relaxation strategy we have 1 f(x ) − f(x ) ≥  f (x ) 2 k k+1 2L k

1 since the maximal decrease cannot be less than that with hk = L . Finally, for Goldstein-Armijo rule in view of (5.2.4)wehave:

  2 f(xk) − f(xk+1) ≤ βf (xk),xk − xk+1 = βhk  f (xk)  .

From (5.2.5) we obtain: h f(x ) − f(x ) ≥ h (1 − k L)  f (x ) 2 . k k+1 k 2 k ≥ 2 − Therefore hk L (1 β). Further, using (5.2.3)wehave:

  2 f(xk) − f(xk+1) ≥ αf (xk),xk − xk+1 = αhk  f (xk)  .

Combining this inequality with the previous one, we conclude that 2 f(x ) − f(x ) ≥ α(1 − β)  f (x ) 2 . k k+1 L k Thus, we have proved that in all cases we have ω f(x ) − f(x ) ≥  f (x ) 2, (5.2.6) k k+1 L k where ω is some constant. Now we are ready to estimate the performance of the gradient scheme. When taking the sum of the inequalities (5.2.6)fork =0,...,N we obtain:

N ω  2 ∗  f (xk)  ≤ f(x0) − f(xN ) ≤ f(x0) − f , (5.2.7) L k=0 where f ∗ is the optimal value of the problem (5.1.1). As a simple conclusion of (5.2.7)we have:   f (xk) → 0ask →∞. However, we can say something about the convergence rate. Indeed, denote

∗ gN =mingk, 0≤k≤N 5.2. GRADIENT METHOD 113

 where gk = f (xk) . Then, in view of (5.2.7), we come to the following inequality:

 1/2 ∗ √ 1 1 ∗ g ≤ L(f(x0) − f ) . (5.2.8) N N +1 ω { ∗ } The right hand side of this inequality describes the rate of convergence of the sequence gN to zero. Note that we cannot say anything about the rate of convergence of the sequences {f(xk)} or {xk}. Recall, that in general nonlinear optimization our goal is rather moderate: We want to find only a local minimum of our problem. Nevertheless, even this goal, in general, is unreachable for the gradient method. Let us consider the following example. Example 5.2.1 Consider the function of two variables: 1 1 1 f(x) ≡ f(x(1),x(2))= (x(1))2 + (x(2))4 − (x(2))2. 2 4 2 The gradient of this function is

f (x)=(x(1), (x(2))3 − x(2))T .

Therefore there are only three points which can be a local minimum of this function: ∗ ∗ − ∗ x1 =(0, 0),x2 =(0, 1),x3 =(0, 1). Computing the Hessian of this function, 10 f (x)= , 03(x(2))2 − 1

∗ ∗ 1 ∗ we conclude that x2 and x3 are the isolated local minima , but x1 is only a stationary point ∗ of our function. Indeed, f(x1)=0and 4 2 f(x∗ + e )= − < 0 1 2 4 2 for small enough. Now, let us consider the trajectory of the gradient method, which starts from x0 =(1, 0). Note that the second coordinate of this point is zero. Therefore, the second coordinate of  f (x0) is also zero. Consequently, the second coordinate of x1 is zero, etc. Thus, the entire sequence of points, generated by the gradient method will have the second coordinate equal ∗ to zero. This means that this sequence can converge to x1 only. To conclude our example, note that this situation is typical for all first–order uncon- strained minimization methods. Without additional very strict assumptions, it is impossible to guarantee the global convergence of the minimizing sequence to a local minimum, only to a stationary point.

1In fact, in our example they are the global solutions. 114LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

We can now write down some (upper) complexity estimates for a certain class of opti- mization problems. Let us look at the following example.

Example 5.2.2 Consider the following problem class:

Problem class: 1. Unconstrained minimization. ∈ 1,1 n 2.f CL (R ). 3.f(x) is bounded from below.

Oracle: First order black box.

 − solution: f(¯x) ≤ f(x0),  f (¯x) ≤ .

1,1 n Recall that the function class CL (R ) is defined as follows (cf. the definition in Section 5.6.4):

• ∈ 1,1 n n any f CL (R ) is continuously differentiable on R . • Its 1st derivative is Lipshitz continuous on Rn with the constant L:

 f (x) − f (y) ≤ L  x − y 

for all x, y ∈ Rn.

Note, that (5.2.8) can be used to obtain an upper bound for the number of steps (= calls of the oracle), which is necessary to find a point with a small norm of the gradient. For that, let us write out the following inequality:   1 1 1/2 g∗ ≤ √ L(f(x ) − f ∗) ≤ . N N +1 ω 0

≥ L − ∗ ∗ ≤ Therefore, if N +1 ω2 (f(x0) f ), we necessarily have gN . L − ∗ Thus, we can use the value ω2 (f(x0) f )asanupper complexity estimate for our problem class. Comparing this estimate with the result of Theorem 1.2.2,wecanseethatit is much better; at least it does not depend upon n. To conclude this example, note that the lower complexity bounds for the class under consideration are not known.

Let us check, what can be said about the local convergence of the gradient method. Let us consider the unconstrained minimization problem

min f(x) x∈Rn under the following assumptions: 5.2. GRADIENT METHOD 115

∈ 2,2 n 1. f CM (R ) (the class of twice differentiable functions with Lipschitz continuous ∈ 2,2 n Hessian. Recall that for f CM (R )wehave

 f (x) − f (y) ≤ M  x − y 

for all x, y ∈ Rn).

2. There exists a local minimum of function f at which the Hessian is positive definite.

3. We know some bounds 0

 ∗ lIn ≤ f (x ) ≤ LIn. (5.2.9)

∗ 4. Our starting point x0 is close enough to x .

  ∗ Consider the process: xk+1 = xk − hkf (xk). Note that f (x ) = 0. Hence,

1    ∗  ∗ ∗ ∗ ∗ f (xk)=f (xk) − f (x )= f (x + τ(xk − x ))(xk − x )dτ = Gk(xk − x ), 0

1  ∗ ∗ where Gk = f (x + τ(xk − x ))dτ. Therefore 0

∗ ∗ ∗ ∗ xk+1 − x = xk − x − hkGk(xk − x )=(I − hkGk)(xk − x ).

There is a standard technique for analyzing the processes of this type, which is based on contracting mapping. Recall that for a process

n a0 ∈ R ,ak+1 = Akak, where Ak are (n × n) matrices such that  Ak ≤ 1 − q, q ∈ (0, 1), we can estimate the rate of convergence of the sequence {ak} to zero:

k+1  ak+1 ≤ (1 − q)  ak ≤ (1 − q)  a0 → 0.

∗ Thus, in our case we need to estimate  In − hkGk .Denoterk = xk − x . In view of Exercise 5.6.6,wehave:

 ∗  ∗ ∗  ∗ f (x ) − τMrkIn ≤ f (x + τ(xk − x )) ≤ f (x )+τMrkIn.

Therefore, using our assumption (5.2.9), we obtain:

− rk ≤ ≤ rk (l 2 M)In Gk (L + 2 M)In. Hence, − rk ≤ − ≤ − − rk (1 hk(L + 2 M))In In hkGk (1 hk(l 2 M))In. 116LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Thus,  In − hkGk ≤ max{ak(hk),bk(hk)}, (5.2.10) − − rk rk − ak(h)=1 h(l 2 M),bk(h)=h(L + 2 M) 1. − ≡ 2l Note that ak(0) = 1 and bk(0) = 1. Therefore, if rk < r¯ M ,thenak(h) is a strictly decreasing function of h and we can ensure  In − hkGk < 1 for small enough hk.Inthis case we will have rk+1

max{ak(h),bk(h)}→min . h

Let us assume that r0 < r¯. Then, if we form the sequence {xk} using this strategy, we can be ∗ sure that rk+1

2l M Let us estimate the rate of convergence. Denote q = L+l and ak = L+l rk (

− − 2 ≤ − 2 − ak(1 (ak q) ) ≤ ak ak+1 (1 q)ak + ak = ak(1 + (ak q)) = . 1 − (ak − q) 1+q − ak

Therefore 1 ≥ 1+q − 1, or ak+1 ak   q q(1 + q) q − 1 ≥ − q − 1=(1+q) − 1 . ak+1 ak ak Hence,     q q 2l L + l r¯ − 1 ≥ (1 + q)k − 1 =(1+q)k · − 1 =(1+q)k − 1 . ak a0 L + l r0M r0

Thus, k − k ≤ qr0 ≤ qr0 1 ≤ qr0(1 q) ak k . r0 +(1+q) (¯r − r0) r¯ − r0 1+q r¯ − r0 This proves the following theorem. 5.3. NEWTON METHOD 117

Theorem 5.2.1 Let function f(x) satisfy our assumptions and let the starting point x0 be close enough to a local minimum: 2l r = x − x∗ < r¯ = . 0 0 M Then the gradient method with the optimal step size (5.2.11) converges with the following rate: k ∗ rr¯ 0 L − l  xk − x ≤ . r¯ − r0 L + l We call his rate of convergence linear.

5.3 Newton method

Initially, the Newton method has been proposed for finding a root of a function of one variable φ(t), t ∈ R1: φ(t∗)=0. For that, it uses the idea of linear approximation. Indeed, assume that we have some t close enough to t∗.Notethat

φ(t +Δt)=φ(t)+φ(t)Δt + o(| Δt |).

Therefore the equation φ(t +Δt) = 0 can be approximated by the following linear equation:

φ(t)+φ(t)Δt =0.

We can expect that the solution of this equation, the displacement Δt, is a good approxi- mation to the optimal displacement Δt∗ = t∗ − t. Converting this idea in the algorithmic form, we obtain the following process:

− φ(tk) tk+1 = tk  . φ (tk) This scheme can be naturally extended onto the problem of finding solution to the system of nonlinear equations: F (x)=0, where x ∈ Rn and F (x):Rn → Rn. For that we have to define the displacement Δx as a solution to the following linear system:

F (x)+F (x)Δx =0

(it is called the Newton system). The corresponding iterative scheme looks as follows:

 −1 xk+1 = xk − [F (xk)] F (xk). 118LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Finally, in view of Theorem 5.6.1 (necessary condition of the minimum), we can replace the unconstrained minimization problem by the following nonlinear system

f (x)=0. (5.3.12)

(This replacement is not completely equivalent, but it works in the nondegenerate situations.) Further, for solving (5.3.12) we can apply the standard Newton method for the system of nonlinear equations. In the optimization case, the Newton system looks as follows:

f (x)+f (x)Δx =0,

Hence, the Newton method for optimization problems appears in the following form:

 −1  xk+1 = xk − [f (xk)] f (xk). (5.3.13)

Note that we can come to the process (5.3.13), using the idea of quadratic approximation. Consider this approximation, computed at the point xk:   −  1   − −  φ(x)=f(xk)+ f (xk),x xk + 2 f (xk)(x xk),x xk .  Assume that f (xk) > 0. Then we can choose xk+1 as a point of minimum of the quadratic function φ(x). This means that

   φ (xk+1)=f (xk)+f (xk)(xk+1 − xk)=0, and, we come again to the Newton process (5.3.13). We will see that the convergence of the Newton method in a neighborhood of a strict local minimum is very fast. However, this method has two serious drawbacks. First, it can  break down if f (xk) is degenerate. Second, the Newton process can diverge. Let us look at the following example. Example 5.3.1 Let us apply the Newton method to finding a root of the following function of one variable: t φ(t)=√ . 1+t2 Clearly, t∗ =0.Notethat 1 φ(t)= . [1 + t2]3/2 Therefore the Newton process looks as follows: φ(t ) t − k − $ k · 2 3/2 − 3 tk+1 = tk  = tk [1 + tk] = tk. φ (tk) 2 1+tk

Thus, if | t0 |< 1, then this method converges and the convergence is extremely fast. Point t0 = 1 is an oscillation point of this method. If | t0 |> 1, then the method diverge. 5.3. NEWTON METHOD 119

In order to escape from the possible divergence, in practice we can apply a Damped Newton method:  −1  xk+1 = xk − hk[f (xk)] f (xk), where hk > 0 is a step-size parameter. At the initial stage of the method we can use the same step size strategies as for the gradient method. At the final stage it is reasonable to chose hk =1. Let us study the local convergence of the Newton method. Consider the problem

min f(x) x∈Rn under the following assumptions:

∈ 2,2 n 1. f CM (R ). 2. There exists a local minimum of function f with positive definite Hessian:

 ∗ f (x ) ≥ lIn,l>0. (5.3.14)

∗ 3. Our starting point x0 is close enough to x .

 −1  Consider the process: xk+1 = xk − [f (xk)] f (xk). Then, using the same reasoning as for the gradient method, we obtain the following representation:

∗ ∗  −1  xk+1 − x = xk − x − [f (xk)] f (xk)

1 ∗  −1  ∗ ∗ ∗ = xk − x − [f (xk)] f (x + τ(xk − x ))(xk − x )dτ 0

 −1 ∗ =[f (xk)] Gk(xk − x ),

1   ∗ ∗ ∗ where Gk = [f (xk) − f (x + τ(xk − x ))]dτ.Denoterk = xk − x .Then 0

1   ∗ ∗  Gk  = [f (xk) − f (x + τ(xk − x ))]dτ  0

1   ∗ ∗ ≤  f (xk) − f (x + τ(xk − x ))  dτ 0

1 ≤ − rk M(1 τ)rkdτ = 2 M. 0

In view of Exercise 5.6.6,and(5.3.14), we have:

  ∗ f (xk) ≥ f (x ) − MrkIn ≥ (l − Mrk)In. 120LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

l    −1 ≤ − −1 Therefore, if rk < M then f (xk) is positive definite and [f (xk)] (l Mrk) . Hence, 2l for rk small enough (rk < 3M ), we have

Mr2 M ≤ k ≤ 2 rk+1 ( rk

The rate of convergence of this type is called quadratic. Thus, we have proved the following theorem.

Theorem 5.3.1 Let function f(x) satisfy our assumptions. Suppose that the initial starting ∗ point x0 is close enough to x : 2l  x − x∗ < r¯ = . 0 3M

∗ Then  xk − x < r¯ for all k and the Newton method converges quadratically:

 − ∗ 2  − ∗ ≤ M xk x xk+1 x ∗ . 2(l − M  xk − x )

Comparing this result with the rate of convergence of the gradient method, we see that the Newton method is much faster. Surprisingly enough, the region of quadratic convergence of the Newton method is almost the same as the region of the linear convergence of the gradient method. This means that the gradient method is worth to use only at the initial stage of the minimization process in order to get close to a local minimum. The final job should be performed by the Newton method. In this section we have seen several examples of the convergence rate. Let us compare these rates in terms of complexity. As we have seen in Example 5.2.2, the upper bound for the analytical complexity of a problem class is an inverse function of the rate of convergence.

1. Sublinear rate. This rate is described in terms of a power function of the iteration counter. For example, we can have r ≤ √c . In this case the complexity of this scheme k k is c2/2. Sublinear rate is rather slow. In terms of complexity, each new right digit of the answer takes the amount of computations comparable with the total amount of the previous work. Note also, that the constant c plays a significant role in the corresponding complexity estimate.

2. Linear rate. This rate is given in terms of an exponential function of the iteration k counter. For example, it could be like that: rk ≤ c(1−q) . Note that the corresponding 1 1 complexity bound is q (ln c +ln ). This rate is fast: Each new right digit of the answer takes a constant amount of computations. Moreover, the dependence of the complexity estimate in constant c is very weak. 5.3. NEWTON METHOD 121

3. Quadratic rate. This rate has a form of the double exponential function of the iteration ≤ 2 counter. For example, it could be as follows: rk+1 crk. The corresponding complexity 1 estimate depends double-logarithmically on the desired accuracy: ln ln . This rate is extremely fast: Each iteration doubles the number of right digits in the answer. The constant c is important only for the starting moment of the quadratic convergence (crk < 1).

5.3.1 Gradient Method and Newton Method: What is different? In the previous section we have considered two local methods for finding a strict local mini- mum of the following unconstrained minimization problem:

min f(x), x∈Rn

∈ 2,2 n where f CL (R ). Those were the gradient method

 xk+1 = xk − hkf (xk),hk > 0. and the Newton Method:  −1  xk+1 = xk − [f (xk)] f (xk). Recall that the local rate of convergence of these method was different. We have seen, that the gradient method has a linear rate and the Newton method converges quadratically. What is the reason for this difference? If we look at the analytic form of these methods, we can see at least the following formal difference: In the gradient method the search direction is the antigradient, while in the Newton method we multiply the antigradient by some matrix, that is the inverse Hessian. Let us try derive these directions using some “universal” reasoning. Let us fix somex ¯ ∈ Rn. Consider the following approximation of the function f(x), we will call it the model: 1 φ (x)=f(¯x)+f (¯x),x− x¯ +  x − x¯ 2, 1 2h where the parameter h is positive. The first-order optimality condition provides us with the ∗ following equation for x1, the unconstrained minimum of the function φ1(x): 1 φ (x∗)=f (¯x)+ (x∗ − x¯)=0. 1 1 h 1 ∗ −  Thus, x1 =¯x hf (¯x). That is exactly the iterate of the gradient method. Note, that if ∈ 1 h (0, L ] then the function φ1(x)istheglobal upper approximation of f(x):

n f(x) ≤ φ1(x), ∀x ∈ R ,

(see Lemma 5.6.4). This fact is responsible for the global convergence of the gradient method. 122LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Further, consider the quadratic approximation (the model) of the function f(x): 1 φ (x)=f(¯x)+f (¯x),x− x¯ + f (¯x)(x − x¯),x− x¯. (5.3.15) 2 2 We have already seen, that the minimum of this function is ∗ −  −1  x2 =¯x [f (¯x)] f (¯x), and that is the iterate of the Newton method. Thus, we can try to use some approximations of the function f(x), which are better than φ1(x) and which are less expensive than φ2(x). Let G be a positive definite n × n-matrix. Denote 1 φ (x)=f(¯x)+f (¯x),x− x¯ + G(x − x¯),x− x¯. G 2 Computing its minimum from the equation  ∗  ∗ − φG(xG)=f (¯x)+G(xG x¯)=0, we obtain ∗ − −1  xG =¯x G f (¯x). (5.3.16) The first-order methods which form a sequence

 ∗ {Gk} : Gk → f (x ) { } ≡ −1 →  ∗ −1 (or Hk : Hk Gk [f (x )] ) are called the variable metric methods. (Sometimes the name Quasi-Newton methods is used.) In these methods only the gradients are involved in the process of generating the sequences {Gk} or {Hk}. The reasons explaining the step of the form (5.3.16) are so important for optimization, that we will provide it with one more interpretation. It will also shed some light on the “variable metric” name of the algorithm family. We have already used the gradient and the Hessian of a nonlinear function f(x). However, note that they are defined with respect to the standard Euclidean inner product on Rn:

n x, y = x(i)y(i),x,y∈ Rn, i=1

 x = x, x1/2. Indeed, the definition of the gradient is as follows:

f(x + h)=f(x)+f (x),h + o( h ).

From that equation we derive its coordinate form ∂f(x) ∂f(x) f (x)= ,..., . ∂x1 ∂xn 5.3. NEWTON METHOD 123

Let us introduce now a new inner product. Consider a symmetric positive definite n × n- matrix A.Forx, y ∈ Rn denote

1/2 x, yA = Ax, y,  x A= Ax, x .

n The function  x A is a new metric on R defined by the matrix A. Note that topologically this new metric is equivalent to ·:

1/2 1/2 λ1(A)  x ≤x A ≤ λn(A)  x , where λ1(A)andλn(A) are the smallest and the largest eigenvalue of the matrix A. However, the gradient and the Hessian are changing:    1      f(x + h)=f(x)+ f (x),h + 2 f (x)h, h + o( h )

 −1   1  −1     = f(x)+ A f (x),h A + 2 A f (x)h, h A + o( h A)

 −1   1  −1   −1    = f(x)+ A f (x),h A + 4 [A f (x)+f (x)A ]h, h A + o( h A).

 −1   1 −1   −1 Hence, fA(x)=A f (x) is the new gradient and fA(x)= 2 [A f (x)+f (x)A ]isthe new Hessian (with respect to the metric defined by A). Thus, the direction used in the Newton method can be interpreted as the gradient com- puted with respect to the metric defined by A = f (x). Note that the Hessian of f(x)atx computed with respect to A = f (x)istheunit matrix.

Example 5.3.2 Consider the quadratic function

  1   f(x)=α + a, x + 2 Ax, x , where A = AT > 0. Note that f (x)=Ax + a, f (x)=A and f (x∗)=Ax∗ + a =0for x∗ = −A−1a. Let us compute the Newton direction at some x ∈ Rn:

 −1  −1 −1 dN (x)=[f (x)] f (x)=A (Ax + a)=x + A a.

Therefore for any x ∈ Rn we have:

−1 ∗ x − dN (x)=−A a = x .

Thus, the Newton method converges for a quadratic function in one step. Note also that

 −1  1  2 f(x)=α + A a, x A + 2 x A,

 −1   −1  fA(x)=A f (x)=dN (x),fA(x)=A f (x)=In. 124LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Let us write out the general scheme of the variable metric methods. General scheme.

n  0. Choose x0 ∈ R . Set H0 = In. Compute f(x0) and f (x0). 1. k-th iteration (k ≥ 0).  a). Set pk = Hkf (xk).

b). Find xk+1 = xk − hkpk (see section 2.1.2 for the step-size rules).  c). Compute f(xk+1) and f (xk+1).

d). Update the matrix Hk : Hk → Hk+1

The variable metric schemes differ one from another only in the implementation of Step 1d), which updates the matrix Hk. For that, they use the new information, accumulated at  Step 1c), namely the gradient f (xk+1). The idea of this update can be explained with a quadratic function. Let   1    f(x)=α + a, x + 2 Ax, x ,f(x)=Ax + a. Then, for any x, y ∈ Rn we have f (x) − f (y)=A(x − y). This identity explains the origin of the following Quasi-Newton rule:   Choose Hk+1 such that Hk+1(f (xk+1) − f (xk)) = xk+1 − xk. Naturally, there are many ways to satisfy this relation. Let us present several examples of the variable metric schemes, which are recognized as the most efficient ones.

Example 5.3.3 Denote ΔHk = Hk+1 − Hk,   γk = f (xk+1) − f (xk),δk = xk+1 − xk. Then all of the following rules satisfy the Quasi-Newton relation: 1. One-rank correction scheme. T (δk − Hkγk)(δk − Hkγk) ΔHk = . δk − Hkγk,γk 2. Davidon-Fletcher-Powell scheme (DFP). T T δkδk Hkγkγk Hk ΔHk = − . γk,δk Hkγk,γk 3. Broyden-Fletcher-Goldfarb-Shanno scheme (BFGS). T T T Hkγkδk + δkγk Hk Hkγkγk Hk ΔHk = − βk , Hkγk,γk Hkγk,γk where βk =1+γk,δk/Hkγk,γk. Clearly, there are many other possibilities. From the computational point of view, BFGS is considered as the most stable scheme. 5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 125

For quadratic functions the variable metric methods usually terminate in n iterations. In the neighborhood of a strict minimum they have the superlinear rate of convergence: for n any x0 ∈ R there exists a number N such that for all k ≥ N we have

∗ ∗ ∗  xk+1 − x ≤ const·xk − x ·xk−n − x 

(the proofs are very long and technical). As far as the global convergence is concerned, these methods are not better than the gradient method (at least, from the theoretical point of view). Note that in these methods it is necessary to store and update a symmetric n × n- matrix. Thus, each iteration needs O(n2) auxiliary arithmetic operations. During a long time this feature was considered as the main drawback of the variable metric methods. That stimulated the interest to the conjugate gradient schemes, which have much lower complexity of each iteration (we will consider these schemes in Section 5.5). However, in view of the tremendous growth of the computer power, these arguments are not so important now.

5.4 Newton Method and Self-Concordant Functions

The traditional results on the Newton method establish no more than its fast asymptotical convergence; global efficiency estimates can be proved only for modified versions of the method, and these global estimates never are better than those for the Gradient Descent. In this section we consider a family of objectives – the so called self-concordant ones –where the Newton method admits excellent global efficiency estimates. This family underlies the most advanced Interior Point methods for large-scale Convex Optimization.

5.4.1 Preliminaries The traditional “starting point” in the theory of the Newton method – Theorem 5.3.1 – possesses an evident drawback (which, anyhow, remained unnoticed by generations of researchers). The Theorem establishes local quadratic convergence of the Basic Newton method as applied to a function f with positive definite Hessian at the minimizer, this is fine; but what is the “quantitative” information given by the Theorem? What indeed is the “region of quadratic convergence” of the method, let us denote it G, – the set of those starting points from which the method converges quickly to x∗? The proof provides us with certain “constructive” description of G, but look – this description involves differential char- acteristics of f like the magnitude M of the third order derivatives of f in a neighborhood of x∗ and the bound on the norm of inverted Hessian in this neighborhood (which depends on M, the radius of the neighborhood and the smallest eigenvalue l of f (x∗)). Besides this, the “fast convergence” of the method is described in terms of the behavior of the standard ∗ Euclidean distances xt −x . All these quantities – magnitudes of third-order derivatives of f, norms of the inverted Hessian, distances from the iterates to the minimizer – are “frame- dependent”: they depend on the choice of Euclidean structure on the space of variables, 126LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

on what are the orthonormal coordinates used to compute partial derivatives, Hessian ma- trices and their norms, etc. When we vary the Euclidean structure (pass from the original coordinates to another coordinates via a non-orthogonal linear transformation), all these quantities somehow vary, same as the description of G given by Theorem 5.3.1. On the other hand, when passing from one Euclidean structure on the space of variables to another, we do not vary neither the problem, nor the Basic Newton method. Indeed, the latter method is independent of any a priori coordinates, as it is seen from the following “coordinateless” description of the method (cf (5.3.15)):

To find the Newton iterate xt+1 of the previous iterate xt, take the second order Taylor expansion of f at xt and choose, as xt+1, the minimizer of the resulting quadratic form. Thus, the coordinates are responsible only for the point of view we use to investigate the process and are absolutely irrelevant to the process itself. And the results of Theorem 5.3.1 in their quantitative part (same as other traditional results on the Newton method) reflect this “point of view”, not the actual properties of the Newton process! This “dependence on viewpoint” is a severe drawback: how can we get correct impression of actual abilities of the method looking at the method from an “occasionally chosen” position? This is exactly the same as to try to get a good picture of a landscape directing the camera in a random manner.

5.4.2 Self-concordance When the drawback of the traditional results is realized, could we choose a proper point of view – to orient our camera properly, at least for “good” objectives? Assume, e.g., that our objective f is convex with nondegenerate Hessian. Then at every point x there is a natural, intrinsic for the objective, Euclidean structure on the space of variables, namely, the one given by the Hessian of the objective at x; the corresponding norm is  $ d2 |h| = hT f (x)h ≡ | f(x + th). (5.4.17) f,x dt2 t=0

Note that the first expression for |h|f,x seems to be “frame-dependent” – it is given in terms of coordinates used to compute inner product and the Hessian. But in fact the value of this expression is “frame-independent”, as it is seen from the second representation of |h|f,x. Now, from the standard results on the Newton method we know that the behavior of the method depends on the magnitudes of the third-order derivatives of f. Thus, these results are expressed in terms of upper bounds d3 | f(x + th) ≤ α dt3 t=0 on the third-order directional derivatives of the objective, the derivatives being taken along unit in the standard Euclidean metric directions h. What happens if we impose similar upper bound on the third-order directional derivatives along the directions of the unit |·|f,x 5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 127

length rather than along the directions of the unit “usual” length? In other words, what happens if we assume that d3 |h| ≤ 1 ⇒ | f(x + th) ≤ α ? f,x dt3 t=0

Since the left hand side of the concluding inequality is of homogeneity degree 3 with respect to h, the indicated assumption is equivalent to the one d3 | f(x + th) ≤ α|h|3 ∀x ∀h. dt3 t=0 f,x

Now, the resulting inequality, qualitatively, remains true when we scale f – replace it by λf with positive constant λ, but the value of α varies: α → λ−1/2α. We can use this property to normalize the constant factor α, e.g., to set it equal to 2 (this is the most technically convenient normalization). Thus, we come to the main ingredient of the notion of a self-concordant function: a three times continuously differentiable convex function f sat- isfying the inequality  3/2 d3 d2 | f(x + th) ≤ 2|h|3 ≡ 2 | f(x + th) ∀h ∈ Rn. (5.4.18) dt3 t=0 f,x dt2 t=0

We do not insist on f to be defined everywhere; it suffices to assume that the domain of f n is an open convex set Gf ⊂ R ,andthat(5.4.18) is satisfied at every point x ∈ Gf .The second part of the definition of a self-concordant function is that

Gf is a “natural domain” of f,sothatf possesses the barrier property with respect to Gf – blows up to infinity when a sequence of interior points of Gf approaches a boundary point of the domain:

∀{xi ∈ Gf } : xi → x ∈ ∂Gf ,i→∞⇒f(xi) →∞,i→∞. (5.4.19)

Of course, the second part of the definition imposes something on f only when the domain of f is less than the entire Rn. Note that the definition of a self-concordant function is “coordinateless” – it imposes certain inequality between third- and second-order directional derivatives of the function and certain behavior of the function on the boundary of its domain; all notions involved are “frame-independent”.

5.4.3 Self-concordant functions and the Newton method It turns out that the Newton method as applied to a self-concordant function f possesses extremely nice global convergence properties. Namely, one can more or less straightforwardly prove the following statements: 128LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Proposition 5.4.1 [Self-concordant functions and the Newton method]  Let f be strongly self-concordant, and let f be nondegenerate at some point of Gf (this for sure is the case when Gf does not contain lines, e.g., is bounded). Then  (i) [Nondegeneracy] f (x) is positive definite at every point x ∈ Gf ;

(ii) [Existence of minimizer] If f is below bounded (which for sure is the case when Gf is bounded), then f attains its minimum on Gf , the minimizer being unique;

(iii) [DampedNewtonmethod]When started at arbitrary point x0 ∈ Gf the process $ 1  −1   T  −1  xt+1 = xt − [f (xt)] f (xt),λ(f,x)= (f (x)) [f (x)] f (x) (5.4.20) 1+λ(f,xt) – the Newton method with particular stepsizes 1 γt+1 = 1+λ(f,xt) – possesses the following properties:

– (iii.1) The process keeps the iterates in Gf and is therefore well-defined;

– (iii.2) If f is below bounded on Gf (which for sure is the case if λ(f,x) < 1 for ∈ { } ∗ some x Gf )then xt converges to the unique minimizer xf of f on Gf ; – (iii.3) Each step of the process (5.4.20)decreasesf “significantly”, provided that λ(f,xt) is not too small:

f(xt) − f(xt+1) ≥ λ(f,xt) − ln (1 + λ(f,xt)) ; (5.4.21)

– (iii.4) For every t, one has λ2(f,x ) λ(f,x ) < 1 ⇒ f(x )−f(x∗ ) ≤−ln (1 − λ(f,x ))−λ(f,x ) ≤ t (5.4.22) t t f t t 2 and 2 2λ (f,xt) λ(f,xt+1) ≤ . (5.4.23) 1 − λ(f,xt) The indicated statements demonstrate extremely nice global convergence properties of the Damped Newton method (5.4.20) as applied to a self-concordant function f.Namely, assume that f is self-concordant with nondegenerate Hessian at certain (and then, as it was mentioned in the above proposition, at any) point of Gf . Assume, besides this, that f is below bounded on Gf (and, consequently, attains its minimum on Gf by (ii)). According to (iii), the Damped Newton method keeps the iterates in Gf . Now, we may partition the trajectory into two parts: • the initial phase: from the beginning to the first moment, let it be called t∗,when λ(f,xt) ≤ 1/4; 5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 129

• the final phase: starting from the moment t∗.

According to (iii.3), at every step of the initial phase the objective is decreased at least by absolute constant 1 5 1 κ = − ln (> ); 4 4 32 consequently,

• the initial phase is finite and is comprised of no more than

f(x ) − min f N = 0 Gf ini κ iterations.

Starting with t = t∗,wehaveinviewof(5.4.23):

2 2λ (f,xt) 1 λ(f,xt+1) ≤ ≤ λ(f,xt); 1 − λ(f,xt) 2 thus,

∗ • starting with t = t , the quantities λ(f,xt) converge quadratically to 0 with objective- independent rate. According to (5.4.23),

• ∗ − starting with t = t , the residuals in terms of the objective f(xt) minGf f also converge quadratically to zero with objective-independent rate.

Combining the above observations, we observe that

• the number of steps of the Damped Newton method required to reduce the residual f(xt)−min f in the value of a self-concordant below bounded objective to a prescribed value <0.1 is no more than   1 N() ≤ O(1) [f(x ) − min f] + ln ln , (5.4.24) 0 O(1) being an absolute constant.

It is also worthy of note what happens when we apply the Damped Newton method to a below unbounded self-concordant f. The answer is as follows:

• for a below unbounded f one has λ(f,x) ≥ 1 for every x (see (iii.2)), and, consequently, every step of the method decreases f at least by the absolute constant 1 − ln 2 (see (iii.3)). 130LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

The indicated picture gives a “frame-” and “objective-independent” description of the global behavior of the Damped Newton method as applied to a below bounded self-concordant function. Note that the quantity λ(f,x) used to describe the behavior of the method at the first glance is “coordinate dependent” (see (5.4.20)), but in fact this quantity is “coordinate- less”. Indeed, one can easily verify that

λ2(f,x) = fˆ(x) − min fˆ(y), 2 y where 1 fˆ(y)=f(x)+(y − x)T f (x)+ (y − x)T f (x)(y − x) 2 is the second-order Taylor expansion of f at x. This is a coordinateless definition of λ(f,x). Note that the region of quadratic convergence of the Damped Newton method as applied to a below bounded self-concordant function f is, according to (iii.4), the set

1 G = {x ∈ G | λ(f,x) ≤ }. (5.4.25) f f 4

5.4.4 Self-concordant functions: applications At the first glance, the family of self-concordant functions is rather “thin” – the functions are given by certain “strict” differential inequality, and “a general”, even convex and smooth, f hardly may happen to be self-concordant. Thus, what for these elegant results on the behavior of the Newton method on self-concordant functions? The answer is as follows: it turns out that (even constrained) Convex Programming problems of reasonable (in principle – of arbitrary) analytic structure can be reduced to a “small series” of problems of minimizing self-concordant functions. Applying to these auxil- iary problems the Damped Newton method, we come to the theoretically most efficient (and extremely efficient in practice) Interior Point methods for Convex Optimization.Appear- ance of these methods definitely was one of the most important events in optimization. I am going to speak about Interior Point methods in more detail in the next lecture. What should be stressed now is that the crucial point in the design of such methods is our ability to construct “good” self-concordant functions with prescribed domains. To this end it is worth to note how to construct self-concordant functions. Here the following “raw materials” and “combination rules” are useful: Basic examples of self-concordant functions. For the time being, the following exam- ples will be sufficient:

• [Convex quadratic (e.g., linear) form] The convex quadratic function

1 f(x)= xT Ax − bT x + c 2 (A is symmetric positive semidefinite n × n matrix) is self-concordant on Rn; 5.4. NEWTON METHOD AND SELF-CONCORDANT FUNCTIONS 131

• [Logarithm] The function − ln(x)

is self-concordant with the domain R+ = {x ∈ R | x>0};

• [Extension of the previous example: Logarithmic barrier, linear/quadratic case] Let

n G = {x ∈ R | φj(x) < 0,j=1, ..., m}

be a nonempty set in Rn given by m strict convex quadratic (e.g., linear) inequalities. Then the function m f(x)=− ln(−φi(x)) i=1 is self-concordant with the domain equal to G.

Combination rules: simple operations with functions preserving self-concordance

• [Linear combination with coefficients ≥ 1] Let fi, i =1, ...m, be self-concordant func-

tions with the domains Gfi , let these domains possess a nonempty intersection G,and let αi ≥ 1, i =1, ..., m, be given reals. Then the function

m f(x)= αifi(x) i=1 is self-concordant with the domain equal to G. In particular, the sum of a self-concordant function and a convex quadratic function (e.g., a linear one) is self-concordant;

n • [Affine substitution] Let f(x) be self-concordant with the domain Gf ⊂ R , and let k n x = Aξ + b be an affine mapping from R into R with the image intersecting Gf . Then the composite function g(ξ)=f(Aξ + b) is self-concordant with the domain

Gg = {ξ | Aξ + b ∈ Gf }

being the inverse image of Gf under the affine mapping in question. To justify self-concordance of the indicated functions, same as the validity of the combina- tion rules, only minimal effort is required; at the same time, these examples and rules give almost all required to establish excellent global efficiency estimates for Interior Point meth- ods as applied to and Convex Quadratically Constrained Quadratic programming. After we know examples of self-concordant functions, let us look how our new under- standing of the behavior of the Newton method on such a function differs from the one 132LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

given by Theorem 5.3.1. To this end consider a particular self-concordant function – the logarithmic barrier

f(x)=− ln(δ − x1) − ln(δ + x1) − ln(1 − x2) − ln(1 + x2) for the 2D rectangle 2 D = {x ∈ R ||x1| <δ,|x2| < 1}; in what follows we assume that the rectangle is “wide”, i.e., that

δ  1.

This function indeed is self-concordant (see the third of the above “raw material” examples). The minimizer of the function clearly is the origin; the region of quadratic convergence of theDampedNewtonmethodisgivenby

2 2 G { ∈ | x1 x2 ≤ 1 } = x D 2 2 + 2 δ + x1 1+x2 32 (see (5.4.25)). We see that the region of quadratic convergence of the Damped Newton method is large enough – it contains, e.g., 8 times smaller than D concentric to D rectangle D. Besides this, (5.4.24) says that in order to minimize f to inaccuracy, in terms of the objective, , starting with a point x0 ∈ D, it suffices to perform no more than  1 1 O(1) ln +lnln  x0 

steps, where O(1) is an absolute constant and

|x |  x =max{ 1 , |x |}. δ 2 Now let us look what Theorem 5.3.1 says. The Hessian f (0) of the objective at the minimizer is   2δ−2 0 H = , 02 and H−1 = O(δ2); in, say, 0.5-neighborhood U of x∗ =0wealsohave[f (x)]−1 = O(δ2). The third-order derivatives of f in U are of order of 1. Thus, in the notation from the proof of Theorem 5.3.1 we have M = O(1) (this is the magnitude of the third order derivatives of f in U), r =0.5 and the upper bound on the norm of the inverted Hessian of f in U is O(δ2). According to the proof, the region U  of quadratic convergence of the Newton method isr ¯-neighborhood of x∗ =0with r¯ = O(δ−2). Thus, according to Theorem 5.3.1, the region of quadratic convergence of the method be- comes the smaller the larger is δ, while the actual behavior of this region is quite opposite. 5.5. CONJUGATE GRADIENTS METHOD 133

In this simple example, the aforementioned drawback of the traditional approach – its “frame-dependence” – is clearly seen. Applying Theorem 5.3.1 to the situation in question, we used extremely bad “frame” – Euclidean structure. If we were clever enough to scale the variable x1 before applying Theorem 5.3.1 – to divide it by δ – it would become absolutely clear that the behavior of the Newton method is absolutely independent of δ, and the region of quadratic convergence of the method is a once for ever fixed “fraction” of the rectangle D.

5.5 Conjugate gradients method

The conjugate gradient method was initially proposed for minimizing a quadratic func- tion. Therefore we will describe first the conjugate gradient schemes for the problem

min f(x), x∈Rn   1   T with f(x)=α + a, x + 2 Ax, x and A = A > 0. We have already seen that the solution of this problem is x∗ = −A−1a. Therefore, our quadratic objective function canbewritteninthefollowingform:   1   − ∗  1   f(x)=α + a, x + 2 Ax, x = α Ax ,x + 2 Ax, x

− 1  ∗ ∗ 1  − ∗ − ∗ = α 2 Ax ,x + 2 A(x x ),x x . ∗ − 1  ∗ ∗  − ∗ Thus, f = α 2 Ax ,x and f (x)=A(x x ). Let we are given by a starting point x0. Consider the linear subspaces

∗ k ∗ Lk =Lin{A(x0 − x ),...,A (x0 − x )}.

The sequence of the conjugate gradient method is defined as follows:

xk =argmin{f(x) | x ∈ x0 + Lk},k=1, 2,.... (5.5.26)

This definition looks rather abstract. However, we will see that this method can be written in much more “algorithmic” form. The above representation is convenient for the theoretical analysis.

  Lemma 5.5.1 Lk =Lin{f (x0),...,f (xk−1)}. Proof:  ∗ For k =1wehavef (x0)=A(x0 − x ). Let the statement of the lemma is true for some k ≥ 1. Note that k i ∗ xk = x0 + λiA (x0 − x ) i=1 1 with some λi ∈ R . Therefore k  ∗ i+1 ∗ k+1 ∗ f (xk)=A(x0 − x )+ λiA (x0 − x )=y + λkA (x0 − x ), i=1 134LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

where y ∈Lk.Thus,

k+1 ∗    Lk+1 =Lin{Lk,A (x0 − x )} =Lin{Lk,f (xk)} =Lin{f (x0),...,f (xk)}.

The next result is important for understanding the behavior of the minimization sequence of the method.

  Lemma 5.5.2 For any k = i we have f (xk),f (xi) =0.

Proof: Let k>i. Consider the function ⎛ ⎞ k ¯ ⎝  ⎠ φ(λ)=φ(λ1,...,λk)=f x0 + λjf (xj−1) . j=1

In view of Lemma 5.5.1,thereexistsλ¯∗ such that

k ∗  xk = x0 + λj f (xj−1). j=1

However, in view of its definition, xk is the point of minimum of f(x)onLk. Therefore φ(λ¯∗) = 0. It remains to compute the components of the gradient:

∗ ∂φ(λ¯ )   0= = f (xk),f (xi). ∂λi

Corollary 5.5.1 The sequence generated by the conjugate gradient method is finite.

(Since the number of the orthogonal directions cannot exceed n.)

 Corollary 5.5.2 For any p ∈Lk we have f (xk),p =0.

The last result we need explains the name of the method. Denote δi = xi+1 − xi. It is clear that Lk =Lin{δ0,...,δk−1}.

Lemma 5.5.3 For any k = i we have Aδk,δi =0.

(Such directions are called conjugate with respect to A.) Proof: Without loss of generality, we can assume that k>i.Then

  Aδk,δi = A(xk+1 − xk),xi+1 − xi = f (xk+1) − f (xk),xi+1 − xi =0

since δi = xi+1 − xi ∈Li+1 ⊆Lk. 5.5. CONJUGATE GRADIENTS METHOD 135

Let us show, how we can write out the conjugate gradient method in more algo- rithmic form. Since Lk =Lin{δ0,...,δk−1}, we can represent xk+1 as follows: k−1  xk+1 = xk − hkf (xk)+ λjδj. j=0 That is k−1  δk = −hkf (xk)+ λjδj. (5.5.27) j=0 Let us compute the coefficients of this representation. Multiplying (5.5.27)byA and δi,0≤ i ≤ k − 1, and using Lemma 5.5.3 we obtain: k−1   0=Aδk,δi = −hkAf (xk),δi + λjAδj ,δi j=0

 = −hkAf (xk),δi + λiAδi,δi

   = −hkf (xk),f (xi+1) − f (xi) + λiAδi,δi.

Therefore, in view of Lemma 5.5.2, λi =0foralli

Thus, xk+1 = xk − hkpk,where   2   2  − f (xk) δk−1  − f (xk) pk−1 pk = f (xk)   = f (xk)   f (xk) − f (xk−1),δk−1 f (xk) − f (xk−1),pk−1

since δk−1 = −hk−1pk−1 by the definition of {pk}. Note that we managed to write down the conjugate gradient scheme in terms of the gradients of the objective function f(x). This provides us with possibility to apply formally this scheme for minimizing a general nonlinear function. Of course, such extension destroys all properties of the process, specific for the quadratic func- tions. However, we can hope that asymptotically this method could be very fast in the neighborhood of a strict local minimum, where the objective function is close to be quadratic. Let us present the general scheme of the conjugate gradient method for minimizing some nonlinear function f(x) Conjugate gradient method

n   0. Choose x0 ∈ R . Compute f(x0) and f (x0). Set p0 = f (x0). 1. kth iteration (k ≥ 0).

a). Find xk+1 = xk+hkpk (using an ‘‘exact’’ procedure).  b). Compute f(xk+1) and f (xk+1). c). Compute the coefficient βk.  d). Set pk+1 = f (xk+1) − βkpk. 136LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

In the above scheme we did not specify yet the coefficient βk. In fact, there are many different formulas for this coefficient. All of them give the same result for a quadratic function, but in the general nonlinear case they generate different algorithmic schemes. Let us present three most popular expressions. 2 f (xk+1) 1. β = . k f (xk+1)−f (xk),pk 2 f (xk+1) 2. Fletcher-Rieves: β = − 2 . k f (xk) f (xk+1),f (xk+1)−f (xk) 3. Polak-Ribbiere: β = − 2 . k f (xk) Recall that in the quadratic case the conjugate gradient method terminates in n iterations (or less). Algorithmically, this means that pn+1 = 0. In the nonlinear case that is not true, but after n iteration this direction could loose any sense. Therefore, in all practical schemes there is a restart strategy, which at some point sets βk =0 (usually after every n iterations). This ensures the global convergence of the scheme (since we have a normal gradient step just after the restart), and a local n-step local quadratic convergence:

∗ ∗ 2  xn+1 − x ≤ const·x0 − x  ,

∗ provided that x0 is close enough to the strict minimum x . Note, that this local convergence is slower than that of the variable metric methods. However, the conjugate gradient schemes have an advantage of the very cheap iteration. As far as the global convergence is concerned, the conjugate gradients in general are not better than the gradient method. 5.6. EXERCISES 137 5.6 Exercises

5.6.1 Implementing Gradient method Exercise 5.6.1

Let us consider the implementation of the Armijo-Goldstein version of the Gradient Descent.  Recall that at each step of the algorithm we are looking for xk+1 = xk − hf (xk)andthe step-size h such that

 T αf (xk) (xk − xk+1) ≤ f(xk) − f(xk+1), (5.6.28)  T βf (xk) (xk − xk+1) ≥ f(xk) − f(xk+1), (5.6.29) where 0 <α<β<1 are fixed parameters of the algorithm. Note that a point xk+1 which satisfy the system of inequalities above surely exists (why?). To find it we can proceed as follows: let us choose an arbitrary positive h = h0 and test whether it satisfies (5.6.28). If it is the case, let us replace this value subsequently by β β h1 = α h0, h2 = α h1, etc., each time verifying whether the new value of h satisfies (5.6.28). This cannot last forever: for certain s ≥ 1 hs for sure fails to satisfy (5.6.28). When it happens for the first time, the quantity hs−1 turns out to satisfy (5.6.28), while the quantity β hs = α hs−1 fails to satisfy (5.6.28), which means that h = hs−1 passes the Armijo-Goldstein test. α If the initial h0 does not satisfy (5.6.28), we replace this value subsequently by h1 = β h0, α h2 = β γ1, etc., each time verifying whether the new value of h still does not satisfy (5.6.28). Again, this cannot last forever: for certain s ≥ 1, hs for sure satisfies (5.6.28). When it β − happens for the first time, hs turns out to satisfy (5.6.28), while hs 1 = α hs fails to satisfy (5.6.28), and h = hs passes the Armijo test. Write the code implementing the Armijo-Goldstein Gradient Descent method and apply it to the following problems:

• Rosenbrock problem

− 2 2 − 2 → | ∈ 2 f(x) = 100(x2 x1) +(1 x1) min x =(x1,x2) R ,

the starting point is x0 =(−1.2, 1). The Rosenbrock problem is a well-known test example: it has a unique critical point x∗ =(1, 1) (the global minimizer of f); the level lines of the function are banana-shaped valleys, and the function is nonconvex and rather badly conditioned.

• Quadratic problem

2 2 → | ∈ 2 fα(x)=x1 + αx2 min x =(x1,x2) R . Test the following values of α 10−1;10−4;10−6 138LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

and for each of these values test the starting points √ (1, 1); ( α, 1); (α, 1).

How long it takes to reduce the initial inaccuracy, in terms of the objective, by factor 0.1?

• Quadratic problem 1 f(x)= xT Ax − bT x, x ∈ R4, 2 with ⎛ ⎞ ⎛ ⎞ 0.78 −0.02 −0.12 −0.14 0.76 ⎜ ⎟ ⎜ ⎟ ⎜ −0.02 0.86 −0.04 0.06 ⎟ ⎜ 0.08 ⎟ A = ⎜ ⎟ ,b= ⎜ ⎟ ,x=0. ⎝ −0.12 −0.04 0.72 −0.08 ⎠ ⎝ 1.12 ⎠ 0 −0.14 0.06 −0.08 0.74 0.68 Run the method until the norm of the gradient at the current iterate is becomes less than 10−6. Is the convergence fast or not? Those using MATLAB can compute the spectrum of A and to compare the theoretical upper bound on convergence rate with the observed one.

• Experiments with Hilbert matrix. Let H(n) be the n × n Hilbert matrix: 1 (H(n)) = ,i,j=1, ..., n. ij i + j − 1  T (n) 1 n i−1 2 ≥ This is a symmetric positive definite matrix (since x H x = 0 ( i=1 xit ) dt 0, the inequality being strict for x =0). For n =2, 3, 4, 5 perform the following experiments:

– choose somehow n-dimensional nonzero vector x∗, e.g., x∗ =(1, ..., 1)T ; – compute b = H(n)x∗; – Apply your Gradient Descent code to the quadratic function 1 f(x)= xT H(n)x − bT x, 2 ∗ the starting point being x0 =0.Notethatx is the unique minimizer of f. ∗ −4 – Terminate the method when you will get |xN −x |≤10 , not allowing it, anyhow, to run more than 10,000 steps.

What will be the results? Those using MATLAB can try to compute the condition number of the Hilbert matrices in question. You can play with the parameters α and β of the method to get the best possible convergence. 5.6. EXERCISES 139

5.6.2 Regular functions The exercises of this section are critical to understand the content of the next lecture. You are invited to work on them or, at least, to work on the proposed solutions. We are going to use a fundamental principle of numerical analysis, we mean approxima- tion. In general,

To approximate an object means to replace the initial complicated object by a simplified one, close enough to the original.

In nonlinear optimization we usually apply the local approximations based on the derivatives of the nonlinear function. Those are the first- and the second-order approximations (or, the linear and quadratic approximations). Let f(x) be differentiable atx ¯.Thenfory ∈ Rn we have: f(y)=f(¯x)+f (¯x),y− x¯ + o( y − x¯ ), where o(r) is some function of r ≥ 0 such that 1 lim o(r)=0,o(0) = 0. r↓0 r The linear function f(¯x)+f (¯x),y− x¯ is called the linear approximation of f atx ¯. Recall that the vector f (x) is called the gradient of function f at x. Considering the points n yi =¯x + ei,whereei is the ith coordinate vector in R , we obtain the following coordinate form of the gradient: ∂f(x) ∂f(x) f (x)= ,..., ∂x1 ∂xn Let us look at some important properties of the gradient.

5.6.3 Properties of the gradient

Denote by Lf (α)thesublevel set of f(x):

n Lf (α)={x ∈ R | f(x) ≤ α}.

Consider the set of directions tangent to Lf (α)at¯x, f(¯x)=α:   − n yk x¯ Sf (¯x)= s ∈ R | s = lim . → yk x,f¯ (yk)=α  yk − x¯ 

+  Exercise 5.6.2 If s ∈ Sf (¯x) then f (¯x),s =0. Let s be a direction in Rn,  s = 1. Consider the local decrease of f(x)alongs: 1 Δ(s) = lim [f(¯x + αs) − f(¯x)]. α↓0 α 140LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Note that f(¯x + αs) − f(¯x)=αf (¯x),s + o(α). Therefore Δ(s)=f (¯x),s. Using the Cauchy-Schwartz inequality: −x ·y ≤ x, y≤x ·y , we obtain: Δ(s)=f (¯x),s≥−f (¯x)  . Let us takes ¯ = −f (¯x)/  f (¯x .Then Δ(¯s)=−f (¯x),f(¯x)/  f (¯x) = −f (¯x)  . Thus, the direction −f (¯x) (the antigradient) is the direction of the fastest local decrease of f(x)atthepoint¯x. The next statement, which is already known to us, is probably the most fundamental fact in optimization. Theorem 5.6.1 (First-order optimality condition; Ferm`a theorem.) Let x∗ be a local minimum of the differentiable function f(x).Thenf (x∗)=0. Note that this only is a necessary condition of a local minimum. The points satisfying this condition are called the stationary points of function f. Let us look now at the second-order approximation. Let the function f(x)betwice differentiable atx ¯.Then   −  1   − −   − 2 f(y)=f(¯x)+ f (¯x),y x¯ + 2 f (¯x)(y x¯),y x¯ + o( y x¯ ). The quadratic function   −  1   − −  f(¯x)+ f (¯x),y x¯ + 2 f (¯x)(y x¯),y x¯ is called the quadratic (or second-order) approximation of function f atx ¯. Recall that the (n × n)-matrix f (x): 2  ∂ f(x) (f (x))i,j = , ∂xi∂xj is called the Hessian of function f at x. Note that the Hessian is a symmetric matrix: f (x)=[f (x)]T . This matrix can be seen as a derivative of the vector function f (x): f (y)=f (¯x)+f (¯x)(y − x¯)+o( y − x¯ ), where o(r) is some vector function of r ≥ 0 such that 1 lim  o(r) =0, o(0) = 0. r↓0 r Using the second–order approximation, we can write out the second–order optimality conditions. In what follows notation A ≥ 0, used for a symmetric matrix A,meansthatA is positive semidefinite; A>0meansthatA is positive definite. The following result supplies a necessary condition of a local minimum: 5.6. EXERCISES 141

Theorem 5.6.2 (Second-order optimality condition.) Let x∗ be a local minimum of a twice differentiable function f(x).Then

f (x∗)=0,f(x∗) ≥ 0.

Again, the above theorem is a necessary (second–order) characteristics of a local mini- mum. We already know a sufficient condition. Proof: ∗ ∗ Since x is a local minimum of function f(x), there exists r>0 such that for all y ∈ B2(x ,r)

f(y) ≥ f(x∗).

 ∗ ∗ In view of Theorem 5.6.1, f (x ) = 0. Therefore, for any y from B2(x ,r)wehave:

f(y)=f(x∗)+f (x∗)(y − x∗),y− x∗ + o( y − x∗ 2) ≥ f(x∗).

Thus, f (x∗)s, s≥0, for all s,  s =1.

Theorem 5.6.3 Let function f(x) be twice differentiable on Rn and let x∗ satisfy the fol- lowing conditions: f (x∗)=0,f(x∗) > 0.

Then x∗ is a strict local optimum of f(x).

(Sometimes, instead of strict,wesaytheisolated local minimum.) Proof: Note that in a small neighborhood of the point x∗ the function f(x) can be represented as follows: ∗ 1   ∗ − ∗ − ∗  − ∗ 2 f(y)=f(x )+ 2 f (x )(y x ),y x + o( y x ). 1 → ∈ Since r o(r) 0, there exists a valuer ¯ such that for all r [0, r¯]wehave r | o(r) |≤ λ (f (x∗)), 4 1

 ∗  ∗ where λ1(f (x )) is the smallest eigenvalue of matrix f (x ). Recall, that in view of our ∗ assumption, this eigenvalue is positive. Therefore, for any y ∈ Bn(x , r¯)wehave:

≥ ∗ 1  ∗  − ∗ 2  − ∗ 2 f(y) f(x )+ 2 λ1(f (x )) y x +o( y x )

≥ ∗ 1  ∗  − ∗ 2 ∗ f(x )+ 4 λ1(f (x )) y x >f(x ). 142LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

5.6.4 Classes of differentiable functions It is well-known that any continuous function can be approximated by a smooth function with an arbitrary small accuracy. Therefore, assuming the differentiability only, we cannot derive any reasonable properties of the minimization processes. For that we have impose some additional assumptions on the magnitude of the derivatives of the functional components of our problem. Traditionally, in optimization such assumptions are presented in the form of Lipschitz condition for a derivative of certain order. n k,p Let Q be a subset of R .WedenotebyCL (Q) the class of functions with the following properties: • ∈ k,p any f CL (Q)isk times continuously differentiable on Q. • Its pth derivative is Lipschitz continuous on Q with the constant L:

 f (p)(x) − f (p)(y) ≤ L  x − y 

for all x, y ∈ Q. ≤ ≥ q,p ⊆ k,p 2,1 ⊆ Clearly, we always have p k.Ifq k then CL (Q) CL (Q). For example, CL (Q) 1,1 CL (Q). Note also that these classes possess the following property: ∈ k,p ∈ k,p ∈ 1 if f1 CL1 (Q), f2 CL2 (Q) and α, β R ,then ∈ k,p αf1 + βf2 CL3 (Q) with L3 =| α | L1+ | β | L2. We use notation f ∈ Ck(Q)forf,whichisk times continuously differentiable on Q. 1,1 The most important class of the above type is CL (Q), the class of functions with Lip- ∈ 1,1 n schitz continuous gradient. In view of the definition, the inclusion f CL (R )means that  f (x) − f (y) ≤ L  x − y  (5.6.30) for all x, y ∈ Rn. Let us give a sufficient condition for that inclusion.

+ 2,1 n Exercise 5.6.3 Function f(x) belongs to CL (R ) if and only if

 f (x) ≤ L, ∀x ∈ Rn. (5.6.31)

1,1 n This simple result provides us with many representatives of the class CL (R ).   1,1 n Example 5.6.1 1. Linear function f(x)=α + a, x belongs to C0 (R )since

f (x)=a, f (x)=0.

2. For quadratic function

  1   T f(x)=α + a, x + 2 Ax, x ,A= A , 5.6. EXERCISES 143

we have: f (x)=a + Ax, f (x)=A. ∈ 1,1 n   Therefore f(x) CL (R ) with L = A . √ 3. Consider the function of one variable f(x)= 1+x2, x ∈ R1.Wehave: x 1 f (x)=√ ,f(x)= ≤ 1. 1+x2 (1 + x2)3/2 ∈ 1,1 Therefore f(x) C1 (R). Next statement is important for the geometric interpretation of functions from the class 1,1 n CL (R ). + ∈ 1,1 n n Exercise 5.6.4 Let f CL (R ). Then for any x, y from R we have: L | f(y) − f(x) −f (x),y− x|≤  y − x 2 . (5.6.32) 2

1,1 n Geometrically, this means the following. Consider a function f from CL (R ). Let us fix n some x0 ∈ R and form two quadratic functions   −  L  − 2 φ1(x)=f(x0)+ f (x0),x x0 + 2 x x0 ,

  − − L  − 2 φ2(x)=f(x0)+ f (x0),x x0 2 x x0 .

Then, the graph of the function f is located between the graphs of φ1 and φ2:

n φ1(x) ≥ f(x) ≥ φ2(x), ∀x ∈ R . Let us consider the similar result for the class of twice differentiable functions. Our main 2,2 n class of functions of that type will be CM (R ), the class of twice differentiable functions ∈ 2,2 n with Lipschitz continuous Hessian. Recall that for f CM (R )wehave  f (x) − f (y) ≤ M  x − y  (5.6.33) for all x, y ∈ Rn. + ∈ 2,2 n n Exercise 5.6.5 Let f CL (R ). Then for any x, y from R we have: M  f (y) − f (x) − f (x)(y − x) ≤  y − x 2 . (5.6.34) 2 We have the following corollary of this result: + ∈ 2,2 n  −  Exercise 5.6.6 Let f CM (R ) and y x = r.Then    f (x) − MrIn ≤ f (y) ≤ f (x)+MrIn,

n where In is the unit matrix in R . 144LECTURE 5. NONLINEAR PROGRAMMING: UNCONSTRAINED MINIMIZATION

Recall that for matrices A and B we write A ≥ B if A − B ≥ 0 (positive semidefinite). Here is an example of a quite unexpected use of “optimization” (I would rather say, analytic) technique. This is one of the facts which makes the beauty of the mathematics. I propose you to provide an extremely short proof of The Principal Theorem of Algebra, i.e. you are going to show that a non-constant polynomial has a (complex) root.2

Exercise 5.6.7 ∗ Let p(x) be a polynomial of degree n>0. Without loss of generality we can suppose that p(x)=xn + ..., i.e. the coefficient of the highest degree monomial is 1. Now consider the modulus |p(z)| as a function of the complex argument z ∈ C. Show that this function has a minimum, and that minimum is zero.

Hint: Since |p(z)|→+∞ as |z|→+∞, the continuous function |p(z)| must attain a minimum on a complex plan. Let z be a point of the complex plan. Show that for small complex h

k k+1 p(z + h)=p(z)+h ck + O(|h| ) for some k, 1 ≤ k ≤ n and ck =0.Now,if p(z) = 0 there a choice (which one?) of h small such that |p(z + h)| < |p(z)|.

2This proof is tentatively attributed to Hadamard. Lecture 6

Constrained Minimization

(Penalty Function Methods, Barrier Function Methods, Self-Concordant Barriers, Tracing the path, Interior-Point Methods)

This lecture is devoted to the penalty and the barrier methods; as far as the underlying ideas are concerned, these methods implement the simplest approach to constrained opti- mization – approximate a constrained problem by unconstrained ones. Let us look how it is done.

6.1 Penalty and Barrier function methods

The problem we deal with is as follows:

f(x) → min, (ICP) (6.1.1) gi(x) ≤ 0,i=1,...,m.

1,1 n where gi(x) are smooth functions. For example, we can consider gi(x)fromCL (R ). Since the components of the problem (6.1.1) are general nonlinear functions, we cannot expect it would be easier to solve than the unconstrained minimization problem. Indeed, even the troubles with stationary points, which we have in unconstrained minimization, appears in (6.1.1) in much stronger form. Note that the stationary points of this problem (what ever it is?) can be infeasible for the functional constraints and any minimization scheme, attracted by such point, should admit that it fails to solve the problem. Therefore, the following reasoning looks almost as the only way to proceed:

1. We have several efficient methods for unconstrained minimization. (?)1

1In fact, that is not absolutely true. We will see, that in order to apply the unconstrained minimization method to solving the constrained problems, we need to be sure that we are able to find at least a strict local minimum. And we have already seen (Example 5.2.1), that this could be a problem.

145 146 LECTURE 6. CONSTRAINED MINIMIZATION

2. An unconstrained minimization problem is simpler than a constrained one. (?)2

3. Therefore, let us try to approximate the solution of the constrained problem (6.1.1)by a sequence of solutions of some auxiliary unconstrained problems.

These ideology was implemented in the methods of Sequential Unconstrained Minimiza- tion. There are two main groups of such method: the penalty function methods and the barrier methods. Let us describe the basic ideas of this approach. We start from penalty function methods.

Definition 6.1.1 A continuous function Φ(x) is called a penalty function for a closed set G if

• Φ(x)=0for any x ∈ G,

• Φ(x) > 0 for any x/∈ G.

(Sometimes the penalty function is called just penalty.)

The main property of the penalty functions is as follows: If Φ1(x) is a penalty function for G1 and Φ2(x) is a penalty function for G2 then Φ1(x)+Φ2(x) is a penalty function for the intersection G1 G2. Let us give several examples of the penalty functions.

n Example 6.1.1 Let G = {x ∈ R | gi(x) ≤ 0,i=1,...,m}. Then the following functions are the penalties for the set G:

m 2 1. Quadratic penalty: Φ(x)= (gi(x))+. i=1

m 2. Nonsmooth penalty: Φ(x)= (gi(x))+ i=1

(we denote (a)+ =max{a, 0}). The reader can easily continue this list.

The general scheme of the penalty function method is as follows

Penalty Function Method

n 0. Choose x0 ∈ R . Choose a sequence of penalty coefficients:

0

2We are not going to discuss the correctness of this statement for the general nonlinear problems. Note that we have already observed that it is note true for convex problems. 6.1. PENALTY AND BARRIER FUNCTION METHODS 147

1. kth iteration (k ≥ 0): Find a point

xk+1 =argmin{f(x)+tkΦ(x)} x∈Rn

using xk as a starting point.

It is easy to prove the convergence of this scheme assuming that xk+1 is a global minimum of the auxiliary function.3 Denote ∗ Ψk(x)=f(x)+tkΦ(x), Ψk =minΨk(x) x∈Rn ∗ (Ψk is the global optimal value of Ψk(x)). Let us make the following assumption. Assumption 6.1.1 There exists t>¯ 0 such that the set S = {x ∈ Rn | f(x)+t¯Φ(x) ≤ f ∗} is bounded. Theorem 6.1.1 If the problem (6.1.1) satisfies Assumption 6.1.1 then ∗ lim f(xk)=f , lim Φ(xk)=0. k→∞ k→∞ Proof: ∗ ≤ ∗ ∗ ∈ n ≥ Note that Ψk Ψk(x )=f . Further, for any x R we have: Ψk+1(x) Ψk(x). Therefore ∗ ≥ ∗ Ψk+1 Ψk. Thus, there exists a limit ∗ ≡ ∗ ≤ ∗ lim Ψk Ψ f . k→∞

If tk > t¯ then ∗ f(xk)+t¯Φ(xk) ≤ f(xk)+tkΦ(xk) ≤ f .

Therefore, the sequence {xk} has limit points. For any limit point x∗ we have Φ(x∗) = 0. Therefore x∗ ∈ G and ∗ ∗ Ψ = f(x∗)+Φ(x∗)=f(x∗) ≥ f .

Remark 6.1.1 Note that this result is very general, but not too informative. There are still many questions, which should be answered. For example, we don’t know what kind of penalty function should we use. What should be the rules for choosing the penalty coefficients? What should be the accuracy for solving the auxiliary problems? The main feature of this questions is that they cannot be answered in the framework of the general nonlinear programming theory. Therefore, traditionally, they are considered to be answered by the computational practice.

3If we assume that it is a strict local minimum, then the result is much weaker. 148 LECTURE 6. CONSTRAINED MINIMIZATION

Now it is time to look what are our abilities to solve the unconstrained problems

→ (Pt)[Ψ(x)=f(x)+tΦ(x)] min which, as we already know, for large t are good approximations of the constrained problem in question. In principle we can solve these problems by any one of unconstrained minimization methods we know, and this is definitely a great advantage of the approach.

Remark 6.1.2 There is, anyhow, a severe weak point of the construction – to approximate well the constrained problem by unconstrained one, we must deal with large values of the penalty parameter, and this, as we shall see in a while, unavoidably makes the unconstrained problem (Pt) ill-conditioned and thus – very difficult for any unconstrained minimization methods sensitive to the conditioning of the problem. And all the methods for unconstrained minimization we know, except, possibly, the Newton method, are “sensitive” to conditioning (e.g., in the Gradient Descent the number of steps required to achieve an -solution is, asymptotically, inverse proportional to the condition number of the Hessian of objective at the optimal point). Even the Newton method, which does not react on the conditioning explicitly – it is “self-scaled” – suffers a lot as applied to an ill-conditioned problem, since here we are enforced to invert ill-conditioned Hessian matrices, and this, in actual computations with their rounding errors, causes a lot of troubles. The indicated drawback – ill-conditioness of auxiliary unconstrained problems – is the main disadvantage of the “straightforward” penalty scheme, and because of it the scheme is not that widely used now and is in many cases replaced with more smart modified Lagrangian scheme.

Let us consider now the barrier methods.

Definition 6.1.2 A continuous function F (x) is called a barrier function for a closed set G with nonempty interior if F (x) →∞when x approaches the boundary of the set G.

(Sometimes a barrier function is called barrier for short.) Similarly to the penalty functions, the barriers possess the following property: If F1(x) is a barrier forG1 and F2(x) is a barrier for G2 then F1(x)+F2(x) is a barrier for the intersection G1 G2. In order to apply the barrier approach, our problem must satisfy the Slater Condition:

∃x¯ : gi(¯x) < 0,i=1,...,m.

Let us look at some examples of the barriers.

n Example 6.1.2 Let G = {x ∈ R | gi(x) ≤ 0,i=1,...,m}. Then all of the following functions are barriers for G: m 1 1. Power-function barrier: F (x)= p , p ≥ 1. (−gi(x)) i=1

m 2. Logarithmic barrier: F (x)=− ln(−gi(x)). i=1 6.1. PENALTY AND BARRIER FUNCTION METHODS 149

m   3. Exponential barrier: F (x)= exp 1 . −gi(x) i=1 The reader can continue this list.

The scheme of the barrier method is as follows.

Barrier Function Method

0. Choose x0 ∈ int G. Choose a sequence of penalty coefficients:

0

1. kth iteration(k ≥ 0): Find a point 1 x =argmin{f(x)+ F (x)} k+1 ∈ x G tk

using xk as a starting point.

Let us prove the convergence of this method assuming that xk+1 is a global minimum of the auxiliary function. Denote 1 F (x)=f(x)+ F (x),F∗ =minF (x), k k ∈ k tk x G ∗ (Fk is the global optimal value of Fk(x)).

Assumption 6.1.2 The barrier F (x) is bounded from below: F (x) ≥ F ∗ for all x ∈ G.

Theorem 6.1.2 Let (6.1.1)satisfiesAssumption6.1.2.Then

∗ ∗ lim Fk = f . k→∞ Proof: Letx ¯ ∈ int G.Then   1 lim F ∗ ≤ lim f(¯x)+ F (¯x) = f(¯x). →∞ k →∞ k k tk ∗ ≤ ∗ Therefore lim Fk f . Further, k→∞     1 1 1 F ∗ =min f(x)+ F (x) ≥ min f(x)+ F ∗ = f ∗ + F ∗. k ∈ ∈ x G tk x G tk tk ∗ ∗ Thus, lim Fk = f . k→∞ 150 LECTURE 6. CONSTRAINED MINIMIZATION

Same as with the penalty functions method, there are many questions to be answered. We don’t know how to find the starting point x0 and how to choose the best barrier function. We don’t know the rules for updating the penalty coefficients and the acceptable accuracy of the solutions to the auxiliary problems. Finally, we have no idea about the efficiency estimates of this process. And the reason is not in the lack of the theory. Our problem (6.1.1)isjusttoocomplicated. If I were writing this lecture, say, 20 years ago, I would probably stop here, or added some complaints about the fact that, in the same way as for the , “the problems of minimization of Ψ normally (when the solution to the original problem is on the boundary of G; otherwise the problem actually is unconstrained) become the more ill- conditioned the larger is t, so that the difficulties of their numerical solution grow with the penalty parameter”. When indeed writing this lecture, I would say something quite opposite: there exists important situations when the difficulties in numerical minimization of Ψ do not increase with the penalty parameter, and the overall scheme turns out to be theoretically efficient and, moreover, the best known so far. This change in evaluation of the scheme is the result of recent “interior point revolution” in Optimization which I have already mentioned in Lecture 5.

6.2 Self-concordant barriers and path-following scheme

Assume from now on that our problem (ICP) is convex (the revolution we are speaking about deals, at least directly, with Convex Programming only). It is well-known that convex program (ICP) can be easily rewritten as a program with linear objective; indeed, it suffices to extend the vector of design variables by one variable, let it be called t, more and to rewrite (ICP) as the problem

t → min | gj(x) ≤ 0,j=1, ..., k, f(x) − t ≤ 0. The resulting problem still is convex and has linear objective. To save notation, assume that the outlined conversion is carried out in advance, so that already the original problem has linear objective and is therefore of the form (P) f(x) ≡ cT x → min | x ∈ G ⊂ Rn. Here the feasible set G of the problem is convex; we also assume from now on that it is closed, bounded and possesses a nonempty interior. Let us denote 1 (P )[F (x) ≡ f(x)+ F (x)] → min t t t Our abilities to solve (P) efficiently by an interior point method depend on our abilities to point out a “good” interior penalty F for the feasible domain. What we are interested in is a ϑ-self-concordant barrier F ; the meaning of these words is given by the following Definition 6.2.1 [Self-concordant barrier] Let ϑ ≥ 1. We say that a function F is ϑ-self- concordant barrier for the feasible domain D of problem (P), if 6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 151

• F is self-concordant function on int G (Section 5.4,Lecture5), i.e., three times con- tinuously differentiable convex function on int G possessing the barrier property (i.e., F (xi) →∞along every sequence of points xi ∈ int G converging to a boundary point of G) and satisfying the differential inequality

  d3 3/2 | | F (x + th)|≤2 hT F (x)h ∀x ∈ int G ∀h ∈ Rn; dt3 t=0

• F satisfies the differential inequality √ $ |hT F (x)|≤ ϑ hT F (x)h ∀x ∈ int G ∀h ∈ Rn. (6.2.2)

An immediate example is as follows (cf. Section 5.4.4, Lecture 5):

Example 6.2.1 [Logarithmic barrier for a polytope] Let

{ ∈ n | T ≤ } G = x R aj x bj,j=1, ..., m be a polytope given by a list of linear inequalities satisfying the Slater condition (i.e., there T exists x¯ such that aj x

m − − T F (x)= ln(bj aj x) j=1 is an m-self-concordant barrier for G.

In a moment we will justify this example and will consider the crucial issue of how to find a self-concordant barrier for a given feasible domain. For the time being, let us focus on another issue: how to solve (P), given a ϑ-self-concordant barrier for the feasible domain of the problem. What we intend to do is to use the path-following scheme associated with the barrier – certain very natural implementation of the barrier method.

6.2.1 Path-following scheme When speaking about the barrier scheme, we simply claimed that the minimizers of the aggregate Ft approach, as t →∞, the optimal set of (P). The immediate recommendation coming from this claim could be: choose “large enough” value of the penalty and solve, for the chosen t, the problem (Pt) of minimizing Ft, thus coming to a good approximate solution to (P). It makes, anyhow, sense to come to the solution of (Pt) gradually,bysolving

sequentially problems (Ptk )along an increasing sequence of values of the penalty parameter. Namely, assume that the barrier F we use is nondegenerate, i.e., F (x) is positive definite at every point x ∈ int G (note that this indeed is the case when F is self-concordant barrier for a bounded feasible domain, see Proposition 5.4.1.(i)). Then the optimal set of Ft for every 152 LECTURE 6. CONSTRAINED MINIMIZATION

positive t is a singleton (we already know that it is nonempty, and the uniqueness of the minimizer follows from convexity and nondegeneracy of Ft).Thus,wehaveapath ∗ x (t) = argmin Ft(·); int G as we know from Theorem 6.1.2, this path converges to the optimal set of (P) as t →∞;be- sides this, it can be easily seen that the path is continuous (even continuously differentiable) in t. In order to approximate x∗(t) with large values of t via the path-following scheme, we ∗ ∗ trace the path x (t), namely, generate sequentially approximations x(ti)tothepointsx (ti) along certain diverging to infinity sequence t0

• first, choose somehow a new value ti+1 >ti of the penalty parameter • · second, apply to the function Fti+1 ( ) a method for unconstrained minimization started ∗ at x(ti), and run the method until closeness to the new target point x (ti+1) is restored, thus coming to the new iterate x(ti+1) close to the new target point of the path. ∗ ∗ Ourhopeisthatsincex (t) is continuous in t and x(ti) is “close” to x (ti), for “not too ∗ large” ti+1 − ti the point x(ti) will be “not too far” from the new target point x (ti+1), so that the unconstrained minimization method we use will quickly restore closeness to the new target point. With this “gradual” movement, we may hope to arrive near x∗(t) with large t faster than by attacking the problem (Pt) directly. All this was known for many years; and the progress during last decade was in trans- forming these qualitative ideas into exact quantitive recommendations. Namely, it turned out that • A. The best possibilities to carry this scheme out are when the barrier F is ϑ-self- concordant; the less is the value of ϑ, the better; • B. The natural measure of “closeness” of a point x ∈ int G to the point x∗(t)ofthe path is the Newton decrement of the self-concordant function

T Φt(x)=tFt(x) ≡ tc x + F (x) at the point x, i.e., the quantity $  T  −1  λ(Φt,x)= [Φt(x)] [Φt (x)] Φt(x) (cf. Proposition 5.4.1.(iii)). More specifically, the notion “x is close to x∗(t)” is conve- nient to define as the relation λ(Φt,x) ≤ 0.05 (6.2.3) (in fact, 0.05 in the right hand side could be replaced with arbitrary absolute constant < 1, with slight modification of subsequent statements; I choose this particular value for the sake of simplicity) 6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 153

Now, what do all these words “the best possibility” and “natural measure” actually mean? It is said by the following two statements. • C. Assume that x is close, in the sense of (6.2.3), to a point x∗(t) of the path x∗(·) associated with a ϑ-self-concordant barrier for the feasible domain G of problem (P). Let us increase the parameter t to the larger value 0.08 t+ = 1+ √ t (6.2.4) ϑ and replace x by its damped Newton iterate (cf. (5.4.20), Lecture 5) 1 + −  −1  x = x [Φt+ (x)] Φt+ (x). (6.2.5) 1+λ(Φt+ ,x) Then x+ is close, in the sense of (6.2.3), to the new target point x∗(t+) of the path. C. says that we are able to trace the path (all the time staying close to it in the sense of B.) increasing the penalty parameter linearly in the ratio (1 + 0.08ϑ−1/2) and accompanying each step in the penalty parameter by a single Newton step in x. And why we should be happy with this, it is said by • D. If x is close, in the sense of (6.2.3), to a point x∗(t) of the path, then the inaccuracy, in terms of the objective, of the point x as of an approximate solution to (P) is bounded from above by 2ϑt−1: 2ϑ f(x) − min f(x) ≤ . (6.2.6) x∈G t

D. says that the inaccuracy of the iterates x(ti) formed in the above path-following procedure goes to 0 as 1/ti, while C. says that we are able increase ti linearly, at the cost of a single Newton step per each updating of t. Thus, we come to the following Theorem 6.2.1 Assume that we are given • (i) ϑ-self-concordant barrier F for the feasible domain G of problem (P)

• (ii) starting pair (x0,t0) with t0 > 0 and x0 being close, in the sense of (6.2.3), to the ∗ point x (t0). Consider the path-following method (cf. (6.2.4)-(6.2.5)) 0√.08 − 1  −1  ti+1 = 1+ ti; xi+1 = xi [Φti+1 (xi)] Φti+1 (xi). (6.2.7) ϑ 1+λ(Φti+1 ,xi) Then the iterates of the method are well-defined, belong to the interior of G and the method possesses linear global rate of convergence: −t 2ϑ 0√.08 f(xi) − min f ≤ 1+ . (6.2.8) G t0 ϑ 154 LECTURE 6. CONSTRAINED MINIMIZATION

In particular, to make the residual in f less than a given >0,itsufficestoperformno more that √ 20ϑ N() ≤20 ϑ ln 1+  (6.2.9) t0 Newton steps.

We see that the parameter ϑ of the self-concordant barrier underlying the method is respon- sible for the Newton complexity of the method – the factor at the log-term in the complexity bound (6.2.9).

Remark 6.2.1 The presented result does not explain how to start tracing the path – how to get initial pair (x0,t0) close to the path. This turns out to be a minor difficulty: given in advance a strictly feasible solutionx ¯ to (P), we could use the same path-following scheme (applied to certain artificial objective) to come close to the path x∗(·), thus arriving at a position from which we can start tracing the path. In our very brief outline of the topic, it makes no sense to go in these “ details of initialization”; it suffices to say that the necessity to start from approaching x∗(·) basically does not violate the overall complexity of the method.

It makes sense if not to prove the aforementioned statements – the complete proofs, although rather simple, go beyond the scope of our today lecture – but at least to motivate them – to explain what is the role of self-concordance and “magic inequality” (6.2.2)in ensuring properties C. and D. (this is all we need – the Theorem, of course, is an immediate consequence of these two properties). Let us start with C. – this property is much more important. Thus, assume we are at a point x close, in the sense of (6.2.3), to x∗(t). What this inequality actually says? Let us denote by T −1 1/2 hH−1 =(h H h) the scaled Euclidean norm given by the inverse to the

≡   H Φt (x)=F (x)

T (the equality comes from the fact that Φt and F differ by a linear function tf(x) ≡ tc x). Note that by definition of λ(·, ·) one has

    −1 ≡  −1 λ(Φs,x)= Φs(x) H tc + F (x) H .

Due to the last formula, the closeness of x to x∗(t)(see(6.2.3)) means exactly that

    −1 ≡  −1 ≤ Φt(x) H tc + F (x) H 0.05, whence, by the triangle inequality, √  tcH−1 ≤ 0.05 + F (x)H−1 ≤ 0.05 + ϑ (6.2.10) 6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 155

(the concluding inequality here is given by (6.2.2) 4), and this is the main point where this component of the definition of a self-concordant barrier comes into the play). From the indicated relations +  +  λ(Φt+ ,x)=t c + F (x)H−1 ≤(t − t)cH−1 + tc + F (x)H−1 = t+ − t = tc −1 + λ(Φ ,x) ≤ t H t [see (6.2.4), (6.2.10)] 0.08 √ 1 ≤ √ (0.05 + ϑ)+0.05 ≤ 0.134 (≤ ) ϑ 4 (note that ϑ ≥ 1 by Definition 6.2.1). According to Proposition 5.4.1.(iii.3), Lecture 5, the indicated inequality says that we are in the domain of quadratic convergence of the damped Newton method as applied to self-concordant function Φt+ ; namely, the indicated Proposition says that 2 + 2(0.134) λ(Φ + ,x ) ≤ < 0.05. t 1 − 0.134 as claimed in C.. Note that this reasoning heavily exploits self-concordance of F To establish property D., it requires to analyze in more details the notion of a self- concordant barrier, and I am not going to do it here. Just to demonstrate where ϑ comes from, let us prove an estimate similar to (6.2.6) for the particular case when, first, the barrier in question is the standard logarithmic barrier given by Example 6.2.1 and, second, the point x is exactly the point x∗(t) rather than is close to the latter point. Under the outlined assumptions we have ∗ ⇒  ⇒ x = x (t) Φt(x)=0

[substitute expressions for Φt and F ] m a tc + j =0⇒ − T j=1 bj aj x [take inner product with x − x∗, x∗ being an optimal solution to (P)] m aT (x∗ − x) tcT (x − x∗)= j ≤ − T j=1 bj aj x T ∗ − T ∗ − T ≤ − T ∗ ∈ [take into account that aj (x x)=aj x aj x bj aj x due to x G] ≤ m, whence m ϑ cT (x − x∗) ≤ ≡ t t (for the case in question ϑ = m). This estimate is twice better than (6.2.6) – this is because we have considered the case of x = x∗(t) rather than the one of x close to x∗(t).

4 indeed, for a positive definite symmetric matrix H it clearly is the same (why?) to say that |g|H−1 ≤ α T and to say that |g h|≤α|h|H for all h 156 LECTURE 6. CONSTRAINED MINIMIZATION

6.2.2 Applications Linear Programming. The most famous (although, I believe, not the most important) application of Theorem 6.2.1 deals with Linear Programming,whenG is a polytope and F is the standard logarithmic barrier for this polytope√ (see Example 6.2.1). For this case, the Newton complexity of the method5) is O( m), m being the # of linear inequalities involved into the description of G. Each Newton step costs, as it is easily seen, O(mn2) arithmetic operations, so that the arithmetic cost per accuracy digit – number of arithmetic operations required to reduce current inaccuracy by absolute constant factor – turns out to be O(m1.5n2). Thus, we get a polynomial time solution method for LP with very nice complexity characteristics, typically (for m and n of the same order) better than those, e.g., for the Ellipsoid method. Note also that with certain “smart” implementation of Linear Algebra, the above arithmetic cost can be reduced to O(mn2); this is the best known so far cubic in the size of the problem upper complexity bound forLinearProgramming. To increase list of application examples, note that our abilities to solve in the outlined style a convex program of a given structure are limited only by our abilities to point out self- concordant barrier for the corresponding feasible domain. In principle, there are no limits at all – it can be proved that every closed convex domain in Rn admits a self-concordant barrier with the value of parameter at most O(n). This “universal barrier” is given by certain multivariate integral and is too complicated for actual computations; recall that we should form and solve Newton systems associated with our barrier, so that we need it to be “explicitly computable”. Thus, we come to the following important question: How to construct “explicit” self-concordant barriers. There are many cases when we are clever enough to point out “explicitly computable self-concordant barriers” for convex domains we are interested in. We already know one example of this type – Linear Program- ming (although we do not know to the moment why the standard logarithmic barrier for a polytope given by m linear constraints is m-self-concordant). What helps us to construct self-concordant barriers and to evaluate their parameters are the following extremely sim- ple combination rules, completely similar to those for self-concordant functions (see Section 5.4.4, Lecture 5):

• [Linear combination with coefficients ≥ 1] Let Fi, i =1, ...m,beϑi-self-concordant barriers for the closed convex domains Gi, let the intersection G of these domains possess a nonempty interior Q, and let αi ≥ 1, i =1, ..., m, be given reals. Then the function m F (x)= αiFi(x) i=1  m is ( i=1 αiϑi)-self-concordant barrier for G.

5recall that it is the factor at the logarithmic term in (6.2.9), i.e., the # of Newton steps sufficient to reduce current inaccuracy by an absolute constant factor, say, by factor 2; cf. with the stories about polynomial time methods from Lecture 7 6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 157

• [Affine substitution] Let F (x)beϑ-self-concordant barrier for the closed convex domain G ⊂ Rn, and let x = Aξ + b be an affine mapping from Rk into Rn with the image intersecting int G. Then the composite function F +(ξ)=F (Aξ + b) is ϑ-self-concordant barrier for the closed convex domain G+ = {ξ | Aξ + b ∈ G} which is the inverse image of G under the affine mapping in question. The indicated combination rules can be applied to the “raw materials” as follows: • [Logarithm] The function − ln(x)

is 1-self-concordant barrier for the nonnegative ray R+ = {x ∈ R | x>0}; [the indicated property of logarithm is given by 1-line computation] • [Extension of the previous example: Logarithmic barrier, linear/quadratic case] Let n G =cl{x ∈ R | φj(x) < 0,j=1, ..., m} be a nonempty set in Rn given by m convex quadratic (e.g., linear) inequalities satis- fying the Slater condition. Then the function m f(x)=− ln(−φi(x)) i=1 is m-self-concordant barrier for G.

Note that for the case when all the functions fi are linear, the conclusion immediately follows from the Combination rules (a polyhedral set given by m linear inequalities is intersection of m half-spaces, and a half-space is inverse image of the nonnegative axis under affine mapping; applying the combination rules to the barrier − ln x for the nonnegative ray, we get, without any computations, that the standard logarithmic barrier is for a polyhedral set is m-self-concordant). For the case when there are also quadratic forms among fi,one needs several lines of computations. Note that the case of linear fi covers the entire Linear Programming, while the case of convex quadratic fi covers much wider family of quadratically constrained convex quadratic problems. Furthermore, the following examples seem much more important: • The function F (t, x)=− ln(t2 − xT x) is 2-self-concordant barrier for the “ice-cream cone” n { ∈ × n | ≥| |} K+ = (t, x) R R t x ; the function F (X)=− ln Det X n × is n-self-concordant barrier for the cone S+ of positive definite symmetric n n matrices. 158 LECTURE 6. CONSTRAINED MINIMIZATION

One hardly could imagine how wide is the class of applications – from Combinatorial op- timization to Structural Design and Stability Analysis/Synthesis in Control – of the latter two barriers, especially of the “Log-Det”-one.

6.2.3 Concluding remarks The path-following scheme results in the most efficient, in terms of the worst-case complexity analysis, interior point methods for Linear and Convex Programming. From the practical viewpoint, anyhow, this scheme, in its aforementioned form, looks bad. The severe practical drawback of the scheme is its “short-step nature” – according to the scheme, the penalty parameter should be updated by the “programmed” rule (6.2.4), and this makes the actual performance of the method more or less close to the one given by (6.2.9). Thus, in the straightforward implementation of the scheme the complexity estimate (6.2.9)willbenotjust the theoretical upper bound on the worst-case complexity of the method, but the indication of the “typical” performance of the algorithm. And a method actually working according to the complexity estimate (6.2.9) could be fine theoretically, but it definitely will be of very restricted practical interest in the large-scale case. E.g., in LP program with m ≈ 105 inequality constraints and n ≈ 104 variables (these are respectable, but in no sense “outstanding” sizes for a practical LP program) estimate (6.2.9) predicts something like hundreds of Newton steps with Newton systems of the size 104 × 104;eveninthecase of good sparsity structure of the systems, such a computation would be much more time consuming than the one given by the classical Simplex Method. In order to get “practical” path-following methods, we need a long-step tactics – rules for on-line adjusting the stepsizes in the penalty parameter to, let me say, local curvature of the path, rules which allow to update parameter as fast as possible – possible from the viewpoint of the actual “numerical circumstances” rather than from the viewpoint of very conservative theoretical worst-case complexity analysis. Today there are efficient “long-step” policies of tracing the paths, policies which are both fine theoretically (i.e., satisfy complexity bound (6.2.9)) and very efficient computa- tionally. Extremely surprising phenomenon here is that for “good” long-step path-following methods as applied to convex problems of the most important classes (Linear Programming, Quadratically Constrained Convex Quadratic Programming and some other) it turns out that

the actually observed number of Newton iterations required to solve the prob- lem within reasonable accuracy is basically independent of the sizes of the prob- lem and is within 30-50, even ≤ 20 iteration for most situations.

This “empirical fact” (which can be only partly supported by theoretical considerations, not proved completely) is extremely important for applications; it makes polynomial time interior point methods the most attractive (and sometimes - the only appropriate) optimization tool in many important large-scale applications. I should add that the efficient “long-step” implementations of the path-following scheme 6.2. SELF-CONCORDANT BARRIERS AND PATH-FOLLOWING SCHEME 159

are relatively new, and for a long time6) the only interior point methods which demonstrated the outlined “data- and size-independent” convergence rate were the so called potential re- duction interior point methods. In fact, the very first interior point method – the method of Karmarkar for LP – which initialized the entire interior point revolution, was a potential reduction algorithm, and what indeed caused the revolution was outstanding practical per- formance of this method. The method of Karmarkar possesses a very nice (and in fact very simple) geometry and is closely related to the interior penalty scheme; anyhow, time limita- tions enforce me to skip description of this wonderful, although an old-fashioned, algorithm. The concluding remark I would like to make is as follows: all polynomial time implemen- tations of the penalty/barrier scheme known so far are those of the barrier scheme (which is reflected in the name of these implementations: “interior point methods”); numerous at- tempts to do something similar with the penalty approach failed to be successful. It is a pity, due to some attractive properties of the scheme (e.g., here you do not meet with the problem of finding a feasible starting point, which, of course, is needed to start the barrier scheme).

6if I could qualify as “long” a part of the story which – the entire story – started in 1984