op t i m a 7 8 November 2008 page  How to advance in Structural Convex Optimization October, 2008 Abstract called oracle, and it is local, which means that it is not changing if the is In this paper we are trying to analyze the modified far enough from the test point. common features of the recent advances This model of interaction between the in Structural Convex Optimization: optimization scheme and the problem data polynomial-time interior-point methods, is called the local Black Box. At the time of smoothing technique, minimization in its development, this concept fitted very relative scale, and minimization of composite well the existing computational practice, functions. where the interface between the general optimization packages and the problem Keywords convex optimization · non- data was established by Fortran subroutines smooth optimization · complexity theory created independently by the users. · black-box model · optimal methods Black-Box framework allows to speak · structural optimization · smoothing about the lower performance bounds technique. for diffierent problem classes in terms of informational complexity. That is the lower Subject Classification (2000) estimate for the number of calls of oracle 90C06 · 90C22 · 90C25 · 90C60. which is necessary for any optimization method in order to guarantee delivering an Convex Optimization is one of the rare ε-solution to any problem from the problem fields of Numerical Analysis, which benefit class. In this performance measure we do from existence of well-developed complexity not include at all the complexity of auxiliary theories. In our domain, this theory was computations of the scheme. created in the middle of the seventies in Let us present these bounds for the most a series of papers by A.Nemirovsky and important classes of optimization problems D.Yudin (see [8] for full exposition). It posed in the form consists of three parts: – Classification and description of problem min f(x), (1) instances. x∈Q – Lower complexity bounds. – Optimal methods. n where Q ⊆ is a bounded closed (⎢⎢x⎜⎜ ≤ R, x ∈ Q), and function f is In [8], the complexity of a convex convex on Q. In the table below, the first optimization problem was linked with its column indicates the problem class, the level of smoothness introduced by Hölder second one gives an upper bound for conditions on the first derivatives of allowed number of calls of the oracle in functional components. It was assumed 1 the optimization scheme , and the last that the only information the optimization column gives the lower bound for analytical methods can learn about the particular complexity of the problem class, which problem instance is the values and derivatives depends on the absolute accuracy ε and the of these components at some test points. class parameters. This data can be reported by a special unit

This paper was written during the visit of the author at IFOR (ETH, Zurich). The author expresses his gratitude to the Scientific Director of this center Hans-Jacob Lüthi for his support and excellent working conditions.

Yurii Nesterov Center for Operations Research and Econometrics (CORE) Université catholique de Louvain (UCL) 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium. Tel.: +32-10-474348 Fax: +32-10-474301 e-mail: [email protected] op t i m a 7 8 November 2008 page 

put them into the black box for numerical exclusion of variables. methods. That is the main conceptual Imagine for a moment that we do not contradiction of the standard Convex know how to solve the system of linear Optimization. equations. In order to discover the above Intuitively, we always hope that the scheme we should apply the following structure of the problem can be used for improving the performance of minimization Golden Rules schemes. Unfortunately, structure is a very fuzzy notion, which is quite difficult to 1. Find a class of problems which a (2) formalize. One possible way to describe can be solved very efficiently. It is important that these bounds are exact. the structure is to fix the analytical type 2. Describe the transformation This means that there exist methods, which of functional components. For example, rules for converting the initial have efficiency estimates on corresponding we can consider the problems with linear problem into desired form. problem classes proportional to the lower constraints only. It can help, but this 3. Describe the class of problems bounds. The correspondingoptimal methods approach is very fragile: If we add just a for which these transformation were developed in [8,9,19,22,23]. For single constraint of another type, then we rules are applicable. further references, we present a simplified get a new problem class, and all theory must a version of the optimal method [9] as applied be redone from scratch. In our example, it is the class 2 to the problem (1) with f ∈ C2: On the other hand, it is clear that having of linear systems with triangular the structure at hand we can play a lot with matrices. (4)

Choose a starting point y0 ∈ Q and set the analytical form of the problem.We can x−1 = y0. For k ≥ 0 iterate: rewrite the problem in many equivalent In Convex Optimization, these rules were settings using non-trivial transformations used already several times for breaking of variables or constraints, introducing down the limitations of Complexity Theory. additional variables, etc. However, this Historically, the first example of that would serve almost no purpose without type is the theory of polynomial-time (3) fixing a clear final goal. So, let us try to interior-point methods (IPM) based on self- understand what it could be. concordant barriers. In this framework, the As usual, it is better to look at classical class of easy problems is formed by problems As we see, the complexity of each iteration examples. In many situations the sequential of unconstrained minimization of self- of this scheme is comparable with that of the reformulations of the initial problem can concordant functions treated by the Newton simplest method. However, the rate be seen as a part of numerical scheme. We method. This know-how is further used in of convergence of method (3) is much faster. start from a complicated problem P and, the framework of path-following schemes After a certain period of time, it became step by step, change its structure towards to for solving so-called standard minimization clear that, despite its mathematical the moment we get a trivial problem (or, a problems. Finally, it can be shown that by excellence, Complexity Theory of Convex problem which we know how to solve): a simple barrier calculus this approach can Optimization has a hidden drawback. P → … → (f *, x*). be extended onto all convex optimization Indeed, in order to apply convex A good example of such a strategy is the problems with known structure (see [11,18] optimization methods, we need to be standard approach for solving system of for details). The efficiency estimates of sure that functional components of our linear equations corresponding schemes are of the order problem are convex. However, we can check 1/2 υ Ax = b. O(υ ln –ε ) iterations of the Newton convexity only by analyzing the structure of 3 We can proceed as follows: method, where υ is the parameter of these functions: If our function is obtained 1. Check if A is symmetric and positive corresponding self-concordant barrier. from the basic convex functions by convex definite. Sometimes this is clear from the Note that for many important feasible operations (summation, maximum, etc.), we origin of the matrix. sets this parameter is smaller than the conclude that it is convex. If not, then we 2. Compute Cholesky factorization of this dimension of the space of variables. Hence, have to apply general optimization methods matrix: for the pure Black-Box schemes such an which usually do not have theoretical A = LLT ; efficiency is simply unreachable in view guarantees for the global performance. where L is a lower-triangular matrix. of the lower complexity bound for class Thus, the functional components of Form two auxiliary systems C3 (see (2)). It is interesting that formally the problem are not in the black box the Ly = b, LT x = y. the modern IPMs look very similar to the moment we check their convexity and 3. Solve these system by sequential usual Black-Box schemes (Newton method choose minimization scheme. However, we plus path-following approach), which 1 If this upper bound is smaller than O(n), then the dimension of the problem is really very big, and we were developed in the very early days of cannot afford the method to perform this amount of calls. Nonlinear Optimization [4]. However, this 2 In method (11)-(13) from [9], we can set ak = 1 + k/2 since in the proof we need only to ensure is just an illusion. For complexity analysis 2 2 ak+1 – ak ≤ ak+1. of polynomial-time IPM, it is crucial that 3 Numerical verification of convexity is an extremely difficult problem. op t i m a 7 8 November 2008 page 

they employ the special barrier functions difficult. Usually we can work directly with which do not satisfy the local Black-Box the structure of our problem, at least in the assumptions (see [10] for discussion). cases when it is created by us. The second example of using the rules Thus, the basis of the smoothing technique Thus, up to a logarithmic factor, for (4) needs more explanations. By certain [12,13] is formed by two ingredients: obtaining the same accuracy, the methods circumstances, these results were discovered the above observation, and a trivial but based on smoothing technique need only with a delay of twenty years. Perhaps they systematic way for approximating a a square root of iterations of the usual were too simple. Or maybe they are in a nonsmooth function by a smooth one. subgradient scheme. Taking into account, seemingly very sharp contradiction with This can be done for convex functions that usually the subgradient methods are the rigorously proved lower bounds of represented explicitly in a max-form: allowed to run many thousands or even Complexity Theory. millions of iterations, the gain of the Anyway, now everything looks almost f(x) = max { 〈Ax − b, u 〉 − φ(u), smoothing technique in computational time 4 evident. Indeed, in accordance to Rule 1 u∈Qd can be enormously big. in (4), we need to find a class of very easy where Qd is a bounded and convex dual It is interesting, that for problem (6) the problems. And this class can be discovered feasible set and φ(u) is a . computation of the smooth approximation is directly in Table (2)! To see that, let us Then, choosing a nonnegative strongly very cheap. Indeed, let us use for smoothing compare the complexity of the classes d(u), we can define a the entropy function: C1 and C2 for the accuracy of 1% (ε = smooth function −2 10 ). Note that in this case, the accuracy- fμ(x) = max { 〈Ax − b, u 〉 − φ(u) − μ · d(u)}

dependent factors in the efficiency estimates u∈Qd (5) vary from ten to ten thousands. So, the natural question is: which approximates the initial objective. Then the smooth approximation (5) of the Can the easy problems from C2 help us Indeed, denoting Dd = max d(u), objective function in (6) has the following u∈Qd somehow in finding an approximate solution compact representation: to the difficult problems from C1? we get And the evident answer is: Yes, of course! f(x) ≥ fμ(x) ≥ f(x) μDd. It is a simple exercise in Calculus to show that we can always approximate a Lipschitz- At the same time, the gradient of function Thus, the complexity of the oracle forf (x) continuous nonsmooth convex function f is Lipschitz-continuous with Lipschitz μ and f (x) is similar. Note that again, as in on a bounded convex set with a uniform μ constant of the order of O (see [12]) for the polynomial-time IPM theory, we apply accuracy ε > 0 by a smooth convex function details). the standard oracle-based method ((3) in with Lipschitz-continuous gradient. We Thus, we can see that for an this case) to a function which does not pay for the accuracy of approximation by a implementable definition (5), we get a satisfy the Black-Box assumptions. large Lipschitz constant M for the gradient, possibility to solve problem (1) in O An inexplicable blindness to the which should be of the order O . Putting iterations of the fast possibility to reduce the complexity of this bound for M in the efficiency estimate (3). In order to see the magnitude of the nonsmooth optimization problems with improvement, let us look at the following of C2 in (2), we can see that in principle, it known structure is not restricted to the is possible to minimize nonsmooth convex example: smoothing technique only. As it was functions by the oracle-based gradient shown in [7], very similar results can be methods with analytical complexity O . obtained by the extra-gradient method by (6) But what about the Complexity Theory? It G. Korpelevich [6] using the fact that this n seems that it was proved that such efficiency where Δn ∈ R is a standard simplex. Then method is optimal for the class of variational is just impossible. the properly implemented smoothing inequalities with Lipschitz-continuous It is interesting that in fact we do not technique ensures the following rate of operator (for these problems it converges get any contradiction. Indeed, in order convergence: as O . Actually, in a verbal form, the to minimize a smooth approximation of optimality of the extra-gradient method nonsmooth function by an oracle-based was known already for a couple of decades. scheme, we need to change the initial oracle. If we apply to problem (6) the standard However, a rigorous proof of this important Therefore, from mathematical point of subgradient methods (e.g. [14]), we can fact and discussion of its consequences for view, we violate the Black-Box assumption. guarantee only Structural Nonsmooth Optimization was On the other hand, in the majority of published only in [7], after discovering the practical applications this change is not smoothing technique. To conclude this section, let us discuss 4 It is easy to see that the standard subgradient methods for nonsmooth convex minimization need indeed O the last example of acceleration strategies operations to converge. Consider a univariate function f(x) = ⏐x⏐, x ∈ R. Let us look at the subgradient process: in Structural Optimization. Consider the problem of minimizing the composite It easy to see that However, the step-size sequence is optimal [8]. op t i m a 7 8 November 2008 page 

objective function: analysis. Indeed, many other schemes inequalities with Lipschitz continuous (7) have similar theoretical justifications and monotone operators and smooth convex it was not clear at all why these particular concave saddle point problems. SIAM where the function f is a convex suggestions deserve more attention. Journal on Optimization, 15, 229-251 diffierentiable function on domΨ with Moreover, even now, when we know that (2004). 8. A. Nemirovsky and D. Yudin. Problem Lipschitz-continuous gradient, and function the modified variants of some old methods Complexity and Method Efficiency in Ψ is an arbitrary . give excellent complexity results, we cannot Optimization. Wiley, New-York, 1983. Since Ψ can be even discontinuous, in say too much about the theoretical efficiency 9. Yu. Nesterov. A method for unconstrained general this problem is very difficult. of the original schemes.(7) convex minimization problem with the However, if we assume that function Ψ Thus, we have seen that in Convex rate of convergence O . Doklady AN is simple, then the situation is changing. Optimization the complexity analysis SSSR (translated as Soviet Mathematics Indeed, suppose that for any y− ∈ dom Ψ plays an important role in selecting the Doklady), 269(3), 543-547 (1983). we are able to solve explicitly the following promising optimization methods among 10. Yu. Nesterov. Interior-point methods: auxiliary optimization problem: hundreds of others. Of course, it is based An old and new approach to nonlinear on investigation of the worst-case situation. progamming. Mathematical Programming, 79(1-3), 285-297 (1997). However, even this limited help is important 11. Yu. Nesterov. Introductory Lectures on for choosing the perspective directions for (8) Convex Optimization. Kluwer, Boston, further research. This is true especially (compare with (3)). Then it becomes possible 2004. now, when the development of Structural 12. Yu. Nesterov. Smooth minimization of to develop for problem (7) fast gradient Optimization makes the problem settings non-smooth functions. CORE Discussion methods (similar to (3)), which have the and corresponding efficiency estimates more Paper 2003/12 (2003). Published in rate of convergence of the order O and more interesting and diverse. Mathematical Programming, 103 (1), 127- (see [15] for details; similar technique was The size of this paper does not allow us to 152 (2005). developed in [3]). Note that the formulation discuss other interesting setting of Structural 13. Yu. Nesterov. Excessive gap technique in (7) can be also seen as a part of Structural Convex Optimization (e.g. optimization nonsmooth convex minimization. SIAM Optimization since we use the knowledge in relative scale [16, 17]). However, we Journal on Optimization, 16 (1), 235-249 (2005). of the structure of its objective function hope that even the presented examples can 14. Yu. Nesterov. Primal-dual subgradient help the reader to find new and interesting directly in the optimization methods. methods for convex problems. research directions in this promising field Mathematical Programming (2007) Conclusion (see, for example, [1,2,5]). 15. Yu. Nesterov. Gradient methods for minimizing composite objective function. In this paper, we have considered several References CORE Discussion Paper 2007/76, (2007). examples of significant acceleration of the 16. Yu. Nesterov. Rounding of convex sets usual oracle-based methods. Note that the 1. A. d'Aspremont, O. Banerjee, and L. El and efficient gradient methods for linear achieved progress is visible only because of Ghaoui. First-Order Methods for Sparse programming problems. Optimization the supporting complexity analysis. It is Covariance Selection. SIAM Journal on Methods and Software, 23(1), 109-135 interesting that all these methods have some Matrix Analysis and its Applications, 30(1), (2008). 56-66, (2008). prototypes proposed much earlier: 17. Yu. Nesterov. Unconstrained convex 2. O. Banerjee, L. El Ghaoui, and A. − Optimal method (3) is very similar to minimization in relative scale. Accepted by d'Aspremont. Model Selection Through Mathematics of Operations Research. the heavy point method: Sparse Maximum Likelihood Estimation. x = x − α ∇ f(x ) + β(x − x ), 18. Yu. Nesterov, A. Nemirovskii. Interior k+1 k k k k − 1 Journal of Machine Learning Research, 9, point polynomial methods in convex 485-516 (2008). programming: Theory and Applications. where α and β are some fixed positive 3. A. Beck and M. Teboulle. A Fast Iterative SIAM, Philadelphia, 1994. coefficients (see [20] for historical details). Shrinkage-Threshold Algorithm Linear 19. B.T. Polyak. A general method of solving − Polynomial-time IPM are very similar Inverse Problems. Research Report, extremum problems. Soviet Mat. Dokl. 8, to some variants of the classical barrier Technion (2008). 593-597 (1967) methods [4]. 4. A.V. Fiacco and G.P. McCormick. Nonlinear 20. B. Polyak. Introduction to Optimization. − The idea to apply smoothing for Programming: Sequential Uncon strained Optimization Software, New York, 1987. Minimization Technique. John Wiley, New solving minimax problems is also 21. R. Polyak. Smooth Optimization Methods York, 1968. for Minimax Problems. SIAM J. Control not new (see [21] and the references 5. S. Hoda, A. Gilpin, and J. Pena. Smoothing therein). and Optimization, 26(6), 1274-1286 techniques for computing Nash equilibria (1988). of sequential games. Research Report. 22. N.Z. Shor. Minimization Methods for At certain moments of time, these ideas Carnegie Mellon University, (2008). Nondiffierentiable Functions. Springer- were quite new and attractive. However, 6. G.M. Korpelevich. The extragradient Verlag, Berlin, 1985. they did not result in a significant change method for finding saddle points and other 23. S.P. Tarasov, L.G. Khachiyan, and in computational practice since they were problems. Matecon 12, 747-756 (1976). I.I. Erlikh. The Method of Inscribed not provided with a convincing complexity 7. A. Nemirovski. Prox-method with rate Ellipsoids. Soviet Mathematics Doklady, 37, of convergence O(1/t) for variational 226− 230 (1988).