Combinatorial Structures in Nonlinear Programming

Stefan Scholtes∗ April 2002

Abstract Non-smoothness and non-convexity in optimization problems often arise because a combinatorial structure is imposed on smooth or convex data. The combinatorial aspect can be explicit, e.g. through the use of ”max”, ”min”, or ”if” statements in a model, or implicit as in the case of bilevel optimization where the combinatorial structure arises from the possible choices of active constraints in the lower level problem. In analyzing such problems, it is desirable to decouple the combinatorial from the nonlinear aspect and deal with them separately. This paper suggests a problem formulation which explicitly decouples the two aspects. We show that such combinatorial nonlinear programs, despite their inherent non-convexity, allow for a convex first order local optimality condition which is generic and tight. The stationarity condition can be phrased in terms of Lagrange multipliers which allows an extension of the popular sequential (SQP) approach to solve these problems. We show that the favorable local convergence properties of SQP are retained in this setting. The computational effectiveness of the method depends on our ability to solve the subproblems efficiently which, in turn, depends on the representation of the governing combinatorial structure. We illustrate the potential of the approach by applying it to optimization problems with max-min constraints which arise, for example, in robust optimization.

1 Introduction

Nonlinear programming is nowadays regarded as a mature field. A combination of important algorithmic developments and increased computing power over the past decades have advanced the field to a stage where the majority of practical prob- lems can be solved efficiently by commercial software. However, this is not to say that there are no research challenges left. One considerable challenge is posed by global, i.e. non-convex, optimization problems. There are two main sources of non- convexity: An analytic and a combinatorial source. Analytic non-convexity is due to a combination of negative and positive curvature of underlying model functions,

∗Judge Institute of Management & Department of Engineering, University of Cambridge, Cambridge CB2 1AG, [email protected]

1 such as polynomials or trigonometric functions. In contrast, combinatorial non- convexity arises when a combinatorial selection rule is imposed on possibly quite simple, even linear, functions. Typical examples are evaluations that make use of ”max”, ”min”, ”abs”, or ”if” statements. Business spreadsheets are a ver- itable source of such combinatorial nonlinear functions. Consider for example the calculation of revenue as a function of price and production capacity. Revenue is the product of price and sales, where sales is the minimum of demand and production capacity. If capacity κ and price π are decision variables and demand d is a known function of price, then the revenue formula becomes r(π, κ) = π min{d(π), κ}. Such formulas are commonplace in spreadsheet models.

A more complicated example where nested max- and min-terms occur is the re- source allocation problem that arises when several projects are evaluated by decision trees and compete for limited resources. This is a timely problem area as decision tree analysis is experiencing a renaissance in the wake of the popularization of real options valuations of projects. Decision tree evaluations consist essentially of a finite sequence of nested linear combinations of max-terms, where the linear combinations correspond to taking expectations at event nodes and the max-terms correspond to choosing the best of a finite number of possible actions at decision nodes. The value of a project is likely to depend on design parameters such as marketing budgets or other limited resources and a company is faced with the problem of allocating such limited resources to a variety of projects. This resource allocation problem will in- volve parametric values of decision trees, i.e., sequences of max-terms. The problem becomes even more complex if, as is often the case in practice, probability estimates are subjective and several experts assign differing probability estimates to the event branches in the trees. Backward induction through the decision trees will lead to value estimates for each of these probability estimates and a conservative decision maker may well consider the minimum of the expected values as a proxy for the unknown value of the project at event nodes. With this addition, the constraints of the resource allocation problem involve sequences of max- as well as min-terms. These max- and min-terms define the combinatorial structure of the problem.

Many practical applications are a combination of both, the analytical as well as the combinatorial sources of non-convexity. A non-convex combinatorial struc- ture may well be imposed onto smooth but already non-convex functions. Never- theless, it is helpful to think of the two sources separately, not least because the combinatorial source introduces the additional complication of non-smoothness. In fact, for optimization problems which involve only the first, the analytic source of non-convexity, nonlinear programming provides us with a host of methods, such as sequential quadratic programming (SQP) or interior point methods. Engineers and operations researchers regularly apply these methods, albeit they cannot guarantee a global optimum, because the methods provide often substantial improvement and, at the same time, are applicable to model sizes that are orders of magnitude larger than those that can be handled by routines. It is indeed unlikely that the dominance of local optimization procedures will change in the foreseeable future. In this light, it is rather surprising that relatively little attention has been

2 given to the local optimization of non-convex problems arising from the second, the combinatorial source of non-convexity. This is the more surprising as some fairly general classes of such problems, such as disjunctive programs, have been studied extensively from a global optimization viewpoint. The present study attempts to fill this gap by extending the tools available for the local optimization of smooth non-convex problems to problems involving non-convex combinatorial structures.

To date, problems of the type considered here are often tackled by variants of Lemar´echal’s bundle method [17], such as the bundle code of Schramm and Zowe [26]. These methods were originally designed to solve non-smooth convex problems but they are well-defined for more general non-smooth problems if the sub- concept is suitably broadened, see e.g. [5]. However, bundle methods have rather poor convergence properties for non-convex problems. This is not surprising since they are based on convex models, such as maxima of finitely many affine func- tions, of originally non-convex functions and can therefore only be guaranteed to converge to points where these convex models provide no descent directions. These methods do not exploit the specific combinatorial structure of the problem at hand, e.g. a particular combination of ”max” and ”min” terms, but rather replace this structure by a simpler and essentially convex combinatorial structure. An alter- native approach, more akin to our own and more suitable for the exploitation of non-convex structures, has been developed for piecewise linear problems under the label of simplex-type or active-set type methods by Fourer [10, 11, 12] and others. For recent further developments along these lines we refer the reader to the compre- hensive paper of Conn and Mongeau [6] which also surveys some of these methods and provides a fairly extensive bibliography. The existing combinatorial methods, however, are either restricted to piecewise affine models or to models where the combinatorial structure is inherently convex, such as the minimization of point-wise maxima of nonlinear functions. We attempt to add to this literature by presenting and analyzing a modelling framework which explicitly allows for non-convex struc- tures as well as smooth nonlinear data.

The focus of this paper is the extension of the traditional active set and SQP approaches to combinatorial nonlinear programs. We will provide a theoretical framework for the local optimization of such problems, including appropriate sta- tionarity conditions that are tighter than the stationarity conditions of [5] which bundle methods rely on. We then show how the SQP-idea can be extended to combinatorial nonlinear programs and that these methods have better convergence properties for these problems than bundle-type methods, both in terms of speed and in terms of identifying more appropriate stationary points.

We have already mentioned that one motivation for this study is to allow model- ers to use simple nonsmooth functions such as ”max”, ”min”, ”k.k”, etc. as building blocks in their optimization models. There are, however, also interesting classes of optimization problems to which our approach applies directly and we will mention some of them in the next section before we proceed to the technical developments. One of these classes are optimization problems with max-min constraints which

3 arise in robust optimization. We will use this special case frequently to illustrate our general theoretical developments. Section 3 introduces necessary extensions of standard NLP terminology, including notions of active functions, reg- ularity and stationarity for combinatorial nonlinear programs. Section 4 explains how Lagrange multipliers can be defined and used to check stationarity. In Section 5 we discuss the use of the SQP method in the setting of combinatorial nonlinear programs. We then focus in section 6 on the subproblems that arise in the case of max-min constrained problems and show how a decomposition approach can be used to solve these problems to local optimality. We close the paper with a brief report on some preliminary numerical experiments.

2 Problem Statement and Sample Problems

We study combinatorial nonlinear programs of the form min f(x) (1) s.t. g(x) ∈ Z, where f : IRn → IR, g : IRn → IRm, and Z ⊆ IRm. We regard the functions f, g as the data specifying an instance of the , whilst the structure set Z represents the combinatorial structure of a problem class. We will assume throughout that f and g are smooth and that Z is closed. Problem formulations of the above kind are commonplace in semi-definite programming and related ar- eas, where Z is assumed to be a convex, possibly non-polyhedral cone. In contrast to these studies, we will not assume that Z is convex but only that it has some exploitable combinatorial characteristics. A typical set Z is a polyhedral complex, for example the boundary of a polyhedron or the set of roots of a piecewise affine function. In the latter case one may for example wish to exploit a max-min rep- resentation of the piecewise affine function, see e.g. [1]. Notice that problem (1), although of a combinatorial nature, is quite different from and does not necessarily have a discrete feasible set; whence local optimization is non-trivial.

2.1 Sample Problems The problem formulation (1) is quite flexible and encompasses several classes of well-known optimization problems. It obviously encompasses as a special case the standard nonlinear program min f(x) s.t. g(x) ≤ 0 h(x) = 0, the constraints of which can be rewritten as p q (g(x), h(x)) ∈ Z = IR− × {0} . Notice that this set Z is the set of roots of the piecewise linear convex penalty-type function p(u, v) = max{u1, . . . , up, |v1|,..., |vq|}.

4 Because the structure set Z is convex, the nonlinear programming problem can, in some sense, be thought of as a structurally convex problem. In particular, one ob- tains a convex local approximation of the problem if the data functions are replaced by their first order Taylor series. In other words, non-convexity is only introduced through nonlinearity in the data. As mentioned before, such structurally convex problems are not the focus of our study.

A first example of a structurally non-convex problem is the so-called mathemat- ical program with complementarity constraints min f(x) s.t. g(x) ≤ 0 h(x) = 0 min{G(x),H(x)} = 0, see e.g. [18, 23] and the references therein. The constraints are equivalent to

p q r (g(x), h(x), (G1(x),H1(x)),..., (Gr(x),Hr(x))) ∈ Z = IR− × {0} × L , where L is the boundary of the positive orthant in IR2. This structure set Z is non-convex and therefore the problem itself should be thought of as structurally non-convex. The set Z can again by represented as the set of roots of a piecewise linear function, e.g.

p(t, u, v, w) = max{t1, . . . , tp, |u1|,..., |uq|, | min{v1, w1}|,..., | min{vr, wr}|}. As a final example, we mention robust variants of stochastic programs, which lead directly to optimization problems with max-min functions. A standard stochas- tic program with recourse over finitely many scenarios is of the form X min pj min f(x, yj). x∈X y ∈Y 1≤j≤m j j The interpretation is that a decision x needs to be taken now before a scenario j is observed with probability pj. After the observation of the scenario a recourse action yj ∈ Yj can be taken. The decision criterion is to choose x now so that the expected value of f is maximal. If it is important to ensure that the objective is small for all scenarios then the robust counterpart of this stochastic program is more appropriate. This optimization problem is of the form

min max min f(x, yj). x∈X 1≤j≤m yj ∈Yj If there are only finitely many candidates for recourse actions in each scenario then the robust optimization problem turns into a finite max-min optimization problem

min max min fi(x), x∈X 1≤j≤m i∈Nj where Nj ⊆ {1, . . . , n}. Such problems have been recently investigated e.g. in [13, 19] with engineering design applications in mind.

5 The latter optimization problem can be re-written as a constrained problem of the form min α (2) s.t. max min fi(x) − α ≤ 0. 1≤j≤r i∈Nj The constraints are obviously equivalent to

(f1(x) − α, . . . , fm(x) − α) ∈ Z, where Z is the lower zero-level set of a piecewise linear function, e.g.,

Z = {z | p(z) ≤ 0}, p(z) = max min zi. 1≤j≤r i∈Nj

The set Z is a nonconvex union of convex polyhedral cones and therefore the problem itself is structurally non-convex. More generally, max-min optimization problems are problems of the form min f(x) (3) s.t. g(x) ∈ Z, with the set Z above. We will use such max-min optimization problems to illustrate the more theoretical developments in the sequel.

3 Some Terminology

The SQP method for standard nonlinear programs is essentially an application of Newton’s method to the stationarity conditions. The notions of active constraints and of regularity play an important role in the development of the stationarity con- ditions as well as the analysis of the SQP method. We therefore need to extend the notions of active constraints, regularity, and stationarity to combinatorial nonlinear programs before we can discuss a modification of SQP for such problems.

3.1 Active Constraint Functions

We regard a constraint function gi as inactive at a feasible pointx ¯ of problem (1) if the validity of the inclusion g(x) ∈ Z does not depend on the value gi(x) for x close tox ¯. This intuitive notion is made precise in the following definition.

Definition 3.1 A component zi is called inactive for a set Z at a point z¯ ∈ Z if there exists a neighborhood V of z¯ such that for any two points z0, z00 ∈ V which 0 00 0 differ only in the i-th component (i.e. zj = zj for every j 6= i) the inclusion z ∈ Z holds if and only if z00 ∈ Z. All other components are called active at z¯. A constraint function gi is called active (inactive) at a feasible point x¯ of problem (1) if zi is active (inactive) for Z at z¯ = g(¯x).

Notice that activity of a constraint function depends only on the value of the con- straint function at the pointx ¯, not on the behavior of the function close tox ¯. The following lemma is an immediate consequence of the above definition and provides an equivalent definition of activity.

6 Lemma 3.1 The component zi is active for Z at z¯ ∈ Z if and only if there exist k k k sequences z → z¯ and αk → 0 such that z ∈ Z and z + αkei 6∈ Z, where ei is the i-th unit vector.

Inactivity of a component zi implies thatz ¯ + αei ∈ Z for every sufficiently small α but the reverse implication does not hold in general. For instance, if Z is the union of the axes in IR2 then both components satisfy the latter condition at the origin but both components are active at the origin. Another example is the set in 2 IR defined by the inequality |z1| ≤ |z2|. Although αe2 ∈ Z for every α, the second component z2 is active at the origin.

If the components of z are rearranged so that z = (zI , zA) with zI and zA corresponding to inactive and active components, resp., atz ¯ then for every z close toz ¯ z ∈ Z if and only if zA ∈ ZA = {ζA | (ζA, ζI ) ∈ Z}. (4) |I| In other words, locally aroundz ¯ the set Z coincides with the ({z¯I } + IR ) × ZA.

A natural question to ask is whether one can obtain a characterization of the active components for the union or intersection of sets from knowledge of active components of the individual sets. It is not difficult to see that an active compo- m m m nent zi for either ∪j=0Zj or ∩j=0Zj at a pointz ¯ ∈ ∩j=0Zj is active for at least one of the sets Zj atz ¯. The reverse statement, however, does not hold in general in either case. If Z1 is the set of nonnegative reals and Z2 is the set of nonpositive reals, then the single variable is active for both sets at the origin but inactive for the union of the two sets. Also, if Z1 is the set of all (z1, z2) with z2 ≤ 0 and Z2 is the union of Z1 and the nonnegative z2-axis then z1 and z2 are both active for Z2 at the origin but only z2 is active for Z1 ∩ Z2 at the origin.

The following proposition gives an algorithmic characterization of the active components of the set Z arising in (3).

m Proposition 3.1 Let Z = {z ∈ IR | p(z) ≤ 0} with p(z) = max min zi and let 1≤j≤r i∈Nj z¯ ∈ Z. If p(¯z) < 0 then all components are inactive at z¯. If p(¯z) = 0 then the active components can be determined in the following way:

1. Determine J = {j | min z¯i = 0} i∈Nj

2. Determine N¯j = {i ∈ Nj | z¯i = 0} for every j ∈ J.

3. Delete from J all indices k for which there exists j ∈ J such that N¯j is a proper subset of N¯k and call the remaining index set J¯. ¯ A component zi is active at z¯ if and only i ∈ ∪j∈J¯Nj. Proof. The first statement follows directly from the continuity of p(.). To prove the second statement we assume that p(¯z) = 0. Sincez ¯ ∈ Z we have mini∈Nj z¯i ≤ 0 for every 1 ≤ j ≤ r. If mini∈Nj z¯i < p(¯z) then this inequality remains valid in a

7 neighborhood ofz ¯ and therefore p is unchanged if the term mini∈Nj z¯i is eliminated. Hence p(z) = max min zi j∈J i∈Nj for every z in a neighborhood ofz ¯. Also, if i ∈ Nj andz ¯i > mink∈Nj z¯k = 0 then one can remove i from Nj without changing mink∈Nj zk in a neighborhood ofz ¯. After these eliminations we obtain

p(z) = max min zi j∈J i∈N¯j for all z in a neighborhood ofz ¯. Finally, one can eliminate all min-terms corre- sponding to an index set N¯k such that there exists a proper subset N¯j of N¯k for some j ∈ J since min zi ≥ min zi ¯ ¯ i∈Nj i∈Nk for every z. Thus p(z) = max min zi j∈J¯ i∈N¯j for every z close toz ¯. Because of this representation of p in a neighborhood ofz ¯ it ¯ is obvious that i ∈ ∪j∈J¯Nj for every active component zi atz ¯. To see the converse, let i0 ∈ J¯, j0 ∈ N¯j for some j ∈ J¯, and define   1 ¯  k if i ∈ Nj0 \{i0} k zi (α) = α if i = i0  1 − k otherwise.

1 1 Notice that for every − k ≤ α ≤ k

min zk(α) = α ¯ i i∈Nj0 k 1 min zi (α) = − , ∀j 6= j0. i∈N¯j k

k 1 k k Therefore p(z (α)) = α for every 0 ≤ α ≤ k . Since z (α) = z (0) + αei it follows that zi0 is active atz ¯. Q.E.D.

3.2 Regularity and Stationarity We call the constraint functions regular atx ¯ if the of all active constraint functions atx ¯ are linearly independent and callx ¯ a stationary point of (1) if d = 0 is a local minimizer of min ∇f(¯x)>d d (5) s.t. g(¯x) + ∇g(¯x)d ∈ Z. Since activity of a constraint function only depends on the value of the constraint function at a given point, the function gi(¯x)+∇gi(¯x)d is active at d = 0 in (5) if and only if gi is active atx ¯ in (1). Notice that local and global minimization for (5) are equivalent if Z is convex but not necessarily otherwise. The reader may ask why we

8 do not further simplify the auxiliary problem (5) by replacing Z in the constraints of (5) by an appropriate first order approximation at the point g(¯x), e.g. by its Bouligand cone, see [3]. This would indeed simplify the development of necessary optimality conditions. However, the importance of first order optimality conditions is not that they indicate optimality - after all in the presence of nonlinearity it is unlikely that a finite computational method will produce a stationary point - but that they allow the computation of descent paths if the conditions are not satisfied. The convergence of methods with descent directions obtained from tangent cone based optimality conditions, however, often suffers from the discontinuity of the tangent cone as a function of the base point. For instance if Z is the boundary of the positive orthant in IR2 then the only sensible tangent cone at a point on the positive x1-axis is the x1-axis itself. Therefore the tangent cone based subproblem that computes a descent direction is, arbitrarily close to the origin, blind for the possibility to move along the x2-axis at the origin. Consequently, the method may converge to the origin along the x1-axis even though the objective function allows a first order descent along the positive x2-axis at the origin.

The first question we need to address is whether stationarity as defined above is a necessary optimality condition under appropriate assumptions. The simple example

min 2x2 − x 1 2 (6) s.t. (x1, x2) ∈ Z

2 where Z = {(z1, z2) | z1 − z2 = 0} shows that even for regular constraint functions one cannot expect a local minimizer, the origin in this case, to be stationary. We will have to make an additional assumption for this to be the case.

Definition 3.2 A set Z is said to be locally star-shaped at a feasible point z¯ if there exists a neighborhood V of z¯ such that z ∈ Z ∩ V implies αz + (1 − α)¯z ∈ Z ∩ V for every α ∈ [0, 1].

In particular, convex sets are locally star-shaped. However, the set of locally star- shaped sets is much broader. Indeed, one readily verifies that the intersection and the union of finitely many closed locally star-shaped sets remains locally star-shaped. An important special class of locally star-shaped sets are the roots or lower level sets of piecewise affine functions which are non-convex polyhedra. Notice that the set Z employed in example (6) is not locally star-shaped anywhere. Indeed, if the structure set Z is locally star-shaped at a regular local minimizer then this local minimizer is stationary.

Proposition 3.2 If Z is locally star-shaped at g(x) and x is a regular local mini- mizer of (1) then x is a stationary point.

Proof. To simplify notation we assume that all components of g are active at x∗ ∗ ∗ and that x = 0, g(x ) = 0. This can be achieved by replacing Z by the set ZA of (4) and replacing g(.) by g(x∗ + .) − g(x∗). By assumption, the Jacobian ∇g(0) has full row rank and thus there is an (n − m) × (n − m)-matrix A such that the

9 Ã ! ∇g(0) n × n-matrix is invertible. The inverse function theorem thus implies A that the

g(x) = ∇g(0)d Ax = Ad has locally around the origin a unique smooth solution function x = ξ(d) with ∇ξ(0) = Id, the n × n identity matrix. Now suppose the origin is not a stationary point. Then there exists d arbitrarily close to the origin such that ∇g(0)d ∈ Z and ∇f(0)d < 0. Choose d close enough to zero such that ξ(αd) is well defined for every α ∈ [0, 1] (regularity assumption) and that ∇g(0)αd ∈ Z for every α ∈ [0, 1] (local star-shapedness). With φ(α) = f(ξ(αd)) one obtains

φ0(0) = ∇f(0)∇ξ(0)d = ∇f(0)d < 0.

Hence f(ξ(αd)) < f(ξ(0)) = f(0) for all sufficiently small α > 0 which contradicts the local optimality of x = 0. Q.E.D.

4 Lagrange Multipliers

The Lagrange multipliers provide the connection between Newton’s method and SQP. In this section we extend the notion of to combinatorial nonlinear programs and provide an equivalent stationarity condition in terms of these multipliers.

Definition 4.1 The Lagrangian function associated with the data (f, g) of problem (1) is of the form L(x, λ) = f(x) + g(x)>λ. A feasible point x is called a critical point if there exist multipliers λ such that

∇ L(x, λ) := ∇f(x) + ∇g(x)>λ = 0 x (7) λi = 0, if gi is inactive at x. Notice that the multipliers associated with a regular critical point are unique.

Proposition 4.1 Every stationary point of (1) is critical.

Proof. If d = 0 is a local solution of (5) then it is also a solution of the linear program min ∇f(¯x)>d d s.t. gi(¯x) + ∇gi(¯x)d =z ¯i, ∀i : gi is active atx, ¯ wherez ¯ = g(¯x). Hence provides the existence of the multipliers (7). Q.E.D.

Without stringent assumptions on the set Z criticality is not sufficient for sta- tionarity at a regular feasible point of (1). Indeed, in the case of a standard nonlinear

10 program one needs to impose sign constraints on the multipliers of inequalities in order to completely characterize stationarity. So what is the analogue of the sign constraints for general structure sets Z? The following proposition answers this question. The key observation is that stationarity of a regular critical point is in- dependent of the behavior of the data functions in the vicinity of this point but depends only on the value of g at the critical point and on the Lagrange multiplier λ. Proposition 4.2 If x¯ is a regular critical point with multiplier λ then it is a sta- tionary point if and only if v = 0 is a local solution to

max λ>v v (8) s.t. g(¯x) + v ∈ Z.

Proof. Recall thatx ¯ is stationary if and only if d = 0 is a local solution of (5). In view of (7), d = 0 is a local solution of (5) if and only if it is a local solution of

min −λ>∇g(¯x)d d (9) s.t. g(¯x) + ∇g(¯x)d ∈ Z.

Since changes of inactive constraint functions have locally no impact on the validity of the constraint we may restrict our attention to the active constraints. In view of the regularity assumption there exists a solution d(v) to the

∇gi(¯x)d = vi, i : gi is active atx ¯ for every v. If, on the one hand, a sequence vk of feasible points for (8) converges to zero and λ>vk > 0 then there exists a sequence d(vk) of feasible points for (5) converging to zero such that ∇f(¯x)>d(vk) < 0 and thereforex ¯ is not stationary. On the other hand, ifx ¯ is not stationary then there exists a sequence dk converging to zero such that ∇f(¯x)>dk < 0 and therefore vk = ∇g(¯x)dk converges to zero as well and λ>vk > 0, i.e. the origin is not a local solution of (8). Q.E.D.

We emphasize again that stationarity at regular points is fully characterized by the multipliers λ, the valuez ¯ = g(¯x), and the properties of Z in a neighborhood of z¯. There is no need for information about the function f or the constraint functions g in the vicinity ofx ¯.

Notice that small inactive components vi will neither affect the validity of the constraint g(¯x) + v ∈ Z nor, since the corresponding multipliers vanish, the value of the objective function. If we reduce the optimization problem (8) to the ac- tive components by replacing Z by its projection ZA onto the subspace of active components, we obtain the optimization problem

> max λ vA v A (10) s.t. gA(¯x) + vA ∈ ZA which has a local minimizer at vA = 0 if and only if the origin is a local minimizer of the original problem (8). The latter optimization problem (10) allows us to define

11 strict complementarity as the requirement that there exists a neighborhood V of λA such that the origin vA = 0 remains a local minimizer of (10) if λA is replaced by any λ˜A ∈ V . In the classical case of nonlinear programming this is equivalent to the requirement that the multipliers corresponding to active inequalities are strictly positive. This definition will turn out to be useful for a convergence analysis of SQP later on.

It is illustrative to compare the concepts of critical and stationary point in the following example which, although a standard nonlinear program, is re-formulated in a non-standard way that fits our framework. Consider the problem

1 2 2 min 2 [(x1 − 1) + (x2 − 1) ] s.t. x ∈ Z = {z | z1 + z2 ≤ 0}.

The interesting difference of this formulation from the standard formulation g(x) = x1 + x2 ≤ 0 is that our chosen structure set Z is not pointed. In our setting, the active components are z1 and z2 if z1 + z2 = 0 and none otherwise. Critical points can only occur where both components are active and the criticality condition

x1 − 1 + λ1 = 0 x2 − 1 + λ2 = 0 is in fact satisfied for all such points, i.e., all point x with x1 + x2 = 0 are critical with multipliers λi = 1 − xi, i = 1, 2. The stationarity condition requires that the origin solves max λ>v s.t. x + v ∈ Z

Since x1 + x2 = 0, x + v ∈ Z if and only if v1 + v2 ≤ 0 and the origin is a maximizer if and only if λ is a non-negative multiple of the vector e = (1, 1). Since λ = e − x and x1 + x2 = 0 this holds only for x = (0, 0). The minimizer of the original prob- lem is therefore the only stationary point. Notice that, in contrast to the standard NLP formulation, the stationary point does not satisfy strict complementarity in the re-formulation with the structure set Z. This is due to the non-pointedness of the chosen structure set Z.

The following concept is useful for the further specification of Proposition 4.2 for the case of locally star-shaped structure sets Z.

Definition 4.2 Let Z be locally star-shaped and let z ∈ Z. Then the localization of Z at z is the cone

CZ (z) = cl{v | ∃αv > 0 : z + αv ∈ Z ∀α ∈ [0, αv]}.

Notice that the linear subspace spanned by the inactive components is an element of CZ (z). Readers familiar with nonsmooth analysis will realize that CZ (z) coincides with Bouligand’s contingent cone of the locally star-shaped set Z at z, see [3], which contains all vectors v such that z + αkvk ∈ Z for some vk → v and αk → 0 from

12 above.

Recall that the polar cone of a cone C is defined by Co = {y | y>x ≤ 0, ∀x ∈ C}, see [22]. With this notion, the following corollary follows directly from the foregoing proposition. Corollary 4.1 If Z is closed and locally star-shaped at g(¯x) ∈ Z then a regular o critical point x¯ with associated multiplier λ is stationary if and only if λ ∈ CZ (z). The simple example

max x2 2 s.t. (x1, x2) ∈ Z = {(z1, z2) | z2 = z1} with the origin as reference point shows that the ”only if” part of this corollary does not hold if CZ (z) is replaced by the Bouligand cone of a non-star-shaped set.

If Z is closed and locally star-shaped then, in view of the latter corollary, checking stationarity for a regular feasible point is, in principle, a convex feasibility problem. Its computational tractability, however, depends on the availability of a suitable o representation of CZ (z). If, for example Z = {(u, v) | min{u, v} = 0}, where the minimum operation is taken componentwise over the two vectors u, v ∈ IRk then k the set CZ (0) is the union of the 2 convex cones

CI = {(u, v) | ui = 0, vi ≥ 0, if i ∈ I ui ≥ 0, vi = 0, if i 6∈ I},

I ⊆ {1, . . . , k}. The convex hull of CZ (0), however, is the nonnegative orthant in R2k, i.e., in this case the combinatorial difficulty of checking λ>x ≤ 0 for every x ∈ CI and every I ⊆ {1, . . . , k} disappears. This remarkable fact lies at the heart of all the recently developed efficient methods for the local optimization of MPECs. Indeed, if verifying stationarity was a difficult combinatorial problem it would seem unlikely that an efficient method can be guaranteed to converge to a stationary point. By the same token, if the verification of stationarity for a problem of type (1) is tractable, then there is hope that an efficient local optimization method can be developed for this type of problem. The essential question is therefore, whether the convex hull of CZ (z) can be represented in a computationally tractable way. The answer to this question obviously depends on the initial representation of CZ (z). If, for example, the set is given as the union of polyhedral cones then the convex hull of CZ (z) is the sum of all these cones. If the cones are given in ”primal” form o Ci = {x | Aix ≤ 0}, i = 1, . . . , m, then the stationarity problem λ ∈ CZ (z) is equivalent to

> λ (x1 + ··· + xm) ≤ 0, ∀(x1, . . . , xn): A1x1 ≤ 0,...,Amxm ≤ 0.

This can be checked efficiently by a decomposition method as long as the number m of polyhedral cones in the union is not too large. Similarly, if the cones are given > in dual form Ci = {x | x = Ai λi, λi ≥ 0}, i = 1, . . . , m then CZ (z) = {x | x = > A λ, λ ≥ 0}, where A contains all rows of the matrices Ai without duplication. This

13 generator matrix A can be quite small even if m is large. In the MPEC example above, the cones CI are of the form X X CI = {(u, v) | (u, v) = λi(ei, 0) + µi(0, ei), λi, µi ≥ 0}, i∈I i6∈I where ei is the ith unit vector. The overall matrix A is therefore the unit matrix in 2k k R even though the set CZ (z) is the union of m = 2 cones CI .

As a further example, consider the structure set Z of the optimization problem with max-min constraints (3). After applying the local reduction used in the proof of Proposition 3.1 the structure set is locally represented as Z = {z | p(z) ≤ 0} with

p(z) = max min zi, j∈J¯ i∈N¯j wherez ¯i = 0 for every [ i ∈ I¯ = N¯j j∈J¯ and N¯k ⊆ N¯j implies N¯k = N¯j. Obviously, v ∈ CZ (z) if and only if p(v) ≤ 0 which is the case if and only if [ v ∈ V = {w | wrj ≤ 0}. ¯ r∈ ×j∈J¯Nj ¯ ¯ If for some i ∈ I there exists a tuple r ∈ ×j∈J¯Nj with rj 6= i for every j then ei as well as −ei are contained in V since there is a set on the right-hand side which has no constraint on wi. On the other hand if for some i ∈ I¯ every tuple ¯ r ∈ ×j∈J¯Nj has one index rj = i then −ei ∈ V but ei 6∈ V since the constraint wi ≤ 0 is present in every set on the right-hand side. It follows that the cone convCZ (z) = convV is the direct product of the linear subspace generated by the unit vectors ei corresponding to inactive components vi and those components vi for ¯ which there exists a tuple r ∈ ×j∈J¯Nj with rj 6= i for every j and the cone generated by the negative unit vectors −ei corresponding to those i which are contained in ¯ ¯ every tuple r ∈ ×j∈J¯Nj. The latter indices are precisely those for which Nj is a singleton. The following corollary is a direct consequence of this observation and o the stationarity condition λ ∈ CZ (g(x)) for regular x. Corollary 4.2 Suppose a regular feasible point x¯ of (3) is critical with associated multiplier vector λ and assume the notation of Proposition 3.1 for z¯ = g(¯x). Then x¯ is stationary if and only if λi ≥ 0 for every i such that N¯j = {i} and λi = 0 for all other indices i.

If λi < 0 for some i such that N¯j = {i} then, following the arguments in the proof of Proposition 4.2, ∇f(¯x)d < 0 for every solution d of the equations

∇gi(¯x)d = −ei, ∀i : gi is active atx. ¯

14 To illustrate the strength of this optimality condition in comparison with the often used condition based on Clarke’s subdifferential [5], we finally specify the optimality condition for the case of the unconstrained max-min problem

min max min fi(x) (11) x 1≤j≤r i∈Nj

which can be recast as (2). In view of Proposition 3.1 it is possible to compute the set of active functions fi atx ¯ in the following way:

1. Set α := max min fi(¯x) 1≤j≤r i∈Nj

2. Determine J = {j | min fi(¯x) = α} i∈Nj

3. Determine N¯j = {i ∈ Nj | fi(¯x) = α} for every j ∈ J.

4. Delete from J all indices j for which there exists k ∈ J such that N¯k is a proper subset of N¯j and call the remaining index set J¯. ¯ A function fi is active atz ¯ if and only i ∈ ∪j∈J¯Nj. The following corollary is a direct consequence of the foregoing corollary in view of the reformulation (2) of (11).

Corollary 4.3 If x¯ is a minimizer of (11) and the gradients of all active functions fi are affinely independent then zero is in the convex hull of the gradients of those

active fi0 for which there exists an index j0 with i0 ∈ Nj0 and fi0 (¯x) < fi(¯x) for

every i ∈ Nj0 \{i0}.

Notice that a sufficient condition for the gradients of all active functions fi to be affinely independent is that the gradients of all functions with fi(¯x) = α are affinely independent. One may wish to call the latter weakly active. Notice also that the above optimality condition is in general considerably stronger than the often employed condition based on the Clarke subdifferential. Indeed, if the gradients of the active selection functions of a max-min function are affinely independent, then the Clarke subdifferential of the max-min function consists of the convex hull of the gradients of all active functions1. It is not difficult to see that this argument extends to the stationarity condition of Clarke [4] for constrained problems if applied to (3).

1 This can be seen as follows: It suffices to show that every gradient ∇fi(¯x) is contained in the subdifferential. Suppose w.l.o.g. that f(¯x) = 0. First reduce the max-min function to a selection of active functions: Remove all non-vanishing min-functions and in each vanishing min-function all non- vanishing selection functions fi. This results in a max-min function which coincides locally with the original function but has only vanishing selection functions. Next delete all min-functions corresponding to sets Nj with Nj ⊆ Nk for some j 6= k. This will not change the function since these min-functions were redundant. Now choose an index i0 and a set Nj containing i0. Renumber the other selection functions so that Nj = {1, . . . , i0} for some j. Consider the inequality system

f1(x) > . . . > fi0 (x) > . . . > fm(x).

Notice that the affine independence of the vectors ∇fi(¯x) implies the linear independence of the vectors ∇fi(¯x) − ∇fi+1(¯x). The implicit function theorem therefore implies that the above set is nonempty with boundary pointx ¯. Moreover, the max-min function coincides only with the selection function fi0 (x) on the above set. This proves that ∇fi0 (¯x) is contained in the Clarke subdifferential.

15 5 Sequential Quadratic Programming

The Lagrange multiplier-based stationarity condition of the last section allows a straight-forward extension of the sequential quadratic programming idea to combi- natorial nonlinear programs. We will discuss this extension in this section and will argue that the traditional convergence arguments for standard nonlinear programs extend, mutatis mutandis, to combinatorial programs. For the sake of simplicity and transparency and in the interest of brevity we have chosen to mimic the traditional textbook analysis, see [9, 20, 28], using the standard assumptions of regularity, strict complementarity, and a second order condition. Our aim here is merely to argue that the SQP method is a sensible procedure for combinatorial nonlinear programs rather than to provide a comprehensive study of its convergence properties under the weakest possible assumptions. In particular the strict complementarity assump- tion requires that the projection of the structure set Z onto the subspace of active components is ”pointed” at the solution as illustrated in an example in the fore- going section. It is an interesting question whether some of these assumptions can be relaxed, e.g. by extending the more sophisticated convergence analyzes of SQP methods, e.g. [2], or of Newton’s method for nonsmooth or generalized equations, e.g. [8, 16, 21], to the present setting. However, this constitutes a research question in its own and is beyond the scope of the present study.

Recall that the problem we wish to solve is of the form

min f(x) (12) s.t. g(x) ∈ Z.

Given a point x and a multiplier estimate λ we can formulate the SQP subproblem

min ∇f(x)d + 1 d>∇2 L(x, λ)d 2 xx (13) s.t. g(x) + ∇g(x)d ∈ Z.

We will now argue that SQP converges locally quadratically to a stationary point x∗ under the following assumptions: • Z is locally star-shaped at z∗ = g(x∗), • x∗ is a regular stationary point of (12) with multiplier λ∗, • strict complementarity holds at x∗, > 2 ∗ ∗ ∗ • d ∇xxL(x , λ )d > 0 for every non-vanishing direction d with ∇gi(x )d = 0 ∗ for all active constraint functions gi at x . The latter second order condition is the traditional second order sufficient condition for the equality constrained program

min f(x) ∗ (14) s.t. gi(x) = zi , i ∈ A where z∗ = g(x∗) and A is the set of active constraint indices at x∗. To simplify our exposition we assume that all constraint functions are active at x∗. This is

16 without loss of generality since inactive constraint functions will remain inactive in a neighborhood of x∗ and can therefore be discarded for a local analysis. With this assumption the criticality conditions for (12) become

∇f(x∗) + ∇g(x∗)>λ∗ = 0 (15) g(x∗) = z∗.

The criticality conditions for the subproblem are

∇f(x) + ∇2L(x, λ)d + ∇g(x)>µ = 0 g(x) + ∇g(x)d = z (16) z ∈ Z.

Since x∗ is critical with multiplier λ∗, we know that d = 0 and µ = λ∗ solves the latter system, provided (x, λ, z) = (x∗, λ∗, z∗). In view of our second order condition the implicit function theorem implies that the two equations in (16) define locally unique smooth implicit functions d(x, λ, z) and µ(x, λ, z). Next we employ the strict complementarity condition. Since we have assumed that all constraint functions are active, strict complementarity means that v = 0 is a local solution to

max µ>v (17) s.t. z∗ + v ∈ Z for every µ close to λ∗. Therefore the critical points d(x, λ, z∗) with multipliers µ(x, λ, z∗) are indeed stationary points of the subproblems for (x, λ) close to (x∗, λ∗). We finally show that they are the only stationary points of the subproblems close to the origin. To this end, suppose that d(x, λ, z) is stationary for some z 6= z∗. We will now make use of the fact that Z is locally star-shaped at z∗ and therefore the line segment [z∗, z] will be a subset of Z for z close to z∗. Since µ = µ(x, λ, z) is close to λ∗, strict complementarity implies that v = 0 is a local solution of (17). Also, stationarity of d(x, λ, z) means that w = 0 is a local solution of

max µ>w (18) s.t. z + w ∈ Z.

Therefore, using the fact that [z∗, z] is contained in Z, µ is perpendicular to z − z∗ and, since z∗ + α(z − z∗) ∈ Z for every α ∈ [0, 1], u = 0 is not a local solution of

max [µ + α(z − z∗)]>u (19) s.t. z∗ + u ∈ Z for any small positive α. This contradicts strict complementarity.

We can now repeat the classical argument that under our assumptions SQP is equivalent to Newton’s method applied to the optimality conditions with active set identification and converges quadratically if started at a point (x, λ) close to (x∗, λ∗). The penalty trust-region globalization scheme of [25] and [27] applies to this setting, provided an appropriate residual function p is available. This function needs to be locally Lipschitz and an indicator for the set Z in the sense that p ≥ 0 and p(z) = 0

17 if and only if z ∈ Z. In addition, it needs to provide an exact penalization of the constraints. This is, for example, the case if p is piecewise affine and the constraints are regular, see [25]. For a general approach to Lipschitzian penalty function we refer the reader to the recent paper [15]. For a comprehensive study of trust-region globalization we refer to the book [7].

6 A Decomposition Approach to Combinato- rial Problems

In order to apply the SQP method of the last section, we need to have a method to solve the subproblems, which are combinatorial quadratic programmes. Efficient methods for the solution of such problems will most definitely have to make use of the specific structure of Z. Nevertheless, there is a simple decomposition approach which may guide the development of such methods. In this section we will briefly explain the general decomposition approach and show how it can be applied to the subproblems arising from max-min optimization. We begin again with our general problem min f(x) (20) s.t. g(x) ∈ Z. The decomposition approach for problem (20) is based on a collection of finitely many subsets Zi ⊆ Z. These subsets, which we call pieces, give rise to subproblems of the form min f(x) (21) s.t. g(x) ∈ Zi. We assume that the subproblems are simpler than the full problem (20) in the sense that a (descent) method is available to solve the subproblems (21) from a given feasible point. In order to extend the solution procedure for the subproblems to a method for the full problem (20) there has to be a link between optimality or stationarity of the subproblems and optimality or stationarity of the full problem (20). We focus on stationarity in the sequel and require

1. If (20) has a stationary point then g(x) ∈ Zi for at least one such stationary point x and one piece Zi.

2. If x is a stationary point of (21) for all adjacent pieces Zi then it is stationary for the full problem (20).

Here, a piece Zi is called adjacent to x if g(x) ∈ Zi. Notice that both conditions are automatically satisfied if the union of the sets Zi coincides with Z.

0 0 Starting from a piece Zi0 and a point x with g(x ) ∈ Zi0 , the k-th iteration of the decomposition method proceeds as follows: k−1 k−1 1. Given x with g(x ) ∈ Zik−1 , use a descent method to solve (21) for k−1 Zi = Zik−1 , starting from x .

18 2. If the descent method does not produce a stationary point of (21) then stop and report the failure of the descent method (e.g. divergence to −∞ or non- convergence). 3. If the method produces a stationary point xk of (21) and this point is stationary for every adjacent piece then stop; a stationary point of (20) has been found. k k 4. Otherwise, choose a piece Zik adjacent to x for which x is not stationary and continue. If the decomposition method stops after finitely many iterations then it either re- ports failure of the descent method or produces a stationary point xk of (20). If each subproblem (21) has only finitely many stationary levels then the method stops after finitely many iterations since it is a descent method and there are only finitely many pieces Zi. Here, a stationary level of (21) is a value f(x), where x is a sta- tionary point of (21). Obviously convex subproblems have at most one stationary level.

The design of a decomposition method is therefore broken down into three parts:

1. Decide on the partition of Z into pieces Zi. 2. Decide on a descent method to solve the subproblems (21). 3. Develop a mechanism to decide whether the solution of a subproblem is sta- tionary for every adjacent piece and, if not, to produce a piece adjacent to xk for which xk is not stationary. It seems most desirable to make use of the multipliers corresponding to the sub- problem at iteration k to decide whether xk is stationary for every adjacent piece or, if not, to determine a descent piece.

As an example consider the decomposition method for MPECs min f(z) s.t. min{G(z),H(z)} = 0. One starts with an index set I and solves the standard nonlinear program min f(z) s.t. Gi(z) = 0, i ∈ I Gj(z) ≥ 0, j 6∈ I Hi(z) ≥ 0, i ∈ I Hj(z) = 0, i 6∈ I. Ifz ¯ is a regular stationary point of this NLP and the multipliers corresponding to bi-active constraints Hi(z) = Gi(z) = 0 are both nonnegative thenz ¯ is stationary for the MPEC. Otherwise one adds a bi-active index i to I if the multiplier corre- sponding to Hi is negative or removes i form I if the multiplier corresponding to Gi is negative and iterates.

A similar method can be used to solve (3). The basis for this method is the following simple observation.

19 Lemma 6.1 max min zi = min max{zs , . . . , zs }. 1≤j≤r i∈N r 1 r j s∈ ×j=1Nj

Proof. On the one hand, if sj ∈ Nj with zsj = mini∈Nj zi then

max min zi = max zsj 1≤j≤r i∈Nj 1≤j≤r ≥ min max{z , . . . , z }. r s1 sr s∈ ×j=1Nj

One the other hand min zi ≤ min max{zs , . . . , zs } for every j0 and there- i∈N r 1 r j0 s∈ ×j=1Nj fore max min zi ≤ min max{zs , . . . , zs }. Q.E.D. 1≤j≤r i∈N r 1 r j s∈ ×j=1Nj In view of the lemma the max-min optimization problem (3) can be reformulated as min f(x) = min min f(x) x r x s∈×j=1Nj

s.t. max min gi(x) ≤ 0 s.t. gsi (x) ≤ 0, i = 1, . . . , r 1≤j≤r i∈Nj

r Given a combination s ∈ ×j=1Nj the decomposition method now proceeds as fol- lows: 1. Find a stationary pointx ¯ of

min f(x) (22) s.t. gi(x) ≤ 0, i ∈ {s1, . . . , sr}.

We assume that constraints only appear once if sj = sk for some j 6= k. If the gradients of the vanishing constraint functions atx ¯ are linearly independent then we will obtain a unique multiplier vector λ¯ satisfying

∇f(¯x) + ∇g(¯x)>λ¯ = 0 λ¯ ≥ 0 ¯ λk = 0, if k 6∈ {s1, . . . , sr} or gk(¯x) 6= 0

2. Compute the sets J¯ and N¯j defined in Proposition 3.1 forz ¯ = g(¯x). Accord- ¯ ing to Proposition 3.1 the functions gi, i ∈ ∪j∈J¯Nj are the active constraint functions atx ¯. 3. We assume that the active constraint functions atx ¯ are linearly independent. ¯ Then, according to Corollary 4.2,x ¯ is stationary if and only if for every λi0 > 0

there exists an index j0 with i0 ∈ Nj0 and gi0 (¯x) < gi(¯x) for every i ∈ Nj0 \{i0}.

If this is the case for every index i0 with λi0 > 0 then stop.

4. If this is not the case for some i0 with λi0 > 0 then change s in the following way: For every j with sj = i0 set sj := i1, where i1 ∈ Nj, i1 6= i0, and

gi1 (¯x) ≤ gi0 (¯x). Go to step 1. This method is only conceptual since the solution of the subproblems may not be obtainable in a finite number of iterations if nonlinear functions are present. This

20 issue would have to be addressed in an implementation. We will not dwell on these details since we suggest to use this method to solve the combinatorial quadratic programs arising as subproblems in the SQP method, rather than using the method directly to solve max-min optimization problems as a sequence of nonlinear pro- grams. One advantage of the SQP approach is that no new derivative information is needed during the solution of the subproblems. This makes the SQP approach more efficient than a direct composition approach.

Conceptually the method is well-defined if the active constraint functions defined in step 2 are linearly independent at every visited stationary pointx ¯. We will comment on the stringency of this condition below. Let us assume it holds and let us also assume that the method to solve (22), if it converges, produces a point with a lower objective value than the starting point, provided the starting is not stationary. Then we will either observe non-convergence in the subproblem, e.g. due to an unbounded objective function, or we obtain a stationary point of the subproblem. This point is either stationary for the overall problem as tested in step 3, or the method will continue with a different subproblem. In the non-convex case we can possibly revisit the subproblem again, but only finitely often if it has only finitely many stationary values as argued above. Since there are only finitely many subproblems this will lead to finite convergence to a stationary point. Recall that convex subproblems have a single stationary level. Moreover, if the subproblems are convex then stationarity at a subproblem is equivalent to minimality and therefore stationarity for the master problem implies minimality on every adjacent piece and therefore local minimality for the master problem. Summing this up we obtain the following proposition.

Proposition 6.1 If the functions f and gi are smooth and convex and the gradients of the active functions at the generated solutions of all subproblems are linearly independent then the decomposition method stops after finitely many iterations and either produces a local minimizer or reports that a particular subproblem could not be solved.

The underlying reason why this method works is that the multipliers contain sensi- tivity information. Indeed, the subproblem is of the form

min f(x)

s.t. gsj (x) ≤ 0, j = 1, . . . , r.

We assume that constraints corresponding to indices sj = sj0 occur only once. Supposex ¯ is a stationary point of the subproblem with a multiplier vector λ¯ ∈ IRm with λ¯i = 0 if i 6∈ {s1, . . . , sr} and

Xr ¯ ∇f(¯x) + λsj ∇gsj (¯x) = 0 j=1 ¯ λsj ≥ 0 ¯ λsj = 0 ∀j : gsj (¯x) < 0.

21 ¯ Suppose there exists an index sj0 = i0 with λi0 > 0 and that for every j with sj = i0 there exists an ij ∈ Nj different from i0 with gij (¯x) ≤ 0. Notice thatx ¯ is also a stationary point of min f(x)

s.t. gsj (x) ≤ 0, j : sj 6= i0

gi0 (x) ≤ 0,

gij (x) ≤ 0, j : sj = i0, with the same multiplier vector λ¯. Here we assume again that each constraint ap- pears only once. If the active gradients are linearly independent then the multiplier ¯ vector λi0 > 0 indicates that removing the constraint gi0 (x) ≤ 0 will improve the objective value. This can indeed be done by replacing sj = i0 by sj := ij. We do ¯ this and proceed. If λi0 > 0 implies that there exists an index j0 with sj0 = i0 and gi(¯x) 6= 0 for every i ∈ Nj0 \{i0} thenx ¯ is a stationary point.

Let us finally return to a discussion of the assumption that the gradients of the active functions at the generated solutions of the subproblems are linearly in- dependent. Sard’s theorem as applied in [24] suggests that this assumption is not particularly stringent, i.e. given a particular constraint function g the linear inde- pendence assumption will be satisfied for a perturbed constraint function g(x) + a for almost all a in the Lebesgue sense. This result also holds for the functions f used in the reformulation (2) of the unconstrained max-min problem. We give a brief proof of this result for completeness.

Proposition 6.2 Given a smooth mapping f : IRn → IRm and a vector a ∈ IRm define

f a(x) = f(x) + a a a a Ii (x) = {j ∈ {1, . . . , m} | fj (x) = fi (x)} for x ∈ IRn and 1 ≤ i ≤ m. For almost all a ∈ IRm (w.r.t. the Lebesgue measure) a a the gradients ∇fj (x), j ∈ Ii (x) are affinely independent at every point x and for every i ∈ {1, . . . , m}.

Proof. Fix an index set I ⊆ {1, . . . , m} and an index i ∈ I and consider the equations fj(x) − fi(x) = ai − aj, j ∈ I\{i}.

Let (C) be the condition that the vectors ∇fj(x) − ∇fi(x), j ∈ I\{i}, are linearly independent at every solution x of the equations. By Sard’s theorem, for any fixed I\{i} ai the set of vectors a ∈ IR such that condition (C) is violated has Lebesgue measure zero in IRI\{i}. Thus, by Fubini’s theorem, the set of vectors a ∈ IRI such that condition (C) is violated has Lebesgue measure zero in IRI . Since the finite union of Lebesgue null sets is again a Lebesgue null set it follows that for almost all a ∈ IRn condition (C) holds for every I ⊆ {1, . . . , m} and every i ∈ I. To complete the proof it suffices to show that the fact that for fixed I the vectors ∇fj(x)−∇fi(x), i ∈ I\{i} are linearly independent for every i ∈ I implies that the gradients ∇fj(x), j ∈ I, are affinely independent. Indeed, if ∇fj(x), j ∈ I are affinely dependent then

22 P P there exists a non-vanishing vector λ such that j∈I λj∇fj(x) = 0 and j∈I λj = 0. If i ∈ I is such that λi 6= 0 then

X λj (∇fj(x) − ∇fi(x)) = 0 j∈I λi j6=i and thus ∇fj(x) − ∇fi(x), j ∈ I\{i} are linearly dependent. Q.E.D.

6.1 Preliminary Numerical Experience The method of the last section was coded for problems of the form

1 > > min 2 x Qx + q x > s.t. max min ai x − bi ≤ 0 1≤j≤l i∈Nj with positive definite matrix Q. The code was written in MATLAB and used Roger Fletcher’s BQPD code to solve the QPs. We checked for replacement of active constraint functions gi(x) in the order of descending multipliers and replaced a representant gi of the min-function mini∈Nj gi(x) by the function gk, k ∈ Nj\{i}, with the minimal value at the current iteration point, provided that value was non- positive.

A typical set of test problems was generated as follows: We generated an m × n matrix A using the RANDN or RAND functions in MATLAB. The use of the lat- ter guarantees feasibility since A has positive entries and therefore Ax − b < 0 for sufficiently large positive x. We chose b to be the m-vector with all components equal to −1 so that the origin is infeasible for any max-min combination of the se- lection functions Aix − bi since all selection functions all negative at the origin. We then solved the problem of finding a feasible solution with smallest Euclidean norm. To generate the index sets Nj, We randomly permuted the numbers 1,..., 5m and chose Nj to be those indices among the first m components of the permutation vector which did not exceed m. The MATLAB code of the method, albeit with the MATLAB QP solver, as well as the random problem generator can be downloaded from [29].

Finding a feasible point of a max-min system is a non-trivial task in general. If an initial QP is infeasible, then one may attempt to restore feasibility by solving an appropriate phase-1 problem such as Xr min yj (x,y) j=1 s.t. max min gi(x) ≤ yj, j = 1, . . . , r 1≤j≤l i∈Nj y ≥ 0 starting from a feasible point (x, y), where x could be any vector that one may wish to use as a starting point and yj = mini∈Nj gi(x). Obviously, the phase-1 problem is

23 only a heuristic since the method can be trapped in a local minimum of the L1 merit function. As it turned out, a phase-1 approach was unnecessary in our tests. If the initial QP subproblem was infeasible, which occurred frequently, the BQPD code would return positive multipliers for the violated constraints which the method will swap successively for ”more feasible” constraint functions in the corresponding sets Nj. It typically took only a few iterations until a feasible point was found in this way.

The method performed credible on the random test problems. A typical prob- lem with, e.g., n = 100 variables, l = 200 min-functions and a total of m = 200 selection functions was solved in 110-140 QP iterations to local optimality. We per- formed some tests with various randomly chosen initial QPs but in these tests the choice of initial constraint did typically not affect the number of QP iterations. Not surprisingly, the objective values achieved with different starting QPs could differ substantially. In the above mentioned test run the best of five starts outperformed the worst by up to 30%. We also performed test runs where we checked after each iteration whether the current point was Clarke stationary and occasionally found such points upon which the method could often substantially improve. It is not worth giving any more details of the numerical tests as they are very preliminary and were only performed as a sanity check and as an illustration of the practical po- tential of the method. More extensive numerical tests and comparisons with other methods will be necessary for a sound judgement of the numerical reliability of the method.

It is obvious that our method, being a local optimization method, can and often will be trapped in a local minimum. It is still a worthwhile method since descent is guaranteed once feasibility is detected. It is of course possible to reformulate opti- mization problems with max-min constraints as mixed integer NLPs, e.g. through a ”big-M” approach

min f(x) s.t. g (x) + My ≤ M Pi i i∈Nj yi ≥ 1, j = 1, . . . , l y ∈ {0, 1}, where M is a large constant. If the data of the original problem are linear then this results in a mixed integer linear program which is, in principle, solvable to global optimality. Even for such reformulations, which are only tractable for very moderate numbers of selection functions, the local method suggested here is useful as it can be employed to generate an initial upper bound for the objective and to improve a new upper bound in a procedure by invoking the method from the corresponding feasible solution.

Acknowledgement. Thanks are due to my colleague Danny Ralph for many tea-room discussions on the subject and for valuable suggestions on an earlier draft. I also thank Roger Fletcher and Sven Leyffer for making the MATLAB MEX-file of their BQPD code available to me.

24 References

[1] S.G. Bartels, L. Kuntz, and S. Scholtes, Continuous selections of linear func- tions and nonsmooth critical point theory, Nonlinear Analysis, Theory, Meth- ods & Applications 24 (1995), 385-407. [2] J.F. Bonnans and G. Launay, Implicit trust region algorithm for , Technical report (1992), Institute Nationale Recherche Informa- tique et Automation, Le Chesnay, France. [3] G. Bouligand, Sur la semi-continuit´ed’inclusions et quelques sujets connexes, Enseign. Math. 31 (1932), 11-22. [4] F.H. Clarke, A new approach to Lagrange multipliers, Mathematics of Opera- tions Research 1 (1976), 165-174. [5] F.H. Clarke, Optimization and Nonsmooth Analysis, John Wiley & Sons, New York 1983. [6] A.R. Conn and M. Mongeau, Discontinuous piecewise linear optimization, Mathematical Programming 80, 315-380. [7] A.R. Conn, N.I.M. Gould, and Ph.L. Toint, Trust-region methods, SIAM Philadelphia (2000). [8] A.L. Dontchev, Local convergence of Newton’s method for generalized equa- tions, Comptes Rendus de l’Acad´emiedes Sciences de Paris 332 Ser. I (1996), 327-331. [9] R. Fletcher, Practical Methods of Optimization, 2nd Edition, John Wiley & Sons, Chichester 1987 [10] R. Fourer, A for piecewise-linear programming I: Derivation and proof, Mathematical Programming 33 (1985), 204-233. [11] R. Fourer, A simplex algorithm for piecewise-linear programming II: Finiteness, feasibility and degeneracy, Mathematical Programming 41 (1988), 281-315. [12] R. Fourer, A simplex algorithm for piecewise-linear programming III: Com- putational analysis and applications, Mathematical Programming 53 (1992), 213-235. [13] C. Kirjner-Neto and E. Polak, On the conversion of optimization problems with max-min constraints to standard optimization problems, SIAM J. Optimization 8 (1998), 887-915. [14] D. Klatte and B. Kummer, Nonsmooth Equations in Optimization, Kluwer Academic Publishers, Boston 2002. [15] D. Klatte and B. Kummer, Constrained minima and Lipschitzian penalities in metric spaces, SIAM J. Opt. 2002 (forthcoming). [16] B. Kummer, Newton’s method based on generalized derivatives for nonsmooth functions: convergence analysis, in: Advances in Optimization (eds. W. Oettli and D. Pallaschke), Springer Verlag, Berlin 1992, pp. 171-194.

25 [17] C. Lemar´echal, Bundle methods in nonsmooth optimization, in: Proceedings of the IIASA Workshop, Nonsmooth Optimization (eds. C. Lemar´echal and R. Mifflin), Pergamon Press, Oxford (1978), 79-102. [18] Z.-Q. Luo, J.-S. Pang, D. Ralph, Mathematical Programs with Equilibrium Constraints, Cambridge University Press 1996. [19] D. Ralph and E. Polak, A first-order algorithm for semi-infinite min-max-min problems, Manuscript 2000. [20] S.M. Robinson, Perturbed Kuhn-Tucker points and rates of convergence for a class of nonlinear programming algorithms, Mathematical Programming 7 (1974), 1-16. [21] S.M. Robinson, Newton’s method for a class of nonsmooth functions, Set- Valued Analysis, 2 (1994), 291-305. [22] R.T. Rockafellar, Convex Analysis, Princeton University Press 1970. [23] H. Scheel and S. Scholtes, Mathematical Programs with Complementarity Con- straints: Stationarity, Optimality, and Sensitivity, [24] S. Scholtes and M. St¨ohr,How stringent is the linear independence assump- tion for mathematical programs with stationarity constraints? Mathematics of Operations Research (forthcoming) [25] S. Scholtes and M. St¨ohr,Exact penalization of mathematical programs with equilibrium constraints, SIAM Journal on Control and Optimization 37 (1999), 617-652. [26] H. Schramm and J. Zowe, A version of the bundle idear for minimizing a nonsmooth function: conceptual ideal, convergence analysis, numerical results, SIAM J. Opt. 2 (1992), 121-152. [27] M. St¨ohr, Nonsmooth Trust Region Methods and Their Applications to Mathe- matical Programs with Equilibrium Constraints, PhD Thesis (1999), published by Shaker-Verlag Aachen, Germany. [28] R.B. Wilson, A simplicial algorithm for concave programming, PhD disserta- tion, Harvard Business School (1963). [29] http://www.eng.cam.ac.uk/ ss248/publications

26