EXTENDED READINGS, QUESTIONS, EXERCISES, PROGRAMS

Brno University of Technology FME, Institute of

Optimization I sop, vo1p, opm

Pavel Popela

February 15, 2017 Chapter 1

Linear and nonlinear models

Content: The following themes will be studied: • Introduction (Lecture 1–2 below) • (3–9) • Optimality Conditions (10, 17–20) • Lagrangian (24–25) • Introduction to Algorithms - Unconstrained Problems (11–16) - Problems with Constraints (21–23), (26–28)

Detail lecture themes: The order and list will be updated during the flow of lectures (they are numbered below). 1. Organization + Basic concepts 2. Modelling + Solving NLPs (principles) 3. 4. Convex sets 5. 6. Polyhedral sets 7. 8. and method 9. 10. Convex functions 11. Unconstrained optimization Algorithms : general approach and examples 12. Line search methods 13. Multidimensional search without using derivatives 14. 15. 16. Multidimensional search using derivatives (involving advanced methods) 17. 18. 19. Constrained optimization (KKT conditions) 20. Constraints qualification 21. From feasible directions to successive approximations. 22. Projection-based algorithms 23. Reduced algorithms 24. 25. Lagrangian duality 26. Penalty /based methods 27. Augmented Lagrangian-based methods 28. Barrier functions and their use

1 Suggested texts: The following books are suggested as extending readings (further interesting books can be found in the university library): BS93 Bazaraa, M.S., Sherali, H.D., and Shetty, C.M. (1993) Nonlinear Programming: Theory and Applications, John Wiley and Sons. Mi85 Minoux, M. (1985) Mathematical Programming, John Wiley and Sons. BJ90 Bazaraa, M.S., Jarvis, J.J., and Sherali, H.D. (1990) Linear Programming and Network Flows, John Wiley and Sons.

Further sources of information: Books: E.g., Fletcher: Practical Methods of Optimization, Wiley. Journals: Journal of OR, Mathematical Programming. Interesting WWW pages: nlp.fag, http:// www.gams.com, www˜math.cudenver.edu/˜hgreenbe/glossary/glossary.html, www.maths.mu.oz.au/˜worms, mat.gsia.cmu.edu, www.caam.rice.edu/˜mathprog, www.informs.org.

Search also for keywords: nonlinear programming, mathematical programming, optimization . . . and share your result with us!

Exercise 1.

What is NEOS?

2 1.1 Basic concepts

Notation: The following notation is inherited from calculus: a, b, c scalars, constants u, v, x, y, z variables, unknowns A, B, X sets 2A set of all subsets of set A a, b, x, y column vector, a> row vector A, B matrices (aij) where i = 1, ..n, j = 1...n. ∅, {} IN,Z, IR sets of positive integers, integers, and real numbers ∈, ⊂, ∩, ∪, S¯ usual set-related symbols ∧, ∨, ⇒, ⇔ usual logical symbols ∃x ∈ X there exists x in X ∀x ∈ X for any element x of X f : IRn −→ IR a scalar function g : IRn −→ IRm a vector function ∇f(x) the gradient of f at the point x ∇2f(x), H(x) the Hessian of f at x n En, IR the Euclidean n-dimensional (linear) space kxk the Euclidean norm d(x, y) = kx − yk a distance between x and y Nε(x) the epsilon neighbourhood of x x1, ...xk, ... or {xk}k∈IN an infinite sequence lim, lim sup, lim inf limit, upper limit, lower limit int S, cl S, ∂S interior, closure, and boundary of set S sup, inf supremum, infimum min, max minimum, maximum ≤ ordering of real numbers := define

Definition 2 (Topological concepts).

The following topological concepts are often utilized. Denote Nε(x) = {y:ky − xk < ε} and S ⊂ IRn:

Closure: x ∈ cl S ⇔ ∀ε>0 : S ∩ Nε(x) 6= ∅ Closed set: S is closed ⇔ S = cl S

Interior: x ∈ int S ⇔ ∃ε>0 : Nε(x) ⊂ S Open set: S is open ⇔ S = int S Solid set: S is solid ⇔ int S 6= ∅

Boundary: x ∈ ∂S ⇔ ∀ε>0 : (Nε(x) ∩ S 6= ∅) ∧ (Nε(x) ∩ S¯ 6= ∅) n Bounded set: S is bounded ⇔ ∃x ∈ IR , ∃ε>0 : S ⊂ Nε(x) Compact set: S is compact ⇔ (S is bounded)∧(S is closed)

Properties: A finite intersection (union) of open sets is an open set. A finite intersection (union) of closed sets is an open set.

3 cl S = S ∪ ∂S IRn = cl IRn = int IRn S 6= int S ∪ ∂S ∅ = cl ∅ = int ∅ int S = S \ ∂S S open ⇒ S¯ closed ∂S 6= S \ int SS closed ⇒ S¯ open

Exercise 3.

Illustrate concepts and properties by figures. Compare concepts: the closed interval and closed set.

Definition 4 (Continuity-related concepts).

The following concepts that are related to convergence and continuity will be often used:

Convergence: xk −→ x ⇔ ∀ε>0, ∃N ∈ IN, ∀k ≥ N : kxk − xk < ε. Cluster point: x is a cluster point of {xk}k∈IN ⇔ ∃L ⊂ IN : xk −→ x, k ∈ L. Remark: a convergent sequence has a unique cluster point that is a limit.

Limes superior: A limit superior (upper limit) of {xk}k∈IN ⊂ IR is the largest cluster point, i.e. for {xk}k∈IN y = lim sup xk ⇔ (∀ε>0, ∃K ∈ IN : k > K ⇒ xk ≤ y + ε) ∧ (∀ε>0, ∀K > 0, ∃k > K : xk ≤ y − ε) Limes inferior: A limit inferior (lower limit) of {xk}k∈IN ⊂ IR is the smallest cluster point. We may define lim inf xk = − lim sup(−xk). Closedness: The alternative definition of the closed set (or its property): S is closed ⇔ (∀{xk}k∈IN sequence such that xk ∈ S, xk −→ x ⇒ x ∈ S). Relative interior: Let S ⊂ IRn, aff S denotes the affine hull of S (see Remark 36) and int S = ∅ is allowed. Then: relint S = {x ∈ S | ∃ε > 0 : Nε(x) ∩ aff S ⊂ S} Continuous function: For f(x) continuous at x: xk −→ x and f(xk) −→ L implies f(x) = L Lower semicontinuous function: f is lower semicontinuous at x ∈ S ⇔ ε > 0 ∃δ > 0 : y ∈ S, ky − xk < δ ⇒ f(y) − f(x) > −ε (or f(x) < f(y) + ε).

Exercise 5.

Illustrate concepts by drawing figures in IR2. Compare continuity and lower semicon- tinuity of f.

General decision problems. Complex decision problems may be described using algebraic concepts. Let Γ = (V,E) be an oriented graph describing distribution of decision in time or space. Indices v ∈ V are vertices (nodes) and e ∈ E are edges (arcs, arrows). For all nodes, we introduce v v v v v v O = (N , K , {X } v , C , K , { } v ). D K K∈KD I K K∈KI

4 Ov a description of decision problem related to v ∈ V N v problem-related decision makers in node v v N v KD ⊂ 2 groups of decision makers v K ∈ KD one group of decision makers v XK set of feasible decisions in v for K v Q v C ⊂ v X decision situations K∈KD K (no interactions among problems related to different nodes) v v v K ⊂ C × C preference relations v N v KI ⊂ 2 groups of interests v Q v Generally for interacting decision problems: C ⊂ U × v X describe decision situations K∈KD K u v (with interactions among problems) and Tuv : C −→ C describes the influence of decisions from node u on node v.

Definition 6 (Mathematical program).

Let S ⊂ IRn and f : IRn −→ IR. Then a mathematical program (MP is further used as the abbreviation) is defined as follows:

min{f(x) | x ∈ S} (1.1) x

There are various special cases of the general mathematical program (c, b are vectors, A is matrix):

• Linear program (LP): min{c>x | Ax = b, x ≥ 0} • Mixed integer linear program (MIP): > ∃J ⊂ {1, . . . , n}: min{c x | Ax = b, x ≥ 0, ∀j ∈ Jxj ∈ IN} • Nonlinear program - see below

Another mathematical programs can be obtained from general cases after some initial modifica- tion:

• from optimal control using discretization, • from multicriteria optimization by scalar reformulation, and • from stochastic programming by deterministic reformulation.

Definition 7 (Nonlinear program).

Let f : IRn −→ IR, g : IRn −→ IRm, and X ⊂ IRn. In addition ◦ ∈ {≤, ≥, =}m. Then

min{f(x) | g(x) ◦ 0, x ∈ X} (1.2) x is called a nonlinear program (further used shortcut NLP).

Notice that it can be obtained from mathematical program by assigning: S = {x ∈ X | g(x) ◦ m m 0} = ∩i=1Si = ∩i=1{x ∈ X | gi(x) ◦ 0} = {x ∈ X | gi(x) ≤ 0, 1 ≤ i ≤ l, gi(x) = 0, l + 1 ≤

5 i ≤ m}. A standard form of nonlinear program (given by the last form of the feasible set and by minimization) can be obtained by equivalent transformations from any nonlinear program (cf. with linear program standard form, i.e. max 7−→ min, ≤, ≥7−→ =, etc. There are also frequently used special cases of nonlinear programs:

• A quadratic program: min{x>Hx + c>x | Ax = b, x ≥ 0}. • A convex program: S is a and f is a .

Descriptive and normative approach. Till now, we utilized a descriptive approach. It means that syntax of optimization program is correct, i.e. it says “how mathematical optimization model looks". We turn to a normative approach now, saying “what to do" to reach the optimum. We review the ordering concepts.

Ordering of IR and its properties. The couple (IR, ≤) represents a natural ordering of real numbers. There is a binary relation ≤⊂ IR × IR such that it satisfies following axioms:

•∀ a ∈ IR : a ≤ a •∀ a, b ∈ IR : (a ≤ b) ∧ (b ≤ a) ⇒ a = b •∀ a, b, c ∈ IR : (a ≤ b) ∧ (b ≤ c) ⇒ a ≤ c • In addition ∀a, b ∈ IR : (a ≤ b) ∨ (b ≤ a).

Definition 8 (max, min, inf, sup).

Let B ⊂ IR, IR∗ = IR ∪ {−∞, +∞} where −∞ and ∞ are such elements that satisfy ∀a ∈ IR : −∞ ≤ a ≤ ∞. Then we define:

Bmin = min B ⇔ (bmin ∈ B) ∧ (∀b ∈ B : bmin ≤ b),

Bmax = max B ⇔ (bmax ∈ B) ∧ (∀b ∈ B : bmax ≥ b), inf B = max{c | (c ∈ IR∗) ∧ (∀b ∈ B : c ≤ b)},

sup B = min{c | (c ∈ IR∗) ∧ (∀b ∈ B : c ≥ b)}.

Denote domain of f as Domf and image of f as Imf. Then semantics of MP may be given as minx{f(x) | x ∈ S} = min Imf.

Syntax extensions. Usually, in mathematical programming, one of the following descriptions is used. We describe MP by min f(x) subject to x ∈ S or by minx{f(x) | x ∈ S}. To identify the goal of our optimization search, we extend the notation assuming that minx{f(x) | x ∈ S} denotes the set of all minimal objective function values. Similarly, argminx{f(x) | x ∈ S} is a set of all minima. So, ? ∈ argminx{f(x) | x ∈ S} means the search for at least one minimum (using = instead of ∈ means to find all minima). Sometimes, we insert the words -glob- or -loc- into argmin to add details about the type of searched minima.

6 Definition 9 (Scheme for minima).

n Let S ⊂ IR and f : S −→ IR. We define that xmin ∈ S is a point of  local   strict  × minima global non-strict

⇐⇒

    ∃Nε(xmin): ∀x ∈ S ∩ Nε(xmin) \{xmin} < × f(xmin) f(x) ∀x ∈ S \{xmin} ≤

Note that xmin is an isolated minimum of f on S ⇔ ∃ε > 0 : ∃!xmin ∈ Nε(xmin). Example 10 (Use of definition scheme).

Let f : S −→ IR be a given function with domain S. Then xmin ∈ S is a local (non- strict) minimum if and if (iff) ∃Nε(xmin): ∀x ∈ S ∩ Nε(xmin) \{xmin} : f(xmin) ≤ f(x).

Exercise 11.

Apply the definition scheme9 to develop other definitions of optimal solutions. Try to reformulate it for maximum.

Theorem 12 (Weierstrass).

Let S ⊂ IRn, S 6= ∅, S is a compact set, and f : S −→ IR continuous function on S. Then mathematical program minx{f(x) | x ∈ S} attains its global minimum (denoted as xmin).

The theorem remains valid if continuity is replaced by ‘lower semicontinuity’. Exercise 13.

Show the importance of the following assumptions of Theorem 12 by counterexamples in IR2: S 6= ∅, S closed, bounded, and f continuous on S. Does it work for linear programs? Compare formulations: ‘must have a global minimum’ or ‘may have a global minimum’ for Weierstrass’ Theorem.

Exercise 14.

Why constraint g(x) < 0 is seldom considered in mathematical programming? Hint: Think about Weierstrass’ Theorem assumptions and conclusions.

7 Exercise 15.

If g(x) < 0 is still required then show how to change it to attain ≤. Hint: Use ‘rounding relaxation’ g(x) ≤ 0 or ε use g(x) ≤ −ε.

Proof of Theorem 12: Assume S closed and bounded and f continuous on S. Then (from Calculus) f is bounded below on S. Since S 6= ∅ ⇒ ∃α = inf{f(x) | x ∈ S}. Let 0 < ε < 1 : Sk = {x ∈ k S : α ≤ f(x) ≤ α + ε }. So, ∀k ∈ IN : Sk 6= ∅, xk ∈ Sk, and ∀k ∈ IN : {xk} ⊂ S. Because S is bounded ⇒ ∃L ⊂ IN : xk −→ x¯, k ∈ L. Because S is closed ⇒ x¯ ∈ S. Because f is continuous: k ∀k ∈ L : α ≤ f(xk) ≤ α + ε ⇒ α = lim f(xk) = f(x¯). So f(x¯) = inf{f(x) | x ∈ S} = f(xmin) k∈L,k−→∞ and we have xmin. 

1.2 Modelling

Example 16 (Location of facilities).

Locate centers of activities in optimal way, e.g., imagine Malta towns and markets. Use a map and simplify the problem specification (direct distance considered and no governmental geographic rules given).

Exercise 17.

Formulate a mathematical program. Hint: Begin with a traditional transportation problem.

Modeling recommendations: Answer ‘10 modelling questions for ever’. The answers may help you to identify model type and properties:

1. How many decision makers participate in decision problem?

2. How many criteria have to be considered?

3. Are decisions spatially or time distributed?

4. Any uncertain parameters have to be taken into account?

5. Which basic types for decision variables must be used (scalars, vectors, functions, etc.)?

6. Which domains are specified for decision variables?

7. Do we have to make a difference between local and global optima?

8. May differentiability or another computational feature of involved functions be effectively employed?

9. Are there linear functions used for problem description?

10. Is it possible to take an advantage from special structures (networks, etc.)?

8 Modelling in general. Going through successful applications of mathematical programming, we generalize the obtained experience, and we will try to formulate some general rules applica- ble for model building and solving in mathematical programming. It was observed that every mathematical programming model has its own life-cycle, similar to other models. This life-cycle is composed of several steps. These steps were discussed in detail by Geoffrion. We may use, for instance, the following scheme for the description of modelling life-cycle:

Problem analysis: In this step, the decision problem is identified and analysed:

• First, we must formulate the target of the later-built model. • Then, the context of the studied problem is analysed to evaluate global consequences of local decisions. • The extensive discussions among the modelers and experts of the application field follow to define problem structures, parameters, and decision variables. • Bibliographical references to similar applications may help to classify the considered problem. • Historical input data and related decisions are collected for model verification. They are classified (e.g., for different technological scenarios) and stored (e.g., in information system databases). • The results of analysis are formulated using the application specific language as the first step to model building.

Model design: The considered problem is completely classified, the model is built as a math- ematical model, and its transformation into the implementation description is realized in the following steps:

• Number of decision makers is defined, and decision variables are identified. – The type of decision representation is fixed for each decision variable as scalar, vector, function, etc. – Time structure is handled as static or dynamic with discrete or continuous time. – Domains of variables are characterized as real, integer, etc. • Number and the form of decision rules is determined. – Artificial variables describing the values of criteria are declared. – The optimization criteria are expressed in the form of objective functions. – The objective functions are built with general mathematical forms of composed functions combining variables, input parameters, and weighting parameters in- troduced by a decision maker. • Sets of decisions, together with the feasible set of possible situations, are specified. – The experience of consultants must be used in such a way that historical decisions are also feasible for the built program. – Indices and sets of their values assess a model size. – Input data parameters are listed, and their certainty is characterized. Primary parameter values are obtained from databases saving analysis results. – Derived parameters are calculated from primary data using assignments and mathematical operations to simplify model formulation. – A feasible set is described by bounds and constraints. – The groups of constraints are ordered by their importance.

9 – Some of them are identified as soft constraints, others as hard constraints. • The mathematical model properties such as linearity and convexity that help the solution process are studied. • This model is described using a general modelling language for practical computations. • The external data sources are assigned to model parameters. • Model building continues with approximation, relaxation, or transformation that keeps the problem solvable.

Implementation and solution: In this step, the model is completely implemented with the concrete software and hardware platforms. As a result, the solution will be found.

• The mathematical model is implemented for the chosen type of computer and op- erating system. The appropriate optimization system is selected, a previous general formulation is specified. • The solver is chosen. Sometimes the expert’s opinion is important in this selection, e.g., Schittkowski developed a methodology for the support of a solver selection in nonlinear programming. • Computing of test problems with simple structure based on historical data set follows. • Tuning of algorithm parameters for the considered class of problems is realized. • Permanent connection to other software, as information system data sources, is es- tablished.

Presentation and interpretation: In this step, results are collected and analysed.

• The unbounded objective or the infeasible program detect the modelling mistakes, such as missing or contradictory constraints or typing errors in data. • Results are analysed and evaluated. • Suggestions for model change are given.

The importance of modelling in the optimization process is also noted by modelers. Practical experience shows that model building in mathematical programming needs approximately 2/3 of the entire time spent with the problem. Model building in mathematical programming and many ‘modelling tricks’ for different classes of problems were discussed in detail in literature and journals.

Application of problem analysis and model design. The 10 questions introduced earlier may help you to specify a type of the problem. There are two ways to optimization model building:

• Either use existing Input/Output model having the form y = h(x) and then: (1) choose k ∈ {1, . . . , m} and f(x) = hk(x) the objective function, (2) specify lj and uj lower and upper bounds for j ∈ {1, . . . , m}\{k} and set gj = hj for all considered j to get n S = {x | ∀j : lj ≤ gj(x) ≤ uj, x ∈ IR }. Then mathematical program min{f(x) | x ∈ S} is built. • Build the model in the bottom-top style. It means to identify (1) the goal min z, (2) the objective z = f(x), (3) the variables x ∈ IRn, (4) their bounds l ≤ x ≤ u, and (5) constraints (I/O (balance) equalities =, ‘demand’constraints ≥, and ‘storage’ constraints ≤.

10 Example 18 (Continuation – problem analysis).

We choose the second possibility for our example: min z means to minimize overall transportation costs. z = f(x, y, w) costs depend on location of warehouse (xj, yj) and on the amount wj transported from warehouse i to town j. Hint: Compare it with a usual transportation problem and use it partially!

Example 19 (Continuation – model building).

z cost of total transport is proportional to distances and shipments from warehouses

(xi, yi) unknown location of warehouse i ∈ {1, . . . , m}

ci known capacity of warehouse i ∈ {1, . . . , m}

(aj, bj) known location of market j ∈ {1, . . . , n}

rj known demand at market j ∈ {1, . . . , n}

dij unknown distance from warehouse i to market area j

wij unknown units shipped from i to j

min z Pm Pn z = i=1 j=1 dijwij Pn j=1 wij ≤ ci, i = 1, . . . , m Pm i=1 wij ≥ rj, j = 1, . . . , n

wij ≥ 0 Where is the nonlinearity? p 2 2 dij = (xi − aj) + (yi − bj) = k(xi, yi) − (aj, bj)k2

Also another norm thank · k2may be used

l ≤ xi, yi ≤ u, i = 1, . . . , m

Exercise 20.

Fill the model with real data for 10 towns, 3 warehouses, e.g. for Malta island! Visualize your data.

Exercise 21.

Use MS Excel for real data computations including optimization.

11 Useful theoretical properties. At first, we need to know theoretical properties of the model. Weierstrass’ Theorem will answer us: ‘Does the optimal solution exist?’ Other questions remain, e.g. does a unique solution exist?

Exercise 22.

Use Weierstrass’ Theorem to recognize whether the optimal solution exists.

Algorithm choice. We must learn more about the algorithms for NLP (this semester), as the simplex method for LP is not applicable. At the beginning, we may restrict ourselves to the use of optimization software. So, how to solve NLP?

Some software tools. non-specialized Excel, Matlab, etc. own programs hard to compete as you do not know unpublished numerical tricks procedures you should know their programming language solvers MINOS, CONOPT, etc. Difficult data communication modeling languages GAMS, AMPL, AIMMS - the right choice interfaces MPL model building, AIMMS report writing See www.gams.com and www.paragon.nl for more information. Model more problems as nonlinear programs — see below.

Exercise 23 (Parallelogram).

Formulate a mathematical program (using analytic geometry): Find a point E on the side BC of the triangle ABC such that the parallelogram with vertices D resp. F lying on the sides AB resp. AC has maximal area. Solve it using a computer program (Matlab, Excel, GAMS). Hint: The solution is obviously given by choosing E to be the midpoint of BC. In fact for arbitrary E on BC with λ = BE/BC the area of the corresponding parallelogram is the area of (ADEF) = 2λ(1 − λ) multiplying the area of(ABC) and this function is maximal for lambda = 1/2.

12 Exercise 24 (Heron).

Heron (ca. 100 BC) gave a solution to the following problem: On a given line find a point C such that the sum of the distances to the points A and B is minimal. Compare the modeling and solution by mathematical programming with the explanatory and simple geometric approach. Hint: If A and B lie on opposite sides of the given line, then obviously the intersection of the line with the segment AB is the desired point; otherwise one reflects A in the line getting A’ and determines C as the intersection of the line with the segment A’B. Hence: a light ray reflected in the line with angle of incidence and angle of reflection equal takes the shortest possible path from A to B via C. Or: if one wants to go from A to B on the shortest possible route and on the way fetch a pail of water from a (straight) river, then one must solve the same problem.

Exercise 25 (Kepler).

Minimize the cylinder surface S having the constant cylinder volume V . J. Kepler lived in Prague in 17th century. During his walks, he observed that barrels for wine are sold with only one measurement. Solving this problem, he wrote a paper why two measurements are not necessary for the ‘optimal’ barrels.

Exercise 26 (Paper box).

Having a sheet of paper, use scissors to create a box with the biggest volume.

Advanced applications of NLP may be found in Bracken, McCormick: Selected applications of nonlinear programming, Wiley, 1968 (a copy available from Dr.Sklenar). Think about problems that are also implemented in the GAMS system: WEAPONS.GMS, PROCESS.GMS, CHEM.GMS, LINEAR.GMS, LEAST.GMS, LIKE.GMS. Other real-life problems may serve as bases for diploma thesis:

- Melt control problem: Optimize the sequence of charges of input materials to minimize a cost and maximize a quality of produced alloy with respect to constraints involving random losses and goal intervals for the melt composition. - Irrigation network design: Optimize a reconstructed irrigation pipe network topology, min- imize costs, satisfy pressure and flow uncertain demands. - Heat exchanger parameters optimization: Optimize exchanger design parameters mini- mizing investment and maintenance costs subject to technical and economical constraints having parameters with values estimated from experimental data.

13 1.3 Convex sets

1.3.1 Convex sets, combinations, and hulls

Definition 27 (Convex set).

n Let S ⊂ IR . We say S is a convex set ⇔ ∀x1, x2, ∀λ ∈ [0; 1] : λx1 + (1 − λ)x2 ∈ S.

Definition 27 is very important for understanding and memorizing. “The convex set is a set where you cannot hide yourself." For the convex set, all points of any line segment connecting two points from it again belong to it.

Exercise 28.

Draw figures describing different convex and non-convex sets in IR2. Discuss whether a segment line, triangle, circle, etc. are convex sets. Does exist a concave set? Hint: No!!!

Definition 29 (Convex combination).

n Pk Pk Let x, x1,..., xk ∈ IR . Then x satisfying x = j=1 λjxj where j=1 λj = 1 and ∀j ∈ {1, . . . , k} : λj ≥ 0 is called a convex combination of x1,..., xk.

Specifically: The point λx1 + (1 − λ)x2 is a convex combination of x1 and x2. Review: Pk Pk 1. The point x = j=1 λjxj where j=1 λj = 1, ∀j ∈ {1, . . . , k} : λj ∈ IR is called an affine combination of x1,..., xk. Pk 2. The point x = j=1 λjxj where ∀j ∈ {1, . . . , k} : λj ∈ IR is called a linear combination of x1,..., xk.

Example 30.

Convexity of the following sets is important in optimization:

1. Let p ∈ IRn, x¯ ∈ IRn, α ∈ IR, and p>x¯ = α then hyperplane H specified using the normal vector p and constant α (or fixed point x¯) by H = {x ∈ IRn | p>x = α} = {x ∈ IRn | p>(x − x¯) = 0} is a convex set.

2. Let p ∈ IRn and α ∈ IR then halfspace H+ defined by H+ = {x ∈ IRn | p>x ≤ α} is a convex set.

3. Let A ∈ IRm×n and b ∈ IRm then the set S defined by S = {x ∈ IRn | Ax ≤ b} is a convex set.

4. Let A ∈ IRm×n, c ∈ IRn, and b ∈ IRm then the feasible set and set of optimal solutions of a linear program ? ∈ argmin{c>x | Ax = b, x ≥ 0} are convex sets.

14 Exercise 31.

Using a definition 27 prove the convexity of sets introduced in Example 30. Illustrate all cases by figures in IR2.

Lemma 32 (Properties of convex sets).

n Let S1,S2 ⊂ IR be convex sets. Then:

1. S1 ∩ S2 is a convex set.

2. S1 ⊕ S2 = {x1 + x2 | x1 ∈ S1, x2 ∈ S2} is a convex set.

3. S1 S2 = {x1 − x2 | x1 ∈ S1, x2 ∈ S2} is a convex set.

Exercise 33.

Prove Lemma 32. The fact that the intersection of convex sets is again a convex set is very useful in optimization. Why? Hint: For example, think about the feasible set of a linear programming problem.

Definition 34 ().

Let S ⊂ IRn, we say that x ∈ IRn belongs to a convex hull of S (x ∈ conv S) ⇔ ∃k ∈ IN, Pk Pk ∃λ1, . . . , λk ≥ 0, j=1 λj = 1, ∃x1,..., xk ∈ S : x = j=1 λjxj.

Exercise 35.

Draw figures describing convex hulls for different sets in IR2.

Remark 36 (Properties of convex hulls).

Remember, understand, and draw figures: Let S ⊂ IRn then:

1. conv S is the smallest convex set containing S (outside specification).

2. conv S is the intersection of all convex sets containing S.

3. For S convex, S = conv S is true.

4. conv S is the set of all convex combinations from S (inside specification).

5. An affine hull aff S is the set of all affine combinations of points in S. It is the smallest affine subspace containing S.

6. A linear hull lin S is the set of all linear combinations of points in S. It is the smallest linear subspace.

15 Exercise 37.

Drawing figures, compare concepts of conv S, aff S, and lin S using examples in IR2. Hint: Use two points in IR2.

Definition 38 (Polytope, simplex).

n Let x1,..., xk, xk+1 ∈ IR then:

1. conv{x1,..., xk+1} is called a polytope.

2. If x2 − x1, x3 − x1,..., xk+1 − x1 are linearly independent then we say that x1,..., xk+1 are affinely independent.

3. If x1,..., xk+1 are affinely independent then conv{x1,..., xk+1} is called a sim- plex with vertices x1,..., xk+1. 4. There could be no simplex with more than n + 1 vertices.

Theorem 39 (Caratheodory).

n Pn+1 Let S ⊂ IR then ∀x ∈ conv S : ∃ λj ≥ 0, j ∈ {1, . . . , n + 1}, j=1 λj = 1, ∃ xj ∈ S, Pn+1 j ∈ {1, . . . , n + 1} : x = j=1 λjxj.

Caratheodory’s Theorem says that any point in the convex hull of set S can be represented as a convex combination of at most n + 1 points in S (remember that n is a dimension of IRn).

Exercise 40.

How many points do you need when x ∈ S (not in conv S \ S)?

n Proof of Theorem 39: Let S ⊂ IR and x ∈ conv S. Then (by Definition 34) ∃k : ∃xj ∈ S ,∃λj > 0 , Pk Pk j = 1, . . . , k , j=1 λj = 1 and x = j=1 λjxj . If k ≤ n + 1 then the theorem is proven. Otherwise, k > n + 1 and we continue. The main idea is to eliminate one λi > 0. As k ≥ n + 2 then x2 − x1,..., xk − x1 are linearly dependent (because k − 1 ≥ n + 1). So ∃µ2, . . . , µk ∈ IR not all equal zero and satisfying Pk Pk Pk Pk j=2 µj(xj − x1) = 0. We denote µ1 = − j=2 µj. Then j=1 µjxj = 0 , j=1 µj = 0 , and > Pk Pk (µ1, . . . , µk) 6= 0. So, at least one µj > 0. Then we may write: x = j=1 λjxj = j=1 λjxj + 0 = Pk Pk Pk Pk j=1 λjxj −α0 = j=1 λjxj −α j=1 µjxj = j=1(λj −αµj)xj. Now we choose suitable α to eliminate λj λi one nonzero λj: α = min1≤j≤k{ | µj > 0}. Assume that α = . Because λj > 0 and µj > 0 in µj µi the minimization term then α > 0. Then for other j: (1) µj ≤ 0 ⇒ λj − αµj > 0, (2) µj > 0 ⇒ λj λi Pk ≥ = α, and hence, λj − αµj > 0. In addition λi − αµi = 0. We have x = (λj − αµj)xj µj µi j=1j6=i Pk and j=1j6=i(λj − αµj) = 1. Therefore x is a convex combination of k − 1 points in S. This step can be repeated till k = n + 1 and then the proof is complete. 

16 1.3.2 Closed and open convex sets

Theorem 41 (About line segment).

n Let S ⊂ IR be a convex set, int S 6= ∅, x1 ∈ cl S, and x2 ∈ int S. Then ∀λ ∈ (0; 1) : λx1 + (1 − λ)x2 ∈ int S.

Theorem says: Given a convex set with a nonempty interior, the line segment (without the end points) joining a point in the interior of the set and a point in the closure of the set belongs to the interior of the set.

n Proof of Theorem 41: Let S ⊂ IR be convex, int S 6= ∅, x1 ∈ cl S, x2 ∈ int S. Because x2 ∈ int S ⇒ ∃Nε(x2) ⊂ S : {z | kz − x2k < ε} ⊂ S. We denote y = λx1 + (1 − λ)x2 for λ ∈ (0; 1) chosen arbitrarily. 0 We need to show that if y ∈ int S then ∃Nε0 : Nε0 ⊂ S. We choose ε = (1 − λ)ε. So, we want to show {z | kz − yk < (1 − λ)ε} ⊂ S. We take z : kz − yk < (1 − λ)ε. We want to show z ∈ S. It can be done showing that z is a convex combination of 2 points of S. So, ∃z1 satisfying (1 − λ)ε − kz − yk x ∈ cl S ⇒ {x : kx − x k < } ∩ S 6= ∅. 1 1 λ

z−λz1 y−λx1 Set z2 = 1−λ . Then z = λz1 − (1 − λ)z2 and because x2 = 1−λ : z − λz kz − x k = k 1 − x k = 2 2 1 − λ 2 z − λz − (y − λx ) 1 z − λz − (y − λx ) k 1 1 k = k 1 1 k ≤ 1 − λ 1 − λ 1 − λ 1 1 (1 − λ)ε − kz − yk (kz − yk + λkx − z k) < (kz − yk + λ ) = ε. 1 − λ 1 1 1 − λ λ

Then z2 ∈ int S ⇒ z ∈ S ⇒ y ∈ int S.  Exercise 42.

Draw figures and give counterexamples that the theorem does not remain valid when any of its assumptions is omitted.

Corollary 43 (Closure, interior, and convex hull).

Let S be a convex set, so then int S is convex. Let S be convex and int S 6= ∅ ⇒ cl S is convex, cl(int S) = cl S, and int(cl S) = int S.

Proof: To prove: S is convex and int S 6= ∅ ⇒ cl S is convex. So, x1, x2 ∈ cl S arbitrarily. ∃z ∈ int S. By theorem 41, we get ∀λ ∈ (0; 1) : λx2 +(1−λ)z ∈ int S. We fix µ ∈ (0; 1) : and by theorem ∀λ ∈ (0; 1) : µx1 + (1 − µ)(λx2 + (1 − λ)z) ∈ int S ⊂ S. Then λ −→ 1 ⇒ µx1 + (1 − µ)x2 ∈ cl S.  To prove: S is convex and int S 6= ∅ ⇒ cl int S = cl S. So, we know cl(int S) ⊂ cl S because int S ⊂ S. We want to prove cl(int S) ⊃ cl S Assume x ∈ cl S and y ∈ int S 6= ∅ ⇒ ∀λ ∈ (0; 1) λx + (1 − λ)y ∈ int S. − Then λ −→ 1 ⇒ x ∈ cl int S.  To prove: S is convex and int S 6= ∅ ⇒ int(cl S) = int S. We know int S ⊂ int(cl S) because S ⊂ cl S.

We take x1 ∈ int(cl S) and want to show that x1 ∈ int S. ∃ε > 0 : ky − x1k < ε ⇒ y ∈ cl S. We get ε x2 ∈ int S such that y = (1 + ∆)x1 − ∆x2 where ∆ = and (x2 6= x1). Since ky − x1k = ε/2 2kx1−x2k 1 ⇒ y ∈ cl S. Therefore, we can write x1 = λy + (1 − λ)x2 where λ = 1+∆ ∈ (0; 1). Since y ∈ cl S and x2 ∈ int S then by Theorem 41, we conclude x1 ∈ int S. 

17 Exercise 44.

Illustrate principle ideas of proofs by drawings in IR2.

1.3.3 Separating hyperplanes Almost all optimality conditions and duality relationships in nonlinear programming use some sort of separation of convex sets or supporting hyperplanes of convex sets theorems. For them, previous results about cl, int, and conv are needed together with the following theorem:

Theorem 45 (Minimum distance).

Let S ⊂ IRn,S 6= ∅, S = cl S (i.e. S is closed), S = conv S (i.e. S is convex), and y 6∈ S ⇒ ∃!x¯ ∈ S : x¯ ∈ argminx{ky − xk | x ∈ S} and x¯ ∈ argminx{ky − xk | x ∈ S} ⇒ ∀x ∈ S :(y − x¯)>(x − x¯) ≤ 0.

Lemma 46 (Cosine law).

∀a, b ∈ IRn : ka + bk2 = kak2 + kbk2 + 2a>b.

Proof: Let a, b ∈ IRn and γ is the angle defined by them. Then ka + bk2 = (kbk − kak cos γ)2 + 2 2 2 2 2 2 2 > a>b (kak sin γ) = kak (sin γ +cos γ)+kbk −2kak·kbk·cos γ = kak +kbk +2a b because cos γ = kak·kbk 2 (draw a figure in IR and use the Pythagoras’ theorem).  Lemma 47 (Parallelogram law).

∀a, b ∈ IRn : ka + bk2 + ka − bk2 = 2kak2 + 2kbk2.

Proof: By the cosine law: ∀a, b ∈ IRn : ka+bk2 = kak2+kbk2+2a>b and ka−bk2 = kak2+kbk2−2a>b. 2 2 By sum of both equalities, we get 2kak + 2kbk and proof is complete.  Exercise 48.

qPn 2 2 Assuming that kak = j=1 aj and drawing figures in IR explain ideas of both lemmas. Hint: The parallelogram law says that the sum of squared norms of the parallelogram diagonals equals the sum of squared norms of its sides.

Proof of Theorem 45: Existence of x¯: Because S 6= ∅ ⇒ ∃xˆ ∈ S one fixed point and we define S¯ := {x | ky − xk ≤ ky − xˆk} ∩ S. Because S is closed then S¯ is also closed and bounded (xˆ is fixed), and q ¯ Pn 2 hence, S is a compact set. We know that ky − xk is continuous in x. (Note ky − xk = j=1(yj − xj) , (y−x¯)>(x−x¯) and for the angle α of y − x¯ and x − x¯ then cos α = ky−x¯k·kx−x¯k .) In addition inf{ky − xk | x ∈ S} = inf{ky − xˆk | x ∈ S} ∩ S. By Weierstrass’ theorem even minimum x¯ exists. Uniqueness of x¯: We assume x¯0 ∈ S : ky − x¯k = |y − x¯0k = γ and we show that x¯ = x¯0. By 0 x¯−x¯0 1 0 convexity of S (x¯ + x¯ )/2 ∈ S. We check y − 2 k = 2 k(y − x¯) + (y − x¯ )k ≤ (Schwarz inequality)

18 1 1 0 x¯+x¯0 2 ky − x¯k + 2 ky − x¯ k = γ. If < then 2 is nearer and we obtain a contradiction with x¯ ∈ argmin{...}. If = then ∃λ : y − x¯ = λ(y − x¯0). As ky − x¯k = ky − x¯0k = γ then we have | λ |= 1. If λ = 1 then x¯ = x¯0, 0 x¯+x¯0 so x¯ is unique. If λ = −1 then from y − x¯ = λ(y − x¯ ) we get y = 2 ∈ S, however by assumption y 6∈ S and contradiction. To complete the proof ∀x ∈ S : ((y − x¯)>(x − x¯) ≤ 0) ⇒ x¯ ∈ argmin{ky − xk | x ∈ S}. Sufficiency ⇒: x ∈ S ⇒ ky − xk2 = ky − x¯ + x¯ − xk2 = by parallelogram law ky − x¯k2+ kx¯ − xk2+ 2(x¯ − x)>(y − x¯). Note that kx¯ + xk ≥ 0 and 2(x¯ − x)>(y − x¯) = −2(y − x¯)>(x − x¯) ≥ 0 by assumption. Hence: ky − xk2 ≥ ky − x¯k2 ⇒ x¯ ∈ argmin{ky − xk | x ∈ S}. Necessity ⇐: x ∈ argmin{ky − xk | x ∈ S} ⇒ ∀x ∈ S : ky − xk ≥ ky − x¯k, so ky − xk2 ≥ ky − x¯k2. x ∈ S is chosen arbitrarily, S is convex then ∀λ ∈ [0; 1] : λx + (1 − λ)x¯ = x¯ + λ(x − x¯) ∈ S. So ky − x¯ − λ(x − x¯)k2 ≥| y − x¯k2. By Lemma 47 ky − x¯ − λ(x − x¯)k2 = ky − x¯k2 + λ2kx − x¯k2− 2λ(y −x¯)>(x−x¯). Together we have ky −x¯ −λ(x−x¯)k2− λ2kx−x¯k2+ 2λ(y −x¯)>(x−x¯) = ky −x¯k2 ≤ ky − x¯ − λ(x − x¯)k2. Therefore, 2λ(y − x¯)>(x − x¯) ≤ λ2kx − x¯k2. It is trivially satisfied for λ = 0. > λ 2 Assuming λ > 0 and multiplying the inequality by 1/(2λ), we get: (y − x¯) (x − x¯) ≤ 2 kx − x¯k . The right-hand-side (RHS) is greater than 0. As λ −→ 0+ (because λ ∈ [0; 1]) then the RHS approaches 0 and the theorem is proven.  Definition 49 (Hyperplane).

Let α ∈ IR, p ∈ IRn, and p 6= 0. A hyperplane H in IRn is defined by H = {x ∈ IRn | p>x = α}. It also defines two closed halfspaces H+ = {x ∈ IRn | p>x ≥ α} and H− = {x ∈ IRn | > + n > − n p x ≤ α} and two open halfspaces H0 = {x ∈ IR | p x > α} and H0 = {x ∈ IR | p>x < α}.

Alternatively, for x¯ ∈ H : p>x¯ = α and H = {x ∈ IRn | p>x = p>x¯} = {x ∈ IRn | p>(x − x¯) = + − + − 0}. Similarly, we may define H ,H ,H0 ,H0 alternatively. Definition 50 (Separating hyperplanes).

n > Let S1,S2 ⊂ IR and S1 6= ∅, S2 6= ∅. Define H = {x | p x = α}. Then we say that:

> > H separates S1 and S2 ⇔ ∃i ∈ {1, 2} :(∀x ∈ Si : p x ≥ α) ∧ (∀x ∈ S3−i : p x ≤ α).

If in addition S1 ∪ S2 6⊂ H then H is said to properly separate S1 and S2.

> H strictly separates S1 and S2 ⇔ ∃i ∈ {1, 2} :(∀x ∈ Si : p x > α) ∧ (∀x ∈ S3−i p>x < α).

> H strongly separates S1 and S2 ⇔ ∃ ∈ {1, 2} ∃ε > 0 : (∀x ∈ Si : p x ≥ α + ε) ∧ > (∀x ∈ S3−i : p x ≤ α).

Theorem 51 (Separation of point and convex set).

Let S ⊂ IRn,S 6= ∅, S be closed and convex, and y 6∈ S ⇒ ∃p ∈ IRn 6= 0, ∃α ∈ IR : ∀x ∈ S : p>x ≤ α ∧ p>y > α.

Proof: By Theorem 45 ∃x¯ ∈ argmin{ky−xk | x ∈ S} and ∀x ∈ S :(y−x)>(x−x¯) = (x−x¯)>(y−x¯) ≤ 0. We define p = y − x¯, α = (y − x¯)>x¯, and we get p>x ≤ α. Then p>y − α = p>y − (y − x¯)>x¯ = > > > 2 (y − x¯) y − (y − x¯) x¯ = (y − x¯) (y − x¯) = ky − x¯k > 0 because y 6∈ cl S, x¯ ∈ cl S, and y 6= x¯. 

19 Exercise 52.

Using figures and examples in IR2, show the importance of assumptions in separating theorems.

Corollary 53 (Intersection of halfspaces).

Let S ⊂ IRn be closed and convex ⇒ S is the intersection of halfspaces containing S.

Proof: S ⊂ ∩H+ trivially true. To show S ⊃ ∩H+, we use a contradiction expecting y 6∈ S, y ∈ ∩H+. Then by Theorem 51 ∃H hyperplane that strongly separates S (: S ⊂ H+) and y (y 6∈ H+). So, the contradiction is obtained.  Corollary 54.

Let S ⊂ IRn and S 6= ∅, y 6∈ cl conv S ⇒ ∃H hyperplane strongly separating S and y.

Proof: Use cl conv S instead of S and apply Theorem 51.  Remark 55 (Equivalent conclusions).

Theorem 51 conclusion is equivalent to:

1. ∃H hyperplane strictly separating S and y.

2. ∃H hyperplane strongly separating S and y.

3. ∃p ∈ IRn : p>y > sup{p>x | x ∈ S}

4. ∃p ∈ IRn : p>y < inf{p>x | x ∈ S}

Exercise 56.

Explain Theorem 51 and the following Corollaries and Remarks drawing figures. Hints: Think about the use of the graphical solution of NLP and LP. Compare sup an p>x¯ values.

Theorem 57 (Farkas).

m×n n > A ∈ IR is a matrix, c ∈ IR is a vector. Denote S1 = {x | Ax ≤ 0 ∧ c x > 0} and > S2 = {y | A y = c ∧ y ≥ 0} ⇒ (S1 6= ∅ ∧ S2 = ∅)∨ (S1 = ∅ ∧ S2 6= ∅).

Farkas’ theorem is extensively used in the derivation of optimality conditions of LP and NLP. It says that either S1 or S2 is nonempty (empty).

20 > Proof: Assume that S2 6= ∅. Then ∃y ≥ 0 : A y = c. We get x such that Ax ≤ 0. (It exists as x may be chosen 0.) Then c>x = (A>y)x = y>Ax. Because y ≥ 0 and Ax ≥ 0 then y>Ax ≤ 0, and hence, > c x 6> 0 then S1 = ∅. m > Assume S2 = ∅. It means S = {x | ∃y ∈ IR : x = A y, y ≥ 0} is closed, convex, and c 6∈ S. By Theorem 51 ∃p ∈ IRn, α ∈ IR : p>c > α and ∀x ∈ S : p>x ≤ α. We know that 0 ∈ S and then α ≥ 0. As α ≥ p>x = p>A>y = y>Ap. As y ≥ 0 then Ap must be ≤ 0 (otherwise we obtain contradiction). n > So, we have found p ∈ IR such that c p > α and Ap ≤ 0. So p ∈ S1, and hence, S1 6= ∅.  Exercise 58.

Set m = 2 and n = 3. Choose suitable values for A and c. Denote columns of A as 2 vectors a1, a2, a3. Draw two figures. The first one, describing set S1 in IR and the 2 second describing set S2 indirectly, drawing {Ay | y ≥ 0} in IR .

Farkas’ theorem also remains valid for redefined S1,S2:

Corollary 59 (Gordon’s Theorem).

> Let A be an m×n matrix. Then either S1 = {x | Ax < 0} or S2 = {y | A y = 0, y ≥ 0, y 6= 0} is nonempty.

n Proof: S1 can be written as S1 = {x ∈ IR , s ∈ IR | Ax + 1s ≤ 0} where 1 is a vector of m ones. We > > may rewrite it in the form suitable for S1 in Theorem 57. We get {x, s | (A, 1)(x , s) ≤ 0} and for > > > > > en+1 = (0,..., 0, 1) we have en+1(x , s) > 0. Then S2 is specified by (A, 1) y = en+1 and y ≥ 0. > > That is A y = 0, 1 y = 1 and y ≥ 0. This is equivalent to S2 description.  Corollary 60 (Alternatives to Gordon’s Theorem).

Let A be an m × n matrix (A ∈ IRm×n), B be an l × n matrix (B ∈ IRl×n), and c be an n vector (c ∈ IRn). Then:

> > 1. Either S1 = {x | Ax ≤ 0, x ≥ 0, c x > 0} or S2 = {y | A y ≥ c, y ≥ 0} is nonempty.

> > > > > 2. Either S1 = {x | Ax ≤ 0, Bx = 0, c x > 0} or S2 = {(y , z ) | A y + B z = c, y ≥ 0} is nonempty.

> > Proof: To prove 1.: Introduce slack variables to get equalities for S2 (replace A by (A − I)). > > > > To prove 2.: Define z = z1 − z2 where z1 ≥ 0, z2 ≥ 0 in S2 and replace A by (A , B , −B ). 

1.3.4 Supporting hyperplanes

Definition 61 (Supporting hyperplane).

Let S ⊂ IRn, S 6= ∅, x¯ ∈ ∂S. H = {x | p>(x − x¯) = 0} is called a supporting hyperplane of S at x¯ ⇔ S ⊂ H+ ∨ S ⊂ H−. (Without reference to x¯, we may add cl S ∩ H 6= ∅.) If S 6⊂ H then H is a proper supporting hyperplane of S at x¯.

21 It is equivalent to p>x¯ = inf{p>x | x ∈ S} or p>x¯ = sup{p>x | x ∈ S}.

Theorem 62 (Existence f supporting hyperplane).

S ⊂ IRn, S 6= ∅, S is convex, x¯ ∈ ∂S ⇒ ∃H supporting S at x¯ (∃p 6= 0 : ∀x ∈ cl S : p>(x − x¯) ≤ 0)

Proof: x¯ ∈ ∂S ⇒ ∃{yk} a sequence such that yk ∈ cl S, yk −→ x¯. ∀yk ∃pk, kpkk = 1 such that > > ∀x ∈ cl S : pk yk > pk x. {pk} is bounded, so exists K ⊂ IN such that {pk}k∈K has a limit p and > > > > kpk = 1. As ∀k ∈ K : pk yk > pk x then pk (x − yk) < 0. For k −→ ∞ we get p (x − x¯) ≤ 0.  Corollary 63.

Let S ⊂ IRn, S 6= ∅, and:

1. S is convex, x¯ 6∈ int S ⇒ ∃p ∈ IRn, p 6= ∅ : ∀x ∈ cl S : p>(x − x¯) ≤ 0.

2. y 6∈ int conv S ⇒ ∃H separating S and y.

3. x¯ ∈ ∂S ∪ ∂conv S ⇒ ∃H that supports S at x¯.

Exercise 64.

Illustrate Corollary 63 by drawing figures in IR2.

Theorem 65 (Separation of 2 convex sets).

n Let S1,S2 ⊂ IR , S1 6= ∅, S1 6= ∅ be convex sets, S1 ∩ S2 = ∅ ⇒ ∃H hyperplane > > separating S1 and S2, i.e. ∃p 6= 0 : inf{p x | x ∈ S1} ≥ sup{p x | x ∈ S2}.

Exercise 66.

Draw an explanatory figure for Theorem 65. Compare with visualization of nonlinear programs.

Proof: S := S1 S2 = {x1 − x2 | x1 ∈ S1, x2 ∈ S2}. As S1 and S2 are convex ⇒ S convex. Because > S1 ∩ S2 = ∅ ⇒ 0 6∈ S. By Corollary 63.1 ∃p 6= 0, ∀x ∈ S : p x ≥ 0, hence ∀x1 ∈ S1, ∀x2 ∈ S2 : > > p x1 ≥ p x2.  Remark 67.

Gordan’s theorem may be derived either from Farkas’ theorem or from separating theorems.

22 Corollary 68.

n Let S1,S2 ⊂ IR , S1 6= ∅, S1 6= ∅, and

1. S1, S2 be convex, int S2 6= ∅, S1 ∩ int S2 = ∅ ⇒ ∃H hyperplane separating S1 and S2.

2. int conv Si 6= ∅, i = 1, 2 and int conv S1 ∩ int conv S2 = ∅ ⇒ ∃H separating S1 and S2.

Exercise 69.

Draw illustrating figures for Corollary. Why assumption of 2. are important? Hint: Thing about two straight lines and their intersection!

Proof: To prove 1. replace S2 with int S2 in Theorem 65. Similarly prove 2.  Theorem 70 (Strong separation).

Let S1, S2 be closed convex and S1 is bounded. If S1 ∩ S2 = ∅ ⇒ ∃H hyperplane strongly separating S1 and S2.

Proof: Main idea: Put S = S1 S2, show that S is closed, take 0 6∈ S and strongly separate 0 and S.  Corollary 71.

n Let S1,S2 ⊂ IR , S1 6= ∅, S1 6= ∅, S1 bounded, cl conv S1 ∩ cl conv S2 = ∅ ⇒ ∃H hyperplane strongly separating S1 and S2.

Exercise 72.

Why S1 is assumed to be bounded in Theorem 70? Remove this assumption and create a figure in IR2 showing why the conclusion of theorem does not remain valid.

1.3.5 Convex cones and polarity

Definition 73 (Cone).

C ⊂ IRn is called a cone with vertex 0 if x ∈ C ⇒ ∀λ ≥ 0 : λx ∈ C.

Definition 74 (Polar cone).

Let S ⊂ IRn, S 6= ∅ then the polar cone of S (polS or S∗) is given by polS = {p | ∀x ∈ S : p>x ≤ 0}.

23 If S = ∅ then S∗ will be interpreted as IRn. Exercise 75.

Draw figures in IR2 with convex, nonconvex, and polar cones.

Lemma 76.

n Let S1,S2,S3 ⊂ IR be nonempty sets: 1. S∗ is a closed convex cone.

2. S ⊂ (S∗)∗ (we also write (S∗)∗ = S∗∗ = polpolS).

∗ ∗ 3. S1 ⊂ S2 ⇒ S2 ⊂ S1 .

Theorem 77 (Polar cone property).

Let C 6= ∅ be a closed convex cone ⇒ C = C∗∗.

Proof: From Lemma 76: C ⊂ C∗∗. Let x ∈ C∗∗ and suppose by contradiction that x 6∈ C. By previous Theorem 51 (separation of set and point) ∃p 6= 0, ∃α : ∀y ∈ C (also 0 ∈ C, 0 ≤ α): p>y ≤ α and p>x > α ≥ 0. Assume p 6∈ C∗ : ∃y¯ ∈ C : p>y¯ > 0 and also ∀λ ≥ 0 : p>(λy¯) ≥ 0 and p ∈ C∗. As λ −→ ∞ ⇒ p>(λy¯) ≥ 0 −→ ∞. Because we have shown ∀y : p>y ≤ α, we see that also ∀λ ≥ 0 : p>(λy¯) ≤ α. It gives the contradiction. Since x ∈ C∗∗ := {u | ∀v ∈ C∗ :u>v ≤ 0} then p>x ≤ 0 but it > is a contradiction with p x ≥ 0 and so x ∈ C.  Remark 78.

Farkas’ theorem can be proven using polar cones. Define C := {A>y | y ≥ 0}, which is a closed convex cone (explain why) and see that C∗ = {x | Ax ≤ 0}. Therefore c ∈ C ⇔ c ∈ C∗∗ (by theorem) where (c = Ay) and x ∈ C∗ ⇒ c>x ≤ 0.

1.3.6 Polyhedral sets, extreme points, extreme directions

Definition 79 (Polyhedral set).

S ⊂ IRn is called a polyhedral set if it is the intersection of a finite number of closed n k > halfspaces, i.e. ∃k ∈ IN, ∃αi, ∃pi ∈ IR , pi 6= 0, i = 1, . . . , k : S = ∩i=1{x | pi x ≤ αi}

Remark 80.

(1) Every polytope is a polyhedral set. (2) Bounded polyhedral set is a polytope. (3) The feasible set of LP is a polyhedral set. (4) The convex feasible set of NLP may be contained in a polyhedral set.

24 Definition 81 (Extreme point).

Let S ⊂ IRn, S 6= ∅ be convex. x ∈ S is called an extreme point (EP) of S ⇔ ∀x1, x2 ∈ S, ∀λ ∈ (0, 1) : x = λx − 1 + (1 − λ)x2 ⇒ x = x1 = x2.

Exercise 82.

Draw several figures illustrating the concept of an extreme point (EP). Compare it with BFS – Basic Feasible Solution in LP.

Notice that for a compact convex set, any its point can be represented as a convex combination of its extreme points.

Theorem 83 (Characterization of EPs).

Let S = {x ∈ IRn | Ax = b, x ≥ 0} (LP in the standard form), r(A) = m (rank of A, otherwise we may throw away any redundant rows). Then, x is an extreme point of S ⇔ A can be decomposed by rearranging columns of A (and rows of x respectively) in such a way that A = (B, N) (Ax = BxB + NxN = b) satisfying:

 x   B−1b  x = B = xN 0

and B−1b ≥ 0.

Proof: We have x defined by theorem formula. ⇒ We want to show x ∈ S and x is an EP. It means to show that Ax = b and x ≥ 0. Let x = λx1 + (1 − λ)x2 for x1, x2 ∈ S and λ ∈ (0, 1). So we have:

     −1      x1B x2B B b x1B x2B x1 = x2 = then = λ + (1 − λ) . x1N x2N 0 x1N x2N

−1 So B b = λx1B + (1 − λ)x2B and x1N = x2N = 0. We have x1 ∈ S ⇒ Ax1 = b ⇒ Bx1B = b ⇒ −1 −1 x1B = B b and x2 ∈ S ⇒ Ax2 = b ⇒ Bx2B = b ⇒ x2B = B b. So, x = x1 = x2, and hence, x is an EP. > Conversely, x is an EP. We suppose (without losing generality) x = (x1, . . . , xk, 0,..., 0) where x1 > 0, . . . , xk > 0. We discuss whether a1,..., ak columns of A are linearly independent. If they are > Pk dependent then ∃λ1, . . . , λk (λ = (λ1, . . . , λk, 0,..., 0) 6= 0): j=1 λjaj = 0. We construct x1 = x+αλ and x2 = x − αλ such that x1, x2 ≥ 0 (choosing α > 0 without loss of generality). So:

Pk Ax1 = Ax + αAλ = Ax + j=1 λjaj = b ⇒ x1 ∈ S Pk Ax2 = Ax − αAλ = Ax − j=1 λjaj = b ⇒ x2 ∈ S.

1 Because α > 0, λ 6= 0 : x1 6= x2 and x = 2 (x1 + x2) ⇒ then x is not an EP ⇒ contradiction. So, a1,..., ak are linearly independent. Assume that r(A) = m. Then k ≤ m. If k = m then done, otherwise (k < m) ⇒ ∃m − k columns of A (of last n − k columns) that with k columns form a linearly −1 independent set of m vectors. After reordering of columns, we get A = (B, N) and xB = B b ≥ 0. 

25 Exercise 84.

For own LP with 4 variables and 2 equality constraints, compute several extreme points. Compare with the graphical solution (for the original 2 variables and 2 inequalities).

Exercise 85.

Discuss types of nonlinear programs whose feasible sets have extreme points. Hint: Draw figures.

Corollary 86 (Finite number of EPs).

The number of extreme points of S = {x | Ax = b, x ≥ 0} is finite and less or equal  n  to (remember that A ∈ IRm×n). m

Exercise 87.

Prove Corollary, giving the upper bound of number of extreme points.

Theorem 88 (Existence of EP).

Let S = {x | Ax = b, x ≥ 0}= 6 ∅, r(A) = m ⇒ S has at least one EP.

> Proof: Let x ∈ S, x = (x1, . . . , xk, 0,..., 0) . If a1,..., ak are linearly independent then k ≤ m and x > Pk is an extreme point. Otherwise ∃λ1, . . . , λk (λ = (λ1, . . . , λk, 0,..., 0) 6= 0 ⇒ ∃λi > 0) j=1 λjaj = 0. xj xi 0 0 We define: α = min1≤j≤k{ : λj > 0} = > 0. Considerx : x = xj − αλj for j = 1, . . . , k and λj λi j 0 0 0 Pn 0 xj = 0 for j = k + 1, . . . , n. Then xj ≥ 0, j ∈ {1, . . . , k}\{i} and xj = 0 elsewhere. Also: j=1 ajxj = Pk Pk Pk 0 j=1(xj − αλj) = j=1 ajxj − α j=1 ajλj = b and x has a reduced number of positive components (k-1). Repeat procedure till reaching the linear independence. Then EP will be found.  Exercise 89.

Build an example showing that for S = {x | Ax = b} (nonnegativity constraint x ≥ 0 omitted) EP does not necessarily exists.

Definition 90 (Direction).

Let S ⊂ IRn, S 6= ∅, closed, convex. Then d ∈ IRn, d 6= 0 is called a (recession) direction of S ⇔ ∀x ∈ S, ∀λ ≥ 0 : x + λd ∈ S.

26 Definition 91 (Distinct directions).

Let d1 and d2 be two directions of S. They are called distinct directions of S ⇔ ∀α > 0 : d1 6= αd2

Definition 92 (Extreme direction).

A direction d of S is called an extreme direction (ED) of S ⇔ (∀d1, d2 directions of S, ∀λ1, λ2 > 0: d = λ1d1 + λ2d2 ⇒ ∃α > 0 : d1 = αd2).

Exercise 93.

Illustrate directions (of S, distinct, and extreme) by figures in IR2.

Theorem 94 (Characterization of EDs).

Let S = {x ∈ IRn | Ax = b, x ≥ 0}= 6 ∅, r(A) = m. Then, d¯ is an extreme direction of S ⇔ A can be decomposed by rearranging columns of A (and rows of x respectively) in such a way that A = (B, N) (Ax = BxB +NxN = b) and ∃aj column of N, ∃α > 0 satisfying:

 d   −B−1a  d¯ = α B = α j dN ej

−1 and B aj ≤ 0 where ej = (eij)i=1,...,n−m such that eij = 1 for i = j and eij = 0 for i = 1, . . . , n − m, i 6= j.

Proof: We derive it for d instead fo d¯. With x ∈ S, d given in theorem, λ ≥ 0 :

 −1  −B aj A(x + λd) = Ax + λAd = b + λ(B, N) = b + λ(−aj + aj) = b. ej

−1 Notice that Ad = 0 (so A(x + λd) = Ax = b) and B aj ≤ 0 ⇒ d ≥ 0 ⇒ x + λd ≥ 0. So d is a direction of S. Now, we show that d is an extreme direction of S. Let d1 and d2 be directions and assume d = λ1d1+ = λ2d2 for λ1 > 0 and λ2 > 0. We know that at least n − (m + 1) components of d ≥ 0 are necessarily zeroes. Therefore, the same is valid for d1 and d2 corresponding components. Then:   diB di = αi i = 1, 2, Ad1 = Ad2 = 0 ⇒ BdiB + NdiN = 0 ⇒ diN

−1 ¯ Hence, BdiB = −Nej = −aj diB = B aj ⇒ d = d1 = d2, so d is an ED, and therefore, d is an ED. Conversely: Suppose d¯ is an ED and derive a formula above. We may assume

¯ ¯ ¯ ¯ > d = (d1,..., dk, 0,..., 0, dj, 0,..., 0)

. We claim that a1,..., ak are linearly independent. We show it by a contradiction: ⇒ ∃λ1, . . . , λk, > Pk ¯ ¯ λ = (λ1, . . . , λk) 6= 0 : i=1 λiai = 0. We choose α > 0 small that d1 = d + αλ, d2 = d − αλ

27 ¯ P and d1, d2 ≥ 0. Note that Ad1 = Ad + α i λiai = 0 and Ad2 = 0. Together they are distinct ¯ ¯ directions and d = (d1 +d2)/2, so it is a contradiction with d is an ED. Therefore, a1,..., ak are linearly independent. As r(A) = m, we know k ≤ m. If k < m then we may add m − k vectors ai such that i ∈ {k + 1, . . . , n}\{j} to form B matrix (with r(B) = m by a suitable selection of columns). So,  −1  ¯ ¯ ¯ ¯ ¯ −1 ¯ ¯ −B aj ¯ ¯ Ad = 0 = BdB + ajdj and dB = −djB aj and d = dj and also with d ≥ 0, dj > 0 ⇒ ej −1 −B aj ≤ 0.  Exercise 95.

For own LP with 4 variables and 2 equality constraints, compute one extreme direction. Hint: Draw a figure first!

Corollary 96 (Finite number of EDs).

 n  The number of extreme directions of S is finite and less or equal to (n − m) m where A ∈ IRm×n.

 n  Proof: For each of choices of B from A, there are (n − m) ways to extract a from matrix N. m j  Theorem 97 (Existence of EDs).

Let S = {x ∈ IRn | Ax = b, x ≥ 0} 6= ∅, r(A) = m. S has at least one extreme direction ⇔ S is unbounded.

Proof: ⇒: It is valid from the definition of ED. ⇐ Contradiction S is unbounded and there is no extreme direction. From the following theorem: ∀x ∈ S : P P P P x = j λjxj+ j µjdj = j λjxj (because of no extreme direction). Then kxk = k j λjxjk ≤ P Pj Pj (Schwarz inequality) j kλjxjk = (because λj ≥ 0 and λj ≤ 1) i=1 λjkxjk ≤ i=1 kxjk = K ∈ IR. Therefore, ∃K ∈ IR : ∀x ∈ S : kxk < K. It leads to the contradiction with unboundedness of S assumption. Hence, S has at least one extreme direction.  The following representation theorem says: Any point of LP standard form feasible set can be represented as a sum of a convex combination of EPs of S and nonnegative linear combination of EDs of S.

Theorem 98 (Representation theorem).

n Let S ⊂ IR , S 6= ∅, S = {x | Ax = b, x ≥ 0}, r(A) = m. There are x1,..., xk all EPs (extreme points) of S and d1,..., dl are all EDs (extreme directions) of S. Then:

x ∈ S ⇔ ∃µj ≥ 0, j = 1, . . . , l, ∃λj ≥ 0, j = 1, . . . , k : Pk Pk Pl j=1 λj = 1 ∧ x = j=1 λjxj + j=1 µjdj.

28 Exercise 99.

Draw a figure in IR2 for unbounded polyhedral S to illustrate the theorem.

Pk Pl Proof of Theorem 98: Let M := { j=1 λjxj + j=1 µjdj | µj ≥ 0, j = 1, . . . , l, λj ≥ 0, j = 1, . . . , k}. M is a closed (because of definition), convex (prove directly), and nonempty (S 6= ∅ ⇒ ∃ EP by Theorem 88), and M ⊂ S (S is polyhedral ⇒ ∩ halfspaces ⇒ convex, so it contains convex combinations with nonnegative linear combinations of extreme directions from M by definition). We want to show S ⊂ M. By contradiction: z ∈ S ∧ z 6∈ M. Then it is possible to separate z from M (by Theorem > > k Pl 51) by a hyperplane: ∃α ∈ IR, ∃p 6= 0 : p z > α and p (sumj=1λjxj + j=1 µjdj) for each element > k Pl of M. Then, dj may arbitrarily be enlarged, so with µj ≥ 0 and p (sumj=1λjxj + j=1 µjdj) ≤ α > requirement, we get: ∀j : p dj ≤ 0. When λj, µj are set to zeroes and only one λi = 1 then we obtain > > > > > > ∀j : p xj ≤ α < p z. We choose the “tightest" xj by p x¯ = max1≤j≤k p xj, also p x¯ ≤ α < p z.  B−1b  x¯ is an EP: x¯ = ≥ 0. Further, we assume B−1b > 0 (nondegenerate case) without loss of 0 −1 −1 generality. z ∈ S ⇒ Az = b, z ≥ 0, BzB + NzN = b and so: zB = B b − B NzN . We know    −1  > > > > > > zB > > B b > −1 −1 p x¯ < p z, so 0 < p z − p x¯ = (pB, pN ) − (pB, pN ) = pB(B b − B NzN )+ zN 0 > > −1 > > −1 > −1 p zN − p B b = (p − p B N)zN ⇒ ∃j : zj > 0 and pj − p B aj > 0. We check the sign of N B N B  B −1 −1 −yj B aj denoting yj = B aj ≤ 0. We assume yj ≤ 0 then ≥ 0 is an ED of S by Theorem ej   > > > −yj 94. Earlier, we noticed that p dj ≤ 0, for EDs. So, (pB, pN ) ≤ 0 and we get contradiction ej > −1 (previously pj − pBB aj > 0). So yj 6≤ 0 and we can construct:

 b¯   −y  x = + λ j , b¯ = B−1b 0 ej

¯ ¯ bi ¯ br and λ = min1≤i≤m{ | yij > 0}. So, λ > 0 (because of nondegeneracy bi > 0), λ = (rth yij yrj component), x ≥ 0 by definition (has at most m positive components and rth drops to zero and jth is −1 −1 −1 −1 λ). Then Ax = B(B b − λB aj)+ Nλej = Ax = B(B b − λB aj)+ λaj = b ⇒ x ∈ S. Then a1,..., ar−1, ar+1,..., am, aj are linearly independent. (As Byj = aj and we take out yrj 6= 0, if yrj = 0  ¯  > > > b − λyj > ¯ > then linear dependence occur), hence x is an EP. p x = (pB, pN ) = pBb−λpByj +λpj = λej > > −1 > p x¯ + λ(pj − pBB aj) > p x¯ So, this is a contradiction with the assumption that we took the nearest EP, and hence, z ∈ S.  Exercise 100.

For the given simple S and x ∈ S express x by Representation Theorem.

29 1.4 Linear programming — revisited

1.4.1 Assumptions, standard form, and full enumeration Remark 101.

At the beginning, review important ideas related to linear programs:

• Modelling requirements: linearity (cf. NLP), certainty (cf. stochastic program- ming), and divisibility (cf. integer programming). • Simple examples (n = 2) are solvable graphically. It gives ideas to the general case. • Software is widely available: “hidden solvers" (e.g., MS Excel). • There are successful applications of large-scale problems (105 − −108 variables). • Mathematical formulations may differ but theory and algorithm is developed for LP in the standard form:

? ∈ argmin{c>x | Ax = b, x ≥ 0}.

Additional assumptions further used are r(A) = m (nondegeneracy) and S 6= ∅.

Remark 102 (Standard form of LP).

Any LP can be put in the standard form:

• Minimization:

argmin{−f(x) | x ∈ C} = argmax{f(x) | x ∈ C} and − min{−f(x) | x ∈ C} = max{f(x) | x ∈ C}

• Slacks: x¯ ∈ {x | a>x ≤ b} ⇔ (x¯>, s)> ∈ {(x>, s)> | a>x + s = b, s ≥ 0} and x¯ ∈ {x | a>x ≥ b} ⇔ (x¯>, s)> ∈ {(x>, s)> | a>x − s = b, s ≥ 0} • x¯ ∈ C ⇔ ∃x¯+, x¯− ≥ 0 : x¯ = x¯+ − x¯− ∈ C

Algorithm 103 (Full enumeration).

1. Check whether there is an ED dj (Notice that it needs to check all EDs!) decreas- > ing the value of the objective, i.e. ∃j = 1, . . . , l : c dj < 0. 2. If “yes" then there is no optimal solution because of unboundedness of the objective function on the feasible region. Otherwise, compute and compare values of the objective for all EPs. Then the minimum xmin is identified by the smallest value > of c xmin.

Although this procedure works theoretically, it does not work practically because of a huge amount of EDs and EPs to be checked.

30 1.4.2 Linear programming theory

Theorem 104 (Optimality conditions in LP).

We have a linear program (LP) in the standard form. We denote S = {x | Ax = b, x ≥ > 0} and also D = argminx{c x | Ax = b, x ≥ 0}. We assume r(A) = m, S 6= ∅, and that all EPs and EDs are listed: x1,..., xk are all EPs of S and d1 ..., dl are all EDs > of S. Then: D 6= ∅ ⇔ ∀j ∈ {1 . . . , l} : c dj ≥ 0. In addition if D 6= ∅ then ∃xi ∈ D that is an EP of S.

n Pk Proof: We use Theorem 98: ∀x ∈ IR : x ∈ S ⇔ ∃λj ≥ 0, j = 1, . . . , k, j=1 λj = 1 and ∃µj ≥ 0, j = Pk Pl 1, . . . , l such that x = j=1 λjxj + j=1 µjdj. So, we may rewrite the LP using this representation of x ∈ S as follows:

> Pk Pl ? ∈ argminλ,µ{ c ( j=1 λjxj + j=1 µjdj) | Pk λj ≥ 0, j = 1, . . . , k, j=1 λj = 1, µj ≥ 0, j = 1, . . . , l}.

> If c dj < 0 for some j then with µj −→ ∞ (and remaining µj equal 0) the objective function is > unbounded from below. So, necessity and sufficiency is clear If ∀j : c dj ≥ 0 then because we minimize we assign ∀j : mj = 0. Then we need to solve

> Pk Pk ? ∈ argminλ,µ{ c j=1 λjxj | λj ≥ 0, j = 1, . . . , k, j=1 λj = 1}.

> > We can do it by setting λi = 1 for i such that min{c xj | j = 1, . . . , k} = c xi (remaining λj are zeroes).  The theorem explains the full enumeration algorithm but because of its inefficiency is useless. We need something more sophisticated (like “Go down and follow edges").

1.4.3 Simplex method

The idea is to use xn+1 := xn + λndn where xn is the approximate optimal solution after n iterations, λn is a n’th step size, and dn is the n’th direction along that we move from xn to xn+1. The step should be chosen such that it improves the objective function value, i.e. > > c xn+1 ≤ c xn.

Remark 105 (LP notation).

We have x ∈ IRn variable, c ∈ IRn and b ∈ IRm constant vectors, A ∈ IRm×n is a > matrix. We have the linear program in a standard form: D = argminx{c x | Ax = > b, x ≥ 0} We denote S = {x | Ax = b, x ≥ 0} a feasible set and D = argminx{c x | Ax = b, x ≥ 0} is a set of optimal solutions. We assume that A = (B, N) and B−1 exists. We call B matrix a basis. Related EP  B−1b  ≥ 0 is also called a Basic Feasible Solution (BFS). We denote J a set of 0 B indices of B columns in Aand JN is a set of indices of N columns in A. Rewriting the LP using B and N, we get: D = argmin {c>x + c> x | Bx + xB ,xN B B N N B NxN = b, xB ≥ 0, xN ≥ 0}.

31 Remark 106 (Related concepts).

> > −1 Remember that w = cBB are so called simplex multipliers, cf. with dual problems > −1 of LP. Denoting (zj)j = cBB N, we have the condition in the form ∀j ∈ JN : zj −cj ≤ 0. Differences cj − zj (sometimes zj − cj) are called reduced costs.

Theorem 107 (Optimality test).

 B−1b  Let x¯ = ≥ 0 is the EP of S. Assume that nondegeneracy condition is 0 1 > > −1 > satisfied, i.e. B b > 0. Then: x¯ ∈ D ⇔ −cN + cBB N ≤ 0 .

 x¯   b¯   B−1b  Proof: Let x¯ = B = = ≥ 0 be an EP. Compute the objective function x¯N 0 0     > > > x¯B > −1b xB value: c x¯ = (cBcN ) = c B . Let x ∈ S. So it satisfies: x = ≥ 0 and Ax = x¯N xN −1 −1 BxB + NxN = b. Hence xB = B (b − NxN ) = x¯ − B NxN . We compute for the new solution x: > > > > −1 > > −1 > −1 > > > c x = cBxB +cN xN = cBB (b−NxN )+cN xN = cBB b−(cBB N+cN )xN = c x¯−(zj −cj)j xN . So with ∀j : zj − cj ≤ 0 the solution x¯ cannot be improved. 

Theorem 108 (Improving direction).

 B−1b  Let x¯ = ≥ 0 is the given EP and B1b > 0. We also assume that ∃k ∈ J : 0 N > > zk − ck > 0. Then either ∃x EP (improved BFS) such that c x < c x¯ or ∀xk ≥ 0 : > > ∃x ∈ S with xk as a j’th component satisfying c x < c x¯ (unbounded case).

> −1 > > −1 Proof: Let k ∈ JN : zk − ck > 0, i.e. c B N − cN 6≤ 0. We see that zk = cBB ak. (Why? > Hint: Visualize the multiplication of matrices.) We set ∀j ∈ JN \{k} : xj := 0 (but xk 6= 0), so c x = > > −1 > c x¯ −(cBB ak −ck)xk = c x¯ −(zk −ck)xk. (How big xk to choose?) We choose the biggest but feasible −1 xk to get the biggest improvement of the objective function value. Therefore, xB = B (b − NxN ) ≥ 0 −1 ¯ −1 must be valid. So, we may rewrite the nonnegativity constraints as xB = x¯B −B NxN = b−B akxk ≥ −1 ¯ −1 0 and we search for: max{xk | B akxk ≤ b}. We denote yk = B ak for simplicity. Then we find the ¯bi maximum as follows: If yk < 0 then xk ≥ , i = 1, . . . , m, and hence, xk −→ ∞ and the unbounded case yik b¯r ¯bi is proven. Otherwise: yk 6≤ 0 ⇒ λ = = min{ : yik > 0} and xk, = λ. Then (for local index yrk yik max i related to the column of B and corresponding to index j of the column of the original A): ∀j ∈ JB : ¯ ¯ ¯ br br xj = xBi = bi − yik, i = 1, . . . , m (and xBi = 0 for i = r), xk = and xj = 0 for j ∈ JN \{k} and the yrk yrk new solution is an EP formed from the updated B that has linearly independent columns aj, j ∈ JN \{r} > > and the column ak. It is the adjacent basis and EP to the previous one. The requirement c x < c x¯ is also satisfied.  Corollary 109 (Finite convergence).

Assuming LP nondegeneracy, the described in the proof converges in a finite number of steps.

32 Proof Number of EPs is finite and the strict decrease of the objective function value excludes the possibility to pass twice through any EP.  Algorithm 110 (Simplex algorithm — matrix form).

0. Initialization: Solve

> ? ∈ argminx{c x | Ax = b, x ≥ 0}, r(A) = m.

Denote J = {1, . . . , n}. Assume that the initial basis is given J = JB ∪ JN ,

JB ∩ JN = ∅, | JB |= m, | JN |= n − m. Therefore, we have B = (aj)j∈JB , > > r(B) = m, N = (aj)j∈JN , and cB, cN . −1 1. Solve system of linear equations BxB = b efficiently to get xB = B b. Set > xN = 0, and compute z = cBxB. > > > > −1 2. Solve w B = cB efficiently to get w = cBB . Compute ∀j ∈ JN : zj − cj = > w aj − cj. Search zk − ck = max{zj − cj | j ∈ J ∈ JN }. If zk − ck ≤ 0 then STOP because the optimum was reached. −1 3. Solve Byk = ak efficiently to get yk = B ak. If yk ≤ 0 then STOP because of  −y  the unboundedness and recession direction is specified by k . ek

¯br ¯bi 4. Solve = mini∈J { : yik > 0}. Then: JB := JB \{r} ∪ {k}, JN := yrk B yik

JN \{k} ∪ {r}, and update B := (aj)j∈JB , N := (aj)j∈JN and GO TO 1 .

Exercise 111.

What is the most important numerical procedure for the simplex method?

Remark 112.

We expressed new x using x¯, λ, and d. So, xn+1 := xn + λndn where:

   ¯     −1   −1  xB b −yk B b −B ak := + xk = + xk xN 0 ek 0 ek

Remember that xn is feasible, so b¯ ≥ 0 (or b¯ > 0 for the nondegenerate case). Further ¯br xk = > 0. If yk ≤ 0 then we get a descent ED. For the objective function, we get: yrk

> > > > −1  c x = c x¯ + xk(cB, cN ) −B ak ek .

> > −1  To decrease the objective function value (cB, cN ) −B ak ek < 0 must be satis- fied.

33 Remark 113 (Simplex Tableau).

What is behind the simplex tableau? We rewrite the LP as follows:

? ∈ argmin{z | z − c>x = 0, Ax = b, x ≥ 0}

Then we may use the tabular form often used for the Gauss elimination method solving systems of linear equations. The sequence of simplex tables derived by basic equivalent transformations applied to matrix blocks follows:

 1 −c> | 0   1 −c> −c> | 0   1 −c> −c> | 0  ∼ B N ∼ B N ∼ 0 A | b 0 B N | b 0 B−1BB−1N | B−1b

 1 −c> −c> | 0   1 c> − c> c>B−1N − c> | c>B−1b  B N ∼ B B B N B ∼ 0 I B−1N | B−1b 0 I B−1N | B−1b

 > > −1 > > −1  1 0 cBB N − cN | cBB b 0 I B−1N | B−1b

We see that the final table contains the information suitable for the simplex algorithm and the choice of k (see the first row) and r (see the last and k’th column containing −1 yk). In the last column, we find the information about the last xB = B b as we know −1 ¯ that xB = x¯B − B NxN = b − (yj)j∈JN xN and xN = 0. In the first row, we find > −1 > the optimality condition term, and the objective function value cBB b as z = c x = > −1 > −1 > > cBB b − (cBB N − cN )xN =z ¯ − (zj − cj)j xN .

34 Remark 114 (Important ideas).

• Simple lower and upper bounds l ≤ x ≤ u can be implemented in the special way and need not be considered as usual constraints. • Initial basic feasible solution can be found by Two Phase Method that begins with the solution of the modified first phase program (b ≥ 0): ? ∈ argmin{1>x | Ax+xs = b, x ≥ 0, xs ≥ 0}. When the first phase program is solved the original one is either detected as infeasible (for zmin > 0) or (for zmin = 0) the initial basis and JB is obtained for the original program. • The identification of all optimal solutions needs to find all EPs and EDs of the polyhedral set D. There are backtracking procedures based on the spanning tree search modified for LP.

• Instead of Dantzig’s rule (used in Algorithm 110), identifying k index by zk −ck = max{zj − cj | j ∈ JN }, another rules for the choice of the ‘key column’ can be used (see also Deveaux rule and column generation techniques for large-scale problems). • Cycling (stalling) may be caused by degeneracy. There are theoretical lexico- graphical and Bland’s rules avoiding it. In practice, rounding errors may help. For integer network flow programs, the network simplex method is developed. • Regarding memory requirements: For sparse matrix structures, the revised sim- plex method (also in the product matrix form) has been developed. • The number of required operations per one iteration can be expressed by a poly- nomial function of the program size. • For special structures decomposition approaches may help. • There is a difference between theoretical (the worst case) overall computational complexity that is exponential (the number of iterations required for the problem with n variables might be 2n) in comparison with a practical experience showing that for certain programs even 2m − −3m iterations are enough (m is a number of constraints). • There are also polynomial algorithms for linear programs. The first polynomial Khachian’s algorithm had the theoretical importance. The next Karmarkar’s algorithm led to the development of the class of the efficient interior point methods for the huge linear programs. • Previous formulas lead to basic sensitivity results

∂xBl ∂z ∂xBl li ∂z = −y , = cj − zj, = B , = wi ∂xNj lj ∂xNj ∂bi ∂bi

where Bli represents the element of B−1. • LP duality will be revised with NLP later.

Exercise 115.

Compute typical examples from linear programming reviewing: the standard form, two-phase method, and simplex tableau computations.

35 1.5 Convex functions

1.5.1 Basic properties

Definition 116 (Convex function).

Let S ⊂ IRn, S 6= ∅ be a convex set, f : S −→ IR. The function f is said to be convex on S ⇔ ∀x1, x2 ∈ S, ∀λ ∈ (0, 1) : f(λx1 + (1 − λ)f(x2)) ≤ λf(x1) + (1 − λ)f(x2). The function f is said to be strictly convex on S ⇔ ∀x1, x2 ∈ S, x1 6= x2, ∀λ ∈ (0, 1) : f(λx1 + (1 − λ)f(x2)) < λf(x1) + (1 − λ)f(x2). The function f is said to be concave (strictly concave) on S ⇔ −f is convex (strictly convex) on S.

Exercise 117.

How Definition 116 changes when λ ∈ (0, 1) is replaced by λ ∈ [0, 1]?

Exercise 118.

Draw figures of different real functions of one real variable (convex, concave, etc.).

Remark 119 (Geometric interpretation).

For a convex function f, the value of f at points on line segment λx1 + (1 − λ)x2 is less or equal to the height of the chord joining [x1, f(x1)] and [x2, f(x2)].

Remark 120.

If f is both convex and concave ⇔ f is affine (linear). f may be non-convex over S but can be convex over a convex subset of S.

Exercise 121.

Why a convex function must be defined on a convex set? Explain, using a simple example!

Definition 122 (Level set).

Sα = {x ∈ S | f(x) ≤ α} is called an α-level set.

36 Theorem 123 (Convexity of a level set).

n Let S ⊂ IR , S 6= ∅ be a convex set, f : S −→ IR ⇒ ∀α ∈ IR : Sα is a convex set.

Proof: Let x1, x2 ∈ Sα ⇒ x1, x2 ∈ S as Sα ⊂ S. By Definition 122 f(x1) ≤ α, f(x2) ≤ α. We choose λ ∈ (0, 1) and we have x = λx1 +(1−λ)x2 ∈ S. Because of convexity of f: f(x) ≤ λf(x1)+(1−λ)f(x2) ≤ λα + (1 − λ)α = α ⇒ f(x) ≤ α ⇒ x ∈ Sα.  Exercise 124.

Why Theorem 123 is very important for nonlinear programming? Hint: Think about n the NLP constraint gi(x) ≤ 0 where gi : S −→ IR, and then C = {x ∈ IR | gi(x) ≤ n 0, i = 1, . . . , m} = ∩iCi = {x ∈ IR | gi(x) ≤ 0}.

Theorem 125 (Continuity of a convex function).

Let S ⊂ IRn, S 6= ∅ be a convex set, f : S −→ IR is a convex function ⇒ f is a continuous function on int S.

Proof: Let x¯ ∈ int S, we need to show ∀ε > 0 ∃δ > 0 : kx − x¯k ≤ δ ⇒ |f(x) − f(x¯)| ≤ ε. Because 0 0 x¯ ∈ int S ⇒ we set ε > 0. ∃δ : kx − x¯k ≤ δ ⇒ x ∈ S. We define θ as θ = max1≤i≤n{max{f(x¯ + 0 0 δ ei)−f(x¯), f(x¯ −δ ei)−f(x¯)}}, i.e. maximum increase by a coordinate direction. Because of convexity: δ0 εδ0 0 ≤ θ < ∞ (finite and nonnegative). δ := min( n , nθ ). Choose x : kx − x¯k ≤ δ. If xi − x¯i ≥ 0 let 0 0 Pn zi = δ ei otherwise zi = −δ ei. Then x − x¯ = i=1 αizi where ∀i : αi ≥ 0. Furthermore kx − x¯k = n 1 n 1 0 0 n 1 0 P 2 2 0 P 2 2 δ δ P 2 2 1 ε 1 1 δ ( i=1 αi ) . Because kx−x¯k ≤ δ ⇒ δ ( i=1 αi ) ≤ δ = min{ n , n } ⇒ ( i=1 αi ) ≤ min{ n , θ n } ≤ n 1 Pi ⇒ if only one αi > 0 ⇒ |αi| = αi ≤ n and 0 ≤ nαi ≤ 1. Then f(x) = f(x¯ + i=1 αizi) = 1 Pn 1 Pn 1 Pn f( n i=1(x¯+nαizi)) ≤ (convexity) n i=1 f(x¯ + nαizi) = n i=1 f((1 − nαi)x¯ + nαi(x¯ + zi)) ≤ (con- 1 Pn Pn vexity) n i=1((1−nαi)f(x¯)+nαif(x¯ + zi)). Then f(x)−f(x¯) ≤ f(x¯)+ i=1(αif(x¯ + zi)−αif(x¯))− Pn Pn 1 ε f(x¯). So, f(x) − f(x¯) ≤ i=1 αi(f(x¯ + zi) − f(x¯) ≤ (see above) θ i=1 αi (that is ≤ n and nθ ), so Pn ε f(x) − f(x¯) ≤ θ i=1 nθ = ε and upper semicontinuity is proven. We need f(x¯) − f(x) ≤ ε to have y+x f(y)+f(x) |f(x) − f(x¯)| ≤ ε. But x¯ = 2 and f(x¯) ≤ 2 (convexity). Then f(y) ≥ 2f(x¯) − f(x) ⇒ ε ≥ 2f(x¯) − f(x) − f(x¯) = f(x¯) − f(x).  Exercise 126.

Why Theorem 125 is useful for NLP? Hint: Think about Weierstrass’ Theorem 12 assumptions!

Exercise 127.

Illustrate Theorem 125 to show that discontinuity on boundary is allowed. Hint: For example, use y = x2 on [−1, 1].

37 Definition 128 ().

n Let S ⊂ IR , S 6= ∅, f : S −→ IR, x¯ ∈ S, d 6= 0 : ∃λ0 > 0 ∀λ < λ0 : x¯ + λd ∈ S. Then, we define the directional derivative of f at x¯ along d as:

0 f(x¯+λd−f(x¯)) f (x¯; d) := lim λ . λ−→0+

Lemma 129 (Existence of a directional derivative).

Let f : IRn −→ IR be a convex function. Then ∀x ∈ IRn ∀d ∈ IRn d 6= 0 : ∃f 0(x¯; d).

λ1 λ1 λ1 λ1 Proof: Let λ2 > λ1 > 0 : f(x¯ + λ1d) = f( (x¯ + λ2d) + (1 − )x¯) ≤ f(x¯ + λ2d) + (1 − )f(x¯) λ2 λ2 λ2 λ2 f(x¯+λ1d)−f(x¯) 1 λ1 λ1 f(x¯+λ2d)−f(x¯) ⇒ ≤ ( f(x¯ + λ2d) + (1 − )f(x¯)) = . So the difference quotient is λ1 λ1 λ2 λ2 λ2 + λ 1 monotone decreasing (non-increasing) for λ −→ 0 . So, ∀λ > 0 : f(x¯) = f( 1+λ (x¯ − d) + 1+λ (x¯ + λd)) 1+λ λ 1 λ 1 (because x¯ = 1+λ (x¯ − d + d) = 1+λ (x¯ − d)+ 1+λ (λd + x¯ − d + d)) ≤ 1+λ f(x¯ − d)+ 1+λ f(x¯ + λd) f(x¯+λd)−f(x¯) ⇒ (1 + λ)f(x¯) ≤ λf(x¯ − d) + f(x¯ + λd) ⇒ f(x¯ + λd) − f(x¯) ≥ λf(x¯) + λf(x¯ − d) ⇒ λ ≥ f(x¯) + f(x¯ − d). For the given x¯, d, the RHS is constant and the LHS is bounded from below. From theorems of , we conclude that the limit in the theorem exists and:

f(x¯+λd)−f(x¯) f(x¯+λd)−f(x¯) lim λ = inf λ .  λ−→0+ λ>0

Remark 130.

Let f convex on convex S, d by Definition 128, x¯ ∈ int S ⇒ ∃f 0(x; d). If x¯ ∈ ∂S : f 0(x; d) might be −∞ (also for continuous f — illustrate by a figure in IR2).

Definition 131 (Graph).

Let S ⊂ IRn, f : S −→ IR. Then {(x>, f(x))> | x ∈ S} ⊂ IRn+1 is called a graph of f.

Definition 132 ().

S ⊂ IRn, S 6= ∅, f : S −→ IR. Then epif := {(x>, y)> | x ∈ S, y ∈ IR, y ≥ f(x)} is called an epigraph of f.

Definition 133 ().

S ⊂ IRn, S 6= ∅, f : S −→ IR. Then hypf := {(x>, y)> | x ∈ S, y ∈ IR, y ≤ f(x)} is called a hypograph of f.

38 Theorem 134 (Convexity of an epigraph).

Let S ⊂ IRn, S 6= ∅ is a convex set. Then f : S −→ IR is a convex function ⇔ epif is a convex set.

Remark 135.

Let S ⊂ IRn, S 6= ∅ is a convex set, f : S −→ IR is a ⇔ hypf is a convex set.

> > > > Proof of Theorem 134: ⇒: f convex, (x1 , y1) , (x2 , y2) ∈ epif as (x1, x2 ∈ S), y1 ≥ f(x1), y2 ≥ f(x2). Take λ ∈ (0, 1) arbitrarily: λy1 + (1 − λ)y2 ≥ λf(x1) + (1 − λ)f(x2) ≥ f(λx1 + (1 − λ)x2). So, y ≥ f(x) and (x>, y)> ∈ epif and epif is convex. > > > > ⇐: Conversely, assume epif is convex and S convex. Let x1, x2 ∈ S then (x1 , f(x1)) , (x2 , f(x2)) ∈ > epif. Then by Definition 132 and by the assumption of Theorem 134: ∀λ ∈ (0, 1) : (λx1 + (1 − > > > > λ)x2 , λf(x1) + (1 − λ)f(x2)) ∈ epif. Since x ∈ S then (x , f(x)) ∈ epif by Definition 132 and λf(x1) + (1 − λ)f(x2) = y ≥ f(λx1 + (1 − λ)x2), so f is convex.  Remark 136 (Usefulness of epigraph).

When f(x) is nondifferentiable then the solution of ? ∈ argmin{f(x) | x ∈ S} can be obtained by solving of a modified program ? ∈ argmin{z | z ≥ f(x), x ∈ S} where the epigraph of f is considered in the feasibility region description.

Exercise 137.

Solve the following regression problem: For points (xj, yj), j = 1, 2, 3: (1, 1), (2, 4), (3, 2) find a straight line formula y = βx minimizing max{|yj − βxj| : j = 1, 2, 3}. Hint: Use the criterion epigraph to reformulate the problem as an LP. Draw a figure!

Definition 138 (Subgradient).

Let S ⊂ IRn, S 6= ∅ convex, f : S −→ IR is convex. Then ξ ∈ IRn is called a subgradient of f at x¯ ∈ S ⇔ ∀x ∈ S : f(x) ≥ f(x¯) + ξ>(x − x¯).

Subgradient for a concave function is defined using ≤ in Definition 138. Definition 139 (Subdifferential).

The set of all subgradients of f at x¯ is called a subdifferential of f at x¯.

Exercise 140.

Draw subgradient for S ⊂ IR and S ⊂ IR2, and f : S −→ IR.

39 Remark 141.

Recognize that as epif is convex set then ∃ a supporting hyperplane, which is described by a subgradient. f(x¯) + ξ>(x − x¯) corresponds to a supporting hyperplane of epif at (x¯>, f(x¯))>. Then ξidentifies its slope.

Exercise 142.

Find a subgradient and draw a figure for y = |x|, x¯ = 0. Hint: Find that ξ ∈ [−1, 1] as y = 0 + ξ(x − 0).

Exercise 143.

Develop own examples in RR2 and draw contour graphs for x¯, x ∈ S and ξ ∈ IR2.

Theorem 144 (Existence of subdifferential).

Let S ⊂ IRn, S 6= ∅ be convex, f : S −→ IR be convex ⇒ ∀x¯ ∈ int S ∃ξ : H = {(x>, y)> | y = f(x¯) + ξ>(x − x¯)} supports epif at (x>, f(x¯))>. In particular, ∀x ∈ S : f(x) ≥ f(x¯) + ξ>(x − x¯), so ξ is a subgradient of f at x¯, i.e. subdifferential of f is a nonempty set for interior points of S.

> > > > n+1 > > Proof: By theorem epif is convex. Let (x¯ , f(x¯)) ∈ ∂epif. So ∃(ξ0 , µ) ∈ IR and (ξ0 , µ) 6= 0 > > > such that ∀(x , y) ∈ epif : ξ0 (x − x¯) + µ(y − f(x¯)) ≤ 0 (by Theorem 62 about supporting hyperplane). > If µ > 0 ⇒ y chosen enough large and we obtain a contradiction. If µ > 0 ⇒ ∀x ∈ S : ξ0 (x − x¯) ≤ 0 Since x¯ ∈ int S ∃λ > 0 such that x¯ + λξ0 ∈ S. Therefore, we may take arbitrary direction, even ξ. So, > > 2 ξ0 (x¯ + λξ0 − x¯) = λξ0 ξ0 = λkξ0k ≤ 0. Because λ > 0 then ξ0 = 0 that is a contradiction with µ > 0. > ξ0 µ > Then µ < 0 and we have |µ| (x − x¯)+ |µ| (y − f(x¯)) = ξ (x − x¯) + f(x¯) − y ≤ 0. By letting y = f(x) and > ξ0 ξ = |µ| we get the conclusion of theorem.  Corollary 145.

Let S ⊂ IRn, S 6= ∅ be convex, f : S −→ IR strictly convex function. Then ∀x¯ ∈ int S : ∃ξ ∈ IRn : ∀x ∈ S x 6= x¯ : f(x) > f(x¯) + ξ>(x − x¯).

Proof: By theorem ∃ξ ∀x ∈ S : f(x) ≥ f(x¯) + ξ>(x − x¯). By contradiction ∃xˆ 6= x¯ : f(xˆ) = f(x¯) + ξ>(xˆ − x¯) and by strict convexity of f for λ ∈ (0, 1) : f(λx¯ + (1 − λ)xˆ) < λf(x¯) + (1 − λ)f(xˆ) = λf(x¯) + (1 − λ)(f(x¯) + ξ>(xˆ − x¯)) = f(x¯) + (1 − λ)ξ>(xˆ − x¯). For x = λx¯ + (1 − λ)xˆ by theorem > f(λx¯ + (1 − λ)xˆ) ≥ f(x¯) + ξ (λx¯ + (1 − λ)xˆ − x¯) = f(x¯) + (1 − λ)(xˆ − x¯).  Remark 146.

The converse (⇐) is not true in general, so ∃ subgradients ∀x ∈ int S 6⇒ f is convex on S (however, it is true on int S).

40 Theorem 147 (Convexity implied by subgradient).

S ⊂ IRn, S 6= ∅, convex, f : S −→ IR, ∀x¯ ∈ int S : ∃ξ ∀x ∈ S : f(x) ≥ f(x¯)+ξ>(x−x¯) ⇒ f is convex on int S.

Proof: x1, x2 ∈ int S, λ ∈ (0, 1), int S of convex set is a convex set, so x¯ = λx1 +(1−λ)x2 ∈ int S ⊂ S ⇒ > > ∃ξ of f at x¯, so: For x1: f(x1) ≥ f(x¯)+ ξ (x1 −λx1 −(1−λ)x2) = f(λx1 +(1−λ)x2)+ (1−λ)ξ (x1 −x2). > > For x2: f(x2) ≥ f(λx1 + (1 − λ)x2)+ ξ (x2 − λx1 − (1 − λ)x2) = f(λx1 + (1 − λ)x2)+ λξ (x2 − x1). Then, multiplying the inequalities by λ and 1 − λ and summing them, we obtain λf(x1) + (1 − λ)f(x2) ≥ (λ + 1 − λ)f(λx1 + (1 − λ)x2) + 0. So, f is convex on int S. 

1.5.2 Differentiable convex functions Definition 148 (Gradient, differentiable function at point).

S ⊂ IRn, S 6= ∅, f : S −→ IR. Then f is said to be differentiable at x¯ ∈ int S ⇔ ∃∇f(x¯) ∈ IRn (gradient of f at x¯) and function α(x¯; · − x¯) : IRn −→ IR such that ∀x ∈ S : f(x) = f(x¯)+∇f(x¯)>(x−x¯)+ kx−x¯kα(x¯; x−x¯) where lim α(x¯; x−x¯) = 0. x−→x¯

Compare the expression with the first order Taylor expansion, the first order Taylor approxima- tion, and the reminder term (α(x¯; x − x¯)).

Exercise 149.

Draw contour graphs in IR2 illustrating the gradient concept.

Definition 150 (Differentiable function on set).

f is said to be differentiable on the open set S0 ⊂ S if f is differentiable at x, ∀x¯ ∈ S0.

Remark 151.

If f is differentiable at x¯ ⇒ ∃! ∇f(x¯) = ( ∂f(x¯) )> = (f 0 (x¯))> vector of partial ∂xi i=1,...,n xi i=1,...,n derivatives of f at x¯.

Lemma 152 (Subgradient and gradient).

Let S ⊂ IRn, S 6= ∅ be convex, f : S −→ IR is convex. Suppose that f is differentiable ∀x¯ ∈ int S. ⇒ ∀x¯ ∈ int S : |Ξ| = |{ξ | ξ is a subdifferential of fatx¯}| = 1 and ξ = ∇f(x¯).

41 > Proof: Ξ 6= ∅ by theorem, so ξ ∈ Ξ then ∀d ∃λ0 > 0 ∀λ < λ0, λ > 0: f(x¯ + λd) ≥ (x¯) + λξ d and by differentiability f(x¯ + λd) = f(x¯) + λ∇f(x¯)>d+ λkdkα(x¯; λd). Together 0 ≥ λ(ξ − ∇f(x¯))>d− 1 > + λkdkα(x¯; λd) (multiply by λ ). Then 0 ≥ (ξ−∇f(x¯)) d−kdkα(x¯; λd). With λ −→ 0 then α(x¯; λd) −→ > 2 0, so 0 ≥ (ξ − ∇f(x¯)) d for any d even for d = (ξ − ∇f(x¯)). So, 0 ≥ kξ − ∇f(x¯)k ⇒ ξ = ∇f(x¯). 

Theorem 153 (Convexity and existence of gradient).

Let S ⊂ IRn, S 6= ∅ be an open convex set, f : S −→ IR differentiable on S. Then: f is convex (strictly convex) on S ⇔ ∀x¯ ∈ S, ∀x ∈ S (for strictly convex add x 6= x¯): f(x) ≥ f(x¯) + ∇f(x¯)>(x − x¯) (for strictly convex >).

Proof: Directly follows from the subgradient theorem 144.  Remark 154 (Geometrical interpretations).

1. For min{f(x) | x ∈ S} ≥ min{f(x¯) + ∇f(x¯)>(x − x¯) | x ∈ S} gives a lower bound on the optimum objective function value of the original program. This fact is useful for some optimization algorithms.

2. We may construct outer linearization of S = {x | g(x) ≤ 0} convex as S0 = {x | > 0 gi(x¯) + ∇gi(x¯) (x − x¯) ≤ 0, i = 1, . . . , m}. So, S ⊂ S . 3. For f : IR −→ IR (differentiable) non-decreasing slope ⇔ f convex.

Exercise 155.

Illustrate Remark 154 by figures in IR2.

Theorem 156.

Let S ⊂ IRn, S 6= ∅ be a convex open set, f : S −→ IR differentiable on S. Then f is convex (strictly convex) on S ⇔ ∀x1, x2 ∈ S (x1 6= x2 for strictly convex f): > (∇f(x2) − ∇f(x1)) (x2 − x1) ≥ 0 (for strictly convex: >).

> Proof: Assume f convex, x1, x2 ∈ S then f(x1) ≥ f(x2) + ∇f(x2) (x1 − x2) and f(x2) ≥ f(x1) + > ∇f(x1) (x1 − x2). The sum of inequalities leads to: f(x1) + f(x2) ≥ f(x1) + f(x2) + (∇f(x2) − > > ∇f(x1)) (x1 − x2). Hence, 0 ≤ (∇f(x2) − ∇f(x1)) (x2 − x1). Conversely: x1, x2 ∈ S then we get by the multivariate mean value theorem (f is differentiable, and so continuous) ∃x ∈ (x1, x2), ∇f(x), ∃λ ∈ (0, 1) : x = λx1 + (1 − λ)x2 and f(x2) − f(x1) = ∇f(x)(x2 − > > x1). Assumption is (∇f(x) − ∇f(x1)) (x − x1) ≥ 0 ⇒ (1 − λ)(∇f(x) − ∇f(x1)) (x2 − x1) ≥ 0 > > because x = λx1 + (1 − λ)x2. Therefore, −(1 − λ)∇f(x1) (x2 − x1)+ (1 − λ)∇f(x) (x2 − x1) ≥ 0 > 1 and −(1 − λ)∇f(x1) (x2 − x1)+ (1 − λ)(f(x2) − f(x1)) ≥ 0. Then (multiplying by 1−λ ) f(x2) ≥ > f(x1) + ∇f(x1) (x2 − x1) and so f is convex (similarly for strictly convex).  All these results are useful for algorithms but they are not helpful for the function f convexity check.

42 1.5.3 Twice differentiable convex functions Definition 157 (Twice differentiable function at point).

S ⊂ IRn, S 6= ∅, f : S −→ IR. Then f is said to be twice differentiable at x¯ ∈ int S ⇔ ∃∇f(x¯) ∈ IRn and n × n symmetric matrix H(x¯) (Hessian) and α(x¯; · − x¯) : IRn −→ > 1 > IR such that ∀x ∈ S : f(x) = f(x¯) + ∇f(x¯) (x − x¯)+ 2 (x − x¯) H(x¯)(x − x¯)+ kx − x¯k2α(x¯; x − x¯) where lim α(x¯; x − x¯) = 0. x−→x¯

Definition 158 (Twice differentiable function on set).

f is said to be twice differentiable on the open set S0 ⊂ S if f is twice differentiable at x, ∀x¯ ∈ S0.

Remark 159.

2 If f is differentiable at x¯ ⇒ ∃! H(x¯) = ( ∂ f(x¯) )> = (f 00 (x¯))> a ma- ∂xi∂xj i,j=1,...,n xi,xj i=1,...,n trix of the second partial derivatives of f at x¯. And above mentioned the second order Taylor expansion can be rewritten in the summation-index form: f(x) = f(x¯) + Pn f 0 (x¯)(x − x¯ )+ 1 Pn Pn (x − x¯ )f 00 (x¯)(x − x¯ )+ kx − x¯k2α(x¯; x − x¯). j=1 xj j j 2 i=1 j=1 i i xi,xj j j

Definition 160 (Definiteness of matrix).

Let D ∈ IRn×n be an n × n symmetric matrix. Then D is said to be positive definite (PD) ⇔ ∀x ∈ IRn, x 6= 0 : x>Dx > 0, positive semidefinite (PSD) ⇔ ∀x ∈ IRn : x>Dx ≥ 0, negative definite (ND) ⇔ ∀x ∈ IRn, x 6= 0 : x>Dx < 0, and negative semidefinite (NSD) ⇔ ∀x ∈ IRn : x>Dx ≤ 0.

Remark 161.

> Pn 2 From linear algebra is known that x Dx = i=1 λiyi where all eigenvalues λi solve |D − λI| = 0 equation and x = By where columns (eigenvectors bj 6= 0) of B and ∀j : bj solve (D − λjI)bj = 0.

Theorem 162 (Definiteness and eigenvalues).

D is positive definite ⇔ ∀j : λj > 0, D is positive semidefinite ⇔ ∀j : λj ≥ 0, D is negative definite ⇔ ∀j : λj < 0, and D is negative semidefinite ⇔ ∀j : λj ≤ 0.

43 Exercise 163.

Use software (e.g., Matlab) to compute eigenvalues and check definiteness of some matrices.

Theorem 164 (Convexity and definiteness).

S ⊂ IRn, S 6= ∅ open convex, f : S −→ IR twice differentiable on S. Then: f is convex on S ⇔ its H(x¯) is PSD ∀x¯ ∈ S. f is strictly convex on S ⇐ its H(x¯) is PD ∀x¯ ∈ S. f is strictly convex on S ⇒ its H(x¯) is PSD ∀x¯ ∈ S. f is quadratic and strictly convex on S ⇒ its H(x¯) is PD ∀x¯ ∈ S.

n > n Proof: ⇒: f convex, x¯ ∈ S. We need to show ∀x ∈ IR : x H(x¯)x ≥ 0. Since S is open ⇒ ∃λ0 ∀x ∈ IR 1 2 > ∀λ ∈ [−λ0, λ0], λ 6= 0 : x¯ + λx ∈ S. We already know: f(x¯ + λx) = f(x¯) + λ∇f(x)x+ 2 λ x H(x¯)x+ 2 2 1 2 > 2 2 λ kxk α(x¯; x − x¯). By subtraction, we have 0 ≤ 2 λ x H(x¯)x+ λ kxk α(x¯; x − x¯). We multiply the 1 inequality by λ2 and with λ −→ 0 we have PSD H(x¯). ⇐: ∀x¯ ∈ S : H(x¯) is PSD. Then for x and x¯ by the mean value theorem, we have: f(x) = f(x¯) + > 1 > ∇f(x¯) (x − x¯) + 2 (x − x¯) H(xˆ)(x − x¯) where xˆ = λx¯ + (1 − λ)x for some λ ∈ (0, 1) ⇒ xˆ ∈ S. By assumption H(xˆ) is PSD, so (x − x¯)>H(xˆ)(x − x¯) > 0 and f(x) ≥ f(x¯) + ∇f(x¯)>(x − x¯) then f is convex.  Remark 165.

Theorem is useful for a quadratic functions because H(x¯) is constant in this case and definiteness may be easily checked (cf. use of Matlab computations of eigenvalues with symbolic computations in the general case).

Theorem 166 (Univariate function convexity).

S ⊂ IR, S 6= ∅ open convex, f : S −→ IR infinitely differentiable. Then f is strictly convex on S ⇔ ∀x¯ ∈ S ∃k ∈ IN : f (2k)(¯x) > 0 while f (j)(¯x) = 0, ∀j ∈ {2,..., 2k − 1}.

Proof: See calculus, mathematical analysis.  Theorem 167.

n n n f : IR −→ IR, x¯ ∈ IR , d ∈ IR , d 6= 0 and define Fx¯,d(λ) := f(x¯ + λd). Then f is n n (strictly) convex ⇔ Fx¯,d is (strictly) convex ∀x¯ ∈ IR , d ∈ IR , d 6= 0.

Proof is omitted. Theorem may be used (examining f via univariate cross sections Fx¯,d) in algorithms.

44 Lemma 168.

 a b  H = is PSD ⇔ a ≥ 0, c ≥ 0, ac − b2 ≥ 0. b c

  > d1 2 2 Proof: ⇒: By Definition 160 of PSD: d Hd ≥ 0, ∀d = so ad1 + 2bd1d2 + cd2 ≥ 0 ⇒ d2 a ≥ 0, c ≥ 0 otherwise set either d1 = 0 or d2 = 0 and the remaining d1 or d2 large and < 0. Also 2 a = 0 ⇒ b = 0 (or |d1| large and < 0) ⇒ ac − b = 0. With a > 0 completing the squares, we get: 2 2 2 > 2 2bd1d2 b 2 2 b b 2 2 ac−b 2 d Hd = a(d1 + a + a2 d2)+ d2(c − a ) = a(d1 + a d2) + d2( a ) ⇒ ac − b ≥ 0. Otherwise with b d1 = − a d2 ⇒ < 0. 2 > 2 ⇐: a ≥ 0, c ≥ 0, ac − b ≥ 0 assuming a = 0 ⇒ b = 0 ⇒ d Hd = cd2 ≥ 0. If a > 0 then > b 2 2 ac−b2 d Hd = a(d1 + a d2) + d2( a ) ≥ 0. PD similarly. 

Exercise 169.

From Lemma 168, we get for ND a < 0, c < 0, ac − b2 ≥ 0. Why? Hint: Matrix D is NSD (ND) ⇔ −D is PSD (PD).

Exercise 170.

How to deal with non-symmetric H for d>Hd? Hint: Consider that we may write > > > 1 > > d Hd = d H d = 2 d (H + H )d.

Theorem 171 (Checking definiteness).

Let H = (hij) be a symmetric m × n matrix:

(a) ∃i ∈ {1, . . . , n} : hii ≤ 0 ⇒ H is not PD. If hii < 0 ⇒ is not PSD. (b) ∃i ∈ {1, . . . , n} : hii = 0 ⇒ ∀j ∈ {1, . . . , n} : hij = hji = 0 or H is not PSD.  h q>  (c) If n = 1: H is PSD (PD) ⇔ h ≥ 0 (h > 0) otherwise n ≥ 2 let 11 ) 11 11 q G (by (b) h11 = 0 ⇒ q = 0 to keep PSD). So for h11 > 0 as a pivot, perform ele-  h q>  mentary Gauss-Jordan operation with the first row obtaining: 11 ) 0 Gnew where Gnew is a symmetric (n − 1) × (n − 1) and H is PSD ⇔ Gnew is PSD. (Moreover for h11 > 0: H is PD ⇔ Gnew is PD).

> 2 > Proof: (a) d Hd = di hii for d = (0,..., 0, di, 0,..., 0) and proved. > 2 (b) hii = 0, hij 6= 0 ⇒ choose dk = 0 for k 6= i, k 6= j ⇒ d Hd = 2hijdidj+ djjhj ⇒ di large and then contradiction. > > > > (c) n = 1: trivial, n ≥ 2 : d = (d1, δ ). If h11 = 0 then q = 0 G = Gnew and d Hd = δ Hδ and > 2 > > theorem is valid. If h11 > 0 then d Hd = d1h11+2d1(q δ)+δ Gδ. By Gauss-Jordan, we have: Gnew =  >  q1q G − 1  .  = G − 1 qq>. It is a symmetric matrix because it is a difference of symmetric h11  .  h11 > qnq

45 > 2 > > 1 > > q>δ 2 matrices. Therefore, d Hd = d h11 + 2d1(q δ)+ δ (G + qq )δ = δ G δ + h11(d1 + ) 1 new h11 new h11 and theorem is valid because of the second term nonnegativity in the sum. Similarly PD.  Algorithm 172 (Gauss-Jordan).

1. Scan diagonal elements of H (cf. (a) and (b) of Theorem 171). 2. Perform Gauss-Jordan reduction to obtain Gnew by (c). 3. Take Gnew instead of H and GOTO 1. till n = 1 (if n = 1 then (c)).

Remark 173.

Each step (1)–(3) takes m arithmetic operations and comparisons where m ≤ Kn2 for certain K and there is at most n steps and then the total number of operations M ≤ Kn3 (is of O(n3)). The algorithm has a polynomial complexity.

Exercise 174.

Compute own numerical examples.

Corollary 175.

H is an n × n symmetric matrix: H is PD ⇔ H is PSD and nonsingular (∃H−1, r(H) = n).

Proof: ⇒: H is PD ⇒ Gauss-Jordan reduction gives the upper triangular matrix with > 0 in diagonal ⇒ r(H) = n. ⇐: H is PSD and nonsingular, so r(H) = n ⇒ Gauss-Jordan gives the upper triangular matrix, diagonal ≥ 0 but > 0 because of the full rank. 

46 Remark 176 (Convexity of composed functions).

n Pk 1. f1, . . . , fk : IR −→ IR are convex. Then (a) ∀αj > 0, j = 1, . . . , k : j=1 αjfj(x) is convex. (b) f(x) = max{f1(x), . . . , fk(x)} is convex. 2. Let g : IRn −→ IR be a concave function, S = {x | g(x) > 0} is convex. We define 1 f : S −→ IR as f(x) = g(x) is convex. 3. Let h(x) = Ax+b be a linear function where h : IRn −→ IRm and g : IRm −→ IR convex then f(x) = g(h(x)) is a convex function.

4. Let g : IR −→ IR be a non-decreasing (univariate) convex and let h : IRn −→ IR be a convex function then f(x) = g(h(x)) is convex.

Let h1 : IR −→ IR and h2 : IR −→ IR be univariate functions. Then, the composed function h2(h1) : IR −→ IR inherits the convexity (concavity) from functions h1 and h2 by Figure1.

Exercise 177.

Explain the conclusions of Remark 176. Try to prove them.

1.5.4 Minima and maxima of convex functions Exercise 178.

Review previously introduced concepts: feasible solution; infimum, minimum; optimal solution (minimum, maximum); local or global optimal solution; alternative optimal solutions; strict local (global) optimal solution; isolated local (global) optimal solution. Draw own figures in IR2 to illustrate those various concepts.

Exercise 179.

For further piece-wise defined function f(x) on [0, 5) identify all optimal solutions. f(x) = 1 − x for x ∈ [0, 1], f(x) = x − 1 for x ∈ (1, 2], f(x) = 1 for x ∈ (2, 3], f(x) = 4 − x for x ∈ (3, 4), and f(x) = x − 3 for x ∈ [4, 5).

Theorem 180 (Convexity and global minimum).

n Let S ⊂ IR , S 6= ∅ be convex and f : S −→ IR convex on S. Let x¯ ∈ arglocminx{f(x) | x ∈ S} then x¯ ∈ argglobminx{f(x) | x ∈ S}. If x¯ is a strict local minimum (x¯ ∈ argstrictlocminx{f(x) | x ∈ S}) or f is strictly convex on S then x¯ is a unique global minimum of f on S (and an isolated local minimum).

47 Proof: Let x¯ ∈ arglocminx{f(x) | x ∈ S}. Then ∃Nε(x¯): ∀x ∈ S ∩ Nε(x¯): f(x) ≥ f(x¯). By contradiction suppose ∃xˆ ∈ S : f(xˆ) < f(x¯). By convexity of f: ∀λ ∈ (0, 1) : f(λxˆ) + (1 − λ)x¯ ≤ λf(xˆ) + (1 − λ)f(x¯) < λf(x¯) + (1 − λ)f(x¯) = f(x¯). But for λ > 0 small λxˆ + (1 − λ)x¯ ∈ S ∩ Nε(x¯), and so f(λxˆ + (1 − λ)x¯) ≥ f(x¯) and we obtain contradiction. So, x¯ ∈ argglobminx{f(x) | x ∈ S}. Suppose x¯ is a strict local minimum. By contradiction assume that is not unique: ∃xˆ ∈ S : f(x¯) = f(xˆ) and ∀xλ ∈ S : xλ = λxˆ + (1 − λ)x¯ by convexity of f and S: xλ ∈ S and f(xλ) = f(λxˆ + (1 − λ)x¯) ≤ + λf(xˆ) + (1 − λ)f(x¯) = f(x¯). For λ −→ 0 : xλ ∈ S ∩ Nε(x¯) and contradiction, so x¯ is a unique global and isolated minimum.

Finally, assume f is strictly convex and x¯ ∈ arglocminx{f(x) | x ∈ S}. Because strict convexity of f on S ⇒ convexity of f on S then x¯ ∈ argglobminx{f(x) | x ∈ S} as above. By contradiction uniqueness 1 1 1 1 ∃x ∈ S, x 6= x¯ : f(x) = f(x¯), so f( 2 x + 2 x¯) < (strict convexity) 2 f(x) + 2 f(x¯) = f(x¯) that is the contradiction with x¯ ∈ argglobminx{f(x) | x ∈ S}. So, it is unique and isolated. 

Theorem 181 (Convexity and subgradient).

Let S ⊂ IRn, S 6= ∅ be a convex set and f : IRn −→ IR be a convex function. Then x¯ ∈ argminx{f(x) | x ∈ S} ⇔ ∃ξ subgradient of f at x¯ such that ∀x ∈ S : ξ>(x − x¯) ≥ 0.

Proof: ⇐: ∀x ∈ S : ξ>(x−x¯) ≥ 0. For ξ subgradient ∀x ∈ S : f(x¯)+ξ>(x−x¯) ≤ f(x), so f(x¯) ≤ f(x), and hence x¯ is a minimum. > > > n ⇒ (Main idea): Let x¯ be a minimum, define: M1 = {(x − x¯ , y) | x ∈ IR , y > f(x) − f(x¯)} is a > > > convex set and M2 = {(x − x¯ , y) | x ∈ S, y ≤ 0} is also a convex set. By definitions of M1 and M2, > > > we have M1 ∩ M2 = ∅. Otherwise, ∃(x − x¯ , y) : x ∈ S : 0 ≥ y > f(x) − f(x¯) (contradiction with minimization requirement). By theorem about separation of convex sets, we may find a separating (even

supporting) hyperplane for M1 and M2. The normal vector of this hyperplane specifies a subgradient ξ with the required property.  Corollary 182.

If assumptions of Theorem 181 are valid and in addition S is an open set then: x¯ ∈ argminx{f(x) | x ∈ S} ⇔ ∃ξ = 0.

> Proof: Because S is open then ∃λ0 > 0 then ∀λ < λ0 : x¯ − λξ ∈ S ⇒ ξ (x¯ − λξ − x¯) ≥ 0 ⇒ ξ = 0. 

Corollary 183.

If assumptions of Theorem 181 are valid and f is differentiable then: x¯ ∈ > argminx{f(x) | x ∈ S} ⇔ ∀x : ∇f(x¯) (x − x¯) ≥ 0.

Corollary 184.

If assumptions of Theorem 181 are valid and f is differentiable and S is open then: x¯ ∈ argminx{f(x) | x ∈ S} ⇔ ∇f(x¯) = 0.

48 Remark 185.

1. Let S ⊂ IRn, S 6= ∅ be open convex and f : S −→ IR convex differentiable. Then: x¯ ∈ argglobminx{f(x) | x ∈ S} ⇔ ∇f(x¯) = 0 (Necessary and sufficient condition).

2. If x is non-optimal then ∇f(x¯)>(x−x¯) < 0 and we may improve it using d = x−x¯ direction and step λ0 = minλ{f(x¯ + λd) | λ ≥ 0, x¯ + λd ∈ S} (This idea will introduce feasible direction methods).

3. Let x¯ + λ(x − x¯) be a feasible solution, so d = x − x¯ is a feasible direction and −∇f(x¯) is a recessive direction. If the angle between −∇f(x¯) and d is less or equal then the right angle then we may improve x¯ to get a smaller objective function value.

4. If f differentiable: f 0(x¯; d) = ∇f(x¯)>d and f(x) ≥ f(x¯) + ∇f(x¯)>(x − x¯). If ∇f(x¯)>(x − x¯) < 0 then x¯ may be improved.

5. When x¯ ∈ int S : f 0(x¯; d) ≥ 0 ⇒ ∇f(x¯) = 0.

Exercise 186.

Illustrate Remark 185 by figures using contour graphs.

Theorem 187 (Alternative optimal solutions).

S ⊂ IRn, S 6= ∅ convex, f : S −→ IR convex twice differentiable, denote S∗ = ∗ > argminx{f(x) | x ∈ S} then: S = {x ∈ S | ∇f(x¯) (x − x¯) ≤ 0, ∇f(x) = ∇f(x¯)}.

Proof: Not required (see, e.g., Bazaraa-Sherali-Shetty pages 104-105). 

Corollary 188 (Properties of S∗).

1. S∗ = {x ∈ S | ∇f(x¯)>(x − x¯) = 0, ∇f(x) = ∇f(x¯)}.

1 > > 2. In addition, suppose f is a quadratic function f(x) = 2 x Hx + c x and S is a polyhedral set. Then S∗ is polyhedral and S∗ = {x ∈ S : c>(x − x¯) = 0, H(x − x¯) = 0}.

Theorem 189 (Convexity and maximum).

n n S ⊂ IR , S 6= ∅ convex, f : IR −→ IR convex, x¯ ∈ arglocmaxx{f(x) | x ∈ S} ⇒ ∀x ∈ S, ∀ξ subgradients of f at x¯: ξ>(x − x¯) ≤ 0.

49 Proof: Let x¯ ∈ arglocmaxx{f(x) | x ∈ S} then ∃Nε(x¯) ∩ S and x ∈ Nε(x¯): f(x) ≤ f(x¯) then for λ > 0 small enough f(x¯ + λ(x − x¯)) ≤ f(x¯) and ξ subgradient f(x¯ + λ(x − x¯)) − f(x¯) ≥ λξ>(x − x¯) from > convexity. So, 0 ≥ λξ (x − x¯) for λ > 0 and theorem is proven.  Corollary 190.

In addition if f is differentiable and x¯ ∈ arglocmaxx{f(x) | x ∈ S} ⇒ ∀x ∈ S : ∇f(x¯)>(x − x¯) ≤ 0.

Theorem 191.

S ⊂ IRn, S 6= ∅ compact, polyhedral and f : IRn −→ IR convex. Then: ∃x¯ ∈ arglocmaxx{f(x) | x ∈ S}= 6 ∅ and x¯ is EP of S.

Proof: Use Weierstrass’ Theorem, convexity properties, and extreme point properties.  Think about the reversed umbrella!

1.5.5 Generalizations of convex functions Definition 192 ().

Let S ⊂ IRn, S 6= ∅ be a convex set. f : S −→ IR is said to be quasiconvex function on S ⇔ ∀x1, x2 ∈ S ∀λ ∈ (0, 1) : f(λx1 + (1 − λ)x2) ≤ max{f(x1), f(x2)}. f is called quasiconcave ⇔ −f is quasiconvex.

Univariate cross section of quasiconvex function is either monotone or unimodal. Function that is quasiconvex and quasiconcave is quasimonotone. Strictly quasiconvex function is also alterna- tively called a semistrictly, explicitly, functionally quasiconvex function.

Exercise 193.

Illustrate concepts above by figures in IR2.

Theorem 194 (Quasiconvexity characterization).

n S ⊂ IR , S 6= ∅ convex. Then f : S −→ IR is quasiconvex ⇔ ∀α ∈ IR : Sα is a convex set.

Proof: ⇒: x1, x2 ∈ Sα ⇒ x1, x2 ∈ S ∧ max{f(x1), f(x2)} ≤ α. Choose λ ∈ (0, 1), so x = λx1+(1−λ)x2 ⇒ x ∈ S by convexity of S. By quasiconvexity of f: f(x) ≤ max{f(x1), f(x2)} ≤ α ⇒ x ∈ Sα. ⇐: Sα, x1, x2 ∈ S, λ ∈ (0, 1): x = λx1 + (1 − λ)x2. So, x1, x2 ∈ Sα for α = max{f(x1), f(x2)} by assumption x ∈ Sα (convex combination), and so f(x) ≤ α = max{f(x1), f(x2)}. 

50 Remark 195.

u = f quasiconcave ⇔ Sα = {x | f(x) ≥ α} is convex. f quasimonotone ⇔ Sα = {x | f(x) = α} is convex.

Theorem 196 (Quasiconvexity and maximization).

S ⊂ IRn, S 6= ∅ polyhedral and compact, f : IRn −→ IR quasiconvex and continuous: ∃x¯ ∈ argmaxx{f(x) | x ∈ S} ∧ x¯ is EP of S.

0 0 Proof: By contradiction: x ∈ argmaxx{f(x) | x ∈ S} is not an EP. Because of compactness x belongs to a convex hull of extreme points of S and f(x0) is greater than the objective function value at any EP 0 0 xj. We take max f(xj) = f(xk) = α. We consider Sα ⇒ ∀j : xj ∈ Sα ⇒ x ∈ Sα ⇒ f(x ) ≤ α and contradiction.  Theorem 197 (Differentiable quasiconvex function characterization).

Let S ⊂ IRn, S 6= ∅ be an open convex set, f : S −→ IR differentiable on S. Then f is quasiconvex ⇔ one of the following equivalent statements holds:

> 1. ∀x1, x2 ∈ S : f(x1) ≤ f(x2) ⇒ ∇f(x2) (x1 − x2) ≤ 0. > 2. ∀x1, x2 ∈ S : ∇f(x2) (x1 − x2) > 0 ⇒ f(x1) > f(x2).

Proof: Not required (see, e.g., Bazaraa-Sherali-Shetty).  Definition 198 (Strictly quasiconvex function).

Let S ⊂ IRn, S 6= ∅ be a convex set. f : S −→ IR is said to be strictly quasiconvex function on S ⇔ ∀x1, x2 ∈ S with f(x1) 6= f(x2): ∀λ ∈ (0, 1) : f(λx1 + (1 − λ)x2) < max{f(x1), f(x2)}. f is called strictly quasiconcave ⇔ −f is strictly quasiconvex.

Exercise 199.

Explain that f convex ⇒ f strictly quasiconvex 6⇒ quasiconvex by figures. Hint: “Flat spots” may occur in extremizing points (only!). Think about y = 0 for x ∈ IR \{0} and y = 1 for x = 0.

Theorem 200 (Local minima for strictly quasiconvex).

Let S ⊂ IRn, S 6= ∅ be a convex set, f : IRn −→ IR be strictly quasiconvex, and x¯ ∈ arglocminx{f(x) | x ∈ S} ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S}.

Proof: Not required (see, e.g., Bazaraa-Sherali-Shetty page 111, by contradiction). 

51 Lemma 201 (Semicontinuous and strictly quasiconvex function).

S ⊂ IRn, S 6= ∅ convex, f : S −→ IR strictly quasiconvex and lower semicontinuous ⇒ f is quasiconvex.

Proof: Not required (see, e.g., Bazaraa-Sherali-Shetty page 112). 

Definition 202 (Strongly quasiconvex function).

Let S ⊂ IRn, S 6= ∅ be a convex set. f : S −→ IR is said to be strongly quasiconvex function on S ⇔ ∀x1, x2 ∈ S with x1 6= x2: ∀λ ∈ (0, 1) : f(λx1 + (1 − λ)x2) < max{f(x1), f(x2)}. f is called strongly quasiconcave ⇔ −f is strongly quasiconvex.

Alternatively, some authors replace ‘strongly quasiconvex’ with ‘strictly quasiconvex’. Note that strong quasiconvexity enforces unimodality:

Theorem 203 (Unimodality).

S ⊂ IRn, S 6= ∅ convex, f : IRn −→ IR strongly quasiconvex on S. Then x¯ ∈ arglocminx{f(x) | x ∈ S} ⇒ {x¯} = argglobminx{f(x) | x ∈ S}.

Proof: Not required (see, e.g., Bazaraa-Sherali-Shetty page 113).  Remark 204.

However, differentiable and strongly quasiconvex function does not have the property ∇f(x¯) = 0 ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S}. Therefore, we introduce:

Definition 205 ().

S ⊂ IRn, S 6= ∅ convex, f : S −→ IR differentiable on S. f is said pseudoconvex on S > ⇔ ∀x1, x2 ∈ S : ∇f(x1) (x2 − x1) ≥ 0 ⇒ f(x2) ≥ f(x1). f is pseudoconcave ⇔ −f is pseudoconvex.

> Or equivalently f(x2) < f(x1) ⇒ ∇f(x1) (x2 − x1) < 0. f is said strictly pseudoconvex on S > ⇔ ∀x1, x2 ∈ S : ∇f(x1) (x2 − x1) ≥ 0 ⇒ f(x2) > f(x1).

Exercise 206.

Draw a figure explaining the concept of pseudoconvex function f on S.

52 Theorem 207 (Relations among different convex functions).

Let S ⊂ IRn, S 6= ∅ be an open and convex set and f : S −→ IR:

1. f is strictly convex on S ⇒ f is convex on S.

2. f is strictly convex on S ⇒ f is strongly quasiconvex on S.

3. f is convex on S ⇒ f is (strictly) quasiconvex on S.

4. f is differentiable and (strictly) convex on S ⇒ f is (strictly) pseudoconvex on S.

5. f is strictly pseudoconvex on S ⇒ f is pseudoconvex on S.

6. f is strictly pseudoconvex on S ⇒ f is strongly quasiconvex on S.

7. f is pseudoconvex on S ⇒ f is strictly quasiconvex on S.

8. f is strongly quasiconvex on S ⇒ f is (strictly) quasiconvex on S.

9. f is strictly quasiconvex and lower semicontinuous on S ⇒ f is quasiconvex.

Proof: Not required (see, e.g., Bazaraa-Sherali-Shetty page 114–115). 

Definition 208 (Convex function at point).

S ⊂ IRn, S 6= ∅ be a convex set. f : S −→ IR. Then:

1. f is said to be a convex function at x¯ ∈ S ⇔ ∀x ∈ S ∀λ ∈ (0, 1) : f(λx¯ + (1 − λ)x) ≤ λf(x¯) + (1 − λ)f(x).

2. f is said to be a strictly convex function at x¯ ∈ S ⇔ ∀x ∈ S, x 6= x¯, ∀λ ∈ (0, 1) : f(λx¯ + (1 − λ)x) < λf(x¯) + (1 − λ)f(x).

3. f is said to be a quasiconvex function at x¯ ∈ S ⇔ ∀x ∈ S, ∀λ ∈ (0, 1) : f(λx¯ + (1 − λ)x) ≤ max{f(x¯), f(x)}.

4. f is said to be a strictly quasiconvex function at x¯ ∈ S ⇔ ∀x ∈ S, f(x) 6= f(x¯), ∀λ ∈ (0, 1) : f(λx¯ + (1 − λ)x) < max{f(x¯), f(x)}.

5. f is said to be a strongly quasiconvex function at x¯ ∈ S ⇔ ∀x ∈ S, x 6= x¯, ∀λ ∈ (0, 1) : f(λx¯ + (1 − λ)x) < max{f(x¯), f(x)}.

6. f is said to be a pseudoconvex function at x¯ ∈ S ⇔ ∀x ∈ S: ∇f(x¯)>(x − x¯) ≥ 0 ⇒ f(x) ≥ f(x¯).

7. f is said to be a strictly pseudoconvex function at x¯ ∈ S ⇔ ∀x ∈ S, x 6= x¯: ∇f(x¯)>(x − x¯) > 0 ⇒ f(x) > f(x¯).

53 Remark 209 (Properties).

S ⊂ IRn, S 6= ∅ convex and f : S −→ IR. Then:

1. f convex and differentiable at x¯ ⇒ f(x) ≥ f(x¯) + ∇f(x¯)>(x − x¯).

2. f convex and twice differentiable at x¯ ⇒ H(x¯) is PSD.

3. f convex at x¯, x¯ ∈ arglocminx{f(x) | x ∈ S} ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S}.

4. f convex and differentiable at x¯. If x¯ ∈ int S : x¯ ∈ argminx{f(x) | x ∈ S} ⇔ > ∇f(x¯) = 0. Otherwise, x¯ ∈ argminx{f(x) | x ∈ S} ⇔ ∀x ∈ S : ∇f(x¯) (x−x¯) ≥ 0.

5. f convex and differentiable at x¯. x¯ ∈ arglocmaxx{f(x) | x ∈ S} ⇒ ∀x ∈ S : ∇f(x¯)>(x − x¯) ≤ 0.

6. f strictly quasiconvex at x¯: x¯ ∈ arglocminx{f(x) | x ∈ S} ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S}.

7. f strongly quasiconvex at x¯: x¯ ∈ arglocminx{f(x) | x ∈ S} ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S} and x¯ is unique.

8. f pseudoconvex at x¯: ∇f(x¯) = 0 ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S}.

9. f strictly pseudoconvex at x¯: ∇f(x¯) = 0 ⇒ x¯ ∈ argglobminx{f(x) | x ∈ S} and x¯ is unique.

1.6 Theory of unconstrained optimization

Exercise 210.

Review concepts: local minimum, global minimum.

Theorem 211 (Existence of a descent direction).

Let f : IRn −→ IR be differentiable at x¯. ∃d ∈ IRn : ∇f(x¯)>d < 0 ⇒ ∃δ > 0 ∀λ ∈ (0, δ): f(x¯ + λd) < f(x¯). So d is a descent direction of f at x¯.

Proof: By differentiability: f(x¯ + λd) = f(x¯) + λ∇f(x¯)>d+ λkdkα(x¯, λd) where α(·, ·) −→ 0 for f(x¯+λd)−f(x¯) > > λ −→ 0. We may rewrite it as: λ = ∇f(x¯) d + kdkα(x¯, λd). Since ∇f(x¯) d < 0 by assumption and α(·, ·) −→ 0 for λ −→ 0 ⇒ ∃δ > 0 ∀λ ∈ (0, δ): ∇f(x¯)>d + kdkα(x¯, λd) < 0, and hence f(x¯ + λd) < f(x¯).  Corollary 212 (First-order necessary optimality condition).

Let f : IRn −→ IR be differentiable at x¯. If x¯ ∈ arglocmin{f(x) | x ∈ IRn} ⇒ ∇f(x¯) = 0.

54 Proof: By contradiction: ∇f(x¯) 6= 0. Set d = −∇f(x¯). Then ∇f(x¯)>d = −k∇f(x¯)k2 < 0 and by Theorem 211 we have found a descent direction of f at x¯. 

Theorem 213 (Second-order necessary optimality condition).

Let f : IRn −→ IR be twice differentiable at x¯. If x¯ ∈ arglocmin{f(x) | x ∈ IRn} ⇒ H(x¯) is PSD.

Proof: Let d be an arbitrarily chosen direction. By differentiability of f: f(x¯ + λd) = f(x¯) + > 1 2 > 2 2 ∇f(x¯) λd+ 2 λ d H(x¯)d+ λ kdk α(x¯, λd) where for λ −→ 0 ⇒ α(·, ·) −→ 0. By Corollary 212: n f(x¯+λd)−f(x¯) 1 > 2 x¯ ∈ arglocmin{f(x) | x ∈ IR } ⇒ ∇f(x¯) = 0. Then λ2 = 2 d H(x¯)d + kdk α(x¯, λd). By Corollary 212: Left-hand-side ≥ 0 (x¯ is minimum). It is true ∀λ > 0 then also right-hand-side ≥ 0. Then taking a limit of both sides we have α(·, ·) −→ 0, and so (because differentiability implies continuity) 1 > 2 d H(x¯)d ≥ 0 so H(x¯) is PSD. 

Theorem 214 (Sufficient differentiability condition).

Let f : IRn −→ IR be twice differentiable at x¯. If ∇f(x¯) = 0 and H(x¯) is PD ⇒ x¯ ∈ argstrictlocmin{f(x) | x ∈ IRn} (and the isolated minimum).

n > 1 > Proof: By twice differentiability: ∀x ∈ IR : f(x) = f(x¯) + ∇f(x¯) (x − x¯)+ 2 (x − x¯) H(x¯)(x − x¯)+ kx − x¯k2α(x¯, x − x¯) where x −→ x¯ ⇒ α −→ 0. By contradiction: x¯ 6∈ argstrictlocmin{f(x) | x ∈ IRn} : xk−x¯ ∃{xk} : xk −→ x¯, ∀k : f(xk) ≤ f(x¯), xk 6= x¯. So, we know ∇f(x¯) = 0, we denote dk = and kxk−x¯k 1 > we obtain: ∀k : 2 dk H(x¯)dk + α(x¯, xk − x¯) ≤ 0 and kdkk = 1 (a bounded sequence). Therefore, ∃K : > {dk}k∈K converges to d such that kdk = 1. For k −→ ∞, α −→ 0 so d H(x¯)d ≤ 0 and contradiction with PD.  Remark 215.

Because H(x¯) is PD ⇒ H(x) is PD on some Nε(x¯) and f is strictly convex there, so x¯ is an isolated minimum, too.

Theorem 216 (Sufficient convexity condition).

Let f : IRn −→ IR be pseudoconvex at x¯. x¯ ∈ argglobmin{f(x) | x ∈ IRn} ⇔ ∇f(x¯) = 0.

Proof: ⇒: x¯ is a global minimum, it is also a local minimum and by corollary ∇f(x¯) = 0. ⇐: ∇f(x¯) = 0 ⇒ ∀x ∈ IRn : ∇f(x¯)>(x − x¯) = 0. By : Left-hand-side ≥ 0 ⇒ ∀x f(x) ≥ f(x¯), so x¯ is a global minimum. 

55 Example 217 (Useful for regression analysis).

> We have n measurements x1,..., xn of independent variables x = (x1, . . . , xk) (so > > xj = (xj1, . . . , xjk) ) and related y1, . . . , yn (denoting y = (y1, . . . , yn) ) of dependent variable y. We search for y = ϕ(x; β) (ϕ known, β unknown). We know that mea- surements were obtained from η = ϕ(x; β) + ε(x) where ε(xi) are i.i.d. (identically 2 2 independent distributed) and ε(xi) ∼ N(0, σ ). We assume known σ and we look for point estimate b of unknown β using maximization of the likelihood function (h is a density function of η):

n n − 1 P (y −ϕ(x ;b))2 1 2σ2 j j Y j=1 max h(yj, b, σ) = max n e . b b 2 n j=1 (2π) σ

It is equivalent with LSQ criterion:

n ∗ P 2 min S (b) = min (yj − ϕ(xj; b)) . b b j=1

We put ∇S∗(b) = 0. Then:

∂ s∗(b) = 0 i = 1, . . . , m ∂bi n X ∂ 2·(y − ϕ(x , b))·(− ϕ(x , b)) = 0 i = 1, . . . , m. j j ∂b j j=1 i

Generally, we have a system of m nonlinear equations of m unknown variables bi. The global minimum b0 is among roots of this system (solvable by further numerical > methods). We denote f(x) = (f1(x), . . . , fm(x)) and assume the case of a linear > Pm regression function ϕ(x, β) = f(x) β = i=1 βifi(x). So, we solve:

n n m X > 2 X X 2 min (yj − f(xj) b) = min (yj − fl(xj)·bl) , b b j=1 j=1 l=1 and then by Theorems:

n m m X X ∂ X [2·(yj − fl(xj)·bl)·(− fl(xj)·bl)] = 0 i = 1, . . . , m ∂bi j=1 l=1 l=1 n m X X [(yj − fl(xj)·bl)·fi(xj)] = 0 i = 1, . . . , m j=1 l=1 n m n X X X fl(xj)·fi(xj)·bl − yj·fi(xj) = 0 i = 1, . . . , m j=1 l=1 j=1 m n n X X X ( fi(xj)·fl(xj))·bl = fi(xj)·yj i = 1, . . . , m. l=1 j=1 j=1

Because of the linear regression function we have a system of linear equations. To simplify a notation, we introduce matrix F of type m × n and define it as F = (f(x1),..., f(xn)) = (fi(xj))ij. Then we have a system of normal equations in the form FF>b = Fy that is solvable by the Gauss elimination. In the case of the unique solution b0 (F of full rank, r(F) = m), we may derive the explicit form using a matrix inverse: 56 > −1 b0 = (FF ) Fy. ∗ To check whether b0 represents a minimum we have to check convexity of S (b) at b0. For linear regression function, Hessian of S∗(b) is H(b) = 2FF>. By linear algebra theory, we have FF> = FIF>. Because I is PD and F is of the full rank then H(b) is ∗ also PD, and hence, S (b) is strictly convex and b0 is a global minimum. Theorem 218 (One dimensional case).

Let f : IR −→ IR be an infinitely differentiable function (f ∈ C∞(IR)). Then x¯ ∈ IR satisfies x¯ ∈ arglocmin{f(x) | x ∈ IR} ⇔ Either ∀j = 1, 2, . . . f (j)(¯x) = 0 or ∃n = 2k (even) f (n)(¯x) > 0 and ∀j = 1, . . . , n − 1 : f (j)(¯x) = 0.

Proof: It is not required.  There is f (n) n’th order derivative of f. The result essentially asserts that for discussed case x¯ ∈ arglocmin{f(x) | x ∈ IR} ⇔ f is locally convex about x¯. For comparison with multidimensional case, read the discussion by Bazaraa, Sherali, and Shetty, pages 135–137.

1.7 Algorithms – general viewpoint

Remark 219 (Solution sets).

For NLP ? ∈ argminx{f(x) | x ∈ S}, we need an algorithm generating a sequence of points converging to a global optimal solution. Usually, we obtain less. We often obtain only the limit point belonging to some solution set Ω. There are the following typical sets Ω:

Ω = {x¯ | x¯ ∈ arglocmin{f(x) | x ∈ S}}. Ω = {x¯ | x¯ ∈ S ∧ f(x¯) ≤ b} where b is an acceptable level of the objective function value. Ω = {x¯ | x¯ ∈ S ∧ f(x¯) < LB + ε}where LB is a lower bound and ε is a tolerance. Ω = {x¯ | x¯ ∈ S ∧ f(x¯) − fmin < ε} where only optimal objective function value is available. Ω = {x¯ | x¯ satisfies KKT conditions}. Ω = {x¯ | x¯ satisfies FJ conditions}.

Definition 220 (Algorithmic map).

n X Let X ⊂ IR , A : X −→ 2 point-set mapping, and x1 ∈ X. We call A an algorithmic map (mapping).

Remark 221 (Convergence of algorithm).

By A, we may generate an aforementioned sequence x1, . . . , xk,... approximating solu- tions of NLP satisfying xk+1 ∈ A(xk) for k = 1, 2,.... Convergence of NLP algorithms will be related to properties of A with respect to Ω.

57 Definition 222 (Convergence of algorithmic map).

X Algorithmic map A : X −→ 2 is said to converge over Y ⊂ X if ∀x1 ∈ Y the limit of any convergent subsequence of the sequence x1, x2,... generated by the algorithm belongs to the solution set Ω.

Exercise 223.

Consider argmin{x2 | x ≥1}. Draw a figure explaining that the solution set Ω = {1}.

x+1 1. Analyze graphically A(x) = { 2 }. 1 1 2. Analyze graphically A(x) = [1, 2 (x + 1)] for x ≥ 1 and A(x) = [ 2 (x + 1), 1] for x < 1.

3 1 1 1 3. Analyze graphically A(x) = [ 2 + 4 x, 1 + 2 x] for x ≥ 2 and A(x) = { 2 (x + 1)} for x < 2. Hint: If x1 ≥ 2 then xk 6−→ x ∈ Ω. Why?

Exercise 224.

Check graphically that all three previous cases may satisfy: xk ∈ S ⇒ xk+1 ∈ S (keeps feasibility), xk ∈ S, xk 6∈ Ω ⇒ f(xk) > f(xk+1) (decreasing objective), and xk ∈ Ω ⇒ xk+1 ∈ Ω (keeps optimality).

Definition 225 (Closed algorithmic map).

Let X ⊂ IRp, Y ⊂ IRq, both nonempty and closed. A : X −→ 2Y . Then A is said to be closed at x ∈ X ⇔ (xk ∈ X, xk −→ x, yk ∈ A(xk), yk −→ y ⇒ y ∈ A(x)). The map A is said to be closed on Z ⊂ X if it is closed ∀x ∈ Z.

Theorem 226 (About convergence).

n X Let X ⊂ IR , X 6= ∅, closed and Ω be a solution set, A : X −→ 2 , x1 ∈ X, {xk} is defined ∀k ∈ IN and for xk 6∈ Ω: xk+1 := A(xk). If xk ∈ Ω then stop (or xk+1 := xk). We suppose that {xk} is contained in a compact subset of X and that there exists α : X −→ IR (e.g., f(x) or k∇f(x)k) a descent function continuous such that α(y) < α(x) if x 6∈ Ω and y ∈ A(x) if A is closed over X \ Ω then either the algorithm stops in a finite number of steps with point in Ω or it generates the infinite sequence {xk} such that:

1. every convergent subsequence of {xk} has a limit in Ω.

2. α(xk) −→ α(x) for some x ∈ Ω.

Proof: Let xk ∈ Ω then stop. Otherwise, we have x1, x2,..., i.e. {xk}. Let {xk}k∈K be a convergent subsequence with limit x ∈ X. By continuity of α: k ∈ K α(xk) −→ α(x) and ∀ε > 0 ∃K ∈ K : ∀k ∈ K,

58 k ≥ K α(xk) − α(x) < ε. Because α(xK ) − α(x) < ε then for k > K (because of descent function α) α(xk) < α(xK ) and α(xk)−α(x) = α(xk)−α(xK )+ α(xK )−α(x) < 0+ε, so limk−→∞ α(xk) = α(x). By contradiction, we show that x ∈ Ω. Assume x 6∈ Ω: {xk+1}K is contained in a compact subset of X and has a convergent subsequence {xk+1}K¯ with limit x¯ ∈ X. Then limk−→∞ α(xk) = α(x) ⇒ α(x) = α(x¯). Because A is closed at x ⇒ k ∈ K¯ ⇒ xk −→ x, xk+1 ∈ A(xk), xk+1 −→ x¯, x¯ ∈ A(x) then α(x¯) < α(x) and contradiction.  Corollary 227.

If Theorem’s assumptions are valid and Ω = {x¯} ⇒ xk −→ x¯.

Definition 228 (Composite map).

Let X ⊂ IRn, Y ⊂ IRp, Z ⊂ IRq nonempty closed sets. Let B : X −→ 2Y and C : Y −→ 2Z . Then, the composite map A = CB : X −→ 2Z (C is applied after B) where A(x) = ∪{C(y) | y ∈ B(x)}.

The definition of composite map is useful for optimization algorithms because they are also composed.

Theorem 229 (Closed composite map).

Let X ⊂ IRn, Y ⊂ IRp, Z ⊂ IRq nonempty closed sets. Let B : X −→ 2Y , C : Y −→ 2Z , and A = CB. We suppose that B is closed at x and C is closed on B(x). Furthermore, we suppose that xk −→ x and yk ∈ B(xk) ⇒ ∃ convergent subsequence {yk}. Then A is closed at x.

Proof: xk −→ x, zk ∈ A(xk) and zk −→ z. We need to show z ∈ A(x) to prove Theorem 229. By definition: ∀k : ∃yk ∈ B(xk) such that zk ∈ C(yk). By assumption ∃{yk}K convergent subsequence with limit y. B is closed ⇒ y ∈ B(x), C is closed on B(x) and at point, e.g., y. Therefore, z ∈ C(y), z ∈ CB(x) = A(x).  Corollary 230.

Let X ⊂ IRn, Y ⊂ IRp, Z ⊂ IRq nonempty closed sets. Let B : X −→ 2Y and C : Y −→ 2Z , B closed at x, C closed on B(x), and Y is a compact set. Then A = CB is closed at x.

Corollary 231.

Let X ⊂ IRn, Y ⊂ IRp, Z ⊂ IRq nonempty closed sets. Let B : X −→ Y (point-to-point function, i.e. point-to-one-point-set mapping), C : Y −→ 2Z , B continuous at x, and C closed on B(x). Then A = CB is closed at x.

59 Theorem 232 (More about convergence).

X ⊂ IRn, X 6= ∅ is a closed set. Ω ⊂ X, Ω 6= ∅ is a solution set. α : IRn −→ IR is a continuous mapping. C : X −→ 2X mapping satisfies ∀x ∈ X and y ∈ C(x): α(y) ≤ α(x). B : X −→ 2X is a closed map over X \ Ω and ∀x 6∈ Ω, ∀y ∈ B(x): α(y) < α(x). Assume that A = CB generates for x1 the sequence {xk} as follows: Either xk ∈ Ω (and stop) or xk 6∈ Ω: xk+1 ∈ A(xk). If Λ = {x | α(x) ≤ α(x1)} is a compact set then: Either the algorithm stops in a finite number of steps with a point in Ω or all accumulation points of {xk} belong to Ω.

Proof: xk ∈ Ω ⇒ stop by assumptions. Let {xk} be generated by A and {xk}K be a convergent subsequence with limit x. Thus α(xk) −→ α(x) for k ∈ K by continuity of α. By monotonicity of α as before limk−→∞ α(xk) = α(x). We need to show x ∈ Ω. by contradiction assume x 6∈ Ω and consider the sequence {xk+1}K. By definition A, we have xk+1 ∈ C(yk) for yk ∈ B(xk) and yk, xk+1 ∈ Λ. By 0 compactness of Λ ∃ another subsequence {xk+1}K0 , {yk}K0 : xk+1 −→ x , yk −→ y. B closed at x ⇒ 0 y ∈ B(x) and α(y) < α(x) (as x is not a limit). xk+1 ∈ C(yk) ⇒ α(xk+1) ≤ α(yk) and α(x ) ≤ α(y) that is a contradiction.  Theorem 233 (Minimization along independent directions).

Let f : IRn −→ IR be a differentiable function. ? ∈ arglocmin{f(x) | x ∈ IRn} is a NLP (unconstrained). A is defined as follows: y ∈ A(x) is obtained by n times (sequential) minimization of f along directions d1,..., dn (starting from x) where kdjk = 1, j = 1, . . . , n. Denote matrix D(x) = (d1,..., dn) (for given x), assume that determinant det(D(x)) ≥ ε > 0 for some ε and ∀x ∈ IRn. Suppose that minimum of f n along any line in IR is unique. For x1, {xk} is generated as follows: If ∇f(xk) = 0 then stop in xk. Otherwise, xk+1 ∈ A(xk) (and k := k + 1) and repeat. If {xk} is contained in a compact set then each accumulation point x of {xk} satisfies ∇f(xk) = 0.

Remark 234.

1. The discussed general attempt (and also last Theorem 233) to convergence is suitable for many optimization algorithms.

2. No additional assumption about closedness or continuity is needed for Theorem 233. However, all search directions should be linearly independent (e.g., it is valid for coordinate or orthogonal directions).

Exercise 235.

Why uniqueness in Theorem 233 is required? Hint: Draw a figure for z = x2(1 − x1) and use that ∇f(x) 6= 0 ⇒ f(x2) < f(x1).

60 Remark 236 (Termination of algorithm).

In the case of convergence to Ω in a limit sense, we need stopping rules (assume that ε > 0 and N ∈ IN):

1. kxk+N − xkk < ε (small movement).

2. kxk+1−xkk < ε (small relative distance change). kxkk

3. α(xk) − α(xk+N ) < ε (descent function does not improve).

4. α(xk)−α(xk+1) < ε (small relative improvement). |α(xk)|

5. α(xk) − α(x¯) < ε (e.g., Ω = {x¯ | ∇f(x¯) = 0} ∧ α(x¯) = k∇f(x¯)k).

Remark 237 (Comparison of algorithms).

They can be compared using various features:

Generality is given by required assumptions, solvable problems.

Reliability (robustness) is defined by achievable accuracy for solvable problems.

Precision is specified by quality of points after number of iterations.

Sensitivity relates to setup of algorithm parameters and problem’s input data.

Preparations required identify preprocessing steps, e.g., computations of gradient, Hessian.

Computational effort depends on computing time, number of iterations, elementary operations required, functional evaluations, etc.

Remark 238 (Convergence).

It is important to make a difference between theoretical and practical convergence (e.g., average rate based on statistical observations). In theory, we have xk −→ x¯ (∀k : xk 6= x¯). The order of convergence is defined by

kxk+1−x¯k sup{p | p ≥ 0 ∧ ∃β : lim sup kx −x¯kp = β < ∞}. k−→∞ k

|α(xk+1)−α(x¯)| Alternatively the fraction can be given as p . If p = 1 and β < 1 then we call |α(xk)−α(x¯)| convergence linear (geometric -why?). If p > 1 (or p = 1 and β = 0) then convergence is said superlinear. If p = 2 and β < ∞ then convergence is called quadratic (second order). Notice that with greater p theoretical convergence is faster.

61 1.8 Line search methods revisited

Remark 239 (Motivation).

Efficient line search algorithms are very important (Why? Give examples!) be- cause they are repeatedly used by multidimensional algorithms solving NLP prob- lems. Many multidimensional NLP algorithms use the following iteration formula (Explain by own example!): xk+1 := xk + λkdk,

where xk is a point known from the previous iteration, dk is a direction specified by multidimensional algorithm, and λk is a step size. For given xk and dk, step size λk (also denoted as λmin) is computed as follows:

λk ∈ argmin θ(λ) = argmin{f(xk + λdk) | λ ∈ L}, λ∈L

which is a univariate minimization problem with unknown λ (Does it change with it- erations?). We will introduce line search algorithms for this problem.

Remark 240 (Interval of uncertainty).

A feasible set L may be defined in different ways (e.g., L = IR, L = [0; ∞),L = [a; b]). We will search for λmin ∈ L = [a; b]. It is so called interval of uncertainty since the location of λmin over [a; b] is not known. The goal is to reduce the interval of uncertainty by the line search excluding parts of it.

Remark 241 (Reduction for strictly quasiconvex function).

If we assume that λmin ∈ L and θ is strictly quasiconvex (i.e. any local minimum is also global minimum) the important reduction idea is given by the following theorem (Explain by figure!):

Theorem 242.

Let θ : IR −→ IR be strictly quasiconvex over the interval [a; b]. Let λ, µ ∈ [a; b] and λ < µ. If θ(λ) > θ(µ) then ∀z ∈ [a; λ): θ(z) ≥ θ(µ). If θ(λ) ≤ θ(µ) then ∀z ∈ (µ; b]: θ(z) ≥ θ(λ).

Proof: Suppose that θ(λ) > θ(µ) and let z ∈ [a; λ). By contradiction, suppose that θ(z) < θ(µ). Since λ can be written as a convex combination of z and µ then by strict convexity definition, we obtain θ(λ) < max{θ(z), θ(µ)} = θ(µ). This is a contradiction with θ(λ) > θ(µ). So θ(z) ≥ θ(µ). The rest of theorem is proven in the similar way.  Therefore, in the case of strict quasiconvexity we have the following simple rules. (Explain for the forthcoming algorithm!): • If θ(λ) > θ(µ) then the new interval of uncertainty is [λ; b]

62 • If θ(λ) ≤ θ(µ) then the new interval of uncertainty is [a; µ].

Remark 243 (Unimodality).

A reader may also be interested in the relation between different types of generalized convexity introduced earlier and different concepts of unimodality often used in the books on numerical algorithms. So, we give a brief overview (Draw figures!):

Definition 244.

Let θ : L −→ IR be a univariate function and L ⊂ IR is an interval on IR. θ is unimodal on L if there exists λmin ∈ argmin{θ(λ)|λ ∈ L} and θ is nondecreasing on the interval {λ ∈ L|λ ≥ λmin} and nonincreasing on the interval {λ ∈ L|λ ≤ λmin}.

Assuming that θ attains a minimum on L (Why? Figure! avoiding cases as, e.g., θ(λ) = λ3) it can be shown that θ is quasiconvex if and only if it is unimodal on L. Definition 245.

A function θ : IR −→ IR to be minimized is said to be strictly unimodal over [a; b] if there exists λmin ∈ argmin{θ(λ)|λ ∈ [a; b]} and ∀λ1, λ2 ∈ [a; b] such that θ(λ1) 6= θ(λmin) and θ(λ2) 6= θ(λmin) and λ1 < λ2 we have (λ2 ≤ λmin) ⇒ (θ(λ1) > θ(λ2)) and (λ1 ≥ λmin) ⇒ (θ(λ1) < θ(λ2)).

It may be shown that if θ is strictly unimodal and continuous over[a; b] then θ is strictly quasi- convex over [a; b]. Conversely if θ is strictly quasiconvex over [a; b] and has a minimum in this interval (Why? Figure!) then it is strictly unimodal over this interval.

Definition 246.

A function θ : IR −→ IR to be minimized is said to be strongly unimodal over [a; b] if there exists λmin ∈ argmin{θ(λ)|λ ∈ [a; b]} and ∀λ1, λ2 ∈ [a; b] such that λ1 < λ2 we have (λ2 ≤ λmin) ⇒ (θ(λ1) > θ(λ2)) and (λ1 ≥ λmin) ⇒ (θ(λ1) < θ(λ2)).

It can be shown that if θ is strongly unimodal over [a; b] then θ is strongly quasiconvex over [a; b]. Conversely, if θ is strongly quasiconvex over [a; b] and has a minimum in this interval (Why? Figure!), then it is strongly unimodal over the interval.

Definition 247.

Let f : S −→ IR, where f is lower-semicontinuous and S ⊂ IRn is a convex set. f is strongly unimodal on S if ∀x1, x2 ∈ S such that the function F (λ) = f(x1 + λ(x2 − x1)), λ ∈ [0; 1], attains a minimum λmin > 0 we have ∀λ ∈ (0, λmin): F (0) > F (λ) > F (λmin).

It can be shown that f is strongly quasiconvex on S if and only if f is strongly unimodal on S.

63 Classification of line search algorithms:

- LS Without using derivatives: uniform search, dichotomous search, golden section, Fi- bonacci.

- LS Using derivatives: bisection search, Newton’s method.

- Approximation-based LS: quadratic (cubic) approximations.

- Inexact LS: Armijo’s rule.

1.8.1 Line search without using derivatives Uniform search. We know n grid points in advance at which functional evaluations (θ(λ)) are to be made. The big disadvantage is that the information obtained during computations is not further used (Why?). So, for k’th step and step size δ we assign λk := a + kδ, k = 1, . . . , n and θ(λk). We search for r such that θ(λr) = mink θ(λk). The uncertainty interval is then reduced to [λr − δ; λr + δ]. This technique may be used for the initial bracketing of λmin.

Dichotomous search. It already uses information taken from previous iterations. Again, we assume that θ is strictly quasiconvex on [a; b]. The idea is to place λ and µ near to the midpoint to guard against the worst possible outcome in reduction of the uncertainty interval.

Algorithm 248.

Choose the constant δ>0 and the allowable final length of uncertainty interval ε > 0. Let [a1; b1] be the initial interval of uncertainty, let k := 1, and continue.

1. If bk − ak < ε then STOP the minimum lies in the interval [ak; bk]. Otherwise: a + b a + b λ := k k − δ µ := k k + δ k 2 k 2

2. If θ(λk) < θ(µk), let ak+1 := ak and bk+1 := µk. Otherwise, let ak+1 := λk a bk+1 := bk. In both cases k := k + 1 and GOTO 1.

(Be able to use it!) We have to notice that each iteration requires two new computations of θ function values.

Golden section method. To compare various line search methods, usually we consider the ratio of the uncertainty interval length after k’th iteration and before any iteration taken. It is clear that more efficient algorithm has the smaller ratio. This seems to be an argument for dichotomous search. For it (and also for the developed golden section method) λk and µk are chosen in such a way that µk −ak = bk −λk = bk+1 −ak+1 because we do not know our preference for either µk or λk in advance. It may be also acceptable to require

bk+1−ak+1 = α, (1.3) bk−ak and so, reduction ratio α ∈ (0; 1) is a constant. We may think about another possibility to save computational time. We may find that our previous idea to count iterations is imprecise because we are not taking into account needed functional evaluations (There are 2 per iteration

64 h convex  -h2 convex ↑  -h2(h1) convex  1 XXX : XX  XXX  XX    XX     XX  XXX  XX  XXz

h2 convex ↓  h2 concave ↓  ¨¨* H ¨ HH ¨¨   H  ¨ HH ¨¨ H ¨ HHj - - h1 concave  h2 concave ↑  h2(h1) concave 

      Figure 1.1: Convexity and concavity of composed functions. for dichotomous search). So, we try to utilize already computed function value for the next iteration (This is the main idea of the golden section method! Could you utilize the following figure for the computation of α?). Therefore, for the case ak+1 := λk a bk+1 := bk we have to assign λk+1 = µk. By Figure2, two different descriptions of µk utilizing constant α may be used:

ak + α(bk − ak) = λk + (1 − α)(bk − λk).

We may replace λk with ak + (1 − α)(bk − ak) and after the following simplification:

ak + α(bk − ak) = ak + (1 − α)(bk − ak) + (1 − α)(bk − (ak + (1 − α)(bk − ak))) 2 α(bk − ak) = 2(1 − α)(bk − ak) − (1 − α) (bk − ak))) α = 2(1 − α) − (1 − α)2

2 we obtain a quadratic equation α + α − 1 = 0 (it is also√ the same for the case ak+1 := ak a bk+1 := µk) We search for α > 0 and that is α = −1/2 + 5/2 = 0.618 the golden section ratio. For strictly quasiconvex function θ on [a; b] we form the following algorithm (Be able to use it! - with help):

Algorithm 249.

Choose ε > 0, [a1; b1], α := 0.618, compute λ1 := a1 + (1 − α)(b1 − a1), µ1 := a1 + α(b1 − a1) together with θ(λ1) a θ(µ1), and assign k := 1.

1. If bk − ak < ε STOP , the minimum lies in [ak; bk]. Otherwise, if θ(λk) > θ(µk) then GOTO 2. ; and if θ(λk) ≤ θ(µk) then GOTO 3.

2. Assign ak+1 := λk and bk+1 := bk. Furthermore, λk+1 := µk and µk+1 := ak+1 + α(bk+1 − ak+1). Evaluate θ(µk+1) and GOTO 4.

3. Assign ak+1 := ak and bk+1 := µk. Furthermore, µk+1 := λk and λk+1 := ak+1 + (1 − α)(bk+1 − ak+1). Evaluate θ(λk+1) and GOTO 4.

4. k := k + 1, GOTO 1.

65 Example 250.

Solve the problem min{λ2 + 2λ | −3 ≤ λ ≤ 5} using the golden section method. The results of 4 iterations are contained in the table (cf. with the true minimum λmin = −1).

Iteration k ak bk λk µk θ(λk) θ(µk) 1 −3.000 5.000 0.056 1.944 0.115 7.667 2 −3.000 1.944 −1.112 0.056 −0.987 0.115 3 −3.000 0.056 −1.832 −1.112 −0.308 −0.987 4 −1.832 0.056 −1.112 −0.664 −0.987 −0.887

The Fibonacci search. It is useful in the cases when the number of iterations N to be realized is known. The Fibonacci sequence of N + 1 numbers Fk is defined by the assignment Fk+1 := Fk + Fk−1, k = 1,...,N − 1, where F0 = F1 = 1. The basic idea is common with the golden section method which is the limiting case of the Fibonacci method (think about the limit of sequence of ratios used instead of α). Notice that the number of iterations must be known in advance because FN is used at the beginning.

Algorithm 251 (Fibonacci).

Choose ε > 0, δ > 0, [a1; b1], N such that FN > (b1 − a1)/ε. Then λ1 := a1 + (FN−2/FN )(b1 − a1), µ1 := a1 + (FN−1/FN )(b1 − a1), θ(λ1), and θ(µ1) a k := 1.

1. If θ(λk) > θ(µk) then GOTO 2. , otherwise, if θ(λk) ≤ θ(µk) then GOTO 3.

2. ak+1 := λk and bk+1 := bk. Furthermore, λk+1 := µk and µk+1 := ak+1 + (FN−k−1/FN−k)(bk+1 − ak+1). If k = N − 2 then GOTO 5. , otherwise compute θ(µk+1) and GOTO 4.

3. ak+1 := ak and bk+1 := µk. Furthermore, µk+1 := λk and λk+1 := ak+1 + (FN−k−2/FN−k)(bk+1 − ak+1). If k = N − 2 then GOTO 5. , otherwise compute θ(λk+1) and GOTO 4.

4. k := k + 1, GOTO 1.

5. Set λN := λN−1 and µN := λN−1 + δ. If θ(λN ) > θ(µN ) then aN := λN and bN := bN−1. If θ(λN ) ≤ θ(µN ) then aN := aN−1 and bN := λN . STOP and minimum lies in interval [aN ; bN ].

66 Example 252.

Solve the problem min{λ2 + 2λ | −3 ≤ λ ≤ 5} using the Fibonacci search for N = 9. Results of 4 iterations are contained in the table.

Iteration k ak bk λk µk θ(λk) θ(µk) 1 −3.000 5.000 0.054 1.945 0.112 7.675 2 −3.000 1.945 −1.109 0.054 −0.988 0.112 3 −3.000 0.054 −1.836 −1.109 −0.300 −0.988 4 −1.836 0.054 −1.109 −0.672 −0.988 −0.892

Comparison of derivative-free line search methods. For the given precision ε > 0, the required number of iterations n ∈ IN must satisfy the following inequalities: Do you have any idea how they were derived?

Uniform search method: n ≥ (b1 − a1)/(ε/2) − 1

n/2 Dichotomous search method: (0.5) ≥ ε/(b1 − a1)

n−1 Golden section method: (0.618) ≥ ε/(b1 − a1)

Fibonacci search method: Fn ≥ (b1 − a1)/ε

1.8.2 Line search using derivatives Using calculus. We may try to compute θ0(λ) = 0 then we obtain:

0 > 0 = θ (λ) = dk ∇f(xk + λdk) and it is usually a nonlinear equation. Be able to use it during examination! If f is not differen- tiable then also θ is not differentiable and previous methods are used. With differentiable θ we may use the following algorithms.

Bisection search method. It converges for pseudoconvex function θ Compare with the root search bisection! Be able to use it for computations.

Algorithm 253.

N Identify [a1; b1], ε > 0, and N ∈ IN satisfying 0.5 ≤ ε/(b1 − a1), k := 1.

1 0 0 1. λk := 2 (ak +bk), compute θ (λk). If θ (λk) = 0 then STOP and λk is a minimum. 0 0 For θ (λk) > 0 GOTO 2. and for θ (λk) < 0 GOTO 3.

2. ak+1 := ak and bk+1 := λk, GOTO 4.

3. ak+1 := λk and bk+1 := bk, GOTO 4.

4. If k = N, STOP , and the minimum lies in [aN+1; bN+1]. Otherwise k := k + 1 and GOTO 1.

67 Example 254.

Solve the problem min{λ2 + 2λ | −3 ≤ λ ≤ 6} by the bisection search method.

0 Iteration k ak bk λk θ (λk) 1 −3.000 6.000 1.500 5.000 2 −3.000 1.500 −0.750 0.500 3 −3.000 −0.750 −1.875 −1.750 4 −1.875 −0.750 −1.312 −0.625

Newton’s method. It searches roots of equation f(x) = 0. So, we may use it for our equation 0 θ (λ) = 0. By the 2nd order Taylor series approximation of the function θ at λk we obtain: 1 T (λ) = θ(λ ) + θ0(λ )(λ − λ ) + θ00(λ )(λ − λ )2, 2 k k k 2 k k and the following iteration formula (Be able to derive it!):

0 θ (λk) λk+1 = λk − 00 θ (λk)

0 If |λk+1 − λk| < ε or |θ (λk)| < ε then the iteration process is stopped. The use of this procedure requires the existence of non-zero 2nd order derivatives of θ at λk, k = 1, 2,.... This method does not converge to the for all initial points λ1. See Bazaraa-Shetty for theoretical sufficient convergence conditions ("the initial point λ1 should be chosen near to the minimum").

1.8.3 Approximating line search Motivating idea. Previous methods do not accelerate the computational process using in- formation about the shape of function θ (with exception of non-globally convergent Newton’s method and its θ00 use). Therefore, quadratic fit line search has been developed (even cubic also exists). During each iteration, the original function θ is replaced by parabola (Explain algo- rithm by figures). Its minimum approximates λk. We assume that θ is strictly quasiconvex and continuous and λ > 0.

68 Algorithm 255.

We suppose that three points are given 0 ≤ λ1 < λ2 < λ3 such that they satisfy θ1 ≥ θ2 and θ2 ≤ θ3, where θj = θ(λj) for j = 1, 2, 3. Choose ε > 0.

1. Three points (λj; θj), j = 1,..., 3 determine the parabola (How?):

θ (λ − λ )(λ − λ ) θ (λ − λ )(λ − λ ) θ (λ − λ )(λ − λ ) q(λ) = 1 2 3 + 2 1 3 + 3 1 2 (λ1 − λ2)(λ1 − λ3) (λ2 − λ1)(λ2 − λ3) (λ3 − λ1)(λ3 − λ2)

and from q0(λ) = 0, we obtain:

1 b θ + b θ + b θ λ∗ = 23 1 31 2 12 3 , 2 a23θ1 + a31θ2 + a12θ3

2 2 ∗ ∗ where aij = λi − λj and bij = λi − λj . By substitution we obtain θ = θ(λ ).

∗ ∗ ∗ 2. If λ > λ2 and θ ≥ θ2 then new three points are defined as follows λ1, λ2, λ . If ∗ ∗ ∗ λ > λ2 a θ < θ2 then new three points are specified as λ2, λ , λ3. GOTO 5.

∗ ∗ ∗ ∗ 3. if λ < λ2 and θ ≥ θ2 then new three points are given as λ , λ2, λ3. If λ < λ2 ∗ ∗ and θ < θ2 then new three points are λ1, λ , λ2. GOTO 5.

∗ 4. If λ = λ2, then no new point is derived and 0 < δ < λ3 − λ1 is chosen to move ∗ λ from λ2 by δ/2 (a direction does not matter) and GOTO 2.

5. Test whether θ1 = θ2 = θ3 or λ3 − λ1 < ε. If termination condition is not fulfilled then update λ1, λ2, and λ3 by previous rules and GOTO 1.

1.8.4 Inexact line search Armijo’s rule. There are many rules that do not try to minimize θ(λ) but they search for the improved value of θ(λ) in such a way that an outer multivariate algorithm still converges. (Explain by figure!).

Algorithm 256.

Choose 0 < δ < 1 (0.2 is recommended) a α > 1 (2 is recommended). The com- putational procedure is formed: θ∗(λ) = θ(0) + λδθ0(0). The initial iteration λ∗ is given.

1. If θ(λ∗) ≤ θ∗(λ∗) then multiply λ∗ by coefficient α repeatedly until t is found such that satisfies θ(αtλ∗) ≤ θ∗(αtλ∗) and in addition θ(αt+1λ∗) > θ∗(αt+1λ∗). Then λ∗ := αtλ∗.

2. If θ(λ∗) > θ∗(λ∗) then divide λ∗ by coefficient α repeatedly until t is found such that satisfies θ(λ∗/αt) ≤ θ∗(λ∗/αt) and in addition θ(λ∗/αt−1) > θ∗(λ∗/αt−1). Then λ∗ := λ∗/αt.

69 1.8.5 Convergence Line search algorithmic map. Since the line search is a basic component of most NLP algorithms we will analyse its convergence by the concept of closed map (Do not memorize! But understand!). We introduce

M(x; d) = {y | ∃λmin ∈ L : y = x + λmind, ∀λ ∈ L : f(y) ≤ f(x + λd)}.

M is generally a point-set mapping (Why? Think about uniqueness of λmin).

Theorem 257.

Let f : IRn −→ IR be continuous at x, L is a closed interval in IR, and d 6= 0. Then above introduced M : IRn × IRn −→ IRn is closed at (x; d).

Proof: We want to show that y ∈ M(x, d) for (xk, dk) −→ (x, d) and yk −→ y where yk ∈ M(xk; dk). From yk = xk + λkdk and d 6= 0 ⇒ dk 6= 0 (for k large enough) we obtain λk =|| yk − xk || / || dk ||. As the k −→ ∞ then λk −→ λmin and also the RHS: || yk − xk || / || dk ||−→|| y − x || / || d || and so y = x + λmind. Furthermore, by closedness of L: λk ∈ L ⇒ λmin ∈ L and by continuity of f from f(yk) ≤ f(xk + λdk), ∀λ ∈ L, ∀k we may conclude that f(y) ≤ f(x + λd), ∀λ ∈ L. Thus y ∈ M(x, d) and the proof is complete. 

Remark 258.

In addition, if D mapping identifying the direction d is also closed then by general algorithm-related theorems, the composed mapping A = MD is closed. See also Bazaraa-Shetty page 283 for an example where d = 0 causes that M is not closed.

1.9 Multidimensional Search Without Using Derivatives

Methods that do not use derivatives are popular because they are robust and easy implementable (Do you agree?). They are suitable for modest size engineering problems that are not solved too often. They are also advantageous in the cases when the derivatives (or their approximations are hardly available). There are complex computational systems (e.g., Fluent, Ansys) that for the given input vector x ∈ IRn generate the output y ∈ IRm by transformation g : IRn −→ IRm,which is not explicitly known. For instance, it may be hidden in the program. In such case, when yi is minimized and other variables are free, one of proposed methods in this section may be employed.

1.9.1 The Cyclic Coordinate Method Basic idea. It is a trivial and slowly convergent method (You have to know it!). The main idea is to repeatedly realize line searches by axis directions dj. It often results in zig-zag behaviour, see Figure3.

70 Algorithm 259.

Choose ε > 0 and directions dj := ej pro j = 1, . . . , n, where ej are coordinate directions. Choose an initial point x1, set y1 := x1 and j := 1, k := 1.

1. Let λj be an optimal solution of ? ∈ argminλ∈IR f(yj + λdj) and let yj+1 := yj + λjdj. If j < n then j := j + 1 and GOTO 1. If j = n then continue.

2. Let xk+1 := yn+1. Ifkxk+1 − xkk < ε then STOP . Otherwise y1 := xk+1, j := 1, k := k + 1 and GOTO 1.

Method convergence. It follows immediately from the theorem about linearly independent directions under the following assumptions (Why? What is D?):

1. The minimum of f along any line in IRn is unique.

2. The sequence of points generated by algorithm is contained in a compact subset of IRn

Example 260.

4 2 Solve the problem min [(x1 − 2) + (x1 − 2x2) ] by Algorithm 259. Remember that results in tables are rounded or truncated, transposition is very often omitted.

Iteration xk k f(xk) j dj yj λj yj+1 1 (0, 00; 3, 00) 1 (1, 0; 0, 0) (0, 00; 3, 00) 3, 13 (3, 13; 3, 00) 52, 00 2 (0, 0; 1, 0) (3, 13; 3, 00) −1, 44 (3, 13; 1, 56) 2 (3, 13; 1, 56) 1 (1, 0; 0, 0) (3, 13; 1, 56) −0, 50 (2, 63; 1, 56) 1, 63 2 (0, 0; 1, 0) (2, 63; 1, 56) −0, 25 (2, 63; 1, 31) 3 (2, 63; 1, 31) 1 (1, 0; 0, 0) (2, 63; 1, 31) −0, 19 (2, 44; 1, 31) 0, 16 2 (0, 0; 1, 0) (2, 44; 1, 31) −0, 09 (2, 44; 1, 22) 4 (2, 44; 1, 22) 1 (1, 0; 0, 0) (2, 44; 1, 22) −0, 09 (2, 35; 1, 22) 0, 04 2 (0, 0; 1, 0) (2, 35; 1, 22) −0, 05 (2, 35; 1, 17)

Figure 1.2: Illustration of the golden section rule.

1.9.2 The Method of Hooke and Jeeves Acceleration step. The cyclic coordinate method is accelerated (and also zig-zag appears not so often) by including additional accelerating steps using direction xk+1 − xk. See Figure4(And use Figure! ).

71 Algorithm 261 (Hooke-Jeeves).

Choose ε > 0, directions dj := ej for j = 1, . . . , n, initial point x1, and assign y1 := x1 a j := 1, k := 1.

1. Let λj be an optimal solution of ? ∈ argminλ∈IR f(yj + λdj) and set yj+1 := yj + λjdj. If j < n then j := j + 1 and GOTO 1. If j = n then xk+1 := yn+1 and for kxk+1 − xkk < ε STOP , otherwise continue. ˆ 2. Let d := xk+1 − xk and λ solves ? ∈ argminλ∈IR f(xk+1 + λd). Determine ˆ y1 := xk+1 + λd, and so j := 1, k := k + 1 and GOTO 1.

Method convergence. The idea of convergence proof lies in the decomposition of the method into two maps: B denotes cyclic coordinate method already discussed and accelerating step by map C. With additional assumptions: the unique minimum of f (differentiable) along any line, α = f ⇒ ∀x ∈ IRn \ Ω: α(y) < α(x), ∀z ∈ C(y): α(z) ≤ α(y), Ω = {x|∇f(x) = 0}, and {x|f(x) ≤ f(x1)} is compact for starting point x1 the convergence is established by the general convergence theorem for composed maps.

Example 262.

4 2 Solve the problem min [(x1 − 2) + (x1 − 2x2) ] by Algorithm 261.

xk ˆ ˆ k f(xk) j dj yj λj yj+1 d λ yn+1 + λd 1 (0, 00; 3, 00) 1 (1, 0; 0, 0) (0, 00; 3, 00) 3, 13 (3, 13; 3, 00) 52, 00 2 (0, 0; 1, 0) (3, 13; 3, 00) −1, 44 (3, 13; 1, 56) (3, 13; 1, 44) −0, 10 (2, 82; 1, 70) 2 (3, 13; 1, 56) 1 (1, 0; 0, 0) (2, 82; 1, 70) −0, 12 (2, 70; 1, 70) 1, 63 2 (0, 0; 1, 0) (2, 70; 1, 70) −0, 35 (2, 70; 1, 35) (−0, 43; −0, 21) 1, 50 (2, 06; 1, 04) 3 (2, 70; 1, 35) 1 (1, 0; 0, 0) (2, 06; 1, 04) −0, 02 (2, 04; 1, 04) 0, 24 2 (0, 0; 1, 0) (2, 04; 1, 04) −0, 02 (2, 04; 1, 02) (−0, 66; −0, 33) 0, 06 (2, 00; 1, 00) 4 (2, 04; 1, 02) 1 (1, 0; 0, 0) (2, 00; 1, 00) −0, 01 (2, 00; 1, 00) 0, 000003 2 (0, 0; 1, 0) (2, 00; 1, 00) −0, 01 (2, 00; 1, 00)

It is important to note that univariate minimization can be replaced by imprecise algorithms (e.g., discrete steps, Armijo’s rule). Other methods without derivatives may be found, e.g., in Bazaraa-Shetty. They use different sets of directions d1,..., dn.

Figure 1.3: Illustration of the cyclic coordinate method.

1.9.3 The method of Nelder and Mead Moving simplex. The method of moving simplex (do not mix with LP!) is often used by engineers. Its main idea is to choose points x1,..., xn+1 to form a simplex. The algorithm moves and deform the simplex to achieve the situation when the searched minimum becomes the simplex interior point. By biological analogy this method is often implemented as AMOEBA (Be able to explain certain algorithm steps by figures). The method is robust but slow, so suitable for n ≤ 10.

72 Algorithm 263.

n Choose points x1,..., xn+1 ∈ IR to form a simplex. Choose coefficient of reflection α > 0, coefficient of contraction β < 1 and coefficient of expansion γ > 1.

1. Search r and s such that f(xr) = min1≤j≤n+1 f(xj) and f(xs) = max1≤j≤n+1 f(xj). If points are near enough then STOP and the minimum n+1 1 P is found. Otherwise, compute x¯ := n xj j=1,j6=s

2. Let xˆ := x¯ + α(x¯ − xs). If f(xr) > f(xˆ) then xe := x¯ + γ(xˆ − x¯) and GOTO 3. If f(xr) ≤ f(xˆ), then GOTO 4.

3. If f(xˆ) > f(xe) then xs := xe. If f(xˆ) ≤ f(xe) then xs := xˆ. In both cases GOTO 1.

4. If max1≤j≤n+1{f(xj) | j 6= s} ≥ f(xˆ) then xs := xˆ and GOTO 1. Otherwise GOTO 5.

0 0 00 0 00 5. Compute x using f(x ) = min{f(xˆ); f(xs)} and x := x¯ + β(x − x¯). If f(x ) > 0 1 00 f(x ) then xj := xj + 2 (xr − xj) for j = 1, . . . , n + 1 and GOTO 1. If f(x ) ≤ 0 00 f(x ) then xs := x and GOTO 1.

1.10 Multidimensional Search Using Derivatives

1.10.1 Steepest descent (gradient) method

Remark 264 (Main idea).

It is a classical method (Gauss) for the multivariate differentiable function minimiza- tion. d is a descent direction (feasible because S = IRn) of f at x ⇔ ∃δ > 0 : ∀λ ∈ (0, δ) f(x+λd)−f(x)) f(x + λd) < f(x). In particular: If lim λ < 0 then d is a descent direc- λ−→0+ tion. To find a descent direction explicitly we should minimize a directional derivative (see limit above) under condition kdk = 1.

Lemma 265 (Steepest descent direction).

Let f : S −→ IR be a differentiable function at x, f(x) 6= 0. Then the steepest descent ¯ ¯ −∇f(x) 0 direction d satisfies d = k∇f(x)k ∈ argmin{f (x, d) | kdk ≤ 1}.

0 f(x+λd)−f(x)) > Proof: By differentiability and definitions f (x, d) = lim λ = ∇f(x) d (Taylor formula λ−→0+ > > in use). So argmind{∇f(x) d | kdk ≤ 1} have to be solved. We know that ∇f(x) d ≥ −k∇f(x)kkdk ≥ −k∇f(x)k2. Hint: See the formula for the cosine of to vectors ∇f(x) and d and when it is equal to −1. 

73 So, −∇f(x) is a direction of the steepest descent of f at x, and hence, the method is also called the gradient method.

Algorithm 266 (Gradient method).

We choose ε > 0, point x1, k := 1.

1. If k∇f(xk)k < ε then STOP . Otherwise we assign dk := −∇f(xk). We get a solution λk of min{f(xk + λdk) | λ ≥ 0}, and then we define xk+1 := xk + λkdk and k := k + 1, GOTO 1.

Example 267.

4 2 Solve the program min [(x1 − 2) + (x1 − 2x2) ] by the gradient method – Algorithm 266.

xk k f(xk) ∇f(xk) k∇f(xk)k dk = −∇f(xk) λk xk+1 1 (0, 00; 3, 00) (−44, 00; 24, 00) 50, 12 (44, 00; −24, 00) 0, 062 (3, 13; 3, 00) 52, 00 2 (2, 70; 1, 51) (0, 73; 1, 28) 1, 47 (−0, 73; −1, 28) 0, 24 (2, 52; 1, 20) 0, 34 3 (2, 52; 1, 20) (0, 80; −0, 48) 0, 93 (−0, 80; 0, 48) 0, 11 (2, 43; 1, 25) 0, 09 4 (2, 43; 1, 25) (0, 18; 0, 28) 0, 33 (−0, 18; −0, 28) 0, 31 (2, 37; 1, 16) 0, 04

74 Remark 268 (Convergence of the steepest descent method (Optional)).

Let set Ω = {x¯ | ∇f(x¯) = 0} be a solution set, α(x) = f(x) is a descent function, A = MD is a composed algorithmic map, where D : x 7−→ [x, ∇f(x)] (for the given point, a new point and descent direction are generated). Therefore, D is a function. If f is continuously differentiable then D is continuous. M was already discussed with line search methods (check it!). There is M :[x, d] 7−→ {y | y = x + λmind} where λmin ∈ argmin{f(x + λd) | λ ∈ L}. Then, M is closed for L closed interval and f continuous and d 6= 0. Why? We had (xk, dk) −→ (x, d) and for yk derived by M from (xk, dk) and yk −→ y we have to prove M :(x, d) 7−→ y. So, yk = xk +λdk ⇒ as d 6= 0 and dk −→ d then dk 6= 0 for big k. So, yk − xk = λkdk, kyk − xkk = |λk|kdkk, kyk−xkk kyk−xkk ky−xk and λk = . Therefore, for k −→ ∞, we have and denote −→ = kdkk kdkk kdk λmin (and λk −→ λmin). So yk = xk + λkdk, and because of convergence, we get y = x + λmind. Because of L closed λmin ∈ L so y has the form by M. But is λmin a true minimum? ∀λ ∈ L, ∀k : f(yk) ≤ f(xk + λdk) and because of limit and continuity of f: f(y) ≤ f(x + λd). So we have D continuous and M closed. Because of Theorem 231, A = MD is closed. The last step to a convergence proof is to show that ∀x 6∈ Ω: y ∈ A(x) ⇒ α(y) < α(x). > > But y = x + λmind and d = −∇f(x) 6= 0. So, for ∇f(x) d = ∇f(x) (−∇f(x)) = −kf(x)k2 < 0 and by unconstrained optimization theory ⇒ d is a descent direction (check your notes!), so from α(·) = f(·) ⇒ f(y) < f(x). At the end, we need to suppose that {xk} generated by the algorithm is contained in a compact set to complete the proof (global convergence i.e. ∀x1 the sequence converges to xmin). E.g., kxk −→ ∞ then f(x) −→ ∞.

The algorithm looks easily implementable, it converges. Why it is not often used? Because it works well at the beginning of iteration process and for nice functions. But near to the minimum and for ‘narrow valleys’ zigzaging effect may appear. Why? Intuitively, because of the fact that near to the minimum ∇f(xk) −→ 0. > In detail: f(xk + λd) = f(xk) + λ∇f(xk) d + λkdkα(xk; d) (see Taylor expansion), where d is a search direction, kdk = 1 and for λd −→ 0, we have α(xk; λd) −→ 0. In the case of steepest descent method, we use d = −∇f(x)/k∇f(x)k and after substitution the second term > of Taylor series is λ∇f(xk) d = −λk∇f(xk)k, and hence the objective function changes slowly. (It is even worse if α — involving Hessian describing curvature — contributes significantly to the description of f.)

Exercise 269.

Create and compute own examples. Draw contour graphs (e.g., using Matlab func- tions). Use quadratic bivariate functions to be able to compute univariate minima 0 (λmin) easily (solving simple linear equation derived from θ (λ) = 0 instead of using the line search).

75 Remark 270 (Convergence rate).

The theoretical convergence rate is linear. In practical cases it may be destroyed be- cause of the zigzag effect. The reason is that although −∇f(xk) is the steepest descent direction it is true only locally (linear approximation). And if the true descent direction turns out of it, it creates problems — see, e.g., Rosenbrock function computations (see 2 2 2 Matlab demo banana function f(x1, x2) = 100(x2 − x1) + (1 − x1) ). Therefore, the line search allows just a short step.

Remark 271 (Terminology).

Remember the difference between a global convergence property i.e. that for all x1 satisfying certain assumptions the sequence of xk converges to Ω and convergence to a global minimum!

Exercise 272.

Why ∇f(xk+1) and ∇f(xk) are orthogonal? Show using simple examples! Explain 0 theoretically. Hint: Use that from θ(λ) = f(xk + λdk) we obtain that 0 = θ (λ) = > dk ∇f(xk + λdk)

Exercise 273.

Discuss why the steepest descent algorithm works badly for problems with ill condi- tioned functions of the narrow and alongated valley type.

Our discussion leads to the conclusion that from the global convergence viewpoint it si better to continue locally in non-steepest direction and slightly modify it, e.g. turn it. In general we have two main possibilities. Either to use historical information (see the conjugate gradient method later) to update ∇f(xk) or to use the information about the function curvature obtained from the Hessian matrix of f denoted as H(x). It is important to remember that the steepest descent (gradient) method uses only linear 0 approximation (see use of fd previously). Therefore, only incomplete information about f is used for the direction choice.

76 Exercise 274.

2 2 2 2 Compare numerical behaviour for two problems: min{x1 + x2} and min{100x1 + x2}. Draw contour graphs and compute eigenvalues. identify conditional numbers defined as the fraction maximal eigenvalue and minimal eigenvalue. And use the following formula for the estimation of the convergence rate for the steepest descent method:

cond H(xmin − 1) 2 |f(xk+1) − f(xmin)| ≤ ( ) |f(xk) − f(xmin)|. cond H(xmin + 1) Compare bounds with results of iterations. The formula is related to β and the condition number of Hessian equals the fraction of its largest and smallest eigenvalue.

Exercise 275.

Explain the concept of linear convergence rate on the selected geometric sequence. Hint: Use the idea rk+1 = qrk where rk = |f(xk) − f(xmin)|.

Convergence rate analysis. Let {rk} be a sequence satisfying rk −→ r¯. It may be generated, e.g., by f(xk) or α(xk) values. Let:

|rk+1−r¯| p0 = sup{p | lim sup |r −r¯|p = β < ∞}, k−→∞ k where β is a ratio of convergence (if it is smaller for the same p then faster convergence may occur), p is an order of convergence (larger p may also cause faster convergence). Remember that the formula above describes the limiting behaviour of the algorithm. See the following example:

Example. Consider the sequence defined by rk+1 = (rk + 1)/2. For k −→ ∞, we get rk −→ 1, so r¯ = 1. Then:

rk+1 | 2 −1| 1 |rk−1| 1 lim 1 = lim = < ∞ k−→∞ |rk−1| k−→∞ 2 |rk−1| 2

We may denote ek = rk − r¯ and then |ek| = |rk − r¯| or in the vector case kekk = krk − ¯rk. p Without considering lim, we may write |ek+1| = β|ek| or we have |ek+1| = β|ek| for p = 1. Therefore, the geometric sequence can be taken as an example of the error decrease. Compare 1, 0.1, 0.01, 0.001,... for β = 0.1 and e0 = 1 with either β = 0.01 (1, 0.01, 0.0001, 0.000001,...) or β = 0.99 (1, 0.99,...). When β > 1 then the sequence diverges. With higher p (e.g. p = 2) we get faster convergence and error decrease 1, 0.1, 0.001, 0.0000001,.... In our example, p cannot be equal to 2 because:

|rk+1 2 −1| 1 1 lim 2 = lim = ∞ k−→∞ |rk−1| k−→∞ 2 |rk−1|

1.10.2 Newton’s method While the gradient method works with a linear approximation of the objective f, this method utilizes its quadratic approximation (using the higher (second) derivative): 1 q(x) = f(x ) + ∇f(x )>(x − x ) + (x − x )>H(x )(x − x ). k k k 2 k k k

77 The necessary convergence condition ∇q(x) = 0 may by applied:

H(xk)(x − xk) = −∇f(xk). (1.4)

We simplify the formula to derive x explicitly from it and we get a new approximation xk+1, so:

−1 xk+1 = xk − H (xk)∇f(xk).

Note that the formula satisfies xk+1 := xk + λkdk principle, however, λk = 1 and dk = −∇f(xk)H(xk).

Algorithm 276 (Newton).

Choose ε > 0, x1, k := 1.

−1 1. Je-li k∇f(xk)k < ε, STOP . Otherwise dk := −H (xk)∇f(xk), and so, xk+1 := xk + dk, k := k + 1 a GOTO 1.

Example 277.

4 2 Solve min [(x1 − 2) + (x1 − 2x2) ] by Algorithm 276.

xk −1 k f(xk) ∇f(xk) H(xk) H (xk) dk xk+1 (0, 0; 3, 0)  50, 0 −4, 0   8, 0 4, 0  1 (−44, 0; 24, 0) 1 (0, 67; −2, 67) (0, 67; 0, 33) 52, 0 −4, 0 8, 0 384 4, 0 50, 0 (0, 67; 0, 33)  23, 2 −4, 0   8, 0 4, 0  2 (−9, 39; −0, 04) 1 (0, 44; 0, 23) (1, 11; 0, 56) 3, 13 −4, 0 8, 0 170 4, 0 23, 2 (1, 11; 0, 56)  11, 5 −4, 0   8, 0 4, 0  3 (−2, 84; −0, 04) 1 (0, 30; 0, 14) (1, 41; 0, 70) 0, 63 −4, 0 8, 0 76,0 4, 0 11, 5 (1, 41; 0, 70)  6, 18 −4, 0   8, 0 4, 0  4 (−0, 80; −0, 04) 1 (0, 20; 0, 10) (1, 61; 0, 80) 0, 12 −4, 0 8, 0 33,4 4, 0 6, 18

When the method converges then the theoretical convergence rate is quadratic. If the start- ing point is ‘close enough’ (see literature) to xmin (to point satisfying ∇f(xmin) = 0) and r(H(xmin)) = n then it converges with order two (quadratic). Even one step convergence is guaranteed for strictly convex quadratic functions. However, it is impossible to verify theoretical sufficient convergence conditions during computations, so method may accidentally diverge. In practice, the method quickly converges nearly to minimum. So, the important problems to be considered are:

−1 1. H(xk) may be singular and the inverse matrix H (xk) does not necessarily exist. The later idea will to replace H(xk) with a PD approximating matrix (e.g., Marquardt-Levenberg method).

−1 2. For regular H(xk), the direction dk = −H(xk) ∇f(xk) is not necessarily descent, and so, f(xk+1) < f(xk) is not guaranteed. Again the idea will be to use the aforementioned PD approximation.

78 3. If xk+1 − xk is not a descent step because of λk = 1 then we may introduce the formula −1 with λk (cf. xk+1 := xk − λkH(xk) ∇f(xk)) and to use a trust region method. 4. As no global convergence property can be expected, we may use Newton’s method near to −1 optimum or to approximate H(xk) .

5. Regarding the storage requirements: H(xk) needs the n × n table and dk may be obtained −1 from a system of linear equations H(xk)dk = −∇f(xk) instead of computing H (xk) explicitly. There are also memory-less quasi-Newton methods discussed later.

6. Computational difficulties are based on computation of H(x), and then H(xk) and the −1 solution of SLEq to obtain dk. Again the approximation of H (xk) may help.

We may also discuss the relation between the gradient and Newton methods (cf. xk+1 := −1 −1 xk + λkdk and xk+1 := xk − H(xk) ∇f(xk)). If H(xk) is PD then H (xk) is also PD, and > −1 −1 so ∇f(xk) (−H(xk) ∇f(xk)) < 0. Therefore, dk = −H(xk) ∇f(xk) is a descent direction −1 here (cf. convex function properties!). If H(xk) exists then ∃L (a lower triangular matrix with −1 > diagonal elements greater then zero) such that H(xk) = LL (Cholesky decomposition). We further think about x = Ly transformation. Then f(x) = f(Ly) = F (y) (F (·) = F (L·) −1 > here) and xk = Lyk, yk = L xk and ∇F (yk) = L ∇f(xk). So, we may use the special (gradient method-like) iteration step for y space (after ‘scaling’): yk+1 = yk − 1 · ∇F (yk) = (here λk = 1). −1 −1 > −1 > Then, by substitution, L xk+1 = L xk −L ∇f(xk), and finally, xk+1 = L xk −LL ∇f(xk) −1 that is the original Newton method step H(xk) . The previous discussions offer the idea to start with the gradient method and to finish with Newton’s method (Marquardt-Levenberg method for LSQ nonlinear regression. xk+1 is derived from (µkI + H(xk))(x − xk) = −∇f(xk), where µk is the Marquardt parameter. µk is chosen such that all eigenvalues λi of matrix Mk = µkI + H(xk) satisfy ∃δ : λi ≥ δ > 0. So, the idea is to check whether Mk is PD trying > LL factorization. If it is not PD then increase µk. If it is PD, solve Mk(xk+1 −xk) = −∇f(xk) > by a lower triangular factorization LL (xk+1 − xk) = −∇f(xk). Precisely, compute f(xk+1), f(xk+1−f(xk) evaluate Rk = where q(x) is the quadratic approximation of f at x. Then, the q(xk+1)−q(xk) following heuristic is utilized to update µk and k := k + 1:

Rk < 0.25 ⇒ µk+1 := 4µk,

Rk > 0.75 ⇒ µk+1 := 0.5µk,

0.25 ≤ Rk ≥ 0.75 ⇒ µk+1 := µk,

Rk < 0 (no progress) ⇒ restart xk. So, positive eigenvalues imply the PD matrix that is regular, and so invertible. Then, the −1 > iteration formula xk+1 := xk − Mk ∇f(xk) uses descent directions (as 0 > ∇f(xk) dk = −1 −∇f(xk)Mk ∇f(xk)). −1 Trust region methods use the formula xk+1 := xk − λkH(xk) ∇f(xk) and λk is chosen such that kxk+1 − xkk is not too large, so the step is restricted. This region (‘region of trust’) is chosen such that the q(x) (quadratic) approximation at xk is reliable, so Rk is evaluated. If Rk is small then ∆k is decreased. If Rk is big (near to 1) then ∆k is expanded. Then xk+1 is found:

xk+1 ∈ argminx{q(x) | x ∈ Ω = {x | kx − xkk < ∆k}}. −1 For the discussion of the method relation to xk+1 := xk − λkH(xk) ∇f(xk) (also the case of −1 −1 Mk used instead of H(xk) ) or another ways to compute xk+1, like the dog-leg trajectory method or PARTAN method — see literature.

79 1.10.3 Conjugate directions Improvement of the gradient method. The question is how to improve the convergence of ∇f method? The idea is based on several facts: At xmin, we have ∇f(xmmin) = 0 and for 1 > Taylor series we get q(x) = f(xmin) + 2 (x − xmin) H(xmin)(x − xmin) and H(xmin) that is PD (positive definite). Then f behaves in the Nε(xmin) like a strictly convex quadratic function! So, a general method to be efficient should quickly converge on quadratic functions, otherwise it is slow near xmin.

Definition 278 (D-conjugacy).

Let D be a n × n symmetric matrix. Vectors d1,..., dn are called conjugate (with > respect to D) iff they are linearly independent and ∀i, j = 1, . . . , n i 6= j di Ddj = 0.

Remark 279 (Separability achieved by conjugate directions).

Let us show the importance of conjugate directions. Assume that we minimize a > 1 > quadratic function f(x) = c x + 2 x Dx (D symmetric). Then, because of linear n independence of d1,..., dn, we may express arbitrary x ∈ IR as a sum of the ini- Pn tial point x1 and linear combination of conjugate directions i.e. x = x1 + j=1 λjdj. We substitute it in f(x) and we obtain a function of variables λ1, . . . , λn denoted as > λ = (λ1, . . . , λn) , so

n n n X 1 X X F (λ) = c>(x + λ d ) + (x + λ d )>D(x + λ d ). 1 j j 2 1 j j 1 j j j=1 j=1 j=1

Exercise 280.

Simplify and show that because of conjugate directions F is separable in λ1, . . . , λn. Answer why it can be helpful. Hint:

n n n > 1 > P > > 1 P P > F (λ) = c x1 + 2 x1 Dx1 + λj(c + x1 D)dj + 2 λidi Dλjdj = j=1 i=1 j=1 n > 1 > P > > 1 2 > c x1 + 2 x1 Dx1 + (λj(c + x1 D)dj + 2 λj dj Ddj). j=1

Think about multivariate minimization realized by repeated independent univariate minimizations for conjugate directions to λj,min, and hence, xmin). Hint: For positive definite D, we have (using derivatives F 0 (λ ) = 0): λj j

> > −(c dj + x1 Ddj) λj,min = > dj Ddj

So, λj,min, j = 1, . . . , n are explicitly computed and repeatedly used. Then, the sum is computed. It is similar when f is sequentially minimized from x1 along directions dj, j = 1, . . . , n.

80 Notice that conjugate directions may be used also in algorithms without derivatives.

Example 281.

2 2 > Solve min (−12x2 + 4x1 + 4x2 − 4x1x2). Derive Hessian matrix H and choose d1 = > > (1; 0). If d2 = (a; b) then 0 = d1 Hd2 = 8a − 4b. Therefore, d2 is not unique. Select > > > d2 = (1; 2). Minimize f from x1 = (−0, 5; 1) along d1 and get x2 = (0, 5; 1). Continue > along d2 and get xmin = x3 = (1; 2).

Emphasize that minimization of quadratic function using conjugate directions require (only) n times realized univariate minimization.

Conjugate directions for quadratic functions: We may use the following algorithm: At first j := 1. Main step: Start at xj, follow dj to obtain λj,min and xj+1. Then, j := j + 1. If j = n + 1 then STOP and xmin := xn+1, otherwise repeat the main step.

> 1 > Theorem: Let f(x) = c x + 2 x Dx (and minimum exists) and D is n × n symmetric ma- n trix, d1,..., dn are D-conjugate, x1 ∈ IR is a starting point. Let ∀k = 1, . . . , n : λk,min ∈ argmin{f(xk + λkdk) | λk ∈ IR} where xk+1 := xk + λk,mindk. Then ∀k = 1, . . . , n :

> 1. ∇f(xk+1) dj = 0, j = 1, . . . , k.

> > 2. ∇f(x1) dk = ∇f(xk) dk. Pk 3. xk+1 ∈ argminx{f(x) | x − x1 ∈ L(d1,..., dk)} where L(d1,..., dk) = { j=1 µjdj | n forallj : muj ∈ IR} and xn+1 ∈ argminx{f(x) | x ∈ IR }.

0 Proof: (1) We already know it (cf. gradient method and use F (λ); ∇f(xk+1) = ∇f(xk + λkdk)). (2) Use conjugate directions — see literature. (3) The technical proof — see literature.  Exercise: Interpret theorem geometrically.

Explanatory remarks: Conjugate direction method converges finitely in at most n steps in > 1 > the case of minimizing convex quadratic function f(x) = c x + 2 x Dx. We have xk+1 := Pn xk + λkdk (λk ∈ argmin{f(xk + λdk) | λ ∈ IR}) and xn+1 := x1 + j=1 λjdj. We know that ∇f(xn+1) = 0 (minimum achieved). Then, ∇f(xn+1) = Dxn+1 + c = 0 and from the theorem > > > > > (Dxk+c) dk 0 = ∇f(xk+1) dk = (Dxk+1 +c) dk = ((D(xk)+λkdk)) dk +c dk ⇒ λk = − > and dk Ddk Pk−1 > > Pk−1 > > for xk = x1 + j=1 λjdj we obtain dk Dxk = dk Dx1 + j=1 λjdk Ddj = dk Dx1, and hence,

> dk (Dx1 + c) λk = − > . dk Ddk

So, only one question remains: How to generate conjugate directions d1,..., dn? We have certain degrees of freedom to choose them. Therefore, two classes of algorithms may be derived: (1) conjugate , (2) quasi-Newton methods. Or from the practitioner’s viewpoint, we may consider two basic ideas (1) cross-country skiing with slow turns (based on averaging of ∇f(xk) — conjugate gradients) and (2) downhill skiing with the experienced (risk-less) skier (based on −1 approximation of H(xk) — quasi-Newton).

81 1.10.4 Conjugate gradient methods

Remark 282 (Basic idea).

Because of problems with the gradient method, methods of fixed metrics (conjugate gradients) using the information from previous iterations to modify the current gradient direction have been developed.

Algorithm for quadratic functions: We further discuss the conjugate gradient method (related references are Hestens and Stiffel 1952 who developed the method for systems of linear equations and later in 1964 Fletcher and Reeves applied it to optimization problems). These methods are often called fixed metric methods because of its relation to metric spaces. > 1 > We have a quadratic function f(x) = c x + 2 x Dx. The idea is to construct d1,... dn D- k−1 conjugate directions combining (linearly) xk, ∇f(xk), and {dj}j=1 . We denote gk = ∇f(xk) = Dxk + c.

1. Set x1, compute g1 = ∇f(xk) = Dx1 + c, assign d1 := −g1 and k := 1

> gk dk 2. Compute xk+1 := xk + λkdk where λk = − > . Update dk+1 := −gk+1 + αkdk where dk Ddk > gk+1Ddk αk = > . If k = n + 1 then xn+1 is xmin and STOP . Otherwise, k = k + 1 and dk Ddk GOTO 2. .

It is important to know that (for the quadratic function and existing minimum) the conjugate gradient algorithm achieves the minimum (converges) after one complete main step (n univariate minimizations — line searches). The following theorem answers why:

n Theorem: Let f(x) be a quadratic function, x1 ∈ IR and the conjugate gradients algorithm is used. Then:

1. d1,..., dn generated by the algorithm are D-conjugate directions.

2. d1,..., dn are descent directions.

> g> g 2 g> (g −g ) dk D∇f(xk+1) k+1 k+1 k∇f(xk+1)k k+1 k+1 k 3. αk = > = > = k∇f(x )k2 = > . dk Ddk gk gk k gk gk

Remark: We already know that when optimum has not yet been reached then

> > > gk gk gk dk gk (−gk+αk−1dk−1) λk = > = − d Dd = − > dk Ddk k k dk Ddk > and because of theorem and conjugate directions, we know that gk dk−1 = 0. Proof: For readers: The proof is rather technical, just check the main ideas and improve the ability to do matrix calculations! Be able to do small computational simple steps.

gk+1 − gk = D(xk+1 − xk) = λkDdk

1 > gk+1Ddk = gk+1(gk+1 − gk) λk 1 > gk+1Ddk = > gk+1(gk+1 − gk) gk gk > dk Ddk > > gk+1Ddk gk+1(gk+1 − gk) gk+1gk+1 αk = > = > = > dk Ddk gk gk gk gk

82 > because gk = dk − βk−1dk−1 ∈ L(d1,..., dk) and gk+1⊥L(d1,..., dk), so gk+1gk = 0. To prove 1. and > 2., we use induction to move from d1,..., dk to dk+1. We obtain 1. dk+1Ddk = 0 using substitution > dk+1 = −gk+1 + αkdk (and again αk is replaced). Similarly, dk+1Ddj = 0, j = 1, . . . , k by induction and recursion. 

Remark 283 (Convergence).

Under quite general assumptions, conjugate gradient methods converge superlinearly.

Algorithm 284 (Fletcher-Reeves).

Choose ε > 0, x1, d1 := −∇f(x1). Assign y1 := x1 (d1 = −g1 = −∇f(x1) = ∇f(y1)), j := 1, k := 1.

1. If k∇f(yj)k < ε then optimum is reached and STOP , otherwise solve min{f(yj+ λdj) | λ ≥ 0}. Solution λj is used to compute yj+1 := yj + λjdj. If j = n then GOTO 3. . In opposite case, when j < n, continue GOTO 2. .

FR 2. dj+1 := −∇f(yj+1) + αj dj, where

2 FR k∇f(yj+1)k αj = 2 . k∇f(yj)k

Then j := j + 1 and GOTO 1.

3. Define xk+1 := yn+1 (i.e. = yj+1) and y1 := xk+1. Further set up d1 := −∇f(y1), k := k + 1, j := 1, and GOTO 1.

We see that the algorithm is based on the following update of the descent direction: dj+1 := −∇f(yj+1) + αjdj. The coefficient αj is chosen in such a way that directions dj are conjugate with respect to Hessian matrix H in the case of the quadratic objective function. Other modifi- cations of conjugate gradient methods are obtained for other choices of αj. Hestens and Stiefel suggested for qj = ∇f(yj+1) − ∇f(yj) (also further used in Algorithm 288):

> HS ∇f(yj+1) qj αj = > dj qj and Polak and Ribiere suggested to specify

> PR ∇f(yj+1) qj αj = 2 . k∇f(yj)k

HS PR FR Exercise: You may show that for quadratic functions f(x), these choices αj , αj , and αj are equivalent.

83 Example 285.

4 2 Solve min [(x1 − 2) + (x1 − 2x2) ] by Algorithm 284.

xk yj k f(xk) j f(yj ) ∇f(yj ) k∇f(yj )k α1 dj λj 1 (0, 00; 3, 00) 1 (0, 00; 3, 00) (−44, 00; 24, 00) 50, 12 (44, 00; −24, 00) 0, 062 52, 00 52, 00 2 (2, 70; 1, 51) (0, 73; 1, 28) 1, 47 0, 0009 (−0, 69; −1, 30) 0, 23 0, 34 2 (2, 54; 1, 21) 1 (2, 54; 1, 21) (0, 87; −0, 48) 0, 99 (−0, 87; 0, 48) 0, 11 0, 10 0, 10 2 (2, 44; 1, 26) (0, 18; 0, 32) 0, 37 0, 14 (−0, 30; −0, 25) 0, 63 0, 04 3 (2, 25; 1, 10) 1 (2, 25; 1, 10) (0, 16; −0, 20) 0, 32 (−0, 16; 0, 20) 0, 10 0, 008 0, 008 2 (2, 23; 1, 12) (0, 03; 0, 04) 0, 05 0, 04 (−0, 036; −0, 032) 1, 02 0, 003 4 (2, 19; 1, 09) 1 (2, 19; 1, 09) (0, 05; −0, 04) 0, 06 (−0, 05; 0, 04) 0, 11 0, 0017 0, 0017 2 (2, 185; 1, 094) (0, 02; 0, 01) 0, 02 0, 0012

Practical implementation of conjugate gradient method requires to discuss the restart possibilities (see 3. in the FR algorithm), the update of Hessian matrix, and the computations of ∇f(x).

Remark (Convergence): Step 3. (outer loop) ensures the global convergence property that is a consequence of the global convergence property of the ∇f method. The influence of the inner loop inserted step 2. will be discussed later together with the DFP method. See Bazaraa-Shetty book for details and the restart idea.

Remark (Computational properties): The conjugate gradient methods have useful com- putational properties as they have modest memory requirements — it is enough to save 3 n- dimensional vectors.

Remark (Convergence rate): Even convergence rate is superior to the rate of ∇f method. For the gradient method, we had:  2  2 lim sup f(xk+1−f(xmin) = β ≤ cond H(xmin)−1 = λmax−λmin . f(x )−f(x ) cond H(x )+1 λmax+λ k−→∞ k min min min  1 0  So, for f(x) = x2+100x2, we obtain H = and eigenvalues λ = 1 and λ = 100. 1 2 0 100 min max 99 2 It implies that β ≤ 101 . If you think about the related sequence ek+1 = βek where e0 = 1 and β = 0.99 we see the slow decrease of error ek+1! In contrast, the conjugate gradient method has a superlinear convergence as kxk+1−xmink −→ 0. (remember that for the algorithm step from kxk−xmink xk to xk+1 we need n single univariate minimization steps generating y2,..., yn+1!). Under n Lipschitz continuity property (If ∃K ∀d ∈ IR , ∀x ∈ Nε(xmin): kH(x)dk ≤ K · kdk) we even xk+1−xmink obtain a quadratic convergence: lim sup 2 < ∞. Theoretically, this is true even with kxk−xmink inexact line searches. The practical considerations show that 2n iterations are often enough. (Bazaraa-Shetty also discuss how eigenvalues have smaller influence than in the case of gradient method.)

84 1.10.5 Quasi-Newton methods

Remark 286 (Ideas and properties).

These methods use conjugate directions. They are also called variable metric methods (cf. fixed metric methods — conjugate gradients). The main idea is to express the descent direction in the form dj = −Dj∇f(x), where Dj is a symmetric positive definite matrix, approximating inverse matrix H−1(x) (the reason for the name quasi- Newton methods). These methods converge under quite general assumptions. The convergence rate is superlinear.

−1 Remark (Conjugate directions by H(xk) approximation): The main idea is to deflect the gradient direction ∇f(yj) (cf. conjugate gradients and the inner loop specification) by Dj −1 n × n PD approximation of H(yj) . As we already know, with Dj PD then dj = −Dj∇f(yj) > is a descent direction (for ∇f(yj) 6= 0) because dj ∇f(yj) = −∇f(yj)Dj∇f(yj) < 0.

Remark (The simplest Dj update): For the next step Dj+1 is derived as Dj + Cj (the addition is the simplest full matrix update). The form Cj specifies the quasi-Newton’s method.

> Remark (Rank 1 method): The computationally simplest form of Cj is αjujuj . Why? It 2 needs only n + n multiplications. The rank of Cj is 1 (just formed from multiplications of the same row uj).

Algorithm 287 (Rank 1).

Choose ε > 0, starting point x1, symmetric, positive definite matrix D1 (often D1 := I). Then we assign y1 := x1, j := 1, k := j.

1. If k∇f(yj)k < ε then STOP (and minimum is found), otherwise dj := −Dj∇f(yj) and solve min{f(yj + λdj) | λ ≥ 0}. Solution λj is used to compute yj+1 := yj + λjdj. If j = n then xk+1 := yn+1, y1 := xk+1, k := k + 1, j := 1 (Dj is given and guarantees the restart) and GOTO 1. Otherwise, when j < n, we continue with GOTO 2. .

2. Assign Dj+1 := Dj + Cj, where

> R1 (pj − Djqj)(pj − Djqj) Cj = Cj := > qj (pj − Djqj) pj := yj+1 − yj (= λjdj),

qj := ∇f(yj+1) − ∇f(yj),

j := j + 1, and GOTO 1.

R1 > > −1 Note that Cj = αujuj because uj = pj − Djqj and αj = (qj (pj − Djqj)) .

85 > 1 > Remark (How it was derived?): For f(x) = c x + 2 x Hx (H is PD, so regular), we have ∇f(x) = Hx + c, and for yj+1, yj, we obtain ∇f(yj+1) − ∇f(yj) = H(xj+1 − xj) ⇒ qj = Hpj −1 (as qj = ∇f(yj+1) − ∇f(yj) and pj = xj+1 − xj). It also implies that pj = H qj. It must be also satisfied (in certain form) by Dj+1 to guarantee the algorithm nice behaviour in the case of f(x) quadratic (or near to quadratic, e.g., in the neighbourhood of the minimum). > At first, we use Cj = αjujuj (because it is easily computable and symmetric, so Dj+1 is a sum of two symmetric matrices, and hence also symmetric). −1 We consider the idea to replace property pj = H qj by pj = Dj+1qj specifying Dj+1 > > update. So: Dj+1qj = (Dj + Cj)qj = (Dj + αjujuj )qj = pj. So αjujuj qj = pj − Djqj. > > > > We may also multiply the equality by qj from the left and obtain: qj Djqj+αj(qj uj)(uj qj) = > > > > > qj pj. We may express the term containing αj. So: αj(qj uj)(uj qj) = qj pj − qj Djqj. There- > 2 > fore, αj(qj uj) = qj (pj − Djqj). Now we return back to Cj (remember its symmetry!).

2 > > > > > > > > αj uj uj αj uj αj uj (uj qj )αj uj αj uj αj uj (uj qj )αj uj (qj uj ) Cj = αjujuj = α = α = > = > > = j j qj uj αj αj qj uj uj qj > > > > > > > > > > αj (uj uj )qj αj (qj uj )uj αj (uj uj )qj αj qj (uj uj ) αj uj uj qj (αj uj uj qj ) (pj −Dj qj )(pj −Dj qj ) > > > = > > > = > 2 = > . αj (uj qj ) uj qj αj (uj qj ) uj qj αj (uj qj ) qj (pj −Dj qj )

The method converges in n (inner) steps if Cj is PD and f is a quadratic function. Even exact line search is not required! However, there are the following problems: Dj is not necessarily PD and denominator may approach 0.

Remark (Rank 2 method): Because rank 1 method is not so good, rank 2 methods are further used:

> > Cj := αjujuj + βjvjvj .

By the form of αj, uj, βj, and vj, you get various methods. We discuss DFP (Davidon-Fletcher- Powell: good but sensitive to the precision of line search method) and BFGS (Broyden-Fletcher- Goldfarb-Shanno: today the best choice that is robust enough) algorithms. −1 We have qj = Hpj and pj = H qj. Building new matrices Q (with columns qj, j = −1 1, . . . , n) and P (with columns pj, j = 1, . . . , n), we may write Q = HP and P = H Q or −1 −1 −1 even H = QP and H = PQ . And this the key idea. We want to obtain Dn+1H = I (to −1 have a precise approximation of H at minimum). Then Dn+1Hpj = pj, j = 1, . . . , n. We also want Dj+1Hpk = pk, k = 1, . . . , j. So, pk are linearly independent eigenvectors (they form the matrix P) of Dj+1H with eigenvalues equal to 1 (see above). This gives the idea how to derive DFP Cj . We know the requirements: Q = HP, P columns must be H-conjugate and linearly independent. In addition, pk, j = 1, . . . , k are eigenvectors of Dj+1H with eigenvalues equal to 1 (as Dj+1qk = pk and DjHpk = pk, k = 1, . . . , j−1). So, pk = Djqk +Cjqk = DjHpk +Cjqk = pk + Cjqk, k = 1, . . . , j − 1. So, Cjqk = 0, k = 1, . . . , j − 1. For k = j, we obtain Dj+1Hpj = pj (that is a secant condition Dj+1qj = pj). And so, (Dj + Cj)qj = pj ⇒ Cjqj = pj − Djqj. > pj pj If Cj contains the term > then multiplying the numerator of the fraction by qj from the pj qj > −Dj qj (Dj qj ) left, we obtain pj. If Cj contains the term > then multiplying the numerator of the (Dj qj ) qj > > DFP pj pj −Dj qj (Dj qj ) fraction by qj from the left, we obtain −Djqj. So, if Cj = > + > then even pj qj (Dj qj ) qj DFP Cj qk = 0. So, one of the important possibilities how to choose dj directions and approximating matrix Dj is introduced as follows.

86 Algorithm 288 (Davidon-Fletcher-Powell).

Choose ε > 0, starting point x1, symmetric, positive definite matrix D1. Then we assign y1 := x1, j := 1, k := 1.

1. If k∇f(yj)k < ε then STOP , otherwise dj := −Dj∇f(yj) and solve min{f(yj + λdj) | λ ≥ 0}. Solution λj is used to compute yj+1 := yj + λjdj. If j = n then xk+1 := yn+1, y1 := xk+1, k := k + 1, j := 1 and GOTO 1. Otherwise, when j < n, we continue.

DFP 2. Assign Dj+1 := Dj + Cj , where

> > DFP pjpj Djqjqj Dj Cj = > − > , pj qj qj Djqj pj = yj+1 − yj (= λjdj),

qj = ∇f(yj+1) − ∇f(yj),

j := j + 1, and GOTO 1.

DFP > > > −1 Note that Cj = αujuj + βvjvj because uj = pj, αj = (pj qj) , vj = Djqj, and βj = > −1 (qj Djqj) .

Example 289.

4 2 Solve min [(x1 − 2) + (x1 − 2x2) ] by Algorithm 288.

xk yj k j f(xk) f(yj ) ∇f(yj ) Dj dj λj yj+1 (0, 0; 3, 0) (0, 0; 3, 0)  1 0  1 1 (−44, 0; 24, 0) (44, 0; −24, 0) 0, 062 (2, 70; 1, 51) 52, 0 52, 0 0 1 (2, 70; 1, 51)  , 25 , 38  2 (0, 73; 1, 28) (−0, 67; −1, 31) 0, 22 (2, 55; 1, 22) 0, 34 , 38 , 81 (2, 55; 1, 22) (2, 55; 1, 22)  1 0  2 1 (0, 89; −0, 44) (−0, 89; 0, 44) 0, 11 (2, 45; 1, 27) 0, 1036 0, 1036 0 1 (2, 45; 1, 27)  , 65 , 45  2 (0, 18; 0, 36) (−0, 28; −0, 25) 0, 64 (2, 27; 1, 11) 0, 0490 , 45 , 46 (2, 27; 1, 11) (2, 27; 1, 11)  1 0  3 1 (0, 18; −0, 20) (−0, 18; 0, 20) 0, 10 (2, 25; 1, 13) 0, 008 0, 008 0 1 (2, 25; 1, 13)  , 80 , 38  2 (0, 04; 0, 04) (−0, 05; −0, 03) 2, 64 (2, 12; 1, 05) 0, 004 , 38 , 31 (2, 12; 1, 05) (2, 12; 1, 05)  1 0  4 1 (0, 05; −0, 08) (−0, 05; 0, 08) 0, 10 (2, 12; 1, 06) 0, 0005 0, 0005 0 1 (2, 12; 1, 06) 2 (0, 004; 0, 004) 0, 0002

Exercise: Be able to realize one algorithm iteration. In this case, use calculus minimization instead of line search algorithm for univariate minimization (to obtain λj).

87 Remark: It is known that Dj is PD, and hence, dj is a descent direction.

Theorem (DFP properties): Let f(x) be a quadratic function with PD Hessian H to be n minimized (x ∈ IR ). We solve it by DFP and ∀j = 1, . . . , n : ∇f(yj) 6= 0. Then:

1. d1,..., dn are D-conjugate directions,

−1 2. Dn+1 = H ,

n 3. yn+1 ∈ argmin{f(x) | x ∈ IR }. Proof: We emphasize main ideas of several steps. For the complete proof — see literature. We consider the situation ∀j ≤ n. So, we take j arbitrarily fixed. Then for 1. we have to prove that dk are linearly independent H conjugate directions for k ≤ j. It is true, because of the way how dk are constructed. We would like to show that Dj+1Hpk = pk, k ≤ j and Dj+1Hdk = dk (we know that pk = λkdk). So, Hpk = H(λkdk) = H(yk+1 − yk) = qk = ∇f(yk+1) − ∇f(yk), Therefore by induction Hp1 = q1 ⇒ > > p1p1 −D1q1(D1q1) D2Hp1 = (D1 + > + > )q1 = D1q1 + p1 − D1q1 = p1 and then we continue by induction. p1 q1 (D1q1) q1 

Remark 290 (Why DFP choice?).

Let us discuss the DFP Algorithm 288 in the case of quadratic objective function > 1 > f(x) = c x + 2 x Hx with positive definite Hessian matrix H. The algoritmus is constructed in such a way that descent directions dj are conjugate and linearly inde- −1 pendent. In addition, Dj approximates H in such a way that it is a positive definite −1 and symmetric. In addition, Dn+1 = H and the algorithm stops after one complete iteration (composed of n univariate minimizations).

Remark 291 (Choice of vectors).

For the j’th iteration we know vectors pk, k = 1, . . . , j − 1 that satisfy DjHpk = pk, and hence, they are eigenvectors of matrix DjH with unit eigenvalues. In addition, they are linearly independent and conjugate. For known point yj we find a direction dj = −Dj∇f(yj) to get a new iterate yj+1. Then pj = yj+1 −yj and qj = ∇f(yj+1)− ∇f(yj) = H(yj+1 − yj) = Hpj.

Remark 292 (Choice of matrix).

We search for symmetric matrix Cj, which simply updates Dj i.e. Dj+1 := Dj + Cj. We also want Dj+1 satisfying conditions similar to conditions satisfied by Dj, and so, Cjqk = 0 for k = 1, . . . , j − 1 and Dj+1qj = pj must be satisfied. These conditions DFP are satisfied for the choice Cj = Cj used above.

Remark (Convergence rate): Assuming f continuously differentiable and that H(xmin) is PD then: kx − x k k+1 min −→ 0 kxk+1 − xmink

88 that is the superlinear convergence rate to xmin. With kH(x)yk ≤ Kkyk ∀y and ∀x ∈ Nε(xmin) then kyk+1 − xmink lim sup 2 < ∞ k−→∞ kyk − xmink that is quadratic convergence rate. Compare with conjugate gradients, where we obtain

kyk+n − xmink lim sup 2 < ∞. k−→∞ kyk − xmink So, n-times more steps might be necessary for the same behaviour. However, for DFP storage requirements are ∼ n2 (cf. 3n) and intermediate matrix products also spend time. Till now, the algorithm derivation is based on several ideas:

1. The secant property is required.

2. The transformed eigenvectors condition is needed.

3. The symmetry and positive definiteness of approximating matrices cannot be excluded.

89 Remark 293 (Numerical behaviour and BFGS).

The DFP method sometimes generates matrices near to singular. Because there are some degrees of freedom in the choice of Cj (see e.g. BAZARAA 1993), Broyden has suggested another update (for φ > 0: Cj is PD and also Cjqk = 0):

> B DFP φτjvjvj Cj = Cj + > , pj qj

where vj = pj − (1/τj)Djqj (to satisfy Cjqj = pj − Djqj) and

> qj Djqj τj = > > 0. pj qj

For the choice of φ = 1, we get Cj in the form extensively studied by Broyden, Fletcher, BFGS Goldfarb, and Shanno. We denote it as Cj :

> > ! > t BFGS pjpj qj Djqj Djqjpj + pjqjopDj Cj := > 1 + > − > . pj qj pj qj pj qj

−1 The idea is to approximate H by Bj (instead of approximating H by Dj). So, −1 Hpk = qk and Bj+1pk = qk for k = 1, . . . , j (with DFP, we have had H qk = pk and −1 Dj+1qk = pk for k = 1, . . . , j). As Bn+1 = H then we obtain H Bn+1 = I. So, we −1 −1 −1 get from Bj+1pk = qk that H Bj+1pk = H qk and H Bj+1pk = pk, k = 1, . . . , j. −1 ¯ So, pk are eigenvectors of H Bj+1 with eigenvalues equal to 1. Then Bj+1 := Bj +Cj −1 ¯ ¯ and H (Bj + Cj)pk = pk. As qk = Hpk then Bjpk = qk (Cjpk = 0, k ≤ j − 1, C¯ jpj = qj − Bjpj). Then cf. the situation for Dj and understand the formula based on the replacements:

q q> B p p>B ¯ j j j j j j Cj = > − > , qj pj pj Bjpj

Then we need new Dj+1 (we do not want to compute inverse of Bj). It should satisfy BFGS −1 ¯ −1 Dj+1 = Dj + Cj = Bj+1 = (Bj+1 + Cj) . In this case the Sherman-Morrison BFGS formula from linear algebra may help to obtain Cj as introduced above. As we need Dj+1, we may also solve a system of linear equations Bjdj = −∇f(yj) by de- composition to obtain dj. B DFP BFGS DFP It is true that Cj = (1 − φ)Cj + φCj . If we replace DFP matrix Cj by BFGS Cj in Algorithm 288, we get the BFGS algorithm, which is currently considered as the most efficient method for unconstrained minimization.

Remark (Computational comments): There are partial quasi-Newton methods when the restart (reset of Dj) is used earlier than after n inner iterations. It is used to save memory. Scaling is often realized by multiplication of Dj by sj > 0 before update to avoid situation when D1H eigenvalues are >> 1! Then Dj update may move in the wrong way. So, conditional number of matrix is big and computational problems may occur. There are also memory-less quasi-Newton methods based on the idea to forget Dj — often setting Dj := I in BFGS formula. In this case you do not need so large memory and it will work

90 even with weak conditions (regarding the descent directions) even for inexact line search. With exact line search, the equivalence to conjugate gradients is obtained. See literature for more information.

Remark (Convergence of conjugate directions methods): For f quadratic convex was discussed. For non-quadratic case, we have composed A = CB composed map and we need to show that: (1) ∀x ∈ X \ Ω: B is closed map, (2) ∀x ∈ X \ Ω: y ∈ B(x) ⇒ f(y) ≤ f(x), (3) z ∈ C(y) ⇒ f(z) ≤ f(y), (4) {x | f(x) ≤ f(x1)} is a compact set (x1 is a starting point). We get y ∈ B(x) by minimization along d from x where d = −D∇f(x) (D is PD) For D = I, we obtain conjugate gradient method, otherwise quasi-Newton’s method. C gives the minimum along directions by methods (decrease, so (3) is guaranteed). With Ω = {x | ∇f(x) = 0} we may show that: B is closed (cf. discussion with the gradient method) and f is decreasing by d descent (as D is PD). As usually, the assumption (4) should be included. The key idea is the inserted gradient-like step.

Remark 294 (Comparison).

Quasi-Newton methods utilize the modification of descent direction by multiplication of the minus gradient by approximating matrix H(x). It may be interpreted as taking into account the information about the curvature of the surface defined by z = f(x). In the case of large scale problems, there are troubles with memory requirements (huge matrices must be stored). In this case, conjugate gradient methods are preferable.

1.11 Theory of constrained optimization

1.11.1 Introduction In this section, we study properties of constrained optimization problems. Remember from convex analysis that in the case of differentiable convex function f and convex set S the condition ∇f(x¯)>(x−x¯) ≥ 0, ∀x ∈ S is necessary and sufficient condition for attaining the global minimum of f on S at x¯.

Remark 295 (Notation).

By a symbol ∇g(x)>, we denote a transposed Jacobi matrix having n rows and m columns with elements ∂gi(x) (see j’th row a i’th column). Therefore, matrix ∇g(x)> ∂xj > has gradients of functions gi(x) as its columns. Similarly, we denote ∇h(x) .

The following sequence of examples show how to solve constrained optimization (NLP — nonlinear programming — problems) analytically. Theoretical notes follow, so the reader has to use forward references in the example.

Example 296 (Cookbook).

2 2 Completely solve min{(x1 − 1) + (x2 − 2) | x1 − x2 ≤ 1, x2 ≥ 0}. Hint: At first, we split the example to identify main (educational) steps in the solution procedure.

91 Example 297 (Graphical solution).

Solve the problem graphically (can be utilized only if x ∈ IR2 or partially x ∈ IR3). Use your experience from LP. Notice that the contour graph is composed from circles.

Example 298 (Reformulation).

Express the given NLP in the following equivalent form: min{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X}. 2 2 min{(x1 − 1) + (x2 − 2) | x1 − x2 − 1 ≤ 0, −x2 ≤ 0}

Example 299 (Lagrangian).

Build Lagrangian in the form L(x, u, v) = f(x) + g(x)>u + h(x)>v. 2 2 L(x, u, v) = L(x1, x2, u1, u2) = (x1 − 1) + (x2 − 2) + u1(x1 − x2 − 1) + u2(−x2)

Example 300 (Karush-Kuhn-Tucker conditions).

For the given example, derive the KKT conditions from Lagrangian: ∇xL = 0 (or > > ∇xL ≥ 0, x ≥ 0, x ∇xL = 0) u ≥ 0, ∇uL ≤ 0, u ∇uL = 0, and ∇vL = 0.

∇xL = 0 : 0 Lx1 = 2(x1 − 1) + u1 = 0, 0 Lx2 = 2(x2 − 2) + u1 − u2 = 0, u ≥ 0 : u1 ≥ 0, u2 ≥ 0, ∇uL ≤ 0 : 0 Lu1 = x1 + x2 − 1 ≤ 0, 0 Lu2 = −x2 ≤ 0, > u ∇uL = 0 : 0 u1Lu1 = u1(x1 − x2 − 1) = 0, 0 u2Lu2 = u2(−x2) = 0.

Example 301 (Check KKT for the point).

Check validity of the KKT conditions for the given point (x1 = 1, x2 = 0). Remem- ber also the regularity condition! (columns of ∇g(x) have to be linearly independent vectors).

For the given point, we get: u1 + 2(1 − 1) = 0 and u1 − u2 − 4 = 0. Therefore, u2 = −4 < 0 and the KKT conditions are not satisfied.

92 Example 302 (Geometrical interpretation).

Interpret the KKT conditions geometrically. Hint: Draw the feasible region, contour graph as before, add vectors −∇f(x) = (), and columns of ∇g(x).

For the point specified by x1 = 1, and x2 = 0 we have u1 = 0 and u2 = −4. We > > > also have −∇f(x) = (−2(x1 − 1), −2(x2 − 2)) = (0, 4) , ∇g1(x) = (1, 1) , and > ∇g2(x) = (0, −1) . So, we may show that for this point −∇f(x) is not a nonnegative linear combination of gradients of active constraints i.e. ∇g1(x)and ∇g2(x).

Example 303 (One solution satisfying KKT).

Find at least one solution satisfying the KKT conditions.

For finding solutions, a good policy is to subsequently setup ui coefficients and analyze simplified systems of equalities and inequalities:

1. u1 = 0, u2 = 0: ⇒ x1 = 1, x2 = 2, 1 + 2 − 1 6≤ 0. Infeasibility appeared!

2. u1 6= 0, u2 = 0: ⇒ x1 = 1 − x2, −2(x2 − 2) = u1 = −2(x1 − 1), so x1 = x2 − 1. Then, x1 = 0, x2 = 1 and this point satisfies the KKT conditions. We finish the search for all optimal solutions now.

3. u1 = 0, u2 6= 0: ⇒ x1 = 1, x2 = 0, so u1 = 0, u2 − 4 and contradiction.

4. u1 6= 0, u2 6= 0: ⇒ x1 = 1, x2 = 0, so u1 = 0, u2 − 4 and the same contradiction.

Example 304 (All solutions).

Find all solutions satisfying the KKT conditions. Discuss also points that do not satisfy the regularity condition mentioned above.

Solution procedure is completed above. Because ∇g1 and ∇g2 are constant and linearly independent, no irregular points may occur.

Example 305 (Classify found solutions).

Discuss which solutions are optimal using geometry or the second order conditions. Because the feasible set S is a convex set and f is a convex function that is minimized then by the previous theory any local minimum is also a global one.

93 1.11.2 Review of calculus Remark 306 (Equality constraint).

Let us consider a simple program min{f(x, y) | h(x, y) = 0}, where f : IR2 −→ R and 2 h : IR −→ R are differentiable functions. For minimum point (xmin, ymin), the follow- ing equality must be satisfied ∇f(xmin, ymin) = v · ∇h(xmin, ymin). The requirement guarantees that there is no feasible descent direction at the minimizing point.

Remark 307 (Implicit function use).

Another explanation is given by the idea that y depends functionally on x. Then, the objective function derivative f by x can be computed using the formula for computation of the implicit function derivative.

0 0 0 0 0 0 0 0 0 fx + fyyx = fx − (hx/hy)fy = fx − v·hx = 0,

0 0 and so fy − v·hy = 0, which follows from the notation of v. Generalization is given by the following theorem:

Theorem 308 (Lagrange).

Let x¯ be a feasible solution of the following minimizing program Eq with constraints in the form of equalities min{f(x) | h(x) = 0}. Let functions f and h have the continuous first order partial derivatives in the neighbourhood of x¯. Let columns of matrix ∇h(x¯)> are linearly independent (the regularity conditions). If x¯ is a point of local minimum of the solved program Eq then exists a vector v such that

∇f(x¯) + ∇h(x¯)>v = 0.

Theorem 308 specifies the necessary first order optimality conditions for the program Eq. La- grange multipliers v present the information about the sensitivity of the objective function values at minimum point x¯ with respect to the right-hand-side (RHS) changes for constraints h(x) = 0 (cf. with shadow prices in linear programming).

Remark 309 (Lagrange function (1760)).

For program Eq, we define Lagrange function (Lagrangian):

m > X L(x, v) = f(x) + h(x) v = f(x) + vihi(x). i=1

94 Remark 310 (Stationary points).

We search for stationary points of Lagrangian, satisfying the condition ∇L(x, v) = 0, i.e. ∇xL(x, v) = 0 and ∇vL(x, v) = 0. Namely, theorem 308 says how to search the minimum of Eq using stationary points of Lagrangian. We obtain a system of n + m (usually) nonlinear equations of n + m unknown variables x a v, including conditions of Theorem 308, and constraints h(x) = 0.

m ∂L(x, v) ∂f(x) X ∂hi(x) = + v = 0 j = 1, . . . , n ∂x ∂x i ∂x j j i=1 j ∂L(x, v) = hk(x) = 0 k = 1, . . . , m, ∂vk The obtained solution x¯ is a point of the possible solution of Eq. The system of equations gives the necessary conditions, however not sufficient conditions (see the example below).

Example 311.

2 2 2 2 Solve min{−x1 −4x2 | x1 + 2x2 = 6}. Then L(x1, x2, u) = −x1 −4x2 +u·(x1 +2x2 −6) and the solution of ∇L = 0 is obtained x1 = 3, x2 = 3/2 a u = 6. However, for x1 −→ ∞ and x2 −→ ∞ the following conclusion is valid f(x1, x2) −→ −∞.

Remark 312 (Possible simplifications?).

The discussed program and its objective function Eq can be modified by substitutions of variables using equality constraints. Then, the unconstrained optimization problem is obtained. The use of this approach may reduce the dimension of the solved problem. However, it is important to consider whether such substitution is also simplifying for the numerical algorithm use. The additional argument to study NLPs is that it may be impossible to get completely describing expressions for separate variables from h(x) = 0. Therefore, NLPs have to be studied also without such simplifications. Note that analytical computations based on the Lagrangian function are exceptionally used to get optimal solutions. Instead of this, Lagrangian properties are very important for the development of constrained optimization numerical algorithms.

95 1.11.3 Constrained optimization — inequalities

Remark 313 (Transforming inequalities to equalities).

Let us consider the program Ineq of the form min{f(x) | g(x) ≤ 0}. We introduce slack > ∗ 2 2 variables y = (y1, . . . , ym) for separate constraints and we denote y = (y1, . . . , ym). The program Ineq can be rewritten in the form Eq, so we obtain min{f(x) | g(x)+y∗ = 0}. We have the Lagrange function in the form L(x, u, y) = f(x) + (g(x) + y∗)>u and we may use Theorem 308. From ∇L = 0, we get necessary conditions in the form > ∗ ∇xL = ∇f(x) + ∇g(x) u = 0, ∇uL = g(x) + y = 0 (and hence g(x) ≤ 0), and > finally ∇yL = (2u1y1,..., 2umym) = 0. We suggest to compare these conditions with further developed KKT conditions.

Definition 314 (Cone of feasibility directions).

Let S ⊂ IRn, S 6= ∅, x¯ ∈ cl S. The cone of feasibility directions of S at x¯ is:

D = {d | d 6= 0, ∃δ > 0, ∀λ ∈ (0, δ): x¯ + λd ∈ S}.

Exercise 315.

Consider f differentiable at x¯. Draw a figure illustrating the fact that for minimum at x¯, ∇f(x¯)>d 6< 0. Hint: Otherwise d is a feasible descent direction of f at x¯.

Definition 316 (Cone of improving directions).

We define a cone of improving (descent) directions of f at x¯ as

F = {d | ∃δ > 0, ∀λ ∈ (0, δ): f(x¯ + λd) < f(x¯)}.

Definition 317.

We have f differentiable at x¯. We define a halfspace F0 by

> F0 = {d | ∇f(x0) d < 0}

Theorem 318 (Geometric optimality conditions).

n n Let S ⊂ IR , S 6= ∅, f : IR −→ IR differentiable at x¯, x¯ ∈ arglocminx{f(x) | x ∈ S} ⇒ F0 ∩ D = ∅. Conversely, F0 ∩ D = ∅, f is pseudoconvex at x¯ and ∃Nε(x¯), ε > 0 ∀x ∈ S ∩ Nε(x¯): d = (x − x¯) ∈ D ⇒ x¯ ∈ arglocminx{f(x) | x ∈ S}

96 Proof: By contradiction: ∃d ∈ F0 ∩D ⇒ ∃δ1, ∀λ ∈ (0, δ1)f(x¯ +λd) < f(x¯) and ∃δ2, ∀λ ∈ (0, δ2)x¯ +λd ∈ S. Because of intersection, we get a contradiction.

Conversely: ∀x ∈ S ∩ Nε(x¯): f(x) ≥ f(x¯). Otherwise, ∃xˆ ∈ S ∩ Nε : f(xˆ) < f(x¯) but d = (xˆ − x¯) ∈ D and by pseudoconvexity ∇f(x¯)>d < 0 or else ∇f(x¯)>d ≥ 0 ⇒ f(xˆ) = f(x + λd) ≥ f(x¯). x¯ is not a local minimizer, ∃d ∈ F0 ∩ D and contradiction.  Remark 319.

0 0 > Notice that F0 ⊂ F ⊂ F0, where F0 = {d 6= 0 | ∇f(x¯) d ≤ 0}. If f is pseudoconvex 0 at x¯ then F = F0. If f is strictly pseudoconcave then F = F0.

To be able to use geometric interpretation of optimality conditions for computations, we specify S = {x ∈ X | gi(x) ≤ 0, i = 1, . . . , m}. We define G0 ⊂ D using ∇gi, and so F0 ∩ D = ∅ at x¯ ⇒ F0 ∩ G0 = ∅ at x¯. Then, we may use gradients as follows. Lemma 320.

n Let S = {x ∈ X | gi(x) ≤ 0, i = 1, . . . , m}, X be a nonempty open set in IR , gi : n IR −→ IR, i = 1, . . . , m, x¯ ∈ S, I = {i | gi(x¯) = 0} be an index set of binding (active) constraints. Assume that gi are differentiable for i ∈ I at x¯ gi are continuous at x¯ for > 0 > i 6∈ I. Define G0 = {d | ∇gi(x¯) d < 0, ∀i ∈ I}, G0 = {d 6= 0 | ∇gi(x¯) d ≤ 0, ∀i ∈ I}. 0 Then G0 ⊂ D ⊂ G 0. If gi, i ∈ I are strictly pseudoconvex ⇒ D = G0. If gi, i ∈ I are strictly pseudoconvex 0 ⇒ D = G0.

Proof: Not included.  Theorem 321 (Geometric optimality conditions II).

Let the feasible set S be defined by S = {x ∈ X | gi(x) ≤ 0, i = 1, . . . , m}, X be n n n a nonempty open set in IR , f : IR −→ IR, gi : IR −→ IR, i = 1, . . . , m, and Ineq: ? ∈ arglocminx{f(x) | x ∈ S} Let x¯ ∈ S and I be an index set of indices of active constraints. In addition, f, gi, i ∈ I are differentiable at x¯ and gi, i 6∈ I are continuous at x¯. If x¯ solves Ineq ⇒ F0 ∩G0 = ∅. Conversely, F0 ∩ G0 = ∅, f is pseudoconvex at x¯ and gi, i ∈ I are strictly pseudoconvex (over some existing Nε(x¯), ε > 0) ⇒ x¯ ∈ arglocminx{f(x) | x ∈ S}.

Proof: x¯ is a local minimum. Then F0 ∩ D = ∅ ⇒ F0 ∩ G0 = ∅. Conversely: F0 ∩ G0 = ∅ S redefined by active constraints, pseudoconvexity considered then G0 = D by Lemma and F0 ∩ D = ∅. Then gi(x¯) ≤ 0 related level sets are convex on Nε(x¯) ⇒ S ∩ Nε(x¯) is a convex set. F0 ∩ D = ∅ and f is pseudoconvex at x¯ ⇒ x¯ is a local minimum (valid when binding included).  Exercise 322.

Choose different points, compute gradients and draw figures for programs: min{(x1 − 2 2 2 2 2 2 3) + (x2 − 2) | x1 + x2 ≤ 5, x1 + x2 ≤ 3, x1, x2 ≥ 0} and min{(x1 − 1) + (x2 − 1) | 3 (x1 + x2 − 1) ≤ 0, x1, x2 ≥ 0}. Compare results.

97 Remark 323.

In addition if gi, i 6∈ I continuous then x¯ ∈ arglocminx{f(x) | x ∈ S} ⇔ F0 ∩ D = ∅ ⇔ F0 ∩ G0 = ∅ (f pseudoconvex) F0 ∩ D = ∅ ⇔ F ∩ D = ∅. But it is not still useful for computations. The next step is to introduce computationally useful Fritz John (FJ) conditions based on Farkas’ (Gordon’s) Theorem (see section about convex cones).

Theorem 324 (Fritz John).

n n n X ⊂ IR , X 6= ∅ open set, f : IR −→ IR gi : IR −→ IR, i = 1, . . . , m, n Let S = {x ∈ X | gi(x) ≤ 0, i = 1, . . . , m}, X be a nonempty open set in IR , n n f : IR −→ IR, gi : IR −→ IR, i = 1, . . . , m, Ineq: ? ∈ arglocminx{f(x) | x ∈ S} Let x¯ ∈ S and I be an index set of indices of active constraints (I = {i | gi(x¯) = 0}), f, gi, i ∈ I are differentiable at x¯ and gi, i 6∈ I are continuous at x¯. If x¯ solves Ineq ⇒ ∃u0, ui, i ∈ I, u = (ui)i∈I : P u0∇f(x¯) + ui∇gi(x¯) = 0 i∈I > > ∀i ∈ I : u0, ui ≥ 0, (u0, u ) 6= (0, 0 ).

Proof: Let x¯ ∈∈ argminx{f(x) | x ∈ S} ⇒ by Theorem 321 6 ∃d : ∇f(x¯)d < 0 and ∇gi(x¯)d < 0, i ∈ I. > We denote A = (∇f(x¯), (∇g(x¯))i∈I )) . So, we assume that Ad < 0 is inconsistent. By Farkas’s Theorem 57 either {x | Ax ≤ 0 ∧ c>x > 0}= 6 ∅ or {y | A>y = c ∧ y ≥ 0}= 6 ∅. By its Gordon’s reformulation in Theorem 59 either {x | Ax < 0} 6= ∅ or {y | A>y = 0, y ≥ 0, y 6= 0} 6= ∅. (Gordon’s  x   x  Theorem follows from Farkas’ Theorem for the choice A 1  ≤ 0 and 0> 1  > 0 s s  A>   0>  ⇒ y = and y ≥ 0.) Then, for Ad < 0 unsolvable ∃y 6= 0, y ≥ 0, A>y = 0. Denote 1> s > y = (u0, (u1)i∈I ) and proof is complete. 

Corollary 325.

If, in addition, gi, i 6∈ I are differentiable at x¯ then we may write that exists u0, u such that

> > u0∇f(x¯) + ∇g(x¯) u = 0, u g(x¯) = 0, > > > > (u0, u ) ≥ (0, 0 ), (u0, u ) 6= (0, 0 ).

Proof: The proof is trivial just zero terms are added for inactive constraints.  These conditions include dual feasibility, complementary slackness, and Lagrange multipliers. Remember! When these conditions are used to discover x¯ possible minimum, the solution still must be checked for its primal feasibility (x¯ ∈ S).

98 Example 326.

2 2 2 2 Solve min{(x1 − 3) + (x2 − 2) | x1 + x2 − 5 ≤ 0, x1 + 2x2 − 4 ≤ 0, −x1 ≤ 0; −x2 ≤ 0}. At first graphically. Check points (2; 1) and (0; 0). For the first point: u3 = u4 = 0 and u1 = u0/3 and u2 = 2u0/3. For example u0 = 3, u1 = 1 and u2 = 2 and (2; 1) may be minimum. Contradiction occurs for the second point as u1 = u2 = 0 and u3 = −6u0 and u4 = −4u0. So, some ui must be negative.

Example 327 (Kuhn and Tucker (1951)).

3 > Solve min{−x1 | x2 − (1 − x1) ≤ 0, −x2 ≤ 0}. Then, the minimum is x0 = (1; 0) , > > > so ∇f(x0) = (−1; 0) , ∇g1(x0) = (0; 1) and ∇g2(x0) = (0; −1) . We get u0 = 0 and u1 = u2 and Fritz John conditions are not useful as for u0 = 0 no information about −∇f(x0) is used.

Remark 328 (Problems with FJ conditions).

These conditions may also be satisfied trivially (see also example above). When any gradient ∇gi(x¯) = 0 then we may set the related ui 6= 0 and other ui = 0 and conditions are satisfied. When gi(x) = 0 constraint is replaced by the couple gi(x) ≤ 0 and −gi(x) ≥ 0 then Fritz John (FJ) conditions are satisfied by any feasible solutions, as we may choose multipliers ui related to these constraints nonzero and equal. In general, we may fulfill Fritz John conditions for any feasible solution x0 by adding 2 redundant constraint of the form kx − x0k ≥ 0. It is satisfied for any feasible x. The constraint is active in x0. Its gradient equals zero in x0. Therefore, G0 = ∅, and so F0 ∩ G0 = ∅.

These observations led to the development in two directions: (1) the formulation of Fritz John sufficient conditions and (2) its improvement avoiding such cases (KKT conditions).

Theorem 329 (Fritz John sufficient conditions).

We have Ineq problem and x¯ satisfying FJ conditions. Binding constraints are defined by index set I. We assume that S = {x | gi(x) ≤ 0, i ∈ I} (a relaxed feasible set). Let Nε(x¯) be a neighbourhood of x¯, f be pseudoconvex and gi, i ∈ I strictly pseudo- convex over Nε(x¯) ∩ S ⇒ x¯ ∈ arglocmin of Ineq. Let f be pseudoconvex at x¯ and gi, i ∈ I strictly pseudoconvex at x¯ and gi, i ∈ I quasiconvex at x¯ ⇒ x¯ ∈ argglobmin of Ineq.

Proof: Use various definitions generalizing convex functions.  Regarding improvement of FJ, the most important idea is to avoid the situation when F0 ∩ G0 = ∅ because of G0 = ∅ (for any f). Therefore, the condition G0 6= ∅ (constraint qualification) will be sufficient to solve our problem. The Karush-Kuhn-Tucker (KKT) conditions follows.

99 Theorem 330 (Karush-Kuhn-Tucker conditions — inequalities).

n n n Let X ⊂ IR , X 6= ∅, X open, f : IR −→ IR, gi : IR −→ IR, i = 1, . . . , m be given. Ineq: ? ∈ argminx{f(x) | gi(x) ≤ 0, i = 1, . . . , m, x ∈ X}. Let x¯ be feasible, I denotes set of indices of binding constraints at x¯, f, gi, i ∈ I are differentiable at x¯ and gi, i 6∈ I are continuous at x¯. Furthermore, suppose that ∇gi(x¯), i ∈ I are linearly independent (constraint qualification). If x¯ solves Ineq (locally) then ∃ui, i ∈ I : X ∇f(x¯) + ui∇gi(x¯) = 0, ui ≥ 0. i∈I

Proof: By FJ exist u0 and uˆi, i ∈ I not all equal zero such that P u0∇f(x¯) + uˆi∇gi(x¯) = 0, u0, uˆi ≥ 0, i ∈ I. i∈I

uˆi If u0 = 0 then ∇gi(x¯) are linearly dependent ⇒ contradiction. So, u0 6= 0 and u0 > 0. We define ui = u0 and divide FJ equality with u0 by u0. 

Remark 331 (Geometric interpretation of KKT).

KKT theorem 332 introduces necessary conditions for existence of minimum at x¯. The theorem is illustrated by Figure5. It says that for x¯ the steepest descent vector −∇f(x0) can be expressed as nonnegative linear combination of gradients ∇gi(x0) defined only by binding (active) constraints.

Theorem 332 (Karush-Kuhn-Tucker - differentiable version).

Let f a g be differentiable at x¯ minimum of min{f(x) | g(x) ≤ 0}. Let columns of ∇g(x¯)> related to gradients of active constraints are linearly independent (regularity — constraint qualification). Then exist coefficients u such that:

∇f(x¯) + ∇g(x¯)>u = 0, u>g(x¯) = 0, u ≥ 0.

Proof: For the proof, see the previous theorem. Compare with FJ and linear programming complemen- tary slackness. Note that gi(x¯) ≤ 0, i = 1, . . . , m and ui ≥ 0 by KKT. So, uigi(x¯) ≤ 0, i = 1, . . . , m. It is > Pm m equal to zero iff ui = 0 or gi(x¯) = 0. Then, u g(x¯) = i=1 uigi(x¯) = 0 iff ∧i=1(uigi(x¯) = 0). So, the > condition u g(x¯) = 0 guarantees that for non-binding constraints gi(x¯) < 0, i 6∈ I ui must be equal to zero and related gradients are not further considered. 

100 Example 333.

2 2 Find all points satisfying KKT conditions for min{−3x1−x2 | x1+x2 ≤ 5, x1−x2 ≤ 1}. Using Lagrange function, we get:

−3 + u1 · 2x1 + u2 = 0 −1 + u1 · 2x2 − u2 = 0 (1.5) 2 2 u1 · (x1 + x2 − 5) = 0 u2 · (x1 − x2 − 1) = 0 (1.6) u1, u2 ≥ 0 (1.7)

We consider u1 = 0, then (5) is not solvable. Therefore, u1 6= 0, and (6) implies 2 2 x1 + x2 − 5 = 0. If u2 = 0, we express x1 and x2 from (5) using u1. However, the substitution in (6) leads to contradiction. Therefore, u2 6= 0, and from (6) we derive x1 − x2 − 1 = 0. We express x2 and we get a quadratic equation by replacement. For the first root, we have x1 = −1, x2 = −2, that do not satisfy KKT conditions as u1 = −2/3 6≥ 0. For the second root x1 = 2, x2 = 1, u1 = 2/3 and u2 = 1/3, so KKT conditions are satisfied.

Remark 334 (Sufficient KKT for inequalities).

They are very similar to FJ (see later for the general case combining inequalities and equalities). For FJ sufficient conditions strict pseudoconvexity and quasiconvexity have been required. With KKT sufficient conditions gi, i ∈ I must differentiable and quasi- convex.

Exercise 335.

Think about the fact, that convexity assumptions are not necessary even for convex 2 2 2 2 programs. Hint: Check argmin{−x2 | (x1 − 1) + x2 ≤ 1, (x1 + 1) + x2 ≤ 1} where only feasible point is (0, 0) and it is also optimal. However, although it is a trivial convex program, KKT are not satisfied as constraint qualification is not valid.

1.11.4 Constrained optimization — inequalities and equalities

Remark 336 (Generalization).

Theoretical results of previous sections discussed separately for equalities and inequal- ities is possible to generalize and unify. For computational purposes, we have usually considered feasible solutions x ∈ IRn satisfying certain constraints. The most of theory also remains valid for the general case when x ∈ X where X is an open nonempty set. So, we further consider P: ? ∈ argminx{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X}. Notice > > that g(x) = (g1(x), . . . , gm(x)) and h(x) = (h1(x), . . . , hl(x)) .

101 Remark 337 (Overview – see literature for details).

We have P, I set, x¯ ∈ arglocmin of P, f differentiable, gi, i ∈ I differentiable, gi, i 6∈ I continuous, and hi, ∀i continuously differentiable. > Geometrical condition: F0, G0 are defined as before, H0 = {d | ∇hi(x¯) d = 0, i = 1, . . . , l} ∇hi(x¯) linearly independent ⇒ F0 ∩G0 ∩H0 = ∅ (for the proof using ordinary differential equations, see literature). Fritz John necessary conditions: We have P, I, f, gi, hi as before (and all functions are differentiable). If x¯ ∈ arglocmin of P ⇒

> > u0∇f(x¯) + ∇g(x¯) u + ∇h(x¯) v = 0, > > > > > > > > > u g(x¯) = 0, (u0, u ) ≥ 0, (u0, u , v ) 6= (0, 0 , 0 ) .

Fritz John sufficient conditions: We have P, x¯ FJ solution, I, and S is a relaxation of the feasible set at x¯. Then if hi is affine, ∇hi are linearly independent, and ∃Nε(x¯) such that f is pseudoconvex on S ∩ Nε(x¯) and gi, i ∈ I are strictly pseudoconvex on S ∩ Nε(x¯) ⇒ x¯ ∈ arglocmin of P.

Theorem 338 (KKT 1st order necessary conditions).

1. Let X ⊂ IRn be a nonempty open set, f : IRn −→ IR, g : IRn −→ IRm and n l h : IR −→ IR be functions. The components of g are denoted as gi, components of h are denoted hi. 2. We solve P: ? ∈ argmin{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X}.

3. We have x¯ feasible solution P. We denote I = {i | gi(x0) = 0} the index set of binding (active) inequality constraints.

4. Let f and gi are differentiable at x¯ for i ∈ I, gi are continuous for i 6∈ I at x¯, and hi are continuously differentiable at x¯ for i = 1, . . . , l. 5. We also assume that gradients ∇gi(x¯), i ∈ I and ∇hi(x¯), i = 1, . . . , l are linearly independent (constraint qualification).

If x¯ is a local minimum of P then ∃ui, i ∈ I and ∃vi, i = 1, . . . , l such that the following KKT conditions are valid:

l X X ∇f(x¯) + ui∇gi(x¯) + vi∇hi(x¯) = 0, ui ≥ 0 for i ∈ I i∈I i=1

Proof: Simple, based on FJ, similar as for inequalities — see literature. To get pure inequalities, you + − + − may set vi = vi − vi , vi , vi ≥ 0. 

Remark 339.

Consider that previous Lagrange and KKT conditions may be obtained as corollaries of Theorem 338.

102 Remark 340 (Another formulations).

If we consider differentiability at x¯ also for gi, i 6∈ I, then we may write KKT conditions equivalently as follows:

∇f(x¯) + ∇g(x¯)>u + ∇h(x¯)>v = 0, u>g(x¯) = 0, u ≥ 0.

Previously introduced concept of Lagrangian may be generalized as follows:

L(x, u, v) = f(x) + g(x)>u + h(x)>v.

We may formulate KKT conditions using conditions for the existence of general La- grangian stationary points and we obtain

> ∇xL = 0, ∇uL ≤ 0, u ∇uL = 0, u ≥ 0, ∇vL = 0. (1.8)

If we have to introduce nonnegativity conditions x ≥ 0, we replace the condition > ∇xL = 0 in8 by three conditions ∇xL ≥ 0, x ∇xL = 0 and x ≥ 0.

Theorem 341 (Sufficient KKT 1st order conditions).

We have P, x¯ feasible, I index set of indices of binding constraints. We suppose tha KKT conditions hold at x¯, so: ∃u¯i ≥ 0, i ∈ I ∃v¯i ∈ IR, i = 1, . . . , l such that:

l P P ∇f(x¯) + u¯i∇gi(x¯) + v¯i∇hi(x¯) = 0. i∈I i=1

Let J = {i | v¯i > 0}, K = {i | v¯i < 0} and f is pseudoconvex at x¯, gi quasiconvex at x¯ ∀i ∈ I, and hi is quasiconvex at x¯ ∀i ∈ J and hi is quasiconcave at x¯ ∀i ∈ K. Then x¯ is a global minimum of P.

Proof: Let x¯ be feasible for P. Because of quasiconvexity: ∀i ∈ I gi(x) ≤ gi(x¯) = 0. Then, gi(x¯ + λ(x − x¯)) = gi(λx + (1 − λ)x¯) ≤ max{gi(x), gi(x¯)} = gi(x¯). So, gi(x) does not increase moving from x > along x − x¯: So, for i ∈ I : ∇g(x¯) (x − x¯) ≤ 0. We multiply the inequality by u¯i ≥ 0. Similarly, for hi > > quasiconvex and quasiconcave: ∇hi(x¯) (x − x¯) ≤ 0, i ∈ J multiplied by v¯i > 0 and ∇hi(x¯) (x − x¯) ≥ 0, i ∈ K multiplied by v¯i < 0. So, together we have l P P > ( u¯i∇gi(x¯) + v¯i∇hi(x¯)) (x − x¯) ≤ 0. i∈I i=1 We use it for KKT formula and obtain ∇f(x¯)>(x − x¯) ≥ 0. By pseudoconvexity of f at x¯ it implies that f(x) ≥ f(x¯).  Theorem 342 (Sufficient 2nd order KKT conditions).

We consider functions f, g a h from P 2nd order differentiable and vectors x¯, u¯ a v¯, satisfying the KKT conditions. We choose fixed u¯ and v¯ in Lagrangian, so we have 2 L(x, u¯, v¯) the function of the variable x. We denote ∇xL(x) Hessian of L(x, u¯, v¯) at 2 x. If ∇xL(x) is positive semidefinite at feasible points x of some neighbourhood of x¯ then x¯ is a point of a local minimum.

103 Remark 343 (Constraint qualification).

The KKT conditions require ∇gi, ∇hi linearly independent. It is understandable and possible to check it in examples. This simple constraint qualification can can be signif- icantly generalized (using properties of various cones at x¯)– see literature. It is important to know that the KKT conditions require differentiability of consid- ered functions and regularity. However, for the nondifferentiable case there are still conditions based on saddle point of L and duality — see later.

1.12 Constrained optimization — algorithms

Remark 344 (Main idea).

A lot of constrained optimization numerical algorithms uses the idea to repeatedly transform a multivariate nonlinear (constrained) program to a multivariate uncon- strained optimization problem. For unconstrained minimization, several useful algo- rithms have been introduced. (They again repeatedly use an unconstrained problem transformation on the univariate line search problem). This ‘encapsulating’ principle will be often used for particular algorithms. The list of methods will completed with examples.

1.12.1 Penalty function-based algorithms

Remark 345 (Principle).

A nonlinear program is solved through the solution of a sequence of unconstrained problems. They are derived from the original NLP in two steps: (1) constraints are relaxed and (2) they are incorporated into the penalty term α(x) that is included in the objective function. The penalty term is chosen in such a way that feasible solutions (of the original NLP) are not penalized but for the infeasible solutions the objective function value is significantly increased.

Example 346.

Instead of minx{f(x) | h(x) = 0} (the original constrained NLP), we solve minx{f(x)+ µh(x)2 | x ∈ IRn} (unconstrained approximating penalty problem), where µ is a penalty 2 coefficient, and here, aforementioned α(x) = µh(x) . We see that for x −→ xmin ⇒ h(x)2 −→ 0. n Another possibility might be to solve minx{f(x) + µ max{0, g(x)} | x ∈ IR } instead of the original minx{f(x) | g(x) ≤ 0}. In this case, we have α(x) = µ max{0, g(x)} and we may notice that if g(x) is differentiable then α(x) need not be differentiable.

104 Exercise 347.

Illustrate penalty principle by own examples and figures. Start with min{x | x ≥ 2} (Hint: e.g. use min{x + α(x) | x ∈ IR} where α(x) = 0 for x ≥ 0 and α(x) = µ(2 − x)2 for x < 2). Continue with min{x2 | −1 ≤x ≤ 2}.

Example 348 (The relation between solutions).

2 2 Solve min{x1 + x2 | x1 + x2 − 1 = 0} graphically (the solution is xmin = (0.5, 0.5)). 2 2 Reformulate the NLP given above using the penalty principle: min{x1 + x2 + 2 > 2 µ(x1 + x2 − 1 = 0) | (x1, x2) ∈ IR }. The solution xµ,min of the reformulated NLP µ can be obtained using ∇f(x)) = 0. So, we obtain x1 = x2 = 2µ+1 . Thus, for µ −→ ∞ µ 1 ⇒ 2µ+1 −→ 2 , and so xµ,min −→ xmin.

105 Remark 349 (Geometric interpretation of perturbed problem).

2 2 We may consider the perturbed program: min{x1 +x2 | x1 + x2 − 1 = ε} ⇒ We express x2 from the equality constraint and we replace it in the objective function: The penalty 2 2 1+ε reformulation is min{x1 + (1 + ε − x1) | x1 ∈ IR} ⇒ x1,min = x2,min = 2 and the 2 2 minimum value of the objective function is v(ε) = min{x1 + x2 | x1 + x2 − 1 = ε} = 2 2 (1+ε)2 min{x1 + (1 + ε − x1) | x1 ∈ IR} = 2 .  x   h(x , x )  So we may consider the mapping 1 7−→ 1 2 . It means that the points x2 f(x1, x2) are mapped from the space with coordinates x1, x2 to the space with coordinates h, f. It is also important to note that h(x) = ε. Therefore, we may draw a graph of the (1+ε)2 function v(ε) = 2 using axes h, f – draw axes and v function. It gives f(xmin) for different values of ε. For fixed value of ε points on the straight line specified by the equality h = ε with f coordinate greater than v(ε) represent the non-optimal objective function values f(x). Therefore, all feasible points of the original perturbed problem (allowing h(x) 6= 0) map to the epigraph of v(ε). So, the solution of the original problem is derived geometrically (including the previous case ε = 0). Draw straight lines filling the epigraph, colour the case ε = 0. Then, the penalty objective is specified by term f + µh2. We may think about two variables h and f, and we may draw a contour graph of the function specified by this term — draw it. Solving the penalty problem, we are interested to find a feasible point of the perturbed problem with the lowest f + µh2 penalty objective value. Draw the point. For different values of µ, you will get different solutions (points (ε, v(ε)) = 2 (h(xµ,min), f(xµ,min) + µh(xµ,min) ) different from the original problem solution 2 (0, v(0)) = (h(xmin), f(xmin) + µh(xmin) ) = (0, f(xmin))). Check whether you draw both points. With increasing µ curvature of parabola f + µh2 changes and points 2 (h(xµ,min), f(xµ,min) + µh(xµ,min) ) are approaching to point (0, f(xmin)). Draw a sequence of points using several figures. From figures, we see that the solution of penalty problem is infeasible (slightly with µ increasing). We also see that this simple geometric interpretation can be generalized to x ∈ IRn and h(x) ∈ IRl (also to g(x) ≤ 0 ∈ IRm, e.g., using slack variables). The whole visualizing approach is very suitable also for NLP duality.

Remark 350 (Nonconvex case).

Even in the general case when v(ε) might be nonconvex the approaching process works. The reason is that the parabola f + µh2 may move along the nonconvex v (or for f and h nonconvex at x, f + µh2 tends to convexity at x with increasing µ).

106 Remark 351 (General penalty function).

The penalty function α(x) is defined such that it has the following property: Its value is positive for infeasible x and 0 for feasible x. For optimization problem P (see previous section), we choose:

m l X X α(x) = φ(gi(x)) + ψ(hi(x)), i=1 i=1 where φ and ψ are continuous such that: φ(y) = 0 for y ≤ 0 and φ(y) > 0 for y > 0 (cf. gi(x) ≤ 0); ψ(y) = 0 for y = 0 and ψ(y) > 0 for y 6= 0 (cf. hi(x) = 0). Typically φ(y) = (max{0; y})p a ψ(y) = |y|p where p is positive integer. In this case, the penalty function is

m l X p X p α(x) = (max{0; gi(x)}) + |hi(x)| . i=1 i=1 We refer to the function f(x) + µα(x) (new objective) as the auxiliary function.

Remark 352 (General formulation).

We have X ⊂ IRn, X 6= ∅, g(x) ∈ IRm, h(x) ∈ IRl and ? ∈ argmin{f(x) | g(x) ≤ n 0, h(x) = 0, x ∈ X} and all f, gi(i = 1, . . . , m), hi(i = 1, . . . , l) are continuous on IR . Let α be a continuous and satisfying penalty function properties (as it was already discussed). Then (basic) penalty problem is:

max{θ(µ) | µ ≥ 0, θ(µ) = inf{f(x) + µα(x) | x ∈ X}}

The following lemma and theorem say:

inf{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X} = sup{θ(µ) | µ ≥ 0} = lim θ(µ) µ−→∞

So, we can get arbitrarily close to xmin solving θ(µ) for large µ.

Lemma 353.

n Let X ⊂ IR , X 6= ∅, f, g1, . . . , gm, h1, . . . , l continuous functions (as above), α penalty function (as above), and ∀µ ∃xµ ∈ X : θ(µ) = f(xµ) + µα(xµ) ⇒ 1. inf{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X} ≥ sup{θ(µ) | µ ≥ 0} where θ(µ) = inf{f(x) + µα(x) | x ∈ X}}

2. f(xµ) is a nondecreasing function of µ ≥ 0, θ(µ) is a nondecreasing function of µ and α(xµ) is a nonincreasing function of µ.

Proof: See literature. 

107 Theorem 354.

n Let X ⊂ IR , X 6= ∅, f, g1, . . . , gm, h1, . . . , l continuous functions (as above), α penalty function (as above), and ∀µ ∃xµ ∈ X : xµ ∈ argmin{f(x) + µα(x) | x ∈ X} and the sequence {xµ} is contained in a compact subset of X. Then:

inf{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X} = sup{θ(µ) | µ ≥ 0} = lim θ(µ) µ−→∞

where θ(µ) = inf{f(x) + µα(x) | x ∈ X}} = f(xµ) + µα(xµ). Furthermore, limit x¯ of any convergent subsequence of {xµ} is an optimal solution to the original problem and µα(xµ) −→ 0 as µ −→ ∞.

Proof: Main ideas: We want to apply general convergence theorems. By lemma θ(µ) monotone ⇒ sup ... = lim .... α(xµ) −→ 0 is proven using α(xµ) ≤ ε and ε > 0 and ε −→ 0. So, α(x¯) = 0. Ten, x¯ is optimal because of convergence and continuity xµ −→ x¯, so sup ... = f(x¯). At the end µα(xµ) = θ(µ) − f(xµ), and as µ −→ ∞, it approaches 0. 

Corollary 355.

α(xµ) = 0 ⇒ xµ solves the problem.

Remark 356.

In applications we often have X compact, so {xµ} belongs to compact and the assump- tion is satisfied. Notice that xµ optimal for penalty problem can be made arbitrarily close to the feasible region. f(xµ) + µα(xµ) can be made arbitrarily close to the optimal objective function value of the original primal problem. {xµ} points with µ increasing gives the following algorithm. xµ are infeasible approach- ing the feasible region, so the frequent names are ‘outer minimization’ or ‘exterior penalty function method’.

Remark 357 (KKT multipliers).

p With α, φ, ψ differentiable (Remember that function ψ(gi(x)) = | max{0, gi(x)}| is not necessarily differentiable with respect to inner function – φ0 does not exist – or components of x.) KKT multipliers can be computed in the case of the unique solution x¯. Then, the sequences may be used: uµ,i −→ ui and vµ,i −→ vi where 0 0 uµ,i = µφ (gi(xµ)) vµ,i = µψ (hi(xµ)).

108 Algorithm 358 (SUMT — Sequential Unconstrained Minimization Technique).

Set ε > 0, starting point x1, penalty coefficient µ1 > 0 and β > 1, k := 1.

1. For starting solution xk, we solve the following penalty problem xµk ∈

argmin{f(x) + µkα(x) | x ∈ X}. For xµk optimal solution with the objec-

tive function value θ(µk) = f(xµk ) + µkα(xµk ), we assign xk+1 := xµk (a new starting point for the next iteration).

2. If µkα(xk+1) < ε then STOP , otherwise µk+1 := βµk, k := k + 1 and GOTO 1.

Figure 1.4: Illustration of Hooke-Jeeves algorithm.

Example 359.

4 2 2 2 Solve min{(x1 − 2) + (x1 − 2x2) | x1 − x2 = 0, x ∈ IR } using a quadratic penalty 2 2 function α(x1, x2) = (x1 − x2) , µ1 = 0, 1, β = 10 and Algorithm 358. The increasing value µk forces the infeasibility as small as it is possible.

k µk xk+1 = xµk f(xk+1) α(xk+1) µkα(xk+1) θ(µk) 1 0, 1 (1, 4539; 0, 7608) 0, 0935 1, 8307 0, 1831 0, 2766 2 1, 0 (1, 1687; 0, 7407) 0, 5753 0, 3908 0, 3908 0, 9661 3 10, 0 (0, 9906; 0, 8425) 1, 5203 0, 01926 0, 1926 1, 7129 4 100, 0 (0, 9507; 0, 8875) 1, 8917 0, 000267 0, 0267 1, 9184

Remark 360 (Computational difficulties).

For very large µ sometimes ill conditioning (cf. eigenvalues of Hessian of auxiliary function) may cause premature termination (near to feasible region but for from the optimum). So improvement may be required.

Remark 361 (Improvements).

The idea is to avoid computational problems for cases where we do not need µ −→ ∞. Only µ large enough and still approaching xmin. It works for exact penalty functions. We may choose l1 (absolute value) penalty function (p = 1). Theory says that for the KKT point and multipliers, for the convex case, to reach optimum is enough to solve the problem with l1 penalty function and µ ≥ max{ui(i = 1, . . . , m), vi(i = 1, . . . , l)}.

Exercise 362.

Draw figure for your own example and the case of f + µ|h| instead of f + µh2.

109 Remark 363 (Augmented Lagrangian.).

Therefore, recently, Lagrangian-based penalty functions are utilized (see software pack- ages, e.g., MINOS accompanying GAMS). The idea is to shift penalty to constant θ to Pl 2 obtain f(x) + µ i=1(hi(x) − θi) (θi perturbed from 0). After simple computations Pl Pl 2 and changed notation, we have f(x) + i=1 vihi(x) + µ i=1 hi (x). It does not need µ −→ ∞ and it is differentiable (cf. l1). For software implementations, the general ob- Pm 2 Pl Pm 2 2 jective of the form f(x) + i=1 ui(gi(x) + si ) + i=1 vihi(x) + µ( i=1 (gi(x) + si ) + Pl 2 i=1 hi(x) ) is used. Frequently different penalty coefficients µi for different con- straints are used. In addition, it is necessary to specify the way how Lagrange multi- pliers will be updated.

1.12.2 Barrier function-based algorithms Remark 364.

We will further consider the following NLP:

min{f(x) | g(x) ≤ 0, x ∈ X},

where X ⊂ IRn, X 6= ∅ (it may involve equality constraints), f and components of g are continuous functions. The equality constraints are involved in X because the barrier function requires the nonempty interior of the set described by explicit constraints {x | g(x) < 0} 6= ∅. (It is not enough to replace h(x) = 0 with h(x) ≥ 0 and h(x) ≤ 0.)

Exercise 365 (Examples of barrier functions).

m Explain why the following barrier functions are as follows: B(x) = P(− 1 ), B(x) = gi(x) i=1 m m P P − ln(−gi(x)), and B(x) = − ln(min{1, −gi(x)}). i=1 i=1

Remark 366 (General requirements).

We have min{θ(µ) | µ ≥ 0} and θ(µ) = inf{f(x) + µB(x) | g(x) < 0, x ∈ X}. There is m P B(x) = φ(gi(x)), where φ(y) is continuous over {y | y < 0}, if y < 0 then φ(y) ≥ 0, i=1 and φ(y) −→ ∞ with y −→ 0−.

110 Remark 367 (Convergence).

n n There are f, g1, . . . , gm continuous functions on IR , X ⊂ IR , X 6= ∅ closed and {x ∈ X | g(x) < 0}= 6 ∅. There is B a barrier function, as above, continuous on set {x ∈ X | g(x) < 0}. We assume that x¯ is optimal and ∀Nε(x¯) ∃x ∈ Nε(x¯): g(x) < 0. Then: min{f(x) | g(x) ≤ 0, x ∈ X} = lim θ(µ) = inf{θ(µ) | µ > 0} µ−→0+ where θ(µ) = inf{f(x) + µB(x) | g(x) < 0, x ∈ X}; and so θ(µ) = f(xµ) + µB(xµ) where xµ is feasible, i.e., xµ ∈ X ∧ g(xµ) < 0; and a convergent subsequence of {xµ} + converges to x¯ and µB(xµ) −→ 0 as µ −→ 0 .

Remark 368 (Principles).

While the penalty method generates a sequence infeasible points that approaches the boundary of the feasible set from outside (cf. Figure ??), the barrier method creates a sequence of feasible points that approaches the boundary from inside. Emphasize that the barrier does not allow to reach the boundary. We may discuss just to NLPs having inequalities Ineq with the nonempty interior. the reason is that under the equalities the method cannot work with interior points. certain possibility is given by the replacement 2 of hi(x) = 0 by relaxed constraint hi(x) ≤ ε. The barrier is involved in Ineq in such a way that constraints g(x) < 0 are considered and the objective function is enriched with the barrier function B(x):

m X B(x) = φ(gi(x)), i=1 where, usually, φ(y) = −1/y, φ(y) = − ln(−y) or φ(y) = − ln(min{1, −y}).

Algorithm 369 (Barrier).

We choose ε > 0, x1, satisfying g(x1) < 0, and barrier coefficients µ1 > 0 and β ∈ (0, 1), k := 1.

1. For the initial value xk, we solve min{f(x) + µkB(x) | g(x) < 0, x ∈ X}. If xµk

then the optimum solution value θ(µk) = f(xµk ) + µkB(xµk ), is used to obtain

xk+1 := xµk .

2. If µkB(xk+1) < ε then STOP , otherwise µk+1 := βµk, k := k +1 and GOTO 1.

111 Example 370.

4 2 2 2 Solve min{(x1 − 2) + (x1 − 2x2) | x1 − x2 ≤ 0, x ∈ IR } using a barrier function 2 B(x1, x2) = −1/(x1 − x2), µ1 = 10, 0, β = 0, 1 and by Algorithm 369.

k µk xk+1 = xµk f(xk+1) B(xk+1) µkB(xk+1) θ(µk) 1 10.0 (0.7079; 1.5315) 8.3338 0.9705 9.705 18.0388 2 1.0 (0.8282; 1.1098) 3.8214 2.3591 2.3591 6.1805 3 0.1 (0.8989; 0.9638) 2.5282 6.4194 0.6419 3.1701 4 0.01 (0.9294; 0.9162) 2.1291 19.0783 0.1908 2.3199

Figure 1.5: Illustration of KKT conditions.

Remark 371.

Fiacco and McCormick developed the SUMT algorithm combining the penalty approach for equalities with a barrier approach for inequalities. There are also very efficient modifications (interior point methods) for linear program- ming.

Exercise 372.

For penalty and barrier methods, be able to formulate the approximating problem (see θ(µ)). With this problem, be able to realize one iteration (e.g., by the gradient method). Understand differences between penalty (exterior penalty function) and barrier (interior penalty function) methods. Hint: In contrast to penalty, B(x) 6= 0 (barrier) for x feasible (cf. α(x) = 0 for x feasible with penalty).

Remark 373 (Computational difficulty).

Another computational difficulty is that we have to start from x relative interior point with respect to {x ∈ X | g(x) < 0} and also line search, e.g., with discrete steps, nay lead out of the feasible set and, e.g., B(x) < 0.

1.12.3 Methods of feasible directions Remark 374 (The choice of feasible direction).

Another possibility to solve a constrained optimization problem is based on the idea to search for feasible directions within the feasible set and use them during iterations. We start with the easiest classical and understandable Zoutendijk’s method for Ineq.

112 Algorithm 375 (Zoutendijk).

Choose x1 feasible, i.e. g(x1) ≤ 0 and k := 1.

1. Denote I = {i | gi(xk) = 0} and solve

> > min{z | ∇f(xk) d−z ≤ 0, ∇gi(xk) d−z ≤ 0, i ∈ I, −1 ≤ dj ≤ 1, j = 1, . . . , n}

We denote (zk, dk) as the optimal solution. If z = 0 then STOP and xk is a point satisfying conditions of Fritz John. If zk < 0 then we further continue.

2. We obtain the solution λk of min{f(xk + λdk) | 0 ≤ λ ≤ λmax},

where λmax = sup{λ | gi(xk + λdk) ≤ 0, i = 1, . . . , m}.

We assign xk+1 := xk + λkdk, k := k + 1 and GOTO 1.

Remark 376 (Topkis-Veinott).

The method has several weak points (mainly the algorithmic map is not closed, and hence, the convergence is not guaranteed), however, there are several improving modi- fications (e.g., Topkis and Veinott achieved the convergence by replacing the constraint > > ∇gi(xk) d − z ≤ 0, i ∈ I by ∇gi(xk) d − z ≤ −gi(xk), i = 1, . . . , m).

Remark 377 (Rosen projected gradient).

Another idea (Rosen) is to use a projected gradient instead of reduced one. With constrained problems, we often cannot use the steepest direction because it may lead (immediately) to infeasible points. The idea is to project −∇f(x) in such a way that improves the objective function and maintains feasibility.

113 Remark 378 (Wolfe).

The most promising idea (implementation viewpoint) is Wolfe’s reduced gradient method that works with the feasible directions. It depends upon reducing the di- mensionality of the problem by representing all variables in terms of an independent subset of the variables. Originally has been developed for linear constraints. We have min{f(x) | Ax = b, x ≥ 0}, where r(A) = m, f continuously differ- entiable on IRn. Nondegeneracy assumption is satisfied: Any columns of A are linearly independent and each EP of the feasible set has m strictly positive vari- ables. So, every feasible solution has at least m positive components and at most  x  n − m zero components. Let x be feasible A = [B, N], x = B . We de- xN  ∇ f(x)  note components of ∇f(x) = B and d is an improving feasible direc- ∇N f(x) > tion (cf. LP) of f at x if ∇f(x) d < 0, Ad = 0(xj = 0 ⇒ dj ≥ 0). We find   dB −1 d = as follows: 0 = Ad = BdB + NdN ⇒ dB = −B NdN . Therefore, dN   > > > dB > −1 (cf. with LP) ∇f(x) d = (∇Bf(x) , ∇N f(x) ) = ∇Bf(x) (−B NdN ) + dN > > > −1 > > ∇N f(x) dN = (∇N f(x) − ∇Bf(x) B N)dN = rN , where rN is a reduced gradi-    −1  > rB > > B B > > ent. Then, r = = ∇f(x) − ∇Bf(x) −1 = (0 , rN ). So, we used rN B N Ax = b to obtain a reduced description of feasible d, we need to assign dN to obtain  −B−1a  descent direction (With LP — edge descent direction was d = j ≥ 0.) ej > We must choose dN to obtain rN dN < 0 (and xj = 0 ⇒ dj ≥ 0). ∀j nonbasic: rj ≤ 0 ⇒ dj = −rj and rj > 0 ⇒ dj = −xjrj.

Theorem 379.

  xB We have min{f(x) | Ax = b, x ≥ 0}, r(A) = m, x = feasible, xB > 0, xN −1 > > > −1 A = [B, N], and ∃B . f is differentiable at x¯, r = ∇f(x) − ∇Bf(x) B A.   dB d = is defined ∀j ∈ IN : dj = −rj if rj ≤ 0 and dj = −xjdj if rj > 0 and dN −1 dB = −B NdN . If d 6= 0 then d is an improving feasible direction. d = 0 iff x is a KKT point.

Proof: See literature. 

114 Algorithm 380 (Wolfe).

Choose x1 feasible, and k := 1.   dB 1. d = is obtained from (*). If d = 0 then STOP and xk is a dN KKT point (Even Lagrange multipliers are obtained for constraints Ax = > −1 b ... ∇Bf(x) B , x ≥ 0,..., r). dj = −rj if i 6∈ Ik and rj ≤ 0 and dj = −xjrj −1 if i 6∈ Ik and rj > 0 and dB = −B NdN . where Ik is an index set of the m largest components of xk, and so B = {aj | j 6∈ Ik}. and B = {aj | j 6∈ Ik} and > > > −1 r = ∇N f(xk) − ∇Bf(xk) B A.

2. Solve the line search problem λk ∈ argmin{f(xk + λdk) | 0 ≤ λ ≤ λmax}, where xjk λmax = ∞ for dk ≥ 0 and min {− } otherwise λmax = ∞ for dk ≥ 0. Then, 1≤j≤n djk

xk+1 := xk + λkdk, k := k + 1 and GOTO 1. .

Remark 381 (Zangwill).

There is also Zangwill’s convex-simplex method. It is similar to Wolfe’s method. But the only nonbasic variable is modified while the other nonbasic variables are fixed at their current levels (cf. LP simplex).

Remark 382.

The following Wolfe’s method generalization is used for Eq. It is implemented in the CONOPT solver (cf. GAMS and AIMMS).

Remark 383 (Solved program).

We solve a program Eq including bounds on x:

min{f(x) | h(x) = 0, a ≤ x ≤ b}.

We assume differentiability of f and h functions. Therefore, we may use linearization h(xk)+∇h(xk)(x−xk) and use previous remarks. We also suppose nondegeneracy that, > > > l i.e. any feasible solution x may be decomposed in x = (xB, xN ) , where xB ∈ IR a n−l xN ∈ IR . Similarly, it is possible to split a and b in such a way that aB < xB < bB is valid. In addition, Jacobi matrix ∇h(x) may be split in two matrices: a regular square matrix ∇Bh(x) and the remaining matrix ∇N h(x).

115 Algorithm 384 (Abadie-Carpentier).

Choose the feasible solution x and specify xB a xN .

> > > −1 1. We assign r = ∇N f(x) − ∇Bf(x) ∇Bh(x) ∇N h(x). We specify vector dN of dimension n − l that j-th component of dj is equal 0 for xj = aj and rj > 0 or xj = bj and rj < 0. Otherwise dj := −rj. If dN = 0 then STOP and a point x satisfies the KKT conditions. Otherwise, we continue.

2. We find a solution y of the system of nonlinear equations h(y, x¯N ) = 0 by New- ton’s method (x¯N is specified later). We choose ε > 0 and positive integer K. Therefore, θ > 0 satisfies aN ≤ x¯N ≤ bN , where x¯N = xN + θdN . We define y1 := xB, k := 1. −1 A. We select yk+1 := yk − ∇Bh(yk, x¯N ) h(yk, x¯N ). Je-li aB ≤ yk+1 ≤ bB,

f(yk+1, x¯N ) < f(xB, xN ) and kh(yk+1, x¯N )k < ε then GOTO C. Other- wise GOTO B. 1 B. If k = K then θ := 2 θ, x¯N := xN + θdN , and so, y1 := xB, k := 1 and GOTO A. In contrast, k := k + 1 and GOTO A. > > C. We define x := (yk+1, x¯N ), identify new basis B and GOTO 1.

Exercise 385.

Explain principles of feasible direction methods. Try to reformulate own NLP problem for the selected algorithm and realize one iteration using linear programming.

1.12.4 Successive linear programming approach

Remark 386 (Approximation of programs).

Engineering problems of oil industry in sixties involved less non-linear terms. Therefore, techniques based on a repeated local approximation by linear (or quadratic) programs have been developed.

116 Algorithm 387 (Griffith-Stewart).

We choose ε > 0 (terminating scalar), δ > 0 (limiting iteration movement), x1, k := 1. 1. Solve a linear program

> min{∇f(xk) (x − xk) | ∇g(xk)(x − xk) ≤ −g(xk), ∇h(xk)(x − xk) = −h(xk),

a ≤ x ≤ b, −δ ≤ xi − xik ≤ δ, i = 1, . . . , n},

where xik is the i-th component xk. The optimal solution is denoted as xk+1.

2. If kxk+1 − xkk ≤ ε and xk+1 is near feasible then STOP . Otherwise, in the case k := k + 1 and GOTO 1.

Remark 388 (Comments).

It is useful for a few (hundreds) nonlinear terms. It is generalized in two directions: (1) l1 penalty is used as a merit function deciding whether to reject a new iterate; (2) with the augmented Lagrangian the successive quadratic approximation is used.

1.13 Lagrangian duality

Remark 389 (Primal and dual problems).

A primal problem of NLP is P: min{f(x) | g(x) ≤ 0, h(x) = 0, x ∈ X}. La- x grangian dual problem D: is defined as maxu,v{θ(u, v) | u ≥ 0}, where θ(u, v) = > > infx{L(x, u, v) | x ∈ X} i.e. θ(u, v) = inf{f(x) + u g(x) + v h(x) | x ∈ X}.

Example 390.

Find a dual program explicit formulation (if it is possible — useful — suitable) for:  2  ∈ argmin {x2 + x2 | −x − x + 4 ≤ 0, x , x ≥ 0}. We denote g(x) = 2 x1,x2 1 2 1 2 1 2 2 2 −x1 + x2 + 4 and X = {x | x1, x2 ≥ 0}. Then θ(u) = infx{x1 + x2 + u(−x1 − x2 + 4) | 2 2 x1, x2 ≥ 0} = infx1 {x1 − ux1 | x1, x2 ≥ 0} + infx2 {x2 − ux2 | x2 ≥ 0} + 4u. We use u calculus and obtain: x1,min = x2,min = 2 for u ≥ 0 and x1,min = x2,min = 0 for u < 0. 1 2 Hence θ(u) = − 2 u + 4u for u ≥ 0 and θ(u) = 4u for u < 0 As θ(u) is a concave function then umax = 4.

117 Example 391.

 2  Find a dual program explicit formulation for ∈ argmin {−2x +x | x +x − 1 x1,x2 1 2 1 2 > 3 = 0 ∧ (x1, x2) ∈ X = {(0, 0), (0, 4), (4, 4), (1, 2)}}. Then θ(v) = minx1,x2 {−2x1 + x2 + v(x1 + x2 − 3) | (x1, x2) ∈ X} and we replace x1 and x2 with numbers and obtain θ(v) = −4 + 5v for v ≤ −1, θ(v) = −8 + v for −1 ≤ v ≤ 2, −3v for v > 2 (draw a figure of graph of θ(v)). Therefore, vmax = 2.

Compare results of last two examples: f(xmin) = θ(umax) and f(xmin) > θ(vmax).

Example 392 (Linear programming).

It is possible to show that linear programming duality may be derived from Lagrangian duality of nonlinear programs. So, for primal min{c>x | Ax = b, x ∈ X = {x | > > x ≥ 0}} we obtain: max θ(v) where θ(v) = infx{c x + v (b − Ax) | x ≥ 0} = > > > > > infx{(c − v A)x | x ≥ 0} + v b. So, if c − v A ≥ 0 then we need to obtain > > > infimum x = 0 and we have θ(v) = v b. If c − v A 6≥ 0 then ∃xj −→ ∞ : > > > θ(v) = −∞. Hence a dual program is maxv{v b | v A ≥ c }.

Example 393 (Quadratic programming).

1 > > We have primal min{ 2 x Hx + d x | Ax ≤ b} where H is symmetric and PSD. Then 1 > > > n dual is max{θ(u)u ≥ 0} where θ(u) = inf{ 2 x Hx + d x + u (Ax − b)x ∈ IR }. So > 1 > gradient must be equal to 0: Hx + A u + d = 0 and dual looks as: max{ 2 x Hx + d>x + u>(Ax − b) | u ≥ 0, Hx + A>u = −d}. Transposing Hx + A>u + d = 0 and multiplying it by x from right, we obtain u>Ax + d>x = −x>Hx. We may 1 > > > substitute it in dual and we obtain max{− 2 x Hx + b u | u ≥ 0, Hx + A u = −d}. If H is even PD then ∃H−1 and x = −H−1(d + A>u), so dual has the form: 1 > −1 > > −1 1 > −1 max{− 2 u (−AH A )u + (−b − AH d)u − 2 d H d | u ≥ 0}.

Remark 394 (Geometric interpretation of Lagrange duality).

Compare it with penalty algorithm geometric interpretation. So we have a simple primal program min{f(x) | g(x) ≤ 0, x ∈ X}. We denote y = g(x), z = f(x) and G = {(y, z) | y = g(x), z = f(x), x ∈ X}. So we have to find min{z | y ≤ 0, (y, z) ∈ G}. Then θ for dual is θ(u) = min{f(x)+ug(x) | x ∈ X} or θ(u) = min{z+uy | (y, z) ∈ G}. If α = z+uy then z = −uy+α is a straight line(−u is an slope, α is an intercept). So we may draw a figure containing G, the contour graph of z+uy for fixed u (slope). We may identify graphically (ymin, zmin) (like in LP), we find θ(u) value as the intersection with z axis. We may change u and obtain another (ymin, zmin) and θ(u) (like in penalty). We change u to find umax such that θ(u) is maximized (as θ(umax)). Illustrate by figures. In contrast to LP in NLP case, the may occur (illustrate by a figure) when G is nonconvex.

118 1.13.1 Duality theorems For NLP there are analogous weak and theorems like in LP.

Theorem 395 ().

Let x be feasible of P and (u, v) be a feasible to D. then f(x) ≥ θ(u, v).

Proof: Based on the idea that inf A ≤ a ∈ A. 

Corollary 396.

1.“ inf of P ≥ sup of D”.

2. If f(x¯) = θ(u¯, v¯) and x¯, (u¯, v¯) are feasible to P and D respectively then they are optimal.

3. If inf of P is equal to −∞ then ∀u > 0 : θ(u, v) = −∞.

4. If sup of D= − = ∞ then P is infeasible.

Lemma 397.

X ⊂ IRn, X 6= ∅ convex, α : IRn −→ IR, components of g; IRn −→ IRm convex, h : IRn −→ IRl affine (defined by a linear term and constant vector). If System 1 below > > > has no solution x then System 2 below has (u0, u , v ) solution. The converse holds if u0 > 0. System 1: α(x) < 0, g(x) ≤ 0, h(x) = 0, x ∈ X. > > > > > > > System 2: ∀x ∈ X : u0α(x)+u g(x)+v h(x) ≥ 0, (u0, u ) ≥ 0, (u0, u , v ) =6= 0

Proof: See literature (cf. Farkas’ Theorem). 

Theorem 398 (strong duality).

X ⊂ IRn, X 6= ∅ convex, f : IRn −→ IR, and components of g : IRn −→ IRm be convex functions and h : IRn −→ IRl be affine (h(x) = Ax − b). Suppose that the following constraint qualification holds true ∃xˆ ∈ X : g(xˆ) < 0, h(xˆ) = 0 and 0 ∈ int h(X) where h(X) = {h(X) | x ∈ X}. Then inf{f(x) | x ∈ X, g(x) ≤ 0 h(x) = 0} = sup{θ(u, v) | u ≥ 0}. Furthermore, if the infimum is finite then sup{θ(u, v) | u ≥ 0} is achieved in (u¯v¯) with u¯ ≥ 0. If the infimum is achieved at x¯ then u>g(x¯) = 0.

1 Proof: See literature. Uses previous corollary then Lemma above. Then u0 > 0 and so is used as a u0 multiplier. At the end complementarity condition is utilized. 

119 Definition 399 (Saddle point of Lagrangian function).

Let Φ(x, u, v) = f(x) + u>g(x) + v>h(x) be Lagrangian function. Then (x¯, u¯, v¯) is its saddle point if ∀x ∈ X, ∀v, u ≥ 0 and Φ(x¯, u, v) ≤ Φ(x¯, u¯, v¯) ≤ Φ(x, u¯, v¯).

Theorem 400 (Saddle point optimality and absence of duality gap).

Let (x¯, u¯, v¯) with x¯ ∈ X, u¯ ≥ 0 be a saddle point for the Lagrangian function Φ(x, u, v) iff

a) Φ(x¯, u¯, v¯) = min{Φ(x, u¯, v¯) | x ∈ X},

b) g(x¯) ≤ 0, h(x¯) = 0, and

c) u>g(x¯) = 0.

Moreover x¯ solves P and (u¯, v¯) solves D and f(x¯) = θ(u¯, v¯).

Proof: See literature.  Corollary 401.

If X, f, g convex, h affine, 0 ∈ int h(X), and ∃xˆ ∈ X : g(xˆ) < 0, h(xˆ) = 0. If x¯ is optimal to P ⇒ ∃(u¯, v¯), u¯ ≥ 0 such that (x¯, u¯, v¯) is a saddle point.

Theorem 402 (Relation between saddle point criterion and KKT).

S = {x ∈ X | g(x) ≤ 0, h(x) = 0} and minx{f(x) | x ∈ S} are defined. Suppose that x¯ ∈ S satisfies the KKT conditions, so ∃u¯ ≥ 0 : ∇f(x¯) + ∇g(x¯)>u¯ + ∇h(x¯)>v¯ = 0, > u¯ g(x¯) = 0. Suppose that f, gi(i ∈ I = {i | gi(x¯) = 0}) are convex at x¯ and for v¯i 6= 0 hi is affine. Then (x¯, u¯, v¯) is a saddle point for Φ(x, u, v). Conversely: Suppose that (x¯, u¯, v¯) with x ∈ int X and u¯ ≥ 0 is a saddle point solution. Then x¯ is feasible to P and furthermore (x¯, u¯, v¯) satisfies KKT.

Proof: See literature. 

All theorems together say: With certain assumptions the optimal dual variables for the Lagrangian dual problem are equal to Lagrangian multipliers for the KKT conditions and are equal to multipliers for the saddle point conditions.

120 Remark 403 (Further readings).

The following ideas should be further studied:

• θ(u, v) is concave under weak assumptions (useful for maximization). • Differentiability of θ relates to assumption of uniqueness. • Subgradient of θ is useful for ascent directions. • Ascent directions of θ allow to search steepest ascent direction. • Formulating dual problem and using it in its implicit form. • Solving dual problem by a cutting plane methods (outer linearization). • Getting primal solution from dual problem.

121