Nonlinear Programming Models Fabio Schoen Introduction

2008 http://gol.dsi.unifi.it/users/schoen

Nonlinear Programming Models – p. 1 Nonlinear Programming Models – p. 2

NLP problems Local and global optima

A global minimum or global optimum is any x⋆ S such that ∈ min f(x) x S f(x) f(x⋆) x S Rn ∈ ⇒ ≥ ∈ ⊆ A point x¯ is a local optimum if ε > 0 such that Standard form: ∃ x S B(¯x, ε) f(x) f(¯x) min f(x) ∈ ∩ ⇒ ≥ where Rn is a ball in Rn. hi(x)=0 i =1,m B(¯x, ε) = x : x x¯ ε Any global optimum{ ∈ is alsok a−localk≤optimum,} but the opposite is g (x) 0 j =1, k j ≤ generally false. Here S = x Rn : h (x)=0 i, g (x) 0 j { ∈ i ∀ j ≤ ∀ }

Nonlinear Programming Models – p. 3 Nonlinear Programming Models – p. 4 Convex Functions Convex Functions

A set S Rn is convex if ⊆ x, y S λx + (1 λ)y S ∈ ⇒ − ∈ for all choices of λ [0, 1]. Let Ω Rn: non empty convex set. A function f : Ω R ∈is convex iff ⊆ → f(λx + (1 λ)y) λf(x) + (1 λ)f(y) − ≤ − for all x, y Ω,λ [0, 1] ∈ ∈ x y

Nonlinear Programming Models – p. 5 Nonlinear Programming Models – p. 6

Properties of convex functions Convex functions

Every convex function is continuous in the interior of Ω. It might be discontinuous, but only on the frontier. If f is continuously differentiable then it is convex iff

f(y) f(x) + (y x)T f(x) ≥ − ∇ for all y Ω ∈

x y

Nonlinear Programming Models – p. 7 Nonlinear Programming Models – p. 8 If f is twice continuously differentiable f it is convex iff its Example: an affine function is convex (and concave) Hessian matrix is positive semi-definite:⇒ For a quadratic function (Q: symmetric matrix):

∂2f 1 2f(x) := f(x) = xT Qx + bT x + c ∇ ∂x ∂x 2  i j  we have then 2f(x) < 0 iff ∇ f(x) = Qx + b 2f(x) = Q vT 2f(x)v 0 v Rn ∇ ∇ ∇ ≥ ∀ ∈ is convex iff < or, equivalently, all eigenvalues of 2f(x) are non negative. f Q 0 ∇ ⇒

Nonlinear Programming Models – p. 9 Nonlinear Programming Models – p. 10

Convex Optimization Problems Maximization

Slight abuse in notation: a problem min f(x) max f(x) x S ∈ x S ∈ is a convex optimization problem iff S is a convex set and f is convex on S. For a problem in standard form is called convex iff S is a convex set and f is a concave function (not to be confused with minimization of a concave function, (or min f(x) maximization of a convex function) which are NOT a convex optimization problem) hi(x)=0 i =1,m g (x) 0 j =1, k j ≤ if f is convex, hi(x) are affine functions, gj(x) are convex functions, then the problem is convex.

Nonlinear Programming Models – p. 11 Nonlinear Programming Models – p. 12 Convex and non convex optimization Convex functions: examples

Convex optimization “is easy”, non convex optimization is Many (of course not all . . . ) functions are convex! usually very hard. affine functions aT x + b Fundamental property of convex optimization problems: every quadratic functions 1 xT Qx + bT x + c with Q = QT , Q 0 local optimum is also a global optimum (will give a proof later) 2  Minimizing a positive semidefinite quadratic function on a any norm is a convex function polyhedron is easy (polynomially solvable); if even a single (however is concave) eigenvalue of the hessian is negative the problem becomes x log x log x ⇒ n NP –hard f is convex if and only if x0,d R , its restriction to any ∀ ∈ line: φ(α) = f(x0 + αd), is a convex function a linear non negative combination of convex functions is convex g(x, y) convex in x for all y g(x, y) dy convex ⇒ R

Nonlinear Programming Models – p. 13 Nonlinear Programming Models – p. 14

more examples . . .

max aT x + b is convex i{ i } f,g: convex max f(x),g(x) is convex ⇒ { } fa convex functions for any a (a possibly uncountable Data Approximation set) sup f (x) is convex ∈A ⇒ a∈A a f convex f(Ax + b) ⇒ let S Rn be any set f(x) = sup x s is convex ⊆ ⇒ s∈S k − k T T race(A X) = i,j AijXij is convex (it is linear!) log det X−1 is convexP over the set of matrices X Rn×n : X 0 ∈ ≻ λmax(X) (the largest eigenvalue of a matrix X)

Nonlinear Programming Models – p. 15 Nonlinear Programming Models – p. 16 Table of contents Norm approximation

norm approximation Problem: maximum likelihood min Ax b x k − k robust estimation where A, b: parameters. Usually the system is over-determined, i.e. b Range(A). 6∈ For example, this happens when A Rm×n with m > n and A has full rank. ∈ r := Ax b: “residual”. −

Nonlinear Programming Models – p. 17 Nonlinear Programming Models – p. 18

Examples Example: ℓ1 norm

r = √rT r: least squares (or “regression”) Matrix A R100×30 ∈ k k 80 r = √rT Pr with P 0: weighted least squares norm 1 residuals k k ≻ r = max r : minimax, or ℓ∞ or di Tchebichev 70 k k i | i| approximation 60 r = r : absolute or ℓ1 approximation k k i | i| 50 P Possible (convex) additional constraints: 40 maximum deviation from an initial estimate: x x ǫ k − estk≤ 30 simple bounds ℓi xi ui ≤ ≤ 20 ordering: x x x 1 ≤ 2 ≤···≤ n 10 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Nonlinear Programming Models – p. 19 Nonlinear Programming Models – p. 20 2 ℓ∞ norm ℓ norm

20 18 norm residuals norm 2 residuals 18 ∞ 16 16 14 14 12 12 10 10 8 8 6 6 4 4 2 2 0 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5

Nonlinear Programming Models – p. 21 Nonlinear Programming Models – p. 22

Variants comparison min h(y aT x) where h: convex function: i i i 4 − norm 1(x) P z2 z 1 3.5 norm 2(x) h linear–quadratic h(z) = | |≤ linquad(x) 2 z 1 z > 1 deadzone(x) ( | |− | | 3 logbarrier(x) 0 z 1 2.5 “dead zone”: h(z) = | |≤ z 1 z > 1 ( | |− | | 2 log(1 z2) z < 1 1.5 logarithmic barrier: h(z) = − − | | z 1 1 ( ∞ | |≥ 0.5 0 -0.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Nonlinear Programming Models – p. 23 Nonlinear Programming Models – p. 24 Maximum likelihood Max likelihood estimate - MLE

Given a sample X1,X2,...,Xk and a parametric family of (taking the logarithm, which does not change optimum points): probability density functions ( ; θ), the maximum likelihood L · ˆ T estimate of θ given the sample is θ = arg max log(p(Xi ai θ)) θ − i X θˆ = arg max (X1,...,Xk; θ) θ L If p is log–concave this problem is convex. Examples: ⇒ Example: linear measures with and additive i.i.d. (independent ε (0, σ), i.e. p(z) = (2πσ)−1/2 exp( z2/2σ2) MLE is the ∼ N − ⇒ identically dsitributed) noise: ℓ2 estimate: θ = arg min Aθ X ; k − k2 1 T p(z) = (1/(2a)) exp( z /a) ℓ estimate: Xi = ai θ + εi (1) −| | ⇒ θˆ = arg min Aθ X θ k − k1 where ε iid random variables with density p( ): i · k (X ...,X ; θ) = p(X aT θ) L 1 k i − i i=1 Y Nonlinear Programming Models – p. 25 Nonlinear Programming Models – p. 26

Ellipsoids

p(z) = (1/a) exp( z/a)1 (negative exponential) the An ellipsoid is a subset of Rn of the form − {z≥0} ⇒ estimate can be found solving the LP problem: = x Rn :(x x )T P −1(x x ) 1 E { ∈ − 0 − 0 ≤ } min 1T (X Aθ) − Rn Aθ X where x0 is the center of the ellipsoid and P is a ≤ symmetric∈ positive-definite matrix. p uniform on [ a, a] the MLE is any θ such that Alternative representations: − ⇒ Aθ X ∞ a k − k ≤ = x Rn : Ax b 1 E { ∈ k − k2 ≤ } where A 0, or ≻ = x Rn : x = x + Au u 1 E { ∈ 0 |k k2 ≤ } where A is square and non singular (affine transformation of the unit ball)

Nonlinear Programming Models – p. 27 Nonlinear Programming Models – p. 28 Robust Least Squares RLS

T 2 It holds: Least Squares: xˆ = arg min i(ai x bi) Hp: ai not known, but it is known that − α + βT y α + β y pP | | ≤ | | k kk k then, choosing y⋆ = β/ β if α 0 and y⋆ = β/ β , otherwise ai i = a¯i + Piu : u 1 k k ≥ − k k ∈ E { k k≤ } if α < 0, then y =1 and k k T where Pi = P < 0. Definition: worst case residuals: i α + βT y⋆ = α + βT β/ β sign(α) | | | k k | T 2 = α + β max (ai x bi) | | k k ai∈Ei − i sX then:

A robust estimate of x is the solution of T T T max (ai x bi) = max a¯i x bi + u Pix ai∈Ei | − | kuk≤1 | − | T 2 T xˆr = arg min max (ai x bi) = a¯i x bi + Pix x ai∈Ei − | − | k k s i X

Nonlinear Programming Models – p. 29 Nonlinear Programming Models – p. 30

......

Thus the Robust Least Squares problem reduces to

min t 2 1/2 x,t k k T 2 T min ( a¯i x bi + Pix ) a¯ x b + P x t | − | k k i i i i i ! − k k≤ X T a¯i x + bi + Pix ti (a convex optimization problem). − k k≤ Transformation: (Second Order Cone Problem). A norm cone is a convex set

n+1 min t 2 = (x, t) R : x t x,t k k C { ∈ k k≤ } a¯T x b + P x t i i.e. | i − i| k i k ≤ i ∀

Nonlinear Programming Models – p. 31 Nonlinear Programming Models – p. 32 Geometrical Problems

projections and distances polyhedral intersection Geometrical Problems extremal volume ellipsoids classification problems

Nonlinear Programming Models – p. 33 Nonlinear Programming Models – p. 34

Projection on a set Projection on a convex set

Given a set C the projection of x on C is defined as: If

C = x : Ax = b, fi(x) 0, i =1,m PC (x) = arg min z x { ≤ } z∈C k − k where f : convex C is a convex set and the problem i ⇒ P (x) = arg min x z C k − k ⊕ Az = b

fi(z) 0 i =1,m bc ≤ is convex.

bc ⊕

bc

Nonlinear Programming Models – p. 35 Nonlinear Programming Models – p. 36 Distance between convex sets Distance between convex sets

(j) (1) (2) (j) (j) (j) dist(C ,C ) = min x y If C = x : A x = b ,fi 0 then the minimum distance x∈C(1),y∈C(2) k − k can be found{ through a convex≤ model:}

min x(1) x(2) k − k A(1)x(1) = b(1) A(2)x(2) = b(2) f (1)x(1) 0 i ≤ f (2)x(2) 0 i ≤

Nonlinear Programming Models – p. 37 Nonlinear Programming Models – p. 38

Polyhedral intersection Polyhedral intersection

1: polyhedra described by means of linear inequalities: = ? It is a linear feasibility problem: Ax b, Cx d P1 P2 ∅ ≤ ≤ = x : Ax b , = x : Cx d 1 T 2? Just check P1 { ≤ } P2 { ≤ } P ⊆ P sup cT x : Ax b d k { k ≤ }≤ k ∀ (solution of a finite number of LP’s)

Nonlinear Programming Models – p. 39 Nonlinear Programming Models – p. 40 Polyhedral intersection (2) Minimal ellipsoid containing k points

2: polyhedra (polytopes) described through vertices: Given v ,...,v Rn find an ellipsoid 1 k ∈ = conv v ,...,v , = conv w ,...,w = x : Ax b 1 P1 { 1 k} P2 { 1 h} E { k − k≤ } = ? Need to find λ ,λ , µ , µ 0: with minimal volume containing the k given points. P1 P2 ∅ 1 k 1 h ≥ T * λi =1 µj =1 * i j * X X * * λivi = µjwj * * i j * X X * * ? i =1, . . . , k check whether µ 0: * * P1 ⊆ P2 ∀ ∃ j ≥ * * * µj = 1 j * X µ w = v * ** j j i Nonlinear Programming Models – p. 41 Nonlinear Programming Models – p. 42 j X

Max. ellipsoid contained in a polyhedron

A = AT 0. Volume of is proportional to det A−1 convex Given = x : Ax b find an ellipsoid: P { ≤ } optimization≻ problem (inE the unknowns: A, b): ⇒ = By + d : y 1 E { k k≤ } min log det A−1 contained in with maximum volume. A = AT P A 0 ≻ Av b 1 i =1, k k i − k ≤

Nonlinear Programming Models – p. 43 Nonlinear Programming Models – p. 44 Max. ellipsoid contained in a polyhedron Difficult variants

These problems are hard: T ai (By + d) bi y : y 1 find a maximal volume ellipsoid contained in a polyhedron E ⊆ P ⇔ ≤ ∀ k k≤ T T given by its vertices sup ai By + ai d bi i ⇔ kyk≤1{ }≤ ∀ * T * Bai + ai d bi * * ⇔ k k ≤ * * * * max log det B * * B,d * T B = B 0 * ≻ * T * Bai + ai d bi i =1,... k k ≤ * * * *

Nonlinear Programming Models – p. 45 Nonlinear Programming Models – p. 46

find a minimal volume ellipsoid containing a polyhedron It is already a difficult problem to show whether a given ellipsoid described as a system of linear inequalities. contains a polyhedron P = Ax b . EThis problem is still difficult even{ when≤ } the ellipsoid is a sphere: this problem is equivalent to norm maximization in a polyhedron – it is an NP–hard concave optimization problem.

Nonlinear Programming Models – p. 47 Nonlinear Programming Models – p. 48 Linear classification (separation)

Given two point sets X1,...,Xk, Y1,..., Yh find an hyperplane bc T bc a x = t such that:

T bc bc bc a Xi 1 i =1, k bc bc ≥ bc T a Yj 1 j =1,h bc ≤ bc bc (LP feasibility problem). bc bc bc bc bc

bc

Nonlinear Programming Models – p. 49 Nonlinear Programming Models – p. 50

Robust separation Robust separation

bc Find a “maximal” separation: bc

bc bc bc T T bc bc max min a Xi max a Yj bc a:kak≤1 i − j   bc bc bc equivalent to the convex problem: bc bc bc bc bc max t1 t2 −T a Xi t1 i bc ≥ ∀ aT Y t j j ≤ 2 ∀ a 1 k k ≤

Nonlinear Programming Models – p. 51 Nonlinear Programming Models – p. 52 Optimality Conditions: descent directions

Let S Rn be a convex set and consider the problem ⊆ Optimality Conditions min f(x) x∈S Fabio Schoen where f : S R. Let x , x S and d = x x . d is a feasible → 1 2 ∈ 2 − 1 2008 direction. http://gol.dsi.unifi.it/users/schoen If there exists ǫ¯ > 0 such that f(x + ǫd)

Optimality Conditions – p. 1 Optimality Conditions – p. 2

Optimality Conditions for Convex Sets

If x⋆ S is a local optimum for f() and there exists a ∈ neighborhood (x⋆) such that f 1( (x⋆)), then U ∈C U dT f(x⋆) 0 d : feasible direction ∇ ≥ ∀

Optimality Conditions – p. 3 Optimality Conditions – p. 4 proof Optimality Conditions: tangent cone

Taylor expansion: General case:

f(x⋆ + ǫd) = f(x⋆) + ǫdT f(x⋆) + o(ǫ) min f(x) ∇ gi(x) 0 i =1,...,m d cannot be a descent direction, so, if ǫ is sufficiently small, then ≤ f(x⋆ + ǫd) f(x⋆). Thus x X (X : open set) ≥ ∈ T ⋆ ǫd f(x ) + o(ǫ) 0 Let S = x X : gi(x) 0, i =1,...,m . ∇ ≥ Tangent{ cone∈ to S in x¯≤: T (¯x) = d Rn} : { ∈ } and dividing by ǫ, d xk x¯ = lim − → o(ǫ) d xk x¯ xk x¯ dT f(x⋆) + 0 k k k − k ∇ ǫ ≥ where xk S. Letting ǫ 0 the proof is complete. ∈ ↓

Optimality Conditions – p. 5 Optimality Conditions – p. 6

Some examples

S = Rn T (x) = Rn x ⇒ ∀ S = Ax = b bc { } ⇒ T (x) = d : Ad =0 bc { }

bc S = Ax b ; let be the set of active constraints in x¯: { ≤ } I T b a x¯ = bi i i ∈ I bc T ai x¯ < bi i . bc 6∈ I

bc

Optimality Conditions – p. 7 Optimality Conditions – p. 8 Let d = limk(xk x¯)/ (xk x¯) − k − k ⇒ T T ai d = ai lim(xk x¯)/ (xk x¯) i k − k − k ∈ I T = lim ai (xk x¯)/ (xk x¯) k − k − k T = lim(ai xk b)/ (xk x¯) k − k − k 0 ≤ Thus if d T (¯x) aT d 0 for i . ∈ ⇒ i ≤ ∈ I

Optimality Conditions – p. 9 Optimality Conditions – p. 10

Example

T 2 2 Viceversa, let xk =x ¯ + αkd. If a d 0 for i Let S = (x, y) R : x y =0 (parabola). i ≤ ∈ I ⇒ { ∈ − } Tangent cone at (0, 0)? Let (xk, yk) (0, 0) , i.e. T T 2 { → } ai xk = ai (¯x + αkd) i xk 0, yk = x : ∈ I → k T = bi + αkai d 2 4 (xk, yk) (0, 0) = xk + (xk) bi k − k ≤ q T T 2 a xk = a (¯x + αkd) i = xk 1 + x i i 6∈ I | | k < b + α aT d q i k i and if small enough bi αk x y ≤ lim k = 1 lim k =0 → + 2 → + 2 xk 0 xk 1 + x xk 0 xk 1 + x Thus | | k | | k xk yk T lim p = 1 lim p =0 T (x) = d : a d 0 i → − 2 → − 2 i xk 0 xk 1 + x − xk 0 xk 1 + x { ≤ ∀ ∈ I} | | k | | k thus T (0, 0) = p( 1, 0), (1, 0) p Optimality Conditions – p. 11 { − } Optimality Conditions – p. 12 Descent direction I order necessary opt condition

n n d R is a feasible direction in x¯ S if α¯ > 0 : Let x¯ S R be a local optimum for minx∈S f(x); let ∈ ∈ ∃ ∈ ⊆ f 1( (¯x)). Then x¯ + αd S α [0, α¯). ∈C U ∈ ∀ ∈ dT f(¯x) 0 d T (¯x) d feasible d T (¯x), but in general the converse is false. ∇ ≥ ∀ ∈ ⇒ ∈ If Proof d = limk(xk x¯)/ (xk x¯) . Taylor expansion: f(¯x + αd) f(¯x) α (0, α¯) − k − k ≤ ∀ ∈ T f(xk) = f(¯x) + f(¯x)(xk x¯) + o( xk x¯ ) d is a descent direction ∇ − k − k T = f(¯x) + f(¯x)(xk x¯) + xk x¯ o(1). ∇ − k − k x¯ local optimum (¯x) : f(x) f(¯x) x S. ⇒∃ U ≥ ∀ ∈U∩

Optimality Conditions – p. 13 Optimality Conditions – p. 14

... Examples

If k is large enough, xk (¯x): Unconstrained problems ∈ U Every d Rn belongs to the tangent cone at a local optimum f(xk) f(¯x) 0 ∈ ⇒ − ≥ T f(¯x)d 0 d Rn thus ∇ ≥ ∀ ∈ T f(¯x)(xk x¯) + xk x¯ o(1) 0 Choosing d = ei e d = ei we get ∇ − k − k ≥ − Dividing by (xk x¯) : f(¯x)=0 k − k ∇ T f(¯x)(xk x¯)/ (xk x¯) + o(1) 0 NB: the same is true if x is a local minimum in the relative ∇ − k − k ≥ ¯ interior of the . and in the limit T f(¯x)d 0. ∇ ≥

Optimality Conditions – p. 15 Optimality Conditions – p. 16 Linear equality constraints Linear equality constraints

From LP duality ⇒ min f(x) max 0T λ =0 Ax = b AT λ = f(¯x) ∇ Tangent cone: d : Ad =0 . Necessary conditions: { } Thus at a local minimum point there exist Lagrange multipliers: T f(¯x)d 0 d : Ad =0 ∇ ≥ ∀ λ : AT λ = f(¯x) ∃ ∇ equivalent statement:

min T f(¯x)d =0 d ∇ Ad =0

(a linear program).

Optimality Conditions – p. 17 Optimality Conditions – p. 18

Linear inequalities Linear inequalities

From LP duality: min f(x) max 0T λ =0 Ax b T ≤ AI λ = f(¯x) ∇ Tangent cone at a local minimum x¯: λ 0 n T d R : a d 0 i (¯x) . Let AI be the rows of A ≤ { ∈ i ≤ ∀ ∈ I } associated to active constraints at x¯. Then Thus, at a local optimum, the gradient is a non positive linear combination of the coefficients of active constraints. min T f(¯x)d =0 d ∇ AI d 0 ≤ λ 0 ≤

Optimality Conditions – p. 19 Optimality Conditions – p. 20 Farkas’ Lemma Geometrical interpretation

Let A: matrix in Rm×n and b Rn. One and only one of the ∈ following sets: AT y 0 Ax = b ≤ T AT y 0 b y > 0 x 0 ≤ ≥ bT y > 0 and a1 Ax = b z : x : z = Ax, x 0 x 0 { ∃ ≥ } ≥ b is non empty a2

y : AT y 0 { ≤ }

Optimality Conditions – p. 21 Optimality Conditions – p. 22

Proof Farkas’ Lemma (proof)

1) if x 0 : Ax = b bT y = xT AT y. Thus if AT y 0 bT y 0. 2) let x : Ax = b, x 0 = . Let { ≥ } ∅ 2) Premise:∃ ≥ Separating⇒ hyperplane theorem: let C≤and⇒D be≤ S = y Rm : x 0, Ax = y two convex nonempty sets: C D = . Then there exists a =0 { ∈ ∃ ≥ } and b: ∪ ∅ 6 S is closed, convex and b S. From the separating hyperplane m 6∈ aT x b x C theorem: α R =0, β R: ≤ ∈ ∃ ∈ 6 ∈ T a x b x D αT y β x S ≥ ∈ ≤ ∀ ∈ T If C is a point and D is a closed convex set, separation is strict, α b>β i.e. 0 S β 0 αT b > 0; αT Ax β for all x 0. This is T ∈ ⇒ ≥ ⇒ ≤ ≥ a C < b possible iff αT A 0. ≤ T Letting y = α we obtain a solution of a x > b x D ∈ AY y 0 bT y > 0 ≤

Optimality Conditions – p. 23 Optimality Conditions – p. 24 First order feasible variations cone First order variations

G(¯x) T (¯x). n T ⊇ G(¯x) = d R : gi(¯x)d 0 i In fact if xk is feasible and { ∈ ∇ ≤ } ∈ I { } xk x¯ d = lim − k xk x¯ k − k

then gi(¯x) 0 and ≤

b g(¯x + lim(xk x¯)) 0 k − ≤

b

Optimality Conditions – p. 25 Optimality Conditions – p. 26

...

xk x¯ T g(¯x + lim xk x¯ − ) 0 gi(¯x + αkd) = gi(¯x) + αk gi(¯x)d + o(αk) k k − k xk x¯ ≤ ∇ k − k xk x¯ where αk > 0 and d belong to the tangent cone T (¯x). If the i–th g(¯x + lim xk x¯ lim − ) 0 k k − k xk x¯ ≤ constraint is active, then k − k g(¯x + lim xk x¯ d) 0 T k k − k ≤ gi(¯x + αkd) = αk gi(¯x)d + o(αk) 0 ∇ ≤ T Let αk = xk x¯ , if αk 0: gi(¯x + αkd)/αk = gi(¯x)d + o(αk))/αk 0 k − k ≈ ∇ ≤

g(¯x + αkd) 0 Letting αk 0 the result is obtained. ≤ →

Optimality Conditions – p. 27 Optimality Conditions – p. 28 example KKT necessary conditions

G(¯x) = T (¯x); (Karush–Kuhn–Tucker) 6 Let x¯ X Rn,X = be a local optimum for x3 + y 0 ∈ ⊆ 6 ∅ − ≤ y 0 min f(x) − ≤ gi(x) 0 i =1,...,m ≤ x X ∈ : indices of active constraints at x¯. If: I 1 1. f(x),gi(x) (¯x) for i ∈C ∈ I 2. “constraint qualifications” conditions: T (¯x) = G(¯x) hold in x¯ ;

then there exist Lagrange multipliers λi 0, i : ≥ ∈ I

f(¯x) + λi gi(¯x)=0. ∇ ∇ i∈I X Optimality Conditions – p. 29 Optimality Conditions – p. 30

Proof Constraint qualifications: examples

T n x¯ local optimum if d T (¯x) d f(¯x) 0. But d T (¯x) polyhedra: X = R and gi(x) are affine functions: Ax b ⇒ ∈ ⇒ ∇ ≥ ∈ ⇒ ≤ T linear independence: X open set, gi(x), i continuous in x¯ and d gi(¯x) 0 i . 6∈ I ∇ ≤ ∈ I gi(¯x) , i are linearly independent. {∇ } ∈ I Thus it is impossible that Slater condition: X open set, gi(x), i convex differentiable ∈ I functions in x¯, gi(x), i continuous in x¯, and xˆ X T f(¯x)d > 0 strictly feasible: 6∈ I ∃ ∈ −∇ T gi(¯x)d 0 i gi(ˆx) < 0 i . ∇ ≤ ∈ I ∈ I From Farkas’ Lemma there exists a solution of: ⇒ T T λi gi(¯x) = f(¯x) i ∇ −∇ ∈ I i∈I X λi 0 i ≥ ∈ I

Optimality Conditions – p. 31 Optimality Conditions – p. 32 Convex problems Standard convex problem

An optimization problem min f(x) min f(x) x∈S gi(x) 0 i =1,m ≤ is a convex problem if hj(x)=0 j =1, k S is a convex set, i.e. if x, y S λx + (1 λ)y S f is convex ∈ ⇒ − ∈ g are convex λ [0, 1] i ∀ ∈ are affine (i.e. of the form T ) f is a convex function on S, i.e. hj α x + β then the problem is convex. f(λx + (1 λ)y) λf(x) + (1 λ)f(y) − ≤ − λ [0, 1] and x, y S ∀ ∈ ∈

Optimality Conditions – p. 33 Optimality Conditions – p. 34

Convex problems Sufficiency of 1st order conditions

Every local optimum is a global one. (for a convex differentiable problem: if dT f(¯x) d T (¯x), then ∇ ∀ ∈ Proof: x¯: local optimum for minS f(x) x¯ is a (global) optimum x⋆: global optimum. Proof: S convex λx⋆ + (1 λ)¯x S. Thus if λ 0 ⇒ − ∈ ≈ ⇒ f(y) f(¯x) + (y x¯)T f(x) y S f(¯x) f(λx⋆ + (1 λ)¯x ≥ − ∇ ∀ ∈ ≤ − λf(x⋆) + (1 λ)f(¯x) But y x¯ T (¯x) ≤ − − ∈ ⇒ T ⇒ f(y) f(¯x) + d f(x) y S ⋆ ≥ ∇ ∀ ∈ f(¯x) f(x ) f(¯x) ≤ ≥ and x¯ is also a global optimum. thus x¯ is a global minimum.

Optimality Conditions – p. 35 Optimality Conditions – p. 36 Convexity of the set of global optima KKT for equality constraints

(for convex problems) x¯: local optimum for The set of global minima of a convex problem is a convex set. In fact, let x¯ and y¯ be global minima for the convex problem min f(x)

gi(x) 0 i =1,...,m min f(x) ≤ x∈S hj(x)=0 j =1, . . . , k Then, choosing λ [0, 1] we have λx¯ + (1 λ)¯y S, as S is x X Rn convex. Moreover∈ − ∈ ∈ ⊆ Let : set of active inequalities in x¯. If f(x), f(λx¯ + (1 λ)¯y) λf(¯x) + (1 λ)f(¯y) I 1 and “constraint qualifications” hold in , − ≤ − gi(x), i ,hj(x) x¯ ⋆ ⋆ ⋆ ∈ I ∈C R λf + (1 λ)f = f λi 0 i e µj , j =1,...,h: − ⇒∃ ≥ ∀ ∈ I ∈ ∀ h where f ⋆ is the global minimum value. Thus the equality holds f(¯x) + λi gi(¯x) + µj hj(¯x)=0 and the proof is complete. ∇ ∇ ∇ i∈I j=1 X X

Optimality Conditions – p. 37 Optimality Conditions – p. 38

Complementarity II order necessary conditions

2 KKT equivalent formulation: If f,g1,hj in x¯ and the gradients of active constraints in x¯ are linearly∈C independent, then there exist mutlipliers m h λi 0, i and µj, j =1, . . . , k such that f(¯x) + λi gi(¯x) + µj hj(¯x)=0 ≥ ∈ I ∇ ∇ ∇ i=1 j=1 X X k λ g (¯x)=0 i =1,...,m f(¯x) + λi gi(¯x) + µj hj(¯x)=0 i i ∇ ∇ ∇ i∈I j=1 X X Condition λigi(¯x)=0 is called complementarity condition and

dT 2L(¯x)d 0 ∇ ≥ T T for every direction d: d gi(¯x) 0,d hj(x)=0 where ∇ ≤ ∇ k 2 2 2 2 L(x) := f(x) + λi gi(x) + µj hj(x) ∇ ∇ ∇ ∇ i∈I j=1 Optimality Conditions – p. 39 X X Optimality Conditions – p. 40 Sufficient conditions Lagrange Duality

⋆ ⋆ ⋆ Let f,gi,hj twice continuously differentiable. Let x ,λ , µ : Problem:

k f ⋆ = min f(x) ⋆ ⋆ ⋆ ⋆ ⋆ f(x ) + λi gi(x ) + µj hj(x )=0 ∇ ∇ ∇ gi(x) 0 i∈I j=1 ≤ X X ⋆ ⋆ x X λi gi(x )=0 ∈ ⋆ λi 0 definition: Lagrange Function: ≥ T 2 ⋆ T ⋆ d L(x )d > 0 d :d hj(x )=0 ∇ ∀ ∇ L(x; λ) = f(x) + λigi(x) λ 0, x X T ⋆ ≥ ∈ d gi(x )=0, i i ∇ ∈ I X then x⋆ is a local minimum.

Optimality Conditions – p. 41 Optimality Conditions – p. 42

Relaxation Lagrange minimization is a relaxation

Given an optimization problem Proof: Feasible set of the Lagrange problem: (contains the min f(x) X x∈S original one) If g(x) 0 and λ 0 a relaxation is a problem ≤ ≥ ⇒ T min g(x) L(x, λ) = f(x) + λ g(x) x∈Q f(x) where ≤

S Q ⊆ g(x) f(x) x S. ≤ ∀ ∈ Weak Duality : The optimal value of a relaxation is a lower bound on the optimum value of the problem.

Optimality Conditions – p. 43 Optimality Conditions – p. 44 Dual Lagrange function Example (circle packing) with respect to constraints g(x) 0: ≤ min r θ(λ) = inf L(x, λ) − x∈X 2 2 2 4r (xi xj) (yi yj) 0 1 i < j N = inf (f(x) + λT g(x)) − − − − ≤ ≤ ≤ x∈X xi, yi 1 i =1,...,N ≤ xi, yi 0 i =1,...,N For every choice of λ 0, θ(λ) is a lower bound for every − − ≤ feasible solution and in≥ particular, is a lower bound for the global minimum value of the problem.

Optimality Conditions – p. 45 Optimality Conditions – p. 46

solution

When N =2, relaxing the first constraint: Minimizing with respect to x, y x1 x2 = y1 y2 =1 from which ⇒| − | | − | 2 2 2 θ(λ) = min r + λ(4r (x1 x2) (y1 y2) ) x,y,r − − − − − θ(λ) = min r +4λr2 2λ r x1, x2, y1, y2 0 − − ≥ 1 x1, x2, y1, y2 1 r = ≤ 8λ 1 θ(λ) = 2λ − − 16λ This is a lower bound on the optimum value. Best possible lower bound:

θ⋆ = max θ(λ) λ 1 √2 λ⋆ = θ⋆ = 4√2 − 2 Optimality Conditions – p. 47 Optimality Conditions – p. 48 Lagrange Dual

Choosing (x1, y1) = (0, 0) and (x2, y2) = (1, 1) a feasible solution with r = √2/2 is obtained. θ⋆ = max θ(λ) The Lagrange dual gives a lower bound equal to √2/2: same λ 0 as the objective function at a feasible solution optimal− ≥ ⇒ solution! This problem might: (an exception, not the rule!) 1. be unbounded 2. have a finite sup but non max 3. have a unique maximum attained in correspondence with a single solution x 4. have many different maxima, each connected with a different solution x

Optimality Conditions – p. 49 Optimality Conditions – p. 50

Equality constraints Linear Programming

f ⋆ = min f(x) min cT x

gi(x) 0 i =1,...,m Ax b ≤ ≤ h (x)=0 j =1, . . . , k j Dual Lagrange function: x X ∈ θ(λ) = min cT x + λT (Ax b) Lagrange function: x − = λT b + min(cT + λT A)x. − x L(x; λ, µ) = f(x) + λT g(x) + µT h(x) but: where λ 0, but µ is free. ≥ 0 if cT + λT A =0 min(cT + λT A)x = x otherwise. ( −∞

Optimality Conditions – p. 51 Optimality Conditions – p. 52 ... Quadratic Programming (QP)

Lagrange dual function: 1 min xT Qx + cT x λT b if cT + λT A =0 2 θ(λ) = − otherwise. Ax = b ( −∞ Lagrange dual: (Q: symmetric). Lagrange dual function: T max λ b − T T 1 λ A + c = 0 θ(λ) = min xT Qx + cT x + λT (Ax b) x 2 − λ 0 1 ≥ = λT b + min xT Qx + (cT + λT A)x which is equivalent to: − x 2

T max λ b T T λ A = c λ 0 ≤ Optimality Conditions – p. 53 Optimality Conditions – p. 54

QP – Case 1 QP – Case 2

Q has at least one negative eigenvalue Q positive definite minimum point of the dual Lagrange ⇒ function: ⇒ 1 T T T min x Qx + (c + λ A)x = T x 2 −∞ Qx¯ + (c + A λ)=0

T In fact d : d Qd < 0. i.e. Choosing∃ x = αd with α > 0 ⇒ x¯ = Q−1(c + AT λ) 1 − xT Qx + (cT + λT A)x = 2 1 α2dT Qd + α(cT + λT A)d 2 and for large values of α this can be made as small as desired.

Optimality Conditions – p. 55 Optimality Conditions – p. 56 ......

Lagrange function value: Lagrange dual (seen as a min problem):

T 1 T T −1 T T 1 T T T min λ b + (c + A λ) Q (c + A λ) θ(λ) = λ b + x¯ Qx¯ + (c + λ A)¯x λ 2 − 2 1 Optimality conditions: = λT b + (c + AT λ)T Q−1QQ−1(c + AT λ) − 2 −1 T − b + AQ (c + A λ) = 0 (cT + λT A)Q 1(c + AT λ) − 1 = λT b + (c + AT λ)T Q−1(c + AT λ) − 2 But recalling that x¯ = Q−1(c + AT λ) − ⇒ (cT + λT A)Q−1(c + AT λ) − b Ax¯ = 0 feasibility of x¯ 1 − = λT b (c + AT λ)T Q−1(c + AT λ) − − 2 if we find optimal multipliers λ (a linear system) we get the optimal ⇒ ⇒ solution x¯ (thanks to feasibility and weak duality)!

Optimality Conditions – p. 57 Optimality Conditions – p. 58

Properties of the Lagrange dual Dim.

For any problem From Weierstrass theorem

f ⋆ = min f(x) θ(λ) = min f(x) + λT g(x) x∈X gi(x) 0 i =1,...,m exists and is finite ≤ x X θ(ηa + (1 η)b) = min(f(x) + (ηa + (1 η)b)T g(x)) ∈ − x∈X − T T where X is non empty and compact, if f and gi are continuous = min(η(f(x) + a g(x)) + (1 η)(f(x) + b g(x))) then the Lagrange dual function is concave x∈X − η min(f(x) + aT g(x)) + (1 η) min(f(x) + bT g(x)) ≥ x∈X − x∈X = ηθ(a) + (1 η)θ(b). −

Optimality Conditions – p. 59 Optimality Conditions – p. 60 ... Solution of the Lagrange dual

Let λ¯ be the optimal solution of the restricted dual. Is it an max θ(λ) = max min(f(x) + λT g(x)) optimal dual solution? Is it true that z¯ f(x) + λ¯T g(x)? Check: λ λ x∈X we look for x¯, optimal solution of ≤ is equivalent to min f(x) + λ¯T g(x) x∈X max z ¯T z f(x) + λT g(x) x X if f(¯x) + λ g(¯x) z¯ then we have found the optimal solution ≤ ∀ ∈ of the dual; ≥ λ 0 ≥ otherwise the pair x,¯ f(¯x) is added to the restricted dual and a new solution is computed. After having computed f and g in x1, x2, . . . , xk a restricted dual can be defined:

max z T z f(xj) + λ g(xj) j =1, . . . , k ≤ ∀ λ 0 Optimality Conditions – p. 61 Optimality Conditions – p. 62 ≥

Geometric programming

Unconstrained Geometric program: Transformed problem:

m n m n α kj R αkj yj min ck xj αkj , ck > 0 min ck e = x>0 ∈ y k=1 j=1 k j=1 ! X Y X=1 Y m T (non convex). Variable substitution: α y+βk min e k βk = log ck y k=1 xj = exp(yj) yj R X ∈

still non convex, but its logarithm is convex.

Optimality Conditions – p. 63 Optimality Conditions – p. 64 Duality example solving the dual

Dual of Dual function

m m T T min f(x) min log exp(αk x + βk) L(λ) = min log exp yk + λ (Ax + β y) x,y − k k X=1 X=1 No constraints dual lagrange function is identical to f(x)! Minimization in x is unconstrained: min λT Ax Strong duality holds,⇒ but is useless. ⇒ if λT A =0 L(λ) is unbounded Simple transformation: 6 if λT A =0 then m m min log exp yk T L(λ) = min log exp yk + λ (β y) k=1 y − X T k=1 yk = αk x + βk X

Optimality Conditions – p. 65 Optimality Conditions – p. 66

First order (unconstrained) optimality conditions w.r.t. yi: Substituting λj = exp yj/ k exp yk,

exp yi P λi =0 L(λ) = log exp yj λjyj exp yk − k j − j X X Lagrange multipliersP exist provided that = log exp yj yj exp yj/ exp yk ⇒ − j j k X X X λi =1 λi > 0 i 1 i ∀ = ( exp yk(log exp yj yk)) exp yk − X k k j X X P exp yk = (log exp yj yk) exp yj − k j j ! X X = Pλk log λk − k X

Optimality Conditions – p. 67 Optimality Conditions – p. 68 Lagrange Dual Special cases: linear constraints

The Lagrange Dual becomes: min f(x) T max β λ λk log λk λ − Ax b k X ≥ λk =1 Lagrange function: k X T L(x, λ) = f(x) + λT (b Ax) A λ =0 − λ 0 ⋆ ≥ Constraint qualifications always hold (polyhedron). If x is a local optimum there exists λ⋆ 0: ≥ Ax⋆ b ≥ f(x⋆) = AT λ⋆ ∇ λ⋆T (b Ax⋆)=0 −

Optimality Conditions – p. 69 Optimality Conditions – p. 70

Non negativity constraints

⋆ min f(x) ⋆ ∂f(x ) λj = j =1, n x 0 ∂xj ≥ from which Lagrange function: L(x, λ) = f(x) λT x. KKT conditions: − ∂f(x⋆) ⋆ ⋆ ⋆ f(x ) = λ =0 j : xj > 0 ∇ ∂xj ∀ ⋆ x 0 ∂f(x⋆) ≥ 0 otherwise λ⋆ 0 ∂x ≥ ≥ j (λ⋆)T x⋆ =0

Optimality Conditions – p. 71 Optimality Conditions – p. 72 Box constraints Box constr. (cont)

Then, from complementarity,

min f(x) ⋆ ∂f(x ) ⋆ ℓ x u ℓi < ui i = λj j Jℓ ≤ ≤ ∀ ∂xj ∈ ⋆ Lagrange function: L(x, λ, µ) = f(x) + λT (ℓ x) + µT (x u). ∂f(x ) ⋆ − − = µj j Ju KKT conditions: ∂xj − ∈ ⋆ ⋆ ⋆ ⋆ ∂f(x ) f(x ) = λ µ =0 j J0 ∇ − ∂xj ∈ (ℓ x⋆)T λ⋆ =0 − (x⋆ u)T µ =0 − (λ⋆, µ⋆) 0 ≥ Given x⋆ let ⋆ ⋆ ⋆ Jℓ = j : x = ℓj ,Ju = j : x = uj ,J = j : ℓj < x < uj { j } { j } 0 { j } Optimality Conditions – p. 73 Optimality Conditions – p. 74

Optimization over the simplex

Thus min f(x) ∂f(x⋆) 0 j Jℓ 1T ∂xj ≥ ∈ x =1 ∂f(x⋆) x 0 0 j Ju ≥ ∂xj ≤ ∈ T T T ⋆ Lagrange function: L(x, λ, µ) = f(x) λ x + µ (1 x 1). KKT: ∂f(x ) − − =0 j J0 ∂xj ∈ f(x⋆) = λ⋆ µ⋆1 ∇ − with feasibility ℓ x⋆ u 1T x⋆ =1 ≤ ≤ (x⋆,λ⋆) 0 ≥ (λ⋆)T x⋆ =0

Optimality Conditions – p. 75 Optimality Conditions – p. 76 simplex. . . Application: Min var portfolio

Given n assets with random returns R1,...,Rn, how to invest 1 ⋆ ∂f(x ) ⋆ ⋆ e in such a way that the resulting portfolio has minimum λj = µ ∂xj − − variance? If xj denotes the percentage of the investment on asset j, how to compute the variance of this portfolio P (x)? ⋆ ⋆ (all equal). Thus, from complementarity, if xj > 0 then λj =0 ⋆ ⋆ 2 and ∂f(x ) = µ⋆; otherwise ∂f(x ) µ⋆. Thus, if j : x⋆ > 0, Var = E(P (x) (E(P (x)))) ∂xj − ∂xj ≥− j − n 2 ⋆ ⋆ ∂f(x ) ∂f(x ) = E (Rj E(Rj))xj k − j=1 ! ∂xj ≤ ∂xk ∀ X = (Ri E(Ri))(Rj E(Rj))xixj i,j − − X = xT Qx

where Q is the variance-covariance matrix of the n assets.

Optimality Conditions – p. 77 Optimality Conditions – p. 78

Min var portfolio Optimal portfolio

⋆ Problem (objective multiplied by 1/2 for simpler computations): KKT: for all j : xj > 0:

min(1/2)xT Qx Qijxj Qkjxj k ≤ ∀ 1T j j x =1 X X x 0 Vector Qx might be thaught as the vector of marginal ≥ contributions to the total risk (which is a weighted sum of elements of Qx). Thus in the optimal portfolio, all assets with positive level give equal (and minimal) contribution to the total risk.

Optimality Conditions – p. 79 Optimality Conditions – p. 80 Optimization

Most common form for optimization algorithms: Algorithms for unconstrained local Line search-based methods: Given a starting point x a sequence is generated: optimization 0 xk+1 = xk + αkdk Fabio Schoen n where dk R : search direction, αk > 0: step 2008 ∈ Usually first dk is chosen and than the step is obtained, often http://gol.dsi.unifi.it/users/schoen from a 1–dimensional optimization

Algorithms for unconstrained local optimization – p. 1 Algorithms for unconstrained local optimization – p. 2

Trust-region algorithms Speed measures

⋆ A model m(x) and a confidence region U(xk) containing xk are Let x : local optimum. The error in xk might be measured e.g. defined. The new iterate is chosen as the solution of the as constrained optimization problem e(x ) = x x⋆ or k k k − k min m(x) e(x ) = f(x ) f(x⋆) . x∈U(xk) k | k − | ⋆ The model and the confidence region are possibly updated at Given xk x if q > 0,β (0, 1) : (for k large enough): each iteration. { } → ∃ ∈ e(x ) qβk k ≤

xk is linearly convergent, or converges with order 1; ⇒{β : convergence} rate A sufficient condition for linear convergence:

e(x ) lim sup k+1 β e(xk) ≤ Algorithms for unconstrained local optimization – p. 3 Algorithms for unconstrained local optimization – p. 4 super–linear convergence Higher order convergence

If for every β (0, 1) exists q: If, given p > 1, q > 0,β (0, 1) : ∈ ∃ ∈ k (pk) e(xk) qβ e(x ) qβ ≤ k ≤ then convergence is super–linear. then xk is said to converge with order at least p Sufficient condition: If p =2{ } quadratic convergence Sufficient condition: ⇒ e(xk+1) e(x ) lim sup =0 lim sup k+1 < e(xk) p e(xk) ∞

Algorithms for unconstrained local optimization – p. 5 Algorithms for unconstrained local optimization – p. 6

Examples Examples

1 1 k converges to 0 with order one 1 (linear convergence) k converges to 0 with order one 1 (linear convergence) 1 k2 converges to 0 with order 1

Algorithms for unconstrained local optimization – p. 7 Algorithms for unconstrained local optimization – p. 7 Examples Examples

1 1 k converges to 0 with order one 1 (linear convergence) k converges to 0 with order one 1 (linear convergence) 1 1 k2 converges to 0 with order 1 k2 converges to 0 with order 1 2−k converges to 0 with order 1 2−k converges to 0 with order 1 k−k converges to 0 with order 1; convergence is super–linear

Algorithms for unconstrained local optimization – p. 7 Algorithms for unconstrained local optimization – p. 7

Examples Descent directions and the gradient

1 Let 1 Rn , Rn k converges to 0 with order one 1 (linear convergence) f ( ) xk : f(xk) =0 ∈CRn ∈ ∇ 6 1 Let d . If k2 converges to 0 with order 1 ∈ T −k d f(x ) < 0 2 converges to 0 with order 1 ∇ k −k k converges to 0 with order 1; convergence is then d is a descent direction super–linear Taylor expansion: 1 k converges a 0 with order 2 quadratic convergence 22 f(x + αd) f(x ) = αdT f(x ) + o(α) k − k ∇ k f(x + αd) f(x ) k − k = dT f(x ) + o(1) α ∇ k Thus if α is small enough f(x + αd) f(x ) < 0 k − k NB: d might be a descent direction even if dT f(x )=0 ∇ k

Algorithms for unconstrained local optimization – p. 7 Algorithms for unconstrained local optimization – p. 8 Convergence of line search methods

If a sequence x = x + α d is generated in such a way that: if d =0 then k+1 k k k k 6 0 = x : f(x) f(x0) is compact T L { ≤ } dk f(xk) | ∇ | σ( f(xk) ) dk =0 whenever f(xk) =0 d ≥ k∇ k 6 ∇ 6 k kk f(xk+1) f(xk) ≤ where σ is such that limk→∞ σ(tk)=0 limk→∞ tk =0 if f(x ) =0 k then ⇒ ∇ k 6 ∀ (σ is called a forcing function) T dk lim f(xk)=0 k→∞ d ∇ k kk

Algorithms for unconstrained local optimization – p. 9 Algorithms for unconstrained local optimization – p. 10

Comments on the assumptions

Then either there exists a finite index k¯ such that f(x¯)=0 or f(xk+1) f(xk): most optimization methods choose dk as a ∇ k ≤ otherwise descent direction. If dk is a descent direction, choosing αk “sufficiently small” ensures the validity of the assumption xk 0 and all of its limit points are in 0 T ∈ L L dk f(x ) admits a limit limk→∞ f(xk)=0: given a normalized direction dk, the { k } kdkk ∇ scalar product d T f(x ) is the directional derivative of f limk→∞ f(xk)=0 k ∇ k ∇ along dk: it is required that this goes to zero. This can be for every limit point x¯ of x we have f(¯x)=0 { k} ∇ achieved through precise line searches (choosing the step so that f is minimized along dk) T |dk ∇f(xk)| σ( f(x ) ): letting, e.g., σ(t) = ct, c > 0, if kdkk ≥ k∇ k k d : dT f(x ) < 0 then the condition becomes k k ∇ k dT f(x ) k ∇ k c d f(x ≤− k kkk∇ kk

Algorithms for unconstrained local optimization – p. 11 Algorithms for unconstrained local optimization – p. 12 Gradient Algorithms

Recalling that General scheme:

T dk f(xk) xk+1 = xk αkDk f(xk) cos θk = ∇ − ∇ dk f(xk k kk∇ k with Dk 0 e αk > 0 then the condition becomes If f(x ≻) =0 then ∇ k 6 cos θ c d = D f(x ) k ≤− k k∇ k that is, the angle between dk and f(xk) is bounded away from is a descent direction. In fact orthogonality. ∇ dT f(x ) = T f(x )D f(x ) k ∇ k −∇ k k∇ k < 0

T dk f(xk) ∇ θk

Algorithms for unconstrained local optimization – p. 13 Algorithms for unconstrained local optimization – p. 14

Steepest Descent or “gradient” method:

Dk := I i.e. x = x α f(x ). k+1 k − k∇ k If f(xk) =0 then dk = f(xk) is a descent direction. Moreover,∇ 6 it is the steepest−∇(w.r.t. the euclidean norm):

T min f(xk)d d∈Rn ∇ f(xk) d 1 ∇ k k≤

Algorithms for unconstrained local optimization – p. 15 Algorithms for unconstrained local optimization – p. 16 ... Newton’s method

T 2 −1 min f(xk)d Dk := f(xk) d∈Rn ∇ − ∇ √dT d 1 Motivation: Taylor expansion of f:  ≤ T 1 KKT conditions: In the interior f(xk)=0; if the constraint is T T 2 f(x) f(xk) + f(xk)(x xk) + (x xk) f(xk)(x xk) active ⇒∇ ≈ ∇ − 2 − ∇ − ⇒ d Minimizing the approximation: f(xk) + λ =0 ∇ d 2 k k f(xk) + f(xk)(x xk)=0 √dT d =1 ∇ ∇ − λ 0 If the hessian is non singular ≥ ⇒ 2 −1 ∇f(xk) x = x f(x ) f(x ) d = . k − ∇ k ∇ k ⇒ − k∇f(xk)k  Algorithms for unconstrained local optimization – p. 17 Algorithms for unconstrained local optimization – p. 18

Step choice

Given dk, how to choose αk so that xk+1 = xk + αkdk? Minimizing w.r.t. α: “optimal” choice (one-dimensional optimization): αdT Qd + (Qx + c)T d =0 k k k k ⇒ αk = arg min f(xk + αdk). T α≥0 (Qxk + c) dk α = T − dk Qdk Analytical expression of the optimal step is available only in few dT f(x ) 1 T T = k k cases. E.g. if f(x) = 2 x Qx + c x with Q 0. Then T ∇2 ≻ −d f(xk)dk k ∇ 1 f(x + αd ) = (x + αd )T Q(x + αd ) + cT (x + αd ) E.g., in steepest descent: k k 2 k k k k k k 2 1 2 T T f(xk) = α dk Qdk + α(Qxk + c) dk + β α = k∇ k 2 k T f(x ) 2f(x ) f(x ) ∇ k ∇ k ∇ k where β does not depend on α.

Algorithms for unconstrained local optimization – p. 19 Algorithms for unconstrained local optimization – p. 20 Approximate step size Avoid too large steps

Rules for choosing a step-size (from the sufficient condition for convergence):

f(xk+1)

Algorithms for unconstrained local optimization – p. 21 Algorithms for unconstrained local optimization – p. 22

Avoid too small steps Armijo’s rule

u

Input: δ (0, 1),γ (0, 1/2), ∆k > 0 ∈ ∈ u α := ∆k; T while (f(xk + αdk) > f(xk)+ γαdk f(xk)) do u ∇ u α := δα ; u end return α

Typical values : δ [0.1, 0.5],γ [10−4, 10−3]. ∈ ∈ On exit the returned step is such that

T f(xk + αdk) f(xk)+ γαdk f(xk) ≤ ∇

Algorithms for unconstrained local optimization – p. 23 Algorithms for unconstrained local optimization – p. 24 Line search in practice

How to choose the initial step size ∆k? ⋆ Let φ(α) = f(xk + αdk). A possibility is to choose ∆k = α , the minimizer of a quadratic approximation to φ( ). Example: acceptable steps ·

1 2 q(α) = c0 + c1α + c2α α 2 q(0) = c0 := f(xk) q′(0) = c := dT f(x ) 1 k ∇ k ⋆ Then α = c1/c2. γαdT f(x ) − k ∇ k

T αdk f(xk) ∇ Algorithms for unconstrained local optimization – p. 25 Algorithms for unconstrained local optimization – p. 26

Thus it is reasonable to start with Third condition? If an estimate fˆ of the minimum of f(xk + αdk) ˆ is available choose c2 : min q(α) = f. ˆ ⇒ f f(xk) ∆k =2 T− min q(α) = q( c /c ) dk f(xk) − 1 2 ∇ 2 ˆ (f(xk−1)−f(xk)) = c0 c1/c2 := f A reasonable estimate might be to choose ∆k =2 T − dk ∇f(xk) c = c2/2(fˆ c ) 2 1 − 0 α⋆ = c /c − 1 2 fˆ c =2 − 0 c1

Algorithms for unconstrained local optimization – p. 27 Algorithms for unconstrained local optimization – p. 28 Convergence of steepest descent Local analysis of steepest descent

Behaviour of the when minimizing x = x α f(x ) k+1 k − k∇ k 1 f(x) = xT Qx If a sufficiently accurate step size is used the condition of the 2 ⇒ theorem on global convergence are satisfied the steepest ⋆ ⇒ where Q 0. (local and global) optimum: x =0. Steepest descent algorithm globally converges to a stationary point. descent method:≻ “Sufficiently accurate” means exact line search or, e.g., Armijo’s rule. x = x α f(x ) k+1 k − k∇ k = x α Qx k − k k = (I α Q)x − k k Error (in x) at step k +1:

x 0 = (I α Q)x k k+1 − k k − k kk T 2 Algorithms for unconstrained local optimization – p. 29 = x (I α Q) x Algorithms for unconstrained local optimization – p. 30 k − k k q ... Analysis

Let A: symmetric with eigenvalues: λ < <λ . Then λ is an eigenvalue of A iff αλ is an eigenvalue of αA 1 · · · n λ is an eigenvalue of A iff 1 + λ is an eigenvalue of I + A λ v 2 vT Av λ v 2 v Rn 1k k ≤ ≤ mk k ∀ ∈ Thus the eigenvalues of (I α Q) are − k ⇒ T 2 ⋆ T 1 αλi xk (I αkQ) xk λ xk xk − − ≤ where λ are the eigenvalues of Q. The maximum eigenvalue where λ⋆ largest eigenvalue of (I α Q)2. i − k will be: max (1 α λ )2, (1 α λ )2 { − k 1 − k n } thus

x max (1 α λ )2, (1 α λ )2 x k k+1k≤ { − k 1 − k n }k kk = maxp 1 αkλ1 , 1 αkλn xk Algorithms for unconstrained local optimization – p. 31 {| − | | − |}k kAlgorithms for unconstrained local optimization – p. 32 ......

Eliminating the dependency on α : α 0 and λ λ , k ≥ 1 ≤ n ⇒ max 1 αλ , 1 αλ = 1 αλ 1 αλ {| − 1| | − n|} − 1 ≥ − n max 1 αλ , 1 + αλ , 1 αλ , 1 + αλ 1 + αλ1 1 + αλn { − 1 − 1 − n − n} − ≤−

5 |1 − αλ1| and thus 4 |1 − αλn| max 1 αkλ1 , 1 αkλn xk = max 1 αλ1, 1 + αλn 3 {| − | | − |}k k { − − } 2 Minimum point: 1 1 αλ = 1 + αλ − 1 − n 0 0 0.2 0.4 0.6 0.8 1 i.e. 2 α⋆ = λ1 + λn Algorithms for unconstrained local optimization – p. 33 Algorithms for unconstrained local optimization – p. 34

Analysis Zig–zagging

In the best possible case 1 min (x2 + My2) 2 xk+1 ⋆ k k 1 α λ1 where M > 0. Optimum: x⋆0y⋆ =0. Starting point: (M, 1). xk ≤ | − | k k Iterates: 2 = 1 λ1 xk+1 xk xk | − λ1 + λn | = + α yk+1 yk Myk λn λ1       = − With optimal step size λn + λ1 ⇒ ρ 1 k = − M M−1 ρ +1 xk+1 M+1 = k yk+1 " M−1 #   − M+1  where ρ = λn/λ1: condition number of Q ρ 1 (ill–conditioned problem) very slow convergence  ρ ≫1 very speed convergence⇒ ≈ ⇒

Algorithms for unconstrained local optimization – p. 35 Algorithms for unconstrained local optimization – p. 36 Zig–zagging

Converegence is 10 rapid if M 1 ≈ very slow and “zig–zagging” if M 1 or M 1 ≫ ≪ 5 Slow convergence and zig–zagging are general phenomena (especially when the starting point is near the longest axes of the ellipsoidal level sets) 0

-5

-10 0 20 40 60 80 100

Algorithms for unconstrained local optimization – p. 37 Algorithms for unconstrained local optimization – p. 38

Analysis of Newton’s method

2 −1 Thus Newton-Raphson method: xk+1 = xk ( f(xk)) f(xk). Let ⋆ − ∇ ∇ x : local optimum. Taylor expansion of f: ⋆ ⋆ ∇ x xk+1 = o( x xk ) ⋆ k − k k − k f(x )=0 ⋆ ⋆ ∇ kx −xk+1k o(kx −xkk) i.e. ⋆ ⋆ convergence is at least super–linear 2 ⋆ ⋆ kx −xkk = kx −xkk = f(xk) + f(xk)(x xk) + o( x xk ) ⇒ ∇ ∇ − k − k If 2f(x ) is non singular and ( 2f(x ))−1 is limited ∇ k k ∇ k k ⇒ −1 −1 0 = 2f(x ) f(x ) + (x⋆ x ) + 2f(x ) o( x⋆ x ) ∇ k ∇ k − k ∇ k k − kk = x⋆ x + o( x⋆ x ) − k+1 k − kk 

Algorithms for unconstrained local optimization – p. 39 Algorithms for unconstrained local optimization – p. 40 Local Convergence of Newton’s Method Difficulties

Let f 2( (x⋆,δ )), where : ball with radius δ and center x⋆; Many things might go wrong: ∈C U 1 U 1 let 2f(x⋆) be non–singular. Then: 2 ∇ at some iteration, f(xk) might be singular. For example: ⋆ ∇ 1. δ > 0 : if x0 (x ,δ) xk is well defined and if xk belongs to a flat region f(x) = constant. ∃ ∈⋆ U ⇒{ } converges to x at least superlinearly. 2 even if non singular, inversion f(xk) or, in any case, ∇ 2 2. If δ > 0, L > 0, M > 0 : solving a linear system with coefficient matrix f(xk) is ∃ numerically unstable and computationally demanding∇ 2f(x) 2f(y) L x y k∇ −∇ k≤ k − k there is no guarantee that 2f(x ) 0 Newton direction ∇ k ≻ ⇒ and might not be a descent direction

( 2f(x))−1 M k ∇ k≤ ⋆ then, if x0 (x ,δ) Newton’s method converges with order at least 2 and∈ U LM x x⋆ x x⋆ 2 k k+1 − k≤ 2 k k − k Algorithms for unconstrained local optimization – p. 41 Algorithms for unconstrained local optimization – p. 42

Difficulties Newton–type methods

Newton’s method just tries to solve the system line search variant: x = x α ( 2f(x ))−1 f(x ) k+1 k − k ∇ k ∇ k 2 f(xk)=0 Modified Newton method: replace f(xk) by ∇ 2 ∇ 2 ( f(xk) + Dk) where Dk is chosen so that f(xk) + Dk is and thus might very well be attracted towards a maximum positive∇ definite ∇ the method lacks global convergence: it converges only if started “near” a local optimum

Algorithms for unconstrained local optimization – p. 43 Algorithms for unconstrained local optimization – p. 44 Quasi-Newton methods Quasi–Newton equation

Consider solving the nonlinear system f(x)=0. Taylor Let: expansion of the gradient: ∇ s := x x y := f(x ) f(x ) k k+1 − k k ∇ k+1 −∇ k f(x ) f(x ) + 2f(x )(x x ) ∇ k ≈∇ k+1 ∇ k+1 k − k+1 Quasi–Newton equation: Bk+1sk = yk. If Bk was the previous Let Bk+1 be an approximation of the hessian in xk+1. approximate hessian, we ask that Quasi–Newton equation: 1. the variation between Bk and Bk+1 is “small” B (x x ) = f(x ) f(x ) 2. nothing changes along directions which are normal to the k+1 k+1 − k ∇ k+1 −∇ k step sk:

B z = B z z : zT s =0 k k+1 ∀ k Choosing n 1 vectors z which are orthogonal to s n2 linearly − k ⇒ independent equations in n2 unknowns a unique solution. ⇒∃

Algorithms for unconstrained local optimization – p. 45 Algorithms for unconstrained local optimization – p. 46

Broyden updating proof

It can be shown that the unique solution is given by: T (yk Bksk)s T k (yk Bksk)s Bk+1 Bk = − T k k − k s sk Bk+1 = Bk + − T k sk sk (Bsˆ B s )sT (Bˆ B )s sT = k − k k k = − k k k Theorem: let Rn×n and . The unique solution to: T T Bk sk =0 sk sk sk sk ∈ 6 ˆ s sT Tr s sT s sT min Bk B F ˆ k k ˆ k k k k Bˆ k − k (B Bk) k T k = (B Bk) T ≤ − sk sk − sk sk ˆ p Bsk = yk sT s = (Bˆ B ) k k = (Bˆ B ) − k sT s k − k k T k k is Broyden’s update Bk+1 here X F = √TrX X denotes k k Frobenius norm. Unicity is a consequence of the strict convexity of the norm and the convexity of the feasible region.

Algorithms for unconstrained local optimization – p. 47 Algorithms for unconstrained local optimization – p. 48 Quasi-Newton and optimization Simmetry

T Special situation: (yk−Bksk)sk Remedy: let C1 = Bk + T symmetrization: sk sk 1. the hessian matrix in optimization problems is symmetric; 2. in gradient methods, when we let 1 T C2 = (C1 + C1 ) x = x (B )−1 f(x ), it is desirable that B be 2 k+1 k − k+1 ∇ k k+1 positive definite. However, it does not satisfy Quasi–Newton equation. Broyden Broyden’s update: update of C2:

T (y C s )sT (yk Bksk)sk k 2 k k Bk+1 = Bk + − C3 = C2 + − T T s sk sk sk k which is not symmetric, . . . is generally not symmetric even if Bk is.

Algorithms for unconstrained local optimization – p. 49 Algorithms for unconstrained local optimization – p. 50

PBS update BFGS

In the limit Same ideas, but applied to the approximate inverse Hessian:

T T T T Inverse Quasi–Newton equation: (yk Bksk)sk + sk(yk Bksk) (sk (yk Bksk))sksk Bk+1 = Bk + − T − + −T 2 sk sk (sk sk) sk = Hk+1yk

(PBS – Powell-Broyden-Symmetric update). lead to the most common Quasi–Newton update: BFGS Imposing also hereditary positive definiteness, DFP (Broyden-Fletcher-Goldfarb-Shanno): (Davidon-Fletcher-Powell) is obtained: T T T skyk yksk sksk T T T T Hk+1 = I Hk I + (yk Bksk)yk + yk(yk Bksk) (sk (yk Bksk))ykyk T T T − yk sk − yk sk yk sk Bk+1 = Bk + − T − + −T 2     yk sk (yk sk) y sT s yT y yT = I k k B I k k + k k − yT s k − yT s yT s  k k   k k  k k

Algorithms for unconstrained local optimization – p. 51 Algorithms for unconstrained local optimization – p. 52 BFGS method Trust Region methods

Possible defect of standard Newton method: the approximation

xk+1 = xk αkHk f(xk) becomes less and less precise if we move away from the − ∇ current point. Long step bad approximation. s yT y sT s sT ⇒ H = I k k H I k k + k k Idea: constrained minimization of quadratic approximation: k+1 − yT s k − yT s yT s  k k   k k  k k x = arg min m (x) where yk = f(xk+1) f(xk) k+1 k ∇ −∇ kxk+1−xkk≤∆k sk = xk+1 xk m (x) = f(x ) + T f(x )(x x ) − k k ∇ k k+1 − k 1 + (x x )T 2f(x )(x x ) 2 k+1 − k ∇ k k+1 − k

∆k > 0: parameter. First advantage (over pure Newton): the step is always definite (thanks to Weierstrass’s theorem)

Algorithms for unconstrained local optimization – p. 53 Algorithms for unconstrained local optimization – p. 54

Outline of Trust Region

Let m ( ) a local model function. E.g. in Newton Trust Region How to choose and update the trust region radius ∆ ? Given a k · k methods, step sk, let

T 1 T 2 f(xk) f(xk + sk) mk(s) = f(xk) + s f(xk) + s f(xk)s ρ = − ∇ 2 ∇ k m (0) m (s ) k − k k or in a Quasi-Newton Trust Region method the ratio between the actual reduction and the predicted 1 reduction m (s) = f(x ) + sT f(x ) + sT B s k k ∇ k 2 k

Algorithms for unconstrained local optimization – p. 55 Algorithms for unconstrained local optimization – p. 56 Model updating Algorithm

Data: ∆ˆ > 0, ∆0 ∈ (0, ∆)ˆ , η ∈ [0, 1/4] f(xk) f(xk + sk) for k =0, 1,... do ρk = − mk(0) mk(sk) Find the step sk and ρk minimizing the model in the trust region ; − if ρk < 1/4 then The predicted reduction is always non negative; ∆k+1 = ∆k/4 ; else if ρk > 3/4 and kskk = ∆k then if ρk is small (surely if it is negative) the model and the ˆ function strongly disagree the step must be rejected and ∆k+1 = min{2∆k, ∆} ; ⇒ else the trust region reduced ∆k+1 = ∆k; end if ρ 1 it is safe to expand the trust region k ≥ end intermediate ρk values lead us to keep the region if ρk > η then xk+1 = xk + sk; unchanged else xk+1 = xk; end end

Algorithms for unconstrained local optimization – p. 57 Algorithms for unconstrained local optimization – p. 58

Solving the model

How to find Thus either s is in the interior of the ball with radius ∆, in which case λ =0 and we have the (quasi)-Newton step: T 1 T min f(xk) s + s Bks s ∇ 2 −1 p = Bk f(xk) s ∆ − ∇ k k≤ or s = ∆ and if λ > 0 then 2λs = f(x ) Bs = m (s) k k −∇ k − −∇ k If Bk 0, KKT conditions are necessary and sufficient; rewriting s is parallel to the negtaive gradient of the model and normal ≻ the constraint as sT s ∆2 : to⇒ its contour lines. ≤ ⇒ f(x ) + B s +2λs =0 ∇ k k λ(∆ s )=0 −k k

Algorithms for unconstrained local optimization – p. 59 Algorithms for unconstrained local optimization – p. 60 The Cauchy Point Finding the Cauchy point

s Strategy to approximately solve the trust region sub–problem. Finding pk is easy: analytic solution: Find the “Cauchy point”: the minimizer of mk along the direction f(x ) f(xk) within the trust region. First find the direction: s k −∇ pk = ∇ ∆k − gk s T k k pk = arg min fk + f(xk) p p ∇ For the step size τk: p ∆k T k k≤ If f(xk) Bk f(xk) 0 negative curvature direction ∇largest possible∇ step≤ ⇒τ =1 Then along this direction find a minimizer ⇒ ⇒ k Otherwise the model along the line is strictly convex, so s τk = arg min mk(τpk) τ≥0 3 f(xk) s τk = min 1, k∇ k τpk ∆k { ∆ f(x )T B f(x )} k k≤ k∇ k k∇ k s The Cauchy point is xk + τkpk. Choosing the Cauchy point global but extremely slow convergence (similar to steepest⇒ descent). Usually an improved point is searched starting from the Cauchy one. Algorithms for unconstrained local optimization – p. 61 Algorithms for unconstrained local optimization – p. 62

Pattern Search

For smooth optimization, but without knowledge of derivatives. Elementary idea: if x R2 is not a local minimum for f, then at ∈ least one of the directions e1, e2, e1, e2 (moving towards E, N, Derivative Free Optimization W, S) forms an acute angle with − f−(x) is a descent direction. −∇ ⇒ Direct search: explores all the direction in search of one which gives a descent.

Algorithms for unconstrained local optimization – p. 63 Algorithms for unconstrained local optimization – p. 64 Coordinate search Pattern search

Let D⊕ = ei be the set of coordinate directions and their It is not necessary to explore 2n directions. It is sufficient that opposites{± } the set of directions forms a positive span, i.e. every v Rn should be expressible as a non negative linear combination∈ of Data: k =0, ∆0 an initial step length, x0 a starting point the vectors in the set. while ∆ is large enough do Formally, is a generating set iff if f(x + ∆ d) 0 else ∀ 6 ∈ ∃ ∈G ∆k+1 =0.5∆k ; A “good” generating set should be characterized by a end sufficiently high cosine measure: k = k +1 ; end vT d κ( ) := min max G v6=0 d∈G v d k kk k

Algorithms for unconstrained local optimization – p. 65 Algorithms for unconstrained local optimization – p. 66

Examples Step Choice

u u u x + ∆ d if f(x + ∆ d )

Algorithms for unconstrained local optimization – p. 67 Algorithms for unconstrained local optimization – p. 68 b b

Algorithms for unconstrained local optimization – p. 69 Algorithms for unconstrained local optimization – p. 70

b

b

Algorithms for unconstrained local optimization – p. 71 Algorithms for unconstrained local optimization – p. 72 Nelder-Mead Simplex 1: Reflection

Given a simplex S = v ,...,v in Rn let v the worst point: Check f(R): if it is intermediate, i.e. better than the worst and { 1 n+1} r r = arg maxi f(vi) . Let C be the centroid of S vr : worse than the best, then accept the reflection, i.e. discard the { } \{ } worst point in the simplex and replace it with R. vi C = i6=r P n The algorithm performs a sort of line search along the direction C vr. Let−

R = C + (C v ) − r a reflection of the worst point along the direction. Let f¯ be the best function value in the current simplex. Three cases might occur:

Algorithms for unconstrained local optimization – p. 73 Algorithms for unconstrained local optimization – p. 74

Reflection step 2: improvement

if the trial step is an improvement:

f(R) < f¯ b worst then attempt an expansion: try to move R to R¯ = R + (R C) − If successful (f(R¯)

reflection

Algorithms for unconstrained local optimization – p. 75 Algorithms for unconstrained local optimization – p. 76 Expansion 3: contraction

If however the reflected point R is worse than all points in the simplex (possibly except the worst vr), than a contraction step is performed: b

if f(R) >f(vr) (R is worse than all points in the simplex), add worst b 0.5(vr + C) b to the simplex and discard v ⊗ r otherwise if R is better than vr than add b reflection 0.5(R + C) expansion to the simplex and discard vr

Algorithms for unconstrained local optimization – p. 77 Algorithms for unconstrained local optimization – p. 78

Contraction

Nelder-Mead is not a direct search method (only a single direction at a time is explored) It is widely used by practitioners. However it may fail to b converge to a local minimum. There are examples of strictly convex functions in R2 on which the method converges to a non-stationary point. The bad

b convergence properties are connected to the event that the b n–dimensional simplex degenerates into a lower dimensional reflection space. contraction Moreover the method has a strong tendency to generate b ⊗ b directions which are almost normal to that of the gradient! worst Convergent variants of Nelder-Mead method do exists.

b

Algorithms for unconstrained local optimization – p. 79 Algorithms for unconstrained local optimization – p. 80 Implicit filtering Implicit filtering

Let Data: εk 0, params δ,γ, ∆ of Armijo’s rule { } ↓ f(x) = h(x) + w(x) repeat OuterIteration = false; where h(x) is a smooth function, while w(x) can be considered repeat compute f(xk) and a finite difference estimate of f(xk): as an additive, typically random, noise. ∇ The method performs a rough estimate of the gradient (finite εk f(xk)=[(f(xk + εkei) f(xkεkei))/2εk] difference with a “large step”) and proceeds with an Armijo line ∇ − search. If unsuccessful, the step for finite differences is if εk f(xk) εk then reduced. k∇OuterIterationk≤ = true else Armijo: if successful accept the Armijo step; otherwise let OuterIteration = true end until OuterIteration ; k = k + 1;

Algorithms for unconstrained local optimization – p. 81 until convergence criterion ; Algorithms for unconstrained local optimization – p. 82

Convergence properties

If 2h(x) is Lipschitz continuous ∇ the sequence x generated by the method is infinite { k}

2 η(xk; εk) lim εk + =0 k→∞ εk where

η(x; ε) = sup w(x) z:kz−xk∞≤ε | |

unsuccessful Armijo steps occur at most a finite number of times then all limit points of xk are stationary { } Algorithms for unconstrained local optimization – p. 83 Algorithms for constrained local optimization Feasible direction methods Fabio Schoen

2008 http://gol.dsi.unifi.it/users/schoen

Algorithms for constrained local optimization – p. 1 Algorithms for constrained local optimization – p. 2

Frank–Wolfe method Frank–Wolfe

T Let X: convex set. Consider the problem: If ∇ f(xk)(ˆxk − xk)=0 then

min f(x) T x∈X ∇ f(xk)d ≥ 0

Let xk ∈ X ⇒choosing a feasible direction dk corresponds to for every feasible direction d ⇒first order necessary conditions choosing a point x ∈ X : dk = x − xk. hold. “Steepest descent” choice: Otherwise, letting dk =x ˆk − x, this is a descent direction along which a step αk ∈ (0, 1] might be chosen according to Armijo’s T min ∇ f(xk)(x − xk) rule. x∈X

(a linear objective with convex constraints, usually easy to solve). Let xˆk be an optimal solution of this problem.

Algorithms for constrained local optimization – p. 3 Algorithms for constrained local optimization – p. 4 Convergence of Frank-Wolfe method Gradient Projection methods

Under mild conditions the method converges to a point Generic iteration: satisfying first order necessary conditions. However it is usually extremely slow (convergence may be xk+1 = xk + αk(¯xk − xk) sub–linear) It might find applications in very large scale problems in which where the direction dk =x ¯k − xk is obtained finding solving the sub-problem for direction determination is very easy + (e.g. when X is a polytope). x¯k = [xk − sk∇f(xk)]

+ + where: sk ∈ R and [·] represents projection over the feasible set.

Algorithms for constrained local optimization – p. 5 Algorithms for constrained local optimization – p. 6

The method is slightly faster than Frank-Wolfe, with a linear convergence rate similar to that of (unconstrained) steepest descent. It might be applied if projection is relatively cheap, e.g. when Lagrange Multiplier Algorithms the feasible set is a box. A point xk satisfies first order necessary conditions T d ∇f(xk) ≥ 0 iff

+ xk = [xk − sk∇f(xk)]

Algorithms for constrained local optimization – p. 7 Algorithms for constrained local optimization – p. 8 Barrier Methods Barrier Method

Let εk ↓ 0 and x0 strictly feasible, i.e. gj(x0) < 0 ∀ j. Then let min f(x) xk = arg min(f(x) + εkB(x)) x∈Rn gj(x) ≤ 0 j =1,...,r Proposition: every limit point of {x } is a global minimum of the A Barrier is a which tends to +∞ whenever k constrained optimization problem x approaches the boundary of the feasible region. Examples of barrier functions:

B(x) = − log(−gj(x)) logaritmic barrier Xj 1 B(x) = − invers barrier g (x) Xj j

Algorithms for constrained local optimization – p. 9 Algorithms for constrained local optimization – p. 10

... Analysis of Barrier methods

Special case: a single constraint (might be generalized) If B(x) = φ(g(x)), ⇒ Let x¯ be a limit point of {xk} (a global minimum). If KKT ′ conditions hold, then there exists a unique λ ≥ 0: ∇f(xk) + εkφ (g(xk))∇g(xk)=0

∇f(¯x) + λ∇g(¯x)=0 In the limit, for k → ∞:

′ (with λg(¯x)=0. xk, solution of the barrier problem lim εkφ (g(xk))∇g(xk) = λ∇g(¯x)

′ min f(x) + εkB(x) if limk g(xk) < 0 ⇒φ (g(xk))∇g(xk) → K (finite) and Kεk → 0

g(x) < 0 if limk g(xk)=0 ⇒(thanks to the unicity of Lagrange multipliers), satisfies ′ λ = lim εkφ (g(xk)) k ∇f(xk) + εk∇B(xk)=0

Algorithms for constrained local optimization – p. 11 Algorithms for constrained local optimization – p. 12 Difficulties in Barrier Methods Example

strong numeric instability: the condition number of the 2 2 hessian matrix grows as εk → 0 min(x − 1) +(y − 1)

need for an initial strictly feasible point x0 x + y ≤ 1

(partial) remedy: εk is very slowly decreased and the solution of Logarithmic Barrier problem: the k +1–th problem is obtained starting an unconstrained 2 2 min(x − 1) +(y − 1) − εk log(1 − x − y) optimization from xk x + y − 1 < 0

Gradient:

εk 2(x − 1) + 1 x y  − −  εk 2(y − 1) + 1 x y  − − 

3 √1+εk Stationary points x = y = 4 ± 4 (only the “-” solution is acceptable)

Algorithms for constrained local optimization – p. 13 Algorithms for constrained local optimization – p. 14

Barrier methods and L.P. The central path

The starting point is usually associated with ε = ∞ and is the min cT x unique solution of Ax = b min − log xj x ≥ 0 Xj Ax = b Logarithmic Barrier on x ≥ 0: x > 0 T min c x − ε log xj The trajectory x(ε) of solutions to the barrier problem is called Xj the central path and leads to an optimal solution of the LP. Ax = b x > 0

Algorithms for constrained local optimization – p. 15 Algorithms for constrained local optimization – p. 16 Penalty Methods Convergence of the quadratic penalty method

Penalized problem: (for equality constrained problems): let

min f(x) + ρP (x) 2 P (x; ρ) = f(x) + ρ hi(x) Xi where ρ > 0 and P (x) ≥ 0 with P (x)=0 if x is feasible. n Example: Given ρ0 > 0, x0 ∈ R , k =0, let

min f(x) xk+1 = arg min P (x; ρk)

hi(x)=0 i =1,...,m (found with an iterative method initialized at xk); let ρk+1 > ρk, k k . A penalized problem might be: := +1 If xk+1 is a global minimizer of P and ρk → ∞ then every limit 2 point of {xk} is a global optimum of the constrained problem. min f(x) + ρ hi(x) Xi

Algorithms for constrained local optimization – p. 17 Algorithms for constrained local optimization – p. 18

Exact penalties Exact penalties

Exact penalties: there exists a penalty parameter value s.t. the for inequality constrained problems: optimal solution to the penalized problem is the optimal solution of the original one. min f(x)

ℓ1 penalty function: hi(x)=0

gj(x) ≤ 0 P1(x; ρ) = f(x) + ρ |hi(x)| Xi the penalized problem is

P1(x; ρ) = f(x)ρ |hi(x)| + ρ max(0, −gj(x)) Xi Xj

Algorithms for constrained local optimization – p. 19 Algorithms for constrained local optimization – p. 20 Augmented Lagrangian method Motivation

Given an equality constrained problem, reformulate it as: 1 min f(x)+ ρkh(x)k2 + λT h(x) 1 x 2 min f(x) + ρkh(x)k2 2 h(x)=0 ∇xLρ(x, λ)= ∇f(x)+ λi∇h(x)+ ρh(x)∇h(x) The Lagrange function of this problem is called Augmented Xi Lagrangian: = ∇xL(x, λ)+ ρh(x)∇h(x) ∇2 L (x, λ)= ∇2f(x)+ λ ∇2h(x)+ ρh(x)∇2h(x)+ ρ∇h(x)∇T h(x) 1 xx ρ i L(x; λ) = f(x) + ρkh(x)k2 + λT h(x) Xi 2 2 T 2 = ∇xxL(x, λ)+ ρh(x)∇ h(x)+ ρ∇h(x)∇ h(x)

Algorithms for constrained local optimization – p. 21 Algorithms for constrained local optimization – p. 22

motivation . . . motivation . . .

Let (x⋆,λ⋆) an optimal (primal and dual) solution. Necessarily: Observe that: ∇ L(x⋆,λ⋆)=0; moreover h(x⋆)=0 thus x 2 2 2 T ∇xxLρ(x, λ) = ∇xxL(x, λ) + ρh(x)∇ h(x) + ρ∇h(x)∇ h(x) ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ∇xLρ(x ,λ ) = ∇xL(x ,λ ) + ρh(x )∇h(x ) 2 T = ∇xxL(x, λ) + ρ∇h(x)∇ h(x) =0 Assume that sufficient optimality conditions hold: ⇒(x⋆,λ⋆) is a stationary point for the augmented lagrangian. T 2 ⋆ ⋆ T ⋆ v ∇xxL(x ,λ )v > 0 ∀ v : v ∇h(x )=0,

Algorithms for constrained local optimization – p. 23 Algorithms for constrained local optimization – p. 24 ......

Let v =6 0 : vT ∇h(x⋆)=0. Then Let v =6 0 : vT ∇h(x⋆)=06 . Then

T 2 ⋆ ⋆ T T 2 ⋆ ⋆ T T ⋆ T ⋆ T 2 ⋆ ⋆ T T 2 ⋆ ⋆ T T ⋆ T ⋆ v ∇xxLρ(x ,λ )v = v ∇xxL(x ,λ )v + ρv ∇h(x )∇ h(x )v v ∇xxLρ(x ,λ )v = v ∇xxL(x ,λ )v + ρv ∇h(x )∇ h(x )v T 2 ⋆ ⋆ T T 2 ⋆ ⋆ T T ⋆ 2 = v ∇xxL(x ,λ )v > 0 = v ∇xxL(x ,λ )v + ρ(v ∇h(x ))

which might be negative. However ∃ρ¯ > 0: if ρ ≥ ρ¯ T 2 ⋆ ⋆ T ⇒v ∇xxLρ(x ,λ )v > 0. Thus, if ρ is large enough, the Hessian of the augmented lagrangian is positive definite and x⋆ is a (strict) local minimum ⋆ of Lρ(·,λ )

Algorithms for constrained local optimization – p. 25 Algorithms for constrained local optimization – p. 26

Inequality constraints

Given the problem min f(x) min f(x) g(x) ≤ 0 hi(x)=0 i =1,m

Nonlinear transformation of inequalities into equalities: gj(x) ≤ 0 j =1, p min f(x) x,s an Augmented Lagrangian problem might be defined as g x s2 j ,p j( ) + j =0 =1 T 1 2 min Lρ(x, z; λ, µ) = min f(x) + λ h(x) + ρkh(x)k x,z 2 1 + µ (g (x) + z2) + ρ (g (x) + z2)2 j j j 2 j j Xj Xj

Algorithms for constrained local optimization – p. 27 Algorithms for constrained local optimization – p. 28 ......

Consider minimization with respect to z variables: Thus:

1 ⋆ µj 2 2 2 uj = max{0, − − gj(x)}. min µj(gj(x) + zj ) + ρ (gj(x) + zj ) ρ z 2 Xj Xj

1 2 Substituting: = min µj(gj(x) + uj) + ρ(gj(x) + uj) u≥0 2 Xj 1 L (x; λ, µ) = f(x) + λT h(x) + ρkh(x)k2 ρ 2 (quadratic minimization over the nonnegative orthant). Solution: 1 2 + max{0, µj + ρgj(x)}− µj ⋆ 2ρ uj = max{0, u¯j} Xj  where u¯ is the unconstrained optimum: This is an Augmented Lagragian for inequality constrained problems. u¯ : µj + ρ(gj(x)+¯uj)=0

Algorithms for constrained local optimization – p. 29 Algorithms for constrained local optimization – p. 30

Sequential Quadratic Programming Newton step for SQP

Jacobian of KKT system: min f(x) 2 T ′ ∇ L(x; λ) ∇ H(x) h x F (x, λ) = xx i( )=0  ∇H(x) 0 

Idea: apply Newton’s method to solve the KKT equations: Newton step: Lagrangian function: x x d k+1 = k + k L(x; λ) = f(x) + λihi(x)  λk+1   λk   ∆k  Xi where let H(x) = [hi(x)] , ∇H(x) = [∇hi(x)]. KKT conditions: ∇2 L(x ; λ ) ∇T H(x ) d −∇f(x ) −∇HT (x )λ xx k k k k = k k k ∇f(x) + ∇HT (x)λ H x ∆ H x F [x; λ] = =0  ∇ ( k) 0   k   − ( k)   H(x) 

Algorithms for constrained local optimization – p. 31 Algorithms for constrained local optimization – p. 32 existence Alternative view: SQP

The Newton step exists if T 1 T 2 the Jacobian of the constraint set ∇H(xk) has full row rank min f(xk) + ∇f(xk) d + d ∇xxL(xk; λk)d d 2 2 the Hessian ∇ L(xk; λk) is positive definite xx ∇H(xk)d + H(xk)=0 In this case the Newton step is the unique solution of KKT conditions:

2 T T 2 x λ d f x H x ∇xxL(xk; λk)dk + ∇ H(xk)∆k + ∇f(xk) + ∇H (xk)λk =0 ∇xxL( k; k) + ∇ ( k) + ∇ ( k)Λk =0 ∇H(xk)dk + H(xk)=0 Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers Λk = λk+1

Algorithms for constrained local optimization – p. 33 Algorithms for constrained local optimization – p. 34

Alternative view: SQP

Thus SQP can be seen as a method which T 1 T 2 min L(xk,λk) + ∇x L(xk,λk)d + d ∇xxL(xk; λk)d minimizes a quadratic approximation to the Lagrangian d 2 subject to a first order approximation of the constraints. ∇H(xk)d + H(xk)=0

KKT conditions:

2 ∇xxL(xk; λk)d + ∇f(xk) + ∇H(xk)λk + ∇H(xk)Λk =0

Under the same conditions as before this QP has a unique solution dk with Lagrange multipliers Λk = ∆k+1

Algorithms for constrained local optimization – p. 35 Algorithms for constrained local optimization – p. 36 Inequalities Filter Methods

If the original problem is Basic idea:

min f(x) min f(x)

hi(x)=0 g(x) ≤ 0 g (x) ≤ 0 j can be considered as a problem with two objectives: then the SQP iteration solves minimize f(x) minimize g(x) T 1 T 2 min fk + ∇f(xk) d + d ∇xxL(xk,λk)d d 2 (the second objective has priority over the first) T ∇i hi(xk)p + hi(xk)=0 T ∇j gj(xk)p + gj(xk) ≤ 0

Algorithms for constrained local optimization – p. 37 Algorithms for constrained local optimization – p. 38

Filter

Given the problem Let {fk,hk, k =1, 2,...} the observed values of f and h at points x1, x2,.... f x min ( ) A pair (fk,hk) dominates a pair (fℓ,hℓ) iff

gj(x) ≤ 0 j =1, . . . , k fk ≤ fℓ and let us consider the bi-criteria optimization problem hk ≤ hℓ

min f(x) A filter is a list of pairs which are non-dominated by the others min h(x) where

h(x) = max{gj(x), 0} Xj

Algorithms for constrained local optimization – p. 39 Algorithms for constrained local optimization – p. 40 Trust region SQP

f(x) Consider a Trust-region SQP method:

T 1 T 2 min fk + ∇L(xk; λk) d + d ∇xxL(xk; λk)d d 2

bc T ∇j gj(xk)p + gj(xk) ≤ 0 bc kdk∞ ≤ ρ

bc (the ∞ norm is used here in order to keep the problem a QP) Traditional (unconstrained) trust region methods: if the current step is a failure ⇒reduce the trust region ⇒eventually the step bc will become a pure gradient step ⇒convergence! h(x) bc

Algorithms for constrained local optimization – p. 41 Algorithms for constrained local optimization – p. 42

Trust region SQP Filter methods

Here diminishing the trust region radius might lead to infeasible Data: x : starting point, ρ, k QP’s: 0 = 0 while Convergence criterion not satisfied do if QP is infeasible then Find xk+1 minimizing constraint violation; else Solve QP and get a step dk; try setting xk+1 = xk + dk; T ∇j gj(xk)p + gj(xk) ≤ 0 if (fk+1,hk+1) is acceptable to the filter then Accept xk+1 and add (fk+1,hk+1) to the filter; Remove dominated points from the filter; Possibly increase ρ;

gj(x) ≤ 0 else Reject the step; Reduce ρ; end bc xk end set k = k + 1; Algorithms for constrained local optimization – p. 43 end Algorithms for constrained local optimization – p. 44 Comparison with other methods f(x)

bc Rejected filter steps

bc

bc

bc h(x) bc acceptable steps "classical" method

Algorithms for constrained local optimization – p. 45 Global Optimization Problems

min f(x) Introduction to Global Optimization x∈S⊆Rn What is it meant by global optimization? Of course we sould like Fabio Schoen to find f ∗ = min f(x) 2008 x∈S⊆Rn http://gol.dsi.unifi.it/users/schoen and x∗ = arg min f(x) : f(x∗) f(x) x S ≤ ∀ ∈

Introduction to Global Optimization – p. 1 Introduction to Global Optimization – p. 2

This definition in unsatisfactory: Quite often we are satisfied in looking for f ∗ and search one or the problem is “ill posed” in x (two objective functions which more feasible solutions suche that differ only slightly might have global optima which are f(¯x) f(x∗) + ε arbitrarily far) ≤ it is however well posed in the optimal values: f g δ Frequently, however, this is too ambitious a task! f ∗ g∗ ε || − || ≤ ⇒| − |≤

Introduction to Global Optimization – p. 3 Introduction to Global Optimization – p. 4 Research in Global Optimization

the problem is highly relevant, especially in applications many global optimization papers get published on applied the problem is very hard (perhaps too much) to solve research journals there are plenty of publications on global optimization Bazaraa, Sherali, Shetty “Nonlinear Programming: theory algorithms for specific problem classes and algorithms”, 1993: the word “global optimum” appears for the first time on page there are only relatively few papers with relevant theoretical 99, the second time at page 132, then at page 247: contents “A desirable property of an algorithm for solving [an often from elegant theories, weak algorithms have been optimization] problem is that it generates a sequence of produced and viceversa, the best computational methods points converging to a global optimal solution. In many often lack a sound theoretical support cases however we may have to be satisfied with less favorable outcomes.” after this (in 638 pages) it never appears anymore. “Global optimization” is never cited.

Introduction to Global Optimization – p. 5 Introduction to Global Optimization – p. 6

Complexity

Similar situation in Bertsekas, Nonlinear Programming (1999): Global optimization is “hopeless”: without “global” information 777 pages, but only the definition of global minima and maxima no algorithm will find a certifiable global optimum unless it is given! generates a dense sample. Nocedal & Wrigth, “Numerical Optimization”, 2nd edition, 2006: There exists a rigorous definition of “global” information – some Global solutions are needed in some applications, but for many examples: problems they are difficult to recognize and even more difficult number of local optima to locate . . . global optimum value many successful global optimization algorithms require the for global optimization problems over a box, (an upper solution of many local optimization problems, to which the bound on) the Lipschitz constant algorithms described in this book can be applied f(y) f(x) L x y x, y | − |≤ k − k ∀ Concavity of the objective function + convexity of the feasible region an explicit representation of the objective function as the Introduction to Global Optimization – p. 7 Introduction to Global Optimization – p. 8 difference between two convex functions (+ convexity of the Complexity

Global optimization is computationally intractable also Many special cases are still NP –hard: according to classical complexity theory. Special cases: norm maximization on a parallelotope: Quadratic programming: max x 1 T T k k min x Qx + c x b Ax c l≤Ax≤u 2 ≤ ≤ is NP –hard [Sahni, 1974] and, when considered as a decision Quadratic optimization on a hyper-rectangle (A = I) when problem, NP -complete [Vavasis, 1990]. even only one eigenvalue of Q is negative quadratic minimization over a simplex

1 min xT Qx + cT x x≥0 2

xj = 1 j X

Introduction to Global Optimization – p. 9 Even checking that a point is a local optimum is NPIntroduction-hard to Global Optimization – p. 10

Applications of global optimization

concave minimization – quantity discounts, scale Minimization of cost functions which are neither convex nor economies concave. E.g.: finding the minimum conformation of fixed charge complex molecules – Lennard-Jones micro-cluster, protein folding, protein-ligand docking, combinatorial optimization - binary linear programming: Example: Lennard-Jones: pair potential due to two atoms at X ,X R3: min cT x + KxT (1 x) 1 2 ∈ 1 2 − v(r) = Ax = b r12 − r6 x [0, 1] where r = X X . The total energy of a cluster of N ∈ k 1 − 2k atoms located at X ,...,X R3 is defined as: or: 1 N ∈ T min c x v( Xi Xj ) || − || i=1,...,N j

3 Potential energy model:E = El + Ea + Ed + Ev + Ee where: attractive(x) repulsive(x) 1 lennard-jones(x) E = Kb(r r0)2 2 l 2 i i − i i∈L X 1 (contribution of pairs of bonded atoms):

1 0 E = Kθ(θ θ0)2 a 2 i i − i i∈A -1 X (angle between 3 bonded atoms)

-2 1 E = Kφ[1 + cos(nφ γ)] d 2 i i − i∈T -3 X 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 (dihedrals)

Introduction to Global Optimization – p. 13 Introduction to Global Optimization – p. 14

Docking

Given two macro-molecules M1,M2, find their minimal energy Aij Bij Ev = coupling R12 − R6 (i,j) ∈ C  ij ij  If no bonds are changed to find the optimal docking it is X X ⇒ (van der Waals) sufficient to minimized: 1 q q E = i j Aij Bij 1 qiqj e Ev + Ee = + 2 εRij 12 6 ∈ C R − R 2 εRij (i,j) i∈M1,j∈M2 ij ij i∈M1,j∈M2 X X X   X (Coulomb interaction)

Introduction to Global Optimization – p. 15 Introduction to Global Optimization – p. 16 Main algorithmic strategies Example: Lennard Jones

Two main families: N−1 N 1. with global information (“structured problems”) 1 2 LJ = min LJ(X) = min N X X 12 − X X 6 2. without global information (“unstructured problems”) i=1 j=i+1 i j i j X X k − k k − k Structured problems stochastic and deterministic methods This is a highly structured problem. But is it easy/convenient to Unstructured problems⇒ typically stochastic algorithms use its structure? Every global optimization⇒ method should try to find a balance And how? between exploration of the feasible region approximations of the optimum

Introduction to Global Optimization – p. 17 Introduction to Global Optimization – p. 18

LJ

The map NB: every 2 function is d.c., but often its d.c. decomposition is C R3N RN(N−1)/2 not known. F1 : + 7→ D.C. optimization is very elegant, there exists a nice duality 2 2 F1(X1,...,XN ) X1 X2 ,..., XN−1 XN theory, but algorithms are typically very inefficient. 7→ k − k k − k is convex and the function

F : RN(N−1)/2 R 2 + 7→ 1 1 F2(r12,...,rN−1,N ) 6 2 3 7→ rij − rij X X is the difference between two convex functions. Thus LJ(X) can be seen as the difference between two convex function (a d.c. programming problem)

Introduction to Global Optimization – p. 19 Introduction to Global Optimization – p. 20 A primal method for d.c. optimization D.C. canonical form

“cutting plane” method (just an example, not particularly efficient, useless for high dimensional problems). min cT x Any unconstrained d.c. problem can be represented as an g(x) 0 equivalent problem with linear objective, a convex constraint ≤ h(x) 0 and a reverse convex constraint. If g,h ar convex, then ≥ min g(x) h(x) is equivalent to: − where h,g: convex. Let min z Ω = x : g(x) 0 { ≤ } g(x) h(x) z C = x : h(x) 0 − ≤ { ≤ } which is equivalent to Hp: 0 intΩ intC, cT x > 0 x Ω intC min z ∈ ∩ ∀ ∈ \ g(x) w Fundamental property: if a D.C. problem admits an optimum, at ≤ h(x) + z w least one optimum belongs to ≥

Introduction to Global Optimization – p. 21 ∂Ω ∂C Introduction to Global Optimization – p. 22 ∩

Discussion of the assumptions

4 g(0) < 0,h(0) < 0, cT x > 0 feasible x. Let x¯ be a solution to the ∀ convex problem 3 cT x =0 min cT x g(x) 0 ≤ 2 If h(¯x) 0 then x¯ solves the d.c. problem. Otherwise cT x > cT x¯ 1 C for all feasible≥ x. Coordinate transformation: y = x x¯: − 0 min cT y Ω 0

gˆ(y) 0 -1 ≤ hˆ(y) 0 ≥ -2 where gˆ(y) = g(y +¯x). Then cT y > 0 for all feasible solutions -3 and hˆ(0) > 0; by continuity it is possible to choose x¯ so that gˆ(0) < 0. -4 Introduction to Global Optimization – p. 23 Introduction to Global Optimization – p. 24 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 D(¯x) = x Ω : cT x cT x¯ Let x¯ best known solution. { ∈ ≤ } Let 3 cT x =0 D(¯x) = x Ω : cT x cT x¯ { ∈ ≤ } If D(¯x) C then x¯ is optimal; 2 Check:⊆ a polytope P (with known vertices) is built which contains D(¯x) 1 C If all vertices of P are in C optimal solution. Otherwise let v: best feasible vertex; ⇒ 0 Ω the intersection of the segment [0, v] with ∂C (if feasible) is an improving point x. Otherwise a cut is introduced in P which is -1 tangent to Ω in x. x¯ -2

-3

-4 Introduction to Global Optimization – p. 25 Introduction to Global Optimization – p. 26 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

Initialization

4 ⋆ P : D(¯x) P with vertices V1,...,Vk. V := arg max h(Vj) Given a feasible solution x¯, take a polytope P such that ⊆

3 T P D(¯x) c x =0 ⊇ i.e. 2

y : cT y cT x¯ 1 C ≤ y feasible 0 Ω y P ⇒ ∈ -1 If P C, i.e. if y P h(y) 0 then x¯ is optimal. ⊂ ∈ ⇒ ≤ x¯ Checking is easy if we know the vertices of P . -2

-3 V ⋆ -4 Introduction to Global Optimization – p. 27 Introduction to Global Optimization – p. 28 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 Step 1

4 xk = ∂C [V ⋆, 0] Let V ⋆ the vertex with largest h() value. Surely h(V ⋆) > 0 ∩ (otherwise we stop with an optimal solution) 3 cT x =0 Moreover: h(0) < 0 (0 is in the interior of C). Thus the line from ⋆ V to 0 must intersect the boundary of C 2 Let xk be the intersection point. It might be feasible ( improving) or not. ⇒ 1 C

0 Ω

-1 x¯ xk -2

-3 V ⋆ -4 Introduction to Global Optimization – p. 29 Introduction to Global Optimization – p. 30 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

4 If xk Ω, set x¯ := xk 4 Otherwise if xk Ω, the polytope is divided ∈ 6∈ 3 cT x =0 3 cT x =0

2 2

1 C 1 C

0 Ω 0 Ω

-1 -1 x¯ -2 -2

-3 -3

-4 -4 Introduction to Global Optimization – p. 31 Introduction to Global Optimization – p. 32 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 Duality for d.c. problems

4 Otherwise if xk Ω, the polytope is divided 6∈ min g(x) h(x) ∈ 3 cT x =0 x S − where f,g: convex. Let 2 h⋆(u) := sup uT x h(x) : x Rn { − ∈ } 1 C g⋆(u) := sup uT x g(x) : x Rn { − ∈ } 0 Ω the conjugate functions of h e g. The problem

⋆ ⋆ ⋆ -1 inf h (u) g (u) : u : h (u) < + { − ∞}

-2 is the Fenchel-Rockafellar dual. If min g(x) h(x) admits an optimum, then Fenchel dual is a strong dual.−

-3

-4 Introduction to Global Optimization – p. 32 Introduction to Global Optimization – p. 33 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

A primal/dual algorithm

⋆ If x arg min g(x) h(x) then T ∈ − Pk : min g(x) (h(xk) + (x xk) yk) − − u⋆ ∂h(x⋆) and ∈ ⋆ ⋆ T D : min h (y) (g (y − ) + x (y y − ) (∂ denotes subdifferential) is dual optimal and if k − k 1 k − k 1 u⋆ arg min h⋆(u) g⋆(u) then ∈ − x⋆ ∂g⋆(u⋆) ∈ is an optimal primal solution.

Introduction to Global Optimization – p. 34 Introduction to Global Optimization – p. 35 GlobOpt - relaxations

Consider the global optimization problem (P):

min f(x) x X Exact Global Optimization ∈ and assume the min exists and is finite and that we can use a relaxation (R):

min g(y) y Y ∈ Usually both X and Y are subsets of the same space Rn. Recall: (R) is a relaxation of (P) iff: X Y ⊆ g(x) f(x) for all x X

Introduction to Global Optimization – p. 36 ≤ ∈ Introduction to Global Optimization – p. 37

Branch and Bound Tools

1. Solve the relaxation (R) and let L be the (global) optimum “good relaxations”: easy yet accurate value (assume it is feasible for (R)) good upper bounding, i.e., good heuristics for (P) 2. (Heuristically) solve the original problem (P) (or, more generally, find a “good” feasible solution to (P) in X). Let U Good relaxations can be obtained, e.g., through: be the best feasible function value known convex relaxations 3. if U L ε then stop: U is a certified ε–optimum for (P) domain reduction − ≤ 4. otherwise split X and Y into two parts and apply to each of them the same method

Introduction to Global Optimization – p. 38 Introduction to Global Optimization – p. 39 Convex relaxations A 1-D example

Assume X is convex and Y = X. If g is the convex envelop of f on X, then solving the convex relaxation (R), in one step gives the certified global optimum for (P). g(x) is a convex under-estimator of f on X if:

g(x)is convex g(x) f(x) x X ≤ ∀ ∈ g is the convex envelop of f on X if:

gis a convex under-estimator off g(x) h(x) x X ≥ ∀ ∈ h : convex under-estimator of f ∀

Introduction to Global Optimization – p. 40 Introduction to Global Optimization – p. 41

Convex under-estimator Branching

Introduction to Global Optimization – p. 42 Introduction to Global Optimization – p. 43 Bounding Relaxation of the feasible domain

Let

min f(x) x∈S

be a GlobOpt problem where f is convex, while S is non convex. A relaxation (outer approximation) is obtained replacing S with a larger set Q. If Q is convex convex optimization problem. If the optimal solution to ⇒ Upper bound min f(x) x∈Q fathomed belongs to S optimal solution to the original problem. ⇒

lower bounds

Introduction to Global Optimization – p. 44 Introduction to Global Optimization – p. 45

Example Relaxation

min x 2y min x 2y x∈[0,5],y∈[0,3] − − x∈[0,5],y∈[0,3] − − xy 3 xy 3 ≤ ≤ 4 We know that: 3 (x + y)2 = x2 + y2 +2xy 2 thus 1 xy = ((x + y)2 x2 y2)/2 0 − − 0 1 2 3 4 5 6 and, as x and y are non-negative, x2 5x, y2 3y, thus a (convex) relaxation of xy 3 is ≤ ≤ ≤ (x + y)2 5x 3y 6 Introduction to Global Optimization – p. 46 − − ≤ Introduction to Global Optimization – p. 47 (a convex constraint) Relaxation Stronger Relaxation

4

3 min x 2y x∈[0,5],y∈[0,3] − − 2 xy 3 ≤ 1 Thus:

0 (5 x)(3 y) 0 0 1 2 3 4 5 6 − − ≥ ⇒ 15 3x 5y + xy 0 − − ≥ ⇒ xy 3x +5y 15 Optimal solution of the relaxed convex problem: (2, 3) (value: ≥ − 8) − Thus a (convex) relaxation of xy 3 is ≤ 3x +5y 15 3 − ≤ i.e.: 3x +5y 18

Introduction to Global Optimization – p. 48 ≤ Introduction to Global Optimization – p. 49

Relaxation Convex (concave) envelopes

4 How to build convex envelopes of a function or how to relax a 3 non convex constraint? Convex envelopes lower bounds 2 Convex envelopes of⇒ f(x) upper bounds Constraint: g(x) 0 −if h(x)⇒is a convex underestimator of g 1 then h(x) 0 is a≤ convex⇒ relaxations. ≤ 0 Constraint: g(x) 0 if h(x) is concave and h(x) g(x), then h(x) 0 is a “convex”≥ ⇒ constraint ≥ 0 1 2 3 4 5 6 ≥

The optimal solution of the convex (linear) relaxation is (1, 3) which is feasible optimal for the original problem ⇒

Introduction to Global Optimization – p. 50 Introduction to Global Optimization – p. 51 Convex envelopes Generating sets

Definition: a function is polyhedral if it is the pointwise maximum of a finite number of linear functions.

(NB: in general, the convex envelope is the pointwise * supremum of affine minorants) The generating set X of a function f over a convex set P is the set * n X = x R :(x, f(x))is a vertex of epi(convP (f)) { ∈ } * * I.e., given f we first build its convex envelop in P and then define its epigraph (x, y) : x P, y f(x) . This is a convex set whose extreme{ points can∈ be denoted≥ } by V . X are the x coordinates of V

Introduction to Global Optimization – p. 52 Introduction to Global Optimization – p. 53

Characterization

Let f(x) be continuously differentiable in a polytope P . The convex envelope of f on P is polyhedral if and only if

X(f) = Vert(P )

(the generating set is the vertex set of P ) Corollary: let f ,...,f 1(P ) and f (x) possess 1 m ∈C i i polyhedral convex envelopes on P . Then b b b P

Conv( fi(x)) = Convfi(x) i i X X

iff the generating set of i Conv(fi(x)) is Vert(P ) P

Introduction to Global Optimization – p. 54 Introduction to Global Optimization – p. 55 Characterization Characterization

If a f(x) is such that Convf(x) is polyhedral, than an affine The condition may be reversed: given m affine functions function h(x) such that h1,...,hm such that, for each of them

1. h(x) f(x) for all x Vert(P ) 1. hj(x) f(x) for all x Vert(P ) ≤ ∈ ≤ ∈ 2. there exist n +1 affinely independent vertices of P , 2. there exist n +1 affinely independent vertices of P , V1,...,Vn+1 such that V1,...,Vn+1 such that

f(Vi) = h(Vi) i =1,...,n +1 f(Vi) = hj(Vi) i =1,...,n +1 belongs to the polyhedral description of Convf(x) and Then the function ψ(x) = maxj φj(x) is the convex envelope of a polyhedral function f iff h(x) = convf(x) the generating set of ψ is Vert(P) for any x Conv(V ,...,V ). for every vertex V we have ψ(V ) = f(V ) ∈ 1 n+1 i i i

Introduction to Global Optimization – p. 56 Introduction to Global Optimization – p. 57

Sufficient condition Application: a bilinear term

If f(x) is lower semi-continuous in P and for all x Vert(P ) there (Al-Khayyal, Falk (1983)): let x [ℓ , u ], y [ℓ , u ]. Then the 6∈ ∈ x x ∈ y y exists a line ℓx: x interior of P ℓx and f(x) is concave in a convex envelope of xy in [ℓ , u ] [ℓ , u is ∈ ∩ x x × y y neighborhood of x on ℓx, then Conv is polyhedral φ(x, y) = max ℓyx + ℓxy ℓxℓy; uyx + uxy uxuy f(x) { − − } Application: let In fact: φ(x, y) is a under-estimate of xy:

f(x) = αijxixj (x ℓx)(y ℓy) 0 i,j X − − ≥ xy ℓyx + ℓxy ℓxℓy The sufficient condition holds for f in [0, 1]n bilinear forms are ≥ − ⇒ polyhedral in an hypercube and analogously for xy u x + u y u u ≥ y x − x y

Introduction to Global Optimization – p. 58 Introduction to Global Optimization – p. 59 Bilinear terms All easy then? xy φ(x, y) = max ℓ x + ℓ y ℓ ℓ ; u x + u y u u Of course no! ≥ { y x − x y y x − x y} No other (polyhedral) function underestimating xy is tighter. Many things can go wrong . . . In fact ℓyx + ℓxy ℓxℓy belongs to the convex envelope: it It is true that, on the hypercube, a bilinear form: underestimates −xy and coincides with xy at 3 vertices ((ℓx, ℓy), (ℓx, uy), (ux, ℓy)). αijxixj Analogously for the other affine function. i

Introduction to Global Optimization – p. 60 Introduction to Global Optimization – p. 61

Fractional terms Univariate concave terms

A convex underestimate of a fractional term x/y over a box can If f(x), x [ℓx, ux], is concave, then the convex envelope is be obtained through simply its∈ linear interpolation at the extremes of the interval:

w ℓx/y + x/uy ℓx/uy if ℓx 0 f(ux) f(ℓx) ≥ − ≥ f(ℓx) + − (x ℓx) w x/u ℓ y/ℓ u + ℓ /ℓ if ℓ < 0 ux ℓx − ≥ y − x y y x y x − w u /y + x/ℓ u /ℓ if ℓ 0 ≥ x y − x y x ≥ w x/ℓ u y/ℓ u + u /u if ℓ < 0 ≥ y − x y y x y x (a better underestimate exists)

Introduction to Global Optimization – p. 62 Introduction to Global Optimization – p. 63 Underestimating a general nonconvex function

2 Let f(x) be general non convex. Than a convex How to choose αi’s? One possibility: uniform choice: αi = α. In underestimate∈C on a box can be defined as this case convexity of φ is obtained iff

n 1 φ(x) = f(x) αi(xi ℓi)(ui xi) α max 0, min λmin(x) − − − ≥ −2 x∈[ℓ,u] i=1   X where λ (x) is the minimum eigenvalue of 2f(x) where αi > 0 are parameters. The Hessian of φ is min ∇ 2φ(x) = 2f(x)+2diag(α) ∇ ∇ φ is convex iff 2φ(x) is positive semi-definite. ∇

Introduction to Global Optimization – p. 64 Introduction to Global Optimization – p. 65

Key properties Estimation of α

φ(x) f(x) Compute an interval Hessian [H] : [H(x)] = [hL (x),hU (x)] in ≤ ij ij ij φ interpolates f at all vertices of [ℓ, u] [ℓ, u] Find α such that [H]+2diag(α) < 0. φ is convex Gerschgorin theorem for real matrices: Maximum separation:

1 2 λmin min hii hij max(f(x) φ(x)) = α (ui ℓi) ≥ i − | | − 4 − ( j=6 i ) i X X Thus the error in underestimation decreases when the box Extension to interval matrices: is split. L L U uj ℓj λmin min hii max hij , hij − ≥ i − {| | | |} u ℓ ( 6 i i ) Xj=i −

Introduction to Global Optimization – p. 66 Introduction to Global Optimization – p. 67 Improvements Domain (range) reduction

new relaxation functions (other than quadratic). Example Techniques for cutting the feasible region without cutting the global optimum solution. n Simplest approaches: feasibility-based and optimality-based Φ(x; γ) = (1 eγi(xi−ℓi))(1 eγi(ui−xi)) − − − range reduction (RR). i=1 X Let the problem be: gives a tighter underestimate than the quadratic function min f(x) partitioning: partition the domain into a small number of x∈S regions (hyper-rectangules); evaluate a convex Feasibility based RR asks for solving underestimator in each region; join the underestimators to form a single convex function in the whole domain ℓi = min xi ui = max xi x S x S ∈ ∈ for all i 1,...,n and then adding the constraints x [ℓ, u] to the problem∈ (or to the sub-problems generated during∈ Branch & Bound) Introduction to Global Optimization – p. 68 Introduction to Global Optimization – p. 69

Feasibility Based RR Optimality Based RR

If S is a polyhedron, RR requires the solution of LP’s: Given an incumbent solution x¯ S, ranges are updated by solving the sequence: ∈ [ℓ¯, u¯] = min / max x¯ Ax b ℓi = min xi ui = max xi ≤ f(x) f(¯x) f(x) f(¯x) x [L, U] ≤ ≤ ∈ x S x S “Poor man’s” L.P. based RR: from every constraint a x b ∈ ∈ j ij j ≤ i in which ai¯ > 0 then where f(x) is a convex underestimate of f in the current P domain. 1 RR can be applied iteratively (i.e., at the end of a complete RR x b a x ¯ ≤ a i − ij j ⇒ sequence, we might start a new one using the new bounds) i¯ 6 ! Xj=¯ 1 x b min a L , a U ¯ ≤ a i − { ij j ij j} i¯ 6 ! Xj=¯

Introduction to Global Optimization – p. 70 Introduction to Global Optimization – p. 71 generalization R.H.S. perturbation

Let min f(x)(P ) x∈X φ(y) = min f(x) (Ry) ¯ g(x) 0 x∈X ≤ g(x) y ≤ a (non convex) problem; let be a perturbation of (R). (R) convex (R ) convex for any y. ⇒ y min f(x) (R) Let x¯: an optimal solution of (R) and assume that the i–th x∈X¯ constraint is active: g(x) 0 ≤ g(¯x)=0 be a convex relaxation of (P ): Then, if x¯y is an optimal solution of (Ry) g (x) yi is active at ¯ ⇒ i ≤ x X : g(x) 0 x X : g(x) 0 and x¯ if y 0 { ∈ ≤ }⊆{ ∈ ≤ } y i ≤ x X : g(x) 0 ∈ ≤ ⇒ f(x) f(x) ≤ Introduction to Global Optimization – p. 72 Introduction to Global Optimization – p. 73

Duality Main result

Assume (R) has a finite optimum at x¯ with value φ(0) and If (R) is convex with optimum value φ(0), constraint i is active at Lagrange multipliers µ. Then the hyperplane the optimum and the Lagrange multiplier is µi > 0 then, if U is an upper bound for the original problem (P ) the constraint: H(y) = φ(0) µT y − g (x) (U L)/µ i i is a supporting hyperplane of the graph of φ(y) at y =0, i.e. ≥− − (where L = φ(0)) is valid for the original problem (P ), i.e. it does φ(y) φ(0) µT y y Rm not exclude any feasible solution with value better than U. ≥ − ∀ ∈

Introduction to Global Optimization – p. 74 Introduction to Global Optimization – p. 75 proof Applications

Problem (R ) can be seen as a convex relaxation of the Range reduction: let x [ℓ, u] in the convex relaxed problem. If y ∈ perturbed non convex problem variable xi is at its upper bound in the optimal solution, them we can deduce Φ(y) = min f(x) x∈X xi max ℓi, ui (U L)/λi g(x) y ≥ { − − } ≤ where λi is the optimal multiplier associated to the i–th upper and thus φ(y) Φ(y). Thus underestimating (R ) produces an ≤ y bound. Analogously for active lower bounds: underestimate of Φ(y). Let y := eiyi; From duality: T L µ e y φ(e y ) Φ(e y ) xi min ui, ℓi + (U L)/λi − i i ≤ i i ≤ i i ≤ { − } If yi < 0 then U is an upper bound also for Φ(eiyi), thus L µiyi U. But if yi < 0 then constraint i is active. For any feasible− x≤there exists a y < 0 such that g(x) y is active we i ≤ i ⇒ may substitute y with g (x) and deduce L µ g (x) U i i − i i ≤

Introduction to Global Optimization – p. 76 Introduction to Global Optimization – p. 77

Methods based on “merit functions”

Let the constraint Bayesian algorithm: the objective function is considered as a realization of a stochastic process T ai x bi ≤ f(x) = F (x; ω) be active in an optimal solution of the convex relaxation (R). Then we can deduce the valid inequality A loss function is defined, e.g.:

aiT x bi (U L)/µi L(x1, ..., xn; ω) = min F (xi; ω) min F (x; ω) ≥ − − i=1,n − x and the next point to sample is placed in order to minimize the expected loss (or risk)

x = arg min E (L(x , ..., x , x ) x , ..., x ) n+1 1 n n+1 | 1 n = arg min E (min(F (x ; ω) F (x; ω)) x , ..., x ) n+1 − | 1 n

Introduction to Global Optimization – p. 78 Introduction to Global Optimization – p. 79 Radial basis method “Bumpiness”

⋆ Given k observations (x1,f1),..., (xk,fk), an interpolant is built: Let fk an estimate of the value of the global optimum after k observations. Let sy the (unique) interpolant of the data points n k

s(x) = λiΦ( x xi ) + p(x) k − k (xi,fi)i =1, . . . , k i=1 X ⋆ (y,fk ) p: polynomial of a (prefixed) small degree m. Φ: radial function like, e.g.: Idea: the most likely location of y is such that the resulting interpolant has minimum “bumpiness” Φ(r) = r linear Bumpiness measure: Φ(r) = r3 cubic m+1 y 2 σ(s ) = ( 1) λ s (x ) Φ(r) = r log r thin plate spline k − i k i 2 Φ(r) = e−γr gaussian X

Polynomial p is necessary to guarantee existence of a unique interpolant (i.e. when the matrix Φij = Φ( xi xj ) is singular)

{ k − k } Introduction to Global Optimization – p. 80 Introduction to Global Optimization – p. 81

TO BE DONE Stochastic methods

Pure Random Search - random uniform sampling over the feasible region Best start: like Pure Random Search, but a local search is started from the best observation Multistart: Local searches started from randomly generated starting points

Introduction to Global Optimization – p. 82 Introduction to Global Optimization – p. 83 3 3

2 2 + + 1 + + 1 + + + + + + + + 0 rs rs rs rs +rs rs rs rs rs rs 0 rs rs rs rs +rs rs rs rs rs rs + + + + -1 -1 + + -2 -2

-3 -3 0 1 2 3 4 5 0 1 2 3 4 5

Introduction to Global Optimization – p. 84 Introduction to Global Optimization – p. 85

Clustering methods Uniform sample

Given a uniform sample, evaluate the objective function rs rs 5 rs Sample Transformation (or concentration): either a fraction rs of “worst” points are discarded, or a few steps of a gradient rs rs 3 4 rs method are performed rs − rs Remaining points are clustered rs rs 5 rs 3 −rs from the best point in each cluster a single local search is

started rs 0 2 rs rs rs rs rs rs rs rs rs rs 1 rs rs 1 rs − rs rs rs rs 0 0 1 2 3 4 5

Introduction to Global Optimization – p. 86 Introduction to Global Optimization – p. 87 Sample concentration Clustering

rs r rs 5 r 5 rs r rs r

rs r rs 3 4 r 3 4 rs r rs − r − rs r

rs rs r u 5 rs 3 5 u 3 −rs −r

rs 0 2 r 0 2 rs r + + + + + + + 1 + + 1 1 1 − + − + + + + + 0 0 0 1 2 3 4 5 0 1 2 3 4 5

Introduction to Global Optimization – p. 88 Introduction to Global Optimization – p. 89

Local optimization Clustering: MLSL

Sampling proceed in batches of N points. Given sample points r r 5 n r X1,...,Xk [0, 1] , label Xj as “clustered” iff Y X1,...,Xk: r ∈ ∃ ∈ r 1 r 3 4 1 log k n n r − r Xj Y ∆k := σΓ 1 + r || − || ≤ √2π k 2 r u   5 u 3   −r and

r 0 2 f(Y ) f(Xj) r ≤ 1 1 − 0 0 1 2 3 4 5

Introduction to Global Optimization – p. 90 Introduction to Global Optimization – p. 91 Simple Linkage Smoothing methods

A sequential sample is generated (batches consist of a single Given f : Rn R, the Gaussian transform is defined as: observation). A local search is started only from the last → sampled point (i.e. there is no “recall”) unless there exists a 1 2 2 f λ(x) = n/2 n f(y) exp y x /λ sufficiently near sampled point with better function valure h i π λ Rn −k − k Z  When λ is sufficiently large f λ is convex. Idea: starting with a large enough λ, minimize⇒h the smoothedi function and slowly decrease λ towards 0.

Introduction to Global Optimization – p. 92 Introduction to Global Optimization – p. 93

Smoothing methods

3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0

10 10 5 5 -10 0 -10 0 -5 -5 0 -5 0 -5 5 5 -10 -10 10 Introduction to Global Optimization – p. 94 10 Introduction to Global Optimization – p. 95 2.4 2.2 2.2 2 2 1.8 1.8 1.6 1.6 1.4 1.4 1.2 1.2 1 0.8 1 0.6 0.8

10 10 5 5 -10 0 -10 0 -5 -5 0 -5 0 -5 5 5 -10 -10 10 Introduction to Global Optimization – p. 96 10 Introduction to Global Optimization – p. 97

Transformed function landscape

Elementary idea: local optimization smooths out many “high frequency” oscillations

2.2 2 1.8 1.6 1.4 1.2 1 0.8

10 5 -10 0 -5 0 -5 5 -10 10 Introduction to Global Optimization – p. 98 Introduction to Global Optimization – p. 99 10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

Introduction to Global Optimization – p. 100 Introduction to Global Optimization – p. 101

0 0

10 Monotonic Basin-Hopping

9

k := 0; f ⋆ := + ; 8 ∞ while k < MaxIter do k 7 X : random initial solution ⋆ Xk = arg min f(x; Xk); k 6 (local minimization started at X ) ⋆ fk = f(Xk ); ⋆ ⋆ if fk < f = f := fk 5 ⇒ NoImprove := 0; while NoImprove < MaxImprove do 4 X = random perturbation of Xk Y minf x X ; 3 = arg ( ; ) ⋆ ⋆ if f(Y ) < f = Xk := Y ; NoImprove := 0; f := f(Y ) ⇒ otherwise NoImprove ++ 2 end while end while 1

Introduction to Global Optimization – p. 102 Introduction to Global Optimization – p. 103

0 10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

Introduction to Global Optimization – p. 104 Introduction to Global Optimization – p. 105

0 0

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

Introduction to Global Optimization – p. 106 Introduction to Global Optimization – p. 107

0 0 10

9

8

7

6

5

4

3

2

1

Introduction to Global Optimization – p. 108

0 References

In this year’s course the global optimization part has been expanded, so it is possible that some part in nonlinear optimization will be skipped. Here is an essential reference list for the material covered during the course:

Mokhtar S. Bazaraa, John J. Jarvis and Hanif D. Sherali, Linear Program- ming and Network Flows, John Wiley & Sons, 1990.

Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientific, 1999.

Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer, 2006.

Mohit Tawarmalani and Nikolaos V. Sahinidis, A Polyhedral Branch–and– Cut Approach to Global Optimization, in: Mathematical Programming, vol- ume 103, pages 225-249, 2005.

Androulakis I.P., C.D. Maranas, and C.A. Floudas (PostScript (184K), PDF (154K)), ”αBB : A Global Optimization Method for General Constrained Nonconvex Problems”, Journal of Global Optimization, 7, 4, pp. 337-363(1995).

A. Rikun. A convex envelope formula for multilinear functions. Journal of Global Optimization, pages 10:425–437, 1997.

Andrea Grosso, Marco Locatelli and Fabio Schoen, A Population Based Ap- proach for Hard Global Optimization Problems Based on Dissimilarity Mea- sures, in: Mathematical Programming, volume 110, number 2, pages 373-404, 2007.

1