Notes on Numerical Optimization University of Chicago, 2014

Vivak Patel October 18, 2014

1 Contents

Contents 2

List of Algorithms 4

I Fundamentals of Optimization 5

1 Overview of Numerical Optimization 5 1.1 Problem and Classification ...... 5 1.2 Theoretical Properties of Solutions ...... 5 1.3 Algorithm Overview ...... 7 1.4 Fundamental Definitions and Results ...... 8

2 Methods 9 2.1 Step Length ...... 9 2.2 Step Length Selection Algorithms ...... 11 2.3 Global Convergence and Zoutendjik ...... 12

3 Methods 15 3.1 Trust Region Subproblem and Fidelity ...... 15 3.2 Fidelity Algorithms ...... 16 3.3 Approximate Solutions to Subproblem ...... 16 3.3.1 Cauchy Point ...... 16 3.3.2 Dogleg Method ...... 17 3.3.3 Global Convergence of Cauchy Point Methods ...... 18 3.4 Iterative Solutions to Subproblem ...... 20 3.4.1 Exact Solution to Subproblem ...... 20 3.4.2 Newton’s Root Finding Iterative Solution ...... 22 3.5 Trust Region Subproblem Algorithms ...... 22

4 Conjugate 23

II Model Hessian Selection 24

5 Newton’s Method 24

6 Newton’s Method with Hessian Modification 28

7 Quasi-Newton’s Method 29 7.1 Rank-2 Update: DFP & BFGS ...... 30 7.2 Rank-1 Update: SR1 ...... 31

III Specialized Applications of Optimization 32

8 Inexact Newton’s Method 32

9 Limited Memory BFGS 33

2 10 Least Squares Regression Methods 34 10.1 Linear Least Squares Regression ...... 34 10.2 Line Search: Gauss-Newton ...... 35 10.3 Trust Region: Levenberg-Marquardt ...... 35

IV Constrained Optimization 36 10.4 Theory of Constrained Optimization ...... 36

3 List of Algorithms

1 Backtracking Algorithm ...... 11 2 Interpolation Algorithm ...... 11 3 Trust Region Management Algorithm ...... 16 4 Solution Acceptance Algorithm ...... 16 5 Overview of Conjugate Algorithm ...... 23 6 Conjugate Gradients Algorithm for Convex Quadratic ...... 23 7 Fletcher Reeves CG Algorithm ...... 23 8 Line Search with Modified Hessian ...... 28 9 Overview of Inexact CG Algorithm ...... 32 10 L-BFGS Algorithm ...... 33

4 Part I Fundamentals of Optimization

1 Overview of Numerical Optimization 1.1 Problem and Classification 1. Problem: ( c (z) = 0 i ∈ E arg min f(z): i n z∈R ci(z) ≥ 0 i ∈ I

(a) f : Rn → R is known as the objective function (b) E are equality constraints (c) I are inequality constraints 2. Classifications (a) Unconstrained vs. Constrained. If E ∪ I = ∅ then it is an uncon- strained problem. (b) vs. . When f(z) and ci(z) are all linear functions of z then this is a linear programming problem. Otherwise, it is a nonlinear programming problem. (c) Note: This is not the same as having linear or nonlinear equations.

1.2 Theoretical Properties of Solutions 1. Global Minimizer. Weak Local Minimizer (Local Minimizer). Strict Local Minimizer. Isolated Local Minimizer.

Definition 1.1. Let f : Rn → R (a) x∗ is a global minimizer of f if f(x∗) ≤ f(x) for all x ∈ Rn (b) x∗ is a local minimizer of f if for some neighborhood, N of x∗, f(x∗) ≤ f(x) for all x ∈ N(x∗). (c) x∗ is a strict local minimizer of f if for some neighborhood, N(x∗), f(x∗) < f(x) for all x ∈ N(x∗). (d) x∗ is an isolated local minimizer of f if for some neighborhood, N(x∗), there are no other minimizers of f in N(x∗). Note 1.1. Every isolated local minimizer is a strict minimizer, but it is not true that a strict minimizer is always an isolated local min- imizer.

2. Taylor’s Theorem

Theorem 1.1. Suppose f : Rn → R is continuously differentiable. Let p ∈ Rn. Then for some t ∈ [0, 1]: f(x + p) = f(x) + ∇f(x + tp)T p (1)

5 If f is twice continuously differentiable, then:

Z 1 ∇f(x + p) = ∇f(x) + ∇2f(x + tp)pdt (2) 0 And for some t ∈ [0, 1]: 1 f(x + p) = f(x) + ∇f(x)T p + pT ∇2f(x + tp)p (3) 2

3. Necessary Conditions (a) First Order Necessary Conditions Lemma 1.1. Let f be continuously differentiable. If x∗ is a mini- mizer of f then ∇f(x∗) = 0.

Proof. Suppose ∇f(x∗) 6= 0. Let p = −∇f(x∗). Then, pT ∇f(x∗) < 0 and by continuity there is some T > 0 such that for any t ∈ [0,T ), pT ∇f(x∗ + tp) < 0. We now apply Equation 1 of Taylor’s Theorem to get a contradiction.

Let t ∈ [0,T ) and q = x + tp. Then ∃t0 ∈ [0, t] such that:

f(x∗ + q) = f(x∗) + t∇f(x∗ + t0tp)T p < f(x∗)

(b) Second Order Necessary Conditions Lemma 1.2. Suppose f is twice continuously differentiable. If x∗ is a minimizer of f then ∇f(x∗) = 0 and ∇2f(x∗)  0 (i.e. positive semi-definite)

Proof. The first conclusion follows from the first order necessary con- dition. We proceed with the second one by contradiction and Taylor’s Theorem. Suppose ∇2f(x∗) ≺ 0. Then there is a p such that:

pT ∇2f(x∗)p < 0

By continuity, there is a T > 0 such that pT ∇2f(x∗ + tp)p < 0 for t ∈ [0,T ]. Fix t ∈ [0,T ] and let q = tp. Then, by Equation 3 from Taylor’s Theorem, ∃t0 ∈ [0, t] such that: 1 f(x∗ + q) = f(x∗) + t2pT ∇2f(x∗ + t0tp)p < f(x∗) 2

4. Second Order Sufficient Condition

Lemma 1.3. Let x∗ ∈ Rn. Suppose ∇2f is continuous, ∇2f(x∗) 0, and ∇f(x∗) = 0. Then x∗ is a strict local minimizer.

6 Proof. By continuity, there is a ball of radius T about x∗, B, in which ∇2f(x) 0. Let kpk < T . Then, there exists a t ∈ [0, 1] such that: 1 f(x∗ + p) = f(x∗) + pT ∇2f(x∗ + tp)p > f(x∗) 2

5. Convexity and Local Minimizers Lemma 1.4. When f is convex any local minimizer x∗ is a global min- imizer. If, in addition, f is differentiable then any stationary point is a global minimizer.

Proof. Suppose x∗ is not a global minimizer. Let z be the global mini- mizer. Then f(x∗) > f(z). By convexity, for any λ ∈ [0, 1]: f(x∗) > λf(z) + (1 − λ)f(x∗) ≥ f(λz + (1 − λ)x∗) So as λ → 0, we see that in any neighborhood of x∗ there is a point w such that f(w) < f(x∗). A contradiction.

For the second part, suppose z is as above. Then: f(x∗ + λ(z − x∗)) − f(x∗) ∇f(x∗)T (z − x∗) = lim λ↓0 λ λf(z) + (1 − λ)f(x∗) − f(x∗) ≤ lim λ↓0 λ ≤ f(z) − f(x∗) < 0

1.3 Algorithm Overview

1. In general, algorithms begin with a seed point, x0, and locally search for decreases in the objective function, producing iterates xk, until stopping conditions are met.

2. Algorithms typically generate a local model for f near a point xk: 1 f(x + p) ≈ m (p) = f + pT ∇f + pT B p k k k k 2 k

3. Different choices of Bk will lead to different methods with different prop- erties:

(a) Bk = 0 will lead to steepest descent methods 2 (b) Letting Bk be the closest positive definite approximation to ∇ fk leads to newton’s methods

(c) Iterative approximations to the Hessian given by Bk lead to quasi- newton’s methods (d) Conjugate Gradient methods update p without explicitly computing Bk

7 1.4 Fundamental Definitions and Results 1. Q-convergence. n Definition 1.2. Let xk → x in R .

(a) xk converge Q-linearly if ∃q ∈ (0, 1) such that for sufficiently large k: kx − xk k+1 ≤ q kxk − xk

(b) xk converge Q-superlinearly if: kx − xk lim k+1 → 0 k→∞ kxk − xk 2 (c) xk converges Q-quadratically if ∃q > 0 such that for sufficiently large k: kxk+1 − xk 2 ≤ q kxk − xk 2. R-convergence. n Definition 1.3. Let xk → x in R

(a) xk converges R-linearly if ∃vk ∈ R converging Q-linearly such that

kxk − xk ≤ vk

(b) xk converges R-superlinearly if ∃vk ∈ R converging Q-superlinearly such that kxk − xk ≤ vk

(c) xk converges R-quadratically if ∃vk ∈ R converging Q-quadratically such that kxk − xk ≤ vk 3. Sherman-Morrison-Woodbury Lemma 1.5. If A and A˜ are non-singular and related by A˜ = A + abT then: A−1abT A−1 A˜−1 = A−1 − 1 + bT A−1a

Proof. We can check this by simply plugging in the formula. However, to “derive” this: −1 A˜−1 = A + abT  −1 = A I + A−1abT  −1 = I + A−1abT  A−1 = I − A−1abT + A−1abT A−1abT − · · ·  A−1 = I − A−1a(1 − bT A−1a + ··· )bT  A−1 A−1abT A−1 = A−1 − 1 + bT A−1a

8 2 Line Search Methods 2.1 Step Length 1. Descent Direction

n T Definition 2.1. p ∈ R is a descent direction if p ∇fk < 0.

2. Problem: Find αk ∈ arg min φ(α) where φ(α) = f(xk + αpk). However, it is too costly to find the exact minimizer of φ. Instead, we find an α which acceptably reduces φ. 3. Wolfe Condition

(a) Armijo Condition. Curvature Condition. Wolfe Condition. Strong Wolfe Condition. Definition 2.2. Fix x and p to be a descent direction. Let φ(α) = 0 T f(x + αp) and so φ (α) = f (x + αp)p. Fix 0 < c1 < c2 < 1. 0 i. α satisfies the Armijo Condition if φ(α) ≤ φ(0)+αc1φ (0). Note 0 l(α) := φ(0) + αc1φ (0). 0 0 ii. α satisfies the Curvature Condition if φ (α) ≥ c2φ (0) 0 0 iii. α satisfies the Strong Curvature Condition if |φ (α)| ≤ c2|φ (0)| iv. The Wolfe Condition is the Armijo Condition with the Curvature Condition v. The Strong Wolfe Condition is the Armijo Condition with the Strong Curvature Condition. (b) The Armijo Condition: guarantees a decrease at the next iterate by ensuring φ(alpha) < φ(0). (c) Curvature Condition 0 0 i. If φ (α) < c2φ (0), then φ is still decreasing at α and so we improve the reduction in f by taking a larger α 0 0 ii. If φ (α) ≥ c2φ (0), then either we are closer to a minimum where φ0 = 0 or φ0 > 0 which means we have surpassed the minimum. iii. The strong condition guarantees that the choice of α is closer to φ0 = 0. (d) Existence of Step satisfying Wolfe and Strong Wolfe Conditions Lemma 2.1. Suppose f : Rn → R is continuously differentiable. Let p be a descent direction at x and assume that f is bounded from below along the ray {x + αp : α > 0}. If 0 < c1 < c2 < 1 there exists an interval satisfying the Wolfe and Strong Wolfe Condition.

Proof. T i. Satisfying Armijo Condition: Let l(α) = f(x) + c1αp ∇f(x). Since l(α) is decreasing and φ(α) is bounded from below, even- tually φ(α) = l(α). Let αA > 0 be the first time this intersection occurs. Then φ(α) < l(α) for α ∈ (0, αA).

9 ii. Curvature Condition: By the mean value theorem, ∃β ∈ (0, αA) such that (since l(αA) = φ(αA)):

0 0 0 αAφ (β) = φ(αA) − φ(0) = c1αAφ (0) > c2αAφ (0)

And since φ0 is smooth there is an interval containing β for which this inequality holds. 0 0 iii. Strong Curvature Condition. Since φ (β) = c1φ (0) < 0, it fol- lows that: 0 0 |φ (β)| < |c2φ (0)|

4. Goldstein Condition

(a) Goldstein Condition Definition 2.3. Take c ∈ (0, 1/2). The Goldstein Condition is sat- isfied by α if:

φ(0) + (1 − c)αφ0(0) ≤ φ(α) ≤ φ(0) + cαφ0(0)

(b) Existence of Step satisfying Goldstein Condition Lemma 2.2. Suppose f : Rn → R is continuously differentiable. Let p be a descent direction at x and assume that f is bounded from below along the ray {x + αp : α > 0}. Let c ∈ (0, 1/2). Then there exists an interval satisfying the Goldstein condition.

Proof. Let l(α) = φ(0)+(1−c)αφ0(0) and u(α) = φ(0)+cαφ0(0). For α > 0, l(α) < u(α) since c ∈ (0, 1/2). Since φ is bounded from below let αu be the smallest point of intersection between u(α) and φ(α) for α > 0. And let αl be the largest point of intersection, less than αu, of l(α) and φ(α). Then, for α ∈ (αl, αu), l(α) < φ(α) < u(α).

5. Backtracking (a) Backtracking Definition 2.4. Let ρ ∈ (0, 1) and α¯ > 0. Backtracking checks over a sequence α,¯ ρα,¯ ρ2α,¯ . . . until a α is found satisfying the Armijo condition. (b) Existence of Step satisfying Backtracking Lemma 2.3. Suppose f : Rn → R is continuously differentiable. Let p be a descent direction at x. Then there is a step produced by the backtracking algorithm which satisfies the Armijo condition.

Proof. ∃ > 0 such that for 0 < α < , φ(α) < φ(0) + cαφ0(0) by continuity. Since ρnα¯ → 0, there is an n such that ρnα¯ < .

10 2.2 Step Length Selection Algorithms 1. Algorithms which implement the Wolfe or Goldstein Conditions exist and are ideal, but are not discussed herein. 2. Backtracking Algorithm Algorithm 1: Backtracking Algorithm input : Reduction rate ρ ∈ (0, 1), Initial estimateα ¯ > 0, c ∈ (0, 1), φ(α) α ← α¯ while φ(α) ≥ φ(0) + αcφ0(0) do α ← ρα end return α

3. Interpolation with Expensive First Derivatives (a) Suppose the Interpolation technique produces a sequence of guesses α0, α1,....

(b) Interpolation produces αk by modeling φ with a polynomial m (quadratic or cubic) and then minimizes m to find a new estimate for α. The parameters of the polynomial are given by requiring: i. m(0) = φ(0) ii. m0(0) = φ0(0)

iii. m(αk−1) = φ(αk−1)

iv. m(αk−2) = φ(αk−2) (c) When k = 1, we model using a quadratic polynomial

4. Interpolation with Inexpensive First Derivatives

(a) Suppose interpolation produces a sequence of guesses α0, α1,...

(b) In this case, m is always a cubic polynomial such that, αk is computed by:

i. m(αk−1) = φ(αk−1)

ii. m(αk−2) = φ(αk−2) 0 0 iii. m (αk−1) = φ (αk−1) 0 0 iv. m (αk−2) = φ (αk−2) 5. Interpolation Algorithm Algorithm 2: Interpolation Algorithm ¯ input : Feasible Search Region [¯a, b], Initial estimate α0 > 0, φ α ← 0, β ← α0 while φ(β) ≥ φ(0) + cβφ0(0) do Approximate m if Inexpensive Derivatives then α ← β end Explicitly compute minimizer of m in feasible region. Store as β. end return β

11 2.3 Global Convergence and Zoutendjik 1. Globally Convergent. Zoutendjik Condition. Definition 2.5. Suppose we have an algorithm, Ω which produces iterates (xk) and denote ∇f(xk) = ∇fk.

(a) Ω is globally convergent if k∇fkk → 0. (That is, xk converge to a stationary point.)

(b) Suppose Ω produced search directions pk such that kpkk = 1. And let θk be the angle between ∇fk and pk. Ω satisfies the Zoutendjik Condition if ∞ X 2 2 cos (θk) k∇fkk < ∞ k=1 2. Zoutendjik Condition & Angle Bound implies global convergence

Lemma 2.4. Suppose Ω produces a sequence (xk, pk, ∇fk, θk) such that ∃δ > 0 and ∀k ≥ 1, cos(θk) ≥ δ. If Ω satisfies the Zoutendjik condition then Ω is globally convergent.

Proof. By the Zoutendjik condition:

∞ 2 X 2 δ k∇fkk < ∞ k=1

2 Therefore, k∇fkk → 0.

3. Example of Angle Bound −1 Example 2.1. Suppose pk = −Bk ∇fk where Bk are positive definite −1 T and kBkk Bk < M < ∞ for all k. Letting k·k = k·k2, Bk = XΛX be T the EVD of Bk, and X ∇fk = z:

−∇f T B−1∇f | cos(θ )| = k k k k −1 k∇fkk Bk ∇fk T −1 z Λ z = kzk kΛ−1zk n z2 P i = i=1 λi q 2 Pn zi kzk i=1 2 λi 2 1/λ1 kzk ≥ 2 1/λn kzk λ ≥ n λ1 1 ≥ M

Hence, if Ω satisfies the Zoutendjik Condition, we see that limn k∇fnk → 0.

12 4. Wolfe Condition Line Search Satisfies Zoutendjik Condition. Theorem 2.1. Suppose we have an objective f satisfying:

(a) f is bounded from below in Rn

(b) Given an initial x0, there is an open set N containing L = {x : f(x) ≤ f(x0) (c) f is continuously differentiable on N (d) ∇f is Lipschitz continuous on N

Suppose we have an algorithm Ω producing (xk, pk, ∇fk, θk, αk) such that:

(a) pk is a descent direction (with kpkk = 1)

(b) αk satisfies the Wolfe Conditions Then Ω satisfies the Zoutendjik Condition.

T Proof. The general strategy is to lower bound αk uniformly by C|∇fk pk|, where C > 0. Then using the descent condition:

T fk+1 ≤ fk − cαk|∇fk pk| T 2 ≤ fk − C|∇fk pk| k 2 2 X ( 2 ≤ fk − C cos (θk) k∇fkk2 ≤ f0 − C cos θj) k∇fjk2 j=1

And we can conclude that since f is bounded from below by l, then f0 − fk ≤ f0 − l < ∞. The result follows.

T So now we set out to show that αk ≥ C|∇fk pk|. We do this by leveraging the Curvature Condition. (a) From the curvature condition, we have that:

T T (∇fk+1 − ∇fk) pk ≥ (c2 − 1)∇fk pk

(b) From Lipschitz Continuity, with constant L:

T 2 |(∇fk+1 − ∇fk) pk| ≤ k∇fk+1 − ∇fkk kpkk ≤ Lαk kpkk = Lαk

1−c2 T Together these imply that L |∇fk pk| ≤ αk.

5. Goldstein Condition Line Search satisfies Zoutendjik Condition Theorem 2.2. Suppose f has the same properties as it does in Theorem 2.1. And suppose Ω produces (xk, pk, ∇fk, θk, αk) such that:

(a) pk is a descent direction with kpkk = 1

(b) αk satisfies the Goldstein conditions. Then Ω satisfies the Zoutendjik Condition.

13 Proof. We use Taylor’s Theorem and then Lispchitz Continuity to find the lower bound. By Taylor’s theorem, for tk ∈ (0, 1):

T T (1 − c)αk∇fk pk ≤ fk+1 − fk = ∇f(xk + tkαkpk) αkpk

Therefore,

T T T cαk|∇fk pk| ≤ |(∇f(xk + tkαkpk) − ∇fk )pk| 2 ≤ Ltkαk kpkk

≤ Lαk

The result follows.

6. satisfies Zoutendjik Condition Theorem 2.3. Suppose f has the same properties as it does in Theorem 2.1. And suppose Ω produces (xk, pk, ∇fk, θk, αk) such that:

(a) pk is a descent direction with kpkk = 1

(b) αk is selected by backtracking with α¯ = 1 Then Ω satisfies the Zoutendjik condition.

Proof. Suppose Ω uses ρ ∈ (0, 1). There are two cases to consider. When αk = 1 and αk < 1. We denote the first subsequence by k(j) and the second by k(l). For αk(l) we know at least that αk(l)/ρ was rejected, that is: T f(xk + αk(l)pk/ρ) > fk + cαk(l)∇fk pk/ρ

Using the same strategy as in Goldstein, ∃tk such that:

T T ∇f(xk + tkαk(l)pk/ρ) αkpk/ρ > cαk(l)∇fk pk/ρ

Therefore, by Lipschitz continuity:

T (1 − c)|∇fk pk| ≤ Ltkαk(l)/ρ ≤ Lαk(l)/ρ

We now consider αk(j) = 1 since the Armijo condition gives the sequence:

T fk(j+1) ≤ fk(j)+1 ≤ fk(j) + c∇fk pk = fk(j) − c| cos(θk(j)s)| fk(j)

Therefore: X c | cos(θk(j))| ∇fk(j) < ∞ j Then ∃J such that j ≥ J,

2 2 cos (θk(j)) ∇fk(j) ≤ | cos(θk(j))| ∇fk(j) < 1

Therefore, the Zoutendjik condition holds.

14 3 Trust Region Methods 3.1 Trust Region Subproblem and Fidelity 1. The strategy of the trust region problem is as follows: at any point x we have a model mx(p) of f(x + p) which we trust is a good approximation in a region kpk ≤ ∆. Minimizing this, we have a new iterate on which to continue applying the approach. 2. Trust Region Subproblem Definition 3.1. Let x ∈ Rn and f : Rn → R be an objective function with model mx(p) in a region kpk ≤ ∆. The Trust Region Subproblem is to find arg min{mx(p): kpk ≤ ∆}

Note 3.1. Usually mx is taken to be 1 m (p) = f(x) + ∇f T p + pT Bp x x 2 where B is the model Hessian and is not necessarily positive definite.

3. Fidelity

Definition 3.2. The Fidelity of mx in a region kpk ≤ ∆ at a point p + x is f(x + p) − f(x) ρ(mx, f, p, ∆) = mx(p) − mx(0)

4. Fidelity and Trust Region. Let 0 < c1 < c2 < 1

(a) If ρ ≥ c2, especially if p is intended for descent, the model is a good approximation to f and we can expand the trust region ∆

(b) If ρ ≤ c1, especially if p is intended for descent, the model is a poor approximation to f and we should shrink the trust region ∆ (c) Otherwise, the model is a sufficient approximation to f and the trust region should not be adjusted.

5. Fidelity and Iterates. Suppose p causes a descent in mx. And let η ∈ (0, 1/4] (a) If ρ < η then we do not have a sufficient decrease in the function and the process should be repeated with x (b) If ρ ≥ η then we have a sufficient decrease in f, and we should continue the process with x + p 6. Notation

(a) Let x0, x1,... be the iterates produces by subsequential solutions to the trust region subproblem for some objective f

(b) mxk (p) =: mk(p)

(c) ∆ for a particular mk is denoted ∆k

(d) Accepted solutions to the subproblem at xk are pk so that xk+1 = xk + pk.

15 3.2 Fidelity Algorithms 1. Fidelity and Trust Region Algorithm 3: Trust Region Management Algorithm input : Thresholds 0 < c2 < c1 < 1, Trust Region ∆, ∆max, ρx if ρx > c1 then ∆ ← min(2∆, ∆max) else if ρ < c2 then ∆ ← ∆/2 else Continue end return ∆

Note 3.2. Usually c1 = 1/4 and c2 = 3/4

2. Fidelity and Solution Acceptance Algorithm 4: Solution Acceptance Algorithm input : Treshold η ∈ (0, 1/4], ρ,x,p if ρx(p) ≥ η then x = x + p else Continue end return x

3.3 Approximate Solutions to Subproblem 3.3.1 Cauchy Point 1. Cauchy Point

S Definition 3.3. Let mx be a model of f within kpk ≤ ∆. Let p be the S direction of steepest descent at mx(0) such that p = 1, and let τ be:

s arg min{mx(τp ): τ ≤ ∆}

C S Then, p = τp is the Cauchy Point of mx.

Note 3.3. The Cauchy point is the line search minimizer of mk along the direction of steepest descent subject to kpk ≤ ∆.

2. Computing the Cauchy Point

Lemma 3.1. Let mx be a quadratic model of f within kpk ≤ ∆ such that 1 m (p) = f(x) + gT p + pT Bp x 2 Then, its Cauchy Point is:

(  kgk3  −g min ∆, gT Bg > 0 pC = kgk gT Bg −g T kgk ∆ g Bg ≤ 0

16 Proof. First, we compute the direction of steepest descent: −g pS = kgk

S T S t2 S T S Then, we have that mx(tp ) = f(x) + tg p + 2 (p ) Bp . If the quadratic term is negative, then t = ∆ is the minimizer. And so one option is: −∆g pC = kgk If the quadratic term is positive then: d d m (tpS) = dt x dt = gT pS + tpSBpS

Setting this to 0 and substituting back in for pS:

kgk3 t = gT Bg This value could potentially be larger than ∆, so to safeguard against this, in this case: ! kgk3 τ = min ∆, gT Bg

3. Using the Cauchy point does not produce the best rate of convergence

3.3.2 Dogleg Method 1. Requirement: B 0

2. Intuitively, the dogleg method moves towards the minimizer of mx which is pmin = −B−1g until it reaches this point or hits the boundary of the trust region. It does this by first moving to the Cauchy Point, and then along a linear path to pmin 3. Dogleg Path

Definition 3.4. Let mx be the quadratic model with B 0 of f with radius ∆. Let pC be the Cauchy Point and pmin be the unconstrained DL minimizer of mx. The Dogleg Path is p (t) where: ( pC t t ∈ [0, 1] pDL(t) = pC + (t − 1)(pmin − pC ) t ∈ [1, 2]

4. Dogleg Method

DL Definition 3.5. The Dogleg Method chooses xk+1 = xk + p (τ) where

pDL(τ) = ∆.

17 5. Properties of Dogleg Path

DL Lemma 3.2. Suppose p (t) is the Dogleg path of mx with ∆ = ∞ and B 0. Then:

DL (a) mx(p (t)) is decreasing

(b) pDL(t) is increasing

Proof. When 0 < t < 1, this case is uninteresting since for these values we DL DL know mx(p (t)) is decreasing and p (t) is increasing. So we consider 1 < t < 2 and reparameterize using t − 1 = z. We have that: d m (pDL(t)) = (pmin − pC )T (g − BpC ) + z(pmin − pC )T B(pmin − pC ) dz x = −(1 − z)(pmin − pC )T B(pmin − pC )

Letting c = (pmin − pC )T B(pmin − pC ) > 0 since B 0, we have that the derivative is strictly negative for z < 1. To show that the norm is increasing, first we show that the angle between pC and pmin is between (−π/2, π/2), then we show its length is longer. (a) For the angle, g 6= 0:

kgk2 hpmin, pC i = gT B−1g > 0 gT Bg

(b) Using Cauchy-Schwartz, kgk pmin ≥ kgk pC since:

T −1 T 4 −1 T −1 g B gg Bg kgk B g kgk ≥ g B g = ≥ gT Bg gT Bg

3.3.3 Global Convergence of Cauchy Point Methods 1. Reduction obtained by Cauchy Point

c Lemma 3.3. The Cauchy Point pk satisfies   C 1 kgkk mk(pk ) − mk(0) ≤ − kgkk min ∆k, 2 kBkk

Proof. We have that 1 m (pC ) − m (0) = gT pC + (pC )T B (pC ) k k k k 2 k k k

18 This brings in two cases when gT Bg ≤ 0 then (dropping subscripts)

C 1 T mk(p ) − mk(0) = − kgk ∆ + g Bg 2 kgk2 ≤ − kgk ∆ 1  kgk  ≤ − kgk min ∆, 2 kBk

When gT Bg ≥ 0, and ∆(gT Bg) ≤ kgk3, we are again in the above case:

C 1 T mk(p ) − mk(0) = − kgk ∆ + g Bg 2 kgk2 1 ≤ − kgk ∆ + kgk ∆ 2 1  kgk  ≤ − kgk min ∆, 2 kBk

When ∆gT Bg > kgk3:

kgk4 kgk4 m (pC ) − m (0) = − + gT Bg k k gT Bg 2(gT Bg)2 kgk4 = − 2gT Bg 1 kgk2 ≤ − 2 kBk 1  kgk  ≤ − kgk min ∆, 2 kBk

2. Reduction achieved by Dogleg

DL Corollary 3.1. The dogleg step pk satisfies:   DL 1 kgkk mk(pk ) − mk(0) ≤ − kgkk min ∆k, 2 kBkk

C DL pk Proof. Since mk(pk ) ≤ mk the result follows.

3. Reduction achieved by any method based on Cauchy Point

Corollary 3.2. Suppose a method uses step pk where mk(pk) − mk(0) ≤ C c(mk(pk ) − mk(0)) for c ∈ (0, 1). Then pk satisfies:   1 kgkk mk(pk) − mk(0) ≤ − c kgkk min ∆k, 2 kBkk

19 4. Convergence when η = 0 n Theorem 3.1. Let x0 ∈ R and f be an objective function. Let S = {x : f(x) ≤ f(x0)} and S(R0) = {x : kx − yk < R0, y ∈ S} be a neighborhood of radius R0 about S. Suppose the following conditions: (a) f is bounded from below on S

(b) f is Lipschitz continuously differentiable on S(R0) for some R0 > 0 and some constant L. (c) There is a c such that for all k:   kgkk mk(pk) − mk(0) ≤ −c kgkk min ∆k, kBkk (d) There is a γ ≥ 1 such that for all k:

kpkk ≤ γ∆k (e) η = 0 (fidelity threshold)

(f) kBkk ≤ β for all k. Then: lim inf kgkk = 0 5. Convergence when η ∈ (0, 1/4) Theorem 3.2. If η ∈ (0, 1/4) with all other conditions from Theorem 3.1 holding, then: lim kgkk = 0

3.4 Iterative Solutions to Subproblem 3.4.1 Exact Solution to Subproblem 1. Conditions for Minimizing Quadratics Lemma 3.4. Let m be the quadratic function 1 m(p) = gT p + pT Bp 2 where B is a symmetric matrix. (a) m attains a minimum if and only if B  0 and g ∈ Im(B). If B  0 then every p satisfying Bp = −g is a global minimizer of M. (b) m has a unique minimizer if and only if B 0.

Proof. If p∗ is a minimizer of m we have from second order conditions that 0 = ∇m(p∗) = Bp∗ + g and B = ∇2m(p∗)  0. Now suppose that g ∈ Im(B), then ∃p such that Bp = −g. Let w ∈ Rn then: 1 m(p + w) = gT p + gT w + wT Bw + 2pT Bw + pT Bp 2 1 1 = m(p) + gT w − 2gT w + wT Bw 2 2 ≥ m(p)

20 where the inequality follows since B  0. This proves the first part.

Now suppose m has a unique minimizer, p∗, and suppose ∃w such that wT Bw = 0. Since p∗ is a minimizer, we have that Bp∗ = −g and so from the computation above m(p+w) = m(p) and so p+w, p are minimizers to m which is a contradiction. For the other direction, since B 0 and let- ting p∗ = −B−1g, by Second Order Sufficient Conditions, m has a unique minimizer.

2. Characterization of Exact Solution Theorem 3.3. The vector p∗ is a solution to the trust region problem: 1 min f(x) + gT p + pT Bp : kpk ≤ ∆ n p∈R 2 if and only if p∗ is feasible and ∃λ > 0 such that: (a) (B + λI)p∗ = g (b) λ(kp∗k − ∆) = 0 (c) B + λI  0

Proof. First we prove (⇐) direction. So ∃λ ≥ 0 which satisfies the prop- erties. Consider the problem 1 1 m∗(p) = f(x) + gT p + pT (B + λI)p = m(p) + λ kpk2 2 2 Moreover, m∗(p) ≥ m∗(p∗) which implies that: m(p) ≥ m(p∗) + λ(kp∗k2 − kpk2) So if λ = 0 then m(p) ≥ m(p∗) and p∗ is minimizer of m, or if λ > 0 then kp∗k = ∆ and so for a feasible p, kpk ≤ ∆, which implies m(p) ≥ m(p∗).

Now suppose p∗ is the minimizer of m. If kp∗k < ∆ then p∗ is the un- constrained minimizer of m and so λ = 0, and so ∇m(p∗) = Bp∗ + g = 0 and ∇2m(p∗) = B  0 by the previous lemma. Now suppose kp∗k = ∆. Then the second condition is automatically satisfied. We now use the La- T 1 T λ T 2 grangian to determine the solution. L(λ, p) = g p+ 2 p Bp+ 2 (p p−∆ ) Differentiating with respect to p and setting this equal to 0, we see that (B + λI)p∗ + g = 0 Now to show that (B + λI) is positive semi definite, note that for any p 2 ∗ λ 2 such that kpk = ∆ and since p is the minimizer of m: m(p) + 2 kpk ≥ ∗ λ ∗ 2 m(p ) + 2 kp k Rearranging, we have that (p − p∗)T (B − λI)(p − p∗) ≥ 0 And since the set of all normalized ±(p − p∗) where kpk = ∆ is dense on the unit sphere, B − λI is positive semi-definite. The last thing to show is that λ ≥ 0 which follows using the fact that if λ < 0 then p∗ must be a global minimizer of m. By the previous lemma, this is a contradiction.

21 3.4.2 Newton’s Root Finding Iterative Solution 1. Theorem 3.3 guarantees that a solution exists and we need on find λ which satisfies these conditions. Two cases exist from this theorem: Case 1 λ = 0. If λ = 0 then B  0 and we can compute a solution p∗ using QR decomposition. Case 2 λ > 0. First, letting QΛQT , be the EVD of B, and defining:

p(λ) = −Q(Λ + λI)−1QT g

Hence:

n T 2 2 X (qj g) kp(λ)k = gT Q(Λ + λI)−2QT g = (λ + λ)2 j=1 j

From Theorem 3.3, we want to find λ > 0 for which:

n T 2 X (qj g) ∆2 = (λ + λ)2 j=1 j

2. The second case has two sub-cases which must be considered. Let Q1 be the columns of Q which correspond to the eigenspace of λ1.

T Case 2a If Q1 g 6= 0, then we implement Newton’s Root finding algorithm on: 1 1 f(λ) = − ∆ kp(λ)k

since as λ → λ1, kp(λ)k → ∞ and so a solution exists. T Case 2b If Q1 g = 0, then applying Newton’s root finding algorithm naively is not useful since kp(λ)k 6→ ∞ as λ → λ1. We note that when λ = −λ1,(B + λI)  0 and so we can find a τ such that:

T 2 2 X (qj g) 2 ∆ = 2 + τ (λj − λ1) j:λ16=λj

Computation. ∃z 6= 0 such that (B − λ1I)z = 0 and kzk = 1. There- T fore, qj z = 0 for all j : λ1 6= λj. Setting

T 2 X (qj g) 2 p(λ1, τ) = 2 + τ (λj − λ1) j:λ16=λj

we need only find τ.

3.5 Trust Region Subproblem Algorithms

22 4 Conjugate Gradients

1. Conjugated Gradients Overview Algorithm 5: Overview of Conjugate Gradient Algorithm input : x0 a starting point,  tolerance, f objective x ← x0 Compute Gradient: r ← ∇f(x) Compute Conjugated Descent Direction: p ← −r while krk >  do Compute Optimal Step Length: α Compute Step: x ← x + αr Compute Gradient: r ← ∇f(x) Compute Conjugated Step Direction: p end return Minimizer: x

2. Conjugate Gradients for minimizing convex (A 0), quadratic function 1 f(x) = xT Ax − bT x 2

Algorithm 6: Conjugate Gradients Algorithm for Convex Quadratic input : x0 a starting point,  = 0 tolerance, f objective x ← x0 Compute Gradient: r ← ∇f(x) = Ax − b Compute Conjugated Descent Direction: p ← −r while krk >  do rT r Compute Optimal Step Length: α = pT Ap Compute Step: x ← x + αr Store Previous: rold ← r Compute Gradient: r ← r + αAp krk2 Compute Conjugated Step Direction: p ← −r + 2 p kroldk end return Minimizer: x

3. Fletcher-Reeves for Nonlinear Functions Algorithm 7: Fletcher Reeves CG Algorithm input : x0 a starting point,  tolerance, f objective x ← x0 Compute Gradient: r ← ∇f(x) Compute Conjugated Descent Direction: p ← −r while krk >  do Compute Optimal Step Length: α (Strong Wolfe Line Search) Compute Step: x ← x + αr Store Previous: rold ← r Compute Gradient: r ← ∇f(x) krk2 Compute Conjugated Step Direction: p ← −r + 2 p kroldk end return Minimizer: x

23 Part II Model Hessian Selection

5 Newton’s Method

1. Requires ∇2f 0 in some region for the method to work. 2. Line Search (a) The search direction:

N 2 −1 pk = −∇ fk ∇fk

(b) Rate of Convergence Theorem 5.1. Suppose f is an objective with minimum x∗ satisfying the Second Order Conditions. Moreover: i. Suppose f is twice continuously differentiable in a neighborhood of x∗ ii. Specifically, suppose ∇2f is Lipschitz continuous in a neighbor- hood of x∗ N N iii. Suppose pk = pk and xk+1 = xk + pk ∗ iv. Suppose xk → x ∗ Then for x0 sufficiently close to x : ∗ i. xk → x quadratically

ii. kfkk → 0 quadratically.

∗ 2 Proof. Our goal is to get xk+1 −x in terms of ∇ f and use Lipschitz continuity to provide an upper bound. i. We have that:

∗ xk+1 − x ∗ = xk + pk − x ∗ 2 −1 = xk − x − ∇ fk ∇fk 2 −1  2 ∗ ∗  = ∇ fk ∇ fk(xk − x ) − (∇fk − ∇f )  Z 1  2 −1 2 ∗ 2 ∗ ∗ = ∇ fk ∇ fk(xk − x ) − ∇ f(xk + t(x − x ))(x − x )dt 0 Z 1  2 −1 2 2 ∗  ∗ = ∇ fk ∇ fk − ∇ f(xk + t(xk − x )) (x − x )dt 0

24 ii. Using Lipschitz continuity:

∗ kxk+1 − x k Z 1 2 −1 2 2 ∗ ∗ ≤ ∇ fk ∇ fk − ∇ f(xk + t(xk − x )) kx − x k dt 0 Z 1 2 −1 ∗ 2 ≤ ∇ fk tL kxk − x k dt 0 2 −1 L ∗ 2 ≤ ∇ f kxk − x k k 2

2 −1 We now need to show that ∇ fk is bounded from above. Let η(2r) be a neighborhood of radius 2r about x∗ for which ∇2f

0. In this neighborhood, ∇2f −1 exists and g(x) = ∇2f(x)−1 is continuous. Then, on the compact closed ball η(r), g(x) has a finite maximum. Therefore, we have quadratic convergence of ∗ xk → x .

iii. We now consider the first derivative:

∗ k∇fk+1 − ∇f k = k∇fk+1k 2 = ∇fk+1 − ∇fk − ∇ fkpk Z 1  2 2  = ∇ (xk + tpk) − ∇ fk pkdt 0 Z 1 2 ≤ Lt kpkk 0 L ≤ g(x )2 k∇f k2 2 k k

So if x0 ∈ η(r), the result follows.

3. Trust Region (a) Asymptotically Similar

Definition 5.1. A sequence pk is asymptotically similar to a se- quence qk if kpk − qkk = o(kqkk) (b) The Subproblem

T 1 T 2 min f(x) + ∇f p + p ∇ fxp : kpk ≤ ∆ n x p∈R 2

(c) Rate of Convergence Theorem 5.2. Suppose f is an objective with minimum x∗ satisfying the Second Order Conditions. Moreover: i. Suppose f is twice continuously differentiable in a neighborhood of x∗

25 ii. Specifically, suppose ∇2f is Lipschitz continuous in a neighbor- hood of x∗

iii. Suppose for some c ∈ (0, 1), pk satisfy   k∇fkk mk(pk) − mk(0) ≤ −c k∇fkk min ∆k, 2 k∇ fkk

N N 1 and are asymptotically similar to pk when pk ≤ 2 ∆k. ∗ iv. Suppose xk+1 = xk + pk and xk → x Then:

i. ∆k becomes inactive for sufficiently large k. ∗ ii. xk → x superlinearly.

Proof. We need to show that ∆k are bounded from below. The sec- ∗ N ond part follows from this quickly since xk → x implies pk → 0 N 1 and eventually pk ≤ 2 ∆k. So applying the asymptotic similarity:

∗ N ∗ N kxk + pk − x k ≤ xk + pk − x + pk − pk ∗ 2 N ≤ o(kxk − x k ) + o( pk ) ∗ 2 ∗ ≤ o(kxk − x k ) + o(kx − x k) ≤ o(kx − x∗k) where the penultimate inequality follows from: ! 2 −1 ∗ 2 −1 ∗ 2 ∇ fk k∇fk − ∇f k ≤ ∇ fk kxk − x k sup ∇ f(x) x∈η(r)

To get that ∆k is bounded from below, we frame it in terms of ρk. The numerator of ρk − 1 satisfies:

|f(xk + pk) − f(xk) − mk(pk) − mk(0)| Z 1 1 T 2 2  ≤ pk ∇ f(x + tspk) − ∇ fk pkdt 2 0 L ≤ kp k3 4 k

And the denominator of ρk − 1 satisfies:   k∇fkk |mk(pk) − mk(0)| ≥ c k∇fkk min ∆k, 2 k∇ fkk   k∇fkk ≥ c k∇fkk min kpkk , 2 k∇ fkk

So we need only lower bound kfkk by kpkk to translate this to a lower N 1 ∗ bound for ∆k. If pk > 2 ∆k then (in some neighborhood of x )

N 2 −1 kpkk ≤ ∆k ≤ 2 pk ≤ 2 ∇ fk k∇fkk ≤ C k∇fkk

26 N 1 Else if pk ≤ 2 ∆k then

N N N kpkk ≤ pk + o( pk ) ≤ 2 pk ≤ C k∇fkk

Therefore: ! c kp k kp k |m (p ) − m (0)| ≥ k min kp k , k k k k k 2 2 −1 C k∇ fkk ∇ fk

Since the second term’s denominator is the condition number of the matrix (which must be greater than or equal to 1), we have that:

0 0 |ρk − 1| ≤ C kpkk ≤ C ∆k

3 Hence, if ∆k < 4C0 , ρk > 3/4 and ∆k would double. So we know 1 ˆ 1 ˆ 3 that ∆k ≥ 2M ∆ for some fixed M such that 2M ∆ ≤ 4C0 . So we have that for sufficiently large k, pk → 0 and so ρk → 1 and ∆k becomes inactive.

27 6 Newton’s Method with Hessian Modification

2 2 1. If ∇ fk 6 0, we can add a matrix Ek so that ∇ fk + Ek =: Bk 0. −1 As long as this model produces a descent direction, and kBkk Bk are bounded uniformly, Zoutendjik guarantees convergence.

2. Newton’s Method with Hessian Modification Algorithm 8: Line Search with Modified Hessian input : Starting point x0, Tolerance  k ← 0 while kpkk >  do 2 Find Ek so that ∇ fk + Ek 0 pk ← Solve Bkp = −gk Compute αk (Wolfe, Goldstein, Backtracking, Interpolation) xk+1 ← xk + αkpk k ← k + 1 end

3. Minimum Frobenius Norm

2 (a) Problem: find E which satisfies min kEkF : λmin(A + E) ≥ δ. T T (b) Solution: If A = QΛQ is the EVD of A then E = Qdiag(τi)Q where ( 0 λi ≥ δ τi = δ − λi λi < δ 4. Minimum 2-Norm

2 (a) Problem: find E which satisfies min kEk2 : λmin(A + E) ≥ δ

(b) Solution: If λmin(A) = λn then E = τI where: ( 0 λ ≥ δ τ = n δ − λn λn < δ

5. Modified Cholesky (a) Compute LDL Factorization but add constants to ensure positive definiteness of A (b) May require pivoting to ensure numerical stability

6. Modified Block Cholesky (a) Compute LBL factorization where B is a block diagonal (1 × 1 and 2× where the number of blocks is the number of positive eigenvalues and the number of 2×2 blocks is the number of negative eigenvalues) (b) Compute the EVD of B and modify it to ensure appropriate eigen- values.

28 7 Quasi-Newton’s Method

1. Fundamental Idea

Computation. Let f be our objective and suppose it is Lipschitz twice continuously differentiable. From Taylor’s Theorem:

Z 1 ∇f(x + p) = ∇f(x) + ∇2f(x + tp)pdt 0 Z 1 = ∇f(x) + ∇2f(x)p + ∇2f(x + tp) − ∇2f(x) pdt 0 = ∇f(x) + ∇2f(x)p + o(kpk)

Therefore, letting x = xk and p = xk+1 − xk

2 ∇ fk(xk+1 − xk) ≈ ∇fk+1 − ∇fk

2 Quasi-Newton methods choose an approximation to ∇ fk, Bk which sat- isfies: Bk+1(xk+1 − xk) = gk+1 − gk

2. Secant Equation

Definition 7.1. Let xi be a sequence of iterates when minimizing an objective function f. Let sk = xk+1 − xk and yk = gk+1 − gk where gi = ∇fi. Then the secant equation is

Bk+1sk = yk

where Bk+1 is some matrix.

3. Curvature Condition

T Definition 7.2. The curvature condition is sk yk > 0.

Note 7.1. If Bk satisfies the secant equation and sk, yk satisfy the cur- T vature condition then sk Bk+1sk > 0. So the quadratic along the step direction is convex.

4. Line Search Rate of Convergence Theorem 7.1. Suppose f is an objective with minimum x∗ satisfying the Second Order Conditions. Moreover:

(a) Suppose f is twice continuously differentiable in a neighborhood of x∗

(b) Suppose Bk is computed using a quasi-Newton method and

Bkpk = −gk

(c) Suppose xk+1 = xk + pk

29 ∗ (d) Suppose xk → x

∗ ∗ When x0 is sufficiently close to x , xk → x superlinearly if and only if:

2 ∗ (Bk − ∇ f(x ))pk lim → 0 k→∞ kpkk

N Proof. We show that the condition is equivalent to: pk − pk = o(kpkk). First, if the condition holds:

N 2 −1 2  pk − pk = ∇ fk ∇ fkpk + ∇fk 2 −1 2  = ∇ fk ∇ fk − Bk pk 2 −1  2 2 ∗  2 ∗   = ∇ fk ∇ fk − ∇ f(x ) pk + ∇ f(x ) − Bk pk

∗ Taking norms and using that x0 is sufficiently close to x , and continuity:

N pk − pk = o(kpkk) For the other direction:

N 2 ∗  o(kpkk) = pk − pk ≥ ∇ f(x ) − Bk pk Now we just apply triangle inequality and properties of the Newton Step:

∗ N ∗ N ∗ 2 kxk + pk − x k ≤ xk + pk − x + pk − pk = o(kxk − x k )+o(kpkk)

∗ Note that o(kpkk) = o(kxk − x k).

7.1 Rank-2 Update: DFP & BFGS 1. Davidson-Fletcher-Powell Update (a) Let: Z 1 2 Gk = ∇ f(xk + tαpk)dt 0 So Taylor’s Theorem implies:

yk = Gksk

(b) Let:

−1/2 −1/2 kAkW = Gk AGk F

(c) Problem. Given Bk symmetric positive definite, sk and yk, find

T arg min kB − BkkW : B = B , Bsk = yk

(d) Solution:

 T   T  T yksk skyk ykyk Bk+1 = I − T Bk I − T + T yk sk yk sk yk sk

30 (e) Inverse Hessian:

T T Hkykyk Hk sksk Hk+1 = Hk − T + T yk Hkyk yk sk

2. Broyden, Fletcher, Goldfarb, Shanno Update

(a) Problem: Given Hk (inverse of Bk) symmetric positive definite, sk and yk, find

T arg min kH − HkkW : H = H , Hyk = sk

(b) Solution:

 T   T  T skyk yksk sksk Hk+1 = I − T Hk I − T + T yk sk yk sk yk sk

(c) Hessian: T T Bksksk Bk ykyk Bk+1 = Bk − T + T sk Bksk yk sk 3. Preservation of Positive Definiteness

Lemma 7.1. If Bk 0 and Bk+1 is updated from the BFGS method, then Bk+1 0.

n Proof. We can equivalently show that Hk+1 0. Let z ∈ R \{0}. Then:

T T T 2 z Hk+1z = w Hkw + ρk(sk z)

T T T where w = z − ρk(sk z)yk. If sk z = 0 then w = z and z Hkz > 0. If T T 2 sk z 6= 0, then ρk(sk z) > 0 (regardless of if w = 0) and so Hk+1 0.

7.2 Rank-1 Update: SR1 1. Problem: find v such that:

T Bk+1 = Bk + σvv : Bk+1sk = yk

2. Solution: T (yk − Bksk)(yk − Bksk) Bk+1 = Bk + T sk (yk − Bksk)

Proof. Multiplying by sk:

T yk − Bksk = σ(v sk)v

Hence, v is a multiple of yk − Bksk.

31 Part III Specialized Applications of Optimization

8 Inexact Newton’s Method

1. Requirements for search direction:

(a) pk is inexpensive to compute

(b) pk approximates the newton step closely as measured by:

2 zk = ∇ fkpk − ∇fk

(c) The error is controlled by the current gradient:

kzkk ≤ ηk k∇fkk

2 2. Typically, inexact methods keep Bk = ∇ fk and improve on computing bk 3. CG Based Algorithmic Overview Algorithm 9: Overview of Inexact CG Algorithm input : x0 a starting point, f objective, restrictions x ← x0 Define Tolerance:  Compute Gradient: r ← ∇f(x) Compute Conjugated Descent Direction: p ← −r while krk >  do Check Constrained Polynomial if pT Bp ≤ 0 then Compute α given restrictions Compute Minimizer: x ← x + αp Break Loop else Compute Optimal Step Length: α if Restrictions on x then Compute τ given restrictions Compute Minimizer: x ← x + ταp Break Loop else Compute Step: x ← x + αr end Compute Gradient: r ← ∇f(x) Compute Conjugated Step Direction: p end end x∗ ← x return Minimizer: x∗

4. As with the usual CG, we can implement methods for computing the gradient and conjugate step direction efficiently

32 5. Line Search Strategy: the restriction of α is simply that it satisfies the strong Wolfe Conditions

6. Trust-Region Strategy: (a) The restriction on α is that kx + αpk = ∆ because the restricted polynomial in the direction p is decreasing, so we should go all the way to the boundary (b) The restriction on x is that kx + αpk > ∆ because x+αp is no longer in the trust region so we need to find τ that ensures this happens. 7. An alternative is to use Lanczos’ Method, which generalizes CG Methods.

9 Limited Memory BFGS

1. In normal BFGS:

T T T Hk+1 = (I − ρkskyk )Hk(I − ρkyksk ) + ρksksk

2. Substituting in for Hk until H0 reveals that only sk, yk must be stored to compute Hk+1

3. Limited BFGS capitalizes on this by storing a fixed number of sk, yk and k re-updating Hk from a (typically diagonal) matrix H0 Algorithm 10: L-BFGS Algorithm input : x0 a starting point, f objective, m storage size,  tolerance x ← x0 k ← 0 while k∇f(x)k >  do Choose H0 Update H from {(sk, yk),..., (sk−m, yk−m)} Compute pu ← −H∇f(x) Implement Line Search or Trust Region (dogleg) to find p x ← x + p k ← k + 1 end return Minimizer: x

33 10 Least Squares Regression Methods

1. In least squares problems, the objective function has a specialized form, for example: m X 2 f(x) = rj(x) i=1 (a) x ∈ Rn and usually are the parameters of a particular model of interest

(b) rj are usually the residuals evaluated for the model φ with parameter x at point tj and output yj. That is:

rj(x) = φ(x, tj) − yj 2. Formulation   (a) Let r(x) = r1(x) r2(x) ··· rm(x) . Then: 1 f(x) = kr(x)k2 2 (b) The first derivative:

 T  ∇r1(x) T  ∇r2(x)  ∇f(x) = ∇r(x)T r(x) =   r(x) =: J(x)T r(x)  .   .  T ∇rm(x) (c) The Second Derivative: ∇2f(x) = ∇r(x)T ∇r(x) + ∇2r(x)r(x) = J(x)T J(x) + ∇2r(x)r(x)

3. The main conceit behind Least Squares methods is to approximate ∇2f(x) by J(x)T J(x), thus, only first derivative information is required.

10.1 Linear Least Squares Regression T 1. Suppose φ(x, tj) = tj x then: r(x) = T x − y

T where each row of T is tj 2. Objective Function: 1 f(x) = kT x − yk2 2 Note 10.1. The objective function can be solved directly using matrix factorizations if m is not too large.

3. First Derivative: ∇f(x) = T T T x − T y Note 10.2. The normal equations may be better to solve if m is large because by multiplying by the transpose, we need only solve an n×n system of equations.

34 10.2 Line Search: Gauss-Newton 1. The Gauss-Newton Algorithm is a line search for non-linear least squares regression with search direction:

T T Jk Jkpk = −Jk rk

2. Notice that these are the normal equations from: 1 kJ p + r k2 2 k k k

3. Using this search direction, we proceed with line search as per usual

10.3 Trust Region: Levenberg-Marquardt 1. The local model is: 1 m(p) = f(x) + (J(x)T r(x))T p + pT J(x)T J(x)p kpk ≤ ∆ 2

2. Since f(x) = kr(x)k2 /2, the solution to minimizing m(p) is equivalent to: 1 min kJ(x)p + rk2 kpk ≤ ∆ 2

35 Part IV Constrained Optimization

10.4 Theory of Constrained Optimization 1. In constrained optimization, the problem is:

min{x ∈ arg min f(x): ci(x) = 0 ∀i ∈ E, cj(x) ≥ 0 ∀j ∈ I}

(a) f is the usual objective function (b) E is the index for equality constraints (c) I is the index for inequality constraints 2. Feasible Set. Active Set. Definition 10.1. The feasible set, Ω, is the set of all points which satisfy the constraints:

Ω = {x : ci(x) = 0 ∀i ∈ E, cj(x) ≥ 0 ∀j ∈ I}

The active set at point x ∈ Ω, A(x), is defined by:

A(x) = E ∪ {j ∈ I : cj(x) = 0}

3. Feasible Sequence. Tangent. Tangent Cone.

Definition 10.2. Let x ∈ Ω. A sequence zk → x is a feasible sequence if for sufficiently large K, if k ≥ K then zk ∈ Ω. A vector d is a tangent to x if there is a feasible sequence zk → x and a sequence tk → 0 such that: z − x lim k = d k→∞ tk

The set of all tangents is the Tangent Cone at x, denoted TΩ(x).

Note 10.3. TΩ(x) contains all step directions which for a sufficiently small step size will remain in the feasible region Ω.

4. Linearized Feasible Directions. Definition 10.3. Let x ∈ Ω and A(x) be its active set. The set of lin- earized feasible directions, F(x) is defined as:

T T F(x) = {d : d ∇ci(x) = 0 ∀i ∈ E, d ∇ci(x) ≥ 0 ∀A(x) ∩ I}

Note 10.4. Let x ∈ Ω and suppose we linearize the constraints near x. The set F(x) contains all search directions for which the linearized approximations to ci(x+d) satisfy the constraints. For example If ci(x) = T T 0 for i ∈ E, then 0ci(x + d) ≈ ci(x) + d ∇ci(x) = d ∇ci(x) for d to be a good search direction.

5. LICQ Constrain Qualification

36 Definition 10.4. Given a point x and active set A(x), the linear indepen- dence constrain qualification (LICQ) holds if the set of active constraint gradients, {∇ci(x): i ∈ A(x)}, is linearly independent. Note 10.5. Most algorithms use line search or trust region require that F(x) ≈ TΩ(x), else there is not a good way to find another iterate.

6. First Order Necessary Conditions

∗ Theorem 10.1. Let f, ci be continuously differentiable. Suppose x is a local solution to the optimization problem and at x∗, LICQ holds. Then there is a Lagrange multiplier vector λ, with components λi for i ∈ E ∪ I, such that the Karush-Kuhn-Tucker conditions are satisfied at (x∗, λ):

∗ (a) ∇xL(x , λ) = 0 ∗ (b) For all i ∈ E, ci(x ) = 0 ∗ (c) For all j ∈ I, cj(x ) ≥ 0

(d) For all j ∈ I, λj ≥ 0 ∗ (e) For all j ∈ I, λjcj(x ) = 0

37