Trust-Region Methods for Sequential Cory Cutsail May 17, 2016

Contents

1 Sequential Quadratic Programming 2 1.1 Background ...... 2 1.2 A Review of Trust Region Methods ...... 3 1.3 A Review of Quadratic Programming ...... 3 1.4 A Note on Merit Functions ...... 3 1.4.1 The Basics of Merit Functions ...... 3 1.4.2 Penalty Functions ...... 4

2 Trust Region Methods for Sequential Quadratic Programming 4 2.1 An Interior-Point Method ...... 4 2.2 Active-Set Methods ...... 6 2.2.1 Sequential `1 Quadratic Programming ...... 6 2.2.2 Sequential Linear-Quadratic Programming ...... 7 2.2.3 Computing the Step and Updating the Penalty Parameter ...... 7 2.3 Existing SQP Solvers ...... 7 2.4 Conclusions ...... 8

1 1 Sequential Quadratic Programming 1.1 Background Sequential quadratic programming is an for obtaining the solution to nonlinear optimization problems. Nonlinear optimization problems are of the form ( c (x) = 0, i ∈ E min f(x) subject to i n x∈R ci(x) ≥ 0, i ∈ I

n where f and ci are smooth scalar functions over A ⊂ R . There are two primary types of iterative methods for nonlinear optimization; interior point methods and active-set methods. The difference between interior point and active set methods can be thought of as a difference in the direction that they approach an optimum x∗ from different areas of the feasible region. An interior point method will typically begin at a point on the interior of the feasible region, whereas an active set method will typically approach x∗ from a point on ∂A.

Depending on the method being used, sequential quadratic programming (SQP) methods can be either active-set or interior point. We will consider each, moving from a general algorithm for SQP to an interior point method, and move on to consider various active set trust-region SQP methods. Note that a quadratic in Rn will be expressed as f(x) = xT Ax + bT x.

Sequential quadratic programming methods have a variety of applications in both academic and industrial problem, and are an active area of research in the field of numerical optimization. A particularly interesting set of examples that could use SQP methods involve optimal control. For example1, consider the control IBVP for the wave equation, ∂2u c2∂2u = (1) ∂t2 ∂x2

u(x, 0) = ψ1(x) (2) ∂u (x, 0) = φ (x) (3) ∂t 1 u(0, t) = ψ2(t) (4)

u(L, t) = φ2(t) (5) which can be used to model the vibrations of a one-dimensional string. Suppose we want to minimize the time taken to bring the string to rest. This problem can be represented as

T  Z (1) − (5) 2 2  min ψ2(t) + φ2(t) dt subject to ∂u (6) u(x, T ) = (x, T ) = 0 0 ∂t Stated more plainly, we want to minimize the value of a hyperbolic PDE over time, given constraints (1)-(5), which are the physical model of the wave equation, and a terminal boundary condition that states that the mechanical displacement of the “string” and it’s velocity are zero at the conclusion of the time interval, T . Avoiding the details of discretization, we can discretize this optimal control problem, converting it to

 0 ui = ψ1(xi) N !  X ψi = ψi min k (φi )2 + (ψi )2 subject to 1 2 2 2 ui = φi i=1  M 2  i φ1(x, T ) = 0 Before we had a problem in continuous time and space; we now see the same problem in discrete time across M equally spaced gridpoints. Gerdt solves this discretized control problem using a BFGS modified SQP algorithm. Although this requires discretization, that we are able solve the control problem for the 1D wave equation using SQP methods is incredibly valuable.

This paper won’t focus on optimal control problems, but rather the innerworkings of trust region SQP methods. We begin by reviewing trust region methods, touch briefly on penalty methods to focus on penalty/merit functions, and finally begin to discuss various trust-region methods for sequential quadratic programming.

1See [4]

2 1.2 A Review of Trust Region Methods The general trust region method minimizes an unconstrained objective by imposing an artificial constraint on the step length2. Typically this constraint says that the step length is less than some length, or ||p|| < ∆, where p is the step length, ∆ is the maximum step distance (often thought of as the ”radius” of the trust region), and || · || is some norm defined in Rn.

The formulation of the general trust region method appears as

T 1 T min mk(p) = fk + g p + p Bkp subject to ||p|| < ∆ n k p∈R 2

2 where mk(p) is here a quadratic model of our original function from the Taylor expansion. Error here is O(||p|| ) 3 [6], so the problem is well conditioned for small p. Bk is the either the exact Hessian or an approximation. Trust region methods, generally, are robust and have desirable convergence properties. Without giving away too much at the beginning, note that we can guarantee local convergence by establishing the equivalence of the trial step and the SQP step.

1.3 A Review of Quadratic Programming The sequential quadratic programming method solves a quadratic programming subproblem for a model function at each iterate. The general quadratic programming problem can be expressed as ( 1 a x = b , i ∈ E min f(x) = xT Gx + cT x subject to i i i n x∈R 2 aixi ≤ bi, i ∈ I

This is notably similar to the formulation of the trust-region subproblem, where the subproblem was expressed as a quadratic in the step length for the model function. E denotes the set of equality constraints, meaning that constraint satisfaction requires that equality holds. I denotes the set of inequality constraints, meaning that as long as the inequality is satisfied, the constraint is satisfied. We can define the active set of constraint indices as i such that aixi = bi, i ∈ E ∪ I, or the set of constraints for which equality holds. This will be useful later in considering active-set SQP methods. Also important to recall is that an optimum for a quadratic programming problem is given by the solution of the system GAT  x −c = A 0 λ b where the leftmost matrix is known as the KKT matrix. We can guarantee a solution when Am×n has rank m and G is positive definite.

1.4 A Note on Merit Functions 1.4.1 The Basics of Merit Functions Merit functions ensure that iterative methods for constrained optimization converge to a solution, instead of blowing up or cycling. Because of the number of moving parts, and the fact that SQP methods solve an approximation of the original objective function, merit functions can be invaluable. An example of a merit function in rootfinding might be the sum of squared residuals; when f(xk+1) − f(xk) > f(xk) − f(xk−1), the iterate k + 1 isn’t a good one. We can account for this in finding the next iterate by including the residual

n 1 X r(x ) = f (x )2 k 2 i k i=1

n where xk is the current iterate and fi is the i−th index of the vector f(x) ∈ R . 2Yuan [7] reviews trust region methods and introduces the SQP subproblem 3 error being |f(xk + p) − mk(p)|

3 1.4.2 Penalty Functions Penalty functions are a special type of merit function that allow a constrained optimization problem to violate it’s constraints, but account for this in the function approximation. Penalty functions are frequently used in SQP methods. The `1 penalty function is particularly relevant for SQP methods, and is expressed as ! X X − φ1(x; µ) := f(x) + µ |ci(x)| + [ci(x)] i∈E i∈I

Where [·]− := max{0, ·}. Instead of minimizing an objective function subject to a set of constraints, a puts the constraints inside the objective. Theorem 17.3 from Nocedal and Wright [6] states that if x∗ is a strict ∗ ∗ local minimizer of some problem, then x is a local minimizer of φ1(x; µ) for all µ > µ , ∗ ∗ µ = max = ||λ ||∞. Then a solution to the penalized NLP also yields a local solution to the general NLP, because a I∪E movement from the solution yields an increase in the penalty function. We will show an example of a penalized SQP that uses the `1 penalty further on.

2 Trust Region Methods for Sequential Quadratic Programming

SQP methods, as previously stated, are used for the minimization of C2 nonlinear objective functions. The standard approximation method is to take the second-order Taylor approximation of the objective function for the model, that is 1 f(x + p) ≈ f + ∇f T p + pT ∇2 f p + O(||x2||) = m (x) k k 2 xx k k where the model function at iterate k is denoted mk(x). If we include the constraint that ||p|| ≤ ∆, this model function and attempt to minimize the objective function, we get a problem identical to the original trust region problem. This is why trust-region methods are a nice way to solve SQP problems. Also of concern for the SQP problem are the original constraints. Because the constraints may be nonlinear, we may have to linearize the constraints or introduce a penalty for constraint violation. Below we take these things into consideration.

The basic SQP method can also be thought of as Newton’s method for systems. If the KKT matrix from the section on quadratic programming is non-singular, we have that a unique minimum for the model function is given by the solution of the system ∇2 L −AT   p  −∇f  xx k k k = k , Ak 0 λk+1 −ck meaning that the next iterate is identical to the iterate given by applying Newton’s method. This is covered in 531-532 of Nocedal and Wright [6].

We begin by discussing the general trust-region approach for SQP problems, then move on to examine an interior point method for SQP problems. We then look at two active-set methods, S`1QP and SLQP. Despite being active-set methods with names that are similar in appearance, these two methods are very different!

2.1 An Interior-Point Method Interior point methods generally find a minimum x∗ from the interior of the feasible region. One might imagine that if constraints are nonlinear, an interior-point method might make the problem more tractable, or have nicer convergence properties than an active set method, which typically approach a solution from the boundary of the feasible region. The method shown here is for equality constrained nonlinear programming problems.

The general interior-point algorithm shown here was developed by Dennis, Heinkenschloss, and Vicente. [3] (DHV) It was developed for “the solution of a class of minimization problems with nonlinear equality constraints. . . such nonlinear problems [may] arise from the disretization of optimzal control problems.” This is the type of problem shown in the introduction, wherein we showed the discretization of the 1D wave equation, formulated as an optimal control problem. The DHV interior-point methods have a global convergence property because of their formulation as a trust-region problem, as long as an appropriate merit function for the selection of the trust radius is included. The general algorithm is well suited for high-dimensional problems, including optimal control problems for various partial

4 differential equation techniques.

The affine scaling approach of the DHV methods allow the problem to maintain its original formulation as a discretized PDE subject to the same discretized constraints, as opposed to enforcing stricter constraints or using a penalty function to allow the violation of constraints. For more on affine scaling techniques in general, see [2].

The general DHV algorithm appears with only minor changes from its presentation in [3] as;

Algorithm 1 Trust-Region Interior Point SQP Algorithm4

1: Select x0 such that a < (x2)0 < b, ∆0 > 0, and calculate λ0. Select α1 ∈ (0, 1), η1 ∈ (0, 1), σ ∈ (0, 1), 0 < ∆min ≤ ∆max, ρ > 0, and ρ−1 ≥ 1 2: for k ∈ 0, 1, 2,... do, n n 3: Compute the trust region: pk s.t. ||pk || ≤ ∆k 4: Compute (pk)u. This varies based on model specification. It must satisfy

σk(a − (x2)k) ≤ (pk)x2 ≤ σk(b − (x2)k)

where σk ∈ (σ, 1]. n 5: Let pk = pk + Wk(sk)x1 . 6: Compute λk+1. Let δλk = λk+1 − λk. 7: Compute the predicted reduction, which is given by

T 2 2 pred(pk; ρk−1) = mk(0) − mk(pk) − δλk (Jkpk + Ck) + ρk−1(||Ck|| − ||Jksk + Ck|| )

ρk−1 8: if pred(p , ρ ) ≥ (||C ||2 − ||J s + C ||2) then set ρ = ρ . k k−1 2 k k k k k k−1 9: else set T 2(mk(pk) − mk(0) + δλk (Jkpk + Ck)) ρk = 2 2 + ρ ||Ck|| − ||Jksk + Ck||

5 10: if the ratio of actual reduction to predicted reduction is less than η1, set ∆k according to model specification. then Let xk+1 = xk, λk+1 = λk. 11: else Accept pk and select ∆k+1 such that

max{∆min, ∆k} ≤ ∆k+1 ≤ ∆max

and let xk+1 = xk + pk and λk+1 = λk + δkλk.

px1  Note that p = denotes a solution of some linearized state equation in the SQP subproblem, which is of the form px2

T 1 T min ∇ p + p ∇xxL(x, λ)p p 2 where L(x, λ) is the LaGrange constrained problem, or L(x, λ) = f(x) + λT C(x) and C(x) denotes the set of equality constraints. Then given Cm×n is invertible (an assumption that guarantees a solution of the KKT system), a solution means that the inner product of the Jacobian of the constraint matrix J(x) and the step p is equal to −C, or that J(x)p = −C(x). Then  x1   −1  x2 p −Cx1 (x) Cx2 (x) p = pn + W (x)p = x2 , W (x) = . p In−m

Then Wi ∈ col(W ) → Wi ∈ N (J(x)). Note here the similarity between this method and the null-space method used for quadratic programming. We are essentially using the same process as is found on page 457 of [6] to solve the QP subproblem at each iterate in steps (3) - (7). So the algorithm solves a nonlinear optimal control problem using a model derived from the second-order Taylor approximation of the C2 nonlinear objective and, granted the matrix of equality constraints is full-rank, it guarantees global convergence. Nice! The issue with this approach is the restriction that constraints be equality constraints. The active set methods below allow for the relaxation of this assumption.

4A similar algorithm can be found on p. 549 of Nocedal and Wright [6]. I’d recommend starting there. 5See [3] p.1764-1767 for more on this.

5 2.2 Active-Set Methods Active-set methods typically approach an optimum x∗ from the boundary of the feasible region. The fundamental assumptions of the active-set methods presented below are the same as the assumptions of the interior-point methods above, namely that

(1) The Jacobian of constraints A(x)m×n is rank m. T 2 (2) p ∇xxL(x, λ)p > 0, ∀p 6= 0 where A(x)p = 0

In active-set methods, the user recognizes that the best iteration may violate the linearized equality constraints, and so the aim is to make each iteration closer to the feasible region while simultaneously minimizing the value of the objective function. We examine two different SQP methods; S`1QP and SLQP.

2.2.1 Sequential `1 Quadratic Programming

Sequential `1 quadratic programming takes a quadratic model function and a set of linearized constraints and places them into the model of the objective function, converting the problem to the minimization of a penalized model function over a trust region. More directly, the problem become ! T 1 T 2 X T X T − min mµ(p) := fk + ∇fk p + p ∇xxLkp + µ |ci(xk) + ∇ci(xk) p|| + (ci(xk) + ∇ci(xk) p) s.t. ||p||∞ < ∆ p 2 i∈E i∈I

Using the ∞ norm to define the trust region, we have a C2 quadratic program that we can solve via a quadratic programming algorithm, like the Null-Space method mentioned previously or MATLAB’s quadprog.m6, to solve each iterate! The `1 merit function used here is identical to the merit function defined in section 1.4.2. Call it now, as before, φ1(x; µ). Then we determine the ratio of actual reduction to expected (predicted) reduction in the model function as aredk φ1(xk; µ) − φ1(xk + pk; µ) ρk = = predk mµ(0) − mµ(pk) as was done in Algorithm 1. Use standard trust-region methods to verify whether or not a step is “good,” eg if ρk > η, where η is some predefined value in (0, 1), accept the step, otherwise shrink the trust region and try again.

Using the S`1QP method allows us to avoid issues arising from violating the linearized constraints by focusing on solving what is essentially an unconstrained optimization problem over a trust region. Also, many methods require that the Hessian or Hessian approximation be positive definite in a region surrounding the minimizer x∗. For this method, the Hessian need not be positive definite. Another way to think about this is that the penalty function φ1(x; µ) is the function we are minimizing, as opposed to the model function itself. If the Hessian of φ1(x; µ) weren’t positive definite, we might run into issues, but the Hessian of the model function doesn’t matter here. A proof of the fact that mini- mizing φ(x; µ) yields a minimizer of the original objective function is provided on page (503) of Nocedal and Wright [6].

The behavior of the approximation, and thus the convergence of the method, is dependent on proper selection of the penalty parameter. Nocedal and Wright [6] (553-554) give a technique and algorithm for updating the penalty parameter at each iterate. I won’t go through all the details here, but the essence of the method is to select a penalty parameter such that slack variables defined by making all constraints active are equal to zero. In other words, select a penalty parameter that places the model function in the feasible region. If this cannot be done, update the penalty parameter such that the step pk is “at least a fraction of the optimal reduction given by p∞. The algorithm to update the penalty parameter (and compute step-length), show in the next section and and found on page 554 of [6], can used for both the S`1QP and SLQP methods.

The key benefit of this particular method is a reduction in the number of equations that must hold with equality. Instead of having m equations in n unknowns, the system is essentially reduced to optimizing a single (albeit very big) quadratic programming problem over a trust region. The reason characterization of this method as an active set method isn’t yet obvious; when the algorithm for calculating the penalty parameter is introduced in the next section, it will be clear that the ultimate goal is to solve the penalized objective such that A = I ∪ E, or that the active set is the union of both sets of constraint, meaning that each constraint holds with equality. This is why the S`1QP approach is deemed an active set method.

6From the optimization toolbox

6 2.2.2 Sequential Linear-Quadratic Programming The sequential linear-quadratic programming method7 lessens the cost of solving the quadratic subproblem. It is a two-step method, the first step being to determine a working set W via the solution of a linear program. Next, the model function is solved via an equality constrained quadratic programming method wherein the constraints are defined by the working set. The first step is to solve the problem, whose iterate k is defined by  c (x ) + ∇c (x )T p = 0 i ∈ E  i k i k T T min fk + ∇fk p subject to ci(xk) + ∇ci(xk) p ≥ 0, i ∈ I p  LP ||p||∞ ≤ ∆k

As noted in the section on S`1QP, the system of equations determined by the linearized constraints may be inconsistent. Instead of solving this system directly, we can solve it via the `1 penalty method to avoid running into trouble with T constraints. To do so, reformulate the system above, introducing the slack variables zi = ∇ci(xk) p + ci(xk), i ∈ E, T and −ti = ∇ci(xk) p + ci(xk), i ∈ E. Then we express the system exactly as ! T X X LP min fk + ∇fk p + µ zi + ti s.t. ||p||∞ ≤ ∆k . p i∈E i∈I

The simplex method works to solve this LP. In doing so, we determine an active set of constraints, given by

LP T LP T LP Ak(p ) = {i ∈ E|zi p = 0} ∪ {i ∈ I|ti p = 0}, where pLP is given by the solution of the linear programming problem. For any i in the union of the two index sets E ∪ I such that i 6∈ A, i ∈ V, which denotes the set of violated constraints. Define next the working set at iterate k c LP LP Wk ⊂ Ak. Finally, the Cauchy step is given by p = α p . The alpha parameter here is a step-length with upper bound one. Via Nocedal and Wright, p. 552, we use the Cauchy step to “ensure that the algorithm makes progress on the penalty function.”

With W and pc defined via the solution of the LP, we can formulate an equality constrained QP as follows;

! ( T 1 T 2 X ci(xk) + ∇ci(xk) p = 0, i ∈ Wk min fk + p ∇xxLkp + ∇fk + µk γi∇ci(xk) s.t. , p 2 ||p||2 ≤ ∆k i∈Vk where gamma inverts the sign of negative constraints. The trust region here is entirely different from the one used in the LP problem in both shape and maximal value (∆). We can solve this using any EQP method, albeit Nocedal and Wright c m m c recommend the projected conjugate . The total step at iterate k is then given by pk = p +α (p −p ), α ∈ [0, 1]. Then the Cauchy step is used when the step derived from the EQP problem doesn’t provide a sufficient decrease in the objective function.

2.2.3 Computing the Step and Updating the Penalty Parameter The method below (on the following page) shows how to compute the step and update the penalty parameter for SLQP and S`1QP methods. This is a general algorithm, and a specific example may require a particular kind of implementation. Gerdt’s optimal control for the wave equation, for example, modifies the standard SQP method with a BFGS Hessian approximation.

2.3 Existing SQP Solvers Some numerical solvers for SQP problems exist, implementing various methods depending on the problem at hand. MATLAB’s built-in fmincon, [5] has an SQP option for medium-sized problems. fmincon is designed to find the minimum of an NLP problem with constraints. It defaults to an interior point method, but it is unclear whether it is an interior point SQP algorithm or not. The 'sqp' option runs an active-set SQP method. The only informa- tion the MATLAB documentation gives is that it satisfies bounds at all iteration; the specific SQP method is not clear.

7This entire section draws heavily from Nocedal and Wright, 551-553, including notation. I add to the interpretation, but the method- ology is pulled squarely from this section of [6]

7 Algorithm 2 Step Computation and Penalty Update for SLQP and S`1QP Methods 1: Take as given xk, µk−1,∆k > 0, and parameters ε1, ε2 ∈ (0, 1). 2: First solve

 T ∇ci(xk) p + ci(xk) = zi, i ∈ E !  T T 1 T 2 X X − ∇ci(xk) p + ci(xk) ≥ −ti, i ∈ I min mµ(p) = fk + ∇fk p + p ∇xxLkp + µ zi + (ti) s.t. p 2 z, t ≥ 0 i∈E i∈I  ||p||∞ < ∆

with µ = µk−1 for p(µk−1). 3: Denote the penalized portion of the penalty function at iterate k by ξk(·). 4: if ξk(p(µk−1)) = 0 then Let µk = µk−1 5: else Compute the infinity norm of the step, p∞ 6: if ξk(p∞) = 0 then find µk+1 ≥ µk−1 s.t. ξk(p(µk+1)) = 0 7: else Find µk+1 ≥ µk−1 such that

ξk(0) − ξk(p(µk+1)) ≥ ε1[ξk(0) − ξk(p∞)]

8: If µk+1 is not such that

mµk+1 (0) − mµk+1 (p(µk+1) ≥ ε2µk+1[ξk(0) − ξk(p(µk+1),

increase µk+1 until the above is satisfied 9: µk+1 becomes the new penalty parameter, and pk+1 = p(µk+1) becomes the new step.

The KNITRO package [1] includes multiple SQP algorithm implementations for medium to large scale nonlinear programming problems. The KNITRO package was written in C, and offers both interior point and an active set method. The active set method implemented by KNITRO is actually the SLQP method described above. The original KNITRO package used a conjugate gradient method for step computation; KNITRO is on its fifth iteration now, and offers a variety of both trust region and approaches depending on the problem at hand. KNITRO automatically selects the appropriate solver for the problem. See Byrd, Nocedal and Waltz [1] for more on the KNITRO package.

2.4 Conclusions SQP methods are useful for a variety of problems in nonlinear programming. SQP methods can prevent issues of degeneracy and often guarantee global convergence under mild assumptions. SQP methods are an active area of research in mathematical optimization, and provide useful insight into various applied problems into dynamical systems and partial differential equations. Although some SQP methods use interior point techniques to find an optimum, the main competition for SQP methods in terms of a dominant paradigm for minimizing nonlinear programming problems are interior point methods. There is no dominant type of SQP method (line search or trust-region), but trust region methods that use a penalty parameter are able to avoid degeneracy and guarantee global convergence. The two active-set methods presented here fall into this category.

8 References

[1] Richard Byrd, Jorge Nocedal, and Richard Waltz. Knitro: An integrated package for nonlinear optimization. February 6, 2006. [2] Thomas F. Coleman and Yuying Li. An affine scaling trust region algorithm for nonlinear programming. 2000. [3] J. Dennis, M. Heinkenschloss, and L. Vicente. Trust-region interior point sqp algorithms for a class of nonlinear programming problems. SIAM Journal for Control and Optimization, pages 1750–1794, 1998.

[4] Matthias Gerdts. Optimal Control of ODEs and DAEs. Graduate Texts in Mathematics. De Gruyter, 2012. [5] MathWorks. MATLAB R2016a Documentation, 2016. [6] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research. 2006. [7] Ya-Xiang Yuan. A review of trust-region algorithms for optimization, 2000.

9