Appendix a Brief Review of Mathematical Optimization
Total Page:16
File Type:pdf, Size:1020Kb
Appendix A Brief Review of Mathematical Optimization Ecce signum (Behold the sign; see the proof) Optimization is a mathematical discipline which determines a “best” solution in a quantitatively well-defined sense. It applies to certain mathematically defined prob- lems in, e.g., science, engineering, mathematics, economics, and commerce. Opti- mization theory provides algorithms to solve well-structured optimization problems along with the analysis of those algorithms. This analysis includes necessary and sufficient conditions for the existence of optimal solutions. Optimization problems are expressed in terms of variables (degrees of freedom) and the domain; objective function1 to be optimized; and, possibly, constraints. In unconstrained optimization, the optimum value is sought of an objective function of several variables without any constraints. If, in addition, constraints (equations, inequalities) are present, we have a constrained optimization problem. Often, the optimum value of such a problem is found on the boundary of the region defined by inequality constraints. In many cases, the domain X of the unknown variables x is X = IR n. For such problems the reader is referred to Fletcher (1987), Gill et al. (1981), and Dennis & Schnabel (1983). Problems in which the domain X is a discrete set, e.g., X = IN o r X = ZZ, belong to the field of discrete optimization ; cf. Nemhauser & Wolsey (1988). Dis- crete variables might be used, e.g., to select different physical laws (for instance, different limb-darkening laws) or models which cannot be smoothly mapped to each other. In what follows, we will only consider continuous domains, i.e., X = IR n. In this appendix we assume that the reader is familiar with the basic concepts of calculus and linear algebra and the standard nomenclature of mathematics and theoretical physics. A.1 Unconstrained Optimization Let f : X = IR n → IR , x → f (x) be a continuous real-valued function. The problem 1 The fitting of a model with free parameters to data leads to a minimization problem in which the objective function is the sum of squares of residuals. J. Kallrath, E.F. Milone, Eclipsing Binary Stars: Modeling and Analysis, Astronomy 351 and Astrophysics Library, DOI 10.1007/978-1-4419-0699-1, C Springer Science+Business Media, LLC 1999, 2009 352 A Brief Review of Mathematical Optimization UCP :min{ f (x)}⇐⇒f∗ := min{ f (x) | x ∈ X}, (A.1.1) x x is called an unconstrained minimization problem. A vector x∗ corresponding to the scalar f∗ is called a minimizing point or minimizer x∗, f∗ = f (x∗) and is also expressed as x∗ = arg min{ f (x)} :={x | f (x) ≤ f (x ), ∀ x ∈ X}. (A.1.2) Maximization problems can be transformed into minimization problems with the relation max { f (x)} =−min{− f (x)}. (A.1.3) x x Therefore, it is sufficient to discuss minimization problems only. A typical difficulty associated with nonlinear optimization is the problem that in most cases it is possible to determine only a locally optimal solution, not the global optimum. Loosely speaking, the global optimum is the best of all possible solutions, whereas a local optimum is the best in a neighborhood (for instance, the local sphere introduced on page 172) of x∗ only. The formal definition reads: Definition A.1.1 Given a minimization problem, a point x∗ X is called a local minimum with respect to a neighborhood N of x∗ (or simply local minimum )if f (x∗) ≤ f (x), ∀ x ∈ N(x∗). (A.1.4) If N(x∗) = X, then x∗ is called global minimum . N(x∗) could be chosen, for instance, as the ball Bε(x∗) introduced on page 172. Early methods used for minimization were ad hoc search methods merely com- paring function values f (x) at different points x. The most successful method of this group is the Simplex method [Spendley et al. (1962), Nelder & Mead (1965)] described in Chap. 4. In this group we also find the alternating vari- ables method, in which in iteration k (k = 1, 2, ..., K ) the variable xk alone is changed in an attempt to reduce the value of the objective function, and the other variables are kept fixed. Both methods are easy to implement. They can han- dle nonsmooth functions, and they do not suffer from the requirement of many derivative-based optimization procedures depending on sufficiently accurate initial solutions for convergence. However, sometimes they converge very slowly. The most efficient way to use such methods is to derive an approximate solution for the minimization problem and use it to initialize a derivative-based minimization algorithm. Derivative-based methods need the first- and possibly second-order derivatives involving the following objects: A.1 Unconstrained Optimization 353 • Gradient of a scalar, real-valued differentiable function f (x) of a vector x; f : X = IR n → IR , x → f (x) ∂ ∂ f (x):= f (x),..., f (x) ∈ IR n. (A.1.5) ∂x1 ∂xn 2 T • Jacobi matrix of a vector-valued function F(x) = ( f1(x),..., fm (x)) ∂ J(x) ≡F(x):= ( f1(x),..., fm (x)) = fi (x) ∈ M(m, n). (A.1.6) ∂x j • Hessian matrix3 of a real-valued function f (x) ∂ ∂ ∂2 H(x) ≡2 f (x):= f (x) = f (x) ∈ M(n, n). (A.1.7) ∂xi ∂x j ∂xi ∂x j Note that the Hessian matrix of the scalar function f (x) is the Jacobian matrix of the gradient, i.e., vector function f (x). Derivative-based minimization algorithms are local minimizers and may optionally involve the Jacobian J and the Hessian H. In any case, we assume that f (x) is sufficiently smooth so that the derivatives are continuous. As in the one-dimensional case, a local minimizer x∗ satisfies n f (x∗) = 0, f (x∗) ∈ IR . (A.1.8) A Taylor series expansion about x∗ can be used to prove the following theorem on sufficient conditions: Theorem A.1.2 Sufficient conditions for a strict and isolated local minimizer x∗ are that (A.1.8) is satisfied and that the Hessian H∗ := H(x∗) is positive definite ,that is, T n s H∗s > 0, ∀ s = 0, s ∈ IR . (A.1.9) Here s denotes any vector different from zero. The widespread approach to solve the minimization problem (A.1.1) numerically is to apply line search algorithms.Given a solution xk of the kth iteration, in the next step xk+1 is computed according to the following scheme: 2 So called after Carl Gustav Jacob Jacobi (1804–1851). 3 So called after the German mathematician Ludwig Otto Hesse (1811–1874). In the German the matrix is correctly called the “Hesse matrix,” but in English, the not quite correct term “Hessian.” The proper spelling would be “Hesseian” or “Hessean.” M(m, n) denotes the set of all matrices with m rows and n columns. 354 A Brief Review of Mathematical Optimization • determine a search direction sk ; • solve the line search subproblem, i.e., compute an appropriate damping factor 4 αk > 0, which minimizes f (xk + αsk ); and • define xk+1 := xk + αk sk . Different algorithms correspond to different ways of computing the search direction sk .Fordamped methods there are several procedures to compute the damping factor αk . Undamped methods do not solve the line search problem and just put αk = 1. If f (xk + αsk ) is exactly minimized, in the new point the gradient f (xk+1)is T = . orthogonal to the current search direction sk , i.e., sk f (xk+1) 0 This follows from the necessary condition d f 0 = ( f (x + α s )) = ( f (x + α s ))T s . (A.1.10) dα k k k k k k k With this concept in mind several methods can be classified. Descent methods are line search methods in which the search direction sk satisfies the descent property T f (xk ) sk < 0. (A.1.11) The steepest descent method results from sk =−f (xk ). Obviously, (A.1.11) is fulfilled in this case. The gradient may be calculated analytically or may be approx- imated by finite differences with a small h > 0, that is, e.g., forward differences ∂ + − = f ≈ f (x hei ) f (x) + i f (x) (x) f (x)h (A.1.12) ∂xi h or central differences ∂ f f (x + 1 he ) − f (x − 1 he ) = ≈ 2 i 2 i + 1 2. i f (x) (x) 24 f (x)h (A.1.13) ∂xi h Here ei denotes the unit vector along the ith coordinate axis. From a numerical point of view, central differences are preferred because they are more accurate and have an error which is only of the order of h2. Another approach to compute the search direction sk is to expand the function f (x) about xk in a Taylor series up to second order. So we approximate f (x) locally by a quadratic model, i.e., x = xk +sk , f (x) = f (xk +sk ), and + ≈ + T + 1 TH . f (xk sk ) f (xk ) f (xk ) sk 2 sk k sk (A.1.14) 4 Usually, f (xk + αsk ) is not exactly minimized with respect to α. One possible heuristic is to −m evaluate f for αm = 2 for m = 0, 1, 2,... andstopwhen f (xk + αm sk ) ≤ f (xk ). A.1 Unconstrained Optimization 355 If both the gradient f (x) and the Hessian matrix H are known analytically, the classical Newton method can be derived from (A.1.14). Applying the necessary condition (A.1.8) to (A.1.14) leads to Hk sk =−f (xk ), (A.1.15) from which sk is computed according to =−H−1 . sk k f (xk ) (A.1.16) It can be shown [see, for instance, Bock (1987)] that the damped Newton method (Newton method with line search) converges globally, if Hk is positive definite.