APM462 Lecture Notes
Total Page:16
File Type:pdf, Size:1020Kb
APM462 Lecture Notes Yuchen Wang December 28, 2019 Contents 1 Matrix Calculus 3 1.1 Matrix Multiplication . .3 1.2 Partitioned Matrices . .3 1.3 Matrix Differentiation . .4 2 Second-year Calculus Review 5 2.1 Mean Value Theorem in 1 Dimension . .6 2.2 1st Order Taylor Approximation . .6 2.3 2nd Order Mean Value Theorem . .7 2.4 Recall: Definition of gradient . .7 2.5 Mean Value Theorem in n dimension . .7 n 2.6 1st Order Taylor Approximation in R .................................8 n 2.7 2nd Order Mean Value Theorem in R .................................8 n 2.8 2nd Order Taylor Approximation in R ................................8 2.9 Geometric Meaning of Gradient . .9 2.10 Implicit Function Theorem . .9 2.11 Level Sets of f ..............................................9 3 Convex Sets & Functions 9 3.1 Definitions . .9 3.2 Basic Properties of Convex Functions . 10 3.3 Criterions for convexity . 11 3.4 Minimization and Maximization of Convex Functions . 12 4 Basics of Unconstrained Optimization 13 4.1 Extreme Value Theorem . 13 4.2 Unconstrained Optimization . 14 4.3 1st order necessary condition for local minimum . 15 4.4 2nd order necessary condition for local minimum . 15 4.5 2nd order sufficient condition (for interior points) . 16 5 Optimization with Equality Constraints 16 5.1 Definitions of Related Spaces . 16 5.2 Lagrange Multipliers: 1st order necessary condition for local minimum . 17 5.3 2nd order necessary condition for local minimum . 17 5.4 2nd order sufficient condition for local minimum . 18 1 CONTENTS 2 6 Optimization with Inequality Constraints 18 6.1 Kuhn-Tucker conditions: 1st order necessary condition for local minimum . 19 6.2 2nd order necessary conditions for local minimum . 20 6.3 2nd order sufficient conditions . 20 7 Different Computation Methods for Solving Optimum 21 7.1 Newton's Method . 21 7.2 Method of Steepest Descent (Gradient Method) . 23 7.3 Method of Conjugate Direction . 26 7.3.1 Geometric Interpretations of Method of Conjugate Directions . 29 8 Calculus of Variations 30 8.1 Example . 31 8.2 Classical Problem: the Brachistochrone . 33 8.3 General class of problems in Calculus of Variations . 33 n 8.4 Euler-Lagrange Equations in R .................................... 36 8.5 Equality constraints . 38 8.5.1 Isoperimetric constraints . 38 8.5.2 Holonomic constraints . 39 1 MATRIX CALCULUS 3 1 Matrix Calculus Row v.s. Column Vector Our default rule is that every vector is a column vector unless explicitly stated otherwise. This is also known as the numerator layout. n Special case: For f : R ! R, Df is a 1 × n matrix or row vector. 1.1 Matrix Multiplication Definition 1.1.1 Let A be m × n, and B be n × p, and let the product AB be C = AB then C is a m × p matrix, with element (i; j) given by n X cij = aikbkj k=1 for all i = 1; 2; : : : ; m; j = 1; 2; : : : ; p. Proposition 1.1.2 Let A be m × n, and x be n × 1, then the typical element of the product z = Ax is given by n X zi = aikxk k=1 for all i = 1; 2; : : : ; m. Similarly, let y be m × 1, then the typical element of the product zT = yT A is given by n T X zi = akiyk k=1 for all i = 1; 2; : : : ; n. Finally, the scalar resulting from the product α = yT Ax is given by m n X X α = ajkyixk j=1 k=1 1.2 Partitioned Matrices Proposition 1.2.1 Let A be a square, nonsingular matrix of order m. Partition A as A A A = 11 12 A21 A22 so that A11 and A22 are invertible. Then −1 −1 −1 −1 −1 −1 (A11 − A12A22 A21) −A11 A12(A22 − A21A11 A12) A = −1 −1 −1 −1 −1 −A22 A21(A11 − A12A22 A21) (A22 − A21A11 A12) proof: Direct multiplication of the proposed A−1 and A yields A−1A = I 1 MATRIX CALCULUS 4 1.3 Matrix Differentiation Proposition 1.3.1 @A @AT = @x @x Proposition 1.3.2 Let y = Ax where y is m × 1, x is n × 1, A is m × n, and A does not depend on x. Suppose that x is a function of the vector z, while A is independent of z. Then @y @x = A @z @z Proposition 1.3.3 Let the scalar α be defined by α = yT Ax where y is m × 1, x is n × 1, A is m × n, and A is independent of x and y, then @α = yT A @x and @α = xT AT @y Proposition 1.3.4 For the special case where the scalar α is given by the quadratic form α = xT Ax where x is n × 1, A is n × n, and A does not depend on x, then @α = xT (A + AT ) @x proof: By definition n n X X α = aijxixj j=1 i=1 Differentiating with respect to the kth element of x we have n n @α X X = a x + a x @x kj J ik i k j=1 i=1 for all k = 1; 2; : : : ; n, and consequently, @α = xT AT + xT A = xT (AT + A) @x Proposition 1.3.4 For the special case where A is a symmetric matrix and α = xT Ax where x is n × 1, A is n × n, and A does not depend on x, then @α = 2xT A @x 2 SECOND-YEAR CALCULUS REVIEW 5 Proposition 1.3.5 Let the scalar α be defined by α = yT x where y is n × 1, x is n × 1, and both y and x are functions of the vector z. Then @α @y @x = xT + yT @z @z @z Proposition 1.3.6 Let the scalar α be defined by α = xT x where x is n × 1, and x is a functions of the vector z. Then @α @y = 2xT @z @z Proposition 1.3.7 Let the scalar α be defined by α = yT Ax where y is m × 1, A is m × n, x is n × 1, and both y and x are functions of the vector z, while A does not depend on z. Then @α @y @x = xT AT + yT A @z @z @z Proposition 1.3.8 Let A be an invertible, m×m matrix whose elements are functions of the scalar parameter α. Then @A−1 @A = −A−1 A−1 @α @α proof: Start with the definition of the inverse A−1A = I and differentiate, yielding @A @A−1 A−1 + A = 0 @α @α rearranging the terms yields @A−1 @A = −A−1 A−1 @α @α Vector-by-vector Differentiation Identities 1.3.9 Young's Theorem 1.3.10 i.e. Symmetry of second derivatives T [rxyf(x; y)] = ryxf(x; y) proof: This is straightforward by writing out the elements of the matrix. 2 Second-year Calculus Review functions R ! R 2 SECOND-YEAR CALCULUS REVIEW 6 2.1 Mean Value Theorem in 1 Dimension 1 g 2 C on R g(x + h) − g(x) = g0(x + θh) h where θ 2 (0; 1) Or equivalently, g(x + h) = g(x) + hg0(x + θh) 2.2 1st Order Taylor Approximation 1 g 2 C on R g(x + h) = g(x) + hg0(x) + o(h) where o(h) is \little o" of h, the error term. Say a function f(h) = o(h), this means lim f(h) = 0 h!0 h For example, for f(h) = h2, we can say f(h) = o(h), 2 since lim f(h) = lim h = limh = 0 h!0 h h!0 h h!0 proof: (Use MVT): WTS : g(x + h) − g(x) − hg0(x) = o(h) [g(x + h) − g(x)] − hg0(x) [hg0(x + θh)] − hg0(x) lim = lim h!0 h h!0 h = lim g0(x + θh) − g0(x) h!0 = lim g0(x) − g0(x) h!0 = 0 2 SECOND-YEAR CALCULUS REVIEW 7 2.3 2nd Order Mean Value Theorem 2 g 2 C on R h2 g(x + h) = g(x) + hg0(x) + g0(x + θh) 2 for some θ 2 (0; 1) proof: 0 h2 00 2 WTS: g(x + h) − g(x) − hg (x) − 2 g (x) = o(h ) 2 2 2 g(x + h) − g(x) − hg0(x) − h g00(x) [ h g0(x + θh)] − h g00(x) lim 2 = lim 2 2 h!0 h2 h!0 h2 1 = lim (g00(x + θh) − g00(x)) h!0 2 1 = lim (g00(x) − g00(x)) h!0 2 = 0 n multivariate functions: R ! R 2.4 Recall: Definition of gradient n n Gradient of f : R ! R at x 2 R (denoted rf(x)) if exists is a vector characterized by the property: f(x + v) − f(x) − rf(x) · v lim = 0 v!0 jjvjj In Cartesian coordinates, rf(x) = ( @f (x);:::; @f (x)) @x1 @xn 2.5 Mean Value Theorem in n dimension 1 n n f 2 C on R , then for any x; v 2 R , f(x + v) = f(x) + rf(x + θv) · v for some θ 2 (0; 1) proof: Reduce to 1-dimension case g(t) := f(x + tv); t 2 R d g0(t) = f(x + tv) dt n X @f d(x + tv)i = (x + tv) · (by Chain Rule) @x dt i=1 i X @f d(xi + tvi) = (x + tv) · @xi dt X @f = (x + tv) · vi @xi = rf(x + tv) · v (*) 2 SECOND-YEAR CALCULUS REVIEW 8 1 g 2 C on R Using MVT in R: f(x + v) = g(1) = g(0 + 1) = g(0) + 1g0(0 + θ1) (θ 2 (0; 1)) = g(0) + g0(θ) = f(x) + rf(x + θv) · v (by (*)) 2.6 1st Order Taylor Approximation in Rn 1 n f 2 C on R f(x + v) = f(x) + rf(x) · v + o(jjvjj) proof: [f(x + v) − f(x)] − rf(x) · v [rf(x + θv) · v] − rf(x) · v lim = lim jjvjj!0 jjvjj jjvjj!0 jjvjj v = lim [rf(x + θv) − rf(x)] · jjvjj!0 jjvjj v = 0 ( jjvjj is a unit vector, remains 1) 2.7 2nd Order Mean Value Theorem in Rn 2 n f 2 C on R 1 f(x + v) = f(x) + rf(x) · v + vT r2f(x + θv) · v 2 Remarks In this course, r2 means Hessian, not Laplacian.