Calculus with Vectors and Matrices

Paul Klein Stockholms universitet October 1999 Calculus with vectors and matrices 1 Matrix differential calculus 1.1 The gradient and the Hessian The purpose of this section is to make sense of expressions like @f (x) = r f (x) = rf (x) = f 0 (x) = f (x) (1) @xT x x where f : Rn ! Rm. Of course, we already know what a partial derivative is and how to calculate it. What this section will tell us is how to arrange the partial derivatives into a matrix (gradient), and the rules of arithmetic that follow from adopting our particular arrangement convention. Definition 1.1 Let f : Rn ! Rm have partial derivatives at x: Then 2 3 @f1(x) ··· @f1(x) 6 @x1 @xn 7 @f (x) 6 7 6 . .. 7 (2) T , 6 . 7 @x 6 7 | {z } 4 5 m×n @fm(x) ··· @fm(x) @x1 @xn 1 and @f (x) @f (x)T (3) @x , @xT | {z } n×m where AT is the transpose of A. Definition 1.2 Let f : Rn ! Rn have partial derivatives at x: Then the (scalar- valued) Jacobian of f at x is defined via @f (x) J (x) det : (4) f , @xT @f (x) Remark 1.1 Sometimes the gradient itself is called the Jacobian. Here @xT the Jacobian is defined as the determinant of the gradient. The following properties of the gradient follow straightforwardly from the definition. Proposition 1.1 1. Let x be an n × 1 vector and A an m × n matrix. Then @ [Ax] = A: (5) @xT 2. Let x be an n × 1 vector and A an n × m matrix. Then @ xT A = AT : (6) @xT 3. Let x be an n × 1 vector and A an n × n matrix. Then @ xT Ax = xT A + AT : (7) @xT 2 4. Let x be an n × 1 vector and A an n × n symmetric matrix. Then @ xT Ax = 2xT A (8) @xT If f is scalar-valued, it is straightforward to define the second derivative (Hes- sian) as follows. Definition 1.3 Let f : Rn ! R have continuous first and second partial derivatives at x (so as to satisfy the requirements of Young’s theorem). Then 2 3 @2f(x) @2f(x) 2 ··· @x @x1@xn 2 6 1 7 @ f (x) 6 . 7 00 6 . .. 7 f (x) : (9) @x@xT , 6 . 7 , | {z } 6 7 n×n 4 @2f(x) @2f(x) 5 ··· 2 @xn@x1 @xn Note that, by Young’s theorem, the Hessian of a scalar-valued function is symmetric. T Proposition 1.2 Let f (x) , x Ax where A is symmetric. Then @2f (x) = 2A (10) @x@xT Occasionally we run into matrix-valued functions, and the way forward then is to vectorize and then differentiate. 3 " # Definition 1.4 Let A = a1 a2 ··· an be an m × n matrix. Then |{z} |{z} |{z} m×1 m×1 m×1 2 3 a1 6 7 6 7 6 7 6 a2 7 vec (A) 6 7 (11) , 6 . 7 | {z } 6 . 7 mn×1 6 7 4 5 an Definition 1.5 Let f : Rk ! Rn×m have partial derivatives at x: Then @f (x) @vecf (x) (12) @xT , @xT | {z } nm×k Having defined the vec operator, we quickly run into cases where we need the Kronecker product, defined as follows. Definition 1.6 Let A and B be matrices. Denote the element in the i:th row m×n k×l and j:th column of A by aij: Then 2 3 a B ··· a B 6 11 1n 7 6 7 6 . .. 7 A ⊗ B , 6 . 7 : (13) | {z } 6 7 mk×ln 4 5 am1B ··· amnB Proposition 1.3 Let A , B and C be matrices. Then k×l m×n p×q vec (ABC) = CT ⊗ A vec (B) (14) 4 Proof. Exercise. Occasionally we find ourselves wanting to differentiate a vector-valued function with respect to a matrix. Again the way forward is to vectorize. Proposition 1 Whenever the following expressions are defined, they are true. The trace of a matrix A is denoted by tr (A) : [Various rules of arithmetic omit- ted in this version. See the bibliography for sources.] Definition 1.7 Let f : Rn×m ! Rk have partial derivatives at x: Then @f (x) @f (x) , (15) @AT @ (vecA)T | {z } nm×k n×m n m Example 1.1 Let f : R ! R be defined via f (Φ) , Φk where k 2 R is a T constant vector. Then f (Φ) = k ⊗ In vecΦ and hence @f (x) = kT ⊗ I . (16) @ΦT n We are now in a position to state rather general versions of the product and chain rule for matrices. 5 1.2 The product rule Proposition 1.4 (the product rule) Let A : Rl ! Rn×m and B : Rl ! Rm×k have partial derivatives at x 2 Rl. Then @ @vecA (x) @vecB (x) [A (x) B (x)] = B (x)T ⊗ I + (I ⊗ A (x)) : (17) @xT n @xT k @xT Kind-of proof. Suppose A (x) ≡ A. Then, by Proposition 1.3, vec (AB (x)) = (Ik ⊗ A) vecB (x) : (18) Since differentiation is a linear operator, it follows that @vec (AB (x)) @vecB (x) = (I ⊗ A) (19) @xT k @xT Conversely, assume that B (x) ≡ B. Then @vec (A (x) B) @vecA (x) = BT ⊗ I (20) @xT n @xT Combining the two results yields the product rule. Corollary 1.1 When we have vector- rather than matrix-valued functions, the for- mula is drastically simplified. Let f : Rl ! Rm and g : Rl ! Rm have partial derivatives at x 2 Rl: Then @ h i @f (x) @g (x) f (x)T g (x) = g (x)T + f (x)T (21) @xT @xT @xT 6 −1 Example 1.2 Suppose we would like to differentiate f (Ω) , Ω with respect to (vecΩ)T : One quick way of getting the result is to note that @ (ΩΩ−1) @ (I) = = 0 @ (vecΩ)T @ (vecΩ)T so that −1 −T @vecΩ @vecΩ 0 = Ω ⊗ In + (In ⊗ Ω) @ (vecΩ)T @ (vecΩ)T Hence −1 @vecΩ −1 −T −T −1 = − (In ⊗ Ω) Ω ⊗ In = −Ω ⊗ Ω : @ (vecΩ)T 1.3 The chain rule Proposition 1.5 (the chain rule) Let f and g have partial derivatives at x, and let h (x) = (f ◦ g)(x) = f (g (x)) : Define y = g (x). Then h has partial derivatives at x and @h (x) @f (y) @g (x) = : (22) @xT @yT @xT With an alternative piece of notation, we have @f (g (x)) @f (g (x)) @g (x) = : (23) @xT @gT @xT Proof. The scalar chain rule and the definition of matrix multiplication. T −1 Example 1.3 Let f (A) , x Ax and let B (A) , A : Find @f (B (A)) : @vec (A)T 7 Here we use the chain rule to find that @f (B (A)) xT Bx @vec (A−1) = = − xT ⊗ xT Ω−T ⊗ Ω−1 : @vec (A)T @vec (B)T @vec (A)T 2 Remarks on unvectorizing The definition of the derivative of a matrix function with respect to a matrix given above was stated originally by Magnus and Neudecker(1999). It has some great advantages. But sometimes it is not so useful. One example is found in chapter 10, where we differentiate a matrix with respect to a scalar. Then the whole theory would break down if we hade to vectorize before we differentiated. So we go against Magnus & Neudecker and omit to vectorize. A similar phenomenon arises when we want to differentiate a scalar with respect to a matrix. For example, according to the Magnus-Neudecker definition, @xT Ax @xT Ax = = xT ⊗ xT ; @AT @ (vecA)T so that the result is an n2 × 1 vector. But often it is nicer to define the result as the n × n matrix @xT Ax @xT Ax = = xxT : @AT @A More generally, let f : Rm×n ! R be a scalar-valued function. Then the defini- 8 tions 2 @f (A) @f (A) 3 6 @a11 @am1 7 @f (A) 6 7 6 7 , 6 7 @A 6 7 4 @f (A) @f (A) 5 @a1n @amn and @f (A) @f (A)T = @AT @A where aij is the element on the ith row and jth column of A are often nicer in practice than the ones given by Magnus & Neudecker. Another example is the following. Suppose we want to evaluate @ xT Ω−1x : @ΩT We found above that @vec xT Ω−1x = − xT ⊗ xT Ω−T ⊗ Ω−1 ; @ (vecΩ)T which is a 1 × n2 vector. But a much neater statement of the result is @ xT Ω−1x = −Ω−1xxT Ω−1; @ΩT which is an n × n matrix. This fact is brought home clearly by looking at the following example, where the transpose will be denoted by 0. Example 2.1 (Maximum likelihood estimation of the population mean and vari- ance/covariance matrix of a normally distributed vector of random variables.) Let 9 µ be an unknown n × 1 column vector and let Ω be an unknown n × n matrix. Let hxti be a given sequence of n × 1 vectors. Consider the maximization problem ( ( T )) −T=2 X 1 0 −1 max jΩj exp − (xt − µ) Ω (xt − µ) : µ,Ω 2 t=1 Using the ‘unvectorized’ definition of the matrix derivative and noting that, with this definition, @ ln jΩj = Ω−1; @Ω0 it is easy to confirm that the unique solution is given by 8 T 1 P > µ = xt < T t=1 T > 1 P 0 :> Ω = (xt − µ)(xt − µ) : T t=1 So the derivative of a scalar-valued function with respect to a matrix may usefully be defined as I have in this section.

Load more