Vector and Matrix Calculus

Vector and Matrix Calculus Herman Kamper [email protected] Published: 2013-01-30 Last update: 2021-07-26 1 Introduction As explained in detail in [1], there unfortunately exists multiple competing notations concerning the layout of matrix derivatives. This can cause a lot of difficulty when consulting several sources, since different sources might use different conventions. Some sources, for example [2] (from which I use a lot of identities), even use a mixed layout (according to [1, Notes]). Identities for both the numerator layout (sometimes called the Jacobian formulation) and the denominator layout (sometimes called the Hessian formulation) is given in [1], so this makes it easy to check what layout a particular source uses. I will aim to stick to the denominator layout, which seems to be the most widely used in the field of statistics and pattern recognition (e.g. [3] and [4, pp. 327{332]). Other useful references concerning matrix calculus include [5] and [6]. In this document column vectors are assumed in all cases expect where specifically stated otherwise. Table 1: Derivatives of scalars, vector functions and matrices [1,6]. column vector m×n scalar y m matrix Y 2 R y 2 R @Y @y row vector matrix @x (only scalar x scalar @x @y m @x 2 R numerator layout) n column vector matrix column vector x 2 R @y n @y n×m @x 2 R @x 2 R p×q @y p×q matrix X 2 R matrix @X 2 R 2 Definitions Table1 indicates the six possible kinds of derivatives when using the denominator layout. Using this layout notation consistently, we have the following definitions. n n The derivative of a scalar function f : R ! R with respect to vector x 2 R is 2 @f(x) 3 @x 6 1 7 6 @f(x) 7 @f(x) def 6 @x2 7 = 6 7 (1) @x 6 . 7 6 . 7 4 5 @f(x) @xn This is the transpose of the gradient (some authors simply call this the gradient, irrespective of whether numerator or denominator layout is used). 1 n m T The derivative of a vector function f : R ! R , where f(x) = f1(x) f2(x) : : : fm(x) n and x 2 R , with respect to scalar xi is @f(x) def h @f1(x) @f2(x) @fm(x) i = @x @x ::: @x (2) @xi i i i n m T The derivative of a vector function f : R ! R , where f(x) = f1(x) f2(x) : : : fm(x) , n with respect to vector x 2 R is 2 @f(x) 3 2 @f1(x) @f2(x) @fm(x) 3 @x @x @x ::: @x 6 1 7 6 1 1 1 7 6 @f(x) 7 6 @f1(x) @f2(x) ::: @fm(x) 7 @f(x) def 6 @x2 7 6 @x2 @x2 @x2 7 = 6 7 = 6 7 (3) @x 6 . 7 6 . 7 6 . 7 6 . 7 4 5 4 5 @f(x) @f1(x) @f2(x) ::: @fm(x) @xn @xn @xn @xn This is just the transpose of the Jacobian matrix. m×n m×n The derivative of a scalar function f : R ! R with respect to matrix X 2 R is 2 @f(X) @f(X) @f(X) 3 @X @X ··· @X 6 11 12 1n 7 6 @f(X) @f(X) ··· @f(X) 7 @f(X) def 6 @X21 @X22 @X2n 7 = 6 7 (4) @X 6 . 7 6 . 7 4 5 @f(X) @f(X) ··· @f(X) @Xm1 @Xm2 @Xmn Observe that the (1) is just a special case of (4) for column vectors. Often (as in [3]) the gradient notation is used as an alternative to the notation used above, for example: @f(x) r f(x) = (5) x @x @f(X) r f(X) = (6) X @X 3 Identities 3.1 Scalar-by-vector product rule m n m×n If a 2 R , b 2 R and C 2 R then m m 0 n 1 m n T X X X X X a Cb = ai(Cb)i = ai @ CijbjA = Cijaibj (7) i=1 i=1 j=1 i=1 j=1 m m n n m×n Now assume we have vector functions u : R ! R , v = R ! R and A 2 R . The vector q functions u and v are functions of x 2 R , but A is not. We want to find an identity for @uTAv (8) @x 2 From (7), we have: m n @uTAv @uTAv @ X X = = A u v @x @x @x ij i j l l l i=1 j=1 m n X X @ = A u v ij @x i j i=1 j=1 l m n X X @ui @vj = A v + u ij j @x i @x i=1 j=1 l l m n m n X X @ui X X @vj = A v + A u (9) ij j @x ij i @x i=1 j=1 l i=1 j=1 l Now we can show (by writing out the elements [Notebook, 2012-05-22]) that: m n m n @u @v X X @ui X X @vj Av + ATu = A v + (AT) u @x @x ij j @x ji i @x l i=1 j=1 l i=1 j=1 l m n m n X X @ui X X @vj = A v + A u (10) ij j @x ij i @x i=1 j=1 l i=1 j=1 l A comparison of (9) and (10) completes the proof that @uTAv @u @v = Av + ATu (11) @x @x @x 3.2 Useful identities from scalar-by-vector product rule m q n m×n m×q From (11) it follows, with vectors and matrices b 2 R , d 2 R , x 2 R , B 2 R , C 2 R , q×n D 2 R , that @(Bx + b)TC(Dx + d) @(Bx + b) @(Dx + d)T = C(Dx + d) + CT(Bx + b) (12) @x @x @x resulting in the identity: @(Bx + b)TC(Dx + d) = BTC(Dx + d) + DTCT(Bx + b) (13) @x by using the easily verifiable identities: @(u(x) + v(x)) @u(x) @v(x) = + (14) @x @x @x @Ax = AT (15) @x @a = 0 (16) @x Some other useful special cases of (11): @xTAb = Ab (17) @x 3 @xTAx = (A + AT)x (18) @x @xTAx = 2Ax if A is symmetric (19) @x 3.3 Derivatives of determinant See [7, p. 374] for definition of cofactors. Also see [Notebook, 2012-05-22]. n×n We can write the determinant of matrix X 2 R as n X jXj = Xi1Ci1 + Xi2Ci2 + ::: + XinCin = XijCij+ (20) j=1 Thus the derivative will be @jXj @ = fXi1Ci1 + Xi2Ci2 + ::: + XinCing @X kl @Xkl @ = fXk1Ck1 + Xk2Ck2 + ::: + XknCkng @Xkl (can choose i any number, so choose i = k) = Ckl (21) Thus (see [7, p. 386]) @jXj = cofactor X = (adj X)T (22) @X But we know that the inverse of X is given by [7, p. 387] 1 X−1 = adj X (23) jXj thus adj X = jXjX−1 (24) which, when substituted into (22), results in the identity @jXj = jXj(X−1)T (25) @X From (25) we can also write @ ln jXj @ ln jXj 1 @jXj 1 = = = jXj(X−1)T (26) @X kl @Xkl jXj @X jXj giving the identity @ ln jXj = (X−1)T (27) @X 4 References [1] Matrix calculus. [Online]. Available: http://en.wikipedia.org/wiki/Matrix calculus [2] K. B. Petersen and M. S. Pedersen, \The matrix cookbook," 2008. [3] A. Ng, Machine Learning. Class notes for CS229, Stanford Engineering Everywhere, Stanford University, 2008. [Online]. Available: http://see.stanford.edu [4] S. R. Searle, Matrix Algebra Useful for Statistics. New York, NY: John Wiley & Sons, 1982. [5] J. R. Schott, Matrix Analysis for Statistics. New York, NY: John Wiley & Sons, 1996. [6] T. P. Minka, \Old and new matrix algebra useful for statistics," 2000. [Online]. Available: http://research.microsoft.com/en-us/um/people/minka/papers/matrix [7] D. G. Zill and M. R. Cullen, Advanced Engineering Mathematics, 3rd ed. Jones and Bartlett, 2006. 5.

Load more