Vector and

Herman Kamper [email protected]

Published: 2013-01-30 Last update: 2021-07-26

1 Introduction

As explained in detail in [1], there unfortunately exists multiple competing notations concerning the layout of matrix . This can cause a lot of difficulty when consulting several sources, since different sources might use different conventions. Some sources, for example [2] (from which I use a lot of identities), even use a mixed layout (according to [1, Notes]). Identities for both the numerator layout (sometimes called the Jacobian formulation) and the denominator layout (sometimes called the Hessian formulation) is given in [1], so this makes it easy to check what layout a particular source uses. I will aim to stick to the denominator layout, which seems to be the most widely used in the field of and pattern recognition (e.g. [3] and [4, pp. 327–332]). Other useful references concerning matrix calculus include [5] and [6]. In this document column vectors are assumed in all cases expect where specifically stated otherwise.

Table 1: Derivatives of scalars, vector functions and matrices [1,6].

column vector m×n y m matrix Y ∈ R y ∈ R ∂Y ∂y row vector matrix ∂x (only scalar x scalar ∂x ∂y m ∂x ∈ R numerator layout)

n column vector matrix column vector x ∈ R ∂y n ∂y n×m ∂x ∈ R ∂x ∈ R p×q ∂y p×q matrix X ∈ R matrix ∂X ∈ R

2 Definitions

Table1 indicates the six possible kinds of derivatives when using the denominator layout. Using this layout notation consistently, we have the following definitions. n n The of a scalar f : R → R with respect to vector x ∈ R is

 ∂f(x)  ∂x  1   ∂f(x)  ∂f(x) def  ∂x2  =   (1) ∂x  .   .    ∂f(x) ∂xn

This is the of the (some authors simply call this the gradient, irrespective of whether numerator or denominator layout is used).

1 n m  T The derivative of a vector function f : R → R , where f(x) = f1(x) f2(x) . . . fm(x) n and x ∈ R , with respect to scalar xi is

∂f(x) def h ∂f1(x) ∂f2(x) ∂fm(x) i = ∂x ∂x ... ∂x (2) ∂xi i i i

n m  T The derivative of a vector function f : R → R , where f(x) = f1(x) f2(x) . . . fm(x) , n with respect to vector x ∈ R is

 ∂f(x)   ∂f1(x) ∂f2(x) ∂fm(x)  ∂x ∂x ∂x ... ∂x  1   1 1 1   ∂f(x)   ∂f1(x) ∂f2(x) ... ∂fm(x)  ∂f(x) def  ∂x2   ∂x2 ∂x2 ∂x2  =   =   (3) ∂x  .   . . . .   .   . . . .      ∂f(x) ∂f1(x) ∂f2(x) ... ∂fm(x) ∂xn ∂xn ∂xn ∂xn

This is just the transpose of the Jacobian matrix. m×n m×n The derivative of a scalar function f : R → R with respect to matrix X ∈ R is

 ∂f(X) ∂f(X) ∂f(X)  ∂X ∂X ··· ∂X  11 12 1n   ∂f(X) ∂f(X) ··· ∂f(X)  ∂f(X) def  ∂X21 ∂X22 ∂X2n  =   (4) ∂X  . . . .   . . . .    ∂f(X) ∂f(X) ··· ∂f(X) ∂Xm1 ∂Xm2 ∂Xmn

Observe that the (1) is just a special case of (4) for column vectors. Often (as in [3]) the gradient notation is used as an alternative to the notation used above, for example:

∂f(x) ∇ f(x) = (5) x ∂x ∂f(X) ∇ f(X) = (6) X ∂X

3 Identities

3.1 Scalar-by-vector m n m×n If a ∈ R , b ∈ R and C ∈ R then

m m  n  m n T X X X X X a Cb = ai(Cb)i = ai  Cijbj = Cijaibj (7) i=1 i=1 j=1 i=1 j=1

m m n n m×n Now assume we have vector functions u : R → R , v = R → R and A ∈ R . The vector q functions u and v are functions of x ∈ R , but A is not. We want to find an identity for ∂uTAv (8) ∂x

2 From (7), we have:

m n ∂uTAv ∂uTAv ∂ X X = = A u v ∂x ∂x ∂x ij i j l l l i=1 j=1 m n X X ∂ = A u v ij ∂x i j i=1 j=1 l m n   X X ∂ui ∂vj = A v + u ij j ∂x i ∂x i=1 j=1 l l m n m n X X ∂ui X X ∂vj = A v + A u (9) ij j ∂x ij i ∂x i=1 j=1 l i=1 j=1 l

Now we can show (by writing out the elements [Notebook, 2012-05-22]) that:   m n m n ∂u ∂v X X ∂ui X X ∂vj Av + ATu = A v + (AT) u ∂x ∂x ij j ∂x ji i ∂x l i=1 j=1 l i=1 j=1 l m n m n X X ∂ui X X ∂vj = A v + A u (10) ij j ∂x ij i ∂x i=1 j=1 l i=1 j=1 l

A comparison of (9) and (10) completes the proof that

∂uTAv ∂u ∂v = Av + ATu (11) ∂x ∂x ∂x

3.2 Useful identities from scalar-by-vector product rule m q n m×n m×q From (11) it follows, with vectors and matrices b ∈ R , d ∈ R , x ∈ R , B ∈ R , C ∈ R , q×n D ∈ R , that ∂(Bx + b)TC(Dx + d) ∂(Bx + b) ∂(Dx + d)T = C(Dx + d) + CT(Bx + b) (12) ∂x ∂x ∂x resulting in the identity:

∂(Bx + b)TC(Dx + d) = BTC(Dx + d) + DTCT(Bx + b) (13) ∂x by using the easily verifiable identities:

∂(u(x) + v(x)) ∂u(x) ∂v(x) = + (14) ∂x ∂x ∂x

∂Ax = AT (15) ∂x ∂a = 0 (16) ∂x

Some other useful special cases of (11):

∂xTAb = Ab (17) ∂x

3 ∂xTAx = (A + AT)x (18) ∂x

∂xTAx = 2Ax if A is symmetric (19) ∂x

3.3 Derivatives of See [7, p. 374] for definition of cofactors. Also see [Notebook, 2012-05-22]. n×n We can write the determinant of matrix X ∈ R as n X |X| = Xi1Ci1 + Xi2Ci2 + ... + XinCin = XijCij+ (20) j=1

Thus the derivative will be ∂|X| ∂ = {Xi1Ci1 + Xi2Ci2 + ... + XinCin} ∂X kl ∂Xkl ∂ = {Xk1Ck1 + Xk2Ck2 + ... + XknCkn} ∂Xkl (can choose i any number, so choose i = k)

= Ckl (21)

Thus (see [7, p. 386]) ∂|X| = cofactor X = (adj X)T (22) ∂X But we know that the inverse of X is given by [7, p. 387] 1 X−1 = adj X (23) |X| thus adj X = |X|X−1 (24) which, when substituted into (22), results in the identity

∂|X| = |X|(X−1)T (25) ∂X

From (25) we can also write

∂ ln |X| ∂ ln |X| 1 ∂|X| 1 = = = |X|(X−1)T (26) ∂X kl ∂Xkl |X| ∂X |X| giving the identity

∂ ln |X| = (X−1)T (27) ∂X

4 References

[1] Matrix calculus. [Online]. Available: http://en.wikipedia.org/wiki/Matrix calculus [2] K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” 2008. [3] A. Ng, . Class notes for CS229, Stanford Everywhere, Stanford University, 2008. [Online]. Available: http://see.stanford.edu [4] S. R. Searle, Matrix Algebra Useful for Statistics. New York, NY: John Wiley & Sons, 1982. [5] J. R. Schott, Matrix for Statistics. New York, NY: John Wiley & Sons, 1996. [6] T. P. Minka, “Old and new matrix algebra useful for statistics,” 2000. [Online]. Available: http://research.microsoft.com/en-us/um/people/minka/papers/matrix [7] D. G. Zill and M. R. Cullen, Advanced Engineering Mathematics, 3rd ed. Jones and Bartlett, 2006.

5