<<

- Notes on the of a Johannes Traa

This write-up elucidates the rules of for expressions involving the trace of a of a matrix X:

f tr £g ( X )¤ . (1) = We would like to take the derivative of f with respect to X: ∂f ? . (2) ∂X = One strategy is to write the trace expression as a using index notation, take the derivative, and re-write in matrix form. An easier way is to reduce the problem to one or more smaller problems where the results for simpler can be applied. It’s brute- force vs bottom-up.

MATRIX-VALUED DERIVATIVE

M N The derivative of a scalar f with respect to a matrix X R × can be written as: ∈

1   ∂f ∂f ∂f  ∂X11 ∂X12 ∂X1N   ···     ∂f ∂f ∂f  ∂f  ∂X ∂X ∂X   21 22 ··· 2N    (3) ∂X =  . . . .   . . .. .       ∂f ∂f ∂f  ∂XM1 ∂XM2 ··· ∂XMN So, the result is the same size as X.

MATRIX AND INDEX NOTATION

It is useful to be able to convert between matrix notation and index notation. For example, the product AB has elements: X [AB]ik Ai j B jk , (4) = j and the matrix product ABCT has elements:

£ T ¤ X £ T ¤ X X XX ABC il Ai j BC jl Ai j B jkClk Ai j B jkClk . (5) = j = j k = j k

FIRST-ORDER DERIVATIVES

EXAMPLE 1 Consider this example:

f tr [AXB] . (6) = We can write this using index notation as: X XX XX X XXX f [AXB]ii Ai j [XB]ji Ai j X jk Bki Ai j X jk Bki . (7) = i = i j = i j k = i j k

Taking the derivative with respect to X jk , we get: ∂f X Ai j Bki [BA]k j . (8) ∂X jk = i = The result has to be the same size as X, so we know that the indices of the rows and columns must be j and k, respectively. This means we have to the result above to write the derivative in matrix form as: ∂tr [AXB] AT BT . (9) ∂X =

2 EXAMPLE 2 Similarly, we have:

£ T ¤ XXX f tr AX B Ai j Xk j Bki , (10) = = i j k so that the derivative is:

∂f X Ai j Bki [BA]k j , (11) ∂Xk j = i =

The X term appears in (10) with indices k j, so we need to write the derivative in matrix form such that k is the row index and j is the column index. Thus, we have:

∂ tr £AXT B¤ BA . (12) ∂X =

MULTIPLE-ORDER

Now consider a more complicated example:

f tr £AXBXCT ¤ (13) = XXXXX Ai j X jk Bkl XlmCim . (14) = i j k l m

The derivative has contributions from both appearances of X.

TAKE 1

In index notation:

∂f XXX £ T ¤ Ai j Bkl XlmCim BXC A k j , (15) ∂X jk = i l m = ∂f XXX £ T ¤ Ai j X jk BklCim C AXB ml . (16) ∂Xlm = i j k =

Transposing appropriately and summing the terms together, we have:

∂ tr £AXBXCT ¤ AT CXT BT BT XT AT C . (17) ∂X = +

TAKE 2

We can skip this tedious process by applying (9) for each appearance of X: £ ¤ £ ¤ ∂ tr AXBXCT ∂ tr [AXD] ∂ tr EXCT AT DT ET C . (18) ∂X = ∂X + ∂X = +

3 where D BXCT and E AXB. So we just evaluate the matrix derivative for each appear- = = ance of X assuming that everything else is a constant (including other X’s). To see why this rule is useful, consider the following beast:

f tr £AXXT BCXT XC¤ . (19) = We can immediately write down the derivative using (9) and (12):

£ T T ¤ ∂ tr AXX BCX XC T T (A)T ¡XT BCXT XC¢ ¡BCXT XC¢(AX) (XC)¡AXXT BC¢ ¡AXXT BCXT ¢ (C)T ∂X = + + + (20) ACT XT XCT BT X BCXT XCAX XCAXXT BC XCT BT XXT AT CT . (21) = + + +

FROBENIUSNORM

The Frobenius shows up when we have an optimization problem involving a matrix factorization and we want to minimize a sum-of-squares error criterion: Ã !2 XX X 2 £ T ¤ f Xik Wi j H jk X WH F tr (X WH)(X WH) . (22) = i k − j = k − k = − −

We can work with the expression in index notation, but it’s easier to work directly with matrices and apply the results derived earlier. Suppose we want to find the derivative with respect to W. Expanding the matrix , we have:

f tr £XXT ¤ tr £XHT WT ¤ tr £WHXT ¤ tr £WHHT WT ¤ . (23) = − − + Applying (9) and (12), we easily deduce that:

∂ tr £(X WH)(X WH)T ¤ − − 2XHT 2WHHT . (24) ∂W = − +

LOG

Consider this trace expression: µ ¶ £ T ¤ XX XX f tr V log(AXB) Vi j log Aim XmnBn j . (25) = = i j m n

Taking the derivative with respect to Xmn, we get:   µ ¶ ∂f XX Vi j XX V PP  AimBn j Aim Bn j . (26) ∂ Xmn = i j Aim XmnBn j = i j AXB i j m n

Thus:

4 £ ¤ ∂ tr VT log(AXB) µ V ¶ AT BT . (27) ∂X = AXB Similarly: £ ¡ ¢¤ ∂ tr VT log AXT B µ V ¶T B A . (28) ∂X = AXT B These bare a spooky resemblance to (9) and (12).

DIAG

EXAMPLE 1 Consider the tricky case of a di ag ( ) operator: − £ ¤ XX f tr A di ag (x)B Ai j x j B ji . (29) = = i j

Taking the derivative, we have:

∂f X £¡ T ¢ ¤ Ai j B ji A B 1 j . (30) ∂x j = i = ¯ So we can write: ∂ tr £A di ag (x)B¤ ¡AT B¢1 . (31) ∂ x = ¯

EXAMPLE 2 Consider the following take on the last example: £ ¤ XXX f tr J A di ag (x)B Ai j x j B jk , (32) = = i j k where J is the .

TAKE 1

Taking the derivative, we have: Ã !Ã ! ∂f XX X X £ T ¤ Ai j B jk Ai j B jk A 1 B1 j . (33) ∂x j = i k = i k = ¯ So we can write: ∂ tr £J A di ag (x)B¤ AT 1 B1 . (34) ∂ x = ¯

5 TAKE 2

We could have derived this result from the previous example using the rotation property of the trace operator:

f tr £J A di ag (x)B¤ tr £11T A di ag (x)B¤ tr £1T A di ag (x)B1¤ tr £aT di ag (x)b¤ , = = = = (35) where we have defined a AT 1 and b B1. Applying (31), we have: = = ∂ tr £aT di ag (x)b¤ (a b)1 AT 1 B1 . (36) ∂ x = ¯ = ¯

EXAMPLE 3 Consider a more complicated example: µ ¶ £ T ¡ ¢¤ XX X f tr V log A di ag (x)B Vi j log Aim xmBm j . (37) = = i j m

Taking the derivative with respect to xm, we have:   ∂f XX Vi j P  AimBm j (38) ∂xm = i j Aim xmBm j m " µ ¶ # X X V Aim Bm j (39) = j i A di ag (x)B i j ·µ µ V ¶ ¶ ¸ AT B 1 . (40) = A di ag (x)B ¯ m The final result is: £ ¡ ¢¤ ∂ tr VT log A di ag (x)B µ µ V ¶ ¶ AT B 1 . (41) ∂ x = A di ag (x)B ¯

EXAMPLE 4 How about when we have a trace composed of a sum of expressions, each of which depends on what row of a matrix B is chosen: " # µ µ ¶ ¶ X T ¡ ¢ XXX X X f tr V log A di ag (Bk: X)C Vi j log Aim Bkn Xnm Cm j . (42) = k = k i j m n

Taking the derivative, we get:

6   ∂f XXX Vi j   µ ¶  AimBknCm j (43) ∂Xnm = k i j P P  Aim Bkn Xnm Cm j m n µ ¶ X XX V Bkn AimCm j (44) = k i j A di ag (Bk: X)C i j · µµ ¶T ¶¸ X T V T Bkn 1 A C , (45) = k A di ag (Bk: X)C ¯ km so we can write:   µ³ ´T ¶ · ¸ 1T V A CT ¡ ¢  A di ag(B1: X)C  ∂ tr PVT log A di ag (B X)C  ¯  k:   k T  .  B  .  . (46) ∂X =      µ³ ´T ¶ 1T V A CT A di ag(BK : X)C ¯

CONCLUSION

We’ve learned two things. First, if we don’t know how to find the derivative of an expres- sion using matrix calculus directly, we can always fall back on index notation and convert back to matrices at the end. This reduces a potentially unintuitive matrix-valued problem into one involving scalars, which we are used to. Second, it’s less painful to massage an expression into a familiar form and apply previously-derived identities.

7