Matrix Calculus - Notes on the Derivative of a Trace Johannes Traa

Matrix Calculus - Notes on the Derivative of a Trace Johannes Traa This write-up elucidates the rules of matrix calculus for expressions involving the trace of a function of a matrix X: f tr £g ( X )¤ . (1) Æ We would like to take the derivative of f with respect to X: @f ? . (2) @X Æ One strategy is to write the trace expression as a scalar using index notation, take the derivative, and re-write in matrix form. An easier way is to reduce the problem to one or more smaller problems where the results for simpler derivatives can be applied. It’s brute- force vs bottom-up. MATRIX-VALUED DERIVATIVE M N The derivative of a scalar f with respect to a matrix X R £ can be written as: 2 1 2 3 @f @f @f 6 @X11 @X12 @X1N 7 6 ¢¢¢ 7 6 7 6 @f @f @f 7 @f 6 @X @X @X 7 6 21 22 ¢¢¢ 2N 7 6 7 (3) @X Æ 6 . 7 6 . .. 7 6 7 6 7 4 @f @f @f 5 @XM1 @XM2 ¢¢¢ @XMN So, the result is the same size as X. MATRIX AND INDEX NOTATION It is useful to be able to convert between matrix notation and index notation. For example, the product AB has elements: X [AB]ik Ai j B jk , (4) Æ j and the matrix product ABCT has elements: £ T ¤ X £ T ¤ X X XX ABC il Ai j BC jl Ai j B jkClk Ai j B jkClk . (5) Æ j Æ j k Æ j k FIRST-ORDER DERIVATIVES EXAMPLE 1 Consider this example: f tr [AXB] . (6) Æ We can write this using index notation as: X XX XX X XXX f [AXB]ii Ai j [XB]ji Ai j X jk Bki Ai j X jk Bki . (7) Æ i Æ i j Æ i j k Æ i j k Taking the derivative with respect to X jk , we get: @f X Ai j Bki [BA]k j . (8) @X jk Æ i Æ The result has to be the same size as X, so we know that the indices of the rows and columns must be j and k, respectively. This means we have to transpose the result above to write the derivative in matrix form as: @tr [AXB] AT BT . (9) @X Æ 2 EXAMPLE 2 Similarly, we have: £ T ¤ XXX f tr AX B Ai j Xk j Bki , (10) Æ Æ i j k so that the derivative is: @f X Ai j Bki [BA]k j , (11) @Xk j Æ i Æ The X term appears in (10) with indices k j, so we need to write the derivative in matrix form such that k is the row index and j is the column index. Thus, we have: @ tr £AXT B¤ BA . (12) @X Æ MULTIPLE-ORDER Now consider a more complicated example: f tr £AXBXCT ¤ (13) Æ XXXXX Ai j X jk Bkl XlmCim . (14) Æ i j k l m The derivative has contributions from both appearances of X. TAKE 1 In index notation: @f XXX £ T ¤ Ai j Bkl XlmCim BXC A k j , (15) @X jk Æ i l m Æ @f XXX £ T ¤ Ai j X jk BklCim C AXB ml . (16) @Xlm Æ i j k Æ Transposing appropriately and summing the terms together, we have: @ tr £AXBXCT ¤ AT CXT BT BT XT AT C . (17) @X Æ Å TAKE 2 We can skip this tedious process by applying (9) for each appearance of X: £ ¤ £ ¤ @ tr AXBXCT @ tr [AXD] @ tr EXCT AT DT ET C . (18) @X Æ @X Å @X Æ Å 3 where D BXCT and E AXB. So we just evaluate the matrix derivative for each appear- Æ Æ ance of X assuming that everything else is a constant (including other X’s). To see why this rule is useful, consider the following beast: f tr £AXXT BCXT XC¤ . (19) Æ We can immediately write down the derivative using (9) and (12): £ T T ¤ @ tr AXX BCX XC T T (A)T ¡XT BCXT XC¢ ¡BCXT XC¢(AX) (XC)¡AXXT BC¢ ¡AXXT BCXT ¢ (C)T @X Æ Å Å Å (20) ACT XT XCT BT X BCXT XCAX XCAXXT BC XCT BT XXT AT CT . (21) Æ Å Å Å FROBENIUS NORM The Frobenius norm shows up when we have an optimization problem involving a matrix factorization and we want to minimize a sum-of-squares error criterion: Ã !2 XX X 2 £ T ¤ f Xik Wi j H jk X WH F tr (X WH)(X WH) . (22) Æ i k ¡ j Æ k ¡ k Æ ¡ ¡ We can work with the expression in index notation, but it’s easier to work directly with matrices and apply the results derived earlier. Suppose we want to find the derivative with respect to W. Expanding the matrix outer product, we have: f tr £XXT ¤ tr £XHT WT ¤ tr £WHXT ¤ tr £WHHT WT ¤ . (23) Æ ¡ ¡ Å Applying (9) and (12), we easily deduce that: @ tr £(X WH)(X WH)T ¤ ¡ ¡ 2XHT 2WHHT . (24) @W Æ¡ Å LOG Consider this trace expression: µ ¶ £ T ¤ XX XX f tr V log(AXB) Vi j log Aim XmnBn j . (25) Æ Æ i j m n Taking the derivative with respect to Xmn, we get: 0 1 µ ¶ @f XX Vi j XX V @PP A AimBn j Aim Bn j . (26) @ Xmn Æ i j Aim XmnBn j Æ i j AXB i j m n Thus: 4 £ ¤ @ tr VT log(AXB) µ V ¶ AT BT . (27) @X Æ AXB Similarly: £ ¡ ¢¤ @ tr VT log AXT B µ V ¶T B A . (28) @X Æ AXT B These bare a spooky resemblance to (9) and (12). DIAG EXAMPLE 1 Consider the tricky case of a di ag ( ) operator: ¡ £ ¤ XX f tr A di ag (x)B Ai j x j B ji . (29) Æ Æ i j Taking the derivative, we have: @f X £¡ T ¢ ¤ Ai j B ji A B 1 j . (30) @x j Æ i Æ ¯ So we can write: @ tr £A di ag (x)B¤ ¡AT B¢1 . (31) @ x Æ ¯ EXAMPLE 2 Consider the following take on the last example: £ ¤ XXX f tr J A di ag (x)B Ai j x j B jk , (32) Æ Æ i j k where J is the matrix of ones. TAKE 1 Taking the derivative, we have: Ã !Ã ! @f XX X X £ T ¤ Ai j B jk Ai j B jk A 1 B1 j . (33) @x j Æ i k Æ i k Æ ¯ So we can write: @ tr £J A di ag (x)B¤ AT 1 B1 . (34) @ x Æ ¯ 5 TAKE 2 We could have derived this result from the previous example using the rotation property of the trace operator: f tr £J A di ag (x)B¤ tr £11T A di ag (x)B¤ tr £1T A di ag (x)B1¤ tr £aT di ag (x)b¤ , Æ Æ Æ Æ (35) where we have defined a AT 1 and b B1. Applying (31), we have: Æ Æ @ tr £aT di ag (x)b¤ (a b)1 AT 1 B1 . (36) @ x Æ ¯ Æ ¯ EXAMPLE 3 Consider a more complicated example: µ ¶ £ T ¡ ¢¤ XX X f tr V log A di ag (x)B Vi j log Aim xmBm j . (37) Æ Æ i j m Taking the derivative with respect to xm, we have: 0 1 @f XX Vi j @P A AimBm j (38) @xm Æ i j Aim xmBm j m " µ ¶ # X X V Aim Bm j (39) Æ j i A di ag (x)B i j ·µ µ V ¶ ¶ ¸ AT B 1 . (40) Æ A di ag (x)B ¯ m The final result is: £ ¡ ¢¤ @ tr VT log A di ag (x)B µ µ V ¶ ¶ AT B 1 . (41) @ x Æ A di ag (x)B ¯ EXAMPLE 4 How about when we have a trace composed of a sum of expressions, each of which depends on what row of a matrix B is chosen: " # µ µ ¶ ¶ X T ¡ ¢ XXX X X f tr V log A di ag (Bk: X)C Vi j log Aim Bkn Xnm Cm j . (42) Æ k Æ k i j m n Taking the derivative, we get: 6 0 1 @f XXXB Vi j C B µ ¶ C AimBknCm j (43) @Xnm Æ k i j @P P A Aim Bkn Xnm Cm j m n µ ¶ X XX V Bkn AimCm j (44) Æ k i j A di ag (Bk: X)C i j · µµ ¶T ¶¸ X T V T Bkn 1 A C , (45) Æ k A di ag (Bk: X)C ¯ km so we can write: 2 3 µ³ ´T ¶ · ¸ 1T V A CT ¡ ¢ 6 A di ag(B1: X)C 7 @ tr PVT log A di ag (B X)C 6 ¯ 7 k: 6 7 k T 6 . 7 B 6 . 7 . (46) @X Æ 6 7 6 7 4 µ³ ´T ¶5 1T V A CT A di ag(BK : X)C ¯ CONCLUSION We’ve learned two things. First, if we don’t know how to find the derivative of an expression using matrix calculus directly, we can always fall back on index notation and convert back to matrices at the end. This reduces a potentially unintuitive matrix-valued problem into one involving scalars, which we are used to. Second, it’s less painful to massage an expression into a familiar form and apply previously-derived identities.

Load more