<<

Appendix D

Matrix

From too much study, and from extreme passion, cometh madnesse.

Isaac Newton [205, §5] −

D.1 , Directional , Taylor

D.1.1

Gradient of a differentiable real f(x) : RK R with respect to its vector argument is defined uniquely in terms of partial

∂f(x) ∂x1  ∂f(x)  , ∂x2 RK f(x) . (2053) ∇  .  ∈  .   ∂f(x)     ∂xK    while the second-order gradient of the twice differentiable real function with respect to its vector argument is traditionally called the Hessian;

2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂x1 ∂x1∂x2 ··· ∂x1∂xK 2 2 2  ∂ f(x) ∂ f(x) ∂ f(x)  2 2 K f(x) , ∂x2∂x1 ∂x2 ··· ∂x2∂xK S (2054) ∇  . . . .  ∈  . . .. .  2 2 2  ∂ f(x) ∂ f(x) ∂ f(x)   2   ∂xK ∂x1 ∂xK ∂x2 ∂x   ··· K    interpreted

∂f(x) ∂f(x) 2 ∂ ∂ 2 ∂ f(x) ∂x1 ∂x2 ∂ f(x) = = = (2055) ∂x1∂x2 ³∂x2 ´ ³∂x1 ´ ∂x2∂x1

Dattorro, Convex Optimization „ Euclidean Distance Geometry, Mεβoo, 2005, v2020.02.29. 599 600 APPENDIX D. CALCULUS

The gradient of vector-valued function v(x) : R RN on real domain is a row vector →

v(x) , ∂v1(x) ∂v2(x) ∂vN (x) RN (2056) ∇ ∂x ∂x ··· ∂x ∈ h i while the second-order gradient is

2 2 2 2 , ∂ v1(x) ∂ v2(x) ∂ vN (x) RN v(x) 2 2 2 (2057) ∇ ∂x ∂x ··· ∂x ∈ h i Gradient of vector-valued function h(x) : RK RN on vector domain is →

∂h1(x) ∂h2(x) ∂hN (x) ∂x1 ∂x1 ··· ∂x1  ∂h1(x) ∂h2(x) ∂hN (x)  h(x) , ∂x2 ∂x2 ··· ∂x2 ∇  . . .   . . .  (2058)  ∂h1(x) ∂h2(x) ∂hN (x)     ∂xK ∂xK ··· ∂xK    K×N = [ h (x) h (x) hN (x) ] R ∇ 1 ∇ 2 ··· ∇ ∈ while the second-order gradient has a three-dimensional written representation dubbed cubix ;D.1

∂h1(x) ∂h2(x) ∂hN (x) ∇ ∂x1 ∇ ∂x1 ··· ∇ ∂x1  ∂h1(x) ∂h2(x) ∂hN (x)  2h(x) , ∇ ∂x2 ∇ ∂x2 ··· ∇ ∂x2 ∇  . . .   . . .  (2059)  ∂h1(x) ∂h2(x) ∂hN (x)     ∇ ∂xK ∇ ∂xK ··· ∇ ∂xK    2 2 2 K×N×K = h (x) h (x) hN (x) R ∇ 1 ∇ 2 ··· ∇ ∈ where the gradient of each£ real entry is with respect to vector¤ x as in (2053). The gradient of real function g(X) : RK×L R on matrix domain is → ∂g(X) ∂g(X) ∂g(X) ∂X11 ∂X12 ··· ∂X1L ∂g(X) ∂g(X) ∂g(X)   K×L g(X) , ∂X21 ∂X22 ··· ∂X2L R ∇  . . .  ∈  . . .   ∂g(X) ∂g(X) ∂g(X)     ∂XK1 ∂XK2 ··· ∂XKL    (2060) g(X) ∇X(:,1) £ X(:,2) g(X) K L = R ×1× ∇ . .. ∈ g(X) ∇X(:,L) ¤ where gradient is with respect to the ith column of X . The strange appearance of ∇X(:, i) (2060) in RK×1×L is meant to suggest a third dimension perpendicular to the page (not

D.1The word matrix comes from the Latin for womb ; related to the prefix matri- derived from mater meaning mother. D.1. GRADIENT, , 601

a diagonal matrix). The second-order gradient has representation

∂g(X) ∂g(X) ∂g(X) ∇ ∂X11 ∇ ∂X12 ··· ∇ ∂X1L  ∂g(X) ∂g(X) ∂g(X)  2 K×L×K×L g(X) , ∇ ∂X21 ∇ ∂X22 ··· ∇ ∂X2L R ∇  . . .  ∈  . . .   ∂g(X) ∂g(X) ∂g(X)     ∇ ∂XK1 ∇ ∂XK2 ··· ∇ ∂XKL    (2061) g(X) ∇∇X(:,1) £ X(:,2) g(X) K L K L = R ×1× × × ∇∇ . .. ∈ g(X) ∇∇X(:,L) where the gradient is with respect to matrix X . ¤ Gradient of vector-valued∇ function g(X) : RK×L RN on matrix domain is a cubix → g (X) g (X) gN (X) ∇X(:,1) 1 ∇X(:,1) 2 ··· ∇X(:,1) £ X(:,2) g1(X) X(:,2) g2(X) X(:,2) gN (X) g(X) , ∇ . ∇ . ··· ∇ . ∇ ...... (2062) g (X) g (X) gN (X) ∇X(:,L) 1 ∇X(:,L) 2 ··· ∇X(:,L) ¤ K×N×L = [ g (X) g (X) gN (X) ] R ∇ 1 ∇ 2 ··· ∇ ∈ while the second-order gradient has a five-dimensional representation;

g (X) g (X) gN (X) ∇∇X(:,1) 1 ∇∇X(:,1) 2 · · · ∇∇X(:,1) £ X(:,2) g1(X) X(:,2) g2(X) X(:,2) gN (X) 2g(X) , ∇∇ . ∇∇ . · · · ∇∇ . ∇ ...... (2063) g (X) g (X) gN (X) ∇∇X(:,L) 1 ∇∇X(:,L) 2 · · · ∇∇X(:,L) ¤ 2 2 2 K×N×L×K×L = g (X) g (X) gN (X) R ∇ 1 ∇ 2 ··· ∇ ∈ The£ gradient of matrix-valued function g¤(X) : RK×L RM×N on matrix domain has a four-dimensional representation called quartix (fourth-order→ )

g (X) g (X) g N (X) ∇ 11 ∇ 12 ··· ∇ 1 g (X) g (X) g N (X) M×N×K×L g(X) ,  ∇ 21 ∇ 22 ··· ∇ 2  R (2064) ∇ . . . ∈  . . .     gM1(X) gM2(X) gMN (X)   ∇ ∇ ··· ∇  while the second-order gradient has a six-dimensional representation

2 2 2 g (X) g (X) g N (X) ∇ 11 ∇ 12 ··· ∇ 1 2 2 2 g (X) g (X) g N (X) M×N×K×L×K×L 2g(X) ,  ∇ 21 ∇ 22 ··· ∇ 2  R (2065) ∇ . . . ∈  . . .   2 2 2   gM1(X) gM2(X) gMN (X)   ∇ ∇ ··· ∇  and so on. 602 APPENDIX D.

D.1.2 Product rules for matrix-functions Given dimensionally compatible matrix-valued functions of matrix f(X) and g(X)

T X f(X) g(X) = X (f) g + X (g) f (2066) ∇ ∇ ∇

while [65, §8.3] [420] ¡ ¢

T T T X tr f(X) g(X) = X tr f(X) g(Z) + tr g(X) f(Z) (2067) ∇ ∇ Z←X ³ ´¯ ¡ ¢ ¡ ¢ ¡ ¢ ¯ These expressions implicitly apply as well to -, vector-, or matrix-valued¯ functions of scalar, vector, or matrix arguments.

D.1.2.0.1 Example. Cubix. 2 2 2 2 2 2 Suppose f(X) : R × R = XTa and g(X) : R × R = Xb . We wish to find → → T T 2 X f(X) g(X) = X a X b (2068) ∇ ∇ using the . Formula (¡2066) calls for¢

T 2 T T X a X b = X (X a) Xb + X (Xb) X a (2069) ∇ ∇ ∇ Consider the first of the two terms:

T X (f) g = X (X a) Xb ∇ ∇ (2070) = (XTa) (XTa) Xb ∇ 1 ∇ 2 2 2 2 The gradient of XTa forms a cubix in£ R × × ; a.k.a, third-order¤ tensor.

T T ∂(X a)1 ∂(X a)2 (2071)

∂X11 ∂X11

 Â

 Â

  Â

 Â 

 Â

  T T  ∂(X a)1 ∂(X a)2   ∂X12 ∂X12  (Xb)1   T 2×1×2 X (X a) Xb =     R  T T  ∇  ∂(X a)1 ∂(X a)2  ∈

 ∂X21 ∂X21   (Xb)2  Â

 Â  Â

   Â

 Â 

  Â

        T T   ∂(X a)1 ∂(X a)2   ∂X22 ∂X22      Because gradient of the product (2068) requires total change with respect to change in each entry of matrix X , the Xb vector must make an inner product with each vector in that second dimension of the cubix indicated by dotted line segments;

a1 0

T 0 a1 b1X11 + b2X12 2×1×2 X (X a) Xb =   R ∇ a 0 b1X21 + b2X22 ∈ 2 · ¸    0 a2  (2072)   a (b X + b X ) a (b X + b X ) 2 2 = 1 1 11 2 12 1 1 21 2 22 R × a2(b1X11 + b2X12) a2(b1X21 + b2X22) ∈ · ¸ = abTXT

where the cubix appears as a complete 2 2 2 matrix. In like manner for the second × × term X (g) f ∇ D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 603

b1 0

T b2 0 X11a1 + X21a2 2×1×2 X (Xb) X a =   R ∇ 0 b X12a1 + X22a2 ∈ 1 · ¸ (2073)    0 b2  2 2 = XTabT R ×  The solution ∈ T 2 T T T T X a X b = ab X + X ab (2074) ∇ can be found from Table D.2.1 or verified using (2067). 2

D.1.2.1 A partial remedy for venturing into hyperdimensional matrix representations, such as the cubix or quartix, is to first vectorize matrices as in (39). This device gives rise to the Kronecker product of matrices ; a.k.a, tensor product (kron() in Matlab). ⊗ Although its definition sees reversal in the literature, [434, §2.1] Kronecker product is not commutative (B A = A B). We adopt the definition: for A Rm×n and B Rp×q ⊗ 6 ⊗ ∈ ∈ B A B A B qA 11 12 ··· 1 B A B A B qA 21 22 2 pm qn B A ,  . . ··· .  R × (2075) ⊗ . . . ∈    Bp A Bp A BpqA   1 2 ···  for which A 1 = 1 A = A (real unity acts like Identity). One advantage⊗ to⊗ vectorization is existence of the traditional two-dimensional matrix representation (second-order tensor) for the second-order gradient of a real function with

n×n § respect to a vectorized matrix. From §A.1.1 no.36 ( D.2.1) for square A,B R , for

∈ § example [220, §5.2] [15, 3] 2 2 2 tr(AXBXT) = 2 vec(X)T(BT A) vec X = B AT + BT A Rn × n (2076) ∇vec X ∇vec X ⊗ ⊗ ⊗ ∈

To disadvantage is a large new but known set of algebraic rules (§A.1.1) and the fact that its mere use does not generally guarantee two-dimensional matrix representation of gradients. Another application of the Kronecker product is to reverse order of appearance in a matrix product: Suppose we wish to weight the columns of a matrix S RM×N , for ∈ example, by respective entries wi from the main diagonal in

w1 0 . N W , .. S (2077)  ∈ 0 wN   A conventional means for accomplishing column weighting is to multiply S by diagonal matrix W on the right side:

w1 0 . M×N S W = S .. = S(: , 1)w1 S(: , N)wN R (2078)   ··· ∈ 0 w N £ ¤   To reverse product order such that diagonal matrix W instead appears to the left of S : for I SM (Law) ∈ S(: , 1) 0 0 .. T 0 S(: , 2) . RM×N SW = (δ(W ) I) . .  (2079) ⊗ .. .. 0 ∈    0 0 S(: , N)      604 APPENDIX D. MATRIX CALCULUS

To instead weight the rows of S via diagonal matrix W SM , for I SN ∈ ∈ S(1 , :) 0 0 .. 0 S(2 , :) . RM×N WS =  . . (δ(W ) I) (2080) .. .. 0 ⊗ ∈    0 0 S(M , :)      D.1.2.2 Hadamard product For any matrices of like size, S,Y RM×N , Hadamard’s product denotes simple multiplication of corresponding entries∈ (.* in Matlab). It is possible to convert◦ Hadamard product into a standard product of matrices: S(: , 1) 0 0 .. 0 S(: , 2) . RM×N S Y = δ(Y (: , 1)) δ(Y (: , N))  . .  (2081) ◦ ··· .. .. 0 ∈   £ ¤ 0 0 S(: , N)      In the special case that S = s and Y = y are vectors in RM s y = δ(s)y (2082) ◦ sT y = ysT (2083) s ⊗yT = syT ⊗ D.1.3 Chain rules for composite matrix-functions Given dimensionally compatible matrix-valued functions of matrix variable f(X) and

g(X) [462, §15.7] T T X g f(X) = X f f g (2084) ∇ ∇ ∇ 2 T T 2 T 2 g f(X) = X X f ¡ f g =¢ f f g + X f g X f (2085) ∇X ∇ ∇ ∇ ∇X ∇ ∇ ∇f ∇ ¡ ¢ ¡ ¢ D.1.3.1 Two arguments

T T T T X g f(X) , h(X) = X f f g + X h h g (2086) ∇ ∇ ∇ ∇ ∇ ¡ ¢

D.1.3.1.1 Example. for two arguments. [51, §1.1]

g f(x)T, h(x)T = (f(x) + h(x))TA(f(x) + h(x)) (2087) x εx ¡ f(x) = ¢ 1 , h(x) = 1 (2088) εx x · 2 ¸ · 2 ¸ T T 1 0 T ε 0 T x g f(x) , h(x) = (A + A )(f + h) + (A + A )(f + h) (2089) ∇ 0 ε 0 1 · ¸ · ¸ ¡ ¢ T T 1 + ε 0 T x1 εx1 x g f(x) , h(x) = (A + A ) + (2090) ∇ 0 1 + ε εx2 x2 · ¸ µ· ¸ · ¸¶ ¡ ¢ T T T lim x g f(x) , h(x) = (A + A )x (2091) ε→0 ∇ from Table D.2.1. ¡ ¢ 2

These foregoing formulae remain correct when gradient produces hyperdimensional representation: D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 605

D.1.4 First directional derivative

Assume that a differentiable function g(X) : RK×L RM×N has continuous first- and second-order gradients g and 2g over dom g →which is an open set. We seek simple expressions for the∇ first and∇ second directional derivatives in direction Y RK×L : →Y →Y ∈ respectively, dg RM×N and dg2 RM×N . ∈ ∈ Assuming that the exists, we may state the of the mnth entry of g with respect to klth entry of X ;

T ∂g (X) gmn(X + ∆t e e ) gmn(X) mn = lim k l − R (2092) ∂Xkl ∆t→0 ∆t ∈

th K th where ek is the k standard basis vector in R while el is the l standard basis vector in RL. Total number of partial derivatives equals KLMN while the gradient is defined in their terms; mnth entry of the gradient is

∂gmn(X) ∂gmn(X) ∂gmn(X) ∂X11 ∂X12 ··· ∂X1L  ∂gmn(X) ∂gmn(X) ∂gmn(X)  ∂X21 ∂X22 ∂X2 K×L gmn(X) = ··· L R (2093) ∇  . . .  ∈  . . .   . . .   ∂gmn(X) ∂gmn(X) ∂gmn(X)   ∂XK1 ∂XK2 ∂XKL   ···    while the gradient is a quartix

g (X) g (X) g N (X) ∇ 11 ∇ 12 ··· ∇ 1 g21(X) g22(X) g2N (X) g(X) =  ∇ ∇ ··· ∇  RM×N×K×L (2094) ∇ . . . ∈  . . .     gM1(X) gM2(X) gMN (X)   ∇ ∇ ··· ∇  By simply rotating our perspective of a four-dimensional representation of gradient matrix, we find one of three useful transpositions of this quartix (connoted T1 ):

∂g(X) ∂g(X) ∂g(X) ∂X11 ∂X12 ··· ∂X1L  ∂g(X) ∂g(X) ∂g(X)  1 K L M N T ∂X21 ∂X22 ∂X2L R × × × g(X) = . . ··· . (2095) ∇  . . .  ∈  . . .   ∂g(X) ∂g(X) ∂g(X)   ∂X 1 ∂X 2 ∂X   K K ··· KL    When a limit for ∆t R exists, it is easy to show by substitution of variables in (2092) ∈

T ∂gmn(X) gmn(X + ∆tYkl ek el ) gmn(X) Ykl = lim − R (2096) ∂Xkl ∆t→0 ∆t ∈ which may be interpreted as the change in gmn at X when the change in Xkl is equal th K×L to Ykl the kl entry of any Y R . Because the total change in gmn(X) due to Y is ∈ th the sum of change with respect to each and every Xkl , the mn entry of the directional

derivative is the corresponding total differential [462, §15.8] 606 APPENDIX D. MATRIX CALCULUS

∂gmn(X) T dgmn(X) dX→Y = Ykl = tr gmn(X) Y (2097) | ∂Xkl ∇ k,l X ¡ T ¢ gmn(X + ∆tYkl e e ) gmn(X) = lim k l − (2098) ∆t→0 ∆t k,l X gmn(X + ∆tY ) gmn(X) = lim − (2099) ∆t→0 ∆t d = g (X + tY ) (2100) dt mn ¯t=0 ¯ where t R . Assuming finite Y ,¯ equation (2099) is called the Gˆateaux differential

∈ ¯ § [50, App.A.5] [265, §D.2.1] [474, 5.28] whose existence is implied by existence of the

Fr´echet differential (the sum in (2097)). [337, §7.2] Each may be understood as the change D.2 in gmn at X when the change in X is equal in magnitude and direction to Y . Hence the directional derivative,

dg (X) dg (X) dg N (X) 11 12 ··· 1 →Y dg (X) dg (X) dg (X) ,  21 22 2N  ¯ RM×N dg (X) . . ··· . ¯ . . . ¯ ∈  . . .  ¯  dg (X) dg (X) dg (X)  ¯  M1 M2 MN  ¯dX→Y  ···  ¯ T T ¯ T tr g11(X) Y tr g12(X) Y ¯ tr g1N (X) Y ∇ T ∇ T ··· ∇ T tr g21(X) Y tr g22(X) Y tr g2N (X) Y =  ¡∇ ¢ ¡∇ ¢ ··· ¡∇ ¢  . . .  ¡ . ¢ ¡ . ¢ ¡ . ¢  (2101)  T T T   tr gM1(X) Y tr gM2(X) Y tr gMN (X) Y   ∇ ∇ ··· ∇  ¡ ∂g11(X) ¢ ∂g¡12(X) ¢ ∂g1N (X¡) ¢ Ykl Ykl Ykl ∂Xkl ∂Xkl ∂Xkl k,l k,l ··· k,l

 P ∂g21(X) P ∂g22(X) P ∂g2N (X)  Ykl Ykl Ykl ∂Xkl ∂Xkl ∂Xkl =  k,l k,l ··· k,l   . . .   P . P . P .   ∂gM1(X) ∂gM2(X) ∂gMN (X)   Ykl Ykl Ykl   ∂Xkl ∂Xkl ··· ∂Xkl   k,l k,l k,l   P P P  from which it follows →Y ∂g(X) dg (X) = Ykl (2102) ∂Xkl k,l X Yet for all X dom g , any Y RK×L, and some open interval of t R ∈ ∈ ∈ →Y g(X + tY ) = g(X) + t dg (X) + O(t2) (2103)

which is the first-order multidimensional Taylor series expansion about X . [462, §18.4]

[203, §2.3.4] Differentiation with respect to t and subsequent t-zeroing isolates the second term of expansion. Thus differentiating and zeroing g(X + tY ) in t is an operation equivalent to individually differentiating and zeroing every entry gmn(X + tY ) as in (2100). So the directional derivative of g(X) : RK×L RM×N in any direction Y RK×L evaluated at X dom g becomes → ∈ ∈ →Y d dg (X) = g(X + tY ) RM×N (2104) dt ∈ ¯t=0 ¯ D.2Although Y is a matrix, we may regard it¯ as a vector in RKL. ¯ D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 607

υ ✡ ✡ T f(α + ty) ✡ ✡ ✡ (α , f(α))✡ xf(α) ∇ ✡ f(x) υ , ✡  →∇xf(α)  ✡ 1df(α)  2  ✡  

∂ H

2 2 Figure 216: Strictly convex quadratic bowl in R R ; f(x) = xTx : R R versus x 2 on some open disc in R . slice ∂ is perpendicular× to function→ domain. Slice intersection with domain connotes bidirectionalH vector y . of line at T T point (α , f(α)) is value of directional derivative xf(α) y (2129) at α in slice direction y .

2 ∇ § Negative gradient xf(x) R is direction of steepest descent. [74, §9.4.1] [462, 15.6] 3 [203] [519] When vector−∇ υ ∈R entry υ is half directional derivative in gradient direction ∈ 3 υ1 at α and when = xf(α) , then υ points directly toward bowl bottom. υ2 ∇ − · ¸

K×L

§ § [371, §2.1, 5.4.5] [43, 6.3.1] which is simplest. In case of a real function g(X) : R R → →Y dg (X)= tr g(X)T Y (2126) ∇ In case g(X) : RK R ¡ ¢ → →Y dg (X) = g(X)T Y (2129) ∇ Unlike gradient, directional derivative does not expand dimension; directional derivative (2104) retains the dimensions of g . The derivative with respect to t makes

the directional derivative resemble ordinary calculus (§D.2); e.g, when g(X) is linear, →Y

dg (X) = g(Y ). [337, §7.2]

D.1.4.1 Interpretation of directional derivative In the case of any differentiable real function g(X) : RK×L R , the directional derivative of g(X) at X in any direction Y yields the slope of g along→ the line X + tY t R through its domain evaluated at t = 0. For higher-dimensional functions,{ by (2101| ),∈ this} slope interpretation can be applied to each entry of the directional derivative. Figure 216, for example, shows a plane slice of a real convex bowl-shaped function f(x) along a line α + ty t R through its domain. The slice reveals a one-dimensional real function of t{; f(α +|ty∈).} The directional derivative at x = α in direction y is the slope of f(α + ty) with respect to t at t = 0. In the case of a real function having vector argument h(X) : RK R , its directional derivative in the normalized direction of its gradient is the gradient→ magnitude. (2129) For a real function of real variable, the directional derivative evaluated at any point in the function domain is just the slope of

that function there scaled by the real direction. (confer §3.6) 608 APPENDIX D. MATRIX CALCULUS

Directional derivative generalizes our one-dimensional notion of derivative to a multidimensional domain. When direction Y coincides with a member of the standard T Cartesian basis ekel (63), then a single partial derivative ∂g(X)/∂Xkl is obtained from directional derivative (2102); such is each entry of gradient g(X) in equalities (2126) and (2129), for example. ∇

D.1.4.1.1 Theorem. Directional derivative optimality condition. [337, §7.4] Suppose f(X) : RK×L R is minimized on convex set RK×L by X⋆, and the directional derivative of→f exists there. Then for all X C ⊆ ∈C →X−X⋆ df(X) 0 (2105) ≥ ⋄

D.1.4.1.2 Example. Simple bowl. Bowl function (Figure 216)

f(x) : RK R , (x a)T(x a) b (2106) → − − − has function offset b R , axis of revolution at x = a , and positive definite Hessian (2054) everywhere in− its∈ domain (an open hyperdisc in RK ); id est, strictly convex quadratic f(x) has unique global minimum equal to b at x = a . A vector υ based anywhere in dom f R pointing toward the unique bowl-bottom− is specified: − × x a υ − RK R (2107) ∝ f(x) + b ∈ × · ¸ Such a vector is xf(x) ∇ υ = (2108)  →∇xf(x)  1 df(x)  2  since the gradient is  

xf(x) = 2(x a) (2109) ∇ − and the directional derivative in direction of the gradient is (2129)

→∇xf(x) T T df(x) = xf(x) xf(x) = 4(x a) (x a) = 4(f(x) + b) (2110) ∇ ∇ − − 2

D.1.5 Second directional derivative By similar argument, it so happens: the second directional derivative is equally simple. Given g(X) : RK×L RM×N on open domain, → 2 2 2 ∂ gmn(X) ∂ gmn(X) ∂ gmn(X) ∂Xkl∂X11 ∂Xkl∂X12 ··· ∂Xkl∂X1L 2 2 2  ∂ gmn(X) ∂ gmn(X) ∂ gmn(X)  ∂gmn(X) ∂ gmn(X) K×L = ∇ = ∂Xkl∂X21 ∂Xkl∂X22 ··· ∂Xkl∂X2L R (2111) ∇ ∂Xkl ∂Xkl  . . .  ∈  . . .   2 2 2   ∂ gmn(X) ∂ gmn(X) ∂ gmn(X)     ∂Xkl∂XK1 ∂Xkl∂XK2 ··· ∂Xkl∂XKL    D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 609

∂gmn(X) ∂gmn(X) ∂gmn(X) ∇ ∂X11 ∇ ∂X12 ··· ∇ ∂X1L  ∂gmn(X) ∂gmn(X) ∂gmn(X)  2 ∂X21 ∂X22 ∂X2 K×L×K×L gmn(X) = ∇ ∇ ··· ∇ L R ∇  . . .  ∈  . . .   . . .   ∂gmn(X) ∂gmn(X) ∂gmn(X)   ∂XK1 ∂XK2 ∂XKL   ∇ ∇ ··· ∇    (2112) ∂∇gmn(X) ∂∇gmn(X) ∂∇gmn(X) ∂X11 ∂X12 ··· ∂X1L  ∂∇gmn(X) ∂∇gmn(X) ∂∇gmn(X)  ∂X21 ∂X22 ∂X2 = . . ··· . L  . . .   . . .   ∂∇gmn(X) ∂∇gmn(X) ∂∇gmn(X)   ∂X 1 ∂X 2 ∂X   K K ··· KL    Rotating our perspective, we get several views of the second-order gradient:

2 2 2 g (X) g (X) g N (X) ∇ 11 ∇ 12 ··· ∇ 1 2 2 2 g21(X) g22(X) g2N (X) 2g(X) =  ∇ ∇ ··· ∇  RM×N×K×L×K×L (2113) ∇ . . . ∈  . . .   2 2 2   gM1(X) gM2(X) gMN (X)   ∇ ∇ ··· ∇  ∂g(X) ∂g(X) ∂g(X) ∇ ∂X11 ∇ ∂X12 ··· ∇ ∂X1L  ∂g(X) ∂g(X) ∂g(X)  1 K L M N K L 2 T ∂X21 ∂X22 ∂X2L R × × × × × g(X) = ∇ . ∇ . ··· ∇ . (2114) ∇  . . . ∈  . . .   ∂g(X) ∂g(X) ∂g(X)   ∂X 1 ∂X 2 ∂X   ∇ K ∇ K ··· ∇ KL    ∂∇g(X) ∂∇g(X) ∂∇g(X) ∂X11 ∂X12 ··· ∂X1L  ∂∇g(X) ∂∇g(X) ∂∇g(X)  2 K L K L M N 2 T ∂X21 ∂X22 ∂X2L R × × × × × g(X) = . . ··· . (2115) ∇  . . . ∈  . . .   ∂∇g(X) ∂∇g(X) ∂∇g(X)   ∂X 1 ∂X 2 ∂X   K K ··· KL    Assuming the limits to exist, we may state the partial derivative of the mnth entry of g with respect to klth and ij th entries of X ;

T 2 ∂g (X+∆t e e )−∂g (X) ∂ gmn(X) = ∂ ∂gmn(X) = lim mn k l mn ∂Xkl ∂Xij ∂Xij ∂Xkl ∆t→0 ∂Xij ∆t ³ T´ T T T (2116) (gmn(X+∆t ek el +∆τ ei ej )−gmn(X+∆t ek el ))−(gmn(X+∆τ ei ej )−gmn(X)) = lim ∆τ,∆t→0 ∆τ ∆t

Differentiating (2096) and then scaling by Yij

2 T ∂ gmn(X) ∂gmn(X+∆t Ykl ek el )−∂gmn(X) YklYij = lim Yij (2117) ∂Xkl ∂Xij ∆t→0 ∂Xij ∆t T T T T (gmn(X+∆t Ykl ek el +∆τ Yij ei ej )−gmn(X+∆t Ykl ek el ))−(gmn(X+∆τ Yij ei ej )−gmn(X)) = lim ∆τ,∆t→0 ∆τ ∆t

which can be proved by substitution of variables in (2116). The mnth second-order total differential due to any Y RK×L is ∈ 610 APPENDIX D. MATRIX CALCULUS

2 2 ∂ gmn(X) T T d gmn(X) dX→Y = YklYij = tr X tr gmn(X) Y Y (2118) | ∂Xkl ∂Xij ∇ ∇ i,j k,l X X ³ ¡ ¢ ´ ∂gmn(X + ∆tY ) ∂gmn(X) = lim − Yij (2119) ∆t→0 ∂X ∆t i,j ij X gmn(X + 2∆tY ) 2gmn(X + ∆tY ) + gmn(X) = lim − (2120) ∆t→0 ∆t2 d 2 = g (X + tY ) (2121) dt2 mn ¯t=0 ¯ Hence the second directional¯ derivative, ¯ 2 2 2 d g (X) d g (X) d g N (X) 11 12 ··· 1 →Y d 2g (X) d 2g (X) d 2g (X) 2 ,  21 22 2N  ¯ RM×N dg (X) . . ··· . ¯ . . . ¯ ∈  . . .  ¯  d 2g (X) d 2g (X) d 2g (X)  ¯  M1 M2 MN  ¯dX→Y  ···  ¯ ¯ T T T T ¯ T T tr tr g (X) Y Y tr tr g (X) Y Y tr tr g N (X) Y Y ∇ ∇ 11 ∇ ∇ 12 ··· ∇ ∇ 1  ³ T T ´ ³ T T ´ ³ T T ´  tr tr¡ g21(X) Y ¢ Y tr tr¡ g22(X) Y ¢ Y tr tr¡ g2N (X) Y ¢ Y = ∇ ∇ ∇ ∇ ··· ∇ ∇  . . .   ³ ¡ . ¢ ´ ³ ¡ . ¢ ´ ³ ¡ . ¢ ´   . . .   T T T T T T   tr tr gM1(X) Y Y tr tr gM2(X) Y Y tr tr gMN (X) Y Y   ∇ ∇ ∇ ∇ ··· ∇ ∇   ³ ´ ³ ´ ³ ´ ¡ 2 ¢ ¡ 2 ¢ 2 ¡ ¢ ∂ g11(X) ∂ g12(X) ∂ g1N (X) YklYij YklYij YklYij ∂Xkl ∂Xij ∂Xkl ∂Xij ∂Xkl ∂Xij i,j k,l i,j k,l ··· i,j k,l  2 2 2  P P ∂ g21(X) P P ∂ g22(X) P P ∂ g2N (X) YklYij YklYij YklYij ∂Xkl ∂Xij ∂Xkl ∂Xij ∂Xkl ∂Xij =  i,j k,l i,j k,l ··· i,j k,l   . . .  (2122)  P P . P P . P P .   2 2 2   ∂ gM1(X) ∂ gM2(X) ∂ gMN (X)  YklYij YklYij YklYij  ∂Xkl ∂Xij ∂Xkl ∂Xij ∂Xkl ∂Xij   i,j k,l i,j k,l ··· i,j k,l     P P P P P P  from which it follows

→Y 2 →Y 2 ∂ g(X) ∂ dg (X) = YklYij = dg (X) Yij (2123) ∂Xkl ∂Xij ∂Xij i,j k,l i,j X X X Yet for all X dom g , any Y RK×L, and some open interval of t R ∈ ∈ ∈ →Y 1 →Y g(X + tY ) = g(X) + t dg (X) + t2 dg2(X) + O(t3) (2124) 2!

which is the second-order multidimensional Taylor series expansion about X . [462, §18.4]

[203, §2.3.4] Differentiating twice with respect to t and subsequent t-zeroing isolates the third term of the expansion. Thus differentiating and zeroing g(X + tY ) in t is an operation equivalent to individually differentiating and zeroing every entry gmn(X + tY ) as in (2121). So the second directional derivative of g(X): RK×L RM×N becomes

§ § [371, §2.1, 5.4.5] [43, 6.3.1]

→Y d 2 dg2(X) = g(X + tY ) RM×N (2125) dt2 ∈ ¯t=0 ¯ confer ¯ which is again simplest. ( (2104)) Directional¯ derivative retains the dimensions of g . D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 611

D.1.6 directional derivative expressions In the case of a real function g(X) : RK×L R , all its directional derivatives are in R : → →Y dg (X)= tr g(X)T Y (2126) ∇

→Y ¡ ¢ →Y 2 T T T dg (X)= tr X tr g(X) Y Y = tr X dg (X) Y (2127) ∇ ∇ ∇ µ ¶ ³ ¡ ¢ ´ →Y T →Y 3 T T 2 T dg (X)= tr X tr X tr g(X) Y Y Y = tr X dg (X) Y (2128) ∇ ∇ ∇ ∇ µ ¶ Ã ! ³ ¡ ¢ ´ In the case g(X) : RK R has vector argument, they further simplify: → →Y dg (X) = g(X)T Y (2129) ∇ →Y dg2(X) = Y T 2g(X)Y (2130) ∇ →Y 3 T 2 T dg (X) = X Y g(X)Y Y (2131) ∇ ∇ and so on. ¡ ¢ D.1.7 higher-order multidimensional Taylor series Series expansions of the differentiable matrix-valued function g(X) , of matrix argument, were given earlier in (2103) and (2124). Assume that g(X) has continuous first-, second-, and third-order gradients over open set dom g . Then, for X dom g and any Y RK×L, the Taylor series is expressed on some open interval of µ R∈ ∈ ∈ →Y 1 →Y 1 →Y g(X + µY ) = g(X) + µ dg (X) + µ2 dg2(X) + µ3 dg3(X) + O(µ4) (2132) 2! 3! or on some open interval of Y k k2 →Y −X 1 →Y −X 1 →Y −X g(Y ) = g(X) + dg(X) + dg2(X) + dg3(X) + O( Y 4) (2133) 2! 3! k k

which are third-order expansions about X . The from calculus is what § insures finite order of the series. [462] [51, §1.1] [50, App.A.5] [265, 0.4] These somewhat unbelievable formulaeD.3 imply that a function can be determined over the whole of its domain by knowing its value and all its directional derivatives at a single point X .

D.1.7.0.1 Example. Inverse-matrix function. Say g(Y )= Y −1. From the table on page 616,

→Y d dg (X) = g(X + tY ) = X−1Y X−1 (2134) dt − ¯t=0 ¯ ¯ →Y d 2 ¯ dg2(X) = g(X + tY ) = 2X−1Y X−1Y X−1 (2135) dt2 ¯t=0 ¯ 2 D.3 ¯ −1/x e.g, real continuous and differentiable¯ function of real variable f(x)= e has no Taylor series expansion about x=0 , of any practical use, because each derivative equals 0 there. 612 APPENDIX D. MATRIX CALCULUS

→Y d 3 dg3(X) = g(X + tY ) = 6X−1Y X−1Y X−1Y X−1 (2136) dt3 − ¯t=0 ¯ ¯ Let’s find the Taylor series expansion¯ of g about X = I : Since g(I)= I , for Y 2 < 1 (µ = 1 in (2132)) k k

g(I + Y ) = (I + Y )−1 = I Y + Y 2 Y 3 + . . . (2137) − − If Y is small, (I + Y )−1 I Y .D.4 Now we find Taylor series expansion about X : ≈ − g(X + Y ) = (X + Y )−1 = X−1 X−1Y X−1 + 2X−1Y X−1Y X−1 . . . (2138) − − If Y is small, (X + Y )−1 X−1 X−1Y X−1. 2 ≈ −

D.1.7.0.2 Exercise. log det . (confer [74, p.644]) Find the first three terms of a Taylor series expansion for log det Y . Specify an open interval over which the expansion holds in vicinity of X . H

D.1.8 Correspondence of gradient to derivative

From the foregoing expressions for directional derivative, we derive a relationship between gradient with respect to matrix X and derivative with respect to real variable t :

D.1.8.1 first-order

Removing evaluation at t = 0 from (2104),D.5 we find an expression for the directional derivative of g(X) in direction Y evaluated anywhere along a line X + tY t R intersecting dom g { | ∈ } →Y d dg (X + tY ) = g(X + tY ) (2139) dt In the general case g(X) : RK×L RM×N , from (2097) and (2100) we find →

T d tr X gmn(X + tY ) Y = gmn(X + tY ) (2140) ∇ dt ¡ ¢ which is valid at t = 0 , of course, when X dom g . In the important case of a real function g(X) : RK×L R , from (2126) we have∈ simply →

T d tr X g(X + tY ) Y = g(X + tY ) (2141) ∇ dt ¡ ¢ When g(X) : RK R has vector argument, →

T d X g(X + tY ) Y = g(X + tY ) (2142) ∇ dt

D.4Had we instead set g(Y )=(I + Y )−1, then the equivalent expansion would have been about X = 0. D.5Justified by replacing X with X + t Y in (2097)-(2099); beginning,

∂gmn(X + t Y ) dgmn(X + t Y )|dX→Y = Ykl ∂Xkl Xk,l D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 613

D.1.8.1.1 Example. Gradient. T T K×L L g(X) = w X Xw , X R , w R . Using the tables in §D.2, ∈ ∈ T T T T tr X g(X + tY ) Y = tr 2ww (X + tY )Y (2143) ∇ T T T ¡ ¢ = 2w¡ (X Y + tY Y )w ¢ (2144)

Applying equivalence (2141),

d d g(X + tY ) = wT(X + tY )T(X + tY )w (2145) dt dt = wT XTY + Y TX + 2tY TY w (2146) T T T = 2w¡(X Y + tY Y )w ¢ (2147) which is the same as (2144). Hence, the equivalence is demonstrated. It is easy to extract g(X) from (2147) knowing only (2141): ∇ T T T T tr X g(X + tY ) Y = 2w (X Y + tY Y )w ∇ = 2tr wwT(XT + tY T)Y ¡ ¢ T T T tr X g(X) Y = 2tr ww X Y (2148) ∇ ¡ ¢ ¡ ¢ ⇔ ¡ T ¢ X g(X) = 2Xww ∇ 2

D.1.8.2 second-order Likewise removing the evaluation at t = 0 from (2125),

→Y d 2 dg2(X + tY ) = g(X + tY ) (2149) dt2 we can find a similar relationship between second-order gradient and : In the general case g(X) : RK×L RM×N from (2118) and (2121), → 2 T T d tr X tr X gmn(X + tY ) Y Y = gmn(X + tY ) (2150) ∇ ∇ dt2 ³ ¡ ¢ ´ In the case of a real function g(X) : RK×L R we have, of course, → 2 T T d tr X tr X g(X + tY ) Y Y = g(X + tY ) (2151) ∇ ∇ dt2 ³ ¡ ¢ ´ From (2130), the simpler case, where real function g(X) : RK R has vector argument, → d 2 Y T 2 g(X + tY )Y = g(X + tY ) (2152) ∇X dt2

D.1.8.2.1 Example. Second-order gradient. We want to find 2g(X) RK×K×K×K given real function g(X) = log det X having K ∇ ∈

S § domain intr + . From the tables in D.2,

h(X) , g(X) = X−1 intr SK (2153) ∇ ∈ + 614 APPENDIX D. MATRIX CALCULUS so 2g(X)= h(X). By (2140) and (2103), for Y SK ∇ ∇ ∈ T d tr hmn(X) Y = hmn(X + tY ) (2154) ∇ dt ¯t=0 ¡ ¢ d¯ = ¯ h(X + tY ) (2155) dt¯ µ ¯t=0 ¶mn ¯ d ¯ −1 = ¯ (X + tY ) (2156) dt µ ¯t=0 ¶mn ¯−1 −1 = X¯ Y X (2157) − ¯ mn T RK×K ¡ ¢ Setting Y to a member of ek el k, l =1 ...K , and employing a property (41) of the function we find{ ∈ | }

2 T T −1 T −1 g(X)mnkl = tr hmn(X) e e = hmn(X)kl = X e e X (2158) ∇ ∇ k l ∇ − k l mn 2 ¡ ¢ −1 T −1¡ RK×K ¢ g(X)kl = h(X)kl = X ek el X (2159) ∇ ∇ − ∈ 2 ¡ ¢ From all these first- and second-order expressions, we may generate new ones by evaluating both sides at arbitrary t (in some open interval) but only after differentiation.

D.2 Tables of gradients and derivatives § ˆ Results may be validated numerically via Richardson extrapolation. [332, 5.4] [146] When algebraically proving results for symmetric matrices, it is critical to take gradients ignoring symmetry and to then substitute symmetric entries afterward. [220] [78] n k

ˆ i,j,k,ℓ,K,L,m,n,M,N are integers, unless otherwise noted, a , b R , x,y R , A,B Rm×n, X,Y RK×L, t , µ R . ∈ ∈ ∈ ∈ ∈ µ µ ˆ x means δ(δ(x) ) for µ R ; id est, entrywise vector exponentiation. δ is the ∈ main-diagonal linear operator (1681). x0 , 1 , X0 , I if square.

d dx1 →y →y

d . 2 x § ˆ , . dx  . , dg(x) , dg (x) (directional derivatives D.1), log x , e , x , x/y d | |  dxk  (Hadamard  quotient), sgn x , √◦ x (entrywise square root), etcetera, are maps k k

f : R R that maintain dimension; e.g, (§A.1.1) → d −1 T −1 x , x 1 δ(x) 1 (2160)

dx ∇ § ˆ For A a scalar or square matrix, we have the Taylor series [98, 3.6]

∞ 1 eA , Ak (2161) k! k=0

Further, [440, §5.4] X eA 0 A Sm (2162) ≻ ∀ ∈

ˆ For all square A and integer k detkA = det Ak (2163) D.2. TABLES OF GRADIENTS AND DERIVATIVES 615

D.2.1 algebraic

T k×k T K×L×K×L x x = x x = I R X X = X X , I R (Identity) ∇ T ∇ T ∈ k ∇ T ∇ T T∈ T K×L x1 x = x x 1 = 1 R X 1 X1 = X 1 X 1 = 11 R ∇ ∇ ∈ ∇ ∇ ∈ T x(Ax b) = A ∇ T − T x x A b = A ∇ − ¡ T ¢ T x(Ax b) (Ax b) = 2A (Ax b) ∇2 − T − T − x (Ax b) (Ax b) = 2A A ∇ − T − T x (Ax b) (Ax b)= A (Ax b)/ Ax b = x Ax b ∇ − − − k − k2 ∇ k − k2 pT T xz Ax b = A δ(z) sgn(Ax b) , zi = 0 (Ax b)i = 0 ∇ T| − | T − 6 ⇒ − 6 x1 Ax b = A sgn(Ax b) = x Ax b ∇ | − | − ∇ k − k1 T T df(y) x1 f( Ax b ) = A δ dy sgn(Ax b) ∇ | − | y=|Ax−b| − µ ¯ ¶ ¯ T T T T x x Ax + 2x By + y Cy =¯ A + A x + 2By ∇ (x + y)TA(x + y) = A + AT (x + y) x¡ ¢ ¡ ¢ ∇2 xTAx + 2xTBy + yTCy = A + AT ∇x ¡ ¢ ¡ ¢ T T T T X a Xb = X b X a = ab ∇ ∇ T 2 T T T T X a X b = X ab + ab X ∇ T −1 −T T −T X a X b = X ab X ∇ − confer ∂X−1 (X−1) = = X−1e eTX−1, (2095) X kl ∂X k l ∇ kl − (2159) T T T T T T T x a x xb = 2xa b X a X Xb = X(ab + ba ) ∇ ∇ T T T T T T T T x a xx b = (ab + ba )x X a XX b = (ab + ba )X ∇ ∇ T T T T T T x a x xa = 2xa a X a X Xa = 2Xaa ∇ ∇ T T T T T T x a xx a = 2aa x X a XX a = 2aa X ∇ ∇ T T T T T T x a yx b = ba y X a Y X b = ba Y ∇ ∇ T T T T T T x a y xb = yb a X a Y Xb = Y ab ∇ ∇ T T T T T T x a xy b = ab y X a XY b = ab Y ∇ ∇ T T T T T T x a x yb = ya b X a X Y b = Y ba ∇ ∇ 616 APPENDIX D. MATRIX CALCULUS

algebraic continued

d dt (X + tY ) = Y

d BT(X + tY )−1A = BT(X + tY )−1 Y (X + tY )−1A dt − d BT(X + tY )−TA = BT(X + tY )−T Y T(X + tY )−TA dt − d BT(X + tY )µA = ... , 1 µ 1 , X,Y SM dt − ≤ ≤ ∈ + 2 d T −1 T −1 −1 −1 2 B (X + tY ) A = 2B (X + tY ) Y (X + tY ) Y (X + tY ) A dt3 d BT(X + tY )−1A = 6BT(X + tY )−1 Y (X + tY )−1 Y (X + tY )−1Y (X + tY )−1A dt3 − d T T T T dt (X + tY ) A(X + tY ) = Y AX + X AY + 2 tY AY 2 d (X + tY )TA(X + tY ) = 2 Y TAY dt2¡ ¢ d T −1 dt ¡(X + tY ) A(X + tY ) ¢ −1 −1 = (X + tY )TA(X + tY ) (Y TAX + XTAY + 2 tY TAY ) (X + tY )TA(X + tY ) d ¡ − ¢ dt ((X + tY )A(X + tY )) = YAX + XAY + 2 tYAY 2 ¡ ¢ ¡ ¢ d ((X + tY )A(X + tY )) = 2 YAY dt2

D.2.2 trace Kronecker

T T T T T X tr(AXBX ) = X vec(X) (B A) vec X = (B A + B A) vec X ∇vec ∇vec ⊗ ⊗ ⊗ 2 tr(AXBXT) = 2 vec(X)T(BT A) vec X = B AT + BT A (2076) ∇vec X ∇vec X ⊗ ⊗ ⊗ D.2. TABLES OF GRADIENTS AND DERIVATIVES 617

D.2.3 trace

x µx = µI X tr µX = X µ tr X = µI ∇ ∇ ∇ T −1 d −1 −2 −1 −2T x 1 δ(x) 1 = dx x = x X tr X = X ∇ T −1 −2 − ∇ −1 − −1 −T T −T x 1 δ(x) y = δ(x) y X tr(X Y ) = X tr(Y X ) = X Y X ∇ − ∇ ∇ − d µ µ−1 µ µ−1 M x = µx X tr X = µX , X S dx ∇ ∈ j (j−1)T X tr X = jX ∇

T −1 T −2 −1 −2 T x(b a x) = (b a x) a X tr (B AX) = (B AX) A ∇ − − ∇ − − T µ T µ−1 ¡ ¢ ¡ ¢ x(b a x) = µ(b a x) a ∇ − − − T T T T T T x x y = x y x = y X tr(X Y ) = X tr(Y X ) = X tr(Y X) = X tr(XY ) = Y ∇ T ∇ ∇ T ∇ T ∇ ∇ x x x = 2x X tr(X X) = X tr(XX ) = 2X ∇ ∇ ∇ T T T T X tr(AXBX ) = X tr(XBX A) = A XB + AXB ∇ ∇ T T T T T T X tr(AXBX) = X tr(XBXA) = A X B + B X A ∇ ∇ T X tr(AXAXAXAX) = X tr(XAXAXAXA) = 4(AXAXAXA) ∇ ∇ T X tr(AXAXAX) = X tr(XAXAXA) = 3(AXAXA) ∇ ∇ T X tr(AXAX) = X tr(XAXA) = 2(AXA) ∇ ∇ T X tr(AX) = X tr(XA) = A ∇ ∇ k−1 k k i k−1−i T X tr(Y X ) = X tr(X Y ) = X Y X ∇ ∇ i=0 ¡ ¢ T T T T P T T T X tr(X YY XX YY X) = 4YY XX YY X ∇ T T T T T T T X tr(XYY X XYY X ) = 4XYY X XYY ∇ T T T T T X tr(Y XX Y ) = X tr(X YY X) = 2YY X ∇ T T ∇ T T T X tr(Y X XY ) = X tr(XYY X ) = 2XYY ∇ ∇ T 2 X tr (X + Y ) (X + Y ) = 2(X + Y ) = X X + Y F ∇ T ∇ k k X tr((X + Y )(X + Y )) = 2(X + Y ) ∇ ¡ ¢ T T T T X tr(A XB) = X tr(X AB ) = AB ∇ T −1 ∇ −T T −T T −T X tr(A X B) = X tr(X AB ) = X AB X ∇ ∇ − T T T T X a Xb = X tr(ba X) = X tr(Xba ) = ab ∇ T T ∇ T T ∇ T T T X b X a = X tr(X ab ) = X tr(ab X ) = ab ∇ T −1 ∇ −T T ∇ −T T −T X a X b = X tr(X ab ) = X ab X ∇ T µ ∇ − X a X b = . . . ∇ 618 APPENDIX D. MATRIX CALCULUS

trace continued

d d dt tr g(X + tY )= tr dt g(X + tY ) [273, p.491]

d dt tr(X + tY )= tr Y

d j j−1 dt tr (X + tY ) = j tr (X + tY ) tr Y

d tr(X + tY )j = j tr (X + tY )j−1 Y ( j) dt ∀ d ¡ 2 ¢ dt tr((X + tY )Y )= tr Y

d tr (X + tY )k Y = d tr(Y (X + tY )k) = k tr (X + tY )k−1 Y 2 , k 0, 1, 2 dt dt ∈{ } ¡ ¢ k−¡1 ¢ d k d k i k−1−i dt tr (X + tY ) Y = dt tr(Y (X + tY ) )= tr (X + tY ) Y (X + tY ) Y i=0 ¡ ¢ P d tr (X + tY )−1 Y = tr (X + tY )−1 Y (X + tY )−1 Y dt − d tr BT(X + tY )−1A = tr BT(X + tY )−1 Y (X + tY )−1A dt ¡ ¢ − ¡ ¢ d tr BT(X + tY )−TA = tr BT(X + tY )−T Y T(X + tY )−TA dt ¡ ¢ − ¡ ¢ d tr BT(X + tY )−kA = . . . , k > 0 dt ¡ ¢ ¡ ¢ d tr BT(X + tY )µA = ... , 1 µ 1 , X,Y SM dt ¡ ¢ − ≤ ≤ ∈ + 2 ¡ ¢ d tr BT(X + tY )−1A = 2tr BT(X + tY )−1 Y (X + tY )−1 Y (X + tY )−1A dt2

d ¡ T ¢ ¡ T T T ¢ dt tr (X + tY ) A(X + tY ) = tr Y AX + X AY + 2 tY AY 2 d tr (X + tY )TA(X + tY ) = 2tr Y TAY dt2 ¡ ¢ ¡ ¢ d T −1 dt tr ¡(X + tY ) A(X + tY )¢ ¡ ¢ −1 −1 = ³¡tr (X + tY )TA(X + tY¢ )´ (Y TAX + XTAY + 2 tY TAY ) (X + tY )TA(X + tY ) − d dt tr((X³+ tY )A(X + tY )) = tr(YAX + XAY + 2 tYAY ) ´ 2 ¡ ¢ ¡ ¢ d tr((X + tY )A(X + tY )) = 2 tr(YAY ) dt2 D.2. TABLES OF GRADIENTS AND DERIVATIVES 619

D.2.4 logarithmic x 0 , det X > 0 on some neighborhood of X , and det(X + tY )> 0 on some open interval≻ of t ; otherwise, log( ) would be discontinuous. [107, p.75]

d −1 −T log x = x X log det X = X dx ∇ −T 2 ∂X −1 T −1 T X log det(X)kl = = X ek el X , confer (2112)(2159) ∇ ∂Xkl − d −1 −1 −1 −T ¡ ¢ log x = x X log det X = X dx − ∇ − d µ −1 µ −T log x = µx X log det X = µX dx ∇ µ −T X log det X = µX ∇ k k −T X log det X = X log det X = kX ∇ ∇ µ −T X log det (X + tY ) = µ(X + tY ) ∇ T 1 T −T x log(a x + b) = a T X log det(AX + B) = A (AX + B) ∇ a x+b ∇

T T −T T X log det(I A XA) = A(I A XA) A ∇ ± ± ± k k −T X log det(X + tY ) = X log det (X + tY ) = k(X + tY ) ∇ ∇ d −1 dt log det(X + tY )= tr ((X + tY ) Y )

2 d log det(X + tY ) = tr ((X + tY )−1 Y (X + tY )−1 Y ) dt2 − d log det(X + tY )−1 = tr ((X + tY )−1 Y ) dt − 2 d log det(X + tY )−1 = tr ((X + tY )−1 Y (X + tY )−1 Y ) dt2

d 2 dt log det(δ(A(x + ty) + a) + µI) = tr (δ(A(x + ty) + a)2 + µI)−12δ(A(x + ty) + a)δ(Ay) ³ ´ 620 APPENDIX D. MATRIX CALCULUS

D.2.5 determinant

T −T X det X = X det X = det(X)X ∇ ∇ −1 −1 −T −1 −T X det X = det(X )X = det(X) X ∇ − − µ µ −T X det X = µ det (X)X ∇ µ µ −T X det X = µ det(X )X ∇ k k−1 T 2×2 X det X = k det (X) tr(X)I X , X R ∇ − ∈ k k ¡ k −¢T k −T X det X = X det X = k det(X )X = k det (X)X ∇ ∇ µ µ −T X det (X + tY ) = µ det (X + tY )(X + tY ) ∇ k k k −T X det(X + tY ) = X det (X + tY ) = k det (X + tY )(X + tY ) ∇ ∇ d −1 dt det(X + tY ) = det(X + tY ) tr((X + tY ) Y )

2 d det(X + tY ) = det(X + tY )(tr2 (X + tY )−1 Y tr((X + tY )−1 Y (X + tY )−1 Y )) dt2 − ¡ ¢ d det(X + tY )−1 = det(X + tY )−1 tr((X + tY )−1 Y ) dt − 2 d det(X + tY )−1 = det(X + tY )−1(tr2((X + tY )−1 Y ) + tr((X + tY )−1 Y (X + tY )−1 Y )) dt2

d µ µ −1 dt det (X + tY ) = µ det (X + tY ) tr((X + tY ) Y )

D.2.6 logarithmic Matrix .

d µ −1 −1 dt log(X + tY ) = µY (X + tY ) = µ(X + tY ) Y, XY = Y X

d log(I tY )µ = µY (I tY )−1 = µ(I tY )−1Y [273, p.493] dt − − − − − D.2. TABLES OF GRADIENTS AND DERIVATIVES 621

D.2.7 exponential

§ § Matrix exponential. [98, §3.6, 4.5] [440, 5.4]

T T T tr(Y X) Y X tr(Y X) X e = X det e = e Y ( X,Y ) ∇ ∇ ∀ T T T T YX Y X T T X Y X tr e = e Y = Y e ( X,Y ) ∇ YX ∀ X tr Ae = . . . ∇ T¡ Ax ¢ T Ax x1 e = A e ∇ T |Ax| T |Ax| x1 e = A δ(sgn(Ax))e (Ax)i = 0 ∇ 6

T x 1 x x log(1 e ) = e ∇ 1Tex 1 1 2 log(1Tex) = δ(ex) exexT ∇x 1Tex − 1Tex µ ¶

k 1 k 1 k 1 k x x = x 1/x ∇ i k i i=1 µi=1 ¶ Q Q k 1 1 k 1 1 2 x k = x k δ(x)−2 (1/x)(1/x)T ∇x i −k i − k i=1 µi=1 ¶µ ¶ Q Q d tY tY tY dt e = e Y = Y e

d X+ t Y X+ t Y X+ t Y dt e = e Y = Y e , XY = Y X

2 d eX+ t Y = eX+ t Y Y 2 = Y eX+ t Y Y = Y 2 eX+ t Y , XY = Y X dt2

j d etr(X+ t Y ) = etr(X+ t Y ) trj(Y ) dt j

D.2.7.0.1 Exercise. Expand these tables. §

Provide four unfinished table entries indicated by . . . in §D.2.1 & D.2.3. H § D.2.7.0.2 Exercise. log . (§D.1.7, 3.5.4) Find the first four terms of the Taylor series expansion for log x about x = 1. Plot the x 1 supporting hyperplane to the hypograph of log x at = . Prove log x x 1 . log x 0 ≤ − · ¸ · ¸ H