Appendix D
Matrix calculus
From too much study, and from extreme passion, cometh madnesse.
Isaac Newton [86, §5] −
D.1 Directional derivative, Taylor series
D.1.1 Gradients Gradient of a differentiable real function f(x) : RK R with respect to its vector domain is defined → ∂f(x) ∂x1 ∂f(x) ∆ K f(x) = ∂x2 R (1354) ∇ . ∈ . ∂f(x) ∂xK while the second-order gradient of the twice differentiable real function with respect to its vector domain is traditionally called the Hessian ; ∂2f(x) ∂2f(x) ∂2f(x) 2 ∂ x1 ∂x1∂x2 ··· ∂x1∂xK ∂2f(x) ∂2f(x) ∂2f(x) ∆ 2 K 2 ∂x2∂x1 ∂ x2 ∂x2∂x S f(x) = . . ··· . K (1355) ∇ ...... ∈ . . . ∂2f(x) ∂2f(x) ∂2f(x) 2 1 2 ∂xK ∂x ∂xK ∂x ··· ∂ xK
© 2001 Jon Dattorro. CO&EDG version 04.18.2006. All rights reserved. 501 Citation: Jon Dattorro, Convex Optimization & Euclidean Distance Geometry, Meboo Publishing USA, 2005. 502 APPENDIX D. MATRIX CALCULUS
The gradient of vector-valued function v(x) : R RN on real domain is a row-vector →
∆ v(x) = ∂v1(x) ∂v2(x) ∂vN (x) RN (1356) ∇ ∂x ∂x ··· ∂x ∈ h i while the second-order gradient is
2 2 2 2 ∆ ∂ v1(x) ∂ v2(x) ∂ vN (x) N v(x) = 2 2 2 R (1357) ∇ ∂x ∂x ··· ∂x ∈ h i Gradient of vector-valued function h(x) : RK RN on vector domain is →
∂h1(x) ∂h2(x) ∂hN (x) ∂x1 ∂x1 ··· ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) ∆ ∂x2 ∂x2 ∂x2 h(x) = . . ··· . ∇ . . . . . . (1358) ∂h1(x) ∂h2(x) ∂hN (x) ∂x ∂x ∂x K K ··· K K×N = [ h (x) h (x) hN (x) ] R ∇ 1 ∇ 2 · · · ∇ ∈ while the second-order gradient has a three-dimensional representation dubbed cubix ;D.1
∂h1(x) ∂h2(x) ∂hN (x) ∇ ∂x1 ∇ ∂x1 · · · ∇ ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) ∆ 2 ∂x2 ∂x2 ∂x2 h(x) = ∇ . ∇ . · · · ∇ . ∇ . . . . . . (1359) ∂h1(x) ∂h2(x) ∂hN (x) ∂x ∂x ∂x ∇ K ∇ K · · · ∇ K 2 2 2 K×N×K = [ h (x) h (x) hN (x) ] R ∇ 1 ∇ 2 · · · ∇ ∈ where the gradient of each real entry is with respect to vector x as in (1354).
D.1The word matrix comes from the Latin for womb ; related to the prefix matri- derived from mater meaning mother. D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 503
The gradient of real function g(X) : RK×L R on matrix domain is →
∂g(X) ∂g(X) ∂g(X) ∂X11 ∂X12 ··· ∂X1L ∂g(X) ∂g(X) ∂g(X) ∆ K×L ∂X21 ∂X22 ∂X2 R g(X) = . . ··· . L ∇ . . . ∈ . . . ∂g(X) ∂g(X) ∂g(X) ∂XK1 ∂XK2 ∂XKL ··· (1360) g(X) ∇X(:,1) g(X) × × = ∇X(:,2) RK 1 L ... ∈ g(X) ∇X(:,L)
th where the gradient X(:,i) is with respect to the i column of X . The strange appearance of∇ (1360) in RK×1×L is meant to suggest a third dimension perpendicular to the page (not a diagonal matrix). The second-order gradient has representation
∂g(X) ∂g(X) ∂g(X) ∇ ∂X11 ∇ ∂X12 · · · ∇ ∂X1L ∂g(X) ∂g(X) ∂g(X) ∆ K×L×K×L 2 ∂X21 ∂X22 ∂X2 R g(X) = ∇ . ∇ . · · · ∇ . L ∇ . . . ∈ . . . ∂g(X) ∂g(X) ∂g(X) ∂XK1 ∂XK2 ∂XKL ∇ ∇ · · · ∇ (1361) g(X) ∇∇X(:,1) g(X) × × × × = ∇∇X(:,2) RK 1 L K L ... ∈ g(X) ∇∇X(:,L) where the gradient is with respect to matrix X . ∇ 504 APPENDIX D. MATRIX CALCULUS
Gradient of vector-valued function g(X) : RK×L RN on matrix domain is a cubix →
g (X) g (X) gN (X) ∇X(:,1) 1 ∇X(:,1) 2 · · · ∇X(:,1) ∆ g (X) g (X) gN (X) g(X) = ∇X(:,2) 1 ∇X(:,2) 2 · · · ∇X(:,2) ∇ ......
g (X) g (X) gN (X) ∇X(:,L) 1 ∇X(:,L) 2 · · · ∇X(:,L) K×N×L = [ g (X) g (X) gN (X) ] R (1362) ∇ 1 ∇ 2 · · · ∇ ∈ while the second-order gradient has a five-dimensional representation;
g (X) g (X) gN (X) ∇∇X(:,1) 1 ∇∇X(:,1) 2 · · · ∇∇X(:,1) ∆ g (X) g (X) gN (X) 2g(X) = ∇∇X(:,2) 1 ∇∇X(:,2) 2 · · · ∇∇X(:,2) ∇ ......
g (X) g (X) gN (X) ∇∇X(:,L) 1 ∇∇X(:,L) 2 · · · ∇∇X(:,L) 2 2 2 K×N×L×K×L = [ g (X) g (X) gN (X) ] R (1363) ∇ 1 ∇ 2 · · · ∇ ∈ The gradient of matrix-valued function g(X) : RK×L RM×N on matrix domain has a four-dimensional representation called quartix→
g (X) g (X) g N (X) ∇ 11 ∇ 12 · · · ∇ 1 ∆ g21(X) g22(X) g2N (X) RM×N×K×L g(X) = ∇ . ∇ . · · · ∇ . (1364) ∇ . . . ∈ gM (X) gM (X) gMN (X) ∇ 1 ∇ 2 · · · ∇ while the second-order gradient has six-dimensional representation
2 2 2 g (X) g (X) g N (X) ∇ 11 ∇ 12 · · · ∇ 1 2 2 2 2 ∆ g21(X) g22(X) g2N (X) RM×N×K×L×K×L g(X) = ∇ . ∇ . · · · ∇ . ∇ . . . ∈ 2 2 2 gM (X) gM (X) gMN (X) ∇ 1 ∇ 2 · · · ∇ (1365) and so on. D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 505
D.1.2 Product rules for matrix-functions Given dimensionally compatible matrix-valued functions of matrix variable f(X) and g(X)
T X f(X) g(X) = X(f) g + X(g) f (1366) ∇ ∇ ∇ while [35, §8.3] [205]
T T T X tr f(X) g(X) = X tr f(X) g(Z) + tr g(X) f(Z) (1367) ∇ ∇ Z←X These expressions implicitly apply as well to scalar-, vector-, or matrix-valued functions of scalar, vector, or matrix arguments.
D.1.2.0.1 Example. Cubix. Suppose f(X) : R2×2 R2 = XTa and g(X) : R2×2 R2 = Xb . We wish to find → → T T 2 X f(X) g(X) = X a X b (1368) ∇ ∇ using the product rule. Formula (1366) calls for
T 2 T T X a X b = X(X a) Xb + X(Xb) X a (1369) ∇ ∇ ∇ Consider the first of the two terms:
T X(f) g = X(X a) Xb ∇ ∇ (1370) = (XTa) (XTa) Xb ∇ 1 ∇ 2 2 2 2 The gradient of XTa forms a cubix in R × × .
T T ∂(X a)1 ∂(X a)2 (1371) ∂X11 ∂X11 II II II II II II ∂(XTa)1 ∂(XTa)2 (Xb)1 ∂X12 ∂X12 T R2×1×2 X(X a) Xb = ∇ T T ∈ ∂(X a)1 ∂(X a)2 (Xb) ∂X21 ∂X21 2 II II II II II II ∂(XTa)1 ∂(XTa)2 ∂X22 ∂X22 506 APPENDIX D. MATRIX CALCULUS
Because gradient of the product (1368) requires total change with respect to change in each entry of matrix X , the Xb vector must make an inner product with each vector in the second dimension of the cubix (indicated by dotted line segments);
a1 0 0 a T 1 b1X11 + b2X12 2×1×2 X(X a) Xb = R ∇ a 0 b1X21 + b2X22 ∈ 2 0 a2 (1372) a (b X + b X ) a (b X + b X ) 2×2 = 1 1 11 2 12 1 1 21 2 22 R a2(b1X11 + b2X12) a2(b1X21 + b2X22) ∈ = abTXT where the cubix appears as a complete 2 2 2 matrix. In like manner for × × the second term X(g) f ∇ b1 0 b 0 T 2 X11a1 + X21a2 2×1×2 X(Xb) X a = R ∇ 0 b X12a1 + X22a2 ∈ (1373) 1 0 b2 2 2 = XTabT R × ∈ The solution T 2 T T T T X a X b = ab X + X ab (1374) ∇ can be found from Table D.2.1 or verified using (1367). 2
D.1.2.1 Kronecker product A partial remedy for venturing into hyperdimensional representations, such as the cubix or quartix, is to first vectorize matrices as in (29). This device gives rise to the Kronecker product of matrices ; a.k.a, direct product ⊗ or tensor product. Although it sees reversal in the literature, [211, §2.1] we adopt the definition: for A Rm×n and B Rp×q ∈ ∈ B A B A B qA 11 12 ··· 1 ∆ B21A B22A B2qA pm×qn B A = . . ··· . R (1375) ⊗ . . . ∈ Bp1A Bp2A BpqA ··· D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 507
One advantage to vectorization is existence of a traditional two-dimensional matrix representation for the second-order gradient of a real function with respect to a vectorized matrix. For example, from
n×n
§ § § §A.1.1 no.22 ( D.2.1) for square A , B R [96, 5.2] [10, 3] ∈ 2 2 2 tr(AXBXT ) = 2 vec(X)T (BT A) vec X = B AT + BT A Rn ×n ∇vec X ∇vec X ⊗ ⊗ ⊗ ∈ (1376) To disadvantage is a large new but known set of algebraic rules and the fact that its mere use does not generally guarantee two-dimensional matrix representation of gradients.
D.1.3 Chain rules for composite matrix-functions Given dimensionally compatible matrix-valued functions of matrix variable
f(X) and g(X) [137, §15.7]
T T X g f(X) = X f f g (1377) ∇ ∇ ∇ 2 T T 2 T 2 g f(X) = X X f f g = f f g + X f g X f (1378) ∇X ∇ ∇ ∇ ∇X ∇ ∇ ∇f ∇ D.1.3.1 Two arguments
T T T T X g f(X) , h(X) = X f f g + X h h g (1379) ∇ ∇ ∇ ∇ ∇
D.1.3.1.1 Example. Chain rule for two arguments. [28, §1.1]
g f(x)T , h(x)T =(f(x)+ h(x))TA (f(x)+ h(x)) (1380) x εx f(x)= 1 , h(x)= 1 (1381) εx x 2 2
T T 1 0 T ε 0 T x g f(x) , h(x) = (A + A )(f + h)+ (A + A )(f + h) ∇ 0 ε 0 1 (1382)
T T 1+ ε 0 T x1 εx1 x g f(x) , h(x) = (A + A ) + ∇ 0 1+ ε εx2 x2 (1383) 508 APPENDIX D. MATRIX CALCULUS
T T T lim x g f(x) , h(x) = (A + A )x (1384) ε→0 ∇ from Table D.2.1. 2
These formulae remain correct when the gradients produce hyperdimensional representations:
D.1.4 First directional derivative Assume that a differentiable function g(X) : RK×L RM×N has continuous first- and second-order gradients g and 2g over→ dom g which is an open set. We seek simple expressions for∇ the first∇ and second directional derivatives →Y →Y in direction Y RK×L, dg RM×N and dg2 RM×N respectively. ∈ ∈ ∈ Assuming that the limit exists, we may state the partial derivative of the mnth entry of g with respect to the klth entry of X ;
T ∂g (X) gmn(X +∆te e ) gmn(X) mn = lim k l − R (1385) → ∂Xkl ∆t 0 ∆t ∈
th K th where ek is the k standard basis vector in R while el is the l standard basis vector in RL. The total number of partial derivatives equals KLMN while the gradient is defined in their terms; the mnth entry of the gradient is
∂gmn(X) ∂gmn(X) ∂gmn(X) ∂X11 ∂X12 ··· ∂X1L ∂gmn(X) ∂gmn(X) ∂gmn(X) × ∂X21 ∂X22 ∂X2 K L gmn(X)= ··· L R (1386) ∇ . . . ∈ . . . ∂gmn(X) ∂gmn(X) ∂gmn(X) 1 2 ∂XK ∂XK ··· ∂XKL while the gradient is a quartix
g (X) g (X) g N (X) ∇ 11 ∇ 12 · · · ∇ 1 g21(X) g22(X) g2N (X) RM×N×K×L g(X)= ∇ . ∇ . · · · ∇ . (1387) ∇ . . . ∈ gM (X) gM (X) gMN (X) ∇ 1 ∇ 2 · · · ∇ D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 509
By simply rotating our perspective of the four-dimensional representation of the gradient matrix, we find one of three useful transpositions of this quartix (connoted T1):
∂g(X) ∂g(X) ∂g(X) ∂X11 ∂X12 ··· ∂X1L ∂g(X) ∂g(X) ∂g(X) T1 K×L×M×N g(X) = ∂X21 ∂X22 ··· ∂X2L R (1388) ∇ . . . ∈ . . . ∂g(X) ∂g(X) ∂g(X) ∂XK1 ∂XK2 ··· ∂XKL When the limit for ∆t R exists, it is easy to show by substitution of variables in (1385) ∈
T ∂gmn(X) gmn(X +∆tYkl ekel ) gmn(X) Ykl = lim − R (1389) → ∂Xkl ∆t 0 ∆t ∈ which may be interpreted as the change in gmn at X when the change in Xkl th K×L is equal to Ykl , the kl entry of any Y R . Because the total change ∈ in gmn(X) due to Y is the sum of change with respect to each and every th Xkl , the mn entry of the directional derivative is the corresponding total
differential [137, §15.8]
∂gmn(X) T dgmn(X) dX→Y = Ykl = tr gmn(X) Y (1390) | ∂Xkl ∇ k,l X T gmn(X +∆tYkl e e ) gmn(X) = lim k l − (1391) ∆t→0 ∆t k,l X gmn(X +∆tY ) gmn(X) = lim − (1392) ∆t→0 ∆t d = g (X + tY ) (1393) dt mn t=0
where t R . Assuming finite Y , equation (1392) is called the Gˆateaux
∈ § differential [27, App.A.5] [125, §D.2.1] [234, 5.28] whose existence is implied
by the existence of the Fr´echet differential, the sum in (1390). [157, §7.2] Each may be understood as the change in gmn at X when the change in X is equal 510 APPENDIX D. MATRIX CALCULUS in magnitude and direction to Y .D.2 Hence the directional derivative,
dg (X) dg (X) dg N (X) 11 12 ··· 1 →Y dg (X) dg (X) dg (X) ∆ 21 22 2N RM×N dg(X) = . . ··· . . . . ∈ dgM (X) dgM (X) dgMN (X) 1 2 ··· dX→Y T T T tr g11(X) Y tr g12(X) Y tr g1N (X) Y ∇ T ∇ T ··· ∇ T tr g (X) Y tr g (X) Y tr g N (X) Y = ∇ 21 ∇ 22 ··· ∇ 2 . . . . . . T T T tr gM (X) Y tr gM (X) Y tr gMN (X) Y ∇ 1 ∇ 2 ··· ∇ ∂g11(X) ∂g12 (X) ∂g1N (X) Ykl Ykl Ykl ∂Xkl ∂Xkl ∂Xkl k,l k,l ··· k,l P ∂g21(X) P ∂g22(X) P ∂g2N (X) Ykl Ykl Ykl ∂Xkl ∂Xkl ∂Xkl = k,l k,l ··· k,l (1394) . . . P . P . P . ∂gM1(X) ∂gM2(X) ∂gMN (X) Ykl Ykl Ykl ∂Xkl ∂Xkl ∂Xkl k,l k,l ··· k,l P P P from which it follows →Y ∂g(X) dg(X)= Ykl (1395) ∂Xkl k,l X Yet for all X dom g , any Y RK×L, and some open interval of t R ∈ ∈ ∈ →Y g(X + tY )= g(X) + t dg(X) + o(t2) (1396)
which is the first-order Taylor series expansion about X . [137, §18.4]
[85, §2.3.4] Differentiation with respect to t and subsequent t-zeroing isolates the second term of the expansion. Thus differentiating and zeroing g(X+tY ) in t is an operation equivalent to individually differentiating and zeroing every entry gmn(X + tY ) as in (1393). So the directional derivative of g(X) in any direction Y RK×L evaluated at X dom g becomes ∈ ∈ →Y d dg(X)= g(X + tY ) RM×N (1397) dt ∈ t=0
D.2 RKL Although Y is a matrix, we may regard it as a vector in . D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 511
υ T fˆ(α + t y)
ˆ xf(α) (fˆ(α), α) ∆ ∇ fˆ(x) υ = ˆ →∇xf(α) 1dfˆ(α) 2 ∂ H
Figure 97: Drawn is a convex quadratic bowl in R2 R ; fˆ(x)= xTx : R2 R versus x on some open disc in R2. Plane slice× ∂ is perpendicular→ H to function domain. Slice intersection with domain connotes bidirectional vector y . Tangent line slope at point (α , fˆ(α)) is directional derivative T T value xfˆ(α) y (1424) at α in slice direction y . Recall, negative gradient ˆ ∇ 2 xf(x) R is always steepest descent direction [248]. [137, §15.6] When 3 vector−∇ υ ∈R entry υ is half directional derivative in gradient direction at α ∈ 3 υ1 and when = xfˆ(α), then υ points directly toward bowl bottom. υ2 ∇ −
§ § [177, §2.1, 5.4.5] [25, 6.3.1] which is simplest. The derivative with respect to t makes the directional derivative (1397) resemble ordinary calculus
→Y § (§D.2); e.g., when g(X) is linear, dg(X)= g(Y ). [157, 7.2]
D.1.4.1 Interpretation directional derivative In the case of any differentiable real function fˆ(X) : RK×L R , the → directional derivative of fˆ(X) at X in any direction Y yields the slope of fˆ along the line X + tY through its domain (parametrized by t R) evaluated at t = 0. For higher-dimensional functions, by (1394), this slope∈ interpretation can be applied to each entry of the directional derivative. Unlike the gradient, directional derivative does not expand dimension; e.g., directional derivative in (1397) retains the dimensions of g . 512 APPENDIX D. MATRIX CALCULUS
Figure 97, for example, shows a plane slice of a real convex bowl-shaped function fˆ(x) along a line α + t y through its domain. The slice reveals a one-dimensional real function of t ; fˆ(α + t y). The directional derivative at x = α in direction y is the slope of fˆ(α + t y) with respect to t at t = 0. In the case of a real function having vector argument h(X) : RK R , its directional derivative in the normalized direction of its gradient is→ the gradient magnitude. (1424) For a real function of real variable, the directional derivative evaluated at any point in the function domain is just the slope of
that function there scaled by the real direction. (confer §3.1.1.4)
D.1.4.1.1 Theorem. Directional derivative condition for optimization. ˆ K×L p×k [157, §7.4] Suppose f(X) : R R is minimized on convex set R by X⋆, and the directional derivative→ of fˆ exists there. Then for allC ⊆X ∈ C →X−X⋆ dfˆ(X) 0 (1398) ≥ ⋄ D.1.4.1.2 Example. Simple bowl. Bowl function (Figure 97)
fˆ(x) : RK R =∆ (x a)T (x a) b (1399) → − − − has function offset b R , axis of revolution at x = a , and positive definite Hessian (1355) everywhere− ∈ in its domain (an open hyperdisc in RK ); id est, strictly convex quadratic fˆ(x) has unique global minimum equal to b at x = a . A vector υ based anywhere in dom fˆ R pointing toward− the unique bowl-bottom− is specified: × x a υ − RK R (1400) ∝ fˆ(x)+ b ∈ × Such a vector is xfˆ(x) ∇ υ = (1401) →∇xfˆ(x) 1 dfˆ(x) 2 since the gradient is xfˆ(x)= 2(x a) (1402) ∇ − D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 513
and the directional derivative in the direction of the gradient is (1424)
→∇xfˆ(x) T T dfˆ(x) = xfˆ(x) xfˆ(x) = 4(x a) (x a) = 4 fˆ(x)+ b (1403) ∇ ∇ − − 2
D.1.5 Second directional derivative
By similar argument, it so happens: the second directional derivative is equally simple. Given g(X) : RK×L RM×N on open domain, → 2 2 2 ∂ gmn(X) ∂ gmn(X) ∂ gmn(X) ∂Xkl∂X11 ∂Xkl∂X12 ··· ∂Xkl∂X1L 2 2 2 ∂ gmn(X) ∂ gmn(X) ∂ gmn(X) ∂gmn(X) ∂ gmn(X) K×L = ∇ = ∂Xkl∂X21 ∂Xkl∂X22 ··· ∂Xkl∂X2L R (1404) ∇ ∂X ∂X . . . ∈ kl kl . . . 2 2 2 ∂ gmn(X) ∂ gmn(X) ∂ gmn(X) ∂Xkl∂XK1 ∂Xkl∂XK2 ··· ∂Xkl∂XKL
∂gmn(X) ∂gmn(X) ∂gmn(X) ∇ ∂X11 ∇ ∂X12 · · · ∇ ∂X1L ∂gmn(X) ∂gmn(X) ∂gmn(X) × × × 2 ∂X21 ∂X22 ∂X2 K L K L gmn(X) = ∇ ∇ · · · ∇ L R ∇ . . . ∈ . . . ∂gmn(X) ∂gmn(X) ∂gmn(X) 1 2 ∇ ∂XK ∇ ∂XK · · · ∇ ∂XKL (1405) ∂∇gmn(X) ∂∇gmn(X) ∂∇gmn(X) ∂X11 ∂X12 ··· ∂X1L ∇ ∇ ∇ ∂ gmn(X) ∂ gmn(X) ∂ gmn(X) = ∂X21 ∂X22 ··· ∂X2L . . . . . . ∂∇gmn(X) ∂∇gmn(X) ∂∇gmn(X) ∂XK1 ∂XK2 ··· ∂XKL Rotating our perspective, we get several views of the second-order gradient:
2 2 2 g (X) g (X) g N (X) ∇ 11 ∇ 12 · · · ∇ 1 2 2 2 2 g21(X) g22(X) g2N (X) RM×N×K×L×K×L g(X) = ∇ . ∇ . · · · ∇ . (1406) ∇ . . . ∈ 2 2 2 gM (X) gM (X) gMN (X) ∇ 1 ∇ 2 · · · ∇ 514 APPENDIX D. MATRIX CALCULUS
∂g(X) ∂g(X) ∂g(X) ∇ ∂X11 ∇ ∂X12 · · · ∇ ∂X1L ∂g(X) ∂g(X) ∂g(X) 2 T1 K×L×M×N×K×L g(X) = ∇ ∂X21 ∇ ∂X22 · · · ∇ ∂X2L R (1407) ∇ . . . ∈ . . . ∂g(X) ∂g(X) ∂g(X) ∇ ∂XK1 ∇ ∂XK2 · · · ∇ ∂XKL ∂∇g(X) ∂∇g(X) ∂∇g(X) ∂X11 ∂X12 ··· ∂X1L ∂∇g(X) ∂∇g(X) ∂∇g(X) 2 T2 K×L×K×L×M×N g(X) = ∂X21 ∂X22 ··· ∂X2L R (1408) ∇ . . . ∈ . . . ∂∇g(X) ∂∇g(X) ∂∇g(X) ∂XK1 ∂XK2 ··· ∂XKL Assuming the limits exist, we may state the partial derivative of the mnth entry of g with respect to the klth and ij th entries of X ;
2 T T − T − T − ∂ g (X) gmn(X+∆t ek el +∆τ ei ej ) gmn(X+∆t ek el ) (gmn(X+∆τ ei ej ) gmn(X)) mn = lim ∂Xkl ∂Xij ∆τ,∆t→0 ∆τ ∆t (1409) Differentiating (1389) and then scaling by Yij
2 T − ∂ gmn(X) ∂gmn(X+∆t Ykl ek el ) ∂gmn(X) YklYij = lim Yij (1410) ∂Xkl ∂Xij ∆t→0 ∂Xij ∆t
T T − T − T − gmn(X+∆t Ykl ek el +∆τ Yij ei ej ) gmn(X+∆t Ykl ek el ) (gmn(X+∆τ Yij ei ej ) gmn(X)) = lim ∆τ,∆t→0 ∆τ ∆t
which can be proved by substitution of variables in (1409). The mnth second-order total differential due to any Y RK×L is ∈ 2 2 ∂ gmn(X) T T d gmn(X) dX→Y = YklYij = tr X tr gmn(X) Y Y (1411) | ∂Xkl ∂Xij ∇ ∇ i,j k,l X X ∂gmn(X +∆tY ) ∂gmn(X) = lim − Yij (1412) ∆t→0 ∂X ∆t i,j ij X gmn(X +2∆tY ) 2gmn(X +∆tY )+ gmn(X) = lim − (1413) ∆t→0 ∆t2 d2 = g (X + tY ) (1414) dt2 mn t=0
Hence the second directional derivative, D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 515
2 2 2 d g (X) d g (X) d g N (X) 11 12 ··· 1 →Y d2g (X) d2g (X) d2g (X) 2 ∆ 21 22 2N RM×N dg (X) = . . ··· . . . . ∈ 2 2 2 d gM (X) d gM (X) d gMN (X) 1 2 ··· dX→Y
T T T T T T tr tr g (X) Y Y tr tr g (X) Y Y tr tr g N (X) Y Y ∇ ∇ 11 ∇ ∇ 12 ··· ∇ ∇ 1 T T T T T T tr tr g (X) Y Y tr tr g (X) Y Y tr tr g N (X) Y Y = ∇ ∇ 21 ∇ ∇ 22 ··· ∇ ∇ 2 . . . . . . T T T T T T tr tr gM1(X) Y Y tr tr gM2(X) Y Y tr tr gMN (X) Y Y ∇ ∇ ∇ ∇ ··· ∇ ∇ 2 2 2 ∂ g11(X) ∂ g12(X) ∂ g1N (X) YklYij YklYij YklYij ∂Xkl ∂Xij ∂Xkl ∂Xij ∂Xkl ∂Xij i,j k,l i,j k,l ··· i,j k,l 2 2 2 P P ∂ g21(X) P P ∂ g22(X) P P ∂ g2N (X) YklYij YklYij YklYij = ∂Xkl ∂Xij ∂Xkl ∂Xij ··· ∂Xkl ∂Xij i,j k,l . i,j k,l . i,j k,l . P P . P P . P P . 2 2 2 ∂ gM1(X) ∂ gM2(X) ∂ gMN (X) YklYij YklYij YklYij ∂Xkl ∂Xij ∂Xkl ∂Xij ∂Xkl ∂Xij i,j i,j ··· i,j k,l k,l k,l P P P P P P (1415) from which it follows →Y 2 →Y 2 ∂ g(X) ∂ dg (X)= YklYij = dg(X) Yij (1416) ∂Xkl ∂Xij ∂Xij i,j k,l i,j X X X Yet for all X dom g , any Y RK×L, and some open interval of t R ∈ ∈ ∈ →Y 1 →Y g(X + tY )= g(X) + t dg(X) + t2 dg2(X) + o(t3) (1417) 2!
which is the second-order Taylor series expansion about X . [137, §18.4]
[85, §2.3.4] Differentiating twice with respect to t and subsequent t-zeroing isolates the third term of the expansion. Thus differentiating and zeroing g(X+ tY ) in t is an operation equivalent to individually differentiating and zeroing every entry gmn(X + tY ) as in (1414). So the second directional derivative becomes →Y d2 dg2(X)= g(X + tY ) RM×N (1418) dt2 ∈ t=0
§ § [177, §2.1, 5.4.5] [25, 6.3.1] which is again simplest. (confer (1397))
516 APPENDIX D. MATRIX CALCULUS
D.1.6 Taylor series Series expansions of the differentiable matrix-valued function g(X) , of matrix argument, were given earlier in (1396) and (1417). Assuming g(X) has continuous first-, second-, and third-order gradients over the open set dom g , then for X dom g and any Y RK×L the complete Taylor series on some open interval∈ of µ R is expressed∈ ∈ →Y 1 →Y 1 →Y g(X+µY )= g(X)+ µ dg(X)+ µ2 dg2(X)+ µ3 dg3(X)+ o(µ4) (1419) 2! 3! or on some open interval of Y k k →Y −X 1 →Y −X 1 →Y −X g(Y )= g(X) + dg(X) + dg2(X) + dg3(X) + o( Y 4) (1420) 2! 3! k k which are third-order expansions about X . The mean value theorem from
calculus is what insures the finite order of the series. [28, §1.1] [27, App.A.5]
[125, §0.4] [137] In the case of a real function g(X) : RK×L R , all the directional derivatives are in R : → →Y dg(X)= tr g(X)T Y (1421) ∇ →Y →Y 2 T T T dg (X)= tr X tr g(X) Y Y = tr X dg(X) Y (1422) ∇ ∇ ∇ →Y T →Y 3 T T 2 T dg (X)= tr X tr X tr g(X) Y Y Y = tr X dg (X) Y (1423) ∇ ∇ ∇ ∇ ! In the case g(X) : RK R has vector argument, they further simplify: → →Y dg(X)= g(X)T Y (1424) ∇ →Y dg2(X)= Y T 2g(X)Y (1425) ∇ →Y 3 T 2 T dg (X)= X Y g(X)Y Y (1426) ∇ ∇ and so on.
D.1.6.0.1 Exercise. log det . (confer [39, p.644]) Find the first two terms of the Taylor series expansion (1420) for log det X . H D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 517
D.1.7 Correspondence of gradient to derivative
From the foregoing expressions for directional derivative, we derive a relationship between the gradient with respect to matrix X and the derivative with respect to real variable t :
D.1.7.1 first-order
Removing from (1397) the evaluation at t = 0 ,D.3 we find an expression for the directional derivative of g(X) in direction Y evaluated anywhere along a line X + tY (parametrized by t) intersecting dom g
→Y d dg(X + tY )= g(X + tY ) (1427) dt
In the general case g(X) : RK×L RM×N , from (1390) and (1393) we find →
T d tr X gmn(X + tY ) Y = gmn(X + tY ) (1428) ∇ dt which is valid at t = 0, of course, when X dom g . In the important case of a real function g(X) : RK×L R , from∈ (1421) we have simply →
T d tr X g(X + tY ) Y = g(X + tY ) (1429) ∇ dt When, additionally, g(X) : RK R has vector argument, →
T d X g(X + tY ) Y = g(X + tY ) (1430) ∇ dt
D.3Justified by replacing X with X + tY in (1390)-(1392); beginning,
∂gmn(X + tY ) dgmn(X + tY ) dX→Y = Ykl | ∂Xkl k,l X 518 APPENDIX D. MATRIX CALCULUS
D.1.7.1.1 Example. Gradient. T T K×L L g(X)= w X Xw , X R , w R . Using the tables in §D.2, ∈ ∈ T T T T tr X g(X + tY ) Y = tr 2ww (X + tY )Y (1431) ∇ T T T = 2w (X Y + tY Y )w (1432) Applying the equivalence (1429), d d g(X + tY ) = wT(X + tY )T (X + tY )w (1433) dt dt = wT XT Y + Y TX + 2tY T Y w (1434) T T T = 2w (X Y + tY Y )w (1435) which is the same as (1432); hence, equivalence is demonstrated. It is easy to extract g(X) from (1435) knowing only (1429): ∇ T T T T tr X g(X + tY ) Y = 2w (X Y + tY Y )w ∇ = 2tr wwT(XT + tY T )Y T T T (1436) tr X g(X) Y = 2tr ww X Y ∇ ⇔ T X g(X) = 2Xww ∇ 2
D.1.7.2 second-order Likewise removing the evaluation at t =0 from (1418),
→Y d2 dg2(X + tY )= g(X + tY ) (1437) dt2 we can find a similar relationship between the second-order gradient and the second derivative: In the general case g(X) : RK×L RM×N from (1411) and (1414), →
2 T T d tr X tr X gmn(X + tY ) Y Y = gmn(X + tY ) (1438) ∇ ∇ dt2 × In the case of a real function g(X) : RK L R we have, of course, → 2 T T d tr X tr X g(X + tY ) Y Y = g(X + tY ) (1439) ∇ ∇ dt2 D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 519
From (1425), the simpler case, where the real function g(X) : RK R has vector argument, →
d2 Y T 2 g(X + tY )Y = g(X + tY ) (1440) ∇X dt2
D.1.7.2.1 Example. Second-order gradient. SK Given real function g(X) = log det X having domain int + , we want to 2 K×K×K×K find g(X) R . From the tables in §D.2, ∇ ∈
h(X) =∆ g(X) = X−1 int SK (1441) ∇ ∈ + so 2g(X)= h(X). By(1428) and (1396), for Y SK ∇ ∇ ∈
T d tr hmn(X) Y = hmn(X + tY ) (1442) ∇ dt t=0 d = h(X + tY ) (1443) dt t=0 mn d = (X + tY )−1 (1444) dt t=0 mn −1 −1 = X Y X (1445) − mn T Setting Y to a member of the standard basis Ekl =ekel , for k, l 1 ...K , and employing a property of the trace function (31) we find ∈{ }
2 T −1 −1 g(X)mnkl = tr hmn(X) Ekl = hmn(X)kl = (X Ekl X ) ∇ ∇ ∇ − mn (1446)
2 −1 −1 K×K g(X)kl = h(X)kl = X Ekl X R (1447) ∇ ∇ − ∈ 2
From all these first- and second-order expressions, we may generate new ones by evaluating both sides at arbitrary t (in some open interval) but only after the differentiation. 520 APPENDIX D. MATRIX CALCULUS D.2 Tables of gradients and derivatives
[96] [43]
When proving results for symmetric matrices algebraically, it is critical to take gradients ignoring symmetry and to then substitute symmetric entries afterward.
n k m×n K×L
a, b R , x, y R , A , B R , X,Y R , t,µ R , i,j,k,ℓ,K,L,m,n,M,N∈ ∈ are∈ integers, unless otherwise∈ noted. ∈
µ µ x means δ(δ(x) ) for µ R ; id est, entrywise vector exponentiation. ∈ δ is the main-diagonal linear operator (1036). x0 =∆ 1, X0 =∆ I if square.
d dx1 →y →y
d ∆ . 2 § . dx = . , dg(x) , dg (x) (directional derivatives D.1), log x , d dxk sgn x, sin x, x/y (Hadamard quotient), √x (entrywise square root), k k
etcetera, are maps f : R R that maintain dimension; e.g.,(§A.1.1) →
d −1 ∆ T −1 x = x 1 δ(x) 1 (1448) dx ∇
T K×K The standard basis: Ekl = e e R k,ℓ 1 ...K
k ℓ ∈ | ∈{ } § For A a scalar or matrix, we have the Taylor series [45, 3.6]
∞ 1 eA = Ak (1449) k! k X=0
Further, [215, §5.4] eA 0 A Sm (1450) ≻ ∀ ∈
For all square A and integer k
detkA = det Ak (1451)
2×2
Table entries with notation X R have been algebraically verified in that dimension but may hold∈ more broadly. D.2. TABLES OF GRADIENTS AND DERIVATIVES 521
D.2.1 Algebraic
T k×k T ∆ K×L×K×L x x = x x = I R X X = X X = I R (identity) ∇ ∇ ∈ ∇ ∇ ∈ T x(Ax b) = A ∇ − T T x x A b = A ∇ − T T x(Ax b) (Ax b) = 2A (Ax b) ∇ − − − 2(Ax b)T (Ax b) = 2ATA ∇x − − T T T T x x Ax + 2x By + y Cy = (A + A )x + 2By ∇ 2 xTAx + 2xTBy + yT Cy = A + AT ∇x T T T T X a Xb = X b X a = ab ∇ ∇ T 2 T T T T X a X b = X ab + ab X ∇ T −1 −T T −T X a X b = X ab X ∇ − −1 −1 ∂X −1 −1 X (X )kl = = X Ekl X , confer (1388)(1447) ∇ ∂Xkl −
T T T T T T T x a x xb = 2xa b X a X Xb = X(ab + ba ) ∇ ∇ T T T T T T T T x a xx b = (ab + ba )x X a XX b = (ab + ba )X ∇ ∇ T T T T T T x a x xa = 2xa a X a X Xa = 2Xaa ∇ ∇ T T T T T T x a xx a = 2aa x X a XX a = 2aa X ∇ ∇ T T T T T T x a yx b = ba y X a Y X b = ba Y ∇ ∇ T T T T T T x a y xb = yb a X a Y Xb = Y ab ∇ ∇ T T T T T T x a xy b = ab y X a XY b = ab Y ∇ ∇ T T T T T T x a x yb = ya b X a X Y b = Y ba ∇ ∇ 522 APPENDIX D. MATRIX CALCULUS
Algebraic continued
d dt (X + tY )= Y
d BT (X + tY )−1A = BT (X + tY )−1 Y (X + tY )−1A dt − d BT (X + tY )−TA = BT (X + tY )−T Y T (X + tY )−TA dt − d BT (X + tY )µA = ... , 1 µ 1, X,Y SM dt − ≤ ≤ ∈ + 2 d T −1 T −1 −1 −1 dt2 B (X + tY ) A = 2B (X + tY ) Y (X + tY ) Y (X + tY ) A
d T T T T dt (X + tY ) A(X + tY ) = Y AX + X AY + 2 tY AY
d 2 T T dt2 (X + tY ) A(X + tY ) = 2 Y AY
d dt ((X + tY )A(X + tY )) = YAX + XAY + 2 tYAY
d 2 dt2 ((X + tY )A(X + tY )) = 2 YAY
D.2.2 Trace Kronecker
T T T T T X tr(AXBX ) = X vec(X) (B A) vec X = (B A + B A) vec X ∇vec ∇vec ⊗ ⊗ ⊗ 2 tr(AXBXT ) = 2 vec(X)T (BT A) vec X = B AT + BT A ∇vec X ∇vec X ⊗ ⊗ ⊗ D.2. TABLES OF GRADIENTS AND DERIVATIVES 523
D.2.3 Trace
x µx = µI X tr µX = X µ tr X = µI ∇ ∇ ∇ 1T −11 d −1 −2 −1 −2T x δ(x) = dx x = x X tr X = X ∇ T −1 −2 − ∇ −1 − −1 −T T −T x 1 δ(x) y = δ(x) y X tr(X Y ) = X tr(Y X ) = X Y X ∇ − ∇ ∇ − d µ µ−1 µ (µ−1)T 2×2 x = µx X tr X = µX , X R dx ∇ ∈ j (j−1)T X tr X = jX ∇
T −1 T −2 −1 −2 T x(b a x) = (b a x) a X tr (B AX) = (B AX) A ∇ − − ∇ − − T µ T µ−1 x(b a x) = µ(b a x) a ∇ − − − T T T T T T x x y = x y x = y X tr(X Y ) = X tr(Y X ) = X tr(Y X) = X tr(XY ) = Y ∇ ∇ ∇ ∇ ∇ ∇ T T T T X tr(AXBX ) = X tr(XBX A) = A XB + AXB ∇ ∇ T T T T T T X tr(AXBX) = X tr(XBXA) = A X B + B X A ∇ ∇ T X tr(AXAXAX) = X tr(XAXAXA) = 3(AXAXA) ∇ ∇ k−1 k k i k−1−i T X tr(Y X ) = X tr(X Y ) = X Y X ∇ ∇ i=0 P T T T T T X tr(Y XX Y ) = X tr(X YY X) = 2 YY X ∇ T T ∇ T T T X tr(Y X XY ) = X tr(XYY X ) = 2XYY ∇ ∇ T X tr (X + Y ) (X + Y ) = 2(X + Y ) ∇ T X tr((X + Y )(X + Y )) = 2(X + Y ) ∇ T T T T X tr(A XB) = X tr(X AB ) = AB ∇ T −1 ∇ −T T −T T −T X tr(A X B) = X tr(X AB ) = X AB X ∇ ∇ − T T T T X a Xb = X tr(ba X) = X tr(Xba ) = ab ∇ T T ∇ T T ∇ T T T X b X a = X tr(X ab ) = X tr(ab X ) = ab ∇ T −1 ∇ −T T ∇ −T T −T X a X b = X tr(X ab ) = X ab X ∇ T µ ∇ − X a X b = ... ∇ 524 APPENDIX D. MATRIX CALCULUS
Trace continued d d dt tr g(X + tY )=tr dt g(X + tY ) d dt tr(X + tY )= tr Y d j j−1 dt tr (X + tY )= j tr (X + tY )tr Y d tr(X + tY )j = j tr((X + tY )j−1 Y ) ( j) dt ∀ d 2 dt tr((X + tY )Y )= tr Y d tr (X + tY )k Y = d tr(Y (X + tY )k)= k tr (X + tY )k−1 Y 2 , k 0, 1, 2 dt dt ∈{ }