Appendix D Matrix Calculus
Total Page:16
File Type:pdf, Size:1020Kb
Appendix D Matrix calculus From too much study, and from extreme passion, cometh madnesse. Isaac Newton [86, §5] − D.1 Directional derivative, Taylor series D.1.1 Gradients Gradient of a differentiable real function f(x) : RK R with respect to its vector domain is defined → ∂f(x) ∂x1 ∂f(x) ∆ K f(x) = ∂x2 R (1354) ∇ . ∈ . ∂f(x) ∂xK while the second-order gradient of the twice differentiable real function with respect to its vector domain is traditionally called the Hessian ; ∂2f(x) ∂2f(x) ∂2f(x) 2 ∂ x1 ∂x1∂x2 ··· ∂x1∂xK ∂2f(x) ∂2f(x) ∂2f(x) ∆ 2 K 2 ∂x2∂x1 ∂ x2 ∂x2∂x S f(x) = . ··· . K (1355) ∇ . ... ∈ . ∂2f(x) ∂2f(x) ∂2f(x) 2 1 2 ∂xK ∂x ∂xK ∂x ··· ∂ xK © 2001 Jon Dattorro. CO&EDG version 04.18.2006. All rights reserved. 501 Citation: Jon Dattorro, Convex Optimization & Euclidean Distance Geometry, Meboo Publishing USA, 2005. 502 APPENDIX D. MATRIX CALCULUS The gradient of vector-valued function v(x) : R RN on real domain is a row-vector → ∆ v(x) = ∂v1(x) ∂v2(x) ∂vN (x) RN (1356) ∇ ∂x ∂x ··· ∂x ∈ h i while the second-order gradient is 2 2 2 2 ∆ ∂ v1(x) ∂ v2(x) ∂ vN (x) N v(x) = 2 2 2 R (1357) ∇ ∂x ∂x ··· ∂x ∈ h i Gradient of vector-valued function h(x) : RK RN on vector domain is → ∂h1(x) ∂h2(x) ∂hN (x) ∂x1 ∂x1 ··· ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) ∆ ∂x2 ∂x2 ∂x2 h(x) = . ··· . ∇ . . (1358) ∂h1(x) ∂h2(x) ∂hN (x) ∂x ∂x ∂x K K ··· K K×N = [ h (x) h (x) hN (x) ] R ∇ 1 ∇ 2 · · · ∇ ∈ while the second-order gradient has a three-dimensional representation dubbed cubix ;D.1 ∂h1(x) ∂h2(x) ∂hN (x) ∇ ∂x1 ∇ ∂x1 · · · ∇ ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) ∆ 2 ∂x2 ∂x2 ∂x2 h(x) = ∇ . ∇ . · · · ∇ . ∇ . . (1359) ∂h1(x) ∂h2(x) ∂hN (x) ∂x ∂x ∂x ∇ K ∇ K · · · ∇ K 2 2 2 K×N×K = [ h (x) h (x) hN (x) ] R ∇ 1 ∇ 2 · · · ∇ ∈ where the gradient of each real entry is with respect to vector x as in (1354). D.1The word matrix comes from the Latin for womb ; related to the prefix matri- derived from mater meaning mother. D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 503 The gradient of real function g(X) : RK×L R on matrix domain is → ∂g(X) ∂g(X) ∂g(X) ∂X11 ∂X12 ··· ∂X1L ∂g(X) ∂g(X) ∂g(X) ∆ K×L ∂X21 ∂X22 ∂X2 R g(X) = . ··· . L ∇ . ∈ . ∂g(X) ∂g(X) ∂g(X) ∂XK1 ∂XK2 ∂XKL ··· (1360) g(X) ∇X(:,1) g(X) × × = ∇X(:,2) RK 1 L ... ∈ g(X) ∇X(:,L) th where the gradient X(:,i) is with respect to the i column of X . The strange appearance of∇ (1360) in RK×1×L is meant to suggest a third dimension perpendicular to the page (not a diagonal matrix). The second-order gradient has representation ∂g(X) ∂g(X) ∂g(X) ∇ ∂X11 ∇ ∂X12 · · · ∇ ∂X1L ∂g(X) ∂g(X) ∂g(X) ∆ K×L×K×L 2 ∂X21 ∂X22 ∂X2 R g(X) = ∇ . ∇ . · · · ∇ . L ∇ . ∈ . ∂g(X) ∂g(X) ∂g(X) ∂XK1 ∂XK2 ∂XKL ∇ ∇ · · · ∇ (1361) g(X) ∇∇X(:,1) g(X) × × × × = ∇∇X(:,2) RK 1 L K L ... ∈ g(X) ∇∇X(:,L) where the gradient is with respect to matrix X . ∇ 504 APPENDIX D. MATRIX CALCULUS Gradient of vector-valued function g(X) : RK×L RN on matrix domain is a cubix → g (X) g (X) gN (X) ∇X(:,1) 1 ∇X(:,1) 2 · · · ∇X(:,1) ∆ g (X) g (X) gN (X) g(X) = ∇X(:,2) 1 ∇X(:,2) 2 · · · ∇X(:,2) ∇ ... ... ... g (X) g (X) gN (X) ∇X(:,L) 1 ∇X(:,L) 2 · · · ∇X(:,L) K×N×L = [ g (X) g (X) gN (X) ] R (1362) ∇ 1 ∇ 2 · · · ∇ ∈ while the second-order gradient has a five-dimensional representation; g (X) g (X) gN (X) ∇∇X(:,1) 1 ∇∇X(:,1) 2 · · · ∇∇X(:,1) ∆ g (X) g (X) gN (X) 2g(X) = ∇∇X(:,2) 1 ∇∇X(:,2) 2 · · · ∇∇X(:,2) ∇ ... ... ... g (X) g (X) gN (X) ∇∇X(:,L) 1 ∇∇X(:,L) 2 · · · ∇∇X(:,L) 2 2 2 K×N×L×K×L = [ g (X) g (X) gN (X) ] R (1363) ∇ 1 ∇ 2 · · · ∇ ∈ The gradient of matrix-valued function g(X) : RK×L RM×N on matrix domain has a four-dimensional representation called quartix→ g (X) g (X) g N (X) ∇ 11 ∇ 12 · · · ∇ 1 ∆ g21(X) g22(X) g2N (X) RM×N×K×L g(X) = ∇ . ∇ . · · · ∇ . (1364) ∇ . ∈ gM (X) gM (X) gMN (X) ∇ 1 ∇ 2 · · · ∇ while the second-order gradient has six-dimensional representation 2 2 2 g (X) g (X) g N (X) ∇ 11 ∇ 12 · · · ∇ 1 2 2 2 2 ∆ g21(X) g22(X) g2N (X) RM×N×K×L×K×L g(X) = ∇ . ∇ . · · · ∇ . ∇ . ∈ 2 2 2 gM (X) gM (X) gMN (X) ∇ 1 ∇ 2 · · · ∇ (1365) and so on. D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 505 D.1.2 Product rules for matrix-functions Given dimensionally compatible matrix-valued functions of matrix variable f(X) and g(X) T X f(X) g(X) = X(f) g + X(g) f (1366) ∇ ∇ ∇ while [35, §8.3] [205] T T T X tr f(X) g(X) = X tr f(X) g(Z) + tr g(X) f(Z) (1367) ∇ ∇ Z←X These expressions implicitly apply as well to scalar-, vector-, or matrix-valued functions of scalar, vector, or matrix arguments. D.1.2.0.1 Example. Cubix. Suppose f(X) : R2×2 R2 = XTa and g(X) : R2×2 R2 = Xb . We wish to find → → T T 2 X f(X) g(X) = X a X b (1368) ∇ ∇ using the product rule. Formula (1366) calls for T 2 T T X a X b = X(X a) Xb + X(Xb) X a (1369) ∇ ∇ ∇ Consider the first of the two terms: T X(f) g = X(X a) Xb ∇ ∇ (1370) = (XTa) (XTa) Xb ∇ 1 ∇ 2 2 2 2 The gradient of XTa forms a cubix in R × × . T T ∂(X a)1 ∂(X a)2 (1371) ∂X11 ∂X11 II II II II II II ∂(XTa)1 ∂(XTa)2 (Xb)1 ∂X12 ∂X12 T R2×1×2 X(X a) Xb = ∇ T T ∈ ∂(X a)1 ∂(X a)2 (Xb) ∂X21 ∂X21 2 II II II II II II ∂(XTa)1 ∂(XTa)2 ∂X22 ∂X22 506 APPENDIX D. MATRIX CALCULUS Because gradient of the product (1368) requires total change with respect to change in each entry of matrix X , the Xb vector must make an inner product with each vector in the second dimension of the cubix (indicated by dotted line segments); a1 0 0 a T 1 b1X11 + b2X12 2×1×2 X(X a) Xb = R ∇ a 0 b1X21 + b2X22 ∈ 2 0 a2 (1372) a (b X + b X ) a (b X + b X ) 2×2 = 1 1 11 2 12 1 1 21 2 22 R a2(b1X11 + b2X12) a2(b1X21 + b2X22) ∈ = abTXT where the cubix appears as a complete 2 2 2 matrix. In like manner for × × the second term X(g) f ∇ b1 0 b 0 T 2 X11a1 + X21a2 2×1×2 X(Xb) X a = R ∇ 0 b X12a1 + X22a2 ∈ (1373) 1 0 b2 2 2 = XTabT R × ∈ The solution T 2 T T T T X a X b = ab X + X ab (1374) ∇ can be found from Table D.2.1 or verified using (1367). 2 D.1.2.1 Kronecker product A partial remedy for venturing into hyperdimensional representations, such as the cubix or quartix, is to first vectorize matrices as in (29). This device gives rise to the Kronecker product of matrices ; a.k.a, direct product ⊗ or tensor product. Although it sees reversal in the literature, [211, §2.1] we adopt the definition: for A Rm×n and B Rp×q ∈ ∈ B A B A B qA 11 12 ··· 1 ∆ B21A B22A B2qA pm×qn B A = . ··· . R (1375) ⊗ . ∈ Bp1A Bp2A BpqA ··· D.1. DIRECTIONAL DERIVATIVE, TAYLOR SERIES 507 One advantage to vectorization is existence of a traditional two-dimensional matrix representation for the second-order gradient of a real function with respect to a vectorized matrix. For example, from n×n § § § §A.1.1 no.22 ( D.2.1) for square A , B R [96, 5.2] [10, 3] ∈ 2 2 2 tr(AXBXT ) = 2 vec(X)T (BT A) vec X = B AT + BT A Rn ×n ∇vec X ∇vec X ⊗ ⊗ ⊗ ∈ (1376) To disadvantage is a large new but known set of algebraic rules and the fact that its mere use does not generally guarantee two-dimensional matrix representation of gradients. D.1.3 Chain rules for composite matrix-functions Given dimensionally compatible matrix-valued functions of matrix variable f(X) and g(X) [137, §15.7] T T X g f(X) = X f f g (1377) ∇ ∇ ∇ 2 T T 2 T 2 g f(X) = X X f f g = f f g + X f g X f (1378) ∇X ∇ ∇ ∇ ∇X ∇ ∇ ∇f ∇ D.1.3.1 Two arguments T T T T X g f(X) , h(X) = X f f g + X h h g (1379) ∇ ∇ ∇ ∇ ∇ D.1.3.1.1 Example. Chain rule for two arguments. [28, §1.1] g f(x)T , h(x)T =(f(x)+ h(x))TA (f(x)+ h(x)) (1380) x εx f(x)= 1 , h(x)= 1 (1381) εx x 2 2 T T 1 0 T ε 0 T x g f(x) , h(x) = (A + A )(f + h)+ (A + A )(f + h) ∇ 0 ε 0 1 (1382) T T 1+ ε 0 T x1 εx1 x g f(x) , h(x) = (A + A ) + ∇ 0 1+ ε εx2 x2 (1383) 508 APPENDIX D.