Matrix Calculus

Appendix D Matrix Calculus From too much study, and from extreme passion, cometh madnesse. Isaac Newton [205, §5] − D.1 Gradient, Directional derivative, Taylor series D.1.1 Gradients Gradient of a differentiable real function f(x) : RK R with respect to its vector argument is defined uniquely in terms of partial derivatives→ ∂f(x) ∂x1 ∂f(x) , ∂x2 RK f(x) . (2053) ∇ . ∈ . ∂f(x) ∂xK while the second-order gradient of the twice differentiable real function with respect to its vector argument is traditionally called the Hessian; 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂x1 ∂x1∂x2 ··· ∂x1∂xK 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 2 K f(x) , ∂x2∂x1 ∂x2 ··· ∂x2∂xK S (2054) ∇ . ∈ . .. 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂xK ∂x1 ∂xK ∂x2 ∂x ··· K interpreted ∂f(x) ∂f(x) 2 ∂ ∂ 2 ∂ f(x) ∂x1 ∂x2 ∂ f(x) = = = (2055) ∂x1∂x2 ³∂x2 ´ ³∂x1 ´ ∂x2∂x1 Dattorro, Convex Optimization Euclidean Distance Geometry, Mεβoo, 2005, v2020.02.29. 599 600 APPENDIX D. MATRIX CALCULUS The gradient of vector-valued function v(x) : R RN on real domain is a row vector → v(x) , ∂v1(x) ∂v2(x) ∂vN (x) RN (2056) ∇ ∂x ∂x ··· ∂x ∈ h i while the second-order gradient is 2 2 2 2 , ∂ v1(x) ∂ v2(x) ∂ vN (x) RN v(x) 2 2 2 (2057) ∇ ∂x ∂x ··· ∂x ∈ h i Gradient of vector-valued function h(x) : RK RN on vector domain is → ∂h1(x) ∂h2(x) ∂hN (x) ∂x1 ∂x1 ··· ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) h(x) , ∂x2 ∂x2 ··· ∂x2 ∇ . . (2058) ∂h1(x) ∂h2(x) ∂hN (x) ∂xK ∂xK ··· ∂xK K×N = [ h (x) h (x) hN (x) ] R ∇ 1 ∇ 2 ··· ∇ ∈ while the second-order gradient has a three-dimensional written representation dubbed cubix ;D.1 ∂h1(x) ∂h2(x) ∂hN (x) ∇ ∂x1 ∇ ∂x1 ··· ∇ ∂x1 ∂h1(x) ∂h2(x) ∂hN (x) 2h(x) , ∇ ∂x2 ∇ ∂x2 ··· ∇ ∂x2 ∇ . . (2059) ∂h1(x) ∂h2(x) ∂hN (x) ∇ ∂xK ∇ ∂xK ··· ∇ ∂xK 2 2 2 K×N×K = h (x) h (x) hN (x) R ∇ 1 ∇ 2 ··· ∇ ∈ where the gradient of each£ real entry is with respect to vector¤ x as in (2053). The gradient of real function g(X) : RK×L R on matrix domain is → ∂g(X) ∂g(X) ∂g(X) ∂X11 ∂X12 ··· ∂X1L ∂g(X) ∂g(X) ∂g(X) K×L g(X) , ∂X21 ∂X22 ··· ∂X2L R ∇ . ∈ . ∂g(X) ∂g(X) ∂g(X) ∂XK1 ∂XK2 ··· ∂XKL (2060) g(X) ∇X(:,1) £ X(:,2) g(X) K L = R ×1× ∇ . .. ∈ g(X) ∇X(:,L) ¤ where gradient is with respect to the ith column of X . The strange appearance of ∇X(:, i) (2060) in RK×1×L is meant to suggest a third dimension perpendicular to the page (not D.1The word matrix comes from the Latin for womb ; related to the prefix matri- derived from mater meaning mother. D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 601 a diagonal matrix). The second-order gradient has representation ∂g(X) ∂g(X) ∂g(X) ∇ ∂X11 ∇ ∂X12 ··· ∇ ∂X1L ∂g(X) ∂g(X) ∂g(X) 2 K×L×K×L g(X) , ∇ ∂X21 ∇ ∂X22 ··· ∇ ∂X2L R ∇ . ∈ . ∂g(X) ∂g(X) ∂g(X) ∇ ∂XK1 ∇ ∂XK2 ··· ∇ ∂XKL (2061) g(X) ∇∇X(:,1) £ X(:,2) g(X) K L K L = R ×1× × × ∇∇ . .. ∈ g(X) ∇∇X(:,L) where the gradient is with respect to matrix X . ¤ Gradient of vector-valued∇ function g(X) : RK×L RN on matrix domain is a cubix → g (X) g (X) gN (X) ∇X(:,1) 1 ∇X(:,1) 2 ··· ∇X(:,1) £ X(:,2) g1(X) X(:,2) g2(X) X(:,2) gN (X) g(X) , ∇ . ∇ . ··· ∇ . ∇ .. .. .. (2062) g (X) g (X) gN (X) ∇X(:,L) 1 ∇X(:,L) 2 ··· ∇X(:,L) ¤ K×N×L = [ g (X) g (X) gN (X) ] R ∇ 1 ∇ 2 ··· ∇ ∈ while the second-order gradient has a five-dimensional representation; g (X) g (X) gN (X) ∇∇X(:,1) 1 ∇∇X(:,1) 2 · · · ∇∇X(:,1) £ X(:,2) g1(X) X(:,2) g2(X) X(:,2) gN (X) 2g(X) , ∇∇ . ∇∇ . · · · ∇∇ . ∇ .. .. .. (2063) g (X) g (X) gN (X) ∇∇X(:,L) 1 ∇∇X(:,L) 2 · · · ∇∇X(:,L) ¤ 2 2 2 K×N×L×K×L = g (X) g (X) gN (X) R ∇ 1 ∇ 2 ··· ∇ ∈ The£ gradient of matrix-valued function g¤(X) : RK×L RM×N on matrix domain has a four-dimensional representation called quartix (fourth-order→ tensor) g (X) g (X) g N (X) ∇ 11 ∇ 12 ··· ∇ 1 g (X) g (X) g N (X) M×N×K×L g(X) , ∇ 21 ∇ 22 ··· ∇ 2 R (2064) ∇ . ∈ . gM1(X) gM2(X) gMN (X) ∇ ∇ ··· ∇ while the second-order gradient has a six-dimensional representation 2 2 2 g (X) g (X) g N (X) ∇ 11 ∇ 12 ··· ∇ 1 2 2 2 g (X) g (X) g N (X) M×N×K×L×K×L 2g(X) , ∇ 21 ∇ 22 ··· ∇ 2 R (2065) ∇ . ∈ . 2 2 2 gM1(X) gM2(X) gMN (X) ∇ ∇ ··· ∇ and so on. 602 APPENDIX D. MATRIX CALCULUS D.1.2 Product rules for matrix-functions Given dimensionally compatible matrix-valued functions of matrix variable f(X) and g(X) T X f(X) g(X) = X (f) g + X (g) f (2066) ∇ ∇ ∇ while [65, §8.3] [420] ¡ ¢ T T T X tr f(X) g(X) = X tr f(X) g(Z) + tr g(X) f(Z) (2067) ∇ ∇ Z←X ³ ´¯ ¡ ¢ ¡ ¢ ¡ ¢ ¯ These expressions implicitly apply as well to scalar-, vector-, or matrix-valued¯ functions of scalar, vector, or matrix arguments. D.1.2.0.1 Example. Cubix. 2 2 2 2 2 2 Suppose f(X) : R × R = XTa and g(X) : R × R = Xb . We wish to find → → T T 2 X f(X) g(X) = X a X b (2068) ∇ ∇ using the product rule. Formula (¡2066) calls for¢ T 2 T T X a X b = X (X a) Xb + X (Xb) X a (2069) ∇ ∇ ∇ Consider the first of the two terms: T X (f) g = X (X a) Xb ∇ ∇ (2070) = (XTa) (XTa) Xb ∇ 1 ∇ 2 2 2 2 The gradient of XTa forms a cubix in£ R × × ; a.k.a, third-order¤ tensor. T T ∂(X a)1 ∂(X a)2 (2071) ∂X11 ∂X11 Â Â Â Â Â Â Â Â Â Â Â Â T T ∂(X a)1 ∂(X a)2 ∂X12 ∂X12 (Xb)1 T 2×1×2 X (X a) Xb = R T T ∇ ∂(X a)1 ∂(X a)2 ∈ ∂X21 ∂X21 (Xb)2 Â Â Â Â Â Â Â Â Â Â Â Â T T ∂(X a)1 ∂(X a)2 ∂X22 ∂X22 Because gradient of the product (2068) requires total change with respect to change in each entry of matrix X , the Xb vector must make an inner product with each vector in that second dimension of the cubix indicated by dotted line segments; a1 0 T 0 a1 b1X11 + b2X12 2×1×2 X (X a) Xb = R ∇ a 0 b1X21 + b2X22 ∈ 2 · ¸ 0 a2 (2072) a (b X + b X ) a (b X + b X ) 2 2 = 1 1 11 2 12 1 1 21 2 22 R × a2(b1X11 + b2X12) a2(b1X21 + b2X22) ∈ · ¸ = abTXT where the cubix appears as a complete 2 2 2 matrix. In like manner for the second × × term X (g) f ∇ D.1. GRADIENT, DIRECTIONAL DERIVATIVE, TAYLOR SERIES 603 b1 0 T b2 0 X11a1 + X21a2 2×1×2 X (Xb) X a = R ∇ 0 b X12a1 + X22a2 ∈ 1 · ¸ (2073) 0 b2 2 2 = XTabT R × The solution ∈ T 2 T T T T X a X b = ab X + X ab (2074) ∇ can be found from Table D.2.1 or verified using (2067). 2 D.1.2.1 Kronecker product A partial remedy for venturing into hyperdimensional matrix representations, such as the cubix or quartix, is to first vectorize matrices as in (39). This device gives rise to the Kronecker product of matrices ; a.k.a, tensor product (kron() in Matlab). ⊗ Although its definition sees reversal in the literature, [434, §2.1] Kronecker product is not commutative (B A = A B). We adopt the definition: for A Rm×n and B Rp×q ⊗ 6 ⊗ ∈ ∈ B A B A B qA 11 12 ··· 1 B A B A B qA 21 22 2 pm qn B A , . ··· . R × (2075) ⊗ . ∈ Bp A Bp A BpqA 1 2 ··· for which A 1 = 1 A = A (real unity acts like Identity). One advantage⊗ to⊗ vectorization is existence of the traditional two-dimensional matrix representation (second-order tensor) for the second-order gradient of a real function with n×n § respect to a vectorized matrix. From §A.1.1 no.36 ( D.2.1) for square A,B R , for ∈ § example [220, §5.2] [15, 3] 2 2 2 tr(AXBXT) = 2 vec(X)T(BT A) vec X = B AT + BT A Rn × n (2076) ∇vec X ∇vec X ⊗ ⊗ ⊗ ∈ To disadvantage is a large new but known set of algebraic rules (§A.1.1) and the fact that its mere use does not generally guarantee two-dimensional matrix representation of gradients. Another application of the Kronecker product is to reverse order of appearance in a matrix product: Suppose we wish to weight the columns of a matrix S RM×N , for ∈ example, by respective entries wi from the main diagonal in w1 0 . N W , .. S (2077) ∈ 0 wN A conventional means for accomplishing column weighting is to multiply S by diagonal matrix W on the right side: w1 0 . M×N S W = S .. = S(: , 1)w1 S(: , N)wN R (2078) ··· ∈ 0 w N £ ¤ To reverse product order such that diagonal matrix W instead appears to the left of S : for I SM (Law) ∈ S(: , 1) 0 0 .

Matrix Calculus

The Directional Derivative the Derivative of a Real Valued Function (Scalar ﬁeld) with Respect to a Vector

Mathematics for Dynamicists - Lecture Notes Thomas Michelitsch

Training Autoencoders by Alternating Minimization

A Brief but Important Note on the Product Rule

The Mean Value Theorem Math 120 Calculus I Fall 2015

AP Calculus AB Topic List 1. Limits Algebraically 2. Limits Graphically 3

A Quotient Rule Integration by Parts Formula Jennifer Switkes ([email protected]), California State Polytechnic Univer- Sity, Pomona, CA 91768

Notes for Math 136: Review of Calculus

Multivariable and Vector Calculus

Deep Learning Based Computer Generated Face Identification Using

Concavity and Points of Inflection We Now Know How to Determine Where a Function Is Increasing Or Decreasing

Tensor Calculus and Differential Geometry