Properties of the Trace and Matrix Derivatives

Properties of the Trace and Matrix Derivatives John Duchi Contents 1 Notation 1 2 Matrix multiplication 1 3 Gradient of linear function 1 4 Derivative in a trace 2 5 Derivative of product in trace 2 6 Derivative of function of a matrix 3 7 Derivative of linear transformed input to function 3 8 Funky trace derivative 3 9 Symmetric Matrices and Eigenvectors 4 1 Notation A few things on notation (which may not be very consistent, actually): The columns of a matrix m×n T T A ∈ R are a1 through an, while the rows are given (as vectors) bya ˜1 throughta ˜m. 2 Matrix multiplication First, consider a matrix A ∈ Rn×n. We have that Xn T T AA = aiai , i=1 that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that Xn T (AA )ij = apiapj p=1 th because the i, j element is the i row of A, which is the vector ha1i, a2i, ··· , anii, dotted with the th T j column of A , which is ha1j, ··· , anj. 1 If we look at the matrix AAT , we see that 2 P P 3 2 3 n a a ··· n a a a a ··· a a p=1 p1 p1 p=1 p1 pn Xn i1 i1 i1 in Xn T 6 . 7 6 . 7 T AA = 4 . .. 5 = 4 . .. 5 = aiai Pn Pn i=1 i=1 p=1 apnap1 ··· p=1 apnapn ainai1 ··· ainain 3 Gradient of linear function Consider Ax, where A ∈ Rm×n and x ∈ Rn. We have 2 T 3 ∇xa˜1 x T 6 ∇xa˜ x 7 6 2 7 £ ¤ T ∇xAx = 6 . 7 = a˜1 a˜2 ··· a˜m = A 4 . 5 T ∇xa˜mx Now let us consider xT Ax for A ∈ Rn×n, x ∈ Rn. We have that T T T T T T T T x Ax = x [ã1 x a˜2 x ··· añ x] = x1a˜1 x + ··· + xnañ x If we take the derivative with respect to one of the xls, we have the l component for eacha ˜i, which T is to say ail, and the term for xla˜l x, which gives us that ∂ Xn xT Ax = x a +a ˜T x = aT x +a ˜T x. ∂x i il l l l l i=1 In the end, we see that T T ∇xx Ax = Ax + A x. 4 Derivative in a trace Recall (as in Old and New Matrix Algebra Useful for Statistics) that we can define the differential of a function f(x) to be the part of f(x + dx) − f(x) that is linear in dx, i.e. is a constant times dx. Then, for example, for a vector valued function f, we can have f(x + dx) = f(x) + f 0(x)dx + (higher order terms). In the above, f 0 is the derivative (or Jacobian). Note that the gradient is the transpose of the Jacobian. Consider an arbitrary matrix A. We see that 2 T 3 a˜1 dx1 6 . 7 tr 4 .. 5 T Pn T tr(AdX) a˜ dxn a˜ dx = n = i=1 i i . dX dX dX Thus, we have · ¸ ·Pn T ¸ tr(AdX) i=1 a˜i dxi = = aij dX ij ∂xji so that tr(AdX) = A. dX Note that this is the Jacobian formulation. 2 5 Derivative of product in trace In this section, we prove that T ∇AtrAB = B 2 3 ←− ~a1 −→ 2 3 6 7 ↑ ↑ ↑ 6 ←− ~a2 −→ 7 trAB = tr 6 . 7 4 b~ b~ ··· b~ 5 4 . 5 1 2 n . ↓ ↓ ↓ ←− ~an −→ 2 T ~ T ~ T ~ 3 ~a1 b1 ~a1 b2 ··· ~a1 bn 6 T ~ T ~ T ~ 7 6 ~a2 b1 ~a2 b2 ··· ~a2 bn 7 = tr 6 . 7 4 . .. 5 T ~ T ~ T ~ ~an b1 ~an b2 ··· ~an bn Xm Xm Xm = a1ibi1 + a2ibi2 + ... + anibin i=1 i=1 i=1 ∂trAB ⇒ = bji ∂aij T ⇒ ∇AtrAB = B 6 Derivative of function of a matrix Here we prove that T ∇AT f(A) = (∇Af(A)) . 2 3 ∂f(A) ∂f(A) ··· ∂f(A) 6 ∂A11 ∂A21 ∂An1 7 6 ∂f(A) ∂f(A) ··· ∂f(A) 7 6 ∂A12 ∂A22 ∂An2 7 ∇AT f(A) = 6 . 7 4 . .. 5 ∂f(A) ∂f(A) ··· ∂f(A) ∂A1n ∂A2n ∂Ann T = (∇Af(A)) 7 Derivative of linear transformed input to function Consider a function f : Rn → R. Suppose we have a matrix A ∈ Rn×m and a vector x ∈ Rm. We wish to compute ∇xf(Ax). By the chain rule, we have ∂f(Ax) Xn ∂f(Ax) ∂(Ax) Xn ∂f(Ax) ∂(ãT x) = · k = · k ∂xi ∂(Ax)k ∂xi ∂(Ax)k ∂xi k=1 k=1 Xn ∂f(Ax) Xn = · aki = ∂kf(Ax)aki ∂(Ax)k k=1 k=1 T = ai ∇f(Ax). 3 T As such, ∇xf(Ax) = A ∇f(Ax). Now, if we would like to get the second derivative of this function (third derivatives would be a little nice, but I do not like tensors), we have 2 Xn ∂ f(Ax) ∂ T ∂ ∂f(Ax) = ai ∇f(Ax) = aki ∂xi∂xj ∂xj ∂xj ∂(Ax)k k=1 Xn Xn ∂2f(Ax) = aki ali ∂(Ax)k∂(Ax)l l=1 k=1 T 2 = ai ∇ f(Ax)aj 2 T 2 From this, it is easy to see that ∇xf(Ax) = A ∇ f(Ax)A. 8 Funky trace derivative In this section, we prove that T T T ∇AtrABA C = CAB + C AB . In this bit, let us have AB = f(A), where f is matrix-valued. T T ∇AtrABA C = ∇Atrf(A)A C T T = ∇•trf(•)A C + ∇•trf(A) • C T T 0 T T = (A C) f (•) + (∇•T trf(A) • C) T T T T = C AB + (∇•T tr • Cf(A)) = CT ABT + ((Cf(A))T )T = CT ABT + CAB 9 Symmetric Matrices and Eigenvectors In this we prove that for a symmetric matrix A ∈ Rn×n, all the eigenvalues are real, and that the eigenvectors of A form an orthonormal basis of Rn. First, we prove that the eigenvalues are real. Suppose one is complex: we have λx¯ T x = (Ax)T x = xT AT x = xT Ax = λxT x. Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v 6= 0 of A. Consider a space W of vectors orthogonal to v. We then have that, for w ∈ W , (Aw)T v = wT AT v = wT Av = λwT v = 0. Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v, so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic polynomial of A: X (A) = det(A − λI). The field is algebraicly closed, so there is at least one complex root r, so we have that A − rI is singular and there is a vector v 6= 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete. 4.

Properties of the Trace and Matrix Derivatives

Estimations of the Trace of Powers of Positive Self-Adjoint Operators by Extrapolation of the Moments∗

The Mean Value Theorem Math 120 Calculus I Fall 2015

AP Calculus AB Topic List 1. Limits Algebraically 2. Limits Graphically 3

5 the Dirac Equation and Spinors

Multivariable and Vector Calculus

A Some Basic Rules of Tensor Calculus

18.700 JORDAN NORMAL FORM NOTES These Are Some Supplementary Notes on How to Find the Jordan Normal Form of a Small Matrix. Firs

Concavity and Points of Inflection We Now Know How to Determine Where a Function Is Increasing Or Decreasing

Calculus Terminology

Matrix Calculus

The Linear Algebra Version of the Chain Rule 1

Trace Inequalities for Matrices