Properties of the Trace and Matrix Derivatives
John Duchi
Contents
1 Notation 1
3 Gradient of linear function 1
4 Derivative in a trace 2
5 Derivative of product in trace 2
6 Derivative of function of a matrix 3
7 Derivative of linear transformed input to function 3
8 Funky trace derivative 3
9 Symmetric Matrices and Eigenvectors 4
1 Notation
A few things on notation (which may not be very consistent, actually): The columns of a matrix m×n T T A ∈ R are a1 through an, while the rows are given (as vectors) bya ˜1 throughta ˜m.
2 Matrix multiplication
First, consider a matrix A ∈ Rn×n. We have that Xn T T AA = aiai , i=1 that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that Xn T (AA )ij = apiapj p=1
th because the i, j element is the i row of A, which is the vector ha1i, a2i, ··· , anii, dotted with the th T j column of A , which is ha1j, ··· , anj.
1 If we look at the matrix AAT , we see that P P n a a ··· n a a a a ··· a a p=1 p1 p1 p=1 p1 pn Xn i1 i1 i1 in Xn T . . . . . . T AA = . .. . = . .. . = aiai Pn Pn i=1 i=1 p=1 apnap1 ··· p=1 apnapn ainai1 ··· ainain 3 Gradient of linear function
Consider Ax, where A ∈ Rm×n and x ∈ Rn. We have
T ∇xa˜1 x T ∇xa˜ x 2 £ ¤ T ∇xAx = . = a˜1 a˜2 ··· a˜m = A . T ∇xa˜mx Now let us consider xT Ax for A ∈ Rn×n, x ∈ Rn. We have that
T T T T T T T T x Ax = x [˜a1 x a˜2 x ··· a˜n x] = x1a˜1 x + ··· + xna˜n x
If we take the derivative with respect to one of the xls, we have the l component for eacha ˜i, which T is to say ail, and the term for xla˜l x, which gives us that ∂ Xn xT Ax = x a +a ˜T x = aT x +a ˜T x. ∂x i il l l l l i=1 In the end, we see that T T ∇xx Ax = Ax + A x.
4 Derivative in a trace
Recall (as in Old and New Matrix Algebra Useful for Statistics) that we can define the differential of a function f(x) to be the part of f(x + dx) − f(x) that is linear in dx, i.e. is a constant times dx. Then, for example, for a vector valued function f, we can have
f(x + dx) = f(x) + f 0(x)dx + (higher order terms).
In the above, f 0 is the derivative (or Jacobian). Note that the gradient is the transpose of the Jacobian. Consider an arbitrary matrix A. We see that
T a˜1 dx1 . tr .. T Pn T tr(AdX) a˜ dxn a˜ dx = n = i=1 i i . dX dX dX Thus, we have · ¸ ·Pn T ¸ tr(AdX) i=1 a˜i dxi = = aij dX ij ∂xji so that tr(AdX) = A. dX Note that this is the Jacobian formulation.
2 5 Derivative of product in trace
In this section, we prove that T ∇AtrAB = B
←− ~a1 −→ ↑ ↑ ↑ ←− ~a2 −→ trAB = tr . b~ b~ ··· b~ . 1 2 n . ↓ ↓ ↓ ←− ~an −→ T ~ T ~ T ~ ~a1 b1 ~a1 b2 ··· ~a1 bn T ~ T ~ T ~ ~a2 b1 ~a2 b2 ··· ~a2 bn = tr . . . . .. . T ~ T ~ T ~ ~an b1 ~an b2 ··· ~an bn Xm Xm Xm = a1ibi1 + a2ibi2 + ... + anibin i=1 i=1 i=1 ∂trAB ⇒ = bji ∂aij T ⇒ ∇AtrAB = B
6 Derivative of function of a matrix
Here we prove that T ∇AT f(A) = (∇Af(A)) . ∂f(A) ∂f(A) ··· ∂f(A) ∂A11 ∂A21 ∂An1 ∂f(A) ∂f(A) ··· ∂f(A) ∂A12 ∂A22 ∂An2 ∇AT f(A) = . . . . . . .. . ∂f(A) ∂f(A) ··· ∂f(A) ∂A1n ∂A2n ∂Ann T = (∇Af(A))
7 Derivative of linear transformed input to function
Consider a function f : Rn → R. Suppose we have a matrix A ∈ Rn×m and a vector x ∈ Rm. We wish to compute ∇xf(Ax). By the chain rule, we have
∂f(Ax) Xn ∂f(Ax) ∂(Ax) Xn ∂f(Ax) ∂(˜aT x) = · k = · k ∂xi ∂(Ax)k ∂xi ∂(Ax)k ∂xi k=1 k=1 Xn ∂f(Ax) Xn = · aki = ∂kf(Ax)aki ∂(Ax)k k=1 k=1 T = ai ∇f(Ax).
3 T As such, ∇xf(Ax) = A ∇f(Ax). Now, if we would like to get the second derivative of this function (third derivatives would be a little nice, but I do not like tensors), we have
2 Xn ∂ f(Ax) ∂ T ∂ ∂f(Ax) = ai ∇f(Ax) = aki ∂xi∂xj ∂xj ∂xj ∂(Ax)k k=1 Xn Xn ∂2f(Ax) = aki ali ∂(Ax)k∂(Ax)l l=1 k=1 T 2 = ai ∇ f(Ax)aj
2 T 2 From this, it is easy to see that ∇xf(Ax) = A ∇ f(Ax)A.
8 Funky trace derivative
In this section, we prove that
T T T ∇AtrABA C = CAB + C AB .
In this bit, let us have AB = f(A), where f is matrix-valued.
T T ∇AtrABA C = ∇Atrf(A)A C T T = ∇•trf(•)A C + ∇•trf(A) • C T T 0 T T = (A C) f (•) + (∇•T trf(A) • C) T T T T = C AB + (∇•T tr • Cf(A)) = CT ABT + ((Cf(A))T )T = CT ABT + CAB
9 Symmetric Matrices and Eigenvectors
In this we prove that for a symmetric matrix A ∈ Rn×n, all the eigenvalues are real, and that the eigenvectors of A form an orthonormal basis of Rn. First, we prove that the eigenvalues are real. Suppose one is complex: we have
λx¯ T x = (Ax)T x = xT AT x = xT Ax = λxT x.
Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v 6= 0 of A. Consider a space W of vectors orthogonal to v. We then have that, for w ∈ W ,
(Aw)T v = wT AT v = wT Av = λwT v = 0.
Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v, so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic polynomial of A:
X (A) = det(A − λI).
The field is algebraicly closed, so there is at least one complex root r, so we have that A − rI is singular and there is a vector v 6= 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete.
4