
Properties of the Trace and Matrix Derivatives John Duchi Contents 1 Notation 1 2 Matrix multiplication 1 3 Gradient of linear function 1 4 Derivative in a trace 2 5 Derivative of product in trace 2 6 Derivative of function of a matrix 3 7 Derivative of linear transformed input to function 3 8 Funky trace derivative 3 9 Symmetric Matrices and Eigenvectors 4 1 Notation A few things on notation (which may not be very consistent, actually): The columns of a matrix m×n T T A ∈ R are a1 through an, while the rows are given (as vectors) bya ˜1 throughta ˜m. 2 Matrix multiplication First, consider a matrix A ∈ Rn×n. We have that Xn T T AA = aiai , i=1 that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that Xn T (AA )ij = apiapj p=1 th because the i, j element is the i row of A, which is the vector ha1i, a2i, ··· , anii, dotted with the th T j column of A , which is ha1j, ··· , anj. 1 If we look at the matrix AAT , we see that 2 P P 3 2 3 n a a ··· n a a a a ··· a a p=1 p1 p1 p=1 p1 pn Xn i1 i1 i1 in Xn T 6 . 7 6 . 7 T AA = 4 . .. 5 = 4 . .. 5 = aiai Pn Pn i=1 i=1 p=1 apnap1 ··· p=1 apnapn ainai1 ··· ainain 3 Gradient of linear function Consider Ax, where A ∈ Rm×n and x ∈ Rn. We have 2 T 3 ∇xa˜1 x T 6 ∇xa˜ x 7 6 2 7 £ ¤ T ∇xAx = 6 . 7 = a˜1 a˜2 ··· a˜m = A 4 . 5 T ∇xa˜mx Now let us consider xT Ax for A ∈ Rn×n, x ∈ Rn. We have that T T T T T T T T x Ax = x [˜a1 x a˜2 x ··· a˜n x] = x1a˜1 x + ··· + xna˜n x If we take the derivative with respect to one of the xls, we have the l component for eacha ˜i, which T is to say ail, and the term for xla˜l x, which gives us that ∂ Xn xT Ax = x a +a ˜T x = aT x +a ˜T x. ∂x i il l l l l i=1 In the end, we see that T T ∇xx Ax = Ax + A x. 4 Derivative in a trace Recall (as in Old and New Matrix Algebra Useful for Statistics) that we can define the differential of a function f(x) to be the part of f(x + dx) − f(x) that is linear in dx, i.e. is a constant times dx. Then, for example, for a vector valued function f, we can have f(x + dx) = f(x) + f 0(x)dx + (higher order terms). In the above, f 0 is the derivative (or Jacobian). Note that the gradient is the transpose of the Jacobian. Consider an arbitrary matrix A. We see that 2 T 3 a˜1 dx1 6 . 7 tr 4 .. 5 T Pn T tr(AdX) a˜ dxn a˜ dx = n = i=1 i i . dX dX dX Thus, we have · ¸ ·Pn T ¸ tr(AdX) i=1 a˜i dxi = = aij dX ij ∂xji so that tr(AdX) = A. dX Note that this is the Jacobian formulation. 2 5 Derivative of product in trace In this section, we prove that T ∇AtrAB = B 2 3 ←− ~a1 −→ 2 3 6 7 ↑ ↑ ↑ 6 ←− ~a2 −→ 7 trAB = tr 6 . 7 4 b~ b~ ··· b~ 5 4 . 5 1 2 n . ↓ ↓ ↓ ←− ~an −→ 2 T ~ T ~ T ~ 3 ~a1 b1 ~a1 b2 ··· ~a1 bn 6 T ~ T ~ T ~ 7 6 ~a2 b1 ~a2 b2 ··· ~a2 bn 7 = tr 6 . 7 4 . .. 5 T ~ T ~ T ~ ~an b1 ~an b2 ··· ~an bn Xm Xm Xm = a1ibi1 + a2ibi2 + ... + anibin i=1 i=1 i=1 ∂trAB ⇒ = bji ∂aij T ⇒ ∇AtrAB = B 6 Derivative of function of a matrix Here we prove that T ∇AT f(A) = (∇Af(A)) . 2 3 ∂f(A) ∂f(A) ··· ∂f(A) 6 ∂A11 ∂A21 ∂An1 7 6 ∂f(A) ∂f(A) ··· ∂f(A) 7 6 ∂A12 ∂A22 ∂An2 7 ∇AT f(A) = 6 . 7 4 . .. 5 ∂f(A) ∂f(A) ··· ∂f(A) ∂A1n ∂A2n ∂Ann T = (∇Af(A)) 7 Derivative of linear transformed input to function Consider a function f : Rn → R. Suppose we have a matrix A ∈ Rn×m and a vector x ∈ Rm. We wish to compute ∇xf(Ax). By the chain rule, we have ∂f(Ax) Xn ∂f(Ax) ∂(Ax) Xn ∂f(Ax) ∂(˜aT x) = · k = · k ∂xi ∂(Ax)k ∂xi ∂(Ax)k ∂xi k=1 k=1 Xn ∂f(Ax) Xn = · aki = ∂kf(Ax)aki ∂(Ax)k k=1 k=1 T = ai ∇f(Ax). 3 T As such, ∇xf(Ax) = A ∇f(Ax). Now, if we would like to get the second derivative of this function (third derivatives would be a little nice, but I do not like tensors), we have 2 Xn ∂ f(Ax) ∂ T ∂ ∂f(Ax) = ai ∇f(Ax) = aki ∂xi∂xj ∂xj ∂xj ∂(Ax)k k=1 Xn Xn ∂2f(Ax) = aki ali ∂(Ax)k∂(Ax)l l=1 k=1 T 2 = ai ∇ f(Ax)aj 2 T 2 From this, it is easy to see that ∇xf(Ax) = A ∇ f(Ax)A. 8 Funky trace derivative In this section, we prove that T T T ∇AtrABA C = CAB + C AB . In this bit, let us have AB = f(A), where f is matrix-valued. T T ∇AtrABA C = ∇Atrf(A)A C T T = ∇•trf(•)A C + ∇•trf(A) • C T T 0 T T = (A C) f (•) + (∇•T trf(A) • C) T T T T = C AB + (∇•T tr • Cf(A)) = CT ABT + ((Cf(A))T )T = CT ABT + CAB 9 Symmetric Matrices and Eigenvectors In this we prove that for a symmetric matrix A ∈ Rn×n, all the eigenvalues are real, and that the eigenvectors of A form an orthonormal basis of Rn. First, we prove that the eigenvalues are real. Suppose one is complex: we have λx¯ T x = (Ax)T x = xT AT x = xT Ax = λxT x. Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v 6= 0 of A. Consider a space W of vectors orthogonal to v. We then have that, for w ∈ W , (Aw)T v = wT AT v = wT Av = λwT v = 0. Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v, so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic polynomial of A: X (A) = det(A − λI). The field is algebraicly closed, so there is at least one complex root r, so we have that A − rI is singular and there is a vector v 6= 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete. 4.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-