<<

Properties of the and

John Duchi

Contents

1 Notation 1

2 1

3 of linear 1

4 in a trace 2

5 Derivative of product in trace 2

6 Derivative of function of a matrix 3

7 Derivative of linear transformed input to function 3

8 Funky trace derivative 3

9 Symmetric Matrices and Eigenvectors 4

1 Notation

A few things on notation (which may not be very consistent, actually): The columns of a matrix m×n T T A ∈ R are a1 through an, while the rows are given (as vectors) bya ˜1 throughta ˜m.

2 Matrix multiplication

First, consider a matrix A ∈ Rn×n. We have that Xn T T AA = aiai , i=1 that is, that the product of AAT is the sum of the outer products of the columns of A. To see this, consider that Xn T (AA )ij = apiapj p=1

th because the i, j element is the i row of A, which is the vector ha1i, a2i, ··· , anii, dotted with the th T j column of A , which is ha1j, ··· , anj.

1 If we look at the matrix AAT , we see that  P P    n a a ··· n a a a a ··· a a p=1 p1 p1 p=1 p1 pn Xn i1 i1 i1 in Xn T  . . .   . . .  T AA =  . .. .  =  . .. .  = aiai Pn Pn i=1 i=1 p=1 apnap1 ··· p=1 apnapn ainai1 ··· ainain 3 Gradient of linear function

Consider Ax, where A ∈ Rm×n and x ∈ Rn. We have

 T  ∇xa˜1 x T  ∇xa˜ x   2  £ ¤ T ∇xAx =  .  = a˜1 a˜2 ··· a˜m = A  .  T ∇xa˜mx Now let us consider xT Ax for A ∈ Rn×n, x ∈ Rn. We have that

T T T T T T T T x Ax = x [˜a1 x a˜2 x ··· a˜n x] = x1a˜1 x + ··· + xna˜n x

If we take the derivative with respect to one of the xls, we have the l component for eacha ˜i, which T is to say ail, and the term for xla˜l x, which gives us that ∂ Xn xT Ax = x a +a ˜T x = aT x +a ˜T x. ∂x i il l l l l i=1 In the end, we see that T T ∇xx Ax = Ax + A x.

4 Derivative in a trace

Recall (as in Old and New Matrix Algebra Useful for Statistics) that we can define the differential of a function f(x) to be the part of f(x + dx) − f(x) that is linear in dx, i.e. is a times dx. Then, for example, for a vector valued function f, we can have

f(x + dx) = f(x) + f 0(x)dx + (higher order terms).

In the above, f 0 is the derivative (or Jacobian). Note that the gradient is the of the Jacobian. Consider an arbitrary matrix A. We see that

 T  a˜1 dx1  .  tr  ..  T Pn T tr(AdX) a˜ dxn a˜ dx = n = i=1 i i . dX dX dX Thus, we have · ¸ ·Pn T ¸ tr(AdX) i=1 a˜i dxi = = aij dX ij ∂xji so that tr(AdX) = A. dX Note that this is the Jacobian formulation.

2 5 Derivative of product in trace

In this section, we prove that T ∇AtrAB = B

  ←− ~a1 −→     ↑ ↑ ↑  ←− ~a2 −→  trAB = tr  .   b~ b~ ··· b~   .  1 2 n . ↓ ↓ ↓ ←− ~an −→  T ~ T ~ T ~  ~a1 b1 ~a1 b2 ··· ~a1 bn  T ~ T ~ T ~   ~a2 b1 ~a2 b2 ··· ~a2 bn  = tr  . . .   . .. .  T ~ T ~ T ~ ~an b1 ~an b2 ··· ~an bn Xm Xm Xm = a1ibi1 + a2ibi2 + ... + anibin i=1 i=1 i=1 ∂trAB ⇒ = bji ∂aij T ⇒ ∇AtrAB = B

6 Derivative of function of a matrix

Here we prove that T ∇AT f(A) = (∇Af(A)) .   ∂f(A) ∂f(A) ··· ∂f(A)  ∂A11 ∂A21 ∂An1   ∂f(A) ∂f(A) ··· ∂f(A)   ∂A12 ∂A22 ∂An2  ∇AT f(A) =  . . . .   . . .. .  ∂f(A) ∂f(A) ··· ∂f(A) ∂A1n ∂A2n ∂Ann T = (∇Af(A))

7 Derivative of linear transformed input to function

Consider a function f : Rn → R. Suppose we have a matrix A ∈ Rn×m and a vector x ∈ Rm. We wish to compute ∇xf(Ax). By the , we have

∂f(Ax) Xn ∂f(Ax) ∂(Ax) Xn ∂f(Ax) ∂(˜aT x) = · k = · k ∂xi ∂(Ax)k ∂xi ∂(Ax)k ∂xi k=1 k=1 Xn ∂f(Ax) Xn = · aki = ∂kf(Ax)aki ∂(Ax)k k=1 k=1 T = ai ∇f(Ax).

3 T As such, ∇xf(Ax) = A ∇f(Ax). Now, if we would like to get the of this function (third derivatives would be a little nice, but I do not like ), we have

2 Xn ∂ f(Ax) ∂ T ∂ ∂f(Ax) = ai ∇f(Ax) = aki ∂xi∂xj ∂xj ∂xj ∂(Ax)k k=1 Xn Xn ∂2f(Ax) = aki ali ∂(Ax)k∂(Ax)l l=1 k=1 T 2 = ai ∇ f(Ax)aj

2 T 2 From this, it is easy to see that ∇xf(Ax) = A ∇ f(Ax)A.

8 Funky trace derivative

In this section, we prove that

T T T ∇AtrABA C = CAB + C AB .

In this bit, let us have AB = f(A), where f is matrix-valued.

T T ∇AtrABA C = ∇Atrf(A)A C T T = ∇•trf(•)A C + ∇•trf(A) • C T T 0 T T = (A C) f (•) + (∇•T trf(A) • C) T T T T = C AB + (∇•T tr • Cf(A)) = CT ABT + ((Cf(A))T )T = CT ABT + CAB

9 Symmetric Matrices and Eigenvectors

In this we prove that for a A ∈ Rn×n, all the eigenvalues are real, and that the eigenvectors of A form an orthonormal of Rn. First, we prove that the eigenvalues are real. Suppose one is complex: we have

λx¯ T x = (Ax)T x = xT AT x = xT Ax = λxT x.

Thus, all the eigenvalues are real. Now, we suppose we have at least one eigenvector v 6= 0 of A. Consider a space W of vectors orthogonal to v. We then have that, for w ∈ W ,

(Aw)T v = wT AT v = wT Av = λwT v = 0.

Thus, we have a set of vectors W that, when transformed by A, are still orthogonal to v, so if we have an original eigenvector v of A, then a simple inductive argument shows that there is an orthonormal set of eigenvectors. To see that there is at least one eigenvector, consider the characteristic of A:

X (A) = det(A − λI).

The field is algebraicly closed, so there is at least one complex root r, so we have that A − rI is singular and there is a vector v 6= 0 that is an eigenvector of A. Thus r is a real eigenvalue, so we have the base case for our induction, and the proof is complete.

4