<<

Some Important Properties for

Dawen Liang Carnegie Mellon University [email protected]

1 Introduction

Matrix calculation plays an essential role in many algorithms, among which ma- trix calculus is the most commonly used tool. In this note, based on the properties from the dif- ferential calculus, we show that they are all adaptable to the matrix calculus1. And in the end, an example on least-square is presented.

2 Notation

A matrix is represented as a bold upper letter, e.g. X, where Xm,n indicates the numbers of rows and columns are m and n, respectively. A vector is represented as a bold lower letter, e.g. x, where it is a n × 1 column vector in this note. An important concept for a n × n matrix An,n is the Tr(A), which is defined as the sum of the diagonal:

n X Tr(A) = Aii (1) i=1 where Aii index the element at the ith row and ith column.

3 Properties

The of a matrix is usually referred as the , denoted as ∇. Consider a m×n p×q f : R → R , the gradient for f(A) w.r.t. Am,n is:   ∂f ∂f ··· ∂f ∂A11 ∂A12 ∂A1n  ∂f ∂f ··· ∂f  ∂f(A)  ∂A21 ∂A22 ∂A2n  ∇Af(A) = =  . . . .  ∂A  . . .. .   . . .  ∂f ∂f ··· ∂f ∂Am1 ∂Am2 ∂Amn This definition is very similar to the differential derivative, thus a few simple properties hold (the matrix A below is and has the same with the vectors):

T T ∇xb Ax = b A (2)

1Some of the detailed derivations which are omitted in this note can be found at http://www.cs.berkeley. edu/˜jduchi/projects/matrix_prop.pdf

1 T T ∇AXAY = Y X (3) T T ∇xx Ax = Ax + A x (4) T ∇AT f(A) = (∇Af(A)) (5) where superscript T denotes the of a matrix or a vector. Now let us turn to the properties for the derivative of the trace. First of all, a few useful properties for trace: Tr(A) = Tr(AT) (6) Tr(ABC) = Tr(BCA) = Tr(CAB) (7) Tr(A + B) = Tr(A) + Tr(B) (8) which are all easily derived. Note that the second one be extended to more general case with arbitrary number of matrices. Thus, for the , T ∇ATr(AB) = B (9) Proof : Just extend Tr(AB) according to the trace definition (Eq.1).

T T T ∇ATr(ABA C) = CAB + C AB (10)

Proof :

T ∇ATr(ABA C) T =∇ATr((AB) (A C)) | {z } | {z } u(A) v(AT) T T =∇A:u(A)Tr(u(A)v(A )) + ∇A:v(AT)Tr(u(A)v(A )) T T T T =(v(A )) ∇Au(A) + (∇AT:v(AT)Tr(u(A)v(A )) T T T T T =C AB + ((u(A)) ∇AT v(A )) =CTABT + (BTATCT)T =CAB + CTABT

Here we make use of the property of the derivative of product: (u(x)v(x))0 = u0(x)v(x) + 0 u(x)v (x). The notation ∇A:u(A) means to calculate the derivative w.r.t. A only on u(A). Same ap- plies to ∇AT:v(AT). Here is used. Note that the conversion from ∇A:v(AT) to ∇AT:v(AT) is based on Eq.5.

4 An Example on Least-square Linear Regression

Now we will derive the solution for least-square linear regression in matrix form, using the proper- ties shown above. We know that the least-square linear regression has a closed-form solution (often referred as normal equation).

2 (i) (i) 1:N Assume we have N data points {x , y } , and the linear regression function hθ(x) is parametrized by θ. We can rearrange the data to matrix form:

 (x(1))T   y(1)   (x(2))T   y(2)  X =   y =    .   .   .   .  (x(N))T y(N)

Thus the error can be represented as:

 (1) (1)  hθ(x ) − y (2) (2)  hθ(x ) − y  Xθ − y =    .   .  (N) (N) hθ(x ) − y

The squared error E(θ), according to the numerical definition:

N 1 X E(θ) = (h (x(i)) − y(i))2 2 θ i=1 which is equivalent to the matrix form: 1 E(θ) = (Xθ − y)T(Xθ − y) 2 Take the derivative:

1 T ∇θE(θ) = ∇ (Xθ − y) (Xθ − y) 2 | {z } 1×1 matrix, thus Tr(·)=(·) 1 = ∇Tr(θTXTXθ − yTXθ − θTXTy + yTy) 2 1 = ∇Tr(θTXTXθ) − ∇Tr(yTXθ) − ∇Tr(θTXTy) 2 1 = ∇Tr(θ I θTXTX) − (yTX)T − XTy 2 The first term can be computed using Eq. 10, where A = θ, B = I, and C = XTX (Note that in this case, C = CT). Plug back to the derivation: 1 ∇ E(θ) = (XTXθ + XTXθ − 2XTy) θ 2 1 = (2XTXθ − 2XTy) 2 ====Set to⇒0 XTXθ = XTy T −1 T θLS = (X X) X y

The normal equation is obtained in matrix form.

3