<<

Paul Klein Stockholms universitet October 1999

Calculus with vectors and matrices

1 differential

1.1 The and the Hessian

The purpose of this section is to make sense of expressions like

∂f (x) = ∇ f (x) = ∇f (x) = f 0 (x) = f (x) (1) ∂xT x x where f : Rn → Rm. Of course, we already know what a partial is and how to calculate it. What this section will tell us is how to arrange the partial into a matrix (gradient), and the rules of arithmetic that follow from adopting our particular arrangement convention.

Definition 1.1 Let f : Rn → Rm have partial derivatives at x. Then   ∂f1(x) ··· ∂f1(x)  ∂x1 ∂xn  ∂f (x)    . .. .  (2) T ,  . . .  ∂x   | {z }   m×n ∂fm(x) ··· ∂fm(x) ∂x1 ∂xn

1 and ∂f (x) ∂f (x)T (3) ∂x , ∂xT | {z } n×m where AT is the transpose of A.

Definition 1.2 Let f : Rn → Rn have partial derivatives at x. Then the (scalar- valued) Jacobian of f at x is defined via

∂f (x) J (x) det . (4) f , ∂xT

∂f (x) Remark 1.1 Sometimes the gradient itself is called the Jacobian. Here ∂xT the Jacobian is defined as the determinant of the gradient.

The following properties of the gradient follow straightforwardly from the defi- nition.

Proposition 1.1 1. Let x be an n × 1 vector and A an m × n matrix. Then

∂ [Ax] = A. (5) ∂xT

2. Let x be an n × 1 vector and A an n × m matrix. Then

∂ xT A = AT . (6) ∂xT

3. Let x be an n × 1 vector and A an n × n matrix. Then

∂ xT Ax = xT A + AT  . (7) ∂xT

2 4. Let x be an n × 1 vector and A an n × n symmetric matrix. Then

∂ xT Ax = 2xT A (8) ∂xT

If f is scalar-valued, it is straightforward to define the second derivative (Hes- sian) as follows.

Definition 1.3 Let f : Rn → R have continuous first and second partial deriva- tives at x (so as to satisfy the requirements of Young’s theorem). Then   ∂2f(x) ∂2f(x) 2 ··· ∂x ∂x1∂xn 2  1  ∂ f (x)  . . .  00  . .. .  f (x) . (9) ∂x∂xT ,  . .  , | {z }   n×n  ∂2f(x) ∂2f(x)  ··· 2 ∂xn∂x1 ∂xn

Note that, by Young’s theorem, the Hessian of a scalar-valued is symmet- ric.

T Proposition 1.2 Let f (x) , x Ax where A is symmetric. Then

∂2f (x) = 2A (10) ∂x∂xT

Occasionally we run into matrix-valued functions, and the way forward then is

to vectorize and then differentiate.

3 " # Definition 1.4 Let A = a1 a2 ··· an be an m × n matrix. Then |{z} |{z} |{z} m×1 m×1 m×1   a1        a2  vec (A)   (11) ,  .  | {z }  .  mn×1     an

Definition 1.5 Let f : Rk → Rn×m have partial derivatives at x. Then

∂f (x) ∂vecf (x) (12) ∂xT , ∂xT | {z } nm×k

Having defined the vec operator, we quickly run into cases where we need the

Kronecker product, defined as follows.

Definition 1.6 Let A and B be matrices. Denote the element in the i:th row m×n k×l and j:th column of A by aij. Then

  a B ··· a B  11 1n     . .. .  A ⊗ B ,  . . .  . (13) | {z }   mk×ln   am1B ··· amnB

Proposition 1.3 Let A , B and C be matrices. Then k×l m×n p×q

vec (ABC) = CT ⊗ A vec (B) (14)

4 Proof. Exercise.

Occasionally we find ourselves wanting to differentiate a vector-valued function with respect to a matrix. Again the way forward is to vectorize.

Proposition 1 Whenever the following expressions are defined, they are true.

The of a matrix A is denoted by tr (A) . [Various rules of arithmetic omit- ted in this version. See the bibliography for sources.]

Definition 1.7 Let f : Rn×m → Rk have partial derivatives at x. Then

∂f (x) ∂f (x) , (15) ∂AT ∂ (vecA)T | {z } nm×k

n×m n m Example 1.1 Let f : R → R be defined via f (Φ) , Φk where k ∈ R is a

T  constant vector. Then f (Φ) = k ⊗ In vecΦ and hence

∂f (x) = kT ⊗ I  . (16) ∂ΦT n

We are now in a position to state rather general versions of the product and for matrices.

5 1.2 The

Proposition 1.4 (the product rule) Let A : Rl → Rn×m and B : Rl → Rm×k have partial derivatives at x ∈ Rl. Then

∂   ∂vecA (x) ∂vecB (x) [A (x) B (x)] = B (x)T ⊗ I + (I ⊗ A (x)) . (17) ∂xT n ∂xT k ∂xT

Kind-of proof. Suppose A (x) ≡ A. Then, by Proposition 1.3,

vec (AB (x)) = (Ik ⊗ A) vecB (x) . (18)

Since differentiation is a linear operator, it follows that

∂vec (AB (x)) ∂vecB (x) = (I ⊗ A) (19) ∂xT k ∂xT

Conversely, assume that B (x) ≡ B. Then

∂vec (A (x) B) ∂vecA (x) = BT ⊗ I  (20) ∂xT n ∂xT

Combining the two results yields the product rule.

Corollary 1.1 When we have vector- rather than matrix-valued functions, the for- mula is drastically simplified. Let f : Rl → Rm and g : Rl → Rm have partial derivatives at x ∈ Rl. Then

∂ h i ∂f (x) ∂g (x) f (x)T g (x) = g (x)T + f (x)T (21) ∂xT ∂xT ∂xT

6 −1 Example 1.2 Suppose we would like to differentiate f (Ω) , Ω with respect to (vecΩ)T . One quick way of getting the result is to note that

∂ (ΩΩ−1) ∂ (I) = = 0 ∂ (vecΩ)T ∂ (vecΩ)T so that −1 −T  ∂vecΩ ∂vecΩ 0 = Ω ⊗ In + (In ⊗ Ω) ∂ (vecΩ)T ∂ (vecΩ)T Hence −1 ∂vecΩ −1 −T  −T −1 = − (In ⊗ Ω) Ω ⊗ In = −Ω ⊗ Ω . ∂ (vecΩ)T

1.3 The chain rule

Proposition 1.5 (the chain rule) Let f and g have partial derivatives at x, and let h (x) = (f ◦ g)(x) = f (g (x)) . Define y = g (x). Then h has partial derivatives at x and ∂h (x) ∂f (y) ∂g (x) = . (22) ∂xT ∂yT ∂xT With an alternative piece of notation, we have

∂f (g (x)) ∂f (g (x)) ∂g (x) = . (23) ∂xT ∂gT ∂xT

Proof. The scalar chain rule and the definition of matrix multiplication.

T −1 Example 1.3 Let f (A) , x Ax and let B (A) , A . Find ∂f (B (A)) . ∂vec (A)T

7 Here we use the chain rule to find that

∂f (B (A)) xT Bx ∂vec (A−1) = = − xT ⊗ xT  Ω−T ⊗ Ω−1 . ∂vec (A)T ∂vec (B)T ∂vec (A)T

2 Remarks on unvectorizing

The definition of the derivative of a matrix function with respect to a matrix given

above was stated originally by Magnus and Neudecker(1999). It has some great

advantages. But sometimes it is not so useful. One example is found in chapter

10, where we differentiate a matrix with respect to a scalar. Then the whole

theory would break down if we hade to vectorize before we differentiated. So we

go against Magnus & Neudecker and omit to vectorize.

A similar phenomenon arises when we want to differentiate a scalar with respect

to a matrix. For example, according to the Magnus-Neudecker definition,

∂xT Ax ∂xT Ax = = xT ⊗ xT  , ∂AT ∂ (vecA)T so that the result is an n2 × 1 vector. But often it is nicer to define the result as the n × n matrix ∂xT Ax ∂xT Ax = = xxT . ∂AT ∂A

More generally, let f : Rm×n → R be a scalar-valued function. Then the defini-

8 tions  ∂f (A) ∂f (A) 

 ∂a11 ∂am1  ∂f (A)     ,   ∂A    ∂f (A) ∂f (A) 

∂a1n ∂amn and ∂f (A) ∂f (A)T = ∂AT ∂A

where aij is the element on the ith row and jth column of A are often nicer in

practice than the ones given by Magnus & Neudecker.

Another example is the following. Suppose we want to evaluate

∂ xT Ω−1x . ∂ΩT

We found above that

∂vec xT Ω−1x = − xT ⊗ xT  Ω−T ⊗ Ω−1 , ∂ (vecΩ)T which is a 1 × n2 vector. But a much neater statement of the result is

∂ xT Ω−1x = −Ω−1xxT Ω−1, ∂ΩT

which is an n × n matrix. This fact is brought home clearly by looking at the

following example, where the transpose will be denoted by 0.

Example 2.1 (Maximum likelihood estimation of the population mean and vari-

ance/covariance matrix of a normally distributed vector of random variables.) Let

9 µ be an unknown n × 1 column vector and let Ω be an unknown n × n matrix. Let

hxti be a given of n × 1 vectors. Consider the maximization problem

( ( T  )) −T/2 X 1 0 −1 max |Ω| exp − (xt − µ) Ω (xt − µ) . µ,Ω 2 t=1

Using the ‘unvectorized’ definition of the matrix derivative and noting that, with

this definition, ∂ ln |Ω| = Ω−1, ∂Ω0

it is easy to confirm that the unique solution is given by

 T 1 P  µ = xt  T t=1 T  1 P  0  Ω = (xt − µ)(xt − µ) . T t=1

So the derivative of a scalar-valued function with respect to a matrix may usefully be defined as I have in this section. However, there is no satisfactory chain rule for the definition of the matrix derivative given here. This is one of Magnus &

Neudecker’s reasons for advocating the definition I give in the previous section.

2.1 Taylor’s formula in n dimensions

n Proposition 2.1 Let f : S → Rm be differentiable on the open set S ⊂ R . Let

n m x0 ∈ S. Then there is a function r : R → R (which typically depends on x0) such that

10 1. For all x ∈ S,

∂f (x ) f (x) = f (x ) + 0 (x − x ) + r (x − x ) . (24) 0 ∂xT 0 0

2. r (h) lim = 0 (25) h→0 khk

Proof. See [??].

n Proposition 2.2 Let f : S → R be twice differentiable on the open set S ⊂ R . Let

n x0 ∈ S. Then there is a function r : R → R such that

1. For all x ∈ S,

∂f (x ) 1 ∂f 2 (x ) f (x) = f (x ) + 0 (x − x ) + (x − x )T 0 (x − x ) + r (x − x ) . 0 ∂xT 0 2 0 ∂x∂xT 0 0 (26)

2. r (h) lim = 0 (27) h→0 khk2

Proof. See [??].

Warning. It is not claimed (and it isn’t true) that whenever f : S → R has derivatives of all orders, the infinite

∂f (x ) 1 ∂f 2 (x ) f (x ) + 0 (x − x ) + (x − x )T 0 (x − x ) + ··· 0 ∂xT 0 2 0 ∂x∂xT 0

11 converges to f (x) . This series may fail to converge, or it may converge to a

number different from f (x). In other words, not all infinitely differentiable

functions are analytic (representable by a ). A counterexample

is the function f defined via f (x) = e−1/x2 for x 6= 0 and f (0) = 0. Then f

is infinitely differentiable with f (n) (0) = 0 for all n = 0, 1, 2, ... and so the

infinite around x0 = 0 is identically zero, yet of course f itself

is not identically zero.

3 Integrating over Rn

3.1 Definition

Within the Riemann theory, integrating over rectangles (and the n−dimensional counterparts) is just a matter of iterating the process of integration. More pre-

2 cisely, suppose E ⊂ R is a closed rectangle, i.e. E = [a1, b1] × [a2, b2] where we require a1 ≤ b1 and a2 ≤ b2 so that the orientation of our set E is not an issue. We then have the following definition.

Definition 3.1 Let E ⊂ R2 be a closed rectangle and let f : E → R be a . Let x = hx, yi. Define

 b  Z 2 ϕ (y) =  f (x, y) dx (28)

a2

12 Then b  b  b Z Z 1 Z 2 Z 1 f (x) dx =  f (x, y) dx = ϕ (y) dy (29)

E a1 a2 a1

Happily, the order of integration does not matter under our assumptions. We have the following proposition.

Proposition 3.1 Let E be a closed rectangle and let f : E → R be a continuous function. Then

b  b  b  b  Z 1 Z 2 Z Z 2 Z 1  f (x, y) dx dy = f (x) dx =  f (x, y) dy dx. (30)

a1 a2 E a2 a1

Proof. See [??].

Remark 3.1 If you think this is a surprising result, recall that are just sums, and sums (avoiding pathologies where infinity is involved) are the same independent of the order of the terms.

We can of course generalize this definition and proposition to integration over

n closed rectangles E ⊂ R , i.e. sets of the form E = [a1, b1] × [a2, b2] × · · · × [an, bn] . Just keep on iterating the process of integration!

13 3.2

Definition 3.2 Let f be an arbitrary function on Rn into R. Then the set

Sf = {x ∈ X : f (x) 6= 0} (31)

is called the support of f. If Sf is a compact set, then f is said to have compact

support.

Theorem 3.1 Let T be a 1-1 (injective) continuously differentiable function from

n n an open set E ⊂ R into R such that the Jacobian JT (x) 6= 0 for all x ∈ E. Let f

be a continuous function from Rn into R such whose support is compact and lies in T (E) . Then Z Z f (y) dy = f (T (x)) |JT (x)| dx. (32)

Rn Rn

Proof. See [??].

Remark 3.2 The reason for having |JT (x)| instead of JT (x) is that, with the def-

inition of the used in this section, we integrate over subsets of Rn without b regard for their orientation. For example, in the scalar case, we consider R f (x) dx a a and R f (x) dx to be the same. Given that these are defined to be the same, we b must take steps to assure that, say, the change of variables T (x) = −x makes no difference, and that is guaranteed by taking the absolute value of the Jacobian.

14 Example 3.1 (from Econometrics II; calculating the volume of a cylinder)

Let c, k ≥ 0. Let f : R2 → R be defined via   c if x2 + y2 ≤ k2 f (x, y) = (33)  0 otherwise (Draw a picture of this!) We now want to calculate Z f (x, y) dx (34)

R2 and it turns out to be convenient to use the change of variables approach, noting with satisfaction that f has compact support. Looking at the picture, it seems that a switch to polar coordinates makes sense. So define     r    2 E =   ∈ R : 0 < r < k and 0 < θ < 2π (35)  θ  and T on E via   r cos θ   T (r, θ) =   (36) r sin θ Apparently the Jacobian is   cos θ −r sin θ   JT (r, θ) = det   = r (37) sin θ r cos θ and   c if 0 ≤ r ≤ k and 0 ≤ θ ≤ 2π f (T (r, θ)) = (38)  0 otherwise. Hence 2π k  Z Z Z 2 f (x, y) =  crdr dθ = ck π. (39)

R2 0 0

15 References

Magnus, J. and H. Neudecker (1999). Matrix Differential Calculus With Appli- cations in Statistics and Econometrics. John Wiley and Sons.

16