3. Multivariate Normal Distribution

The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.)

1 (x )2 f (x) = exp 2 < x < p2 ( 2 ) 1 1 where = mean of distribution, 2 = variance. In p dimensions the density becomes

1 1 T 1 f (x) = exp (x ) (x ) (3.1) (2)p=2 1=2 2 j j Within the mean vector there are p (independent) parameters and within the symmetric co- 1 1 variance matrix there are 2 p (p + 1) independent parameters [ 2 p (p + 3) independent parameters in total]. We use the notation

x s N p (; ) (3.2) to denote a RV x having the p variate MVN distribution with

E (x) = Cov (x) =

Note that MVN distributions are entirely characterized by the …rst and second moments of the distribution.

3.1 Basic properties

If x (p 1)is MVN with mean and covariance matrix

Any linear combination of x is MVN Let y = Ax + c with A (q p) and c (q 1) then

y s N q y; y

T where y = A + c and y = AA :

Any subset of variables in x has a MVN distribution. If a set of variables is uncorrelated, then they are independently distributed. In particular

i) if ij = 0 then xi; xj are independent.

1 ii) if x is MVN with covariance matrix , then Ax and Bx are independent if and only if

Cov (Ax; Bx) = ABT (3.3) = 0

Conditional distributions are MVN.

Result For the MVN distribution, variable are uncorrelated variable are independent. , Proof Let x (p 1) be partitioned as x q x = 1 "x2# p q with mean vector q = 1 "2# p q and covariance matrix

q p q q = 11 12 "21 22# p q i) Independent uncorrelated (always holds). ) Suppose x1; x2 are independent. Then f (x1; x2) = h (x1) g (x2) is a factorization of the T multivariate p.d.f.and 12 = Cov (x1; x2) = E (x1 1)(x2 2) factorizes into the T h i product of E [(x1 1)] and E (x2 2) which are both zero since E (x1) = 1 and E (x2) = 2: Hence 12 = 0: h i

ii) Uncorrelated independent (for MVN) ) This result depends on factorizing the p.d.f. (3.1) when 12 = 0: T 1 In this case (x ) (x ) has the partitioned form 1 T T T T 11 0 x1 1 x1 1 ; x2 2 " 0 22# "x2 2# h i 1 T T T T 11 0 x1 1 = x1 1 ; x2 2 1 " 0 22 #"x2 2# h i T 1 T 1 = (x1 ) (x1 ) + (x2 ) (x2 ) 1 11 1 2 22 2

2 T 1 so that exp (x ) (x ) factorizes into the product of f g T 1 T 1 exp (x1 ) (x1 ) and exp (x2 ) (x2 ) : 1 11 1 2 22 2 Thereforen the p.d.f. can be writteno as n o

f (x) = g (x1) h (x2)

proving that x1 and x2 are independent.

3.2 Conditional distribution X q Let X = 1 be a partitioned MVN random p vector, "X2# p q with mean = 1 and covariance matrix "2#

= 11 12 "21 22#

The conditional distribution of X2 given X1 = x1 is MVN with

1 E (X2 X1 = x1) = 2 + 21 (x1 1) (3.4a) j 11 1 Cov (X2 X1 = x1) = 22 21 12 (3.4b) j 11

Note: the notation X1 to denote the r:v: and x1 to denote a speci…c constant value (realization of X1) will be very useful here.

Proof of 3.4a 1 De…ne a transformation from (X1; X2) to new variables X1 and X0 = X2 21 X1: This 2 11 is achieved by the linear transformation

X1 I 0 X1 = 1 (3.5a) "X20 # " 2111 I#"X2# = AX say. (3.5b)

This linear relationship shows that X1; X20 are jointly MVN (by …rst property of MVN stated above.)

We now show that X20 and X1 are independent by proving that X1 and X20 are uncorrelated. Approach 1:

3 1 Cov X1; X0 = Cov X1; X2 21 X1 2 11 1 = Cov(X1; X2) Cov (X1; X1) 12 11 1 = 12 11 12 11 = 0

Approach 2:

B 1 In (3.3), write A = where B = I 0 and C = 2111 I "C# h i h i

Cov X1; X20 = Cov (BX;CX) T = BC 1 11 12 12 = I 0 11 "21 22#" I # h i 1 11 12 = 11 12 " I # h i = 0

Since X20 and X1 are MVN and uncorrelated they are independent. Thus

E X20 X1 = x1 = E X20 j 1 = E X2 21 X1 11 1 = 21 2 11 1

1 Now, as X0 = X2 21 X1 and X1 = x1 is given, we have 2 11

1 E (X2 X1 = x1) = E X20 X1 = x1 + 21 x1 j j 11 1 1 = 21 + 21 x1 2 11 1 11 1 = + 21 (x1 ) 2 11 1 as required.

Proof of 3.4b

Because X20 is independent of X1

Cov X0 X1 = x1 = Cov X0 2j 2

4 The left hand side is

LHS = Cov X0 X1 = x1 2j 1 = Cov X2 21x1 X1 = x1 11 j = Cov (X2 X1 = x1) j The right hand side is

RHS = Cov X20 1 = Cov X2 21 X1 11 1 = 22 21 12 11 following from the general expansion

Cov (X2 DX ) = Cov (X2; X2) DCov (X1; X2) 1 T T Cov (X2; X1) D + DCov (X1; X1) D

1 with D = 2111 : Therefore

1 Cov (X2 X1 = x1) = 22 21 12 j 11 as required.

Example Let x have a MVN distribution with covariance matrix

1 2 = 2 1 0 3 2 0 1 6 7 4 5 Show that the conditional distribution of (X1;X2) given X3 = x3 is also MVN with mean

2 + (x3 ) = 1 3 " 2 # and covariance matrix 1 4 " 1#

5 Solution

X1 Let Y 1 = and Y 2 = (X3) then "X2#

1 EY 1 = "2#

EY 2 = (3) :

Y We have Cov 1 = 11 12 where "Y 2# "21 22#

1 11 = " 1# 2 T 12 = = 21 " 0 # 22 = [1]

Hence

1 E [Y 1 Y 2 = x3] = 1 + 12 (x3 3) j 22 2 1 = + (x3 3) "2# " 0 # 2 + (x3 ) = 1 3 " 2 #

and :

1 Cov [Y 1 Y 2 = x3] = 11 12 21 j 22 1 2 = 2 0 " 1# " 0 # h i 1 4 = " 1#

3.3 Maximum-likelihood estimation

T Let X = (x1; :::; xn) contain an independent random sample of size n from Np (; ) : The maximum likelihood estimates (MLE ’s)of ; are the sample mean and covariance matrix (with divisor n)

^ = x (3.6a) ^ = S (3.6b)

6 The likelihood function is a function of the parameters ; given the data X

n L (; X) = f (xr ; ) (3.7) j j r=1 Y

The RHS is evaluated by substituting the individual data vectors x1; :::; xn in turn into the f g p.d.f. of Np (; ) and taking the product.

n np n=2 f (xr ; ) = (2) 2 j j j r=1 Y n 1 T 1 exp (xr ) (xr ) 2 ( r=1 ) X Maximizing L is equivalent to minimizing the "log likelihood" function

l (; ) = 2 log L n = 2 log f (xr ; ) j r=1 X n T 1 = K + n log + (xr ) (xr ) (3.8) j j r=1 X where K is a constant independent of ; :

Result 3.3 1 T l (; ) = n log + tr S + dd (3.9) j j up to an additive constant, where d=x : Proof

Noting that xr = (xr x) + d the …nal term in the likelihood expression (3.8) becomes n T 1 (xr ) (xr ) r=1 Xn T 1 T 1 = (xr x) (xr x) + nd d r=1 X 1 T 1 = ntr S + nd d 1 T = ntr S + dd

n proving the expression (3.9). Note that the cross-product terms have vanished because r=1 xr = P

7 nx and therefore

n n T 1 T 1 d (xr x) = d (xr x) r=1 r=1 X n X T 1 = (xr x) d r=1 = X0

In (3.9) the dependence on is entirely through d. Now assume that is positive de…nite (p.d.), 1 then so is as 1 1 T = V V

T T 1 where = V V is the eigenanalysis of . Thus d = 0 we have d d > 0: Hence l (; ) 8 6 is minimized with respect to for …xed when d = 0 i.e.

^ = x

Final part of proof: to minimize the log-likelihood l (;^ ) w.r.t. let

1 l (;^ ) = n log + tr S j j = () (3.10)

We show that

1 () (S) = n log log S + tr S p j j j j 1 1 = n tr S log S p (3.11) j j 0

Lemma 1 1 1 S is positive semi-de…nite (proved elsewhere). Therefore the eigenvalues of S are positive.

Lemma 2 For any set of positive numbers A log G + 1 where A and G are the arithmetic, geometric means respectively.

Proof

8 x For all x we have e 1 + x (simple exercise).Consider a set of n strictly positive numbers yi f g

yi 1 + log yi yi n + log yi 1 X X n A 1 + log yi = 1 + log GY

as required. n Recall that for any (n n) matrix A; we have tr (A) = i the sum of the eigenvalues, i=1 A = and i the product of the eigenvalues. Let i (i = 1P; :::; p) be the positive eigenvalues of 1 j j S and substituteY in (3.11)

1 log S = log i j j Y = p log G

1 tr S = i X = pA

Hence

() (S) = np A log G 1 f g 0 This proves that the MLE’sare as stated in (3:6) :

3.3 Sampling distribution of ¯xand S

The Wishart distribution (De…nition)

T If M (p p) can be written M = X X where X (m p) is a data matrix from Np (0; ) then M is said to have a Wishart distribution with scale matrix and degrees of freedom m: We write

M s Wp (;m) (3.12)

When = Ip the distribution is said to be in standard form.

Note:

9 The Wishart distribution is the multivariate generalization of the chi-square 2 distribution

Additive property of matrices with a Wishart distribution

Let M 1, M 2 be matrices having the Wishart distribution

M 1 s Wp (;m1)

M 2 s Wp (;m2) independently, then

M 1 + M 2 s Wp (;m1 + m2)

This property follows from the de…nition of the Wishart distribution because data matrices are additive in the sense that if X X = 1 "X2# is a combined data matrix consisting of m1 + m2 rows then

T T T X X = X1 X1 +X2 X2 is matrix (known as the "Gram matrix") formed from the combined data matrix X:

Case of p = 1 2 When p = 1 we know from the de…nition of r as the distribution of the sum of squares of r independent N (0; 1) variates that m 2 2 2 M = xi s m i=1 X so that 2 2 2 W1 ; m m

Sampling distributions

Let x1; x2; :::; xn be a random sample of size n from Np (; ). Then

1. The sample mean x has the normal distribution

1 x N ; s p n

2. The (scaled) sample covariance matrix has the Wishart distribution:

(n 1) Su s Wp (;n 1)

3. The distributions of x and Su are independent.

10 3.4 Estimators for special circumstances

3.4.1 proportional to a given vector

Sometimes is known to be proportional to a given vector, so = k0 with 0 being a known vector. For example if x represents a sample of repeated measurements then = k1where 1 = (1; 1; :::; 1)T is the p vector of 1 s: 0

We …nd the MLE of k for this situation. Suppose is known and = k : Let d0 = x k : 0 0 The log likelihood is

l (k) = 2 log L 1 T = n log + tr S + d0d j j 0 1 T 1 = n log + tr S + (x k ) (x k ) j j 0 0 T 1 T 1 2 T 1 = n hx x 2k x+ k i 0 0 0 +constant terms indept of k

dl Set = 0 to minimize l (k) w.r.t. k dk

T 1 T 1 2 x+ 2 k = 0 0 0 0 from which T 1 ^ 0 x k = T 1 (3.13) 0 0 Properties We now show that k^ is an unbiased estimator of k and determine the variance of k^ 1 In (3.13) k^ takes the form cT x with cT = T 1 and = T 1 so 0 0 0

cT [x] k^ = E E h i kcT = 0 :

kT 1 = 0 0

since E [x] = k0: Hence E k^ = k (3.14) h i showing that k^ is an unbiased estimator.

11 1 1 Note that V ar [x] = and therefore that V ar cT x = cT c we have n n 1 V ar k^ = cT c n 2 1 2 = T 1 1 T 1 n 0 0 0 0 1 = T 1 (3.15) n0 0

3.4.2 Linear restriction on

We determine an estimator for to satisfy a linear restriction

A = b where A (m p) and b (m 1) are given constants and is assumed to be known. We write the restriction in vector form g () = 0 and form the Lagrangean

(; ) = l () + 2T g () L

T where = (1; :::; m) is a vector of Lagrange multipliers (the factor 2 is inserted just for convenience).

(; ) = l () + 2T (A b) L T 1 T = n (x ) (x ) + 2 (A b) ignoren constant terms involving o

d Set (; ) = 0 using results from Example Sheet 2: dL

1 T 2 (x ) + 2A = 0 x = AT (3.16) We use the constraint A = b to evaluate the Lagrange multipliers : Premultiply by A

Ax b = AAT 1 = AAT (Ax b) Substitute into (3.16) 1 ^=x AT AAT (Ax b) (3.17) 12 3.4.3 Covariance matrix proportional to a given matrix

We consider estimating k when = k0; where 0 is a given.constant matrix. The likelihood (3.8) takes the form when d = 0 (^ =x)

1 1 l (k) = n log k0 + tr S j j k 0 plus constant terms (not involving k):

1 l (k) = p log k + tr 1S k 0 + constant terms dl p 1 1 = 0 tr S = 0 dk ) k k2 0 Hence 1 tr S k^ = 0 (3.18) p