3. Multivariate

The MVN distribution is a generalization of the univariate normal distribution which has the density function (p.d.f.)

1 (x )2 f (x) = exp 2 < x < p2 ( 2 ) 1 1 where  = mean of distribution, 2 = . In p dimensions the density becomes

1 1 T 1 f (x) = exp (x )  (x ) (3.1) (2)p=2  1=2 2 j j   Within the vector  there are p (independent) parameters and within the symmetric co- 1 1 variance matrix  there are 2 p (p + 1) independent parameters [ 2 p (p + 3) independent parameters in total]. We use the notation

x s N p (; ) (3.2) to denote a RV x having the p variate MVN distribution with

E (x) =  Cov (x) = 

Note that MVN distributions are entirely characterized by the …rst and second moments of the distribution.

3.1 Basic properties

If x (p 1)is MVN with mean  and matrix  

Any linear combination of x is MVN  Let y = Ax + c with A (q p) and c (q 1) then  

y s N q y; y

T  where y = A + c and y = AA :

Any subset of variables in x has a MVN distribution.  If a set of variables is uncorrelated, then they are independently distributed. In particular 

i) if ij = 0 then xi; xj are independent.

1 ii) if x is MVN with , then Ax and Bx are independent if and only if

Cov (Ax; Bx) = ABT (3.3) = 0

Conditional distributions are MVN. 

Result For the MVN distribution, variable are uncorrelated variable are independent. , Proof Let x (p 1) be partitioned as  x q x = 1 "x2# p q with mean vector  q  = 1 "2# p q and covariance matrix

q p q   q  = 11 12 "21 22# p q i) Independent uncorrelated (always holds). ) Suppose x1; x2 are independent. Then f (x1; x2) = h (x1) g (x2) is a factorization of the T multivariate p.d.f.and 12 = Cov (x1; x2) = E (x1 1)(x2 2) factorizes into the T h i product of E [(x1 1)] and E (x2 2) which are both zero since E (x1) = 1 and E (x2) = 2: Hence 12 = 0: h i

ii) Uncorrelated independent (for MVN) ) This result depends on factorizing the p.d.f. (3.1) when 12 = 0: T 1 In this case (x )  (x ) has the partitioned form 1 T T T T 11 0 x1 1 x1 1 ; x2 2 " 0 22# "x2 2# h i 1 T T T T 11 0 x1 1 = x1 1 ; x2 2 1 " 0 22 #"x2 2# h i T 1 T 1 = (x1  )  (x1  ) + (x2  )  (x2  ) 1 11 1 2 22 2

2 T 1 so that exp (x )  (x ) factorizes into the product of f g T 1 T 1 exp (x1  )  (x1  ) and exp (x2  )  (x2  ) : 1 11 1 2 22 2 Thereforen the p.d.f. can be writteno as n o

f (x) = g (x1) h (x2)

proving that x1 and x2 are independent. 

3.2 Conditional distribution X q Let X = 1 be a partitioned MVN random p vector, "X2# p q  with mean  = 1 and covariance matrix "2#

   = 11 12 "21 22#

The conditional distribution of X2 given X1 = x1 is MVN with

1 E (X2 X1 = x1) = 2 + 21 (x1 1) (3.4a) j 11 1 Cov (X2 X1 = x1) = 22 21 12 (3.4b) j 11

Note: the notation X1 to denote the r:v: and x1 to denote a speci…c constant value (realization of X1) will be very useful here.

Proof of 3.4a 1 De…ne a transformation from (X1; X2) to new variables X1 and X0 = X2 21 X1: This 2 11 is achieved by the linear transformation

X1 I 0 X1 = 1 (3.5a) "X20 # " 2111 I#"X2# = AX say. (3.5b)

This linear relationship shows that X1; X20 are jointly MVN (by …rst property of MVN stated above.)

We now show that X20 and X1 are independent by proving that X1 and X20 are uncorrelated. Approach 1:

3 1 Cov X1; X0 = Cov X1; X2 21 X1 2 11 1  = Cov(X1; X2) Cov (X1; X1)  12 11 1 = 12 11 12 11 = 0

Approach 2:

B 1 In (3.3), write A = where B = I 0 and C = 2111 I "C# h i h i

Cov X1; X20 = Cov (BX;CX) T  = BC 1 11 12  12 = I 0 11 "21 22#" I # h i 1 11 12 = 11 12 " I # h i = 0

Since X20 and X1 are MVN and uncorrelated they are independent. Thus

E X20 X1 = x1 = E X20 j 1  = E X2 21 X1 11 1 =  21   2 11 1

1 Now, as X0 = X2 21 X1 and X1 = x1 is given, we have 2 11

1 E (X2 X1 = x1) = E X20 X1 = x1 + 21 x1 j j 11 1 1 =  21  + 21 x1 2 11 1 11 1 =  + 21 (x1  ) 2 11 1 as required.

Proof of 3.4b

Because X20 is independent of X1

Cov X0 X1 = x1 = Cov X0 2j 2  

4 The left hand side is

LHS = Cov X0 X1 = x1 2j 1 = Cov X2 21x1 X1 = x1 11 j = Cov (X2 X1 = x1)  j The right hand side is

RHS = Cov X20 1 = Cov X2 21 X1 11 1 = 22 21 12  11 following from the general expansion

Cov (X2 DX ) = Cov (X2; X2) DCov (X1; X2) 1 T T Cov (X2; X1) D + DCov (X1; X1) D

1 with D = 2111 : Therefore

1 Cov (X2 X1 = x1) = 22 21 12 j 11 as required.

Example Let x have a MVN distribution with covariance matrix

1  2  = 2  1 0 3 2 0 1 6 7 4 5 Show that the conditional distribution of (X1;X2) given X3 = x3 is also MVN with mean

2  +  (x3  )  = 1 3 " 2 # and covariance matrix 1 4  "  1#

5 Solution

X1 Let Y 1 = and Y 2 = (X3) then "X2#

1 EY 1 = "2#

EY 2 = (3) :

Y   We have Cov 1 = 11 12 where "Y 2# "21 22#

1  11 = " 1# 2  T 12 = = 21 " 0 # 22 = [1]

Hence

1 E [Y 1 Y 2 = x3] = 1 + 12 (x3 3) j 22 2 1  = + (x3 3) "2# " 0 # 2  +  (x3  ) = 1 3 " 2 #

and :

1 Cov [Y 1 Y 2 = x3] = 11 12 21 j 22 1  2 = 2 0 " 1# " 0 # h i 1 4  = "  1#

3.3 Maximum-likelihood estimation

T Let X = (x1; :::; xn) contain an independent random of size n from Np (; ) : The maximum likelihood estimates (MLE ’s)of ;  are the sample mean and covariance matrix (with divisor n)

^ = x (3.6a) ^ = S (3.6b)

6 The is a function of the parameters ;  given the X

n L (;  X) = f (xr ; ) (3.7) j j r=1 Y

The RHS is evaluated by substituting the individual data vectors x1; :::; xn in turn into the f g p.d.f. of Np (; ) and taking the product.

n np n=2 f (xr ; ) = (2) 2  j j j r=1 Y n 1 T 1 exp (xr )  (xr ) 2 ( r=1 ) X Maximizing L is equivalent to minimizing the "log likelihood" function

l (; ) = 2 log L n = 2 log f (xr ; ) j r=1 X n T 1 = K + n log  + (xr )  (xr ) (3.8) j j r=1 X where K is a constant independent of ; :

Result 3.3 1 T l (; ) = n log  + tr  S + dd (3.9) j j up to an additive constant, where d=x  :   Proof

Noting that xr  = (xr x) + d the …nal term in the likelihood expression (3.8) becomes n T 1 (xr )  (xr ) r=1 Xn T 1 T 1 = (xr x)  (xr x) + nd  d r=1 X 1 T 1 = ntr  S + nd  d 1 T = ntr  S + dd

  n proving the expression (3.9). Note that the cross-product terms have vanished because r=1 xr = P

7 nx and therefore

n n T 1 T 1 d  (xr x) = d  (xr x) r=1 r=1 X n X T 1 = (xr x)  d r=1 = X0

In (3.9) the dependence on  is entirely through d. Now assume that is positive de…nite (p.d.), 1 then so is  as 1 1 T  = V  V

T T 1 where  = V V is the eigenanalysis of . Thus d = 0 we have d  d > 0: Hence l (; ) 8 6 is minimized with respect to  for …xed  when d = 0 i.e.

^ = x

Final part of proof: to minimize the log-likelihood l (;^ ) w.r.t.  let

1 l (;^ ) = n log  + tr  S j j =  ()  (3.10)

We show that

1  ()  (S) = n log  log S + tr  S p j j j j 1 1 = n tr  S log  S p (3.11) j j 0   

Lemma 1 1 1  S is positive semi-de…nite (proved elsewhere). Therefore the eigenvalues of  S are positive.

Lemma 2 For any set of positive numbers A log G + 1  where A and G are the arithmetic, geometric respectively.

Proof

8 x For all x we have e 1 + x (simple exercise).Consider a set of n strictly positive numbers yi  f g

yi 1 + log yi  yi n + log yi  1 X X n A 1 + log yi  = 1 + log GY 

as required. n Recall that for any (n n) matrix A; we have tr (A) = i the sum of the eigenvalues,  i=1 A = and i the product of the eigenvalues. Let i (i = 1P; :::; p) be the positive eigenvalues of 1 j j  S and substituteY in (3.11)

1 log  S = log i j j Y  = p log G

1 tr  S = i  X = pA

Hence

 ()  (S) = np A log G 1 f g 0  This proves that the MLE’sare as stated in (3:6) :

3.3 distribution of ¯xand S

The (De…nition)

T If M (p p) can be written M = X X where X (m p) is a data matrix from Np (0; ) then   M is said to have a Wishart distribution with scale matrix  and degrees of freedom m: We write

M s Wp (;m) (3.12)

When  = Ip the distribution is said to be in standard form.

Note:

9 The Wishart distribution is the multivariate generalization of the chi-square 2 distribution

Additive property of matrices with a Wishart distribution

Let M 1, M 2 be matrices having the Wishart distribution

M 1 s Wp (;m1)

M 2 s Wp (;m2) independently, then

M 1 + M 2 s Wp (;m1 + m2)

This property follows from the de…nition of the Wishart distribution because data matrices are additive in the sense that if X X = 1 "X2# is a combined data matrix consisting of m1 + m2 rows then

T T T X X = X1 X1 +X2 X2 is matrix (known as the "Gram matrix") formed from the combined data matrix X:

Case of p = 1 2 When p = 1 we know from the de…nition of r as the distribution of the sum of squares of r independent N (0; 1) variates that m 2 2 2 M = xi s  m i=1 X so that 2 2 2 W1  ; m   m 

Sampling distributions

Let x1; x2; :::; xn be a random sample of size n from Np (; ). Then

1. The sample mean x has the normal distribution

1 x N ;  s p n  

2. The (scaled) sample covariance matrix has the Wishart distribution:

(n 1) Su s Wp (;n 1)

3. The distributions of x and Su are independent.

10 3.4 Estimators for special circumstances

3.4.1  proportional to a given vector

Sometimes  is known to be proportional to a given vector, so  = k0 with 0 being a known vector. For example if x represents a sample of repeated measurements then  = k1where 1 = (1; 1; :::; 1)T is the p vector of 1 s: 0

We …nd the MLE of k for this situation. Suppose  is known and  = k : Let d0 = x k : 0 0 The log likelihood is

l (k) = 2 log L 1 T = n log  + tr  S + d0d j j 0 1 T 1 = n log  + tr  S + (x k )  (x k ) j j 0 0 T 1 T 1 2 T 1 = n hx  x 2k  x+ k    i 0 0 0 +constant terms indept of k 

dl Set = 0 to minimize l (k) w.r.t. k dk

T 1 T 1 2  x+ 2    k = 0 0 0 0  from which T 1 ^ 0  x k = T 1 (3.13) 0  0 Properties We now show that k^ is an unbiased estimator of k and determine the variance of k^ 1 In (3.13) k^ takes the form cT x with cT = T  1 and = T  1 so 0 0 0

cT [x] k^ = E E h i kcT  = 0 :

kT  1 = 0 0

since E [x] = k0: Hence E k^ = k (3.14) h i showing that k^ is an unbiased estimator.

11 1 1 Note that V ar [x] =  and therefore that V ar cT x = cT c we have n n   1 V ar k^ = cT c n 2   1 2 = T  1 1 T  1 n 0 0 0 0 1  = T 1 (3.15) n0  0

3.4.2 Linear restriction on 

We determine an estimator for  to satisfy a linear restriction

A = b where A (m p) and b (m 1) are given constants and  is assumed to be known.   We write the restriction in vector form g () = 0 and form the Lagrangean

(; ) = l () + 2T g () L

T where  = (1; :::; m) is a vector of Lagrange multipliers (the factor 2 is inserted just for convenience).

(; ) = l () + 2T (A b) L T 1 T = n (x )  (x ) + 2 (A b) ignoren constant terms involving  o

d Set (; ) = 0 using results from Example Sheet 2: dL

1 T 2 (x ) + 2A  = 0 x  = AT  (3.16) We use the constraint A = b to evaluate the Lagrange multipliers : Premultiply by A

Ax b = AAT  1  = AAT (Ax b)  Substitute into (3.16) 1 ^=x AT AAT (Ax b) (3.17)  12 3.4.3 Covariance matrix  proportional to a given matrix

We consider estimating k when  = k0; where 0 is a given.constant matrix. The likelihood (3.8) takes the form when d = 0 (^ =x)

1 1 l (k) = n log k0 + tr  S j j k 0    plus constant terms (not involving k):

1 l (k) = p log k + tr  1S k 0   + constant terms  dl p 1 1 = 0 tr  S = 0 dk ) k k2 0  Hence 1 tr  S k^ = 0 (3.18) p 

13