<<

Basic Multivariate Normal Theory

[Prerequisite background: theory of random variables, expectation, vari- ance, , generating , independence and . Other requirements: Basic vector- theory, multivariate calculus, multivariate change of vari- able.]

1 Random Vector

k T A random vector U ∈ R is a vector (U1,U2, ··· ,Uk) of scalar random variables Ui defined over a common probability . The expectation and of U are defined as:   EU1   EU2 EU :=  .  ,  .  EU k   VarU1 Cov(U1,U2) ... Cov(U1,Uk) Cov(U ,U ) VarU ... Cov(U ,U ) T  2 1 2 2 k  VarU := E({U − EU}{U − EU} ) =  . . . .  .  . . .. .  Cov(Uk,U1) Cov(Uk,U2) ... VarUk

It is easy to check that if X = a + BU where a ∈ Rp and B is a p × k matrix, then X is a random vector in Rp with EX = a + B EU, VarX = BVar(U)BT . You should also note that if Σ = VarU for some random vector U, then Σ is a k × k non-negative definite matrix, because for any a ∈ Rk, aT Σa = Var(aT U) ≥ 0.

2 Multivariate Normal

Definition 1. A random vector U ∈ Rk is called a normal random vector if for every a ∈ Rk, aT U is a (one dimensional) normal random .

Theorem 1. A random vector U ∈ Rk is a normal random vector if and only if one can k T write U = m + AZ for some m ∈ R and k × k matrix A where Z = (Z1, ··· ,Zk) with IID Zi ∼ Normal(0, 1). Proof. “If part”. Suppose U = m + AZ with m, A and Z as in the statement of the . Then for any a ∈ Rk, ∑n T T T a U = a m + a AU = b0 + biZi i=1 for some scalars b0, b1, ··· , bk. But a of independent (one dimensional) normal variables is another normal, so aT U is a normal variable.

“Only if part” Suppose U is a normal random vector. It suffices to show that a V = m + AZ with Z as in the statement of the theorem, and suitably chosen m and A, has the same k distribution as U. For any a ∈ R , the moment MU (a) of U at a is

1 EeaT U = EeX where X = aT U is normally distributed by definition. But, by one dimensional X EX+ 1 VarX aT EU+ 1 aT (VarU)a aT µ+aT Σa normal distribution theory, Ee = e 2 = e 2 = e where we denote EU by µ and VarU by Σ. Note that Σ is non-negative definite and thus can be written as Σ = AAT for some k × k matrix A. T IID Write V = µ + AZ where Z = (Z1, ··· ,Zk) with Zi ∼ Normal(0, 1). Then, by the “if part”, V is a normal random vector, and because Zi’s are IID with 0 and variance 1, aT µ+ 1 aT Σa EV = µ and VarV = Σ. So by above discussion MV (a) = e 2 = MU (a). So U and V have the same moment generating function. Because this moment generating function is defined for all a ∈ Rk, it uniquely determines the associated . That is, V and U have the same distribution.

Notation. If a random k-vector U is a normal random vector, then by above proof, its distribution is completely determined by its mean µ = EU and variance Σ = VarU. We shall denote this distribution by Normalk(µ, Σ). Note that U ∼ Normalk(µ, Σ) that U = µ + AZ for Z as in the above theorem, where A satisfies Σ = AAT .

p Theorem 2 (Linear transformations). If U ∼ Normalk(µ, Σ) and a ∈ R and B is p × p T matrix, then V = a + BU ∼ Normalp(a + Bµ, BΣB ). Proof. Write U = µ + AZ where Z is as in Theorem 1 and A satisfies Σ = AAT . Then, V = a + BU = (a + Bµ) + (BA)Z. This proves the result.

Theorem 3 (Probability density function). A Normalk(µ, Σ) distribution with a positive definite Σ admits the following pdf: { } 1 f(x) = (2π)−k/2(det Σ)−1/2 exp − (x − µ)T Σ−1(x − µ) , x ∈ Rk. 2

T Proof. Suppose X ∼ Normalk(µ, Σ) then X = µ + AZ where AA = Σ and A is non-singular (because Σ is p.d.). The joint pdf of Z1, ··· ,Zk is given by { } { } ∏k 1 z2 1 g(z , ··· , z ) = √ exp − i = (2π)−k/2 exp − zT z , z = (z , ··· , z )T ∈ Rk. 1 k 2 2 1 k i=1 2π Therefore, by multivariate change of variable formula for the transformation X = µ + AZ with inverse transformation Z = A−1(X − µ) and hence Jacobian J = (det A)−1, the pdf of X is,

| −1| −1 − ∈ Rk f(x) = (det A) g(A (x µ{)), x } 1 = (2π)−k/2| det A|−1 exp − (x − µ)T (AAT )−1(x − µ) , x ∈ Rk {2 } 1 = (2π)−k/2(det Σ)−1/2 exp − (x − µ)T Σ−1(x − µ) , x ∈ Rk. 2

as det Σ = (det A)2.

2 Theorem 4 (Independence). Let U ∼ Normalk(µ, Σ). Suppose for some 1 ≤ p < k, Σ can be partitioned as ( ) Σ 0 Σ = (11) p×(k−p) 0(k−p)×p Σ(22)

where Σ(11) is p × p, Σ(22) is (k − p) × (k − p) and 0m×n denotes the matrix of zeros of the specified dimensions. Similarly partition ( ) ( ) U µ U = (1) , and µ = (1) U(2) µ(2) into vectors of length p and (k − p). Then

U(1) ∼ Normalp(µ(1), Σ(11)),U(2) ∼ Normalk−p(µ(2), Σ(22))

and U(1) and U(2) are independent. Proof. Because Σ is non-negative definite, so must be any of its diagonal blocks. Therefore × − × − T there are p p and (k p) (k p) matrices A(1) and A(2) such that Σ(11) = A(1)A(1) and T T Σ(22) = A(2)A(2). Clearly, Σ = AA with ( ) A 0 A = (1) p×(k−p) . 0(k−p)×p A(2)

T IID Because U is normal, we must have U = µ + AZ where Z = (Z1, ··· ,Zk) with Zi ∼ T T Normal(0, 1). Write Z(1) = (Z1, ··· ,Zp) and Z(2) = (Zp+1, ··· ,Zk) . Then Z(1) and Z(2) are independent and ( ) ( ) ( )( ) ( ) U µ A 0 Z µ + A Z (1) = (1) + (1) p×(k−p) (1) = (1) (1) (1) . U(2) µ(2) 0(k−p)×p A(2) Z(2) µ(2) + A(2)Z(2)

From this the result follows.

Theorem 5 (Conditioning). Let U ∼ Nk(µ, Σ). For some 1 ≤ p < k, consider the partition into blocks with first p and last k − p variables: ( ) ( ) ( ) U µ Σ Σ U = (1) , µ = (1) and Σ = (11) (12) . U(2) µ(2) Σ(21) Σ(22)

Then, | ∼ −1 − − −1 U(1) U(2) Np(µ(1) + Σ(12)Σ(22)(U(2) µ(2)), Σ(11) Σ(12)Σ(22)Σ(21)).

Proof. Clearly, U(2) ∼ Nk−p(µ(2), Σ(22)). Consider p IID N1(0, 1) variables Z1,...,Zp that T − −1 are independent of U(2) and let Z = (Z1,...,Zp) . The matrix Σ(11) Σ(12)Σ(22)Σ(21) is − −1 T non-negative definite, and let B be such that Σ(11) Σ(12)Σ(22)Σ(21) = BB . By Theorem 1, ( ) (( ) ( )) Z 0p×1 Ip 0p×(k−p) ∼ Nk , U(2) µ(2) 0(k−p)×k Σ(22)

3 and hence, ( ) (( ) ( )) −1 − µ(1) + Σ(12)Σ(22)(U(2) µ(2)) + BZ µ(1) Σ(11) Σ(12) ∼ Nk , , U(2) µ(2) Σ(21) Σ(22) which is same as the distribution of U. So the conditional distribution of U(1) given U(2) −1 − is same as the conditional distribution of µ(1) + Σ(12)Σ(22)(U(2) µ(2)) + BZ given U2; −1 − − but the latter conditional distribution is precisely Np(µ(1) + Σ(12)Σ(22)(U(2) µ(2)), Σ(11) −1 Σ(12)Σ(22)Σ(21)).

3 mean and variance of normal

IID 2 Theorem 6. Let Xi ∼ Normal(µ, σ ). Then ¯ X−√µ ∼ 1. W = σ/ n Normal(0, 1), ∑ n (X −X¯)2 i=1 i ∼ 2 2. V = σ2 χn−1 3. and V and W are independent.

2 Proof. Suffices to prove for µ = 0 and σ = 1 (otherwise take X˜i = (Xi − µ)/σ). Let X T n denote the random vector (X1, ··· ,Xn) ∈ R . Then, Xn ∼ Normaln(0n×1,In) where In × denotes the n n identity matrix with 1 on the diagonals√ and zero elsewhere. Let B be a n × m orthogonal matrix whose first row equals (1/ n)11×n where 1m×n denote the matrix of the given dimensions (such an orthogonal matrix can be constructed by the Gram-Schmidt method). Take Y = BX. Then

T Y ∼ Normaln(0n×1,BB ) = Normaln (0n×1,In) , as, B is orthogonal. Therefore the coordinate variables Y1,Y2, ··· ,Yn of Y are IID Normal(0, 1). Now argue, √ √ ∑ √ n ¯ ∼ 1. Y1 = (1/ n)11×mX = (1/ n) i=1 Xi = nX. So W = Y1 Normal(0, 1). T T 2 2 ··· 2 2 2 ··· 2 2. Y Y = X X (since B is orthogonal), and so Y1 + Y2 + + Yn = X1 + X2 + + Xn. 2 ¯ 2 But Y1 = nX . Hence ∑n ∑n 2 ··· 2 2 − ¯ 2 − ¯ 2 Y2 + + Yn = Xi nX = (Xi X) = V. i=1 i=1 2 ··· 2 ∼ 2 − So, V = Y2 + + Yn χn−1 as this is a sum of squares of n 1 IID standard normal variables.

3. W = Y1 and V is a function of Y2, ··· ,Yn, hence these two are independent as Yi’s are independent.

4 4 Quadratic forms

A k×k symmetric matrix H is called idempotent if H2 = H. The eigenvalues of an idempotent matrix are either 0 or 1. To see this, note that if λ is an eigenvalue of an idempotent matrix H then Hv = λv for some v ≠ 0. Pre-multiply both sides by H to get H2v = λHv = λ2v. But H2 = H and so H2v = Hv = λv. Thus λ2v = λv, and because v ≠ 0 this implies λ2 = λ. So λ ∈ {0, 1}. Because the rank of a symmetric matrix is equal to the of its non-zero eigenvalues, the only full-rank idempotent matrix is the identity matrix. If H is idempotent with rank m 2 then Ik − H is idempotent with rank k − m (because (Ik − H)(Ik − H) = Ik − H − H + H = Ik − H). k For any of linearly independent vectors v1, ··· , vm ∈ R , the linear space {a1v1 + m ··· + amvm : a1, ··· , am ∈ R} is the same as the space of vectors C(V ) = {V b : b ∈ R } where V = [v1 : v2 : ··· : vm] is the k × m matrix with columns vi. This space is called the column space of V . The matrix H = V (V T V )−1V T is idempotent and gives the orthogonal projection matrix onto C(V ) (that means for any x ∈ Rk, the Euclidean ∥x − y∥, y ∈ C(V ) is minimized at y = Hx and Hx and x − Hx are orthogonal.). The converse is also true, every idempotent matrix H of rank m is the orthogonal pro- jection matrix onto C(V ) for∑ some k × m matrix V with independent columns. This can m T be seen by writing H = i=1 vivi where vi’s are eigenvectors corresponding to the m non-zero eigenvalues (which must equal 1) of H. Take V = [v1 : v2 : ··· : vm]. Then T T −1 T T H = VV = V (V V ) V as V V = Im.

Theorem 7 (Quadratic form). Suppose X ∼ Normalk(0,Ik) and H is idempotent with rank T ∼ 2 m. Then Y = X HX χm. ∑ m T Proof. Write H = i=1 vivi where vi’s are eigenvectors corresponding to the m non-zero T T T T eigenvalues of H. Then H = VV where V = [v1 : ··· : vm]. Then Y = (V X) (V X) = T T ∼ T T 2 ··· Z Z where Z = V X Normalm(0,Im) as V V = Im. Therefore Y = Z Z = Z1 + + 2 ∼ 2 Zm χm.

Theorem 8 (A variant of Thm 7). Suppose X ∼ Normalk(0,H) where H is idempotent with T ∼ 2 rank m, then X X χm. T Proof. Let Y = HZ where Z ∼ Normalk(0,Ik). Then Y ∼ Normalk(0,HH ). But because H is idempotent, HHT = H and hence Y has the same distribution as X. So XT X has the same distribution as Y T Y = (HZ)T (HZ) = ZT HT HZ = ZT HZ (because HT H = H2 = H). But T ∼ 2 by Theorem thm:quad, Z HZ χm.

5 Least-squares estimate and residual sum of squares

Consider the Gaussian T ··· Yi = zi β + ϵi, i = 1, 2, , n,

∈ Rp IID∼ 2 ˆ where zi are fixed, and ϵi Normal(0, σ ). The estimate of β is βLS = T −1 T T (Z Z) Z Y with Y = (Y1, ··· ,Yn) and Z denoting n × p matrix with i-th row given by

5 T C ··· ··· T zi . Note that (Z) is now the column space of z·1, z·2, , z·p where z·j = (z1j, z2j, , znj) is the vector of observations from all n cases on the j-th attribute of the input variable. T −1 T ˆ ˆ Denote H = Z(Z Z) Z . Then Y = ZβLS = HY is the projection of Y onto C(Z). This is expected, because in least squares we minimize ∥Y −Zβ∥2 over β which is same as minimizing 2 ∥Y − v∥ over v ∈ C(Z). Notice that HZ = Z and so (In − H)Z = 0. − ˆ − T ˆ ··· T Define the residualsϵ ˆi = Yi Yi = Yi zi βLS and takeϵ ˆ = (ˆϵ∑1, , ϵˆn) . Then − ˆ − 2 2 1 n 2 1 T ϵˆ = Y ZβLS = (In H)Y . An estimate of σ is given by s = n−p i=1 ϵˆi = n−p ϵˆ ϵˆ. Here is the counterpart of Theorem 6 for the least-squares.

ˆ Theorem 9. Let Y , Z, βLS and H be as above. Then ˆ 2 T −1 1. βLS ∼ Normalp(β, σ (Z Z) ).

2 2. ϵˆ ∼ Normaln(0, σ (In − H)) ˆ 3. βLS and ϵˆ are independent. 1 T ∼ 2 4. σ2 ϵˆ ϵˆ χn−p. Proof. From what we have discussed above, ( ) ( ) βˆ (ZT Z)−1ZT LS = Y ϵˆ I − H n (( ) ( ) ) (ZT Z)−1ZT (ZT Z)−1ZT ( ) ∼ Normal Zβ, (σ2I ) Z(ZT Z)−1 I − H n+p I − H I − H n n (( n ) ( n )) (ZT Z)−1ZT Zβ σ2(ZT Z)−1ZT Z(ZT Z)−1 σ2(ZT Z)−1ZT (I − h) = Normal , n n+p (I − H)Zβ σ2(I − H)Z(ZT Z)−1 σ2(I − H)(I − H) (( n) ( n )) n n 2 T −1 β σ (Z Z) 0p×n = Normaln+p , 2 0n×1 0n×p σ (In − H)

2 2 because HZ = Z,(In − H)Z = 0n×p and H = H (and so (In − H) = In − H). This proves assertions 1, 2 and 3 (see theorem 4 above). To prove the last assertion, note that 1 ∼ − − 1 T ∼ 2 σ ϵˆ Normaln−p(0,In H) and In H is idempotent. So, by Theorem 7, σ2 ϵˆ ϵˆ χn−p. ˆ Theorem 10 (Confidence intervals). Let Y , Z, βLS and H be as above. Let α be any number in (0, 1). Then

p T ˆ √s 1. For any vector a ∈ R , a ≠ 0 × , a βLS ∓ z − (α) is a (1 − α) confidence p 1 n p na T 1 for a β, where na = aT (ZT Z)−1a > 0.

ˆ √s 2. In particular, for every j = 1, 2, ··· , p, βLS ∓ z − (α) is a (1 − α) confidence ,j n p njj T −1 interval for βj, where njj is the inverse of the j-th diagonal entry of (Z Z) . Proof. The second assertion follows from the first with a = (0, ··· , 0, 1, 0, ··· , 0)T , where the 1 occurs at the j-th position. To prove the first assertion, note that by Theorem 8,

T ˆ − T T ˆ T 2 T T −1 √a βLS a β 1. a βLS ∼ Normal(a β, σ a (Z Z) a) and hence W = ∼ Normal(0, 1). σ aT (ZT Z)−1a

6 (n−p)s2 1 T ∼ 2 2. V = σ2 = σ2 ϵˆ ϵˆ χn−p. 3. W and V are independent.

T ˆ − T T ˆ − T Hence T = √ W = √a βLS a β = a βLS√ a β ∼ t(n − p). Therefore, V/(n−p) s aT (ZT Z)−1a s/ na ( ) T ∈ T ˆ ∓ s − ≤ ≤ − P a β a βLS zn−p(α)√ = P ( zn−p(α) T zn−p(α)) = 1 α. na

7