Basic Multivariate Normal Theory
Total Page:16
File Type:pdf, Size:1020Kb
Basic Multivariate Normal Theory [Prerequisite probability background: Univariate theory of random variables, expectation, vari- ance, covariance, moment generating function, independence and normal distribution. Other requirements: Basic vector-matrix theory, multivariate calculus, multivariate change of vari- able.] 1 Random Vector k T A random vector U 2 R is a vector (U1;U2; ··· ;Uk) of scalar random variables Ui defined over a common probability space. The expectation and variance of U are defined as: 0 1 EU1 B C BEU2C EU := B . C ; @ . A EU k 0 1 VarU1 Cov(U1;U2) ::: Cov(U1;Uk) BCov(U ;U ) VarU ::: Cov(U ;U )C T B 2 1 2 2 k C VarU := E(fU − EUgfU − EUg ) = B . C : @ . .. A Cov(Uk;U1) Cov(Uk;U2) ::: VarUk It is easy to check that if X = a + BU where a 2 Rp and B is a p × k matrix, then X is a random vector in Rp with EX = a + B EU, VarX = BVar(U)BT . You should also note that if Σ = VarU for some random vector U, then Σ is a k × k non-negative definite matrix, because for any a 2 Rk, aT Σa = Var(aT U) ≥ 0. 2 Multivariate Normal Definition 1. A random vector U 2 Rk is called a normal random vector if for every a 2 Rk, aT U is a (one dimensional) normal random variable. Theorem 1. A random vector U 2 Rk is a normal random vector if and only if one can k T write U = m + AZ for some m 2 R and k × k matrix A where Z = (Z1; ··· ;Zk) with IID Zi ∼ Normal(0; 1). Proof. \If part". Suppose U = m + AZ with m, A and Z as in the statement of the theorem. Then for any a 2 Rk, Xn T T T a U = a m + a AU = b0 + biZi i=1 for some scalars b0; b1; ··· ; bk. But a linear combination of independent (one dimensional) normal variables is another normal, so aT U is a normal variable. \Only if part" Suppose U is a normal random vector. It suffices to show that a V = m + AZ with Z as in the statement of the theorem, and suitably chosen m and A, has the same k distribution as U. For any a 2 R , the moment generating function MU (a) of U at a is 1 EeaT U = EeX where X = aT U is normally distributed by definition. But, by one dimensional X EX+ 1 VarX aT EU+ 1 aT (VarU)a aT µ+aT Σa normal distribution theory, Ee = e 2 = e 2 = e where we denote EU by µ and VarU by Σ. Note that Σ is non-negative definite and thus can be written as Σ = AAT for some k × k matrix A. T IID Write V = µ + AZ where Z = (Z1; ··· ;Zk) with Zi ∼ Normal(0; 1). Then, by the \if part", V is a normal random vector, and because Zi's are IID with mean 0 and variance 1, aT µ+ 1 aT Σa EV = µ and VarV = Σ. So by above discussion MV (a) = e 2 = MU (a). So U and V have the same moment generating function. Because this moment generating function is defined for all a 2 Rk, it uniquely determines the associated probability distribution. That is, V and U have the same distribution. Notation. If a random k-vector U is a normal random vector, then by above proof, its distribution is completely determined by its mean µ = EU and variance Σ = VarU. We shall denote this distribution by Normalk(µ, Σ). Note that U ∼ Normalk(µ, Σ) means that U = µ + AZ for Z as in the above theorem, where A satisfies Σ = AAT . p Theorem 2 (Linear transformations). If U ∼ Normalk(µ, Σ) and a 2 R and B is p × p T matrix, then V = a + BU ∼ Normalp(a + Bµ, BΣB ). Proof. Write U = µ + AZ where Z is as in Theorem 1 and A satisfies Σ = AAT . Then, V = a + BU = (a + Bµ) + (BA)Z. This proves the result. Theorem 3 (Probability density function). A Normalk(µ, Σ) distribution with a positive definite Σ admits the following pdf: { } 1 f(x) = (2π)−k=2(det Σ)−1=2 exp − (x − µ)T Σ−1(x − µ) ; x 2 Rk: 2 T Proof. Suppose X ∼ Normalk(µ, Σ) then X = µ + AZ where AA = Σ and A is non-singular (because Σ is p.d.). The joint pdf of Z1; ··· ;Zk is given by { } { } Yk 1 z2 1 g(z ; ··· ; z ) = p exp − i = (2π)−k=2 exp − zT z ; z = (z ; ··· ; z )T 2 Rk: 1 k 2 2 1 k i=1 2π Therefore, by multivariate change of variable formula for the transformation X = µ + AZ with inverse transformation Z = A−1(X − µ) and hence Jacobian J = (det A)−1, the pdf of X is, j −1j −1 − 2 Rk f(x) = (det A) g(A (x µ{)); x } 1 = (2π)−k=2j det Aj−1 exp − (x − µ)T (AAT )−1(x − µ) ; x 2 Rk {2 } 1 = (2π)−k=2(det Σ)−1=2 exp − (x − µ)T Σ−1(x − µ) ; x 2 Rk: 2 as det Σ = (det A)2. 2 Theorem 4 (Independence). Let U ∼ Normalk(µ, Σ). Suppose for some 1 ≤ p < k, Σ can be partitioned as ( ) Σ 0 Σ = (11) p×(k−p) 0(k−p)×p Σ(22) where Σ(11) is p × p, Σ(22) is (k − p) × (k − p) and 0m×n denotes the matrix of zeros of the specified dimensions. Similarly partition ( ) ( ) U µ U = (1) ; and µ = (1) U(2) µ(2) into vectors of length p and (k − p). Then U(1) ∼ Normalp(µ(1); Σ(11));U(2) ∼ Normalk−p(µ(2); Σ(22)) and U(1) and U(2) are independent. Proof. Because Σ is non-negative definite, so must be any of its diagonal blocks. Therefore × − × − T there are p p and (k p) (k p) matrices A(1) and A(2) such that Σ(11) = A(1)A(1) and T T Σ(22) = A(2)A(2). Clearly, Σ = AA with ( ) A 0 A = (1) p×(k−p) : 0(k−p)×p A(2) T IID Because U is normal, we must have U = µ + AZ where Z = (Z1; ··· ;Zk) with Zi ∼ T T Normal(0; 1). Write Z(1) = (Z1; ··· ;Zp) and Z(2) = (Zp+1; ··· ;Zk) . Then Z(1) and Z(2) are independent and ( ) ( ) ( )( ) ( ) U µ A 0 Z µ + A Z (1) = (1) + (1) p×(k−p) (1) = (1) (1) (1) : U(2) µ(2) 0(k−p)×p A(2) Z(2) µ(2) + A(2)Z(2) From this the result follows. Theorem 5 (Conditioning). Let U ∼ Nk(µ, Σ). For some 1 ≤ p < k, consider the partition into blocks with first p and last k − p variables: ( ) ( ) ( ) U µ Σ Σ U = (1) ; µ = (1) and Σ = (11) (12) : U(2) µ(2) Σ(21) Σ(22) Then, j ∼ −1 − − −1 U(1) U(2) Np(µ(1) + Σ(12)Σ(22)(U(2) µ(2)); Σ(11) Σ(12)Σ(22)Σ(21)): Proof. Clearly, U(2) ∼ Nk−p(µ(2); Σ(22)). Consider p IID N1(0; 1) variables Z1;:::;Zp that T − −1 are independent of U(2) and let Z = (Z1;:::;Zp) . The matrix Σ(11) Σ(12)Σ(22)Σ(21) is − −1 T non-negative definite, and let B be such that Σ(11) Σ(12)Σ(22)Σ(21) = BB . By Theorem 1, ( ) (( ) ( )) Z 0p×1 Ip 0p×(k−p) ∼ Nk ; U(2) µ(2) 0(k−p)×k Σ(22) 3 and hence, ! (( ) ( )) −1 − µ(1) + Σ(12)Σ(22)(U(2) µ(2)) + BZ µ(1) Σ(11) Σ(12) ∼ Nk ; ; U(2) µ(2) Σ(21) Σ(22) which is same as the distribution of U. So the conditional distribution of U(1) given U(2) −1 − is same as the conditional distribution of µ(1) + Σ(12)Σ(22)(U(2) µ(2)) + BZ given U2; −1 − − but the latter conditional distribution is precisely Np(µ(1) + Σ(12)Σ(22)(U(2) µ(2)); Σ(11) −1 Σ(12)Σ(22)Σ(21)): 3 Sample mean and variance of normal data IID 2 Theorem 6. Let Xi ∼ Normal(µ, σ ). Then ¯ X−pµ ∼ 1. W = σ= n Normal(0; 1), P n (X −X¯)2 i=1 i ∼ 2 2. V = σ2 χn−1 3. and V and W are independent. 2 Proof. Suffices to prove for µ = 0 and σ = 1 (otherwise take X~i = (Xi − µ)/σ). Let X T n denote the random vector (X1; ··· ;Xn) 2 R . Then, Xn ∼ Normaln(0n×1;In) where In × denotes the n n identity matrix with 1 on the diagonalsp and zero elsewhere. Let B be a n × m orthogonal matrix whose first row equals (1= n)11×n where 1m×n denote the matrix of the given dimensions (such an orthogonal matrix can be constructed by the Gram-Schmidt method). Take Y = BX. Then T Y ∼ Normaln(0n×1; BB ) = Normaln (0n×1;In) ; as, B is orthogonal. Therefore the coordinate variables Y1;Y2; ··· ;Yn of Y are IID Normal(0; 1). Now argue, p p P p n ¯ ∼ 1. Y1 = (1= n)11×mX = (1= n) i=1 Xi = nX. So W = Y1 Normal(0; 1). T T 2 2 ··· 2 2 2 ··· 2 2. Y Y = X X (since B is orthogonal), and so Y1 + Y2 + + Yn = X1 + X2 + + Xn.