Lecture 3. Inference About Multivariate Normal Distribution
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 3. Inference about multivariate normal distribution 3.1 Point and Interval Estimation Let X1;:::; Xn be i.i.d. Np(µ, Σ). We are interested in evaluation of the maximum likelihood estimates of µ and Σ. Recall that the joint density of X1 is − 1 1 0 −1 f(x) = j2πΣj 2 exp − (x − µ) Σ (x − µ) ; 2 p n for x 2 R . The negative log likelihood function, given observations x1 = fx1;:::; xng, is then n n 1 X `(µ, Σ j xn) = log jΣj + (x − µ)0Σ−1(x − µ) + c 1 2 2 i i i=1 n n n 1 X = log jΣj + (x¯ − µ)0Σ−1(x¯ − µ) + (x − x¯)0Σ−1(x − x¯): 2 2 2 i i i=1 ~ On this end, denote the centered data matrix by X = [x~1;:::; x~n]p×n, where x~i = xi − x¯. Let n 1 1 X S = X~ X~ 0 = x~ x~0 : 0 n n i i i=1 n Proposition 1. The m.l.e. of µ and Σ, that jointly minimize `(µ, Σ j x1 ), are µ^MLE = x¯; MLE Σb = S0: Note that S0 is a biased estimator of Σ. The sample variance{covariance matrix S = n n−1 S0 is unbiased. For interval estimation of µ, we largely follow Section 7.1 of H¨ardleand Simar (2012). First note that since µ 2 Rp, we need to generalize the notion of intervals (primarily defined for R1) to higher dimension. A simple extension is a direct product of marginal intervals: for intervals a < x < b and c < y < d, we obtain a rectangular region f(x; y) 2 R2 : a < x < b; c < y < dg. A confidence region A 2 Rp is composed of the values of a function of (random) obser- n vations X1;:::; Xn. A = A(X1 ) is a confidence region of size 1 − α 2 (0; 1) for parameter µ if p P (µ 2 A) ≥ 1 − α; for all µ 2 R : (Elliptical confidence region) Corollary 7 in lecture 2 provides a pivot which paves n−p ¯ 0 −1 ¯ a way to construct a confidence region for µ. Since p (X − µ) S0 (X − µ) ∼ Fp;n−p and ¯ 0 −1 ¯ p P (X − µ) S0 (X − µ) < n−p F1−α;p;n−p = 1 − α; p A = µ 2 p :(X¯ − µ)0S−1(X¯ − µ) < F R 0 n − p 1−α;p;n−p is a confidence region of size 1 − α for parameter µ. (Simultaneous confidence intervals) Simultaneous confidence intervals for all linear combinations of elements of µ, a0µ for arbitrary a 2 Rp, provides confidence of size 1 − α 0 for all intervals covering a µ including the marginal means µ1; : : : ; µp. We are interested in evaluating lower and upper bounds L(a) and U(a) satisfying 0 p p P (L(a) < a µ < U(a); for all a 2 R ) ≥ 1 − α; for all µ 2 R : First consider a single confidence interval by fixing a particular vector a. To evaluate 0 0 0 0 a confidence interval for a µ, write new random variables Yi = a Xi ∼ N1(a µ, a Σa)(i = 2 (a0µ−a0X¯ )2 1; : : : ; n), whose squared t-statistic is t (a) = n a0Sa ∼ F1;n−1. Thus, for any fixed a, 2 P (t (a) ≤ F1−α,1;n−1) = 1 − α: (1) Next, consider many projection vectors a1; a2;:::; aM (M is finite only for convenience). The simultaneous confidence intervals of the type similar to (1) are then M ! \ 2 P ft (ai) ≤ h(α)g ≥ 1 − α; i=1 for some h(·). By collecting some facts, 2 2 1. maxa t (a) ≤ h(α) implies t (ai) ≤ h(α) for all i. 2 ¯ 0 −1 ¯ 2. maxa t (a) = n(µ − X) S (µ − X), 3. and Corollary 7 in lecture 2, n−1 p we have for h(α) = n n−p F1−α;p;n−p, M ! \ 2 2 P ft (ai) ≤ h(α)g ≥ P max t (a) ≤ h(α) = 1 − α: a i=1 Proposition 2. Simultaneously for all a 2 Rp, the interval a0X¯ ± ph(α)a0Sa contains a0µ with probability 1 − α. Example 1. From the Golub gene expression data, with dimension d = 7129, take the first and 1674th variables (genes), to focus on the bivariate case (p = 2). There are two populations: 11 observations from AML, 27 from ALL. Figure 1 illustrates the elliptical confidence region of size 95% and 99%. Figure 2 compares the elliptical confidence region with the simultaneous 0 0 confidence intervals for a1 = (1; 0) and a2 = (0; 1) . 2 4 x 10 Golub data, ALL (red) and AML(blue) 2.5 2 1.5 gene #1674 1 0.5 0 −500 −400 −300 −200 −100 0 100 gene #1 Figure 1: Elliptical confidence regions of size 95% and 99%. 4 x 10 Golub data, ALL (red) and AML(blue) 2.5 2 1.5 gene #1674 1 0.5 0 −500 −400 −300 −200 −100 0 100 gene #1 Figure 2: Simultaneous confidence intervals of size 95% 3.2 Hypotheses testing Consider testing a null hypothesis H0 : θ 2 Ω0 against an alternative hypothesis H1 : θ 2 Ω1. The principle of likelihood ratio test is as follows: Let L0 be the maximized likelihood under 3 H0 : θ 2 Ω0, and L1 be the maximized likelihood under θ 2 Ω0 [ Ω1. The likelihood ratio statistic, or sometimes called Wilks statistic, is then L W = −2 log( 0 ) ≥ 0 L1 The null hypothesis is rejected if the observed value of W is large. In some cases the exact distribution of W under H0 can be evaluated. In other cases, Wilks' theorem states that for large n (sample size), L 2 W =)χν; where ν is the number of free parameters in H1, not in H0. If the degrees of freedom in Ω0 is q and the degrees of freedom in Ω0 [ Ω1 is r, then ν = r − q. Consider testing hypotheses on µ and Σ of multivariate normal distribution, based on n-sample X1;:::; Xn. case I: H0 : µ = µ0;H1 : µ 6= µ0; Σ is known. In this case, we know the exact distribution of the likelihood ratio statistic 0 −1 2 W = n(x¯ − µ0) Σ (x¯ − µ0) ∼ χp; under H0. case II: H0 : µ = µ0;H1 : µ 6= µ0; Σ is unknown. ¯ The m.l.e. under H1 areµ ^ = x and Σb = S0. The restrictedp m.l.e. of Σ under H0 is 1 Pn 0 0 Σb(0) = n i=1(xi − µ0)(xi − µ0) = S0 + δδ , where δ = n(x¯ − µ0). The likelihood ratio statistic is then 0 W = n log jS0 + δδ j − n log jS0j: It turns out that W is a monotone increasing function of 0 −1 −1 δ S δ = n(x¯ − µ0)S (x¯ − µ0); which is the Hotelling's T 2(n − 1) statistic. case III: H0 : Σ = Σ0;H1 :Σ 6= Σ0; µ is unknown. We have the likelihood ratio statistic −1 −1 W = −n log jΣ0 S0j − np + ntrace(Σ0 S0): This is the case where the exact distribution of W is difficult to evaluate. For large 2 n, use Wilks' theorem to approximate the distribution of W by χν with the degrees of freedom ν = p(p + 1)=2. Next, consider testing the equality of two mean vectors. Let X11;:::; X1n1 be i.i.d. Np(µ1; Σ) and X21;:::; X2n2 be i.i.d. Np(µ2; Σ). 4 case IV: H0 : µ1 = µ2;H1 : µ1 6= µ2; Σ is unknown. Since ¯ ¯ n1 + n2 X1 − X2 ∼ Np(µ1 − µ2; Σ); n1n2 (n1 + n2 − 2)SP ∼ Wp(n1 + n2 − 2; Σ); we have Hotelling's T 2 statistic for two-sample problem 2 n1n2 ¯ ¯ 0 −1 ¯ ¯ T (n1 + n2 − 2) = (X1 − X2) SP (X1 − X2); n1 + n2 and by Theorem 5 in lecture 2 n1 + n2 − p − 1 2 T (n1 + n2 − 2) ∼ Fp;n1+n2−p−1: (n1 + n2 − 2)p 2 Similar to case II above, the likelihood ratio statistic is a monotone function of T (n1 + n2 − 2). 3.3 Hypothesis testing when p > n In the high dimensional situation where the dimension p is larger than sample size (p > n−1 2 or p > n1 +n2 −2), the sample covariance S is not invertable, thus the Hotelling's T statistic, which is essential in the testing procedures above, cannot be computed. We survey important proposals for testing hypotheses on means in high dimension, low sample size data. A basic idea in generalizing a test procedure for the p > n case is to base the test on a computable test statistic which is also an estimator for kµ − µ0k or kµ1 − µ2k. In case II (one sample), Dempster (1960) proposed to replace S−1 in Hotelling's statistic −1 by (trace(S)Ip) . He showed that under H0 : µ = 0, nX¯ 0X¯ T = ∼ F ; approximately; D trace(S) r;(n−1)r 2 for r = (trace(Σ)) , a measure of sphericity of Σ. An estimatorr ^ of r is used in testing. trace(Σ2) −1 Bai and Saranadasa (1996) proposed to simply replace S in Hotelling's statistic by Ip, ¯ 0 ¯ ¯ 0 ¯ 0 ¯ 0 ¯ yielding TB = nX X. However X X is not an unbiased estimator of µ µ since E(X X) = 1 0 n trace(Σ) + µ µ. They showed that the standardized statistic nX¯ 0X¯ − trace(S) nX¯ 0X¯ − trace(S) MB = = q sd(b nX¯ 0X¯ − trace(S)) 2(n−1)n 2 1 2 (n−2)(n+1) trace(S ) − n (trace(S)) has asymptotic N(0; 1) distribution for p; n ! 1.