<<

Lecture 3. Inference about multivariate

3.1 Point and

Let X1,..., Xn be i.i.d. Np(µ, Σ). We are interested in evaluation of the maximum likelihood estimates of µ and Σ. Recall that the joint density of X1 is   − 1 1 0 −1 f(x) = |2πΣ| 2 exp − (x − µ) Σ (x − µ) , 2 p n for x ∈ R . The negative log , given observations x1 = {x1,..., xn}, is then n n 1 X `(µ, Σ | xn) = log |Σ| + (x − µ)0Σ−1(x − µ) + c 1 2 2 i i i=1 n n n 1 X = log |Σ| + (x¯ − µ)0Σ−1(x¯ − µ) + (x − x¯)0Σ−1(x − x¯). 2 2 2 i i i=1 ˜ On this end, denote the centered matrix by X = [x˜1,..., x˜n]p×n, where x˜i = xi − x¯. Let n 1 1 X S = X˜ X˜ 0 = x˜ x˜0 . 0 n n i i i=1 n Proposition 1. The m.l.e. of µ and Σ, that jointly minimize `(µ, Σ | x1 ), are µˆMLE = x¯, MLE Σb = S0.

Note that S0 is a biased of Σ. The matrix S = n n−1 S0 is unbiased. For interval estimation of µ, we largely follow Section 7.1 of H¨ardleand Simar (2012). First note that since µ ∈ Rp, we need to generalize the notion of intervals (primarily defined for R1) to higher dimension. A simple extension is a direct product of marginal intervals: for intervals a < x < b and c < y < d, we obtain a rectangular region {(x, y) ∈ R2 : a < x < b, c < y < d}. A confidence region A ∈ Rp is composed of the values of a function of (random) obser- n vations X1,..., Xn. A = A(X1 ) is a confidence region of size 1 − α ∈ (0, 1) for µ if p P (µ ∈ A) ≥ 1 − α, for all µ ∈ R . (Elliptical confidence region) Corollary 7 in lecture 2 provides a pivot which paves n−p ¯ 0 −1 ¯ a way to construct a confidence region for µ. Since p (X − µ) S0 (X − µ) ∼ Fp,n−p and   ¯ 0 −1 ¯ p P (X − µ) S0 (X − µ) < n−p F1−α;p,n−p = 1 − α,  p  A = µ ∈ p :(X¯ − µ)0S−1(X¯ − µ) < F R 0 n − p 1−α;p,n−p is a confidence region of size 1 − α for parameter µ. (Simultaneous confidence intervals) Simultaneous confidence intervals for all linear combinations of elements of µ, a0µ for arbitrary a ∈ Rp, provides confidence of size 1 − α 0 for all intervals covering a µ including the marginal µ1, . . . , µp. We are interested in evaluating lower and upper bounds L(a) and U(a) satisfying

0 p p P (L(a) < a µ < U(a), for all a ∈ R ) ≥ 1 − α, for all µ ∈ R .

First consider a single confidence interval by fixing a particular vector a. To evaluate 0 0 0 0 a confidence interval for a µ, write new random variables Yi = a Xi ∼ N1(a µ, a Σa)(i = 2 (a0µ−a0X¯ )2 1, . . . , n), whose squared t- is t (a) = n a0Sa ∼ F1,n−1. Thus, for any fixed a,

2 P (t (a) ≤ F1−α,1,n−1) = 1 − α. (1)

Next, consider many projection vectors a1, a2,..., aM (M is finite only for convenience). The simultaneous confidence intervals of the type similar to (1) are then

M ! \ 2 P {t (ai) ≤ h(α)} ≥ 1 − α, i=1 for some h(·). By collecting some facts,

2 2 1. maxa t (a) ≤ h(α) implies t (ai) ≤ h(α) for all i.

2 ¯ 0 −1 ¯ 2. maxa t (a) = n(µ − X) S (µ − X), 3. and Corollary 7 in lecture 2,

n−1 p we have for h(α) = n n−p F1−α;p,n−p,

M ! \ 2  2  P {t (ai) ≤ h(α)} ≥ P max t (a) ≤ h(α) = 1 − α. a i=1

Proposition 2. Simultaneously for all a ∈ Rp, the interval a0X¯ ± ph(α)a0Sa

contains a0µ with probability 1 − α.

Example 1. From the Golub gene expression data, with dimension d = 7129, take the first and 1674th variables (genes), to focus on the bivariate case (p = 2). There are two populations: 11 observations from AML, 27 from ALL. Figure 1 illustrates the elliptical confidence region of size 95% and 99%. Figure 2 compares the elliptical confidence region with the simultaneous 0 0 confidence intervals for a1 = (1, 0) and a2 = (0, 1) .

2 4 x 10 Golub data, ALL (red) and AML(blue) 2.5

2

1.5 gene #1674 1

0.5

0 −500 −400 −300 −200 −100 0 100 gene #1 Figure 1: Elliptical confidence regions of size 95% and 99%.

4 x 10 Golub data, ALL (red) and AML(blue) 2.5

2

1.5 gene #1674 1

0.5

0 −500 −400 −300 −200 −100 0 100 gene #1 Figure 2: Simultaneous confidence intervals of size 95%

3.2 Hypotheses testing

Consider testing a null hypothesis H0 : θ ∈ Ω0 against an H1 : θ ∈ Ω1. The principle of likelihood ratio test is as follows: Let L0 be the maximized likelihood under 3 H0 : θ ∈ Ω0, and L1 be the maximized likelihood under θ ∈ Ω0 ∪ Ω1. The likelihood ratio statistic, or sometimes called Wilks statistic, is then L W = −2 log( 0 ) ≥ 0 L1 The null hypothesis is rejected if the observed value of W is large. In some cases the exact distribution of W under H0 can be evaluated. In other cases, Wilks’ theorem states that for large n (sample size), L 2 W =⇒χν, where ν is the number of free in H1, not in H0. If the degrees of freedom in Ω0 is q and the degrees of freedom in Ω0 ∪ Ω1 is r, then ν = r − q. Consider testing hypotheses on µ and Σ of multivariate normal distribution, based on n-sample X1,..., Xn.

case I: H0 : µ = µ0,H1 : µ 6= µ0, Σ is known. In this case, we know the exact distribution of the likelihood ratio statistic

0 −1 2 W = n(x¯ − µ0) Σ (x¯ − µ0) ∼ χp,

under H0. case II: H0 : µ = µ0,H1 : µ 6= µ0, Σ is unknown. ¯ The m.l.e. under H1 areµ ˆ = x and Σb = S0. The restricted√ m.l.e. of Σ under H0 is 1 Pn 0 0 Σb(0) = n i=1(xi − µ0)(xi − µ0) = S0 + δδ , where δ = n(x¯ − µ0). The likelihood ratio statistic is then 0 W = n log |S0 + δδ | − n log |S0|. It turns out that W is a monotone increasing function of

0 −1 −1 δ S δ = n(x¯ − µ0)S (x¯ − µ0),

which is the Hotelling’s T 2(n − 1) statistic. case III: H0 : Σ = Σ0,H1 :Σ 6= Σ0, µ is unknown. We have the likelihood ratio statistic

−1 −1 W = −n log |Σ0 S0| − np + ntrace(Σ0 S0).

This is the case where the exact distribution of W is difficult to evaluate. For large 2 n, use Wilks’ theorem to approximate the distribution of W by χν with the degrees of freedom ν = p(p + 1)/2.

Next, consider testing the equality of two vectors. Let X11,..., X1n1 be i.i.d. Np(µ1, Σ) and X21,..., X2n2 be i.i.d. Np(µ2, Σ).

4 case IV: H0 : µ1 = µ2,H1 : µ1 6= µ2, Σ is unknown. Since

¯ ¯ n1 + n2 X1 − X2 ∼ Np(µ1 − µ2, Σ), n1n2 (n1 + n2 − 2)SP ∼ Wp(n1 + n2 − 2, Σ),

we have Hotelling’s T 2 statistic for two-sample problem

2 n1n2 ¯ ¯ 0 −1 ¯ ¯ T (n1 + n2 − 2) = (X1 − X2) SP (X1 − X2), n1 + n2 and by Theorem 5 in lecture 2

n1 + n2 − p − 1 2 T (n1 + n2 − 2) ∼ Fp,n1+n2−p−1. (n1 + n2 − 2)p

2 Similar to case II above, the likelihood ratio statistic is a monotone function of T (n1 + n2 − 2).

3.3 Hypothesis testing when p > n

In the high dimensional situation where the dimension p is larger than sample size (p > n−1 2 or p > n1 +n2 −2), the sample covariance S is not invertable, thus the Hotelling’s T statistic, which is essential in the testing procedures above, cannot be computed. We important proposals for testing hypotheses on means in high dimension, low sample size data. A basic idea in generalizing a test procedure for the p > n case is to base the test on a computable which is also an estimator for kµ − µ0k or kµ1 − µ2k. In case II (one sample), Dempster (1960) proposed to replace S−1 in Hotelling’s statistic −1 by (trace(S)Ip) . He showed that under H0 : µ = 0, nX¯ 0X¯ T = ∼ F , approximately, D trace(S) r,(n−1)r

2 for r = (trace(Σ)) , a measure of sphericity of Σ. An estimatorr ˆ of r is used in testing. trace(Σ2) −1 Bai and Saranadasa (1996) proposed to simply replace S in Hotelling’s statistic by Ip, ¯ 0 ¯ ¯ 0 ¯ 0 ¯ 0 ¯ yielding TB = nX X. However X X is not an unbiased estimator of µ µ since E(X X) = 1 0 n trace(Σ) + µ µ. They showed that the standardized statistic nX¯ 0X¯ − trace(S) nX¯ 0X¯ − trace(S) MB = = q sd(b nX¯ 0X¯ − trace(S)) 2(n−1)n 2 1 2 (n−2)(n+1) trace(S ) − n (trace(S))

has asymptotic N(0, 1) distribution for p, n → ∞.

5 −1 Srivastava and Du (2008) proposed to replace S in Hotelling’s statistic by DS = diag(S). 1 ¯ 0 −1 ¯ n−1 n(n−1) 2 2 Then TS = nX DS X − n−3 p can be used to estimate n−3 kDΣµk , which is zero under H0 : µ = 0. Srivastava and Du’s test statistic is then ¯ 0 −1 ¯ n−1 TS nX DS X − n−3 p MS = = , q 2 sd(b TS) 2 p 2trace(R ) − n−1

1 1 − 2 − 2 which has asymptotic N(0, 1) distribution for p, n → ∞. Here R = DS SDS is the sample correlation matrix. Chen and Qin (2010) improves the two-sample test for mean vectors from that of Bai and Saranadasa (1996). In testing H1 : µ1 = µ2, Bai and Saranadasa (1996) proposed ¯ 0 ¯ n1+n2 to use TB = X X2 − trace(SP ). The substraction of trace(SP ) is to make sure that 1 n1n2 2 E(TB) = kµ1 − µ2k . Chen and Qin (2010) proposed to not use trace(SP ), by considering Pn1 0 Pn2 0 Pn1 Pn2 0 i6=j X1iX1j i6=j X2iX2j i=1 j=1 X1iX2j TC = + − 2 . n1(n1 − 1) n2(n2 − 1) n1n2 2 Since E(TC ) = kµ1 − µ2k , Chen and Qin proposed to test based on TC . There are many other ideas, including:

1. The test statistic is essentially the maximum of p normalized marginal mean differences (Cai et al., 2013); 2. Use a generalized inverse of S, denoted by S− or S†, to replace S−1; 3. Estimate Σ in a way that is invertible;

4. Reduce the dimension p of the random vector X by Z = h(X) ∈ Rd, for d < n, then apply the traditional theory of hypothesis testing.

Next lecture is on linear dimension reduction–principal component analysis.

References

Bai, Z. and Saranadasa, H. (1996), “Effect of high dimension: by an example of a two sample problem,” Statist. Sinica, 6, 311–329. Cai, T. T., Liu, W., and Xia, Y. (2013), “Two-Sample Test of High Dimensional Means under Dependency,” To appear in Journal of the Royal Statistical Society: Series B. Chen, S. X. and Qin, Y.-L. (2010), “A two-sample test for high-dimensional data with applications to gene-set testing,” The Annals of , 38, 808–835. Dempster, A. P. (1960), “A significance test for the separation of two highly multivariate small samples,” Biometrics, 16, 41–50. Srivastava, M. S. and Du, M. (2008), “A test for the mean vector with fewer observations than the dimension,” Journal of Multivariate Analysis, 99, 386–402.

6 In evaluating the MLE of Σ for MVN, one can use the following famous result.

Lemma 3 (von Neumann). For any m × m symmetric matrices A and B with eigenvalues 0 0 σA = (σA1, . . . , σAm) and σB = (σB1, . . . , σBm) in decreasing order,

0 0 |trace(A B)| ≤ σAσB,

and the equality holds when A and B have the same eigenvectors.

A general version of von Neumann inequality is:

Lemma 4 (von Neumann). For any m×n matrices A and B with vectors of singular values σA and σB in decreasing order,

0 0 |trace(A B)| ≤ σAσB,

and the equality holds when A and B are simulateneously diagonazable.

−1 The problem was to minimize the negative log-likelihood log |Σ| + trace(Σ S0). The parameter, assumed to be nonnegative definite, is eigen-decomposed into UΛU 0. Likewise, 0 we can eigen-decompose the real symmetric matrix S0 = VDV .

−1 −1 0 0 log |Σ| + trace(Σ S0) = log |Λ| + trace(UΛ U VDV ) ≥ log |Λ| + trace(Λ−1D) p X = log(λi) + trace(di/λi) i=1 p X = (ai − log(ai)) + log(di), i=1

where ai = di/λi, and minimized when ai = 1.

7