This article was downloaded by: [University of Pennsylvania] On: 29 April 2013, At: 08:44 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa20 Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings Tony a , Weidong b & Yin a a Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, 19104 b Department of Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, Shanghai, Accepted author version posted online: 02 Jan 2013.Published online: 15 Mar 2013.

To cite this article: Tony Cai , Weidong Liu & Yin Xia (2013): Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings, Journal of the American Statistical Association, 108:501, 265-277 To link to this article: http://dx.doi.org/10.1080/01621459.2012.758041

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings

Tony C AI, Weidong LIU, and Yin XIA

In the high-dimensional setting, this article considers three interrelated problems: (a) testing the equality of two covariance matrices 1 and 2; (b) recovering the support of 1 − 2; and (c) testing the equality of 1 and 2 row by row. We propose a new test for testing the hypothesis H0: 1 = 2 and investigate its theoretical and numerical properties. The limiting null distribution of the test statistic is derived and the power of the test is studied. The test is shown to enjoy certain optimality and to be especially powerful against sparse alternatives. The simulation results show that the test significantly outperforms the existing methods both in terms of size and power. Analysis of a prostate cancer dataset is carried out to demonstrate the application of the testing procedures. When the null hypothesis of equal covariance matrices is rejected, it is often of significant interest to further investigate how they differ from each other. Motivated by applications in genomics, we also consider recovering the support of 1 − 2 and testing the equality of the two covariance matrices row by row. New procedures are introduced and their properties are studied. Applications to gene selection are also discussed. Supplementary materials for this article are available online. KEY WORDS: Extreme value Type I distribution; Gene selection; Hypothesis testing; Sparsity.

1. INTRODUCTION or quite similar in the sense that they only possibly differ in a small number of entries. In such a setting, under the alternative Testing the equality of two covariance matrices and is 1 2 the difference of the two covariance matrices − is sparse. an important problem in multivariate analysis. Many statistical 1 2 The above-mentioned tests, which are all based on the Frobenius procedures including the classical Fisher’s linear discriminant norm, are not powerful against such sparse alternatives. See analysis rely on the fundamental assumption of equal covari- Section 3.3 for further details. ance matrices. This testing problem has been well studied in The first goal of this article is to develop a test that is pow- the conventional low-dimensional setting. See, for example, erful against sparse alternatives and robust with respect to the Sugiura and Nagao (1968), Gupta and Giri (1973), Perlman population distributions. Let X and Y be two p variate random (1980), Gupta and Tang (1984), O’Brien (1992), and Anderson vectors with covariance matrices and , respectively. Let (2003). In particular, the likelihood ratio test (LRT) is commonly 1 2 {X ,...,Xn } be independent and identically distributed (iid) used and enjoys certain optimality under regularity conditions. 1 1 random samples from X and let {Y ,...,Y n } be iid random Driven by a wide range of contemporary scientific applica- 1 2 samples from Y that are independent of {X ,...,Xn }.Wewish tions, analysis of high-dimensional data is of significant current 1 1 to test the hypotheses interest. In the high-dimensional setting, where the dimension can be much larger than the sample size, the conventional testing H0 : 1 = 2 versus H1 : 1 = 2 (1) procedures such as the LRT either perform poorly or are not even based on the two samples. We are particularly interested in the well defined. Several tests for the equality of two large covari- high-dimensional setting where p can be much larger than n = ance matrices have been proposed. For example, Schott (2007) max(n ,n ). In many applications, if the null hypothesis H : introduced a test based on the Frobenius norm of the differ- 1 2 0 = is rejected, it is often of significant interest to further ence of the two covariance matrices. Srivastava and Yanagihara 1 2 investigate in which way the two covariance matrices differ

Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 (2010) constructed a test that relied on a measure of distance by from each other. Motivated by applications to gene selection, in tr(2)/(tr( ))2 − tr(2)/(tr( ))2. Both of these two tests are 1 1 2 2 this article we also consider two related problems, recovering designed for the multivariate normal populations. and the support of − and testing the equality of the two (2012) proposed a test using a linear combination of three one- 1 2 covariance matrices row by row. sample U statistics, which was also motivated by an unbiased We propose a test for the hypotheses in Equation (1) based estimator of the Frobenius norm of − . 1 2 on the maximum of the standardized differences between the In many applications, such as gene selection in genomics, the entries of the two-sample covariance matrices and investigate covariance matrices of the two populations can be either equal its theoretical and numerical properties. The limiting null dis- tribution of the test statistic is derived. It is shown that the Tony Cai is Dorothy Silberberg Professor of Statistics, Department of Statis- distribution of the test statistic converges to a Type I extreme tics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 H (E-mail: [email protected]). Weidong Liu is Professor, Department of value distribution under the null 0. This fact implies that Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong the proposed test has the prespecified significance level asymp- University, Shanghai, China (E-mail: [email protected]). Yin Xia is totically. The power of the test is investigated. The theoretical Ph.D. student, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: [email protected]). analysis shows that the proposed test enjoys certain optimality The research of Tony Cai and Yin Xia was supported in part by NSF FRG grant DMS-0854973. Weidong Liu’s research was supported by NSFC, Grant No. 11201298, the Program for Professor of Special Appointment (Eastern Scholar) © 2013 American Statistical Association at Shanghai Institutions of Higher Learning, Foundation for the Author of Na- Journal of the American Statistical Association tional Excellent Doctoral Dissertation of PR China and the startup fund from March 2013, Vol. 108, No. 501, Theory and Methods Shanghai Jiao Tong University (SJTU). DOI: 10.1080/01621459.2012.758041

265 266 Journal of the American Statistical Association, March 2013

{ ,..., } against a large class of sparse alternatives in terms of the power. and Y 1 Y n2 from a distribution with the covariance − = σ We show that it only requires one√ of the entries of 1 2 matrix 2 ( ij 2)p×p, define the sample covariance matrices having a magnitude more than C log p/n in order for the test n 1 H 1  to correctly reject the√ null hypothesis 0. It is also shown that (ˆσij )p×p := ˆ = (Xk − X¯ )(Xk − X¯ ) , C p/n 1 1 n this lower bound log is rate optimal. 1 k=1 n In addition to the theoretical properties, we also consider the 2 1  σij p×p = ˆ = Y k − Y¯ Y k − Y¯ , numerical performance of the proposed testing procedure using (ˆ 2) : 2 n ( )( ) both simulated and real datasets. The numerical results show 2 k=1 n n that the new test significantly outperforms the existing methods ¯ = 1 1 ¯ = 1 2 where X n k= Xk and Y n k= Y k. The null hy- both in terms of size and power. A prostate cancer dataset is 1 1 2 1 pothesis H0 : 1 = 2 is equivalent to H0 :max1≤i≤j≤p |σij 1 − used to illustrate our testing procedures for gene selection. σij 2|=0. A natural approach to testing this hypothesis is to com- In addition to the global test of equal covariance matrices, we pare the sample covariancesσ ˆij andσ ˆij and to base the test on − 1 2 also consider recovery of the support of 1 2 as well as test- the maximum differences. It is important to note that the sample ing 1 and 2 row by row. Support recovery can also be viewed covariancesσ ˆij 1’s andσ ˆij 2’s are in general heteroscedastic and as simultaneous testing of the equality of individual entries be- can possibly have a wide range of variability. It is thus neces- tween the two covariance matrices. We introduce a procedure for sary to first standardizeσ ˆij 1 − σˆij 2 before making a comparison support recovery based on the thresholding of the standardized among different entries. differences between the entries of the two covariance matrices. To be more specific, define the variances θij 1 = var((Xi − It is shown that under certain conditions, the procedure recovers μ i )(Xj − μ j )) and θij = var(( − μ i )(Yj − μ j )). Given − 1 1 2 2 2 the true support of 1 2 exactly with probability tending to { ,..., } { ,..., } θ θ the two samples X1 Xn1 and Y 1 Y n2 , ij 1 and ij 2 1. The procedure is also shown to be minimax rate optimal. can be, respectively, estimated by The problem of testing the equality of two covariance - n trices row by row is motivated by applications in genomics. 1 1 θˆ = X − X¯ X − X¯ − σ 2, A commonly used approach in microarray analysis is to select ij 1 n [( ki i )( kj j ) ˆij 1] 1 k=1 “interesting” genes by applying multiple testing procedures on n 1 the two-sample t statistics. This approach has been successful 1 X¯ i = Xki, in finding genes with significant changes in the mean expres- n 1 k=1 sion levels between diseased and nondiseased populations. It has been noted recently that these mean-based methods lack and n the ability to discover genes that change their relationships with 1 2 θˆ = Y − Y¯ Y − Y¯ − σ 2, other genes, and new methods that are based on the change ij 2 n [( ki i )( kj j ) ˆij 2] 2 k=1 in the gene’s dependence structure are thus needed to identify n 2 these genes. See, for example, Ho et al. (2008), et al. (2009), Y¯ = 1 Y . and Hu, Qiu, and Glazko (2010). In this article, we propose a i n ki 2 k= procedure that simultaneously tests the p null hypotheses that 1 the corresponding rows of the two covariance matrices are equal Such an estimator of the variance has been used in Cai and to each other. Asymptotic null distribution of the test statistics Liu (2011) in the context of adaptive estimation of a sparse θˆ θˆ σ − σ is derived and properties of the test are studied. It is shown covariance matrix. Given ij 1 and ij 2, the variance of ˆij 1 ˆij 2 θˆ /n + θˆ /n that the procedure controls the family-wise error rate (FWER) can then be estimated by ij 1 1 ij 2 2. at a prespecified level. Applications to gene selection are also Define the standardized statistics

Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 considered. σ − σ 2 = (ˆij 1 ˆij 2) , ≤ i ≤ j ≤ p. The rest of this article is organized as follows. Section 2 in- Mij : 1 (2) θˆij /n + θˆij /n troduces the procedure for testing the equality of two covariance 1 1 2 2 matrices. Theoretical properties of the test are investigated in We consider the following test statistic for testing the hypothesis Section 3. After Section 3.1 in which basic definitions and as- H0 : 1 = 2, sumptions are given, Section 3.2 develops the asymptotic null 2 (ˆσij 1 − σˆij 2) distribution of the test statistic and presents the optimality re- Mn =:maxMij = max . (3) 1≤i≤j≤p 1≤i≤j≤p θˆ /n + θˆ /n sults for testing against sparse alternatives. Comparisons with ij 1 1 ij 2 2 other tests are given in Section 3.3. Section 4 considers support The asymptotic behavior of the test statistic Mn will be studied recovery of 1 − 2 and testing the two covariance matrices in detail in Section 3. Intuitively, Mij are approximately square row by row. Section 5 investigates the numerical performance of of standard normal variables under the null H0 and they are the proposed test by simulations and by an analysis of a prostate only “weakly dependent” under suitable conditions. The test cancer dataset. Section 6 discusses our results and other related statistic Mn is the maximum of p(p + 1)/2 such variables and 2 work. The proofs of the main results are given in Section 7. so the value of Mn is close to 2 log p under H0, based on the extreme values of normal random variables. More precisely, we show in Section 3 that, under certain regularity conditions, 2. THE TESTING PROCEDURE Mn − 4logp + log log p converges to a Type I extreme value { ,..., } H Given two iid random samples X1 Xn1 from a p- distribution under the null hypothesis 0. Based on this result, variate distribution with the covariance matrix 1 = (σij 1)p×p for a given significance level 0 <α<1, we define the test α Cai, Liu, and Xia: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings 267

by So (r) is the set of indices i such that either the ith variable of X is highly correlated (>r) with some other variable of X or  = I M ≥ q + p − p , α ( n α 4log log log ) (4) the ith variable of Y is highly correlated (>r) with some other variable of Y. where qα is the 1 − α quantile of the Type I extreme M value distribution with the cumulative distribution function To obtain the asymptotic distribution of n, we assume that s r exp(− √1 exp(− x )), that is, max1≤j≤p j is not too large, and the set ( )isalsonottoo 8π 2 large for some r<1. −1 qα =−log(8π) − 2 log log(1 − α) . (5) (C1). Suppose that there exists a subset ϒ ⊂{1, 2,...,p} with Card(ϒ) = o(p) and a constant α0 > 0 such that for The hypothesis H0 : 1 = 2 is rejected whenever α = 1. γ all γ>0, max1≤j≤p,j∈ /ϒ sj (α0) = o(p ). Moreover, there ex- The test α is particularly well suited for testing against ist some constant r<1 and a sequence of numbers p,r such sparse alternatives. It is consistent if one of the entries of 1 − √ that Card((r)) ≤ p,r = o(p). 2 has a magnitude more than C log p/n for some constant C>0 and no other special structure of 1 − 2 is required. It It is easy to check that if the eigenvalues of the correlation will be shown that the test α is an asymptotically α-level test matrices R1 and R2 are bounded from above, then Condition and enjoys certain optimality against sparse alternatives. (C1) is easily satisfied. In fact, if λmax(R1) ≤ C0 and λmax(R2) ≤ 2+2α0 The standardized statistics Mij are also useful for identifying C0 for some constant C0 > 0, then maxj sj ≤ C(log p) . γ the support of 1 − 2. That is, they can be used to estimate The condition maxj,j∈ /ϒ sj = o(p ) is quite weak. It allows the + α the positions at which the two covariance matrices differ from largest eigenvalues to be of order p/(log p)2 2 0 .Itisalsoeasy each other. This is of interest in many applications including to see that the condition Card((r)) = o(p)forsomer<1is gene selection. We show in Section 4.1 that under certain con- trivially satisfied if all the correlations are bounded away from ditions, the support of 1 − 2 can be correctly recovered by ±1, that is, thresholding the Mij . max |ρij 1|≤r<1 and max |ρij 2|≤r<1. (6) 1≤i

We now turn to an analysis of the properties of the test α We do not require the distributions of X and Y to be Gaussian. including the asymptotic size and power. The asymptotic size Instead we shall impose more general moment conditions. of the test is obtained by deriving the limiting distribution of the / (C2). Sub-Gaussian-type tails: Suppose that log p = o(n1 5) test statistic Mn under the null. We are particularly interested and n1 n2. There exist some constants η>0 and K>0 such in the power of the test α under the sparse alternatives where that 1 − 2 is sparse in the sense that 1 − 2 only contains a 2 small number of nonzero entries. E exp(η(Xi − μi1) /σii1) ≤ K, η Y − μ 2/σ ≤ K i 3.1 Definitions and Assumptions E exp( ( i i2) ii2) for all . τ > We begin by introducing some basic definitions and technical Furthermore, we assume that for some constants 1 0 and p τ > | | = a2 2 0, assumptions. Throughout the article, denote a 2 j=1 j = a ,... ,a T ∈ θij θij for the usual Euclidean norm of a vector a ( 1 p) min 1 ≥ τ and min 2 ≥ τ . (7) p p×q ≤i≤j≤p 1 ≤i≤j≤p 2 IR . For a matrix A = (aij ) ∈ IR , define the spectral 1 σii1σjj1 1 σii2σjj2 A = |Ax| AF = norm 2 sup|x|2≤1 2 and the Frobenius norm

Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 ∗ 2 (C2 ). Polynomial-type tails: Suppose that for some γ0,c1 > i,j aij . For two sequences of real numbers {an} and {bn}, γ0 0, p ≤ c1n , n1 n2 and for some >0 write an = O(bn) if there exists a constant C such that |an|≤ γ + + C|bn| holds for all sufficiently large n, write an = o(bn)if X − μ /σ 1/24 0 4 ≤ K, E ( i i1) ii1 limn→∞ an/bn = 0, and write an bn if there exist constants γ + + Y − μ /σ 1/24 0 4 ≤ K i C>c>0 such that c|bn|≤|an|≤C|bn| for all sufficiently E ( i i2) ii2 for all . large n. The two-sample sizes are assumed to be comparable, Furthermore, we assume (7) holds. that is, n n .Letn = max(n ,n ). 1 2 1 2 In addition to the moment conditions, we also need the fol- Let R =:(ρij ) and R =:(ρij ) be the correlation matrices 1 1 2 2 lowing technical condition for some of the results. It holds if X of X and Y, respectively. For a fixed constant α > 0, define 0 and Y have elliptically contoured distributions, see, for example,

−1−α0 sj = sj (α0):= card{i : |ρij 1|≥(log p) or |ρij 2| Anderson (2003, pp. 47–54). − −α ≥ (log p) 1 0 }. κ ,κ ≥ 1 (C3). Suppose that there exist 1 2 3 such that for any i, j, k, l ∈{1, 2,...,p}, So sj is the cardinality of the set of indices i such that either − −α > p 1 0 the ith variable of X is highly correlated ( (log ) ) with E(Xi − μi1)(Xj − μj1)(Xk − μk1)(Xl − μl1) the jth variable of X or the ith variable of Y is highly correlated = κ1(σij 1σkl1 + σik1σjl1 + σil1σjk1), with the jth variable of Y.For0

(r) ={1 ≤ i ≤ p : |ρij 1| >ror |ρij 2| >rfor some j = i}. = κ2(σij 2σkl2 + σik2σjl2 + σil2σjk2). (8) 268 Journal of the American Statistical Association, March 2013

Remark 1. Conditions (C1) and (C3) are only needed for We now turn to an analysis of the power of the test α given the limiting null distribution of the test statistic. (C3) holds in (4). We shall define the following class of matrices: for the elliptically contoured distributions (Anderson 2003) 1 4 2 2 1 |σij − σij | with κ1 ≡ E(Xi − μi1) /[E(Xi − μi1) ] and κ2 ≡ E(Yi − 1 2 3 3 U(c)= (1, 2): max ≥ c log p . 4 2 2 ≤i≤j≤p μi2) /[E(Yi − μi2) ] . Results on the power of the test α do 1 θij 1/n1 + θij 2/n2 ∗ not rely on either (C1) or (C3). Conditions (C2) and (C2 )are (12) moment conditions on X and Y. We only require either (C2) or (C2∗) to hold. This is much weaker than the Gaussian as- The next theorem shows that the null parameter set in which = U sumption required in the literature such as Schott (2007) and 1 2 is asymptotically distinguishable from (4) by the test  H  Srivastava and Yanagihara (2010). Condition (7) is satisfied α. That is, 0 is rejected by α with overwhelming probability τ = τ = ∼ N μ , ∼ N μ , if (1, 2) ∈ U(4). with 1 2 1, if X ( 1 1) and Y ( 2 2). In the non-Gaussian case, if (C3) and (6) hold, then (7) holds with Theorem 2. Suppose that (C2) or (C2∗) holds. Then as τ = κ − r2/ τ = κ − r2/ 1 1 3 and 2 2 3. n, p →∞,

3.2 Limiting Null Distribution and Optimality inf P(α = 1) → 1. (13) (1,2)∈U(4) We are now ready to present the asymptotic null distribu- tion of Mn. The following theorem shows that Mn − 4logp + It can be seen from Theorem 2 that it only requires√ one of the log log p converges weakly under H0 to an extreme value entries of 1 − 2 having a magnitude more than C log p/n distribution of Type I with the distribution function F (t) = in order for the test α to correctly reject H0. This lower bound exp(−1 √1 e−t/2). 8π is rate optimal. Denote by P the collection of distributions sat- ∗ T α P Theorem 1. Suppose that (C1), (C2) (or (C2∗)), and (C3) isfying (C2) or (C2 ). Let α be the set of -level tests over , that is, P(Tα = 1) ≤ α under H0 over all distributions in P for hold. Then under H0, for any t ∈ IR, any Tα ∈ Tα. P(Mn − 4logp + log log p ≤ t) ∗ α, β > 1 t Theorem 3. Suppose that (C2) or (C2 ) holds. Let 0 → exp − √ exp − , (9) and α + β<1. Then there exists a constant c > 0 such that π 2 0 8 for all large n and p, as n, p →∞. Furthermore, under H0, the convergence in (9)is ∗ inf sup P(Tα = 1) ≤ 1 − β. (14) uniform for all X and Y satisfying (C1), (C2) (or (C2 )), and , ∈U c ( 1 2) ( 0) Tα ∈Tα (C3).

The limiting behavior of Mn is similar to that of the largest The difference between 1 and 2 is measured by off-diagonal entry Ln of the sample correlation matrix in Jiang max1≤i≤j≤p |σij 1 − σij 2| in Theorem 3. Another measurement (2004), wherein the article derived the asymptotic distribution is based on the Frobenius norm 1 − 2F . Suppose that − c p of Ln under the assumption that the components X1,...,Xp are 1 2 is a sparse matrix with 0( ) nonzero entries. That is, independent. Further extensions and improvements are given in p p (2007), Liu, , and (2008), and Cai and Jiang c0(p) = I{σij 1 − σij 2 = 0}. (2011, 2012). A key assumption in these articles is the in- i=1 j=1 dependence between the components. The techniques in their We introduce the following class of matrices for 1 − 2: proofs cannot be used to obtain (9), since dependent components are allowed in Theorem 1. The proof of (9) requires different Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 V c = , |σ |≤K, |σ |≤K, techniques. ( ) ( 1 2):max ii1 max ii2 i i The limiting null distribution given in (9) shows that the test log p  α  − 2 ≥ cc p . α defined in (4) is an asymptotically level test. That is, 1 2 F 0( ) n P (Type I error) = PH (α = 1) → α. (10) 0 V c |σ − σ |≥ Note√ that on ( ), we have max1≤i≤j≤p ij 1 ij 2 The limiting null distribution in Theorem 1 requires Condi- c log p/n. Thus, for sufficiently large constant c, tions (C1) and (C3). However, without (C1) and (C3), the size inf P(α = 1) → 1 of the test can still be effectively controlled. (1,2)∈V(c) Proposition 1. Under (C2) (or (C2)∗), for 0 <α<1, as n, p →∞. The following theorem shows that the lower 2 bound for 1 − 2F in V(c) is rate optimal. That is, no α- P (Type I error) = PH (α = 1) ≤−log(1 − α) + o(1). 0 level test can reject H with overwhelming probability uniformly (11) 0 over V(c0)forsomec0 > 0. Note that for commonly used significant level α = 0.05, Theorem 4. Suppose that (C2) or (C2∗) holds. Assume that − log(1 − α) = 0.05129, which is close to 0.05, and for α = c (p) ≤ pr for some 0 0 and α + β<1. 0.01, − log(1 − α) = 0.01005, which is even closer to the nom- 0 There exists a constant c > 0 such that for all large n and p, inal level. It is easy to see that − log(1 − α) ≈ α for small α. So, 0  the test α effectively controls the probability of Type I error inf sup P(Tα = 1) ≤ 1 − β. ( , )∈V(c ) without (C1) and (C3). 1 2 0 Tα ∈Tα Cai, Liu, and Xia: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings 269

Note that for every c>0, there exists some constant K(c) > also illustrated in some of the simulation results given in Sec- 0 such that for any 0 0 and γ>0, which s α (D1). For any i, j, k, l ∈{1, 2}, is stronger than the condition on maxj/∈ϒ j ( 0) in (C1). Addi- 2 2 2+2α tionally, Condition (C1) allows for tr(i ) (p /(log p) 0 ). tr(kl) →∞and

tr{(i j )(kl)}=o{tr(i j )tr(kl)}. 4. SUPPORT RECOVERY OF 1 – 2 AND 1 i = γ ∈ , ∞ 1 i APPLICATION TO GENE SELECTION (D2). limp→∞ p tr( 1) i1 (0 ) and limp→∞ p tr( 2) = γi2 ∈ (0, ∞). We have focused on testing the equality of two covariance matrices 1 and 2 in Sections 2 and 3. As mentioned in the (D3). p/n1 → b1 and p/n2 → b2, where b1,b2 ∈ (0, ∞). introduction, if the null hypothesis H0 : 1 = 2 is rejected, Let βLC(α), βSY(α), and βSc(α) be, respectively, the powers further exploring in which ways they differ is also of significant of the tests given in Li and Chen (2012), Srivastava and Yanagi- interest in practice. Motivated by applications in gene selection, hara (2010), and Schott (2007) with the significance level being we consider in this section two related problems, one is recover- controlled at level α. We shall make the comparisons under ing the support of 1 − 2 and another is identifying the rows the normality condition since two of these tests require this on which the two covariance matrices differ from each other. Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 assumption. 4.1 Support Recovery of 1 – 2 ∼ N μ , ∼ Proposition 2. Suppose that X ( 1 1) and Y N μ , , ∈ S s ,c The goal of support recovery is to find the positions at which ( 2 2). Assume that ( 1 2) ( p n,p). the two covariance matrices differ from each other. The prob- (i) For the proposed test α defined in (4), lem can also be viewed as simultaneous testing of equality of individual entries between the two covariance matrices. Denote lim P(α = 1) = 1. n,p→∞ the support of 1 − 2 by (ii) Suppose (D1) holds, then  = (1, 2) ={(i, j):σij 1 = σij 2}. (16) lim βLC(α) = α. n,p→∞ In certain applications, the variances along the diagonal play (iii) Suppose (D2) holds for i = 1, 2, 3, and 4, then a more important role than the covariances. For example, in a differential variability analysis of gene expression, Ho et al. lim β (α) = α. n,p→∞ SY (2008) proposed to select genes based on the differences in the variances. In this section, we shall treat the variances along Proposition 2 shows that under the class of sparse alternatives the diagonal differently from the off-diagonal covariances for in S(sp,cn,p), the tests given in Li and Chen (2012) and Srivas- support recovery. Since there are p diagonal elements, Mii along tava and Yanagihara (2010) would suffer from trivial power the diagonal are thresholded at 2 log p based on the extreme while the proposed test α enjoys the full power. This fact is values of normal variables. The off-diagonal entries Mij with 270 Journal of the American Statistical Association, March 2013

i = j are thresholded at a different level. More specifically, set level α for some 0 <α<1, a different threshold level is needed. For this purpose, we shall set the threshold level at 4 log p − ˆ (τ)={(i, i):Mii ≥ 2logp}∪{(i, j):Mij ≥ τ log p, i = j}, log log p + qα where qα is given in (5). Define (17) M τ  where ij are defined in (2) and is the threshold constant for ˆ ={(i, i):Mii ≥ 2logp}∪{(i, j) the off-diagonal entries. : Mij ≥ 4logp − log log p + qα,i = j}. The following theorem shows that with τ = 4 the estimator ˆ (4) recovers the support  exactly with probability tending We have the following result.

to 1 when the magnitudes of nonzero entries are above certain ∗ thresholds. Define Proposition 3. Suppose that (C1), (C2) (or (C2 )), and (C3) hold. Under − ∈ S ∩ W (4) with s (p) = o(p), we have 1 2 0 0 0 |σij − σij | n, p →∞ W (c) = ( , ): min 1 2 ≥ c log p . as , 0 1 2 i,j ∈ ( ) θij 1/n1 + θij 2/n2 P(ˆ  = ) → α. We have the following result. Theorem 5. Suppose that (C2) or (C2∗) holds. Then as Proposition 3 assumes Conditions (C1) and (C3). As in Propo-  n, p →∞, sition 1, it can be shown that P(ˆ = ) ≤−log(1 − α) + o(1) under Condition (C2) (or (C2∗)) only, without assuming (C1) or inf P(ˆ (4) = ) → 1. (1,2)∈W0(4) (C3).

The choice of the thresholding constant τ = 4 is optimal for 4.2 Testing Rows of Two Covariance Matrices  s p the exact recovery of the support . Consider the class of 0( )- As discussed in the introduction, the standard methods for sparse matrices, ⎧ ⎫ gene selection are based on the comparisons of the means and ⎨ p ⎬ thus lack the ability to select genes that change their relation- ships with other genes. It is of significant practical interest to S0 = A = (aij )p×p :max I{aij = 0}≤s0(p) . ⎩ i ⎭ j=1 develop methods for gene selection, which capture the changes in the gene’s dependence structure. We show in the following theorem that, for any threshold con- Several methods have been proposed in the literature. It was τ< stant 4, the probability of recovering the support exactly noted in Ho et al. (2008) that the changes of variances are bi- − ∈ S tends to 0 for all 1 2 0. This is mainly because the ologically interesting and are associated with changes in coex- τ p threshold level log is too small to ensure that the zero en- pression patterns in different biological states. Ho et al. (2008) tries of 1 − 2 are estimated by zero. We assume that proposed to test H0i : σii1 = σii2 and select the ith gene if H0i (C1∗). There exists a subset  ⊂{1, 2,...,p} with is rejected. Hu et al. (2009) and Hu, Qiu, and Glazko (2010) γ Card() ≥ p for all 0 <γ <1 such that |ρij 1|≤ introduced methods that are based on simultaneous testing of −1−α0 −1−α0 C(log p) and |ρij 2|≤C(log p) for all i = j ∈  and the equality of the joint distributions of each row between two- some α0 > 0. sample correlation matrices/covariance distance matrices. It is easy to see that (C1∗) is much weaker than (C1) because The covariance provides a natural measure of the associa- it allows p − pγ variables to be arbitrarily correlated. tion between two genes and it can also reflect the changes of ∗ ∗ variances. Motivated by these applications, in this section we Theorem 6. Suppose that (C1 ), (C2) (or (C2 )), and (C3) consider testing the equality of two covariance matrices row by 1−τ1 hold. Let 0 <τ<4. If s0(p) = O(p ) with some τ/4 < row. Let σi·1 and σi·2 be the ith row of 1 and 2, respectively. Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 τ < n, p →∞ 1 1, then as , We consider testing simultaneously the hypotheses sup P(ˆ (τ) = ) → 0. H i : σi· = σi· , 1 ≤ i ≤ p. 1−2∈S0 0 1 2

In addition, the minimum separation of the magnitudes of the We shall use the FWER to measure the Type I errors for the p √ ˆ nonzero entries must be at least of order log p/n to enable the tests. The support estimate (4) defined in (17) can be used to H H ˆ s p = test the hypotheses 0i by rejecting 0i if the ith row of (4) exact recovery of the support. Consider, for example, 0( ) 1 ∗ in Theorem 4. Then Theorem 4 indicates that no test can distin- is nonzero. Suppose that (C1) and (C2) (or (C2 )) hold. Then it can be shown easily that for this test, the FWER → 0. guish H0 and H1 uniformly on the space V(c0) with probability tending to one. It is easy to see that V(c ) ⊆ W (c )forsome A different test is needed to simultaneously test the hypothe- 0 0 1 H ≤ i ≤ p α c > 0. Hence, there is no estimate that can recover the sup- ses 0i ,1 , at a prespecified FWER level for some 1 0 <α<1. Define port  exactly uniformly on the space W0(c1) with probability tending to one. Mi· = max Mij . We have so far focused on the exact recovery of the support of 1≤j≤p,j=i − with probability tending to 1. It is sometimes desirable 1 2 The null distribution of Mi· can be derived similarly under the to recover the support under a less stringent criterion such as the same conditions as in Theorem 1. FWER, defined by FWER = P(V ≥ 1), where V is the number of false discoveries. The goal of using the threshold level 4 log p Theorem 7. Suppose the null hypothesis H0i holds. Then is to ensure FWER → 0. To control the FWER at a prespecified under the conditions in Theorem 1, for any given x ∈ IR , Cai, Liu, and Xia: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings 271

/ as n, p →∞, (−1)i+j 0.4|i−j|1 10 . This model was used in Srivastava and   1 x Yanagihara (2010). P(Mi· −2logp + log log p ≤ x) → exp −√ exp − . π 2 (18) To evaluate the power of the tests, let U = (ukl)beama- trix with eight random nonzero entries. The locations of four Based on the limiting null distribution of Mi· given in (18), nonzero entries are selected randomly from the upper trian- we introduce the following test for testing a single hypothesis gle of U, each with a magnitude generated from Unif(0, 4) ∗ H0i , max1≤j≤p σjj . The other four nonzero entries in the lower trian- i,α = I(Mi· ≥ αp or Mii ≥ 2logp) (19) gle are determined by symmetry. We use the following four pairs (i), (i) i = , , of covariance matrices ( 1 2 ), 1 2 3, and 4, to compare α = p − p + q q i i with p 4log log log α, where α is given in (5). the power of the tests, where ( ) = (i) + δ I and ( ) = (i) + H  = 1 2 0i is rejected and the ith gene is selected if i,α 1. It can U + δ I, with δ =|min{λ ((i) + U),λ ((i))}| + 0.05.  ≤ i ≤ p min min be shown that the tests i,α,1 together controls the The sample sizes are taken to be n = n = n with n = 60 α 1 2 overall FWER at level asymptotically. and 100, while the dimension p varies over the values 50, 100,

Theorem 8. Let F ={i : σi·1 = σi·2, 1 ≤ i ≤ p} and suppose 200, 400, and 800. For each model, data are generated from Card(F ) = o(p). Then under the conditions in Theorem 1, multivariate normal distributions with mean zero and covariance   matrices 1 and 2. The nominal significant level for all the FWER = P max i,α = 1 → α. testsissetatα = 0.05. The actual sizes and powers in percents i∈F c for the four models, reported in Table 1, are estimated from 5000 replications. Theorem 8 requires Conditions (C1) and (C3). However, sim- It can be seen from Table 1 that the estimated sizes of our ilar to Proposition 1, it can be shown that the FWER can still test α are close to the nominal level 0.05 in all the cases. This be effectively controlled without (C1) and (C3) in the sense that ∗ reflects the fact that the null distribution of the test statistic Mn FWER ≤−log(1 − α) + o(1) under Condition (C2) (or (C2 )) is well approximated by its asymptotic distribution. For Models only. 1–3, the estimated sizes of the tests in Schott (2007) and Li and Chen (2012) are also close to 0.05. But both tests suffer from 5. NUMERICAL RESULTS the size distortion for Model 4, the actual sizes are around 0.10 In this section, we turn to the numerical performance of the for both tests. The LRT has serious size distortion (all is equal proposed methods. The goal is to first investigate the numerical to 1). Srivastava and Yanagihara (2010)’s test has actual sizes performance of the test α and the support recovery procedure close to the nominal significance level in all the cases. through simulation studies and then illustrate our methods by The power results in Table 1 show that the proposed test applying them to the analysis of a prostate cancer dataset. The has much higher power than the other tests in all settings. The test α is compared with four other tests, the LRT, the tests number of nonzero off-diagonal entries of 1 − 2 does not given in Schott (2007) and Li and Chen (2012), which are both change when p grows, so the estimated powers of all tests tend 2 based on an unbiased estimator of 1 − 2F , and the test to decrease when the dimension p increases. It can be seen in proposed in Srivastava and Yanagihara(2010), which is based on Table 1 that the powers of Schott (2007), Li and Chen (2012), 2 / 2 − 2 / 2 a measure of distance by tr( 1) (tr( 1)) tr( 2) (tr( 2)) . and Srivastava and Yanagihara (2010)’s tests decrease extremely The LRT is not well defined when p>n, so the comparison fast as p grows. However, the power of the proposed test α with the LRT is made only for the case p ≤ n. remains high even when p = 800, especially in the case of n =

Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 We first introduce the matrix models used in the simulations. 100. Overall, for the sparse models, the new test significantly Let D = (dij ) be a diagonal matrix with diagonal elements outperforms all the other three tests. dii = Unif(0.5, 2.5) for i = 1,...,p. Denote by λmin(A)the We also carried out simulations for non-Gaussian distribu- minimum eigenvalue of a symmetric matrix A. The following tions including Gamma distribution, t-distribution, and normal (i) four models under the null, 1 = 2 = , i = 1, 2, 3 and 4, distribution contaminated with exponential distribution. Simi- are used to study the size of the tests. lar phenomena as those in the Gaussian case are observed. For ∗ ∗ ∗ reasons of space, these simulation results are given in the sup- • Model 1: ∗(1) = (σ (1)), where σ (1) = 1, σ (1) = 0.5for ij ii ij plementary material (Cai, Liu, and Xia 2012). 5(k − 1) + 1 ≤ i = j ≤ 5k, where k = 1,...,[p/5] and σ ∗(1) = (1) = 1/2∗(1) 1/2 ij 0 otherwise. D D . 5.1 Support Recovery ∗ ∗ • Model 2: ∗(2) = (σ (2)), where ω (2) = 0.5|i−j| for 1 ≤ ij ij We now consider the simulation results on recovering the i, j ≤ p (2) = 1/2∗(2) 1/2 . D D . support of − in the first three models with D = I and • ∗(3) = σ (3) σ ∗(3) = σ ∗(3) = . ∗ 1 2 Model 3: ( ij ), where ii 1, ij 0 5 the fourth model with O = I under the normal distribution. ∗(3) ∗(3) (3) (i) Bernoulli(1, 0.05) for i

Table 1. N(0, 1) random variables. Models 1–4 empirical sizes and powers in percents. α = 0.05. n = 60 and 100. 5000 replications

Model 1 Model 2 Model 3 Model 4

np50 100 200 400 800 50 100 200 400 800 50 100 200 400 800 50 100 200 400 800

Empirical size 60 α 5.04.65.06.16.15.55.45.05.36.05.55.35.65.95.94.54.54.64.65.2 Likelihood 100.0 100.0 NA NA NA 100.0 100.0 NA NA NA 100.0 100.0 NA NA NA 100.0 100.0NANANA Schott 7.15.75.35.65.66.66.15.44.84.66.45.55.35.45.210.210.810.39.010.3 Li-Chen 7.15.85.35.55.66.66.05.24.84.96.35.25.35.45.110.010.910.29.110.4 Srivastava 5.65.44.74.83.85.24.75.14.93.75.25.04.74.14.05.15.95.24.45.3

100 α 4.84.14.44.94.84.54.24.35.04.94.65.14.54.54.94.23.93.74.04.2 Likelihood 100.0 100.0 100.0 NA NA 100.0 100.0 100.0 NA NA 100.0 100.0 100.0 NA NA 100.0 100.0 100.0NA NA Schott 6.45.35.05.24.85.55.04.95.05.25.05.35.25.24.89.29.29.29.99.7 Li-Chen 6.55.35.15.14.75.75.05.15.25.24.95.15.05.14.99.19.49.29.99.6 Srivastava 5.25.55.05.04.14.95.35.05.44.54.74.95.04.74.75.65.74.85.04.7 Empirical power 60 α 87.974.147.640.438.886.756.260.946.325.890.377.649.742.540.181.872.148.240.532.7 Schott 31.814.28.36.76.234.59.28.96.65.235.015.58.36.76.428.413.58.27.17.2 Li-Chen 31.213.98.16.45.933.78.98.76.75.434.015.48.36.76.427.813.17.96.97.0 Srivastava 28.711.35.94.63.936.77.86.75.13.737.614.76.24.64.115.56.25.44.34.7

100 α 99.999.494.191.693.099.994.698.195.979.3 100.099.795.092.993.799.699.293.992.589.4 Schott 57.424.710.07.97.063.014.112.99.25.863.627.310.37.97.051.321.99.47.96.5 Li-Chen 57.124.39.87.67.162.514.112.88.95.463.227.110.28.06.950.421.49.47.96.2 Srivastava 58.128.38.95.95.473.812.010.96.74.675.032.09.16.25.224.98.15.54.75.0

magnitude between 0.74 and 0.86 for i = 1, 2, 3, and 4 in our able at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. generated models. The dataset consists of two classes of gene expression data that The accuracy of the support recovery is evaluated by the came from 52 prostate tumor patients and 50 prostate normal similarity measure s(,ˆ ) defined by patients. This dataset has been analyzed in several articles on |ˆ ∩ | classification in which the two covariance matrices are assumed s(,ˆ ) = , to be equal; see, for example, , Brock, and Parrish (2009). |ˆ | The equality of the two covariance matrices is an important as- i i sumption for the validity of these classification methods. It is where ˆ = ˆ (4) and  is the support of ( ) − ( ) and 1 2 thus interesting to test whether such an assumption is valid. |·| denotes the cardinality. Note that 0 ≤ s(,ˆ ) ≤ 1 with There are a total of 12,600 genes. To control the computational s(,ˆ ) = 0 indicating disjointness and s(,ˆ ) = 1 indicat- costs, only the 5000 genes with the largest absolute values of the t ing exact recovery. We summarize the average (standard devia- statistics are used. Let and be, respectively, the covariance tion) values of s(,ˆ ) in percents for all models with n = 100 1 2 matrices of these 5000 genes in tumor and normal samples. We in Table 2 based on 100 replications. The values are close to one, apply the test α defined in (4) to test the hypotheses H : = and hence the supports are well recovered by our procedure. 0 1 versus H : = . Based on the asymptotic distribution To better illustrate the elementwise recovery performance, 2 1 1 2 of the test statistic Mn,thep-value is calculated to be 0.0058 and Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 heat maps of the nonzeros identified out of 100 replications for the null hypothesis H : = is thus rejected at commonly p = 50 and 100 are shown in Figures 1 and 2. These heat maps 0 1 2 used significant levels such as α = 0.05 or α = 0.01. Based on suggest that, in all models, the estimated support by ˆ (4) is this test result, it is therefore not reasonable to assume  =  close to the true support. 1 2 in applying a classifier to this dataset. 5.2 Real Data Analysis We then apply three methods to select genes with changes in variances/covariances between the two classes. The first is the We now illustrate our methods by applying them to the analy- differential variability analysis (Ho et al. 2008), which chooses sis of a prostate cancer dataset (Singh et al. 2002), which is avail- the genes with different variances between two classes. In our procedure, the ith gene is selected if Mii ≥ 2logp. As a result, Table 2. Average (standard deviation) of s(,ˆ ) in percents 21 genes are selected. The second and third methods are based on the differential covariance analysis, which is similar to the Model 1 Model 2 Model 3 Model 4 differential covariance distance vector analysis in Hu, Qiu, and Glazko (2010), but replacing the covariance distance matrix p = 50 97.4(2.5) 89.8(4.3) 91.3(4.1) 93.0(3.8) in their article by the covariance matrix. The second method p = ...... 100 95 3(31) 86 8(51) 86 2(53) 89 7(44) selects the ith gene, if the ith row of ˆ (4) is nonzero. This p = ...... 200 78 8(56) 80 1(59) 82 4(58) 84 0(64) leads to the selection of 43 genes. The third method, which p = 400 75.6(6.5) 73.2(7.7) 79.4(6.1) 77.8(7.0) tests the covariance matrices row by row and is defined in (19), p = 800 64.0(7.5) 61.8(8.8) 72.9(6.1) 68.1(7.6) controls the FWER at α = 0.1, and is able to find 52 genes. As Cai, Liu, and Xia: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings 273

Figure 1. Heat maps of the frequency of the 0’s identified for each entry of 1 − 2 (n = 100, p = 50 for the top row and p = 100 for the second row) out of 100 replications. White indicates 1000’s identified out of 100 runs; black, 0/100.

expected, the gene selection based on the covariances could be There are several possible extensions of our method. For more powerful than the one that is based only on the variances. example, an interesting direction is to generalize the procedure The tests provide valuable information as these identified genes to testing the hypothesis of homogeneity of several covariance can be selected for a follow-up study. matrices, H0 : 1 =···=K , where K ≥ 2. A test statistic Finally, we apply the support recovery procedure ˆ (4) to similar to Mn can be constructed to test this hypothesis and 1 − 2. For a visual comparison between 1 and 2, Figure 2 analogous theoretical results can be developed. We shall report plots the heat map of 1 − 2 of the 200 largest absolute values the details of the results elsewhere in the future as a significant of two-sample t statistics. It can be seen from Figure 2 that the amount of additional work is still needed. estimated support of 1 − 2 is quite sparse. Much recent attention has focused on the estimation of large covariance and precision matrices. The current work here is 6. DISCUSSION related to the estimation of covariance matrices. An adap- tive thresholding estimator of sparse covariance matrices was We introduced in this article a test for the equality of two introduced in Cai and Liu (2011). The procedure is based covariance matrices, which is proved to have the prespecified 1/2 on the standardized statisticsσ ˆij /θˆij , which is closely re- significance level asymptotically and to enjoy certain optimality lated to Mij . In this article, the asymptotic distribution of in terms of its power. In particular, it was shown theoretically and Mn = max ≤i≤j≤p Mij is obtained. It gives a justification on numerically that the test is especially powerful against sparse 1 the family-wise error of simultaneous tests H ij : σij = σij alternatives. Support recovery and testing two covariance ma- 0 1 2 for 1 ≤ i ≤ j ≤ p. For example, by thresholding Mij at level trices row by row with applications to gene selection are also 4logp − log log p + qα, the family-wise error is controlled considered. asymptotically at level α, that is,   Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 FWER = P max Mij ≥ 4logp − log log p + qα → α, (i,j)∈G

c 2 where G ={(i, j):σij 1 = σij 2} and Card(G ) = o(p ). These tests are useful in the studies of differential coexpression in genetics; see de la Fuente (2010). The test introduced in the article is based on the asymptotic results. When the sample size is small, say n ≤ 30, the critical value derived from the asymptotic distribution is not sufficiently accurate and modification is thus needed. The following “normal ∗ cut-off” method can be used instead. Let (Xk, 1 ≤ k ≤ n1) and ∗ ∗ (Y k, 1 ≤ k ≤ n2) be generated from iid N(0, I p×p). Let Mn be ∗ the maximum test statistic constructed from (Xk, 1 ≤ k ≤ n1) ∗ and (Y k, 1 ≤ k ≤ n2). It follows from Theorem 1 that under (C1)–(C3), we have ∗ sup |P(Mn ≥ y) − P(Mn ≥ y)|→0 y∈R

under the null, where Mn is the maximum test statistic computed Figure 2. Heat map of the selected genes by exact recovery. from the original data. Let yn(α) be the critical value such that 274 Journal of the American Statistical Association, March 2013

∗ P(Mn ≥ yn(α)) = α. Then we have P(Mn ≥ yn(α)) → α under Let be any subset of {(i, j):1≤ i ≤ j ≤ p} and H0.Soyn(α) can be used as the critical value when the sample | |=Card( ). size is small and it can be obtained easily by simulation. Some ∗ Lemma 4. Under the conditions of (C2) or (C2 ), we have additional simulation results illustrating the case with a small for some constant C>0 that sample size (n = 30) are given in the supplementary material  2  (˜σij − σ˜ij − σij + σij ) (Cai, Liu, and Xia 2012). P max 1 2 1 2 ≥ x2 i,j ∈ ( ) θij 1/n1 + θij 2/n2 7. PROOFS ≤ C| |(1 − (x)) + O(p−1 + n−/8) We prove the main results in this section. Throughout this sec- uniformly for 0 ≤ x ≤ (8 log p)1/2 and ⊆{(i, j):1≤ i ≤ tion, we denote by C, C1, C2,..., constants that do not depend j ≤ p}. on n, p and may vary from place to place. We begin by collect- ing some technical lemmas that will be used in the proofs of the The proofs of Lemmas 3 and 4 are given in the supplementary main results. These technical lemmas together with Proposition material (Cai, Liu, and Xia 2012). 1–3 are proved in the supplementary material (Cai, Liu, and Xia 7.2 Proof of Theorem 1 2012). μ = μ = Without loss of generality, we assume that 1 0, 2 0, 7.1 Technical Lemmas σii1 = σii2 = 1for1≤ i ≤ p.Let 2 The first lemma is the classical Bonferroni’s inequality. (ˆσij 1 − σˆij 2) Mˆ n = max , ≤i≤j≤p θ /n + θ /n B =∪p B 1 ij 1 1 ij 2 2 Lemma 1 (Bonferroni inequality). Let t=1 t . For any k<[p/2], we have and σ − σ 2 M˜ = (˜ij 1 ˜ij 2) . 2k 2k−1 n max 1≤i≤j≤p θij /n + θij /n t−1 t−1 1 1 2 2 (−1) Et ≤ P(B) ≤ (−1) Et , {|θˆ /θ − |≤Cε / p, t=1 t=1 Note that under the event ij 1 ij 1 1 n log |θˆij /θij − 1|≤Cεn/ log p}, E = B ∩···∩B . 2 2 where t ≤i <···c,Y>c) By the√ exponential inequality, max1≤i≤p i max1≤i≤p i   = , lim c2 1 O ( log p/n). Thus, by Lemma 3, it suffices to show that for c→∞ [2π(1 − ρ)1/2c2]−1 exp − (1 + ρ)1/2 P 1+ρ any x ∈ IR ,   uniformly for all ρ such that |ρ|≤δ, for any δ,0<δ<1. 1 x P(M˜ n −4logp + log log p ≤ x)→exp −√ exp − 8π 2 The next lemma is on the large deviations for θˆij 1 and θˆij 2. (22) Lemma 3. Under the conditions of (C2) or (C2∗), there exists as n, p →∞.Letρij = ρij = ρij under H . Define some constant C>0 such that 1 2 0

Downloaded by [University of Pennsylvania] at 08:44 29 April 2013   −1−α ε Aj = i : |ρij |≥(log p) 0 , |θˆ − θ |/σ σ ≥ C n = O p−1 + n−/8 , P max ij 1 ij 1 ii1 jj1 ( ) A ={(i, j):1≤ i ≤ j ≤ p}, i,j log p A ={i, j ≤ i ≤ p, i ≤ j,j ∈ A }, (20) 0 ( ):1 i B0 ={(i, j):i ∈ (r) ∪ ϒ, i < j ≤ p}∪{(i, j) and : j ∈ (r) ∪ ϒ, 1 ≤ i 2 k=1 that Card ( ( )) Card ( ) ( ) and for all 0, Cai, Liu, and Xia: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings 275

s = o pγ D ≤ C p1+γ + Note that maxj/∈ϒ j ( ). This implies that Card( 0) 0 2 o(p ) for any γ>0. By Lemma 4, we obtain that for any fixed 2 2 max Vk − max Vˆk ≤ 2max|Vˆk| max |Vk − Vˆk| x ∈ R, 1≤k≤q 1≤k≤q 1≤k≤q 1≤k≤q + |V − Vˆ |2. M˜ ≥ y ≤ D × Cp−2 + o = o . max k k (24) P( D0 p) Card( 0) (1) (1) 1≤k≤q Set By (23) and (24), it suffices to prove that for any x ∈ IR ,as n, p →∞ σ − σ 2 , (˜ij 1 ˜ij 2) 2   M˜ A\D = max =:maxUij , 0 i,j ∈A\D θ /n + θ /n i,j ∈A\D 2 ( ) 0 ij 1 1 ij 2 2 ( ) 0 P max Vˆk − 4logp + log log p ≤ x 1≤k≤q σ˜ij 1−σ˜ij 2   Uij = √ . x where θ /n +θ /n It suffices to show that for any 1 ij 1 1 ij 2 2 → exp −√ exp − . (25) x ∈ IR, 8π 2   By Bonferroni inequality (see Lemma 1), for any integer m with P M˜ A\D − 4logp + log log p ≤ x 0   0 c >  → , ≤l≤n +n ≤k≤q where 1 0 and 2 0 are absolutely constants, n 0 √ 1 1 2 1 = N ,...,N −4 which will be specified later and Nd :( k1 kd )isa Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 ≤ C n p + n |Zlk| η|Zlk|/ K n ( ) max max E exp( (2 1)) = = 1 + 1≤l≤n1+n2 1≤k≤q normal vector with ENd 0 and cov(Nd ) n cov(W 1) √ 2 −4 cov(W n + ). Recall that d is a fixed integer not depending on ≤ C n(p + n) , 1 1 1/5 n, p. Because log p = o(n ), we can let n → 0 sufficiently and if (C2∗) holds, then slow such that n +n 1/2 1 2 n n 1 √ c d5/2 − = O p−M √ |Z |I{|Z |≥ n/ p 8} 1 exp / ( ) (28) max E lk lk (log ) c d3τn p 1 2 1≤k≤q n 2 (log ) √ l=1 √ 8 M> ≤ C n max max E|Zlk|I{|Zlk|≥ n/(log p) } for any large 0. It follows from (26), (27), and (28) that 1≤l≤n1+n2 1≤k≤q −γ −/ 2m−1 ≤ Cn 0 8. 2 d−1 P max Vˆk ≥ yp ≤ (−1) 1≤k≤q d= ≤k <···

We need the following technical lemma whose proof is in- N .Letμρ be the distribution of 1 − I. Note that μρ is a prob- 2 2 volved, and is given in the supplementary material (Cai, Liu, ability measure on { ∈ S(c0(p)) : F = c0(p)ρ }, where and Xia 2012). S(c0(p)) is the class of matrices with c0(p) nonzero entries. Let dP ({Xn, Y n}) be the likelihood function given being Lemma 5. For any fixed integer d ≥ 1 and real number x ∈ 1 1 uniformly distributed on N and IR ,   d {Xn, Y n} | | ≥ y1/2 ±  p −1/2 L = L { , } = P1( ) , P Nd min p n(log ) μρ : μρ ( Xn Y n ) Eμρ dP0({Xn, Y n}) 1≤k1<···0 will be Proof of Theorem 6. Without loss of generality, we as- γ specified later. Let 2 = I and 1 be uniformly distributed on sume  ={1, 2,...,p0} with p0 = Card(θ)≥ p for all Cai, Liu, and Xia: Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings 277

0 <γ <1. Let A1 be the largest subset of \{1} such that Berman, S. M. (1962), “A Law of Large Numbers for the Maximum of a Stationary Gaussian Sequence,” The Annals of Mathematical Statistics, 33, σ1k1 = σ1k2 for all k ∈ A1.Leti1 = min{j : j ∈ A1,j >1}. |i − |≤s p A ≥ p − s p 894–908. [274] Then we have 1 1 0( ). Also, Card( 1) 0 0( ). Cai, T., and Jiang, T. (2011), “Limiting Laws of Coherence of Random Matrices Similarly, let Al be the largest subset of Al−1 \{il−1} such With Applications to Testing Covariance Structure and Construction of σ = σ k ∈ A i = {j j ∈ A ,j > Compressed Sensing Matrices,” The Annals of Statistics, 39, 1496–1525. that il−1k1 il−1k2 for all l and l min : l i } i − i ≤ s p l

Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 In the supplement we prove Proposition 1–3 and the technical Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, 1, 203–209. [272] results, Lemmas 3, 4, and 5, which are used in the proofs of the Srivastava, M. S., and Yanagihara, H. (2010), “Testing the Equality of Sev- main results. We also present more extensive simulation results eral Covariance Matrices With Fewer Observations Than the Dimension,” comparing the numerical performance of the proposed test with Journal of Multivariate Analysis, 101, 1319–1329. [265,268,269,271] Sugiura, N., and Nagao, H. (1968), “Unbiasedness of Some Test Criteria for the that of other tests. Equality of One or Two Covariance Matrices,” The Annals of Mathematical Statistics, 39, 1682–1692. [265] [Received October 2012. Revised November 2012.] Xu, P., Brock, G. N., and Parrish, R. S. (2009), “Modified Linear Discrim- inant Analysis Approaches for Classification of High-Dimensional Mi- croarray Data,” Computational Statistics and Data Analysis, 53, 1674– 1687. [272] Za¨ıtsev, A. (1987), “On the Gaussian Approximation of Convolutions Under REFERENCES Multidimensional Analogues of S.N. Bernstein’s Inequality Conditions,” Anderson, T. W. (2003), An Introduction to Multivariate Statistical Analysis Probability Theory and Related Fields, 74, 535–566. [275] (3rd ed.), New York: Wiley-Interscience. [265,267] Zhou, W. (2007), “Asymptotic Distribution of the Largest Off-Diagonal Entry of Baraud, Y. (2002), “Non-Asymptotic Minimax Rates of Testing in Signal De- Correlation Matrices,” Transactions of the American Mathematical Society, tection,” Bernoulli, 8, 577–606. [276] 359, 5345–5364. [268]