Two-Sample Covariance Matrix Testing and Support Recovery In
Total Page:16
File Type:pdf, Size:1020Kb
This article was downloaded by: [University of Pennsylvania] On: 29 April 2013, At: 08:44 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa20 Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings Tony Cai a , Weidong Liu b & Yin Xia a a Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA, 19104 b Department of Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, Shanghai, China Accepted author version posted online: 02 Jan 2013.Published online: 15 Mar 2013. To cite this article: Tony Cai , Weidong Liu & Yin Xia (2013): Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings, Journal of the American Statistical Association, 108:501, 265-277 To link to this article: http://dx.doi.org/10.1080/01621459.2012.758041 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings Tony C AI, Weidong LIU, and Yin XIA In the high-dimensional setting, this article considers three interrelated problems: (a) testing the equality of two covariance matrices 1 and 2; (b) recovering the support of 1 − 2; and (c) testing the equality of 1 and 2 row by row. We propose a new test for testing the hypothesis H0: 1 = 2 and investigate its theoretical and numerical properties. The limiting null distribution of the test statistic is derived and the power of the test is studied. The test is shown to enjoy certain optimality and to be especially powerful against sparse alternatives. The simulation results show that the test significantly outperforms the existing methods both in terms of size and power. Analysis of a prostate cancer dataset is carried out to demonstrate the application of the testing procedures. When the null hypothesis of equal covariance matrices is rejected, it is often of significant interest to further investigate how they differ from each other. Motivated by applications in genomics, we also consider recovering the support of 1 − 2 and testing the equality of the two covariance matrices row by row. New procedures are introduced and their properties are studied. Applications to gene selection are also discussed. Supplementary materials for this article are available online. KEY WORDS: Extreme value Type I distribution; Gene selection; Hypothesis testing; Sparsity. 1. INTRODUCTION or quite similar in the sense that they only possibly differ in a small number of entries. In such a setting, under the alternative Testing the equality of two covariance matrices and is 1 2 the difference of the two covariance matrices − is sparse. an important problem in multivariate analysis. Many statistical 1 2 The above-mentioned tests, which are all based on the Frobenius procedures including the classical Fisher’s linear discriminant norm, are not powerful against such sparse alternatives. See analysis rely on the fundamental assumption of equal covari- Section 3.3 for further details. ance matrices. This testing problem has been well studied in The first goal of this article is to develop a test that is pow- the conventional low-dimensional setting. See, for example, erful against sparse alternatives and robust with respect to the Sugiura and Nagao (1968), Gupta and Giri (1973), Perlman population distributions. Let X and Y be two p variate random (1980), Gupta and Tang (1984), O’Brien (1992), and Anderson vectors with covariance matrices and , respectively. Let (2003). In particular, the likelihood ratio test (LRT) is commonly 1 2 {X ,...,Xn } be independent and identically distributed (iid) used and enjoys certain optimality under regularity conditions. 1 1 random samples from X and let {Y ,...,Y n } be iid random Driven by a wide range of contemporary scientific applica- 1 2 samples from Y that are independent of {X ,...,Xn }.Wewish tions, analysis of high-dimensional data is of significant current 1 1 to test the hypotheses interest. In the high-dimensional setting, where the dimension can be much larger than the sample size, the conventional testing H0 : 1 = 2 versus H1 : 1 = 2 (1) procedures such as the LRT either perform poorly or are not even based on the two samples. We are particularly interested in the well defined. Several tests for the equality of two large covari- high-dimensional setting where p can be much larger than n = ance matrices have been proposed. For example, Schott (2007) max(n ,n ). In many applications, if the null hypothesis H : introduced a test based on the Frobenius norm of the differ- 1 2 0 = is rejected, it is often of significant interest to further ence of the two covariance matrices. Srivastava and Yanagihara 1 2 investigate in which way the two covariance matrices differ Downloaded by [University of Pennsylvania] at 08:44 29 April 2013 (2010) constructed a test that relied on a measure of distance by from each other. Motivated by applications to gene selection, in tr(2)/(tr( ))2 − tr(2)/(tr( ))2. Both of these two tests are 1 1 2 2 this article we also consider two related problems, recovering designed for the multivariate normal populations. Li and Chen the support of − and testing the equality of the two (2012) proposed a test using a linear combination of three one- 1 2 covariance matrices row by row. sample U statistics, which was also motivated by an unbiased We propose a test for the hypotheses in Equation (1) based estimator of the Frobenius norm of − . 1 2 on the maximum of the standardized differences between the In many applications, such as gene selection in genomics, the entries of the two-sample covariance matrices and investigate covariance matrices of the two populations can be either equal its theoretical and numerical properties. The limiting null dis- tribution of the test statistic is derived. It is shown that the Tony Cai is Dorothy Silberberg Professor of Statistics, Department of Statis- distribution of the test statistic converges to a Type I extreme tics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 H (E-mail: [email protected]). Weidong Liu is Professor, Department of value distribution under the null 0. This fact implies that Mathematics, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong the proposed test has the prespecified significance level asymp- University, Shanghai, China (E-mail: [email protected]). Yin Xia is totically. The power of the test is investigated. The theoretical Ph.D. student, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: [email protected]). analysis shows that the proposed test enjoys certain optimality The research of Tony Cai and Yin Xia was supported in part by NSF FRG grant DMS-0854973. Weidong Liu’s research was supported by NSFC, Grant No. 11201298, the Program for Professor of Special Appointment (Eastern Scholar) © 2013 American Statistical Association at Shanghai Institutions of Higher Learning, Foundation for the Author of Na- Journal of the American Statistical Association tional Excellent Doctoral Dissertation of PR China and the startup fund from March 2013, Vol. 108, No. 501, Theory and Methods Shanghai Jiao Tong University (SJTU). DOI: 10.1080/01621459.2012.758041 265 266 Journal of the American Statistical Association, March 2013 { ,..., } against a large class of sparse alternatives in terms of the power. and Y 1 Y n2 from a distribution with the covariance − = σ We show that it only requires one√ of the entries of 1 2 matrix 2 ( ij 2)p×p, define the sample covariance matrices having a magnitude more than C log p/n in order for the test n 1 H 1 to correctly reject the√ null hypothesis 0. It is also shown that (ˆσij )p×p := ˆ = (Xk − X¯ )(Xk − X¯ ) , C p/n 1 1 n this lower bound log is rate optimal. 1 k=1 n In addition to the theoretical properties, we also consider the 2 1 σij p×p = ˆ = Y k − Y¯ Y k − Y¯ , numerical performance of the proposed testing procedure using (ˆ 2) : 2 n ( )( ) both simulated and real datasets. The numerical results show 2 k=1 n n that the new test significantly outperforms the existing methods ¯ = 1 1 ¯ = 1 2 where X n k= Xk and Y n k= Y k. The null hy- both in terms of size and power. A prostate cancer dataset is 1 1 2 1 pothesis H0 : 1 = 2 is equivalent to H0 :max1≤i≤j≤p |σij 1 − used to illustrate our testing procedures for gene selection. σij 2|=0. A natural approach to testing this hypothesis is to com- In addition to the global test of equal covariance matrices, we pare the sample covariancesσ ˆij andσ ˆij and to base the test on − 1 2 also consider recovery of the support of 1 2 as well as test- the maximum differences.