Estimating False Discovery Proportion Under Arbitrary Covariance

Estimating False Discovery Proportion Under Arbitrary Covariance

This article was downloaded by: [Princeton University] On: 02 November 2012, At: 12:43 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of the American Statistical Association Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/uasa20 Estimating False Discovery Proportion Under Arbitrary Covariance Dependence Jianqing Fan a b , Xu Han c & Weijie Gu a a Department of Operations Research & Financial Engineering, Princeton University, Princeton, NJ, 08544, USA b School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China c Department of Statistics, Fox Business School, Temple University, Philadelphia, PA, 19122 Accepted author version posted online: 20 Aug 2012.Version of record first published: 08 Oct 2012. To cite this article: Jianqing Fan, Xu Han & Weijie Gu (2012): Estimating False Discovery Proportion Under Arbitrary Covariance Dependence, Journal of the American Statistical Association, 107:499, 1019-1035 To link to this article: http://dx.doi.org/10.1080/01621459.2012.720478 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA Estimating False Discovery Proportion Under Arbitrary Covariance Dependence Jianqing FAN,XuHAN, and Weijie GU Multiple hypothesis testing is a fundamental problem in high-dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any single-nucleotide polymorphisms (SNPs) are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In this article, we propose a novel method—based on principal factor approximation—that successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large-scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling false discovery rate and FDP. Our estimate of realized FDP compares favorably with Efron’s approach, as demonstrated in the simulated examples. Our approach is further illustrated by some real data applications. We also propose a dependence-adjusted procedure that is more powerful than the fixed-threshold procedure. Supplementary material for this article is available online. KEY WORDS: Arbitrary dependence structure; False discovery rate; Genome-wide association studies; High-dimensional inference; Multiple hypothesis testing. 1. INTRODUCTION t, estimate the FDR, and choose t so that the estimated FDR is no larger than α (Storey 2002). Finding such a common thresh- Multiple hypothesis testing is a fundamental problem in mod- old is based on a conservative estimate of FDR. Specifically, let ern research for high-dimensional inference, with wide appli- = ∨ cations in scientific fields, such as biology, medicine, genetics, FDR(t) p0t/(R(t) 1), where R(t) is the number of total dis- neuroscience, economics, and finance. For example, in genome- coveries with the threshold t and p0 is an estimate of p0. Then, ≤ wide association studies, massive amount of genomic data (e.g., solve t such that FDR(t) α. The equivalence between the two single-nucleotide polymorphism (SNPs), expression quantita- methods has been theoretically studied by Storey, Taylor, and tive trait loci (eQTLs)) are collected and tens of thousands of Siegmund (2004) and Ferreira and Zwinderman (2006). hypotheses are tested simultaneously to find if any of these Both procedures have been shown to control the FDR for in- genomic data are associated with some observable traits (e.g., dependent test statistics. However, in practice, test statistics are blood pressure, weight, some disease); in finance, thousands of usually correlated. Although Benjamini and Yekutieli (2001) tests are performed to see which fund managers have winning and Clarke and Hall (2009) argued that when the null distribu- ability (Barras, Scaillet, and Wermers 2010). tion of test statistics satisfies some conditions, the dependence False discovery rate (FDR) was introduced in the celebrated case in the multiple testing is asymptotically the same as the article by Benjamini and Hochberg (1995) for large-scale mul- independence case, multiple testing under general dependence tiple testing. By definition, FDR is the expected proportion of structures is still a very challenging and important open problem. falsely rejected null hypotheses among all of the rejected null Efron (2007) pioneered the work in the field and noted that corre- hypotheses. The classification of tested hypotheses can be sum- lation must be accounted for in deciding which null hypotheses marized in Table 1. are significant because the accuracy of FDR techniques will be compromised in high correlation situations. There are several Downloaded by [Princeton University] at 12:43 02 November 2012 Various testing procedures have been developed for control- ling FDR, among which there are two major approaches. One articles that show that the Benjamini–Hochberg (B–H) proce- is to compare the p-values with a data-driven threshold as in dure or Storey’s procedure can control FDR under some special dependence structures, for example, positive regression depen- Benjamini and Hochberg (1995). Specifically, let p(1) ≤ p(2) ≤ dence on subsets (Benjamini and Yekutieli 2001) and weak de- ···≤p(p) be the ordered observed p-values of p hypotheses. De- = { ≤ } 0 0 pendence (Storey, Taylor, and Siegmund 2004). Sarkar (2002) fine k max i : p(i) iα/p and reject H(1),...,H(k), where α is a specified control rate. If no such i exists, reject no hy- also showed that FDR can be controlled by a generalized step- pothesis. The other related approach is to fix a threshold value wise multiple testing procedure under positive regression de- pendence on subsets. However, even if the procedures are valid under these special dependence structures, they will still suffer Jianqing Fan is Frederick L. Moore’18 Professor, Department of Oper- from efficiency loss without considering the actual dependence ations Research & Financial Engineering, Princeton University, Princeton, NJ 08544, USA and Honorary Professor, School of Statistics and Man- information. In other words, there are universal upper bounds agement, Shanghai University of Finance and Economics, Shanghai, China for a given class of covariance matrices. (E-mail: [email protected]). Xu Han is Assistant Professor, Department of In this article, we develop a procedure for high-dimensional Statistics, Fox Business School, Temple University, Philadelphia, PA 19122 (E-mail: [email protected]). Weijie Gu is graduate student, Department of multiple testing, which can deal with any arbitrary dependence Operations Research & Financial Engineering, Princeton University, Princeton, NJ 08544, USA (E-mail: [email protected]). The article was completed while Xu Han was a postdoctoral fellow at Princeton University. This research was © 2012 American Statistical Association partly supported by NSF grants DMS-0704337 and DMS-0714554 and NIH Journal of the American Statistical Association grant R01-GM072611. The authors are grateful to the editor, associate editor, September 2012, Vol. 107, No. 499, Theory and Methods and referees for helpful comments. DOI: 10.1080/01621459.2012.720478 1019 1020 Journal of the American Statistical Association, September 2012 Table 1. Classification of tested hypotheses is, the simultaneous tests for each SNP j of the null hypothesis { i }n Hj of no association between the SNP Xj and Y.Let Xj i=1 Number Number i n be the sample data of X , {Y } = be the independent sample Number not rejected rejected Total j i 1 random variables of Y. Consider the marginal linear regression { i}n { i }n True null UVp0 between Y i=1 and Xj i=1: False null TSp1 n p − R Rp 1 i i 2 (αj ,βj ) = argmin E Y − aj − bj X , aj ,bj n j i=1 j = 1,...,p. (1) structure and fully incorporate the covariance information. This { i }n is in contrast with previous articles that considers multiple test- where expectation is taken conditionally given Xj i=1. ing under special dependence structures; for example, Sun and We wish to simultaneously test the hypotheses Cai (2009) developed a multiple testing procedure where param- H : β = 0 versus H : β = 0,j= 1,...,p

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    18 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us