arXiv:1407.4546v7 [math.ST] 28 Sep 2016 Council. Council Research Australian the and 11501462 SWUFE No. at (Grant Research NSFC tical JBK150501), JBK160159, No. (Grant il yohsstsig efnraie oeaedevia moderate sample self-normalized testing, hypothesis tiple aer ee ilb erymse n oee eebrda during remembered provided forever had and Peter missed dearly that Q be supports him. will with and Peter spent helps career. yea we final time the the his all treasure in for always Even will and Melbourne. We from of us. learn University guide Ch to the Jinyuan at opportunity generation. years the young two for the help grateful gen to extremely a keen are was and Peter heart community. warm servic statistical a and the work appli and His statistics often problems. on he statistical that complex techniques tackle to analytic of proba knowledge and statistics dinary mathematical to contributions ential 62H15. 1931–1956 5, No. DOI: 44, Vol. Statistics 2016, of Annals The
Melbourne, c aiainadtpgahcdetail. typographic 1931–1956 and 5, the pagination by No. published 44, Vol. article 2016, original Statistics the Mathematical of of reprint Institute electronic an is This CRAM yJnunChang Jinyuan By nttt fMteaia Statistics Mathematical of Institute 3 2 1 e od n phrases. and words Key eevdJn 2015. June Received M 00sbetclassifications. subject 2000 AMS Tribute upre yNHGatR1G107- n rn rmteAu the from grant an a and 603710 R01-GM100474-4 GRF Grant Council NIH Grants by Research Supported Kong Hong by C Supported the for Funds Research Fundamental the by part in Supported 10.1214/15-AOS1375 otwsenUiest fFnneadEconomics, and Finance of University Southwestern t saitc two-sample -statistic, TWO-SAMPLE plcblt fteeitn ttsia ehdlge f methodologies statistical sample existing the of applicability rlfaeok nldn h two-sample the including two-sample framework, a Studentized eral Cram´er-type biostatistics for mod sharp of theorems establish we viation fields paper, the this In in metrics. those including plications, salse o h two-sample accu second-order the with for theorem established deviation p moderate In refined examples. a prototypical as statistic test Mann–Whitney ee a rlin n rlfi eerhr h a aee made has who researcher, prolific and brilliant a was Peter : RTP OEAEDVAIN O STUDENTIZED FOR DEVIATIONS MODERATE ER-TYPE ´ † Two-sample hns nvriyo ogKong Hong of University Chinese t saitct oegnrlnniersaitc.Applicat statistics. nonlinear general more to -statistic U otta,fledsoeyrt,Mann–Whitney rate, discovery false Bootstrap, 1 , saitc r ieyue nabodrneo ap- of range broad a in used widely are -statistics U ∗ , U 2016 , † SAITC IHAPPLICATIONS WITH -STATISTICS iMnShao Qi-Man , -statistics. hsrpitdffr rmteoiia in original the from differs reprint This . rmr 01,6E7 eodr 22,62F40, 62E20, secondary 62E17; 60F10, Primary in t h naso Statistics of Annals The saitc hs eut xedthe extend results These -statistic. 1 2 t , saitcadStudentized and -statistic ‡ ‡ n e-i Zhou Wen-Xin and in tdnie ttsis two- statistics, Studentized tion, iiyter.Ptrhdextraor- had Peter theory. bility U n rneo University Princeton and saitc nagen- a in -statistics aehdapoon impact profound a had have e ru etradfin with friend and mentor erous dwt neiu simplicity ingenious with ed u etradfriend. and mentor our s okwt ee ntelast the in Peter with work ,h a ffre ieto time afforded had he r, . -a hoi ograteful so is Shao i-Man h aiu tgso his of stages various the o h one- the rom n n e-i Zhou Wen-Xin and ang ,teCne fStatis- of Center the ), ∗ decono- nd nvriyof University , articular, rt de- erate nrlUniversities entral 403513. d tainResearch stralian ayis racy osto ions omul influ- normously U et mul- test, 3 , § , † § 2 J. CHANG, Q.-M. SHAO AND W.-X. ZHOU
two-sample large-scale multiple testing problems with false discovery rate control and the regularized bootstrap method are also discussed.
1. Introduction. The U-statistic is one of the most commonly used non- linear and nonparametric statistics, and its asymptotic theory has been well studied since the seminal paper of Hoeffding (1948). U-statistics extend the scope of parametric estimation to more complex nonparametric problems and provide a general theoretical framework for statistical inference. We re- fer to Koroljuk and Borovskich (1994) for a systematic presentation of the theory of U-statistics, and to Kowalski and Tu (2007) for more recently discovered methods and contemporary applications of U-statistics. Applications of U-statistics can also be found in high dimensional statis- tical inference and estimation, including the simultaneous testing of many different hypotheses, feature selection and ranking, the estimation of high dimensional graphical models and sparse, high dimensional signal detec- tion. In the context of high dimensional hypothesis testing, for example, several new methods based on U-statistics have been proposed and studied in Chen and Qin (2010), Chen, Zhang and Zhong (2010) and Zhong and Chen (2011). Moreover, Li et al. (2012) and Li, Zhong and Zhu (2012) em- ployed U-statistics to construct independence feature screening procedures for analyzing ultrahigh dimensional data. Due to heteroscedasticity, the measurements across disparate subjects may differ significantly in scale for each feature. To standardize for scale, unknown nuisance parameters are always involved and a natural approach is to use Studentized, or self-normalized statistics. The noteworthy advan- tage of Studentization is that compared to standardized statistics, Studen- tized ratios take heteroscedasticity into account and are more robust against heavy-tailed data. The theoretical and numerical studies in Delaigle, Hall and Jin (2011) and Chang, Tang and Wu (2013, 2016) evidence the impor- tance of using Studentized statistics in high dimensional data analysis. As noted in Delaigle, Hall and Jin (2011), a careful study of the moderate deviations in the Studentized ratios is indispensable to understanding the common statistical procedures used in analyzing high dimensional data. Further, it is now known that the theory of Cram´er-type moderate de- viations for Studentized statistics quantifies the accuracy of the estimated p-values, which is crucial in the study of large-scale multiple tests for con- trolling the false discovery rate [Fan, Hall and Yao (2007), Liu and Shao (2010)]. In particular, Cram´er-type moderate deviation results can be used to investigate the robustness and accuracy properties of p-values and critical values in multiple testing procedures. However, thus far, most applications have been confined to t-statistics [Fan, Hall and Yao (2007), Wang and Hall (2009), Delaigle, Hall and Jin (2011), Cao and Kosorok (2011)]. It is conjec- tured in Fan, Hall and Yao (2007) that analogues of the theoretical properties STUDENTIZED TWO-SAMPLE U-STATISTICS 3 of these statistical methodologies remain valid for other resampling meth- ods based on Studentized statistics. Motivated by the above applications, we are attempting to develop a unified theory on moderate deviations for more general Studentized nonlinear statistics, in particular, for two-sample U-statistics. The asymptotic properties of the standardized U-statistics are extensively studied in the literature, whereas significant developments are achieved in the past decade for one-sample Studentized U-statistics. We refer to Wang, Jing and Zhao (2000) and the references therein for Berry–Esseen-type bounds and Edgeworth expansions. The results for moderate deviations can be found in Vandemaele and Veraverbeke (1985), Lai, Shao and Wang (2011) and Shao and Zhou (2016). The results in Shao and Zhou (2016) paved the way for further applications of statistical methodologies using Studentized U-statistics in high dimensional data analysis. Two-sample U-statistics are also commonly used to compare the differ- ent (treatment) effects of two groups, such as an experimental group and a control group, in scientifically controlled experiments. However, due to the structural complexities, the theoretical properties of the two-sample U- statistics have not been well studied. In this paper, we establish a Cram´er- type moderate deviation theorem in a general framework for Studentized two-sample U-statistics, especially the two-sample t-statistic and the Stu- dentized Mann–Whitney test. In particular, a refined moderate deviation theorem with second-order accuracy is established for the two-sample t- statistic. The paper is organized as follows. In Section 2, we present the main results on Cram´er-type moderate deviations for Studentized two-sample U- statistics as well as a refined result for the two-sample t-statistic. In Sec- tion 3, we investigate statistical applications of our theoretical results to the problem of simultaneously testing many different hypotheses, based partic- ularly on the two-sample t-statistics and Studentized Mann–Whitney tests. Section 4 shows numerical studies. A discussion is given in Section 5. All the proofs are relegated to the supplementary material [Chang, Shao and Zhou (2016)].
2. Moderate deviations for Studentized U-statistics. We use the follow- ing notation throughout this paper. For two sequences of real numbers an and bn, we write an ≍ bn if there exist two positive constants c1, c2 such that c1 ≤ an/bn ≤ c2 for all n ≥ 1, we write an = O(bn) if there is a constant C such that |an|≤ C|bn| holds for all sufficiently large n, and we write an ∼ bn and an = o(bn), respectively, if limn→∞ an/bn = 1 and limn→∞ an/bn = 0. Moreover, for two real numbers a and b, we write for ease of presentation that a ∨ b = max(a, b) and a ∧ b = min(a, b). 4 J. CHANG, Q.-M. SHAO AND W.-X. ZHOU
2.1. A review of Studentized one-sample U-statistics. We start with a brief review of Cram´er-type moderate deviation for Studentized one-sample U-statistics. For an integer s ≥ 2 and for n> 2s, let X1,...,Xn be indepen- dent and identically distributed (i.i.d.) random variables taking values in a metric space (X, G), and let h : Xd 7→ R be a symmetric Borel measurable function. Hoeffding’s U-statistic with a kernel h of degree s is defined as 1 Un = n h(Xi1 ,...,Xis ), s ≤ ··· ≤ 1 i1