Goodness-Of-Fit Tests for Parametric Regression Models
Total Page:16
File Type:pdf, Size:1020Kb
Goodness-of-Fit Tests for Parametric Regression Models Jianqing Fan and Li-Shan Huang Several new tests are proposed for examining the adequacy of a family of parametric models against large nonparametric alternatives. These tests formally check if the bias vector of residuals from parametric ts is negligible by using the adaptive Neyman test and other methods. The testing procedures formalize the traditional model diagnostic tools based on residual plots. We examine the rates of contiguous alternatives that can be detected consistently by the adaptive Neyman test. Applications of the procedures to the partially linear models are thoroughly discussed. Our simulation studies show that the new testing procedures are indeed powerful and omnibus. The power of the proposed tests is comparable to the F -test statistic even in the situations where the F test is known to be suitable and can be far more powerful than the F -test statistic in other situations. An application to testing linear models versus additive models is also discussed. KEY WORDS: Adaptive Neyman test; Contiguous alternatives; Partial linear model; Power; Wavelet thresholding. 1. INTRODUCTION independent observations from a population, Parametric linear models are frequently used to describe Y m4x5 ˜1 ˜ N 401‘ 251 (1.1) the association between a response variable and its predic- D C tors. The adequacy of such parametric ts often arises. Con- ventional methods rely on residual plots against tted values where x is a p-dimensional vector and m4 5 is a smooth regres- sion surface. Let f 4 1 ˆ5 be a given parametri¢ c family. The or a covariate variable to detect if there are any system- ¢ atic departures from zero in the residuals. One drawback of null hypothesis is the conventional methods is that a systematic departure that H 2 m4 5 f 4 1 ˆ5 for some ˆ0 (1.2) is smaller than the noise level cannot be observed easily. 0 ¢ D ¢ Recently many articles have presented investigations of the use of nonparametric techniques for model diagnostics. Most of Examples of f 4 1 ˆ5 include the widely used linear model T ¢ them focused on one-dimensional problems, including Dette f 4x1 ˆ5 x ˆ and a logistic model f 4x1ˆ5 ˆ0=41 TD D C (1999), Dette and Munk (1998), and Kim (2000). The book ˆ1 exp4ˆ2 x55. by Hart (1997) gave an extensive overview and useful refer- The alternative hypothesis is often vague. Depending on the ences. Chapter 5 of Bowman and Azzalini (1997) outlined the situations of applications, one possible choice is the saturated work by Azzalini, Bowman, and Härdle (1989) and Azzalini nonparametric alternative and Bowman (1993). Although a lot of work has focused on H 2 m4 5 f4 1 ˆ5 for all ˆ1 (1.3) the univariate case, there is relatively little work on the multi- 1 ¢ 6D ¢ ple regression setting. Hart (1997, section 9.3), considered two approaches: one is to regress residuals on a scalar function of and another possible choice is the partially linear models (see the covariates and then apply one-dimensional goodness-of- t Green and Silverman 1994 and the references therein) tests; the second approach is to explicitly take into account the T multivariate nature. The test by Härdle and Mammen (1993) H 2 m4x5 f 4x 5 x2 ‚ 1 with f 4 5 nonlinear1 (1.4) 1 D 1 C 2 ¢ was based on the L2 error criterion in d-dimensions, Stute, González Mantoiga, and Presedo Quindimil (1998) investi- where x1 and x2 are the covariates. Traditional model diagnos- gated a marked empirical process based on the residuals and tic techniques involve plotting residuals against each covari- Aerts, Claeskens, and Hart (1999) constructed tests based ate to examine if there is any systematic departure, against on orthogonal series that involved choosing a nested model the index sequence to check if there is any serial correlation, sequence in the bivariate regression. and against tted values to detect possible heteroscedasticity, In this article, a completely different approach is proposed among others. This amounts to informally checking if biases and studied. The basic idea is that if a parametric family of are negligible in the presence of large stochastic noises. These models ts data adequately, then the residuals should have techniques can be formalized as follows. Let ˆ be an esti- O T nearly zero bias. More formally, let 4x11 Y151 : : : 1 4xn1 Yn5 be mate under the null hypothesis and let ˜ 4˜11 : : : 1 ˜n5 be the resulting residuals with ˜ Y f4OxD1 ˆ5O. UsuallyO , ˆ Oi D i ƒ i O O converges to some vector ˆ0. Then under model (1.1), con- n Jianqing Fan is Professor, Department of Statistics, Chinese University ditioned on 8xi9i 1, ˜ is nearly independently and normally D O of Hong Kong and Professor, Department of Statistics, University of distributed with mean vector ‡ 4‡ 1 : : : 1 ‡ 5T , where ‡ North Carolina, Chapel Hill, NC 27599-3260 (E-mail: [email protected]). 1 n i m4x 5 f 4x 1 ˆ 5. Namely, any Dnite-dimensional componentDs Li-Shan Huang is Assistant Professor, Department of Biostatistics, i ƒ i 0 University of Rochester Medical Center, Rochester, NY 14642 (E-mail: [email protected] r.edu). Fan’s research was partially supported by NSA 96-1-0015 and NSF grant DMS-98-0-3200. Huang’s work was partially sup- ported by NSF grant DMS-97-10084. The authors thank the referee and the © 2001 American Statistical Association associate editor for their constructive and detailed suggestions that led to Journal of the American Statistical Association signi cant improvement in the presentation of the article. June 2001, Vol. 96, No. 454, Theory and Methods 640 Fan and Huang: Regression Goodness-of-Fit 641 of ˜ are asymptotically independent and normally distributed there is only one predictor variable (i.e., p 1), such ordering O D with mean being a subvector of ‡. Thus, the problem becomes is straightforward. However, it is quite challenging to order a multivariate vector. A method based on the principal compo- H 2 ‡ 0 versus H 2 ‡ 01 (1.5) nent analysis is described in Section 2.2. 0 D 1 6D When there is some information about the alternative based on the observations ˜. Note that ˜i will not be inde- hypothesis such as the partially linear model in (1.4), it is pendent in general after estimatinO g ˆ, buOt the dependence is sensible to order the residuals according to the covariate x1. practically negligible, as will be demonstrated. Thus, the tech- Indeed, our simulation studies show that it improves a great niques for independent samples such as the adaptive Neyman deal the power of the generic ordering procedure outlined in test in Fan (1996) continue to apply. Plotting a covariate X j the last paragraph improved a great deal. The parameters ‚2 against the residual ˜ aims at testing if the biases in ˜ along 1=2 O O in model (1.4) can be estimated easily at the nƒ rate via the direction Xj are negligible, namely whether the scatter plot a difference based estimator, for example, that of Yatchew 84Xj 1 ‡j 59 is negligible in the presence of the noise. (1997). Then the problem is, heuristically, reduced to the one- In testing the signi cance of the high-dimensional prob- dimensional nonparametric setting; see Section 2.4 for details. lem (1.5), the conventional likelihood ratio statistic is not pow- Wavelet transform is another popular family of orthogo- erful due to noise accumulation in the n-dimensional space, nal transforms. It can be applied to the testing problem (1.5). as demonstrated in Fan (1996). An innovative idea proposed For comparison purposes, wavelet thresholding tests are also by Neyman (1937) is to test only the rst m-dimensional sub- included in our simulation studies. See Fan (1996) and problem if there is prior that most of nonzero elements lie Spokoiny (1996) for various discussions on the properties of on the rst m dimension. To obtain such a qualitative prior, the wavelet thresholding tests. a Fourier transform is applied to the residuals in an attempt This article is organized as follows. In Section 2, we pro- to compress nonzero signals into lower frequencies. In test- pose a few new tests for the testing problems (1.2) ver- ing goodness of t for a distribution function, Fan (1996) sus (1.3) and (1.2) versus (1.4). These tests include the proposed a simple data-driven approach to choose m based adaptive Neyman test and the wavelet thresholding tests in on power considerations. This results in an adaptive Neyman Sections 2.1 and 2.3, respectively. Ordering of multivari- test. See Rayner and Best (1989) for more discussions on ate vectors is discussed in Section 2.2. Applications of the the Neyman test. Some other recent work motivated by the test statistics in the partially linear models are discussed in Neyman test includes Eubank and Hart (1992), Eubank and Section 2.4 and their extension to additive models is outlined LaRiccia (1992), Kim (2000), Inglot, Kallenberg, and Ledwina brie y in Section 2.5. Section 2.6 outlines simple methods for (1994), Ledwina (1994), Kuchibhatla and Hart (1996), and estimating the residual variance ‘ 2 in the high-dimensional Lee and Hart (1998), among others. Also see Bickel and setting. In Section 3, we carry out a number of numerical stud- Ritov (1992) and Inglot and Ledwina (1996) for illuminating ies to illustrate the power of our testing procedures. Technical insights into nonparametric tests. In this article, we adopt the proofs are relegated to the Appendix. adaptive Neyman test proposed by Fan (1996) to the testing problem (1.5).