Effect of Mean on Variance Function Estimation in Nonparametric
Total Page:16
File Type:pdf, Size:1020Kb
Effect of Mean on Variance Function Estimation in Nonparametric Regression Lie Wang1, Lawrence D. Brown1, T. Tony Cai1 and Michael Levine2 Abstract Variance function estimation in nonparametric regression is considered and the minimax rate of convergence is derived. We are particularly interested in the effect of the unknown mean on the estimation of the variance function. Our results indicate that, contrary to the common practice, it is often not desirable to base the estimator of the variance function on the residuals from an optimal estimator of the mean. Instead it is more desirable to use estimators of the mean with minimal bias. In addition our results also correct the optimal rate claimed in the previous literature. Keywords: Minimax estimation, nonparametric regression, variance estimation. AMS 2000 Subject Classification: Primary: 62G08, 62G20. 1 Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104. The research of Tony Cai was supported in part by NSF Grant DMS-0306576. 2 Department of Statistics, Purdue University, West Lafayette, IN 47907. 1 1 Introduction Consider the heteroscedastic nonparametric regression model 1 yi = f(xi) + V 2 (xi)zi, i = 1, ..., n, (1) where xi = i/n and zi are independent with zero mean, unit variance and uniformly bounded fourth moments. Both the mean function f and variance function V are defined on [0, 1] and are unknown. The main object of interest is the variance function V . The estimation accuracy is measured both globally by the mean integrated squared error Z 1 R(V,Vˆ ) = E (Vˆ (x) − V (x))2 dx (2) 0 and locally by the mean squared error at a point 2 R(Vˆ (x∗),V (x∗)) = E(Vˆ (x∗) − V (x∗)) . (3) We wish to study the effect of the unknown mean f on the estimation of the variance function V . In particular, we are interested in the case where the difficulty in estimation of V is driven by the degree of smoothness of the mean f. The effect of not knowing the mean f on the estimation of V has been studied before in Hall and Carroll (1989). The main conclusion of their paper is that it is possible to characterize explicitly how the smoothness of the unknown mean function influences the rate of convergence of the variance estimator. In association with this they claim an explicit minimax rate of convergence for the variance estimator under pointwise risk. For example, they state that the “classical” rates of convergence (n−4/5) for the twice differentiable variance function estimator is achievable if and only if f is in the Lipschitz class of order at least 1/3. More precisely, Hall and Carroll (1989) stated that, under the pointwise mean squared error loss the minimax rate of convergence for estimating V is − 4α − 2β max{n 2α+1 , n 2β+1 } (4) if f has α derivatives and V has β derivatives. We shall show here that this result is in fact incorrect. In the present paper we revisit the problem in the same setting as in Hall and Carroll (1989). We show that the minimax rate of convergence under both the pointwise squared error and global integrated mean squared error is − 2β max{n−4α, n 2β+1 } (5) 2 if f has α derivatives and V has β derivatives. The derivation of the minimax lower bound is involved and is based on a moment matching technique and a two-point testing argument. A key step is to study a hypothesis testing problem where the alternative hypothesis is a Gaussian location mixture with a special moment matching property. The minimax upper bound is obtained using kernel smoothing of the squared first order differences. Our results have two interesting implications. Firstly, if V is known to belong to a regular parametric model, such as the set of positive polynomials of a given order, the cutoff for the smoothness of f on the estimation of V is 1/4, not 1/2 as stated in Hall and Carroll (1989). That is, if f has at least 1/4 derivative then the minimax rate of convergence for estimating V is solely determined by the smoothness of V as if f were known. On the other hand, if f has less than 1/4 derivative then the minimax rate depends on the relative smoothness of both f and V and will be completely driven by the roughness of f. Secondly, contrary to the common practice, our results indicate that it is often not desirable to base the estimator Vˆ of the variance function V on the residuals from an optimal estimator fˆ of f. In fact, the result shows that it is desirable to use an fˆ with minimal bias. The main reason is that the bias and variance of fˆhave quite different effects on the estimation of V . The bias of fˆ cannot be removed or even reduced in the second stage smoothing of the squared residuals, while the variance of fˆ can be incorporated easily. The paper is organized as follows. Section 2 presents an upper bound for the minimax risk while Section 3 derives a rate-sharp lower bound for the minimax risk under both the global and local losses. The lower and upper bounds together yield the minimax rate of convergence. Section 4 discusses the obtained results and their implications for practical variance estimation in the nonparametric regression. The proofs are given in Section 6. 2 Upper bound In this section we shall construct a kernel estimator based on the square of the first order differences. Such and more general difference based kernel estimators of the variance function have been considered, for example, in M¨uller and Stadtm¨uller (1987 and 1993). For estimating a constant variance, difference based estimators have a long history. See von Neumann (1941, 1942), Rice (1984), Hall, Kay and Titterington (1990) and Munk, Bissantz, Wagner and Freitag (2005). 3 Define the Lipschitz class Λα(M) in the usual way: Λα(M) = {g : for all 0 ≤ x, y ≤ 1, k = 0, ..., bαc − 1, 0 |g(k)(x)| ≤ M, and |g(bαc)(x) − g(bαc)(y)| ≤ M |x − y|α } where bαc is the largest integer less than α and α0 = α − bαc. We shall assume that α β f ∈ Λ (Mf ) and V ∈ Λ (MV ). We say that the function f “has α derivatives” if α β f ∈ Λ (Mf ) and V “has β derivatives” if V ∈ Λ (MV ). For i = 1, 2, ..., n − 1, set Di = yi − yi+1. Then one can write 1 1 √ 1 2 2 2 Di = f(xi) − f(xi+1) + V (xi)zi − V (xi+1)zi+1 = δi + 2Vi i (6) 1 q 2 1 where δi = f(xi) − f(xi+1), Vi = 2 (V (xi) + V (xi+1)) and − 1 1 1 i = (V (xi) + V (xi+1)) 2 (V 2 (xi)zi − V 2 (xi+1)zi+1) has zero mean and unit variance. We construct an estimator Vˆ by applying kernel smoothing to the squared differences 2 2 Di which have means δi + 2Vi. Let K(x) be a kernel function satisfying Z 1 K(x) is supported on [−1, 1] , K(x)dx = 1 −1 Z 1 K(x)xidx = 0 for i = 1, 2, ··· , bβc and −1 Z 1 K2(x)dx = k < ∞. −1 It is well known in kernel regression that special care is needed in order to avoid sig- nificant, sometimes dominant, boundary effects. We shall use the boundary kernels with asymmetric support, given in Gasser and M¨uller (1979 and 1984), to control the boundary effects. For any t ∈ [0, 1], There exists a boundary kernel function Kt(x) with support [−1, t] satisfying the same conditions as K(x), i.e. Z t Kt(x)dx = 1, −1 Z t i Kt(x)x dx = 0 for i = 1, 2, ··· , bβc −1 Z t 2 Kt (x)dx ≤ bk < ∞ for all t ∈ [0, 1]. −1 4 We can also make Kt(x) → K(x) as t → 1 (but this is not necessary here). See Gasser, 1 M¨uller and Mammitzsch (1985). For any 0 < h < 2 , x ∈ [0, 1], and i = 2, ··· , n − 2, let R (xi+xi+1)/2 1 x−u h K( h )du when x ∈ (h, 1 − h) (xi+xi−1)/2 h R (xi+xi+1)/2 1 x−u K (x) = Kt( )du when x = th for some t ∈ [0, 1] i (xi+xi−1)/2 h h R (xi+xi+1)/2 1 x−u Kt(− )du when x = 1 − th for some t ∈ [0, 1] (xi+xi−1)/2 h h and we take the integral from 0 to (x1 + x2)/2 for i = 1, and from (xn−1 + xn−2)/2 to Pn−1 h 1 fori = n − 1. Then we can see that for any 0 ≤ x ≤ 1, i=1 Ki (x) = 1. Define the estimator Vb as n−1 1 X Vb(x) = Kh(x)D2. (7) 2 i i i=1 Same as in the mean function estimation problem, the optimal bandwidth hn can be −1/(1+2β) β easily seen to be hn = O(n ) for V ∈ Λ (MV ). For this optimal choice of the bandwidth, we have the following theorem. Theorem 1 Under the regression model (1) where xi = i/n and zi are independent with zero mean, unit variance and uniformly bounded fourth moments, let the estimator Vb be given as in (7) with the bandwidth h = O(n−1/(1+2β)). Then there exists some constant C0 > 0 depending only on α, β, Mf and MV such that for sufficiently large n, 2β 2 −4α − sup sup E(Vb(x∗) − V (x∗)) ≤ C0 · max{n , n 1+2β } (8) α β f∈Λ (Mf ),V ∈Λ (MV ) 0≤x∗≤1 and Z 1 2β 2 −4α − sup E (Vb(x) − V (x)) dx ≤ C0 · max{n , n 1+2β }.