Adaptive Estimation of High Dimensional Partially Linear Model
Total Page:16
File Type:pdf, Size:1020Kb
Adaptive Estimation of High Dimensional Partially Linear Model Fang Han,∗ Zhao Ren,y and Yuxin Zhuz May 6, 2017 Abstract Consider the partially linear model (PLM) with random design: Y = XTβ∗ + g(W ) + u, where g(·) is an unknown real-valued function, X is p-dimensional, W is one-dimensional, and β∗ is s-sparse. Our aim is to efficiently estimate β∗ based on n i.i.d. observations of (Y; X; W ) with possibly n < p. The popular approaches and their theoretical properties so far have mainly been developed with an explicit knowledge of some function classes. In this paper, we propose an adaptive estimation procedure, which automatically adapts to the model via a tuning-insensitive bandwidth parameter. In many cases, the proposed procedure also proves to attain weaker scaling and noise requirements than the best existing ones. The proof rests on a general method for determining the estimation accuracy of a perturbed M-estimator and new U-statistics tools, which are of independent interest. Keywords: partially linear model; pairwise difference approach; adaptivity; minimum sample size requirement; heavy-tailed noise; degenerate U-processes. 1 Introduction ∗ ∗ ∗ T p Partially linear regression involves estimating a vector β = (β1 ; : : : ; βp ) 2 R from independent and identically distributed (i.i.d.) observations f(Yi;Xi;Wi); i = 1; : : : ; ng satisfying T ∗ Yi = Xi β + g(Wi) + ui; for i = 1; : : : ; n: (1.1) p Here Xi 2 R , Yi;Wi; ui 2 R, g(·) is an unknown real-valued function, and ui stands for a noise term of finite variance and independent of (Xi;Wi). This paper is focused on such problems when the number of covariates p + 1 is much larger than the sample size n, which in the sequel we shall refer to as a high dimensional partially linear model (PLM). For this, we regulate β∗ to be s-sparse, namely, the number of nonzero elements in β∗, s, is smaller than n. According to the smoothness of function g(·), the following regression problems, with sparse regression coefficient β∗, are special instances of the studied model (1.1): (1) The ordinary linear regression models, when g(·) is constant-valued. ∗Department of Statistics, University of Washington, Seattle, WA 98195, USA; e-mail: [email protected] yDepartment of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA; email: [email protected] zDepartment of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA; e-mail: [email protected] 1 (2) Partially linear Lipschitz models, when g(·) satisfies a Lipschitz-type condition (see, for ex- ample, Condition 7.1 in Li and Racine(2007) for detailed descriptions). (3) Partially smoothing spline models (Engle et al., 1986; Wahba, 1990), when g(·) can be well approximated by splines. (4) Partially linear jump discontinuous regression, when g(·) can contain numerous jumps. In literature, Bunea(2004), Bunea and Wegkamp(2004), Fan and Li(2004), Liang and Li (2009), M¨ullerand van de Geer(2015), Sherwood and Wang(2016), Yu et al.(2016), Zhu(2017), among many others, have studied the high dimensional partially linear model, largely following the least squares approaches (Chen, 1988; Robinson, 1988; Speckman, 1988; Donald and Newey, 1994; Carroll et al., 1997; Fan and Huang, 2005; Cattaneo et al., 2017). For example, Zhu (Zhu, 2017) proposed to estimate β∗ through the following two-stage projection strategy: n h 1 X 2 i m = argmin Z − m (W ) + λ km k2 ; where Z = Y and Z = X for j 2 [p]; b j n ij e j i j;n e j Fj i0 i ij ij me j 2Fj i=1 n proj 1 X T 2 βb = (Vb0i − Vb β) + λnkβk1; where Vb0i = Yi − m0(Wi) and Vbij = Xij − mj(wi): n i b b i=1 Here fFj; 0 ≤ j ≤ pg are a series of pre-specified function classes, k·kFj is the associated norm, k·kq is the vector `q-norm, and [p] denotes the set of integers between 1 and p. In a related study, M¨uller and van de Geer (M¨ullerand van de Geer, 2015) proposed a penalized least squares approach: n LSE h 1 X T 2 2 2 i βb ; g := argmin Yi − Xi β − g(Xi) + λnkβk1 + µnkgkG : b p n e e β2R ;ge2G i=1 Here G is a pre-specified function class allowed to be of infinite dimension. Two tuning parameters λn; µn are employed to induce sparsity and smoothness separately. Both approaches mentioned above, exemplifying a large fraction of existing methods, often require a knowledge of some function classes, either G for penalized least squares estimators or fFj; 0 ≤ j ≤ pg for the projection approach. In this paper, we consider adaptive estimation and adopt an alternative approach, Honor´eand Powell's pairwise difference method (Honor´eand Powell, 2005) with an extra lasso-type penalty, for estimating β∗ . In order to motivate this procedure, for i; j 2 [n], write Yeij = Yi − Yj, Xeij = Xi − Xj, Wfij = Wi − Wj, and ueij = ui − uj. For avoiding trivial settings of discrete type W , hereafter we assume W to be absolutely continuous without loss of generality. The fact that T ∗ E[YeijjWfij = 0;Xi;Xj] = Xeijβ then implies ∗ h T 2 i β = argmin L0(β); where L0(β) := f (0)(Yeij − Xe β) j Wfij = 0 ; E Wfij ij β2Rp ∗ and f (0) stands for the density value of Wfij at 0. A natural estimator of β is Wfij n o βbhn = argmin Lbn(β; hn) + λnkβk1 : (1.2) β2Rp 2 Here hn is a specified bandwidth, λn is a tuning parameter to control the sparsity level, −1 n X 1 Wfij T 2 Lbn(β; hn) := K Yeij − Xe β (1.3) 2 h h ij i<j n n is an empirical approximation to L0(β), and K(·) is a nonnegative kernel function which will be specified later in Assumption5. When λn = 0, we recover the original Honor´eand Powell's estimator (Honor´eand Powell, 2005). When λn > 0, we obtain a sparse solution that is more suitable for tackling high dimensional data. Our analysis is under the triangular array setting where p = pn and s = sn are allowed to grow with n. Three targets are in order: (1) we aim to show that the pairwise difference approach enjoys the adaptivity property in the sense that our procedure does not rely on the knowledge of ∗ 2 smoothness of g(·), and as hn and λn appropriately scale with n, the estimation error kβbhn − β k2 can achieve sharp rates of convergence for a wide range of g(·) function classes of different degrees of smoothness. In addition, we will unveil an intrinsic tuning insensitivity property for the choice of bandwidth hn; (2) we will investigate the minimum sample size requirement for the pairwise difference approach; and (3) we will study the impact of heavy-tailed noise u on estimation. 1.1 Adaptivity of pairwise difference approach One major aim of this paper is to show the adaptivity of βbhn to the smoothness of function g(·). From the methodological perspective, no matter how smooth (complex) the g(·) function is, one does not need to tailor the pairwise difference approach. In addition, we will show that the bandwidth hn bears a tuning insensitive property (Sun and Zhang, 2012) in the sense that hn does not depend on the smoothness of function g(·). Hence, methodologically, the procedure can automatically adapt to the model. From the theoretical perspective, the adaptivity property of the pairwise difference approach ∗ 2 is shown via analyzing kβbhn − β k2 for a range of smooth function classes of g(·). To this end, we first construct a general characterization of the \smoothness" degree for g(·). In contrast to commonly used smoothness conditions, a new characterization of smoothness degree is introduced. Recall that Lbn(β; hn) is an empirical approximation of L0(β) as hn ! 0. Define ∗ βh = argmin ELbn(β; h) (1.4) β2Rp ∗ to be the minimum of a perturbed objective function of L0. Define β to be the minimum of ∗ L0. Intuitively, as h is closer to 0, Lh(β) := ELbn(β; h) is pointwise closer to L0(β), and hence βh shall converge to β∗. In Assumption 11, we characterize this rate of convergence. Specifically, it is assumed that there exist two positive absolute constants ζ and γ, such that for any h small enough, ∗ ∗ γ (general smoothness condition) kβh − β k2 ≤ ζh : (1.5) Condition (1.5) is related to Honor´eand Powell's condition on the existence of Taylor expansion of ∗ ∗ βh around β , i.e., D ∗ ∗ X ` −1=2 βh = β + b`h + o(n ); (1.6) `=1 3 where, when p is fixed, (1.6) implies (1.5) with γ = 1. However, compared to (1.6), (1.5) is more general and suitable when p = pn ! 1 as n ! 1, and can be verified for many function classes. To illustrate it, in Theorems 3.4 and 3.5, we calculate a lower bound of γ for a series of (discontinuous) piecewise α-H¨olderfunction classes. We are now ready to describe our main theoretical results. They show the adaptivity of βbhn ∗ 2 through the analysis of kβbhn − β k2 with either γ = 1 (corresponding to the \smooth" case) or γ < 1 (corresponding to the \non-smooth" case). In addition to Condition (1.5), we pose the following six commonly posed conditions on the choice of kernel and distribution of the model: (1) the kernel function K(·) satisfies several routinely required conditions (Assumption5); (2) W is not degenerate given X (Assumptions6); (3) conditioning on Wf12 = 0, Xe12 satisfies a restricted eigenvalue condition (Bickel et al., 2009; van de Geer and B¨uhlmann, 2009) (Assumption7); (4) Wf12 is informative near zero (Assumption8); (5) conditioning on any W = w, X is multivariate subgaussian, i.e., any linear contrast of X is subgaussian (Assumption9); (6) u is subgaussian, or as later studied, of finite variance (Assumptions 10 and 17 respectively).