Adaptive Estimation of High Dimensional Partially Linear Model

Fang Han,∗ Zhao Ren,† and Yuxin Zhu‡

May 6, 2017

Abstract Consider the partially linear model (PLM) with random design: Y = XTβ∗ + g(W ) + u, where g(·) is an unknown real-valued function, X is p-dimensional, W is one-dimensional, and β∗ is s-sparse. Our aim is to efficiently estimate β∗ based on n i.i.d. observations of (Y,X,W ) with possibly n < p. The popular approaches and their theoretical properties so far have mainly been developed with an explicit knowledge of some function classes. In this paper, we propose an adaptive estimation procedure, which automatically adapts to the model via a tuning-insensitive bandwidth parameter. In many cases, the proposed procedure also proves to attain weaker scaling and noise requirements than the best existing ones. The proof rests on a general method for determining the estimation accuracy of a perturbed M-estimator and new U- tools, which are of independent interest.

Keywords: partially linear model; pairwise difference approach; adaptivity; minimum sample size requirement; heavy-tailed noise; degenerate U-processes.

1 Introduction

∗ ∗ ∗ T p Partially involves estimating a vector β = (β1 , . . . , βp ) ∈ R from independent and identically distributed (i.i.d.) observations {(Yi,Xi,Wi), i = 1, . . . , n} satisfying T ∗ Yi = Xi β + g(Wi) + ui, for i = 1, . . . , n. (1.1) p Here Xi ∈ R , Yi,Wi, ui ∈ R, g(·) is an unknown real-valued function, and ui stands for a noise term of finite and independent of (Xi,Wi). This paper is focused on such problems when the number of covariates p + 1 is much larger than the sample size n, which in the sequel we shall refer to as a high dimensional partially linear model (PLM). For this, we regulate β∗ to be s-sparse, namely, the number of nonzero elements in β∗, s, is smaller than n. According to the smoothness of function g(·), the following regression problems, with sparse regression coefficient β∗, are special instances of the studied model (1.1):

(1) The ordinary linear regression models, when g(·) is constant-valued.

∗Department of Statistics, University of Washington, Seattle, WA 98195, USA; e-mail: [email protected] †Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA; email: [email protected] ‡Department of , Johns Hopkins University, Baltimore, MD 21205, USA; e-mail: [email protected]

1 (2) Partially linear Lipschitz models, when g(·) satisfies a Lipschitz-type condition (see, for ex- ample, Condition 7.1 in Li and Racine(2007) for detailed descriptions).

(3) Partially smoothing spline models (Engle et al., 1986; Wahba, 1990), when g(·) can be well approximated by splines.

(4) Partially linear jump discontinuous regression, when g(·) can contain numerous jumps.

In literature, Bunea(2004), Bunea and Wegkamp(2004), Fan and Li(2004), Liang and Li (2009), M¨ullerand van de Geer(2015), Sherwood and Wang(2016), Yu et al.(2016), Zhu(2017), among many others, have studied the high dimensional partially linear model, largely following the least squares approaches (Chen, 1988; Robinson, 1988; Speckman, 1988; Donald and Newey, 1994; Carroll et al., 1997; Fan and Huang, 2005; Cattaneo et al., 2017). For example, Zhu (Zhu, 2017) proposed to estimate β∗ through the following two-stage projection strategy: n h 1 X 2 i m = argmin Z − m (W ) + λ km k2 , where Z = Y and Z = X for j ∈ [p]; b j n ij e j i j,n e j Fj i0 i ij ij me j ∈Fj i=1 n proj 1 X T 2 βb = (Vb0i − Vb β) + λnkβk1, where Vb0i = Yi − m0(Wi) and Vbij = Xij − mj(wi). n i b b i=1

Here {Fj, 0 ≤ j ≤ p} are a series of pre-specified function classes, k·kFj is the associated norm, k·kq is the vector `q-norm, and [p] denotes the set of integers between 1 and p. In a related study, M¨uller and van de Geer (M¨ullerand van de Geer, 2015) proposed a penalized least squares approach: n  LSE h 1 X  T 2 2 2 i βb , g := argmin Yi − Xi β − g(Xi) + λnkβk1 + µnkgkG . b p n e e β∈R ,ge∈G i=1 Here G is a pre-specified function class allowed to be of infinite dimension. Two tuning parameters λn, µn are employed to induce sparsity and smoothness separately. Both approaches mentioned above, exemplifying a large fraction of existing methods, often require a knowledge of some function classes, either G for penalized least squares estimators or {Fj, 0 ≤ j ≤ p} for the projection approach. In this paper, we consider adaptive estimation and adopt an alternative approach, Honor´eand Powell’s pairwise difference method (Honor´eand Powell, 2005) with an extra lasso-type penalty, for estimating β∗ . In order to motivate this procedure, for i, j ∈ [n], write Yeij = Yi − Yj, Xeij = Xi − Xj, Wfij = Wi − Wj, and ueij = ui − uj. For avoiding trivial settings of discrete type W , hereafter we assume W to be absolutely continuous without loss of generality. The fact that T ∗ E[Yeij|Wfij = 0,Xi,Xj] = Xeijβ then implies ∗ h T 2 i β = argmin L0(β), where L0(β) := f (0)(Yeij − Xe β) | Wfij = 0 , E Wfij ij β∈Rp ∗ and f (0) stands for the density value of Wfij at 0. A natural estimator of β is Wfij n o βbhn = argmin Lbn(β, hn) + λnkβk1 . (1.2) β∈Rp

2 Here hn is a specified bandwidth, λn is a tuning parameter to control the sparsity level,  −1 n X 1 Wfij  T 2 Lbn(β, hn) := K Yeij − Xe β (1.3) 2 h h ij i 0, we obtain a sparse solution that is more suitable for tackling high dimensional data. Our analysis is under the triangular array setting where p = pn and s = sn are allowed to grow with n. Three targets are in order: (1) we aim to show that the pairwise difference approach enjoys the adaptivity property in the sense that our procedure does not rely on the knowledge of ∗ 2 smoothness of g(·), and as hn and λn appropriately scale with n, the estimation error kβbhn − β k2 can achieve sharp rates of convergence for a wide of g(·) function classes of different degrees of smoothness. In addition, we will unveil an intrinsic tuning insensitivity property for the choice of bandwidth hn; (2) we will investigate the minimum sample size requirement for the pairwise difference approach; and (3) we will study the impact of heavy-tailed noise u on estimation.

1.1 Adaptivity of pairwise difference approach

One major aim of this paper is to show the adaptivity of βbhn to the smoothness of function g(·). From the methodological perspective, no matter how smooth (complex) the g(·) function is, one does not need to tailor the pairwise difference approach. In addition, we will show that the bandwidth hn bears a tuning insensitive property (Sun and Zhang, 2012) in the sense that hn does not depend on the smoothness of function g(·). Hence, methodologically, the procedure can automatically adapt to the model. From the theoretical perspective, the adaptivity property of the pairwise difference approach ∗ 2 is shown via analyzing kβbhn − β k2 for a range of smooth function classes of g(·). To this end, we first construct a general characterization of the “smoothness” degree for g(·). In contrast to commonly used smoothness conditions, a new characterization of smoothness degree is introduced. Recall that Lbn(β, hn) is an empirical approximation of L0(β) as hn → 0. Define ∗ βh = argmin ELbn(β, h) (1.4) β∈Rp ∗ to be the minimum of a perturbed objective function of L0. Define β to be the minimum of ∗ L0. Intuitively, as h is closer to 0, Lh(β) := ELbn(β, h) is pointwise closer to L0(β), and hence βh shall converge to β∗. In Assumption 11, we characterize this rate of convergence. Specifically, it is assumed that there exist two positive absolute constants ζ and γ, such that for any h small enough, ∗ ∗ γ (general smoothness condition) kβh − β k2 ≤ ζh . (1.5) Condition (1.5) is related to Honor´eand Powell’s condition on the existence of Taylor expansion of ∗ ∗ βh around β , i.e., D ∗ ∗ X ` −1/2 βh = β + b`h + o(n ), (1.6) `=1

3 where, when p is fixed, (1.6) implies (1.5) with γ = 1. However, compared to (1.6), (1.5) is more general and suitable when p = pn → ∞ as n → ∞, and can be verified for many function classes. To illustrate it, in Theorems 3.4 and 3.5, we calculate a lower bound of γ for a series of (discontinuous) piecewise α-H¨olderfunction classes.

We are now ready to describe our main theoretical results. They show the adaptivity of βbhn ∗ 2 through the analysis of kβbhn − β k2 with either γ = 1 (corresponding to the “smooth” case) or γ < 1 (corresponding to the “non-smooth” case). In addition to Condition (1.5), we pose the following six commonly posed conditions on the choice of kernel and distribution of the model:

(1) the kernel function K(·) satisfies several routinely required conditions (Assumption5);

(2) W is not degenerate given X (Assumptions6);

(3) conditioning on Wf12 = 0, Xe12 satisfies a restricted eigenvalue condition (Bickel et al., 2009; van de Geer and B¨uhlmann, 2009) (Assumption7);

(4) Wf12 is informative near zero (Assumption8);

(5) conditioning on any W = w, X is multivariate subgaussian, i.e., any linear contrast of X is subgaussian (Assumption9);

(6) u is subgaussian, or as later studied, of finite variance (Assumptions 10 and 17 respectively).

We investigate the “smooth” case first in Section3. Theorem 3.1 proves, for every s-sparse ∗ vector β , as long as γ = 1 and hn, λn are properly chosen, with high probability, ∗ 2 kβbhn − β k2 . s log p/n, (1.7) where by the symbol “.” we “≤” up to a multiplicative absolute constant. We hence recover the same rate-optimality phenomenon as those using projection-based (Zhu, 2017) or penalized least squares approaches (M¨ullerand van de Geer, 2015; Yu et al., 2016) in high dimensions. For more discussions on this bound, see Remarks 3.7 and 3.8. In Section3 we further study these g(·) functions in the regime γ < 1. For describing the results T on the “non-smooth” case, additional notation is needed. Define ηn := kE[Xe12Xe12 | Wf12 = 0]k∞ to be a characterization of sparsity of the covariance matrix, with k · k∞ representing the matrix induced `∞ norm. In Theorems 3.2 and 3.3, we prove that, when γ ∈ [1/4, 1], for appropriately chosen hn, λn, the inequality γ γ ∗ 2 h 2 ns log p log p o log p i kβbh − β k min η · + , s (1.8) n 2 . n n n n holds with high probability. Inequalities (1.7) and (1.8) together demonstrate the adaptivity behavior of the pairwise differ- ence estimator βbhn . Depending on the smoothness degree of g(·), one gets a whole range of rates  2 γ γ from s log p/n to min ηn ·{s log p/n + (log p/n) }, s(log p/n) , while we do not need to tailor the estimation procedure for different g(·) up to a selection of (hn, λn). In addition, compared to λn, the bandwidth hn bears a tuning-insensitive property and much less effort is required for selecting

4 1/2 it. In particular, we can assign hn to be 2(log p/n) , which will automatically adapt to the g(·) function and achieve the best convergence rates we can derive. Interestingly, this bandwidth level is intrinsic and cannot be altered. For more details, see Remark 2.2 and Remark 3.8. The bound (1.8) behaves differently depending on the scales of (n, p, s), ηn, and γ. It also reveals an interesting rate-optimality phenomenon. Particularly, the pairwise difference approach is still possible to derive an estimator with desired rate of convergence s log p/n, provided that ηn 1−γ is upper bounded by an absolute constant and (n/ log p) . s. More details on this bound is in Remark 3.9. A lower bound of the γ-value corresponding to α-H¨olderfunctions or some possibly discontinuous functions is calculated in Theorem 3.5. Of note, the pairwise difference approach could handle α-H¨olderclasses with α ≤ 1/2 and could even achieve convergence rate s log p/n under some scaling conditions for 1/4 ≤ α ≤ 1/2. See Corollary 3.1 for further details. ∗ 2 We briefly discuss the technical challenges in determining the upper bounds of kβbhn − β k2. First, although β∗ is s-sparse, β∗ is usually a dense vector. Accordingly, standard analysis of high hn dimensional M-estimators requiring sparsity of β∗ (see, for example, Negahban et al.(2012)) does hn not apply directly. For this, the proof rests on a general method for establishing the convergence rate of a possibly perturbed M-estimator. There, of independent interest, it is shown that Condi- tion (1.5) is near-sufficient for constructing a sharp analysis. Secondly, to achieve sharp rates of 1/2 convergence due to the bias-variance tradeoff, hn has to be set of the order (log p/n) , resulting in variance explosion of the Lbn(β, hn), formulated as a non-degenerate U-process. For this, a careless analysis exploiting the first-order Bernstein inequality (Proposition 2.3(a) in Ar- cones and Gine(1993), for example), coupled with a truncation argument, will render an inefficient bound. To overcome this, one needs to employ a more careful analysis, separately analyzing the sample-mean and degenerate U- components. Such a combined analysis, formulated as a user-friendly tail bound, is provided in Lemma 3.4. Thirdly, the verification of (1.5) for Lipschitz and possibly discontinuous piecewise H¨olderfunctions requires some additional treatments. For this, viewing an M-estimator as the minimum of a stochastic perturbation of the population loss function, we borrow known techniques in mathematical statistics (van der Vaart and Wellner, 1996), and couple them with some initial intuition from the perturbation theory (Davis and Kahan, 1970; Kato, 2013).

1.2 Additional properties The previous section highlights the adaptivity of the pairwise difference approach in handling a range of both “smooth” and “non-smooth” g(·) functions. As another contribution of this paper, in Sections3 and4, we will study some additional properties of the studied procedure. We start with the minimum sample size requirement for estimation consistency. In high di- mensional data analysis, sample size requirement is a key feature for measuring the performance of various methods. In literature of sparse learning, a variety of scaling requirements relating n to p and s have been posed. See, for example, Han and Liu(2016) for an illustration of “optimal rate under sub-optimal scaling” phenomenon in estimation of sparse principal components, Dezeure et al.(2015) for the scaling requirement in conducing inference for the lasso, and Ren et al.(2015) for the minimum scaling requirement in making inference for Gaussian graphical models.

5 In Theorem 3.1 and several follow-up results, provided that Condition (1.5) holds with γ = 1, it is shown that the pairwise difference approach attains the rate of convergence (1.7) under a mild scaling requirement: {(log p)4 ∨ s4/3(log p)1/3 ∨ s(log p)2}/n small enough. In many cases, this sample size requirement compares favorably with the existing ones. See the discussions in Section 6.1 for details. The improvement is viable due to the formulation of the pairwise difference estimator as the minimum of a smooth convex loss function Lbn(β, hn). This loss function is intrinsically a U-process as β evolves, and is technically easier to handle than the loss functions of projection/penalized least squares estimators that involve stochastic sequences of nonlinear functions. For more interpretation and why the order s4/3 appears, see Remark 3.7 and discussions in Section 6.2. From a different perspective, in Section4, we further investigate the impact of heavy-tailed noise u on the performance of the pairwise difference approach. Recent progresses in studying the lasso regression have revealed a fundamental property that, given a random design of multivariate subgaussian covariates, the lasso is able to maintain minimax rate-optimal estimation of β∗ as long as u is of a (2 + )-. See, for example, Theorem 1.4 in Lecu´eand Mendelson(2017a). We study the impact of heavy-tailed noise on estimation for the pairwise difference approach in Section 4. Particularly, in Corollary 4.1, we will show that the result in Lecu´eand Mendelson(2017a) applies to the pairwise difference approach.

1.3 Notation

+ Throughout the paper, we define R, Z, and Z to be sets of real numbers, integers, and positive +  integers. For n ∈ Z , write [n] = 1, . . . , n . Let 1(·) stand for the indicator function. For arbitrary 0 p Pp 1 q Pp q vectors v, v ∈ R and 0 < q < ∞, we define kvk0 = j=1 (vj 6= 0), kvkq = j=1 |vj| , and 0 Pp 0 p×q Pq hv, v i = j=1 vjvj. For an arbitrary matrix Ω = (Ωij) ∈ R , write kΩk∞ = maxi∈[p] j=1 |Ωij|. For a symmetric real matrix Ω, let λmin(Ω) denote its smallest eigenvalue. For a set S, we denote |S| c p to be its cardinality and S to be its complement. For a vector v ∈ R and an index set S, we write |S| vS ∈ R to be the sub-vector of v of components indexed by S. For a real function f : X → R, k T let kfk∞ = supf∈X f(x). For an arbitrary function f : R → R, we use ∇f = (∇1f, . . . , ∇kf) to p denote its gradient. For some absolutely continuous random vector X ∈ R , let fX denote its density function, FX denote its distribution function, and ΣX denote its covariance matrix. For some joint T T p+1 p m continuous random vector (X ,W ) ∈ R and some measurable function ψ(·): R → R , let fW |ψ(X)(w, z) denote the value of the conditional density of W = w given ψ(X) = z. For any two numbers a, b ∈ R, we define a ∨ b = max(a, b) and a ∧ b = min(a, b). For any two real sequences {an} and {bn}, we write an . bn, or equivalently bn & an, if there exists an absolute constant C such that |an| ≤ C|bn| for any large enough n. We write an  bn if an . bn and bn . an. We + 0 0 denote Ip to be the p×p identity matrix for p ∈ Z . Let c, c ,C,C > 0 be generic constants, whose actual values may vary from place to place.

1.4 Paper organization The rest of this paper is organized as follows. In Section2, we introduce a general method for determining the rate of convergence of M-estimators in which the minimizer of the population risk

6 can be a perturbation of the true parameter. This general method lays the foundation for later sections. In Section3, we highlight the adaptivity property of the pairwise difference approach, and show the rate-optimality in estimating β∗ of the pairwise difference approach, i.e., inequalities (1.7) and (1.8). We first show these main results under the general smoothness condition defined in (1.5), and then for some specific function classes of g(·), including some discontinuous ones. Technical challenges related to controlling U-processes are also discussed in this section. In Section 4 we extend the analysis in Section3 and allow a heavy-tailed noise. Section5 contains some finite sample analysis. Discussions are provided in Section6, with all the proofs relegated to a supplementary appendix.

2 A general method

∗ p Let θ ∈ R be a p-dimensional parameter of interest. Suppose that Z1,...,Zn with n < p follow a distribution indexed by parameters (θ∗, η∗), where η∗ is possibly infinite-dimensional and stands ∗ p for a nuisance parameter. Suppose further that θ minimizes a loss function Γ0(θ) defined on R , which is independent of choice of the nuisance parameter and difficult to approximate. Instead, we observe a {Z1,...,Zn}-measurable loss function, Γbn(θ, h), such that Γh(θ) = EΓbn(θ, h) is a p perturbed version of Γ0(θ), namely, for each θ ∈ R ,

Γh(θ) → Γ0(θ) as h → 0. We define ∗ θh = argmin Γh(θ), θ∈Rp and for a chosen bandwidth hn > 0 and a tuning parameter λn ≥ 0 that scale with n, n o θbhn = argmin Γbn(θ, hn) + λnkθk1 . θ∈Rp To motivate this setting, recall that the pairwise difference estimator defined in the Introduction can be formulated into this framework, with ∗ ∗ θ = β , Γ0(θ) = L0(β), Γbn(θ, h) = Lbn(β, h), and Γh(θ) = EΓbn(θ, h). ∗ In this section, we present a general method for establishing consistency of θbhn to θ with explicit rate of convergence provided. This method clearly has its origins in Negahban et al.(2012), and is recast into the current form for the purpose of our setting. Before introducing the main result, some more assumptions and notation are in order. For establishing consistency in high dimensions, we need a notion of intrinsic dimension (Amelunxen et al., 2014) that has to be of order smaller than n. This paper is focused on sparsity, an assumption that has been well-accepted in literature (B¨uhlmannand van de Geer, 2011).

Assumption 1 (Sparsity condition). Assume that there exists a positive integer s = sn < n ∗ such that kθ k0 ≤ s.

∗ Of note, θh is usually no longer a sparse vector. Thus, the framework in Negahban et al.(2012) cannot be directly applied to obtain the error bound between θbhn and the population minimizer

7 θ∗ . In addition, our ultimate goal is to provide the rate of convergence of θ to θ∗ rather than hn bhn θ∗ . These two facts motivate us to characterize the behavior of θ by carefully constructing a hn bhn surrogate sparse vector θ∗ ∈ p, which possibly depends on h . Then the estimation accuracy ehn R n of θ can be well controlled via bounding θ − θ∗ and θ∗ − θ∗ separately. Specifically, for the bhn bhn ehn ehn latter term we assume that there exist a positive number ρn and an integer sen > 0 such that kθ∗ − θ∗k ≤ ρ and kθ∗ k = s . (2.1) ehn 2 n ehn 0 en Possible choices of θ∗ include θ∗ and a hard thresholding version of θ∗ . ehn hn We then proceed to characterize the first term θ −θ∗ above by studying the relation between bhn ehn h and λ through the surrogate sparse vector θ∗ . This characterization is made via the following n n ehn perturbation level condition that corresponds to Equation (23) in Negahban et al.(2012).

+ Assumption 2 (Perturbation level condition). Assume a sequence {1,n, n ∈ Z } such that + 1,n → 0 as n → ∞, and for each n ∈ Z ,  ∗ P 2 ∇kΓbn(θehn , hn) ≤ λn for all k ∈ [p] ≥ 1 − 1,n. (2.2)

Remark 2.1. Assumption2 demands more words. While often we have ∇ Γ (θ∗ , h ) = 0, the E k bn hn n same conclusion does not always apply to ∇ Γ (θ∗ , h ). Hence, the mean value of ∇ Γ (θ∗ , h ) E k bn ehn n k bn ehn n is intrinsically a measure of the bias level of our estimator θ away from the surrogate θ∗ , which bhn ehn also characterizes the perturbation of Γ to Γ especially for the choice of θ∗ = θ∗. Due to this h 0 ehn reason, we call it the perturbation level condition.

Remark 2.2. Concerning the studied pairwise difference estimator, we will see in Section3 that 1/2 1/2 (2.2) usually reduces to a requirement λn & hn + (log p/n) and hn & (log p/n) , which has sharply regulated the scales of the pair (hn, λn). Compared to the typical requirement λn & 1/2 (log p/n) , the extra term hn quantifies the bias level. In addition, the requirement hn & (log p/n)1/2 is intrinsic to our pairwise difference approach. In order to have a sharp choice of 1/2 1/2 λ  (log p/n) , hn has to be chosen at the order of (log p/n) (see Remarks 3.5-3.6 for further details).

We then move on to define an analogy of the restricted eigenvalue (RE) condition for lasso introduced in Bickel et al.(2009) (see also, compatibility condition in van de Geer and B¨uhlmann (2009), among other similar conditions.). To this end, we denote the first-order Taylor series error, evaluated at θ∗ , to be ehn ∗ ∗ ∗ δΓbn(∆, hn) = Γbn(θehn + ∆, hn) − Γbn(θehn , hn) − h∇Γbn(θehn , hn), ∆i. Further define sets  ∗  p Sen = j ∈ [p]: θe 6= 0 and C = ∆ ∈ : k∆ c k1 ≤ 3k∆ k1 . hn,j Sen R Sen Sen We are now ready to state the empirical RE assumption.

Assumption 3 (Empirical restricted eigenvalue condition). Assume, for any h > 0, Γbn(θ, h) is convex in θ. In addition, assume that there exist positive absolute constant κ1 and radius r such

8 + + that there exists a sequence {2,n, n ∈ Z } with 2,n → 0 as n → ∞, and for each n ∈ Z ,  2  p  δΓn(∆, hn) ≥ κ1k∆k for all ∆ ∈ C ∩ ∆ ∈ : k∆k2 ≤ r ≥ 1 − 2,n. P b 2 Sen R

With sen = |Sen| and ρn defined in (2.1), we are then able to present the general method for determining the convergence rate of θbhn . 1/2 Theorem 2.1. Provided λn ≤ κ1r/3sen and Assumptions1-3 stand, the inequality 2 ∗ 2 18senλn 2 kθbhn − θ k2 ≤ 2 + 2ρn κ1 holds with probability at least 1 − 1,n − 2,n.

Of note, in the sequel, r can always be set as ∞. For these cases, the first condition in Theorem 2.1 vanishes, resulting in a theorem (slightly) extending Theorem 1 in Negahban et al.(2012).

3 Pairwise difference approach for partially linear model

The goal of this section is to prove inequalities (1.7) and (1.8) via the general method introduced in Section2. Specifically, in Section 3.1 we first introduce some general assumptions imposed on the kernel as well as the distribution of the model. Using Theorem 2.1, Section 3.2 presents the main results, i.e., inequalities (1.7) and (1.8), under the general smoothness condition (1.5). Section 3.3 specializes the main results in Section 3.2 by focusing on several specific classes of g(·) functions, including H¨olderfunction classes as well as some discontinuous ones. We close this section by highlighting two distinct technical challenges intrinsic to the pairwise difference approach. Some additional notation is introduced. In what follows, we write (Y, X, W, u) to be a copy of (Y1,X1,W1, u1), and (Y,e X,e W,f ue) to be a copy of (Ye12, Xe12, Wf12, ue12). Recall that without loss of generality, we assume W to be absolutely continuous with regard to the Lebesgue measure, since discrete type W would render a much simpler situation for analyzing βbhn in (1.2).

3.1 Regularity assumptions We start with some common assumptions in literature. The first assumption formally states the model described in the Introduction. Of note, we do not require either X or u to be mean-zero.

Assumption 4. Assume {(Yi,Xi,Wi, ui), i ∈ [n]} are i.i.d. random variables following (1.1) with p ∗ p ∗ Xi ∈ R , Yi,Wi, ui ∈ R, and p > n. Assume β ∈ R to be s-sparse with s := kβ k0 < n. Our second assumption concerns the kernel function K(·) specified in (1.3). R +∞ Assumption 5. Assume a positive kernel K(·) such that −∞ K(w) dw = 1. Further assume there exists some positive absolute constant MK , such that Z +∞ n 3 o max |w| K(w) dw, sup |w|K(w), sup K(w) ≤ MK . −∞ w∈R w∈R For simplicity, we pick MK ≥ 1.

9 Remark 3.1. Assumption5 eliminates higher order kernel functions considered in literature (Li and Racine, 2007; Abrevaya and Shin, 2011), which would add non-convexity to the loss function (1.3), an issue that is not easy to fix with existing techniques. The assumption is mild, and (after rescaling) is satisfied by any of those classical kernel functions in Section 1.2 of Tsybakov(2009).

The remaining assumptions concern the distribution of (X, W, u). We first characterize the “conditional non-degeneracy” property of the pair (X,W ) in the following Assumptions6,7, and 70. This is also related to the identification condition of β∗ posed in Section 7.1 of Li and Racine (2007).

Assumption 6. Assume that the conditional density of W is smooth enough, or more specifically, assume there exists a positive absolute constant M such that

n ∂fW |X (w, x) o sup , fW |X (w, x) ≤ M. w,x ∂w Assumption 7. Define S ⊂ [p] to be the support of β∗. Assume there exists some positive absolute  0 p 0 0 T T  constant κ`, such that for any v ∈ v ∈ R : kvSc k1 ≤ 3kvS k1 , we have v E XeXe Wf = 0 v ≥ 2 κ`kvk2. As will be seen later, Assumption7 is sufficient when g(·) is “smooth” (e.g., globally α-H¨older with α ≥ 1. See Section 3.3.2). However, if g(·) is “non-smooth” (e.g., discontinuous or globally α-H¨olderwith α < 1. See Section 3.3.1), we need to pose a stronger lower eigenvalue condition summarized in the following assumption.

0  T  Assumption 7 . There exists an absolute constant κ` > 0, such that λmin E XeXe Wf = 0 ≥ κl. Remark 3.2. Assumption7 0 is posed to achieve two goals: characterizing the bias kβ∗ −β∗k and hn 2 the variance. While Assumption7 suffices for controlling the variance term, controlling the bias usually requires much stronger conditions. Indeed, it suffices to use the following two assumptions, which are consequences of Assumption7 0:

(A1) Assumption7 with its slight variant (referred to Assumption 13), posed for proving a new empirical RE condition;

0 ∗ ∗ 2 ∗ ∗ (A2) there exist positive absolute constants C,C > 0, such that kβh − β k2 ≤ C|L0(βh) − L0(β )| for any h ∈ (0,C0), posed for bounding kβ∗ − β∗k . hn 2 For (1.3) to be informative, besides requiring the pair (X,W ) to be conditionally non-degenerate, we also need Wf to possess a nonzero probability mass in a small neighborhood of 0. The next assumption, combined with Assumption6, guarantees this.

Assumption 8. There exists some positive absolute constant M , such that f (0) ≥ M . ` Wf ` Before introducing the last two regularity conditions, we formally define the subgaussian distri- 2 bution. We say that a random variable X is subgaussian with parameter σ , if E exp{t(X−E[X])} ≤ 2 2 exp(t σ /2) holds for any t ∈ R. We then define the following two conditions which regulate the tail behaviors of X and u.

10 Assumption 9. There exists some positive absolute constant κx, such that, conditional on W = w for any w in the range of W and unconditionally, hX, vi is subgaussian with parameter at most 2 2 p κxkvk2 for any v ∈ R .

Assumption 10. There exists some positive absolute constant κu, such that u is subgaussian with 2 parameter at most κu. Remark 3.3. There is no need to assume zero mean for X or u in Assumptions9-10 thanks to our pairwise difference approach. Assumption9 is arguably difficult to relax (Adamczak et al., 2011; Lecu´eand Mendelson, 2017b). Assumption 10 has been posed throughout the existing literature. In Section4, however, we will show that the pairwise difference approach can adopt a much milder moment condition for the noise u when Assumption9 holds.

3.2 Main results In this section we prove inequalities (1.7) and (1.8) under a general smoothness condition. The main results are stated in Theorems 3.1 for the “smooth” case and in Theorems 3.2-3.3 for the “non-smooth” case. ∗ We first formally pose the smoothness condition (1.5). Recalling βh defined in (1.4), we assume ∗ ∗ that βh converges to β in a certain polynomial rate with regard to h. Assumption 11 (General smoothness condition). Assume there exist absolute constants ζ > 0, γ ∈ (0, 1], and C0 > 0, such that for any h ∈ (0,C0) we have ∗ ∗ γ kβh − β k2 ≤ ζh . Remark 3.4. Assumption 11 is closely related to the Taylor expansion condition (1.6) posed in Honor´eand Powell(2005) and Aradillas-Lopez et al.(2007). There the authors verified the condition under several smoothness conditions on the model (see Section 3.3 in Honor´eand Powell (2005) for details). Unfortunately, in general their condition is ill-defined with increasing dimension. Instead, our condition can be viewed as a high dimensional analogy of Condition (1.6), which also includes the non-smooth case with γ < 1. We further require one more assumption involving function g(·). Denote  −1 n X 1 Wfij   Uk = K Xeijk g(Wi) − g(Wj) , for k ∈ [p], (3.1) 2 h h i

11 Remark 3.6. There is one interesting phenomenon on Assumption 12 that sheds some light on the advantage of the pairwise difference approach. Instead of using U-statistic in the objective function sp Lbn(β, hn) in (1.3), one may also consider the objective function Ln (β, hn) by naively splitting the data into two halves, where n/2 sp 2 X 1 nWf(2i−1)(2i) o T 2 L (β, hn) := K Ye(2i−1)(2i) − Xe β . n n h h (2i−1)(2i) i=1 n n sp By doing so, Assumption 12 also becomes a data-splitting based statistic Uk , which can be written sp as the sum of n/2 i.i.d. terms. However, if Uk is used, Assumption 12 is no longer valid as hn → 0. To see this, as hn → 0, the effective sample size becomes nhn, and thus the concentration rate is 1/2 nhn(log p/n) . Thanks to our pairwise difference approach, the sharp concentration can still hold 1/2 as long as hn & (log p/n) . Similar observations have been known in literature (cf. Section 6 in Ramdas et al.(2015)). Also see Lemma A2.21 in the supplement for further details on verifying Assumption 12. Our first result concerns the situation γ = 1. This corresponds to the “smooth” case, on which a vast literature of partially linear models has been focused (Li and Racine, 2007). Theorem 3.1 (Smooth case). Assume Assumption 11 holds with γ = 1 and that there exists some 1/2 absolute constant K1 > 0 such that K1(log p/n) < C0. In addition, assume Assumptions4-12 hold and that 1  1 n 3 4 1 2o hn ∈ [K1(log p/n) 2 ,C0), λn ≥ C hn + (log p/n) 2 , and n ≥ C (log p) ∨ s 3 (log p) 3 ∨ s(log p) , where the constant C only depends on M, MK , C0, κx, κu, κ`, M`, ζ, and K1. Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n) − n, 0 0 where C , c, c are three positive constants only depending on M, MK , C0, κx, M`, κ`, and C. Remark 3.7. Theorem 3.1 shows, in many settings, the sample size requirement n ≥ Cs4/3(log p)1/3 suffices for βbhn to be consistent and of the convergence rate s log p/n. It compares favorably to the best existing scaling requirement (see Section 6.1 for detailed comparisons with existing results). The term s4/3(log p)1/3 shows up in verifying the empirical restricted eigenvalue condition, Condi- tion3. In particular, the major effort is put on separately controlling a sample-mean-type random matrix and a degenerate U-matrix. See Theorem 3.7 for details. The sharpness of this sample size requirement is further discussed in Section 6.2. 1/2 Remark 3.8. The requirement hn ∈ [K1(log p/n) ,C0) in Theorem 3.1 is due to the pairwise difference approach in order to verify the perturbation level condition, Condition2. With regard to this bandwidth selection, Theorem 3.1 unveils a tuning-insensitive phenomenon, which also applies to all the following results. Particularly, via examining the theorem, it is immediate that one might 1/2 choose, say, hn = 2(log p/n) without any impact on estimation accuracy asymptotically. The constant K1 that achieves the best upper bound constant can be theoretically calculated. However, we do not feel necessary to provide it. 1/2 Picking hn, λn  (log p/n) , we recover the desired rate of convergence s log p/n, which proves to be minimax rate-optimal (Raskutti et al., 2011) as if there is no nonparametric component. This

12 result has been established using other procedures in literature under certain smoothness conditions of g(·). In contrast, our result shows that the pairwise difference approach can attain estimation rate-optimality in the “smooth” regime, while in various settings improving the best to date scaling requirement. Our next result concerns the “non-smooth” case γ < 1. Its proof is a slight modification to that of Theorem 3.1.

Theorem 3.2 (Non-smooth, case I). Assume Assumption 11 holds with a general γ ∈ (0, 1] and 1/2 that there exists some absolute constant K1 > 0 such that K1(log p/n) < C0. In addition, we assume Assumptions4-12 hold and that 1  1 γ n 3 4 1 2o hn ∈ [K1(log p/n) 2 ,C0), λn ≥ C (log p/n) 2 + hn , and n ≥ C (log p) ∨ s 3 (log p) 3 ∨ s(log p) , where constant C only depends on M, MK , C0, κx, κu, κ`, M`, ζ, γ, and K1. Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n) − n, 0 0 where C , c, c are three positive constants only depending on M, MK , C0, κx, M`, κ`, and C.

1/2 γ/2 γ Picking hn  (log p/n) and λn  (log p/n) , we obtain a series of upper bounds s(log p/n) as γ changes from 1 to 0. While the rate derived here is not always s log p/n, to the best of our knowledge there is little analogous theoretical results in literature focusing on functions g(·) leading to this setting, especially when γ is small. The only exceptions turn out to be Honor´eand Powell(2005) and Aradillas-Lopez et al.(2007), where the authors proved consistency with little assumption on g(·). However, the rate of convergence was not calculated, and the analysis was focused on fixed dimensional settings. By inspecting Theorem 3.2, one might attempt to conjecture that, in certain cases, the minimax optimal rate for βbhn is no longer the same as that in the smooth case. However, surprisingly to us, this is not always the case. As a matter of fact, one can still recover the rate s log p/n under a certain regime of (n, p, s). To this end, we first define an RE condition slightly stronger than Assumption7.

Assumption 13. Assume there exists some positive absolute constant κ` such that for any  0 p 0 0 2 2γ v ∈ v ∈ R : kvJ c k1 ≤ 3kvJ k1 for any J ⊂ [p] and |J | ≤ s + ζ nhn / log p , T  T  2 we have v E XeXe Wf = 0 v ≥ κ`kvk2. Theorem 3.3 (Non-smooth, case II). Assume Assumption 11 holds with a general γ ∈ [1/4, 1] 1/2 and that there exists some absolute constant K1 > 0 such that K1(log p/n) < C0. Define T T ηn = kE[XeXe |Wf = 0]k∞ as a measure of sparsity for E[XeXe |Wf = 0]. In addition, we further assume Assumptions4-6,8-13 hold and that 1/2  1/2 hn ∈ [K1(log p/n) ,C0), λn ≥ C hn + ηn(log p/n) , n o and n ≥ C (log p)3 ∨ q4/3(log p)1/3 ∨ q(log p)2 ,

2γ where q = s + nhn / log p, and the constant C > 0 only depends on M, MK , C0, κx, κu, κ`, M`, ζ,

13 γ and K1. Then we have 2 2γ n ∗ 2 0 2 s log p nλnhn o 0 0 kβbh − β k ≤ C sλ + + ≥ 1 − c exp(−c log p) − c exp(−c n) − n, P n 2 n n log p 0 0 where C , c, c are three positive constants only depending on M, MK , C0, κx, M`, κ`, and C.

1/2 1/2 Remark 3.9. Picking hn  (log p/n) , λn  ηn(log p/n) , Theorem 3.3 proves that, under the n −1 o sample size requirement n ≥ C (log p)3∨(1+γ ) ∨ s4/3(log p)1/3 ∨ s(log p)2 , the inequality

γ ∗ 2 2 ns log p log p o kβbh − β k η · + n 2 . n n n holds with high probability. On one hand, as ηn . 1 and γ = 1, the above bound reduces to the 1−γ 1−γ inequality presented in Theorem 3.1. On the other hand, as ηn . 1, γ < 1, and s(log p) & n , we can still recover the optimal rate s log p/n. In addition, although λ is set to be of possibly 1/2 different order in Theorems 3.1-3.3, hn is constantly chosen to be of the same order (log p/n) . We refer to the previous Remark 3.8 for discussions.

3.3 Case studies This section exploits and extends the main results, Theorems 2.1, 3.1, 3.2, and 3.3, to examine several special g(·) functions. For this, we first consider various H¨olderfunction classes as well as some discontinuous ones under a lower eigenvalue condition Assumption7 0. In Section 3.3.2, we show, for Lipschitz function class (and also higher order H¨olderclasses), rate-optimality can be attained under a weaker restricted eigenvalue condition, Assumption7, by directly applying Theorem 2.1. In the following, we formally define the α-H¨olderfunction class.

Definition 3.1. Let ` = bαc denote the greatest integer strictly less than α. Given a metric space (T, | · |), a function g : T → R is said to belong to (L, α)-H¨olderfunction class if |g(`)(x) − g(`)(y)| ≤ L|x − y|α−`, for any x, y ∈ T. Here g(`) represents the `-th derivative of g(·). We say that g(·) is α-H¨olderif g(·) belongs to a (L, α)-H¨olderfunction class for a certain absolute constant L > 0.

3.3.1 Cases under the general smoothness condition This section is devoted to general α-H¨olderfunctions and certain discontinuous ones. For this, we need a stronger lower eigenvalue condition, Assumption7 0. We also need a smooth condition on fW |X (w, x) below, slightly stronger than Assumption6.

Assumption 14. Assume that the conditional density of W is smooth enough, or more specifically, assume that there exists some positive absolute constant M, such that n ∂2f (w, x) o W |X sup 2 ≤ M. w,x ∂w We first consider any α-H¨olderfunction g(·) with α ≥ 1 and show that it yields Assumption 11 with γ = 1.

14 2 2 −1 Theorem 3.4. Assume h ≤ C0 for some positive constant C0, and that h ≤ κ`M` ·(4MMK κx) . In addition, assume g(·) to be 1-H¨older(or higher-order H¨olderon some compact support [a, b]). Under Assumptions4-6,7 0,8-9, and 14, we have ∗ ∗ kβh − β k2 ≤ ζh, 2 where ζ is a constant only depending on M,Mk,C0, Eue , κx, κ`,M`, the H¨olderconstant (and a, b). We then move on to the non-smooth case including certain discontinuous and α-H¨olderfunctions with α < 1. Specifically, we make the following assumption about g(·).

Assumption 15. There exist absolute constants Mg > 0, Md ≥ 0, Ma ≥ 0, and 0 < α ≤ 1, and 2 set A ∈ R , such that α  |g(w1) − g(w2)| ≤ Mg|w1 − w2| + Md1 (w1, w2) ∈ A , for any w1, w2 in the range of W , and that

h 1 Wfij  i K 1(W ,W ) ∈ A ≤ M h. E h h i j a We also assume that the set A is symmetric in the sense that (w1, w2) ∈ A implies (w2, w1) ∈ A.

Example 3.1. Suppose function g : R → R is piecewise (ML, α)-H¨olderfor some α ∈ (0, 1], and have discontinuity points a1, . . . , aJ with jump size bounded in absolute value by Cg, for positive absolute constants ML and Cg. Also suppose |fW (w)| ≤ M for some positive absolute constant M. Consider set J [ n o A = (−∞, aj] × [aj, +∞) ∪ [aj, +∞) × (−∞, aj] , j=1 and consider box kernel function K(w) = 1(|w| ≤ 1/2). Then α  |g(w1) − g(w2)| ≤ (J + 1)ML · |w1 − w2| + JCg1 (w1, w2) ∈ A , for any w1, w2 ∈ R, and that

h 1 Wfij  i K 1(W ,W ) ∈ A E h h i j J 2 X Z aj Z +∞ JM 2 = 1|w − w | ≤ h/2 f (w )f (w ) dw dw ≤ h. h 1 2 W 1 W 2 1 2 4 j=1 −∞ aj

Thus we have verified two equations in Assumption 15 with Mg = (J + 1)ML, Md = JCg, and 2 Ma = JM /4.

Example 3.2. Suppose W ∼ Unif[0, 1], kernel function K(w) = 1(w ∈ [−1/2, 1/2]), and ( w, x ∈ [0, 1/2), g(w) = w + 1, x ∈ [1/2, 1]. Suppose h ≤ 1/2 and consider set A = [1/4, 1/2]×[1/2, 3/4]∪[1/2, 3/4]×[1/4, 1/2]. One can easily

15 check that |g(w1) − g(w2)| ≤ 3|w1 − w2| + 1{(w1, w2) ∈ A}, and

h 1 Wfij  i K 1(W ,W ) ∈ A = h/4. E h h i j Thus we have verified two equations in Assumption 15 with Mg = 3, Md = 1, and Ma = 1/4.

2 2 −1 Theorem 3.5. Assume h ≤ C0 for some positive constant C0, and that h ≤ κ`M` ·(4MMK κx) . Under Assumptions4-6,7 0,8-9, and 14-15, we have ∗ ∗ γ kβh − β k2 ≤ ζh , 2 where ζ > 0 is a constant only depending on M,MK ,C0,Mg,Md,Ma, E[ue ], κx, κ`,M`, and γ = α if MdMa = 0, and γ = α ∧ 1/2 if otherwise.

Theorems 3.4-3.5 verify the general smoothness condition, Assumption 11. It suffices to verify Assumption 12 in order to apply those main results in Section 3.2, i.e., Theorems 3.1-3.3, to obtain the following corollary.

1/2 Corollary 3.1. Assume there exist some absolute constants K1,C0 > 0 such that K1(log p/n) < C0, and that 1/2  4 4/3 1/3 2 hn ∈ [K1(log p/n) ,C0) and n ≥ C (log p) ∨ q (log p) ∨ q(log p) , where the quantity q and the dependence of constant C are specified in three cases below. In addition, we assume Assumptions4-6,7 0,8-10, and 14 hold.

(1) Assume that g(·) is α-H¨olderfor α ≥ 1, and g(·) has compact support when α > 1.  1/2 Set q = s. Assume further that λ ≥ C hn + (log p/n) , where C only depends on M,MK ,C0, κx, κu, κ`,M`,K1, and H¨olderparameters of g(·). Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n), 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx, κ`,M`,C.

(2) Assume Assumption 15 holds with α ∈ (0, 1]. Set q = s. Assume further that λn ≥  1/2 γ C (log p/n) + hn , where C only depends on M,MK ,C0, κx, κu, κ`,M`,K1,Mg,Md,Ma, and γ = α if MdMa = 0, γ = α ∧ 1/2 if otherwise. Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n), 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx, κ`,M`,C.

T (3) Assume Assumption 15 holds with α ∈ [1/4, 1]. Recall ηn = kE[XeXe |Wf = 0]k∞. Set 2γ  1/2 q = s + nhn / log p. Assume further that λn ≥ C hn + ηn(log p/n) , where C only depends on M,MK ,C0, κx, κu, κ`,M`,K1,Mg,Md,Ma and γ = α if MdMa = 0, γ = α ∧ 1/2 if otherwise. Then we have 2 2γ n ∗ 2 0 2 s log p nλnhn o 0 0 kβbh − β k ≤ C sλ + + ≥ 1 − c exp(−c log p) − c exp(−c n), P n 2 n n log p 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx, κ`,M`,C.

16 3.3.2 Smooth case under a weaker RE condition Assumption7 0 is arguably strong. Here we provide a rate-optimality result for Lipschitz function g(·) without requiring Assumption7 0. Instead, we only need a weaker RE condition, Assumption 7. Similar conclusion with corresponding analysis proceeds to α-H¨olderfunction with α > 1 and a compact support, and thus is omitted here.

Assumption 16 (Lipschitz condition). There exists an absolute constant Mg > 0, such that |g(w1) − g(w2)| ≤ Mg|w1 − w2|, for any w1, w2 in the range of W .

To achieve this goal, instead of using the general smoothness condition, i.e., Assumption 11, and Theorem 3.1 in Section 3.2, we first characterize the bias term in a different way to directly check the perturbation level condition, Condition2.

Lemma 3.1 (Bias calculation). Assume that there exist some absolute constants K1,C0 > 0 such 1/2 that K1(log p/n) < C0. In addition, we assume Assumptions5,6,9, 10, and 16 hold and that 1/2  1/2 4 hn ∈ [K1(log p/n) ,C0), λn ≥ C hn + (log p/n) , and n ≥ C(log p) , where the constant C only depends on M,MK ,C0, κx, κu,Mg,K1. Then we have  ∗ 0 P 2 ∇kLbn(β , hn) ≤ λn for all k ∈ [p] ≥ 1 − c exp(−c log p), where c and c0 are positive constants only depending on C.

1/2 Theorem 3.6. Assume that there exist absolute constants K1,C0 > 0 such that K1(log p/n) < C0. In addition, we assume Assumptions4-10, 16 hold and that 1  1 n 4 4 1 2o hn ∈ [K1(log p/n) 2 ,C0), λn ≥ C hn + (log p/n) 2 , and n ≥ C (log p) ∨ s 3 (log p) 3 ∨ s(log p) , where the constant C only depends on M, MK , C0, κx, κu, κ`, M`, Mg, K1. Then we have ∗ 2 0 2  0 0 P kβbhn − β k2 ≤ C sλn ≥ 1 − c exp(−c log p) − c exp(−c n) − n, 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κ`,M`,C.

It is worth highlighting that Assumption 16 allows the ranges of W and g(·) unbounded, and hence covers some nontrivial function classes whose metric entropy numbers, to our knowledge, are still unknown. See Section 6.1 for more details.

3.4 Technical challenges of the analysis This section highlights two distinct technical challenges we faced in proving the previous results in Sections 3.2-3.3. All the proofs are relegated to a supplementary material.

3.4.1 On verifying the RE condition The main results of the paper, including Theorems 3.1, 3.2, 3.3, 3.6, as well as Corollary 3.1, are all based on the general method introduced in Section2. For this, one major object of interest is to verify the empirical RE condition (Assumption3 in Section2) based on the population RE conditions such as Assumption7 and its variant Assumption 13. This result is formally stated in

17 Corollary 3.2 at the end of this section. The proof rests on several advanced U-statistics exponential inequalities (Gin´eet al., 2000; Houdr´eand Reynaud-Bouret, 2003) and nonasymptotic random matrix analysis tools specifically tailored for U-matrices (Vershynin, 2012; Mitra and Zhang, 2014), and thus deserves a discussion. We start with a definition of the restricted spectral norm (Han and Liu, 2016). For an arbitrary p by p real matrix M and an integer q ∈ [p], the q-restricted spectral norm kMk2,q of M is defined to be vTMv kMk2,q := max . p T v∈R ,kvk0≤q v v As pointed in the seminal paper Rudelson and Zhou(2013), the empirical RE condition, i.e., Assumption3, is closely related to the q-restricted spectral norm of Hessian matrix for the loss function regarding a special choice of q. Our proof relies on a study of this q-restricted spectral norm. In Assumption3, letting Γbn(θ, hn) = Lbn(β, hn), simple algebra yields  −1 Tn n X 1 Wfij  To T δLbn(∆, hn) = ∆ K XeijXe ∆ = ∆ Tbn∆. 2 h h ij i

Note that Tbn is a random U-matrix, namely, a random matrix formulated as a matrix-valued U-statistic. As was discussed in the previous sections, hn is usually picked to be of the order 1/2 (log p/n) , rendering a large bump as Wfij is close to zero. Consequently, when hn is set in the −1 T 2 regime of interest, the variance of the kernel g∆(Di,Dj) = hn K(Wfij/hn)(Xeij∆) will explode at the rate of (n/ log p)1/2, leading to a loose and sub-optimal bound when using Bernstein inequality for non-degenerate U-statistics (see, e.g., Proposition 2.3(a) in Arcones and Gine(1993)). Thus a more careful study of this random U-matrix Tbn is need. The next theorem gives a concentration inequality for Tbn under the q-restricted spectral norm.

Theorem 3.7. For some q ∈ [p], suppose there exists some absolute constant C > 0 such that n ≥ C · q4/3(log p)1/3 ∨ q(log p)2 + log(1/α). Then under Assumptions5,6, and9, with probability at least 1 − α, we have 1/4 2 1/2 0 hq(log p) q(log p) log(1/α)i kTbn − ETbnk2,q ≤ C · + + , n3/4 n n 0 where C is a positive constant only depending on M,MK ,C0, κx,C.

The proof of Theorem 3.7 follows the celebrated Hoeffding’s decomposition (Hoeffding, 1948). However, there are two major challenges. We discuss them in details before closing this section with a formal verification of Assumption3. On one hand, different from most existing investigations on nonasymptotic random matrix theory, the first order term of δLbn(∆, hn), after decomposition, does not have a natural product −1 Pn T structure, namely, it cannot be written as n i=1 UiUi for some independent random vectors p {Ui ∈ R , i ∈ [n]}. Hence, we cannot directly follow those well-established arguments based on a natural product structure, but have to resort to properties of the kernel. To this end, we state the

18 following two auxiliary lemmas, which are repeatedly used in the proofs, and can be regarded as extensions to the classic results in, for example, Robinson(1988).

Lemma 3.2. Suppose we have random variables W ∈ R and Z ∈ Z, such that ∂f (w, z) W |Z ≤ M , ∂w 1 for some positive constant M1 with any z in the range of Z and any w in the range of W . Also, let R +∞ K(·) be a kernel function such that −∞ |w|K(w) dw ≤ M2 for some constant M2 > 0. Then we have for any h > 0, n 1 W  o K Z − [Z|W = 0]f (0) ≤ M M [|Z|]h. E h h E W 1 2E

Lemma 3.3. Let (W1,Z1), (W2,Z2) ∈ R × Z be i.i.d.. Assume ∂f (w, z) W1|Z1 ≤ M ∂w 1 holds for some positive constant M1 with any z in the range of Z1 and any w in the range of W . R +∞ Let K(·) be a kernel function such that −∞ |w|K(w) dw ≤ M2 for some constant M2 > 0. Let 2 ϕ : Z → R be a measurable function. Then we have for any h > 0, n 1 W1 − W2  o   K ϕ(Z1,Z2) W2,Z2 − ϕ(Z1,Z2) W1 = W2,W2,Z2 fW (W2) E h h E 1   ≤M1M2E |ϕ(Z1,Z2)| Z2 h.

On the other hand, the second order term of δLbn(∆, hn), after decomposition, forms a degenerate U-statistic, and requires further study. To control this term, one might consider using the two-term Bernstein inequality for degenerate U-statistics (see, e.g., Proposition 2.3(c) in Arcones and Gine (1993) or Theorem 4.1.2 in de la Pe˜naand Gin´e(2012)). But it will add an additional polynomial log p multiplicity term in the upper bound. Instead, we adopt the sharpest four-term Bernstein inequality discovered by Gin´eet al.(2000), get rid of several inexplicit terms (e.g., the `2 → `2 norm), and formulate it into the following user-friendly tail inequality. We state this result in the following auxiliary lemma. The constants here are able to be explicitly calculated thanks to Houdr´e and Reynaud-Bouret(2003).

2 Lemma 3.4. Let Z1,...,Zn,Z ∈ Z be i.i.d., and g : Z → R be a symmetric measurable   P   function with E g(Z1,Z2) < ∞. Write Un(g) = i

19 where we take positive absolute constants 3/2 C1 = η1() = 2(1 + ) , √ −1 C2 = η2() = 8 2(2 +  +  ), √ √ −1 2 −1  −1 2  C3 = η3() = e(1 +  ) (5/2 + 32 ) + {2 2(2 +  +  )} ∨ (1 + ) / 2 , (3.4)  −1 2 −1 2 C4 = η4() = 4e(1 +  ) (5/2 + 32 ) ∨ 4(1 + ) /3,

C5 = 2.77, for any  > 0. For cases that f(z) = 0 (corresponding to the degenerate case), t can be set as zero and the first term on the second line of (3.3) can be eliminated.

Combining Theorem 3.7 with Theorem 10 and the follow-up arguments in Rudelson and Zhou (2013), we immediately have the following corollary, which verifies the desired empirical RE con- dition corresponding to different situations. Note that Assumption7 0 is stronger than both As- sumption7 and its variant Assumption 13. Thus the results below still hold when Assumption7 0 is imposed in Section 3.3.1.

Corollary 3.2. Suppose Assumptions4-6 and8-9 are satisfied.

(1) Assume Assumption7 holds, and that n ≥ Cs4/3(log p)1/3 ∨ s(log p)2 ,

for some constant C > 0 only depending on M,MK ,C0, κx, κ`,M`. Then we have

h κ`M` 2  0 p i δLbn(∆, hn) ≥ k∆k for all ∆ ∈ ∆ ∈ : k∆Sc k1 ≤ 3k∆S k1 P 4 2 R ≥1 − c exp(−c0 log p) − c exp(−c0n), 0 where c, c are positive constants only depending on M,MK ,C0, κx, κ`,M`,C.

(2) Assume Assumption 13 holds, and that  2γ 4/3 1/3 2γ 2 n ≥ C {s + nhn / log p} (log p) ∨ {s + nhn / log p}(log p) ,

for some constant C > 0 only depending on M,MK ,C0, κx, κ`,M`, ζ, γ. Then we have

n κ`M` 2 o P δLbn(∆, hn) ≥ k∆k2 for all ∆ ∈ C 0 4 Sen ≥1 − c exp(−c0 log p) − c exp(−c0n),

 p 2 2γ where C 0 := v ∈ : kvJ c k1 ≤ 3kvJ k1 for some J ⊂ [p] and |J | ≤ s + ζ nhn / log p) , Sen R 0 and c, c are positive constants only depending on M,MK ,C0, κx, κ`,M`,C.

3.4.2 On verifying the smoothness condition In Section 3.3.1, we focused on several special function classes for g(·) by verifying the corresponding general smoothness condition, Assumption 11. This smoothness condition, or equivalently Condi- ∗ tion (1.5), bears a “dimension-free” flavor. In other words, although the number of entries, p, in βh ∗ ∗ ∗ and β is increasing as n grows, the upper bound of kβh − β k2 in (1.5) does not depend on p. At

20 the first sight, this result looks a little bit surprising. However, there is rational explanation using insights from the perturbation theory. ∗ ∗ Specifically, recall that βh and β are extrema of Lh and L0. Lh is a “perturbed” version of L0 only on a one-dimensional feature Wf, and hence the overall resulting fluctuation on the corresponding extrema is intrinsically independent of the dimension of β.

Remark 3.10. Think about a related problem in linear algebra. Let Ω be a p by p symmetric real T matrix and Ωh = Ω + he1e1 , analogous to L0 and Lh respectively. Suppose our aim is to measure the difference between the leading eigenvectors vΩ and vΩ,h of Ω and Ωh. Davis Kahan inequality (Davis and Kahan, 1970), particularly its variant in Theorem 2 in Yu et al.(2015), gives 2h sin ∠(vΩ, vΩ,h) ≤ , (3.5) λ1(Ω) − λ2(Ω) where sin ∠(vΩ, vΩ,h) represents the sine of the angle between vΩ and vΩ,h, and λ1(Ω), λ2(Ω) are the two largest singular values of Ω. Inequality (3.5) thus reveals a “dimension-free” phenomenon, which sheds light to understanding (1.5), though verifying the later condition requires a different set of analysis tools.

∗ A sketch of the proofs to Theorems 3.4-3.5 is as follows. Our target is to show that |Lh(βh) − ∗ ∗ ∗ ∗ ∗ 2 L0(β )| is upper and lower bounded by functions of kβh − β k2 and kβh − β k2 respectively. First, ∗ ∗ 0 we lower bound |Lh(βh) − L0(β )|. By Assumptions7 and8, we have 2 ∂ L0(β)  T  λmin = 2λmin XeXe Wf = 0 f (0) ≥ 2κ`M`. ∂β2 E Wf Therefore, for some β = β∗ + t(β∗ − β∗ ), t ∈ [0, 1], we have t hn hn 1 ∂2L (β) ∗ ∗ ∗ ∗ T 0 ∗ ∗ ∗ ∗ 2 L0(βh) − L0(β ) = (βh − β ) 2 (βh − β ) ≥ κ`M`kβh − β k2. (3.6) 2 ∂β β=βt p Secondly, we upper bound |Lh(β) − L0(β)| for arbitrary β ∈ R . For this, simple algebra yields

h 1 Wf  T ∗ 2i  T ∗ 2  |Lh(β) − L0(β)| ≤ K {Xe (β − β )} − {Xe (β − β )} Wf = 0 f (0) E h h E Wf h 1 Wfij  2i + K g(W ) − g(W ) E h h i j

h 1 Wf  2i  2  + K u − u Wf = 0 f (0) E h h e E e Wf

h 1 Wfij  T ∗  i + 2 K Xe (β − β ) g(Wi) − g(Wj) . E h h Each of the four terms can be upper bounded using Taylor expansion, Lemmas A2.15 and A2.16, and several properties of the function g(·) such that ∗ γ 2γ ∗ 2 2 |Lh(β) − L0(β)| ≤ 2a1kβ − β k2h + a2h + a3kβ − β k2h . (3.7) Finally, the conclusion follows easily by combining (3.6)-(3.7) and observing ∗ ∗ ∗ ∗ ∗ ∗ L0(βh) − L0(β ) ≤ |L0(βh) − Lh(βh)| + |Lh(β ) − L0(β )|. Further details can be found in the proofs to Theorems 3.4-3.5.

21 4 Extension to heavy-tailed noise

This section proves the second assertion in Section 1.2. Concerning the lasso regression, van de Geer (2010) (see, e.g., Lemma 5.1 therein) commented that, for characterizing the impact of the noise u on −1 Pn the lasso estimation accuracy, it is sufficient to consider the quantity maxj∈[p] n i=1 uiXij. For −1 Pn fixed design, one has to assume subgaussianity of u in order to ensure that maxj∈[p] n i=1 uiXij scales in the rate of (log p/n)1/2. In contrast, Lecu´eand Mendelson(2017a), among many others, pointed out that, when X is multivariate subgaussian, u can adopt a much milder tail condition. This section verifies that the same observation applies to partially linear model coupled with the pairwise difference approach. This track of study could also be compared to the parallel investiga- tion on high dimensional (Fan et al., 2016, 2017; Loh, 2017, among others). The following heavy-tailedness assumption is posed.

2+ Assumption 17. For some  ≥ 0, there exists an absolute constant Mu > 0 such that E[|ue| ] ≤ Mu.

In Assumption 17, we allow  = 0 so that a finite second moment of ue suffices. Similar to the lasso analysis, the following quantity is the key in measuring the impact of u. For k ∈ [p], define  −1 n X 1 Wfij  U1k = K Xeijkuij. 2 h h e i

Lemma 4.1. Assume that there exist some absolute constants K1,C0 > 0 and 1/(2+) < ξ < 3/4, such that 1/2 5/(3−4ξ) hn ∈ [K1(log p/n) ,C0) and n ≥ C(log p) , 1/2 where K1(log p/n) < C0, and constant C only depends on C0,Mu, κx, ξ. Then under additional Assumptions5,6, and9, we have h  0 1/2i 0 0 P max |U1k − E[U1k]| ≥ C (log p/n) ≤ c exp(−c log p) + c exp(−c log n), k∈[p] 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx,Mu, , ξ, K1,C.

Lemma 4.1 immediately yields the following corollary, which provides analogy of main results, i.e., Theorems 3.1, 3.2, 3.3, 3.6, and Corollary 3.1.

Corollary 4.1. Assume that there exist some absolute constants K1,C0 > 0 and 1/(2 + ) < ξ < 3/4, such that 1/2 hn ∈ [K1(log p/n) ,C0) and n ≥ C(log p)5/(3−4ξ) ∨ (log p)3 ∨ q4/3(log p)1/3 ∨ q(log p)2 , 1/2 where K1(log p/n) < C0, and the quantity q and the dependence of constant C will be specified  T  case by case below. Denote ηn = kE XeXe Wf = 0 k∞. We then have, replacing Assumption

22 10 with Assumption 17 in corresponding results, the following assertions are still true. Also all three positive constants C0, c, c0 that have different values in specific cases, but only depend on

M,MK ,C0, κx, κ`,M`, ξ, C.

(1) Analogy of Theorem 3.1: Assume Assumption 11 holds with γ = 1. Set q = s. Assume  1/2 further that λn ≥ C hn + (log p/n) , where C only depends on M,MK ,C0, κx,Mu, κ`, M`, ξ, , ζ, K1. Then under Assumptions4-9, 11, 12, and 17, we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c log n) − n. (2) Analogy of Theorem 3.2: Assume Assumption 11 holds with a general γ ∈ (0, 1]. Set q = s.  1/2 γ Assume further that λn ≥ C (log p/n) +hn , where C only depends on M,MK ,C0, κx,Mu, κ`,M`, ξ, , ζ, γ, K1. Then under Assumptions4-9, 11, 12, and 17, we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c log n) − n. (3) Analogy of Theorem 3.3: Assume Assumption 11 holds with a general γ ∈ [1/4, 1]. Set 2γ  1/2 q = s+nhn / log p. Assume further that λn ≥ C hn +ηn(log p/n) , where C only depends on M,MK ,C0, κx,Mu, κ`,M`, ξ, , ζ, γ, K1. Then under Assumptions4-6,8-9, 11-13, and 17, we have 2 2γ n ∗ 2 0 2 s log p nλnhn o 0 0 kβbh − β k ≤ C sλ + + ≥ 1 − c exp(−c log p) − c exp(−c log n) − n. P n 2 n n log p (4) Analogy of Corollary 3.1:

a. Assume that g(·) is α-H¨olderfor α ≥ 1, and g(·) has compact support when α > 1.  1/2 4 Set q = s. Assume further that λn ≥ C hn + (log p/n) and n ≥ (log p) , where C only depends on M,MK ,C0, κx,Mu, κ`,M`, ξ, , K1, and H¨olderparameters of g(·). Then under Assumptions4-6,7 0,8-9, 14, and 17, we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c log n).

b. Assumption 15 holds with α ∈ (0, 1]. Set q = s. Assume further that λn ≥  1/2 γ 4 C (log p/n) +hn and n ≥ C(log p) , where C only depends on M,MK ,C0, κx,Mu, κ`, M`, ξ, , K1,Mg,Md, Ma, and γ = α if MdMa = 0, γ = α ∧ 1/2 if otherwise. Then under Assumptions4-6,7 0,8-9, 14 and 17, we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c log n). 2γ c. Assume Assumption 15 holds with α ∈ [1/4, 1]. Set q = s + nhn / log p. Assume  1/2 4 further that λn ≥ C hn + ηn(log p/n) and n ≥ C(log p) , where C only depends on M,MK ,C0, κx,Mu, κ`, M`, ξ, , K1,Mg,Md,Ma and γ = α if MdMa = 0, γ = α ∧ 1/2 if otherwise. Then under Assumptions4-6,7 0,8-9, 14 and 17, 2 2γ n ∗ 2 0 2 s log p nλnhn o 0 0 kβbh − β k ≤ C sλ + + ≥ 1−c exp(−c log p) − c exp(−c log n). P n 2 n n log p

 1/2 (5) Analogy of Theorem 3.6: Set q = s. Assume that λn ≥ C hn + (log p/n) and 4 n ≥ C(log p) , where C depends only on M,MK ,C0, κx,Mu, κ`,M`, ξ, , K1,Mg. Then under

23 Assumptions4-9, 16, and 17, we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c log n).

5 Finite sample analysis

This section presents simulation studies to illustrate the finite sample performance. The purpose is not to provide exhaustive numerical studies on comparison of different methods, but to give some empirical evidences for backing up the previous theoretical investigation. Three procedures are considered: (1) pairwise difference approach outlined in (1.2), denoted as “PRD”;

(2) penalized least squares approach using B-splines, as in M¨ullerand van de Geer(2015), denoted as “B-spline”;

(3) projection approach, as in Zhu(2017) and Robinson(1988), denoted as “Projection”. For each method, the tuning parameter of the related lasso solution is selected by 10-fold cross validation as implemented in R package glmnet (Friedman et al., 2009). For the pairwise difference 1/2 approach, the kernel bandwidth hn is set to be 2(log p/n) and the kernel is set to be the box kernel in Example 3.2. An R program implementing all considered methods has been put in the authors’ website. We move on to describe the data generating scheme. In Scenarios 1-5, we generate n independent observations from (1.1). In all five scenarios, we generate Wi from uniform distribution over [0, 1], Xi from N(0, Σ), independent of Wi, and ui independent of (Xi,Wi).

Scenario 1: g(w) = sin(10w), Σ = Ip, ui ∼ N(0, 1);

Scenario 2: g(w) = sin(10w) for w ≤ 0.2, and g(w) = sin(10w) + 3 for w > 0.2, Σ = Ip, ui ∼ N(0, 1);

Scenario 3: g(w) = sin(10w) for w ≤ 0.2, and g(w) = sin(10w) + 3 for w > 0.2, Σ = (aij) |i−j| such that aij = 0.3 , ui ∼ N(0, 1);

1/3 Scenario 4: g(w) = w , Σ = Ip, ui ∼ N(0, 1);

1 Scenario 5: g(w) = sin(10w), Σ = Ip, ui ∼ t(3) .

We take β∗ such that its first s elements form an arithmetic sequence going from 5 to 0.1, and that the rest of elements are zeros. We let n = 200, p = 1, 000, and take s = 20, 50. Estimation accuracy is measured by averaging `2 norm of the difference between estimate and true value of parameter, over 1,000 independent replications. We summarize the simulation results in Table1. There, it is observed that the pairwise differ- ence approach constantly outperforms the other two approaches, and the advantage becomes more significant as s increases.

1t(3) stands for t-distribution with the degree of freedom 3.

24 Table 1: Averaged `2 norm of errors, with standard errors in the bracket. Method Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 PRD 1.200 1.182 1.120 1.065 1.827 (0.010) (0.005) (0.011) (0.004) (0.014) s = 20 B-spline 1.300 1.420 1.126 1.113 2.013 (0.006) (0.006) (0.005) (0.005) (0.016) Projection 1.173 1.249 1.021 1.095 1.897 (0.005) (0.005) (0.005) (0.005) (0.016) PRD 4.979 5.495 6.509 5.286 6.816 (0.055) (0.059) (0.068) (0.058) (0.057) s = 50 B-spline 5.947 6.991 7.045 6.265 7.534 (0.068) (0.071) (0.076) (0.071) (0.068) Projection 5.654 6.416 6.934 6.133 7.357 (0.064) (0.068) (0.070) (0.068) (0.063)

6 Discussions

6.1 Comparison with existing results Specific to the high dimensional partially linear Lipschitz models, for achieving the rate of con- vergence (1.7), the following scaling requirements are needed. (1) M¨ullerand van de Geer(2015) and Yu et al.(2016) required s2 log p/n to be sufficiently small. This requirement is implied by Theorem 1 in M¨uller and van de Geer(2015) (specifically, by combining Equation (5) and the 2 2 2 2 requirement R < λ therein) and Lemma 2.2 in Yu et al.(2016) (by noticing µ + λ s0 ≤ λ). −3 ∗ 10 2 ∗ (2) Zhu(2017) required (log p) kβ k1 /n to be sufficiently small . Assuming kβ k1 is of order s, the above requirement reduces to demanding s10/(n(log p)3) small enough. This requirement is ∗ implied via combining Theorem 4.1 (19) and Lemma 5.1 therein. Even if kβ k1 = O(1) holds in a restricted and ideal setting, Zhu(2017) still required that ( s3/2 + s log p)/n as implied by Theorem 4.1 (18) and Lemma 5.1 together. In comparison, referring to Theorem 3.6, our scaling requirement is n ≥ C{(log p)4 ∨ s4/3(log p)1/3 ∨ s(log p)2}. In many cases, it is s4/3(log p)1/3/n < C for some absolute constant C > 0, and compares favorably to the existing ones in literature. We list three more assumptions that are required in literature while in some cases we do not need. First, both M¨ullerand van de Geer(2015) and Yu et al.(2016) required X to be entry- wise bounded (see, e.g., Condition 2.2 in M¨ullerand van de Geer(2015) and Assumption A.4 in Yu et al.(2016)), an assumption that is arguably strong. When this requirement fails, a straightforward ∗ 2 truncation argument will render an extra log p multiplicity term in the upper bound of kβb − β k2. In comparison, we do not need this condition (cf. Assumption9). Secondly, M¨ullerand van de Geer (2015), Yu et al.(2016), and Zhu(2017) all required W to be of a compact support if some Lipschitz function class is studied. This is due to the calculation of the metric entropy of the Lipschitz class,

2 In Zhu(2017), the metric entropy condition is imposed for function series {Fj , 0 ≤ j ≤ p} rather than g(·).

25 while we do not need (cf. Theorem 3.6). Lastly, referring to Corollary 3.1, the pairwise difference approach could handle α-H¨olderclasses with α ≤ 1/2, which are settings technically difficult to manage using the least squares estimators. However, we note that these gains come at a price: for 1−α 1−α α-H¨olderclasses with 1/2 < α < 1, the pairwise difference approach with s(log p) . n or T increasing ηn = kE[XeXe |Wf = 0]k∞ cannot be shown by Corollary 3.1 to attain the same s log p/n rate as using least squares approaches in M¨ullerand van de Geer(2015), Yu et al.(2016), and Zhu (2017).

6.2 Further discussion on the minimum sample size requirement One advantage of our adaptive pairwise difference approach is a mild scaling requirement, n ≥ C{(log p)4 ∨s4/3(log p)1/3 ∨s(log p)2}. In many cases, it is better than the existing results. One may ask if this requirement can be further relaxed to the optimal one in the linear model, s log p/n < C, to obtain minimax optimal rates for estimating coefficients under the `2 norm. As was discussed in 4/3 1/3 Section 3.4.1, s (log p) /n < C is needed to provide a bound for kTbn − ETbnk2,q in verifying the T empirical RE condition. Specifically, to bound the U-process maxk∆k2=1,k∆k0≤q |∆ (Tbn−ETbn)∆|, we apply the routine Hoeffding’s decomposition. While the first order term of the decomposition is just like a well-behaved (i.e., bounded variance) empirical process yielding to consistency under a weaker requirement s log p/n < C, its second order term is delicate, due to the growing variance of the degenerate U-process incurred by the shrinking sequence hn. The four-term Bernstein inequality in Gin´eet al.(2000) and Houdr´eand Reynaud-Bouret(2003) provides a sharp tail probability for this degenerate U-process evaluated at each fixed ∆ (see Lemma 3.4). One may want to apply routine metric entropy or chaining argument to bound its union bound over all q-sparse unit vectors. However, although the first order term of its tail regular, the tail of U-process is substantially different from that of an empirical process as reflected in higher order terms in Lemma 3.4. Indeed, the increment of U-process is more like exponential distribution (Nolan and Pollard, 1987). As a result, the growing variance, together with higher order terms in the tail of this degenerate U- process, demands a stronger sample size requirement, s2 log p/n < C, following the routine chaining argument. For such a degenerate U-matrix Mcn, to prove a bound for kMcnk2,q, we chose to use the T trick maxk∆k2=1,k∆k0≤q |∆ Mcn∆| ≤ qkMcnkmax (for a given matrix M, kMkmax := maxj,k |Mjk|). By doing this, we do not need to take union bound which alleviates the influence of higher order terms of the tail probability in Lemma 3.4, but still require s4/3(log p)1/3/n < C due to the growing variance and this simple trick. It would be an interesting problem to further relax our already weak scaling requirement s4/3(log p)1/3/n < C via a refined analysis or a different approach.

6.3 Tuning parameter selection In the paper, we have demonstrated the adaptivity of pairwise difference approach to the smoothness of g(·). In addition, the choice of bandwidth hn is tuning-insensitive in the sense that one can simply 1/2 choose hn = 2(log p/n) without any impact on estimation accuracy asymptotically. However, there is another parameter λn in the optimization (1.2) which needs to be tuned. One may ask if this λn can be made tuning-insensitive as well. Indeed, this important issue has been investigated by Belloni et al.(2011) and Sun and Zhang(2012) for lasso regression and Bunea et al.(2014) and

26 Mitra and Zhang(2016) for group lasso models without the nonparametric component g(·). In those models, the optimal choice of λn only depends on the variance of the noise variable u besides n, p. Thus one can have a tuning-insensitive λ by formulating this unknown variance as an extra variable in the objective function to be estimated. However, the choice of λn in our settings, as reflected in the Remark 2.1 following the perturbation level condition, is determined by both a variance part and a bias part. While the variance part is due to the variance of u which can be controlled in a similar manner as that for lasso regression, the bias part is due to general smoothness condition (1.5). In particular, the value ζ depends on the constant L in specific (L, α)-H¨olderfunction class and is unknown in practice. It would be interesting to extend the methods developed in our paper to a complete tuning-insensitive method.

6.4 Extension to multidimensional W Although one-dimensional W is the main object of interest in this paper, the pairwise difference approach can be readily extended to study multi-dimensional W via employing a higher-order kernel function K(·)(Li and Racine, 2007). In detail, in conducting the pairwise difference approach, we only require a first-order kernel K(·) (Assumption5), rendering a bandwidth hn to be of exactly the order (log p/n)1/2 for achieving bias variance tradeoff. Indeed, a first-order kernel is sufficient when W is one-dimensional, since the bias and variance in estimation match in this case. However, d for a multi-dimensional W ∈ R , to control the variance, hn is recommended to be chosen at the order of (log p/n)1/(2d), leading to an explosion of the bias if the kernel is only of first order. In contrast, a higher order kernel can effectively reduce the bias, and hold promise for recovering the regular s log p/n estimation rate. However, this gain comes at a price: the studied problem is no longer convex, which raises some nontrivial computational challenges. It would be interesting to examine if the established nonconvex penalized M-estimation theory (e.g., Fan et al.(2014), Loh and Wainwright(2015), and Wang et al.(2014), to just name a few) can apply to our problem.

6.5 Extending the general method to studying other problems

∗ In our studied problem, we do not need to assume any sparsity for βh. This is due to the developed general method introduced in Section2, which alleviates the stringent sparsity requirement in ∗ Negahban et al.(2012) and allows for highly non-sparse βh. In a related study, Lambert-Lacroix and Zwald(2011) and Fan et al.(2017) investigated robust linear regression approaches using the Huber Huber loss. There, a perturbed loss function Γα (θ) is similarly posited, while a sparsity condition ∗ Huber ∗ Huber is enforced on θα := argminθ Γα (θ) instead of θ := argminθ Γα=0 (θ), where α controls the ∗ blending of quadratic and linear penalizations. The potential bias occurred for θα is due to an ∗ asymmetric noise. In comparison, the bias of βh in our paper is due to the lack of consideration of the nonparametric component in the model. In addition, instead of analyzing the bias and variance ∗ of βh separately as Fan et al.(2017) did, the general method developed in Section2 allows an intermediate surrogate in the analysis as demonstrated in Theorem 3.3. In the future, it would be interesting to inspect if the developed general method could apply to studying the robust regression problems of perturbed loss functions, and if the sparsity condition there could be similarly relaxed.

27 Acknowledgement

We thank Xi Chen, Michael Jansson, and Jing Tao for discussions. We thank Jianqing Fan for pointing out relevant literature and many comments. We thank Yanqin Fan for introduction to related literature in . We thank Roy Han for discussion on regression under heavy- tailed noise. We thank Adrian Raftery and Cun-Hui Zhang for discussion on extension to multi- dimensional W . We own thanks to Noah Simon for conversations and question on an older version. We own thanks to Thomas Richardson for discussion on bandwidth selection and RE condition. We also thank all participants in University of Washington Biostatistics, Center of the Statistics and Social Sciences (CSSS), and Econometrics seminars. All remaining errors are ours.

References

Abrevaya, J. and Shin, Y. (2011). Rank estimation of partially linear index models. The Econo- metrics Journal, 14(3):409–437.

Adamczak, R., Litvak, A. E., Pajor, A., and Tomczak-Jaegermann, N. (2011). Restricted isometry property of matrices with independent columns and neighborly polytopes by random . Constructive Approximation, 34(1):61–88.

Amelunxen, D., Lotz, M., McCoy, M. B., and Tropp, J. A. (2014). Living on the edge: Phase transitions in convex programs with random data. Information and Inference, 3(3):224–294.

Aradillas-Lopez, A., Honor´e,B. E., and Powell, J. L. (2007). Pairwise difference estimation with nonparametric control variables. International Economic Review, 48(4):1119–1158.

Arcones, M. A. and Gine, E. (1993). Limit theorems for U-processes. The Annals of Probability, 21(3):1494–1542.

Belloni, A., Chernozhukov, V., and Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732.

B¨uhlmann,P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.

Bunea, F. (2004). Consistent covariate selection and post inference in semipara- metric regression. The Annals of Statistics, 32(3):898–927.

Bunea, F., Lederer, J., and She, Y. (2014). The group square-root lasso: Theoretical properties and fast algorithms. IEEE Transactions on Information Theory, 60(2):1313–1325.

Bunea, F. and Wegkamp, M. H. (2004). Two-stage model selection procedures in partially linear regression. Canadian Journal of Statistics, 32(2):105–118.

28 Carroll, R. J., Fan, J., Gijbels, I., and Wand, M. P. (1997). Generalized partially linear single-index models. Journal of the American Statistical Association, 92(438):477–489.

Cattaneo, M. D., Jansson, M., and Newey, W. K. (2017). Alternative asymptotics and the partially linear model with many regressors. Econometric Theory, (to appear).

Chen, H. (1988). Convergence rates for parametric components in a partly linear model. The Annals of Statistics, 16(1):136–146.

Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):1–46. de la Pe˜na,V. and Gin´e,E. (2012). Decoupling: From Dependence to Independence. Springer.

Dezeure, R., B¨uhlmann,P., Meier, L., and Meinshausen, N. (2015). High-dimensional inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30(4):533–558.

Donald, S. G. and Newey, W. K. (1994). Series estimation of semilinear models. Journal of Multivariate Analysis, 50(1):30–40.

Engle, R. F., Granger, C. W., Rice, J., and Weiss, A. (1986). Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81(394):310–320.

Fan, J. and Huang, T. (2005). Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli, 11(6):1031–1057.

Fan, J., Li, Q., and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society: Series B, 79(1):247–265.

Fan, J. and Li, R. (2004). New estimation and model selection procedures for semiparametric mod- eling in longitudinal data analysis. Journal of the American Statistical Association, 99(467):710– 723.

Fan, J., Wang, W., and Zhu, Z. (2016). Robust low-rank matrix recovery. arXiv:1603.08315.

Fan, J., Xue, L., and Zou, H. (2014). Strong oracle optimality of folded concave penalized estima- tion. The Annals of Statistics, 42(3):819–849.

Friedman, J., Hastie, T., and Tibshirani, R. (2009). glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1(4).

Gin´e,E., Latala,R., and Zinn, J. (2000). Exponential and moment inequalities for U-statistics. In High Dimensional Probability II, pages 13–38. Springer.

Han, F. and Liu, H. (2016). ECA: High dimensional elliptical component analysis in non-Gaussian distributions. Journal of the American Statistical Association, (to appear).

29 Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293–325.

Honor´e,B. E. and Powell, J. (2005). Pairwise difference estimators for nonlinear models. In Andrews, D.W.K., Stock, J.H. (Eds.) Identification and Inference in Econometric Models. Essays in Honor of Thomas Rothenberg, pages 520–553. Cambridge University Press.

Houdr´e, C. and Reynaud-Bouret, P. (2003). Exponential inequalities, with constants, for U- statistics of order two. In Stochastic Inequalities and Applications, pages 55–69. Springer.

Kato, T. (2013). Perturbation Theory for Linear Operators, volume 132. Springer.

Lambert-Lacroix, S. and Zwald, L. (2011). Robust regression through the Huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics, 5:1015–1053.

Lecu´e,G. and Mendelson, S. (2017a). Regularization and the small-ball method I: Sparse recovery. The Annals of Statistics, (to appear).

Lecu´e,G. and Mendelson, S. (2017b). Sparse recovery under weak moment assumptions. Journal of the European Mathematical Society, (to appear).

Li, Q. and Racine, J. S. (2007). Nonparametric Econometrics: Theory and Practice. Princeton University Press.

Liang, H. and Li, R. (2009). Variable selection for partially linear models with measurement errors. Journal of the American Statistical Association, 104(485):234–248.

Loh, P.-L. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics, (to appear).

Loh, P.-L. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Journal of Machine Learning Research, volume 16, pages 559–616.

Mitra, R. and Zhang, C.-H. (2014). Multivariate analysis of nonparametric estimates of large correlation matrices. arXiv:1403.6195.

Mitra, R. and Zhang, C.-H. (2016). The benefit of group sparsity in group inference with de-biased scaled group lasso. Electronic Journal of Statistics, 10(2):1829–1873.

M¨uller,P. and van de Geer, S. (2015). The partial linear model in high dimensions. Scandinavian Journal of Statistics, 42(2):580–608.

Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538–557.

30 Nolan, D. and Pollard, D. (1987). U-processes: rates of convergence. The Annals of Statistics, 15(2):780–799.

Ramdas, A., Reddi, S. J., Poczos, B., Singh, A., and Wasserman, L. (2015). Adaptivity and computation-statistics tradeoffs for kernel and distance based high dimensional two sample test- ing. arXiv:1508.00655.

Raskutti, G., Wainwright, M. J., and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over `q-balls. IEEE Transactions on Information Theory, 57(10):6976–6994.

Ren, Z., Sun, T., Zhang, C.-H., and Zhou, H. H. (2015). Asymptotic normality and optimalities in estimation of large Gaussian graphical models. The Annals of Statistics, 43(3):991–1026.

Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica, 56(4):931– 954.

Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE Transactions on Information Theory, 59(6):3434–3447.

Sherwood, B. and Wang, L. (2016). Partially linear additive quantile regression in ultra-high dimension. The Annals of Statistics, 44(1):288–317.

Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society: Series B, 50(3):413–436.

Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99(4):879–898.

Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer. van de Geer, S. (2010). `1-regularization in high-dimensional statistical models. In Proceedings of the International Congress of Mathematicians, volume 4, pages 2351–2369. van de Geer, S. A. and B¨uhlmann,P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360–1392. van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer.

Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Com- pressed Sensing, pages 210–268. Cambridge University Press.

Wahba, G. (1990). Spline Models for Observational Data. SIAM.

Wang, Z., Liu, H., and Zhang, T. (2014). Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics, 42(6):2164–2201.

Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the Davis–Kahan theorem for statisticians. Biometrika, 102(2):315–323.

31 Yu, Z., Levine, M., and Cheng, G. (2016). Minimax optimal estimation in high dimensional semiparametric models. arXiv:1612.05906.

Zhu, Y. (2017). Nonasymptotic analysis of semiparametric regression models with high-dimensional parametric coefficients. The Annals of Statistics, (to appear).

32