Adaptive Estimation of High Dimensional Partially Linear Model
Fang Han,∗ Zhao Ren,† and Yuxin Zhu‡
May 6, 2017
Abstract Consider the partially linear model (PLM) with random design: Y = XTβ∗ + g(W ) + u, where g(·) is an unknown real-valued function, X is p-dimensional, W is one-dimensional, and β∗ is s-sparse. Our aim is to efficiently estimate β∗ based on n i.i.d. observations of (Y,X,W ) with possibly n < p. The popular approaches and their theoretical properties so far have mainly been developed with an explicit knowledge of some function classes. In this paper, we propose an adaptive estimation procedure, which automatically adapts to the model via a tuning-insensitive bandwidth parameter. In many cases, the proposed procedure also proves to attain weaker scaling and noise requirements than the best existing ones. The proof rests on a general method for determining the estimation accuracy of a perturbed M-estimator and new U-statistics tools, which are of independent interest.
Keywords: partially linear model; pairwise difference approach; adaptivity; minimum sample size requirement; heavy-tailed noise; degenerate U-processes.
1 Introduction
∗ ∗ ∗ T p Partially linear regression involves estimating a vector β = (β1 , . . . , βp ) ∈ R from independent and identically distributed (i.i.d.) observations {(Yi,Xi,Wi), i = 1, . . . , n} satisfying T ∗ Yi = Xi β + g(Wi) + ui, for i = 1, . . . , n. (1.1) p Here Xi ∈ R , Yi,Wi, ui ∈ R, g(·) is an unknown real-valued function, and ui stands for a noise term of finite variance and independent of (Xi,Wi). This paper is focused on such problems when the number of covariates p + 1 is much larger than the sample size n, which in the sequel we shall refer to as a high dimensional partially linear model (PLM). For this, we regulate β∗ to be s-sparse, namely, the number of nonzero elements in β∗, s, is smaller than n. According to the smoothness of function g(·), the following regression problems, with sparse regression coefficient β∗, are special instances of the studied model (1.1):
(1) The ordinary linear regression models, when g(·) is constant-valued.
∗Department of Statistics, University of Washington, Seattle, WA 98195, USA; e-mail: [email protected] †Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA; email: [email protected] ‡Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA; e-mail: [email protected]
1 (2) Partially linear Lipschitz models, when g(·) satisfies a Lipschitz-type condition (see, for ex- ample, Condition 7.1 in Li and Racine(2007) for detailed descriptions).
(3) Partially smoothing spline models (Engle et al., 1986; Wahba, 1990), when g(·) can be well approximated by splines.
(4) Partially linear jump discontinuous regression, when g(·) can contain numerous jumps.
In literature, Bunea(2004), Bunea and Wegkamp(2004), Fan and Li(2004), Liang and Li (2009), M¨ullerand van de Geer(2015), Sherwood and Wang(2016), Yu et al.(2016), Zhu(2017), among many others, have studied the high dimensional partially linear model, largely following the least squares approaches (Chen, 1988; Robinson, 1988; Speckman, 1988; Donald and Newey, 1994; Carroll et al., 1997; Fan and Huang, 2005; Cattaneo et al., 2017). For example, Zhu (Zhu, 2017) proposed to estimate β∗ through the following two-stage projection strategy: n h 1 X 2 i m = argmin Z − m (W ) + λ km k2 , where Z = Y and Z = X for j ∈ [p]; b j n ij e j i j,n e j Fj i0 i ij ij me j ∈Fj i=1 n proj 1 X T 2 βb = (Vb0i − Vb β) + λnkβk1, where Vb0i = Yi − m0(Wi) and Vbij = Xij − mj(wi). n i b b i=1
Here {Fj, 0 ≤ j ≤ p} are a series of pre-specified function classes, k·kFj is the associated norm, k·kq is the vector `q-norm, and [p] denotes the set of integers between 1 and p. In a related study, M¨uller and van de Geer (M¨ullerand van de Geer, 2015) proposed a penalized least squares approach: n LSE h 1 X T 2 2 2 i βb , g := argmin Yi − Xi β − g(Xi) + λnkβk1 + µnkgkG . b p n e e β∈R ,ge∈G i=1 Here G is a pre-specified function class allowed to be of infinite dimension. Two tuning parameters λn, µn are employed to induce sparsity and smoothness separately. Both approaches mentioned above, exemplifying a large fraction of existing methods, often require a knowledge of some function classes, either G for penalized least squares estimators or {Fj, 0 ≤ j ≤ p} for the projection approach. In this paper, we consider adaptive estimation and adopt an alternative approach, Honor´eand Powell’s pairwise difference method (Honor´eand Powell, 2005) with an extra lasso-type penalty, for estimating β∗ . In order to motivate this procedure, for i, j ∈ [n], write Yeij = Yi − Yj, Xeij = Xi − Xj, Wfij = Wi − Wj, and ueij = ui − uj. For avoiding trivial settings of discrete type W , hereafter we assume W to be absolutely continuous without loss of generality. The fact that T ∗ E[Yeij|Wfij = 0,Xi,Xj] = Xeijβ then implies ∗ h T 2 i β = argmin L0(β), where L0(β) := f (0)(Yeij − Xe β) | Wfij = 0 , E Wfij ij β∈Rp ∗ and f (0) stands for the density value of Wfij at 0. A natural estimator of β is Wfij n o βbhn = argmin Lbn(β, hn) + λnkβk1 . (1.2) β∈Rp
2 Here hn is a specified bandwidth, λn is a tuning parameter to control the sparsity level, −1 n X 1 Wfij T 2 Lbn(β, hn) := K Yeij − Xe β (1.3) 2 h h ij i
1.1 Adaptivity of pairwise difference approach
One major aim of this paper is to show the adaptivity of βbhn to the smoothness of function g(·). From the methodological perspective, no matter how smooth (complex) the g(·) function is, one does not need to tailor the pairwise difference approach. In addition, we will show that the bandwidth hn bears a tuning insensitive property (Sun and Zhang, 2012) in the sense that hn does not depend on the smoothness of function g(·). Hence, methodologically, the procedure can automatically adapt to the model. From the theoretical perspective, the adaptivity property of the pairwise difference approach ∗ 2 is shown via analyzing kβbhn − β k2 for a range of smooth function classes of g(·). To this end, we first construct a general characterization of the “smoothness” degree for g(·). In contrast to commonly used smoothness conditions, a new characterization of smoothness degree is introduced. Recall that Lbn(β, hn) is an empirical approximation of L0(β) as hn → 0. Define ∗ βh = argmin ELbn(β, h) (1.4) β∈Rp ∗ to be the minimum of a perturbed objective function of L0. Define β to be the minimum of ∗ L0. Intuitively, as h is closer to 0, Lh(β) := ELbn(β, h) is pointwise closer to L0(β), and hence βh shall converge to β∗. In Assumption 11, we characterize this rate of convergence. Specifically, it is assumed that there exist two positive absolute constants ζ and γ, such that for any h small enough, ∗ ∗ γ (general smoothness condition) kβh − β k2 ≤ ζh . (1.5) Condition (1.5) is related to Honor´eand Powell’s condition on the existence of Taylor expansion of ∗ ∗ βh around β , i.e., D ∗ ∗ X ` −1/2 βh = β + b`h + o(n ), (1.6) `=1
3 where, when p is fixed, (1.6) implies (1.5) with γ = 1. However, compared to (1.6), (1.5) is more general and suitable when p = pn → ∞ as n → ∞, and can be verified for many function classes. To illustrate it, in Theorems 3.4 and 3.5, we calculate a lower bound of γ for a series of (discontinuous) piecewise α-H¨olderfunction classes.
We are now ready to describe our main theoretical results. They show the adaptivity of βbhn ∗ 2 through the analysis of kβbhn − β k2 with either γ = 1 (corresponding to the “smooth” case) or γ < 1 (corresponding to the “non-smooth” case). In addition to Condition (1.5), we pose the following six commonly posed conditions on the choice of kernel and distribution of the model:
(1) the kernel function K(·) satisfies several routinely required conditions (Assumption5);
(2) W is not degenerate given X (Assumptions6);
(3) conditioning on Wf12 = 0, Xe12 satisfies a restricted eigenvalue condition (Bickel et al., 2009; van de Geer and B¨uhlmann, 2009) (Assumption7);
(4) Wf12 is informative near zero (Assumption8);
(5) conditioning on any W = w, X is multivariate subgaussian, i.e., any linear contrast of X is subgaussian (Assumption9);
(6) u is subgaussian, or as later studied, of finite variance (Assumptions 10 and 17 respectively).
We investigate the “smooth” case first in Section3. Theorem 3.1 proves, for every s-sparse ∗ vector β , as long as γ = 1 and hn, λn are properly chosen, with high probability, ∗ 2 kβbhn − β k2 . s log p/n, (1.7) where by the symbol “.” we mean “≤” up to a multiplicative absolute constant. We hence recover the same rate-optimality phenomenon as those using projection-based (Zhu, 2017) or penalized least squares approaches (M¨ullerand van de Geer, 2015; Yu et al., 2016) in high dimensions. For more discussions on this bound, see Remarks 3.7 and 3.8. In Section3 we further study these g(·) functions in the regime γ < 1. For describing the results T on the “non-smooth” case, additional notation is needed. Define ηn := kE[Xe12Xe12 | Wf12 = 0]k∞ to be a characterization of sparsity of the covariance matrix, with k · k∞ representing the matrix induced `∞ norm. In Theorems 3.2 and 3.3, we prove that, when γ ∈ [1/4, 1], for appropriately chosen hn, λn, the inequality γ γ ∗ 2 h 2 ns log p log p o log p i kβbh − β k min η · + , s (1.8) n 2 . n n n n holds with high probability. Inequalities (1.7) and (1.8) together demonstrate the adaptivity behavior of the pairwise differ- ence estimator βbhn . Depending on the smoothness degree of g(·), one gets a whole range of rates 2 γ γ from s log p/n to min ηn ·{s log p/n + (log p/n) }, s(log p/n) , while we do not need to tailor the estimation procedure for different g(·) up to a selection of (hn, λn). In addition, compared to λn, the bandwidth hn bears a tuning-insensitive property and much less effort is required for selecting
4 1/2 it. In particular, we can assign hn to be 2(log p/n) , which will automatically adapt to the g(·) function and achieve the best convergence rates we can derive. Interestingly, this bandwidth level is intrinsic and cannot be altered. For more details, see Remark 2.2 and Remark 3.8. The bound (1.8) behaves differently depending on the scales of (n, p, s), ηn, and γ. It also reveals an interesting rate-optimality phenomenon. Particularly, the pairwise difference approach is still possible to derive an estimator with desired rate of convergence s log p/n, provided that ηn 1−γ is upper bounded by an absolute constant and (n/ log p) . s. More details on this bound is in Remark 3.9. A lower bound of the γ-value corresponding to α-H¨olderfunctions or some possibly discontinuous functions is calculated in Theorem 3.5. Of note, the pairwise difference approach could handle α-H¨olderclasses with α ≤ 1/2 and could even achieve convergence rate s log p/n under some scaling conditions for 1/4 ≤ α ≤ 1/2. See Corollary 3.1 for further details. ∗ 2 We briefly discuss the technical challenges in determining the upper bounds of kβbhn − β k2. First, although β∗ is s-sparse, β∗ is usually a dense vector. Accordingly, standard analysis of high hn dimensional M-estimators requiring sparsity of β∗ (see, for example, Negahban et al.(2012)) does hn not apply directly. For this, the proof rests on a general method for establishing the convergence rate of a possibly perturbed M-estimator. There, of independent interest, it is shown that Condi- tion (1.5) is near-sufficient for constructing a sharp analysis. Secondly, to achieve sharp rates of 1/2 convergence due to the bias-variance tradeoff, hn has to be set of the order (log p/n) , resulting in variance explosion of the loss function Lbn(β, hn), formulated as a non-degenerate U-process. For this, a careless analysis exploiting the first-order Bernstein inequality (Proposition 2.3(a) in Ar- cones and Gine(1993), for example), coupled with a truncation argument, will render an inefficient bound. To overcome this, one needs to employ a more careful analysis, separately analyzing the sample-mean and degenerate U-statistic components. Such a combined analysis, formulated as a user-friendly tail bound, is provided in Lemma 3.4. Thirdly, the verification of (1.5) for Lipschitz and possibly discontinuous piecewise H¨olderfunctions requires some additional treatments. For this, viewing an M-estimator as the minimum of a stochastic perturbation of the population loss function, we borrow known techniques in mathematical statistics (van der Vaart and Wellner, 1996), and couple them with some initial intuition from the perturbation theory (Davis and Kahan, 1970; Kato, 2013).
1.2 Additional properties The previous section highlights the adaptivity of the pairwise difference approach in handling a range of both “smooth” and “non-smooth” g(·) functions. As another contribution of this paper, in Sections3 and4, we will study some additional properties of the studied procedure. We start with the minimum sample size requirement for estimation consistency. In high di- mensional data analysis, sample size requirement is a key feature for measuring the performance of various methods. In literature of sparse learning, a variety of scaling requirements relating n to p and s have been posed. See, for example, Han and Liu(2016) for an illustration of “optimal rate under sub-optimal scaling” phenomenon in estimation of sparse principal components, Dezeure et al.(2015) for the scaling requirement in conducing inference for the lasso, and Ren et al.(2015) for the minimum scaling requirement in making inference for Gaussian graphical models.
5 In Theorem 3.1 and several follow-up results, provided that Condition (1.5) holds with γ = 1, it is shown that the pairwise difference approach attains the rate of convergence (1.7) under a mild scaling requirement: {(log p)4 ∨ s4/3(log p)1/3 ∨ s(log p)2}/n small enough. In many cases, this sample size requirement compares favorably with the existing ones. See the discussions in Section 6.1 for details. The improvement is viable due to the formulation of the pairwise difference estimator as the minimum of a smooth convex loss function Lbn(β, hn). This loss function is intrinsically a U-process as β evolves, and is technically easier to handle than the loss functions of projection/penalized least squares estimators that involve stochastic sequences of nonlinear functions. For more interpretation and why the order s4/3 appears, see Remark 3.7 and discussions in Section 6.2. From a different perspective, in Section4, we further investigate the impact of heavy-tailed noise u on the performance of the pairwise difference approach. Recent progresses in studying the lasso regression have revealed a fundamental property that, given a random design of multivariate subgaussian covariates, the lasso is able to maintain minimax rate-optimal estimation of β∗ as long as u is of a (2 + )-moment. See, for example, Theorem 1.4 in Lecu´eand Mendelson(2017a). We study the impact of heavy-tailed noise on estimation for the pairwise difference approach in Section 4. Particularly, in Corollary 4.1, we will show that the result in Lecu´eand Mendelson(2017a) applies to the pairwise difference approach.
1.3 Notation
+ Throughout the paper, we define R, Z, and Z to be sets of real numbers, integers, and positive + integers. For n ∈ Z , write [n] = 1, . . . , n . Let 1(·) stand for the indicator function. For arbitrary 0 p Pp 1 q Pp q vectors v, v ∈ R and 0 < q < ∞, we define kvk0 = j=1 (vj 6= 0), kvkq = j=1 |vj| , and 0 Pp 0 p×q Pq hv, v i = j=1 vjvj. For an arbitrary matrix Ω = (Ωij) ∈ R , write kΩk∞ = maxi∈[p] j=1 |Ωij|. For a symmetric real matrix Ω, let λmin(Ω) denote its smallest eigenvalue. For a set S, we denote |S| c p to be its cardinality and S to be its complement. For a vector v ∈ R and an index set S, we write |S| vS ∈ R to be the sub-vector of v of components indexed by S. For a real function f : X → R, k T let kfk∞ = supf∈X f(x). For an arbitrary function f : R → R, we use ∇f = (∇1f, . . . , ∇kf) to p denote its gradient. For some absolutely continuous random vector X ∈ R , let fX denote its density function, FX denote its distribution function, and ΣX denote its covariance matrix. For some joint T T p+1 p m continuous random vector (X ,W ) ∈ R and some measurable function ψ(·): R → R , let fW |ψ(X)(w, z) denote the value of the conditional density of W = w given ψ(X) = z. For any two numbers a, b ∈ R, we define a ∨ b = max(a, b) and a ∧ b = min(a, b). For any two real sequences {an} and {bn}, we write an . bn, or equivalently bn & an, if there exists an absolute constant C such that |an| ≤ C|bn| for any large enough n. We write an bn if an . bn and bn . an. We + 0 0 denote Ip to be the p×p identity matrix for p ∈ Z . Let c, c ,C,C > 0 be generic constants, whose actual values may vary from place to place.
1.4 Paper organization The rest of this paper is organized as follows. In Section2, we introduce a general method for determining the rate of convergence of M-estimators in which the minimizer of the population risk
6 can be a perturbation of the true parameter. This general method lays the foundation for later sections. In Section3, we highlight the adaptivity property of the pairwise difference approach, and show the rate-optimality in estimating β∗ of the pairwise difference approach, i.e., inequalities (1.7) and (1.8). We first show these main results under the general smoothness condition defined in (1.5), and then for some specific function classes of g(·), including some discontinuous ones. Technical challenges related to controlling U-processes are also discussed in this section. In Section 4 we extend the analysis in Section3 and allow a heavy-tailed noise. Section5 contains some finite sample analysis. Discussions are provided in Section6, with all the proofs relegated to a supplementary appendix.
2 A general method
∗ p Let θ ∈ R be a p-dimensional parameter of interest. Suppose that Z1,...,Zn with n < p follow a distribution indexed by parameters (θ∗, η∗), where η∗ is possibly infinite-dimensional and stands ∗ p for a nuisance parameter. Suppose further that θ minimizes a loss function Γ0(θ) defined on R , which is independent of choice of the nuisance parameter and difficult to approximate. Instead, we observe a {Z1,...,Zn}-measurable loss function, Γbn(θ, h), such that Γh(θ) = EΓbn(θ, h) is a p perturbed version of Γ0(θ), namely, for each θ ∈ R ,
Γh(θ) → Γ0(θ) as h → 0. We define ∗ θh = argmin Γh(θ), θ∈Rp and for a chosen bandwidth hn > 0 and a tuning parameter λn ≥ 0 that scale with n, n o θbhn = argmin Γbn(θ, hn) + λnkθk1 . θ∈Rp To motivate this setting, recall that the pairwise difference estimator defined in the Introduction can be formulated into this framework, with ∗ ∗ θ = β , Γ0(θ) = L0(β), Γbn(θ, h) = Lbn(β, h), and Γh(θ) = EΓbn(θ, h). ∗ In this section, we present a general method for establishing consistency of θbhn to θ with explicit rate of convergence provided. This method clearly has its origins in Negahban et al.(2012), and is recast into the current form for the purpose of our setting. Before introducing the main result, some more assumptions and notation are in order. For establishing consistency in high dimensions, we need a notion of intrinsic dimension (Amelunxen et al., 2014) that has to be of order smaller than n. This paper is focused on sparsity, an assumption that has been well-accepted in literature (B¨uhlmannand van de Geer, 2011).
Assumption 1 (Sparsity condition). Assume that there exists a positive integer s = sn < n ∗ such that kθ k0 ≤ s.
∗ Of note, θh is usually no longer a sparse vector. Thus, the framework in Negahban et al.(2012) cannot be directly applied to obtain the error bound between θbhn and the population minimizer
7 θ∗ . In addition, our ultimate goal is to provide the rate of convergence of θ to θ∗ rather than hn bhn θ∗ . These two facts motivate us to characterize the behavior of θ by carefully constructing a hn bhn surrogate sparse vector θ∗ ∈ p, which possibly depends on h . Then the estimation accuracy ehn R n of θ can be well controlled via bounding θ − θ∗ and θ∗ − θ∗ separately. Specifically, for the bhn bhn ehn ehn latter term we assume that there exist a positive number ρn and an integer sen > 0 such that kθ∗ − θ∗k ≤ ρ and kθ∗ k = s . (2.1) ehn 2 n ehn 0 en Possible choices of θ∗ include θ∗ and a hard thresholding version of θ∗ . ehn hn We then proceed to characterize the first term θ −θ∗ above by studying the relation between bhn ehn h and λ through the surrogate sparse vector θ∗ . This characterization is made via the following n n ehn perturbation level condition that corresponds to Equation (23) in Negahban et al.(2012).
+ Assumption 2 (Perturbation level condition). Assume a sequence {1,n, n ∈ Z } such that + 1,n → 0 as n → ∞, and for each n ∈ Z , ∗ P 2 ∇kΓbn(θehn , hn) ≤ λn for all k ∈ [p] ≥ 1 − 1,n. (2.2)
Remark 2.1. Assumption2 demands more words. While often we have ∇ Γ (θ∗ , h ) = 0, the E k bn hn n same conclusion does not always apply to ∇ Γ (θ∗ , h ). Hence, the mean value of ∇ Γ (θ∗ , h ) E k bn ehn n k bn ehn n is intrinsically a measure of the bias level of our estimator θ away from the surrogate θ∗ , which bhn ehn also characterizes the perturbation of Γ to Γ especially for the choice of θ∗ = θ∗. Due to this h 0 ehn reason, we call it the perturbation level condition.
Remark 2.2. Concerning the studied pairwise difference estimator, we will see in Section3 that 1/2 1/2 (2.2) usually reduces to a requirement λn & hn + (log p/n) and hn & (log p/n) , which has sharply regulated the scales of the pair (hn, λn). Compared to the typical requirement λn & 1/2 (log p/n) , the extra term hn quantifies the bias level. In addition, the requirement hn & (log p/n)1/2 is intrinsic to our pairwise difference approach. In order to have a sharp choice of 1/2 1/2 λ (log p/n) , hn has to be chosen at the order of (log p/n) (see Remarks 3.5-3.6 for further details).
We then move on to define an analogy of the restricted eigenvalue (RE) condition for lasso introduced in Bickel et al.(2009) (see also, compatibility condition in van de Geer and B¨uhlmann (2009), among other similar conditions.). To this end, we denote the first-order Taylor series error, evaluated at θ∗ , to be ehn ∗ ∗ ∗ δΓbn(∆, hn) = Γbn(θehn + ∆, hn) − Γbn(θehn , hn) − h∇Γbn(θehn , hn), ∆i. Further define sets ∗ p Sen = j ∈ [p]: θe 6= 0 and C = ∆ ∈ : k∆ c k1 ≤ 3k∆ k1 . hn,j Sen R Sen Sen We are now ready to state the empirical RE assumption.
Assumption 3 (Empirical restricted eigenvalue condition). Assume, for any h > 0, Γbn(θ, h) is convex in θ. In addition, assume that there exist positive absolute constant κ1 and radius r such
8 + + that there exists a sequence {2,n, n ∈ Z } with 2,n → 0 as n → ∞, and for each n ∈ Z , 2 p δΓn(∆, hn) ≥ κ1k∆k for all ∆ ∈ C ∩ ∆ ∈ : k∆k2 ≤ r ≥ 1 − 2,n. P b 2 Sen R
With sen = |Sen| and ρn defined in (2.1), we are then able to present the general method for determining the convergence rate of θbhn . 1/2 Theorem 2.1. Provided λn ≤ κ1r/3sen and Assumptions1-3 stand, the inequality 2 ∗ 2 18senλn 2 kθbhn − θ k2 ≤ 2 + 2ρn κ1 holds with probability at least 1 − 1,n − 2,n.
Of note, in the sequel, r can always be set as ∞. For these cases, the first condition in Theorem 2.1 vanishes, resulting in a theorem (slightly) extending Theorem 1 in Negahban et al.(2012).
3 Pairwise difference approach for partially linear model
The goal of this section is to prove inequalities (1.7) and (1.8) via the general method introduced in Section2. Specifically, in Section 3.1 we first introduce some general assumptions imposed on the kernel as well as the distribution of the model. Using Theorem 2.1, Section 3.2 presents the main results, i.e., inequalities (1.7) and (1.8), under the general smoothness condition (1.5). Section 3.3 specializes the main results in Section 3.2 by focusing on several specific classes of g(·) functions, including H¨olderfunction classes as well as some discontinuous ones. We close this section by highlighting two distinct technical challenges intrinsic to the pairwise difference approach. Some additional notation is introduced. In what follows, we write (Y, X, W, u) to be a copy of (Y1,X1,W1, u1), and (Y,e X,e W,f ue) to be a copy of (Ye12, Xe12, Wf12, ue12). Recall that without loss of generality, we assume W to be absolutely continuous with regard to the Lebesgue measure, since discrete type W would render a much simpler situation for analyzing βbhn in (1.2).
3.1 Regularity assumptions We start with some common assumptions in literature. The first assumption formally states the model described in the Introduction. Of note, we do not require either X or u to be mean-zero.
Assumption 4. Assume {(Yi,Xi,Wi, ui), i ∈ [n]} are i.i.d. random variables following (1.1) with p ∗ p ∗ Xi ∈ R , Yi,Wi, ui ∈ R, and p > n. Assume β ∈ R to be s-sparse with s := kβ k0 < n. Our second assumption concerns the kernel function K(·) specified in (1.3). R +∞ Assumption 5. Assume a positive kernel K(·) such that −∞ K(w) dw = 1. Further assume there exists some positive absolute constant MK , such that Z +∞ n 3 o max |w| K(w) dw, sup |w|K(w), sup K(w) ≤ MK . −∞ w∈R w∈R For simplicity, we pick MK ≥ 1.
9 Remark 3.1. Assumption5 eliminates higher order kernel functions considered in literature (Li and Racine, 2007; Abrevaya and Shin, 2011), which would add non-convexity to the loss function (1.3), an issue that is not easy to fix with existing techniques. The assumption is mild, and (after rescaling) is satisfied by any of those classical kernel functions in Section 1.2 of Tsybakov(2009).
The remaining assumptions concern the distribution of (X, W, u). We first characterize the “conditional non-degeneracy” property of the pair (X,W ) in the following Assumptions6,7, and 70. This is also related to the identification condition of β∗ posed in Section 7.1 of Li and Racine (2007).
Assumption 6. Assume that the conditional density of W is smooth enough, or more specifically, assume there exists a positive absolute constant M such that
n ∂fW |X (w, x) o sup , fW |X (w, x) ≤ M. w,x ∂w Assumption 7. Define S ⊂ [p] to be the support of β∗. Assume there exists some positive absolute 0 p 0 0 T T constant κ`, such that for any v ∈ v ∈ R : kvSc k1 ≤ 3kvS k1 , we have v E XeXe Wf = 0 v ≥ 2 κ`kvk2. As will be seen later, Assumption7 is sufficient when g(·) is “smooth” (e.g., globally α-H¨older with α ≥ 1. See Section 3.3.2). However, if g(·) is “non-smooth” (e.g., discontinuous or globally α-H¨olderwith α < 1. See Section 3.3.1), we need to pose a stronger lower eigenvalue condition summarized in the following assumption.
0 T Assumption 7 . There exists an absolute constant κ` > 0, such that λmin E XeXe Wf = 0 ≥ κl. Remark 3.2. Assumption7 0 is posed to achieve two goals: characterizing the bias kβ∗ −β∗k and hn 2 the variance. While Assumption7 suffices for controlling the variance term, controlling the bias usually requires much stronger conditions. Indeed, it suffices to use the following two assumptions, which are consequences of Assumption7 0:
(A1) Assumption7 with its slight variant (referred to Assumption 13), posed for proving a new empirical RE condition;
0 ∗ ∗ 2 ∗ ∗ (A2) there exist positive absolute constants C,C > 0, such that kβh − β k2 ≤ C|L0(βh) − L0(β )| for any h ∈ (0,C0), posed for bounding kβ∗ − β∗k . hn 2 For (1.3) to be informative, besides requiring the pair (X,W ) to be conditionally non-degenerate, we also need Wf to possess a nonzero probability mass in a small neighborhood of 0. The next assumption, combined with Assumption6, guarantees this.
Assumption 8. There exists some positive absolute constant M , such that f (0) ≥ M . ` Wf ` Before introducing the last two regularity conditions, we formally define the subgaussian distri- 2 bution. We say that a random variable X is subgaussian with parameter σ , if E exp{t(X−E[X])} ≤ 2 2 exp(t σ /2) holds for any t ∈ R. We then define the following two conditions which regulate the tail behaviors of X and u.
10 Assumption 9. There exists some positive absolute constant κx, such that, conditional on W = w for any w in the range of W and unconditionally, hX, vi is subgaussian with parameter at most 2 2 p κxkvk2 for any v ∈ R .
Assumption 10. There exists some positive absolute constant κu, such that u is subgaussian with 2 parameter at most κu. Remark 3.3. There is no need to assume zero mean for X or u in Assumptions9-10 thanks to our pairwise difference approach. Assumption9 is arguably difficult to relax (Adamczak et al., 2011; Lecu´eand Mendelson, 2017b). Assumption 10 has been posed throughout the existing literature. In Section4, however, we will show that the pairwise difference approach can adopt a much milder moment condition for the noise u when Assumption9 holds.
3.2 Main results In this section we prove inequalities (1.7) and (1.8) under a general smoothness condition. The main results are stated in Theorems 3.1 for the “smooth” case and in Theorems 3.2-3.3 for the “non-smooth” case. ∗ We first formally pose the smoothness condition (1.5). Recalling βh defined in (1.4), we assume ∗ ∗ that βh converges to β in a certain polynomial rate with regard to h. Assumption 11 (General smoothness condition). Assume there exist absolute constants ζ > 0, γ ∈ (0, 1], and C0 > 0, such that for any h ∈ (0,C0) we have ∗ ∗ γ kβh − β k2 ≤ ζh . Remark 3.4. Assumption 11 is closely related to the Taylor expansion condition (1.6) posed in Honor´eand Powell(2005) and Aradillas-Lopez et al.(2007). There the authors verified the condition under several smoothness conditions on the model (see Section 3.3 in Honor´eand Powell (2005) for details). Unfortunately, in general their condition is ill-defined with increasing dimension. Instead, our condition can be viewed as a high dimensional analogy of Condition (1.6), which also includes the non-smooth case with γ < 1. We further require one more assumption involving function g(·). Denote −1 n X 1 Wfij Uk = K Xeijk g(Wi) − g(Wj) , for k ∈ [p], (3.1) 2 h h i 11 Remark 3.6. There is one interesting phenomenon on Assumption 12 that sheds some light on the advantage of the pairwise difference approach. Instead of using U-statistic in the objective function sp Lbn(β, hn) in (1.3), one may also consider the objective function Ln (β, hn) by naively splitting the data into two halves, where n/2 sp 2 X 1 nWf(2i−1)(2i) o T 2 L (β, hn) := K Ye(2i−1)(2i) − Xe β . n n h h (2i−1)(2i) i=1 n n sp By doing so, Assumption 12 also becomes a data-splitting based statistic Uk , which can be written sp as the sum of n/2 i.i.d. terms. However, if Uk is used, Assumption 12 is no longer valid as hn → 0. To see this, as hn → 0, the effective sample size becomes nhn, and thus the concentration rate is 1/2 nhn(log p/n) . Thanks to our pairwise difference approach, the sharp concentration can still hold 1/2 as long as hn & (log p/n) . Similar observations have been known in literature (cf. Section 6 in Ramdas et al.(2015)). Also see Lemma A2.21 in the supplement for further details on verifying Assumption 12. Our first result concerns the situation γ = 1. This corresponds to the “smooth” case, on which a vast literature of partially linear models has been focused (Li and Racine, 2007). Theorem 3.1 (Smooth case). Assume Assumption 11 holds with γ = 1 and that there exists some 1/2 absolute constant K1 > 0 such that K1(log p/n) < C0. In addition, assume Assumptions4-12 hold and that 1 1 n 3 4 1 2o hn ∈ [K1(log p/n) 2 ,C0), λn ≥ C hn + (log p/n) 2 , and n ≥ C (log p) ∨ s 3 (log p) 3 ∨ s(log p) , where the constant C only depends on M, MK , C0, κx, κu, κ`, M`, ζ, and K1. Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n) − n, 0 0 where C , c, c are three positive constants only depending on M, MK , C0, κx, M`, κ`, and C. Remark 3.7. Theorem 3.1 shows, in many settings, the sample size requirement n ≥ Cs4/3(log p)1/3 suffices for βbhn to be consistent and of the convergence rate s log p/n. It compares favorably to the best existing scaling requirement (see Section 6.1 for detailed comparisons with existing results). The term s4/3(log p)1/3 shows up in verifying the empirical restricted eigenvalue condition, Condi- tion3. In particular, the major effort is put on separately controlling a sample-mean-type random matrix and a degenerate U-matrix. See Theorem 3.7 for details. The sharpness of this sample size requirement is further discussed in Section 6.2. 1/2 Remark 3.8. The requirement hn ∈ [K1(log p/n) ,C0) in Theorem 3.1 is due to the pairwise difference approach in order to verify the perturbation level condition, Condition2. With regard to this bandwidth selection, Theorem 3.1 unveils a tuning-insensitive phenomenon, which also applies to all the following results. Particularly, via examining the theorem, it is immediate that one might 1/2 choose, say, hn = 2(log p/n) without any impact on estimation accuracy asymptotically. The constant K1 that achieves the best upper bound constant can be theoretically calculated. However, we do not feel necessary to provide it. 1/2 Picking hn, λn (log p/n) , we recover the desired rate of convergence s log p/n, which proves to be minimax rate-optimal (Raskutti et al., 2011) as if there is no nonparametric component. This 12 result has been established using other procedures in literature under certain smoothness conditions of g(·). In contrast, our result shows that the pairwise difference approach can attain estimation rate-optimality in the “smooth” regime, while in various settings improving the best to date scaling requirement. Our next result concerns the “non-smooth” case γ < 1. Its proof is a slight modification to that of Theorem 3.1. Theorem 3.2 (Non-smooth, case I). Assume Assumption 11 holds with a general γ ∈ (0, 1] and 1/2 that there exists some absolute constant K1 > 0 such that K1(log p/n) < C0. In addition, we assume Assumptions4-12 hold and that 1 1 γ n 3 4 1 2o hn ∈ [K1(log p/n) 2 ,C0), λn ≥ C (log p/n) 2 + hn , and n ≥ C (log p) ∨ s 3 (log p) 3 ∨ s(log p) , where constant C only depends on M, MK , C0, κx, κu, κ`, M`, ζ, γ, and K1. Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n) − n, 0 0 where C , c, c are three positive constants only depending on M, MK , C0, κx, M`, κ`, and C. 1/2 γ/2 γ Picking hn (log p/n) and λn (log p/n) , we obtain a series of upper bounds s(log p/n) as γ changes from 1 to 0. While the rate derived here is not always s log p/n, to the best of our knowledge there is little analogous theoretical results in literature focusing on functions g(·) leading to this setting, especially when γ is small. The only exceptions turn out to be Honor´eand Powell(2005) and Aradillas-Lopez et al.(2007), where the authors proved consistency with little assumption on g(·). However, the rate of convergence was not calculated, and the analysis was focused on fixed dimensional settings. By inspecting Theorem 3.2, one might attempt to conjecture that, in certain cases, the minimax optimal rate for βbhn is no longer the same as that in the smooth case. However, surprisingly to us, this is not always the case. As a matter of fact, one can still recover the rate s log p/n under a certain regime of (n, p, s). To this end, we first define an RE condition slightly stronger than Assumption7. Assumption 13. Assume there exists some positive absolute constant κ` such that for any 0 p 0 0 2 2γ v ∈ v ∈ R : kvJ c k1 ≤ 3kvJ k1 for any J ⊂ [p] and |J | ≤ s + ζ nhn / log p , T T 2 we have v E XeXe Wf = 0 v ≥ κ`kvk2. Theorem 3.3 (Non-smooth, case II). Assume Assumption 11 holds with a general γ ∈ [1/4, 1] 1/2 and that there exists some absolute constant K1 > 0 such that K1(log p/n) < C0. Define T T ηn = kE[XeXe |Wf = 0]k∞ as a measure of sparsity for E[XeXe |Wf = 0]. In addition, we further assume Assumptions4-6,8-13 hold and that 1/2 1/2 hn ∈ [K1(log p/n) ,C0), λn ≥ C hn + ηn(log p/n) , n o and n ≥ C (log p)3 ∨ q4/3(log p)1/3 ∨ q(log p)2 , 2γ where q = s + nhn / log p, and the constant C > 0 only depends on M, MK , C0, κx, κu, κ`, M`, ζ, 13 γ and K1. Then we have 2 2γ n ∗ 2 0 2 s log p nλnhn o 0 0 kβbh − β k ≤ C sλ + + ≥ 1 − c exp(−c log p) − c exp(−c n) − n, P n 2 n n log p 0 0 where C , c, c are three positive constants only depending on M, MK , C0, κx, M`, κ`, and C. 1/2 1/2 Remark 3.9. Picking hn (log p/n) , λn ηn(log p/n) , Theorem 3.3 proves that, under the n −1 o sample size requirement n ≥ C (log p)3∨(1+γ ) ∨ s4/3(log p)1/3 ∨ s(log p)2 , the inequality γ ∗ 2 2 ns log p log p o kβbh − β k η · + n 2 . n n n holds with high probability. On one hand, as ηn . 1 and γ = 1, the above bound reduces to the 1−γ 1−γ inequality presented in Theorem 3.1. On the other hand, as ηn . 1, γ < 1, and s(log p) & n , we can still recover the optimal rate s log p/n. In addition, although λ is set to be of possibly 1/2 different order in Theorems 3.1-3.3, hn is constantly chosen to be of the same order (log p/n) . We refer to the previous Remark 3.8 for discussions. 3.3 Case studies This section exploits and extends the main results, Theorems 2.1, 3.1, 3.2, and 3.3, to examine several special g(·) functions. For this, we first consider various H¨olderfunction classes as well as some discontinuous ones under a lower eigenvalue condition Assumption7 0. In Section 3.3.2, we show, for Lipschitz function class (and also higher order H¨olderclasses), rate-optimality can be attained under a weaker restricted eigenvalue condition, Assumption7, by directly applying Theorem 2.1. In the following, we formally define the α-H¨olderfunction class. Definition 3.1. Let ` = bαc denote the greatest integer strictly less than α. Given a metric space (T, | · |), a function g : T → R is said to belong to (L, α)-H¨olderfunction class if |g(`)(x) − g(`)(y)| ≤ L|x − y|α−`, for any x, y ∈ T. Here g(`) represents the `-th derivative of g(·). We say that g(·) is α-H¨olderif g(·) belongs to a (L, α)-H¨olderfunction class for a certain absolute constant L > 0. 3.3.1 Cases under the general smoothness condition This section is devoted to general α-H¨olderfunctions and certain discontinuous ones. For this, we need a stronger lower eigenvalue condition, Assumption7 0. We also need a smooth condition on fW |X (w, x) below, slightly stronger than Assumption6. Assumption 14. Assume that the conditional density of W is smooth enough, or more specifically, assume that there exists some positive absolute constant M, such that n ∂2f (w, x) o W |X sup 2 ≤ M. w,x ∂w We first consider any α-H¨olderfunction g(·) with α ≥ 1 and show that it yields Assumption 11 with γ = 1. 14 2 2 −1 Theorem 3.4. Assume h ≤ C0 for some positive constant C0, and that h ≤ κ`M` ·(4MMK κx) . In addition, assume g(·) to be 1-H¨older(or higher-order H¨olderon some compact support [a, b]). Under Assumptions4-6,7 0,8-9, and 14, we have ∗ ∗ kβh − β k2 ≤ ζh, 2 where ζ is a constant only depending on M,Mk,C0, Eue , κx, κ`,M`, the H¨olderconstant (and a, b). We then move on to the non-smooth case including certain discontinuous and α-H¨olderfunctions with α < 1. Specifically, we make the following assumption about g(·). Assumption 15. There exist absolute constants Mg > 0, Md ≥ 0, Ma ≥ 0, and 0 < α ≤ 1, and 2 set A ∈ R , such that α |g(w1) − g(w2)| ≤ Mg|w1 − w2| + Md1 (w1, w2) ∈ A , for any w1, w2 in the range of W , and that h 1 Wfij i K 1(W ,W ) ∈ A ≤ M h. E h h i j a We also assume that the set A is symmetric in the sense that (w1, w2) ∈ A implies (w2, w1) ∈ A. Example 3.1. Suppose function g : R → R is piecewise (ML, α)-H¨olderfor some α ∈ (0, 1], and have discontinuity points a1, . . . , aJ with jump size bounded in absolute value by Cg, for positive absolute constants ML and Cg. Also suppose |fW (w)| ≤ M for some positive absolute constant M. Consider set J [ n o A = (−∞, aj] × [aj, +∞) ∪ [aj, +∞) × (−∞, aj] , j=1 and consider box kernel function K(w) = 1(|w| ≤ 1/2). Then α |g(w1) − g(w2)| ≤ (J + 1)ML · |w1 − w2| + JCg1 (w1, w2) ∈ A , for any w1, w2 ∈ R, and that h 1 Wfij i K 1(W ,W ) ∈ A E h h i j J 2 X Z aj Z +∞ JM 2 = 1|w − w | ≤ h/2 f (w )f (w ) dw dw ≤ h. h 1 2 W 1 W 2 1 2 4 j=1 −∞ aj Thus we have verified two equations in Assumption 15 with Mg = (J + 1)ML, Md = JCg, and 2 Ma = JM /4. Example 3.2. Suppose W ∼ Unif[0, 1], kernel function K(w) = 1(w ∈ [−1/2, 1/2]), and ( w, x ∈ [0, 1/2), g(w) = w + 1, x ∈ [1/2, 1]. Suppose h ≤ 1/2 and consider set A = [1/4, 1/2]×[1/2, 3/4]∪[1/2, 3/4]×[1/4, 1/2]. One can easily 15 check that |g(w1) − g(w2)| ≤ 3|w1 − w2| + 1{(w1, w2) ∈ A}, and h 1 Wfij i K 1(W ,W ) ∈ A = h/4. E h h i j Thus we have verified two equations in Assumption 15 with Mg = 3, Md = 1, and Ma = 1/4. 2 2 −1 Theorem 3.5. Assume h ≤ C0 for some positive constant C0, and that h ≤ κ`M` ·(4MMK κx) . Under Assumptions4-6,7 0,8-9, and 14-15, we have ∗ ∗ γ kβh − β k2 ≤ ζh , 2 where ζ > 0 is a constant only depending on M,MK ,C0,Mg,Md,Ma, E[ue ], κx, κ`,M`, and γ = α if MdMa = 0, and γ = α ∧ 1/2 if otherwise. Theorems 3.4-3.5 verify the general smoothness condition, Assumption 11. It suffices to verify Assumption 12 in order to apply those main results in Section 3.2, i.e., Theorems 3.1-3.3, to obtain the following corollary. 1/2 Corollary 3.1. Assume there exist some absolute constants K1,C0 > 0 such that K1(log p/n) < C0, and that 1/2 4 4/3 1/3 2 hn ∈ [K1(log p/n) ,C0) and n ≥ C (log p) ∨ q (log p) ∨ q(log p) , where the quantity q and the dependence of constant C are specified in three cases below. In addition, we assume Assumptions4-6,7 0,8-10, and 14 hold. (1) Assume that g(·) is α-H¨olderfor α ≥ 1, and g(·) has compact support when α > 1. 1/2 Set q = s. Assume further that λ ≥ C hn + (log p/n) , where C only depends on M,MK ,C0, κx, κu, κ`,M`,K1, and H¨olderparameters of g(·). Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n), 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx, κ`,M`,C. (2) Assume Assumption 15 holds with α ∈ (0, 1]. Set q = s. Assume further that λn ≥ 1/2 γ C (log p/n) + hn , where C only depends on M,MK ,C0, κx, κu, κ`,M`,K1,Mg,Md,Ma, and γ = α if MdMa = 0, γ = α ∧ 1/2 if otherwise. Then we have ∗ 2 0 2 0 0 P(kβbhn − β k2 ≤ C sλn) ≥ 1 − c exp(−c log p) − c exp(−c n), 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx, κ`,M`,C. T (3) Assume Assumption 15 holds with α ∈ [1/4, 1]. Recall ηn = kE[XeXe |Wf = 0]k∞. Set 2γ 1/2 q = s + nhn / log p. Assume further that λn ≥ C hn + ηn(log p/n) , where C only depends on M,MK ,C0, κx, κu, κ`,M`,K1,Mg,Md,Ma and γ = α if MdMa = 0, γ = α ∧ 1/2 if otherwise. Then we have 2 2γ n ∗ 2 0 2 s log p nλnhn o 0 0 kβbh − β k ≤ C sλ + + ≥ 1 − c exp(−c log p) − c exp(−c n), P n 2 n n log p 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κx, κ`,M`,C. 16 3.3.2 Smooth case under a weaker RE condition Assumption7 0 is arguably strong. Here we provide a rate-optimality result for Lipschitz function g(·) without requiring Assumption7 0. Instead, we only need a weaker RE condition, Assumption 7. Similar conclusion with corresponding analysis proceeds to α-H¨olderfunction with α > 1 and a compact support, and thus is omitted here. Assumption 16 (Lipschitz condition). There exists an absolute constant Mg > 0, such that |g(w1) − g(w2)| ≤ Mg|w1 − w2|, for any w1, w2 in the range of W . To achieve this goal, instead of using the general smoothness condition, i.e., Assumption 11, and Theorem 3.1 in Section 3.2, we first characterize the bias term in a different way to directly check the perturbation level condition, Condition2. Lemma 3.1 (Bias calculation). Assume that there exist some absolute constants K1,C0 > 0 such 1/2 that K1(log p/n) < C0. In addition, we assume Assumptions5,6,9, 10, and 16 hold and that 1/2 1/2 4 hn ∈ [K1(log p/n) ,C0), λn ≥ C hn + (log p/n) , and n ≥ C(log p) , where the constant C only depends on M,MK ,C0, κx, κu,Mg,K1. Then we have ∗ 0 P 2 ∇kLbn(β , hn) ≤ λn for all k ∈ [p] ≥ 1 − c exp(−c log p), where c and c0 are positive constants only depending on C. 1/2 Theorem 3.6. Assume that there exist absolute constants K1,C0 > 0 such that K1(log p/n) < C0. In addition, we assume Assumptions4-10, 16 hold and that 1 1 n 4 4 1 2o hn ∈ [K1(log p/n) 2 ,C0), λn ≥ C hn + (log p/n) 2 , and n ≥ C (log p) ∨ s 3 (log p) 3 ∨ s(log p) , where the constant C only depends on M, MK , C0, κx, κu, κ`, M`, Mg, K1. Then we have ∗ 2 0 2 0 0 P kβbhn − β k2 ≤ C sλn ≥ 1 − c exp(−c log p) − c exp(−c n) − n, 0 0 where C , c, c are three positive constants only depending on M,MK ,C0, κ`,M`,C. It is worth highlighting that Assumption 16 allows the ranges of W and g(·) unbounded, and hence covers some nontrivial function classes whose metric entropy numbers, to our knowledge, are still unknown. See Section 6.1 for more details. 3.4 Technical challenges of the analysis This section highlights two distinct technical challenges we faced in proving the previous results in Sections 3.2-3.3. All the proofs are relegated to a supplementary material. 3.4.1 On verifying the RE condition The main results of the paper, including Theorems 3.1, 3.2, 3.3, 3.6, as well as Corollary 3.1, are all based on the general method introduced in Section2. For this, one major object of interest is to verify the empirical RE condition (Assumption3 in Section2) based on the population RE conditions such as Assumption7 and its variant Assumption 13. This result is formally stated in 17 Corollary 3.2 at the end of this section. The proof rests on several advanced U-statistics exponential inequalities (Gin´eet al., 2000; Houdr´eand Reynaud-Bouret, 2003) and nonasymptotic random matrix analysis tools specifically tailored for U-matrices (Vershynin, 2012; Mitra and Zhang, 2014), and thus deserves a discussion. We start with a definition of the restricted spectral norm (Han and Liu, 2016). For an arbitrary p by p real matrix M and an integer q ∈ [p], the q-restricted spectral norm kMk2,q of M is defined to be vTMv kMk2,q := max . p T v∈R ,kvk0≤q v v As pointed in the seminal paper Rudelson and Zhou(2013), the empirical RE condition, i.e., Assumption3, is closely related to the q-restricted spectral norm of Hessian matrix for the loss function regarding a special choice of q. Our proof relies on a study of this q-restricted spectral norm. In Assumption3, letting Γbn(θ, hn) = Lbn(β, hn), simple algebra yields −1 Tn n X 1 Wfij To T δLbn(∆, hn) = ∆ K XeijXe ∆ = ∆ Tbn∆. 2 h h ij i Note that Tbn is a random U-matrix, namely, a random matrix formulated as a matrix-valued U-statistic. As was discussed in the previous sections, hn is usually picked to be of the order 1/2 (log p/n) , rendering a large bump as Wfij is close to zero. Consequently, when hn is set in the −1 T 2 regime of interest, the variance of the kernel g∆(Di,Dj) = hn K(Wfij/hn)(Xeij∆) will explode at the rate of (n/ log p)1/2, leading to a loose and sub-optimal bound when using Bernstein inequality for non-degenerate U-statistics (see, e.g., Proposition 2.3(a) in Arcones and Gine(1993)). Thus a more careful study of this random U-matrix Tbn is need. The next theorem gives a concentration inequality for Tbn under the q-restricted spectral norm. Theorem 3.7. For some q ∈ [p], suppose there exists some absolute constant C > 0 such that n ≥ C · q4/3(log p)1/3 ∨ q(log p)2 + log(1/α). Then under Assumptions5,6, and9, with probability at least 1 − α, we have 1/4 2 1/2 0 hq(log p) q(log p) log(1/α)i kTbn − ETbnk2,q ≤ C · + + , n3/4 n n 0 where C is a positive constant only depending on M,MK ,C0, κx,C. The proof of Theorem 3.7 follows the celebrated Hoeffding’s decomposition (Hoeffding, 1948). However, there are two major challenges. We discuss them in details before closing this section with a formal verification of Assumption3. On one hand, different from most existing investigations on nonasymptotic random matrix theory, the first order term of δLbn(∆, hn), after decomposition, does not have a natural product −1 Pn T structure, namely, it cannot be written as n i=1 UiUi for some independent random vectors p {Ui ∈ R , i ∈ [n]}. Hence, we cannot directly follow those well-established arguments based on a natural product structure, but have to resort to properties of the kernel. To this end, we state the 18 following two auxiliary lemmas, which are repeatedly used in the proofs, and can be regarded as extensions to the classic results in, for example, Robinson(1988). Lemma 3.2. Suppose we have random variables W ∈ R and Z ∈ Z, such that ∂f (w, z) W |Z ≤ M , ∂w 1 for some positive constant M1 with any z in the range of Z and any w in the range of W . Also, let R +∞ K(·) be a kernel function such that −∞ |w|K(w) dw ≤ M2 for some constant M2 > 0. Then we have for any h > 0, n 1 W o K Z − [Z|W = 0]f (0) ≤ M M [|Z|]h. E h h E W 1 2E Lemma 3.3. Let (W1,Z1), (W2,Z2) ∈ R × Z be i.i.d.. Assume ∂f (w, z) W1|Z1 ≤ M ∂w 1 holds for some positive constant M1 with any z in the range of Z1 and any w in the range of W . R +∞ Let K(·) be a kernel function such that −∞ |w|K(w) dw ≤ M2 for some constant M2 > 0. Let 2 ϕ : Z → R be a measurable function. Then we have for any h > 0, n 1 W1 − W2 o K ϕ(Z1,Z2) W2,Z2 − ϕ(Z1,Z2) W1 = W2,W2,Z2 fW (W2) E h h E 1 ≤M1M2E |ϕ(Z1,Z2)| Z2 h. On the other hand, the second order term of δLbn(∆, hn), after decomposition, forms a degenerate U-statistic, and requires further study. To control this term, one might consider using the two-term Bernstein inequality for degenerate U-statistics (see, e.g., Proposition 2.3(c) in Arcones and Gine (1993) or Theorem 4.1.2 in de la Pe˜naand Gin´e(2012)). But it will add an additional polynomial log p multiplicity term in the upper bound. Instead, we adopt the sharpest four-term Bernstein inequality discovered by Gin´eet al.(2000), get rid of several inexplicit terms (e.g., the `2 → `2 norm), and formulate it into the following user-friendly tail inequality. We state this result in the following auxiliary lemma. The constants here are able to be explicitly calculated thanks to Houdr´e and Reynaud-Bouret(2003). 2 Lemma 3.4. Let Z1,...,Zn,Z ∈ Z be i.i.d., and g : Z → R be a symmetric measurable P function with E g(Z1,Z2) < ∞. Write Un(g) = i