<<

arXiv:1802.08667v5 [stat.ML] 13 Apr 2021 lwrta 1 than slower asna eoo n,aogagvnsqec fmodels). of given a along prope one, when or effects, zero treatment near average mass in example, (for tification R) hscntuto ulsuo n sdrcl nprdb ee ( efficien Newey semi-parametric by the inspired functiona of directly computation the the is of in and learner arise correct upon scores a bias builds such with a construction residual adding This by regression (RR). constructed the are of scores product Such whic case. property a this robustness, No in double (2019)). et to Syrgkanis connection Belloni and the Foster (1959); debiased emphasizes 2018); Neyman (2016, use (e.g., al. we learner et hav step nozhukov bias, that first such scores each orthogonal avoid to Neyman To fun respect on a estimator. based into (DML) biased learner learning machine badly chine a a Plugging give model. can effect treatment a in ates aucitsbitdto submitted Manuscript h rbe flann ierfntoaso ersin.Goa par 1 Global regressions. at of (estimable functionals importan linear regular of learning variety ve of This problem local volume. the their shrinking as of well regions as over of derivatives, averages p distribution directional average the global changing and include from sors, Examples effects f policy linear generally). effects, a more treatment as expressed projection, be (or can interest function of objects statistical Many fe h ersini ihdmninl eedn nmn variables many on depending dimensional, high is regression the Often aaeesUigRglrzdRezRepresenters Riesz Regularized Using Parameters eBae ahn erigo lbladLocal and Global of Learning Machine De-Biased Summary lse fmdl,tasaigit oetcndnebands Keywords confidence honest into un translating othe parameters. asymptotic models, the imply as and of non-asymptotic long classes are as results regressi dense” main “completely the Our be to can approximation representer R the the the analys either Our include robustness”: parameter. we nuisance sity orth property, additional an Neyman this as a sma achieve functional to To construct invariant parameters. approximately We is sance point. that deriv parameter a target and at the effects, fixed policy subvector effects, Example variate derivatives. treatment and average regular effects, include of policy Examples effects, (nonparame function. treatment non-regular expectation and conditional the (semi-parametric) regular for itrChernozhukov Victor E-mail: † I cnmc,5 eoilDie abig A012 USA. 02142, MA Cambridge Drive, Memorial 50 Economics, MIT / : √ n cenmteu nwymteu [email protected] [email protected], [email protected], emnotooaiy asinapoiain sparsity approximation, Gaussian orthogonality, Neyman epoieaatv neec ehd,bsdon based methods, inference adaptive provide We ae) lblprmtr a lob o-eua ne ekide weak under non-regular be also can parameters Global rates). / h cnmtisJournal Econometrics The √ n ae,adlclprmtr r o-eua etmbeat (estimable non-regular are parameters local and rate), † .INTRODUCTION 1. hte .Newey K. Whitney , p 1–49. pp. , † no h prxmto to approximation the or on fnnrglrfunctionals non-regular of s syed ek“obespar- “double weak yields is n au Singh Rahul and , ucinl nld average include functionals tvscniinlo co- a on conditional atives lprubtoso h nui- the of perturbations ll rc ierfntoasof functionals linear tric) o ohgoa n local and global both for ssffiinl “sparse”. sufficiently is r fr aiiyoe large over validity iform eta h od“double” word the that te ezrpeetrfrthe for representer iesz rhgnlsoe have scores orthogonal h ntoa faregression a of unctional sosdfie ytaking by defined rsions st crsaccumulate scores nsity rtasotn regres- transporting or gnleuto for equation ogonal l 21,21) Cher- 2015); (2014, al. o em h average the term: ion eodrvtv with derivative zero e xmlsmotivates examples t mtr r typically are ameters ybudfrregular for bound cy ’ is representer Riesz l’s ℓ 1 rmtr:average arameters: toa finterest of ctional regularization, “obe ma- /“double” uha covari- as such 94)where 1994a) † n- 2 Chernozhukov, Newey, and Singh functionals. We also remove overfitting bias (high entropy bias) by using cross-fitting, an efficient form of sample splitting, where we average over data observations different than those used by the nonparametric learners. See, e.g, Schick (1986) for early use and Chernozhukov et al. (2018) for more recent use in the context of debiased machine learning. Using closed-form solutions for Riesz representers in several examples, Chernozhukov et al. (2016, 2018) defined DML estimators in high dimensional settings and established their good properties. In comparison, the new approach proposed in this paper has the following advantages and some limitations:

1 We provide a novel algorithm based on ℓ1 regularization to automatically estimate the Riesz representer from the empirical analog of equations that implicitly char- acterize it. 2 Even when a closed-form solution for the Riesz representer is available, the method avoids estimating each of its components. For example, the method avoids explicit density derivative estimation for the average derivative, and it avoids inverting estimated propensity scores for average treatment effects. 3 The adaptive inference theory covers both regular objects (estimable at the 1/√n rate) and nonregular ones (with rates L/√n, where L is the operator of the linear functional). → ∞ 4 As far as we know, the adaptive inference theory given here is the first non- asymptotic Gaussian approximation analysis of de-biased machine learning. 5 Our approach remains interpretable under misspecification, estimating a linear functional of the projection rather than regression. (This point is made explicit in Section 4). 6 We provide a non-asymptotic analysis when using the ℓ1-penalized method to learn the regression, and an asymptotic analysis when using other modern machine learn- ing estimators to learn the regression. 7 The current analysis focuses on linear functionals. In follow-up work, Chernozhukov et al. (2018) extend the approach to nonlinear functionals through a linearization.

Sections 2 and 3 present the main ideas for a general audience. In Section 2, we de- fine global, local, and perfectly localized linear functionals of the regression, and provide orthogonal representations for these functionals. In Section 3, we present two empiri- cal examples: local and global average treatment effects, and local and global average derivatives. Sections 4 and 5 are theoretical. In Section 4, we provide estimation theory, demonstrat- ing concentration and approximate Gaussianity of the DML estimator with regression and Riesz representer estimated via regularized moment conditions. We provide rates of convergence for estimating the Riesz representer, giving both fast rates under ap- proximate sparsity and slow rates under the dense model. In Section 5, we demonstrate asymptotic consistency and Gaussianity of the DML estimator with regression estimated via general machine learning. The supplement provides supporting material. In Section A, we give a detailed account of how our work relates to previous and contemporary work. In Section B, we review prelimaries of . In Section C, we analyze the structure of the leading examples, providing bounds on operator norm, variance of the score, and kurtosis. Finally, we provide proofs for each section. DML with Riesz Representers 3 2. OVERVIEW OF TARGET FUNCTIONALS, ORTHOGONAL REPRESENTATION, ESTIMATION, AND INFERENCE 2.1. Target functionals

We consider a random element W with distribution P taking values w in its . Denote the Lq(P ) norm of a measurable function f : R and also the Lq(P ) normW W → of random variable f(W ) by f P,q = f(W ) P,q. For a differentiable map x g(x), d k k k k k ′ 7→ from R to R , we use ∂x′ g to abbreviate the partial derivatives (∂/∂x )g(x), and we ′ ′ ′ use ∂x g(x0) to mean ∂x g(x) x=x0 , etc. We use x to denote the transpose of a column vector x. | Let (Y,X) denote a random sub-vector of W taking values in their support sets, y R and x Rd, where d = is allowed. Let F denote the law of X. We define∈Y⊂ ∈X ⊂ ∞ x γ⋆(x) := E[Y X = x], 7→ 0 | as the unknown regression function of Y on X. We consider the convex parameter space ⋆ Γ0 for γ0 with elements γ. (Later, in the theoretical sections, we generalize and replace the regression function by a projection). Our goal is to construct high-quality inference methods for real-valued linear function- ⋆ ⋆ als of γ0 . To present examples below we need to endow γ0 with a causal interpretation, which requires us to assume that it is a structural function, invariant to the changes in the distribution of X under policies described below. This property is not guaranteed for an arbitrary regression problem. For the reader who is unfamiliar with these concepts, we note that a simple sufficient condition for invariance is follows: given a stochastic process x Y (x), called potential outcomes or structural function, vector X is generated to fol- low7→ distribution F independently of x Y (x) and Y is generated as Y = Y (X). In this ⋆ 7→ case we have γ0 (x)=EY (x) for any F . This condition is conventionally called exogene- ity in econometrics and random assignment in statistics. The measurability requirement here is that (x, ω) Y (x, ω) is a measurable map. We refer to Imbens and Rubin (2015), Hernan and Robins7→ (2019), and Peters et al. (2017) for the relevant formalizations that enable causal interpretation.

⋆ ⋆ Example 2.1. (Average treatment effect) Let X = (D,Z) and γ0 (X)= γ0 (D,Z), where D 0, 1 is the indicator of the receipt of the treatment. Define ∈{ } θ⋆ = (γ⋆(1,z) γ⋆(0,z))ℓ(x)dF (x), 0 0 − 0 Z where x ℓ(x) is a weighting function. This statistical parameter is a weighted average treatment7→ effect under the standard conditional exogeneity assumption, which guarantees ⋆ that γ0 is invariant to changes in the distributions of D conditional on Z. The assumption requires D to be independent of the potential outcome process d Y (d, Z) and outcome ⋆ 7→ ⋆ to be generated as Y = Y (D,Z), so that γ0 (d, z) = E[Y (d, Z) Z = z]. Here γ0 is invariant to changes in the conditional distributions of D, but not| to the changes in the distribution of Z.

Here and below, a weighting function is a measurable function x ℓ(x) such that ℓdF = 1 and ℓ2dF < . In this example, setting 7→ ∞ R R 4 Chernozhukov, Newey, and Singh ℓ(x) = 1 gives average treatment effect in the entire population, • ℓ(x)=1(d = 1)/P (D = 1) gives the average treatment effect for the treated popu- • lation, ℓ(x)=1(z N)/P (Z N) the average treatment effect conditional on the covari- • ates Z being∈ in the group∈ or neighborhood N, and so on. We can model small neighborhoods N as shrinking in volume with the sample size. The local weighting and kernel weighting discussed below are applicable to all key examples. Moreover, they are combinable with other weighting functions so that, for example, we can target inference on local average treatment effects for the treated.

Example 2.2. (Policy effect from changing distribution of X) The average causal effect of the policy that shifts the distribution of covariates from F0 to F1 with the sup- ⋆ port contained in , when γ0 is invariant over F, F0, F1 , for the weighing function x ℓ(x), is givenX by: { } 7→ θ⋆ = γ⋆(x)ℓ(x)dµ(x); µ(x)= F (x) F (x). 0 0 1 − 0 Z ⋆ Exogeneity is a sufficient condition for the stated invariance of γ0 .

Example 2.3. (Policy effect from transporting X) A weighted average effect of changing covariates X according to a transport map X T (X), where T is deterministic measurable map from to , with the weighting function7→ x ℓ(x), is given by: X X 7→ θ⋆ = [γ⋆(T (x)) γ⋆(x)]ℓ(x)dF (x). 0 0 − 0 Z This has a causal interpretation if the policy induces the equivariant change in the re- ˜ ˜ ⋆ gression function, namely the outcome Y under the policy obeys E[Y X] = γ0 (T (X)). Exogeneity is a sufficient condition. |

Example 2.4. (Average ) In the same settings as the previ- ous example, a weighted average derivative of a continuously differentiable γ0 with respect to component vector d in the direction d t(x) and weighed by x ℓ(x) is the linear functional of the form: 7→ 7→

⋆ ′ ⋆ θ0 = ℓ(x)t(x) ∂dγ0 (d, z)dF (x).

⋆ Z In causal analysis, θ0 is an approximation to 1/r times the average causal effect of the policy that shifts the distribution of covariates via the map X = (D,Z) T (X) = 7→ ⋆ (D + rt(X),Z) for small r, weighted by ℓ(X). Here we require that (d, x) ∂dγ0 (x) exists and is continuous on . 7→ X In this example, consider the case when X = (D,Z) consists of continuous treatment variable D and covariates Z. Further suppose ℓ(x) = ℓ(d) and t(x) = 1. Then the ∗ ∗ parameter of interest is θ0 = E[ℓ(D)T (D)], where T (d) = E[∂dγ0 (D,Z) D = d]. When Y = Y (D) for a potential outcome process Y (d) that is independent of| treatment D conditional on the covariates Z and differentiable in d, it was shown by Altonji and Matzkin (2005) and Florens et al. (2008) that T (d) = E[∂dY (D) D = d], which is an ∗ | average treatment effect on the treated. Thus θ0 is a weighted avarage of the effect of DML with Riesz Representers 5 treatment on the treated and would be equal to T (d) for the perfectly localized ℓ(d) = 1(D = d)/fD(d), where fD(d) is the pdf of D. Also for ℓ(d) 1, Imbens and Newey ∗ ≡ (2009) showed that θ0 = E[T (D)] = E[∂dY (D)], which is an average treatment effect. See also Rothenh¨ausler and Yu (2019). All of these statistical parameters play an important role in causal inference, counter- factual decompositions, and predictive analyses. Introduction of the weighting function ℓ(X) allows us to study subgroup effects and local effects, and these will be covered by our non-asymptotic results and asymptotic results. All of the above examples can be viewed as real-valued linear functionals of the regres- sion function.

Definition 2.1. (Target parameter) Our target is the real-valued linear functional ⋆ of γ0 : θ⋆ = θ(γ⋆), where γ θ(γ) := Em(W, γ), (2.1) 0 0 7→ γ m(w,γ) is a linear operator for each w , defined on Γ = span(Γ0), and the map w 7→ m(w,γ) is measurable with finite second∈W moment under P for each γ Γ. 7→ ∈ The linear operator γ θ(γ) has the following generating function m in these exam- ples: 7→

1 m(w,γ) = (γ(1,z) γ(0,z))ℓ(x); − 2 m(w,γ)= m(γ)= γ(x)ℓ(x)dµ(x); µ(x)= F (x) F (x); 1 − 0 3 m(w,γ)= ℓ(x)(γ(T (x)) γ(x)); ′R − 4 m(w,γ)= ℓ(x)t(x) ∂dγ(x).

In these examples, we can recognize the dependency on the weighting function by writing m(w,γ; ℓ). In Examples 1,3, and 4, we can decompose m(w,γ; ℓ)= m0(w,γ)ℓ(x). Our local functionals are defined by using the weight function that localizes the func- tionals around value d0 of a low-dimensional vector component D. Here D is a p1- dimensional component of vector X. We consider the weighting function

1 d0 D 1 d0 D ℓh(D)= K − /ω, ω =E K − , h R+, (2.2) hp1 h hp1 h ∈      where K : Rp1 R is a kernel function of order o such that K = 1 and → R ( mu)K(u)du =0, for m =1, ..., o 1, ⊗ − Z with its support contained in the cube [ 1, 1]p1 . The simplest example is the box kernel with K(u)= p1 1( 1

Definition 2.2. (Local and localized functionals) We consider the local func- tional ⋆ ⋆ θ(γ0 ; ℓh) := Em(W, γ0 ; ℓh), 6 Chernozhukov, Newey, and Singh as well as the (perfectly) localized functional ⋆ ⋆ θ(γ0 ; ℓ0) := lim θ(γ0 ; ℓh). h→0

The difficulty in targeting localized functionals is that they are not pathwise differen- tiable. A key quantity in the analysis is the operator norm (the modulus of continuity) of γ θ(γ) on Γ, defined as 7→ L := sup θ(γ) / γ P,2. (2.3) γ∈Γ\{0} | | k k We consider L = as non-regular cases, e.g. perfectly-localized functionals. We also consider cases where∞ L as n as non-regular. Indeed, the latter case arises from approximating the→ functional ∞ with→ ∞L = by functionals where L , e.g. local functionals with h 0. The L = case also arises∞ in triangular array asymptotics→ ∞ where P changes with n.→ The asymptotic∞ thought experiment, where L , approximates non-asymptotic cases where L is high. We emphasize that we derive both→ ∞ non-asymptotic results and their asymptotic corollaries (which lead to simplified statements conveying key qualitative features of non-asymptotic results).

2.2. Building an orthogonal representation of the target functional

Equation (2.1) can be thought of as a direct formulation of the target parameter. Next we introduce a dual formulation and finally an orthogonal formulation. Towards this end, we define the Riesz representer α0.

Definition 2.3. (Linear and minimal linear representer) A linear representer for the linear functional γ is α L2(F ) such that 0 ∈ θ(γ)=Eγ(X)α (X), for all γ Γ. (2.4) 0 ∈ ¯ 2 ⋆ If α0 Γ := closure(Γ) in L (F ), we call it the minimal representer and denote it by α0; if not,∈ we call it a representer. Any representer can be reduced to the minimal representer by projecting it onto Γ¯.

A minimal linear representer exists if and only if L< , as a consequence of the Riesz– Frechet theorem; see Lemma 1 below. Therefore, when∞L < , we define the following dual linear representation for the target parameter ∞ ⋆ ⋆ θ0 = θ(α0); θ(α) := E[α(X)Y ]. (2.5) To motivate the upcoming orthogonal representation, we note that either the direct or the dual identification strategies can be used for direct plug-in estimation, but this does not give good estimators, as explained in the following technical remark.

Remark 2.1. (Non-orthogonality of direct and dual formulations) Even if ⋆ we knew expectation operator E and use θ(ˆγ) or θ(ˆα) as the estimator for θ0, this esti- mator would have high biases. Indeed, neither γ θ(γ) nor α θ(α) are orthogonal to local perturbations h Γ of γ⋆ or h¯ Γ of α⋆, namely7→ 7→ ∈ 0 ∈ 0 ⋆ ⋆ ¯ ⋆ ¯ ∂tθ(γ0 + th) =Em(W, h) =0, ∂tθ(α0 + th) =Eγ0 (X)h(X) =0. t=0 6 t=0 6

DML with Riesz Representers 7 ⋆ ⋆ ⋆ Consequently, the quantities Em(W, γˆ γ0 ) and Eγ0 (ˆα α0) are first order biases for θ(ˆγ) − − ⋆ ⋆ and θ(ˆα). The regularized estimators γˆ or αˆ exploit structure of γ0 and α0 to estimate them well in high-dimensional problems, but they exhibit biases that vanish at rates slower than 1/√n, which makes θ(ˆγ) and θ(ˆα) converge at the same slow rate.

⋆ Therefore we proceed to construct another representation for θ0 that has the required Neyman orthogonality structure.

Definition 2.4. (Orthogonal representation for the target functional) We have θ⋆ = θ(α⋆,γ⋆); θ(α, γ) := E[m(W, γ)+ α(X)(Y γ(X))], (2.6) 0 0 0 − ⋆ ⋆ where (α, γ) are the nuisance parameters with the true value (α0,γ0 ).

Unlike the direct or dual representations for the functional, this representation is Ney- man orthogonal to perturbations (h,h¯ ) Γ2 of (α⋆,γ⋆) such that ∈ 0 0 ∂ ⋆ ¯ ⋆ ⋆ ⋆ ¯ θ(α0 + th, γ0 + th) =Em(W, h) Eα0(X)h(X) + E[(Y γ0 (X))h(X)]=0. (2.7) ∂t t=0 − −

In fact, a stronger property holds

θ(α, γ) θ(α⋆,γ⋆)= (γ γ⋆)(α α⋆)dF, (2.8) − 0 0 − − 0 − 0 Z which implies (2.7) as well as double robustness. This quantity is also known as the remainder of the von Mises expansion of the functional. The Neyman orthogonality property states that the representation of the target param- eter θ0 in terms of the nuisance parameters (α, γ) is invariant to the local perturbations of the values of the nuisance parameter. This property makes the orthogonal representa- ⋆ tion an excellent basis for constructing high quality point and interval estimators of θ0 in modern high-dimensional settings when we will be plugging-in biased estimators in lieu ⋆ ⋆ of γ0 and α0, where the bias occurs because of the regularization (see, e.g., Chernozhukov et al. (2016, 2018)).

2.3. The case of finite-dimensional linear regression

It is instructive to consider the case of linear finite-dimensional regression. Consider p 2 x b(x)= bj(x) j=1 as a p-dimensional dictionary of basis functions with bj L (F ) for7→ each j ={ 1, ...,} p. The regression function is assumed to obey the linear functional∈ ⋆ ′ form γ0 = b β0 for some β0. Also define G =Eb(X)b(X)′,M =Em(W, b). First, observe that for γ = b′β, θ(γ)=Em(W, b′β)=Em(W, b)′β = Mβ. For instance, in Examples 1–4:

1. M = E(b(1,Z) b(0,Z))ℓ(X) 2. M = bℓ(dF dF ), − 1 − 0 3. M = E(b(T (X)) b(X))ℓ(X) 4. M =E∂ b(D,Z)t(X)ℓ(X). − R d 8 Chernozhukov, Newey, and Singh ⋆ ⋆ Second, we make a guess that the linear representer α0 to be of the form α0(x) = ′ b(x) ρ0, for ρ0 defined below. We can define the parameters β0 and ρ0 as any minimal ℓ1-norm solutions to the system of equations: min β + ρ : Gβ =EYb(X), Gρ = M. (2.9) k k1 k k1 −1 −1 In particular, if G is full rank, the solutions are β0 = G Eb(X)Y and ρ0 = G M. We now verify the representation property for our guess: ⋆ ′ ′ ′ ′ Eγ(X)α0(X)=Eβ b(X)b(X) ρ0 = β Gρ0 = β M = θ(γ), for all β’s and hence all γ’s. The operator norm of θ(γ)= M ′β is given by M ′β β′Gρ L = sup = sup 0 = ρ′ Gρ < . | ′ | | ′ | 0 0 β∈Rp\{0} √β Gβ β∈Rp\{0} √β Gβ ∞ p We conclude that direct, dual, and orthogonal representations are given by θ(γ)= M ′β; θ(α)= ρ′Eb(X)Y ; θ(γ, α)= M ′β + ρ′Eb(X)Y ρ′Gβ, − where β is γ’s parameter and ρ is α’s parameter. These representations appear to be both novel and useful.

2.4. The case of infinite-dimensional regression

In the infinite-dimensional case, we can employ the Riesz–Frechet representation theorem and Hahn–Banach extension theorem to establish existence of the linear Riesz represen- ter.

Lemma 2.1. (Extended Riesz representation) (i) If L< , there exists a unique ⋆ ¯ ⋆ ∞ minimal representer α0 Γ and L = α0 P,2. (ii) If there exists a linear representer α on Γ with α <∈ , then L = kα⋆k α < , where α⋆, obtained by 0 k 0kP,2 ∞ k 0kP,2 ≤ k 0kP,2 ∞ 0 projecting α0 onto Γ¯, is the unique minimal representer. In both cases γ θ(γ) can be extended to Γ¯ or to the entire L2(F ) with the modulus of continuity L. 7→

The first part of the lemma shows (implicit) existence of a linear representer when L< . Our estimation results will rely only on the existence of minimal representers. In some∞ case, however, we may utilize the closed-form solutions for linear representers (see, e.g., Section C for the key examples), to improve the basis functions for estimating the minimal representers. There is also an efficiency reason to work with minimal representers rather than any linear representer, as highlighted in Section 4 analyzing semi-parametric efficiency.

2.5. Informal preview of estimation and inference results

Our estimation and inference will exploit empirical analogs of both the orthogonal rep- resentation of the parameter (2.6) and the equation defining the RR property (2.4). To approximate the regression function and the RR, we consider the p-vector of dictio- nary functions b, where the dimension p of the dictionary can be large, potentially much ⋆ ′ ⋆ larger than n. We approximate α0 by a b ρ0, and we approximate γ0 by a ′ linear form b β0, and estimate the parameters using the algorithms below. DML with Riesz Representers 9 n n 1 Let (Wi)i=1 = (Yi,Xi)i=1 denote i.i.d. copies of data vector W . We use cross- fitting to avoid biases from overfitting that can arise in high-dimensional settings. To this end, let (I , ..., I ) be a partition of the observation index set 1, ..., n into 1 K { } K distinct subsets of about equal size. Let EAf = EAf(W ) denote the empirical E E −1 average of f(W ) over i A 1, ..., n : Af := Af(W )= A i∈A f(Wi). 2 For each block k =1, ...,∈ K, we⊂{ obtain generalized} Datnzig selector| | (GDS) estimates ′ ′ P αˆk = b ρˆ andγ ˆk = b βˆk, where −1 ′ Rp ˆ E c E c ρˆk = arg minρ∈ ρ 1 : D Ik m(W, b) Ik b(X)b(X) ρ ∞ λρ, k k k −1 − ′ k ≤ (2.10) βˆk = arg minβ∈Rp β 1 : Dˆ EIc (Y b(X) β)b(X)) ∞ λβ, k k k  k − k ≤ c where I = 1, ..., n Ik is the set of observation indices leaving Ik out, λ’s are k { }\ tuning parameters, and Dˆ is a scaling detailed in Section E. Typically λ’s scale like log(p n)/n; Section 3 provides concrete choices. 3 The DML∨ estimator is an average of estimated orthogonal representations over k: p K 1 θˆ = m(W , γˆ )+ˆα (X )[Y γˆ (X )] . (2.11) n { i k k i i − k i } Xk=1 iX∈Ik The estimator of its asymptotic variance is K 1 σˆ2 = m(W , γˆ )+ˆα (X )[Y γˆ (X )] θˆ 2. (2.12) n { i k k i i − k i − } kX=1 iX∈Ik We remark that the RR estimator in step 2 is of Dantzig selector type, but is not exactly the Dantzig selector, requiring some new analysis. Next, we state the key concentration and approximate Gaussianity results informally. Key quantities in the analysis are the “true” score and its moments: ψ⋆(W ) := θ⋆ m(W, γ⋆) α⋆(X)(Y γ⋆(X)), σ2 := Eψ2(W ), κ3 := E ψ3(W ) . 0 0 − 0 − 0 − 0 0 | 0 | We establish conditions under which

γˆ γ⋆ + αˆ α⋆ /σ 0, √n (ˆγ γ⋆)(ˆα α⋆)dF/σ 0. (2.13) k k − 0 kP,2 k k − 0kP,2 → k − 0 k − 0 → Z These include a bound on the ℓ1 norm of coefficients and that either the regression function or the RR is approximately sparse with the effective dimension s less than √n. This allows either nuisance parameter to be completely dense. Given that (2.13) holds, we establish that the resulting de-biased (or “double”) ma- chine learning (DML) estimator θˆ is adaptive, namely it is approximated up to the error o(σ/√n) by the oracle estimator n θ¯ := θ⋆ n−1 ψ (W ), 0 − 0 i i=1 X ˆ ⋆ where the oracle knows the scores ψ0. Hence the approximate deviation of θ from θ0 is determined by ψ /√n, which is the standard deviation of the oracle estimator. k 0kP,2 Consequently, θˆ concentrates in a σ/√n neighborhood of the target with deviations controlled by the normal laws, ˆ ⋆ 3 sup P(√n(θ θ0 )/σ t) P(N(0, 1) t) A(κ/σ) /√n + errorn 0, t∈R − ≤ − ≤ ≤ →

10 Chernozhukov, Newey, and Singh where the errorn bound is non-asymptotic and tends to zero as n . Of course, σ/√n 0 is required for concentration. The non-asymptotic bound→ autom ∞ atically im- plies the→ uniform validity of results over large classes of probability laws P for W . There are two cases to consider:

1 Regular case: the parameters σ, κ/σ, and L are bounded, leading to 1/√n con- centration, adaptation, and Gaussian approximation. 2 Non-regular case, the parameters σ, κ/σ, and L diverge, so that we need σ/√n 0,L/√n 0, (κ/σ)/√n 0, → → → for σ/√n concentration, adaptation, and Gaussian approximation.

As we show in Section C, in the case of local functionals, the latter condition can be more succinctly stated as (κ/σ) . σ L,L/√n 0. ≍ → Finally, we establish that we can transfer learning and inference guarantees for local functionals to those for the (perfectly) localized functionals if the localization bias is sufficiently small, namely √n(θ(γ⋆; ℓ ) θ(γ⋆; ℓ )/σ 0. 0 h − 0 0 → We think it is remarkable that a single inference theory covers both regular and non- regular cases, and provides uniform validity over large classes of P .

3. APPLICATIONS 3.1. Global and local effects of 401(k) eligibility on net financial assets

First, we use our method to answer a question in household finance: what is the average treatment effect of 401(k) eligibility on net financial assets (over a horizon of about two years)? We follow the identification strategy of Poterba and Venti (1994) and Poterba et al. (1995), who assume selection on observables. The authors assume that when 401(k) was introduced, workers ignored whether a given job offered 401(k) and instead made employment decisions based on income and other observable job characteristics; after conditioning on income and job characteristics, 401(k) eligibility was exogenous at the time. We use data from the 1991 US Survey of Income and Program Participation, using sample selection and variable construction as in Abadie (2003) and Chernozhukov and Hansen (2004). The outcome Y is net financial assets defined as the sum of IRA balances, 401(k) balances, checking accounts, US saving bonds, other interest-earning accounts, stocks, mutual funds, and other interest-earning assets minus non-mortgage debt. The treatment D is an indicator of eligibility to enroll in a 401(k) plan. The raw covariates X are age, income, years of education, family size, marital status, two-earner status, benefit pension status, IRA participation, and home-ownership. We impose common support of the propensity score for the treated and untreated groups based on these covariates, yielding n = 9869 observations. We consider the fully-interacted specification b(D,X) of Chernozhukov et al. (2018) with p = 277 including polynomials of continuous covariates, interactions among all covariates, and interactions between covariates and treatment status. DML with Riesz Representers 11 Table 1. Average treatment effect of 401(k) eligibility on net financial assets. Localized average treatment effects are reported by income quintile groups. The regression is esti- mated by GDS or Lasso. Standard errors are reported in parentheses. Income quintile N treated N untreated GDS Lasso All 3682 6187 7994.79 (1335.29) 8068.91 (1331.22) 1 272 1702 4485.09 (936.81) 4467.18 (921.45) 2 527 1447 854.26 (1505.42) 873.02 (1505.38) 3 755 1219 5391.66 (1193.00) 4970.92 (1197.93) 4 962 1012 9746.77 (2160.25) 8822.58 (2157.53) 5 1166 807 17784.33 (7775.72) 13958.51 (7788.43)

Table 2. Average treatment effect of 401(k) eligibility on net financial assets. Localized average treatment effects are reported by income quintile groups. The regression is esti- mated by random forest or neural network. Standard errors are reported in parentheses. Income quintile N treated N untreated Random forest Neural network All 3682 6187 8875.23 (1403.07) 7673.53 (1818.37) 1 272 1702 4834.53 (935.70) 5127.75 (1168.08) 2 527 1447 1041.63 (1718.77) 1386.80 (1607.70) 3 755 1219 3959.34 (1491.50) 4981.23 (1279.75) 4 962 1012 11026.08 (2347.68) 8753.54 (2378.47) 5 1166 807 19919.19 (7907.88) 15049.97 (7811.69)

Tables 1 and 2 summarize results for the entire population and for each quintile of the income distribution. We use K = 5 folds in cross-fitting. To estimate the RR, we use the generalized Dantzig selector (GDS) procedure introduced in the present work. To estimate the regression, we use GDS, Lasso, random forest, or neural network. GDS is implemented using the tuning procedure described in Section E. Lasso is implemented using the tuning procedure described in Chernozhukov et al. (2018). Random forest and neural network are implemented with the same settings as Chernozhukov et al. (2018). We find ATE of 7995 (1335) using GDS for both the RR and the regression. This ATE estimate is stable across different choices of regression estimator. We find that localized ATE is not statistically significant for the second quintile, and it is statistically significant, positive, and strongly heterogeneous for the other quintiles. For comparison, Chernozhukov et al. (2018) report ATE of 7170 (1398) by DML, which estimates the RR by estimating the propensity score and plugging it into the RR functional form. Though these two estimators are asymptotically equivalent under correct specification, the lower standard error of our estimate perhaps reflects numerical stability conferred by avoiding the estimated propensity score in the denominator. The ATE results are broadly consistent with Poterba and Venti (1994) and Poterba et al. (1995), who use a simpler specification motivated by economic reasoning. The localized ATE estimates by income quintile group appear to be new empirical results and are of interest in their own right. 12 Chernozhukov, Newey, and Singh 3.2. Global and local price elasticity of gasoline demand

Second, we use our method to estimate the average price elasticity of household gasoline demand: the percentage change in demand due to a unit percentage change in price. This parameter is critical for assessing the welfare consequences of tax changes, and it has been studied in Hausman and Newey (1995); Schmalensee and Stoker (1999); Yatchew and No (2001); Blundell et al. (2012). Formally, the parameter of interest is the average derivative of log demand with respect to log price holding income and demographic characteristics fixed. We use data from the 1994-1996 Canadian National Private Vehicle Use Survey, us- ing sample selection and variable construction as in Yatchew and No (2001) and Belloni et al. (2019). The outcome Y is log gasoline consumption. The variable D with respect to which we differentiate is log price per liter. The raw covariates X are log age, log income, and log distance as well as geographical, time, and household composition dummies. In total we have n = 5001 observations. We consider the specification b(D,X) previously considered by Semenova and Chernozhukov (2020). The specification includes polynomi- als of continuous covariates, and interactions of log price (and its square) with time and household composition dummies. Altogether, p = 91. Table 3 summarizes results for the entire population and for each quintile of the income distribution. We use K = 5 folds in cross-fitting. To estimate the RR, we use the gener- alized Dantzig selector (GDS) procedure introduced in the present work. To estimate the regression, we use GDS, Lasso, random forest, or neural network. Again, GDS is imple- mented using the tuning procedure described in Section E, Lasso is implemented using the tuning procedure described in Chernozhukov et al. (2018), and random forest and neural network are implemented with the same settings as Chernozhukov et al. (2018). We find average price elasticity of 0.30 (0.06) using GDS for both the RR and the regression. Lasso gives similar results.− Note that random forest is not differentiable, and the derivative of a neural network may be difficult to extract from a black-box implemen- tation. When using these estimators, we implement a partial difference approximation of the derivative, detailed in Section E. We conjecture that this approximation explains why the results using random forest appear attenuated and why the results using neural network appear positive or statistically insignificant. Using GDS, we find that localized average price elasticity is statistically significant and negative in each income quintile, with substantial heterogeneity. For comparison, OLS regression of log consumption on log price, log age, log income, and log distance as well as geographical, time, and household composition dummies yields an estimate of 0.14 (0.06). The linear specification leads to a positive elasticity estimate, contradicting economic intuition (since it says there would be more gasoline consump- tion when prices are higher). Our localized average price elasticity results using GDS are broadly consistent with Semenova and Chernozhukov (2020), who more explicitly consider the relationship between average price elasticity and income. DML with Riesz Representers 13 Table 3. Estimated average derivative (price elasticity) of gasoline demand. Localized average derivatives are reported by income quintile groups. The regression is estimated by GDS, Lasso, random forest, or neural network. Standard errors are reported in paren- theses. Income quintile N GDS Lasso Random forest Neural network All 5001 -0.30 (0.06) -0.12 (0.06) -0.05 (0.06) 0.16 (0.06) 1 1001 -0.92 (0.13) -0.62 (0.12) -0.47 (0.14) 0.07 (0.13) 2 1000 -0.41 (0.11) -0.39 (0.11) -0.23 (0.12) 0.31 (0.12) 3 1000 -1.69 (0.16) -1.17 (0.14) -0.67 (0.13) -0.32 (0.13) 4 1000 -0.98 (0.14) -0.82 (0.13) -0.36 (0.15) 0.06 (0.14) 5 1000 -0.98 (0.15) -0.52 (0.12) -0.08 (0.13) 0.49 (0.11)

4. ESTIMATION AND INFERENCE FOR HIGH DIMENSIONAL APPROXIMATELY LINEAR MODELS 4.1. Best linear approximations for the regression function and the Riesz representer

To approximate the regression function, we consider the p-vector of dictionary functions x b(x) = (b (x))p , b L2(F ). 7→ j j=1 j ∈ The dimension p of the dictionary can be large, potentially much larger than n. Let Γb be the of L2(F ) generated by b. We assume that as n we have 2 → ∞ that p and Γb Γ¯, where Γ¯ is a linear subspace of L (F ) with the basis functions ˜b ∞ →. Here ∞ convergence→ means that any convergent sequence in Γ has its limit in Γ¯ { j}j=1 b and for each γ Γ¯ we have a sequence in Γ converging to it, with respect to the L2(F ) ∈ b norm. Note that this setup allows the dictionary b = bn to change with n, as for example with b-splines. ⋆ ¯ ⋆ Here we define γ0 as a projection of Y onto Γ, i.e. γ0 is the projection of Y on the infinite ˜ ∞ set of variables bj(X) j=1. This setup is slightly more general than in the introduction, ⋆ { } where γ0 was the conditional expectation function. Of course, if the latter is an element ¯ ⋆ of Γ, it automatically coincides with γ0 . ⋆ We approximate γ0 by the finite-dimensional best linear predictor (BLP) γ0 via ⋆ ′ γ0 = γ0 + rγ := b β0 + rγ :E[b(X)rγ(X)]=0, ′ where rγ is the approximation error, and γ0 := b β0 is the best linear predictor of Y ⋆ and best linear approximation to γ0 . We define β0 as a minimal ℓ1-norm solution to the system of equations min β :E[b(X)(γ⋆(X) b(X)′β)]=0, k k1 0 − when G =Eb(X)b(X)′ is not full rank. ⋆ Similarly, we approximate the Riesz representer α0, which exists by Lemma 1 whenever L< , via the best linear approximation α : ∞ 0 ⋆ ′ α0 = α0 + rα = b ρ0 + rα :E[rα(X)b(X)]=0.

We define ρ0 as a minimal ℓ1-norm solution to the system of equations min ρ : E[(α⋆(X) b(X)′ρ)b(X)]=0. k k1 0 − 14 Chernozhukov, Newey, and Singh ⋆ Using that Eα0(X)b(X)=Em(W, b), we note that 0 = E[r (X)b(X)] = E((α⋆(X) b(X)′ρ )b(X))=Em(W, b) Eα (X)b(X). (4.14) α 0 − 0 − 0 Hence α is the Riesz representer for Em(W, γ) for each γ Γ . Here we can interpret 0 ∈ b Γb as the collection of test functions on which the representation property (4.14) holds.

Definition 4.1. (Penultimate and ultimate target parameters) Our penultimate target is the linear functional applied to the BLP γ0: θ := E[m(W, γ )] = E[α (X)γ (X)] = E[m(W, γ )+ α (X)(Y γ (X))]. 0 0 0 0 0 0 − 0 ⋆ Our ultimate target is the linear functional applied to γ0 θ⋆ := E[m(W, γ⋆)] = E[α⋆(X)γ⋆(X)] = E[m(W, γ⋆)+ α⋆(X)(Y γ⋆(X))]. 0 0 0 0 0 0 − 0

If the approximation errors are such that

(√n/σ) r r dF 0 (4.15) α γ → Z our inference will target the ultimate parameter. In the non-regular setup, the second order error condition rαrγ dF σ/√n is weaker then what is usually required for pathwise differentiable functionals≤ (since σ is the non-regular case); there is a lower bar for oracle ratesR in non-regular problems.→ ∞ This phenomenon was also noted by Foster and Syrgkanis (2019) and Kennedy (2020). Otherwise our inference will target an interpretable penultimate parameter. We shall formally refer to the latter case as the misspecified case.

Lemma 4.1. (Basic properties of the score) Our DML estimator of θ0 will be based on the following score function: ψ(W, θ; β,ρ)= θ m(W, b)′β ρ′b(X)(Y b(X)′β), − − − which has the following properties: ∂ ψ(W, θ; β,ρ)= m(W, b)+ ρ′b(X)b(X), ∂ ψ(W, θ; β,ρ)= b(X)(Y b(X)′β), β − ρ − − 2 2 2 ′ ∂ββ′ ψ(W, θ; β,ρ)= ∂ρρ′ ψ(W, θ; β,ρ)=0, ∂βρ′ ψ(W, θ; β,ρ)= b(X)b(X) .

This score function is Neyman orthogonal at (β0,ρ0): E[∂ ψ(W, θ; β,ρ )] = E[m(W, b)] + Gρ =0, β 0 − 0 E[∂ ψ(W, θ; β ,ρ)] = E[ b(X)(Y b(X)′β )] = E[b(X)γ (X)] + Gβ =0. ρ 0 − − 0 − 0 0

The second claim of the lemma is immediate from the definition of (β0,ρ0) and the first follows from elementary calculations. The orthogonality property above says that the score function is invariant to small perturbations of the nuisance parameters ρ and β around their “true values” ρ0 and β0. This invariance property plays a crucial role in removing the impact of biased estimation of nuisance parameters ρ0 and β0 on the estimation of the main parameters θ0. DML with Riesz Representers 15 4.2. Estimators

Estimation will be carried out using the following Dantzig selector-type estimators (Can- des and Tao (2007)). In a follow-up work, Chernozhukov et al. (2018) consider Lasso-type estimators.

Definition 4.2. (Generalized Dantzig selector estimator) Consider a param- eter t T Rp, where T is a convex set. Consider the moment functions t g(t) and the estimated∈ ⊂ moment functions t gˆ(t), mapping Rp to Rp: 7→ 7→ g(t)= Gt M;g ˆ(t)= Gtˆ M,ˆ − − where G and Gˆ are p by p non-negative-definite matrices and M and Mˆ are p-vectors. Define t0 as a minimal ℓ1-norm solution to g(t)=0 and assume t0 T . Define the GDS estimator tˆ by solving ∈ tˆ arg min t : gˆ(t) λ, t T ∈ k k1 k k∞ ≤ ∈ where λ is chosen such that gˆ(t ) g(t ) λ, with probability at least 1 ǫ. k 0 − 0 k∞ ≤ − Here we record the possibility of convex restrictions on the parameter space by placing t in a convex parameter space T . If parameter restrictions are correct, then this can potentially improve theoretical guarantees by weakening the requirements on G and other primitives.

Definition 4.3. (GDS for BLP: Dantzig selector) Given a diagonal positive-definite ˆ ˆ ˆ −1 normalization matrix Dβ, define βA = Dβt, where t is the GDS estimator for t0 = Dβ β0 with G =Eb(X)b(X)′, Gˆ = E b(X)b(X)′,M = D−1EYb(X), Mˆ = D−1E Yb(X); T Rp. A β β A β ⊂ In this setting, our estimator specializes to the original Dantzig selector. In practice, p we use Tβ = R , although when we are interested in average derivative functionals, it is theoretically helpful to impose the convex restrictions of the sort T = t Rp : sup ∂ b(x)′t B , where B is some a priori known upper bound on the{ derivative.∈ x∈X | d |≤ } Ideally, D is chosen such that diag(V ar(D−1(Gβˆ Mˆ )) = I. Our practical algorithm β β 0 − given in Section E estimates Dβ from the data.

Definition 4.4. (GDS for Riesz representer) Given a diagonal positive-definite nor- malization matrix Dρ, define ρˆA = Dρtˆ, where tˆ is the GDS estimator of the parameter −1 t0 = Dρ ρ0 with G =Eb(X)b(X)′, Gˆ = E b(X)b(X)′,M = D−1Em(W, b), Mˆ = D−1E m(W, b); T Rp. A ρ ρ A ρ ⊂ In this setting, our estimator is a generalization of the original Dantzig selector. In p practice, we are using Tρ = R , even though it is possible to exploit some structured restrictions on the problem motivated by the nature of the universal Riesz representers. Ideally, D is chosen such that diag(V ar(D−1(Gρˆ Mˆ )) = I. Our practical algorithm ρ ρ 0 − given in Section E estimates Dρ from the data. We now define the DML estimator with Riesz Representers, which makes use of cross- fitting. 16 Chernozhukov, Newey, and Singh Definition 4.5. (DML with RR) Consider the partition of 1, ..., n into K 2 blocks (I )K , with m = n/K observations in I , for k

K n/K n/K θˆ = θˆ w ; w = ⌊ ⌋ if k

We provide a single non-asymptotic result that allows us to cover both global and local functionals, implying uniformly valid rates of concentration and normal approximations over large sets of P . Consider the oracle estimator based upon the true score functions: n θ¯ := θ n−1 ψ (W ), ψ (W ) := ψ(W, θ ; β ,ρ ). 0 − 0 i 0 0 0 0 i=1 X We seek to establish minimal conditions under which the DML estimator approximates the oracle estimator, and is approximately normal with distribution N(0, σ2/n), σ := ψ . k 0kP,2 For regular functionals σ is bounded, giving 1/√n concentration around θ0, and for non- regular functionals σ L requring L/√n 0 to get concentration. Our normal ∝ → ∞ → approximation is accurate if kurtosis of ψ0 does not grow too fast: (κ/σ)3/√n is small, κ := ψ . k 0kP,3 In the regular case (κ/σ)3 is bounded, but for the non-regular cases it can scale as fast as L, again requiring L/√n 0. → Fix all of these and the constants. Define the guarantee set: S = (u, v) R2p : √u′Gu r , √v′Gv σr , u′Gv σr ,β + u T ,ρ + v T , ∈ ≤ 1 ≤ 2 | |≤ 3 0 ∈ β 0 ∈ ρ We will take u = βˆ β and v =ρ ˆ ρ . As such, r measures the non-asymptotic mean − 0 k − 0 1 square rate for the BLP; r2 measures the non-asymptotic mean square rate for the RR; and r3 measures how the estimation errors interact. Note the presence of σ acting on r2 and r3, which accommodates non-regular functionals. We will instantiate (r1, r2, r3) as fast and slow rates by analyzing the GDS estimator, in Theorem 4.3 below. Next, define µ to be the smallest modulus of continuity such that on (u, v) S ∈ √V ar(( m(W, b)+ρ ′b(X)b(X))′u) µσ b′u , √V ar((Y b(X)′β )b(X)′v) µ b′v , − 0 ≤ k kP,2 − 0 ≤ k kP,2 √V ar(u′b(X)b(X)′v) µ( b′u + b′v ). ≤ k kP,2 k kP,2 In typical applications, the modulus of continuity µ is bounded. Indeed, if elements of the dictionary are bounded with probability one, b(X) C, then we can select µ = CB k k∞ ≤ DML with Riesz Representers 17 for many functionals of interest, so the assumption is plausible. If b(X) = X are sub- Gaussian, then this assumption is also easily satisfied; however, this case is not of central interest to us. Consider P that satisfies the following conditions.

ˆ K R(δ) With probability 1 ε, the estimation errors (βk β0, ρˆk ρ0) k=1 take values in SK , with quality of− the guarantee obeying { − − } σ−1(√mσr + µr (1 + σ)+ µσr ) δ. 3 1 2 ≤

R(δ) is a requirement on how the sequences (r1, r2, r3) evolve relative to (σ,µ,m). We will formally verify R(δ) for the approximately sparse setting, in Corollary 4.4 below. R(δ) is the key condition for our main result, Theorem 4.1.

Theorem 4.1. (Adaptive estimation and approximate Gaussian inference) Suppose K divides n for simplicity. Under condition R(δ), we have the adaptivity property, namely the difference between the DML and the oracle estimator is small: for any ∆ (0, 1), ∈ √n(θˆ θ¯)/σ √K4δ/∆ | − |≤ with probability at least 1 ε ∆2. − − As a consequence, θˆ concentrates in a σ/√n neighborhood of θ0, with deviations ap- proximately distributed according to the Gaussian law Φ(z)=P(N(0, 1) z): ≤ −1 3 −1/2 2 sup P(σ √n(θˆ0 θ0) z) Φ(z) A(κ/σ) n + √K2δ/∆+ ε + ∆ , z∈R − ≤ − ≤

where A< 1/2 is the sharpest absolute constant in the Berry–Esseen bound.

The constants can be chosen to yield an asymptotic result.

Corollary 4.1. (Uniform asymptotic adaptivity and Gaussianity) Let n be any nondecreasing set of probability laws P that obey condition R(δ ) where δ P0 is a n n → given sequence. Then the DML estimator θˆ is uniformly asymptotically equivalent to the oracle estimator θ¯, that is √n(θˆ θ¯)/σ = O (δ ) | − | P n uniformly in P n as n . In addition, if for each P n the kurtosis of ψ0 does not grow too fast,∈P namely: → ∞ ∈P (κ/σ)3/√n δ , ≤ n we have that √n(θˆ θ )/σ is asymptotically Gaussian uniformly in P : − 0 ∈Pn

lim sup sup PP (√n(θˆ0 θ0)/σ z) Φ(z) =0. n→∞ P ∈Pn z∈R − ≤ −

Hence the DML estimator of the linear functionals of the BLP function γ0 enjoys good properties under the stated regularity conditions. This result does not distinguish between inference on global functionals from inference on local functionals, as long as the latter are not perfectly localized. We state a separate result for perfectly localized functionals below. 18 Chernozhukov, Newey, and Singh ⋆ Corollary 4.2. (Inference on the ultimate parameter θ0) Suppose that, in ad- dition to conditions of Corollary 4.1, P satisfies the small approximation error condition:

(√n/σ) θ θ⋆ = (√n/σ) r r dF δ. (4.16) | 0 − 0| α γ ≤ Z ⋆ Then conclusions of Theorem 4.1 hold with θ0 replacing θ0, with √K4δ/∆ increased by δ, ⋆ and the same probability. Conclusions of Corollary 4.1 continue to hold with θ0 replacing θ0 for a class of probability laws n, provided each P n satisfies the conditions of Corollary 4.1 and (4.16) for the givenP δ = δ 0. ∈ P n → The approximation bias for the ultimate target can be plausibly small due to the fact that many rich function classes admit regularized linear approximations with respect to conventional dictionaries b. For instance, Tsybakov (2012) and Belloni et al. (2014) show small approximation bias using Fourier bases as dictionaries, and using Sobolev and rearranged Sobolev balls, respectively, as the function classes.

Corollary 4.3. (Inference on the perfectly localized parameter) Suppose that, in addition to conditions of Corollary 4.1, P satisfies the small approximation error con- dition: √n θ (γ ; ℓ ) θ (γ⋆; ℓ ) /σ = √n r r dF /σ δ, (4.17) | 0 0 h − 0 0 h | α γ ≤ Z and the localization bias is small:

√n θ (γ⋆; ℓ ) θ (γ⋆; ℓ ) /σ δ, (4.18) | 0 0 h − 0 0 0 | ≤ ⋆ Then conclusions of Theorem 4.1 hold with θ0(γ0 ; ℓ0) replacing θ0, with √K4δ/∆ in- creased by 2δ, and the same probability. Conclusions of Corollary 4.1 continue to hold with θ⋆(γ⋆; ℓ ) replacing θ = θ (γ ; ℓ ) for a class of probability laws , provided 0 0 0 0 0 h Pn each P n satisfies the conditions of Corollary 4.1 and (4.17)-(4.18) for the given δ = δ ∈0 P. n → 4.4. Semi-parametric efficiency

Below we use concepts from semi-parametric efficiency, as presented in Bickel et al. (1993) and Van der Vaart (2000); we do not recall them here for brevity. ˆ ⋆ The DML-RR estimator θ will be asymptotically efficient for estimating θ0, defined ⋆ ¯ as a functional of γ0 , the projection of Y on Γ. The distribution of a data observation is unrestricted in this case, so that there will only be one influence function for each functional of interest, and the estimator is asymptotically linear with that influence function. The standard semiparametric efficiency results then imply that our estimator will have the smallest asymptotic concentration among estimators that are locally regular; see Bickel et al. (1993) and Van der Vaart (2000). Our formal result stated below only implies efficiency for the regular case, where the operator norm of the function L is bounded, holding P fixed. We expect that a similar result continues to hold with L , by developing an appropriate formalization that handles P changing with n and→ rules ∞ out super-efficiency phenomena. However, this formalization requires a separate major development, which we leave to future research. ⋆ ⋆ In what follows, the notation γ0,P emphasizes the dependence of the projection γ0 on P . DML with Riesz Representers 19 ⋆ ⋆ ⋆ ⋆ ⋆ Theorem 4.2. (Efficiency) Let ψ0 (W ) := θ0 m(W, γ0 ) α0(X)(Y γ0 (X)). Sup- 2 2 − − − pose that E[Y ] < , E[ψ0(W ) ] < , and m(W, γ) is mean square continuous in γ ∞ ⋆ ∞ under P . Then θ0,P := m(w,γ0,P )dP (w) is differentiable at P , in the sense that

R θ0,Pτ θ0,P lim − =EP δ(W )ψ0(W ), τց0 τ where ψ0 is called the influence function and is unique, and the directional perturbation Pτ is defined as dPτ = dP [1 + τδ], where the direction δ is any element of the tan- gent set δ measurable : R : δdP = 0, δ ∞ < M for each 0

4.5. Properties of GDS estimators

Our goal is to verify that the guarantee R(δ) holds. In particular we have to analyze (r , r , r ) by bounding the population prediction norm v √v′Gv. This is a more 1 2 3 7→ nuanced problem than bounding the empirical prediction norm v v′Gvˆ , which has 7→ been accomplished in a variety of prior analyses done on Dantzig-typpe and Lasso-type estimators. We begin with the following condition, which only controls the max of error rates and controls the ℓ1 norm of true parameters:

MD We have that t0 T and t0 1 B, where B 1, and the empirical moments obey the following bounds∈ withk probabilityk ≤ at least 1≥ ε, for λ¯ λ − ≥ Gˆ G λ,¯ Gtˆ Mˆ λ. k − k∞ ≤ k 0 − k≤

The bounds on ℓ1 norm of coefficients are naturally motivated, for example, by working in Sobolev or rearranged Sobolev spaces (see, Tsybakov (2012) and Belloni et al. (2014), respectively). Rearranged Sobolev spaces allow the first p regression coefficients in the series expansion to be arbitrarily rearranged, allowing a much greater degree of oscillatory behaviors than in the original Sobolev spaces. The complexity of these function classes are also different. Sobolev spaces are Donsker sets under sufficient smoothness, whereas rearranged Sobolev spaces have the covering entropy bounded below by log p and are not Donsker if p . At the core→ of ∞ this approach is the restricted set S(t ,ν) := δ : Gδ ν, t + δ t ,t + δ T , 0 { k k∞ ≤ k 0 k1 ≤k 0k1 ∈ } where ν is the noise level. As demonstrated in the proof, the GDS estimator belongs to this set with high probability 1 ǫ for the noise level ν = 4Bλ,¯ where λ is the penalty level of GDS (ν scales like log(−p n)/√n in our problems). ∨ p Definition 4.6. (Effective dimension) Define the effective dimension of t0 at the noise level ν > 0 as: ′ 2 s(t0) := s(t0; ν) := sup δ Gδ /ν . δ∈S(t0,ν) | | 20 Chernozhukov, Newey, and Singh The effective dimension is defined in terms of the population (rather than sample) covari- ance matrix G, which makes it easy to verify regularity conditions. Note that if G = I and t = s, then s(t ) s. More generally, s(t ) measures the effective difficulty k 0k0 0 ≤ 0 of estimating t0 in the prediction norm, created by design G and the structure of t0. The condition imposes no conditions on the restricted or sparse eigenvalues of G. For ′ example, take G = 11 , a rank 1 matrix, and suppose t0 0 = 1. Then s(t0) 1 holds in this case, giving useful and intuitive performance bounds,k k while the standard≤ restricted eigenvalues and cone invertibility factors are all zero in this case, yielding no bounds on the performance in the population prediction norm. This type of example illustrates the possibility of accommodation of overcomplete (multiple or amalgamated) dictionar- ies in b, whose use in conjunction with ℓ1 penalization has been advocated by Donoho et al. (2005). Of course, the bounds on effective− dimension follow from the bounds on cone-invertibility factors and restricted eigenvalues. p Given a vector δ R , let δA denote a vector with the j-th component set to δj if j A and 0 if j A∈. ∈ 6∈ Lemma 4.2. (Bound on effective dimension in approximately sparse model) Suppose that t0 is approximately sparse, namely t ∗ Aj−a j =1, ..., p, | 0|j ≤ ∗ p for some finite positive constants A and a > 1, where ( t0 j )j=1 is the non-increasing p M | | p rearrangement of ( t0j )j=1. Let t0 := t0(1( t0 > ν) := (t0j 1( t0j > ν))j=1 denote the vector with components| | smaller than ν trimmed| | to 0. Then | | 6a s(t ,ν) s k−1 , tM s := (A/ν)1/a, 0 ≤ × ∨ a 1 k 0 k0 ≤  −  k is the cone invertibility factor:

Gδ ∞ k := inf |M|k k : δ =0, δ c 2 δ , δ 6 k M k1 ≤ k Mk1 k k1 = support(tM), c = 1, ..., p , and s. M 0 M { }\M |M| ≤ The cone invertibility factor is a generalization of the restricted eigenvalue condition of Bickel et al. (2009), proposed by Ye and Zhang (2010). The concept of the effective dimension does not split t0 into a sparse component and a small dense component, as is done in the now standard analysis of ℓ1-regularized estimators of approximately sparse t0. The effective dimension is simply stated in terms of t0 alone.

Lemma 4.3. (Non-asymptotic bound for GDS in population prediction norm) Suppose that MD holds. Then with probability 1 2ε the estimator tˆ exists and obeys: − (tˆ t )′G(tˆ t ) (s(t ; ν)ν2) (2Bν). − 0 − 0 ≤ 0 ∧ The bound is a minimum of what is called the “fast rate bound” and the “slow rate” bound. This result tightens the result in Chatterjee (2013) who established a “slow rate” bound (in the context of Lasso) that applies under no assumptions on G. If the effective 2 dimension is not too big, as in the examples above, the “fast rate” s(t0)ν provides a DML with Riesz Representers 21 tighter bound under weak assumptions on G. It is important to emphasize that the result is stated in terms of the population prediction norm rather than the empirical norm. We now apply this result to GDS estimators of the Riesz representer and the BLP. We impose the following conditions. Let GA denote the empirical process over f : Rp and i A, namely ∈F W → ∈ G f := G f(W ) := I −1/2 (f(W ) Pf), Pf := Pf(W ) := f(w)dP (w). A A | | i − Xi∈A Z The following is a sufficient condition that will deliver the guarantee R(δ) for δ 0. Let ℓ˜ denote a positive constant (that increases to as n in the asymptotic→ results). ∞ → ∞

−1 −1 SC (a) The ℓ1 norms of coefficients are bounded as Dρ ρ0 1 B and Dβ β0 1 B, k k ≤ −1 k −k 1 ≤ for B 1, and the scaling matrices obey Dρv µDσ v for Dρ v S(Dρ ρ0,ν) ≥ −1 k−1 k≤ k k ˜ ∈ and Dβu µD u for Dβ u S(Dβ β0,ν) for ν = 4Bℓ/√n. (b) Given a randomk subsetk ≤ A ofk k 1, ..., n of∈ size m n n/K , dictionary b obeys with { }′ ≥ − ⌊ ⌋ probability at least 1 ǫ, GAbb ∞ ℓ.˜ (c) The penalty levels λρ and λβ are chosen such that with probability− k at leastk 1≤ ǫ, D−1(G bb′β G Yb(X)) /√m λ , − k β A 0 − A k∞ ≤ ρ D−1(G bb′β G m(W, b)) /√m λ , and are not overly large, λ λ k ρ A 0 − A k∞ ≤ β β ∨ ρ ≤ ℓ/˜ √m.

SC(a) records a restriction on the ℓ1 norm of β0 and ρ0. For instance, in Examples 1–3, Dρ σI LI, which requires the ℓ1-norm of ρ0 to increase at most at the speed L σ. ≍ ≍ ≍SC(b) is a weak assumption: the bound λ¯ and the penalty level λ can be chosen proportionally to log(p n)/√n, that is ∨ p ℓ˜ log(p n) ≍ ∨ using self-normalized moderate deviationp bounds (Jing et al. (2003); Belloni et al. (2014)) or high-dimensional central limit theorems (Chernozhukov et al. (2017)), under mild moment conditions, without requiring sub-Gaussianity. For instance, Belloni et al. (2014) employ these tools to show that, for the bounded design case b ∞ C, λ can be chosen as in the Gaussian error case, provided that errors follow t(2k +k δ)≤ distribution (having above 2 bounded moments), and get the error bounds similar to the Gaussian case. Here we state a general condition as our working assumption, instead of focusing on more specific condition that get us Gaussian-type conclusions.

Theorem 4.3. (GDS for BLP and RR) Suppose SC holds. Then with probability at least 1 K4ǫ, we have that u = βˆA β0 and v =ρ ˆA ρ0 obey, for some absolute constant C, − − − u′Gu r2, v′Gv σ2r2, u′Gv σr , ≤ 1 ≤ 2 | |≤ 3 r2 = Cµ2 (B2ℓ˜2s /n) (B2ℓ/˜ √n), r2 = Cµ2 (B2ℓ˜2s /n) (B2ℓ/˜ √n), r = r r , 1 D β ∧ 2 D ρ ∧ 3 1 2 −1 −1 where sβ and sρ are the effective dimensions for parameters Dβ β0 and Dρ ρ0 for the noise level ν =4Bℓ/˜ √n

In other words, we have instantiated (r1, r2, r3) for approximately sparse models in the 22 Chernozhukov, Newey, and Singh guarantee set S. We have the following corollary, which verifies R(δ) for approximately sparse models and hence provides sufficient conditions for Theorem 4.1.

Corollary 4.4. (Sufficient condition for R(δ)) Suppose SC holds. The guaran- tee R(δ) holds with ε =1 K4ǫ, provided − either Cs √nδ2/(ℓ˜3µ2µ2 ) or Cs √nδ2/(ℓ˜3µ2µ2 ), β ≤ D ρ ≤ D for some large enough constant C that only depends on B and K.

Remark 4.1. (Sharpness of conditions: Double sparsity robustness) This gives sufficient conditions such that (ignoring slowly growing term ℓ˜) the condition R(o(1)) holds if

either sβ √n or sρ √n, ≪ ≪ −1 where sβ and sρ are measures of the effective dimensions of parameters Dβ β0 and −1 Dρ ρ0. In well-behaved exactly sparse models, these effective dimensions are proportional to the sparsity indices divided by restricted eigenvalues. The latter possibility allows one of the parameter values to be “dense”, having unbounded effective dimension, in which case this parameter can be estimated at some “slow” rate n−1/4. These types of conditions appear to be rather sharp, matching similar conditions used in Javanmard and Montanari (2018) in the case of inference on a single coefficient in Gaussian exactly sparse linear regression models.

5. ESTIMATION AND INFERENCE USING GENERAL REGRESSION LEARNERS In this section we generalize the previous analysis to allow for any regression learnerγ ˆ of E[Y X] to be used in the construction of the estimator. As we have done in preceding sections| we continue to include local functionals in our analysis, so that the results apply to nonparametric objects as well as regular ones that can be estimated √n-consistently. The only conditions we will impose on the regression learner are certain L2 convergence properties that we will specify in this section. These properties will allow for a wide variety of learners, including GDS, Lasso, neural nets, boosting, and others. Thus we provide estimators of local functions that can be constructed using many regression learners. We continue to consider estimators that use cross-fitting and have the form K 1 θˆ = m(W , γˆ )+ˆα (X )[Y γˆ (X )] . n { i k k i i − k i } kX=1 iX∈Ik whereγ ˆk denotes the regression learner computed from observations not in Ik and ′ αˆk(x)= b(x) ρˆk is the GDS learner of the Riesz representer described in previous sections. To allow for as many regression learners as possible under as weak conditions as possible we focus on asymptotic analysis in this section. The fundamental property we will require ofγ ˆ is that it have some mean square convergence rate as an estimator of the true ⋆ ⋆ conditional mean γ0 . Specifically we require that there is r1 converging to zero such that for each k, γˆ γ⋆ = O (r⋆). (5.19) k k − 0 kP,2 p 1 ⋆ For purposes of formulating regularity conditions it is also useful to work with α0 rather than α0. We will also require that α⋆(ˆγ γ⋆) = o (σ), m( , γˆ γ⋆) = o (σ). (5.20) k 0 k − 0 kP,2 p k · k − 0 kP,2 p DML with Riesz Representers 23 In the regular case these conditions generally follow from the mean square consistency of ⋆ γˆk under boundedness of α0. In nonregular cases they may impose additional conditions. For example, under the conditions of Lemma 3.4 it will be sufficient for these conditions to hold that h−p1/2 γˆ γ⋆ 0. k k − 0 kP,2 →P This condition will hold as long as h grows slowly enough relative to the mean-square convergence rate of eachγ ˆk. Recall from Theorem 4.3 that r is the convergence rate of σ−1 αˆ α . Let 2 k k − 0kP,2 r⋆ = r + σ−1 α α⋆ and 2 2 k 0 − 0kP,2 ψ⋆(W )= θ m(W, γ⋆) α⋆(X)[Y γ⋆(X)]. 0 0 − 0 − 0 − 0 Theorem 5.1. (Asymptotic Gaussian inference with general regression learner) ⋆ ⋆ Suppose V ar(Y X) is bounded; r1 0 and r2 0; equations (5.19) and (5.20) are sat- isfied; and √nr⋆|r⋆ 0. Then → → 1 2 → n 1 σ √n θˆ = θ ψ⋆(W )+ o , hence (θˆ θ ) d (0, 1). 0 − n 0 i p √n σ − 0 → N i=1 X   ˆ ⋆ This result shows that asymptotic linearity of the estimator θ will result if r2 0 ⋆ → fast enough relative to r1 . Asymptotic linearity implies asymptotic Gaussian inference by standard central limit theorem arguments. As in regular doubly robust estimation ⋆ problems it allows for a tradeoff between the speed of convergence r2 of the Riesz rep- ⋆ resenter and r1 of the regression. It only requires a mean square convergence rate for the regression learnerγ ˆk and so allows for a wide variety of first step machine learning estimators. We could also formulate a non-asymptotic analog to this asymptotic result. This would depend on the availability of non-asymptotic results for the learner γˆk. To the best of our knowledge such results are not available for many learners, such as neural nets and random forests. To allow the results of this section to include as many first steps as possible we focus here on the asymptotic result and reserve the non-asymptotic result to future work.

REFERENCES Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment re- sponse models. Journal of Econometrics 113 (2), 231–263. Altonji, J. G. and R. L. Matzkin (2005). Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73 (4), 1053–1102. Athey, S., G. W. Imbens, and S. Wager (2018). Approximate residual balancing: Debi- ased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (4), 597–623. Athey, S., J. Tibshirani, and S. Wager (2019). Generalized random forests. The Annals of Statistics 47 (2), 1148–1178. Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Economet- rica 80 (6), 2369–2429. Belloni, A., V. Chernozhukov, D. Chetverikov, and I. Fern´andez-Val (2019). Conditional 24 Chernozhukov, Newey, and Singh quantile processes based on series or many regressors. Journal of Econometrics 213 (1), 4–29. Belloni, A., V. Chernozhukov, and C. Hansen (2011). Inference for high-dimensional sparse econometric models. arXiv:1201.0220 . Belloni, A., V. Chernozhukov, and C. Hansen (2014). Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies 81 (2), 608–650. Belloni, A., V. Chernozhukov, and K. Kato (2015). Uniform post-selection in- ference for least absolute deviation regression and other Z-estimation problems. Biometrika 102 (1), 77–94. Belloni, A., V. Chernozhukov, and L. Wang (2014). Pivotal estimation via square-root lasso in nonparametric regression. The Annals of Statistics 42 (2), 757–788. Bickel, P. J., C. A. Klaassen, Y. Ritov, and J. A. Wellner (1993). Efficient and Adaptive Estimation for Semiparametric Models, Volume 4. Johns Hopkins University Press. Bickel, P. J. and Y. Ritov (1988). Estimating integrated squared density derivatives: Sharp best order of convergence estimates. Sankhy¯a: The Indian Journal of Statistics, Series A, 381–393. Bickel, P. J., Y. Ritov, and A. B. Tsybakov (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics 37 (4), 1705–1732. Blundell, R., J. L. Horowitz, and M. Parey (2012). Measuring the price responsiveness of gasoline demand: Economic shape restrictions and nonparametric demand estimation. Quantitative Economics 3 (1), 29–51. Bradic, J. and M. Kolar (2017). Uniform inference for high-dimensional quantile regres- sion: Linear functionals and regression rank scores. arXiv:1702.06209 . Cai, T. T. and Z. Guo (2017). Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. The Annals of Statistics 45 (2), 615–646. Candes, E. and T. Tao (2007). The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics 35 (6), 2313–2351. Chatterjee, S. (2013). Assumptionless consistency of the lasso. arXiv:1303.5817 . Chernozhukov, V., D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21 (1), C1–C68. Chernozhukov, V., D. Chetverikov, and K. Kato (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics 41 (6), 2786–2819. Chernozhukov, V., D. Chetverikov, and K. Kato (2017). Central limit theorems and bootstrap in high dimensions. The Annals of Probability 45 (4), 2309–2352. Chernozhukov, V., J. C. Escanciano, H. Ichimura, W. K. Newey, and J. M. Robins (2016). Locally robust semiparametric estimation. arXiv:1608.00033 . Chernozhukov, V. and C. Hansen (2004). The effects of 401(k) participation on the wealth distribution: An instrumental quantile regression analysis. Review of Economics and Statistics 86 (3), 735–751. Chernozhukov, V., C. Hansen, and M. Spindler (2015). Valid post-selection and post- regularization inference: An elementary, general approach. Annual Review of Eco- nomics 7 (1), 649–688. Chernozhukov, V., W. K. Newey, and R. Singh (2018). Learning L2 continuous regression functionals via regularized Riesz representers. arXiv:1809.05224 . Colangelo, K. and Y.-Y. Lee (2020). Double debiased machine learning nonparametric inference with continuous treatments. arXiv:2004.03036 . DML with Riesz Representers 25 D´ıaz, I. and M. J. van der Laan (2013). Targeted data adaptive estimation of the causal dose–response curve. Journal of Causal Inference 1 (2), 171–192. Donoho, D. L., M. Elad, and V. N. Temlyakov (2005). Stable recovery of sparse over- complete representations in the presence of noise. IEEE Transactions on Information Theory 52 (1), 6–18. Fan, Q., Y.-C. Hsu, R. P. Lieli, and Y. Zhang (2019). Estimation of conditional average treatment effects with high-dimensional data. arXiv:1908.02399 . Florens, J.-P., J. J. Heckman, C. Meghir, and E. Vytlacil (2008). Identification of treat- ment effects using control functions in models with continuous, endogenous treatment and heterogeneous effects. Econometrica 76 (5), 1191–1206. Foster, D. J. and V. Syrgkanis (2019). Orthogonal statistical learning. arXiv:1901.09036 . Galvao, A. F. and L. Wang (2015). Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment. Journal of the American Statistical Association 110 (512), 1528–1542. Genovese, C. and L. Wasserman (2008). Adaptive confidence bands. The Annals of Statistics 36 (2), 875–905. Guo, Z. and C.-H. Zhang (2019). Local inference in additive models with decorrelated local linear estimator. arXiv:1907.12732 . Hasminskii, R. Z. and I. A. Ibragimov (1979). On the nonparametric estimation of functionals. In Proceedings of the Second Prague Symposium on Asymptotic Statistics. Hausman, J. A. and W. K. Newey (1995). Nonparametric estimation of exact consumers surplus and deadweight loss. Econometrica, 1445–1476. Hernan, M. A. and J. M. Robins (2019). Causal Inference. CRC. Hirshberg, D. A. and S. Wager (2017). Balancing out regression error: Efficient treatment effect estimation without smooth propensities. arXiv:1712.00038v1 . Hirshberg, D. A. and S. Wager (2018). Debiased inference of average partial effects in single-index models. arXiv:1811.02547 . Hirshberg, D. A. and S. Wager (2019). Augmented minimax linear estimation. arXiv:1712.00038v5 . Imbens, G. W. and W. K. Newey (2009). Identification and estimation of triangular simultaneous equations models without additivity. Econometrica 77 (5), 1481–1512. Imbens, G. W. and D. B. Rubin (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press. Jankova, J. and S. Van De Geer (2015). Confidence intervals for high-dimensional inverse covariance estimation. Electronic Journal of Statistics 9 (1), 1205–1229. Jankova, J. and S. Van De Geer (2016). Confidence regions for high-dimensional gener- alized linear models under sparsity. arXiv:1610.01353 . Jankova, J. and S. Van De Geer (2018). Semiparametric efficiency bounds for high- dimensional models. The Annals of Statistics 46 (5), 2336–2359. Javanmard, A. and A. Montanari (2014a). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research 15 (1), 2869–2909. Javanmard, A. and A. Montanari (2014b). Hypothesis testing in high-dimensional regres- sion under the Gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory 60 (10), 6522–6554. Javanmard, A. and A. Montanari (2018). Debiasing the lasso: Optimal sample size for Gaussian designs. The Annals of Statistics 46 (6A), 2593–2622. Jing, B.-Y., Q.-M. Shao, and Q. Wang (2003). Self-normalized Cram´er-type large devia- tions for independent random variables. The Annals of Probability 31 (4), 2167–2215. 26 Chernozhukov, Newey, and Singh Kallus, N. and A. Zhou (2018). Policy evaluation and optimization with continuous treatments. In International Conference on Artificial Intelligence and Statistics, pp. 1243–1251. Kennedy, E. H. (2020). Optimal doubly robust estimation of heterogeneous causal effects. arXiv:2004.14497 . Kennedy, E. H., Z. Ma, M. D. McHugh, and D. S. Small (2017). Nonparametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B, Statistical Methodology 79(4), 1229. Lee, S., R. Okui, and Y.-J. Whang (2017). Doubly robust uniform confidence band for the conditional average treatment effect function. Journal of Applied Econometrics 32 (7), 1207–1225. Luedtke, A. R. and M. J. Van Der Laan (2016). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics 44 (2), 713. Newey, W. K. (1994a). The asymptotic variance of semiparametric estimators. Econo- metrica, 1349–1382. Newey, W. K. (1994b). Kernel estimation of partial means and a general variance esti- mator. Econometric Theory 10 (2), 1–21. Newey, W. K., F. Hsieh, and J. M. Robins (1998). Undersmoothing and bias corrected functional estimation. Technical report, MIT Department of Economics. Newey, W. K., F. Hsieh, and J. M. Robins (2004). Twicing kernels and a small bias property of semiparametric estimators. Econometrica 72 (3), 947–962. Newey, W. K. and J. R. Robins (2018). Cross-fitting and fast remainder rates for semi- parametric estimation. arXiv:1801.09138 . Neykov, M., Y. Ning, J. S. Liu, and H. Liu (2018). A unified theory of confidence regions and testing for high-dimensional estimating equations. Statistical Science 33 (3), 427– 443. Neyman, J. (1959). Optimal asymptotic tests of composite statistical hypotheses. In Probability and Statistics, pp. 416–444. Wiley. Nie, X. and S. Wager (2017). Quasi-oracle estimation of heterogeneous treatment effects. arXiv:1712.04912 . Ning, Y. and H. Liu (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics 45 (1), 158–195. Peters, J., D. Janzing, and B. Sch¨olkopf (2017). Elements of Causal Inference: Founda- tions and Learning Algorithms. MIT press. Poterba, J. M. and S. F. Venti (1994). 401(k) plans and tax-deferred saving. In Studies in the Economics of Aging, pp. 105–142. University of Chicago Press. Poterba, J. M., S. F. Venti, and D. A. Wise (1995). Do 401(k) contributions crowd out other personal saving? Journal of Public Economics 58 (1), 1–32. Ren, Z., T. Sun, C.-H. Zhang, and H. H. Zhou (2015). Asymptotic normality and optimal- ities in estimation of large Gaussian graphical models. The Annals of Statistics 43 (3), 991–1026. Robins, J., M. Sued, Q. Lei-Gomez, and A. Rotnitzky (2007). Comment on ”performance of double-robust estimators when inverse probability weights are highly variable”. Sta- tistical Science 22 (4), 544–559. Robins, J. M. and A. Rotnitzky (1995). Semiparametric efficiency in multivariate re- gression models with missing data. Journal of the American Statistical Associa- tion 90 (429), 122–129. DML with Riesz Representers 27 Robins, J. M. and A. Rotnitzky (2001). Comment on “inference for semiparametric models: Some questions and an answer”. Statistica Sinica 11 (4), 920–936. Robins, J. M., A. Rotnitzky, and L. P. Zhao (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90 (429), 106–121. Rothenh¨ausler, D. and B. Yu (2019). Incremental causal effects. arXiv:1907.13258 . Rubin, D. and M. J. van der Laan (2005). A general imputation methodology for non- parametric regression with censored data. Technical report, UC Berkeley Division of Biostatistics. Rubin, D. and M. J. van der Laan (2006). Extending marginal structural models through local, penalized, and additive learning. Technical report, UC Berkeley Division of Biostatistics. Schick, A. (1986). On asymptotically efficient estimation in semiparametric models. The Annals of Statistics 14 (3), 1139–1151. Schmalensee, R. and T. M. Stoker (1999). Household gasoline demand in the United States. Econometrica 67 (3), 645–662. Semenova, V. and V. Chernozhukov (2020). Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal. Shevtsova, I. (2011). On the absolute constants in the Berry-Esseen type inequalities for identically distributed summands. arXiv:1111.6554 . Toth, B. and M. van der Laan (2016). TMLE for marginal structural models based on an instrument. Technical report, UC Berkeley Division of Biostatistics. Tsybakov, A. B. (2012). Introduction to Nonparametric Estimation. Springer. Van de Geer, S., P. B¨uhlmann, Y. Ritov, and R. Dezeure (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics 42 (3), 1166–1202. Van Der Laan, M. J. and S. Dudoit (2003). Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estima- tor: Finite sample oracle inequalities and examples. Technical report, UC Berkeley Division of Biostatistics. van der Laan, M. J. and A. R. Luedtke (2014). Targeted learning of an optimal dy- namic treatment, and statistical inference for its mean outcome. Technical report, UC Berkeley Division of Biostatistics. Van der Laan, M. J. and S. Rose (2011). Targeted Learning: Causal Inference for Obser- vational and Experimental Data. Springer Science & Business Media. Van Der Laan, M. J. and D. Rubin (2006). Targeted maximum likelihood learning. The International Journal of Biostatistics 2 (1). Van Der Vaart, A. et al. (1991). On differentiable functionals. The Annals of Statis- tics 19 (1), 178–204. Van der Vaart, A. W. (2000). Asymptotic Statistics, Volume 3. Cambridge University Press. Van Der Vaart, A. W. and J. A. Wellner (1996). Weak convergence. In Weak Convergence and Empirical Processes, pp. 16–28. Springer. Wang, L., A. Rotnitzky, and X. Lin (2010). Nonparametric regression with missing out- comes using weighted kernel estimating equations. Journal of the American Statistical Association 105 (491), 1135–1146. Yatchew, A. and J. A. No (2001). Household gasoline demand in Canada. Economet- rica 69 (6), 1697–1709. 28 Chernozhukov, Newey, and Singh Ye, F. and C.-H. Zhang (2010). Rate minimaxity of the lasso and Dantzig selector for the ℓq loss in ℓr balls. Journal of Machine Learning Research 11 (Dec), 3519–3540. Zhang, C.-H. and S. S. Zhang (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76 (1), 217–242. Zhu, Y. and J. Bradic (2017). Breaking the curse of dimensionality in regression. arXiv:1708.00430.. Zhu, Y. and J. Bradic (2018). Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association 113 (524), 1583–1600. Zimmert, M. and M. Lechner (2019). Nonparametric estimation of causal heterogeneity under high-dimensional confounding. arXiv:1908.08779 .

A. RELATED WORK A.1. Previous Learning Problems

The paper builds upon ideas in classical semi- and nonparametric learning theory with low-dimensional X, using traditional smoothing methods [Van Der Vaart et al. (1991); Newey (1994a); Bickel et al. (1993); Robins and Rotnitzky (1995); Van der Vaart (2000)], that do not apply to the current high-dimensional setting. Our paper also builds upon and contributes to the literature on modern orthogonal/debiased estimation and infer- ence [Zhang and Zhang (2014); Belloni et al. (2011, 2014, 2015); Javanmard and Mon- tanari (2014a,b, 2018); Van de Geer et al. (2014); Ning and Liu (2017); Chernozhukov et al. (2015); Neykov et al. (2018); Ren et al. (2015); Jankova and Van De Geer (2015, 2016, 2018); Bradic and Kolar (2017); Zhu and Bradic (2017, 2018)], which focuses on coefficients in high-dimensional linear and generalized linear regression models, without considering the general linear functionals analyzed here. The functionals we consider are different than those analyzed in Cai and Guo (2017). The continuity properties of functionals we consider provide additional structure that we exploit, namely the Riesz representer, an object that is not considered in Cai and Guo (2017). Targeted maximum likelihood, Van Der Laan and Rubin (2006), based on machine learners has been considered by Van der Laan and Rose (2011) and large sample theory given by Luedtke and Van Der Laan (2016), Toth and van der Laan (2016), and Zheng et al. (2016). Here we provide DML learners via regularized RR, which are relatively simple to implement and analyze, and which directly target functionals of interest and learn the RR automatically from the data.

A.2. De-biased Estimation

We build on previous work on debiased estimating equations constructed by adding an influence function. Hasminskii and Ibragimov (1979) and Bickel and Ritov (1988) suggest such estimators for functionals of a density. Newey (1994a) derives such scores as a part of the computation of the semi-parametric efficiency bound for regular functionals. Doubly robust estimating equations as in Robins et al. (1995) and Robins and Rotnitzky (1995) have this structure. Newey et al. (1998, 2004) further develop theory in this vein, in a low-dimensional nonparametric setting. In the regular case, Chernozhukov et al. (2016, 2018) analyze the double robust/debiased learners in several high-dimensional DML with Riesz Representers 29 settings. However, analysis requires an explicit formula for the Riesz representer, used in its estimation, which is often unavailable in closed form (or may be inefficient when restrictions such as additivity are used—see Section C for the explicit definition of the additive model and structure of representers in that case). In contrast, here we estimate the Riesz representer automatically from the moment conditions that characterize it, and extend the analysis to cover non-regular functionals. Various papers have considered direct estimation of the Riesz representer. Among these papers, ours is the first to present a framework for direct estimation of the Riesz represen- ter of a broad class of linear functionals, in a high-dimensional setting, without requiring strong Donsker class assumptions. The earliest reference of which we know is Robins et al. (2007), a comment on another paper, which consider only the global average treatment effect (ATE). Zhu and Bradic (2017) show that it is possible to attain √n-consistency for the coefficients of a partially linear model when the regression function is dense. Our results apply to a much broader class of functionals, and allow for tradeoffs in accu- racy of estimating the regression function and the Riesz representer. Newey and Robins (2018) present and analyze estimators based on regression splines, while we present and analyze sparse estimators methods for the high-dimensional setting. The Athey et al. (2018) estimator of the ATE is based on sparse linear regression and on approximate balancing weights when the regression is linear and strongly sparse. Our results apply to a much broader class of linear functionals and allow the regression learner to converge at relatively slow rates, including the dense case or approximately sparse case. Since the first version of this paper was posted online, subsequent work has built upon its insights. Hirshberg and Wager (2019) build upon the present work by considering the problem of learning regular functionals when the regression function belongs to a Donsker class. They utilize the orthogonal representations proposed in this paper and Chernozhukov et al. (2016), and extend the initial version of the paper, Hirshberg and Wager (2017), that had only considered the ATE example. Our approach does not re- quire a Donsker class assumption, which is too restrictive in our setting. Hirshberg and Wager (2018) consider the average derivative functional in a single index model, ana- lyzing a variant of the estimator proposed here, adapted to the single-index regression structure. Rothenh¨ausler and Yu (2019) builds upon our work, analyzing global average derivative functionals, and proposing practical Lasso-type solvers for estimating the RR. Our approach is also practical; the RR estimation is based on a Dantzig selector type estimator, which is easy to compute by linear programming methods. In follow-up work, Chernozhukov et al. (2018) consider different Lasso-type solvers for estimating RR. Com- pared to Rothenh¨ausler and Yu (2019), our analysis covers a much broader collection of functionals, and deals with both local and global versions.

A.3. Localized Functionals

A new development incorporated in this version of the paper is the inclusion of local and localized functionals, such as average treatment/policy effects and derivatives localized to certain neighborhoods of a value of a low-dimensional covariate subvector. In low- dimensional nonparametrics, the study of such functionals, called “partial means” goes back, e.g., to Newey (1994b). In contrast, here we treat the case where the ambient covariate space is very high-dimensional, but we localize with respect to a value of a low- dimensional subvector. Moreover, we must rely on orthogonalized estimating equations to eliminate the regularization biases arising due to the high-dimensional ambient space. 30 Chernozhukov, Newey, and Singh Various papers have studied debiased moment equations for certain localized functionals: conditional average treatment effect (CATE), continuous treatment effect (CTE), and regression derivative at a point. We instead present a unified analysis for the general class of localized functionals. Moreover, we cover local effects that are not perfectly localized, which may be more robust objects from an inferential point of view, as argued in Genovese and Wasserman (2008). The debiased CATE and CTE literature is vast. Prominent examples of the debiased CATE literature include Wang et al. (2010), van der Laan and Luedtke (2014), Luedtke and Van Der Laan (2016), Nie and Wager (2017), Lee et al. (2017), and most recently Kennedy (2020). Independently and contemporaneously to the present version of the paper, Fan et al. (2019) and Zimmert and Lechner (2019) define and study perfectly localized average treatment effects with high-dimensional confounders. Prominent exam- ples of the debiased CTE literature include Rubin and van der Laan (2006), D´ıaz and van der Laan (2013), Galvao and Wang (2015), Kennedy et al. (2017), Kallus and Zhou (2018), and Colangelo and Lee (2020). These works develop inference on perfectly lo- calized average potential outcomes with continuous treatment effects, using a different approach than what we develop here. Our development is complementary as it covers a much broader collection of functionals. The debiased literature on regression derivative at a point is more recent. Guo and Zhang (2019) study inference on the regression derivative ∂γ1(d) at a point d in a high- dimensional regression model, γ(D,Z)= γ1(D)+ γ2(Z), where D is univariate covariate of interest and Z is a high-dimensional vector of control covariates. Our analysis is again complementary: it covers objects like this, but also covers more general functionals like E[∂dγ(D,Z) D = d], either without additivity structure or without requiring D to be one-dimensional.| Semenova and Chernozhukov (2020) apply low-dimensional series re- gression estimators on top of the pre-estimated unbiased orthogonal signal of treatment and partial derivative effects, where pre-estimation of the orthogonal signal is done in the high-dimensional setting. Our analysis has a rather different structure (without re- liance on close-form solutions for Riesz representers), and kernels are used for localization instead of series. Our work complements existing work that considers the problem of estimating general nonpathwise differentiable functionals like the localized ones here. Early contributions include Robins and Rotnitzky (2001), Van Der Laan and Dudoit (2003), and Rubin and van der Laan (2005). More recently, Athey et al. (2019) consider this issue in the context of generalized random forests. Foster and Syrgkanis (2019) present a general theory, but without inference guarantees. Unlike previous work, we analyze finite sample Gaussian approximation.

B. NOTATION AND PRELIMINARIES B.1. Notation glossary

′ ′ n Let W = (Y,X ) be a random vector with law P on the sample space , and W1 = n W (Yi,Xi)i=1 denote i.i.d. copies of W . The law of X is denoted by F . All models and probability measure P can be indexed by n, the sample size, so that the models and their dimensions and parameters determined by P change with n. We use notation from the empirical process theory, see Van Der Vaart and Wellner (1996). Let EI f denote the empirical average of f(W ) over i I 1, ..., n : E f := E f(W )= I −1 f(W ). i ∈ ⊂{ } I I | | i∈I i P DML with Riesz Representers 31 p Let GI denote the empirical process over f : R and i I, namely GI f := G −1/2 ∈ F W → ∈ I f(W ) := I i∈I (f(Wi) Pf), where Pf := Pf(W ) := f(w)dP (w). Denote the Lq(P ) norm| | of a measurable function− f : R and also the Lq(P ) norm of random variable f(W ) by Pf = f(W ) . WeW use → to denote ℓR norm on Rd. For a k kP,q k kP,q k·kq q differentiable map x f(x), from Rd to Rk, we use ∂ ′ f(x) to abbreviate the partial 7→ x derivatives (∂/∂x′)f(x), and we use ∂ ′ f(x ) to mean ∂ ′ f(x) , etc. We use x′ to x 0 x |x=x0 denote the transpose of a column vector x. We use divd to denote the divergence of scalar dim(d) function: divd g = j=1 ∂dj g(d). We say that a . b under the asymptotics with an index n if a Cb for all n sufficiently large, and a b if both a . Cb and b . Ca for all n→sufficiently ∞ ≤P large, where C 1 is a positive constant≍ that does not depend on n. ≥ B.2. Preliminaries

To prove the first couple of lemmas we recall the following definitions and results. Given two normed vector spaces V and W over the field of real numbers R, a linear map A : V W is continuous if and only if it has a norm: → A := inf c 0 : Av c v for all v V < , k kop { ≥ k k≤ k k ∈ } ∞ where op is the operator norm. The operator norm depends on the choice of norms for the normedk·k vector spaces V and W . A is a complete linear space equipped with an inner product f,g and the norm f,f 1/2. The space L2(P ) is the Hilbert space h i |h i| with the inner product f,g = fgdP and norm f P,2. The closed linear subspaces of L2(P ) equipped with theh samei inner product andk normk are Hilbert spaces. R Hahn–Banach extension for normed vector spaces. If V is a normed with linear subspace U (not necessarily closed) and if φ : U K is continuous and linear, then there exists an extension ψ : V K of φ which is also7→ continuous and linear and which has the same operator norm as7→φ. Riesz–Frechet representation theorem. Let H be a Hilbert space over R with an inner product , , and T a bounded linear functional mapping H to R. If T is bounded then there existsh· ·i a unique g H such that for every f H we have T (f)= f,g . It is given by g = z(Tz), where z∈is unit-norm element of the∈ orthogonal complementh i of the kernel subspace K = a H : Ta = 0 . Moreover, T op = g , where T op denotes the operator norm of T{ , while∈ g denotes} the Hilbertk spacek normk k of g. k k k k Radon–Nykodym derivative. Consider a measure space ( , Σ) on which two σ- finite measure are defined, µ and ν. If ν µ (i.e. ν is absolutely continuousX with respect to µ), then there is a measurable function≪ f : [0, ), such that for any measurable set A , ν(A)= f dµ. The function f isX conventionally → ∞ denoted by dν/dµ. ⊆ X A Integration by parts. Consider a closed measurable subset of Rk equipped with R Lebesgue measure V and piecewise smooth boundary ∂ , and supposeX that v : Rk and φ : R are both C1( ), then X X → X → X

ϕ div v dV = ϕ v′dS v′ grad ϕ dV, − ZX Z∂X ZX where S is the surface measure induced by V . 32 Chernozhukov, Newey, and Singh C. STRUCTURE OF FUNCTIONALS AND THEIR SCORES IN LEADING EXAMPLES We see that the key quantities in the main inference results are the operator norm L of the linear functional and the standard deviation σ and kurtosis κ/σ of the score ψ0. In this section we establish bounds on these quantities in the key Examples 1–4, focusing on either unrestricted or additive nonparametric models.

C.1. Structure of Riesz representers for unrestricted and additive models

Below we derive linear representers through change of measure and integration by parts. These representers are universal since they apply to the unrestricted model, where Γ¯ = 2 L (F ). We remark here that these representers are universal, since they can represent θ0 even when Γ¯ = L2(F ), if they exist. These universal representers are not minimal unless Γ¯ = L2(F ). Theorem6 4.2 implies that it is better to use the minimal representer than the universal representer to attain full semi-parametric efficiency (unless Γ = L2(F )). Consider the following (some well-known) candidates for universal linear representers in Examples 1–4: α (x; ℓ) = [(1(d = 1) 1(d = 0))/P(D = d Z = z)]ℓ(x); (3.21) 0 − | α (x; ℓ) = [d(F (x) F (x))/dF (x)]ℓ(x); (3.22) 0 1 − 0 α (x; ℓ) = [d(F (x) F (x))/dF (x)]ℓ(x), F = Law(T (X)); (3.23) 0 1 − 1 α (x; ℓ)= (div (ℓ(x)t(x)f(d z))/f(d z), f(d z)= pdfof D given Z = z; (3.24) 0 − d | | | treated as formal maps α : R na , where dF /dF denotes the Radon–Nykodym 0 X → ∪{ } k derivative of measure Fk with respect to F on support(ℓ), divd denotes the divergence of scalar function: p1

divd g(d, z)= ∂dj g(d, z), j=1 X and na is “not available”. The Radon–Nykodym derivatives exist if Fk is absolutely continuous with respect to F on support(ℓ).

Lemma 3.1. (Universal representers for key examples) In Example 1–4, (i) If 2 α0(X; ℓ) is real-valued a.s. and α0( ; ℓ) L (F ), then it is the universal representer for the corresponding linear functional ·γ ∈θ(γ), and the latter is continuous. In Example 4, we require that d γ(x)ℓ(x)t(x)f(7→d z) is continuously differentiable on the support 7→ | set z = support(D Z = z), and vanishes on its boundary ∂ z, which is assumed to be piecewise-smooth,D for| each z . Further, if Γ¯ = L2(F ), theD representer is minimal; ∈ Z ⋆ ¯ otherwise, the minimal representer α0 is obtained by projecting α0 onto Γ. (ii) There are examples of P , exhibited in the proof of this lemma, such that linear functionals 2 in Examples 1–4 can be continuous on Γ¯ = L (F ), but α0(X; ℓ) = na with positive probability. 6

Part of the lemma is well known (for example, the α0(X; ℓ) representer for ATE is the Horvitz-Thompson transformation), while a part of lemma appears to be new. The first part of the lemma provides a simple sufficient condition to guarantee continuity of the target functionals. It recovers well-known sufficient conditions for nonparametric identi- fication of various functionals. The second part of the lemma states that this condition DML with Riesz Representers 33 is not necessary, and that target functionals can be continuous on some subsets of L2(F ) without these conditions. The following is a useful result in view of the wide practical use of additive models, which model the regression function as additive in the two sets of vector components x1 and x2 of x. (There is not much loss in generality in considering two sets rather than multiple sets). It is an important setting where Γ is not dense in L2(F ) and where minimal representers are not equal to the universal representers.

AM Suppose that the regression function is additive in components x1 and x2 of x: x γ(x)= γ (x )+ γ(x ), x = (x′ , x′ )′ , 7→ 1 1 2 1 2 ∈ X where γ Γ , a dense subset of L2(F ), where F denotes the probability law of 1 ∈ 01 1 1 X1. The linear functional m0 and the weighing function ℓ depends only on the first component, namely m(w,γ; ℓ)= m(w,γ1; ℓ) and ℓ(x)= ℓ(x1).

The following lemma shows that we can construct representers for additive models by taking conditional expectation of a universal representer. We can immediately see that the minimal representers can be generated as conditional expectations of the universal representers.

Lemma 3.2. (Order-preserving, contractive representers for additive models) Work with AM and assume α ( ; ℓ) L2(F ). Then on γ Γ, 0 · ∈ ∈ θ(γ)= θ(γ )= α⋆(x )γ (x )dF (x ), α⋆(x ) = E[α (X) X = x ], 1 0 1 1 1 1 0 1 0 | 1 1 Z where α0 is any linear representer for γ θ(γ) on Γ. In particular, the conditional expectation operator is order-preserving, and7→ it induces the contraction for all Lq(P ) norms for all q [1, ]: ∈ ∞ α⋆ α . k 0kP,q ≤k 0kP,q The latter properties are useful in characterizing the structure of the global and local functionals under condition AM.

C.2. Structure of global functionals and scores in key examples

Here we develop bounds on the key quantities: the standard deviation σ of the score, the kurtosis κ/σ, and the modulus of continuity L. In the regular case, these quantities are bounded. Here we would like to study how the bounds depend on L, and we analyze the non-regular cases arising from taking a sequence of models with L . To make key points, we focus on the case where either Γ=¯ L2(F→) or ∞Γ¯ L2(F ) with the additive model AM holding. Furthermore, we develop these bounds in⊂ the context of Examples 1–3, though the proofs are useful to characterize bounds in other contexts. Our goal is to fix a weighting function ℓ, and to consider how a non-regularity L can arise from modeling quantities like → ∞ 1/P(D = d Z), (d(F F )/dF ) X, (d(F F )/dF ) X, (3.25) | 1 − 0 ◦ 1 − ◦ taking high values due to the denominator taking values close to zero. We may charac- terize such cases as the weakening of overlap of supports of relevant distributions (e.g., 34 Chernozhukov, Newey, and Singh

F puts small mass on points where F1 puts a lot of mass). In Example 4, a similar issue could arise due to 1/f(D Z) taking high values; for brevity, we don’t analyze this source of non-regularity for Example| 4 and focus on localization as the source. In the sequel, we say that a . b under the asymptotics with an index n if a Cb for all n sufficiently large, and a b if both a . Cb and b . Ca for all→n ∞sufficiently≤ large, where C 1 is a positive constant≍ that does not depend on n. ≥ Lemma 3.3. (Structure of global average effects functionals and scores in Examples 1,2,3) Suppose that either (a) Γ¯ = L2(F ) or (b) that Γ¯ L2(F ) with the additive model AM ⊂ holding. Suppose that the universal Riesz representers α0(X) = α0(X; ℓ) given in for- mulae (3.21), (3.22), (3.23) for Examples 1–3 exist and are in L2(F ). Suppose that α⋆(X)= α (X) in the case (a) and α⋆(X ) = E[α⋆(X) X ] in the case (b) obey: 0 0 0 1 0 | 1 α⋆ c( α⋆ 2 1), (3.26) k 0kP,3 ≤ k 0kP,2 ∨ for some finite constant c and that U = m(W, γ⋆(X)) Em(W, γ⋆(X)) and U = Y γ⋆(X) 1 0 − 0 2 − 0 obey the bounded moment and bounded heteroscedasticity conditions: (E[ U q])1/q c,¯ 0

Here we focus on local functionals and develop bounds that relate key quantities: the standard deviation σ of the score, the kurtosis κ/σ, and the modulus of continuity L. Our first goal is examine how the localization of the weighting function ℓ creates the non-regularity L . Our inference theory outlined above covers local functionals provided L/√n is→ small, ∞ and it also covers perfectly localized functionals provided the scaled localization bias is small: √n(θ(γ⋆; ℓ ) θ(γ⋆; ℓ ))/σ 0. 0 h − 0 0 → We provide a bound on the localization bias in terms of the smoothness and the kernel order. The latter additional requirement means that the inference on perfectly localized functionals is less robust than the inference on the local functionals (analogously, to the point that was made by Genovese and Wasserman (2008)). DML with Riesz Representers 35 Lemma 3.4. (Structure of local average effects functionals and scores in Examples 1, 2, 3) Suppose that either (a) Γ¯ = L2(F ) or (b) Γ¯ L2(F ) with the additive model AM hold- ⊂ ing. Suppose the universal Riesz representer α0(X; 1), corresponding to the flat weighting function ℓ =1, given in formulae (3.21), (3.22), and (3.23), corresponding to Examples 1, 2, and 3, exists and obeys 0 < α α (X; 1) α,¯ a.s. (3.27) ≤ 0 ≤

Suppose for some h0 > 0, we have that Nh0 (d0) = d : d d0 ∞ h . Suppose that for ℓ = ℓ with h h : { k − k ≤ } ⊂ D h ≤ 0 U = m(W, γ⋆(X); ℓ) Em(W, γ⋆(X); ℓ) and U = Y γ⋆(X), 1 0 − 0 2 − 0 obey the bounded heteroscedastic moment conditions: (E[ U q])1/q c¯ ℓ , 0

Lemma 3.5. (Structure of local average derivative functionals and scores in Example 4) Suppose that either (a) Γ¯ = L2(F ) or that (b) Γ¯ L2(F ) with the additive model AM ⊂ holding. Suppose the universal Riesz representer α0(X; ℓh) given in formula (3.24) exists for all 0

T op = sup T,u1 .... uv . | | ku1k2≤1,...,kuvk2≤1 |h ⊗ ⊗ i|

Lemma 3.6. (Structure of bias in perfect localization) Suppose that for some ⋆ h0 > 0, d m(d) = E[m(W, γ0 ) D = d] and d fD(d) are continuously differentiable 7→ |sm v 7→ sm o v on Nh0 (d0) to the integer order , and for := and ∂d denoting the tensor ∂v/(∂d)v we have ∧ v v ¯ sup ∂d(m(d)fD(d)) op g¯v, sup ∂dfD(d) op fv, inf fD(d) f. d∈Nh (d0) d∈Nh0 (d0) k k ≤ d∈Nh0 (d0) k k ≤ 0 ≥ In addition, assume m(d )f (d ) g.¯ 0 D 0 ≤ We have that for all h

2 We note that Γ = span(Γ0) is a linear subspace of L (F ), and Γ¯ is a closed subspace by definition. Therefore, Γ¯ is a Hilbert space with norm g g P,2 and inner product (f,g) f,g = fgdF . 7→ k k To show7→ h claimi (i), we note that by the Hahn–Banach extension theorem, the operator R θ :Γ R can be extended to θ˜ : Γ¯ R such that θ˜ op = θ op. By the Riesz–Frechet → → ⋆ k k ˜ k k ⋆ ¯ theorem there exists a unique representer α0 such that θ(γ) = γ, α0 on γ Γ and ˜ ⋆ h i ∈ θ op = α0 P,2. k k k k ⋆ To show claim (ii), we are given a linear representer α0. Denote by α0 the projection ¯ ⋆ of α0 onto Γ. Then γ ϕ(γ) := γ, α0 = γ, α0 agrees with γ θ(γ) on γ Γ. ¯ 7→ ˜ h i h⋆ i ¯ 7→ ∈ Extend θ to Γ by defining θ(γ)= ϕ(γ)= γ, α0 for γ Γ Γ, which is well-defined by h ⋆ i ∈ \ Cauchy-Schwarz inequality. Then ϕ op = α0 P,2 α0 P,2 < , since the orthogonal projection reduces the norm. Further,k k k k ≤k k ∞ ⋆ ⋆ ˜ ˜ > α0 P,2 = sup γ, α0 / γ P,2 = sup θ(γ) / γ P,2 = θ op. ∞ k k γ∈Γ¯\{0} |h i| k k γ∈Γ¯\{0} | | k k k k

⋆ ˜ Hence α0 is a representer for the extension θ, and the Riesz–Frechet theorem implies ⋆ ✷ that α0 is unique.

E. DETAILS FOR SECTION 3 E.1. Practical implementation details

In practice we use the following generic algorithm for computing GDS estimators over subsamples A. In particular, for regression we set m(W, b)= Yb(X). DML with Riesz Representers 37

1 Obtain initial estimate tˆ using a low-dimensional sub-dictionary b0 of b: tˆ (tˆ′ , 0′)′; tˆ = Gˆ−1Mˆ ; Mˆ E m(W, b ); Gˆ E b b′ ; ← 0 0 0 0 ← A 0 0 ← A 0 0 Compute the empirical moments for the full dictionary: Mˆ E m(W, b); Gˆ E bb′. ← A ← A 2 Update the diagonal normalization matrix: Dˆ 2 diag E [ b(X)b(X)′tˆ m(W, b) 2]; j =1, ..., p . ← A { − }j 3 Update the GDS estimate, using the current estimate as the starting point in the algorithm: tˆ arg min t : Dˆ −1(Mˆ Gtˆ ) λ; λ = cΦ−1(1 a/2p)/√n, ← k k1 k − k∞ ≤ − 4 Iterate on steps 2 and 3 several times. Return the final estimate tˆ.

We note the following. First, theoretical arguments similar to Belloni et al. (2012) suggest that the data-driven algorithm behaves as the algorithm that knows the ideal −1 D, since iterations yield DDˆ I ∞ P 0. The argument works provided we can set c> 1.1 . In practice, however,k c =− 1k works→ just fine from the outset. We set a small, e.g. a =0.1. Second, Chernozhukov et al. (2013) discuss finer data-driven choices of penalty levels based on the Gaussian or empirical bootstraps: −1 ∗ ∗ λ = c [(1 α) quantile( Dˆ (Mˆ + Gˆ t) ∞ (Wi)i∈Ic )], × − − k k | k where Mˆ ∗ and Gˆ∗ are bootstrap copies of Mˆ and Gˆ. This method yields an even lower theoretically valid penalty levels, because they adapt to the correlation structure much better. For instance, for highly-correlated empirical moments, the penalty level produced by this method can be substantially lower than the simple plug-in choice made above (in the extreme case, where the moments are perfectly correlated, the penalty level of Chernozhukov et al. (2013) approximates cΦ−1(1 a/2))/√n). − E.2. Partial difference

Consider a simplification of Example 4, average derivative:

⋆ ⋆ θ0 = ∂dγ0 (d, z)ℓ(x)dF (x). Z For nonparametric regression estimators that are linear in a dictionary b(d, z), e.g. GDS and Lasso, the average derivative is straightforward to compute: apply the learned co- efficients βˆ to the derivative of the dictionary ∂db(d, z), and average across observations using weighting ℓ(x)= ℓ(d, z). Random forest is an example of a nonparametric regression estimator that is not differentiable. A neural network is differentiable, but its derivative at each observation may be difficult to access when using a black-box implementation. For this reason, when using random forest or neural network, we use an average partial difference approximation of the average derivative. Specifically, consider the average partial difference functional 1 θ∗ = [γ⋆(d + ∆/2,z) γ⋆(d ∆/2,z)] ℓ(x)dF (x). 0 0 − 0 − ∆ Z 38 Chernozhukov, Newey, and Singh The theory developed for Example 3, policy effect from transporting X, directly applies to average partial difference. In practice, we take ∆ to be one fourth of the standard deviation of D. There is an important connection between average derivative and average partial dif- ference when using a nonparametric regression estimator that is linear in a dictionary b(d, z), e.g. GDS and Lasso. If the dictionary b(d, z) is quadratic in d, then the aver- age derivative estimate must be numerically identical to the average partial difference estimate. The specification from Semenova and Chernozhukov (2020) that we use when estimating average price elasticity of gasoline is quadratic in log price. Therefore Table 3 presents average partial difference estimates that perfectly coincide with average deriva- tive estimates for GDS and Lasso, and that approximate average derivative estimates for random forest and neural network.

F. PROOFS FOR SECTION 4 F.1. Proof of Theorem 4.1

The proof uses empirical process notation: GI denotes the empirical process over f : Rp and I 1, ..., n , namely ∈F W → ⊂{ } G f := G f(W ) := I −1/2 (f(W ) Pf), Pf := Pf(W ) := f(w)dP (w). I I | | i − Xi∈I Z c Step 1. We have a random partition (Ik, Ik) of 1, ..., n into sets of size m = n/K and n n/K. Let { } − θ¯ = θ E ψ (W ). k 0 − Ik 0 Observe that in Lemma 4.1, derivatives don’t depend on θ. Hence for all θ, ∂ ψ(W, θ; β ,ρ )= m(W, b)+ ρ ′b(X)b(X) =: ∂ ψ (W ) β 0 0 − 0 β 0 ∂ ψ(W, θ; β ,ρ )= b(X)(Y b(X)′β ) =: ∂ ψ (W ) ρ 0 0 − − 0 ρ 0 2 ′ 2 ∂βρ′ ψ(X,θ; β0,ρ0)= b(X)b(X) =: ∂βρ′ ψ0(W ), where ψ0(W ) := ψ(W, θ0; β0,ρ0) as before. Define the estimation errors u := βˆ β and v :=ρ ˆ ρ . Using Lemma 4.1, we have k − 0 k − 0 by the exact Taylor expansion around (β0,ρ0)

′ ′ ′ 2 θˆ = θ¯ (E ∂ ψ (W )) u (E ∂ ψ (W )) v u (E ∂ ′ ψ (W ))v. k k − Ik β 0 − Ik ρ 0 − Ik βρ 0 Consider the event that Condition R holds. On this event: E 4 (√m/σ)(θˆ θ¯ ) = rem := rem := σ−1[G ∂ ψ (W )]′u σ−1[G ∂ ψ (W )]′v k − k k jk − Ik β 0 − Ik ρ 0 j=1 X −1 ′ 2 −1 ′ 2 σ u [G ∂ ′ ψ (W )]v σ √mu [P ∂ ′ ψ (W )]v, − Ik βρ 0 − βρ 0 where we have used that by Lemma 4.1

′ ′ P ∂βψ0(W ) u =0, P ∂ρψ0(W ) v =0. We now bound E[rem2 1( )] by analyzing each of its terms. By the law of iterated k E DML with Riesz Representers 39 expectations

4 2 2 2 E[rem 1( )] = E[E[rem 1( ) (Wi)i∈Ic ]] 4 E[E[rem 1( ) (Wi)i∈Ic ]] k E k E | k ≤ jk E | k j=1 X 2 using the fact that E J V J J EV 2 for arbitrary random variables (V )J . j=1 j ≤ j=1 j j j=1 Note that u and vare fixed once we condition on the observations (Wi)i∈Ic . On the P P k event , by condition R, rem1k, rem2k and rem3k have conditional mean 0 and conditional varianceE given by −1 −1 ′ σ √V ar[rem1k (Wi)i∈Ic ]= σ √V ar[(∂βψ0(W ) u) (Wi)i∈Ic ] | k | k −1 ′ −1 σ µσ√u Gu = σ µσr1 δ, −1 ≤ −1 ′ ≤ σ √V ar[rem2k (Wi)i∈Ic ]= σ √V ar[(∂ρψ0(W ) v) (Wi)i∈Ic ] | k | k −1 ′ −1 σ µ√v Gv = σ µσr2 δ, −1 ≤ −1 ′ ′ ≤ σ √V ar[rem3k (Wi)i∈Ic ]= σ √V ar[u b(X)b(X) v (Wi)i∈Ic ] | k | k σ−1µ(√v′Gv + √u′Gu) ≤ σ−1µ(σr + r ) δ. ≤ 2 1 ≤ On the event , rem has conditional mean and conditional variance given by E 4k −1 ′ 2 −1 σ √mu [P ∂ ′ ψ0(W )]v σ √mσr3 δ, √V ar[rem4k (Wi)i∈Ic ]=0. | βρ |≤ ≤ | k In summary, E[rem2 1( )] 4[δ2 + δ2 + δ2 + δ2]=16δ2. k E ≤ ˆ −1 K ˆ ¯ −1 K ¯ Step 2. Here we bound the difference between θ = K k=1 θk and θ = K k=1 θk: √n 1 K P√n 1 K P √n/σ θˆ θ¯ m/σ θˆ θ¯ rem . | − |≤ √m K | k − k|≤ √m K k k=1 k=1 X p X By Markov inequality we have

K K 1 1 c P remk > 4δ/∆ P remk > 4δ/∆ + P ( ) K ! ≤ K ∩E! E Xk=1 kX=1 K 2 −2 2 2 K E remk 1( ) ∆ /(16δ )+ ǫ ≤  ! E  kX=1 −2 2 2 2 2 2 K K max E(remk1( ))∆ /(16δ )+ ǫ ∆ + ǫ. ≤ k E ≤ And we have that n/m = √K. So it follows that

p √n(θˆ θ¯)/σ err =4√Kδ/∆ | − |≤ with probability at least 1 Π for Π := ∆2 + ǫ. − Step 3. To show the second claim, let Z := √n(θ¯ θ0)/σ. By the Berry–Esseen bound, for some absolute constant A, − 3 −1/2 3 −1/2 sup P(Z z) Φ(z) A ψ0/σ P,3n = A(κ/σ) n . z∈R | ≤ − |≤ k k 40 Chernozhukov, Newey, and Singh The current best estimate of A is 0.4748, due to Shevtsova (2011). Hence, using Step 2, for any z R, we have ∈ P(√n(θˆ θ )/σ z) Φ(z)=P(√n(θˆ θ¯)/σ + Z z) Φ(z) − 0 ≤ − − ≤ − = P(Z z + √n(θ¯ θˆ)/σ) Φ(z) P(Z z + err) + Π Φ(z) ≤ − − ≤ ≤ − = P(Z z + err) Φ(z + err)+Φ(z + err) Φ(z) + Π ≤ − − A(κ/σ)3n−1/2 + err/√2π + Π, ≤ where 1/√2π is the upper bound on the derivative of Φ. Similarly, conclude that P(√nσ−1(θˆ θ ) z) Φ(z) A(κ/σ)3n−1/2 err/√2π Π. − 0 ≤ − ≥ − − The result follows by noting that 4/√2π =1.5957... < 2. ✷

F.2. Proof of Theorem 4.2

We shall verify the hypotheses of Van der Vaart (2000), Theorem 25.20. Step 1. Suppose that W had Radon–Nykodym derivative dP under P with respect to some measure µ. Consider the set for some ε> 0:

= δ measurable : R, δdP =0, δ 1/(2ε) . Sε { W → k k∞ ≤ } Z Consider a parametric submodel (i.e. path) of the form

= dP (w)= dP (w) [1 + τδ (w)]: δ . P τ ∈ Sε}τ∈(0,ε) n It is standard to verify that δ is the score of dPτ , namely δ(w)= ∂τ log dPτ (w), and that quadratic mean differentiability holds:

[(√dP √dP )/τ (δ/2)d√dP ]2dµ 0, τ − − → Z which implies that deviations from P are locally asymptotically normal. The collection of scores ε therefore form the tangent set of at P . ConsiderS the parameter of interest: P

θτ = m(w,γτ )dPτ , Z ⋆ ⋆ ¯ where γτ abbreviates the heavy notation γ0,Pτ , denoting the projection of Y on Γ under ⋆ ⋆ Pτ . We will also use γ0 to denote γ0,P . Step 2 below shows the differentiability of the parameter with respect to τ: θ θ τ − 0 ψ δdP, for each δ , τ → 0 ∈ Sε Z where ψ0 is a score function. This is done in Step 2 below. This score function belongs to the L2(P ) closure of the linear span of : Sε span( )= δ L2(P ): δdP =0 . Sε ∈ n Z o so it follows that ψ0 is the projection of itself on the ε and is therefore the only influence function. S DML with Riesz Representers 41

Step 2. Because δ is bounded by 1/(2ε), the dPτ and dP dominate each other so that Γ¯ does not depend on τ. Let Eτ denote expectation under Pτ and E under P . Then for some generic positive finite constant C

E γ⋆ (X)2 CE γ⋆ (X)2 CE Y 2 CE Y 2 = C. τ ≤ τ τ ≤ τ ≤ h i h i Note that by γ⋆,γ⋆ Γ¯ and the previous inequality, as τ 0   τ 0 ∈ → ⋆ ⋆ ⋆ ⋆ E[γτ (X) γ0 (X)]=Eτ [γτ (X) γ0 (X)] + o (1)

⋆ ⋆ ⋆ 2 =Eτ [Yγ0 (X)] + o (1) = E [Yγ0 (X)] + o (1) = E γ0 (X) + o (1) . Similarly we have h i

⋆ 2 ⋆ 2 ⋆ E γτ (X) =Eτ γτ (X) + o (1) = Eτ [Yγτ (X)] + o (1) h i h i = E [Yγ⋆ (X)] + o (1) = E [γ⋆ (X) γ⋆(X)] + o (1) E[γ⋆(X)2]. τ 0 τ → 0 Therefore it follows that E γ⋆ (X) γ⋆ (X) 2 =E γ⋆ (X)2 +E γ⋆ (X)2 2E [γ⋆ (X) γ⋆ (X)] 0. { τ − 0 } τ 0 − τ 0 → h ⋆ i ⋆h i h i ⋆ ⋆ Note that E[α0 (X) γτ (X) γ (X) δ (W )] CE[ α0 (X) γτ (X) γ0 (X) ] 0 so that | { − } |≤ | | | − | → E[m (W, γ⋆)] E[m (W, γ⋆)] = E [α (X) γ⋆ (X) γ⋆ (X) ] τ − 0 0 { τ − 0 } =E [α (X) γ⋆ (X) γ⋆ (X) ] τ 0 { τ − 0 } τE[α (X) γ⋆ (X) γ⋆ (X) δ (W )] − 0 { τ − 0 } =E [α (X) Y γ⋆ (X) ]+ o (τ) τ 0 { − 0 } =E [α (X) Y γ⋆ (X) ] E[α (X) Y γ⋆ (X) ]+ o (τ) τ 0 { − 0 } − 0 { − 0 } = τE[α (X) Y γ⋆ (X) δ (W )] + o (τ) . 0 { − 0 } ⋆ Therefore E[m (W, γτ )] is differentiable at τ = 0 with ∂E[m (W, γ⋆)] /∂τ = E [α (X) Y γ⋆ (X) δ (W )] . τ 0 { − 0 } In addition, by mean-square continuity of m (W, γ⋆), ⋆ ⋆ ⋆ Eτ [m (W, γτ )] E[m(W, γτ )] = τE[m (W, γτ ) δ(W )] − ⋆ ⋆ ⋆ = τE[m (W, γ0 ) δ(W )] + τE[ m(W, γτ ) m(W, γ0 ) δ(W )] ⋆ { − } = τE[m (W, γ0 ) δ(W )] + o (τ) . It follows that E [m (W, γ⋆)] E[m(W, γ⋆)] is differentiable with τ τ − τ ∂ E [m (W, γ⋆)] E[m(W, γ⋆)] { τ τ − τ } = E [m (W, γ⋆) δ(W )] = E[ m (W, γ⋆) θ δ(W )]. ∂τ 0 { 0 − 0}

It then follows by the derivative of the sum being the sum of the derivatives that θτ = ⋆ Eτ [m(W, γτ )] is differentiable at τ = 0 and ∂θ τ = E[ψ (W )δ(W )]. ∂τ 0 ✷ 42 Chernozhukov, Newey, and Singh F.3. Proof of Lemma 4.2

First, we note that tM = s := max x : Ax−a ν = (A/ν)1/a. k 0 k0 |M| ≤ { ≥ } Define tr := t tM = t 1( t ν). 0 − 0 0 | 0|≤ Note that ∞ 1 1 a tr νs + Ax−adx = νs As−a+1 = νs νs = νs. k k1 ≤ − 1 a − 1 a a 1 Zs − − − Then δ S(t ,ν) implies that, by the repeated use of the triangle inequality: ∈ 0 M r M r t + δ t t + δ + t + δ c t + t k 0 k1 ≤k 0k1 ⇐⇒ k 0 Mk1 k 0 M k1 ≤k 0 k1 k 0k1 r r M M r = δ c t t + δ c t t + δ + t ⇒ k M k1 −k 0k1 ≤k 0 M k1 ≤k 0 k1 −k 0 Mk1 k 0k1 r r r = δ c t δ + t = δ c δ +2 t . ⇒ k M k1 −k 0k1 ≤k Mk1 k 0k1 ⇒ k M k1 ≤k Mk1 k 0k1 r If 2 t 1 δM 1, we have that δMc 1 2 δM 1, so using the definition of the cone invertibilityk k ≤k factork we obtain k k ≤ k k (k/s) δ Gδ ν = δ′Gδ δ Gδ (s/k)ν2. k k1 ≤k k∞ ≤ ⇒ ≤k k1k k∞ ≤ If 2 tr δ , then δ 6 tr k k1 ≥k Mk1 k k1 ≤ k k1 a δ′Gδ δ Gδ 6 tr ν 6 sν2. ✷ ≤k k1k k∞ ≤ k k1 ≤ a 1 − F.4. Proof of Lemma 4.3

Consider the event such that R gˆ(t ) λ, gˆ(tˆ) λ, (6.28) k 0 k∞ ≤ k k∞ ≤ holds. This event holds with probability at least 1 ǫ. The event implies that tˆ 1 t by definition of tˆ, which further implies that− for δ = tˆ t R k k ≤ k 0k1 − 0 Gδ (G Gˆ)δ + Gδˆ k k∞ ≤k − k∞ k k∞ = (G Gˆ)δ + gˆ(tˆ) gˆ(t ) k − k∞ k − 0 k∞ G Gˆ δ + gˆ(tˆ) + gˆ(t ) ≤k − k∞k k1 k k∞ k 0 k∞ λ¯2B +2λ ν.¯ ≤ ≤ Hence δ S(t0,ν) with probability 1 ǫ. The first∈ inequality now in the bound− follows from the definition of s(t ): sup δ′Gδ 0 δ∈S(t0,ν) ≤ s(t )ν2. The second bound follows by δ 2B, δ′Gδ Gδ δ ν2B. ✷ 0 k k1 ≤ ≤k k∞k k1 ≤ F.5. Proof of Theorem 4.3 and Corollary 4.4

Application of Lemma 4.3 implies that with probability at least 1 4ǫ, estimation errors u˜ = D−1(βˆ β ) andv ˜ = D−1(ˆρ ρ ) obey − β A − 0 ρ A − 0 u˜′Gu˜ C[(B2ℓ˜2s(D−1β ; ν)/n) (B2ℓ/˜ √n)], ≤ β 0 ∧ DML with Riesz Representers 43 v˜′Gv˜ C[(B2ℓ˜2s(D−1ρ ; ν)/n) (B2ℓ/˜ n)], ≤ ρ 0 ∧ where C is an absolute constant. Then p

u′Gu µ2 u˜′Gu,˜ v′Gv µ2 σ2v˜′Gv.˜ | |≤ D | |≤ D The stated bounds then follow. Hence the guarantee R(δ) holds for ε =1 K4ǫ provided that for some large enough absolute C: −

Cσ−1(√mσr + µr (1 + σ)+ µσr ) δ, 3 1 2 ≤ for r1, r2, and r3 given in the corollary. ✷

G. PROOFS FOR SECTION 5 G.1. Proof of Theorem 5.1

Let φ(w,γ,α) = α(x)[y γ(x)], ψ(w,γ,α,θ) = θ m(w,γ) φ(w,γ,α), φ¯(γ, α) = − − − φ(w,γ,α)F0(dw), andm ¯ (γ)= m(w,γ)F0(dw). Note that

R φ¯(γ⋆, α⋆) = 0, φ¯(Rγ⋆, αˆ )=0, m¯ (ˆγ γ⋆)= φ¯(ˆγ , α⋆). (7.29) 0 0 0 k k − 0 − k 0 Then we have

ˆ 1 ⋆ 1 ⋆ ⋆ θk θ0 + ψ0 (Wi)= ψ(Wi,γ0 , α0,θ0) ψ(Wi, γˆk, αˆk,θ0) − nk nk { − } iX∈Ik iX∈Ik 1 ⋆ ⋆ ⋆ ˆ ˆ = m(Wi, γˆk)+ φ(Wi, γˆk, αˆk) m(Wi,γ0 ) φ(Wi,γ0 , α0) = R1 + R2, nk { − − } iX∈Ik where

ˆ 1 ⋆ ⋆ R1 = [m(Wi, γˆk γ0 ) m¯ (ˆγk γ0 )] (7.30) nk − − − iX∈Ik 1 ⋆ ⋆ ⋆ ¯ ⋆ + [φ(Wi, γˆk, α0) φ(Wi,γ0 , α0) φ(ˆγk, α0)] nk − − iX∈Ik 1 ⋆ ⋆ ⋆ ¯ ⋆ + [φ(Wi,γ0 , αˆk) φ(Wi,γ0 , α0) φ(γ0 , αˆk)], nk − − iX∈Ik ˆ 1 ⋆ ⋆ ⋆ ⋆ R2 = [φ(Wi, γˆk, αˆk) φ(Wi, γˆk, α0) φ(Wi,γ0 , αˆk)+ φ(Wi,γ0 , α0)] nk − − iX∈Ik 1 ⋆ ⋆ = [ˆαk(Xi) α0(Xi)][ˆγk(Xi) γ0 (Xi)]. (7.31) −nk − − iX∈Ik ˆ ⋆ ⋆ c Define ∆ik = m(Wi, γˆk γ0 ) m¯ (ˆγk γ0 ) for i Ik and let k denote the observations W for i / I . Note that−γ ˆ depends− only− on c by∈ construction.W Then by independence of i ∈ k k Wk c and W ,i I we have E[∆ˆ c]=0. Also by independence of the observations, Wk { i ∈ k} ik|Wk E[∆ˆ ∆ˆ c] = 0 for i, j I . Furthermore, for i I E[∆ˆ 2 c] [m(w, γˆ ik jk|Wk ∈ k ∈ k ik|Wk ≤ k − R 44 Chernozhukov, Newey, and Singh ⋆ 2 γ0 )] F0(dw). Then by equation (5.20) we have 2 1 ˆ c 1 ˆ c 1 ˆ 2 c E ∆ik k = 2 E ∆ik k = 2 E[∆ik k]  nk ! |W  nk " ! |W # nk |W iX∈Ik iX∈Ik iX∈Ik   1 [m(w, γˆ γ⋆)]2F (dw)= o (σ2/n )= o (σ2/n). ≤ n k − 0 0 p k p k Z The conditional Markov inequality then implies that ∆ˆ /n = o (σ/√n). The i∈Ik ik p analogous results also hold for ∆ˆ = φ(W, γˆ , α⋆) φ(W, γ⋆, α⋆) φ¯(ˆγ , α⋆) and ∆ˆ = ik k 0 P 0 0 k 0 ik φ(W, γ⋆, αˆ ) φ(W, γ⋆, α⋆) φ¯(γ⋆, αˆ ) by φ¯(γ⋆, α⋆)=0− . Summing− across the three terms 0 k − 0 0 − 0 k 0 0 in Rˆ1 gives Rˆ1 = op(σ/√n). ˆ ⋆ ⋆ Next let ∆k(x)= [ˆαk(x) α0(x)][ˆγk(x) γ0 (x)]. Then by the triangle and Cauchy- Schwartz inequalities,− − −

E[ R c] ∆ˆ (x) F (dx) αˆ α⋆ γˆ γ⋆ = σσ−1 αˆ α⋆ γˆ γ⋆ | 2| |Wk ≤ k ≤k k − 0kP,2 k k − 0 kP,2 k k − 0kP,2 k k − 0 kP,2 Z σσ −1( αˆ α + α α⋆ ) γˆ γ⋆ . ≤ k k − 0kP,2 k 0 − 0kP,2 k k − 0 kP,2 ⋆ ⋆ By hypothesis r2r1 = o (1/√n) , so that by the conditional Markov inequality and the ⋆ definition of r2 , ˆ ⋆ ⋆ R2 = Op(σr2 r1 )= op(σ/√n). The conclusion then follows by the triangle inequality. ✷

H. PROOFS FOR SECTION S3 H.1. Proof of Lemma 3.1

2 Use the same notation as in the proof of the previous lemma. In all examples, α0 L (F ) 2 ∈ and γ L (F ) imply that α0,γ < α0 P,2 γ P,2 < . ∈ |h i| k k k k 1 ∞ Proof of claim (i). In Example 1, since dF (x)= k=0 P [D = k Z = z]1(k = d)dF (z) by the Bayes rule, we have | P 1(d = 1) 1(d = 0) α ,γ = γ(d, z)ℓ(x) − dF (x)= θ(γ). h 0 i P [D = d Z = z] Z | 2 dF1 dF0 In Example 2, ℓα0 L (F ) means that the Radon–Nykodym derivatives dF and dF exist on the support∈ of ℓ, so that dF dF α ,γ = γℓ 1 0 dF = γℓ(dF dF )= θ(γ). h 0 i dF − dF 1 − 0 Z   Z We can demonstrate the claim for Example 3 similarly to Example 2. In Example 4, we can write div (ℓ(x)t(x)f(d z)) α ,γ = γ(x) d | f(d z)dddF (z) h 0 i − f(d z) | Z Z | = ∂ γ(x)′t(x)ℓ(x)f(d z)dddF (z)= θ(γ), d | Z Z where we used the integration by parts and that γ(x)ℓ(x)t(x)f(d z) vanishes on the boundary of . The rest of the claim is immediate from Lemma 1.| Dz DML with Riesz Representers 45 Proof of claim (ii). We can refer to the case of linear regression discussed in Section 2. In what follows consider the case of G> 0 and ℓ = 1. In Example 1, M = E(b(1,Z) b(0,Z)). Suppose P [D =0 Z] 0, 1 with probability in [π, 1 π] for π > 0, but such− that G> 0 (this puts restrictions| ∈{ on }b). This is known − as the case of failing overlap assumption in causal inference. Then α0(X) is na with probability π. In Example 2 and 3, M = b(dF dF ) is well defined, but α (X) = na whenever 1 − 0 0 dF1/dF and dF1/dF do not exist. For instance, F1 and F0 can have point masses, where F does not, while retaining theR same support as F . In Example 4, take basis functions b and a constant direction t(X) = 1, such that M = E∂ b(D,Z) is well defined. Consider the case where f(d Z) = 0 with positive d | probability so that α0(X)= na with this probability. ✷

H.2. Proof of Lemma 3.2

2 The projection operator onto Γ¯1 = L (F1) is the conditional expectation with condition- ing on X1. The contractive property follows from Jensen’s inequality. ✷

H.3. Proof of Lemma 3.3

The proof uses the fact that m(W, γ)= m(X,γ), and that ψ⋆(X) (W )= U α⋆(X)U . 0 − 1 − 0 2 ⋆ Since EU1U2α0(X) = 0 by the LIE, using the bounded moments assumption we have: σ2 =EU 2 +EU 2α⋆2 E[E(U 2 X)α⋆2(X)] c2L2. 1 2 0 ≥ 2 | 0 ≥ The bound from above follows similarly: σ2 =EU 2 +EU 2α⋆2 c¯2 + E[E(U 2 X)α⋆2(X)] c¯2 +¯c2L2. 1 2 0 ≤ 2 | 0 ≤ Using the triangle inequality and bounded moments assumptions, we have: κ U + U α⋆ c¯ + (E(E[ U 3 X] α⋆(X) 3))1/3, ≤k 1kP,3 k 2 0kP,3 ≤ | 2| | | 0 | c¯ +¯c α⋆ c¯(1 + c(L2 1)), ≤ k 0kP,3 ≤ ∨ where the last line follows by assumption. ✷

H.4. Proof of Lemma 3.4

We shall use that m(W, γ)= m(X,γ), and ψ⋆(W )= U α⋆(X)U . 0 − 1 − 0 2 ⋆ Then by EU1U2α0(X) = 0, holding by the LIE, we have σ2 =EU 2 +EU 2α⋆2 =EU 2 + E(E[U 2 X]α⋆2(X)). 1 2 0 1 2 | 0 Then using the moment assumptions, we have c2 α⋆ 2 σ2 c¯2( ℓ 2 + α⋆ 2 ). k 0kP,2 ≤ ≤ k kP,2 k 0kP,2 Using the triangle inequality, the LIE, and the bounded heteroscedasticity assumption, 46 Chernozhukov, Newey, and Singh conclude κ U + U α⋆ c¯( ℓ + α⋆ ). ≤k 1kP,3 k 2 0kP,3 ≤ k kP,3 k 0kP,3 ⋆ For the case (a), α0(X)= α0(X; 1)ℓ(X), using the assumed bound α α0(X; 1) α¯ conclude that ≤ ≤ α ℓ L = α⋆ α¯ ℓ , α⋆ α¯ ℓ . k kP,2 ≤ k 0kP,2 ≤ k kP,2 k 0kP,3 ≤ k kP,3 For the case (b), α⋆(X ) = E[α (X; 1) X ]ℓ(X ), so that by Jensen’s inequality 0 1 0 | 1 1 α⋆ α (X; 1)ℓ(X ) α¯ ℓ k 0kP,q ≤k 0 1 kP,q ≤ k kP,q and using α E[α (X; 1) X ], ≤ 0 | 1 holding because conditional expectation preserves order, conclude that α⋆ 2 = E(E[α (X; 1) X ]2ℓ(X )2) α2 ℓ 2 . k 0kP,2 0 | 1 1 ≥ k kP,2 p1 −p1 Further, by change of variables in R : u = (d0 d)/h, so that du = h dd, we have that −

q q −p1q q d −p1(q−1) q d ℓ P,qω = h K ((d0 d)/h) fD(d) d = h K (u) fD(d0 uh) u k k Rp1 | − | Rp1 | | − Z Z so that 1/q 1/q h−p1(q−1)/qf 1/q K q ℓ ω h−p1(q−1)/qf¯1/q K q . | | ≤k kP,q ≤ | | Z  Z  Further, we have that

ω = h−p1 K((d d)/h)f (d)dd = K(u)f (d uh)du. 0 − D D 0 − Z Z Using the Taylor expansion in h around h = 0 and the Holder inequality:

ω f (d ) = K(u)h∂f (d uh˜)′udu hf¯′ u K(u) du, | − D 0 | D 0 − ≤ k k∞| | Z Z for some 0 h˜ h. Hence for all h

Similarly to the proof of Lemma 3.4, using the LIE and bounded heteroscedasticity, we obtain α⋆ 2 c2 σ2 ℓ 2 c¯2 + α⋆ 2 c¯2, k 0kP,2 ≤ ≤k kP,2 k 0kP,2 and by the triangle inequality κ ℓ c¯+ α⋆ c.¯ ≤k kP,3 k 0kP,3 It remains to bound α⋆ . To help this, introduce notation k 0kP,q v(X) := f(D Z). | Case (a). We have that ⋆ α0 = α0 = divd(ℓ)t + divd(t)ℓ + divd(v)ℓt/v. By the triangle inequality, α⋆ div (ℓ)t + div (t)ℓ + div (v)ℓt/v , k 0kP,q ≤k d kP,q k d kP,q k d kP,q α⋆ div (ℓ)t div (t)ℓ div (v)ℓt/v . k 0kP,2 ≥k d kP,2 −k d kP,2 −k d kP,2 Using the bounds assumed in the Lemma, we have div (ℓ)t div (ℓ) t¯; div (t)ℓ t¯′ ℓ ; div (v)ℓt/v ℓ (f¯′t/f¯ ). k d kP,q ≤k d kP,q k d kP,q ≤ k kP,q k d kP,q ≤k kP,q ′ By the proof of Lemma 3.4, for all h

div (ℓ) q ω−qh−qh−p1(q−1)f¯ divK q (f/2)−qh−qh−p1(q−1)f¯ divK q k d kP,q ≤ | | ≤ | | Z Z Case (b). Here we have, using the notation as above α⋆(X ) = E[α X ] = div (ℓ(X ))E[t(X ) X ] 0 1 0 | 1 d 1 1 | 1 + E[div (t(X) X ]ℓ(X ) + E[div (v(X))t(X)/v(X) X ]ℓ(X ). d | 1 1 d | 1 1 ⋆ Then by contractive property of the conditional expectation α0 P,q α0 P,q, so the upper bounds apply from case (a). k k ≤ k k 48 Chernozhukov, Newey, and Singh We only need to establish lower bound on α⋆ . By the triangle inequality, k 0kP,2 α⋆ div (ℓ)E[t X ] E[div (t) X ]ℓ E[div (t) X ]ℓ . k kP,2 ≥k d | 1 kP,2 −k d | 1 kP,2 −k d | 1 kP,2 By Jensen’s inequality, and using the same calculations as in case (a): div (ℓ(X ))E[t(X ) X ] div (ℓ(X ))t(X ) t¯ div (ℓ) ; k d 1 1 | 1 kP,2 ≤k d 1 1 kP,2 ≤ k d kP,q E[div (t) X ]ℓ div (t)ℓ t¯′ ℓ ; k d | 1 kP,2 ≤k d kP,q ≤ k kP,q E[div (v)t/v X ]ℓ div (v)ℓt/v ℓ (f¯′t/f¯ ). k d | 1 kP,2 ≤k d kP,q ≤k kP,q And, similarly to the calculation above div (ℓ)E[t X ] 2 = E[div (ℓ)2E((E[t X ])2 D)] k d | 1 kP,2 d | 1 | = ω−2h−2h−p12 (divK((d d)/h)2E((E[t X ])2 D = d)f(d)dd 0 − | 1 | Z = ω−2h−2h−p1 (divK(u)2E((E[t X ])2 D = d hu)f(d hu)du | 1 | 0 − 0 − Z ω−2h−2h−p1 t2f (divK)2 ≥ Z (2f¯)−2h−2h−p1 t2f (divK)2, ≥ Z 2 2 using the assumed bound E((E[t X1]) D = d) t for d Nh(d0). In either case (a) or (b), we now| summarize| ≥ the bounds∈ asymptotically by letting h 0: ց L . σ . h−p1/2(1 + h−1), h−p1/2(h−1 1) . L . h−p1/2(h−1 + 1), − κ . h−2p1/3(h−1 + 1), κ/σ . h−p1/6. ✷

H.6. Proof of Lemma 3.6

Introduce m(d) := E[m(W, γ⋆) D = d] and note 0 | ϑ (h)= m(d)h−p1 K((d d)/h)f (d)dd = m(d hu)K(u)f (d hu)du, 1 0 − D 0 − D 0 − Z Z ϑ (h)= h−p1 K((d d)/h)f (d)dd = K(u)f (d uh)du. 2 0 − D D 0 − Z Z Note that by K = 1,

R ϑ1(0) = m(d0)fD(d0), ϑ2(0) = fD(d0). Hence ⋆ ϑ1(h) ⋆ ϑ1(0) θ(γ0 ; ℓh)= , θ(γ0 ; ℓ0) := = m(d0). ϑ2(h) ϑ2(0) By the standard argument to control the bias of the higher-order kernel smoothers, e.g. by Lemma B2 in Newey (1994b), which employs the Taylor expansion of order v in h around h = 0, for some constants Av that depend only on v:

v v ϑ (h) ϑ (0) Avh g¯v u K(u) du, | 1 − 1 |≤ k k | | Z DML with Riesz Representers 49 v v ϑ (h) ϑ (0) Avh f¯v u K(u) du, | 2 − 2 |≤ k k | | Z where v = o sm. Then using the relation ∧ −1 −1 −1 ϑ1(h) ϑ1(0) ϑ2 (0)(ϑ1(h) ϑ1(0)) + ϑ1(0)(ϑ2 (h) ϑ2 (0)) = − −1 −1 − , ϑ (h) − ϑ (0) +(ϑ1(h) ϑ1(0))(ϑ (h) ϑ (0)) 2 2  − 2 − 2  we deduce the following bound that applies for all h