Orthogonal Statistical Learning

Dylan J. Foster Vasilis Syrgkanis MIT MSR New England [email protected] [email protected]

Abstract We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target parameter depends on an unknown nuisance parameter that must be estimated from data. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target parameter and one for the nuisance parameter. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate settings in which the target parameter belongs to a complex nonparametric class. We provide conditions on the metric entropy of the nuisance and target classes such that oracle rates—rates of the same order as if we knew the nuisance parameter—are achieved. We also derive new rates for specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.

Contents

1 Introduction 3 1.1 Related work ...... 7 1.2 Organization ...... 8 arXiv:1901.09036v3 [math.ST] 24 Sep 2020 2 Framework: Statistical Learning with a Nuisance Component9

3 Orthogonal Statistical Learning 10 3.1 Fast Rates Under Strong Convexity ...... 11 3.2 Beyond Strong Convexity: Slow Rates ...... 13 3.3 Example: Treatment Effect Estimation ...... 14 3.4 Example: Policy Learning ...... 16 3.5 Construction of Orthogonal Losses ...... 17

4 Empirical Risk Minimization with a Nuisance Component 18 4.1 Fast Rates via Local Rademacher Complexities ...... 20

1 4.2 Slow Rates and Variance Penalization ...... 21

5 Minimax Oracle Rates for Square Losses 23 5.1 Minimax Oracle Rates ...... 25

6 Minimax Oracle Rates for Generic Lipschitz Losses 27

7 Discussion 27

I Additional Results 36

A Sufficient Conditions for Single Index Losses 36 A.1 Fast Rates ...... 36 A.2 Slow Rates ...... 38 A.3 Proofs ...... 39

B Additional Applications 42 B.1 Policy Learning ...... 42 B.2 Domain Adaptation and Sample Bias Correction ...... 44 B.3 Missing Data ...... 45

C Orthogonal Loss Construction: Examples 47

D Plug-in Empirical Risk Minimization: Examples 48 D.1 Proofs ...... 51

II Proofs for Main Results 53

E Preliminaries 54

F Proofs from Section 3 54

G Technical Lemmas for Constrained M-Estimators 57 G.1 Proofs of Lemmas for Constrained M-Estimators ...... 59

H Proofs from Section 4 62 H.1 Proof of Theorem 3 ...... 62 H.2 Slow Rate for Plug-In ERM ...... 68 H.3 Proof of Theorem 4 ...... 69 H.4 Proof of Theorem 5 ...... 71

I Proofs from Section 5 and 6 76 I.1 Notation ...... 76 I.2 Preliminaries ...... 77 I.3 Overview of Proofs ...... 78 I.4 Skeleton Aggregation ...... 80 I.5 Rates for Specific Algorithms ...... 81 I.6 Proofs for Oracle Rates ...... 85

2 1 Introduction

Predictive models based on modern machine learning methods are becoming increasingly widespread in policy making, with applications in healthcare, education, law enforcement, and business decision making. Most problems that arise in policy making, such as attempting to predict counterfactual outcomes for different interventions or optimizing policies over such interventions, are not pure prediction problems, but rather are causal in nature. It is important to address the causal aspect of these problems and build models that have a causal interpretation. A common paradigm in the search of causality is that to estimate a model with a causal interpretation from observational data—that is, data not collected via randomized trial or via a known treatment policy—one typically needs to estimate many other quantities that are not of primary interest, but that can be used to de-bias a purely predictive machine learning model by formulating an appropriate loss. One example of such a nuisance parameter is the propensity for taking an action under the current policy, which can be used to form unbiased estimates for the reward for new policies, but is typically unknown in datasets that do not come from controlled experiments. To make matters more concrete, let us walk through an example for which certain variants have been well-studied in machine learning (Dud´ıket al., 2011; Swaminathan and Joachims, 2015a; Nie and Wager, 2017; Kallus and Zhou, 2018). Suppose a decision maker wants to estimate the causal effect of some treatment T ∈ {0, 1} on an outcome Y as a function of a set of observable features X; the causal effect will be denoted as θ(X). Typically, the decision maker has access to data consisting of tuples (Xi,Ti,Yi), where Xi is the observed feature for sample i, Ti is the treatment taken, and Yi is the observed outcome. Due to the partially observed nature of the problem, one needs to create unbiased estimates of the unobserved outcome. A standard approach is to make an unconfoundedness assumption (Rosenbaum and Rubin, 1983) and use the so-called doubly-robust formula, which is a combination of direct regression and inverse propensity scoring. Let Yi(t)   denote the potential outcome for treatment t in sample i, and let m0(xi, t) := E Yi(t) | xi and p0(xi, t) := E[1{T = t} | xi]. If (Yi(0),Yi(1)) ⊥ Ti | Xi, then the following is an unbiased estimator for each potential outcome:

(Yi − m0(xi, t)) 1{Ti = t} Ybi(t) = m0(xi, t) + . (1) p0(xi, t) Given such an estimator, we can estimate the treatment effect by running a regression between P 2 the unbiased estimates and the features, i.e. solve minθ∈Θ i(Yb(1) − Yb(0) − θ(Xi)) over a target parameter class Θ. In the population limit, with infinite samples, this corresponds to finding a  2 parameter θ(x) that minimizes the population risk E (Ybi(1) − Ybi(0) − θ(X)) . Similarly, if the decision maker is interested in policy optimization rather than estimating treatment effects, they P can use these unbiased estimates to solve minθ∈Θ i(Ybi(0) − Ybi(1)) · θ(Xi) over a policy space Θ of functions mapping features to {0, 1}. However, when dealing with observational data, the functions m0 and p0 are not known, and must be estimated if we wish to evaluate the proxy labels Yb(t). Since these functions are only used as a means to learn the target parameter θ, we may regard them as nuisance parameters. The goal of the learner is to estimate a target parameter that achieves low population risk when evaluated at the true nuisance parameters as opposed to the estimated nuisance parameters, since only then does the model have a causal interpretation. This phenomenon is ubiquitous in causal inference and motivates us to formulate the abstract problem of statistical learning with a nuisance component: Given n i.i.d. examples from a distribution D, a learner is interested in finding a target parameter θb ∈ Θ so as to minimize a population risk

3 function LD :Θ × G → R. The population risk depends not just on the target parameter, but also on a nuisance parameter whose true value g0 ∈ G is unknown to the learner. The goal of the learner is to produce an estimate that has small excess risk evaluated at the unknown true nuisance parameter: LD(θ,b g0) − inf LD(θ, g0) →n 0. (2) θ∈Θ Depending on the application, such an excess risk bound can take different interpretations. For many settings, such as treatment effect estimation, it is closely related to mean squared error, while in policy optimization it typically corresponds to regret. Following the tradition of statistical learning theory (Vapnik, 1995; Bousquet et al., 2004), we make excess risk the primary focus of our work, independent of the interpretation. We develop algorithms and analysis tools that generically address (2), then apply these tools to a number of applications of interest. The problem of statistical learning with a nuisance component is strongly connected to the well- studied semiparametric inference problem (Levit, 1976; Ibragimov and Has’Minskii, 1981; Pfanzagl, 1982; Bickel, 1982; Klaassen, 1987; Robinson, 1988; Bickel et al., 1993; Newey, 1994; Robins and Rotnitzky, 1995; Ai and Chen, 2003; van der Laan and Dudoit, 2003; van der Laan and Robins, 2003; Ai and Chen, 2007; Tsiatis, 2007; Kosorok, 2008; van der Laan and Rose, 2011; Ai and Chen, 2012; Chernozhukov et al., 2016; Belloni et al., 2017; Chernozhukov et al., 2018a), which focuses √ on providing so-called “ n-consistent and asymptotically normal” estimates for a low-dimensional target parameter θ0 (which may be expressed as a population risk minimizer or a solution to estimating equations) in the presence of a typically nonparametric nuisance parameter. Unlike the semiparametric inference problem, statistical learning with a nuisance component does not require a well-specified model, nor a unique minimizer of the population risk. Moreover, we do not ask for parameter recovery or asymptotic inference (e.g., asymptotically valid confidence intervals). Rather, we are content with an excess risk bound, regardless of whether there is an underlying true parameter to be identified. As a consequence, we provide guarantees even in the presence of misspecification, and when the target parameter belongs to a large, potentially nonparametric class. For example, one line of previous work gives semiparametric inference guarantees when the nuisance parameter is a neural network (Chen and White, 1999; Farrell et al., 2018); by focusing on excess risk we can give guarantees for the case where the target parameter is a neural network. The case where the target parameter belongs to an arbitrary class has not been addressed at the level of generality we consider in the present work, but we mention some prior work that goes beyond the low-dimensional/parametric setup for special cases. Athey and Wager(2017) and Zhou et al.(2018) give guarantees based on metric entropy of the target class for the specific problem of treatment policy learning. For estimation of treatment effects, various nonparametric classes have been used for the target class on a rather case by case basis, including kernels (Nie and Wager, 2017), random forests (Athey et al., 2019; Oprescu et al., 2019; Friedberg et al., 2018), and high-dimensional linear models (Chernozhukov et al., 2017, 2018b). Other results allow for fairly general choices for the target parameter class in specific statistical models (Rubin and van der Laan, 2005, 2007; D´ıazand van der Laan, 2013; van der Laan and Luedtke, 2014; Kennedy et al., 2017, 2019; K¨unzelet al., 2019). Our work unifies these directions into a single framework, and our general tools lead to improved or refined results when specialized to many of these individual settings. Our approach is to reduce the problem of statistical learning with a nuisance component to the standard formulation of statistical learning. We build on a recent thread of research on semiparametric inference known as “double” or “debiased” machine learning (Chernozhukov et al., 2016, 2017, 2018a,c,b), which leverages sample splitting to provide inference guarantees under weak

4 Meta-Algorithm 1 (Two-Stage Estimation with Sample Splitting). Input: Sample set S = z1, . . . , zn. • Split S into subsets S1 = z1, . . . , zbn/2c and S2 = S \ S1. • Let gb be the output of Alg(G,S1). • Return θb, the output of Alg(Θ,S2; gb). assumptions on the estimator for the nuisance parameter. Rather than directly analyzing particular algorithms and models for the target parameter (e.g., regularized regression, gradient boosting, or neural network estimation), we assume a black-box guarantee for the excess risk in the case where a nuisance value g ∈ G is fixed. Our main theorem asks only for the existence of an algorithm Alg(Θ,S; g) that, for any given nuisance parameter g and data set S, achieves low excess risk with respect to the population risk LD(θ, g), i.e. with probability at least 1 − δ,

LD(θ,b g) − inf LD(θ, g) ≤ RateD(Θ, S, δ; g). (3) θ∈Θ Likewise, we assume the existence of a black-box algorithm Alg(G,S) to estimate the nuisance component g0 from the data, with the required estimation guarantee varying from problem to problem. Given access to the two black-box algorithms, we analyze a simple sample splitting meta-algorithm for statistical learning with a nuisance component, presented as Meta-Algorithm1. We can now state the main question addressed in this paper: When is the excess risk achieved by sample splitting robust to nuisance component estimation error? In more technical terms, we seek to understand when the two-stage sample splitting meta-algorithm achieves an excess risk bound with respect to g0, in spite of error in the estimator gb output by the first-stage algorithm. Robustness to nuisance estimation error allows the learner to use more complex models for nuisance estimation and—under certain conditions on the complexity of the target and nuisance parameter classes—to learn target parameters whose error is, up to lower order terms, as good as if the learner had known the true nuisance parameter in advance. Such a guarantee is referred to as achieving an oracle rate in semiparametric inference.

Overview of results. We use Neyman orthogonality (Neyman, 1959, 1979), a key tool in inference in semiparametric models (Newey, 1994; van der Vaart, 2000; Robins et al., 2008; Zheng and van der Laan, 2010; Belloni et al., 2017; Chernozhukov et al., 2018a), to provide oracle rates for statistical learning with a nuisance component. We show that if the population risk satisfies a functional analogue of Neyman orthogonality, the estimation error of gb has a second order impact on the overall excess risk (relative to g0) achieved by θb. To gain some intuition, Neyman orthogonality is weaker condition than double-robustness, albeit similar in flavor, (see, e.g., Chernozhukov et al.(2016)) and is satisfied by both the treatment effect loss and the policy learning loss described in the introduction. In more detail, our variant of the Neyman orthogonality condition asserts that a functional cross- derivative of the loss vanishes when evaluated at the optimal target and nuisance parameters. Prior work provides a number of means through which to construct Neyman orthogonal losses whenever certain moment conditions are satisfied by the data generating process (Chernozhukov et al., 2018a, 2016, 2018b). Indeed, orthogonal losses can be constructed in settings including treatment effect estimation, policy learning, missing and censored data problems, estimation of structural econometric models, and game-theoretic models.

5 We identify two regimes of excess risk behavior: 1. Fast rates. When the population risk is strongly convex with respect to the prediction of the target parameter (e.g., the treatment effect estimation loss), then typically so-called fast rates (e.g., rates of order of O(1/n) for parametric classes) are optimal if the true nuisance parameter is known. Letting RG denote the estimation error of the nuisance component, in this setting we show that orthogonality implies that the first stage error has an impact on the 4 −1/4 excess risk of the order of RG (in particular, n -RMSE rates for the nuisance suffice when the target is parametric). 2. Slow rates. Absent any strong convexity of the population risk (e.g., for the treatment policy √ optimization loss), typically slow rates (e.g. rates of order O(1/ n) for parametric classes) are optimal if the true nuisance parameter is known. For this setting, we show that the impact of 2 −1/4 nuisance estimation error is of the order RG so, once again, n RMSE rates for the nuisance suffice when the target is parametric. To make the conditions above concrete for arbitrary classes, we give conditions on the relative complexity of the target and nuisance classes—quantified via metric entropy—under which the sample splitting meta-algorithm achieves oracle rates, assuming the two black-box estimation algorithms are instantiated appropriately. This allows us to extend several prior works beyond the parametric regime to complex nonparametric target classes. Our technical results extends the works of Yang and Barron(1999); Rakhlin et al.(2017), which provide minimax optimal rates without nuisance components and utilize the technique of aggregation in designing optimal algorithms. The flexibility of our approach allows us to instantiate our framework with any machine learning model and algorithm of interest for both nuisance and target parameter estimation, and to utilize the vast literature on generalization bounds in machine learning to establish refined (e.g., data-dependent or dimension-independent) rates for several classes of interests. For instance, our approach allows us to leverage recent work on size-independent generalization error of neural networks. Moving beyond black-box results, we use our main theorems as a starting point to provide sharp analyses for certain general-purpose statistical learning algorithms for target estimation in the presence of nuisance parameters. First, we provide a new analysis for empirical risk minimization with plug-in estimation of nuisance parameters, wherein we extend the classical local Rademacher complexity analysis of empirical risk minimization (Koltchinskii and Panchenko, 2000; Bartlett et al., 2005) to account for the impact of the nuisance error (leveraging orthogonality). Second, in the slow rate regime we give a new -penalized empirical risk minimization with plug-in nuisance estimation, which allows us to recover and extend several prior results in the literature on policy learning. Our result improves upon the variance-penalized risk minimization approach of Maurer and Pontil(2009) by replacing the dependence on the metric entropy at a fixed approximation level with the critical radius, which is related to the entropy integral. As a consequence of focusing on excess risk, we obtain oracle rates under weaker assumptions on the data generating process than in previous works. Notably, we obtain guarantees even when the target parameter is misspecified and the target parameters are not identifiable. For instance, for sparse high-dimensional linear classes, we obtain optimal prediction rates with no restricted eigenvalue assumptions. We highlight the applicability of our results to four settings of primary importance in the literature: 1) estimation of heterogeneous treatment effects from observational data, 2) offline policy optimization, 3) domain adaptation, 4) learning with missing data. For each of these applications, our general theorems allow for the use of arbitrary estimators for the nuisance and target parameter classes and provide robustness to the nuisance estimation error.

6 1.1 Related work

General frameworks for learning/inference with nuisance parameters. The work of van der Laan and Dudoit(2003) and subsequent refinements and extensions (van der Laan et al., 2006, 2007) develops cross-validation methodology for a similar risk minimization setting in which the target risk parameter depends on an unknown nuisance parameter. van der Laan and Dudoit(2003) analyze a cross-validation meta-algorithm in which the learner simultaneously forms a nuisance parameter estimator and a set of candidate target parameter estimators using a set of training samples, then selects a final estimate for the target parameter by minimizing an empirical loss over a validation set. The train and validation splits may be chosen in a general fashion that encompasses K-fold and Monte Carlo validation. They provide finite-sample oracle rates for the excess risk in the case where the target parameter belongs to a finite class (in particular, rates of the type log|Θ|/n for a class of square losses and plog|Θ|/n for general losses), and also extend these guarantees to linear combinations of basis functions via pointwise ε-nets (in our language, such classes are parametric). Overall, our approach offers several new benefits: • By completely splitting nuisance estimation and target estimation into separate stages and taking advantage of orthogonality, we can provide meta-theorems on robustness that are invariant to the choice of learning algorithm both for the first and second stage, which obviates the need to assume the target class is finite or admits a linear representation (Section 3). • When we do specialize to algorithms such as ERM and variants, we can provide finite-sample guarantees for rich classes of target parameters in terms of sharp learning-theoretic complexity measures such as local Rademacher complexity and empirical metric entropy (Section 4). In particular, we can provide conditions under which oracle rates are attained under very general complexity assumptions on the target and nuisance parameters (Section 5, Section 6). The methodology of van der Laan and Dudoit(2003) can be used to directly estimate a target parameter or to select the best of many candidate nuisance estimators in a data-driven fashion. van der Laan et al.(2007) refers to the use of this cross-validation methodology to perform data- adaptive estimation of nuisance parameters as the “super learner”, and subsequent work has advocated for its use for nuisance estimation within a framework for semiparametric inference known as targeted maximum likelihood estimation (TMLE). TMLE (Scharfstein et al., 1999; van der Laan and Rubin, 2006; Zheng and van der Laan, 2010; van der Laan and Rose, 2011) and its more general variant, targeted minimum loss-based estimation, are general frameworks for semiparametric inference which—like our framework—employ empirical risk minimization in the presence of nuisance parameters. TMLE estimates the target parameter by repeatedly minimizing an empirical risk (typically the negative log-likelihood) in order to refine an initial estimate. This approach easily incorporates constraints, and can be used in tandem with the super learning technique. The analysis leverages orthogonality, and is also agnostic to how the nuisance estimates are obtained. However, the main focus of this framework is on the classical semiparametric inference objective; minimizing a population risk is not the end goal as it is here.

Specific instances of risk minimization with nuisance parameters. A number of prior works employ empirical risk minimization with nuisance parameters for specific statistical models (Rubin and van der Laan, 2005, 2007; D´ıazand van der Laan, 2013; van der Laan and Luedtke, 2014; Kennedy et al., 2017, 2019; K¨unzelet al., 2019). These results allow for general choices for the target class and nuisance class (typically subject to Donsker conditions, or with guarantees in

7 the vein of van der Laan and Dudoit(2003)), and the main focus is semiparametric inference rather than excess risk guarantees.

Nonparametric target parameters. Outside of the risk minimization-based approaches above and the examples in the prequel (Athey et al., 2019; Nie and Wager, 2017; Athey and Wager, 2017; Zhou et al., 2018; Oprescu et al., 2019; Friedberg et al., 2018; Chernozhukov et al., 2017, 2018b), a number of other results also consider inference for nonparametric target parameters in the presence of nuisance parameters. In van der Vaart and van der Laan(2006), the target is a Lipschitz function over [0, ∞) (the marginal survival function) and an estimation rate of n−2/3 is given. Wang et al.(2010) consider estimation of smooth nonparametric target parameters in the presence of missing outcomes, and give algorithms based on kernel smoothing. Robins and Rotnitzky(2001); Robins et al.(2008) consider settings where the target parameter is scalar, but the optimal rate is nonparametric due to the presence of complex nuisance parameters.

Sample splitting. While our use of sample splitting is directly inspired by recent use of the technique in double/debiased machine learning (Chernozhukov et al., 2016, 2018a), the basic technique dates back to the early days of semiparametric inference and it has found use in many other works to remove Donsker conditions for estimation in the presence of nuisance parameters (Bickel, 1982; Klaassen, 1987; van der Vaart, 2000; Robins et al., 2008; Zheng and van der Laan, 2010).

Limitations. Our results are quite general, but there are some applications that go beyond the scope of our framework. For example, while we consider only plug-in estimation for the nuisance parameters, several works attain refined results by using specialized estimators van der Laan and Rubin(2006); Hirshberg and Wager(2017); Chernozhukov et al.(2018c); Ning et al.(2018). While our focus is on methods based on loss minimization, some problems such as nonparametric instrumental variables (Newey and Powell, 2003; Hall et al., 2005; Blundell et al., 2007; Chen and Pouzo, 2009, 2012, 2015; Chen and Christensen, 2018) are more naturally posed in terms of conditional moment restrictions.1

1.2 Organization

Section 2 contains technical preliminaries and definitions. Section 3 presents our main theorems concerning the excess risk of Meta-Algorithm 1, as well as case study in which we apply these theorems to treatment effect estimation and policy learning, and a generic construction of orthogonal losses. Section 4 analyzes the performance of plug-in empirical risk minimization as the second stage of the meta-algorithm. Section 5 and Section 6 give conditions on the relative complexity of the nuisance and target class under which the algorithms for stage one and stage two can be configured such that oracle excess risk bounds are achieved. We conclude with discussion in Section 7. The appendix is split into two parts. Part I contains additional results, including sufficient conditions for Neyman orthogonality and applications of our main results to specific settings. Part II contains proofs for our main results.

1In fact, nonparametric IV can be cast as a special case of the setup in (4), but we do not know of any estimators for this problem that satisfy the conditions required to apply our main theorems.

8 2 Framework: Statistical Learning with a Nuisance Component

We work in a learning setting in which observations belong to an abstract set Z. We receive a sample set S := z1, . . . , zn where each zt is drawn i.i.d. from an unknown distribution D over Z. Define variable subsets X ⊆ W ⊂ Z; the restriction X ⊆ W is not strictly necessary but simplifies notation. We focus on learning parameters that come from a target parameter class Θ: X → V2 and nuisance parameter class G : W → V1, where V1 and V2 are finite dimensional vector spaces of dimension K and K respectively, equipped with norms k·k and k·k . Note that since our 1 2 V1 V2 results are fully non-asymptotic, the classes Θ and G may be taken to grow with n.

Given an example zt ∈ Z, we write wt ∈ W and xt ∈ X to denote the subsets of zt that act as arguments to the nuisance and target parameters respectively. For example, we may write g(wt) for g ∈ G or θ(xt) for θ ∈ Θ. We assume that the function spaces Θ and G are equipped with norms k·kΘ and k·k respectively. In our applications, both norms take the form kfk = kf(z)kp 1/p G Lp(V,D) Ez∼D V for functions f : Z → V, where V ∈ {V1, V2}. We measure performance of the target predictor through the real-valued population loss functional LD(θ, g), which maps a target predictor θ and nuisance predictor g to a loss. The subscript D in LD denotes that the functional depends on the underlying distribution D. For all of our applications, LD has the following structure, in line the classical statistical learning setting: First define a pointwise loss function `(θ, g; z), then define LD(θ, g) := Ez∼D[`(θ, g; z)]. Our general framework does not explicitly assume this structure, however.

Let g0 ∈ G be the unknown true value for the nuisance parameter. Given the samples S, and without knowledge of g0, we aim to produce a target predictor θb that minimizes the excess risk evaluated at g0 LD(θ,b g0) − inf LD(θ, g0). (4) θ∈Θ As discussed in the introduction, we will always produce such a predictor via the sample splitting meta-algorithm (Meta-Algorithm 1), which makes uses of a nuisance predictor gb. When the infimum in the excess risk is obtained, we use θ? to denote the corresponding minimizer, in which case the excess risk can be written as

? LD(θ,b g0) − LD(θ , g0).

We occasionally use the notation θ0 to refer to a particular target parameter with respect to which the second stage satisfies a first-order condition, e.g. DθLD(θ0, g0)[θ − θ0] = 0 ∀θ ∈ Θ. If θ0 ∈ Θ ? and the population risk is convex, then we can take θ = θ0 without loss of generality, but we do not assume this, and in general we do not assume existence of a such a parameter θ0.

d Notation. We let h·, ·i denote the standard inner product. k·kp will denote the `p norm over R d1×d2 and k·kσ will denote the spectral norm over R . Unless otherwise stated, the expectation E[·], probability P(·), and variance Var(·) operators will be taken with respect to the underlying distribution D. We define empirical analogues En[·], Pn(·), and Varn(·) with respect to a sample set z1, . . . , zn, whose value will be clear from context. For a vector space V with norm k·k and function f : Z → V, we define kfk = kf(z)kp 1/p for V Lp(V,D) Ez∼D V p ∈ (0, ∞), with Lp(`q, D) referring to the special case where k·kV = k·kq. For a sample set S = z1:n,

9 we define the empirical variant kfk = 1 Pn kf(z )kp 1/p. When V = , we drop the first Lp(V,S) n i=1 i V R argument and write Lp(D) and Lp(S). We extend these definitions to p = ∞ in the natural way. For a subset X of a vector space, conv(X ) will denote the convex hull. For an element x ∈ X , we define the star hull via

star(X , x) = t · x + (1 − t) · x0 | x0 ∈ X , t ∈ [0, 1] , (5)

and adopt the shorthand star(X ) := star(X , 0). Given functions f, g : X → [0, ∞) where X is any set, we use non-asymptotic big-O notation, writing f = O(g) if there exists a numerical constant c < ∞ such that f(x) ≤ c · g(x) for all x ∈ X and f = Ω(g) if there is a numerical constant c > 0 such that f(x) ≥ c · g(x). We write f = Oe(g) as shorthand for f = O(g max{1, polylog(g)}).

3 Orthogonal Statistical Learning

In this section we present our main results on orthogonal statistical learning, which state that under certain conditions on the loss function, the error due to estimation of the nuisance component g0 has higher-order impact on the prediction error of the target component. The results in this section, which form the basis for all subsequent results, are algorithm-independent, and only involve assumptions on properties of the population risk LD. To emphasize the high level of generality, the results in this section invoke the learning algorithms in Meta-Algorithm 1 only through “rate” functions RateD(G,...) and RateD(Θ,...) which respectively bound the estimation error of the first stage and the excess risk of the second stage. Definition 1 (Algorithms and Rates). The first and second stage algorithms and corresponding rate functions are defined as follows: a) Nuisance algorithm and rate. The first stage learning algorithm Alg(G,S), when given a sample set S from distribution D, outputs a predictor gb for which

kgb − g0kG ≤ RateD(G, S, δ) with probability at least 1 − δ.

b) Target algorithm and rate. Let Θb be some set with θ? ∈ Θb. The second stage learning algorithm Alg(Θ,S; g), when given sample set S from distribution D and any g ∈ G outputs a predictor θb ∈ Θb for which ? LD(θ,b g) − LD(θ , g) ≤ RateD(Θ, S, δ; g) with probability at least 1 − δ.

We denote worst-case variants of the rates by RateD(G, n, δ) := supS:|S|=n RateD(G, S, δ) and RateD(Θ, n, δ; g) := supS:|S|=n RateD(Θ, S, δ; g).

Observe that if one naively applies the algorithm for the target class using the nuisance predictor gb as a plug-in estimate for g0, the rate stated in Definition 1 will only yield an excess risk bound of the form ? LD(θ,b gb) − LD(θ , gb) ≤ RateD(Θ, S, δ; gb). (6)

10 This clearly does not match the desired bound (4), which involves only the true nuisance value g0 and not the plug-in estimate gb. The bulk of our work is to show that orthogonality may be used to correct this mismatch.

Note that Definition 1 allows the target predictor θb belongs to a class Θb which in general has Θb 6= Θ. This extra level of generality serves two purposes. First, it allows for refined analysis in the case where Θb ⊂ Θ, which is encountered when using algorithms based on regularization that do not impose hard constraints on, e.g., the norm of the class of predictors. Second, it permits the use of improper prediction, i.e. Θb ⊃ Θ, which in some settings is required to obtain optimal rates for misspecified models (Audibert, 2008; Foster et al., 2018). 1 Pn Recall that for a sample set S = z1, . . . , zn, the empirical loss is defined via LS(θ, g) = n t=1 `(θ, g; zt). Many classical results from statistical learning can be applied to the double machine learning setting by minimizing the empirical loss with plug-in estimates for g0, and we can simply cite these results to provide examples of RateD for the target class Θ. Note however that this structure is not assumed by Definition 1, and we indeed consider algorithms that do not have this form.

Fast rates and slow rates. The rates presented in this section fall into two distinct categories, which we distinguish by referring to them as either fast rates or slow rates. The meaning of the word “fast” or “slow” here is two-fold: First, for fast rates, our assumptions on the loss imply that when the target class Θ is not too large (e.g. a parametric or VC-subgraph class) prediction error rates of order O(1/n) are possible in the absence of nuisance parameters. For our slow rate results, the best √ prediction error rate that can be achieved is O(1/ n), even for small classes. This distinction is consistent with the usage of the term fast rate in statistical learning (Bousquet et al., 2004; Bartlett et al., 2005; Srebro et al., 2010), and we will see concrete examples of such rates for specific classes in later sections (Section 4, Section 5). The second meaning “fast” versus “slow” refers to the first stage: When estimation error for the nuisance is of order ε, the impact on the second stage in our fast rate results is of order ε4, while for our slow rate results the impact is of order ε2. The fast rate regime—particularly, the ε4-type dependence on the nuisance error—will be the more familiar of the two for readers accustomed to semiparametric inference. While fast rates might at first seem to strictly improve over slow rates, these results require stronger assumptions on the loss. Our results in Section 5 and Section 6 show that which setting is more favorable will in general depend on the precise relationship between the complexity of the target parameter class and the nuisance parameter class.

3.1 Fast Rates Under Strong Convexity

We first present general conditions under which the sample splitting meta-algorithm obtains so-called ? fast rates for prediction. To present the conditions, we fix a representative θ ∈ arg minθ∈Θ LD(θ, g0). In general the minimizer may not be unique—indeed, by focusing on prediction we can provide guarantees even though parameter recovery is clearly impossible in this case. Thus, we assume that a single fixed representative θ? is used throughout all the assumptions stated in this section. Our assumptions are stated in terms of directional derivatives with respect to the target and nuisance parameters. Definition 2 (Directional Derivative). Let F be a vector space of functions. For a functional d F : F → R, we define the derivative operator Df F (f)[h] = F (f + th) for a pair of functions dt t=0

11 k ∂k f, h ∈ F. Likewise, we define D F (f)[h1, . . . , hk] = F (f + t1h1 + ... + tkhk) . f ∂t1...∂tk t1=···=tk=0 When considering a functional in two arguments, e.g. LD(θ, g), we write DθLD(θ, g) and DgLD(θ, g) to make the argument with respect to which the derivative is taken explicit. Our first assumption is the starting point for this work, and asserts that the population loss is orthogonal in the sense that the certain pathwise derivatives vanish. Assumption 1 (Orthogonal Loss). The population risk LD is Neyman orthogonal:

? ? DgDθLD(θ , g0)[θ − θ , g − g0] = 0 ∀θ ∈ Θb, ∀g ∈ G. (7)

In addition to orthogonality, our main theorem for fast rates requires three additional assumptions, all of which are ubiquitous in results on fast rates for prediction in statistical learning. We require a first-order optimality condition for the target class, and require that the population risk is both strongly convex with respect to the target class and smooth. Assumption 2 (First Order Optimality). The minimizer for the population risk satisfies the first-order optimality condition:

? ? ? DθLD(θ , g0)[θ − θ ] ≥ 0 ∀θ ∈ star(Θb, θ ). (8)

Remark 1. The first-order condition is typically satisfied for models that are well-specified, meaning that there is some variable in z that identifies the target parameter θ0. More generally, it suffices to “almost” satisfy the first-order condition, i.e. to replace (8) by the condition

? ? DθLD(θ , g0)[θ − θ ] ≥ − on(RateD(Θ, n, δ; gb)). (9) The first-order condition is also satisfied whenever Θb is star-shaped around θ?, i.e. star(Θb, θ?) ⊆ Θb. Assumption 3 (Strong Convexity in Prediction). The population risk LD is strongly convex with respect to the prediction: For all θ ∈ Θb and g ∈ G, 2 ¯ ? ? ? 2 4 ¯ ? Dθ LD(θ, g)[θ − θ , θ − θ ] ≥ λkθ − θ kΘ − κkg − g0kG ∀θ ∈ star(Θb, θ ).

Assumption 4 (Higher-Order Smoothness). There exist constants β1 and β2 such that the following derivative bounds hold:

a) Second-order smoothness with respect to target. For all θ ∈ Θb and all θ¯ ∈ star(Θb, θ?): 2 ¯ ? ? ? 2 Dθ LD(θ, g0)[θ − θ , θ − θ ] ≤ β1 · kθ − θ kΘ.

? b) Higher-order smoothness. For all θ ∈ star(Θb, θ ), g ∈ G, and g¯ ∈ star(G, g0):

2 ? ? ? 2 DgDθLD(θ , g¯)[θ − θ , g − g0, g − g0] ≤ β2 · kθ − θ kΘ · kg − g0kG.

All of the conditions of Assumption 3 and Assumption 4 are easily satisfied whenever the population loss is obtained by applying the square loss or any other strongly convex and smooth link to the prediction of the target class; concrete examples are given in Appendix A. We now state our main theorem for fast rates.

12 ? Theorem 1. Suppose that there is some θ ∈ arg minθ∈Θ LD(θ, g0) such that Assumptions 1 to4, are satisfied. Then the sample splitting meta-algorithm (Meta-Algorithm 1) produces a parameter θb such that with probability at least 1 − δ,

 2  ? 2 4 1 β2 4 θb− θ ≤ RateD(Θ,S2, δ/2; g) + + 2κ · (RateD(G,S1, δ/2)) , (10) Θ λ b λ λ and ? LD(θ,b g0) − LD(θ , g0) 2β β β2  (11) ≤ 1 Rate (Θ,S , δ/2; g) + 1 2 + 2κ · (Rate (G,S , δ/2))4. λ D 2 b 2λ λ D 1

Theorem 1 shows that for Meta-Algorithm 1, the impact of the unknown nuisance parameter on the 4 prediction has favorable fourth-order growth: (RateD(G,S1, δ/2)) . This means that if the desired −1 oracle rate without nuisance parameters is of order O(n ), it suffices to take RateD(G,S1, δ/2) = o(n−1/4).

There is one issue not addressed by Theorem 1: If the nuisance parameter g0 were known, the rate for the target parameters would be RateD(Θ,... ; g0), but the bound in (11) scales instead with RateD(Θ,... ; g). This is addressed in Section 4 and Section 5, where we show that for many b 4 standard algorithms, the cost to relate these quantities grows only as (RateD(G,S1, δ/2)) , and so can be absorbed into the second term in (10) or (11).

3.2 Beyond Strong Convexity: Slow Rates

The strong convexity assumption used by Theorem 1 requires curvature only in the prediction space, not the parameter space. This is considerably weaker than what is assumed in prior works on double machine learning (e.g., Chernozhukov et al.(2018b)), and is a major advantage of analyzing prediction error rather than parameter recovery. Nonetheless, in some situations even assuming strong convexity on predictions may be unrealistic. A second advantage of studying prediction is that, while parameter recovery is not possible in this case, it is still possible to achieve low prediction error, albeit with slower rates than in the strongly convex case. We now give guarantees under which these (slower) oracle rates for prediction error can be obtained in the presence of nuisance parameters using Meta-Algorithm 1. The key technical assumption for our results here is universal orthogonality, which informally states that the loss is not simply orthogonal around θ?, but rather is orthogonal for all θ ∈ Θ. Assumption 5 (Universal Orthogonality). For all θ¯ ∈ star(Θb, θ?) + star(Θb − θ?, 0), ¯ ? DθDgLD(θ, g0)[g − g0, θ − θ ] = 0 ∀g ∈ G, θ ∈ Θ.

The universal orthogonality assumption is satisfied for examples including treatment effect estimation (Section 3.3) and policy learning (Section 3.4), and is used implicitly in previous work in these settings (Nie and Wager, 2017; Athey and Wager, 2017). Beyond orthogonality, we require a mild smoothness assumption for the nuisance class. 2 2 Assumption 6. The derivatives DgLD(θ, g) and Dθ DgLD(θ, g) are continuous. Furthermore, there ? exists a constant β such that for all θ ∈ star(Θb, θ ) and g¯ ∈ star(G, g0),

2 2 DgLD(θ, g¯)[g − g0, g − g0] ≤ β · kg − g0kG ∀g ∈ G. (12)

13 Our main theorem for slow rates is as follows. ? Theorem 2. Suppose that there is θ ∈ arg minθ∈Θ LD(θ, g0) such that Assumption 5 and As- sumption 6 are satisfied. Then with probability at least 1 − δ, the target parameter θb produced by Meta-Algorithm 1 enjoys the excess risk bound:

? 2 LD(θ,b g0) − LD(θ , g0) ≤ RateD(Θ,S2, δ/2; gb) + β · (RateD(G,S1, δ/2)) .

3.3 Example: Treatment Effect Estimation

To make matters concrete, we now walk through a detailed example in which we specialize our general framework to the well-studied problem of treatment effect estimation. We show how the setup falls in our framework, explain what statistical assumptions are required to apply our main theorems, and show how to interpret the resulting excess risk bounds. Following, e.g., Robinson(1988); Nie and Wager(2017), we receive examples z = (X, W, Y, T ) according to the following data generating process:

Y = T · θ0(X) + f0(W ) + ε1, E[ε1 | X, W, T ] = 0, (13) T = e0(W ) + ε2, E[ε2 | X,W ] = 0, where X ∈ X and W ∈ W are covariates, T ∈ {0, 1} is the treatment variable, and Y ∈ R is the target variable. The true target parameter is θ0 : X → R, but we do not necessarily assume that θ0 ∈ Θ. The functions e0 : W → [0, 1] and f0 : W → R are unknown; we define m0(x, w) = E[Y | X = x, W = w] = θ0(x)e0(w) + f0(w) and take g0 = {m0, e0} to be the true nuisance parameter. We set w = (X, W, T ) and x = (X), and use the loss

`(θ, {m, e}; z) = ((Y − m(X,W ) − (T − e(W ))θ(X))2. (14)

Interpreting excess risk. Let us take a moment to interpret the meaning of excess risk for the loss we have defined. It is simple to verify that if the true nuisance parameters g0 = {m0, e0} are plugged in, and if the model is well-specified in the sense that θ0 ∈ Θ, we have

2 LD(θ, g0) − LD(θ0, g0) = E((T − e0(W )) · (θ(X) − θ0(X))) .

Thus, if a predictor θ has low risk it must be good at predicting θ0(X) whenever there is sufficient variation in the treatment T . If the model is not well-specified but Θ is convex, we can still deduce that ? ? 2 LD(θ, g0) − LD(θ , g0) ≥ E((T − e0(W )) · (θ(X) − θ (X))) , so in this case low excess risk implies that we predict nearly as well as the best predictor in class (again, assuming sufficient variation in T ).

Verifying orthogonality. Establishing the basic orthogonality and first-order conditions required to apply Theorem 1 and Theorem 2 is a simple exercise (see Appendix F for a full derivation): • The conditional expectation assumptions in (13) imply that the loss satisfies the first-order condition whenever θ0 ∈ Θ. On the other hand, even if θ0 ∈/ Θ, the first-order condition is still satisfied as long as Θ is convex.

14 • The loss is universally orthogonal, meaning that its partial derivatives vanish not just around θ0 but around any θ : X → R:

0 0 DeDθLD(θ, {m0, e0})[θ − θ, e − e0] = 0 ∀θ, θ , e

and 0 0 DmDθLD(θ, {m0, e0})[θ − θ, m − m0] = 0 ∀θ, θ , m This means that the orthogonality condition (7) in Assumption 1 is satisfied for any θ?, regardless of whether or not θ0 ∈ Θ. As a consequence, our general results imply that for any class, it is possible to achieve oracle rates for prediction with this loss in the presence of nuisance parameters, even when the parameter Θ is completely misspecified. That is, if Θ is convex, then thanks to the universal orthogonality property, 1/4 oracle rates are achievable so long as RateD(G, n, δ) = o RateD(Θ, n, δ) , modulo regularity conditions which we verify now.

Fast rates, slow rates, and strong convexity. This example is a special case of a more general class of single-index losses which take the form `(θ(x), g(w); z) = hΛ(g(w), v), θ(x)i − Γ(g(w), z)2, where Λ and Γ are known functions (take Λ(g(w), w) = (T −e(W )) and Γ(g(w), z) = (Y −m(X,W ))). In Appendix A, we give general conditions under which the regularity conditions required apply our main theorems hold for losses of this type. Briefly, the regularity conditions in Assumption 4 and Assumption 6 hold given mild boundedness and smoothness assumptions, while the more restrictive strong convexity condition (Assumption 3) required by Theorem 1 for fast rates requires—when 2 ? 2 E(T −e0(W )) (θ(X)−θ (X)) specialized to treatment effects—control of ratios of the form 2 . Whether the E(θ(X)−θ?(X)) fast rate (Theorem 1) or slow rate (Theorem 2) is better given finite samples will depend on the behavior of the data distribution and target class. Let us first consider fast rates, for which we appeal to Theorem 1. For the square loss, assuming data is bounded or subgaussian, the strong convexity condition required by Assumption 3 specializes to ( 2 ? 2 ) E(T − e0(W )) (θ(X) − θ (X)) inf 2 ≥ λ. (15) θ∈Θ E(θ(X) − θ?(X))

One special case of the treatment effect setup, which was investigated in Chernozhukov et al.(2017) and Chernozhukov et al.(2018b), is where Θ is a class of high-dimensional predictors of the form p p θ(x) = hw, φ(x)i, where w ∈ R and φ : X → R is a fixed featurization. We allow the dimension p to grow with n, so in general we may have p  n. For this setting, to satisfy the condition (15), it suffices that Var(ε2 | X) ≥ η for some η > 0 with no further assumptions on the data distribution or target parameter class. The latter condition is typically referred to as overlap, since for the case of a binary treatment it boils down to requiring that the treatment is not deterministic for any realization of the covariates. Compared to Chernozhukov et al.(2017, 2018b), our main theorems allow for misspecification of the target parameter. The convergence rate depends on the quantity λsi in (15), which is more benign  > than the minimum restricted eigenvalue of the matrix E φ(X)φ(X) , which was used in these works. Whenever the overlap condition is satisfied we have λsi ≥ η, but even when overlap is not satisfied the restricted eigenvalue assumption alone is sufficient to imply (15), thereby recovering the assumptions from prior work as a special case. Note that Chernozhukov et al.(2017, 2018b) focused

15 on parameter recovery, for which restricted eigenvalue type conditions are a minimal assumption to guarantee consistency. Since we consider mean squared error, we can provide guarantees even when parameter recovery is impossible. Such is the case, for example, when the overlap condition is  > satisfied but the matrix E φ(X)φ(X) has arbitrarily bad restricted eigenvalue. Turning to slow rates, we observe that some distributions may simply not satisfy (15). In this case, we can appeal to Theorem 2, as Assumption 6 is trivially satisfied as long as the classes are bounded, and does not require any lower bounds in the vein of (15). While the dependence on first stage estimation error in Theorem 2 is worse than in Theorem 1, orthogonality still helps out here when the target class is sufficiently large, cf. Figure 3.

3.4 Example: Policy Learning

As a second example, we show how to apply our framework to the classical problem of policy learning. Compared to our treatment effect estimation example, losses for this setting do not typically satisfy the strong convexity property, meaning that Theorem 2 is the relevant meta-theorem, and slow rates are to be expected.

In policy learning, we receive examples of the form Z = (X,T,Y ), where Y ∈ R is an incurred loss, T ∈ T is a treatment vector and X ∈ X is a vector of covariates. The treatment T is chosen based on an unknown, potentially randomized policy which depends on X. Specifically, we assume the following data generating process:

Y = f0(T,X) + ε1, E[ε1 | X,T ] = 0, (16) T = e0(X) + ε2, E[ε2 | X] = 0. The learner wishes to optimize over a set of treatment policies Θ ⊆ (X → T ) (i.e., policies take as input covariates X and return a treatment). Their goal is to produce a policy θb that achieves small regret with respect to the population risk:   E f0(θb(X),X) − min E[f0(θ(X),X)]. (17) θ∈Θ This formulation has been extensively studied in (Qian and Murphy, 2011; Zhao et al., 2012; Zhou et al., 2017; Athey and Wager, 2017; Zhou et al., 2018) and machine learning (Beygelzimer and Langford, 2009; Dud´ıket al., 2011; Swaminathan and Joachims, 2015a; Kallus and Zhou, 2018); in the latter, it is sometimes referred to as counterfactual risk minimization.

The learner does not know the so-called counterfactual outcome function f0, so it is treated as a nuisance parameter. Typically, orthogonalization of this nuisance parameter is possible by utilizing the secondary treatment equation in (16) and fitting a parameter for the observational policy e0, which is also treated as a nuisance parameter. We can then write the expected counterfactual reward as f0(t, X) = E[`(t, f0, e0; Z) | X] (18) for some known loss function ` that utilizes the treatment parameter e0. Letting g0 = {f0, e0}, the learner’s goal can be phrased as minimizing the population risk,

E[f0(θ(X),X)] = E[E[`(θ(X), f0, e0; Z) | X]] = E[`(θ(X), f0, e0; Z)] =: LD(θ, g0), (19) over θ ∈ Θ. This formulation clearly falls into our orthogonal statistical learning framework, where the target parameter is the policy θ and the counterfactual outcome f0 and observed treatment policy e0 together form the nuisance parameter g0 := {f0, e0}.

16 We make this discussion concrete for the special case of a binary treatment T ∈ {0, 1}; additional examples are discussed in Appendix B.1. To simplify notation, define p0(t, x) = P[T = t | X = x], and observe that p0(t, x) = e0(x) if t = 1 and 1 − e0(x) if t = 0. Then the loss function

(Y − f0(t, X)) `(t, f0, e0; Z) = f0(t, X) + 1{T = t} , (20) p0(t, X) has the structure in (19): it evaluates to the true risk (17) whenever the true nuisance parameter is plugged in. This formulation leads to the well-known doubly-robust estimator for the counterfactual outcome (Cassel et al., 1976; Robins et al., 1994; Robins and Rotnitzky, 1995; Dud´ıket al., 2011). It is easy to verify that the resulting population risk is orthogonal with respect to both f0 and p0. We can also obtain an equivalent loss function by subtracting the loss incurred by choosing treatment 0. Define   Y − f0(1,X) Y − f0(0,X) β(f0, e0; Z) = f0(1,X) − f0(0,X) + T + (1 − T ) , e0(X) 1 − e0(X) and set `(t, f0, e0; Z) = β(f0, e0; Z) · t. This formulation leads to a linear population risk:

LD(θ, {f0, e0}) = E[β(f0, e0; Z) · θ(X)]. (21) It is straightforward to show that the this population risk satisfies universal orthogonality, so that Theorem 2 can be applied whenever the nuisance parameters are bounded appropriately.

3.5 Construction of Orthogonal Losses

While orthogonal losses are already known for many problem settings and statistical models (treatment effect estimation, policy learning, regression with missing/censored data, and so on), for new problems we often begin with a loss which is not necessarily orthogonal. A natural question, which we address now, is whether one can modify the loss to satisfy orthogonality so that our main theorems can be applied. Suppose we begin with a loss `(θ(x), g; z) such that the nuisance and target parameter are specified by the moment equations E[∇ζ `(θ0(x), g0; z) | x] = 0, (22) E[u − g0(w) | w] = 0, where u ⊆ z is a , x ⊆ w, and ∇ζ denotes the derivative with respect to the first argument. If LD(θ, g) = Ez[`(θ(x), g(w); z)] is not orthogonal, we can construct an orthogonal loss using a generalization of a construction in Chernozhukov et al.(2018b). For simplicity, we sketch the approach for the special case where θ0 is scalar-valued.

To begin, assume that there exists a function a0 such that for all x ∈ X , we have

Dg E[∇ζ `(θ0(x), g0; z) | x][g − g0] = E[ha0(w), g(w) − g0(w)i | x]. (23)

Under this assumption, we can expand our nuisance parameters to include a0—that is, define g˜0 := {g0, a0}—and construct a new orthogonal loss:

`˜(θ(x), g˜; z) := `(θ(x), g; z) + ha(w), u − g(w)i · θ(x). (24) ˜ ˜  Letting LD(θ, g˜) = E `(θ(x), g˜; z) be the new population risk, we have the following claim.

17 Lemma 1. The population risk L˜D(θ, g˜) satisfies Assumption 1 and Assumption 2.

As a first example, in the special case where the loss depends on g0 only through its evaluation at w (i.e., (22) simplifies to E[∇ζ `(θ0(x), g0(w); z) | x] = 0), then we can take

a0(w) = E[∇γ∇ζ `(θ0(x), g0(w); z) | w]. (25)

Of course, to make use of the lemma, we must be able to estimate the new nuisance parameter a0. This can be accomplished through an additional plug-in estimation step based on sample splitting: Split S into folds S1, S2, S3, and S4 of equal size. Estimate gb on S1, then obtain an initial estimate θ for θ by solving arg min L (θ, g), where L denotes the empirical loss over S . Next, binit 0 θ∈Θ S2 b S2 2 use the initial estimator to compute a plug-in estimator ba for a0 by regressing onto the “targets” ∇ζ ∇γ`(θbinit(x), gb(w); z) on S3. Finally, produce the main estimator for the target parameter by solving θ = arg min L (θ, {g, a}). The key idea behind this scheme is that the initial estimator b θ∈Θ eS4 b b θbinit will not be able to take advantage of orthogonality, but its estimation error will only enter the final bound through the error of ba, and thus will only have higher-order impact on the rate. This approach is applicable for the problem of estimating utility functions in models of strategic competition, as used in Chernozhukov et al.(2018b); see Appendix C for a worked out example. For some models—including utility function estimation—a0 is a known function of θ0 and g0, so that the extra regression to estimate ba given the initial estimators is not required. A more general setting where the loss has the form in (23) is as follows. Suppose that all functions 2 g ∈ G are conditionally square-integrable in the sense that for all x, E[g (w) | x] < ∞, and suppose there exist functions β0, Tx such that we can write

E[∇ζ `(θ0(x), g; z) | x] = β0(x) + Tx(g), where Tx(g) is a linear operator on g with uniformly bounded operator norm:

E[∇ζ `(θ0(x), g; z) | x] sup kTx(g)kop := sup p < ∞. (26) x x,g6=0 E[g2(w) | x]

By the Riesz-Frechet representation theorem, we can express the operator Tx as

Tx(g) = E[a0(w) g(w) | x], (27) where we have used that x ⊆ w to simplify. Hence, we have

E[∇ζ `(θ0(x), g; z) | x] = β0(x) + E[a0(w) g(w) | x], (28) so that (23) is satisfied for a0 induced by the family of Riesz representers for operators Tx. This is a variant of the Riesz representor approach presented in Chernozhukov et al.(2018d). In Appendix C, we show that this construction recovers the treatment effect estimation example presented in the introduction.

4 Empirical Risk Minimization with a Nuisance Component

In this section we develop algorithms and analysis for orthogonal statistical learning with M- estimation losses, i.e. losses that take the form

LD(θ, g) = E[`(θ(x), g(w); z)]. (29)

18 We analyze the case where the algorithm used for the target parameter (the second stage algorithm), is one of the most natural and widely used algorithms: plug-in empirical risk minimization (ERM). Specifically, we define the empirical risk via

n 1 X L (θ, g) = `(θ(x ), g(w ); z ), (30) S2 n i i i i=1 where we adopt the convention that |S| = 2n with S2 = {z1, . . . , zn} to keep notation compact. The plug-in ERM algorithm returns the minimizer plug-in empirical loss obtained by plugging in the first-stage estimate of the nuisance component:

θb = arg min LS (θ, g). (31) θ∈Θ 2 b

The goal of this section is to provide generalization error bounds for the plug-in ERM algorithm and variants. In particular, we will upper bound the second-stage rate RateD(Θ,S2, δ; gb) as a function of standard complexity measures of the target class Θ. The goal of this section is to show that the impact of gb on the achievable rate by ERM is negligible and classical excess risk bounds carry over up to lower order terms and constant factors. One can easily combine our results on the rate RateD(Θ,S2, δ; gb) from this section, with the main theorems from the previous section to obtain oracle guarantees on the excess risk, wherein the error due to nuisance estimation is of second order. In the fast rate regime we offer a generalization of the local Rademacher complexity analysis of Bartlett et al.(2005) in the presence of an estimated nuisance component and show that notion of the critical radius of the class Θ still governs rate RateD(Θ,S2, δ; gb) up to second order error. This result, coupled with our main theorem in the previous section, leads to several applications of our theory to particular target classes, including sparse linear models, neural networks and kernel classes; these are discussed at the end of the section. In the slow rate regime (i.e., for generic Lipschitz losses), we show that the Rademacher complexity of the loss governs the rate, which subsequently can be upper bounded by the entropy integral of the function class. More importantly, we offer a novel moment-penalized variant of the ERM algorithm that achieves a rate whose leading term is equal to the critical radius, multiplied by the variance of the population loss evaluated at the optimal target parameter. This offers an improvement over prior variance-penalized ERM approaches (Maurer and Pontil, 2009), whose leading term depends on the metric entropy of the target function class evaluated at single scale, and which typically is larger than the critical radius (the latter depending on a fixed point of the entropy integral).

Technical preliminaries. To present our main results, we need to introduce additional tools from empirical process theory and statistical learning. For any real-valued function class G, define the localized Rademacher complexity:

" n # 1 X R (G, δ) := sup  g(z ) , (32) n E,z1:n i i g∈G:kgk ≤δ n L2(D) i=1 where 1, . . . , n are independent Rademacher random variables. Let Rn(G) denote the non-localized Rademacher complexity (that is, Rn(G, ∞)). We also make use of the metric entropy of a function class (which is closely related to the Rademacher complexity).

19 Definition 3 (Metric Entropy). For any real-valued function class G and sample z1:n, the empirical 0 metric entropy Hp(G, ε, z1:n) is the logarithm of the size of the smallest function class G , such that 0 0 0 for any g ∈ G there exists g ∈ G , with kg − g kLp(z1:n) ≤ ε. Moreover Hp(G, ε, n) will denote the maximal empirical entropy over all possible sample sets z1, . . . , zn.

Finally, for a vector-valued function class F, let F|t = {ft :(f1, . . . , ft, . . . , fd) ∈ F} denote the marginal real-valued function class that corresponds to coordinate t of the functions in class F.

4.1 Fast Rates via Local Rademacher Complexities

Our first contribution is an extension of the foundational results of Bartlett et al.(2005); Koltchinskii and Panchenko(2000)—which bound the excess risk for empirical risk minimization in terms of local Rademacher complexities—to incorporate misspecification due to nuisance parameter estimation error. A crucial parameter in this approach is the critical radius δn of a function class G, defined as the smallest solution to the inequality

2 Rn(G, δn) ≤ δn. (33)

Classical work shows that in the absence of a nuisance component, if a loss `(θ(z); z) is Lipschitz in its first argument and satisfies standard assumptions required for fast rates (strong convexity in the 2 first argument), then empirical risk minimization achieves an excess risk bound of order δn. For the −1/2 −1 case of parametric classes, δn = Oe(n ), leading to the fast Oe(n ) rates for strongly convex losses. For more general classes (cf. Wainwright(2019)) the critical radius is—up to constant factors—equal to the solution to an inequality on the metric entropy of the function class (cf. Appendix D.1):

Z δn r 2 H2(G(δn, z1:n), ε, z1:n) δn 2 dε ≤ , (34) δn n 20 8 where G(δ, z ) := {g ∈ G : kgk ≤ δ}; see Appendix D for concrete examples. 1:n L2(z1:n) Our first theorem in this section extends this result in the presence of a nuisance component and bounds the excess risk of the plug-in ERM algorithm by the critical radius of the target function class Θ (more precisely, the worst-case critical radius for each coordinate of the target class, since we deal with vector-valued function classes). K Theorem 3 (Fast Rates for Plug-In ERM). Consider a function class Θ: X → R 2 with R := 2 2  R K2 log(log(n))  supθ∈Θ kθkL∞(`2,D) ∨ 1. Let δn = Ω n be any solution to the system of equations:

δ2 R (star(Θ| − θ?), δ) ≤ , ∀t ∈ {1, . . . , d}, (35) n t t R ? ? where θt is the projection of θ onto coordinate t. Suppose that `(·, gb(w); z) is L-Lipschitz in its first argument with respect to the `2 norm and that the population risk LD satisfies Assumptions 1 to4 with k·kΘ = k · kL2(`2,D) and k·kG arbitrary. Let θb be the outcome of the plug-in ERM algorithm. Then with probability at least 1 − δ,

  2   ? δn log(1/δ) 4 LD(θ,b g) − LD(θ , g) = O C1 · + + C2 · kg − g0k , (36) b b R2 n b G

2 2 L  LK2 κ β2  where C1 = K2 λ + RL and C2 = R λ ∨ λ2 .

20 We emphasize that Theorem 3 provides an excess risk bound relative to the plugin estimate gb, and all that is required to obtain an excess risk bound at g0 is to apply Theorem 1 or Theorem 2. 4 Critically, in both theorems the error due to nuisance estimation error scales as kgb − g0kG due to orthogonality, meaning that we can use a complex function class for nuisance estimation without spoiling the rate for the target class. In Appendix H.1, we show (Lemma 12) that when the loss ` is smooth with respect to the first argument, the lower bound on δn required by Theorem 3 can be dropped.

4.2 Slow Rates and Variance Penalization

We now turn to the slow rate regime of Section 3.2, where the loss is not necessarily strongly convex in the prediction. In this setting we prove upper bounds on the generalization error of a variance penalized version of the plug-in ERM algorithm. Our main result is a slow rate that scales favorably with the variance of the loss rather than the range, and that is robust to nuisance estimation error. The basic algorithm we analyze first estimates the nuisance parameter, then estimates the ? optimal loss value µ := infθ∈Θ LD(θ, g0) using auxiliary samples, and finally performs plug-in empirical risk minimization with an empirical variance penalty which is centered using the estimate for µ?. To simplify notation, we assume for this result that |S| = 3n and is partitioned equal splits ? S = S1 ∪ S2 ∪ S3. Define the variance of the loss at (θ , g0) via

? ? V = Var(`(θ (·), g0(·); ·)).

Our main theorem is as follows. Theorem 4 (Plug-In ERM with Centered Second Moment Penalization). Consider the centered second moment-penalized plugin empirical risk minimizer

−1 θb = arg min LS (θ, g) + 36δnR k`(θ(·), g(·); ·) − µkL (S ), (37) θ∈Θ 2 b b b 2 2 where µ = inf L (θ, g). Consider the function class F = {`(θ(·), g(·); ·): θ ∈ Θ}, with b θ∈Θ S3 b b R := sup kfk ∨ 1 and f ? := `(θ?(·), g(·); ·). Let δ2 ≥ 0 be any solution to the inequality f∈F L∞(D) b n δ2 R (star(F − f ?), δ) ≤ , (38) n R Suppose that γ 7→ `(θ(x), γ; z) is L-Lipschitz almost surely, and let k·k = k·k . Then with G L2(`2,D) probability at least 1 − δ,

? LD(θ,b gb) − LD(θ , gb) r ! ! √ δ log(1/δ) 1   log(1/δ) = O V ? n + + δ2 + R2 (` ◦ Θ) + L2kg − g k2 + R . R n R n n b 0 G n

Our approach offers an improvement over the rates for empirical variance penalization in Maurer and Pontil(2009), which provides a generalization error bound whose leading term is of the form: q ? −1 Varn(`(θ (·),gb(·),·))H∞(`◦Θ,n ,z1:n) n . The drawback of such a bound is that it evaluates the metric entropy at a fixed approximation level of 1/n, which can be suboptimal compared to the critical radius. For example, this bound scales with pd log n/n for classes of VC-dimension d, which we show now can be improved as a consequence of our general machinery.

21 Application to VC Classes. We now the general tools developed in this section to give efficient/variance-dependent rates for VC classes with general Lipschitz losses. Our main re- sult shows that for VC classes with dimension d, the excess risk enjoyed by variance penalization p ? ? ? grows exactly as O( V d/n) (where V , as before, is the variance of the loss at the pair (θ , g0)) so long as the nuisance estimator converges at a rate of o(n−1/4). The key to our approach is to assume boundedness of the so-called Alexander capacity function, a classical quantity that arises in the study of ratio type empirical processes (Gin´eand Koltchinskii, 2006). To be more precise, for this example we assume that Θ is a class of binary predictors with VC dimension d, and let ` have the following policy learning structure:

`(θ, g; z) = Γ(g, z) · θ(x),LD(θ, g) = E[`(θ, g; z)], where Γ is a known function. Our goal is to derive a bound for which the leading term only scales with V ? rather than the loss range. Our results depend on a variant of the Alexander capacity function (Gin´eand Koltchinskii, 2006; Hanneke, 2014). Letting

  2 ? 2 2 Θ0(ε) = θ ∈ Θ: E Γ (g0, z)(θ(x) − θ (x)) ≤ ε , the capacity function is defined as

2 ? 2 E[supθ∈Θ (ε) Γ (g0, z)(θ(x) − θ (x)) ] τ 2(ε) = 0 . (39) ε2 When Γ is the unweighted classification loss, this definition recovers the classical definition of the capacity function (Gin´eand Koltchinskii, 2006; Hanneke, 2014). Beyond boundedness of the capacity function, we make the following assumption. Assumption 7. Assumption 5 holds along, with the following bounds: • |Γ(g, z)| ≤ R almost surely for all g ∈ G, for some R ≥ 1.  2  • E Γ (g0, z) | x ≥ γ almost surely.

41/4 • E(Γ(g, z) − Γ(g0, z)) ≤ Lkg − g0kG for all g ∈ G, for some norm k·kG.

• The first stage algorithm provides an estimation error bound with respect to k·kG, i.e. kgb − g0kG ≤ RateD(G, S, δ) with probability at least 1 − δ over the draw of S.

• Assumption 6 holds with respect to k·kG with constant β. Theorem 5. Suppose that Assumption 7 holds, and define τ := sup √ {τ(ε)}. Then variance- 0 ε≥ d/n penalized empirical risk minimizer guarantees that with probability at least 1 − δ,

? LD(θ,b g0) − LD(θ , g0) r ? ! V d log τ0 (R + L)d log τ0 2 −1/2 2 ≤ Oe + + (β + (R + L) γ ) · (RateD(G,S1, δ/2)) . n n

−1/4 Note that whenever RateD(G,S1, δ/2) = o(n ) the asymptotic rate depends only on the variance ? at θ and g0, not on the problem-dependent parameters L/R/βγ. Furthermore, whenever the capacity function is constant the asymptotic rate is exactly O(pV ?d/n). Variance-dependent bounds that obtain the efficient O(pV ?d/n) rate have been the subject of much recent investigation, and there is much interest in understanding when the O(pV ?d log n/n)

22 rate obtained by naive approaches can be improved. To give a brief survey, the seminal empirical variance bound due to Maurer and Pontil(2009) when applied directly to this setting gives a suboptimal O(pV ?d log n/n) rate. Recent work of Athey and Wager(2017) shows that for a specific loss and nuisance parameter setup arising in policy learning, the log n can be replaced with a certain worst-case variance parameter. Our result, Theorem 5 is complementary, and shows that the log n can be replaced by the capacity function for general losses. It appears unlikely that the log n factor can be removed without at least some type of assumption. Indeed the results in Rakhlin et al. (2017) imply that there are indeed VC classes for which the critical radius grows as pd log n/n in the worst case. The proof of Theorem 5 can be broken into three parts: First, we apply the previous results of this section to show that the excess risk obtained by variance penalization depends on the critical radius of the class ` ◦ Θ. Second, we show that in the absence of first-stage estimation error, the capacity function controls the critical radius. Finally, we show that the impact of nuisance estimation error on the capacity function is of second order.

5 Minimax Oracle Rates for Square Losses

The previous section developed sharp orthogonal learning guarantees for a specific algorithm, empirical risk minimization. We now shift our focus to understanding what rates can be achieved by any algorithm, and how this is determined by intrinsic properties of the target and nuisance classes. We give achievability results for oracle rates as a function of the relationship between the metric entropy of the nuisance and target classes. For any real-valued function class F, we will say that the complexity of F is p if

−p H2(F, ε, n) = O (max{ε , log(1/ε)}). (40)

When p = 0, this corresponds to the case of parametric functions (e.g., VC-subgraph classes), while for p > 0, we recover nonparametric function classes, such as Lipschitz/smooth functions or kernel spaces. We let p1 and p2 denote the maximum complexity of any output coordinate projection for the nuisance and target class, respectively. Our goal is to understand for what pairs (p1, p2) the sample splitting meta-algorithm can achieve oracle rates. We focus on the important special case of square losses of the form 2 `(θ(x), g(w); z) = hΛ(g(w), v), θ(x)i − Γ(g(w), z) ,LD(θ, g) = E[`(θ, g; z)], where Λ and Γ are known functions, and where we recall from Section 2 that x, w are subsets of the data z, and v ⊆ z is an arbitrary auxiliary subset of the data. We assume that the nuisance parameters are defined in terms of regression problems, i.e., that g0(w) = E[u | w] for some known random vector u ⊆ z. This assumption is standard in semiparametric literature (Bickel et al., 1993; Kosorok, 2008; van der Laan and Rose, 2011), and implies that each coordinate t of g0 may be  2 expressed as the minimizer of a squared loss: g0,t = arg mingt∈G|t E (gt(w) − ut) . In this setting, a sufficient condition for orthogonality is that

? E[∇γ∇ζ `(θ (x), g0(w), z) | w] = 0, (41) where ∇ζ and ∇γ denote the gradient of ` with respect to the first and second argument, respectively.

23 In the absence of nuisance parameters, the minimax optimal rates for the excess risk and the mean squared error have been characterized both for the well-specified setting in which there exists θ0 ∈ Θ such that E[Γ(g0(w), z) | w, v] = hΛ(g0(w), v), θ0(x)i, (42) and in the misspecified setting where this assumption is removed. In the former setting the minimax − 2 rates are of order Θ(n 2+p2 ) (Yang and Barron, 1999), while in the latter setting the optimal rate is − 2 ∧ 1 Θ(e n 2+p2 p2 ) (Rakhlin et al., 2017). We show that under orthogonality, the optimal well-specified and misspecified rates can be achieved in the presence of nuisance parameters even when the nuisance class G is larger than the target class Θ, provided it is not too much larger. This generalizes the large body of results on semiparametric inference (Levit, 1976; Ibragimov and Has’Minskii, 1981; Pfanzagl, 1982; Bickel, 1982; Klaassen, 1987; Robinson, 1988; Bickel et al., 1993; Newey, 1994; Robins and Rotnitzky, 1995; Ai and Chen, 2003; van der Laan and Dudoit, 2003; van der Laan and Robins, 2003; Ai and Chen, 2007; Tsiatis, 2007; Kosorok, 2008; van der Laan and Rose, 2011; Ai and Chen, 2012; Chernozhukov et al., 2016; Belloni et al., 2017; Chernozhukov et al., 2018a), which show for various settings and assumptions that when the target class is parametric, one can √ − 1 obtain a n-consistent estimator for the target as long as the nuisance estimator converges at a n 4 rate. Our main workhorse for the results in this section is the “Aggregation of ε-Nets” or “Skeleton Aggregation” algorithm described in Yang and Barron(1999) and extended to random design in Rakhlin et al.(2017). We use Skeleton Aggregation as-is to provide rates for the first stage, and provide an extension in the presence of nuisance parameters for the second stage. Since our aim is to characterize the optimal dependence on the complexity of the classes Θ and G, which is already quite technical, we assume that all other problem-dependent parameters are constant. This is primarily for expository purposes. Assumption 8. The classes are bounded in the sense that for all θ ∈ Θ + star(Θ − Θ) and g ∈ G the following bounds hold almost surely: a) hΛ(g(w), w), θ(x)i ∈ [−1, +1], b) Γ(g(w), z) ∈ [−1, +1], > c) Λ(g(w), v)Λ(g(w), v)  I, d) kg(w)k∞, kθ(x)k∞ ≤ 1. Moreover,

• kuk∞ ≤ 1 almost surely.

K2 • The functions {Λt(·, v)}t=1 and Γ(·, z) have O(1)-Lipschitz gradients with respect to `2. ? 2 ? 2 • The strong convexity condition EhΛ(g0(w), v), θ(x) − θ (x)i ≥ γ Ekθ(x) − θ (x)k2 is satisfied for all θ ∈ Θ + star(Θ − Θ) for some γ = Ω(1).

• K1,K2 = O(1).

For our results in the regime where p1 ≥ 2, we technically require that Assumption 8 is satisfied not just for all g ∈ G, but for all g ∈ G + star(G − G, 0). Assumption 8 implies that Assumption 3 and Assumption 4 are satisfied with respect to the norms q kθk := hΛ(g (w), v), θ(x)i2 and k·k = k·k . Since typical results on minimax oracle Θ E 0 G L2(`2,D) rates provide rates for the nuisance g with respect to the norm k · kL2(`2,D), we assume control on the ratio between these norms. Assumption 9 (Moment Comparison). There is a constant C2→4 such that

kg − g0k L4(`2,D) ≤ C , ∀g ∈ G. kg − g k 2→4 0 L2(`2,D)

The moment comparison condition has been used recently in statistics as a minimal assumption for

24 learning without boundedness (Lecu´eand Mendelson, 2016; Mendelson, 2014; Liang et al., 2015). d 1/4 For example, suppose that each g ∈ G√has the form x 7→ hw, xi for w, x ∈ R . Then C2→4 ≤ 3 if x is mean-zero gaussian and C2→4 ≤ 8 if x follows any distribution that is independent across all coordinates and symmetric (via the Khintchine inequality). Moment comparison is also implied by the “subgaussian class” assumption used in Mendelson(2011); Lecu´eand Mendelson(2016). 2 We emphasize that the moment constant C2→4 does not enter the leading term in any of our bounds—only the RateD(G,..., ) term in Theorem 1—and so it does not affect the asymptotic rates under conditions on metric entropy growth of G that we prescribe in the sequel. We also note that this condition is not required at all for many classes of interest, where direct L4 estimation rates are available (see discussion in Appendix A). Moreover, if guarantees stronger than L4 available (e.g., L∞ error or parameter recovery), then—in addition to removing the moment condition—we can obtain tighter dependence on problem parameters through an alternative version of the Cauchy-Schwarz argument used to establish Assumption 3/Assumption 4 for the setup in this section (see the proof of Lemma 2 in Appendix A). However, we adopt the condition here because it allows us to develop guarantees for arbitrary classes at the highest possible level of generality.

5.1 Minimax Oracle Rates

We first state our main theorem for the well-specified case. Theorem 6 (Oracle Rates, Well-Specified Case). Suppose that we are in the well-specified setting and that Assumptions 1 and2 and Assumptions 8 and9 are satisfied for a class Θb defined below. Suppose that the following relationship holds

p1 < 2p2 + 2. (43) Then for appropriate choice of sub-algorithms, the sample splitting meta-algorithm Meta-Algorithm 1 produces a predictor θb that guarantees that with probability at least 1 − δ,

2 − 2+p  LD(θ,b g0) − LD(θ0, g0) ≤ Oe n 2 , (44) where the Oe symbol hides problem-dependent parameters and log(δ−1) terms. Theorem 6 is summarized in Figure 1. In particular, whenever Θ is a parametric class (i.e. H2(Θ, ε, n) ∝ d2 log(1/ε)), it suffices to take p1 < 2, which recovers the usual setup for semipara- metric inference. We now focus on deriving oracle rates in the case where the second stage is misspecified. This setting has been relatively unexplored in recent results on double machine learning (Chernozhukov et al., 2018a; Mackey et al., 2018; Chernozhukov et al., 2018b); this is perhaps not surprising since for many setups the well-specified property is critical to establish orthogonality. However, for certain settings including treatment effect estimation (Section 3.3), orthogonality can indeed hold even without model correctness. Our main theorem for the misspecified case shows that we can obtain oracle rates in the presence of both misspecification and nuisance parameters as long as the nuisance class has moderate complexity. Theorem 7 (Oracle Rates, Misspecified Case). Suppose that the target class Θ is convex. Suppose that Assumptions 1 and2 and Assumptions 8 and9 hold. If the relationship

p1 < max{2p2 + 2, 4p2 − 2} (45) 2Suppose G is scalar-valued and let kgk = infc > 0 | expg2(w)/c2 ≤ 2 . Then the subgaussian class ψ2 E assumption for our setting asserts that kg − g0k ≤ Ckg − g0k for all g ∈ G. ψ2 L2(D)

25 Strongly convex/well-specified 14

12 Suboptimal rate 10 1 p 8

6 Exponent 4 Oracle rate

2

0 0 2 4 6 Exponent p2

Figure 1: Relationship between first and second stage for oracle rates; well-specified case.

Strongly convex/misspecified

20 Suboptimal rate 15 1 p

10 Exponent Oracle rate 5

0 0 2 4 6 Exponent p2

Figure 2: Relationship between first and second stage for oracle rates; misspecified case. holds, then for appropriate choice of sub-algorithms, the sample splitting meta-algorithm Meta- Algorithm 1 produces a predictor θb such that with probability at least 1 − δ,

2 1 − 2+p ∧ p  LD(θ,b g0) − LD(θ0, g0) ≤ Oe n 2 2 . (46)

Theorem 7 is summarized in Figure 2. Comparing to the well-specified case (Theorem 6/Figure 1), we see that in the misspecified case the condition on the nuisance parameter metric entropy is significantly more permissive when the target parameter class Θ is large (i.e. p2 > 2). For example, if p2 = 5 then we require p1 < 12 for oracle rates in the well-specified case, but only require p1 < 18 in the misspecified case. On the other hand, when p2 ≤ 2 the conditions on the nuisance metric entropy match the well-specified case. In particular, whenever Θ is a parametric class it again −1/4 suffices to take p1 < 2 so that the first stage has an n -rate.

26 6 Minimax Oracle Rates for Generic Lipschitz Losses

The fast oracle rates developed in the previous section rely on strong convexity, which is not satisfied for all losses used in practice. For example, linear losses used in policy learning do not satisfy this property (Athey and Wager, 2017). In this section we extend the oracle rate characterization from Section 5.1 from strongly convex losses to any loss that is Lipschitz in the target prediction θ(x). For arbitrary Lipschitz losses (in particular, for the linear loss), the optimal rate in the absence − 1 ∧ 1 of nuisance parameters is Oe(n 2 p2 ) under the metric entropy growth assumed in Section 5 (cf. Section 12.8/12.9 of Rakhlin and Sridharan(2012)). Our main theorem for this section shows that this rate is still obtained in the presence of nuisance parameters when the nuisance metric entropy parameter p1 is not too much larger than p2. We make the following regularity assumption on the loss. Assumption 10. The loss ` has absolute value bounded by 1, the mapping ζ 7→ `(ζ, γ; z) is O(1)- Lipschitz with respect to `2, and the mapping γ 7→ `(ζ, γ; z) has O(1)-Lipschitz gradients with respect to `2. Assumption 5 holds for all g ∈ G when p1 ≤ 2, and for all g ∈ G + star(G − G, 0) if p1 > 2. Compared to the oracle rates for square losses, the assumptions required for our main Lipschitz loss theorem are relaxed significantly: We no longer require strong convexity, nor do we require any type of moment comparison for the nuisance class. On the other hand, we do require the additional universal orthogonality condition from Section 3.2. Theorem 8. Suppose that Assumption 10 is satisfied. If the relationship

p1 < max{2, 2p2 − 2} (47) holds, then for appropriate choice of sub-algorithms, the sample splitting meta-algorithm produces a predictor θb that guarantees that, with probability at least 1 − δ,

1 1 − 2 ∧ p  LD(θ,b g0) − LD(θ0, g0) ≤ Oe n 2 . (48)

Theorem 8 is summarized in Figure 3. In particular, whenever Θ is a parametric class (i.e. H2(Θ, ε, n) ∝ d2 log(1/ε)), it suffices to take p1 < 2, as in the well-specified and misspecified square loss setups in Section 5.1, and as in standard semiparametric results (Newey, 1994; van der Vaart, 2000; Robins et al., 2008; Zheng and van der Laan, 2010; Chernozhukov et al., 2018a). That the condition matches is somewhat interesting given that the final rate in this case is slower than in the square loss setup. Note however that we cannot tolerate p1 > 2 until p2 > 2, as compared to our results strongly convex losses, where the admissible value of p1 is growing for all values of p2. This result generalizes the condition given in Athey and Wager(2017) for the specific case of policy learning, which applies only in the parametric setting, and for a specific loss.

7 Discussion

This paper initiates the systematic study of prediction error and excess risk guarantees in the presence of nuisance parameters and Neyman orthogonality. Our results highlight that orthogonality is beneficial for learning with nuisance parameters even in the presence of possible model misspec- ification, and even when the target parameters belong to large nonparametric classes. We also show that many of the typical assumptions used to analyze estimation in the presence of nuisance parameters can be dropped when prediction error is the target here. There are many promising

27 General loss 10

8 Suboptimal rate 1 p 6

4 Exponent Oracle rate 2

0 0 2 4 6 Exponent p2

Figure 3: Relationship between first and second stage for oracle rates; general loss case. future directions, including weakening assumptions, obtaining sharper guarantees for specific settings and losses of interest (e.g., doubly-robust guarantees), and analyzing further algorithms for general function classes (along the lines of Section 4).

Acknowledgements. We are very grateful to the anonymous COLT reviewers and to Xiaohong Chen for pointing out additional related work. Part of this work was completed while DF was an intern at Microsoft Research, New England. DF acknowledges support from the Facebook PhD fellowship and NSF Tripods grant #1740751.

References

C. Ai and X. Chen. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica, 71(6):1795–1843, 2003. C. Ai and X. Chen. Estimation of possibly misspecified semiparametric conditional moment restriction models with different conditioning variables. Journal of Econometrics, 141(1):5–43, 2007. C. Ai and X. Chen. The semiparametric efficiency bound for models of sequential moment restrictions containing unknown functions. Journal of Econometrics, 170(2):442–457, 2012. M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999. S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332, 2019. S. Athey and S. Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017. S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2): 1148–1178, 2019.

28 J.-Y. Audibert. Progressive mixture rules are deviation suboptimal. In Advances in Neural Information Processing Systems, pages 41–48, 2008.

P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.

P. L. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems (NIPS), 2017.

P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and pseudodi- mension bounds for piecewise linear neural networks. Conference on Learning Theory, 2017.

A. Belloni, V. Chernozhukov, I. Fern´andez-Val, and C. Hansen. Program evaluation and causal inference with high-dimensional data. Econometrica, 85(1):233–298, 2017.

S. Ben-David and R. Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. In International Conference on Algorithmic Learning Theory, pages 139–153. Springer, 2012.

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In Advances in neural information processing systems, pages 137–144, 2007.

D. Bertsimas and C. McCord. Optimization over continuous and multi-dimensional decisions with observational data. In Advances in Neural Information Processing Systems, pages 2962–2970, 2018.

A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138. ACM, 2009.

P. J. Bickel. On adaptive estimation. The Annals of Statistics, pages 647–671, 1982.

P. J. Bickel, C. A. Klaassen, P. J. Bickel, and Y. Ritov. Efficient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993.

J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learning bounds for domain adaptation. In Advances in neural information processing systems, pages 129–136, 2008.

R. Blundell, X. Chen, and D. Kristensen. Semi-nonparametric iv estimation of shape-invariant engel curves. Econometrica, 75(6):1613–1669, 2007.

O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Advanced lectures on machine learning, pages 169–207. Springer, 2004.

C. M. Cassel, C. E. S¨arndal,and J. H. Wretman. Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika, 63(3):615–620, 1976.

X. Chen and T. M. Christensen. Optimal sup-norm rates and uniform inference on nonlinear functionals of nonparametric iv regression. Quantitative Economics, 9(1):39–84, 2018.

X. Chen and D. Pouzo. Efficient estimation of semiparametric conditional moment models with possibly nonsmooth residuals. Journal of Econometrics, 152(1):46–60, 2009.

29 X. Chen and D. Pouzo. Estimation of nonparametric conditional moment models with possibly nonsmooth generalized residuals. Econometrica, 80(1):277–321, 2012.

X. Chen and D. Pouzo. Sieve Wald and QLR inferences on semi/nonparametric conditional moment models. Econometrica, 83(3):1013–1079, 2015.

X. Chen and H. White. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory, 45(2):682–691, 1999.

V. Chernozhukov, J. C. Escanciano, H. Ichimura, and W. K. Newey. Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033, 2016.

V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv preprint arXiv:1712.09988, 2017.

V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018a.

V. Chernozhukov, D. Nekipelov, V. Semenova, and V. Syrgkanis. Plug-in regularized estimation of high-dimensional parameters in nonlinear semiparametric models. arXiv preprint arXiv:1806.04823, 2018b.

V. Chernozhukov, W. Newey, and J. Robins. Double/de-biased machine learning using regularized riesz representers. arXiv preprint arXiv:1802.08667, 2018c.

V. Chernozhukov, W. Newey, J. Robins, and R. Singh. Double/de-biased machine learning of global and local parameters using regularized riesz representers. arXiv preprint arXiv:1802.08667, 2018d.

C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pages 442–450, 2010.

H. Daume III and D. Marcu. Domain adaptation for statistical classifiers. Journal of artificial Intelligence research, 26:101–126, 2006.

B. Delyon and A. Juditsky. On minimax wavelet estimators. Applied and Computational Harmonic Analysis, 3(3):215–228, 1996.

I. D´ıazand M. J. van der Laan. Targeted data adaptive estimation of the causal dose–response curve. Journal of Causal Inference, 1(2):171–192, 2013.

M. Dud´ık,J. Langford, and L. Li. Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1097–1104. Omnipress, 2011.

M. H. Farrell, T. Liang, and S. Misra. Deep neural networks for estimation and inference: Application to causal effects and other semiparametric estimands. arXiv preprint arXiv:1809.09953, 2018.

D. J. Foster, S. Kale, H. Luo, M. Mohri, and K. Sridharan. Logistic regression: The importance of being improper. Conference on Learning Theory, 2018.

R. Friedberg, J. Tibshirani, S. Athey, and S. Wager. Local linear forests. arXiv preprint arXiv:1807.11408, 2018.

30 E. Gin´eand V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empirical processes. The Annals of Probability, 34(3):1143–1216, 2006.

N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. Conference on Learning Theory, 2018.

B. S. Graham. Efficiency bounds for missing data models with semiparametric restrictions. Econo- metrica, 79(2):437–452, 2011.

A. Guntuboyina and B. Sen. Global risk bounds and adaptation in univariate convex regression. Probability Theory and Related Fields, 163(1-2):379–411, 2015.

P. Hall, J. L. Horowitz, et al. Nonparametric methods for inference in the presence of instrumental variables. The Annals of Statistics, 33(6):2904–2929, 2005.

S. Hanneke. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.

T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.

D. Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik- chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232, 1995.

D. A. Hirshberg and S. Wager. Augmented minimax linear estimation. arXiv preprint arXiv:1712.00038, 2017.

I. A. Ibragimov and R. Z. Has’Minskii. Statistical estimation: asymptotic theory. Springer-Verlag, New York, 1981.

J. Jiang and C. Zhai. Instance weighting for domain adaptation in NLP. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 264–271, 2007.

N. Kallus and A. Zhou. Policy evaluation and optimization with continuous treatments. In International Conference on Artificial Intelligence and Statistics, pages 1243–1251, 2018.

E. H. Kennedy, Z. Ma, M. D. McHugh, and D. S. Small. Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1229–1245, 2017.

E. H. Kennedy, S. Lorch, and D. S. Small. Robust causal inference with continuous instruments using the local instrumental variable curve. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):121–143, 2019.

G. Kerkyacharian, O. Lepski, and D. Picard. Nonlinear estimation in anisotropic multi-index denoising. Probability theory and related fields, 121(2):137–170, 2001.

G. Kerkyacharian, O. Lepski, and D. Picard. Nonlinear estimation in anisotropic multi-index denoising. sparse case. Theory of Probability & Its Applications, 52(1):58–77, 2008.

C. A. Klaassen. Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics, pages 1548–1562, 1987.

31 V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, 2011.

V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function learning. High Dimensional Probability II, 47:443–459, 2000.

M. R. Kosorok. Introduction to empirical processes and semiparametric inference. Springer, 2008.

S. R. K¨unzel,J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116 (10):4156–4165, 2019.

G. Lecu´eand S. Mendelson. Learning subgaussian classes: Upper and minimax bounds. In Topics in Learning Theory. Societe Mathematique de France, 2016.

E. L. Lehmann and G. Casella. Theory of point estimation. Springer Science & Business Media, 2006.

O. Lepski. Asymptotically minimax adaptive estimation. I: Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):682–697, 1992.

B. Y. Levit. On the efficiency of a class of non-parametric estimates. Theory of Probability & Its Applications, 20(4):723–740, 1976.

T. Liang, A. Rakhlin, and K. Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Proceedings of The 28th Conference on Learning Theory, pages 1260–1285, 2015.

L. Mackey, V. Syrgkanis, and I. Zadik. Orthogonal machine learning: Power and limitations. In International Conference on Machine Learning, pages 3375–3383, 2018.

Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In 22nd Conference on Learning Theory (COLT) 2009, 2009.

A. Maurer. A vector-contraction inequality for Rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3–17. Springer, 2016.

A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In The 22nd Conference on Learning Theory (COLT), 2009.

S. Mendelson. Improving the sample complexity using global data. IEEE transactions on Information Theory, 48(7):1977–1991, 2002.

S. Mendelson. Discrepancy, chaining and subgaussian processes. The Annals of Probability, 39(3): 985–1026, 2011.

S. Mendelson. Learning without concentration. In Conference on Learning Theory (COLT), pages 25–39, 2014.

W. K. Newey. The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society, pages 1349–1382, 1994.

W. K. Newey and J. L. Powell. Instrumental variable estimation of nonparametric models. Econo- metrica, 71(5):1565–1578, 2003.

32 J. Neyman. Optimal asymptotic tests of composite hypotheses. Probability and statsitics, pages 213–234, 1959. J. Neyman. c(α) tests and their use. Sankhy¯a:The Indian Journal of Statistics, Series A, pages 1–21, 1979. X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017. Y. Ning, S. Peng, and K. Imai. Robust estimation of causal effects via high-dimensional covariate balancing propensity score. arXiv preprint arXiv:1812.08683, 2018. M. Oprescu, V. Syrgkanis, and Z. S. Wu. Orthogonal random forest for causal inference. In International Conference on Machine Learning, pages 4932–4941, 2019. J. Pfanzagl. Contributions to a general asymptotic statistical theory. 1982. M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals of statistics, 39(2):1180, 2011. A. Rakhlin and K. Sridharan. Statistical learning and sequential prediction, 2012. Preprint. Available at http://www.mit.edu/∼rakhlin/courses/stat928/stat928 notes.pdf. A. Rakhlin, K. Sridharan, and A. B. Tsybakov. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23(2):789–824, 2017. J. Robins, L. Li, E. Tchetgen, and A. van der Vaart. Higher order influence functions and minimax estimation of nonlinear functionals. In Probability and statistics: essays in honor of David A. Freedman, pages 335–421. Institute of Mathematical Statistics, 2008. J. M. Robins and A. Rotnitzky. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995. J. M. Robins and A. Rotnitzky. Comment on the Bickel and Kwon article,Inference for semiparametric models: Some questions and an answer. Statistica Sinica, 11(4):920–936, 2001. J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427): 846–866, 1994. P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pages 931–954, 1988. P. R. Rosenbaum and D. B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983. D. Rubin and M. J. van der Laan. A general imputation methodology for nonparametric regression with censored data. 2005. D. Rubin and M. J. van der Laan. A doubly robust censoring unbiased transformation. The international journal of biostatistics, 3(1), 2007. D. O. Scharfstein, A. Rotnitzky, and J. M. Robins. Rejoinder-Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94 (448):1135–1146, 1999.

33 S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.

N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, pages 2199–2207, 2010.

C. J. Stone. Optimal rates of convergence for nonparametric estimators. The annals of Statistics, pages 1348–1360, 1980.

C. J. Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053, 1982.

A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015a.

A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In Advances in Neural Information Processing Systems, pages 3231–3239, 2015b.

A. Tsiatis. Semiparametric theory and missing data. Springer Science & Business Media, 2007.

A. B. Tsybakov. Pointwise and sup-norm sharp adaptive estimation of functions on the Sobolev classes. The Annals of Statistics, 26(6):2420–2469, 1998.

M. J. van der Laan and S. Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. 2003.

M. J. van der Laan and A. R. Luedtke. Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. 2014.

M. J. van der Laan and J. M. Robins. Unified methods for censored longitudinal data and causality. Springer Science & Business Media, 2003.

M. J. van der Laan and S. Rose. Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media, 2011.

M. J. van der Laan and D. Rubin. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1), 2006.

M. J. van der Laan, S. Dudoit, and A. van der Vaart. The cross-validated adaptive epsilon-net estimator. Statistics & Decisions. International Mathematical Journal for Stochastic Methods and Models, 24(3):373–395, 2006.

M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.

A. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

A. van der Vaart and M. J. van der Laan. Estimating a survival distribution with current status data and high-dimensional covariates. The International Journal of Biostatistics, 2(1), 2006.

34 V. N. Vapnik. The nature of statistical learning theory. Springer, 1995.

M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

L. Wang, A. Rotnitzky, and X. Lin. Nonparametric regression with missing outcomes using weighted kernel estimating equations. Journal of the American Statistical Association, 105(491):1135–1146, 2010.

Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, pages 1564–1599, 1999.

T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550, 2002.

Y. Zhao, D. Zeng, A. J. Rush, and M. R. Kosorok. Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106–1118, 2012.

W. Zheng and M. J. van der Laan. Asymptotic theory for cross-validated targeted maximum likelihood estimation. 2010.

X. Zhou, N. Mayer-Hamblett, U. Khan, and M. R. Kosorok. Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112 (517):169–187, 2017.

Z. Zhou, S. Athey, and S. Wager. Offline multi-action policy learning: Generalization and optimiza- tion. arXiv preprint arXiv:arXiv:1810.04778, 2018.

35 Organization of Appendix

The appendix is organized as follows. Part I contains supplemental results: Sufficient conditions for the main theorems (Appendix A), applications of the main results (Appendix B), examples of orthogonal losses (Appendix C), and generalization guarantees for specific function classes (Appendix D). Part II contains proofs for the results presented in the main body of the paper: Appendix E contains preliminaries, Appendix F proves the main theorems for the sample-splitting meta-algorithm, Appendix H proves the main results for plug-ing empirical risk minimization (building on technical tools developed in Appendix G), and Appendix I proves our results concerning oracle rates.

Part I Additional Results

A Sufficient Conditions for Single Index Losses

Our setup and main theorems in Section 3 are stated at a high level of generality, with abstract assumptions on the structure of the population risk. In this section of the appendix we provide conditions under which these assumptions follow from concrete structural assumptions on the risk. We give sufficient guarantees for general families of loss functions. In particular, the conditions we give here suffice to derive guarantees for the applications considered in Appendix B.

A.1 Fast Rates

In this section we give a broad class of losses under which the conditions for fast rates in Section 3.1 are satisfied. The population loss LD is defined as the expectation of a point-wise loss `(ζ, γ; z) acting on the predictions of the nuisance and target parameters. Specifically, we assume existence of functions Φ and Λ such that the loss has the following loss structure:

`(θ(x), g(w); z) = Φ(hΛ(g(w), v), θ(x)i, g(w), z),LD(θ, g) = E[`(θ(x), g(w); z)]. (49)

Here we recall from Section 2 that x, w are subsets of the data z, and let v ⊆ z be an auxiliary subset of the data. We also assume existence of functions φ(ζ) and Γ(γ, z) such that the partial derivative of Φ may be written as ∂ Φ(ζ, γ, z) = φ(ζ) − Γ(γ, z), (50) ∂ζ where φ is a strictly increasing and Lipschitz. A simple example is square loss regression, where 1 2 Φ(ζ, γ, z) = 2 (ζ − Γ(γ, z)) . To provide fast rates of the type in Section 3 for losses with this structure, we make the following regularity assumption.

36 Assumption 11 (Sufficient Conditions for Fast Rates: Single Index Losses). The loss ` satisfies the following conditions for parameters (Tsi, τsi,Lsi,Rsi, λsi, µsi):

0 Tsi ≥ φ (t) ≥ τsi. ? E[∇γ∇ζ `(θ (x), g0(w); z) | w] = 0. (⇒ Assumption 1) ? ? ? E[∇ζ `(θ (x), g0(w); z) · (θ(x) − θ (x))] ≥ 0, ∀θ ∈ star(Θb, θ ). (⇒ Assumption 2)

Λ(γ, v) is Lsi-Lipschitz in γ w.r.t k·k2 a.s. (⇒ Assumption 3)

sup kθ(x)k2 ≤ Rsi. (⇒ Assumption 3) x∈X ,θ∈Θ ? 2 EhΛ(g0(w), v), θ(x) − θ (x)i ≥ λsi, ∀θ ∈ Θ. (⇒ Assumption 3) ? 2 b Ekθ(x) − θ (x)k2 ∇2 ∇ `(θ?(x), g(w); z) | w ≤ µ , ∀i, ∀g ∈ star(G, g ). (⇒ Assumption 4(b)) E γγ ζi σ si 0

As a concrete example, consider the logistic loss, where Φ(t, γ, z) = y · log(σlog(t)) + (1 − y) · log(1 − −t σlog(t)), where the target class y ∈ {0, 1} is a subset of the data z and σlog(t) := 1/(1 + e ) is ∂ the logistic function, so that ∂t Φ(t, γ, z) = σlog(t) − y. Observe that the gradient of the loss with respect to the target index value can be written as

∇ζ `(ζ, γ; z) = (φ(hΛ(γ, v), ζi) − Γ(γ, z))Λ(γ, v).

Moreover, whenever the arguments to the loss are bounded, the Hessian can be bounded above and below via

> 2 0 > > Tsi · Λ(γ, v)Λ(γ, v)  ∇ζζ `(ζ, γ; z) = φ (hΛ(γ, v), ζi) · Λ(γ, v)Λ(γ, v)  τsi · Λ(γ, v)Λ(γ, v) , (51) and thus the ratio condition is implied by a minimum eigenvalue assumption on the conditional  >  covariance matrix E Λ(g0(w), v)Λ(g0(w), v) | x . We now show that these conditions are sufficient to satisfy the assumptions of Theorem 1, and thus guarantee higher order impact from the nuisance parameters. τsi Lemma 2. If Assumption 11 holds, then Assumptions 1 to4 are satisfied with constants λ = 4 , κ = 4 2 √  1/2 4τsiLsiRsi µsi K2  2 , β1 = Tsi and β2 = √ and with respect to the norms kθk := z hΛ(g0(w), v), θ(x)i λsi λsi Θ E and k·k = k·k . G L4(`2,D) Combining this lemma with the guarantee from Theorem 1 directly yields the following corollary. ? Corollary 1. Suppose that there is some θ ∈ arg minθ∈G LD(θ, g0) such that Assumption 11 is satisfied. The sample splitting meta-algorithm Meta-Algorithm 1 produces a predictor θb that guarantees that with probability at least 1 − δ,

2 4 2  ? 16 2 4µsiK2 + 16τsiLsiRsi 4 LD(θ,b g0) − LD(θ , g0) ≤ RateD(Θ,S2, δ/2; gb) + (RateD(G,S1, δ/2)) . τsi τsi λsi

Furthermore, with k·kΘ as in Lemma 2, the following prediction error guarantee is satisfied with probability at least 1 − δ,

2 4 2  8T 2Tsi 4µ K2 + 8τsiL R θ − θ? 2 ≤ si Rate (Θ,S , δ/2; g) + si si si (Rate (G,S , δ/2))4. b Θ D 2 b D 1 τsi τsi λsi

37 Observe that in both of the bounds in Corollary 1, the only problem-dependent parameters that affect the leading term are Tsi and τsi. Importantly, this implies that if the more restrictive parameters λsi,Lsi, µsi,Rsi,K2 and so forth are held constant as n grows, then they are negligible asymptotically so long as the nuisance parameter can be estimated quickly enough. For the square loss this is particularly desirable since Tsi = τsi = 1.

Estimation for the first stage. Corollary 1 provides guarantees in terms of the L4 estimation rate for the nuisance parameters, i.e.   1 Rate (G,S , δ) = kg − g k = kg(w) − g (w)k4 4 . D 1 b 0 L4(`2,D) Ew b 0 2

Since L4 error rates are somewhat less common than L2 (i.e., square loss) estimation rates, let us briefly discuss conditions under which out-of-the box algorithms can be used to give guarantees on the L4 error.

First, for many nonparametric classes of interest, minimax Lp error rates have been characterized and can be applied directly. This includes smooth classes (Stone, 1980, 1982), H¨olderclasses (Lepski, 1992; Kerkyacharian et al., 2001, 2008), Besov classes (Delyon and Juditsky, 1996), Sobolev classes (Tsybakov, 1998), and convex regression (Guntuboyina and Sen, 2015) Second, whenever the G is a linear class or more broadly a parametric class, classical statistical theory (Lehmann and Casella, 2006) guarantees parameter recovery. Up to problem-dependent constants, this implies a bound on the L4 error as soon as the fourth moment is bounded. This approach also extends to the high-dimesional setting (Hastie et al., 2015). Last, if the class G has well-behaved moments in the sense that kg − g k ≤ Ckg − g k 0 L4(`2,D) 0 L2(`2,D) for all g ∈ G, we can directly appeal to square loss regression algorithms for the first stage; this is the approach taken in Section 5. This condition is related to the so-called “subgaussian class” assumption, and both have been explored in recent works (Lecu´eand Mendelson, 2016; Mendelson, 2014; Liang et al., 2015).

A.2 Slow Rates

In the single-index setup, assumptions much weaker than Assumption 11 suffice to obtain slow rates via Theorem 2. In particular, the following conditions are sufficient. Assumption 12 (Sufficient Slow Rate Conditions for Single Index Losses).

? ? E[∇ζ ∇γ`(θ(x), g0(w); z) | w] = 0, ∀θ ∈ star(Θb, θ ) + star(Θb − θ , 0). (⇒ Assumption 5)  2  ? E ∇γγ`(θ(x), g(w); z) | w  βsiI, ∀θ ∈ star(Θb, θ ), g ∈ star(G, g0). (⇒ Assumption 6)

Compared to Assumption 11, the most important difference is that since we require universal orthogonality, the first condition is required to hold for all θ, not just at θ0. Assumption 12 has the following immediate consequences. Lemma 3. If Assumption 12 holds, then Assumption 5 is satisfied and Assumption 6 is satisfied with constant β = β and with respect to the norm k·k = k·k . si G L2(`2,D) Corollary 2. Suppose Assumption 12 holds. Then with probability at least 1−δ, the target predictor θb produced by Meta-Algorithm 1 enjoys the excess risk bound ? 2 LD(θ,b g0) − LD(θ , g0) ≤ RateD(Θ,S2, δ/2; gb) + βsi · (RateD(G,S1, δ/2)) .

38 A.3 Proofs

Proof of Lemma 2. We prove that the assumptions required by Theorem 1 are implied by our conditions one by one.

Assumption 1. By the definition of directional derivatives, the law of iterated expectations and the fact that X ⊆ W:

? ? h ? > ? i DgDθLD(θ , g0)[θ − θ , g − g0] = E (θ(x) − θ (x)) · ∇γ∇ζ `(θ (x), g0(w), z)(g(w) − g0(w))

h ? > ? i = E (θ(x) − θ (x)) · E[∇γ∇ζ `(θ (x), g0(w), z) | w](g(w) − g0(w)) = 0.

Assumption 2. This likewise follows immediately by expanding the directional derivative and applying the law of iterated expectation:

? ? ? ? DθLD(θ , g0)[θ − θ ] = E[∇ζ `(θ (x), g0(w), z) · (θ(x) − θ (x))] ≥ 0.

We now argue about the remaining assumptions. We will repeatedly invoke the following expression for the second derivative of the population risk.

2 ¯ 0 0 h 0 ¯  0 2i Dθ LD(θ, g)[θ − θ , θ − θ ] = Ez φ Λ(g(w), v), θ(x) · Λ(g(w), v), θ(x) − θ (x) . (52)

Let us introduce some additional notation. Let g ∈ G and θ ∈ Θ be the free variables in the statements of Assumption 3 and Assumption 4. We define the following vector- and matrix-valued random variables:

W0 = Λ(g0(w), v),Wn = Λ(g(w), v), (53) ? X0 = θ (x),Xn = θ(x), (54)

V0 = g0(w),Vn = g(w), (55) h > i A0 = E W0W0 | x . (56)

To prove the lemma, it suffices to verify Assumption 3 and Assumption 4 with respect to the norm 0 2 h 0 2i on the Θ space defined via kθ − θ kΘ = E hW0, θ(x) − θ (x)i and the norm in the G space defined via kg − g0k = kg − g0k . One useful observation used throughout the proof is that for all G L4(`2,D) θ ∈ Θ,b h i h i kθ − θ?k2 = (θ(x) − θ?(x))>A (θ(x) − θ?(x)) ≥ λ · kθ(x) − θ?(x)k2 = λ · kθ − θ?k2 . Θ E 0 si E 2 si L2(`2,D)

39 Assumption 3. By (51) and (52), we have

2 ¯ ? ? h 2i Dθ LD(θ, g)[θ − θ , θ − θ ] ≥ τsi · Ez hWn,Xn − X0i τ h i h i ≥ si · hW ,X − X i2 − τ · hW − W ,X − X i2 2 Ez 0 n 0 si Ez 0 n n 0 τ h i h i ≥ si · hW ,X − X i2 − τ · kW − W k2kX − X k2 2 Ez 0 n 0 si Ez 0 n 2 n 0 2 τ ≥ si · kθ − θ?k2 − τ L2 kV − V k2kX − X k2. 2 Θ si si Ez n 0 2 n 0 2 Using the AM-GM inequality we have for any η > 0:

τ τ L4 ητ ≥ si · kθ − θ?k2 − si si kV − V k4 − si kX − X k4 2 Θ 2η Ez n 0 2 2 E n 0 2 τ τ L4 4ητ R2 = si · kθ − θ?k2 − si si kg − g k4 − si si kX − X k2 2 Θ 2η 0 G 2 E n 0 2 4 2 τsi ? 2 τsiLsi 4 4ητsiRsi ? 2 ≥ · kθ − θ kΘ − kg − g0kG − kθ − θ kΘ. 2 2η 2λsi

λsi Choosing η = 2 , yields the inequality: 8Rsi

4 2 2 ¯ ? ? τsi ? 2 4τsiLsiRsi 4 Dθ LD(θ, g)[θ − θ , θ − θ ] ≥ kθ − θ kΘ − kg − g0kG. 4 λsi

4τ L4 R2 Thus, Assumption 3 is satisfied with λ = τsi and κ = si si si . 4 λsi

Assumption 4(a). Using (51) and (52) we have:

2 ¯ ? ? h 2i ? 2 Dθ LD(θ, g0)[θ − θ , θ − θ ] ≤ Tsi · Ez hW0,Xn − X0i = Tsi · kθ − θ kΘ

It follows that Assumption 4(a) is satisfied with β1 = 2Tsi.

? Assumption 4(b). For simplicity of notation, define the random vectors X0 = θ (x), Xn = θ(x),  2 ?  V0 = g0(w), Vn = g(w) and Σ(w) = E ∇γγ∇ζi `(θ (x), g¯(w), z) | w . Then invoking the assumed structure on the loss function:

2 ? ? DgDθLD(θ , g¯)[θ − θ , g − g0, g − g0] K X2 h i = (X − X )(V − V )>∇2 ∇ `(θ?(x), g¯(w), z)(V − V ) . E ni 0i n 0 γγ ζi n 0 i=1

40 Using that x ⊆ w, we have

K2 h i X > = E (Xni − X0i)(Vn − V0) Σ(w)(Vn − V0) i=1 K2 h i X > ≤ E |Xni − X0i| · (Vn − V0) Σ(w)(Vn − V0) i=1

K2 X h 2i ≤ µ E |Xni − X0i| · kVn − V0k2 i=1 h 2i = µ E kXn − X0k1 · kVn − V0k2 .

All that remains is to relate these norms to the norms appearing in the lemma definition.

p h 2i ≤ µ K2 E kXn − X0k2 · kVn − V0k2 r r p h 2i h 4i ≤ µ K2 · E kXn − X0k2 · E kVn − V0k2 p = µ K · kθ − θ?k · kg − g k2 2 L2(`2,D) 0 L4(`2,D) √ 2 µ K2C2→4 ? 2 ≤ √ kθ − θ kΘ · kg − g0kG. λsi Thus, we have

√ 2 2 ¯ ? ? µ K2C2→4 ? 2 Dθ LD(θ, g0)[θ − θ , θ − θ ] ≤ √ kθ − θ kΘ · kg − g0kG, (57) λsi √ 2 µ K2C2→4 and so Assumption 4(b) is satisfied with β2 = √ . λsi

Proof of Lemma 3. Immediate.

The following lemma is used in certain subsequent results. Lemma 4. Suppose Assumption 11 holds. Then for any functions θ ∈ Θb, θ¯ ∈ star(Θb, θ?), and g ∈ G, 4 2 2 ¯ ? ? ? 2 TsiLsiRsi 4 Dθ LD(θ, g)[θ − θ , θ − θ ] ≤ 3Tsi kθ − θ kΘ + kg − g0kG. (58) λsi

Proof. We adopt the same notation as in the proof of Lemma 2. This proof is almost the same as the proof that Assumption 3 is satisfied in that lemma.

2 ¯ ? ? h 2i Dθ LD(θ, g)[θ − θ , θ − θ ] ≤ Tsi · Ez hWn,Xn − X0i h 2i h 2i ≤ 2Tsi · Ez hW0,Xn − X0i + 2 Tsi · Ez hW0 − Wn,Xn − X0i h 2i h 2 2i ≤ 2Tsi · Ez hW0,Xn − X0i + 2 Tsi · Ez kW0 − Wnk2kXn − X0k2 ? 2 2  2 2 ≤ 2Tsi · kθ − θ kΘ + 2 Tsi Lsi Ez kVn − V0k2kXn − X0k2 .

41 Using the AM-GM inequality we have for any η > 0:

T L4 ≤ 2T · kθ − θ?k2 + si si kV − V k4 + ηT kX − X k4 si Θ η Ez n 0 2 si E n 0 2 T L4 ≤ 2T · kθ − θ?k2 + si si kg − g k4 + ηT R2 kX − X k2 si Θ η 0 G si si E n 0 2 4 2 ? 2 TsiLsi 4 ηTsiRsi ? 2 = 2Tsi · kθ − θ kΘ + kg − g0kG + kθ − θ kΘ. η λsi

λsi Choosing η = 2 , yields the inequality: Rsi

4 2 2 ¯ ? ? ? 2 TsiLsiRsi 4 Dθ LD(θ, g)[θ − θ , θ − θ ] ≤ 3Tsi kθ − θ kΘ + kg − g0kG. λsi

B Additional Applications

In this section of the appendix we show how three additional families of applications—offline policy learning/optimization, domain adaptation/sample bias correction, and learning with missing data—fall into our general orthogonal statistical learning framework. We also sketch some statistical consequences based on our general algorithmic tools, and show how these results generalize and extend previous work.

B.1 Policy Learning

In this section we show how to view some additional policy learning models in our framework. As in Section 3.4 we consider the data-generating process in (16), but here we go beyond binary treatments.

Multiple finite treatments. The binary setting above can easily be extended to the case of N possible treatments, analyzed in Zhou et al.(2018). Formally, let T ∈ {~e1, . . . ,~eN }, where N ~ei ∈ {0, 1} is the i-th standard basis vector. We still follow the data generating process (16), but N now e0 : X → ∆N and f0 : {0, 1} × X → R. To simplify notation, let p0(t, x) = Pr[T = t | X = x] so p0(t, x) = e0(x)t. Then the following loss function is an unbiased estimate of the counterfactual loss: (Y − f0(t, X)) `(t, Z; f0, e0) = f0(t, X) + 1{T = t} . (59) p0(t, X) This formulation leads to the standard extension of the doubly-robust estimator to multiple outcomes Dud´ıket al.(2011); Zhou et al.(2018). Define an N-dimensional vector-valued function β(f0, e0,Z) to have the t-th coordinate is equal to `(t, Z; f0, e0). Then, as in the binary case, we can equivalently optimize a population risk that is linear in the target parameter:

LD(θ, {f0, e0}) = E[hβ(f0, e0,Z), θ(x)i]. (60) This population risk is easily shown to satisfy universal orthogonality.

42 Counterfactual risk minimization and general continuous treatments. Counterfactual risk minimization (CRM) is a learning framework introduced by Swaminathan and Joachims(2015a). It is mathematically equivalent to the policy learning setup with arbitrary treatment and outcome spaces, but is motivated by a different set real-world learning scenarios and was developed in a parallel line of research in the machine learning literature. To highlight the relationship with policy learning and the applicability of our results to this setting we will present the CRM framework using the notation of policy learning. In counterfactual risk minimization we receive data Z = (Y,T,X) from the policy learning data generating process (16). The goal is to choose a hypothesis θ : X → ∆(T ) (i.e., the policy takes as input covariates and returns a distribution over treatments) that minimizes the population risk:

1 LD(θ, f0) = EZ Et∼θ(X)[f0(t, X)]. (61) As in policy learning, we construct an unbiased estimate of this counterfactual loss via inverse propensity scoring. Let p0(t, X) denote the probability density of treatment t conditional on covariates X and (overloading notation) let θ(t, x) denote the density that hypothesis θ assigns to treatment t. Then we can formulate a new risk function that provides an unbiased estimate of the target risk (61):   2 θ(T,X) LD(θ, p0) = E Y . (62) p0(T,X)

In the CRM framework the propensity p0 is assumed to be known (Swaminathan and Joachims, 2015a,b). When the propensity is not known, we can treat it as a nuisance parameter to be estimated from data. However, the loss (62) is not orthogonal to p0. We can orthogonalize the population risk by also constructing an estimate of f0 (see (16)) by regressing Y on (T,X). This leads to an analogue of the doubly robust formulation from the finite treatment setup: "  # (Y − f0(T,X)) LD(θ, {f0, p0}) = E f0(T,X) + ·θ(T,X) . (63) p0(T,X) | {z } =:β(f0,p0,Z) For finite treatments this formulation is mathematically equivalent to the population risk for multiple finite treatments presented in the prequel. For continuous treatments, the empirical version of the problem (63) may be ill-posed, even if we assume that the propensity p0 has density ower bounded by some constant (the analogue of the overlap condition). Swaminathan and Joachims(2015a) proposed to regularize the empirical risk via variance penalization. A similar variance penalization approach is also proposed in recent work of Bertsimas and McCord(2018), who consider policy learning over arbitrary treatment spaces. The variance-penalized empirical risk minimization algorithm—in the context of Meta-Algorithm 1—can be seen as a second stage algorithm that achieves a rate whose leading term scales with the variance of the optimal policy rather than some worst-case upper bound on the risk. Hence, it can be used in our framework to achieve variance-dependent excess risk bounds. Kallus and Zhou(2018) develop alternative algorithms for policy learning with continuous treatments via a kernel smoothing approach. This approach is equivalent to adding noise to a deterministic hypothesis space Π, e.g. θ(x) = π(x) + ζ for each π ∈ Π, where ζ ∼ N (0, σ2). In our framework Θ is the space of density functions θ(t, x) induced by this construction. The value of a deterministic policy π ∈ Π (or equivalently the value of its corresponding density θ ∈ Θ) is equal to

E[β(f0, p0,Z) ·Kσ(T − π(X))], (64)

43 where Kσ is the pdf of a normal distribution with standard deviation σ. This is equivalent to the formulation in Kallus and Zhou(2018), since the empirical version of this risk is the kernel-weighted loss: n 1 X L (π, {f , p }) = β(f , p ,Z ) ·K (T − π(X )), (65) S 0 0 n 0 0 i σ i i t=1 though we note that Kallus and Zhou(2018) do not restrict themselves to only the gaussian kernel. This idea falls into our framework by simply defining Θ to be this space of randomly perturbed policy functions. The resulting analysis in our framework is slightly different than that of Kallus and Zhou(2018), where kernel weighting is invoked to show consistency of the empirical risk, and subsequently optimization of the empirical risk over deterministic policies is analyzed. With our framework, we directly calculate the regret with respect to randomized policies by applying our general theorem. This implies that we enjoy robustness to errors in estimating f0 and p0. Observe that the rate for the second stage will depend on the amount of σ, since the variance of the empirical risk is governed by σ. Consequently, if one is interested in regret against deterministic strategies, we can invoke Lipschitzness of the reward function f0 to control the regret added by the extra randomness we are injecting, which would typically be of order σ. We can then choose an optimal σ as a function of the number of samples to trade-off the bias and variance of the regret. If we wish to further optimize dependence on problem-dependent parameters in the resulting rates one can use variance penalization in the kernel-based framework to achieve a regret rate whose leading term scales with the variance of the optimal policy.

B.2 Domain Adaptation and Sample Bias Correction

Domain adaptation is a widely studied topic in machine learning (Daume III and Marcu, 2006; Jiang and Zhai, 2007; Ben-David et al., 2007; Blitzer et al., 2008; Mansour et al., 2009). The goal is to choose a hypothesis that minimizes a given loss in expectation over a target data distribution, where the target distribution may be different from the distribution of data that is already collected. We consider a particular instance of domain adaptation called covariate shift, encountered in supervised learning (Shimodaira, 2000). We assume that we have data Z = (X,Y ), where X are co-variates drawn from some distribution Ds with density ps and Y are labels, drawn from some distribution Dx conditional on x. Our goal is to choose a hypothesis θ from some hypothesis space Θ, so as to minimize a loss function `(θ(x), y) in expectation over a different distribution of co-variates Dt with density pt. Both of the densities are unknown, and we solve this issue in the orthogonal statistical learning framework by treating their ratio as a nuisance parameter for an importance-weighted loss function. Let f (x) = pt(x) and g = {f }, so that 0 ps(x) 0 0

LD(θ, g0) := EDs Ey|x[`(θ(x), y) · f0(x)] = EDt Ey|x[`(θ(x), y)]. (66)

We assume the hypothesis space satisfies a realizability condition, i.e. there exists θ0 ∈ Θ such that:

E[∇ζ `(θ0(x), y) | x] = 0, (67) where ∇ζ corresponds to the gradient with respect to the first input of `. For instance, for the case of the square loss `(ζ, y) = (ζ − y)2, then this condition has the natural interpretation that there exists θ0 ∈ Θ such that: E[y | x] = θ0(x). (68)

44 Observe that when we treat the density ratio f0(x) as a nuisance function, the loss function LD is orthogonal. Indeed,

Df DθLD(θ, {f})[θ − θ0, f − f0] = E[∇ζ `(θ0(x), y) · (θ(x) − θ0(x)) · (f(x) − f0(x))] = E[E[∇ζ `(θ0(x), y) | x] · (θ(x) − θ0(x)) · (f(x) − f0(x))] = 0.

Also, note that this setup fits in to the single index structure from Appendix A by writing LD as the expectation of a new loss `˜(θ(x), g(w), z) := `(θ(x), y) · f(x). Focusing on the square loss `(ζ, y) = (ζ − y)2 for concreteness, it is simple to show that all of the conditions of Assumption 11 are satisfied as long as we have g(x) ≥ η > 0 for all g ∈ G. Corollary 1 then implies that with probability at least 1 − δ, Meta-Algorithm 1 enjoys the bound

?  −1 4 LD(θ,b g0) − LD(θ , g0) ≤ O RateD(Θ,S2, δ/2; gb) + poly(η ) · (RateD(G,S1, δ/2)) .

1/4 −1 Note that whenever RateD(G,S1, δ/2) = o RateD(Θ,S2, δ/2; gb) the dependence on η is negligible asymptotically. Of course, it is also important to develop algorithms for which the rate of the target class does not depend on η−1. As one example, we can employ the variance-penalized ERM guarantee from Theorem 5. When Θ has VC dimension d, and the variance of the loss at p (θ0, g0) and the capacity function τ0 are bounded, this gives RateD(G,S1, δ/2) = O( d/n), with −1 −1/8 η entering only lower-order terms. The final result is that if RateD(G,S1, δ/2) = o(n ), we get an excess risk bound for which the dominant term is O(pd/n), with no dependence on η−1.

Related work. Cortes et al.(2010) gave generalization error guarantees for the important weighted loss (66) in the case where the densities ps and pt are known. At the other extreme, Ben-David and Urner(2012) showed strong impossibility results in the regime where the densities are unknown. Our results lie in the middle, and show that learning with unknown densities is possible in the regime where the weights belong to a nonparametric class that is not much more complex than the target predictor class Θ. We remark in passing that algorithms based on discrepancy minimization Ben-David et al.(2007); Mansour et al.(2009) offer another approach to domain adaptation that does not require importance weights, but these results are not directly comparable to our own.

B.3 Missing Data

As a final application, we apply our tools to the well-studied problem of regression with miss- ing/censored data (Robins and Rotnitzky, 1995; van der Laan and Robins, 2003; Rubin and van der Laan, 2005; Tsiatis, 2007; Wang et al., 2010; Graham, 2011). In this setting we receive data is generated through standard regression model, but label/target variables are sometimes “missing” or unobserved. The learner observes whether or not the target is missing for each example, and the conditional probability that the target is missing is treated as an unknown nuisance parameter. As usual, the target is the unknown regression function. To proceed, we formalize the setting through the following data-generating process for the observed variables (X, W, T, Ye): Ye = T · Y, Y = θ0(X) + ε1, E[ε1 | X] = 0, (69) T = e0(W ) + ε2, E[ε2 | W ] = 0.

45 Here T ∈ {0, 1} is an auxiliary variable (observed by the learner) that indicates whether the target variable is missing, and e0 : X → [0, 1] is the unknown propensity for T . The parameter θ0 : X → R is the true regression function. We make the standard unobserved confounders assumption that X ⊆ W and T ⊥ Y | W . We define h (w) = −2 (E[Y |W =w]−θ0(x)) , take g = {h , e }, and use the 0 e0(w) 0 0 0 loss 2 T Ye − θ(X) `(θ, {h, e}, z) = − θ(X)h(W )(T − e(W )). (70) e(W ) Observe that this loss has the property that

2 LD(θ, g0) = E(Y − θ(X)) , so that the excess risk relative to θ0 precisely corresponds to prediction accuracy whenever the true nuisance parameter is plugged in. Proposition 1. This model satisfies Assumption 1 and Assumption 2 whenever θ0 ∈ Θ, i.e. we have DθLD(θ0, {h, e0})[θ−θ0] = 0, DeDθLD(θ0, {h0, e0})[θ−θ0, e−e0] = 0, and DhDθLD(θ0, {h0, e0})[θ− θ0, h − h0] = 0. Note that the extra nuisance parameter h is only required here because we consider the general setting in which W 6= X. Whenever W = X this is unnecessary (and indeed h0 = 0). This parameter can generally be estimated at a rate no worse than the rate for e0 and θ0 (absent nuisance parameters); see Chernozhukov et al.(2018b) for discussion. As to rates and algorithms, the situation here is essentially the same as that of the domain adaptation example, so we discuss it only briefly. The setup has the single index structure from Appendix A, and all of the sufficient conditions for fast rates from Assumption 11 are satisfied as long as we have e(W ) ≥ η > 0 for all e in the nuisance class. Thus, with probability at least 1 − δ, Meta-Algorithm 1 enjoys the bound

?  −1 4 LD(θ,b g0) − LD(θ , g0) ≤ O RateD(Θ,S2, δ/2; gb) + poly(η ) · (RateD(G,S1, δ/2)) . As with the previous example, the variance-penalized ERM guarantees from Section 4 can be applied here to provide bounds on RateD(Θ,...) for which the dominant term in the excess risk does not scale with the inverse propensity range.

Proof of Proposition 1. We first show that the gradient vanishes in the sense of Assumption 2 when evaluated at θ0. In particular, for any choice of h we have

DθLD(θ0, {h, e0})[θ − θ0] " ! # (Ye − θ0(X)) = E −2T − h(W )(T − e0(W )) (θ(X) − θ0(X)) . e0(W )

Using that E[T | W ] = e0(W ) and X ⊆ W : " ! # (Ye − θ0(X)) = E −2T (θ(X) − θ0(X)) . e0(W )

Using that T ∈ {0, 1}:    (Y − θ0(X)) = E −2T (θ(X) − θ0(X)) . e0(W )

46 Using that T ⊥ Y | W and that X ⊆ W :    E[(Y − θ0(X)) | W ] = EW −2 E[T | W ] (θ(X) − θ0(X)) e0(W ) = −2 E[(Y − θ0(X))(θ(X) − θ0(X))].

Using that E[Y | X] = θ0(X): = 0.

To establish orthogonality with respect to e, we have

DeDθLD(θ0, {h0, e0})[θ − θ0, e − e0]    (Y − θ0(X)) = E 2T 2 + h0(W ) (θ(X) − θ0(X))(e(W ) − e0(W )) . e0(W ) Using that X ⊆ W :    (E[Y | W ] − θ0(X)) = EW 2 + h0(W ) (θ(X) − θ0(X))(e(W ) − e0(W )) . e0(W )

The result follows immediately, using that h (w) = −2 (E[Y |W =w]−θ0(x)) . 0 e0(w) = 0.

To establish orthogonality with respect to h, we have

DhDθLD(θ0, {h0, e0})[θ − θ0, h − h0] = E[(T − e0(W ))(θ(X) − θ0(X))(h(W ) − h0(W ))] = E[ε2 · (θ(X) − θ0(X))(h(W ) − h0(W ))].

Using that E[ε2 | W ] = 0 and X ⊆ W , the expression above is seen to be equal to zero.

C Orthogonal Loss Construction: Examples

In this section of the appendix we walk through concrete examples of the orthogonal loss construction approach outlined in Section 3.5. Both examples use the loss structure in (22).

Estimating utility functions in models of strategic competition. In this setting, we have ψ(x)  1  θ(x) = , x = w, and ∇ `(θ(x), g; z) = (L(ψ(x) + ∆ g(x)) − y) · , ∆ ζ g(x) where L is the logistic function. The motivation of this problem stems from estimating games of incomplete information, where y is the entry decision of one player, u is the entry decision of the opponent, x is a featurized state of the world, ψ(x) is the non-strategic part of the utility of the player and ∆ g is the competitive part of the utility, i.e. the effect of the opponent’s entry decision on the player’s utility. For this setting, we can take the auxiliary nuisance variable to be

0  a0(x) = E[∇γ∇ζ `(θ0(x), g0(x); z) | x] = ∆0L (ψ0(x) + ∆0 g0(x)) 1, g0(x) (71)

47 Thus, a0(w) is a known function of θ0 and g0, and to estimate a0 it suffices to construct preliminary estimates for θ0 and g0 (e.g., using plug-in estimation with the original non-orthogonal loss). We can simplify the final orthogonal loss based our generic construction to

`˜(θ(x), g˜; z) := `(θ(x), g; z) + (u − g(w)) · ∆˜ L0(ψ˜(x) + ∆˜ g(x)) · (θ(x) + ∆g(x)), (72)

ψ˜(x) whereg ˜ = {g, θ˜} and θ˜ = is a preliminary estimate for θ . ∆˜ 0

Treatment effect estimation. In this setting, we denote the treatment as d ∈ {0, 1} and w = (d, x, v), for some vector of extra control variables v. In this setting, we take z = (x, w, y) and w = (x, v, d), where v is a vector of extra control variables and d ∈ {0, 1} is a binary treatment. We begin from the moment equation

∇ζ `(θ(x), g; z) := θ(x) − g(x, v, 1) + g(x, v, 0), (73) with g0 identified by the local moment equation E[y − g0(x, v, d) | x, v, d] = 0. In this case, the Riesz d−(1−d) representer takes the form a0(w) = Pr[D=d|x,v] , and the resulting orthogonal loss created using the generic construction takes the form

2 `˜(θ(x), g˜; z) = (θ(x) − g(1, x, v) + g(0, x, v)) + (y − g(x, v, d)) · a0(w) · θ(x), (74) where we recallg ˜ = {g, a}. Minimizing this over θ is equivalent to minimizing the loss

2 `˜(θ(x), g˜; z) = (θ(x) − g(1, x, v) + g(0, x, v) + (y − g(x, v, d)) · a0(w)) , (75) which is precisely the loss presented in the introduction of the paper.

D Plug-in Empirical Risk Minimization: Examples

In this section of the appendix, we instantiate the general plug-in ERM framework from Section 4 to give concrete guarantees for some concrete classes of interest. In all examples we use Oe to hide dependence on problem-dependent constants, log n factors, and log(δ−1) factors.

Linear classes. For our first set of examples, we focus on high-dimensional linear predictors. Chernozhukov et al.(2018b) gave orthogonal/debiased estimation guarantees for high-dimensional predictors using Lasso-type algorithms. Our first example shows how to recover the type of guarantee they gave, and our second example shows that we can give similar guarantees under weaker assumptions by exploiting that we work in the excess risk / statistical learning (rather than parameter estimation) framework. ? d Example 1 (High-Dimensional Linear Predictors with `1 Constraint). Suppose that θ ∈ R is an ? s-sparse linear function with support set T ⊂ [d] and that kθ k1 ≤ 1 and kxk∞ ≤ 1 almost surely under D. Define the target class via

n d ? o Θ = x 7→ hθ, xi | θ ∈ R , kθk1 ≤ kθ k1 .

48 Given S2 = x1:n, define the restricted eigenvalue for the target class as

1 2 n kX∆k2 γre = inf 2 , ∆:k∆ c k ≤k∆ k T 1 T 1 k∆k2

n×d where X ∈ R has x1:n as rows. Then under the assumptions of Theorem 3, the empirical risk minimizer guarantees that with probability at least 1 − δ,   ? s log d 4 LD(θ,b g0) − LD(θ , g0) ≤ Oe + (RateD(G,S1, δ/2)) . n · γre

For parameter estimation it is well known that restricted eigenvalue or related conditions are required to ensure parameter consistency. For prediction however, such as assumptions are not needed if we are willing to consider inefficient algorithms. The next example shows that ERM over predictors with a hard sparsity constraint obtains the optimal high-dimensional rate for prediction in the presence of nuisance parameters with no restricted eigenvalue assumption. Example 2 (High-Dimensional Linear Predictors with Hard Sparsity). Suppose that Θ is a class of high-dimensional linear predictors obeying exact or “hard” sparsity:

n d o Θ = x 7→ hθ, xi | θ ∈ R , kθk0 ≤ s, kθk1 ≤ 1 ,

and suppose kxk∞ ≤ 1 almost surely under D. Then under the assumptions of Theorem 3, the empirical risk minimizer guarantees that with probability at least 1 − δ,   ? s log(d/s) 4 LD(θ,b g0) − LD(θ , g0) ≤ Oe + (RateD(G,S1, δ/2)) . n

Neural networks. We now move beyond the classical linear setting to the case to the case where the target parameters belong to a class of neural networks, a considerably more expressive class of −t −1 models. Let σlog(t) = (1 + e ) be the logistic link function and let σrelu(t) = max{t, 0} be the ReLU function.3 Our first neural network example is inspired by Chen and White(1999); Farrell et al.(2018), who analyzed neural networks for nuisance parameter estimation with parametric target parameters. We depart from their approach by using neural networks to estimate target parameters. Example 3. Suppose that the target parameters are a class of neural networks Θ = σlog ◦ F, where n o F = f(x) := A · σ (A · σ (A · σ (A x) ...)) | A ∈ di×di−1 , kfk ≤ M , L relu L−1 relu L−2 relu 1 i R L∞(D) (76) PL and d0 = d and dL = 1. Let W = i=1 didi−1 denote the total number of weights in the network. Under the assumptions of Theorem 3, the empirical risk minimizer guarantees that with probability at least 1 − δ,   ? WL log W log M 4 LD(θ,b g0) − LD(θ , g0) ≤ Oe + (RateD(G,S1, δ/2)) . n 3 For vector-valued inputs x we overload σlog(t) and σrelu(t) to denote element-wise application.

49 The target class in this example is well-suited to estimation of binary treatment effects. Note that in this example our only quantitative assumption on the network weights is that they guarantee boundedness of the output. However, the bound scales linearly with the number of parameters W , and thus may be vacuous for modern overparameterized neural networks. Our next example, which is based on neural networks covering bounds from Bartlett et al.(2017), shows that by making stronger assumptions on the weight matrices we can obtain weaker dependence on the number of − 1 −1 parameters. This comes at the price of a slower rate—n 2 vs. n . Example 4. Suppose that the target parameters are a class of neural networks Θ = σlog ◦ F, where n o di×di−1 F = f(x) := AL · σrelu(AL−1 · σrelu(AL−2 · σrelu(A1x) ...)) | Ai ∈ R , kAikσ ≤ si, kAik2,1 ≤ bi , and kAk2,1 denotes the sum of row-wise `2-norms. Suppose kxk2 ≤ 1 almost surely under D. Under the assumptions of Theorem 3, the empirical risk minimizer guarantees that with probability at least 1 − δ,   3/2  QL PL b 2/3 si · ( i/si) ? i=1 i=1 4 LD(θ,b g0) − LD(θ , g0) ≤ Oe √ + (RateD(G,S1, δ/2)) .  n 

Let us give a concrete example where the neural network guarantees above enable oracle rates for the target class while using a more flexible parameter class for the nuisance. Suppose the target parameters belong to the class in Example 3 with L2 layers and W2 weights, and suppose the nuisance parameters also belong to a neural network class, but with L1 layers and W1 weights. In the 4 next section we will show that under certain assumptions one can guarantee (RateD(G,S1, δ/2)) = 2 Oe((W1L1/n√) ) for such a class. In this case, Example 4 shows that we obtain oracle rates whenever W1L1 = o( W2L2n), meaning the number of parameters in the nuisance network can be significantly larger than for the target network. Similar guarantees can be derived for Example 3. Deriving tight generalization bounds for neural networks is an active area of research and there are many more results that can be used as-is to give guarantees for the second stage in our general framework (Golowich et al., 2018; Arora et al., 2019).

Kernels. For our final example, we give rates for some basic kernel classes. These examples were chosen only for concreteness, and the machinery in this section and the subsequent sections can be invoked to give guarantees for more rich and general nonparametric classes. Example 5 (Gaussian Kernels). Suppose that Θ ⊂ ([0, 1] → R) is unit ball in the reproducing 0 − 1 (x−x0)2 kernel Hilbert space with the Gaussian kernel K(x, x ) = e 2 . Suppose x is drawn from the uniform distribution over [0, 1]. Under the assumptions of Theorem 3, the empirical risk minimizer guarantees that with probability at least 1 − δ,   ? 1 4 LD(θ,b g0) − LD(θ , g0) ≤ Oe + (RateD(G,S1, δ/2)) . n Example 6 (Sobolev Spaces). Suppose the target class is the Sobolev space  0 Θ = θ : [0, 1] → R | θ(0) = 0, f is absolutely continuous with θ ∈ L2[0, 1] , and suppose that x is drawn from the uniform distribution on [0, 1]. Under the assumptions of Theorem 3, the empirical risk minimizer guarantees that with probability at least 1 − δ,   ? 1 4 LD(θ,b g0) − LD(θ , g0) ≤ Oe + (RateD(G,S1, δ/2)) . n2/3

50 D.1 Proofs

Throughout this section we adopt the shorthand k·k = k·k . We first recall some basic n,2 L2(z1:n) technical lemmas which will be used in the proofs for the examples.

Lemma 5 (Mendelson(2002), Lemma 4.5) . For any real-valued function class F with kfkn,2 ≤ 1 ? ? for all f ∈ F and any f with kf kn,2 ≤ 1,

? H2(star(F, f ), ε, z1:n) ≤ H2(F, ε/2, z1:n) + log(4/ε). (77)

Lemma 6 (Wainwright(2019), Proposition 14.1) . Let δn be the minimal solution to

2 Rn(F, δ) ≤ δ , where F ⊆ (Z → R) is a star-shaped set with supf∈F supz∈Z |f(z)| ≤ 1. Then with probability at 2 least 1 − exp(−cnδn) over the draw of data z1:n,

δn ≤ 34δbn, where δbn is the minimal solution to

" n # 1 X R (F, δ, z ) := sup  f(z ) ≤ δ2. (78) n 1:n E n i i f∈F:kfkn,2≤δ i=1

The following result is an immediate consequence of Lemma 14. Lemma 7. Define  F(δ, z1:n) = f ∈ F : kfkn,2 ≤ δ . (79) Then any minimal solution to r Z δ H (F(δ, z ), ε, z ) δ2 2 1:n 1:n dε ≤ . (80) δ2 n 20 8 is an upper bound for the fixed point δbn in (78).

Proof for Example 1. Let data x1:n be fixed. Let G = {x 7→ hθ, xi | kθk1 ≤ b}. Under our assumption that kxtk∞ ≤ 1, Zhang(2002), Theorem 3, implies that b2 log(d) H (G, ε, x ) ≤ O . 2 1:n ε2 We will now establish that

n ? ? o Θ(δ, x1:n) := x 7→ hθ − θ , xi | θ ∈ Θ, kθ − θ kn,2 ≤ δ ⊆ G for an appropriate choice of b using the restricted eigenvalue bound. Let

n d o C = ∆ ∈ R | k∆T c k1 ≤ k∆T k1 .

We first claim Θ − θ? ⊂ C. Indeed, fix θ ∈ Θ and let ∆ = θ − θ?. Then we have

? ? ? ? kθ k1 ≥ kθk1 = kθ + ∆k1 = kθ + ∆T k1 + k∆T c k1 ≥ kθ k1 − k∆T k1 + k∆T c k1.

51 Rearranging, we get k∆T c k1 ≤ k∆T k1 as desired. Now observe that for any ∆ ∈ C, we have √ √ 2 s 1 k∆k1 ≤ k∆T k1 + k∆T c k1 ≤ 2k∆T k1 ≤ 2 sk∆k2 ≤ √ · √ kX∆k2. γre n √ 2 s This implies that Θ(δ, x ) ⊆ G for b = √ δ, and as a consequence 1:n b γre   s log(d) 2 H2(Θ(δ, x1:n), ε, x1:n) ≤ O 2 δ . γre · ε We plug this bound into (80) and derive an upper bound of r s ! Z δ H (star(Θ − θ?, 0), ε, x ) s log(d) Z δ 1 2 1:n dε ≤ O δ · · dε δ2 n γren δ2 ε 8 8 s ! s log(d) ≤ O δ log(1/δ) · . γren

? ? where we have used that star(Θ − θ , 0) = Θ − θ . Using Lemma 7, we may now take δn ≤   q s log(d/s) O n log n in (35), then combine with Theorem 3 and Theorem 1 to get the result.

Proof for Example 2. Since kθk1 ≤ 1 and kxk∞ ≤ 1, the standard covering number bound for linear classes states that the covering number at scale ε for any fixed sparsity pattern is at most d ed s C · s log(1/ε). We take the union over all such covers for all s ≤ s sparsity patterns, which ? implies H2(Θ − θ , ε, x1:n) ∝ s(log(d/s) + log(1/ε)). Lemma 5 further implies that

? H2(star(Θ − θ , 0), ε, x1:n) ∝ s(log(d/s) + log(1/ε)).

It is now a standard calculation to show that r r ! Z δ H (star(Θ − θ?, 0), ε, x ) s log(d/s) 2 1:n dε ≤ O δplog(1/δ) · . δ2 n n 8   q s log(d/s) log n Thus, via Lemma 6 and Lemma 7, we may take δn ≤ O n in (35). The final result follows by combining Theorem 3 and Theorem 1.

Proof for Example 3. Since our the target class is bounded, Lemma 5 implies

? ? H2(star(Θ − θ ), 2ε, x1:n) ≤ H2(Θ − θ , ε, x1:n) + log(2/ε) = H2(Θ, ε, x1:n) + log(2/ε).

Recall n o F = f(x) := A · σ (A · σ (A · σ (A x) ...)) | A ∈ di×di−1 , kfk ≤ M . L relu L−1 relu L−2 relu 1 i R L∞(D)

Then, since σrelu is 1-Lipschitz and positive-homogeneous, we have H2(Θ, ε, x1:n) ≤ H2(F, ε, x1:n). Recall that for [0,M]-valued classes of regressors we can relate the empirical L2 metric to an

52 0 empirical L1 metric for a closely related VC class as follows. Let Y ∼ unif([0,M]), let f, f ∈ G, and write

0 2 2 0 2 Pn(f(X) − f (X)) = M Pn(PY (Y ≤ f(X)) − PY (Y ≤ f (X))) 2  0 ≤ M (Pn × PY )(1{Y ≤ f(X)} − 1 Y ≤ f (X) ).

Consequently, we see that the L2 covering number for F on the distribution Pn at scale ε, is at most 0 the size of the L1 cover of the class F = {(x, y) 7→ {y ≤ f(x)} | f ∈ F} on distribution Pn × PY at 2 scale ε /M. Thus, invoking Haussler’s L1 covering number bound for VC classes (Haussler, 1995), we have CM  CM  H (Θ, ε, x ) ≤ 2 · vc(F 0) log = 2 · pdim(F) log , 2 1:n ε ε where vc(·) denotes the VC dimension and pdim(·) denotes the pseudodimension. Using Theorem 14.1 from Anthony and Bartlett(1999) and Theorem 6 from Bartlett et al.(2017), we have

pdim(F) ≤ O(LW log(W )).

With this bound on the metric entropy we have r r ! Z δ H (star(Θ − θ?, 0), ε, x ) LW log W log M Z δ 2 1:n dε ≤ O · plog(1/ε)dε δ2 n n δ2 8 8 r ! LW log W log M ≤ O · δplog(1/δ) . n

  q LW log W log M log n Thus, it suffices to take δn ≤ O n in (35) and appeal to Theorem 3 and Theorem 1.

? Proof for Example 4. As in Example 3, we have H2(star(Θ − θ ), ε, x1:n) ≤ H2(F, ε, x1:n). The- orem 3.3 of Bartlett et al.(2017) implies that under our assumptions,

 K L !3 n log W Y 2 X 2/3 H (F, ε, x ) ≤ O s · (bi/s ) . 2 1:n  ε2 i i  i=1 i=1

The result follows by plugging this bound into Lemma 7 and proceeding exactly as in the previous examples.

Proof for Example 5 and Example 6. Note that each target class Θ has range bounded by 1. p −1/3 By examples 14.4 and 14.3 in Wainwright(2019), we may take δn = c log n/n and δn = cn in (35) for the gaussian and Sobolev classes respectively. We combine with this with Theorem 3 and Theorem 1.

53 Part II Proofs for Main Results

E Preliminaries

We invoke the following version of Taylor’s theorem and its directional derivative generalization repeatedly. Proposition 2 (Taylor expansion). Let a ≤ b be fixed and let f : I → R, where I ⊆ R is an open interval containing a, b. If f is (k + 1)-times differentiable, then there exists c ∈ [a, b] such that

k X 1 1 f(a) = f(b) + f (i)(b)(a − b)i + f (k+1)(c)(a − b)k+1. i! (k + 1)! i=1 0 0 Let F : F → R, where F is a vector space of functions. For any g, g ∈ F, if t 7→ F (t·g+(1−t)·g ) is (k + 1)-times differentiable over an open interval containing [0, 1], then there exists g¯ ∈ conv({g, g0}) such that k 0 X 1 i 0 0 1 k+1 0 0 F (g ) = F (g) + DgF (g)[g − g, . . . , g − g] + Dg F (¯g)[g − g, . . . , g − g]. i! | {z } (k + 1)! | {z } i=1 i times k + 1 times

F Proofs from Section 3

Proof of Theorem 1. To simplify notation we drop the subscripts from the norms k·kΘ and k·kG and abbreviate Rθ := RateD(Θ,S2, δ/2; gb) and Rg := RateD(G,S1, δ/2). Using a second-order ¯ ? Taylor expansion on the risk at gb, there exists θ ∈ star(Θb, θ ) such that 1 2 ¯ ? ? ? ? ? · D LD(θ, g)[θb− θ , θb− θ ] = LD(θ,b g) − LD(θ , g) − DθLD(θ , g)[θb− θ ]. 2 θ b b b b Using the strong convexity assumption (Assumption 3) we get: 2 2 2 ¯ ? ? ? 4 ? 4 Dθ LD(θ, g)[θb− θ , θb− θ ] ≥ λ · θb− θ − κ · kg − g0kG = λ · θb− θ − κ · Rg. b Θ b Θ Combining these statements we conclude that 2 λ ? ? κ 4 ? ? · θb− θ ≤ LD(θ,b g) − LD(θ , g) + · Rg − DθLD(θ , g)[θb− θ ]. 2 Θ b b 2 b Using the assumed rate for θb (Definition 1), this implies the inequality 2 λ ? κ 4 ? ? · θb− θ ≤ Rθ + · Rg − DθLD(θ , g)[θb− θ ]. (81) 2 Θ 2 b We now apply a second-order Taylor expansion (using the assumed derivative continuity from Assumption 4), which implies that there existsg ¯ ∈ star(G, g0) such that ? ? − DθLD(θ , gb)[θb− θ ] ? ? ? ? 1 2 ? ? = −DθLD(θ , g0)[θb− θ ] − DgDθLD(θ , g0)[θb− θ , g − g0] − · D DθLD(θ , g¯)[θb− θ , g − g0, g − g0]. b 2 g b b

54 Using orthogonality of the loss (Assumption 1), this is equal to

? ? 1 2 ? ? = −DθLD(θ , g0)[θb− θ ] − · D DθLD(θ , g¯)[θb− θ , g − g0, g − g0]. 2 g b b We use the second order smoothness assumed in Assumption 4: β ? ? 2 2 ? ≤ − DθLD(θ , g0)[θb− θ ] + · kg − g0kG · θb− θ . 2 b Θ Invoking the AM-GM inequality, we have that for any constant η > 0: β β η 2 ? ? 2 4 2 ? ≤ − DθLD(θ , g0)[θb− θ ] + · kg − g0kG + · θb− θ 4η b 4 Θ

Lastly, we use the assumed rate for gb (Definition 1): 2 ? ? β2 4 β2η ? ≤ − DθLD(θ , g0)[θb− θ ] + · Rg + · θb− θ 4η 4 Θ Choosing η = λ and combining this string of inequalities with (81) and rearranging, we get: β2 2  2  ? 4  ? ?  β2 2κ 4 θb− θ ≤ −DθLD(θ , g0)[θb− θ ] + Rθ + + · Rg (82) Θ λ λ2 λ

? ? Assumption 2 implies that DθLD(θ , g0)[θb− θ ] ≥ 0. Hence, we get the desired inequality (10). To conclude the inequality (11), we use another Taylor expansion, which implies that there exists θ¯ ∈ star(Θb, θ?) such that ? ? ? 1 2 ¯ ? ? LD(θ,b g0) − LD(θ , g0) = DθLD(θ , g0)[θb− θ ] + · D LD(θ, g0)[θb− θ , θb− θ ] 2 θ Using the smoothness bound from Assumption 4: 2 ? ? β1 ? ≤ DθLD(θ , g0)[θb− θ ] + · θb− θ (83) 2 Θ We combine (82) with (83) to get  2    ? 2β1 β1 β2 2κ 4 2β1 ? ? LD(θ,b g0) − LD(θ , g0) ≤ · Rθ + + · R − − 1 · DθLD(θ , g0)[θb− θ ]. λ 2 λ2 λ g λ

? ? The result follows by again using that DθLD(θ , g0)[θb− θ ] ≥ 0, along with the fact that β1/λ ≥ 1, without loss of generality.

Proof of Theorem 2. To begin, we use the guarantee for the second stage from Definition 1 and perform straightforward manipulation to show ? ? ? LD(θ,b g0) − LD(θ , g0) ≤ (LD(θ,b g0) − LD(θ,b gb)) + (LD(θ , gb) − LD(θ , g0)) + RateD(Θ,S2, δ/2; gb). Using continuity guaranteed by Assumption 6, we perform a second-order Taylor expansion with respect to g for each pair of loss terms in the preceding expression to conclude that there exist 0 g, g ∈ star(G, g0) such that ? ? (LD(θ,b g0) − LD(θ,b gb)) + (LD(θ , gb) − LD(θ , g0)) 1 2 = −DgLD(θ,b g0)[g − g0] − · D LD(θ,b g)[g − g0, g − g0] b 2 g b b 1 + D L (θ?, g )[g − g ] + · D2L (θ?, g0)[g − g , g − g ]. g D 0 b 0 2 g D b 0 b 0

55 Using the smoothness promised by Assumption 6:

? 2 ≤ − DgLD(θ,b g0)[gb − g0] + DgLD(θ , g0)[gb − g0] + βkgb − g0kG. To relate the two derivative terms, we apply another second-order Taylor expansion (which is possible due to Assumption 6), this time with respect to the target predictor.

DgLD(θ,b g0)[gb − g0] ? ? ? 1 2 ¯ ? ? = DgLD(θ , g0)[g − g0] + DθDgLD(θ , g0)[g − g0, θb− θ ] + · D DgLD(θ, g0)[g − g0, θb− θ , θb− θ ], b b 2 θ b where θ¯ ∈ conv({θ,b θ?}). Universal orthogonality immediately implies that

? ? DθDgLD(θ , g0)[gb − g0, θb− θ ] = 0. Furthermore, observe that

2 ¯ ? ? Dθ DgLD(θ, g0)[gb − g0, θb− θ , θb− θ ] ¯ ? ? ¯ ? DθDgLD(θ + t(θb− θ ), g0)[g − g0, θb− θ ] − DθDgLD(θ, g0)[g − g0, θb− θ ] = lim b b . t→0 t

Since θ¯ + t(θb − θ?) ∈ star(Θb, θ?) + star(Θb − θ?, 0) for all t ∈ [0, 1], including t = 0, universal orthogonality (Assumption 5) implies that both terms in the numerator are zero, and hence 2 ¯ ? ? ? Dθ DgLD(θ, g0)[gb−g0, θb−θ , θb−θ ] = 0. We conclude that DgLD(θ,b g0)[gb−g0] = DgLD(θ , g0)[gb−g0]. Using this identity in the excess risk upper bound, we arrive at

? LD(θ,b g0) − LD(θ , g0) 2 ≤ RateD(Θ,S2, δ/2; gb) + β · kgb − g0kG 2 ≤ RateD(Θ,S2, δ/2; gb) + β · (RateD(G,S1, δ/2)) .

Proof of Orthogonality for Treatment Effects Example (Section 3.3). We first show that Assumption 2 holds whenever the second stage is well-specified (i.e. θ0 ∈ Θ). We have

DθLD(θ0, g0)[θ − θ0] = −2 · E[((Y − m0(X,W )) − (T − e0(W ))θ0(X))(T − e0(W ))(θ(X) − θ0(X))]. In particular, for any x we have

E[(Y − m0(X,W )) − (T − e0(W ))θ0(X))(T − e0(W )) | X = x] = E[(T θ0(W ) + f0(W ) + ε1 − e0(X)θ0(W ) − f0(W )) − (T − e0(W ))θ0(X))(T − e0(W )) | X = x] = E[ε1 · ε2 | X = x] = E[E[ε1 | X = x, ε2] · ε2 | X = x] = 0,

0 and so DθLD(θ0, g0)[θ − θ0] = 0. To establish orthogonality for the propensities e, let θ, θ be fixed. Then we have

0 DeDθLD(θ, {m0, e0})[θ − θ, e − e0]  0  = 2 · E ((Y − m0(X,W )) − (T − e0(W ))θ(X))(θ (X) − θ(X))(e(W ) − e0(W ))  0  − 2 · E θ(X)(T − e0(W ))(θ (X) − θ(X))(e(W ) − e0(W )) .

56 To handle the first term, we use that for any x, w,

E[((Y − m0(X,W )) − (T − e0(W ))θ(X)) | X = x, W = w] = E[ε1 + ε2 · (θ0(X) − θ(X)) | X = x, W = w] = 0. Similarly, the second term is handled by using that

E[θ(X)(T − e0(W )) | X = x, W = w] = E[θ(X) · ε2 | X = x, W = w] = 0. To establish orthogonality for the expected value parameter m, for any θ, θ0 we have 0 DmDθLD(θ, {m0, e0})[θ − θ, m − m0]  0  = 2 · E (T − e0(W ))(θ (X) − θ(X))(m(X,W ) − m0(X,W ))  0  = 2 · E ε2 · (θ (X) − θ(X))(m(X,W ) − m0(X,W )) = 0, which follows from the assumption E[ε2 | X,W ] = 0. Note that both of these orthogonality proofs held for any choice of θ, not just θ0, and hence universal orthogonality holds.

Proof of Lemma 1. The first-order condition (Assumption 2) follows immediately by using that E[∇ζ `(θ0(x), g0; z) | x] = 0, and by the assumption that E[u | w] = g0(w). To establish orthogonality with respect to the nuisance parameter g, we compute

DgDθL˜D(θ0, {g0, a0})[θ − θ0, g − g0] = E[Dg E[∇ζ `(θ0(x), g0(w); z) | x][g − g0](θ(x) − θ0(x))] − E[ha0(w), g(w) − g0(w)i(θ(x) − θ0(x))] = E[E[ha0(w), g(w) − g0(w)i | x](θ(x) − θ0(x))] − E[ha0(w), g(w) − g0(w)i(θ(x) − θ0(x))] = 0, where the second equality follows from the definition of a0 and the final inequality is the law of total expectation. For orthogonality with respect to a, we have ˜ DθDaLD(θ0, {g0, a0})[θ − θ0, a − a0] = E[ha(w) − a0(w), u − g0(w)i(θ(x) − θ0(x))] = 0, where the final inequality uses that x ⊆ w and E[u | w] = g0(w).

G Technical Lemmas for Constrained M-Estimators

In this section of the appendix we give self-contained technical results for M-estimation over general function classes, in the absence of nuisance parameters. The results here serve as a building block for the results of Section 4. d d Let F : X → R be the function class, and let ` : R × Z → R be the loss. We receive a sample set S = z1, . . . , zn drawn from distribution D independently. Let Lf denote the random variable `(f(x), z) and let n 1 X L = [`(f(x), z)], and L = `(f(x ), z ) (84) P f E Pn f n i i i=1 denote the population risk and empirical risk over n samples. The constrained empirical risk minimizer is given by fb = arg min Pn Lf . (85) f∈F

57 Additional Notation. To keep notation compact, we adopt the abbreviations kfk := kfk p,q Lp(`q,D) and kfk := kfk . We drop the subscript q for real-valued function classes. We recall n,p,q Lp(`q,z1:n) that F|t denotes the tth coordinate projection of F and, likewise, for any f ∈ F, the tth coordinate projection is ft ∈ F|t. The following lemmas provide a vector-valued extension of the analysis of constrained ERM based on local Rademacher complexities given in Wainwright(2019). The key idea behind the extension is to invoke a vector-valued contraction theorem for Rademacher complexity due to Maurer(2016). For completeness we give a include proofs for these lemmas, even though some parts are straightforward adaptations of lemmas in Wainwright(2019). d ? Lemma 8. Consider a function class F : X → R , with supf∈F kfk∞,2 ≤ 1 and pick any f ∈ F. Assume that the loss ` is L-Lipschitz in its first argument with respect to the `2-norm and let

Zn(r) = sup |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| . (86) ? f ∈ F:kf−f k2,2≤r

Then there are universal constants c1, c2 > 0 such that

d !  2  X c2nu Pr Z (r) ≥ 16 L R(F| − f ?, r) + u ≤ c exp − . (87) n t t 1 L2r2 + 2Lu t=1

Moreover, if δn is any solution to the inequalities

? 2 ∀t ∈ {1, . . . , d} : R(star(F|t − ft ), δ) ≤ δ , (88) then for each r ≥ δn,  c nu2  Pr[Z (r) ≥ 16 L d r δ + u] ≤ c exp − 2 . (89) n n 1 L2r2 + 2Lu

d Lemma 9. Consider a vector valued function class F : X → R with supf∈F kfk∞,2 ≤ 1, and pick any f ? ∈ F. Let δ2 ≥ 4 d log(41 log(2c2n)) be any solution to the system of inequalities n c2n ? 2 ∀t ∈ {1, . . . , d} : R(star(F|t − ft ), δ) ≤ δ . (90)

Moreover, assume that the loss ` is L-Lipschitz in its first argument with respect to the `2 norm. Consider the following event:

? ? E1 = {∃f ∈ F : kf − f k2,2 ≥ δn and |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≥ 18L d δn kf − f k2,2} . (91) 2 Then for some universal constants c3, c4, Pr[E1] ≤ c3 exp(−c4nδn). d ? Lemma 10. Consider a vector valued function class F : X → R and pick any f ∈ F. Let δn ≥ 0 be any solution to the inequalities

? 2 ∀t ∈ {1, . . . , d} : R(star(F|t − ft ), δ) ≤ δ . (92)

Suppose supf∈F kfk∞,2 ≤ 1. Moreover, assume that the loss ` is L-Lipschitz in its first argument with respect to the `2 norm and also linear, i.e. Lf+f 0 = Lf + Lf 0 and Lαf = αLf . Consider the following event:

? ? E1 = {∃f ∈ F : kf − f k2,2 ≥ δn and |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≥ 17L d δn kf − f k2,2} . (93) 2 Then for some universal constants c3, c4, Pr[E1] ≤ c3 exp(−c4nδn).

58 ? Lemma 11. Consider a function class F, with supf∈F kfk∞ ≤ 1, and pick any f ∈ F. Let δ2 ≥ 4 d log(41 log(2c2n)) be any solution to the inequalities: n c2n

? 2 ∀t ∈ {1, . . . , d} : R(star(F|t − ft ), δ) ≤ δ . (94)

Moreover, assume that the loss ` is L-Lipschitz in its first argument with respect to the `2 norm. 2 Then for some universal constants c5, c6, with probability 1 − c5 exp(c6nδn),

? |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≤ 18L d δn{kf − f k2,2 + δn}, ∀f ∈ F. (95)

Hence, the outcome fb of constrained ERM satisfies that with the same probability,

? (L − L ? ) ≤ 18L d δ {kf − f k + δ }. (96) P fb f n b 2,2 n

2 If the loss Lf is also linear in f, i.e. Lf+f 0 = Lf + Lf 0 and Lαf = αLf , then the lower bound on δn is not required.

G.1 Proofs of Lemmas for Constrained M-Estimators

Proof of Lemma 8. By the Lipschitz condition on the loss and the boundedness of the functions, ? we have kLf − Lf ? k∞ ≤ Lkf − f k∞,2 ≤ 2L. Moreover,

2 2 ? 2 2 2 Var(Lf − Lf ? ) ≤ P(Lf − Lf ? ) ≤ L kf − f k2,2 ≤ L r .

Thus by Talagrand’s concentration inequality (see Theorem 3.8 of Wainwright(2019) and follow-up discussion) we have

 nu2  Pr[Z (r) ≥ 2 [Z (r)] + u] ≤ c exp −c . n E n 1 2 L2r2 + 2Lu

Moreover, by a standard symmetrization argument,

" n # 1 X ? E[Zn(r)] ≤ 2 E sup i {`(f(xi), zi) − `(f (xi), zi)} ? n kf−f k2,2≤r i=1 " n # 1 X ? = 2 E sup δ i {`(f(xi), zi) − `(f (xi), zi)} ? n kf−f k2,2≤r, δ∈{−1,1} i=1 " n # X 1 X ? ≤ 2 E sup δ i {`(f(xi), zi) − `(f (xi), zi)} ? n δ∈{−1,1} kf−f k2,2≤r i=1 " n # 1 X ? ≤ 4 E sup i {`(f(xi), zi) − `(f (xi), zi)} , ? n kf−f k2,2≤r i=1 where the second inequality follows from the fact that each summand is non-negative, since we can always choose f = f ?. By invoking the multivariate contraction inequality of Maurer(2016), letting

59 {i,t}1≤i≤n,1≤t≤d be independent Rademacher random variables, we have

" n d # 1 X X ? ≤ 8 L E sup i,t (ft(xi) − ft (xi)) ? n kf−f k2,2≤r i=1 t=1 d " n # X 1 X ? ≤ 8 L E sup i,t (ft(xi) − ft (xi)) ? n t=1 kft−ft k2≤r i=1 d X ? = 8 L R(F|t − ft , r), t=1

? ? ? where we also used the fact that for any fixed f , E[i`(f (xi), zi)] = E[i,tft (xi)] = 0. This ? completes the proof of the first part of the lemma. For the second part, observe that: R(F|t −ft , r) ≤ ? R(G,r) R(star(F|t − ft ), r). Moreover, for any star shaped function class G, the function r → r is monotone non-increasing. Thus for any r ≥ δn,

? ? R(star(F|t − ft ), r) R(star(F|t − ft ), δn) ≤ ≤ δn. r δn

? Rearranging yields that R(star(F|t − ft ), r) ≤ rδn. Hence, E[Zn(r)] ≤ 8 L d r δn. This completes the proof of the second part of the lemma.

Proof of Lemma 9. We invoke a peeling argument. Consider the events

m−1 ? m Sm = {f ∈ F : α δn ≤ kf − f k2,2 ≤ α δn},

? for α = 18/17. Since supf∈F kf − f k2,2 ≤ 2 supf∈F kfk∞,2 ≤ 2, it must be that any f ∈ F with ? log(2/δn) kf − f k2,2 ≥ δn belongs to some Sm for m ∈ {1, 2,...,M}, where M ≤ ≤ 41 log(2/δn). logα(e) Thus by a union bound we have

M X Pr[E1] ≤ Pr[E1 ∩ Sm]. m=1

? Moreover, observe that if the event E1 ∩ Sm occurs then there exists a f ∈ F with kf − f k2,2 ≤ m α δn = rm, such that

? m−1 |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≥ 18L d δn kf − f k2 ≥ 18L d δn α δn m = 17L d δn α δn = 17L d δn rm.

Thus, by the definition of Zn(r), we have

Pr[E1 ∩ Sm] ≤ Pr[Zn(rm) ≥ 17Ldδnrm].

Applying Lemma 8 with r = rm and u = L d rm δn, yields that the latter probability is at most 2 2 2 2 L rmδn δn c1 exp(−c2n 2 2 2 ) ≤ c1 exp(−c2n ), where we used the fact that δn ≤ rm in the last L rm+L drmδn 2d inequality. Subsequently, taking a union bound over the M events, we have

 δ2   δ2  Pr[E ] ≤ c M exp −c n n = c exp −c n n + log(M) . (97) 1 1 2 2 1 2 2d

60 Since, by assumption on the lower bound on δn we have log(M) ≤ log(41 log(2/δn)) ≤ log(41 log(2c2n)) ≤ 2 c2nδn 4d , we get  δ2  Pr[E ] ≤ c exp −c n n . (98) 1 1 2 4d

Proof of Lemma 10. For simplicity, let k · k = k · k2,2. Suppose that there exists a function f ∈ F, ? with kf − f k ≥ δn, such that ? |Pn (Lf − Lf ? ) − P (Lf − Lf ? )| ≥ 17L d δnkf − f k. 0 ? 0 ? Then we will show that there exists a function f ∈ star(F − f ), with kf − f k = δn, such that 2 |Pn (Lf − Lf ? ) − P (Lf − Lf ? )| ≥ 17L d δn. To do so, we simply choose f 0 to satisfy δ f 0 − f ? = n (f − f ?) . kf − f ?k

δn 0 ? Since kf−f ?k ≤ 1 and by the definition of the star hull, we know that f ∈ star(F − f ). Moreover, 0 0 ? ? by the definition of θ , we also have that kf − f kn = δn. Moreover, by the linearity of the loss Lf with respect to f, we have:   Pn Lf 0 − Lf ? − P Lf 0 − Lf ? = PnLf 0−f ? − PLf 0−f ? δn = | L ? − L ? | kf − f ?k Pn f−f P f−f δn = | (L − L ? ) − (L − L ? )| kf − f ?k Pn f f P f f δ ≥ n 17L d δ kf − f ?k = 17L d δ2. kf − f ?k n n

Thus we have that the probability of event E1 is upper bounded by the probability of the event ( ) 0 2 E1 = sup |Pn (Lf − Lf ? ) − P (Lf − Lf ? )| ≥ 17L d δn ? ? f ∈ star(F−f ):kf−f k≤δn  2 = Zn(δn) ≥ 17L d δn . 2 Invoking Lemma 8, with r = δn and u = L d δn, we get that the probability of the second event is 0 0 2 0 0 also at most µ1 exp(−µ2nδn), for some universal constants µ1, µ2.

Proof of Lemma 11. Consider the events: 2 E0 = {Zn(δn) ≥ 17L d δn}, ? ? E1 = {∃f ∈ F : kf − f k2,2 ≥ δn and |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≥ 18L d δnkf − f k2,2} , with Zn(r) as defined in Lemma 8. Observe that if (95) is violated, then one of these events 2 must occur. Applying Lemma 8 with r = δn and u = L d δn yields, that event E0 happens with 0 2 0 2 probability at most c1 exp(c2nδn), where c2 = c2/(L + L d). Moreover, applying Lemma 9 we get 2 2 that Pr[E1] ≤ c3 exp(−c4nδn). Thus by a union bound with probability 1 − c5 exp(c6nδn), neither events occur. If the loss Lf is linear then we apply Lemma 10 instead of Lemma 9, which does not 2 require a lower bound on δn.

61 H Proofs from Section 4

Let Lθ,g denote the random variable `(θ(x), g(w); z) and recall that

n 1 X L = [`(θ(x), g(w); z)], and L = `(θ(x ), g(w ); z ), (99) P θ,g E Pn θ,g n i i i i=1 denote, respectively, the population risk and empirical risk over the n samples in S2. Consider the two-stage plugin ERM algorithm: θb = arg min PnLθ,g. (100) θ∈Θ b

As in Appendix G, we adopt the abbreviations kfk := kfk and kfk := kfk , p,q Lp(`q,D) n,p,q Lp(`q,S2) and drop the subscript q for real-valued function classes.

Throughout this section, we repeatedly make use of the following fact: If δn solves a fixed point q 0 log(1/δ) equation such as (35), then δn := δn + C n does as well. By expanding the radius in this 2 fashion, we can ensure that any events that occur with probability at least 1−e−cnδn (e.g., Theorem 3 0 and Lemma 12 below) occur with probability at least 1 − δ, as long as δn is replaced by δn in the final excess risk bound.

H.1 Proof of Theorem 3

We split this section into two parts. In the first part we prove the main result (Theorem 3), which 2 requires a lower bound on the critical radius δn of order Ω(log log(n)/n). In the second part we prove the improved result that does not require the lower bound, albeit under the extra assumption that the loss ` is smooth (Lemma 12). We prove both results for the case when R = 1. The general case follows by observing that ERM over Θ is equivalent to ERM over the class Θ/R with the loss `˜(ζ, g(w); z) := `(R · ζ, g(w); z), with the problem parameters remapped as L 7→ LR, β2 7→ β2R, 2 κ 7→ κ, λ 7→ λR , and δn 7→ δn/R.

Proof of Theorem 3. Since θb is the outcome of the Plug-In ERM and since θ? ∈ Θ, we have   Pn L − Lθ?,g ≤ 0. θ,b gb b Applying Lemma 9, with F = Θ, f ? = θ? and L = L , we know that the probability of the event · ·,gb  ? ? E = ∃θ ∈ Θ: kθ − θ k ≥ δ and (L − L ? ) − (L − L ? ) ≥ 18L d δ kθ − θ k . 1 2,2 n Pn θ,gb θ ,gb P θ,gb θ ,gb n 2,2 2 ? is at most ζ = c3 exp(−c4nδn). Thus, with probability 1 − ζ, either kθb− θ k2,2 ≤ δn or

? Pn(L − Lθ?,g) − P(L − Lθ?,g) ≤ 18L d δnkθ − θ k2,2. θ,b gb b θ,b gb b Invoking the first inequality, the latter also implies that

? P(L − Lθ?,g) ≤ 18L d δnkθb− θ k2,2. θ,b gb b Invoking the strong convexity Assumption 3, we have

? ? ? 2 4 P(L − Lθ?,g) ≥ DθLD(θ , g)[θb− θ ] + λkθb− θ k2,2 − κkg − g0kG. θ,b gb b b b

62 Invoking Assumption 1, Assumption 2 and Assumption 4 and following the same step of inequalities as in the proof of Theorem 1, we have that for any η > 0,

? ? ? ? β2 4 β2η ? 2 DθLD(θ , g)[θb− θ ] ≥ DθLD(θ , g0)[θb− θ ] − kg − g0k − kθ − θ k b 4η b G 4 2,2

β2 4 β2η ? 2 ≥ − kg − g0k − kθb− θ k . 4η b G 4 2,2 Choosing η = 2λ and combining the last two inequalities yields β2  2  λ ? 2 β2 4 P(L − Lθ?,g) ≥ kθb− θ k2,2 − κ + kg − g0kG. θ,b gb b 2 8λ b Combining with the upper bound on the left hand side and using the AM-GM inequality, we have for any η > 0,  2  2 2 λ ? 2 β2 4 ? 9L d 2 η ? 2 kθb− θ k − κ + kg − g0k ≤ 18L d δnkθb− θ k2,2 ≤ δ + kθb− θ k . 2 2,2 8λ b G η n 2 2,2

λ Choosing η = 2 and re-arranging yields 2 2  2  ? 2 80L d 2 4 β2 4 kθb− θ k ≤ δ + κ + kg − g0k . 2,2 λ2 n λ 8λ b G   Applying Lemma 11 and invoking (95) together with the fact that Pn L − Lθ?,g ≤ 0, by the θ,b gb b definition of Plug-In ERM, yields that

  ? 2 P(L − Lθ?,g) ≤ O Ld · δn θb− θ + δn θ,b gb b 2,2 ! L2d2 κ β2 1/2 = O δ2 + Ldδ2 + Ldδ ∨ 2 kg − g k2 λ n n n λ λ2 b 0 G L2d2 κ β2   ≤ O δ2 + Ldδ2 + Ld ∨ 2 kg − g k4 , λ n n λ λ2 b 0 G where the final inequality is by AM-GM.

We now prove an extension to Theorem 3, which shows that the log log n/n lower bound on the critical radius can be removed for smooth losses. Since this is an auxiliary result, we do not track the precise dependence on the problem-dependent constants. d Lemma 12 (Improved Fast Rates for Plug-In ERM). Consider a function class Θ: X → R , with supθ∈Θ kθk∞,2 ≤ 1. Let δn ≥ 0 be any solution to the inequality: ? 2 ∀t ∈ {1, . . . , d} : R(star(Θ|t − θt ), δ) ≤ δ . (101)

Suppose `(·, gb(w); z) is L-Lipschitz with respect to the `2 norm, convex, and U-smooth in its first argument. Moreover, suppose the population risk LD satisfies Assumptions 1 to4, with k·kΘ = k·k2,2 and k·kG arbitrary, that and Assumption 4(a) is satisfied at every g ∈ G, rather than only at g0. Let θb be the outcome of the plug-in ERM algorithm. Then with probability at least 1 − δ,   2 log(1/δ) 4 P(L − Lθ?,g) ≤ Oe δn + + kg − g0kG , (102) θ,b gb b n b where Oe hides polynomial dependence on problem parameters.

63 p 2 Proof of Lemma 12. For keep notation compact, we abbreviate kθk = kθk2,2 = E kθ(x)k2 and q 1 Pn 2 kθkn = kθkn,2,2 = n i=1 kθ(xi)k2 throughout this proof. We repeatedly make use of the fact (see Lemma 14.1 of Wainwright(2019)) that for δn satisfying the conditions of the theorem, with 2 probability at least 1 − µ1 exp(−µ2nδn) (for universal constants µ1, µ2), for all θ ∈ Θ,

2 ? 2 ? 2 1 ? 2 δn kθ − θ k − kθ − θ k ≤ kθ − θ k + d . (103) n 2 2

Since θb is the outcome of the plug-in ERM and since θ? ∈ Θ, we have   Pn L − Lθ?,g ≤ 0. θ,b gb b Moreover, the following lemma shows that the strong convexity assumption (Assumption 3) on the 2 population risk also implies (up to δn terms) strong convexity of the empirical risk.

Lemma 13. Let δn ≥ 0 be any solution to the inequality:

? 2 ∀t ∈ {1, . . . , d} : R(star(Θ|t − θt ), δ) ≤ δ (104)

Suppose that the loss function ` is convex and U-smooth in its first argument. Moreover, suppose 2 2 that the population risk satisfies Assumption 3. Then for parameters c1 = O(d), c2 = O(U d ), with 2 probability 1 − c1 exp(−c2nδn),

  D ? ? E λ ? 2 4 2 2 2 Pn L − Lθ?,g ≥ Pn ∇ζ `(θ (x), g(w); z), θb(z) − θ (z) + kθb− θ k − κkg − g0kG − O(U d δn). θ,b gb b b 4 b

We defer the proof of this lemma to the end of the section and finish the proof of Theorem 3. Let ? ? ? Lθ,g = h∇ζ `(θ (x), g(w); z), θ(z)i, denote the linearized loss around θ . Combining the last two inequalities we derive

λ ? 2  ? ?  4 2 2 2 kθb− θ k ≤ −Pn L − Lθ?,g + κkg − g0kG + O(U d δn). (105) 2 θ,b gb b b Consider the following events: ( )   2 E1 = sup Pn Lθ,g − Lθ?,g − P Lθ,g − Lθ?,g > 17 L d δn ? b b b b kθ−θ k≤δn n ?  ? ?   ? ?  ? o E2 = ∃θ ∈ Θ: kθ − θ k ≥ δn and n L − L ? − L − L ? ≥ 17L d δnkθ − θ k . P θ,gb θ ,gb P θ,gb θ ,gb

Invoking Lemma 8, with F = Θ, f ? = θ? and L = L , r = δ and u = L d δ2, we get that the · ·,gb n n 0 0 2 0 0 probability of event E1 is at most µ1 exp(−µ2nδn), for some universal constants µ1, µ2. Noting that ? ? ? ? the loss L is linear in θ and invoking Lemma 10, with F = Θ, f = θ and L· = L , we get θ,gb ·,gb 0 0 2 that the probability of E2 is also at most µ1 exp(−µ2nδn). Thus, we conclude by a union bound 00 00 2 both events fail to occur with probability at least 1 − ζ, for ζ = µ1 exp(−µ2nδn), for some universal 00 00 ? constants µ1, µ2. Thus, with probability at least 1 − ζ, E1 does not hold and either kθb− θ k ≤ δn or     ? ? ? ? ? Pn L − Lθ?,g − P L − Lθ?,g ≤ 17L d δnkθb− θ k. θ,b gb b θ,b gb b

64 In the former case, together with the fact that E1 does not hold,       2 Pn Lθ,g − Lθ?,g − P Lθ,g − Lθ?,g ≤ sup Pn Lθ,g − Lθ?,g − P Lθ,g − Lθ?,g ≤ 17 L d δn. b b b b b b ? b b b b kθ−θ k≤δn   Combining with the fact that Pn L − Lθ?,g ≤ 0, by the definition of ERM, yields θ,b gb b

  2 P L − Lθ?,g ≤ 17 L d δn, θ,b gb b as desired for the theorem. In the latter case, we further derive—by invoking (105)—that

λ ? 2 ?  ? ?  4 2 2 2 kθb− θ k ≤ 17L d δnkθb− θ k − P L − Lθ?,g + κkg − g0kG + O(U d δn) 2 θ,b gb b b ? ? ? 4 2 2 2 = 17L d δnkθb− θ k − DθLD(θ , gb)[θb− θ ] + κkgb − g0kG + O(U d δn). Applying the AM-GM inequality, we have that for any η > 0,

2 2 2 λ ? 2 17 L d 2 η ? 2 ? ? 4 2 2 2 kθb− θ k ≤ δ + kθb− θ k − DθLD(θ , g)[θb− θ ] + κkg − g0k + O(U d δ ). 2 2η n 2 b b G n

λ Choosing η = 2 , re-arranging yields  2 2   ? 2 L U 2 2 4 ? ? 4κ 4 kθb− θ k ≤ O ∨ d δ − DθLD(θ , g)[θb− θ ] + kg − g0k . λ2 λ n λ b λ b G Performing a second order Taylor expansion of the population risk around θ? and invoking second order smoothness of the population risk,

? ? β1 ? 2 P(L − Lθ?,g) ≤ DθLD(θ , g)[θb− θ ] + · θb− θ . θ,b gb b b 2 Combining the latter two inequalities yields

  2 2     L U 2 2 κβ1 4 4β1 ? ? P(L − Lθ?,g) ≤ O β1 ∨ d δn + kg − g0kG − − 1 DθLD(θ , g)[θb− θ ]. θ,b gb b λ2 λ λ b λ b (106)

Since β1/λ ≥ 1 (by standard properties of smoothness and strong convexity of a function) the  4β1  coefficient λ − 1 in front of the last term is non-negative. Invoking Assumption 1, Assumption 2 and Assumption 4 and following the same inequalities as in the proof of Theorem 1, we have that for any η > 0,

? ? − DθLD(θ , gb)[θ − θ ] ? ? β2 4 β2η ? 2 ≤ −DθLD(θ , g0)[θb− θ ] + kg − g0k + kθ − θ k 4η b G 4 β β η ≤ 2 kg − g k4 + 2 kθ − θ?k2 4η b 0 G 4   2 2    β2 4 β2η L U 2 2 κ 4 4 ? ? ≤ kg − g0k + O ∨ d δ + kg − g0k − DθLD(θ , g)[θb− θ ] . 4η b G 4 λ2 λ n λ b G λ b

65 Choosing η = λ and rearranging, we have 2β2

L2 U 2   β2   −D L (θ?, g)[θ − θ?] ≤ O ∨ d2δ2 + κ ∨ 2 kg − g k4 θ D b λ2 λ n λ b 0 G

Combining with (106), this yields

2 4  P(L − Lθ?,g) ≤ Oe δn + kg − g0kG , (107) θ,b gb b b as desired.   Proof of Lemma 13. Consider a second-order Taylor expansion of Pn L − Lθ?,g as θ,b gb b

 >  hD ? ? Ei 1  ?  ¯  ?  n ∇ζ `(θ (x), g(w); z), θb(z) − θ (z) + n θb(x) − θ (x) ∇ζζ `(θ(x), g(w); z) θb(x) − θ (x) , P b 2P b for some θ¯ ∈ conv{θ,b θ?} that is determined by θb by the mean value theorem. Since ` is a convex function with respect to its first argument, we know that the Hessian is positive semidefinite. Hence, we can decompose it as

d ¯ X ¯ ¯ > ∇ζζ `(θ(x), gb(w); z) = vt(θ(x), gb(w); z) vt(θ(x), gb(w); z) . t=1

For simplicity of notation, since gb is fixed and the rest of the variables are random, we will denote with V the random vector that corresponds to vector v (θ¯(x), g(w); z), where we recall that θ¯ is t,θb t b determined by θb. Thus we can write the second-order term in the Taylor expansion as

d d  >  X  ?  >  ?  X ? 2 n θb(x) − θ (x) V V θb(x) − θ (x) = nhV , θb(x) − θ (x)i . P t,θb t,θb P t,θb t=1 t=1

2 Moreover, we note that since k∇√ ζζ `(θ(x), gb(w); z)k∞ ≤ U, it must be that kVt,θk2 ≤ U, for all θ ∈ Θ, which implies that kV k ≤ U. We now argue that for any δ that satisfies the conditions of the t,θb 2 n 2 theorem, for some universal constants µ1, µ2, with probability 1 − µ1 exp(−µ2nδn),

h ? 2i 1 h ? 2i 2 n hV , θb(x) − θ (x)i ≥ hV , θb(x) − θ (x)i − O(δ ). P t,θb 2P t,θb n ? Consider the function class H = {Z → hVt,θ, θ(X) − θ (X)i : θ ∈ Θ}. Introduce the shorthand q √ 1 Pn 2 p 2 khkn = n i=1 h (Zi) and khk = E[h (Z)]. Since |h| ≤ U for all h ∈ H, invoking Theorem 4.1 of Wainwright(2019), we have that for any δbn that satisfies the inequality

2 δbn Rn(star(H, 0), δbn) ≤ √ , U

2 with probability at least 1 − µ3 exp(−µ4nδbn),

1 δb2 kgk2 ≥ kgk2 − n ∀h ∈ H. n 2 2

66 The latter immediately translates to

? 2 1 ? 2 1 2 nhVt,θ, θ(x) − θ (x)i ≥ hVt,θ, θ(x) − θ (x)i − δb ∀θ ∈ Θ. P 2P 2 n

In particular, this inequality holds for θ = θb. We√ now relate δbn to the choice of δn from the ? √theorem statement. Observe that because kVt,θk ≤ U, the function Z → hVt,θ, θ(X) − θ (X)i is U-Lipschitz in θ(X) − θ?(X). Thus by the vector-valued contraction lemma of Maurer(2016), we have (cf. the proof of Lemma 8) for any r > 0,

√ d X ? Rn(star(H), r) ≤ 4 U R(star(Θ|t − θt ), r). t=1

Now for any r ≥ δn, we have

√ d √ X ? Rn(star(H), r) ≤ 4 U R(star(Θ|t − θt ), r) ≤ 4 Udδnr. t=1

Thus if we choose δbn = 4Udδn, then

√ 2 δbn Rn(star(H), δbn) ≤ 4 Udδnδbn = √ , U

2 2 2 which is what is required for δbn. Thus, with probability at least 1 − µ3 exp(−µ5U d nδn),

? 2 1 ? 2 2 2 2 nhV , θb(x) − θ (x)i ≥ hV , θb(x) − θ (x)i − 8U d δ ∀θ ∈ Θ, P t,θb 2P t,θb n as desired. Now taking a union bound over all t, and since d is a constant, we get that the latter holds uniformly 2 2 2 for all t. Hence, we conclude that with a probability at least 1 − d exp(−µ5U d nδn),

d d X ? 2 1 X ? 2 2 2 2 nhV , θb(x) − θ (x)i ≥ hV , θb(x) − θ (x)i − O(U d δ ). P t,θb 2 P t,θb n t=1 t=1 Subsequently the latter inequality implies that

>  ?  ¯  ?  Pn θb(x) − θ (x) ∇ζζ `(θ(x), gb(w); z) θb(x) − θ (x) > 1  ?  ¯  ?  2 2 2 ≥ θb(x) − θ (x) ∇ζζ `(θ(x), g(w); z) θb(x) − θ (x) − O(U d δ ). 2P b n

2 ¯ ? ? The first term on the right hand side is equal to Dθ LD(θ, g)[θb− θ , θb− θ ]. Hence, by Assumption 3, we get that the latter is lower bounded by

λ ? 2 κ 4 2 2 2 kθb− θ k − kg − g0k − O(U d δ ). 2 2 b G n

67 H.2 Slow Rate for Plug-In ERM

In this section, we give a generalization bound for plug-in ERM in the slow rate regime which scales with the (normalized) entropy integral for the class ` ◦ Θ. To state the result, we first formally define the entropy integral. Definition 4 (Entropy Integral). For any real-valued function class F the entropy integral is given by ( r ) Z r H (F, ε/2, n) κ(F, r) = inf 4α + 10 2 dε . (108) α≥0 α n

The generalization bound is as follows. Theorem 9 (Slow Rate for Plug-In ERM). Consider the function class

` ◦ Θ = {`(θ(·), gb(·), ·): θ ∈ Θ}, with R := sup k`(θ(·), g(·), ·)k and r := sup k`(θ(·), g(·), ·) − `(θ?(·), g(·), ·)k . Let θ θ∈Θ b L∞(D) θ∈Θ b b L2(D) b be the plug-in empirical risk minimizer. Then with probability at least 1 − δ, r ! ? log(1/δ) log(1/δ) LD(θ,b g) − LD(θ , g) = O Rn(` ◦ Θ, r) + r + R (109) b b n n r ! H (` ◦ Θ, r, n) log(1/δ) log(1/δ) = O R · κ(` ◦ Θ, r/R) + R 2 + r + R . n n n

Proof of Theorem 9. We prove the theorem for the case of R = 1. The statement for general R follows by a standard rescaling argument. Applying the first part of Lemma 8 with F = ` ◦ Θ, ? ? ? f = `(θ (·), gb(·), ·), Lf = f, and r = supf∈F kf − f k2, we have that with probability at least 1 − δ, r ! ? log(1/δ) log(1/δ) sup |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≤ 16R(F − f , r) + O r + . f ∈ F n n

Hence, the latter also holds for fb = `(θb(·), gb(·), ·). Moreover by the definition of plug-in ERM, (L − L ? ) ≤ 0. Thus we get Pn fb f r ! ? log(1/δ) log(1/δ) (L − L ? ) ≤ 16R(F − f , r) + O r + . P fb f n n

This proves the first part of the Theorem. We now analyze the quantity R(F − f ?, r) so as to provide the second part.  1 Pn  Let Rn(F, δ, z1:n) = E supf∈F n i=1 if(zi) denote the empirical Rademacher complexity ? ? conditional on the dataset z1:n. Thus R(F−f , r) = Ez1:n [Rn(F − f , r, z1:n)]. Let rn = supf∈F kf− ? f k2,n, which is a random variable that depends on the dataset z1:n. Invoking Lemma 14, we have " # Z rn r ? ? H2(F − f , ε, n) Rn(F − f , r, z1:n) ≤ inf 4α + 10 dε . α≥0 α n | {z } =:K(α,rn)

68 Thus, we have that

? R(F − f , r) ≤ inf [4α + 10 Ez1:n [K(α, rn)]]. α≥0 Since metric entropy is a non-negative, non-increasing function of the precision ε, we have that K(α, ·) is a non-decreasing, concave function. Thus by Jensen’s inequality,

p 2 Ez1:n [K(α, rn)] ≤ K(α, Ez1:n [rn]) ≤ K(α, Ez1:n [rn]).

2 Moreover, by the definition of rn and invoking a standard symmetrization argument for uniform convergence and Lipschitz contraction (since elements of F are bounded by 1), we have " #  2 2 ? 2 ? 2 ? Ez1:n rn − r ≤ Ez1:n sup kf − f kn,2 − kf − f k2 ≤ 4R(F − f , r). f∈F

Moreover, by concavity of K(α, ·) and an application of the AM-GM inequality, we also have

p 2 p ? K(α, Ez1:n [rn]) ≤ K(α, r + 2 R(F − f , r)) (non-decreasing K(α, ·)) ∂K(α, r) ≤ K(α, r) + · 2pR(F − f ?, r) (concavity of K(α, ·)) ∂r r H (F − f ?, r, n) = K(α, r) + 2 · 2pR(F − f ?, r) (definition of derivative) n H (F − f ?, r, n) 1 = K(α, r) + 2 2 + R(F − f ?, r). (AM-GM) n 2 Thus we conclude that H (F − f ?, r, n) 1 R(F − f ?, r) ≤ inf [4α + 10K(α, r)] + 2 2 + R(F − f ?, r). α≥0 n 2 Rearranging yields H (F − f ?, r, n) R(F − f ?, r) ≤ inf [8α + 20K(α, r)] + 4 2 . α≥0 n Moreover, observe that if F 0 is an ε-cover of F, then F 0 − f ? is an ε-cover of F − f ?. Thus, ? H2(F − f , ε, n) ≤ H2(F, ε, n). This completes the proof of the theorem.

H.3 Proof of Theorem 4

We first prove a theorem about non-centered `2-moment penalization, and then show that this implies Theorem 4. Theorem 10 (Moment-Penalized Plug-In ERM). Consider the function class F = {`(θ(·), gb(·); ·): θ ∈ Θ}, with R := sup kfk and f ? := `(θ?(·), g(·); ·). Let δ2 ≥ 0 be any solution to the f∈F L∞(D) b n inequality δ2 R (star(F − f ?), δ) ≤ , (110) n R and θb be the second moment-penalized empirical risk minimizer,

θb = arg min LS (θ, g) + 36δnk`(θ(·), g(·); ·)kL (S ). θ∈Θ 2 b b 2 2

69 Then with probability at least 1 − δ,

r !  2 ! ? p ? 2 δn log(1/δ) δn log(1/δ) LD(θ,b g) − LD(θ , g) = O [`(θ (·), g(·); ·) ] · + + + R . b b E b R n R n

Proof of Theorem 10. We first consider the case where R = 1. Applying Lemma 11, the fact that ? ? 2 kf − f k2 ≤ 2kf − f kn,2 + δn with probability 1 − c7 exp(c8nδn) (via Theorem 4.1 of Wainwright (2019)) and the fact that Lf is linear (it is the identity function), we get that for any δn ≥ 0 that 2 satisfies the conditions of the theorem, with probability 1 − c9 exp(c10nδn) ? ? |Pn(Lf − Lf ? ) − P(Lf − Lf ? )| ≤ 18δn{kf − f k2 + δn} ≤ 36δn{kf − f kn,2 + δn} ∀f ∈ F. 2 Then we know that with probability at least 1 − c9 exp(c10nδn), ? 2 (L − L ? ) ≤ (L − L ? ) + 36δ kf − f k + 36δ P fb f Pn fb f n b n,2 n ? 2 ≤ (L − L ? ) + 36δ kfk + 36δ kf k + 36δ Pn fb f n b n n,2 n ? 2 ≤ 72δnkf kn,2 + 36δn, where the second inequality follows by the definition of the moment penalized algorithm, since

? PnLf + 36δnkfbkn,2 ≤ PnLf ? + 36δnkf kn,2. Now, for general values of R, we may apply the reasoning above to the normalized class F/R. In particular, the condition (110) implies that

1 δ2 R (star(F − f ?)/R, δ /R) = R (star(F − f ?), δ ) ≤ n , n n R n n R2 0 so we may take δn := δn/R as the critical radius for the normalized class. Then, the inequalities so far imply that with probability at least 1 − δ,

r !  2 ! 1  ?  1 p ? 2 δn log(1/δ) δn log(1/δ) LD(θ,b g) − LD(θ , g) = O [`(θ (·), g(·); ·) ] · + + + , R b b R E b R n R2 n which is equivalent to the main result.

Proof of Theorem 4. We now show how to deduce the variance-based bound from Theorem 4 from ? ? the second moment-based bound from Theorem 10. Observe that if the optimal loss µ := LD(θ , g0) is zero then the two are equivalent. This motivates the following approach: if one has access to ? a good preliminary estimate µb of the value µ , then using moment penalization one can always attain a bound that depends on kf ? − µk = pVar(f ?) + O(|µ − µ?|). The latter is achieved b L2(D) b by simply re-defining the function class F in Theorem 10 to be the re-centered class of losses: {`(θ(·), gb(·), ·) − µb : θ ∈ Θ}. This leads to the algorithm in the theorem statement, which penalizes the centered second moment:

θb = arg min LS (θ, g) + 36δnk`(θ(·), g(·); ·) − µkL (S ). (111) θ∈Θ 2 b b b 2 2 ? If the error in the preliminary estimate is vanishing, i.e. |µb − µ | =: εn → 0, then the impact of this error on the regret is only of second order, since the final regret bound will be of the form:

?  p ? 2  p ? 2 2 LD(θ,b gb) − LD(θ , gb) = O δn Var(f ) + δnεn + δn = O δn Var(f ) + εn + δn . (112)

70 More precisely, using the bound from Theorem 10, we have

r !  2 ! ? p ?  δn log(1/δ) δn log(1/δ) LD(θ,b g) − LD(θ , g) = O Var[`(θ (·), g(·); ·)] + εn · + + + R , b b b R n R n which, using the AM-GM inequality, we can bound by

r !  2  2 ! ? p ?  δn log(1/δ) δn log(1/δ) εn LD(θ,b g) − LD(θ , g) = O Var[`(θ (·), g(·); ·)] · + + + R + . b b b R n R n R

We can bound the error of the estimate µb using vanilla (non-localized) Rademacher complexity and two-sided uniform convergence arguments over the function class F. In particular, using standard arguments, we can guarantee that with probability at least 1 − δ over the draw of S3, we have  r  log(1/δ) µ − inf LD(θ, g) ≤ O Rn(` ◦ Θ) + R . b θ b n

Finally, we observe that if γ 7→ `(θ(z), γ; z) is L-Lipschitz for all θ and z, then we have |LD(θ, gb) − ? LD(θ, g0)| = O(Lkg0 − gbkG) for any θ ∈ Θ. Hence, it must be the case that |LD(θ , gb) − infθ LD(θ, gb)| = O(Lkg0 − gbkG). Similarly, we have ? ? |Var[`(θ (·), gb(·); ·)] − Var[`(θ (·), g0(·); ·)]| ≤ O(Lkg0 − gbkG), leading to the final result (after another application of the AM-GM inequality).

H.4 Proof of Theorem 5

Define the function class F = {z 7→ `(θ(x), gb(w); z): θ ∈ Θ}. We assume for now that kfk∞ ≤ 1 for all f ∈ F, that Γ is 1-Lipschitz (i.e. L = R = 1 in the theorem statement) and that  2  E Γ (g0, z) | x ≥ γ; the general case will be handled by rescaling at the end of the proof..

Our starting point is to appeal to Theorem 10 . In particular, let δn ≥ 0 be any solution to the inequality: ? 2 Rn(star(F − f ), r) ≤ δ , (113) ? ? where f (z) := `(θ (x), gb(w); z). Then if θb is the outcome of variance-penalized ERM, we have that by Theorem 4, with probability at least 1 − δ,

√ r ! ! ? ? log(1/δ) 2 log(1/δ) 2 2 LD(θ,b g) − LD(θ , g) = O V δn + + δ + + (RateD(G,S1, δ/2)) + R (F) . b b n n n n

Furthermore, using Theorem 2, universal orthogonality implies that

√ r ! ! ? ? log(1/δ) 2 log(1/δ) 2 2 LD(θ,b g0) − LD(θ , g0) = O V δn + + δ + + β(RateD(G,S1, δ/2)) + R (F) . n n n n

71 Moving to capacity function at g0. Per the discussion in the previous section, we focus on bounding the critical radius in the case where F is bounded by 1 and Γ is 1-Lipschitz. We wish to make use of the capacity function, which is defined at g0, but the local Rademacher complexity we need to bound is that of F, which evaluates the weight function Γ at gb. To make progress, we show how to use the capacity function defined in the theorem statement to bound the following “plug-in” variant: 2 ? 2 E supθ: [Γ(g,z)2(θ(x)−θ?(x))2]≤ε2 Γ (g, z)(θ(x) − θ (x)) τ 2(ε) = E b b . b ε2

p 2 ? 2 We first show how to relate the L2 norm at gb to the L2 norm at g0. Define kθkΘ = E Γ(g,b z) (θ(x) − θ (x)) . Then for any θ ∈ Θ we have 1 Γ(g, z)2(θ(x) − θ?(x))2 ≥ Γ(g , z)2(θ(x) − θ?(x))2 − (Γ(g, z) − Γ(g , z))2(θ(x) − θ?(x))2 E b 2 E 0 E b 0 1 = kθ − θ?k2 − (Γ(g, z) − Γ(g , z))2(θ(x) − θ?(x))2. 2 Θ E b 0 Using AM-GM and boundedness of θ, for any η > 0 this is lower bounded by 1 1 η kθ − θ?k2 − (Γ(g, z) − Γ(g , z))4 − (θ(x) − θ?(x))2. 2 Θ 2η E b 0 2 E Using the Lipschitz assumption and conditional lower bound on Γ, we further lower bound by 1 1 η ≥ kθ − θ?k2 − kg − g k4 − kθ − θ?k2 . 2 Θ 2η b 0 G 2γ Θ Hence, by choosing η = γ/2 and rearranging, we get 4 kθ − θ?k2 ≤ 4 Γ(g, z)2(θ(x) − θ?(x))2 + kg − g k4 . (114) Θ E b γ b 0 G We proceed to bound the capacity function τ. Let ε = 2 kg − g k2 . Let ε ≥ ε be fixed and let b 0 γ1/2 b 0 G 0  2 ? 2 2 θ ∈ Θ be any policy with E Γ(g,b z) (θ(x) − θ (x)) ≤ ε . Then equation (114) implies that that ? 2 2 kθ − θ kΘ ≤ 5ε , and so 2 ? 2 E supθ:kθ−θ?k2 ≤5ε2 Γ (g,b z)(θ(x) − θ (x)) τ 2(ε) ≤ Θ . b ε2 To handle the term in the numerator we proceed similar to the proof of (114). We have 2 ? 2 E sup Γ (g,b z)(θ(x) − θ (x)) ? 2 2 θ:kθ−θ kΘ≤5ε 2 ? 2 2 ? 2 ≤ 2 E sup Γ (g0, z)(θ(x) − θ (x)) + 2 E sup (Γ(g,b z) − Γ(g0, z)) (θ(x) − θ (x)) . ? 2 2 ? 2 2 θ:kθ−θ kΘ≤5ε θ:kθ−θ kΘ≤5ε Fix any η > 0. We use AM-GM and boundedness of policies to upper bound the second term as 2 ? 2 E sup (Γ(g,b z) − Γ(g0, z)) (θ(x) − θ (x)) ? 2 2 θ:kθ−θ kΘ≤5ε 1 4 ? 2 ≤ E(Γ(g,b z) − Γ(g0, z)) + η Ex sup (θ(x) − θ (x)) η ? 2 2 θ:kθ−θ kΘ≤5ε 1 4 η  2  ? 2 ≤ kgb − g0kG + Ex sup E Γ (g0, z) | x (θ(x) − θ (x)) η γ ? 2 2 θ:kθ−θ kΘ≤5ε 1 4 η 2 ? 2 ≤ kgb − g0kG + E sup Γ (g0, z)(θ(x) − θ (x)) η γ ? 2 2 θ:kθ−θ kΘ≤5ε

72 We choose η = γ and recall the definition of ε0, which gives

2 2 ? 2 ≤ ε0/4 + E sup Γ (g0, z)(θ(x) − θ (x)) . ? 2 2 θ:kθ−θ kΘ≤5ε Putting everything together, we get

2 ? 2 2 4 · E supθ:kθ−θ?k2 ≤5ε2 Γ (g0, z)(θ(x) − θ (x)) + ε0/2 τ 2(ε) ≤ Θ . b ε2

Thus, for all ε ≥ ε0 we have 2 2 τb (ε) ≤ 20 · τ (5ε) + 3. (115)

Bounding the critical radius. For any class F we define Fδ = {f ∈ F : kfk2 ≤ δ}. We work with the following empirical version of the local Rademacher complexity

" n # 1 X R (F, δ, z ) = sup  f(z ) , (116) n 1:n E n i i f∈Fδ i=1 which has Rn(F, δ) = Ez1:n [Rn(F, δ, z1:n)]. Let the draw of z1:n be fixed. Invoking Lemma 14, we have " # Z suph∈star(F−f?) khkn r ? ? δ H2(star(F − f )δ, ε, z1:n) Rn(star(F − f ), δ, z1:n) ≤ inf 4α + 10 dε . α≥0 α n

? ? ? Using that any h ∈ star(F − f )δ can be written as r · (f − f ), where kf − f k2 ≤ δ and r ∈ [0, 1], a simple discretization argument (cf. proof of Lemma 5) shows that   ? ? H2(star(F − f )δ, ε, z1:n) ≤ H2((F − f )δ, ε/2, z1:n) + log 2 sup khkn/ε . ? h∈(F−f )δ

Let us adopt the shorthand v = sup ? khk . It follows from the usual symmetrization n h∈(F−f )δ n 2 2 ? argument that E vn ≤ δ + 2Rn(F − f , δ). Letting α = 0 be fixed, we can summarize our argument so far as

Z vn r ? Z vn ? H2((F − f )δ, ε/2, z1:n) p Rn(star(F − f ), δ, z1:n) ≤ 10 dε + 10 log(2vn/ε)/ndε. 0 n 0 Furthermore, using a change of variables we have

Z vn Z 1 p p vn log(2vn/ε)/ndε ≤ vn log(2/ε)/ndε ≤ C · √ . 0 0 n ? We now handle the covering integral for (F − f )δ. Let   2 ? 2 2 Θ(δ) = θ ∈ Θ | E Γ (g,b z)(θ(x) − θ (x)) ≤ δ . ? Our approach is to upper bound the empirical L2 covering number for (F − f )δ by the covering number of the class Θ(δ) with respect to Hamming error. Let the Hamming error on a set 0 S = {x1, . . . , xM } be defined via

M 0 1 X  0 0 0 d 0 (θ, θ ) = 1 θ(x ) 6= θ (x ) . H,S M i i i=1

73 0 0 0 0 ? We claim that there is a choice for the dataset S = x1, . . . , xM such that for all g, g ∈ (F − f )δ, the empirical L2 error on S2 is upper bounded by the empirical Hamming error of associated policies θ, θ0 on S0. ? 0 0 ? ? 0 Let h = (f − f ) and h = (f − f ) be fixed elements of (F − f )δ, and let θ, θ ∈ Θ(δ) be such that ? 0 0 0 f(z) = Γ(g, z)(θ(x) − θ (x)) and likewise for f and θ . Define Γi = Γ(g, zi), and take S to contain b 2 ? 2 b l supθ∈Θ(δ) Γi (θ(xi)−θ (xi)) m of mi := δ2 copies of example xi for each i. With this choice, we have

M 0 1 X  0 0 0 d 0 (θ, θ ) = 1 θ(x ) 6= θ (x ) H,S M i i i=1 n  2 ? 2  sup Γ (θe(xi) − θ (xi)) 1 X θe∈Θ(δ) i  0 =  1 θ(xi) 6= θ (xi) M  δ2  i=1  n 2 ? 2 2 0 ? 2 1 X Γ (θ(xi) − θ (xi)) + Γ (θ (xi) − θ (xi)) ≥ i i 1θ(x ) 6= θ0(x ) 2M δ2 i i i=1 n 2 0 2 1 X Γ (θ(xi) − θ (xi)) ≥ i 1θ(x ) 6= θ0(x ) 4M δ2 i i i=1 n 2 0 2 1 X Γ (θ(xi) − θ (xi)) = i 4M δ2 i=1 n 0 2 = h − h 4Mδ2 n,2

0 n 2 0 Thus, if we let ε = 4Mδ2 ε , then any ε -cover in Hamming error is an ε-cover in L2. Now define

n 1 X u2 = sup Γ2(θ(x ) − θ?(x ))2, n n i i i i=1 θ∈Θ(δ)

2 2 and note that un ≥ vn by definition. We invoke the following facts. 2 ? 2 Pn supθ∈Θ(δ) Γi (θ(xi)−θ (xi)) 2 2 • M ≤ n + i=1 δ2 = n(1 + un/δ ). • Haussler’s bound (Haussler, 1995) implies that any class with VC dimension d admits a 2e d ε–Hamming error cover of size e(d + 1) ε . Putting everything together, we have

Z vn r ? Z vn r 2 2 2 H2((F − f )δ, ε/2, z1:n) d log(2e(δ + un)/ε ) p dε ≤ dε + C · vn log d/n. 0 n 0 n

2 2 ? It follows from the usual symmetrization argument that E [vn] ≤ δ + 2Rn(F − f , δ). Furthermore, using the Talagrand-type concentration bound in (87) (we use the assumed boundedness of elements of F to simplify (87) to the form that appears on this page), there exists a constant C ≥ 1 such −s that for any s > 0, with probability at least 1 − e over the draw of z1:n,  s  v2 ≤ C δ2 + R (F − f ?, δ) + =: δ˜2. n n n

74 Thus, conditioning on this event, we have

r ˜ r Z vn d log(2e(δ2 + u2 )/ε2) Z δ d log(2e(δ2 + u2 )/ε2) n dε ≤ n dε 0 n 0 n r Z 1 d log(2e(1 + u2 /δ2)/ε2) ≤ δ˜ n dε 0 n r d log(2e(1 + u2 /δ2)) ≤ C · δ˜ n , n where the second inequality uses a change of variables and that δ ≤ δ˜. Now, to summarize our developments so far, we have shown that with probability at least 1 − e−s,

? Rn(star(F − f ), δ, z1:n) r ! d log(2e(1 + u2 /δ2)) ≤ C δ˜ n + δ˜plog d/n n r ! d log(2e(1 + u2 /δ2)) ≤ C δ˜ n n r ! d log(2e(1 + u2 /δ2)) = C (δ + pR (F − f ?, δ) + ps/n) n . n n

−s 2 2 s Using Markov’s inequality, we have that with probability at least 1 − e , un ≤ E [un] · e . Thus, by union bound, with probability at least 1 − 2e−s,

? Rn(star(F − f ), δ, z1:n) r ! √ d log(2e2(1 + [u2 ]/δ2)) ≤ C s(δ + pR (F − f ?, δ) + ps/n) Ez1:n n . n n

Integrating out this tail bound, we get that

? Rn(star(F − f ), δ) r ! d log(2e2(1 + [u2 ]/δ2)) ≤ C (δ + pR (F − f ?, δ) + p1/n) Ez1:n n . n n

? ? Using AM-GM and that Rn(F − f , δ) ≤ Rn(star(F − f ), δ), then rearranging, this implies r ! d log(2e2(1 + [u2 ]/δ2)) d log(2e(1 + [u2 ]/δ2)) R (star(F − f ?), δ) ≤ C δ Ez1:n n + Ez1:n n . n n n

 2  2 We now bound the ratio Ez1:n un /δ . We have

n 1 X [u2 ] = sup Γ2(g, z )(θ(x ) − θ?(x ))2 = sup Γ2(g, z)(θ(x) − θ?(x))2 Ez1:n n n Ezi b i i i Ez b i=1 θ∈Θ(δ) θ∈Θ(δ) 2 2 ≤ τb (δ)δ .

75 Thus, using the relationship between τ and τb established in the previous section of the proof, if δ > ε0 we have  2  2 2 Ez1:n un /δ ≤ 20 · τ (5δ) + 3, and so for all δ > ε0, r ! d log τ(5δ) d log τ(5δ) R (star(F − f ?), δ) ≤ C δ + . n n n

In particular, we can see from this expression that taking r d log τ δ ∝ 0 + ε n n 0 yields a valid upper bound on the critical radius.

Final bound. Putting together the excess risk bound and the critical radius bound, we have

? LD(θ,b g0) − LD(θ , g0) r ! ! √ log(1/δ) log(1/δ) = O V ? δ + + δ2 + + (1 + β)(Rate (G,S , δ/2))2 + R2 (F) n n n n D 1 n r ! V ?d log(τ /δ) d log(τ /δ) ≤ O 0 + 0 + (γ−1/2 + β)(Rate (G,S , δ/2))2 + R2 (F) . n n D 1 n

To handle the square Rademacher complexity term, we recall that since |Γ| ≤ 1, the main result q d of Haussler(1995) implies that Rn(F) ≤ n ; since this term is squared in the final bound, its contribution is of lower order. To deduce the final bound in the general R-bounded L-Lipschitz case we divide the class by (L + R), then rescale the final bound (observing that β, γ, and V ? all vary appropriately under the rescaling).

I Proofs from Section 5 and 6

I.1 Notation

We state the guarantees in this section using slightly more refined notation than in Section 5 and Section 6. In particular, we consider both the case where the target and nuisance classes are parametric and where they are nonparametric. For the nuisance parameter class G, the two cases we consider are:

a) Parametric case. There exists a constant d1 such that

H2(G|t, ε, n) ≤ O(d1 log(1/ε)) ∀t.

b) Nonparametric case. There exists a constant p1 such that

−p1 H2(G|t, ε, n) ≤ O(ε ) ∀t.

76 Likewise, for the target class, the two cases we consider are:

a) Parametric case. There exists a constant d2 such that

H2(Θ, ε, n) ≤ O(d2 log(1/ε)).

b) Nonparametric case. There exists a constant p2 such that

−p2 H2(Θ, ε, n) ≤ O(ε ).

I.2 Preliminaries

Definition 5 (Empirical Rademacher Complexity). For a real-valued function class F and sample set S = z1, . . . , zn, the Rademacher complexity is defined via " n # 1 X R (F,S) = sup  f(z ) , (117) n E n i i f∈F i=1 where  = 1, . . . , n are i.i.d. Rademacher random variables. We define Rn(F) = supS∈Zn Rn(F,S). We require the following technical lemmas. First is the Dudley entropy integral bound; we use the following form from Srebro et al.(2010) (Lemma A.3). In all results that follow, we use C > 0 to denote an absolute constant whose value may change from line to line. Lemma 14. For any real-valued function class F ⊆ (Z → R), we have √ ( 2 ) Z supf∈F En f (z) r H2(F, ε, S) Rn(F,S) ≤ inf 4α + 10 dε . (118) α>0 α n

As a consequence, whenever F takes values in [−1, +1] the following bounds hold: −p • If H2(F, ε, S) ≤ O(ε ), then Rn(F,S) ≤ rn,p, where rn,p satisfies

 − 1  n 2 , p < 2.  − 1 rn,p ≤ Cp · n 2 · log n, p = 2.  − 1  n p , p > 2.

p • If H2(F, ε, S) ≤ O(d log(1/ε)), then Rn(F,S) ≤ C · d/n.

We also require the following lemma, which controls the rate at which the empirical L2 metric converges to the population L2 metric in terms of metric entropy behavior. Lemma 15. Let F ⊆ (Z → [−1, +1]), and let S = z1:n be a collection of samples in Z drawn i.i.d. from D. −p 0 • If H2(F, ε, n) ≤ O(ε ) for some p, then with probability at least 1 − δ, for all f, f ∈ F,

0 2 2 0 log(log n/δ) f − f ≤ 2dS(f, f ) + Rn,p + C , L2(D) n where  −1 3  n log n, p < 2.  −1 4 Rn,p ≤ Cp · n log n, p = 2. − 2  n p log3 n, p > 2.

77 0 • If H2(F, ε, n) ≤ O(d log(1/ε)), then with probability at least 1 − δ, for all f, f ∈ F,

0 2 2 0 d log(en/d) + log(log n/δ) f − f ≤ 2dS(f, f ) + C . L2(D) n

Proof of Lemma 15. Using Lemma 8 and Lemma 9 from Rakhlin et al.(2017), it also holds that with probability at least 1 − 4δ, for all f, f 0 ∈ F, we have

f − f 0 2 ≤ 2d2 (f, f 0) + C(r? + β), L2(D) S

? where β = (log(1/δ) + log log n)/n, and where r ≤ d2 log(en/d2) in the parametric case, and ? 2 3 r ≤ Rn(F) log (n) in the general case. The final result follows by applying the Rademacher complexity bounds from Lemma 14.

Remark 2. Technically, the result in Rakhlin et al.(2017) we appeal to in the proof above is stated for [0, 1]-valued classes, but it may be applied to our [−1, +1]-valued setting by shifting and rescaling the class F (i.e., invoking with F 0 := (F + 1)/2). We appeal to the same reasoning throughout this section, shifting regression targets in the same fashion when necessary.

I.3 Overview of Proofs

We now sketch the high-level approach behind the main results in Section 5 and Section 6. The idea is to use out-of-the-box learning algorithms for both the nuisance and target stage. However, which algorithm gives an optimal rate will depend on the complexity of G and Θ. Moreover, some of the algorithms we employ for the target class require new analyses based on orthogonality to bound the error due to nuisance parameter estimation.

First stage. Base on our assumptions on the metric entropy, we can obtain rates by appealing to the following generic algorithms. • Global ERM : For each t, select

n X 2 gbt ∈ arg min ((ut)i − g(wi)) . g∈G|t i=1

• Skeleton Aggregation/Aggregation of ε-nets (Yang and Barron(1999); Rakhlin et al.(2017); see Appendix I.4 for a formal description): For each t, run the Skeleton Aggregation algorithm with the class G|t on the dataset of instance-target pairs (w1, (ut)1),..., (wn, (ut)n). Let (gb)t be the result. Proposition 3 (Rates for first stage, informal). Suppose that Assumption 8 or Assumption 10 holds. Then Global ERM guarantees that with probability at least 1 − δ,

 −1  Oe K1d1 log(en/d1) · n , Parametric case. 2  kg − g0kL (` ,D) ≤ b 2 2 2 1  − 2+p ∧ p   Oe K1n 1 1 , Nonparametric case.

78 Skeleton Aggregation guarantees that with probability at least 1 − δ,

 −1  Oe K1d1 log(en/d1) · n , Parametric case. 2  kg − g0kL (` ,D) ≤ b 2 2 2  − 2+p   Oe K1n 1 , Nonparametric case.

−1 Here the Oe notation suppresses log n, log(K1), and log(δ ) factors. A precise version of Proposition 3 and detailed description of the algorithms are given in Appendix I.5. − 2 Note that the minimax rate is Ω(n 2+p1 )(Yang and Barron, 1999), and so Skeleton Aggregation is optimal for all values of p1, while Global ERM is optimal only for p1 ≤ 2. While these are not the only algorithms in the literature for which we have generic guarantees based on metric entropy (other choices include Star Aggregation (Liang et al., 2015) and Aggregation-of-Leaders (Rakhlin et al., 2017)), they suffice for our goal in this section, which is to characterize the spectrum of admissible rates.

In all applications we study, the dimension K1 is constant. Nevertheless, studying procedures that jointly learn all output dimensions of G and, in particular, deriving the correct statistical complexity when K1 is large is an interesting direction for future research and may be practically useful.

Second stage. The idea behind the second-stage rates we provide is that the problem of obtaining a target predictor for the second stage can be solved by reducing to the classical square loss regression setting. We map our setting onto square loss regression by defining auxiliary variables X = w and Y = Γ(g0(w), w), and by defining auxiliary predictor classes

F0 = {X 7→ hΛ(g0(w), w), θ(x)i | θ ∈ Θ}, F = {X 7→ hΛ(gb(w), w), θ(x)i | θ ∈ Θ}. With these definitions, our goal to bound the excess risk in, e.g., Theorem 6 can be equivalently stated as producing a predictor fb ∈ F0 that enjoys a bound on 2 2 E fb(X) − Y − inf E(f(X) − Y ) , (119) f∈F0 which is the standard notion of square loss excess risk used in, e.g., Liang et al.(2015); Rakhlin et al.(2017). Defining, Ye = Γ(gb(w), w), we can apply any standard algorithm for the class F to the dataset (X1, Ye1),..., (Xn, Yen). Note however that, due to the use of gb as a plug-in estimate, predictors produced via Meta-Algorithm 1 will—invoking Definition 1—give a guarantee of the form

2 2 E fb(X) − Ye − inf E f(X) − Ye ≤ RateD(Θ,S2, δ/2; g), (120) f∈F b

where fb and the benchmark f belong to F instead of F0. The machinery developed in Section 3 relates the left-hand-side of this expression to the oracle excess risk (119). Depending on the setting, more work is required to show that the right-hand-side of (120) is controlled. This challenge is only present in the well-specified setting. The difficulty is that while the original problem (119) is well-specified in this case, the presence of the plug-in estimator gb in (120) introduces additional “misspecification”. We show for global ERM and Skeleton Aggregation the right-hand-side of the expression is controlled as well, meaning that RateD(Θ,S2, δ/2; gb) is not much larger than the rate RateD(Θ,S2, δ/2; g0) that would have been achieved if the true value for the nuisance predictor was plugged in. This achieved by exploiting orthogonality yet again.

79 In the misspecified setting, we can simply upper bound the right-hand side of (120) by the worst-case bound supg∈G RateD(Θ,S2, δ/2; g) and get the desired growth. Since the model is misspecified to begin with, any extra misspecification introduced by using the plugin estimate here is irrelevant. To be precise, the algorithm configuration is as follows. • For stage one, use Skeleton Aggregation (Yang and Barron, 1999; Rakhlin et al., 2017). If p1 ≤ 2, global ERM can be used instead. • For stage two, in the misspecified setting with Θ convex, use global ERM. • For stage two, in the well-specified setting, we use Skeleton Aggregation, with a new analysis to account for the small amount of “model misspecification” introduced by the plug-in nuisance estimate gb. If p1 ≤ 2, global ERM can be used instead; this is because skeleton ERM and global ERM are both optimal for p1 ≤ 2, even in the presence of nuisance parameters.

I.4 Skeleton Aggregation

Here we briefly describe the Skeleton Aggregation meta-algorithm for real-valued regression (Yang and Barron, 1999; Rakhlin et al., 2017). The setting is as follows: we receive n examples S = n (X1,Y1),..., (Xn,Yn) ∈ (X × R) i.i.d. from a distribution D. For a function class F ⊆ (X → R), 2 we define LD(f) = ED(f(X) − Y ) . Our goal is to produce a predictor fbS for which the excess risk LD(fb) − inff∈F LD(f) is small. We call a sharp model selection aggregate any algorithm that, given a finite collection of M functions PM f1, . . . , fM and n i.i.d. samples, returns a convex combination fb = i=1 νifi for which log(M/δ) L(fb) ≤ min L(fi) + C (121) i∈[M] n with probability at least 1 − δ. One such model selection aggregate is the star aggregation strategy of Audibert(2008), which produces a 2-sparse convex combination fb with the property (121) whenever |Y | ≤ 1 almost surely and the functions in f1, . . . , fM take values in [−1, +1]. We use the following variant of skeleton aggregation, following Rakhlin et al.(2017). Given a dataset 0 00 S = (X1,Y1),..., (Xn,Yn), we split it into two equal-sized parts S and S . 0  • Fix a scale ε > 0, and let N = H(F, ε, S ). Let fbi i∈[N] be a collection of functions that realize the cover, and assume the cover is proper without loss of generality.4  • Let fb be the output of the star aggregation algorithm run with the collection fbi i∈[N] on the dataset S00. For a simple analysis of this strategy, see Section 6 of Rakhlin et al.(2017). In general, the strategy ? ? is optimal only in the well-specified setting in which E[Y | X] = f (X) for some f ∈ F. We give a more refined analysis in the presence of nuisance parameters in the sequel. As final remark, note that since we use a proper cover and the star aggregate is 2-sparse, the final predictor fb lies in the class F 0 := F + star(F − F, 0)

4Any improper ε-cover can be made into a proper 2ε-cover.

80 I.5 Rates for Specific Algorithms

Given an example z ∈ Z, we define an auxiliary example (X,e Ye) via Xe(z) = w, Ye(z) = Γ(gb(w), z). For the remainder of this section we make use of the auxiliary second-stage dataset Se defined via S = (X(z), Y (z)) . (122) e e e z∈S2 We make use of the following auxiliary predictor classes: n o F0 = Xe 7→ hΛ(g0(w), w), θ(x)i | θ ∈ Θ , (123) n o F = Xe 7→ hΛ(gb(w), w), θ(x)i | θ ∈ Θ . (124)

Finally, we define `(y, y) = (y − y)2 and L(f) = `(f(X), Y ), where (X, Y ) are sampled from e b b e EX,e Ye e e e e e the distribution introduced by drawing z ∼ D, and taking (Xe(z), Ye(z)). With these definitions, observe that for any f ∈ Fb of the form Xe 7→ hΛ(gb(w), w), θ(x)i, we have ? Le(f) = LD(θ, g) and inf Le(f) ≤ LD(θ , g). (125) b f∈F b

We relate the metric entropy of the auxiliary class Fb to that of Θ as follows. Proposition 4. Under Assumption 8, it holds that

H2(Fb, ε, n) ≤ H2(Θb, ε, n). (126) Lemma 16 (Rates for first stage). Global ERM guarantees that with probability at least 1 − δ,

 −1  −1 C · d1 log(en/d1) + log(δ ) n , Parametric case.  2  − −1 −1  C · n 2+p1 + log(δ )n , Nonparametric case, p < 2 kg − g k2 ≤ K · p1 1 b 0 L2(`2,D) 1 p −1  C (log n + log(δ ))/n, Nonparametric case, p1 = 2  − 1  p1 p −1 Cp1 · n + log(δ )/n, Nonparametric case, p1 > 2 Skeleton Aggregation guarantees that with probability at least 1 − δ,

( −1  −1 2 C · d1 log(en/d1) + log(δ ) n , Parametric case. kg − g k ≤ K · 2 0 L2(`2,D) 1 − b 2+p1 −1 −1 Cp1 · n + log(δ )n , Nonparametric case.

Cp1 is some constant that depends only on p1.

Proof of Lemma 16. In what follows we analyze the algorithms under consideration for the class G|i for a fixed coordinate i. The final result follows by union bounding over coordinates and summing the coordinate-wise error bounds we establish. Global ERM. When we are either in the parametric case or the nonparametric case with p1 < 2, the result is given by Theorem 5.2 of Koltchinskii(2011). See Example 3 and Example 4 that follow the theorem for precise calculations under these assumtions. See also Remark 2.

On the other hand, when p1 ≥ 2 we apply to the standard Rademacher complexity bound for ERM (e.g. Shalev-Shwartz and Ben-David(2014)), which states that with probability at least 1 − δ, r log(1/δ) g (w) − (g ) (w)2 ≤ 2 · R (` ◦ G| ,S ) + , E bi 0 i n/2 square i 1 n

81 2 where `square(gi, ui) = (gi(w) − ui) . The result follows by applying Lipschitz contraction to the Rademacher complexity (using that the class is bounded) and appealing to the Rademacher complexity bound from Lemma 14. Skeleton Aggregation. We appeal to Section 6 of Rakhlin et al.(2017). See Remark 2.

Lemma 17. Consider the plug-in global ERM strategy for the setting in Section 5.1, i.e. X θb ∈ arg min `(θ, gb; z). θ∈Θ z∈S2 Under the assumptions of Theorem 6 and Theorem 7, global ERM guarantees

 d log(en/d ) + log(δ−1)n−1, Parametric case.  2 2  − 2  2+p2 −1 −1 ?  Cp2 · n + log(δ )n , Nonparametric case, p2 < 2. LD(θ,b gb) − LD(θ , gb) ≤ C · p −1  (log n + log(δ ))/n, Nonparametric case, p2 = 2.  − 1  p2 p −1 Cp2 · n + log(δ )/n, Nonparametric case, p2 > 2.

Proof of Lemma 17. To begin, let fb(Xe) := Λ(gb(w), w), θb(x) and observe that we can write fb as the global ERM for the auxiliary dataset Se: X fe ∈ arg min `e(f(Xe), Ye). f∈F b (X,e Ye )∈Se

Case p2 < 2. In the misspecified case we appeal to Theorem 5.1 in Koltchinskii(2011), using that fb is the global ERM for the class Fb. To invoke the theorem, we verify that a) Fb takes values in [−1, +1] under Assumption 8 (see Remark 2), b) Fb inherits convexity from Θ, and c) H2(Fb, ε, n) ≤ H2(Θ, ε, n), following Proposition 4. The theorem (see also the following discussion in Example 3 and Example 4) therefore guarantees that with probability at least 1 − δ,

( −1 −1  n d2 log(en/d2) + log(δ ) , Parametric case. Le(fb) − inf Le(f) ≤ C · − 2 2+p2 −1 −1 f∈Fb Cp2 · n + n log(δ ), Nonparametric case, p1 < 2.

The result now follows from (125), in particular that Le(fb) = LD(θ,b gb).

Case p2 ≥ 2. We apply the standard Rademacher complexity bound for ERM (e.g. Shalev-Shwartz and Ben-David(2014)), which states that with probability at least 1 − δ, r log(1/δ) Le(fb) − inf Le(f) ≤ 2 · Rn/2(`e◦ Fb) + C f∈Fb n r 0 log(1/δ) ≤ C · Rn/2(Fb) + C , n where we have applied Lipschitz contraction to the Rademacher complexity (using that the class is bounded). To complete the result, we use that H2(Fb, ε, n) ≤ H2(Θ, ε, n) and appeal to the Rademacher complexity bound from Lemma 14.

82 Lemma 18. Consider the following Skeleton Aggregation variant:5 0 00 • Split S2 into equal-sized subsets S and S . 0 • Fix a scale ε > 0, and let N = H2(Θ, ε, S ). Let {θi}i∈[N] be a collection of functions that realize the cover, and assume the cover is proper without loss of generality. Define fi = Xe 7→ hΛ(gb(w), w), θi(x)i for each i ∈ [N]. • Let θb ∈ Θ + star(Θ − Θ, 0) =: Θb realize the output of the star aggregation algorithm using the   function class fbi i∈[N] on the dataset (Xe(z), Ye(z)) z∈S00 . Under the assumptions of Theorem 6, when the model is well-specified, Skeleton Aggregation guaran- tees that with probability at least 1 − δ,

? LD(θ,b gb) − LD(θ , gb) ( −1 4 0 (d2 log(en/d2) + K2 log(K2 log n/δ))n , Parametric case, ≤ C · kg − g k + C · 2 0 L4(`2,D) − b 2+p2 −1 Cp2 · n + K2 log(K2 log n/δ)n , Nonparametric case,

p2 4 2+p ∧ p (2+p ) so long as K2 = o(n 2 2 2 ).

Proof of Lemma 18. Let Fb = {fi}i∈[N]. The Skeleton Aggregation algorithm as described outputs a predictor fb ∈ Fb + star(Fb − Fb, 0) (see Appendix I.4) such that     log N log(1/δ) H2(Θ, ε, S) log(1/δ) Le(fb) ≤ min Le(fbi) + C + ≤ min Le(fbi) + C + . i∈[N] n n i∈[N] n n

Translating back into the language of the lemma statement, recall that we can express each fi via fi = X 7→ hΛ(gb(w), w), θi(x)i, with {θi}i∈[N] ⊂ Θ since we have assumed a proper cover. Since this parameterization is linear in θ, there must be some θb ∈ Θ + star(Θ − Θ, 0) that realizes fb. Using the expression for the risk in (125), this implies   H2(Θ, ε, S) log(1/δ) LD(θ,b g) ≤ min LD(θi, g) + C + . b i∈[N] b n n

Adding and subtracting from both sides, we rewrite the inequality as   H2(Θ, ε, S) log(1/δ) LD(θ,b g) − LD(θ0, g) ≤ min LD(θi, g) − LD(θ0, g) + C + . b b i∈[N] b b n n

The idea going forward is to use orthogonality to bound the approximation error on the right- hand-side. Let i be fixed. Using a second-order Taylor expansion there exists θ¯ ∈ star(Θ, θ0) such that 1 L (θ , g) − L (θ , g) = D L (θ , g)[θ − θ ] + D2L (θ,¯ g)[θ − θ , θ − θ ]. D i b D 0 b θ D 0 b i 0 2 θ D b i 0 i 0

Using another second-order Taylor expansion, there existsg ¯ ∈ star(G, g0) for which

DθLD(θ0, gb)[θi − θ0] 2 = DθLD(θ0, g0)[θi − θ0] + DgDθLD(θ0, g0)[θi − θ0, gb − g0] + DgDθLD(θ0, g¯)[θi − θ0, gb − g0, gb − g0]. 5See Appendix I.4 for background.

83 The model is well-specified and we have assumed orthogonality, so the first two terms on the 0 2 h 0 2i right-hand-side vanish. Let kθ − θ kΘ = E hΛ(g0(w), w), θ(x) − θ (x)i . Using Lemma 4, we have

D2L (θ,¯ g)[θ − θ , θ − θ ] = C · kθ − θ k2 + C0 · kg − g k4 . θ D b i 0 i 0 i 0 Θ b 0 L4(`2,D) Next, using (57), we have

D2D L (θ , g¯)[θ − θ , g − g , g − g ] ≤ C · kθ − θ k · kg − g k2 g θ D 0 i 0 b 0 b 0 i 0 Θ b 0 L4(`2,D)   ≤ C · kθ − θ k2 + kg − g k4 . i 0 Θ b 0 L4(`2,D)

Furthermore, Assumption 8 implies that kθ − θ k ≤ kθ − θ k . Plugging these bounds back i 0 Θ i 0 L2(`,D) into the excess risk guarantee, there are constants C,C0,C00 such that   2 0 H2(F, ε, S) log(1/δ) LD(θ,b g) − LD(θ0, g) ≤ C · minkθi − θ0kL (` ,D) + C + b b i∈[N] 2 2 n n + C00 · kg − g k4 . b 0 L4(`2,D)

We now invoke Lemma 15 for each of the K2 output coordinates of the target space separately and union bound, which implies that with probability at least 1 − δ over the draw of S0,

2 2 kθ − θ k ≤ 2d 0 (θ , θ ) + U, i 0 L2(`2,D) 2,S i 0 where K log(K log n/δ) U ≤ K R + C 2 2 2 n,p2 n in the nonparametric case, and

d log(en/d ) K log(K log n/δ) U ≤ CK 2 2 + C0 2 2 2 n n in the parametric case. Returning to the excess risk, this implies

LD(θ,b gb) − LD(θ0, gb)   2 0 H2(F, ε, S) log(1/δ) 00 4 ≤ C · min dS0 (θi, θ0) + C + + U + C · kg − g0kL (` ,D). i∈[N] n n b 4 2

2 2 The cover property implies that mini∈[N] d2,S0 (θi, θ0) ≤ ε . So we are left with   2 0 H2(F, ε, S) log(1/δ) 00 4 LD(θ,b g) − LD(θ0, g) ≤ Cε + C + + U + C · kg − g0k . b b n n b L4(`2,D)

2 H2(F,ε,S) Solving for the balance ε  n , leads the first two terms to be of order d2 log(en/d2) in the − 2 parametric case and n 2+p2 in the nonparametric case. Thus, in the parametric case, the term U d2 log(en/d2) 0 K2 log(K2 log n/δ) dominates and the final bound is CK2 n + C n . In the nonparametric case, our assumption on the growth of K2 implies that U is of lower order.

84 I.6 Proofs for Oracle Rates

Proof of Theorem 6. First we invoke Lemma 16, which along with Assumption 9 implies that depending on whether p1 > 2, one of either global ERM or skeleton aggregation guarantees that with probability at least 1 − δ,

2 2 2  2 −  kg − g k ≤ (Rate (G,S , δ)) ≤ O C · n 2+p1 . b 0 L4(`2,D) D 1 e 2→4

Observe that the assumption that the model is well-specified at (g0, θ0) implies that Assumption 2 is satisfied. We invoke Assumption 11 (implied by Assumption 8) and Corollary 1 to get

? 00 4 LD(θ,b g0) − LD(θ , g0) ≤ C · RateD(Θ,S2, δ; gb) + C (RateD(G,S1, δ)) . We now invoke Lemma 18 using that the model is assumed to be well-specified, which implies that with probability at least 1 − δ, Skeleton Aggregation enjoys

 2  − 2+p 0 4 RateD(Θ,S2, δ; gb) ≤ Oe n 2 + C (RateD(G,S1, δ)) .

Combining these results, we get

 2 4  ? − 2+p 4 − 2+p LD(θ,b g0) − LD(θ , g0) ≤ Oe n 2 + C2→4 · n 1 .

The final result follows by setting p1 to guarantee that the first term dominates this expression. We mention in passing that to show that global ERM achieves the desired rate for stage two when p2 ≤ 2, one can appeal to the rates in Appendix H.

Proof of Theorem 7. As in the proof of Theorem 6, we invoke Lemma 16, which implies that Skeleton Aggregation6 guarantees that with probability at least 1 − δ,

2 2 2 2  2 −  kg − g k ≤ C (Rate (G,S , δ)) ≤ O C · n 2+p1 . b 0 L2(`2,D) 2→4 D 1 e 2→4

Observe that since Θ is convex, Assumption 2 is satisfied, and we can invoke Assumption 11 (implied by Assumption 8) and Corollary 1 to get

? 00 4 LD(θ,b g0) − LD(θ , g0) ≤ C · RateD(Θ,S2, δ; gb) + C (RateD(G,S1, δ)) . We use global ERM for the second stage. Lemma 17 implies that since the class Θ is convex, with probability at least 1 − δ, global ERM guarantees

 2 1  − 2+p ∧ p RateD(Θ,S2, δ; gb) ≤ Oe n 2 2 .

This leads to a final guarantee of

 2 1 4  ? − 2+p ∧ p 4 − 2+p LD(θ,b g0) − LD(θ , g0) ≤ Oe n 2 2 + C2→4 · n 1 , and the stated result follows by setting p1 to guarantee that the first term dominates this expression.

6 Alternatively, global ERM can be applied when p1 ≤ 2.

85 Proof of Theorem 8. Lemma 16 implies that either Skeleton Aggregation or global ERM (for p1 ≤ 2) guarantees that with probability at least 1 − δ,

2 2  − 2  kg − g k = (Rate (G,S , δ)) ≤ O n 2+p1 . b 0 L2(`2,D) D 1 e

Theorem 2 guarantees

? 00 2 LD(θ,b g0) − LD(θ , g0) ≤ C · RateD(Θ,S2, δ; gb) + C (RateD(G,S1, δ)) . We use global ERM for the second stage. The standard Rademacher complexity bound for ERM (e.g. Shalev-Shwartz and Ben-David(2014)), states that with probability at least 1 − δ, the excess risk is bounded by the Rademacher complexity of the target class Θ composed with the loss class as follows:

n/2 r ? 2 X log(1/δ) LD(θ,b g) − LD(θ , g) ≤ 2 ·  sup t`(θ(xi), g(wi); zi) +C . b b E n b n θ∈Θ i=1 | {z } =: Rn/2(` ◦ Θ,S2)

Using Lemma 14 and boundedness of the loss, we have

( Z 1 r ) H2(` ◦ Θ, ε, S2) Rn/2(` ◦ Θ,S2) ≤ C · inf α + dε . α>0 α n

Since the loss is 1-Lipschitz with respect to `2, we have H2(` ◦ Θ, ε, S2) ≤ H2(Θ, ε, S2). Under the −p assumed growth of H2(Θ, ε, S2) ∝ ε 2 , this gives

 1 1  − 2 ∧ p RateD(Θ,S2, δ; gb) ≤ Oe n 2 .

This leads to a final guarantee of

 1 1 2  ? − 2 ∧ p − 2+p LD(θ,b g0) − LD(θ , g0) ≤ Oe n 2 + n 1 . ,

The theorem statement follows by setting p1 so that the first term dominates.

86