Optimal oracle inequalities for solving projected fixed-point equations

Wenlong Mou Ashwin Pananjady? Martin J. Wainwright,†

Department of Electrical Engineering and Computer Sciences Simons Institute for the Theory of Computing? Department of † University of California Berkeley

Abstract Linear fixed point equations in Hilbert spaces arise in a variety of settings, including reinforcement learning, and computational methods for solving differential and integral equations. We study methods that use a collection of random observations to compute approximate solutions by searching over a known low-dimensional subspace of the Hilbert space. First, we prove an instance-dependent upper bound on the mean-squared error for a linear stochastic approximation scheme that exploits Polyak–Ruppert averaging. This bound consists of two terms: an approximation error term with an instance-dependent approximation factor, and a statistical error term that captures the instance-specific com- plexity of the noise when projected onto the low-dimensional subspace. Using information- theoretic methods, we also establish lower bounds showing that both of these terms cannot be improved, again in an instance-dependent sense. A concrete consequence of our charac- terization is that the optimal approximation factor in this problem can be much larger than a universal constant. We show how our results precisely characterize the error of a class of temporal difference learning methods for the policy evaluation problem with linear function approximation, establishing their optimality.

1 Introduction

Linear fixed point equations over a Hilbert space, with the Euclidean space being an important special case, arise in various contexts. Depending on the application, such fixed point equations take different names, including estimating equations, Bellman equations, Poisson equations + and inverse systems [Ber11, KVZ 72, Woo16]. More specifically, given a Hilbert space X, we consider a fixed point equation of the form

arXiv:2012.05299v1 [cs.LG] 9 Dec 2020 v = Lv + b, (1)

where b is some member of the Hilbert space, and L is a linear operator mapping X to itself. When the Hilbert space is infinite-dimensional—or has a finite but very large dimension D—it is common to seek approximate solutions to equation (1). A standard approach is to choose a subspace S of the Hilbert space, of dimension d  D, and to search for solutions within this subspace. In particular, letting ΠS denote the orthogonal projection onto this subspace, various methods seek (approximate) solutions to the projected fixed point equation  v = ΠS Lv + b . (2) In order to set the stage, let us consider some generic examples that illustrate the projected fixed point equation (2). We eschew a fully rigorous exposition at this stage, deferring technical details and specific examples to Section 2.2.

1 Example 1 (Galerkin methods for differential equations). Let X be a Hilbert space of suitably differentiable functions, and let A be a linear differential operator of order k, say of the form Pk (j) (j) th A(v) = ω0v + j=1 ωjv , where v denotes the j -order derivative of the function v ∈ X. Given a function b ∈ X, suppose that we are interested in solving the differential equation A(v) = b. This represents a particular case of our fixed point equation with L = I − A. d Let S be a finite-dimensional subspace of X, say spanned by a set of basis functions {φj}j=1. A Galerkin method constructs an approximate solution to the differential equation A(v) = b by solving the projected fixed point equation (2) over a subspace of this type. Concretely, any Pd d function v ∈ S has a representation of the form v = j=1 ϑjφj for some weight vector ϑ ∈ R . Pd Applying the operator A to any such function yields the residual A(v) = j=1 ϑjA(φj), and d the Galerkin method chooses the weight vector ϑ ∈ R such that v satisfies the equation v = ΠS((I − A)v + f). A specific version of the Galerkin method for a second-order differential equation called the elliptic boundary value problem is presented in detail in Section 2.2.2. ♣

Example 2 (Instrumental variable methods for nonparametric regression). Let X denote p a suitably constrained space of square-integrable functions mapping R → R, and suppose that we have a regression model of the form1 Y = f ∗(X) + . Here X is a random vector p ∗ of covariates taking values in R , the pair (Y, ) denote scalar random variables, and f ∈ X denotes an unknown function of interest. In the classical setup of nonparametric regression, it is assumed that E[ | X] = 0, an assumption that can be violated. Instead, suppose that we p have a vector of instrumental variables Z ∈ R such that E[ | Z] = 0. Now let T : X → X denote a linear operator given by T (f) = E[f(X)|Z], and denote by r = E[Y |Z] a point in X. Instrumental variable approaches to estimating f ∗ are based on the equality

∗ ∗ E[Y − f (X) | Z] = r − T (f ) = 0, (3) which is a linear fixed point relation of the form (1) with L = I − T and b = r. Now let {φj}j≥1 be an orthonormal basis of X, and let S denote the subspace spanned by Pd the first d such eigenfunctions. Then each function f ∈ S can be represented as f = j=1 ϑjφj, and approximate solutions to the fixed point equation (3) may be obtained via solving a projected variant (2), i.e., the equation f = ΠS((I − T )f + r). A specific example of an instrumental variables method is the class of temporal difference methods for policy evaluation, introduced and discussed in detail in Section 2.2.3. ♣

In particular instantiations of both of the examples above, it is typical for the ambient dimension D to be very large (if not infinite) and for us to only have sample access to the 2 n pair (L, b). This paper treats the setting in which n observations {(Li, bi)}i=1 are drawn i.i.d. from some distribution with mean (L, b). Letting v∗ denote the solution to the fixed point ∗ equation (1), our goal is to use these observations in order to produce an estimate vbn of v that satisfies an oracle inequality of the form

∗ 2 ∗ 2 Ekvbn − v k ≤ α · inf kv − v k + εn. (4) v∈S

1For a more detailed discussion of existence and uniqueness of the various objects in this model, see Darolles et al. [DFFR11]. 2Especially for applications in reinforcement learning, another natural setting is that of Markov noise, which we handle in a companion paper.

2 Here we use k · k to denote the Hilbert norm associated with X. The three terms appearing on the RHS of inequality (4) all have concrete interpretations. The term

∗ ∗ 2 A(S, v ) := inf kv − v k (5) v∈S defines the approximation error; this is the error incurred by an oracle procedure that knows the fixed point v∗ in advance and aims to output the best approximation to v∗ within the subspace S. The term α is the approximation factor, which indicates how poorly the estimator vbn performs at carrying out the aforementioned approximation; note that α ≥ 1 by definition, and it is most desirable for α to be as small as possible. The final term εn is a proxy for the statistical error incurred due to our stochastic observation model; indeed, one expects that as the sample size n goes to infinity, this error should tend to zero for any reasonable estimator, ∗ indicating consistent estimation when v ∈ S. More generally, we would like our estimator to also have as small a statistical error as possible in terms of the other parameters that define the problem instance. In an ideal world, we would like both desiderata to hold simultaneously: the approximation factor should be as close to one as possible while the statistical error stays as small as possible. As we discuss shortly, such a “best-of-both-worlds” guarantee can indeed be obtained in many canonical problems, and “sharp” oracle inequalities—meaning ones in which the approximation factor is equal to 1—are known [RST17, DS12]. On the other hand, such oracle equalities with unit factors are not known for the fixed point equation (1). Tsitsiklis and Van Roy [TVR97] show that if the operator L is γmax-contractive in the norm k · k, then the (deterministic) solution v to the projected fixed point equation (2) satisfies the bound

∗ 2 1 ∗ 2 kv − v k ≤ 2 inf kv − v k . (6) 1 − γmax v∈S The bound (6) has a potentially large approximation factor that can be quite far from one (as would be the case for a “sharp” oracle inequality). One motivating question for our work is whether or not this bound can be improved, and if so, to what extent.3 Our work is also driven by the complementary question of whether a sharp bound can be obtained on the statistical error of an estimator that, unlike v, has access only to the samples n {(Li, bi)}i=1. In particular, we would like the statistical error εn to depend on some notion of complexity within the subspace S, and not on the ambient space. Recent work by Bhandari et al. [BRS18] provides worst-case bounds on the statistical error of a stochastic approximation scheme, showing that the parametric rate n . d/n is attainable. In this paper, we study how to derive a more fine-grained bound on the statistical error that reflects the practical performance of the algorithm and depends optimally on the geometry of our problem instance.

1.1 Contributions and organization The main contribution of this paper is to resolve both of the aforementioned questions, in particular by deriving upper bounds and information-theoretic lower bounds on both the approximation factor and statistical error that are instance-dependent.

3Note that one can achieve an approximation factor arbitrarily close to one provided that n  D. One way to do so is as follows: form the plug-in estimate that solves the original fixed point relation (1) on the 1 Pn 1 Pn sample averages n i=1 Li and n i=1 bi, and then project this solution onto the subspace S. In this paper, our principal interest—driven by the practical examples of Galerkin approximation and temporal difference learning—is in the regime d  n  D.

3 On one hand, these bounds demonstrate that in most cases, the optimal oracle inequality no longer has approximation factor 1 in our general setting, but on the other hand, that the optimal approximation factor can be much better in many cases than what is suggested by the worst-case bound (6). We also derive a significantly sharper bound on the statistical error of a stochastic approximation scheme that is instance-optimal in a precise sense. In more detail, we present the following results: • Theorem1 establishes an instance-dependent risk upper bound of the form (4) for the Polyak–Ruppert averaged stochastic approximation estimator, whose approximation factor α depends in a precise way on the projection of the operator L onto the subspace S, and the statistical error n matches the Cram´er–Raolower bound for the instance within the subspace. • In Theorem2, we prove an information-theoretic lower bound on the approximation factor. It is a local analysis, in that the bound depends critically on the projection of the population- level operator. This lower bound certifies that the approximation factor attained by our estimator is optimal. To the best of our knowledge, this is also the first instance of an optimal oracle inequality with a non-constant and problem-dependent approximation factor. • In Theorem3, we establish via a Bayesian Cram´er-Raolower bound that the leading statistical error term for our estimator is also optimal in an instance-dependent sense. • In Section4, we derive specific consequences of our results for several examples, including for the problem of Galerkin approximation in second-order elliptic equations and temporal difference methods for policy evaluation with linear function approximation. A particular consequence of our results shows that in a minimax sense, the approximation factor (6) is optimal for policy evaluation with linear function approximation (cf. Proposition1). The remainder of this paper is organized as follows. Section 1.2 contains a detailed discussion of related work. We introduce formal background and specific examples in Section2. Our main results under the general model of projected fixed point equations are introduced and discussed in Section3. We then specialize these results to our examples in Section4, deriving several concrete corollaries for Galerkin methods and temporal difference methods. Our proofs are postponed to Section5, and technical results are deferred to the appendix.

1.2 Related work Our paper touches on various lines of related work, including oracle inequalities for statisti- cal estimation, stochastic approximation and its application to reinforcement learning, and projected linear equation methods. We provide a brief discussion of these connections here.

Oracle inequalities: There is a large literature on misspecified statistical models and oracle inequalities (e.g., see the monographs [Mas07, Kol11] for overviews). Oracle inequalities in the context of penalized empirical risk minimization (ERM) are quite well-understood (e.g., [BBM05, Kol06, MN06]). Typically, the resulting approximation factor is exactly 1 or arbitrarily close to 1, and the statistical error term depends on the localized Rademacher complexity or metric entropy of this function class. Aggregation methods have been de- veloped in order to obtain sharp oracle inequalities with approximation factor exactly 1 (e.g. [Tsy04, BTW07b, DS12, RST17]). Sharp oracle inequalities are now available in a variety of settings including for sparse linear models [BTW07a], density estimation [DS18], graphon estimation [KTV17], and shape-constrained estimation [Bel18]. As previously noted, our setting differs qualitatively from the ERM setting, in that as shown in this paper, sharp

4 oracle inequalities are no longer possible. There is another related line of work on oracle inequalities of density estimation. Yatracos [Yat85] showed an oracle inequality with the non-standard approximation factor 3, and with a statistical error term depending on the metric entropy. This non-unit approximation factor was later shown to be optimal for the class of one-dimensional piecewise constant densities [CDSS14, BKM19, ZJT20]. The approximation factor lower bound in these papers and our work both make use of the birthday paradox to establish information-theoretic lower bounds.

Stochastic approximation: Stochastic approximation algorithms for linear and nonlin- ear fixed-point equations have played a central role in large-scale machine learning and statistics [RM51, Lai03, NJLS09]. See the books [BMP12, Bor09] for a comprehensive survey of the classical methods of analysis. The seminal works by Polyak, Ruppert, and Juditsky [Pol90, PJ92, Rup88] propose taking the average of the stochastic approxima- tion iterates, which stabilizes the algorithm and achieves a Gaussian limiting distribu- tion. This asymptotic result is also known to achieve the local asymptotic minimax lower bound [DR16]. Non-asymptotic guarantees matching this asymptotic behavior have also been established for stochastic approximation algorithms and their variance-reduced vari- ants [MB11, KPR+20, MLW+20, LMWJ20]. Stochastic approximation is also a fundamental building block for reinforcement learning algorithms, wherein the method is used to produce an iterative, online solution to the Bellman equation from data; see the books [Sze10, Ber19] for a survey. Such approaches include temporal difference (TD) methods [Sut88] for the policy evaluation problem and the Q- learning algorithm [WD92] for policy optimization. Variants of these algorithms also abound, including LSTD [Boy02], SARSA [RN94], actor-critic algorithms [KT00], and gradient TD methods [SMP+09]. The analysis of these methods has received significant attention in the literature, ranging from asymptotic guarantees (e.g., [BB96, TVR97, TVR99]) to more fine- grained finite-sample bounds (e.g., [BRS18, SY19, LS18, PW20, Wai19b, Wai19c]). Our work contributes to this literature by establishing finite-sample upper bounds for temporal difference methods with Polyak–Ruppert averaging, as applied to the policy evaluation problem with linear function approximation.

Projected methods for linear equations: Galerkin [Gal15] first proposed the method of approximating the solution to a linear PDE by solving the projected equation in a finite- dimensional subspace. This method later became a cornerstone of finite-element methods in numerical methods for PDEs; see the books [Fle84, BS07] for a comprehensive survey. A fundamental tool used in the analysis of Galerkin methods is C´ea’slemma [C´ea64], which corresponds to a special case of the approximation factor upper bounds that we establish. As mentioned before, projected linear equations were also considered independently by Tsitsiklis and Van Roy [TVR97] in the context of reinforcement learning; they established the worst- case upper bound (6) on the approximation factor under contractivity assumptions. These contraction-based bounds were further extended to the analysis of Q-learning in optimal stopping problems [TVR99]. The connection between the Galerkin method and TD methods was discovered by Yu and Bertsekas [YB10, Ber11], and the former paper shows an instance- dependent upper bound on the approximation factor. This analysis was later applied to Monte–Carlo methods for solving linear inverse problems [PWB09, PWB12]. We note that the Bellman equation can be written in infinitely many equivalent ways— by using powers of the transition kernel and via the formalism of resolvents—leading to a

5 continuous family of projected equations indexed by a scalar parameter λ (see, e.g., Section 5.5 of Bertsekas [Ber19]). Some of these forms can be specifically leveraged in other observation models; for instance, by observing the trajectory of the Markov chain instead of i.i.d. samples, it becomes possible to obtain unbiased observations for integer powers of the transition kernel. This makes it possible to efficiently estimate the solution to the projected linear equation for various values of λ, and underlies the family of TD(λ) methods [Sut88, Boy02]. Indeed, Tsitsiklis and Van Roy [TVR97] also showed that the worst-case approximation factor in equation (6) can be improved by using larger values of λ. Based on this observation, a line of work has studied the trade-off between approximation error and estimation measure in model selection for reinforcement learning problems [Ber16, Sch10, MS08, VR06]. However, unlike this body of work, our focus in the current paper is on studying the i.i.d. observation model; we postpone a detailed investigation of the Markov setting to a companion paper.

1.3 Notation

Here we summarize some notation used throughout the paper. For a positive integer m, we define the set [m] := {1, 2, ··· , m}. For any pair (X, Y) of real Hilbert spaces and a linear ∗ operator A : X → Y, we denote by A : Y → X the adjoint operator of A, which by definition, ∗ satisfies hAx, yi = hx, A yi for all (x, y) ∈ X × Y. For a bounded linear operator A from X kAxkY to Y, we define its operator norm as: |||A|||X→Y := supx∈ \{0} . We use the shorthand X kxkX notation |||A|||X to denote its operator norm when A is a bounded linear operator mapping X to d d itself. When X = R 1 and Y = R 2 are finite-dimensional Euclidean spaces equipped with the standard inner product, we denote by |||A|||op the operator norm in this case. We also use k · k2 to denote the standard Euclidean norm, in order to distinguish it from the Hilbert norm k · k. For a random object X, we use L(X) to denote its probability law. Given a vector d d×d µ ∈ R and a positive semi-definite Σ ∈ R , we use N (µ, Σ) to denote the Gaussian distribution with mean µ and covariance Σ. We use U(Ω) to denote the uniform distribution over a set Ω. Given a Polish space S and a positive measure µ associated to its Borel σ-algebra, p  R p 1/p for p ∈ [1, +∞), we define L (S, µ) := f : S → R, kfkLp := S |f| dµ < +∞ . When S d p is a subset of R and µ is the Lebesgue measure, we use the shorthand notation L (S). For a d point x ∈ R , we use δx to denote the Dirac δ-function at point x. d d We use {ej}j=1 to denote the standard basis vectors in the Euclidean space R , i.e., ei is d ×d a vector with a 1 in the i-th coordinate and zeros elsewhere. For two matrices A ∈ R 1 2 d ×d and B ∈ R 3 4 , we denote by A ⊗ B their Kronecker product, a d1d3 × d2d4 real matrix. d×d For symmetric matrices A, B ∈ R , we use A  B to denote the fact B − A is a positive semi-definite matrix, and denote by A ≺ B when B − A is positive definite. For a positive integer d and indices i, j ∈ [d], we denote by Eij a d × d matrix with a 1 in the (i, j) position and zeros elsewhere. More generally, given a set S and s1, s2, ∈ S, we define Es1,s2 to be the linear operator such that Es1,s2 f(x) := f(s2)1x=s1 for all f : S → R.

2 Background

We begin by formulating the projected fixed point problem more precisely in Section 2.1. Section 2.2 provides illustrations of this general set-up with some concrete examples.

6 2.1 Problem formulation

Consider a separable Hilbert space X with (possibly infinite) dimension D, equipped with the inner product h·, ·i. Let L denote the set of all bounded linear operators mapping X to itself. Given one such operator L ∈ L and some b ∈ X, we consider the fixed point relation v = Lv + b, as previously defined in equation (1). We assume that the operator I − L has a bounded inverse, which guarantees the existence and uniqueness of the fixed point satisfying equation (1). We let v∗ denote this unique solution. As previously noted, in general, solving a fixed point equation in the Hilbert space can be computationally challenging. Consequently, a natural approach is to seek approximations to the fixed point v∗ based on searching over a finite-dimensional subspace of the full Hilbert space. More precisely, given some d-dimensional subspace S of X, we seek to solve the projected fixed point equation (2).

Existence and uniqueness of projected fixed point: For concreteness in analysis, we are interested in problems for which the projected fixed equation has a unique solution. Here we provide a sufficient condition for such existence and uniqueness. In doing so and for future reference, it is helpful to define some mappings between X and the subspace S. Let us fix some orthogonal basis {φj}j≥1 of the full space X such that S = span{φ1, . . . , φd}. In terms of d d this basis, we can define the projection operator Φd : X → R via Φd(x) := hx, φji j=1. The d adjoint operator of Φd is a mapping from R to X, given by d X Φd(v) := vjφj. (7) j=1 Using these operators, we can define the projected operator associated with L—namely

∗ M := ΦdLΦd. (8)

Note that M is simply a d-dimensional matrix, one which describes the action of L on S according to the basis that we have chosen. As we will see in the main theorems, our results do not depend on the specific choice of the orthonormal basis, but it is convenient to use a given one, as we have done here. Consider the quantity

1  > κ(M) := 2 λmax M + M , (9) corresponding to the maximal eigenvalue of the symmetrized version of M. One sufficient condition for there be a unique solution to the fixed point equation (2) is the bound κ(M) < 1. When this bound holds, the matrix (Id − M) is invertible, and hence for any b ∈ X, there is a unique solution v to the equation v = ΠS(Lv + b).

Stochastic observation model: As noted in the introduction, this paper focuses on an observation model in which we observe i.i.d. random pairs (Li, bi) for i = 1, . . . , n that are unbiased estimates of the pair (L, b) so that

E[Li] = L, and E[bi] = b. (10) In addition to this unbiasedness, we also assume that our observations satisfy a certain second- moment bound. A weaker and a stronger version of this assumption are both considered.

7 Assumption 1(W) (Second-moment bound in projected space). There exist scalars σL, σb > 0 d such that for any unit-norm vector u ∈ S and any basis vector in {φj}j=1 we have the bounds 2 2 2 Ehφj, (Li − L)ui ≤ σLkuk , and (11a) 2 2 Ehφj, bi − bi ≤ σb . (11b)

Assumption 1(S) (Second-moment bound in ambient space). There exist scalars σL, σb > 0 D such that for any unit-norm vector u ∈ X and any basis vector in {φj}j=1 we have the bounds 2 2 2 Ehφj, (Li − L)ui ≤ σLkuk , and (12a) 2 2 Ehφj, bi − bi ≤ σb . (12b) In words, Assumption 1(W) guarantees that the random variable obtained by projecting the “noise” onto any of the basis vectors φ1, . . . , φd in the subspace S has bounded second moment. Assumption 1(S) further requires the projected noise onto any basis vector of the entire space X to have bounded second moment. In Section4, we show that there are various settings—including Galerkin methods and temporal difference methods—for which at least one of these assumptions is satisfied.

2.2 Examples We now present some concrete examples to illustrate our general formulation. In particular, we discuss the problems of , temporal difference learning methods from reinforcement learning4, and Galerkin methods for solving partial differential equations.

2.2.1 Linear regression on a low-dimensional subspace Our first example is the linear regression model when true parameter is known to lie approxi- mately in a low-dimensional subspace. This example, while rather simple, provides a useful pedagogical starting point for the others to follow. For this example, the underlying Hilbert space X from our general formulation is simply D the Euclidean space R , equipped with the standard inner product h·, ·i. We consider D zero-mean covariates X ∈ R and a response Y ∈ R, and our goal is to estimate the best-fitting linear model x 7→ hv, xi. In particular, the mean-square optimal fit is given by ∗ 2 v := arg minv∈RD Y − hv, Xi) . From standard results on linear regression, this vector must > ∗ satisfy the normal equations E[XX ]v = E[YX]. We assume that the second-moment matrix > ∗ E[XX ] is non-singular, so that v is unique. Let us rewrite the normal equations in a form consistent with our problem formulation. An equivalent definition of v∗ is in terms of the fixed point relation  1  1 v∗ = I − [XX>] v∗ + [YX], (13) β E β E

> where β := λmax(E[XX ]) is the maximum eigenvalue. This fixed point condition is a special 1 > 1 case of our general equation (1) with the operator L = I − β E[XX ] and vector b = β E[YX]. Note that we have 1 µ |||L||| = |||I − [XX>]||| ≤ 1 − < 1, op β E op β

4As noted by Bradtke and Barto [BB96], this method can be understood as an instrumental variable method [Woo16], and our results also apply to this more general setting.

8 > where µ = λmin(E[XX ]) > 0 is the minimum eigenvalue of the . D In the well-specified setting of linear regression, we observe i.i.d. pairs (Xi,Yi) ∈ R × R that are linked by the standard linear model

∗ Yi = hv ,Xii + εi, for i = 1, 2, ··· , n, (14) where εi denotes zero-mean noise with finite second moment. Each such observation can be used to form the matrix-vector pair

−1 > −1 Li = I − β XiXi , and bi = β XiYi, which is in the form of our assumed observation model. Thus far, we have simply reformulated linear regression as a fixed point problem. In order to bring in the projected aspect of the problem, let us suppose that the ambient dimension D is much larger than the sample size n, but that we have the prior knowledge that v∗ lies D (approximately) within a known subspace S of R , say of dimension d  D. Our goal is then to approximate the solution to the associated projected fixed-point equation. d Using {φj}j=1 to denote an orthonormal basis of S, the population-level projected linear equation (2) in this case takes the form

h >i E (ΠSX)(ΠSX) v = E [Y · ΠSX] , (15)

Thus, the population-level projected problem (15) corresponds to performing linear regression using the projected version of the covariates, thereby obtaining a vector of weights v ∈ S in this low-dimensional space.

2.2.2 Galerkin methods for second-order elliptic equations We now turn to the Galerkin method for solving differential equations, a technique briefly introduced in Section1. The general problem is to compute an approximate solution to a partial differential equation based on a limited number of noisy observations for the coeffi- cients. Stochastic inverse problems of this type arise in various scientific and engineering applications [Nic17,AM OS19¨ ]. For concreteness, we consider a second-order elliptic equation with Dirichlet boundary 5 m conditions. Given a bounded, connected and open set Ω ⊆ R with unit Lebesgue measure, let ∂Ω denote its boundary. Consider the Hilbert space of functions  Z  2 X := v :Ω → R, k∇v(x)k2dx < ∞, v|∂Ω = 0 Ω equipped with the inner product hu, vi := R ∇u(x)>∇v(x)dx. H˙ 1 Ω 2 Given a -valued function a and a square-integrable function f ∈ L , the boundary-value problem is to find a function v :Ω → R such that ( ∇ · (a(x)∇v(x)) + f = 0 in Ω, (16) v(x) = 0 on ∂Ω.

5It should be noted that Galerkin methods apply to a broader class of problems, including linear PDEs of parabolic and hyperbolic type [LT08], as well as kernel integral equations [PWB09, PWB12].

9 We impose a form of uniform ellipticity by requiring that µIm  a(x)  βIm, for some positive scalars µ ≤ β, valid uniformly over x. The problem can be equivalently stated in terms of the elliptic operator A := −∇ · (a∇); as shown in Appendix C.3.1, the pair (A, f) induces a bounded, self-adjoint linear operator Ae on X and a function g ∈ X such that the solution to the boundary value problem can be written as  1  v∗ = I − Ae v∗ + β−1g. (17) β

1 By construction, this is now an instance of our general fixed point equation (1) with L := I − β Ae −1 µ and b := β g. Furthermore, our assumptions imply that |||L|||X ≤ 1 − β . We consider a stochastic observation model that is standard in the literature (see, e.g., the paper [GN20]). Independently for each i ∈ [n], let Wi denote an m × m symmetric with entries on the diagonal and upper-diagonal given by i.i.d. standard Gaussian 0 random variables. Let wi ∼ N (0, 1) denote a standard Gaussian random variable. Suppose now that we observe the pair xi, yi ∼ U(Ω); the observed values for the i-th sample are then given by

0 (ai, fi) := a(xi) + Wi, f(yi) + wi with xi, yi ∼ U(Ω). (18)  The unbiased observations (Li, bi) can then be constructed by replacing (a, f) with aiδxi , fiδyi in the constructions above. For such problems, the finite-dimensional projection not only serves as a fast and cheap way to compute solutions from simulation [LWKP20], but also makes the solution stable and robust to noise [KKV11]. Given a finite-dimensional linear subspace S ⊆ X spanned by d orthogonal basis functions (φi)i=1, we consider the projected version of equation (17), with solution denoted by v:

v = ΠS(Lv + b). (19) Straightforward calculation in conjunction with Lemma 17 shows that equation (19) is equiva- lent to the conditions v ∈ S, and hAv, φ i = hg, φ i for all j ∈ [d], (20) e j H˙ 1 j H˙ 1 with the latter equality better known as the Galerkin orthogonality condition in the litera- ture [BS07].

2.2.3 Temporal difference methods for policy evaluation Our final example involves the policy evaluation problem in reinforcement learning. This is a special case of an instrumental variable method, as briefly introduced in Section1. We require some additional terminology to describe the problem of policy evaluation. Consider a Markov chain on a state space S and a transition kernel P : S × S → R. It becomes a discounted Markov reward process when we introduce a reward function r : S → R, and discount factor γ ∈ (0, 1). The goal of the policy evaluation problem to estimate the value function, which is the expected, long-term, discounted reward accrued by running the process. The value function exists under mild assumptions such as boundedness of the reward, and is given by the solution to the Bellman equation v∗ = γP v∗ + r, which is a fixed point equation of the form (1) with L = γP and b = r.

10 Throughout our discussion, we assume that the transition kernel P is ergodic and aperiodic, 2 so that its stationary distribution ξ is unique. We define X to be the Hilbert space L (S, ξ), 0 and for any pair of vectors v, v ∈ X, we define the inner product as follows Z hv, v0i := v(s)v0(s)dξ(s). S

In the special case of a finite state space, the Hilbert space X is a finite-dimensional Euclidean space with dimension D = |S| and equipped with a weighted `2-norm. We consider the i.i.d. observation model in this paper6. For each i = 1, 2, ··· , n, suppose + that we observe an independent tuple (si, si ,Ri(si)), such that

+ si ∼ ξ, si ∼ P (si, ·), and E[Ri(si)|si] = r(si). (21)

The i-th observation (Li, bi) is then obtained by plugging in these observations to compute unbiased estimates of P and r, respectively. A common practice in reinforcement learning is to employ function approximation, which in its simplest form involves solving a projected linear equation on a subspace. In particular, consider a set {ψ1, ψ2, ··· , ψd} of basis functions in X, and suppose that they are linearly independent on the support of ξ. We are interested in projections onto the subspace S = span(ψ1, . . . , ψd), and in solving the population-level projected fixed point equation (2), which takes the form

v¯ = ΠS(γP v¯ + r). (22)

The basis functions ψi are not necessarily orthogonal, and it is common for the projection operation to be carried out in a somewhat non-standard fashion. In order to describe this, it is convenient to write equation (22) in the projected space. For each s ∈ S, let d > ¯ ψ(s) = [ψ1(s) ψ2(s) . . . ψd(s)] denote a vector in R , and note that we may write v¯(s) = ψ(s) ϑ d for a vector of coefficients ϑ¯ ∈ R . Now observe that equation (22) can be equivalently written in terms of the coefficient vector ϑ¯ as

> ¯ h + > i ¯ Es∼ξ[ψ(s)ψ(s) ]ϑ = γEs∼ξ Es+∼P (s,·)[ψ(s)ψ(s ) ] ϑ + Es∼ξ[r(s)ψ(s)]. (23)

Equation (23) is the population relation underlying the canonical temporal difference (LSTD) learning method [BB96, Boy02].

3 Main results for general projected linear equations

Having set-up the problem and illustrated it with some examples, we now turn to the statements of our main results. We begin in Section 3.1 by stating an upper bound on the mean-squared error of a stochastic approximation scheme that uses Polyak–Ruppert averaging. We then discuss the form of this upper bound for various classes of operator L, with a specific focus on producing transparent bounds on the approximation factor. Section 3.2 is devoted to information-theoretic lower bounds that establish the sharpness of our upper bound.

6As mentioned before, we undertake a more in-depth study of the case with Markov observations, which is particularly relevant to the MRP example, in a companion paper.

11 3.1 Upper bounds In this section, we describe a standard stochastic approximation scheme for the problem based on combining ordinary stochastic updates with Polyak–Ruppert averaging [Pol90, PJ92, Rup88]. In particular, given an oracle that provides observations (Li, bi), consider the stochastic recursion parameterized by a positive stepsize η:  vt+1 = (1 − η)vt + ηΠS Lt+1vt + bt+1 , for t = 1, 2,.... (24a) This is a standard stochastic approximation scheme for attempting to solve the projected fixed point relation. In order to improve it, we use the standard device of applying Polyak–Ruppert averaging so as to obtain our final estimate. For a given sample size n ≥ 2, our final estimate vbn is given by taking the average of these iterates from time n0 to n—that is n 1 X vbn := vt. (24b) n − n0 t=n0+1

Here the “burn-in” time n0 is an integer parameter to be specified. The stochastic approximation procedure (24) is defined in the entire space X; note that it d can be equivalently written as iterates in the projected space R , via the recursion ∗ ϑt+1 = (1 − η)ϑt + η(ΦdLt+1Φdϑt + Φdbt+1). (25) ∗ The original iterates can be recovered by applying the adjoint operator—that is, vt = Φdϑt for t = 1, 2,....

3.1.1 A finite-sample upper bound Having introduced the algorithm itself, we are now ready to provide a guarantee on its error. Two matrices play a key role in the statement of our upper bound. The first is the ∗ d-dimensional matrix M := ΦdLΦd that we introduced in Section 2.1. We show that the ∗ 2 mean-squared error is upper bounded by the approximation error infv∈S kv − v k along with a pre-factor of the form

 −1 2 T −T  α(M, s) = 1 + λmax (I − M) (s Id − MM )(I − M) , (26)

1 T  for s = |||L|||op. Our bounds also involve the quantity κ(M) = 2 λmax M + M , which we abbreviate by κ when the underlying matrix M is clear from the context. The second matrix is a covariance matrix, capturing the noise structure of our observations, given by ∗ Σ := cov (Φd(b1 − b)) + cov (Φd(L1 − L)v) .

This matrix, along with the constants (σL, σb) from Assumption 1(W), arise in the definition of two additional error terms, namely trace (I − M)−1Σ∗(I − M)−> E (M, Σ∗) := , and (27a) n n σ  d 3/2 H (σ , σ , v) := L kvk2σ2 + σ2 . (27b) n L b (1 − κ)3 n L b −3/2 As suggested by our notation, the error Hn(σL, σb, v) is a higher-order term, decaying as n ∗ in the sample size, whereas the quantity En(M, Σ ) is the dominant source of statistical error. With this notation, we have the following:

12 n Theorem 1. Suppose that we are given n i.i.d. observations {(Li, bi)}i=1 that satisfy the noise conditions in Assumption 1(W). Then there are universal constants (c0, c) such that for 2  2  c0σLd 2 kv0−vk d any sample size n ≥ (1−κ)2 log 1−κ , then running the algorithm (24) with

1 stepsize η = √ , and burn-in period n0 = n/2 c0σL dn yields an estimate vbn such that ∗ 2 ∗ 2 1   ∗ Ekvbn −v k ≤ (1+ω)·α(M, |||L|||X) inf kv−v k +c 1 + ω · En(M, Σ )+Hn(σL, σb, v) , (28) v∈S valid for any ω > 0.

We prove this theorem in Section 5.1.

∗ 2 A few comments are in order. First, the quantity α(M, |||L|||X) infv∈S kv − v k is an upper bound on the approximation error kv − v∗k2 incurred by the (deterministic) projected fixed point v. The pre-factor α(M, |||L|||X) ≥ 1 measures the instance-specific deficiency of v relative to an optimal approximating vector from the subspace, and we provide a more in-depth discussion of this factor in Section 3.1.2 to follow. Note that Theorem1 actually provides a family of bounds, indexed by the free parameter ω > 0. By choosing ω arbitrarily close to zero, ∗ 2 we can make the pre-factor in front of infv∈S kv − v k arbitrarily close to α(M, |||L|||X)—albeit at the expense of inflating the remaining error terms. In Theorem2 to follow, we prove that the quantity α(M, |||L|||X) is, in fact, the smallest approximation factor that can be obtained in any such bound. The latter two terms in the bound (28) correspond to estimation error that arises from estimating v based on a set of n stochastic observations. While there are two terms here in principle, we show in Corollary1 to follow that the estimation error is dominated by the term ∗ ∗ En(M, Σ ) under some natural assumptions. Note that the leading term En(M, Σ ) scales with the local complexity for estimating v, and we show in Theorem3 that this term is also information-theoretically optimal. In the next subsection, we undertake a more in-depth exploration of the approximation factor in this problem, discussing prior work in the context of the term α(M, |||L|||X) appearing in Theorem1.

3.1.2 Detailed discussion of the approximation error As mentioned in the introduction, upper bounds on the approximation factor have received significant attention in the literature, and it is interesting to compare our bounds.

Past results: In the case where γmax := |||L|||X < 1, the approximation-factor bound (6) was established by Tsitsiklis and Van Roy [TVR97], via the following argument. Letting ∗ ve := ΠS(Lv + b), we have

∗ 2 (i) 2 ∗ 2 ∗ 2 ∗ 2 kv − v k = kv − vek + kve − v k = kΠS(Lv + b) − ΠS(Lv + b)k + kve − v k (ii) ∗ 2 ∗ 2 ≤ kLv − Lv k + kve − v k (iii) 2 ∗ 2 ∗ 2 ≤ γmaxkv − v k + kve − v k . (29)

13 Step (i) uses Pythagorean theorem; step (ii) follows from the non-expansiveness of the projection operator; and step (iii) makes use of the contraction property of the operator L. Note that by −2 definition, we have α(M, |||L|||X) ≤ (1 − |||L|||X) , and so the approximation factor in Theorem1 recovers the bound (6) in the worst case. In general, however, the factor α(M, |||L|||X) can be significantly smaller. Yu and Bertsekas [YB10] derived two fine-grained approximation factor upper; in terms of our notation, their bounds take the form

(1) 2  −1 −> αYB := 1 + |||L|||X · λmax (I − M) (I − M) , (2) −1 2 αYB := 1 + |||(I − ΠSL) ΠSLΠS⊥ |||X.

(1) It is clear from the definition that α(M, |||L|||X) ≤ αYB, but α(M, |||L|||X) can often provide an improved bound. This improvement is indeed significant, as will be shown shortly in (2) Lemma1. On the other hand, the term αYB is never larger than α(M, |||L|||X), and is indeed the smallest possible bound that depends only on L and not b. However, as pointed out by Yu (2) and Bertsekas, the value of αYB is not easily accessible in practice, since it depends on the ⊥ precise behavior of the operator L over the orthogonal complement S . Thus, estimating the (2) quantity αYB requires O(D) samples. In contrast, the term α(M, |||L|||X) depends only on the projected operator M and the operator norm |||L|||X. The former can be easily estimated using d samples and at smaller computational cost, while the latter is usually known a priori. The discussion in Section4 to follow fleshes out these distinctions.

A simulation study: In order to compare different upper bounds on the approximation factor, we conducted a simple simulation study on the problem of value function estimation, as previously introduced in Section 2.2.3. For this problem, the approximation factor α(M, γ) is computed more explicitly in Corollary5. The Markov transition kernel is given by the simple random walk on a graph. We consider Gaussian random feature vectors and associate them with two different random graph models, Erd¨os-R´enyi graphs and random geometric graphs, respectively. The details for these models are described and discussed in AppendixD. In Figure1, we show the simulation results for the values of the approximation factor. (1) Given a sample from above graphs and feature vectors, we plot the value of α(M, γ), αYB and (2) −5 −0.5 αYB against the discount rate 1 − γ, which ranges from 10 to 10 . Note that the two plots use different scales: Panel (a) is a linear-log plot, whereas panel (b) is a log-log plot. Figure1 (1) shows that the approximation factor α(M, γ) derived in Theorem1 is always between αYB (2) and αYB. As mentioned before, the latter quantity depends on the particular behavior of the ⊥ linear operator L in the subspace S , which can be difficult to estimate. The improvement (1) over αYB, on the other hand, can be significant. In the Erd¨os-R´enyi model, all the three quantities are bounded by relatively small constant, regardless of the value of γ. The bound α(M, γ) is roughly at the midpoint between the (1) (2) bounds αYB and αYB. The differences are much starker in the random geometric graph case: (1) (2) The bound improves over αYB by several orders of magnitude, while being off from αYB by a factor of 10 for large γ. As we discuss shortly in Lemma1, this is because the approximation 1  (1) 1  factor α(M, γ) scales as O 1−κ(M) while αYB scales as O (1−κ(M))2 , making a big difference for the case with large correlation.

14 Approximation factor bounds in Erdos-Renyi random graphs Approximation factor bounds in soft random geometric graphs 105 8

7 104 6

3 5 10

4 102 3 Approximation factor Approximation factor 2 (M, ) 101 (M, ) (1) (1) YB YB 1 (2) (2) YB 100 YB 0 10 5 10 4 10 3 10 2 10 1 10 5 10 4 10 3 10 2 10 1 Discount rate 1 Discount rate 1

(a) (b) Figure 1. Plots of various approximation factor as a function of the discount factor γ in the policy evaluation problem. (See the text for a discussion.) (a) Results for an Erd¨os-R´enyi random graph model with N = 3000, projected dimension d = 1000, and a = 3. The resulting number of vertices in the graph Ge is 2813. The value of 1 − γ is plotted in log-scale, and the value of approximation factor is plotted on the standard scale. (b) Results for a random geometric graph model with N = 3000, projected dimension d = 2, and r = 0.1. The resulting number of vertices in the graph Ge is 2338. Both the discount rate 1 − γ and the approximation factor are plotted on the log-scale.

Some useful bounds on α(M, |||L|||X): We conclude our discussion of the approximation factor with some bounds that can be derived under different assumptions on the operator L and its projected version M. The following lemma is useful in understanding the behavior of the approximation factor as a function of the contractivity properties of the operator L; this is particularly useful in analyzing convergence rates in numerical PDEs.

d×d Lemma 1. Consider a projected matrix M ∈ R such that (I − M) is invertible and κ(M) < 1. (a) For any s > 0, we have the bound

s2 α(M, s) ≤ 1 + |||(I − M)−1|||2 · s2 ≤ 1 + . (30a) op (1 − κ(M))2

(b) For s ∈ [0, 1], we have 2 α(M, s) ≤ 1 + 2|||(I − M)−1||| ≤ 1 + . (30b) op 1 − κ(M)

See Appendix B.1 for the proof of this lemma. A second special case, also useful, is when the matrix M is symmetric, a setting that appears in least-squares regression, value function estimation in reversible Markov chains, and self-adjoint elliptic operators. The optimal approximation factor α(M, γmax) can be explicitly computed in such cases.

d Lemma 2. Suppose that M is symmetric with eigenvalues {λj(M)}j=1 such that λmax(M) < 1. Then for any s > 0, we have

2 2 s − λj α(M, s) = 1 + max 2 . (31) j=1,...,d (1 − λj)

15 See Appendix B.2 for the proof of this lemma.

Lemma1 reveals that there is a qualitative shift between the non-expansive case |||L|||X ≤ 1 and the complementary expansive case. In the latter case, the optimal approximation factor 1  always scales as O (1−κ(M))2 , but below the threshold |||L|||X = 1, the approximation factor 1  drastically improves to become O 1−κ(M) . It is worth noting that both bounds can be achieved up to universal constant factors. In the context of differential equations, the bound of the form (a) in Lemma1 is known as C´ea’slemma [ C´ea64], which plays a central role in the convergence rate analysis of the Galerkin methods for numerical differential equations.

However, the instance-dependent approximation factor α(M, |||L|||X) can often be much smaller: the global coercive parameter needed in C´ea’sestimate is replaced by the bounds on the behavior of the operator L in the finite-dimensional subspace. The part (b) in Lemma1 generalizes C´ea’senergy estimate from the symmetric positive-definite case to the general non-expansive setting. See Corollary4 for a more detailed discussion on the consequences of our results to elliptic PDEs. Lemmas1 and2 yield the following corollary of the general bound (28) under different conditions on the operator L.

2  2  c0σLd 2 kv0−vk d Corollary 1. Under the conditions of Theorem1 and given a sample size n ≥ (1−κ)2 log 1−κ :

(a) There is a universal positive constant c such that ( ) |||L|||2 (σ2 + σ kvk2) d ∗ 2 X ∗ 2 b L Ekvbn − v k ≤ c 2 · inf kv − v k + 2 (32a) 1 − κ(M) v∈S 1 − κ(M) n

∗ for any operator L, and its associated projected operator M = ΦdLΦd.

(b) Moreover, when L is non-expansive (|||L|||X ≤ 1), we have

( 2 2 ) ∗ 2 1 ∗ 2 (σb + σLkvk ) d Ekvbn − v k ≤ c · inf kv − v k + 2 . (32b) 1 − κ(M) v∈S 1 − κ(M) n

See Section 5.2 for the proof of this claim. As alluded to before, the simplified form of Corollary1 no longer has an explicit higher order term, and the statistical error now scales at the parametric rate d/n. It is worth noting that the lower bound on n required in the assumption of the corollary is a mild requirement: 2 2 (σb +σLkvk ) d in the absence of such a condition, the statistical error term (1−κ)2 n in both bounds would blow up, rendering the guarantee vacuous.

3.2 Lower bounds In this section, we establish information-theoretic lower bounds on the approximation factor, as well as the statistical error. Our eventual result (in Corollary2) shows that the first two terms appearing in Theorem1 are both optimal in a certain instance-dependent sense. However, a precise definition of the local neighborhood of instances over which the lower bound holds requires some definitions. In order to motivate these definitions more transparently and naturally arrive at both terms of the bound, the following section presents individual bounds on the approximation and estimation errors, and then combines them to obtain Corollary2.

16 3.2.1 Lower bounds on the approximation error As alluded to above, the first step involved in a lower bound is a precise definition of the collection of problem instances over which it holds; let us specify a natural such collection for lower bounds on the approximation error. Each problem instance is specified by the joint distribution of the observations (Li, bi), which implicitly specifies a pair of means (L, b) = (E[Li], E[bi]). For notational convenience, we define this class by first defining a collection comprising instances specified solely by the mean pair (L, b), and then providing restrictions on the distribution of (Li, bi). Let us define the first such component. For a given d×d d matrix M0 ∈ R and vector h0 ∈ R , write  |||L||| ≤ γ , A( , v∗) ≤ δ2, dim( ) = D,  X max S X Capprox(M0, h0, D, δ, γmax) := (L, b) ∗ . ΦdLΦd = M0, and Φdb = h0.

D In words, this is a collection of all instances of the pair (L, b) ∈ L × R whose projections onto the subspace of interest are fixed to be the pair (M0, h0), and whose approximation error is less than δ2. In addition, the operator L satisfies a certain bound on its operator norm. Having specified a class of (L, b) pairs, we now turn to the joint distribution over the pair of observations (Li, bi), which we denote for convenience by PL,b. Now define the collection of instances n o

Gvar(σL, σb) := PL,b (Li, bi) satisfies Assumption 1(S) with constants (σL, σb) .

This is simply the class of all distributions such that our observations satisfy Assumption 1(S) with pre-specified constants. As a point of clarification, it is useful to recall that our upper bound in Theorem1 only needed Assumption 1(W) to hold, and we could have chosen to match this by defining the Gvar under Assumption 1(W). We comment further on this issue following the theorem statement. We are now ready to state Theorem2, which is a lower bound on the worst-case approx- imation factor over all problem instances such that (L, b) ∈ Capprox(M0, h0, D, δ, γmax) and PL,b ∈ Gvar(σL, σb). Note that such a collection of problem instances is indeed local around the pair (M0, h0). Two settings are considered in the statement of the theorem: proper estimators when vbn is restricted to take values in the subspace S; and improper estimators, where vbn can take values in the entire space X. We use VbS and VbX to denote the class of proper and improper estimators, respectively. Finally, we use the shorthand Capprox ≡ Capprox(M0, h0, D, δ, γmax) for convenience.

d×d Theorem 2. Suppose M0 ∈ R is a matrix such that I − M0 is invertible, and that the scalars (σL, σb) are such that σL ≥ γmax and σb ≥ δ. If the ambient dimension satisfies 12 2 D ≥ d + ω n for some scalar ω ∈ (0, 1), then we have the lower bounds ∗ 2 2 inf sup Ekvbn − v k ≥ (1 − ω) · α(M0, γmax) · δ and (33a) v ∈V bn bS (L,b)∈Capprox PL,b∈Gvar(σL,σb) ∗ 2  2 inf sup Ekvbn − v k ≥ (1 − ω) · α(M0, γmax) − 1 · δ . (33b) v ∈V bn bX (L,b)∈Capprox PL,b∈Gvar(σL,σb) See Section 5.3 for the proof of this claim. A few remarks are in order. First, Theorem2 shows that the approximation factor upper bound in Theorem1 is information-theoretically optimal in the instance-dependent sense:

17 in the case of proper estimators, the upper and lower bound can be made arbitrarily close by choosing the constant ω arbitrarily small in both theorems. Both bounds depend on the projected matrix M0, characterizing the fundamental impact of the geometry in the projected space on the complexity of the estimation problem. The lower bound for improper estimators is slightly smaller, but for most practical applications we have α(M0, γmax)  1 and so this result should be viewed as almost equivalent. Second, note that we may also extract a worst-case lower bound on the approximation factor from Theorem2. Indeed, for a scalar γmax ∈ (0, 1), consider the family of instances in the 2 aforementioned problem classes satisfying |||L|||X ≤ γmax. Setting M0 = γmaxId and applying Theorem2, we see that (in a worst-case sense over this class), the risk of any estimator is 1 ∗ lower bounded by 2 A( , v ). This establishes the optimality of the classical worst-case 1−γmax S upper bound (6).

Third, notice that theorem requires the noise variances (σL, σb) to be large enough, and this is a natural requirement in spite of the fact that we seek lower bounds on the approximation error. Indeed, in the extreme case of noiseless observations, we have access to the population pair (L, b) with a single sample, and can compute both v∗ and its projection onto the subspace S without error. From a more quantitative standpoint, it is worth noting that our requirements σL ≥ γmax and σb ≥ δ are both mild, since the scalars γmax and δ are typically order 1 quantities. Indeed, if both of these bounds held with equality, then Corollary1 yields that the statistical error would be of the order O(d/n), and so strictly smaller than the approximation error we hope to capture7. Observe that Theorem2 requires the ambient dimension D to be larger than n2. As mentioned in the introduction, we should not expect any non-trivial approximation factor when n ≥ D, but this leaves open the regime n  D  n2. Is a smaller approximation factor achievable when D is not extremely large? We revisit this question in Section 3.2.4, showing that while there are some quantitative differences in the lower bound, the qualitative nature of the message remains unchanged. Regarding our noise assumptions, it should be noted that the class of instances satisfying Assumption 1(W) is strictly larger than the corresponding class satisfying Assumption 1(S), and so our lower bound to follow extends immediately to the former case. Second, it is important to note that imposing only Assumption 1(W) would in principle allow the noise in ⊥ the orthogonal complement S to grow in an unbounded fashion, and one should expect that it is indeed optimal to return an estimate of the projected fixed point v. To establish a more meaningful (but also more challenging) lower bound, we operate instead under the stronger Assumption 1(S), enforcing second moment bounds on the noise not only for basis vectors in S, but also its orthogonal complement. Assumption 1(S) allows for other natural estimators: For instance, the plug-in estimator of v∗ via the original fixed point equation (1) would now incur finite error. Nevertheless, as shown by our lower bound, the stochastic approximation ⊥ estimator analyzed in Theorem1 is optimal even if the noise in S behaves as well as that in S.

7As a side remark, we note that our noise conditions can be further weakened, if desired, via a mini-batching trick. To be precise, given any problem instance PL,b ∈ Gvar(σL, σb) and any integer m > 0, one could treat the sample mean of m independent samples as a single sample, resulting in a problem instance in the class σ σ σ σ √L √b  √L √b  Gvar m , m . The same lower bound still applies to the class Gvar m , m , at a cost of stronger dimension 12 2 2 requirement D ≥ d + ω n m .

18 3.2.2 Lower bounds on the estimation error We now turn to establishing a minimax lower bound on the estimation error that matches the statistical error term in Theorem1. This lower bound takes a slightly different form from ∗ Theorem2: rather than studying the total error kvbn − v k directly, we establish a lower bound on the error kvbn − vk instead. Indeed, the latter term is more meaningful to study in order to characterize the estimation error—which depends on the sample size n—since for large sample sizes, the total error ∗ kvbn − v k will be dominated by a constant approximation error. As we demonstrate shortly, the term kvbn − vk depends on noise covariance and the geometry of the matrix M0 in the projected space, while having the desired dependence on the sample size n. It is worth noting ∗ ∗ also that this automatically yields a lower bound on the error kvbn − v k when we have v = v . We are now ready to prove a local minimax lower bound for estimating v ∈ S, which is given by the solution to the projected linear equation v = ΠS(Lv + b). While our objective is to prove a local lower bound around each pair (L0, b0) ∈ L × X, the fact that we are estimating v implies that it suffices to define our set of local instances in the d-dimensional space of ∗ projections. In particular, our means (L, b) are specified by those pairs for which ΦdLΦd is ∗ close to M0 := ΦdL0Φd, and Φdb is close to h0 := Φdb0. Specifically, let v0 denote the solution to the projected linear equation v0 = ΠS(L0v0 + b0), and define the neighborhood ( r r ) d d N(M , h ) := (M 0, h0): |||M 0 − M ||| ≤ σ , and kh0 − h k ≤ σ , (34) 0 0 0 F L n 0 2 b n which, in turn, defines a local class of problem instances (L, b) given by

n ∗  o Cest := (L, b) | ΦdLΦd, Φdb ∈ N(M0, h0) . We have thus specified our local neighborhood in terms of the mean pair (L, b), and as before, it remains to define a local class of distributions on these instances. Toward this end, define the class

Gcov(ΣL, Σb, σL, σb) n o

:= Gvar(σL, σb) ∩ PL,b cov (Φd(b1 − b))  Σb and cov (Φd(L1 − L)v0)  ΣL , (35) corresponding to distributions on the observation pair (Li, bi) that satisfy Assumption 1(S) and whose “effective noise” covariances are dominated by the PSD matrices ΣL and Σb. Note that Assumption 1(S) implies the diagonal elements of above two covariance matrices 2 2 2 are bounded by σb and σLkv0k , respectively. In order to avoid conflicts between assumptions, we assume throughout that for all indices j ∈ [d], the diagonal entries of the covariance matrices satisfy the conditions

2 2 2 (Σb)j,j ≤ σb and (ΣL)j,j ≤ σLkv0k . (36)

We then have the following theorem for the estimation error kvbn − vk, where we use the shorthand Gcov ≡ Gcov(ΣL, Σb, σL, σb) for brevity.

Theorem 3. Under the setup above, suppose the matrix I − M0 is invertible, and suppose 2 −1 2 that n ≥ 16σL|||(I − M0) |||opd. Then there is a universal constant c > 0 such that 2 inf sup Ekvbn − vk ≥ c ·En(M0, ΣL + Σb). v ∈V bn bX (L,b)∈Cest PL,b∈Gcov

19 See Section 5.4 for the proof of this claim. ∗ The estimation error lower bound in Theorem3 matches the statistical error term En(M, Σ ) in Theorem1, up to a universal constant. Indeed, in the asymptotic limit n → ∞, the regularity of the problem can be leveraged in conjunction with classical Le Cam theory (see, e.g., [vdV00]) to show that the asymptotic optimal limiting distribution is a Gaussian law with covariance (I − M)−1Σ∗(I − M)−>. (See the paper [KPR+20] for a detailed analysis of this type in the special case of policy evaluation in tabular MDPs.) This optimality result holds in a “local” √ sense: it is minimax optimal in a small neighborhood of radius O(1/ n) around a given problem instance (M0, h0). Theorem3, on the other hand, is non-asymptotic, showing that a similar result holds provided n is lower bounded by an explicit, problem-dependent quantity 2 −1 2 of the order σLd|||(I − M0) |||op. This accommodates a broader range of sample sizes than the upper bound in Theorem1.

3.2.3 Combining the bounds Having presented separate lower bounds on the approximation and estimation errors in conjunction with definitions of local classes of instances over which they hold, we are now ready to present a corollary which combines the two lower bounds in Theorems2 and3. We begin by defining the local classes of instances over which our combined bound holds. Given a matrix-vector pair (M0, h0), covariance matrices (ΣL, Σb), ambient dimension D > 0, and scalars δ, γmax, σL, σb > 0, we begin by specifying a collection of mean pairs (L, b) via

[ 0 0 Cfinal(M0, h0, D, δ, γmax) := Capprox(M , h , D, δ, γmax). (37) 0 0 (M ,h )∈Nn(M0,h0)

Clearly, this represents a natural combination of the classes Capprox and Cest introduced above. We use the shorthand Cfinal for this class for brevity. Our collection of distributions PL,b is still given by the class Gcov from equation (35). With these definitions in hand, we are now ready to state our combined lower bound.

Corollary 2. Under the setup above, suppose that the pair (σL, σb) satisfies the conditions p in Theorem2 and equation (36), and that the matrix M0 satisfies |||M0|||op ≤ γmax − σL d/n. 2 −1 2 Moreover, suppose that the sample size and ambient dimension satisfy n ≥ 16σL|||(I −M0) |||opd and D ≥ d + 36n2, respectively. Then the following minimax lower bound holds for a universal positive constant c:   ∗ 2  2 inf sup Ekvbn − v k ≥ c · α(M0, γmax) − 1 · δ + En(M0, ΣL + Σb) . v ∈V bn bX (L,b)∈Cfinal PL,b∈Gcov We prove this corollary in Section 5.5. It is relatively straightforward consequence of combining Theorems2 and3.

∗ The combined lower bound matches the expression α(M0, γmax)A(S, v ) + En(M0, ΣL + Σb), given by the first two terms of Theorem1, up to universal constant factors. Recall from our discussion of Theorem1 that the high-order term Hn(σL, σb, v) represents the “optimization error” of the stochastic approximation algorithm, which depends on the coercive condition κ(M0) instead of the natural geometry I − M0 of the problem. While we do not expect this term to appear in an information-theoretic lower bound, the leading estimation error term En(M0, ΣL + Σb) will dominate the high-order term when the sample size n is large

20 enough. For such a range of n, the bound in Theorem1 is information-theoretically optimal in the local class specified above. More broadly, consider the class of all instances satisfying

Assumption 1(S), with κ(M) ≤ κ and |||L|||X ≤ 1. Then the bound in Theorem1 is optimal, in 2 cσL a worst-case sense, over this class as long as the sample size exceeds the threshold (1−κ)2 d.

3.2.4 The intermediate regime It remains to tie up some loose ends. Note that the lower bound in Theorem2 requires a condition D  n2. On the other hand, it is easy to see that the approximation factor can be made arbitrarily close to 1 when n  D. (For example, one could run the estimator based on stochastic approximation and averaging—which was analyzed in Theorem1—with the entire Euclidean space X, and project the resulting estimate onto the subspace S.) In the middle regime n  D  n2, however, it is not clear which estimator is optimal. In the following theorem, we present a lower bound for the approximation factor in this intermediate regime, which establishes the optimality of Theorem1 up to a constant factor. d×d Theorem 4. Suppose M0 ∈ R is a matrix such that I −M0 is invertible, and that the scalars 1+1/q (σL, σb) satisfy σL ≥ 1 + γmax and σb ≥ δ. If the ambient dimension satisfies D ≥ d + 3qn   for some integer q ∈ 2, log n ∧ √ 1 , then we have the lower bound 2(1−γmax∧1) α(M, γ ) − 1 inf sup kv − v∗k2 ≥ max δ2. E bn 2 v ∈V 4q bn bX (L,b)∈Capprox PL,b∈Gvar(σL,σb) See AppendixA for the proof of this theorem. Theorem4 resolves the gap in the intermediate regime, up to a constant factor that depends on q. In particular, the stochastic approximation estimator (24) for projected equations still yields a near-optimal approximation factor. Compared to Theorem2, Theorem4 weakens the requirement on the ambient dimension D and covers the entire regime D  n. Furthermore, using the same arguments as in Corollary2, this theorem can also be combined with Theorem3 to obtain the following lower bound in the regime D ≥ d + 3qn1+1/q, for any integer q > 0: α(M , γ ) − 1  inf sup kv − v∗k2 ≥ c · 0 max · δ2 + E (M , Σ + Σ ) . E bn 2 n 0 L b v ∈V q bn bX (L,b)∈Cfinal PL,b∈Gcov Let us summarize our approximation factor lower bounds in the various regimes. Consider (n)∞ a sequence of problem instances PL,b n=1 with increasing ambient dimension Dn. Let the (n) ∗ projected dimension d, noise variances (σL, σb), oracle error δ, projected matrix ΦdL Φd = M, and the operator norm bound |||L|||X ≤ γmax be all fixed. The following table then presents a combination of our results from Theorems1,2, and4; our results suggest that the optimal approximation factor exhibits a “slow” phase transition phenomenon. It is an interesting open question whether the phase transition is sharp, and to identify the asymptotically optimal log Dn approximation factor in the regime limn→∞ log n = 1 since our lower bounds do not apply in this linear regime.

4 Consequences for specific models

We now discuss the consequences of our main theorems for the three examples introduced in Section 2.2. For brevity, we state only upper bounds for the first two examples; our third

21 log Dn q = limn→∞ log n [2, ∞) (1, 2) (0, 1) Lower bound α(M0, γmax) cq · α(M0, γmax) 1 Upper bound α(M0, γmax) α(M0, γmax) 1

∗ 2 Ekvbn−v k Table 1. Bounds on the approximation factor ∗ for proper estimators in different A(S,v ) ranges of ambient dimension. Here, cq ∈ (0, 1) represents a constant depending only on the aspect ratio q. example for temporal difference learning methods includes both upper and lower bounds.

4.1 Linear regression Recall the setting of linear regression from Section 2.2.1, including our i.i.d. observation model (14). We assume bounds on the second moment of ε and fourth moment of X—namely, the existence of some ς > 0 such that

4 4 2 2 D−1 Ehu, Xi ≤ ς , and E(ε ) ≤ ς for all u ∈ S . (38) −1 2 −1 2 8 These conditions ensure that Assumption 1(W) is satisfied with (σL, σb) = (β ς , β ς ). > Recall that the (unprojected) covariance matrix satisfies the PSD relations µI  E[XX ]  βI,  > and define the d-dimensional covariance matrix Σ := E (ΦdX)(ΦdX) for convenience. In this case, our stochastic approximation iterates (24a) take the form

 >  vt+1 = vt − η ΠSXt+1Xt+1ΠSvt + Yt+1ΠSXt+1 , for all t = 0, 1, 2,..., (39)

2 Pn−1 and we take the averaged iterates vbn := n t=n/2 vt. For this procedure, we have the following guarantee: n Corollary 3. Suppose that we have n i.i.d. observations {(Xi,Yi)}i=1 from the model (14) satisfying the moment conditions (38). Then there are universal positive constants (c, c0) 4   c0ς d 2 β 2 such that given a sample size n ≥ 2 log µ kv0 − vk2d , if the stochastic approximation λmin(Σ) 1√ scheme (39) is run with step size η = 2 , then the averaged iterate satisfies the bound c0ς dn  Σ µ kv − v∗k2 ≤ (1 + ω) · α I − , 1 − A( , v∗) E bn 2 d β β S r !3 trace(Σ−1) · (ε2) c ς2 d + c · E + · ωn ω λmin(Σ) n for each ω > 0. This result is a direct consequence of Theorem1 in application to this model.

−1 2 trace(Σ )·E(ε ) Note that the statistical error term n in this case corresponds to the classical statistical rates for linear regression in this low-dimensional subspace. The approximation factor, by Lemma2 admits the closed-form expression   2 Σ µ µ + 2β(λi − µ) α Id − , 1 − = max 2 , β β i∈[d] λi 8Note that the stochastic approximation iterates are invariant under translation, and consequently we can assume without loss of generality that v = 0.

22 d where {λj}j=1 denote the eigenvalues of the matrix Σ. Since λj ∈ [µ, β] for each j ∈ [d], the   approximation factor is at most of the order O β . λmin(Σ) Compared to known sharp oracle inequalities for linear regression (e.g., [RH15]), the approximation factor in our bound is not 1 but rather a problem-dependent quantity. This is because we study the estimation error under the standard Euclidean metric k · k2, as opposed 2 to the prediction error under the data-dependent metric k · kL (PX ). When the covariance > Σ µ  matrix E[XX ] is identity, the approximation factor α Id − β , 1 − β is equal to 1, recovering classical results. Another error metric of interest, motivated by applications such as transfer learning [LCL20], is the prediction error when the covariates X follow a different distribution Q. For such a problem, the result above can be modified straightforwardly by choosing the D > > −1 Hilbert space X to be R , equipped with the inner product hu, vi := u EQ[XX ] v.

4.2 Galerkin methods We now return to the example of Galerkin methods, as previously introduced in Section 2.2.2, with the i.i.d. observation model (18). We assume the basis functions φ1, . . . , φd to have uniformly bounded function value and gradient, and define the scalars   2 kfkL2 + 1 σL := 1 + max sup k∇φj(x)k2, and σb := max sup |φj(x)|. (40) β j∈[d] x∈Ω β j∈[d] x∈Ω These boundedness conditions are naturally satisfied by many interesting basis functions such as the Fourier basis9, and ensure—we verify this concretely in the proof of Corollary4 to follow—that our observation model satisfies Assumption 1(W) with parameters (σL, σb). Taking the finite-dimensional representation v = ϑ>φ, the stochastic approximation estimator for solving equation (19) is given by

−1  >  ϑt+1 = ϑt − β η ∇φ(xt+1) at+1∇φ(xt+1)ϑt − ft+1φ(yt+1) , for t = 0, 1, ··· n−1 2 X > ϑbn := ϑt, and vn := ϑb φ. n b n t=n/2

In order to state our statistical guarantees for vbn, we define the following matrices: Z −1 > M := Id − β ∇φ(x) a(x)∇φ(x)dx, Ω Z Z  Z > 1 > > 1 > > ΣL := 2 ∇φ a∇v(∇v) a∇φ dx − 2 ∇φ a∇v dx ∇φ a∇v dx β Ω β Ω Ω Z 1 > h > 2 2m i + 2 (∇φ) (∇v)(∇v) + diag k∇vk2 − (∂jv) j=1 (∇φ)dx, β Ω Z Z  Z > 1 2  > 1 Σb := 2 f(x) + 1 φ(x)φ(x) dx − 2 f(x)φ(x)dx f(x)φ(x)dx . β Ω β Ω Ω With these definitions in hand, we are ready to state the consequence of our main theorems to the estimation problem of elliptic equations.

9In the typical application of finite-element methods, basis functions based on local interpolation are widely used [BS07]. These basis functions can have large sup-norm, but via application of the Walsh–Hadamard transform, a new basis can be obtained satisfying condition (40) with dimension-independent constants. Since the stochastic approximation algorithm is invariant under orthogonal transformation, this modification is only for the convenience of analysis and does not change the algorithm itself.

23 Corollary 4. Under the setup above, there are universal positive constants (c, c0) such that if 2  2  c0σLd 2 kv0−vk βd n ≥ (1−κ(M))2 log µ and the stochastic approximation scheme is run with step size η = 1√ , then the averaged iterates satisfy c0σL dn kv − v∗k2 E bn X     µ ∗ 2 1 ≤ (1 + ω)α M, 1 − inf kv − v kX + c 1 + · (En(M, ΣL + Σb) + Hn(σL, σb, v)) β v∈S ω for any ω > 0. See Appendix C.3.2 for the proof of this corollary.  µ  Note that the approximation factor α M, 1 − β is uniformly bounded on the order of O(β/µ), which recovers C´ea’senergy estimates in the symmetric and uniform elliptic case [C´ea64]. On the other hand, for a suitable choice of basis vectors, the bound in Corollary4 can often be much smaller: the parameter µ corresponding to a global coercive condition can be replaced by the smallest eigenvalue of the projected operator M. Furthermore, our analysis can also directly extend to the more general asymmetric and semi-elliptic case, for which case the global coercive condition may not hold true. It is also worth noting that the bound in Corollary4 is given in terms of Sobolev norm k · k = k · k , as opposed to standard 2-norm used in the nonparametric estimation X H˙ 1 L 1 2 literature. By the Poincar´einequality, and H˙ -norm bound implies an L -norm bound, and ensures stronger error guarantees on the gradient of the estimated function.

4.3 Temporal difference learning We now turn to the final example previously introduced in Section 2.2.3, namely that of the TD algorithm in reinforcement learning. Recall the i.i.d. observation model (21). Also recall the equivalent form of the projected fixed point equation (23), and note that the population-level operator L satisfies the norm bound

|||L|||X = γ · sup kP vk ≤ γ := γmax, kvk≤1 since ξ is the stationary distribution of the transition kernel P .

4.3.1 Upper bounds on stochastic approximation with averaging

As mentioned before, this example is somewhat non-standard in that the basis functions ψi are not necessarily orthonormal; indeed the classical temporal difference (TD) learning update d in R involves the stochastic approximation algorithm

 > + >  ϑt+1 = ϑt − η ψ(st+1)ψ(st+1) ϑt − γψ(st+1)ψ(st+1) ϑt − Rt+1(st+1)ψ(st+1) . (41a)

The Polyak–Ruppert averaged estimator is then given by the relations

n−1 2 X > ϑbn = ϑt, and vn := ϑb ψ. (41b) n b n t=n/2

Note that the updates (22) are, strictly speaking, different from the canonical iterates (25), but this should not be viewed as a fundamental difference since we are ultimately interested in

24 the value function iterates vbn; these are obtained from the iterates ϑbn by passing back to the original Hilbert space. Nevertheless, this cosmetic difference necessitates some natural basis transformations before 10 d×d stating our results. Define the matrix B ∈ R by Bij := hψi, ψji for i, j ∈ [d]; this defines an orthonormal basis given by     −1/2 φ1 φ2 ··· φd := ψ1 ψ2 ··· ψd B . Let

β := λmax(B) and µ := λmin (B) , so that β/µ is the of the covariance matrix of the features. Having set up this transformation, we are now ready to state the implication of our main theorem to the case of LSTD problems. We assume the following fourth-moment condition: 4 d−1  > −1/2  4  4  4 ∀u ∈ S , Eξ u B ψ(s) ≤ ς , and Eξ R (s) ≤ ς . (42) As verified in the proof of Corollary5 to follow, equation (42) suffices to guarantee that 2 2 √ Assumption 1(W) is satisfied with parameters (σL, σb) = (2ς , ς / β). We also require the following matrices to be defined: −1/2 + > −1/2 h −1/2 + > ¯i M := γB Eξ[ψ(s)ψ(s ) ]B , ΣL := covξ B ψ(s) ψ(s) − γψ(s ) ϑ ,

h −1/2 i Σb := covξ R(s)B ψ(s) . The following corollary then provides a guarantee on the Polyak–Ruppert averaged TD(0) iterates (41).

Corollary 5. Under the set-up above, there are universal positive constants (c, c0) such that 4 2  2  c0ς β d 2 kv0−vk2βd given a sample size n ≥ µ2(1−κ(M))2 log µ(1−κ(M)) , then when the stochastic approximation 1√ scheme (41a) is run with step size η = 2 , then the averaged iterates satisfy the bound c0ς β dn ∗ 2 ∗ Ekvbn − v k ≤ (1 + ω)α(M, γ)A(S, v )  r !3  1  ς2β d + c 1 + E (M, Σ + Σ ) + 1 + kv¯k2 (43) ω  n L b (1 − κ(M))µ n  for any ω > 0. See Appendix C.1 for the proof of this corollary. 1 In the worst case, the approximation factor α(M, γ) scales as 1−γ2 , recovering the classical result (6), but more generally gives a more fine-grained characterization of the approximation factor depending on the one-step auto-covariance matrix for the feature vectors. By Lemma1, 1  we have α(M, γ) ≤ O 1−κ(M) , so intuitively, the approximation factor is large when the Markov chain transitions slowly in the feature space along a certain directions. On the other hand, if the one-step-transition is typically a big jump, then approximation factor is smaller. The statistical error term En(M, ΣL + Σb) matches the Cram´er–Raolower bound, and gives a finer characterization than both worst-case upper bounds [BRS18] and existing instance- dependent upper bounds [LS18]. Note that the final, higher-order term depends on the β condition number µ of the covariance matrix B. This ratio is 1 when the basis vectors are orthonormal, but in general, the speed of algorithmic convergence depends on this parameter.

10 Since the functions ψi are linearly independent, we have B 0.

25 4.3.2 Approximation factor lower bounds for MRPs We conclude our discussion of discounted MRPs with an information-theoretic lower bound for policy evaluation. This bound involves technical effort over and above Theorem2 since our construction for MRPs must make use only of operators L that are constructed using a valid transition kernel. To set the stage, we say that a Markov reward process (P, γ, r) and d associated basis functions {ψj}j=1 are in the canonical set-up if the following conditions hold: • The stationary distribution ξ of P exists and is unique.

• The reward function and its observations are uniformly bounded. In particular, we have krk∞ ≤ 1, and kRk∞ ≤ 1 almost surely.

> • The basis functions are orthonormal, i.e., Eξ[ψ(s)ψ(s) ] = Id. The three conditions are standard assumptions in Markov reward processes. Now given scalars ν ∈ (0, 1] and γ ∈ (0, 1), integer D > 0 and scalar δ ∈ (0, 1/2), we consider the following class of MRPs and associated feature vectors:

 (P, γ, r, ψ) is in the canonical setup, |S| = D,  CMRP (ν, γ, D, δ) := (P, γ, r, ψ) ∗ 2 + >  . A(S, v ) ≤ δ , κ Eξ[ψ(s)ψ(s ) ] ≤ ν.

+ > Note that under the canonical set-up, we have M = γEξ[ψ(s)ψ(s ) ], and consequently, a problem instance in the class CMRP(ν, γ, D, δ) satisfies κ(M) ≤ νγ in the set-up of Corollary5. + >  The condition κ Eξ[ψ(s)ψ(s ) ] ≤ ν can be seen as a “mixing” condition in the projected space: when ν is bounded away from 1, the feature vector cannot have too large a correlation with its next-step transition in any direction. We have the following minimax lower bound for this class, where we use the shorthand CMRP ≡ CMRP (ν, γ, D, δ) for convenience. 2 Proposition 1. There are universal positive constants (c, c1) such that if D ≥ c1(n + d), then for all scalars ν ∈ (0, 1] and γ ∈ (0, 1), we have

∗ 2 c 2 inf sup kvbn − v k ≥ δ ∧ 1. (44) v ∈V 1 − νγ bn bX (P,γ,r,ψ)∈CMRP See Appendix C.2 for the proof of this proposition. A few remarks are in order. First, in conjunction with Corollary5 and the second upper bound in Lemma1, we can conclude that the TD algorithm for policy evaluation with linear function approximation attains the minimax-optimal approximation factor over the class CMRP up to universal constants. It is also worth noting that Proposition1 also shows that the worst-case upper bound (6) due to Tsitsiklis and Van Roy [TVR97] is indeed sharp up to a 1 1 universal constant; indeed, note that for all γ ∈ (0, 1), we have 1−γ2  1−γ , and that the latter factor can be obtained from the lower bound (44) by taking ν = 1. Second, note that the class CMRP is defined in a more “global” sense, as opposed to the “local” class Capprox used in Theorem2. This class contains all the MRP instances satisfying the approximation error bound and the constraint on κ(M), and a minimax lower bound over this larger class is weaker than the lower bound over the local class that imposes restrictions on the projected matrix. That being said, Proposition1 still captures more structure in the Markov transition kernel than the fact that it is contractive in the ξ-norm. For example, when the Markov chain makes “local moves” in the feature space, the correlation between feature vectors

26 can be large, leading to large value of ν and larger values of optimal approximation factor. On the other hand, if the one-step transition of the feature vector jumps a large distance in all directions, the optimal approximation factor will be small. Finally, it is worth noticing that Proposition1 holds true only for the i.i.d. observation models. If we are given the entire trajectory of the Markov reward process, the approximation factor can be made arbitrarily close to 1, using TD(λ) methods [TVR97]. The trade-off inherent to the Markov observation model is left for our companion paper.

5 Proofs

We now turn to the proofs of our main results.

5.1 Proof of Theorem1 We divide the proof into two parts, corresponding to the two components in the mean-squared ∗ 2 error of the estimator vbn. The first term is the approximation error kv − v k that arises from the difference between the exact solution v∗ to the original fixed point equation, and the exact solution v to the projected set of equations. The second term is the estimation error 2 Ekvbn − vk , measuring the difficulty of estimating v on the basis of n noisy samples. In particular, under the conditions of the theorem, we prove that the approximation error is upper bounded as

∗ 2 ∗ 2 kv − v k ≤ α(M, |||L|||X) inf kv − v k , (45a) v∈S whereas the estimation error is bounded as trace (I − M)−1Σ∗(I − M)−> σ  d 3/2 kv − vk2 ≤ c + c L kvk2σ2 + σ2 . (45b) E bn n (1 − κ)3 n L b Given these two inequalities, it is straightforward to prove the bound (28) stated in the theorem. By expanding the square, we have

∗ 2 2 ∗ 2 ∗ Ekvbn − v k = Ekvbn − vk + kv − v k + 2Ehvbn − v, v − v i (i) 2 ∗ 2 p 2 ∗ 2 ≤ Ekvbn − vk + kv − v k + 2 Ekvbn − vk · kv − v k (ii) 2 ∗ 2 1 2 ∗ 2 ≤ Ekvbn − vk + kv − v k + ω Ekvbn − vk + ωkv − v k ∗ 2 1 2 = (1 + ω)kv − v k + (1 + ω )Ekvbn − vk where step (i) follows from the Cauchy–Schwarz inequality; and step (ii) follows from the arithmetic-geometric mean inequality, and is valid for any ω > 0. Substituting the bounds from equations (45a) and (45b) yields the claim of the theorem.

The remainder of our argument is devoted to the proofs of the bounds (45a) and (45b).

5.1.1 Proof of approximation error bound (45a)

We begin with some decomposition relations for vectors and operators. Note that S is a finite-dimensional subspace, and therefore is closed. We use ⊥ S := {u ∈ X | hu, vi = 0 | for all v ∈ S.

27 ⊥ to denote its orthogonal complement. The pair (S, S ) forms a direct product decomposition of X, and the projection operators satisfy ΠS + ΠS⊥ = I. Also define the operators LS,S = ΠSLΠS and LS,⊥ = ΠSLΠS⊥ . With this notation, our proof can be broken down into two auxiliary lemmas, which we state here: Lemma 3. The error kv − v∗k between the projected fixed point v and the original fixed point v∗ is bounded as ∗ 2 −1 2  ∗ 2 kv − v k ≤ 1 + |||(I − LS,S) LS,⊥|||X inf kv − v k . (46) v∈S Lemma 4. Under the set-up above, we have

−1 2  −1 2 > −> |||(I − LS,S) LS,⊥|||X ≤ λmax (Id − M) |||L|||XId − MM (Id − M) . The claimed bound (45a) on the approximation error follows by combining these two lemmas, and recalling our definition of α(M,L). We now prove these two lemmas in turn.

5.1.2 Proof of Lemma3

For any vector v ∈ X, we perform the orthogonal decomposition v = vS +v⊥, where vS := ΠS(v) ⊥ is a member of the set S, and v⊥ := ΠS⊥,ξ is a member of the set S . With this notation, the operator L can be decomposed as

L = (ΠS + Π ⊥ )L(ΠS + Π ⊥ ) = ΠSLΠS + ΠSLΠ ⊥ + Π ⊥ LΠS + Π ⊥ LΠ ⊥ . S S | {z } S S S S =:L | {z } | {z } | {z } S,S =:LS,⊥ =:L⊥,S =:L⊥,⊥

The four operators LS,S,LS,⊥,L⊥,S,L⊥,⊥ defined in the equation above are also bounded linear operators. By the properties of projection operators, we note that LS,S and L⊥,S both ⊥ map each element of S to 0, and LS,⊥ and L⊥,⊥ both map each element of S to 0. Decomposing the target vector v∗ in an analogous manner yields the two components ∗ ⊥ ∗ ve := ΠS(v ), and v := v − v.e ∗ ∗ The fixed point equation v = Lv + b can then be written using S and its orthogonal complement as

(a) ⊥ ⊥ (b) ⊥ ve = LS,Sve + LS,⊥v + bS, and v = L⊥,Sve + L⊥,⊥v + b⊥. (47) For the projected solution v, we have the defining equation

v = LS,Sv + bS. (48) Subtracting equation (47)(a) from equation (48) yields ⊥ (I − LS,S)(ve − v) = LS,⊥v . ∗ 1 T Recall the quantity M = ΦdLΦd, and our assumption that κ(M) = 2 λmax(M + M ) < 1. This condition implies that I − LS,S is invertible on the subspace S. Since this operator also maps ⊥ −1 ⊥ each element of S to itself, it is invertible on all of X, and we have ve− v = (I − LS,S) LS,⊥v . Applying the Pythagorean theorem then yields ∗ 2 2 ∗ 2 −1 ⊥ 2 ⊥ 2 kv − v k = kv − vek + kve − v k = k(I − LS,S) LS,⊥v k + kv k −1 2  ⊥ 2 ≤ 1 + |||(I − LS,S) LS,⊥|||X · kv k , (49) as claimed.

28 5.1.3 Proof of Lemma4

By the definition of operator norm for any vector v ∈ X such that kvk = 1, we have

2 2 2 2 2 |||L|||X ≥ kLvk = kLS,SvS + LS,⊥v⊥k + kL⊥,SvS + L⊥,⊥v⊥k ≥ kLS,SvS + LS,⊥v⊥k .

Noting the fact that LS,Sv⊥ = 0 = LS,⊥vS, we have the following norm bound on the linear operator LS,S + LS,⊥:

|||LS,S + LS,⊥|||X = sup k(LS,S + LS,⊥)vk kvk=1

= sup kLS,SvS + LS,⊥v⊥k ≤ |||L|||X. kvk=1

∗ ∗ ⊥ By definition, the operator L = Π ⊥ L Π maps any vector to , and the operator L S,⊥ S S S S,S maps any element of ⊥ to 0. Therefore, we have the identity L L∗ = 0. A similar argument S S,S S,⊥ yields that L L∗ = 0. Consequently, we have S,⊥ S,S

2 2 ∗ |||L|||X ≥ |||LS,S + LS,⊥|||X = |||(LS,S + LS,⊥)(LS,S + LS,⊥) |||X = ||| L L∗ + L L∗ ||| . (50) S,S S,S S,⊥ S,⊥ X | {z } =:G

∗ ∗ Note that the operator G can be expressed as G = ΠS (LΠSL + LΠS⊥ L ) ΠS. From this representation, we see that: • For any vector x ∈ X, we have Gx ∈ S. ⊥ • For any vector y ∈ S , we have Gy = 0. d×d ∗ Consequently, there exists a matrix Ge ∈ R , such that G = ΦdGeΦd. Since G is a positive semi-definite operator, the matrix Ge is positive semi-definite. Equation (50) implies that

2 λmax(Ge) = |||Ge|||op = |||G|||X ≤ |||L|||X. (51)

−1 Now defining τ := |||(I − LS,S) LS,⊥|||X, note that

τ 2 = ||| (I − L )−1L L∗ (I − L∗ )−1 ||| . (52) S,S S,⊥ S,⊥ S,S X | {z } =:H Moreover, the operator H is self-adjoint, and we have the following properties: −1 • The operator L ,⊥ maps any vector to S, and (I − L , ) maps S to itself. Consequently, S  S S  for any x ∈ , the vector Hx = (I − L )−1L L∗ (I − L∗ )−1 x is a member of the X S,S S,⊥ S,⊥ S,S set S. ∗ ∗ ⊥ ⊥ • The operator L ,⊥ = Π ⊥ L ΠS maps any vector from S to 0. Consequently, for any y ∈ S , S S   we have Hy = (I − L )−1L L∗ (I − L∗ )−1 y = 0. S,S S,⊥ S,⊥ S,S d×d ∗ Owing to the facts above, there exists a matrix He ∈ R , such that H = ΦdHeΦd. Since the operator H is positive semi-definite, so is the matrix He. Consequently, by equation (52), 2 d−1 we obtain the identity τ = |||H|||X = |||He|||op = λmax(H). In particular, letting u ∈ S be a maximal eigenvector of He, we have

He  τ 2uu>. (53)

29 ∗ Since M = ΦdLS,SΦd by definition, combining the above matrix inequalities (51) and (53), we arrive at the bound:

2 |||L|||XId  Ge = Φ L L∗ + L L∗  Φ∗ d S,S S,S S,⊥ S,⊥ d = Φ L L∗ Φ∗ + (Φ (I − L )Φ∗) · Φ (I − L )−1L L∗ (I − L∗ )−1Φ∗ · Φ (I − L∗ )Φ∗ d S,S S,S d d S,S d d S,S S,⊥ S,⊥ S,S d d S,S d = MM > + (I − M)He(I − M >)  MM > + τ 2(I − M)uu>(I − M >).

d−1 Re-arranging and noting that u ∈ S , we arrive at the inequality

2 > h −1 2 > −>i  −1 2 > −> τ ≤ u (I − M) (|||L|||XId − MM )(I − M) u ≤ λmax (I − M) (|||L|||XId − MM )(I − M) , which completes the proof of Lemma4.

5.1.4 Proof of estimation error bound (45b)

We now turn to the proof of our claimed bound on the estimation error. Our analysis relies on two auxiliary lemmas. The first lemma provides bounds on the mean-squared error of the standard iterates {vt}t≥0—that is, without the averaging step:

Lemma 5. Suppose that the noise conditions in Assumption 1(W) hold. Then for any stepsize 1−κ  η ∈ 0, 4σ2 d+1+|||L|||2 , we have the bound L X 8η kv − vk2 ≤ e−(1−κ)ηt/2 kv − vk2 + (kvk2σ2 d + σ2d) valid for t = 1, 2,.... (54) E t E 0 1 − κ L b

See Section 5.1.5 for the proof of this claim.

Our second lemma provides a bound on the PR-averaged estimate vbn based on n observations in terms of a covariance term, along with the error of the non-averaged sequences {vt}t≥1:

Lemma 6. Under the setup above, we have the bound

2 6  −1 ∗ −> Ekvbn − vk ≤ trace (I − M) Σ (I − M) n − n0 n 2 6 X −1 2 3Ekvn − vn0 k + 2 Ek(I − M) Φd(Lt+1 − L)(vt − v)k2 + 2 2 2 . (55) (n − n0) η (n − n0) (1 − κ) t=n0

See Section 5.1.6 for the proof of this claim.

Equipped with these two lemmas, we can now complete the proof of the claimed bound (45b) on the estimation error. Recalling that n0 = n/2, we see that the first term in the bound (55) matches a term in the bound (45b). As for the remaining two terms in equation (55), the second moment bounds from Assumption 1(W) combined with the assumption that κ(M) < 1

30 imply that 1 k(I − M)−1Φ (L − L)(v − v)k2 ≤ kΦ (L − L)(v − v)k2 E d t+1 t 2 (1 − κ)2 E d t+1 t 2 d 1 X ≤ hφ , (L − L)(v − v)i2 (1 − κ)2 E j t+1 t j=1 σ2 dkv − vk2 ≤ L t . (1 − κ)2 On the other hand, we can use Lemma5 to control the third term in the bound (55). We begin by observing that 2 2 2 2 kvn − vn0 k ≤ 2kvn − vk + 2kvn0 − vk ≤ 4 sup Ekvt − vk . n0≤t≤n  2  c0 kv0−vk d If we choose a burn-in time n0 > (1−κ)η log 1−κ , then Lemma5 ensures that

2 16η 2 2 2  sup Ekvt − vk ≤ kvk σLd + σb d . n0≤t≤n 1 − κ 1 Finally, taking the step size η = √ , recalling that n0 = n/2, and putting together 24σL dn the pieces yields 12   1 12σ2 d 48  kv − vk2 ≤ trace (I − M)−1Σ∗(I − M)−> + L + sup kv − vk2 E bn 2 2 2 E t n (1 − κ) n η n n0≤t≤n 12   48σ  d 3/2 ≤ trace (I − M)−1Σ∗(I − M)−> + L kvk2σ2 + σ2 , n (1 − κ)3 n L b as claimed.

It remains to prove our two auxiliary lemmas, which we do in the following subsections.

5.1.5 Proof of Lemma5 We now prove Lemma5, which provides a bound on the error of the non-averaged iterates {vt}t≥1, as defined in equation (24a). Using the form of the update, we expand the mean- squared error to find that 2 2 Ekvt+1 − vk = Ek(I − ηI + ηΠSL)(vt − v) + ηΠS(Lt+1 − L)vt + ηΠS(bt+1 − b)k (i) 2 2 2 = Ek(I − ηI + ηΠSL)(vt − v)k + η EkΠS(Lt+1 − A)vt + Πφ(bt+1 − b)k (ii) 2 2 2 ≤ (1 − η(1 − κ))Ekvt − vk + 2η EkΠS(Lt+1 − L)(vt − v)k 2 2 + 2η EkΠS(Lt+1 − L)v + ΠS(bt+1 − b)k . (56) In step (i), we have made use of the fact that the noise is unbiased, and in step (ii), we have 1−κ  used that for any ∆ in the subspace S and any stepsize η ∈ 0, 1+|||L|||2 , we have X 2 2 2 2 2 k(I − ηI + ηΠSL)∆k = (1 − η) k∆k + η kΠSL∆k + 2(1 − η)ηh∆, ΠSL∆i n 2 2 2 o 2 ≤ 1 − 2η + η + η |||L|||X + 2(1 − η)ηκ k∆k ≤ 1 − η(1 − κ)k∆k2.

31 Turning to the second term of equation (56), the moment bounds in Assumption 1(W) imply that

d 2 X 2 2 2 EkΠS(Lt+1 − L)(vt − v)k = Ehφj, (Lt+1 − L)(vt − v)i ≤ Ekvt − vk σLd. j=1

Finally, the last term of equation (56) is also handled by Assumption 1(W), whence we obtain

d 2 X 2 EkΠS(Lt+1 − L)v + ΠS(bt+1 − b)k ≤ 2 Ehφj, (Lt+1 − L)vi j=1 d X 2 2 2 2 + 2 Ehφj, bt+1 − bi ≤ 2kvk σLd + 2σb d. j=1

1−κ Putting together the pieces, we see that provided η < 4σ2 d+1+|||L|||2 , we have L X

2 2 2 2 2 2 2 2 Ekvt+1 − vk ≤ (1 − η(1 − κ) + 2η σLd)Ekvt − vk + 4η (kvk σLd + σb d)  η(1 − κ) ≤ 1 − kv − vk2 + 4η2(kvk2σ2 d + σ2d). 2 E t L b

Finally, rolling out the recursion yields the bound

8η kv − vk2 ≤ e−(1−κ)ηn/2 kv − vk2 + (kvk2σ2 d + σ2d), E n E 0 1 − κ L b which completes the proof.

5.1.6 Proof of Lemma6

Recall that v satisfies the fixed point equation v = ΠSLv + ΠSb. Using this fact, we can derive the following elementary identity:

n−1 vn0 − vn 1 X = (vt − ΠSLt+1vt − ΠSbt+1) η(n − n0) n − n0 t=n0 n−1 n−1 1 X 1 X = (I − ΠSL)(vbn − v) + ΠS(Lt+1 − L)vt + ΠS(bt+1 − b) . (57) n − n0 n − n0 t=n0 t=n0 | {z } | {z } (1) (2) =:Ψn =:Ψn

Re-arranging terms and applying the Cauchy–Schwarz inequality, we have

3  1  kv − vk2 ≤ k(I − Π L)−1(v − v )k2 + k(I − Π L)−1Ψ(1)k2 + k(I − Π L)−1Ψ(2)k2 . bn 2 2 S n n0 S n S n (n − n0) η

32 (1) (2) n Note that the quantities Ψn and Ψn are martingales adapted to the filtration Fn := σ({Li, bi}i=1), so that

n−1 3 X kv − vk2 ≤ k(I − Π L)−1Π (L − L)v k2 E bn 2 E S S t+1 t (n − n0) t=n0 n−1 3 X −1 2 + 2 Ek(I − ΠSA) ΠS(bt+1 − b)k (n − n0) t=n0 3 −1 2 + 2 2 Ek(I − ΠSL) (vn − vn0 )k . (n − n0) η

We claim that for any vector v ∈ X, we have

−1 ∗ −1  (I − ΠSL) ΠSv = Φd (I − M) Φdv . (58)

Taking this claim as given for the moment, by applying equation (58) with v = (Lt+1 − L)vt and v = bt+1 − b, we find that

−1 2 −1 2 Ek(I − ΠSL) ΠS(Lt+1 − L)vtk = Ek(I − M) Φd(Lt+1 − L)vtk2 −1 2 −1 2 ≤ 2Ek(I − M) Φd(Lt+1 − L)vk2 + 2Ek(I − M) Φd(Lt+1 − L)(vt − v)k2, and

−1 2 −1 2 Ek(I − L) ΠS(bt+1 − b)k = Ek(I − M) Φd(bt+1 − b)k2.

Putting together the pieces, we obtain

2 3  −1 −> Ekvbn − vk ≤ trace (I − M) · cov(Φd(b1 − b)) · (I − M) n − n0 6  −1 −> + trace (I − M) · cov(Φd(L1 − L)v) · (I − M) n − n0 n 2 6 X −1 2 3Ekvn − vn0 k + 2 Ek(I − M) Φd(Lt+1 − L)(vt − v)k2 + 2 2 2 , (n − n0) η (n − n0) (1 − κ) t=n0 as claimed. It remains to prove the identity (58).

−1 Proof of claim (58): Note that for any vector v ∈ X, the vector z := (I − ΠSL) ΠSv is a d member of S, since z = ΠSLz + ΠSv. Furthermore, since {φj}j=1 is a standard basis for S, we ∗ have z = ΠSz = ΦdΦdz, and consequently,

∗ Φdz = ΦdLz + Φdv = (ΦdLΦd)Φdz + Φdv = MΦdz + Φdv.

−1 Since the matrix M is invertible, we have Φdz = (Id − M) Φdv. Consequently, we have the ∗ ∗ −1 identity z = ΦdΦdz = Φd(Id − M) Φdv, which proves the claim.

33 5.2 Proof of Corollary1 We begin by applying Theorem1 with ω = 1. Applying Lemmas1 and2 yield the desired bounds on the approximation error in parts (a) and (b). We also claim that

(σ2 kvk2 + σ2)d E (M, Σ∗) ≤ L b , (59a) n (1 − κ)2n

2  2  2 cσLd 2 kv0−vk d cσLd and that given a sample size such that n ≥ (1−κ)2 log 1−κ > (1−κ)2 , we have

r σ d (σ2 kvk2 + σ2)d (σ2 kvk2 + σ2)d H (σ , σ , v) ≤ L · L b ≤ L b , (59b) n L b 1 − κ n (1 − κ)2n (1 − κ)2n

Combining these two auxiliary claims establishes the corollary. It remains to establish the bounds (59).

Proof of claim (59): Let us first handle the contribution to this error from the noise variables bi. We begin with the following sequence of bounds:

 −1 −> trace (I − M) cov(Φd(b1 − b))(I − M)

 −> −1  = trace (I − M) (I − M) · cov(Φd(b1 − b)) −> −1 ≤ |||(I − M) (I − M) |||op · ||| cov(Φd(b1 − b))|||nuc −1 2  ≤ |||(I − M) |||op trace cov(Φd(b1 − b)) .

d By the assumption κ(M) < 1, for any vector u ∈ R , we have that

2 (1 − κ)kuk2 ≤ h(I − M)u, ui ≤ k(I − M)uk2 · kuk2.

−1 1 Consequently, we have the bound |||(I − M) |||op ≤ 1−κ(M) . For the trace of the covariance, we note by Assumption 1(W) that

d  X 2 2 trace cov(Φd(b1 − b)) = hφj, b1 − bi ≤ σb d. j=1

2 −1 −> σb d Putting together the pieces yields trace (I − M) cov(Φd(b1 − b))(I − M) ≤ (1−κ)2 . Turning now to the contribution to the error from the random observation Li, we have

 −1 −> −1 2  trace (I − M) cov(Φd(L1 − L)v)(I − M) ≤ |||(I − M) |||op trace cov(Φd(L1 − L)v) .

Once again, Assumption 1(W) yields the bound

d  X 2 2 2 trace cov(Φd(L1 − L)v) = hφj, (L1 − L)vi ≤ σLkvk d, j=1 and combining the pieces proves the claim.

34 Proof of claim (59b): The proof of this claim is immediate. Simply note that for n ≥ 2  2  2 cσLd 2 kv0−vk d cσLd (1−κ)2 log 1−κ > (1−κ)2 , we have r σ d (σ2 kvk2 + σ2)d (σ2 kvk2 + σ2)d H (σ , σ , v) ≤ L · L b ≤ L b . n L b 1 − κ n (1 − κ)2n (1 − κ)2n

5.3 Proof of Theorem2 At a high level, our proof of the lower bound proceeds by constructing two ensembles of problem instances that are hard to distinguish from each other, and such that the approximation error on at least one of them is large. The two instances are indexed by values of a bit z ∈ {−1, 1}, and each instance is, in turn, obtained as a mixture over 2D−d centers; each center is indexed by a binary string ε ∈ {−1, 1}D−d. The problem is then phrased as one of estimating the value of z from the observations; this is effectively a reduction to testing and the use of Le Cam’s mixture-vs-mixture method. d−1 Specifically, let u ∈ S be an eigenvector associated to the largest eigenvalue of the −1 2 > −> matrix (I − M0) γmaxI − M0M0 (I − M0) . By the definition of the approximation factor α(M0, γmax), we have:  > > 2 > α(M0, γmax) − 1 · (I − M0)uu (I − M0)  γmaxI − M0M0 . Based on the eigenvector u, we further define the d-dimensional vectors: p p w := α(M0, γmax) − 1 · (I − M0)u, and y := α(M0, γmax) − 1 · δu. (60) Substituting into the above PSD domination relation yields that

> > 2 ww + M0M0  γmaxI. (61)

(ε,z) (ε,z) ∗ Now consider the following class of (population-level) problem instances (L , b , vε,z) indexed by a binary string ε ∈ {−1, 1}D−d and a bit z ∈ {−1, 1}: √ √ √  d d   2d zy + (I − M )−1h  M0 D−d εd+1w ··· D−d εDw √ 0 0  0 0 ··· 0   2zδεd+1  L(ε,z) :=   , v∗ :=   ,  . .  ε,z  .   . .   √ .  0 0 ··· 0 2zδεD  √  √ 2dh0  2zδεd+1 b(ε,z) := (I − L(ε,z))v∗ =   . (62) ε,z  .   √ .  2zδεD We take the weight vector ξ to be

h 1 1 1 1 i ξ = 2d ··· 2d 2(D−d) ··· 2(D−d) , | {z } d | {z } (D − dD) and the weighted inner product h·, ·i on the space X = R is defined via

D X D hp, qi := pjξjqj for each pair p, q ∈ R . j=1

35 This choice of inner product then induces the vector norm k · k and operator norm |||· |||X. Next, we define the basis vectors via (√ 2dei for i = 1, 2, ··· , d, and φi = p 2(D − d)ei for i = d + 1, ··· D.

By construction, we have ensured that kφik = 1 for each i ∈ [D]. We let the subspace S be the span of the first d standard basis vectors, i.e., S := span(e1, e2, ··· , ed). For each binary string ε ∈ {−1, 1}D−d and signed bit z ∈ {−1, 1}, a straightforward calculation reveals that the projected problem instance satisfies the identities

(ε,z) ∗ (ε,z) ΦdL Φd = M0, and Φdb = h0. (63a) Also note that for any pair (ε, z), we have by construction that

D √ ∗ 2 1 X 2 2 inf kvε,z − vk = ( 2zδεj) = δ . (63b) v∈S 2(D − d) j=d+1

∗ In words, this shows that the k · k-error of approximating vε,z with the linear subspace S is always δ, irrespective of which ε ∈ {−1, 1}D−d and z ∈ {−1, 1} are chosen. Next, we construct the random observation models for the i.i.d. observations, which are (ε,z) also indexed by the pair (, z). In particular, we construct the random matrix Li and (ε,z) random vector bi via  √  2dh0  0  √      .  M 0 ··· 0 dε (i) w 0 ··· 0 . 0 τ  .   L    (ε,z)  0 0 ··· 0 (ε,z)  0  L :=   , b := √  . (64) i . . i  2(D − d)zδε (i)   . .   τ     b  0 0 ··· 0  0     .   .  0

(i) (i) where the random indices τL and τb are chosen independently and uniformly at random from the set {d + 1, d + 2, ··· ,D}. By construction, we have ensured that for each ε ∈ {−1, 1}D−d and z ∈ {−1, 1}, the observations have mean

h (ε,z)i (ε,z) h (ε,z)i (ε,z) E Li = L , and E bi = b .

This concludes our description of the problem instances themselves. Since our proof proceeds via Le Cam’s lemma, we require some more notation for product distributions and (n) mixtures under this observation model. Let Pε,z denote the n-fold product of the probability (ε,z) (ε,z) laws of the pair Li , bi . We also define the following mixture of product measures for each z ∈ {−1, 1}:

1 X (n) := (n). Pz 2D−d Pε,z ε∈{±1}D−d

36  (n) (n) We seek bounds on the total variation distance dTV P1 , P−1 . With this setup, the following lemmas assert that (a) Our construction satisfies the √conditions in Assumption 1(S), and (b) The total variation distance is small provided n  D − d.

Lemma 7. For each binary string ε ∈ {−1, 1}D−d and bit z ∈ {−1, 1}: (ε,z) (ε,z) (a) The population-level matrix L defined in equation (62) satisfies |||L |||X ≤ γmax. (ε,z) (ε,z) (b) The random observations Li , bi defined in equation (64) satisfies Assumption 1(S), for any scalar pair (σL, σb) such that σL ≥ γmax and σb ≥ δ.

 (n) (n) 12n2 Lemma 8. Under the set-up above, we have dTV P1 , P−1 ≤ D−d .

Part (a) of Lemma7 and equations (63a)–(63b) together ensure that population-level problem instance (L, b) we constructed belongs to the class Capprox(M0, h0, D, δ, γmax). Part (b) of Lemma7 further ensures the probability distribution PL,b belongs to the class Gvar(σL, σb). Lemma8 ensures that the two mixture distributions corresponding to different choices of the bit z are close provided n is not too large. The final step in applying Le Cam’s mixture-vs-mixture result is to show that the approximation error is large for at least one of the choices of the bit z. We carry out this step by splitting the rest of the proof into two cases, depending on whether or not we enforce that our estimator v is constrained to lie in the subspace S. Throughout, we h i b v1 d D−d use the decomposition v = b , where v1 ∈ R and v2 ∈ R . Also recall the definition of b vb2 b b the vector y from equation (60).

Case I: vb ∈ S. This corresponds to the “proper learning” case where the estimator is D−d restricted to take values in the subspace S and vb2 = 0. Note that for any ε ∈ {−1, 1} , we have 1 √ kv∗ − vk2 = kv∗ − Π (v∗ )k2 + kv∗ − vk2 = δ2 + kv − 2dzyk2. ε,z b ε,z S ε,z ε,z b 2d b1 2

Therefore, for any ε, ε0 ∈ {−1, 1}D−d, the following chain of inequalities holds: √ √ 1 ∗ 2 ∗ 2 2 1  2 2 kv − vk + kv 0 − vk = δ + kv − 2dyk + kv + 2dyk 2 ε,1 b ε ,−1 b 4d b1 2 b1 2 1 = δ2 + kv k2 + 2dkyk2 2d b1 2 2 2 2 ≥ δ + kyk2 2 = α(M0, γmax) · δ .

By Le Cam’s lemma, we thus have

∗ 2 2  (n) (n)  inf sup Ekvbn − v k ≥ α(M0, γmax)δ · 1 − dTV(P−1 , P1 ) v ∈V bn bS (L,b)∈Capprox PL,b∈Gvar(σL,σb) (i) 2 ≥ (1 − ω) · α(M0, γmax) · δ ,

12n2 where in step (i), we have applied Lemma8 in conjunction with the inequality D ≥ d + ω .

37 Case II: vb ∈ / S. This corresponds to the case of “improper learning” where the estimator 0 D−d can take values in the entire space X. In this case, for any pair ε, ε ∈ {−1, 1} , we obtain √ ∗ ∗  > > p kvε,1 − vε0,−1k ≥ k 2 2dy 0 ··· 0 k = 2kyk2 = 2δ α(M0, γmax) − 1. Applying triangle inequality and Young’s inequality yields the bound

1 ∗ 2 ∗ 2 1 ∗ ∗ 2 1 ∗ ∗ 2 2 (kv − v k + kv − v 0 k ) ≥ (kv − v k + kv − v 0 k) ≥ kv − v 0 k ≥ (α(M , γ ) − 1) · δ . 2 b ε,1 b ε ,1 4 b ε,1 b ε ,1 4 ε,1 ε ,−1 0 max By Le Cam’s lemma, we once again have

∗ 2  2  (n) (n)  inf sup Ekvbn − v k ≥ α(M0, γmax) − 1 · δ · 1 − dTV(P−1 , P1 ) v ∈V bn bX (L,b)∈Capprox PL,b∈Gvar(σL,σb)  2 ≥ (1 − ω) · α(M0, γmax) − 1 · δ . Putting together the two cases completes the proof.

5.3.1 Proof of Lemma7 We prove the two parts of the lemma separately. Once again, recall our definition of the pair (w, y) from equation (60).

Proof of part (a): In order to study the operator norm of the matrix L(ε,z) in the Hilbert h (1) i space , we consider a vector p = p ∈ D, with p(1) ∈ d and p(2) ∈ D−d. Assuming X p(2) R R R kpk = 1, we have

√ D 1 X (2) kL(ε,z)pk2 = kM p(1) + w · d ε p k2. 2d 0 D−d j j 2 j=d+1 By the Cauchy–Schwarz inequality, we have 2 √ D D D d X (2) d  X 2 X (2)2 d (2) 2 εjp ≤ 2 ε p = kp k . D−d j (D−d) j j D−d 2 j=d+1 j=d+1 j=d+1 1 (1) d 1 (2) 2 Define the vector a1 := √ p ∈ and a2 := √ kp k2. Clearly, we have 1 = kpk = 2d R 2(D−d) 2 2 ka1k2 + a2, and so √ (ε,z) 2 1 (1) 2 kL pk ≤ · sup kM0p + 2da2twk2 2d t∈[−1,1] 1  √ √  = · max kM p(1) + 2da wk2, kM p(1) − 2da wk2 2d 0 2 2 0 2 2 2 2 = max kM0a1 + a2wk2, kM0a1 − a2wk2   2 ≤ ||| M0 w |||op.   2 > > 2 Equation (61) implies that ||| M0 w |||op = λmax M0M0 + ww ≤ γmax, and therefore, for all ε ∈ {−1, 1}D−d and z ∈ {−1, 1}, we have (ε,z) (ε,z) |||L |||X = sup kL pk ≤ γmax, kpk=1 as desired.

38 D Proof of part (b): Consider any pair of vectors p, q ∈ R such that kpk = kqk = 1. Using h (1) i h (1) i the decompositions p = p and q = q , with p(1), q(1) ∈ d, p(2), q(2) ∈ D−d, we have p(2) q(2) R R √ 2 (ε,z) (ε,z) 2 1 (2) > (1) Ehp, (Li − L )qi ≤ 2 E dετL q (i) w p (2d) τL 2  2 1  > (1) (2) = · w p · E q (i) 4d τL 1 1 ≤ kp(1)k2 · kwk2 · kq(2)k2 4d 2 2 D − d 2 2 2 2 2 ≤ kwk2 · kpk · kqk ≤ kwk2. > > 2 4 Recall that M0M0 + ww ≤ γmaxId by equation (61). Consequently, we have kwk2 ≤ > 2 4 2 2 kM0 wk2+kwk2 ≤ γmaxkwk2, which implies that kwk2 ≤ γmax. Therefore, the noise assumption in equation (12a) is satisfied with parameter σL = γmax. For the noise on the vector b, we note that √ 2  2 (ε,z) 2 1 (2) δ2 (2) δ2 1 (2) 2 hb − b, pi ≤ 2(D − d)zδε (i) p ≤ p ≤ · kp k E i 4(D−d)2 E τ (i) 2 E (i) 2 D−d 2 b τb τb ≤ δ2kpk2 = δ2, showing the the the noise assumption (12b) is satisfied with σb = δ.

5.3.2 Proof of Lemma8 (i) (i) Recall that τL , τb are the random indices in the i-th sample. We define E to be the event (i)n (i)n that the indices τL i=1, τb i=1 are not all distinct, i.e.,

n (i1) (i2)o n (i1) (i2)o n (i1) (i2)o E := ∃i1, i2 ∈ [n], s.t. τL = τb ∪ ∃i1 6= i2, s.t. τL = τL ∪ ∃i1 6= i2, s.t. τb = τb . We claim that (n) C (n) C P1 |E = P−1 |E . (65) Assuming equation (65), we now give a proof of the upper bound on the total variation distance δ. We use the following lemma:

Lemma 9. Given two probability measures P1, P2 and an event E with P1(E ), P2(E ) < 1/2, we have C C dTV(P1, P2) ≤ dTV(P1|E , P2|E ) + 3P1(E ) + 3P2(E ). In order to bound the probability of E , we apply a union bound. Under either of the (n) (n) probability measures P1 and P−1 , we have the following bound: 2 X  (i) (j) X  (i ) (i ) X  (i ) (i ) 2n (E ) ≤ τ = τ + τ 1 = τ 2 + τ 1 = τ 2 ≤ . P P L b P L L P b b D − d i,j∈[n] i1

39 Proof of equation (65): For D ≥ 2n + d, we define a probability measure Q through the following sampling procedure: D−d • Sample a subset S ⊆ {d + 1, ··· ,D} of size 2n uniformly at random over all possible 2n possible subsets. • Partition the set S into two disjoint subsets S = SL ∪ Sb, each of size n. The partition is 2n chosen uniformly at random over all n possible partitions. Let     (1) (2) (n) (1) (2) (n) SL := τeL , τeL , ··· , τeL and Sb := τeb , τeb , ··· , τeb .

(i) (i) i.i.d. • For each i ∈ [n], sample two random bits ζL , ζb ∼ U({−1, 1}). n • Let Q be the probability distribution of the observations (Li, bi)i=1, that are constructed (i) (i) (i) (i) from the tuple (τeL , τeb , ζL , ζb ) defined above. Specifically, we let  √  2dh0 √  0   (i)    M0 0 ··· 0 dζL w 0 ··· 0  .   .   0 0 ··· 0 0 0 ··· 0      0  Li :=  0 0 ··· 0 , bi := √  ,  . .   (i)   . .   2(D − d)ζb δ  . .     0  0 0 ··· 0    ...  0

√ (i) (i) where the vector dζL w appears at the τeL -th column of the matrix Li, and the scalar √ (i) (i) 2(D − d)ζb δ appears at the τeb -th row of the vector bi. (n) C For either choice of the bit z ∈ {±1}, we claim that the probability measure Pz |E is identical C to the distribution Q. To prove this claim, we first note that conditioned on the event E , the (i) (i) n indices (τL , τb )i=1 actually form a uniform random subset of {d + 1, ··· ,D} with cardinality (i) n (i) n 2n, and the partition into (τL )i=1 and (τb )i=1 is a uniform random partition, i.e., n n  (i) (i) C d  (i) (i) (n) (n) τL , τb E = τL , τb , under both P1 and P−1 . (67) i=1 e e i=1 (i) (i)n Given an index subset tL , tb i=1 ⊆ {d+1, ··· ,D} that are mutually distinct, conditioned (i) (i)n (i) (i)n on the value of τL , τb i=1 = tL , tb i=1, the observed random bits under the probability (n) distribution Pz are given by

ε (1) , ε (2) , ··· , ε (n) , zε (1) , ··· , zε (n) , τL τL τL τb τb which are 2n independent Rademacher random variables. (1) (2) (n) (1) (2) (n) On the other hand, the random bits ζL , ζL , ··· , ζL , ζb , ζb , ··· , ζb are also 2n inde- (i) (i)n pendent Rademacher random variables. Consequently, for any index subset tL , tb i=1 ⊆ {d + 1, ··· ,D} that are mutually distinct, we have the following equality-in-distribution:  n  n (i) (i) (i) (i) n d (i) (i) (i) (i) (i) (i) n ε (i) , zε (i) (τL = tL , τb = tb )i=1 = ζL , ζb (τL = tL , τb = tb )i=1. (68) τL τb i=1 i=1 e e Putting equations (67) and (68) together completes the proof.

40 Proof of Lemma9: Given a function f with range contained in [0, 1], we have Z Z

f(x)P1(dx) − f(x)P2(dx)

Z Z Z Z

≤ f(x)P1(dx) + f(x)P2(dx) + f(x)P1(dx) − f(x)P2(dx) E E E C E C R f(x) (dx) R f(x) (dx) C C C E C P1 E C P2 |P1(E ) − P2(E )| ≤ P1(E ) + P2(E ) + P1(E ) · C − C + C P1(E ) P2(E ) P2(E ) C C ≤ P1(E ) + P2(E ) + dTV(P1|E , P2|E ) + 2|P1(E ) − P2(E )| C C ≤ 3(P1(E ) + P2(E )) + dTV(P1|E , P2|E ), which completes the proof.

5.4 Proof of Theorem3 In order to prove our local minimax lower bound, we make use of the Bayesian Cram´er– Rao bound, also known as the van Trees inequality. In particular, we use a functional version of this inequality. It applies to a parametric family of densities {pη, η ∈ Θ} w.r.t Lebesgue measure, with sufficient regularity so that the matrix I(η) :=  ∂ ∂ > Eη ( ∂η log pη(X1))( ∂η log pη(X1)) is well-defined. Proposition 2 (Theorem 1 of [GL95], special case). Given a prior distribution ρ with bounded p support contained within Θ, let T : supp(ρ) 7→ R denote a locally smooth functional. Then n n for any estimator Tb based on i.i.d. samples X1 = {Xi}i=1 and for any smooth matrix-valued d p×d function C : R → R , we have    2 R trace C(η) ∂T (η) ρ(η)dη kT (Xn) − T (η)k2 ≥ ∂η . E nE b 1 2 n R trace C(η)I(η)C(η)> ρ(η)dη+R k∇·C(η)+C(η)·∇ log ρ(η)k2ρ(η)dη η∼ρX1 ∼pη ( ) 2 Recall that our lower bound is local, and holds for problem instances (L, b) such that the ∗ pair (ΦdLΦd, Φdb) is within a small `2 neighborhood of a fixed pair (M0, h0). We proceed by constructing a careful prior on such instances that will allow us to apply Proposition2; note ∗ that it suffices to construct our prior over the d × d matrix ΦdLΦd and d-dimensional vector d Φdb. For the rest of the proof, we work under the orthonormal basis {φj}j=1. We also use the convenient shorthandx ¯0 := Φdv0. As a building block for our construction, we consider the one-dimensional density function 2 πt  µ(t) := cos 2 · 1t∈[−1,1], borrowed from Section 2.7 of Tsybakov [Tsy08]. It can be verified that µ defines a probability measure supported on the interval [−1, 1]. We denote by µ⊗d the d-fold product measure of µ. Let Z and Z0 denote two random vectors drawn i.i.d. from the distribution µ⊗d. d We use an auxiliary pair of R -valued random variables (ψ, λ) given by 1 1 1 1 ψ := √ Σ 2 Z and λ := √ Σ 2 Z0. (69) n b n L Our choice of this pair is motivated by the fact that the Fisher information matrix of this distribution takes a desirable form. In particular, we have the following lemma. 2d Lemma 10. Let ρ : R → R+ denote the density of (ψ, λ) defined in equation (69). Then  −1  Σb 0 I(ρ) = nπ −1 . 0 ΣL

41 We now use the pair (ψ, λ) in order to define the ensemble of population-level problem instances

(ψ,λ) −2 > (ψ,λ) M := M0 + kx¯0k2 λ(¯x0) and h := h0 + ψ. (70)

(ψ,λ) In order to define the problem instance in the Hilbert space X, we simply let L := ∗ (ψ,λ) (ψ,λ) ∗ (ψ,λ) d ΦdM Φd and b := Φdh , for a given basis (φi)i=1 in the space S. (ψ,λ) (ψ,λ) (ψ,λ) The matrix-vector pair (L , b ) induces the fixed point equation x¯ψ,λ = M x¯ψ,λ + h(ψ,λ), and its solution is given by

(ψ,λ) −1 (ψ,λ) −2 > −1 x¯ψ,λ = (I − M ) h = (I − M0 − kx¯0k2 λ(¯x0) ) (h0 + ψ).

Note that by construction, the Jacobian matrix formed by taking the partial derivative of x¯ψ,λ with respect to ψ and λ is given by

 (ψ,λ) −1 −2 > (ψ,λ) −1 (ψ,λ) −1 ∇ψ,λx¯ψ,λ = (I − M ) kx¯0k2 (¯x0) (I − M ) (h0 + ψ) · (I − M ) .

(ψ,λ) ∗ (ψ,λ) (ψ,λ) ∗ (ψ,λ) Now define the observation model via Li := ΦdMi Φd and bi := Φdhi , where

(ψ,λ) (ψ,λ) −2 > (ψ,λ) (ψ,λ) 0 Mi := M + kx¯0k2 wi(¯x0) and hi := h + wi, (71)

0 where wi ∼ N(0, ΣL) and wi ∼ N(0, Σb) are independent. The following lemma certifies some basic properties of observation model constructed above. Lemma 11. Consider the ensemble of problem instances defined in equations (70) and (71). d−1 For each pair (ψ, λ) in the support of ρ, each index j ∈ [d] and each unit vector u ∈ S , we have r r d d |||M (ψ,λ) − M ||| ≤ σ , and kh(ψ,λ) − h k ≤ σ , (72a) 0 F L n 0 2 b n  (ψ,λ) (ψ,λ)   (ψ,λ) (ψ,λ) cov M1 − M x¯0 = ΣL, and cov h1 − h = Σb, (72b) 2 2  > (ψ,λ) (ψ,λ)  2  > (ψ,λ) (ψ,λ) 2 E ej M1 − M u ≤ σL, and E ej h1 − h ≤ σb . (72c)

Lemma 11 ensures that our problem instance lies in the desired class In particular, equation (72a) guarantees that the population-level problem instance L(ψ,λ), b(ψ,λ) lies in the class Cest; on the other hand, equation (72b) and (72c) guarantees that the probability distribution PL,b we constructed lies in the class Gcov. Some calculations yield that the Fisher information matrix for this observation model is given by

 −1  Σb 0 I(ψ, λ) = −1 , 0 ΣL

d for any ψ, λ ∈ R . We will apply Proposition2 shortly, for which we use the following matrix C:

−1  −1 −1  C(ψ, λ) := ∇ψ,λx¯ψ,λ (0,0) · I(ψ, λ) = (I − M0) Σb (I − M0) ΣL .

Note that by construction, the matrix C does not depend on the pair (ψ, λ).

42 2 −1 2 We also claim that if n ≥ 16σL|||(I − M0) |||opd, then the following inequalities hold for our construction: h  i 1   T := trace (I − M )−1Σ (I − M (ψ,λ))−> ≥ trace (I − M )−1Σ (I − M )−> and b Eρ 0 b 2 0 b 0 (73a)  >  (¯x0) x¯ψ,λ  −1 (ψ,λ) −> 1  −1 −> TL := Eρ 2 trace (I − M0) ΣL(I − M ) ≥ trace (I − M0) ΣL(I − M0) . kx¯0k2 3 (73b) Taking these two claims as given for the moment, let us complete the proof of the theorem. First, note that

h  > i Eρ trace C(ψ, λ) · ∇ψ,λx¯ψ,λ  >  h  −1 (ψ,λ) −>i (¯x0) x¯ψ,λ  −1 (ψ,λ) −> = Eρ trace (I − M0) Σb(I − M ) + Eρ 2 trace (I − M0) ΣL(I − M ) kx¯0k2 5   ≥ trace (I − M )−1Σ (I − M )−> . (74) 6 0 b 0 Second, since I(·, ·) and C(·, ·) are both constant functionals, we have that

 −1 −> E [trace (C(ψ, λ)I(ψ, λ)C(ψ, λ))] = trace (I − M0) (ΣL + Σb)(I − M0) . (75) Additionally, Lemma 10 yields

2  h >i > Ek∇ · C(ψ, λ) + C(ψ, λ) · ∇ log ρ(ψ, λ)k2 = trace C(0, 0) · E ∇ log ρ(ψ, λ)∇ log ρ(ψ, λ) C(0, 0)

 −1 −> = nπ · trace (I − M0) (ΣL + Σb)(I − M0) . (76) We are finally in a position to put together the pieces. Applying Proposition2 and combining equations (74), (75), and (76), we obtain the lower bound −1 −> Z trace (I − M0) (ΣL + Σ )(I − M0) kx (Ln, bn) − x¯ k2ρ(dψ, dλ) ≥ b , (77) E bn 1 1 ψ,λ 2 9(1 + π)n d for any estimator xbn that takes values in R . For the problem instances we construct, note that (ψ,λ) (ψ,λ) −1 (ψ,λ) ∗ (ψ,λ) −1 (ψ,λ) ∗ v = (I − ΠSL ) ΠSb = Φd(I − M ) h = Φdx¯ψ,λ.

For any estimator vbn ∈ VbX, we note that (ψ,λ) 2 (ψ,λ) 2 2 kv − vbnk ≥ kΠS(v − vbn)k = kΦdvbn − x¯ψ,λk2. Recall by Lemma 11 that on the support of the prior distribution ρ, the population-level (ψ,λ) (ψ,λ) problem instance L , b lies in the class Cest, and that the probability distribution PL,b we constructed lies in the class Gcov. We thus have the minimax lower bound n n 2 n n 2 inf sup Ekvn(L1 , b1 ) − vk ≥ inf sup Ekxn(L1 , b1 ) − x¯ψ,λk2 b x b vbn∈Vbn (L,b)∈Cest bn (ψ,λ)∈supp(ρ) PL,b∈Gcov −1 ∗ −> Z trace (I − M0) Σ (I − M0) ≥ kx (Ln, bn) − x¯ k2ρ(dψ, dλ) ≥ c · E bn 1 1 ψ,λ 2 n 1 for c = 9(1+π) > 0, which completes the proof of the theorem.

43 5.4.1 Proof of Lemma 10

We first note that λ is independent of ψ, and consequently ρ = ρb ⊗ ρa, where ρb, ρL are the marginal densities for ψ and λ respectively. Since the Fisher information tensorizes over product measures, it suffices to compute the Fisher information of ρL and ρb separately. By a change of variables, we have d 1 √ −1/2  2 − 2 ⊗d ρL(λ) = n det(ΣL) · µ nΣL λ . Substituting this into the expression for Fisher information, we obtain Z > I(ρa) = (∇ log ρL(λ))(∇ log ρa(λ)) ρL(λ)dλ

Z > √ −1/2 ⊗d  √ −1/2 ⊗d  ⊗d = nΣL ∇ log µ (y) · nΣL ∇ log µ (z) µ (z)dz

−1/2 h ⊗d ⊗d >i −1/2 = nΣL · EZ∼µ⊗d (∇ log µ (Z))(∇ log µ (Z)) ·ΣL . | {z } I(µ⊗d) ⊗d ⊗d Finally, since µ is a product measure, we have I µ = I(µ) · Id = πId, and hence −1 −1 I(ρL) = πnΣL . Reasoning similarly for ρb, we have that I(ρb) = πnΣb . This completes the proof.

5.4.2 Proof of Lemma 11 We prove the three facts in sequence.

Proof of equation (72a): Note that the scalars σL and σb satisfies the compatibility condition (36), we therefore have the bounds r d |||M (ψ,λ) − M ||| = kx¯ k−1 · kλk ≤ n−1/2kx¯ k−1 · ptrace(Σ ) ≤ σ , 0 F 0 2 2 0 2 L L n r p d kh(ψ,λ) − h k = kψk ≤ n−1 trace(Σ ) ≤ σ , 0 2 2 b b n which completes the proof of the first bound.

Proof of equation (72b): Straightforward calculation leads to the following identities

 (ψ,λ) (ψ,λ)   −2 >  cov M1 − M x¯0 = cov kx¯0k2 w1x¯0 x¯0 = cov(w1) = ΣL,

 (ψ,λ) (ψ,λ) 0 cov h1 − h = cov(w1) = Σb.

d−1 Proof of equation (72c): Given any index j ∈ [d] and vector u ∈ S , we note that: 2 2 2  > (ψ,λ) (ψ,λ)  1  > >  1  >  1 > 2 E ej M1 − M u = 4 E ej w1 · x¯0 u ≤ 2 E ej w1 = 2 ej ΣLej ≤ σL, kx¯0k2 kx¯0k2 kx¯0k2 where the last inequality is due to the compatibility condition (36). (ψ,λ) (ψ,λ) Similarly, for the noise h1 − h , we have: 2 2  > (ψ,λ) (ψ,λ)  > 0  > 2 E ej h1 − h = E ej w1 = ej Σbej ≤ σb , which completes the proof of the last condition.

44 5.4.3 Proof of claims (73)

We prove the two bounds separately, using the convenient shorthand Ib for the LHS of claim (73a) and Ia for the LHS of claim (73b).

Proof of claim (73a): We begin by applying the matrix inversion formula, which yields

(ψ,λ) −1 −1 1 −1 > −1 (I − M ) = (I − M0) − 2 > (I − M0) λ(¯x0) (I − M0) . kx¯0k2 − (¯x0) (I − M0)λ | {z } =:H

2 For n ≥ 16σLd, we have r p d 1 (¯x )>(I − M )λ ≤ 2kx¯ k · kλk ≤ 2kx¯ k n−1 trace(Σ ) ≤ 2σ kx¯ k2 ≤ kx¯ k2, 0 0 0 2 2 0 2 L L 0 2 n 2 0 2

To bound Tb from below, we note that

 −1 −>  −1 >  Tb = trace (I − M0) Σb(I − M0) − trace (I − M0) ΣbE[H ]

 −1 −> −1 −> > > ≥ trace (I − M0) Σb(I − M0) − |||(I − M0) Σb(I − M0) |||nuc · |||(I − M0) E[H] |||op

 −1 −> = trace (I − M0) Σb(I − M0) · (1 − |||E[H](I − M0)|||op) .

2 When n ≥ 4σLd, we have

∞ " > k # X 1 (¯x0) (I − M0)λ −1 > |||E[H](I − M0)|||op ≤ 2 |||E 2 (I − M0) λ(¯x0) |||op kx¯0k kx¯0k k=0 2 2 ∞ > k ! 1 X (¯x0) (I − M0)λ −1 > ≤ 2 E 2 · |||(I − M0) λ(¯x0) |||F kx¯0k kx¯0k 2 k=1 2 d ≤ 2σ2 |||(I − M )−1||| . L 0 op n

2 −1 Therefore, for n ≥ 16σL|||(I − M0) |||opd, we obtain the lower bound 1   T ≥ trace (I − M )−1Σ (I − M )−> , b 2 0 b 0 as desired.

Proof of claim (73b): We note that

x¯ψ,λ =x ¯0 − Hh0 + (I − M0)ψ − Hψ.

Consequently, we have

 >  (¯x0) (¯x0 − Hh0 + (I − M0)ψ − Hψ)  −1 −> >  TL = E 2 · trace (I − M0) ΣL((I − M0) − H ) . kx¯0k2

45 Since λ is independent of ψ and H is dependent only upon λ, by taking expectation with respect to ψ, we have that

 >  (¯x0) (¯x0 − Hh0)  −1 −> >  TL = E 2 · trace (I − M0) ΣL((I − M0) − H ) . kx¯0k2

We note that

> > −1 r |(¯x0) Hh0| |(¯x0) (I − M0) λ| −1 d 2 = 2 > ≤ 2|||(I − M0) |||opσL . kx¯0k2 kx¯0k2 − (¯x0) (I − M0) n

> 2 −1 2 1 (¯x0) (¯x0−Hh0) 3 Therefore, for n ≥ 16σL|||(I − M0) |||opd, we have the bound ≤ 2 ≤ , and 2 kx¯0k2 2 consequently, we have

1   3   T ≥ trace (I − M )−1Σ (I − M )−> − trace (I − M )−1Σ H> L 2 0 L 0 2E 0 L 1   3 ≥ trace (I − M )−1Σ (I − M )−> − |||(I − M )−1Σ (I − M )−>||| · |||H(I − M )||| 2 0 L 0 2 0 L 0 nuc E 0 op 1   ≥ trace (I − M )−1Σ (I − M )−> · (1 − 3 |||H(I − M )||| ) . 2 0 L 0 E 0 op

For the matrix H(I − M0), we have the almost-sure upper bound r 2 −1 > 2 −1 −1 d |||H(I − M0)|||op ≤ 2 |||(I − M0) λ(¯x0) |||op ≤ · k(I − M0) λk2 ≤ 2|||(I − M0) |||opσL . kx¯0k2 kx¯0k2 n

2 −1 2 Thus, provided n ≥ 18σL|||(I − M0) |||opd, putting together the pieces yields the lower bound

1   T ≥ trace (I − M )−1Σ (I − M )−> . L 3 0 L 0

5.5 Proof of Corollary2

2 2 Define the terms ∆1 := 3 α(M0, γmax)δ and ∆2 := c ·En(M0, ΣL + Σb). We split our proof into two cases.

Case I: if ∆1 ≤ ∆2. Consider the function class:

[ 0 0 Ce := Capprox(M , h ,D, 0, γmax) 0 0 (M ,h )∈N(M0,h0)

Clearly, we have the inclusion Ce ⊆ Cfinal. Moreover, for a problem instance in Ce, we have ∗ ∗ A(S, v ) = 0, and consequently v = v. Note that the construction of problem instances in Theorem3 can be embedded in X of any dimension D, and the linear operator L constructed in the proof of Theorem3 satisfies the bound r d |||L(ψ,λ)||| ≤ |||M (ψ,λ)||| ≤ |||M ||| + σ ≤ γ . X op 0 op L n max

46 (ψ,λ) (ψ,λ) Consequently, the class Ce contains the population-level problem instances (M , h ) d constructed in the proof of Theorem3, for any choice of ψ, λ ∈ R . Invoking Theorem3, we thus obtain the sequence of bounds

∗ 2 ∗ 2 2 inf sup Ekvbn − v k ≥ inf sup Ekvbn − v k = inf sup Ekvbn − vk v ∈V (L,b)∈ v ∈V v ∈V bn bX Cfinal bn bX (L,b)∈Ce bn bX (L,b)∈Ce ∈G PL,b cov PL,b∈Gcov PL,b∈Gcov 2 = inf sup Ekvbn − vk v ∈V bn bX (L,b)∈Cest PL,b∈Gcov

≥ ∆2.

Case II: if ∆1 > ∆2. In this case, we consider the class of noise distributions

Ge := Gcov(0, 0, σL, σb).

(ε,y) (ε,y)n Clearly, Ge is a sub-class of Gcov. Note that the observation model Li , bi i=1 constructed in the proof of Theorem2 satisfies the following identities almost surely:

(ε,y) ∗ (ε,y) ∗ (ε,y) (ε,y) ΦdLi Φd = ΦdL Φd, Φdbi = Φdb .

So the problem instances constructed in the proof of Theorem2 belongs to class Ge . Invoking Theorem2, we obtain the bound

∗ 2 ∗ 2 2 2 inf sup Ekvbn − v k ≥ inf sup Ekvbn − v k ≥ (α(M0, γmax) − 1)δ = ∆1. v ∈V v ∈V 3 bn bX (L,b)∈Cfinal bn bX (L,b)∈Cfinal ∈G PL,b cov PL,b∈Ge Combining the results in two cases, we arrive at the lower bound:

∗ 2 inf sup Ekvbn − v k ≥ max(∆1, ∆2) v ∈V bn bX (L,b)∈Cfinal PL,b∈Gcov 1 1 1 c ≥ ∆ + ∆ ≥ (α(M , γ ) − 1)δ2 + E (M, Σ + Σ ), 2 1 2 2 3 0 max 2 n L b which completes the proof of the corollary.

6 Discussion

In this paper, we studied methods for computing approximate solutions to fixed point equa- tions in Hilbert spaces, using methods that search over low-dimensional subspaces of the Hilbert space, and operate on stochastic observations of the problem data. We analyzed a standard stochastic approximation scheme involving Polyak–Ruppert averaging, and proved non-asymptotic instance-dependent upper bounds on its mean-squared error. This upper bound involved a pure approximation error term, reflecting the discrepancy induced by searching over a finite-dimensional subspace as opposed to the Hilbert space, and an estimation error term, induced by the noisiness in the observations. We complemented this upper bound with an information-theoretic analysis, that established instance-dependent lower bounds for both the approximation error and the estimation error. A noteworthy consequence of our analysis is that the optimal approximation factor in the oracle inequality is neither unity nor constant,

47 but a quantity depending on the projected population-level operator. As direct consequences of our general theorems, we showed oracle inequalities for three specific examples in statistical estimation: linear regression on a linear subspace, Galerkin methods for elliptic PDEs, and value function estimation via temporal difference methods in Markov reward processes.

The results of this paper leave open a number of directions for future work: • This paper focused on the case of independently drawn observations. Another observation model, one which arises naturally in the context of reinforcement learning, is the Markov observation model. As discussed in Section 2.2.3, we consider the problem with L = γP and b = r, where P is a Markov transition kernel, γ is the discount factor and r is the reward function. The observed states and rewards in this setup are given by a single trajectory of the Markov chain P , instead of i.i.d. from the stationary distribution. It is known [TVR97] that the resolvent formalism (a.k.a. TD(λ)) leads to an improved approximation factor with larger λ ∈ [0, 1). On the other hand, larger choice of λ may lead to larger variance and slower convergence for the stochastic approximation estimator, and a model selection problem exists (See Section 2.2 in the monograph [Sze10] for a detailed discussion). It is an important future work to extend our fine-grained risk bounds to the case of TD(λ) methods with Markov data. Leveraging the instance-dependent upper and lower bounds, one can also design and analyze estimators that achieve the optimal trade-off. • This paper focused purely on oracle inequalities defined with respect to a subspace. However, the framework of oracle inequalities is far more general; in the context of statistical estimation, one can prove oracle inequalities for any star-shaped set with bounds on its metric entropy. (See Section 13.3 in the monograph [Wai19a] for the general mechanism and examples.) For all the three examples considered in Section 2.2, one might imagine approximating solutions using sets with nonlinear structure, such as those defined by `1-constraints, Sobolev ellipses, or the function class representable by a given family neural networks. An interesting direction for future work is to understand the complexity of projected fixed point equations defined by such approximating classes.

Acknowledgement

We would like to thank Peter Bartlett for helpful discussions.

References

[AMOS19]¨ Simon Arridge, Peter Maass, Ozan Oktem,¨ and Carola-Bibiane Sch¨onlieb.Solving inverse problems using data-driven models. Acta Numerica, 28:1–174, 2019. (Cited on page9.)

[BB96] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996. (Cited on pages5,8, and 11.)

[BBM05] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005. (Cited on page4.)

[Bel18] Pierre C Bellec. Sharp oracle inequalities for least squares estimators in shape restricted regression. The Annals of Statistics, 46(2):745–780, 2018. (Cited on page4.)

[Ber11] Dimitri P Bertsekas. Temporal difference methods for general projected equations. IEEE Transactions on Automatic Control, 56(9):2128–2139, 2011. (Cited on pages1 and5.)

48 [Ber16] Dimitri P Bertsekas. Proximal algorithms and temporal differences for large linear systems: extrapolation, approximation, and simulation. arXiv preprint arXiv:1610.05427, 2016. (Cited on page6.) [Ber19] Dimitri P Bertsekas. Reinforcement Learning and Optimal Control. Athena Scientific Belmont, MA, 2019. (Cited on pages5 and6.) [BKM19] Olivier Bousquet, Daniel Kane, and Shay Moran. The optimal approximation factor in density estimation. In Conference on Learning Theory, pages 318–341, 2019. (Cited on page5.) [BMP12] Albert Benveniste, Michel M´etivier,and Pierre Priouret. Adaptive Algorithms and Stochastic Approximations, volume 22. Springer Science & Business Media, 2012. (Cited on page5.) [Bor09] Vivek S Borkar. Stochastic Approximation: a Dynamical Systems Viewpoint, volume 48. Springer, 2009. (Cited on page5.) [Boy02] Justin A Boyan. Technical update: Least-squares temporal difference learning. Machine learning, 49(2-3):233–246, 2002. (Cited on pages5,6, and 11.) [BRS18] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018. (Cited on pages3,5, and 25.) [BS07] Susanne Brenner and Ridgway Scott. The Mathematical Theory of Finite Element Methods, volume 15. Springer Science & Business Media, 2007. (Cited on pages5, 10, and 23.) [BTW07a] Florentina Bunea, Alexandre Tsybakov, and Marten Wegkamp. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1:169–194, 2007. (Cited on page4.) [BTW07b] Florentina Bunea, Alexandre B Tsybakov, and Marten H Wegkamp. Aggregation for Gaussian regression. The Annals of Statistics, 35(4):1674–1697, 2007. (Cited on page4.) [CDSS14] Siu On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Near-optimal density estimation in near-linear time using variable-width histograms. In Advances in Neural Information Processing Systems, pages 1844–1852, 2014. (Cited on page5.) [C´ea64] Jean C´ea. Approximation variationnelle des probl`emesaux limites. In Annales de l’institut Fourier, volume 14, pages 345–444, 1964. (Cited on pages5, 16, and 24.) [DFFR11] Serge Darolles, Yanqin Fan, Jean-Pierre Florens, and Eric Renault. Nonparametric instru- mental regression. Econometrica, 79(5):1541–1565, 2011. (Cited on page2.) [DR16] John Duchi and Feng Ruan. Asymptotic optimality in stochastic optimization. arXiv preprint arXiv:1612.05612, 2016. (Cited on page5.) [DS12] Arnak S Dalalyan and Joseph Salmon. Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics, 40(4):2327–2355, 2012. (Cited on pages3 and4.) [DS18] Arnak S Dalalyan and Mehdi Sebbar. Optimal Kullback–Leibler aggregation in mixture density estimation by maximum likelihood. Mathematical Statistics and Learning, 1(1):1–35, 2018. (Cited on page4.)

[Dur07] Richard Durrett. Random Graph Dynamics, volume 200. Citeseer, 2007. (Cited on page 74.) [Fle84] Clive AJ Fletcher. Computational Galerkin methods. In Computational galerkin methods, pages 72–85. Springer, 1984. (Cited on page5.) [Gal15] Boris Grigoryevich Galerkin. Series solution of some problems of elastic equilibrium of rods and plates. Vestnik inzhenerov i tekhnikov, 19(7):897–908, 1915. (Cited on page5.) [GL95] Richard D Gill and Boris Y Levit. Applications of the van Trees inequality: a Bayesian Cram´er-Raobound. Bernoulli, 1(1-2):59–79, 1995. (Cited on page 41.)

49 [GN20] Matteo Giordano and Richard Nickl. Consistency of Bayesian inference with Gaussian process priors in an elliptic inverse problem. Inverse Problems, 36(8):085001, 2020. (Cited on page 10.) [KKV11] Barbara Kaltenbacher, Alana Kirchner, and Boris Vexler. Adaptive discretizations for the choice of a parameter in nonlinear inverse problems. Inverse Problems, 27(12):125008, 2011. (Cited on page 10.) [Kol06] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006. (Cited on page4.) [Kol11] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Et´ede Probabilit´esde Saint-Flour XXXVIII-2008, volume 2033. Springer Science & Business Media, 2011. (Cited on page4.) [KPR+20] Koulik Khamaru, Ashwin Pananjady, Feng Ruan, Martin J Wainwright, and Michael I Jordan. Is temporal difference learning optimal? An instance-dependent analysis. arXiv preprint arXiv:2003.07337, 2020. (Cited on pages5 and 20.) [KT00] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000. (Cited on page5.) [KTV17] Olga Klopp, Alexandre B Tsybakov, and Nicolas Verzelen. Oracle inequalities for network models and sparse graphon estimation. The Annals of Statistics, 45(1):316–354, 2017. (Cited on page4.) [KVZ+72] Mark Aleksandrovich Krasnosel’ski˘ı,Gennadi M Va˘ınikko, RP Zabre˘ıko, Ya B Ruticki, and V Ya Stet’senko. Approximate Solution of Operator Equations. Wolters-Noordhoff Publishing, Groningen, 1972. Translated from the Russian by D. Louvish. (Cited on page1.) [Lai03] Tze Leung Lai. Stochastic approximation. The Annals of Statistics, 31(2):391–406, 2003. (Cited on page5.) [LCL20] Sai Li, T Tony Cai, and Hongzhe Li. Transfer learning for high-dimensional linear regression: Prediction, estimation, and minimax optimality. arXiv preprint arXiv:2006.10593, 2020. (Cited on page 23.) [LMWJ20] Chris Junchi Li, Wenlong Mou, Martin J Wainwright, and Michael I Jordan. ROOT-SGD: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. arXiv preprint arXiv:2008.12690, 2020. (Cited on page5.) [LS18] Chandrashekar Lakshminarayanan and Csaba Szepesv´ari.Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018. (Cited on pages5 and 25.) [LT08] Stig Larsson and Vidar Thom´ee. Partial Differential Equations with Numerical Methods, volume 45. Springer Science & Business Media, 2008. (Cited on page9.) [LWKP20] Robert Lung, Yue Wu, Dimitris Kamilis, and Nick Polydorides. A sketched finite element method for elliptic models. Computer Methods in Applied Mechanics and Engineering, 364:112933, 2020. (Cited on page 10.) [Mas07] Pascal Massart. Concentration Inequalities and Model Selection, volume 6. Springer, 2007. (Cited on page4.) [MB11] Eric´ Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011. (Cited on page5.) [MLW+20] Wenlong Mou, Chris Junchi Li, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration. In Proceedings of Thirty Third Conference on Learning Theory, volume 125, pages 2947–2997, 2020. (Cited on page5.)

50 [MN06] Pascal Massart and Elodie´ N´ed´elec.Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006. (Cited on page4.) [MS08] R´emi Munos and Csaba Szepesv´ari.Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(May):815–857, 2008. (Cited on page6.) [Nic17] Richard Nickl. On Bayesian inference for some statistical inverse problems with partial differential equations. Bernoulli News, 24(2):5–9, 2017. (Cited on page9.) [NJLS09] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Opti- mization, 19(4):1574–1609, 2009. (Cited on page5.) [Pen03] Mathew Penrose. Random Geometric Graphs, volume 5. Oxford University Press, 2003. (Cited on page 74.) [PJ92] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. (Cited on pages5 and 12.) [Pol90] Boris T Polyak. A new method of stochastic approximation type. Automat. i Telemekh, 7(98-107):2, 1990. (Cited on pages5 and 12.)

[PW20] Ashwin Pananjady and Martin J Wainwright. Instance-dependent `∞-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory, to appear, 2020+. (Cited on page5.) [PWB09] Nick Polydorides, Mengdi Wang, and Dimitri P Bertsekas. Approximate solution of large- scale linear inverse problems with Monte Carlo simulation. Lab. for Information and Decision Systems Report, MIT, 2009. (Cited on pages5 and9.) [PWB12] Nick Polydorides, Mengdi Wang, and Dimitri P Bertsekas. A quasi Monte Carlo method for large-scale inverse problems. In Monte Carlo and Quasi-Monte Carlo Methods 2010, pages 623–637. Springer, 2012. (Cited on pages5 and9.) [RH15] Phillippe Rigollet and Jan-Christian H¨utter.High dimensional statistics. Lecture notes for course 18S997, 2015. (Cited on page 23.) [RM51] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951. (Cited on page5.) [RN94] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems. Technical report, Cambridge University Engineering Department, 1994. (Cited on page5.) [RST17] Alexander Rakhlin, Karthik Sridharan, and Alexandre B Tsybakov. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23(2):789–824, 2017. (Cited on pages3 and4.) [Rup88] David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988. (Cited on pages5 and 12.) [Sch10] Bruno Scherrer. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the 27th Interna- tional Conference on International Conference on Machine Learning, pages 959–966, 2010. (Cited on page6.) [SMP+09] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesv´ari,and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000, 2009. (Cited on page5.) [Sut88] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988. (Cited on pages5 and6.)

51 [SY19] Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approx- imation and TD learning. In Conference on Learning Theory, pages 2803–2830. PMLR, 2019. (Cited on page5.) [Sze10] Csaba Szepesv´ari. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010. (Cited on pages5 and 48.) [Tsy04] Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004. (Cited on page4.) [Tsy08] Alexandre B Tsybakov. Introduction to Nonparametric Estimation. Springer Science & Business Media, 2008. (Cited on page 41.) [TVR97] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in Neural Information Processing Systems, pages 1075–1081, 1997. (Cited on pages3,5,6, 13, 26, 27, and 48.) [TVR99] John N Tsitsiklis and Benjamin Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Transactions on Automatic Control, 44(10):1840–1851, 1999. (Cited on page5.) [vdV00] Aad W van der Vaart. Asymptotic Statistics, volume 3. Cambridge university press, 2000. (Cited on page 20.) [VR06] Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006. (Cited on page6.) [Wai19a] Martin J Wainwright. High-dimensional Statistics: A Non-asymptotic Viewpoint, volume 48. Cambridge University Press, 2019. (Cited on page 48.) [Wai19b] Martin J Wainwright. Stochastic approximation with cone-contractive operators: Sharper `∞-bounds for Q-learning. arXiv preprint arXiv:1905.06265, 2019. (Cited on page5.) [Wai19c] Martin J Wainwright. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019. (Cited on page5.) [WD92] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. (Cited on page5.) [Woo16] Jeffrey M Wooldridge. Introductory Econometrics: A Modern Approach. Nelson Education, 2016. (Cited on pages1 and8.) [Yat85] Yannis G Yatracos. Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics, pages 768–774, 1985. (Cited on page5.) [YB10] Huizhen Yu and Dimitri P Bertsekas. Error bounds for approximations from projected linear equations. Mathematics of Operations Research, 35(2):306–329, 2010. (Cited on pages5 and 14.) [ZJT20] Banghua Zhu, Jiantao Jiao, and David Tse. Deconstructing generative adversarial networks. IEEE Transactions on Information Theory, 2020. (Cited on page5.)

A Proof of Theorem4

We begin by defining some additional notation needed in the proof. For a non-negative integer 2k×2k k, we use Hk ∈ R to denote the of order k, recursively defined as:   Hk−1 Hk−1 H0 := 1,Hk := , for k = 1, 2, ··· Hk−1 −Hk−1

52 For any integer q ≥ 2, we define 0 1 0 ··· 0 0 0 1 ··· 0    . .  Jq :=  . .  ,   0 0 ··· 0 1 0 0 ··· 0 0 which is a q × q Jordan block with zeros in the diagonal.

Now we turn to the proof of the theorem. We assume that D is an integer multiple of q, and D−d that m := q is an integer power of 2; the complementary case can be handled by adjusting d−1 the constant factors in our bounds. Similarly to the proof of Theorem2, we let u ∈ S be an −1 2 > eigenvector associated to the largest eigenvalue of the matrix (I − M0) γmaxI − M0M0 (I − −> M0) , and define the d-dimensional vectors: p p w := α(M0, γmax) − 1 · (I − M0)u, and y := α(M0, γmax) − 1 · δu. We first construct the following (D − d) × (D − d) , indexed by the bits (εij)1≤i≤q,1≤j≤m:  m  Im diag(ε1j · ε2j)j=1 0 ··· 0 0 m  0 Im diag(ε2j · ε3j) ··· 0 0   j=1  (ε)  ......  J :=  . . . .  . (78a)  m   0 0 ··· Im diag(ε(q−1)j · εqj)j=1 0 0 ··· 0 Im Each submatrix depicted above is an m × m matrix, and the diagonal blocks are given by identity matrices. We use this construction to define the population-level instance (L(ε,z), b(ε,z)) as follows: √  2dh  √ √ √ 0  d/2 d/2 d/2   0  M zε w zε w ··· zε w 0 ··· 0   0 D−d 11 D−d 12 D−d 1m  .     .   0    (ε,z)   (ε,z)  0  L =  0  , b =   .  .   √δ ε   . 1 J (ε)   2 q1   2   .   .  0  .  √δ ε 2 qm (78b) ∗ (ε,z) −1 (ε,z) It can then be verified that the solution to the fixed point equation vε,z = (I − L ) b is given by  √ √  d zy + 2d(I − M )−1h q √ 0 0    2δε11   √   2δε12    ∗  .  vε,z =  .  . (78c)  √   2δεq1     .   √ .  2δεqm

53 Similarly to the proof of Theorem2, we take the subspace S to be the one spanned by first d coordinates, and take the weight vector be

h 1 1 1 1 i ξ = 2d ··· 2d 2(D−d) ··· 2(D−d) . | {z } | {z } d (D − d) The inner product is then given by hp, p0i := PD p ξ p0 for vectors p, p0 ∈ D. j=1 j j j √ R It remains to define our basis vectors. For j ∈ [d], we let φj := 2dej. For the orthogonal ⊥ complement S , we use the basis vectors     0 φd+1 φd+2 ··· φd+qm = √ . 2q Hlog2 m ⊗ Iq

2k×2k Recall that Hk ∈ R denotes the Hadamard matrix of order k, and ⊗ denotes Kronecker product. The first d rows of the matrix are zeros, while the following D − d = mq columns are given by the Kronecker product. By the definition of Hadamard matrix, we have kφik = 1 and hφi, φji = 0 for i 6= j. As before, the construction of equations (78a)-(78c) ensures that for any choice of the binary string ε ∈ {−1, 1}mq and bit y ∈ {−1, 1}, the oracle approximation error is equal to q m √ ∗ ∗ 2 1 X X 2 2 A(S, vε,z) = inf kvε,z − vk = ( 2δεij) = δ . (79a) v∈ 2(D − d) S i=1 j=1 Furthermore, straightforward calculation yields that the projected matrix-vector pair takes the form

(ε,z) ∗ (ε,z) ΦdL Φd = M0, and Φdb = h0. (79b) (ε,z) (ε,z)n Now, we construct our observation model from which samples Li , bi i=1 are gen- erated. For each i ∈ [n] and j ∈ [m], we sample independently Bernoulli random variables (i) (i) (i) i.i.d. χ0j , χ1j , ··· , χqj ∼ Ber(1/m). The random observations are then generated by the random matrix

  (i) m  Im diag mχ1j ε1jε2j 0 ··· 0 0  j=1    (i) m   0 Im diag mχ2j ε2jε3j ··· 0 0   j=1  J (ε) :=  . . . .  , i  ......   . .    (i) m   0 0 ··· Im diag mχ ε(q−1)jεqj   (q−1)j j=1 0 0 ··· 0 Im (80a) where once again, the diagonal blocks correspond to m × m identity matrices. We use this random matrix to generate the observations √ √ √  d/2 d/2 d/2  M0 χ01 zε11w χ02 zε12w ··· χ0m zε1mw 0 ··· 0  q q q   0  (ε,z)   L =  0  (80b) i    . 1 (ε)   . 2 Ji  0

54 and h√ i> (ε,z) 2dh> 0 ··· 0 mχ √δ ε ··· mχ √δ ε bi = 0 q1 2 q1 qm 2 qm . (80c) | {z } D − d − m

This concludes our description of the problem instances themselves. As before, our proof proceeds via Le Cam’s lemma, and we use similar notation for product distributions and (n) mixtures under this observation model. Let Pε,z denote the n-fold product of the probability (ε,z) (ε,z) laws of (Li , bi ). We also define the following mixture of product measures for each z ∈ {−1, 1}:

1 X (n) := (n). Pz 2D−d Pε,z ε∈{±1}m×q

We seek bounds on the total variation distance

 (n) (n) ∆ := dTV P1 , P−1 .

With this setup, the following lemmas assert that (a) Our construction satisfies the operator norm condition and the noise conditions in Assumption 1(S) with the associated parameters bounded by dimension-independent constants, and (b) The total variation distance ∆ is small provided n  m1+1/q.   Lemma 12. For q ∈ 2, √ 1 , and any ε ∈ {−1, 1}m×q and z ∈ {−1, 1}, 2(1−1∧γmax) (ε,z) (a) The construction in equation (78b) satisfies the bound |||L |||X ≤ γmax. (b) The observation model defined in equation (80a)-(80c) satisfies Assumption 1(S) with σL = γmax + 1 and σb = δ/q.

Lemma 13. Under the setup above, we have

12nq+1 ∆ ≤ . mq Part (a) of Lemma 12 and equation (79a)-(79b) together ensure that population-level problem instance (L, b) we constructed belongs to the class Capprox(M0, h0, D, δ, γmax). Part (b) of Lemma 12 further ensures the probability distribution PL,b belongs to the class Gvar(σL, σb). Lemma 13 ensures that the two mixture distributions corresponding to different choices of the bit z are close provided n is not too large. The final step in applying Le Cam’s mixture-vs- mixture result is to show that the approximation error is large for at least one of the choices of the bit z. Given any pair ε, ε0 ∈ {−1, 1}q×m, we note that √ √ ∗ ∗  q d > > 2 2p kv − v 0 k ≥ k 2 y 0 ··· 0 k = kyk = α(M , γ ) − 1 · δ. ε,1 ε ,−1 q q 2 q 0 max

Applying triangle inequality and Young’s inequality, we have the bound

1 ∗ 2 ∗ 2 1 ∗ ∗ 2 1 ∗ ∗ 2 α(M0, γmax) − 1 2 (kv − v k + kv − v 0 k ) ≥ (kv − v k + kv − v 0 k) ≥ kv − v 0 k ≥ δ . 2 b ε,1 b ε ,1 4 b ε,1 b ε ,1 4 ε,1 ε ,−1 2q2

55 Finally, applying Le Cam’s lemma yields

α(M , γ ) − 1   inf sup kv − v∗k2 ≥ 0 max δ2 · 1 − d ( (n), (n)) E bn 2 TV P−1 P1 v ∈V 2q bn bX (L,b)∈Capprox PL,b∈Gvar(σL,σb) and using Lemma 13 in conjunction with the condition D ≥ d + 3qn1+1/q, we arrive at the final bound α(M , γ ) − 1 inf sup kv − v∗k2 ≥ 0 max δ2 E bn 2 v ∈V 2q bn bX (L,b)∈Capprox PL,b∈Gvar(σL,σb) as desired.

A.1 Proof of Lemma 12 We prove the two parts of the lemma separately.

Proof of part (a): We first show the upper bound on the operator norm. For any vector h (1) i p ∈ D, we employ the decomposition p = p with p(1) ∈ d and p(2) ∈ qm. Assuming R p(2) R R 2 1 (1) 2 1 (2) 2 that kpk = 2d kp k2 + 2(D−d) kp k2 = 1, we have that

√ m 1 z d X (2) 1 1 kL(ε,z)pk2 = kM p(1) + w · ε p k2 + k J (ε)p(2)k2. 2d 0 D − d 1j j 2 2(D − d) 2 2 j=1

1 (1) 1 (2) Define the vector a1 := √ p and scalar a2 := √ kp k2 for convenience; we have the 2d 2(D−d) 2 2 identity ka1k2 + a2 = 1. Following the same arguments as in the proof of Lemma7, we then have

√ m √ 1 z d X (2) 1 d kMp(1) + w · ε p k2 ≤ sup kM p(1) + a twk2 2d D − d 1j j 2 2d 0 q 2 2 j=1 t∈[−1,1]     1 2 1 2   2 2 1 2 ≤ max kM0a1 + √ a2wk , kM0a1 − √ a2wk ≤ ||| M0 w ||| ka1k + a . q 2 2 q 2 2 op 2 2q2 2

By the definition of the vector w, we have the bound

  2  > > 2 ||| M0 w |||op = λmax M0M0 + ww ≤ γmax.

On the other hand, note that 1 1 1 1 k J (ε)p(2)k2 ≤ |||J (ε)|||2 · kp(2)k2 = |||J (ε)|||2 a2, 2(D − d) 2 2 8(D − d) op 2 4 op 2 and consequently, that

 1  1 kL(ε,z)pk2 ≤ γ2 ka k2 + a2 + |||J (ε)|||2 · a2. (81) max 1 2 2q2 2 4 op 2

56 In order to bound the operator norm of the matrix J (ε), we use the following fact about operator norm, proved at the end of this section for convenience. For any block matrix m×m T = [Tij]1≤i,j≤q, with each block Tij ∈ R , we have   |||[Tij]1≤i,j≤q|||op ≤ ||| |||Tij|||op 1≤i,j≤q|||op. (82)

Applying equation (A.1) to matrix J (ε) yields the bound

(ε) |||J |||op ≤ |||Iq + Jq|||op.

Recall that Jq is the Jordan block of size q with zeros in the diagonal. Straightforward calculation yields the bound 2 1 0 ··· 0 0 1 2 1 ··· 0 0   >  . .  (Iq + Jq)(Iq + Jq)   . .  =: Cq.   0 0 ··· 1 2 1 0 0 ··· 0 1 2

Note that Cq is a tridiagonal , whose norm admits the closed-form expression  π  4 |||C ||| = 2 + 2 cos ≤ 4 − . q op q + 1 q2

(ε) p p 2 Therefore, we have |||J |||op ≤ |||Cq|||op ≤ 4 − 4/q . Substituting into equation (81), we obtain  1   1  kL(ε,z)pk2 ≤ γ2 ka k2 + a2 + 1 − · a2. max 1 2 2q2 2 q2 2

Invoking the condition q ≤ √ 1 , we have the bound 2(1−1∧γmax)

 a2   1  kL(ε,z)pk2 ≤ γ2 ka k2 + 2 + 1 − a2 ≤ γ2 (ka k2 + a2) = γ2 , max 1 2 2q q2 2 max 1 2 2 max Since the choice of the vector p is arbitrary, this yields

(ε,z) |||L |||X ≤ γmax, which completes the proof.

Proof of part (b): Next, we verify the noise conditions in Assumption 1(S). For a vector h (1) i p ∈ such that kpk = 1, denote it with p = p , where p(1) ∈ d and p(2) ∈ D−d. We X p(2) R R 1 (1) 2 1 (2) 2 have the identity 2d kp k2 + 2(D−d) kp k2 = 1. (ε,z) (ε,z) For the noise bi − b , we note that

m  2 (ε,z) 1 X δ (2) 2 hp, b − b(ε,z)i2 ≤ (mχ − 1)√ ε p  E i 4(D − d)2 E qj qj (q−1)m+j j=1 2 1 mδ2 δ2 ≤ · kp(2)k2 ≤ . 4(D − d)2 2 2 q2

57 Consequently, equation (12b) is satisfied for σb = δ/q. In order to bound the noise in the L component, we consider the first d basis vectors and the last (D − d) basis vectors separately. First, note that for each k ∈ [d], we have

 2 p m (ε,z) 1 d/2 X (i) (2) √ hφ , L − L(ε,z)pi2 = z(mχ − 1)ε p · 2dw>e E k i 4d2 E D − d 1j 1j j k j=1 m 1 m X (2) 2 ≤ kwk2 · p  4 2 (D − d)2 j j=1 kwk2 ≤ 2 . 4q

Following the derivation in Lemma7, we have kwk2 ≤ γmax, and consequently, we have the bound

(ε,z) (ε,z) 2 2 Ehφk, Li − L pi ≤ γmax.

On the other hand, for k ≥ d + 1, the basis vector φk is constructed through the Hadamard (k) −1/2 (k) matrix. Let ψ := (2q) · φk, and note that the entries of ψ are uniformly bounded by 1. Letting k − d = (k0 − 1)m + k1 for some choice of integers k0 ∈ [q] and k1 ∈ [m], we have

(ε,z) (ε,z) 2 Ehφk, (Li − L )pi 2  m  1 X (i) p (k) (2) ≤ E  (mχ − 1)εk0jε(k0+1)j · 2qψ · p  4(D − d)2 k0j d+(k0−1)m+j k0m+j j=1 m 2mq X (k) 2 (2) 2 ≤ ψ  · p  4(D − d)2 d+(k0−1)m+j k0m+j j=1 1 ≤ kp(2)k2 = kp(2)k2 ≤ 1. 2(D − d) 2

This verifies that equation (12a) is satisfied with parameter σL = γmax + 1.

mq > Proof of equation (A.1): For any vector x ∈ R , consider the decomposition x =  > > m x1 ··· xq with xj ∈ R for each j ∈ [q]. We have the bound

2 q q q  q  2 X X 2 X X kT xk2 = k Tijxjk2 ≤  |||Tij|||opkxjk2 i=1 j=1 i=1 j=1     2   2 2 = k |||Tij|||op 1≤i,j≤q · kxjk2 1≤j≤qk2 ≤ ||| |||Tij|||op 1≤i,j≤q|||op · kxk2, which proves this inequality.

A.2 Proof of Lemma 13 For each j ∈ [m], define the event

n (`i) o Ej := for all i ∈ {0, 1, ··· , q}, there exists `i ∈ [n] such that χij = 1 . (83)

58 Sm (n) (n) Let E := j=1 Ej. We note that under both P1 and P−1 , we have the inequality m m q   nq+1 q+1 X X Y n  (`) o m − 1 n (E ) ≤ (Ej) = 1 − χ = 1 for all ` ∈ [n] = m · 1 − ≤ . P P P ij m mq j=1 j=1 i=0 As before, our choice of the event E was guided by the fact that the two mixture distributions are identical on its complement. In particular, we claim that (n) C (n) C P1 |E = P−1 |E . (84) Taking this claim as given for the moment and applying Lemma9, we arrive at the bound 12nq+1 d ( (n), (n)) ≤ . TV P1 P−1 mq This completes the proof of this lemma. It remains to prove equation (84).

(n) (n) Proof of equation (84): We first note that both P1 are P−1 are m-fold i.i.d. product  (`) m distributions: given z = 1 or z = −1, the random objects (εij)1≤i≤q, (χij )0≤i≤q,1≤`≤n j=1 (n) are independent and identically distributed. Now for each j ∈ [m], let Qz,j be the joint law of (`) n (`)  (`) (`) (`) (`)  the random object (ρ )`=1, where ρ = zχ0j ε1j, χ1j ε1jε2j, ··· , χ(q−1)jε(q−1)jεqj, χqj εqj . It suffices to show that (n) C (n) C ∀j ∈ [m], Q1,j |Ej = Q−1,j|Ej . (85) (n) To prove equation (85), we construct a distribution Q∗,j , and show that it is equal to both of (n) the conditional laws above. In analogy to the proof of Theorem2, we construct Q∗,j according to the following sampling procedure: (`) 11 • Sample the indicators (χij )0≤i≤q,1≤`≤n, each from the Bernoulli distribution Ber(1/m), conditioned C on the event Ej . • For each i ∈ {0, 1, ··· , q}, sample a random bit ζ(i) ∼ U({−1, 1}) independently. • For each ` ∈ [n], generate the random object

(`)  (`) (0) (`) (1) (`) (q−1) (`) (q) ρ = χ0j ζ , χ1j ζ , ··· , χ(q−1)jζ , χqj ζ .

(n) C (n) In the following, we construct a coupling between Qz,j |Ej and Q∗,j , for any z ∈ {−1, 1}, and show that they are actually the same. (`) (n) C (n) First, we couple the random indicators (χij )0≤i≤q,1≤`≤n under Qz,j |Ej and Q∗,j directly so that they are equal almost surely. By the first step in the sampling procedure, we know that the conditional law of these indicators are the same under both probability distributions. By definition (83), we note that

C n (`) o Ej = there exists i ∈ {0, 1, ··· , q}, such that χij = 0 for all ` ∈ [n] . Let the random variable ι ∈ {0, 1, ··· , q} be the smallest such index12 i. We construct the joint distribution by conditioning on different values of ι. First, note that the value of the random variable ζ(ι) is never observed, so we may set it to be an independent Rademacher random variable without loss of generality, and this does not affect the law of the random (`) n object (ρ )`=1 under consideration. Now consider the following three cases:

11 C Note that P(Ej ) > 0 in this sampling procedure. So the conditional distribution is well-defined. 12Note that under the joint distribution we construct, ι is well-defined almost surely.

59 (`) (`) Case I: ι = 0: In this case, we have χ0j = 0 for all ` ∈ [n]. So the first coordinate of each ρ q is always zero. We define the following random variables in the probability space of (εij)i=1:

(q)0 (i)0 ζ := εqj, and ζ := εijε(i+1)j for i ∈ {1, 2, ··· , q − 1}.

q Since (εij)i=1 are i.i.d. Rademacher random variables, it is easy to show by induction that (i)0q (i) q the random sequence ζ i=1 is also i.i.d. Rademacher, which has the same law as (ζ )i=1. (i)q (i)0q Consequently, we can construct the coupling such that ζ i=1 = ζ i=1 almost surely. (`) n Under this coupling, when ι = 0, the random objects (ρ )`=1 generated under both probability distributions are almost-surely the same.

(`) Case II: ι ∈ {1, 2, ··· , q − 1}: In this case, we have χιj = 0 for any ` ∈ [n]. We define the q following random variables in the probability space of (εij)i=1:

(0)0 (i)0 (q)0 ζ := zε1j, and ζ := εijε(i+1)j for i ∈ {1, 2, ··· , q − 1}\{ι}, and ζ := εqj.

(i)0 ι−1 Note that the tuple of random variables (ζ )i=0 lives in the sigma-field σ(ε1j, ··· , ειj), and (i)0 q (ζ )i=ι+1, on the other hand, lives in the sigma-field σ(ε(ι+1)j, ··· , εqj), so these two tuples (i)0ι−1 are independent. For any fixed bit z ∈ {−1, 1}, it is easy to show by induction that ζ i=0 is an i.i.d. Rademacher random sequence. Similarly, applying the induction backwards from q to (i)0q ι + 1, we can also show that ζ i=ι+1 is also an i.i.d. Rademacher random sequence. Putting (i)0 together the pieces, we see that the random variables ζ 0≤i≤q,i6=ι are i.i.d. Rademacher, (i) and have the same law as the tuple ζ 0≤i≤q,i6=ι. We can therefore construct the coupling (`) n such that they are equal almost surely. Under this coupling, the random objects (ρ )`=1 generated under both probability distributions are almost-surely the same.

(`) Case III: ι = q: In this case, we have χιj = 0 for any ` ∈ [n]. We define the following q random variables in the probability space of (εij)i=1:

(0)0 (i)0 ζ := zε1j, and ζ := εijε(i+1)j for i ∈ {1, 2, ··· , q − 1}.

q Note that (εij)i=1 are i.i.d. Rademacher random variables. For each choice of the bit z ∈ {−1, 1}, (i)0q we can show by induction that the random sequence ε i=1 is also i.i.d. Rademacher, which (i) q has the same law as the tuple (ε )i=1. Making them equal almost surely in the coupling leads (`) n to the corresponding random object (ρ )`=1 being almost-surely equal. (n) C (n) Therefore, we have constructed a coupling between Qz,j |Ej and Q∗,j so that the generated random objects are always the equal, for any z ∈ {−1, +1}. This shows that

(n) C (n) (n) C Q1,j |Ej = Q∗,j = Q−1,j|Ej , which completes the proof of equation (85), and hence, the lemma.

B Proof of the bounds on the approximation factor

In this section, we prove the claims on the quantity α(M, γmax) that defines the optimal approximation factor.

60 B.1 Proof of Lemma1 Recall that

 −1 2 > −> α(M, s) = 1 + λmax (I − M) (s Id − MM )(I − M) . (86)

In the following, we prove upper bounds for the two different cases separately.

Bounds in the general case: By assumption, we have |||M|||op ≤ s, and consequently,

2 > 2 0  s Id − MM  s .

Thus, we have the sequence of implications

 −1 2 > −> α(M, s) − 1 = λmax (I − M) (s I − MM )(I − M) −1 2 > −> = |||(I − M) (s I − MM )(I − M) |||op −1 2 > −1 ≤ |||(I − M) |||op · |||s Id − MM |||op · |||(I − M) |||op −1 2 2 ≤ |||(I − M) |||op · s , which proves the bound.

Bounds under non-expansive condition: When s ≤ 1, we have

1 1 s2I − MM >  I − MM > = (I − M)(I + M >) + (I + M)(I − M >). 2 2 Consequently, we have the chain of bounds

 −1 > −> α(M, s) − 1 ≤ λmax (I − M) (I − MM )(I − M) 1   = λ (I + M)>(I − M >)−1 + (I − M)−1(I + M) 2 max 1 1 ≤ |||(I + M)>(I − M >)−1||| + |||(I − M)−1(I + M)||| 2 op 2 op −1 > −1 ≤ |||(I − M) |||op + |||(I − M ) |||op −1 = 2|||(I − M) |||op.

d Finally, we note that if κ(M) < 1, then for any u ∈ R , we have

2 (1 − κ(M))kuk2 ≤ h(I − M)u, ui ≤ k(I − M)uk2 · kuk2.

−1 1 Consequently, we have |||(I − M) |||op ≤ 1−κ(M) , which completes the proof of this lemma.

B.2 Proof of Lemma2 Once again, recall the definition

 −1 2 > −> α(M, s) = 1 + λmax (I − M) (s Id − MM )(I − M) .

61 > Since M is symmetric, let M = P ΛP be its eigen-decomposition, where Λ = diag(λ1, λ2, ··· , λd), and note that

 −1 2 2 −1 > α(M, s) = 1 + λmax P (I − Λ) (s − Λ )(I − Λ) P −2 2 2  = 1 + λmax (I − Λ) (s − Λ )  2 2  γmax − λi = 1 + max 2 , 1≤i≤d (1 − λi) which completes the proof.

C Proofs for the examples

In this section, we provide proofs for the results related to three examples discussed in Section4. Note that Corollary3 follows directly from Theorem1. Moreover, the proof of Corollary4 builds on some technical results, and so we postpone the proof of all results related to elliptic equations to Appendix C.3. We begin this section with proofs of results related to temporal difference methods, i.e., Corollary5 and Proposition1.

C.1 Proof of Corollary5

1/2 Recall our definition of the positive definite matrix B, with Bij = hψi, ψji. Letting θt := B ϑt, the iterates (41a) can be equivalently written as

 > + >  θt+1 = θt − ηB · φ(st+1)φ(st+1) θt − γφ(st+1)φ(st+1) θt + Rt+1(st+1)φ(st+1) , (87)

2 Pn−1 and the Polyak–Ruppert averaged iterate is given by θbn := n t=n/2 θt. We also define ¯ θ := Φdv, which is the solution to projected linear equations under the orthogonal basis. Clearly, we have θ¯ = B1/2ϑ¯. 4 2 ¯ c0ς β 2  kϑ0−ϑk2dβ  We now claim that if n ≥ µ2(1−κ(M))2 d log µ(1−κ(M)) , then

∗ ¯ ∗ 2 ∗ kΦdθ − v k ≤ α(M, γ)A(S, v ), and (88a) 3 2 r ! ¯ 2 2 ς β d kθbn − θk ≤ cEn(M, ΣL + Σb) + c 1 + kv¯k . (88b) E 2 (1 − κ(M))µ n

Taking both inequalities as given for now, we proceed with the proof of this corollary. Combining equation (88a) and equation (88b) via Young’s inequality, we arrive at the bound

  ∗ 2 ∗ ¯ ∗ 2 1 ¯ 2 kvn − v k ≤ (1 + ω)kΦ θ − v k + 1 + kθbn − θk E b d ω E 2  r !3  1  ς2β d ≤ (1 + ω)A( , v∗) + c 1 + E (M, Σ + Σ ) + 1 + kv¯k2 , S ω  n L b (1 − κ(M))µ n  which completes the proof of this corollary.

62 Proof of equation (88a): By equation (23) and the definition of θ¯, we have ¯ ¯ θ = γMθ + Eξ[R(s)φ(s)]. ∗ ¯ It is easy to see that Φdθ solves the projected Bellman equation (22). Note furthermore that the projected linear operator is given by

∗ ∗ ΦdLΦd = γΦdP Φd = M. Invoking the bound in equation (45a), we complete the proof of this inequality.

Proof of equation (88b): Following the proof strategy for the bound (45b), we first show ¯ 2 an upper bound on the iterates Ekϑt − ϑk under the non-orthogonal basis (ψj)j∈[d], and then use this bound to establish the final estimation error guarantee under k · k-norm. Recall the stochastic approximation procedure under the non-orthogonal basis:

 > + >  ϑt+1 = ϑt − η ψ(st+1)ψ(st+1) ϑt − γψ(st+1)ψ(st+1) ϑt − Rt+1(st+1)ψ(st+1) .

1 1/2 1/2 1 Let Mf := Id − β B (Id − M) B and eh := β E[R(s)ψ(s)]. We can view equation (41a) as a stochastic approximation procedure for solving the linear fixed-point equation ϑ¯ = Mfϑ¯+eh, with stochastic observations

−1  > + > −1 Mft := Id − β ψ(st)ψ(st) − γψ(st)ψ(st ) , and eht := β R(st)ψ(st).

d−1 To verify Assumption 1(W), we note that for p, q ∈ S , the following bounds directly follows from the condition (42):

2 2 2  >   −2  > >  −2  > + >  E p Mft − Mf q ≤ 2β E (p ψ(st)) · (ψ(st) q) + 2β E (p ψ(st)) · (ψ(st ) q) q q −2 > 4 > 4 −2 > 4 + > 4 ≤ 2β E (p ψ(st)) · E (ψ(st) q) + 2β E (p ψ(st)) · E ψ(st ) q 4ς4 ≤ kB1/2pk2 · kB1/2qk2 ≤ 4ς4, β2 2 2 and 2 2  >  −2  >  E p eht − eh ≤ β E R(st) · p ψ(st) q −2 4 > 1/2 4 4 ≤ β E [R(st) ] · E p B φ(st) ≤ ς /β. Consequently, for the stochastic approximation procedure in equation (41a), Assumption 1(W) 2 2 √ is satisfied with σL = 2ς and σb = ς / β. To establish an upper bound on κ(Mf), we note that  >  1 1/2 M + M 1/2 1 − κ(Mf) = λmin B − B B β 2  >  1 1/2 > M + M 1/2 = inf (B u) Id − (B u) β u∈Sd−1 2  >  µ > M + M µ ≥ inf u Id − u ≥ (1 − κ(M)) . β u∈Sd−1 2 β

63 c0(1−κ(M))µ Invoking Lemma5, for η < (ς4d+1)β2 , we have

2 − µ (1−κ(M))ηt 2 8ηβ 2 4 4  kϑ − ϑ¯k ≤ e 2 kϑ − ϑ¯k + kϑk ς d + ς d/β . (89) E t 2 E 0 2 (1 − κ(M))µ 2 On the other hand, applying Lemma6 to the stochastic approximation procedure (87) under the orthogonal coordinates, we have the bound

2 6  −1 ∗ −> Ekvbn − vk ≤ trace (I − M) Σ (I − M) n − n0 n 6 X 1/2 −1/2 −1 1/2 −1/2 ¯ 2 + 2 Ek(I − B MBf ) B (Mft+1 − Mf)B (θt − θ)k2 (n − n0) t=n0 1/2 −1/2−1 2 3Ek Id − B MBf (θn − θn0 )k2 + 2 2 2 . (90) η β (n − n0) Straightforward calculation yields 1/2 −1/2 −1 1/2 −1/2 ¯ 2 2 −1 −1/2 ¯ 2 Ek(I − B MBf ) B (Mft+1 − Mf)B (θt − θ)k2 = β Ek(I − M) B (Mft+1 − Mf)(ϑt − ϑ)k2. d For any vector p ∈ R , using condition (42), we note that −1/2 2 −2 > 1/2 2 −2 + > 1/2 2 EkB (Mft − Mf)pk2 ≤ 2β Ekφ(st)φ(st) B pk2 + 2β Ekφ(st)φ(st ) B pk2 q q q q −2 4 > 1/2 4 −2 4 + > 1/2 4 ≤ 2β Ekφ(st)k2 · E φ(st) B p + 2β Ekφ(st)k2 · E φ(st ) B p ≤ 4β−1ς4d. Substituting into the identity above, we obtain 4 1/2 −1/2 −1 1/2 −1/2 ¯ 2 4βς d ¯ 2 Ek(I − B MBf ) B (Mft+1 − Mf)B (θt − θ)k2 ≤ Ekϑt − ϑk2. 1 − κ(M)2 For the third term in equation (90), we note that

1/2 −1/2−1 2 2 −1 −1/2 2 Ek Id − B MBf (θn − θn0 )k2 = β Ek(I − M) B (ϑn − ϑn0 )k2 2 2β ¯ 2 ¯ 2 ≤ Ekϑn − ϑk2 + Ekϑn0 − ϑk2 . µ1 − κ(M)2

1 dβ  Putting together the pieces and invoking the bound (89), we see that if n0 ≥ c0 µη(1−κ) log µ(1−κ) , then " 4 # 2 ∗ 24βς d 48 ¯ 2 kvn − vk ≤ 6En(M, Σ ) + + · sup kϑt − ϑk2 E b 2 2 2 2 E 1 − κ(M) n µ 1 − κ(M) η n n0≤t≤n 3  4  ∗ β ς ηd 1 ¯ 2 4 4  ≤ 6En(M, Σ ) + c + kϑk2ς d + ς d/β . µ21 − κ(M)3 n ηβ2n2

¯ 2 −1/2 ¯ 2 −1 2 1√ Now note that kϑk2 = kB θk2 ≤ µ kvk , and so choosing the step size η := 2 c0ς β dn yields

3 6  3/2 2 ∗ β ς d Ekvbn − vk ≤ 6En(M, Σ ) + c 3 . µ31 − κ(M) n This completes the proof of equation (88b), and thus the corollary.

64 C.2 Proof of Proposition1

Our construction and proof is inspired by the proof of Theorem2, with some crucial differences in the analysis that result from the specific noise model in the MRP setting. Letting D and d be integer multiples of four without loss of generality, we denote the state space by S = {1, 2, ··· ,D}. We decompose the state space into S = S0 ∪ S1 ∪ S2, with S0 := D D {1, 2, ··· , 2d}, S1 := {2d + 1, ··· , d + 2 }, and S2 := {d + 2 + 1, ··· ,D}. Define the scalars ρ = min(γ, ν) ∈ (0, 1) and τ := √ δ ∧ 1. 2(1−ρ)

Figure 2. A graphical illustration of the MRP instance constructed above. For this instance, we let d = 1, |S1| = 4 and |S2| = 4, so that the total number of states is D = 10. In the graph, solid rounds stand for states, and arrows stand for the possible transitions. The numbers associated to the arrows stand for the probability of the transitions, and the equations r = ··· standard for the reward at a state. The sets S0, S1 and S2 are separated by red dotted lines, and the sets Γ1, Γ¯1,Γ2, and Γ¯2 are marked by transparent rectangles. A blue round stands for a state with positive value function, and an orange round stands for a state with negative value function.

1 Given a sign z ∈ {−1, 1} and subsets Γ1 ⊆ S1 and Γ2 ⊆ S2 such that |Γi| = 2 |Si| for each i ∈ {1, 2}, we let Γ¯i := Si \ Γi for i ∈ {1, 2}. We then construct Markov reward processes (Γ1,Γ2,z) (Γ1,Γ2,z) (Γ1,Γ2,z) D (P , r ) and feature vectors (ψ (si))i=1, indexed by the tuple (Γ1, Γ2, z). Entry (i, j) of the transition matrix is given by

 ρ i = j ∈ S0,   1−ρ i, j ∈ S , |i − j| = d,  2 0  1−ρ   (i, j) ∈ ({1, ··· , d} × Γ1) ∪ {d + 1, ··· , 2d} × Γ¯1 , P (Γ1,Γ2,z)(i, j) := |S1| (91a) 2 ¯ ¯   (i, j) ∈ (Γ1 × Γ2) ∪ Γ1 × Γ2 ,  |S2|  1 ¯   (i, j) ∈ (Γ2 × {1, 2, ··· , d}) ∪ Γ2 × {d + 1, ··· , 2d}  d 0 otherwise.

65 The reward function at state i is given by  zτ i ∈ Γ ,  1 (Γ1,Γ2,z) r (i) := −zτ i ∈ Γ¯1, (91b) 0 otherwise.

This MRP is illustrated in Figure2 for convenience. It remains to specify the feature vectors, and we use the same set of features for each tuple (Γ1, Γ2, z). The i-th such feature vector is given by q 3−ρ de i ∈ {1, 2, ··· , d},  2 i  q ψ(i) := 3−ρ (91c) − 2 dei−d i ∈ {d + 1, ··· , 2d},  0 otherwise.

It is easy to see that for any tuple (z, Γ1, Γ2), the Markov chain is irreducible and aperiodic, and furthermore, that the stationary distribution of the transition kernel P (Γ1,Γ2,z) is independent of the tuple (Γ1, Γ2, z), and given by

h 1 1 1−ρ 1−ρ i ξ = (3−ρ)d ··· (3−ρ)d (3−ρ)(D−2d) ··· (3−ρ)(D−2d) . | {z } | {z } 2d D − 2d

> Clearly, we have Eξ[ψ(s)ψ(s) ] = Id under the stationary distribution. For the projected transition kernel, we have 3 − ρ  1 − ρ [ψ(s)ψ(s+)>] = · ρ − I  ρI  νI . (92) E 2 2 d d d

(1−ρ)/2 Given the discount factor γ ∈ (0, 1), let c0 := 1−γ(ρ−(1−ρ)(1−γ2)/2) for convenience. Straightfor- ward calculation then yields that the value function for the problem instance P (Γ1,Γ2,z), r(Γ1,Γ2,z) at state i is given by  c zτ i ∈ {1, 2, ··· , d},  0  −c0zτ i ∈ {d + 1, ··· , 2d},  (1 + γ2c )zτ i ∈ Γ , v∗ (i) = 0 1 Γ1,Γ2,z 2 ¯ −(1 + γ c0)zτ i ∈ Γ1,  γc zτ i ∈ Γ ,  0 2  −γc0zτ i ∈ Γ¯2.

For ρ > 1/2, we have the bounds 1 1 − ρ 1 − ρ 1 1 − ρ c ≥ · ≥ ≥ , and c ≤ ≤ 1. 0 4 1 − γρ 4(1 − ρ2) 8 0 1 − γρ

Consequently, we have |v∗ (i)|  |v∗ (j)| for each pair (i, j). Γ1,Γ2,z Γ1,Γ2,z Note that by our construction, the subspace S spanned by the basis functions ψ(1), ψ(2), ··· , ψ(2d) is given by

 2 S = v ∈ L (S, ξ): v(s) = 0 for s∈ / S0, and v(i + d) = −v(i) for all i ∈ [d] .

66 Consequently, we have   ∗ 2 1 − ρ 1 2 2 2 1 2 2 2 2 2 inf kv − vΓ1,Γ2,zk = · (1 + γ c0) τ + γ c0τ ≤ 2(1 − ρ)τ = δ . (93) v∈S 3 − ρ 2 2

Putting the equations (92) and (93) together, for any tuple (Γ1, Γ2, z), we conclude that the (Γ1,Γ2,z) (Γ1,Γ2,z) ((Γ1,Γ2,z)) problem instance (P , r , γ, ψ ) belongs to the class CMRP(ν, γ, D, δ). In order to apply Le Cam’s lemma, we define the following mixture distributions for each z ∈ {−1, 1}:

 −2 (n) |S1| X ⊗n Pz := PΓ ,Γ ,z, |S1|/2 1 2 Γ1⊆S1,Γ2⊆S2 1 |Γ1|=|Γ2|= 2 |S1|

+ (Γ1,Γ2,z) (Γ1,Γ2,z) where PΓ1,Γ2,z is the law of an observed tuple (si, si , r(si)) under the MRP P , γ, r , and ⊗n denotes its n-fold product. Our next result gives a bound on the total variation PΓ1,Γ2,z distance.

 (n) (n) Cn2 Lemma 14. Under the set-up above, we have dTV P1 , P−1 ≤ D−2d .

Taking this lemma as given, we now turn to the proof of the proposition. Consider any 0 0 estimator vb for the value function. For any pair Γ1, Γ2 and Γ1, Γ2, we have 2 ∗ 2 ∗ 2 1 ∗ ∗ 2 1 2 2 δ kv − vΓ ,Γ ,1k + kv − vΓ0 ,Γ0 ,−1k ≥ kvΓ ,Γ ,1 − vΓ0 ,Γ0 ,−1k ≥ c0τ ≥ . b 1 2 b 1 2 2 1 2 1 2 2 64(1 − ρ)

Invoking Le Cam’s lemma, for D > 2C(n2 + d), we have

0 c 2  (n) (n)  c 2 inf sup ≥ δ 1 − dTV(P1 , P−1 ) ≥ δ , v 1 − ρ 1 − νγ bn (P,γ,r,ψ)∈CMRP which completes the proof.

C.2.1 Proof of Lemma 14 In contrast to the proof of Theorem2, the underlying mixing components here are indexed by random subsets of a given size, instead of random bits. This sampling procedure introduces additional dependency, so that the arguments in the proof of Theorem2 do not directly apply. Instead, we use an induction-type argument by constructing the coupling directly. (n) Similarly to before, we construct a probability distribution Q and bound the total (n) (n) variation distance between Q and Pz for each z ∈ {−1, 1}. In particular, for k ∈ [n], we let (k) Q be the law of k independent samples drawn from the following observation model: • (Initial state:) Generate the state si ∼ ξ. + + • (Next state:) If si ∈ S1, then generate si ∼ U(S2). If si ∈ S2, then generate si ∼ U(S0). On the 13 other hand, if si ∈ S0, then generate S ∼ U(S1) and let  si w.p. ρ, +  1−ρ si = (si + d) mod 2d w.p. 2 , (94)  1−ρ S w.p. 2 . 13The expression a mod b denotes the remainder of a divided by b, when a and b are integers.

67 (i) (i) • (Reward:) If si ∈ S1, randomly draw Ri = ζ ∼ U({−1, 1}), and output ζ τ as the reward. Otherwise, output the reward Ri = 0. (n) (n) To bound the total variation distance dTV(Q , Pz ), we use the following recursive relation, which holds for each k = 0, 1, ··· , n − 1:

 (k+1) (k+1)  (k) (k) dTV Q , Pz ≤ dTV Q , Pz

 (k+1) + k (k+1) + k  + sup dTV Q |(si, si ,Ri)i=1, Pz |(si, si ,Ri)i=1 . (95) + k (si,si ,Ri)i=1

(k+1) Owing to the i.i.d. nature of the sampling model for Q , note that we have the equivalence + + k d + (sk+1, sk+1,Rk+1)|(si, si ,Ri)i=1 = (sk+1, sk+1,Rk+1). (k) (k) At this juncture, it is helpful to view the probability distributions P1 and P−1 via the following two-step sampling procedure: First, for j ∈ {1, 2}, sample the subsets Γj ⊆ Sj uniformly at random from the collection of all subsets of size |Sj|/2. Then, generate k i.i.d. + k samples (si, si ,Ri)i=1 according to the observation model (91a)-(91b). Consequently, for the rest of this proof, we view Γ1 and Γ2 as random sets. With this equivalence at hand, the following technical lemma shows that the posterior distribution of the subsets (Γ1, Γ2) + k conditioned on sampling the tuple (si, si ,Ri)i=1 is very close to the distribution of subsets chosen uniformly at random. Lemma 15. There is a universal positive constant c such that for each bit z ∈ {±1} and indices j ∈ {1, 2} and k ∈ [n], the following statement is true almost surely. For each + k (k) tuple (si, si ,Ri)i=1 in the support of Pz , the posterior distribution of Γj conditioned on + k (k) (si, si ,Ri)i=1 ∼ Pz satisfies

 + k  1 ck max P Γj 3 s | (si, si ,Ri)i=1 − ≤ . k + 2 D − d s∈Sj \∪i=1{si,si }

+ k k + In words, for any “observable” tuple (si, si ,Ri)i=1 and each state s ∈ Sj \∪i=1{si, si }, the + k posterior probability of the event {Γj 3 s} conditioned on observing the tuple (si, si ,Ri)i=1 is close to 1/2 provided D − d is large relative to k. In addition to the sets Γj, j = 1, 2 being close to uniformly random, we also require the following analog of a “birthday-paradox” argument Sk + in this setting. For convenience, we let Tk := i=1{si, si } denote the subset of states seen up until sample k. Lemma 16. There is a universal positive constant c such that for each k ∈ [n] and each (k+1) n (k+1) (k+1) (k+1)o distribution M ∈ P−1 , P−1 , Q , the following statement holds almost surely. + k+1 (k+1) For each tuple (si, si ,Ri)i=1 in the support of M , the probability the tuple of states  + + k (k) sk+1, sk+1 conditioned on (si, si ,Ri)i=1 ∼ M satisfies   ck s , s+ ∩ T ∩ (S ∪ S ) 6= | (s , s+,R )k ≤ . (96) P k+1 k+1 k 1 2 ∅ i i i i=1 D − d | {z } (1) :=Ek+1 In words, Lemma 16 ensures that if D − d is large relative to k, then the states seen in sample k + 1 are different from those seen up until that point (provided we only count states in the set S1 ∪ S2). Lemmas 15 and 16 are both proved at the end of this section; we take them as given for the rest of this proof.

68 + (k+1) + k + Now consider tuples (sk+1, sk+1,Rk+1) ∼ Pz |(si, si ,Ri)i=1 and (sek+1, sek+1, Rek+1) ∼ (k+1) + k Q |(si, si ,Ri)i=1; we will now construct a coupling between these two tuples in order to show that the total variation between between the respective laws is small. First, note that (k+1) (k+1) under both Pz and Q , the initial state is drawn from the stationary distribution, i.e., + k sk+1, sek+1 ∼ ξ, regardless of the sequence (si, si ,Ri)i=1. We can therefore couple the two conditional laws together so that sk+1 = sek+1 almost surely. To construct the coupling for the rest, we consider the following three cases:

Coupling on the event sk+1 ∈ S0: We begin by coupling the reward random variables; we have Rk+1 = Rek+1 = 0 under both conditional distributions, so this component of the distribution can be coupled trivially. Next, we couple the next state: By construction of the observation models (91a) and (94), we have

+  +  P sk+1 = sk+1|sk+1 = P sek+1 = sek+1|sek+1 = ρ, and 1 − ρ s+ = s + d mod 2d | s  = s+ = s + d mod 2d | s  = , P k+1 k+1 k+1 P ek+1 ek+1 ek+1 2 and so these two components of the distribution can be coupled trivially. It remains to handle + the case where sk+1 ∈ S0 and sk+1 ∈ S1. By the symmetry of elements within set S1, we note (1) C + + that on the event Ek+1 , both random variables sek+1 and sk+1 are uniformly distributed on (1) C the set S1 \Tk. Consequently, on the event Ek+1 , we can couple the conditional laws so + + that sk+1 = sek+1 almost surely.

Coupling on the event sk+1 ∈ S1: As before, we begin by coupling the rewards, but first, (1) C (k) note that on the event Ek+1 , we have sk+1 ∈ S1 \Tk. Invoking Lemma 15, under Pz and conditionally on the value of sk+1, we have the bound

 + k  1 ck P sk+1 ∈ Γ1 | (si, s ,Ri)i=1 − ≤ . i 2 D − d

Now the reward function (91b) satisfies r(s) = zτ for s ∈ Γ1 and r(s) = −zτ for s ∈ Γ¯1. On (k+1) the other hand, under Q , the reward Rek+1 takes value of τ and −τ, each with probability half. Consequently, there exists a coupling between Rk+1 and Rek+1, such that

 + k  ck P Rk+1 6= Rek+1, sk+1 ∈ S1 | (si, si ,Ri)i=1 ≤ . | {z } D − d (2) :=Ek+1

Next, we construct the coupling for next-step transition conditionally on the current step. By (1) C the symmetry of elements within set S2, we note that under Ek+1 , both random variables + + (1) C sek+1 and sk+1 are uniformly distributed on the set S2 \Tk. Consequently, on the event Ek+1 , + + we can couple the conditional laws so that sk+1 = sek+1 almost surely.

Coupling on the event sk+1 ∈ S2: In this case, we have Rk+1 = Rek+1 = 0 under both conditional distributions, so this coupling is once again trivial. It remains to construct + + (1) C a coupling between next-step transitions sk+1 and sek+1. On the event Ek+1 , we have

69 (k) sk+1 ∈ S2 \Tk. Under Pz and conditionally on the value of sk+1, Lemma 15 leads to the bound

 + k  1 ck P sk+1 ∈ Γ2 | (si, s ,Ri)i=1 − ≤ . i 2 D − d

(n) + By definition, under Pz , we have that sk+1 ∼ U({1, 2, ··· , d}) when sk+1 ∈ Γ2, and + ¯ (n) + sk+1 ∼ U({d + 1, ··· , 2d}) when sk+1 ∈ Γ2. Under Q , we have sek+1 ∼ U({1, 2, ··· , 2d}). Consequently, there exists a coupling such that

  ck s+ 6= s+ , s ∈ S | (s , s+,R )k ≤ . P k+1 ek+1 k+1 2 i i i i=1 D − d | {z } (3) :=Ek+1

+ k Putting together our bounds from the three cases, note that for any sequence (si, si ,Ri)i=1 (k) (k) on the support of Q and Pz , we almost surely have

     + + k + + k dTV L (sk+1, sk+1,Rk+1) (si, si ,Ri)i=1 , L (sek+1, sek+1, Rek+1) (si, si ,Ri)i=1

3 0 X  (j) + k  c k ≤ E | (si, s ,Ri) ≤ , P k+1 i i=1 D − d j=1 where the final inequality follows from applying Lemma 16. Substituting into the recursion (95), we conclude that for any z ∈ {−1, 1}, we have

n−1 3 0 2  (n) (n) X X  (j) + k  c n dTV Q , Pz ≤ sup P Ek+1 | (si, si ,Ri)i=1 ≤ , + k D − d k=0 j=1 (si,si ,Ri)i=1 which completes the proof of this lemma. It remains to prove the two helper lemmas.

Proof of Lemma 15: Given z ∈ {±1}, we define the sets

¯  Z1 := {si : i ∈ [k], si ∈ S1,Ri = zτ} , Z1 := {si}i∈[k] ∩ S1 \ Z1, and  + ¯  Z2 := si : i ∈ [k], si ∈ S2, si ∈ [d] , Z2 := {si}i∈[k] ∩ S2 \ Z2

By the reward model (91b) in our construction, for any valid pair of subsets (Γ1, Γ2), under the law ⊗k , the observations (s , s+,R )k have positive probability if and only if Z ⊆ Γ PΓ1,Γ2,z i i i i=1 1 1 ¯ and Γ1 ∩ Z1 = ∅. Furthermore, by the symmetry between the elements in Γ1, for any Γ1 such that Z ⊆ Γ and Γ ∩ Z¯ = , the probability of observing (s , s+,R )k under ⊗k is 1 1 1 1 ∅ i i i i=1 PΓ1,Γ2,z independent of the choice of Γ1. Consequently, the probability under the mixture distribution

70 (k) Pz can be calculated as

+ k 0 0  + k  X P (si, si ,Ri)i=1 | Γ1 = Γ · P(Γ1 = Γ ) P Γ1 3 s | (si, si ,Ri)i=1 = + k  (si, s ,Ri) s∈Γ0 P i i=1 0 |Γ |=|S1|/2  0 0 1 0 ¯ 0 0 Γ ⊆ S1 : |Γ | = |S1|,Z1 ⊆ Γ , Z1 ∩ Γ = ∅, s ∈ Γ = 2  0 0 1 0 0 ¯ 0 Γ ⊆ S1 : |Γ | = 2 |S |,Z1 ⊆ Γ Z1 ∩ Γ = ∅ |S | − |Z | − |Z¯ |−1|S | − |Z | − |Z¯ | − 1 = 1 1 1 1 1 1 |S1|/2 − |Z1| |S1|/2 − |Z1| − 1 |S |/2 − |Z | = 1 1 . |S1| − |Z1| − |Z¯1|

¯ D−2d By definition, we have |Z1| + |Z1| ≤ k, and |S1| = 2 . For D ≥ d + 8k, this yields

 + k  1 4k 8k P Γ1 3 s | (si, s ,Ri)i=1 − ≤ ≤ . i 2 D − 2d D − d

1 Similarly, by the transition model (91a) in our construction, for any Γ2 ⊆ S2 with |Γ2| = 2 |S2|, under the law ⊗k , the observations (s , s+,R )k have positive probability if and only if PΓ1,Γ2,z i i i i=1 ¯ Z2 ⊆ Γ2 and Γ2 ∩ Z2 = ∅. Following exactly the same calculation as above, we arrive at the bound

 + k  1 8k P Γ2 3 s | (si, s ,Ri)i=1 − ≤ , i 2 D − d as desired.

(k+1) + k Proof of Lemma 16: Under the conditional distribution M |(si, si ,Ri)i=1, for each s ∈ S1 ∪ S2, we have

2 +  2 P (sk+1 = s) ≤ , and P sk+1 = s ≤ . |S1| |S1|

Applying a union bound, we arrive at the inequality

 (1) + k  P Ek+1 | (si, si ,Ri)i=1 X +  X + + + ≤ P (sk+1 = si) + P sk+1 = si + P sk+1 = si + P sk+1 = si i∈[k] i∈[k] si∈S1∪S2 si∈S1∪S2 8k 32k ≤ ≤ , |S1| D − d which completes the proof.

C.3 Proof for elliptic equations In this section, we prove the results for the elliptic equation example in Section 2.2.2.

71 C.3.1 Technical results from Section 2.2.2 The main technical result that was assumed in Section 2.2.2 is collected as a lemma below.

1 Lemma 17. There exists a bounded, self-adjoint, linear operator Ae ∈ L and a function g ∈ H˙ , 1 such that, for all u, v ∈ H˙ ,

hu, Avi = hu, Avi 2 , (97a) e H˙ 1 L hu, gi = hu, fi 2 , (97b) H˙ 1 L and such that

2 2 µkuk 1 ≤ hu, Aue i ˙ 1 ≤ βkuk 1 . (97c) H˙ H H˙ We prove the three claims in turn.

1 Proof of equation (97a): For any pair of test functions u, v ∈ H˙ , integration by parts and the uniform ellipticity condition yield Z Z  > hu, Avi 2 = − u(x)∇ · a(x)∇v(x) dx = ∇u a∇vdx ≤ βkuk · kvk . L H˙ 1 H˙ 1 Ω Ω

˙ 1 Now given a fixed function v ∈ H , the above equation ensures that h·, AviL2 is a bounded 0 1 linear functional. By the Riesz representation theorem, there exists a unique function v ∈ H˙ with kv0k ≤ βkvk , such that H˙ 1 H˙ 1 1 0 ∀u ∈ ˙ , hu, Avi 2 = hu, v i . H L H˙ 1 Clearly, the mapping from v to v0 is linear, and we have kv0k ≤ βkvk for any v ∈ ˙ 1. H˙ 1 H˙ 1 H Thus, the mapping v 7→ v0 is a bounded linear operator. Using Ae to denote this operator, equation (97a) then directly follows. It remains to verify that Ae is self-adjoint. Indeed, for 1 u, v ∈ H˙ , we have the identity Z Z > hu, Avi = hu, Avi 2 = ∇u a∇vdx = − v∇ · (a∇u)dx = hv, Aui 2 = hv, Aui , e H˙ 1 L L e H˙ 1 Ω Ω which proves the self-adjoint property.

Proof of equation (97b): Since the domain Ω is bounded and connected, there exists a constant ρP depending only on Ω, such that the following Poincar´eequation holds: 1 ∀v ∈ ˙ 1, kvk2 ≤ kvk2 . (98) H L2 ˙ 1 ρP H

1 For any test function u ∈ H˙ , equation (98) leads to the bound 1 hu, fiL2 ≤ kfkL2 · kukL2 ≤ kfkL2 · kuk ˙ 1 . ρP H

˙ 1 So h·, fiL2 is a bounded linear functional on H . Again, by the Riesz representation theorem, 1 1 there exists a unique g ∈ ˙ , such that hu, fi 2 = hu, gi for all u ∈ ˙ , which completes H L H˙ 1 H the proof.

72 1 Proof of equation (97c): For any test function u ∈ H˙ , we note that Z >  hu, Aui = hu, Aui 2 = ∇u(x) a(x) ∇u(x) dx. e H˙ 1 L Ω

From our uniform ellipticity condition, we know that µIm  a(x)  βIm for any x ∈ Ω. 2 2 Substituting this relation yields µkuk ≤ hu, Aue i ˙ 1 ≤ βkuk , as claimed. H˙ 1 H H˙ 1

C.3.2 Proof of Corollary4

The matrices M, ΣL, Σb for the projected problem instances can be obtained by straightforward calculation. In order to apply Theorem1, it remains to verify the assumptions. By Lemma 17, the operator L is self-adjoint in , and is sandwiched as 0 ≤ hu, Lui ≤ X H˙ 1 1 − µ kuk2 for all u ∈ . This yields the operator norm bound |||L||| ≤ 1 − µ . β H˙ 1 X X β Now we verify the conditions in Assumption 1(W). For any basis function φj with j ∈ [d] 1 and vector v ∈ H˙ , we have Z 2 Z 2 2 1 > 1 > Ehφj, (Li − L)ui ˙ 1 ≤ 2 E δxi ∇φj(x) a(xi)∇u(x)dx + 2 E δxi ∇φj(x) Wi∇u(x)dx H β Ω β Ω Z 2 Z 2 1  >  1  >  = 2 ∇φj(x) a(x)∇u(x) dx + 2 E ∇φj(x) Wi∇u(x) dx β Ω β Ω Z Z 1 2 2 2 2 2 2 ≤ 2 k∇φj(x)k2|||a(x)|||opk∇u(x)k2dx + 2 k∇φj(x)k2 · k∇u(x)k2dx β Ω β Ω   Z 2 2 2 ≤ 1 + 2 max sup k∇φjk2 · k∇u(x)k2dx β j∈[d] x∈Ω Ω ≤ σ2 kuk2 , L H˙ 1 and Z 2 Z 2 2 1 1 Ehφj, bi − bi ˙ 1 ≤ 2 E δyi φj(y)f(yi)dy + 2 E δyi φj(y)gidy H β Ω β Ω Z Z 1 2 2 1 2 2 = 2 φj(y) f(y) dy + 2 φj(y) E[gi ]dy β Ω β Ω Z 1 2 2  ≤ 2 sup |φj| f(y) + 1 dy β x∈Ω Ω kfk2 + 1 L2 2 = 2 max sup |φj| . β j∈[d] x∈Ω

Therefore, Assumption 1(W) is satisfied with constants (σL, σb). Invoking Theorem1 completes the proof.

D Models underlying simulations in Figure1

The simulation results shown in Figure1 are generated by constructing random transition matrices based on the following random graph models:

Erd¨os-R´enyi random graph: Given d, N ∈ N+ and a > 1, we consider the following sampling a procedure. Let G be an Erd¨os-R´enyi random graph with N vertices and edge probability p = N , and take Ge to be its largest connected component. (When c > 1, the number of vertices in Ge is

73 of order Θ(N). See the monograph [Dur07] for details.) For each vertex v ∈ V (Ge), we associate it with an independent standard Gaussian random vector φv ∼ N (0,Id). Let V (Ge) be the state space and let the Markov transition kernel P be the simple random walk on Ge. In Figure1 (a), we take the number of vertices to be N = 3000 and the feature dimension to be d = 1000. The edge density parameter is chosen as a = 3. The resulting giant connected component contains 2813 vertices. Random geometric graph: Given a pair of positive integeres (d, N) and scalar r > 0, we consider the following sampling procedure. For each vertex i ∈ [N], we associate it with an independent standard Gaussian random vector φi ∼ N (0,Id). The graph G is then constructed such that (i, j) ∈ E(G) if and only if kφi − φjk2 ≤ r. (See the monograph [Pen03] for more details of this random graph model.) Take Ge to be the largest connected component of G. We take V (Ge) as the state space, and let P be the simple random walk on Ge. In Figure1 (b), we take the number of vertices to be N = 3000 and the feature dimension to be d = 2. The distance threshold is chosen as r = 0.1. The resulting giant connected component contains 2338 vertices. Despite their simplicity, the two random graph models capture distinct types of the behavior of the resulting random walk in feature space: in the former model, the transition kernel makes “big jumps” in the feature space, and the correlation between two consecutive states is small; in the latter model, the transition kernel makes “local moves” in the feature space, leading to large correlation.

74