<<

Instance-dependent `∞-bounds for policy evaluation in tabular reinforcement learning

Ashwin Pananjady? and Martin J. Wainwright†,‡

?Simons Institute for the Theory of Computing, UC Berkeley †Departments of EECS and , UC Berkeley ‡Voleon Group, Berkeley

September 17, 2020

Abstract

Markov reward processes (MRPs) are used to model stochastic phenomena arising in opera- tions research, control engineering, robotics, and artificial intelligence, as well as communication and transportation networks. In many of these cases, such as in the policy evaluation problem encountered in reinforcement learning, the goal is to estimate the long-term value function of such a process without access to the underlying population transition and reward functions. Working with samples generated under the synchronous model, we study the problem of esti- mating the value function of an infinite-horizon, discounted MRP on finitely many states in the `∞-norm. We analyze both the standard plug-in approach to this problem and a more robust variant, and establish non-asymptotic bounds that depend on the (unknown) problem instance, as well as -dependent bounds that can be evaluated based on the observations of state- transitions and rewards. We show that these approaches are minimax-optimal up to constant factors over natural sub-classes of MRPs. Our analysis makes use of a leave-one-out decoupling argument tailored to the policy evaluation problem, one which may be of independent interest.

1 Introduction

A variety of applications spanning science and engineering use Markov reward processes as models for real-world phenomena, including queuing systems, transportation networks, robotic exploration, game playing, and . In some of these settings, the underlying parameters that govern the process are known to the modeler, but in others, these must be estimated from observed data. A salient example of the latter setting, which forms the main motivation for this paper, is the policy evaluation problem encountered in Markov decision processes (MDPs) and reinforcement

arXiv:1909.08749v2 [stat.ML] 15 Sep 2020 learning [Ber95a, Ber95b, SB18]. Here an agent operates in an environment whose dynamics are unknown: at each step, it observes the current state of the environment, and takes an action that changes its state according to some stochastic transition function determined by the environment. The goal is to evaluate the utility of some policy—that is, a mapping from states to actions, where utility is measured using rewards that the agent receives from the environment. These rewards are usually assumed to be additive over time, and since the policy determines the action to be taken at each state, the reward obtained at any time is simply a function of the current state of the agent. Thus, this setting induces a Markov reward process (MRP) on the state space, in which both the underlying transitions and rewards are unknown to the agent. The agent only observes samples of state transitions and rewards.

1 Given these samples, the goal of the agent is to estimate the value function of the MRP. As noted above, in the context of Markov decision processes (MDPs), this problem is known as policy evaluation. The value function evaluated at a given state measures the expected long-term reward accumulated by starting at that state and running the underlying Markov chain. In applications, this value function encodes crucial information about the MRP. For example, there are MRPs in which the value function corresponds to the probability of a power grid failing [FMP08], the taxi times of flights in an airport [BGSL08], or the value of a board configuration in a game of Go [SSM07]. Moreover, policy evaluation is an important component of many policy optimization algorithms for reinforcement learning, which use it as a sub-routine while searching for good policies to deploy in the environment. The focus of this paper is on understanding the policy evaluation problem in finite-state (or tabular) MRPs in an instance-dependent manner, focusing on the the generative setting in which the agent has access to a simulator that generates samples from the underlying MRP. In particular, we would like guarantees on the sample complexity of policy evaluation—defined as the number of samples required to obtain a value function estimate of some pre-specified error tolerance—as a function of the agent’s environment, i.e., the transition and reward functions induced by the policy being evaluated. Local guarantees of this form provide more guidance for algorithm design in finite sample settings than their worst-case counterparts. Indeed, this viewpoint underpins the important sub-field of local minimax complexity studied widely in the statistics and optimization literatures (e.g., [CL04, ZCDL16, WW20]), as well as in more recent work on online reinforcement learning algorithms [ZB19]. As a natural first step towards providing local guarantees for the policy evaluation problem, we analyze the plug-in estimator for the problem, which estimates the underlying transition and reward functions from the samples, and outputs the value function of the MRP in which these estimates correspond to the ground truth parameters. We also analyze a robust variant of this approach, and provide minimax lower bounds that hold over subsets of the parameter space.

Related work: Markov reward processes have a rich history originating in the theory of Markov chains and renewal processes; we refer the reader to the classical books [Fel66] and [Dur99] for introductions to the subject. The policy evaluation problem has seen considerable interest in the stochastic control and reinforcement learning communities, and various algorithms have been analyzed in both asymptotic [Bor98, Tad04] and non-asymptotic [LS18, SY19] settings. Chapter 3 of the monograph by Szepesv´ari[Sze09] provides a brief introduction to these methods, and the recent survey by Dann et al. [DNP14] focuses on methods based on temporal differences [Sut88]. In the language of temporal difference (TD) algorithms, the plug-in approach that we analyze corresponds to the temporal difference (LSTD) solution [BB96] in the tabular setting, without function approximation. While TD algorithms for policy evaluation have been analyzed by many previous papers, their focus is typically either on (i) how function approximation affects the algorithm [TVR97], (ii) asymptotic convergence guarantees [Bor98, Tad04] or (iii) establishing convergence rates in metrics of the `2-type [Tad04, LS18, SY19]. Since `2-type metrics can be associated with an inner product, many specialized analyses can be ported over from the literature on (e.g., [BM11, NJLS09]).1 On the other hand, our focus is on providing non-asymptotic guarantees in the `∞-error metric, since these are particularly compatible with

1Here we have only referenced some representative papers; see the references in Szepesv´ari[Sze09] for a broader overview.

2 Problem Algorithm Paper Model Sample-size Guarantee Technique

[KS99], [Kak03] Synchronous Non-asymptotic Global, ` Hoeffding State-action Plug-in ∞ [AMK13] Synchronous Non-asymptotic Global, ` Bernstein value ∞ Global, estimation Stochastic [BM00] Synchronous Asymptotic ODE method conv. in dist. in MDPs approximation: Local, Asymptotic Q-learning & [DM17] Synchronous Asymptotic conv. in dist. normality variants [Wai19b], Bernstein, Synchronous Non-asymptotic Local, ` [CMSS20] ∞ Moreau envelope [AMGK11], Bernstein, [SWWY18], Synchronous Non-asymptotic Global, ` ∞ reduction [Wai19c] Optimal [AMK13] Synchronous Non-asymptotic Global, ` Bernstein Plug-in ∞ value Bernstein + [AKY20] Synchronous Non-asymptotic Global, ` estimation ∞ decoupling in MDPs Stochastic Bernstein + [SWW+18] Synchronous Non-asymptotic Global, ` approximation ∞ variance reduction Bernstein + Plug-in Current paper Synchronous Non-asymptotic Local, ` ∞ leave-one-out Policy [Tad04], [PJ92], Synchronous, Local, ` and Averaging, evaluation Stochastic [JJS94], [BM00], Asymptotic 2 trajectories conv. in dist. ODE method in MRPs approximation: [DM17] TD-learning [LS18], [BRS18], Synchronous, Averaging, Non-asymptotic Global, ` [SY19] trajectories 2 martingales Global oracle [TVR97], inequality Asymptotic TD-learning Trajectories Asymptotic [UKM+08] Local, normality with function conv. in dist. approximation [BRS18], Synchronous, Global and Population to [DMR19], Non-asymptotic trajectories local, ` sample [DSTM18] 2 of Current paper Synchronous Non-asymptotic Local, ` Robustness ∞ Table 1. A subset of results in the tabular and infinite-horizon discounted setting, both for pol- icy evaluation in MRPs and policy optimization in MDPs. For a broader overview of results, see Gosavi [Gos09] for the setting of infinite-horizon average reward, and Dann and Brunskill [DB15] for the episodic setting. The “technique” vertical of the table is only meant to showcase a representative subset of those employed. Our contributions are highlighted in red.

3 policy iteration methods. In particular, policy iteration can be shown to converge at a geometric rate when combined with policy evaluation methods that are accurate in `∞-norm (e.g., see the books [AJK19, BT96]). Also, given that we are interested in fine-grained, instance-dependent guarantees, we first study the problem without function approximation. As briefly alluded to before, there has also been some recent focus on obtaining instance- dependent guarantees in online reinforcement learning settings [MMM14, SJ19, ZKB19]. These analyses have led to more practically applicable algorithms for certain episodic MDPs [ZB19, JA18] that improve upon worst-case bounds [AOM17]. Recent work has also established some instance- dependent bounds for the problem of state-action value function estimation in Markov decision processes, for both ordinary Q-learning [Wai19b] and a variance-reduced improvement [Wai19c]. However, we currently lack the localized lower bounds that would allow us to understand the fundamental limits of the problem in a more local sense, except in some special cases for asymptotic settings; for instance, see Ueno et al. [UKM+08] and Devraj and Meyn [DM17] for bounds of this type for LSTD and stochastic approximation, respectively. We hope that our analysis of the simpler policy evaluation problem will be useful in broadening the scope of such guarantees. Portions of our analysis exploit a decoupling that is induced by a leave-one-out technique. We note that leave-one-out techniques are frequently used in probabilistic analysis (e.g., [BE02, dlPG12, MWCC18]). In the context of Markov processes, arguments that are related to but distinct from those in this paper have been used in analyzing estimates of the stationary distribution of a Markov chain [CFMW19], and for analyzing optimal policies in reinforcement learning [AKY20]. For the reader’s convenience, we have collected many of the relevant results both in policy optimization and evaluation in Table 1, along with the settings and sample-size regimes in which they apply, the nature of the guarantee, and the salient techniques used.

Contributions: We study the problem of estimating the infinite-horizon, discounted value func- tion of a tabular MRP in `∞-norm, assuming access to state transitions and reward samples under the . Our first main result, Theorem 1, analyzes the plug-in estimator, showing two types of guarantees: on one hand, we derive high-probability upper bounds on the error that can be computed based on the observed data, and on the other, we show upper bounds that depend on the underlying (unknown) population transition matrix and reward function. The latter result is achieved via a decoupling argument that we expect to be more broadly applicable to problems of this type. Corollary 1 then specializes the population-based result in Theorem 1 to natural sub-classes of MRPs. Theorem 2 provides minimax lower bounds for these sub-classes, showing—in conjunction with Corollary 1—that the plug-in approach is minimax optimal over the class of MRPs with uniformly bounded reward functions. However, these results suggest that the plug-in approach is not minimax-optimal over the class of MRPs having value functions with bounded variance under the transition model (this notion is defined precisely in Section 3 to follow). Consequently, we analyze an approach based on the median-of-means device and show that this modified estimator is minimax optimal over the class of MRPs having value functions with bounded variance. The benefits of our instance-dependent guarantees are even evident in a model as simple as the 3-state MRP illustrated in Figure 1. Suppose that we observe noiseless rewards of this MRP and wish to compute its infinite-horizon value function with discount factor γ ∈ (0, 1). Bounds based on the contractivity of the Bellman operator [KS99, Kak03, Wai19b] imply that the `∞-error of the plug-in estimate scales proportionally to 1/(1−γ)2. The worst-case bounds of Azar et al. [AMK13]

4 1/2 1/2

AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4qkkV9Fj04rGi/YA2lM120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LR7dRvPXFtRKwecZxwP6IDJULBKFrpwTuv9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8NrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxbuoVO8vy7WbPI4CHMMJnIEHV1CDO6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AFZWY0u AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4qkkV9Fj04rGi/YA2lM120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LR7dRvPXFtRKwecZxwP6IDJULBKFrpwTuv9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8NrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxbuoVO8vy7WbPI4CHMMJnIEHV1CDO6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AFZWY0u

1 1

AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh6MVjBdMW2lA220m7dLMJuxuhlP4GLx4U8eoP8ua/cdvmoK0PBh7vzTAzL0wF18Z1v53C2vrG5lZxu7Szu7d/UD48auokUwx9lohEtUOqUXCJvuFGYDtVSONQYCsc3c381hMqzRP5aMYpBjEdSB5xRo2VfEVuiNcrV9yqOwdZJV5OKpCj0St/dfsJy2KUhgmqdcdzUxNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuyUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6DiZcpplByRaLokwQk5DZ56TPFTIjxpZQpri9lbAhVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/fKON0A== rAAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh6MVjBdMW2lA220m7dLMJuxuhlP4GLx4U8eoP8ua/cdvmoK0PBh7vzTAzL0wF18Z1v53C2vrG5lZxu7Szu7d/UD48auokUwx9lohEtUOqUXCJvuFGYDtVSONQYCsc3c381hMqzRP5aMYpBjEdSB5xRo2VfEVuiNsrV9yqOwdZJV5OKpCj0St/dfsJy2KUhgmqdcdzUxNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuyUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6DiZcpplByRaLokwQk5DZ56TPFTIjxpZQpri9lbAhVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/ex+Nzw== =0 r =1/2 r =1

AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4qkkV9CIUvXisYD+gDWWznbRLN5uwuxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7qZ+6wmV5rF8NOME/YgOJA85o8ZKLUVuiHde7ZXKbsWdgSwTLydlyFHvlb66/ZilEUrDBNW647mJ8TOqDGcCJ8VuqjGhbEQH2LFU0gi1n83OnZBTq/RJGCtb0pCZ+nsio5HW4yiwnRE1Q73oTcX/vE5qwms/4zJJDUo2XxSmgpiYTH8nfa6QGTG2hDLF7a2EDamizNiEijYEb/HlZdKsVryLSvXhsly7zeMowDGcwBl4cAU1uIc6NIDBCJ7hFd6cxHlx3p2PeeuKk88cwR84nz9b/45F Figure 1: A simple 3-state Markov reward process. imply a rate 1/(1 − γ)3/2. But the optimal local result captured in this paper shows that the error is only proportional to 1/(1−γ). For a discount factor γ = 0.99, this improves the previous bounds by factors of 100 and 10, respectively, and consequently, the respective sample complexities by factors of 104 and 102. Instance-dependent results therefore allow us to differentiate problems that are “solvable” with finite samples from those that are not.

Notation: For a positive integer n, let [n] : = {1, 2, . . . , n}. For a finite set S, we use |S| to denote its cardinality. We use c, C, c1, c2,... to denote universal constants that may change from line to line. We use the convenient shorthand a ∨ b : = max{a, b} and a ∧ b = min{a, b}. We let 1 D denote the all-ones vector in R , and abusing notation slightly, we let 1 {E} denote the indicator D of an event E. Let ej denote the jth standard basis vector in R . We let v(i) denote the i-th order of a vector v, i.e., the i-th largest entry of v. For a pair of vectors (u, v) of compatible dimensions, we use the notation u  v to indicate that the difference vector v − u is entry-wise non-negative. The relation u  v is defined analogously. We let |u| denote the entry-wise absolute D value of a vector u ∈ R ; squares and square-roots of vectors are, analogously, taken entrywise. Note that for a positive scalar λ, the statements |u|  λ · 1 and kuk∞ ≤ λ are equivalent. Finally, we let kMk1,∞ denote the maximum `1-norm of the rows of a matrix M, and refer to it as the (1, ∞)-operator norm of a matrix.

2 Background and Problem Formulation

In this section, we introduce the basic notation required to specify a Markov reward process, and formally define the problem of estimating value functions in the generative setting.

2.1 Markov reward processes and value functions We study Markov reward processes defined on a finite set X of states, indexed as X = {1, 2,...,D}. The state evolution over time is determined by a set of transition functions {P ( · | x), x ∈ X , with the transition from state x to the next state being randomly chosen according to the distribution P ( · | x). For notational convenience, we let P ∈ [0, 1]D×D denote a row stochastic (Markov) transition matrix, where row j of this matrix—which we denote by pj—collects the transition function of the j-th state. Also associated with an MRP is a population reward function r : X 7→ R: transitioning from state x results in the reward r(x). For convenience, we engage in a minor abuse of notation by letting r also denote a vector of length D, with rj corresponding to the reward obtained at the j-th state. In this paper, we consider the infinite-horizon, discounted reward as our notion for the long-term value of a state in the MRP. In particular, for a scalar discount factor γ ∈ (0, 1), the long-term

5 value of state x in the MRP is given by " # ∞ ∗ X k θ (x) : = E γ r(xk) x0 = x , where xk ∼ P ( · | xk−1) for all k ≥ 1. k=0 In words, this measures the expected discounted reward obtained by starting at the state x, where the expectation is taken with respect to the random transitions over states. Once again, we use θ∗ ∗ to also denote a vector of length D, where θj corresponds to the value of the j-th state. A note to the reader: in the sequel, we often reference a state simply by its index, and often refer to the state space X ≡ [D]. Accordingly, we also use P ( · | j) to denote the transition function corresponding to state j ∈ [D].

2.2 Observation model Given access to the true transition and reward functions, it is straightforward, at least in principle, to compute the value function. By definition, it is the unique solution of the Bellman fixed point relation

θ∗ = r + γPθ∗. (1)

In the learning setting, the pair (P, r) is unknown, and we instead assume access to a black box that generates samples from the transition and reward functions. In this paper, we operate under a setting known as the synchronous or generative setting; it is a stylized observation model that has been used extensively in the study of Markov decision processes (see Kearns and Singh [KS99] for an introduction). Let us introduce it in the context of MRPs: for a given sample index k = 1, 2,...,N and for each state j ∈ [D], we observe a random next state Xk,j ∈ [D] drawn according to the transition function P ( · | j), and a random reward Rk,j drawn from a conditional distribution Dr( · | j). Throughout, we assume that the rewards are generated independently across states, with E[Rk,j] = rj. Letting ρ(r) denote a non-negative vector indexed by the states j ∈ [D], we assume the conditional distributions {Dr( · | j), j ∈ [D]} are ρ(r)-sub-Gaussian, meaning that for each j ∈ [D], we have

2 2 h i λ ρj (r) λ(R−rj ) 2 ER∼Dr( · |j) e ≤ e for all λ ∈ R. (2)

∗ With N such i.i.d. samples in hand, our goal is to estimate the value function θ in the `∞-error metric. Such a goal is particularly relevant to the policy evaluation problem described in the introduc- tion, since `∞-estimates of the value function can be used in conjunction with a policy improvement sub-routine to eventually arrive at an optimal policy (see, e.g., Section 1.2.2. of the recent mono- graph [AJK19]). We note in passing that bounds proved under the generative model may be translated into the more challenging online setting via the notion of Markov cover times (see, e.g., the papers [EDM03, AMGK11] for conversions of this type for Markov decision processes).

3 Main results

We now turn to the statement and discussion of our main results. We begin by providing `∞-guarantees on value function estimation for the natural plug-in approach.

6 3.1 Guarantees for the plug-in approach

A natural approach to this problem is use the observations to construct estimates (Pb, rb) of the pair (P, r), and then substitute or “plug in” these estimates into the Bellman equation, thereby obtaining the value function of the MRP having transition matrix Pb and reward vector rb. In order to define the plug-in estimator, let us introduce some helpful notation. For each time index k, we use the associated set of state samples {Xk,j, j ∈ [D]} to form a random binary matrix D×D Zk ∈ {0, 1} , in which row j has a single non-zero entry, determined by the sample Xk,j. Thus, the location of the non-zero entry in row j is drawn from the defined by pj, N the j-th row of P. Recall that our observations also include the stochastic reward vectors {Rk}k=1 sampled from the reward distribution Dr. Based on these observations, we define the sample means

N N 1 X 1 X Pb = Zk and r = Rk, (3) N b N k=1 k=1 which can be seen as unbiased estimates of the transition matrix P and the reward vector r, respectively. The estimates (Pb, rb) define a new MRP, and its value function is given by the fixed point relation

θbplug = rb+ γPbθbplug. (4)

−1 Solving this fixed point equation, we obtain the closed form expression θbplug = (I − γPb) rb for the plug-in estimator. Note that the terminology “plug-in” arises the fact that θbplug is obtained by substituting the estimates (Pb, rb) into the original Bellman equation (1). We also note that in this special case—that is, the tabular setting without function approximation—the plug-in estimate is equivalent to the LSTD solution [BB96, Boy02]. In order to establish guarantees for the estimator θbplug, we require some additional notation. As mentioned before, we are interested in non-asymptotic, instance-dependent guarantees of two types: the first is a bound that can be evaluated in practice from the observed data, and the second is a D guarantee that depends on the unknown population quantities P and r. For each vector θ ∈ R , define the vector of empirical

2 2 σb (θ) = Eb (Z − Pb)θ , where Eb denotes expectation over the empirical distribution (i.e., the random matrix Z is drawn N uniformly at random from the set {Zk}k=1). Note that given θ, this quantity is computable purely from the observed samples. On the other hand, the population result will involve the population variance vector

2 2 σ (θ) = E |(Z − P)θ| , where in this case Z is drawn according to the population model P. As a final definition, the span semi-norm of a value function θ is given by

kθkspan : = max θ(x) − min θ(x). x∈X x∈X

7 D Equivalently, the span semi-norm is equal to the variation of the vector θ ∈ R ; see Puter- man [Put05] for more details.

We now ready to state our main result for the plug-in estimator.

γ2 Theorem 1. There is a pair of universal constants (c1, c2) such that if N ≥ c1 (1−γ)2 log(8D/δ), then each of the following statements holds with probability at least 1 − δ.

(a) We have

(r   ) ∗ log(8D/δ) −1 kρ(r)k∞ log(8D/δ) γkθbplugkspan kθbplug − θ k∞ ≤ c2 γk(I − γPb) σ(θbplug)k∞ + + · . N b 1 − γ N 1 − γ (5a)

(b) We have

(r   ∗ ) ∗ log(8D/δ) −1 ∗ kρ(r)k∞ log(8D/δ) γkθ kspan kθbplug − θ k∞ ≤ c2 γ (I − γP) σ(θ ) + + · . N ∞ 1 − γ N 1 − γ (5b)

It is worth making a few comments on this theorem, which provides two instance-dependent upper bounds on the error of the plug-in approach. Assuming for simplicity of discussion2 that the maximum noise reward parameter kρ(r)k∞ is known, then part (a) of the theorem provides a bound that can be evaluated based on the observed data; bounds of this form are especially useful in downstream analyses. For instance, a central consideration in policy iteration methods is to ∗ obtain “good enough” value function estimates θb for fixed policies, in that we have kθb− θ k∞ ≤  for some prescribed tolerance . Theorem 1(a) provides a method by which such a bound may be verified for the plug-in approach: compute the statistic on the RHS of bound (5a); if this is less ∗ than , then the bound kθbplug − θ k∞ ≤  holds with probability exceeding 1 − δ. On the other hand, Theorem 1(b) provides a guarantee that depends on the unknown problem instance. From the perspective of the analysis, this is the more difficult bound to establish, since it requires a leave-one-out technique to decouple dependencies between the estimate θbplug and the matrix Pb. We expect our technique—presented in full in Section 5.2—and its variants to be more broadly useful in analyzing other problems in reinforcement learning besides the policy evaluation problem considered here. c1 Third, note that our lower bound on the sample size—which evaluates to N ≥ (1−γ)2 log(8D/δ) for any strictly positive discount factor—is unavoidable in general. In particular, for any fixed reward-noise parameter kρ(r)k∞ > 0, this condition is required in order to obtain a consistent estimate of the value function.3 On the other hand, in the special case of deterministic rewards (kρ(r)k∞ = 0), we suspect that this condition can be weakened, but leave this for future work.

2We note that when ρ(r) is not known but the reward distribution is (say) Gaussian, it is straightforward to provide an entry-wise upper bound for it by computing the empirical of rewards from samples, and using this to define a high-probability and data-dependent bound on the sub-Gaussian parameter. 3For instance, even with known transition dynamics, estimating the value function of a single state to within  1  additive error  requires Ω (1−γ)22 samples of the noisy reward.

8 Finally, it is worth noting that there are two terms in the bounds of Theorem 1: the first term corresponds to a notion of standard deviations of the estimated/true value function and reward, and the second depends on the span semi-norm of the value function. Are both of these terms necessary? What is the optimal rate at which any value function can be estimated? These questions motivate the analysis to be presented in the following section.

3.2 Is the plug-in approach optimal? In order to study the question of optimality, we adopt the notion of local minimax risk, in which the performance of an estimator is measured in a worst-case sense locally over natural subsets of the parameter space. Our upper bounds depend on the problem instance via the standard deviation function σ(θ∗), the reward standard deviation ρ(r), and the span semi-norm of θ∗. Accordingly, we define the following subsets4 of Markov reward processes (MRPs):

n ∗ o Mvar(ϑ, %) : = set of all MRPs s.t. kσ(θ )k∞ ≤ ϑ and kρ(r)k∞ ≤ % , (6a)

n ∗ o Mvfun(ζ, %) : = set of all MRPs s.t. kθ kspan ≤ ζ and kρ(r)k∞ ≤ % , and (6b) n o Mrew(rmax,%) : = set of all MRPs s.t. krk∞ ≤ rmax and kρ(r)k∞ ≤ % . (6c)

Letting M be any one of these sets, we use the shorthand θ ∈ M to that θ is the value function of some MRP in the set M. Each choice of the set M defines the local minimax risk given by

h ∗ i inf sup E kθb− θ k∞ , θb θ∗∈M where the infimum ranges over all measurable functions θb of N observations from the generative model. With this set-up, we can now state some lower bounds in terms of such local minimax risks:

1 Theorem 2. There is a pair of absolute constants (c1, c2) such that for all γ ∈ [ 2 , 1) and sample c1 sizes N ≥ 1−γ log(D/2), the following statements hold. √ (a) For each triple of positive scalars (ϑ, ζ, %) satisfying5 ϑ ≤ ζ 1 − γ, we have r h ∗ i c2 log(D/2) inf sup E kθb− θ k∞ ≥ (ϑ + %) . (7a) ∗ θb θ ∈Mvar(ϑ,%)∩Mvfun(ζ,%) 1 − γ N

q log D (b) For each pair of positive scalars (rmax,%) satisfying rmax ≥ % N , we have

r   h ∗ i c2 log(D/2) rmax inf sup E kθb− θ k∞ ≥ + % . (7b) ∗ 1/2 θb θ ∈Mrew(rmax,%) 1 − γ N (1 − γ) 4The following mnemonic device may help the reader appreciate and remember notation: the symbol ϑ, or “vartheta”, stands for a measure of the variability in the value function θ; the symbol %, or “varrho”, represents the variability in reward samples, and rmax represents the maximum absolute reward mean. 5We conjecture that this lower bound can be proved under the weaker condition ϑ ≤ ζ, thereby the condition present in Corollary 1(a).

9 Equipped with these lower bounds, we can now assess the local minimax optimality of the plug-in estimator. In order to facilitate this comparison, let us state a corollary of Theorem 1 that provides bounds on the worst-case error of the plug-in estimator over particular subsets of the parameter space. In order to further simplify the comparison, we restrict our attention to the 1 γ ∈ [ 2 , 1) covered by the lower bounds.

1 6 Corollary 1. There are absolute constants (c3, c4) such that for all γ ∈ [ 2 , 1) and sample sizes c3 N ≥ (1−γ)2 log(8D/δ), the following statements hold.

(a) Consider a triple of positive scalars (ϑ, ζ, %) such that7 ϑ ≤ ζ. Then for any value function ∗ θ ∈ Mvar(ϑ, %) ∩ Mvfun(ζ, %), we have

(r ) ∗ c4 log(8D/δ) log(8D/δ) kθbplug − θ k∞ ≤ (ϑ + %) + · ζ (8a) 1 − γ N N

with probability at least 1 − δ.

∗ (b) Consider an arbitrary pair of positive scalars (rmax,%). Then for any value function θ ∈ Mrew(rmax,%), we have

r   ∗ c4 log(8D/δ) rmax kθbplug − θ k∞ ≤ + % (8b) 1 − γ N (1 − γ)1/2

with probability at least 1 − δ.

By comparing Corollary 1(b) with Theorem 2(b), we see that the plug-in estimator is minimax optimal (up to constant factors) over the class Mrew(rmax,%). This conclusion parallels that of Azar et al. [AMK13] for the related problem of optimal state-value function estimation in MDPs. (In our notation, their work applies to the special case of % = 0, but their analysis can easily be extended to this more general setting.) A comparison of part (a) of the two results is more interesting. Here we see that the first term in the upper bound (8a) matches the lower bound (7a) up to a constant factor. The second term of inequality (8a), however, does not have an analogous component in the lower bound, and this leads us to the interesting question of whether the analysis of the plug-in estimator can be sharpened so ∗ as to remove the dependence of the error on the span semi-norm kθ kspan. Proposition 1, presented in Appendix A, shows that this is impossible in general, and that there are MRPs in which the `∞ error can be lower bounded by a term that is proportional to the span semi-norm. This raises another natural question: Is there a different estimator whose error can be bounded ∗ independently of the span semi-norm kθ kspan, and which is able achieve the lower bound (7a)? In the next section, we introduce such an estimator via a median-of-means device.

6As shown in the proof, part (a) of the corollary holds without this assumption on the sample size, but we state it here to facilitate a direct derivation of Corollary 1 from Theorem 1. 7It is worth noting that the condition ϑ ≤ ζ in part (a) of the corollary does not entail any loss of generality, since ∗ ∗ ∗ ∗ we always have kσ(θ )k∞ ≤ kθ kspan. Indeed, for MRPs in which kσ(θ )k∞  kθ kspan, the second term on the RHS of inequality (8a) will dominate the bound unless the sample size N is large.

10 3.3 Closing the gap via the median-of-means method In many situations, the span semi-norm of a value function θ∗ may be much larger its variance σ(θ∗) under the transition model. Such a discrepancy arises when there are states with extremely large positive (or negative) rewards that are visited with very low probability. In such cases, the second terms in the bounds (5) dominate the first. It is thus of interest to derive bounds that are purely “variance-dependent” and independent of the span norm. In order to do so, we analyze a slight variant of the plug-in approach. In particular, we analyze the median-of-means estimator, which is a standard robust alternative to the sample mean in other scenarios [NY83, LL20]. In the context of reinforcement learning, Pazis et al. [PPH16] made use of it for online policy optimization in MDPs. In our setting, we only employ median-of-means to obtain a better estimate of term depending on the transition matrix; we still use the estimate rb defined in equation (3) as our estimate of 8 N D the reward function. Given the data set {Zk}k=1 and some vector θ ∈ R , the median-of-means estimate Mc(θ) of the population expectation Pθ is given by the following nonlinear operation:

• First, split the data set into K equal parts denoted {D1,..., DK }, where each subset Di has size m = bN/Kc.

Second, compute the empirical mean µ (θ) : = 1 P Z θ for each i ∈ [K]. • bi m k∈Di k

• Finally, return the quantity Mc(θ) : = med(µb1(θ),..., µbK (θ)), where the median—defined for convenience as the bK/2c-th —is taken entry-wise.

The random operator Mc defines the median-of-means empirical Bellman operator, given by

MoM TbN (θ) : = rb+ γMc(θ). (9)

As shown in Lemma 6 (see Section 5), this operator is γ-contractive in the `∞-norm. Consequently, it has a unique fixed point, which we term the median-of-means value function estimate, denoted by θbMoM. In practice, the estimate θbMoM can be found by starting at an arbitrary initialization and MoM 9 repeatedly applying the γ-contractive operator TbN until convergence. The following theorem provides a population-based guarantee on the error of this estimator.

Theorem 3. Suppose that the median-of-means operator Mc is constructed with the parameter choice K = 8 log(4D/δ). Then there is a universal constant c such that we have r ∗ c log(8D/δ) ∗  kθbMoM − θ k∞ ≤ γkσ(θ )k∞ + kρ(r)k∞ (10) 1 − γ N with probability exceeding 1 − δ.

8In principle, one could run a median-of-means estimate on the combination of reward and transition, but this is not necessary in our setting due to the sub-Gaussian assumption on the reward noise (2). Slight modifications of our techniques also yield bounds for the combined median-of-means estimate assuming only that the standard deviation of the reward noise is bounded entry-wise by the vector ρ(r). 9 Since the operator is γ-contractive, it suffices to run this iterative algorithm for logγ  to obtain an -approximate fixed point in an additive sense.

11 We have thus achieved our goal of obtaining a purely variance-dependent bound. Indeed, for ∗ each pair of positive scalars (ϑ, %), any value function θ ∈ Mvar(ϑ, %), and reward distribution satisfying kρ(r)k∞ ≤ %, we have r ∗ c log(8D/δ) kθbMoM − θ k∞ ≤ (ϑ + %) , 1 − γ N with probability exceeding 1 − δ. Integrating this tail bound yields an analogous upper bound on the expected error, which matches the lower bound (7a) on the expected error up to a constant factor. As a corollary, we conclude that the minimax risk over the class Mvar(ϑ, %) scales as r h ∗ i 1 log(D) inf sup E kθb− θ k∞  (ϑ + %) , (11) ∗ θb θ ∈Mvar(ϑ,%) 1 − γ N and is achieved (up to constant factors) by the estimator θbMoM. However, our results fall short of showing that the estimator θbMoM is minimax optimal over the ∗ class Mrew(rmax,%) of MRPs with bounded rewards. Indeed, for any value function θ in the class Mrew(rmax,%), Theorem 3 yields the corollary

r   ∗ c log(8D/δ) rmax kθbMoM − θ k∞ ≤ γ + % 1 − γ N 1 − γ with probability exceeding 1 − δ. Comparing inequality (7b) with this bound, we see that our 1 upper bound on the median-of-means estimator is sub-optimal by a factor (1 − γ)− 2 in the discount complexity. From a technical standpoint, this is due to the fact that our upper bound in Theorem 3 1 ∗ −1 ∗ involves the functional 1−γ kσ(θ )k∞ and not the sharper functional k(I − γP) σ(θ )k∞ present in Theorem 1(b). We believe that this gap is not intrinsic to the MoM method, and conjecture that an upper bound depending on the latter functional can be proved for the estimator θbMoM; this would guarantee that the median-of-means estimator is also minimax optimal over the class Mrew(rmax,%).

4 Numerical

In this section, we explore the sharpness of our theoretical predictions, for both the plug-in and the median-of-means (MoM) estimator. Our bounds predict a range of behaviors depending on ∗ the scaling of the maximum standard deviation kσ(θ )k∞, and the span semi-norm (for the plug-in estimator). Let us verify these scalings via some simple experiments.

4.1 Behavior on the “hard” example used for the lower bound First, we use a simple variant of our lower bound construction illustrated in panel (a) of Figure 2. This MRP consists of D = 2 states, where state 1 stays fixed with probability p, transitions to state 2 with probability 1 − p, and state 2 is absorbing. The rewards in states 1 and 2 are given by ν and ντ, respectively. Here the triple (p, ν, τ), along with the discount factor γ, are parameters of the construction.

12 Log error versus log(1/(1 ))

= 0.00, (1.554, 1.569, 1.500) 0 = 0.50, (1.045, 1.056, 1.000) = 1.00, (0.547, 0.564, 0.500)

1 p r 1 o r

r e 2 m

r MoM o MoM n

p - 1 3 Plug

Plug g MoM

r = ⌫ r = ⌫ ⌧ o · l 4 Plug 5 (p, ⌫, ⌧) 2.0 2.5 3.0 3.5 4.0 P0 Complexity log(1/(1 )) (a) (b)

Figure 2. (a) Illustration of the MRP R0(p, ν, τ) used in the simulation, and also as a building 4γ−1 block in the lower bound construction of Theorem 2. For the simulation, we choose p = 3γ , let α ν = 1, and set τ = 1 − (1 − γ) . (b) Log-log plot of the `∞-error versus the discount complexity parameter 1/(1 − γ) for both the plug-in estimator (in + markers) and median-of-means estimator (in • markers) averaged over T = 1000 trials with N = 104 samples each. We have also plotted the least-squares fits through these points, and the slopes of these lines are provided in the legend. In ∗ particular, the legend contains the tuple of slopes (βbplug, βbMoM, β ) for each value of α. Logarithms are to the natural base.

In order to parameterize this MRP in a scalarized manner, we vary the triple (p, ν, τ) in the following way. First, we fix a scalar α in the unit interval [0, 1], and then we set

4γ−1 α p = 3γ , ν = 1, and τ = 1 − (1 − γ) . Note that this sub-family of MRPs is fully parameterized by the pair (γ, α). Let us clarify why this particular scalarization is interesting. As shown in the proof of Theorem 2 (see equation (34)), the underlying MRP has maximal standard deviation scaling as

 1 0.5−α kσ(θ∗)k ∼ . ∞ 1 − γ

Consequently, by the bound (10) from Theorem 3, for a fixed sample size N, the MoM estimator 1.5−α  1  should have `∞-norm scaling as 1−γ . As we discuss in Appendix B, the same prediction 1 also holds for the plug-in estimator, assuming that N % (1−γ) . In order to test this prediction, we fixed the parameter α ∈ [0, 1], and generated a range of MRPs with different values of the discount factor γ. For each such MRP, we drew N = 104 samples from the generative observation model and computed both the plug-in and median-of-means estimators, where the latter estimator was run with the choice K = 20. While the plug-in estimator has a simple

13 closed-form expression, the MoM estimator was obtained by running the median-of-means Bellman MoM operator TbN iteratively until it converged to its fixed point; we declared that convergence had −8 occurred when the `∞-norm of the difference between successive iterates fell below 10 . In panel (b) of Figure 2, we plot the `∞-error, of both the plug-in approach as well as the median-of-means estimator, as a function of γ. The plot shows the behavior for three distinct values α = {0, 0.5, 1}. Each point on each curve is obtained by averaging 1000 Monte Carlo trials of the . Note that on this log-log plot, we see a linear relationship between the log `∞-error and log discount complexity, with the slopes depending on the value of α. More precisely, from our calculations above, our theory predicts that the log `∞-error should be related to the log 1  complexity log 1−γ in a linear fashion with slope

β∗ = 1.5 − α.

Consequently, for both the plug-in and MoM estimators, we performed a to es- timate these slopes, denoted by βbplug and βbMoM respectively. The plot legend reports the triple ∗ ∗ (βbplug, βbMoM, β ), and for each we see good agreement between the theoretical prediction β and its empirical counterparts.

4.2 When does the MoM estimator perform better than plug-in? Our theoretical results predict that the MoM estimator should outperform the plug-in approach ∗ when the span semi-norm of the value function kθ kspan is much larger than its maximum standard ∗ deviation kσ(θ )k∞. Indeed, Proposition 1 in Appendix A demonstrates that there are MRPs on which the `∞-error of the plug-in estimator grows with the span semi-norm of the optimal value function. Let us now simulate the behavior of both the plug-in and MoM approach on this MRP, constructed by taking D/3 copies of the 3-state MRP in Figure 3(a). Our simulation is carried out on N = 104 samples from this D-state MRP, with noiseless observations of the reward. In order to parameterize the MRP via the discount factor alone, we fix the pair (q, D) in the following way. First, we fix a scalar α in the unit interval [0, 1], and then set

j  1 α k 10 D = 3 and q = . 1 − γ ND

Note that this sub-family of MRPs is fully parameterized by the pair (γ, α). The construction also ensures that kθ∗k kσ(θ∗)k ∞  √ ∞ , (12) N N for this chosen parameterization, and furthermore, that the ratio of the LHS and RHS of inequal- ity (12) increases as the dimension D increases (see the proof of Proposition 1). As shown in Proposition 1 in Appendix A, the `∞ error of the plug-in estimator for this family ∗ of MRPs can be lower bounded by kθ k∞/N. It is also straightforward to show that the error of the kσ(θ∗)k MoM estimator is upper bounded by the quantity √ ∞ . Now increasing the value of α increases N the dimension D, and so the MoM estimator should behave better and better for larger values of α. In particular, this behavior can be captured in the log-log plot of the error against 1/(1 − γ), which is presented in Figure 3(b).

14 Log error versus log(1/(1 ))

= 0.50, (0.981, 0.540) 5.0 = 0.75, (0.955, 0.301) = 1.00, (0.912, 0.053)

r 5.5 o r r e

6.0

1AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4sSRV0GPRi8cKpi20oWy2k3bpZhN3N0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+Oyura+sbm4Wt4vbO7t5+6eCwoZNMMfRZIhLVCqlGwSX6hhuBrVQhjUOBzXB4O/WbT6g0T+SDGaUYxLQvecQZNVbyPXJOHrulsltxZyDLxMtJGXLUu6WvTi9hWYzSMEG1bntuaoIxVYYzgZNiJ9OYUjakfWxbKmmMOhjPjp2QU6v0SJQoW9KQmfp7YkxjrUdxaDtjagZ60ZuK/3ntzETXwZjLNDMo2XxRlAliEjL9nPS4QmbEyBLKFLe3EjagijJj8ynaELzFl5dJo1rxLirV+8ty7SaPowDHcAJn4MEV1OAO6uADAw7P8ApvjnRenHfnY9664uQzR/AHzucPYbuNvw== q

qAAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7Ow6M2tCCF/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaHSPJb3ZpygH9GB5CFn1Fip/tgrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+ANzrjPk= m

r

o 6.5 n 1 1 -

7.0 g

AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh6MVjBdMW2lA220m7dLMJuxuhlP4GLx4U8eoP8ua/cdvmoK0PBh7vzTAzL0wF18Z1v53C2vrG5lZxu7Szu7d/UD48auokUwx9lohEtUOqUXCJvuFGYDtVSONQYCsc3c381hMqzRP5aMYpBjEdSB5xRo2VfEVuiNsrV9yqOwdZJV5OKpCj0St/dfsJy2KUhgmqdcdzUxNMqDKcCZyWupnGlLIRHWDHUklj1MFkfuyUnFmlT6JE2ZKGzNXfExMaaz2OQ9sZUzPUy95M/M/rZCa6DiZcpplByRaLokwQk5DZ56TPFTIjxpZQpri9lbAhVZQZm0/JhuAtv7xKmrWqd1GtPVxW6rd5HEU4gVM4Bw+uoA730AAfGHB4hld4c6Tz4rw7H4vWgpPPHMMfOJ8/ex+Nzw== r = µ/2 r =0 AAAB8nicbVDLSgNBEOz1GeMr6tHLYBA8xd0o6EUIevEYwTxgs4TZyWwyZB7LzKwQQj7DiwdFvPo13vwbJ8keNLGgoajqprsrTjkz1ve/vZXVtfWNzcJWcXtnd2+/dHDYNCrThDaI4kq3Y2woZ5I2LLOctlNNsYg5bcXDu6nfeqLaMCUf7SilkcB9yRJGsHVSqNEN6ogMnaNqt1T2K/4MaJkEOSlDjnq39NXpKZIJKi3h2Jgw8FMbjbG2jHA6KXYyQ1NMhrhPQ0clFtRE49nJE3TqlB5KlHYlLZqpvyfGWBgzErHrFNgOzKI3Ff/zwswm19GYyTSzVJL5oiTjyCo0/R/1mKbE8pEjmGjmbkVkgDUm1qVUdCEEiy8vk2a1ElxUqg+X5dptHkcBjuEEziCAK6jBPdShAQQUPMMrvHnWe/HevY9564qXzxzBH3ifP/LDj7o= rAAAB7nicbVBNSwMxEJ2tX7V+VT16CRbBU9mtgl6EohePFewHtEvJptk2NMmGJCuUpT/CiwdFvPp7vPlvTNs9aOuDgcd7M8zMixRnxvr+t1dYW9/Y3Cpul3Z29/YPyodHLZOkmtAmSXiiOxE2lDNJm5ZZTjtKUywiTtvR+G7mt5+oNiyRj3aiaCjwULKYEWyd1NboBvVE2i9X/Ko/B1olQU4qkKPRL3/1BglJBZWWcGxMN/CVDTOsLSOcTku91FCFyRgPaddRiQU1YTY/d4rOnDJAcaJdSYvm6u+JDAtjJiJynQLbkVn2ZuJ/Xje18XWYMalSSyVZLIpTjmyCZr+jAdOUWD5xBBPN3K2IjLDGxLqESi6EYPnlVdKqVYOLau3hslK/zeMowgmcwjkEcAV1uIcGNIHAGJ7hFd485b14797HorXg5TPH8Afe5w9hUo7x = µ o l 7.5

8.0

2.0 2.5 3.0 3.5 4.0 Complexity log(1/(1 ))

(a) (b)

Figure 3. (a) Illustration of the MRP R1(q, µ). In the simulation as well as in the lower bound construction of Proposition 1, we concatenate D/3 such MRPs to produce an MRP on D states. For α j  1  k 10 the simulation, we choose D = 3 1−γ and set q = ND and µ = 1. (b) Log-log plot of the `∞-error versus the discount complexity parameter 1/(1 − γ) for both the plug-in estimator (in + markers) and median-of-means estimator (in • markers) averaged over T = 1000 trials with N = 104 samples each. We have also plotted the least-squares fits through these points, and the slopes of these

lines are provided in the legend. In particular, the legend contains the tuple of slopes (βbplug, βbMoM) for each value of α. Logarithms are to the natural base.

The plot shows the behavior for three distinct values α = {0.5, 0.75, 1}. Each point on each curve is obtained by averaging 1000 Monte Carlo trials of the experiment. As expected, the MoM estimator consistently outperforms the plug-in estimator for each value of α. Moreover, on this log-log plot, we see a linear relationship between the log `∞-error and log discount complexity, with the slopes depending on the value of α. For both the plug-in and MoM estimators, we performed a linear regression to estimate these slopes, denoted by βbplug and βbMoM respectively. The plot legend reports the pair (βbplug, βbMoM), and we see that the gap between the slopes increases as α increases.

5 Proofs

We now turn to the proofs of our main results. Throughout our proofs, the reader should recall that the values of absolute constants may change from line-to-line. We also use the following facts repeatedly. First, for a row stochastic matrix M with non-negative entries and any scalar γ ∈ [0, 1), we have the infinite series

∞ X (I − γM)−1 = (γM)t, (13a) t=0

15 which implies that the entries of (I − γM)−1 are all non-negative. Second, for any such matrix, we −1 1 also have the bound k(I − γM) k1,∞ ≤ 1−γ . Finally, for any matrix A with positive entries and a vector v of compatible dimension, we have the elementwise inequality

|Av|  A|v|. (13b)

5.1 Proof of Theorem 1, part (a)

Throughout this proof, we adopt the convenient shorthand θb ≡ θbplug for notational convenience. By the Bellman equations (1) and (4) for θ∗ and θb, respectively, we have

∗ n ∗o ∗ ∗ θb− θ = γ Pbθb− Pθ + (rb− r) = γPb(θb− θ ) + γ(Pb − P)θ + (rb− r).

Introducing the shorthand ∆b : = θb− θ∗ and re-arranging implies the relation −1 ∗ −1 ∆b = γ(I − γPb) (Pb − P)θ + (I − γPb) (rb− r), (14) and consequently, the elementwise inequality

−1 ∗ −1 |∆b |  γ(I − γPb) |(Pb − P)θ | + (I − γPb) |(rb− r)|, (15) where we have used the relation (13b) with the matrix A = (I − γPb)−1. Given the sub-Gaussian condition on the stochastic rewards, we can apply Hoeffding’s inequality combined with the union q log(8D/δ) bound to obtain the elementwise inequality |rb−r|  c N ·ρ(r), which holds with probability δ −1 1 at least 1 − 4 . Since the matrix (I − γPb) has non-negative entries and (1, ∞)-norm at most 1−γ , we have r −1 c log(8D/δ) (I − γPb) |r − r|  kρ(r)k∞ 1. (16a) b 1 − γ N with the same probability. On the other hand, by Bernstein’s inequality, we have (r ) ∗ log(8D/δ) ∗ ∗ log(8D/δ) |(Pb − P)θ |  c · σ(θ ) + kθ kspan · 1 N N

δ with probability at least 1 − 4 , and hence

(r ∗ ) −1 ∗ log(8D/δ) −1 ∗ kθ kspan log(8D/δ) (I − γPb) |(Pb − P)θ |  c · k(I − γPb) σ(θ )k∞ + · 1. N 1 − γ N (16b) Substituting the bounds (16a) and (16b) into the elementwise inequality (15), we find that

(r   ∗ ) log(8D/δ) −1 ∗ kρ(r)k∞ γkθ kspan log(8D/δ) |∆b |  c · γk(I − γPb) σ(θ )k∞ + + · 1 (17) N 1 − γ 1 − γ N

δ with probability at least 1 − 2 . ∗ ∗ Our next step is to relate the pair of population quantities (σ(θ ), kθ kspan) to their empirical analogues (σb(θb), kθbkspan). The following lemma provides such a bound.

16 Lemma 1 (Population to empirical variance). We have the element-wise inequality r ∗ 0 ∗ log(8D/δ) σ(θ )  2σ(θb) + 2|∆b | + c kθ kspan · 1 (18) b N with probability at least 1 − δ/2.

Taking this lemma as given for the , let us complete the proof.

Since the matrix (I − γPb)−1 has non-negative entries, we can multiply both sides of the ele- mentwise inequality (18) by it; doing so and taking the `∞-norm yields

0 ∗ r −1 ∗ −1 2k∆b k∞ c kθ kspan log(8D/δ) k(I − γPb) σ(θ )k∞ ≤ 2k(I − γPb) σ(θb)k∞ + + . b 1 − γ 1 − γ N

Substituting back into the elementwise inequality (17) and taking `∞-norms of both sides, we find that

(r   ∗ ) log(8D/δ) −1 kρ(r)k∞ γkθ kspan log(8D/δ) k∆b k∞ ≤ c γk(I − γPb) σ(θb)k∞ + + N b 1 − γ 1 − γ N r 2cγ log(8D/δ) + k∆b k∞. 1 − γ N

Since the span semi-norm satisfies the triangle inequality, we have

∗ kθ kspan ≤ kθbkspan + k∆b kspan ≤ kθbkspan + 2k∆b k∞.

Substituting this bound and re-arranging yields

(r   ) ∗ log(8D/δ) −1 kρ(r)k∞ γkθbkspan log(8D/δ) κkθb− θ k∞ ≤ c γk(I − γPb) σ(θb)k∞ + + . N b 1 − γ 1 − γ N

  2cγ q log(8D/δ) log(8D/δ) where we have introduced the shorthand κ : = 1 − 1−γ N + N . Finally, by choos- 2 log(8D/δ) 1 ing the pre-factor c1 in the lower bound N ≥ c1γ (1−γ)2 large enough, we can ensure that κ ≥ 2 , thereby completing the proof of Theorem 1(a).

5.1.1 Proof of Lemma 1 We now turn to the proof of the auxiliary result in Lemma 1. We begin by noting that the statement is trivially true when N ≤ log(8D/δ), since we have

∗ ∗ σ(θ )  kθ kspan1.

Thus, by adjusting the constant factors in the statement of the lemma, it suffices to prove the lemma under the assumption N ≥ c log(8D/δ) for a sufficiently large absolute constant c. Accordingly, we make this assumption for the rest of the proof.

17 We use the following convenient notation for expectations. Let E denote the vector expectation operator, with the convention that E[v] = Pv. Similarly, let Eb denote the vector empirical expec- tation operator, given by Eb[v] = Pbv. These operators are applied elementwise by definition, and we let Ei and Ebi denote the i-th entry of each operator, respectively. With this notation, we have

2 ∗ ∗ ∗ 2 σ (θ ) = E |θ − E[θ ]| 2 ∗ ∗ ∗ ∗ 2 = (E − Eb) θ − E[θ ] + Eb |θ − E[θ ]| 2 2 2 ∗ ∗ ∗ ∗ ∗ ∗  (E − Eb) θ − E[θ ] + 2 Eb[θ ] − E[θ ] + 2Eb θ − Eb[θ ] 2 2 ∗ ∗ ∗ ∗ 2 ∗ = (E − Eb) θ − E[θ ] + 2 Eb[θ ] − E[θ ] +2σb (θ ). (19) | {z } | {z } T1 T2

We claim that the terms T1 and T2 are bounded as follows: σ2(θ∗) log(8D/δ) T  + ckθ∗k2 · 1, and (20a) 1 4 span N ( ) log(8D/δ)  log(8D/δ)2 T  c · σ2(θ∗) + kθ∗k · 1 , (20b) 2 N span N

δ where each bound holds with probability at least 1 − 4 . Taking these bounds as given for the moment, as long as N ≥ c0 log(8D/δ) for a sufficiently large constant c0, we can ensure that

σ2(θ∗) log(8D/δ) T + T  + ckθ∗k2 · 1, 1 2 2 span N Substituting back into our earlier bound (19), we find that

σ2(θ∗) log(8D/δ)  2σ2(θ∗) + c0kθ∗k2 · 1. 2 b span N Rearranging and taking square roots entry-wise, we find that r r log(8D/δ) log(8D/δ) σ(θ∗)  4σ2(θ∗) + 2c0kθ∗k2 · 1  2σ(θ∗) + c0kθ∗k · 1. b span N b span N

∗ ∗ Finally noting that we have the entry-wise inequality σb(θ )  σb(θb) + |θb− θ | establishes the claim of Lemma 1. It remains to prove the bounds (20a) and (20b).

∗ ∗ 2 Proof of bound (20a): For each index i ∈ [D], define the Yi : = θJ − Ei[θ ] , where J is an index chosen at random from the distribution pi. By definition, each random variable Yi is non-negative, and so with E now denoting the regular expectation of a scalar random variable, we have lower tail bound (Proposition 2.14, [Wai19a])

 ns2  P [E[Yi] − Yi ≥ s] ≤ exp − 2 for all s > 0. 2E[Yi ]

18 ∗ 2 Moreover, we have Yi ≤ kθ kspan almost surely, from which we obtain 2 ∗ 2  ∗ ∗ 2 ∗ 2 2 ∗ E[Yi ] ≤ kθ kspanEi (θ − E[θ ]) = kθ kspanσi (θ ). Putting together the pieces yields the elementwise inequality r log(8D/δ) (i) σ2(θ∗) log(8D/δ) T  ckθ∗k · σ(θ∗)  + c0kθ∗k2 , 1 span N 8 span N with probability at least 1 − δ/3, where in step (i), we have used the inequality 2ab ≤ νa2 + ν−1b2, which holds for any triple of positive scalars (a, b, ν).

Proof of the bound (20b): From Bernstein’s inequality, we have the element-wise bound (r ) ∗ ∗ log(8D/δ) ∗ ∗ log(8D/δ) b[θ ] − [θ ]  c · σ(θ ) + kθ kspan · 1 E E N N with probability at least 1 − δ/4, and hence ( ) log(8D/δ)  log(8D/δ)2 T  c · σ2(θ∗) + kθ∗k · 1 , 2 N span N as claimed.

5.2 Proof of Theorem 1, part (b)

Once again, we employ the shorthand θb ≡ θbplug for notational convenience, and also the short- hand ∆b = θb− θ∗. Note that it suffices to show the inequality

n ∗ −1 ∗ −1 o δ P kθb− θ k∞ ≥ cγ (I − γP) |(Pb − P)θ | + c(1 − γ) kr − rk∞ ≤ , (21) ∞ b 2 from which the theorem follows by application of a Bernstein bound to the first term and Hoeffding bound to the second, in a similar fashion to the inequalities (16). We therefore dedicate the rest of the proof to establishing inequality (21).

5.2.1 Proving the bound (21) We have

∗ ∗ ∆b = θb− θ = γPbθb− γPθ + (rb− r) = γ(Pb − P)θb+ γP∆b + (rb− r), which implies that

−1 −1 −1 −1 ∗ ∆b − (I − γP) (rb− r) = γ(I − γP) (Pb − P)θb = γ(I − γP) (Pb − P)∆b + γ(I − γP) (Pb − P)θ . (22)

Since all entries of (I − γP)−1 are non-negative, we have the element-wise inequalities

−1 −1 ∗ −1 |∆b |  γ(I − γP) |(Pb − P)∆b | + γ(I − γP) |(Pb − P)θ | + (I − γP) |rb− r| −1 −1 ∗ 1  γ(I − γP) |(Pb − P)∆b | + γ(I − γP) |(Pb − P)θ | + kr − rk∞ · 1. (23) 1 − γ b

19 The second and third terms are already in terms of the desired population-level functionals in equation (21). It remains to bound the first term. Note that the key difficulty here is the fact that the two matrices Pb − P and ∆b are not independent. As a first attempt to address this dependence, one is tempted to use the fact that provided N is large enough, each row of Pb − P has small `1-norm; for instance, see Weissman et al. [WOS+03] for sharp bounds of this type. In particular, this would allow us to work with the entry-wise bounds r D |(Pb − P)∆b |  kPb − Pk1,∞k∆b k∞ · 1 C k∆b k∞ · 1, - N where the final relation hides logarithmic factors in the pair (D, δ). Proceeding in this fashion, −1q D we would then bound each entry in the first term of equation (23) by γ(1 − γ) N k∆b k∞; then −1q D choosing N large enough such that γ(1−γ) N ≤ 1/2 suffices to establish bound (21). However, γ2 this requires a sample size N & (1−γ)2 D, while we wish to obtain the bound (21) with the sample γ2 size N & (1−γ)2 . This requires a more delicate analysis. Our analysis instead proceeds entry-by-entry, and uses a leave-one-out sequence to carefully decouple the dependence between Pb − P and ∆.b Let us introduce some notation to make this precise. For each i ∈ [D], recall that we used pbi and pi to denote row i of the matrices Pb and P, respectively. Let Pb (i) denote the i-th leave-one-out transition matrix, which is identical to Pb except (i) (i) −1 with row i replaced by the population vector pi. Let θb : = (I − γPb ) r be the value function estimate based on Pb (i) and the true reward vector r, and denote the associated difference vector by ∆b (i) : = θb(i) − θ∗. Now note that we have h i (i) (i) (Pb − P)∆b = hpi − pi, ∆b i = hpi − pi, ∆b i + hpi − pi, θb− θb i. i b b b (i) This decomposition is helpful because, now, the vectors pbi − pi and ∆b are independent by con- struction, so that standard tail bounds can be used on the first term. For the second term, we use the fact that θb ≈ θb(i), since the latter is obtained by replacing just one row of the estimated transition matrix. Formally, this closeness will be argued by using the matrix inversion formula. We collect these two results in the following lemma. 0 2 log(8D/δ) Lemma 2. Suppose that the sample size is lower bounded as N ≥ c γ (1−γ)2 . Then with proba- δ bility at least 1 − 2D and for each i ∈ [D], we have ( r ) (i) log(8D/δ) ∗ γ|hpi − pi, ∆b i| ≤ c γk∆b k∞ + γ |hpi − pi, θ i| + kr − rk∞ and (24a) b N b b ( r ) (i) log(8D/δ) ∗ γ|hpi − pi, θb − θbi| ≤ c γk∆b k∞ + γ |hpi − pi, θ i| + kr − rk∞ . (24b) b N b b With this lemma in hand, let us complete the proof. Combining the bounds of Lemma 2 with a union bound over all D entries yields the elementwise inequality ( r ) ∗ log(8D/δ) γ (Pb − P)∆b  cγ (Pb − P)θ + c γk∆b k∞ + kr − rk∞ 1 N b

20 with probability at least 1 − δ/2. Since the entries of (I − γP)−1 are non-negative, we can multiply both sides of this inequality by it, thereby obtaining ( r ) −1 −1 ∗ c log(8D/δ) γ(I − γP) (Pb − P)∆b  cγ(I − γP) (Pb − P)θ + γk∆b k∞ + kr − rk∞ 1. 1 − γ N b

Returning to the upper bound (23), we have shown that r k∆b k∞ log(8D/δ) 0 −1 ∗ c k∆b k∞ ≤ cγ + c γ (I − γP) |(Pb − P)θ | + kr − rk∞. 1 − γ N ∞ 1 − γ b

0 2 log(8D/δ) Under the assumed lower bound on the sample size N ≥ c γ (1−γ)2 , this inequality implies that

0 −1 ∗ c k∆b k∞ ≤ c γ (I − γP) |(Pb − P)θ | + kr − rk∞, ∞ 1 − γ b as claimed (21). We now proceed to a proof of Lemma 2, which uses the following structural lemma relating the quantities ∆b (i) and ∆.b

0 2 log(8D/δ) Lemma 3. Suppose that the sample size is lower bounded as N ≥ c γ (1−γ)2 . Then with proba- δ bility at least 1 − 4D and for each i ∈ [D], we have

(i) c n ∗ o k∆b k∞ ≤ ck∆b k∞ + γ |hpi − pi, θ i| + kr − rk∞ . (25) 1 − γ b b

This lemma is proved in Section 5.2.3 to follow.

5.2.2 Proof of Lemma 2 We prove the two bounds in turn.

(i) Proof of inequality (24a): Note that pbi − pi and ∆b are independent by construction, so that the Hoeffding inequality yields r (i) (i) log(8D/δ) |hpi − pi, ∆b i| ≤ ck∆b k∞ (26) b N with probability at least 1 − δ/(4D). Using this in conjunction with inequality (25) from Lemma 3 yields the bound r r (i) log(8D/δ) cγ log(8D/δ)n ∗ o γ|hpi − pi, ∆b i| ≤ cγk∆b k∞ + γ |hpi − pi, θ i| + kr − rk∞ b N 1 − γ N b b (i) r log(8D/δ) ∗ ≤ cγk∆b k∞ + cγ |hpi − pi, θ i| + ckr − rk∞, N b b

0 γ2 where in step (i), we have used the lower bound on the sample size N ≥ c (1−γ)2 log(8D/δ).

21 Proof of inequality (24b): The proof of this claim is more involved. Using the relation (22) (with suitable modifications of terms), we have

(i) −1 (i) (i) −1 θb − θb = γ(I − γPb) (Pb − Pb)θb + (I − γPb) (r − rb) −1  (i)  −1 = −γ(I − γPb) ei hpbi − pi, θb i + (I − γPb) (r − rb). (27)

Moreover, the Woodbury matrix identity [HJ85] yields

(i) −1 T (i) −1  −1  −1 (I − γPb ) ei(pi − pi) (I − γPb ) M : = I − γPb − I − γPb (i) = −γ b . T (i) −1 1 − γ(pbi − pi) (I − γPb ) ei Consequently,

(i) > −1  (i)  > −1 hpbi − pi, θb − θbi = −γ(pbi − pi) (I − γPb) ei hpbi − pi, θb i + (pbi − pi) (I − γPb) (r − rb) > (i) −1  (i)  > (i) −1 = −γ(pbi − pi) (I − γPb ) ei hpbi − pi, θb i + (pbi − pi) (I − γPb ) (r − rb) >  (i)  > − γ(pbi − pi) Mei hpbi − pi, θb i + (pbi − pi) M(r − rb) 2  (i)  2Zi − Zi 1 − 2Zi = hpbi − pi, θb i · + Ti · , (28) 1 − Zi 1 − Zi where we have defined, for convenience, the random variables

> (i) −1 > (i) −1 Zi : = γ(pbi − pi) (I − γPb ) ei and Ti : = (pbi − pi) (I − γPb ) (r − rb).

(i) −1 Since pbi − pi is independent of the vector (I − γPb ) (r − rb), applying the Hoeffding bound yields the inequality r c log(8D/δ) |T | ≤ kr − rk i 1 − γ b ∞ N with probability exceeding 1 − δ/(4D). (i) −1 On the other hand, exploiting independence between the vectors pbi − pi and (I − γPb ) ei and applying the Hoeffding bound, we also have r cγ log(8D/δ) |Z | ≤ i 1 − γ N

0 γ2 with probability least 1 − δ/(4D). Taking N ≥ c (1−γ)2 log(8D/δ) for a sufficiently large constant 0 c ensures that γ|Ti| ≤ kr − rbk∞ and |Zi| ≤ 1/4, so that with probability exceeding 1 − δ/(2D), inequality (28) yields

(i) n (i) o γ|hpbi − pi, θb − θbi| ≤ c γ|hpbi − pi, θb i| + kr − rbk∞ n (i) ∗ o ≤ c γ|hpbi − pi, ∆b i| + γ|hpbi − pi, θ i| + kr − rbk∞ .

Finally, applying part (a) of Lemma 2 completes the proof.

22 5.2.3 Proof of Lemma 3

Recall our leave-one-out matrix Pb (i), and the explicit bound (26). We have r (i) (i) ∗ (i) log(8D/δ) ∗ hpi − pi, θb i ≤ hpi − pi, ∆b i + |hpi − pi, θ i| ≤ ck∆b k∞ + |hpi − pi, θ i| (29) b b b N b with probability at least 1 − δ/(4D). Substituting inequality (29) into the bound (27), we find that ( r ) (i) c (i) log(8D/δ) ∗ kθb − θbk∞ ≤ γk∆b k∞ · + γ |hpi − pi, θ i| + kr − rk∞ . (30) 1 − γ N b b

Finally, the triangle inequality yields

(i) (i) k∆b k∞ ≤ k∆b k∞ + kθb − θbk∞ ( r ) c (i) log(8D/δ) ∗ ≤ k∆b k∞ + γk∆b k∞ · + γ |hpi − pi, θ i| + kr − rk∞ . 1 − γ N b b

0 2 log(8D/δ) 0 For N ≥ c γ (1−γ)2 with c sufficiently large, we have

(i) c n ∗ o k∆b k∞ ≤ ck∆b k∞ + γ |hpi − pi, θ i| + kr − rk∞ 1 − γ b b

δ with probability at least 1 − 4D , which completes the proof of Lemma 3. Since Corollary 1 follows from Theorem 1, we prove it first before moving to a proof of Theorem 2 in Section 5.4.

5.3 Proof of Corollary 1

−1 1 In order to prove part (a), consider inequality (14) and further use the fact that k(I − γPb) k1,∞ ≤ 1−γ to obtain the element-wise bound

∗ γ ∗ krb− rk∞ |θb− θ |  k(Pb − P)θ k∞1 + · 1. 1 − γ 1 − γ

Applying Bernstein’s bound to the first term and Hoeffding’s bound to the second completes the proof. In order to prove part (b) of the corollary, we apply Lemma 7 of Azar et al. [AMK13]—in particular, equation (17) of that paper. Tailored to this setting, their result leads to the point-wise bound r k(I − γP)−1σ(θ∗)k ≤ c max . ∞ (1 − γ)3/2

We also have the bound 2r kθ∗k ≤ 2kθ∗k = 2k(I − γP)−1rk ≤ max , span ∞ ∞ 1 − γ

23 so that combining the pieces and applying Theorem 1(b), we obtain

(r   ) ∗ c log(8D/δ) rmax log(8D/δ) rmax kθb− θ k∞ ≤ γ + kρ(r)k∞ + γ · . (1 − γ) N (1 − γ)1/2 N 1 − γ

log(8D/δ) Finally, when N ≥ c1 1−γ for a sufficiently large constant c1, we have r log(8D/δ) r log(8D/δ) r max ≤ c max , N 1 − γ N (1 − γ)1/2 thereby establishing the claim.

5.4 Proof of Theorem 2 For all of our lower bounds, we assume that the reward distribution takes the Gaussian form

2 Dr( · | j) = N (rj,% ) (31) for each state j. Note that this reward distribution satisfies kρ(r)k∞ = % by construction. Let us begin with a short overview of our proof, which proceeds in two steps. First, we suppose that the transition matrix P is known exactly, and the hardness of the estimation problem is due to noisy observations of the reward function. In particular, letting MI(rmax,%) denote the class of all MRPs with the specific reward observation model (31), and for which the transition matrix is the identity matrix I and the rewards are uniformly bounded as krk∞ ≤ rmax, we show that ( r ) ∗ % log(D) rmax inf sup Ekθb− θ k∞ ≥ c · ∧ . (32) ∗ θb θ ∈MI(rmax,%) 1 − γ N 1 − γ

Note that for each pair of positive scalars (ϑ, rmax) we have the inclusions

MI(rmax,%) ⊆ Mvar(ϑ, %) and MI(rmax,%) ⊆ Mrew(rmax,%), and so that the lower bound (32) carries over to the classes Mvar(ϑ, %) and Mrew(rmax,%). Next, we suppose that the population reward function r is known exactly (% = 0), and the hardness of the estimation problem is only due to uncertainty in the transitions. Under this setting, we prove the lower bounds r ∗ ϑ log(D/2) inf sup Ekθb− θ k∞ ≥ c · , and (33a) ∗ θb θ ∈Mvar(ϑ,0) 1 − γ N r ∗ rmax log(D/2) inf sup Ekθb− θ k∞ ≥ c · . (33b) ∗ 3/2 θb θ ∈Mrew(rmax,0) (1 − γ) N

Since Mvar(ϑ, 0) ⊂ Mvar(ϑ, %) for any % > 0, these lower bounds also carry over to the more general setting. The minimax lower bounds of Theorem 2 are obtained by taking the maximum of the bounds (32) and (33). Let us now establish the two previously claimed bounds.

24 5.4.1 Proof of claim (32) For some positive scalar α to be chosen shortly, consider D distinct reward vectors {r(1), . . . , r(D)}, (i) D where the vector r ∈ R has entries ( α if i = j r(i) : = for all j ∈ [D]. j 0 otherwise,

Denote by R(i) the MRP with reward function r(i); and transition matrix I. Thus, the i-th value ∗ (i) 1 (i) function is given by the vector (θ ) : = 1−γ r . ∗ (i) ∗ (j) By construction, we have k(θ ) − (θ ) k∞ = α/(1 − γ) for each pair of distinct indices (i, j). Furthermore, the KL between Gaussians of variance %2 centered at r(i) and r(j) is given by

  kr(i) − r(j)k2 2α2 D N (r(i),%2I) k N (r(j),%2I) = 2 = . KL %2 %2

Thus, applying the local packing version of Fano’s method (§15.3.3, [Wai19a]), we have

2 2 α N + log 2! ∗ α %2 inf sup Ekθb− θ k∞ ≥ c 1 − . ∗ (i) 1 − γ log D θb θ ∈{M }i∈[D]

q log D Setting α = % 6N ∧ rmax yields the claimed lower bound.

5.4.2 Proof of claim (33) This lower bound is based on a modification of constructions used by Lattimore and Hutter [LH14] and Azar et al. [AMK13]. Our proof, however, is tailored to the generative observation model. Our proof is structured as follows. First, we construct a family of “hard” MRPs and prove a minimax lower bound as a function of parameters used to define this family. Constructing this family of hard instances requires us to first define a basic building block: a two-state MRP that was illustrated in Figure 2(a). After obtaining this general lower bound, we then set the scalars that parameterize the hard class MRP appropriately to obtain the two claimed bounds. We now describe the two-state MRP in more detail. For a pair of parameters (p, τ), each in the unit interval [0, 1], and a positive scalar ν, consider the two-state Markov reward process R0(p, ν, τ), with transition matrix and reward vector given by

p 1 − p  ν  P = and r = , 0 0 1 0 ν · τ respectively. See Figure 2 for an illustration of this MRP. A straightforward calculation yields that it has value function and corresponding standard deviation vector given by √ " 1−γ+γτ(1−p) # " (1−τ) p(1−p) # ∗ (1−γp)(1−γ) ∗ 1−γp θ (p, ν, τ) = ν τ and σ(θ ) = ν , (34) 1−γ 0

25 ∗ ∗ ∗ ν(1−τ) respectively, where we have used the shorthand θ ≡ θ (p, ν, τ). We also have kθ kspan = 1−γp ; ∗ ∗ the two scalars (ν, τ) allow us to control the quantities kσ(θ )k∞ and kθ kspan. Index the states of this MRP by the set {0, 1}, and consider now a sample drawn from this MRP under the generative model. We see a pair of states drawn according to the respective rows of the transition matrix P0; the first state is drawn according to the Bernoulli distribution Ber(p), and the second state is deterministic and equal to 1. For convenience, we use P(p) = (Ber(p), 1) to denote the distribution of this pair of states. Our hard class of instances is based in part on the difficulty of distinguishing two such MRPs that are close in a specific sense. Let us make this intuition precise. For two scalar values 0 ≤ p2 ≤ p1 ≤ 1, some algebra yields the relation

∗ ∗ (p1 − p2)(1 − τ) kθ (p1, ν, τ) − θ (p2, ν, τ)k∞ = ν · . (35) (1 − γp1)(1 − γp2) In the sequel, we work with the choices r 4γ − 1 1 p (1 − p ) p = and p = p − 1 1 log(D/2), 1 3γ 2 1 8 N

 1  which, under the assumed lower bound on the sample size N, are both scalars in the range 2 , 1  1  for all discount factors γ ∈ 2 , 1 . Moreover, it is worth noting the relations 1 − γ 1 − γ 1 − γ 1 − p = , c ≤ 1 − p ≤ c 1 3γ 1 3γ 2 2 3γ 4 1 − γp = (1 − γ), and c (1 − γ) ≤ 1 − γp ≤ c (1 − γ), (36) 1 3 1 2 2 cγ where the inequalities on the right hold provided N ≥ 1−γ log(D/2) for a sufficiently large con- stant c. Here the pair of constants (c1, c2) are universal, depend only on c, and may change from line to line. We also require the following lemma, proved in Section 5.4.3 to follow, which provides a useful bound on the KL divergence between P(p1) and P(p2). Lemma 4. For each pair p, q ∈ [1/2, 1), we have

(p − q)2 D ( (p)k (q)) ≤ . KL P P (p ∨ q)(1 − (p ∨ q)) We are now in a position to describe the hard family of MRPs over which we prove a gen- eral lower bound. Suppose that D is even for convenience, and consider a set of D/2 “master” ¯ 10 MRPs M : = {R1,..., RD/2} each on D states constructed as follows. Decompose each master MRP into D/2 sub-MRPs of two states each; index the k-th sub-MRP in the j-th master MRP by Rj,k. For each pair j, k ∈ [D/2], set ( R0(p1, ν, τ) if j 6= k Rj,k = R0(p2, ν, τ) otherwise.

10Note that this step is only required in order to “tensorize” the construction in order to obtain the optimal dependence on the dimension. If, instead of the `∞ error, one was interested in estimating the value function at a fixed state of the MRP, then this tensorization is no longer needed.

26 ∗ N Let θj denote the value function corresponding to MRP Rj, and let Pj denote the distribution of state transitions observed from the MRP Rj under the generative model. Also note that for each i ∈ [D/2], we have p ∗ (1 − τ) p1(1 − p1) kσ(θi )k∞ = ν . (37) (1 − γp1)

Lower bounding the minimax risk over this class: We again use the local packing form of Fano’s method (§15.3.3, [Wai19a]) to establish a lower bound. Choose some index J uniformly at N random from the set [D/2], and suppose that we draw N i.i.d. samples Y : = (Y1,...,YN ) from D the MRP RJ under the generative model. Here each Yi ∈ X represents a random set of D states, and the goal of the estimator is to identify the random index J and, consequently, to estimate the ∗ value function θJ . Let us now lower bound the expected error incurred in this (D/2)-ary hypothesis testing problem. Fano’s inequality yields the bound

 N  ∗ 1 ∗ ∗ I(J; Y ) + log 2 inf sup Ekθb− θ k∞ ≥ min kθj − θkk∞ 1 − , (38) θb θ∗∈M¯ 2 j6=k log(D/2) where I(J; Y N ) denotes the mutual information between J and Y N . Let us now bound the two terms that appear in inequality (38). By equation (35), we have

∗ ∗ (p1 − p2)(1 − τ) kθj − θkk∞ = ν · for all 1 ≤ j 6= k ≤ D/2. (1 − γp1)(1 − γp2)

Furthermore, since the samples Y1,...,YN are i.i.d., the chain rule of mutual information yields

1 N I(J; Y ) = I(J; Y1) ≤ max DKL(PjkPk) N j6=k (i) = DKL(P(p1)kP(p2)) + DKL(P(p2)kP(p1)) (ii) (p − p )2 ≤ 2 1 2 , p1(1 − p1) where step (i) is a consequence of the construction, which ensures that the distributions Pj and Pk coincide on all but the j-th and k-th sub-MRPs. On the other hand, step (ii) follows from Lemma 4, and the fact that p2 ≤ p1. Putting together the pieces, we now have

 2  2N (p1−p2) + log 2 ∗ ν (p1 − p2)(1 − τ) p1(1−p1) inf sup Ekθb− θ k∞ ≥ · 1 −  . θb θ∗∈M¯ 2 (1 − γp1)(1 − γp2) log(D/2)

q 1 p1(1−p1) Recall the choice p1 − p2 = 8 N log(D/2). For D ≥ 8, this ensures, for a small enough positive constant c, the bound p r ∗ (1 − τ) p1(1 − p1) log(D/2) 1 inf sup Ekθb− θ k∞ ≥ cν · . (39) θb θ∗∈M¯ (1 − γp1) N 1 − γp2 With the relation (39) at hand, we now turn to proving the two sub-claims in equation (33).

27 Proof of claim (33a): Recall equation (37); for i ∈ [D/2], we have

p ∗ (1 − τ) p1(1 − p1) ∗ (1 − τ) kσ(θi )k∞ = ν and kθi kspan = ν . (1 − γp1) (1 − γp1) √ Now for every pair of scalars (ϑ, ζ) satisfying ϑ = ζ 1 − γ, set τ = 1/2 and ν = 2ζ(1 − γp1). With ¯ this choice of parameters, we have the inclusion Mvar(ϑ, 0) ∩ Mvfun(ζ, 0) ⊆ M, and evaluating the bound (39) yields

r ∗ log(D/2) 1 inf sup Ekθb− θ k∞ ≥ cϑ ∗ θb θ ∈Mvar(ϑ,0)∩Mvfun(ζ,0) N 1 − γp2 r (ii) log(D/2) 1 = cϑ , N 1 − γ where in step (ii), we have used inequality (36). The same lower bound clearly also extends to the −1/2 set Mvar(ϑ, 0) ∩ Mvfun(ζ, 0) for ζ ≥ ϑ(1 − γ) ; this establishes part (a) of the theorem.

Proof of claim (33b): Given a value rmax, set τ = 0 and ν = rmax and note that the rewards of all the MRPs in the set M¯ satisfy krk∞ ≤ ν. Hence, we have Mrew(rmax, 0) ⊆ M¯ for this choice of parameters. Using inequality (39) and recalling the bounds (36) once again, we have

p r ∗ p1(1 − p1) log(D/2) 1 inf sup Ekθb− θ k∞ ≥ crmax · ∗ θb θ ∈Mrew(rmax,0) (1 − γp1) N 1 − γp2 r r log(D/2) ≥ c max . (1 − γ)3/2 N

5.4.3 Proof of Lemma 4

By construction, the second state of the Markov chain is absorbing, so it suffices to consider the KL divergence between the first components of the distributions P(p) and P(q). These are Bernoulli random variables Ber(p) and Ber(q), and the following calculation bounds their KL divergence:

p 1 − p D (Ber(p)kBer(q)) = p log + (1 − p) log KL q 1 − q (ii) p − q q − p ≤ p · + (1 − p) · q 1 − q (p − q)2 = , q(1 − q) where step (ii) uses the inequality log(1+x) ≤ x, which is valid for all x > −1. A similar inequality holds with the roles of p and q reversed, and the denominator of the expression is lower for the larger value p ∨ q. This completes the proof.

28 5.5 Proof of Theorem 3 Note that the median-of-means operator is applied elementwise; denote the i-th such operator by Mci. Let Mc − P denote the elementwise difference of operator Mc and the linear operator P; its i-th component is given by the operator Mci(·) − hpi, ·i. We require two technical lemmas in the proof. The power of the median-of-means device is clarified by the first lemma, which is an adaptation of classical results (see, e.g., [NY83, JVV86]).

Lemma 5. Suppose that K = 8 log(4D/δ) and m = bN/Kc. Then there is a universal constant c D such that for each index i ∈ [D] and each fixed vector θ ∈ R , we have ( r ) log(8D/δ) δ Pr |(Mci − pi)(θ)| ≥ c σi(θ) ≤ . N 4D

Comparing this lemma to the Bernstein bound (cf. equation (16b)), we see that we no longer ∗ pay in the span semi-norm kθ kspan, and this is what enables us to establish the solely variance- dependent bound (10). We also require the following lemma that guarantees that the median-of-means Bellman operator is contractive.

Lemma 6. The median-of-means operator is 1-Lipschitz in the `∞-norm, and satisfies

D |Mc(θ1) − Mc(θ2)| ≤ kθ1 − θ2k∞ for all vectors θ1, θ2 ∈ R .

MoM Consequently, the empirical operator TbN is γ-contractive in `∞-norm and satisfies

MoM MoM |TbN (θ1) − TbN (θ2)| ≤ γkθ1 − θ2k∞ for all pairs of value functions (θ1, θ2).

See Section 5.5.1 for the proof of Lemma 6.

We are now in a position to establish the theorem, where we now use the shorthand θb ≡ θbMoM for convenience. Note that the vectors θ∗ and θb satisfy the fixed point relations

∗ ∗ θ = r + γPθ , and θb = rb+ γMc(θb), respectively. Taking differences, the error vector ∆b = θb− θ∗ satisfies the relation

∗ ∗ ∗ θb− θ = γ(Mc(∆b + θ ) − Pθ ) + rb− r ∗ ∗ ∗ = γ(Mc(∆b + θ ) − Mc(θ )) + γ(Mc − P)(θ ) + (rb− r).

Taking `∞-norms on both sides and using the triangle inequality, we have

∗ ∗ ∗ k∆b k∞ ≤ γkMc(θ + ∆)b − Mc(θ )k∞ + γ|(Mc − P)(θ )| + krb− rk∞ (i) ∗ ≤ γk∆b k∞ + γ|(Mc − P)(θ )| + krb− rk∞, where step (i) is a result of Lemma 6. Finally, applying Lemma 5 in conjunction with the Hoeffding inequality and a union bound over all D indices completes the proof.

29 5.5.1 Proof of Lemma 6 The second claim follows directly from the first by noting that

MoM MoM |TbN (θ1) − TbN (θ2)| = γ|Mc(θ1) − Mc(θ2)|.

D In order to prove the first claim, recall that for each θ ∈ R , we have Mc(θ) = med(µb1(θ),..., µbK (θ)), where the median—defined as the bK/2c-th order statistic—is taken entry-wise. By definition, for each i ∈ [K], we have  

1 X kµi(θ1) − µi(θ2)k∞ =  Zk (θ1 − θ2) b b m k∈D i ∞

1 X ≤ Zk kθ1 − θ2k∞ m k∈D i 1,∞ (i) = kθ1 − θ2k∞, where step (i) is a result of the fact that 1 P Z is a row stochastic matrix with non-negative m k∈Di k entries. Finally, we have the entry-wise bound

|Mc(θ1) − Mc(θ2)| = | med(µb1(θ1),..., µbK (θ1)) − med(µb1(θ2),..., µbK (θ2))| (ii)  kθ1 − θ2k∞ · 1, where step (ii) follows from our definition of the median as the bK/2c-th order statistic, and Lemma 7 to follow. This completes the proof of Lemma 6.

Lemma 7. For each pair of vectors (u, v) of dimension D and each index i ∈ [D], we have

|u(i) − v(i)| ≤ ku − vk∞.

Proof. Assume without loss of generality that the entries of u are sorted in increasing order (so that u1 ≤ u2 ≤ ... ≤ uD), and let w denote a vector containing the entries of v sorted in increasing order. We then have

(i) |u(i) − v(i)| = |ui − wi| ≤ ku − wk∞ ≤ ku − vk∞, where step (i) follows from the rearrangement inequality applied to the `∞-norm [Vin90].

6 Discussion

Our work investigates the local minimax complexity of value function estimation in Markov reward processes. Our upper bounds are instance-dependent, and we also provide minimax lower bounds that hold over natural subsets of the parameter space. The plug-in approach is shown to be optimal over the class of MRPs with bounded rewards, and a variant based on the median-of-means device achieves optimality over the class of MRPs having value functions with bounded variance.

30 Our results also leave a few interesting technical questions unresolved. Let us start with two inter-related questions: Is Corollary 1(a) sharp, say up to a logarithmic factor in the dimension? Is the median-of-means approach minimax-optimal over the class of MRPs having bounded rewards? We conjecture that both of these questions can be answered in the affirmative, but there are technical challenges to overcome. For instance, while the median-of-means device is crucial to removing the span-norm dependence in Corollary 1(a), it leads to a non-linear update rule that needs to be much more carefully handled in order to ensure minimax optimality. A second set of technical questions concerns the definition of “locality” in our bounds. Are our results also sharp under alternative local minimax parameterizations (say in terms of the −1 ∗ functional k(I − γP) σ(θ )k∞)? Is there a more fine-grained lower bound analysis that shows the (sub)-optimality of these approaches, and are there better adaptive procedures for this problem? The literature on estimating functionals of discrete distributions [JVHW15] shows that additional refinements over the plug-in approach are usually beneficial; is that also the case here? There is also the related question of whether a minimax lower bound can be proved over a local neighborhood of every point θ∗. We remark that guarantees of this flavor exist in a variety of related problems in both the asymptotic and non-asymptotic settings [vdV00, CL04, ZCDL16]. Indeed, in a follow-up paper [KPR+20] with a superset of the current authors, we have shown such a local lower bound for this problem, which is achieved via stochastic approximation coupled with a variance reduction device. In a complementary direction, another interesting question is to ask how function approximation affects these bounds. Our techniques should be useful in answering some of these questions, and also more broadly in proving analogous guarantees in the more challenging policy optimization setting. Finally, there is the question of removing our assumption on the generative model: How does the plug-in estimator behave when it is computed on a sampled trajectory of the system? A classical solution is the method of simulating the generative model from such samples [Yu94]: given a sampled trajectory, chop it into pieces of length (roughly) equal to the mixing time of the Markov chain, and to treat the respective first sample from each of these pieces as (approximately) inde- pendent. But clearly, this approach is somewhat wasteful, and there have been recent refinements in related problems when the mixing time can become arbitrarily large [SMT+18]. It would be 2 interesting to explore these approaches and derive instance-dependent guarantees in the Lµ-norm, where µ is the stationary distribution of the Markov chain.

Acknowledgements AP was supported in part by a research fellowship from the Simons Institute for the Theory of Computing. MJW and AP were partially supported by National Science Foundation Grant DMS- 1612948 and Office of Naval Research Grant ONR-N00014-18-1-2640.

A Dependence of plug-in error on span semi-norm

In this section, we state and prove a proposition that provides a family of MRPs in which the `∞-error of the plug-in estimator can be completely characterized by the span semi-norm of the optimal value function.

31 Proposition 1. Suppose that the rewards are observed noiselessly, with ρ(r) = 0. There is a pair of universal positive constants (c1, c2) such that for any triple of positive scalars (ζ, N, D), there is a D-state MRP for which

∗ ∗ kσ(θ )k∞ 3 ζ kθ k∞ = ζ and ≤ √ · , (40) N D N and for which the error of the plug-in estimator satisfies

(a) (b) ζ h ∗ i ζ log(1 + D/3)  −1 c1γ ≤ kθbplug − θ k∞ ≤ c2γ (log log(1 + D/3)) ∧ 1 . (41) N E N A few comments are in order. First, note that equation (40) guarantees that we have

1 −1 ∗ 1 1 ∗ √ · k(I − γP) σ(θ )k∞ . √ kθ kspan N D (1 − γ)N for large values of the dimension D, so that the first term in the guarantee (5b) is dominated by 1 the second. In particular, suppose that D  (1−γ)2 ; then we have

1 −1 ∗ 1 ∗ √ · k(I − γP) σ(θ )k∞  kθ kspan. N N In other words, if our analysis was loose in that the error of the plug-in estimator depended only on the functional √1 · k(I − γP)−1σ(θ∗)k , then it would be impossible to prove a lower bound N ∞ ∗ kθ kspan that involves the quantity N . On the other hand, equation (41) shows that this such a lower ∗ kθ kspan bound can indeed be proved: the plug-in error is characterized precisely by the quantity γ N up to a logarithmic factor in the dimension D. Second, note that while equation (41) shows that the plug-in error must have some span semi- norm dependence, it falls short of showing the stronger lower bound

ζ h ∗ i c1 ≤ kθbplug − θ k∞ , (42) (1 − γ)N E which would show, for instance, that Corollary 1(a) is sharp up to a logarithmic factor. We conjecture that there is an MRP for which the bound (42) holds. Finally, it is worth commenting on the logarithmic factor that appears in the upper bound of equation (41). Note that for sufficiently large D, the logarithmic factor is proportional to log D/ log log D. This is consequence of applying Bennett’s inequality instead of Bernstein’s in- equality, and we conjecture that the same factor ought to replace the factor log D factor multiplying the span semi-norm in the upper bound (5b).

A.1 Proof of Proposition 1 In order to prove Proposition 1, it suffices to construct an MRP satisfying condition (40) and compute its plug-in estimator in closed form. With this goal in mind, suppose that for simplicity that D is divisible by three, and consider D/3 copies of the 3-state MRP from Figure 3(a). By ∗ p ∗ µ 10 construction, we have kσ(θ )k∞ = µ q(1 − q), and kθ kspan = (1−γ) . Setting q = ND , we see that µ condition (40) is immediately satisfied with ζ = (1−γ) .

32 It remains to verify the claim (41). Note that the plug-in estimator for this MRP can be computed in closed form. In particular, it is straightforward to verify that for each state i having reward µ/2, we have

N(1 − γ)  ∗ d θbplug(i) − θ = Bin(N, q) − Nq, (43) γµ i where we have used the notation θbplug(i) to denote the i-th entry of the vector θbplug. Furthermore, these D/3 random variables are independent. Thus, the (scaled) `∞-error of the plug-in estimator is equal to the maximum absolute deviation in a collection of independent binomial random variables.

Proof of inequality (41), part (a): The following technical lemma provides a lower bound on the deviation of binomials, and its proof is postponed to the end of this section.

1  Lemma 8. Let X1,...,Xk denote independent random variables with distribution Bin n, 3kn . Let Yj = Xj − E[Xj] for each 1 ≤ j ≤ k. Then, we have

  4 E max |Yj| ≥ . 1≤j≤k 9

Applying Lemma 8 with k = D/3 in conjunction with the characterization (43), and substituting our choices of the pair (µ, q) yields

h ∗ i 4 ζγ kθbplug − θ k∞ ≥ · . E 9 N

Proof of inequality (41), part (b): Corollary 3.1(ii) and Lemma 3.3 of Wellner [Wel17] yield, to the best of our knowledge, the sharpest available upper bound on the maximum absolute deviation of Bin(n, q) random variables in the regime nq(1 − q)  1:

  √ log(1 + k) E max |Yj| ≤ 12 · if log(1 + k) ≥ 5. (44) 1≤j≤k log log(1 + k)

Combining this bound with the Bernstein bound when k is small, and substituting the various quantities completes the proof.

1 Proof of Lemma 8: Employing the shorthand q = 3kn , we have     E max |Yj| ≥ (1 − nq) · Pr max Xj ≥ 1 1≤j≤k 1≤j≤k = (1 − nq) · (1 − (1 − q)nk)  s  2  1 3nk ≥ · 1 − 3 1 − 3  3nk  2   4 ≥ · 1 − e−1/3 ≥ . 3 9

33 B Calculations for the “hard” sub-class

Recall from equation (34) our previous calculation of the value function and standard deviation, from which we have

pp(1 − p) pp(1 − p) kσ(θ∗)k = ν(1 − τ) , k(I − γP)−1σ(θ∗)k = ν(1 − τ) , ∞ 1 − γp ∞ (1 − γp)2

∗ 1 4γ−1 α and kθ kspan = ν(1 − τ) 1−γp . Substituting in our choices ν = 1, p = 3γ , and τ = 1 − (1 − γ) and simplifying by employing inequality (36), we have

 1 0.5−α  1 1.5−α  1 1−α kσ(θ∗)k ∼ , k(I − γP)−1σ(θ∗)k ∼ , and kθ∗k ∼ , ∞ 1 − γ ∞ 1 − γ span 1 − γ

1 for each discount factor γ ≥ 2 . Here, the ∼ notation indicates that the LHS can be sandwiched between two terms that are proportional to the RHS such that the factors of proportionality are strictly positive and γ-independent. For the plug-in estimator, its performance will be determined by the maximum of the two terms

k(I − γP)−1σ(θ∗)k 1  1 1.5−α kθ∗k 1  1 2−α √ ∞ ∼ √ and span ∼ . N N 1 − γ (1 − γ)N N 1 − γ

1 In the regime N % 1−γ , the first term will be dominant.

References

[AJK19] A. Agarwal, N. Jiang, and S. M. Kakade. Reinforcement learning: Theory and algo- rithms. Technical Report, CS Department, UW Seattle, 2019.

[AKY20] A. Agarwal, S. M. Kakade, and L. F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pages 67–83, 2020.

[AMGK11] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. J. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, pages 2411–2419, 2011.

[AMK13] M. G. Azar, R. Munos, and H. J. Kappen. Minimax PAC bounds on the sample com- plexity of reinforcement learning with a generative model. Machine Learning, 91:325– 349, 2013.

[AOM17] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017.

[BB96] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22(1-3):33–57, 1996.

[BE02] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.

34 [Ber95a] D. P. Bertsekas. Dynamic programming and stochastic control, volume 1. Athena Scientific, Belmont, MA, 1995.

[Ber95b] D.P. Bertsekas. Dynamic programming and stochastic control, volume 2. Athena Sci- entific, Belmont, MA, 1995.

[BGSL08] P. Balakrishna, R. Ganesan, L. Sherry, and B. S. Levy. Estimating taxi-out times with a reinforcement learning algorithm. In Proceedings of the IEEE/AIAA 27th Digital Avionics Systems Conference, pages 3–D. IEEE, 2008.

[BM00] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic ap- proximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.

[BM11] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic optimization algo- rithms for machine learning. In Advances in neural information processing systems, December 2011.

[Bor98] V. S. Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization, 36(3):840–851, 1998.

[Boy02] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine learning, 49(2-3):233–246, 2002.

[BRS18] J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. arXiv preprint arXiv:1806.02450, 2018.

[BT96] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, 1996.

[CFMW19] Y. Chen, J. Fan, C. Ma, and K. Wang. Spectral method and regularized mle are both optimal for top-k . The Annals of Statistics, 47(4):2204–2235, 2019.

[CL04] T. T. Cai and M. G. Low. An adaptation theory for nonparametric confidence intervals. The Annals of statistics, 32(5):1805–1840, 2004.

[CMSS20] Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam. Finite-sample anal- ysis of stochastic approximation using smooth convex envelopes. arXiv preprint arXiv:2002.00874, 2020.

[DB15] C. Dann and E. Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.

[dlPG12] V. de la Pena and E. Gin´e. Decoupling: from dependence to independence. Springer Science & Business Media, 2012.

[DM17] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. arXiv preprint arXiv:1707.03770, 2017.

35 [DMR19] T. T. Doan, S. T. Maguluri, and J. Romberg. Finite-time performance of distributed temporal difference learning with linear function approximation. arXiv preprint arXiv:1907.12530, 2019.

[DNP14] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1):809–883, 2014.

[DSTM18] G. Dalal, B. Sz¨or´enyi, G. Thoppe, and S. Mannor. Finite sample analyses for TD(0) with function approximation. In Thirty-Second AAAI Conference on Artificial Intelli- gence, 2018.

[Dur99] R. Durrett. Essentials of stochastic processes, volume 1. Springer, 1999.

[EDM03] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of machine learning research, 5:1–25, 2003.

[Fel66] W. Feller. An Introduction to Probability Theory and its Applications: Volume II. John Wiley and Sons, New York, 1966.

[FMP08] J. Frank, S. Mannor, and D. Precup. Reinforcement learning in the presence of rare events. In Proceedings of the International conference on Machine learning, pages 336– 343. ACM, 2008.

[Gos09] A. Gosavi. Reinforcement learning: A tutorial survey and recent advances. INFORMS Journal on Computing, 21(2):178–192, 2009.

[HJ85] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cam- bridge, 1985.

[JA18] N. Jiang and A. Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398, 2018.

[JJS94] T. Jaakkola, M. I. Jordan, and S. P. Singh. Convergence of stochastic iterative dynamic programming algorithms. In Advances in neural information processing systems, pages 703–710, 1994.

[JVHW15] J. Jiao, K. Venkat, Y. Han, and T. Weissman. Minimax estimation of functionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835–2885, 2015.

[JVV86] M. R. Jerrum, L. G. Valiant, and V. V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science, 43:169–188, 1986.

[Kak03] S. M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, 2003.

36 [KPR+20] K. Khamaru, A. Pananjady, F. Ruan, M. J. Wainwright, and M. I. Jordan. Is tem- poral difference learning optimal? An instance-dependent analysis. arXiv preprint arXiv:2003.07337, 2020.

[KS99] M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in neural information processing systems, 1999.

[LH14] T. Lattimore and M. Hutter. Near-optimal PAC bounds for discounted MDPs. Theo- retical Computer Science, 558:125–143, 2014.

[LL20] G. Lecu´eand M. Lerasle. Robust machine learning by median-of-means: theory and practice. Annals of Statistics, 48(2):906–931, 2020.

[LS18] C. Lakshminarayanan and C. Szepesv´ari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? In Proceedings of the International Conference on Artificial Intelligence and Statistics, pages 1347–1355, 2018.

[MMM14] O.-A. Maillard, T. A. Mann, and S. Mannor. How hard is my MDP? the distribution- norm to the rescue. In Advances in Neural Information Processing Systems, pages 1835–1843, 2014.

[MWCC18] C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix com- pletion. In Proceedings of the International Conference on Machine Learning, pages 3345–3354, 2018.

[NJLS09] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Jour. Opt., 19(4):1574–1609, 2009.

[NY83] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley and Sons, New York, 1983.

[PJ92] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization, 30(4):838–855, 1992.

[PPH16] J. Pazis, R. E. Parr, and J. P. How. Improving PAC exploration using the median of means. In Advances in Neural Information Processing Systems, pages 3898–3906, 2016.

[Put05] M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. Wiley, 2005.

[SB18] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2nd edition, 2018.

[SJ19] M. Simchowitz and K. G. Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In Advances in Neural Information Processing Systems, pages 1151– 1160, 2019.

[SMT+18] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference on Learning Theory, pages 439–473, 2018.

37 [SSM07] D. Silver, R. S. Sutton, and M. M¨uller. Reinforcement learning of local shape in the game of Go. In Proceedings of the International Joint Conferences on Artificial Intelligence, volume 7, pages 1053–1058, 2007.

[Sut88] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.

[SWW+18] A. Sidford, M. Wang, X. Wu, L. Yang, and Y. Ye. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems, 2018.

[SWWY18] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving Markov decision processes. In Proceedings of the Symposium on Discrete Algorithms (SODA), 2018.

[SY19] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD learning. In Conference on Learning Theory, pages 2803–2830, 2019.

[Sze09] C. Szepesv´ari. Algorithms for reinforcement learning. Morgan-Claypool, 2009.

[Tad04] V. B. Tadic. On the almost sure rate of convergence of linear stochastic approximation algorithms. IEEE Transactions on Information Theory, 50(2):401–409, 2004.

[TVR97] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075– 1081, 1997.

[UKM+08] T. Ueno, M. Kawanabe, T. Mori, S. Maeda, and S. Ishii. A semiparametric statistical approach to model-free policy evaluation. In Proceedings of the International Confer- ence on Machine Learning, pages 1072–1079, 2008.

[vdV00] A. W. van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

[Vin90] A. Vince. A rearrangement inequality and the permutahedron. Amer. Math. Monthly, 97(4):319–323, 1990.

[Wai19a] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint. Cam- bridge University Press, Cambridge, UK, 2019.

[Wai19b] M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharper `∞-bounds for Q-learning. arXiv preprint arXiv:1905.06265, 2019. [Wai19c] M. J. Wainwright. Variance-reduced Q-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.

[Wel17] J. A. Wellner. The Bennett-Orlicz norm. Sankhya A, 79(2):355–383, 2017.

[WOS+03] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.

38 [WW20] Y. Wei and M. J. Wainwright. The local geometry of testing in ellipses: Tight con- trol via localized Kolmogorov widths. IEEE Transactions on Information Theory, 66(8):5110–5129, 2020.

[Yu94] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.

[ZB19] A. Zanette and E. Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In Proceedings of the International Conference on Machine Learning, volume 97, pages 7304–7312, 2019.

[ZCDL16] Y. Zhu, S. Chatterjee, J. Duchi, and J. Lafferty. Local minimax complexity of stochastic convex optimization. In Advances in Neural Information Processing Systems, pages 3431–3439, 2016.

[ZKB19] A. Zanette, M. J. Kochenderfer, and E. Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In Advances in Neural Information Processing Systems, pages 5625–5634, 2019.

39