Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

Yaqi Duan 1 Chi Jin 2 Zhiyuan Li 3

Abstract algorithms including support vector machines (Cortes & Vapnik, 1995; Suykens & Vandewalle, 1999), boosting (Fre- This paper considers batch Reinforcement Learn- und et al., 1996; Schapire, 1999), as well as many success- ing (RL) with general value function approxima- ful applications in fields such as computer vision (Szeliski, tion. Our study investigates the minimal assump- 2010; Forsyth & Ponce, 2012), speech recognition (Juang tions to reliably estimate/minimize Bellman error, & Rabiner, 1991; Jelinek, 1997), and bioinformatics (Baldi and characterizes the generalization performance et al., 2001). by (local) Rademacher complexities of general function classes, which makes initial steps in Notably, in the area of supervised learning, a considerable bridging the gap between statistical learning the- amount of effort has been spent on obtaining sharp risk ory and batch RL. Concretely, we view the Bell- bounds. These are valuable, for instance, in the problem man error as a surrogate loss for the optimality of model selection—choosing a model of suitable complex- gap, and prove the followings: (1) In double sam- ity. Typically, these risk bounds characterize the excess pling regime, the excess risk of Empirical Risk risk—the suboptimality of the learned function compared Minimizer (ERM) is bounded by the Rademacher to the best function within a given function class, via proper complexity of the function class. (2) In the single complexity measures of that function class. After a long sampling regime, sample-efficient risk minimiza- line of extensive research (Vapnik, 2013; Vapnik & Chervo- tion is not possible without further assumptions, nenkis, 2015; Bartlett et al., 2005; 2006), risk bounds are regardless of algorithms. However, with com- proved under very weak assumptions which do not require pleteness assumptions, the excess risk of FQI and realizability—the prespecified function class contains the a minimax style algorithm can be again bounded ground-truth. The complexity measures for general function by the Rademacher complexity of the correspond- classes have also been developed, including but not limited ing function classes. (3) Fast statistical rates can to metric entropy (Dudley, 1974), VC dimension (Vapnik & be achieved by using tools of local Rademacher Chervonenkis, 2015) and Rademacher complexity (Bartlett complexity. Our analysis covers a wide range of & Mendelson, 2002). (See e.g. (Wainwright, 2019) for a function classes, including finite classes, linear textbook review.) spaces, kernel spaces, sparse linear features, etc. Concurrently, batch reinforcement learning (Lange et al., 2012; Levine et al., 2020)—a branch of Reinforcement Learning (RL) that learns from offline data, has been in- 1. Introduction dependently developed. This paper considers the value Statistical learning theory, since its introduction in the late function approximation setting, where the learning agent 1960’s, has become one of the most important frameworks aims to approximate the optimal value function from a - in machine learning, to study problems of inference or func- stricted function class that encodes the prior knowledge. tion estimation from a given collection of data (Hastie et al., Batch RL with value function approximation provides an 2009; Vapnik, 2013; James et al., 2013). The development important foundation for the empirical success of modern of statistical learning has led to a series of new popular RL, and leads to the design of many popular algorithms such as DQN (Mnih et al., 2015) and Fitted Q-Iteration with *Equal contribution 1Department of Operations Re- neural networks (Riedmiller, 2005; Fan et al., 2020). search and Financial Engineering, Princeton University 2Department of Electrical and Computer Engineering, Despite being a special case of supervised learning, batch Princeton University 3Department of Computer Science, RL also brings several unique challenges due to the addi- Princeton University. Correspondence to: Yaqi Duan tional requirement of learning the rich temporal structures , Chi Jin , Zhiyuan Li . within the data. Addressing these unique challenges has been the main focus of the field so far (Levine et al., 2020). Proceedings of the 38 th International Conference on Machine Consequently, the field of statistical learning and batch RL Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning have been developed relatively in parallel. In contrast to the Farahmand et al., 2016; Le et al., 2019) are canonical and mild assumptions required and the generic function class popular approaches. When using a linear function space, the allowed in classical statistical learning theory, a majority of sample complexity for batch RL is shown to depend on the batch RL results (Munos & Szepesvari´ , 2008; Antos et al., dimension (Lazaric et al., 2012). When it comes to general 2008; Lazaric et al., 2012; Chen & Jiang, 2019) remain function classes, several complexity measures of function under rather strong assumptions which rarely hold in prac- class such as metric entropy and VC dimensions have been tice, and are applicable only to a restricted set of function used to bound the performance of fitted value iteration and classes. This raises a natural question: can we bring the policy iteration (Munos & Szepesvari´ , 2008; Antos et al., rich knowledge in statistical learning theory to advance our 2008; Farahmand et al., 2016). understanding in batch RL? Throughout the existing theoretical studies of batch RL, This paper makes initial steps in bridging the gap between people commonly use concentrability, realizability and com- statistical learning theory and batch RL. We investigate the pleteness assumptions to prove polynomial sample com- minimal assumptions required to reliably estimate or mini- plexity. Chen & Jiang(2019) justify the necessity of low mize the Bellman error, and characterize the generalization concentrability and hold a debate on realizability and com- performance of batch RL algorithms by (local) Rademacher pleteness. Xie & Jiang(2020a) develop an algorithm that complexities of general function classes. Concretely, we only relies on the realizability of optimal Q-function and establish conditions when the Bellman error can be viewed circumvents completeness condition. However, they use as a surrogate loss for the optimality gap in values. We a stronger concentrability assumption and the error bound then bound the excess risk measured in Bellman errors. We has a slower convergence rate. While the analyses in Chen prove the followings: & Jiang(2019) and Xie & Jiang(2020a) are restricted to discrete function classes with a finite number of elements, • In the double sampling regime, the excess risk of a Wang et al.(2020a) investigate value function approxima- simple Empirical Risk Minimizer (ERM) is bounded tion with linear spaces. It is shown that data coverage and by the Rademacher complexity of the function class, realizability conditions are not sufficient for polynomial under almost no assumptions. sample complexity in the linear case.

• In the single sampling regime, without further assump- Off-policy evaluation Off-policy evaluation (OPE) refers tions, no algorithm can achieve small excess risk in to the estimation of value function given offline data (Precup, the worse case unless the number of samples scales 2000; Precup et al., 2001; Xie et al., 2019; Uehara et al., polynomially with respect to the number of states. 2020; Kallus & Uehara, 2020; Yin et al., 2020; Uehara et al., • In the single sampling regime, under additional com- 2021), which can be viewed as a subroutine of batch RL. pleteness assumptions, the excess risks of Fitted Q- Combining OPE with policy improvement leads to policy- Iteration (FQI) algorithm and a minimax style algo- iteration-based or actor-critic algorithms (Dann et al., 2014). rithm can be again bounded by the Rademacher com- OPE is considered as a simpler problem than batch RL and plexity of the corresponding function classes. its analyses cannot directly translate to guarantees in batch RL. • Fast statistical rates can be achieved by using tools of local Rademacher complexity. Online RL RL in online mode is in general a more dif- ficult problem than batch RL. The role of value function Finally, we specialize our generic theory to concrete ex- approximation in online RL remains largely unclear. It re- amples, and show that our analysis covers a wide range quires better tools to measure the capacity of function class of function classes, including finite classes, linear spaces, in an online manner. In the past few years, there are some kernel spaces, sparse linear features, etc. investigations in this direction, including using Bellman rank (Jiang et al., 2017) and Eluder dimension (Wang et al., 1.1. Related Work 2020b) to characterize the hardness of RL problem. We restrict our discussions in this section to the RL results 1.2. Notation under function approximation. For any integer K > 0, let [K] be the collection of Batch RL There exists a stream of literature regarding 1, 2,...,K. We use 1[·] to denote the indicator function. finite sample guarantees for batch RL with value function For any function q(·) and any measure ρ over the domain of 2 2 approximation. Among the works, fitted value iteration q, we define norm k·kρ where kqkρ := Ex∼ρq (x). Let ρ1 (Munos & Szepesvari´ , 2008) and policy iteration (Antos be a measure over X1 and ρ2(· | x1) be a conditional distri- et al., 2008; Farahmand et al., 2008; Lazaric et al., 2012; bution over X2. Define ρ1 × ρ2 as a joint distribution over Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

X1 × X2, given by (ρ1 × ρ2)(x1, x2) := ρ1(x1)ρ2(x2 | x1). step, For any finite set X , let Unif(X ) define a uniform distribu-  0  tion over X . [PhV ](s, a) :=E V (s ) s, a , π  0 0  [PhQ](s, a) :=Eπ Q(s , a ) s, a , ?  0 0  2. Preliminaries [PhQ](s, a) :=E max Q(s , a ) s, a . a0 We consider the setting of episodic Markov decision process T π, T ? : S×A → MDP(S, A,H, , ), where S is the set of states which We further define Bellman operators h h R P S×A h ∈ [H] possibly has infinitely many elements; A is a finite set of R for as actions with |A| = A; H is the number of steps in each π π (Th Q)(s, a) := (rh + PhQ)(s, a), episode; (· | s, a) gives the distribution over the next Ph (T ?Q)(s, a) := (r + ? Q)(s, a). state if action a is taken from state s at step h ∈ [H]; and h h Ph rh : S × A → [0, 1] is the deterministic reward function at 1 Then the Bellman equation and the Bellman optimality equa- step h. tion can be written as:

In each episode of an MDP, we start with a fixed initial π π π ? ? ? Qh(s, a) = (Th Qh+1)(s, a),Qh(s, a) = (Th Qh+1)(s, a). state s1. Then, at each step h ∈ [H], the agent observes state sh ∈ S, picks an action ah ∈ A, receives reward rh(sh, ah), and then transitions to the next state sh+1, which The objective of RL is to find a near-optimal policy, where ? π is drawn from the distribution Ph(· | sh, ah). Without loss the sub-optimality is measured by V1 (s1) − V1 (s1). Ac- of generality, we assume there is a terminating state send cordingly, we have the following definition of -optimal which the environment will always transit to at step H + 1, policy. and the episode terminates when send is reached. Definition 2.1 (-optimal policy). We say a policy π is - ? π A (non-stationary, stochastic) policy π is a collection of optimal if V1 (s1) − V1 (s1) ≤ .  H functions πh : S → ∆A , where ∆A is the h∈[H] 2.1. (Local) Rademacher complexity probability simplex over action set A. We denote πh(· | s) as the action distribution for policy π at state s and time In this paper, we leverage Rademacher complexity to char- π h. Let Vh : S → R denote the value function at step h acterize the complexity of a function class. For a generic under policy π, which gives the expected sum of remaining real-valued function space F ⊆ RX and n fixed data points n rewards received under policy π, starting from sh = s, until X = {x1, . . . , xn} ∈ X , the empirical Rademacher com- the end of the episode. That is, plexity is defined as

" n # " H # 1 π X X 0 0 0 R (F) := sup σ f(x ) X , Vh (s) := Eπ rh (sh , ah ) sh = s . bX E i i f∈F n h0=h i=1

π where σi ∼ Uniform({−1, 1}) are i.i.d. Rademacher ran- Accordingly, the action-value function Qh : S × A → R at dom variables and the expectation is taken with respect to step h is defined as, n the uncertainties in {σi}i=1. Let ρ be the underlying distri- " H # bution of xi. We further define a population Rademacher π X ρ 0 0 0 complexity R (F) := [R (F)] with expectation taken Qh(s, a) := Eπ rh (sh , ah ) sh = s, ah = a . n Eρ bX ρ h0=h over data samples X. Intuitively, Rn(F) measures the com- plexity of F by the extent to which functions in the class F Since the action spaces, and the horizon, are finite, there correlate with random noise σi. always exists (see, e.g., (Puterman, 2014)) an optimal policy This paper further uses the tools of local Rademacher com- ? ? π π which gives the optimal value Vh (s) = supπ Vh (s) for plexity to obtain results with fast statistical rate. For a all s ∈ S and h ∈ [H]. generic real-valued function space F ⊆ RX , and data distri- + π ? bution ρ. Let T be a functional T : F → R , we study the For notational convenience, we take shorthands Ph, Ph, Ph as follows, where (s, a) is the state-action pair for the cur- local Radmacher complexity in the form of 0 0 rent step, while (s , a ) is the state-action pair for the next ρ Rn({f ∈ F | T (f) ≤ r}). 1While we study deterministic reward functions for notational simplicity, our results generalize to randomized reward functions. A crucial quantity that appears in the generalization error Note that we are assuming that rewards are in [0, 1] for normaliza- bound using local Rademacher complexity is the critical tion. radius (Bartlett et al., 2005). We define as follows. Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

Definition 2.2 (Sub-root function). A function ψ : R+ → approximate the optimal Q-value function. For notational + √ R is sub-root if it is nondecreasing, and r → ψ(r)/ r is simplicity, we denote f := (f1, ··· , fH ) ∈ F with F := th nonincreasing for r > 0. F1 ×· · ·×FH . Since no reward is collected in the (H +1) Definition 2.3 (Critical radius of local Radmacher complex- steps, we will always use the convention that fH+1 = 0 and ity). The critical radius of the local Radmacher complexity FH+1 = {0}. We assume fh ∈ [−H,H] for any fh ∈ Fh. H ρ Each f ∈ F induces a greedy policy πf = {πf } where Rn({f ∈ F | T (f) ≤ r}) is the infimum of the set B, h h=1 ? where set B is defined as follows: for any r ∈ B, there h i ? 1 0 exists a sub-root function ψ such that r is the fixed point πfh (a | s) = a = arg max fh(s, a ) . a0 of ψ, and for any r ≥ r? we have ρ ψ(r) ≥ Rn({f ∈ F | T (f) ≤ r}). (1) In valued-based batch RL, we take the offline dataset D as input and output an estimated optimal Q-value function We typically obtain an upper bound of this critical radius by f and the associated policy πf . We are interested in the constructing one specific sub-root function ψ satisfying (1). performance of πf , which is measured by suboptimality in ? πf values, i.e., V1 (s1) − V1 (s1). However, this gap is highly nonsmooth in f, which is similar to the case of supervised 3. Batch RL with Value Function learning where the 0 − 1 losses for classification tasks are Approximation also highly nonsmooth and intractable. To mitigate this issue, a popular approach is to use a surrogate loss—the This paper focuses on the offline setting where the data in Bellman error. form of tuples D = {(s, a, r, s0, h)} are collected before- Definition 3.1 (Bellman error). Under data distribution µ, hand, and are given to the agent. In each tuple, (s, a) are we define the Bellman error of function f = (f , ··· , f ) the state and action at the hth step, r is the resulting reward, 1 H as and s0 is the next state sampled from (·|s, a). For each Ph 1 H h ∈ [H] n i.i.d X ? 2 , we have access to data, that are sampled E(f) := kfh − Th fh+1kµ . (2) th H h with marginal distribution µh over (s, a) at the h step. We h=1 denote µ = µ1×µ2×...×µH . For each h ∈ [H], we further denote the marginal distribution of s0 in tuple (s, a, s0, h) Bellman error E(f) appears in many classical RL algorithms as νh , and let ν = ν1 × ν2 × ... × νH . Throughout this including Bellman risk minimization (BRM) (Antos et al., paper, we will consistently use µ and ν to only denote the 2008), least-square temporal difference (LSTD) learning probability measures defined above. (Bradtke & Barto, 1996; Lazaric et al., 2012), etc. We assume data distribution µ is well-behaved and satisfies The following lemma shows that under Assumption1, one the following assumption. can control the suboptimality in values by the Bellman error. π Assumption 1 (Concentrability). Given a policy π, let Ph denote the marginal distribution at time step h, starting from Lemma 3.2 (Bellman error to value suboptimality). Under s1 and following π. There exists a parameter C such that Assumption1, for any f ∈ F, we have that , π dP ? πf p sup h (s, a) ≤ C for any policy π. V1 (s1) − V1 (s1) ≤ 2H C ·E(f), (3) (s,a,h)∈S×A×[H] dµh where C is the concentrability coefficient in Assumption1. Assumption1 requires that for any state-action pair (s, a), Therefore, the Bellman error E(f) is indeed a surrogate if there exists a policy π that reaches (s, a) with some de- loss for the suboptimality of π under mild conditions. In scent amount of probability, then the chance that sample f the next two sections, we will focus on designing efficient (s, a) appears in the dataset would not be low. Intuitively, algorithms that minimize the Bellman error. Assumption1 ensures that the dataset D is representative for all the “reachable” state-action pairs. The assumption is frequently used in the literature of batch RL, e.g. equation 4. Results for Double Sampling Regime (7) in Munos(2003), Definition 5.1 in Munos(2007), Propo- As a starting point for Bellmen error minimization, we con- sition 1 in Farahmand et al.(2010), Assumption 1 in Chen sider an empirical version of E(f) computed from samples. & Jiang(2019), etc. We remark that Assumption1 here is A natural choice of this empirical proxy is as follows the only assumption of this paper regarding the properties of the batch data. 1 X 2 Lˆ (f) := f (s, a) − r − V (s0) , B nH h fh+1 We consider the setting of value function approximation, (s,a,r,s0,h)∈D where at each step h we use a function fh in class Fh to (4) Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

where Vfh+1 (s) := maxa∈A fh+1(s, a). Unfortunately, the Most importantly, we remark that Theorem 4.1 holds with- ˆ estimator LB is biased due to the error-in variable situa- out any assumption on the input data distribution or the tion (Bradtke & Barto, 1996). In particular, we have the properties of the MDP. Function class F can also be com- following decomposition. pletely misspecificed in the sense the optimal value function Q? may be very far from F. This allows Theorem 4.1 to be H 1 X 0 widely applicable to a large number of applications. E(f) = Lˆ (f)− Var 0 (V (s )). Eµ B H Eµh s ∼Ph(·|s,a) fh+1 h=1 However, a major limitation of Theorem 4.1 is its reliance (5) on double samples. Double samples are not available in ˆ most dynamical systems that have no simulators or can not That is, the Bellman error and the expectation of LB differ by a variance term. This variance term is due to the stochastic be reverted back to the previous step. In next section, we transitions in the system, which is non-negligible even when analyze algorithms in the standard single sampling regime. f approximates the optimal value function Q?. A direct fix of this problem is to estimate the variance by double 5. Results for Single Sampling Regime samples, where two independent samples of sh+1 are drawn when being in state sh (Baird, 1995). In this section, we focus on batch RL in the standard single sampling regime, where each tuple (s, a, r, s0, h) in dataset Formally, in this section, we consider the setting where for D has a single next step s0 following (s, a). We first present 0 any (s, a, r, s , h) in dataset D, there exists a paired tuple a sample complexity lower bound for minimizing the Bell- 0 (s, a, r, s˜ , h) which share the same state-action pair (s, a) man error, showing that in order to acheive an excess risk 0 0 at step h, while s , s˜ being two independent samples of that does not scale polynomially with respect to the number the next state. Such data can be collected for instance if of states, it is inevitable to have additional structural assump- a simulator is avaliable, or the system allows an agent to tions on function class F and the MDP. Then we analyze revert back to the previous step. For simplicity, we denote fitted Q-iteration (FQI) and a minimax estimator respec- 0 0 this dataset as De = {(s, a, r, s , s˜ , h)} without placing ad- tively, under different completeness assumptions. In addi- ditional constraints. tion to Rademacher complexity upper bounds similar to The- We construct the following empirical risk, which further orem 4.1, we also utilize localization techniques and prove estimates the variance term in (5) via double samples, bounds with faster statistical rate in these two schemes.  1 X 2 Lˆ (f) := f (s, a)−r−V (s0) 5.1. Lower bound DS nH h fh+1 (s,a,r,s0,s˜0,h)∈De Recall that when double samples are available, the excess

1 2  risk of ERM estimator is controlled by Rademacher com- − V (s0) − V (˜s0) . 2 fh+1 fh+1 plexities of function classes (Theorem 4.1). In the single sampling regime, one natural question to ask is whether ˆ We can show that, for any fixed f ∈ F, ELDS(f) = E(f), there exists an algorithm with a similar guarantee (i.e. the ˆ i.e., LDS is an unbiased estimator of the Bellman error. Our excess risk is upper bounded by certain complexity mea- algorithm for this setting is simply the Empirical Risk Mini- sure of the function class). Unfortunately, without further mizer (ERM), and we prove the following guarantee. assumptions, the answer is negative. Theorem 4.1. There exists an absolute constant c > 0, Theorem 5.1. Let A be an arbitrary algorithm that takes with probability at least 1 − δ, the ERM estimator fˆ = any dataset D and function class F as input and outputs an ˆ ˆ + arg minf∈F LDS(f) satisfies the following: estimator f ∈ F. For any S ∈ N and sample size n ≥ 0, r there exists an S-state, single-action MDP paired with a log(1/δ) ˆ E(fˆ) ≤ min E(f) + cH2 function class F with |F| = 2 such that the f output by f∈F n algorithm A satisfies H X   1/2  + c Rµh (F ) + Rνh (V ). ˆ S n h n Fh+1 EE(f) ≥ min E(f) + Ω min 1, . (6) h=1 f∈F n Here, the expectation is taken over the randomness in D. Here, we use shorthand VFh+1 := {Vfh+1 | fh+1 ∈ Fh+1} for any h ∈ [H]. Theorem 4.1 asserts that, in the dou- Theorem 5.1 reveals a fundamental difference between the ble sampling regime, simple ERM has its excess risk single sampling regime and the double sampling regime. ˆ E(f) − minf∈F E(f) upper bounded by the Rademacher The lower bound in inequality (6) depends polynomially on complexity of function class {F }H , {V }H and a h h=1 Fh√+1 h=1 S—the cardinality of state space, which is considered to be small concentration term that scales as O˜(1/ n). intractably large in the setting of function approximation. Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

Algorithm 1 FQI probability at least 1 − δ, the output of FQI fˆ satisfies ˆ 1: initialize fH+1 ← 0. H r log(H/δ) ˆ X µh 2 2: for h = H,H − 1,..., 1 do E(f) ≤  + c Rn (Fh) + cH . (7) 3: fˆ ← arg min `ˆ (f , fˆ ) := n h fh∈Fh h h h+1 h=1 1 P 0 2 0 f (s, a)−r−V ˆ (s ) . n (s,a,r,s ,h)∈Dh h fh+1 ˆ ˆ ˆ We remark that Assumption2 immediately implies that 4: return f = (f1,..., fH ). minf∈F E(f) ≤ . Therefore, although the minimal Bell- man error minf∈F E(f) does not explicitly appear on the In batch RL with single sampling, despite the use of func- right hand side, inequality (7) is still a variant of excess risk tion class F, the hardness of Bellman error minimization bound. is still determined by the size of state space. This also sug- For typical parametric function classes, the Rademacher gests that minimizing Bellman error in the single sampling complexity scales as n−1/2 (see Section6). Therefore, The- regime, is intrinsically different from the classic supervised orem 5.2 guarantees that the excess risk decrease as n−1/2, learning due to the additional temporal correlation structure up to a constant error  due to the approximate completeness presented within the data. (in Assumption2). However, since Bellman error is the av- 2 We remark that unlike most lower bounds of similar type in erage of squared -norms (Definition 3.1), one may expect prior works (Sutton & Barto, 2018; Sun et al., 2019), which a faster statistical rate in this setting, similar to the case of only apply to certain restrictive classes of algorithms, The- linear regression. For this reason, we take advantage of the orem 5.1 is completely information-theoretic, and applies localization techniques and develop sharper error bounds in to any algorithm. Recently, (Jiang, 2019) proved the impos- Theorem 5.3. sibility of estimating the bellman error in single sampling Theorem 5.3 (FQI, local Rademacher complexity). There regime for all algorithms. Our results is stronger since the exists an absolute constant c > 0, under Assumption2, with impossibility of minimization implies the impossibility of probability at least 1 − δ, the output of FQI fˆ satisfies estimation, but not vice versa. √ E(fˆ) ≤  + c  · ∆ + c∆ , (8) To circumvent the hardness result in Theorem 5.1, additional H X log(H/δ) structural assumptions are necessary. In the following, we ∆ := H r? + H2 . provide statistical gurantees for two batch RL algorithms, h n h=1 where different completeness assumptions on F are used. ? Here rh is the critical radius of local Rademacher com- † † plexity Rµh ({f ∈ F | kf − f k2 ≤ r}) with f := 5.2. Fitted Q-iteration (FQI) n h h h h µh h ? ˆ arg min kfh − T fh+1kµ . We consider the classical FQI algorithm. We assume that fh∈Fh h h function class F = F ×...×F is (approximately) closed 1 H On the RHS of inequality (8), the first term  measures under the optimal Bellman operators T ?,..., T ? , which is √ 1 H model misspecification. The other two terms c(  · ∆ + ∆) commonly adopted by prior analyses of FQI (Munos & can be viewed as statistical errors since ∆ → 0 as sample Szepesvari´ , 2008; Chen & Jiang, 2019). size n → ∞. For typical parametric function classes, the Assumption 2. There exists  > 0 such that, for all h ∈ critical radius of the local Rademacher complexity scales as [H] sup inf kf − T ?f k2 ≤  −1 , fh+1∈Fh+1 fh∈Fh h h h+1 µh . n (see Section6), which decreases much faster than stan- dard Rademacher complexity. That is, Theorem 5.3 indeed The FQI algorithm is closely related to approximate dy- guarantees faster statistical rate comparing to Theorem 5.2. namic programming (Bertsekas & Tsitsiklis, 1995). It starts ˆ † ˆ by setting fH+1 := 0 and then recursively computes Q- Finally, we remark that fh in Theorem 5.3 depends on fh+1 value functions at h = H,H − 1,..., 1. Each iteration and therefore is random. We will show later in Section6 for in FQI is a least squares regression problem based on data many examples, the critical radius can be upper bounded in- † collected at that time step. For h ∈ [H], we denote Dh as dependent of the choice of fh, in which case the randomness th † set of data at the h step. The details of FQI are specified in fh does not affect the final results. in Algorithm1. In the following Theorem 5.2, we upper bound the excess 5.3. Minimax Algorithm risk of the output of FQI in terms of Rademacher complex- The (approximate) completeness of F in Assumption2 can ity. be stringent sometimes. For instance, if there is a new func- Theorem 5.2 (FQI, Rademacher complexity). There exists tion fh attached to Fh, for the sake of completeness, we an absolute constant c > 0, under Assumption2, with need to enlarge Fh−1 by adding several approximations of Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

? Th−1fh. The same goes for Fh−2,..., F1. After amplify- from µh at time step h (or from νh × Unif(A) at h + 1) and ing the function classes one by one for each step, we may following π. There exists a parameter Ce such that obtain an exceedingly large F.  π π  dPh,t dPeh,t To avoid the issue above posted by the completeness as- sup ∨ (s, a) ≤ Ce for any policy π. sumptions on F, we introduce a new function class G = (s,a)∈S×A dµt dµt h∈[H],t>h G1 ×· · ·×GH , where Gh consists of functions mapping from S × A to [−H,H]. We assume that for each fh+1 ∈ Fh+1, ? For notational convenience, we define one can always find a good approximation of Th fh+1 in this helper function class Gh. f † = (f †, . . . , f † ) := arg min E(f) and Assumption 3. There exists  > 0 such that, for all h ∈ 1 H f∈F ? 2 † ? † [H] sup inf kg − T f k ≤  g := arg min kgh − T f kµh . , fh+1∈Fh+1 gh∈Gh h h h+1 µh . h gh∈Gh h h+1 According to (5), we can approximate Bellman error E(f) Now we are ready to state the excess risk bound of the mini- ˆ by subtracting the variance term from LB(f). If gh is close max algorithm in terms of local Rademacher complexity as ? 0 2 follows. to Th fh+1, then gh(s, a) − r − Vfh+1 (s ) averaged over data provides a good estimator of the variance term. Follow- Theorem 5.5 (Minimax algorithm, local Rademacher com- ing this intuition, we define a new loss plexity). There exists an absolute constant c > 0, under Assumptions3 and4, with probability at least 1 − δ, the 1 X h 2 ˆ Lˆ (f, g) := f (s, a)−r−V (s0) minimax estimator f satisfies: MM nH h fh+1 (s,a,r,s0,h)∈D r ˆ  0 2i E(f)≤min E(f)+ + c min E(f)+ ∆ + c∆ , (9)  f∈F f∈F − gh(s, a) − r − Vfh+1 (s ) . H  q  The minimax algorithm (Antos et al., 2008; Chen & Jiang, 3 X ? ? ?  ? ∆ := H Ce rf,h + rg,h +r ˜f,h + Cre g,h 2019) then computes h=1

ˆ ˆ 2 log(H/δ) f := arg min max LMM(f, g). + H . f∈F g∈G n

Now we are ready to state our theoretical guarantees for the where Ce is the concentrability coefficient in Assumption4, ? ? ? minimax algorithms. and rf,h, rg,h, r˜f,h are the critical radius of the following local Rademacher complexities respectively: Theorem 5.4 (Minimax algorithm, Rademacher complex- ity). There exists an absolute constant c > 0, under As- µh  † 2  Rn fh ∈ Fh kfh − f kµ ≤ r , sumption3, with probability at least 1 − δ, the minimax h h Rµh g ∈ G kg − g† k2 ≤ r  , estimator fˆ satisfies: n h h h h µh Rνh V f ∈ F , r n fh+1 h+1 h+1 log(1/δ) ˆ 2 † 2  E(f) ≤ min E(f) +  + cH kfh+1 − f k ≤ r . f∈F n h+1 νh×Unif(A) H X µh µh νh  Similar to Theorem 5.3, our upper bound in (9) can + c Rn (Fh) + Rn (Gh) + Rn (VFh+1 ) . h=1 also be viewed as a combination of model misspeci- fication error (minf∈F E(f) + ) and statistical error As is shown in Theorem 5.4, the excess risk is simul- q  (c minf∈F E(f) +  ∆ + c∆). As n → ∞, the model taneously controlled by the Rademacher complexities of misspecification error is nonvanishing and the statistical {F }H , {G }H and {V }H . h h=1 h h=1 Fh+1 h=1 error tends to zero. Again for typical parametric function Similar to the results for FQI, we can also develop risk classes, the critical radius of the local Rademacher complex- bounds with faster statistical rate using the localization tech- ity scales as n−1 (see Section6), and Theorem 5.5 claims niques. For technical reasons that will be soon discussed, we the excess risk of the minimax algorithm also decreases as introduce the following assumption, which can be viewed n−1 except a constant model misspecification error . a variant of the concentrability coefficient in Assumption1 Intuitively, Assumption4 is required in Theorem 5.5 to al- under different initial distributions. † low that E(f) close to E(f ) implies fh in the neighborhood Assumption 4. For any policy π and h ∈ [H], let P π (or † h,t of fh for each step h ∈ [H]. We conjecture such additional π Peh,t) denote the marginal distribution at t > h, starting assumption is unavoidable if we would like to upper bound Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning the excess risk using the local Rademacher complexity of Functions in RKHS. Consider a Reproducing Kernel

Fh, Gh and VFh+1 for the minimax algorithm. Hilbert Space (RKHS) H associated with a positive ker- nel k :(S × A) × (S × A) → . Suppose that In AppendixC, we present an alternative version of Theo- R k(s, a), (s, a) ≤ 1 for any (s, a) ∈ S × A. Consider rem 5.3, which does not require Assumption4 but bound the function class F ⊆ {f ∈ H | kfk ≤ H}, here the excess risk using the local Rademacher complexity of a K k·k denotes the RKHS norm. Define an integral opera- composite function class depending on the loss, F, and G. K tor T : L2(ρ) → L2(ρ) as The alternative version recovers the sharp result in (Chen &    Jiang, 2019) when the function classes F and G both have T f := E(s,a)∼ρ k ·, (s, a) f(s, a) . finite elements.   Suppose that E(s,a)∼ρ k (s, a), (s, a) < +∞. Let Finally, our upper bounds for the minimax algorithm con-  ∞ λi(T ) i=1 be the eigenvalues of T , arranging in a non- tain Radermacher complexities of the function class VF . increasing order. Then We can conveniently control them using the Radermacher Proposition 6.3. For kernel function class F defined above, complexities of function class F as follows. for any data distribution ρ and any anchor function fo ∈ F: Proposition 5.6. Let F be a set of functions over S ×A and v ρ be a measure over S. We have the following inequality, u ∞ ρ u 2 X  R (F) ≤ Ht 1 ∧ 4λi(T ) , √ n n ρ ρ×Unif(A) i=1 Rn(VF ) ≤ 2 A Rn (F),  v  u ∞ where A is the cardinality of the set A. ?,ρ  j u 2 X  rn (F, fo) ≤ 2 min + Ht λi(T ) . j∈ n n N  i=j+1  6. Examples Sparse linear functions. Let φ : S × A → d be the Below we give four examples of function classes, each R feature map to a d-dimensional Euclidean space, and con- with an upper bound on Rademacher complexity, as well sider the function class F ⊂ {w>φ | w ∈ d, kwk ≤ s}. as the critical radius of the local Rademacher complexity. R 0 Assume that when (s, a) ∼ ρ, φ(s, a) satisfies a Gaussian Throughout this section we use notation r?,ρ(F, f ) to de- n o distribution with covariance Σ. Assume kfk ≤ H for any note the critical radius of local Rademacher complexity ρ ρ 2 f ∈ F. Furthermore, denote κs(Σ) to be the upper bound Rn({f ∈ F | kf − fokρ ≤ r}). such that κs(Σ) ≥ λmax(M)/λmin(M) for any matrix M that is a s × s principal submatrix of Σ. Then Function class with finite element. First, we consider Proposition 6.4. There exists an absolute constant c > 0, the function class F with |F| < ∞. Under the normaliza- for sparse linear function class F defined above, assume the tion that f ∈ [0,H] for any f ∈ F, we have the following. data distribution ρ satisfies the conditions specified above, then for any anchor function fo ∈ F: Proposition 6.1. For function class F defined above, for r s log d any data distribution ρ and any anchor function fo ∈ F: Rρ (F) ≤ cHpκ (Σ) , n s n r  ρ log |F| log |F| ?,ρ 2 s log d R (F) ≤ 2H max , , r (F, fo) ≤ c κs(Σ) · . n n n n n

?,ρ 2H log |F| rn (F, fo) ≤ . End-to-end results. Finally, to obtain an end-to-end re- n sult that upper bounds the suboptimality in values for spe- cific function classes listed above, we can simply combine Linear functions. Let φ : S × A → Rd be the feature (a) the result that upper bound the value suboptimality using map to a d-dimensional Euclidean space, and consider the the Bellman error (Lemma 3.2); (b) the results that upper function class F ⊂ {w>φ | w ∈ Rd, kwk ≤ H}. Under the bound the Bellman error in terms of (local) Rademacher normalization that kφ(s, a)k ≤ 1 for any (s, a), we have complexity (Theorems 4.1, 5.2-5.5); (c) the upper bounds of Proposition 6.2. For linear function class F defined above, (local) Rademacher complexity for specific function classes for any data distribution ρ and any anchor function fo ∈ F: (Propositions 6.1-6.4). r 2d Rρ (F) ≤ H , 7. Conclusion n n 2d This paper studies batch RL with general value function r?,ρ(F, f ) ≤ . n o n approximation from the lens of statistical learning theory. Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

We identify the intrinsic difference between batch reinforce- Dann, C., Neumann, G., Peters, J., et al. Policy evalua- ment learning and classical supervised learning (Theorem tion with temporal differences: A survey and compari- 5.1) due to the additional temporal correlation structure pre- son. Journal of Machine Learning Research, 15:809–883, sented in the RL data. Under mild conditions, this paper also 2014. provides upper bounds on the generalization performance of several popular batch RL algorithms in terms of the (local) Dudley, R. M. Metric entropy of some classes of sets with Rademacher complexities of general function classes. We differentiable boundaries. Journal of Approximation The- hope our results shed light on the future research in further ory, 10(3):227–236, 1974. bridging the gap between statistical learning theory and RL. Fan, J., Wang, Z., Xie, Y., and Yang, Z. A theoretical analysis of deep q-learning. In Learning for Dynamics Acknowledgement and Control, . 486–489. PMLR, 2020.

ZL acknowledges support from NSF, ONR, Simons Foun- Farahmand, A. M., Ghavamzadeh, M., Szepesvari,´ C., and dation, Schmidt Foundation, Microsoft Research, Mozilla Mannor, S. Regularized policy iteration. In nips, pp. Research, Amazon Research, DARPA and SRC. 441–448, 2008.

References Farahmand, A. M., Munos, R., and Szepesvari,´ C. Error propagation for approximate policy and value iteration. Antos, A., Szepesvari,´ C., and Munos, R. Learning In Advances in Neural Information Processing Systems, near-optimal policies with bellman-residual minimiza- 2010. tion based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008. Farahmand, A.-m., Ghavamzadeh, M., Szepesvari,´ C., and Mannor, S. Regularized policy iteration with nonparamet- Baird, L. Residual algorithms: Reinforcement learning with ric function spaces. The Journal of Machine Learning function approximation. In Machine Learning Proceed- Research, 17(1):4809–4874, 2016. ings 1995, pp. 30–37. Elsevier, 1995. Forsyth, D. A. and Ponce, J. Computer vision: a modern Baldi, P., Brunak, S., and Bach, F. Bioinformatics: the approach. Pearson,, 2012. machine learning approach. MIT press, 2001. Freund, Y., Schapire, R. E., et al. Experiments with a new Bartlett, P. L. and Mendelson, S. Rademacher and gaussian boosting algorithm. In icml, volume 96, pp. 148–156. complexities: Risk bounds and structural results. Journal Citeseer, 1996. of Machine Learning Research, 3(Nov):463–482, 2002.

Bartlett, P. L., Bousquet, O., Mendelson, S., et al. Local Hastie, T., Tibshirani, R., and Friedman, J. The elements of rademacher complexities. The Annals of Statistics, 33(4): statistical learning: data mining, inference, and predic- 1497–1537, 2005. tion. Springer Science & Business Media, 2009.

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity, James, G., Witten, D., Hastie, T., and Tibshirani, R. An classification, and risk bounds. Journal of the American introduction to statistical learning, volume 112. Springer, Statistical Association, 101(473):138–156, 2006. 2013.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic pro- Jelinek, F. Statistical methods for speech recognition. MIT gramming: an overview. In Proceedings of 1995 34th press, 1997. IEEE conference on decision and control, volume 1, pp. Jiang, N. On value functions and the agent-environment 560–564. IEEE, 1995. boundary. arXiv preprint arXiv:1905.13341, 2019. Bradtke, S. J. and Barto, A. G. Linear least-squares algo- rithms for temporal difference learning. Machine learn- Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., ing, 22(1):33–57, 1996. and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. In Proceedings of Chen, J. and Jiang, N. Information-theoretic considera- the 34th International Conference on Machine Learning- tions in batch reinforcement learning. arXiv preprint Volume 70, pp. 1704–1713. JMLR. org, 2017. arXiv:1905.00360, 2019. Juang, B. H. and Rabiner, L. R. Hidden markov models Cortes, C. and Vapnik, V. Support-vector networks. Ma- for speech recognition. Technometrics, 33(3):251–272, chine learning, 20(3):273–297, 1995. 1991. Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

Kallus, N. and Uehara, M. Double reinforcement learning Puterman, M. L. Markov decision processes: discrete for efficient off-policy evaluation in markov decision pro- stochastic dynamic programming. John Wiley & Sons, cesses. Journal of Machine Learning Research, 21(167): 2014. 1–63, 2020. Riedmiller, M. Neural fitted q iteration–first experiences Lange, S., Gabel, T., and Riedmiller, M. Batch reinforce- with a data efficient neural reinforcement learning method. ment learning. In Reinforcement learning, pp. 45–73. In European Conference on Machine Learning, pp. 317– Springer, 2012. 328. Springer, 2005. Ijcai Lazaric, A., Ghavamzadeh, M., and Munos, R. Finite- Schapire, R. E. A brief introduction to boosting. In , sample analysis of least-squares policy iteration. Journal volume 99, pp. 1401–1406. Citeseer, 1999. of Machine Learning Research, 13:3041–3074, 2012. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based in contextual decision pro- Le, H., Voloshin, C., and Yue, Y. Batch policy learning un- cesses: Pac bounds and exponential improvements over International Conference on Machine der constraints. In model-free approaches. In Conference on Learning The- Learning , pp. 3703–3712. PMLR, 2019. ory, pp. 2898–2933. PMLR, 2019. Ledoux, M. and Talagrand, M. Probability in Banach Sutton, R. S. and Barto, A. G. Reinforcement learning: An Spaces: isoperimetry and processes. Springer Science & introduction. MIT press, 2018. Business Media, 2013. Suykens, J. A. and Vandewalle, J. Least squares support Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- vector machine classifiers. Neural processing letters, 9 forcement learning: Tutorial, review, and perspectives on (3):293–300, 1999. open problems. arXiv preprint arXiv:2005.01643, 2020. Szeliski, R. Computer vision: algorithms and applications. Maurer, A. A vector-contraction inequality for rademacher Springer Science & Business Media, 2010. International Conference on Algorithmic complexities. In Uehara, M., Huang, J., and Jiang, N. Minimax weight and Learning Theory , pp. 3–17. Springer, 2016. q-function learning for off-policy evaluation. In Interna- tional Conference on Machine Learning Mendelson, S. Geometric parameters of kernel machines. , pp. 9659–9668. In International Conference on Computational Learning PMLR, 2020. Theory, pp. 29–43. Springer, 2002. Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. Finite sample analysis of minimax offline Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, reinforcement learning: Completeness, fast rates and first- J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- order efficiency. arXiv preprint arXiv:2102.02981, 2021. land, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518(7540): van Handel, R. Probability in high dimension. Technical 529–533, 2015. report, PRINCETON UNIV NJ, 2014.

Munos, R. Error bounds for approximate policy iteration. Vapnik, V. The nature of statistical learning theory. Springer In ICML, volume 3, pp. 560–567, 2003. science & business media, 2013.

Munos, R. Performance bounds in l p-norm for approximate Vapnik, V. N. and Chervonenkis, A. Y. On the uniform con- value iteration. SIAM journal on control and optimization, vergence of relative frequencies of events to their proba- Measures of complexity 46(2):541–561, 2007. bilities. In , pp. 11–30. Springer, 2015. Munos, R. and Szepesvari,´ C. Finite-time bounds for fitted Vershynin, R. Introduction to the non-asymptotic analysis Journal of Machine Learning Research value iteration. , 9 of random matrices. arXiv preprint arXiv:1011.3027, (May):815–857, 2008. 2010. Precup, D. Eligibility traces for off-policy policy evalua- Wainwright, M. J. High-dimensional statistics: A non- tion. Computer Science Department Faculty Publication asymptotic viewpoint, volume 48. Cambridge University Series, pp. 80, 2000. Press, 2019. Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy Wang, R., Foster, D. P., and Kakade, S. M. What are the temporal-difference learning with function approxima- statistical limits of offline rl with linear function approxi- tion. In ICML, pp. 417–424, 2001. mation? arXiv preprint arXiv:2010.11895, 2020a. Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

Wang, R., Salakhutdinov, R. R., and Yang, L. Reinforce- ment learning with general value function approximation: Provably efficient approach via bounded eluder dimen- sion. Advances in Neural Information Processing Sys- tems, 33, 2020b. Xie, T. and Jiang, N. Batch value-function approximation with only realizability. arXiv preprint arXiv:2008.04990, 2020a. Xie, T. and Jiang, N. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. arXiv preprint arXiv:2003.03924, 2020b.

Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. arXiv preprint arXiv:1906.03393, 2019. Yin, M., Bai, Y., and Wang, Y.-X. Near optimal provable uniform convergence in off-policy evaluation for rein- forcement learning. arXiv preprint arXiv:2007.03760, 2020.