<<

Bootstrapping for Off- Evaluation

Botao Hao 1 Xiang(Jack) Ji 2 Yaqi Duan 2 Hao Lu 2 Csaba Szepesvari´ 1 3 Mengdi Wang 1 2

Abstract In practice, FQE has demonstrated robust and satisfying Bootstrapping provides a flexible and effective performances on many classical RL tasks under different approach for assessing the of batch rein- metrics (Voloshin et al., 2019). A more recent study by forcement learning, yet its theoretical property is Paine et al.(2020) demonstrated surprising scalability and less understood. In this paper, we study the use of effectiveness of FQE with deep neural nets in a range of bootstrapping in off-policy evaluation (OPE), and complex continuous-state RL tasks. On the theoretical side, in particular, we focus on the fitted Q-evaluation FQE was proved to be a minimax-optimal policy evaluator (FQE) that is known to be minimax-optimal in in the tabular and linear-model cases (Yin & Wang, 2020; the tabular and linear-model cases. We propose a Duan & Wang, 2020). bootstrapping FQE method for inferring the dis- The aforementioned research mostly focuses on point esti- tribution of the policy evaluation error and show mation for OPE. In practical batch RL applications, a point that this method is asymptotically efficient and estimate is far from enough. Statistical inference for OPE is distributionally consistent for off-policy statisti- of great interests. For instance, one often hopes to construct cal inference. To overcome the computation limit tight confidence interval around policy , estimate the of bootstrapping, we further adapt a subsampling variance of off-policy evaluator, or evaluate multiple poli- procedure that improves the runtime by an or- cies using the same and estimate their correlations. der of magnitude. We numerically evaluate the Bootstrapping (Efron, 1982), is a conceptually simple and bootrapping method in classical RL environments generalizable approach to infer the error distribution based for confidence interval estimation, estimating the on batch data. Therefore, in this work, we study the use variance of off-policy evaluator, and estimating of bootstrapping for off-policy inference. We will provide the correlation between multiple off-policy evalu- theoretical justifications as well as numerical . ators. Our main results are summarized below:

1. Introduction • First we analyze the asymptotic distribution of FQE Off-policy evaluation (OPE) often serves as the starting with linear function approximation and show that the point of batch reinforcement learning (RL). The objective of policy evaluation error asymptotically follows a nor- OPE is to estimate the value of a target policy based on batch mal distribution (Theorem 4.2). The asymptotic vari- episodes of state-transition trajectories that were generated ance matches the Cramer–Rao´ lower bound for OPE arXiv:2102.03607v2 [stat.ML] 9 Feb 2021 using a possibly different and unknown behavior policy. In (Theorem 4.5) and implies that this estimator is asymp- this paper, we investigate statistical inference for OPE. In totically efficient. particular, we analyze the popular fitted Q-evaluation (FQE) method, which is a basic model-free approach that fits un- • We propose a bootstrapping FQE method for estimat- known value function from data using function approxi- ing the distribution of off-policy evaluation error. We mation and backward dynamic programming (Fonteneau prove that bootstrapping FQE is asymptotically con- et al., 2013; Munos & Szepesvari´ , 2008; Le et al., 2019). sistent in estimating the distribution of the original FQE (Theorem 5.1) and establish the consistency of * 1 2 Equal contribution Deepmind Princeton University bootstrap confidence interval as well as bootstrap vari- 3University of Alberta. Correspondence to: Botao Hao . ance estimation. Further, we propose a subsampled bootstrap procedure to improve the computational effi- Preliminary work. ciency of bootstrapping FQE. Bootstrapping Statistical Inference for Off-Policy Evaluation

• We highlight the necessity of bootstrapping by Our analysis improves their work in the following aspects. episodes, rather than by sample transition First, we study FQE with linear function approximation as considered in previous works; see Kostrikov & while Kostrikov & Nachum(2020) only considered the tab- Nachum(2020). The reason is that bootstrapping de- ular case. Second, we provide distributional consistency of pendent data in general fails to characterize the right bootstrapping FQE which is stronger than the consistency error distribution (Remark 2.1 in Singh(1981)). We of confidence interval in Kostrikov & Nachum(2020). illustrate this phenomenon via experiments (see Figure 1). All our theoretical analysis applies to episodic de- 2. Preliminary pendent data, and we do not require the i.i.d. sample transition assumption commonly made in OPE liter- Consider an episodic Markov decision process (MDP) that is atures (Jiang & Huang, 2020; Kostrikov & Nachum, defined by a tuple M = (S, A, P, r, H). Here, S is the state 0 2020; Dai et al., 2020). space, A is the action space, P (s |s, a) is the probability of reaching state s0 when taking action a in state s, r : • Finally, we evaluate subsampled bootstrapping FQE X × A → [0, 1] is the reward function, and H is the length in a range of classical RL tasks, including a discrete of horizon. A policy π : S → P(A) maps states to a tabular domain, a continuous control domain and a distribution over actions. The state-action value function simulated healthcare example. We variants of boot- (Q-function) is defined as, for h = 1,...,H, strapping FQE with tabular representation, linear func- " H # tion approximation, and neural networks. We carefully π π X 0 0 Qh(s, a) = E r(sh , ah ) sh = s, ah = a , examine the effectiveness and tightness of bootstrap h0=h confidence intervals, as well as the accuracy of boot- π strapping for estimating the variance and correlation where ah0 ∼ π(· | sh0 ), sh0+1 ∼ P (· | sh0 , ah0 ) and E de- for OPE. notes expectation over the sample path generated under policy π. The Q-function satisfies the Bellman equation for Related Work. Point estimation of OPE receives consider- policy π: able attentions in recent years. We include a detailed litera- π h π 0 i ture review in AppendixA. Confidence interval estimation Qh−1(s, a) = r(s, a) + E Vh (s ) s, a , of OPE is also important in many high-stake applications. 0 π Thomas et al.(2015) proposed a high-confidence OPE based where s ∼ P (·|s, a) and Vh : S → R is the value function V π(s) = R Qπ(s, a)π(a|s)da. on importance and empirical Bernstein inequality. defined as h a h

Kuzborskij et al.(2020) proposed a tighter confidence inter- We denote [n] = {1, . . . , n} and λmin(X) as the minimum d×d val for contextual bandits based on empirical Efron-Stein eigenvalue of X. Denote Id ∈ R as a diagonal matrix inequality. However, importance sampling suffers from with 1 as all the diagonal entry and 0 anywhere else. the curse of horizon (Liu et al., 2018) and concentration- based confidence intervals are typically overly-conservative Off-policy evaluation. Suppose that the batch data D = since they only exploit tail (Hao et al., 2020a). {D1,..., DK } consists of K independent episodes col- Another line of recent works formulated the estimation of lected using an unknown behavior policy π¯. Each episode, confidence intervals into an optimization problem (Feng k k k denoted as Dk = {(sh, ah, rh)}h∈[H], is a trajectory of H et al., 2020; 2021; Dai et al., 2020). These work are specific state-transition tuples. It is easy to generalize our analysis to confidence interval construction for OPE, and they do to multiple unknown behavior since our algorithms not provide distributional consistency guarantee. Thus, they do not require the of the behavior policy. Let don’t easily generalize to other statistical inference tasks. N = KH be the total number of sample transitions; and we Several existing work has investigated the use of bootstrap- sometimes write D = {(sn, an, rn)}n∈[N] for simplicity. ping in OPE. Thomas et al.(2015); Hanna et al.(2017) The of OPE is to estimate the expected cumulative constructed confidence intervals by bootstrapping impor- return (i.e., value) of a target policy π from a a fixed initial tance sampling estimator or learned models but didn’t come distribution ξ1, based on the dataset D. The value is defined with any consistency guarantee. The most related work is as " H #

Kostrikov & Nachum(2020) that provided the first asymp- π X vπ = E r(sh, ah) s1 ∼ ξ1 . totic consistency of bootstrap confidence interval for OPE. h=1 Bootstrapping Statistical Inference for Off-Policy Evaluation

Fitted Q-evaluation. Fitted Q-evaluation (FQE) is an in- True Distribution Bootstrap by Episode Bootstrap by Transition stance of the fitted Q-iteration method, dated back to Fonte- 1200 1200 1200 1000 1000 1000 neau et al.(2013); Le et al.(2019). Let F be a given func- 800 800 800 tion class, for examples a linear function class or a neural 600 600 600 π 400 400 400 network class. Set QbH+1 = 0. For h = H,..., 1, we 200 200 200 π 0 0 0 recursively estimate Qh by regression and function approxi- 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 mation: Figure 1. Bootstrap by episodes vs. by sample transitions. The N 2 π n 1 X   o first panel is the true FQE error distribution by Monte Carlo ap- Qb = argmin f(sn, an) − yn + λρ(f) , h N f∈F n=1 proximation. The second panel is the bootstrap distribution by episode while the third one is by sample transitions. Both behavior R π and target policies are the optimal policy. The number of Monte where yn = rn + a Qbh+1(sn+1, a)π(a|sn+1)da and ρ(f) is a proper regularizer. The value estimate is Carlo and bootstrap samples is 10000. h i v = Qπ(x, a) , (2.1) bπ Es∼ξ1,a∼π(·|s) b1 3. Bootstrapping Fitted Q-Evaluation (FQE) π which can be directly computed based on Qb1 . See the full description of FQE in Appendix B.1. As shown in Le et al.(2019); Voloshin et al.(2019); Duan & Wang(2020); Paine et al.(2020), FQE not only demon- strates strong empirical performances, but also enjoys prov- Off-policy inference. Let vbπ be an off-policy estimator ably optimal theoretical guarantees. Thus it is natural to of the target policy value vπ. In addition to the point esti- mator, we are primarily interested in the distribution of the conduct bootstrapping on top of FQE for off-policy infer- ence. off-policy evaluation error vbπ − vπ. We aim to infer the error distribution of vbπ − vπ in order to conduct statistical Recall the original dataset D consists of K episodes. We pro- inference. Suppose F is an estimated distribution of vbπ −vπ. pose to bootstrap FQE by episodes: Draw sample episodes Then we can use F for a range of downstream off-policy ∗ ∗ D1,..., DK independently with replacement from D. This inference tasks, for examples: is the standard Efron’s nonparametric bootstrap (Efron, 1982). Then we run FQE on the new bootstrapped set • Moment estimation. With F , we can estimate the p-th D∗ = {D∗,..., D∗ } as in Eq. (2.1) and let the output v∗ R p 1 K bπ moment of vbπ − vπ by x dF (x). Two important as the bootstrapping FQE estimator. By repeating the above examples are estimation and variance estimation. ∗ process, we may obtain multiple samples of vbπ, and may • Confidence interval construction. Define the quantile use these samples to further conduct off-policy inference (see Section 6.2 for details). function of F as G(p) = inf{x ∈ R, p ≤ F (x)}. Specify a confidence level 0 < δ ≤ 1. With F , we can 3.1. Bootstrap by episodes vs. boostrap by sample construct the 1 − δ confidence interval as [vbπ − G(1 − transitions δ/2), vbπ −G(δ/2)]. If F is close to the true distribution of vbπ − vπ, the above one would be the nearly tightest Practitioners may wonder what is the right way to bootstrap confidence interval for vπ based on vbπ. a data set. This question is quite well understood in super- • Evaluating multiple policies and estimating their cor- vised learning when the data points are independent and identically distributed; there the best way to bootstrap is to relation. Suppose there are two target policies π1, π2 to evaluate and the corresponding off-policy estimators resample data points directly. However, in episodic RL, al- are v , v . Let F be the estimated joint distribution though episodes may be generated independently from one bπ1 bπ2 12 of v − v and v − v . The Pearson correlation another, sample transitions (sn, an, rn) in the same episode bπ1 π1 bπ2 π2 coefficient between the two estimators is are highly dependent. Therefore, we choose to bootstrap the batch dataset by episodes, rather than by sample transitions Cov(v , v ) bπ1 bπ2 ρ(vπ , vπ ) = . which was commonly done according to previous literatures b 1 b 2 pVar(v )Var(v ) bπ1 bπ2 (Kostrikov & Nachum, 2020). We argue that bootstrapping Both the covariance and variance can be estimated by sample transitions may fail to correctly characterize the from F12, so we can further estimate the correlation target error distribution of OPE. This is due to the in-episode between off-policy evaluators. dependence. To illustrate this phenomenon, we conduct nu- Bootstrapping Statistical Inference for Off-Policy Evaluation merical experiments using a toy Cliff Walking environment. linear function class. We compare the true distribution of FQE error obtained Next we present our first main result. The theorem presents by Monte Carlo sampling with error distributions obtained the asymptotic normality of FQE with linear function ap- using bootstrapping FQE. Figure1 clearly shows that the proximation. For any h1 ∈ [H], h2 ∈ [H], define the cross- bootstrap distribution of v∗ − v (by episodes) closely ap- bπ bπ time-covariance matrix as proximates the true error distribution of vbπ − vπ, while the bootstrap distribution by sample transition is highly irregular " H # 1 X 1 1 1 1 > 1 1 Ωh ,h = φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 , and incorrect. This validates our belief that it is necessary to 1 2 E h h h h h1,h h2,h H 0 bootstrap by episodes and handle dependent data carefully h =1 1 π 1 1 1 π 1 for OPE. where ε 0 = Q (s 0 , a 0 ) − (r 0 + V (s 0 )). h1,h h1 h h h h1+1 h +1 Theorem 4.2 (Asymptotic normality of FQE). Suppose 4. Asymptotic Distribution and Optimality of λmin(Σ) > 0 and Condition√ 4.1 holds. The FQE with linear FQE function approximation is N-consistent and asymptoti- cally normal: Before analyzing the use of bootstrap, we first study the √ asymptotic properties of FQE estimators. For the sake of d 2 N (vbπ − vπ) → N (0, σ ), as N → ∞, (4.1) theoretical , we focus our analysis on the FQE with linear function approximation, because it is the most where →d denotes converging in distribution. The asymptotic basic and universal function approximation. We will show variance σ2 is given by that the FQE error is asympotically normal and its asymp- H totic variance exactly matches the Cramer–Rao´ lower bound. 2 X π > −1 −1 π σ = (ν ) Σ Ωh,hΣ ν All the proofs are deferred to Appendix B.3 and B.4. h h h=1 (4.2) X Notations. Given a feature map φ : S×A → d, we + 2 (νπ )>Σ−1Ω Σ−1νπ, R R h1 h1,h2 h let F be a linear function class spanned by φ. Without h1

as Σ = E[ H h=1 φ(sh, ah)φ(sh, ah) ], where E is the terms that are asymptotically negligible. For the primary expectation over population distribution generated by the term, we utilize classical martingale central limit theorem behavior policy. (McLeish et al., 1974) to prove its asymptotic normality. Remark 4.3. The second term on the right-hand side of 4.1. Asymptotic normality Eq. (4.2) (cross-product term) characterizes the dependency We need a representation condition about the function class between two different fitted-Q steps. When considering a F, which will ensure sample-efficient policy evaluation via tabular time-inhomogeneous MDP that was used in Yin & FQE. Wang(2020), this cross-product term disappears and the asymptotic variance becomes Condition 4.1 (Policy completeness). For any f ∈ F, we π assume P f ∈ F, and r ∈ F. H " π 1 1 2 # X µh(sh, ah) 1 2 E 1 1 2 (εh,h) , µ¯h(s , a ) Policy completeness requires the function class F can well h=1 h h capture the Bellman operator. It is crucial for the estimation 1 1 π where µ¯h is the marginal distribution of (sh, ah) and µh is consistency of FQE (Le et al., 2019; Duan & Wang, 2020) the marginal distribution of (sh, ah) under policy π. This π and implies the realizability condition Qh ∈ F for h ∈ [H]. matches the asymptotic variance term in Remark 3.2 of Yin Recently, Wang et al.(2020) established a lower bound & Wang(2020). π showing that the condition Qh ∈ F alone is not enough for sample-efficient OPE. Thus we need the policy complete- Next, we give a corollary about the joint asymptotic er- ness condition in order to leverage the generalizability of ror distribution when evaluating multiple policies. Denote Bootstrapping Statistical Inference for Off-Policy Evaluation √ Π = {π1, . . . , πL} as a set of target policies to evaluate and N(vbπ − vπ). Consequently, we may use the method to denote v as the FQE estimator of the policy π . For each construct confidence regions with asymptotically correct bπk k 1,k πk 1 1 1 πk 1 π ∈ Π, let ε 0 = Q (s 0 , a 0 )−(r 0 +V (s 0 )). and tight coverage. All the proofs are deferred to Appendix k h1,h h1 h h h h1+1 h +1 For any h1 ∈ [H], h2 ∈ [H], denote B.5 and B.6.

" H # Suppose that the batch dataset D is generated from a prob- j,k 1 X 1 1 1 1 > 1,j 1,k ∗ Ω = φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 . h1,h2 E h h h h h1,h h2,h ability space (X , A, PD), and the bootstrap weight W is H 0 h =1 from an independent probability space (W, Ω, PW ). Their Corollary 4.4 (Multiple policies). Suppose the conditions joint probability measure is PDW ∗ = PD × PW ∗ . Let in Theorem 4.2 hold. The set of FQE estimators converge PW ∗|D denote the conditional distribution once the dataset in distribution to a multivariate Gaussian distribution: D is given. √ Theorem 5.1 (Distributional consistency). Suppose the  N (v − v )  bπ1 π1 same assumptions in Theorem 4.2 hold. Conditioned on D,  .  →d N (0, Γ), √ .  we have N (vπ − vπ ) √ b L L ∗  d 2 N vbπ − vbπ → N (0, σ ), as N → ∞, (5.1) 2 L L×L where the covariance matrix Γ = (σjk)j,k=1 ∈ R with where σ2 is defined in Eq. (4.2). Consequently, it implies H √  2 X πj > −1 j,k −1 π ∗  k sup ∗ N v − v ≤ α σjk = (νh ) Σ Ωh,hΣ νh PW |D bπ bπ α∈(0,1) h=1 √  X πj > −1 j,k −1 π + 2 (ν ) Σ Ω Σ ν k . − PD N(vπ − vπ) ≤ α → 0. h1 h1,h2 h b h1

2. Then we have for any 1 ≤ r < q, √ Z h ∗ ri r running FQE on a bootstrapped set can be substantially re- EW ∗|D N(vπ − vπ) → t dµ(t), b b duced. With linear function approximation, one run of FQE H where µ(·) is the distribution of N (0, σ2). requires solving least square problems. Thus the total runtime complexity of the subsampled bootstrapping FQE The consistency of bootstrap variance estimate is immedi- is O(B(K2γ H3d + Hd3)), where 0 < γ < 1 controls the ately implied by setting r = 2. subsample size. When γ is small, we achieve significant speedup by an order of magnitude improvements. 6. Subsampled Bootstrapping FQE 6.2. Off-policy inference via bootstrapping FQE Computing bootstrap-based quantities can be prohibitively demanding as the data size grows. Inspired by recent de- We describe how to conduct off-policy inference based on velopments from community (Kleiner et al., 2014; the output of Algorithm1. Sengupta et al., 2016), we adapt a simple subsampled boot- strap procedure for FQE to accelerate the computation. • Bootstrap variance estimation. To estimate the vari- ance of FQE estimators, we calculate the bootstrap 6.1. Subsampled bootstrap sample variance as 1 B Let the original dataset be D = {D1,..., DK }. For any X (b) 2 Varc (vπ(D)) = (ε − ε¯) , b B − 1 dataset De, we denote by vbπ(De) the FQE estimator based on b=1 dataset De and B as the number of bootstrap samples. The where ε¯ = 1 PB ε(b). subsampled bootstrap includes the following three steps. B b=1 (b) For each b ∈ [B], we first construct a random subset DK,s • Bootstrap confidence interval. Compute the δ/2 of s episodes where each sample episode is drawn inde- and 1 − δ/2 quantile of the empirical distribu- pendently without replacement from dataset D. Typically (1) (B) π π tion {ε , . . . , ε }, denoted as qbδ/2, qb1−δ/2 respec- s = Kγ for some 0 < γ ≤ 1.Then we generate a resam- tively. The percentile bootstrap confidence interval is ple set D(b)∗ of K episodes where each sample episode is π π K,s [vbπ(D) − qb1−δ/2, vbπ(D) − qbδ/2]. drawn independently with replacement from D(b) . Note K,s • Bootstrap correlation estimation. For any of two tar- that when s = K, D(b) is always equal to D such that the K,s get policies π1 and π2, we want to estimate the Pearson subsampled bootstrap reduces to vanilla bootstrap. In the correlation coefficient between their FQE estimators. (b) (b)∗ (b) end, we compute ε = vbπ(DK,s ) − vbπ(DK,s). Algorithm The bootstrap sample correlation can be computed as 1 gives the full description. ρ(v (D), v (D)) = b bπ1 bπ2 Remark 6.1 (Computational benefit). In Algorithm1, al- PB (b) (b) b=1(ε1 − ε¯1)(ε2 − ε¯2) though each run of FQE is still over a dataset of K episodes, q q . PB (b) 2 PB (b) 2 only s of them are distinct. As a result, the runtime of b=1(ε1 − ε¯1) b=1(ε2 − ε¯2) Bootstrapping Statistical Inference for Off-Policy Evaluation

On-Policy On-Policy 7. Experiments 1.0 Vanilla bootstrap 2 0.8 10 Subsampled bootstrap HCOPE In this section, we numerically evaluate the proposed boot- 0.6 Oracle CI 101 strapping FQE method in several RL environments. For 0.4 Vanilla bootstrap constructing confidence intervals, we fix the confidence 0.2 Subsampled bootstrap interval width Expected coverage 0 empirical coverage 10 0.0 level at δ = 0.1. For estimating variance and correlations, 10 50 100 200 500 10 50 100 200 500 we average the results over 200 trials. More details about # episodes in dataset # episodes in dataset Epsilon-Greedy Behavior Policy Epsilon-Greedy Behavior Policy 1.0 Vanilla bootstrap the are given in AppendixD. 2 0.8 10 Subsampled bootstrap HCOPE 0.6 Oracle CI 7.1. Experiment with tabular discrete environment 101 0.4 Vanilla bootstrap

0.2 Subsampled bootstrap interval width We first consider the Cliff Walking environment (Sutton & Expected coverage 0 empirical coverage 10 0.0 Barto, 2018), with artificially added randomness to create 10 50 100 200 500 10 50 100 200 500 # episodes in dataset # episodes in dataset stochastic transitions (see AppendixD for details). The target policy is chosen to be a near-optimal policy, trained Figure 2. Off-policy CI for Cliff Walking. Left: Empirical cov- using Q-learning. Consider three choices of the behavior erage probability of CI; Right: CI width under different behavior policy: the same as the target policy (on-policy), 0.1 - policies. Boostrapping-FQE confidence interval method demon- greedy policy and soft-max policy with temperature 1.0 strates better and tighter coverage of the groundtruth. It closely resembles the oracle confidence interval which comes from the based on the learned optimal Q-function. The results for true error distribution. soft-max policy and correlation estimation are deferred to AppendixD.

1.0 25 Vanilla bootstrap Subsampled bootstrap We test three different methods. The first two methods are 0.8 20 subsampled bootstraping FQE with subsample sizes s = K 0.6 15 0.5 (the vanilla bootstrap) and s = K (the computational- 0.4 10 time (min) 5 efficient version), where B = 100. The third method is the 0.2 Vanilla bootstrap

empirical coverage Expected coverage 0.0 0 high-confidence off-policy evaluation (HCOPE) (Thomas 10 50 75 100 250 10 50 100 200 500 et al., 2015), which we use as a baseline for comparison. # bootstrap samples # episodes in dataset HCOPE is a method for constructing off-policy confident Figure 3. Sample and time efficiency of bootstrapping FQE. interval for tabular MDP, and it is based concentration in- Left: Empirical coverage of bootstrapping-FQE CI, as #bootstrap equalities and has provable coverage guarantee. We also samples increases. Right: Runtime of bootstrapping FQE, as data- compare these methods with the oracle confidence interval size increases (with subsample size s = K0.5). (which is the true distribution’s quantile obtained by Monte Carlo simulation). the distribution information. However, bootstrap confidence Coverage and tightness of off-policy confidence interval interval tends to be under-estimate when the number of (K = 10) (CI). We study the empirical coverage probability and in- episodes is extremely small . Thus we suggest the terval width with different number of episodes. Figure2 practitioner to use bootstrap methods when the sample size (K > 50) shows the result under different behavior policies. In the left is moderately large . panel of Figure3, we report the effect of the number of boot- Further, the subsampled bootstrapping FQE demonstrates strap samples on empirical coverage probability (-greedy a competitive performance as well as significantly reduced behavior policy, K = 100). It is clear that the empirical computation time. The saving in computation time becomes coverage of our confidence interval based on bootstrapping increasingly substantial as the data gets big; see the right FQE becomes increasingly close to the expected coverage panel of Figure3. (= 1 − δ) as the number of episodes increases. The width Bootstrapping FQE for variance estimation. We study of bootstrapping-FQE confidence interval is significantly the performance of variance estimation using subsampled tighter than that of the HCOPE and very close to the oracle bootstrapping FQE under three different behavior policies. one. It is worth noting that, even in the on-policy case, our We vary the number of episodes and the true Var(v (D)) bootstrap-based confidence interval still has a clear advan- bπ is computed through Monte Carlo method. We report the tage over the concentration-based confidence interval. The estimation error of Var(v (D)) − Var(v (D)) across 200 advantage of our method comes from that it fully exploits c bπ bπ trials in the left panel of Figure4. Bootstrapping Statistical Inference for Off-Policy Evaluation

60 Oracle CI On-policy 20 0.9 Epsilon-greedy policy Duan & Wang (2020) Subsampled bootstrap 0.9 10 Softmax policy 0 0.8 20 0.8 1 40 0.7 correlation correlation policy value

estimation error 60 0.6 0.1 0.7 10 50 100 200 500 10 50 100 200 400 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 # episodes in dataset # episodes in dataset epsilon epsilon

Figure 4. Bootstrapping for variance estimation and with func- Figure 5. Bootstrapping FQE with neural nets for estimating tion approximation Left: Error of variance estimates, as data size the correlation between two FQE estimators. The left panel is increases. Right: Confidence interval constructed using bootstrap- using 300 episodes, while the right panel is using 500 episodes. ping FQE with linear function approximation. level 1.0 1.0 level 1.0 (0.0, 0.5] (0.0, 0.2] level (0.5, 1.0] (0.2, 0.4] (0.0, 0.2] (1.0, 1.5] 7.2. Experiment with Mountain Car using linear (0.4, 0.6] (0.2, 0.4] (1.5, 2.0] 0.5 0.5 (0.6, 0.8] 0.5 (0.4, 0.6] (2.0, 2.5] (0.8, 1.0] (0.6, 0.8] (2.5, 3.0] function approximation (1.0, 1.2] (0.8, 1.0] (3.0, 3.5] (1.2, 1.4] 0.0 (1.0, 1.2] 0.0 0.0 (3.5, 4.0] (1.4, 1.6] (1.2, 1.4] (4.0, 4.5] (1.6, 1.8] (1.4, 1.6] (4.5, 5.0] (1.8, 2.0] (1.6, 1.8] (5.0, 5.5] Next we test the methods on the classical Mountain Car (2.0, 2.2] −0.5 (1.8, 2.0] −0.5 −0.5 (5.5, 6.0] (2.2, 2.4]

0.05 epsilon greedy policy 0.05 epsilon greedy policy 0.05 epsilon greedy policy (6.0, 6.5] (2.4, 2.6] environment (Moore, 1990) with linear function approxi- (6.5, 7.0] −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 mation. We artificially added a Gaussian random force to optimal policy optimal policy optimal policy the car’s dynamics to create stochastic transitions. For the Figure 6. Estimated confidence region for evaluating two poli- linear function approximation, we choose 400 radial basis cies using bootstrapping FQE with neural networks. Two tar- functions (RBF) as the feature map. The target policy is get policies are optimal policy and 0.15 -greedy policy. Red point chosen as the optimal policy trained by Q-learning; and are true values of those two target policies. From left to right, the the behavior policy is chosen to be the 0.1 -greedy policy sample sizes are K = 100, 300, 500. based on the learned optimal Q-function. tion, and the last layer uses Softsign. The network takes For comparison, we compute an empirical Bernstein- as input the state-action pair (a 11-dim vector) and outputs inequality-based confidence interval (Duan & Wang, 2020), a Q-value estimate. Let the behavior policy be the 0.15 which to our best knowledge is the only provable CI based -greedy policy. on FQE with function approximation (see AppendixD for We evaluate two policies based on the same set of data. This its detailed form). We also compute the oracle CI using is very common in healthcare problem since we may have Monte Carlo simulation. Figure4 right give all the results. multiples treatments by the doctor. One target policy is fixed According to the results, our method demonstrates good to be the optimal policy while we vary the other one with dif- coverage of the groundtruth and is much tighter than the ferent -greedy noise. We expect the correlation decreases concentration-based CI, even both of them use linear func- as the difference between two target policies increases. Fig- tion approximation. ure5 is well aligned with our expectation. In Figure6, we plot the confidence region of two target policies obtained by 7.3. Experiment with septic using neural bootstrapping FQE using neural networks. According to Fig- nets for function approximation ures5 and6, the bootstrapping FQE method can effectively construct confidence regions and correlation estimates, even Lastly, we consider a real-world healthcare problem for when using neural networks for function approximation. treating sepsis in the intensive care unit (ICU). We use the These results suggest that the proposed bootstrapping FQE septic management simulator by Oberst & Sontag(2019) for method reliably achieves off-policy inference, with more our study. It simulates a patient’s vital signs, e.g. the heart general function approximators. rate, blood pressure, oxygen concentration, and glucose lev- els, with three treatment actions (antibiotics, vasopressors, and mechanical ventilation) to choosen from at each time 8. Conclusion step. The reward is +1 when a patient is discharged and −1 This paper studies bootstrapping FQE for statistical off- if the patient reaches a life critical state. policy inference and establishes its asymptotic distributional We apply the bootstrapping FQE using neural network func- consistency as a theoretical benchmark. Our experiments tion approximator with three fully connected layers, where suggest that bootstrapping FQE is effective and efficient the first layer uses 256 units and a Relu activation function, in a range of tasks, from tabular problems to continuous the second layer uses 32 units and a Selu activation func- problems, with linear and neural network approximation. Bootstrapping Statistical Inference for Off-Policy Evaluation

References Hao, B., Abbasi-Yadkori, Y., Wen, Z., and Cheng, G. Boot- strapping upper confidence bound. Thirty-fourth Annual Bickel, P. J. and Freedman, D. A. Some asymptotic theory Conference on Neural Information Processing , for the bootstrap. The annals of statistics, pp. 1196–1217, 2020a. 1981. Hao, B., Duan, Y., Lattimore, T., Szepesvari,´ C., and Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvari,´ C., and Wang, M. Sparse feature selection makes batch rein- Schuurmans, D. Coindice: Off-policy confidence interval forcement learning more sample efficient. arXiv preprint estimation. arXiv preprint arXiv:2010.11652, 2020. arXiv:2011.04019, 2020b. Duan, Y. and Wang, M. Minimax-optimal off-policy eval- uation with linear function approximation. Internation Jiang, N. and Huang, J. Minimax value interval for off- Advances in Conference on Machine Learning, 2020. policy evaluation and policy optimization. Neural Information Processing Systems, 33, 2020. Eck, D. J. Bootstrapping for multivariate linear regression models. Statistics & Probability Letters, 134:141–149, Jiang, N. and Li, L. Doubly robust off-policy value evalua- 2018. tion for reinforcement learning. In International Confer- ence on Machine Learning, pp. 652–661, 2016. Efron, B. The jackknife, the bootstrap and other resampling plans. SIAM, 1982. Kallus, N. and Uehara, M. Double reinforcement learning for efficient off-policy evaluation in markov decision pro- Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch cesses. Journal of Machine Learning Research, 21(167): mode reinforcement learning. Journal of Machine Learn- 1–63, 2020. ing Research, 6(Apr):503–556, 2005. Kato, K. A note on moment convergence of bootstrap Feng, Y., Ren, T., Tang, Z., and Liu, Q. Accountable off- m-estimators. Statistics & Risk Modeling, 28(1):51–61, policy evaluation with kernel bellman statistics. Pro- 2011. ceedings of the International Conference on Machine Learning, 2020. Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. A scalable bootstrap for massive data. Journal of the Royal Feng, Y., Tang, Z., Zhang, N., and Liu, Q. Non-asymptotic Statistical Society: Series B: Statistical , pp. confidence intervals of off-policy evaluation: Primal and 795–816, 2014. dual bounds. In International Conference on Learning Representations, 2021. URL https://openreview. Kostrikov, I. and Nachum, O. Statistical bootstrapping for net/forum?id=dKg5D1Z1Lm. uncertainty estimation in off-policy evaluation. arXiv preprint arXiv:2007.13609, 2020. Fonteneau, R., Murphy, S. A., Wehenkel, L., and Ernst, D. Batch mode reinforcement learning based on the Kuzborskij, I., Vernade, C., Gyorgy,¨ A., and Szepesvari,´ synthesis of artificial trajectories. Annals of operations C. Confident off-policy evaluation and selection through research, 208(1):383–416, 2013. self-normalized importance weighting. arXiv preprint arXiv:2006.10460, 2020. Freedman, D. A. et al. Bootstrapping regression models. The Annals of Statistics, 9(6):1218–1228, 1981. Lagoudakis, M. G. and Parr, R. Least-squares policy it- eration. Journal of machine learning research, 4(Dec): Hallak, A. and Mannor, S. Consistent on-line off-policy 1107–1149, 2003. evaluation. In Proceedings of the 34th International Con- ference on Machine Learning-Volume 70, pp. 1372–1383. Le, H. M., Voloshin, C., and Yue, Y. Batch policy learning JMLR. org, 2017. under constraints. Proceedings of Machine Learning Research, 97:3703–3712, 2019. Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Liao, P., Klasnja, P., and Murphy, S. Off-policy estimation of Thirty-First AAAI Conference on Artificial Intelligence, long-term average outcomes with applications to mobile 2017. health. arXiv preprint arXiv:1912.13088, 2019. Bootstrapping Statistical Inference for Off-Policy Evaluation

Liu, Q., Li, L., Tang, Z., and Zhou, D. Breaking the curse Thomas, P. and Brunskill, E. Data-efficient off-policy policy of horizon: Infinite-horizon off-policy estimation. In evaluation for reinforcement learning. In International Advances in Neural Information Processing Systems, pp. Conference on Machine Learning, pp. 2139–2148, 2016. 5356–5366, 2018. Thomas, P., Theocharous, G., and Ghavamzadeh, M. High McLeish, D. L. et al. Dependent central limit theorems confidence policy improvement. In International Con- and invariance . the Annals of Probability, 2(4): ference on Machine Learning, pp. 2380–2388. PMLR, 620–628, 1974. 2015.

Moore, A. W. Efficient memory-based learning for robot Uehara, M. and Jiang, N. Minimax weight and Q- arXiv preprint control. 1990. function learning for off-policy evaluation. arXiv:1910.12809, 2019. Munos, R. and Szepesvari,´ C. Finite-time bounds for fitted Van der Vaart, A. W. Asymptotic statistics, volume 3. Cam- value iteration. Journal of Machine Learning Research, 9 bridge university press, 2000. (5), 2008. Voloshin, C., Le, H. M., Jiang, N., and Yue, Y. Empirical Nachum, O., Chow, Y., Dai, B., and Li, L. DualDICE: study of off-policy policy evaluation for reinforcement Behavior-agnostic estimation of discounted stationary learning. arXiv preprint arXiv:1911.06854, 2019. distribution corrections. In Advances in Neural Informa- tion Processing Systems, pp. 2315–2325, 2019. Wang, R., Foster, D. P., and Kakade, S. M. What are the statistical limits of offline rl with linear function approxi- Oberst, M. and Sontag, D. Counterfactual off-policy eval- mation? arXiv preprint arXiv:2010.11895, 2020. uation with gumbel-max structural causal models. In International Conference on Machine Learning, pp. 4881– Xie, T., Ma, Y., and Wang, Y.-X. Towards optimal off-policy 4890, 2019. evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Paine, T. L., Paduraru, C., Michi, A., Gulcehre, C., Zolna, Processing Systems, pp. 9665–9675, 2019. K., Novikov, A., Wang, Z., and de Freitas, N. Hyper- Yin, M. and Wang, Y.-X. Asymptotically efficient off-policy parameter selection for offline reinforcement learning. evaluation for tabular reinforcement learning. Interna- arXiv preprint arXiv:2007.09055, 2020. tional Conference on Artificial Intelligence and Statistics, Petersen, K. and Pedersen, M. The matrix cookbook. tech- 2020. nical university of denmark. Technical Manual, 2008. Zhang, R., Dai, B., Li, L., and Schuurmans, D. GenDICE: Generalized offline estimation of stationary values. arXiv Precup, D., Sutton, R. S., and Singh, S. Eligibility traces preprint arXiv:2002.09072, 2020a. for off-policy policy evaluation. In ICML’00 Proceedings of the Seventeenth International Conference on Machine Zhang, S., Liu, B., and Whiteson, S. GradientDICE: Re- Learning, 2000. thinking generalized offline estimation of stationary val- ues. arXiv preprint arXiv:2001.11113, 2020b. Sengupta, S., Volgushev, S., and Shao, X. A subsampled double bootstrap for massive data. Journal of the Ameri- can Statistical Association, 111(515):1222–1232, 2016.

Shi, C., Zhang, S., Lu, W., and Song, R. Statistical inference of the value function for reinforcement learning in infinite horizon settings. arXiv preprint arXiv:2001.04515, 2020.

Singh, K. On the asymptotic accuracy of efron’s bootstrap. The Annals of Statistics, pp. 1187–1195, 1981.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Bootstrapping Statistical Inference for Off-Policy Evaluation A. Additional Related Work

Popular approaches include direct methods (Lagoudakis & Parr, 2003; Ernst et al., 2005; Munos & Szepesvari´ , 2008; Le et al., 2019), double-robust / importance sampling (Precup et al., 2000; Jiang & Li, 2016; Thomas & Brunskill, 2016), marginalized importance sampling (Hallak & Mannor, 2017; Liu et al., 2018; Xie et al., 2019; Nachum et al., 2019; Uehara & Jiang, 2019; Zhang et al., 2020a;b). On the theoretical side, Uehara & Jiang(2019); Yin & Wang(2020) established asymptotic optimality and efficiency for OPE in the tabular setting and Kallus & Uehara(2020) provided a complete study of semiparametric efficiency in a more general setting. Duan & Wang(2020); Hao et al.(2020b) showed that FQE with linear/sparse lienar function approximation is minimax optimal and Wang et al.(2020) studied the fundamental hardness of OPE with linear function approximation. In statistics community, Liao et al.(2019) studied OPE in an infinite-horizon undiscounted MDP and derived the asymptotic distribution of empirical Bellman residual minimization estimator. Their asymptotic variance had a tabular representation and thus didn’t show the effect of function approximation. Shi et al.(2020) considered asymptotic confidence interval for policy value but under different model assumption that assumes Q-function is smooth.

B. Proofs of Main Theorems

B.1. Full Algorithm of General FQE

Algorithm 2 Fitted Q-Evaluation (Le et al., 2019) input Dateset D = {D1,..., DK }, target policy π, function class F, initial state distribution ξ0. π 1: Initialize QbH+1 = 0. 2: for h = H,H − 1,... 1 do 3: Compute regression targets for any k ∈ [K], h0 ∈ [H]: Z k k π k k yh,h0 = rh0 + Qbh+1(sh0+1, a)π(a|sh0+1)da. a

k k k 4: Build training set {(sh0 , ah0 ), yh,h0 }k∈[K],h0∈[H]. 5: Solve a supervised learning problem:

( K H ) π 1 X 1 X k k k 2 Qbh = argmin f(sh0 , ah0 ) − yh,h0 + λρ(f) , f∈F K H k=1 h0=1

where ρ(f) is a proper regularizer. 6: end for vπ = R R Qπ(s, a)ξ (s)π(a|s)dsda output b s a b1 1 .

We restate the full algorithm of FQE in Algorithm2. Here we simply assume the initial state distribution ξ1 is known. In practice, we always have the access to sample from ξ1 and thus we can approximate it by Monte Carlo sampling.

B.2. Equivalence between FQE and model-based plug-in estimator We show that the FQE in Algorithm2 with linear function class F is equivalent to a plug-in estimator. This equivalence is helpful to derive the asymptotic normality of FQE and bootstrapping FQE. Define

N N N −1 X π > −1 X X > Mcπ = Σb φ(sn, an)φ (sn+1) , Rb = Σb rnφ(sn, an), Σb = φ(sn, an)φ(sn, an) + λId, (B.1) n=1 n=1 n=1 Bootstrapping Statistical Inference for Off-Policy Evaluation

π R where φ (s) = a φ(s, a)π(a|s)da, sN+1 is the terminal state and λ is the regularization parameter. Choosing ρ(f) = λI, > π the FQE is equivalent to, for h = H,..., 1, Qbh(s, a) = φ(s, a) wbh with N Z π −1 X  π  wbh = Σb φ(sn, an) rn + Qbh+1(sn+1, a)π(a|sn+1)da n=1 a N Z −1 X  > π  = Σb φ(sn, an) rn + φ(sn+1, a) wbh+1π(a|sn+1)da n=1 a N N −1 X −1 X π > = Σb φ(sn, an)rn + Σb φ(sn, an)φ (sn+1) wbh+1 n=1 n=1 π = Rb + Mcπwbh+1. This gives us a recursive form of wπ. Denoting wπ = 0 and νπ = [φ(s, a)], the FQE estimator can be bh bH+1 1 Es∼ξ1,a∼π(·|s) written into Z Z H−1 π > π π > X h vbπ = Qb1(s, a)ξ1π(a|s)dads = (ν1 ) wb1 = (ν1 ) (Mcπ) R.b (B.2) s a h=0

π d π > π On the other hand, from Condition 4.1, there exists some wr, wh ∈ R such that Qh(·, ·) = φ(·, ·) wh for each h ∈ [H] > d×d > π 0 > and r(·, ·) = φ(·, ·) wr and there exists Mπ ∈ R such that φ(s, a) Mπ = E[φ (s ) |s, a]. From Bellman equation and Condition 4.1, Z π h π 0 0 i Qh(s, a) = r(s, a) + E Qh+1(s , a)π(a|s )da|s, a a > > π 0 > π > π  = φ(s, a) wr + φ(s, a) E[φ (s ) |s, a]wh+1 = φ(s, a) wr + Mπwh+1 (B.3) H−h > X h = φ(s, a) (Mπ) wr. h=0 Therefore, the true scalar value function can be written as

H−1 h π i π > X h vπ = Es∼ξ1,a∼π(·|s) Q1 (s, a) = (ν1 ) (Mπ) wr, h=0 which implies Eq. (B.2) is a plug-in estimator.

B.3. Proof of Theorem 4.2: Asymptotic normality of FQE

π π π > π > h−1 Recall νh = E [φ(xh, ah)|x1 ∼ ξ1] and denote (νbh ) = (ν1 ) Mcπ . We follow Lemma B.3 in Duan & Wang(2020) to decompose the error term into following three parts: √ N(vπ − vbπ) = E1 + E2 + E3, where

N H 1 X X π > −1  π π  E1 = √ (νh ) Σ φ(sn, an) Qh(sn, an) − rn + Vh+1(sn+1) , N n=1 h=1 H N X  π > −1 π > −1 1 X  π π  E2 = N(νbh ) Σb − (νh ) Σ √ φ(sn, an) Qh(sn, an) − rn + Vh+1(sn+1) , h=1 N n=1 H 1 X π > −1 π E3 = λ√ (νbh ) Σb wh . N h=0 Bootstrapping Statistical Inference for Off-Policy Evaluation √ To prove the asymptotic normality of N(vπ − vbπ), we will first prove the asymptotic normality of E1 and then show both E1 and E2 are asymptotically negligible. For n = 1, 2,...,N, we denote

H 1 X π > −1  π π  en = √ (νh ) Σ φ(sn, an) Qh(sn, an) − rn+1 + Vh+1(sn+1) . N h=1 PN  Then E1 = n=1 en. Define a filtration Fn n=1,...,N with Fn generated by (s1, a1, s2),..., (sn−1, an−1, sn) and   (sn, an). From the definition of value function, it is easy to see E en Fn = 0 that implies that {en}n∈[N] is a martingale difference sequence. To show the asymptotic normality, we use the following martingale central limit theorem for triangular arrays.

Theorem B.1 (Martingale CLT, Corollary 2.8 in (McLeish et al., 1974)). Let {Xmn; n = 1, . . . , km} be a martingale difference array (row-wise) on the probability triple (Ω, F,P ). Suppose Xmn satisfy the following two conditions:

km p X 2 p 2 max |Xmn| → 0, and Xmn → σ , 1≤n≤km n=1

Pkm d 2 for km → ∞. Then n=1 Xmn → N (0, σ ).

Recall that the variance σ2 is defined as

H X X σ2 = (νπ)>Σ−1Ω Σ−1νπ + 2 (νπ )>Σ−1Ω Σ−1νπ , h h,h h h1 h1,h2 h2 (B.4) h=1 h1

H h 1 X 1 1 1 1 > 1 1 i Ωh ,h = φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 , 1 2 E H h h h h h1,h h2,h h0=1

1 π 1 1 1 π 1 where ε 0 = Q (s 0 , a 0 ) − (r 0 + V (s 0 )). To apply Theorem B.1, we let k = N, X = e and we need to h1,h h1 h h h h1+1 h +1 m mn n verify the following two conditions:

H   X π > −1 1  π π  p max (νh ) Σ √ φ(sn, an) Qh(sn, an) − rn+1 + Vh+1(sn+1) → 0, as N → ∞, (B.5) 1≤n≤N h=1 N and N H 2 X  1 X π > −1  π π  p 2 √ (νh ) Σ φ(sn, an) Qh(sn, an) − rn+1 + Vh+1(sn+1) → σ , as N → ∞. (B.6) n=1 N h=1

π Verify Condition B.5: Since r ∈ [0, 1], we have rn + Vh+1(sn+1) ∈ [0,H − h]. For any n ∈ [N], we have

H ! X 1   π > −1 √ π π  (νh ) Σ φ(sn, an) Qh(sn, an) − rn+1 + Vh+1(sn+1) h=1 N H 1 X √ π > −1 π π  ≤ (νh ) Σ φ(sn, an) Qh(sn, an) − rn + Vh+1(sn+1) N h=1 H 1 X √ π > −1 ≤ (H − h + 1) (νh ) Σ φ(sn, an) . N h=1

π > −1 Note that (νh ) Σ φ(sn, an) is independent of N. Then for fixed d, H, Condition B.5 is satisfied when N → ∞. Bootstrapping Statistical Inference for Off-Policy Evaluation

2 2 2 2 Verify Condition B.6: Recall the definition of σ in Eq. (B.4) and let σ = σ1 + σ2 for H 2 X π > −1 −1 π σ1 = (νh ) Σ Ωh,hΣ νh , h=1 X σ2 = 2 (νπ )>Σ−1Ω Σ−1νπ . 2 h1 h1,h2 h2 h1

N H 2 X  1 X π > −1  π π  √ (νh ) Σ φ(sn, an) Qh(sn, an) − rn+1 + Vh+1(sn+1) n=1 N h=1 N H X 1 X   = (νπ)>Σ−1φ(s , a )φ(s , a )>Σ−1νπ Qπ(s , a ) − (r + V π (s ))2 N h n n n n h h n n n h+1 n+1 n=1 h=1 N X 1 X π > −1 π > −1 + 2 (ν ) Σ φ(sn, an)(ν ) Σ φ(sn, an) N h1 h2 n=1 h1

2 2 We denote the first term as I1, the second term as I2 and separately bound I1 − σ1 and I2 − σ2 as follows:

• We rewrite I1 in terms of episodes as H K H X π > −1/2 1 X 1 X −1/2 k k k k > k 2 −1/2 −1/2 π I = (ν ) Σ Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) (ε 0 ) Σ Σ ν . 1 h K H h h h h hh h h=1 k=1 h=1 Moreover, denote H −1/2 h 1 X 1 1 1 1 > 1 2i −1/2 d×d Z = Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) (ε 0 ) Σ ∈ . h E H h h h h hh R h0=1 Then we have H K H 2 X π > −1/2 1 X 1 X −1/2 k k k k > k 2 −1/2  −1/2 π |I − σ | = (ν ) Σ Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) (ε 0 ) Σ − Z Σ ν 1 1 h K H h h h h hh h h h=1 k=1 h0=1

H 2 K H X π > −1/2 1 X  1 X −1/2 k k k k > k 2 −1/2  ≤ (νh ) Σ Σ φ(sh0 , ah0 )φ(sh0 , ah0 ) (εhh0 ) Σ − Zh , 2 K H 2 h=1 k=1 h0=1

p 2 where the last inequality is from Cauchy–Schwarz inequality. From Lemma C.7, we reach I1 → σ1 as K → ∞.

• We rewrite I2 as K H X π > −1/2 1 X 1 X −1/2 k k k k > k k −1/2 −1/2 π I = 2 (ν ) Σ Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 Σ Σ ν , 2 h1 h h h h h1h h2h h2 K H 0 h1

1 1 i −1/2 d×d Zh h = Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 Σ ∈ . 1 2 E H h h h h h1h h2h R h0=1 Then we have K H 2 X π > −1/2 1 X 1 X −1/2 k k k k > k k −1/2  −1/2 π |I − σ | = 2 (ν ) Σ Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 Σ − Z Σ ν 2 2 h1 h h h h h1h h2h h1h2 h2 K H 0 h1

−1/2 π > −1/2 X X −1/2 k k k k > k k −1/2 ≤ 2 (νh ) Σ (νh ) Σ Σ φ(sh0 , ah0 )φ(sh0 , ah0 ) εh h0 εh h0 Σ − Zh1h2 . 1 2 2 2 1 2 2 K H 0 h1

p 2 From Lemma C.7, we reach I2 → σ2 as K → ∞.

Putting the above two steps together, we have verified Condition B.6. Then applying Theorem B.1 we obtain that d 2 E1 → N (0, σ ). On the other hand, according to Lemmas B.6, B.10 in (Duan & Wang, 2020), q 3.5 π > π −1 π π 1/2 −1/2 p ln(8dH/δ)dH |E2| ≤ 15 (ν ) (Σ ) ν · (Σ ) Σ · C1κ1(2 + κ2) · √ 0 0 2 N q 2 π > π −1 π π 1/2 −1/2 5 ln(8dH/δ)C1dH |E3| ≤ (ν ) (Σ ) ν · (Σ ) Σ · √ , 0 0 2 N with probability at least 1 − δ and κ1, κ2 are some problem-dependent constants that do not depend on N. When√ N → ∞, both |E2|, |E3| converge in probability to 0. By Slutsky’s theorem, we have proven the asymptotic normality of N(vπ −vbπ). 

B.4. Proof of Theorem 4.5: Efficiency bound Influence function. Recall that our dataset D consists of K i.i.d. trajectories, each of which has length H. Denote  τ : = s1, a1, r1, s2, a2, r2, . . . , sH , aH , rH , sH+1 .

For simplicity, we assume that the reward rh is deterministic given (sh, ah), i.e. rh = r(sh, ah) for some reward function r. The distribution of τ is given by ¯ P(dτ ) =ξ1(ds1, da1)P(ds2 s1, a1)¯π(da2 | s2)P(ds3 | s2, a2)

... P(dsH | sH−1, aH−1)¯π(daH | sH )P(dsH+1 | sH , aH ).

Define Pη : = P + η∆P where ∆P satisfies (∆P)F ⊆ F under condition 4.1. Denote score functions ∂ ∂ g(τ ) : = log P (dτ ) and g(s0 | s, a) : = log P (ds0 | s, a). ∂η η ∂η η Note that H X g(τ ) = g(sh+1 | sh, ah). h=1

We consider the pointwise estimation. The objective function ψξ1 is defined as

" H # X ψ (P ) : = r (s , a ) (s , a ) ∼ ξ , P , π . ξ1 η E η h h 1 1 1 η h=1

∂ We calculate the derivative ∂η ψξ1 (Pη) and have

" H h−1 # ∂ ∂ X Z Y ψξ1 (Pη) = r(sh, ah)ξ1(ds1, da1) Pη(dsj+1 | sj, aj)π(daj+1 | sj+1) ∂η ∂η h h=1 (S×A) j=1 H h−1 h−1 X Z  X  Y = r(sh, ah) g(sj+1 | sj, aj) ξ1(ds1, da1) Pη(dsj+1 | sj, aj)π(daj+1 | sj+1). h h=1 (S×A) j=1 j=1 Bootstrapping Statistical Inference for Off-Policy Evaluation

π  PH  By using Q-functions Qη,j(sj, aj) := E h=j rη(sh, ah) (sj, aj), Pη, π for j = 1, 2,...,H, Qη,H+1 := 0, we find that

H−1 H H−1 ∂ Z X X Y ψξ1 (Pη) = g(sj+1 | sj, aj) rη(sh, ah)ξ1(ds1, da1) Pη(dsi+1 | si, ai)π(dai+1 | si+1) ∂η H (S×A) j=1 h=j+1 i=1 H−1 j Z X Y = g(sj+1 | sj, aj)ξ1(ds1, da1) Pη(dsi+1 | si, ai)π(dai+1 | si+1) H (S×A) j=1 i=1 H H−1 ! X Y · rη(sh, ah) Pη(dsi+1 | si, ai)π(dai+1 | si+1) h=j+1 i=j+1 H−1 Z j X π Y = g(sj+1 | sj, aj)Qη,j+1(sj+1, aj+1)ξ1(ds1, da1) Pη(dsi+1 | si, ai)π(dai+1 | si+1) j+1 j=1 (S×A) i=1 H X  π  = E g(sh+1 | sh, ah)Vη,h+1(sh+1) (s1, a1) ∼ ξ1, Pη, π . h=1

It follows that

" H # ∂ X π ψξ1 (Pη) =E g(sh+1 | sh, ah)Vh+1(sh+1) (s1, a1) ∼ ξ1, P, π . ∂η η=0 h=1

> −1 π > −1   Define wh(s, a) : = φ(s, a) Σ νh = φ(s, a) Σ E φ(sh, ah) (s1, a1) ∼ ξ1, P, π for h = 1, 2,...,H. For any > f ∈ H with f(s, a) = φ(s, a) wf , we have

   >  E f(sh, ah) (s1, a1) ∼ ξ1, P, π =E φ(sh, ah) wf (s1, a1) ∼ ξ1, P, π  > −1 >  =E φ(sh, ah) Σ E(s,a)∼µ¯[φ(s, a)φ(s, a) ]wf (s1, a1) ∼ ξ1, P, π h  > −1 > i =E(s,a)∼µ¯ E φ(sh, ah) (s1, a1) ∼ ξ1, P, π Σ φ(s, a)φ(s, a) wf   =E(s,a)∼µ¯ wh(s, a)f(s, a) ,

 0 π 0  H where µ¯ is the distribution of dataset D. Since the mapping (s, a) 7→ E g(s | s, a)Vh (s ) s, a belongs to , therefore,

" H # ∂ X 0 π 0 ψξ1 (Pη) =E(s,a)∼µ¯ wh(s, a)g(s | s, a)Vh+1(s ) . ∂η η=0 h=1

 0  Note that E g(s | s, a) s, a = 0, therefore,

" H # ∂ X 0  π 0  π 0  ψξ1 (Pη) = E(s,a)∼µ¯ wh(s, a)g(s | s, a) Vh+1(s ) − E Vh+1(s ) s, a . ∂η η=0 h=1

By definition of µ, we have

∂ ψξ1 (Pη) ∂η η=0 H " H #

1 X X  π  π  ¯ = wh(sj, aj)g(sj+1 | sj, aj) V (sj+1) − V (sj+1) sj, aj (s1, a1) ∼ ξ1, P, π¯ . H E h+1 E h+1 j=1 h=1 Bootstrapping Statistical Inference for Off-Policy Evaluation

 0  We use the property E g(s | s, a) s, a = 0 again and derive that

∂ ψξ1 (Pη) ∂η η=0 H " H H #   1 X X X  π  π  ¯ = wh(sj, aj) g(sl+1 | sl, al) V (sj+1) − V (sj+1) sj, aj (s1, a1) ∼ ξ1, P, π¯ H E h+1 E h+1 j=1 h=1 l=1 " H H #

1 X X  π  π  ¯ = g(τ ) wh(sj, aj) V (sj+1) − V (sj+1) sj, aj (s1, a1) ∼ ξ1, P, π¯ . H E h+1 E h+1 h=1 j=1 We can conclude that H H ˙ 1 X X  π  π  ψP (τ ) : = wh0 (sh, ah) V 0 (sh+1) − V 0 (sh+1) sh, ah , H h +1 E h +1 h=1 h0=1 is an influence function.

Efficiency bound. For notational convenience, we take shorthands

H 0 X  π 0  π 0  q(s, a, s ) : = wh(s, a) Vh+1(s ) − E Vh+1(s ) s, a , h=1 and rewrite H 1 X ψ˙ (τ ) = q(s , a , s ). P H h h h+1 h=1  0  Since E q(s, a, s ) s, a = 0, we find that

" H 2 # H 1  X  1 X h i ψ˙ 2 (τ ) = q(s , a , s ) ξ¯ , P, π¯ = q2(s , a , s ) ξ¯ , P, π¯ . E P H2 E h h h+1 1 H2 E h h h+1 1 h=1 h=1 It follows that

 ˙ 2  1 h  2 0 i 1 h  2 0 i ψ (τ ) = (s,a)∼µ¯ q (s, a, s ) s, a = (s,a)∼µ¯ q (s, a, s ) s, a E P H E E H E E " H 2# 1 > −1 X  π 0  π 0  π = (s,a)∼µ¯ φ(s, a) Σ V (s ) − V (s ) s, a ν , H E h+1 E h+1 h h=1 which coincides with the asymptotic variance of OPE estimator defined in (4.2).



B.5. Proof of Theorem 5.1: Distributional consistency of bootstrapping FQE

PN > In order to simplify the derivation, we assume λ = 0 and the empirical covariance matrix n=1 φ(sn, an)φ(sn, an) is invertible in this section since the effect of λ is asymptotically negligible. For a matrix A ∈ Rm×n, suppose the vec operator stacks the column of a matrix such that vec(A) ∈ Rmn×1. We use the equivalence form of FQE in Eq. (B.2) such that

H−1 π > X h vbπ = (ν1 ) (Mcπ) R.b h=0

Mcπ can be viewed as the solution of the following multivariate linear regression:

π > > φ (sn+1) = φ(sn, an) Mπ + ηn, Bootstrapping Statistical Inference for Off-Policy Evaluation √ π > > where ηn = φ (sn+1) − φ(sn, an) Mπ. We first derive the asymptotic distribution of Nvec(Mcπ − Mπ) that follows: √ √ N  −1 X  π 0 > >  Nvec(Mcπ − Mπ) = vec NΣb φ(sn, an) φ (sn) − φ(sn, an) Mπ n=1 (B.7) K H −1 1 X  1 X k k  π k0 > k k >  = (NΣb ⊗ Id)√ vec √ φ(sh, ah) φ (sh ) − φ(sh, ah) Mπ , K k=1 H h=1

k π k > k k > where ⊗ is kronecker product. Define ξh = φ (sh+1) − φ(sh, ah) Mπ. From the definition of Mπ, it is easy to see Z Z π k > k k k k > k k > E[φ (sh+1) |sh, ah] = P(s|sh, ah) π(a|s)φ(s, a) dads = φ(sh, ah) Mπ. s a Again with martingale central limit theorem and independence between each episode, we have as K → ∞,

K H 1 X  1 X k k k d √ vec √ φ(sh, ah)ξh → N(0, ∆), (B.8) K k=1 H h=1

2 2 where ∆ ∈ Rd ×d is the covariance matrix defined as: for j, k ∈ [d2] H H h  1 X k k k  1 X k k k i ∆jk = E vec √ φ(sh, ah)ξh vec √ φ(sh, ah)ξh . (B.9) j k H h=1 H h=1

k k Next we start to derive the conditional bootstrap asymptotic distribution. For notation simplicity, denote φhk = φ(sh, ah) π k > H and yhk = φ (sh+1) . We rewrite the dataset combined with feature map φ(·, ·) such that Dk = {φhk, yhk, rhk}h=1. Recall that we bootstrap D by episodes such that each episode is sampled with replacement to form the starred data ∗ ∗ ∗ ∗ H Dk = {φhk, yhk, rhk}h=1 for k ∈ [K]. More specifically,

K K K ∗ X ∗ ∗ X ∗ ∗ X ∗ φhk = Wk φhk, yhk = Wk yhk, rhk = Wk rhk, k=1 k=1 k=1 ∗ ∗ ∗ ∗ where W = (W1 ,...,WK ) is the bootstrap weight. For example, W could be a multinomial random vector with parameters (K; K−1,...,K−1) that forms the standard nonparametric bootstrap. Note that for different h ∈ [H], they have the same bootstrap weight and given the original samples D1,..., DK , the resampled vectors are independent. Define the ∗ ∗ corresponding starred quantity Mcπ , Rb as

K K K H ∗ ∗−1 X X ∗ ∗ ∗ ∗−1 X X ∗ ∗ Mcπ = Σb φhkyhk, Rb = Σb rhkφhk, k=1 h=1 k=1 h=1 where K H ∗ X X ∗ ∗> Σb = φhkφhk . k=1 h=1 √ ∗ We will derive the asymptotic distribution of N(vec(Mcπ − Mcπ)) by using the following decomposition: √ √ K H ∗  ∗−1 X X ∗ ∗ ∗  Nvec(Mcπ − Mcπ) = Nvec Σb φhk(yhk − φhkMcπ) k=1 h=1 K H ∗−1  1 X 1 X ∗ ∗ ∗  = (NΣb ⊗ Id)vec √ √ φhk(yhk − φhkMcπ) . K k=1 H h=1 We denote K H K H 1 X 1 X ∗ 1 X 1 X ∗ ∗ ∗ Z = √ √ φhk(yhk − φhkMπ),Z = √ √ φhk(yhk − φhkMcπ). K k=1 H h=1 K k=1 H h=1 Bootstrapping Statistical Inference for Off-Policy Evaluation

Both Z and Z∗ are the sum of independent d × d random matrices. We prove the bootstrap consistency using the Mallows metric as a central tool. The Mallows metric, relative to the Euclidean norm k · k, for two probability measures µ, ν in Rd is defined as 1/l l Λl(µ, ν) = inf E (kU − V k ), U∼µ,V ∼ν where U and V are two random vectors that U has law µ and V has law ν. For random variables U, V , we sometimes write Λl(U, V ) as the Λl-distance between the laws of U and V . We refer Bickel & Freedman(1981); Freedman et al.(1981) for more details about the properties of Mallows metric. Suppose the common distribution of original K episodes {D1,..., DK 2Hd+H is µ and their empirical distribution is µK . Both µ and µK are probability in R . From Lemma C.1, we know that Λ4(µK , µ) → 0 a.e. as K → ∞.

∗ 1 PH ∗ ∗> • Step 1. We prove Σb /N converges in conditional probability to Σ. From the bootstrap design, H h=1 φkhφkh is 1 PH ∗ ∗> 0 independent of H h=1 φk0hφk0h for any k 6= k . According to Lemma C.3, we have

K H K H K H H  X 1 X X 1 X  X  1 X 1 X  Λ φ∗ φ∗>, φ φ> ≤ Λ φ∗ φ∗>, φ φ> 1 H kh kh H kh kh 1 H kh kh H kh kh k=1 h=1 k=1 h=1 k=1 h=1 h=1 H H  1 X 1 X  = KΛ φ∗ φ∗>, φ φ> . 1 H kh kh H kh kh h=1 h=1 Both sides of the above inequality are random variables such that the distance is computed between the conditional distribution of the starred quantity and the unconditional distribution of the unstarred quantity. Define a mapping Hd d×d d f : R → R such that for any x1, . . . , xH ∈ R ,

H 1 X f(x , . . . , x ) = x x>. 1 H H h h h=1 From Lemma C.2 with f, we have as K goes to infinity

H H  1 X 1 X  Λ φ∗ φ∗>, φ φ> → 0 . 1 H kh kh H kh kh h=1 h=1

1 PH ∗ ∗> 1 PH > This implies the conditional law of H h=1 φkhφkh is close to the unconditional law of H h=1 φkhφkh. By the law of large numbers: K H 1 X 1 X p φ φ> → Σ. (B.10) K H kh kh k=1 h=1 p This further implies the conditional on D, we have Σb∗/N → Σ. • Step 2. We prove Z∗ conditionally converges to a multivariate Gaussian distribution. From Lemma C.4,

H H 2 ∗ 2  1 X ∗ ∗ ∗ 1 X  Λ2(vec(Z ), vec(Z)) ≤ Λ2 vec(√ φhk(yhk − φhkMcπ)), vec(√ φhk(yhk − φhkMπ)) . H h=1 H h=1

Using Lemma C.5, we have the right side converges to 0, a.e. as K → ∞. This means the conditional law of vec(Z∗) is close to the unconditional law of vec(Z), and the latter essentially converges to a multivariate Gaussian distribution with zero mean and covariance matrix ∆ from Eq. (B.8).

By Slutsky’s theorem, we have conditional on D, √ ∗ d  −1 −1  Nvec(Mcπ − Mcπ) → N 0, (Σ ⊗ Id)∆(Σ ⊗ Id) , (B.11) Bootstrapping Statistical Inference for Off-Policy Evaluation where ∆ is defined in Eq. (B.9). According to the equivalence between FQE and plug-in estimator in Section B.2,

H−1 H−1 ∗ π > X ∗ h ∗ π > X h vbπ = (ν1 ) (Mcπ ) Rb , vbπ = (ν1 ) (Mcπ) R.b h=0 h=0

Define a function g : Rd×d → R as H−1 π > X h g(M) := (ν1 ) (M) wr. h=0 By the high-order matrix derivative (Petersen & Pedersen, 2008), we have

h−1 ∂ X (νπ)>(M)hw = (M r)>νπw>(M h−1−r)> ∈ d×d. ∂M 1 r 1 r R r=1

This implies the gradient of g at vec(Mπ)

H−1 h−1  X X r > π > h−1−r > ∇g(vec(Mπ)) = vec (Mπ) ν1 wr (Mπ ) h=0 r=1 H H−h  X π > X h0−1 > d2×1 = vec νh wr (Mπ ) ∈ R . h=1 h0=1 Applying multivariate delta theorem (Theorem C.6) for Eq. (B.11), we have conditional on D √  ∗  d  > −1 −1  N g(Mcπ ) − g(Mcπ) → N 0, ∇ g(vec(Mcπ))(Σ ⊗ Id)∆(Σ ⊗ Id)∇g(vec(Mcπ)) ,

p where ∆ is defined in Eq. (B.9). From Eq. (B.10), we have Σb/N → Σ. Using Slutsky’s theorem and Eqs. (B.7)-(B.8), we have √   d −1 −1 N vec(Mcπ − Mπ) → N (0, (Σ ⊗ Id)∆(Σ ⊗ Id)).

p This further implies Mcπ → Mπ. By continuous mapping theorem, √  ∗  d  > −1 −1  N g(Mcπ ) − g(Mcπ) → N 0, ∇ g(vec(Mπ))(Σ ⊗ Id)∆(Σ ⊗ Id)∇g(vec(Mπ)) .

Now we simplify the variance term as follows:

> −1 −1 ∇ g(vec(Mπ))(Σ ⊗ Id)∆(Σ ⊗ Id)∇g(vec(Mπ)) H H−h H H H−h X X 0 h 1 X i X X 0 = (νπ)>Σ−1w> (M )h −1 ξ>φ(s1 , a1 )φ(s1 , a1 )>ξ (M )h −1w Σ−1(νπ)> h r π E H h h h h h h π r h h=1 h0=1 h=1 h=1 h0=1 H H−h H H−h X X 0 h 1 X i X 0 = (νπ)>Σ−1w> (M )h −1 ξ>φ(s1 , a1 )φ(s1 , a1 )>ξ (M )h −1w Σ−1(νπ)> h r π E H h h h h h h π r h h=1 h0=1 h=1 h0=1

H−h1 H H−h2 X X 0 h 1 X i X 0 + 2 (ν> )>Σ−1w> (M )h −1 ξ>φ(s1 , a1 )φ(s1 , a1 )>ξ (M )h −1w Σ−1(ν> )> h1 r π E h h h h h h π r h2 0 H 0 h1

Σ−1 w> (M )h −1ξ>φ(s1 , a1 )φ(s1 , a1 )>ξ (M )h −1w Σ−1(νπ)> h E H r π h h h h h h π r h h=1 h=1 h0=1 h0=1

H H−h1 H−h2 X h 1 X X 0 X 0 i + 2 (ν> )>Σ−1 w> (M )h −1ξ>φ(s1 , a1 )φ(s1 , a1 )>ξ (M )h −1w Σ−1(ν> )>, h1 E r π h h h h h h π r h2 H 0 0 h1

1 1 1 > π 1 > where ξh = φ(sh, ah) Mπ − φ (sh+1) . Recall that we define 1 π 1 1 1 π 1 εh,h0 = Qh(sh0 , ah0 ) − (rh0 + Vh+1(sh+1))

H−h1 H−h X  1 1 > π 1 > h0−1 X 1 h−1 = φ(sh, ah) Mπ − φ (sh+1) (Mπ) wr = ξh(Mπ) wr, h=1 h=1 where the second equation is from Eq. (B.3). This implies > −1 −1 ∇ g(vec(Mπ))(Σ ⊗ Id)∆(Σ ⊗ Id)∇g(vec(Mπ)) H H X π > −1 h 1 X 1 1 1 1 > 1 2i −1 π > = (ν ) Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) (ε 0 ) Σ (ν ) h E H h h h h h,h h h=1 h0=1 H X > > −1 h 1 X 1 1 1 1 > 1 1 i −1 > > 2 + 2 (ν ) Σ φ(s 0 , a 0 )φ(s 0 , a 0 ) ε 0 ε 0 Σ (ν ) = σ . h1 E h h h h h1,h h2,h h2 H 0 h1

On the other hand, K H K H X X 1 X 1 X Rb∗ = (Σb∗)−1 r∗ φ∗ = KH(Σb∗)−1 r∗ φ∗ . hk hk K H hk hk k=1 h=1 k=1 h=1 Using Lemma C.3, we have K H K H H H  1 X 1 X 1 X 1 X   1 X 1 X  Λ r∗ φ∗ , r φ ≤ Λ r∗ φ∗ , r φ . 1 K H hk hk K H hk hk 1 H hk hk H hk hk k=1 h=1 k=1 h=1 h=1 h=1 The right hand side of the display goes to 0 as K → ∞. From the law of large number, K H H 1 X 1 X p h 1 X i r φ → φ(s1 , a1 )φ(s1 , a1 )> w K H hk hk E H h h h h r k=1 h=1 h=1 ∗ Combining with the fact that the conditional laws of Σb concentrates around Σ, this ends the proof. 

B.6. Proofs of Corollary 5.2 and Corollary 5.4 We prove the consistency of bootstrap confidence interval by using Lemma 23.3 in Van der Vaart(2000). Suppose Ψ(t) = P(N (0, σ2) ≤ t). Combining Theorem 4.2 and Theorem 5.1, we have √ √    ∗  PD N(vbπ − vπ) ≤ t → Ψ(t), PW ∗|D N(vbπ − vbπ) ≤ t → Ψ(t). π −1 Using the quantile convergence theorem (Lemma 21.1 in Van der Vaart(2000)), it implies qδ → Ψ (t) almost surely. Therefore, √  π   π  PDW ∗ vπ ≤ vbπ − qδ/2 = PDW ∗ N(vbπ − vπ) ≥ qδ/2  2 −1  → PDW ∗ N (0, σ ) ≥ Ψ (δ/2) = 1 − δ/2. This finishes the proof of Corollary 5.2. It is well known that the convergence in distribution implies the convergence in moment under the uniform integra- bility condition. The√ proof of the consistency of bootstrap moment estimation is straightforward since the condition ∗ q lim supN→∞ EW ∗|D[( N(vbπ − vbπ)) ] < ∞ for some q > 2 ensures a similar uniform integrability condition. Together with the distributional consistency in Theorem 5.1, we apply Lemma 2.1 in Kato(2011) then we reach the conclusion.  Bootstrapping Statistical Inference for Off-Policy Evaluation C. Supporting Results

We present a series of useful lemmas about Mallows metric. n Lemma C.1 (Lemma 8.4 in Bickel & Freedman(1981)) . Let {Xi}i=1 be independent random variables with common distribution µ. Let µn be the empirical distribution of X1,...,Xn. Then Λl(µn, µ) → 0 a.e..

Lemma C.2 (Lemma 8.5 in Bickel & Freedman(1981)) . Suppose Xn,X are random variables and Λl(Xn,X) → 0. Let f be a continuous function. Then Λl(f(Xn), f(X)) → 0. n n Lemma C.3 (Lemma 8.6 of Bickel & Freedman(1981)) . Let {Ui}i=1, {Vi}i=1 be independent random vectors. Then we have n n n  X X  X   Λ1 Ui, Vi ≤ Λ1 Ui,Vi . i=1 i=1 i=1 n n Lemma C.4 (Lemma 8.7 of Bickel & Freedman(1981)) . Let {Ui}i=1, {Vi}i=1 be independent random vectors and E[Uj] = E[Vj]. Then we have n n n  X X 2 X  2 Λ2 Ui, Vi ≤ Λ2 Ui,Vi . i=1 i=1 i=1

2Hd 2Hd d Let µK and µ be probabilities on R . A data point in R can be written as (x1, . . . , xH , y1, . . . , yH ) where xh ∈ R d and yh ∈ R . Denote

H Z 1 X Σ(µ) = x x>µ(dx , . . . , dx , dy , . . . , dy ), H h h 1 H 1 H h=1 Z H −1 X > M(µ) = Σ(µ) xhyh µ(dx1, . . . , dxH , dy1, . . . , dyH ), h=1 H X > ε(µ, x1, . . . , xH , y1, . . . , yH ) = (yh − xh M(µ)). h=1

Lemma C.5 (Lemma 7 in Eck(2018)) . If Λ4(µK , µ) → 0 as K → ∞, then we have the µK -law of PH > PH > vec( h=1 ε(µK , x1, . . . , xH , y1, . . . , yH )xh ) converges to the µ-law of vec( h=1 ε(µ, x1, . . . , xH , y1, . . . , yH )xh ) in Λ2.

Theorem C.6 (Multivariate delta theorem). Suppose {Tn} is a sequence of k-dimensional random vectors such that √ d k n(Tn − θ) → N(0, Σ(θ)). Let g : R → R be once differentiable at θ with the gradient matrix ∇g(θ). Then

√ d > n(g(Tn) − g(θ)) → N(0, ∇ g(θ)Σ(θ)∇g(θ)).

We restate Lemma B.5 in Duan & Wang(2020) in the following that is proven using matrix Bernstein inequality. > −1 Lemma C.7. Under the assumption φ(s, a) Σ φ(s, a) ≤ C1d for all (s, a) ∈ X , with probability at least 1 − δ,

 N  r 1 X 2 ln(2d/δ)C1dH 2 ln(2d/δ)C1dH Σ−1/2 φ(s , a )φ(s , a )> Σ−1/2 − I ≤ + . (C.1) N n n n n N 3N n=1 2

D. Supplement for Experiments

D.1. Experiment details The original CliffWalking environment from OpenAI gym has deterministic state transitions. That is, for any state-action 0 pair (s, a), there exists a corresponding s ∈ S such that P(· | s, a) = δs0 (·). We modify the environment in order to make it stochastic. Specifically, we introduce randomness in state transitions such that given a state-action pair (s, a), the transition Bootstrapping Statistical Inference for Off-Policy Evaluation

Softmax Behavior Policy Softmax Behavior Policy 1.0 2 0.8 10 Vanilla bootstrap 0.6 Subsampled bootstrap 101 HCOPE 0.4 Oracle CI Vanilla bootstrap

0.2 Subsampled bootstrap interval width Expected coverage 0 empirical coverage 10 0.0 10 50 100 200 500 10 50 100 200 500 # episodes in dataset # episodes in dataset

Figure 7. Left: Empirical coverage probability of CI; Right: CI width under different behavior policies. takes place in the same way as in the deterministic environment with probability 1 −  and takes place as if the action were a random action a0, instead of the intended a, with probability . This is an episodic tabular MDP and the agent stops when falling from the cliff or reaching the terminal point. We also reduce the penalty of falling off the cliff from −100 to −50. The original MountainCar environment from OpenAI gym has deterministic state transitions. We modify the environment in order to make it stochastic. Specifically, we introduce randomness in state transitions by adding a Gaussian random force, 1 namely, N (0, 10 ) multiplied by the constant-magnitude force from the original environment. We also increase the gravity parameter from 0.0025 to 0.008, the force parameter from 0.001 to 0.008 and the maximum allowed speed from 0.07 to 0.2.

Empirical coverage probability. The preceding discussion leads to the simulation method for estimating the coverage probability of a confidence interval. The simulation method has three steps:

1. Simulate many fresh dataset of episode size K following the behavior policy.

2. Compute the confidence interval for each dataset.

3. Compute the proportion of datset for which the true value of target policy is contained in the confidence interval. That proportion is an estimate for the empirical coverage probability for the confidence interval.

The true value of target policy is computed through Monte Carlo rollouts with sufficient number of samples (10000 in our experiments). With linear function approximation, we use the confidence interval proposed in Section 6 in Duan & Wang(2020) as a baseline since it is only available confidence interval based on FQE. In particular, it shows that with probability at least 1 − δ,

H r X q √  N  3N 2H  4 3N 2H  |vπ − vπ| ≤ (H − h + 1) (νπ)>Σb−1νπ 2λ + 2 2d log 1 + log + log , b bh bh λd δ 3 δ h=1

π > π > π h where (νbh ) = (ν1 ) (Mc ) and Mcπ is defined in Eq. (B.1).

D.2. Additional experiments In Figure7, we include the result for soft-max behavior policy in the Cliff Walking environment. In Figure8, we include the result for correlation estimation in Cliff Walking environment. The behavior policy is 0.1 -greedy policy while two target policies are optimal policy and 0.1 -greedy policy. Bootstrapping Statistical Inference for Off-Policy Evaluation

0.80 0.75 0.70 0.65 0.60 estimation error 0.55 10 30 50 75 100 # episodes in dataset

Figure 8. Error of correlation estimates, as data size increases.