Bootstrapping Statistical Inference for Off-Policy Evaluation

Bootstrapping Statistical Inference for Off-Policy Evaluation Botao Hao 1 Xiang(Jack) Ji 2 Yaqi Duan 2 Hao Lu 2 Csaba Szepesvari´ 1 3 Mengdi Wang 1 2 Abstract In practice, FQE has demonstrated robust and satisfying Bootstrapping provides a flexible and effective performances on many classical RL tasks under different approach for assessing the quality of batch rein- metrics (Voloshin et al., 2019). A more recent study by forcement learning, yet its theoretical property is Paine et al.(2020) demonstrated surprising scalability and less understood. In this paper, we study the use of effectiveness of FQE with deep neural nets in a range of bootstrapping in off-policy evaluation (OPE), and complex continuous-state RL tasks. On the theoretical side, in particular, we focus on the fitted Q-evaluation FQE was proved to be a minimax-optimal policy evaluator (FQE) that is known to be minimax-optimal in in the tabular and linear-model cases (Yin & Wang, 2020; the tabular and linear-model cases. We propose a Duan & Wang, 2020). bootstrapping FQE method for inferring the dis- The aforementioned research mostly focuses on point esti- tribution of the policy evaluation error and show mation for OPE. In practical batch RL applications, a point that this method is asymptotically efficient and estimate is far from enough. Statistical inference for OPE is distributionally consistent for off-policy statisti- of great interests. For instance, one often hopes to construct cal inference. To overcome the computation limit tight confidence interval around policy value, estimate the of bootstrapping, we further adapt a subsampling variance of off-policy evaluator, or evaluate multiple poli- procedure that improves the runtime by an or- cies using the same data and estimate their correlations. der of magnitude. We numerically evaluate the Bootstrapping (Efron, 1982), is a conceptually simple and bootrapping method in classical RL environments generalizable approach to infer the error distribution based for confidence interval estimation, estimating the on batch data. Therefore, in this work, we study the use variance of off-policy evaluator, and estimating of bootstrapping for off-policy inference. We will provide the correlation between multiple off-policy evalu- theoretical justifications as well as numerical experiments. ators. Our main results are summarized below: 1. Introduction • First we analyze the asymptotic distribution of FQE Off-policy evaluation (OPE) often serves as the starting with linear function approximation and show that the point of batch reinforcement learning (RL). The objective of policy evaluation error asymptotically follows a nor- OPE is to estimate the value of a target policy based on batch mal distribution (Theorem 4.2). The asymptotic vari- episodes of state-transition trajectories that were generated ance matches the Cramer–Rao´ lower bound for OPE arXiv:2102.03607v2 [stat.ML] 9 Feb 2021 using a possibly different and unknown behavior policy. In (Theorem 4.5) and implies that this estimator is asymp- this paper, we investigate statistical inference for OPE. In totically efficient. particular, we analyze the popular fitted Q-evaluation (FQE) method, which is a basic model-free approach that fits un- • We propose a bootstrapping FQE method for estimat- known value function from data using function approxi- ing the distribution of off-policy evaluation error. We mation and backward dynamic programming (Fonteneau prove that bootstrapping FQE is asymptotically con- et al., 2013; Munos & Szepesvari´ , 2008; Le et al., 2019). sistent in estimating the distribution of the original FQE (Theorem 5.1) and establish the consistency of * 1 2 Equal contribution Deepmind Princeton University bootstrap confidence interval as well as bootstrap vari- 3University of Alberta. Correspondence to: Botao Hao <haobo- [email protected]>. ance estimation. Further, we propose a subsampled bootstrap procedure to improve the computational effi- Preliminary work. ciency of bootstrapping FQE. Bootstrapping Statistical Inference for Off-Policy Evaluation • We highlight the necessity of bootstrapping by Our analysis improves their work in the following aspects. episodes, rather than by individual sample transition First, we study FQE with linear function approximation as considered in previous works; see Kostrikov & while Kostrikov & Nachum(2020) only considered the tab- Nachum(2020). The reason is that bootstrapping de- ular case. Second, we provide distributional consistency of pendent data in general fails to characterize the right bootstrapping FQE which is stronger than the consistency error distribution (Remark 2.1 in Singh(1981)). We of confidence interval in Kostrikov & Nachum(2020). illustrate this phenomenon via experiments (see Figure 1). All our theoretical analysis applies to episodic de- 2. Preliminary pendent data, and we do not require the i.i.d. sample transition assumption commonly made in OPE liter- Consider an episodic Markov decision process (MDP) that is atures (Jiang & Huang, 2020; Kostrikov & Nachum, defined by a tuple M = (S; A; P; r; H). Here, S is the state 0 2020; Dai et al., 2020). space, A is the action space, P (s js; a) is the probability of reaching state s0 when taking action a in state s, r : • Finally, we evaluate subsampled bootstrapping FQE X × A ! [0; 1] is the reward function, and H is the length in a range of classical RL tasks, including a discrete of horizon. A policy π : S!P(A) maps states to a tabular domain, a continuous control domain and a distribution over actions. The state-action value function simulated healthcare example. We test variants of boot- (Q-function) is defined as, for h = 1;:::;H, strapping FQE with tabular representation, linear func- " H # tion approximation, and neural networks. We carefully π π X 0 0 Qh(s; a) = E r(sh ; ah ) sh = s; ah = a ; examine the effectiveness and tightness of bootstrap h0=h confidence intervals, as well as the accuracy of boot- π strapping for estimating the variance and correlation where ah0 ∼ π(· j sh0 ); sh0+1 ∼ P (· j sh0 ; ah0 ) and E de- for OPE. notes expectation over the sample path generated under policy π. The Q-function satisfies the Bellman equation for Related Work. Point estimation of OPE receives consider- policy π: able attentions in recent years. We include a detailed litera- π h π 0 i ture review in AppendixA. Confidence interval estimation Qh−1(s; a) = r(s; a) + E Vh (s ) s; a ; of OPE is also important in many high-stake applications. 0 π Thomas et al.(2015) proposed a high-confidence OPE based where s ∼ P (·|s; a) and Vh : S! R is the value function V π(s) = R Qπ(s; a)π(ajs)da: on importance sampling and empirical Bernstein inequality. defined as h a h Kuzborskij et al.(2020) proposed a tighter confidence inter- We denote [n] = f1; : : : ; ng and λmin(X) as the minimum d×d val for contextual bandits based on empirical Efron-Stein eigenvalue of X. Denote Id 2 R as a diagonal matrix inequality. However, importance sampling suffers from with 1 as all the diagonal entry and 0 anywhere else. the curse of horizon (Liu et al., 2018) and concentration- based confidence intervals are typically overly-conservative Off-policy evaluation. Suppose that the batch data D = since they only exploit tail information (Hao et al., 2020a). fD1;:::; DK g consists of K independent episodes col- Another line of recent works formulated the estimation of lected using an unknown behavior policy π¯. Each episode, confidence intervals into an optimization problem (Feng k k k denoted as Dk = f(sh; ah; rh)gh2[H], is a trajectory of H et al., 2020; 2021; Dai et al., 2020). These work are specific state-transition tuples. It is easy to generalize our analysis to confidence interval construction for OPE, and they do to multiple unknown behavior policies since our algorithms not provide distributional consistency guarantee. Thus, they do not require the knowledge of the behavior policy. Let don’t easily generalize to other statistical inference tasks. N = KH be the total number of sample transitions; and we Several existing work has investigated the use of bootstrap- sometimes write D = f(sn; an; rn)gn2[N] for simplicity. ping in OPE. Thomas et al.(2015); Hanna et al.(2017) The goal of OPE is to estimate the expected cumulative constructed confidence intervals by bootstrapping impor- return (i.e., value) of a target policy π from a a fixed initial tance sampling estimator or learned models but didn’t come distribution ξ1, based on the dataset D. The value is defined with any consistency guarantee. The most related work is as " H # Kostrikov & Nachum(2020) that provided the first asymp- π X vπ = E r(sh; ah) s1 ∼ ξ1 : totic consistency of bootstrap confidence interval for OPE. h=1 Bootstrapping Statistical Inference for Off-Policy Evaluation Fitted Q-evaluation. Fitted Q-evaluation (FQE) is an in- True Distribution Bootstrap by Episode Bootstrap by Transition stance of the fitted Q-iteration method, dated back to Fonte- 1200 1200 1200 1000 1000 1000 neau et al.(2013); Le et al.(2019). Let F be a given func- 800 800 800 tion class, for examples a linear function class or a neural 600 600 600 π 400 400 400 network class. Set QbH+1 = 0. For h = H; : : : ; 1, we 200 200 200 π 0 0 0 recursively estimate Qh by regression and function approxi- 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 mation: Figure 1. Bootstrap by episodes vs. by sample transitions. The N 2 π n 1 X o first panel is the true FQE error distribution by Monte Carlo ap- Qb = argmin f(sn; an) − yn + λρ(f) ; h N f2F n=1 proximation.

Load more