A Semiparametric Statistical Approach to Model-Free Policy Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
A Semiparametric Statistical Approach to Model-Free Policy Evaluation Tsuyoshi Ueno† [email protected] Motoaki Kawanabe‡ [email protected] Takeshi Mori† [email protected] Shin-ichi Maeda† [email protected] Shin Ishii† [email protected] †Graduate School of Informatics, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan ‡Fraunhofer FIRST, IDA, Kekul´estr. 7, 12489 Berlin, Germany Abstract parametric models. Linear function approximation has mostly been used due to their simplicity and com- Reinforcement learning (RL) methods based putational convenience. on least-squares temporal difference (LSTD) have been developed recently and have shown To estimate the value function with a linear model, good practical performance. However, the an online procedure called temporal difference (TD) quality of their estimation has not been well learning (Sutton & Barto, 1998) and a batch proce- elucidated. In this article, we discuss LSTD- dure called least-squares temporal difference (LSTD) based policy evaluation from the new view- learning are widely used (Bradtke & Barto, 1996). point of semiparametric statistical inference. LSTD can achieve fast learning, because it uses en- In fact, the estimator can be obtained from a tire sample trajectories simultaneously. Recently, ef- particular estimating function which guaran- ficient procedures for policy improvement combined tees its convergence to the true value asymp- with policy evaluation by LSTD have been developed, totically, without specifying a model of the and have shown good performance in realistic prob- environment. Based on these observations, lems. For example, the least squares policy itera- we 1) analyze the asymptotic variance of an tion (LSPI) method maximizes the Q-function esti- LSTD-based estimator, 2) derive the opti- mated by LSTD (Lagoudakis & Parr, 2003), and the mal estimating function with the minimum natural actor-critic (NAC) algorithm uses the natu- asymptotic estimation variance, and 3) derive ral policy gradient obtained by LSTD (Peters et al., a suboptimal estimator to reduce the com- 2005). Although variance reduction techniques have putational burden in obtaining the optimal been proposed for other RL algorithms (Greensmith estimating function. et al., 2004; Mannor et al., 2007), the important issue of how to evaluate and reduce the estimation variance of LSTD learning remains unresolved. 1. Introduction In this article, we discuss LSTD-based policy evalua- Reinforcement learning (RL) is a machine learning tion in the framework of semiparmetric statistical in- framework based on reward-related interactions with ference, which is new to the RL field. Estimation of environments (Sutton & Barto, 1998). In many RL linearly-represented value functions can be formulated methods, policy evaluation, in which a value function as a semiparametric inference problem, where the sta- is estimated from sample trajectories, is an important tistical model includes not only the parameters of in- step for improving a current policy. Since RL problems terest but also additional nuisance parameters with in- often involve high-dimensional state spaces, the value numerable degrees of freedom (Godambe, 1991; Amari functions are often approximated by low-dimensional & Kawanabe, 1997; Bickel et al., 1998). We approach this problem by using estimating functions, which pro- th Appearing in Proceedings of the 25 International Confer- vide a well-established method for semiparametric es- ence on Machine Learning, Helsinki, Finland, 2008. Copy- timation (Godambe, 1991). We then show that the in- right 2008 by the author(s)/owner(s). strumental variable method, a technique used in LSTD A Semiparametric Statistical Approach to Model-Free Policy Evaluation learning, can be constructed from an estimating func- rewritten as tion which guarantees its consistency (asymptotic lack V π(s ) = p(s |s )¯r(s ,s ) − r¯ of bias) by definition. t t+1 t t t+1 stX+1∈S As the main results, we show the asymptotic esti- + p(s |s )V π(s ), (3) mation variance in a general instrumental variable t+1 t t+1 stX+1∈S method (Lemma 2) and the optimal estimating func- tion that yields the minimum asymptotic variance of where the estimation (Theorem 1). We also derive a sub- p(st+1|st) := π(st, at)p(st+1|st, at) and optimal instrumental variable, based on the idea of atP∈A π(st,at)p(st+1|st,at)r(st,at,st+1) a the c-estimator (Amari & Kawanabe, 1997), to reduce r¯(s ,s ) := t∈A . t t+1 P p(st+1|st) the computational difficulty of estimating the optimal Throughout this article, we assume that the linear instrumental variable (Theorem 2). As a proof of con- function approximation is faithful, and discuss only cept, we compare the mean squared error (MSE) of asymptotic estimation variance. (In general cases, our new estimators with that of LSTD on a simple bias becomes non-negligible and selection of basis example of the Markov decision process (MDP). functions is more important.) Assumption 2. The value function can be repre- 2. Background sented as a linear function of some features: 2.1. MDPs and Policy Evaluation π ⊤ ⊤ V (st) = φ(st) θ = φt θ, (4) RL is an approach to finding an optimal policy for where φ(s) : S → Rm is a feature vector and θ ∈ Rm sequential decision-making in an unknown environ- is a parameter vector. ment. We consider a finite MDP, which is defined as a quadruple (S, A,p,r): S is a finite set of states; A Here, the symbol ⊤ denotes a transpose and the di- mensionality of the feature vector m is smaller than the is a finite set of actions; p(st+1|st, at) is the transition number of states |S|. Substituting eq. (4) for eq. (3), probability to a next state st+1 when taking an action we obtain the following equation at at state st; and r(st, at,st+1) is a reward received ⊤ with the state transition. Let π(st, at) = p(at|st) be a stochastic policy that the agent follows. We introduce φt − p(st+1|st)φt+1 θ = the following assumption concerning the MDP. stX+1∈S Assumption 1. An MDP has a stationary state dis- π p(st+1|st)¯r(st,st+1) − r.¯ (5) tribution d (s) = p(s) under the policy π(st, at). stX+1∈S There are two major choices in definition of the state When the matrix ⊤ value function: discounted reward accumulation and π E φt− p(st+1|st)φt+1 φt− p(st+1|st)φt+1 average reward (Bertsekas & Tsitsiklis, 1996). With " st+1∈S ! st+1∈S ! # P P the former choice, the value function is defined as is non-singular and p(st+1|st) is known, we can easily ∞ obtain the parameter θ. However, since p(st+1|st) π π t V (s) := E γ rt+1|s0 = s , (1) is unknown in normal RL settings, we have to Xt=0 estimate this parameter from the sample trajectory {s , a , r , · · · ,s , a , r } alone, instead of Eπ 0 0 1 N−1 N−1 N where [·|s0 = s] is the expectation with respect using it directly. to the sample trajectory conditioned on s0 = s and rt+1 := r(st, at,st+1). γ ∈ [0, 1) is the discount factor. Eq. (5) can be rewritten as With the latter choice, on the other hand, the value ⊤ yt = x θ + ǫt, (6) function is defined as t ∞ where yt, xt and ǫt are defined as V π(s) := Eπ [r − r¯|s = s] , (2) t+1 0 yt := rt+1 − r,¯ xt := φt − φt+1 t=0 X ⊤ wherer ¯ := dπ(s)π(s, a)p(s′|s, a)r(s,a,s′) ′ ǫt := φt+1 − p(st+1|st)φt+1 θ sP∈S aP∈A sP∈S denotes the average reward over the stationary distri- stX+1∈S bution. +rt+1 − p(st+1|st)¯r(st,st+1). (7) According to the Bellman equation, eq. (2) can be stX+1∈S A Semiparametric Statistical Approach to Model-Free Policy Evaluation When we use the discounted reward accumulation for p(x) and the conditional distribution p(y|x) of output the value function, eq. (6) also holds with y given x, respectively. Then, the joint distribution becomes yt := rt+1, xt := φt − γφt+1 ⊤ p(x,y; θ, kx, kǫ) = p(y|x; θ, kǫ)p(x; kx). (11) ǫt := γ φt+1 − p(st+1|st)φt+1 θ We would like to estimate the parameter θ represent- stX+1∈S ing the value function in the presence of the extra un- knowns kx and kǫ, which can have innumerable de- + rt+1 − p(st+1|st)¯r(st,st+1). (8) grees of freedom. Statistical models which contain stX+1∈S such (possibly infinite-dimensional) nuisance param- π Because E [ǫt] = 0, eq. (6) can be seen as a linear eters in addition to parameters of interest are called regression problem, where x, y and ǫ are an input, an semiparametric (Bickel et al., 1998). In semiparamet- output and observation noise, respectively (Bradtke ric inference, one established way of estimating param- & Barto, 1996). Note that eters is to employ an estimating function (Godambe, 1991), which can give a consistent estimator of θ with- π E [ǫtg(st,st−1, · · · ,s0)] = 0 (9) out estimation of the nuisance parameters kx and kǫ. Now we begin with a short overview of the estimating holds for any function g(st,st−1, · · · ,s0) because of function in the simple i.i.d. case, and then discuss the the Markov property. The regression problem (6) has Markov chain case.