A Semiparametric Statistical Approach to Model-Free Policy Evaluation

A Semiparametric Statistical Approach to Model-Free Policy Evaluation Tsuyoshi Ueno† [email protected] Motoaki Kawanabe‡ [email protected] Takeshi Mori† [email protected] Shin-ichi Maeda† [email protected] Shin Ishii† [email protected] †Graduate School of Informatics, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan ‡Fraunhofer FIRST, IDA, Kekuléstr. 7, 12489 Berlin, Germany Abstract parametric models. Linear function approximation has mostly been used due to their simplicity and com- Reinforcement learning (RL) methods based putational convenience. on least-squares temporal difference (LSTD) have been developed recently and have shown To estimate the value function with a linear model, good practical performance. However, the an online procedure called temporal difference (TD) quality of their estimation has not been well learning (Sutton & Barto, 1998) and a batch proce- elucidated. In this article, we discuss LSTD- dure called least-squares temporal difference (LSTD) based policy evaluation from the new view- learning are widely used (Bradtke & Barto, 1996). point of semiparametric statistical inference. LSTD can achieve fast learning, because it uses en- In fact, the estimator can be obtained from a tire sample trajectories simultaneously. Recently, ef- particular estimating function which guaran- ficient procedures for policy improvement combined tees its convergence to the true value asymp- with policy evaluation by LSTD have been developed, totically, without specifying a model of the and have shown good performance in realistic prob- environment. Based on these observations, lems. For example, the least squares policy itera- we 1) analyze the asymptotic variance of an tion (LSPI) method maximizes the Q-function esti- LSTD-based estimator, 2) derive the opti- mated by LSTD (Lagoudakis & Parr, 2003), and the mal estimating function with the minimum natural actor-critic (NAC) algorithm uses the natu- asymptotic estimation variance, and 3) derive ral policy gradient obtained by LSTD (Peters et al., a suboptimal estimator to reduce the com- 2005). Although variance reduction techniques have putational burden in obtaining the optimal been proposed for other RL algorithms (Greensmith estimating function. et al., 2004; Mannor et al., 2007), the important issue of how to evaluate and reduce the estimation variance of LSTD learning remains unresolved. 1. Introduction In this article, we discuss LSTD-based policy evalua- Reinforcement learning (RL) is a machine learning tion in the framework of semiparmetric statistical in- framework based on reward-related interactions with ference, which is new to the RL field. Estimation of environments (Sutton & Barto, 1998). In many RL linearly-represented value functions can be formulated methods, policy evaluation, in which a value function as a semiparametric inference problem, where the sta- is estimated from sample trajectories, is an important tistical model includes not only the parameters of in- step for improving a current policy. Since RL problems terest but also additional nuisance parameters with in- often involve high-dimensional state spaces, the value numerable degrees of freedom (Godambe, 1991; Amari functions are often approximated by low-dimensional & Kawanabe, 1997; Bickel et al., 1998). We approach this problem by using estimating functions, which pro- th Appearing in Proceedings of the 25 International Confer- vide a well-established method for semiparametric es- ence on Machine Learning, Helsinki, Finland, 2008. Copy- timation (Godambe, 1991). We then show that the in- right 2008 by the author(s)/owner(s). strumental variable method, a technique used in LSTD A Semiparametric Statistical Approach to Model-Free Policy Evaluation learning, can be constructed from an estimating func- rewritten as tion which guarantees its consistency (asymptotic lack V π(s ) = p(s |s )¯r(s ,s ) − r¯ of bias) by definition. t t+1 t t t+1 stX+1∈S As the main results, we show the asymptotic esti- + p(s |s )V π(s ), (3) mation variance in a general instrumental variable t+1 t t+1 stX+1∈S method (Lemma 2) and the optimal estimating function that yields the minimum asymptotic variance of where the estimation (Theorem 1). We also derive a sub- p(st+1|st) := π(st, at)p(st+1|st, at) and optimal instrumental variable, based on the idea of atP∈A π(st,at)p(st+1|st,at)r(st,at,st+1) a the c-estimator (Amari & Kawanabe, 1997), to reduce r¯(s ,s ) := t∈A . t t+1 P p(st+1|st) the computational difficulty of estimating the optimal Throughout this article, we assume that the linear instrumental variable (Theorem 2). As a proof of con- function approximation is faithful, and discuss only cept, we compare the mean squared error (MSE) of asymptotic estimation variance. (In general cases, our new estimators with that of LSTD on a simple bias becomes non-negligible and selection of basis example of the Markov decision process (MDP). functions is more important.) Assumption 2. The value function can be repre- 2. Background sented as a linear function of some features: 2.1. MDPs and Policy Evaluation π ⊤ ⊤ V (st) = φ(st) θ = φt θ, (4) RL is an approach to finding an optimal policy for where φ(s) : S → Rm is a feature vector and θ ∈ Rm sequential decision-making in an unknown environ- is a parameter vector. ment. We consider a finite MDP, which is defined as a quadruple (S, A,p,r): S is a finite set of states; A Here, the symbol ⊤ denotes a transpose and the di- mensionality of the feature vector m is smaller than the is a finite set of actions; p(st+1|st, at) is the transition number of states |S|. Substituting eq. (4) for eq. (3), probability to a next state st+1 when taking an action we obtain the following equation at at state st; and r(st, at,st+1) is a reward received ⊤ with the state transition. Let π(st, at) = p(at|st) be a stochastic policy that the agent follows. We introduce φt − p(st+1|st)φt+1 θ = the following assumption concerning the MDP. stX+1∈S Assumption 1. An MDP has a stationary state dis- π p(st+1|st)¯r(st,st+1) − r.¯ (5) tribution d (s) = p(s) under the policy π(st, at). stX+1∈S There are two major choices in definition of the state When the matrix ⊤ value function: discounted reward accumulation and π E φt− p(st+1|st)φt+1 φt− p(st+1|st)φt+1 average reward (Bertsekas & Tsitsiklis, 1996). With " st+1∈S ! st+1∈S ! # P P the former choice, the value function is defined as is non-singular and p(st+1|st) is known, we can easily ∞ obtain the parameter θ. However, since p(st+1|st) π π t V (s) := E γ rt+1|s0 = s , (1) is unknown in normal RL settings, we have to Xt=0 estimate this parameter from the sample trajectory {s , a , r , · · · ,s , a , r } alone, instead of Eπ 0 0 1 N−1 N−1 N where [·|s0 = s] is the expectation with respect using it directly. to the sample trajectory conditioned on s0 = s and rt+1 := r(st, at,st+1). γ ∈ [0, 1) is the discount factor. Eq. (5) can be rewritten as With the latter choice, on the other hand, the value ⊤ yt = x θ + ǫt, (6) function is defined as t ∞ where yt, xt and ǫt are defined as V π(s) := Eπ [r − r¯|s = s] , (2) t+1 0 yt := rt+1 − r,¯ xt := φt − φt+1 t=0 X ⊤ wherer ¯ := dπ(s)π(s, a)p(s′|s, a)r(s,a,s′) ′ ǫt := φt+1 − p(st+1|st)φt+1 θ sP∈S aP∈A sP∈S denotes the average reward over the stationary distri- stX+1∈S bution. +rt+1 − p(st+1|st)¯r(st,st+1). (7) According to the Bellman equation, eq. (2) can be stX+1∈S A Semiparametric Statistical Approach to Model-Free Policy Evaluation When we use the discounted reward accumulation for p(x) and the conditional distribution p(y|x) of output the value function, eq. (6) also holds with y given x, respectively. Then, the joint distribution becomes yt := rt+1, xt := φt − γφt+1 ⊤ p(x,y; θ, kx, kǫ) = p(y|x; θ, kǫ)p(x; kx). (11) ǫt := γ φt+1 − p(st+1|st)φt+1 θ We would like to estimate the parameter θ represent- stX+1∈S ing the value function in the presence of the extra un- knowns kx and kǫ, which can have innumerable de- + rt+1 − p(st+1|st)¯r(st,st+1). (8) grees of freedom. Statistical models which contain stX+1∈S such (possibly infinite-dimensional) nuisance param- π Because E [ǫt] = 0, eq. (6) can be seen as a linear eters in addition to parameters of interest are called regression problem, where x, y and ǫ are an input, an semiparametric (Bickel et al., 1998). In semiparamet- output and observation noise, respectively (Bradtke ric inference, one established way of estimating param- & Barto, 1996). Note that eters is to employ an estimating function (Godambe, 1991), which can give a consistent estimator of θ with- π E [ǫtg(st,st−1, · · · ,s0)] = 0 (9) out estimation of the nuisance parameters kx and kǫ. Now we begin with a short overview of the estimating holds for any function g(st,st−1, · · · ,s0) because of function in the simple i.i.d. case, and then discuss the the Markov property. The regression problem (6) has Markov chain case.

A Semiparametric Statistical Approach to Model-Free Policy Evaluation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support