A Semiparametric Statistical Approach to Model-Free Policy Evaluation

Tsuyoshi Ueno† [email protected] Motoaki Kawanabe‡ [email protected] Takeshi Mori† [email protected] Shin-ichi Maeda† [email protected] Shin Ishii† [email protected] †Graduate School of Informatics, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan ‡Fraunhofer FIRST, IDA, Kekul´estr. 7, 12489 Berlin, Germany

Abstract parametric models. Linear function approximation has mostly been used due to their simplicity and com- Reinforcement learning (RL) methods based putational convenience. on least-squares temporal difference (LSTD) have been developed recently and have shown To estimate the value function with a linear model, good practical performance. However, the an online procedure called temporal difference (TD) quality of their estimation has not been well learning (Sutton & Barto, 1998) and a batch proce- elucidated. In this article, we discuss LSTD- dure called least-squares temporal difference (LSTD) based policy evaluation from the new view- learning are widely used (Bradtke & Barto, 1996). point of semiparametric . LSTD can achieve fast learning, because it uses en- In fact, the can be obtained from a tire sample trajectories simultaneously. Recently, ef- particular estimating function which guaran- ficient procedures for policy improvement combined tees its convergence to the true value asymp- with policy evaluation by LSTD have been developed, totically, without specifying a model of the and have shown good performance in realistic prob- environment. Based on these observations, lems. For example, the least squares policy itera- we 1) analyze the asymptotic variance of an tion (LSPI) method maximizes the Q-function esti- LSTD-based estimator, 2) derive the opti- mated by LSTD (Lagoudakis & Parr, 2003), and the mal estimating function with the minimum natural actor-critic (NAC) algorithm uses the natu- asymptotic estimation variance, and 3) derive ral policy gradient obtained by LSTD (Peters et al., a suboptimal estimator to reduce the com- 2005). Although variance reduction techniques have putational burden in obtaining the optimal been proposed for other RL algorithms (Greensmith estimating function. et al., 2004; Mannor et al., 2007), the important issue of how to evaluate and reduce the estimation variance of LSTD learning remains unresolved. 1. Introduction In this article, we discuss LSTD-based policy evalua- Reinforcement learning (RL) is a machine learning tion in the framework of semiparmetric statistical in- framework based on reward-related interactions with ference, which is new to the RL field. Estimation of environments (Sutton & Barto, 1998). In many RL linearly-represented value functions can be formulated methods, policy evaluation, in which a value function as a semiparametric inference problem, where the sta- is estimated from sample trajectories, is an important tistical model includes not only the parameters of in- step for improving a current policy. Since RL problems terest but also additional nuisance parameters with in- often involve high-dimensional state spaces, the value numerable degrees of freedom (Godambe, 1991; Amari functions are often approximated by low-dimensional & Kawanabe, 1997; Bickel et al., 1998). We approach this problem by using estimating functions, which pro- th Appearing in Proceedings of the 25 International Confer- vide a well-established method for semiparametric es- ence on Machine Learning, Helsinki, Finland, 2008. Copy- timation (Godambe, 1991). We then show that the in- right 2008 by the author(s)/owner(s). strumental variable method, a technique used in LSTD A Semiparametric Statistical Approach to Model-Free Policy Evaluation learning, can be constructed from an estimating func- rewritten as tion which guarantees its consistency (asymptotic lack V π(s ) = p(s |s )¯r(s ,s ) − r¯ of bias) by definition. t t+1 t t t+1 stX+1∈S As the main results, we show the asymptotic esti- + p(s |s )V π(s ), (3) mation variance in a general instrumental variable t+1 t t+1 stX+1∈S method (Lemma 2) and the optimal estimating func- tion that yields the minimum asymptotic variance of where the estimation (Theorem 1). We also derive a sub- p(st+1|st) := π(st, at)p(st+1|st, at) and optimal instrumental variable, based on the idea of atP∈A π(st,at)p(st+1|st,at)r(st,at,st+1) a the c-estimator (Amari & Kawanabe, 1997), to reduce r¯(s ,s ) := t∈A . t t+1 P p(st+1|st) the computational difficulty of estimating the optimal Throughout this article, we assume that the linear instrumental variable (Theorem 2). As a proof of con- function approximation is faithful, and discuss only cept, we compare the mean squared error (MSE) of asymptotic estimation variance. (In general cases, our new with that of LSTD on a simple bias becomes non-negligible and selection of basis example of the Markov decision process (MDP). functions is more important.) Assumption 2. The value function can be repre- 2. Background sented as a linear function of some features: 2.1. MDPs and Policy Evaluation π ⊤ ⊤ V (st) = φ(st) θ = φt θ, (4) RL is an approach to finding an optimal policy for where φ(s) : S → Rm is a feature vector and θ ∈ Rm sequential decision-making in an unknown environ- is a parameter vector. ment. We consider a finite MDP, which is defined as a quadruple (S, A,p,r): S is a finite set of states; A Here, the symbol ⊤ denotes a transpose and the di- mensionality of the feature vector m is smaller than the is a finite set of actions; p(st+1|st, at) is the transition number of states |S|. Substituting eq. (4) for eq. (3), probability to a next state st+1 when taking an action we obtain the following equation at at state st; and r(st, at,st+1) is a reward received ⊤ with the state transition. Let π(st, at) = p(at|st) be a stochastic policy that the agent follows. We introduce φt − p(st+1|st)φt+1 θ = the following assumption concerning the MDP.  stX+1∈S  Assumption 1. An MDP has a stationary state dis-   π p(st+1|st)¯r(st,st+1) − r.¯ (5) tribution d (s) = p(s) under the policy π(st, at). stX+1∈S There are two major choices in definition of the state When the matrix ⊤ value function: discounted reward accumulation and π E φt− p(st+1|st)φt+1 φt− p(st+1|st)φt+1 average reward (Bertsekas & Tsitsiklis, 1996). With " st+1∈S ! st+1∈S ! # P P the former choice, the value function is defined as is non-singular and p(st+1|st) is known, we can easily ∞ obtain the parameter θ. However, since p(st+1|st) π π t V (s) := E γ rt+1|s0 = s , (1) is unknown in normal RL settings, we have to Xt=0   estimate this parameter from the sample trajectory {s , a , r , · · · ,s , a , r } alone, instead of Eπ 0 0 1 N−1 N−1 N where [·|s0 = s] is the expectation with respect using it directly. to the sample trajectory conditioned on s0 = s and rt+1 := r(st, at,st+1). γ ∈ [0, 1) is the discount factor. Eq. (5) can be rewritten as With the latter choice, on the other hand, the value ⊤ yt = x θ + ǫt, (6) function is defined as t ∞ where yt, xt and ǫt are defined as V π(s) := Eπ [r − r¯|s = s] , (2) t+1 0 yt := rt+1 − r,¯ xt := φt − φt+1 t=0 X ⊤ wherer ¯ := dπ(s)π(s, a)p(s′|s, a)r(s,a,s′) ′ ǫt := φt+1 − p(st+1|st)φt+1 θ sP∈S aP∈A sP∈S   denotes the average reward over the stationary distri- stX+1∈S bution. +rt+1 − p(st+1|st)¯r(st,st+1). (7) According to the Bellman equation, eq. (2) can be stX+1∈S A Semiparametric Statistical Approach to Model-Free Policy Evaluation

When we use the discounted reward accumulation for p(x) and the conditional distribution p(y|x) of output the value function, eq. (6) also holds with y given x, respectively. Then, the joint distribution becomes yt := rt+1, xt := φt − γφt+1 ⊤ p(x,y; θ, kx, kǫ) = p(y|x; θ, kǫ)p(x; kx). (11)

ǫt := γ φt+1 − p(st+1|st)φt+1 θ We would like to estimate the parameter θ represent-   stX+1∈S ing the value function in the presence of the extra un- knowns kx and kǫ, which can have innumerable de- + rt+1 − p(st+1|st)¯r(st,st+1). (8) grees of freedom. Statistical models which contain stX+1∈S such (possibly infinite-dimensional) nuisance param- π Because E [ǫt] = 0, eq. (6) can be seen as a linear eters in addition to parameters of interest are called regression problem, where x, y and ǫ are an input, an semiparametric (Bickel et al., 1998). In semiparamet- output and observation noise, respectively (Bradtke ric inference, one established way of estimating param- & Barto, 1996). Note that eters is to employ an estimating function (Godambe, 1991), which can give a consistent estimator of θ with- π E [ǫtg(st,st−1, · · · ,s0)] = 0 (9) out estimation of the nuisance parameters kx and kǫ. Now we begin with a short overview of the estimating holds for any function g(st,st−1, · · · ,s0) because of function in the simple i.i.d. case, and then discuss the the Markov property. The regression problem (6) has Markov chain case. an undesirable property, however, which is known as p x θ, κ an “error-in-variable problem” (Young, 1984): the in- We consider a general semiparametric model ( | ), θ m κ put x and observation noise variables ǫ are mutually where is an -dimensional parameter and is a t t m dependent. nuisance parameter. An -dimensional vector func- tion f(x; θ) is called an estimating function when it It is not easy to solve such an error-in-variable problem satisfies the following conditions for any θ, κ; in a rigorous manner; the simple least-squares method lacks consistency. Therefore, LSTD learning has used E[f(x; θ)|θ, κ] = 0 (12) the instrumental variable method (Bradtke & Barto, E ∂ 1996), a standard method to solve the error-in-variable det f(x; θ) θ, κ 6= 0 (13) ∂θ  problem that employs an “instrumental variable” to E ||f(x; θ)|| 2|θ, κ < ∞, (14) remove the effects of correlation between the input and   the observation noise. When where E[·|θ, κ] denotes the expectation with respect ⊤ X = [x0, x1, · · · , xN−1] and y = [y0,y1, · · · ,yN−1] , to x, which obeys the distribution p(x; θ, κ). The the estimator of the instrumental variable method is notations det | · | and || · || denote the determinant given by and the Euclidean norm, respectively. Consider that i.i.d. samples {x , x , · · · , x } are obtained from θˆ = [ZX⊤]−1[Zy], (10) 0 1 N−1 the true model p(x; θ = θ∗, κ = κ∗) = p(x; θ∗, κ∗) for the observed trajectory. If there is an estimating where Z = [z , z , · · · , zN ], and zt is an instrumen- 0 1 −1 function f, by solving the estimating equation tal variable that is assumed to be correlated with the input xt but uncorrelated with the observation noise N−1 ǫt. f(xt; θˆ) = 0, (15) Xt=0 2.2. Semiparametric Model and Estimating ˆ Functions we can obtain an estimator θ with good asymptotic properties. A solution of eq. (15) is called an “M- In the error-in-variable problem, if it is possible to as- estimator” in ; the M-estimator is consistent, sume a reasonable model with a small number of pa- i.e., converges to the true parameter θ∗ regardless of rameters on the joint input-output probability p(x,y), the true nuisance parameter κ∗ when the sample size a proper estimator with consistency can be obtained N reaches infinity. In addition, the asymptotic vari- by the maximum likelihood method. Since the transi- ance AV[θˆ] is given by tion probability p(st+1|st) is unknown and usually dif- 1 ficult to estimate, it is practically impossible to con- AV[θˆ] = E[(θˆ − θ∗)(θˆ − θ∗)⊤] = A−1MA−⊤, struct such a . Let kx and kǫ be N parameters which characterize the input distribution (16) A Semiparametric Statistical Approach to Model-Free Policy Evaluation

E ∂ ∗ ∗ E ⊤ E ∗ 2 ⊤ where A = ∂θ f(x; θ)|θ , κ where AIV = d[ztxt ], MIV = d[(ǫt ) ztzt ]. Ed[·] and M = E f(x; θ)f ⊤(x; θ)|θ∗, κ∗ . The symbol denotes the expectation when the sample trajectory π 1 −⊤ denotes transpose of the inverse matrix. We starts from the stationary distribution d (s0). omit the time index t, unless it is necessary to clar- ify. Note that the asymptotic variance AV depends on Proof The estimating equation (18) can be ex- the true parameters, θ∗ and κ∗, not on the samples pressed as {x0, x1, · · · , xN−1}. Zy = ZX⊤θˆ (21) The notion of the estimating function can be ex- tended to cases in which samples are given by a certain where Z = [z0, · · · , zN−1], X = [x0, · · · , xN−1], y = stochastic process (Godambe, 1985). In the semipara- ⊤ [y0, · · · ,yN−1] . On the other hand, from eq. (6), the metric model for policy evaluation, under Assump- left hand side of eq. (21) is equal to ZX⊤θ∗ + ZX⊤ǫ, tion 1, there exist sufficient conditions of estimating ∗ ∗ ⊤ where ǫ = [ǫ0, · · · , ǫN−1] . Thus the asymptotic vari- functions which are almost the same as eqs. (12) - (14). ance of the estimator θˆ is obtained as The instrumental variable method is a type of esti- mating function method for semiparametric problems Eπ[(θˆ − θ∗)(θˆ − θ∗)⊤] = Eπ[(ZX⊤)−1Zǫǫ⊤Z⊤(XZ⊤)−1] where the unknown distribution is given by eq. (11). N 1 −→→∞ A−1 Eπ[Zǫǫ⊤Z⊤]A−⊤ Lemma 1. Suppose {xt,yt} is given by eq. (7) or (8), IV N 2 IV and zt is given by a function of {st, · · · ,st−T }. If Eπ ⊤ Eπ ⊤ 2 ⊤ [ztxt ] is nonsingular and ||zt(xt θ − yt)|| is where we used the fact that the matrix ZX has the finite, then   limit

⊤ N−1 z (x θ − y ) (17) 1 1 N t t t (ZX⊤) = z x⊤ −→→∞ A . N N t t IV is an estimating function for the parameter θ. There- Xt=0 fore, the estimating equation is given by Also, the matrix Eπ[Zǫǫ⊤Z⊤] has the following limit: N−1 ⊤ N−1 zt(xt θ − yt) = 0. (18) 1 Eπ ⊤ ⊤ 1 Eπ ∗ 2 ⊤ N→∞ [Zǫǫ Z ] = [(ǫt ) ztzt ] −→ MIV, Xt=0 N N Xt=0

where we used the property in eq. (9). Therefore, we Proof For all t, the conditions corresponding to (13) have and (14) are satisfied by the assumptions, and the con- Eπ ˆ ∗ ˆ ∗ ⊤ N→∞ 1 −1 −⊤ dition (12) is satisfied as [(θ − θ )(θ − θ ) ] −→ AIV MIVAIV . Eπ ⊤ Eπ N [zt(xt θ − yt)] = [ztǫt] = 0 from the property in eq. (9). (Q.E.D.) (Q.E.D.) LSTD is specifically an instrumental variable method To summarize, if we have an instrumental variable in which the feature vector zt = φ(st) = φt is used as which satisfies the assumptions in Lemmas 1 and 2, an instrumental variable: we can obtain an M-estimator from the estimating

⊤ equation (18) with the asymptotic variance eq. (20). fLSTD = φt(xt θ − yt). (19) When more than one instrumental variable exists, it is appropriate to choose the one whose estimator has The solution of the estimating equation is an M- the minimum asymptotic variance. estimator, and its asymptotic variance is given as fol- lows. 3. Main Results Lemma 2. Let zt be a function of {st, ..., st−T } sat- ∗ In this section, we show that estimating functions for isfying the two conditions in Lemma 1 and ǫt = ⊤ ∗ ∗ the semiparametric model of policy evaluation are lim- x θ − yt be the residual for the true parameter θ . t ited to the type of equation used in the instrumental Then, the solution θˆ of the estimating equation (18) variable method. Furthermore, we derive the optimal has the asymptotic variance 1 We remark that the definitions of AIV and MIV do not 1 −1 −⊤ depend on the t. Nevertheless, we keep the time index t AV[θˆ] = A MIVA , (20) N IV IV for clarification. A Semiparametric Statistical Approach to Model-Free Policy Evaluation instrumental variable having the minimum asymptotic where we have used eq. (9). This implies that variance of the estimation. 1 1 AV[θˆ ] = A−1M A−⊤  A−1M A−⊤ = AV[θˆ ]. We first remark on the invariance property of the in- z N z z z N z˜ z˜ z˜ z˜ strumental variable method. Lemma 3. The value function estimation (Q.E.D.) ˆ ⊤ ˆ ⊤ ⊤ −1 V (st) = φt θ = φt [ZX ] [Zy] is invariant with Here, the inequality  denotes the semipositive def- respect to the application of any regular linear trans- initeness of the subtraction. Now, we consider the formation to either the instrumental variable zt or the general form of estimating functions for inference of basis functions φt. the value function. In the following, we consider only ‘admissible’ estimating functions. More precisely, we Proof Assume that the instrumental variable and discard ‘inadmissible’ estimating functions whose esti- the basis functions are both transformed by any mators are always inferior to those of other estimat- ′ regular matrices Wz and Wφ as zt = Wzzt and ing functions in the sense of asymptotic variance. To ′ φt = Wφφt. Noting that the linear transformation simplify analysis, we only consider the limited set of of φt yields the linear transformation of the input estimating functions which are defined on a one-step ′ xt = Wφxt, the estimator of the instrumental vari- sample {s,a,s′}. able method given by eq. (10) becomes Proposition 1. For the semiparametric model of θˆ′ = [Z′(X′)⊤]−1[Z′y] = W −1θˆ. This means that φ eqs. (6), (7) or eqs. (6), (8), all admissible estimating the estimated value function is invariant as functions of only the one-step sample {s,a,s′} must (φ′ )⊤θˆ′ = φ⊤θˆ. (Q.E.D.) t t have the form of f = z(y − x⊤θ), where z is any When the basis functions span over the whole space of function which does not depend on s′ and satisfies the functions of the state, any set of basis functions can assumption in Lemma 1. be represented by applying a linear transformation to another set of basis functions. This observation leads Proof Due to space limitation, we will just outline to the following Corollary. the proof. To be an estimating function, the function Corollary 1. When the basis functions φt span the f must satisfy π ′ ′ whole space of functions of the state, the value function Ed[f]= d (s) p(s |s, a)π(s, a)f(s,a,s ) = 0. ′ estimation is invariant with respect to the choice of s∈S s ∈S a∈A BecauseP we can proveP P that the stationary distribution basis functions and of the instrumental variable. dπ(s) takes any probability vector, dπ(s)v(s) = 0 An instrumental variable may depend not only on sP∈S implies that v(s) = 0 for any state s, where the current state s , but also on the previous states t v(s) := p(s′|s, a)π(s, a)f(s,a,s′). Further- {st−1, · · · ,st−T }, because such an instrumental vari- ′ sP∈S aP∈A able does not violate the condition, cov[zt, ǫt] = 0. more, the Bellman equation (6) holds, whatever the However, we do not need to consider such instrumen- p(s′|s, a) is. To fulfil v = 0, f must have the tal variables, as the following Lemma shows. form of f = z(y − x⊤θ) + h, where z does not de- ′ Lemma 4. Let zt (st, · · · ,st−T ) be any instrumental pend on s or a, and h is any function that satisfies ′ variable depending on the current and previous states π(s, a)h(s,a,s ) = 0. However, the addition of a∈A which satisfies the conditions in Lemmas 1 and 2. suchP a function h necessarily enlarges the asymptotic Then, there is necessarily an instrumental variable de- variance of the estimation. Therefore, the admissi- pending only on the current state whose corresponding ble estimating function is restricted to the form of estimator has equal or minimum asymptotic variance. f = z(y − x⊤θ). (Q.E.D)

Proof We show that the conditional expectation We are currently working on the conjecture that π z˜t = E [zt|st] which depends only on the current state whether Proposition 1 can be extended to general es- st, gives an equally good or better estimator. The timating functions depend on all previous states and matrices in the asymptotic variance, eq. (20), can be actions. If this is true, from Lemma 4, it is sufficient to calculated as consider the instrumental variable method with zt de- pending only on the current state st for the semipara- E ⊤ E ⊤ Az = d z˜txt + d (z˜t − zt)xt = Az˜ metric inference problem. Therefore, we next discuss E  ∗ 2    ⊤ the optimal instrument variable of this type in terms of Mz = d[(ǫt ) (z˜t + zt − z˜t)(z˜t + zt − z˜t) ] asymptotic variance, which corresponds to the optimal E ∗ 2 ⊤ = Mz˜ + d[(ǫt ) (zt − z˜t)(zt − z˜t) ], estimating function. A Semiparametric Statistical Approach to Model-Free Policy Evaluation

Algorithm 1 The pseudo code of gLSTD. Algorithm 2 The pseudo code of LSTDc gLSTD(D, φ) LSTDc(D, φ) // D = {s0, r1, · · · ,sN−1, rN }: Sample sequence // D = {s0, r1, · · · ,sN−1, rN }: Sample sequence // φ: Basis functions // φ: Basis functions // Calculate the initial parameter and its residual // Calculate the initial parameter and its residual N−1 −1 N−1 N−1 −1 N−1 ˆ ⊤ ˆ ⊤ θ0 ← φtxt φtyt θ0 ← φtxt φtyt  t=0   t=0   t=0   t=0  ⊤Pˆ P ⊤Pˆ P ǫˆt ← xt θ0 − yt ǫˆt ← xt θ0 − yt π ^2 π^ // Construct the suboptimal // Calculate the estimator E [(ˆǫt) |st], E [xt|st] // instrumental variable with optimal shift // of the conditional expectations −1 N−1 N−1 N−1 N−1 2 2 ⊤ ⊤ // and construct the instrumental variable ǫˆt φt − ǫˆt φtφt xtφt xt ^ ^ " t=0 # " t=0 #" t=0 # " t=0 # Eπ 2 −1Eπ cˆ ← − −1 zˆt ← [(ˆǫt) |st] [xt|st] PN−1 NP−1 N−1P N−1P 2 2 ⊤ ⊤ ǫˆt − ǫˆt φt xtφt xt // Calculate the parameter " t=0 # " t=0 #" t=0 # " t=0 # −1 N−1 N−1 zˆ = φ + Pcˆ P P P ˆ ⊤ t t θg ← zˆtxt zˆtyt  t=0   t=0  //Calculate the parameter ˆ P P N−1 −1 N−1 Return θg ⊤ θˆc ← zˆtx zˆtyt  t    tP=0 tP=0 Return θˆc Theorem 1. The optimal instrumental variable gives the minimum asymptotic variance −1 //// Eπ (ǫ∗)2|s (φ − γEπ[φ |s ]) t t t t+1 t 1 2 3 4  ∗ (discounted reward accumulation) z =  −1 (22) t Eπ ∗ 2 Eπ 5 5 5 5  (ǫt ) |st (φt − [φt+1|st]) r = 0 r =1 r =0.5 r =0 (average reward).    Figure 1. A four-state MDP. The proof is given in Appendix A. Note that the def- inition of the optimal instrumental variable includes ∗ both the residual ǫt and the conditional expectations Eπ Eπ ∗ 2 [φt+1|st] and [(ǫt ) |st]. To make this estima- tor practical, we replace the residual ǫ∗ with that of The proof is given in Appendix B. In eq. (23), however, t ∗ the LSTD estimator, and approximate the expecta- the residual ǫt is again unknown; hence, we need to Eπ Eπ ∗ 2 approximate this, too, as in the gLSTD learning. We tion, [φt+1|st] and [(ǫt ) |st], by using function approximation. We call this procedure “gLSTD learn- call this procedure “LSTDc learning” (see Algorithm 2 ing” (see Algorithm 1 for its pseudo code). for its pseudo code). To avoid estimating the functions depending on the 4. Simulation Experiments Eπ Eπ ∗ 2 current state, [φt+1|st] and [(ǫt ) |st], which ap- pear in the instrumental variable, we simply replace So far, we have discussed the asymptotic variance un- them by constants. When z is an instrumental vari- der the assumption that we have an infinite number of able, addition of any constant value to z, z′ = z + c, samples. In this section, we evaluate the performance leads to another valid instrumental variable; because of the proposed estimator in a practical situation with of Lemma 1, it is easily confirmed that a finite number of samples. We use an MDP defined ⊤ on a simple Markov random walk, which was also used fc = (zt + c)(xt θ − yt) is an estimating function. Therefore, obtaining the optimal constant c yields a in a previous study (Lagoudakis & Parr, 2003). This suboptimal instrumental variable within instrumental MDP incorporates a one-dimensional chain walk with variables produced by constant shifts. four states (Figure 1). Two actions, “left”(L) and “right”(R), are available at every state. Rewards 1 Theorem 2. The optimal shift is given by and 0.5 are given when states ‘2’ and ‘3’ are visited, E [(ǫ∗)2z ] − E [(ǫ∗)2z z⊤]E [x z⊤]−1E [x ] respectively. c∗ := − d t t d t t t d t t d t . E ∗ 2 E ∗ 2 ⊤ E ⊤ −1E d[(ǫt ) ] − d[(ǫt ) zt ] d[xtzt ] d[xt] We adopt the simplest direct representation of states; (23) the state variable took s = 1, s = 2, s = 3 or s = 4, A Semiparametric Statistical Approach to Model-Free Policy Evaluation

10 which is guaranteed to be consistent regardless of the stochastic properties of the environments. Based on 8 Average Average Average the optimal estimating functions in the two classes of 2.1733 0.5311 0.5333 estimating functions, we constructed two new policy 6 evaluation methods called gLSTD and LSTDc. We also evaluated the asymptotic variance of the general 4 instrumental variable methods for MDP. Moreover, we showed that the form of possible estimating functions Mean squared error (MSE) 2 for the value function estimation is restricted to be the same as those used in the instrumental variable meth- 0 LSTD LSTDc gLSTD ods. We then demonstrated, through an experiment using a simple MDP problem, that the gLSTD and Figure 2. Simulation result. LSTDc estimators reduce substantially the asymptotic variance of the LSTD estimator. Further work is necessary to construct procedures when the corresponding state was visited. The value for policy updating based on evaluation by gLSTD function was defined as the average reward, eq. (2), and LSTDc. It should be possible to incorporate and was approximated by a linear function with a our proposed ideas into the least-squares policy itera- three-dimensional basis function: φ(s) = [s,s2,s3]⊤. tion (Lagoudakis & Parr, 2003) and the natural actor- The policy was set at random, and at the beginning critic method (Peters et al., 2005). of each episode an initial state was randomly selected according to the stationary distribution of this Markov chain. A. Proof of Theorem 1: The Optimal Under these conditions, we performed 100 episodes Instrumental Variable each of which consisted of 100 random walk steps. As shown in eq. (20), the asymptotic variance of the We evaluated the “mean squared error” (MSE) of the estimator θˆz is given by value function, i.e., dπ(i)|Vˆ (i) − V ∗(i)|2; i∈{1,2,3,4} P 1 −1 −⊤ ∗ ⊤ AV[θˆ ] = A M A , where Vˆ and V denote Vˆ (i) = φ(s = i) θˆ z N z z z and V ∗(i) = φ(s = i)⊤θ∗, respectively. ⊤ ∗ 2 ⊤ where Az := Ed[ztx ] and Mz := Ed[(ǫ ) ztz ]. If Figure 2 shows box-plots of the MSEs of LSTD, t t t we add a small change δt(st, · · · ,st T ) to the instru- LSTDc, and gLSTD. For this example, estimators of − mental variable zt, the matrices become the conditional expectations in gLSTD can be calcu- lated by sample average in each state, because there E ⊤ Az+δ = Az + d[δtxt ], were only four discrete states. In continuous state E ∗ 2 ⊤ ⊤ problems, however, estimation of such conditional ex- Mz+δ = Mz + d[(ǫt ) (δtzt + ztδt )]. pectations would become much harder. Therefore, the deviation of the trace of asymptotic In Figure 2, the y-axis denotes the MSE of the value variance can be calculated as function. The center line and the upper and lower sides −1 −⊤ −1 −⊤ of each box denote the median of MSEs and the upper Tr Az+δMz+δAz+δ − Tr Az MzAz and lower quartiles, respectively. The number above  −1 ⊤ −1 −⊤  = − Tr A Ed δtx A MzA each box represents the average MSE. There is signif- z t z z  −1E  ⊤ −1 −⊤ icant difference between the MSE of LSTD and those − Tr Az d xtδt Az MzAz  −1E  ∗ 2  ⊤ ⊤ −⊤ of LSTDc and gLSTD. The estimators for LSTDc and + Tr Az d (ǫt ) ztδt + δzt Az for gLSTD both achieved a much smaller MSE than =2E δ⊤A−⊤A−1Eπ (ǫ∗)2|s , · · · ,s z that for the ordinary LSTD. d t z z t t t−T t E ⊤ −⊤ −1  −⊤Eπ   − 2 d δt Az Az MzAz [xt|st, · · · ,st−T ] . 5. Conclusion   By using the condition that the deviation becomes 0 In this study, we have discussed LSTD-based policy for any small change δt(st, · · · ,st−T ), the optimal in- evaluation in the framework of semiparametric statis- strumental variable can be obtained as tical inference. We showed that the standard LSTD ∗ Eπ ∗ 2 −1 −⊤Eπ algorithm is indeed an estimating function method zt = (ǫt ) |st MzAz [xt|st].   A Semiparametric Statistical Approach to Model-Free Policy Evaluation

Considering Lemma 3, the optimal instrumental vari- References able is restricted as Amari, S., & Kawanabe, M. (1997). Information ge- ∗ Eπ ∗ 2 −1 Eπ ometry of estimating functions in semi-parametric zt = (ǫt ) |st [xt|st],   statistical models. Bernoulli, 3, 29–54. or as its transformation by any regular matrix. Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-Dynamic Now, we show that eq. (22) also satisfies the global Programming. Athena Scientific. ∗ optimality. Substituting zt to the matrix Az∗ , we obtain Bickel, D., Ritov, D., Klaassen, C., & Wellner, J. (1998). Efficient and Adaptive Estimation for Semi- −⊤ Az∗ = Mz∗ Az∗ F , parametric Models. Springer. where Bradtke, S., & Barto, A. (1996). Linear least-squares algorithms for temporal difference learning. Ma- E Eπ ∗ 2 −1Eπ Eπ ⊤ F := d [(ǫt ) |st] [xt|st] [xt |st] . chine Learning, 22, 33–57.  ∗  Furthermore, the matrices at zt + δt become Godambe, V. (1985). The foundations of finite sample −⊤ ⊤ estimation in stochastic processes. Biometrika, 72, A ∗ = M ∗ A ∗ F + E [x δ ], z +δ z z d t t 419–428. −⊤ −1 −⊤E ⊤ Mz∗+δ = Mz∗ Az∗ F Az∗ Mz∗ + Mz∗ Az∗ d[xtδt ] ⊤ −1 ∗ 2 ⊤ Godambe, V. (Ed.). (1991). Estimating Functions. E δ x A ∗ M ∗ E ǫ δ δ . + d[ t t ] z z + d[( t ) t t ] Oxford Science.

Therefore, Greensmith, E., Bartlett, P., & Baxter, J. (2004). Vari- −1 −⊤ −1 −⊤ ance reduction techniques for gradient estimates in A ∗ Mz∗ δA ∗ − A ∗ Mz∗ A ∗ z +δ + z +δ z z reinforcement learning. Journal of Machine Learn- −1 −1 −⊤ −⊤ ∗ ∗ ∗ ⊤ = Az∗+δ(Mz +δ − Az +δAz∗ Mz Az∗ Az∗+δ)Az∗+δ ing Research, 5, 1471–1530. −1 ∗ 2 ⊤ ⊤ −1 ⊤ −⊤ = A ∗ (E [(ǫ ) δ δ ] − E [δ x ]F E [x δ ])A ∗ z +δ d t t t d t t d t t z +δ Horn, R., & Johnson, C. (1985). Matrix analysis. Cam- −1 E Eπ ∗ 2 bridge University Press. = Az∗+δ d [(ǫt ) |st] n h Eπ ∗ 2 −1E ⊤ −1Eπ × δt − [(ǫt ) |st] d[δtxt ]F [xt|st] Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning Re- Eπ ∗ 2 −1E ⊤ −1Eπ ⊤ × δt − [(ǫt ) |st] d[δtxt ]F [xt|st] search, 4, 1107–1149.  i −⊤ Az∗+δ  0. Mannor, S., Simester, D., Sun, P., & Tsitsiklis, J. N. o (2007). Bias and variance approximation in value ∗ The equality holds only when δt(st) ∝ zt . (Q.E.D.) function estimates. Management Science, 53, 308– 322.

B. Proof of Theorem 2: The Optimal c Peters, J., Vijayakumar, S., & Schaal, S. (2005). Nat- By differentiating the trace of eq. (20), the optimal ural actor-critic. Proceedings of the 16th European constant c must satisfy Conference on Machine Learning (pp. 280–291). E ∗ 2 E ∗ 2 −⊤E Sutton, R., & Barto, A. (1998). Reinforcement Learn- d[(ǫt ) ]c + d[(ǫt ) φt] = McAc d[xt]. ing: An Introduction. MIT Press. Using the well-known matrix inversion lemma (Horn & Johnson, 1985), the solution can be obtained as Young, P. (1984). Recursive Estimation and Time- eq. (23). series Analysis. Springer-Verlag. In addition, the global optimality among those applied by constant shifts can be proved using a similar argu- ment to that in Appendix A. (Q.E.D.)

Acknowledgments The authors wish to thank the anonymous reviewers for their helpful comments and suggestions.