<<

Constructing States for Learning

M. M. Hassan Mahmud [email protected] School of , The Australian National University, Canberra, 0200 ACT, Australia

Abstract tially observable (POMDP), which are (essentially) hidden Markov models with POMDPs are the models of choice for re- actions. Given the model (structure and parame- inforcement learning (RL) tasks where the ters) of a POMDP, there exists effective heuristic al- environment cannot be observed directly. gorithms for planning (see the survey (Ross et al., In many applications we need to learn the 2008)), although exact planning is undecidable in gen- POMDP structure and parameters from ex- eral (Madani et al., 2003). However, in many impor- perience and this is considered to be a dif- tant problems, the POMDP model is not available a- ficult problem. In this paper we address priori and has to be learned from experience. While this issue by modeling the hidden environ- there are some promising new approaches (e.g. (Doshi- ment with a novel class of models that are Velez, 2009) using HDP-POMDPs), this problem is as less expressive, but easier to learn and plan yet unsolved (and in fact NP-hard even under severe with than POMDPs. We call these mod- constraints (Sabbadin et al., 2007)). els deterministic Markov models (DMMs), which are deterministic-probabilistic finite One way to bypass this difficult learning problem is automata from learning theory, extended to consider simpler environment models. In particu- with actions to the sequential (rather than lar, in this paper we assume that each history deter- i.i.d.) setting. Conceptually, we extend ministically maps to one of finitely many states and the Utile Suffix Memory method of McCal- this state is a sufficient statistic of the history (Mc- lum to handle long term memory. We de- callum, 1995; Shalizi & Klinkner, 2004; Hutter, 2009). scribe DMMs, give Bayesian for Given this history-state map the environment becomes learning and planning with them and also a MDP which can then be used to plan. So the learn- present experimental results for some stan- ing problem now is to learn this map. Indeed, the well dard POMDP tasks and tasks to illustrate known USM (Mccallum, 1995) used Predic- its efficacy. tion Suffix Trees (Ron et al., 1994) for these maps (each history is mapped to exactly one leaf/state) and was quite successful in benchmark POMDP domains. 1. Introduction However, PSTs lack long term memory and had dif- ficulty with noisy environments and so USM was not In this paper we derive a method to estimate the hid- followed up on for the most part. In our work we con- den structure of environments in general reinforcement sider a Bayesian setup and replace PSTs with finite learning problems. In such problems, at each discrete state machines and endow the agent with long term time step the agent takes an action and in turn re- memory. The resulting model is a proper subclass of ceives just an observation and a reward. The goal of POMDPs, but hopefully maintains the computational the agent is to take actions (i.e. plan) to maximize its simplicity and efficiency that comes with considering future time-averaged discounted rewards (Bertsekas & deterministic history state maps. Shreve, 1996). Clearly, we need to impose some struc- ture on the environment to solve this problem. We note that belief states of POMDPs are also de- terministic functions of the history. But this state The most popular approach in to space is infinite and so POMDP models learning al- do this is by assuming that the environment is a par- gorithms try to estimate the hidden states (see for in- stance (Doshi-Velez, 2009)). As a result, these meth- th Appearing in Proceedings of the 27 International Confer- ods are quite different from algorithms using deter- ence on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). ministic history-state maps. Other notable meth- Constructing States for Reinforcement Learning ods for learning the environment model include PSRs and λ denotes the empty history. (Littman et al., 2002) – unfortunately we lack space and do not discuss these further. Superficially, finite 2.2. General RL Environments state controllers for POMDPs (Kaelbling et al., 1998) seem closely related to our work but these are not A general RL environment, denoted by grle, is defined quite model learning algorithms. They are (powerful) by a tuple (A, R, O, RO, γ) where all the quantities planning algorithms that assume a hidden but known except RO are as defined above. RO defines the dy- environment model (at the very least, implicitly, an namics of the environment: at step t when action a estimate of the number of hidden states). is applied, the next reward×observation is selected ac- cording to RO(ro|h, a) where h = ao0:t−1 is We now proceed as follows. We define a general RL the history before step t. We will write the marginals environment and then our model, the deterministic over R and O w.r.t. RO by R and O respectively. Markov model (DMM), and show how to use it to 0/0.4 0/0.2 0/0.9 model the environment, infer the state given a history r= r= r= 1/0.6 {1/0.8 {1/0.1 and compute the optimal policy. We then describe { our Bayesian inference framework and then derive a {ao'',ao}/0.3 {ao''}/0.5 {ao'}/0.7 Bayesian, heuristic Monte-Carlo style algorithm for s s s1 2 model learning and planning. Finally, we describe ex- 0 {ao',ao}/0.5 {ao,ao'}/0.56 periments on standard benchmark POMDP domains {ao}/0.44 and some novel domains to illustrate the advantage of our method over USM. Due to lack of space formal proofs of our results are given in (Mahmud, 2010). Our Figure 1. A DMM with A = {a}, R := {0, 1} and O := {o, o0, o00}. The edges are labeled with z ∈ A × O that focus here is on motivating our approach via discussion cause the transition along with the probability for that and experiments. transition (parameters φ~s,a). The reward distributions for

each state appear above each state (parameters θ~s,a). 2. Modeling Environments The actions at each step are chosen using a policy π : To recap, we model a general RL environment by our H → A; the value function of π is defined as: model, the DMM, and then use the MDP derived from π π the DMM to plan for the problem. In the following V (h) = ER(r|h,a)(r) + γEO(o|h,a)[V (hao)] (1) we introduce notation (Sect. 2.1), define a general where a := π(h). The goal in RL problems is to learn RL environment (Sect. 2.2), define our model, the ∗ the optimal policy π∗ which satisfies V π (h) ≥ V π(h) DMMs (Sect. 2.3) and show how they can model the for each policy π and history h. In particular the value environment and construct the requisite MDP to plan function of this policy is given by: with (Sect. 2.4). We then describe the DMM inference criterion and the learning algorithm in Sect. 3 and 4. π∗ ∗  V (h) := V (h) := max ER(r|h,a)(r)+ a ∗ 2.1. Preliminaries γEO(o|h,a)[V (hao)] (2)

EP (x)[f(x)] denotes the expectation of the function f The existence of these functions follow via standard with respect to distribution P . We let A be a finite set results (Bertsekas & Shreve, 1996). For the sequel we of actions, O a finite set of observations and R ⊂ IR fix a particular RL environment and denote it by grle a finite set of rewards. We set H := (A × O)∗ to be := (A, R, O, RO, γ). the set of histories and γ ∈ [0, 1) to be a fixed discount rate. We now need some notation for sequences over 2.3. Deterministic Markov Models finite alphabets. x0:n will denote a string of length Our model is a and as such defined n + 1 and x

leaves phases of acting in the environment and in the process collecting data, and improving its estimate of DMM of the target environment via learning. The al- gorithm is given in Alg. 1. The initializations are self explanatory and so we focus on the main for loop. Figure 2. Example of the looped DMM. The edge labels have been removed for clarity. The root node/state is at The first step (line 4), learning a better estimate, is the top and is the start state. The leaf nodes at the bottom the heart of the algorithm and discussed in detail be- and the parent of each node is the node above it. low. The second step (line 5) is performed as follows. First, a MDP corresponding to ξi is constructed using Z the method in Sect. 2.4, but using the estimates (11) Y ~ ms,a(r) ~ ~ θs,a(r) w(θs,a)dθs,a and (12) for the MDP reward function (5) and state- r transition function (6) respectively. Then the optimal Q 0 1 r0 Γ(ms,a(r ) + ) policy is learned offline by using Q-learning on the con- = Γ(1) |R| (8) P 0 structed MDP (we do not use value iteration because Γ( 0 ms,a(r ) + 1) r Q learning was seen to be quicker without any notice- where Γ is the Gamma function. L is a distribution able difference in performance). over rewards at state s for action a. The marginal likelihood of ξ is a distribution over reward sequences Algorithm 1 Algorithm for learning and planning. (withs ¯ as the corresponding state sequence) and is 0:k Require: T the length of each phase of acting in the given by: world, N the number of phases. Y P r(¯r0:n|a¯o¯0:n, ξ) = L(¯r0:n|a¯o¯0:n, s, a, ξ) (9) 1: Generate a random samplea ¯r¯o¯0:T from the envi- s,a ronment by choosing actions at random. 2: Let ξ0 be the initial estimate of the target DMM Finally, we denote the space of DMMs by H and set it with a single state (i.e. all transitions are to q◦). to the set of all DMMs with < M states (say M = 2500 3: for i = 1 to N do (!)) and we put a uniform prior W over this space (so 4: Learn ξi from ξi−1 using history so fara ¯o¯0:T ·i by the posterior ratio is equal to the marginal likelihood setting ξi to search (ξi−1, a¯r¯o¯0:T ·i). ratio for DMMs). Given the model space and the prior, 5: Estimate optimal policyπ ˆ∗ for MDP corre- the posterior probability of a ξ givena ¯r¯o¯0:n is: sponding to ξi using Q-learning. 6: Generate a samplea ¯r¯o¯ from the envi- P r(¯r0:n|a¯o¯0:n, ξ)W (ξ) T ·i:T ·(i+1) ∗ P r(ξ|a¯r¯o¯0:n) := P 0 0 (10) ronment by choosing actions according toπ ˆ . 0 P r(¯r0:n|a¯o¯0:n, ξ )W (ξ ) ξ ∈H 7: end for In the next section during learning the model we will use the posterior as the selection criterion. To use In the third step (line 6) we follow the policy as de- a learned model for planning, we need the predictive scribed at the end of Sect. 2.4. Additionally, whenever distributions for rewards and states for pairs s, a given the MDP/DMM transitions from state s to s0 on ac- the data, which are, respectively: tion a and observes reward r, we perform Q-learning 1 updates on the Q-value of s at a (the Q-values learned ms,a(r) + |R| P r(r|s, a, a¯r¯o¯ , ξ) := (11) in line 5 are used as initial values for this step). This 0:n 1 + P m (r0) r0 s,a additional Q-learning is a heuristic to ensure that we 0 1 mˆ s,a(s ) + overcome incorrect parameter estimates due to mis- P r(s0|s, a, a¯r¯o¯ , ξ) := |S| (12) 0:n P match between δi, δg and Si, Sg of ξ and ξg and still 1 + q mˆ s,a(q) i properly explore the domain. This is our approach 0 where wherem ˆ s,a(s ) is the number of times state to dealing with the infamous ‘exacerbated exploration 0 s¯i = s ands ¯i−1 = s anda ¯i = a ina ¯s¯0:n. ( (12) problem’ that plagues general RL (Chrisman, 1992). was obtained by performing the same trick as above During this Q-learning, we also take a random action 0 ~ on the multinomials P r(s |φs,a, ξ) using the conjugate with probability 0.1 to further aid in exploration. Dirichlet prior). Now we are ready describe our learn- ing and planning algorithm. 4.1. Estimating ξi

4. Online Learning and Planning We now discuss how to estimate ξi in the first step of the for loop (line 4). DMMs are sequential versions of We give a heuristic algorithm for learning the DMMs DPFAs (Vidal et al., 2005) and learning the latter is and planning with them online. The algorithm inter- NP-hard when states might have equal distributions. Constructing States for Reinforcement Learning

Algorithm 2 Main Stochastic Search transitions back to an ancestor (loop, Alg. 4). The g function: search(ξ, a¯r¯o¯0:n) modification is then accepted as the estimate of ξ in the next iteration if it satisfies a Metropolis-Hastings Require: K, the number of iterations. (MH) style condition (line 5). The condition contains 1: Set ξ to ξ. cur a parameter α which was set to 25 in all our exper- 2: for i = 1 to K do iments (this simulates the proposal distribution ratio 3: With probability 0.6 and 0.4 respec- in the MH acceptance probability). tively, set ξprop to extend(ξcur, a¯r¯o¯0:n) or loop(ξcur, a¯r¯o¯0:n). Algorithm 4 Function for Looping Edges. 4: Choose rnd ∈ [0, 1] uniformly at random. P r(ξprop|a¯r¯o¯0:n) function: search(ξ, a¯r¯o¯0:n) 5: If > rnd set ξcur to ξprop. αP r(ξcur |a¯r¯o¯0:n) 6: end for 1: Sample a leaf node/state s according to frequency ins ¯ (¯s = δ(q , a¯o¯ )). 7: Return ξcur. 0:n k ◦ 0:k 2: Sample an outgoing, unextended edge ao at s ac- cording to the empirical (w.r.t.a ¯r¯o¯0:n) next step reward at ao. Since this is in fact often the case in RL problems, we 3: Choose a ancestor s0 of s according 2−m/Z where need to put significant bias in our search algorithm. m is the no. of ancestors of s between s and s0. 4: Set δ(s, ao) to s0. Algorithm 3 Function for Extending Edges 5: Return ξ. function: extend(ξ, a¯r¯o¯0:n) 1: Sample a leaf node/state s according to frequency We now discuss the heuristics used in extend and ins ¯0:n (¯sk = δ(q◦, a¯o¯0:k)). loop to choose nodes to modify. In line 1 of both pro- 2: Sample an outgoing edge ao at s according to the cedure, we sample states according to its frequency in empirical (w.r.t.a ¯r¯o¯0:n) next step reward at ao. the data because we expect the impact to the likeli- 3: Create a new state snew with context xsao where hood to be the greatest if we modify high frequency xs is the context of s (the context of root is empty). states. Then in line 2 of both we sample an outgo- 0 0 4: Set δ(s, ao) to snew; set δ(snew, a o ) = ing edge ao of the state s from line 1 according to the 0 0 arg max{length(xs )|xs is a suffix of xsnew }. average reward seen at s after taking action a and ob- 5: Add snew to the state of ξ. serving o. The intuition is that if the reward at that 6: Return ξ. edge is high, then it might be a good idea to create a new state there (in extend ) or move it to a differ- ent state (in loop ) to get a better model. A more We introduce bias by restricting our algorithm to a sophisticated version of this heuristic would be to use particular class of DMMs. Each DMM in this class is the future discounted reward at the edge. In lines 3 a tree, such that outgoing edges at each node can go and 4 extend constructs the new state for the cho- only to a child or an ancestor. Only leaf nodes are sen edge. In line 3 loop chooses an ancestor to loop allowed to transition to other nodes that are not its back to, choosing a closer ancestor with higher prob- descendants – see Fig. 2. This is similar to a loop- ability. This is because in typical POMDP domains ing suffix tree (Holmes & Isbell-Jr., 2006), but our we expect the state to transition back to states visited model is not a suffix tree structure – each node in very recently. loop then loops the edge in line 4. the graph is in fact a state. It is straightforward to show that all DMMs have such a loopy representa- tion although with possibly exponentially many states. 5. Experiments For the typical POMDP domains this representation 5.1. Setup seems particularly suitable, as dynamics in such do- mains have a neighborhood structure, where we are We ran experiments to illustrate that we do in fact only able to transition back to states that were visited extend McCallum’s Utile Suffix Memory in desirable recently. Now our search algorithm (search, Alg. 2) ways. We first ran three experiments on three POMDP is a stochastic search over this space with the poste- maze problems to evaluate our model on standard rior (10) as the selection criterion. search at each step problems. Then we ran experiments on three novel, performs one of two possible operations. It chooses a non-episodic domains to test the long term memory leaf node, and either it extends it by adding another ability of our method. The domains for the first are leaf node (extend, Alg. 3); or it chooses an inter- given in Fig. 3 (numbered 1,2, and 3 respectively) and nal or a leaf node and loops one of its un-extended they have the same dynamics as in (Mccallum, 1995) Constructing States for Reinforcement Learning

S + + + + + + + + + + + + + and [7, (2, 2, 2, 2, 2, 2, 2)] (referred to as harvesting S + + + + + + + + + + + + + + + + + + + + + + + + + + + + problems 1, 2 and 3 respectively from now on). The + + + + G + + + + + + + + + POMDP representation of these problems have, re- SG + + + + + + + + + 4 5 7 Half Cheese + + + + + + + + + + + + + spectively, 4 · 2 = 64, 5 · 2 = 160, 7 · 2 = 896 states. Maze (3 x 7) Full Cheese + + + + + + + G Maze (5 x 7) In each of these cases, we let each USM and DMM Bigger Maze (8 x 9) run for 3000 steps and report the total accumulated reward for each method averaged over 10 trials. Figure 3. The three POMDP mazes. S is the start state and G is the goal state. 5.2. Results The results for the three maze domains are given in with state transition and observation noise. Fig. 4 we report the median time per trial to get to the goal state to be comparable to (Mccallum, 1995) – the The domains for the second type of experiments are a performance of our implementation of USM matches set of ‘harvesting’ tasks with m crops. A crop i become that of (Mccallum, 1995). In these experiments in Alg. ready to be harvested after being developed through 1, T is variable, equal to the number of steps the agent ki different phases. It is ready to be harvested for just reaches the goal, while N := 30; and K := 5000. 1 time step once development is complete and then it spoils immediately, requiring to be grown again. The As can be seen, our algorithm more or less matches USM, except at the beginning where the number of agent has m + 1 different actions. Action ai devel- ops crop i to its next phase with probability 0.7 and steps is much higher. This is because our model is develops some other crop with probability 0.3. The trying to learn a recursive model and hence gets con- remaining action allows the agent to harvest any crop fused when there is little data. Unsurprisingly, the av- ready to be harvested at that step. The observation erage number of states estimated by our method was at each step is the crop that has been developed at 45.3, 67.5 and 90.5 compared to 204.5, 231.5 and 623.5 that step. So essentially this is a counting task where for USM. Similarly, our stochastic search took signif- the agent has to remember which phase each crop is icantly longer off-line processing to learn (5 minutes at – so a finite history suffix is not sufficient to make a compared to 10s of seconds for USM). So for these decision. The agent receives −1.0 reward for each ac- types of episodic tasks, it might be better to first learn tion, and 5.0 reward for a successful harvest. There are a model using USM and then compress it to a DMM no episodes and the each problem instance essentially (Shalizi & Klinkner, 2004). Regardless, the experi- runs forever. ments give evidence that DMMs are viable for model- ing hidden RL environment structure , and that our inference criterion (7) is correct.

350 The results for the 3 harvesting/counter domains are USM 1 USM 2 in Fig. 5. Here the difference between the two meth- 300 USM 3 ods become quite significant, with our method domi- DMM 1 250 DMM 2 nating USM completely. The differences in computa- DMM 3 tional time was quite similar to that in the previous 200 set of experiments. The average number of states in- 150 ferred differed by about 400, with the USM continually

100 creating new states. No. of Steps to Goal 50 6. Conclusion 0 0 5 10 15 20 25 30 In this paper we introduced a novel model for learn- Trial No. ing hidden environment structure in RL problems. We showed how to use the models to approximate the en- vironment and compute optimal policies. We further Figure 4. Results for the maze domains; USM i is the num- introduced a novel inference criterion, a heuristic al- ber of steps to goal state for maze i at the given trial. gorithm to learn and plan with these models. We then Similarly for DMM i etc. performed experiments to show the potential of this approach. The avenues for future research are many. We used several different values of [m, The most important at this time seems to be devel- (k1, k2, ··· , km)] : [4, (2, 2, 2, 2)], [5, (2, 2, 2, 2, 2)] Constructing States for Reinforcement Learning

Holmes, M. P. and Isbell-Jr., C. L. Looping suffix tree- based inference of partially observable hidden state. 1000 USM 1 In Proceedings of the 23rd International Conference 800 USM 2 USM 3 on Machine Learning, 2006. 600 DMM 1 DMM 2 400 DMM 3 Hopcroft, J. E., Motwani, R., and Ullman, J. D. In- 200 troduction , Languages and Com- putation. Addison Wesley, 3rd edition, 2006. 0 -200 Hutter, M. Feature markov decision process. In Pro- nd -400 ceedings of the 2 AGI Conference, 2009. Total Cumulative Reward -600 Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. -800 Planning and acting in partially observable stochas- 0 10 20 30 40 50 tic domains. Artificial Intelligence, 101:99 – 134, Number of steps X 100 1998. Littman, M., Sutton, R., and Singh, S. S. Predictive Figure 5. Results for the mining domains. USM i is the representations of state. In Proceedings of the 15th total reward accumulated by USM on mining problem i at Neural Information Processing Systems Conference, the given step. Similarly for DMM i etc. 2002. Madani, O., Hanks, S., and Condon, A. On the un- oping a principled algorithm for learning these mod- decidability of probabilistic planning and related els that will apply across many different tasks. An- problems. Artificial Intel- other important generalization would be to consider ligence, 147(1-2):5–34, 2003. extending DMMs to more structured, possibly rela- tional, state spaces. It would also be interesting to Mahmud, M. M. H. A deterministic probabilistic ap- extend the notion of DMMs to consider arbitrary mea- proach to modeling general rl environments. Tech- surable state spaces and see how the inference criterion nical report, Australian National University, 2010. works in that situation. Mccallum, A. Instance-based utile distinctions for re- inforcement learning. In Proceedings of the 12th In- Acknowledgments ternational Machine Learning Conference, 1995. We would like to thank John Lloyd and Marcus Hut- Ron, D., Singer, Y., and Tishby, N. Learning proba- ter for many useful comments and fruitful discussions. bilistic automata with variable memory length. In th The research was supported by the Australian Re- Proceedings of the 7 Conference on Computational search Council Discovery Project DP0877635 “Foun- Learning Theory, 1994. dation and Architectures for Agent Systems”. Ross, Stephane, Pineau, Joelle, Paquet, Sebastian, and Chaib-draa, Brahim. Online planning algo- References rithms for POMDPs. Journal of Artificial Intelli- gence Research, 32:663–704, 2008. Bertsekas, D. P. and Shreve, S. E. Stochastic : The Discrete-Time Case. Athena Scien- Sabbadin, R., Lang, J., and Ravaonjanahary, N. tific, paperback edition, 1996. Purely epistemic markov decision process. In Pro- ceedings of the 22nd National Conference on Artifi- Chipman, H. A., George, E. I., and McCulloh, R. E. cial Intelligence, 2007. Bayesian CART model search. Journal of the Amer- ican Statistical Association, 93:935 – 948, 1998. Shalizi, C. and Klinkner, K. Blind construction of optimal nonlinear recursive predictors for discrete Chrisman, L. Reinforcement learning with percep- sequences. In Proc. of the 20th Conference on Un- tual aliasing: The perceptual distinctions approach. certainty in Artificial Intelligence, 2004. In Proceedings of the Tenth National Conference on Vidal, E., Thollard, F., Higuera, C., Casacuberta, Artificial Intelligence, pp. 183–188, 1992. F., and Carrasco, R. C. Probabilistic finite-state Doshi-Velez, F. The infinite partially observable machines part I. IEEE Transactions on Pat- markov decision process. In Proc. of the 20th Neural tern Analysis and Machine Intelligence, 27(7):1013– Information Processing Systems Conference, 2009. 1025, 2005.