PEGASUS: A policy search method for large MDPs and POMDPs Andrew Y. Ng Michael Jordan Computer Science Division Computer Science Division UC Berkeley & Department of Statistics Berkeley, CA 94720 UC Berkeley Berkeley, CA 94720 Abstract attempt to choose a good policy from some restricted class of policies. We propose a new approach to the problem Most approaches to policy search assume access to the of searching a space of policies for a Markov POMDP either in the form of the ability to execute trajec- decision process (MDP) or a partially observ- tories in the POMDP, or in the form of a black-box “gen- able Markov decision process (POMDP), given erative model” that enables the learner to try actions from a model. Our approach is based on the following arbitrary states. In this paper, we will assume a stronger observation: Any (PO)MDP can be transformed model than these: roughly, we assume we have an imple- into an “equivalent” POMDP in which all state mentation of a generative model, with the difference that transitions (given the current state and action) are it has no internal random number generator, so that it has deterministic. This reduces the general problem to ask us to provide it with random numbers whenever it of policy search to one in which we need only needs them (such as if it needs a source of randomness to consider POMDPs with deterministic transitions. draw samples from the POMDP’s transition distributions). We give a natural way of estimating the value of This small change to a generative model results in what all policies in these transformed POMDPs. Pol- we will call a deterministic simulative model, and makes it icy search is then simply performed by searching surprisingly powerful. for a policy with high estimated value. We also establish conditions under which our value esti- We show how, given a deterministic simulative model, we can reduce the problem of policy search in an ar- mates will be good, recovering theoretical results bitrary POMDP to one in which all the transitions are similar to those of Kearns, Mansour and Ng [7], but with “sample complexity” bounds that have deterministic—that is, a POMDP in which taking an ac- ¢ tion ¡ in a state will always deterministically result in only a polynomial rather than exponential depen- transitioning to some fixed state ¢¤£ . (The initial state in this dence on the horizon time. Our method applies to arbitrary POMDPs, including ones with infi- POMDP may still be random.) This reduction is achieved by transforming the original POMDP into an “equivalent” nite state and action spaces. We also present empirical results for our approach on a small one that has only deterministic transitions. discrete problem, and on a complex continuous Our policy search algorithm then operates on these “sim- state/continuous action problem involving learn- plified” transformed POMDPs. We call our method PEGA- ing to ride a bicycle. SUS (for Policy Evaluation-of-Goodness And Search Us- ing Scenarios, for reasons that will become clear). Our algorithm also bears some similarity to one used in Van 1 Introduction Roy [12] for value determination in the setting of fully ob- servable MDPs. In recent years, there has been growing interest in algo- The remainder of this paper is structured as follows: Sec- rithms for approximate planning in (exponentially or even tion 2 defines the notation that will be used in this pa- infinitely) large Markov decision processes (MDPs) and per, and formalizes the concepts of deterministic simulative partially observable MDPs (POMDPs). For such large do- models and of families of realizable dynamics. Section 3 mains, the value and -functions are sometimes compli- then describes how we transform POMDPs into ones with cated and difficult to approximate, even though there may only deterministic transitions, and gives our policy search be simple, compactly representable policies that perform algorithm. Section 4 goes on to establish conditions un- very well. This observation has led to particular interest in der which we may give guarantees on the performance of direct policy search methods (e.g., [16, 8, 15, 1, 7]), which the algorithm, Section 5 describes our experimental results, Since we are interested in the “planning” problem, we as- and Section 6 closes with conclusions. sume that we are given a model of the (PO)MDP. Much pre- vious work has studied the case of (PO)MDPs specified via a generative model [7, 13], which is a stochastic function ¨ 2 Preliminaries ¥ ¡ that takes as input any ¢ state-action pair, and outputs ¥ This section gives our notation, and introduces the concept ¢¤£ according to (and the associated reward). In this paper, we assume a stronger model. We assume we have a *+¨-,-687eq@ACK¦ of the set of realizable dynamics of a POMDP under a pol- ¦mln olp( deterministic function jk? , so that ¥ ¨ ( *5¨-,-687 icy class. q s ¡ r for any fixed ¢ -pair, if is distributed Uniform , ¥ ¨ ¨ A Markov decision process (MDP) is a tuple s ¥§¦©¨ ¨ ¨¥¨ !¨ "# ¦ ¢ ¡ r then j is distributed according to the transition dis- where: is a set of states; tribution !¥ . In other words, to draw a sample from !¥ is the initial-state distribution, from which the start-state s ¤ ¥ ¡ r for some fixed ¢ and , we need only draw uni- 7 ( *+¨-,-6 q ¥ ¨ ¨ ¢%$ is drawn; is a set of actions; are the tran- s ¢ ¡ r formly in , and then take j to be our sample. sition probabilities, with giving the next-state distri- *+¨-, '&)( We will call such a model a deterministic simulative model ¢ bution upon taking action ¡ in state ; is the " for a (PO)MDP. discount factor; and is the reward function, bounded by "/.©01 . For the sake of concreteness, we will assume, un- Since a deterministic simulative model allows us to simu- *5¨-,-6879 ¦324( late a generative model, it is clearly a stronger model. How- less otherwise stated, that is a :<; -dimensional hypercube. For simplicity, we also assume rewards are de- ever, most computer implementations of generative models "=¥ ¨ "=¥ also provide deterministic simulative models. Consider a ¢ ¡ terministic, and written ¢ rather than , the ex- tensions being trivial. Lastly, everything that needs to be generative model that is implemented via a procedure that ¡ :t takes ¢ and , makes at most calls to a random number ¥ measurable is assumed to be measurable. ¦BACD £ generator, and then outputs ¢ drawn according to . A policy is any mapping >@? . The value function CKJ ¥ ¦IA Then this procedure is already providing a deterministic EGFH? EGF ¢ of a policy > is a map , so that gives simulative model. The only difference is that the determin- the expected discounted sum of rewards for executing > istic simulative model has to make explicit (or “expose”) its ¢ starting from state . With some abuse of notation, we also s interface to the random number generator, via r . (A gen- define the value of a policy, with respect to the initial-state erative model implemented via a physical simulation of an distribution , according to MDP with “resets” to arbitrary states does not, however, L2NMOQP RTS3( ¥ Q6 ¥ readily lead to a deterministic simulative model.) F $ > E ¢ E (1) Let us examine some simple examples of deterministic sim- ¥ ¨ $U (where the subscript ¢ indicates that the expectation ¢ ¡ ulative models. Suppose that for a state-action pair !u%uv¥ /2',wcx !u%uv¥ y2 h h $ is with respect to ¢ drawn according to ). When we are ¢%£ £ ¢¤£ ¢¤£ £ and some states ¢¤£ and , , wcx 2|, 2 z s considering multiple MDPs and wish to make explicit that s r . Then we may choose :{t so that is just ¥ ¨ ¨ }2 ,wcx s s^~ a value function is for a particular MDP V , we will also £ j ¢ ¡ ¢ ¥ ¥ a real number, and let if , and ¥ ¨ ¨ 2 W h h W s ¢ E > write EGF , , etc. ¢ ¡ ¢¤£ £ j otherwise. As another example, suppose ¦H2IJ !¥ h h In the policy search setting, we have some fixed class X , and is a normal distribution with a cumula- ¥ 24, & :{t X of policies, and desire to find a good policy > . More tive distribution function . Again letting , we ¥ ¨ ¨ ©2 ¥ s s j j ¢ ¡ h X precisely, for a given MDP V and policy class , define may choose to be . ¥ ¨ L2^] _+` ¥ ed YZ\[ W It is a fact of probability and measure theory that, given ¥ X E > V (2) any transition distribution , such a deterministic sim- F<acb j ¥ & ulative model can always be constructed for it. (See, > X E > f Our goal is to find a policy f so that is close to ¨ ¥ e.g. [4].) Indeed, some texts (e.g. [2]) routinely define YZg[ X V . POMDPs using essentially deterministic simulative mod- Note that this framework also encompasses cases where our els. However, there will often be many different choices of j family X consists of policies that depend only on certain as- for representing a (PO)MDP, and it will be up to the user pects of the state. In particular, in POMDPs, we can restrict to decide which one is most “natural” to implement. As we attention to policies that depend only on the observables. will see later, the particular choice of j that the user makes This restriction results in a subclass of stochastic memory- can indeed impact the performance of our algorithm, and free policies. h By introducing artificial “memory vari- “simpler” (in a sense to be formalized) implementations are ables” into the process state, we can also define stochastic generally preferred. limited-memory policies [9] (which certainly permits some To close this section, we introduce a concept that will be belief state tracking). useful later, that captures the family of dynamics that a i Although we have not explicitly addressed stochastic policies (PO)MDP and policy class can exhibit.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-