: A policy search method for large MDPs and POMDPs

Andrew Y. Ng Michael Jordan Computer Science Division Computer Science Division UC Berkeley & Department of Statistics Berkeley, CA 94720 UC Berkeley Berkeley, CA 94720

Abstract attempt to choose a good policy from some restricted class of policies. We propose a new approach to the problem Most approaches to policy search assume access to the of searching a space of policies for a Markov POMDP either in the form of the ability to execute trajec- decision process (MDP) a partially observ- tories in the POMDP, or in the form of a black-box “gen- able Markov decision process (POMDP), given erative model” that enables the learner to try actions from a model. Our approach is based on the following arbitrary states. In this paper, we will assume a stronger observation: Any (PO)MDP can be transformed model than these: roughly, we assume we have an imple- into an “equivalent” POMDP in which all state mentation of a generative model, with the difference that transitions (given the current state and action) are it has no internal random number generator, so that it has deterministic. This reduces the general problem to ask us to provide it with random numbers whenever it of policy search to one in which we need only needs them (such as if it needs a source of randomness to consider POMDPs with deterministic transitions. draw samples from the POMDP’s transition distributions). We give a natural way of estimating the value of This small change to a generative model results in what all policies in these transformed POMDPs. Pol- we will call a deterministic simulative model, and makes it icy search is then simply performed by searching surprisingly powerful. for a policy with high estimated value. We also establish conditions under which our value esti- We show how, given a deterministic simulative model, we can reduce the problem of policy search in an ar- mates will be good, recovering theoretical results bitrary POMDP to one in which all the transitions are similar to those of Kearns, Mansour and Ng [7],

but with “sample complexity” bounds that have deterministic—that is, a POMDP in which taking an ac- ¢ tion ¡ in a state will always deterministically result in only a polynomial rather than exponential depen-

transitioning to some fixed state ¢¤£ . (The initial state in this dence on the horizon time. Our method applies to arbitrary POMDPs, including ones with infi- POMDP may still be random.) This reduction is achieved by transforming the original POMDP into an “equivalent” nite state and action spaces. We also present empirical results for our approach on a small one that has only deterministic transitions. discrete problem, and on a complex continuous Our policy search algorithm then operates on these “sim- state/continuous action problem involving learn- plified” transformed POMDPs. We call our method PEGA- ing to ride a bicycle. SUS (for Policy Evaluation-of-Goodness And Search Us- ing Scenarios, for reasons that will become clear). Our algorithm also some similarity to one used in Van 1 Introduction Roy [12] for value determination in the setting of fully ob- servable MDPs. In recent years, there has been growing interest in algo- The remainder of this paper is structured as follows: Sec- rithms for approximate planning in (exponentially or even tion 2 defines the notation that will be used in this pa- infinitely) large Markov decision processes (MDPs) and per, and formalizes the concepts of deterministic simulative partially observable MDPs (POMDPs). For such large do- models and of families of realizable dynamics. Section 3 mains, the value and -functions are sometimes compli- then describes how we transform POMDPs into ones with cated and difficult to approximate, even though there may only deterministic transitions, and gives our policy search be simple, compactly representable policies that perform algorithm. Section 4 goes on to establish conditions un- very well. This observation has led to particular interest in der which we may give guarantees on the performance of direct policy search methods (e.g., [16, 8, 15, 1, 7]), which the algorithm, Section 5 describes our experimental results, Since we are interested in the “planning” problem, we as- and Section 6 closes with conclusions. sume that we are given a model of the (PO)MDP. Much pre- vious work has studied the case of (PO)MDPs specified via

a generative model [7, 13], which is a stochastic function

¨ 

2 Preliminaries ¥ ¡

that takes as input any ¢ state-action pair, and outputs

 ¥  

This section gives our notation, and introduces the concept ¢¤£ according to (and the associated reward). In this

paper, we assume a stronger model. We assume we have a

*+¨-,-687eq@ACK¦ of the set of realizable dynamics of a POMDP under a pol- ¦mln olp(

deterministic function jk? , so that

¥ ¨  ( *5¨-,-687

icy class. q

s

¡ r

for any fixed ¢ -pair, if is distributed Uniform ,

¥ ¨ ¨ 

A Markov decision process (MDP) is a tuple s

¥§¦©¨ ¨ ¨¥ ¨ !¨ "# ¦

¢ ¡ r then j is distributed according to the transition dis-

where: is a set of states; tribution !¥ . In other words, to draw a sample from !¥

is the initial-state distribution, from which the start-state s

¤ ¥



¡ r

for some fixed ¢ and , we need only draw uni-

7

( *+¨-,-6 q ¥ ¨ ¨  ¢%$

is drawn; is a set of actions; are the tran- s



¢ ¡ r formly in , and then take j to be our sample.

sition probabilities, with giving the next-state distri- *+¨-,

'&)( We will call such a model a deterministic simulative model ¢ bution upon taking action ¡ in state ; is the " for a (PO)MDP. discount factor; and is the reward function, bounded

by "/.©01 . For the sake of concreteness, we will assume, un- Since a deterministic simulative model allows us to simu- *5¨-,-6879 ¦324( late a generative model, it is clearly a stronger model. How- less otherwise stated, that is a :<; -dimensional

hypercube. For simplicity, we also assume rewards are de- ever, most computer implementations of generative models

 "=¥ ¨ 

"=¥ also provide deterministic simulative models. Consider a

¢ ¡ terministic, and written ¢ rather than , the ex-

tensions being trivial. Lastly, everything that needs to be generative model that is implemented via a procedure that

¡ : t

takes ¢ and , makes at most calls to a random number

 ¥ 

measurable is assumed to be measurable. 

¦BACD £ generator, and then outputs ¢ drawn according to .

A policy is any mapping >@? . The value function

CKJ ¥ 

¦IA Then this procedure is already providing a deterministic

EGFH? EGF ¢ of a policy > is a map , so that gives simulative model. The only difference is that the determin-

the expected discounted sum of rewards for executing > istic simulative model has to make explicit (or “expose”) its ¢ starting from state . With some abuse of notation, we also s

interface to the random number generator, via r . (A gen- define the value of a policy, with respect to the initial-state erative model implemented via a physical simulation of an

distribution , according to MDP with “resets” to arbitrary states does not, however,

L2NMOQP RTS3( ¥ Q6

¥ readily lead to a deterministic simulative model.)

F

$

> E ¢ E (1)

Let us examine some simple examples of deterministic sim-

¥ ¨  $U

(where the subscript ¢ indicates that the expectation

¢ ¡

ulative models. Suppose that for a state-action pair

!u%uv¥ /2',wcx !u%uv¥ y2

h h $

is with respect to ¢ drawn according to ). When we are

¢%£ £ ¢¤£ ¢¤£ £

and some states ¢¤£ and , ,

wcx 2|, 2

z s

considering multiple MDPs and wish to make explicit that s r

. Then we may choose :{t so that is just

¥ ¨ ¨ }2 ,wcx

s s^~

a value function is for a particular MDP V , we will also

£

j ¢ ¡ ¢

 ¥ 

¥ a real number, and let if , and

¥ ¨ ¨ 2

W

h h

W

s

¢ E >

write EGF , , etc.

¢ ¡ ¢¤£ £

j otherwise. As another example, suppose

¦H2IJ !¥ 

h h

In the policy search setting, we have some fixed class X , and is a normal distribution with a cumula-

¥  24,

&

€ :{t X

of policies, and desire to find a good policy > . More tive distribution function . Again letting , we

¥ ¨ ¨ ©2 ¥ 



s s

j j ¢ ¡ €‚ƒh X

precisely, for a given MDP V and policy class , define may choose to be .

¥ ¨ L2^] _+` ¥ ed YZ\[

W It is a fact of probability and measure theory that, given

 ¥



X E > V (2)

any transition distribution , such a deterministic sim-

F

j

¥ 

& ulative model can always be constructed for it. (See,

> X E > f

Our goal is to find a policy f so that is close to

¨ 

¥ e.g. [4].) Indeed, some texts (e.g. [2]) routinely define

YZg[ X V . POMDPs using essentially deterministic simulative mod-

Note that this framework also encompasses cases where our els. However, there will often be many different choices of j family X consists of policies that depend only on certain as- for representing a (PO)MDP, and it will be up to the user pects of the state. In particular, in POMDPs, we can restrict to decide which one is most “natural” to implement. As we attention to policies that depend only on the observables. will see later, the particular choice of j that the user makes This restriction results in a subclass of stochastic memory- can indeed impact the performance of our algorithm, and

free policies. h By introducing artificial “memory vari- “simpler” (in a sense to be formalized) implementations are ables” into the process state, we can also define stochastic generally preferred. limited-memory policies [9] (which certainly permits some To close this section, we introduce a concept that will be

belief state tracking). useful later, that captures the family of dynamics that a i

Although we have not explicitly addressed stochastic policies (PO)MDP and policy class can exhibit. Assume a deter- > so far, they are a straightforward generalization (e.g. using the ministic simulative model j , and fix a policy . If we are transformation to deterministic policies given in [7]).

& ¥ ¨ ¨ ¨-d-d%d £2 ¥ 

s s’‘

¢ >¢£ Xy£ >¢£ ¢ > ¢

executing > from some state , the successor-state is deter- corresponding , given by ,

¥ ¨ L2 ¥ ¨ ¥ ¨  " ¥ ¨ ¨ ¨-d%d-d¤2m"=¥ 

h

s s s s

‘

¢ r j ¢ > ¢ r ¢ £ ¢ ¢

mined by „ , which is a function of and let the reward be given by .

F h

s

r > X

and . Varying over , we get a whole family of func- s

2† ¥ ¨ ˆ2 ¥ ¨ ¥ e¨ e ‰

If one observes only the “ ¢ ”-portion (but not the ’s) of a

s s

„ „ ¢ r j ¢ > ¢ r

tions mapping from

7

q

¦mlp( *5¨-,6 ¦ VI£ F Fƒ‡ sequence of states generated in the POMDP using pol- into successor states . This set of functions

icy >¢£ , one obtains a sequence that is drawn from the same

should be thought of as the family of dynamics realiz- distribution as would have been generated from the original &

able by the POMDP and X , though since its definition does

> X

(PO)MDP V under the corresponding policy . It fol-

& &

depend on the particular deterministic simulative model j

X >¢£ Xy£

lows that, for corresponding policies > and ,

¥ L2 ¥  W‚¥

that we have chosen, this is “as expressed with respect to W

> E >¢£

we have that E . This also implies that the

„ „v‰ Š

j .” For each , also let be the -th coordinate function

¨  ¥ ¨ 

¥ best possible expected returns in both (PO)MDPs are the

s s

¥ ¨ ¦2 ¥ ¨ 

‰ ‰

¢ r Š „ ¢ r YZg[ YZg[

(so that „ is the -th coordinate of ) and let

X VI£ X/£ same: V .

be the corresponding families of coordinate functions map-

7

*+¨%,6 q ( *+¨%,6 ¦3l‹( To summarize, we have shown how, using a deterministic ping from into . Thus, Œ‰ captures all the

simulative model, we can transform any POMDP V and

ways that coordinate Š of the state can evolve. VI£ policy class X into an “equivalent” POMDP and policy

We are now ready to describe our policy search method.

£ £ V

class X , so that the transitions in are deterministic;

&§¦ &§

£ ¡

i.e., given a state ¢ and an action , the next-state

X Xy£

3 Policy search method in VI£ is exactly determined. Since policies in and

& X/£ have the same values, if we can find a policy >¢£ that

In this section, we show how we transform a (PO)MDP into £ does well in VI£ starting from , then the corresponding

an “equivalent” one that has only deterministic transitions. &

> X 

¥ policy will also do well for the original POMDP

f >

This then leads to natural estimates E of the policies’

V 

¥ starting from . Hence, the problem of policy search >

values E . Finally, we may search over policies to opti- in general POMDPs is reduced to the problem of policy

¥ 

f > mize E , to find a (hopefully) good policy. search in POMDPs with deterministic transition dynamics. In the next section, we show how we can exploit this fact

3.1 Transformation of (PO)MDPs to derive a simple and natural policy search method.

2¥Ž¦!¨ ˆ¨ ¨-¤ ¥ e{¨ !¨ "# 

Given a (PO)MDP V and 3.2 PEGASUS: A method for policy search

a policy class X , we describe how, using a determinis- V

tic simulative model j for , we construct our trans-

¨ ¨ ¨-¤ ¥ ¨ !¨ " 

2¥§¦ As discussed, it suffices for policy search to find a good



&

£ £ £ £

formed POMDP VI£ and Xy£

policy >¢£ for the transformed POMDP, since the cor-

& VI£

corresponding class of policies X/£ , so that has only de- X

responding policy > will be just as good. To do this,

¥ ¥

W‚¥ W

terministic transitions (though its initial state may still be f

E E

2), we first construct an approximation to , and

& ¥ 

:{t W‚¥

random). To simplify the exposition, we assume , f

X/£ E >¢£

then search over policies >¢£ to optimize (as

s

¥  r

so that the terms are just real numbers. W > a proxy for optimizing the hard-to-compute E ), and

VI£ is constructed is as follows: The action space and dis- thus find a (hopefully) good policy.

VI£ V

count factor for are the same as in . The state space W‚¥

E *5¨-,-68

¦3l ( Recall that is given by VI£

for VI£ is . In other words, a typical state in

¥ ¨ ¨ ¨%d-d%d 

¥ ¨6<¨ ¥ ¦2NM P (

 R¢S

s s’‘

¥ ¥

W

¥

W

¢

F

¢-$ > E

can be written as a vector — this consists of E (3) ¦ h

a state ¢ from the original state space , followed by an

( *5¨-,6

&ˆ¦

$ £

infinite sequence of real numbers in . where the expectation is over the initial state ¢ drawn

The rest of the transformation is straightforward. Upon according to £ . The first step in the approximation is to

¥ ¨ ¨ ¨-d%d-d s’‘

s replace the expectation over the distribution with a finite

¢ VI£

taking action ¡ in state in , we deter-

¥ ¨ ¨ ¨%d-d%d 

h s’“

s\‘ sample of states. More precisely, we first draw a sam-

‘

¢¤£

 ¨ ¨-d-d%d¨ ministically transition to the state , where 

2 ¥ ¨ ¨ 

s

hª ª ª

¢© ¬ ¢© ¢©™«

$ $

ple $ of initial states according to

¢¤£ j ¢ ¡ ¢

. In other words, the portion of the state h

(which should be thought of as the “actual” state) changes £ . These states, also called “scenarios” (a term from the

¥ ¨ ¨-d%d-d s’‘

s stochastic optimization literature; see, e.g. [3]), define an

¢%£ 

to , and one number in the infinite sequence ¥

h

W¡¥ >

approximation to E : £

is used up to generate ¢ from the correct distribution. By j

the definition of the deterministic simulative model , we ,

( *5¨-,6

®

s

ed ¥ ¦­ ¥

«

‰

U•”Œ–+—™˜›šœž

W‚¥ ¥

see that so long as , then the “next- W

ª

F

E E > ¢

© h

$ (4) ¬

state” distribution of ¢¤£ is the same as if we had taken action

‰°¯

s

h ¢

¡ in state (randomization over ).

h

VI£

Finally, we choose £ , the initial-state distribution over Since the transitions in are deterministic, for a given

¦ 2Ÿ¦ l¡( *5¨-,6› ¥ ¨ ¨ ¨-d%d-d  &Ÿ¦ &

s s’‘

¢ ¢ £ > Xy£

£ , so that drawn according to state and a policy , the sequence of states

h

s

U

¢ ‰ > ¢

£ will be so that , and the ’s are distributed i.i.d. that will be visited upon executing from is exactly deter-

( *5¨-,6 & X Uniform . For each policy > , also let there be a mined; hence the sum of discounted rewards for executing ¢

> from is also exactly determined. Thus, to calculate one discounting. Assuming that the time at which the agent en-

¥ 

¥ 

‰

W‚¥

f

¥

W

E >gÁ

ª

EF ¢

© ters the goal region is differentiable, then is again ‘ of the terms $ in the summation in Equation (4)

‰ differentiable.

ª

¢ © corresponding to scenario $ , we need only use our de-

terministic simulative model to find the sequence of states

‰

ª

> ¢© visited by executing from $ , and sum up the result- 4 Main theoretical results ing discounted rewards. Naturally, this would be an infinite sum, so the second (and standard) part of the approxima-

PEGASUS samples a number of scenarios from £ , and

¥  ¥ 

tion is to truncate this sum after some number ± of steps,

f f

E > > E

uses them to form an approximation E to . If is ±

where ± is called the horizon time. Here, we choose to

.©0 1

2N´ ¥ ¥,·¸ ƒw "  z

a uniformly good approximation to E , then we can guaran-

šµ¶

±=³ ²

be the ² -horizon time , so that f

tee that optimizing E will result in a policy with value close

¥ ¨ 

(because of discounting) the truncation introduces at most YZ\[

w

z X to V . This section establishes conditions under

² error into the approximation.

which this occurs.

¨-d-d%d¨

hª ª

¬ ¢ ¢

© ©™« $

To summarize, given scenarios $ , our ap- W‚¥ proximation to E is the deterministic function

4.1 The case of finite action spaces

,

®

¥ L2 " ¥ c¹= ƒ" ¥ c¹ˆ-%§¹= ƒº©»" ¥ 

«

‰ ‰ ‰

Å2

W¡¥

f

ª ª ª

»

£ £ £ º

E > ¢ ¢ ¢

© © ©

$ We begin by considering the case of two actions,

h

¬

 ¨ 

‰™¯

‘

¡ ¡

h . Studying policy search in a similar setting,

h

¥ ¨ ¨%d-d-d%¨ 

‰ ‰ ‰

ª ª ª »

º Kearns, Mansour and Ng [7] established conditions under

¢ ¢ ¢

© © ©

where $ is the sequence of states deter- h

‰ which their algorithm gives uniformly good estimates of

ª

> ¢ ¬ ©

ministically visited by starting from $ . Given sce- 

¥ the values of policies. A key to that result was that uniform

¥

W > narios, this defines an approximation to E for all poli-

& convergence can be established so long as the policy class X/£ cies > .

X has low “complexity.” This is analogous to the setting of

The final implementational detail is that, since the states supervised learning, where a learning algorithm that uses

&¼¦½l¾( *+¨%,6›

‰

ª

¢ Æ ©

$ are infinite-dimensional vectors, we a hypothesis class that has low complexity (such as in have no way of representing them (and their successor the sense of low VC-dimension) will also enjoy uniform

states) explicitly. But because we will be simulating only convergence of its error estimates to their means.

¨ ¨-d%d-d¨

‰ ‰ ‰

s s s

‘

ª ª ª

º©»

±

³

© © steps, we need only represent © , of

In our setting, since X is just a class of functions mapping

h

¦  ¨ 

2¿¥ ¨ ¨ ¨%d-d%d 

‰ ‰ ‰

‘

‰

s s

‘

ª ª ª

¡ ¡

¢ ¢ ª

© © ©

$ from into , it is just a set of boolean functions.

the state © , and so we will do

¥ 

h

h X

just that. Viewed in the space of the original, untrans- Hence, Ç/È , its Vapnik-Chervonenkis dimension [14], ¬

formed POMDP, evaluating a policy this way is therefore is well defined. That is, we say X shatters a set of states

z ¬

also akin to generating Monte Carlo trajectories and tak- if it can realize each of the « possible action combina-

¥  X ing their empirical average return, but with the crucial dif- tions on them, and Ç/È is just the size of the largest set

shattered by X . The result of Kearns et al. then suffices to ference that all the randomization is “fixed” in advance and “

“reused” for evaluating different > . give the following theorem.

¥ 

W‚¥

f

¬ E > >

¨ 

Having used scenarios to define for all , we '2É

‘

¥ 

¡ ¡ ¥

W Theorem 1 Let a POMDP with actions be

f

> h may search over policies to optimize E . We call

given, and let X be a class of strategies for this POMDP,

¥ 

this policy search method PEGASUS: Policy Evaluation-of- 2

¥ 

: Ç/È X

W‚¥ with Vapnik-Chervonenkis dimension . Also

f

E >

Goodness And Search Using Scenarios. Since is a ¨ÊÌËÀ*

f E deterministic function, the search procedure only needs to let any ² be fixed, and let be the policy-value optimize a deterministic function, and any number of stan- estimates determined by PEGASUS using ¬ scenarios and

dard optimization methods may be used. In the case that

2À &IJÄÃ 

Í >\Á

the action space is continuous and X is More precisely, if the agent enters the goal region on some

¥ 

‡  ¢

a smoothly parameterized family of policies (so >gÁ is time step, then rather than giving it a reward of 1, we figure out ѝÒeÓ Ô what fraction ÎGϧРof that time step (measured in continuous differentiable in for all ¢ ) then if all the relevant quanti-

ties are differentiable, it is also possible to find the deriva- time) the agent had taken to enter the goal region, and then give

¥ w  ¥  Î

it reward ÕÖ instead. Assuming is differentiable in the system’s

W‚¥

f

ØÙ

: : E >\Á

tives , and gradient ascent methods can be ¥Ú8Û<Ü-Ý

Õ+Ö ×

 

¥ dynamics, then and hence are now also differentiable

¥

W

f

Á >

used to optimize E . One common barrier to doing (other than on a usually-measure 0 set, for example from trunca- "

this is that is often discontinuous, being (say) 1 within tion at Þàß steps). á

a goal region and 0 elsewhere. One approach to dealing The algorithm of Kearns, Mansour and Ng uses a “trajectory

Ø

" Ú8Û\Ý

with this problem is to smooth out, possibly in com- tree” method to find the estimates × ; since each trajectory tree

ڛåOÚ ÝQÝ ÞOß bination with “continuation” methods that gradually un- is of size âãä , they were very expensive to build. Each scenario in PEGASUS can be viewed as a compact representation smooth it again. An alternative approach that may be use- of a trajectory tree (with a technical difference that different sub- ful in the setting of continuous dynamical systems is to al- trees are not constructed independently), and the proof given in ter the reward function to use a continuous-time model of Kearns et al. then applies without modification to give Theorem 1. a horizon time of ± ³ . If The proof of this Theorem, which is not difficult, is in Ap-

pendix A. This result shows that simplicity of X is not suf-

.©01

" , ,

´™é ç ¨ ¨´ ¨ ¨

2Næèç+` ficient for uniform convergence in the case of infinite ac-

Ê ,ê·ë Lìyì

š šµ : ¬ (5)

² tion spaces. However, the counterexample used in the proof

of Theorem 2 has a very complex j despite the MDP be-

,G·mÊ

f j

then with probability at least , E will be uniformly ing quite simple. Indeed, a different choice for would

ñ

í í

close to E : have made uniform convergence occur. Thus, it is natu-

í í í

í ral to hypothesize that assumptions on the “complexity” of

¥ !· ¥  ´™´ &

~

f

j

˜›šœ¤î

> E > ² > X E (6) are also needed to ensure uniform convergence. As we will shortly see, this intuition is roughly correct. Since ac-

tions affect transitions only through j , the crucial quantity

Using the transformation given in Kearns et al., the case of

Ë z is actually the composition of policies and the determinis-

a finite action space with also gives rise to essen-

‡ ‡ tic simulative model — in other words, the class of the tially the same uniform-convergence result, so long as X dynamics realizable in the POMDP and policy class, us- has low “complexity.” ing a particular deterministic simulative model. In the next

The bound given in the theorem has no dependence on section, we show how assumptions on the complexity of the size of the state space or on the “complexity” of the leads to uniform convergence bounds of the type we desire.

POMDP’s transitions and rewards. Thus, so long as X has low VC-dimension, uniform convergence will occur, inde- 4.3 Uniform convergence in the case of infinite action pendently of how complicated the POMDP is. As in Kearns spaces

et al., this theorem therefore recovers the best analogous

¦¼2K( *5¨-,6›7 9

results in supervised learning, in which uniform conver- For the remainder of this section, assume .

( *+¨%,6›79nl

gence occurs so long as the hypothesis class has low VC- Then is a class of functions mapping from

( *+¨-,-687 ( *+¨%,6›7

q 9 dimension, regardless of the size or “complexity” of the into , and so a simple way to capture its

underlying space and target function. “complexity” is to capture the complexity of its families

2ò,{¨-d%d-d¨

‰ ‰

Š : ;

of coordinate functions, , . Each is a

*5¨-,6›79ˆl@( *+¨%,6›7eq

family of functions mapping from ( into

X *+¨-,-6

4.2 The case of infinite action spaces: “Simple” is ( Œ‰ insufficient for uniform convergence , the Š -th coordinate of the state vector. Thus, is

just a family of real-valued functions — the family of Š -th j We now consider the case of infinite action spaces. coordinate dynamics that X can realize, with respect to .

Whereas, in the 2-action case, X being “simple” was suffi- The complexity of a class of boolean functions is measured cient to ensure uniform convergence, this is not the case in by its VC dimension, defined to be the size of the largest set

POMDPs with infinite action spaces. shattered by the class. To capture the “complexity” of real-

Suppose is a (countably or uncountably) infinite set valued families of functions such as Œ‰ , we need a general- 2

of actions. A “simple” class of policies would be X ization of the VC dimension. The pseudo-dimension, due

  ¥ ‚ï ¨ &N #

> ¢ ¡ ¡ > — the set of all policies that al- to Pollard [10] is defined as follows: ways‡ choose the same action, regardless of the state. Intu-

Definition (Pollard, 1990). Let Æ be a family of functions

itively, this is the simplest policy that actually uses an infi- J :

mapping from a space ó into . Let a sequence of points

& ¨-d%d-d-¨

nite action space; also, any reasonable notion of complexity ¨-d%d-d¨

ô ô ô ô

7 7 Æ

ó be given. We say shatters

X

¨%d-d%d¨ h

of policy classes should assign a low “dimension.” If it h

7 õ

if there exists a sequence of real numbers õ such

J©7 <¥§öÄ¥ ¦· ¨%d-d-d%¨öÄ¥ ¦·

were true that simple policy classes imply uniform conver- h

ô ô 7

that the subset of given by õ

X

 öB&  7 J©7

h h

gence, then it is certainly true that this should always z

7 Æ

õ intersects all orthants of (equivalently,

¨-d%d-d%¨ &I¤*5¨-,c

enjoy uniform convergence. Unfortunately, this is not the ‡

7

÷ ÷

if for any sequence of : bits , there is

ö4& öÄ¥ ‹ø 2•,

case, as we now show. h

ô

‰ õ‰¡ùú÷e‰

a function Æ such that , for

2ò,{¨-d%d-d¨

: Æ

all Š ). The pseudo-dimension of , denoted 

Theorem 2 Let be an infinite set of actions, and let ¥

û

2ð  ¥ ˆï ¨ &o #

—°ž

Æ Æ

t , is the size of the largest set that shatters, or

> > ¢ ¡ ¡

X be the corresponding set ‡ of all “constant valued” policies. Then there exists a finite- infinite if Æ can shatter arbitrarily large sets. state MDP with action space , and a deterministic simu- The pseudo-dimension generalizes the VC dimension, and lative model for it, so that PEGASUS’ estimates using the *+¨-,

coincides with it in the case that Æ maps into . We deterministic simulative model do not uniformly converge Ëm* will use it to capture the “complexity” of the classes of the to their means. i.e. There is an ² , so that for estimates

POMDP’s realizable dynamics Œ‰ . We also remind readers

f ¬ E derived using any finite number of scenarios and any

& of the definition of Lipschitz continuity. X

finite horizon time, there is a policy > so that

ü

þi Úÿþ¡ ݧ¦Hþ©¨’i

¥ !· ¥  Ë d

ý ¤ nÑ  £¢+Ò¥¤

For example, Ò if , otherwise; see

f

> E > ²

E (7) Appendix A.

‡ ‡

J AC J ? Definition. A function „ is Lipschitz con- Using tools from [5], it is also possible to show similar

tinuous (with respect to the Euclidean norm on its range uniform convergence results without Lipschitz continuity >

and domain) if there exists a constant  such that for all assumptions, by assuming that the family is parameter-

¨Ÿ& ¥  ¥ Œ· ¥+ ·

û

ô ô ‘H~ ô ‘

šž

„ „  >

„ , . Here, ized by a small number of real numbers, and that (for all

& "

‡°‡ ‡°‡ ‡°‡ ‡°‡

Æ > X j

 is called a Lipschitz bound. A family of functions ), , and are each implemented by a function that

 J

mapping from J into is uniformly Lipschitz contin- calculates their results using only a bounded number of the

öN& Æ uous with Lipschitz bound  if every function is usual arithmetic operations on real numbers.

Lipschitz continuous with Lipschitz bound  . The proof of Theorem 3, which uses techniques first intro- We now state our main theorem, with a corollary regarding duced by Haussler [6] and Pollard [10], is quite lengthy,

f and is deferred to Appendix B.

when optimizing E will result in a provably good policy.

¦Ì2 ( *5¨-,6›7 Theorem 3 Let a POMDP with state space 9 , and a possibly infinite action space be given. Also let 5 Experiments

a policy class X , and a deterministic simulative model

¦ l¸ Ÿl‹( *5¨-,-687 ACÀ¦

q In this section, we report the results from two experiments. j¸? for the POMDP be given. Let be the corresponding family of realizable dynamics in the The first, run to examine the behavior of PEGASUS para-

metrically, involved a simple gridworld POMDP. The sec- ‰

POMDP, and the resulting families of coordinate func-

¥  2^,{¨-d%d-d¨ û

~ ond studied a complex continuous state/continuous action

—°ž

Œ‰ : : Š t tions. Suppose that for each ; , problem involving riding a bicycle. and that each family Œ‰ is uniformly Lipschitz continuous

with Lipschitz bound at most  , and that the reward func- Figure 1a shows the finite state and action POMDP used

.©01 .©0 1

" ¦HAC ( ·Œ" ¨" 6

tion ? is also Lipschitz continuous in our first experiment. In this problem, the agent starts

¨ Ê‹Ë *

·#, ²

with Lipschitz bound at most  . Finally, let be in the lower-left corner, and receives a reinforcement f

given, and let E be the policy-value estimates determined per step until it reaches the absorbing state in the upper- ±=³

by PEGASUS using ¬ scenarios and a horizon time of . right corner. The eight possible observations, also shown 2

If ¬ in the figure, indicate whether each of the eight squares

, ,

"y.©0 1 adjoining the current position contains a wall. The policy

2('&)cx*'

¨ ¨ æoç+` ´°é ç ¨ ¨´ ¨ ¨´ ¨´



.©0 1 ì/ì

Ê ,£·‹ "

š šµ šµ šµ $&%

: : :  t ; class is small, consisting of all functions map-

² ping from the eight possible observations to the four ac- ,G·mÊ

f tions corresponding to trying to move in each of the com-

then with probability at least , E will be uniformly í

í pass directions. Actions are noisy, and result in moving

E í

close to : í

í í

!· ¥  ´™´ &

¥ in a random direction 20% of the time. Since the policy

~

f

˜›šœ¤î

> E > ² > X E (8) class is small enough to exhaustively enumerate, our opti- mization algorithm for searching over policies was simply

Corollary 4 Under the conditions of Theorem 1 or 3, let ¬ exhaustive search, trying all $&% policies on the scenarios,

¬ be chosen as in the Theorem. Then with probability at

,G·ÌÊ and picking the best one. Our experiments were done with

§2N*+d +)+ 24,%** >

least , the policy f chosen by optimizing the value

2 ¥ 

and a horizon time of ± , and all results re-

f

îcœµ©ž î

> E > estimates, given by f , will be near-

F

deterministic simulative model was

¥ £ø ¥ ¨ !·

z

YZ\[

,-

E > f V X ²

Ê¥ ¨_+`\ *&

(9) *5d

-

s ~

- ¢

- if

Ê¥ ¨´10  *+d **54 *+d°,%*

s ~

˜32 ¢

. if

*+d°,%*74 *+d°,© Ê¥ ¨ 

¥ ¨ ¨ ¤2

Remark. The (Lipschitz) continuity assumptions give a û

s ~

s

-

š6ê–

¢

j ¢ ¡

- if

-

¨  *+d°,©54 *+d *

sufficient but not necessary set of conditions for the the- ʥ

z

-/

s ~

œ—™µ*8&2

¢ if

¨ 

orem, and other sets of sufficient conditions can be en- Ê¥ ¡

visaged. For example, if we assume that the distribution ¢ otherwise

¨ 

on states induced by any policy at each time step has a ʥ

¡ ¢

bounded density, then we can show uniform convergence where ¢ denotes the result of moving one step from ¢

for a large class of (“reasonable”) discontinuous reward in the direction indicated by ¡ , and is if this move would

ë2Å, ËÀ*+d¨+*

"‚¥ result in running into a wall.

¢ 

functions such as ¢ if otherwise.

h  Space constraints preclude a detailed discussion, but briefly, Figure 1b shows the result of running this experiment, for

this is done by constructing two Lipschitz continuous reward different numbers of scenarios. The value of the best policy

"! X functions  and that are “close to” and which upper- and within is indicated by the topmost horizontal , and the lower-bound  (and which hence give value estimates that also solid curve below that is the mean policy value when using

upper- and lower-bound our value estimates under  ); using the our algorithm. As we see, even using surprisingly small assumption of bounded densities to show our values under 

numbers of scenarios, the algorithm manages to find good

# 

and "! are -close to that of ; applying Theorem 3 to show uni-

¬

!  form convergence occurs with  and ; and lastly deducing policies, and as becomes large, the value also approaches

from this that uniform convergence occurs with  as well. the optimal value. Results on 5x5 gridworld -9

-10 G -11

-12 9

-13 mean policy value

-14

S -15

-16 0 5 10 15 20 25 30 m (a) (b)

Figure 1: (a) 5x5 gridworld, with the 8 observations. (b) PEGASUS results using the normal and complex deterministic simulative

models. The topmost horizontal line shows the value of the best policy in : ; the solid curve is the mean policy value using the normal

model; the lower curve is the mean policy value using the complex model. The (almost negligible) 1 s.e. bars are also plotted. ô

We had previously predicted that a “complicated” deter- selected a vector r of fifteen (simple, manually-chosen but

ministic simulative model j can lead to poor results. For not fine-tuned) features of each state; actions were then

.©01 .kj l .kj l

¥ ¨  ö\<;  ( *5¨-,6\ACD( *5¨-,6 2hg©¥i  ¥ · ¹

ô

¡ ? @ r @ @ @

each ¢ -pair, let be a hash function chosen with sigmoids: ,

( *+¨-,-6 2mg©¥i  -¥ .©0 1‹· .kj l{O¹ .kj l g©¥on I2

h

ô

‘

r A A A

that maps any Uniform random variable into another A , where

( *5¨-,6 ,w+¥,¡¹qp 

sr j

Uniform random variable. = Then if is a determin- . Note that since our approach can handle

¥ ¨ ¨ y2 ¥ ¨ ¨ö’>; {¥  

s s

¢ ¡ j ¢ ¡ istic simulative model, j£ is an- continuous actions directly, we did not, unlike [11], have other one that, because of the presence of the hash function, to discretize the actions. The initial-state distribution was

is a much more “complex” model than j . (Here, we appeal manually chosen to be representative of a “typical” state

to the reader’s intuition about complex functions, rather distribution when riding a bicycle, and was also not fine- 2Ìx*

than formal measures of complexity.) We would therefore tuned. We used only a small number ¬ of scenarios,

p2o*+d +)+*t 2uc** ± predict that using PEGASUS with j£ would give worse re- , , with the continuous-time model of

sults than j , and indeed this prediction is borne out by the discounting discussed earlier, and (essentially) gradient as-

results as shown in Figure 1b (dashed curve). The differ- cent to optimize over the weights. % Shaping rewards, to

ence between the curves is not large, and this is also not reward progress towards the goal, were also used. v unexpected given the small size of the problem. ? We ran 10 trials using our policy search algorithm, testing Our second experiment used Randløv and Alstrøm’s [11] each of the resulting solutions on 50 rides. Doing so, the

bicycle simulator, where the objective is to ride to a goal median riding distances to the goal of the 10 different poli-

$ h one kilometer away. The actions are the torque @ applied to cies ranged from about 0.995km to 1.07km. In all 500

the handlebars and the displacement A of the rider’s center- evaluation runs for the 10 policies, the worst distance we of-gravity from the center. The six-dimensional state used observed was also about 1.07km. These results are signifi- in [11] includes variables for the bicycle’s tilt angle and cantly better than those of [11], which reported riding dis-

orientation, and the handlebar’s angle. If the bicycle tilt tances of about 7km (since their policies often took very

w<,B

exceeds > , it falls over and enters an absorbing state, “non-linear” paths to the goal), and a single “best-ever” receiving a large negative reward. The randomness in the trial of about 1.7km.

simulator is from a uniformly distributed term added to the w intended displacement of the center-of-gravity. Rescaled Running experiments without the continuous-time model of appropriately, this became the s term of our deterministic discounting, we also obtained, using a non-gradient based hill- climbing algorithm, equally good results as those reported here. simulative model. Our implementation of gradient ascent, using numerically evalu-

We performed policy search over the following space: We ated derivates, was run with a bound on the length of a step taken

Ø Ú8Û<Ü-Ý

on any iteration, to avoid problems near × ’s discontinuities. x

C Other experimental details: The shaping reward was propor-

In our experiments, this was implemented by choosing, for tional to and signed the same as the amount of progress towards

Úÿþ Ý Úÿþ Ý

ÒD¢ ÒD¢ FvÓ¤Ò¡ < > ÒÓѤѤÑBG

each pair, a random integer E from , the goal. As in [11], we did not include the distance-from-goal as

N1PSRYT

L ¤ E Ò£¢ ¤ and then letting HJIK , where one of the state variables during training; training therefore pro- denotes the fractional part of Z .

[ ceeding “infinitely distant” from the goal.

iy

ý \

Theory predicts that the difference between ý and ’s perfor- Distances under 1km are possible since, as in [11], the goal

åOÚ^] _a`©bdc c eYf Ý mance should be at most : ; see [7]. has a 10m radius. 6 Conclusions observable environments. In Uncertainty in Artificial Intelligence, Proceedings of the Fifteenth conference, We have shown how any POMDP can be transformed into 1999. an “equivalent” one in which all transitions are determin- istic. By approximating the transformed POMDP’s initial [10] D. Pollard. Empirical Processes: Theory and Appli- state distribution with a sample of scenarios, we defined an cations. NSF-CBMS Regional Conference Series in estimate for the value of every policy, and finally performed Probability and Statistics, Vol. 2. Inst. of Mathemati- policy search by optimizing these estimates. Conditions cal Statistics and American Statistical Assoc., 1990. were established under which these estimates will be uni- [11] J. Randløv and P. Alstrøm. Learning to drive a bicy- formly good, and experimental results showed our method cle using reinforcement learning and shaping. In Pro- working well. It is also straightforward to extend these ceedings of the Fifteenth International Conference on methods and results to the cases of finite-horizon undis- Machine Learning, 1998.

counted reward, and infinite-horizon average reward with ± ³ ² -mixing time . [12] Benjamin Van Roy. Learning and Value Function Approximation in Complex Decision Processes. PhD Acknowledgements thesis, Massachusetts Institute of Technology, 1998.

We thank Jette Randløv and Preben Alstrøm for the use [13] R. S. Sutton and A. G. Barto. Reinforcement Learn- of their bicycle simulator, and Ben Van Roy for helpful ing. MIT Press, 1998. comments. A. Ng is supported by a Berkeley Fellowship. [14] V.N. Vapnik. Estimation of Dependences Based on This work was also supported by ARO MURI DAAH04- Empirical Data. Springer-Verlag, 1982. 96-0341, ONR MURI N00014-00-1-0637, and NSF grant IIS-9988642. [15] J.K. Williams and S. Singh. Experiments with an al- gorithm which learns stochastic memoryless policies References for POMDPs. In NIPS 11, 1999. [16] R.J. Williams. Simple statistical gradient-following [1] L. Baird and A.W. Moore. Gradient descent for gen- algorithms for connectionist reinforcement learning. eral Reinforcement Learning. In NIPS 11, 1999. Machine Learning, 8:229–256, 1992. [2] Dimitri Bertsekas. Dynamic Programming and Opti- mal Control, Vol. 1. Scientific, 1995. Appendix A: Proof of Theorem 2

[3] John R. Birge and Francois Louveaux. Introduction Proof (of Theorem 2). We construct an MDP with states ¨

to Stochastic Programming. Springer, 1997. ¨

$

¢ ¢

¢ and plus an absorbing state. The reward func-

"=¥ y2 2 ·#,¨*+¨%,

ƒh h

‰

Š Š tion is ¢ for . Discounting is ignored

[4] R. Durrett. Probability : Theory and Examples, 2nd ¢

in this construction. Both ¢ and transition with proba- h edition. Duxbury, 1996. ¢h

bility 1 to the absorbing state regardless of the action taken. $

[5] P. W. Goldberg and M. R. Jerrum. Bounding the The initial-state ¢ has a .5 chance of transitioning to each ¢

Vapnik-Chervonenkis dimension of concept classes of ¢ and .

ƒh h parameterized by real numbers. Machine Learning,

We now construct j , which will depend in a complicated

2 B{| ( ¨ 6 ¨ &H( *5¨-,-6J}

18:131–148, 1995. s

‰ ‰ ‰ ‰

r z ¡ ÷ ¡ ÷

way on the term. Let ‰°¯

¨ 4 ¨-, 4€m

~

‡

h

~h ÷‰

[6] D. Haussler. Decision-theoretic generalizations of ¡<‰ be the countable set of all finite O£ unions of intervals with rational endpoints in [0,1]. Let z

the PAC model for neural networks and other appli- z cations. Information and Computation, 100:78–150, be the countable subset of z that contains all elements of

that have total length (Lebesgue measure) exactly 0.5. For

(™,wcx+¨< w'c6 ( *5d *5¨ *+d v6‚{Ì( *5d +¨ *5dƒ)6 1992. z

example, and are both

¨ ¨%d-d%d

‘

#£ z z

[7] M. Kearns, Y. Mansour, and A. Y. Ng. Approximate in z . Let be an enumeration of the elements

 ¨ ¨%d-d-d™

h

‘

#£ ¡ ¡

planning in large POMDPs via reusable trajectories. of z . Also let be an enumeration of (some

h (extended version of paper in NIPS 12), 1999. countably infinite subset of) . The deterministic simula- tive model on these actions is given by:

[8] H. Kimura, M. Yamamura, and S. Kobayashi. Re-

&

s

¥ ¨ ¨ ¤2 „

‰

¢ z

inforcement learning by stochastic hill climbing on s if

ƒh

$ ‰

j ¢ ¡

discounted reward. In Proceedings of the Twelfth In- ¢ otherwise

h

¡†e¥ Œ2 QP¡†e¥ Œ2 *+d

ternational Conference on Machine Learning, 1995. QP

‰

¢ ¡

So, ¢ for all , and this is a

¥ L2N*

h ¢h >

[9] N. Meuleau, L. Peshkin, K-E. Kim, and L.P. Kael- correct model for the MDP. Note also that E for all

& X bling. Learning finite-state controllers for partially > .

2ò<¥ ¨-d%d-d-¨  & 

Æ „ „ ?L„ Æ

For any finite sample of ¬ scenarios is the class of functions

‘

¥ ¨ ¨%¥ ¨ ¨-d%d-d-¨¤¥ ¨  ( *+¨-,-6 ¥ ¨-d%d-d%¨ -¥ I2

h

‡ ‡

s s s ô

¢%$ r hª ¢%$ r ª ¢%$ r ª z¤‰ ó „ „

© ©™«

© , there exists some mapping from into (where

¥ ¥ e¨%d-d-d¤¨ ¥   

h

& 2o,{¨-d%d-d-¨

ô ô

s

„ „

ª

©‰ Š ¬

z ). Then for any probability measure ‡

such that © for all . Thus, evaluating

ËB*

h

ï

h‰ˆ

ó ²

¡ ‰ ¬

>’‰ using this set of scenarios, all simulated trajec- on and ,

$ ¢

tories will transition from ¢ from , so the value estimate

u u

£

ø , ¥ à2 ,

h

 

¥ ¨ ¨  ¥ w)•g¨ ¨ 

; u ;

~

f

à ãŸ

‰ ‰

± ³ > E >

² Æ :&‘ ² Æ :&‘ t

(assuming ) for is . Since this argu- t (10)

ª ª

‡

© © ¯

ment holds for any finite number ¬ of scenarios, we have

¥ ¸2 *

h

‡

f

¥ ¨ e¨%d-d-d-¨%¥ ¨ 

E >

shown that E does not uniformly converge to

&

¡¤ ¡¤

 ó 

Lemma 7 Let ó be bounded

2^,{¨-d%d-d¨>•

> X ‹

h h h

(over ). h Æ

metric spaces, and for each Š , let be a class

‡

¤ ó

of functions mapping from ó into . Suppose that

h ‡ Appendix B: Proof of Theorem 3 ‡

each Æ is uniformly Lipschitz continuous (with respect to

‡

¤ 

the metric  on its domain, and on its range), with

øŸ, 2I %- &

h ‡

Due to space constraints, this proof will be slightly dense. ‡

Æ „ §¥ ¥¢„ ?{„

some Lipschitz bound ÷ . Let

¨-, •\

h

‡ ‡ ~

The proof techniques we use are due to Haussler [6] and ~ Š Æ be the class of functions mapping from

Pollard [10]. Haussler [6], to which we will be repeatedly ‡

¡¤ ó

ó into given by composition of the functions in the

h h

ËB* 2(•ƒ¥£¦ 

referring, provides a readable introduction to most of the

$ $

Æ ² ² ÷ ²

’s. Let be given, and let ¯ . Then

h ‡

methods used here. ‡

‡

£

¥ ¨ ¨  ¨ ¨ 

We begin with some standard definitions from [6]. For a ¥

~

¡¤ $ ¤

” ² Æ  ² Æ 

” (11)

Œ ó  h

subset of a space endowed with (pseudo-)metric , we h

‡ ‡

&

¯

$Ž

Œ ó ² Œ õ Œ

h

say is an -cover for if, for every , there ‡

& ¥ ¨  ˾*

~

£ $ £

õ Œ  õ õ ² ² ó

is some such that . For each , let Lemma 8 Let Æ be a family of functions mapping from



¥ ¨ ¨ 

( *+¨ 6 

ô

² Œ  ² Œ

ó r

denote the size of the smallest -cover for . into V , and let be a probability measure on . Let ó

be generated by ¬ independent draws from , and assume

Æ ó

Let be a family of functions mapping from a set into Ë@*

¥ÿ ¨  

² . Then

a bounded pseudo metric space  , and let be a prob-

( ª+öˆ& ¥ !· ¥ÿ Ë 6

§©¨

– ô –

— —

ó Æ

abilityu measure on . Define a pseudo metric on by

Æ ? f r ²

¥ ¨ Œ2^M“R ( ¥ ¥ e¨ ¥ Q6

; ’

ô ô

Ÿ Ÿ

‡ ‡

W

:&‘ „ j t  „ j

t

¥ w<,U'5¨ ¨ ^p

. Defineu the capac-

³



~

‘

¥ ¨ ¨ Œ2 ] _+` ¥ ¨ ¨ 

; ’

ª

©

ñ 

=

$)” ² Æ › ¬«

« (12)

Æ ” ² Æ  ² Æ :*‘

ity of to be t , where

]_5` 

ª ©

the is over all probability measures on ó . The quan-

¨ ¨ 

¥ We are now ready to prove Theorem 3. No serious attempt

² Æ  Æ tity ” thus measures the “richness” of the class .

 has been made to tighten polynomial factors in the bound. ²

Note that ” and are both decreasing functions of , and

¥ ¨ ¨ ¤2 ¥o• ¨ ¨<•  • Ë@*

f

” ² Æ  ” ² Æ  that for any . Proof (of Theorem 3). Our proof is in three parts. First, E

gives an estimate of the discounted rewards summed over ¹ë,¤ The main results obtained with pseudo-dimension are uni- ¥

form convergence of the empirical means of classes of ± ³ -steps; we reduce the problem of showing uniform f

convergence of E to one of proving that our estimates of Æ

random variables to their true means. Let be a fam- 2Ì*+¨-d%d-d¨

( *+¨ 6

ô

± ± ±

the expected rewards on the -th step, ³ , all

V r ily of functions mapping from ó into , and let converge uniformly. Second, we carefully define the map-

(the “training set”) be ¬ i.i.d. draws from some prob-

‰

 ö|&

ª

¢ ±

ping from the scenarios © to the -th step rewards, and Æ

ability measure over ó . Then for each , let

¥  2 ¥,vw ™˜ öÄ¥ 

ô ô

–©— use Lemmas 5, 6 and 7 to bound its capacity. Lastly, apply-

‰

f r ¬ «

‰°¯ be the empirical mean of

öÄ¥  ¥ÿ¦2NM“R ( öT¥ Q6

h

–©— ô

ô ing Lemma 8 gives our result. To simplify the notation in

t

.©0 1

24, ¨ øŸ,

. Also let be the true mean. "

  this proof, assume , and  . We now state a few results from [6]. In [6], these are The-

orem 6 combined with Theorem 12; Lemma 7; Lemma 8; Part I: Reduction to uniform convergence of ± -th step

¥g¨ £2 f

rewards. E was defined by

› ¡ ¡

and Theorem 9 (with š being a singleton set, ,

2 w 2

z

œ ‘

,

$ V A V › ›

² , and ). Below, and respectively

®

ed T¹m%-¤¹ »"=¥ ¢¹3 ƒ"=¥ ¥ ¦2 "‚¥

º

J

«

‰ ‰ ‰ h

f

ª ª ª

º©»

¢ ¢ E > ¢

© © ©

denote the Manhattan and Euclidean metrics on . e.g. $



˜

¥ ¨ <L2 ·

h

¬

ô ô

‰°¯

‰ ‰

› r r hh

‰™¯ .

h

h ‡ ‡

h

˜

 "‚¥ ¥ ¦2

‰

f

º

ª

º

h

¢ ± E >

©

«

Æ ó

Lemma 5 Let be a family of functions mapping from For each , let ‰™¯ be the empirical

¥ ë2

( *5¨ 6 2 ¥ 

û

h

«

º

± E >

—°ž

: t Æ

into V , and . Then for any probabil- mean of the reward on the -th step, and let

M ( "=¥ Q6

^­

 *ž4

~

º

± ¢

² V ó be the true expected reward on the -th step

ity measureu on and any , we have that

¥  2



7

 ¥ ¥ p w <´ ¥ p w   ¥ ¨ ¨

;

z z z

~

ÃDŸ

$ˆU

¢ > E >

–

V ² V ² ² Æ :&‘

t (starting from and executing ). Thus,

. 

˜

¥ 

º

ª

©

º

º

E > ¨-d%d-d¨

¯ƒ$ .

Æ Æ¡

Lemma 6 Let each be a family of functions 2o*+¨%d-d%d¨

( *+¨-,-6

h ± ³

Suppose we can show, for each ± , that with

ó Ƹ‰ ¹N,¤

mapping from into . The free product of the ’s ,ê·3ʝw¥

± ³

iQi probability ,

This is inconsistent with the definition used in [6], which has

¥ ¤· ¥  w ¥ ¹m,¯® &

z

~

Ú eY¢’Ý

f

º º

E > E > ² ± ³ > X

an additional Ó factor. (13)

‡ ‡ Æ

Then by the union bound, we know that with probability capacity of Æ (and hence prove uniform converge over ),

,·mÊ ¥ Œ· ¥  w ¥ ¹ ,

z

;

~

f f

¥

W

º º

º

E > E > ² ± EF

, ³ holds simulta- we have also proved uniform convergence for (over

2 *+¨-d%d-d¨ & &

‡ ‡

± ³ > X > X

neously for all ± and for all . This all ).

&

2ò,¨%d-d-d-¨ ¥ 

û

~ X

implies that, for all > ,

—°ž

Š : €‰ :

; t

For each , u since , Lemma 5



¥ ¨ ¨  ¥¥ pcw <´ ¥ pcw  7

;

z z z

~

ãŸ

¥ ¤· ¥ 

‰ –

² :*‘ ² ² t

implies that u .

f



ë2 , ¥ ¨ ¨

;

ª

E > E >

©

Ÿ

Ã

‡ ‡

‰ ‰

² ² :*‘ ²

Moreover, clearly t since each

» »

º º

ª

©

® ®

¥ ©· ¥  ¹ ¥ ©· ¥  º

º is a singleton set. Combined with Lemma 6, this implies

~

2^,{¨-d-d%d¨ ,

f

º º

E > E > E > E >

~

Š ± ²

‡ ‡ ‡

‡ that, for each and ,

º º

¯ƒ$ ¯ƒ$

u



 ¥ ¨ ¨

u

;

º©»

Ã

‰

² Æ :&‘

t

®

¥ !· ¥  ¹ w

z

ª

~

©

f

º º

79

E > E > ²

u

‡ ‡

£



 ¥ w¥ ¹Ì¥ ·  e¨ ¨

;

º

~

¯ƒ$

Ÿ

Ã

² : ; ± Š : t :&‘

t

d

~

ª

‡

©

²

¯

h

‡

79

º©»

˜ ¥ ©· ¥ 

º

u ~

£



º

 ¥ w¥ ¹ e¨ ¨

;

º

E > E > ~

where we used the fact that ¯ƒ$

Ÿ

Ã

w

‡ ‡

z

² :<; ± ³: t :&‘

t

ª

² ² ‡

, by construction of the -horizon time. But this is ex- ©

¯

h ‡

actly the desired result. Thus, we need only prove that 7e79

p ¥ ¹  p<¥ ¹ 

z z

2

7

9

ç

´

z : ; ± ³: t : ; ± ³: t ±

Equation (13) holds with high probability for each ~

ì

*+¨%d-d%d¨

–

± ² ²

³ .

‘

7e7

9

p ¥ ¹ 

z

~

7

9

ç

± ±

³

: ; ± ³: t

Part II: Bounding the capacity. Let be fixed. z

~

&‹¦pl

ì

‰

¢ ª

We now write out the mapping from a scenario ©

²

¥( *+¨%,6›7eq¢ ±

to the -th step reward. Since this mapping 

 s

where we have used the fact that is decreasing in its ²

: t ±  depends only on the first elements of the “ ”s portion ]_5`

parameter. By taking a over probability measures ,

¨ ¨ 

of the scenario, we will, with some abuse of notation, write ¥

&N¦Ìlp( *+¨%,6›7

q

º

‰

² Ƹ‰ ›

this is also a bound on ” . Now, as metrics over

¢ ª

79 7eq

J

º h

the scenario as © , and ignore its other

¤ ‰

‘O~

‰

 ª ª

› › ©

© , . Thus, this also gives

ª

¢ h

coordinates. Thus, a scenario © may now be written as

¥ ¨ ¨ ¨-d-d%d-¨ 

‘

s s s

‘

7e7

9

7eq

º p<¥ ¹  z

¢ .

7

9

ç

¥ ¨ ¨ 

h

z :<; ± ³:{t

‘ ~

¦}l

ì

‰

² Æ ›

” (14) ‰

Given a family of functions (such as ) mapping from

²

( *+¨-,-687 ( *5¨-,-6 ¦Hlk( *+¨%,6›7 

q q

into , we extend its domain to ¤

ø *

°

for any finite simply by having it ignore the ex- Finally, applying Lemma 7 with each of the  ’s being the

•N2 ¹ , 2

‘

± ²

tra coordinates. Note this extension of the domain does › norm on the appropriate space, , and

¥ ¹Ì,¤

º

 ©²$ not change the pseudo-dimension of a family of functions. ±

$ , we find

2†,{¨-d%d-d¨



° ° ±

¨ ¨ 

Also, for each , define a mapping from ¥

¦Ìlp( *+¨-,-6¥kAC ( *+¨%,6 ¥ ¨ ¨ ¨-d-d%d-¨ 2

‘

s s’‘ s

” ² Æ ›

  ¢

according to ±

2  

h

º

s

¤

  

° ² ± £

. For each , let be singleton sets. Where h

¥ w+¥¥ ¹N,¤ º e¨ ¨ 

~ ‘





” ² ±   Æ › $

necessary, ± ’s domain is also extended as we have just de-

‡ ¯

scribed. h

‡

‘

7e7

9

º

p<¥ ¹ -¥ ¹N,¤

2 ,{¨-d-d%d¨ ¹†, 2 ¦•l

º

z

£

7

9

ç



:<; ± ³:{t ±  

z

Š ± ó ‰ $

For each , define ~

¥( *+¨%,6›7 

q

º

ì

¤ ‰

he ó

. For example, is just the space of ²

¯ h

s

h

‡

: t¤± ‘

scenarios (with only the first elements of the ’s 7e79

»

º

2¦ 2 ,¨%d-d-d-¨

»

º

p<¥ ¹ -¥ ¹N,¤

z

º

¤

ó Š ±

79

»´³

º

: ± : ±   kept), and . For each , de- z

; ³ t ³

$

~

h

ó ‰Q¤

fine a family of maps from ó ‰ into according to

² µ

2 l lB%-¢l l l l@-%¢l

h

‘ ‘

7 7 7

9 q q

‰ ¤ ¤

Æ ² ²

h h

7eq

º

² 1¤

‰ (where the definition of the free product of

 hª

© Part III: Proving uniform convergence. Applying

¥ ¨ ¨ 

sets of functions is as given in Lemma 6); note such an ‘

2 ¥ ¹ 

² Æ ›

Lemma 8 with the above bound on ” , we find

,#·@Ê

Ƹ‰ /$ : ±ˆ:  t has Lipschitz bound at most ; .

2•"G that for there to be a probability of our estimate of

º ¤

Also let Æ be a singleton set containing the

.©01 .©0 1

2 ( ·Œ" ¨ " 6

h

± ²

‘ the expected -th step reward to be -close to the mean, it

º ¤

reward function, and ó . Finally, -%

2 suffices that

º º

Æ ¤ ¥ƒÆ ¥ ¥¢Æ

let Æ be the family of maps from

 ' ,

z

.©01 .©0 1

¦pln¥ ( *5¨-,-687  ( ·Œ" ¨ " 6

q

º

h h

2 ç+´ ¹k´ ¥ ¥ w,U'5¨ ¨  

into . ‘

‘

Ê ì

š{µ šµ

¬ $ ” ² Æ ›

.©01 .©01

¦ AC (™·Œ" ¨ " 6

;

²

f

¥

W

º

EF ? £

, ,

Now, let be the reward ,

¨ ´ ¨ ¨´ ¨´ ¨ ¨ d 2·¶ ç<` ´°é ç ¨ >

received on the ± -th step when executing from a scenario

ìŒì

Ê ,귋

š{µ šµ šµ š

 ´ : : :

; t

&è¦

²

£ > X

¢ . As we let vary over , this defines a family ·Œ"/.©01<¨ "/.©0 1¤6

of maps from scenarios into ( . Clearly, this ‹ family of maps is a subset of Æ . Thus, if we can bound the This completes the proof of the Theorem.