<<

Journal of Articial Intelligence Research Submitted published

Reinforcement A Survey

Leslie Pack Kaelbling lpkcsbrownedu

Michael L Littman mlittmancsbrownedu

Computer ScienceDepartment Brown University

Providence RI USA

Andrew W Mo ore awmcscmuedu

Smith Hal l Carnegie Mel lon University Forbes Avenue

Pittsburgh PA USA

Abstract

This pap er surveys the eld of learning from a science p er

sp ective It is written to b e accessible to researchers familiar with Both

the historical basis of the eld and a broad selection of currentwork are summarized

Reinforcement learning is the problem faced by an agent that learns b ehavior through

trialanderror interactions with a dynamic environment The work describ ed here has a

resemblance to work in but diers considerably in the details and in the use

of the word reinforcement The pap er discusses central issues of

including trading o exploration and exploitation establishing the foundations of the eld

via Markov decision theory learning from delayed reinforcement constructing empirical

mo dels to accelerate learning making use of generalization and hierarchy and coping with

hidden state It concludes with a survey of some implemented systems and an assessment

of the practical utility of current metho ds for reinforcement learning

Intro duction

Reinforcement learning dates back to the early days of cyb ernetics and work in

psychology neuroscience and In the last vetotenyears it has attracted

rapidly increasing in terest in the machine learning and articial intelligence communities

Its promise is b eguilinga way of programming agents byreward and punishment without

needing to sp ecify how the task is to b e achieved But there are formidable computational

obstacles to fullling the promise

This pap er surveys the historical basis of reinforcement learning and some of the current

work from a computer science p ersp ective We give a highlevel overview of the eld and a

taste of some sp ecic approaches It is of course imp ossible to mention all of the imp ortant

work in the eld this should not b e taken to b e an exhaustiveaccount

Reinforcement learning is the problem faced by an agent that must learn b ehavior

through trialanderror interactions with a dynamic environment The work describ ed here

has a strong family resemblance to ep onymous work in psychology but diers considerably

in the details and in the use of the word reinforcement It is appropriately thoughtofas

a class of problems rather than as a set of techniques

There are two main strategies for solving reinforcementlearning problems The rst is to

search in the space of b ehaviors in order to nd one that p erforms well in the environment

This approach has b een taken bywork in genetic and

c

AI Access Foundation and Morgan Kaufmann Publishers All rights reserved Kaelbling Littman Moore

T

s a I i B

R r

Figure The standard reinforcementlearning mo del

as well as some more novel searchtechniques Schmidhub er The second is to use

statistical techniques and metho ds to estimate the utility of taking

actions in states of the world This pap er is devoted almost entirely to the second set of

techniques b ecause they take advantage of the sp ecial structure of reinforcementlearning

problems that is not available in optimization problems in general It is not yet clear which

set of approaches is b est in which circumstances

The rest of this section is devoted to establishing notation and describing the basic

reinforcementlearning mo del Section explains the tradeo b etween exploration and

exploitation and presents some solutions to the most basic case of reinforcementlearning

problems in whichwewant to maximize the immediate reward Section considers the more

general problem in whichrewards can b e delayed in time from the actions that were crucial

to gaining them Section considers some classic mo delfree algorithms for reinforcement

learning from delayed reward adaptive critic TD and Qlearning Section

demonstrates a continuum of algorithms that are sensitive to the amount of computation an

agent can p erform b etween actual steps of action in the environment Generalizationthe

cornerstone of mainstream machine learning researchhas the p otential of considerably

aiding reinforcement learning as describ ed in Section Section considers the problems

that arise when the agent do es not have complete p erceptual access to the state of the

environment Section catalogs some of reinforcement learnings successful applications

Finally Section concludes with some sp eculations ab out imp ortant op en problems and

the future of reinforcement learning

ReinforcementLearning Mo del

In the standard reinforcementlearning mo del an agent is connected to its environment

via p erception and action as depicted in Figure On each stepofinteraction the agent

receives as input i some indication of the current state s of the environment the agent

then cho oses an action a to generate as output The action changes the state of the

environment and the value of this state transition is communicated to the agent through

a scalar reinforcement signal r Theagents b ehavior B should cho ose actions that tend

to increase the longrun sum of values of the reinforcement signal It can learn to do this

over time by systematic guided byawidevariety of algorithms that are the

sub ject of later sections of this pap er

Reinforcement Learning A Survey

Formally the mo del consists of

a discrete set of environment states S

a discrete set of agent actions Aand

a set of scalar reinforcement signals typically f g or the real numbers

The gure also includes an input function I which determines how the agent views the

environment state we will assume that it is the identity function that is the agent p erceives

the exact state of the environment until we consider partial observability in Section

An intuitiveway to understand the relation b etween the agent and its environmentis

with the following example dialogue

Environment You are in state You have p ossible actions

Agent Ill take action

Environment You received a reinforcement of units You are now in state

You have p ossible actions

Agent Ill take action

Environment You received a reinforcement ofunitsYou are now in state

You have p ossible actions

Agent Ill take action

Environment You received a reinforcement of units You are now in state

You have p ossible actions

The agents job is to nd a p olicy mapping states to actions that maximizes some

longrun measure of reinforcement We exp ect in general that the environment will b e

nondeterministic that is that taking the same action in the same state on two dierent

o ccasions may result in dierent next states andor dierent reinforcementvalues This

happ ens in our example ab ove from state applying action pro duces diering rein

forcements and diering states on two o ccasions However we assume the environmentis

stationary that is that the of making state transitions or receiving sp ecic

reinforcement signals do not change over time

Reinforcement learning diers from the more widely studied problem of sup ervised learn

ing in several ways The most imp ortant dierence is that there is no presentation of in

putoutput pairs Instead after cho osing an action the agent is told the immediate reward

and the subsequent state but is not told which action would have b een in its b est longterm

interests It is necessary for the agent to gather useful exp erience ab out the p ossible system

states actions transitions and rewards actively to act optimally Another dierence from

sup ervised learning is that online p erformance is imp ortant the evaluation of the system

is often concurrent with learning

This assumption may b e disapp ointi ng after all op eration in nonstationary environments is one of the

motivations for buildin g learning systems In fact many of the algorithms describ ed in later sections

are eectiveinslowlyvarying nonstationary environments but there is very little theoretical analysis

in this area

Kaelbling Littman Moore

Some asp ects of reinforcement learning are closely related to search and planning issues

in articial intelligence AI search algorithms generate a satisfactory tra jectory through a

graph of states Planning op erates in a similar manner but typically within a construct

with more than a graph in which states are represented by comp ositions of

logical expressions instead of atomic symb ols These AI algorithms are less general than the

reinforcementlearning metho ds in that they require a predened mo del of state transitions

and with a few exceptions assume determinism On the other hand reinforcement learning

at least in the kind of discrete cases for which theory has b een develop ed assumes that

the entire can b e enumerated and stored in memoryan assumption to which

conventional search algorithms are not tied

Mo dels of Optimal Behavior

Before we can start thinking ab out algorithms for learning to b ehave optimallywehave

to decide what our mo del of optimality will b e In particular wehave to sp ecify how the

agent should take the future into account in the decisions it makes ab out how to b ehave

now There are three mo dels that have b een the sub ject of the ma jorityofwork in this

area

The nitehorizon mo del is the easiest to think ab out at a given moment in time the

agent should optimize its exp ected reward for the next h steps

h

X

E r

t

t

it need not worry ab out what will happ en after that In this and subsequent expressions

r represents the scalar reward received t steps into the future This mo del can b e used in

t

twoways In the rst the agent will ha ve a nonstationary p olicy that is one that changes

over time On its rst step it will take what is termed a hstep optimal action This is

dened to b e the b est action available given that it has h steps remaining in whichtoact

and gain reinforcement On the next step it will takea h step optimal action and so

on until it nally takes a step optimal action and terminates In the second the agent

do es recedinghorizon control in whichitalways takes the hstep optimal action The agent

always acts according to the same p olicy but the value of h limits how far ahead it lo oks

in cho osing its actions The nitehorizon mo del is not always appropriate In many cases

wemay not know the precise length of the agents life in advance

The innitehorizon discounted mo del takes the longrun reward of the agentinto ac

count but rewards that are received in the future are geometrically discounted according

to discountfactor where

X

t

E r

t

t

y of living We can interpret in several ways It can b e seen as an interest rate a probabilit

another step or as a mathematical trick to b ound the innite sum The mo del is conceptu

ally similar to recedinghorizon control but the discounted mo del is more mathematically

tractable than the nitehorizon mo del This is a dominant reason for the wide attention

this mo del has received

Reinforcement Learning A Survey

Another optimality criterion is the averagerewardmodel in which the agent is supp osed

to take actions that optimize its longrun average reward

h

X

lim E r

t

h

h

t

Such a p olicy is referred to as a gain optimal p olicy it can b e seen as the limiting case of

the innitehorizon discounted mo del as the discount factor approaches Bertsekas

One problem with this criterion is that there is no way to distinguish b etween two p olicies

one of which gains a large amountofreward in the initial phases and the other of which

do es not Reward gained on any initial prex of the agents life is overshadowed by the

longrun average p erformance It is p ossible to generalize this mo del so that it takes into

account b oth the long run average and the amount of initial reward than can b e gained

In the generalized optimal mo del a p olicy is preferred if it maximizes the longrun

average and ties are broken by the initial extra reward

Figure contrasts these mo dels of optimalityby providing an environment in which

changing the mo del of optimalitychanges the optimal p olicy In this example circles

represent the states of the environment and arrows are state transitions There is only

a single action choice from every state except the start state which is in the upp er left

and marked with an incoming arrow All rewards are zero except where marked Under a

nitehorizon mo del with h the three actions yield rewards of and so

the rst action should b e chosen under an innitehorizon discounted mo del with

the three choices yield and so the second action should b e chosen

and under the average reward mo del the third action should b e chosen since it leads to

an average reward of If wechange h to and to then the second action is

optimal for the nitehorizon mo del and the rst for the innitehorizon discounted mo del

however the average reward mo del will always prefer the b est longterm average Since the

choice of optimality mo del and parameters matters so much it is imp ortanttocho ose it

carefully in any application

The nitehorizon mo del is appropriate when the agents lifetime is known one im

p ortant asp ect of this mo del is that as the length of the remaining lifetime decreases the

agents p olicy maychange A system with a hard deadline would b e appropriately mo deled

this way The relative usefulness of innitehorizon discounted and biasoptimal mo dels is

still under debate Biasoptimality has the advantage of not requiring a discount parameter

however algorithms for nding biasoptimal p olicies are not yet as wellundersto o d as those

for nding optimal innitehorizon discounted p olicies

Measuring Learning Performance

The criteria given in the previous section can b e used to assess the p olicies learned bya

given Wewould also like to b e able to evaluate the quality of learning itself

There are several incompatible measures in use

Eventual convergence to optimal Many algorithms come with a provable guar

antee of asymptotic convergence to optimal b ehavior Watkins Dayan This

is reassuring but useless in practical terms An agent that quickly reaches a plateau Kaelbling Littman Moore

+2

Finite horizon, h=4 +10

Infinite horizon, γ=0.9 +11

Average reward

Figure Comparing mo dels of optimality All unlab eled arrows pro duce a reward of zero

at of optimalitymayinmany applications b e preferable to an agent that has a

guarantee of eventual optimality but a sluggish early

Sp eed of convergence to optimality Optimality is usually an asymptotic result

and so convergence sp eed is an illdened measure More practical is the speedof

convergencetonearoptimality This measure b egs the denition of how near to

optimality is sucient A related measure is level of performance after a given time

which similarly requires that someone dene the given time

It should b e noted that here wehave another dierence b etween reinforcementlearning

and conventional sup ervised learning In the latter exp ected future predictiveaccu

racy or statistical eciency are the prime concerns For example in the wellknown

PAC framework Valiant there is a learning p erio d during which mistakes do

not count then a p erformance p erio d during which they do The framework provides

b ounds on the necessary length of the learning p erio d in order to have a probabilistic

guarantee on the subsequent p erformance That is usually an inappropriate view for

an agent with a long existence in a complex environment

In spite of the mismatchbetween emb edded reinforcement learning and the traintest

p ersp ective Fiechter provides a PAC analysis for Qlearning describ ed in

Section that sheds some light on the connection b etween the two views

Measures related to sp eed of learning have an additional weakness An algorithm

that merely tries to achieve optimality as fast as p ossible may incur unnecessarily

large p enalties during the learning p erio d A less aggressive strategy taking longer to

achieve optimality but gaining greater total reinforcement during its learning might

b e preferable

A more appropriate measure then is the exp ected decrease in reward gained

due to executing the learning algorithm instead of b ehaving optimally from the very

b eginning This measure is known as regret Berry Fristedt It p enalizes

mistakes wherever they o ccur during the run Unfortunately results concerning the

regret of algorithms are quite hard to obtain

Reinforcement Learning A Survey

Reinforcement Learning and AdaptiveControl

Adaptivecontrol Burghes Graham Stengel is also concerned with algo

rithms for improving a sequence of decisions from exp erience Adaptivecontrol is a much

more mature discipline that concerns itself with dynamic systems in which states and ac

tions are vectors and system dynamics are smo oth linear or lo cally linearizable around a

desired tra jectoryAvery common formulation of cost functions in adaptivecontrol are

quadratic p enalties on deviation from desired state and action vectors Most imp ortantly

although the dynamic mo del of the system is not known in advance and must b e esti

mated from the structure of the dynamic mo del is xed leaving mo del estimation

as a parameter estimation problem These assumptions p ermit deep elegant and p owerful

which in turn lead to robust practical and widely deployed adaptive

control algorithms

Exploitation versus Exploration The SingleState Case

One ma jor dierence b etween reinforcement learning and sup ervised learning is that a

reinforcementlearner must explicitly explore its environment In order to highlight the

problems of exploration we treat a very simple case in this section The fundamental issues

and approaches describ ed here will in many cases transfer to the more complex instances

of reinforcement learning discussed later in the pap er

The simplest p ossible reinforcementlearning problem is known as the k armed bandit

problem which has b een the sub ject of a great deal of study in the statistics and applied

mathematics literature Berry Fristedt The agent is in a ro om with a collection of

k gambling machines each called a onearmed bandit in collo quial English The agentis

p ermitted a xed numb er of pulls hAnyarmmay b e pulled on each turn The machines

do not require a dep osit to play the only cost is in wasting a pull pla ying a sub optimal

machine When arm i is pulled machine i pays o or according to some underlying

parameter p where payos are indep endentevents and the p s are unknown

i i

What should the agents strategy b e

This problem illustrates the fundamental tradeo b etween exploitation and exploration

The agent might b elieve that a particular arm has a fairly high payo probability should

it cho ose that arm all the time or should it cho ose another one that it has less information

ab out but seems to b e worse Answers to these questions dep end on how long the agent

is exp ected to play the game the longer the game lasts the worse the consequences of

prematurely converging on a suboptimal arm and the more the agent should explore

There is a wide variety of solutions to this problem We will consider a representative

selection of them but for a deep er discussion and a numb er of imp ortant theoretical results

see the b o ok byBerryandFristedt We use the term action to indicate the

agents choice of arm to pull This eases the transition into delayed reinforcement mo dels

in Section It is very imp ortant to note that bandit problems t our denition of a

reinforcementlearning environment with a single state with only self transitions

Section discusses three solutions to the basic onestate bandit problem that have

formal correctness results Although they can b e extended to problems with realvalued

rewards they do not apply directly to the general multistate delayedreinforcement case

Kaelbling Littman Moore

Section presents three techniques that are not formally justied but that have had wide

use in practice and can b e applied with similar lack of guarantee to the general case

Formally Justied Techniques

There is a fairly welldevelop ed formal theory of exploration for very simple problems

Although it is instructive the metho ds it provides do not scale well to more complex

problems

DynamicProgramming Approach

If the agent is going to b e acting for a total of h steps it can use basic Bayesian reasoning

to solve for an optimal strategy Berry Fristedt This requires an assumed prior

joint distribution for the parameters fp g the most natural of which is that each p is

i i

indep endently uniformly distributed b etween and We compute a mapping from belief

states summaries of the agents exp eriences during this run to actions Here a b elief state

can b e represented as a tabulation of action choices and payos fn w n w n w g

k k

denotes a state of play in whicheach arm i has b een pulled n times with w payos We

i i

write V n w n w as the exp ected payo remaining given that a total of h pulls

k k

are available and we use the remaining pulls optimally

P

n h then there are no remaining pulls and V n w n w This is If

i k k

i

the basis of a recursive denition If weknowtheV value for all b elief states with t pulls

remaining we can compute the V value of any b elief state with t pulls remaining

Future payo if agenttakes action i

V n w n w max E

k k i

then acts optimally for remaining pulls

w V n w n w n

k k i i i i

max

i

V n w n w n w

i i i i k k

where is the p osterior sub jective probability of action i paying o given n w and

i i i

our prior probability For the uniform priors which result in a b eta distribution

i

w n

i i

The exp ense of lling in the table of V values in this way for all attainable b elief states

is linear in the numb er of b elief states times actions and thus exp onential in the horizon

Gittins Allocation Indices

Gittins gives an allo cation index metho d for nding the optimal choice of action at each

step in k armed bandit problems Gittins The technique only applies under the

discounted exp ected reward criterion For each action consider the numb er of times it has

t factors b een chosen nversus the numb er of times it has paid o w For certain discoun

there are published tables of index values I n w for eachpairofn and w Lookup

the index value for eachaction i I n w It represents a comparative measure of the

i i

combined value of the exp ected payo of action i given its history of payos and the value

of the information that wewould get bycho osing it Gittins has shown that cho osing the

action with the largest index value guarantees the optimal b etween exploration and

exploitation

Reinforcement Learning A Survey

a = 0 a = 1

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1 r = 1 a = 0 a = 1

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1

r = 0

Figure A Tsetlin with N states The rowshows the state transitions

that are made when the previous action resulted in a reward of the b ottom

rowshows transitions after a reward of In states in the left half of the gure

actionistaken in those on the right action is taken

Because of the guarantee of optimal exploration and the simplicity of the technique

given the table of index values this approach holds a great deal of promise for use in more

complex applications This metho d proved useful in an application to rob otic manipulation

with immediate reward Salganico Ungar Unfortunatelynoonehasyet b een

able to nd an analog of index values for delayed reinforcement problems

Learning Automata

A branch of the theory of adaptivecontrol is devoted to learning automata surveyed by

Narendra and Thathachar whichwere originally describ ed explicitly as nite state

automata The Tsetlin automaton shown in Figure provides an example that solves a

armed bandit arbitrarily near optimally as N approaches innity

It is inconvenient to describ e algorithms as nitestate automata so a movewas made

to describ e the internal state of the agent as a probability distribution according to which

actions would b e chosen The probabilities of taking dierent actions would b e adjusted

according to their previous successes and failures

An example which stands among a set of algorithms indep endently develop ed in the

mathematical psychology literature Hilgard Bower is the linear rewardinaction

y of taking action i algorithm Let p b e the agents probabilit

i

When action a succeeds

i

p p p

i i i

p p p for j i

j j j

When action a fails p remains unchanged for all j

i j

This algorithm converges with probabilitytoavector containing a single and the

rest s cho osing a particular action with probability Unfortunately itdoesnotalways

converge to the correct action but the probability that it converges to the wrong one can

b e made arbitrarily small by making small Narendra Thathachar There is no

literature on the regret of this algorithm

Kaelbling Littman Moore

AdHo c Techniques

In reinforcementlearning practice some simple ad hoc strategies have b een p opular They

are rarelyifever the b est choice for the mo dels of optimalitywehave used but they may

b e viewed as reasonable computationally tractable Thrun has surveyed

avariety of these techniques

Greedy Strategies

The rst strategy that comes to mind is to always cho ose the action with the highest esti

mated payo The aw is that early unlucky sampling might indicate that the b est actions

reward is less than the reward obtained from a sub optimal action The sub optimal action

will always b e picked leaving the true optimal action starved of data and its sup eriority

never discovered An agentmust explore to ameliorate this outcome

A useful heuristic is optimism in the faceofuncertainty in which actions are selected

greedily but strongly optimistic prior b eliefs are put on their payos so that strong negative

evidence is needed to eliminate an action from consideration This still has a measurable

danger of starving an optimal but unlucky action but the risk of this can b e made arbitrar

ily small Techniques like this have b een used in several reinforcement learning algorithms

including the interval exploration metho d Kaelbling b describ ed shortly the ex

ploration bonus in Dyna Sutton curiositydriven exploration Schmidhub er a

and the exploration mechanism in prioritized sweeping Mo ore Atkeson

Randomized Strategies

Another simple exploration strategy is to tak e the action with the b est estimated exp ected

reward by default but with probability pcho ose an action at random Some versions of

this strategy start with a large value of p to encourage initial exploration which is slowly

decreased

An ob jection to the simple strategy is that when it exp eriments with a nongreedy action

it is no more likely to try a promising alternative than a clearly hop eless alternative A

slightly more sophisticated strategy is Boltzmann exploration In this case the exp ected

reward for taking action a ERaisusedtocho ose an action probabilisticall y according to

the distribution

ERaT

e

P a P

ERa T

e

a A

The temperature parameter T can b e decreased over time to decrease exploration This

metho d works well if the b est action is well separated from the others but suers somewhat

when the values of the actions are close It may also converge unnecessarily slowly unless

the temp erature schedule is manually tuned with great care

Intervalbased Techniques

Exploration is often more ecient when it is based on secondorder information ab out the

certaintyorvariance of the estimated values of actions Kaelblings interval estimation

algorithm b stores statistics for eachactiona w is the numb er of successes and n

i i i

the numb er of trials An action is chosen by the upp er b ound of a

Reinforcement Learning A Survey

condence interval on the success probabilityofeachactionandcho osing the action with

the highest upp er b ound Smaller values of the parameter encourage greater exploration

When payos are b o olean the normal approximation to the binomial distribution can b e

used to construct the condence interval though the binomial should b e used for small

n Other payo distributions can b e handled using their asso ciated statistics or with

nonparametric metho ds The metho d works very well in empirical trials It is also related

to a certain class of statistical techniques known as experiment design metho ds Box

Drap er which are used for comparing multiple treatments for example fertilizers

or drugs to determine which treatment if any is b est in as small a set of exp eriments as

p ossible

More General Problems

When there are multiple states but reinforcement is still immediate then any of the ab ove

solutions can b e replicated once for each state However when generalization is required

these solutions must b e integrated with generalization metho ds see section this is

straightforward for the simple adho c metho ds but it is not understo o d how to maintain

theoretical guarantees

Many of these techniques fo cus on converging to some regime in which exploratory

actions are taken rarely or never this is appropriate when the environment is stationary

However when the environment is nonstationary exploration must continue to take place

in order to notice changes in the world Again the more adho c techniques can b e mo died

to deal with this in a plausible manner keep temp erature parameters from going to decay

the statistics in interv al estimation but none of the theoretically guaranteed metho ds can

b e applied

Delayed Reward

In the general case of the reinforcement learning problem the agents actions determine

not only its immediate reward but also at least probabilistically the next state of the

environment Suchenvironments can b e thoughtofasnetworks of bandit problems but

the agentmust takeinto account the next state as well as the immediate reward when it

decides whichactiontotake The mo del of longrun optimality the agent is using determines

exactly howitshouldtake the value of the future into account The agent will havetobe

able to learn from delayed reinforcement it maytake a long sequence of actions receiving

insignicant reinforcement then nally arrive at a state with high reinforcement The agent

must b e able to learn which of its actions are desirable based on reward that can take place

arbitrarily far in the future

Markov Decision Pro cesses

Problems with delayed reinforcementarewell mo deled as Markov decision processes MDPs

An MDP consists of

a set of states S

a set of actions A

Kaelbling Littman Moore

a reward function R SAand

a state transition function T SA S where a memberofS is a probability

distribution over the set S ie it maps states to probabilities Wewrite T s a s

for the probability of making a transition from state s to state s using action a

The state transition function probabilistically sp ecies the next state of the environmentas

a function of its current state and the agents action The reward function sp ecies exp ected

instantaneous reward as a function of the current state and action The mo del is Markov if

the state transitions are indep endentofany previous environment states or agent actions

There are many go o d references to MDP mo dels Bellman Bertsekas Howard

Puterman

Although general MDPsmayhave innite even uncountable state and action spaces

we will only discuss metho ds for solving nitestate and niteaction problems In section

we discuss metho ds for solving problems with continuous input and output spaces

Finding a Policy Given a Mo del

Before we consider algorithms for learning to b ehavein MDP environments we will ex

plore techniques for determining the optimal p olicy given a correct mo del These dynamic

programming techniques will serve as the foundation and inspiration for the learning al

gorithms to follow We restrict our attention mainly to nding optimal p olicies for the

innitehorizon discounted mo del but most of these algorithms have analogs for the nite

horizon and averagecase mo dels as well We rely on the result that for the innitehorizon

discounted mo del there exists an optimal deterministic stationary p olicy Bellman

We will sp eak of the optimal value of a stateit is the exp ected innite discounted sum

of reward that the agent will gain if it starts in that state and executes the optimal p olicy

Using as a complete decision p olicy it is written

X

t

V s maxE r

t

t

This optimal value function is unique and can b e dened as the solution to the simultaneous

equations

X

A

V smax Rs a T s a s V s s S

a

s S

which assert that the value of a state s is the exp ected instantaneous reward plus the

exp ected discounted value of the next state using the b est available action Given the

e can sp ecify the optimal p olicy as optimal value function w

X

A

Rs a T s a s V s s arg max

a

s S

Value Iteration

One way then to nd an optimal p olicy is to nd the optimal value function It can

b e determined by a simple iterative algorithm called value iteration that can b e shown to

converge to the correct V values Bellman Bertsekas

Reinforcement Learning A Survey

initialize V s arbitrarily

loop until policy good enough

loop for s S

loop for a A

P

Qs a Rs a T s a s V s

s S

V smax Qs a

a

end loop

end loop

It is not obvious when to stop the value iteration algorithm One imp ortant result

b ounds the p erformance of the current greedy p olicy as a function of the Bel lman residual of

the currentvalue function Williams Baird b It says that if the maximum dierence

between two successivevalue functions is less than then the value of the greedy p olicy

the p olicy obtained bycho osing in every state the action that maximizes the estimated

discounted reward using the current estimate of the value function diers from the value

function of the optimal p olicy by no more than atany state This provides an

eective stopping criterion for the algorithm Puterman discusses another stopping

criterion based on the span seminorm whichmay result in earlier termination Another

imp ortant result is that the greedy p olicy is guaranteed to b e optimal in some nite number

of steps even though the value function maynothaveconverged Bertsekas And in

practice the greedy p olicy is often optimal long b efore the value function has converged

Value iteration is very exible The assignments to V need not b e done in strict order

as shown ab ove but instead can o ccur asynchronously in parallel provided that the value

of every state gets up dated innitely often on an innite run These issues are treated

extensively byBertsekas who also proves convergence results

Up dates based on Equation are known as ful l backups since they make use of

mation from all p ossible successor states It can b e shown that up dates of the form

Qs a Qs a Qs a Qs a r max

a

can also b e used as long as each pairing of a and s is up dated innitely often s is sampled

from the distribution T s a s r is sampled with mean Rs a and b ounded variance and

the learning rate is decreased slowly This typ e of backup Singh is critical

to the op eration of the mo delfree metho ds discussed in the next section

The computational complexity of the valueiteration algorithm with full backups p er

iteration is quadratic in the numb er of states and linear in the numb er of actions Com

monly the transition probabilities T s a s are sparse If there are on average a constant

numb er of next states with nonzero probability then the cost p er iteration is linear in the

numb er of states and linear in the numb er of actions The numb er of iterations required to

reach the optimal value function is p olynomial in the numb er of states and the magnitude

of the largest reward if the discount factor is held constant However in the worst case

the numb er of iterations grows p olynomially in so the convergence rate slows

considerably as the discount factor approaches Littman Dean Kaelbling b

Kaelbling Littman Moore

Policy Iteration

The policy iteration algorithm manipulates the p olicy directly rather than nding it indi

rectly via the optimal value function It op erates as follows

choose an arbitrary policy

loop

compute the value function of policy

solve the linear equations

P

V s Rs s T s ss V s

s S

improve the policy at each state

P

s arg max Rs a T s a s V s

a

s S

until

The value function of a p olicy is just the exp ected innite discounted reward that will

b e gained at each state by executing that p olicy It can b e determined by solving a set

of linear equations Once we know the v alue of each state under the current p olicywe

consider whether the value could b e improved bychanging the rst action taken If it can

wechange the p olicy to take the new action whenever it is in that situation This step is

guaranteed to strictly improve the p erformance of the p olicy When no improvements are

p ossible then the p olicy is guaranteed to b e optimal

jS j

Since there are at most jAj distinct p olicies and the sequence of p olicies improves at

each step this algorithm terminates in at most an exp onential numb er of iterations Puter

man However it is an imp ortant op en question howmany iterations p olicy iteration

takes in the worst case It is known that the running time is pseudop olynomial and that for

any xed discount factor there is a p olynomial b ound in the total size of the MDP Littman

et al b

Enhancement to Value Iteration and Policy Iteration

In practice value iteration is much faster p er iteration but p olicy iteration takes fewer

iterations Arguments have b een put forth to the eect that each approachisbetterfor

large problems Putermans modiedpolicy iteration algorithm Puterman Shin

provides a metho d for trading iteration time for iteration improvement in a smo other way

The basic idea is that the exp ensive part of p olicy iteration is solving for the exact value

of V Instead of nding an exact value for V we can p erform a few steps of a mo died

er successive iterations This can b e valueiteration step where the p olicy is held xed ov

shown to pro duce an approximation to V that converges linearly in In practice this can

result in substantial sp eedups

Several standard numericalanalysis techniques that sp eed the convergence of dynamic

programming can b e used to accelerate value and p olicy iteration Multigrid methods can

b e used to quickly seed a go o d initial approximation to a high resolution value function

by initially p erforming value iteration at a coarser resolution Rude State aggre

gation works by collapsing groups of states to a single metastate solving the abstracted

problem BertsekasCastanon

Reinforcement Learning A Survey

Computational Complexity

Value iteration works by pro ducing successive approximations of the optimal value function

Each iteration can b e p erformed in O jAjjS j steps or faster if there is sparsity in the

transition function However the numb er of iterations required can growexponentially in

the discount factor Condon as the discount factor approaches the decisions must

b e based on results that happ en farther and farther into the future In practice p olicy

iteration converges in fewer iterations than value iteration although the p eriteration costs

of O jAjjS j jS j can b e prohibitiveThereisnoknown tightworstcase b ound available

for p olicy iteration Littman et al b Mo died p olicy iteration Puterman Shin

seeks a tradeo b etween cheap and eective iterations and is preferred by some

practictioners Rust

Linear programming Schrijver is an extremely general problem and MDPs can

b e solved by generalpurp ose linearprogramming packages Derman DEp enoux

Homan Karp An advantage of this approach is that commercialquality

linearprogramming packages are available although the time and space requirements can

still b e quite high From a theoretic p ersp ective linear programming is the only known

algorithm that can solve MDPs in p olynomial time although the theoretically ecien t

algorithms have not b een shown to b e ecient in practice

Learning an Optimal Policy Mo delfree Metho ds

In the previous section we reviewed metho ds for obtaining an optimal p olicy for an MDP

assuming that we already had a mo del The mo del consists of knowledge of the state tran

sition probability function T s a s and the reinforcement function Rs a Reinforcement

learning is primarily concerned with how to obtain the optimal p olicy when such a mo del

is not known in advance The agentmust interact with its environment directly to obtain

information which by means of an appropriate algorithm can b e pro cessed to pro duce an

optimal p olicy

At this p oint there are twoways to pro ceed

Mo delfree Learn a controller without learning a mo del

Mo delbased Learn a mo del and use it to derive a controller

Which approach is b etter This is a matter of some debate in the reinforcementlearning

communityAnumb er of algorithms have b een prop osed on b oth sides This question also

app ears in other elds such as adaptivecontrol where the dichotomyisbetween direct and

indirect adaptive control

This section examines mo delfree learning and Section examines mo delbased meth

o ds

The biggest problem facing a reinforcementlearning agentis temporal credit assignment

Howdowe know whether the action just taken is a go o d one when it mighthave far

reaching eects One strategy is to wait until the end and reward the actions taken if

the result was go o d and punish them if the result was bad In ongoing tasks it is dicult

to know what the end is and this might require a great deal of memory Instead we

will use insights from value iteration to adjust the estimated value of a state based on Kaelbling Littman Moore

r s v AHC a

RL

Figure Architecture for the adaptive heuristic critic

the immediate reward and the estimated value of the next state This class of algorithms

is known as temporal dierence methods Sutton We will consider two dierent

temp oraldierence learning strategies for the discounted innitehorizon mo del

Adaptive Heuristic Critic and TD

The adaptive heuristic critic algorithm is an adaptiveversion of p olicy iteration Barto

Sutton Anderson in which the valuefunction computation is no longer imple

mented by solving a set of linear equations but is instead computed by an algorithm called

TD A blo ck diagram for this approachisgiven in Figure It consists of two comp o

nents a critic lab eled AHC and a reinforcementlearning comp onent lab eled RL The

reinforcementlearning comp onent can b e an instance of anyofthek armed bandit algo

rithms mo died to deal with multiple states and nonstationary rewards But instead of

acting to maximize instantaneous reward it will b e acting to maximize the heuristic value

v that is computed by the critic The critic uses the real external reinforcement signal to

learn to map states to their exp ected discounted values given that the p olicy b eing executed

is the one currently instantiated in the RL comp onent

We can see the analogy with mo died p olicy iteration if we imagine these comp onents

working in alternation The p olicy implemented by RL is xed and the critic learns the

value function V for that p olicyNowwe x the critic and let the RL comp onent learn a

new p olicy that maximizes the new value function and so on In most implementations

however b oth comp onents op erate simultaneously Only the alternating implementation

can b e guaranteed to converge to the optimal p olicy under appropriate conditions Williams

and Baird explored the convergence prop erties of a class of AHCrelated algorithms they

call incremental variants of p olicy iteration Williams Baird a

It remains to explain how the critic can learn the value of a p olicyWe dene hs a rs i

to b e an experience tuple summarizing a single transition in the environment Here s is the

agents state b efore the transition a is its choice of action r the instantaneous reward it

receives and s its resulting state The value of a p olicy is learned using Suttons TD

algorithm Sutton which uses the up date rule

V s V s r V s V s

Whenever a state s is visited its estimated value is up dated to b e closer to r V s

since r is the instantaneous reward received and V s is the estimated value of the actually

o ccurring next state This is analogous to the samplebackup rule from value iterationthe

only dierence is that the sample is drawn from the real world rather than bysimulating

a known mo del The key idea is that r V s is a sample of the value of V s and it is

Reinforcement Learning A Survey

more likely to b e correct b ecause it incorp orates the real r If the learning rate is adjusted

prop erly it must b e slowly decreased and the p olicy is held xed TD is guaranteed to

converge to the optimal value function

The TD rule as presented ab ove is really an instance of a more general class of

algorithms called TD with TD lo oks only one step ahead when adjusting

value estimates although it will eventually arrive at the correct answer it can take quite a

while to do so The general TD rule is similar to the TD rule given ab ove

V u V ur V s V seu

but it is applied to every state according to its eligibili ty eu rather than just to the

immediately previous state s One version of the eligibil ity trace is dened to b e

t

X

if s s

k

tk

es where

ss ss

k k

otherwise

k

The eligibili tyofastates is the degree to which it has b een visited in the recent past

when a reinforcement is received it is used to up date all the states that have b een recently

visited according to their eligibili ty When this is equivalenttoTD When

it is roughly equivalent to up dating all the states according to the numb er of times they

were visited by the end of a run Note that we can up date the eligibili ty online as follows

es if s current state

es

es otherwise

It is computationally more exp ensive to execute the general TD though it often

converges considerably faster for large Dayan Dayan Sejnowski There

has b een some recentwork on making the up dates more ecientCichosz Mulawka

and on changing the denition to make TD more consistent with the certaintyequivalent

metho d Singh Sutton which is discussed in Section

Qlearning

The work of the two comp onents of AHC can b e accomplished in a unied manner by

Watkins Qlearning algorithm Watkins Watkins Dayan Qlearning is

typically easier to implement In order to understand Qlearning wehavetodevelop some

additional notation Let Q s a b e the exp ected discounted reinforcement of taking action

a in state s then continuing bycho osing actions optimally Note that V s is the value

of s assuming the b est action is taken initially and so V s max Q s a Q s a can

a

hence b e written recursively as

X

Q Q s a Rs a T s a s max s a

a

s S

Note also that since V s max Q s a wehave s arg max Q s aasan

a a

optimal p olicy

Because the Q function makes the action explicit we can estimate the Q values on

line using a metho d essentially the same as TD but also use them to dene the p olicy

Kaelbling Littman Moore

b ecause an action can b e chosen just by taking the one with the maximum Q value for the

current state

The Q is

Qs a Qs a r max Qs a Qs a

a

where hs a rs i is an exp erience tuple as describ ed earlier If each action is executed in

each state an innite numb er of times on an innite run and is decayed appropriately the

Q values will converge with probabilitytoQ Watkins Tsitsiklis Jaakkola

Jordan Singh Qlearning can also b e extended to up date states that o ccurred

more than one step previouslyasinTDPeng Williams

When the Q values are nearly converged to their optimal values it is appropriate for

the agent to act greedily taking in each situation the action with the highest Q value

During learning however there is a dicult exploitation versus exploration tradeo to b e

made There are no go o d formally justied approaches to this problem in the general case

standard practice is to adopt one of the ad hoc metho ds discussed in section

AHC architectures seem to b e more dicult to work with than Qlearning on a practical

level It can b e hard to get the relative learning rates right in AHC so that the two

comp onents converge together In addition Qlearning is exploration insensitive that

is that the Q values will converge to the optimal values indep endentofhow the agent

b ehaves while the data is b eing collected as long as all stateaction pairs are tried often

enough This means that although the explorationexploitation issue must b e addressed

in Qlearning the details of the exploration strategy will not aect the convergence of the

learning algorithm For these reasons Qlearning is the most p opular and seems to b e the

most eective mo delfree algorithm for learning from delayed reinforcement It do es not

however address any of the issues involved in generalizing over large state andor action

spaces In addition it mayconverge quite slowly to a go o d p olicy

Mo delfree Learning With Average Reward

As describ ed Qlearning can b e applied to discounted innitehorizon MDPs It can also

b e applied to undiscounted problems as long as the optimal p olicy is guaranteed to reacha

rewardfree absorbing state and the state is p erio dicall y reset

Schwartz examined the problem of adapting Qlearning to an averagereward

framework Although his Rlearning algorithm seems to exhibit convergence problems for

some MDPs several researchers have found the averagereward criterion closer to the true

problem they wish to solve than a discounted criterion and therefore prefer Rlearning to

Qlearning Mahadevan

With that in mind researchers have studied the problem of learning optimal average

reward p olicies Mahadevan surveyed mo delbased averagereward algorithms from

a reinforcementlearning p ersp ective and found several diculties with existing algorithms

In particular he showed that existing reinforcementlearning algorithms for average reward

and some dynamic programming algorithms do not always pro duce biasoptimal p oli

cies Jaakkola Jordan and Singh describ ed an averagereward learning algorithm

with guaranteed convergence prop erties It uses a MonteCarlo comp onent to estimate the

exp ected future reward for each state as the agentmoves through the environment In

Reinforcement Learning A Survey

addition Bertsekas presents a Qlearninglike algorithm for averagecase reward in his new

textb o ok Although this recentwork provides a much needed theoretical foundation

to this area of reinforcement learning many imp ortant problems remain unsolved

Computing Optimal Policies by Learning Mo dels

The previous section showed how it is p ossible to learn an optimal p olicy without knowing

the mo dels T s a s orRs a and without even learning those mo dels en route Although

many of these metho ds are guaranteed to nd optimal p olicies eventually and use very

little computation time p er exp erience they make extremely inecient use of the data they

gather and therefore often require a great deal of exp erience to achieve go o d p erformance

In this section westillbeginby assuming that we dont know the mo dels in advance but

we examine algorithms that do op erate by learning these mo dels These algorithms are

esp ecially imp ortant in applications in which computation is considered to b e cheap and

realworld exp erience costly

CertaintyEquivalent Metho ds

We b egin with the most conceptually straightforward metho d rst learn the T and R

functions by exploring the environment and keeping statistics ab out the results of each

action next compute an optimal p olicy using one of the metho ds of Section This

metho d is known as certainty equivlance Kumar Varaiya

There are some serious ob jections to this metho d

It makes an arbitrary division b etween the learning phase and the acting phase

How should it gather data ab out the environment initially Random exploration

might b e dangerous and in some environments is an immensely inecient metho d of

gathering data requiring exp onentially more data Whitehead than a system

that interleaves exp erience gathering with p olicybuil din g more tightly Ko enig

Simmons See Figure for an example

The p ossibilityofchanges in the environment is also problematic Breaking up an

agents life into a pure learning and a pure acting phase has a considerable risk that

the optimal controller based on early life b ecomes without detection a sub optimal

controller if the environmentchanges

Avariation on this idea is certainty equivalence in which the mo del is learned continually

through the agents lifetime and at each step the current mo del is used to compute an

optimal p olicy and value function This metho d makes very eective use of available data

but still ignores the question of exploration and is extremely computationally demanding

even for fairly small state spaces Fortunately there are a numb er of other mo delbased

algorithms that are more practical

Dyna

Suttons Dyna architecture exploits a middle ground yielding strategies that

are b oth more eective than mo delfree learning and more computationally ecient than Kaelbling Littman Moore

123...... nGoal

Figure In this environment due to Whitehead random exploration would take

n

take O steps to reach the goal even once whereas a more intelligent explo

ration strategy eg assume anyuntried action leads directly to goal would

require only O n steps

the certaintyequivalence approach It simultaneously uses exp erience to build a mo del T

and R uses exp erience to adjust the p olicy and uses the mo del to adjust the p olicy

Dyna op erates in a lo op of interaction with the environment Given an exp erience tuple

hs a s ri it b ehaves as follows

Up date the mo del incrementing statistics for the transition from s to s on action a

and for receiving reward r for taking action a in state s The up dated mo dels are T

and R

Up date the p olicy at state s based on the newly up dated mo del using the rule

X

Qs aRs a T s a s max Qs a

a

s

whichisaversion of the valueiteration up date for Q values

Perform k additional up dates cho ose k stateaction pairs at random and up date them

according to the same rule as b efore

X

T s a s maxQs a Qs a Rs a

k k k k k k

a

s

Cho ose an action a to p erform in state s based on the Q values but p erhaps mo died

by an exploration strategy

The Dyna algorithm requires ab out k times the computation of Qlearning p er instance

but this is typically vastly less than for the naive mo delbased metho d A reasonable value

of k can b e determined based on the relative sp eeds of computation and of taking action

Figure shows a grid world in whichineach cell the agent has four actions N S E

W and transitions are made deterministically to an adjacent cell unless there is a blo ck

ement o ccurs As we will see in Table Dyna requires an order of in whichcasenomov

magnitude fewer steps of exp erience than do es Qlearning to arrive at an optimal p olicy

Dyna requires ab out six times more computational eort however

Reinforcement Learning A Survey

Figure A state grid world This was formulated as a shortestpath reinforcement

learning problem which yields the same result as if a reward of is given at the

goal a reward of zero elsewhere and a discount factor is used

Steps b efore Backups b efore

convergence convergence

Qlearning

Dyna

prioritized sweeping

Table The p erformance of three algorithms describ ed in the text All metho ds used

the exploration heuristic of optimism in the face of uncertainty any state not

previously visited was assumed by default to b e a goal state Qlearning used

its optimal learning rate parameter for a deterministic maze Dyna and

prioritized sweeping were p ermitted to take k backups p er transition For

prioritized sweeping the priority queue often emptied b efore all backups were

used

Kaelbling Littman Moore

Prioritized Sweeping QueueDyna

Although Dyna is a great improvement on previous metho ds it suers from b eing relatively

undirected It is particularly unhelpful when the goal has just b een reached or when the

agentisstuck in a dead end it continues to up date random stateaction pairs rather than

concentrating on the interesting parts of the state space These problems are addressed

by prioritized sweeping Mo ore Atkeson and QueueDyna Peng Williams

which are two indep endentlydevel op ed but very similar techniques We will describ e

prioritized sweeping in some detail

The algorithm is similar to Dyna except that up dates are no longer chosen at random

and values are now asso ciated with states as in value iteration instead of stateaction pairs

as in Qlearning Tomake appropriate choices wemust store additional information in

the mo del Each state rememb ers its predecessors the states that have a nonzero transition

probability to it under some action In addition each state has a priority initially set to

zero

Instead of up dating k random stateaction pairs prioritized sweeping up dates k states

with the highest priorityFor each highpriority state sitworks as follows

Rememb er the currentvalue of the state V V s

old

Up date the states value

X

V smax Rs a T s a s V s

a

s

Set the states priority backto

Compute the value change jV V sj

old

Use to mo dify the priorities of the predecessors of s

If wehave up dated the V value for state s and it has changed by amount then the

immediate predecessors of s are informed of this event Any state s for which there exists

an action a such that T s a s has its priority promoted to T s a s unless its

priority already exceeded that value

The global b ehavior of this algorithm is that when a realworld transition is surprising

the agent happ ens up on a goal state for instance then lots of computation is directed

to propagate this new information backtorelevant predecessor states When the real

ery similar to the predicted result then world transition is b oring the actual result is v

computation continues in the most deserving part of the space

Running prioritized sweeping on the problem in Figure we see a large improvement

over Dyna The optimal p olicy is reached in ab out half the numb er of steps of exp erience

and onethird the computation as Dyna required and therefore ab out times fewer steps

and twice the computational eort of Qlearning

Reinforcement Learning A Survey

Other Mo delBased Metho ds

Metho ds prop osed for solving MDPsgiven a mo del can b e used in the context of mo del

based metho ds as well

RTDP realtime dynamic programming Barto Bradtke Singh is another

mo delbased metho d that uses Qlearning to concentrate computational eort on the areas

of the statespace that the agentismostlikely to o ccupy It is sp ecic to problems in which

the agent is trying to achieve a particular goal state and the reward everywhere else is

By taking into account the start state it can nd a short path from the start to the goal

without necessarily visiting the rest of the state space

The Plexus planning system Dean Kaelbling Kirman Nicholson Kirman

exploits a similar intuition It starts by making an approximate version of the MDP

whichismuch smaller than the original one The approximate MDP contains a set of states

called the envelope that includes the agents current state and the goal state if there is one

States that are not in the envelop e are summarized by a single out state The planning

pro cess is an alternation b etween nding an optimal p olicy on the approximate MDP and

adding useful states to the envelop e Action may take place in parallel with planning in

which case irrelevant states are also pruned out of the envelop e

Generalization

All of the previous discussion has tacitly assumed that it is p ossible to enumerate the state

and action spaces and store tables of values over them Except in very small environments

this means impractical memory requirements It also makes inecient use of exp erience In

a large smo oth state space we generally exp ect similar states to have similar values and sim

ilar optimal actions Surely therefore there should b e some more compact representation

than a table Most problems will havecon tinuous or large discrete state spaces some will

have large or continuous action spaces The problem of learning in large spaces is addressed

through generalization techniques whichallow compact storage of learned information and

transfer of knowledge b etween similar states and actions

The large literature of generalization techniques from inductive can b e

applied to reinforcement learning However techniques often need to b e tailored to sp ecic

details of the problem In the following sections we explore the application of standard

functionapproximation techniques adaptive resolution mo dels and hierarchical metho ds

to the problem of reinforcement learning

The reinforcementlearning architectures and algorithms discussed ab ovehave included

the storage of a variety of mappings including SAp olicies Svalue functions

SA Q functions and rewards SA S deterministic transitions and S

AS transition probabilities Some of these mappings such as transitions and

immediate rewards can b e learned using straightforward sup ervised learning and can b e

handled using any of the wide variety of functionapproximation techniques for sup ervised

learning that supp ort noisy training examples Popular techniques include various neural

network metho ds Rumelhart McClelland Berenji Lee

CMAC Albus and lo cal memorybased metho ds Mo ore Atkeson Schaal

such as generalizations of nearest neighb or metho ds Other mappings esp ecially the p olicy

Kaelbling Littman Moore

mapping typically need sp ecialized algorithms b ecause training sets of inputoutput pairs

are not available

Generalization over Input

A reinforcementlearning agents current state plays a central role in its selection of reward

maximizing actions Viewing the agent as a statefree blackbox a description of the

current state is its input Dep ending on the agentarchitecture its output is either an

or an evaluation of the current state that can b e used to select an action

The problem of deciding how the dierent asp ects of an input aect the value of the output

is sometimes called the structural creditassignment problem This section examines

approaches to generating actions or evaluations as a function of a description of the agents

current state

The rst group of techniques covered here is sp ecialized to the case when reward is not

delayed the second group is more generally applicable

Immediate Reward

When the agents actions do not inuence state transitions the resulting problem b ecomes

one of cho osing actions to maximize immediate reward as a function of the agents current

state These problems b ear a resemblance to the bandit problems discussed in Section

except that the agent should condition its action selection on the current state For this

reason this class of problems has b een describ ed as associative reinforcement learning

The algorithms in this section address the problem of learning from immediate b o olean

reinforcement where the state is vector valued and the action is a b o olean vector Such

algorithms can and have b een used in the context of a delayed reinforcement for instance

as the RL comp onent in the AHC architecture describ ed in Section They can also b e

generalized to realvalued reward through rewardcomparison metho ds Sutton

CRBP The complementary reinforcementbac kpropagation algorithm Ackley Littman

crbp consists of a feedforward network mapping an enco ding of the state to an

enco ding of the action The action is determined probabilistically from the activation of

the output units if output unit i has activation y then bit i of the action vector has value

i

with probability y and otherwise Any neuralnetwork sup ervised training pro cedure

i

can b e used to adapt the network as follows If the result of generating action a is r

then the network is trained with inputoutput pair hs ai If the result is r then the

network is trained with inputoutput pair hs ai wherea a a

n

The idea b ehind this training rule is that whenever an action fails to generate reward

crbp will try to generate an action that is dierent from the currentchoice Although it

seems like the algorithm might oscillate b etween an action and its complement that do es

not happ en One step of training a network will only change the action slightly and since

the output probabilities will tend to movetoward this makes action selection more

random and increases search The hop e is that the random distribution will generate an

action that works b etter and then that action will b e reinforced

ARC The asso ciative reinforcement comparison arc algorithm Sutton is an

instance of the ahc architecture for the case of b o olean actions consisting of twofeed

Reinforcement Learning A Survey

forward networks One learns the value of situations the other learns a p olicy These can

b e simple linear networks or can have hidden units

In the simplest case the entire system learns only to optimize immediate reward First

let us consider the b ehavior of the network that learns the p olicy a mapping from a vector

describing s to a or If the output unit has activation y then a the action generated

i

will b e if y where is normal noise and otherwise

The adjustment for the output unit is in the simplest case

e r a

where the rst factor is the reward received for taking the most recent action and the second

enco des which action was taken The actions are enco ded as and so a always has

the same magnitude if the reward and the action have the same sign then action will b e

made more likely otherwise action will b e

As describ ed the network will tend to seek actions that given p ositivereward To extend

this approach to maximize reward we can compare the reward to some baseline b This

changes the adjustmentto

e r ba

where b is the output of the second network The second network is trained in a standard

sup ervised mo de to estimate r as a function of the input state s

Variations of this approachhave b een used in a variety of applications Anderson

Barto et al Lin b Sutton

REINFORCE Algorithms Williams studied the problem of cho osing ac

tions to maximize immedate reward He identied a broad class of up date rules that p er

form on the exp ected reward and showed howtointegrate these rules with

This class called reinforce algorithms includes linear rewardinaction

Section as a sp ecial case

The generic reinforce up date for a parameter w can b e written

ij

ln g w r b

j ij ij ij

w

ij

where is a nonnegative factor r the current reinforcement b a reinforcement baseline

ij ij

and g is the probability density function used to randomly generate actions based on unit

i

activations Both and b can take on dierentvalues for each w however when

ij ij ij ij

is constant throughout the system the exp ected up date is exactly in the direction of the

exp ected reward gradient Otherwise the up date is in the same half space as the gradient

but not necessarily in the direction of steep est increase

Williams p oints out that the choice of baseline b canhave a profound eect on the

ij

convergence sp eed of the algorithm

LogicBased Metho ds Another strategy for generalization in reinforcement learning is

to reduce the learning problem to an asso ciative problem of learning b o olean functions

A b o olean function has a vector of b o olean inputs and a single b o olean output Taking

inspiration from mainstream machine learning work Kaelbling develop ed two algorithms

for learning b o olean functions from reinforcement one uses the bias of k DNF to drive

Kaelbling Littman Moore

the generalization pro cess Kaelbling b the other searches the space of syntactic

descriptions of functions using a simple generateandtest metho d Kaelbling a

The restriction to a single b o olean output makes these techniques dicult to applyIn

very b enign learning situations it is p ossible to extend this approach to use a collection

of learners to indep endently learn the individual bits that make up a complex output In

general however that approach suers from the problem of very unreliable reinforcement

if a single learner generates an inappropriate output bit all of the learners receivealow

reinforcementvalue The metho d Kaelbling b allows a collection of learners

to b e trained collectively to generate appropriate joint outputs it is considerably more

reliable but can require additional computational eort

Delayed Reward

Another metho d to allow reinforcementlearning techniques to b e applied in large state

spaces is mo deled on value iteration and Qlearning Here a function approximator is used

to represent the value function by mapping a state description to a value

Many reseachers have exp erimented with this approach Boyan and Mo ore used

lo cal memorybased metho ds in conjunction with value iteration Lin used backprop

agation networks for Qlearning Watkins used CMAC for Qlearning Tesauro

used backpropagation for learning the value function in describ ed in

Section Zhang and Dietterich used backpropagation and TD to learn go o d

strategies for jobshop scheduling

Although there have b een some p ositive examples in general there are unfortunate in

teractions b etween function approximation and the learning rules In discrete environments

there is a guarantee that any op eration that up dates the value function according to the

Bellman equations can only reduce the error b etween the currentvalue function and the

optimal v alue function This guarantee no longer holds when generalization is used These

issues are discussed byBoyan and Mo ore who give some simple examples of value

function errors growing arbitrarily large when generalization is used with value iteration

Their solution to this applicable only to certain classes of problems discourages such diver

gence by only p ermitting up dates whose estimated values can b e shown to b e nearoptimal

via a battery of MonteCarlo exp eriments

Thrun and Schwartz theorize that function approximation of value functions

is also dangerous b ecause the errors in value functions due to generalization can b ecome

comp ounded by the max op erator in the denition of the value function

Several recent results Gordon Tsitsiklis Van Roy showhow the appro

priate choice of function approximator can guarantee convergence though not necessarily to

the optimal values Bairds residual gradient technique Baird provides guaranteed

convergence to lo cally optimal solutions

Perhaps the glo ominess of these counterexamples is misplaced Boyan and Mo ore

rep ort that their counterexamples can b e made to work with problemsp ecic handtuning

despite the unreliabilityofuntuned algorithms that provably converge in discrete domains

Sutton shows how mo died versions of Boyan and Mo ores examples can converge

successfully An op en question is whether general principles ideally supp orted by theory

can help us understand when value function approximation will succeed In Suttons com

Reinforcement Learning A Survey

parative exp eriments with Boyan and Mo ores counterexamples he changes four asp ects

of the exp eriments

Small changes to the task sp ecications

A very dierent kind of function approximator CMAC Albus that has weak

generalization

A dierent learning algorithm SARSA Rummery Niranjan instead of value

iteration

A dierent training regime Boyan and Mo ore sampled states uniformly in state space

whereas Suttons metho d sampled along empirical tra jectories

There are intuitive reasons to b elieve that the fourth factor is particularly imp ortant but

more careful researchisneeded

Adaptive Resolution Mo dels In many cases what wewould like to do is partition

the environmentinto regions of states that can b e considered the same for the purp oses of

learning and generating actions Without detailed prior knowledge of the environment it

is very dicult to know what granularity or placement of partitions is appropriate This

problem is overcome in metho ds that use adaptive resolution during the course of learning

a partition is constructed that is appropriate to the environment

Decision Trees In environments that are characterized by a set of b o olean or discrete

valued variables it is p ossible to learn compact decision trees for representing Q values The

Glearning algorithm Chapman Kaelbling works as follows It starts by assuming

that no partitioning is necessary and tries to learn Q values for the entire environmentas

if it were one state In parallel with this pro cess it gathers statistics based on individual

input bits it asks the question whether there is some bit b in the state description such

that the Q values for states in which b are signicantly dierentfromQ values for

states in which b If such a bit is found it is used to split the decision Then

the pro cess is rep eated in each of the leaves This metho d was able to learn very small

representations of the Q function in the presence of an overwhelming numb er of irrelevant

noisy state attributes It outp erformed Qlearning with backpropagation in a simple video

game environment and was used by McCallum in conjunction with other techniques

for dealing with partial observability to learn b ehaviors in a complex drivingsimulator It

cannot however acquire partitions in which attributes are only signicantincombination

such as those needed to solve parity problems

Variable Resolution Dynamic Programming The VRDP algorithm Mo ore

enables conventional dynamic programming to b e p erformed in realvalued multivariate

statespaces where straightforward discretization would fall prey to the curse of dimension

ality A kdtree similar to a is used to partition state space into coarse

regions The coarse regions are rened into detailed regions but only in parts of the state

space which are predicted to b e imp ortant This notion of imp ortance is obtained by run

ning mental tra jectories through state space This algorithm proved eectiveonanumber

of problems for which full highresolution arrays would have b een impractical It has the

disadvantage of requiring a guess at an initially valid tra jectory through statespace Kaelbling Littman Moore

(a) (b) (c)

G G G Goal

Start

Figure a Atwodimensional maze problem The p oint rob ot must nd a path from

start to goal without crossing any of the barrier lines b The path taken by

PartiGame during the entire rst trial It b egins with intense exploration to nd a

route out of the almost entirely enclosed start region Having eventually reached

a suciently high resolution it discovers the gap and pro ceeds greedily towards

the goal only to b e temp orarily blo cked by the goals barrier region c The

second trial

PartiGame Algorithm Mo ores PartiGame algorithm Mo ore is another solution

to the problem of learning to achieve goal congurations in deterministic highdimensional

continuous spaces by learning an adaptiveresolution mo del It also divides the environment

into cells but in each cell the actions available consist of aiming at the neighb oring cells

this aiming is accomplished by a lo cal controller whichmust b e provided as part of the

problem statement The graph of cell transitions is solved for shortest paths in an online

incremental manner but a minimax criterion is used to detect when a group of cells is

to o coarse to preventmovementbetween obstacles or to avoid limit cycles The oending

cells are split to higher resolution Eventually the environment is divided up just enough to

cho ose appropriate actions for achieving the goal but no unnecessary distinctions are made

An imp ortant feature is that as well as reducing memory and computational requirements

it also structures exploration of state space in a multiresolution manner Given a failure

the agent will initially try something very dierent to rectify the failure and only resort to

small lo cal changes when all the qualitatively dierent strategies ha ve b een exhausted

Figure a showsatwodimensional continuous maze Figure b shows the p erformance

of a rob ot using the PartiGame algorithm during the very rst trial Figure c shows the

second trial started from a slightly dierent p osition

This is a very fast algorithm learning p olicies in spaces of up to nine dimensions in less

than a minute The restriction of the current implementation to deterministic environments

limits its applicabili tyhowever McCallum suggests some related treestructured

metho ds

Reinforcement Learning A Survey

Generalization over Actions

The networks describ ed in Section generalize over state descriptions presented as

inputs They also pro duce outputs in a discrete factored representation and thus could b e

seen as generalizing over actions as well

In cases such as this when actions are describ ed combinatorially it is imp ortantto

generalize over actions to avoid keeping separate statistics for the huge numb er of actions

that can b e chosen In continuous action spaces the need for generalization is even more

pronounced

When estimating Q values using a neural network it is p ossible to use either a distinct

network for each action or a network with a distinct output for each action When the

action space is continuous neither approach is p ossible An alternative strategy is to use a

single network with b oth the state and action as input and Q value as the output Training

such a network is not conceptually dicult but using the network to nd the optimal action

can b e a challenge One metho d is to do a lo cal gradientascent search on the action in

order to nd one with high value Baird Klopf

Gullapalli has develop ed a neural reinforcementlearning unit for use in

continuous action spaces The unit generates actions with a normal distribution it adjusts

the mean and variance based on previous exp erience When the chosen actions are not

p erforming well the variance is high resulting in exploration of the range of choices When

an action p erforms well the mean is moved in that direction and the variance decreased

resulting in a tendency to generate more action values near the successful one This metho d

was successfully emplo yed to learn to control a rob ot arm with manycontinuous degrees of

freedom

Hierarchical Metho ds

Another strategy for dealing with large state spaces is to treat them as a hierarchyof

learning problems In many cases hierarchical solutions intro duce slight suboptimalityin

p erformance but p otentially gain a go o d deal of eciency in execution time learning time

and space

Hierarchical learners are commonly structured as gatedbehaviors as shown in Figure

There is a collection of behaviors that map environment states into lowlevel actions and

a gating function that decides based on the state of the environment which b ehaviors

actions should b e switched through and actually executed Maes and Bro oks used

aversion of this architecture in which the individual b ehaviors were xed a priori and the

gating function was learned from reinforcement Mahadevan and Connell b used the

dual approach they xed the gating function and supplied reinforcement functions for the

individual b ehaviors whichwere learned Lin a and Dorigo and Colomb etti

b oth used this approach rst training the b ehaviors and then training the gating

function Many of the other hierarchical learning metho ds can b e cast in this framework

Feudal Qlearning

Feudal Qlearning Dayan Hinton Watkins involves a hierarchyoflearning

mo dules In the simplest case there is a highlevel master and a lowlevel slave The master

receives reinforcement from the external environment Its actions consist of commands that Kaelbling Littman Moore

s b1

g a b2

b3

Figure A structure of gated b ehaviors

it can give to the lowlevel learner When the master generates a particular command to

the slave it must reward the slave for taking actions that satisfy the command even if they

do not result in external reinforcement The master then learns a mapping from states to

commands The slave learns a mapping from commands and states to external actions The

set of commands and their asso ciated reinforcement functions are established in advance

of the learning

This is really an instance of the general gated b ehaviors approach in which the slave

can execute any of the b ehaviors dep ending on its command The reinforcement functions

for the individual b ehaviors commands are given but learning takes place simultaneously

at b oth the high and low levels

Compositional Qlearning

Singhs comp ositional Qlearning b a CQL consists of a hierarchy based on

the temp oral sequencing of subgoals The elemental tasks are b ehaviors that achieve some

recognizable condition The highlevel goal of the system is to achieve some set of condi

tions in sequential order The achievement of the conditions provides reinforcement for the

elemental tasks which are trained rst to achieve individual subgoals Then the gating

function learns to switch the elemental tasks in order to achieve the appropriate highlevel

sequential goal This metho d was used by Tham and Prager to learn to control a

simulated multilink rob ot arm

Hierarchical Distance to Goal

Esp ecially if we consider reinforcement learning mo dules to b e part of larger agent archi

tectures it is imp ortant to consider problems in which goals are dynamically input to the

learner Kaelblings HDG algorithm a uses a hierarchical approach to solving prob

lems when goals of achievement the agent should get to a particular state as quickly as

p ossible are given to an agent dynamically

The HDG algorithm works by analogy with navigation in a harb or The environment

is partitioned a priori but more recentwork Ashar addresses the case of learning

the partition into a set of regions whose centers are known as landmarks If the agentis

Reinforcement Learning A Survey

office

2/5 1/5 2/5

hall hall +100

printer

Figure An example of a partially observable environment

currently in the same region as the goal then it uses lowlevel actions to move to the goal

If not then highlevel information is used to determine the next landmark on the shortest

path from the agents closest landmark to the goals closest landmark Then the agent uses

lowlevel information to aim toward that next landmark If errors in action cause deviations

in the path there is no problem the b est aiming p oint is recomputed on every step

Partially Observable Environments

In many realworld environments it will not b e p ossible for the agenttohave p erfect and

complete p erception of the state of the environment Unfortunately complete observability

is necessary for learning metho ds based on MDPs In this section we consider the case in

which the agent makes observations of the state of the environment but these observations

may b e noisy and provide incomplete information In the case of a rob ot for instance

it might observe whether it is in a corridor an op en ro om a Tjunction etc and those

observations might b e errorprone This problem is also referred to as the problem of

incomplete p erception p erceptual aliasing or hidden state

In this section we will consider extensions to the basic MDP framework for solving

partially observable problems The resulting formal mo del is called a partial ly observable

Markov decision process or POMDP

StateFree Deterministic Policies

The most naive strategy for dealing with partial observability is to ignore it That is to

treat the observations as if they were the states of the environment andtrytolearnto

h the agent is attempting to get to b ehave Figure shows a simple environment in whic

the printer from an oce If it moves from the oce there is a go o d chance that the agent

will end up in one of two places that lo ok like hall but that require dierent actions for

getting to the printer If we consider these states to b e the same then the agent cannot

p ossibly b ehave optimally But howwell can it do

The resulting problem is not Markovian and Qlearning cannot b e guaranteed to con

verge Small breaches of the Markov requirementarewell handled by Qlearning but it is

p ossible to construct simple environments that cause Qlearning to oscillate Chrisman

Kaelbling Littman Moore

Littman It is p ossible to use a mo delbased approach however act according to

some p olicy and gather statistics ab out the transitions b etween observations then solve for

the optimal p olicy based on those observations Unfortunately when the environmentisnot

Markovian the transition probabilities dep end on the p olicy b eing executed so this new

p olicy will induce a new set of transition probabilities This approachmay yield plausible

results in some cases but again there are no guarantees

It is reasonable though to ask what the optimal p olicy mapping from observations to

actions in this case is It is NPhard Littman b to nd this mapping and even the

b est mapping can havevery p o or p erformance In the case of our agent trying to get to the

printer for instance any deterministic statefree p olicy takes an innite numb er of steps to

reach the goal on average

StateFree Sto chastic Policies

Some improvement can b e gained by considering sto chastic p olicies these are mappings

from observations to probability distributions over actions If there is randomness in the

agents actions it will not get stuck in the hall forever Jaakkola Singh and Jordan

havedevelop ed an algorithm for nding lo callyoptimal sto chastic p olicies but nding a

globally optimal p olicy is still NP hard

In our example it turns out that the optimal sto chastic p olicy is for the agent when

p

in a state that lo oks like a hall to go east with probability and west with

p

probability This p olicy can b e found by solving a simple in this case

quadratic program The fact that such a simple example can pro duce irrational numbers

gives some indication that it is a dicult problem to solve exactly

Policies with Internal State

Theonlyway to b ehave truly eectively in a widerange of environments is to use memory

of previous actions and observations to disambiguate the current state There are a variety

of approaches to learning p olicies with internal state

Recurrent Qlearning One intuitively simple approach is to use a recurrent neural net

work to learn Q values The network can b e trained using backpropagation through time or

some other suitable technique and learns to retain history features to predict value This

approach has b een used byanumb er of researchers Meeden McGraw Blank Lin

Mitchell Schmidhub er b It seems to work eectively on simple problems

but can suer from convergence to lo cal optima on more complex problems

Classier Systems Classier systems Holland Goldb erg were explicitly

develop ed to solve problems with delayed reward including those requiring shortterm

memory The internal mechanism typically used to pass reward back through chains of

decisions called the bucket brigade algorithm b ears a close resemblance to Qlearning In

spite of some early successes the original design do es not app ear to handle partially ob

served environments robustly

Recently this approach has b een reexamined using insights from the reinforcement

learning literature with some success Dorigo did a comparative study of Qlearning and

classier systems Dorigo Bersini Cli and Ross start with Wilsons zeroth

Reinforcement Learning A Survey

i b a

SE π

Figure Structure of a POMDP agent

level classier system Wilson and add one and twobit memory registers They nd

that although their system can learn to use shortterm memory registers eectively the

approach is unlikely to scale to more complex environments

Dorigo and Colomb etti applied classier systems to a mo derately complex problem of

learning rob ot b ehavior from immediate reinforcement Dorigo Dorigo Colomb etti

Finitehistorywindow Approach One way to restore the Markov prop erty is to allow

decisions to b e based on the history of recent observations and p erhaps actions Lin and

Mitchell used a xedwidth nite history window to learn a p ole balancing task

McCallum describ es the utile sux memory which learns a variablewidth window

that serves simultaneously as a mo del of the environment and a nitememory p olicy This

system has had excellent results in a very complex drivingsimulation domain McCallum

Ring has a neuralnetwork approach that uses a variable history window

adding history when necessary to disambiguate situations

POMDP Approach Another strategy consists of using hidden Markov mo del HMM

techniques to learn a mo del of the environment including the hidden state then to use that

mo del to construct a perfect memory controller Cassandra Kaelbling Littman

Lovejoy Monahan

Chrisman showed how the forwardbackward algorithm for learning HMMs could

b e adapted to learning POMDPs He and later McCallum also gave heuristic state

splitting rules to attempt to learn the smallest p ossible mo del for a given environment The

resulting mo del can then b e used to integrate information from the agents observations in

order to make decisions

Figure illustrates the basic structure for a p erfectmemory controller The comp onent

on the left is the state estimator which computes the agents belief state b as a function of

the old b elief state the last action a and the current observation i In this context a b elief

state is a probability distribution over states of the environment indicating the likeliho o d

given the agents past exp erience that the environment is actually in each of those states

The state estimator can b e constructed straightforwardly using the estimated world mo del

and Bayes rule

Nowwe are left with the problem of nding a p olicy mapping b elief states into action

This problem can b e formulated as an MDP but it is dicult to solve using the techniques

describ ed earlier b ecause the input space is continuous Chrismans approach do es

not takeinto account future uncertainty but yields a p olicy after a small amount of com

putation A standard approach from the op erationsresearch literature is to solve for the

Kaelbling Littman Moore

optimal p olicy or a close approximation thereof based on its representation as a piecewise

linear and convex function over the b elief space This metho d is computationally intractable

but may serve as inspiration for metho ds that make further approximations Cassandra

et al Littman Cassandra Kaelbling a

Reinforcement Learning Applications

One reason that reinforcement learning is p opular is that is serves as a theoretical to ol for

studying the principles of agents learning to act But it is unsurprising that it has also

b een used byanumb er of researchers as a practical computational to ol for constructing

autonomous systems that improve themselves with exp erience These applications have

ranged from rob otics to industrial manufacturing to combinatorial search problems such

as computer game playing

Practical applications provide a test of the ecacy and usefulness of learning algorithms

They are also an inspiration for deciding which comp onents of the reinforcementlearning

framework are of practical imp ortance For example a researcher with a real rob otic task

can provide a data p oint to questions suchas

How imp ortant is optimal exploration Can we break the learning p erio d into explo

ration phases and exploitation phases

What is the most useful mo del of longterm reward Finite horizon Discounted

Innite horizon

Howmuch computation is available b etween agent decisions and how should it b e

used

What prior knowledge can we build into the system and which algorithms are capable

of using that knowledge

Let us examine a set of practical applications of reinforcement learning while b earing these

questions in mind

Game Playing

Game playing has dominated the Articial Intelligence world as a problem domain ever since

the eld w as b orn Twoplayer games do not t into the established reinforcementlearning

framework since the optimality criterion for games is not one of maximizing reward in the

face of a xed environment but one of maximizing reward against an optimal adversary

minimax Nonetheless reinforcementlearning algorithms can b e adapted to work for a

very general class of games Littman a and many researchers have used reinforcement

learning in these environments One application sp ectacularly far ahead of its time was

Samuels checkers playing system Samuel This learned a value function represented

by a linear function approximator and employed a training scheme similar to the up dates

used in value iteration temp oral dierences and Qlearning

More recentlyTesauro applied the temp oral dierence algorithm

to backgammon Backgammon has approximately states making tablebased rein

forcement learning imp ossible Instead Tesauro used a backpropagationbased three

Reinforcement Learning A Survey

Training Hidden Results

Games Units

Basic Poor

TD Lost bypoints in

games

TD Lost by p oints in

games

TD Lost bypointin

games

Table TDGammons p erformance in games against the top human professional players

A backgammon tournamentinvolves playing a series of games for p oints until one

player reaches a set target TDGammon won none of these tournaments but came

suciently close that it is now considered one of the b est few players in the world

neural network as a function approximator for the value function

Board Position Probability of victory for current player

Twoversions of the learning algorithm were used The rst whichwe will call Basic TD

Gammon used very little predened knowledge of the game and the representation of a

b oard p osition was virtually a raw enco ding suciently p owerful only to p ermit the neural

network to distinguish b etween conceptually dierent p ositions The second TDGammon

was provided with the same raw state information supplemented byanumb er of hand

crafted features of backgammon b oard p ositions Providing handcrafted features in this

manner is a go o d example of how inductive from human knowledge of the task can

b e supplied to a learning algorithm

The training of b oth learning algorithms required several months of computer time and

was achieved by constant selfplay No exploration strategy was usedthe system always

greedily chose the move with the largest exp ected probability of victory This naive explo

ved entirely adequate for this environment which is p erhaps surprising ration strategy pro

given the considerable work in the reinforcementlearning literature which has pro duced

numerous counterexamples to show that greedy exploration can lead to p o or learning p er

formance Backgammon however has two imp ortant prop erties Firstly whatever p olicy

is followed every game is guaranteed to end in nite time meaning that useful reward

information is obtained fairly frequently Secondly the state transitions are suciently

sto chastic that indep endent of the p olicy all states will o ccasionally b e visiteda wrong

initial value function has little danger of starving us from visiting a critical part of state

space from which imp ortant information could b e obtained

The results Table of TDGammon are impressive It has comp eted at the very top

level of international human play Basic TDGammon played resp ectably but not at a

professional standard

Figure Schaal and Atkesons devilsticking rob ot The tap ered stick is hit alternately

by each of the two hand sticks The task is to keep the devil stick from falling

for as many hits as p ossible The rob ot has three motors indicated by torque

vectors

Although exp eriments with other games have in some cases pro duced interesting learning

b ehavior no success close to that of TDGammon has b een rep eated Other games that

have b een studied include Go Schraudolph Dayan Sejnowski and Chess Thrun

It is still an op en question as to if and how the success of TDGammon can b e

rep eated in other domains

Rob otics and Control

In recentyears there havebeenmany rob otics and control applications that have used

reinforcement learning Here we will concentrate on the following four examples although

many other interesting ongoing rob otics investigations are underway

Schaal and Atkeson constructed a twoarmed rob ot shown in Figure that

learns to juggle a device known as a devilstick This is a complex nonlinear control

task involving a sixdimensional state space and less than msecs p er control deci

sion After ab out initial attempts the rob ot learns to keep juggling for hundreds of

hits A typical human learning the task requires an order of magnitude more practice

to achieve prociency at mere tens of hits

The juggling rob ot learned a world mo del from exp erience whichwas generalized

to unvisited states by a function approximation scheme kno wn as lo cally weighted

regression Cleveland Delvin Mo ore Atkeson Between each trial

a form of dynamic programming sp ecic to linear control p olicies and lo cally linear

transitions was used to improve the p olicy The form of dynamic programming is

known as linearquadraticregulator design Sage White

Reinforcement Learning A Survey

Mahadevan and Connell a discuss a task in which a mobile rob ot pushes large

boxes for extended p erio ds of time Boxpushing is a wellknown dicult rob otics

problem characterized by immense uncertainty in the results of actions Qlearning

was used in conjunction with some novel clustering techniques designed to enable a

higherdimensional input than a tabular approachwould have p ermitted The rob ot

learned to p erform comp etitively with the p erformance of a humanprogrammed so

lution Another asp ect of this work mentioned in Section was a preprogrammed

breakdown of the monolithic task description into a set of lower level tasks to b e

learned

Mataric describ es a rob otics exp eriment with from the viewp oint of theoret

ical reinforcement learning an unthinkably high dimensional state space containing

many dozens of degrees of freedom Four mobile rob ots traveled within an enclo

sure collecting small disks and transp orting them to a destination region There were

three enhancements to the basic Qlearning algorithm Firstly preprogrammed sig

nals called progress estimators were used to break the monolithic task into subtasks

This was achieved in a robust manner in which the rob ots were not forced to use

the estimators but had the freedom to prot from the they provided

Secondly control was decentralized Each rob ot learned its own p olicy indep endently

without explicit communication with the others Thirdly state space was brutally

quantized into a small numb er of discrete states according to values of a small num

b er of preprogrammed b o olean features of the underlying sensors The p erformance

of the Qlearned p olicies were almost as go o d as a simple handcrafted con troller for

the job

Qlearning has b een used in an elevator dispatching task Crites Barto The

problem which has b een implemented in simulation only at this stage involved four

elevators servicing ten o ors The ob jectivewas to minimize the average squared

wait time for passengers discounted into future time The problem can b e p osed as a

discrete Markov system but there are states even in the most simplied version of

the problem Crites and Barto used neural networks for function approximation and

provided an excellent comparison study of their Qlearning approach against the most

p opular and the most sophisticated elevator dispatching algorithms The squared wait

time of their controller was approximately less than the b est alternative algorithm

Empty the System heuristic with a receding horizon controller and less than half

the squared wait time of the controller most frequently used in real elevator systems

The nal example concerns an application of reinforcement learning by one of the

authors of this surveytoapackaging task from a fo o d pro cessing industry The

problem involves lling containers with variable numb ers of nonidentical pro ducts

The pro duct characteristics also vary with time but can b e sensed Dep ending on

the task various constraints are placed on the containerlling pro cedure Here are

three examples

The mean weightofallcontainers pro duced byashiftmust not b e b elow the

manufacturers declared weight W

Kaelbling Littman Moore

The number of containers b elow the declared weightmust b e less than P

No containers may b e pro duced b elowweight W

Such tasks are controlled by machinery which op erates according to various setpoints

Conventional practice is that setp oints are chosen byhuman op erators but this choice

is not easy as it is dep endent on the current pro duct characteristics and the current

task constraints The dep endency is often dicult to mo del and highly nonlinear

The task was p osed as a nitehorizon Markov decision task in which the state of the

system is a function of the pro duct characteristics the amount of time remaining in

the pro duction shift and the mean wastage and p ercentbelow declared in the shift

so far The system was discretized into discrete states and lo cal weighted

regression was used to learn and generalize a transition mo del Prioritized sweep

ing was used to maintain an optimal value function as each new piece of transition

information was obtained In simulated exp eriments the savings were considerable

typically with wastage reduced by a factor of ten Since then the system has b een

deployed successfully in several factories within the United States

Some interesting asp ects of practical reinforcement learning come to light from these

examples The most striking is that in all cases to make a real system work it proved

necessary to supplement the fundamen tal algorithm with extra preprogrammed knowledge

Supplying extra knowledge comes at a price more human eort and insight is required and

the system is subsequently less autonomous But it is also clear that for tasks suchas

these a knowledgefree approachwould not haveachieved worthwhile p erformance within

the nite lifetime of the rob ots

What forms did this preprogrammed knowledge take It included an assumption of

linearity for the juggling rob ots p olicy a manual breaking up of the task into subtasks for

the two mobilerob ot examples while the b oxpusher also used a clustering technique for

the Q values which assumed lo cally consistent Q values The four diskcollecting rob ots

additionally used a manually discretized state space The packaging example had far fewer

dimensions and so required corresp ondingly weaker assumptions but there to o the as

sumption of lo cal piecewise continuity in the transition mo del enabled massive reductions

in the amount of learning data required

The exploration strategies are interesting to o The juggler used careful statistical anal

ysis to judge where to protably exp eriment However b oth mobile rob ot applications

were able to learn well with greedy explorationalways exploiting without delib erate ex

ploration The packaging task used optimism in the face of uncertainty None of these

strategies mirrors theoretically optimal but computationally intractable exploration and

yet all proved adequate

Finally itisalsoworth considering the computational regimes of these exp eriments

They were all very dierent which indicates that the diering computational demands of

various reinforcement learning algorithms do indeed have an array of diering applications

The juggler needed to mak every fast decisions with low latency b etween each hit but

had long p erio ds seconds and more b etween each trial to consolidate the exp eriences

collected on the previous trial and to p erform the more aggressive computation necessary

to pro duce a new reactivecontroller on the next trial The b oxpushing rob ot was meantto

Reinforcement Learning A Survey

op erate autonomously for hours and so had to make decisions with a uniform length control

cycle The cycle was suciently long for quite substantial computations b eyond simple Q

learning backups The four diskcollecting rob ots were particularly interesting Each rob ot

had a short life of less than minutes due to battery constraints meaning that substantial

numb er crunching was impractical and any signicantcombinatorial searchwould have

used a signicant fraction of the rob ots learning lifetime The packaging task had easy

constraints One decision was needed every few minutes This provided opp ortunities for

fully computing the optimal value function for the state system b etween every

control cycle in addition to p erforming massive crossvalidationbased optimization of the

transition mo del b eing learned

A great deal of further work is currently in progress on practical implementations of

reinforcement learning The insights and task constraints that they pro duce will havean

imp ortant eect on shaping the kind of algorithms that are develop ed in future

Conclusions

There are a variety of reinforcementlearning techniques that work eectively on a variety

of small problems But very few of these techniques scale well to larger problems This is

not b ecause researchers have done a bad job of inventing learning techniques but b ecause

it is very dicult to solve arbitrary problems in the general case In order to solve highly

complex problems wemust giveuptabula rasa learning techniques and b egin to incorp orate

bias that will giveleverage to the learning pro cess

The necessary bias can come in a variety of forms including the following

shaping The technique of shaping is used in training animals Hilgard Bower a

teacher presents very simple problems to solve rst then gradually exp oses the learner

to more complex problems Shaping has b een used in sup ervisedlearning systems

and can b e used to train hierarchical reinforcementlearning systems from the b ottom

up Lin and to alleviate problems of delayed reinforcementby decreasing the

delayuntil the problem is well understo o d Dorigo Colomb etti Dorigo

lo cal reinforcement signals Whenever p ossible agents should b e given reinforcement

signals that are lo cal In applications in which it is p ossible to compute a gradient

rewarding the agent for taking steps up the gradient rather than just for achieving

the nal goal can sp eed learning signicantly Mataric

An agent can learn bywatching another agent p erform the task Lin

For real rob ots this requires p erceptual abilities that are not yet available But

another strategy is to haveahuman supply appropriate motor commands to a rob ot

through a joystick or steering wheel Pomerleau

problem decomp osition Decomp osing a huge learning problem into a collection of smaller

ones and providing useful reinforcement signals for the subproblems is a very p ower

ful technique for biasing learning Most interesting examples of rob otic reinforcement

learning employ this technique to some extent Connell Mahadevan

reexes One thing that keeps agents that know nothing from learning anything is that

they have a hard time even nding the interesting parts of the space they wander

Kaelbling Littman Moore

around at random never getting near the goal or they are always killed immediately

These problems can b e ameliorated by programming a set of reexes that cause the

agent to act initially in some way that is reasonable Mataric Singh Barto

Grup en Connolly These reexes can eventually b e overridden bymore

detailed and accurate learned knowledge but they at least keep the agent alive and

pointed in the right direction while it is trying to learn Recentwork by Millan

explores the use of reexes to make rob ot learning safer and more ecient

With appropriate biases supplied byhuman programmers or teachers complex reinforcement

learning problems will eventually b e solvable There is still muchwork to b e done and many

interesting questions remaining for learning techniques and esp ecially regarding metho ds for

approximating decomp osing and incorp orating bias into problems

Acknowledgements

Thanks to Marco Dorigo and three anonymous reviewers for comments that havehelped

to improve this pap er Also thanks to our many colleagues in the reinforcementlearning

community who have done this work and explained it to us

Leslie Pack Kaelbling was supp orted in part by NSF grants IRI and IRI

Michael Littman was supp orted in part by Bellcore Andrew Mo ore was supp orted

in part by an NSF Research Initiation Award and by M Corp oration

References

Ackley D H Littman M L Generalization and scaling in reinforcement learn

ing In Touretzky D S Ed Advances in Neur al Information Processing Systems

pp San Mateo CA Morgan Kaufmann

Albus J S A new approach to control Cereb ellar mo del articulation

controller cmac Journal of Dynamic Systems Measurement and Control

Albus J S Behavior and BYTE Bo oks Subsidiary of McGraw

Hill Peterb orough New Hampshire

Anderson C W Learning and Problem Solving with Multilayer Connectionist

Systems PhD thesis University of Massachusetts Amherst MA

Ashar R R Hierarchical learning in sto chastic domains Masters thesis Brown

UniversityProvidence Rho de Island

Baird L Residual algorithms Reinforcement learning with function approxima

tion In Prieditis A Russell S Eds Proceedings of the Twelfth International

Conference on Machine Learning pp San Francisco CA Morgan Kaufmann

Baird L C Klopf A H Reinforcement learning with highdimensional con

tinuous actions Tech rep WLTR WrightPatterson Air Force Base Ohio

Wright Lab oratory

Reinforcement Learning A Survey

Barto A G Bradtke S J Singh S P Learning to act using realtime dynamic

programming Articial Intel ligence

Barto A G Sutton R S Anderson C W Neuronlike adaptive elements that

can solve dicult learning control problems IEEE Transactions on Systems Man

and Cybernetics SMC

Bellman R Dynamic Programming Princeton University Press Princeton NJ

Berenji H R Articial neural networks and approximate reasoning for intelligent

control in space In American Control Conference pp

BerryDAFristedt B Bandit Problems Sequential Allocation of Experiments

Chapman and Hall London UK

Bertsekas D P Dynamic Programming Deterministic and Stochastic Models

PrenticeHall Englewo o d Clis NJ

BertsekasDP Dynamic Programming and Athena Scientic

Belmont Massachusetts Volumes and

Bertsekas D P CastanonDA Adaptive aggregation for innite horizon

dynamic programming IEEE T ransactions on Automatic Control

Bertsekas D P Tsitsiklis J N Paral lel and Distributed Computation Numer

ical Methods PrenticeHall Englewo o d Clis NJ

Box G E P Drap er N R Empirical ModelBuilding and Response Surfaces

Wiley

Boyan J A Mo ore A W Generalization in reinforcement learning Safely

approximating the value function In Tesauro G Touretzky D S Leen T K

Eds Advances in Neural Information Processing Systems Cambridge MA The

MIT Press

Burghes D Graham A Introduction to including Optimal

Control Ellis Horwood

Cassandra A R Kaelbling L P Littman M L Acting optimally in partially

observable sto chastic domains In Proceedings of the Twelfth National Conferenceon

Articial Intel ligence Seattle WA

Chapman D Kaelbling L P Input generalization in delayed reinforcement

learning An algorithm and p erformance comparisons In Proceedings of the Interna

tional Joint ConferenceonArticial Intel ligence Sydney Australia

Chrisman L Reinforcement learning with p erceptual aliasing The p erceptual

distinctions approach In Proceedings of the Tenth National ConferenceonArticial

Intel ligence pp San Jose CA AAAI Press

Kaelbling Littman Moore

Chrisman L Littman M Hidden state and shortterm memory Presentation

at Reinforcement Learning Workshop Machine Learning Conference

Cichosz P Mulawka J J Fast and ecient reinforcement learning with trun

cated temp oral dierences In Prieditis A Russell S Eds Proceedings of the

Twelfth International Conference on Machine Learning pp San Francisco

CA Morgan Kaufmann

Cleveland W S Delvin S J Lo cally weighted regression An approachto

by lo cal tting Journal of the American Statistical Association

Cli D Ross S Adding temp orary memory to ZCS Adaptive Behavior

Condon A The complexity of sto chastic games Information and Computation

Connell J Mahadevan S Rapid task learning for real rob ots In Learning

Kluwer Academic Publishers

Crites R H Barto A G Improving elevator p erformance using reinforcement

learning In Touretzky D Mozer M Hasselmo M Eds Neural Information

Processing Systems

Dayan P The convergence of TD for general Machine Learning

Dayan P Hinton G E Feudal reinforcement learning In Hanson S J Cowan

J D Giles C L Eds Advances in Neural Information Processing Systems

San Mateo CA Morgan Kaufmann

Dayan P Sejnowski T J TDconverges with probability Machine Learn

ing

Dean T Kaelbling L P Kirman J Nicholson A Planning with deadlines in

sto chastic domains In Proceedings of the Eleventh National ConferenceonArticial

Intel ligence Washington DC

DEp enoux F A probabilistic pro duction and inventory problem Management

Science

Derman C Finite State Markovian Decision Processes Academic Press New York

Dorigo M Bersini H A comparison of qlearning and classier systems In

From Animals to Animats Proceedings of the Third International Conference on the

Simulation of Adaptive Behavior Brighton UK

Dorigo M Colomb etti M Rob ot shaping Developing autonomous agents

through learning Articial Intel ligence

Reinforcement Learning A Survey

Dorigo M Alecsys and the AutonoMouse Learning to control a real rob ot by

distributed classier systems Machine Learning

Fiechter CN Ecient reinforcement learning In Proceedings of the Seventh

Annual ACM Conference on Computational Learning Theory pp Asso ciation

of Computing Machinery

Gittins J C Multiarmed Bandit Allocation Indices WileyInterscience series in

systems and optimization Wiley Chichester NY

Goldb erg D Genetic algorithms in search optimization and machine learning

AddisonWesley MA

Gordon G J Stable function approximation in dynamic programming In Priedi

tis A Russell S Eds Proceedings of the Twelfth International Conferenceon

Machine Learning pp San Francisco CA Morgan Kaufmann

Gullapalli V A sto chastic reinforcement learning algorithm for learning realvalued

functions Neural Networks

Gullapalli V Reinforcement learning and its application to control PhD thesis

University of Massach usetts Amherst MA

Hilgard E R Bower G H Theories of Learning fourth edition PrenticeHall

Englewo o d Clis NJ

Homan A J Karp R M On nonterminating sto chastic games Management

Science

Holland J H in Natural and Articial Systems University of Michigan

Press Ann Arb or MI

Howard R A Dynamic Programming and Markov Processes The MIT Press

Cambridge MA

Jaakkola T Jordan M I Singh S P On the convergence of sto chastic iterative

dynamic programming algorithms Neural Computation

Jaakkola T Singh S P Jordan M I Montecarlo reinforcement learning in

nonMarkovian decision problems In Tesauro G Touretzky D S Leen T K

Eds Advances in Neural Information Processing Systems Cambridge MA The

MIT Press

Kaelbling L P a Hierarchical learning in sto chastic domains Preliminary results

In Proceedings of the Tenth International Conference on Machine Learning Amherst

MA Morgan Kaufmann

earning in Embedded Systems The MIT Press Cambridge MA Kaelbling L P b L

Kaelbling L P a Asso ciative reinforcement learning A generate and test algorithm

Machine Learning

Kaelbling Littman Moore

Kaelbling L P b Asso ciative reinforcement learning Functions in k DNF Machine

Learning

Kirman J Predicting RealTime Planner Performance by Domain Characterization

PhD thesis Department of Computer Science Brown University

Ko enig S Simmons R G Complexity analysis of realtime reinforcement

learning In Proceedings of the Eleventh National ConferenceonArticial Intel ligence

pp Menlo Park California AAAI PressMIT Press

Kumar PRVaraiya PP Stochastic Systems Estimation Identication and

Adaptive ControlPrentice Hall Englewo o d Clis New Jersey

Lee C C A self learning rulebased controller employing approximate reasoning

and neural net concepts International Journal of Intel ligent Systems

Lin LJ Programming rob ots using reinforcement learning and teaching In

Proceedings of the Ninth National ConferenceonArticial Intel ligence

Lin LJ a Hierachical learning of rob ot skills by reinforcement In Proceedings of

the International ConferenceonNeur al Networks

Lin LJ b Reinforcement Learning for Using Neural Networks PhD thesis

Carnegie Mellon University Pittsburgh PA

Lin LJ Mitchell T M Memory approaches to reinforcement learning in non

Markovian domains Tech rep CMUCS Carnegie Mellon UniversityScho ol

of Computer Science

Littman M L a Markov games as a framework for multiagent reinforcement learn

ing In Proceedings of the Eleventh International Conference on Machine Learning

pp San Francisco CA Morgan Kaufmann

Littman M L b Memoryless p olicies Theoretical limitations and practical results

In Cli D Husbands P Meyer JA Wilson S W Eds From Animals

to Animats Proceedings of the Third International Conference on Simulation of

Adaptive Behavior Cambridge MA The MIT Press

Littman M L Cassandra A Kaelbling L P a Learning p olicies for partially

observable environments Scaling up In Prieditis A Russell S Eds Proceed

ings of the Twelfth International Conference on Machine Learning pp San

Francisco CA Morgan Kaufmann

Littman M L Dean T L Kaelbling L P b On the complexity of solving

Markov decision problems In Proceedings of the Eleventh Annual Conferenceon

Uncertainty in Articial Intel ligence UAI Montreal Queb ec Canada

LovejoyWSAsurvey of algorithmic metho ds for partially observable Markov

decision pro cesses Annals of

Reinforcement Learning A Survey

Maes P Bro oks R A Learning to co ordinate b ehaviors In Proceedings Eighth

National ConferenceonArticial Intel ligence pp Morgan Kaufmann

Mahadevan S To discount or not to discount in reinforcement learning A case

study comparing R learning and Q learning In Proceedings of the Eleventh Inter

national Conference on Machine Learning pp San Francisco CA Morgan

Kaufmann

Mahadevan S Average reward reinforcement learning Foundations algorithms

and empirical results Machine Learning

Mahadevan S Connell J a Automatic programming of b ehaviorbased rob ots

using reinforcement learning In Proceedings of the Ninth National Conferenceon

Articial Intel ligence Anaheim CA

Mahadevan S Connell J b Scaling reinforcement learning to rob otics by ex

ploiting the subsumption architecture In Proceedings of the Eighth International

Workshop on Machine Learning pp

Mataric M J Reward functions for accelerated learning In Cohen W W

Hirsh H Eds Proceedings of the Eleventh International Conference on Machine

L earning Morgan Kaufmann

McCallum A K Reinforcement Learning with Selective Perception and Hidden

State PhD thesis Department of Computer Science UniversityofRochester

McCallum R A Overcoming incomplete p erception with utile distinction memory

In Proceedings of the Tenth International Conference on Machine Learning pp

Amherst Massachusetts Morgan Kaufmann

McCallum R A Instancebased utile distinctions for reinforcement learning with

hidden state In Proceedings of the Twelfth International Conference Machine Learn

ing pp San Francisco CA Morgan Kaufmann

Meeden L McGraw G Blank D Emergentcontrol and planning in an au

tonomous vehicle In Touretsky D Ed Proceedings of the Fifteenth Annual Meeting

of the Cognitive ScienceSociety pp Lawerence Erlbaum Asso ciates Hills

dale NJ

Millan J d R Rapid safe and of navigation strategies IEEE

Transactions on Systems Man and Cybernetics

Monahan G E A survey of partially observable Markov decision pro cesses Theory

mo dels and algorithms Management Science

Mo ore A W Variable resolution dynamic programming Eciently learning ac

tion maps in multivariate realvalued spaces In Proc Eighth International Machine

Learning Workshop

Kaelbling Littman Moore

Mo ore A W The partigame algorithm for variable resolution reinforcement learn

ing in multidimensional statespaces In Cowan J D Tesauro G Alsp ector J

Eds Advances in Neural Information Processing Systems pp San Mateo

CA Morgan Kaufmann

Mo ore A W Atkeson C G An investigation of memorybased function ap

proximators for learning control Tech rep MIT Artical Intelligence Lab oratory

Cambridge MA

Mo ore A W Atkeson C G Prioritized sweeping Reinforcement learning with

less data and less real time Machine Learning

Mo ore A W Atkeson C G Schaal S Memorybased learning for control

Tech rep CMURITR CMU Rob otics Institute

Narendra K Thathachar M A L Learning Automata An Introduction

PrenticeHall Englewo o d Clis NJ

Narendra K S Thathachar M A L Learning automataa survey IEEE

Transactions on Systems Man and Cybernetics

Peng J Williams R J Ecient learning and planning within the Dyna frame

work Adaptive Behavior

Peng J Williams R J Incremental multistep Qlearning In Proceedings of the

Eleventh International Conference on Machine L earning pp San Francisco

CA Morgan Kaufmann

Pomerleau D A Neural network perception for guidance Kluwer

Academic Publishing

Puterman M L Markov Decision ProcessesDiscrete Stochastic Dynamic Pro

gramming John Wiley Sons Inc New York NY

Puterman M L Shin M C Mo died p olicy iteration algorithms for discounted

Markov decision pro cesses Management Science

Ring M B Continual Learning in Reinforcement Environments PhD thesis

UniversityofTexas at Austin Austin Texas

Rude U Mathematical and computational techniques for multilevel adaptive meth

odsSociety for Industrial and Applied Mathematics Philadelphi a Pennsylvania

Rumelhart D E McClelland J L Eds Paral lel DistributedProcessing

Explorations in the microstructures of cognition Volume Foundations The MIT

Press Cambridge MA

Rummery G A Niranjan M Online Qlearning using connectionist systems

Tech rep CUEDFINFENGTR Cambridge University

Reinforcement Learning A Survey

Rust J Numerical dynamic programming in In Handbook of Computa

tional Economics Elsevier North Holland

Sage A P White C C Optimum Systems ControlPrentice Hall

Salganico M Ungar L H Active exploration and learning in realvalued

spaces using multiarmed bandit allo cation indices In Prieditis A Russell S

Eds Proceedings of the Twelfth International Conference on Machine Learning

pp San Francisco CA Morgan Kaufmann

Samuel A L Some studies in machine learning using the game of checkers IBM

Journal of Research and Development ReprintedinEAFeigenbaum

and J Feldman editors and ThoughtMcGrawHill New York

Schaal S Atkeson C Rob ot juggling An implementation of memorybased

learning Control Systems Magazine

Schmidhub er J A general metho d for multiagent learning and incremental self

improvement in unrestricted environments In Yao X Ed Evolutionary Computa

tion Theory and Applications Scientic Publ Co Singap ore

Schmidhub er J H a Curious mo delbuildi ng control systems In Proc International

Joint ConferenceonNeur al Networks SingaporeVol pp IEEE

Schmidhub er J H b Reinforcement learning in Markovian and nonMarkovian

environments In Lippman D S Mo o dy J E Touretzky D S Eds Advances

in Neural Information Processing Systems pp San Mateo CA Morgan

Kaufmann

Schraudolph N N Dayan P SejnowskiTJTemp oral dierence learning of

p osition evaluation in the game of Go In Cowan J D Tesauro G Alsp ector

J Eds Advances in Neural Information Processing Systems pp San

Mateo CA Morgan Kaufmann

Schrijver A Theory of Linear and Integer Programming WileyInterscience New

York NY

Schwartz A A reinforcement learning metho d for maximizing undiscounted re

wards In Proceedings of the Tenth International Conference on Machine Learning

pp Amherst Massachusetts Morgan Kaufmann

Singh S P Barto A G Grup en R Connolly C Robust reinforcement

learning in motion planning In Cowan J D Tesauro G Alsp ector J Eds

al Information Processing Systems pp San Mateo CA Advances in Neur

Morgan Kaufmann

Singh S P Sutton R S Reinforcement learning with replacing eligibili ty traces

Machine Learning

Kaelbling Littman Moore

Singh S P a Reinforcement learning with a hierarchy of abstract mo dels In

Proceedings of the Tenth National ConferenceonArticial Intel ligence pp

San Jose CA AAAI Press

Singh S P b Transfer of learning by comp osing solutions of elemental sequential

tasks Machine Learning

Singh S P Learning to Solve Markovian Decision Processes PhD thesis Depart

ment of Computer Science University of Massachusetts Also CMPSCI Technical

Rep ort

Stengel R F Stochastic Optimal Control John Wiley and Sons

Sutton R S Generalization in Reinforcement Learning Successful Examples Using

Sparse Coarse Co ding In Touretzky D Mozer M Hasselmo M Eds Neural

Information Processing Systems

Sutton R S Temporal Credit Assignment in Reinforcement Learning PhD thesis

University of Massachusetts Amherst MA

Sutton R S Learning to predict by the metho d of temp oral dierences Machine

Learning

Sutton R S Integrated architectures for learning planning and reacting based

on appro ximating dynamic programming In Proceedings of the Seventh International

Conference on Machine Learning Austin TX Morgan Kaufmann

Sutton R S Planning by incremental dynamic programming In Proceedings

of the Eighth International Workshop on Machine Learning pp Morgan

Kaufmann

Tesauro G Practical issues in temp oral dierence learning Machine Learning

Tesauro G TDGammon a selfteaching backgammon program achieves master

level play Neural Computation

Tesauro G Temp oral dierence learning and TDGammon Communications of

the ACM

Tham CK Prager R W A mo dular qlearning architecture for manipula

tor task decomp osition In Proceedings of the Eleventh International Conferenceon

Machine Learning San Francisco CA Morgan Kaufmann

Thrun S Learning to play the game of chess In Tesauro G Touretzky D S

Leen T K Eds Advances in Neural Information Processing Systems Cambridge

MA The MIT Press

Reinforcement Learning A Survey

Thrun S Schwartz A Issues in using function approximation for reinforcement

learning In Mozer M SmolenskyP Touretzky D Elman J Weigend A

Eds Proceedings of the Connectionist Models Summer School Hillsdale NJ

Lawrence Erlbaum

Thrun S B The role of exploration in learning control In White D A

Sofge D A Eds Handbook of Intel ligent Control Neural Fuzzy and Adaptive

ApproachesVan Nostrand Reinhold New York NY

Tsitsiklis J N Asynchronous sto chastic approximation and Qlearning Machine

Learning

Tsitsiklis J N Van Roy B Featurebased metho ds for large scale dynamic

programming Machine Learning

Valiant L G A theory of the learnable Communications of the ACM

Watkins C J C H Learning from DelayedRewards PhD thesis Kings College

Cambridge UK

Watkins C J C H Dayan P Qlearning Machine Learning

Whitehead S D Complexity and co op eration in Qlearning In Proceedings of the

Eighth International Workshop on Machine Learning Evanston IL Morgan Kauf

mann

Williams R J A class of gradientestimating algorithms for reinforcementlearning

in neural networks In Proceedings of the IEEE First International Conferenceon

Neural Networks San Diego CA

Williams R J Simple statistical gradientfollowing algorithms for connectionist

reinforcement learning Machine Learning

Williams R J Baird I I I L C a Analysis of some incremental variants of p olicy

iteration First steps toward understanding actorcritic learning systems Tech rep

NUCCS Northeastern University College of Computer Science Boston MA

Williams R J Baird I I I L C b Tight p erformance b ounds on greedy p olicies

based on imp erfect value functions Tech rep NUCCS Northeastern Univer

sity College of Computer Science Boston MA

Wilson S Classier tness based on accuracy

Zhang W Dietterich T G A reinforcement learning approach to jobshop

scheduling In Proceedings of the International Joint ConferenceonArticial Intel

lience