Quick viewing(Text Mode)

Reinforcement Learning: Chap1

Reinforcement Learning: Chap1

This excerpt from Learning. Richard S. Sutton and Andrew G. Barto. © 1998 The MIT Press. is provided in screen-viewable form for personal use only by members of MIT CogNet. Unauthorized use or dissemination of this information is expressly forbidden. If you have any questions about this material, please contact [email protected]. 1 Introduction

The idea that we learn by interacting with our environmentis probably the first to occur to us when we think about the natureof learning. When an infant plays, wavesits arms, or looks about, it has no explicit teacher, but it doeshave a direct sensorimotorconnection to its environment. Exercisingthis connectionproduces a wealthof informationabout cause and effect , aboutthe consequencesof actions, and about what to do in order to achievegoals . Throughoutour lives, such interactions are undoubtedlya major sourceof knowledgeabout our environmentand ourselves. Whether we are learning to drive a car or to hold a conversation, we are acutely awareof how our environmentresponds to what we do, and we seekto influence what happensthrough our behavior. Learningfrom interactionis a foundationalidea underlyingnearly all theoriesof learningand intelligence. In this book we explore a computationalapproach to learning from interaction. - Ratherthan directly theorizing abouthow peopleor animalslearn , we explore ide alized learningsituations and evaluate the effectivenessof variouslearning methods . That is, we adoptthe perspectiveof an artificial intelligenceresearcher or engineer. We explore designsfor machinesthat are effective in solving learning problemsof scientificor economicinterest , evaluatingthe designsthrough or computationalexperiments . The approachwe explore, called reinforcementlearning , is much more focusedon goal-directedlearning from interactionthan are other approach es to machinelearning .

1.1 Reinforcement Learning

- Reinforcement learning is learning what to do - how to map situations to actions so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of , but instead must discover which Introduction actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics- trial -and-error - search and delayed reward are the two most important distinguishing features of reinforcement learning. Reinforcement learning is defined not by characterizing learning methods, but by characterizing a learning problem . Any method that is well suited to solving that problem, we consider to be a reinforcement learning method. A full specification of the reinforcement learning problem in terms of of es must wait until Chapter 3, but the basic idea is simply to capture the most important aspects of the real problem facing a learning agent interacting with its environment to achieve a goal. Clearly , such an agent must be able to sense the state of the environment to some extent and must be able to take actions that affect the state. The agent also must have a goal or goals relating to the state of the environment . The formulation is intended to include just these three aspects- - sensation, action, and goal in their simplest possible forms without trivializing any of them. Reinforcement learning is different from , the kind of learning studied in most current research in machine learning, statistical , and artificial neural networks. Supervised learning is learning from examples provided by a knowledgable external supervisor. This is an important kind of learning, but alone it is not adequate for learning from interaction . In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act. In uncharted - territory where one would expect learning to be most beneficial- an agent must be able to learn from its own experience. One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation . To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future . The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. The exploration - exploitation dilemma has been intensively studied by mathematicians for many decades (see Chapter 2). For now, we simply 1.1 Reinforcement Learning note that the entire issue of balancingexploration and exploitation does not even arisein supervisedlearning as it is usually defined. Another key feature of reinforcementlearning is that it explicitly considersthe whole problem of a goal-directed agentinteracting with an uncertainenvironment . This is in contrastwith manyapproach esthat considersubproblems without addressing how they might fit into a larger picture. For example, we have mentionedthat much of machinelearning researchis concernedwith supervisedlearning without explicitly specifyinghow such an ability would finally be useful. Other researchers have developedtheories of planning with general goals, but without considering ' plannings role in real-time decision-making, or the questionof where the predictive models necessaryfor planning would come from. Although theseapproach es haveyielded many useful results, their focuson isolatedsubproblems is a significant limitation. Reinforcementlearning takes the oppositetack , starting with a complete, interactive , goal-seekingagent . All reinforcementlearning agentshave explicit goals, can senseaspects of their environments, and can chooseactions to influencetheir environments. Moreover, it is usually assumedfrom the beginning that the agent has to operatedespite significant uncertaintyabout the environmentit faces. When reinforcementlearning involves planning, it has to addressthe interplay between planningand real -time , aswell asthe questionof how environmental modelsare acquired and improved . Whenreinforcement learning involves supervised learning, it doesso for specificreasons that detenninewhich capabilitiesare critical and which are not. For learning researchto make progress, important subproblems haveto be isolatedand studied, but they shouldbe subproblemsthat play clear roles in complete, interactive, goal-seekingagents , evenif all the details of the complete agentcannot yet be filled in. One of the larger trendsof which reinforcementlearning is a part is that toward greatercontact between artificial intelligenceand other engineeringdisciplines . Not all that long ago, artificial intelligencewas viewed as almost entirely separatefrom control theoryand . It hadto do with logic and symbols, not numbers. Artificial intelligencewas large LISP programs, not linear algebra, differential equations, or statistics. Over the last decadesthis view has gradually eroded. Modem artificial intelligenceresearchers accept statistical and control , for example, as relevantcompeting methods or simply as tools of their trade. The previouslyignored areaslying betweenartificial intelligence and conventionalengineering are now among the most active, including new fields such as neural networks, , and our topic, reinforcementlearning . In reinforcementlearning we extend ideas from optimal and stochasticapproximation to address the broaderand more ambitiousgoals of . Introduction

1.2 Examples

A good way to understandreinforcement learning is to considersome of the examples and possibleapplications that haveguided its development. . A masterchess player makesa move. The choiceis informed both by planning- anticipatingpossible replies and counterreplies- and by immediate, intuitive judgments of the desirability of particularpositions and moves. ' . An adaptivecontroller adjustsparameters of a petroleumrefinery s operationin real time. The controller optimizes the yield/cost/quality trade-off on the basis of specifiedmarginal costs without sticking strictly to the setpoints originally suggested by engineers. . A gazellecalf strugglesto its feet minutesafter being born. Half an hour later it is running at 20 miles per hour. . A mobile robot decideswhether it shouldenter a newroom in searchof moretrash to collect or starttrying to find its way back to its batteryrecharging station . It makes its decisionbased on how quickly and easily it hasbeen able to find the rechargerin the past. . Phil prepareshis breakfast. Closely examined, eventhis apparentlymundane activity revealsa complex web of conditionalbehavior and interlocking goal- subgoal relationships: walking to the cupboard, openingit , selectinga cerealbox , thenreaching for, grasping, andretrieving the box. Othercomplex , tuned, interactivesequences of behaviorare requiredto obtain a bowl, spoon, and milk jug. Each stepinvolves a seriesof eyemovements to obtaininformation and to guidereaching and locomotion . Rapidjudgments are continually madeabout how to carry the objectsor whetherit is betterto ferry someof them to the dining table beforeobtaining others . Eachstep is guidedby goals, suchas graspinga spoonor getting to the refrigerator, and is in service of other goals, suchas having the spoonto eat with oncethe cerealis prepared and ultimately obtainingnourishment . Theseexamples share features that are so basicthat they are easyto overlook. All involve interaction betweenaU active decision-making agent and its environment, within which the agentseeks to achievea goal despiteuncertainty about its environment ' . The agents actionsare pennittedto affect the future stateof the environment (e.g., the next chessposition , the level of reservoirsof the refinery, the next location of the robot), therebyaffecting the options and opportunitiesavailable to the agent at later times. Correct choice requirestaking into accountindirect , delayedconsequences of actions, and thus may requireforesight or planning. 1.3 Elements of Reinforcement Learning

At the sametime , in all these examplesthe effects of actions cannot be fully react predicted; thus the agentmust monitor its environmentfrequently and appropriately . For example, Phil mustwatch the milk he poursinto his cerealbowl to keep it from overflowing. All theseexamples involve goals that are explicit in the sense that the agentcan judge progresstoward its goal basedon what it can sensedirectly . The chessplayer knows whetheror not he wins, the refinery controller knows how much petroleumis being produced, the mobile robot knows when its batteriesrun down, and Phil knows whetheror not he is enjoyinghis breakfast. In all of theseexamples the agentcan use its experienceto improveits performance over time. The chessplayer refines the intuition he usesto evaluatepositions , thereby run improving his play; the gazellecalf improvesthe efficiency with which it can ; Phil learnsto streamlinemaking his breakfast. The knowledgethe agentbrings to the task at the start- either from previousexperience with relatedtasks or built into it by designor evolution- influenceswhat is useful or easyto learn, but interaction with the environmentis essentialfor adjustingbehavior to exploit specific features of the task.

1.3 Elements of Reinforcement Learning

Beyond the agentand the environment, one can identify four main subelementsof a reinforcementlearning system: a policy, a rewardfunction , a valuefunction , and, , a model of the environment. optionally ' A policy definesthe learning agents way of behavingat a given time. Roughly speaking, a policy is a mappingfrom perceivedstates of the environmentto actions to be taken when in those states. It correspondsto what in psychologywould be called a set of stimulus- responserules or associations. In somecases the policy may be a simple function or lookup table, whereasin others it may involve extensive computationsuch as a searchprocess . The policy is the core of a reinforcement learningagent in the sensethat it aloneis sufficientto determinebehavior . In general, policies may be stochastic. A rewardfunction definesthe goal in a reinforcementlearning problem . Roughly - to speaking, it mapseach perceivedstate ( or state action pair) of the environment . A reinforcement a single number, a reward, indicating the intrinsic desirability of that state ' learning agents sole objectiveis to maximizethe total rewardit receives in the long run. The reward function defineswhat are the good and bad eventsfor the agent. In a biological system, it would not be inappropriateto identify rewards with pleasureand pain. They are the immediateand defining featuresof the problem Introduction facedby the agent. As such, the rewardfunction must necessarilybe unalterableby the agent. It may, however, serveas a basisfor altering the policy. For example, if an action selectedby the policy is followed by low reward, then the policy may be changedto selectsome other action in that situationin the future. In general, reward functionsmay be stochastic. Whereasa rewardfunction indicateswhat is good in an immediatesense , a value function specifieswhat is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulateover the future, startingfrom that state. Whereasrewards detennine the immediate, intrinsic desirability of environmentalstates , values indicate the long-term desirability of statesafter taking into accountthe statesthat are likely to follow, and the rewards availablein thosestates . For example, a statemight always yield a low immediate reward but still have a high value becauseit is regularly followed by other states that yield high rewards. Or the reversecould be true. To make a human analogy, rewardsare like pleasure( if high) and pain (if low), whereasvalues correspond to a more refined and farsightedjudgment of how pleasedor displeasedwe are that our environmentis in a particularstate . Expressedthis way, we hopeit is clear that value functionsformalize a basicand familiar idea. Rewardsare in a senseprimary , whereasvalues , as predictionsof rewards, are secondary. Without rewardsthere could be no values, and the only purposeof estimating valuesis to achievemore reward. Nevertheless, it is valueswith which we are mostconcerned when makingand evaluating decisions . Action choicesare made basedon valuejudgments . We seekactions that bring about statesof highestvalue , not highestreward , becausethese actions obtain the greatestamount of reward for us over the long run. In decision-making and planning, the derivedquantity called value is the one with which we are most concerned. Unfortunately, it is much harder to detenninevalues than it is to detenninerewards . Rewardsare basicallygiven directly by the environment, but valuesmust be estimatedand reestimatedfrom the sequencesof observationsan agentmakes over its entire lifetime. In fact, the most important componentof almost all reinforcementlearning algorithms is a method for efficiently estimatingvalues . The centralrole of valueestimation is arguablythe most importantthing we havelearned about reinforcement learning over the last few decades. Although all the reinforcementlearning methodswe consider in this book are structuredaround estimating value functions, it is not strictly necessaryto do this to solve reinforcementlearning problems . For example, searchmethods such as genetic algorithms, geneticprogramming , simulatedannealing , and other function optimization methodshave been used to solvereinforcement learning problems . These 1.3 Elements of Reinforcement Learning methodssearch directly in the spaceof policieswithout everappealing to valuefunctions . We call theseevolutionary methods because their operationis analogousto the way biological evolution producesorganisms with skilled behavioreven when they do not learn during their individual lifetimes. If the spaceof policies is sufficiently small, or can be structuredso that good policies are commonor easyto find, then evolutionarymethods can be effective. In addition, evolutionarymethods have advantages on problemsin which the learning agentcannot accurately sense the state of its environment. Nevertheless, what we meanby reinforcementlearning involves learning while interacting with the environment, which evolutionarymethods do not do. It is our belief that methodsable to take advantageof the detailsof individual behavioralinteractions canbe muchmore efficient thanevolutionary methods in manycases . Evolutionary methodsignore much of the useful structureof the reinforcementlearning problem: they do not usethe fact that the policy they are searchingfor is a function from statesto actions; they do not notice which statesan individual passesthrough during its lifetime, or which actions it selects. In somecases this information can be misleading( e.g., when statesare misperceived), but more often it shouldenable more efficient search. Although evolution and learningshare many featuresand can naturally work together, as they do in nature, we do not considerevolutionary methods by themselvesto be especiallywell suited to reinforcementlearning problems. " " For simplicity, in this book when we use the term reinforcementlearning we do not include evolutionarymethods . The fourth andfinal elementof somereinforcement learning systems is a modelof the environment. This is somethingthat mimics the behaviorof the environment. For example, givena stateand action , the modelmight predictthe resultantnext stateand next reward. Models are usedfor planning, by which we meanany way of deciding on a courseof actionby consideringpossible future situationsbefore they areactually experienced. The incorporationof modelsand planning into reinforcementlearning systemsis a relatively new development. Early reinforcementlearning systems were explicitly trial-and-error learners; what they did was viewed as almost the opposite of planning. Nevertheless, it gradually becameclear that reinforcementlearning methodsare closely related to dynamicprogramming methods , which do usemodels , andthat they in turn areclosely related to state-spaceplanning methods . In Chapter9 we explore reinforcementlearning systemsthat simultaneouslylearn by , learn a model of the environment, and use the model for planning. Modem reinforcementlearning spans the spectrumfrom low-level, trial-and-error learningto high-level, deliberativeplanning . Introduction

1.4 An ExtendedExample : Tic-Tac -Toe

To illusttate the general idea of reinforcement learning and contrast it with other approach es, we next consider a single example in more detail . ' - - Consider the familiar child s game of tic tac toe. lWo players take turns playing - - on a three by three board. One player plays Xs and the other Os until one player wins by placing three marks in a row, horizontally , vertically , or diagonally , as the X player has in this game: 0xX0 X0

If the board fills up with neither player getting three in a row, the gameis a draw. Becausea skilled playercan play so asnever to lose, let us assumethat we areplaying againstan imperfectplayer , one whoseplay is sometimesincorrect and allows us to win. For the moment, in fact, let us considerdraws and lossesto be equally bad for ' us. How might we constructa playerthat will find the imperfectionsin its opponents play and learn to maximizeits chancesof winning? Although this is a simpleproblem , it cannotreadily be solvedin a satisfactoryway " " through classicaltechniques . For example, the classical minimax solution from gametheory is not correcthere because it assumesa particularway of playing by the opponent. For example, a minimax playerwould neverreach a gamestate from which it could lose, evenif in fact it alwayswon from that statebecause of incorrectplay by the opponent. Classicaloptimization methods for sequentialdecision problems , such as , can computean optimal solution for any opponent, but requireas input a completespecification of that opponent, including the with which the opponentmakes each move in eachboard state. Let us assumethat this information is not availablea priori for this problem, as it is not for the vast majority of problemsof practicalinterest . On the otherhand , suchinformation can be estimatedfrom experience, in this caseby playing manygames against the opponent. ' About the bestone can do on this problemis first to learn a model of the opponents behavior, up to somelevel of confidence, and then apply dynamic programmingto 1.4 An ExtendedExample : 1ic-Tac-Toe computean optimal solution given the approximateopponent model . In the end, this is not that different from someof the reinforcementlearning methodswe examine later in this book. An evolutionary approachto this problem would directly searchthe spaceof possiblepolicies for one with a high of winning againstthe opponent. Here, a policy is a rule that tells the player what moveto makefor every stateof the game- every possibleconfiguration of Xs and Os on the three-by-three board. For eachpolicy considered, an estimateof its winning probability would be obtainedby playing some number of gamesagainst the opponent. This evaluationwould then direct which policy or policies would be considerednext . A typical evolutionary method would hill-climb in policy space, successivelygenerating and evaluating policies in an attemptto obtain incrementalimprovements . Or, perhaps, agenetic- style could be used that would maintain and evaluatea population of policies. Literally hundredsof different optimization methodscould be applied. By directly searchingthe policy spacewe mean that entire policies are proposedand comparedon the basisof scalarevaluations . Here is how the tic-tac-toe problem would be approachedusing reinforcement learning and approximatevalue functions. First we set up a table of numbers, one for eachpossible state of the game. Each numberwill be the latest estimateof the ' probability of our winning from that state. We treat this estimateas the states value, and the whole table is the learnedvalue function . StateA hashigher valuethan state " " B, or is considered better than stateB , if the currentestimate of the probability of our winning from A is higher than it is from B. Assumingwe alwaysplay Xs, then for all stateswith threeXs in a row the probability of winning is I , becausewe have " " alreadywon . Similarly, for all stateswith three Os in a row, or that are filled up, the correct probability is 0, as we cannot win from them. We set the initial values of all the other statesto 0.5, representinga guessthat we have a 50% chanceof winning. We play many gamesagainst the opponent. To selectour moveswe examinethe statesthat would result from eachof our possiblemoves ( one for eachblank space on the board) and look up their currentvalues in the table. Most of the time we move greedily, selectingthe movethat leadsto the statewith greatestvalue , that is, with the highestestimated probability of winning. Occasionally, however, we selectrandomly from among the other movesinstead . Theseare called exploratory movesbecause they causeus to experiencestates that we might otherwisenever see . A sequenceof movesmade and consideredduring a gamecan be diagrammedas in Figure 1.1. While we are playing, we changethe valuesof the statesin which we find ourselves during the game. We attempt to make them more accurateestimates of the Introduction

starting

~ position opponent's move

ourmove { ,,,,, opponentour'smove{ "::,,,",, opponent'smove{ ,,," ",, , .: ourmove ,,",',",, { ,,",,,", ,, ,,,,, , , ,, ,, , ."*,' Figure 1.1 A sequenceof tic-tac-toe moves. The solid lines representthe moves taken during a game; the dashedlines representmoves that we (our reinforcementlearning player ) consideredbut did not make. Our secondmove was an exploratorymove , meaningthat it was * takeneven though another sibling move, the one leadingto e , wasranked higher . Exploratory movesdo not result in any learning, but eachof our other movesdoes , causingbackups as suggestedby the curvedarrows and detailedin the text.

" " probabilitiesof winning. To do this, we back up the value of the stateafter each greedymove to the statebefore the move, as suggestedby the arrowsin Figure 1.1. More precisely, the current value of the earlier stateis adjustedto be closer to the ' value of the later state. This can be done by moving the earlier states value a fraction of the way towardthe value of the later state. If we let 8 denotethe statebefore ' the greedy move, and 8 the stateafter the move, then the updateto the estimated valueof 8, denotedV (8), can be written as ' - V(s) +- V(s) + a [ V(s ) V(s)] , wherea is a small positivefraction called the step-sizeparameter , which influences the rate of learning. This updaterule is an exampleof a temporal-differencelearning . . . ~ 1.4 An ExtendedExample : 1ic-Tac-Toe method, so called becauseits changesare basedon a difference, V (Sf) - V (s), betweenestimates at two different times. The methoddescribed above perfonns quite well on this task. For example, if the step-size parameteris reducedproperly over time, this methodconverges , for any fixed opponent, to the true probabilities of winning from each stategiven optimal play by our player. Furthennore, the movesthen taken ( excepton exploratorymoves ) are in fact the optimal moves against the opponent. In other words, the method convergesto an optimal policy for playing the game. If the step-size parameteris not reducedall the way to zero over time, then this player also plays well against opponentsthat slowly changetheir way of playing. This exampleillustrates the differencesbetween evolutionary methods and methods that learn value functions. To evaluatea policy, an evolutionarymethod must hold it fixed and play many gamesagainst the opponent, or simulatemany games using a model of the opponent. The frequencyof wins gives an unbiasedestimate of the probability of winning with that policy, and can be used to direct the next policy selection. But eachpolicy changeis madeonly after many games, and only the final outcomeof eachgame is used: what happensduring the gamesis ignored. For example, if the player wins, then all of its behaviorin the gameis given credit, independentlyof how specific movesmight havebeen critical to the win. Credit is evengiven to movesthat neveroccurred ! Valuefunction methods, in contrast, allow individual statesto be evaluated. In the end, both evolutionaryand value function methodssearch the spaceof policies, but learning a value function takesadvantage of information availableduring the courseof play. This simpleexample illustrates some of the key featuresof reinforcementlearning methods. First, there is the emphasison learning while interactingwith an environment , in this casewith an opponentplayer . Second, thereis a clear goal, and correct behavior requiresplanning or foresight that takes into accountdelayed effects of ' ones choices. For example, the simple reinforcementlearning player would learn to setup multimovetraps for a shortsightedopponent . It is a striking featureof the reinforcement learningsolution that it can achievethe effectsof planningand lookahead without using a model of the opponentand without conductingan explicit search over possiblesequences of future statesand actions. While this exampleillustrates some of the key featuresof reinforcementlearning , it is so simple that it might give the impressionthat reinforcementlearning is more limited than it really is. Although tic-tac-toe is a two-persongame , reinforcement learning also applies in the casein which there is no externaladversary , that is, in " " the caseof a gameagainst nature . Reinforcementlearning also is not restrictedto problemsin which behaviorbreaks down into separateepisodes , like the separate Introduction

- gamesof tic tac-toe, with rewardonly at the end of eachepisode . It is just as applicable when behaviorcontinues indefinitely and when rewardsof variousmagnitudes can be receivedat any time. - Tic tac-toe has a relatively small, finite stateset , whereasreinforcement learning can be used when the state set is very large, or even infinite. For example, Gerry Tesauro( 1992, 1995) combined the algorithm describedabove with an artificial neuralnetwork to learn to play , which has approximately1020 states . With this many statesit is impossibleever to experiencemore than a small fraction ' of them. Tesauros programlearned to play far betterthan any previousprogram , and ' now playsat the level of the world s besthuman players ( seeChapter 11 ). The neural network providesthe programwith the ability to generalizefrom its experience, so that in new statesit selectsmoves based on information savedfrom similar states faced in the past, as detenninedby its network. How well a reinforcementlearning systemcan work in problemswith such large state sets is intimately tied to how appropriatelyit can generalizefrom pastexperience . It is in this role that we havethe greatestneed for supervisedlearning methodswith reinforcementlearning . Neural networksare not the only, or necessarilythe best, way to do this. In this tic-tac-toe example, learning startedwith no prior knowledgebeyond the rules of the game, but reinforcementlearning by no meansentails a tabuiarasa view of learningand intelligence. On the contrary, prior information can be incorporated into reinforcementlearning in a variety of ways that can be critical for efficient learning. We also had accessto the true state in the tic-tac-toe example, whereas reinforcementlearning can also be applied when part of the state is hidden, or when different statesappear to the learnerto be the same. That case, however, is substantiallymore difficult, and we do not cover it significantly in this book. Finally, the tic-tac-toe playerwas able to look aheadand know the statesthat would result from eachof its possiblemoves . To do this, it had to havea modelof the game " " that allowed it to think about how its environmentwould changein responseto movesthat it might never make. Many problemsare like this, but in otherseven a short-term model of the effectsof actionsis lacking. Reinforcementlearning can be applied in either case. No model is required, but modelscan easily be usedif they are availableor can be learned.

Exercise 1.1: Self-Play Suppose, insteadof playing againsta fixed opponent, the reinforcementlearning algorithm described above played against itself . What do you think would happenin this case? Would it learn a different way of playing? Exercise1 .2: Symmetries Many tic-tac-toe positionsappear different but arereally the samebecause of symmetries. How might we amendthe reinforcementlearning 1.5 Summary

algorithm describedabove to take advantageof this? In what ways would this improve it? Now think again. Supposethe opponentdid not take advantageof symmetries . In that case, shouldwe ? Is it true, then, that symmetricallyequivalent positions shouldnecessarily have the samevalue ? Exercise1 .3: GreedyPlay Supposethe reinforcementlearning player wasgreedy , that is, it alwaysplayed the movethat broughtit to the position that it ratedthe best. Would it learn to play better, or worse, than a nongreedyplayer ? What problems might occur? Exercise 1.4: Learning from Exploration Supposelearning updatesoccurred after all moves, including exploratorymoves . If the step-size parameteris appropriately reducedover time, thenthe statevalues would convergeto a setof probabilities. What are the two setsof probabilitiescomputed when we do, and when we do not, learn from exploratorymoves ? Assumingthat we do continueto makeexploratory moves, which set of probabilities might be better to learn? Which would result in more wins?

Exercise 1.5: Other Improvements Can you think of other ways to improve the reinforcementlearning player ? Can you think of any betterway to solve the tic-tac- toe problemas posed?

1.5 Summary

Reinforcementlearning is a computationalapproach to understandingand automating - goal-directedlearning and decision-making. It is distinguishedfrom other com putationalapproach esby its emphasison learningby the individual from direct interaction with its environment, without relying on exemplarysupervision or complete modelsof the environment. In our opinion, reinforcementlearning is the first field to seriouslyaddress the computationalissues that arisewhen learningfrom interaction with an environmentin order to achievelong -term goals. Reinforcementlearning uses a formal frameworkdefining the interactionbetween a learning agentand its environmentin terms of states, actions, and rewards. This framework is intendedto be a simple way of representingessential features of the artificial intelligenceproblem . Thesefeatures include a senseof causeand effect, a senseof uncertaintyand nondetenninism, and the existenceof explicit goals. The conceptsof valueand value functions are the key featuresof the reinforcement learning methodsthat we consider in this book. We take the position that value functionsare essential for efficient searchin the spaceof policies. Their useof value Introduction

1.6 History of Reinforcement Learning

The history of reinforcementlearning has two main threads, both long and rich, that were pursuedindependently before intertwining in modemreinforcement learning . One thread concernslearning by trial and error and startedin the psychologyof animal learning. This thread runs through some of the earliest work in artificial intelligenceand led to the revival of reinforcementlearning in the early 1980s. The other thread concernsthe problem of optimal control and its solution using value functions and dynamic programming. For the most part, this threaddid not involve learning. Although the two threadshave been largely independent, the exceptions revolve arounda third, less distinct threadconcerning temporal -differencemethods suchas used in the tic-tac-toe examplein this chapter. All threethreads came together in the late 1980sto producethe modemfield of reinforcementlearning as we present it in this book. The threadfocusing on trial-and-error learningis the one with which we are most familiar and aboutwhich we havethe most to say in this brief history. Before doing that, however, we briefly discussthe optimal control thread. " " The term optimal control cameinto usein the late 1950sto describethe problem ' of designinga controller to minimize a measureof a dynamical systems behavior over time. One of the approach es to this problemwas developed in the mid-1950sby RichardBellman and others through extending a nineteenth-centurytheory of Hamil- ' ton andJacobi . This approachuses the conceptsof a dynamicalsystem s stateand of " " a value function, or optimal return function, to define a functional equation, now often called the . The class of methodsfor solving optimal control problemsby solving this equationcame to be known as dynamic programming (Bellman, 1957a). Bellman( 1957b) alsointroduced the discretestochastic version of the optimal control problem known as Markovian decisionprocess es (MD Ps), and Ron Howard (1960) devisedthe policy iteration methodfor MD Ps. All of theseare essentialelements underlying the theory and algorithms of modem reinforcement learning. Dynamic programmingis widely consideredthe only feasible way of solving general stochasticoptimal control problems. It suffers from what Bellman called " " the curse of dimensionality, meaning that its computationalrequirements grow 1.6 History of Reinforcement Learning exponentiallywith the numberof statevariables , but it is still far more efficient and more widely applicablethan any other generalmethod . Dynamic programminghas been extensivelydeveloped since the late 1950s, including extensionsto partially observableMD Ps (surveyedby Lovejoy, 1991), many applications( surveyedby White, 1985, 1988, 1993), approximationmethods ( surveyedby Rust, 1996), and asynchronousmethods ( Bertsekas, 1982, 1983). Many excellentmodem treatments of dynamicprogramming are available ( e.g., Bertsekas, 1995; Puterman, 1994; Ross, 1983; and Whittle, 1982, 1983). Bryson (1996) providesan authoritativehistory of optimal control. In this book, we consider all of the work in optimal control also to be, in a sense, work in reinforcementlearning . We define reinforcementlearning as any effective way of solving reinforcementlearning problems, and it is now clear that theseproblems are closely related to optimal control problems, particularly those formulatedas MD Ps. Accordingly, we mustconsider the solutionmethods of optimal control, suchas dynamicprogramming , also to be reinforcementlearning methods. Of course, almostall of thesemethods require complete knowledge of the systemto be control led, and for this reasonit feels a little unnaturalto say that they are part of reinforcementlearning . On the other hand, manydynamic programming methods are incrementaland iterative. Like learningmethods , they graduallyreach the correct answerthrough successive approximations . As we showin the restof this book, these similarities are far more than superficial. The theoriesand solution methodsfor the casesof completeand incompleteknowledge are so closely relatedthat we feel they must be consideredtogether as part of the samesubject matter . Let us return now to the other major thread leading to the modem field of reinforcement learning, that centeredon the idea of trial-and-error learning. This thread " " beganin psychology, where reinforcement theoriesof learningare common. Perhaps the first to succinctlyexpress the essenceof trial-and-error learningwas Edward Thorndike. We take this essenceto be the idea that actionsfollowed by good or bad ' outcomeshave their tendencyto be reselectedaltered accordingly. In Thorndikes words:

Of severalresponses made to the samesituation , thosewhich are accompaniedor closely followedby satisfactionto theanimal will , otherthings being equal , bemore firmly connected with thesituation , so that, whenit recurs, theywill bemore likely to recur,. thosewhich are accompaniedor closelyfollowed by discomfortto theanimal will , otherthings being equal , havetheir connectionswith thatsituation weakened , so that, whenit recurs, theywill be less likely to OCCU1: Thegreater the satisfactionor discomfort, the greaterthe strengtheningor weakeningof thebond . (Thorndike, 1911, p. 244) Introduction

" " Thorndikecalled this the Law of Effect becauseit describesthe effect of reinforcing eventson the tendencyto selectactions . Although sometimescontroversial ( e.g., seeKimble , 1961, 1967; Mazur, 1994), the Law of Effect is widely regardedas an obvious basic principle underlying much behavior( e.g., Hilgard and Bower, 1975; Dennett, 1978; Campbell, 1960; Cziko, 1995). The Law of Effect includes the two most important aspectsof what we mean - - by trial anderror learning. First, it is selectional, meaningthat it involves trying alternativesand selectingamong them by comparingtheir consequences. Second, it is associative, meaning that the alternativesfound by selection are associated with particular situations. Natural selection in evolution is a prime exampleof a selectionalprocess , but it is not associative. Supervisedlearning is associative, but not selectional. It is the combinationof thesetwo that is essentialto the Law of Effect and to trial-and-error learning. Another way of saying this is that the Law of Effect is an elementaryway of combiningsearch and memory: searchin the fonn of trying and selectingamong many actions in each situation, and memory in the fonn of rememberingwhat actionsworked best , associatingthem with the situations in which they were best. Combining searchand memory in this way is essentialto reinforcementlearning . The idea of programminga computerto learn by trial and error datesback to the earliestspeculations about computers and intelligence( e.g., Turing, 1950). The ear- liest computationalinvestigations of trial-and-error learningto be publishedin detail were perhapsthose by Minsky and by Farley and Clark, both in 1954. In his PhiD. dissertation, Minsky discussedcomputational models of reinforcementlearning and describedhis constructionof an analogmachine composed of componentshe called - SN ARCs (StochasticNeural Analog ReinforcementCalculators ). Farley and Clark describedanother neural -network learning machinedesigned to learn by trial and " " " " error. In the 1960sthe tenns reinforcement and reinforcementlearning were widely usedin the engineeringliterature for the first time (e.g., Waltz and Fu, 1965; Mendel, 1966; Fu, 1970; Mendel and McClaren, 1970). Particularlyinfluential was ' " " Minsky s paper StepsToward Artificial Intelligence (Minsky, 1961), which discussed severalissues relevant to reinforcementlearning , including what he called the credit assignmentproblem : How do you distributecredit for successamong the many decisionsthat may havebeen involved in producingit ? All of the methodswe discussin this book are, in a sense, directedtoward solving this problem. The interestsof Farley and Clark (1954; Clark and Farley, 1955) shifted from - trial and-error learning to generalizationand pattern recognition, that is, from reinforcement learningto supervisedlearning . This begana patternof confusionabout the relationshipbetween these types of learning. Many researchersseemed to believe 1.6 History of ReinforcementLearning that they were studyingreinforcement learning when they were actually studyingsupervised learning. For example, neural network pioneerssuch as Rosenblatt( 1962) and Widrow and Hoff (1960) were clearly motivatedby reinforcementlearning - they usedthe languageof rewardsand punishments- but the systemsthey studied were supervisedlearning systemssuitable for pattern recognition and perceptual learning. Even today, researchersand textbooksoften minimize or blur the distinction betweenthese types of learning. Somemodem neural -networktextbooks use the " " term trial-and-error to describenetworks that learnfrom training examplesbecause they use error information to updateconnection weights . This is an understandable confusion, but it substantiallymisses the essentialselectional character of trialand - error learning. Partly asa resultof theseconfusions , researchinto genuinetrial -and-error learning becamerare in the 1960sand 1970s. In the next few paragraphswe discusssome of the exceptionsand partial exceptionsto this trend. One of thesewas the work by a New Zealandresearcher named John Andreae. Andreae( 1963) developeda systemcalled STeLLA that learnedby trial and error in interactionwith its environment. This systemincluded an internal model of the " " world and, later, an internal monologue to deal with problemsof hiddenstate ( An- ' dreae, 1969a). Andreaes later work (1977) placedmore emphasison learningfrom a teacher, but still included trial and error. Unfortunately, his pioneeringresearch was not well known, and did not greatly impact subsequentreinforcement learning research. More influential was the work of Donald Michie. In 1961and 1963he described a simple trial-and-error learning system for learning how to play tic-tac-toe (or naughtsand crosses ) calledMENACE (for MatchboxEducable Naughts and Crosses Engine). It consistedof a matchboxfor eachpossible game position , eachmatchbox containinga numberof coloredbeads , a different color for eachpossible move from that position. By drawing a beadat randomfrom the matchboxcorresponding to the ' current gameposition , one could determineME NA CE s move. When a gamewas over, beadswere addedto or removedfrom the boxesused during play to reinforce ' or punishME NA CE s decisions. Michie andChambers ( 1968) describedanother tic - tac-toe reinforcementlearner called GLEE (GameLearning Expectimaxing Engine ) and a reinforcementlearning controller called BOXES. They appliedBOXES to the task of learningto balancea pole hinged to a movablecart on the basisof a failure signal occurring only when the pole fell or the cart reachedthe end of a track. This task was adaptedfrom the earlier work of Widrow and Smith (1964), who used supervisedlearning methods, assuminginstruction from a teacheralready able to ' balancethe pole. Michie and Chamberss version of pole-balancingis one of the Introduction bestearly examplesof a reinforcementlearning task underconditions of incomplete knowledge. It influencedmuch later work in reinforcementlearning , beginningwith someof our own studies( Barto, Sutton, and Anderson, 1983; Sutton, 1984). Michie has consistentlyemphasized the role of trial and error and learning as essential aspectsof artificial intelligence( Michie, 1974). Widrow, Gupta, and Maitra (1973) modified the LMS algorithm of Widrow and Hoff (1960) to produce a reinforcementlearning rule that could learn from success and failure signalsinstead of from training examples. They called this form of " " " " learning selectivebootstrap adaptation and describedit as learningwith a critic " " insteadof learningwith a teacher. They analyzedthis rule and showedhow it could learn to play blackjack. This was an isolatedforay into reinforcementlearning by Widrow, whosecontributions to supervisedlearning were much more influential. Researchon learning automatahad a more direct influenceon the trial-and-error thread leading to modem reinforcementlearning research. These are methodsfor solving a nonassociative, purely selectionallearningproblem known as the n-anned " " bandit by analogyto a slot machine, or one-armedbandit , exceptwith n levers( see Chapter2 ). Learning automataare simple, low-memory machinesfor solving this problem. Learning automataoriginated in Russiawith the work of Tsetlin (1973) and have been extensivelydeveloped since then within engineering( seeNarendra and Thathachar, 1974, 1989). Barto and Anandan( 1985) extendedthese methods to the associativecase . John Holland (1975) outlined a generaltheory of adaptivesystems based on selectional principles. His early work concernedtrial and error primarily in its nonassociative form, as in evolutionarymethods and the n-armedbandit . In 1986he introduced classifier systems, true reinforcementlearning systemsincluding association ' and valuefunctions . A key componentof Hollands classifiersystems was alwaysa geneticalgorithm , an evolutionarymethod whose role wasto evolveuseful representations . Classifiersystems have been extensively developed by many researchersto form a major branchof reinforcementlearning research( e.g., seeGoldberg , 1989; Wilson, 1994), but geneticalgorithms - which by themselvesare not reinforcement learningsystems - havereceived much more attention. The individual most responsiblefor reviving the trial-and-error thread to reinforcement learningwithin artificial intelligencewas Harry Klopf (1972, 1975, 1982). Klopf recognizedthat essentialaspects of adaptivebehavior were being lost aslearning researcherscame to focus almostexclusively on supervisedlearning . What was missing, according to Klopf, were the hedonic aspectsof behavior, the drive to achievesome result from the environment, to control the environmenttoward desired endsand awayfrom undesiredends . This is the essentialidea of trial-and-error 1.6 History of ReinforcementLearning

' learning. Klopf s ideaswere especially influential on the authorsbecause our assessment of them (Barto and Sutton, 1981a) led to our appreciationof the distinction betweensupervised and reinforcementlearning , and to our eventualfocus on reinforcement learning. Much of the early work that we and colleaguesaccomplished was directed toward showing that reinforcementlearning and supervisedlearning were indeeddifferent (Barto, Sutton, and Brouwer, 1981; Barto and Sutton, 1981b ; Barto and Anandan, 1985). Other studiesshowed how reinforcementlearning could addressimportant problemsin neural network learning, in particular, how it could producelearning algorithmsfor multilayer networks( Barto, Anderson, and Sutton, 1982; Barto and Anderson, 1985; Barto and Anandan, 1985; Barto, 1985, 1986; Barto and Jordan, 1987). We turn now to the third thread of the history of reinforcementlearning , that - concerningtemporal -differencelearning . Temporaldifferencelearning methods are distinctivein being drivenby the differencebetween temporally successive estimates - - of the samequantity - for example, of the probability of winning in the tic tac toe example. This threadis smallerand lessdistinct thanthe other two, but it hasplayed - a particularlyimportant role in the field, in part becausetemporal differencemethods seemto be new and uniqueto reinforcementlearning . The origins of temporal-differencelearning are in part in animallearning psychology , in particular, in the notion of secondaryreinforcers . A secondaryreinforcer is a stimulus that has beenpaired with a primary reinforcer such as food or pain and, as a result, has come to take on similar reinforcing properties. Minsky (1954) may havebeen the first to realizethat this psychologicalprinciple could be importantfor artificial learning systems. Arthur Samuel( 1959) was the first to proposeand implement a method that included temporal-difference ideas, as part of his learning ' celebratedcheckers -playing program. Samuelmade no referenceto Minsky s work or to possibleconnections to animal learning. His inspirationapparently came from ' Claude Shannons (1950) suggestionthat a computercould be programmedto use an evaluationfunction to play chess, and that it might be able to improve its play - ' by modifying this function on line. (It is possiblethat theseideas of Shannons also influencedBellman , but we know of no evidencefor this.) Minsky (1961) extensively ' " " discussedSamuel s work in his Steps paper, suggestingthe connectionto secondaryreinforcement theories , both naturaland artificial. As we havediscussed , in the decadefollowing the work of Minsky and Samuel, - - little computationalwork was done on trial anderror learning, and apparentlyno - computationalwork at all was doneon temporaldifferencelearning . In 1972, Klopf - brought trial-and-error learningtogether with an importantcomponent of temporal differencelearning . Klopf was interestedin principles that would scaleto learning Introduction in large systems, and thus was inbigued by notionsof local reinforcement, whereby subcomponentsof an overall learningsystem could reinforceone another. He developed " " the ideaof generalizedreinforcement , wherebyevery component ( nominally, every neuron) views all of its inputs in reinforcementterms : excitatoryinputs as rewards andinhibitory inputsas punishments . This is not the sameidea as what we now know as temporal-differencelearning , and in retrospectit is farther from it than was ' Samuels work. On the otherhand , Klopf linked the ideawith bialand -error learning and relatedit to the massiveempirical databaseof animal learningpsychology . ' Sutton (1978a, 1978b, 1978c) developedKlopf s ideas further, particularly the links to animal learningtheories , describinglearning rules driven by changesin temporally successivepredictions . He and Barto refined theseideas and developeda psychologicalmodel of classicalconditioning basedon temporal-differencelearning (Sutton and Barto, 1981a; Barto and Sutton, 1982). There followed several other influential psychologicalmodels of classicalconditioning based on temporal- differencelearning (e.g., Klopf, 1988; Moore et al., 1986; Sutton and Barto, 1987, 1990). Someneuroscience models developed at this time arewell interpretedin terms of temporal-differencelearning (Hawkins and Kandel, 1984; Byrne, Gingrich, and Baxter, 1990; Gelperin, Hopfield, and Tank, 1985; Tesauro, 1986; Friston et al., 1994), althoughin mostcases there was no historical connection. A recentsummary of links betweentemporal -differencelearning and neuroscienceideas is providedby Schultz, Dayan, and Montague( 1997). Our early work on temporal-differencelearning was strongly influencedby ani- ' ' " " mallearning theoriesand by Klopf s work. Relationshipsto Minsky s Steps paper ' and to Samuels checkersplayers appearto have been recognizedonly afterward. By 1981, however, we were fully awareof all the prior work mentionedabove as part of the temporal-differenceand trial -and-error threads. At this time we developed a methodfor using temporal-differencelearning in trial-and-error learning, known ' as the actor- critic architecture, and applied this methodto Michie and Chamberss pole-balancingproblem (Barto, Sutton, and Anderson, 1983). This methodwas extensively ' studiedin Suttons (1984) Ph.D. dissertationand extended to usebackprop - ' agationneural networksin Andersons (1986) Ph.D. dissertation. Around this time, Holland (1986) incorporatedtemporal -differenceideas explicitly into his classifier systems. A key stepwas takenby Suttonin 1988by separatingtemporal -difference learning from control, treating it as a generalprediction method. That paper also introducedthe TD(A) algorithm and provedsome of its convergenceproperties . As we were finalizing our work on the actor- critic architecturein 1981, we discovered a paperby Ian Witten (1977) that containsthe earliest known publication of a temporal-difference . He proposedthe method that we now call Bibliographical Remarks

' tabular TD(O) for use as part of an adaptivecontroller for solving MO Ps. Witten s ' work wasa descendantof Andreaes early experimentswith STeLLA andother trialand ' -error learning systems. Thus, Witten s 1977paper spannedboth major threads - of reinforcementlearning research- trial-and-error learning and optimal control - while making a distinct early contributionto temporaldifferencelearning . Finally, the temporal-differenceand optimal control threadswere fully broughttogether ' in 1989with Chris Watkinss developmentof Q-learning. This work extended andintegrated prior work in all threethreads of reinforcementlearning research . Paul Werbos 1987 contributedto this by arguingfor the convergenceof trialand ( ) integration ' -error learning and dynamic programmingsince 1977. By the time of Watkinss - work there had been tremendousgrowth in reinforcementlearning research, pri marily in the machinelearning subfield of artificial intelligence, but also in neural networksand artificial intelligencemore broadly . In 1992, the remarkablesuccess of ' - Gerry Tesauros backgammonplaying program, TD Gammon, brought additional attentionto the field. Other importantcontributions made in the recenthistory of reinforcement learningare too numerousto mentionin this brief account; we cite these at the end of the individual chaptersin which they arise.

1.7 Bibliographical Remarks

For additional generalcoverage of reinforcementlearning , we refer the readerto the books by Bertsekasand Tsitsiklis (1996) and Kaelbling (1993a). Two specialissues of the journal Machine Learning focus on reinforcementlearning : Sutton (1992) and Kaelbling (1996). Useful surveysare provided by Barto (1995b); Kaelbling, Littman, and Moore (1996); and Keerthi and Ravindran( 1997). ' The exampleof Phil s breakfastin this chapterwas inspiredby Agre (1988). We direct the - readerto Chapter6 for referencesto the kind of temporaldifferencemethod we usedin the tic-tac-toe example. Modem attemptsto relate the kinds of algorithmsused in reinforcementlearning to the nervoussystem are madeby Hampson( 1989), Friston et al. (1994), Barto (1995a), Houk, Adams, and Barto (1995), Montague, Dayan, and Sejnowski( 1996), and Schultz, Dayan, and Montague( 1997). This excerpt from Reinforcement Learning. Richard S. Sutton and Andrew G. Barto. © 1998 The MIT Press. is provided in screen-viewable form for personal use only by members of MIT CogNet. Unauthorized use or dissemination of this information is expressly forbidden. If you have any questions about this material, please contact [email protected].