Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPs)
Objectives of this chapter:
❐ present Markov decision processes—an idealized form of the AI problem for which we have precise theoretical results ❐ introduce key components of the mathematics: value functions and Bellman equations
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM
Agent Agent 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM state reward action state reward St Rt action At St Rt Rt+1 At SUMMARY OF NOTATIONAgent xiii Rt+1 St+1 Environment SUMMARYstate reward OF NOTATION xiii St+1 EnvironmentThe Agent-Environment Interface action St Rt Summary of Notation At Figure 3.1: TheRt+1 agent–environment interaction in reinforcement learning. SUMMARYSummary OF NOTATION of Notation xiii Figure 3.1: The agent–environment interaction in reinforcement learning. St+1 Environment Agent Capital letters are used for random variables and major algorithm variables. SummaryLowerFiguregivesCapital 3.1: case rise The letters letters to agent–environment ofrewards, are are Notation used used special for for interaction the numerical random values in reinforcement variables of values random that and learning. variables the major agent algorithm and tries for to scalar maximize variables. gives rise to rewards, special numerical valuesstate thatreward the agent tries to maximizefunctions.overLower time. Quantitiescaseaction A letters complete that are areused specification required for the to values of be an real-valued of environment random vectors variables defines are and awrittentask for,one scalar St Rt A over time. A complete specification of an environment defines a task,onegivesin bold riseinstancefunctions. to and rewards, in of lower thet Quantities special reinforcement case numerical (even that values if are learning random that required the agentvariables). problem. to tries be to real-valued maximize vectors are written Rt+1 Capital letters are used for random variables and major algorithm variables. instance of the reinforcement learning problem. over time.in bold A complete and in specification lower case of (even an environment if random defines variables). a task,one St+1 EnvironmentLower caseMore letters specifically, are used the for agent the values and environment of random variables interact and at each for scalar of a sequence instance of the reinforcement learning problem. More specifically, the agent and environment interact at each of a sequencefunctions.s of discretestate Quantities time steps, that aret =0 required, 1, 2, 3,... to be.2 At real-valued each time vectors step t, are the written agent receives 2 aMores specifically,actionstate the agent and environment interact at each of a sequence of discrete time steps, t =0, 1, 2, 3,.... At each time step t, the agent receivesin boldsome and representation in lower case (even of the2 if environment’s random variables).state, St S, where S is the set of Agent and environment interact atof diS discretescreate ti timemeset s steps,teps ofaction: alltt =0= nonterminal0, 1, 22,,3K,.... At states each time step t, the agent receives2 some representation of the environment’s state, St S, where S is the set of possible states, and on that basis selects an action, At A(St), where A(St) 2 some+ representation of the environment’s state, St S, where S is the set of Agent observeAs state at step t:A SSt ∈ S set ofset all of states, all nonterminal including the states2 terminal state 2 possible states, and on that basis selects an action, At (St), where possible(sSt) is states, thestate set and of on actions that basis available selects an inaction state, At StA.(S Onet), where timeA( stepSt) later, in part as a 2 S+ set of all states, including the2 terminal state is the set of actions available in state St. One produc timees stepaction later, at step in t : part At ∈is asA the(( aSstconsequence)setofactionspossibleinstate set) of actions available of its action, in state theSt. One agent time receivess step later, a numerical in part as a reward, R R a A action t+1 2 ⇢ consequence of its action, the agent receives age numericalts resultingreward reward:, R R ∈ consequenceR (s)setofactionspossibleinstate of itsR action, the agent receives a numerical rewards , Rt+1 R,and 3 t+1t+1 RS R, whereset of allis the nonterminal set3 of possible states rewards, and finds2 itself in a new state, St+1. 2 finds⇢3 itself in a new state, St+1. Figure 3.1 diagrams the agent–environment , where R is the set of possible rewards, and finds itself in a new state, St+1t+. Figurediscrete 3.1 diagrams time step the agent–environment interaction. R and resulting next state: St+1 ∈interaction.S set of all states, including the terminal state Figure 3.1 diagrams the agent–environment interaction. T! =t s0,a0final,s1,adiscrete time1,... step time of an step episode A(s)setofactionspossibleinstateAt each time step, the agent implementss a mapping from states to prob- SAt eachT timestate step,final at thet time agent step implements of an a episode mapping from states to prob- abilitiest of selecting each possible action. This mapping is called the agent’s At each time step, the agent implements. . . a mappingR fromt+1 statesRt+ toThe2 prob- otherabilitiesRSt+3 random ofstate variables selecting. . . at t are each a possiblefunction action. of this sequence. This mapping The transitional is called the agent’s St St+1 St+2 At t St+3action at t abilities of selecting each possible action. This mappingAt is calledAt+ the1 agent’spolicyt At+2 and isdiscrete denotedAt+3 ⇡ timet, where step⇡t⇡(a s)istheprobabilitythat⇡ a s At = a if St = s. A a S s target rtpolicy+1At is aand functionaction is denoted of at stt, t|,at where, and stt(+1.)istheprobabilitythat The termination conditiont = t, if t = . ReinforcementTRt finalreward learning time at methods stept, dependent, of specify an episode how like theS agent|t,on changesAt 1 and its policySt 1 as policy and is denoted ⇡t, where ⇡t(a s)istheprobabilitythatAt = a ifterminalSt = s. Reinforcement target z , and prediction learning methodsy , are functions specify how of s thealone. agent changes its policy as R. S. Sutton and A. G.| Barto: Reinforcement Learning: An Introduction aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeG Rt returnt reward (cumulative at t, dependent, discountedt 2 like reward)St,ont followingAt 1 andt St 1 Reinforcement learning methods specify how the agent changes its policySt ast state at t the( totaln)aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeGt amount ofreturn reward it (cumulative receives over the discounted long run. reward) following t AGtt (n) (n) actionn-step at returnt (Section 7.1) (n 1) aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeR the= totalrt+1 + amount t+1zt+1 of+(1 reward itt+1 receives)R over the long run. This t G frameworkt isn-step abstract return and flexible (Section and can 7.1)t+1 be applied to many di↵erent RGtt reward -return at t (Section, dependent, 7.2) like St,onAt 1 and St 1 the total amount of reward it receives over the long run. problems in many di↵erent ways. For example, the time steps need not refer Gt This framework -return is (Section abstract 7.2) and flexible and can be applied to many di↵erent toG fixedt (0) intervalsreturn of real (cumulative time; they can discounted refer to arbitrary reward) successive following stagest of This framework is abstract and flexible and can be applied to many di↵erentR(nt)problems= yt in many di↵erent ways. For example, the time steps need not refer decision-makingG⇡ npolicy,-step and acting. return decision-making The (Section actions 7.1) can rule be low-level controls, such as the problems in many di↵erent ways. For example, the time steps need not refert to⇡ fixed intervalspolicy, of decision-making real time; they rule can refer to arbitrary successive stages of voltagesG⇡ (s)actiontakeninstate applied -return to the (Section motors of 7.2) a robots under arm, ordeterministic high-level decisions,policy such⇡ t 1 n 1 (n) to fixed intervals of real time; they can refer to arbitrary successive stagesasR whetherof decision-making⇡=(1(s)actiontakeninstate or not ) to have and lunchR acting. or to go The to graduates actionsunder school.deterministic can be Similarly, low-level thepolicy controls,⇡ such as the ⇡(at s)probabilityoftakingaction t a in state s under stochastic policy ⇡ decision-making and acting. The actions can be low-level controls, such asstates the |voltages can⇡(a takes)probabilityoftakingaction a wide appliedn=1 variety to of the forms. motors They can of be a completely robota in arm, state determined ors under high-level by stochastic decisions,policy such⇡ ⇡p(s0 s, a)probabilityoftransitionfromstate| policy,X decision-making rule s to state s0 under action a voltages applied to the motors of a robot arm, or high-level decisions, such as|p(s whether0 s, a)probabilityoftransitionfromstate or not to have lunch or to go to graduates to state school.s0 under Similarly, action a the wider⇡r((ss,)actiontakeninstate audience. a, s0)expectedimmediaterewardontransitionfrom| s under deterministic policy s⇡ to s0 under action a as whether or not to have lunch or to go to graduate school. Similarly, the2 statesr(⇡s,( a,st can,a s0)expectedimmediaterewardontransitionfromt) take a wide variety of forms. They can be completelys to s determined0 under action by a ⇡(⇢Weat s= restrict)probabilityoftakingaction attention to discrete time to keep thingsa asin simple state ass possible,under evenstochastic though policy ⇡ states can take a wide variety of forms. They can be completely determinedmany by| of theb( ideasst,a cant) be extended to the continuous-time case (e.g., see Bertsekas and Tsitsiklis,pv(⇡s(0ss,)valueofstate a 1996;)probabilityoftransitionfromstate Werbos, 1992; Doya,s 1996).under policy ⇡ (expecteds to state return)s0 under action a 3 widerv (s)valueofstate audience. s under policy ⇡ (expected return) We| use⇡ Rt+1 instead of Rt to denote the immediate reward due to the action taken rv(s,(s a,)valueofstate s20)expectedimmediaterewardontransitionfrom1s under the optimal policy s to s0 under action a wider audience. at time vt because(Wes)valueofstate restrict it emphasizes attention that the to next discretes under reward time and the the to optimal next keep state, thingsS policyt+1 as, are simple jointly as possible, even though ⇤ wo↵ (!)= won(!) ⇢ 2We restrict attention to discrete time to keep things as simple as possible, even thoughdetermined.q⇡(s,many a⇤)valueoftakingaction of the ideas can bei extendeda in to state the continuous-times under policy case⇡ (e.g., see Bertsekas and q⇡(s, a)valueoftakingactiona in state s under policy ⇡ many of the ideas can be extended to the continuous-time case (e.g., see Bertsekasv andq⇡((ss,)valueofstateTsitsiklis, a)valueoftakingaction 1996; Werbos,is=1under 1992; Doya,policya in 1996).⇡ state(expecteds under return) the optimal policy ⇤ q (3s, a)valueoftakingactionY a in state s under the optimal policy We use Rt+1 instead of Rt to denote the immediate reward due to the action taken Tsitsiklis, 1996; Werbos, 1992; Doya, 1996). vVt(s)valueofstate⇤ estimate (a randoms under variable) the optimal of policyv⇡ or v 3 ⇤ Vt estimate (a random variable) of v⇡ or v We use Rt+1 instead of Rt to denote the immediate reward due to the action taken watt = time↵tt(becauseCtRt ity emphasizest) wyt that the next reward⇤ and the next state, St+1, are jointly qQ⇡(ts, a)valueoftakingactionestimate (a randomr variable)a in state ofs underq⇡ or q policy ⇤⇡ at time t because it emphasizes that the next reward and the next state, St+1, are jointly determined.Qt estimate (a random variable) of q⇡⇤ or q q (s, a)valueoftakingaction¯ a in state s under the optimal⇤ policy determined. ⇤ wt = ↵t(Rt yt) wyt Vvˆt(s, w)approximatevalueofstateestimate (a randomr variable)s ofgivenv⇡ or av vector of weights w vˆ(s, w)approximatevalueofstates given⇤ a vector of weights w Qt estimate (a random variable) of q⇡ or q qˆ(s, a,qˆ(s,w)approximatevalueofstate–actionpair a, w)approximatevalueofstate–actionpair⇤ s, a givens, a given weights weightsw w w, wwt , w vectorvector of (possibly of (possibly learned) learned)weightsweightsunderlyingunderlying an approximate an approximate value value function function vˆ(s, w)approximatevalueofstatet s given a vector of weights w x(s)vectoroffeaturesvisiblewheninstatex(s)vectoroffeaturesvisiblewheninstates s qˆ(s, a, w)approximatevalueofstate–actionpairs, a given weights w w>x inner product of vectors, w>x = wixi; e.g.,v ˆ(s, w)=w>x(s) w>x inner product of vectors, w>x i= i wixi; e.g.,v ˆ(s, w)=w>x(s) w, wt vector of (possibly learned) weights underlying an approximate value function x(s)vectoroffeaturesvisiblewheninstateP s P w>x inner product of vectors, w>x = i wixi; e.g.,v ˆ(s, w)=w>x(s) P
1 58 CHAPTER 3. FINITE MARKOV DECISION PROCESSES
3.6 Markov Decision Processes
A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a 58 CHAPTER 3. FINITE MARKOV DECISION PROCESSES finite Markov decision process (finite MDP). Finite MDPs are particularly important to the theory of reinforcement learning. We treat them extensively throughout this 3.6book; they Markov are all you Decision need to understand Processes 90% of modern reinforcement learning. A particular finite MDP is defined by its state and action sets and by the one-step Adynamics reinforcement of the environment. learning task Giventhat satisfies any state the and Markov action propertys and a is, the called probability a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a of each possibleMarkov pair of nextDecision state and Processes reward, s0,r, is denoted finite Markov decision process (finite MDP). Finite MDPs are particularly important . to thep( theorys0,r s, aof) =Pr reinforcementSt+1 =s0,R learning.t+1 = r WeSt = treats, At them=a . extensively throughout(3.6) this ❐ If a reinforcement| learning task has the| Markov Property, it is book; they are all you need to understand 90% of modern reinforcement learning. Thesebasically quantities a Markov completely Decision specify Process the dynamics (MDP). of a finite MDP. Most of the theory weA present particular in the finite rest of MDP this is book defined implicitly by its assumes state and the action environment sets and is by a finite the one-step MDP. ❐58dynamicsIf stateCHAPTER and of the action environment. sets 3. THEare finite, REINFORCEMENT Given it is any a finite state MDP and LEARNING. action s and PROBLEMa, the probability Given the dynamics as specified by (3.6), one can compute anything else one might of each possible pair of next state and reward, s ,r, is denoted ❐wantTo todefine know a aboutfinite MDP, the environment, you need to suchgive: as the0 expected rewards for state–action A particular finite MDP is defined by its state and action sets and by the pairs,! state and action. sets one-stepp(s0 dynamics,r s, a) =Pr of theSt+1 environment.=s0,Rt+1 = Givenr St = anys, A statet =a and. action s and a, (3.6) ! | | the probabilityone-step. “dynamics” of each possible pair of next state and reward, s0,r, is denoted r(s, a) = E[Rt+1 St =s, At =a]= r p(s0,r s, a), (3.7) These quantities completely| specify the dynamics of a finite| MDP. Most of the theory r R s0 S we presentp(s0,r ins, the a)= restPr ofS this= books0,R implicitly= rX2 S assumes=X2s, A = thea . environment is(3.6) a finite MDP. | { t+1 t+1 | t t } theGivenstate-transition the dynamics probabilities as specified, by (3.6), one can compute anything else one might These! there quantities is also: completely specify the dynamics of a finite MDP. Most of the want to know about. the environment, such as the expected rewards for state–action theoryp( wes0 s, present a) =Pr in theS rest=s0 ofS this=s, book A = implicitlya = p assumes(s0,r s, a the), environment (3.8) pairs, | t+1 | t t | is a finite MDP. r R . X2 Givenr(s, athe) = dynamicsE[Rt+1 asSt = specifieds, At =a by]= (3.6),r one canp(s0,r computes, a), anything else (3.7) and the expected rewards| for state–action–next-state triples,| one might want to know about the environment,r R s0 suchS as the expected rewards R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction X2 X2 3 for state–action. pairs, r R rp(s0,r s, a) r(s, a, s0) = E Rt+1 St =s, At =a, St+1 = s0 = 2 | . (3.9) the state-transition probabilities, p(s s, a) P 0| r(s, a)=E.[Rt⇥+1 St =s, At =a]= r p⇤(s0,r s, a), (3.7) In thep( firsts0 s, edition a) =Pr ofS| thist+1 = book,s0 St the=s, dynamics At =a = werep expressed(|s0,r s, a) exclusively, in terms(3.8) | | r R s0 S | of the latter two quantities, which wereX2 denotedX2 r PRa and Ra respectively. One X2 ss0 ss0 theweaknessstate-transition of that notation probabilities is that, it still did not fully characterize the dynamics and the expected rewards for state–action–next-state triples, of the rewards, giving only their expectations. Another weakness is the excess of subscriptsp(s0 s, and a)= superscripts.Pr St+1 =s0 InS thist =s, edition At =a we= willp( predominantlys0,r s, a), use the(3.8) explicit | . { | } |r R rp(s0,r s, a) r(s, a, s0) = E Rt+1 St =s, At =a, St+1 =r sR0 = 2 | . (3.9) notation of (3.6), while sometimes referring directlyX2 to thep transition(s s, a) probabilities P 0| and(3.8). the expected rewards⇥ for state–action–next-state⇤ triples, In the first edition of this book, the dynamics were expressed exclusively in terms Example 3.7: Recycling Robot MDP The recyclinga robota (Example 3.3) can of the latter two quantities, which were denoted P r Randrp(sR0,r s,respectively. a) One be turnedr(s, a, into s0)= aE simple[Rt+1 exampleSt =s, A oft = ana, S MDPt+1 = bys0]= simplifyingss02 itss and|0 providing. (3.9) some weakness of that notation| is that it still did not fullyp( characterizes0 s, a) the dynamics more details. (Our aim is to produce a simple example,P not| a particularly realistic of the rewards, giving only their expectations. Another weakness is the excess of Inone.) the Recall first edition that the of agent this book, makes the a decision dynamics at timeswere expressed determined exclusively by external in events termssubscripts(or by of other the and latter parts superscripts. of two the quantities, robot’s In control this which edition system). were denote we At will eachPa predominantlyand suchR timea respectively. the use robot the decides explicit ss0 ss0 Onenotationwhether weakness it of should (3.6), of that (1) while actively notation sometimes search is that referring for it a still can, directly did (2) not remain to fully the stationary characterize transition and probabilities the wait for dynamics(3.8).someone to of thebring rewards, it a can, giving or (3) only go their back expectations. to home base Another to recharge weakness its is battery. theExampleSuppose excess the of 3.7: subscripts environment Recycling and works superscripts. Robot as follows. MDP In this TheThe edition best recycling way we to will robot find predominantly cans (Example is to actively 3.3) can usebe turned the explicit into a notation simple example of (3.6), of while an MDP sometimes by simplifying referring directly it and providing to the some transitionmore details. probabilities (Our aim (3.8). is to produce a simple example, not a particularly realistic Exampleone.) Recall 3.7: that Recycling the agent makes Robot a MDPdecision atThe times recycling determined robot by (Example external events 3.3)(or by can other be turned parts of into the a robot’s simple control example system). of an MDP At each by such simplifying time the it robot and decides providingwhether it some should more (1) details. actively (Our search aim for is a to can, produce (2) remain a simple stationary example, and not wait for asomeone particularly to bring realistic it a one.) can, Recall or (3) that go back the agent to home makes base a decision to recharge at times its battery. determinedSuppose the by environment external events works (or by as otherfollows. parts The of best the robot’s way to control find cans system). is to actively At each such time the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. Suppose the environment works as follows. The best way to find cans is to actively search for them, but this runs down the robot’s battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In The Agent Learns a Policy
Policy at step t = π t = a mapping from states to action probabilities
π t (a | s) = probability that At = a when St = s
Special case - deterministic policies:
πt (s) = the action taken with prob=1 when St = s
❐ Reinforcement learning methods specify how the agent changes its policy as a result of experience. ❐ Roughly, the agent’s goal is to get as much reward as it can over the long run.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 The Meaning of Life (goals, rewards, and returns)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Return
Suppose the sequence of rewards after step t is:
Rt+1, Rt+2 , Rt+3,K. . . What do we want to maximize?
At least three cases, but in all of them,
we seek to maximize the expected return, E{Gt }, on each step t.
• Total reward, Gt = sum of all future reward in the episode • Discounted reward, Gt = sum of all future discounted reward • Average reward, Gt = average reward per time step
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Episodic Tasks
Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze
In episodic tasks, we almost always use simple total reward:
... Gt = Rt+1 + Rt+2 +L + RT ,
where T is a final time step at which a terminal state is reached, ending an episode.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 Continuing Tasks
Continuing tasks: interaction does not have natural episodes, but just goes on and on...
In this class, for continuing tasks we will always use discounted return: ∞ 2 ... k Gt = Rt+1 + γ Rt+2 + γ Rt+3 +L = ∑γ Rt+k+1, k=0 where γ , 0 ≤ γ ≤ 1, is the discount rate.
shortsighted 0 ← γ → 1 farsighted
Typically, = 0.9
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 An Example: Pole Balancing
Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track
As an episodic task where episode ends upon failure: reward = +1 for each step before failure ⇒ return = number of steps before failure As a continuing task with discounted return: reward = −1 upon failure; 0 otherwise ⇒ return = −γ k , for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Another Example: Mountain Car
Get to the top of the hill as quickly as possible.
reward = −1 for each step where not at top of hill ⇒ return = − number of steps before reaching top of hill
Return is maximized by minimizing number of steps to reach the top of the hill.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 A Trick to Unify Notation for Returns
❐ In episodic tasks, we number the time steps of each episode starting from zero. ❐ We usually do not have to distinguish between episodes, so
instead of writing S t , j for states in episode j, we write just St ❐ Think of each episode as ending in an absorbing state that always produces reward of zero:
R = 0 R = +1 R = +1 R = +1 4 S 1 S 2 S 3 0 1 2 R5 = 0 .
∞ ❐ We can cover all cases by writing k Gt = ∑γ Rt+k+1, k=0 where γ can be 1 only if a zero reward absorbing state is always reached.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Rewards and returns • The objective in RL is to maximize long-term future reward
• That is, to choose A t so as to maximize Rt+1,Rt+2,Rt+3,...
• But what exactly should be maximized?
• The discounted return at time t: the discount rate G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2