<<

Chapter 3: The Learning Problem (Markov Decision Processes, or MDPs)

Objectives of this chapter:

❐ present Markov decision processes—an idealized form of the AI problem for which we have precise theoretical results ❐ introduce key components of the mathematics: value functions and Bellman equations

R. S. Sutton and A. G. Barto: : An Introduction 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM

Agent Agent 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM state reward action state reward St Rt action At St Rt Rt+1 At SUMMARY OF NOTATIONAgent xiii Rt+1 St+1 Environment SUMMARYstate reward OF NOTATION xiii St+1 EnvironmentThe Agent-Environment Interface action St Rt Summary of Notation At Figure 3.1: TheRt+1 agent–environment interaction in reinforcement learning. SUMMARYSummary OF NOTATION of Notation xiii Figure 3.1: The agent–environment interaction in reinforcement learning. St+1 Environment Agent Capital letters are used for random variables and major variables. SummaryLowerFiguregivesCapital 3.1: case rise The letters letters to agent–environment ofrewards, are are Notation used used special for for interaction the numerical random values in reinforcement variables of values random that and learning. variables the major agent algorithm and tries for to scalar maximize variables. gives rise to rewards, special numerical valuesstate thatreward the agent tries to maximizefunctions.overLower time. Quantitiescaseaction A letters complete that are areused specification required for the to values of be an real-valued of environment random vectors variables defines are and awrittentask for,one scalar St Rt A over time. A complete specification of an environment defines a task,onegivesin bold riseinstancefunctions. to and rewards, in of lower thet Quantities special reinforcement case numerical (even that values if are learning random that required the agentvariables). problem. to tries be to real-valued maximize vectors are written Rt+1 Capital letters are used for random variables and major algorithm variables. instance of the reinforcement learning problem. over time.in bold A complete and in specification lower case of (even an environment if random defines variables). a task,one St+1 EnvironmentLower caseMore letters specifically, are used the for agent the values and environment of random variables interact and at each for scalar of a sequence instance of the reinforcement learning problem. More specifically, the agent and environment interact at each of a sequencefunctions.s of discretestate Quantities time steps, that aret =0 required, 1, 2, 3,... to be.2 At real-valued each time vectors step t, are the written agent receives 2 aMores specifically,actionstate the agent and environment interact at each of a sequence of discrete time steps, t =0, 1, 2, 3,.... At each time step t, the agent receivesin boldsome and representation in lower case (even of the2 if environment’s random variables).state, St S, where S is the set of Agent and environment interact atof diS discretescreate ti timemeset s steps,teps ofaction: alltt =0= nonterminal0, 1, 22,,3K,.... At states each time step t, the agent receives2 some representation of the environment’s state, St S, where S is the set of possible states, and on that basis selects an action, At A(St), where A(St) 2 some+ representation of the environment’s state, St S, where S is the set of Agent observeAs state at step t:A SSt ∈ S set ofset all of states, all nonterminal including the states2 terminal state 2 possible states, and on that basis selects an action, At (St), where possible(sSt) is states, thestate set and of on actions that basis available selects an inaction state, At StA.(S Onet), where timeA( stepSt) later, in part as a 2 S+ set of all states, including the2 terminal state is the set of actions available in state St. One produc timees stepaction later, at step in t : part At ∈is asA the(( aSstconsequence)setofactionspossibleinstate set) of actions available of its action, in state theSt. One agent time receivess step later, a numerical in part as a reward, R R a A action t+1 2 ⇢ consequence of its action, the agent receives age numericalts resultingreward reward:, R R ∈ consequenceR (s)setofactionspossibleinstate of itsR action, the agent receives a numerical rewards , Rt+1 R,and 3 t+1t+1 RS R, whereset of allis the nonterminal set3 of possible states rewards, and finds2 itself in a new state, St+1. 2 finds⇢3 itself in a new state, St+1. Figure 3.1 diagrams the agent–environment , where R is the set of possible rewards, and finds itself in a new state, St+1t+. Figurediscrete 3.1 diagrams time step the agent–environment interaction. R and resulting next state: St+1 ∈interaction.S set of all states, including the terminal state Figure 3.1 diagrams the agent–environment interaction. T! =t s0,a0final,s1,adiscrete time1,... step time of an step episode A(s)setofactionspossibleinstateAt each time step, the agent implementss a mapping from states to prob- SAt eachT timestate step,final at thet time agent step implements of an a episode mapping from states to prob- abilitiest of selecting each possible action. This mapping is called the agent’s At each time step, the agent implements. . . a mappingR fromt+1 statesRt+ toThe2 prob- otherabilitiesRSt+3 random ofstate variables selecting. . . at t are each a possiblefunction action. of this sequence. This mapping The transitional is called the agent’s St St+1 St+2 At t St+3action at t abilities of selecting each possible action. This mappingAt is calledAt+ the1 agent’spolicyt At+2 and isdiscrete denotedAt+3 ⇡ timet, where step⇡t⇡(a s)istheprobabilitythat⇡ a s At = a if St = s. A a S s target rtpolicy+1At is aand functionaction is denoted of at stt, t|,at where, and stt(+1.)istheprobabilitythat The termination conditiont =t, if t = . ReinforcementTRt finalreward learning time at methods stept, dependent, of specify an episode how like theS agent|t,on changesAt 1 and its policySt 1 as policy and is denoted ⇡t, where ⇡t(a s)istheprobabilitythatAt = a ifterminalSt = s. Reinforcement target z , and prediction learning methodsy , are functions specify how ofs thealone. agent changes its policy as R. S. Sutton and A. G.| Barto: Reinforcement Learning: An Introduction aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeG Rt returnt reward (cumulative at t, dependent, discountedt 2 like reward)St,ont followingAt 1 andt St 1 Reinforcement learning methods specify how the agent changes its policySt ast state at t the( totaln)aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeGt amount ofreturn reward it (cumulative receives over the discounted long run. reward) following t AGtt (n) (n) actionn-step at returnt (Section 7.1) (n 1) aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeR the= totalrt+1 + amountt+1zt+1 of+(1 reward itt+1 receives)R over the long run. Thist G frameworkt isn-step abstract return and flexible (Section and can 7.1)t+1 be applied to many di↵erent RGtt reward-return at t (Section, dependent, 7.2) like St,onAt 1 and St 1 the total amount of reward it receives over the long run. problems in many di↵erent ways. For example, the time steps need not refer Gt This framework-return is (Section abstract 7.2) and flexible and can be applied to many di↵erent toG fixedt (0) intervalsreturn of real (cumulative time; they can discounted refer to arbitrary reward) successive following stagest of This framework is abstract and flexible and can be applied to many di↵erentR(nt)problems= yt in many di↵erent ways. For example, the time steps need not refer decision-makingG⇡ npolicy,-step and acting. return decision-making The (Section actions 7.1) can rule be low-level controls, such as the problems in many di↵erent ways. For example, the time steps need not refert to⇡ fixed intervalspolicy, of decision-making real time; they rule can refer to arbitrary successive stages of voltagesG⇡(s)actiontakeninstate applied-return to the (Section motors of 7.2) a robots under arm, ordeterministic high-level decisions,policy such⇡ t 1 n 1 (n) to fixed intervals of real time; they can refer to arbitrary successive stagesasR whetherof decision-making⇡=(1(s)actiontakeninstate or not) to have and lunchR acting. or to go The to graduates actionsunder school.deterministic can be Similarly, low-level thepolicy controls,⇡ such as the ⇡(at s)probabilityoftakingaction t a in state s under policy ⇡ decision-making and acting. The actions can be low-level controls, such asstates the |voltages can⇡(a takes)probabilityoftakingaction a wide appliedn=1 variety to of the forms. motors They can of be a completely robota in arm, state determined ors under high-level by stochastic decisions,policy such⇡ ⇡p(s0 s, a)probabilityoftransitionfromstate| policy,X decision-making rule s to state s0 under action a voltages applied to the motors of a robot arm, or high-level decisions, such as|p(s whether0 s, a)probabilityoftransitionfromstate or not to have lunch or to go to graduates to state school.s0 under Similarly, action a the wider⇡r((ss,)actiontakeninstate audience. a, s0)expectedimmediaterewardontransitionfrom| s under deterministic policy s⇡ to s0 under action a as whether or not to have lunch or to go to graduate school. Similarly, the2 statesr(⇡s,( a,st can,a s0)expectedimmediaterewardontransitionfromt) take a wide variety of forms. They can be completelys to s determined0 under action by a ⇡(⇢Weat s= restrict)probabilityoftakingaction attention to discrete time to keep thingsa asin simple state ass possible,under evenstochastic though policy ⇡ states can take a wide variety of forms. They can be completely determinedmany by| of theb( ideasst,a cant) be extended to the continuous-time case (e.g., see Bertsekas and Tsitsiklis,pv(⇡s(0ss,)valueofstate a 1996;)probabilityoftransitionfromstate Werbos, 1992; Doya,s 1996).under policy ⇡ (expecteds to state return)s0 under action a 3 widerv (s)valueofstate audience. s under policy ⇡ (expected return) We| use⇡ Rt+1 instead of Rt to denote the immediate reward due to the action taken rv(s,(s a,)valueofstate s20)expectedimmediaterewardontransitionfrom1s under the optimal policy s to s0 under action a wider audience. at time vt because(Wes)valueofstate restrict it emphasizes attention that the to next discretes under reward time and the the to optimal next keep state, thingsS policyt+1 as, are simple jointly as possible, even though ⇤ wo↵ (!)=won(!) ⇢ 2We restrict attention to discrete time to keep things as simple as possible, even thoughdetermined.q⇡(s,many a⇤)valueoftakingaction of the ideas can bei extendeda in to state the continuous-times under policy case⇡ (e.g., see Bertsekas and q⇡(s, a)valueoftakingactiona in state s under policy ⇡ many of the ideas can be extended to the continuous-time case (e.g., see Bertsekasv andq⇡((ss,)valueofstateTsitsiklis, a)valueoftakingaction 1996; Werbos,is=1under 1992; Doya,policya in 1996).⇡ state(expecteds under return) the optimal policy ⇤ q (3s, a)valueoftakingactionY a in state s under the optimal policy We use Rt+1 instead of Rt to denote the immediate reward due to the action taken Tsitsiklis, 1996; Werbos, 1992; Doya, 1996). vVt(s)valueofstate⇤ estimate (a randoms under variable) the optimal of policyv⇡ or v 3 ⇤ Vt estimate (a random variable) of v⇡ or v We use Rt+1 instead of Rt to denote the immediate reward due to the action takenwatt = time↵tt(becauseCtRt ity emphasizest) wyt that the next reward⇤ and the next state, St+1, are jointly qQ⇡(ts, a)valueoftakingactionestimate (a randomr variable)a in state ofs underq⇡ or q policy ⇤⇡ at time t because it emphasizes that the next reward and the next state, St+1, are jointly determined.Qt estimate (a random variable) of q⇡⇤ or q q (s, a)valueoftakingaction¯ a in state s under the optimal⇤ policy determined. ⇤wt = ↵t(Rt yt) wyt Vvˆt(s, w)approximatevalueofstateestimate (a randomr variable)s ofgivenv⇡ or av vector of weights w vˆ(s, w)approximatevalueofstates given⇤ a vector of weights w Qt estimate (a random variable) of q⇡ or q qˆ(s, a,qˆ(s,w)approximatevalueofstate–actionpair a, w)approximatevalueofstate–actionpair⇤ s, a givens, a given weights weightsw w w, wwt , w vectorvector of (possibly of (possibly learned) learned)weightsweightsunderlyingunderlying an approximate an approximate value value function function vˆ(s, w)approximatevalueofstatet s given a vector of weights w x(s)vectoroffeaturesvisiblewheninstatex(s)vectoroffeaturesvisiblewheninstates s qˆ(s, a, w)approximatevalueofstate–actionpairs, a given weights w w>x inner product of vectors, w>x = wixi; e.g.,v ˆ(s, w)=w>x(s) w>x inner product of vectors, w>x i= i wixi; e.g.,v ˆ(s, w)=w>x(s) w, wt vector of (possibly learned) weights underlying an approximate value function x(s)vectoroffeaturesvisiblewheninstateP s P w>x inner product of vectors, w>x = i wixi; e.g.,v ˆ(s, w)=w>x(s) P

1 58 CHAPTER 3. FINITE MARKOV DECISION PROCESSES

3.6 Markov Decision Processes

A reinforcement learning task that satisfies the Markov property is called a , or MDP. If the state and action spaces are finite, then it is called a 58 CHAPTER 3. FINITE MARKOV DECISION PROCESSES finite Markov decision process (finite MDP). Finite MDPs are particularly important to the theory of reinforcement learning. We treat them extensively throughout this 3.6book; they Markov are all you Decision need to understand Processes 90% of modern reinforcement learning. A particular finite MDP is defined by its state and action sets and by the one-step Adynamics reinforcement of the environment. learning task Giventhat satisfies any state the and Markov action propertys and a is, the called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a of each possibleMarkov pair of nextDecision state and Processes reward, s0,r, is denoted finite Markov decision process (finite MDP). Finite MDPs are particularly important . to thep( theorys0,r s, aof) =Pr reinforcementSt+1 =s0,R learning.t+1 = r WeSt = treats, At them=a . extensively throughout(3.6) this ❐ If a reinforcement| learning task has the| Markov Property, it is book; they are all you need to understand 90% of modern reinforcement learning. Thesebasically quantities a Markov completely Decision specify Process the dynamics (MDP). of a finite MDP. Most of the theory weA present particular in the finite rest of MDP this is book defined implicitly by its assumes state and the action environment sets and is by a finite the one-step MDP. ❐58dynamicsIf stateCHAPTER and of the action environment. sets 3. THEare finite, REINFORCEMENT Given it is any a finite state MDP and LEARNING. action s and PROBLEMa, the probability Given the dynamics as specified by (3.6), one can compute anything else one might of each possible pair of next state and reward, s ,r, is denoted ❐wantTo todefine know a aboutfinite MDP, the environment, you need to suchgive: as the0 expected rewards for state–action A particular finite MDP is defined by its state and action sets and by the pairs,! state and action. sets one-stepp(s0 dynamics,r s, a) =Pr of theSt+1 environment.=s0,Rt+1 = Givenr St = anys, A statet =a and. action s and a, (3.6) ! | | the probabilityone-step. “dynamics” of each possible pair of next state and reward, s0,r, is denoted r(s, a) = E[Rt+1 St =s, At =a]= r p(s0,r s, a), (3.7) These quantities completely| specify the dynamics of a finite| MDP. Most of the theory r R s0 S we presentp(s0,r ins, the a)= restPr ofS this= books0,R implicitly= rX2 S assumes=X2s, A = thea . environment is(3.6) a finite MDP. | { t+1 t+1 | t t } theGivenstate-transition the dynamics as specified, by (3.6), one can compute anything else one might These! there quantities is also: completely specify the dynamics of a finite MDP. Most of the want to know about. the environment, such as the expected rewards for state–action theoryp( wes0 s, present a) =Pr in theS rest=s0 ofS this=s, book A = implicitlya = p assumes(s0,r s, a the), environment (3.8) pairs, | t+1 | t t | is a finite MDP. r R . X2 Givenr(s, athe) = dynamicsE[Rt+1 asSt = specifieds, At =a by]= (3.6),r one canp(s0,r computes, a), anything else (3.7) and the expected rewards| for state–action–next-state triples,| one might want to know about the environment,r R s0 suchS as the expected rewards R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction X2 X2 3 for state–action. pairs, r R rp(s0,r s, a) r(s, a, s0) = E Rt+1 St =s, At =a, St+1 = s0 = 2 | . (3.9) the state-transition probabilities, p(s s, a) P 0| r(s, a)=E.[Rt⇥+1 St=s, At =a]= r p⇤(s0,r s, a), (3.7) In thep( firsts0 s, edition a) =Pr ofS| thist+1= book,s0 St the=s, dynamics At =a = werep expressed(|s0,r s, a) exclusively, in terms(3.8) | | r R s0 S | of the latter two quantities, which wereX2 denotedX2 r PRa and Ra respectively. One X2 ss0 ss0 theweaknessstate-transition of that notation probabilities is that, it still did not fully characterize the dynamics and the expected rewards for state–action–next-state triples, of the rewards, giving only their expectations. Another weakness is the excess of subscriptsp(s0 s, and a)= superscripts.Pr St+1 =s0 InS thist =s, edition At =a we= willp( predominantlys0,r s, a), use the(3.8) explicit | . { | } |r R rp(s0,r s, a) r(s, a, s0) = E Rt+1 St =s, At =a, St+1 =r sR0 = 2 | . (3.9) notation of (3.6), while sometimes referring directlyX2 to thep transition(s s, a) probabilities P 0| and(3.8). the expected rewards⇥ for state–action–next-state⇤ triples, In the first edition of this book, the dynamics were expressed exclusively in terms Example 3.7: Recycling Robot MDP The recyclinga robota (Example 3.3) can of the latter two quantities, which were denoted P r Randrp(sR0,r s,respectively. a) One be turnedr(s, a, into s0)= aE simple[Rt+1 exampleSt =s, A oft = ana, S MDPt+1 = bys0]= simplifyingss02 itss and|0 providing. (3.9) some weakness of that notation| is that it still did not fullyp( characterizes0 s, a) the dynamics more details. (Our aim is to produce a simple example,P not| a particularly realistic of the rewards, giving only their expectations. Another weakness is the excess of Inone.) the Recall first edition that the of agent this book, makes the a decision dynamics at timeswere expressed determined exclusively by external in events termssubscripts(or by of other the and latter parts superscripts. of two the quantities, robot’s In control this which edition system). were denote we At will eachPa predominantlyand suchR timea respectively. the use robot the decides explicit ss0 ss0 Onenotationwhether weakness it of should (3.6), of that (1) while actively notation sometimes search is that referring for it a still can, directly did (2) not remain to fully the stationary characterize transition and probabilities the wait for dynamics(3.8).someone to of thebring rewards, it a can, giving or (3) only go their back expectations. to home base Another to recharge weakness its is battery. theExampleSuppose excess the of 3.7: subscripts environment Recycling and works superscripts. Robot as follows. MDP In this TheThe edition best recycling way we to will robot find predominantly cans (Example is to actively 3.3) can usebe turned the explicit into a notation simple example of (3.6), of while an MDP sometimes by simplifying referring directly it and providing to the some transitionmore details. probabilities (Our aim (3.8). is to produce a simple example, not a particularly realistic Exampleone.) Recall 3.7: that Recycling the agent makes Robot a MDPdecision atThe times recycling determined robot by (Example external events 3.3)(or by can other be turned parts of into the a robot’s simple control example system). of an MDP At each by such simplifying time the it robot and decides providingwhether it some should more (1) details. actively (Our search aim for is a to can, produce (2) remain a simple stationary example, and not wait for asomeone particularly to bring realistic it a one.) can, Recall or (3) that go back the agent to home makes base a decision to recharge at times its battery. determinedSuppose the by environment external events works (or by as otherfollows. parts The of best the robot’s way to control find cans system). is to actively At each such time the robot decides whether it should (1) actively search for a can, (2) remain stationary and wait for someone to bring it a can, or (3) go back to home base to recharge its battery. Suppose the environment works as follows. The best way to find cans is to actively search for them, but this runs down the robot’s battery, whereas waiting does not. Whenever the robot is searching, the possibility exists that its battery will become depleted. In The Agent Learns a Policy

Policy at step t = π t = a mapping from states to action probabilities

π t (a | s) = probability that At = a when St = s

Special case - deterministic policies:

πt (s) = the action taken with prob=1 when St = s

❐ Reinforcement learning methods specify how the agent changes its policy as a result of experience. ❐ Roughly, the agent’s goal is to get as much reward as it can over the long run.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 The Meaning of Life (goals, rewards, and returns)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Return

Suppose the sequence of rewards after step t is:

Rt+1, Rt+2 , Rt+3,K. . . What do we want to maximize?

At least three cases, but in all of them,

we seek to maximize the expected return, E{Gt }, on each step t.

• Total reward, Gt = sum of all future reward in the episode • Discounted reward, Gt = sum of all future discounted reward • Average reward, Gt = average reward per time step

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 Episodic Tasks

Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze

In episodic tasks, we almost always use simple total reward:

... Gt = Rt+1 + Rt+2 +L + RT ,

where T is a final time step at which a terminal state is reached, ending an episode.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 Continuing Tasks

Continuing tasks: interaction does not have natural episodes, but just goes on and on...

In this class, for continuing tasks we will always use discounted return: ∞ 2 ... k Gt = Rt+1 + γ Rt+2 + γ Rt+3 +L = ∑γ Rt+k+1, k=0 where γ , 0 ≤ γ ≤ 1, is the discount rate.

shortsighted 0 ← γ → 1 farsighted

Typically, = 0.9

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 An Example: Pole Balancing

Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track

As an episodic task where episode ends upon failure: reward = +1 for each step before failure ⇒ return = number of steps before failure As a continuing task with discounted return: reward = −1 upon failure; 0 otherwise ⇒ return = −γ k , for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9 Another Example: Mountain Car

Get to the top of the hill as quickly as possible.

reward = −1 for each step where not at top of hill ⇒ return = − number of steps before reaching top of hill

Return is maximized by minimizing number of steps to reach the top of the hill.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 A Trick to Unify Notation for Returns

❐ In episodic tasks, we number the time steps of each episode starting from zero. ❐ We usually do not have to distinguish between episodes, so

instead of writing S t , j for states in episode j, we write just St ❐ Think of each episode as ending in an absorbing state that always produces reward of zero:

R = 0 R = +1 R = +1 R = +1 4 S 1 S 2 S 3 0 1 2 R5 = 0 .

∞ ❐ We can cover all cases by writing k Gt = ∑γ Rt+k+1, k=0 where γ can be 1 only if a zero reward absorbing state is always reached.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Rewards and returns • The objective in RL is to maximize long-term future reward

• That is, to choose A t so as to maximize Rt+1,Rt+2,Rt+3,...

• But what exactly should be maximized?

• The discounted return at time t: the discount rate G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2

Reward sequence Return

0.5(or any) 1 0 0 0… 1 0.5 0 0 2 0 0 0… 0.5 0.9 0 0 2 0 0 0… 1.62 0.5 -1 2 6 3 2 0 0 0… 2 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 G = R + R + 2R + 3R + [0, 1) t t+1 t+2 t+3 t+4 ··· 2 • Suppose =0 . 5 and the reward sequence is

R =1,R =6,R = 12,R = 16, then zeros for R and later 1 2 3 4 5 • What are the following returns?

G =0 G = 16 G = 4 G =4 G =3 4 3 2 1 0 • Suppose =0 . 5 and the reward sequence is all 1s. 1 G = =2 1 • Suppose =0 . 5 and the reward sequence is

R1 =1,R2 = 13,R3 = 13,R4 = 13, and so on, all 13s

G2 = 26 G1 = 26 G0 = 14 • And if =0.9?

G1 = 130 G0 = 118 4 value functions

state action values values

prediction v⇡ q⇡

control v q ⇤ ⇤

• All theoretical objects, mathematical ideals (expected values)

• Distinct from their estimates:

Vt(s) Qt(s, a) Values are expected returns • The value of a state, given a policy:

v⇡(s)=E Gt St = s, At: ⇡ v⇡ : S { | 1 ⇠ } !< • The value of a state-action pair, given a policy:

q⇡(s, a)=E Gt St = s, At = a, At+1: ⇡ q⇡ : S A { | 1 ⇠ } ⇥ !< • The optimal value of a state:

v (s) = max v⇡(s) v : S ⇤ ⇡ ⇤ !< • The optimal value of a state-action pair:

q (s, a) = max q⇡(s, a) q : S A ⇤ ⇡ ⇤ ⇥ !< • Optimal policy: ⇡ is an optimal policy if and only if ⇤ ⇡ (a s) > 0 only where q (s, a) = max q (s, b) s S ⇤ | ⇤ b ⇤ 8 2 • in other words, ⇡ is optimal iff it is greedy wrt q ⇤ ⇤ 4 value functions

state action values values

prediction v⇡ q⇡

control v q ⇤ ⇤

• All theoretical objects, mathematical ideals (expected values)

• Distinct from their estimates:

Vt(s) Qt(s, a) optimal policy example

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 Gridworld

❐ Actions: north, south, east, west; deterministic. ❐ If would take agent off the grid: no move but reward = –1 ❐ Other actions produce reward = 0, except actions that move agent out of special states A and B as shown.

State-value function for equiprobable random policy; γ = 0.9

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 Golf

❐ State is ball location

❐ Reward of –1 for each stroke putt !4 vVputt !3 until the ball is in the hole s a n d !" ❐ Value of a state? !1 ❐ !2 !2 Actions: !3 0 !1 ! !4 putt (use putter) s !5 a green n ! driver (use driver) !6 d !4 !" ❐ putt succeeds anywhere on !3 !2 the green

Q*(s,driver) q*(s,driver) s a n d

0 !1 !2 !2

s a R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction !3 n green 19 d What we learned so far

❐ Finite Markov decision processes! ! States, actions, and rewards ! And returns ! And time, discrete time ! They capture essential elements of life — state, causality ❐ The goal is to optimize expected returns ! returns are discounted sums of future rewards ❐ Thus we are interested in values — expected returns ❐ There are four value functions ! state vs state-action values ! values for a policy vs values for the optimal policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 SUMMARY OF NOTATION xiii Summary of Notation

Capital letters are used for random variables and major algorithm variables. SUMMARYLower OF NOTATION case letters are used for the values of random variablesxiii and for scalar functions. Quantities that are required to be real-valued vectors are written SUMMARY OF NOTATION xiii Summaryin bold of and Notation in lower case (even if random variables). SUMMARY OF NOTATION xiii Optimal Value Functions Summary of Notation Capital letterss are usedstate for random variables and major algorithm variables. ❐ For finite MDPs, policies can be partiallySummaryLower ordered case: lettersa of are Notationaction used for the values of random variables and for scalar Capital letters are used for random variables and major algorithm variables. π ≥ π# if and only if v (s) ≥ v functions.(s) for all s ∈ QuantitiesS thatset of are all required nonterminal to be states real-valued vectors are written π π # Lower case letters are used for the values of random variables and for scalar ❐ in bold and inS+ lower caseset (even of all if states, random including variables). the terminal state There are always one or more policies thatCapital are betterfunctions. letters than or are Quantities used for random that are variablesrequired to and be real-valued major algorithm vectors variables. are written equal to all the others. These are the optimal policies. WeA (s)setofactionspossibleinstates Lower casein bold letters and are in lower used case for (even the values if random of random variables). variables and for scalar denote them all π*. functions.s state Quantities that are required to be real-valued vectors are written t discrete time step ❐ Optimal policies share the same optimalin astate-value bold ands actionfunction in lowerstate: case (even if random variables). T final time step of an episode v*(s) = maxvπ (s) for all s ∈ S a set of allaction nonterminal states π + ❐ Optimal policies also share the same optimalsS action-valueS statesetS oft allset states, of allstate nonterminal including at t the states terminal state function: aA(s)setofactionspossibleinstateS+actionAt set of allaction states, at includingt s the terminal state Rt reward at t, dependent, like St,onAt 1 and St 1 q*(s,a) = maxqπ (s,a) for all s ∈ S and a ∈ A((ssets) )setofactionspossibleinstate of all nonterminal states s π St + setdiscreteG oft all time states,return step including (cumulative the terminal discounted state reward) following t This is the expected return for taking action a in state s (n) AT(s)setofactionspossibleinstatet finalGt timediscrete stepn-step time of an step return episode (Sections 7.1) and thereafter following an optimal policy. St T stateGt atfinalt time-return step of (Section an episode 7.2) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction tAt Stdiscreteaction atstate time21t at stept TRt Atfinalreward⇡ timeaction at stept, dependent,policy, at oft an decision-making episode like St,onA rulet 1 and St 1 Rt ⇡(s)actiontakeninstatereward at t, dependent, like Sst,onunderAt 1deterministicand St 1 policy ⇡ SGtt statereturn at (cumulativet discounted reward) following t AG(n) Gtactionn-step⇡(a atreturn s)probabilityoftakingactiont (cumulative (Section 7.1) discounted reward)a followingin state st under stochastic policy ⇡ tt (n) | Gt p(s0 ns,-step a)probabilityoftransitionfromstate return (Section 7.1) s to state s0 under action a RGtt reward-return at (Sectiont, dependent, 7.2) like St,onAt 1 and St 1 G r s,| a,-return s (Section 7.2) s s a Gt treturn( (cumulative0)expectedimmediaterewardontransitionfrom discounted reward) following t to 0 under action G⇡ (n) npolicy,-step return decision-making (Section 7.1) rule t ⇡ policy, decision-making rule G⇡(s)actiontakeninstate-returnv⇡(s)valueofstate (Section 7.2)s unders underdeterministic policy ⇡ policy(expected⇡ return) t ⇡(s)actiontakeninstatev (s)valueofstatessunderunderdeterministic the optimalpolicy policy⇡ ⇡(a s)probabilityoftakingaction⇤ a in state s under stochastic policy ⇡ | ⇡(a s)probabilityoftakingactionq (s, a)valueoftakingactiona ina statein states unders understochastic policypolicy⇡ ⇡ ⇡p(s0 s, a)probabilityoftransitionfromstatepolicy,| ⇡ decision-making rule s to state s0 under action a | p(s0 s,q a)probabilityoftransitionfromstate(s, a)valueoftakingactiona ins stateto states unders0 under the action optimala policy ⇡r((s,s)actiontakeninstate a, s0)expectedimmediaterewardontransitionfrom| s under deterministic policys to⇡ s0 under action a r(s, a, s⇤0)expectedimmediaterewardontransitionfroms to s0 under action a Vt estimate (a random variable) of v⇡ or v ⇡(a s)probabilityoftakingactiona in state s under stochastic⇤ policy ⇡ | Qt estimate (a random variable) of q⇡ or q pv(⇡s(0ss,)valueofstate a)probabilityoftransitionfromstates under policy ⇡ (expecteds to state return)s0 under action a | v⇡(s)valueofstates under policy ⇡ (expected return) ⇤ rv(s,(s a,)valueofstate s0v)expectedimmediaterewardontransitionfrom(s)valueofstates unders under the optimal the optimal policy policy s to s0 under action a ⇤ ⇤ vˆ(s, w)approximatevalueofstates given a vector of weights w q⇡(s, a)valueoftakingactionq⇡(s, a)valueoftakingactiona in statea in states unders under policy policy⇡ ⇡ vq ((s,s)valueofstate a)valueoftakingactionq (s, aqˆ)valueoftakingaction(s, a, w)approximatevalueofstate–actionpairs under policya in statea ⇡in(expected states unders under the return) the optimal optimals, policy a policygiven weights w ⇡⇤ ⇤ w, wt vector of (possibly learned) weights underlying an approximate value function vVt(s)valueofstateVt estimateestimate (a randoms (aunder random variable) the variable) optimal of v⇡ of policyorv⇡vor v ⇤ x(s)vectoroffeaturesvisiblewheninstate⇤ ⇤ s qQ⇡t(s, a)valueoftakingactionQtestimateestimate (a random (a random variable)a in variable) state of qs⇡ ofunderorq⇡qor policyq ⇡ ⇤ ⇤ q (s, a)valueoftakingactionw>x inner producta in state of vectors,s underw> thex = optimali wixi policy; e.g.,v ˆ(s, w)=w>x(s) ⇤ Vvˆt(s, w)approximatevalueofstatevˆ(s,estimatew)approximatevalueofstate (a random variable)s given of sv⇡given aor vectorv a vector of weights of weightsw w ⇤ P Qqˆ(ts, a, wqˆ)approximatevalueofstate–actionpair(s,estimate a, w)approximatevalueofstate–actionpair (a random variable) of q⇡ or q s, a s,given a given weights weightsww ⇤ w, wt w,vectorwt ofvector (possibly of (possibly learned) learned)weightsweightsunderlyingunderlying an an approximate approximate value value function function vˆx((s,s)vectoroffeaturesvisiblewheninstatew)approximatevalueofstatex(s)vectoroffeaturesvisiblewheninstates given a vectors s of weights w w>x inner product of vectors, w>x = wixi; e.g.,v ˆ(s, w)=w>x(s) qˆw(s,>x a, w)approximatevalueofstate–actionpairinner product of vectors, w>x = i wixs,i i; a e.g.,givenv ˆ( weightss, w)=w>x(s) w, w vector of (possibly learned) weights underlying an approximate value function t P P x(s)vectoroffeaturesvisiblewheninstates

w>x inner product of vectors, w>x = i wixi; e.g.,v ˆ(s, w)=w>x(s) P Why Optimal State-Value Functions are Useful

Any policy that is greedy with respect to v * is an optimal policy.

Therefore, given v * , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

A B 22.0 24.4 22.0 19.4 17.5

+5 19.8 22.0 19.8 17.8 16.0

+10 B' 17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0 A' 14.4 16.0 14.4 13.0 11.7

a) gridworld b) vV** c) !π**

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 putt !4 vVputt !3 s a n d !"

!1 Optimal Value Function!2 for Golf !2 !3 0 !1 !4 s !5 a green ❐ We can hit the ball farther with drivern than with putter, !6 d but with less accuracy !4 !" !2 ❐ q* (s,driver) gives the value or !using3 driver first, then using whichever actions are best

Q*(s,driver) q*(s,driver) s a n d

0 !1 !2 !2

s a !3 n green d

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 What About Optimal Action-Value Functions?

Given q * , the agent does not even have to do a one-step-ahead search:

π *(s) = argmaxq*(s,a) a

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Value Functions x 4

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 Bellman Equations x 4

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 for a Policy π

The basic idea: 2 3 ... Gt = Rt+1 + γ Rt+2 + γ Rt+3 + γ Rt+4 L+ 2 +... = Rt+1 + γ ( Rt+2 + γ Rt+3 + γ Rt+4 L )

= Rt+1 + γ Gt+1

So: vπ (s) = Eπ {Gt St = s}

= Eπ {Rt+1 + γ vπ (St+1 ) St = s} Or, without the expectation operator:

v (s)= ⇡(a s) p(s0,r s, a) r + v (s0) ⇡ | | ⇡ a X Xs0,r h i

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

2 v⇡(s)=E⇡ Rt+1 + Rt+2 + Rt+3 + St =s ··· = E⇡⇥[Rt+1 + v⇡(St+1) St =s] ⇤ (1) | = ⇡(a s) p(s0,r s, a) r + v (s0) , (2) | | ⇡ a X Xs0,r h i

i More on the Bellman Equation

v (s)= ⇡(a s) p(s0,r s, a) r + v (s0) ⇡ | | ⇡ a X Xs0,r h i This is a set of equations (in fact, linear), one for each state. The value function for π is its unique solution. 2 v⇡(s)=E⇡ Rt+1 + Rt+2 + Rt+3 + St =s ··· = EBackup⇡⇥[Rt+1 + diagramsv⇡(St+1:) St =s] ⇤ (1) | = ⇡(a s) p(s0,r s, a) r + v (s0) , (2) | | ⇡ a X Xs0,r h i

for vπ for qπ

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

i Gridworld

❐ Actions: north, south, east, west; deterministic. ❐ If would take agent off the grid: no move but reward = –1 ❐ Other actions produce reward = 0, except actions that move agent out of special states A and B as shown.

State-value function for equiprobable random policy; γ = 0.9

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 v (s)= ⇡(a s) p(s0,r s, a) r + v (s0) ⇡ | | ⇡ a X Xs0,r h i

2 v⇡(s)=E⇡ Rt+1 + Rt+2 + Rt+3 + St =s ··· = E⇡[Rt+1 + v⇡(St+1) St =s] (1) ⇥ | ⇤ Bellman Optimality Equation for v* = ⇡(a s) p(s0,r s, a) r + v (s0) , (2) | | ⇡ a X Xs0,r h i The value of a state under an optimal policy must equal the expected return for the best action from that state:

v (s) = max q⇡ (s, a) ⇤ a ⇤ = max E[Rt+1 + v (St+1) St =s, At =a] (3) a ⇤ |

= max p(s0,r s, a) r + v (s0) . (4) a | ⇤ Xs0,r ⇥ (a) ⇤ s (b) s,a max r The relevant backup diagram: a s' r max s' a'

v * is the unique solution of this system of nonlinear equations.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

i 68 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM

q (s, driver). These are the values of each state if we first play a stroke with ⇤ the driver and afterward select either the driver or the putter, whichever is better. The driver enables us to hit the ball farther, but with less accuracy. We can reach the hole in one shot using the driver only if we are already very close; thus the 1contourforq (s, driver)coversonlyasmallportionof ⇤ the green. If we have two strokes, however, then we can reach the hole from much farther away, as shown by the 2contour.Inthiscasewedon’thave to drive all the way to within the small 1contour,butonlytoanywhere on the green; from there we can use the putter. The optimal action-value function gives the values after committing to a particular first action, in this case, to the driver, but afterward using whichever actions are best. The 3 contour is still farther out and includes the starting tee. From the tee, the best sequence of actions is two drives and one putt, sinking the ball in three strokes.

Because v is the value function for a policy, it must satisfy the self- ⇤ consistency condition given by the Bellman equation for state values (3.12). Because it is the optimal value function, however, v ’s consistency condition ⇤ can be written in a special form without reference to any specific policy. This is the Bellman equation for v ,ortheBellman optimality equation. Intuitively, ⇤ the Bellman optimality equation expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action from that state:

v (s)= maxq⇡ (s, a) ⇤ a A(s) ⇤ 2 =maxE⇡ [Gt St =s, At =a] a ⇤ | 1 k =maxE⇡⇤ Rt+k+1 St =s, At =a a " # k=0 X 1 k =maxE⇡⇤ Rt+1 + Rt+k+2 St =s, At =a a " # k=0 X =maxE[Rt+1 + v (St+1) St =s, At =a](3.16) a ⇤ | =max p(s0,r s, a) r + v (s0) . (3.17) a A(s) | ⇤ 2 s0,r Bellman OptimalityX ⇥ Equation⇤ for q* The last two equations are two forms of the Bellman optimality equation for v . The Bellman optimality equation for q is ⇤ ⇤

q (s, a)=E Rt+1 + max q (St+1,a0) St = s, At = a ⇤ a0 ⇤ h i = p(s0,r s, a) r + max q (s0,a0) . | a0 ⇤ Xs0,r h i

(a) s (b) s,a The relevant backupmax diagram: r a s' r max s' a'

q * is the unique solution of this system of nonlinear equations.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 Solving the Bellman Optimality Equation

❐ Finding an optimal policy by solving the Bellman Optimality Equation requires the following: ! accurate knowledge of environment dynamics; ! we have enough space and time to do the computation; ! the Markov Property. ❐ How much space and time do we need? ! polynomial in number of states (via methods; Chapter 4), ! BUT, number of states is often huge (e.g., has about 1020 states). ❐ We usually have to settle for approximations. ❐ Many RL methods can be understood as approximately solving the Bellman Optimality Equation.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32 Summary

❐ Agent-environment interaction ❐ Value functions ! States ! State-value function for a policy ! Actions ! Action-value function for a policy ! Rewards ! Optimal state-value function ❐ Policy: stochastic rule for ! Optimal action-value function selecting actions ❐ Optimal value functions ❐ Return: the function of future ❐ Optimal policies rewards agent tries to maximize ❐ Bellman Equations ❐ Episodic and continuing tasks ❐ The need for approximation ❐ Markov Property ❐ Markov Decision Process ! Transition probabilities ! Expected rewards

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33