The Reinforcement Learning Problem (Markov Decision Processes, Or Mdps)

Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPs) Objectives of this chapter: ❐ present Markov decision processes—an idealized form of the AI problem for which we have precise theoretical results ❐ introduce key components of the mathematics: value functions and Bellman equations R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM Agent Agent 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM state reward action state reward St Rt action At St Rt Rt+1 At SUMMARY OF NOTATIONAgent xiii Rt+1 St+1 Environment SUMMARYstate reward OF NOTATION xiii St+1 EnvironmentThe Agent-Environment Interface action St Rt Summary of Notation At Figure 3.1: TheRt+1 agent–environment interaction in reinforcement learning. SUMMARYSummary OF NOTATION of Notation xiii Figure 3.1: The agent–environment interaction in reinforcement learning. St+1 Environment Agent Capital letters are used for random variables and major algorithm variables. SummaryLowerFiguregivesCapital 3.1: case rise The letters letters to agent–environment ofrewards, are are Notation used used special for for interaction the numerical random values in reinforcement variables of values random that and learning. variables the major agent algorithm and tries for to scalar maximize variables. gives rise to rewards, special numerical valuesstate thatreward the agent tries to maximizefunctions.overLower time. Quantitiescaseaction A letters complete that are areused specification required for the to values of be an real-valued of environment random vectors variables defines are and awrittentask for,one scalar St Rt A over time. A complete specification of an environment defines a task,onegivesin bold riseinstancefunctions. to and rewards, in of lower thet Quantities special reinforcement case numerical (even that values if are learning random that required the agentvariables). problem. to tries be to real-valued maximize vectors are written Rt+1 Capital letters are used for random variables and major algorithm variables. instance of the reinforcement learning problem. over time.in bold A complete and in specification lower case of (even an environment if random defines variables). a task,one St+1 EnvironmentLower caseMore letters specifically, are used the for agent the values and environment of random variables interact and at each for scalar of a sequence instance of the reinforcement learning problem. More specifically, the agent and environment interact at each of a sequencefunctions.s of discretestate Quantities time steps, that aret =0 required, 1, 2, 3,... to be.2 At real-valued each time vectors step t, are the written agent receives 2 aMores specifically,actionstate the agent and environment interact at each of a sequence of discrete time steps, t =0, 1, 2, 3,.... At each time step t, the agent receivesin boldsome and representation in lower case (even of the2 if environment’s random variables).state, St S, where S is the set of Agent and environment interact atof diS discretescreate ti timemeset s steps,teps ofaction: alltt =0= nonterminal0, 1, 22,,3K,.... At states each time step t, the agent receives2 some representation of the environment’s state, St S, where S is the set of possible states, and on that basis selects an action, At A(St), where A(St) 2 some+ representation of the environment’s state, St S, where S is the set of Agent observeAs state at step t:A SSt ∈ S set ofset all of states, all nonterminal including the states2 terminal state 2 possible states, and on that basis selects an action, At (St), where possible(sSt) is states, thestate set and of on actions that basis available selects an inaction state, At StA.(S Onet), where timeA( stepSt) later, in part as a 2 S+ set of all states, including the2 terminal state is the set of actions available in state St. One produc timees stepaction later, at step in t : part At ∈is asA the(( aSstconsequence)setofactionspossibleinstate set) of actions available of its action, in state theSt. One agent time receivess step later, a numerical in part as a reward, R R a A action t+1 2 ⇢ consequence of its action, the agent receives age numericalts resultingreward reward:, R R ∈ consequenceR (s)setofactionspossibleinstate of itsR action, the agent receives a numerical rewards , Rt+1 R,and 3 t+1t+1 RS R, whereset of allis the nonterminal set3 of possible states rewards, and finds2 itself in a new state, St+1. 2 finds⇢3 itself in a new state, St+1. Figure 3.1 diagrams the agent–environment , where R is the set of possible rewards, and finds itself in a new state, St+1t+. Figurediscrete 3.1 diagrams time step the agent–environment interaction. R and resulting next state: St+1 ∈interaction.S set of all states, including the terminal state Figure 3.1 diagrams the agent–environment interaction. T! =t s0,a0final,s1,adiscrete time1,... step time of an step episode A(s)setofactionspossibleinstateAt each time step, the agent implementss a mapping from states to prob- SAt eachT timestate step,final at thet time agent step implements of an a episode mapping from states to prob- abilitiest of selecting each possible action. This mapping is called the agent’s At each time step, the agent implements. a mappingR fromt+1 statesRt+ toThe2 prob- otherabilitiesRSt+3 random ofstate variables selecting. at t are each a possiblefunction action. of this sequence. This mapping The transitional is called the agent’s St St+1 St+2 At t St+3action at t abilities of selecting each possible action. This mappingAt is calledAt+ the1 agent’spolicyt At+2 and isdiscrete denotedAt+3 ⇡ timet, where step⇡t⇡(a s)istheprobabilitythat⇡ a s At = a if St = s. A a S s target rtpolicy+1At is aand functionaction is denoted of at stt, t|,at where, and stt(+1.)istheprobabilitythat The termination conditiont =βt, if t = . ReinforcementTRt finalreward learning time at methods stept, dependent, of specify an episode how like theS agent|t,on changesAt 1 and its policySt 1 as policy and is denoted ⇡t, where ⇡t(a s)istheprobabilitythatAt = a ifterminalSt = s. Reinforcement target z , and prediction learning methodsy , are functions specify how of−s thealone. agent− changes its policy as R. S. Sutton and A. G.| Barto: Reinforcement Learning: An Introduction aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeG Rt returnt reward (cumulative at t, dependent, discountedt 2 like reward)St,ont followingAt 1 andt St 1 Reinforcement learning methods specify how the agent changes its policySt ast state at t − − the( totaln)aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeGt amount ofreturn reward it (cumulative receives over the discounted long run. reward) following t AGtt (n) (n) actionn-step at returnt (Section 7.1) (n 1) aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeR the= totalrt+1 + amountβt+1zt+1 of+(1 rewardβ itt+1 receives)R − over the long run. Thisλt G frameworkt isn-step abstract return and flexible (Section and can 7.1)t+1 be applied to many di↵erent RGtt rewardλ-return at t (Section, dependent, 7.2)− like St,onAt 1 and St 1 the total amount of reward it receives over the long run. problemsλ in many di↵erent ways. For example, the time− steps need− not refer Gt This frameworkλ-return is (Section abstract 7.2) and flexible and can be applied to many di↵erent toG fixedt (0) intervalsreturn of real (cumulative time; they can discounted refer to arbitrary reward) successive following stagest of This framework is abstract and flexible and can be applied to many di↵erentR(nt)problems= yt in many di↵erent ways. For example, the time steps need not refer decision-makingG⇡ npolicy,-step and acting. return decision-making The (Section actions 7.1) can rule be low-level controls, such as the problems in many di↵erent ways. For example, the time steps need not refert to⇡ fixed intervalspolicy, of decision-making real time; they rule can refer to arbitrary successive stages of voltagesG⇡λ(s)actiontakeninstate appliedλ-return to the (Section motors of 7.2) a robots under arm, ordeterministic high-level decisions,policy such⇡ t λ 1 n 1 (n) to fixed intervals of real time; they can refer to arbitrary successive stagesasR whetherof decision-making⇡=(1(s)actiontakeninstate or notλ) to haveλ − and lunchR acting. or to go The to graduates actionsunder school.deterministic can be Similarly, low-level thepolicy controls,⇡ such as the ⇡(at s)probabilityoftakingaction− t a in state s under stochastic policy ⇡ decision-making and acting. The actions can be low-level controls, such asstates the |voltages can⇡(a takes)probabilityoftakingaction a wide appliedn=1 variety to of the forms. motors They can of be a completely robota in arm, state determined ors under high-level by stochastic decisions,policy such⇡ ⇡p(s0 s, a)probabilityoftransitionfromstate| policy,X decision-making rule s to state s0 under action a voltages applied to the motors of a robot arm, or high-level decisions, such as|p(s whether0 s, a)probabilityoftransitionfromstate or not to have lunch or to go to graduates to state school.s0 under Similarly, action a the wider⇡r((ss,)actiontakeninstate audience. a, s0)expectedimmediaterewardontransitionfrom| s under deterministic policy s⇡ to s0 under action a as whether or not to have lunch or to go to graduate school. Similarly, the2 statesr(⇡s,( a,st can,a s0)expectedimmediaterewardontransitionfromt) take a wide variety of forms. They can be completelys to s determined0 under action by a ⇡(⇢Weat s= restrict)probabilityoftakingaction attention to discrete time to keep thingsa asin simple state ass possible,under evenstochastic though policy ⇡ states can take a wide variety of forms. They can be completely determinedmany by| of theb( ideasst,a cant) be extended to the continuous-time case (e.g., see Bertsekas and Tsitsiklis,pv(⇡s(0ss,)valueofstate a 1996;)probabilityoftransitionfromstate Werbos, 1992; Doya,s 1996).under policy ⇡ (expecteds to state return)s0 under action a 3 widerv (s)valueofstate audience.

The Reinforcement Learning Problem (Markov Decision Processes, Or Mdps)

Survey on Reinforcement Learning for Language Processing

Probabilistic Goal Markov Decision Processes∗

1 Introduction

Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy

A Deep Reinforcement Learning Neural Network Folding Proteins

An Introduction to Deep Reinforcement Learning

A Brief Taster of Markov Decision Processes and Stochastic Games

Modelling Working Memory Using Deep Recurrent Reinforcement Learning

Reinforcement Learning: a Survey

Markov Decision Processes

An Energy Efficient Edgeai Autoencoder Accelerator For

Sparse Cooperative Q-Learning