CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 31: Introduction to Reinforcement Learning CS109B Data Science 2 Pavlos Protopapas, Mark Glickman, and Chris Tanner 1 The Basic Intuition 2 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 A formal definition 30 Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences. A mathematical formalisation of a decision making process. 31 Where is it used? 32 Traffic Robotics Games 33 Elements of Reinforcement Learning 34 36 37 38 39 40 41 42 Exploration vs Exploitation 43 Do I just go left like Can I try going right? before? It’s a safe I might just find the option right? cheese faster! Exploitation Exploration Better immediate reward Long term return is increased 44 Markov Decision Process 45 Markov Property The future is independent of the past given the present. A state �! is said to be Markov if and only if ℙ[�!"#|�#, �$, �%. , �!] = ℙ[�!"#|�!] Hence we require the state to encapsulate all the necessary information from the history 46 Markov Process Gives the entire dynamics of the environment A Markov Process is a tuple ⟨�, �⟩ • S is the set of states (finite) • P is the state transition matrix 0.9 Facebook Sleep 0.1 0.5 0.2 1.0 0.5 0.8 0.6 Markov Process Chain Class 1 Class 2 Class 3 Pass 0.4 0.4 0.2 Club Figure adopted from 48 0.4 David Silver’s RL course Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 50 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 51 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 52 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 53 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 54 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 55 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 56 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 57 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 58 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 59 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 60 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 61 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 62 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 63 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 64 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 65 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 66 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 67 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 68 Markov Decision Process (MDP) MDP provides a method to formalize sequential decision making where actions influence immediate rewards, subsequent states and future rewards. 69 Markov Decision Process A Markov Decision Process is a tuple ⟨�, �, �, �, �⟩ • S is the set of states (finite) One matrix for each action in the action space • A is a finite set of actions # & • P is the state transition matrix, �""! = ℙ[�!$% = � |�! = �, �! = �] • R is the reward function • � is the discount factor 70 Episodic Task Tasks Continuing Task 71 Episodic Tasks VENTURE A task that consists of “episodes” which end naturally in a special stated called terminal state �!. TERMINATE After the terminal state, we reset back to the start state. The termination time T is a random variable and varies from one episode to another. The return in this task type is a simple summation of the rewards. RESET ) �! = ∑ �' '(!$% 72 Continuing Tasks Task where there is no natural break into identifiable episodes. It goes on continually without a definitive limit i.e. the final step � = ∞. Hence, a discounting factor � is introduced when computing the return. % " �! = ∑ � �!&"&' where 0 ≤ � ≤ 1. "#$ � determines the present value of future rewards: A reward received k steps from now is worth only �!"# times the immediate reward. If � < 1, the sum is finite. If � = 0, we maximise only the immediate reward. 73 Episodic and Continuing Tasks To combine episodic and continuing task we introduce an absorbing state. On termination of an episode, we go to this state that transitions to itself and generates only rewards of zero. The return for T rewards is the same as the return for ∞ rewards. & "'!'% �! = ∑ � �" where 0 ≤ � ≤ 1 "#!$% 74 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 76 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 77 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 78 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 79 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 80 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 81 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 82 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 83 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 84 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 85 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 86 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 87 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 88 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 89 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 90 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 91 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 92 Rewards �", �#, �$ are equal to 10 All rewards after this are equal to 0 93 94 Policy Consider the following scenario We have a mouse in a state. Let’s call this state �$. The mouse can take 3 possible actions: Go up, left or right. No backward move in this policy. 95 Policy Consider the following scenario We have a mouse in a state. Let’s call this state �$. The mouse can take 3 possible actions: Go up, left or right. No backward move in this policy. If the mouse was not very smart, then the probability of it % taking any one of those actions is . ( 96 Policy However, a smarter mouse would realize that going left would give it a slight electric shock, hence it drastically reduces the probability of taking that action. Further, the mouse can smell the cheese from somewhere above it, hence it likely for it to want to try going up. Thus, the new probability of going % % ) Up = , Left = , Right = ) * * 97 Policy Assume, the mouse takes an action and goes up. It is now in state �'$. Again, we have the same set of possible actions. But the mouse now knows that it will get an electric shock when it goes left and not right like the previous case. Thus, the probability of taking any action in this state changes. 98 Policy This probability, that defines the action taken by an agent in a given state is what is called a Policy. The probability of an action changes for each state the agent is present in. 99 Policy Policy �, provides a probability mapping given a state. If an agent is said to follow a policy � at time t, then �(�|�) is the probability that �! = � if �! = �. �(�|�) = ��{�! = �|�! = �} At a time t, under policy � that probability of taking action a in state s is �(�|�) . For each state ���, � is a probability distribution over ���(�) i.e. probability distribution for all actions permissible in that state. 100 State value function Value Function Action value function 101 Let’s now think of a scenario where we have 2 mice starting from an intial state, �%.