Reinforcement Learning
PROF XIAOHUI XIE SPRING 2019
CS 273P Machine Learning and Data Mining
Slides courtesy of Sameer Singh, David Silver Machine Learning
Intro to Reinforcement Learning
Markov Processes
Markov Reward Processes
Markov Decision Processes
2 Reinforcement Learning
3 What makes it different?
No direct supervision, only rewards Feedback is delayed, not instantaneous Time really matters, i.e. data is sequential Agent’s actions affect what data it will receive
• Fly stunt maneuvers in a helicopter
• Defeat the world champion at Backgammon or Go • Manage an investment portfolio • Control a power station • Make a humanoid robot walk Examples • Play many different Atari games better than humans
4 Agent-Environment Interface
Agent
• decides on an action • receives next observation • receives next reward
Environment
• executes the action • computes next observation • computes next reward
5 Reward, Rt
Nothing about WHY it is +, positive (Good) How well the doing well, could have -, negative (Bad) agent is doing little to do with At-1
Agent is trying to maximize its cumulative reward
6 Example of Rewards
• Fly stunt maneuvers in a helicopter • +ve reward for following desired trajectory • −ve reward for crashing • Defeat the world champion at Backgammon • +/−ve reward for winning/losing a game • Manage an investment portfolio • +ve reward for each $ in bank • Control a power station • +ve reward for producing power • −ve reward for exceeding safety thresholds • Make a humanoid robot walk • +ve reward for forward motion • −ve reward for falling over • Play many different Atari games better than humans • +/−ve reward for increasing/decreasing score
7 Sequential Decision Making
Actions have long term consequences
Rewards may be delayed
May be better to sacrifice short term reward for long term benefit
• A financial investment (may take months to mature) • Refueling a helicopter (might prevent a crash later) • Blocking opponent moves (might eventually help win) • Spend a lot of money and go to college (earn more later)
Examples • Don’t commit crimes (rewarded by not going to jail) • Get started on final project early (avoid stress later)
A key aspect of intelligence: How far ahead are you able to plan?
8 Reinforcement Learning
Given an environment (produces observations and rewards)
Reinforcement Learning
Automated agent that selects actions to maximize total rewards in the environment
9 Let’s look at the Agent
What does the choice of action depend on?
• Can you ignore Ot completely?
• Is just Ot enough? Or (Ot,At)? • Is it last few observations? • Is it all observations so far?
10 Agent State, St
History: everything that happened so far
Ht =
O1R1A1O2R2A2O3R3,…,At-1OtRt
State, St can be Ot OtRt
At-1OtRt
Ot-3Ot-2Ot-1Ot
You, as AI designer, In general, S = f(H ) t t specify this function
11 Agent Policy, π
Current state Next action S A t π t
Good policy: Leads to larger cumulative reward Bad policy: Leads to worse cumulative reward (we will explore this later)
12 Example: Atari
Rules are unknown • What makes the score increase? Dynamics are unknown • How do actions change pixels?
13 Video Time!
https://www.youtube.com/watch?v=V1eYniJ0Rnk
14 Example: Robotic Soccer https://www.youtube.com/watch?v=CIF2SBVY-J0
15 AlphaGo
https://www.youtube.com/watch?v=I2WFvGl4y8c
16 Overview of Lecture
States Markov Processes
+ Rewards
Markov Reward Processes
+ Actions
Markov Decision Processes
17 Machine Learning
Intro to Reinforcement Learning
Markov Processes
Markov Reward Processes
Markov Decision Processes
18 Markov Property
19 Markov Property
20 Markov Property
21 State Transition Matrix
22 State Transition Matrix
23 Markov Processes
24 Markov Processes
25 Student Markov Chain
26 Student MC: Episodes
27 Student MC: Episodes
28 Student MC: Transition Matrix
29 Machine Learning
Intro to Reinforcement Learning
Markov Processes
Markov Reward Processes
Markov Decision Processes
30 Markov Reward Process
31 Markov Reward Process
32 The Student MRP
33 Return
34 Return
35 Return
36 Why discount?
37 Why discount?
38 Why discount?
39 Why discount?
40 Why discount?
41 Value Function
42 Value Function
43 Student MRP: Returns
44 Student MRP: Returns
45 Student MRP: Value Function
46 Student MRP: Value Function
47 Student MRP: Value Function
48 Bellman Equations for MRP
49 Backup Diagrams for MRP
50 Student MRP: Bellman Eq
51 Matrix Form of Bellman Eq
52 Solving the Bellman Equation
53 Solving the Bellman Equation
54 Dynamic Programming v2(C1) = -2 + γ ( .5 v1(FB) + .5 v1(C2) ) 2 1 1 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) … 3 2 2 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) …
C1 C2 C3 Pa Pub FB Slp γ=0.5 t=1 -2 -2 -2 10 1 -1 0
t=2 -2.75 -2.8 1.2 10 0 -1.55 0
t=3 -3.09 -1.52 1 10 0.41 -1.83 0
t=4 -2.84 -1.6 1.08 10 0.59 -1.98 0
55 Machine Learning
Intro to Reinforcement Learning
Markov Processes
Markov Reward Processes
Markov Decision Processes
57 Markov Decision Process
58 The Student MDP
59 Policies
60 Policies
61 Policies
62 MPs → MRPs → MDPs
63 MPs → MRPs → MDPs
64 Value Function
65 Value Function
66 Student MDP: Value Function
67 Bellman Expected Equation
68 Bellman Expected Equation
69 Bellman Expected Equation, V
70 Bellman Expected Equation, Q
71 Bellman Expected Equation, V
72 Bellman Expected Equation, Q
73 Student MDP: Bellman Exp Eq.
74 Bellman Exp Eq: Matrix Form
75 Optimal Value Function
76 Optimal Value Function
77 Optimal Value Function
78 Student MDP: Optimal V
79 Student MDP: Optimal Q
80 Optimal Policy
81 Optimal Policy
82 Finding Optimal Policy
83 Student MDP: Optimal Policy
84 Bellman Optimality Eq, V
85 Bellman Optimality Eq, Q
86 Bellman Optimality Eq, V
87 Bellman Optimality Eq, Q
88 Student MDP: Bellman Optimality
89 Solving Bellman Equations in MDP Not easy ● Not a linear equation ● No “closed-form” solutions
90 Overview
MDPs States, Transitions, Actions, Rewards
Prediction Given Policy π, Estimate State Value Functions, Action Value Functions
Control Estimate Optimal Value Functions, Optimal Policy
Does the agent know the MDP?
Yes! It’s “planning” No! It’s “Model-free RL” Agent knows everything Agent observes everything as it goes
91 Overview
Evaluate Policy, Find Best Policy, π π*
MDP Known Policy Evaluation Policy/Value Iteration
MDP Unknown MC and TD Learning Q-Learning
92 Overview
Evaluate Policy, Find Best Policy, π π*
Planning
MDP Known Policy Evaluation Policy/Value Iteration
MDP Unknown MC and TD Learning Q-Learning
93 Dynamic Programming
94 Requirements for Dynamic Programming
95 Requirements for Dynamic Programming
96 Planning by Dynamic Programming
97 Applications of DPs
98 Overview
Evaluate Policy, Find Best Policy, π π*
MDP Known Policy Evaluation Policy/Value Iteration
MDP Unknown MC and TD Learning Sarsa + Q-Learning
99 Iterative Policy Evaluation
100 Iterative Policy Evaluation
101 Iterative Policy Evaluation
102 Iterative Policy Evaluation
103 Random Policy: Grid World
104 Policy Evaluation: Grid World
Time 0 : do nothing, stop; no cost.
Time 1 : move (reward -1); then k=0 Unless in goal: reward 0
Time 2 : move (reward -1); then k=1 Most: move (-1) + [v1 = -1] = -2 Some: move (-1) + ¾ [v1 = -1] + ¼ [v1=0] = 1.75
105 Policy Evaluation: Grid World
106 Policy Evaluation: Grid World
107 Policy Evaluation: Grid World
In general: best policy & value for “one step, then follow random policy” (always better policy than random!)
108 Overview
Evaluate Policy, Find Best Policy, π π*
MDP Known Policy Evaluation Policy/Value Iteration
MDP Unknown MC and TD Learning Sarsa + Q-Learning
109 Improving a Policy!
110 Improving a Policy!
111 Improving a Policy!
112 Policy Iteration
113 Jack’s Car Rental
114 Policy Iteration in Car Rental
115 Policy Improvement
116 Policy Improvement
117 Policy Improvement
118 Modified Policy Iteration
119 Modified Policy Iteration
120 Modified Policy Iteration
121 Modified Policy Iteration
122 Generalized Policy Iteration
123 Overview
Evaluate Policy, Find Best Policy, π π*
MDP Known Policy Evaluation Policy/Value Iteration
MDP Unknown MC and TD Learning Sarsa + Q-Learning
124 Monte Carlo RL
125 Monte Carlo RL
126 Monte Carlo RL
127 Monte Carlo Policy Evaluation
128 Every-Visit MC Policy Evaluation
Equivalent, “incremental tracking” form:
Looks like SGD to minimize MSE from the mean value… 129 Blackjack Example
stand hit stand
hit
hit
130 Blackjack Value Function
stand hit
131 Blackjack Value Function
stand hit
132 Temporal Difference Learning
133 Temporal Difference Learning
134 MC and TD
135 MC and TD
136 MC and TD
137 MC and TD
138 Driving Home Example
139 Driving Home: MC vs TD
140 Driving Home: MC vs TD
141 Finite Episodes: AB Example
MC & TD can give different answers on fixed data:
V(B) = 6 / 8
V(A) = 0 ? (Direct MC estimate)
V(A) = 6 / 8? (TD estimate)
142 MC vs TD
Temporal Monte Carlo Difference
• Wait till end of episode to learn • Learn online after every step • Only for terminating worlds • Non-terminating worlds ok
• High-variance, low bias • Low variance, high bias • Not sensitive to initial value • Sensitive to initial value • Good convergence properties • Much more efficient
• Doesn’t exploit Markov property • Exploits Markov Property
• Minimizes squared error • Maximizes log-likelihood
143 Unified View: Monte Carlo
144 Unified View: TD Learning
145 Unified View: Dynamic Prog.
146 Unified View of RL (Prediction)
147 Overview
Evaluate Policy, Find Best Policy, π π*
MDP Known Policy Evaluation Policy/Value Iteration
MDP Unknown MC and TD Learning Sarsa + Q-Learning
148 Model-free Control
Learn a policy 휋 to maximize rewards in the environment
149 On and Off Policy Learning
150 Generalized Policy Iteration
151 Gen Policy Improvement?
152 Not quite!
153 Learn Q function directly…
154 Greedy Action Selection?
155 Greedy Action Selection?
156 Greedy Action Selection?
157 ∊-Greedy Exploration
158 ∊-Greedy Exploration
159 Which Policy Evaluation?
160 Sarsa: TD for Policy Evaluation
161 On-Policy Control w/ Sarsa
162 Off-Policy Learning
163 Off-Policy Learning
164 Q-Learning
165 Q-Learning
166 Q-Learning
167 Off-Policy w/ Q-Learning
168 Off-Policy w/ Q-Learning
169 Off-Policy w/ Q-Learning
170 Q-Learning Control Algorithm
(SARSAMAX)
171 Relation between DP and TD
172 Update Eqns for DP and TD
173