Reinforcement Learning

Reinforcement Learning PROF XIAOHUI XIE SPRING 2019 CS 273P Machine Learning and Data Mining Slides courtesy of Sameer Singh, David Silver Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 2 Reinforcement Learning 3 What makes it different? No direct supervision, only rewards Feedback is delayed, not instantaneous Time really matters, i.e. data is sequential Agent’s actions affect what data it will receive • Fly stunt maneuvers in a helicopter • Defeat the world champion at Backgammon or Go • Manage an investment portfolio • Control a power station • Make a humanoid robot walk Examples • Play many different Atari games better than humans 4 Agent-Environment Interface Agent • decides on an action • receives next observation • receives next reward Environment • executes the action • computes next observation • computes next reward 5 Reward, Rt Nothing about WHY it is +, positive (Good) How well the doing well, could have agent is doing -, negative (Bad) little to do with At-1 Agent is trying to maximize its cumulative reward 6 Example of Rewards • Fly stunt maneuvers in a helicopter • +ve reward for following desired trajectory • −ve reward for crashing • Defeat the world champion at Backgammon • +/−ve reward for winning/losing a game • Manage an investment portfolio • +ve reward for each $ in bank • Control a power station • +ve reward for producing power • −ve reward for exceeding safety thresholds • Make a humanoid robot walk • +ve reward for forward motion • −ve reward for falling over • Play many different Atari games better than humans • +/−ve reward for increasing/decreasing score 7 Sequential Decision Making Actions have long term consequences Rewards may be delayed May be better to sacrifice short term reward for long term benefit • A ﬁnancial investment (may take months to mature) • Refueling a helicopter (might prevent a crash later) • Blocking opponent moves (might eventually help win) • Spend a lot of money and go to college (earn more later) Examples • Don’t commit crimes (rewarded by not going to jail) • Get started on ﬁnal project early (avoid stress later) A key aspect of intelligence: How far ahead are you able to plan? 8 Reinforcement Learning Given an environment (produces observations and rewards) Reinforcement Learning Automated agent that selects actions to maximize total rewards in the environment 9 Let’s look at the Agent What does the choice of action depend on? • Can you ignore Ot completely? • Is just Ot enough? Or (Ot,At)? • Is it last few observations? • Is it all observations so far? 10 Agent State, St History: everything that happened so far Ht = O1R1A1O2R2A2O3R3,…,At-1OtRt State, St can be Ot OtRt At-1OtRt Ot-3Ot-2Ot-1Ot You, as AI designer, In general, S = f(H ) t t specify this function 11 Agent Policy, π Current state Next action S A t π t Good policy: Leads to larger cumulative reward Bad policy: Leads to worse cumulative reward (we will explore this later) 12 Example: Atari Rules are unknown • What makes the score increase? Dynamics are unknown • How do actions change pixels? 13 Video Time! https://www.youtube.com/watch?v=V1eYniJ0Rnk 14 Example: Robotic Soccer https://www.youtube.com/watch?v=CIF2SBVY-J0 15 AlphaGo https://www.youtube.com/watch?v=I2WFvGl4y8c 16 Overview of Lecture States Markov Processes + Rewards Markov Reward Processes + Actions Markov Decision Processes 17 Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 18 Markov Property 19 Markov Property 20 Markov Property 21 State Transition Matrix 22 State Transition Matrix 23 Markov Processes 24 Markov Processes 25 Student Markov Chain 26 Student MC: Episodes 27 Student MC: Episodes 28 Student MC: Transition Matrix 29 Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 30 Markov Reward Process 31 Markov Reward Process 32 The Student MRP 33 Return 34 Return 35 Return 36 Why discount? 37 Why discount? 38 Why discount? 39 Why discount? 40 Why discount? 41 Value Function 42 Value Function 43 Student MRP: Returns 44 Student MRP: Returns 45 Student MRP: Value Function 46 Student MRP: Value Function 47 Student MRP: Value Function 48 Bellman Equations for MRP 49 Backup Diagrams for MRP 50 Student MRP: Bellman Eq 51 Matrix Form of Bellman Eq 52 Solving the Bellman Equation 53 Solving the Bellman Equation 54 Dynamic Programming 2 1 1 v (C1) = -2 + γ ( .5 v (FB) + .5 v (C2) ) 2 1 1 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) … 3 2 2 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) … C1 C2 C3 Pa Pub FB Slp γ=0.5 t=1 -2 -2 -2 10 1 -1 0 t=2 -2.75 -2.8 1.2 10 0 -1.55 0 t=3 -3.09 -1.52 1 10 0.41 -1.83 0 t=4 -2.84 -1.6 1.08 10 0.59 -1.98 0 55 Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 56 Markov Decision Process 57 Markov Decision Process 58 The Student MDP 59 Policies 60 Policies 61 Policies 62 MPs → MRPs → MDPs 63 MPs → MRPs → MDPs 64 Value Function 65 Value Function 66 Student MDP: Value Function 67 Bellman Expected Equation 68 Bellman Expected Equation 69 Bellman Expected Equation, V 70 Bellman Expected Equation, Q 71 Bellman Expected Equation, V 72 Bellman Expected Equation, Q 73 Student MDP: Bellman Exp Eq. 74 Bellman Exp Eq: Matrix Form 75 Optimal Value Function 76 Optimal Value Function 77 Optimal Value Function 78 Student MDP: Optimal V 79 Student MDP: Optimal Q 80 Optimal Policy 81 Optimal Policy 82 Finding Optimal Policy 83 Student MDP: Optimal Policy 84 Bellman Optimality Eq, V 85 Bellman Optimality Eq, Q 86 Bellman Optimality Eq, V 87 Bellman Optimality Eq, Q 88 Student MDP: Bellman Optimality 89 Solving Bellman Equations in MDP Not easy ● Not a linear equation ● No “closed-form” solutions 90 Overview MDPs States, Transitions, Actions, Rewards Prediction Given Policy π, Estimate State Value Functions, Action Value Functions Control Estimate Optimal Value Functions, Optimal Policy Does the agent know the MDP? Yes! It’s “planning” No! It’s “Model-free RL” Agent knows everything Agent observes everything as it goes 91 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Q-Learning 92 Overview Evaluate Policy, Find Best Policy, π π* Planning MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Q-Learning 93 Dynamic Programming 94 Requirements for Dynamic Programming 95 Requirements for Dynamic Programming 96 Planning by Dynamic Programming 97 Applications of DPs 98 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 99 Iterative Policy Evaluation 100 Iterative Policy Evaluation 101 Iterative Policy Evaluation 102 Iterative Policy Evaluation 103 Random Policy: Grid World 104 Policy Evaluation: Grid World Time 0 : do nothing, stop; no cost. Time 1 : move (reward -1); then k=0 Unless in goal: reward 0 Time 2 : move (reward -1); then k=1 Most: move (-1) + [v1 = -1] = -2 Some: move (-1) + ¾ [v1 = -1] + ¼ [v1=0] = 1.75 105 Policy Evaluation: Grid World 106 Policy Evaluation: Grid World 107 Policy Evaluation: Grid World In general: best policy & value for “one step, then follow random policy” (always better policy than random!) 108 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 109 Improving a Policy! 110 Improving a Policy! 111 Improving a Policy! 112 Policy Iteration 113 Jack’s Car Rental 114 Policy Iteration in Car Rental 115 Policy Improvement 116 Policy Improvement 117 Policy Improvement 118 Modified Policy Iteration 119 Modified Policy Iteration 120 Modified Policy Iteration 121 Modified Policy Iteration 122 Generalized Policy Iteration 123 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 124 Monte Carlo RL 125 Monte Carlo RL 126 Monte Carlo RL 127 Monte Carlo Policy Evaluation 128 Every-Visit MC Policy Evaluation Equivalent, “incremental tracking” form: Looks like SGD to minimize MSE from the mean value… 129 Blackjack Example stand hit stand hit hit 130 Blackjack Value Function stand hit 131 Blackjack Value Function stand hit 132 Temporal Difference Learning 133 Temporal Difference Learning 134 MC and TD 135 MC and TD 136 MC and TD 137 MC and TD 138 Driving Home Example 139 Driving Home: MC vs TD 140 Driving Home: MC vs TD 141 Finite Episodes: AB Example MC & TD can give different answers on fixed data: V(B) = 6 / 8 V(A) = 0 ? (Direct MC estimate) V(A) = 6 / 8? (TD estimate) 142 MC vs TD Temporal Monte Carlo Difference • Wait till end of episode to learn • Learn online after every step • Only for terminating worlds • Non-terminating worlds ok • High-variance, low bias • Low variance, high bias • Not sensitive to initial value • Sensitive to initial value • Good convergence properties • Much more efficient • Doesn’t exploit Markov property • Exploits Markov Property • Minimizes squared error • Maximizes log-likelihood 143 Unified View: Monte Carlo 144 Unified View: TD Learning 145 Unified View: Dynamic Prog. 146 Unified View of RL (Prediction) 147 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 148 Model-free Control Learn a policy 휋 to maximize rewards in the environment 149 On and Off Policy Learning 150 Generalized Policy Iteration 151 Gen Policy Improvement? 152 Not quite! 153 Learn Q function directly… 154 Greedy Action Selection? 155 Greedy Action Selection? 156 Greedy Action Selection? 157 ∊-Greedy Exploration 158 ∊-Greedy Exploration 159 Which Policy Evaluation? 160 Sarsa: TD for Policy Evaluation 161 On-Policy Control w/ Sarsa 162 Off-Policy Learning 163 Off-Policy Learning 164 Q-Learning 165 Q-Learning 166 Q-Learning 167 Off-Policy w/ Q-Learning 168 Off-Policy w/ Q-Learning 169 Off-Policy w/ Q-Learning 170 Q-Learning Control Algorithm (SARSAMAX) 171 Relation between DP and TD 172 Update Eqns for DP and TD 173.

Load more