Machine Learning and Data Mining Reinforcement Learning Markov

+ Machine Learning and Data Mining Reinforcement Learning Markov Decision Processes Kalev Kask Overview • Intro • Markov Decision Processes • Reinforcement Learning – Sarsa – Q-learning • Exploration vs Exploitation tradeoff 2 Resources • Book: Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto • UCL Course on Reinforcement Learning David Silver – https://www.youtube.com/watch?v=2pWv7GOvuf0 – https://www.youtube.com/watch?v=lfHX2hHRMVQ – https://www.youtube.com/watch?v=Nd1-UUMVfz4 – https://www.youtube.com/watch?v=PnHCvfgC_ZA – https://www.youtube.com/watch?v=0g4j2k_Ggc4 – https://www.youtube.com/watch?v=UoPei5o4fps 3 Lecture 1: Introduction to Reinforcement Learning BranchesAbout RL of Machine Learning Supervised Unsupervised Learning Learning Machine Learning Reinforcement Learning 4 Why is it different • No target values to predict • Feedback in the form of rewards – May be delayed not instantaneous • Have a goal : max reward • Have timeline : actions along arrow of time • Actions affect what data it will receive 5 Agent-Environment 6 Lecture 1: Introduction to Reinforcement Learning AgentThe RL Problem and Environment Environments observation action Ot At At each step t the agent: Executes action At Receives observation Ot Receives scalar rewardRt reward Rt The environment: Receives action At Emits observationOt+1 Emits scalar reward Rt+1 t increments at env. step 7 Sequential Decision Making • Actions have long term consequences • Goal maximize cumulative (long term) reward – Rewards may be delayed – May need to sacrifice short term reward • Devise a plan to maximize cumulative reward 8 Lecture 1: Introduction to Reinforcement Learning SequentialThe RL Problem Decision Making Reward Examples: A financial investment (may take months to mature) Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now) 9 Reinforcement Learning Learn a behavior strategy (policy) that maximizes the long term Sum of rewards in an unknown and stochastic environment (Emma Brunskill: ) Planning under Uncertainty Learn a behavior strategy (policy) that maximizes the long term Sum of rewards in a known stochastic environment (Emma Brunskill: ) 10 295, Winter 2018 11 Lecture 1: Introduction to Reinforcement Learning AtariProblems Example: within RL Reinforcement Learning observation action O A t t Rules of the game are unknown Learn directly from reward R t interactive game-play Pick actions on joystick, see pixels and scores 12 Demos Some videos • https://www.youtube.com/watch?v=V1eYniJ0Rnk • https://www.youtube.com/watch?v=CIF2SBVY-J0 • https://www.youtube.com/watch?v=I2WFvGl4y8c 13 Markov Property 14 State Transition 15 Markov Process 16 Student Markov Chain 17 Student MC : Episodes 18 Student MC : Transition Matrix 19 Return 20 Value 21 Student MRP 22 Student MRP : Returns 23 Student MRP : Value Function 24 Student MRP : Value Function 25 Student MRP : Value Function 26 Bellman Equation for MRP 27 Backup Diagrams for MRP 28 Bellman Eq: Student MRP 29 Bellman Eq: Student MRP 30 Lecture 2: Markov Decision Processes Markov Reward Processes SolvingBellman Equation the Bellman Equation The Bellman equation is a linear equation It can be solved directly: v = R + γPv (I − γP) v = R v = (I − γP)−1 R Computational complexity is O(n3) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g. Dynamic programming Monte-Carlo evaluation Temporal-Difference learning 31 Markov Decision Process 32 Student MDP 33 Policies 34 MP → MRP → MDP 35 Value Function 36 Bellman Eq for MDP Evaluating Bellman equation translates into 1-step lookahead 37 Bellman Eq, V 38 Bellman Eq, q 39 Bellman Eq, V 40 Bellman Eq, q 41 Student MDP : Bellman Eq 42 Bellman Eq : Matrix Form 43 Optimal Value Function 44 Student MDP : Optimal V 45 Student MDP : Optimal Q 46 Lecture 2: Markov Decision Processes OptimalMarkov Decision PolicyProcesses Optimal Value Functions Define a partial ordering over policies ' π ≥ π if vπ(s) ≥ vπ' (s), ∀s Theorem For any Markov Decision Process There exists an optimal policy π∗ that is better than or equal to all other policies, π∗ ≥ π, ∀π All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s) All optimal policies achieve the optimal action-value function, qπ∗(s, a) = q∗(s,a) 47 Lecture 2: Markov Decision Processes Markov Decision Processes FindingOptimal Value anFunctions Optimal Policy An optimal policy can be found by maximising over q∗(s, a), There is always a deterministic optimal policy for any MDP If we know q∗(s, a), we immediately have the optimal policy 48 Student MDP : Optimal Policy 49 Bellman Optimality Eq, V 50 Student MDP : Bellman Optimality 51 Lecture 2: Markov Decision Processes SolvingMarkov Decision the Processes Bellman Optimality Equation Bellman Optimality Equation Not easy Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods Value Iteration Policy Iteration Q-learning Sarsa 52 Lecture 1: Introduction to Reinforcement Learning MazeInside An RLExampleAgent Start Rewards: -1 per time-step Actions: N, E, S, W States: Agent’s location Goal 53 Lecture 1: Introduction to Reinforcement Learning MazeInside An RLExample:Agent Policy Start Goal Arrows represent policy π(s) for each state s 54 Lecture 1: Introduction to Reinforcement Learning MazeInside An RLExample:Agent Value Function -14 -13 -12 -11 -10 -9 Start -16 -15 -12 -8 -16 -17 -6 -7 -18 -19 -5 -24 -20 -4 -3 -23 -22 -21 -22 -2 -1 Goal Numbers represent value vπ (s) of each state s 55 Lecture 1: Introduction to Reinforcement Learning MazeInside An RLExample:Agent Model -1 -1 -1 -1 -1 -1 Agent may have an internal Start -1 -1 -1 -1 model of the environment -1 -1 -1 Dynamics: how actions -1 change the state -1 -1 Rewards: how muchreward from each state -1 -1 Goal The model may be imperfect a Grid layout represents transition model Pss‘ a Numbers represent immediate reward Rs from each state s (same for all a) 56 Algorithms for MDPs 57 Lecture 1: Introduction to Reinforcement Learning ModelInside An RL Agent 58 Algorithms cont. Prediction Control 59 Lecture 1: Introduction to Reinforcement Learning LearningProblems within RLand Planning Two fundamental problems in sequential decision making Reinforcement Learning: The environment is initiallyunknown The agent interacts with the environment The agent improves its policy Planning: A model of the environment is known The agent performs computations with its model (without any external interaction) The agent improves its policy a.k.a. deliberation, reasoning, introspection, pondering, thought, search 60 Lecture 1: Introduction to Reinforcement Learning MajorInside An RLComponentsAgent of an RL Agent An RL agent may include one or more of these components: Policy: agent’s behaviourfunction Value function: how good is each state and/or action Model: agent’s representation of the environment 61 Dynamic Programming 62 Requirements for DP 63 Applications for DPs 64 Lecture 3: Planning by Dynamic Programming PlanningIntroduction by Dynamic Programming Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP For prediction: Input: MDP (S, A, P, R, γ) and policy π or: MRP (S, P π , R π , γ) Output: value function vπ Or for control: Input: MDP (S, A, P, R, γ) Output: optimal value function v∗ and: optimal policy π∗ 65 Lecture 3: Planning by Dynamic Programming PolicyPolicy EvaluationEvaluation (Prediction) Iterative Policy Evaluation Problem: evaluate a given policy π Solution: iterative application of Bellman expectation backup v1 → v2 → ... → vπ Using synchronous backups, At each iteration k + 1 For all states s ∈ S ' Update vk+1(s) from vk (s ) where s' is a successor state of s We will discuss asynchronous backups later Convergence to vπ can be proven 66 Iterative policy Evaluation 68 Lecture 3: Planning by Dynamic Programming Policy Evaluation EvaluatingExample: Small Gridworld a Random Policy in the Small Gridworld Undiscounted episodic MDP (γ = 1) Nonterminal states 1, ..., 14 One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is −1 until the terminal state is reached Agent follows uniform random policy π(n|·) = π(e|·) = π(s|·) = π(w|·) = 0.25 69 Policy Evaluation : Grid World 70 Policy Evaluation : Grid World 71 Policy Evaluation : Grid World 72 Policy Evaluation : Grid World 73 Most of the story in a nutshell: 74 Finding Best Policy 75 Lecture 3: Planning by Dynamic Programming PolicyPolicy Iteration Improvement Given a policy π Evaluate the policy π vπ(s) = E [Rt+1 + γRt+2 + ...|St = s] Improve the policy by acting greedily with respect to vπ ' π = greedy(vπ) In Small Gridworld improved policy was optimal, π' = π∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π∗ 76 Policy Iteration 77 Lecture 3: Planning by Dynamic Programming PolicyPolicy IterationIteration Policy evaluation Estimate vπ Iterative policy evaluation Policy improvement Generate πI ≥ π Greedy policy improvement 78 Jack’s Car Rental 79 Policy Iteration in Car Rental 80 Lecture 3: Planning by Dynamic Programming PolicyPolicy IterationImprovement Policy Improvement 81 Lecture 3: Planning by Dynamic Programming PolicyPolicy Iteration Improvement (2) Policy Improvement If improvements stop, ' qπ(s, π (s)) = max qπ(s, a) = qπ(s, π(s)) = vπ(s) a∈A Then the Bellman optimality equation has

Machine Learning and Data Mining Reinforcement Learning Markov

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support