Lecture 21: Reinforcement Learning

Lecture 21: Reinforcement Learning Justin Johnson Lecture 21 - 1 December 4, 2019 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson Lecture 21 - 2 December 4, 2019 Assignment 6: Generative Models Generative Adversarial Networks Due on Tuesday 12/17, 11:59pm Justin Johnson Lecture 21 - 3 December 4, 2019 So far: Supervised Learning Supervised Learning Classification Data: (x, y) x is data, y is label Goal: Learn a function to map x -> y Examples: Classification, regression, Cat object detection, semantic segmentation, image captioning, etc. This image is CC0 public domain Justin Johnson Lecture 21 - 4 December 4, 2019 So far: Unsupervised Learning Unsupervised Learning Feature Learning (e.g. autoencoders) Data: x Just data, no labels! Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. Justin Johnson Lecture 21 - 5 December 4, 2019 Today: Reinforcement Learning Action Problems where an Agent Environment agent performs actions in environment, and receives rewards Goal: Learn how to take actions that maximize reward Reward Earth photo is in the public domain Robot image is in the public domain Justin Johnson Lecture 21 - 6 December 4, 2019 Overview - What is reinforcement learning? - Algorithms for reinforcement learning - Q-Learning - Policy Gradients Justin Johnson Lecture 21 - 7 December 4, 2019 Reinforcement Learning Environment Agent Justin Johnson Lecture 21 - 8 December 4, 2019 Reinforcement Learning Environment State The agent sees a state; may st be noisy or incomplete Agent Justin Johnson Lecture 21 - 9 December 4, 2019 Reinforcement Learning Environment State Action The makes an action st at based on what it sees Agent Justin Johnson Lecture 21 - 10 December 4, 2019 Reinforcement Learning Environment State Action Reward Reward tells the agent st at rt how well it is doing Agent Justin Johnson Lecture 21 - 11 December 4, 2019 Reinforcement Learning Action causes change to environment Environment Environment State Action Reward st at rt Agent Agent Agent learns Justin Johnson Lecture 21 - 12 December 4, 2019 Reinforcement Learning Process repeats Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Justin Johnson Lecture 21 - 13 December 4, 2019 Example: Cart-Pole Problem Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horiZontal velocity Action: horiZontal force applied on the cart Reward: 1 at each time step if the pole is upright This image is CC0 public domain Justin Johnson Lecture 21 - 14 December 4, 2019 Example: Robot Locomotion Objective: Make the robot move forward State: Angle, position, velocity of all joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement Figure from: Schulman et al, “High-Dimensional Continuous Control Using Generalized Advantage Estimation”, ICLR 2016 Justin Johnson Lecture 21 - 15 December 4, 2019 Example: Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, “Playing Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Justin Johnson Lecture 21 - 16 December 4, 2019 Example: Go Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: On last turn: 1 if you won, 0 if you lost This image is CC0 public domain Justin Johnson Lecture 21 - 17 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Justin Johnson Lecture 21 - 18 December 4, 2019 Reinforcement Learning vs Supervised Learning Dataset Dataset Input Prediction Loss Input Prediction Loss xt yt Lt xt+t yt+1 Lt+1 Model Model Why is RL different from normal supervised learning? Justin Johnson Lecture 21 - 19 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Stochasticity: Rewards and state transitions may be random Justin Johnson Lecture 21 - 20 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Credit assignment: Reward rt may not directly depend on action at Justin Johnson Lecture 21 - 21 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Nondifferentiable: Can’t backprop through world; can’t compute drt/dat Justin Johnson Lecture 21 - 22 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Nonstationary: What the agent experiences depends on how it acts Justin Johnson Lecture 21 - 23 December 4, 2019 MarKov Decision Process (MDP) Mathematical formaliZation of the RL problem: A tuple (�, �, �, �, �) S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards) Markov Property: The current state completely characteriZes the state of the world. Rewards and next states depend only on current state, not history. Justin Johnson Lecture 21 - 24 December 4, 2019 MarKov Decision Process (MDP) Mathematical formaliZation of the RL problem: A tuple (�, �, �, �, �) S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards) Agent executes a policy � giving distribution of actions conditioned on states Justin Johnson Lecture 21 - 25 December 4, 2019 MarKov Decision Process (MDP) Mathematical formaliZation of the RL problem: A tuple (�, �, �, �, �) S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards) Agent executes a policy � giving distribution of actions conditioned on states * + Goal: Find policy � that maximiZes cumulative discounted reward: ∑+ � �+ Justin Johnson Lecture 21 - 26 December 4, 2019 MarKov Decision Process (MDP) - At time step t=0, environment samples initial state �. ~ �(�.) - Then, for t=0 until done: - Agent selects action �+ ~ � � �+) - Environment samples reward �+ ~ � � �+, �+) - Environment samples next state �+23 ~ � � | �+, �+ - Agent receives reward rt and next state st+1 Justin Johnson Lecture 21 - 27 December 4, 2019 A simple MDP: Grid World Actions: States Reward 1. Right ★ Set a negative 2. Left “reward” for 3. Up ★ each transition 4. Down (e.g. r = -1) Objective: Reach one of the terminal states in as few moves as possible Justin Johnson Lecture 21 - 28 December 4, 2019 A simple MDP: Grid World Bad policy Optimal Policy ★ ★ ★ ★ Justin Johnson Lecture 21 - 29 December 4, 2019 Finding Optimal Policies Goal: Find the optimal policy �* that maximiZes (discounted) sum of rewards. Justin Johnson Lecture 21 - 30 December 4, 2019 Finding Optimal Policies Goal: Find the optimal policy �* that maximiZes (discounted) sum of rewards. Problem: Lots of randomness! Initial state, transition probabilities, rewards Justin Johnson Lecture 21 - 31 December 4, 2019 Finding Optimal Policies Goal: Find the optimal policy �* that maximiZes (discounted) sum of rewards. Problem: Lots of randomness! Initial state, transition probabilities, rewards Solution: MaximiZe the expected sum of rewards �. ~ � �. ∗ + � = arg max � > � �+ | � �+ ~ � � | �+ < +?. �+23 ~ � � | �+, �+ Justin Johnson Lecture 21 - 32 December 4, 2019 Value Function and Q Function Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … Justin Johnson Lecture 21 - 33 December 4, 2019 Value Function and Q Function Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: < + � � = � > � �+ | �. = �, � +?. Justin Johnson Lecture 21 - 34 December 4, 2019 Value Function and Q Function Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: < + � � = � > � �+ | �. = �, � +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: < + � �, � = � > � �+ | �. = �, �. = �, � +?. Justin Johnson Lecture 21 - 35 December 4, 2019 Bellman Equation Optimal Q-function: Q*(s, a) is the Q-function for the optimal policy �* It gives the max possible future reward when taking action a in state s: ∗ + � �, � = max � > � �+ | �. = �, �. = �, � < +?. Justin Johnson Lecture 21 - 36 December 4, 2019 Bellman Equation Optimal Q-function: Q*(s, a) is the Q-function for the optimal policy �* It gives the max possible future

Lecture 21: Reinforcement Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support