Lecture 21: Reinforcement Learning

Lecture 21: Reinforcement Learning Justin Johnson Lecture 21 - 1 December 4, 2019 Assignment 5: Object Detection Single-stage detector Two-stage detector Due on Monday 12/9, 11:59pm Justin Johnson Lecture 21 - 2 December 4, 2019 Assignment 6: Generative Models Generative Adversarial Networks Due on Tuesday 12/17, 11:59pm Justin Johnson Lecture 21 - 3 December 4, 2019 So far: Supervised Learning Supervised Learning Classification Data: (x, y) x is data, y is label Goal: Learn a function to map x -> y Examples: Classification, regression, Cat object detection, semantic segmentation, image captioning, etc. This image is CC0 public domain Justin Johnson Lecture 21 - 4 December 4, 2019 So far: Unsupervised Learning Unsupervised Learning Feature Learning (e.g. autoencoders) Data: x Just data, no labels! Goal: Learn some underlying hidden structure of the data Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc. Justin Johnson Lecture 21 - 5 December 4, 2019 Today: Reinforcement Learning Action Problems where an Agent Environment agent performs actions in environment, and receives rewards Goal: Learn how to take actions that maximize reward Reward Earth photo is in the public domain Robot image is in the public domain Justin Johnson Lecture 21 - 6 December 4, 2019 Overview - What is reinforcement learning? - Algorithms for reinforcement learning - Q-Learning - Policy Gradients Justin Johnson Lecture 21 - 7 December 4, 2019 Reinforcement Learning Environment Agent Justin Johnson Lecture 21 - 8 December 4, 2019 Reinforcement Learning Environment State The agent sees a state; may st be noisy or incomplete Agent Justin Johnson Lecture 21 - 9 December 4, 2019 Reinforcement Learning Environment State Action The makes an action st at based on what it sees Agent Justin Johnson Lecture 21 - 10 December 4, 2019 Reinforcement Learning Environment State Action Reward Reward tells the agent st at rt how well it is doing Agent Justin Johnson Lecture 21 - 11 December 4, 2019 Reinforcement Learning Action causes change to environment Environment Environment State Action Reward st at rt Agent Agent Agent learns Justin Johnson Lecture 21 - 12 December 4, 2019 Reinforcement Learning Process repeats Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Justin Johnson Lecture 21 - 13 December 4, 2019 Example: Cart-Pole Problem Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horiZontal velocity Action: horiZontal force applied on the cart Reward: 1 at each time step if the pole is upright This image is CC0 public domain Justin Johnson Lecture 21 - 14 December 4, 2019 Example: Robot Locomotion Objective: Make the robot move forward State: Angle, position, velocity of all joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement Figure from: Schulman et al, “High-Dimensional Continuous Control Using Generalized Advantage Estimation”, ICLR 2016 Justin Johnson Lecture 21 - 15 December 4, 2019 Example: Atari Games Objective: Complete the game with the highest score State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step Mnih et al, “Playing Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Justin Johnson Lecture 21 - 16 December 4, 2019 Example: Go Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: On last turn: 1 if you won, 0 if you lost This image is CC0 public domain Justin Johnson Lecture 21 - 17 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Justin Johnson Lecture 21 - 18 December 4, 2019 Reinforcement Learning vs Supervised Learning Dataset Dataset Input Prediction Loss Input Prediction Loss xt yt Lt xt+t yt+1 Lt+1 Model Model Why is RL different from normal supervised learning? Justin Johnson Lecture 21 - 19 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Stochasticity: Rewards and state transitions may be random Justin Johnson Lecture 21 - 20 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Credit assignment: Reward rt may not directly depend on action at Justin Johnson Lecture 21 - 21 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Nondifferentiable: Can’t backprop through world; can’t compute drt/dat Justin Johnson Lecture 21 - 22 December 4, 2019 Reinforcement Learning vs Supervised Learning Environment Environment State Action Reward State Action Reward st at rt st+1 at+1 rt+1 Agent Agent Nonstationary: What the agent experiences depends on how it acts Justin Johnson Lecture 21 - 23 December 4, 2019 MarKov Decision Process (MDP) Mathematical formaliZation of the RL problem: A tuple (�, �, �, �, �) S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards) Markov Property: The current state completely characteriZes the state of the world. Rewards and next states depend only on current state, not history. Justin Johnson Lecture 21 - 24 December 4, 2019 MarKov Decision Process (MDP) Mathematical formaliZation of the RL problem: A tuple (�, �, �, �, �) S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards) Agent executes a policy � giving distribution of actions conditioned on states Justin Johnson Lecture 21 - 25 December 4, 2019 MarKov Decision Process (MDP) Mathematical formaliZation of the RL problem: A tuple (�, �, �, �, �) S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards) Agent executes a policy � giving distribution of actions conditioned on states * + Goal: Find policy � that maximiZes cumulative discounted reward: ∑+ � �+ Justin Johnson Lecture 21 - 26 December 4, 2019 MarKov Decision Process (MDP) - At time step t=0, environment samples initial state �. ~ �(�.) - Then, for t=0 until done: - Agent selects action �+ ~ � � �+) - Environment samples reward �+ ~ � � �+, �+) - Environment samples next state �+23 ~ � � | �+, �+ - Agent receives reward rt and next state st+1 Justin Johnson Lecture 21 - 27 December 4, 2019 A simple MDP: Grid World Actions: States Reward 1. Right ★ Set a negative 2. Left “reward” for 3. Up ★ each transition 4. Down (e.g. r = -1) Objective: Reach one of the terminal states in as few moves as possible Justin Johnson Lecture 21 - 28 December 4, 2019 A simple MDP: Grid World Bad policy Optimal Policy ★ ★ ★ ★ Justin Johnson Lecture 21 - 29 December 4, 2019 Finding Optimal Policies Goal: Find the optimal policy �* that maximiZes (discounted) sum of rewards. Justin Johnson Lecture 21 - 30 December 4, 2019 Finding Optimal Policies Goal: Find the optimal policy �* that maximiZes (discounted) sum of rewards. Problem: Lots of randomness! Initial state, transition probabilities, rewards Justin Johnson Lecture 21 - 31 December 4, 2019 Finding Optimal Policies Goal: Find the optimal policy �* that maximiZes (discounted) sum of rewards. Problem: Lots of randomness! Initial state, transition probabilities, rewards Solution: MaximiZe the expected sum of rewards �. ~ � �. ∗ + � = arg max � > � �+ | � �+ ~ � � | �+ < +?. �+23 ~ � � | �+, �+ Justin Johnson Lecture 21 - 32 December 4, 2019 Value Function and Q Function Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … Justin Johnson Lecture 21 - 33 December 4, 2019 Value Function and Q Function Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: < + � � = � > � �+ | �. = �, � +?. Justin Johnson Lecture 21 - 34 December 4, 2019 Value Function and Q Function Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: < + � � = � > � �+ | �. = �, � +?. How good is a state-action pair? The Q function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy: < + � �, � = � > � �+ | �. = �, �. = �, � +?. Justin Johnson Lecture 21 - 35 December 4, 2019 Bellman Equation Optimal Q-function: Q*(s, a) is the Q-function for the optimal policy �* It gives the max possible future reward when taking action a in state s: ∗ + � �, � = max � > � �+ | �. = �, �. = �, � < +?. Justin Johnson Lecture 21 - 36 December 4, 2019 Bellman Equation Optimal Q-function: Q*(s, a) is the Q-function for the optimal policy �* It gives the max possible future

Load more