Lecture 21: Reinforcement Learning
Justin Johnson Lecture 21 - 1 December 4, 2019 Assignment 5: Object Detection
Single-stage detector Two-stage detector
Due on Monday 12/9, 11:59pm
Justin Johnson Lecture 21 - 2 December 4, 2019 Assignment 6: Generative Models
Generative Adversarial Networks
Due on Tuesday 12/17, 11:59pm
Justin Johnson Lecture 21 - 3 December 4, 2019 So far: Supervised Learning Supervised Learning Classification Data: (x, y) x is data, y is label
Goal: Learn a function to map x -> y
Examples: Classification, regression, Cat object detection, semantic segmentation, image captioning, etc. This image is CC0 public domain Justin Johnson Lecture 21 - 4 December 4, 2019 So far: Unsupervised Learning Unsupervised Learning Feature Learning (e.g. autoencoders) Data: x Just data, no labels!
Goal: Learn some underlying hidden structure of the data
Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc.
Justin Johnson Lecture 21 - 5 December 4, 2019 Today: Reinforcement Learning Action Problems where an Agent Environment agent performs actions in environment, and receives rewards
Goal: Learn how to take actions that maximize reward Reward
Earth photo is in the public domain Robot image is in the public domain Justin Johnson Lecture 21 - 6 December 4, 2019 Overview
- What is reinforcement learning? - Algorithms for reinforcement learning - Q-Learning - Policy Gradients
Justin Johnson Lecture 21 - 7 December 4, 2019 Reinforcement Learning
Environment
Agent
Justin Johnson Lecture 21 - 8 December 4, 2019 Reinforcement Learning
Environment
State The agent sees a state; may st be noisy or incomplete
Agent
Justin Johnson Lecture 21 - 9 December 4, 2019 Reinforcement Learning
Environment
State Action The makes an action st at based on what it sees
Agent
Justin Johnson Lecture 21 - 10 December 4, 2019 Reinforcement Learning
Environment
State Action Reward Reward tells the agent st at rt how well it is doing
Agent
Justin Johnson Lecture 21 - 11 December 4, 2019 Reinforcement Learning Action causes change to environment Environment Environment
State Action Reward st at rt
Agent Agent
Agent learns
Justin Johnson Lecture 21 - 12 December 4, 2019 Reinforcement Learning Process repeats
Environment Environment
State Action Reward State Action Reward st at rt st+1 at+1 rt+1
Agent Agent
Justin Johnson Lecture 21 - 13 December 4, 2019 Example: Cart-Pole Problem Objective: Balance a pole on top of a movable cart
State: angle, angular speed, position, horizontal velocity
Action: horizontal force applied on the cart
Reward: 1 at each time step if the pole is upright
This image is CC0 public domain
Justin Johnson Lecture 21 - 14 December 4, 2019 Example: Robot Locomotion Objective: Make the robot move forward
State: Angle, position, velocity of all joints
Action: Torques applied on joints
Reward: 1 at each time step upright + forward movement Figure from: Schulman et al, “High-Dimensional Continuous Control Using Generalized Advantage Estimation”, ICLR 2016 Justin Johnson Lecture 21 - 15 December 4, 2019 Example: Atari Games
Objective: Complete the game with the highest score
State: Raw pixel inputs of the game screen Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step
Mnih et al, “Playing Atari with Deep Reinforcement Learning”, NeurIPS Deep Learning Workshop, 2013 Justin Johnson Lecture 21 - 16 December 4, 2019 Example: Go
Objective: Win the game!
State: Position of all pieces
Action: Where to put the next piece down
Reward: On last turn: 1 if you won, 0 if you lost
This image is CC0 public domain Justin Johnson Lecture 21 - 17 December 4, 2019 Reinforcement Learning vs Supervised Learning
Environment Environment
State Action Reward State Action Reward st at rt st+1 at+1 rt+1
Agent Agent
Justin Johnson Lecture 21 - 18 December 4, 2019 Reinforcement Learning vs Supervised Learning
Dataset Dataset
Input Prediction Loss Input Prediction Loss xt yt Lt xt+t yt+1 Lt+1
Model Model
Why is RL different from normal supervised learning?
Justin Johnson Lecture 21 - 19 December 4, 2019 Reinforcement Learning vs Supervised Learning
Environment Environment
State Action Reward State Action Reward st at rt st+1 at+1 rt+1
Agent Agent
Stochasticity: Rewards and state transitions may be random
Justin Johnson Lecture 21 - 20 December 4, 2019 Reinforcement Learning vs Supervised Learning
Environment Environment
State Action Reward State Action Reward st at rt st+1 at+1 rt+1
Agent Agent
Credit assignment: Reward rt may not directly depend on action at
Justin Johnson Lecture 21 - 21 December 4, 2019 Reinforcement Learning vs Supervised Learning
Environment Environment
State Action Reward State Action Reward st at rt st+1 at+1 rt+1
Agent Agent
Nondifferentiable: Can’t backprop through world; can’t compute drt/dat
Justin Johnson Lecture 21 - 22 December 4, 2019 Reinforcement Learning vs Supervised Learning
Environment Environment
State Action Reward State Action Reward st at rt st+1 at+1 rt+1
Agent Agent
Nonstationary: What the agent experiences depends on how it acts
Justin Johnson Lecture 21 - 23 December 4, 2019 Markov Decision Process (MDP)
Mathematical formalization of the RL problem: A tuple (�, �, �, �, �)
S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards)
Markov Property: The current state completely characterizes the state of the world. Rewards and next states depend only on current state, not history.
Justin Johnson Lecture 21 - 24 December 4, 2019 Markov Decision Process (MDP)
Mathematical formalization of the RL problem: A tuple (�, �, �, �, �)
S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards)
Agent executes a policy � giving distribution of actions conditioned on states
Justin Johnson Lecture 21 - 25 December 4, 2019 Markov Decision Process (MDP)
Mathematical formalization of the RL problem: A tuple (�, �, �, �, �)
S: Set of possible states A: Set of possible actions R: Distribution of reward given (state, action) pair P: Transition probability: distribution over next state given (state, action) �: Discount factor (tradeoff between future and present rewards)
Agent executes a policy � giving distribution of actions conditioned on states * Goal: Find policy � that maximizes cumulative discounted reward: ∑ � �
Justin Johnson Lecture 21 - 26 December 4, 2019 Markov Decision Process (MDP)
- At time step t=0, environment samples initial state � ~ �(� ) - Then, for t=0 until done: - Agent selects action � ~ � � � ) - Environment samples reward � ~ � � � , � ) - Environment samples next state � ~ � � | � , � - Agent receives reward rt and next state st+1
Justin Johnson Lecture 21 - 27 December 4, 2019 A simple MDP: Grid World
Actions: States Reward 1. Right ★ Set a negative 2. Left “reward” for 3. Up ★ each transition 4. Down (e.g. r = -1)
Objective: Reach one of the terminal states in as few moves as possible
Justin Johnson Lecture 21 - 28 December 4, 2019 A simple MDP: Grid World
Bad policy Optimal Policy
★ ★
★ ★
Justin Johnson Lecture 21 - 29 December 4, 2019 Finding Optimal Policies
Goal: Find the optimal policy �* that maximizes (discounted) sum of rewards.
Justin Johnson Lecture 21 - 30 December 4, 2019 Finding Optimal Policies
Goal: Find the optimal policy �* that maximizes (discounted) sum of rewards.
Problem: Lots of randomness! Initial state, transition probabilities, rewards
Justin Johnson Lecture 21 - 31 December 4, 2019 Finding Optimal Policies
Goal: Find the optimal policy �* that maximizes (discounted) sum of rewards.
Problem: Lots of randomness! Initial state, transition probabilities, rewards
Solution: Maximize the expected sum of rewards
� ~ � � ∗ � = arg max � � � | � � ~ � � | � � ~ � � | � , �
Justin Johnson Lecture 21 - 32 December 4, 2019 Value Function and Q Function
Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
Justin Johnson Lecture 21 - 33 December 4, 2019 Value Function and Q Function
Following a policy � produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: