Deep Reinforcement Learning via Imitation Learning
Sergey Levine perception
Action (run away)
action sensorimotor loop
Action (run away) End-to-end vision standard features mid-level features classifier computer (e.g. HOG) (e.g. DPM) (e.g. SVM) vision Felzenszwalb ‘08 deep learning
Krizhevsky ‘12 End-to-end control
standard state low-level modeling & motion motor observations estimation controller robotic prediction planning torques control (e.g. vision) (e.g. PD)
deep motor sensorimotor observations torques learning indirect supervision actions have consequences Contents
Imitation learning
Imitation without a human
Research frontiers Terminology & notation
1. run away 2. ignore 3. pet Terminology & notation
1. run away 2. ignore 3. pet Terminology & notation
1. run away 2. ignore 3. pet
a bit of history…
управление Lev Pontryagin Richard Bellman Contents
Imitation learning
Imitation without a human
Research frontiers Imitation Learning
training supervised data learning
Images: Bojarski et al. ‘16, NVIDIA Does it work? No! Does it work? Yes!
Video: Bojarski et al. ‘16, NVIDIA Why did that work?
Bojarski et al. ‘16, NVIDIA
Can we make it work more often? cost
stability Learning from a stabilizing controller
(more on this later) Can we make it work more often? Can we make it work more often?
DAgger: Dataset Aggregation
Ross et al. ‘11 DAgger Example
Ross et al. ‘11 What’s the problem?
Ross et al. ‘11 Imitation learning: recap
training supervised data learning
• Usually (but not always) insufficient by itself – Distribution mismatch problem • Sometimes works well – Hacks (e.g. left/right images) – Samples from a stable trajectory distribution – Add more on-policy data, e.g. using DAgger Contents
Imitation learning
Imitation without a human
Research frontiers Terminology & notation
1. run away 2. ignore 3. pet Trajectory optimization Probabilistic version Probabilistic version (in pictures) DAgger without Humans
Ross et al. ‘11 Another problem PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
path replanned!
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization
avoids high cost!
input substitution trick need state at training time
Kahn, Zhang, Levine, Abbeel ‘16 but not at test time! PLATO: Policy Learning with Adaptive Trajectory Optimization
Kahn, Zhang, Levine, Abbeel ‘16 Beyond driving & flying Trajectory Optimization with Unknown Dynamics
[L. et al. NIPS ‘14] Trajectory Optimization with Unknown Dynamics
new old [L. et al. NIPS ‘14] Learning on PR2
[L. et al. ICRA ‘15] Combining with Policy Learning expectation under current policy
trajectory distribution(s)
L. et al. ICML ’14 (dual descent) can also use BADMM (L. et al.’15)
Lagrange multiplier Guided Policy Search
trajectory-centric RL supervised learning [see L. et al. NIPS ‘14 for details] training time test time
L.*, Finn*, Darrell, Abbeel ‘16 ~ 92,000 parameters Experimental Tasks Experimental Tasks Generalization Experiments Comparisons
end-to-end training
(trained on pose only) pose prediction
(trained on pose only) pose features Comparisons
coat hanger success rate pose prediction 55.6% pose features 88.9% end-to-end training 100% shape sorting cube success rate pose prediction 0% pose features 70.4% end-to-end training 96.3% 2 cm toy claw hammer success rate pose prediction 8.9% pose features 62.2% end-to-end training 91.1% Meeussen et al. (Willow Garage) bottle cap success rate pose prediction n/a pose features 55.6% end-to-end training 88.9% Guided Policy Search Applications manipulation dexterous hands soft hands
with N. Wagener and P. Abbeel with V. Kumar and E. Todorov with A. Gupta, C. Eppner, P. Abbeel
locomotion aerial vehicles
with G. Kahn, T. Zhang, P. Abbeel with V. Koltun A note about terminology… the “R” word
a bit of history… reinforcement learning (the problem statement)
reinforcement learning without using the model (the method)
Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton Contents
Imitation learning
Imitation without a human
Research frontiers ingredients for success in learning: supervised learning: learning sensorimotor skills: computation computation algorithms algorithms data ~? data
L., Pastor, Krizhevsky, Quillen ‘16 Grasping with Learned Hand-Eye Coordination
• 800,000 grasp monocular RGB camera attempts for training (3,000 robot-hours) 7 DoF arm
• monocular camera 2-finger (no depth) gripper • 2-5 Hz update object • no prior knowledge bin
L., Pastor, Krizhevsky, Quillen ‘16 Using Grasp Success Prediction
training testing
L., Pastor, Krizhevsky, Quillen ‘16 Open-Loop vs. Closed-Loop Grasping
open-loop grasping closed-loop grasping
failure rate: 33.7% depth + segmentation failure rate: 17.5% failure rate: 35%
Pinto & Gupta, 2015 Grasping Experiments
L., Pastor, Krizhevsky, Quillen ‘16 Continuous Learning in the Real World
• breadth and diversity of data • learning new tasks quickly • leveraging prior data • task success supervision Learning from Prior Experience
with J. Fu Learning from Prior Experience Learning what Success Means
can we learn the cost with visual features? Learning what Success Means
with C. Finn, P. Abbeel Learning what Success Means Challenges & Frontiers
• Algorithms – Sample complexity – Safety – Scalability • Supervision – Automatically evaluate success – Learn cost functions • Transfer from prior experience Acknowledgements
Greg Kahn Tianhao Zhang Chelsea Finn Trevor Darrell Pieter Abbeel
Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen