Deep Reinforcement Learning Via Imitation Learning
Total Page:16
File Type:pdf, Size:1020Kb
Deep Reinforcement Learning via Imitation Learning Sergey Levine perception Action (run away) action sensorimotor loop Action (run away) End-to-end vision standard features mid-level features classifier computer (e.g. HOG) (e.g. DPM) (e.g. SVM) vision Felzenszwalb ‘08 deep learning Krizhevsky ‘12 End-to-end control standard state low-level modeling & motion motor observations estimation controller robotic prediction planning torques control (e.g. vision) (e.g. PD) deep motor sensorimotor observations torques learning indirect supervision actions have consequences Contents Imitation learning Imitation without a human Research frontiers Terminology & notation 1. run away 2. ignore 3. pet Terminology & notation 1. run away 2. ignore 3. pet Terminology & notation 1. run away 2. ignore 3. pet a bit of history… управление Lev Pontryagin Richard Bellman Contents Imitation learning Imitation without a human Research frontiers Imitation Learning training supervised data learning Images: Bojarski et al. ‘16, NVIDIA Does it work? No! Does it work? Yes! Video: Bojarski et al. ‘16, NVIDIA Why did that work? Bojarski et al. ‘16, NVIDIA Can we make it work more often? cost stability Learning from a stabilizing controller (more on this later) Can we make it work more often? Can we make it work more often? DAgger: Dataset Aggregation Ross et al. ‘11 DAgger Example Ross et al. ‘11 What’s the problem? Ross et al. ‘11 Imitation learning: recap training supervised data learning • Usually (but not always) insufficient by itself – Distribution mismatch problem • Sometimes works well – Hacks (e.g. left/right images) – Samples from a stable trajectory distribution – Add more on-policy data, e.g. using DAgger Contents Imitation learning Imitation without a human Research frontiers Terminology & notation 1. run away 2. ignore 3. pet Trajectory optimization Probabilistic version Probabilistic version (in pictures) DAgger without Humans Ross et al. ‘11 Another problem PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization path replanned! Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization avoids high cost! input substitution trick need state at training time Kahn, Zhang, Levine, Abbeel ‘16 but not at test time! PLATO: Policy Learning with Adaptive Trajectory Optimization Kahn, Zhang, Levine, Abbeel ‘16 Beyond driving & flying Trajectory Optimization with Unknown Dynamics [L. et al. NIPS ‘14] Trajectory Optimization with Unknown Dynamics new old [L. et al. NIPS ‘14] Learning on PR2 [L. et al. ICRA ‘15] Combining with Policy Learning expectation under current policy trajectory distribution(s) L. et al. ICML ’14 (dual descent) can also use BADMM (L. et al.’15) Lagrange multiplier Guided Policy Search trajectory-centric RL supervised learning [see L. et al. NIPS ‘14 for details] training time test time L.*, Finn*, Darrell, Abbeel ‘16 ~ 92,000 parameters Experimental Tasks Experimental Tasks Generalization Experiments Comparisons end-to-end training (trained on pose only) pose prediction (trained on pose only) pose features Comparisons coat hanger success rate pose prediction 55.6% pose features 88.9% end-to-end training 100% shape sorting cube success rate pose prediction 0% pose features 70.4% end-to-end training 96.3% 2 cm toy claw hammer success rate pose prediction 8.9% pose features 62.2% end-to-end training 91.1% Meeussen et al. (Willow Garage) bottle cap success rate pose prediction n/a pose features 55.6% end-to-end training 88.9% Guided Policy Search Applications manipulation dexterous hands soft hands with N. Wagener and P. Abbeel with V. Kumar and E. Todorov with A. Gupta, C. Eppner, P. Abbeel locomotion aerial vehicles with G. Kahn, T. Zhang, P. Abbeel with V. Koltun A note about terminology… the “R” word a bit of history… reinforcement learning (the problem statement) reinforcement learning without using the model (the method) Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton Contents Imitation learning Imitation without a human Research frontiers ingredients for success in learning: supervised learning: learning sensorimotor skills: computation computation algorithms algorithms data ~? data L., Pastor, Krizhevsky, Quillen ‘16 Grasping with Learned Hand-Eye Coordination • 800,000 grasp monocular RGB camera attempts for training (3,000 robot-hours) 7 DoF arm • monocular camera 2-finger (no depth) gripper • 2-5 Hz update object • no prior knowledge bin L., Pastor, Krizhevsky, Quillen ‘16 Using Grasp Success Prediction training testing L., Pastor, Krizhevsky, Quillen ‘16 Open-Loop vs. Closed-Loop Grasping open-loop grasping closed-loop grasping failure rate: 33.7% depth + segmentation failure rate: 17.5% failure rate: 35% Pinto & Gupta, 2015 Grasping Experiments L., Pastor, Krizhevsky, Quillen ‘16 Continuous Learning in the Real World • breadth and diversity of data • learning new tasks quickly • leveraging prior data • task success supervision Learning from Prior Experience with J. Fu Learning from Prior Experience Learning what Success Means can we learn the cost with visual features? Learning what Success Means with C. Finn, P. Abbeel Learning what Success Means Challenges & Frontiers • Algorithms – Sample complexity – Safety – Scalability • Supervision – Automatically evaluate success – Learn cost functions • Transfer from prior experience Acknowledgements Greg Kahn Tianhao Zhang Chelsea Finn Trevor Darrell Pieter Abbeel Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen.