<<

Deep Reinforcement via Learning

Sergey Levine perception

Action (run away)

action sensorimotor loop

Action (run away) End-to-end vision standard features mid-level features classifier computer (e.g. HOG) (e.g. DPM) (e.g. SVM) vision Felzenszwalb ‘08 deep learning

Krizhevsky ‘12 End-to-end control

standard state low-level modeling & motion motor observations estimation controller robotic prediction planning torques control (e.g. vision) (e.g. PD)

deep motor sensorimotor observations torques learning indirect supervision actions have consequences Contents

Imitation learning

Imitation without a human

Research frontiers Terminology & notation

1. run away 2. ignore 3. pet Terminology & notation

1. run away 2. ignore 3. pet Terminology & notation

1. run away 2. ignore 3. pet

a bit of history…

управление Lev Pontryagin Richard Bellman Contents

Imitation learning

Imitation without a human

Research frontiers Imitation Learning

training supervised data learning

Images: Bojarski et al. ‘16, NVIDIA Does it work? No! Does it work? Yes!

Video: Bojarski et al. ‘16, NVIDIA Why did that work?

Bojarski et al. ‘16, NVIDIA

Can we make it work more often? cost

stability Learning from a stabilizing controller

(more on this later) Can we make it work more often? Can we make it work more often?

DAgger: Dataset Aggregation

Ross et al. ‘11 DAgger Example

Ross et al. ‘11 What’s the problem?

Ross et al. ‘11 Imitation learning: recap

training supervised data learning

• Usually (but not always) insufficient by itself – Distribution mismatch problem • Sometimes works well – Hacks (e.g. left/right images) – Samples from a stable trajectory distribution – Add more on-policy data, e.g. using DAgger Contents

Imitation learning

Imitation without a human

Research frontiers Terminology & notation

1. run away 2. ignore 3. pet Trajectory optimization Probabilistic version Probabilistic version (in pictures) DAgger without Humans

Ross et al. ‘11 Another problem PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

path replanned!

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 PLATO: Policy Learning with Adaptive Trajectory Optimization

avoids high cost!

input substitution trick need state at training time

Kahn, Zhang, Levine, Abbeel ‘16 but not at test time! PLATO: Policy Learning with Adaptive Trajectory Optimization

Kahn, Zhang, Levine, Abbeel ‘16 Beyond driving & flying Trajectory Optimization with Unknown Dynamics

[L. et al. NIPS ‘14] Trajectory Optimization with Unknown Dynamics

new old [L. et al. NIPS ‘14] Learning on PR2

[L. et al. ICRA ‘15] Combining with Policy Learning expectation under current policy

trajectory distribution(s)

L. et al. ICML ’14 (dual descent) can also use BADMM (L. et al.’15)

Lagrange multiplier Guided Policy Search

trajectory-centric RL supervised learning [see L. et al. NIPS ‘14 for details] training time test time

L.*, Finn*, Darrell, Abbeel ‘16 ~ 92,000 parameters Experimental Tasks Experimental Tasks Generalization Experiments Comparisons

end-to-end training

(trained on pose only) pose prediction

(trained on pose only) pose features Comparisons

coat hanger success rate pose prediction 55.6% pose features 88.9% end-to-end training 100% shape sorting cube success rate pose prediction 0% pose features 70.4% end-to-end training 96.3% 2 cm toy claw hammer success rate pose prediction 8.9% pose features 62.2% end-to-end training 91.1% Meeussen et al. (Willow Garage) bottle cap success rate pose prediction n/a pose features 55.6% end-to-end training 88.9% Guided Policy Search Applications manipulation dexterous hands soft hands

with N. Wagener and P. Abbeel with V. Kumar and E. Todorov with A. Gupta, C. Eppner, P. Abbeel

locomotion aerial vehicles

with G. Kahn, T. Zhang, P. Abbeel with V. Koltun A note about terminology… the “R” word

a bit of history… (the problem statement)

reinforcement learning without using the model (the method)

Lev Pontryagin Richard Bellman Andrew Barto Richard Sutton Contents

Imitation learning

Imitation without a human

Research frontiers ingredients for success in learning: supervised learning: learning sensorimotor skills: computation computation algorithms algorithms data ~? data

L., Pastor, Krizhevsky, Quillen ‘16 Grasping with Learned Hand-Eye Coordination

• 800,000 grasp monocular RGB camera attempts for training (3,000 robot-hours) 7 DoF arm

• monocular camera 2-finger (no depth) gripper • 2-5 Hz update object • no prior knowledge bin

L., Pastor, Krizhevsky, Quillen ‘16 Using Grasp Success Prediction

training testing

L., Pastor, Krizhevsky, Quillen ‘16 Open-Loop vs. Closed-Loop Grasping

open-loop grasping closed-loop grasping

failure rate: 33.7% depth + segmentation failure rate: 17.5% failure rate: 35%

Pinto & Gupta, 2015 Grasping Experiments

L., Pastor, Krizhevsky, Quillen ‘16 Continuous Learning in the Real World

• breadth and diversity of data • learning new tasks quickly • leveraging prior data • task success supervision Learning from Prior

with J. Fu Learning from Prior Experience Learning what Success Means

can we learn the cost with visual features? Learning what Success Means

with C. Finn, P. Abbeel Learning what Success Means Challenges & Frontiers

• Algorithms – Sample complexity – Safety – Scalability • Supervision – Automatically evaluate success – Learn cost functions • Transfer from prior experience Acknowledgements

Greg Kahn Tianhao Zhang Chelsea Finn Trevor Darrell Pieter Abbeel

Justin Fu Peter Pastor Alex Krizhevsky Deirdre Quillen