An Algorithmic Perspective on Imitation Learning

Foundations and Trends ® in Robotics Vol. 7, No. 1-2 (2018) 1–179 © 2018 T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel and J. Peters DOI: 10.1561/2300000053 An Algorithmic Perspective on Imitation Learning Takayuki Osa Joni Pajarinen University of Tokyo Technische Universität Darmstadt [email protected] [email protected] Gerhard Neumann J. Andrew Bagnell University of Lincoln Carnegie Mellon University [email protected] [email protected] Pieter Abbeel University of California, Berkeley [email protected] Jan Peters Technische Universität Darmstadt [email protected] Contents 1 Introduction 3 1.1 Key successes in Imitation Learning . 4 1.2 Imitation Learning from the Point of View of Robotics . 5 1.3 Differences between Imitation Learning and Supervised Learning . 10 1.4 Insights for Machine Learning and Robotics Research . 11 1.5 Statistical Machine Learning Background . 12 1.5.1 Notation and Mathematical Formalization . 12 1.5.2 Markov Property . 14 1.5.3 Markov Decision Process . 14 1.5.4 Entropy . 14 1.5.5 Kullback-Leibler (KL) Divergence . 15 1.5.6 Information and Moment Projections . 15 1.5.7 The Maximum Entropy Principle . 16 1.5.8 Background: Reinforcement Learning . 17 1.6 Formulation of the Imitation Learning Problem . 18 2 Design of Imitation Learning Algorithms 20 2.1 Design Choices for Imitation Learning Algorithms . 20 2.2 Behavioral Cloning and Inverse Reinforcement Learning 24 ii iii 2.3 Model-Free and Model-Based Imitation Learning Meth- ods ............................. 25 2.4 Observability . 28 2.4.1 Trajectories in Fully Observable Settings . 29 2.4.2 Trajectories in Partially Observable Settings . 29 2.4.3 Differences in observability between the expert and the learner . 30 2.5 Policy Representation in Imitation Learning . 31 2.5.1 Levels of Policy Abstraction . 31 2.5.2 Hierarchical vs Monolithic Policies . 33 2.5.3 Feedback vs Open-Loop/Feedback-Free Policies . 34 2.5.4 Stationarity and Stochasticity of Policies . 35 2.5.4.1 Stationary vs. Non-Stationary Policies . 36 2.5.4.2 Deterministic Policy . 36 2.5.4.3 Stochastic Policy . 37 2.6 Behavior Descriptors . 38 2.6.1 State-action Distribution . 38 2.6.2 Trajectory Feature Expectation . 38 2.6.3 Trajectory Feature Distribution . 39 2.7 Information Theoretic Understanding of Feature Matching 39 2.7.1 Information Theoretic Understanding of Imita- tion Learning Algorithms for Trajectory Learn- ing ......................... 40 2.7.2 Information Theoretic Understanding of Imita- tion Learning Algorithms in Action-State Space . 44 3 Behavioral Cloning 46 3.1 Problem Statement . 46 3.2 Design Choices for Behavioral Cloning . 48 3.2.1 Choice of Surrogate Loss Functions for Behav- ioral Cloning . 49 3.2.1.1 Quadratic Loss Function . 49 3.2.1.2 ℓ1-Loss Function . 50 3.2.1.3 Log Loss Function . 51 3.2.1.4 Hinge Loss Function . 52 3.2.1.5 Kullback-Leibler Divergence . 52 iv 3.2.2 Choice of Regression Methods for Behavioral Cloning . 53 3.3 Model-Free and Model-Based Behavioral Cloning Methods 53 3.4 Model-Free Behavioral Cloning Methods in Action-State space............................. 54 3.4.1 Model-Free Behavioral Cloning as Supervised Learning . 55 3.4.2 Imitation as Supervised Learning with Neural Networks . 56 3.4.2.1 Recent Successes of Imitation Leaning with Neural Networks . 56 3.4.2.2 Learning with Recurrent Neural Networks 58 3.4.3 Teacher-Student Interaction during Behavioral Cloning . 59 3.4.3.1 Reduction of Structured Prediction to Iterative Learning of Simple Classification 60 3.4.3.2 Confidence-Based Approach . 61 3.4.3.3 Data Aggregation Approach: DA GGER 62 3.5 Model-Free Behavioral Cloning for Learning Trajectories 65 3.5.1 Trajectory Representation . 65 3.5.1.1 Keyframe/Via-Point Based Approaches 66 3.5.1.2 Representation with Hidden Markov Models . 67 3.5.1.3 Dynamic Movement Primitives . 69 3.5.1.4 Probabilistic Movement Primitives . 73 3.5.1.5 Trajectory Representation with Time- Invariant Dynamical Systems . 75 3.5.2 Comparison of Trajectory Representations . 77 3.5.3 Generalization of Demonstrated Trajectories . 79 3.5.4 Information Theoretic Understanding of Model- FreeBC....................... 83 3.5.5 Time Alignment of Multiple Demonstrations . 84 3.5.6 Learning Coupled Movements . 86 3.5.6.1 Learning Coupled Movements with DMPs 87 v 3.5.6.2 Learning Coupled Movements with Gaussian Conditioning . 87 3.5.6.3 Learning Coupled Movements with Time-Invariant Dynamical Systems . 89 3.5.7 Incremental Trajectory Learning . 90 3.5.8 Combining Multiple Expert Policies . 93 3.6 Model-Free Behavioral Cloning for Task-Level Planning 94 3.6.1 Segmentation and Clustering for Task-Level Planning . 94 3.6.2 Learning a Sequence of Primitive Motions . 95 3.7 Model-Based Behavioral Cloning Methods . 101 3.7.1 Model-Based Behavioral Cloning Methods with Forward Dynamics Models . 101 3.7.1.1 Imitation with a Gaussian Mixture Forward Model . 103 3.7.1.2 Imitation with a Gaussian Process Forward Model . 104 3.7.2 Imitation Learning through Iterative Learning Control . 107 3.7.3 Information Theoretic Understandings of Model- Based Behavioral Cloning Methods . 108 3.8 Robot Applications with Model-Free BC Methods . 109 3.8.1 Learning to Hit a Ball with DMP . 109 3.8.2 Learning Hand-Over Tasks with ProMPs . 110 3.8.3 Learning to Tie a Knot by Modeling the Trajec- tory Distribution with Gaussian Processes . 112 3.9 Robot Applications with Model-Based BC Methods . 113 3.9.1 Learning Acrobatic Helicopter Flights . 113 3.9.2 Learning to Hit a Ball with an Underactuated Robot . 114 3.9.3 Learning to Control with DA GGER . 115 4 Inverse Reinforcement Learning 117 4.1 Problem Statement . 118 4.2 Model-Based and Model-Free IRL Methods . 120 vi 4.3 Design Choices for Inverse Reinforcement Learning Methods . 120 4.4 Model-Based Inverse Reinforcement Learning Methods . 122 4.4.1 Feature Expectation Matching . 123 4.4.2 Maximum Margin Planning . 124 4.4.3 Inverse Reinforcement Learning Based on the Maximum Entropy Principle . 126 4.4.3.1 Maximum Entropy Inverse Reinforcement Learning . 126 4.4.3.2 Maximum Causal Entropy Inverse Reinforcement Learning . 129 4.4.3.3 IRL from Failed Demonstrations . 130 4.4.3.4 Connection of Maximum Entropy Methods to Economics . 131 4.4.4 Miscellaneous Important Model-Based IRL Methods . 132 4.4.4.1 Linearly-Solvable MDPs . 132 4.4.4.2 IRL Methods Based on a Bayesian Framework . 133 4.4.5 Learning Nonlinear Reward Functions . 134 4.4.5.1 Boosting Methods . 134 4.4.5.2 Deep Network Methods . 135 4.4.5.3 Gaussian Process IRL . 135 4.4.6 Guided Cost Learning . 136 4.5 Model-Free Inverse Reinforcement Learning Methods . 138 4.5.1 Relative Entropy Inverse Reinforcement Learning 138 4.5.2 Generative Adversarial Imitation Learning . 140 4.6 Interpretation of IRL with the Maximum Entropy Prin- ciple . 141 4.7 Inverse Reinforcement Learning under Partial Observ- ability . 143 4.7.1 IRL from Partially Observable Demonstrations . 144 4.7.2 IRL with Incomplete Expert Observations . 145 4.7.3 Active Inverse Reinforcement Learning as a POMDP . 146 vii 4.7.4 Cooperative Inverse Reinforcement Learning . 146 4.8 Robot Applications with IRL Methods . 147 4.8.1 Learning to Drive a Car in a Simulator . 147 4.8.2 Learning Path Planning with MMP . 148 4.8.3 Learning Motion Planning with Deep Guided- Cost Learning . 150 4.8.4 Learning a Ball-in-a-Cup task with Relative En- tropy Inverse Reinforcement Learning . 151 5 Challenges in Imitation Learning for Robotics 152 5.1 Behavioral Cloning vs Inverse Reinforcement Learning . 152 5.2 Open Questions in Imitation Learning . 154 5.2.1 Problems Related to Demonstrated Data . 154 5.2.2 Open Questions Related to Design Choices . 155 5.2.3 Problems Related to Algorithms . 157 5.2.4 Performance Evaluation . 159 Acknowledgements 161 References 162 Abstract As robots and other intelligent agents move from simple environments and problems to more complex, unstructured settings, manually pro- gramming their behavior has become increasingly challenging and ex- pensive. Often, it is easier for a teacher to demonstrate a desired behavior rather than attempt to manually engineer it. This process of learning from demonstrations, and the study of algorithms to do so, is called imitation learning . This work provides an introduction to imitation learning. It covers the underlying assumptions, approaches, and how they relate; the rich set of algorithms developed to tackle the problem; and advice on effective tools and implementation. We intend this paper to serve two audiences. First, we want to famil- iarize machine learning experts with the challenges of imitation learning, particularly those arising in robotics, and the interesting theoreti- cal and practical distinctions between it and more familiar frameworks like statistical supervised learning theory and reinforcement learning. Second, we want to give roboticists and experts in applied artificial in- telligence a broader appreciation for the frameworks and tools available for imitation learning. We organize our work by dividing imitation learning into directly replicating desired behavior (sometimes called behavioral cloning [Bain and Sammut, 1996]) and learning the hidden objectives of the desired behavior from demonstrations (called inverse optimal control [Kalman, 1964] or inverse reinforcement learning [Russell, 1998]). In addition to method analysis, we discuss the design decisions a practitioner must make when selecting an imitation learning approach. Moreover, appli- cation examples—such as robots that play table tennis [Kober and Peters, 2009] and programs that play the game of Go [Silver et al., 2016]— illustrate the properties and motivations behind different forms of imitation learning.

An Algorithmic Perspective on Imitation Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support