An Algorithmic Perspective on Imitation Learning
Total Page:16
File Type:pdf, Size:1020Kb
Foundations and Trends ® in Robotics Vol. 7, No. 1-2 (2018) 1–179 © 2018 T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel and J. Peters DOI: 10.1561/2300000053 An Algorithmic Perspective on Imitation Learning Takayuki Osa Joni Pajarinen University of Tokyo Technische Universität Darmstadt [email protected] [email protected] Gerhard Neumann J. Andrew Bagnell University of Lincoln Carnegie Mellon University [email protected] [email protected] Pieter Abbeel University of California, Berkeley [email protected] Jan Peters Technische Universität Darmstadt [email protected] Contents 1 Introduction 3 1.1 Key successes in Imitation Learning . 4 1.2 Imitation Learning from the Point of View of Robotics . 5 1.3 Differences between Imitation Learning and Supervised Learning . 10 1.4 Insights for Machine Learning and Robotics Research . 11 1.5 Statistical Machine Learning Background . 12 1.5.1 Notation and Mathematical Formalization . 12 1.5.2 Markov Property . 14 1.5.3 Markov Decision Process . 14 1.5.4 Entropy . 14 1.5.5 Kullback-Leibler (KL) Divergence . 15 1.5.6 Information and Moment Projections . 15 1.5.7 The Maximum Entropy Principle . 16 1.5.8 Background: Reinforcement Learning . 17 1.6 Formulation of the Imitation Learning Problem . 18 2 Design of Imitation Learning Algorithms 20 2.1 Design Choices for Imitation Learning Algorithms . 20 2.2 Behavioral Cloning and Inverse Reinforcement Learning 24 ii iii 2.3 Model-Free and Model-Based Imitation Learning Meth- ods ............................. 25 2.4 Observability . 28 2.4.1 Trajectories in Fully Observable Settings . 29 2.4.2 Trajectories in Partially Observable Settings . 29 2.4.3 Differences in observability between the expert and the learner . 30 2.5 Policy Representation in Imitation Learning . 31 2.5.1 Levels of Policy Abstraction . 31 2.5.2 Hierarchical vs Monolithic Policies . 33 2.5.3 Feedback vs Open-Loop/Feedback-Free Policies . 34 2.5.4 Stationarity and Stochasticity of Policies . 35 2.5.4.1 Stationary vs. Non-Stationary Policies . 36 2.5.4.2 Deterministic Policy . 36 2.5.4.3 Stochastic Policy . 37 2.6 Behavior Descriptors . 38 2.6.1 State-action Distribution . 38 2.6.2 Trajectory Feature Expectation . 38 2.6.3 Trajectory Feature Distribution . 39 2.7 Information Theoretic Understanding of Feature Matching 39 2.7.1 Information Theoretic Understanding of Imita- tion Learning Algorithms for Trajectory Learn- ing ......................... 40 2.7.2 Information Theoretic Understanding of Imita- tion Learning Algorithms in Action-State Space . 44 3 Behavioral Cloning 46 3.1 Problem Statement . 46 3.2 Design Choices for Behavioral Cloning . 48 3.2.1 Choice of Surrogate Loss Functions for Behav- ioral Cloning . 49 3.2.1.1 Quadratic Loss Function . 49 3.2.1.2 ℓ1-Loss Function . 50 3.2.1.3 Log Loss Function . 51 3.2.1.4 Hinge Loss Function . 52 3.2.1.5 Kullback-Leibler Divergence . 52 iv 3.2.2 Choice of Regression Methods for Behavioral Cloning . 53 3.3 Model-Free and Model-Based Behavioral Cloning Methods 53 3.4 Model-Free Behavioral Cloning Methods in Action-State space............................. 54 3.4.1 Model-Free Behavioral Cloning as Supervised Learning . 55 3.4.2 Imitation as Supervised Learning with Neural Networks . 56 3.4.2.1 Recent Successes of Imitation Leaning with Neural Networks . 56 3.4.2.2 Learning with Recurrent Neural Networks 58 3.4.3 Teacher-Student Interaction during Behavioral Cloning . 59 3.4.3.1 Reduction of Structured Prediction to Iterative Learning of Simple Classification 60 3.4.3.2 Confidence-Based Approach . 61 3.4.3.3 Data Aggregation Approach: DA GGER 62 3.5 Model-Free Behavioral Cloning for Learning Trajectories 65 3.5.1 Trajectory Representation . 65 3.5.1.1 Keyframe/Via-Point Based Approaches 66 3.5.1.2 Representation with Hidden Markov Models . 67 3.5.1.3 Dynamic Movement Primitives . 69 3.5.1.4 Probabilistic Movement Primitives . 73 3.5.1.5 Trajectory Representation with Time- Invariant Dynamical Systems . 75 3.5.2 Comparison of Trajectory Representations . 77 3.5.3 Generalization of Demonstrated Trajectories . 79 3.5.4 Information Theoretic Understanding of Model- FreeBC....................... 83 3.5.5 Time Alignment of Multiple Demonstrations . 84 3.5.6 Learning Coupled Movements . 86 3.5.6.1 Learning Coupled Movements with DMPs 87 v 3.5.6.2 Learning Coupled Movements with Gaussian Conditioning . 87 3.5.6.3 Learning Coupled Movements with Time-Invariant Dynamical Systems . 89 3.5.7 Incremental Trajectory Learning . 90 3.5.8 Combining Multiple Expert Policies . 93 3.6 Model-Free Behavioral Cloning for Task-Level Planning 94 3.6.1 Segmentation and Clustering for Task-Level Planning . 94 3.6.2 Learning a Sequence of Primitive Motions . 95 3.7 Model-Based Behavioral Cloning Methods . 101 3.7.1 Model-Based Behavioral Cloning Methods with Forward Dynamics Models . 101 3.7.1.1 Imitation with a Gaussian Mixture Forward Model . 103 3.7.1.2 Imitation with a Gaussian Process Forward Model . 104 3.7.2 Imitation Learning through Iterative Learning Control . 107 3.7.3 Information Theoretic Understandings of Model- Based Behavioral Cloning Methods . 108 3.8 Robot Applications with Model-Free BC Methods . 109 3.8.1 Learning to Hit a Ball with DMP . 109 3.8.2 Learning Hand-Over Tasks with ProMPs . 110 3.8.3 Learning to Tie a Knot by Modeling the Trajec- tory Distribution with Gaussian Processes . 112 3.9 Robot Applications with Model-Based BC Methods . 113 3.9.1 Learning Acrobatic Helicopter Flights . 113 3.9.2 Learning to Hit a Ball with an Underactuated Robot . 114 3.9.3 Learning to Control with DA GGER . 115 4 Inverse Reinforcement Learning 117 4.1 Problem Statement . 118 4.2 Model-Based and Model-Free IRL Methods . 120 vi 4.3 Design Choices for Inverse Reinforcement Learning Methods . 120 4.4 Model-Based Inverse Reinforcement Learning Methods . 122 4.4.1 Feature Expectation Matching . 123 4.4.2 Maximum Margin Planning . 124 4.4.3 Inverse Reinforcement Learning Based on the Maximum Entropy Principle . 126 4.4.3.1 Maximum Entropy Inverse Reinforcement Learning . 126 4.4.3.2 Maximum Causal Entropy Inverse Reinforcement Learning . 129 4.4.3.3 IRL from Failed Demonstrations . 130 4.4.3.4 Connection of Maximum Entropy Methods to Economics . 131 4.4.4 Miscellaneous Important Model-Based IRL Methods . 132 4.4.4.1 Linearly-Solvable MDPs . 132 4.4.4.2 IRL Methods Based on a Bayesian Framework . 133 4.4.5 Learning Nonlinear Reward Functions . 134 4.4.5.1 Boosting Methods . 134 4.4.5.2 Deep Network Methods . 135 4.4.5.3 Gaussian Process IRL . 135 4.4.6 Guided Cost Learning . 136 4.5 Model-Free Inverse Reinforcement Learning Methods . 138 4.5.1 Relative Entropy Inverse Reinforcement Learning 138 4.5.2 Generative Adversarial Imitation Learning . 140 4.6 Interpretation of IRL with the Maximum Entropy Prin- ciple . 141 4.7 Inverse Reinforcement Learning under Partial Observ- ability . 143 4.7.1 IRL from Partially Observable Demonstrations . 144 4.7.2 IRL with Incomplete Expert Observations . 145 4.7.3 Active Inverse Reinforcement Learning as a POMDP . 146 vii 4.7.4 Cooperative Inverse Reinforcement Learning . 146 4.8 Robot Applications with IRL Methods . 147 4.8.1 Learning to Drive a Car in a Simulator . 147 4.8.2 Learning Path Planning with MMP . 148 4.8.3 Learning Motion Planning with Deep Guided- Cost Learning . 150 4.8.4 Learning a Ball-in-a-Cup task with Relative En- tropy Inverse Reinforcement Learning . 151 5 Challenges in Imitation Learning for Robotics 152 5.1 Behavioral Cloning vs Inverse Reinforcement Learning . 152 5.2 Open Questions in Imitation Learning . 154 5.2.1 Problems Related to Demonstrated Data . 154 5.2.2 Open Questions Related to Design Choices . 155 5.2.3 Problems Related to Algorithms . 157 5.2.4 Performance Evaluation . 159 Acknowledgements 161 References 162 Abstract As robots and other intelligent agents move from simple environments and problems to more complex, unstructured settings, manually pro- gramming their behavior has become increasingly challenging and ex- pensive. Often, it is easier for a teacher to demonstrate a desired be- havior rather than attempt to manually engineer it. This process of learning from demonstrations, and the study of algorithms to do so, is called imitation learning . This work provides an introduction to imi- tation learning. It covers the underlying assumptions, approaches, and how they relate; the rich set of algorithms developed to tackle the prob- lem; and advice on effective tools and implementation. We intend this paper to serve two audiences. First, we want to famil- iarize machine learning experts with the challenges of imitation learn- ing, particularly those arising in robotics, and the interesting theoreti- cal and practical distinctions between it and more familiar frameworks like statistical supervised learning theory and reinforcement learning. Second, we want to give roboticists and experts in applied artificial in- telligence a broader appreciation for the frameworks and tools available for imitation learning. We organize our work by dividing imitation learning into directly replicating desired behavior (sometimes called behavioral cloning [Bain and Sammut, 1996]) and learning the hidden objectives of the desired behavior from demonstrations (called inverse optimal control [Kalman, 1964] or inverse reinforcement learning [Russell, 1998]). In addition to method analysis, we discuss the design decisions a practitioner must make when selecting an imitation learning approach. Moreover, appli- cation examples—such as robots that play table tennis [Kober and Peters, 2009] and programs that play the game of Go [Silver et al., 2016]— illustrate the properties and motivations behind different forms of imitation learning.