Direct Loss Minimization Inverse Optimal Control

Direct Loss Minimization Inverse Optimal Control Andreas Doerr∗, Nathan Ratliff∗y, Jeannette Bohgy, Marc Toussaint∗ and Stefan Schaalyz ∗Machine Learning & Robotics Lab, University Stuttgart, Germany yMax Planck Institute, Autonomous Motion Departement, Tuebingen, Germany zUniversity of Southern California, Computational Learning and Motor Control Lab, Los Angeles, CA, USA [email protected], fratliffn, jbohg, [email protected], [email protected], [email protected] Abstract—Inverse Optimal Control (IOC) has strongly im- pacted the systems engineering process, enabling automated planner tuning through straightforward and intuitive demonstration. The most successful and established applications, though, have been in lower dimensional problems such as navigation planning where exact optimal planning or control is feasible. In higher dimensional systems, such as humanoid robots, research has made substantial progress toward generalizing the ideas to model free or locally optimal settings, but these systems are complicated to the point where demonstration itself can be difficult. Typically, real-world applications are restricted to at best noisy or even partial or incomplete demonstrations that prove cumbersome in existing frameworks. This work derives a very flexible method of IOC based on a form of Structured Prediction known as Direct Loss Minimization. The resulting algorithm is essentially Policy Search on a reward function that rewards similarity to demonstrated behavior (using Covariance Matrix Adaptation (CMA) in Fig. 1: Learning from demonstrated behavior. A humanoid robot is used as a running example in the experimental section of this work for learning of motion policies. our experiments). Our framework blurs the distinction between IOC, other forms of Imitation Learning, and Reinforcement dimensional problems, but interestingly has struggled to make Learning, enabling us to derive simple, versatile, and practical the leap to higher dimensional systems. Two factors contribute algorithms that blend imitation and reinforcement signals into to making high-dimensional systems hard. a unified framework. Our experiments analyze various aspects of its performance and demonstrate its efficacy on conveying First, the Optimal Control problem itself, is intractable in preferences for motion shaping and combined reach and grasp most high-dimensional, especially continuous, domains (e.g. quality optimization. as found in humanoids). Recent advances in motion optimization, though, have made significant progress on that problem. I. INTRODUCTION Algorithms like CHOMP [21], STOMP [10], iTOMP [4], Tra- Implementing versatile and generalizable robot behavior jOpt [24], KOMO [30], and RIEMO [22] have incrementally can take hundreds if not thousands of person-hours. Typical improved motion optimization to the point where now it is systems integrate sensor processing, state estimation, multiple often a central tool for high-dimensional motion generation. layers of planning, low level control, and even reactive be- The second issue though, is more fundamental: it is very haviors to induce successful and generalizable actions. With difficult to provide accurate, full policy demonstrations for that complexity comes huge parameter spaces that take experts high-dimensional systems. This problem is often overlooked typically days of tuning to get right. Still, performance may be or ignored in existing approaches (cf. Section III). suboptimal, and any change to the system requires additional This work narrows the gap between Imitation Learning calibration. Much of how experts tweak systems remains an (specifically, Inverse Optimal Control) and Reinforcement art, but recent machine learning advances have led to some Learning (Policy Search) by tracing the development of IOC very powerful learning from demonstration tools that have through its connections to Structured Prediction and relating significantly simplified the process [1, 33]. further advances in Structured Prediction back to the policy One method of choice for learning from demonstration, es- learning problem. What results is a unified algorithm that pecially in practical high-performance applications, is Inverse blends naturally between Inverse Optimal Control and Policy Optimal Control (IOC) [18, 34, 3, 13]. Since most planning- Search Reinforcement Learning enabling both learning from based systems are designed to transform sensor readings noisy partial demonstrations and optimization of high-level into cost functions interpretable by planners, they already reward functions to join in tuning a high-performance optimal produce generalizable behavior by design. Tapping into this planner. The fundamental connection we draw, which architecture automates the time-consuming tweaking process Section II discusses in detail, is that if we apply an needed to make this mapping from features to costs reliable. advanced form of Structured Prediction, known as Direct Where applicable, IOC has become an invaluable tool for real- Loss Minimization [15], to the IOC problem, what results is world applications [25]. IOC has been very successful in lower effectively a Policy Search algorithm that optimizes a reward d promoting similarity to expert demonstrations. This connection where w 2 R is a weight vector for d features f(ξi; ξ) = of Policy Search RL through Direct Loss Minimization to fi(ξ), and λ 2 R+. A simple subgradient method for optimiz- IOC suggests that the problem’s difficulty results mostly from ing this objective suggests an update rule of the form the shape of the reward landscape itself, and less from the X w = w − η (f (ξ ) − f (ξ∗)); (1) problem formulation. Reward functions that reward similarity t+1 t t i i i i to expert demonstrations are naturally more discriminating i ∗ T than high-level success-oriented rewards. This work closes the where ξi = argminξ2Ξ(w fi(ξ) − Li(ξ)) (see [28] and gap between Reinforcement Learning and Imitation Learning [18] for details). This proxy objective and update rule have by straightforwardly posing them both as blackbox optimiza- also been used in graphics (see [14]) under a margin-less tion problems, letting the reward/loss functions distinguish “Perceptron” form [5] where the authors learned directly from between whether the problem is Inverse Optimal Control, motion capture data. In all cases (Structured Prediction, IOC, Reinforcement Learning, or some combination of the two. and graphics), formulating the convex proxy and resulting Section V presents applications of this hybrid learning method- subgradient update rule was critical since it was unclear how ology using both noisy complete demonstrations and partial to efficiently optimize the loss function directly. information to address learning motion styles. But in 2010, [15] demonstrated that an algorithm very sim- Our high-level goal is to create tools to help engineers ilar in nature to a subgradient approach to MMSC, but which navigate the complicated high-dimensional parameter spaces directly estimated the gradient of the underlying structured that arise in sophisticated real-world systems. loss function, not only worked, but often performed better than approaches leveraging this convex MMSC objective. II. METHODOLOGY The authors demonstrated empirically and theoretically that Inverse Optimal Control is strongly related to Structured the shape of L(ξi; ξ), itself, was sufficiently structured to Predition. Maximum Margin Planning [18], for instance, re- admit direct optimization. Parameterizing ξ by a weight vector duces the problem directly to Maximum Margin Structured w and problem context γi, they showed that they could P Classification [28, 31], and Maximum Entropy IOC [34] directly optimize (w) = i L(ξi; ξ(w; γi)). Interestingly, develops a Bayesian framework strongly related to Conditional the update rule that approximated that gradient bears a striking Random Fields [12]. These two bodies of literature have resemblance to that given in Equation 1 been driving significant algorithmic advances from both sides X w = w + η f (ξ ) − f (ξ∗ ); (2) [20, 17, 32]. In many ways, we can view IOC explicitly as a t+1 t t i i i i direct i form of Structured Prediction where the set of all policies is the ∗ T structured set of labels and the underlying learning problem where fi(ξi direct) = argminξ2Ξ w fi(ξ) + Li(ξ) . In the is to predict the correct policy given expert demonstrations. parlance of policy learning, this update rule compares the Advances in Structured Prediction usually lead to advances in features see by the example policy ξi to those see by a policy IOC. coaxed downhill slightly by increasing the cost of high-loss Structured Prediction has gone through a number of in- states. carnations, but one prominent formalization of the problem Turning back to the policy learning problem these observa- is Maximum Margin Structured Classification (MMSC) [28], tions suggest that a strong form of IOC would be to directly a generalization of the Support Vector Machine (SVM). Just optimize this loss function (w). The above direct loss gradi- as the hinge loss is a simple piecewise linear upper bound ent estimate gives one update rule that’s particularly applicable to the 0-1 binary loss function in an SVM, the generalized to general Structured Prediction, but on the policy side, this “hinge-loss” for MMSC is also a piecewise linear upper bound sort of direct optimization problem has been studied for a to the structured loss function of the Structured Prediction while and there

Direct Loss Minimization Inverse Optimal Control

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support