Optimal Control with Learned Local Models: Application to Dexterous Manipulation

Optimal Control with Learned Local Models: Application to Dexterous Manipulation Vikash Kumar1, Emanuel Todorov1, and Sergey Levine2 . Fig. 1: Learned hand manipulation behavior involving clockwise rotation of the object Abstract— We describe a method for learning dexterous ma- pressures in 40 pneumatic cylinders, and the position and nipulation skills with a pneumatically-actuated tendon-driven velocity of the object being manipulated. 24-DoF hand. The method combines iteratively refitted time- Pneumatics have non-negligible time constants (around 20 varying linear models with trajectory optimization, and can be seen as an instance of model-based reinforcement learning ms in our system), which is why the cylinder pressures or as adaptive optimal control. Its appeal lies in the ability represent additional state variables, making it difficult to to handle challenging problems with surprisingly little data. apply torque-control techniques. The system also has a 40- We show that we can achieve sample-efficient learning of tasks dimensional continuous control space – namely the com- that involve intermittent contact dynamics and under-actuation. mands to the proportional valves regulating the flow of com- Furthermore, we can control the hand directly at the level of the pneumatic valves, without the use of a prior model that pressed air to the cylinders. The cylinders act on the joints describes the relationship between valve commands and joint through tendons. The tendons do not introduce additional torques. We compare results from learning in simulation and on state variables (since we avoid slack via pre-tensioning) the physical system. Even though the learned policies are local, but nevertheless complicate the dynamics. Overall this is a they are able to control the system in the face of substantial daunting system to model, let alone control. variability in initial state. Depending on one’s preference of terminology, our method I. INTRODUCTION can be classified as model-based Reinforcement Learning Dexterous manipulation is among the most challenging (RL), or as adaptive optimal control [3]. While RL aims control problems in robotics, and remains largely unsolved. to solve the same general problem as optimal control, its This is due to a combination of factors including high uniqueness comes from the emphasis on model-free learning dimensionality, intermittent contact dynamics, and under- in stochastic domains [4]. The idea of learning policies actuation in the case of dynamic object manipulation. Here without having models still dominates RL, and forms the we describe our efforts to tackle this problem in a principled basis of the most remarkable success stories, both old [5] way. We do not rely on manually designed controllers. In- and new [6]. However RL with learned models has also been stead we synthesize controllers automatically, by optimizing considered. Adaptive control on the other hand mostly fo- high-level cost functions. The resulting controllers are able cuses on learning the parameters of a model with predefined to manipulate freely-moving objects; see Fig 1. While the structure, essentially interleaving system identification with present results are limited to learning local models and control [7]. control policies, the performance we obtain, together with Our approach here lies somewhere in between (to fix the small amount of data we need (around 60 trials) indicate terminology, we call it RL in subsequent sections). We rely that the approach can be extended with more global learning on a model, but that model does not have any informative methods such as [1]. predefined structure. Instead, it is a time-varying linear model We use our Adroit platform [2], which is a ShadowHand learned from data, using a generic prior for regularization. skeleton augmented with high-performance pneumatic actu- Related ideas have been pursued previously [8], [1], [9]. ators. This system has a 100-dimensional continuous state Nevertheless, as with most approaches to automatic control space, including the positions and velocities of 24 joints, the and computational intelligence in general, the challenge is not only in formulating ideas but also in getting them to 1Computer Science & Engineering, University of Washington; 2Electrical scale to hard problems – which is our main contribution here. Engineering & Computer Science, University of California, Berkeley. This work was supported by the US National Science Foundation & MAST In particular, we demonstrate scaling from a 14-dimensional Program, Army Research Office. state space in [9] to a 100-dimensional state space here. This Algorithm 1 RL with linear-Gaussian controllers function of the form 1: initialize p(u jx ) 1 t t `(x ; u ) ≈ [x ; u ]T` [x ; u ]+[x ; u ]T` +const: 2: for iteration k = 1 to K do t t 2 t t xu;xut t t t t xut 3: run p(utjxt) to collect trajectory samples fτig where subscripts denote derivatives, e.g. `xut is the gradient 4: fit dynamics p(xt+1jxt; ut) to fτjg using linear re- of ` with respect to [xt; ut], while `xu;xut is the Hessian. gression with GMM prior The particular cost functions used in our experiments are 5: fit p = arg minp Ep(τ)[`(τ)] s.t. DKL(p(τ)kp^(τ)) ≤ described in the next section. When the cost function is 6: end for quadratic and the dynamics are linear-Gaussian, the optimal time-varying linear-Gaussian controller of the form is important in light of the curse of dimensionality. Indeed p(utjxt) = N (Ktxt + kt; Ct) can be obtained by using RL has been successfully applied to a range of robotic tasks the LQR method. This type of iterative approach can be [10], [11], [12], [13], however dimensionality and sample thought of as a variant of iterative LQR [17], where the complexity have presented major challenges [14], [15]. dynamics are fitted to data. Under this model of the dynamics The manipulation skills we learn are represented as time- and cost function, the optimal policy can be computed by varying linear-Gaussian controllers. These controllers are recursively computing the quadratic Q-function and value fundamentally trajectory-centric, but otherwise are extremely function, starting with the last time step. These functions are flexible, since they can represent any trajectory with any given by linear stabilization strategy. Since the controllers are time- 1 T T varying, the overall learned control law is nonlinear, but is lo- V (xt) = xt Vx;xtxt + xt Vxt + const cally linear at each time step. These types of controllers have 2 1 been employed previously for controlling lower-dimensional Q(x ; u ) = [x ;u ]TQ [x ;u ]+[x ;u ]TQ +const t t 2 t t xu;xut t t t t xut robotic arms [16], [9]. We can express them with the following recurrence: II. REINFORCEMENT LEARNING WITH LOCAL LINEAR T Qxu;xut = `xu;xut + f Vx;xt+1fxut MODELS xut T Qxut = `xut + fxutVxt+1 In this section, we describe the reinforcement learning V = Q − QT Q−1 Q algorithm (summarised in algorithm 1) that we use to control x;xt x;xt u;xt u;ut u;xt T −1 our pneumatically-driven five finger hand. The derivation in Vxt = Qxt − Qu;xtQu;utQut; this section follows previous work [1], but we describe the which allows us to compute the optimal control law as algorithm in this section for completeness. The aim of the g(x ) = u^ + k + K (x − x^ ), where K = −Q−1 Q method is to learn a time-varying linear-Gaussian controller t t t t t t t u;ut u;xt and k = −Q−1 Q . If we consider p(τ) to be the of the form p(u jx ) = N (K x + k ; C ), where x and t u;ut ut t t t t t t t trajectory distribution formed by the deterministic control u are the state and action at time step t. The actions in our t law g(x ) and the stochastic dynamics p(x jx ; u ), LQR system correspond to the pneumatic valve’s input voltage, t t+1 t t can be shown to optimize the standard objective while the state space is described in the preceding section. T The aim of the algorithm is to minimize the expectation X E [`(τ)] over trajectories τ = fx ; u ;:::; x ; u g, min Ep(xt;ut)[`(xt; ut)]: (1) p(τ) 1 1 T T g(x ) PT t t=1 where `(τ) = t=1 `(xt; ut) is the total cost, and the expec- QT However, we can also form a time-varying linear-Gaussian tation is under p(τ) = p(x1) t=1 p(xt+1jxt; ut)p(utjxt), where p(xt+1jxt; ut) is the dynamics distribution. controller p(utjxt), and optimize the following objective: T A. Optimizing Linear-Gaussian Controllers X min Ep(x ;u )[`(xt; ut)] − H(p(utjxt)): p(u jx ) t t The simple structure of time-varying linear-Gaussian con- t t t=1 trollers admits a very efficient optimization procedure that As shown in previous work [18], this objective is in fact works well even under unknown dynamics. The method is optimized by setting p(utjxt) = N (Ktxt + kt; Ct), where −1 summarized in Algorithm 1. At each iteration, we run the Ct = Qu;ut. While we ultimately aim to minimize the current controller p(utjxt) on the robot to gather N samples standard controller objective in Equation (1), this maximum (N = 5 in all of our experiments), then use these samples entropy formulation will be a useful intermediate step for a to fit time-varying linear-Gaussian dynamics of the form practical learning algorithm trained with fitted time-varying p(xt+1jxt; ut) = N (fxtxt + futut + fct; Ft). This is done linear dynamics. by using linear regression with a Gaussian mixture model prior, which makes it feasible to fit the dynamics even when B.

Load more