Robot Trajectory Optimization Using Approximate Inference
Total Page:16
File Type:pdf, Size:1020Kb
Robot Trajectory Optimization using Approximate Inference Marc Toussaint [email protected] TU Berlin, Franklinstr 28/29 FR6-9, 10587 Berlin, Germany Abstract relation between expectation-maximization and (im- mediate reward) RL observed by (Dayan & Hinton, The general stochastic optimal control (SOC) 1997), Attias' (2003) inference approach to planning, problem in robotics scenarios is often too and an EM approach to solving MDPs and POMDPs complex to be solved exactly and in near (Toussaint et al., 2006). In the control literature, the real time. A classical approximate solu- relation between stochastic optimal control (SOC) and tion is to first compute an optimal (de- inference (Kalman estimation) is long known for the terministic) trajectory and then solve a lo- special LQG case (e.g., Stengel, 1986). However, one cal linear-quadratic-gaussian (LQG) pertur- of the most interesting and not fully solved questions bation model to handle the system stochas- in the field is how this transfer can be generalized to ticity. We present a new algorithm for this non-LQG problems { we will discuss such work in some approach which improves upon previous algo- more detail below. rithms like iLQG. We consider a probabilis- tic model for which the maximum likelihood In this paper we do not try to propose an exact and (ML) trajectory coincides with the optimal general reformulation of SOC problems as an infer- trajectory and which, in the LQG case, re- ence problem. Instead, in its present form (focusing produces the classical SOC solution. The al- on Gaussian messages), our approach aims at a local gorithm then utilizes approximate inference approximate solution to the SOC problem that is fast methods (similar to expectation propaga- to compute. This is in the tradition of the classical tion) that efficiently generalize to non-LQG SOC literature (see, e.g, Bryson & Ho, 1969; Sten- systems. We demonstrate the algorithm on a gel, 1986), where the LQG case is typically introduced simulated 39-DoF humanoid robot. as a (variational) model to control perturbations from an optimal trajectory. That is, the general non-linear SOC problem is approximately solved by first comput- 1. Introduction ing an optimal trajectory for the noise-free system, and then deriving a local LQG model of the perturbations Trajectory optimization is a key problem in robotics, around this optimal trajectory. As long as the system in particular when the time needed for optimization is not perturbed too far from the optimal trajectory, is critical. For instance, in interactive scenarios with the corresponding linear quadratic regulator (LQR) is a human, robots need to react on dynamic changes of a reasonable solution to the SOC problem. The 2nd the environment and the optimization time should be stage of computing the local LQR is analytically solved no more than the temporal duration of the movement by the Ricatti equations (see below); the main prob- itself so that an online sequencing is in principle pos- lem arises with the non-linear trajectory optimization sible. The aim of this work is to push the performance problem in the first stage. of trajectory optimization algorithms a little further using approximate inference techniques. The trajectory optimization problem can be solved, e.g., by gradient methods (typically with a spline- The transfer of inference methods to stochastic con- based trajectory encoding, e.g., (Chen, 1991; Zhang trol and Reinforcement Learning (RL) problems has a & Knoll, 1995; Schlemmer & Gruebel, 1998)) or by long history. Examples in the context of decision mak- sequential quadratic programming (SQP) schemes. In ing and RL include the early work on inference in in- every iteration of SQP one devices a quadratic approx- fluence diagrams (Cooper, 1988; Shachter, 1988), the imation of the objective (cost) function { the optimum of which can be computed using the Ricatti equations. Appearing in Proceedings of the 26 th International Confer- ence on Machine Learning, Montreal, Canada, 2009. Copy- Hence, SQP typically involves iterating LQG solution right 2009 by the author(s)/owner(s). methods. An instance of such an algorithm is iLQG Robot Trajectory Optimization using Approximate Inference (Todorov & Li, 2005). An aspect of such methods The optimal value function Jt(x) gives the expected which will contrast to our approach is that in each SQP future cost when in state x at time t for the best con- iteration one computes a global trajectory x0:T (by trols and obeys the Bellman optimality equation \global" we mean \over the full time interval 0::T ") Z h 0 i and a global LQG approximation. Our framework will Jt(x) = min ct(x; u) + P (x j u; x) Jt+1(y) : (4) naturally lead to iterative updates of local messages u x0 rather than a global trajectory. There are two versions of stochastic optimal control In the LQG case, the similarity between the Ricatti problems: The open-loop control problem is to find equations and Kalman's equation for state estima- ∗ a control sequence u1:T that minimizes the expected tion is referred to as \Kalman duality" (Stengel, 1986; cost. The closed-loop (feedback) control problem is to Todorov, 2008) { but it strictly holds only in the LQG ∗ find a control policy πt : xt 7! ut (that exploits the case. The crucial question is how to generalize this du- true state observation in each time step and maps it to ality to the non-LQG case. Todorov (2008) introduces a feedback control signal) that minimizes the expected a special case of SOC problems which can be solved cost. via inference (see also (Kappen et al., 2009)). An al- ternative to seeking an exact duality is to focus on op- The linear quadratic gaussian (LQG) case plays an timal trajectories and consider a probabilistic model important role as a perturbation model or as an ingre- for which the ML solution coincides with the optimal dient in iterative solution methods. LQG is a linear trajectory (Todorov, 2008). We follow this latter ap- control process with Gaussian noise, proach and will utilize approximate inference meth- ods to efficiently find the ML trajectory and the local P (xt+1 j xt; ut) = N (xt+1 j Atxt + at + Btut;Qt) ; LQG solution around this trajectory. In previous work and quadratic costs, (Toussaint & Goerick, 2007) we applied the same idea > > > for a special case of coupled task and joint space plan- ct(xt; ut) = xt Rtxt − 2rt xt + ut Htut : (5) ning. In this paper we focus on the general theory and SOC case. The LQG process is defined by matrices and vectors A0:T ; a0:T ;B0:T ;Q0:T ;R0:T ; r0:T ;H0:T . As a conven- In the next section we will briefly introduce the frame- tion, in the remainder of this paper we will drop the work of SOC, the LQG case and the SQP method subscript t for A; a; B; Q; R; r; H { wherever a sub- iLQG. In section 3 we formulate the general probabilis- script is missing we refer to time t. tic model. We derive an exact Gaussian message pass- ing algorithm for the special LQG case and show how The LQG case allows us to derive an exact backward it is related to the Ricatti equations. Then we general- recursion, called Ricatti equation, for the computation ize to the non-LQG case based on Expectation Prop- of the value function. The value function will always agation (Minka, 2001b). Finally, section 4 presents an be a quadratic form of the state. Assume that at time evaluation on a simulated humanoid reaching problem. t+1 the optimal value function can be expressed as > > Jt+1(x) = x Vt+1x − 2vt+1x (6) 2. Stochastic optimal control for some matrix Vt+1 and some vector vt+1. Apply- We consider a discrete time stochastic controlled sys- ing equation (4) in a straight-forward manner one can tem of the form derive that Jt(x) is of the form xt+1 = ft(xt; ut) + ξ ; ξ ∼ N (0;Qt) ; (1) > > Jt(x) = x Vtx − 2x vt + terms independent of x ; n m with time step t, state xt 2 , control ut 2 , > R R Vt = R + A Vt+1A − KVt+1A (7) and Gaussian noise ξ of covariance Q. We also use v = r + A>(v − V a) − K(v − V a) (8) the notation P (xt+1 j ut; xt) = N (xt+1 j ft(xt; ut);Qt), t t+1 t+1 t+1 t+1 > > -> -1 -1 where K := A Vt+1(Vt+1 + B HB ) ; 1 N (xja; A) / exp{− (x − a)> A-1 (x − a)g (2) and the minimization in (4) is given by 2 ∗ > -1 > is a Gaussian over x with mean a and covariance A. ut (x) = −(H + B Vt+1B) B (Vt+1(Ax + a) − vt+1) : For a given state-control sequence x0:T ; u0:T we define (9) the cost as T Equations (7)-(9) are the Ricatti equations (Stengel, X 1986). (Using the Woodbury identity it can be rewrit- C(x0:T ; u0:T ) = ct(xt; ut) : (3) t=0 ten in other forms.) Initialized with VT = RT and Robot Trajectory Optimization using Approximate Inference Algorithm 1 iLQG optimality are not fully equivalent (there is yet no gen- 1: Input: initial trajectory x0:T , convergence rate α, eral version of Kalman's duality known) { but both are control costs H0:T , functions At(x), at(x), Bt(x), interesting definitions of optimality. Accordingly, the Rt(x), rt(x) formulation we present below will not in all respects 2: Output: trajectory x0:T , LQR V0:T ; v0:T 3: repeat be equivalent to the problem of expected cost mini- 4: access RT (xT ), rT (xT ) mization, but we can ensure that the ML trajectory of 5: VT RT , vT rT the conditioned probabilistic model coincides with the 6: for t = T − 1 to 0 do //bwd Ricatti recursion optimal trajectory in the classical case.