Constrained Unscented Dynamic Programming
Total Page:16
File Type:pdf, Size:1020Kb
Constrained Unscented Dynamic Programming Brian Plancher, Zachary Manchester, and Scott Kuindersma Abstract— Differential Dynamic Programming (DDP) has be- using this control law to update the input trajectory. This come a popular approach to performing trajectory optimization process is repeated until convergence. for complex, underactuated robots. However, DDP presents two Second derivatives of the dynamics (rank-three tensors) practical challenges. First, the evaluation of dynamics deriva- tives during optimization creates a computational bottleneck, are required to compute the quadratic cost-to-go approxi- particularly in implementations that capture second-order dy- mations used in DDP. This computational bottleneck led to namic effects. Second, constraints on the states (e.g., boundary the development of the iterative Linear Quadratic Regulator conditions, collision constraints, etc.) require additional care (iLQR) algorithm [10], which uses only first derivatives of since the state trajectory is implicitly defined from the inputs the dynamics to reduce computation time at the expense of and dynamics. This paper addresses both of these problems by building on recent work on Unscented Dynamic Programming slower convergence. Recent work has proposed a completely (UDP)—which eliminates dynamics derivative computations in derivative-free variant of DDP that uses a deterministic DDP—to support general nonlinear state and input constraints sampling scheme inspired by the unscented Kalman filter. using an augmented Lagrangian. The resulting algorithm has The resulting algorithm, Unscented Dynamic Programming the same computational cost as first-order penalty-based DDP (UDP) [11], has the same computational complexity per- variants, but can achieve constraint satisfaction to high pre- cision without the numerical ill-conditioning associated with iteration as iLQR with finite-difference derivatives, but pro- penalty methods. We present results demonstrating its favorable vides empirical convergence approaching that of the full performance on several simulated robot systems including a second-order DDP algorithm. The primary contribution of quadrotor and 7-DoF robot arm. this paper is an extension of UDP that supports nonlinear constraints on states and inputs. I. INTRODUCTION Several authors have proposed methods for adding con- Trajectory optimization algorithms [1] are a powerful straints to DDP methods. For the special case of box con- set of tools for synthesizing dynamic motions for complex straints on input variables, quadratic programs (QPs) can robots [2], [3], [4], [5], [6]. Most robot tasks of interest be solved in the backwards pass [12], [13]. In this setting, include state constraints relating to, e.g., obstacle avoidance, bounds on the inputs are enforced by projecting the feedback reaching a desired goal state, and maintaining contact with terms onto the constraint surface. Farshidian et al. [14] the environment. Direct transcription methods for trajectory extends this to equality constraints on both the state and optimization parameterize both the state and input trajecto- input through a similar projection framework while pure state ries, allowing them to easily handle such constraints by for- constraints are still handled via penalty functions. mulating a large (and sparse) nonlinear program that can be Designing effective penalty functions and continuation solved using off-the-shelf Sequential Quadratic Programming schedules can be difficult.One common approach is to in- (SQP) packages. In contrast, Differential Dynamic Program- crease penalty weighting coefficients in the cost function ming (DDP) parameterizes only the input trajectory and uses until convergence to a result that satisfies the constraints [14]. Bellman’s optimality principle to iteratively solve a sequence However, it is well known that these methods often lead to of much smaller optimization problems [7], [8]. Recently, numerical ill-conditioning before reaching the desired con- DDP and its variants have received increased attention due straint tolerance [15], [16]. In practice, this leads to infeasible to growing evidence that online planning is possible for trajectories or collisions when run on hardware. Despite this, high-dimensional robots [4], [9]. However, constraints are penalty methods have seen broad application in robotics. typically addressed approximately by augmenting the native For example, van den Berg [17] uses exponential barrier cost with constraint penalty functions. This paper introduces cost terms in the LQR Smoothing algorithm to prevent a a variant of DDP that captures nonlinear constraints on states quadrotor from colliding with cylindrical obstacles. and inputs with high accuracy while maintaining favorable The augmented Lagrangian approach to solving con- convergence properties. strained nonlinear optimization problems was conceived to The classical DDP algorithm begins by simulating the dy- overcome the conditioning problems of penalty methods namics forward from an initial state with an initial guess for by adding a linear Lagrange multiplier term to the objec- the input trajectory. An affine control law is then computed tive [16]. As pointed out by other researchers [18], these in a backwards pass using local quadratic approximations methods may be particularly well-suited to trajectory opti- of the cost-to-go. The forward pass is then performed again mization problems as they allow the solver to temporarily traverse infeasible regions and aggressively move towards School of Engineering and Applied Sciences, Harvard University, Cambridge, MA. brian [email protected], local optima before making incremental adjustments to sat- fzmanchester, [email protected] isfy constraints. A theoretical formulation of this method was proposed for use in the DDP context over two decades Taking the second-order Taylor expansion of Q, we have: ago [19] and was later used to develop a hybrid-DDP T 2 1 3 2 0 QT QT 3 2 1 3 algorithm [20], but this work appears to have received 1 x u little attention in the robotics community. We compare our Q(δx; δu) ≈ 4δx5 4Qx Qxx Qxu5 4δx5 ; (5) 2 T constrained UDP algorithm against this method. δu Qu Qxu Quu δu In the remainder of the paper, we review key concepts where the block matrices are computed as: from DDP, augmented Lagrangian methods, and the un- T 0 0 scented transform (Section II), introduce the constrained Qxx = `xx + fx Vxxfx + Vx · fxx UDP algorithm (Section III), and describe our experimental T 0 0 Quu = `uu + fu Vxxfu + Vx · fuu results on an inverted pendulum, a quadrotor flying through Q = ` + f T V 0 f + V 0 · f (6) a virtual forest, and a manipulator avoiding obstacles (Sec- xu xu x xx u x xu T 0 tion IV). Several practical considerations are also discussed. Qx = `x + fx Vx T 0 Qu = `u + fu Vx: II. BACKGROUND Following the notation used elsewhere [4], we have dropped In the following subsections, we summarize key ideas explicit time indices and used a prime to indicate the next from DDP, augmented Lagrangian methods, and the un- timestep. Derivatives with respect to x and u are are denoted scented transform, all of which form the basis for the with subscripts. The rightmost terms in the equations for constrained UDP algorithm described in Section III. Qxx, Quu, and Qxu involve second derivatives of the dynam- ics, which are rank-three tensors. As mentioned previously, A. Differential Dynamic Programming these tensor calculations are relatively expensive and are We assume a discrete-time nonlinear dynamical system of often omitted, resulting in the iLQR algorithm [10]. the form: Minimizing equation (5) with respect to δu results in the following correction to the control trajectory: xk+1 = f(xk; uk); (1) −1 δu = −Quu (Quxδx + Qu) ≡ Kδx + d; (7) where x 2 Rn is the system state and u 2 Rm is a control input. The goal is to find an input trajectory, U = which consists of an affine term d and a linear feedback term fu0; : : : ; uN−1g, that minimizes an additive cost function, Kδx. These terms can be substituted back into equation (5) to obtain an updated quadratic model of V : N−1 X J(x0;U) = `f (xN ) + `(xk; uk); (2) 1 −1 ∆V = − QuQuu Qu k=0 2 −1 (8) Vx = Qx − QuQuu Qux where x0 is the initial state and x1; : : : ; xN are computed by −1 integrating forward the dynamics (1). Vxx = Qxx − QxuQuu Qux: Using Bellman’s principle of optimality [21], we can de- Therefore, a backward update pass can be performed starting fine the optimal cost-to-go, Vk(x), by the recurrence relation: at the final state, xN , by setting VN = `f (xN ) and iteratively applying the above computations. A forward simulation pass VN (x) = `f (xN ) is then performed to compute a new state trajectory using the (3) updated controls. This forward-backward process is repeated Vk(x) = min `(x; u) + Vk+1(f(x; u)): u until the algorithm converges within a specified tolerance. When interpreted as an update procedure, this relationship DDP, like other variants of Newton’s method, can achieve leads to classical dynamic programming algorithms [21]. quadratic convergence near a local optimum [8], [22]. How- However, the curse of dimensionality prevents direct applica- ever, care must be taken to ensure good convergence behavior tion of dynamic programming to most systems of interest to from arbitrary initialization: A line search parameter, α, must the robotics community. In addition, while VN (x) = `f (xN ), be added to the forward pass to ensure a satisfactory decrease and often has a simple analytical form, Vk(x) will typically in cost, and a regularization term, ρ, must be added to Quu have complex geometry that is difficult to represent due in equation (7) to ensure positive-definitiveness [23]. to the nonlinearity of the dynamics (1). DDP avoids these difficulties by settling for local approximations to the cost- B. Unscented Dynamic Programming to-go along a trajectory. UDP replaces the gradient and Hessian calculations in We define Q(δx; δu) as the local change in the minimiza- equation (6) with approximations computed from a set of tion argument in (3) under perturbations, δx; δu: sample points [11].