Variational Inference MPC Using Tsallis Divergence
Total Page:16
File Type:pdf, Size:1020Kb
Variational Inference MPC using Tsallis Divergence Ziyi Wang∗y, Oswin So∗, Jason Gibson, Bogdan Vlahov, Manan S. Gandhi Guan-Horng Liu and Evangelos A. Theodorou Autonomous Control and Decision Systems Lab Georgia Institute of Technology, Atlanta, Georgia ∗These authors contributed equally yCorrespondence to: [email protected] Abstract—In this paper, we provide a generalized framework the variational distribution and target distribution. for Variational Inference-Stochastic Optimal Control by using the In existing VI-SOC works, the KL divergence is used as non-extensive Tsallis divergence. By incorporating the deformed the distributional distance metric due to its simplicity. On the exponential function into the optimality likelihood function, a novel Tsallis Variational Inference-Model Predictive Control other hand, recent advances in VI research involve extending algorithm is derived, which includes prior works such as Varia- the framework to other statistical divergences, such as the tional Inference-Model Predictive Control, Model Predictive Path α-divergence [7] and χ-divergence [8]. In Wang et al. [9], Integral Control, Cross Entropy Method, and Stein Variational Regli and Silva [10], the authors proposed variants of the Inference Model Predictive Control as special cases. The pro- α-divergence to improve the performance and robustness of posed algorithm allows for effective control of the cost/reward transform and is characterized by superior performance in terms the inference algorithm. Wan et al. [11] further extended the of mean and variance reduction of the associated cost. The VI framework to f-divergence, which is a broad statistical aforementioned features are supported by a theoretical and divergence family that recovers the KL, α and χ-divergence numerical analysis on the level of risk sensitivity of the proposed as special cases. algorithm as well as simulation experiments on 5 different robotic Tsallis divergence is another generalized divergence rooted systems with 3 different policy parameterizations. in non-extensive statistical mechanics [12]. Tsallis divergence and Tsallis entropy are generalizations of the KL divergence I. INTRODUCTION and Shannon entropy respectively. The Tsallis entropy is non- Variational Inference (VI) is a powerful tool for approxi- additive (hence non-extensive) in the sense that for a system mating the posterior distribution of the unobserved random composed of (probabilistically) independent subsystems, the variables [1]. VI recasts the approximation problem as an total entropy differs from the sum of the entropies of the optimization problem. Instead of directly approximating the subsystems [13]. Over the last two decades, an increasing target distribution p(zjx) of the latent variable z, VI minimizes number of complex natural, artificial and social systems the Kullback-Leibler (KL) divergence between a tractable have verified the predictions and consequences derived from variational distribution q(z) and the target distribution. Due the Tsallis entropy and divergence [14, 15, 16, 17]. Lee to its faster convergence and comparable performance to et al. [17] demonstrated that Tsallis entropy regularization Markov Chain Monte Carlo sampling methods, VI has received leads to improved performance and faster convergence in RL increasing attention in machine learning and robotics [2, 3, 4]. applications. VI has been applied to Stochastic Optimal Control (SOC) In this paper, we provide a generalized formulation of the problems recently. In Okada and Taniguchi [5], the authors VI-SOC framework using the Tsallis divergence and introduce formulated the SOC problem as a VI problem by setting a novel Model Predictive Control (MPC) algorithm. The main arXiv:2104.00241v1 [cs.LG] 1 Apr 2021 the desired policy distribution as the target distribution. The contribution of our work is threefold: VI-SOC framework works directly in the space of policy • We propose the Tsallis VI-MPC algorithm, which allows distributions instead of specific policy parameterizations in most for additional control of the shape of the cost transform SOC and Reinforcement Learning (RL) frameworks. This gives compared to previous VI-MPC algorithms using KL rise to a unified derivation for a variety of parametric policy dis- divergence. tributions, such as unimodal Gaussian and Gaussian mixture in • We provide a holistic view of connections between Tsallis Okada and Taniguchi [5]. Lambert et al. [6] derived the VI-SOC VI-SOC and the state-of-the-art Model Predictive Path algorithm for non-parametric policy distribution using Stein Integral (MPPI) control, Cross Entropy Method (CEM) variational gradient descent. Apart from the policy distribution, and Stochastic Search (SS) methods. the VI-SOC framework is characterized by two components, the • We show, both analytically and numerically, that the optimality likelihood function and distributional distance metric. proposed Tsallis VI-SOC framework achieves lower cost The optimality likelihood function measures the likelihood of variance than MPPI and CEM. We further demonstrate trajectory samples to be optimal and defines the cost/reward the superior performance of Tsallis VI-SOC in mean cost transform to allow for different algorithmic developments. The and variance minimization on simulated systems from distributional distance metric is the measure of distance between control theory and robotics for 3 different choices of policy distributions. TABLE I: Notation comparison between Variational Inference and Variational Inference-Stochastic Optimal Control. The rest of this paper is organized as follows: in SectionII, we review KL VI-SOC and derive the Tsallis VI-SOC frame- VI VI-SOC work. In Section III, we discuss the connections between Tsallis Notation Meaning Notation Meaning VI-SOC and related works. A reparameterization and analysis x data o optimality of the framework is included in SectionIV, and we propose the z latent variable τ trajectory novel Tsallis VI-MPC algorithm in SectionV. We showcase the p(z) prior distribution p(τ) prior distribution performance of the proposed algorithm against related works q(z) variational distribution q(τ) controlled distribution in SectionVI and conclude the paper in Section VII. p(xjz) generative model p(ojτ) optimality likelihood p(zjx) posterior distribution p(τjo) optimal distribution II. TSALLIS VARIATIONAL INFERENCE-STOCHASTIC OPTIMAL CONTROL In this section, we review the VI-SOC formulation using KL where the constant p(o) is dropped. The first term in the divergence [5] and derive the novel Tsallis divergence VI-SOC objective measures the likelihood of a trajectory being optimal framework. while the second term serves as regularization. Splitting the A. SOC Problem Formulation expectation in the first term, we get ∗ n In discrete time SOC, we work with trajectories τ := fX; Ug q (U) = arg min − Eq(U)[Ep(XjU)[log p(ojτ)]] nx×T +1 q(U) of state, X := fx0; x1; : : : ; xT g 2 R , and control, (3) o : nu×T U = fu0; u1; : : : ; uT −1g 2 R over a finite time horizon + DKL (q(U) k p(U)) ; T > 0. The goal in SOC problems is to minimize the QT −1 expected cost defined by an arbitrary cost function J : where p(U) and q(U) represent t=0 p(ut) and nx×T × nu×T −1 ! + with initial state distribution p(x ) QT −1 R R R 0 t=0 q(ut) respectively due to independence and T −1 and state transition probability p(xt+1jxt; ut) corresponding Q p(XjU) = p(x0) t=0 p(xt+1jxt; ut). n to stochastic dynamics xt+1 = F (xt; ut; t) where t 2 R Optimality Likelihood: The optimality likelihood in Eq. (3) is the system stochasticity. can be parameterized by a non-increasing function of the trajec- B. KL VI-SOC tory cost J(X; U) as p(ojτ) := f(J(X; U)). The monotonicity requirement ensures that trajectories incurring higher costs We can formulate the SOC problem as an inference problem are always less likely to be optimal. Common choices of f and apply methods from VI. To apply VI to SOC, we introduce include f(x) = exp(−x) and f(x) = 1fx≤γg. In this paper, a dummy optimality variable o 2 f0; 1g with o = 1 indicating we choose f(x) = exp(−x) such that log p(ojτ) = −J(X; U). that the trajectory τ = fX; Ug is optimal. TableI compares To avoid excessive notation, we abuse the notation to define the differences in notation between conventional VI methods J(U) := Ep(XjU)[J(X; U)] and the N-sample empirical and SOC formulated as a VI problem. approximation of the expectation J = PN [J (n)]. Hence, The objective of VI-SOC is to sample from the posterior n=1 Eq. (3) takes the form distribution ∗ p(o = 1jτ)p(τ) q (U) = arg min Eq(U)[J(X; U)] + DKL (q(U) k p(U)) ; p(τjo) = q(U) p(o) (4) T −1 (1) p(o = 1jτ) Y which can be interpreted as an application of the famous = p(x ) p(x jx ; u )p(u ): p(o = 1) 0 t+1 t t t maximum entropy principle. t=0 C. Tsallis VI-SOC Here p(xt+1jxt; ut) is the state transition probability, p(x0) is the initial state distribution and p(ut) is some prior control In this subsection, we use the Tsallis divergence as the reg- distribution (e.g. zero mean Gaussian or uniform distribution). ularization function and derive the Tsallis-VI-SOC algorithm. For simplicity, we use o to indicate o = 1 and o0 for o = 0 First, we define the deformed logarithm and exponential as from here on. We can now formulate the VI-SOC objective as xr−1 − 1 log (x) = ; (5) minimizing the distance between a controlled distribution q(τ) r r − 1 and the target distribution p(τjo) 1 exp (x) = (1 + (r − 1)x) r−1 ; (6) ∗ r + q (τ) = arg min DKL (q(τ) k p(τjo)) q(τ) where (·)+ := max(0; ·) and r > 0. Using logr and expr we " # p(XjU)q(U) can define the Tsallis entropy and the corresponding Tsallis = arg min log Eq(τ) p(ojτ) divergence [12] as q(τ) (2) p(o) p(XjU)p(U) Sr(q(z)) := − q[log q(z)] " T −1 # E r X q(ut) Z = arg min − log p(ojτ) + log ; 1 r (7) Eq(τ) p(u ) = − q(z) dz − 1 ; q(τ) t=0 t r − 1 and by the empirical distribution q~∗(U) with weights w(n): q(z) Dr (q(z) k p(z)) :=Eq logr N p(z) ∗ X (n) q~ (U) = w 1 (n) ; (12) r−1 ! fU=U g 1 Z q(z) n=1 = q(z) dz − 1 : r − 1 p(z) q∗(U (n)) w(n) = : (13) (8) PN ∗ (n0) n0=1 q (U ) Note that as r ! 1, logr ! log, expr ! exp, Dr !DKL We now solve Eq.