Robust Filtering and Smoothing with Gaussian Processes Smoothing and GP Dynamic Systems

1 Robust Filtering and Smoothing with Gaussian Processes smoothing and GP dynamic systems. In Section II, we briefly introduce Gaussian process regression, discuss the expressiveness of a GP, Marc Peter Deisenroth, Ryan Turner Member, IEEE, and explain how to train GPs. Section III details our proposed method Marco F. Huber Member, IEEE, (GP-RTSS) for smoothing in GP dynamic systems. In Section IV, we Uwe D. Hanebeck Member, IEEE, Carl Edward Rasmussen provide experimental evidence of the robustness of the GP-RTSS. Section V concludes the paper with a discussion. Abstract—We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the A. Problem Formulation and Notation transition function and the measurement function are described by non- parametric Gaussian process (GP) models. GPs are gaining increasing In this note, we consider discrete-time stochastic systems importance in signal processing, machine learning, robotics, and control x = f(x ) + w (1) for representing unknown system functions by posterior probability t t−1 t distributions. This modern way of “system identification” is more robust zt = g(xt) + vt (2) than finding point estimates of a parametric function representation. Our D E principled filtering/smoothing approach for GP dynamic systems is based where xt 2 R is the state, zt 2 R is the measurement at time on analytic moment matching in the context of the forward-backward step t, wt ∼ N (0; Σw) is Gaussian system noise, vt ∼ N (0; Σv) is algorithm. Our numerical evaluations demonstrate the robustness of the Gaussian measurement noise, f is the transition function (or system proposed approach in situations where other state-of-the-art Gaussian function) and g is the measurement function. The discrete time steps filters and smoothers can fail. t run from 0 to T . The initial state x0 of the time series is distributed Index Terms—Nonlinear systems, Bayesian inference, smoothing, fil- according to a Gaussian prior distribution p(x ) = N (µx; Σx). tering, Gaussian processes, machine learning 0 0 0 The purpose of filtering and smoothing is to find approximations to the posterior distributions p(xtjz1:τ ), where 1 : τ in a subindex I. INTRODUCTION abbreviates 1; : : : ; τ with τ = t during filtering and τ = T Filtering and smoothing in the context of dynamic systems refers during smoothing. In this note, we consider Gaussian approxima- x x to a Bayesian methodology for computing posterior distributions of tions p(xtjz1:τ ) ≈ N (xt j µtjτ ; Σtjτ ) of the latent state posterior d the latent state based on a history of noisy measurements. This kind distributions p(xtjz1:τ ). We use the short-hand notation abjc where of methodology can be found, e.g., in navigation, control engineering, a = µ denotes the mean µ and a = Σ denotes the covariance, b robotics, and machine learning [1]–[4]. Solutions to filtering [1]–[5] denotes the time step under consideration, c denotes the time step up and smoothing [6]–[9] in linear dynamic systems are well known, and to which we consider measurements, and d 2 fx; zg denotes either numerous approximations for nonlinear systems have been proposed, the latent space (x) or the observed space (z). for both filtering [10]–[15] and smoothing [16]–[19]. In this note, we focus on Gaussian filtering and smoothing in B. Gaussian RTS Smoothing Gaussian process (GP) dynamic systems. GPs are a robust non- x x Given the filtering distributions p(xtjz1:t) = N (xt j µtjt; Σtjt), parametric method for approximating unknown functions by a poste- t = 1;:::;T , a sufficient condition for Gaussian smoothing is the rior distribution over them [20], [21]. Although GPs have been around computation of Gaussian approximations of the joint distributions for decades, they only recently became computationally interesting p(xt−1; xtjz1:t−1), t = 1;:::;T [19]. for applications in robotics, control, and machine learning [22]–[26]. In Gaussian smoothers, the standard smoothing distribution for the The contribution of this note is the derivation of a novel, principled, dynamic system in (1)–(2) is always and robust Rauch-Tung-Striebel (RTS) smoother for GP dynamic x x systems, which we call the GP-RTSS. The GP-RTSS computes p(xt−1jz1:T ) = N (xt−1 j µt−1jT ; Σt−1jT ) ; where (3) a Gaussian approximation to the smoothing distribution in closed x x x x µt−1jT = µt−1jt−1 + Jt−1(µtjT − µtjt) (4) form. The posterior filtering and smoothing distributions can be Σx = Σx + J (Σx − Σx )J> (5) computed without linearization [10] or sampling approximations of t−1jT t−1jt−1 t−1 tjT tjt t−1 x x −1 densities [11]. Jt−1 := Σt−1;tjt−1(Σtjt−1) t = T;:::; 1 : (6) We provide numerical evidence that the GP-RTSS is more robust Depending on the methodology of computing this joint distribution, than state-of-the-art nonlinear Gaussian filtering and smoothing algo- we can directly derive arbitrary RTS smoothing algorithms, including rithms including the extended Kalman filter (EKF) [10], the unscented the URTSS [16], the EKS [1], [10], the CKS [19], a smoothing Kalman filter (UKF) [11], the cubature Kalman filter (CKF) [15], the extension to the CKF [15], or the GP-URTSS, a smoothing extension GP-UKF [12], and their corresponding RTS smoothers. Robustness to the GP-UKF [12]. The individual smoothers (URTSS, EKS, CKS, refers to the ability of an inferred distribution to explain the “true” GP-based smoothers etc.) simply differ in the way of computing/ state/measurement. estimating the means and covariances required in (4)–(6) [19]. The paper is structured as follows: In Sections I-A, I-B, we To derive the GP-URTSS, we closely follow the derivation of the introduce the problem setup and necessary background on Gaussian URTSS [16]. The GP-URTSS is a novel smoother, but its derivation This work was supported in part by ONR MURI under Grant N00014-09- is relatively straightforward and therefore not detailed in this note. 1-1052, by Intel Labs, and by DataPath, Inc. Instead, we detail the derivation of the GP-RTSS, a robust Rauch- M. P. Deisenroth is with the TU Darmstadt, Germany, and also with the Tung-Striebel smoother for GP dynamic systems, which is based on University of Washington, Seattle, USA. (email: [email protected]) analytic computation of the means and (cross-)covariances in (4)–(6). R. D. Turner is with Winton Capital, London, UK, and the University of Cambridge, UK. (email: [email protected]) In GP dynamic systems, the transition function f and the mea- M. F. Huber is with the AGT Group (R&D) GmbH, Darmstadt, Germany. surement function g in (1)–(2) are modeled by Gaussian processes. (email: [email protected]) This setup is getting more relevant in practical applications such as U. D. Hanebeck is with the Karlsruhe Institute of Technology (KIT), robotics and control, where it can be difficult to find an accurate Germany. (email: [email protected]) f g C. E. Rasmussen is with the University of Cambridge, UK, and also with the parametric form of and , respectively [25], [27]. Given the Max Planck Institute for Biological Cybernetics, Tubingen,¨ Germany. (email: increasing use of GP models in robotics and control, the robustness [email protected]) of Bayesian state estimation is important. 2 II. GAUSSIAN PROCESSES requires averaging over γ(s) and yields Z In the standard GP regression model, we assume that the data E [h(x)] = h(x)p(γ(s)) dγ(s) (12) > > γ D := fX := [x1;:::; xn] ; y := [y1; : : : ; yn] g have been D Z 2 Z generated according to yi = h(xi) + "i, where h : R ! R and (11) (x − s) 2 = exp − γ(s)p(γ(s)) dγ(s) ds = 0 (13) "i ∼ N (0; σ" ) is independent (measurement) noise. GPs consider λ2 h a random function and infer a posterior distribution over h from E [γ(s)] = 0 h data. The posterior is used to make predictions about function values since γ . Hence, the mean function of equals zero D everywhere. Let us now find the covariance function. Since the mean h(x∗) for arbitrary inputs x∗ 2 R . function equals zero, for any x; x0 2 R we obtain Similar to a Gaussian distribution, which is fully specified by a Z mean vector and a covariance matrix, a GP is fully specified by a 0 0 covγ [h(x); h(x )] = h(x)h(x )p(γ(s)) dγ(s) mean function mh( · ) and a covariance function Z (x − s)2 (x0 − s)2 k (x; x0) := E [(h(x) − m (x))(h(x0) − m (x0))] (7) = exp − exp − h h h h λ2 λ2 0 R 0 RD = covh[h(x); h(x )] 2 ; x; x 2 (8) Z × γ(s)2p(γ(s)) dγ(s) ds (14) which specifies the covariance between any two function values. E Here, h denotes the expectation with respect to the function h. where we used the definition of h in (11). Using varγ [γ(s)] = 1 and The covariance function kh( · ; · ) is also called a kernel. completing the squares yields Unless stated otherwise, we consider a prior mean function mh ≡ 0 2 1 x+x0 (x−x0)2 0 and use the squared exponential (SE) covariance function with Z 2 s − 2 + 2 cov [h(x); h(x0)] = exp B− C ds automatic relevance determination γ @ λ2 A 2 1 > −1 kSE(xp; xq) := α exp − (xp − xq) Λ (xp − xq) (9) 2 (x − x0)2 = α2 exp − (15) D 2 2λ2 for xp; xq 2 R , plus a noise covariance function knoise := δpqσ" , such that kh = kSE + knoise.

Load more