MANUSCRIPT 1 Bayesian Filtering: From Kalman Filters to Particle Filters, and Beyond ZHE CHEN
Abstract— In this self-contained survey/review paper, we system- IV Bayesian Optimal Filtering 9 atically investigate the roots of Bayesian filtering as well as its rich IV-AOptimalFiltering...... 10 leaves in the literature. Stochastic filtering theory is briefly reviewed IV-BKalmanFiltering...... 11 with emphasis on nonlinear and non-Gaussian filtering. Following IV-COptimumNonlinearFiltering...... 13 the Bayesian statistics, different Bayesian filtering techniques are de- IV-C.1Finite-dimensionalFilters...... 13 veloped given different scenarios. Under linear quadratic Gaussian circumstance, the celebrated Kalman filter can be derived within the Bayesian framework. Optimal/suboptimal nonlinear filtering tech- V Numerical Approximation Methods 14 niques are extensively investigated. In particular, we focus our at- V-A Gaussian/Laplace Approximation ...... 14 tention on the Bayesian filtering approach based on sequential Monte V-BIterativeQuadrature...... 14 Carlo sampling, the so-called particle filters. Many variants of the V-C Mulitgrid Method and Point-Mass Approximation . . 14 particle filter as well as their features (strengths and weaknesses) are V-D Moment Approximation ...... 15 discussed. Related theoretical and practical issues are addressed in V-E Gaussian Sum Approximation ...... 16 detail. In addition, some other (new) directions on Bayesian filtering V-F Deterministic Sampling Approximation ...... 16 are also explored. V-G Monte Carlo Sampling Approximation ...... 17 Index Terms— Stochastic filtering, Bayesian filtering, V-G.1ImportanceSampling...... 18 Bayesian inference, particle filter, sequential Monte Carlo, V-G.2RejectionSampling...... 19 sequential state estimation, Monte Carlo methods. V-G.3SequentialImportanceSampling...... 19 V-G.4Sampling-ImportanceResampling...... 20 V-G.5StratifiedSampling...... 21 “The probability of any event is the ratio between the V-G.6MarkovChainMonteCarlo...... 22 value at which an expectation depending on the happening of the event ought to be computed, and the value of the V-G.7HybridMonteCarlo...... 23 thing expected upon its happening.” V-G.8Quasi-MonteCarlo...... 24 — Thomas Bayes (1702-1761), [29] VI Sequential Monte Carlo Estimation: Particle Filters 25 “Statistics is the art of never having to say you’re wrong. Variance is what any two statisticians are at.” VI-ASequentialImportanceSampling(SIS)Filter..... 26 —C.J.Bradfield VI-BBootstrap/SIRfilter...... 26 VI-CImprovedSIS/SIRFilters...... 27 Contents VI-DAuxiliary Particle Filter ...... 28 VI-ERejectionParticleFilter...... 29 I Introduction 2 VI-F Rao-Blackwellization ...... 30 I-AStochasticFilteringTheory...... 2 VI-GKernelSmoothingandRegularization...... 31 I-BBayesianTheoryandBayesianFiltering...... 2 VI-HDataAugmentation...... 32 I-C Monte Carlo Methods and Monte Carlo Filtering . . . 2 VI-H.1 Data Augmentation is an Iterative Kernel I-DOutlineofPaper...... 3 SmoothingProcess...... 32 VI-H.2 Data Augmentation as a Bayesian Sampling II Mathematical Preliminaries and Problem Formula- Method...... 33 tion 4 VI-I MCMC Particle Filter ...... 33 II-APreliminaries...... 4 VI-JMixtureKalmanFilters...... 34 II-BNotations...... 4 VI-KMixtureParticleFilters...... 34 II-CStochasticFilteringProblem...... 4 VI-LOtherMonteCarloFilters...... 35 II-D Nonlinear Stochastic Filtering Is an Ill-posed Inverse VI-MChoicesofProposalDistribution...... 35 Problem...... 5 VI-M.1PriorDistribution...... 35 II-D.1InverseProblem...... 5 VI-M.2Annealed Prior Distribution ...... 36 II-D.2 Differential Operator and Integral Equation . . 6 VI-M.3Likelihood...... 36 II-D.3RelationstoOtherProblems...... 7 VI-M.4Bridging Density and Partitioned Sampling . . 37 II-EStochasticDifferentialEquationsandFiltering.... 7 VI-M.5Gradient-BasedTransitionDensity...... 38 VI-M.6EKFasProposalDistribution...... 38 III Bayesian Statistics and Bayesian Estimation 8 VI-M.7UnscentedParticleFilter...... 38 III-ABayesianStatistics...... 8 VI-NBayesianSmoothing...... 38 III-BRecursiveBayesianEstimation...... 9 VI-N.1Fixed-pointsmoothing...... 38 VI-N.2Fixed-lagsmoothing...... 39 The work is supported by the Natural Sciences and Engineering VI-N.3Fixed-intervalsmoothing...... 39 Research Council of Canada. Z. Chen was also partially supported VI-OLikelihoodEstimate...... 40 by Clifton W. Sherman Scholarship. VI-PTheoreticalandPracticalIssues...... 40 The author is with the Communications Research Laboratory, McMaster University, Hamilton, Ontario, Canada L8S 4K1, e- VI-P.1ConvergenceandAsymptoticResults..... 40 mail: [email protected], Tel: (905)525-9140 x27282, VI-P.2Bias-Variance...... 41 Fax:(905)521-2922. VI-P.3Robustness...... 43 VI-P.4AdaptiveProcedure...... 46 MANUSCRIPT 2
VI-P.5EvaluationandImplementation...... 46 its line have been proposed and developed to overcome its limitation. VIIOther Forms of Bayesian Filtering and Inference 47 VII-AConjugate Analysis Approach ...... 47 B. Bayesian Theory and Bayesian Filtering VII-BDifferential Geometrical Approach ...... 47 VII-CInteractingMultipleModels...... 48 Bayesian theory2 was originally discovered by the British VII-DBayesian Kernel Approaches ...... 48 researcher Thomas Bayes in a posthumous publication in VII-EDynamicBayesianNetworks...... 48 1763 [29]. The well-known Bayes theorem describes the VIIISelected Applications 49 fundamental probability law governing the process of log- VIII-ATargetTracking...... 49 ical inference. However, Bayesian theory has not gained VIII-BComputerVisionandRobotics...... 49 its deserved attention in the early days until its modern VIII-CDigitalCommunications...... 49 form was rediscovered by the French mathematician Pierre- VIII-DSpeechEnhancementandSpeechRecognition..... 50 Simon de Laplace in Th´eorie analytique des probailit´es.3 VIII-EMachineLearning...... 50 Bayesian inference [38], [388], [375], devoted to applying VIII-FOthers...... 50 Bayesian statistics to statistical inference, has become one VIII-GAnIllustrativeExample:Robot-ArmProblem..... 50 of the important branches in statistics, and has been ap- IX Discussion and Critique 51 plied successfully in statistical decision, detection and es- IX-AParameterEstimation...... 51 timation, pattern recognition, and machine learning. In IX-BJointEstimationandDualEstimation...... 51 particular, the November 19 issue of 1999 Science mag- IX-CPrior...... 52 azine has given the Bayesian research boom a four-page IX-DLocalizationMethods...... 52 special attention [320]. In many scenarios, the solutions IX-EDimensionalityReductionandProjection...... 53 IX-FUnansweredQuestions...... 53 gained through Bayesian inference are viewed as “optimal”. Not surprisingly, Bayesian theory was also studied in the X Summary and Concluding Remarks 55 filtering literature. One of the first exploration of itera- tive Bayesian estimation is found in Ho and Lee’ paper I. Introduction [212], in which they specified the principle and procedure of Bayesian filtering. Sprangins [426] discussed the itera- HE contents of this paper contain three major scien- tive application of Bayes rule to sequential parameter esti- tific areas: stochastic filtering theory, Bayesian theory, T mation and called it as “Bayesian learning”. Lin and Yau and Monte Carlo methods. All of them are closely discussed [301] and Chien an Fu [92] discussed Bayesian approach around the subject of our interest: Bayesian filtering. In to optimization of adaptive systems. Bucy [62] and Bucy the course of explaining this long story, some relevant the- and Senne [63] also explored the point-mass approximation ories are briefly reviewed for the purpose of providing the method in the Bayesian filtering framework. reader a complete picture. Mathematical preliminaries and background materials are also provided in detail for the C. Monte Carlo Methods and Monte Carlo Filtering self-containing purpose. The early idea of Monte Carlo4 can be traced back to A. Stochastic Filtering Theory the problem of Buffon’s needle when Buffon attempted in 1777 to estimate π (see e.g., [419]). But the modern Stochastic filtering theory was first established in the formulation of Monte Carlo methods started from 1940s early 1940s due to the pioneering work by Norbert Wiener in physics [330], [329], [393] and later in 1950s to statis- [487], [488] and Andrey N. Kolmogorov [264], [265], and it tics [198]. During the World War II, John von Neumann, culminated in 1960 for the publication of classic Kalman Stanislaw Ulam, Niick Metropolis, and others initialized filter (KF) [250] (and subsequent Kalman-Bucy filter in 1 the Monte Carlo method in Los Alamos Laboratory. von 1961 [249]), though many credits should be also due to Neumann also used Monte Carlo method to calculate the some earlier work by Bode and Shannon [46], Zadeh and elements of an inverse matrix, in which they redefined the Ragazzini [502], [503], Swerling [434], Levinson [297], and “Russian roulette” and “splitting” methods [472]. In recent others. Without any exaggeration, it seems fair to say decades, Monte Carlo techniques have been rediscovered in- that the Kalman filter (and its numerous variants) have dependently in statistics, physics, and engineering. Many dominated the adaptive filter theory for decades in signal new Monte Carlo methodologies (e.g. Bayesian bootstrap, processing and control areas. Nowadays, Kalman filters hybrid Monte Carlo, quasi Monte Carlo) have been reju- have been applied in the various engineering and scientific venated and developed. Roughly speaking, Monte Carlo areas, including communications, machine learning, neu- roscience, economics, finance, political science, and many 2A generalized Bayesian theory is the so-called Quasi-Bayesian the- others. Bearing in mind that Kalman filter is limited by its ory (e.g. [100]) that is built on the convex set of probability distribu- tions and a relaxed set of aximoms about preferences, which we don’t assumptions, numerous nonlinear filtering methods along discuss in this paper. 3An interesting history of Thomas Bayes and its famous essay is 1Another important event in 1960 is the publication of the cele- found in [110]. brated least-mean-squares (LMS) algorithm [485]. However, the LMS 4The method is named after the city in the Monaco principality, filter is not discussed in this paper, the reader can refer to [486], [205], because of a roulette, a simple random number generator. The name [207], [247] for more information. was first suggested by Stanislaw Ulam. MANUSCRIPT 3 technique is a kind of stochastic sampling approach aim- tial geometry approach, variational method, or conjugate ing to tackle the complex systems which are analytically method. Some potential future directions, will be consid- intractable. The power of Monte Carlo methods is that ering combining these methods with Monte Carlo sampling they can attack the difficult numerical integration prob- techniques, as we will discuss in the paper. The attention lems. In recent years, sequential Monte Carlo approaches of this paper, however, is still on the Monte Carlo methods have attracted more and more attention to the researchers and particularly sequential Monte Carlo estimation. from different areas, with many successful applications in statistics (see e.g. the March special issue of 2001 Annals D. Outline of Paper of the Institute of Statistical Mathematics), sig- In this paper, we present a comprehensive review of nal processing (see e.g., the February special issue of 2002 stochastic filtering theory from Bayesian perspective. [It IEEE Transactions on Signal Processing), machine happens to be almost three decades after the 1974 publica- learning, econometrics, automatic control, tracking, com- tion of Prof. Thomas Kailath’s illuminating review paper munications, biology, and many others (e.g., see [141] and “A view of three decades of linear filtering theory” [244], the references therein). One of the attractive merits of se- we take this opportunity to dedicate this paper to him who quential Monte Carlo approaches lies in the fact that they has greatly contributed to the literature in stochastic filter- allow on-line estimation by combining the powerful Monte ing theory.] With the tool of Bayesian statistics, it turns Carlo sampling methods with Bayesian inference, at an ex- out that the celebrated Kalman filter is a special case of pense of reasonable computational cost. In particular, the Bayesian filtering under the LQG (linear, quadratic, Gaus- sequential Monte Carlo approach has been used in parame- sian) circumstance, a fact that was first observed by Ho ter estimation and state estimation, for the latter of which and Lee [212]; particle filters are also essentially rooted it is sometimes called particle filter.5 The basic idea of in Bayesian statistics, in the spirit of recursive Bayesian particle filter is to use a number of independent random estimation. To our interest, the attention will be given to variables called particles,6 sampled directly from the state the nonlinear, non-Gaussian and non-stationary situations space, to represent the posterior probability, and update where we mostly encounter in the real world. Generally for the posterior by involving the new observations; the “par- nonlinear filtering, no exact solution can be obtained, or the ticle system” is properly located, weighted, and propagated solution is infinite-dimensional,8 hence various numerical recursively according to the Bayesian rule. In retrospect, approximation methods come in to address the intractabil- the earliest idea of Monte Carlo method used in statisti- ity. In particular, we focus our attention on sequential cal inference is found in [200], [201], and later in [5], [6], Monte Carlo method which allows on-line estimation in a [506], [433], [258], but the formal establishment of particle Bayesian perspective. The historic root and remarks of filter seems fair to be due to Gordon, Salmond and Smith Monte Carlo filtering are traced. Other Bayesian filtering [193], who introduced certain novel resampling technique approaches other than Monte Carlo framework are also re- to the formulation. Almost in the meantime, a number viewed. Besides, we extend our discussion from Bayesian of statisticians also independently rediscovered and devel- filtering to Bayesian inference, in the latter of which the oped the sampling-importance-resampling (SIR) idea [414], well-known hidden Markov model (HMM) (a.k.a. HMM [266], [303], which was originally proposed by Rubin [395], filter), dynamic Bayesian networks (DBN) and Bayesian [397] in a non-dynamic framework.7 The rediscovery and kernel machines are also briefly discussed. renaissance of particle filters in the mid-1990s (e.g. [259], Nowadays Bayesian filtering has become such a broad [222], [229], [304], [307], [143], [40]) after a long dominant topic involving many scientific areas that a comprehen- period, partially thanks to the ever increasing computing sive survey and detailed treatment seems crucial to cater power. Recently, a lot of work has been done to improve the ever growing demands of understanding this important the performance of particle filters [69], [189], [428], [345], field for many novices, though it is noticed by the author [456], [458], [357]. Also, many doctoral theses were devoted that in the literature there exist a number of excellent tuto- to Monte Carlo filtering and inference from different per- rial papers on particle filters and Monte Carlo filters [143], spectives [191], [142], [162], [118], [221], [228], [35], [97], [144], [19], [438], [443], as well as relevant edited volumes [365], [467], [86]. [141] and books [185], [173], [306], [82]. Unfortunately, as It is noted that particle filter is not the only leaf in the observed in our comprehensive bibliographies, a lot of pa- Bayesian filtering tree, in the sense that Bayesian filtering pers were written by statisticians or physicists with some can be also tackled with other techniques, such as differen- special terminologies, which might be unfamiliar to many 5Many other terminologies also exist in the literature, e.g., SIS fil- engineers. Besides, the papers were written with different ter, SIR filter, bootstrap filter, sequential imputation, or CONDEN- nomenclatures for different purposes (e.g. the convergence SATION algorithm (see [224] for many others), though they are ad- and asymptotic results are rarely cared in engineering but dressed differently in different areas. In this paper, we treat them as different variants within the generic Monte Carlo filter family. Monte are important for the statisticians). The author, thus, felt Carlo filters are not all sequential Monte Carlo estimation. obligated to write a tutorial paper on this emerging and 6 The particle filter is called normal if it produces i.i.d. samples; promising area for the readership of engineers, and to in- sometimes it is deliberately to introduce negative correlations among the particles for the sake of variance reduction. troduce the reader many techniques developed in statistics 7The earliest idea of multiple imputation due to Rubin was pub- lished in 1978 [394]. 8Or the sufficient statistics is infinite-dimensional. MANUSCRIPT 4 and physics. For this purpose again, for a variety of particle P(x)
filter algorithms, the basic ideas instead of mathematical 1 derivations are emphasized. The further details and exper- imental results are indicated in the references. Due to the dual tutorial/review nature of current paper, only few sim- ple examples and simulation are presented to illustrate the essential ideas, no comparative results are available at this stage (see other paper [88]); however, it doesn’t prevent us x 0 presenting the new thoughts. Moreover, many graphical and tabular illustrations are presented. Since it is also a Fig. 1. Empirical probability distribution (density) function con- survey paper, extensive bibliographies are included in the structed from the discrete observations {x(i)}. references. But there is no claim that the bibliographies are complete, which is due to the our knowledge limitation as well as the space allowance. (see Fig. 1 for illustration)
The rest of this paper is organized as follows: In Section Np 1 II, some basic mathematical preliminaries of stochastic fil- Pˆ(x)= δ(x − x(i)) Np tering theory are given; the stochastic filtering problem is i=1 also mathematically formulated. Section III presents the · essential Bayesian theory, particularly Bayesian statistics where δ( ) is a Radon-Nikod´ym density w.r.t. μ of the and Bayesian inference. In Section IV, the Bayesian fil- point-mass distribution concentrated at the point x. When ∈ − (i) (i) tering theory is systematically investigated. Following the x X is discrete, δ(x x )is1forx = x and 0 ∈ − (i) simplest LQG case, the celebrated Kalman filter is briefly elsewhere. When x X is continuous, δ(x x )isa − (i) (i) derived, followed by the discussion of optimal nonlinear Dirac-delta function, δ(x x ) = 0 for all x = x,and ˆ filtering. Section V discusses many popular numerical ap- X dP (x)= X pˆ(x)dx =1. proximation techniques, with special emphasis on Monte B. Notations Carlo sampling methods, which result in various forms of particle filters in Section VI. In Section VII, some other Throughout this paper, the bold font is referred to vec- new Bayesian filtering approaches other than Monte Carlo tor or matrix; the subscript symbol t (t ∈ R+) is referred sampling are also reviewed. Section VIII presents some se- to the index in a continuous-time domain; and n (n ∈ N) lected applications and one illustrative example of particle is referred to the index in a discrete-time domain. p(x)is filters. We give some discussions and critiques in Section referred to the pdf in a Lebesque measure or the pmf in IX and conclude the paper in Section X. a counting measure. E[·]andVar[·] (Cov[·]) are expecta- tion and variance (covariance) operators, respectively. Un- less specified elsewhere, the expectations are taken w.r.t. 9 II. Mathematical Preliminaries and Problem the true pdf. Notations x0:n and y0:n are referred to Formulation the state and observation sets with elements collected from time step 0 up to n. Gaussian (normal) distribution is de- A. Preliminaries noted by N (μ, Σ). xn represents the true state in time step n, whereas xˆn (or xˆn|n)andxˆn|n−1 represent the fil- F Definition 1: Let S be a set and be a family of subsets tered state and predicted state of xn, respectively. f and g of S. F is a σ-algebra if (i) ∅∈F; (ii) A ∈Fimplies are used to represent vector-valued state function and mea- c ∈F ···∈F ∪∞ ∈F A ; (iii) A1,A2, implies i=1Ai . surement function, respectively. f is denoted as a generic A σ-algebra is closed under complement and union of (vector or scalar valued) nonlinear function. Additional countably infinitely many sets. nomenclatures will be given wherever confusion is neces- Definition 2: A probability space is defined by the el- sary to clarify. ements {Ω, F,P} where F is a σ-algebra of Ω and P is For the reader’s convenience, a complete list of notations a complete, σ-additive probability measure on all F.In used in this paper is summarized in the Appendix G. other words, P is a set function whose arguments are ran- dom events (element of F) such that axioms of probability C. Stochastic Filtering Problem hold. Before we run into the mathematical formulation of dP (x) Definition 3: Let p(x)= dμ denote Radon-Nikod´ym stochastic filtering problem, it is necessary to clarify some density of probability distribution P (x) w.r.t. a measure μ. basic concepts: When x ∈ X is discrete and μ is a counting measure, p(x) is a probability mass function (pmf); when x is continuous Filtering is an operation that involves the extraction of and μ is a Lebesgue measure, p(x) is a probability density information about a quantity of interest at time t by function (pdf). using data measured up to and including t. 9 Intuitively, the true distribution P (x) can be replaced Sometimes it is also denoted by y1:n, which differs in the assuming by the empirical distribution given the simulated samples order of state and measurement equations. MANUSCRIPT 5
Prediction is an a priori form of estimation. Its aim is to input ut-1 ut u derive information about what the quantity of interest t+1 will be like at some time t + τ in the future (τ> 0) by using data measured up to and including time ( ) ft-1 ft( ) t. Unless specified otherwise, prediction is referred to state xt-1 xt xt+1 one-step ahead prediction in this paper.
Smoothing is an a posteriori form of estimation in that g t-1 ( ) g t( ) g t+1 ( ) data measured after the time of interest are used for measurement y y the estimation. Specifically, the smoothed estimate at yt-1 t t+1 time t is obtained by using data measured over the interval [0,t], where t Now, let us consider the following generic stochastic fil- tering problem in a dynamic state-space form [238], [422]: of mean and state-error correlation matrix are calculated and propagated. In equations (3a) and (3b), Fn+1,n, Gn x˙ t = f(t, xt, ut, dt), (1a) are called transition matrix and measurement matrix, re- yt = g(t, xt, ut, vt), (1b) spectively. Described as a generic state-space model, the stochastic where equations (1a) and (1b) are called state equation and filtering problem can be illustrated by a graphical model measurement equation, respectively; xt represents the state (Fig. 2). Given initial density p(x0), transition density vector, yt is the measurement vector, ut represents the sys- p(xn|xn− ), and likelihood p(yn|xn), the objective of the tem input vector (as driving force) in a controlled environ- 1 N N N N filtering is to estimate the optimal current state at time n ment; f : R x → R x and g : R x → R y are two vector- given the observations up to time n, which is in essence valued functions, which are potentially time-varying; dt amount to estimating the posterior density p(xn|y0:n)or and vt represent the process (dynamical) noise and mea- p(x n|y n). Although the posterior density provides a surement noise respectively, with appropriate dimensions. 0: 0: complete solution of the stochastic filtering problem, the The above formulation is discussed in the continuous-time problem still remains intractable since the density is a func- domain, in practice however, we are more concerned about tion rather than a finite-dimensional point estimate. We the discrete-time filtering.10 In this context, the following should also keep in mind that most of physical systems are practical filtering problem is concerned:11 not finite dimensional, thus the infinite-dimensional system xn+1 = f(xn, dn), (2a) can only be modeled approximately by a finite-dimensional filter, in other words, the filter can only be suboptimal yn = g(xn, vn), (2b) in this sense. Nevertheless, in the context of nonlinear where dn and vn can be viewed as white noise random filtering, it is still possible to formulate the exact finite- sequences with unknown statistics in the discrete-time do- dimensional filtering solution, as we will discuss in Section main. The state equation (2a) characterizes the state tran- IV. sition probability p(xn+1|xn), whereas the measurement In Table I, a brief and incomplete development history of equation (2b) describes the probability p(yn|xn)whichis stochastic filtering theory (from linear to nonlinear, Gaus- further related to the measurement noise model. sian to non-Gaussian, stationary to non-stationary) is sum- The equations (2a)(2b) reduce to the following special marized. Some detailed reviews are referred to [244], [423], case where a linear Gaussian dynamic system is consid- [247], [205]. ered:12 D. Nonlinear Stochastic Filtering Is an Ill-posed Inverse xn+1 = Fn+1,nxn + dn, (3a) Problem yn = Gnxn + vn, (3b) D.1 Inverse Problem for which the analytic filtering solution is given by the Stochastic filtering is an inverse problem: Given collected Kalman filter [250], [253], in which the sufficient statistics13 yn at discrete time steps (hence y0:n), provided f and g are 10 The continuous-time dynamic system can be always converted known, one needs to find the optimal or suboptimal xˆn.In into a discrete-time system by sampling the outputs and using “zero- another perspective, this problem can be interpreted as an order holds” on the inputs. Hence the derivative will be replaced by the difference, the operator will become a matrix. inverse mapping learning problem: Find the inputs sequen- 11For discussion simplicity, no driving-force in the dynamic system tially with a (composite) mapping function which yields the (which is often referred to the stochastic control problem) is consid- output data. In contrast to the forward learning (given in- ered in this paper. However, the extension to the driven system is straightforward. puts find outputs) which is a many-to-one mapping prob- 12An excellent and illuminating review of linear filtering theory is lem, the inversion learning problem is one-to-many, in a found in [244] (see also [385], [435], [61]); for a complete treatment of sense that the mapping from output to input space is gen- linear estimation theory, see the classic textbook [247]. 13Sufficient statistics is referred to a collection of quantities which erally non-unique. uniquely determine a probability density in its entirety. A problem is said to be well-posed if it satisfies three con- MANUSCRIPT 6 TABLE I A Development History of Stochastic Filtering Theory. author(s) (year) method solution comment Kolmogorov (1941) innovations exact linear, stationary Wiener (1942) spectral factorization exact linear, stationary, infinite memory Levinson (1947) lattice filter approximate linear, stationary, finite memory Bode & Shannon (1950) innovations, whitening exact linear, stationary, Zadeh & Ragazzini (1950) innovations, whitening exact linear, non-stationary Kalman (1960) orthogonal projection exact LQG, non-stationary, discrete Kalman & Bucy (1961) recursive Riccati equation exact LQG, non-stationary, continuous Stratonovich (1960) conditional Markov process exact nonlinear, non-stationary Kushner (1967) PDE exact nonlinear, non-stationary Zakai (1969) PDE exact nonlinear, non-stationary Handschin & Mayne (1969) Monte Carlo approximate nonlinear, non-Gaussian, non-stationary Bucy & Senne (1971) point-mass, Bayes approximate nonlinear, non-Gaussian, non-stationary Kailath (1971) innovations exact linear, non-Gaussian, non-stationary Beneˇs (1981) Beneˇs exact solution of Zakai eqn. nonlinear, finite-dimensional Daum (1986) Daum, virtual measurement exact solution of FPK eqn. nonlinear, finite-dimensional Gordon, Salmond, & Smith (1993) bootstrap, sequential Monte Carlo approximate nonlinear, non-Gaussian, non-stationary Julier & Uhlmann (1997) unscented transformation approximate nonlinear, (non)-Gaussian, derivative-free ditions: existence, uniqueness and stability, otherwise it is where the second integral is Itˆo stochastic integral (named said to be ill posed [87]. In this context, stochastic filtering after Japanese mathematician Kiyosi Ito [233]).15 problem is ill-posed in the following sense: (i) The ubiqui- Mathematically, the ill-posed nature of stochastic filter- tous presence of the unknown noise corrupts the state and ing problem can be understood from the operator theory. measurement equations, given limited noisy observations, Definition 4: [274], [87] Let A : Y → X be an operator the solution is non-unique; (ii) Supposing the state equa- from a normed space Y to X. The equation AY = X is said tion is a diffeomorphism (i.e. differentiable and regular),14 to be well posed if A is bijective and the inverse operator the measurement function is possibly a many-to-one map- A−1 : X → Y is continuous. Otherwise the equation is ping function (e.g. g(ξ)=ξ2 or g(ξ)=sin(ξ), see also the called ill posed. illustrative example in Section VIII-G), which also violates Definition 5: [418] Suppose H is a Hilbert space and let the uniqueness condition; (iii) The filtering problem is per A = A(γ) be a stochastic operator mapping Ω × H in se a conditional posterior distribution (density) estimation H.LetX = X(γ) be a generalized random variable (or problem, which is known to be stochastically ill posed es- function) in H, then pecially in high-dimensional space [463], let alone on-line processing [412]. A(γ)Y = X(γ) (6) D.2 Differential Operator and Integral Equation is a generalized stochastic operator equation for the ele- ∈ H In what follows, we present a rigorous analysis of stochas- ment Y . tic filtering problem in the continuous-time domain. To Since γ is an element of a measurable space (Ω, F)on simplify the analysis, we first consider the simple irregular which a complete probability measure P is defined, stochas- stochastic differential equation (SDE): tic operator equation is a family of equations. The family of equations has a unique member when P is a Dirac mea- dxt = f(t, xt)+dt,t∈ T (4) sure. Suppose Y is a smooth functional with continuous dt first n derivatives, then (6) can be written as t where xt is a second-order stochastic process, ωt = dsds 0 N k is a Wiener process (Brownian motion) and dt canbere- d Y A(γ)Y (γ)= ak(t, γ) = X(γ), (7) garded as a white noise. f : T ×L (Ω, F,P) → L (Ω, F,P) k 2 2 k dt is a mapping to a (Lebesque square-integrable) Hilbert =0 space L2(Ω, F,P) with finite second-order moments. The which can be represented in the form of stochastic integral solution of (4) is given by the stochastic integral equations of Fredholm type or Voltera type [418], with an t t 15 t ω The Itˆo stochastic integral is defined as t σ(t)dω(t)= xt = x0 + f(s, xs)ds + d s, (5) 0 n 2 0 0 lim σ(tj−1)Δωj .TheItˆo calculus satisfies dω (t)=dt, n→∞ j=1 14Diffeomorphism is referred to a smooth mapping with a smooth dω(t)dt =0,dtN+1 = dωN+2(t)=0(N>1). See [387], [360] for a inverse, one-to-one mapping. detailed background about ItˆocalculusandItˆoSDE. MANUSCRIPT 7 appropriately defined kernel K: off-line; whereas filtering is aimed to sequentially infer the signal or state process given some observations by Y (t, γ)=X(t, γ)+ K(t, τ, γ)Y (τ,γ)dτ, (8) assuming the knowledge of the state and measurement models. which takes a similar form as the continuous-time Wiener- • Missing data problem: Missing data problem is Hopf equation (see e.g. [247]) when K is translation invari- well addressed in statistics, which is concerned about ant. probabilistic inference or model fitting given limited Definition 6: [418] Any mapping Y (γ):Ω→ H which data. Statistical approaches (e.g. EM algorithm, data satisfies A(γ)Y (γ)=X(γ) for every γ ∈ Ω, is said to be a augmentation) are used to help this goal by assum- wide-sense solution of (6). ing auxiliary missing variables (unobserved data) with The wide-sense solution is a stochastic solution if it is tractable (on-line or off-line) inference. measurable w.r.t. P and Pr{γ : A(γ)Y (γ)=X(γ)} =1. • Density estimation: Density estimation shares some The existence and uniqueness conditions of the solution to commons with filtering in that both of them target at a the stochastic operator equation (6) is given by the prob- dependency estimation problem. Generally, filtering is abilistic Fixed-Point Theorem [418]. The essential idea of nothing but to learn the conditional probability distri- Fixed-Point Theorem is to prove that A(γ) is a stochas- bution. However, density estimation is more difficult tic contractive operator, which unfortunately is not always in the sense that it doesn’t have any prior knowledge true for the stochastic filtering problem. on the data (though sometimes people give some as- Let’s turn our attention to the measurement equation in sumption, e.g. mixture distribution) and it usually an integral form works directly on the state (i.e. observation process is tantamount to the state process). Most of density t estimation techniques are off-line. yt = g(s, xs)ds + vt, (9) • 0 Nonlinear dynamic reconstruction: Nonlinear dy- N N N namic reconstruction arise from physical phenomena R x → R y · ∈ R x where g : . For any φ( ) , the optimal (e.g. sea clutter) in the real world. Given some lim- ˆ (in mean-square sense) filter φ(xt) is the one that seeks an ited observations (possibly not continuously or evenly minimum mean-square error, as given by recorded), it is concerned about inferring the physi- | cally meaningful state information. In this sense, it 2 π(xt y0:t)φ(x)dxt φˆ(xt) ≡ arg min{ φ − φˆ } = , (10) is very similar to the filtering problem. However, it π(xt|y t)dxt 0: is much more difficult than the filtering problem in where π(·) is an unnormalized filtering density. A common that the nonlinear dynamics involving f is totally un- way to study the unnormalized filtering density is to treat known (usually assuming a nonparametric model to it as a solution of the Zakai equation, as will be detailed in estimate) and potentially complex (e.g. chaotic), and Section II-E. the prior knowledge of state equation is very limited, and thereby severely ill-posed [87]. Likewise, dynamic D.3 Relations to Other Problems reconstruction allows off-line estimation. It is conducive to better understanding the stochastic fil- E. Stochastic Differential Equations and Filtering tering problem by comparing it with many other ill-posed problems that share some commons in different perspec- In the following, we will formulate the continuous-time tives: stochastic filtering problem by SDE theory. Suppose {xt} • System identification: System identification has is a Markov process with an infinitesimal generator, rewrit- many commons with stochastic filtering. Both of them ing state-space equations (1a)(1b) in the following form of belong to statistical inference problems. Sometimes, Itˆo SDE [418], [360]: identification is also meant as filtering in stochastic ω control realm, especially with a driving-force as in- dxt = f(t, xt)dt + σ(t, xt)d t, (11a) put. However, the measurement equation can ad- dyt = g(t, xt)dt + dvt, (11b) mit the feedback of previous output, i.e. yn = where f(t, xt) is often called nonlinear drift and σ(t, xt) g(xn, yn−1, vn). Besides, identification is often more concerned about the parameter estimation problem in- called volatility or diffusion coefficient. Again, the noise processes {ωt, vt,t ≥ 0} are two Wiener processes. xt ∈ stead of state estimation. We will revisit this issue in N N the Section IX. R x , yt ∈ R y . First, let’s look at the state equation ≥ • Regression: In some perspective, filtering can be (a.k.a. diffusion equation). For all t 0, we define a 16 viewed as a sequential linear/nonlinear regression backward diffusion operator Lt as problem if state equation reduces to a random walk. N N x x 2 But, regression differs from filtering in the following i ∂ 1 ij ∂ Lt = ft + at , (12) sense: Regression is aimed to find a deterministic map- ∂xi 2 ∂xi∂xj i=1 i,j=1 ping between the input and output given a finite num- { } 16L ber of observation pairs xi, yi i=1, which is usually t is a partial differential operator. MANUSCRIPT 8 ij i j where at = σ (t, xt)σ (t, xt). Operator L corresponds to Given conditional pdf (18), suppose we want to calculate N an infinitesimal generator of the diffusion process {xt,t ≥ φˆ(xt)=E[φ(xt)|Yt] for any nonlinear function φ ∈ R x . 0}. The goal now is to deduce conditions under which By interchanging the order of integrations, we have one can find a recursive and finite-dimensional (close form) ∞ scheme to compute the conditional probability distribution ˆ |Y 17 φ(xt)= φ(x)p(xt t)dx p(xt|Yt), given the filtration Yt produced by the observa- −∞ tion process (1b). ∞ 18 = φ(x)p(x0)dx Let’s define an innovations process −∞ t t ∞ L˜ |Y et = yt − E[g(s, xs)|y0:s]ds, (13) + φ(x) sp(xs s)dxds 0 0 −∞ t ∞ where E[g(s, xs)|Ys] is described as |Y −1 + φ(x)p(xs s)esΣv,sdxds 0 −∞ gˆ(xt)=E[g(t, xt)|Yt] t ∞ ∞ = E[φ(x0)] + p(xs|Ys)Lsφ(x)dxds = g(xt)p(xt|Ys)dx. (14) 0 −∞ −∞ t ∞ N + φ(x)g(s, x)p(xs|Ys)dx For any test function φ ∈ R x , the forward diffusion oper- 0 −∞ ator L˜ is defined as ∞ − |Y −1 N N gˆ(xs) φ(x)p(xs s)dx Σv,sds. x x 2 −∞ i ∂φ 1 ij ∂ φ L˜ tφ = − ft + at , (15) ∂xi 2 ∂xi∂xj i=1 i,j=1 The Kushner equation lends itself a recursive form of fil- tering solution, but the conditional mean requests all of which essentially is the Fokker-Planck operator. Given ini- higher-order conditional moments and thus leads to an tial condition p(x0)att = 0 as boundary condition, it turns infinite-dimensional system. out that the pdf of diffusion process satisfies the Fokker- On the other hand, under some mild conditions, the un- Planck-Kolmogorov equation (FPK; a.k.a. Kolmogorov normalized conditional density of xt given Ys, denoted as forward equation, [387]) 19 π(xt|Yt), is the unique solution of the following stochas- tic partial differential equation (PDE), the so-called Zakai ∂p(xt) equation (see [505], [238], [285]): = L˜ tp(xt). (16) ∂t dπ(xt|Yt)=L˜π(xt|Yt)dt + g(t, xt)π(xt|Yt)dyt (19) By involving the innovation process (13) and assuming E [vt]=Σv,t, we have the following Kushner’s equation with the same L˜ defined in (15). Zakai equation and Kush- (e.g., [284]): ner equation have a one-to-one correspondence, but Zakai 20 −1 equation is much simpler, hence we are usually turned dp(xt|Yt)=L˜ tp(xt|Yt)dt + p(xt|Yt)etΣ dt, (t ≥ 0) (17) v,t to solve the Zakai equation instead of Kushner equation. which reduces to the FPK equation (16) when there are no In the early history of nonlinear filtering, the common way observations or filtration Yt. Integrating (17), we have is to discretize the Zakai equation to seek the numerical solution. Numerous efforts were devoted along this line t [285], [286], e.g. separation of variables [114], adaptive lo- p(xt|Yt)=p(x )+ p(xs|Ys)ds 0 cal grid [65], particle (quadrature) method [66]. However, 0 t these methods are neither recursive nor computationally L˜ |Y −1 + sp(xs s)esΣv,sds. (18) efficient. 0 17One can imagine filtration as sort of information coding the pre- III. Bayesian Statistics and Bayesian Estimation vious history of the state and measurement. 18InnovationsprocessisdefinedasawhiteGaussiannoiseprocess. A. Bayesian Statistics See [245], [247] for detailed treatment. 19The stochastic process is determined equivalently by the FPK Bayesian theory (e.g., [38]) is a branch of mathemat- equation (16) or the SDE (11a). The FPK equation can be inter- ical probability theory that allows people to model the preted as follows: The first term is the equation of motion for a cloud uncertainty about the world and the outcomes of interest of particles whose distribution is p(xt), each point of which obeys the dx by incorporating prior knowledge and observational evi- equation of motion dt = f(xt,t). The second term describes the dis- 21 turbance due to Brownian motion. The solution of (16) can be solved dence. Bayesian analysis, interpreting the probability as exactly by Fourier transform. By inverting the Fourier transform, we 20 can obtain This is true because (19) is linear w.r.t. π(xt|Yt) whereas (17) involves certain nonlinearity. We don’t extend discussion here due to 2 1 (x − x0 − f(x0)Δt) space constraint. p(x,t+Δt|x0,t)= √ exp − , 21 2πσ0Δt 2σ0Δt In the circle of statistics, there are slightly different treatments to probability. The frequentists condition on a hypothesis of choice and which is a Guaussian distribution of a deterministic path. put the probability distribution on the data, either observed or not; MANUSCRIPT 9 a conditional measure of uncertainty, is one of the popu- priors; (iii) updating the hyperparameters of the prior. Op- lar methods to solve the inverse problems. Before running timization and integration are two fundamental numerical into Bayesian inference and Bayesian estimation, we first problems arising in statistical inference. Bayesian inference introduce some fundamental Bayesian statistics. can be illustrated by a directed graph, a Bayesian network Definition 7: (Bayesian Sufficient Statistics) Let p(x|Y) (or belief network) is a probabilistic graphical model with denote the probability density of x conditioned on mea- a set of vertices and edges (or arcs), the probability depen- surements Y. A statistics, Ψ(x), is said to be “sufficient” dency is described by a directed arrow between two nodes if the distribution of x conditionally on Ψ does not depend that represent two random variables. Graphical models on Y. In other words, p(x|Y)=p(x|Y) for any two sets Y also allow the possibility of constructing more complex hi- and Y s.t. Ψ(Y)=Ψ(Y). erarchical statistical models [239], [240]. The sufficient statistics Ψ(x) contains all of information B. Recursive Bayesian Estimation Y brought by x about .TheRao-Blackwell Theorem says In the following, we present a detailed derivation of re- that when an estimator is evaluated under a convex loss, cursive Bayesian estimation, which underlies the principle the optimal procedure only depends on the sufficient statis- of sequential Bayesian filtering. Two assumptions are used tics. Sufficiency Principle and Likelihood Principle are two to derive the recursive Bayesian filter: (i) The states follow axiomatic principles in the Bayesian inference [388]. a first-order Markov process p(xn|x0:n−1)=p(xn|xn−1); There are three types of intractable problems inherently (ii) the observations are independent of the given states. related to the Bayesian statistics: For notation simplicity, we denote Yn as a set of observa- • Normalization: Given the prior p(x) and likelihood tions y n := {y , ··· , yn};letp(xn|Yn) denote the condi- | | 0: 0 p(y x), the posterior p(x y) is obtained by the product tional pdf of xn. From Bayes rule we have of prior and likelihood divided by a normalizing factor Y | as |Y p( n xn)p(xn) p(xn n)= Y p(y|x)p(x) p( n) p(x|y)= . (20) | p(yn, Yn−1|xn)p(xn) X p(y x)p(x)dx = p(yn, Yn−1) • Marginalization: Given the joint posterior (x, z), p(yn|Yn−1, xn)p(Yn−1|xn)p(xn) the marginal posterior is = |Y Y p(yn n−1)p( n−1) p(yn|Yn− , xn)p(xn|Yn− )p(Yn− )p(xn) p(x|y)= p(x, z|y)dz, (21) = 1 1 1 Z p(yn|Yn−1)p(Yn−1)p(xn) p(yn|xn)p(xn|Yn− ) as shown later, marginalization and factorization plays = 1 . (23) an important role in Bayesian inference. p(yn|Yn−1) • Expectation: Given the conditional pdf, some aver- |Y aged statistics of interest can be calculated As shown in (23), the posterior density p(xn n) is de- scribed by three terms: • Prior: The prior p(xn|Yn−1) defines the knowledge of Ep(x|y)[f(x)] = f(x)p(x|y)dx. (22) X the model In Bayesian inference, all of uncertainties (including p(xn|Yn− )= p(xn|xn− )p(xn− |Yn− )dxn− , (24) states, parameters which are either time-varying or fixed 1 1 1 1 1 but unknown, priors) are treated as random variables.22 | The inference is performed within the Bayesian framework where p(xn xn−1) is the transition density of the state. given all of available information. And the objective of • Likelihood: the likelihood p(yn|xn) essentially deter- Bayesian inference is to use priors and causal knowledge, mines the measurement noise model in the equation quantitatively and qualitatively, to infer the conditional (2b). probability, given finite observations. There are usually • Evidence: The denominator involves an integral three levels of probabilistic reasoning in Bayesian analysis (so-called hierarchical Bayesian analysis): (i) starting with p(yn|Yn−1)= p(yn|xn)p(xn|Yn−1)dxn. (25) model selection given the data and assumed priors; (ii) esti- mating the parameters to fit the data given the model and Calculation or approximation of these three terms are the only one hypothesis is regarded as true; they regard the probability essences of the Bayesian filtering and inference. as frequency. The Bayesians only condition on the observed data and consider the probability distributions on the hypotheses; they put IV. Bayesian Optimal Filtering probability distributions on the several hypotheses given some priors; probability is not viewed equivalent to the frequency. See [388], [38], Bayesian filtering is aimed to apply the Bayesian statis- [320] for more information. tics and Bayes rule to probabilistic inference problems, and 22This is the true spirit of Bayesian estimation which is different from other estimation schemes (e.g. least-squares) where the un- specifically the stochastic filtering problem. To our knowl- known parameters are usually regarded as deterministic. edge, Ho and Lee [212] were among the first authors to MANUSCRIPT 10 discuss iterative Bayesian filtering, in which they discussed p(x|y) in principle the sequential state estimation problem and in- mode mode cluded the Kalman filter as a special case. In the past few mean median mean decades, numerous authors have investigated the Bayesian mode filtering in a dynamic state space framework [270], [271], [421], [424], [372], [480]-[484]. x A. Optimal Filtering Fig. 3. Left: An illustration of three optimal criteria that seek An optimal filter is said “optimal” only in some specific different solutions for a skewed unimodal distribution, in which the sense [12]; in other other words, one should define a cri- mean, mode and median do not coincide. Right: MAP is misleading for the multimodal distribution where multiple modes (maxima) exist. terion which measures the optimality. For example, some potential criteria for measuring the optimality can be: where Q(x) is an arbitrary distribution of x.The 1. Minimum mean-squared error (MMSE): It can be de- first term is called Kullback-Leibler (KL) divergence fined in terms of prediction or filtering error (or equiv- between distributions Q(x)andP (x|y), the second alently the trace of state-error covariance) term is the entropy w.r.t. Q(x). The minimization of free energy can be implemented iteratively by the 2 2 E[ xn − xˆn |y0:n]= xn − xˆn p(xn|y0:n)dxn, expectation-maximization (EM) algorithm [130]: Q(xn ) ←− arg max{Q, xn}, which is aimed to find the conditional mean xˆn = +1 Q E[xn|y0:n]= xnp(xn|y0:n)dxn. xn+1 ←− arg max{Q(xn}, x). 2. Maximum a posteriori (MAP): It is aimed to find the x 23 mode of posterior probability p(xn|y0:n), which is equal to minimize a loss function Remarks: E E − I • = [1 xn:xn−xˆn≤ζ (xn)], The above criteria are valid not only for state estima- tion but also for parameter estimation (by viewing x where I(·) is an indicator function and ζ is a small as unknown parameters). scalar. • Both MMSE and MAP methods require the estima- 3. Maximum likelihood (ML): which reduces to a special tion of the posterior distribution (density), but MAP 24 case of MAP where the prior is neglected. doesn’t require the calculation of the denominator (in- 4. Minimax: which is to find the median of posterior tegration) and thereby more computational inexpen- | p(xn y0:n). See Fig. 3 for an illustration of the differ- sive; whereas the former requires full knowledge of ence between mode, mean and median. the prior, likelihood and evidence. Note that how- 25 5. Minimum conditional inaccuracy : Namely, ever, MAP estimate has a drawback especially in a 1 high-dimensional space. High probability density does Ep , [− logp ˆ(x|y)] = p(x, y) log dxdy. not imply high probability mass. A narrow spike with (x y) pˆ(x|y) very small width (support) can have a very high den- 6. Minimum conditional KL divergence [276]: The con- sity, but the actual probability of estimated state (or ditional KL divergence is given by parameter) belonging to it is small. Hence, the width of the mode is more important than its height in the p(x, y) KL = p(x, y) log dxdy. high-dimensional case. pˆ(x|y)p(x) • The last three criteria are all ML oriented. By min- imizing the negative log-likelihood − logp ˆ(x|y)and 26 7. Minimum free energy : It is a lower bound of maxi- taking the expectation w.r.t. a fixed or variational mum log-likelihood, which is aimed to minimize pdf. Criterion 5 chooses the expectation w.r.t. joint pdf p(x, y); when Q(x)=p(x, y), it is equivalent to F(Q; P ) ≡ EQ(x)[− log P (x|y)] Criterion 7; Criterion 6 is a modified version of the Q(x) upper bound of Criterion 5. = EQ(x) log − EQ(x)[log Q(x)], P (x|y) The criterion of optimality used for Bayesian filtering is the Bayes risk of MMSE.27 Bayesian filtering is optimal 23When the mode and the mean of distribution coincide, the MAP estimation is correct; however, for multimodal distributions, the MAP in a sense that it seeks the posterior distribution which estimate can be arbitrarily bad. See Fig. 3. integrates and uses all of available information expressed 24 This can be viewed as a least-informative prior with uniform dis- by probabilities (assuming they are quantitatively correct). tribution. 25It is a generalization of Kerridge’s inaccuracy for the case of i.i.d. However, as time proceeds, one needs infinite computing data. power and unlimited memory to calculate the “optimal” 26Free energy is a variational approximation of ML in order to minimize its upper bound. This criterion is usually used in off-line 27For a discussion of difference between Bayesian risk and frequen- Bayesian estimation. tist risk, see [388]. MANUSCRIPT 11 • The process noise and measurement noise are mutually T independent: E[dnvm] = 0 for all n, m. Time update: Measurement MAP One-step prediction update: Correction Let xˆn denote the MAP estimate of xn that maxi- of the measurement to the state estimate mizes p(xn|Yn), or equivalently log p(xn|Yn). By using the yn xn Bayes rule, we may express p(xn|Yn)by p(xn, Yn) p(xn|Yn)= p(Yn) Fig. 4. Schematic illustration of Kalman filter’s update as a p(xn, yn, Yn− ) predictor-corrector. = 1 , (26) p(yn, Yn−1) solution, except in some special cases (e.g. linear Gaussian where the expression of joint pdf in the numerator is further or conjugate family case). Hence in general, we can only expressed by seek a suboptimal or locally optimal solution. p(xn, yn, Yn−1)=p(yn|xn, Yn−1)p(xn, Yn−1) B. Kalman Filtering = p(yn|xn, Yn−1)p(xn|Yn−1)p(Yn−1) | |Y Y Kalman filtering, in the spirit of Kalman filter [250], = p(yn xn)p(xn n−1)p( n−1). (27) [253] or Kalman-Bucy filter [249], consists of an iterative prediction-correction process (see Fig. 4). In the predic- The third step is based on the fact that vn does not depend Y tion step, the time update is taken where the one-step on n−1. Substituting (27) into (26), we obtain ahead prediction of observation is calculated; in the cor- p(yn|xn)p(xn|Yn−1)p(Yn−1) rection step, the measurement update is taken where the p(xn|Yn)= p(yn, Yn− ) correction to the estimate of current state is calculated. 1 p(yn|xn)p(xn|Yn− )p(Yn− ) In a stationary situation, the matrices An, Bn, Cn, Dn in 1 1 = |Y Y (3a) and (3b) are constant, Kalman filter is precisely the p(yn n−1)p( n−1) Wiener filter for stationary least-squares smoothing. In p(yn|xn)p(xn|Yn− ) = 1 , (28) other words, Kalman filter is a time-variant Wiener filter p(yn|Yn−1) [11], [12]. Under the LQG circumstance, Kalman filter was originally derived with the orthogonal projection method. which shares the same form as (23). Under the Gaussian In the late 1960s, Kailath [245] used the innovation ap- assumption of process noise and measurement noise, the proach developed by Wold and Kolmogorov to reformulate mean and covariance of p(yn|xn) are calculated by the Kalman filter, with the tool of martingales theory.28 E | E From innovations point of view, Kalman filter is a whiten- [yn xn]= [Gnxn + vn]=Gnxn (29) ing filter.29 Kalman filter is also optimal in the sense that and it is unbiased E[xˆn]=E[xn] and is a minimum variance estimate. A detailed history of Kalman filter and its many Cov[yn|xn]=Cov[vn|xn]=Σv, (30) variants can be found in [385], [244], [246], [247], [238], [12], [423], [96], [195]. respectively. And the conditional pdf p(yn|xn)canbefur- Kalman filter has a very nice Bayesian interpretation ther written as [212], [497], [248], [366]. In the following, we will show 1 T −1 that the celebrated Kalman filter can be derived within a p(yn|xn)=A exp − (yn − Gnxn) Σ (yn − Gnxn) , 1 2 v Bayesian framework, or more specifically, it reduces to a (31) MAP solution. The derivation is somehow similar to the ML solution given by [384]. For presentation simplicity, −Ny/2 −1/2 where A1 =(2π) |Σv| . we assume the dynamic and measurement noises are both Consider the conditional pdf p(xn|Yn−1), its mean and Gaussian distributed with zero mean and constant covari- covariance are calculated by ance. The derivation of Kalman filter in the linear Gaussian scenario is based on the following assumptions: E[xn|Yn−1]=E[Fn,n−1xˆn + dn−1|Yn−1] T T = Fn− ,nxˆn− = xˆn|n− , (32) • E[dndm]=Σdδmn; E[vnvm]=Σvδmn. 1 1 1 • The state and process noise are mutually independent: T T and E[xndm]=0forn ≤ m; E[xnvm] = 0 for all n, m. |Y − 28The martingale process was first introduced by Doob and dis- Cov[xn n−1]=Cov[xn xˆn|n−1] cussed in detail in [139]. =Cov[en,n−1], (33) 29Innovations concept can be used straightforward in nonlinear fil- tering [7]. From innovations point of view, one of criteria to justify the ≡ |Y optimality of the solution to a nonlinear filtering problem is to check respectively, where xˆn|n−1 xˆ(n n−1) represents the how white the pseudo-innovations are, the whiter the more optimal. state estimate at time n given the observations up to n−1, MANUSCRIPT 12 en,n−1 is the state-error vector. Denoting the covariance of noting that en,n−1 = xn − xˆn|n−1 and yn = Gnxn + vn, en,n−1 by Pn,n−1, by Gaussian assumption, we may obtain we further have 1 T en = en,n−1 − Kn(Gnen,n−1 + vn) p(xn|Yn−1)=A2 exp − (xn − xˆn|n−1) 2 =(I − KnGn)en,n−1 − Knvn, (42) −1 ×Pn,n− (xn − xˆn|n−1) , (34) 1 and it further follows −N /2 −1/2 x | n,n− | MAP where A2 =(2π) P 1 . By substituting equa- Pn =Cov[en ] tions (31) and (34) to (26), it further follows T T =(I − KnGn)Pn,n−1(I − KnGn) + KnΣvKn . |Y ∝ − 1 − T −1 − p(xn n) A exp (yn Gnxn) Σv (yn Gnxn) Rearranging the above equation, it reduces to 2 1 T −1 − − (xn − xˆn|n− ) P (xn − xˆn|n− ) , Pn = Pn,n−1 Fn,n+1KnGnPn,n−1. (43) 2 1 n,n−1 1 (35) Thus far, the Kalman filter is completely derived from MAP MAP principle, the expression of xn is exactly the same where A = A1A2 is a constant. Since the denominator is solution derived from the innovations framework (or oth- a normalizing constant, (35) can be regarded as an unnor- ers). malized density, the fact doesn’t affect the following deriva- The above procedure can be easily extended to ML case tion. without much effort [384]. Suppose we want to maximize Since the MAP estimate of the state is defined by the the marginal maximum likelihood of p(xn|Yn), which is condition equivalent to maximizing the log-likelihood ∂log p(xn|Yn) =0, (36) log p(xn|Yn) = log p(xn, Yn) − log p(Yn), (44) ∂xn xn=xˆMAP and the optimal estimate near the solution should satisfy substituting equation (35) into (36) yields ∂log p(xn|Yn) −1 =0. (45) MAP T −1 −1 ∂xn xn=xˆML xˆn = Gn Σv Gn + Pn,n−1 −1 T −1 Substituting (35) to (45), we actually want to minimize the × Pn,n− xˆn|n−1 + Gn Σ yn . 1 v the cost function of two combined Mahalanobis norms 31 By using the lemma of inverse matrix,30 it is simplified as 2 2 E = yn − Gnxn −1 + xn − xˆn −1 . (46) Σv Pn,n−1 MAP xˆn = xˆn|n−1 + Kn(yn − Gnxˆn|n−1), (37) Taking the derivative of E with respect to xn and setting as zero, we also obtain the same solution as (37). where Kn is the Kalman gain as defined by Remarks: T T −1 • Kn = Fn+1,nPn,n−1Gn (GnPn,n−1Gn +Σv) . (38) The derivation of the Kalman-Bucy filter [249] was rooted in the SDE theory [387], [360], it can be also Observing derived within the Bayesian framework [497], [248]. • The optimal filtering solution described by Wiener- en,n−1 = xn − xˆn|n−1 Hopf equation is achieved by spectral factorization MAP technique [487]. By admitting state-space formula- = Fn,n−1xn−1 + dn − Fn,n−1xˆn− 1 tion, Kalman filter elegantly overcomes the station- MAP = Fn,n−1en−1 + dn−1, (39) arity assumption and provides a fresh look at the MAP filtering problem. The signal process (i.e.“state”) and by virtue of Pn−1 =Cov[en−1 ], we have is regarded as a linear stochastic dynamical system driven by white noise, the optimal filter thus has Pn,n−1 =Cov[en,n−1] T a stochastic differential structure which makes the = Fn,n−1Pn−1Fn,n−1 +Σd. (40) recursive estimation possible. Spectral factorization is replaced by the solution of an ordinary differen- Since tial equation (ODE) with known initial conditions. Wiener filter doesn’t treat the difference between the − MAP en = xn xˆn white and colored noises, it also permits the infinite- = xn − xn|n−1 − Kn(yn − Gnxˆn|n−1), (41) dimensional systems; whereas Kalman filter works for 30 −1 −1 T 31 2 For A = B + CD C , it follows from the matrix inverse The Mahalanobis norm is defined as a weighted norm: A B = lemma that A−1 = B − BC(D + CT BC)−1CT B. AT BA. MANUSCRIPT 13 finite-dimensional systems with white noise assump- approximation (e.g. Gaussian sum filter) or linearization tion. techniques (i.e. EKF) are usually used. In the EKF, by • Kalman filter is an unbiased minimum variance estima- defining tor under LOG circumstance. When the Gaussian as- df(x) dg(x) sumption of noise is violated, Kalman filter is still opti- Fˆ n+1,n = , Gˆ n = , mal in a mean square sense, but the estimate doesn’t dx x=xˆn dx x=xˆn|n−1 produce the condition mean (i.e. it is biased), and the equations (2a)(2b) can be linearized into (3a)(3b), and neither the minimum variance. Kalman filter is not the conventional Kalman filtering technique is further em- robust because of the underlying assumption of noise ployed. The details of EKF can be found in many books, density model. e.g. [238], [12], [96], [80], [195], [205], [206]. Because EKF • Kalman filter provides an exact solution for linear always approximates the posterior p(xn|y n) as a Gaus- Gaussian prediction and filtering problem. Concerning 0: sian, it works well for some types of nonlinear problems, the smoothing problem, the off-line estimation version but it may provide a poor performance in some cases when of Kalman filter is given by the Rauch-Tung-Striebel the true posterior is non-Gaussian (e.g. heavily skewed or (RTS) smoother [384], which consists of a forward fil- multimodal). Gelb [174] provided an early overview of the ter in a form of Kalman filter and a backward recursive uses of EKF. It is noted that the estimate given by EKF is smoother. The RTS smoother is computationally effi- usually biased since in general E[f(x)] = f(E[x]). cient than the optimal smoother [206]. In summary, a number of methods have been developed • The conventional Kalman filter is a point-valued fil- for nonlinear filtering problems: ter, it can be also extended to set-valued filtering [39], [339], [80]. • Linearization methods: first-order Taylor series expan- • In the literature, there exists many variants of Kalman sion (i.e. EKF), and higher-order filter [20], [437]. filter, e.g., covariance filter, information filter, square- • Approximation by finite-dimensional nonlinear filters: root Kalman filters. See [205], [247] for more details Beneˇs filter [33], [34], Daum filter [111]-[113], and pro- and [403] for a unifying review. jection filter [202], [55]. • Classic PDE methods, e.g. [282], [284], [285], [505], C. Optimum Nonlinear Filtering [496], [497], [235]. • Spectral methods [312]. In practice, the use of Kalman filter is limited by the • Neural filter methods, e.g. [209]. ubiquitous nonlinearity and non-Gaussianity of physical • Numerical approximation methods, as to be discussed world. Hence since the publication of Kalman filter, numer- in Section V. ous efforts have been devoted to the generic filtering prob- lem, mostly in the Kalman filtering framework. A number C.1 Finite-dimensional Filters of pioneers, including Zadeh [503], Bucy [61], [60], Won- The on-line solution of the FPK equation can be ham [496], Zakai [505], Kushner [282]-[285], Stratonovich avoided if the unnormalized filtered density admits a finite- [430], [431], investigated the nonlinear filtering problem. dimensional sufficient statistics. Beneˇs [33], [34] first ex- See also the papers seeking optimal nonlinear filters [420], plored the exact finite-dimensional filter32 in the nonlinear [289], [209]. In general, the nonlinear filtering problem per filtering scenario. Daum [111] extended the framework to a sue consists in finding the conditional probability distribu- more general case and included Kalman filter and Beneˇsfil- tion (or density) of the state given the observations up to ter as special cases [113]. Some new development of Daum current time [420]. In particular, the solution of nonlinear filter with virtual measurement was summarized in [113]. filtering problem using the theory of conditional Markov The recently proposed projection filters [202], [53]-[57], also processes [430], [431] is very attractive from Bayesian per- belong to the finite-dimensional filter family. spective and has a number of advantages over the other methods. The recursive transformations of the posterior In [111], starting with SDE filtering theory, Daum intro- measures are characteristics of this theory. Strictly speak- duced a gradient function ing, the number of variables replacing the density function ∂ r(t, x)= ln ψ(t, x) is infinite, but not all of them are of equal importance. ∂x Thus it is advisable to select the important ones and reject the remainder. where ψ(t, x) is the solution of the FPK equation of (11a) The solutions of nonlinear filtering problem have two cat- with a form egories: global method and local method. In the global ap- ∂ψ(t, x) ∂ψ(t, x) ∂f 1 ∂2ψ proach, one attempts to solve a PDE instead of an ODE = − f − ψtr + tr A , ∂t ∂x ∂x 2 ∂xxT in linear case, e.g. Zakai equation, Kushner-Stratonovich equation, which are mostly analytically intractable. Hence with an appropriate initial condition (see [111]), and A = T the numerical approximation techniques are needed to solve σ(t, xt)σ(t, xt) . When the measurement equation (11b) is the equation. In special scenarios (e.g. exponential family) 32Roughly speaking, a finite-dimensional filter is the one that can with some assumptions, the nonlinear filtering can admit be implemented by integrating a finite number of ODE, or the one the tractable solutions. In the local approach, finite sum has the sufficient statistics with finite variables. MANUSCRIPT 14 linear with Gaussian noise (recalling the discrete-time ver- the MAP estimate, which is partially justified by the fact sion (3b)), Daum filter admits a finite-dimensional solution that under certain regularity conditions the posterior dis- tribution asymptotically approaches Gaussian distribution s 1 T −1 p(xt|Yt)=ψ (xt)exp (xt − mt) P t (xt − mt) , as the number of samples increases to infinity. Laplace ap- 2 proximation is useful in the MAP or ML framework, this where s is real number in the interval 0