<<

for Dynamic Models with Mixtures Francois Caron, Manuel Davy, Arnaud Doucet, Emmanuel Duflos, Philippe Vanheeghe

To cite this version:

Francois Caron, Manuel Davy, Arnaud Doucet, Emmanuel Duflos, Philippe Vanheeghe. Bayesian Inference for Dynamic Models with Dirichlet Process Mixtures. 9th IEEE International Conference on Information Fusion, 2006, Florence, Italy. ￿inria-00119993￿

HAL Id: inria-00119993 https://hal.inria.fr/inria-00119993 Submitted on 12 Dec 2006

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Bayesian Inference for Dynamic Models with Dirichlet Process Mixtures

Francois Caron1, Manuel Davy1, Arnaud Doucet2, Emmanuel Duflos1, Philippe Vanheeghe1

1 LAGIS (UMR CNRS 8146), Ecole Centrale de Lille BP48, Villeneuve d’Ascq, France fi[email protected] 2 Department of and Department of Computer Science, University of British Columbia, Vancouver, Canada, [email protected]

Abstract - Using Kalman techniques, it is pos- and will assume that {wt} is Gaussian. However, our sible to perform optimal estimation in linear algorithms can be extended straightforwardly to the Gaussian state-space models. We address here case where the observation noise pdf should also be the case where the noise probability density estimated. functions are of unknown functional form. A flexible Bayesian nonparametric noise model Our methodology relies on the introduction of a based on mixture of Dirichlet processes is in- Dirichlet Process Mixtures (DPM) for the state noise troduced. Efficient Monte Carlo pdf. DPM is a flexible Bayesian nonparametric model and Sequential Monte Carlo methods are then which have become very popular in statistics over the developed to perform optimal estimation in last few years to perform nonparametric density es- such contexts. timation. To the best of our knowledge, this power- Keywords: Bayesian nonparametrics, Dirichlet Pro- ful class of models has never been used in the context cess Mixture, Monte Carlo Markov Chain, Rao- of dynamic models. We demonstrate here that such Blackwellisation, Particle filter. tools can help to improve drastically the performance of standard algorithms when the noise pdfs are un- known. 1 Introduction The rest of this paper is organized as follows. In Let us consider the following dynamic linear model Section 2, we recall the basics of Bayesian nonparamet- ric density estimation and in Section 3 we derive the xt = Ftxt−1 + Ctut + Gtvt (1) dynamic model with unknown noise distribution. In Section 4, we develop an efficient Markov chain Monte zt = Htxt + wt (2) Carlo algorithm to perform optimal estimation in the where xt is the hidden state vector, zt is the observa- batch case. In Section 5, we develop an efficient Se- tion, vt and wt are sequence of indenpendent random quential Monte Carlo/Particle filter to perform optimal variables. Ft and Ht are the known state and obser- estimation in the sequential case. All these algorithms vation matrices, ut is a known input, Ct the input can be interpreted as generalized Rao-Blackwellised transfert matrix and Gt is the state transfert matrix. methods. Finally, we demonstrate our algorithms on For any sequence {at}, we write ai:j for two applications: blind deconvolution of impulse pro- (ai, ai+1, ..., aj). We are here interested in estimating cesses and robust regression. the hidden state xt given the observations z1:t (filter- ing) or z1:T for T ≥ t (smoothing). If the noise se- quences are assumed Gaussian with known parameters, optimal estimation can be performed using the Kalman filter/smoother. However, these Gaussian assumptions can be very inapproriate for many real-world models. 2 Bayesian nonparametric den- In this paper, we address the problem of optimal state estimation when the probability density functions (pdf) sity estimation of the noise sequences are unknown and need to be es- timated on-line or off-line from the data. To simplify the presentation, we will limit ourselves here to the es- We review here briefly modern Bayesian tools to per- timation of the pdf of the state noise sequence {vt} form nonparametric density estimation. 2.1 Density estimation where G ∼ DP (G0, α) then the posterior distribution of G|θ1:n is also a DP Let y1, ..., yn be a statistically exchangeable sequence n distributed according to the pdf F α 1 G|θ ∼ DP ( G + δ , α + n) 1:n α + n 0 α + n θk y ∼ F (·). k=1 k X We are interested here in estimating F and we consider Moreover, it can be shown that the predictive distribu- G the following nonparametric model tions, computed by integrating out the RPM , admits the following Polya urn representation [3]

G n F (y) = f(y|θ)d (θ) (3) 1 α Θ θ |θ ∼ δ + G . Z n+1 1:n α + n θk α + n 0 k where θ ∈ Θ is called the latent variable or cluster vari- X=1 able, f(·|θ) is the mixed pdf and G is the dis- tribution. Under a Bayesian framework, it is assumed 2.3 Dirichlet Process Mixture G that is a Random Probability (RPM) dis- Using these modelling tools, it is now possible to refor- tributed according to a prior distribution (i.e., a dis- mulate the density estimation problem using the fol- tribution over the set of probability distributions). We lowing hierarchical model known as DPM [4] will select here the RPM to follow a Dirichlet Process

(DP) prior. G ∼ DP (G0, α, ), θk|G ∼ G 2.2 Dirichlet Processes yk|θk ∼ f(·|θk) Ferguson [1] introduced the Dirichlet Process (DP) as The objective of density estimation boils down to es- a probability measure on the space of probability mea- timating the posterior distribution p(θ1:n|y1:n). Al- sures. Given a probability measure G0 on a (measur- though DPM were introduced in the 70’s, these mod- able) space (T , A) and a positive real number α, a els were too complex to handle numerically before the G distributed according to a introduction of MCMC [5, 6]. They have now become DP of base distribution G0 and scale factor α, denoted extremely popular. G ∼ DP (G0, α), satisfies for any partition A1, ..., Ak k of T and any 3 Dynamic Linear Model with G G G G ( (A1), ..., (Ak)) ∼ D ( 0(A1), ..., 0(Ak), α) Unknown Noise Distribution where D is a standard . Sethura- In this article, we are interested in the class of models man [2] established that the realizations of a Dirichlet (1)-(2) where {wt} are distributed according to wt ∼ process are discrete with probability one and admit the N (0,Rt) whereas the pdf of {vt} is unknown. We so-called stick-breaking representation adopt a flexible Bayesian nonparametric model for this

∞ noise pdf. {vt} is supposed to be distributed according G to a DPM of base mixed distribution N (µt, Σt), scale = πkδθk (4) k=1 parameter α and Normal-inverse Wishart base distri- X bution G0 [7] denoted G0 = NIW (µ0, κ0, ν0, Λ0) with k−1 G µ0, κ0, ν0, Λ0 are hyperparameters that are assumed to with θk ∼ 0, πk = βk j=1 (1 − βj) and βk ∼ B(1, α) where B denotes the and δθ denotes be known and fixed. To summarize, we have the fol- Q k the Dirac delta measure located in θk. Using (3), it lowing model comes that the following flexible prior model is adopted G DP G , α , for the unknown distribution F ∼ ( 0 ) (µt, Σt) ∼ G, ∞ i.i.d. vt ∼ N (µt, Σt). F (y) = πkf(y|θk). k X=1 This model is much more flexible than a standard mix- ture of Gaussians. It is equivalent to say that v is sam- Apart from its flexibility, a fundamental motivation t pled from a fixed but unknown distribution F which to use the DP model is the simplicity of the posterior admits the following representation update. Let θ1, .., θn be n random samples from G

i.i.d. F (vt) = N (vt; µ, Σ)dG(µ, Σ). (5) θk|G ∼ G Z F is a countable infinite mixture of Gaussians pdf of Thus p(θt|z1:T , θ−t) is proportional to unknown parameters, and the mixing distribution G is sampled from a Dirichlet process. We will denote T p(z1:T |θ1:T ) × δθ (θt) + αG0(θt) . θt = {µt, Σt} the latent cluster variables giving the  k  k ,k t mean and covariance matrix for that cluster. =1X6=   The MCMC algorithms available in the literature to We can sample from this density with a Metropolis- estimate these Bayesian nonparametric models – e.g. Hastings step, where the candidate pdf is the condi- [5, 6]– do not apply in our case because we do not ob- tional prior p(θt|θ−t). This is given by algorithm 2. serve the sequence {vt} but only the observations {zt} generated by the dynamic model (1)-(2). In the fol- Algorithm 2 Metropolis-Hastings step to sample from lowing sections, we propose two strategies to perform p(θt|z1:T , θ−t) inference in this more complex framework. Despite • Sample a candidate cluster the complexity of the model, we develop efficient com- putational methods based on a Rao-Blackwellisation T (i)∗ 1 α approach. θ ∼ δθ + G0 t α + T − 1 k α + T − 1 k ,k t =1X6= (i) (i)∗ 4 MCMC algorithm • With probability ρ(θt , θt ) where

(i)∗ (i) p(z T |θ , θ ) Assume we are interested in estimating the posterior ρ(θ(i), θ(i)∗) = min 1, 1: t −t p x , θ z x θ t t (i) (i) ( 0:T 1:T | 1:T ), where is the state vector and t = Ã p(z1:T |θt , θ−t) ! {µt, Σt} is the latent variable as defined above. This (i) (i)∗ (i) (i−1) posterior satisfies set θt = θt , otherwise θt = θt .

p(x0:T , θ1:T |z1:T ) = p(x0:T |θ1:T , z1:T )p(θ1:T |z1:T ). The computation of the acceptance probability re- (i) (i) (6) quires to compute the likelihood p(z1:T |θt , θ−t). This Conditional upon θt, (1) may be rewritten as can be done in O(T ) operations using a Kalman fil- ter. However, this has to be done for t = 1,...,T ′ ′ xt = Ftxt−1 + ut(θt) + Gtvt(θt) and one finally obtains an algorithm of computational complexity in O(T 2). Here, we propose to use instead ′ ′ where ut(θt) = Ctut+Gtµt is a known input and vt(θt) the backward-forward recursion developed in [10], to is a centered white Gaussian noise of known covari- obtain an algorithm of complexity in O(T ). This algo- ance matrix Σt. Thus p(x0:T |θ1:T , z1:T ) is a Gaussian rithm uses the following likelihood decomposition: distribution whose parameters can be computed using a Kalman smoother [8] where the marginal posterior p(z1:T |θ1:T ) = p(z1:t−1|θ1:t−1)p(zt|θ1:t, z1:t−1) (7) p(θ |z ) can be approximated through MCMC us- 1:T 1:T p z x , θ p x z , θ dx ing the following Gibbs sampler [9]: × ( t+1:T | t t+1:T ) ( t| 1:t 1:t) t ZX with p(z |x , θ ) = Algorithm 1 Gibbs sampler to sample from t:T t−1 t:T p(θ1:T |z1:T ) p(z |x , θ )p(z , x |θ , x )dx (8) (1) t+1:T t−1 t:T t t t t−1 t • Initialization: For t = 1, ..., T , sample θt ZX • Iteration i, i ≥ 2 The first two terms of the r.h.s. in Eq. (7) are com- For t = 1,...,T , (i) (i) (i) puted by a forward recursion based on the Kalman sample θt ∼ p(θt|z1:T , θ−t) where θ−t = filter [10]. The third term can be evaluated by a back- (i) (i) (i−1) (i−1) {θ1 , .., θt−1, θt−1 , .., θT } ward recursion according to Eq. (8), see Appendix for details. Based on Eq. (7), the density p(θt|z1:T , θ−t) is expressed by The Gibbs sampler needs to sample from p(θ |z , θ ) where t 1:T −t p(θt|z1:T , θ−t) ∝ p(θt|θ−t)p(zt|θ1:t, z1:t−1)

p(θt|z1:T , θ−t) ∝ p(z1:T |θ1:T )p(θt|θ−t). × p(zt+1:T |xt, θt+1:T )p(xt|z1:t, θ1:t)dxt ZX From the Polya urn representation, p(θt|θ−t) equals The full steps are given in algorithm 3. It can be easily established that the simulated T 1 α Markov chain θ(i) is ergodic with limiting distribu- δ (θ ) + G (θ ). 1:T α + T − 1 θk t α + T − 1 0 t k ,k t tion p(θ1:T |z1:Tn). Aftero N iterations of the algorithm, =1X6= Algorithm 3 MCMC algorithm to sample from namic model to reduce the dimension of the parameter p(θ1:T |z1:T ) space. We have the following decomposition of the pos- Initialization i = 1 terior distribution (1) • For t = 1, ..., T , sample θt . Iteration i, i ≥ 2 p(x0:t, θ1:t|z1:t) = p(x0:t|θ1:t, z1:t)p(θ1:t|z1:t).

• Backward recursion: For t = T, .., 1, As p(x0:t|θ1:t, z1:t) can be computed using Kalman ′−1 (i−1) techniques, we only need to estimate through parti- compute and store Pt|t+1(θt+1:T ) and cle methods the marginal posterior p(θ1:t|z1:t). This P ′−1 θ(i−1) m′ θ(i−1) t|t+1( t+1:T ) t|t+1( t+1:T ) is a generalization of the Rao-Blackwellized parti- cle filter [11] to DPM. At time t, it follows that • Forward recursion: For t = 1, .., T p(xt, θ1:t|z1:t) is approximated through a set of N par- (1) (N) – Perform a Kalman filter step with ticles θ1:t , . . . , θ1:t by the following empirical distribu- (i−1) (i) (i−1) tion θt = θt , store mt|t(θ1:t−1, θt ) (i) (i−1) and Pt|t(θ1:t−1, θt ). N (i) (i) – Metropolis-Hastings step PN (xt, θ1:t|z1:t) = wt p(xt|θ1:t, z1:t) i=1 ∗ Sample a candidate cluster X with e T 1 α (i)∗ G (i) (i) (i) θt ∼ δθk + 0 p x θ , z x θ , θ α + T − 1 α + T − 1 ( t| 1:t 1:t) = N ( t|t( 1:t) Σt|t( 1:t)) k ,k t =1X6= x θ(i) θ(i) ∗ Perform a Kalman filter step with The parameters t|t( 1:t) andb Σt|t( 1:t) are computed (i)∗ (i) (i)∗ recursively for each particule i using the Kalman filter θt = θt , store mt|t(θ1:t−1, θt ) and (i) (i)∗ [8]. In order to buildb the algorithm, we note that Pt|t(θ1:t−1, θt ) (i) (i)∗ (i) z (i) z z (i) z (i) (i) ∗ With probability ρ(θt , θt ) where p(θ1:t| 1:t) ∝ p(θ1:t−1| 1:t−1)p( t|θ1:t, 1:t−1)p(θt |θ1:t−1) (i)∗ (i) where (i) (i)∗ p(z1:T |θt , θ−t) ρ(θt , θt ) = min 1, i i ( ) ( ) (i) (i) (i) Ã p(z1:T |θt , θ−t) ! p(zt|θ1:t, z1:t−1) = p(zt|θt , θ1:t−1, z1:t−1) (i) (i)∗ (i) (i−1) (i) (i) set θt = θt , otherwise θt = θt . = N (zt(θ1:t),St(θ1:t)) State post-sampling and b (i) E (i) • For i = 1, ..., N, compute xt = xt|θ1:T for all t (i) (i) (i) zt(θ1:t) = Ht Ft xt−1|t−1(θ1:t−1) + Ctut + Gtµt with a Kalman smoother. ³ ´ i h n i o T i Ti S θ( ) H F θ( ) F G ( )G bt( 1:t) = t t Σb t−1|t−1( 1:t−1) t + tΣt t h n o T i the MMSE estimates of θ1:T and x0:T are computed ×Ht + Rt using The Rao-Blackwellized (RBPF) steps N N 1 1 are given in algorithm 4. θ = θ(i) , x = x(i) 1:T N 1:T t N t As in other particle filtering algorithms, the perfor- k k X=1 X=1 mance depends highly on the importance distribution b b selected. Here, the optimal importance distribution is (i) (i) 5 Rao-Blackwellized Particle q(θt|θ1:t−1, z1:t) = p(θt|θ1:t−1, z1:t), which equals

Filter algorithm t−1 z z (i) (i) P N ( t: t(θ − ,θt),St(θ − ,θt))δ (i) (θt) j=1 1:t 1 1:t 1 θ b j t−1 (i) (i) (i) P N (zt:zt(θ − ,θj ),St(θ − ,θj ))+αI(θ − ) In many applications, we want to process the data j=1 b 1:t 1 1:t 1 1:t 1 αI(θ(i) )×H (θ(i) ,θ ) on-line. In this case, MCMC are inadequate as 1:t−1 0 1:t−1 t + t−1 (i) (i) (i) P N (zt:zt(θ − ,θj ),St(θ − ,θj ))+αI(θ − ) these are batch iterative algorithms. We propose j=1 b 1:t 1 1:t 1 1:t 1 here some original Sequential Monte Carlo meth- with ods, also known as particle filters, to approxi- z z (i) (i) mate on-line the sequence of probability distributions (i) N ( t: t(θ − ,θt),St(θ − ,θt))G0(θt) b 1:t 1 1:t 1 H0(θ1:t−1, θt) = z z (i) (i) G , {p(x0:t, θ1:t|z1:t), t = 1, 2,...} . Similar to the batch R N ( t: t(θ t− ,θt),St(θ t− ,θt)) 0(θt)dθt Θ b 1: 1 1: 1 case, it is possible to exploit the structure of the dy- (i) (i) (i) G I(θ1:t−1) = Θ N (zt(θ1:t−1, θt),St(θ1:t−1, θt)) 0(θt)dθt. R b Algorithm 4 Rao-Blackwellised Particle Filter to de-blurring, spectroscopic data analysis, audio source sample from p(θ1:t|z1:t) restauration, etc. We follow here the model presented At time 1. in [12], which is recalled below. (i) (i) • For i = 1, .., N, sample (x0 , Σ0|0) ∼ p0(x0). (i) 1 6.1.1 Statistical Model • Set w0 ← N At each time t (t ≥ 2) Let H = 1 h1 .. hL = 1 h and xt = • For i = 1,...,N T (i) (i) vt vt−1 ... vt−L . The observed signal zt is • Sample θt ∼ q(θt|θ1:t−1, z1:t) ¡ ¢ ¡ ¢ • Compute the convolution of the sequence xt with a finite impulse ¡ H ¢ e response filter , observed in additive white Gaussian (i) (i) (i) (i) noise wt. The observation model is then (xt|t−1(θ1:t), Σt|t−1(θ1:t), xt|t(θ1:t), Σt|t(θ1:t)) (i) (i) (i) =KF 1step(xt−1|t−1(θ1:t−1), Σt−1|t−1(θ1:t−1), θt , zt) zt = Hxt + wt b b • For i = 1,...,N, update the weights e 2 2 b where wt ∼ N (0, σw) with σw is the assumed known (i) (i) (i) (i) p(z |θ ,θ ,z − )p(θ |θ ) of wt. The state space model can be written w(i) ∝ w(i) t 1:t−1 et 1:t 1 et 1:t−1 , t t−1 q(θ(i)|θ(i) ,z ) as follows: et 1:t−1 1:t N (i) wt = 1. xt = F xt−1 + Gvt e i=1 P 2 −1 0 01×L 1 e N (i) where F = , G = , 0m×n • Compute Neff = w 0L IL 0L i=1 t µ ×1 ¶ µ ×1 ¶ · ¸ is the null matrix of size m × n and Im is the identity N η P ³ ´ w(i) • If eff ≤ , resample the particles and set t = matrix of size m × m. The state transition noise vt is 1 e N supposed to be independent from wt, and distributed according to the mixture The weights are then updated with vt ∼ λF + (1 − λ)δ0 (9) (i) ∝ (i) (i) wt wt−1I(θ1:t−1). where δ0 is the Dirac delta function at 0 and F is a However, this optimal importance distribution cannot DPM of Gaussians defined in Eq. (5). In other words, e be used, as the associated importance weights do not the noise is alternatively zero, or distributed according admit a closed-form expression. However it is possible to a DPM of Gaussians. to derive various approximations of it so as to design For simplicity reasons, we introduce latent Bernoulli efficient importance distributions q(θt|θ1:t−1, z1:t). variables rt ∈ {0, 1} such that Pr(rt = 1) = λ and ¿From the particles, the MMSE estimate and poste- vt|(rt = 1) ∼ f(·|θt), vt|(rt = 0) ∼ δ0. Consider the rior covariance matrix of xt are given by cluster variable φt defined by φt = θt if rt = 1 and φt = (0, 0) (i.e. parameters corresponding to the delta- N mass) if rt = 0, that is, φt ∼ λF + (1 − λ)δ(0,0). By x w(i)x(i) t|t = t t|t integrating out F , one has i=1 X N b e b φt|φ−t ∼ λp(φt|φ−t, rt = 1) + (1 − λ)δ(0,0) (i) (i) (i) (i) T Σt|t = wt Σt|t + (xt|t − xt|t)(xt|t − xt|t) i=1 where p(φt|φ−t, rt = 1) is the Polya urn representation X h i ′ e b b b b on the set φ−t = {φ ∈ φ−t|φ 6= δ(0,0)} of size T given by 6 Applications ′ e T G k=1,k6=t δφ + α 0 In this section, we present two applications of the above φ |(φ , r = 1) ∼ ek t −t t α + T ′ model and algorithms. We address, first, blind decon- P volution and, second, robust regression. In both cases, we assume that the statistics of the state noise are un- known, and modelled as a DPM. The hyperparameters are Φ = ( α h λ ) (the hy- perparameters of the base distribution G0 are assumed 6.1 Blind deconvolution of impulse fixed and known). These hyperparameters are assumed h processes random with prior distribution p(Φ) = p(α)p( )p(λ), where Blind deconvolution finds many applications in vari- η ν p(α) = G( , ), p(h) = N (0, Σh), p(λ) = B(ζ, τ) ous fields of engineering and physics, such as image 2 2 Estimated signal 4 True signal 0.4

v 2 0.35 Real pdf F Estimated pdf Fv 0 0.3

−2 0.25 −4 0 20 40 60 80 100 120 (v)

v 0.2

Time index F

1.5 0.15 Residual between the true and estimated signals

1 0.1

0.05 0.5

0 −5 0 5 0 0 20 40 60 80 100 120 v Time index

Figure 2: True and estimated pdf F Figure 1: Top picture: True and estimated signal v1:T . Bottom picture: residual between the true and esti- mated signals. 6.2.1 Statistical model Let us consider the problem of estimating a continu- ous regression function g(·) defined on a domain I ⊂ R. where η, ν,Σh, ζ and τ are known. We have the Suppose we have T observations z , t = 1, ..., T in ad- following conditional posteriors t ditive Gaussian noise

p(h|x1:T , φ1:T , α, λ, z1:T ) ∝ p(h) zt = g(yt) + wt p z x , h × ( 1:T | 1:T ) iid 2 where y0 < y1 < .. < yT and wt ∼ N (0, σw). The x h z r function g(·) is supposed to follow the stochastic differ- and p(λ| 1:T , φ1:T , α, , 1:T ) = p(λ| 1:T ) with ′′ ential equation g (y) = v(y|θt) ∀y ∈ [yt−1, yt], where 2 T T v(y|θt) ∼ N (µt, σt ), i.e., the second derivative of the g p(λ|r1:T ) = B ζ + rt, τ + (1 − rt) function is piecewise constant and follows a normal 2 Ã t=1 t=1 ! pdf of mean and variance given by θt = {µt, σt }. Con- X X ditional on θt, this may be formulated under the fol- v where rt = 0 if φt = (0, 0) and rt = 1 otherwise. The lowing state space form aim is to approximate by MCMC the joint posterior ′ ′ xt = F xt−1 + ut + vt pdf p(v1:T , φ1:T , Φ|z1:T ). This is done by applying Al- gorithm 3 for the cluster variable, whereas the other zt = Hxt + wt variables are sampled by Metropolis-Hastings or direct T 1 ∆t sampling w.r.t their conditional posterior. where x = g(y ) g′(y ) , F = , u′ = t t t 0 1 t 2 µ ¶ ∆t ¡ ¢ 2 µ , v′ are independent N (0, σ2 × V (∆t)), 6.1.2 Simulation results ∆t t t t µ ¶ ∆t3 ∆t2 This model has been simulated with the following pa- 3 2 V (∆t) = ∆t2 , H = 1 0 . ∆t is the rameters: T = 120, L = 3, h = −1.5 0.5 −0.2 , Ã 2 ∆t ! λ = 0.6, σ2 = 0.1,F = 0.7N (2, 1) + 0.3N (−1, 1). For constant sampling period. ¡ ¢ w ¡ ¢ the estimation, 5000 MCMC iterations are performed. Fig. 1 (top) displays the MMSE estimate of v1:T to- 6.2.2 Simulation results gether with its true value. As can be seen in Fig. 1 We present simulation results obtained with the fol- (bottom), the residual is very small. Also, as can be lowing settings: σ2 = 0.1, ∆t = 1, T = 50. The seen in Fig. 2, the estimated pdf F is quite close to the w true function g is zero over [0, 10], it switches to true one. 4 + sinc(0.2 ∗ (t − 10)) over [10, 20] and switches again to −5 × sinc(0.2(t − 20)) over [20, 50]. The regres- 6.2 Robust Regression sion function has been estimated with both the MCMC algorithm (5000 iterations) and the Rao-Blacwellized We now consider the problem of robust regression as particle filter (1000 particles) respectively. The true described in [13]. The statistical model considered here and estimated regression functions, together with the is presented below. measurements, are plotted in Figures 3 and 4. 6 A Backward forward recursion Real function Regression function Measure 4 We recall that, knowing θt, the state space model may be written as

2 ′ ′ xt = Ftxt−1 + ut(θt) + Gtvt(θt) 0 ′ ′ with ut(θt) = Ctut + Gtµt is a known input and vt(θt)

−2 is a centered white Gaussian noise of known covariance matrix Σt.

−4 The forward backward algorithm uses the following likelihood decomposition: −6 0 10 20 30 40 50 p(z1:t|θ1:t) = p(z1:t−1|θ1:t−1)p(zt|θ1:t, z1:t−1) (10)

× p(zt T |xt, θt T )p(xt|z t, θ t)dxt Figure 3: True and regression functions with the +1: +1: 1: 1: ZX MCMC algorithm with 5000 particles. (11) with

p(zt:T |xt−1, θt:T )

6 Real function Regression function = p(z |x , θ )p(z , x |θ , x )dx (12) Measure t+1:T t−1 t:T t t t t−1 t 4 ZX It is shown in [10] that if 2 X p(zt:T |xt−1, θt:T )dxt−1 < ∞ then z x − 0 p( t:T | t 1,θt:T ) z x x is a Gaussian distribution RRX p( t:T | t−1,θt:T )d t−1 ′ −2 w.r.t. xt−1, of mean mt−1|t(θt:T ) and covariance ′ ′−1 ′−1 ′ Pt−1|t(θt:T ). Pt−1|t(θt:T ) and Pt−1|t(θt:T )mt−1|t(θt:T ) −4 always satisfy the following backward information

−6 filter recursion.

−8 • Initialization 5 10 15 20 25 30 35 40 45 50 ′−1 T −1 PT |T (θT ) = HT RT HT ′−1 ′ T −1 Figure 4: True and regression functions with the Rao- PT |T (θT )mT |T (θT ) = HT Rt zT Blackwellized particle filter algorithm with 1000 parti- • Backward recursion. For t = T − 1..1, cles −1 T ′−1 ∆t+1 = Inv + B (θt+1)Pt+1|t+1(θt+1|T )B(θt+1) h i T P ′−1 (θ ) = F P ′−1 (θ ) 7 Conclusion t|t+1 t+1:T t+1 t+1|t+1 t+1:T × (Inx − B(θt+1)∆t+1(θt+1:T ) T ′−1 × B (θt+1)Pt+1|t+1(θt+1:T ))Ft+1 In this article we have presented a Bayesian nonpara- ′−1 ′ metric model which allows us to estimate the state Pt|t+1(θt+1:T )mt|t+1(θt+1:t) noise pdf in a linear dynamic model. The methodology T = F (θ ) × (I − P ′−1 (θ ) presented here can be straightforwardly extended to es- t+1 nx t+1|t+1 t+1:T T timate the observation noise pdf. The Dirichlet process × B(θt+1)∆t+1(θt+1:T )B (θt+1)) mixture considered here is flexible and we have pre- × P ′−1 (θ ) m′ (θ ) − u′ (θ ) sented two simulation-based algorithms based on Rao- t+1|t+1 t+1:T t+1|t+1 t+1:T t+1 t+1 Blackwellisation which allows us to perform efficiently ³ T ´ P ′−1(θ ) = P ′−1 (θ ) + H R−1H inference. We are currently investigating the following t|t t:T t|t+1 t+1:T t t t extensions of our methodology. First, it would be of ′−1 ′ ′−1 ′ interest to consider nonlinear dynamic models. Sec- Pt|t (θt:T )mt|t(θt:T ) = Pt|t+1(θt+1:T )mt|t+1(θt+1:T ) ond, it would be important to develop non-stationary T −1 + Ht Rt zt Dirichlet process mixture models in cases where the T noise statistics are assumed to evolve over time. where B(θt) = Gt×chol(Σt) . For the Metropolis Hasting ratio, we need to com- [3] D. Blackwell and J.B. MacQueen. Ferguson dis- pute the acceptance probability only with a probability tributions via Polya urn schemes. The annals of constant statistics, 1:353–355, 1973.

[4] C.E. Antoniak. Mixtures of Dirichlet pro- p(z1:T |θ1:T ) ∝ p(zt|θ1:t, z1:t−1) cesses with applications to Bayesian nonparamet- × p(zt+1:T |xt, θt+1:T )p(xt|z1:t, θ1:t)dxt ric problems. The annals of statistics, 2:1152– ZX 1174, 1974. (13) [5] M.D. Escobar and M. West. Bayesian density es- If Pt|t(θ1:t) 6= 0 then it exists Πt|t(θ1:t) and Qt|t(θ1:t) timation and inference using mixtures. Journal of T such that Pt|t(θ1:t) = Qt|t(θ1:t)Πt|t(θ1:t)Qt|t(θ1:t). The the american statistical association, 90:577–588, matrices Qt|t(θ1:t) and Πt|t(θ1:t) are straightforwardly 1995. obtained using the singular value decomposition of [6] R.M. Neal. Markov chain sampling methods for Pt|t(θ1:t). Matrix Πt|t(θ1:t) is a nt ×nt, 1 ≤ nt ≤ nx di- Dirichlet process mixture models. Journal of agonal matrix with the nonzero eigenvalues of Pt|t(θ1:t) as elements. Then one has computational and graphical statistics, 9:249–265, 2000. p(z1:T |θ1:T ) ∝ N (zt|t−1(θ1:t),St(θ1:t)) − 1 [7] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Ru- T ′−1 2 × Πt|t(θ1:t)Q (θ1:t)P (θt+1:T )Qt|t(θ1:t) + Int bin. Bayesian data analysis. Chapman and Hall, t|t e t|t+1 × ¯exp(− 1 mT (θ )P ′−1 (θ )m (θ ) ¯ 1995. ¯ 2 t|t 1:t t|t+1 t+1:T t|t 1:t ¯ −2¯mT (θ )P ′−1 (θ )m′ (θ ) ¯ t|t 1:t t|t+1 t+1:T t|t+1 t+1:T [8] B.D.O. Anderson and J.B. Moore. Optimal filter- ′ T −(mt|t+1(θt+1:T ) − mt|t(θ1:t)) ing. Prentice-Hall, 1979. ′−1 ×Pt|t+1(θt+1:T )At|t(θ1:t) ′−1 ′ [9] C.P. Robert and G. Casella. Monte Carlo statis- ×P (θt T )(m (θt T ) − m (θ t))) t|t+1 +1: t|t+1 +1: t|t 1: tical methods. Springer-Verlag, 1999. where [10] A. Doucet and C. Andrieu. Iterative algorithms for state estimation of jump Markov linear sys- At|t(θ1:t) = Qt|t(θ1:t) tems. IEEE transactions on , −1 −1 T ′−1 49(6):1216–1227, 2001. × Πt|t (θ1:t) + Qt|t(θ1:t)Pt|t+1(θt+1:T )Qt|t(θ1:t) h T i [11] Arnaud Doucet, Nando de Freitas, and Neil Gor- × Q (θ1:t) t|t don, editors. Sequential Monte Carlo Methods in practice. Springer-Verlag, 2001. The quantities mt|t(θ1:t), Pt|t(θ1:t), zt|t−1(θ1:t) and S θ t( 1:t) are, resp., the one-step ahead filtered estimate [12] A. Doucet and P. Duvaut. Bayesian estimation x t and covariance matrix of t, the innovatione at time , of state-space models applied to deconvolution of and the covariance of this innovation. These quantities Bernoulli-Gaussian processes. Signal Processing, are given by the Kalman filter, the system being linear 57:147–161, 1997. Gaussian conditional upon θ1:t. [13] M. West, P. Muller, and M.D. Escobar. Hierar- chical priors and mixture models, with applica- Acknowledgment tion in regression and density estimation. In P.R. Freeman and A.F.M. Smith, editors, Aspects of Fran¸coisCaron is supported by the Centre National uncertainty, pages 363–386. John Wiley, 1994. de la Recherche Scientifique (CNRS) and the R´egion Nord-Pas de Calais.

References

[1] T.S. Ferguson. A Bayesian analysis of some non- parametric problems. The annals of statistics, 1:209–230, 1973.

[2] J. Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650, 1994.