On Particle Methods for Parameter Estimation in State-Space Models 3

Home , Hidden Markov model

arXiv:1412.8695v2 [stat.CO] 10 Sep 2015 abig B P,Uie igo e-mail: Kingdom United 2PZ, Cambridge, CB1 of Cambridge University Engineering, of Department xod xodO13G ntdKndme-mail: Kingdom United 3TG, OX1 of Oxford University Oxford, Statistics, of Department Professor, Maci Jan Singh, S. Sumeetpal Models Doucet, Arnaud Kantas, State-Space Parameter Nikolas for in Estimation Methods Particle On ed sdvrea ttsis clg,economet- ecology, in statistics, applications as of diverse numerous as found mod- series ﬁelds have time of that class els popular very a are models, 05 o.3,N.3 328–351 3, DOI: No. 30, Vol. 2015, Science Statistical [email protected] [email protected] [email protected] meilCleeLno,Lno W B,United 2BZ, e-mail: SW7 Kingdom London London, College Imperial RS-NA n E ai,925Malakoﬀ, 99245 e-mail: France Paris, HEC and CREST-ENSAE e-mail: Kingdom United 2PZ, Cambridge, CB1 of Cambridge University Engineering, of Department

.Kna sLcue,Dprmn fMathematics, of Department Lecturer, is Kantas N. c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint tt-pc oes lokona idnMarkov hidden as known also models, State-space nttt fMteaia Statistics Mathematical of Institute 10.1214/14-STS511 [email protected] .INTRODUCTION 1. [email protected] nttt fMteaia Statistics Mathematical of Institute , eec,pril leig eunilMneCro state Carlo, Monte Sequential filtering, particle ference, phrases: simple on and performance words li their Key and illustrate advantages and the methods discuss p these We parameter of of static models. alg review perform state-space comprehensive sophisticated to in a more proposed present tion been on to have is rely that paper to methods this necessary of part is aim standard it The context, this and In fail data. the ods parameters from static estimated unknown be on to depends also interest t of applications, model most the in Carlo However, to problems. Monte approximations inference Sequential state numerical as reliable known provide also methods, methods, Particle and engineering ing. information econometrics, statistics, in Abstract. 05 o.3,N.3 328–351 3, No. 30, Vol. 2015, .Coii Professor, Chopinis N. . .M aijwk sProfessor, is Maciejowski M. J. . .S ig sSno Lecturer, Senior is Singh S. S. . 2015 , olna o-asinsaesaemdl r ubiquitous are models state-space non-Gaussian Nonlinear . .Due is Doucet A. . aeinifrne aiu ieiodin- likelihood maximum inference, Bayesian This . in 1 (1.2) sity, where { t fa se and asset an of ity eoe opnns( components denotes hra the whereas h atta hyaeflxbeadesl inter- easily and where flexible models volatility include are stochastic models state-space they of Applications that pretable. fact the as spaces state general more well. to applies follows what (1.1) { akvtasto density transition Markov endb w tcatcprocesses stochastic two see by defined sciences; environmental and [ engineering rics, etMro rcs fiiildensity initial of process Markov tent 11 z Y h ouaiyo tt-pc oessesfrom stems models state-space of popularity The n n , } } h spaces The . 30 n θ ≥ g , ∈ 0 X X θ h process The . ( 34 Y h aaee ftemdland model the of parameter the Θ n 0 y n | | ∼ , ( x ∼ | X ( eoe h odtoa agnlden- marginal conditional the denotes ) Y 87 X µ g : 0 vle observations -valued θ θ : 0 .Fral,asaesaemdlis model state-space a Formally, ]. ( ( n x y n − 0 n X 1 = ) | jwk n ioa Chopin Nicolas and ejowski saemodels. -space inlprocess- signal x , = Y estate-space he n x and n ) x : 0 z , { i t bevdlgrtr [ log-return observed its : 0 associated z , cemeth- icle X n models. mitations htneed that Y , n orithms. n Y i − estima- +1 f (SMC) } article : 0 θ 1 n a eEcien but Euclidean, be can ( ) z , . . . , ≥ n x ∼ − 0 ′ | 1 x f san is ,ta is, that ), = θ X ( j { x fasequence a of ) y n Y n : 0 { n | stevolatil- the is x X X n } n − n vle la- -valued n µ − ≥ 1 } θ 1 ) 0 n ( ) ≥ x , satisfy and ) 0 z and 52 i : ], j 2 N. KANTAS ET AL. biochemical network models where Xn corresponds early on that this naive approach is problematic [54] to the population of various biochemical species and due to the parameter space not being explored ad- Yn are imprecise measurements of the size of a subset equately. This has motivated over the past fifteen of these species [93], neuroscience models where Xn years the development of many particle methods for is a state vector determining the neuron’s stimulus– the parameter estimation problem, but numerically response function and Yn some spike train data [77]. robust methods have only been proposed recently. However, nonlinear non-Gaussian state-space mod- The main objective of this paper is to provide a els are also notoriously difficult to fit to data and comprehensive overview of this literature. This pa- it is only recently, thanks to the advent of powerful per thus differs from recent survey papers on parti- simulation techniques, that it has been possible to cle methods which all primarily focus on estimating fully realize their potential. the state sequence X0: n or discuss a much wider To illustrate the complexity of inference in state- range of topics, for example, [32, 55, 58, 65]. We space models, consider first the scenario where the will present the main features of each method and parameter θ is known. On-line and off-line infer- comment on their pros and cons. No attempt, however, is made to discuss the intricacies of the specific ence about the state process Xn given the ob- { } implementations. For this we refer the reader to the servations Yn is only feasible analytically for simple models{ such} as the linear Gaussian state-space original references. We have chosen to broadly classify the methods as model. In nonlinear non-Gaussian scenarios, numer- follows: Bayesian or Maximum Likelihood (ML) and ous approximation schemes, such as the Extended whether they are implemented off-line or on-line. In Kalman filter or the Gaussian sum filter [1], have the Bayesian approach, the unknown parameter is been proposed over the past fifty years to solve these assigned a prior distribution and the posterior den- so-called optimal filtering and smoothing problems, sity of this parameter given the observations is to be but these methods lack rigor and can be unreliable characterized. In the ML approach, the parameter in practice in terms of accuracy, while determinis- estimate is the maximizing argument of the likeli- tic integration methods are difficult to implement. hood of θ given the data. Both these inference pro- Markov chain Monte Carlo (MCMC) methods can cedures can be carried out off-line or on-line. Specifi- obviously be used, but they are impractical for on- cally, in an off-line framework we infer θ using a fixed line inference; and even for off-line inference, it can observation record y0: T . In contrast, on-line meth- be difficult to build efficient high-dimensional pro- ods update the parameter estimate sequentially as posal distributions for such algorithms. For nonlin- observations yn n≥0 become available. ear non-Gaussian state-space models particle algo- The rest of{ the} paper is organized as follows. In rithms have emerged as the most successful. Their Section 2 we present the main computational chal- widespread popularity is due to the fact that they lenges associated to parameter inference in state- are easy to implement, suitable for parallel imple- space models. In Section 3 we review particle meth- mentation [60] and, more importantly, have been ods for filtering when the model does not include demonstrated in numerous settings to yield more any unknown parameters, whereas Section 4 is ded- accurate estimates than the standard alternatives, icated to smoothing. These filtering and smoothing for example, see [11, 23, 30, 67]. techniques are at the core of the off-line and on-line In most practical situations, the model (1.1)–(1.2) ML parameter procedures described in Section 5. depends on an unknown parameter vector θ that In Section 6 we discuss particle methods for off-line needs to be inferred from the data either in an on- and on-line Bayesian parameter inference. The per- line or off-line manner. In fact inferring the param- formance of some of these algorithms are illustrated eter θ is often the primary problem of interest; for on simple examples in Section 7. Finally, we sum- example, for biochemical networks, we are not inter- marize the main advantages and drawbacks of the ested in the population of the species per se, but we methods presented and discuss some open problems want to infer some chemical rate constants, which in Section 8. ′ are parameters of the transition prior fθ(x x). Al- though it is possible to define an extended| state 2. COMPUTATIONAL CHALLENGES ASSOCIATED TO PARAMETER INFERENCE that includes the original state Xn and the parameter θ and then apply standard particle methods to A key ingredient of ML and Bayesian parame- perform parameter inference, it was recognized very ter inference is the likelihood function pθ(y0: n) of ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 3

θ which satisfies 3. FILTERING AND PARTICLE APPROXIMATIONS (2.1) pθ(y0: n)= pθ(x0: n,y0: n) dx0: n, Z In this section the parameter θ is assumed known where pθ(x0: n,y0: n) denotes the joint density of and we focus on the problem of estimating the la- (X0: n,Y0: n) which is given from equations (1.1)– tent process Xn n≥0 sequentially given the obser- (1.2) by vations. An important{ } by-product of this so-called pθ(x0: n,y0: n) filtering task from a parameter estimation viewpoint (2.2) n n is that it provides us with an on-line scheme to com- = µ (x ) f (x x ) g (y x ). pute pθ(y0: n) n≥0. As outlined in Section 2, the θ 0 θ k| k−1 θ k| k { } k=1 k=0 particle approximation of these likelihood terms is Y Y a key ingredient of numerous particle-based param- The likelihood function is also the normalizing constant of the posterior density p (x y ) of the eter inference techniques discussed further on. θ 0: n| 0: n latent states X0: n given data y0: n, 3.1 Filtering pθ(x0: n,y0: n) (2.3) pθ(x0: n y0: n)= . Filtering usually denotes the task of estimating | pθ(y0: n) recursively in time the sequence of marginal poste- This posterior density is itself useful for comput- riors p (x y ) , known as the filtering den- { θ n| 0: n }n≥0 ing the score vector θℓn(θ) associated to the log- sities. However, we will adopt here a more gen- ∇ likelihood ℓn(θ) = log pθ(y0: n), as Fisher’s identity eral definition and will refer to filtering as the yields task of estimating the sequence of joint posteriors pθ(x0: n y0: n) n≥0 recursively in time, but we will θℓn(θ)= θ log pθ(x0: n,y0: n) { | } ∇ ∇ still refer to the marginals pθ(xn y0: n) n≥0 as the (2.4) Z filtering densities. { | } p (x y ) dx . · θ 0: n| 0: n 0: n It is easy to verify from (2.1) and (2.3) that the The main practical issue associated to parame- posterior pθ(x0: n y0: n) and the likelihood pθ(y0: n) ter inference in nonlinear non-Gaussian state-space satisfy the following| fundamental recursions: for n models is that the likelihood function is intractable. 1, ≥ As performing ML parameter inference requires maximizing this intractable function, it means prac- pθ(x0: n y0: n) (3.1) | tically that it is necessary to obtain reasonably low- f (x x )g (y x ) variance Monte Carlo estimates of it, or of the as- = p (x y ) θ n| n−1 θ n| n θ 0: n−1| 0: n−1 p (y y ) sociated score vector if this maximization is car- θ n| 0: n−1 ried out using gradient-based methods. Both tasks and involve approximating high-dimensional integrals, (3.2) p (y )= p (y )p (y y ), (2.1) and (2.4), whenever n is large. On-line infer- θ 0: n θ 0: n−1 θ n| 0: n−1 ence requires additionally that these integrals be ap- where proximated on the fly, ruling out the applications of standard computational tools such as MCMC. pθ(yn y0: n−1) Bayesian parameter inference is even more chal- | lenging, as it requires approximating the posterior (3.3) = g (y x )f (x x ) θ n| n θ n| n−1 density Z pθ(xn−1 y0: n−1) dxn−1: n. pθ(y0: n)p(θ) · | (2.5) p(θ y0: n)= , | pθ(y0: n)p(θ) dθ There are essentially two classes of models for which p (x y ) and p (y ) can be computed where p(θ) is the priorR density. Here not only θ 0: n| 0: n θ 0: n pθ(y0: n) but also p(y0: n)= pθ(y0: n)p(θ) dθ are in- exactly: the class of linear Gaussian models, for tractable and, once more, these integrals must be which the above recursions may be implemented us- approximated on-line if oneR wants to update the ing Kalman techniques, and when is a finite state X posterior density sequentially. We will show in this space; see, for example, [11]. For other models these review that particle methods are particularly well quantities are typically intractable, that is, the den- suited to these integration tasks. sities in (3.1)–(3.3) cannot be computed exactly. 4 N. KANTAS ET AL.

Algorithm 1 Auxiliary particle ﬁltering At time n = 0, for all i 1,...,N : • ∈ { } i 1. Sample X0 qθ(x0 y0). ∼i | i i N i 2. Compute W 1 w0(X0)qθ(y1 X0), i=1 W 1 = 1. i ∝ N i | 3. Resample X0 W 1δXi (dx0). ∼ i=1 0 P At time n 1, forP all i 1,...,N : • ≥ ∈ { } i i i i i 1. Sample Xn qθ(xn yn, Xn−1) and set X0: n (X0: n−1, Xn). ∼i | i i ← N i 2. Compute W n+1 wn(Xn−1: n)qθ(yn+1 Xn), i=1 W n+1 = 1. i ∝ N i | 3. Resample X0: n W n+1δXi (dx0: n). ∼ i=1 0: n P P

3.2 Particle Filtering One recovers the SISR algorithm as a special case of Algorithm 1 by taking q (y x ) = 1 [or, more 3.2.1 Algorithm Particle filtering methods are a θ n n−1 generally, by taking q (y x )=| h (y ), some ar- set of simulation-based techniques which approxi- θ n n−1 θ n bitrary positive function].| Further, one recovers mate numerically the recursions (3.1) to (3.3). We the bootstrap filter by taking q (x y ,x ) = focus here on the APF (auxiliary particle filter [78]) θ n n n−1 f (x x ). This is an important special| case, as for two reasons: first, this is a popular approach, in θ n n−1 some| complex models are such that one may sample particular, in the context of parameter estimation from f (x x ), but not compute the correspond- (see, e.g., Section 6.2.3); second, the APF covers as θ n n−1 ing density;| in such a case the bootstrap filter is special cases a large class of particle algorithms, such the only implementable algorithm. For models such as the bootstrap filter [46] and SISR (Sequential Im- portance Sampling Resampling [31, 69]). that the density fθ(xn xn−1) is tractable, [78] rec- ommend selecting q (x| y ,x )= p (x y ,x ) Let θ n| n n−1 θ n| n n−1 and qθ(yn xn−1)= pθ(yn xn−1) when these quanti- (3.4) q (x ,y x )= q (x y ,x )q (y x ), ties are tractable,| and using| approximations of these θ n n| n−1 θ n| n n−1 θ n| n−1 quantities in scenarios when they are not. The intu- where q (x y ,x ) is a probability density func- θ n n n−1 ition for these recommendations is that this should tion which is| easy to sample from and q (y x ) θ n n−1 make the weight function (3.6) nearly constant. is not necessarily required to be a probability| den- The computational complexity of Algorithm 1 is sity function but just a nonnegative function of (N) per time step; in particular, see, for example, (x ,y ) one can evaluate. [For n = 0, n−1 n [O31], page 201, for a (N) implementation of the remove the∈X dependency ×Y on x , i.e., q (x ,y )= n−1 θ 0 0 resampling step. At timeO n, the approximations of q (x y )q (y ).] θ 0 0 θ 0 p (x y ) and p (y y ) presented earlier in The| algorithm relies on the following importance θ 0: n 0: n θ n 0: n−1 (2.3) and| (3.3), respectively,| are given by weights: N gθ(y0 x0)µθ(x0) i (3.5) w0(x0)= | , (3.7) pˆθ(dx0: n y0: n)= WnδXi (dx0: n), qθ(x0 y0) | 0: n | Xi=1 gθ(yn xn)fθ(xn xn−1) N wn(xn−1: n)= | | 1 i qθ(xn,yn xn−1) pˆ (y y )= w (X ) | θ n| 0: n−1 N n n−1: n (3.6) i=1 ! for n 1. (3.8) X ≥ N In order to alleviate the notational burden, we omit W i q (y Xi ) , · n−1 θ n| n−1 the dependence of the importance weights on θ; we i=1 ! will do so in the remainder of the paper when no X i i N i confusion is possible. The auxiliary particle filter can where Wn wn(Xn−1: n), Wn = 1 andp ˆθ(y0)= ∝ i=1 be summarized in Algorithm 1 [12, 78]. 1 N w (Xi ). In practice, one uses (3.7) mostly N i=1 0 0 P P ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 5 to obtain approximations of posterior moments model, Aθ,n,p typically grows exponentially with n. N This is intuitively not surprising, as the dimension i i W ϕ(X ) E[ϕ(X ) y ], of the target density pθ(x0: n y0: n) is increasing with n 0: n ≈ 0: n | 0: n | i=1 n. Moreover, the successive resampling steps lead to X a depletion of the particle population; p (x y ) but expressing particle filtering as a method for θ 0: m| 0: n approximating distributions (rather than moments) will eventually be approximated by a single unique particle as n m increases. This is referred to as turns out to be a more convenient formalization. The − likelihood (3.2) is then estimated through the degeneracy problem in the literature ([11], Fig- n ure 8.4, page 282). This is a fundamental weakness (3.9) pˆ (y ) =p ˆ (y ) pˆ (y y ). of particle methods: given a fixed number of parti- θ 0: n θ 0 θ k| 0: k−1 cles N, it is impossible to approximate pθ(x0: n y0: n) kY=1 | The resampling procedure is introduced to replicate accurately when n is large enough. particles with high weights and discard particles Fortunately, it is also possible to establish much with low weights. It serves to focus the computa- more positive results. Many state-space models pos- tional efforts on the “promising” regions of the state sess the so-called exponential forgetting property space. We have presented above the simplest resam- ([23], Chapter 4). This property states that for ′ pling scheme. Lower variance resampling schemes any x0,x and data y0: n, there exist constants 0 ∈X have been proposed in [53, 69], as well as more ad- Bθ < and λ [0, 1) such that ∞ ∈ vanced particle algorithms with better overall per- ′ formance, for example, the Resample–Move algo- pθ(dxn y1: n,x0) pθ(dxn y1: n,x0) TV (3.11) k | − | k rithm [44]. For the sake of simplicity, we have also n Bθλ , presented a version of the algorithm that operates ≤ resampling at every iteration n. It may be more effi- where TV is the total variation distance, that is, cient to trigger resampling only when a certain crite- the optimalk · k filter forgets exponentially fast its initial rion regarding the degeneracy of the weights is met; condition. This property is typically satisfied when see [31] and [68], pages 35 and 74. the signal process X is a uniformly ergodic { n}n≥0 3.2.2 Convergence results Many sharp convergen- Markov chain and the observations Yn n≥0 are not too informative ([23], Chapter 4), or{ when} Y ce results are available for particle methods [23]. A { n}n≥0 selection of these results that gives useful insights on are informative enough that it effectively restricts the difficulties of estimating static parameters with the hidden state to a bounded region around it [76]. particle methods is presented below. Weaker conditions can be found in [29, 90]. When Under minor regularity assumptions, one can exponential forgetting holds, it is possible to estab- show that for any n 0, N > 1 and any bounded lish much stronger uniform-in-time convergence re- ≥ test function ϕ : n+1 [ 1, 1], there exist con- sults for functions ϕ that depend only on recent n X → − n stants Aθ,n,p < such that for any p 1 states. Specifically, for an integer L> 0 and any ∞ ≥ L bounded test function ΨL : [ 1, 1], there ex- E X → − ϕn(x0: n) ist constants Cθ,L,p < such that for any p 1, Z n L 1, ∞ ≥ p ≥ − (3.10) pˆ (dx y ) p (dx y ) p · { θ 0: n| 0: n − θ 0: n| 0: n } E Ψ(xn−L+1: n)∆θ,n(dxn−L+1: n) X L Aθ,n,p (3.12) Z , p/2 C ≤ N θ,L,p , where the expectation is with respect to the law ≤ N p/2 of the particle filter. In addition, for more general where classes of functions, we can obtain for any fixed n a Central Limit Theorem (CLT) as N + ([17] ∆θ,n(dxn−L+1: n) and [23], Proposition 9.4.2). Such results→ are reassur-∞ ing but weak, as they reveal nothing regarding long- (3.13) = pˆθ(dx0: n y0: n) n−L+1 { | time behavior. For instance, without further restric- Zx0: n−L∈X tions on the class of functions ϕ and the state-space p (dx y ) . n − θ 0: n| 0: n } 6 N. KANTAS ET AL.

This result explains why particle filtering is an fixed batch of observations y0: T . Smoothing for a effective computational tool in many applications fixed parameter θ is at the core of the two main par- such as tracking, where one is only interested in ticle ML parameter inference techniques described pθ(xn−L+1: n y0: n), as the approximation error is in Section 5, as these procedures require computing | uniformly bounded over time. smoothed additive functionals of the form (3.14). Similar positive results hold forp ˆθ(y0: n). This es- Clearly, one could unfold the recursion (3.1) from timate is unbiased for any N 1 ([23], Theorem n = 0 to n = T to obtain p (x y ). However, ≥ θ 0: T | 0: T 7.4.2, page 239), and, under assumption (3.11), the as pointed out in the previous section, the path relative variance of the likelihood estimatep ˆθ(y0: n), space approximation (3.7) suffers from the degener- that is the variance of the ratiop ˆθ(y0: n)/pθ(y0: n), acy problem and yields potentially high variance es- is bounded above by D n/N [14, 90]. This is a great θ timates of (3.14) as (3.15) holds. This has motivated improvement over the exponential increase with n the development of alternative particle approaches that holds for standard importance sampling tech- to approximate p (x y ) and its marginals. niques; see, for instance, [32]. However, the con- θ 0: T | 0: T stants Cθ,L,p and Dθ are typically exponential in nx, 4.1 Fixed-lag Approximation the dimension of the state vector Xn. We note that nonstandard particle methods designed to minimize For state-space models with “good” forgetting the variance of the estimate of pθ(y0: n) have recently properties [e.g., (3.11)], we have been proposed [92]. (4.1) p (x y ) p (x y ) Finally, we recall the theoretical properties of par- θ 0: n| 0: T ≈ θ 0: n| 0 : (n+L)∧T ticles estimates of the following so-called smoothed for L large enough, that is, observations collected additive functional ([11], Section 8.3 and [74]), at times k>n + L do not bring any significant n θ additional information about X0: n. In particular, n = sk(xk−1: k) θ S n+1 when having to evaluate T of the form (3.14) ZX (k=1 ) S (3.14) X we can approximate the expectation of sn(xn−1: n) p (x y ) dx . w.r.t. pθ(xn−1: n y0: T ) by its expectation w.r.t. θ 0: n 0: n 0: n | · | p (x y ). Such quantities are critical when implementing ML θ n−1: n| 0 : (n+L)∧T Algorithmically, a particle implementation of (4.1) parameter estimation procedures; see Section 5. If i means not resampling the components X0: n of the we substitutep ˆθ(dx0: n y0: n) to pθ(x0: n y0: n) dx0: n | | particles Xi obtained by particle filtering at times to approximate θ, then we obtain an estimate θ 0: k n n k>n + L. This was first suggested in [56] and which can be computedS recursively in time; see, forS used in [11], Section 8.3, and [74]. This algorithm is example, [11], Section 8.3. For the remainder of thisb paper we will refer to this approximation as the path simple to implement, but the main practical prob- space approximation. Even when (3.11) holds, there lem is the choice of L. If taken too small, then p (x y ) is a poor approximation of exists 0 < Fθ, Gθ < such that the asymptotic bias θ 0: n| 0 : (n+L)∧T ) ∞ p (x y ). If taken too large, the degeneracy re- [23] and variance [81] satisfy θ 0: n| 0: T mains substantial. Moreover, even as N , this n n2 →∞ (3.15) E( θ) θ F , V( θ) G particle approximation will have a nonvanishing bias | Sn − Sn|≤ θ N Sn ≥ θ N since pθ(x0: n y0: T ) = pθ(x0: n y0 : (n+L)∧T ). 2 | 6 | for sp : b[ 1, 1] where the varianceb is w.r.t. the law of theX → particle− filter. The fact that the variance 4.2 Forward–Backward Smoothing grows at least quadratically in time follows from the 4.2.1 Principle The joint smoothing density θ degeneracy problem and makes n unsuitable for p (x y ) can be expressed as a function of the S θ 0: T | 0: T some on-line likelihood based parameter estimation filtering densities p (x y ) T using the follow- schemes discussed in Section 5. { θ n| 0: n }n=0 b ing key decomposition: 4. SMOOTHING pθ(x0: T y0: T ) (4.2) | In this section the parameter θ is still assumed T −1 known and we focus on smoothing, that is, the prob- = pθ(xT y0: T ) pθ(xn y0: n,xn+1), lem of estimating the latent variables X0: T given a | | nY=0 ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 7 where p (x y ,x ) is a backward (in time) averages that approximate smoothing expectations θ n| 0: n n+1 Markov transition density given by E[ϕ(X0: T ) y0: T ]. In that scenario, the first approach costs (N 2|(T +1)), while the second approach costs fθ(xn+1 xn)pθ(xn y0: n) O (4.3) p (x y ,x )= | | . (N(T + 1)) on average. In some applications, the θ n| 0: n n+1 p (x y ) O θ n+1| 0: n rejection sampling procedure can be computation- T ally costly as the acceptance probability can be very A backward in time recursion for pθ(xn y0: T ) { | }n=0 small for some particles; see, for example, Section 4.3 follows by integrating out x0: n−1 and xn+1: T in (4.2) while applying (4.3), in [75] for empirical results. This has motivated the development of hybrid procedures combining FF- pθ(xn y0: T ) BSa and rejection sampling [85]. | We can also directly approximate the marginals (4.4) = pθ(xn y0: n) T | pθ(xn y0: T ) . Assuming we have an approxima- { | }n=0 f (xn+1 xn)p (xn+1 y0: T ) N i θ θ tionp ¯θ(dxn+1 y0: T ) = W δXi (dxn+1) | | dxn+1. | i=1 n+1|T n+1 · pθ(xn+1 y0: n) i i Z | where WT |T = WT , then byP using (4.4) and (4.5), we This is referred to as forward–backward smooth- obtain the approximationp ¯θ(dxn y0: T ) = T N i | ing, as a forward pass yields pθ(xn y0: n) n=0 which W δ i (dx ) with { | } i=1 n|T Xn n can be used in a backward pass to obtain pθ(xn T { T −1| N j j i y ) . Combined to p (x y ,x ) , P W fθ(X Xn) 0: T n=0 θ n 0: n n+1 n=0 i i n+1|T n+1| } { θ | } (4.6) Wn|T = Wn j . this allows us to obtain T . An alternative to × N l l S j=1 l=1 Wnfθ(Xn+1 Xn) these forward–backward procedures is the general- X | ized two-filter formula [6]. This Forward FilteringP Backward Smoothing (FFBSm, where “m” stands for “marginal”) pro- 4.2.2 Particle implementation The decomposi- cedure requires (N 2(T + 1)) operations to approx- tion (4.2) suggests that it is possible to sample ap- imate p (x y O ) T instead of (N(T + 1)) for { θ n| 0: T }n=0 O proximately from pθ(x0: T y0: T ) by running a par- the path space and fixed-lag methods. However, | ticle filter from time n = 0 to T , storing the ap- this high computational complexity of forward– T proximate filtering distributions pˆθ(dxn y0: n) , backward estimates can be reduced using fast com- { | }n=0 that is, the marginals of (3.7), then sampling XT putational methods [57]. Particle approximations ∼ pˆ (dx y ) and for n = T 1, T 2,..., 0 sam- of generalized two-filter smoothing procedures have θ T | 0: T − − pling Xn pˆθ(dxn y0: n, Xn+1) where this distribu- also been proposed in [6, 38]. tion is obtained∼ by| substitutingp ˆ (dx y ) for θ n| 0: n 4.3 Forward Smoothing pθ(dxn y0: n) in (4.3): | 4.3.1 Principle Whenever we are interested in θ pˆθ(dxn y0: n, Xn+1) computing the sequence n≥0 recursively in (4.5) | {Sn} N i i time, the forward–backward procedure described W fθ(Xn+1 X )δXi (dxn) = i=1 n | n n . above is cumbersome, as it requires performing a N i i P i=1 Wnfθ(Xn+1 Xn) new backward pass with n + 1 steps at time n. An | important but not well-known result is that it is pos- This Forward FilteringP Backward Sampling (FF- sible to implement exactly the forward–backward BSa) procedure was proposed in [45]. It requires procedure using only a forward procedure. This re- (N(T + 1)) operations to generate a single path sult is at the core of [34], but its exposition relies on OX , as sampling from (4.5) costs (N) oper- 0: T O tools which are nonstandard for statisticians. We fol- ations. However, as noted in [28], it is possible low here the simpler derivation proposed in [24, 25] to sample using rejection from an alternative ap- which simply consists of rewriting (3.14) as proximation of pθ(xn y0: n, Xn+1) in (1) opera- | O θ θ tions if we use an unweighted particle approxima- (4.7) = V (xn)p (xn y0: n) dxn, Sn n θ | tion of pθ(xn y0: n) in (4.3) and if the transition Z | ′ where prior satisfies fθ(x x) C < . Hence, with this n approach, sampling| a path≤ X ∞ costs, on average, 0: T V θ(x ) := s (x ) only (T + 1) operations. A related rejection tech- n n k k−1: k O Z (k=1 ) nique was proposed in [48]. In practice, one may gen- (4.8) X erate N such trajectories to compute Monte Carlo p (x y ,x ) dx . · θ 0: n−1| 0: n−1 n 0: n−1 8 N. KANTAS ET AL.

θ It can be easily checked using (4.2) that Vn (xn) sat- 4.4 Convergence Results for Particle Smoothing isfies the following forward recursion for n 0: ≥ Empirically, for a fixed number of particles, these smoothing procedures perform significantly much V θ (x )= V θ(x )+ s (x ) n+1 n+1 { n n n+1 n : n+1 } better than the naive path space approach to (4.9) Z smoothing (i.e., simply propagating forward the p (x y ,x ) dx , · θ n| 0: n n+1 n complete state trajectory within a particle filter- with V θ(x ) = 0 and where p (x y ,x ) is ing algorithm). Many theoretical results validating 0 0 θ n| 0: n n+1 given by (4.3). In practice, we shall approximate these empirical findings have been established under assumption (3.11) and additional regularity as- the function V θ on a certain grid of values x , as n n sumptions. The particle estimate of θ based on explained in the next section. n the fixed-lag approximation (4.1) has anS asymptotic 4.3.2 Particle implementation We can easily pro- variance in n/N with a nonvanishing (as N ) →∞ vide a particle approximation of the forward smooth- bias proportional to n and a constant decreasing ing recursion. Assume you have access to approxi- exponentially fast with L [74]. In [24, 25, 28], it is θ i θ i mations Vn (Xn) of Vn (Xn) at time n, where shown that when (3.11) holds, there exists 0 < Fθ, { }N {i } pˆ (dx y )= W δ i (dx ). Then when up- Hθ < such that the asymptotic bias and variance θ n 0: n i=1 n Xn n ∞ | b of the particle estimate of θ computed using the dating our particle filter to obtainp ˆθ(dxn+1 y0: n+1)= Sn N i P | forward–backward procedures satisfy W δ i (dxn+1), we can directly compute i=1 n+1 Xn+1 θ i θ θ n θ n the particle approximations V (X ) by plug- (4.13) E( ) Fθ , V( ) Hθ . P { n+1 n+1 } | Sn − Sn|≤ N Sn ≤ N ging (4.5) andp ˆθ(dxn y0: n) in (4.7)–(4.9) to obtain | The bias forb the path space and forward–backwardb b θ N estimators of n are actually equal [24]. Recently, V θ (Xi )= W jf (Xi Xj ) it has also beenS established in [75] that, under sim- n+1 n+1 n θ n+1| n j=1 ilar regularity assumptions, the estimate obtained X b through (4.12) also admits an asymptotic variance θ j j i in n/N whenever K 2. (4.10) Vn (Xn)+ sn+1(Xn, Xn+1) ≥ · { }! 5. MAXIMUM LIKELIHOOD PARAMETER N b j i j ESTIMATION Wnfθ(Xn+1 Xn) , | ! . Xj=1 We describe in this section how the particle fil- N tering and smoothing techniques introduced in Sec- θ i θ i tions 3 and 4 can be used to implement maximum (4.11) n = WnVn (Xn). S likelihood parameter estimation techniques. Xi=1 b b This approach requires (N 2(n + 1)) operations 5.1 Off-Line Methods O to compute θ at iteration n. A variation over We recall that ℓ (θ) denote the log-likelihood Sn T this idea recently proposed in [75] and [88] consists function associated to data y0: T introduced in Sec- θ i i,j of approximatingb Vn+1(Xn+1) by sampling Xn tion 2. So as to maximize ℓT (θ), one can rely on i ∼ pˆθ(dxn y0: n, Xn+1) for j = 1,...,K to obtain standard nonlinear optimization methods, for ex- | ample, using quasi-Newton or gradient-ascent tech- θ i Vn+1(Xn+1) niques. We will limit ourselves to these approaches (4.12) even if they are sensitive to initialization and might K b 1 θ i,j i,j i get trapped in a local maximum. = Vn (Xn )+ sn+1(Xn , Xn+1) . K { } 5.1.1 Likelihood function evaluation We have seen Xj=1 in Section 3 that ℓT (θ) can be approximated us- b i When it is possible to sample fromp ˆθ(dxn y0: n, X ) ing particle methods, for any fixed θ Θ. One may | n+1 in (1) operations using rejection sampling, (4.12) wish then to treat ML estimation as an∈ optimization O provides a Monte Carlo approximation to (4.10) of problem using Monte Carlo evaluations of ℓT (θ). overall complexity (NK). When optimizing a function calculated with a Monte O ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 9

Carlo error, a popular strategy is to make the evalu- 5.1.2 Gradient ascent The log-likelihood ℓT (θ) ated function continuous by using common random may be maximized with the following steepest as- numbers over different evaluations to ease the opti- cent algorithm: at iteration k + 1 mization. Unfortunately, this strategy is not helpful (5.1) θk+1 = θk + γk+1 θℓT (θ) θ=θ , in the particle context. Indeed, in the resampling ∇ | k i N where θℓT (θ) θ=θk is the gradient of ℓT (θ) w.r.t. θ stage, particles Xn i=1 are resampled according to ∇ | { N} i evaluated at θ = θk and γk is a sequence of positive the distribution W δ i (dx ) which admits { } i=1 n+1 Xn n real numbers, called the step-size sequence. Typi- a piecewise constant and hence discontinuous cumu- P cally, γk is determined adaptively at iteration k us- lative distribution function (c.d.f.). A small change ing a line search or the popular Barzilai–Borwein al- in θ will cause a small change in the importance ternative. Both schemes guarantee convergence to a i N weights W n+1 and this will potentially gener- local maximum under weak regularity assumptions; { }i=1 ate a different set of resampled particles. As a result, see [95] for a survey. the log-likelihood function estimate will not be con- The score vector ℓ (θ) can be computed by us- ∇θ T tinuous in θ even if ℓT (θ) is continuous. ing Fisher’s identity given in (2.4). Given (2.2), it is To bypass this problem, an importance sampling easy to check that the score is of the form (3.14). An method was introduced in [49], but it has compu- alternative to Fisher’s identity to compute the score tational complexity (N 2(T + 1)) and only pro- is presented in [20], but this also requires computing O vides low variance estimates in the neighborhood an expectation of the form (3.14). of a suitably preselected parameter value. In the These score estimation methods are not appli- restricted scenario where R, an elegant solu- cable in complex scenarios where it is possible to X ⊆ ′ tion to the discontinuity problem was proposed in sample from fθ(x x), but the analytical expres- [72]. The method uses common random numbers sion of this transition| kernel is unavailable [51]. For and introduces a “continuous” version of the re- those models, a naive approach is to use a finite sampling step by finding a permutation σ such that difference estimate of the gradient; however, this σ(1) σ(2) σ(N) Xn Xn Xn and defining a piece- might generate too high a variance estimate. An wise linear≤ approximation≤···≤ of the resulting c.d.f. from interesting alternative presented in [50], under the which particles are resampled, that is, name of iterated filtering, consists of deriving an approximation of θℓT (θ) θ=θk based on the posterior k−1 σ(k−1) ∇ | T σ(i) σ(k) x Xn moments E(ϑn y0: n), V(ϑn y0: n) of an artifi- F (x)= W + W − , { | | }n=0 n n+1 n+1 σ(k) σ(k−1) cial state-space model with latent Markov process i=1 ! Xn Xn T X − Zn = (Xn,ϑn) n=0, σ(k−1) σ(k) { } Xn x Xn . (5.2) ϑ = ϑ + ε , X f ( x ), ≤ ≤ n+1 n n+1 n+1 ∼ ϑn+1 ·| n This method requires (N(T + 1) log N) operations and observed process Y g ( x ). Here O n+1 ∼ ϑn+1 ·| n+1 due to the sorting of the particles, but the result- εn n≥1 is a zero-mean white noise sequence with { } 2 ing continuous estimate of ℓT (θ) can be maximized variance σ Σ, E(ϑn+1 ϑn)= ϑn, E(ϑ0)= θk, V(ϑ0)= using standard optimization techniques. Extensions τ 2Σ. It is shown in [50| ] that this approximation im- nx 2 2 2 2 to the multivariate case where R (with nx > proves as σ ,τ 0 and σ /τ 0. Clearly, as the X ⊆ 2 → → 1) have been proposed in [59] and [22]. However, variance σ of the artificial dynamic noise εn on the scheme [59] does not guarantee continuity of the θ-component decreases, it will be necessary{ } to the likelihood function estimate and only provides use more particles to approximate θℓT (θ) θ=θk as log-likelihood estimates which are positively corre- the mixing properties of the artificial∇ dynamic| model lated for neighboring values in the parameter space, deteriorates. whereas the scheme in [22] has (N 2) computa- O 5.1.3 Expectation–Maximization Gradient ascent tional complexity and relies on a nonstandard par- algorithms can be numerically unstable as they re- ticle filtering scheme. quire to scale carefully the components of the score When θ is high dimensional, the optimization over vector. The Expectation Maximization (EM) algo- the parameter space may be made more efficient if rithm is a very popular alternative procedure for provided with estimates of the gradient. This is ex- maximizing ℓT (θ) [27]. At iteration k + 1, we set ploited by the algorithms described in the forthcom- (5.3) θk+1 = argmax Q(θk,θ), ing sections. θ 10 N. KANTAS ET AL. where (N −2(T + 1)), but the MSE of the path space es- Otimate is variance dominated, whereas the forward– Q(θk,θ)= log pθ(x0: T ,y0: T ) backward estimates are bias dominated. This can be (5.4) Z understood by decomposing the MSE as the sum of pθ (x0: T y0: T ) dx0: T . the squared bias and the variance and then substi- · k | tuting appropriately for N 2 particles in (3.15) for The sequence ℓ (θ ) generated by this algo- T k k≥0 the path space method and for N particles in (4.13) rithm is nondecreasing.{ } The EM is usually favored for the forward–backward estimates. We confirm ex- by practitioners whenever it is applicable, as it is perimentally this fact in Section 7.1. numerically more stable than gradient techniques. These experimental results suggest that these par- In terms of implementation, the EM consists of ticle smoothing estimates might thus be of limited computing a n -dimensional summary statistic of s interest compared to the path based estimates for the form (3.14) when p (x ,y ) belongs to the θ 0: T 0: T ML parameter inference when accounting for com- exponential family, and the maximizing argument putational complexity. However, this comparison ig- of Q(θ ,θ) can be characterized explicitly through a k nores that the (N 2) computational complexity suitable function Λ : Rns Θ, that is, O → of these particle smoothing estimates can be re- −1 θk (5.5) θk+1 = Λ(T T ). duced to (N) by sampling approximately from S p (x y O ) with the FFBSa procedure in Sec- Discussion of particle implementations θ 0: T 0: T 5.1.4 The tion 4.2 |or by using fast computational methods [57]. path space approximation (3.7) can be used to ap- Related (N) approaches have been developed for proximate the score (2.4) and the summary statis- generalizedO two-filter smoothing [7, 38]. When ap- tics of the EM algorithm at the computational cost plicable, these fast computational methods should of (N(T + 1)); see [11], Section 8.3, and [74, 81]. be favored. Experimentally,O the variance of the associated estimates increases typically quadratically with T [81]. 5.2 On-Line Methods To obtain estimates whose variance increases only For a long observation sequence the computation typically linearly with T with similar computational of the gradient of ℓT (θ) can be prohibitive, and cost, one can use the fixed-lag approximation pre- moreover, we might have real-time constraints. An sented in Section 4.1 or a more recent alternative alternative would be a recursive procedure in which where the path space method is used, but the addi- the data is run through once sequentially. If θn is tive functional of interest, which is a sum of terms the estimate of the model parameter after the first over n = 0,...,T , is approximated by a sum of sim- n observations, a recursive method would update ilar terms which are now exponentially weighted the estimate to θn+1 after receiving the new data w.r.t. n [73]. These methods introduce a nonvanish- yn. Several on-line variants of the ML procedures ing asymptotic bias difficult to quantify but appear described earlier are now presented. For these meth- to perform well in practice. ods to be justified, it is crucial for the observation To improve over the path space method without process to be ergodic for the limiting averaged like- introducing any such asymptotic bias, the FFBSm lihood function ℓT (θ)/T to have a well-defined limit and forward smoothing discussed in Sections 4.2 and ℓ(θ) as T + . 4.3 as well as the generalized two-filter smoother → ∞ have been used [6, 24, 25, 81, 82]. Experimen- 5.2.1 On-line gradient ascent An alternative to tally, the variance of the associated estimates in- gradient ascent is the following parameter update creases typically linearly with T [81] in agreement scheme at time n 0: ≥ with the theoretical results in [24, 25, 28]. However, (5.6) θ = θ + γ log p (y y ) , the computational complexity of these techniques n+1 n n+1∇ θ n| 0: n−1 |θ=θn is (N 2(T + 1)). For a fixed computational com- where the positive nonincreasing step-size sequence O 2 2 plexity of order (N (T + 1)), an informal com- γn n≥1 satisfies n γn = and n γn < [5, 64], O { } −α ∞ ∞ parison of the performance of the path space esti- for example, γn = n for 0.5 < α 1. Upon receiv- 2 P P≤ mate using N particles and the forward–backward ing yn, the parameter estimate is updated in the estimate using N particles suggest that both esti- direction of ascent of the conditional density of this mates admit a Mean Square Error (MSE) of order new observation. In other words, one recognizes in ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 11

(5.6) the update of the gradient ascent algorithm algorithm computed sequentially based on y0: n−1. (5.1), except that the partial (up to time n) like- When yn is received, we compute lihood is used. The algorithm in the present form is, however, not suitable for on-line implementation, θ n = γn+1 sn(xn−1: n) S 0: because evaluating the gradient of log p (y y ) Z θ n| 0: n−1 at the current parameter estimate requires comput- pθ0: n (xn−1,xn y0: n) dxn−1: n (5.10) · | ing the filter from time 0 to time n using the current n n parameter value θn. + (1 γ ) (1 γ ) γ − n+1 − i k+1 An algorithm bypassing this problem has been k=0 i=k+2 ! proposed in the literature for a finite state-space la- X Y tent process in [64]. It relies on the following update s (x )p (x y ) dx , · k k−1: k θ0: k k−1: k| 0: k k−1: k scheme: Z where γn n≥1 needs to satisfy n γn = and 2 { } ∞ (5.7) θn+1 = θn + γn+1 log pθ0: n (yn y0: n−1), γ < . Then the standard maximization step ∇ | n n ∞ P where log p (y y ) is defined as (5.5) is used as in the batch version ∇ θ0: n n| 0: n−1 P (5.11) θn+1 = Λ( θ0: n ). log pθ0: n (yn y0: n−1) S (5.8) ∇ | The recursive calculation of θ0: n is achieved by set- = log pθ0: n (y0: n) log pθ0: n−1 (y0: n−1), S ∇ −∇ ting Vθ0 = 0, then computing with the notation log pθ (y0: n) corresponding to ∇ 0: n V (x )= γ s (x ,x ) a “time-varying” score which is computed with a θ0: n n { n+1 n n−1 n filter using the parameter θ at time p. The update Z p (5.12) + (1 γ )V (x ) rule (5.7) can be thought of as an approximation to − n+1 θ0: n−1 n−1 } the update rule (5.6). If we use Fisher’s identity to p (x y ,x ) dx · θ0: n n−1| 0: n−1 n n−1 compute this “time-varying” score, then we have for and, finally, 1 p n, ≤ ≤

(5.13) θ0: n = Vθ0: n (xn)pθ0: n (xn y0: n) dxn. sp(xp−1: p)= log fθ(xp xp−1) θ=θp S | (5.9) ∇ | | Z

+ log gθ(yp xp) θ=θp . Again, the subscript θ0: n on pθ0: n (x0: n y0: n) indi- ∇ | | cates that the posterior density is being| computed The asymptotic properties of the recursion (5.7) sequentially using the parameter θp at time p n. (i.e., the behavior of θn in the limit as n goes to infin- The filtering density then is advanced from time≤ ity) has been studied in [64] for a finite state-space n 1 to time n by using fθn (xn xn−1), gθn (yn xn) HMM. It is shown that under regularity conditions − | | and pθn (yn y0: n) in the fraction of the r.h.s. of (3.1). this algorithm converges toward a local maximum Whereas the| convergence of the EM algorithm to- of the average log-likelihood ℓ(θ), ℓ(θ) being max- ward a local maximum of the average log-likelihood imized at the “true” parameter value under iden- ℓ(θ) has been established for i.i.d. data [10], its con- tifiability assumptions. Similar results hold for the vergence for state-space models remains an open recursion (5.6). problem despite empirical evidence it does [8, 9, 24]. 5.2.2 On-line Expectation–Maximization It is also This has motivated the development of modified ver- possible to propose an on-line version of the EM sions of the on-line EM algorithm for which conver- algorithm. This was originally proposed for finite gence results are easier to establish [4, 62]. However, state-space and linear Gaussian models in [35, 42]; the on-line EM presented here usually performs empirically better [63]. see [9] for a detailed presentation in the finite state- space case. Assume that pθ(x0: n,y0: n) is in the 5.2.3 Discussion of particle implementations Both exponential family. In the on-line implementation the on-line gradient and EM procedures require of EM, running averages of the sufficient statistics approximating terms (5.8) and (5.10) of the form −1 θ n n are computed [8, 35]. Let θp 0≤p≤n be the (3.14), except that the expectation is now w.r.t. the sequenceS of parameter estimates of{ the} on-line EM posterior density p (x y ) which is updated θ0: n 0: n| 0: n 12 N. KANTAS ET AL. using the parameter θp at time p n. In this on-line presentation of the Particle Marginal Metropolis– framework, only the path space, fixed-lag≤ smoothing Hastings (PMMH) sampler, which is an approxima- and forward smoothing estimates are applicable; the tion of an ideal MMH sampler for sampling from fixed-lag approximation is applicable but introduces p(x ,θ y ) which would utilize the following 0: T | 0: T a nonvanishing bias. For the on-line EM algorithm, proposal density: similarly to the batch case discussed in Section 5.1.4, ′ ′ q((x0: T ,θ ) (x0: T ,θ)) the benefits of using the forward smoothing estimate (6.1) | [24] compared to the path space estimate [8] with ′ ′ = q(θ θ)pθ′ (x0: T y0: T ), N 2 particles are rather limited, as experimentally | | ′ demonstrated in Section 7.1. However, for the on- where q(θ θ) is a proposal density to obtain a can- ′ | line gradient ascent algorithm, the gradient term didate θ when we are at location θ. The acceptance probability of this sampler is log pθ0: n (yn y0: n−1) in (5.7) is a difference be- ∇ | ′ ′ tween two score-like vectors (5.8) and the behavior pθ′ (y0: T )p(θ )q(θ θ ) of its particle estimates differs significantly from its (6.2) 1 ′| . ∧ pθ(y0: T )p(θ)q(θ θ) EM counterpart. Indeed, the variance of the particle | Unfortunately, this ideal algorithm cannot be imple- path estimate of log pθ0: n (yn y0: n−1) increases lin- ∇ | mented, as we cannot sample exactly from p ′ (x early with n, yielding an unreliable gradient ascent θ 0: T | procedure, whereas the particle forward smooth- y0: T ) and we cannot compute the likelihood terms ing estimate has a variance uniformly bounded in pθ(y0: T ) and pθ′ (y0: T ) appearing in the acceptance time under appropriate regularity assumptions and probability. yields a stable gradient ascent procedure [26]. Hence, The PMMH sampler is an approximation of this the use of a procedure of computational complexity ideal MMH sampler which relies on the particle ap- (N 2) is clearly justified in this context. The very proximations of these unknown terms. Given θ and recentO paper [88] reports that the computationally a particle approximationp ˆθ(y0: T ) of pθ(y0: T ), we sample θ′ q(θ′ θ), then run a particle filter to ob- cheaper estimate (4.12) appears to exhibit similar ∼ | tain approximationsp ˆ ′ (dx y ) andp ˆ ′ (y ) properties whenever K 2 and might prove an at- θ 0: T | 0: T θ 0: T ≥ of pθ′ (dx0: T y0: T ) and pθ′ (y0: T ). We then sam- tractive alternative. ′ | ple X0: T pˆθ′ (dx0: T y0: T ), that is, we choose ran- domly one∼ of N particles| generated by the particle 6. BAYESIAN PARAMETER ESTIMATION i filter, with probability WT for particle i, and accept ′ ′ In the Bayesian setting, we assign a suitable (θ , X0: T ) [andp ˆθ′ (y0: T )] with probability prior density p(θ) for θ and inference is based on ′ ′ pˆθ′ (y0: T )p(θ )q(θ θ ) the joint posterior density p(x0: T ,θ y0: T ) in the (6.3) 1 | . | ∧ pˆ (y )p(θ)q(θ′ θ) off-line case or the sequence of posterior densities θ 0: T | p(x0: n,θ y0: n) n≥0 in the on-line case. The acceptance probability (6.3) is a simple approx- { | } 6.1 Off-Line Methods imation of the “ideal” acceptance probability (6.2). This algorithm was first proposed as a heuris- 6.1.1 Particle Markov chain Monte Carlo meth- tic to sample from p(θ y0: T ) in [39]. Its remark- ods Using MCMC is a standard approach to ap- able feature established| in [3] is that it does ad- proximate p(x ,θ y ). Unfortunately, designing mit p(x ,θ y ) as invariant distribution what- 0: T | 0: T 0: T 0: T efficient MCMC sampling algorithms for nonlin- ever the number| of particles N used in the particle ear non-Gaussian state-space models is a difficult approximation [3]. However, the choice of N has an task: one-variable-at-a-time Gibbs sampling typi- impact on the performance of the algorithm. Using cally mixes very poorly for such models, whereas large values of N usually results in PMMH aver- blocking strategies that have been proposed in the ages with variances lower than the corresponding av- literature are typically very model-dependent; see, erages using fewer samples, but the computational for instance, [52]. cost of constructingp ˆθ(y0: T ) increases with N. A Particle MCMC are a class of MCMC tech- simplified analysis of this algorithm suggests that N niques which rely on particle methods to build ef- should be selected such that the standard deviation ficient high-dimensional proposal distributions in a of the logarithm of the particle likelihood estimate generic manner [3]. We limit ourselves here to the should be around 0.9 if the ideal MMH sampler was ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 13

′ ′ using the perfect proposal q(θ θ)= p(θ y0: n) [79] each successive resampling step reduces the diversity and around 1.8 if one uses an isotropic| normal| ran- of the sample of θ values; after a certain time n, the dom walk proposal for a target that is a product approximationp ˆ(dθ y0: n) contains a single unique of d i.i.d. components with d [83]. For gen- value for θ. This is| clearly a poor approach. Even eral proposal and target densities,→∞ a recent theoret- in the much simpler case when there is no latent ical analysis and empirical results suggest that this variable X0: n, it is shown in [17], Theorem 4, that standard deviation should be selected around 1.2– the asymptotic variance of the corresponding parti- 1.3 [33]. As the variance of this estimate typically cle estimates diverges at least at a polynomial rate, increases linearly with T , this means that the com- which grows with the dimension of θ. putational complexity is of order (T 2) by iteration. A pragmatic approach that has proven useful in O A particle version of the Gibbs sampler is also some applications is to introduce artificial dynamics available [3] which mimicks the two-component for the parameter θ [54], Gibbs sampler sampling iteratively from p(θ x ,y ) and p (x y ). These algorithms| (6.4) θn+1 = θn + εn+1, 0: T 0: T θ 0: T | 0: T rely on a nonstandard version of the particle fil- where εn n≥0 is an artificial dynamic noise with ter where N 1 particles are generated conditional decreasing{ } variance. Standard particle methods − upon a “fixed” particle. Recent improvements over can now be applied to approximate p(x ,θ { 0: n 0: n| this particle Gibbs sampler introduce mechanisms y0: n) n≥0. A related kernel density estimation to rejuvenate the fixed particle, using forward or method} also appeared in [67], which proposes to backward sampling procedures [66, 89, 91]. These use a kernel density estimate p(θ y0: n) from which methods perform empirically extremely well, but, one samples from. As before, the| static parameter contrary to the PMMH, it is still unclear how one is transformed to a slowly time-varying one, whose should scale N with T . dynamics is related to the kernel bandwidth. To 6.2 On-Line Methods mitigate the artificial variance inflation, a shrink- age correction is introduced. An improved version In this context, we are interested in approxi- of this method has been recently proposed in [41]. mating on-line the sequence of posterior densities It is difficult to quantify how much bias is intro- p(x ,θ y ) . We emphasize that, contrary { 0: n | 0: n }n≥0 duced in the resulting estimates by the introduc- to the on-line ML parameter estimation procedures, tion of this artificial dynamics. Additionally, these none of the methods presented in this section by- methods require a significant amount of tuning, for pass the particle degeneracy problem. This should example, choosing the variance of the artificial dy- come as no surprise. As discussed in Section 3.2.2, namic noise or the kernel width. However, they can even for a fixed θ, the particle estimate of pθ(y0: n) perform satisfactorily in practice [41, 67]. has a relative variance that increases linearly with n under favorable mixing assumptions. The methods 6.2.2 Practical filtering The practical filtering ap- in this section attempt to approximate p(θ y ) proach proposed in [80] relies on the following fixed- | 0: n ∝ pθ(y0: n)p(θ). This is a harder problem, as it implic- lag approximation: itly requires having to approximate pθi (y0: n) for all (6.5) p(x ,θ y ) p(x ,θ y ) i 0: n−L 0: n−1 0: n−L 0: n the particles θ approximating p(θ y0: n). | ≈ | { } | for L large enough; that is, observations coming af- 6.2.1 Augmenting the state with the parameter ter n 1 presumably bring little information on At first sight, it seems that estimating the se- − x0: n−L. To sample approximately from p(θ y0: n), quence of posterior densities p(x ,θ y ) | { 0: n | 0: n }n≥0 one uses the following iterative process: at time n, can be easily achieved using standard particle meth- several MCMC chains are run in parallel to sample ods by merely introducing the extended state Zn = from (Xn,θn), with initial density p(θ0)µθ (x0) and tran- 0 i sition density f (x x )δ (θ ), that is, θ = p(xn−L+1: n,θ y0: n, X ) θn n n−1 θn−1 n n | 0: n−L θ . However, this| extended process Z clearly n−1 n = p(x ,θ y , Xi ), does not possess any forgetting property (as dis- n−L+1: n | n−L+1: n n−L i cussed in Section 3), so the algorithm is bound to where the Xn−L have been obtained at the pre- degenerate. Specifically, the parameter space is ex- vious iteration and are such that (approximately) plored only in the initial step of the algorithm. Then, Xi p(x y ) p(x y ). Then one n−L ∼ n−L| 0: n−1 ≈ n−L| 0: n 14 N. KANTAS ET AL.

i collects the first component Xn−L+1 of the simu- p˜θ(x0: n,y0: n) is in the exponential family and thus i lated sample Xn−L+1: n, increments the time index can be summarized by a set of fixed-dimensional suf- n and runs several new MCMC chains in parallel to ficient statistics s (x0: n,y0: n). This type of method sample from p(x ,θ y , Xi ) was first used to perform on-line Bayesian parame- n−L+2: n+1 | n−L+2: n+1 n−L+1 and so on. The algorithm is started at time L 1, ter estimation in a context wherep ˜θ(x0: n,y0: n) is − with MCMC chains that target p(x0: L−1 y0: L−1). in the exponential family [36, 44]. Similar strategies Like all methods based on fixed-lag approximation,| were adopted in [2] and [84]. In the particular sce- the choice of the lag L is difficult and this introduces nario where qθ(xn yn,xn−1)= pθ(xn yn,xn−1) and | | a nonvanishing bias which is difficult to quantify. qθ(yn xn−1)= pθ(yn xn−1), this method was men- | | However, the method performs well on the exam- tioned in [2, 86] and is discussed at length in [70] ples presented in [80]. who named it particle learning. Extensions of this strategy to parameter estimation in conditionally 6.2.3 Using MCMC steps within particle meth- linear Gaussian models, where a part of the state ods To avoid the introduction of an artificial dy- is integrated out using Kalman techniques [15, 31], namic model or of a fixed-lag approximation, an is proposed in [13]. approach originally proposed independently in [36] As opposed to the methods relying on kernel or and [44] consists of adding MCMC steps to re- artificial dynamics, these MCMC-based approaches introduce “diversity” among the particles. Assum- have the advantage of adding diversity to the par- ing we use an auxiliary particle filter to approximate ticles approximating p(θ y0: n) without perturbing i i | p(x0: n,θ y0: n) n≥0, then the particles X ,θ the target distribution. Unfortunately, these algo- { | } { 0: n n} obtained after the sampling step at time n are ap- rithms rely implicitly on the particle approximation proximately distributed according to of the density p(x y ) even if algorithmically 0: n| 0: n p˜(x ,θ y ) it is only necessary to store some fixed-dimensional 0: n | 0: n sufficient statistics sn(Xi ,y ) . Hence, in this { 0: n 0: n } p(x0: n−1,θ y0: n−1)qθ(xn,yn xn−1). respect they suffer from the degeneracy problem. ∝ | | We havep ˜(x ,θ y )= p(x ,θ y ) if q (x This was noticed as early as in [2]; see also the word 0: n | 0: n 0: n | 0: n θ n| of caution in the conclusion of [4, 36] and [18]. The yn,xn−1)= pθ(xn yn,xn−1) and qθ(yn xn−1)= pθ(yn x ). To add diversity| in this population| of parti-| practical implications are that one observes empir- n−1 ically that the resulting Monte Carlo estimates can cles, we introduce an MCMC kernel K (d(x′ ,θ′) n 0: n display quite a lot of variability over multiple runs as (x ,θ)) with invariant densityp ˜(x ,θ y ) and| 0: n 0: n 0: n demonstrated in Section 7.2. This should not come replace, at the end of each iteration, the set| of resam- as a surprise, as the sequence of posterior distribu- i ¯i pled particles, (X0: n, θn) with N “mutated” parti- tions does not have exponential forgetting proper- i ˜i cles (X0: n, θn) simulated from, for i = 1,...,N, ties, hence, there is an accumulation of Monte Carlo i ˜i i ¯i errors over time. e(X0: n, θn) Kn(d(x0: n,θ) (X0: n, θn)). ∼ | 6.2.4 The SMC2 algorithm The SMC2 algorithm If we use the SISR algorithm, then we can alter- e introduced simultaneously in [19] and [43] may natively use an MCMC step of invariant density be considered as the particle equivalent of Par- p(x ,θ y ) after the resampling step at time n. 0: n 0: n ticle MCMC. It mimics an “ideal” particle algo- Contrary| to standard applications of MCMC, the rithm proposed in [16] approximating sequentially kernel does not have to be ergodic. Ensuring ergodic- p(θ y ) where N particles (in the θ-space) ity would indeed require one to sample an increasing { | 0: n }n≥0 θ are used to explore these distributions. The Nθ number of variables as n increases—this algorithm particles at time n are reweighted according to would have an increasing cost per iteration, which pθ(y0: n+1)/pθ(y0: n) at time n + 1. As these like- would prevents its use in on-line scenarios, but it lihood terms are unknown, we substitute to them can be an interesting alternative to standard MCMC pˆθ(y0: n+1)/pˆθ(y0: n) wherep ˆθ(y0: n) is a particle ap- and was suggested in [61]. In practice, one there- proximation of the partial likelihood pθ(y0: n), ob- fore sets Xi = Xi and only samples θi and 0: n−L 0: n−L tained by a running a particle filter of Nx particles i Xn−L+1: n, where L is a small integer; often L = 0 in the x-dimension, up to time n, for each of the (only θ ise updated). Note that the memory require- Nθ θ-particles. When particle degeneracy (in the θ- mentse for this method do not increase over time if dimension) reaches a certain threshold, θ-particles ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 15 are refreshed through the succession of a resampling using (4.7)–(4.11); see [24]. Recall that this proce- step, and an MCMC step, which in these particular dure has a computational cost that is (N 2) per O settings takes the form of a PMCMC update. The time for N particles and provides the same esti- cost per iteration of this algorithm is not constant mates as the standard forward–backward implemen- and, additionally, it is advised to increase Nx with n tation of FFBSm. For the sake of brevity, we will for the relative variance ofp ˆθ(y0: n) not to increase, not consider the remaining smoothing methods of therefore, it cannot be used in truly on-line scenar- Section 4; for the fixed-lag and the exponentially ios. Yet there are practical situations where it may weighted approximations we refer the reader to [74], be useful to approximate jointly all the posteriors respectively, [73] for numerical experiments. We use a simulated data set of size 6 104 ob- p(θ y1: n), for 1 n T , for instance, to assess the ∗ ∗ 2∗ 2∗ × predictive| power≤ of the≤ model. tained using θ = (ρ ,τ ,σ ) = (0.8, 0.1, 1) and then generate 300 independent replications of each method in order to compute the empirical bias and 7. EXPERIMENTAL RESULTS θ∗ ∗ variance of n when θ is fixed to θ . In order to We focus on illustrating numerically a few algo- make a comparisonS that takes into account the com- rithms and the impact of the degeneracy problem putational cost,b we use N 2 particles for the (N) O on parameter inference. This last point is motivated method and N for the (N 2) one. We look sep- O by the fact that particle degeneracy seems to have arately at the behavior of the bias of θ and the Sn been overlooked by many practitioners. In this way variance and MSE of the rescaled estimates θ/√n. Sn numerical results may provide valuable insights. The results are presented in Figure 1 forb N = 50, We will consider the following simple scalar linear 100, 200. b Gaussian state space model: For both methods the bias grows linearly with time, this growth being higher for the (N 2) (7.1) Xn = ρXn−1 + τWn, Yn = Xn + σVn, O method. For the variance of θ/√n, we observe a Sn where Vn,Wn are independent zero-mean and unit- linear growth with time for the (N) method with O variance Gaussians and ρ [ 1, 1]. The main reason N 2 particles, whereas this varianceb appears roughly for choosing this model is∈ that− Kalman recursions constant for the (N 2) method. Finally, the MSE θ O can be implemented to provide the exact values of of n/√n grows for both methods linearly as ex- the summary statistics θ used for ML estimation pected.S In this particular scenario, the constants of Sn through the EM algorithm and to compute the exact proportionalityb are such that the MSE is lower for 2 likelihood pθ(y0: n). Hence, using a fine discretiza- the (N) method than for the (N ) method. In tion of the low-dimensional parameter space, we can general,O we can expect that the O(N) method be su- O compute a very good approximation of the true pos- perior in terms of the bias and the (N 2) method O terior density p(θ y0: n). In this model it is straight- superior in terms of the variance. These results are forward to present| numerical evidence of some ef- in agreement with the theoretical results in the lit- fects of degeneracy for parameter estimation and to erature [24, 25, 28], but additionally show that the show how it can be overcome by choosing an appro- lower bound on the variance growth of θ for the Sn priate particle method. (N) method of [81] appears sharp. O We proceed to see how the bias andb variance 7.1 Maximum Likelihood Methods of the estimates of θ affect the ML estimates, Sn As ML methods require approximating smoothed when the former are used within both an off-line additive functionals θ of the form (3.14), we be- and an on-line EM algorithm; see Figures 2 and Sn 3, respectively. For the model in (7.1) the E-step gin by investigating the empirical bias, variance θ and MSE of two standard particle estimates of θ, corresponds to computing n where sk(xk−1,xk)= n ((y x )2,x2 ,x x ,x2S) and the M-step update where we set s (x ,x )= x x for the modelS k k k−1 k−1 k k k k−1 k k−1 k function− is given by described in (7.1). The first estimate relies on the 2 path space method with computational cost (N) z3 z3 O Λ(z1, z2, z3, z4)= , z4 , z1 . per time, which usesp ˆ (dx y ) in (3.7) to ap- z4 − z2 θ 0: n| 0: n proximate θ as θ; see [11], Section 8.3, for more We compare the estimates of θ∗ when the E-step is Sn Sn details. The second estimate relies on the forward computed using the (N) and the (N 2) meth- O O implementation ofb FFBSm presented in Section 4.3 ods described in the previous section with 1502 16 N. KANTAS ET AL.

θ Fig. 1. Estimating smoothed additive functionals: empirical bias of the estimate of n (top panel), empirical variance (middle θ S 2 panel) and MSE (bottom panel) for the estimate of n/√n. Left column: (N) method using N = 2500, 10,000, 40,000 2 S O particles. Right column: (N ) method using N = 50, 100, 200 particles. In every subplot, the top line corresponds to using O N = 50, the middle for N = 100 and the lower for N = 200. and 150 particles, respectively. A simulated data results given the low number of particles used. How- set for θ∗ = (ρ∗,τ ∗,σ∗) = (0.8, 1, 0.2) will be used. ever, we note, as observed previously in the litera- In every case we will initialize the algorithm using ture, that the on-line EM as well as the on-line gra- ∗ θ0 = (0.1, 0.1, 0.2) and assume σ is known. In Fig- dient ascent method requires a substantial number ures 2 and 3 we present the results obtained using of observations, that is, over 10,000, before achiev- 150 independent replications of the algorithm. For ing convergence [8, 9, 24, 81]. For smaller data sets, the off-line EM, we use 25 iterations for T = 100, these algorithms can also be used by going through 1000, 2500, 5000, 10,000. For the on-line EM, we the data, say, K times. Typically, this method is 5 −0.8 use T = 10 with the step size set as γn = n and cheaper than iterating (5.1) or (5.4)–(5.5) K times for the first 50 iterations no M-step update is per- the off-line algorithms and can yield comparable pa- formed. This “freezing” phase is required to allow rameter estimates [94]. Experimentally, the proper- for a reasonable estimation of the summary statis- ties of the estimates of θ discussed earlier appear Sn tic; see [8, 9] for more details. Note that in Figure 3 to translate into properties of the resulting parame- we plot only the results after the algorithm has con- ter estimates: the (N) method provides estimates O verged, that is, for n 5 104. In each case, both the with less bias but more variance than the (N 2) ≥ × O (N) and the (N 2) methods yield fairly accurate method. O O ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 17

Fig. 2. Off-line EM: boxplots of θˆn for various T using 25 iterations of off-line EM and 150 realizations of the algorithms. Top panels: (N) method using N = 1502 particles. Bottom panels: (N 2) with N = 150. The dotted horizontal lines are the O O ML estimate for each time T obtained using Kalman filtering on a grid.

For more numerical examples regarding the re- (N) on-line EM, to [72] and [59], Chapter 10, O maining methods discussed in Section 5, we re- for smooth likelihood function methods and to [11], fer the reader to [50, 51] for iterated ﬁltering, to [24, 25, 81] for comparisons of the (N) and (N 2) Chapters 10–11, for a detailed exposition of oﬀ-line methods for EM and gradient ascent,O to [8]O for the EM methods.

4 Fig. 3. On-line EM: boxplots of θˆn for n 5 10 using 150 realizations of the algorithms. We also plot the ML estimate ≥ × at time n obtained using Kalman ﬁltering on a grid (black). 18 N. KANTAS ET AL.

7.2 Bayesian Methods between the average estimated variance of the posterior over the true one decreases with time n and We still consider the model in (7.1), but simplify it shows that the supports of the approximate pos- it further by fixing either ρ or τ. This is done in terior densities provided by this method cover, on order to keep the computations of the benchmarks average, only a small portion of the support of the that use Kalman computations on a grid relatively true posterior. These experiments confirm that in inexpensive. For those parameters that are not fixed, we shall use the following independent priors: a uni- this example the particle method with MCMC steps form on [ 1, 1] for ρ, and inverse gamma for τ 2,σ2 fails to adequately explore the space of θ. Although with the shape− and scale parameter pair being (a, b) the box plots provide some false sense of security, the and (c, d), respectively, with a = b = c = d =1. In relative and scaled average variance clearly indicate all the subsequent examples, we will initialize the that any posterior estimates obtained from a single algorithms by sampling θ from the prior. run of particle method with MCMC steps should be We proceed to examine the particle algorithms used with caution. Furthermore, in the bottom right with MCMC moves that we described in Sec- panel of Figure 4 we also investigate experimentally tion 6.2.3. We focus on an efficient implementation the empirical relative variance of the marginal likelihood estimates pˆ(y ) . This relative variance of this idea discussed in [70] which can be put in { 0: n }n≥0 practice for the simple model under consideration. appears to increase quadratically with n for the par- We investigate the effect of the degeneracy problem ticle method with MCMC moves instead of linearly in this context. The numerical results obtained in as it does for state-space models with good mixing this section have been produced in Matlab (code properties. This suggests that one should increase available from the first author) and double-checked the number of particles quadratically with the time using the R program available on the personal web index to obtain an estimate of the marginal like- page of the first author of [70, 71]. lihood whose relative variance remains uniformly We first focus on the estimate of the poste- bounded with respect to the time index. Although rior of θ = (τ 2,σ2) given a long sequence of simu- we attribute this quadratic relative variance growth lated observations with τ = σ = 1. In this scenario, to the degeneracy problem, the estimatep ˆ(y0: n) is not the particle approximation of a smoothed addi- pθ(x0: n,y0: n) admits the following two-dimensional n n tive functional, thus there is not yet any theoretical sufficient statistics, s (x0: n,y0: n) = ( k=1(xk 2 n 2 − convergence result explaining rigorously this phe- xk−1) , k=0(yk xk) ), and θ can be updated using Gibbs steps.− We use T = 5 104 andPN = 5000. nomenon. We ranP the algorithm over 100× independent runs One might argue that these particle methods with over the same data set. We present the results only MCMC moves are meant to be used with larger for τ 2 and omit the ones for σ2, as these were very N and/or shorter data sets T . We shall consider similar. The top left panel of Figure 4 shows the box this time a slightly different example where τ = 0.1 plots for the estimates of the posterior mean, and is known and we are interested in estimating the the top right panel shows how the corresponding posterior of θ = (ρ, σ2) given a sequence of obser- relative variance of the estimator for the posterior vations obtained using ρ = 0.5 and σ = 1. In that n mean evolves with time. Here the relative variance is case, the sufficient statistics are s (x0: n,y0: n) = n n−1 2 n 2 defined as the ratio of the empirical variance (over ( xk−1xk, x , (yk xk) ), and the k=1 k=0 k−1 k=0 − different independent runs) of the posterior mean parameters can be rejuvenated through a single P P P estimates at time n over the true posterior variance Gibbs update. In addition, we let T = 5000 and use at time n, which in this case is approximated using N = 104 particles. In Figure 5 we display the esti- 2 a Kalman filter on a fine grid. This quantity exhibits mated marginal posteriors p(ρ y0: n) and p(σ y0: n) | | a steep increasing trend when n 15,000 and con- obtained from 50 independent replications of the firms the aforementioned variability≥ of the estimates particle method. On this simple problem, the es- of the posterior mean. In the bottom left panel of timated posteriors seem consistently rather inac- Figure 4 we plot the average (over different runs) curate for ρ, whereas they perform better for σ2 2 of the estimators of the variance of p(τ y0: n). This but with some nonnegligible variability over runs, average variance is also scaled/normalized| by the which increases as T increases. Similar observations actual posterior variance. The latter is again com- have been reported in [18] and remain unexplained: puted using Kalman filtering on a grid. This ratio for some parameters this methodology appears to ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 19

Fig. 4. Top left: box plots for estimates of posterior mean of τ 2 at n = 1000, 2000,...,50,000. Top right: relative variance, 2 that is, empirical variance (over independent runs) for the estimator of the mean of p(τ y0: n) using particle method with | MCMC steps normalized with the true posterior variance computed using Kalman filtering on a grid. Bottom left: average 2 (over independent runs) of the estimated variance of p(τ y0: n) using particle method with MCMC normalized with the true | posterior variance. Bottom right: relative variance of the pˆ(y0: n) n≥0; All plots are computed using N = 5000 and over 100 { } different independent runs. provide reasonable results despite the degeneracy prove the performance of the particle with MCMC problem and for others it provides very unreliable moves when N increases for a fixed time horizon T . results. For a fixed computational complexity, the particle We investigate further the performance of this Gibbs sampler estimates appear to display less vari- method in this simple example by considering the ability. For a higher-dimensional parameter θ and/or same example for T = 1000, but now consider two very vague priors, this comparison would be more fa- 4 larger numbers of particles, N = 7.5 10 and N = vorable to the particle Gibbs sampler as illustrated 5 × 6 10 , over 50 different runs. Additionally, we com- in [3], pages 336–338. pare× the resulting estimates with estimates provided by the particle Gibbs sampler of [66] using the same 8. CONCLUSION computational cost, that is, N = 50 particles with 3000 and 24,000 iterations, respectively. The results Most particle methods proposed originally in the are displayed in Figures 6 and 7. As expected, we im- literature to perform inference about static param- 20 N. KANTAS ET AL.

Fig. 5. Particle method with MCMC steps, θ =(ρ,σ2); estimated marginal posterior densities for n = 103, 2 103,..., 5 103 × × over 50 runs (red) versus ground truth (blue). eters in general state-space models were computa- with good statistical properties and a reasonable tionally inefficient as they suffered from the degen- computational cost have recently appeared in the eracy problem. Several approaches have been pro- literature. posed to deal with this problem by either adding an To perform batch ML estimation, the forward artificial dynamic on the static parameter [40, 54, filter backward sampler/smoother and generalized 67] or introducing a fixed-lag approximation [56, 74, two-filter procedures are recommended whenever 80]. These methods can work very well in practice, the (N 2T ) computational complexity per itera- but it remains unfortunately difficult/impossible to tionO of their direct implementations can be low- quantify the bias introduced in most realistic ap- ered to (NT ) using, for example, the methods de- plications. Various asymptotically bias-free methods scribedO in [7, 28, 38, 57]. Otherwise, besides a low- ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 21

Fig. 6. Estimated marginal posterior densities for θ =(ρ,σ2) with T = 103 over 50 runs (black-dashed) versus ground truth (green). Top: particle method with MCMC steps, N = 7.5 104 . Bottom: particle Gibbs with 3000 iterations and N = 50. ×

Fig. 7. Estimated marginal posterior densities for θ =(ρ,σ2) with T = 103 over 50 runs (black-dashed) versus ground truth (green). Top: particle method with MCMC steps, N = 6 105. Bottom: particle Gibbs with 24,000 iterations and N = 50. × 22 N. KANTAS ET AL. ering of memory requirements, not much can be research funded in part by the ANR as part of gained from these techniques compared to simply the “Investissements d’Avenir” program (ANR-11- using a standard particle filter with N 2 particles. LABEX-0047). In an on-line ML context, the situation is markedly different. Whereas for the on-line EM algorithm, the REFERENCES forward smoothing approach in [24, 81] of complex- [1] Alspach, D. and Sorenson, H. (1972). Nonlinear ity (N 2) per time step will be similarly of limited O Bayesian estimation using Gaussian sum approx- interest compared to a standard particle filter us- imations. IEEE Trans. Automat. Control 17 439– ing N 2 particles; it is crucial to use this approach 448. when performing on-line gradient ascent as demon- [2] Andrieu, C., De Freitas, J. F. G. and Doucet, A. strated empirically and established theoretically in (1999). Sequential MCMC for Bayesian model selection. In Proc. IEEE Workshop Higher Order [26]. In on-line scenarios where one can admit a ran- Statistics 130–134. IEEE, New York. dom computational complexity at each time step, [3] Andrieu, C., Doucet, A. and Holenstein, R. the method presented in [75] is an interesting al- (2010). Particle Markov chain Monte Carlo meth- ternative when it is applicable. Empirically, these ods. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 on-line ML methods converge rather slowly and will 269–342. MR2758115 [4] Andrieu, C., Doucet, A. and Tadic,´ V. B. (2005). be primarily useful for large data sets. On-line parameter estimation in general state- In a Bayesian framework, batch inference can be space models. In Proc. 44th IEEE Conf. on De- conducted using particle MCMC methods [3, 66]. cision and Control 332–337. IEEE, New York. However, these methods are computationally expen- [5] Benveniste, A., Metivier,´ M. and Priouret, P. sive as, for example, an efficient implementation of (1990). Adaptive Algorithms and Stochastic Ap- proximations. Applications of Mathematics (New the PMMH has a computational complexity of order York) 22. Springer, Berlin. MR1082341 2 (T ) per iteration [33]. On-line Bayesian inference [6] Briers, M., Doucet, A. and Maskell, S. (2010). O remains a challenging open problem as all methods Smoothing algorithms for state-space models. currently available, including particle methods with Ann. Inst. Statist. Math. 62 61–89. MR2577439 MCMC moves [13, 36, 84], suffer from the degen- [7] Briers, M., Doucet, A. and Singh, S. S. (2005). Se- quential auxiliary particle belief propagation. In eracy problem. These methods should not be ruled Proc. Conf. Fusion. Philadelphia, PA. out, but should be used cautiously, as they can pro- [8] Cappe,´ O. (2009). Online sequential Monte Carlo EM vide unreliable results even in simple scenarios as algorithm. In Proc. 15th IEEE Workshop on Sta- demonstrated in our experiments. tistical Signal Processing 37–40. IEEE, New York. ´ Very recent papers in this dynamic research area [9] Cappe, O. (2011). Online EM algorithm for hidden Markov models. J. Comput. Graph. Statist. 20 have proposed to combine individual parameter es- 728–749. MR2878999 timation techniques so as to design more efficient [10] Cappe,´ O. and Moulines, E. (2009). On-line inference algorithms. For example, [21] suggests to expectation–maximization algorithm for latent use the score estimation techniques developed for data models. J. R. Stat. Soc. Ser. B. Stat. ML parameter estimation to design better proposal Methodol. 71 593–613. MR2749909 [11] Cappe,´ O., Moulines, E. and Ryden,´ T. (2005). In- distributions for the PMMH algorithm, whereas [37] ference in Hidden Markov Models. Springer, New demonstrates that particle methods with MCMC York. MR2159833 moves might be fruitfully used in batch scenarios [12] Carpenter, J., Clifford, P. and Fearnhead, P. when plugged into a particle MCMC scheme. (1999). An improved particle filter for non-linear problems. IEE Proceedings—Radar, Sonar and Navigation 146 2–7. ACKNOWLEDGMENTS [13] Carvalho, C. M., Johannes, M. S., Lopes, H. F. and Polson, N. G. (2010). Particle learning and N. Kantas supported in part by the Engineer- smoothing. Statist. Sci. 25 88–106. MR2741816 ing and Physical Sciences Research Council (EP- [14] Cerou,´ F., Del Moral, P. and Guyader, A. SRC) under Grant EP/J01365X/1 and programme (2011). A nonasymptotic theorem for unnormal- grant on Control For Energy and Sustainabil- ized Feynman–Kac particle models. Ann. Inst. ity (EP/G066477/1). S. S. Singh was supported Henri Poincaré, B Probab. Stat. 47 629–649. MR2841068 by the EPSRC (grant number EP/G037590/1). [15] Chen, R. and Liu, J. S. (2000). Mixture Kalman fil- A. Doucet’s research funded in part by EPSRC ters. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62 (EP/K000276/1 and EP/K009850/1). N. Chopin’s 493–508. MR1772411 ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 23

[16] Chopin, N. (2002). A sequential particle filter pling approach. Electron. J. Probab. 14 27–49. method for static models. Biometrika 89 539–551. MR2471658 MR1929161 [30] Doucet, A., De Freitas, J. F. G. and Gor- [17] Chopin, N. (2004). Central limit theorem for sequen- don, N. J., eds. (2001). Sequential Monte tial Monte Carlo methods and its application to Carlo Methods in Practice. Springer, New York. Bayesian inference. Ann. Statist. 32 2385–2411. MR1847783 MR2153989 [31] Doucet, A., Godsill, S. J. and Andrieu, C. (2000). [18] Chopin, N., Iacobucci, A., Marin, J. M., On sequential Monte Carlo sampling methods for Mengersen, K., Robert, C. P., Ryder, R. Bayesian filtering. Stat. Comput. 10 197–208. and Schaufer,¨ C. (2011). On particle learning. [32] Doucet, A. and Johansen, A. M. (2011). A tuto- In Bayesian Statistics 9 (J. M. Bernardo, rial on particle filtering and smoothing: Fifteen M. J. Bayarri, J. O. Berger, A. P. Dawid, years later. In The Oxford Handbook of Nonlin- D., Heckerman A. F. M. Smith and M., West, ear Filtering 656–704. Oxford Univ. Press, Oxford. eds.) 317–360. Oxford Univ. Press, Oxford. MR2884612 MR3204011 [33] Doucet, A., Pitt, M. K., Deligiannidis, G. and [19] Chopin, N., Jacob, P. E. and Papaspiliopoulos, O. Kohn, R. (2015). Efficient implementation of (2013). SMC2: An efficient algorithm for sequential Markov chain Monte Carlo when using an unbiased analysis of state space models. J. R. Stat. Soc. Ser. likelihood estimator. Biometrika 102 295–313. B. Stat. Methodol. 75 397–426. MR3065473 [34] Elliott, R. J., Aggoun, L. and Moore, J. B. [20] Coquelin, P. A., Deguest, R. and Munos, R. (1995). Hidden Markov Models: Estimation and (2009). Sensitivity analysis in HMMs with appli- Control. Applications of Mathematics (New York) cation to likelihood maximization. In Proc. 22th 29. Springer, New York. MR1323178 Conf. NIPS. Vancouver. [35] Elliott, R. J., Ford, J. J. and Moore, J. B. (2000). [21] Dahlin, J., Lindsten, F. and Schon,¨ T. B. (2015). On-line consistent estimation of hidden Markov Particle Metropolis–Hastings using gradient and models. Technical report, Dept. Systems Engineer- Hessian information. Stat. Comput. 25 81–92. ing, Australian National Univ., Canberra. MR3304908 [36] Fearnhead, P. (2002). Markov chain Monte Carlo, [22] DeJong, D. N., Liesenfeld, R., Moura, G. V., sufficient statistics, and particle filters. J. Comput. Richard, J.-F. and Dharmarajan, H. (2013). Graph. Statist. 11 848–862. MR1951601 Efficient likelihood evaluation of state-space [37] Fearnhead, P. and Meligkotsidou, L. (2014). Aug- representations. Rev. Econ. Stud. 80 538–567. mentation schemes for particle MCMC. Preprint. MR3054070 Available at arXiv:1408.6980. [23] Del Moral, P. (2004). Feynman–Kac Formulae: Ge- [38] Fearnhead, P., Wyncoll, D. and Tawn, J. (2010). nealogical and Interacting Particle Systems with A sequential smoothing algorithm with lin- Applications. Springer, New York. MR2044973 ear computational cost. Biometrika 97 447–464. [24] Del Moral, P., Doucet, A. and Singh, S. S. MR2650750 (2009). Forward smoothing using sequential [39] Fernandez-Villaverde,´ J. and Rubio- Monte Carlo. Technical Report 638, CUED-F- Ram´ırez, J. F. (2007). Estimating macroe- INFENG, Cambridge Univ. Preprint. Available at conomic models: A likelihood approach. Rev. arXiv:1012.5390. Econ. Stud. 74 1059–1087. MR2353620 [25] Del Moral, P., Doucet, A. and Singh, S. S. (2010). [40] Flury, T. and Shephard, N. (2009). Learning and A backward particle interpretation of Feynman– filtering via simulation: Smoothly jittered particle Kac formulae. ESAIM Math. Model. Numer. Anal. filters. Economics Series Working Papers 469. 44 947–975. MR2731399 [41] Flury, T. and Shephard, N. (2011). Bayesian infer- [26] Del Moral, P., Doucet, A. and Singh, S. S. (2015). ence based only on simulated likelihood: Particle Uniform stability of a particle approximation of filter analysis of dynamic economic models. Econo- the optimal filter derivative. SIAM J. Control Op- metric Theory 27 933–956. MR2843833 tim. 53 1278–1304. MR3348115 [42] Ford, J. J. (1998). Adaptive hidden Markov model es- [27] Dempster, A. P., Laird, N. M. and Rubin, D. B. timation and applications. Ph.D. thesis, Dept. Sys- (1977). Maximum likelihood from incomplete data tems Engineering, Australian National Univ., Can- via the EM algorithm. J. Roy. Statist. Soc. Ser. B berra. Available at http://infoeng.rsise.anu. 39 1–38. MR0501537 edu.au/files/jason_ford_thesis.pdf. [28] Douc, R., Garivier, A., Moulines, E. and Ols- [43] Fulop, A. and Li, J. (2013). Efficient learning via sim- son, J. (2011). Sequential Monte Carlo smooth- ulation: A marginalized resample–move approach. ing for general state space hidden Markov models. J. Econometrics 176 146–161. MR3084050 Ann. Appl. Probab. 21 2109–2145. MR2895411 [44] Gilks, W. R. and Berzuini, C. (2001). Following [29] Douc, R., Moulines, E. and Ritov, Y. (2009). For- a moving target—Monte Carlo inference for dy- getting of the initial condition for the filter in namic Bayesian models. J. R. Stat. Soc. Ser. B. general state-space hidden Markov chain: A cou- Stat. Methodol. 63 127–146. MR1811995 24 N. KANTAS ET AL.

[45] Godsill, S. J., Doucet, A. and West, M. (2004). [62] Le Corff, S. and Fort, G. (2013). Online expecta- Monte Carlo smoothing for nonlinear times series. tion maximization based algorithms for inference J. Amer. Statist. Assoc. 99 156–168. MR2054295 in hidden Markov models. Electron. J. Stat. 7 763– [46] Gordon, N. J., Salmond, D. J. and Smith, A. F. M. 792. MR3040559 (1993). Novel approach to nonlinear/non-Gaussian [63] Le Corff, S. and Fort, G. (2013). Convergence of a Bayesian state estimation. IEE Proc. F, Comm., particle-based approximation of the block online Radar, Signal. Proc. 140 107–113. expectation maximization algorithm. ACM Trans. [47] Higuchi, T. (2001). Self-organizing time series model. Model. Comput. Simul. 23 Art. 2, 22. MR3034212 In Sequential Monte Carlo Methods in Practice. [64] Le Gland, F. and Mevel, M. (1997). Recursive es- Stat. Eng. Inf. Sci. 429–444. Springer, New York. timation in hidden Markov models. In Proc. 36th MR1847803 IEEE Conf. Decision and Control 3468–3473. San [48] Hurzeler,¨ M. and Kunsch,¨ H. R. (1998). Monte Diego, CA. Carlo approximations for general state-space [65] Lin, M., Chen, R. and Liu, J. S. (2013). Lookahead models. J. Comput. Graph. Statist. 7 175–193. strategies for sequential Monte Carlo. Statist. Sci. MR1649366 28 69–94. MR3075339 [49] Hurzeler,¨ M. and Kunsch,¨ H. R. (2001). Approxi- [66] Lindsten, F., Jordan, M. I. and Schon,¨ T. B. mating and maximising the likelihood for a gen- (2014). Particle Gibbs with ancestor sampling. J. eral state-space model. In Sequential Monte Carlo Mach. Learn. Res. 15 2145–2184. MR3231604 Methods in Practice. Stat. Eng. Inf. Sci. 159–175. [67] Liu, J. and West, M. (2001). Combined parameter Springer, New York. MR1847791 and state estimation in simulation-based filtering. [50] Ionides, E. L., Bhadra, A., Atchade,´ Y. and In Sequential Monte Carlo Methods in Practice. King, A. (2011). Iterated filtering. Ann. Statist. Springer, New York. MR1847793 Liu, J. S. 39 1776–1802. MR2850220 [68] (2001). Monte Carlo Strategies in Scientific [51] Ionides, E. L., Breto,´ C. and King, A. A. (2006). Computing. Springer, New York. MR1842342 [69] Liu, J. S. and Chen, R. (1998). Sequential Monte Inference for nonlinear dynamical systems. Proc. Carlo methods for dynamic systems. J. Amer. Natl. Acad. Sci. USA 103 18438–18443. Statist. Assoc. 93 1032–1044. MR1649198 [52] Kim, S., Shephard, N. and Chib, S. (1998). Stochas- [70] Lopes, H. F., Carvalho, C. M., Johannes, M. S. tic volatility: Likelihood inference and comparison and Polson, N. G. (2011). Particle learning with ARCH models. Rev. Econ. Stud. 65 361–393. for sequential Bayesian computation. In Bayesian [53] Kitagawa, G. (1996). Monte Carlo filter and smoother Statistics 9 (J. M. Bernardo, M. J. Bayarri, for non-Gaussian nonlinear state space models. J. J. O. Berger, A. P. Dawid, D., Heckerman Comput. Graph. Statist. 5 1–25. MR1380850 A. F. M. Smith and M., West, eds.). Oxford [54] Kitagawa, G. (1998). A self-organizing state-space Univ. Press, Oxford. MR3204011 model. J. Amer. Statist. Assoc. 93 1203–1215. [71] Lopes, H. F. and Tsay, R. S. (2011). Particle filters [55] Kitagawa, G. (2014). Computational aspects of se- and Bayesian inference in financial econometrics. quential Monte Carlo filter and smoother. Ann. J. Forecast. 30 168–209. MR2758809 66 Inst. Statist. Math. 443–471. MR3211870 [72] Malik, S. and Pitt, M. K. (2011). Particle filters [56] Kitagawa, G. and Sato, S. (2001). Monte Carlo for continuous likelihood evaluation and maximi- smoothing and self-organising state-space model. sation. J. Econometrics 165 190–209. MR2846644 In Sequential Monte Carlo Methods in Practice. [73] Nemeth, C., Fearnhead, P. and Mihaylova, L. Stat. Eng. Inf. Sci. 177–195. Springer, New York. (2013). Particle approximations of the score and MR1847792 observed information matrix for parameter esti- [57] Klaas, M., Briers, M., De Freitas, N., Doucet, A., mation in state space models with linear computa- Maskell, S. and Lang, D. (2006). Fast particle tional cost. Preprint. Available at arXiv:1306.0735. smoothing: If I had a million particles. In Proc. [74] Olsson, J., Cappe,´ O., Douc, R. and Moulines, E. International Conf. Machine Learning 481–488. (2008). Sequential Monte Carlo smoothing with Pittsburgh, PA. application to parameter estimation in nonlin- [58] Kunsch,¨ H. R. (2013). Particle filters. Bernoulli 19 ear state space models. Bernoulli 14 155–179. 1391–1403. MR3102556 MR2401658 [59] Lee, A. (2008). Towards smoother multivariate parti- [75] Olsson, J. and Westerborn, J. (2014). Efficient cle filters. M.Sc. Computer Science, Univ. British particle-based online smoothing in general hidden Columbia, Vancouver, BC. Markov models: The PaRIS algorithm. Preprint. [60] Lee, A. and Whiteley, N. (2014). Forest resampling Available at arXiv:1412.7550. for distributed sequential Monte Carlo. Preprint. [76] Oudjane, N. and Rubenthaler, S. (2005). Stability Available at arXiv:1406.6010. and uniform particle approximation of nonlinear [61] Lee, D. S. and Chia, K. K. (2002). A particle algo- filters in case of non ergodic signals. Stoch. Anal. rithm for sequential Bayesian parameter estima- Appl. 23 421–448. MR2140972 tion and model selection. IEEE Trans. Signal Pro- [77] Paninski, L., Ahmadian, Y., Ferreira, D. G., cess. 50 326–336. Koyama, S., Rad, K. R., Vidne, M., Vogel- ON PARTICLE METHODS FOR PARAMETER ESTIMATION IN STATE-SPACE MODELS 25

stein, J. and Wu, W. (2010). A new look at state- models with unknown transition matrix and appli- space models for neural data. J. Comput. Neu- cations to IEEE 802.11 networks. In Proc. IEEE rosci. 29 107–126. MR2721336 ICASSP, Vol. IV 13–16. Philadelphia, PA. [78] Pitt, M. K. and Shephard, N. (1999). Filtering [87] West, M. and Harrison, J. (1997). Bayesian Fore- via simulation: Auxiliary particle filters. J. Amer. casting and Dynamic Models, 2nd ed. Springer, Statist. Assoc. 94 590–599. MR1702328 New York. MR1482232 Pitt, M. K. Silva, R. d. S. Giordani, P. [79] , , and [88] Westerborn, J. and Olsson, J. (2014). Efficient Kohn, R. (2012). On some properties of Markov particle-based online smoothing in general hidden chain Monte Carlo simulation methods based on Markov models. In Proc. IEEE ICASSP 8003– the particle filter. J. Econometrics 171 134–151. MR2991856 8007. Florence. [80] Polson, N. G., Stroud, J. R. and Muller,¨ P. [89] Whiteley, N. (2010). Discussion of Particle Markov (2008). Practical filtering with sequential param- chain Monte Carlo methods. J. Royal Stat. Soc. eter learning. J. R. Stat. Soc. Ser. B. Stat. 72 306–307. Methodol. 70 413–428. MR2424760 [90] Whiteley, N. (2013). Stability properties of some [81] Poyiadjis, G., Doucet, A. and Singh, S. S. (2011). particle filters. Ann. Appl. Probab. 23 2500–2537. Particle approximations of the score and observed MR3127943 information matrix in state space models with ap- [91] Whiteley, N., Andrieu, C. and Doucet, A. (2010). plication to parameter estimation. Biometrika 98 Efficient Bayesian inference for switching state– 65–80. MR2804210 space models using discrete particle Markov chain [82] Schon,¨ T. B., Wills, A. and Ninness, B. (2011). Sys- Monte Carlo methods. Preprint. Available at tem identification of nonlinear state-space models. arXiv:1011.2437. 47 Automatica J. IFAC 39–49. MR2878244 [92] Whiteley, N. and Lee, A. (2014). Twisted particle [83] Sherlock, C., Thiery, A. H., Roberts, G. O. and filters. Ann. Statist. 42 115–141. MR3178458 Rosenthal, J. S. (2015). On the efficiency of [93] Wilkinson, D. J. (2012). Stochastic Modelling for Sys- pseudo-marginal random walk Metropolis algo- tems Biology, 2nd ed. CRC Press, Boca Raton, FL. rithms. Ann. Statist. 43 238–275. MR3285606 [94] Yildirim, S., Singh, S. S. and Doucet, A. (2013). [84] Storvik, G. (2002). Particle filters in state space mod- An online expectation–maximization algorithm for els with the presence of unknown static parame- 22 ters. IEEE Trans. Signal Process. 50 281–289. changepoint models. J. Comput. Graph. Statist. [85] Taghavi, E., Lindsten, F., Svensson, L. and 906–926. MR3173749 Schon,¨ T. B. (2013). Adaptive stopping for fast [95] Yuan, Y.-x. (2008). Step-sizes for the gradient method. particle smoothing. In Proc. IEEE ICASSP 6293– In Third International Congress of Chinese Math- 6297. Vancouver, BC. ematicians. Part 1, 2. AMS/IP Stud. Adv. Math., [86] Vercauteren, T., Toledo, A. and Wang, X. (2005). 42, Pt. 1 2 785–796. Amer. Math. Soc., Providence, Online Bayesian estimation of hidden Markov RI. MR2409671