Linköping University Post Print

System identification of nonlinear state-space models

Thomas Schön, Adrian Wills and Brett Ninness

N.B.: When citing this work, cite the original article.

Original Publication:

Thomas Schön, Adrian Wills and Brett Ninness, identification of nonlinear state- space models, 2011, AUTOMATICA, (47), 1, 39-49. http://dx.doi.org/10.1016/j.automatica.2010.10.013 Copyright: Elsevier Science B.V., Amsterdam. http://www.elsevier.com/

Postprint available at: Linköping University Electronic Press http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-65958

System Identification of Nonlinear State-Space Models ?

Thomas B. Sch¨on a, Adrian Wills b, Brett Ninness b

aDivision of Automatic Control, Link¨oping University, SE-581 83 Link¨oping, Sweden bSchool of Electrical Engineering and Computer Science, University of Newcastle, Callaghan, NSW 2308, Australia

Abstract

This paper is concerned with the estimation of a general class of nonlinear dynamic in state-space form. More specifically, a Maximum Likelihood (ML) framework is employed and an Expectation Maximisation (EM) algorithm is derived to compute these ML estimates. The Expectation (E) step involves solving a nonlinear state estimation problem, where the smoothed estimates of the states are required. This problem lends itself perfectly to the particle smoother, which provide arbitrarily good estimates. The maximisation (M) step is solved using standard techniques from numerical optimisation theory. examples demonstrate the efficacy of our proposed solution.

Key words: System identification, nonlinear models, dynamic systems, , smoothing filters, expectation maximisation algorithm, particle methods.

1 Introduction a Kalman filter, for each parameter that is to be esti- mated. An alternative, recently explored in [17] in the The significance and difficulty of estimating nonlinear context of bilinear systems is to employ the Expectation systems is widely recognised [1, 31, 32]. As a result, there Maximisation algorithm [8] for the computation of ML is very large and active research effort directed towards estimates. the problem. A key aspect of this activity is that it gen- erally focuses on specific system classes such as those de- Unlike gradient based search, which is applicable to max- scribed by Volterra kernels [4], neural networks [37], non- imisation of any differentiable cost function, EM meth- linear ARMAX (NARMAX) [29], and Hammerstein– ods are only applicable to maximisation of likelihood Wiener [41] structures, to name just some examples. In functions. However, a dividend of this specialisation is relation to this, the paper here considers Maximum Like- that while some gradients calculations may be necessary, lihood (ML) estimation of the specifying a the gradient of the is not required, relatively general class of nonlinear systems that can be which will prove to be very important in this paper. represented in state-space form. In addition to this advantage, EM methods are widely recognised for their numerical stability [28]. Of course, the use of an ML approach (for example, with regard to linear dynamic systems) is common, and it is Given these recommendations, this paper develops and customary to employ a gradient based search technique demonstrates an EM-based approach to nonlinear sys- such as a damped Gauss–Newton method to actually tem identification. This will require the computation of compute estimates [30, 46]. This requires the computa- smoothed state estimates that, in the linear case, could tion of a cost Jacobian which typically necessitates im- be found by standard linear smoothing methods [17]. In plementing one filter derived (in the linear case) from the fairly general nonlinear (and possibly non-Gaussian) context considered in this work we propose a “parti- ? cle based” approach whereby approximations of the re- Parts of this paper were presented at the 14th IFAC quired smoothed state estimates are approximated by Symposium on System Identification, Newcastle, Australia, Monte Carlo based empirical averages [10]. March 2006 and at the 17th IFAC World Congress, Seoul, South Korea, July, 2008. Corresponding author: T. B. Sch¨on. Tel. +46-13-281373. Fax +46-13-139282. It is important to acknowledge that there is a very signif- Email addresses: [email protected] (Thomas B. Sch¨on), icant body of previous work on the problems addressed [email protected] (Adrian Wills), here. Many approaches using various suboptimal non- [email protected] (Brett Ninness). linear filters (such as the extended Kalman filter) to ap-

Preprint submitted to Automatica 25 August 2010 proximate the cost Jacobian have been proposed [5, 22, is intended. Furthermore, note that we have, for brevity, 27]. Additionally, there has been significant work [3, 12, dispensed with the input signal ut in the notation (2). 40] investigating the employment of particle filters to However, everything we derive throughout this paper is compute the Jacobian’s necessary for a gradient based valid also if an input signal is present. search approach. The formulation (1) and its alternative formulation (2) There has also been previous work on various approxi- capture a relatively broad class of nonlinear systems and mate EM-based approaches. Several authors have con- we consider the members of this class where pθ(xt+1 | xt) sidered using suboptimal solutions to the associated non- and pθ(yt | xt) can be explicitly expressed and evaluated. linear smoothing problem, typically using an extended Kalman smoother [13, 15, 19, 43]. The problem addressed here is the formation of an es- timate θb of the parameter vector θ based on N mea- As already mentioned, this paper is considering parti- surements UN = [u1, ··· , uN ], YN = [y1, ··· , yN ] of ob- cle based approaches in order to solve the involved non- served system input-output responses. Concerning the linear smoothing problem. This idea has been partially notation, sometimes we will make use of Yt:N , which is reported by the authors in two earlier conference publi- used to denote [yt, ··· , yN ]. However, as defined above, cations [45, 47]. for brevity we denote Y1:N simply as YN . Hence, it is here implicitly assumed that the index starts at 1. An interesting extension, handling the case of miss- ing is addressed in [20]. Furthermore, in [26], the One approach is to employ the general prediction error authors introduce an EM algorithm using a particle (PE) framework [30] to deliver θb according to smoother, similar to the algorithm we propose here, but tailored to stochastic volatility models. The sur- θb = arg min V (θ), (3) vey paper [3] is one of the earliest papers to note the θ∈Θ possiblility of EM-based methods employing particle smoothing methods. with cost function V (θ) of the form

2 Problem Formulation N X V (θ) = `(εt(θ)), εt(θ) = yt − ybt|t−1(θ). (4) This paper considers the problem of identifying the pa- t=1 rameters θ for certain members of the following nonlin- nθ ear state-space model structure and with Θ ⊆ R denoting a compact set of permissible values of the unknown parameter θ. Here,

xt+1 = ft(xt, ut, vt, θ), (1a) Z y (θ) = E {y | Y } = y p (y | Y ) dy (5) yt = ht(xt, ut, et, θ). (1b) bt|t−1 θ t t−1 t θ t t−1 t

nx Here, xt ∈ R denotes the state variable, with ut ∈ is the square optimal one-step ahead predictor of yt nu ny R and yt ∈ R denoting (respectively) observed in- based on the model (1). The function `(·) is an arbitrary put and output responses. Furthermore, θ ∈ Rnθ is a and user-chosen positive function. vector of (unknown) parameters that specifies the map- pings ft(·) and ht(·) which may be nonlinear and time- This PE solution has its roots in the Maximum Like- varying. Finally, vt and et represent mutually indepen- lihood (ML) approach, which involves maximising the dent vector i.i.d. processes described by probability den- joint density (likelihood) pθ(YN ) of the observations: sity functions (pdf’s) pv(·) and pe(·). These are assumed to be of known form (e.g.,Gaussian) but parameterized θb = arg max pθ(y1, ··· , yN ). (6) (e.g.,mean and ) by values that can be absorbed θ∈Θ into θ for estimation if they are unknown. To compute this, Bayes’ rule may be used to decompose Due to the random components vt and et, the model (1) the joint density according to can also be represented via the stochastic description N Y p (y , ··· , y ) = p (y ) p (y |Y ). (7) xt+1 ∼ pθ(xt+1 | xt), (2a) θ 1 N θ 1 θ t t−1 t=2 yt ∼ pθ(yt | xt), (2b) Accordingly, since the logarithm is a monotonic func- where pθ(xt+1 | xt) is the pdf describing the dynamics tion, the maximisation problem (6) is equivalent to the for given values of xt, ut and θ, and pθ(yt | xt) is the pdf minimisation problem describing the measurements. As is common practise, in (2) the same symbol pθ is used for different pdf’s that θb = arg min −Lθ(YN ), (8) depend on θ, with the argument to the pdf denoting what θ∈Θ

2 where Lθ(YN ) is the log-likelihood of numerically evaluating the required nx-dimensional integrals. N X Lθ(YN ) , log pθ(YN ) = log pθ(y1)+ log pθ(yt | Yt−1). In what follows, the recently popular methods of sequen- t=2 tial importance (SIR, or particle filtering) (9) will be employed to address this problem. The PE and ML approaches both enjoy well understood theoretical properties including strong consistency, However, there is a remaining difficulty which is related asymptotic normality, and in some situations asymp- to the second challenge mentioned at the end of section totic efficiency. They are therefore both an attractive 2. Namely, if gradient-based search is to be employed to solution, but there are two important challenges to their compute the estimate θb, then not only is pθ(yt | Yt−1) implementation. required, but also its derivative

First, both methods require knowledge of the prediction ∂ pθ(yt | Yt−1). (13) density pθ(yt | Yt−1). In the linear and Gaussian case, ∂θ a Kalman filter can be employed. In the nonlinear case (1) an alternate solution must be found. Unfortunately the SIR technique does not lend itself to the simple computation of this derivative. One approach Second, the optimisation problems (3) or (8) must be to deal with this is to simply numerically evaluate the solved. Typically, the costs V (θ) or Lθ(YN ) are differ- necessary derivative based on differencing. Another is entiable, and this is exploited by employing a gradient to employ a search method that does not require gra- based search method to compute the estimate [30]. Un- dient information. Here, there exist several possibilities, fortunately, these costs will generally possesses multiple such as Nelder–Mead simplex methods or annealing ap- local minima that can complicate this approach. proaches [44, 48].

This paper explores a further possibility which is known 3 Prediction Density Computation as the Expectation Maximisation (EM) algorithm, and is directed at computing an ML estimate. Instead of us- Turning to the first challenge of computing the predic- ing the smoothness of Lθ, it is capable of employing an tion density, note that by the law of total probability alternative feature. Namely, the fact that Lθ is the log- and the Markov nature of (2) arithm of a probability density pθ(YN ), which has unit area for all values of θ. How the EM algorithm is capable Z of utilising this simple fact to deliver an alternate search p (y | Y ) = p (y | x )p (x | Y ) dx , (10) θ t t−1 θ t t θ t t−1 t procedure is now profiled. where xt is the state of the underlying dynamic sys- 4 The Expectation Maximisation Algorithm tem. Furthermore, using the Markov property of (2) and Bayes’ rule we obtain Like gradient based search, the EM algorithm is an it- erative procedure that at the k’th step seeks a value θk pθ(yt | xt)pθ(xt | Yt−1) pθ(xt | Yt) = . (11) such that the likelihood is increased in that Lθk (YN ) > pθ(yt | Yt−1) Lθk−1 (YN ). Again like gradient based search, an approx- imate model of Lθ(YN ) is employed to achieve this. How- Finally, by the law of total probability and the Markov ever, unlike gradient based search, the model is capable nature of (2) of guaranteeing increases in Lθ(YN ). Z The essence of the EM algorithm [8, 33] is the postulation pθ(xt+1 | Yt) = pθ(xt+1 | xt) pθ(xt | Yt) dxt, (12) of a “missing” data set XN = {x1, ··· , xN }. In this paper, it will be taken as the state sequence in the model Together, (11), (10) are known as the “measurement up- structure (1), but other choices are possible, and it can date” and (12) the “time update” equations, which pro- be considered a design variable. The key idea is then to vide a recursive formulation of the required prediction consider the joint likelihood function density pθ(yt | Yt−1) as well as the predicted and filtered state densities pθ(xt | Yt−1), pθ(xt | Yt). Lθ(XN ,YN ) = log pθ(XN ,YN ), (14)

In the linear and Gaussian case, the associated integrals with respect to both the observed data YN and the have closed form solutions which lead to the Kalman fil- XN . Underlying this strategy is an as- ter [25]. In general though, they do not. Therefore, while sumption that maximising the “complete” log likelihood in principle (10)-(12) provide a solution to the compu- Lθ(XN ,YN ) is easier than maximising the incomplete tation of V (θ) or Lθ(YN ), there is a remaining obstacle one Lθ(YN ).

3 As a concrete example, if the model structure (1) was sequence of values θk, k = 1, 2, ··· designed to be in- linear and time-invariant, then knowledge of the state creasingly good approximations of the ML estimate (6) xt would allow system matrices A, B, C, D to be esti- via the following strategy. mated by simple . See [16] for more de- tail, and [34] for further examples. Algorithm 1 (EM Algorithm) The EM algorithm then copes with XN being un- available by forming an approximation Q(θ, θk) of (1) Set k = 0 and initialise θ such that L (Y ) is Lθ(XN ,YN ). The approximation used, is the minimum k θk N finite; variance estimate of Lθ(XN ,YN ) given the observed (2) (Expectation (E) Step): available data YN , and an assumption θk of the true parameter value. This minimum variance estimate is given by the conditional mean [2] Calculate: Q(θ, θk); (21)

Q(θ, θk) , Eθk {Lθ(XN ,YN ) | YN } (15a) Z (3) (Maximisation (M) Step): = Lθ(XN ,YN )pθk (XN |YN ) dXN . (15b)

The utility of this approach depends on the relation- Compute: θk+1 = arg max Q(θ, θk); (22) θ∈Θ ship between Lθ(YN ) and the approximation Q(θ, θk) of Lθ(XN ,YN ). This may be examined by using the defi- nition of conditional probability to write (4) If not converged, update k 7→ k + 1 and return to step 2. log pθ(XN ,YN ) = log pθ(XN | YN ) + log pθ(YN ). (16)

Taking the conditional mean Eθk {· | YN } of both sides The termination decision in step 4 is performed using a then yields standard criterion such as the relative increase of Lθ(YN ) or the relative increase of Q(θ, θk) falling below a pre- Z defined threshold [9]. Q(θ, θk) = Lθ(YN ) + log pθ(XN |YN )pθk (XN |YN ) dXN . (17) The first challenge in implementing the EM algorithm is Therefore the computation of Q(θ, θk) according to the definition (15a). To address this, note that via Bayes’ rule and the Markov property associated with the model structure (1) Lθ(YN ) − Lθk (YN ) = Q(θ, θk) − Q(θk, θk) Z pθk (XN |YN ) + log pθk (XN |YN ) dXN . (18) pθ(XN |YN ) Lθ(XN ,YN ) = log pθ(YN |XN ) + log pθ(XN ) N−1 N X X The rightmost integral in (18) is the Kullback-Leibler = log pθ(x1) + log pθ(xt+1|xt) + log pθ(yt|xt). metric which is non-negative. This follows t=1 t=1 directly upon noting that since for x ≥ 0, − log x ≥ 1−x (23)

Z p (X |Y ) − log θ N N p (X |Y ) dX ≥ θk N N N When the model structure (1) is linear and the stochastic pθk (XN |YN ) Z   components vt and et are Gaussian the log pθ terms are pθ(XN |YN ) either linear or quadratic functions of the state xt. Tak- 1 − pθk (XN |YN ) dXN = 0, (19) pθk (XN |YN ) ing the conditional expectation (15a) in order to com- pute Q(θ, θk) is then simply achieved by invoking a mod- where the equality to zero is due to the fact that pθ(XN | ification of a standard Kalman smoother [16, 24]. YN ) is of unit area for any value of θ. As a consequence of this simple fact In the more general setting of this paper, the situation is more complicated and requires an alternative approach. Lθ(YN ) − Lθk (YN ) ≥ Q(θ, θk) − Q(θk, θk). (20) To develop it, application of the conditional expectation

This delivers the key to the EM algorithm. Namely, operator Eθk {· | YN } to both sides of (23) yields choosing θ so that Q(θ, θk) > Q(θk, θk) implies that the log likelihood is also increased in that Lθ(YN ) > Q(θ, θk) = I1 + I2 + I3, (24) Lθk (YN ). The EM algorithm exploits this to deliver a

4 where A realisation xj ∼ π(x) that is distributed according to the target density π(x) is then achieved by choos- Z j i I = log p (x )p (x |Y ) dx , (25a) ing the j’th realisation x to be equal to the valuex ˜ 1 θ 1 θk 1 N 1 with a certain probability w(˜xi). More specifically, for j i N−1 ZZ j = 1,...,M, a realisation x is selected asx ˜ randomly X according to I2 = log pθ(xt+1|xt)pθk (xt+1, xt|YN ) dxt dxt+1, t=1 (25b) 1 P(xj =x ˜i) = w(˜xi) (27) N κ X Z I3 = log pθ(yt|xt)pθk (xt|YN ) dxt. (25c) t=1 where

Computing Q(θ, θ ) therefore requires knowledge of den- i M k i π(˜x ) X i sities such as p (x |Y ) and p (x , x |Y ) associated w(˜x ) = , κ = w(˜x ). (28) θk t N θk t+1 t N q(˜xi) with a nonlinear smoothing problem. Additionally, inte- i=1 grals with respect to these must be evaluated. Outside the linear case, there is no hope of any analytical solu- This step is known as “resampling”, and the random as- tion to these challenges. This paper therefore takes the signment is done in an independent fashion. The assign- approach of evaluating (25a)-(25c) numerically. ment rule (27) works, since by the independence, the probability that as a result xj takes on the valuex ˜i is the probability q(˜xi) thatx ˜i was realised, times the proba- 5 Computing State Estimates bility w(˜xi) that xj is then assigned this value. Hence, withx ˜i viewed as a continuous variable, rather than one 1 M The quantities I1,I2,I3 in (25) that determine Q(θ, θk) from a discrete set {x˜ , ··· , x˜ } depend primarily on evaluating the smoothed density pθk (xt | YN ) and expectations with respect to it. i j i i π(˜x ) i P(x =x ˜ ) ∝ q(˜x ) i = π(˜x ), (29) To perform these computations, this paper employs se- q(˜x ) quential importance resampling (SIR) methods. These are often discussed under the informal title of “particle so that xj is a realisation from the required density π(x). filters”, and the main ideas underlying them date back half a century [35, 36]. However, it was not until 1993 The challenge in achieving this is clearly the specification that the first working particle filter was discovered by of a density q(x) from which it is both feasible to generate [21]. As will be detailed, this approach first requires deal- realisations {x˜i}, and for which the ratio w(x) in (28) ing with the filtered density pθ(xt | Yt), and hence the can be computed. To address this, consider the following discussion will begin by examining this. selections:

5.1 Particle Filtering π(xt) = pθ(xt | Yt), q(˜xt) = pθ(˜xt | xt−1). (30)

The essential idea is to evaluate integrals by a ran- This choice of proposal density q is feasible since a re- i domised approach that employs the strong law of large alisationx ˜t ∼ pθ(˜xt | xt−1) may be obtained by simply i numbers (SLLN). For example, if it is possible to build generating a realisation vt ∼ pv, and substituting it, a a random number generator that delivers (suitably un- given xt−1, a measured ut and model-implied θ into ft i i correlated) realisations {x } with respect to a given in (1a) in order to deliver a realisationx ˜t. target probability density π(x), then by the SLLN, for a given (measurable) function g Furthermore, if xt−1 used in (30) is a realisation dis- tributed as xt−1 ∼ pθ(xt−1 | Yt−1) then the uncondi- M 1 X Z tional proposal density q is given by the law of total g(xi) ≈ E {g(x)} = g(x)π(x) dx, (26) probability as M i=1 Z with equality (with probability one) in the limit as M → q(˜xt) = pθ(˜xt | xt−1) pθ(xt−1 | Yt−1) dxt−1 (31) ∞.

Certainly, for some special cases such as the Gaussian and hence by the time update equation (12) density, random number generator constructions are well known. Denote by q(x) the density for which such a q(˜xt) = pθ(˜xt | Yt−1). (32) generator is available, and denote by x˜i ∼ q(˜x) a realisation drawn using this generator. As a result, the ratio w = π/q implied by the choice (30)

5 i can be expressed as values {x˜t} are by construction independent, but dis- i tributed asx ˜t ∼ pθ(˜xt | Yt−1). Using them, and again p (˜xi | Y ) p (˜xi | Y ) p (y | x˜i) appealing to the law of large numbers w(˜xi) = θ t t = θ t t = θ t t t q(˜xi) p (˜xi | Y ) p (y | Y ) t θ t t−1 θ t t−1 M (33) 1 X Z g(˜xi)w(˜xi) ≈ g(˜x )w(˜x )p (˜x | Y ) d˜x where the measurement update equation (11) is used in M t t t t θ t t−1 t progressing to the last equality. i=1 (38a) Z According to the model (1), the numerator in this expres- pθ(˜xt | Yt) i = g(˜xt) pθ(˜xt | Yt−1) d˜xt (38b) sion is simply the pdf of gt(xt, ut, et, θ) for givenx ˜t, ut, θ pθ(˜xt | Yt−1) and hence computable. Additionally, the denominator Z i = g(˜xt)pθ(˜xt | Yt) d˜xt = Eθ{g(˜xt) | Yt} (38c) in (33) is independent ofx ˜t, and hence simply a normal- ising constant to ensure unit total probability so that where the transition from (38a) to (38b) follows by (33). M 1 X Note that the expectation in (38c) is identical to that w(˜xi) = p (y | x˜i), κ = p (y | x˜i). (34) t κ θ t t θ t t in (26) with π(xt) = pθ(xt | Yt). However, since the i=1 i sum in (38a) involves independent {x˜t} rather than the i dependent {xt} used in (26), it will generally be a more This analysis suggests a recursive technique of taking re- accurate approximation to the expectation. i alisations xt−1 ∼ pθ(xt−1 | Yt−1), using them to gener- i ate candidatex ˜t via the proposal (30), and then resam- As a result it is preferable to use the left hand side of pling them using the density (34) to deliver realisations (38a) rather than the right hand side of (26). The former, i i xt ∼ pθ(xt | Yt). Such an approach is known as sequen- due to use of the “weights” {w(˜xt)} is an example of what tial importance resampling (SIR) or, more informally, is known as “importance ” [42]. This explains j i the realisations {xt }, {x˜t} are known as particles, and the middle term in the SIR name given to Algorithm 2. the method is known as particle filtering. Of course, this suggests that the resampling step (37) is Algorithm 2 Basic not essential, and one could simplify Algorithm 2 by re- i moving it and simply propagating the weights {wt} for i M a set of particles {xi} whose positions are fixed. Unfor- (1) Initialize particles, {x0}i=1 ∼ pθ(x0) and set t = 1; t (2) Predict the particles by drawing M i.i.d. samples tunately this extreme does not work over time since the according to resampling is critical to being able to track movements in the target density pθ(xt | Yt). x˜i ∼ p (˜x |xi ), i = 1,...,M. (35) t θ t t−1 Recognising that while resampling is necessary, it need i M not be done at each time step t, and recognising the pos- (3) Compute the importance weights {wt}i=1, sibility for alternatives to the choice (32) for the proposal density have lead to a of different particle filtering i pθ(yt|x˜ ) methods [10]. All deliver values {wi}, {x˜i}, {xi} such wi w(˜xi) = t , i = 1,...,M. t t t t , t PM j that arbitrary integrals with respect to a target density pθ(yt|x˜ ) j=1 t p (x | Y ) can be approximately computed via sums (36) θ t t such as (26) and (38a).

j (4) For each j = 1,...,M draw a new particle xt with A mathematical abstraction, which is a useful way of replacement (resample) according to, encapsulating this deliverable, is the discrete Dirac delta approximation of pθ(xt | Yt) given by j i i P(xt =x ˜t) = wt, i = 1,...,M. (37) M X i i (5) If t < N increment t 7→ t + 1 and return to step 2, pθ(xt | Yt) ≈ pbθ(xt | Yt) = wtδ(xt − x˜t). (39) otherwise terminate. i=1

It is important to note that a key feature of the re- Underlying this abstraction is the understanding that sampling step (37) is that it takes an independent se- substituting pbθ for pθ delivers finite sum approximations i i to integrals involving pθ. quence {x˜t} and delivers a dependent one {xt}. Unfor- tunately, this will degrade the accuracy of approxima- tions such as (26), since by the fundamental theory un- 5.2 Particle Smoother derpinning the SLLN, the rate of convergence of the i sum to the integral decreases as the correlation in {xt} The stochastic sampling approach for computing expec- increases [38]. To address this, note that the proposal tations with respect to the filtered density pθ(xt | Yt)

6 can be extended to accommodate the smoothed density To complete the integral computation, note that for the pθ(xt | YN ). The same abstraction just introduced of particular case of t = N, the smoothing density and the filtering density are the same, and hence the weights in M i i (40) may be initialised as wN|N = wN and likewise the X i i pθ(xt | YN ) ≈ pθ(xt | YN ) = w δ(xt − x˜t) (40) i b t|N particlesx ˜N are identical. Working backwards in time t i=1 then, we assume an importance sampling approximation (40) is available at time t + 1, and use it and (44) to will be used to encapsulate the resulting importance compute the integral in (42c) as sampling approximations. To achieve this, note that us- ing the definition of conditional probability several times Z p (x | x )p (x | Y ) θ t+1 t θ t+1 N dx ≈ p (x | Y ) t+1 pθ(xt | xt+1,YN ) = pθ(xt | xt+1,Yt,Yt+1:N ), (41a) θ t+1 t M k k pθ(xt, xt+1,Yt,Yt+1:N ) X wt+1|N pθ(˜xt+1 | xt) = (41b) M . (45a) pθ(xt+1,Yt,Yt+1:N ) P i k i k=1 i=1 wt pθ(˜xt+1 | x˜t) p (Y | x , x ,Y )p (x , x ,Y ) = θ t+1:N t t+1 t θ t t+1 t (41c) pθ(xt+1,Yt,Yt+1:N ) The remaining pθ(xt | Yt) term in (42c) may be repre- sented by the particle filter (39) so that the smoothed pθ(Yt+1:N | xt, xt+1,Yt)pθ(xt | xt+1,Yt)pθ(xt+1,Yt) = density pθ(xt | YN ) is represented by pθ(xt+1,Yt,Yt+1:N ) (41d) M p (Y | x , x ,Y )p (x | x ,Y ) X i i = θ t+1:N t t+1 t θ t t+1 t (41e) pθ(xt | YN ) ≈ pbθ(xt | YN ) = wt|N δ(xt − x˜t), (46a) pθ(Yt+1:N | xt+1,Yt) i=1 M k i =pθ(xt | xt+1,Yt), (41f) X pθ(˜x |x˜ ) wi = wi wk t+1 t , (46b) t|N t t+1|N vk where the last equality follows from the fact that given k=1 t xt+1, by the Markov property of the model (1) there is M k X i k i no further information about Yt+1:N available in xt and vt , wtpθ(˜xt+1|x˜t). (46c) hence pθ(Yt+1:N | xt, xt+1,Yt) = pθ(Yt+1:N | xt+1,Yt). i=1

Consequently, via the law of total probability and Bayes’ These developments can be summarised by the following rule particle smoothing algorithm. Z Algorithm 3 Basic Particle Smoother pθ(xt | YN ) = pθ(xt | xt+1,Yt) pθ(xt+1 | YN ) dxt+1 (42a) (1) Run the particle filter (Algorithm 2) and store Z i M pθ(xt+1 | xt)pθ(xt | Yt) the predicted particles {x˜t}i=1 and their weights = pθ(xt+1 | YN ) dxt+1 {wi}M , for t = 1,...,N. pθ(xt+1 | Yt) t i=1 (2) Initialise the smoothed weights to be the terminal (42b) i Z filtered weights {wt} at time t = N, pθ(xt+1 | xt)pθ(xt+1 | YN ) = pθ(xt | Yt) dxt+1. pθ(xt+1 | Yt) wi = wi , i = 1,...,M. (47) (42c) N|N N and set t = N − 1. This expresses the smoothing density pθ(xt | YN ) in i M (3) Compute the smoothed weights {wt|N }i=1 using the terms of the filtered density pθ(xt | Yt) times an xt de- i M i i M pendent integral. To compute this integral, note first filtered weights {wt}i=1 and particles {x˜t, x˜t+1}i=1 that again by the law of total probability, the denomi- via the formulae (46b), (46c). nator of the integrand can be written as (4) Update t 7→ t−1. If t > 0 return to step 3, otherwise terminate. Z pθ(xt+1 | Yt) = pθ(xt+1 | xt) pθ(xt | Yt) dxt. (43) Like the particle filter Algorithm 2, this particle smoother is not new [11]. Its derivation is presented As explained in the previous section, the particle filter here so that the reader can fully appreciate the ratio- (39) may be used to compute this via importance sam- nale and approximating steps that underly it. This is pling according to important since they are key aspects underlying the novel estimation methods derived here. M X i i Note also that there are alternatives to this algorithm for pθ(xt+1 | Yt) ≈ wt pθ(xt+1 | x˜t). (44) i=1 providing stochastic sampling approximations to func-

7 tions of the smoothed state densities [6, 7, 14, 18, 39]. Furthermore, by (41a)-(41f) The new estimation methods developed in this paper are compatible with any method the user chooses to em- pθ(xt|xt+1,YN ) = pθ(xt|xt+1,Yt). (52) ploy, provided it is compatible with the approximation format embodied by (40). The results presented in this Substituting (52) into (51) and using Bayes’ rule in con- paper used the method just presented as Algorithm 3. junction with the Markov property of the model (1) de- livers

6 The E Step: Computing Q(θ, θk) pθ(xt+1,xt|YN ) = pθ(xt|xt+1,Yt)pθ(xt+1|YN ) (53a) pθ(xt+1|xt) pθ(xt|Yt) = pθ(xt+1|YN ). (53b) These importance sampling approaches will now be em- pθ(xt+1|Yt) ployed in order to compute approximations to the terms I1, I2 and I3 in (25) that determine Q(θ, θk) via (24). Therefore, the particle filter and smoother representa- Beginning with I1 and I3, the particle smoother repre- tions (39), (46a) may be used to deliver an importance sentation (46) achieved by Algorithm 3 directly provides sampling approximation to I2 according to the importance sampling approximations ZZ log pθ(xt+1|xt)pθ (xt+1, xt|YN ) dxt dxt+1 = M k X i i " I1 ≈Ib1 , w1|N log pθ(˜x1), (48a) Z Z pθk (xt+1|YN ) i=1 log pθ(xt+1|xt)pθk (xt+1|xt)× N M pθk (xt+1|Yt) X X i i # I3 ≈Ib3 , wt|N log pθ(yt|x˜t). (48b) t=1 i=1 pθk (xt|Yt) dxt dxt+1 ≈

A vital point is that when forming these approximations, M Z X pθ (xt+1|YN ) the weights {wi } are computed by Algorithms 2 and 3 wi k log p (x |x˜i)p (x |x˜i) dx t|N t p (x |Y ) θ t+1 t θk t+1 t t+1 run with respect to the model structure (1), (2) param- i=1 θk t+1 t eterised by θk. M M j i X X i j pθk (˜xt+1 | x˜t) j i ≈ wtwt+1|N j log pθ(˜xt+1|x˜t). pθk (˜xt+1|Yt) Evaluating I2 given by (25b) is less straightforward, due i=1 j=1 to it depending on the joint density pθ(xt+1, xt|YN ). Nevertheless, using the particle filtering representation Finally, the law of total probability in combination with (39) together with the smoothing representation (46a) the particle filter (39) provides an importance sampling leads to the following importance sampling approxima- approximation to the denominator term given by tion. Z j j pθk (˜xt+1|Yt) = pθk (˜xt+1|xt)pθk (xt|Yt) dxt (54a) Lemma 6.1 The quantity I2 defined in (25b) may be M computed by an importance sampling approximation Ib2 X j ≈ wlp (˜x | x˜l ). (54b) based on the particle filtering and smoothing representa- t θk t+1 t tions (39), (44) that is given by l=1

N−1 M M  X X X ij j i I2 ≈ Ib2 , wt|N log pθ(˜xt+1 | x˜t), (49) t=1 i=1 j=1 Again, all weights and particles in this approximation are computed by Algorithms 2 and 3 run with respect ij to the model structure (1), (2) parametrised by θk. where the weights wt|N are given by Using these importance sampling approaches, the func- i j j i wtw pθk (˜xt+1 | x˜t) tion Q(θ, θk) given by (24), (25) may be approximately wij = t+1|N . (50) t|N PM l j l computed as QbM (θ, θk) defined by l=1 wtpθk (˜xt+1 | x˜t)

QbM (θ, θk) = Ib1 + Ib2 + Ib3, (55) PROOF. First, by the definition of conditional proba- bility where Ib1, Ib2 and Ib3 are given by (48a), (49) and (48b), respectively. Furthermore, the quality of this approxi- mation can be made arbitrarily good as the number M pθ(xt+1, xt|YN ) = pθ(xt|xt+1,YN )pθ(xt+1|YN ). (51) of particles is increased.

8 7 The M Step: Maximisation of QbM (θ, θk) Qb(θ, θk). All that is necessary is to find a value θk+1 for which Q(θk+1, θk) > Q(θk, θk) since via (20) this will guarantee that L(θk+1) > L(θk). Hence, the resulting With an approximation QbM (θ, θk) of the function iteration θk+1 will be a better approximation than θk of Q(θ, θk) required in the E step (21) of the EM Algo- rithm 1 available, attention now turns to the M step the maximum likelihood estimate (8). (22). This requires that the approximation QbM (θ, θk) is maximised with respect to θ in order to compute a new 8 Final Identification Algorithm iterate θk+1 of the maximum likelihood estimate. The developments of the previous sections are now sum- In certain cases, such as when the nonlinearities ft and marised in a formal definition of the EM-based algorithm ht in the model structure (1) are linear in the parame- this paper has derived for nonlinear system identifica- tion. ter vector θ, it is possible to maximise QbM (θ, θk) using closed-form expressions. An example of this will be dis- cussed in Section 10. Algorithm 4 (Particle EM Algorithm)

(1) Set k = 0 and initialise θk such that Lθk (Y ) is finite; In general however, a closed form maximiser will not (2) (Expectation (E) Step): be available. In these situations, this paper proposes a (a) Run Algorithms 2 and 3 in order to obtain the gradient based search technique. For this purpose, note particle filter (39) and particle smoother (46a) that via (55), (48) and (49) the gradient of Qb(θ, θk) with representations. respect to θ is simply computable via (b) Use this information together with (48a), (48b) and (49) to ∂ ∂Iˆ1 ∂Iˆ2 ∂Iˆ3 QbM (θ, θk) = + + , (56a) ∂θ ∂θ ∂θ ∂θ Calculate: QbM (θ, θk) = Ib1 +Ib2 +Ib3. (58) ˆ M i ∂I1 X ∂ log pθ(˜x ) = wi 1 , (56b) (3) (Maximisation (M) Step): ∂θ 1|N ∂θ i=1 ˆ N−1 M M j i Compute: θk+1 = arg max QbM (θ, θk) (59) ∂I2 X X X ∂ log pθ(˜x |x˜t) = wij t+1 , θ∈Θ ∂θ t|N ∂θ t=1 i=1 j=1 explicitly if possible, otherwise according to (57). (56c) (4) Check the non-termination condition Q(θk+1, θk) − ˆ N M i ∂I3 X X ∂ log pθ(yt|x˜ ) Q(θk, θk) >  for some user chosen  > 0. If satisfied = wi t . (56d) ∂θ t|N ∂θ update k 7→ k + 1 and return to step 2, otherwise t=1 i=1 terminate.

With this gradient available, there are a wide variety of It is worth emphasising a point made earlier, that while algorithms that can be employed to develop a sequence the authors have found the simple particle and smooth- of iterates θ = β0, β1, ··· that terminate at a value β∗ ing Algorithms 2 and 3 to be effective, the user is free which seeks to maximise QbM (θ, θk). to substitute alternatives if desired, provided the results they offer are compatible with the representations (39), A common theme in these approaches is that after ini- (46a). tialisation with β0 = θk, the iterations are updated ac- cording to It is natural to question the computational requirements of this proposed algorithm. Some specific comments re-

∂ lating to this will be made in the example section fol- βj+1 = βj+αjpj, pj = Hjgj, gj = QbM (θ, θk) ∂θ lowing. More generally, it is possible to identify the com- θ=βj (57) putation of Ib2 given by (49) and its gradient (56c) as a Here H is a positive definite matrix that is used to de- dominating component of both the E and M steps. As is j evident, it requires O(NM 2) floating point operations. liver a search direction pj by modifying the gradient di- rection. The scalar term α is a step length that is cho- j This indicates that the computing load is sensitive to sen to ensure that QbM (βj +αjpj, θk) ≥ QbM (βj, θk). The the number M of particles employed. Balancing this, the search typically terminates when incremental increases experience of the authors has been that useful results in QbM (β, θk) fall below a user specified tolerance. Com- can be achieved without requiring M to be prohibitively monly this is judged via the gradient itself according to large. The following simulation section will provide an T a test such as |pj gj| ≤  for some user specified  > 0. example illustrating this point with M = 100, and 1000 iterations of Algorithm 4 requiring approximately one In relation to this, it is important to appreciate that it minute of processor time on a standard desktop comput- is in fact not necessary to find a global maximiser of ing platform.

9 9 Convergence PROOF. By application of Corollary 6.1 in [23]. 

It is natural to question the convergence properties of this iterative parameter estimation procedure. These Together, Lemmas 9.1 and 9.2, do not establish conver- will derive from the general EM algorithm 1 on which it gence of Algorithm 4, and are not meant to imply it. is based, for which the most fundamental convergence property is as follows. Indeed, one drawback of the EM algorithm is that ex- cept under restrictive assumptions (such as convex like- lihood), it is not possible to establish convergence of the If the EM algorithm terminates at a point θk+1 because iterates {θk}, even when exact computation of the E- it is a stationary point of Q(θ, θk), then it is also a sta- tionary point of the log likelihood L(θ). Otherwise, the step is possible [34, 49]. likelihood is increased in progressing from θk to θk+1. The point of Lemma 9.1 is to establish that any algorith- mic test that Q(θ, θ ) has not decreased (such as step Lemma 9.1 Let θ be generated from θ by an itera- k k+1 k (4) of Algorithm 4) guarantees a non-decrease of L(θ). tion of the EM Algorithm (21),(22). Then Hence EM is capable of the guaranteed non cost-decreasing property of gradient based search. L(θk+1) ≥ L(θk) ∀k = 0, 1, 2,..., (60) Of course, this depends on the accuracy with which Furthermore, equality holds in this expression if and only Q(θ, θk) can be calculated. The point of Lemma 9.2 is to if both establish that the particle-based approximant QbM (θ, θk) Q(θk+1, θk) = Q(θk, θk), (61) used in this paper is an arbitrarily accurate approxima- and tion of Q(θ, θk). Hence Lemma 9.2 establishes a scientific basis for employing Q (θ, θ ). pθk+1 (XN | YN ) = pθk (XN | YN ), (62) bM k hold for almost all (with respect to Lebesgue measure) XN . 10 Numerical Illustrations

In this section the utility and performance of the new PROOF. See Theorem 5.1 in [16]. Algorithm 4 is demonstrated on two simulation exam-  ples. The first is a linear time-invariant Gaussian system. This is profiled since an exact solution for the expecta- tion step can be computed using the Kalman smoother An important point is that the proof of this result [16]. Comparing the results obtained by employing both only depends on Q(θk+1, θk) ≥ Q(θk, θk) being non- this, and the particle based approximations used in Al- decreasing at each iteration. It does not require that gorithm 4 therefore allow the effect of the particle ap- θk+1 be a maximiser of Q(θ, θk). proximation on estimation accuracy to be judged.

This provides an important theoretical underpinning for The performance of Algorithm 4 on a second example the EM method foundation of Algorithm 4 developed involving a well studied and challenging nonlinear sys- here. Its application is complicated by the fact that tem is then illustrated. only an approximation QbM (θ, θk) of Q(θ, θk) is available. However, this approximation is arbitrarily accurate for 10.1 Linear Gaussian System a sufficiently large number M of particles. The first example to be considered is the following simple Lemma 9.2 Consider the function Q(θ, θk) defined by linear (24)-(25c) and its SIR approximation QbM (θ, θk) defined by (48a)-(49) and (55) which is based on M particles. " # " # " #! xt+1 = axt + vt vt 0 q 0 Suppose that ∼ N , (66a) yt = cxt + et et 0 0 r

pθ(yt | xt) < ∞, pθ(xt+1 | xt) < ∞, (63) with the true parameters given by  4 E |Q(θ, θk)| | YN < ∞, (64) θ? = [a?, c?, q?, r?] = [0.9, 0.5, 0.1, 0.01] . (66b) hold for all θ, θk ∈ Θ. Then with probability one The estimation problem is to determine just the θ = a parameter on the basis of the observations Y . Using lim QbM (θ, θk) = Q(θ, θk), ∀θ, θk ∈ Θ. (65) N M→∞ EM methods it is straightforward to also estimate the

10 c, q and r parameters as well [16]. However, this example that with sufficient number of particles, the use of the concentrates on a simpler case in order to focus atten- approximation QbM in Algorithm 4 can have negligible tion on the effect of the particle filter/smoother approx- detrimental effect on the final estimate produced. imations employed in Algorithm 4. Also plotted in Figure 1 using an ‘o’ symbol, are the More specifically, via Algorithm 4, a particle based ap- results obtained using only M = 10 particles. Despite proximation QbM (a, ak) can be expressed as this being what could be considered a very small num- ber of particles, there is still generally reasonable, and 2 often good agreement between the associated approxi- QbM (a, ak) = −γ(ak)a + 2ψ(ak)a + d, (67) mate and exact estimation results. where d is a constant term that is independent of a and 1 ψ(·) and γ(·) are defined as

0.95 N−1 M M X X X ij j i ψ(ak) = w x˜t+1x˜t, (68a) t|N 0.9 t=1 i=1 j=1 N M X X i i 2 0.85 γ(ak) = wt|N (˜xt) . (68b) t=1 i=1 0.8 Particle EM method Since QbM (a, ak) in (67) is quadratic in a, it is straight- forward to solve the M step in closed form as 0.75

10 particles ψ(ak) 0.7 ak+1 = . (69) 500 particles γ(ak) 0.65 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Furthermore, in this linear Gaussian situation Q(θ, θk) Exact EM Method can be computed exactly using a modified Kalman Fig. 1. Comparison of the likelihood values for the final es- smoother [16]. In this case, the exact Q(a, ak) is again of the quadratic form (67) after straightforward re- timates after 100 iterations of the exact EM method and the particle EM method given in Algorithm 4 using both M = 10 definitions of ψ and γ, so the “exact” M step also has and M = 500 particles. the closed form solution (69).

This “exact EM” solution can then be profiled versus the new particle filter/smoother based EM method (67)- 10.2 A Nonlinear and Non-Gaussian System (69) of this paper in order to assess the effect of the approximations implied by the particle approach. A more challenging situation is now considered that in- volves the following nonlinear and time-varying system This comparison was made by conducting a Monte Carlo study over 1000 different realisations of data YN with xt xt+1 = axt + b 2 + c cos(1.2t) + vt, (70a) N = 100. For each realisation, ML estimates ba were 1 + xt computed using the exact EM solution provided by [16], y = dx2 + e , (70b) and via the approximate EM method of Algorithm 4. t t t " # " # " #! The latter was done for two cases of M = 10 and M = vt 0 q 0 500 particles. In all cases, the initial value a was set to ∼ N , (70c) 0 e 0 0 r the true value a?. t

The results are shown in Figure 1. There, for each of the where the true parameters in this case are 1000 realisations, a point is plotted with x co-ordinate ? ? ? ? ? ? ? the likelihood value L(ba) achieved by 100 iterations θ = [a , b , c , d , q , r ] = [0.5, 25, 8, 0.05, 0, 0.1] . of the exact EM method, and y co-ordinate the value (71) achieved by 100 iterations of Algorithm 4. This has been chosen due to it being acknowledged as a challenging estimation problem in several previous stud- Clearly, if both approaches produced the same estimate, ies in the area [11, 18, 21]. all the points plotted in this manner should lie on the solid y = x line shown in Figure 1. For the case of M = To test the effectiveness of Algorithm 4 in this situation, 500 particles, where the points are plotted with a cross a Monte Carlo study was again performed using 104 dif- ‘x’, this is very close to being the case. This illustrates ferent data realisations YN of length N = 100. For each

11 of these cases, an estimate θb was computed using 1000 iterations of Algorithm 4 with initialisation θ0 being cho- −500 sen randomly, but such that each entry of θ0 lay in an interval equal to 50% of the corresponding entry in the true parameter vector θ?. In all cases M = 100 particles −1000 were used. −1500 Using these choices, each computation of θb using Algo- rithm 4 took 58 seconds to complete on a 3 GHz quad- −2000 core Xeon running Mac OS 10.5. −2500 Log−likelihood and Q function Log−likelihood

The results of this Monte Carlo examination are pro- Iteration 1 −3000 vided in Table 1, where the rightmost column gives Iteration 10 the sample mean of the parameter estimate across the Iteration 100 Monte Carlo trials plus/minus the sample standard de- −3500 viation. Note that 8 of the 104 trials were not included 0 5 10 15 20 25 30 35 40 45 50 Table 1 Parameter b True and estimated parameter values for (70); mean value and standard deviations are shown for the estimates based on Fig. 2. The true log-likelihood function is shown as a func- 104 Monte Carlo runs. tion of the b parameter. Superimposed onto this plot are three Parameter True Estimated instances of the Q(θ, θk) function, defined in (15a). Clearly, as the EM algorithm evolves, then locally around the global a 0.5 0.50 ± 0.0019 maximiser, the approximation Q(θ, θk) resembles the log-like- lihood Lθ(Y ) more closely. b 25.0 25.0 ± 0.99 c 8.0 7.99 ± 0.13 50 d 0.05 0.05 ± 0.0026 40 q 0 7.78 × 10−5 ± 7.6 × 10−5 30

r 0.1 0.106 ± 0.015 20 Parameter b in these calculations due to capture in local minima, 10 0 which was defined according to the relative error test 0 50 100 150 200 250 300 350 400 450 500 ? ? |(θbi − θi )/θi )| > 0.1 for any i’th component. Consid- ering the random initialisation, this small number of 2 required and the results in Table 1 are con- 1 sidered successful results. 0

(q) −1 10 −2 It is instructive to further examine the nature of both log this estimation problem and the EM-based solution. For −3 this purpose consider the situation where only the b −4 −5 and q parameters are to be estimated. In this case, the 0 50 100 150 200 250 300 350 400 450 500 ? Iteration Number log-likelihood Lθ(Y ) as a function of b with q = q = 0 is shown as the solid line in Figure 2. Clearly the log-likelihood exhibits quite erratic behaviour with very Fig. 3. Top: parameter b estimates as a function of iteration number (horizontal line indicates the true parameter value at many local maxima. These could reasonably be expected b = 25). Bottom: log10(q) parameter estimates as a function to create significant difficulties for iterative search meth- of iteration number. ods, such as gradient based search schemes for maximis- ing Lθ(Y ). The whereby the EM-based Algorithm 4 achieves However, in this simplified case, the EM-based method this are illustrated by profiling the function Q(θ, θk) ini- of this paper seems quite robust against capture in these tialised at [b0, q0] = [40, 0.001] for k = 1, 10 and 100 as local maxima. For example, the trajectory of the param- the dotted, dash-dotted and dashed lines, respectively. eter estimates over 100 iterations of Algorithm 4 and Clearly, in each case the Q(θ, θk) function is a much more over 100 different length N = 100 data realisations, and straightforward maximisation problem than that of the 100 random initialisations for the b parameter, with the log likelihood Lθ(Y ). Furthermore, by virtue of the es- q parameter initialised at q = 0.001 are shown in Fig- sential property (20), at each iteration directions of in- ure 3. Here, M = 50 particles were employed, and in all creasing Q(θ, θk) can be seen to coincide with directions cases, an effective estimate of the true parameter value of increasing Lθ(Y ). As a result, difficulties associated b = 25 was obtained. with the local maxima of Lθ(Y ) are avoided.

12 To study this further, the trajectory of EM-based esti- 11 Conclusion T mates θk = [bk, qk] for this example are plotted in rela- tion to the two dimensional log-likelihood surface Lθ(Y ) The contribution in this paper is a novel algorithm for in Figure 4. Clearly, the iterates have taken a path cir- identifying the unknown parameters in general stochas- tic nonlinear state-space models. To formulate the problem a maximum likelihood criterion was employed, mainly due to the general statistical efficiency of such an approach. This problem is then solved using the expec- tation maximisation algorithm, which in turn required a nonlinear smoothing problem to be solved. This was handled using a particle smoothing algorithm. Finally, 1200 the utility and performance of the new algorithm was 1000 demonstrated using two simulation examples. 800 1 600 0.8 400 Acknowledgements Log−likelihood 200 0.6 0 0.4 This work supported by: the strategic research center −200 0.2 0 5 10 MOVIII, funded by the Swedish Foundation for Strate- 15 20 Parameter q 25 30 0 35 40 45 50 gic Research (SSF) and CADICS, a Linneaus Center

Parameter b funded by be Swedish Research Council; and the Aus- tralian Reseach Council. Fig. 4. The log-likelihood is here plotted as a function of the two parameters b and q. Overlaying this are the parameter T estimates θk = [bk, qk] produced by Algorithm 4. References

[1] Bode Lecture: Challenges of Nonlinear System Identification, cumventing the highly irregular “slice” at q = 0 illus- December 2003. trated in Figure 2. As a result, the bulk of them lie in [2] B. D. O. Anderson and J. B. Moore. Optimal Filtering. much better behaved regions of the likelihood surface. Prentice-Hall, New Jersey, 1979. [3] C. Andrieu, A. Doucet, S. S. Singh, and V. B. Tadi´c. Particle methods for change detection, system identification, and This type of behaviour with associated robustness to get contol. Proceedings of the IEEE, 92(3):423–438, March 2004. captured in local minima is widely recognised and asso- [4] J.S. Bendat. Nonlinear System Analysis and Identification ciated with the EM algorithm in the statistics literature from Random Data. Wiley Interscience, 1990. [34]. Within this literature, there are broad explanations [5] T. Bohlin. Practical Grey-box Process Identification: Theory for this advantage, such as the fact that (20) implies that and Applications. Springer, 2006. Q(θ, θk) forms a global approximation to the log likeli- [6] Y. Bresler. Two-filter formulae for discrete-time non-linear hood Lθ(Y ) as opposed to the local approximations that Bayesian smoothing. International Journal of Control, are implicit to gradient based search schemes. However, 43(2):629–641, 1986. a detailed understanding of this phenomenon is an im- [7] M. Briers, A. Doucet, and S. R. Maskell. Smoothing portant open research question deserving further study. algorithms for state-space models. Annals of the Institute of Statistical Mathematics (to appear), 2009. A further intriguing feature of the EM-algorithm is that [8] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the while (20) implies that local maxima of Lθ(Y ) may be Royal Statistical Society, Series B, 39(1):1–38, 1977. fixed points of the algorithm, there may be further fixed points. For example, in the situation just studied where [9] J. E. Dennis and R. B. Schnabel. Numerical Methods ? for Unconstrained Optimization and Nonlinear Equations. the true q = 0, if the EM-algorithm is initialised with Prentice Hall, 1983. q = 0, then all iterations θ will be equal to θ , regard- 0 k 0 [10] A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential less of what the other entries in θ0 are. Monte Carlo Methods in Practice. Springer Verlag, 2001. [11] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. This occurs because with vt = 0 in (1a), the smooth- ing step delivers state estimates completely consistent Statistics and Computing, 10(3):197–208, 2000. [12] A. Doucet and V. B. Tadi´c.Parameter estimation in general with θ0 (a deterministic simulation arises in the sam- state-space models using particle methods. Annals of the pling (2a)), and hence the maximisation step then de- Institute of Statistical Mathematics, 55:409–422, 2003. livers back re-estimates that reflect this, and hence are unchanged. While this is easily avoided by always initial- [13] S. Duncan and M. Gy¨ongy. Using the EM algorithm to estimate the disease parameters for smallpox in 17th century ising q0 as non-zero, a full understanding of this aspect London. In Proceedings of the IEEE international conference and the question of further fixed points are also worthy on control applications, pages 3312–3317, Munich, Germany, of further study. October 2006.

13 [14] P. Fearnhead, D. Wyncoll, and J. Tawn. A sequential [32] Lennart Ljung. Perspectives on system identification. In smoothing algorithm with linear computational cost. Plenary Talk at the 17th IFAC World Congress, Seoul, Korea, Technical report, Department of Mathematics and Statistics, July 6–11 2008. Lancaster University, Lancaster, UK, May 2008. [33] G. McLachlan and T. Krishnan. The EM Algorithm and [15] Z. Ghaharamani and S. T. Roweis. Learning nonlinear Extensions. Whiley Series in Probability and Statistics. John dynamical systems using an EM algorithm. In Advances in Wiley & Sons, New York, USA, 2 edition, 2008. Neural Information Processing Systems, volume 11, pages [34] G. McLachlan and T. Krishnan. The EM Algorithm and 599–605. MIT Press, 1999. Extensions (2nd Edition). John Wiley and Sons, 2008. [16] S. Gibson and B. Ninness. Robust maximum-likelihood [35] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. estimation of multivariable dynamic systems. Automatica, Teller, and E. Teller. Equations of state calculations by 41(10):1667–1682, 2005. fast computing machine. Journal of Chemical Physics, [17] S. Gibson, A. Wills, and B. Ninness. Maximum-likelihood 21(6):1087–1092, 1953. parameter estimation of bilinear systems. IEEE Transactions [36] N. Metropolis and S. Ulam. The Monte Carlo method. on Automatic Control, 50(10):1581–1596, 2005. Journal of the American Statistical Association, 44(247):335– [18] S. J. Godsill, A. Doucet, and M. West. Monte Carlo 341, 1949. smoothing for nonlinear time series. Journal of the American [37] K. Narendra and K. Parthasarathy. Identification and Statistical Association, 99(465):156–168, March 2004. control of dynamical systems using neural networks. IEEE [19] G. C. Goodwin and J. C. Ag¨uero. Approximate EM Transactions on Neural Networks, 1:4–27, 1990. algorithms for parameter and state estimation in nonlinear [38] B. Ninness. Strong laws of large numbers under weak stochastic models. In Proceedings of the 44th IEEE conference assumptions with application. IEEE Trans. Automatic on decision and control (CDC) and the European Control Control, 45(11):2117–2122, 2000. Conference (ECC), pages 368–373, Seville, Spain, December [39] G. Pillonetto and B. M. Bell. Optimal smoothing of non- 2005. linear dynamic systems via Monte Carlo Markov Chains. [20] R. B. Gopaluni. A particle filter approach to identification Automatica, 44(7):1676–1685, July 2008. of nonlinear processes under missing observations. The [40] G. Poyiadjis, A. Doucet, and S. S. Singh. Maximum likelihhod Canadian Journal of Chemical Engineering, 86(6):1081– parameter estimation in general state-space models using 1092, December 2008. particle methods. In Proceedings of the American Statistical [21] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. A Association, Minneapolis, USA, August 2005. novel approach to nonlinear/non-Gaussian Bayesian state [41] S. Rangan, G. Wolodkin, and K. Poolla. New results for estimation. In IEE Proceedings on Radar and Signal Hammerstein system identification. In Proceedings of the Processing, volume 140, pages 107–113, 1993. 34th IEEE Conference on Decision and Control, pages 697– [22] S. Graebe. Theory and Implementation of Gray Box 702, New Orleans, USA, December 1995. Identification. PhD thesis, Royal Institute of Technology, [42] B.D. Ripley. Stochastic Simulation. Wiley, 1987. Stockholm, Sweden, June 1990. [43] S. T. Roweis and Z. Ghaharamani. Kalman filtering and [23] X.-L. Hu, T. B. Sch¨on, and L. Ljung. A basic convergence neural networks, chapter 6. Learning nonlinear dynamical result for particle filtering. IEEE Transactions on Signal systems using the expectation maximization algorithm, Processing, 56(4):1337–1348, April 2008. Haykin, S. (ed), pages 175–216. John Wiley & Sons, 2001. [24] A. H. Jazwinski. Stochastic processes and filtering theory. [44] P. Salamon, P. Sibani, and R. Frost. Facts, conjectures, and Mathematics in science and engineering. Academic Press, Improvements fir Simulated Annealing. SIAM, Philadelphia, New York, USA, 1970. 2002. [25] R. E. Kalman. A new approach to linear filtering and [45] T. B. Sch¨on, A. Wills, and B. Ninness. Maximum likelihood prediction problems. Transactions of the ASME, Journal of nonlinear system estimation. In Proceedings of the 14th IFAC Basic Engineering, 82:35–45, 1960. Symposium on System Identification (SYSID), pages 1003– [26] J. Kim and D. S. Stoffer. Fitting stochastic volatility models 1008, Newcastle, Australia, March 2006. in the presence of irregular sampling via particle methods and [46] T. S¨oderstr¨omand P. Stoica. System identification. Systems the em algorithm. Journal of time series analysis, 29(5):811– and Control Engineering. Prentice Hall, 1989. 833, September 2008. [47] A. G. Wills, T. B. Sch¨on, and B. Ninness. Parameter [27] N. R. Kristensen, H. Madsen, and S. B. Jorgensen. Parameter estimation for discrete-time nonlinear systems using em. In estimation in stochastic grey-box models. Automatica, Proceedings of the 17th IFAC World Congress, Seoul, South 40(2):225–237, February 2004. Korea, July 2008. [28] Kenneth Lange. A gradient algorithm locally equivalent to [48] M. H. Wright. Direct search methods: once scorned, now the em algorithm. Journal of the Royal Statistical Society, respectable. In Numerical analysis 1995 (Dundee, 1995), 57(2):425–437, 1995. pages 191–208. Longman, Harlow, 1996. [29] I.J. Leontaritis and S.A. Billings. Input-output parametric [49] C. Wu. the convergence properties of the EM algorithm, models for non-linear systems. part ii: stochastic non-linear 1983. systems. International Journal of Control, 41(2):329–344, 1985. [30] L. Ljung. System identification, Theory for the user. System sciences series. Prentice Hall, Upper Saddle River, NJ, USA, second edition, 1999. [31] L. Ljung and A. Vicino, editors. Special Issue ‘System Identification: Linear vs Nonlinear’. IEEE Transactions on Automatic Control, 2005.

14