Learning Nonlinear State-Space Models Using Smooth Particle-Filter-Based Likelihood Approximations

Learning Nonlinear State-Space Models Using Smooth Particle-Filter-Based Likelihood Approximations Andreas Svensson,∗ Fredrik Lindsten, and Thomas B. Schön Department of Information Technology, Uppsala University Abstract When classical particle filtering algorithms are used for maximum likelihood parameter estimation in nonlinear state- space models, a key challenge is that estimates of the likelihood function and its derivatives are inherently noisy. The key idea in this paper is to run a particle filter based on a current parameter estimate, but then use the output from this particle filter to re-evaluate the likelihood function approximation also for other parameter values. This results in a (local) deterministic approximation of the likelihood and any standard optimization routine can be applied to find the maximum of this local approximation. By iterating this procedure we eventually arrive at a final parameter estimate. 1 Introduction Consider the nonlinear state-space model xt j xt−1 ∼ fθ(xt j xt−1); (1a) yt j xt ∼ gθ(yt j xt); (1b) N n where the hidden states xt 2 X ⊂ R and observations yt 2 Y ⊂ R y evolves over time t = 0; 1;::: . We assume that the probability densities fθ and gθ are parameterized by unknown parameters θ, which we are interested to learn from data. More specifically we are seeking the maximum likelihood estimate of θ from given observations fy1; : : : ; yT g , y1:T , i.e., to solve θb = arg max pθ(y1:T ): (2) θ We shall refer to pθ(y1:T ) as the likelihood function when considered a function of θ. It follows from (1) and the initial state density p(x0) that T Z Y pθ(y1:T ) = p(x0) fθ(xt j xt−1)gθ(yt j xt)dx0:T : (3) t=1 This can be evaluated analytically only for a few special cases. Approximations, such as the particle filter, are needed for the nonlinear case. For any given θ, the particle filter can be run to give a Monte Carlo based estimate zbθ of (3). It is well-known (Del Moral, 2004) that zbθ is unbiased, i.e., E [zbθ] = pθ(y1:T ), where the expectation is over the stochasticity arXiv:1711.10765v1 [stat.CO] 29 Nov 2017 in the particle filter algorithm itself. Albeit unbiased, zbθ is also stochastic (i.e., a different value is obtained every time the particle filter algorithm is run, since it is a Monte Carlo solution), often with a considerable variance and non-Gaussian characteristics, and standard optimization routines can therefore not be used to solve (2). To this end, schemes using, optimization for noisy function evaluations have been applied (e.g., Wills and Schön 2017; Dahlin and Lindsten 2014). We propose in this paper to iteratively run a particle filter with some current parameter estimate θk−1 and scrutinize the particle filter algorithm to construct a deterministic local approximation of the likelihood function around θk−1. In a sense we hereby manage to circumvent the stochasticity from the Monte Carlo procedure in a consistent way, and can allow for standard optimization routines to search for a new parameter value θk, for which we can run a new particle filter, etc. The idea is outlined as Algorithm 1. Related ideas have been proposed earlier (Le Gland, 2007), but they have to the best of the authors knowledge not yet been fully explored in the system identification context. ∗[email protected] 1 Algorithm 1: Identification idea 1 for k = 1;::: do n n N 2 Run a particle filter with θk−1 and save all the generated random numbers fx0:T ; a1:T gn=1 . n n N 3 Re-write the likelihood estimator zbθ as a deterministic function of the particle system fx0:T ; a1:T gn=1 , and use conventional optimization to find its maximizing argument θk. 2 Background on particle filtering The particle filter was originally proposed as a Monte Carlo solution to the state filtering problem (Gordon et al., 1993), i.e., to compute pθ(xt j y1:t). It was soon (Kitagawa, 1996) realized that the particle filter could also be used to estimate the likelihood function for given parameter values, essentially by inserting the particles (Monte Carlo samples) into the integral in (3) and thereby obtain an estimate zbθ of pθ(y1:T ). Thanks to this likelihood estimate, the particle filter can be 1 used for system identification purposes. As mentioned, zbθ is unbiased, but it often has a heavy-tailed and asymmetric distribution with a non-negligible variance. Its exact properties depends on the particle filter settings and the model. A textbook introduction to particle filters is given in, e.g., Schön and Lindsten (2017); Doucet et al. (2001). We summarize a rather general formulation of the particle filter in Algorithm 2, a version of the auxiliary article filter (Pitt and Shephard, 2 n 1999). We use q(xt j xt−1; yt) to denote an almost arbitrary proposal distribution. Furthermore, νt are the resampling n n N weights, wt the importance weights, and C(fνt gn=1 ) denotes the categorical distribution on the set f1;:::;Ng with n N (possibly unnormalized) weights fνt gn=1 , and N is the number of particles. Let us be explicit about the stochastic elements of Algorithm 2: Particles are initially drawn from the initial distribution n p(x0) on line 1 (which we assume to be independent of θ). Furthermore, the ancestor indices at on line 3 are drawn with n n respect to the resampling weights νt , and finally the propagation of particles xt on line 4, where the new particles are drawn from the proposal q(xt j xt−1; yt). n Whereas fθ(xt j xt−1), gθ(yt j xt) and p(x0) in Algorithm 2 are given by the model specification and wt follows from n the algorithm, the choices for q(xt j xt−1; yt), νt and N are left to the user. The number of particles N is usually taken as large as the computational budget permits, and two common choices for weights and proposals are • the bootstrap particle filter (Gordon et al., 1993) with q(xt j xt−1; yt) = fθ(xt j yt) and n n νt = wt ; n n and consequently wt = gθ(yt j xt ). This choice is generic and requires very little from the user, but has inferior performance compared to • the fully adapted particle filter (Pitt and Shephard, 1999), with q(xt j xt−1; yt) = pθ(xt j xt−1; yt) and n νt = pθ(yt j xt−1): This choice is superior to the bootstrap choice in terms of variance of the obtained approximation, but it is only available for a quite limited set of models. The literature on approximations to this choice is therefore rich (e.g., Naesseth et al. 2015; Doucet et al. 2000). We will in this paper exploit the relatively large freedom that is available when it comes to choosing the proposal density and the resampling weights. By making a choice that only depends on the current parameter value θk−1 it is possible to evaluate the likelihood estimator (which we will denote by zθk 1 (θ)) for different values of θ, while at the same time b − n n N making use of the same realization of fx0:T ; a1:T gn=1 . 1There are several alternative ways in which the particle filter can be used for system identification, for example approaches based on the EM algorithm or Gibbs sampling. 2 The support of q has to cover the support of fθ. 2 Algorithm 2: The auxiliary particle filter n n n 1 Draw x0 ∼ p(x0) and set w0 = 1; ν0 = 1. 2 for t = 1 to T do n j N 3 Draw at ∼ C(fνt−1gj=1): n n at 4 Propagate xt ∼ q(xt j xt−1; yt). n n at PN j n at n wt 1= j=1 wt 1 fθ (xt j xt 1) n 5 Set w − j− n − gθ(yt j x ). t an a n at t t PN t q(x j x ;yt) νt 1= j=1 νt 1 t t 1 − − − 1 PN n 6 Set zt N n=1 wt . QT 7 Compute zbθ = t=1 zt. All statements with n are for n = 1;:::;N. 3 Related work The use of the likelihood estimator zbθ as an objective function in optimization, in the search for a maximum likelihood estimate θb, has been subject to several studies: Doucet and Tadic´ (2003) differentiate zbθ and use it in a stochastic gradient descent scheme, whereas Dahlin and Lindsten (2014); Wills and Schön (2017) use an optimization scheme based on Gaussian processes. Malik and Pitt (2011) use a fixed random seed and run the particle filter for different θ. For a fixed random seed, zbθ is indeed deterministic, however with discontinuous ‘jumps’ due to different resampling decisions being made for different θ. To this end, Malik and Pitt proposes an approximative smoothing to obtain a continuous function. n n The idea used in this paper, i.e., to make the ancestor indices a1:T and particles x0:T depending only on the current, or reference, parameter value θk−1 (instead of θ) has been analyzed by Le Gland (2007), but has (to the best of the authors knowledge) not yet been applied to the system identification problem. The work by Le Gland concerns central limit theorems for the likelihood estimator zθk 1 (θ). The application of the results from Le Gland (2007) to this work is b − subject for further work.

Load more