Getting Started with Particle Metropolis-Hastings for Inference in Nonlinear Dynamical Models
Johan Dahlin Thomas B. Schön Linköping University Uppsala University
Abstract This tutorial provides a gentle introduction to the particle Metropolis-Hastings (PMH) algorithm for parameter inference in nonlinear state-space models together with a software implementation in the statistical programming language R. We employ a step-by-step approach to develop an implementation of the PMH algorithm (and the particle filter within) together with the reader. This final implementation is also available as the package pmhtutorial from the Comprehensive R Archive Network (CRAN) repository. Throughout the tutorial, we provide some intuition as to how the algorithm operates and discuss some solutions to problems that might occur in practice. To illustrate the use of PMH, we consider parameter inference in a linear Gaussian state-space model with synthetic data and a nonlinear stochastic volatility model with real-world data.
Keywords: Bayesian inference, state-space models, particle filtering, particle Markov chain Monte Carlo, stochastic volatility model.
1. Introduction
We are concerned with Bayesian parameter inference in nonlinear state-space models (SSMs). This is an important problem as SSMs are ubiquitous in, e.g., automatic control (Ljung 1999), econometrics (Durbin and Koopman 2012), finance (Tsay 2005) and many other fields. An overview of some concrete applications of state-space modeling is given by Langrock(2011). The major problem with Bayesian parameter and state inference in SSMs is that it is an analytically intractable problem, which cannot be solved in closed-form. Hence, approxima- tions are required and these typically fall into two categories: analytical approximations and statistical simulations. The focus of this tutorial is on a member of the latter category known as particle Metropolis-Hastings (PMH; Andrieu et al. 2010), which approximates the poste- arXiv:1511.01707v8 [stat.CO] 12 Mar 2019 rior distribution by employing a specific Markov chain to sample from it. This requires us to be able to point-wise compute unbiased estimates of the posterior. In PMH, these are pro- vided by the particle filter (Gordon et al. 1993; Doucet and Johansen 2011), which is another important and rather general statistical simulation algorithm. The main aim and contribution of this tutorial is to give a gentle introduction (both in terms of methodology and software) to the PMH algorithm. We assume that the reader is familiar with traditional time series analysis and Kalman filtering at the level of Brockwell and Davis (2002). Some familiarity with Monte Carlo approximations, importance sampling and Markov chain Monte Carlo (MCMC) at the level of Ross(2012) is also beneficial. 2 Getting Started with PMH for Inference in Nonlinear Dynamical Models
The focus is on implementing the particle filter and PMH algorithms step-by-step in the programming language R (R Core Team 2018). We employ literate programming1 to build up the code using a top-down approach, see Section 1.1 in Pharr and Humphreys(2010). The final implementation is available as a R package pmhtutorial (Dahlin 2018) distributed under the GPL-2 license via the Comprehensive R Archive Network (CRAN) repository and available at https://CRAN.R-project.org/package=pmhtutorial. The R source code is also provided via GitHub with corresponding versions for MATLAB (The MathWorks Inc. 2017) and Python (van Rossum et al. 2011) to enable learning by readers from many different backgrounds, see Section8 for details. There are a number of existing tutorials which relate to this one. A thorough introduction to particle filtering is provided by Arulampalam et al. (2002), where a number of different implementations are discussed and pseudo-code provided to facilitate implementation. This is an excellent start for the reader who is particularly interested in particle filtering. However, they do not discuss nor implement PMH. The software LibBi and the connected tutorial by Murray(2015) includes high-performance implementations and examples of particle filtering and PMH. The focus of the present tutorial is a bit different as it tries to explain all the steps of the algorithms and provides basic implementations in several different programming languages. Finally, Ala-Luhtala et al. (2016) presents an advanced tutorial paper on twisted particle filtering. It also includes pseudo-code for PMH but its focus is on particle filtering and does not discuss, e.g., how to tune the proposal in PMH. We return to discuss related work and software throughout this tutorial in Sections 3.3, 4.3,6 and7. The disposition of this tutorial is as follows. In Section2, we introduce the Bayesian parameter and state inference problem in SSMs and discuss how and why the PMH algorithm works. In Sections3 and4, we implement the particle filter and PMH algorithm for a toy model and study their properties as well as give a small survey of results suitable for further study. In Sections5 and6, the PMH implementation is adapted to solve a real-world problem from finance. Furthermore, some more advanced modifications and improvements to PMH are discussed and exemplified. Finally, we conclude this tutorial by Section7 where related software implementations are discussed.
2. Overview of the PMH algorithm
In this section, we introduce the Bayesian inference problem in SSMs related to the param- eters and the latent state. The PMH algorithm is then introduced to solve these intractable problems. The inclusion of the particle filter into PMH is discussed from two complementary perspectives: as an approximation of the acceptance probability and as auxiliary variables in an augmented posterior distribution.
1The skeleton of the code is outlined as a collection of different variables denoted by
··· xt−1 xt xt+1 ···
··· yt−1 yt yt+1 ···
Figure 1: The structure of an SSM represented as a graphical model.
2.1. Bayesian inference in SSMs In Figure1, we present a graphical model of the SSM with the latent states (top), the 2 observations (bottom), but without exogenous inputs . The latent state xt only depends on the previous state xt−1 of the process as it is a (first-order) Markov process. That is, all the t−1 information in the past states x0:t−1 , {xs}s=0 is summarized in the most recent state xt−1. The observations y1:t are conditionally independent as they only relate to each other through the states.
ny We assume that the state and observations are real-valued, i.e., yt ∈ Y ⊆ R and xt ∈ X ⊆ nx R . The initial state x0 is distributed according to the density µθ(x0) parameterized by p some unknown static parameters θ ∈ Θ ⊂ R . The latent state and the observation processes 3 are governed by the densities fθ(xt|xt−1) and gθ(yt|xt), respectively. The density describing the state process gives the probability that the next state is xt given the previous state xt−1. An analogous interpretation holds for the observation process. With the notation in place, a general nonlinear SSM can be expressed as
x0 ∼ µθ(x0), xt|xt−1 ∼ fθ(xt|xt−1), yt|xt ∼ gθ(yt|xt), (1)
where no notational distinction is made between a random variable and its realization. The main objective in Bayesian parameter inference is to obtain an estimate of the parameters θ given the measurements y1:T by computing the parameter posterior (distribution) given by
p(θ)p(y1:T |θ) γθ(θ) πθ(θ) , p(θ|y1:T ) = Z , , (2) 0 0 0 Zθ p(θ )p(y1:T |θ )dθ Θ where p(θ) and p(y1:T |θ) , pθ(y1:T ) denote the parameter prior (distribution) and the data distribution, respectively. The un-normalized posterior distribution denoted by γθ(θ) = p(θ)p(y1:T |θ) is an important component in the construction of PMH. The denominator Zθ is usually referred to as the marginal likelihood or the model evidence.
2It is straightforward to modify the algorithms presented for vector valued quantities and also to include a known exogenous input ut. 3 In this tutorial, we assume that xt|xt−1 and yt|xt can be modeled as continuous random variables with a density function. However, the algorithms discussed can be applied to deterministic states and discrete states/observations as well. 4 Getting Started with PMH for Inference in Nonlinear Dynamical Models
The data distribution pθ(y1:T ) is intractable for most interesting SSMs due to the fact that the state sequence is unknown. The problem can be seen by studying the decomposition
T Y pθ(y1:T ) = pθ(y1) pθ(yt|y1:t−1), (3) t=2 where the so-called predictive likelihood is given by Z pθ(yt|y1:t−1) = gθ(yt|xt)fθ(xt|xt−1)πt−1(xt−1)dxtdxt−1. (4)
The marginal filtering distribution πt−1(xt−1) , pθ(xt−1|y1:t−1) can be obtained from the Bayesian filtering recursions (Anderson and Moore 2005), which for a fixed θ are given by
gθ(yt|xt) Z πt(xt) = fθ(xt|xt−1)πt−1(xt−1) dxt−1, for 0 < t ≤ T, (5) pθ(yt|y1:t−1) X
with π0(x0) = µθ(x0) as the initialization. In theory, it is possible to construct a sequence of filtering distributions by iteratively applying (5). However in practice, this strategy cannot be implemented for many models of interest as the marginalization over xt−1 (expressed by the integral) cannot be carried out in closed form.
2.2. Constructing the Markov chain The PMH algorithm offers an elegant solution to both of these intractabilities by leverag- ing statistical simulation. Firstly, a particle filter is employed to approximate πt−1(xt−1) and obtain point-wise unbiased estimates of the likelihood. Secondly, an MCMC algorithm (Robert and Casella 2004) known as Metropolis-Hastings (MH) is employed to approximate πθ(θ) based on the likelihood estimator provided by the particle filter. PMH emerges as the combination of these two algorithms. MCMC methods generate samples from a so-called target distribution by constructing a spe- cific Markov chain, which can be used to form an empirical approximation. In this tutorial, the target distribution will be either the parameter posterior πθ(θ) or the posterior of the parameters and states πθ,T (θ, x0:T ) = p(θ, x0:T |y1:T ). We focus here on the former case and the latter follows analogously. Executing the PMH algorithm results in K correlated samples (1:K) (k) K θ , {θ }k=1 which can be used to construct a Monte Carlo approximation of πθ(θ). This empirical approximation of the parameter posterior distribution is given by
K K 1 X π (dθ) = δ (k) (dθ), (6) bθ K θ k=1
0 which corresponds to a collection of Dirac delta functions δθ0 (dθ) located at θ = θ with equal weights. In practice, histograms or kernel density estimators are employed to visualize the estimate of the parameter posterior obtained from (6). The construction of the Markov chain within PMH amounts to carrying out two steps to 0 obtain one sample from πθ(θ). The first step is to propose a so-called candidate parameter θ from a proposal distribution q(θ0|θ(k−1)) given the previous state of the Markov chain denoted by θ(k−1). The user is quite free to choose this proposal but its support should cover the Johan Dahlin, Thomas B. Schön 5 support of the target distribution. The second step is to determine if the state of the Markov chain should be changed to the candidate parameter θ0 or if it should remain in the previous state θ(k−1). This decision is stochastic and the candidate parameter is assigned as the next state of the Markov chain, i.e., {θ(k) ← θ0} with a certain acceptance probability α(θ0, θ(k−1)) which is given by
( 0 (k−1) 0) ( 0 (k−1) 0) π θ q θ θ γ θ q θ θ α θ0, θ(k−1) = min 1, θ = min 1, θ , (7) (k−1) 0 (k−1) (k−1) 0 (k−1) πθ θ q θ θ γθ θ q θ θ where Zθ cancels as it is independent of the current state of the Markov chain. The stochastic behavior introduced by (7) facilitates exploration of (in theory) the entire posterior and is also the necessary condition for the Markov chain to actually have the target as its stationary distribution. The latter is known as detailed balance which ensures that the Markov chain is reversible and has the correct stationary distribution. That is, to ensure that the samples generated by PMH actually are from πθ(θ). Hence, the stochastic acceptance decision can be seen as a correction of the Markov chain generated by the proposal distribution. The stationary distribution of the corrected Markov chain is the desired target distribution. In a sense this is similar to importance sampling, where samples from a proposal distribution are corrected by weighting to be approximately distributed according to the target distribution. The intuition for the acceptance probability (7) (disregarding the influence of the proposal 0 0 (k−1) 0 q) is that we always accept a candidate parameter θ if πθ θ > πθ θ . That is if θ increases the value of the target compared with the previous state θ(k−1). This results in a mode-seeking behavior which is similar to that of an optimization algorithm estimating the maximum of the posterior distribution. However from (7), we also note that a small decrease in the posterior value can be accepted to facilitate exploration of the entire posterior. This is the main difference between PMH and that of an optimization algorithm, where the former focus on mapping the entire posterior whereas the latter only would like to find the location of its mode.
2.3. Approximating test functions In Bayesian parameter inference, the interest often lies in computing the expectation of a n so-called test function, which is a well-behaved (integrable) function ϕ :Θ → R ϕ mapping the parameters to a value on the real space. The expectation with respect to the parameter posterior is given by Z h i πθ[ϕ] , E ϕ(θ) y1:T = ϕ(θ)πθ(θ) dθ, (8) Θ where, e.g., choosing ϕ(θ) = θ corresponds to computing the mean of the parameter posterior. The expectation in (8) can be estimated via the empirical approximation in (6) according to
Z K Z K K K 1 X 1 X (k) π [ϕ] ϕ(θ)π (dθ) = ϕ(θ)δ (k) (dθ) = ϕ θ , (9) bθ , bθ K θ K Θ k=1 Θ k=1 where the last equality follows from the properties of the Dirac delta function. This estimator is well-behaved and it is possible to establish a law of large numbers (LLN) and a central limit 6 Getting Started with PMH for Inference in Nonlinear Dynamical Models
theorem (CLT), see Meyn and Tweedie(2009) for more information. From the LLN, we know that the estimator is consistent (and asymptotically unbiased as K → ∞). Moreover from the CLT, we know that the error is approximately Gaussian with a variance decreasing by 1/K, which is the usual Monte Carlo rate. Note that the LLN usually assumes independent samples but a similar result known as the ergodic theorem gives a similar result (under some assumptions) even when θ(1:K) are correlated4. However, the asymptotic variance is usually larger than if the samples would be uncorrelated.
2.4. Approximating the acceptance probability One major obstacle remains to be solved before the PMH algorithm can be implemented. The acceptance probability (7) is intractable and cannot be computed as the likelihood is unknown. From above, we know that the particle filter can provide an unbiased point-wise estimator for the likelihood and therefore also the posterior πθ(θ). It turns out that the unbiasedness property is crucial for PMH to be able to provide a valid empirical approximation of πθ(θ). This approach is known as an exact approximation due to that the likelihood is replaced by an approximation but the algorithm stills retains its validity, see Andrieu and Roberts(2009) for more information.
The particle filter generates a set of samples from πt(xt) for each t, which can be used to create an empirical approximation. These samples are generated by sequentially applying importance sampling to approximate the solution to (5). The approximation of the filtering distribution is then given by
N (i) N N X vt πbt (dxt) , pbθ (dxt|y1:t) = δ (i) (dxt), (10) PN (j) xt i=1 j=1 vt | {z } (i) ,wt
(i) (i) where the particles xt and their normalized weights wt constitute the so-called particle system generated during a run of the particle filter. The un-normalized weights are denoted by (i) vt . It is possible to prove that the empirical approximation converges to the true distribution when N → ∞ under some regularity assumptions5. The likelihood can then be estimated via the decomposition (3) by inserting the empirical filtering distribution (10) into (4). This results in the estimator " # T T 1 N pN (y ) = pN (y ) Y pN (y |y ) = Y X v(i) , (11) bθ 1:T bθ 1 bθ t 1:t−1 N t t=2 t=1 i=1 where a point-wise estimate of the un-normalized parameter posterior is given by
N N γbθ (θ) = p(θ)pbθ (y1:T ). (12) 4The theoretical details underpinning MH have been deliberately omitted. For the ergodic theorem to hold, the Markov chain needs to be ergodic, which means that the Markov chain should be able to explore the entire state-space and not get stuck in certain parts of it. The interested reader can find more about the theoretical details in, e.g., Tierney(1994) and Meyn and Tweedie(2009) for MH and Andrieu et al. (2010) for PMH. 5Again we have omitted many of the important theoretical details regarding the particle filter. For more information about these and many other important topics related to this algorithm, see, e.g., Crisan and Doucet(2002) and Doucet and Johansen(2011). Johan Dahlin, Thomas B. Schön 7
Here, it is assumed that the prior can be computed point-wise by closed-form expressions. The acceptance probability (7) can be approximated by plugging in (12) resulting in
( (k−1) 0) γN (θ0) q θ θ α θ0, θ(k−1) = min 1, bθ N (k−1) 0 (k−1) γbθ (θ ) q θ θ ( 0 N (k−1) 0) p(θ ) p 0 (y ) q θ θ = min 1, bθ 1:T . (13) p(θ(k−1)) pN (y ) q θ0 θ(k−1) bθ(k−1) 1:T
N This approximation is only valid if pbθ (y1:T ) is an unbiased estimator of the likelihood. Fortu- nately, this is the case when employing the particle filter as its likelihood estimator is unbiased (Del Moral 2013; Pitt et al. 2012) for any N ≥ 1. As in the parameter inference problem, interest often lies in computing an expectation of a nϕ well-behaved test function ϕ : X → R with respect to πt(xt) given by Z h i πt[ϕ] , E ϕ(xt) y1:t = ϕ(xt)πt(xt) dxt, X where again choosing the test function ϕ(xt) = xt corresponds to computing the mean of the marginal filtering distribution. This expectation is intractable as πt(xt) is unknown. Again, we replace it with an estimate provided by an empirical approximation (10), which results in
Z N Z N N N X (i) X (i) (i) πbt [ϕ] , ϕ(xt)πbt (dxt) = wt ϕ(xt)δ (i) (dxt) = wt ϕ xt , (14) xt X i=1 X i=1 for some 0 ≤ t ≤ T by the properties of the Dirac delta function. In the subsequent sections, we are especially interested in the estimator for the mean of the marginal filtering distribution,
N N N X (i) (i) xbt , πbt [x] = wt xt , (15) i=1 which is referred to as the filtered state estimate. N Under some assumptions, the properties of πbt [ϕ] are similar as for the estimator in the PMH algorithm, see Crisan and Doucet(2002) and Doucet and Johansen(2011) for more information. Hence, we have that the estimator is consistent (and asymptotically unbiased when N → ∞) and the error is Gaussian with a variance decreasing by 1/N. 8 Getting Started with PMH for Inference in Nonlinear Dynamical Models
2.5. Outlook: Pseudo-marginal algorithms The viewpoint adopted in the previous discussion on PMH is that it is an MH algorithm which employs a particle filter to approximate the otherwise intractable acceptance probability. Another more general viewpoint is to consider PMH as a pseudo-marginal algorithm (Andrieu and Roberts 2009). In this type of algorithm, some auxiliary variables are introduced to facilitate computations but these are marginalized away during a run of the algorithm. This results in that the PMH algorithm can be seen as a standard MH algorithm operating on the non-standard extended space Θ × U, where Θ and U denote the parameter space and the space of the auxiliary variables, respectively. The resulting extended target is given by
N πθ,u(θ, u) = πbθ (θ|u) m(u). (16) Here, the parameter posterior is augmented with u ∈ U which denotes some multivariate random variables with density m(u). From the discussion above, we know that u can be used N to construct a point-wise estimator of πbθ (θ|u) via the particle filter by (12). The unbiasedness property of the likelihood estimator based on the particle filter gives Z h N i N Eu γbθ (θ|u) = γbθ (θ|u) m(u) du = γθ(θ). (17) U
This means that the un-normalized parameter posterior γθ(θ) can be recovered by marginaliz- ing over all the auxiliary variables u. In the implementation, this results in that the sampled values for u simply can be disregarded. Hence, we will not store them in the subsequent implementations but only keep the samples of θ (and xt). We conclude by deriving the pseudo-marginal algorithm following from (16). A proposal for θ and u is selected with the form
0 0 (k−1) (k−1) 0 (k−1) (k−1) 0 (k−1) q θ , u θ , u = qθ θ θ , u qu u u , (18) which is the product of the two proposals selected as
0 (k−1) (k−1) 0 (k−1) 0 (k−1) 0 qθ θ θ , u = qθ θ θ , qu u u = m(u ). (19)
This corresponds to an independent proposal for u and a proposal for θ that does not include u. Other options are a topic of current research, see Section 4.3 for some references. The resulting acceptance probability from this choice of target and proposal is given by ( ) π (θ0, u0) q(θ(k−1), u(k−1)|θ0, u0) α(θ0, u0, θ(k−1), u(k−1)) = min 1, θ,u (k−1) (k−1) 0 0 (k−1) (k−1) πθ,u(θ , u ) q(θ , u |θ , u ) ( ) γN (θ0|u0) q (θ(k−1)|θ0) = min 1, bθ θ , N (k−1) (k−1) 0 (k−1) γbθ (θ |u ) qθ(θ |θ ) Note that the resulting acceptance probability is the same as in (13). These arguments provide a sketch of the proof that PMH generates samples from the target distribution and were first presented by Flury and Shephard(2011). For a more formal treatment and proof of the validity of PMH, see Andrieu and Roberts(2009) and Andrieu et al. (2010). Johan Dahlin, Thomas B. Schön 9 . 1.0 0.8 t t t 6 4 2 6 4 2 ACF of y ACF of latent state x latent state observation y observation 4- 0 -2 -4 0 -2 -4 020002040.6 0.4 0.2 0.0 -0.2
0 50 100 150 200 250 0 50 100 150 200 250 0 5 10 15 20 25
time time time
Figure 2: Simulated data from the LGSS model with latent state (orange), observations (green) and autocorrelation function (ACF) of the observations (light green).
3. Estimating the states in a linear Gaussian SSM
We are now ready to start implementing the PMH algorithm based on the material in Sec- tion2. To simplify the presentation, this section discusses how to estimate the filtering T distributions {πt(xt)}t=0 for a linear Gaussian state-space (LGSS) model. These distribu- N N tions can be used to compute xb0:T and pbθ (y1:T ), i.e., the estimates of the filtered state and the estimate of the likelihood, respectively. The parameter inference problem is treated in the subsequent section. The particular LGSS model considered is given by
2 2 x0 ∼ δx0 (x), xt|xt−1 ∼ N xt; φxt−1, σv , yt|xt ∼ N yt; xt, σe , (20) where parameters are denoted by θ = {φ, σv, σe}. φ ∈ (−1, 1) determines the persistence of the state, while σv, σe ∈ R+ denote the standard deviations of the state transition noise and the observation noise, respectively. The Gaussian density is denoted by N (x; µ, σ2) with mean µ and standard deviation σ > 0. Figure2 presents a simulated data record from the model with T = 250 observations using θ = {0.75, 1.00, 0.10} and x0 = 0. The complete code for the data generation is available in generateData.
3.1. Implementing the particle filter The complete source code for the implementation of the particle filter is available in the function particleFilter. Its code skeleton is given by: particleFilter <- function(y, theta, noParticles, initialState) {
The function particleFilter has the inputs: y (vector with T + 1 observations), theta (parameters {φ, σv, σe}), noParticles and initialState. The outputs will be the esti- N N mates of the filtered states xb0:T and the estimate of the likelihood pb (θ|y1:T ). Note that particleFilter iterates over t = 2,...,T , which corresponds to time indices t = 1,...,T −1 in (20) as the numbering of vector elements starts from 1 in R. The iteration terminates at time index T −1 as future observations yt+1 are required at each iteration. That is, we assume that the new observation arrives to the algorithm between the propagation and weighting steps. It therefore makes sense to use the information in the weighting step if this is possible.
particles <- matrix(0, nrow = noParticles, ncol = T + 1) ancestorIndices <- matrix(0, nrow = noParticles, ncol = T + 1) weights <- matrix(1, nrow = noParticles, ncol = T + 1) normalizedWeights <- matrix(0, nrow = noParticles, ncol = T + 1) xHatFiltered <- matrix(0, nrow = T, ncol = 1) logLikelihood <- 0
h (i) i (j) P at = j = wt−1, j = 1,...,N.
The resampling step is implemented by a call to the function sample by:
(1:N) where the resulting ancestor indices at are returned as newAncestors and stored in ancestorIndices for bookkeeping. Note that the particle lineage is also resampled at the same time as the particles. All this is done to keep track of the genealogy of the particle system over time. That is, the entire history of the particle system.
(i) (i) (i) at (i) at xt |xt−1 ∼ pθ xt xt−1, yt , (21)
(i) at where information from the previous particle xt−1 and the current measurement yt can be included. There are a few different choices for the particle proposal and the most common one is to make use of the model itself, i.e., pθ(xt|xt−1, yt) = fθ(xt|xt−1). However for the LGSS model, an optimal proposal can be derived using the properties of the Gaussian distribution resulting in
(i) (i) opt (i) at at pθ xt xt−1, yt ∝ gθ(yt xt)fθ xt xt−1 (i) (i) 2h −2 −2 at i 2 = N xt ; σ σe yt + σv φxt−1 , σ , (22)
−2 −2 −2 with σ = σv + σe . This particle proposal is optimal in the sense that it minimizes the variance of the incremental particle weights at the current time step7. For most other
6This unbiasedness property is crucial for the PMH algorithm to be valid. As a consequence, adaptive methods (Doucet and Johansen 2011, Section 3.5) that only resamples the particle system at certain iterations cannot be used within PMH. 7However, it is unclear exactly how this influences the entire particle system, i.e., if this is the globally optimal choice. 12 Getting Started with PMH for Inference in Nonlinear Dynamical Models
SSMs, the optimal proposal is intractable and the state transition model is used instead. The propagation step is implemented by:
From (22), the ratio between the noise variances is seen to determine the shape of the proposal. 2 2 2 2 Essentially, there are two different extreme situations (i) σe /σv 1 and (ii) σe /σv 1. In the first extreme (i), the location of the proposal is essentially governed by yt and the scale is mainly determined by σe. This corresponds to essentially simulating particles from gθ(yt|xt). When σe is small this usually allows for running the particle filter using only a small number of particles. In the second extreme (ii), the proposal essentially simulates from fθ(xt|xt−1) and does not take the observation into account. In summary, the performance and characteristics of the optimal proposal are therefore related to the noise variances of the model and their relative sizes.
Remember that in this implementation, the new observation yt+1 is introduced in the particle filter between the propagation and the weighting steps for the first time. This is the reason for the slightly complicated choice of time indices. However, the inclusion of the information in yt+1 in this step is beneficial as it improves the accuracy of the filter. That is, we keep particles that after propagation are likely to result in the observation yt+1. In many situations, the resulting weights are small and it is therefore beneficial to work with shifted log-weights to avoid problems with numerical precision. This is done by applying the transformation
(i) (i) vet = log vt − vmax, for i = 1,...,N,
(1:N) where vmax denotes the largest element in log vt . The weights are then normalized (en- suring that they sum to one) by
(i) (i) (i) exp vet exp(−vmax) exp log vt v w(i) = = = t , (23) t PN (j) PN (j) PN (j) j=1 exp vet exp(−vmax) j=1 exp log vt j=1 vt where the shifts −vmax cancel and do not affect the relative sizes of the weights. Hence, the first and third expressions in (23) are equivalent but the first expression enjoys better numerical precision. The computation of the weights is implemented by: Johan Dahlin, Thomas B. Schön 13
We remind the reader that we compare y[t+1] and particles[, t] due to the convention (1:N) for indexing arrays in R, which corresponds to yt and xt−1 in the model, respectively. This is also the reason for the comment regarding the for-loop in particleFilter. We now see that the weights depend on the next observation and that is why the loop needs to run to T − 1 and not T .
( " N # ) N N X (i) log pbθ (y1:t) = log pbθ (y1:t−1) + vmax + log exp v˜t − log N , i=1 where the log-shifted weights are used to increase the numerical precision. Note that the shift by vmax needs to be compensated for in the estimation of the log-likelihood. This recursion is implemented by:
The estimate of the posterior distribution follows from inserting the estimate of the log- likelihood into (12).
1 N xN = X x(i), (24) bt N t i=1 which is implemented by:
(i) Note that this corresponds to the weights wt = 1/N which would correspond to that all particles are identical according to the expression in (23). However, this is not the case 14 Getting Started with PMH for Inference in Nonlinear Dynamical Models
Number of particles (N) 10 20 50 100 200 500 1000 log-bias −3.70 −3.96 −4.57 −4.85 −5.19 −5.67 −6.08 log-MSE −6.94 −7.49 −8.72 −9.29 −9.91 −10.87 −11.67
Table 1: The log-bias and the log-MSE of the filtered states for varying N. in general and the estimator (24) is the result of making use of the optimal choices in the propagation and weighing steps of the particle filter. This choice corresponds to a particular type of algorithm known as the fully-adapted particle filter (faPF; Pitt and Shephard 1999), see Section 3.3 for more details. In a faPF, we make use of separate weights in the resampling step and when constructing the empirical approximation of πt(xt). In Section5, we introduce N another type of particle filter with another estimator for xbt , which can be used for most SSMs.
3.2. Numerical illustration We make use of the implementation of particleFilter to estimate the filtered state and to investigate the properties of this estimate for finite N. The complete implementation is found in the script/function example1_lgss. We use the data presented in Figure2, which was generated in the beginning of this section. The estimates from the particle filter are compared with the corresponding estimates from the Kalman filter. The latter follows from using the properties of the Gaussian distribution to solve (5) exactly, which is only possible for an LGSS model. In the middle of Figure3, we present the difference between the optimal state estimate from the Kalman filter and the estimate from the particle filter using N = 20 particles. Two alternative measures of accuracy are the bias (absolute error) and the mean square error (MSE) of the state estimate. These are computed according to
T T N 1 X N N 1 X N 2 Bias x = x − xt , MSE x = x − xt , bt T bt b bt T bt b t=1 t=1 where xbt denotes the optimal state estimate obtained by the Kalman filter. In Table1 and in the bottom part of Figure3, we present the logarithm of the bias and the MSE for different values of N. We note that the bias and the MSE decrease rapidly when increasing N. Hence, N we conclude that for this model N does not need to be large for xbt to be accurate.
3.3. Outlook: Design choices and generalizations We make a short de-tour before the tutorial continues with the implementation of the PMH algorithm. In this section, we provide some references to important improvements and fur- ther studies of the particle filter. Note that the scientific literature concerning the particle filter is vast and this is only a small and naturally biased selection of topics and references. Johan Dahlin, Thomas B. Schön 15 6 4 2 observation 6- 20 -2 -4 -6
0 50 100 150 200 250
time .50.10 0.05 error in state estimate 01 00 0.00 -0.05 -0.10
0 50 100 150 200 250
time -3 8- -6 -7 -8 5-4 -5 log-bias log-MSE 7-6 -7 1 1 1 -9 -10 -11 -12
0 200 400 600 800 1000 0 200 400 600 800 1000
no. particles (N) no. particles (N)
Figure 3: Top and middle: A simulated set of observations (top) from the LGSS model and the error in the latent state estimate (middle) using a particle filter with N = 20. Bottom: the estimated log-bias (left) and log-MSE (right) for the particle filter when varying the number of particles N. 16 Getting Started with PMH for Inference in Nonlinear Dynamical Models
More comprehensive surveys and tutorials of particle filtering are given by Doucet and Jo- hansen(2011), Cappé et al. (2007), Arulampalam et al. (2002) and Ala-Luhtala et al. (2016) which also provide pseudo-code. The alterations discussed within these papers can easily be incorporated into the implementation developed within this section. In Section 3.1, we employed the faPF (Pitt and Shephard 1999) to estimate the filtered state and likelihood. This algorithm made use of the optimal choices for the particle proposal and weighting function, which corresponds to being able to sample from p(xt+1|xt, yt+1) and to evaluate p(yt+1|xt). This is not possible for many SSMs. One possibility is to create Gaussian approximations of the required quantities, see Doucet et al. (2001) and Pitt et al. (2012). However, these methods rely on quasi-Newton optimization that can be computationally prohibitive if N is large. See Arulampalam et al. (2002) for a discussion and for pseudo-code. Another possibility is to make use of a mixture of proposals and weighting functions in the particle filter as described by Kronander and Schön(2014). This type of filters are based on multiple importance sampling, which is commonly used in, e.g., computer graphics. A different approach is to make use of an additional particle filter to approximate the fully adapted proposal, resulting in a nested construction where one particle filter is used within another particle filter to construct a proposal distribution for that particle filter. The resulting construction is referred to as nested sequential Monte Carlo (SMC), see Naesseth et al. (2015) for details. The nested SMC construction makes it possible to consider state-spaces where nx is large, something that other types of particle filter struggle with. The bootstrap particle filter (bPF) is the default choice as it can be employed for most SSMs. A drawback with the bPF is that it suffers from poor statistical accuracy when the dimension of the state-space nx grows beyond about 10. In this setting, e.g., faPF or nested SMC are required for state inference. However, there are some approaches which possibly could improve the performance of the bPF in some cases and should be considered for efficient implementations. These includes: (i) parallel implementations, (ii) better resampling schemes, (iii) make use of linear substructures of the SSM and (iv) using quasi-Monte Carlo. Parallel implementations of the particle filter (i) is a topic for ongoing research but some encouraging results are reported by, e.g., Lee et al. (2010) and Lee and Whiteley(2016). In this tutorial, we make use of multinomial resampling in the particle filter. Alternative resampling schemes (ii) can be useful in decreasing the variance of the estimates, see Hol et al. (2006) and Douc and Cappé(2005) for some comparisons. In general, systematic resampling is recommended for particle filtering. Another possible improvement is the combination of Kalman and particle filtering (iii), which is possible if the model is conditionally linear and Gaussian in some of its states. The idea is then to make use of Kalman filtering to estimate these states and particle filtering for the remaining states while keeping the linear ones fixed to their Kalman estimates. These types of models are common in engineering and Rao-Blackwellization schemes like this can lead to a substantial decrease in variance. See, e.g., Doucet et al. (2000), Chen and Liu(2000) and Schön et al. (2005) for more information and comparisons. Quasi-Monte Carlo (iv) is based on so-called quasi-random numbers, which are generated by deterministic recursions to better fill the space compared with standard random numbers. These are useful in standard Monte Carlo to decrease the variance in estimates. The use of quasi-Monte Carlo in particle filtering is discussed by Gerber and Chopin(2015). Finally, particle filtering is an instance of the SMC method (Del Moral et al. 2006), which Johan Dahlin, Thomas B. Schön 17 represents a general class of algorithms based on importance sampling. SMC can be em- ployed for inference in many statistical models and is a complement/alternative to MCMC. A particularly interesting member is the SMC2 algorithm (Chopin et al. 2013; Fulop and Li 2013), which basically is a two-level particle filter similar to nested SMC. The outer particle filter maintains a particle system targeting πθ(θ). The inner particle filter targets πt(xt) and is run for each particle in the outer filter as this contains the hypotheses of the value of θ. The likelihood estimates from the inner filter are used to compute the weights in the outer filter. Hence, SMC2 is an alternative to PMH for parameter inference, see Svensson and Schön (2016) for a comparison.
4. Estimating the parameters in a linear Gaussian SSM
This tutorial now proceeds with the main event, where the PMH algorithm is implemented according to the outline provided in Section2. The particle filter implementation from Sec- tion 3.1 is employed to approximate the acceptance probability (13) in the PMH algorithm. We keep the LGSS model as a toy example to illustrate the implementation and provide an outlook of the PMH literature at the end of this section. A prior is required to be able to carry out Bayesian parameter inference. The objective in this section is to estimate the parameter posterior for θ = φ while keeping {σv, σe} fixed to their true values. Hence, we select the prior p(φ) = TN (−1,1)(φ; 0, 0.5) to ensure that the system is stable (i.e., that the value of xt is bounded for every t). The truncated Gaussian distribution with mean µ, standard deviation σ in the interval z ∈ [a, b] is defined by
2 2 TN [a,b](z; µ, σ ) = I(a < z < b) N (z; µ, σ ), where I(s) denotes the indicator function with value one if s is true and zero otherwise.
4.1. Implementing particle Metropolis-Hastings The complete source code for the implementation of the PMH algorithm is available in the function particleMetropolisHastings. Its code skeleton is given by:
particleMetropolisHastings <- function(y, initialPhi, sigmav, sigmae, noParticles, initialState, noIterations, stepSize) {
This function has inputs: y (vector with T +1 observations), initialPhi (φ(0) the initial value for φ), sigmav, sigmae (parameters {σv, σe}), noParticles, initialState, noIterations (no. PMH iterations K) and stepSize (step size in the proposal). The output from func- tion particleMetropolisHastings is φ(1:K), i.e., the correlated samples approximately dis- tributed according to the parameter posterior πθ. 18 Getting Started with PMH for Inference in Nonlinear Dynamical Models
phi[1] <- initialPhi theta <- c(phi[1], sigmav, sigmae) outputPF <- particleFilter(y, theta, noParticles, initialState) logLikelihood[1]<- outputPF$logLikelihood