<<

Getting Started with Particle Metropolis-Hastings for Inference in Nonlinear Dynamical Models

Johan Dahlin Thomas B. Schön Linköping University Uppsala University

Abstract This tutorial provides a gentle introduction to the particle Metropolis-Hastings (PMH) algorithm for parameter inference in nonlinear state-space models together with a software implementation in the statistical programming language R. We employ a step-by-step approach to develop an implementation of the PMH algorithm (and the particle filter within) together with the reader. This final implementation is also available as the package pmhtutorial from the Comprehensive R Archive Network (CRAN) repository. Throughout the tutorial, we provide some intuition as to how the algorithm operates and discuss some solutions to problems that might occur in practice. To illustrate the use of PMH, we consider parameter inference in a linear Gaussian state-space model with synthetic data and a nonlinear volatility model with real-world data.

Keywords: , state-space models, particle filtering, particle Monte Carlo, stochastic volatility model.

1. Introduction

We are concerned with Bayesian parameter inference in nonlinear state-space models (SSMs). This is an important problem as SSMs are ubiquitous in, e.g., automatic control (Ljung 1999), (Durbin and Koopman 2012), finance (Tsay 2005) and many other fields. An overview of some concrete applications of state-space modeling is given by Langrock(2011). The major problem with Bayesian parameter and state inference in SSMs is that it is an analytically intractable problem, which cannot be solved in closed-form. Hence, approxima- tions are required and these typically fall into two categories: analytical approximations and statistical simulations. The focus of this tutorial is on a member of the latter category known as particle Metropolis-Hastings (PMH; Andrieu et al. 2010), which approximates the poste- arXiv:1511.01707v8 [stat.CO] 12 Mar 2019 rior distribution by employing a specific Markov chain to sample from it. This requires us to be able to point-wise compute unbiased estimates of the posterior. In PMH, these are pro- vided by the particle filter (Gordon et al. 1993; Doucet and Johansen 2011), which is another important and rather general statistical simulation algorithm. The main aim and contribution of this tutorial is to give a gentle introduction (both in terms of methodology and software) to the PMH algorithm. We assume that the reader is familiar with traditional analysis and Kalman filtering at the level of Brockwell and Davis (2002). Some familiarity with Monte Carlo approximations, importance sampling and (MCMC) at the level of Ross(2012) is also beneficial. 2 Getting Started with PMH for Inference in Nonlinear Dynamical Models

The focus is on implementing the particle filter and PMH algorithms step-by-step in the programming language R (R Core Team 2018). We employ literate programming1 to build up the code using a top-down approach, see Section 1.1 in Pharr and Humphreys(2010). The final implementation is available as a R package pmhtutorial (Dahlin 2018) distributed under the GPL-2 license via the Comprehensive R Archive Network (CRAN) repository and available at https://CRAN.R-project.org/package=pmhtutorial. The R source code is also provided via GitHub with corresponding versions for MATLAB (The MathWorks Inc. 2017) and Python (van Rossum et al. 2011) to enable learning by readers from many different backgrounds, see Section8 for details. There are a number of existing tutorials which relate to this one. A thorough introduction to particle filtering is provided by Arulampalam et al. (2002), where a number of different implementations are discussed and pseudo-code provided to facilitate implementation. This is an excellent start for the reader who is particularly interested in particle filtering. However, they do not discuss nor implement PMH. The software LibBi and the connected tutorial by Murray(2015) includes high-performance implementations and examples of particle filtering and PMH. The focus of the present tutorial is a bit different as it tries to explain all the steps of the algorithms and provides basic implementations in several different programming languages. Finally, Ala-Luhtala et al. (2016) presents an advanced tutorial paper on twisted particle filtering. It also includes pseudo-code for PMH but its focus is on particle filtering and does not discuss, e.g., how to tune the proposal in PMH. We return to discuss related work and software throughout this tutorial in Sections 3.3, 4.3,6 and7. The disposition of this tutorial is as follows. In Section2, we introduce the Bayesian parameter and state inference problem in SSMs and discuss how and why the PMH algorithm works. In Sections3 and4, we implement the particle filter and PMH algorithm for a toy model and study their properties as well as give a small survey of results suitable for further study. In Sections5 and6, the PMH implementation is adapted to solve a real-world problem from finance. Furthermore, some more advanced modifications and improvements to PMH are discussed and exemplified. Finally, we conclude this tutorial by Section7 where related software implementations are discussed.

2. Overview of the PMH algorithm

In this section, we introduce the Bayesian inference problem in SSMs related to the param- eters and the latent state. The PMH algorithm is then introduced to solve these intractable problems. The inclusion of the particle filter into PMH is discussed from two complementary perspectives: as an approximation of the acceptance and as auxiliary variables in an augmented posterior distribution.

1The skeleton of the code is outlined as a collection of different variables denoted by . These parts of the code are then populated with content using a C++ like syntax. Hence, the operator = results in the assignment of a piece of code to the relevant variable (overwriting what is already stored within it). On the other hand, += denotes the operation of adding some additional code after the current code stored in the variable. The reader is strongly encouraged to print the source code from GitHub and read it in parallel with the tutorial to enable rapid learning of the material. Johan Dahlin, Thomas B. Schön 3

··· xt−1 xt xt+1 ···

··· yt−1 yt yt+1 ···

Figure 1: The structure of an SSM represented as a graphical model.

2.1. Bayesian inference in SSMs In Figure1, we present a graphical model of the SSM with the latent states (top), the 2 observations (bottom), but without exogenous inputs . The latent state xt only depends on the previous state xt−1 of the process as it is a (first-order) Markov process. That is, all the t−1 information in the past states x0:t−1 , {xs}s=0 is summarized in the most recent state xt−1. The observations y1:t are conditionally independent as they only relate to each other through the states.

ny We assume that the state and observations are real-valued, i.e., yt ∈ Y ⊆ R and xt ∈ X ⊆ nx R . The initial state x0 is distributed according to the density µθ(x0) parameterized by p some unknown static parameters θ ∈ Θ ⊂ R . The latent state and the observation processes 3 are governed by the densities fθ(xt|xt−1) and gθ(yt|xt), respectively. The density describing the state process gives the probability that the next state is xt given the previous state xt−1. An analogous interpretation holds for the observation process. With the notation in place, a general nonlinear SSM can be expressed as

x0 ∼ µθ(x0), xt|xt−1 ∼ fθ(xt|xt−1), yt|xt ∼ gθ(yt|xt), (1)

where no notational distinction is made between a and its realization. The main objective in Bayesian parameter inference is to obtain an estimate of the parameters θ given the measurements y1:T by computing the parameter posterior (distribution) given by

p(θ)p(y1:T |θ) γθ(θ) πθ(θ) , p(θ|y1:T ) = Z , , (2) 0 0 0 Zθ p(θ )p(y1:T |θ )dθ Θ where p(θ) and p(y1:T |θ) , pθ(y1:T ) denote the parameter prior (distribution) and the data distribution, respectively. The un-normalized posterior distribution denoted by γθ(θ) = p(θ)p(y1:T |θ) is an important component in the construction of PMH. The denominator Zθ is usually referred to as the marginal likelihood or the model evidence.

2It is straightforward to modify the algorithms presented for vector valued quantities and also to include a known exogenous input ut. 3 In this tutorial, we assume that xt|xt−1 and yt|xt can be modeled as continuous random variables with a density function. However, the algorithms discussed can be applied to deterministic states and discrete states/observations as well. 4 Getting Started with PMH for Inference in Nonlinear Dynamical Models

The data distribution pθ(y1:T ) is intractable for most interesting SSMs due to the fact that the state is unknown. The problem can be seen by studying the decomposition

T Y pθ(y1:T ) = pθ(y1) pθ(yt|y1:t−1), (3) t=2 where the so-called predictive likelihood is given by Z pθ(yt|y1:t−1) = gθ(yt|xt)fθ(xt|xt−1)πt−1(xt−1)dxtdxt−1. (4)

The marginal filtering distribution πt−1(xt−1) , pθ(xt−1|y1:t−1) can be obtained from the Bayesian filtering recursions (Anderson and Moore 2005), which for a fixed θ are given by

gθ(yt|xt) Z πt(xt) = fθ(xt|xt−1)πt−1(xt−1) dxt−1, for 0 < t ≤ T, (5) pθ(yt|y1:t−1) X

with π0(x0) = µθ(x0) as the initialization. In theory, it is possible to construct a sequence of filtering distributions by iteratively applying (5). However in practice, this strategy cannot be implemented for many models of interest as the marginalization over xt−1 (expressed by the integral) cannot be carried out in closed form.

2.2. Constructing the Markov chain The PMH algorithm offers an elegant solution to both of these intractabilities by leverag- ing statistical simulation. Firstly, a particle filter is employed to approximate πt−1(xt−1) and obtain point-wise unbiased estimates of the likelihood. Secondly, an MCMC algorithm (Robert and Casella 2004) known as Metropolis-Hastings (MH) is employed to approximate πθ(θ) based on the likelihood estimator provided by the particle filter. PMH emerges as the combination of these two algorithms. MCMC methods generate samples from a so-called target distribution by constructing a spe- cific Markov chain, which can be used to form an empirical approximation. In this tutorial, the target distribution will be either the parameter posterior πθ(θ) or the posterior of the parameters and states πθ,T (θ, x0:T ) = p(θ, x0:T |y1:T ). We focus here on the former case and the latter follows analogously. Executing the PMH algorithm results in K correlated samples (1:K) (k) K θ , {θ }k=1 which can be used to construct a Monte Carlo approximation of πθ(θ). This empirical approximation of the parameter posterior distribution is given by

K K 1 X π (dθ) = δ (k) (dθ), (6) bθ K θ k=1

0 which corresponds to a collection of Dirac delta functions δθ0 (dθ) located at θ = θ with equal weights. In practice, histograms or kernel density estimators are employed to visualize the estimate of the parameter posterior obtained from (6). The construction of the Markov chain within PMH amounts to carrying out two steps to 0 obtain one sample from πθ(θ). The first step is to propose a so-called candidate parameter θ from a proposal distribution q(θ0|θ(k−1)) given the previous state of the Markov chain denoted by θ(k−1). The user is quite free to choose this proposal but its support should cover the Johan Dahlin, Thomas B. Schön 5 support of the target distribution. The second step is to determine if the state of the Markov chain should be changed to the candidate parameter θ0 or if it should remain in the previous state θ(k−1). This decision is stochastic and the candidate parameter is assigned as the next state of the Markov chain, i.e., {θ(k) ← θ0} with a certain acceptance probability α(θ0, θ(k−1)) which is given by

( 0 (k−1) 0) ( 0 (k−1) 0) π θ q θ θ γ θ q θ θ αθ0, θ(k−1) = min 1, θ = min 1, θ , (7) (k−1) 0 (k−1) (k−1) 0 (k−1) πθ θ q θ θ γθ θ q θ θ where Zθ cancels as it is independent of the current state of the Markov chain. The stochastic behavior introduced by (7) facilitates exploration of (in theory) the entire posterior and is also the necessary condition for the Markov chain to actually have the target as its stationary distribution. The latter is known as which ensures that the Markov chain is reversible and has the correct stationary distribution. That is, to ensure that the samples generated by PMH actually are from πθ(θ). Hence, the stochastic acceptance decision can be seen as a correction of the Markov chain generated by the proposal distribution. The stationary distribution of the corrected Markov chain is the desired target distribution. In a sense this is similar to importance sampling, where samples from a proposal distribution are corrected by weighting to be approximately distributed according to the target distribution. The intuition for the acceptance probability (7) (disregarding the influence of the proposal 0 0 (k−1) 0 q) is that we always accept a candidate parameter θ if πθ θ > πθ θ . That is if θ increases the value of the target compared with the previous state θ(k−1). This results in a mode-seeking behavior which is similar to that of an optimization algorithm estimating the maximum of the posterior distribution. However from (7), we also note that a small decrease in the posterior value can be accepted to facilitate exploration of the entire posterior. This is the main difference between PMH and that of an optimization algorithm, where the former focus on mapping the entire posterior whereas the latter only would like to find the location of its mode.

2.3. Approximating test functions In Bayesian parameter inference, the interest often lies in computing the expectation of a n so-called test function, which is a well-behaved (integrable) function ϕ :Θ → R ϕ mapping the parameters to a value on the real space. The expectation with respect to the parameter posterior is given by Z h i πθ[ϕ] , E ϕ(θ) y1:T = ϕ(θ)πθ(θ) dθ, (8) Θ where, e.g., choosing ϕ(θ) = θ corresponds to computing the mean of the parameter posterior. The expectation in (8) can be estimated via the empirical approximation in (6) according to

Z K Z K K K 1 X 1 X (k) π [ϕ] ϕ(θ)π (dθ) = ϕ(θ)δ (k) (dθ) = ϕ θ , (9) bθ , bθ K θ K Θ k=1 Θ k=1 where the last equality follows from the properties of the Dirac delta function. This estimator is well-behaved and it is possible to establish a law of large (LLN) and a central limit 6 Getting Started with PMH for Inference in Nonlinear Dynamical Models

theorem (CLT), see Meyn and Tweedie(2009) for more information. From the LLN, we know that the estimator is consistent (and asymptotically unbiased as K → ∞). Moreover from the CLT, we know that the error is approximately Gaussian with a variance decreasing by 1/K, which is the usual Monte Carlo rate. Note that the LLN usually assumes independent samples but a similar result known as the ergodic theorem gives a similar result (under some assumptions) even when θ(1:K) are correlated4. However, the asymptotic variance is usually larger than if the samples would be uncorrelated.

2.4. Approximating the acceptance probability One major obstacle remains to be solved before the PMH algorithm can be implemented. The acceptance probability (7) is intractable and cannot be computed as the likelihood is unknown. From above, we know that the particle filter can provide an unbiased point-wise estimator for the likelihood and therefore also the posterior πθ(θ). It turns out that the unbiasedness property is crucial for PMH to be able to provide a valid empirical approximation of πθ(θ). This approach is known as an exact approximation due to that the likelihood is replaced by an approximation but the algorithm stills retains its validity, see Andrieu and Roberts(2009) for more information.

The particle filter generates a set of samples from πt(xt) for each t, which can be used to create an empirical approximation. These samples are generated by sequentially applying importance sampling to approximate the solution to (5). The approximation of the filtering distribution is then given by

N (i) N N X vt πbt (dxt) , pbθ (dxt|y1:t) = δ (i) (dxt), (10) PN (j) xt i=1 j=1 vt | {z } (i) ,wt

(i) (i) where the particles xt and their normalized weights wt constitute the so-called particle system generated during a run of the particle filter. The un-normalized weights are denoted by (i) vt . It is possible to prove that the empirical approximation converges to the true distribution when N → ∞ under some regularity assumptions5. The likelihood can then be estimated via the decomposition (3) by inserting the empirical filtering distribution (10) into (4). This results in the estimator " # T T 1 N pN (y ) = pN (y ) Y pN (y |y ) = Y X v(i) , (11) bθ 1:T bθ 1 bθ t 1:t−1 N t t=2 t=1 i=1 where a point-wise estimate of the un-normalized parameter posterior is given by

N N γbθ (θ) = p(θ)pbθ (y1:T ). (12) 4The theoretical details underpinning MH have been deliberately omitted. For the ergodic theorem to hold, the Markov chain needs to be ergodic, which means that the Markov chain should be able to explore the entire state-space and not get stuck in certain parts of it. The interested reader can find more about the theoretical details in, e.g., Tierney(1994) and Meyn and Tweedie(2009) for MH and Andrieu et al. (2010) for PMH. 5Again we have omitted many of the important theoretical details regarding the particle filter. For more information about these and many other important topics related to this algorithm, see, e.g., Crisan and Doucet(2002) and Doucet and Johansen(2011). Johan Dahlin, Thomas B. Schön 7

Here, it is assumed that the prior can be computed point-wise by closed-form expressions. The acceptance probability (7) can be approximated by plugging in (12) resulting in

( (k−1) 0) γN (θ0) q θ θ αθ0, θ(k−1) = min 1, bθ N (k−1) 0 (k−1) γbθ (θ ) q θ θ ( 0 N (k−1) 0) p(θ ) p 0 (y ) q θ θ = min 1, bθ 1:T . (13) p(θ(k−1)) pN (y ) qθ0 θ(k−1) bθ(k−1) 1:T

N This approximation is only valid if pbθ (y1:T ) is an unbiased estimator of the likelihood. Fortu- nately, this is the case when employing the particle filter as its likelihood estimator is unbiased (Del Moral 2013; Pitt et al. 2012) for any N ≥ 1. As in the parameter inference problem, interest often lies in computing an expectation of a nϕ well-behaved test function ϕ : X → R with respect to πt(xt) given by Z h i πt[ϕ] , E ϕ(xt) y1:t = ϕ(xt)πt(xt) dxt, X where again choosing the test function ϕ(xt) = xt corresponds to computing the mean of the marginal filtering distribution. This expectation is intractable as πt(xt) is unknown. Again, we replace it with an estimate provided by an empirical approximation (10), which results in

Z N Z N N N X (i) X (i)  (i) πbt [ϕ] , ϕ(xt)πbt (dxt) = wt ϕ(xt)δ (i) (dxt) = wt ϕ xt , (14) xt X i=1 X i=1 for some 0 ≤ t ≤ T by the properties of the Dirac delta function. In the subsequent sections, we are especially interested in the estimator for the mean of the marginal filtering distribution,

N N N X (i) (i) xbt , πbt [x] = wt xt , (15) i=1 which is referred to as the filtered state estimate. N Under some assumptions, the properties of πbt [ϕ] are similar as for the estimator in the PMH algorithm, see Crisan and Doucet(2002) and Doucet and Johansen(2011) for more information. Hence, we have that the estimator is consistent (and asymptotically unbiased when N → ∞) and the error is Gaussian with a variance decreasing by 1/N. 8 Getting Started with PMH for Inference in Nonlinear Dynamical Models

2.5. Outlook: Pseudo-marginal algorithms The viewpoint adopted in the previous discussion on PMH is that it is an MH algorithm which employs a particle filter to approximate the otherwise intractable acceptance probability. Another more general viewpoint is to consider PMH as a pseudo-marginal algorithm (Andrieu and Roberts 2009). In this type of algorithm, some auxiliary variables are introduced to facilitate computations but these are marginalized away during a run of the algorithm. This results in that the PMH algorithm can be seen as a standard MH algorithm operating on the non-standard extended space Θ × U, where Θ and U denote the parameter space and the space of the auxiliary variables, respectively. The resulting extended target is given by

N πθ,u(θ, u) = πbθ (θ|u) m(u). (16) Here, the parameter posterior is augmented with u ∈ U which denotes some multivariate random variables with density m(u). From the discussion above, we know that u can be used N to construct a point-wise estimator of πbθ (θ|u) via the particle filter by (12). The unbiasedness property of the likelihood estimator based on the particle filter gives Z h N i N Eu γbθ (θ|u) = γbθ (θ|u) m(u) du = γθ(θ). (17) U

This means that the un-normalized parameter posterior γθ(θ) can be recovered by marginaliz- ing over all the auxiliary variables u. In the implementation, this results in that the sampled values for u simply can be disregarded. Hence, we will not store them in the subsequent implementations but only keep the samples of θ (and xt). We conclude by deriving the pseudo-marginal algorithm following from (16). A proposal for θ and u is selected with the form

 0 0 (k−1) (k−1)  0 (k−1) (k−1)  0 (k−1) q θ , u θ , u = qθ θ θ , u qu u u , (18) which is the product of the two proposals selected as

 0 (k−1) (k−1)  0 (k−1)  0 (k−1) 0 qθ θ θ , u = qθ θ θ , qu u u = m(u ). (19)

This corresponds to an independent proposal for u and a proposal for θ that does not include u. Other options are a topic of current research, see Section 4.3 for some references. The resulting acceptance probability from this choice of target and proposal is given by ( ) π (θ0, u0) q(θ(k−1), u(k−1)|θ0, u0) α(θ0, u0, θ(k−1), u(k−1)) = min 1, θ,u (k−1) (k−1) 0 0 (k−1) (k−1) πθ,u(θ , u ) q(θ , u |θ , u ) ( ) γN (θ0|u0) q (θ(k−1)|θ0) = min 1, bθ θ , N (k−1) (k−1) 0 (k−1) γbθ (θ |u ) qθ(θ |θ ) Note that the resulting acceptance probability is the same as in (13). These arguments provide a sketch of the proof that PMH generates samples from the target distribution and were first presented by Flury and Shephard(2011). For a more formal treatment and proof of the validity of PMH, see Andrieu and Roberts(2009) and Andrieu et al. (2010). Johan Dahlin, Thomas B. Schön 9 . 1.0 0.8 t t t 6 4 2 6 4 2 ACF of y ACF of latent state x latent state observation y observation 4- 0 -2 -4 0 -2 -4 020002040.6 0.4 0.2 0.0 -0.2

0 50 100 150 200 250 0 50 100 150 200 250 0 5 10 15 20 25

time time time

Figure 2: Simulated data from the LGSS model with latent state (orange), observations (green) and autocorrelation function (ACF) of the observations (light green).

3. Estimating the states in a linear Gaussian SSM

We are now ready to start implementing the PMH algorithm based on the material in Sec- tion2. To simplify the presentation, this section discusses how to estimate the filtering T distributions {πt(xt)}t=0 for a linear Gaussian state-space (LGSS) model. These distribu- N N tions can be used to compute xb0:T and pbθ (y1:T ), i.e., the estimates of the filtered state and the estimate of the likelihood, respectively. The parameter inference problem is treated in the subsequent section. The particular LGSS model considered is given by

 2  2 x0 ∼ δx0 (x), xt|xt−1 ∼ N xt; φxt−1, σv , yt|xt ∼ N yt; xt, σe , (20) where parameters are denoted by θ = {φ, σv, σe}. φ ∈ (−1, 1) determines the persistence of the state, while σv, σe ∈ R+ denote the standard deviations of the state transition noise and the observation noise, respectively. The Gaussian density is denoted by N (x; µ, σ2) with mean µ and standard deviation σ > 0. Figure2 presents a simulated data record from the model with T = 250 observations using θ = {0.75, 1.00, 0.10} and x0 = 0. The complete code for the data generation is available in generateData.

3.1. Implementing the particle filter The complete source code for the implementation of the particle filter is available in the function particleFilter. Its code skeleton is given by: particleFilter <- function(y, theta, noParticles, initialState) { for (t in 2:T) { 10 Getting Started with PMH for Inference in Nonlinear Dynamical Models

} }

The function particleFilter has the inputs: y (vector with T + 1 observations), theta (parameters {φ, σv, σe}), noParticles and initialState. The outputs will be the esti- N N mates of the filtered states xb0:T and the estimate of the likelihood pb (θ|y1:T ). Note that particleFilter iterates over t = 2,...,T , which corresponds to time indices t = 1,...,T −1 in (20) as the numbering of vector elements starts from 1 in R. The iteration terminates at time index T −1 as future observations yt+1 are required at each iteration. That is, we assume that the new observation arrives to the algorithm between the propagation and weighting steps. It therefore makes sense to use the information in the weighting step if this is possible. The particle filter is initialized by allocating memory for the variables: particles, ancestorIndices, weights and normalizedWeights. These correspond to the particles, their ancestor, un-normalized weights and normalized weights, respectively. More- N over, the two variables xHatFiltered and logLikelihood are allocated to store xb0:T and N log pb (θ|y1:T ), respectively. This is all implemented by: = T <- length(y) - 1

particles <- matrix(0, nrow = noParticles, ncol = T + 1) ancestorIndices <- matrix(0, nrow = noParticles, ncol = T + 1) weights <- matrix(1, nrow = noParticles, ncol = T + 1) normalizedWeights <- matrix(0, nrow = noParticles, ncol = T + 1) xHatFiltered <- matrix(0, nrow = T, ncol = 1) logLikelihood <- 0

For the LGSS model (20), we have that µθ(x0) = δ0(x0) so all the (1:N) (1:N) particles are initially set to x0 = x0 = 0 and all weights to w0 = 1/N (as all particles are identical). This operation is implemented by:

= ancestorIndices[, 1] <- 1:noParticles particles[, 1] <- initialState xHatFiltered[, 1] <- initialState normalizedWeights[, 1] = 1 / noParticles

The particles in the filter corresponds to a number of different hy- potheses of what the value of the latent state could be. The weights represents the probability that a given particle has generated the observation under the model. In the resampling step, this information is used to focus the computational effort of the particle filter to the relevant part of the state-space, i.e., to particles with a large weight. This is done by randomly duplicating particles with large weights and discarding particles with small weights, such that the total number of particles always remains the same. Note that, Johan Dahlin, Thomas B. Schön 11 the resampling step is unbiased in the sense that the expected proportions of the resampled particles are given by the particle weights6. This step is important as the particle system otherwise would consist of only a single unique particle after a few iterations. This would result in a large variance in (14). In our implementation, we make use of multinomial resampling, which is also known as a weighted bootstrap with replacement. The output from this procedure are the so-called (i) ancestor indices at for each particle i, which can be interpreted as the parent index of particle i at time t. For each i = 1,...,N, the ancestor index is sampled from the multinomial (categorical) distribution with

h (i) i (j) P at = j = wt−1, j = 1,...,N.

The resampling step is implemented by a call to the function sample by:

= newAncestors <- sample(noParticles, replace = TRUE, prob = normalizedWeights[, t - 1]) ancestorIndices[, 1:(t - 1)] <- ancestorIndices[newAncestors, 1:(t - 1)] ancestorIndices[, t] <- newAncestors

(1:N) where the resulting ancestor indices at are returned as newAncestors and stored in ancestorIndices for bookkeeping. Note that the particle lineage is also resampled at the same time as the particles. All this is done to keep track of the genealogy of the particle system over time. That is, the entire history of the particle system. The hypotheses regarding the state are updated by propagating the particle system forward in time by using the model. This corresponds to sampling from the (i) particle proposal distribution to generate new particles xt by

(i) (i) (i) at  (i) at  xt |xt−1 ∼ pθ xt xt−1, yt , (21)

(i) at where information from the previous particle xt−1 and the current measurement yt can be included. There are a few different choices for the particle proposal and the most common one is to make use of the model itself, i.e., pθ(xt|xt−1, yt) = fθ(xt|xt−1). However for the LGSS model, an optimal proposal can be derived using the properties of the Gaussian distribution resulting in

(i) (i) opt (i) at   at  pθ xt xt−1, yt ∝ gθ(yt xt)fθ xt xt−1 (i)  (i) 2h −2 −2 at i 2 = N xt ; σ σe yt + σv φxt−1 , σ , (22)

−2 −2 −2 with σ = σv + σe . This particle proposal is optimal in the sense that it minimizes the variance of the incremental particle weights at the current time step7. For most other

6This unbiasedness property is crucial for the PMH algorithm to be valid. As a consequence, adaptive methods (Doucet and Johansen 2011, Section 3.5) that only resamples the particle system at certain iterations cannot be used within PMH. 7However, it is unclear exactly how this influences the entire particle system, i.e., if this is the globally optimal choice. 12 Getting Started with PMH for Inference in Nonlinear Dynamical Models

SSMs, the optimal proposal is intractable and the state transition model is used instead. The propagation step is implemented by:

= part1 <- (sigmav^(-2) + sigmae^(-2))^(-1) part2 <- sigmae^(-2) * y[t] part2 <- part2 + sigmav^(-2) * phi * particles[newAncestors, t - 1] particles[, t] <- part1 * part2 + rnorm(noParticles, 0, sqrt(part1))

From (22), the ratio between the noise variances is seen to determine the shape of the proposal. 2 2 2 2 Essentially, there are two different extreme situations (i) σe /σv  1 and (ii) σe /σv  1. In the first extreme (i), the location of the proposal is essentially governed by yt and the scale is mainly determined by σe. This corresponds to essentially simulating particles from gθ(yt|xt). When σe is small this usually allows for running the particle filter using only a small number of particles. In the second extreme (ii), the proposal essentially simulates from fθ(xt|xt−1) and does not take the observation into account. In summary, the performance and characteristics of the optimal proposal are therefore related to the noise variances of the model and their relative sizes. and In these steps, the weights required for the resampling step are computed. The weights can be computed using different so-called weight- ing functions and a standard choice is the observation model, i.e., vt(xt) = gθ(yt|xt). However for the LGSS model, we can instead derive an optimal weighting function by again applying the properties of the Gaussian distribution to obtain Z (i) (i)    (i)  (i) 2 2 vt , p(yt+1|xt ) = gθ yt+1 xt+1 fθ xt+1 xt dxt+1 = N yt+1; φxt , σv + σe .

Remember that in this implementation, the new observation yt+1 is introduced in the particle filter between the propagation and the weighting steps for the first time. This is the reason for the slightly complicated choice of time indices. However, the inclusion of the information in yt+1 in this step is beneficial as it improves the accuracy of the filter. That is, we keep particles that after propagation are likely to result in the observation yt+1. In many situations, the resulting weights are small and it is therefore beneficial to work with shifted log-weights to avoid problems with numerical precision. This is done by applying the transformation

(i) (i) vet = log vt − vmax, for i = 1,...,N,

(1:N) where vmax denotes the largest element in log vt . The weights are then normalized (en- suring that they sum to one) by

    (i) (i) (i) exp vet exp(−vmax) exp log vt v w(i) = = = t , (23) t PN  (j) PN  (j) PN (j) j=1 exp vet exp(−vmax) j=1 exp log vt j=1 vt where the shifts −vmax cancel and do not affect the relative sizes of the weights. Hence, the first and third expressions in (23) are equivalent but the first expression enjoys better numerical precision. The computation of the weights is implemented by: Johan Dahlin, Thomas B. Schön 13

= yhatMean <- phi * particles[, t] yhatVariance <- sqrt(sigmae^2 + sigmav^2) weights[, t] <- dnorm(y[t + 1], yhatMean, yhatVariance, log = TRUE)

= maxWeight <- max(weights[, t]) weights[, t] <- exp(weights[, t] - maxWeight) sumWeights <- sum(weights[, t]) normalizedWeights[, t] <- weights[, t] / sumWeights

We remind the reader that we compare y[t+1] and particles[, t] due to the convention (1:N) for indexing arrays in R, which corresponds to yt and xt−1 in the model, respectively. This is also the reason for the comment regarding the for-loop in particleFilter. We now see that the weights depend on the next observation and that is why the loop needs to run to T − 1 and not T .

The likelihood p(θ|y1:t) can be estimated by inserting the un-normalized weights into (11). However, it is beneficial to instead estimate the log-likelihood to enjoy bet- ter numerical precision as the likelihood often is small. This results in rewriting (11) to obtain the recursion

( " N # ) N N X  (i) log pbθ (y1:t) = log pbθ (y1:t−1) + vmax + log exp v˜t − log N , i=1 where the log-shifted weights are used to increase the numerical precision. Note that the shift by vmax needs to be compensated for in the estimation of the log-likelihood. This recursion is implemented by:

= predictiveLikelihood <- maxWeight + log(sumWeights) - log(noParticles) logLikelihood <- logLikelihood + predictiveLikelihood

The estimate of the posterior distribution follows from inserting the estimate of the log- likelihood into (12).

The latent state xt at time t can be estimated by (15) given the observations y1:(t+1), which corresponds to the estimator

1 N xN = X x(i), (24) bt N t i=1 which is implemented by:

= xHatFiltered[t] <- mean(particles[, t])

(i) Note that this corresponds to the weights wt = 1/N which would correspond to that all particles are identical according to the expression in (23). However, this is not the case 14 Getting Started with PMH for Inference in Nonlinear Dynamical Models

Number of particles (N) 10 20 50 100 200 500 1000 log-bias −3.70 −3.96 −4.57 −4.85 −5.19 −5.67 −6.08 log-MSE −6.94 −7.49 −8.72 −9.29 −9.91 −10.87 −11.67

Table 1: The log-bias and the log-MSE of the filtered states for varying N. in general and the estimator (24) is the result of making use of the optimal choices in the propagation and weighing steps of the particle filter. This choice corresponds to a particular type of algorithm known as the fully-adapted particle filter (faPF; Pitt and Shephard 1999), see Section 3.3 for more details. In a faPF, we make use of separate weights in the resampling step and when constructing the empirical approximation of πt(xt). In Section5, we introduce N another type of particle filter with another estimator for xbt , which can be used for most SSMs. The outputs from particleFilter are returned by:

= list(xHatFiltered = xHatFiltered, logLikelihood = logLikelihood)

3.2. Numerical illustration We make use of the implementation of particleFilter to estimate the filtered state and to investigate the properties of this estimate for finite N. The complete implementation is found in the script/function example1_lgss. We use the data presented in Figure2, which was generated in the beginning of this section. The estimates from the particle filter are compared with the corresponding estimates from the Kalman filter. The latter follows from using the properties of the Gaussian distribution to solve (5) exactly, which is only possible for an LGSS model. In the middle of Figure3, we present the difference between the optimal state estimate from the Kalman filter and the estimate from the particle filter using N = 20 particles. Two alternative measures of accuracy are the bias (absolute error) and the mean square error (MSE) of the state estimate. These are computed according to

T T N  1 X N N  1 X N 2 Bias x = x − xt , MSE x = x − xt , bt T bt b bt T bt b t=1 t=1 where xbt denotes the optimal state estimate obtained by the Kalman filter. In Table1 and in the bottom part of Figure3, we present the logarithm of the bias and the MSE for different values of N. We note that the bias and the MSE decrease rapidly when increasing N. Hence, N we conclude that for this model N does not need to be large for xbt to be accurate.

3.3. Outlook: Design choices and generalizations We make a short de-tour before the tutorial continues with the implementation of the PMH algorithm. In this section, we provide some references to important improvements and fur- ther studies of the particle filter. Note that the scientific literature concerning the particle filter is vast and this is only a small and naturally biased selection of topics and references. Johan Dahlin, Thomas B. Schön 15 6 4 2 observation 6- 20 -2 -4 -6

0 50 100 150 200 250

time .50.10 0.05 error in state estimate 01 00 0.00 -0.05 -0.10

0 50 100 150 200 250

time -3 8- -6 -7 -8 5-4 -5 log-bias log-MSE 7-6 -7 1 1 1 -9 -10 -11 -12

0 200 400 600 800 1000 0 200 400 600 800 1000

no. particles (N) no. particles (N)

Figure 3: Top and middle: A simulated set of observations (top) from the LGSS model and the error in the latent state estimate (middle) using a particle filter with N = 20. Bottom: the estimated log-bias (left) and log-MSE (right) for the particle filter when varying the number of particles N. 16 Getting Started with PMH for Inference in Nonlinear Dynamical Models

More comprehensive surveys and tutorials of particle filtering are given by Doucet and Jo- hansen(2011), Cappé et al. (2007), Arulampalam et al. (2002) and Ala-Luhtala et al. (2016) which also provide pseudo-code. The alterations discussed within these papers can easily be incorporated into the implementation developed within this section. In Section 3.1, we employed the faPF (Pitt and Shephard 1999) to estimate the filtered state and likelihood. This algorithm made use of the optimal choices for the particle proposal and weighting function, which corresponds to being able to sample from p(xt+1|xt, yt+1) and to evaluate p(yt+1|xt). This is not possible for many SSMs. One possibility is to create Gaussian approximations of the required quantities, see Doucet et al. (2001) and Pitt et al. (2012). However, these methods rely on quasi-Newton optimization that can be computationally prohibitive if N is large. See Arulampalam et al. (2002) for a discussion and for pseudo-code. Another possibility is to make use of a mixture of proposals and weighting functions in the particle filter as described by Kronander and Schön(2014). This type of filters are based on multiple importance sampling, which is commonly used in, e.g., computer graphics. A different approach is to make use of an additional particle filter to approximate the fully adapted proposal, resulting in a nested construction where one particle filter is used within another particle filter to construct a proposal distribution for that particle filter. The resulting construction is referred to as nested sequential Monte Carlo (SMC), see Naesseth et al. (2015) for details. The nested SMC construction makes it possible to consider state-spaces where nx is large, something that other types of particle filter struggle with. The bootstrap particle filter (bPF) is the default choice as it can be employed for most SSMs. A drawback with the bPF is that it suffers from poor statistical accuracy when the dimension of the state-space nx grows beyond about 10. In this setting, e.g., faPF or nested SMC are required for state inference. However, there are some approaches which possibly could improve the performance of the bPF in some cases and should be considered for efficient implementations. These includes: (i) parallel implementations, (ii) better resampling schemes, (iii) make use of linear substructures of the SSM and (iv) using quasi-Monte Carlo. Parallel implementations of the particle filter (i) is a topic for ongoing research but some encouraging results are reported by, e.g., Lee et al. (2010) and Lee and Whiteley(2016). In this tutorial, we make use of multinomial resampling in the particle filter. Alternative resampling schemes (ii) can be useful in decreasing the variance of the estimates, see Hol et al. (2006) and Douc and Cappé(2005) for some comparisons. In general, systematic resampling is recommended for particle filtering. Another possible improvement is the combination of Kalman and particle filtering (iii), which is possible if the model is conditionally linear and Gaussian in some of its states. The idea is then to make use of Kalman filtering to estimate these states and particle filtering for the remaining states while keeping the linear ones fixed to their Kalman estimates. These types of models are common in engineering and Rao-Blackwellization schemes like this can lead to a substantial decrease in variance. See, e.g., Doucet et al. (2000), Chen and Liu(2000) and Schön et al. (2005) for more information and comparisons. Quasi-Monte Carlo (iv) is based on so-called quasi-random numbers, which are generated by deterministic recursions to better fill the space compared with standard random numbers. These are useful in standard Monte Carlo to decrease the variance in estimates. The use of quasi-Monte Carlo in particle filtering is discussed by Gerber and Chopin(2015). Finally, particle filtering is an instance of the SMC method (Del Moral et al. 2006), which Johan Dahlin, Thomas B. Schön 17 represents a general class of algorithms based on importance sampling. SMC can be em- ployed for inference in many statistical models and is a complement/alternative to MCMC. A particularly interesting member is the SMC2 algorithm (Chopin et al. 2013; Fulop and Li 2013), which basically is a two-level particle filter similar to nested SMC. The outer particle filter maintains a particle system targeting πθ(θ). The inner particle filter targets πt(xt) and is run for each particle in the outer filter as this contains the hypotheses of the value of θ. The likelihood estimates from the inner filter are used to compute the weights in the outer filter. Hence, SMC2 is an alternative to PMH for parameter inference, see Svensson and Schön (2016) for a comparison.

4. Estimating the parameters in a linear Gaussian SSM

This tutorial now proceeds with the main event, where the PMH algorithm is implemented according to the outline provided in Section2. The particle filter implementation from Sec- tion 3.1 is employed to approximate the acceptance probability (13) in the PMH algorithm. We keep the LGSS model as a toy example to illustrate the implementation and provide an outlook of the PMH literature at the end of this section. A prior is required to be able to carry out Bayesian parameter inference. The objective in this section is to estimate the parameter posterior for θ = φ while keeping {σv, σe} fixed to their true values. Hence, we select the prior p(φ) = TN (−1,1)(φ; 0, 0.5) to ensure that the system is stable (i.e., that the value of xt is bounded for every t). The truncated Gaussian distribution with mean µ, standard deviation σ in the interval z ∈ [a, b] is defined by

2 2 TN [a,b](z; µ, σ ) = I(a < z < b) N (z; µ, σ ), where I(s) denotes the indicator function with value one if s is true and zero otherwise.

4.1. Implementing particle Metropolis-Hastings The complete source code for the implementation of the PMH algorithm is available in the function particleMetropolisHastings. Its code skeleton is given by:

particleMetropolisHastings <- function(y, initialPhi, sigmav, sigmae, noParticles, initialState, noIterations, stepSize) { for (k in 2:noIterations) { } phi }

This function has inputs: y (vector with T +1 observations), initialPhi (φ(0) the initial value for φ), sigmav, sigmae (parameters {σv, σe}), noParticles, initialState, noIterations (no. PMH iterations K) and stepSize (step size in the proposal). The output from func- tion particleMetropolisHastings is φ(1:K), i.e., the correlated samples approximately dis- tributed according to the parameter posterior πθ. 18 Getting Started with PMH for Inference in Nonlinear Dynamical Models

We allocate some variables to store the current state of the Markov chain phi, the proposed state phiProposed, the current log-likelihood logLikelihood and the pro- posed log-likelihood logLikelihoodProposed. Furthermore, we allocate the binary variable proposedPhiAccepted assuming the value 1 if the proposed parameter is accepted and 0 (0) otherwise. Finally, we run the particle filter with the parameters {φ , σv, σe} to estimate the initial likelihood. The initialization is implemented by:

= phi <- matrix(0, nrow = noIterations, ncol = 1) phiProposed <- matrix(0, nrow = noIterations, ncol = 1) logLikelihood <- matrix(0, nrow = noIterations, ncol = 1) logLikelihoodProposed <- matrix(0, nrow = noIterations, ncol = 1) proposedPhiAccepted <- matrix(0, nrow = noIterations, ncol = 1)

phi[1] <- initialPhi theta <- c(phi[1], sigmav, sigmae) outputPF <- particleFilter(y, theta, noParticles, initialState) logLikelihood[1]<- outputPF$logLikelihood

There are many choices for the proposal distribution as discussed in Section2. In this tutorial, we make use of a Gaussian proposal given by

0 (k−1) 0 (k−1) 2 q θ θ = N θ ; θ ,  , (25) where  > 0 denotes the step size of the random walk, i.e., the standard deviation of the increment. The proposal step is implemented by:

= phiProposed[k] <- phi[k - 1] + stepSize * rnorm(1)

The acceptance probability (13) can be simplified since (25) is symmetric in θ, i.e.,

0 (k−1) (k−1) 0 q θ θ = q θ θ , which results in that (13) can be expressed as

( " 0 # " N #! ) p(θ ) p 0 (y ) α(θ0, θ(k−1)) = min 1, exp log + log bθ 1:T . p(θ(k−1)) pN (y ) bθ(k−1) 1:T The log-likelihood and log-priors are used to avoid loss of numerical precision. The prior implies that α(θ0, θ(k−1)) = 0 when |φ| > 1. Therefore, the particle filter is not run in this case to save computations and the acceptance probability is set to zero to ensure that θ0 is rejected. The computation of the acceptance probability is implemented by:

= if (abs(phiProposed[k]) < 1.0) { theta <- c(phiProposed[k], sigmav, sigmae) outputPF <- particleFilter(y, theta, noParticles, initialState) Johan Dahlin, Thomas B. Schön 19

logLikelihoodProposed[k] <- outputPF$logLikelihood }

priorPart <- dnorm(phiProposed[k], log = TRUE) priorPart <- priorPart - dnorm(phi[k - 1], log = TRUE) likelihoodDifference <- logLikelihoodProposed[k] - logLikelihood[k - 1] acceptProbability <- exp(priorPart + likelihoodDifference) acceptProbability <- acceptProbability * (abs(phiProposed[k]) < 1.0)

Finally, we need to take a decision for accepting or rejecting the pro- posed parameter. This is done by simulating a uniform random variable ω over [0, 1] by the built-in R command runif. We accept θ0 if ω < αθ0, θ(k−1) by storing it and its correspond- ing log-likelihood as the current state of the Markov chain. Otherwise, we keep the current values for the state and the log-likelihood from the previous iteration. The accept/reject step is implemented by:

= uniformRandomVariable <- runif(1)

if (uniformRandomVariable < acceptProbability) { phi[k] <- phiProposed[k] logLikelihood[k] <- logLikelihoodProposed[k] proposedPhiAccepted[k] <- 1 } else { phi[k] <- phi[k - 1] logLikelihood[k] <- logLikelihood[k - 1] proposedPhiAccepted[k] <- 0 }

4.2. Numerical illustration We make use of particleMetropolisHastings to estimate θ = φ using the data in Figure2. The complete implementation and code is available in the function/script example2_lgss. We initialize the Markov chain in θ0 = 0.5 and make use of K = 5, 000 iterations (noIterations) discarding the first Kb = 1, 000 iterations (noBurnInIterations) as burn-in. That is, we only make use of the last 4, 000 samples to construct the empirical approximation of the parameter posterior to make certain that the Markov chain in fact has reached its stationary regime, see Section6 for more information. In Figure4, three runs of PMH are presented using different step sizes  = 0.01 (left), 0.10 (center) and 0.50 (right). The resulting estimate of the posterior mean is φb = 0.66 for the case  = 0.10. This is computed by

R> mean(phi[noBurnInIterations:noIterations])

From the autocorrelation function (ACF), we see that the choice of  influences the correlation in the Markov chain. A good choice of  is therefore important to obtain an efficient algorithm. We return to discussing this issue in Section 6.3. 20 Getting Started with PMH for Inference in Nonlinear Dynamical Models 012 10 8 6 12 10 8 6 12 10 8 6 posterior estimate posterior estimate posterior estimate 4 2 0 4 2 0 4 2 0

0.50 0.60 0.70 0.80 0.50 0.60 0.70 0.80 0.50 0.60 0.70 0.80

φ φ φ . 0.8 0.7 0.8 0.7 0.8 0.7 φ φ φ . . 0.6 0.5 0.4 0.6 0.5 0.4 0.6 0.5 0.4

1000 1200 1400 1600 1800 2000 1000 1200 1400 1600 1800 2000 1000 1200 1400 1600 1800 2000

iteration iteration iteration . 1.0 0.8 1.0 0.8 1.0 0.8 ACF ACF ACF 020002040.6 0.4 0.2 0.0 -0.2 0.6 0.4 0.2 0.0 -0.2 0.6 0.4 0.2 0.0 -0.2

0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

iteration iteration iteration

Figure 4: The estimate of πθ[φ] in the LGSS model using the PMH algorithm using three different step sizes:  = 0.01 (left), 0.10 (center) and 0.50 (right). Top: the estimate of πθ presented as a histogram and kernel density estimate (solid line). Middle: the state of the Markov chain at 1, 000 iterations after the burn-in. Bottom: the estimated ACF of the Markov chain. Dotted lines in the top and middle plots indicate the estimate of the posterior mean. The dotted lines in the bottom plot indicate the 95% confidence intervals of the ACF coefficients. Johan Dahlin, Thomas B. Schön 21

Number of observations (T ) 10 20 50 100 200 500 Estimated posterior mean 0.596 0.794 0.765 0.727 0.696 0.719 Estimated posterior variance 0.040 0.013 0.009 0.006 0.003 0.001

Table 2: The estimated posterior mean and variance when varying T .

We note that the parameter estimate differs slightly from the true value 0.75 and that the uncertainty is rather large in the estimate of the parameter posterior. This is due to the relatively small sample size T (and a finite K). From the asymptotic theory of the Bayesian estimator, we know that the posterior mass tends to concentrate around the true parameter as T (and K) tends to infinity. We exemplify this in Table2 by estimating the posterior mean and variance using the same setup when T increases. This small study supports that the true parameter is indeed recov- ered by the posterior mean estimate in the limit. However, the rate of this convergence is determined by the model and therefore it is not possible to give any general guidelines for how large T needs to be to achieve a certain accuracy.

4.3. Outlook: Generalizations We devote this section to give some references for important improvements and further studies of the PMH algorithm. As for the particle filter, the scientific literature related to PMH is vast and quickly growing. This is therefore only a small and biased selection of recent developments. For a broader view of parameter inference in SSM, see, e.g., the surveys by Kantas et al. (2015) and Schön et al. (2015). As discussed in Section2, the PMH algorithm is a member of the family of exact approxima- tion or pseudo-marginal algorithms (Andrieu and Roberts 2009). Here, the particle filter is used to estimate the target but it is also possible to make use of, e.g., importance sampling for this end. PMH is also an instance of the so-called particle MCMC (PMCMC; Andrieu et al. 2010) algorithm, which also includes particle versions of (Andrieu et al. 2010; Lindsten et al. 2014). PMCMC algorithms are a topic of current research and much effort has been given to improve their performance, see Section6 for some examples of this effort. In this tutorial, we make use of an independent proposal for u as discussed in Section 2.5. This essentially means that all particle filters are independent. However, intuitively there could be something to gain by correlating the particle filters as the state estimates are often quite similar for small changes in θ. It turns out that correlating u results in a positive correlation in the estimates of the log-likelihood, which decreases the variance of the estimate of the acceptance probability. In practice, this means that N does not need to scale as rapidly with T as for the case when u are uncorrelated between iterations. This is particularly useful when T is large as it decreases the computational cost of PMH. See Dahlin et al. (2015a), Choppala et al. (2016) and Deligiannidis et al. (2018) for more information and source code.

5. Application example: Estimating the volatility in OMXS30

We continue with a concrete application of the PMH algorithm to infer the parameters of a stochastic volatility (SV; Hull and White 1987) model. This is a nonlinear SSM with 22 Getting Started with PMH for Inference in Nonlinear Dynamical Models

Gaussian noise and inference in this type of model is an important problem as the log- volatility (the latent state in this model) is useful for risk management and to price various financial contracts. See, e.g., Tsay(2005) and Hull(2009) for more information. A particular parameterization of the SV model is given by

 σ2  x ∼ N x ; µ, v , (26a) 0 0 1 − φ2  2 xt+1|xt ∼ N xt+1; µ + φ(xt − µ), σv , (26b)   yt|xt ∼ N yt; 0, exp(xt) , (26c)

where the parameters are denoted by θ = {µ, φ, σv}. Here, µ ∈ R, φ ∈ [−1, 1] and σv ∈ R+ denote the mean value, the persistence and standard deviation of the state process, respectively. Note that this model is quite similar to the LGSS model, but here the state xt scales the variance of the observation noise. Hence, we have Gaussian observations with zero mean and a state-dependent standard deviation given by exp(xt/2).

In econometrics, volatility is another word for standard deviation and therefore xt is known as the log-volatility. The measurements in this model yt are so-called log-returns,   st   yt = 100 log = 100 log(st) − log(st−1) , st−1 where st denotes the price of some financial asset (e.g., an index, stock or commodity) at time T t. Here, {st}t=1 is the daily closing prices of the NASDAQ OMXS30 index, i.e., a weighted average of the 30 most traded stocks at the Stockholm stock exchange. We extract the data from Quandl8 for the period between January 2, 2012 and January 2, 2014. The resulting log- returns are presented in the top part of Figure5. Note the time-varying persistent volatility in the log-returns, i.e., periods of small and large variations. This is known as the volatility clustering effect and is one of the features of real-world data that SV models aim to capture. Looking at (26), this can be achieved when |φ| is close to one and when σv is small. As these choices result in a first-order autoregressive process with a large autocorrelation.

The objectives in this application are to estimate the parameters θ and the log-volatility x0:T from the observed data y1:T . We can estimate both quantities using PMH as both samples from the posterior of the parameter and the state can be obtained at each iteration of the algorithm. To complete the SV model, we assume some priors for the parameters based on domain knowledge of usual ranges of the parameters, i.e.,

2 p(µ) = N (µ; 0, 1), p(φ) = TN [−1,1](φ; 0.95, 0.05 ), p(σv) = G(σv; 2, 10).

Here, G(a, b) denotes a Gamma distribution with shape a and scale b, i.e., a/b.

5.1. Modifying the implementation The implementation for the particle filter and PMH needs to be adapted to this new model. We outline the necessary modifications by replacing parts of the code in the skeleton for the two algorithms. The resulting implementations and source codes are found in the functions

8The data is available for download from: https://www.quandl.com/data/NASDAQOMX/OMXS30. Johan Dahlin, Thomas B. Schön 23 particleFilterSVmodel and particleMetropolisHastingsSVmodel, respectively. In the particle filter, we need to modify all steps except the resampling.

In the initialization, we need to simulate the initial particle system from µθ(x0) by replacing:

= ancestorIndices[, 1] <- 1:noParticles normalizedWeights[, 1] = 1 / noParticles particles[, 1] <- rnorm(noParticles, mu, sigmav / sqrt(1 - phi^2))

Furthermore, we need to choose a (particle) proposal distribution and the weighting function for the particle filter implementation. For the SV model, we cannot compute a closed-form expression for the optimal choices as for the LGSS model. Therefore, a bPF is employed which corresponds to making use of the state dynamics as the proposal (21) given by

(i) (i)  (i)  (i) at  at   at  2 xt |xt−1 ∼ fθ xt xt−1 = N xt; µ + φ xt−1 − µ , σv ,

and the observation model as the weighting function by

(i)  (i)  (i) Wt = gθ yt xt = N yt; 0, exp xt .

N These two choices result in that the estimator xbt = πt[x] is changed to

N N X (i) (i) xbt = wt xt . i=1 These three alterations to the particle filter are implemented by replacing:

= part1 <- mu + phi * (particles[newAncestors, t - 1] - mu) particles[, t] <- part1 + rnorm(noParticles, 0, sigmav)

= yhatMean <- 0 yhatVariance <- exp(particles[, t] / 2) weights[, t] <- dnorm(y[t - 1], yhatMean, yhatVariance, log = TRUE) xHatFiltered[t] <- sum(normalizedWeights[, t] * particles[, t])

We also need to generalize the PMH code to have more than one parameter. This is straight- forward and we refer the reader to the source code for the necessary changes. The ma- jor change is to replace the vectors phi and phiProposed with the matrices theta and thetaProposed. That is, the state of the Markov chain is now three-dimensional correspond- ing to the three elements in θ. Moreover, the parameter proposal is selected as a multivariate Gaussian distribution centered around the previous parameters θ(k−1) with covariance matrix Σ = diag(), where  denotes a vector with three elements. This is implemented by:

= thetaProposed[k, ] <- rmvnorm(1, mean = theta[k - 1, ], sigma = stepSize) 24 Getting Started with PMH for Inference in Nonlinear Dynamical Models

The computation of the acceptance probability is also altered to include the new prior distri- butions. For this model, it is required that |φ| < 1 and σv > 0 to ensure that it is stable and that the standard deviation is positive. This is implemented by:

= if ((abs(thetaProposed[k, 2]) < 1.0) && (thetaProposed[k, 3] > 0.0)) { res <- particleFilterSVmodel(y, thetaProposed[k, ], noParticles) logLikelihoodProposed[k] <- res$logLikelihood xHatFilteredProposed[k, ] <- res$xHatFiltered }

priorMu <- dnorm(thetaProposed[k, 1], 0, 1, log = TRUE) priorMu <- priorMu - dnorm(theta[k - 1, 1], 0, 1, log = TRUE) priorPhi <- dnorm(thetaProposed[k, 2], 0.95, 0.05, log = TRUE) priorPhi <- priorPhi - dnorm(theta[k - 1, 2], 0.95, 0.05, log = TRUE) priorSigmaV <- dgamma(thetaProposed[k, 3], 2, 10, log = TRUE) priorSigmaV <- priorSigmaV - dgamma(theta[k - 1, 3], 2, 10, log = TRUE) prior <- priorMu + priorPhi + priorSigmaV

likelihoodDifference <- logLikelihoodProposed[k] - logLikelihood[k - 1] acceptProbability <- exp(prior + likelihoodDifference)

acceptProbability <- acceptProbability * (abs(thetaProposed[k, 2]) < 1.0) acceptProbability <- acceptProbability * (thetaProposed[k, 3] > 0.0)

Furthermore, needs to be altered to take into account that theta is a vector, see the source code for the details.

5.2. Estimating the log-volatility It is possible to compute an estimate of the log-volatility that takes into account the un- certainty in θ using the pseudo-marginal view of PMH. The particle filter is a deterministic algorithm given u. Hence, the random variables u are equivalent to an estimate of the filtering distribution. The result is that it is quite simple to compute the state estimate by including it in the Markov chain generated by PMH. This is done by modifying the code for the particle filter. After one run of the algorithm, we (1:N) sample a single trajectory by sampling a particle at time T with probability given by wT . We then follow the ancestor lineage back to t = 0 and extract the corresponding path in the state-space. This enables us to obtain a better estimate of the log-volatility as both past and future information are utilized. As a consequence, this often reduces the variance of the estimate. This is done by using the stored resampled ancestor indices and the sampled state trajectory by replacing:

= ancestorIndex <- sample(noParticles, 1, prob = normalizedWeights[, T]) sampledTrajectory <- cbind(ancestorIndices[ancestorIndex, ], 1:(T + 1)) xHatFiltered <- particles[sampledTrajectory] list(xHatFiltered = xHatFiltered, logLikelihood = logLikelihood) Johan Dahlin, Thomas B. Schön 25 log-returns 4- 4 2 0 -2 -4

0 100 200 300 400 500

time log-volatility estimate 2- 2 1 0 -1 -2

0 100 200 300 400 500

time μ μ ACF of posterior estimate . . . 1.5 1.0 0.5 0.0 10-. . . 1.0 0.5 0.0 -0.5 -1.0 1.0 0.6 0.2 -0.2 -1.0 -0.5 0.0 0.5 1.0 2500 3000 3500 4000 0 20 40 60 80 100

μ iteration iteration φ φ ACF of posterior estimate 01 025 20 15 10 5 0 0202061.0 0.6 0.2 -0.2 .809 .61.00 0.96 0.92 0.88 0.88 0.92 0.96 1.00 2500 3000 3500 4000 0 20 40 60 80 100

φ iteration iteration v σ v σ ACF of posterior estimate 10 8 6 4 2 0 . . . . 0.4 0.3 0.2 0.1 0.0 0202061.0 0.6 0.2 -0.2 0.0 0.1 0.2 0.3 0.4 2500 3000 3500 4000 0 20 40 60 80 100

σv iteration iteration

Figure 5: Top: the daily log-returns (dark green) and estimated log-volatility (orange) with 95% confidence intervals of the NASDAQ OMXS30 index for the period between January 2, 2012 and January 2, 2014. Bottom: the posterior estimate (left), the trace of the Markov chain (middle) and the corresponding ACF (right) of µ (purple), φ (magenta) and σv (green) obtained from PMH. The dotted and solid gray lines in the left and middle plots indicate the parameter posterior mean and the parameter priors, respectively. 26 Getting Started with PMH for Inference in Nonlinear Dynamical Models in the function particleFilterSVmodel. The sampled state trajectory in xHatFiltered is then treated in the same manner as the candidate parameter in PMH, see the source code for details. As a result, we can compute the posterior mean of the parameters and the log-volatility and its corresponding standard deviation by:

R> thetaStationary <- pmhOutput$theta[noBurnInIterations:noIterations, ] R> thetaHatMean <- colMeans(thetaStationary) R> thetaHatStandardDeviation <- apply(thetaStationary, 2, sd) R> xHatStationary <- pmhOutput$xHatFiltered[2500:7500, ] R> xhatFilteredMean <- colMeans(xHatStationary) R> xhatFilteredStandardDeviation <- apply(xHatStationary, 2, sd)

where pmhOutput denotes the output variable from particleMetropolisHastingsSVmodel.

5.3. Numerical illustration We now make use of particleMetropolisHastingsSVmodel and particleFilterSVmodel to the SV model using the OMXS30 data introduced in the beginning of this section. The complete implementation is available in the function/code example3_sv. The resulting poste- rior estimates, traces and ACFs are presented in Figure7. We see that the Markov chain and posterior are clearly concentrated around the posterior mean estimate θb = {−0.23, 0.97, 0.15} with standard deviations {0.37, 0.02, 0.06}. This confirms our belief that the log-volatility is a slowly varying process as φ is close to one and σv is small. An estimate of the log-volatility given this parameter is also computed and presented in the second row of Figure7. We note that the volatility is larger around t = 100 and t = 370, which corresponds well to what is seen in the log-returns.

6. Selected advanced PMH topics

In this section, we outline a selected number of possible improvements and best practices for the implementation in Section5. We discuss initialization, convergence diagnostics and how to improve the so-called of the Markov chain. For the latter, we consider three different approaches: tuning the random walk proposal, re-parameterizing the model and tuning the number of particles. This section is concluded by a small survey of more advanced proposal distributions and post-processing.

6.1. Initialization It is important to initialize the Markov chain in areas of high mass to obtain an efficient algorithm. In theory, we can initialize the chain anywhere in the parameter space and still obtain a convergent Markov chain. However, in practice numerical difficulties can occur when the parameter posterior assumes small values or is relatively insensitive to changes in θ. Therefore it is advisable to try to obtain a rough estimate of the parameters to initialize the chain closer to the posterior mode. In the LGSS model, we can make use of standard optimization algorithms in combination with a Kalman filter to estimate the mode of the posterior. However, this is not possible for general SSMs as only noisy estimates of the posterior can be obtained from the particle Johan Dahlin, Thomas B. Schön 27

filter. Standard optimization methods can therefore get stuck in local minima or fail to converge entirely. One approach to mitigate this problem is the simultaneous perturbation and stochastic approximation (SPSA; Spall 1987) algorithm, which can be applied in combination with the particle filter to estimate the posterior mode.

6.2. Diagnosing convergence It is in general difficult to prove that the Markov chain has reached its stationary regime and that the samples obtained are actually samples from πθ(θ). A simple solution is to initialize the algorithm at different points in the parameter space and compare the resulting posterior estimates. This can give an indication that the chain has reached its stationary regime if the resulting estimates are similar. Another alternative is to make use of the Kolmogorov-Smirnov (KS; Massey 1951) test to establish that the posterior estimate does not change after the burn-in. This is done by (k) K dividing the samples {θ }k=1 into three partitions: the burn-in and two sets of equal number of samples from the stationary phase. As the KS test requires uncorrelated samples, thinning can be applied to decrease the autocorrelation. The Markov chain is thinned by keeping every lth sample, where l is selected such that the autocorrelation is negligible between two retained consecutive samples. A standard KS test is then conducted to conclude if the samples in the two thinned partitions are from the same (stationary) distribution or not. Other methods for diagnosing convergence are discussed by Robert and Casella(2009) and Gelman et al. (2013).

6.3. Improving mixing Standard Monte Carlo methods assume that independent samples can be obtained from the distribution of interest or from some instrumental distribution. However, the samples obtained from PMH are correlated as they are a realization generated from a Markov chain. Intuitively, these correlated samples contain less information about the target distribution than if the samples were independent. That is, most samples are similar and the target is not fully explored if the autocorrelation in the Markov chain is large. This autocorrelation is also known as mixing, which is a very important concept in the MCMC literature. It is the key quantity for comparing the efficiency of different MCMC algorithms. It turns out that the mixing influences the performance of an MCMC algorithm by scaling K the asymptotic variance of the estimates. This is apparent from the CLT governing πbθ [ϕ] in (9) which under some assumptions9 has the form √  K  d  (1:K) K πθ[ϕ] − πbθ [ϕ] −→ N 0, VARπ[ϕ] · IACT θ ,K → ∞, (27)

2 (1:K) when VARπ[ϕ] = πθ[(ϕ − πθ[ϕ]) ] < ∞. Here, IACT θ denotes the integrated autocorre- lation time (IACT) of the Markov chain, which is computed as the area under the ACF. An interpretation of the IACT is that it estimates the number of iterations of an MCMC algo- rithm between obtaining two uncorrelated samples. Hence, the IACT is one if the samples

9These assumptions include that the Markov chain should reach its stationary regime quickly enough. This requirement is known as uniform , which means that the total variational norm between the distribution of the Markov chain and its stationary distribution should decrease with a geometric rate regardless of the initialization of the Markov chain, see, e.g., Tierney(1994) for more details. 28 Getting Started with PMH for Inference in Nonlinear Dynamical Models are independent, which is the case in, e.g., importance sampling. Minimizing the IACT is therefore the same as maximizing the mixing in an MCMC algorithm. In the right column of Figure5, we present the ACF for the three parameters in the SV model. This information can be used to compute the corresponding IACTs by

∞ (1:K) X (1:K) IACT(θ ) = 1 + 2 ρτ θ , τ=1

(k) (k+τ) 2 where ρτ = E[(θ − πθ[θ])(θ − πθ[θ])]/πθ[(θ − πθ[θ]) denotes the autocorrelation coef- ficient at lag τ for θ(1:K). In practice, we cannot compute the IACT exactly as we do not know the autocorrelation coefficients for all possible lags. Also, it is difficult to estimate ρτ when τ is large and K is rather small. A possible solution to this problem is to cut off the ACF estimate at some lag and only consider lags smaller than this limit L. In this tutorial, we make use of L = 100 lags to estimate the IACT by

100 (1:K) X K (1:K) IACT[ (θ ) = 1 + 2 ρbτ θ , τ=1

K (k) (k+τ) where ρbτ = COR(ϕ(θ ), ϕ(θ )) denotes the estimate of the lag-τ autocorrelation of ϕ. For the results presented in Figure5, this corresponds to the IACT {135, 86, 66} for each of the three parameters.

Tuning the proposal The mixing can be improved by tuning the proposal distribution to better fit the posterior distribution. This requires an estimate of the posterior covariance, which acts like a pre- conditioning matrix P. The covariance is quite simple to estimate from a pilot run by

R> var(thetaStationary) where thetaStationary denotes the trace of the Markov chain generated by PMH after the burn-in has been discarded. For the problem considered in Section5, this results in the following pre-conditioning matrix

1371 −16 15  −4   Pb = 10 −16 5 −10 , 15 −10 31

which can be used to form a new improved proposal given by

2 !  0 (k−1) 0 (k−1) 2.562 q θ θ = N θ ; θ , Pb . (28) 3

This scaling of the covariance matrix was proposed by Sherlock et al. (2015) to minimize the IACT when sampling from a Gaussian target distribution. In Figure6, we present the marginal posteriors (in color) for each pair of parameters in (26) together with the original (top) and the tuned proposals (bottom) from (28). The tuned proposals fit the posteriors better and this is expected to improve the mixing of the Markov Johan Dahlin, Thomas B. Schön 29 1.05 v v φ σ σ .009 1.00 0.95 0.90 .000 .001 .002 .00.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 0.90 0.95 1.00 1.05

μ μ φ 1.05 v v φ σ σ .009 1.00 0.95 0.90 .000 .001 .002 .00.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

-1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 0.90 0.95 1.00 1.05

μ μ φ

Figure 6: The estimated marginal posterior for µ and φ (purple), µ and σv (magenta) and φ and σv (green) obtained from PMH. The dark contours (top/bottom) indicate the original proposal from Section5 and the new (improved) proposal estimated from the pilot run, respectively. Both proposals are centered at the posterior mean estimate. chain. In Figure7, we present the results obtained by the implementation from Section5 when using the tuned proposal instead. This code is available in the function example4_sv. The resulting IACT estimates are {32, 32, 28}, which is a clear improvement in mixing for µ and σv compared with the results obtained in Section 5.3. The consequence is that K (and therefore computational cost) can be cut by 135/32 = 4.2 while retaining the same variance in the estimates. Note that the maximum IACTs are compared as these are the limiting factors. We can also compare the support of the posterior estimates in Figures5 and7. According to (27), the improved mixing should decrease the asymptotic variance of the estimate. However, no such improvement can been seen in this case. This is probably due to that the variance in the posterior estimates largely results from the finite amount of information in the data.

Re-parameterizing the model Another approach to improve mixing is to re-parameterize the model to obtain unconstrained parameters, which can assume any value on the real line. In Section5, we constrained φ and σv to the regions |φ| < 1 and σv > 0 to obtain a stable and valid SSM. This results in poor 30 Getting Started with PMH for Inference in Nonlinear Dynamical Models log-returns 4- 4 2 0 -2 -4

0 100 200 300 400 500

time log-volatility estimate 2- 2 1 0 -1 -2

0 100 200 300 400 500

time μ μ ACF of posterior estimate . . . 1.5 1.0 0.5 0.0 10-. . . 1.0 0.5 0.0 -0.5 -1.0 1.0 0.6 0.2 -0.2 -1.0 -0.5 0.0 0.5 1.0 2500 3000 3500 4000 0 20 40 60 80 100

μ iteration iteration φ φ ACF of posterior estimate 02 04 50 40 30 20 10 0 0202061.0 0.6 0.2 -0.2 .809 .61.00 0.96 0.92 0.88 0.88 0.92 0.96 1.00 2500 3000 3500 4000 0 20 40 60 80 100

φ iteration iteration v σ v σ ACF of posterior estimate 01 20 15 10 5 0 . . . . 0.4 0.3 0.2 0.1 0.0 0202061.0 0.6 0.2 -0.2 0.0 0.1 0.2 0.3 0.4 2500 3000 3500 4000 0 20 40 60 80 100

σv iteration iteration

Figure 7: Top: the daily log-returns (dark green) and estimated log-volatility (orange) with 95% confidence intervals of the NASDAQ OMXS30 index for the period between January 2, 2012 and January 2, 2014. Bottom: the posterior estimate (left), the trace of the Markov chain (middle) and the corresponding ACF (right) of µ (purple), φ (magenta) and σv (green) obtained from PMH. Dotted and solid gray lines in the left and middle plots indicate the parameter posterior mean and the parameter priors, respectively. Johan Dahlin, Thomas B. Schön 31 mixing if many candidate parameters are proposed which violate these constraints and this increases the autocorrelation. This problem can be mitigated by a re-parameterization of the model given by

  φ = tanh ψ , σv = exp ς , (29) such that ψ, ς ∈ R are unconstrained parameters. Hence, the target of PMH is changed to the posterior of ϑ = {µ, ψ, ς}. This transformation of parameters can be compensated for to still obtain samples from the posterior of θ. This is done by taking the Jacobians of (29) into account, which are given by

∂ 1 ∂ tanh−1(ψ) = , log(ς) = ς−1. (30) ∂ψ 1 − ψ2 ∂ς

The resulting acceptance probability is calculated according to

( 0 N 0 0 2 0 ) p(θ ) p 0 (y |u ) 1 − (φ ) σ α(ϑ0, ϑ ) = min 1, bϑ 1:T v , (31) k−1 (k−1) N 2 p(θ ) p (y |u ) 1 − φ σv,k−1 bϑk−1 1:T k−1 k−1 where the original parameters are used to compute the prior and to estimate the likelihood. Hence, the proposal operates on ϑ to propose a new candidate parameter ϑ0 but (30) is applied to obtain θ0, which is used to compute the acceptance probability (31). After a run of this implementation, samples from the posterior of θ can be obtained by applying (30) to the samples from the posterior of ϑ. To implement this, we need to change the call to the particle filter and the computation of the acceptance probability. The complete implementation and code are available in the function example5_sv. The resulting mean estimate of the parameter posterior is {−0.16, 0.96, 0.17} with standard deviation {0.24, 0.02, 0.06} and IACT {21, 29, 17}. Note that the maximum IACT is even smaller than when tuning the proposal which resulted in a speed-up of 4.7 compared to the implementation in Section5.

Selecting the number of particles The particle filter plays an important role in the PMH algorithm and greatly influences the mixing. If the log-likelihood estimates are noisy (too small N in the particle filter), the chain tends to get stuck for several iterations and this leads to bad performance. This is N the result of the fact that sometimes pθ0 (y1:T )  pbθ0 (y1:T ), which is due to the stochasticity of the estimator. Thus, balancing N and K to obtain good performance at a reasonable computational cost is an important problem. A small N might result in that we need to take K large and vice verse. This problem is investigated by Pitt et al. (2012) and Doucet et al. (2015). A simple rule-of-thumb is to select N such that the standard deviation of the log-likelihood estimates is between 1.0 and 1.7. These results are derived under certain strict assumptions that might not hold in practice. Here, we would like to investigate this rule-of-thumb in a practical setting. We make use of the implementation when tuning the proposal and re-run it for different choices of N. We also estimate the standard deviation of the log-likelihood estimator from the particle filter for each choice of N using the estimate of the posterior mean as the parameters and 1, 000 independent Monte Carlo runs. The proposal distribution is tuned using a pilot run as before 32 Getting Started with PMH for Inference in Nonlinear Dynamical Models

Number of particles (N) 50 100 200 300 400 500 N Standard deviation of pbθ (y1:T ) 2.6 1.8 1.2 1.0 0.9 0.7 Acceptance probability (%) 6 14 22 28 29 33 Maximum IACT 165 139 128 92 114 97 Time per PMH iteration (s) 0.03 0.05 0.09 0.12 0.15 0.19 Time per sample (s) 5 7 11 11 17 18

Table 3: The standard deviation of the log-likelihood estimate, the acceptance probability, the maximum IACT for θ, the computational time per iteration of PMH and a benchmark quantity (see main text) for varying N. and the Markov chain is initialized in the estimate of the posterior mean. The IACT is computed using L = 100 lags. Table3 presents some quantities related to the particle filter and PMH as N is varied (keeping all other parameters fixed). The optimal choice for N is between 100 and 300 depending on the choice of proposal according to the rule-of-thumb proposed in Doucet et al. (2015). The objective for the user is often to minimize the computational time of PMH to obtain a certain number of uncorrelated samples from the posterior. Hence, a suitable benchmark quantity is the maximum IACT multiplied with the computational time for one iteration. Remember, that IACT is the estimated number of iterations between two uncorrelated samples. The benchmark quantity (last row of Table3) is therefore an estimate of the computational time per uncorrelated sample. The results are a bit inconclusive, which is the result of the fact that the IACT is very challenging to estimate and thereby noisy. However, the rule-of-thumb seems to be valid for this model and setting N to 100 seems to be a good choice.

6.4. Including geometric information In practical applications, we typically encounter performance issues when the number of parameters p grows beyond 5 or as previously discussed when the chain is initialized far from the areas of high posterior probability. This problem occurs even if the rule-of-thumb from Sherlock et al. (2015) is used to tune the proposal. The reason for this is that the problem lies within the use of a random walk, which is known to poorly explore high-dimensional spaces. To mitigate this problem, it can be useful to take geometrical information about the posterior into account. Girolami and Calderhead(2011) show how to make use of the gradient and the Hessian of the log-posterior to guide the Markov chain to areas of high posterior probability. In the paper, this is motivated by diffusion processes on Riemann manifolds. However, a perhaps simpler analogy is found in optimization, where we can make use of noisy gradient ascent or Newton updates in the proposal. Gradient information is useful to guide the Markov chain towards the area of interest during the burn-in and also to keep it in this area after the burn-in phase. The Hessian information can be used to rescale the parameter posterior to make it closer to be isotropic, which greatly simplifies sampling. In Dahlin et al. (2015b), the authors show how to make use of this type of proposals in the PMH algorithm. We refer to the proposal that makes use of only gradient information as PMH1 (for first-order). The proposal that makes use of both gradient and Hessian informa- tion is referred to as PMH2 (for second-order). The challenge here is to obtain good estimates of the gradient and Hessian, which both are analytically intractable for a nonlinear SSM. Fur- Johan Dahlin, Thomas B. Schön 33 thermore, the computation of these quantities usually requires the use of a particle smoother with a high computational cost, which is prohibitive inside the PMH algorithm. To mitigate this problem, they propose to make use of the faster but more inaccurate fixed-lag particle smoother and instead regularize the Hessian estimate when it is non-positive definite. In Dahlin et al. (2015c), the authors describe a quasi-Newton PMH2 proposal (qPMH2) based on a noisy quasi-Newton update that does not require any Hessian information but constructs a local approximation of the Hessian based on gradient information. PMH1 and similar algorithms are theoretically studied by Nemeth et al. (2016), which offers a rule-of- thumb to tune the step sizes based on an estimate of the posterior covariance.

6.5. Control variates Control variates are a common and useful variance reduction technique for standard Monte Carlo, which also can be applied to MCMC algorithms such as PMH. Some interesting work along these lines for MH are found in Mira et al. (2013), Papamarkou et al. (2014), Mija- tović and Vogrinc(2018) and Dellaportas and Kontoyiannis(2012), which should be quite straightforward to implement for PMH. However to the knowledge of the authors of this pa- per, control variates have not yet been properly investigated for PMH but some encouraging preliminary results are presented in Chapter 4.3 of Dahlin(2016) based on Papamarkou et al. (2014).

7. Related software

There are a number of software packages related to the current tutorial which implements (i) PMH and/or (ii) particle filtering and SMC. These can be used to quickly carry out Bayesian parameter inference in new SSMs or as building blocks for the user who would like to create his/her own implementations of PMH. The software LibBi (Murray 2015) provides a platform for Bayesian inference in SSMs using both serial and parallel hardware with a particular focus on high performance computing. It is written in C++ and allows the user to define new models using a modeling language similar to JAGS (Plummer 2003) or BUGS (Lunn et al. 2009). This enables the user to quickly solve the parameter inference problem using PMH and SMC2 in many different types of SSMs. An example of using LibBi for inference in the SV model introduced in Section5 is offered via the homepage10, which is helpful in learning how to use the software. Furthermore, interfaces for R and Octave (Eaton et al. 2017) are available via RBi via CRAN and OctBi via the LibBi homepage, respectively. This software is suitable for the user who would like to leverage the power of multi-core computing to carry out inference quickly on large datasets. Another alternative is Biips (Todeschini et al. 2014), which is implemented in C++ with interfaces to R and MATLAB via the package Rbiips and the toolbox Matbiips, respectively. The functionality of Biips is similar to LibBi with the crucial difference on the focus on parallel computations of the latter. Hence, Biips allows the user to define a model using the BUGS language and carry out inference using PMH. The homepage11 connected to Biips contains some example code that implements the model from Section5.

10http://libbi.org/ 11https://biips.github.io/ 34 Getting Started with PMH for Inference in Nonlinear Dynamical Models

The R package pomp (King et al. 2016) includes an impressive setup of different inference methods for SSMs. Among others, this includes particle filtering, PMH and also approximate Bayesian computations (ABC; Marin et al. 2012). This software is a good choice for the reader who uses R on a regular basis and would like to explore some recent developments in statistical computing. It also contains a number of pre-defined models for epidemiology. The model specification in pomp is a bit more complicated compared with LibBi and Biips, which could be a drawback for some users. Probabilistic programming languages (PPLs) constitute a fairly recent contribution to soft- ware for statistical modeling. They extend graphical models with stochastic branches, such as loops and recursions. Three popular PPLs which allow for carrying out inference using PMH in SSMs are Anglican (Tolpin et al. 2016), Venture (Mansinghka et al. 2014) and Prob- abilistic C (Paige and Wood 2014). Anglican is the most developed language of these three and it runs as a standalone application. Implementation of PMH in Anglican requires some additional coding compared with, e.g., LibBi. However, it is a more general framework which allows for inference for a larger class of models. A C++ template for particle filtering and more general SMC algorithms is provided by SM- CTC (Johansen 2009). This template is complemented with an interface to R via the package RCppSMC (Eddelbuettel and Johansen 2018) and extended to parallel computations in C++ via the software vSMC (Zhou 2015). These templates require additional implementation by the user to be able to carry out inference. They do not allow for easy specification of new models or to carry out inference using, e.g., PMH directly. However, they give the user full control of the particle filter, which allows for adding extensions and tailoring the algorithm to a specific problem. A parallel implementation of SMC using CUDA is provided by Lee et al. (2010) and the source code is available online12.A MATLAB toolbox for running SMC algorithms in parallel called DeCo is developed by Casarin et al. (2015). DeCo makes use of SMC for combining densities connected to forecasts in , but could potentially be generalized for other applications. To conclude, we would like to briefly mention some additional software. The Python library pyParticleEst (Nordh 2017) provides functionality for state estimation using different types of particle filters. It also includes parameter estimation using maximum likelihood via the expectation-maximization (EM; Dempster et al. 1977; McLachlan and Krishnan 2008) algo- rithm. MH is implemented in Ox (Doornik 2007) by Bos(2011) to estimate the states and parameters for the SV model in Section5. In this setting, MH is applied to estimate both the states and parameters, so no particle filter is required. The R package smfsb (Wilkin- son 2018) accompanying Wilkinson(2011) contains demonstration code for using PMH in a application. The GitHub repository13 of the first author of this tutorial contains Python code for PMH1, PMH2, qPMH2 (Dahlin et al. 2015b,c) and pseudo-marginal MH with correlated random variables (Dahlin et al. 2015a). This code is quite similar to the Python code supplied with this tutorial and is a good starting point for the interested reader who would like to try out more advanced implementations of PMH.

12http://www.oxford-man.ox.ac.uk/gpuss/cuda_mc.html 13https://www.github.com/compops. Johan Dahlin, Thomas B. Schön 35

8. Conclusions

We have described the PMH algorithm for Bayesian parameter inference in nonlinear SSMs. This includes the particle filter as it plays an important role in PMH and provides an non- negative unbiased estimator of the likelihood. Furthermore, we have applied PMH for infer- ence in an LGSS model and a SV model using both synthetic and real-world data. We also identified and discussed a wide range of practical matters related to PMH. This includes initialization and tuning of the parameter proposal, which are important practical problems. Furthermore, we have provided many references for the reader who could like to continue his/her learning of particle filtering and PMH. The implementation developed within this tutorial can be seen as a compilation of minimal working examples. Hopefully, these code snippets can be of use for the interested reader as a starting point to develop his/her own implementations of the algorithms. The implementations developed in this tutorial are available as the package pmhtutorial from CRAN. Source code for the implementation in R as well as similar code for MATLAB and Python is available from GitHub at https://github.com/compops/pmh-tutorial. See the README.md files in the directory corresponding to each programming language for specific comments and for dependencies.

Acknowledgments

This work was supported by the projects: Probabilistic modeling of dynamical systems (Con- tract number: 621-2013-5524), CADICS, a Linnaeus Center, both funded by the Swedish Research Council and ASSEMBLE (Contract number: RIT15-0012) funded by the Swedish Foundation for Strategic Research (SSF). The authors would like to give a big thanks to Christian Andersson Naesseth, Wilfried Bonou, Manon Kok, Joel Kronander, Fredrik Lind- sten, Andreas Svensson and Patricio Valenzuela for comments and suggestions that greatly improved this tutorial. The majority of the work was carried out while Johan Dahlin was affiliated with the Division of Automatic Control, Linköping University, Sweden.

References

Ala-Luhtala J, Whiteley N, Heine K, Piché R (2016). “An Introduction to Twisted Particle Filters and Parameter Estimation in Non-Linear State-Space Models.” IEEE Transactions on , 64(18), 4875–4890. doi:10.1109/tsp.2016.2563387.

Anderson BDO, Moore JB (2005). Optimal Filtering. Courier Publications.

Andrieu C, Doucet A, Holenstein R (2010). “Particle Markov Chain Monte Carlo Methods.” Journal of the Royal Statistical Society B, 72(3), 269–342. doi:10.1111/j.1467-9868. 2009.00736.x.

Andrieu C, Roberts GO (2009). “The Pseudo-Marginal Approach for Efficient Monte Carlo Computations.” The Annals of , 37(2), 697–725. doi:10.1214/07-aos574. 36 Getting Started with PMH for Inference in Nonlinear Dynamical Models

Arulampalam MS, Maskell S, Gordon N, Clapp T (2002). “A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking.” IEEE Transactions on Signal Processing, 50(2), 174–188. doi:10.1109/78.978374.

Bos C (2011). “A Bayesian Analysis of Unobserved Component Models Using Ox.” Journal of Statistical Software, 41(13), 1–24. doi:10.18637/jss.v041.i13.

Brockwell PJ, Davis RA (2002). Introduction to Time Series and Forecasting. Springer-Verlag. doi:10.1007/b97391.

Cappé O, Godsill SJ, Moulines E (2007). “An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo.” Proceedings of the IEEE, 95(5), 899–924. doi: 10.1109/jproc.2007.893250.

Casarin R, Grassi S, Ravazzolo F, van Dijk H (2015). “Parallel Sequential Monte Carlo for Efficient Density Combination: The DeCo MATLAB Toolbox.” Journal of Statistical Software, 68(3), 1–30. doi:10.18637/jss.v068.i03.

Chen R, Liu JS (2000). “Mixture Kalman Filters.” Journal of the Royal Statistical Society B, 62(3), 493–508. doi:10.1111/1467-9868.00246.

Chopin N, Jacob PE, Papaspiliopoulos O (2013). “SMC2: An Efficient Algorithm for Se- quential Analysis of Models.” Journal of the Royal Statistical Society B, 75(3), 397–426. doi:10.1111/j.1467-9868.2012.01046.x.

Choppala P, Gunawan D, Chen J, Tran MN, Kohn R (2016). “Bayesian Inference for State Space Models Using Block and Correlated Pseudo Marginal Methods.” arXiv:1612.07072v1 [stat.ME], URL http://arxiv.org/abs/1612.07072v1.

Crisan D, Doucet A (2002). “A Survey of Convergence Results on Particle Filtering Methods for Practitioners.” IEEE Transactions on Signal Processing, 50(3), 736–746. doi:10.1109/ 78.984773.

Dahlin J (2016). Accelerating Monte Carlo Methods for Bayesian Inference in Dynamical Models. Ph.D. thesis, Linköping University.

Dahlin J (2018). pmhtutorial: Minimal Working Examples for Particle Metropolis-Hastings. R package version 1.3, URL https://CRAN.R-project.org/package=pmhtutorial.

Dahlin J, Lindsten F, Kronander J, Schön TB (2015a). “Accelerating Pseudo-Marginal Metropolis-Hastings by Correlating Auxiliary Variables.” arXiv:1511.05483v1 [stat.CO], URL http://arxiv.org/abs/1511.05483v1.

Dahlin J, Lindsten F, Schön TB (2015b). “Particle Metropolis-Hastings Using Gradi- ent and Hessian Information.” Statistics and Computing, 25(1), 81–92. doi:10.1007/ s11222-014-9510-0.

Dahlin J, Lindsten F, Schön TB (2015c). “Quasi-Newton Particle Metropolis-Hastings.” In Proceedings of the 17th IFAC Symposium on System Identification (SYSID). Beijing, China.

Del Moral P (2013). Mean Field Simulation for Monte Carlo Integration. CRC Press. doi: 10.1201/b14924. Johan Dahlin, Thomas B. Schön 37

Del Moral P, Doucet A, Jasra A (2006). “Sequential Monte Carlo Samplers.” Journal of the Royal Statistical Society B, 68(3), 411–436. doi:10.1111/j.1467-9868.2006.00553.x. Deligiannidis G, Doucet A, Pitt MK (2018). “The Correlated Pseudomarginal Method.” Journal of the Royal Statistical Society B, 80(5), 839–870. doi:10.1111/rssb.12280. Dellaportas P, Kontoyiannis I (2012). “Control Variates for Estimation Based on Reversible Markov Chain Monte Carlo Samplers.” Journal of the Royal Statistical Society B, 74(1), 133–161. doi:10.1111/j.1467-9868.2011.01000.x. Dempster A, Laird N, Rubin D (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Statistical Society B, 39(1), 1–38. Doornik JA (2007). Object-Oriented Matrix Programming Using Ox. 3rd edition. Timberlake Consultants Press and Oxford, London. Douc R, Cappé O (2005). “Comparison of Resampling Schemes for Particle Filtering.” In Pro- ceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA), pp. 64–69. Zagreb, Croatia. Doucet A, de Freitas N, Gordon N (2001). Sequential Monte Carlo Methods in Prac- tice. Statistics for Engineering and Information Science. Springer-Verlag. doi:10.1007/ 978-1-4757-3437-9. Doucet A, de Freitas N, Murphy K, Russell S (2000). “Rao-Blackwellised Particle Filtering for Dynamic Bayesian Networks.” In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 176–183. Stanford, USA. Doucet A, Johansen A (2011). “A Tutorial on Particle Filtering and Smoothing: Fifteen Years Later.” In D Crisan, B Rozovsky (eds.), The Oxford Handbook of Nonlinear Filtering. Oxford University Press. Doucet A, Pitt MK, Deligiannidis G, Kohn R (2015). “Efficient Implementation of Markov Chain Monte Carlo When Using an Unbiased Likelihood Estimator.” Biometrika, 102(2), 295–313. doi:10.1093/biomet/asu075. Durbin J, Koopman SJ (2012). Time Series Analysis by State Space Methods. 2nd edition. Oxford University Press. Eaton JW, Bateman D, Hauberg S, Wehbring R (2017). GNU Octave Version 4.2.1 Manual: A High-Level Interactive Language for Numerical Computations. CreateSpace Independent Publishing Platform. URL https://www.gnu.org/software/octave/. Eddelbuettel D, Johansen AM (2018). RcppSMC: Rcpp Bindings for Sequential Monte Carlo. R package version 0.2.1, URL https://CRAN.R-project.org/package=RcppSMC. Flury T, Shephard N (2011). “Bayesian Inference Based Only on Simulated Likelihood: Analysis of Dynamic Economic Models.” Econometric Theory, 27(5), 933– 956. doi:10.1017/s0266466610000599. Fulop A, Li J (2013). “Efficient Learning via Simulation: A Marginalized Resample-Move Approach.” Journal of Econometrics, 176(2), 146–161. doi:10.1016/j.jeconom.2013. 05.002. 38 Getting Started with PMH for Inference in Nonlinear Dynamical Models

Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013). Bayesian Data Analysis. 3rd edition. Chapman & Hall/CRC.

Gerber M, Chopin N (2015). “Sequential Quasi Monte Carlo.” Journal of the Royal Statistical Society B, 77(3), 509–579. doi:10.1111/rssb.12104.

Girolami M, Calderhead B (2011). “Riemann Manifold Langevin and Methods.” Journal of the Royal Statistical Society B, 73(2), 123–214. doi:10.1111/ j.1467-9868.2010.00765.x.

Gordon NJ, Salmond DJ, Smith AFM (1993). “Novel Approach to Nonlinear/Non-Gaussian Bayesian State Estimation.” IEEE Proceedings of Radar and Signal Processing, 140(2), 107–113. doi:10.1049/ip-f-2.1993.0015.

Hol JD, Schön TB, Gustafsson F (2006). “On Resampling Algorithms for Particle Filters.” In Proceedings of the Nonlinear Statistical Signal Processing Workshop. Cambridge, UK.

Hull J (2009). Options, Futures, and Other Derivatives. 7th edition. Pearson.

Hull J, White A (1987). “The Pricing of Options on Assets with Stochastic Volatilities.” The Journal of , 42(2), 281–300. doi:10.2307/2328253.

Johansen AM (2009). “SMCTC: Sequential Monte Carlo in C++.” Journal of Statistical Software, 30(6), 1–41. doi:10.18637/jss.v030.i06.

Kantas N, Doucet A, Singh SS, Maciejowski JM, Chopin N (2015). “On Particle Methods for Parameter Estimation in General State-Space Models.” Statistical Science, 30(3), 328–351. doi:10.1214/14-sts511.

King A, Nguyen D, Ionides E (2016). “Statistical Inference for Partially Observed Markov Processes via the R Package pomp.” Journal of Statistical Software, 69(12), 1–43. doi: 10.18637/jss.v069.i12.

Kronander J, Schön TB (2014). “Robust Auxiliary Particle Filters Using Multiple Importance Sampling.” In Proceedings of the 2014 IEEE Statistical Signal Processing Workshop (SSP). Gold Coast, Australia.

Langrock R (2011). “Some Applications of Nonlinear and Non-Gaussian State-Space Mod- elling by Means of Hidden Markov Models.” Journal of Applied Statistics, 38(12), 2955– 2970. doi:10.1080/02664763.2011.573543.

Lee A, Whiteley N (2016). “Forest Resampling for Distributed Sequential Monte Carlo.” Statistical Analysis and Data Mining: The ASA Data Science Journal, 9(4), 230–248. doi:10.1002/sam.11280.

Lee A, Yau C, Giles MB, Doucet A, Holmes CC (2010). “On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods.” Journal of Computational and Graphical Statistics, 19(4), 769–789. doi:10.1198/jcgs.2010.10039.

Lindsten F, Jordan MI, Schön TB (2014). “Particle Gibbs with Ancestor Sampling.” Journal of Research, 15, 2145–2184. Johan Dahlin, Thomas B. Schön 39

Ljung L (1999). System Identification: Theory for the User. Prentice Hall.

Lunn D, Spiegelhalter D, Thomas A, Best N (2009). “The BUGS Project: Evolution, Critique and Future Directions.” Statistics in Medicine, 28(25), 3049–3067. doi:10.1002/sim.3680.

Mansinghka VK, Selsam D, Perov YN (2014). “Venture: A Higher-Order Probabilistic Programming Platform with Programmable Inference.” arXiv:1404.0099 [cs.AI], URL http://arxiv.org/abs/1404.0099.

Marin JM, Pudlo P, Robert CP, Ryder RJ (2012). “Approximate Bayesian Com- putational Methods.” Statistics and Computing, 22(6), 1167–1180. doi:10.1007/ s11222-011-9288-2.

Massey Jr FJ (1951). “The Kolmogorov-Smirnov Test for Goodness of Fit.” Journal of the American Statistical Association, 46(253), 68–78. doi:10.1080/01621459.1951. 10500769.

McLachlan GJ, Krishnan T (2008). The EM Algorithm and Extensions. 2nd edition. John Wiley & Sons.

Meyn SP, Tweedie RL (2009). Markov Chains and Stochastic Stability. Cambridge University Press.

Mijatović A, Vogrinc J (2018). “On the Poisson Equation for Metropolis-Hastings Chains.” Bernoulli, 24(3), 2401–2428. doi:10.3150/17-bej932.

Mira A, Solgi R, Imparato D (2013). “Zero Variance Markov Chain Monte Carlo for Bayesian Estimators.” Statistics and Computing, 23(5), 653–662. doi:10.1007/ s11222-012-9344-6.

Murray LM (2015). “Bayesian State-Space Modelling on High-Performance Hardware Using LibBi.” Journal of Statistical Software, 67(10), 1–36. doi:10.18637/jss.v067.i10.

Naesseth CA, Lindsten F, Schön TB (2015). “Nested Sequential Monte Carlo Methods.” In Proceedings of the 32nd International Conference on Machine Learning (ICML). Lille, France.

Nemeth C, Sherlock C, Fearnhead P (2016). “Particle Metropolis-Adjusted Langevin Algo- rithms.” Biometrika, 103(3), 701–717. doi:10.1093/biomet/asw020.

Nordh J (2017). “pyParticleEst:A Python Framework for Particle-Based Estimation Meth- ods.” Journal of Statistical Software, 78(3), 1–25. doi:10.18637/jss.v078.i03.

Paige B, Wood F (2014). “A Compilation Target for Probabilistic Programming Languages.” In Proceedings of the 31st International Conference on Machine Learning (ICML). Bejing, China.

Papamarkou T, Mira A, Girolami M (2014). “Zero Variance Differential Geometric Markov Chain Monte Carlo Algorithms.” Bayesian Analysis, 9(1), 97–128. doi:10.1214/13-ba848.

Pharr M, Humphreys G (2010). Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann. 40 Getting Started with PMH for Inference in Nonlinear Dynamical Models

Pitt MK, Shephard N (1999). “Filtering via Simulation: Auxiliary Particle Filters.” Journal of the American Statistical Association, 94(446), 590–599. doi:10.1080/01621459.1999. 10474153. Pitt MK, Silva RS, Giordani P, Kohn R (2012). “On Some Properties of Markov Chain Monte Carlo Simulation Methods Based on the Particle Filter.” Journal of Econometrics, 171(2), 134–151. doi:10.1016/j.jeconom.2012.06.004. Plummer M (2003). “JAGS: A Program for Analysis of Bayesian Graphical Models Us- ing Gibbs Sampling.” In K Hornik, F Leisch, A Zeileis (eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). Technische Universität Wien, Vienna, Austria. URL https://www.R-project.org/conferences/ DSC-2003/Proceedings/Plummer.pdf. R Core Team (2018). R: A Language and Environment for Statistical Computing. R Founda- tion for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Robert CP, Casella G (2004). Monte Carlo Statistical Methods. 2nd edition. Springer-Verlag. doi:10.1007/978-1-4757-4145-2. Robert CP, Casella G (2009). Introducing Monte Carlo Methods with R. Springer-Verlag. Ross SM (2012). Simulation. 5th edition. Academic Press. Schön T, Gustafsson F, Nordlund PJ (2005). “Marginalized Particle Filters for Mixed Lin- ear/Nonlinear State-Space Models.” IEEE Transactions on Signal Processing, 53(7), 2279– 2289. doi:10.1109/tsp.2005.849151. Schön TB, Lindsten F, Dahlin J, Wågberg J, Naesseth CA, Svensson A, Dai L (2015). “Se- quential Monte Carlo Methods for System Identification.” In Proceedings of the 17th IFAC Symposium on System Identification (SYSID). Beijing, China. Sherlock C, Thiery AH, Roberts GO, Rosenthal JS (2015). “On The Efficency of Pseudo- Marginal Random Walk Metropolis Algorithms.” The Annals of Statistics, 43(1), 238–275. doi:10.1214/14-aos1278. Spall JC (1987). “A Stochastic Approximation Technique for Generating Maximum Likelihood Parameter Estimates.” In Proceedings of the 6th American Control Conference (ACC), pp. 1161–1167. Minneapolis, MN, USA. Svensson A, Schön TB (2016). “Comparing Two Recent Particle Filter Implementations of Bayesian System Identification.” Technical Report 2016-008, Department of Information Technology, Uppsala, Sweden. The MathWorks Inc (2017). MATLAB – The Language of Technical Computing, Version R2017b. Natick, Massachusetts. URL http://www.mathworks.com/products/matlab/. Tierney L (1994). “Markov Chains for Exploring Posterior Distributions.” The Annals of Statistics, 22(4), 1701–1728. doi:10.1214/aos/1176325750. Todeschini A, Caron F, Fuentes M, Legrand P, Del Moral P (2014). “Biips: Software for Bayesian Inference with Interacting Particle Systems.” arXiv:1412.3779 [stat.CO], URL https://arxiv.org/abs/1412.3779. Johan Dahlin, Thomas B. Schön 41

Tolpin D, van de Meent JW, Yang H, Wood F (2016). “Design and Implementation of Probabilistic Programming Language Anglican.” In Proceedings of the 28th Symposium on the Implementation and Application of Functional Programming Languages. doi:10.1145/ 3064899.3064910.

Tsay RS (2005). Analysis of Financial Time Series. 2nd edition. John Wiley & Sons. doi: 10.1002/0471746193. van Rossum G, et al. (2011). Python Programming Language. URL http://www.python.org.

Wilkinson D (2018). smfsb: Stochastic Modelling for Systems Biology. R package version 1.3, URL https://CRAN.R-project.org/package=smfsb.

Wilkinson DJ (2011). Stochastic Modelling for Systems Biology. 2nd edition. Chapman & Hall/CRC. doi:10.1201/b11812.

Zhou Y (2015). “vSMC: Parallel Sequential Monte Carlo in C++.” Journal of Statistical Software, 62(9), 1–49. doi:10.18637/jss.v062.i09.

Affiliation: Johan Dahlin Department of Computer and Information Science Linköping University SE-581 83 Linköping, Sweden E-mail: [email protected] URL: http://www.johandahlin.com/

Thomas B. Schön Department of Information Technology Uppsala University Box 337 SE-751 05 Uppsala, Sweden E-mail: [email protected] URL: http://user.it.uu.se/~thosc112/