Inference for dynamic and latent variable models via iterated, perturbed Bayes maps

Edward L. Ionidesa,1, Dao Nguyena, Yves Atchadéa, Stilian Stoeva, and Aaron A. Kingb,c

Departments of aStatistics, bEcology and Evolutionary Biology, and cMathematics, University of Michigan, Ann Arbor, MI 48109

Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved December 9, 2014 (received for review June 6, 2014) Iterated filtering algorithms are stochastic optimization procedures Algorithms with this property have been called plug-and-play (12, for latent variable models that recursively combine parameter 16). Various other plug-and-play methods for POMP models have perturbations with latent variable reconstruction. Previously, been recently proposed (17–20), due largely to the convenience of theoretical support for these algorithms has been based on this property in scientific applications. the use of conditional moments of perturbed parameters to ap- proximate derivatives of the log likelihood function. Here, a theo- An Algorithm and Related Questions retical approach is introduced based on the convergence of A general POMP model consists of an unobserved stochastic an iterated Bayes map. An algorithm supported by this theory process fXðtÞ; t ≥ t0g with observations Y1; ...; YN made at times displays substantial numerical improvement on the computa- dimðXÞ t1; ...; tN . We suppose that XðtÞ takes values in X ⊂ R , Yn tional challenge of inferring parameters of a partially observed takesvaluesinY ⊂ RdimðYÞ, and there is an unknown param- Markov process. dimðΘÞ eter θ taking values in Θ ⊂ R . We adopt notation ym:n = ym; ym+1; ...; yn for integers m ≤ n, so we write the collection of sequential Monte Carlo | | maximum likelihood | observations as Y1:N .WritingXn = XðtnÞ, the joint density of Markov process

X0:N and Y1:N is assumed to exist, and the Markovian prop- STATISTICS erty of X0:N together with the conditional independence of n iterated filtering algorithm was originally proposed for the observation process means that this joint density can be Amaximum likelihood inference on partially observed Markov written as process (POMP) models by Ionides et al. (1). Variations on the original algorithm have been proposed to extend it to f ; ðx : ; y : ; θÞ general latent variable models (2) and to improve numerical X0:N Y1:N 0 N 1 N QN performance (3, 4). In this paper, we study an iterated filtering = f ðx ; θÞ f ðx jx − ; θÞf ðy jx ; θÞ: X0 0 XnjXn − 1 n n 1 YnjXn n n ECOLOGY algorithm that generalizes the data cloning method (5, 6) and n=1 is therefore also related to other Monte Carlo methods for likelihood-based inference (7–9). Data cloning methodology The data consist of a sequence of observations, y1*:N. We write is based on the observation that iterating a Bayes map con- ; θ fY1:N ðy1:N Þ for the marginal density of Y1:N , and the likelihood verges to a point mass at the maximum likelihood estimate. ℓ θ = * ; θ function is defined to be ð Þ fY1:N ðy1:N Þ. We look for a max- Combining such iterations with perturbations of model θ^ ℓ θ parameters improves the numerical stability of data cloning imum likelihood estimate (MLE), i.e., a value maximizing ð Þ. and provides a foundation for stable algorithms in which the Bayes map is numerically approximated by sequential Monte Significance Carlo computations. We investigate convergence of a sequential Monte Carlo im- Many scientific challenges involve the study of stochastic dy- plementation of an iterated filtering algorithm that combines namic systems for which only noisy or incomplete measure- data cloning, in the sense of Lele et al. (5), with the stochastic ments are available. Inference for partially observed Markov parameter perturbations used by the iterated filtering algorithm process models provides a framework for formulating and of (1). Lindström et al. (4) proposed a similar algorithm, termed answering questions about these systems. Except when the fast iterated filtering, but the theoretical support for that algo- system is small, or approximately linear and Gaussian, state- rithm involved unproved conjectures. We present convergence of-the-art statistical methods are required to make efficient use results for our algorithm, which we call IF2. Empirically, it can of available data. Evaluation of the likelihood for a partially dramatically outperform the previous iterated filtering algorithm observed Markov process model can be formulated as a filter- of ref. 1, which we refer to as IF1. Although IF1 and IF2 both in- ing problem. Iterated filtering algorithms carry out repeated volve recursively filtering through the data, the theoretical justifi- Monte Carlo filtering operations to maximize the likelihood. cation and practical implementations of these algorithms are We develop a new theoretical framework for iterated filtering fundamentally different. IF1 approximates the Fisher score and construct a new algorithm that dramatically outperforms previous approaches on a challenging inference problem in function, whereas IF2 implements an iterated Bayes map. IF1 disease ecology. has been used in applications for which no other computa-

tionally feasible algorithm for statistically efficient, likelihood- Author contributions: E.L.I., D.N., Y.A., and A.A.K. designed research; E.L.I., D.N., Y.A., S.S., based inference was known (10–15). The extra capabilities of- and A.A.K. performed research; E.L.I., D.N., Y.A., and A.A.K. analyzed data; and E.L.I., fered by IF2 open up further possibilities for drawing infer- D.N., Y.A., S.S., and A.A.K. wrote the paper. ences about nonlinear partially observed stochastic dynamic The authors declare no conflict of interest. models from time series data. This article is a PNAS Direct Submission. Iterated filtering algorithms implemented using basic sequential 1To whom correspondence should be addressed. Email: [email protected]. Monte Carlo techniques have the property that they do not need This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. to evaluate the transition density of the latent Markov process. 1073/pnas.1410597112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1410597112 PNAS | January 20, 2015 | vol. 112 | no. 3 | 719–724 Downloaded by guest on September 27, 2021 Z ^ Algorithm IF2. Iterated filtering ℓ ðθ0:N Þhðθ0:N jφ ; σÞfðφÞdφ dθ0:N−1 σ θ = Z ; [1] input: T fð N Þ ^ θ ℓ θ : h θ : φ ; σ f φ dφ dθ : Simulator for fX0 ðx0 ; Þ ð 0 N Þ ð 0 N j Þ ð Þ 0 N θ : Simulator for fXnjXn − 1 ðxnjxn−1 ; Þ, n in 1 N Evaluator for f ðy jx ; θÞ, n in 1 : N YnjXn n n with f and Tσ f approximating the initial and final density of the * Data, y1:N parameter swarm. For our theoretical analysis, we consider the Number of iterations, M case when the SD of the parameter perturbations is held fixed at Number of particles, J σm = σ > 0 for m = 1; ...; M. In this case, IF2 is a Monte Carlo Initial parameter swarm, Θ0,j in 1 : J f j g TM f θ σ θ φ σ : approximation to σ ð Þ. We call the fixed version of IF2 Perturbation density, hnð j ; Þ, n in 1 N “ ” σ homogeneous iterated filtering since each iteration imple- Perturbation sequence, 1:M ments the same map. For any fixed σ, one cannot expect a pro- ΘM : output: Final parameter swarm, f j ,j in 1 Jg cedure such as IF2 to converge to a point mass at the MLE. For m in 1 : M However, for fixed but small σ, we show that IF2 does approx- ΘF,m ∼ θ Θm−1 σ : imately maximize the likelihood, with an error that shrinks to 0,j h0ð j j ; mÞ for j in 1 J F,m ∼ ΘF,m : zero in a limit as σ → 0 and M → ∞. An immediate motivation for X0,j fX0 ðx0; 0,j Þ for j in 1 J For n in 1 : N studying the homogeneous case is simplicity; it turns out that ΘP,m ∼ h θ ΘF,m ,σ for j in 1 : J even with this simplifying assumption, the theoretical analysis n,j nð j n−1,j mÞ XP,m ∼ f x jXF,m ; ΘP,m for j in 1 : J is not entirely straightforward. Moreover, the homogeneous anal- n,j XnjXn− 1 n n−1,j j ysis gives at least as much insight as an asymptotic analysis into wm = f y*XP,m; ΘP,m for j in 1 : J σ n,j Yn jXn n n,j n,j P the practical properties of IF2, when m decreases down to some P = = m = J m σ > Draw k1:J with ðkj iÞ wn,i u=1wn,u positive level 0 but never completes the asymptotic limit F,m P,m F,m P,m Θ = Θ and X = X for j in 1 : J σm → 0. Iterated filtering algorithms have been primarily devel- n,j n,kj n,j n,kj End For oped in the context of making progress on complex models for Θm = ΘF,m : Set j N,j for j in 1 J which successfully achieving and validating global likelihood End For optimization is challenging. In such situations, it is advisable to run multiple searches and continue each search up to the limits of available computation (25). If no single search can The IF2 algorithm defined above provides a plug-and-play reliably locate the global maximum, a theory assuring conver- θ^ Monte Carlo approach to obtaining . A simplification of IF2 gence to a neighborhood of the maximum is as relevant as a arises when N = 1, in which case, iterated filtering is called theory assuring convergence to the maximum itself in a prac- iterated importance sampling (2) (SI Text, Iterated Importance tically unattainable limit. Sampling). Algorithms similar to IF2 with a single iteration The map Tσ can be expressed as a composition of a parameter ðM = 1Þ have been proposed in the context of Bayesian inference perturbation with a Bayes map that multiplies by the likelihood (21, 22) (SI Text, Applying Liu and West’s Method to the Toy and renormalizes. Iteration of the Bayes map alone has a central Example and Fig. S1). When M = 1andhnðθjφ ; σÞ degenerates limit theorem (CLT) (5) that forms the theoretical basis for the to a point mass at φ, the IF2 algorithm becomes a standard data cloning methodology of refs. 5 and 6. Repetitions of the pa- ΘF;m rameter perturbation may also be expected to follow a CLT. One particle filter (23, 24). In the IF2 algorithm description, n;j F;m might therefore imagine that the composition of these two oper- and Xn;j are the jth particles at time n in the Monte Carlo representation of the mth iteration of a filtering recursion. The ations also has a Gaussian limit. This is not generally true, since the filtering recursion is coupled with a prediction recursion, rep- rescaling involved in the perturbation CLT prevents the Bayes map ΘP;m P;m CLT from applying (SI Text, A Class of Exact Non-Gaussian Limits resented by n;j and Xn;j . The resampling indices k1:J in IF2 for Iterated Importance Sampling). Our agenda is to seek conditions are taken to be a multinomial draw for our theoretical analysis, guaranteeing the following: but systematic resampling is preferable in practice (23). A m (A1) For every fixed σ > 0, lim →∞T f = fσ exists. natural choice of h ðθ φ ; σÞ is a multivariate normal density m σ n j (A2) When J and M become large, IF2 numerically approxi- with mean φ and variance σ2Σ for some covariance matrix Σ, mates fσ. but in general, h could be any conditional density parame- n (A3) As the noise intensity becomes small, limσ→0 fσ approaches terized by σ. Combining the perturbations over all of the time a point mass at the MLE, if it exists. points, we define Stability of filtering problems and uniform convergence of sequential Monte Carlo numerical approximations are closely YN related, and so A1 and A2 are studied together in Theorem 1. θ φ ; σ = θ φ; σ θ θ ; σ : hð 0:N j Þ h0ð 0j Þ hnð nj n−1 Þ Each iteration of IF2 involves standard sequential Monte Carlo n=1 filtering techniques applied to an extended model where latent ΘN+1 variable space is augmented to include a time-varying param- We define an extended likelihood function on by eter. Indeed, all M iterations together can be represented as a filtering problem for this extended POMP model on M rep- Z Z ^ lications of the data. The proof of Theorem 1 therefore leans ℓ θ = ...... ; θ ð 0:N Þ dx0 dxN fX0 ðxo 0Þ on existing results. The novel issue of A3 is then addressed in Theorem 2. QN * × f ðx jx − ; θ Þf y x ; θ : XnjXn − 1 n n 1 n YnjXn n n n Convergence of IF2 = ^ n 1 Θm ; = ; ; ... First, we set up some notation. Let f 0:N m ^1 2 g be a ΘN+1 Θ1 Each iteration of IF2 is a Monte Carlo approximation to MarkovR chain taking values in ^ such that 0:N has den- θ φ ; σ φ φ Θm amap sity hð 0:N j Þfð Þd ,and 0:N has conditional density

720 | www.pnas.org/cgi/doi/10.1073/pnas.1410597112 Ionides et al. Downloaded by guest on September 27, 2021 ^ ^ m−1 m ( ) hðθ : jφ ; σÞ given Θ = φ : for m ≥ 2. Suppose that fΘ ; Z 0 N N 0:N 0 N 0:N XJ ϕ θ ≥ Ω = 1 M C supθj ð Þj m 1g is constructed on the^ canonical probability space E ϕ Θ − ϕðθÞfσðθÞdθ ≤ pffiffiffi : [4] θ1 ; θ2 ; ... θm = Θm ϑ ϑ = θ1 ; θ2 ; ... ∈ Ω J j fð 0:N 0:N Þg with 0:N 0:N ð Þ for ð 0:N 0:N Þ . j=1 J Let fF mg be the corresponding^ Borel filtration. To consider Θm ; = ; ; ... σ → k a time-rescaled limit of f 0:N m 1 2 g as 0, let Proof. B2 and B4 imply that Tσ is mixing, in the sense of ref. 26, fWσðtÞ; t ≥ 0g be a continuous-time, right-continuous, piece- for all sufficiently large k. The results of ref. 26 are based on the contractive properties of mixing maps in the Hilbert projective wise constant^ process defined at its points of discontinuity k+1 metric. Although ref. 26 stated their results in the case where σ σ2 = Θ by^ W ðk Þ N when k is a nonnegative integer. Let m ; = ; ; ... T itself is mixing, the required geometric contraction in the fZ0:N m 1 2 g be the filtered process defined such that, Hilbert metric holds as long as Tk is mixing for all ∈ for any event E F M , K ≤ k ≤ 2K − 1forsomeK ≥ 1 (ref. 27, theorem 2.5.1). Corollary h i 4.2ofref.26impliesEq.3, noting the equivalence of the Hil- ^ E^ ℓ : I bert projective metric and the total variation norm shown in Θ 1 M E 4 P^ = h i ; [2] their lemma 3.4. Then, corollary 5.12 of ref. 26 implies Eq. , ðEÞ ^ Z completing the proof of Theorem 1. A longer version of this E^ ℓ 1:M Θ proof is given in SI Text, Additional Details for the Proof of Theorem 1. where IE is the indicator function for event E and Results similar to Theorem 1 can be obtained using Dobrushin contraction techniques (28). Results appropriate for noncompact ^ YM ^ ℓ ϑ = ℓ θm : spaces can be obtained using drift conditions on a potential function 1:M ð Þ 0:N m=1 (29). Now we move on to our formalization of A3: ^ Theorem 2. Assume B1–B6. For λ2 < supφℓðφÞ, m R In Eq. 2 , P^ðEÞ denotes probability under the^ law of fZ g, and limσ→ fσðθÞ 1 ℓ θ <λ dθ = 0. Z m n 0 f ð Þ 2g E^ Θ denotes expectation under the law of f g. The process Proof. Let λ = supφℓðφÞ and λ = infφℓðφÞ. From B4, ∞ > λ > ^Θ ^ n 0 3 0 m m m λ3 > 0. For positive constants e , e2, η , η and λ < λ0, define fZn g is constructed so that ZN has density T f. We make the 1 1 2 1 following assumptions. STATISTICS = − e λ + e + e λ + e ; (B1) fWσðtÞ; 0 ≤ t ≤ 1g converges weakly as σ → 0 to a diffusion e1 ð1 1Þlogð 0 2Þ 1 logð 2 2Þ fWðtÞ; 0 ≤ t ≤ 1g, in the space of right-continuous functions = − η λ − η + η λ − η : with left limits equipped with the uniform convergence topol- e2 ð1 1Þlogð 1 2Þ 1 logð 3 2Þ ogy. For any open set A ⊂ Θ with positive Lebesgue measure e > δ ; e > P ∈ e e η η λ < and 0, there is a ðA Þ 0suchthat ½WðtÞ A for We^ can pick 1, 2, 1, 2, and 1 so that e1 e2. Suppose that e ≤ ≤ > δ Θm = all t 1jWð0Þ . f n g is initialized with the stationary distribution f fσ identi- (B2) For some t0ðσÞ and σ0 > 0, WσðtÞ has a positive density on Θ, fied in Theorem 1. Now, set M to be the greatest integer less than ECOLOGY > σ < σ 2 m uniformly over the distribution of Wð0Þ for all t t0 and 0. 1=σ , and let F1 be the event that fΘ ; m = 1; ...; Mg spends at ℓ θ θ : ℓ θ > λ N (B3) ð Þ is continuous in a neighborhood f ð Þ 1g for least a fraction of time e1 in fθ : ℓðθÞ < λ2g. Formally, some λ1 < supφℓðφÞ. e > e−1 > ; θ > e (B4) There is an 0 with fY jX ðy*n jxn Þ for all ( ) n n XM 1 ≤ n ≤ N, xn ∈ X and θ ∈ Θ. 1 F1 = ϑ ∈ Ω : 1 ℓ θm <λ > e1 : (B5) There is a C such that h ðθj φ ; σÞ = 0 when jθ − φj > C σ, f ð N Þ 2g 1 n 1 M = for all σ. m 1 θ − θ − < σ (B6) There is a C2 such that sup1≤n≤N j n n 1j C1 implies P^ σ ^ We wish to show that ½F1 is small for small. Let F2 be the set ℓ θ − ℓ θ < σ σ Z − η ð 0:N Þ ð N Þ C2 , for all and all n. of sample paths that spend at least a fraction of time ð1 1Þ up to time M in fθ : ℓðθÞ > λ1g, i.e., Conditions B1 and B2 hold when h ðθjφ ; σÞ corresponds to n ( ) a reflected Gaussian random walk and fWðtÞg is a reflected XM Brownian motion (SI Text, Checking Conditions B1 and B2). More = ϑ ∈ Ω : 1 > − η : F2 1 ℓ θm >λ ð1 1Þ M f ð N Þ 1g generally, when hnðθjφ ; σÞ is a location-scale family with mean φ m=1 away from a boundary, then fWðtÞg will behave like Brownian motion in the interior of Θ. B4 follows if X is compact and Then, we calculate ; θ θ h i fYnjXn ðyn* jxn Þ is positive and continuous as a function of and ^ E^ ℓ : xn. B5 can be guaranteed by construction. B3 and B6 are un- Θ 1 M 1F1 P^½F = h i demanding regularity conditions on the likelihood and extended Z 1 ^ E^ ℓ likelihood. A formalization of A1 and A2 can now be stated Θ 1:M h i as follows. ^ E^ ℓ Theorem 1. 1 1:M 1F1 Let Tσ be the map of Eq. and suppose B2 and B4. ≤ Θh i [5] There is a unique probability density fσ such that for any probability ^ E^ ℓ : 1 density f on Θ, Θ 1 M F2 hQ È É i M E^ ℓ θ m + σ Θ m=1 N C2 1F1 m − = ; [3] ≤ hQ È É i lim Tσ f fσ 0 M m→∞ 1 E^ ℓ θ m − σ Θ m=1 N C2 1F2 1 ΘM ; = ; ...; where jjfjj1 is the L norm of f. Let f j j 1 Jg be the output σ = σ > > E^½expfMe g1 of IF2, with m 0. There is a finite constant C 0 such that, ≤ Θ 1 F1 [6] for any function ϕ : Θ → R and all M, E^ Θ½expfMe2g1F2

Ionides et al. PNAS | January 20, 2015 | vol. 112 | no. 3 | 721 Downloaded by guest on September 27, 2021 P^ Θ½F1 = expfðe1 − e2ÞMg : [7] P^ Θ½F2

We used B5 and B6 to arrive at Eq. 5, then, to get to Eq. 6,we σ σ < e σ < η have taken small enough that C2 2 and C2 2. From B3, fθ : ℓðθÞ > λ1g is an open set, and B1 therefore ensures each P P 7 of the probabilities Θ1:M ½F1 and Θ1:M ½F2 in Eq. tends to a positive limit as σ → 0 given by the probability under the limiting distribution fWðtÞg (SI Text, Lemma S1). The term expfðe1 − e2ÞMg tendstozeroasσ → 0 since, by construction, M^→ ∞ and e1 < e2.SettingL = fθ : ℓðθÞ ≤ λ2g, and noting that m; = ; ; ... fZN m 1 2 g is constructed to have stationary marginal density fσ,wehave Z   XM ^ 1 m fσðθÞdθ = P^ Z ∈ LF1 P^½F1 M Z N Z m=1 L Fig. 1. Results for the simulation study of the toy example. (A) IF1 point   estimates from 30 replications (circles) and the MLE (green triangle). The ^ Â Ã m region of parameter space with likelihood within 3 log units of the maxi- + P^ Z ∈ L Fc P^ Fc ; Z N 1 Z 1 mum (white), within 10 log units (red), within 100 log units (orange), and lower (yellow). (B) IF2 point estimates from 30 replications (circles) with the ≤ e + P^½F ; same algorithmic settings as IF1. (C) Final parameter value of 30 PMCMC 1 Z 1 chains (circles). (D) Kernel density estimates of the posterior for θ1 for the e σ first eight of these 30 PMCMC chains (solid lines), with the true posterior which can be made arbitrarily small by picking 1 small and distribution (dotted black line). small, completing the proof.

Demonstration of IF2 with Nonconvex Superlevel Sets advantage of the structure of this particular problem. However, Theorems 1 and 2 do not involve any Taylor series expansions, this example demonstrates a feature that makes tuning algo- which are basic in the justification of IF1 (2). This might suggest rithms tricky: The nonlinear ridge along contours of constant that IF2 can be effective on likelihood functions without good θ2 expðθ1Þ becomes increasingly steep as θ1 increases, so no sin- low-order polynomial approximations. In practice, this can be gle global estimate of the second derivative of the likelihood is seen by comparing IF2 with IF1 on a simple 2D toy example appropriate. Reparameterization can linearize the ridge in this ðdimðΘÞ = dimðXÞ = dimðYÞ = 2Þ in which the superlevel sets toy example, but in practical problems with much larger param- fθ : ℓðθÞ > λg are connected but not convex. We also compare eter spaces, one does not always know how to find appropriate with particle Monte Carlo (PMCMC) imple- reparameterizations, and a single reparameterization may not be mented as the particle marginal Metropolis–Hastings algo- appropriate throughout the parameter space. rithm of ref. 17. The justification of PMCMC also does not Fig. 1 compares the performance of the three methods, based depend on Taylor series expansions, but PMCMC is compu- on 30 Monte Carlo replications. These replications investigate tationally expensive compared with iterated filtering (30). Our the likelihood and posterior distribution for a single draw from toy example has a constant and nonrandom latent process, our toy model, since our interest is in the Monte Carlo behavior for a given dataset. For this simulated dataset, the MLE is X = ðexpfθ g; θ expfθ gÞ for n = 1; ...; N. The known mea- n 1 2 1 θ = ð1:20; 0:81Þ, shown as a green triangle in Fig. 1 A−C. In this surement model is toy example, the posterior distribution can also be computed    directly by numerical integration. In Fig. 1A, we see that IF1 100 0 f ðyjx ; θÞ ∼ Normal x; ; performs poorly on this challenge. None of the 30 replications YnjXn 01 approach the MLE. The linear combination of perturbed parameters involved in the IF1 update formula can all too easily This example was designed so that a nonlinear combination of knock the search off a nonlinear ridge. Fig. 1B shows that IF2 the parameters is well identified whereas each parameter is performs well on this test, with almost all of the Monte Carlo marginally weakly identified. For the truth, we took θ = ð1; 1Þ. We supposed that θ is suspected to fall in the interval ½−2; 2 replications clustering in the region of highest likelihood. Fig. 1C 1 shows the end points of the PMCMC replications, which are and θ2 is expected in ½0; 10. We used a uniform distribution on this rectangle to specify the prior for PMCMC and to generate nicely spread around the region of high posterior probability. random starting points for all of the algorithms. We set N = 100 However, Fig. 1D shows that mixing of the PMCMC Markov observations, and we used a Monte Carlo sample size of J = 100 chains was problematic. particles. For IF1 and IF2, we used M = 100 filtering iterations, Application to a Model with initial random walk SD 0.1 decreasing geometrically down to 0.01. For PMCMC, we used 104 filtering iterations with ran- Highly nonlinear, partially observed, stochastic dynamic systems dom walk SD 0.1, awarding PMCMC 100 times the computa- are ubiquitous in the study of biological processes. The physical tional resources offered to IF1 and IF2. Independent, normally scale of the systems vary widely from molecular biology (31) to distributed parameter perturbations were used for IF1, IF2, and population ecology and epidemiology (32), but POMP models PMCMC. The random walk SD for PMCMC is not immediately arise naturally at all scales. In the face of biological complexity, comparable to that for IF1 and IF2, since the latter add the noise it is necessary to determine which scientific aspects of a system at each observation time whereas the former adds it only be- are critical for the investigation. Giving consideration to a range tween filtering iterations. All three methods could have their of potential mechanisms, and their interactions, may require parameters fine-tuned, or be modified in other ways to take working with highly parameterized models. Limitations in the

722 | www.pnas.org/cgi/doi/10.1073/pnas.1410597112 Ionides et al. Downloaded by guest on September 27, 2021 ( ) XNs λ = β β − + β = ðtÞ exp trendðt t0Þ jsjðtÞ ðI PÞ ( j=1 ) XNs + ω exp ωjsjðtÞ ; j=1

where fsjðtÞ; j = 1; ...; Nsg is a periodic cubic B-spline basis; β ; = ; ...; ω ; = ; ...; f j j 1 Nsg model seasonality of transmission; f j j 1 Nsg model seasonality of the environmental reservoir; ω and β −1 are scaling constants set to ω = β = 1 y , and we set Ns = 6. The data, consisting of monthly counts of choleraR mortality, are mod- 2 2 tn eled via Yn ∼ NormalðMn; τ M Þ for Mn = m IðsÞds. n tn−1 The inference goal used to assess IF1 and IF2 is to find high- likelihood parameter values starting from randomly drawn starting values in a large hyperrectangle (Table S1). A single search cannot necessarily be expected to reliably obtain the maximum of the likelihood, due to multimodality, weak identi- Fig. 2. Comparison of IF1 and IF2 on the cholera model. Points are the log fiability, and considerable Monte Carlo error in evaluating the likelihood of the parameter vector output by IF1 and IF2, both started at a uniform draw from a large hyperrectangle (Table S1). Likelihoods were likelihood. Multiple starts and restarts may be needed both for evaluated as the median of 10 particle filter replications (i.e., IF2 applied effective optimization and for assessing the evidence to validate 4 with M = 1 and σ1 = 0) each with J = 2 × 10 particles. Seventeen poorly per- effective optimization. However, optimization progress made on forming searches are off the scale of this plot (15 due to the IF1 estimate, 2 an initial search provides a concrete criterion to compare due to the IF2 estimate). Dotted lines show the maximum log likelihood methodologies. Since IF1 and IF2 have essentially the same reported by ref. 10. computational cost, for a given Monte Carlo sample size and number of iterations, shared fixed values of these algorithmic

parameters provide an appropriate comparison. STATISTICS available data may result in some combinations of parameters Fig. 2 compares results for 100 searches with J = 104 particles being weakly identifiable. Despite this, other combinations of and M = 100 iterations of the search. An initial Gaussian ran- parameters may be adequately identifiable and give rise to some dom walk SD of 0.1 geometrically decreasing down to a final interesting statistical inferences. To demonstrate the capa- value of 0.01 was used for all parameters except S0, I0, R1;0, R2;0, bilities of IF2 for such analyses, we fit a model for cholera and R3;0. For those initial value parameters, the random walk epidemics in historic Bengal developed by King et al. (10). The SD decreased geometrically from 0.2 down to 0.02, but these model, the data, and the implementations of IF1 and IF2 used perturbations were applied only at time t . Since some starting 0 ECOLOGY below are all contained in the open source R package pomp points may lead both IF1 and IF2 to fail to approach the global (33). The code generating the results in this article is provided maximum, Fig. 2 plots the likelihoods of parameter vectors as supplementary data (Datasets S1 and S2). output by IF1 and IF2 for each starting point. Fig. 2 shows that, Cholera is a diarrheal disease caused by the bacterial pathogen on this problem, IF2 is considerably more effective than IF1. Vibrio cholerae. Without appropriate medical treatment, severe This maximization was considered challenging for IF1, and (10) infections can rapidly result in death by dehydration. Many required multiple restarts and refinements of the optimization procedure. Our implementation of PMCMC failed to converge questions regarding cholera transmission remain unresolved: on this inference problem (SI Text, Applying PMCMC to the What is the epidemiological role of free-living environmental Cholera Model and Fig. S2), and we are not aware of any pre- vibrio? How important are mild and asymptomatic infections vious successful PMCMC solution for a comparable situation. for the transmission dynamics? How long does protective im- For IF2, however, this situation appears routine. Some Monte munity last following infection? The model we consider splits Carlo replication is needed because searches occasionally fail up the study population of PðtÞ individuals into those who are to approach the global optimum, but replication is always ap- susceptible, SðtÞ,infected,IðtÞ, and recovered, RðtÞ. PðtÞ is as- propriate for Monte Carlo optimization procedures. sumed known from census data. To allow flexibility in repre- A fair numerical comparison of methods is difficult. For ex- senting immunity, RðtÞ is subdivided into R1ðtÞ; ...; RkðtÞ, ample, it could hypothetically be the case that the algorithmic where we take k = 3. Cumulative cholera mortality in each settings used here favor IF2. However, the settings used are month is tracked with a variable MðtÞ that resets to zero at the those that were developed for IF1 by ref. 10 and reflect con- siderable amounts of trial and error with that method. Likeli- beginning of each observation period. The state process, fXðtÞ = hood-based inference for general partially observed nonlinear ðSðtÞ; IðtÞ; R ðtÞ; ...; R ðtÞ; MðtÞÞ; t ≥ t g, follows a stochastic dif- 1 k 0 stochastic dynamic models was considered computationally un- ferential equation, feasible before the introduction of IF1, even in situations con- siderably simpler than the one investigated in this section (19). dS = fkeRk + δðS − HÞ − λðtÞSgdt + dP − ðσSI=PÞdB; We have shown that IF2 offers a substantial improvement on dI = fλðtÞS − ðm + δ + γÞIgdt + ðσSI=PÞdB; IF1, by demonstrating that it functions effectively on a problem at the limit of the capabilities of IF1. dR1 = fγI − ðke + δÞR1gdt; . . Discussion

dRk = fkeRk−1 − ðke + δÞRkgdt; Theorems 1 and 2 assert convergence without giving insights into the rate of convergence. In the particular case of a quadratic log likelihood function and additive Gaussian parameter pertur- M driven by a Brownian motion fBðtÞg. Nonlinearity arises through bations, limM→∞Tσ f is Gaussian, and explicit calculations are the force of infection, λðtÞ, specified as available (SI Text, Gaussian and Near-Gaussian Analysis of

Ionides et al. PNAS | January 20, 2015 | vol. 112 | no. 3 | 723 Downloaded by guest on September 27, 2021 Iterated Importance Sampling). If log ℓðθÞ is close to quadratic The main theoretical innovation of this paper is Theorem 2, and the parameter perturbation is close to additive Gaussian noise, which does not depend on the specific sequential Monte Carlo M then limM→∞Tσ f exists and is close to the limit for the approxi- filter used in IF2. One could, for example, modify IF2 to use an mating Gaussian system (SI Text, Gaussian and Near-Gaussian ensemble Kalman filter (20, 34) or an unscented Kalman filter Analysis of Iterated Importance Sampling). These Gaussian and near- (35). Or, one could take advantage of variations of sequential Gaussian situations also demonstrate that the compactness con- Monte Carlo that may improve the numerical performance (36). ditions for Theorem 2 are not always necessary. In the case N = 1, IF2 applies to the more general class of latent variable models. The However, basic sequential Monte Carlo is a general and widely latent variable model, extended to include a parameter vector that used nonlinear filtering technique that provides a simple yet varies over iterations, nevertheless has the formal structure of theoretically supported foundation for the IF2 algorithm. The a POMP in the context of the IF2 algorithm. Some simplifications numerical stability of sequential Monte Carlo for the extended arise when N = 1(SI Text, Iterated Importance Sampling, Gaussian POMP model constructed by IF2 is comparable, in our cholera and Near-Gaussian Analysis,andAClassofExactNon-Gaussian example, to the model with fixed parameters (SI Text, Con- Limits) but the proofs of Theorems 1 and 2 do not greatly change. sequences of Perturbing Parameters for the Numerical Stability A variation on iterated filtering, making white noise pertur- of SMC and Fig. S3). bations to the parameter rather than random walk perturbations, has favorable asymptotic properties (3). However, practical ACKNOWLEDGMENTS. We acknowledge constructive comments by two algorithms based on this theoretical insight have not yet been pub- anonymous referees, the editor, and Joon Ha Park. Funding was provided lished. Our experience suggests that white noise perturbations can be by National Science Foundation Grants DMS-1308919 and DMS-1106695, National Institutes of Health Grants 1-R01-AI101155, 1-U54-GM111274, and effective in a neighborhood of the MLE but fail to match the 1-U01-GM110712, and the Research and Policy for Infectious Disease performance of IF2 for global optimization problems in com- Dynamics program of Department of Homeland Security and National plex models. Institutes of Health, Fogarty International Center.

1. Ionides EL, Bretó C, King AA (2006) Inference for nonlinear dynamical systems. Proc 19. Wood SN (2010) Statistical inference for noisy nonlinear ecological dynamic systems. Natl Acad Sci USA 103(49):18438–18443. Nature 466(7310):1102–1104. 2. Ionides EL, Bhadra A, Atchadé Y, King AA (2011) Iterated filtering. Ann Stat 39(3): 20. Shaman J, Karspeck A (2012) Forecasting seasonal outbreaks of influenza. Proc Natl 1776–1802. Acad Sci USA 109(50):20425–20430. 3. Doucet A, Jacob PE, Rubenthaler S (2013) Derivative-free estimation of the score 21. Kitagawa G (1998) A self-organising state-space model. JAmStatAssoc93: vector and observed information matrix with application to state-space models. arxiv: 1203–1215. 1304.5768. 22. Liu J, West M (2001) Combined parameter and state estimation in simulation-based 4. Lindström E, Ionides EL, Frydendall J, Madsen H (2012) Efficient iterated filtering. filtering. Sequential Monte Carlo Methods in Practice, eds Doucet A, de Freitas N, System Identification (Elsevier, New York), Vol 16, pp 1785–1790. Gordon NJ (Springer, New York), pp 197–224. 5. Lele SR, Dennis B, Lutscher F (2007) Data cloning: Easy maximum likelihood estimation 23. Arulampalam MS, Maskell S, Gordon N, Clapp T (2002) A tutorial on particle filters for for complex ecological models using Bayesian Markov chain Monte Carlo methods. online nonlinear, non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2): Ecol Lett 10(7):551–563. 174–188. 6. Lele SR, Nadeem K, Schmuland B (2010) Estimability and likelihood inference for 24. Doucet A, de Freitas N, Gordon NJ, eds (2001) Sequential Monte Carlo Methods in generalized linear mixed models using data cloning. J Am Stat Assoc 105(492): Practice (Springer, New York). 1617–1625. 25. Ingber L (1993) Simulated annealing: Practice versus theory. Math Comput Model 18: 7. Doucet A, Godsill SJ, Robert CP (2002) Marginal maximum a posteriori estimation 29–57. using Markov chain Monte Carlo. Stat Comput 12(1):77–84. 26. Le Gland F, Oudjane N (2004) Stability and uniform approximation of nonlinear filters 8. Gaetan C, Yao J-F (2003) A multiple-imputation Metropolis version of the EM algo- using the Hilbert metric and application to particle filters. Ann Appl Probab 14(1): rithm. Biometrika 90(3):643–654. 144–187. 9. Jacquier E, Johannes M, Polson N (2007) MCMC maximum likelihood for latent state 27. Eveson SP (1995) Hilbert’s projective metric and the spectral properties of positive models. J Econom 137(2):615–640. linear operators. Proc London Math Soc 3(2):411–440. 10. King AA, Ionides EL, Pascual M, Bouma MJ (2008) Inapparent infections and cholera 28. Del Moral P, Doucet A (2004) Particle motions in absorbing medium with hard and dynamics. Nature 454(7206):877–880. soft obstacles. Stochastic Anal Appl 22(5):1175–1207. 11. Laneri K, et al. (2010) Forcing versus feedback: Epidemic and monsoon rains in 29. Whiteley N, Kantas N, Jasra A (2012) Linear variance bounds for particle approx- NW India. PLoS Comput Biol. 6(9):e1000898. imations of time-homogeneous Feynman–Kac formulae. Stochastic Process Appl 12. He D, Ionides EL, King AA (2010) Plug-and-play inference for disease dynamics: 122(4):1840–1865. in large and small populations as a case study. J R Soc Interface 7(43):271–283. 30. Bhadra A (2010) Discussion of ‘particle Markov chain Monte Carlo methods’ by 13. Blackwood JC, Cummings DAT, Broutin H, Iamsirithaworn S, Rohani P (2013) Deci- C. Andrieu, A. Doucet and R. Holenstein. J R Stat Soc B 72:314–315. phering the impacts of vaccination and immunity on pertussis epidemiology in 31. Wilkinson DJ (2012) Stochastic Modelling for Systems Biology (Chapman & Hall, Boca Thailand. Proc Natl Acad Sci USA 110(23):9595–9600. Raton, FL). 14. Shrestha S, Foxman B, Weinberger DM, Steiner C, Viboud C, Rohani P (2013) Identi- 32. Keeling M, Rohani P (2009) Modeling Infectious Diseases in Humans and Animals fying the interaction between influenza and pneumococcal pneumonia using in- (Princeton Univ. Press, Princeton, NJ). cidence data. Sci Transl Med 5:191ra84. 33. King AA, Ionides EL, Bretó CM, Ellner S, Kendall B (2009) pomp: Statistical inference 15. Blake IM, et al. (2014) The role of older children and adults in wild poliovirus trans- for partially observed Markov processes (R package). Available at cran.r-project.org/ mission. Proc Natl Acad Sci USA 111(29):10604–10609. web/packages/pomp. 16. Bretó C, He D, Ionides EL, King AA (2009) Time series analysis via mechanistic models. 34. Yang W, Karspeck A, Shaman J (2014) Comparison of filtering methods for the Ann Appl Stat 3:319–348. modeling and retrospective forecasting of influenza epidemics. PLOS Comput Biol 17. Andrieu C, Doucet A, Holenstein R (2010) Particle Markov chain Monte Carlo meth- 10(4):e1003583. ods. J R Stat Soc Ser B 72(3):269–342. 35. Julier S, Uhlmann J (2004) Unscented filtering and nonlinear estimation. Proc IEEE 18. Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MP (2009) Approximate Bayesian 92(3):401–422. computation scheme for parameter inference and model selection in dynamical sys- 36. Cappé O, Godsill S, Moulines E (2007) An overview of existing methods and recent tems. J R Soc Interface 6(31):187–202. advances in sequential Monte Carlo. Proc IEEE 95(5):899–924.

724 | www.pnas.org/cgi/doi/10.1073/pnas.1410597112 Ionides et al. Downloaded by guest on September 27, 2021