Iterated Filtering Algorithms

Yves Atchad´e

Department of Statistics, University of Michigan Joint work with Edward Ionides, Dao Nguyen, and Aaron King University of Michigan.

May 15, 2015 Introduction

I Discrete state space model:

X1 ∼ νθ, Xk |Fk−1 ∼ pθ(·|Xk−1),

and Yk |{Fk−1, Xk } ∼ pθ(·|Xk ), k ≥ 1.

I The {Xk , k ≥ 1} is latent (unobserved). We observe {Yk , k ≥ 1}. p I The goal is to estimate the parameter θ ∈ R from the observation {yk , 1 ≤ k ≤ n} of {Yk , 1 ≤ k ≤ n}.

I I E. L. Ionides, A. Bhadra, Y. F. Atchad´eand A. King (2011). Iterated Filtering. Annals of Statistics Vol. 39, 1776-1802. I E. L. Ionides, D. Nguyen, Y. F. Atchad´e,S. Stove, and A. King (2015). Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Science 112 (3) 719-724. Introduction

I Given data y1:n, the likelihood function is

def `n(θ) = log pθ(y1:n),

where pθ(y1:n) is the density of Y1:n: Z pθ(y1:n) = pθ(x0:n)pθ(y1:n|x1:n)dx1:n.

I Our goal is to compute the mle ˆ θn = Argmaxθ∈Θ `n(θ).

I Standard but challenging computational problem.

I We are particularly interested in the case where the transition densities fθ(xk |xk−1) are intractable. Introduction

Example.

I  dXt = µθ(Xt )dt + Cθ(Xt )dWt , R t Yt = 0 hθ(Xs )ds + t .

I We observe {Yt1 ,..., Ytn } at some non-random time t1 < . . . < tn.

I We wish to estimate θ.

I The transition densities of Xt1 ,..., Xtn are available only in few cases . Introduction

I A large literature on this problem. See C. Li (Annals of Stat. 2013) provides a good review of main methods.

I A very common strategy is to use a discretization of the SDE (Euler). Assume tk = tk−1 + δ, √

Xtk = Xtk−1 + δµθ(Xtk−1 ) + δCθ(Xtk−1 )Zk .

I Can introduce huge bias if the observation mesh is not tight enough. Introduction

I You can improve this a bit with data-augmentation. You add some (i) (i) additional observation time ti < κ1 < . . . < κmi < ti+1, to reduce discretization error.

I And estimate θ, while treating X (i) as nuisance. κj

I However there are limits to how much imputation can be done. Introduction

I Recall that the likelihood function is Z Ln(θ) = pθ(x1:n)pθ(y1:n|x1:n)dx1:n.

I One of the simplest idea is to approximate the likelihood function by importance sampling.

N (i) (i) 1 X pθ(y1:n|X )pθ(X ) (i) i.i.d. L^(θ) = 1:n 1:n , where X ∼ q. N N (i) 1:n i=1 q(X1:n) q is a prob. meas. on the sample space X n.

I Well-known difficulty: hard to choose q. Terribly large variance, particularly if q(x1:n) = pθ(x1:n).

I Sequential Monte Carlo can produce better estimates of `n(θ). But these estimates are typically discontinuous functions of θ.

I Also, approximating and maxi. Ln(θ) is typically not a stable way of computing the mle. Solving ∇`n(θ) = 0 is more a stable problem, def where ∇`n(θ) = log Ln(θ). Introduction

Since I Z Ln(θ) = pθ(x1:n)pθ(y1:n|x1:n)dx1:n.

We have (also known as Louis’s identity): Z ∇`n(θ) = ∇ log pθ(x1:n, y1:n)pθ(x1:n|y1:n)dx1:n, Z = [log νθ(x1)qθ(y1|x1)

n  X + log fθ(xj |xj−1)qθ(yj |xj ) pθ(x1:n|y1:n)dx1:n. j=2 Introduction

I If N X pˆθ (dx1:n|y1:n) = Wi δ ¯ (i) (dx1:n), X1:n i=1

is a SMC approximation of pθ(x1:n|y1:n)dx1:n, we can easily estimate ∇`n(θ) by Z GcN (θ) = ∇ log pθ(x1:n, y1:n)ˆpθ(dx1:n|y1:n).

I Kantas, Doucet, Singh, Maciejowski, and Chopin (arXiv:1412.8695v1, 2014) has a review on different approaches. Introduction

I This leads to a stochastic gradient algorithm Algorithm (Stochastic Gradient method)

Given θk ,

I Compute Gk+1 = G[Nk+1 (θk ). I Update

θk+1 = θk + γk+1Gk+1

= θk + γk+1∇θ`n(θk ) + γk+1 (Gk+1 − ∇θ`n(θk )) . Introduction

P I Typically for convergence one chooses γk → 0, k γk = ∞, and P γk < ∞, where Nk is the number of particles used at iteration k Nk k.

I However, if `n is strongly convex, and has a gradient that is Lipschitz one can do much better with a fixed step-size γ = 2/(µ + L). Then Nk ↑ ∞ is enough for convergence.

I However, these SMC estimate of Gk are not feasible if the transition densities pθ(xk |xk−1) cannot be computed. Introduction

I The talk will present a series of algorithms (IF1, IF1+, and IF2) to solve this problem without having to evaluate the transition densities pθ(xk |xk−1).

I We will only need the ability to simulate from the transition densities.

I I will present informally some convergence results. We are still looking to say more about these algorithms. Particularly in the non-concave case, and on comparing the schemes.

I I will close with some simulation results. Simplified version

d I A standard latent variable model: y ∼ pθ(·), Θ ⊆ R , where Z pθ(y) = fθ(y|x)pθ(x) dx. X | {z } p¯θ (y,x)

I Given observation y, we wish to compute Z ˆ def θ = Argmaxθ∈Θ `(θ), where `(θ) = log p¯θ(y, x)dx. X

I pθ(x) cannot be computed, but it is easy to draw samples from it. Iterated importance sampling

R R 0 I Let K be a density on Θ with Θ uK(u)du = 0, Θ uu K(u)du = Id . d Think of Θ = R , K density of N(0, Id ). I Let ` :Θ → R some arbitrary log-likelihood function.

I For σ > 0, and θ ∈ Θ, consider the probability measure

1 K u−θ  e`(u)du π (du) = σd σ θ,σ R 1 z−θ  `(z) σd K σ e dz

I πθ,σ(·) is the posterior distribution of a model where the 1 u−θ  log-likelihood is ` and the prior is σd K σ (the density of θ + σZ, with Z ∼ K). Iterated importance sampling

I Statisticians are familiar with the idea of posterior dist. as a mix of prior and likelihood.

I The next result would be very intuitive to a Bayesian.

Theorem Fix σ > 0, θ ∈ Θ. Under appropriate smoothness assumption on `, Z 2 3 uπθ,σ(du) = θ + σ ∇θ`(θ) + σ (σ, θ),

where  is uniformly bounded for σ in a neighborhood of 0 and for θ in a compact subset of Θ. Iterated importance sampling

4 I The remainder can be improve to O(σ ) if we use a density K with zero third moments (Doucet, Jacob and Rubenthaler (arXiv:1304.5768v2, 2013)). 3 I If we ignore the remainder σ (σ, θ), we have Z 2 uπθ,σ(du) ≈ θ + σ ∇θ`(θ).

I For σ small we can approximate ∇θ`(θ) by Z 1 (u − θ)π (du). σ2 θ,σ Iterated importance sampling

I Consider now the case where Z `(θ) = log pθ(y, x)dx. X

I Fix θ. Pretend that my (unobserved) signal is (Θˇ , Xˇ), where ˇ ˇ ˇ ˇ Θ = θ + σZ, and X ∼ pΘˇ (·), and we obs. Y ∼ pcΘ(·|X ), where Z ∼ K.

I Using the observation y, we attempt to recover Θ.ˇ The previous lemma says:

−2    σ E Θˇ |Yˇ = y − θ = ∇`(θ) + O(σ). Iterated importance sampling

 −2   E σ Θˇ − θ |Yˇ = y = ∇`(θ) + O(σ). We can then approximate ∇`(θ) by importance sampling.

Let Z i.∼i.d. K. Set Θˇ = θ + σZ , and Xˇ |Θˇ ∼ f (·). I 1:N k k k k Θˇ k I Set f (y|Xˇ ) Θˇ k k Wk = , PN f (y|Xˇ ) j=1 Θˇ j j

Proposition As N → ∞, N ˇ X (Θk − θ) W . k σ2 k=1 converges in probability to ∇`(θ). Iterated importance sampling

Algorithm (Iterated IS)

Given θk : i.i.d. ˇ (j) (j) I Generate Z1:Nn ∼ K(·). Set Θ = θk + σk Z , and Xˇ(j)|Θˇ (j) ∼ f (·). Θˇ k I Set

ˇ(j) Nn (j) fΘˇ (j) (y|X ) −2 X (j) ˇ (j) W = , and Gbk+1 = σ W (Θ − θk ). PN ˇ(j) k j=1 fΘˇ (j) (y|X ) k=1

I θk+1 = θk + γk+1Gbk+1. Iterated importance sampling

θk+1 = θk + γk+1Gbk+1  Z 1  = θk + γk+1∇`(θk ) + γk+1 Gbk+1 − 2 (u − θk )πθk ,σk (du) σk Z 1  + γk+1 2 (u − θk )πθk ,σk (du) − ∇`(θk ) σk (1) (2) = θk + γk+1∇`(θk ) + γk+1k+1 + γk+1k+1.

I For convergence, we roughly need σn ↓ 0,

X X X γk+1 γk = ∞, γk+1σk < ∞, 2 < ∞. σk Nk Iterated importance sampling

Theorem Under some regularity conditions, and assuming

X X X γk+1 γk = ∞, γk+1σk < ∞, 2 < ∞, σk Nk

the sequence {θn, n ≥ 1} produced by the Iterated IS algorithm converges almost surely to {θ ∈ Rd : ∇`(θ) = 0}.

I Reg. condition: We need {θn, n ≥ 0} to remain in a compact set. Iterated filtering

I We now consider a state space model {(Xi , Yi ), 1 ≤ i ≤ n}, where X1 ∼ νθ, Xi |Xi−1 ∼ pθ(xi |xi−1), and Yi ∼ Y1:i−1, X1:i ∼ pθ(yi |xi ).

I Given observation y = (y1,..., yn), the likelihood function of θ is

n Z Y `n(θ) = log νθ(x1)pθ(y1|x1) pθ(xi |xi−1)pθ(yi |xi )dx1:n. i=2 Iterated filtering

I Fix θ. Consider the extended state space model:

Θˇ = θ + τZ , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 1 1 Θˇ 1 1 Θˇ 1 1

for i = 2,..., n, Θˇ i = Θˇ i−1 + σZi , Xˇ |F ∼ (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ), i i−1 PΘˇ i i−1 1 i−1 i Θˇ i i

where Z1,..., Zn are i.i.d. K. ˇ ˇ  I We look at E Θn|Y1:n = y1:n . Iterated filtering

Lemma Assume `n has derivatives to the order 3 that are bounded on compact sets. Then

 2  ˇ ˇ  2 σ Θn|Y1:n = y1:n − θ − τ ∇`n(θ) = O τ + . E τ 2

We can easily combine this with a Bootstrap particle filter to estimate the mle. Iterated Filtering

Algorithm (IF1)

Given θk . 1. Approximate the mean of the filtering distribution in the following state space model, using Nk+1 particles:

Θˇ = θ + τ Z , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 k k+1 1 1 Θˇ 1 1 Θˇ 1 1

for i = 2,..., n, Θˇ i = Θˇ i−1 + σk+1Zi , Xˇ |F ∼ p (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ). i i−1 Θˇ i i−1 1 i−1 i Θˇ i i

2. Compute   Gk+1 − θk θk+1 = θk + γk+1 2 , σk

where Gk+1 is obtained in Step (1). Iterated Filtering Algorithm (Iterated IF1)

Given θk . (j) (j) ˜(j) I For j = 1,..., Nk+1: ϑ1 = θk + τk+1Z1 , X1 ∼ ν ˇ (j) (·), and Θ1 (j) ˜(j) PN (j) compute w1 ∝ p (j) (y1|X1 ), j=1 w1 = 1. ϑ1

ˇ (j) ˇ(j) i.i.d PNk+1 (j) I Resample (Θ1 , X1 ) ∼ j=1 w1 δ (j) ˜ (j) (·). (ϑ1 ,X1 ) I For i = 2,..., n: (j) (j) (j) I For j = 1,..., Nk+1: ϑi = ϑi−1 + σk+1Zi , ˜(j) ˜(j) (j) ˇ (j) (j) Xi ∼ p (j) (·|Xi−1). Set ϑ1:i = (Θ1:i−1, ϑi ), ϑi ˜(j) ˇ(j) ˜(j) (j) ˜(j) X1:i = (X1:i−1, Xi ), and compute wi ∝ p (j) (yi |Xi ). ϑi ˇ (j) ˇ(j) PNk+1 (j) I Resample (Θ1:i , X1:i ) ∼ j=1 wi δ (j) ˜ (j) (·). (ϑ1:i ,X1:i ) I   Nk+1 1 X ˇ (j) θk+1 = θk + γk+1  Θn − θk  . Nk+1 j=1 Iterated Filtering

We have a similar result. Theorem Under some regularity conditions, and assuming

X X X γk+1 γk = ∞, γk+1σk < ∞, 2 < ∞, σk Nk

the sequence {θn, n ≥ 1} produced by the Iterated IS algorithm converges almost surely to {θ ∈ Rd : ∇`(θ) = 0}. Iterated filtering

−2 I The issue with the algorithm is that the function σk (θ − θk ) is growing as k → ∞.

P γk+1 I Hence (as in the condition 2 < ∞), we need to choose γk σk Nk small or increase rapidly the number of particles.

I It turns out that the IF algorithm is related to another well-known algorithm: proximal algorithm.

I This connection can be exploited to build better IF algorithm: IF+. Iterated filtering+

d I Back to the simplified latent variable model: y ∼ pθ(·), Θ ⊆ R , where Z pθ(y) = fθ(y|x)pθ(x) dx. X | {z } p¯θ (y,x)   I IF1 iteratively compute E Θˇ |Yˇ = y in the model ˇ ˇ ˇ ˇ Θ = θ + σZ, and X ∼ pΘˇ (·), and we obs. Y ∼ pcΘ(·|X ),

where Z ∼ N(0, Id ). −2    I And estimate ∇`(θ) by σ E Θˇ |Yˇ = y − θ . Iterated filtering+

  I The posterior mean E Θˇ |Yˇ = y is the mean of the density  1  π(u|y) ∝ exp `(u) − ku − θk2 . 2σ2

I Which should be close to the set of mode of that distribution, which is the proximal operator of `

 1  Prox (θ) def= Argmin −`(u) + ku − θk2 . σ u 2σ2

I The proximal operator is a fundamental tool in convex optimzation for finding critical point of non-smooth functions. Iterated filtering+

 1  Prox (θ) def= Argmin −`(u) + ku − θk2 . σ u 2σ2

I If ` is bounded then Proxσ(θ) is non-empty, and unique if ` is concave.

I Any fixed point of Proxσ is a critical point of `.

I If ` is strongly concave then Proxσ is a contraction. P 2 I If ` is proper concave, and k σk = ∞, then the sequence θk+1 = Proxσ(θk ) converges to a minimizer of −`.

I Notice that θk+1 = Proxσ(θk ) is equivalent to

2 θk+1 = θk + σ ∇`(θk+1).

“implicit gradient descent”. Typically has better properties than the usual gradient descent. Iterated filtering+

Hence instead of running the particle filter to estimate the gradient, we approximate the proximal operator. Algorithm (IF1+)

Given θk . 1. Approximate the mean of the filtering distribution in the following state space model, using Nk+1 particles:

Θˇ = θ + τ Z , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 k k+1 1 1 Θˇ 1 1 Θˇ 1 1

for i = 2,..., n, Θˇ i = Θˇ i−1 + σk+1Zi , Xˇ |F ∼ p (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ). i i−1 Θˇ i i−1 1 i−1 i Θˇ i i

2. Set θk+1 = Gk+1,

where Gk+1 is obtained from Step (1). Iterated filtering+

I One can view this algorithm as a noisy proximal iteration:

θk+1 = Proxσk (θk ) + k+1.

I Can we prove convergence by leveraging known results on the proximal iteration? IF2

I IF1+ is related to an algorithm fairly popular in ecology, known as data cloning (Lele et al. (2007) Ecol. Letter).

I Let ` a log-likelihood. p a prior.

k`(θ) πk (θ) ∝ p(θ)e .

I If Xk ∼ πk , then Xk converges in probab. to Argmax `(θ). Basically simulated annealing.

I In the latent variable model πk is the marginal of

k Y π(θ, x1,..., xk |y) ∝ p(θ) fθ(xi )fθ(y|xi ). i=1 IF2

I Data cloning and IF1+ suggest IF2.

1 def R I For any f ∈ L (X), define fσ(u) = f (z)Kσ(z, u)dz.

I The Bayes operator associated to ` is the map B that transforms f ∈ L1(X) to B[f ] ∈ L1(X) where

f (θ)e`(θ) B[f ](θ) = , θ ∈ X. R `(θ) X e f (θ)dθ

1 I Finally, we consider the transformation Tσ that maps f ∈ L (X) into

Tσ[f ] = B[fσ]. (1) IF2

Theorem n Assume ` is concave with Lipschitz gradient. Suppose Θn ∼ Tσ [f ]. There exists σ0 > 0, and a finite constant C such that for σ ∈ (0, σ0],

 2 1 Θ − θˆ ≤ Cσ, for all n ≥ d e. E n σ Iterated filtering 2

Algorithm (IF2)

I For k = 1, 2,..., K

1. Given pˆk , let pˆk+1 be the SMC approximation of the filtering distribution of the following state space model, using Nk+1 particles:

Θˇ =∼ pˆ , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 k 1 Θˇ 1 1 Θˇ 1 1

for i = 2,..., n, Θˇ i = Θˇ i−1 + σk+1Zi , Xˇ |F ∼ p (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ). i i−1 Θˇ i i−1 1 i−1 i Θˇ i i

I Return pˆK . IF2

I Each iteration of IF2 is a Monte Carlo approximation to a map

I f and Tσ[f ] approximate the initial and final density of the IF2 parameter swarm.

I When the standard deviation of the parameter perturbations is held K fixed at σm = σ > 0, IF2 is a Monte Carlo approximation to Tσ [f ].

I Standard SMC theory can be used to show thatp ˆK approximates K Tσ [f ].

I In the case σ = 0, the iterated Bayes map corresponds to the data cloning approach of Lele (2007).

I Taking σ 6= 0 adds numerical stability, which is necessary for convergence of SMC approximations. Numerical examples

I We compare IF1, IF2 and the particle Markov chain Monte Carlo (PMCMC) method of Andrieu et al (2010).

I PMCMC is an SMC-based plug-and-play algorithm for full-information Bayesian inference on POMPs.

I Computations were done using the pomp R package: King, Nguyen & Ionides (2015). Statistical inference for par- tially observed Markov processes via the R package pomp. To appear in Journal of Statistical Software.

I Data and code reproducing our results are a supplement to Ionides, Nguyen, Atchad´e,Stoev & King (2015). Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Sciences of the USA. Toy example.

 X (t) = exp{θ1}, θ2 exp{θ1} , constant for all t.

100 independent observa- tions: Given X (t) = x,   100 0  Y ∼ Normal x, . n 0 1

A. IF1 point estimates from 30 replications and the MLE (green triangle). B. IF2 point estimates from 30 replications and the MLE (green triangle). C. Final parameter value of 30 PMCMC chains with 104 filtering iterations. D. Kernel density estimates from 8 of these 30 PMCMC chains, and the true posterior distribution (dotted black line). Why is IF2 so much better than IF1 on this problem?

I IF1 updates parameters by a linear combination of filtered parameter estimates for the extended model with time-varying parameters.

I Taking linear combinations can knock the optimizer off nonlinear ridges of the likelihood function.

I IF2 does not have this vulnerability.

I A heuristic argument suggests that IF2 has 2nd order convergence (as if the 1st and 2nd derivatives were computable) whereas IF1 has 1st order convergence. It is an open problem to formalize that. Epidemiological applications: A review of disease dynamics

I Communicable diseases have long had major global health impact (, tuberculosis, , etc).

I Emerging diseases need to be understood and controlled (HIV, Ebola, bird flu, SARS, etc).

I Central to math models is an infected population, I (t), which interacts with a susceptible population, S(t). Susceptible individuals become infected at a nonlinear rate β I (t) S(t), where β is a contact rate.

I The inherent stochasticity of biological populations, and our partial ability to observe epidemics, therefore lead to nonlinear POMP model inference problems. Application to a model The study population P(t) is split into susceptibles, S(t), infecteds, I (t), and k recovered classes R1(t),..., Rk (t). The state process X (t) = (S(t), I (t), R1(t),..., Rk (t)) follows a stochastic differential equation driven by a Brownian motion {B(t)},  dS = kRk + δ(P − S) − λ(t) S dt + dP − (σI /P) dB, dI = λ(t) S − (m + δ + γ)I dt + (σI /P) dB,  dR1 = γI − (k + δ)R1 dt, . .  dRk = kRk−1 − (k + δ)Rk dt. The nonlinearity arises through the force of infection, λ(t), specified as n o n o ¯ PNs PNs λ(t) = β exp βtrend(t − t0) + j=1 βj sj (t) I /P +ω ¯ exp j=1 ωj sj (t) ,

where {sj (t), j = 1,..., Ns } is a periodic cubic B-spline basis. The data are monthly counts of cholera mortality, modeled as

2 2 R tn Yn ∼ Normal(Mn, τ M ) for Mn = m I (s) ds. n tn−1 Monthly cholera mortality in Dhaka, 1891-1940 5000 4000 3000 2000 1000 reported deaths monthly cholera 0

1890 1900 1910 1920 1930 1940

time Comparison of IF1 and IF2 on the cholera model. Algorithmic tuning pa- rameters for both IF1 and IF2 were set at the val- ues chosen by King et al (2008) for IF1.

I Log likelihoods of the parameter vector output by IF1 and IF2, both started at a uniform draw from a large 23-dimensional hyper-rectangle.

I Dotted lines show the maximum log likelihood. Concluding remarks

I Iterated filtering is a novel framework for computing mle in state state models where the density of the state model is intractable.

I Has been applied to some interesting analysis of Cholera dynamics King et a., Nature (2008).

I IF2 performs much better. Several questions on the theoretical understanding of IF1+ and IF2 are still open.