Iterated Filtering Algorithms
Yves Atchad´e
Department of Statistics, University of Michigan Joint work with Edward Ionides, Dao Nguyen, and Aaron King University of Michigan.
May 15, 2015 Introduction
I Discrete state space model:
X1 ∼ νθ, Xk |Fk−1 ∼ pθ(·|Xk−1),
and Yk |{Fk−1, Xk } ∼ pθ(·|Xk ), k ≥ 1.
I The Markov chain {Xk , k ≥ 1} is latent (unobserved). We observe {Yk , k ≥ 1}. p I The goal is to estimate the parameter θ ∈ R from the observation {yk , 1 ≤ k ≤ n} of {Yk , 1 ≤ k ≤ n}.
I I E. L. Ionides, A. Bhadra, Y. F. Atchad´eand A. King (2011). Iterated Filtering. Annals of Statistics Vol. 39, 1776-1802. I E. L. Ionides, D. Nguyen, Y. F. Atchad´e,S. Stove, and A. King (2015). Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Science 112 (3) 719-724. Introduction
I Given data y1:n, the likelihood function is
def `n(θ) = log pθ(y1:n),
where pθ(y1:n) is the density of Y1:n: Z pθ(y1:n) = pθ(x0:n)pθ(y1:n|x1:n)dx1:n.
I Our goal is to compute the mle ˆ θn = Argmaxθ∈Θ `n(θ).
I Standard but challenging computational problem.
I We are particularly interested in the case where the transition densities fθ(xk |xk−1) are intractable. Introduction
Example.
I dXt = µθ(Xt )dt + Cθ(Xt )dWt , R t Yt = 0 hθ(Xs )ds + t .
I We observe {Yt1 ,..., Ytn } at some non-random time t1 < . . . < tn.
I We wish to estimate θ.
I The transition densities of Xt1 ,..., Xtn are available only in few cases . Introduction
I A large literature on this problem. See C. Li (Annals of Stat. 2013) provides a good review of main methods.
I A very common strategy is to use a discretization of the SDE (Euler). Assume tk = tk−1 + δ, √
Xtk = Xtk−1 + δµθ(Xtk−1 ) + δCθ(Xtk−1 )Zk .
I Can introduce huge bias if the observation mesh is not tight enough. Introduction
I You can improve this a bit with data-augmentation. You add some (i) (i) additional observation time ti < κ1 < . . . < κmi < ti+1, to reduce discretization error.
I And estimate θ, while treating X (i) as nuisance. κj
I However there are limits to how much imputation can be done. Introduction
I Recall that the likelihood function is Z Ln(θ) = pθ(x1:n)pθ(y1:n|x1:n)dx1:n.
I One of the simplest idea is to approximate the likelihood function by importance sampling.
N (i) (i) 1 X pθ(y1:n|X )pθ(X ) (i) i.i.d. L^(θ) = 1:n 1:n , where X ∼ q. N N (i) 1:n i=1 q(X1:n) q is a prob. meas. on the sample space X n.
I Well-known difficulty: hard to choose q. Terribly large variance, particularly if q(x1:n) = pθ(x1:n).
I Sequential Monte Carlo can produce better estimates of `n(θ). But these estimates are typically discontinuous functions of θ.
I Also, approximating and maxi. Ln(θ) is typically not a stable way of computing the mle. Solving ∇`n(θ) = 0 is more a stable problem, def where ∇`n(θ) = log Ln(θ). Introduction
Since I Z Ln(θ) = pθ(x1:n)pθ(y1:n|x1:n)dx1:n.
We have (also known as Louis’s identity): Z ∇`n(θ) = ∇ log pθ(x1:n, y1:n)pθ(x1:n|y1:n)dx1:n, Z = [log νθ(x1)qθ(y1|x1)
n X + log fθ(xj |xj−1)qθ(yj |xj ) pθ(x1:n|y1:n)dx1:n. j=2 Introduction
I If N X pˆθ (dx1:n|y1:n) = Wi δ ¯ (i) (dx1:n), X1:n i=1
is a SMC approximation of pθ(x1:n|y1:n)dx1:n, we can easily estimate ∇`n(θ) by Z GcN (θ) = ∇ log pθ(x1:n, y1:n)ˆpθ(dx1:n|y1:n).
I Kantas, Doucet, Singh, Maciejowski, and Chopin (arXiv:1412.8695v1, 2014) has a review on different approaches. Introduction
I This leads to a stochastic gradient algorithm Algorithm (Stochastic Gradient method)
Given θk ,
I Compute Gk+1 = G[Nk+1 (θk ). I Update
θk+1 = θk + γk+1Gk+1
= θk + γk+1∇θ`n(θk ) + γk+1 (Gk+1 − ∇θ`n(θk )) . Introduction
P I Typically for convergence one chooses γk → 0, k γk = ∞, and P γk < ∞, where Nk is the number of particles used at iteration k Nk k.
I However, if `n is strongly convex, and has a gradient that is Lipschitz one can do much better with a fixed step-size γ = 2/(µ + L). Then Nk ↑ ∞ is enough for convergence.
I However, these SMC estimate of Gk are not feasible if the transition densities pθ(xk |xk−1) cannot be computed. Introduction
I The talk will present a series of algorithms (IF1, IF1+, and IF2) to solve this problem without having to evaluate the transition densities pθ(xk |xk−1).
I We will only need the ability to simulate from the transition densities.
I I will present informally some convergence results. We are still looking to say more about these algorithms. Particularly in the non-concave case, and on comparing the schemes.
I I will close with some simulation results. Simplified version
d I A standard latent variable model: y ∼ pθ(·), Θ ⊆ R , where Z pθ(y) = fθ(y|x)pθ(x) dx. X | {z } p¯θ (y,x)
I Given observation y, we wish to compute Z ˆ def θ = Argmaxθ∈Θ `(θ), where `(θ) = log p¯θ(y, x)dx. X
I pθ(x) cannot be computed, but it is easy to draw samples from it. Iterated importance sampling
R R 0 I Let K be a density on Θ with Θ uK(u)du = 0, Θ uu K(u)du = Id . d Think of Θ = R , K density of N(0, Id ). I Let ` :Θ → R some arbitrary log-likelihood function.
I For σ > 0, and θ ∈ Θ, consider the probability measure
1 K u−θ e`(u)du π (du) = σd σ θ,σ R 1 z−θ `(z) σd K σ e dz
I πθ,σ(·) is the posterior distribution of a model where the 1 u−θ log-likelihood is ` and the prior is σd K σ (the density of θ + σZ, with Z ∼ K). Iterated importance sampling
I Statisticians are familiar with the idea of posterior dist. as a mix of prior and likelihood.
I The next result would be very intuitive to a Bayesian.
Theorem Fix σ > 0, θ ∈ Θ. Under appropriate smoothness assumption on `, Z 2 3 uπθ,σ(du) = θ + σ ∇θ`(θ) + σ (σ, θ),
where is uniformly bounded for σ in a neighborhood of 0 and for θ in a compact subset of Θ. Iterated importance sampling
4 I The remainder can be improve to O(σ ) if we use a density K with zero third moments (Doucet, Jacob and Rubenthaler (arXiv:1304.5768v2, 2013)). 3 I If we ignore the remainder σ (σ, θ), we have Z 2 uπθ,σ(du) ≈ θ + σ ∇θ`(θ).
I For σ small we can approximate ∇θ`(θ) by Z 1 (u − θ)π (du). σ2 θ,σ Iterated importance sampling
I Consider now the case where Z `(θ) = log pθ(y, x)dx. X
I Fix θ. Pretend that my (unobserved) signal is (Θˇ , Xˇ), where ˇ ˇ ˇ ˇ Θ = θ + σZ, and X ∼ pΘˇ (·), and we obs. Y ∼ pcΘ(·|X ), where Z ∼ K.
I Using the observation y, we attempt to recover Θ.ˇ The previous lemma says:
−2 σ E Θˇ |Yˇ = y − θ = ∇`(θ) + O(σ). Iterated importance sampling
−2 E σ Θˇ − θ |Yˇ = y = ∇`(θ) + O(σ). We can then approximate ∇`(θ) by importance sampling.
Let Z i.∼i.d. K. Set Θˇ = θ + σZ , and Xˇ |Θˇ ∼ f (·). I 1:N k k k k Θˇ k I Set f (y|Xˇ ) Θˇ k k Wk = , PN f (y|Xˇ ) j=1 Θˇ j j
Proposition As N → ∞, N ˇ X (Θk − θ) W . k σ2 k=1 converges in probability to ∇`(θ). Iterated importance sampling
Algorithm (Iterated IS)
Given θk : i.i.d. ˇ (j) (j) I Generate Z1:Nn ∼ K(·). Set Θ = θk + σk Z , and Xˇ(j)|Θˇ (j) ∼ f (·). Θˇ k I Set
ˇ(j) Nn (j) fΘˇ (j) (y|X ) −2 X (j) ˇ (j) W = , and Gbk+1 = σ W (Θ − θk ). PN ˇ(j) k j=1 fΘˇ (j) (y|X ) k=1
I θk+1 = θk + γk+1Gbk+1. Iterated importance sampling
θk+1 = θk + γk+1Gbk+1 Z 1 = θk + γk+1∇`(θk ) + γk+1 Gbk+1 − 2 (u − θk )πθk ,σk (du) σk Z 1 + γk+1 2 (u − θk )πθk ,σk (du) − ∇`(θk ) σk (1) (2) = θk + γk+1∇`(θk ) + γk+1k+1 + γk+1k+1.
I For convergence, we roughly need σn ↓ 0,
X X X γk+1 γk = ∞, γk+1σk < ∞, 2 < ∞. σk Nk Iterated importance sampling
Theorem Under some regularity conditions, and assuming
X X X γk+1 γk = ∞, γk+1σk < ∞, 2 < ∞, σk Nk
the sequence {θn, n ≥ 1} produced by the Iterated IS algorithm converges almost surely to {θ ∈ Rd : ∇`(θ) = 0}.
I Reg. condition: We need {θn, n ≥ 0} to remain in a compact set. Iterated filtering
I We now consider a state space model {(Xi , Yi ), 1 ≤ i ≤ n}, where X1 ∼ νθ, Xi |Xi−1 ∼ pθ(xi |xi−1), and Yi ∼ Y1:i−1, X1:i ∼ pθ(yi |xi ).
I Given observation y = (y1,..., yn), the likelihood function of θ is
n Z Y `n(θ) = log νθ(x1)pθ(y1|x1) pθ(xi |xi−1)pθ(yi |xi )dx1:n. i=2 Iterated filtering
I Fix θ. Consider the extended state space model:
Θˇ = θ + τZ , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 1 1 Θˇ 1 1 Θˇ 1 1
for i = 2,..., n, Θˇ i = Θˇ i−1 + σZi , Xˇ |F ∼ (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ), i i−1 PΘˇ i i−1 1 i−1 i Θˇ i i
where Z1,..., Zn are i.i.d. K. ˇ ˇ I We look at E Θn|Y1:n = y1:n . Iterated filtering
Lemma Assume `n has derivatives to the order 3 that are bounded on compact sets. Then
2 ˇ ˇ 2 σ Θn|Y1:n = y1:n − θ − τ ∇`n(θ) = O τ + . E τ 2
We can easily combine this with a Bootstrap particle filter to estimate the mle. Iterated Filtering
Algorithm (IF1)
Given θk . 1. Approximate the mean of the filtering distribution in the following state space model, using Nk+1 particles:
Θˇ = θ + τ Z , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 k k+1 1 1 Θˇ 1 1 Θˇ 1 1
for i = 2,..., n, Θˇ i = Θˇ i−1 + σk+1Zi , Xˇ |F ∼ p (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ). i i−1 Θˇ i i−1 1 i−1 i Θˇ i i
2. Compute Gk+1 − θk θk+1 = θk + γk+1 2 , σk
where Gk+1 is obtained in Step (1). Iterated Filtering Algorithm (Iterated IF1)
Given θk . (j) (j) ˜(j) I For j = 1,..., Nk+1: ϑ1 = θk + τk+1Z1 , X1 ∼ ν ˇ (j) (·), and Θ1 (j) ˜(j) PN (j) compute w1 ∝ p (j) (y1|X1 ), j=1 w1 = 1. ϑ1
ˇ (j) ˇ(j) i.i.d PNk+1 (j) I Resample (Θ1 , X1 ) ∼ j=1 w1 δ (j) ˜ (j) (·). (ϑ1 ,X1 ) I For i = 2,..., n: (j) (j) (j) I For j = 1,..., Nk+1: ϑi = ϑi−1 + σk+1Zi , ˜(j) ˜(j) (j) ˇ (j) (j) Xi ∼ p (j) (·|Xi−1). Set ϑ1:i = (Θ1:i−1, ϑi ), ϑi ˜(j) ˇ(j) ˜(j) (j) ˜(j) X1:i = (X1:i−1, Xi ), and compute wi ∝ p (j) (yi |Xi ). ϑi ˇ (j) ˇ(j) PNk+1 (j) I Resample (Θ1:i , X1:i ) ∼ j=1 wi δ (j) ˜ (j) (·). (ϑ1:i ,X1:i ) I Nk+1 1 X ˇ (j) θk+1 = θk + γk+1 Θn − θk . Nk+1 j=1 Iterated Filtering
We have a similar result. Theorem Under some regularity conditions, and assuming
X X X γk+1 γk = ∞, γk+1σk < ∞, 2 < ∞, σk Nk
the sequence {θn, n ≥ 1} produced by the Iterated IS algorithm converges almost surely to {θ ∈ Rd : ∇`(θ) = 0}. Iterated filtering
−2 I The issue with the algorithm is that the function σk (θ − θk ) is growing as k → ∞.
P γk+1 I Hence (as in the condition 2 < ∞), we need to choose γk σk Nk small or increase rapidly the number of particles.
I It turns out that the IF algorithm is related to another well-known algorithm: proximal algorithm.
I This connection can be exploited to build better IF algorithm: IF+. Iterated filtering+
d I Back to the simplified latent variable model: y ∼ pθ(·), Θ ⊆ R , where Z pθ(y) = fθ(y|x)pθ(x) dx. X | {z } p¯θ (y,x) I IF1 iteratively compute E Θˇ |Yˇ = y in the model ˇ ˇ ˇ ˇ Θ = θ + σZ, and X ∼ pΘˇ (·), and we obs. Y ∼ pcΘ(·|X ),
where Z ∼ N(0, Id ). −2 I And estimate ∇`(θ) by σ E Θˇ |Yˇ = y − θ . Iterated filtering+
I The posterior mean E Θˇ |Yˇ = y is the mean of the density 1 π(u|y) ∝ exp `(u) − ku − θk2 . 2σ2
I Which should be close to the set of mode of that distribution, which is the proximal operator of `
1 Prox (θ) def= Argmin −`(u) + ku − θk2 . σ u 2σ2
I The proximal operator is a fundamental tool in convex optimzation for finding critical point of non-smooth functions. Iterated filtering+
1 Prox (θ) def= Argmin −`(u) + ku − θk2 . σ u 2σ2
I If ` is bounded then Proxσ(θ) is non-empty, and unique if ` is concave.
I Any fixed point of Proxσ is a critical point of `.
I If ` is strongly concave then Proxσ is a contraction. P 2 I If ` is proper concave, and k σk = ∞, then the sequence θk+1 = Proxσ(θk ) converges to a minimizer of −`.
I Notice that θk+1 = Proxσ(θk ) is equivalent to
2 θk+1 = θk + σ ∇`(θk+1).
“implicit gradient descent”. Typically has better properties than the usual gradient descent. Iterated filtering+
Hence instead of running the particle filter to estimate the gradient, we approximate the proximal operator. Algorithm (IF1+)
Given θk . 1. Approximate the mean of the filtering distribution in the following state space model, using Nk+1 particles:
Θˇ = θ + τ Z , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 k k+1 1 1 Θˇ 1 1 Θˇ 1 1
for i = 2,..., n, Θˇ i = Θˇ i−1 + σk+1Zi , Xˇ |F ∼ p (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ). i i−1 Θˇ i i−1 1 i−1 i Θˇ i i
2. Set θk+1 = Gk+1,
where Gk+1 is obtained from Step (1). Iterated filtering+
I One can view this algorithm as a noisy proximal iteration:
θk+1 = Proxσk (θk ) + k+1.
I Can we prove convergence by leveraging known results on the proximal iteration? IF2
I IF1+ is related to an algorithm fairly popular in ecology, known as data cloning (Lele et al. (2007) Ecol. Letter).
I Let ` a log-likelihood. p a prior.
k`(θ) πk (θ) ∝ p(θ)e .
I If Xk ∼ πk , then Xk converges in probab. to Argmax `(θ). Basically simulated annealing.
I In the latent variable model πk is the marginal of
k Y π(θ, x1,..., xk |y) ∝ p(θ) fθ(xi )fθ(y|xi ). i=1 IF2
I Data cloning and IF1+ suggest IF2.
1 def R I For any f ∈ L (X), define fσ(u) = f (z)Kσ(z, u)dz.
I The Bayes operator associated to ` is the map B that transforms f ∈ L1(X) to B[f ] ∈ L1(X) where
f (θ)e`(θ) B[f ](θ) = , θ ∈ X. R `(θ) X e f (θ)dθ
1 I Finally, we consider the transformation Tσ that maps f ∈ L (X) into
Tσ[f ] = B[fσ]. (1) IF2
Theorem n Assume ` is concave with Lipschitz gradient. Suppose Θn ∼ Tσ [f ]. There exists σ0 > 0, and a finite constant C such that for σ ∈ (0, σ0],
2 1 Θ − θˆ ≤ Cσ, for all n ≥ d e. E n σ Iterated filtering 2
Algorithm (IF2)
I For k = 1, 2,..., K
1. Given pˆk , let pˆk+1 be the SMC approximation of the filtering distribution of the following state space model, using Nk+1 particles:
Θˇ =∼ pˆ , Xˇ ∼ ν (·), Yˇ ∼ p (·|Xˇ ), 1 k 1 Θˇ 1 1 Θˇ 1 1
for i = 2,..., n, Θˇ i = Θˇ i−1 + σk+1Zi , Xˇ |F ∼ p (·|Xˇ ), Yˇ |F , Xˇ ∼ p (·|Xˇ ). i i−1 Θˇ i i−1 1 i−1 i Θˇ i i
I Return pˆK . IF2
I Each iteration of IF2 is a Monte Carlo approximation to a map
I f and Tσ[f ] approximate the initial and final density of the IF2 parameter swarm.
I When the standard deviation of the parameter perturbations is held K fixed at σm = σ > 0, IF2 is a Monte Carlo approximation to Tσ [f ].
I Standard SMC theory can be used to show thatp ˆK approximates K Tσ [f ].
I In the case σ = 0, the iterated Bayes map corresponds to the data cloning approach of Lele (2007).
I Taking σ 6= 0 adds numerical stability, which is necessary for convergence of SMC approximations. Numerical examples
I We compare IF1, IF2 and the particle Markov chain Monte Carlo (PMCMC) method of Andrieu et al (2010).
I PMCMC is an SMC-based plug-and-play algorithm for full-information Bayesian inference on POMPs.
I Computations were done using the pomp R package: King, Nguyen & Ionides (2015). Statistical inference for par- tially observed Markov processes via the R package pomp. To appear in Journal of Statistical Software.
I Data and code reproducing our results are a supplement to Ionides, Nguyen, Atchad´e,Stoev & King (2015). Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Sciences of the USA. Toy example.