Iterated Filtering Algorithms

Iterated Filtering Algorithms Yves Atchadé Department of Statistics, University of Michigan Joint work with Edward Ionides, Dao Nguyen, and Aaron King University of Michigan. May 15, 2015 Introduction I Discrete state space model: X1 ∼ νθ; Xk jFk−1 ∼ pθ(·|Xk−1); and Yk jfFk−1; Xk g ∼ pθ(·|Xk ); k ≥ 1: I The Markov chain fXk ; k ≥ 1g is latent (unobserved). We observe fYk ; k ≥ 1g. p I The goal is to estimate the parameter θ 2 R from the observation fyk ; 1 ≤ k ≤ ng of fYk ; 1 ≤ k ≤ ng. I I E. L. Ionides, A. Bhadra, Y. F. Atchadéand A. King (2011). Iterated Filtering. Annals of Statistics Vol. 39, 1776-1802. I E. L. Ionides, D. Nguyen, Y. F. Atchadé,S. Stove, and A. King (2015). Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Science 112 (3) 719-724. Introduction I Given data y1:n, the likelihood function is def `n(θ) = log pθ(y1:n); where pθ(y1:n) is the density of Y1:n: Z pθ(y1:n) = pθ(x0:n)pθ(y1:njx1:n)dx1:n: I Our goal is to compute the mle ^ θn = Argmaxθ2Θ `n(θ): I Standard but challenging computational problem. I We are particularly interested in the case where the transition densities fθ(xk jxk−1) are intractable. Introduction Example. I dXt = µθ(Xt )dt + Cθ(Xt )dWt ; R t Yt = 0 hθ(Xs )ds + t : I We observe fYt1 ;:::; Ytn g at some non-random time t1 < : : : < tn. I We wish to estimate θ. I The transition densities of Xt1 ;:::; Xtn are available only in few cases . Introduction I A large literature on this problem. See C. Li (Annals of Stat. 2013) provides a good review of main methods. I A very common strategy is to use a discretization of the SDE (Euler). Assume tk = tk−1 + δ, p Xtk = Xtk−1 + δµθ(Xtk−1 ) + δCθ(Xtk−1 )Zk : I Can introduce huge bias if the observation mesh is not tight enough. Introduction I You can improve this a bit with data-augmentation. You add some (i) (i) additional observation time ti < κ1 < : : : < κmi < ti+1, to reduce discretization error. I And estimate θ, while treating X (i) as nuisance. κj I However there are limits to how much imputation can be done. Introduction I Recall that the likelihood function is Z Ln(θ) = pθ(x1:n)pθ(y1:njx1:n)dx1:n: I One of the simplest idea is to approximate the likelihood function by importance sampling. N (i) (i) 1 X pθ(y1:njX )pθ(X ) (i) i:i:d: L^(θ) = 1:n 1:n ; where X ∼ q: N N (i) 1:n i=1 q(X1:n) q is a prob. meas. on the sample space X n. I Well-known difficulty: hard to choose q. Terribly large variance, particularly if q(x1:n) = pθ(x1:n). I Sequential Monte Carlo can produce better estimates of `n(θ). But these estimates are typically discontinuous functions of θ. I Also, approximating and maxi. Ln(θ) is typically not a stable way of computing the mle. Solving r`n(θ) = 0 is more a stable problem, def where r`n(θ) = log Ln(θ). Introduction Since I Z Ln(θ) = pθ(x1:n)pθ(y1:njx1:n)dx1:n: We have (also known as Louis's identity): Z r`n(θ) = r log pθ(x1:n; y1:n)pθ(x1:njy1:n)dx1:n; Z = [log νθ(x1)qθ(y1jx1) n 3 X + log fθ(xj jxj−1)qθ(yj jxj )5 pθ(x1:njy1:n)dx1:n: j=2 Introduction I If N X p^θ (dx1:njy1:n) = Wi δ ¯ (i) (dx1:n); X1:n i=1 is a SMC approximation of pθ(x1:njy1:n)dx1:n, we can easily estimate r`n(θ) by Z GcN (θ) = r log pθ(x1:n; y1:n)^pθ(dx1:njy1:n): I Kantas, Doucet, Singh, Maciejowski, and Chopin (arXiv:1412.8695v1, 2014) has a review on different approaches. Introduction I This leads to a stochastic gradient algorithm Algorithm (Stochastic Gradient method) Given θk , I Compute Gk+1 = G[Nk+1 (θk ). I Update θk+1 = θk + γk+1Gk+1 = θk + γk+1rθ`n(θk ) + γk+1 (Gk+1 − rθ`n(θk )) : Introduction P I Typically for convergence one chooses γk ! 0, k γk = 1, and P γk < 1, where Nk is the number of particles used at iteration k Nk k. I However, if `n is strongly convex, and has a gradient that is Lipschitz one can do much better with a fixed step-size γ = 2=(µ + L). Then Nk " 1 is enough for convergence. I However, these SMC estimate of Gk are not feasible if the transition densities pθ(xk jxk−1) cannot be computed. Introduction I The talk will present a series of algorithms (IF1, IF1+, and IF2) to solve this problem without having to evaluate the transition densities pθ(xk jxk−1). I We will only need the ability to simulate from the transition densities. I I will present informally some convergence results. We are still looking to say more about these algorithms. Particularly in the non-concave case, and on comparing the schemes. I I will close with some simulation results. Simplified version d I A standard latent variable model: y ∼ pθ(·), Θ ⊆ R , where Z pθ(y) = fθ(yjx)pθ(x) dx: X | {z } p¯θ (y;x) I Given observation y, we wish to compute Z ^ def θ = Argmaxθ2Θ `(θ); where `(θ) = log p¯θ(y; x)dx: X I pθ(x) cannot be computed, but it is easy to draw samples from it. Iterated importance sampling R R 0 I Let K be a density on Θ with Θ uK(u)du = 0, Θ uu K(u)du = Id . d Think of Θ = R , K density of N(0; Id ). I Let ` :Θ ! R some arbitrary log-likelihood function. I For σ > 0, and θ 2 Θ, consider the probability measure 1 K u−θ e`(u)du π (du) = σd σ θ,σ R 1 z−θ `(z) σd K σ e dz I πθ,σ(·) is the posterior distribution of a model where the 1 u−θ log-likelihood is ` and the prior is σd K σ (the density of θ + σZ, with Z ∼ K). Iterated importance sampling I Statisticians are familiar with the idea of posterior dist. as a mix of prior and likelihood. I The next result would be very intuitive to a Bayesian. Theorem Fix σ > 0, θ 2 Θ. Under appropriate smoothness assumption on `, Z 2 3 uπθ,σ(du) = θ + σ rθ`(θ) + σ (σ; θ); where is uniformly bounded for σ in a neighborhood of 0 and for θ in a compact subset of Θ. Iterated importance sampling 4 I The remainder can be improve to O(σ ) if we use a density K with zero third moments (Doucet, Jacob and Rubenthaler (arXiv:1304.5768v2, 2013)). 3 I If we ignore the remainder σ (σ; θ), we have Z 2 uπθ,σ(du) ≈ θ + σ rθ`(θ): I For σ small we can approximate rθ`(θ) by Z 1 (u − θ)π (du): σ2 θ,σ Iterated importance sampling I Consider now the case where Z `(θ) = log pθ(y; x)dx: X I Fix θ. Pretend that my (unobserved) signal is (Θˇ ; Xˇ), where ˇ ˇ ˇ ˇ Θ = θ + σZ; and X ∼ pΘˇ (·); and we obs. Y ∼ pcΘ(·|X ); where Z ∼ K. I Using the observation y, we attempt to recover Θ.ˇ The previous lemma says: −2 σ E Θˇ jYˇ = y − θ = r`(θ) + O(σ): Iterated importance sampling −2 E σ Θˇ − θ jYˇ = y = r`(θ) + O(σ): We can then approximate r`(θ) by importance sampling. Let Z i:∼i:d: K. Set Θˇ = θ + σZ , and Xˇ jΘˇ ∼ f (·). I 1:N k k k k Θˇ k I Set f (yjXˇ ) Θˇ k k Wk = ; PN f (yjXˇ ) j=1 Θˇ j j Proposition As N ! 1, N ˇ X (Θk − θ) W : k σ2 k=1 converges in probability to r`(θ). Iterated importance sampling Algorithm (Iterated IS) Given θk : i:i:d: ˇ (j) (j) I Generate Z1:Nn ∼ K(·). Set Θ = θk + σk Z , and Xˇ(j)jΘˇ (j) ∼ f (·). Θˇ k I Set ˇ(j) Nn (j) fΘˇ (j) (yjX ) −2 X (j) ˇ (j) W = ; and Gbk+1 = σ W (Θ − θk ): PN ˇ(j) k j=1 fΘˇ (j) (yjX ) k=1 I θk+1 = θk + γk+1Gbk+1: Iterated importance sampling θk+1 = θk + γk+1Gbk+1 Z 1 = θk + γk+1r`(θk ) + γk+1 Gbk+1 − 2 (u − θk )πθk ,σk (du) σk Z 1 + γk+1 2 (u − θk )πθk ,σk (du) − r`(θk ) σk (1) (2) = θk + γk+1r`(θk ) + γk+1k+1 + γk+1k+1: I For convergence, we roughly need σn # 0, X X X γk+1 γk = 1; γk+1σk < 1; 2 < 1: σk Nk Iterated importance sampling Theorem Under some regularity conditions, and assuming X X X γk+1 γk = 1; γk+1σk < 1; 2 < 1; σk Nk the sequence fθn; n ≥ 1g produced by the Iterated IS algorithm converges almost surely to fθ 2 Rd : r`(θ) = 0g. I Reg. condition: We need fθn; n ≥ 0g to remain in a compact set. Iterated filtering I We now consider a state space model f(Xi ; Yi ); 1 ≤ i ≤ ng, where X1 ∼ νθ, Xi jXi−1 ∼ pθ(xi jxi−1), and Yi ∼ Y1:i−1; X1:i ∼ pθ(yi jxi ).

Load more