
Iterated Filtering Algorithms Yves Atchad´e Department of Statistics, University of Michigan Joint work with Edward Ionides, Dao Nguyen, and Aaron King University of Michigan. May 15, 2015 Introduction I Discrete state space model: X1 ∼ νθ; Xk jFk−1 ∼ pθ(·|Xk−1); and Yk jfFk−1; Xk g ∼ pθ(·|Xk ); k ≥ 1: I The Markov chain fXk ; k ≥ 1g is latent (unobserved). We observe fYk ; k ≥ 1g. p I The goal is to estimate the parameter θ 2 R from the observation fyk ; 1 ≤ k ≤ ng of fYk ; 1 ≤ k ≤ ng. I I E. L. Ionides, A. Bhadra, Y. F. Atchad´eand A. King (2011). Iterated Filtering. Annals of Statistics Vol. 39, 1776-1802. I E. L. Ionides, D. Nguyen, Y. F. Atchad´e,S. Stove, and A. King (2015). Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Science 112 (3) 719-724. Introduction I Given data y1:n, the likelihood function is def `n(θ) = log pθ(y1:n); where pθ(y1:n) is the density of Y1:n: Z pθ(y1:n) = pθ(x0:n)pθ(y1:njx1:n)dx1:n: I Our goal is to compute the mle ^ θn = Argmaxθ2Θ `n(θ): I Standard but challenging computational problem. I We are particularly interested in the case where the transition densities fθ(xk jxk−1) are intractable. Introduction Example. I dXt = µθ(Xt )dt + Cθ(Xt )dWt ; R t Yt = 0 hθ(Xs )ds + t : I We observe fYt1 ;:::; Ytn g at some non-random time t1 < : : : < tn. I We wish to estimate θ. I The transition densities of Xt1 ;:::; Xtn are available only in few cases . Introduction I A large literature on this problem. See C. Li (Annals of Stat. 2013) provides a good review of main methods. I A very common strategy is to use a discretization of the SDE (Euler). Assume tk = tk−1 + δ, p Xtk = Xtk−1 + δµθ(Xtk−1 ) + δCθ(Xtk−1 )Zk : I Can introduce huge bias if the observation mesh is not tight enough. Introduction I You can improve this a bit with data-augmentation. You add some (i) (i) additional observation time ti < κ1 < : : : < κmi < ti+1, to reduce discretization error. I And estimate θ, while treating X (i) as nuisance. κj I However there are limits to how much imputation can be done. Introduction I Recall that the likelihood function is Z Ln(θ) = pθ(x1:n)pθ(y1:njx1:n)dx1:n: I One of the simplest idea is to approximate the likelihood function by importance sampling. N (i) (i) 1 X pθ(y1:njX )pθ(X ) (i) i:i:d: L^(θ) = 1:n 1:n ; where X ∼ q: N N (i) 1:n i=1 q(X1:n) q is a prob. meas. on the sample space X n. I Well-known difficulty: hard to choose q. Terribly large variance, particularly if q(x1:n) = pθ(x1:n). I Sequential Monte Carlo can produce better estimates of `n(θ). But these estimates are typically discontinuous functions of θ. I Also, approximating and maxi. Ln(θ) is typically not a stable way of computing the mle. Solving r`n(θ) = 0 is more a stable problem, def where r`n(θ) = log Ln(θ). Introduction Since I Z Ln(θ) = pθ(x1:n)pθ(y1:njx1:n)dx1:n: We have (also known as Louis's identity): Z r`n(θ) = r log pθ(x1:n; y1:n)pθ(x1:njy1:n)dx1:n; Z = [log νθ(x1)qθ(y1jx1) n 3 X + log fθ(xj jxj−1)qθ(yj jxj )5 pθ(x1:njy1:n)dx1:n: j=2 Introduction I If N X p^θ (dx1:njy1:n) = Wi δ ¯ (i) (dx1:n); X1:n i=1 is a SMC approximation of pθ(x1:njy1:n)dx1:n, we can easily estimate r`n(θ) by Z GcN (θ) = r log pθ(x1:n; y1:n)^pθ(dx1:njy1:n): I Kantas, Doucet, Singh, Maciejowski, and Chopin (arXiv:1412.8695v1, 2014) has a review on different approaches. Introduction I This leads to a stochastic gradient algorithm Algorithm (Stochastic Gradient method) Given θk , I Compute Gk+1 = G[Nk+1 (θk ). I Update θk+1 = θk + γk+1Gk+1 = θk + γk+1rθ`n(θk ) + γk+1 (Gk+1 − rθ`n(θk )) : Introduction P I Typically for convergence one chooses γk ! 0, k γk = 1, and P γk < 1, where Nk is the number of particles used at iteration k Nk k. I However, if `n is strongly convex, and has a gradient that is Lipschitz one can do much better with a fixed step-size γ = 2=(µ + L). Then Nk " 1 is enough for convergence. I However, these SMC estimate of Gk are not feasible if the transition densities pθ(xk jxk−1) cannot be computed. Introduction I The talk will present a series of algorithms (IF1, IF1+, and IF2) to solve this problem without having to evaluate the transition densities pθ(xk jxk−1). I We will only need the ability to simulate from the transition densities. I I will present informally some convergence results. We are still looking to say more about these algorithms. Particularly in the non-concave case, and on comparing the schemes. I I will close with some simulation results. Simplified version d I A standard latent variable model: y ∼ pθ(·), Θ ⊆ R , where Z pθ(y) = fθ(yjx)pθ(x) dx: X | {z } p¯θ (y;x) I Given observation y, we wish to compute Z ^ def θ = Argmaxθ2Θ `(θ); where `(θ) = log p¯θ(y; x)dx: X I pθ(x) cannot be computed, but it is easy to draw samples from it. Iterated importance sampling R R 0 I Let K be a density on Θ with Θ uK(u)du = 0, Θ uu K(u)du = Id . d Think of Θ = R , K density of N(0; Id ). I Let ` :Θ ! R some arbitrary log-likelihood function. I For σ > 0, and θ 2 Θ, consider the probability measure 1 K u−θ e`(u)du π (du) = σd σ θ,σ R 1 z−θ `(z) σd K σ e dz I πθ,σ(·) is the posterior distribution of a model where the 1 u−θ log-likelihood is ` and the prior is σd K σ (the density of θ + σZ, with Z ∼ K). Iterated importance sampling I Statisticians are familiar with the idea of posterior dist. as a mix of prior and likelihood. I The next result would be very intuitive to a Bayesian. Theorem Fix σ > 0, θ 2 Θ. Under appropriate smoothness assumption on `, Z 2 3 uπθ,σ(du) = θ + σ rθ`(θ) + σ (σ; θ); where is uniformly bounded for σ in a neighborhood of 0 and for θ in a compact subset of Θ. Iterated importance sampling 4 I The remainder can be improve to O(σ ) if we use a density K with zero third moments (Doucet, Jacob and Rubenthaler (arXiv:1304.5768v2, 2013)). 3 I If we ignore the remainder σ (σ; θ), we have Z 2 uπθ,σ(du) ≈ θ + σ rθ`(θ): I For σ small we can approximate rθ`(θ) by Z 1 (u − θ)π (du): σ2 θ,σ Iterated importance sampling I Consider now the case where Z `(θ) = log pθ(y; x)dx: X I Fix θ. Pretend that my (unobserved) signal is (Θˇ ; Xˇ), where ˇ ˇ ˇ ˇ Θ = θ + σZ; and X ∼ pΘˇ (·); and we obs. Y ∼ pcΘ(·|X ); where Z ∼ K. I Using the observation y, we attempt to recover Θ.ˇ The previous lemma says: −2 σ E Θˇ jYˇ = y − θ = r`(θ) + O(σ): Iterated importance sampling −2 E σ Θˇ − θ jYˇ = y = r`(θ) + O(σ): We can then approximate r`(θ) by importance sampling. Let Z i:∼i:d: K. Set Θˇ = θ + σZ , and Xˇ jΘˇ ∼ f (·). I 1:N k k k k Θˇ k I Set f (yjXˇ ) Θˇ k k Wk = ; PN f (yjXˇ ) j=1 Θˇ j j Proposition As N ! 1, N ˇ X (Θk − θ) W : k σ2 k=1 converges in probability to r`(θ). Iterated importance sampling Algorithm (Iterated IS) Given θk : i:i:d: ˇ (j) (j) I Generate Z1:Nn ∼ K(·). Set Θ = θk + σk Z , and Xˇ(j)jΘˇ (j) ∼ f (·). Θˇ k I Set ˇ(j) Nn (j) fΘˇ (j) (yjX ) −2 X (j) ˇ (j) W = ; and Gbk+1 = σ W (Θ − θk ): PN ˇ(j) k j=1 fΘˇ (j) (yjX ) k=1 I θk+1 = θk + γk+1Gbk+1: Iterated importance sampling θk+1 = θk + γk+1Gbk+1 Z 1 = θk + γk+1r`(θk ) + γk+1 Gbk+1 − 2 (u − θk )πθk ,σk (du) σk Z 1 + γk+1 2 (u − θk )πθk ,σk (du) − r`(θk ) σk (1) (2) = θk + γk+1r`(θk ) + γk+1k+1 + γk+1k+1: I For convergence, we roughly need σn # 0, X X X γk+1 γk = 1; γk+1σk < 1; 2 < 1: σk Nk Iterated importance sampling Theorem Under some regularity conditions, and assuming X X X γk+1 γk = 1; γk+1σk < 1; 2 < 1; σk Nk the sequence fθn; n ≥ 1g produced by the Iterated IS algorithm converges almost surely to fθ 2 Rd : r`(θ) = 0g. I Reg. condition: We need fθn; n ≥ 0g to remain in a compact set. Iterated filtering I We now consider a state space model f(Xi ; Yi ); 1 ≤ i ≤ ng, where X1 ∼ νθ, Xi jXi−1 ∼ pθ(xi jxi−1), and Yi ∼ Y1:i−1; X1:i ∼ pθ(yi jxi ).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages46 Page
-
File Size-