Importance Sampling & Sequential Importance Sampling
Total Page:16
File Type:pdf, Size:1020Kb
Importance Sampling & Sequential Importance Sampling Arnaud Doucet Departments of Statistics & Computer Science University of British Columbia A.D. () 1 / 40 Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e. γn (x1:n) πn (x1:n) = Zn We want to estimate expectations of test functions ϕ : En R n ! Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn. We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on. Generic Problem Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F A.D. () 2 / 40 We want to estimate expectations of test functions ϕ : En R n ! Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn. We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on. Generic Problem Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e. γn (x1:n) πn (x1:n) = Zn A.D. () 2 / 40 We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on. Generic Problem Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e. γn (x1:n) πn (x1:n) = Zn We want to estimate expectations of test functions ϕ : En R n ! Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn. A.D. () 2 / 40 Generic Problem Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e. γn (x1:n) πn (x1:n) = Zn We want to estimate expectations of test functions ϕ : En R n ! Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn. We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on. A.D. () 2 / 40 A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context. Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem. Using Monte Carlo Methods Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A.D. () 3 / 40 Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem. Using Monte Carlo Methods Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context. A.D. () 3 / 40 Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem. Using Monte Carlo Methods Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context. Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. A.D. () 3 / 40 Using Monte Carlo Methods Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context. Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem. A.D. () 3 / 40 Sequential Importance Sampling. Applications. Plan of the Lectures Review of Importance Sampling. A.D. () 4 / 40 Applications. Plan of the Lectures Review of Importance Sampling. Sequential Importance Sampling. A.D. () 4 / 40 Plan of the Lectures Review of Importance Sampling. Sequential Importance Sampling. Applications. A.D. () 4 / 40 q can be chosen arbitrarily, in particular easy to sample from N (i) i.i.d. 1 X q ( ) q (dx) = ∑ δX (i) (dx) ) N i=1 b Importance Sampling Importance Sampling (IS) identity. For any distribution q such that π (x) > 0 q (x) > 0 ) w (x) q (x) γ (x) π (x) = where w (x) = . w (x) q (x) dx q (x) where q is called importanceR distribution and w importance weight. A.D. () 5 / 40 Importance Sampling Importance Sampling (IS) identity. For any distribution q such that π (x) > 0 q (x) > 0 ) w (x) q (x) γ (x) π (x) = where w (x) = . w (x) q (x) dx q (x) where q is called importanceR distribution and w importance weight. q can be chosen arbitrarily, in particular easy to sample from N (i) i.i.d. 1 X q ( ) q (dx) = ∑ δX (i) (dx) ) N i=1 b A.D. () 5 / 40 π (x) now approximated by weighted sum of delta-masses Weights ) compensate for discrepancy between π and q. Plugging this expression in IS identity N (i) (i) (i) π (dx) = ∑ W δX (i) (dx) where W ∝ w X , i=1 1 N b Z = ∑ w X (i) . N i=1 b A.D. () 6 / 40 Plugging this expression in IS identity N (i) (i) (i) π (dx) = ∑ W δX (i) (dx) where W ∝ w X , i=1 1 N b Z = ∑ w X (i) . N i=1 b π (x) now approximated by weighted sum of delta-masses Weights ) compensate for discrepancy between π and q. A.D. () 6 / 40 The varianec of the weights is bounded if and only if γ2 (x) dx < ∞. q (x) Z In practice, try to ensure γ (x) w (x) = < ∞. q (x) Note that in this case, rejection sampling could be used to sample from π (x) . Practical recommendations Select q as close to π as possible. A.D. () 7 / 40 In practice, try to ensure γ (x) w (x) = < ∞. q (x) Note that in this case, rejection sampling could be used to sample from π (x) . Practical recommendations Select q as close to π as possible. The varianec of the weights is bounded if and only if γ2 (x) dx < ∞. q (x) Z A.D. () 7 / 40 Practical recommendations Select q as close to π as possible. The varianec of the weights is bounded if and only if γ2 (x) dx < ∞. q (x) Z In practice, try to ensure γ (x) w (x) = < ∞. q (x) Note that in this case, rejection sampling could be used to sample from π (x) . A.D. () 7 / 40 Example 0.8 0.6 Double exponential 0.4 0.2 0 0 10 20 30 40 50 60 70 80 90 100 0.4 0.3 Gaussian 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 1 t•Student (heavy tailed) 0.5 0 •10 •8 •6 •4 •2 0 2 4 6 8 10 Figure: Target double exponential distributions and two IS distributions A.D. () 8 / 40 0. 012 0.01 0. 008 0. 006 Values of theweights Values 0. 004 0. 002 0 •4 •3 •2 •1 0 1 2 3 4 Values of x Figure: IS approximation obtained using a Gaussian IS distribution A.D. () 9 / 40 •4 x 10 15 10 Values of the weights the of Values 5 0 •30 •20 •10 0 10 20 30 Values of x Figure: IS approximation obtained using a Student-t IS distribution A.D.