Importance Sampling & Sequential Importance Sampling
Arnaud Doucet Departments of Statistics & Computer Science University of British Columbia
A.D.() 1 / 40 Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.
γn (x1:n) πn (x1:n) = Zn
We want to estimate expectations of test functions ϕ : En R n !
Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.
We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.
Generic Problem
Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F
A.D.() 2 / 40 We want to estimate expectations of test functions ϕ : En R n !
Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.
We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.
Generic Problem
Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.
γn (x1:n) πn (x1:n) = Zn
A.D.() 2 / 40 We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.
Generic Problem
Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.
γn (x1:n) πn (x1:n) = Zn
We want to estimate expectations of test functions ϕ : En R n !
Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.
A.D.() 2 / 40 Generic Problem
Consider a sequence of probability distributions πn n 1 de…ned on a f g sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g 1 = and En = En 1 E, n = n 1 . F F F F F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.
γn (x1:n) πn (x1:n) = Zn
We want to estimate expectations of test functions ϕ : En R n !
Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.
We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.
A.D.() 2 / 40 A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.
Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.
Using Monte Carlo Methods
Problem 1: For most problems of interest, we cannot sample from πn (x1:n).
A.D.() 3 / 40 Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.
Using Monte Carlo Methods
Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.
A.D.() 3 / 40 Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.
Using Monte Carlo Methods
Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.
Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step.
A.D.() 3 / 40 Using Monte Carlo Methods
Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.
Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.
A.D.() 3 / 40 Sequential Importance Sampling. Applications.
Plan of the Lectures
Review of Importance Sampling.
A.D.() 4 / 40 Applications.
Plan of the Lectures
Review of Importance Sampling. Sequential Importance Sampling.
A.D.() 4 / 40 Plan of the Lectures
Review of Importance Sampling. Sequential Importance Sampling. Applications.
A.D.() 4 / 40 q can be chosen arbitrarily, in particular easy to sample from
N (i) i.i.d. 1 X q ( ) q (dx) = ∑ δX (i) (dx) ) N i=1 b
Importance Sampling
Importance Sampling (IS) identity. For any distribution q such that π (x) > 0 q (x) > 0 ) w (x) q (x) γ (x) π (x) = where w (x) = . w (x) q (x) dx q (x)
where q is called importanceR distribution and w importance weight.
A.D.() 5 / 40 Importance Sampling
Importance Sampling (IS) identity. For any distribution q such that π (x) > 0 q (x) > 0 ) w (x) q (x) γ (x) π (x) = where w (x) = . w (x) q (x) dx q (x)
where q is called importanceR distribution and w importance weight. q can be chosen arbitrarily, in particular easy to sample from
N (i) i.i.d. 1 X q ( ) q (dx) = ∑ δX (i) (dx) ) N i=1 b
A.D.() 5 / 40 π (x) now approximated by weighted sum of delta-masses Weights ) compensate for discrepancy between π and q.
Plugging this expression in IS identity
N (i) (i) (i) π (dx) = ∑ W δX (i) (dx) where W ∝ w X , i=1 1 N b Z = ∑ w X (i) . N i=1 b
A.D.() 6 / 40 Plugging this expression in IS identity
N (i) (i) (i) π (dx) = ∑ W δX (i) (dx) where W ∝ w X , i=1 1 N b Z = ∑ w X (i) . N i=1 b π (x) now approximated by weighted sum of delta-masses Weights ) compensate for discrepancy between π and q.
A.D.() 6 / 40 The varianec of the weights is bounded if and only if
γ2 (x) dx < ∞. q (x) Z In practice, try to ensure
γ (x) w (x) = < ∞. q (x)
Note that in this case, rejection sampling could be used to sample from π (x) .
Practical recommendations
Select q as close to π as possible.
A.D.() 7 / 40 In practice, try to ensure
γ (x) w (x) = < ∞. q (x)
Note that in this case, rejection sampling could be used to sample from π (x) .
Practical recommendations
Select q as close to π as possible. The varianec of the weights is bounded if and only if
γ2 (x) dx < ∞. q (x) Z
A.D.() 7 / 40 Practical recommendations
Select q as close to π as possible. The varianec of the weights is bounded if and only if
γ2 (x) dx < ∞. q (x) Z In practice, try to ensure
γ (x) w (x) = < ∞. q (x)
Note that in this case, rejection sampling could be used to sample from π (x) .
A.D.() 7 / 40 Example
0.8
0.6 Double exponential
0.4
0.2
0 0 10 20 30 40 50 60 70 80 90 100
0.4
0.3 Gaussian
0.2
0.1
0 0 10 20 30 40 50 60 70 80 90 100
1
t•Student (heavy tailed)
0.5
0 •10 •8 •6 •4 •2 0 2 4 6 8 10
Figure: Target double exponential distributions and two IS distributions
A.D.() 8 / 40 0. 012
0.01
0. 008
0. 006 Values of the weights Values 0. 004
0. 002
0 •4 •3 •2 •1 0 1 2 3 4 Values of x
Figure: IS approximation obtained using a Gaussian IS distribution
A.D.() 9 / 40 •4 x 10
15
10 Values of the weights the of Values 5
0 •30 •20 •10 0 10 20 30 Values of x
Figure: IS approximation obtained using a Student-t IS distribution
A.D.() 10 / 40 Γ(1) x 1 We use q (x) = (x), q (x) = 1 + (Cauchy 1 π 2 pνπΓ(1/2) νσ ν distribution) and q3 (x) = x; 0, (variance chosen to match N ν 2 the variance of π (x)) It is easy to see that
π (x) π (x)2 π (x) < ∞ and dx = ∞, is unbounded q (x) q (x) q (x) 1 Z 3 3
We try to compute x 2 π (x) dx 1 x Z where Γ ((ν + 1) /2) x (ν+1)/2 π (x) = 1 + pνπΓ (ν/2) ν is a t-student distribution with ν > 1 (you can sample from it by composition (0, 1) / a (ν/2, ν/2)) using Monte Carlo. N G
A.D.() 11 / 40 It is easy to see that
π (x) π (x)2 π (x) < ∞ and dx = ∞, is unbounded q (x) q (x) q (x) 1 Z 3 3
We try to compute x 2 π (x) dx 1 x Z where Γ ((ν + 1) /2) x (ν+1)/2 π (x) = 1 + pνπΓ (ν/2) ν is a t-student distribution with ν > 1 (you can sample from it by composition (0, 1) / a (ν/2, ν/2)) using Monte Carlo. N G Γ(1) x 1 We use q (x) = (x), q (x) = 1 + (Cauchy 1 π 2 pνπΓ(1/2) νσ ν distribution) and q3 (x) = x; 0, (variance chosen to match N ν 2 the variance of π (x))
A.D.() 11 / 40 We try to compute x 2 π (x) dx 1 x Z where Γ ((ν + 1) /2) x (ν+1)/2 π (x) = 1 + pνπΓ (ν/2) ν is a t-student distribution with ν > 1 (you can sample from it by composition (0, 1) / a (ν/2, ν/2)) using Monte Carlo. N G Γ(1) x 1 We use q (x) = (x), q (x) = 1 + (Cauchy 1 π 2 pνπΓ(1/2) νσ ν distribution) and q3 (x) = x; 0, (variance chosen to match N ν 2 the variance of π (x)) It is easy to see that
π (x) π (x)2 π (x) < ∞ and dx = ∞, is unbounded q (x) q (x) q (x) 1 Z 3 3
A.D.() 11 / 40 Figure: Performance for ν = 12 with q1 (solid line), q2 (dashes) and q3 (light dots). Final values 1.14, 1.14 and 1.16 vs true value 1.13
A.D.() 12 / 40 We try to use the same importance distribution but also use the fact that using a change of variables u = 1/x, we have
∞ 1/2.1 5 7 x π (x) dx = u π (1/u) du Z2.1 Z0 1 1/2.1 = 2.1u 7π (1/u) du 2.1 Z0 7 which is the expectation of 2.1u π (1/u) with respect to [0, 1/2.1] . U
We now try to compute
∞ x5π (x) dx Z2.1
A.D.() 13 / 40 We now try to compute
∞ x5π (x) dx Z2.1 We try to use the same importance distribution but also use the fact that using a change of variables u = 1/x, we have
∞ 1/2.1 5 7 x π (x) dx = u π (1/u) du Z2.1 Z0 1 1/2.1 = 2.1u 7π (1/u) du 2.1 Z0 7 which is the expectation of 2.1u π (1/u) with respect to [0, 1/2.1] . U
A.D.() 13 / 40 Figure: Performance for ν = 12 with q1 (solid), q2 (short dashes), q3 (dots), uniform (long dashes). Final values 6.75, 6.48, 7.06 and 6.48 vs true value 6.54
A.D.() 14 / 40 The posterior distribution is given by
π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j We can use the prior distribution as a candidate distribution q (θ) = π (θ). We also get an estimate of the marginal likelihood
π (θ) f (x θ) dθ. ZΘ j
Application to Bayesian Statistics
Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j
A.D.() 15 / 40 We can use the prior distribution as a candidate distribution q (θ) = π (θ). We also get an estimate of the marginal likelihood
π (θ) f (x θ) dθ. ZΘ j
Application to Bayesian Statistics
Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j The posterior distribution is given by
π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j
A.D.() 15 / 40 We also get an estimate of the marginal likelihood
π (θ) f (x θ) dθ. ZΘ j
Application to Bayesian Statistics
Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j The posterior distribution is given by
π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j We can use the prior distribution as a candidate distribution q (θ) = π (θ).
A.D.() 15 / 40 Application to Bayesian Statistics
Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j The posterior distribution is given by
π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j We can use the prior distribution as a candidate distribution q (θ) = π (θ). We also get an estimate of the marginal likelihood
π (θ) f (x θ) dθ. ZΘ j
A.D.() 15 / 40 Assume we observe x1, ..., xm and the prior is
π (p1, p2) = 2Ip1 +p2 1 then the posterior is
m1,1 m1,2 m2,1 m2,2 π (p1, p2 x1:m ) ∝ p1 (1 p1) (1 p2) p2 Ip1 +p2 1 j where m 1 m I I i,j = ∑ xt =i xt+1 =i t=1 The posterior does not admit a standard expression and its normalizing constant is unknown. We can sample from it using rejection sampling.
Example: Application to Bayesian analysis of Markov chain. Consider a two state Markov chain with transition matrix F
p1 1 p1 1 p2 p2 that is Pr (Xt+1 = 1 Xt = 1) = 1 Pr (Xt+1 = 2 Xt = 1) = p1 and j j Pr (Xt+1 = 2 Xt = 2) = 1 Pr (Xt+1 = 1 Xt = 2) = p2. Physical j j constraints tell us that p1 + p2 < 1.
A.D.() 16 / 40 The posterior does not admit a standard expression and its normalizing constant is unknown. We can sample from it using rejection sampling.
Example: Application to Bayesian analysis of Markov chain. Consider a two state Markov chain with transition matrix F
p1 1 p1 1 p2 p2 that is Pr (Xt+1 = 1 Xt = 1) = 1 Pr (Xt+1 = 2 Xt = 1) = p1 and j j Pr (Xt+1 = 2 Xt = 2) = 1 Pr (Xt+1 = 1 Xt = 2) = p2. Physical j j constraints tell us that p1 + p2 < 1. Assume we observe x1, ..., xm and the prior is
π (p1, p2) = 2Ip1 +p2 1 then the posterior is
m1,1 m1,2 m2,1 m2,2 π (p1, p2 x1:m ) ∝ p1 (1 p1) (1 p2) p2 Ip1 +p2 1 j where m 1 m I I i,j = ∑ xt =i xt+1 =i t=1
A.D.() 16 / 40 Example: Application to Bayesian analysis of Markov chain. Consider a two state Markov chain with transition matrix F
p1 1 p1 1 p2 p2 that is Pr (Xt+1 = 1 Xt = 1) = 1 Pr (Xt+1 = 2 Xt = 1) = p1 and j j Pr (Xt+1 = 2 Xt = 2) = 1 Pr (Xt+1 = 1 Xt = 2) = p2. Physical j j constraints tell us that p1 + p2 < 1. Assume we observe x1, ..., xm and the prior is
π (p1, p2) = 2Ip1 +p2 1 then the posterior is
m1,1 m1,2 m2,1 m2,2 π (p1, p2 x1:m ) ∝ p1 (1 p1) (1 p2) p2 Ip1 +p2 1 j where m 1 m I I i,j = ∑ xt =i xt+1 =i t=1 The posterior does not admit a standard expression and its normalizing constant is unknown. We can sample from it using rejection sampling. A.D.() 16 / 40 If there was no on p1 + p2 < 1 and π (p1, p2) was uniform on [0, 1] [0, 1] , then the posterior would be
π0 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) j : B e (p2; m2,2 + 1, m2,1 + 1) B
but this is ine¢ cient as for the given data (m1,1, m1,2, m2,2, m2,1) we have π0 (p1 + p2 < 1 x1 m ) = 0.21. j : The form of the posterior suggests using a Dirichlet distribution with density
m1,1 m2,2 m1,2 +m2,1 π1 (p1, p2 x1 m ) ∝ p p (1 p1 p2) j : 1 2
but π (p1, p2 x1 m ) /π1 (p1, p2 x1 m ) is unbounded. j : j :
We are interested in estimating E [ ϕ (p1, p2) x1 m ] for i j : ϕ1 (p1, p2) = p1, ϕ2 (p1, p2) = p2, ϕ3 (p1, p2) = p1/ (1 p1), p1 (1 p2 ) ϕ (p1, p2) = p2/ (1 p2) and ϕ (p1, p2) = log using 4 5 p2 (1 p1 ) Importance Sampling.
A.D.() 17 / 40 The form of the posterior suggests using a Dirichlet distribution with density
m1,1 m2,2 m1,2 +m2,1 π1 (p1, p2 x1 m ) ∝ p p (1 p1 p2) j : 1 2
but π (p1, p2 x1 m ) /π1 (p1, p2 x1 m ) is unbounded. j : j :
We are interested in estimating E [ ϕ (p1, p2) x1 m ] for i j : ϕ1 (p1, p2) = p1, ϕ2 (p1, p2) = p2, ϕ3 (p1, p2) = p1/ (1 p1), p1 (1 p2 ) ϕ (p1, p2) = p2/ (1 p2) and ϕ (p1, p2) = log using 4 5 p2 (1 p1 ) Importance Sampling.
If there was no on p1 + p2 < 1 and π (p1, p2) was uniform on [0, 1] [0, 1] , then the posterior would be
π0 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) j : B e (p2; m2,2 + 1, m2,1 + 1) B but this is ine¢ cient as for the given data (m1,1, m1,2, m2,2, m2,1) we have π0 (p1 + p2 < 1 x1 m ) = 0.21. j :
A.D.() 17 / 40 We are interested in estimating E [ ϕ (p1, p2) x1 m ] for i j : ϕ1 (p1, p2) = p1, ϕ2 (p1, p2) = p2, ϕ3 (p1, p2) = p1/ (1 p1), p1 (1 p2 ) ϕ (p1, p2) = p2/ (1 p2) and ϕ (p1, p2) = log using 4 5 p2 (1 p1 ) Importance Sampling.
If there was no on p1 + p2 < 1 and π (p1, p2) was uniform on [0, 1] [0, 1] , then the posterior would be
π0 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) j : B e (p2; m2,2 + 1, m2,1 + 1) B but this is ine¢ cient as for the given data (m1,1, m1,2, m2,2, m2,1) we have π0 (p1 + p2 < 1 x1 m ) = 0.21. j : The form of the posterior suggests using a Dirichlet distribution with density
m1,1 m2,2 m1,2 +m2,1 π1 (p1, p2 x1 m ) ∝ p p (1 p1 p2) j : 1 2 but π (p1, p2 x1 m ) /π1 (p1, p2 x1 m ) is unbounded. j : j : A.D.() 17 / 40 A …nal one consists of using
π3 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) π3 (p2 x1 m, p1) j : B j : m2,1 m2,2 where π (p2 x1:m, p1) ∝ (1 p2) p2 Ip2 1 p1 is badly j 2 approximated through π3 (p2 x1:m, p1) = 2 p2Ip2 1 p1 . It is (1 p1 ) j straightforward to check that π (p1, p2 x1:m ) /π3 (p1, p2 x1:m ) ∝ m2,1 m2,2 2 j j (1 p2) p2 / 2 p2 < ∞. (1 p1 )
(Geweke, 1989) proposed using the normal approximation to the binomial distribution 2 π2 (p1, p2 x1 m ) ∝ exp (m1,1 + m1,2)(p1 p1) / (2p1 (1 p1)) j : 2 exp (m2,1 + m2,2)(p2 p2) / (2p2 (1 p2)) Ip1 +p2 1 b b b where p = m / (m + m ) , p = m / (m + m ). Then to 1 1,1 1,1 1,2 1 2,2 2b,2 2,1b b simulate from this distribution, we simulate …rst π2 (p1 x1 m ) and j : then πb2 (p2 x1:m, p1) which are univariateb truncated Gaussian distributionj which can be sampled using the inverse cdf method. The ratio π (p1, p2 x1 m ) /π2 (p1, p2 x1 m ) is upper bounded. j : j :
A.D.() 18 / 40 (Geweke, 1989) proposed using the normal approximation to the binomial distribution 2 π2 (p1, p2 x1 m ) ∝ exp (m1,1 + m1,2)(p1 p1) / (2p1 (1 p1)) j : 2 exp (m2,1 + m2,2)(p2 p2) / (2p2 (1 p2)) Ip1 +p2 1 b b b where p = m / (m + m ) , p = m / (m + m ). Then to 1 1,1 1,1 1,2 1 2,2 2b,2 2,1b b simulate from this distribution, we simulate …rst π2 (p1 x1 m ) and j : then πb2 (p2 x1:m, p1) which are univariateb truncated Gaussian distributionj which can be sampled using the inverse cdf method. The ratio π (p1, p2 x1:m ) /π2 (p1, p2 x1:m ) is upper bounded. A …nal one consistsj of using j
π3 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) π3 (p2 x1 m, p1) j : B j : m2,1 m2,2 where π (p2 x1:m, p1) ∝ (1 p2) p2 Ip2 1 p1 is badly j 2 approximated through π3 (p2 x1:m, p1) = 2 p2Ip2 1 p1 . It is (1 p1 ) j straightforward to check that π (p1, p2 x1:m ) /π3 (p1, p2 x1:m ) ∝ m2,1 m2,2 2 j j (1 p2) p2 / 2 p2 < ∞. (1 p1 ) A.D.() 18 / 40 Sampling from π using rejection sampling works well but is computationally expensive. π3 is computationally much cheaper whereas π1 does extremely poorly as expected.
Performance for N = 10, 000
Distribution ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 π1 0.748 0.139 3.184 0.163 2.957 π2 0.689 0.210 2.319 0.283 2.211 π3 0.697 0.189 2.379 0.241 2.358 π 0.697 0.189 2.373 0.240 2.358
A.D.() 19 / 40 Performance for N = 10, 000
Distribution ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 π1 0.748 0.139 3.184 0.163 2.957 π2 0.689 0.210 2.319 0.283 2.211 π3 0.697 0.189 2.379 0.241 2.358 π 0.697 0.189 2.373 0.240 2.358
Sampling from π using rejection sampling works well but is computationally expensive. π3 is computationally much cheaper whereas π1 does extremely poorly as expected.
A.D.() 19 / 40 For ‡at functions, one can approximate the variance by
Vπ (ϕ (X )) V E (ϕ (X )) (1 + Vq (w (X ))) . πN N Simple interpretationb : The N weighted samples are approximately equivalent to M unweighted samples from π where N M = N. 1 + Vq (w (X ))
E¤ective Sample Size
In statistics, we are usually not interested in a speci…c ϕ but in several functions and we prefer having q (x) as close as possible to π (x) .
A.D.() 20 / 40 Simple interpretation: The N weighted samples are approximately equivalent to M unweighted samples from π where N M = N. 1 + Vq (w (X ))
E¤ective Sample Size
In statistics, we are usually not interested in a speci…c ϕ but in several functions and we prefer having q (x) as close as possible to π (x) . For ‡at functions, one can approximate the variance by
Vπ (ϕ (X )) V E (ϕ (X )) (1 + Vq (w (X ))) . πN N b
A.D.() 20 / 40 E¤ective Sample Size
In statistics, we are usually not interested in a speci…c ϕ but in several functions and we prefer having q (x) as close as possible to π (x) . For ‡at functions, one can approximate the variance by
Vπ (ϕ (X )) V E (ϕ (X )) (1 + Vq (w (X ))) . πN N Simple interpretationb : The N weighted samples are approximately equivalent to M unweighted samples from π where N M = N. 1 + Vq (w (X ))
A.D.() 20 / 40 We select an importance distribution n 2 q (x1:n) = ∏ xk ; 0, σ . k=1 N 2 1 In this case, we have VIS Z < ∞ only for σ > 2 and h i n 2 VIS Z b1 σ4 / = 1 . Zh2 i N 2σ2 1 b " # 4 It can easily be checked that σ 1 for any 1 2 1. 2σ2 1 > 2 < σ = 6
Limitations of Importance Sampling
Consider the case where the target is de…ned on Rn and n π (x1:n) = ∏ (xk ; 0, 1) , k=1 N n 2 xk γ (x1 n) = exp , : ∏ 2 k=1 Z = (2π)n/2 .
A.D.() 21 / 40 2 1 In this case, we have VIS Z < ∞ only for σ > 2 and h i n 2 VIS Z b1 σ4 / = 1 . Zh2 i N 2σ2 1 b " # 4 It can easily be checked that σ 1 for any 1 2 1. 2σ2 1 > 2 < σ = 6
Limitations of Importance Sampling
Consider the case where the target is de…ned on Rn and n π (x1:n) = ∏ (xk ; 0, 1) , k=1 N n 2 xk γ (x1 n) = exp , : ∏ 2 k=1 Z = (2π)n/2 . We select an importance distribution n 2 q (x1:n) = ∏ xk ; 0, σ . k=1 N
A.D.() 21 / 40 Limitations of Importance Sampling
Consider the case where the target is de…ned on Rn and n π (x1:n) = ∏ (xk ; 0, 1) , k=1 N n 2 xk γ (x1 n) = exp , : ∏ 2 k=1 Z = (2π)n/2 . We select an importance distribution n 2 q (x1:n) = ∏ xk ; 0, σ . k=1 N 2 1 In this case, we have VIS Z < ∞ only for σ > 2 and h i n 2 VIS Z b1 σ4 / = 1 . Zh2 i N 2σ2 1 b " # 4 It canA.D.() easily be checked that σ 1 for any 1 2 1. 21 / 40 2σ2 1 > 2 < σ = 6 For example, if we select σ2 = 1.2 then we have a reasonably good VIS [Z ] n/2 importance distribution as q (xk ) π (xk ) but N 2 (1.103) Z which is approximately equal to 1.9 1021 for n = 1000!b We would need to use N 2 1023 particles to obtain a relative VIS [Z ] variance Z 2 = 0.01. b
The variance increases exponentially with n even in this simple case.
A.D.() 22 / 40 We would need to use N 2 1023 particles to obtain a relative VIS [Z ] variance Z 2 = 0.01. b
The variance increases exponentially with n even in this simple case. For example, if we select σ2 = 1.2 then we have a reasonably good VIS [Z ] n/2 importance distribution as q (xk ) π (xk ) but N 2 (1.103) Z which is approximately equal to 1.9 1021 for n = 1000!b
A.D.() 22 / 40 The variance increases exponentially with n even in this simple case. For example, if we select σ2 = 1.2 then we have a reasonably good VIS [Z ] n/2 importance distribution as q (xk ) π (xk ) but N 2 (1.103) Z which is approximately equal to 1.9 1021 for n = 1000!b We would need to use N 2 1023 particles to obtain a relative VIS [Z ] variance Z 2 = 0.01. b
A.D.() 22 / 40 We want to know which strategy performs the best.
Importance Sampling versus Rejection Sampling
Given N samples from q, we estimate Eπ (ϕ (X )) through IS
N (i) (i) ∑i=1 w X ϕ X EIS (ϕ (X )) = πN N (i) ∑i=1 w X b or we “…lter” the samples through rejection and propose instead
1 K ERS (ϕ (X )) = ϕ X (ik ) πN ∑ K k=1 b where K N is a random variable corresponding to the number of samples accepted.
A.D.() 23 / 40 Importance Sampling versus Rejection Sampling
Given N samples from q, we estimate Eπ (ϕ (X )) through IS
N (i) (i) ∑i=1 w X ϕ X EIS (ϕ (X )) = πN N (i) ∑i=1 w X b or we “…lter” the samples through rejection and propose instead
1 K ERS (ϕ (X )) = ϕ X (ik ) πN ∑ K k=1 b where K N is a random variable corresponding to the number of samples accepted. We want to know which strategy performs the best.
A.D.() 23 / 40 Now let us consider the proposal distribution
q (x, y) = q (x) (y) for (x, y) E [0, 1] . U[0,1] 2
De…ne the arti…cial target π (x, y) on E [0, 1] as Cq(x ) , for (x, y) : x E and y 0, γ(x ) π (x, y) = Z 2 2 Cq(x ) ( 0 otherwisen h io then γ(x ) Cq(x ) Cq (x) π (x, y) dy = dy = π (x) . Z Z Z0
A.D.() 24 / 40 De…ne the arti…cial target π (x, y) on E [0, 1] as Cq(x ) , for (x, y) : x E and y 0, γ(x ) π (x, y) = Z 2 2 Cq(x ) ( 0 otherwisen h io then γ(x ) Cq(x ) Cq (x) π (x, y) dy = dy = π (x) . Z Z Z0 Now let us consider the proposal distribution
q (x, y) = q (x) (y) for (x, y) E [0, 1] . U[0,1] 2
A.D.() 24 / 40 We have
N (i) (i) (i) 1 K ∑i=1 w X , Y ϕ X ERS (ϕ (X )) = ϕ X (ik ) = . πN ∑ N (i) (i) K k=1 ∑i=1 w X , Y b Compared to standard IS, RS performs IS on an enlarged space.
Then rejection sampling is nothing but IS on [0, 1] where X x, y C q(x )dx γ(x ) π ( ) Z for y 0, Cq(x ) w (x, y) ∝ = R 2 q (x) [0,1] (y) 0, otherwise. U ( h i
A.D.() 25 / 40 Compared to standard IS, RS performs IS on an enlarged space.
Then rejection sampling is nothing but IS on [0, 1] where X x, y C q(x )dx γ(x ) π ( ) Z for y 0, Cq(x ) w (x, y) ∝ = R 2 q (x) [0,1] (y) 0, otherwise. U ( h i We have
N (i) (i) (i) 1 K ∑i=1 w X , Y ϕ X ERS (ϕ (X )) = ϕ X (ik ) = . πN ∑ N (i) (i) K k=1 ∑i=1 w X , Y b
A.D.() 25 / 40 Then rejection sampling is nothing but IS on [0, 1] where X x, y C q(x )dx γ(x ) π ( ) Z for y 0, Cq(x ) w (x, y) ∝ = R 2 q (x) [0,1] (y) 0, otherwise. U ( h i We have
N (i) (i) (i) 1 K ∑i=1 w X , Y ϕ X ERS (ϕ (X )) = ϕ X (ik ) = . πN ∑ N (i) (i) K k=1 ∑i=1 w X , Y b Compared to standard IS, RS performs IS on an enlarged space.
A.D.() 25 / 40 To compute integrals, RS is ine¢ cient and you should simply use IS.
The variance of the importance weights from RS is higher than for standard IS: V [w (X , Y )] V [w (X )] . More precisely, we have
V [w (X , Y )] = V [E [w (X , Y ) X ]] + E [V [w (X , Y ) X ]] j j = V [w (X )] + E [V [w (X , Y ) X ]] . j
A.D.() 26 / 40 The variance of the importance weights from RS is higher than for standard IS: V [w (X , Y )] V [w (X )] . More precisely, we have
V [w (X , Y )] = V [E [w (X , Y ) X ]] + E [V [w (X , Y ) X ]] j j = V [w (X )] + E [V [w (X , Y ) X ]] . j To compute integrals, RS is ine¢ cient and you should simply use IS.
A.D.() 26 / 40 At time 1, assume we have approximate π1 (x1) and Z1 using an IS density q1 (x1); that is
N (i) (i) (i) (dx ) = W (i) (dx) where W w X , π1 1 ∑ 1 δX 1 ∝ 1 1 i=1 1 N b 1 (i) Z1 = ∑ w1 X1 N i=1 with b γ1 (x1) w1 (x1) = . q1 (x1)
Introduction to Sequential Importance Sampling
Aim: Design an IS method to approximate sequentially πn n 1 and f g to compute Zn n 1. f g
A.D.() 27 / 40 Introduction to Sequential Importance Sampling
Aim: Design an IS method to approximate sequentially πn n 1 and f g to compute Zn n 1. f g At time 1, assume we have approximate π1 (x1) and Z1 using an IS density q1 (x1); that is
N (i) (i) (i) (dx ) = W (i) (dx) where W w X , π1 1 ∑ 1 δX 1 ∝ 1 1 i=1 1 N b 1 (i) Z1 = ∑ w1 X1 N i=1 with b γ1 (x1) w1 (x1) = . q1 (x1)
A.D.() 27 / 40 (i) We want to reuse the samples X1 from q1 (x1) use to build the IS approximation of π1 (x1) . Thisn onlyo makes sense if π2 (x1) π1 (x1). We select q2 (x1 2) = q1 (x1) q2 (x2 x1) : j (i) so that to obtain X1:2 q2 (x1:2) we only need to sample (i) (i) (i) X X q2 x2 X . 2 1 j 1
At time 2, we want to approximate π2 (x1:2) and Z2 using an IS density q2 (x1:2) .
A.D.() 28 / 40 We select q2 (x1 2) = q1 (x1) q2 (x2 x1) : j (i) so that to obtain X1:2 q2 (x1:2) we only need to sample (i) (i) (i) X X q2 x2 X . 2 1 j 1
At time 2, we want to approximate π2 (x1:2) and Z2 using an IS density q2 (x1:2) . (i) We want to reuse the samples X1 from q1 (x1) use to build the IS approximation of π1 (x1) . Thisn onlyo makes sense if π2 (x1) π1 (x1).
A.D.() 28 / 40 At time 2, we want to approximate π2 (x1:2) and Z2 using an IS density q2 (x1:2) . (i) We want to reuse the samples X1 from q1 (x1) use to build the IS approximation of π1 (x1) . Thisn onlyo makes sense if π2 (x1) π1 (x1). We select q2 (x1 2) = q1 (x1) q2 (x2 x1) : j (i) so that to obtain X1:2 q2 (x1:2) we only need to sample (i) (i) (i) X X q2 x2 X . 2 1 j 1
A.D.() 28 / 40 For the normalized weights
(i) γ X (i) (i) 2 1:2 W ∝ W 2 1 (i) (i) (i) γ1 X1 q2 X2 X1
Updating the IS approximation
We have to compute the weights
γ2 (x1:2) γ2 (x1:2) w2 (x1:2) = = q2 (x1 2) q1 (x1) q2 (x2 x1) : j γ (x ) γ (x ) = 1 1 2 1:2 q1 (x1) γ (x1) q2 (x2 x1) 1 j γ2 (x1:2) = w1 (x1) γ1 (x1) q2 (x2 x1) previous weight j incremental weigh | {z } | {z }
A.D.() 29 / 40 Updating the IS approximation
We have to compute the weights
γ2 (x1:2) γ2 (x1:2) w2 (x1:2) = = q2 (x1 2) q1 (x1) q2 (x2 x1) : j γ (x ) γ (x ) = 1 1 2 1:2 q1 (x1) γ (x1) q2 (x2 x1) 1 j γ2 (x1:2) = w1 (x1) γ1 (x1) q2 (x2 x1) previous weight j incremental weigh | {z } For the normalized weights | {z }
(i) γ X (i) (i) 2 1:2 W ∝ W 2 1 (i) (i) (i) γ1 X1 q2 X2 X1
A.D.() 29 / 40 The importance weights are updated according to
γn (x1:n) γn (x1:n) wn (x1:n) = = wn 1 (x1:n 1) qn (x1:n) γn 1 (x1:n 1) qn (xn x1:n 1) j
Generally speaking, we use at time n
qn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j = q1 (x1) q2 (x2 x1) qn (xn x1:n 1) j j (i) so if X1:n 1 qn 1 (x1:n 1) then we only need to sample (i) (i) ( i) Xn Xn 1 qn xn X1:n 1 . j
A.D.() 30 / 40 Generally speaking, we use at time n
qn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j = q1 (x1) q2 (x2 x1) qn (xn x1:n 1) j j (i) so if X1:n 1 qn 1 (x1:n 1) then we only need to sample (i) (i) ( i) Xn Xn 1 qn xn X1:n 1 . j The importance weights are updated according to
γn (x1:n) γn (x1:n) wn (x1:n) = = wn 1 (x1:n 1) qn (x1:n) γn 1 (x1:n 1) qn (xn x1:n 1) j
A.D.() 30 / 40 (i) (i) sample Xn qn X j 1:n 1 (i) γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i) (i) (i) γn 1 X1:n 1 qn Xn X1:n 1
At time n 2
At any time n, we have
(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n = q X (i) n 1:n thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.
Sequential Importance Sampling
(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 = (i) . q1 X1
A.D.() 31 / 40 (i) (i) sample Xn qn X j 1:n 1 (i) γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i) (i) (i) γn 1 X1:n 1 qn Xn X1:n 1
At any time n, we have
(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n = q X (i) n 1:n thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.
Sequential Importance Sampling
(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 = (i) . q1 X1 At time n 2
A.D.() 31 / 40 (i) γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i) (i) (i) γn 1 X1:n 1 qn Xn X1:n 1
At any time n, we have
(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n = q X (i) n 1:n thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.
Sequential Importance Sampling
(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 = (i) . q1 X1 At time n 2 (i) (i) sample Xn qn X1:n 1 j
A.D.() 31 / 40 At any time n, we have
(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n = q X (i) n 1:n thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.
Sequential Importance Sampling
(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 = (i) . q1 X1 At time n 2 (i) (i) sample Xn qn X j 1:n 1 (i) γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i) (i) (i) γn 1 X1:n 1 qn Xn X1:n 1
A.D.() 31 / 40 Sequential Importance Sampling
(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 = (i) . q1 X1 At time n 2 (i) (i) sample Xn qn X j 1:n 1 (i) γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i) (i) (i) γn 1 X1:n 1 qn Xn X1:n 1
At any time n, we have
(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n = q X (i) n 1:n thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.
A.D.() 31 / 40 Assume we receive y1:n, we are interested in sampling from
p (x1:n, y1:n) πn (x1:n) = p (x1:n y1:n) = j p (y1:n)
and estimating p (y1:n) where n n γn (x1:n) = p (x1:n, y1:n) = µ (x1) ∏ f (xk xk 1) ∏ g (yk xk ) , k=2 j k=1 j n n Zn = p (y1:n) = µ (x1) f (xk xk 1) g (yk xk ) dx1:n. ∏ ∏ Z Z k=2 j k=1 j
Sequential Importance Sampling for State-Space Models
State-space models
Hidden Markov process: X1 µ, Xk (Xk 1 = xk 1) f ( xk 1) j j Observation process: Yk (Xk = xk ) g ( xk ) j j
A.D.() 32 / 40 Sequential Importance Sampling for State-Space Models
State-space models
Hidden Markov process: X1 µ, Xk (Xk 1 = xk 1) f ( xk 1) j j Observation process: Yk (Xk = xk ) g ( xk ) j j Assume we receive y1:n, we are interested in sampling from
p (x1:n, y1:n) πn (x1:n) = p (x1:n y1:n) = j p (y1:n)
and estimating p (y1:n) where n n γn (x1:n) = p (x1:n, y1:n) = µ (x1) ∏ f (xk xk 1) ∏ g (yk xk ) , k=2 j k=1 j n n Zn = p (y1:n) = µ (x1) f (xk xk 1) g (yk xk ) dx1:n. ∏ ∏ Z Z k=2 j k=1 j
A.D.() 32 / 40 (i) (i) sample Xn f X1:n 1 j (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j
(i) At time n = 1, sample X1 µ ( ) and set (i) (i) w1 X = g y1 X . 1 j 1 At time n 2
At any time n, we have
n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk k=2 j k=1 j
thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .
We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j j
A.D.() 33 / 40 (i) (i) sample Xn f X1:n 1 j (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j
At time n 2
At any time n, we have
n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk k=2 j k=1 j
thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .
We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i) w1 X = g y1 X . 1 j 1
A.D.() 33 / 40 (i) (i) sample Xn f X1:n 1 j (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j At any time n, we have
n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk k=2 j k=1 j
thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .
We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i) w1 X = g y1 X . 1 j 1 At time n 2
A.D.() 33 / 40 (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j At any time n, we have
n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk k=2 j k=1 j
thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .
We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i) w1 X = g y1 X . 1 j 1 At time n 2 (i) (i) sample Xn f X1:n 1 j
A.D.() 33 / 40 At any time n, we have
n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk k=2 j k=1 j
thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .
We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i) w1 X = g y1 X . 1 j 1 At time n 2 (i) (i) sample Xn f X1:n 1 j (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j
A.D.() 33 / 40 We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i) w1 X = g y1 X . 1 j 1 At time n 2 (i) (i) sample Xn f X1:n 1 j (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j At any time n, we have
n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk k=2 j k=1 j thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .
A.D.() 33 / 40 Application to Stochastic Volatility Model
1000
500
0 • 25 • 20 • 15 • 10 • 5 0 1000
500
0 • 25 • 20 • 15 • 10 • 5 0 100
50
0 • 25 • 20 • 15 • 10 • 5 0 Importance Weights (base 10 logarithm)
(i) Figure: Histograms of the base 10 logarithm of Wn for n = 1 (top), n = 50 (middle) and n = 100 (bottom).
The algorithm performance collapse as n increases... After a few time steps, only a very small number of particles have non negligible
weights.A.D.() 34 / 40 As we have
πn (x1:n) = πn (x1) πn (x2 x1) πn (xn x1:n 1) , j j
where πn (xk x1:k 1) ∝ γn (xk x1:k 1) it means that we have j j opt qk (xk x1:k 1) = πn (xk x1:k 1) . j j Obviously this result does depend on n so it is only useful if we are only interested in a speci…c target πn (x1:n) and in such scenarios we need to typically approximate πn (xk x1:k 1) . j
Structure of the Optimal Distribution
The optimal zero-variance density at time n is simply given by
qn (x1:n) = πn (x1:n) .
A.D.() 35 / 40 Obviously this result does depend on n so it is only useful if we are only interested in a speci…c target πn (x1:n) and in such scenarios we need to typically approximate πn (xk x1:k 1) . j
Structure of the Optimal Distribution
The optimal zero-variance density at time n is simply given by
qn (x1:n) = πn (x1:n) .
As we have
πn (x1:n) = πn (x1) πn (x2 x1) πn (xn x1:n 1) , j j
where πn (xk x1:k 1) ∝ γn (xk x1:k 1) it means that we have j j opt qk (xk x1:k 1) = πn (xk x1:k 1) . j j
A.D.() 35 / 40 Structure of the Optimal Distribution
The optimal zero-variance density at time n is simply given by
qn (x1:n) = πn (x1:n) .
As we have
πn (x1:n) = πn (x1) πn (x2 x1) πn (xn x1:n 1) , j j
where πn (xk x1:k 1) ∝ γn (xk x1:k 1) it means that we have j j opt qk (xk x1:k 1) = πn (xk x1:k 1) . j j Obviously this result does depend on n so it is only useful if we are only interested in a speci…c target πn (x1:n) and in such scenarios we need to typically approximate πn (xk x1:k 1) . j
A.D.() 35 / 40 We have for the importance weight
γn (x1:n) wn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j Znπn (x1:n 1) πn (xn x1:n 1) = j qn 1 (x1:n 1) qn (xn x1:n 1) j It follows directly that we have opt qn (xn x1:n 1) = πn (xn x1:n 1) j j and
γn (x1:n) wn (x1:n) = wn 1 (x1:n 1) γn 1 (x1:n 1) πn (xn x1:n 1) j γn (x1:n 1) = wn 1 (x1:n 1) γn 1 (x1:n 1)
Locally Optimal Importance Distribution
One sensible strategy consists of selecting qn (xn x1:n 1) at time n so as to minimize the variance of the importance weights.j
A.D.() 36 / 40 It follows directly that we have opt qn (xn x1:n 1) = πn (xn x1:n 1) j j and
γn (x1:n) wn (x1:n) = wn 1 (x1:n 1) γn 1 (x1:n 1) πn (xn x1:n 1) j γn (x1:n 1) = wn 1 (x1:n 1) γn 1 (x1:n 1)
Locally Optimal Importance Distribution
One sensible strategy consists of selecting qn (xn x1:n 1) at time n so as to minimize the variance of the importance weights.j We have for the importance weight
γn (x1:n) wn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j Znπn (x1:n 1) πn (xn x1:n 1) = j qn 1 (x1:n 1) qn (xn x1:n 1) j
A.D.() 36 / 40 Locally Optimal Importance Distribution
One sensible strategy consists of selecting qn (xn x1:n 1) at time n so as to minimize the variance of the importance weights.j We have for the importance weight
γn (x1:n) wn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j Znπn (x1:n 1) πn (xn x1:n 1) = j qn 1 (x1:n 1) qn (xn x1:n 1) j It follows directly that we have opt qn (xn x1:n 1) = πn (xn x1:n 1) j j and
γn (x1:n) wn (x1:n) = wn 1 (x1:n 1) γn 1 (x1:n 1) πn (xn x1:n 1) j γn (x1:n 1) = wn 1 (x1:n 1) γn 1 (x1:n 1) A.D.() 36 / 40 It is often impossible to sample from πn (xn x1:n 1) and/or j computing γn (x1:n 1) = γn (x1:n) dxn. In such cases, it is necessary to approximate πn (xn x1:n 1) and R j γn (x1:n 1).
This locally optimal importance density will be used again and again.
A.D.() 37 / 40 In such cases, it is necessary to approximate πn (xn x1:n 1) and j γn (x1:n 1).
This locally optimal importance density will be used again and again.
It is often impossible to sample from πn (xn x1:n 1) and/or j computing γn (x1:n 1) = γn (x1:n) dxn. R
A.D.() 37 / 40 This locally optimal importance density will be used again and again.
It is often impossible to sample from πn (xn x1:n 1) and/or j computing γn (x1:n 1) = γn (x1:n) dxn. In such cases, it is necessary to approximate πn (xn x1:n 1) and R j γn (x1:n 1).
A.D.() 37 / 40 In this case,
p (x1:n, y1:n) wn (x1:n) = wn 1 (x1:n 1) p (x1:n 1, y1:n 1) p (xn yn, xn 1) j = wn 1 (x1:n 1) p (yn xn 1) . j Example: Consider f (xn xn 1) = (xn; α (xn 1) , β (xn 1)) and 2 j N g (yn xn) = xn; σw then j N 2 p (xn yn, xn 1) = xn; m (xn 1) , σ (xn 1) with j N 2 2 β (xn 1 ) σw 2 α (xn 1) yn σ (xn 1) = 2 , m (xn 1) = σ (xn 1) + 2 . β (xn 1) + σw β (xn 1) σw
Application to State-Space Models
In the case of state-space models, we have opt qn (xn x1:n 1) = p (xn y1:n, x1:n 1) = p (xn yn, xn 1) j j j g (yn xn) f (xn xn 1) = j j p (yn xn 1) j
A.D.() 38 / 40 Example: Consider f (xn xn 1) = (xn; α (xn 1) , β (xn 1)) and 2 j N g (yn xn) = xn; σw then j N 2 p (xn yn, xn 1) = xn; m (xn 1) , σ (xn 1) with j N 2 2 β (xn 1 ) σw 2 α (xn 1) yn σ (xn 1) = 2 , m (xn 1) = σ (xn 1) + 2 . β (xn 1) + σw β (xn 1) σw
Application to State-Space Models
In the case of state-space models, we have opt qn (xn x1:n 1) = p (xn y1:n, x1:n 1) = p (xn yn, xn 1) j j j g (yn xn) f (xn xn 1) = j j p (yn xn 1) j In this case,
p (x1:n, y1:n) wn (x1:n) = wn 1 (x1:n 1) p (x1:n 1, y1:n 1) p (xn yn, xn 1) j = wn 1 (x1:n 1) p (yn xn 1) . j
A.D.() 38 / 40 Application to State-Space Models
In the case of state-space models, we have opt qn (xn x1:n 1) = p (xn y1:n, x1:n 1) = p (xn yn, xn 1) j j j g (yn xn) f (xn xn 1) = j j p (yn xn 1) j In this case,
p (x1:n, y1:n) wn (x1:n) = wn 1 (x1:n 1) p (x1:n 1, y1:n 1) p (xn yn, xn 1) j = wn 1 (x1:n 1) p (yn xn 1) . j Example: Consider f (xn xn 1) = (xn; α (xn 1) , β (xn 1)) and 2 j N g (yn xn) = xn; σw then j N 2 p (xn yn, xn 1) = xn; m (xn 1) , σ (xn 1) with j N 2 2 β (xn 1 ) σw 2 α (xn 1) yn σ (xn 1) = 2 , m (xn 1) = σ (xn 1) + 2 . β (xn 1) + σw β (xn 1) σw
A.D.() 38 / 40 We use qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N
qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N
opt qn (xn x1:n 1) = p (xn yn, xn 1) j j 2 2 σw yn σw = xn; αxn 1 + , . N σ2 + 1 σ2 σ2 + 1 w w w
Application to Linear Gaussian State-Space Models
Consider the simple model
Xn = αXn 1 + Vn, Yn = Xn + σWn
i.i.d. i.i.d. where X1 (0, 1), Vn (0, 1) , Wn (0, 1) . N N N
A.D.() 39 / 40 Application to Linear Gaussian State-Space Models
Consider the simple model
Xn = αXn 1 + Vn, Yn = Xn + σWn
i.i.d. i.i.d. where X1 (0, 1), Vn (0, 1) , Wn (0, 1) . N N N We use qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N
qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N
opt qn (xn x1:n 1) = p (xn yn, xn 1) j j 2 2 σw yn σw = xn; αxn 1 + , . N σ2 + 1 σ2 σ2 + 1 w w w
A.D.() 39 / 40 Sequential Importance Sampling can only work for moderate size problems. Is there a way to partially …x this problem?
Summary
Sequential Importance Sampling is an attractive idea: sequential and parallelizable, only requires designing low-dimensional proposal distributions.
A.D.() 40 / 40 Is there a way to partially …x this problem?
Summary
Sequential Importance Sampling is an attractive idea: sequential and parallelizable, only requires designing low-dimensional proposal distributions. Sequential Importance Sampling can only work for moderate size problems.
A.D.() 40 / 40 Summary
Sequential Importance Sampling is an attractive idea: sequential and parallelizable, only requires designing low-dimensional proposal distributions. Sequential Importance Sampling can only work for moderate size problems. Is there a way to partially …x this problem?
A.D.() 40 / 40