<<

Importance Sampling & Sequential Importance Sampling

Arnaud Doucet Departments of & Computer Science University of British Columbia

A.D.() 1 / 40 Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.

γn (x1:n) πn (x1:n) = Zn

We want to estimate expectations of test functions ϕ : En R n !

Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.

We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.

Generic Problem

Consider a sequence of probability distributions πn n 1 de…ned on a f g  sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g  1 = and En = En 1 E, n = n 1 . F F  F F  F

A.D.() 2 / 40 We want to estimate expectations of test functions ϕ : En R n !

Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.

We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.

Generic Problem

Consider a sequence of probability distributions πn n 1 de…ned on a f g  sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g  1 = and En = En 1 E, n = n 1 . F F  F F  F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.

γn (x1:n) πn (x1:n) = Zn

A.D.() 2 / 40 We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.

Generic Problem

Consider a sequence of probability distributions πn n 1 de…ned on a f g  sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g  1 = and En = En 1 E, n = n 1 . F F  F F  F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.

γn (x1:n) πn (x1:n) = Zn

We want to estimate expectations of test functions ϕ : En R n !

Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.

A.D.() 2 / 40 Generic Problem

Consider a sequence of probability distributions πn n 1 de…ned on a f g  sequence of (measurable) spaces (En, n) n 1 where E1 = E, f F g  1 = and En = En 1 E, n = n 1 . F F  F F  F Each distribution πn (dx1:n) = πn (x1:n) dx1:n is known up to a normalizing constant, i.e.

γn (x1:n) πn (x1:n) = Zn

We want to estimate expectations of test functions ϕ : En R n !

Eπn (ϕn) = ϕn (x1:n) πn (dx1:n) Z and/or the normalizing constants Zn.

We want to do this sequentially; i.e. …rst π1 and/or Z1 at time 1 then π2 and/or Z2 at time 2 and so on.

A.D.() 2 / 40 A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.

Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.

Using Monte Carlo Methods

Problem 1: For most problems of interest, we cannot sample from πn (x1:n).

A.D.() 3 / 40 Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.

Using Monte Carlo Methods

Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.

A.D.() 3 / 40 Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.

Using Monte Carlo Methods

Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.

Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step.

A.D.() 3 / 40 Using Monte Carlo Methods

Problem 1: For most problems of interest, we cannot sample from πn (x1:n). A standard approach to sample from high dimensional distribution consists of using iterative Markov chain Monte Carlo algorithms, this is not appropriate in our context.

Problem 2: Even if we could sample exactly from πn (x1:n), then the computational complexity of the algorithm would most likely increase with n but we typically want an algorithm of …xed computational complexity at each time step. Summary: We cannot use standard MC sampling in our case and, even if we could, this would not solve our problem.

A.D.() 3 / 40 Sequential Importance Sampling. Applications.

Plan of the Lectures

Review of Importance Sampling.

A.D.() 4 / 40 Applications.

Plan of the Lectures

Review of Importance Sampling. Sequential Importance Sampling.

A.D.() 4 / 40 Plan of the Lectures

Review of Importance Sampling. Sequential Importance Sampling. Applications.

A.D.() 4 / 40 q can be chosen arbitrarily, in particular easy to sample from

N (i) i.i.d. 1 X q ( ) q (dx) = ∑ δX (i) (dx)   ) N i=1 b

Importance Sampling

Importance Sampling (IS) identity. For any distribution q such that π (x) > 0 q (x) > 0 ) w (x) q (x) γ (x) π (x) = where w (x) = . w (x) q (x) dx q (x)

where q is called importanceR distribution and w importance weight.

A.D.() 5 / 40 Importance Sampling

Importance Sampling (IS) identity. For any distribution q such that π (x) > 0 q (x) > 0 ) w (x) q (x) γ (x) π (x) = where w (x) = . w (x) q (x) dx q (x)

where q is called importanceR distribution and w importance weight. q can be chosen arbitrarily, in particular easy to sample from

N (i) i.i.d. 1 X q ( ) q (dx) = ∑ δX (i) (dx)   ) N i=1 b

A.D.() 5 / 40 π (x) now approximated by weighted sum of delta-masses Weights ) compensate for discrepancy between π and q.

Plugging this expression in IS identity

N (i) (i) (i) π (dx) = ∑ W δX (i) (dx) where W ∝ w X , i=1   1 N b Z = ∑ w X (i) . N i=1   b

A.D.() 6 / 40 Plugging this expression in IS identity

N (i) (i) (i) π (dx) = ∑ W δX (i) (dx) where W ∝ w X , i=1   1 N b Z = ∑ w X (i) . N i=1   b π (x) now approximated by weighted sum of delta-masses Weights ) compensate for discrepancy between π and q.

A.D.() 6 / 40 The varianec of the weights is bounded if and only if

γ2 (x) dx < ∞. q (x) Z In practice, try to ensure

γ (x) w (x) = < ∞. q (x)

Note that in this case, could be used to sample from π (x) .

Practical recommendations

Select q as close to π as possible.

A.D.() 7 / 40 In practice, try to ensure

γ (x) w (x) = < ∞. q (x)

Note that in this case, rejection sampling could be used to sample from π (x) .

Practical recommendations

Select q as close to π as possible. The varianec of the weights is bounded if and only if

γ2 (x) dx < ∞. q (x) Z

A.D.() 7 / 40 Practical recommendations

Select q as close to π as possible. The varianec of the weights is bounded if and only if

γ2 (x) dx < ∞. q (x) Z In practice, try to ensure

γ (x) w (x) = < ∞. q (x)

Note that in this case, rejection sampling could be used to sample from π (x) .

A.D.() 7 / 40 Example

0.8

0.6 Double exponential

0.4

0.2

0 0 10 20 30 40 50 60 70 80 90 100

0.4

0.3 Gaussian

0.2

0.1

0 0 10 20 30 40 50 60 70 80 90 100

1

t•Student (heavy tailed)

0.5

0 •10 •8 •6 •4 •2 0 2 4 6 8 10

Figure: Target double exponential distributions and two IS distributions

A.D.() 8 / 40 0. 012

0.01

0. 008

0. 006 Values of the weights Values 0. 004

0. 002

0 •4 •3 •2 •1 0 1 2 3 4 Values of x

Figure: IS approximation obtained using a Gaussian IS distribution

A.D.() 9 / 40 •4 x 10

15

10 Values of the weights the of Values 5

0 •30 •20 •10 0 10 20 30 Values of x

Figure: IS approximation obtained using a Student-t IS distribution

A.D.() 10 / 40 Γ(1) x 1 We use q (x) = (x), q (x) = 1 + (Cauchy 1 π 2 pνπΓ(1/2) νσ ν distribution) and q3 (x) = x; 0, (variance chosen to match N ν 2  the variance of π (x))  It is easy to see that

π (x) π (x)2 π (x) < ∞ and dx = ∞, is unbounded q (x) q (x) q (x) 1 Z 3 3

We try to compute x 2 π (x) dx 1 x Z   where Γ ((ν + 1) /2) x (ν+1)/2 π (x) = 1 + pνπΓ (ν/2) ν   is a t-student distribution with ν > 1 (you can sample from it by composition (0, 1) / a (ν/2, ν/2)) using Monte Carlo. N G

A.D.() 11 / 40 It is easy to see that

π (x) π (x)2 π (x) < ∞ and dx = ∞, is unbounded q (x) q (x) q (x) 1 Z 3 3

We try to compute x 2 π (x) dx 1 x Z   where Γ ((ν + 1) /2) x (ν+1)/2 π (x) = 1 + pνπΓ (ν/2) ν   is a t-student distribution with ν > 1 (you can sample from it by composition (0, 1) / a (ν/2, ν/2)) using Monte Carlo. N G Γ(1) x 1 We use q (x) = (x), q (x) = 1 + (Cauchy 1 π 2 pνπΓ(1/2) νσ ν distribution) and q3 (x) = x; 0, (variance chosen to match N ν 2  the variance of π (x)) 

A.D.() 11 / 40 We try to compute x 2 π (x) dx 1 x Z   where Γ ((ν + 1) /2) x (ν+1)/2 π (x) = 1 + pνπΓ (ν/2) ν   is a t-student distribution with ν > 1 (you can sample from it by composition (0, 1) / a (ν/2, ν/2)) using Monte Carlo. N G Γ(1) x 1 We use q (x) = (x), q (x) = 1 + (Cauchy 1 π 2 pνπΓ(1/2) νσ ν distribution) and q3 (x) = x; 0, (variance chosen to match N ν 2  the variance of π (x))  It is easy to see that

π (x) π (x)2 π (x) < ∞ and dx = ∞, is unbounded q (x) q (x) q (x) 1 Z 3 3

A.D.() 11 / 40 Figure: Performance for ν = 12 with q1 (solid line), q2 (dashes) and q3 (light dots). Final values 1.14, 1.14 and 1.16 vs true value 1.13

A.D.() 12 / 40 We try to use the same importance distribution but also use the fact that using a change of variables u = 1/x, we have

∞ 1/2.1 5 7 x π (x) dx = u π (1/u) du Z2.1 Z0 1 1/2.1 = 2.1u 7π (1/u) du 2.1 Z0 7 which is the expectation of 2.1u π (1/u) with respect to [0, 1/2.1] . U

We now try to compute

∞ x5π (x) dx Z2.1

A.D.() 13 / 40 We now try to compute

∞ x5π (x) dx Z2.1 We try to use the same importance distribution but also use the fact that using a change of variables u = 1/x, we have

∞ 1/2.1 5 7 x π (x) dx = u π (1/u) du Z2.1 Z0 1 1/2.1 = 2.1u 7π (1/u) du 2.1 Z0 7 which is the expectation of 2.1u π (1/u) with respect to [0, 1/2.1] . U

A.D.() 13 / 40 Figure: Performance for ν = 12 with q1 (solid), q2 (short dashes), q3 (dots), uniform (long dashes). Final values 6.75, 6.48, 7.06 and 6.48 vs true value 6.54

A.D.() 14 / 40 The posterior distribution is given by

π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j We can use the prior distribution as a candidate distribution q (θ) = π (θ). We also get an estimate of the marginal likelihood

π (θ) f (x θ) dθ. ZΘ j

Application to Bayesian Statistics

Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j

A.D.() 15 / 40 We can use the prior distribution as a candidate distribution q (θ) = π (θ). We also get an estimate of the marginal likelihood

π (θ) f (x θ) dθ. ZΘ j

Application to Bayesian Statistics

Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j The posterior distribution is given by

π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j

A.D.() 15 / 40 We also get an estimate of the marginal likelihood

π (θ) f (x θ) dθ. ZΘ j

Application to Bayesian Statistics

Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j The posterior distribution is given by

π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j We can use the prior distribution as a candidate distribution q (θ) = π (θ).

A.D.() 15 / 40 Application to Bayesian Statistics

Consider a Bayesian model: prior π (θ) and likelihood f (x θ) . j The posterior distribution is given by

π(θ)f ( x θ) ( x) = j ( x) π θ π(θ)f ( x θ)d θ ∝ γ θ j Θ j j where γ ( θ x) = π (θ) f (x θ) . jR j We can use the prior distribution as a candidate distribution q (θ) = π (θ). We also get an estimate of the marginal likelihood

π (θ) f (x θ) dθ. ZΘ j

A.D.() 15 / 40 Assume we observe x1, ..., xm and the prior is

π (p1, p2) = 2Ip1 +p2 1  then the posterior is

m1,1 m1,2 m2,1 m2,2 π (p1, p2 x1:m ) ∝ p1 (1 p1) (1 p2) p2 Ip1 +p2 1 j  where m 1 m I I i,j = ∑ xt =i xt+1 =i t=1 The posterior does not admit a standard expression and its normalizing constant is unknown. We can sample from it using rejection sampling.

Example: Application to Bayesian analysis of Markov chain. Consider a two state Markov chain with transition matrix F

p1 1 p1 1 p2 p2   that is Pr (Xt+1 = 1 Xt = 1) = 1 Pr (Xt+1 = 2 Xt = 1) = p1 and j j Pr (Xt+1 = 2 Xt = 2) = 1 Pr (Xt+1 = 1 Xt = 2) = p2. Physical j j constraints tell us that p1 + p2 < 1.

A.D.() 16 / 40 The posterior does not admit a standard expression and its normalizing constant is unknown. We can sample from it using rejection sampling.

Example: Application to Bayesian analysis of Markov chain. Consider a two state Markov chain with transition matrix F

p1 1 p1 1 p2 p2   that is Pr (Xt+1 = 1 Xt = 1) = 1 Pr (Xt+1 = 2 Xt = 1) = p1 and j j Pr (Xt+1 = 2 Xt = 2) = 1 Pr (Xt+1 = 1 Xt = 2) = p2. Physical j j constraints tell us that p1 + p2 < 1. Assume we observe x1, ..., xm and the prior is

π (p1, p2) = 2Ip1 +p2 1  then the posterior is

m1,1 m1,2 m2,1 m2,2 π (p1, p2 x1:m ) ∝ p1 (1 p1) (1 p2) p2 Ip1 +p2 1 j  where m 1 m I I i,j = ∑ xt =i xt+1 =i t=1

A.D.() 16 / 40 Example: Application to Bayesian analysis of Markov chain. Consider a two state Markov chain with transition matrix F

p1 1 p1 1 p2 p2   that is Pr (Xt+1 = 1 Xt = 1) = 1 Pr (Xt+1 = 2 Xt = 1) = p1 and j j Pr (Xt+1 = 2 Xt = 2) = 1 Pr (Xt+1 = 1 Xt = 2) = p2. Physical j j constraints tell us that p1 + p2 < 1. Assume we observe x1, ..., xm and the prior is

π (p1, p2) = 2Ip1 +p2 1  then the posterior is

m1,1 m1,2 m2,1 m2,2 π (p1, p2 x1:m ) ∝ p1 (1 p1) (1 p2) p2 Ip1 +p2 1 j  where m 1 m I I i,j = ∑ xt =i xt+1 =i t=1 The posterior does not admit a standard expression and its normalizing constant is unknown. We can sample from it using rejection sampling. A.D.() 16 / 40 If there was no on p1 + p2 < 1 and π (p1, p2) was uniform on [0, 1] [0, 1] , then the posterior would be 

π0 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) j : B e (p2; m2,2 + 1, m2,1 + 1) B

but this is ine¢ cient as for the given data (m1,1, m1,2, m2,2, m2,1) we have π0 (p1 + p2 < 1 x1 m ) = 0.21. j : The form of the posterior suggests using a Dirichlet distribution with density

m1,1 m2,2 m1,2 +m2,1 π1 (p1, p2 x1 m ) ∝ p p (1 p1 p2) j : 1 2

but π (p1, p2 x1 m ) /π1 (p1, p2 x1 m ) is unbounded. j : j :

We are interested in estimating E [ ϕ (p1, p2) x1 m ] for i j : ϕ1 (p1, p2) = p1, ϕ2 (p1, p2) = p2, ϕ3 (p1, p2) = p1/ (1 p1), p1 (1 p2 ) ϕ (p1, p2) = p2/ (1 p2) and ϕ (p1, p2) = log using 4 5 p2 (1 p1 ) Importance Sampling.

A.D.() 17 / 40 The form of the posterior suggests using a Dirichlet distribution with density

m1,1 m2,2 m1,2 +m2,1 π1 (p1, p2 x1 m ) ∝ p p (1 p1 p2) j : 1 2

but π (p1, p2 x1 m ) /π1 (p1, p2 x1 m ) is unbounded. j : j :

We are interested in estimating E [ ϕ (p1, p2) x1 m ] for i j : ϕ1 (p1, p2) = p1, ϕ2 (p1, p2) = p2, ϕ3 (p1, p2) = p1/ (1 p1), p1 (1 p2 ) ϕ (p1, p2) = p2/ (1 p2) and ϕ (p1, p2) = log using 4 5 p2 (1 p1 ) Importance Sampling.

If there was no on p1 + p2 < 1 and π (p1, p2) was uniform on [0, 1] [0, 1] , then the posterior would be 

π0 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) j : B e (p2; m2,2 + 1, m2,1 + 1) B but this is ine¢ cient as for the given data (m1,1, m1,2, m2,2, m2,1) we have π0 (p1 + p2 < 1 x1 m ) = 0.21. j :

A.D.() 17 / 40 We are interested in estimating E [ ϕ (p1, p2) x1 m ] for i j : ϕ1 (p1, p2) = p1, ϕ2 (p1, p2) = p2, ϕ3 (p1, p2) = p1/ (1 p1), p1 (1 p2 ) ϕ (p1, p2) = p2/ (1 p2) and ϕ (p1, p2) = log using 4 5 p2 (1 p1 ) Importance Sampling.

If there was no on p1 + p2 < 1 and π (p1, p2) was uniform on [0, 1] [0, 1] , then the posterior would be 

π0 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) j : B e (p2; m2,2 + 1, m2,1 + 1) B but this is ine¢ cient as for the given data (m1,1, m1,2, m2,2, m2,1) we have π0 (p1 + p2 < 1 x1 m ) = 0.21. j : The form of the posterior suggests using a Dirichlet distribution with density

m1,1 m2,2 m1,2 +m2,1 π1 (p1, p2 x1 m ) ∝ p p (1 p1 p2) j : 1 2 but π (p1, p2 x1 m ) /π1 (p1, p2 x1 m ) is unbounded. j : j : A.D.() 17 / 40 A …nal one consists of using

π3 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) π3 (p2 x1 m, p1) j : B j : m2,1 m2,2 where π (p2 x1:m, p1) ∝ (1 p2) p2 Ip2 1 p1 is badly j 2 approximated through π3 (p2 x1:m, p1) = 2 p2Ip2 1 p1 . It is (1 p1 ) j  straightforward to check that π (p1, p2 x1:m ) /π3 (p1, p2 x1:m ) ∝ m2,1 m2,2 2 j j (1 p2) p2 / 2 p2 < ∞. (1 p1 )

(Geweke, 1989) proposed using the normal approximation to the 2 π2 (p1, p2 x1 m ) ∝ exp (m1,1 + m1,2)(p1 p1) / (2p1 (1 p1)) j :  2  exp (m2,1 + m2,2)(p2 p2) / (2p2 (1 p2)) Ip1 +p2 1  b b b    where p = m / (m + m ) , p = m / (m + m ). Then to 1 1,1 1,1 1,2 1 2,2 2b,2 2,1b b simulate from this distribution, we simulate …rst π2 (p1 x1 m ) and j : then πb2 (p2 x1:m, p1) which are univariateb truncated Gaussian distributionj which can be sampled using the inverse cdf method. The ratio π (p1, p2 x1 m ) /π2 (p1, p2 x1 m ) is upper bounded. j : j :

A.D.() 18 / 40 (Geweke, 1989) proposed using the normal approximation to the binomial distribution 2 π2 (p1, p2 x1 m ) ∝ exp (m1,1 + m1,2)(p1 p1) / (2p1 (1 p1)) j :  2  exp (m2,1 + m2,2)(p2 p2) / (2p2 (1 p2)) Ip1 +p2 1  b b b    where p = m / (m + m ) , p = m / (m + m ). Then to 1 1,1 1,1 1,2 1 2,2 2b,2 2,1b b simulate from this distribution, we simulate …rst π2 (p1 x1 m ) and j : then πb2 (p2 x1:m, p1) which are univariateb truncated Gaussian distributionj which can be sampled using the inverse cdf method. The ratio π (p1, p2 x1:m ) /π2 (p1, p2 x1:m ) is upper bounded. A …nal one consistsj of using j

π3 (p1, p2 x1 m ) = e (p1; m1,1 + 1, m1,2 + 1) π3 (p2 x1 m, p1) j : B j : m2,1 m2,2 where π (p2 x1:m, p1) ∝ (1 p2) p2 Ip2 1 p1 is badly j 2 approximated through π3 (p2 x1:m, p1) = 2 p2Ip2 1 p1 . It is (1 p1 ) j  straightforward to check that π (p1, p2 x1:m ) /π3 (p1, p2 x1:m ) ∝ m2,1 m2,2 2 j j (1 p2) p2 / 2 p2 < ∞. (1 p1 ) A.D.() 18 / 40 Sampling from π using rejection sampling works well but is computationally expensive. π3 is computationally much cheaper whereas π1 does extremely poorly as expected.

Performance for N = 10, 000

Distribution ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 π1 0.748 0.139 3.184 0.163 2.957 π2 0.689 0.210 2.319 0.283 2.211 π3 0.697 0.189 2.379 0.241 2.358 π 0.697 0.189 2.373 0.240 2.358

A.D.() 19 / 40 Performance for N = 10, 000

Distribution ϕ1 ϕ2 ϕ3 ϕ4 ϕ5 π1 0.748 0.139 3.184 0.163 2.957 π2 0.689 0.210 2.319 0.283 2.211 π3 0.697 0.189 2.379 0.241 2.358 π 0.697 0.189 2.373 0.240 2.358

Sampling from π using rejection sampling works well but is computationally expensive. π3 is computationally much cheaper whereas π1 does extremely poorly as expected.

A.D.() 19 / 40 For ‡at functions, one can approximate the variance by

Vπ (ϕ (X )) V E (ϕ (X )) (1 + Vq (w (X ))) . πN  N  Simple interpretationb : The N weighted samples are approximately equivalent to M unweighted samples from π where N M = N. 1 + Vq (w (X )) 

E¤ective Sample Size

In statistics, we are usually not interested in a speci…c ϕ but in several functions and we prefer having q (x) as close as possible to π (x) .

A.D.() 20 / 40 Simple interpretation: The N weighted samples are approximately equivalent to M unweighted samples from π where N M = N. 1 + Vq (w (X )) 

E¤ective Sample Size

In statistics, we are usually not interested in a speci…c ϕ but in several functions and we prefer having q (x) as close as possible to π (x) . For ‡at functions, one can approximate the variance by

Vπ (ϕ (X )) V E (ϕ (X )) (1 + Vq (w (X ))) . πN  N b 

A.D.() 20 / 40 E¤ective Sample Size

In statistics, we are usually not interested in a speci…c ϕ but in several functions and we prefer having q (x) as close as possible to π (x) . For ‡at functions, one can approximate the variance by

Vπ (ϕ (X )) V E (ϕ (X )) (1 + Vq (w (X ))) . πN  N  Simple interpretationb : The N weighted samples are approximately equivalent to M unweighted samples from π where N M = N. 1 + Vq (w (X )) 

A.D.() 20 / 40 We select an importance distribution n 2 q (x1:n) = ∏ xk ; 0, σ . k=1 N  2 1 In this case, we have VIS Z < ∞ only for σ > 2 and h i n 2 VIS Z b1 σ4 / = 1 . Zh2 i N 2σ2 1 b "  # 4 It can easily be checked that σ 1 for any 1 2 1. 2σ2 1 > 2 < σ = 6

Limitations of Importance Sampling

Consider the case where the target is de…ned on Rn and n π (x1:n) = ∏ (xk ; 0, 1) , k=1 N n 2 xk γ (x1 n) = exp , : ∏ 2 k=1   Z = (2π)n/2 .

A.D.() 21 / 40 2 1 In this case, we have VIS Z < ∞ only for σ > 2 and h i n 2 VIS Z b1 σ4 / = 1 . Zh2 i N 2σ2 1 b "  # 4 It can easily be checked that σ 1 for any 1 2 1. 2σ2 1 > 2 < σ = 6

Limitations of Importance Sampling

Consider the case where the target is de…ned on Rn and n π (x1:n) = ∏ (xk ; 0, 1) , k=1 N n 2 xk γ (x1 n) = exp , : ∏ 2 k=1   Z = (2π)n/2 . We select an importance distribution n 2 q (x1:n) = ∏ xk ; 0, σ . k=1 N 

A.D.() 21 / 40 Limitations of Importance Sampling

Consider the case where the target is de…ned on Rn and n π (x1:n) = ∏ (xk ; 0, 1) , k=1 N n 2 xk γ (x1 n) = exp , : ∏ 2 k=1   Z = (2π)n/2 . We select an importance distribution n 2 q (x1:n) = ∏ xk ; 0, σ . k=1 N  2 1 In this case, we have VIS Z < ∞ only for σ > 2 and h i n 2 VIS Z b1 σ4 / = 1 . Zh2 i N 2σ2 1 b "  # 4 It canA.D.() easily be checked that σ 1 for any 1 2 1. 21 / 40 2σ2 1 > 2 < σ = 6 For example, if we select σ2 = 1.2 then we have a reasonably good VIS [Z ] n/2 importance distribution as q (xk ) π (xk ) but N 2 (1.103)  Z  which is approximately equal to 1.9 1021 for n = 1000!b  We would need to use N 2 1023 particles to obtain a relative VIS [Z ]   variance Z 2 = 0.01. b

The variance increases exponentially with n even in this simple case.

A.D.() 22 / 40 We would need to use N 2 1023 particles to obtain a relative VIS [Z ]   variance Z 2 = 0.01. b

The variance increases exponentially with n even in this simple case. For example, if we select σ2 = 1.2 then we have a reasonably good VIS [Z ] n/2 importance distribution as q (xk ) π (xk ) but N 2 (1.103)  Z  which is approximately equal to 1.9 1021 for n = 1000!b 

A.D.() 22 / 40 The variance increases exponentially with n even in this simple case. For example, if we select σ2 = 1.2 then we have a reasonably good VIS [Z ] n/2 importance distribution as q (xk ) π (xk ) but N 2 (1.103)  Z  which is approximately equal to 1.9 1021 for n = 1000!b  We would need to use N 2 1023 particles to obtain a relative VIS [Z ]   variance Z 2 = 0.01. b

A.D.() 22 / 40 We want to know which strategy performs the best.

Importance Sampling versus Rejection Sampling

Given N samples from q, we estimate Eπ (ϕ (X )) through IS

N (i) (i) ∑i=1 w X ϕ X EIS (ϕ (X )) = πN N  (i)  ∑i=1 w X b or we “…lter” the samples through rejection and propose instead

1 K ERS (ϕ (X )) = ϕ X (ik ) πN ∑ K k=1   b where K N is a corresponding to the number of samples accepted.

A.D.() 23 / 40 Importance Sampling versus Rejection Sampling

Given N samples from q, we estimate Eπ (ϕ (X )) through IS

N (i) (i) ∑i=1 w X ϕ X EIS (ϕ (X )) = πN N  (i)  ∑i=1 w X b or we “…lter” the samples through rejection and propose instead

1 K ERS (ϕ (X )) = ϕ X (ik ) πN ∑ K k=1   b where K N is a random variable corresponding to the number of samples accepted. We want to know which strategy performs the best.

A.D.() 23 / 40 Now let us consider the proposal distribution

q (x, y) = q (x) (y) for (x, y) E [0, 1] . U[0,1] 2 

De…ne the arti…cial target π (x, y) on E [0, 1] as  Cq(x ) , for (x, y) : x E and y 0, γ(x ) π (x, y) = Z 2 2 Cq(x ) ( 0 otherwisen h io then γ(x ) Cq(x ) Cq (x) π (x, y) dy = dy = π (x) . Z Z Z0

A.D.() 24 / 40 De…ne the arti…cial target π (x, y) on E [0, 1] as  Cq(x ) , for (x, y) : x E and y 0, γ(x ) π (x, y) = Z 2 2 Cq(x ) ( 0 otherwisen h io then γ(x ) Cq(x ) Cq (x) π (x, y) dy = dy = π (x) . Z Z Z0 Now let us consider the proposal distribution

q (x, y) = q (x) (y) for (x, y) E [0, 1] . U[0,1] 2 

A.D.() 24 / 40 We have

N (i) (i) (i) 1 K ∑i=1 w X , Y ϕ X ERS (ϕ (X )) = ϕ X (ik ) = . πN ∑ N (i)  (i)  K k=1 ∑i=1 w X , Y   b Compared to standard IS, RS performs IS on an enlarged space.

Then rejection sampling is nothing but IS on [0, 1] where X  x, y C q(x )dx γ(x ) π ( ) Z for y 0, Cq(x ) w (x, y) ∝ = R 2 q (x) [0,1] (y) 0, otherwise. U ( h i

A.D.() 25 / 40 Compared to standard IS, RS performs IS on an enlarged space.

Then rejection sampling is nothing but IS on [0, 1] where X  x, y C q(x )dx γ(x ) π ( ) Z for y 0, Cq(x ) w (x, y) ∝ = R 2 q (x) [0,1] (y) 0, otherwise. U ( h i We have

N (i) (i) (i) 1 K ∑i=1 w X , Y ϕ X ERS (ϕ (X )) = ϕ X (ik ) = . πN ∑ N (i)  (i)  K k=1 ∑i=1 w X , Y   b 

A.D.() 25 / 40 Then rejection sampling is nothing but IS on [0, 1] where X  x, y C q(x )dx γ(x ) π ( ) Z for y 0, Cq(x ) w (x, y) ∝ = R 2 q (x) [0,1] (y) 0, otherwise. U ( h i We have

N (i) (i) (i) 1 K ∑i=1 w X , Y ϕ X ERS (ϕ (X )) = ϕ X (ik ) = . πN ∑ N (i)  (i)  K k=1 ∑i=1 w X , Y   b Compared to standard IS, RS performs IS on an enlarged space.

A.D.() 25 / 40 To compute integrals, RS is ine¢ cient and you should simply use IS.

The variance of the importance weights from RS is higher than for standard IS: V [w (X , Y )] V [w (X )] .  More precisely, we have

V [w (X , Y )] = V [E [w (X , Y ) X ]] + E [V [w (X , Y ) X ]] j j = V [w (X )] + E [V [w (X , Y ) X ]] . j

A.D.() 26 / 40 The variance of the importance weights from RS is higher than for standard IS: V [w (X , Y )] V [w (X )] .  More precisely, we have

V [w (X , Y )] = V [E [w (X , Y ) X ]] + E [V [w (X , Y ) X ]] j j = V [w (X )] + E [V [w (X , Y ) X ]] . j To compute integrals, RS is ine¢ cient and you should simply use IS.

A.D.() 26 / 40 At time 1, assume we have approximate π1 (x1) and Z1 using an IS density q1 (x1); that is

N (i) (i) (i) (dx ) = W (i) (dx) where W w X , π1 1 ∑ 1 δX 1 ∝ 1 1 i=1 1   N b 1 (i) Z1 = ∑ w1 X1 N i=1   with b γ1 (x1) w1 (x1) = . q1 (x1)

Introduction to Sequential Importance Sampling

Aim: Design an IS method to approximate sequentially πn n 1 and f g  to compute Zn n 1. f g 

A.D.() 27 / 40 Introduction to Sequential Importance Sampling

Aim: Design an IS method to approximate sequentially πn n 1 and f g  to compute Zn n 1. f g  At time 1, assume we have approximate π1 (x1) and Z1 using an IS density q1 (x1); that is

N (i) (i) (i) (dx ) = W (i) (dx) where W w X , π1 1 ∑ 1 δX 1 ∝ 1 1 i=1 1   N b 1 (i) Z1 = ∑ w1 X1 N i=1   with b γ1 (x1) w1 (x1) = . q1 (x1)

A.D.() 27 / 40 (i) We want to reuse the samples X1 from q1 (x1) use to build the IS approximation of π1 (x1) . Thisn onlyo makes sense if π2 (x1) π1 (x1).  We select q2 (x1 2) = q1 (x1) q2 (x2 x1) : j (i) so that to obtain X1:2 q2 (x1:2) we only need to sample (i) (i) (i) X X q2 x2 X . 2 1  j 1  

At time 2, we want to approximate π2 (x1:2) and Z2 using an IS density q2 (x1:2) .

A.D.() 28 / 40 We select q2 (x1 2) = q1 (x1) q2 (x2 x1) : j (i) so that to obtain X1:2 q2 (x1:2) we only need to sample (i) (i) (i) X X q2 x2 X . 2 1  j 1  

At time 2, we want to approximate π2 (x1:2) and Z2 using an IS density q2 (x1:2) . (i) We want to reuse the samples X1 from q1 (x1) use to build the IS approximation of π1 (x1) . Thisn onlyo makes sense if π2 (x1) π1 (x1). 

A.D.() 28 / 40 At time 2, we want to approximate π2 (x1:2) and Z2 using an IS density q2 (x1:2) . (i) We want to reuse the samples X1 from q1 (x1) use to build the IS approximation of π1 (x1) . Thisn onlyo makes sense if π2 (x1) π1 (x1).  We select q2 (x1 2) = q1 (x1) q2 (x2 x1) : j (i) so that to obtain X1:2 q2 (x1:2) we only need to sample (i) (i) (i) X X q2 x2 X . 2 1  j 1  

A.D.() 28 / 40 For the normalized weights

(i) γ X (i) (i) 2 1:2 W ∝ W 2 1 (i)  (i) (i) γ1 X1 q2 X2 X1    

Updating the IS approximation

We have to compute the weights

γ2 (x1:2) γ2 (x1:2) w2 (x1:2) = = q2 (x1 2) q1 (x1) q2 (x2 x1) : j γ (x ) γ (x ) = 1 1 2 1:2 q1 (x1) γ (x1) q2 (x2 x1) 1 j γ2 (x1:2) = w1 (x1) γ1 (x1) q2 (x2 x1) previous weight j incremental weigh | {z } | {z }

A.D.() 29 / 40 Updating the IS approximation

We have to compute the weights

γ2 (x1:2) γ2 (x1:2) w2 (x1:2) = = q2 (x1 2) q1 (x1) q2 (x2 x1) : j γ (x ) γ (x ) = 1 1 2 1:2 q1 (x1) γ (x1) q2 (x2 x1) 1 j γ2 (x1:2) = w1 (x1) γ1 (x1) q2 (x2 x1) previous weight j incremental weigh | {z } For the normalized weights | {z }

(i) γ X (i) (i) 2 1:2 W ∝ W 2 1 (i)  (i) (i) γ1 X1 q2 X2 X1    

A.D.() 29 / 40 The importance weights are updated according to

γn (x1:n) γn (x1:n) wn (x1:n) = = wn 1 (x1:n 1) qn (x1:n) γn 1 (x1:n 1) qn (xn x1:n 1) j

Generally speaking, we use at time n

qn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j = q1 (x1) q2 (x2 x1) qn (xn x1:n 1) j    j (i) so if X1:n 1 qn 1 (x1:n 1) then we only need to sample (i) (i)  (i) Xn Xn 1 qn xn X1:n 1 .  j  

A.D.() 30 / 40 Generally speaking, we use at time n

qn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j = q1 (x1) q2 (x2 x1) qn (xn x1:n 1) j    j (i) so if X1:n 1 qn 1 (x1:n 1) then we only need to sample (i) (i)  (i) Xn Xn 1 qn xn X1:n 1 .  j   The importance weights are updated according to

γn (x1:n) γn (x1:n) wn (x1:n) = = wn 1 (x1:n 1) qn (x1:n) γn 1 (x1:n 1) qn (xn x1:n 1) j

A.D.() 30 / 40 (i) (i) sample Xn qn X  j 1:n 1 (i)   γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i)  (i) (i) γn 1 X1:n 1 qn Xn X1:n 1        

At time n 2 

At any time n, we have

(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n =  q X (i)    n 1:n   thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.

Sequential Importance Sampling

(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 =  (i)  .   q1 X1    

A.D.() 31 / 40 (i) (i) sample Xn qn X  j 1:n 1 (i)   γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i)  (i) (i) γn 1 X1:n 1 qn Xn X1:n 1        

At any time n, we have

(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n =  q X (i)    n 1:n   thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.

Sequential Importance Sampling

(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 =  (i)  .   q1 X1   At time n 2   

A.D.() 31 / 40 (i) γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i)  (i) (i) γn 1 X1:n 1 qn Xn X1:n 1        

At any time n, we have

(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n =  q X (i)    n 1:n   thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.

Sequential Importance Sampling

(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 =  (i)  .   q1 X1   At time n 2    (i) (i) sample Xn qn X1:n 1  j  

A.D.() 31 / 40 At any time n, we have

(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n =  q X (i)    n 1:n   thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.

Sequential Importance Sampling

(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 =  (i)  .   q1 X1   At time n 2    (i) (i) sample Xn qn X  j 1:n 1 (i)   γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i)  (i) (i) γn 1 X1:n 1 qn Xn X1:n 1        

A.D.() 31 / 40 Sequential Importance Sampling

(i) (i) (i) γ1 X1 At time n = 1, sample X1 q1 ( ) and set w1 X1 =  (i)  .   q1 X1   At time n 2    (i) (i) sample Xn qn X  j 1:n 1 (i)   γ X compute w X (i) = w X (i) n 1:n . n 1:n n 1 1:n 1 (i)  (i) (i) γn 1 X1:n 1 qn Xn X1:n 1        

At any time n, we have

(i) γ X (i) (i) n 1:n X1:n qn (x1:n) , wn X1:n =  q X (i)    n 1:n   thus we can obtain easily an IS approximation of πn (x1:n) and of Zn.

A.D.() 31 / 40 Assume we receive y1:n, we are interested in sampling from

p (x1:n, y1:n) πn (x1:n) = p (x1:n y1:n) = j p (y1:n)

and estimating p (y1:n) where n n γn (x1:n) = p (x1:n, y1:n) = µ (x1) ∏ f (xk xk 1) ∏ g (yk xk ) , k=2 j k=1 j n n Zn = p (y1:n) = µ (x1) f (xk xk 1) g (yk xk ) dx1:n. ∏ ∏ Z    Z k=2 j k=1 j

Sequential Importance Sampling for State-Space Models

State-space models

Hidden Markov process: X1 µ, Xk (Xk 1 = xk 1) f ( xk 1)  j  j Observation process: Yk (Xk = xk ) g ( xk ) j  j

A.D.() 32 / 40 Sequential Importance Sampling for State-Space Models

State-space models

Hidden Markov process: X1 µ, Xk (Xk 1 = xk 1) f ( xk 1)  j  j Observation process: Yk (Xk = xk ) g ( xk ) j  j Assume we receive y1:n, we are interested in sampling from

p (x1:n, y1:n) πn (x1:n) = p (x1:n y1:n) = j p (y1:n)

and estimating p (y1:n) where n n γn (x1:n) = p (x1:n, y1:n) = µ (x1) ∏ f (xk xk 1) ∏ g (yk xk ) , k=2 j k=1 j n n Zn = p (y1:n) = µ (x1) f (xk xk 1) g (yk xk ) dx1:n. ∏ ∏ Z    Z k=2 j k=1 j

A.D.() 32 / 40 (i) (i) sample Xn f X1:n 1  j (i)  (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j      

(i) At time n = 1, sample X1 µ ( ) and set (i) (i)   w1 X = g y1 X . 1 j 1 At time n 2   

At any time n, we have

n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk  k=2 j k=1 j    

thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .

We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j j

A.D.() 33 / 40 (i) (i) sample Xn f X1:n 1  j (i)  (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j      

At time n 2 

At any time n, we have

n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk  k=2 j k=1 j    

thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .

We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i)   w1 X = g y1 X . 1 j 1    

A.D.() 33 / 40 (i) (i) sample Xn f X1:n 1  j (i)  (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j At any time n, we have    

n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk  k=2 j k=1 j    

thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .

We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i)   w1 X = g y1 X . 1 j 1 At time n 2   

A.D.() 33 / 40 (i) (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j At any time n, we have    

n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk  k=2 j k=1 j    

thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .

We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i)   w1 X = g y1 X . 1 j 1 At time n 2    (i) (i) sample Xn f X1:n 1  j  

A.D.() 33 / 40 At any time n, we have

n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk  k=2 j k=1 j    

thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .

We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i)   w1 X = g y1 X . 1 j 1 At time n 2    (i) (i) sample Xn f X1:n 1  j (i)  (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j      

A.D.() 33 / 40 We can select q1 (x1) = µ (x1) and qn (xn x1:n 1) = qn (xn xn 1) = f (xn xn 1) . j j (i) j At time n = 1, sample X1 µ ( ) and set (i) (i)   w1 X = g y1 X . 1 j 1 At time n 2    (i) (i) sample Xn f X1:n 1  j (i)  (i) (i) compute wn X1:n = wn 1 X1:n 1 g yn Xn . j At any time n, we have    

n n (i) (i) (i) X1:n µ (x1) ∏ f (xk xk 1) , wn X1:n = ∏ g yk Xk  k=2 j k=1 j     thus we can obtain easily an IS approximation of p (x1 n y1 n) and of : j : p (y1:n) .

A.D.() 33 / 40 Application to Stochastic Volatility Model

1000

500

0 • 25 • 20 • 15 • 10 • 5 0 1000

500

0 • 25 • 20 • 15 • 10 • 5 0 100

50

0 • 25 • 20 • 15 • 10 • 5 0 Importance Weights (base 10 logarithm)

(i) Figure: Histograms of the base 10 logarithm of Wn for n = 1 (top), n = 50 (middle) and n = 100 (bottom).

The algorithm performance collapse as n increases... After a few time steps, only a very small number of particles have non negligible

weights.A.D.() 34 / 40 As we have

πn (x1:n) = πn (x1) πn (x2 x1) πn (xn x1:n 1) , j    j

where πn (xk x1:k 1) ∝ γn (xk x1:k 1) it means that we have j j opt qk (xk x1:k 1) = πn (xk x1:k 1) . j j Obviously this result does depend on n so it is only useful if we are only interested in a speci…c target πn (x1:n) and in such scenarios we need to typically approximate πn (xk x1:k 1) . j

Structure of the Optimal Distribution

The optimal zero-variance density at time n is simply given by

qn (x1:n) = πn (x1:n) .

A.D.() 35 / 40 Obviously this result does depend on n so it is only useful if we are only interested in a speci…c target πn (x1:n) and in such scenarios we need to typically approximate πn (xk x1:k 1) . j

Structure of the Optimal Distribution

The optimal zero-variance density at time n is simply given by

qn (x1:n) = πn (x1:n) .

As we have

πn (x1:n) = πn (x1) πn (x2 x1) πn (xn x1:n 1) , j    j

where πn (xk x1:k 1) ∝ γn (xk x1:k 1) it means that we have j j opt qk (xk x1:k 1) = πn (xk x1:k 1) . j j

A.D.() 35 / 40 Structure of the Optimal Distribution

The optimal zero-variance density at time n is simply given by

qn (x1:n) = πn (x1:n) .

As we have

πn (x1:n) = πn (x1) πn (x2 x1) πn (xn x1:n 1) , j    j

where πn (xk x1:k 1) ∝ γn (xk x1:k 1) it means that we have j j opt qk (xk x1:k 1) = πn (xk x1:k 1) . j j Obviously this result does depend on n so it is only useful if we are only interested in a speci…c target πn (x1:n) and in such scenarios we need to typically approximate πn (xk x1:k 1) . j

A.D.() 35 / 40 We have for the importance weight

γn (x1:n) wn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j Znπn (x1:n 1) πn (xn x1:n 1) = j qn 1 (x1:n 1) qn (xn x1:n 1) j It follows directly that we have opt qn (xn x1:n 1) = πn (xn x1:n 1) j j and

γn (x1:n) wn (x1:n) = wn 1 (x1:n 1) γn 1 (x1:n 1) πn (xn x1:n 1) j γn (x1:n 1) = wn 1 (x1:n 1) γn 1 (x1:n 1)

Locally Optimal Importance Distribution

One sensible strategy consists of selecting qn (xn x1:n 1) at time n so as to minimize the variance of the importance weights.j

A.D.() 36 / 40 It follows directly that we have opt qn (xn x1:n 1) = πn (xn x1:n 1) j j and

γn (x1:n) wn (x1:n) = wn 1 (x1:n 1) γn 1 (x1:n 1) πn (xn x1:n 1) j γn (x1:n 1) = wn 1 (x1:n 1) γn 1 (x1:n 1)

Locally Optimal Importance Distribution

One sensible strategy consists of selecting qn (xn x1:n 1) at time n so as to minimize the variance of the importance weights.j We have for the importance weight

γn (x1:n) wn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j Znπn (x1:n 1) πn (xn x1:n 1) = j qn 1 (x1:n 1) qn (xn x1:n 1) j

A.D.() 36 / 40 Locally Optimal Importance Distribution

One sensible strategy consists of selecting qn (xn x1:n 1) at time n so as to minimize the variance of the importance weights.j We have for the importance weight

γn (x1:n) wn (x1:n) = qn 1 (x1:n 1) qn (xn x1:n 1) j Znπn (x1:n 1) πn (xn x1:n 1) = j qn 1 (x1:n 1) qn (xn x1:n 1) j It follows directly that we have opt qn (xn x1:n 1) = πn (xn x1:n 1) j j and

γn (x1:n) wn (x1:n) = wn 1 (x1:n 1) γn 1 (x1:n 1) πn (xn x1:n 1) j γn (x1:n 1) = wn 1 (x1:n 1) γn 1 (x1:n 1) A.D.() 36 / 40 It is often impossible to sample from πn (xn x1:n 1) and/or j computing γn (x1:n 1) = γn (x1:n) dxn. In such cases, it is necessary to approximate πn (xn x1:n 1) and R j γn (x1:n 1).

This locally optimal importance density will be used again and again.

A.D.() 37 / 40 In such cases, it is necessary to approximate πn (xn x1:n 1) and j γn (x1:n 1).

This locally optimal importance density will be used again and again.

It is often impossible to sample from πn (xn x1:n 1) and/or j computing γn (x1:n 1) = γn (x1:n) dxn. R

A.D.() 37 / 40 This locally optimal importance density will be used again and again.

It is often impossible to sample from πn (xn x1:n 1) and/or j computing γn (x1:n 1) = γn (x1:n) dxn. In such cases, it is necessary to approximate πn (xn x1:n 1) and R j γn (x1:n 1).

A.D.() 37 / 40 In this case,

p (x1:n, y1:n) wn (x1:n) = wn 1 (x1:n 1) p (x1:n 1, y1:n 1) p (xn yn, xn 1) j = wn 1 (x1:n 1) p (yn xn 1) . j Example: Consider f (xn xn 1) = (xn; α (xn 1) , β (xn 1)) and 2 j N g (yn xn) = xn; σw then j N 2 p (xn yn, xn 1) = xn; m (xn 1) , σ (xn 1) with j N  2 2 β (xn 1) σw 2  α (xn 1) yn σ (xn 1) = 2 , m (xn 1) = σ (xn 1) + 2 . β (xn 1) + σw β (xn 1) σw  

Application to State-Space Models

In the case of state-space models, we have opt qn (xn x1:n 1) = p (xn y1:n, x1:n 1) = p (xn yn, xn 1) j j j g (yn xn) f (xn xn 1) = j j p (yn xn 1) j

A.D.() 38 / 40 Example: Consider f (xn xn 1) = (xn; α (xn 1) , β (xn 1)) and 2 j N g (yn xn) = xn; σw then j N 2 p (xn yn, xn 1) = xn; m (xn 1) , σ (xn 1) with j N  2 2 β (xn 1) σw 2  α (xn 1) yn σ (xn 1) = 2 , m (xn 1) = σ (xn 1) + 2 . β (xn 1) + σw β (xn 1) σw  

Application to State-Space Models

In the case of state-space models, we have opt qn (xn x1:n 1) = p (xn y1:n, x1:n 1) = p (xn yn, xn 1) j j j g (yn xn) f (xn xn 1) = j j p (yn xn 1) j In this case,

p (x1:n, y1:n) wn (x1:n) = wn 1 (x1:n 1) p (x1:n 1, y1:n 1) p (xn yn, xn 1) j = wn 1 (x1:n 1) p (yn xn 1) . j

A.D.() 38 / 40 Application to State-Space Models

In the case of state-space models, we have opt qn (xn x1:n 1) = p (xn y1:n, x1:n 1) = p (xn yn, xn 1) j j j g (yn xn) f (xn xn 1) = j j p (yn xn 1) j In this case,

p (x1:n, y1:n) wn (x1:n) = wn 1 (x1:n 1) p (x1:n 1, y1:n 1) p (xn yn, xn 1) j = wn 1 (x1:n 1) p (yn xn 1) . j Example: Consider f (xn xn 1) = (xn; α (xn 1) , β (xn 1)) and 2 j N g (yn xn) = xn; σw then j N 2 p (xn yn, xn 1) = xn; m (xn 1) , σ (xn 1) with j N  2 2 β (xn 1) σw 2  α (xn 1) yn σ (xn 1) = 2 , m (xn 1) = σ (xn 1) + 2 . β (xn 1) + σw β (xn 1) σw  

A.D.() 38 / 40 We use qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N

qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N

opt qn (xn x1:n 1) = p (xn yn, xn 1) j j 2 2 σw yn σw = xn; αxn 1 + , . N σ2 + 1 σ2 σ2 + 1  w  w  w 

Application to Linear Gaussian State-Space Models

Consider the simple model

Xn = αXn 1 + Vn, Yn = Xn + σWn

i.i.d. i.i.d. where X1 (0, 1), Vn (0, 1) , Wn (0, 1) .  N  N  N

A.D.() 39 / 40 Application to Linear Gaussian State-Space Models

Consider the simple model

Xn = αXn 1 + Vn, Yn = Xn + σWn

i.i.d. i.i.d. where X1 (0, 1), Vn (0, 1) , Wn (0, 1) .  N  N  N We use qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N

qn (xn x1:n 1) = f (xn xn 1) = (xn; αxn 1, 1) , j j N

opt qn (xn x1:n 1) = p (xn yn, xn 1) j j 2 2 σw yn σw = xn; αxn 1 + , . N σ2 + 1 σ2 σ2 + 1  w  w  w 

A.D.() 39 / 40 Sequential Importance Sampling can only work for moderate size problems. Is there a way to partially …x this problem?

Summary

Sequential Importance Sampling is an attractive idea: sequential and parallelizable, only requires designing low-dimensional proposal distributions.

A.D.() 40 / 40 Is there a way to partially …x this problem?

Summary

Sequential Importance Sampling is an attractive idea: sequential and parallelizable, only requires designing low-dimensional proposal distributions. Sequential Importance Sampling can only work for moderate size problems.

A.D.() 40 / 40 Summary

Sequential Importance Sampling is an attractive idea: sequential and parallelizable, only requires designing low-dimensional proposal distributions. Sequential Importance Sampling can only work for moderate size problems. Is there a way to partially …x this problem?

A.D.() 40 / 40