Chapter 4. Importance Sampling

Chapter 4. Importance sampling For Monte-Carlo integration, one has a choice of how to split the integrand up into a function g.x/ and a probability distribution p.x/, and different choices produce different rates of convergence (as the number of samples increases). Even if we have made such a choice al- ready, or the probability distribution has been given to us, we can still modify the Monte-Carlo integration to improve the results. To see this, suppose we write p.x/ p.X/ I E Œg.X/ 1 g.x/p.x/ dx 1 g.x/ p .x/ dx E g.X/ ; p p D D D p.x/ D p.X/ Z1 Z1 Ä where p.x/ is some other probability distribution (it is called the biasing distribution). Note that X and x can be vectors, and the integral can be multidimensional. If we then do Monte- Carlo sampling, we obtain the estimate M 1 p.Xj/ I g.X / ; O D M j p .X / j 1 j XD where now the M random samples Xj are drawn (sampled) according to the distribution p.x/. The quantity p.x/=p.x/ is called the likelihood ratio. Note that this generates an unbiased estimate, since Ep ŒI I . O D The utility of using a different distribution becomes clear when one looks at the variance of our biased estimate I , O p.X/ p.x/ 2 Var g.X/ 1 g.x/ I p .x/ dx : p p.X/ D p.x/ Ä Z1 Â Ã Since the variance depends upon p.x/ we can choose the biasing distribution to reduce it. Obviously, we get zero variance and this is optimal if we choose g.x/p.x/ p.x/ : D I The only problem with this, of course, is that we must know the answer I in order to construct this distribution. Note, however, we can also write this p.X/ p.x/ 2 Var g.X/ 1 g.x/ I p .x/ dx p p.X/ D p.x/ Ä Z1 Ä p.x/ 1 g2.x/ p.x/ dx I 2 : D p.x/ Z1 ©kath2020esam448notes 4.1 A coin flipping example 49 We get, of course, a similar result for the original variance if we just set p.x/ p.x/. Subtracting the two, we have D p.X/ p.x/ VarŒg.X/ Var g.X/ 1 g2.x/ 1 p.x/ dx : p p.X/ D p.x/ Ä Z1 Ä 2 Thus, we can reduce the variance if we choose p.x/ > p.x/ when g .x/p.x/ is large and 2 p.x/ < p.x/ when g .x/p.x/ is small. If we do this, the probability mass is redistributed in accordance with its relative importance as measured by the weight g.x/p.x/. Both the magnitude of g.x/ and the relatively likelihood of corresponding values of x are important. It is for this reason that this method of biasing the Monte-Carlo sampling distribution is known as Importance Sampling. We have an exact formula for the variance of the estimator I , and while we can’t compute the O integral directly we can still construct a sampled estimate of this variance, M 2 2 1 p.Xj/ g.X / I : OI D M 1 j p .X / O j 1 j ! XD Since the estimator I is just the mean of random variables all having this variance, we can O then construct an estimate of the variance in I by dividing by the number of samples, M , i.e., O M 2 1 1 p.X / 2 2 g.X / j I : I I j O O D M O D M.M 1/ p .X / O j 1 j ! XD It’s usually very helpful to have an estimate of the variance to accompany some value one is trying to compute via Monte-Carlo integration. 4.1 A coin flipping example Let’s consider a specific problem: suppose we flip a coin N times and count the number of heads. This problem is discrete, of course, and so we will need to modify the previous continuous theory. The notation for the continuous case will provide guidance for how to do importance sampling in this discrete case. th In this case we will let Xj be the result of flipping the coin the j time, and set 1 for heads ; Xj D (0 for tails : For a fair coin P.Xj 1/ 1=2 and P.Xj 0/ 1=2. D D D D ©kath2020esam448notes 50 Importance sampling Suppose we are interested in the probability that the number of heads is greater than or equal to m, i.e., P.ZN m/ ; where ZN Xj : D j X Since the number of heads follows a binomial distribution, we can compute this probability exactly, N j N j N N 1 1 1 N P.ZN m/ : D j 2 2 D 2N j j m ! Â Ã Â Ã j m ! XD XD We will focus on rare events, such as the case N 100 and m 85, for which the probability 13 D D is 2:4 10 . Rare events, or large deviations, are of interest in many situations, particularly when the specific event has a large negative consequence, such as a large rogue ocean wave or danger from a volcanic hazard (Bayarri et al., 2009; Osborne et al., 2000). To put this in the context of the general theory, we can define an indicator function 1; s m Im.s/ D 80; s < m : < Then we can rewrite the whole thing a bit more: symbolically as N P.ZN m/ Im.j /P.ZN j / D D j 0 XD ::: Im.x1 x2 ::: xN /p.x1/p.x2/ p.xN / D x x x C C C X1 X2 XN where in the last expression x1 0 or 1, x2 0 or 1,... xN 0 or 1, and p.1/ 1=2 is the probability of a head and p.0/ D 1=2 is theD probability of aD tail. Note the indicatorD function D cuts off all terms in the sum for x1 x2 ::: xN < m. Furthermore, if we write C C C Im.x1 x2 ::: xN / Im.x/ C C C D E and N p.x1/p.x2/ p.xN / p.xi / p.x/ ; D D E i 1 YD we get the compact notation P.ZN m/ Pm Im.x/p.x/ : D D E E x XE ©kath2020esam448notes 4.1 A coin flipping example 51 Remember here that each component of x is either 0 or 1. The nice thing about this notation is that we immediately see how to convert thisE to a Monte-Carlo estimate of the probability: 1 M Pm Im.Xj /; O D M E j 1 XD where X .X .1/;X .2/;:::;X .N // is the j th random sample of an N -vector of 0’s and 1’s Ej j j j with each componentD selected using the appropriate probability for tails/heads. For m deviating strongly from the mean (e.g., 85 or more heads from 100 flips) the probability of generating m or more heads from N flips will be low, so low that there is no way that we can simulate this with the standard Monte Carlo method — this would require an astronomical number of samples. We can, however, simulate it with importance sampling. The idea is to use an unfair coin, i.e., one with the probability of a head being p. First, we rewrite the above exact result as p.x/ Pm Im.x/p.x/ Im.x/ E p.x/ : D E E D E p .x/ E x x XE XE E Then we see that the importance-sampled (IS) estimator, for M trials and a vector of N heads or tails X, is E M M N N .i/ p.Xj/ p.Xj / E .i/ Pm Im.Xj/ Im Xj .i/ : O D E D j 1 p.Xj/ j 1 i 1 ! i 1 p.Xj / XD E XD XD YD .i/ Here, the Xj are still 0 or 1, but they are drawn from the biased distribution p.x/ with p.1/ p and p.0/ 1 p. Keeping track of the total number of heads (the argument of the indicatorD function)D is simple, of course — it’s just the same sum as before. Now, however, we are adding up not just the contributions from the indicator function (i.e., a count of 1 for every trial with number of heads greater than m), but rather the indicator function weighted by the overall likelihood ratio. The likelihood ratio is easy to calculate, fortunately: since the flips are independent of one another both in the unbiased and biased (importance-sampled) cases, the joint probabilities in the numerator and denominator are both products of the probabilities of each individual flip: 1 if X .i/ 1 ; p.X .i// 2p j D j 8 .i/ p .X / D ˆ 1 .i/ j <ˆ if X 0 : 2.1 p/ j D ˆ Thus, we just multiply the likelihood ratios:ˆ for each individual flip to get the overall likelihood ratio. If p > 1=2, on a typical flip a head will be more likely to occur, which means that there ©kath2020esam448notes 52 Importance sampling −6 −8 ) σ ( 10 −10 log −12 −14 0.5 0.6 0.7 0.8 0.9 1 p Figure 4.1: Standard deviation of importance-sampled symmetric random walk as a function of the biased probability p for N 100 and m 85.

Chapter 4. Importance Sampling

Lecture 15: Approximate Inference: Monte Carlo Methods

Estimating Standard Errors for Importance Sampling Estimators with Multiple Markov Chains

Distilling Importance Sampling

Importance Sampling & Sequential Importance Sampling

Blackimportance Sampling and Its Optimality for Stochastic

Importance Sampling: Intrinsic Dimension And

Importance Sampling

Uniformly Efficient Importance Sampling for the Tail Distribution Of

Advances in Importance Sampling Arxiv:2102.05407V2 [Stat.CO]

Sampling and Monte Carlo Integration

Monte Carlo Integration

9 Importance Sampling 3 9.1 Basic Importance Sampling