<<

Chapter 4. Importance sampling

For Monte-Carlo integration, one has a choice of how to split the integrand up into a func- tion g.x/ and a p.x/, and different choices produce different rates of convergence (as the number of samples increases). Even if we have made such a choice al- ready, or the probability distribution has been given to us, we can still modify the Monte-Carlo integration to improve the results. To see this, suppose we write p.x/ p.X/ I E Œg.X/ 1 g.x/p.x/ dx 1 g.x/ p .x/ dx E g.X/ ; p  p D D D p.x/ D p.X/ Z1 Z1 Ä  where p.x/ is some other probability distribution (it is called the biasing distribution). Note that X and x can be vectors, and the integral can be multidimensional. If we then do Monte- Carlo sampling, we obtain the estimate M 1 p.Xj/ I  g.X / ; O D M j p .X / j 1  j XD where now the M random samples Xj are drawn (sampled) according to the distribution p.x/. The quantity p.x/=p.x/ is called the likelihood ratio. Note that this generates an unbiased estimate, since Ep ŒI  I .  O D The utility of using a different distribution becomes clear when one looks at the variance of our biased estimate I , O p.X/ p.x/ 2 Var g.X/ 1 g.x/ I p .x/ dx : p  p.X/ D p.x/ Ä  Z1  à Since the variance depends upon p.x/ we can choose the biasing distribution to reduce it. Obviously, we get zero variance and this is optimal if we choose g.x/p.x/ p.x/ : D I The only problem with this, of course, is that we must know the answer I in order to construct this distribution. Note, however, we can also write this p.X/ p.x/ 2 Var g.X/ 1 g.x/ I p .x/ dx p  p.X/ D p.x/ Ä  Z1 Ä  p.x/ 1 g2.x/ p.x/ dx I 2 : D p.x/ Z1

©kath2020esam448notes 4.1 A coin flipping example 49

We get, of course, a similar result for the original variance if we just set p.x/ p.x/. Subtracting the two, we have D

p.X/ p.x/ VarŒg.X/ Var g.X/ 1 g2.x/ 1 p.x/ dx : p p.X/ D p.x/ Ä  Z1 Ä  2 Thus, we can reduce the variance if we choose p.x/ > p.x/ when g .x/p.x/ is large and 2 p.x/ < p.x/ when g .x/p.x/ is small. If we do this, the probability mass is redistributed in accordance with its relative importance as measured by the weight g.x/p.x/. Both the magnitude of g.x/ and the relatively likelihood of corresponding values of x are important. It is for this reason that this method of biasing the Monte-Carlo sampling distribution is known as Importance Sampling.

We have an exact formula for the variance of the I , and while we can’t compute the O integral directly we can still construct a sampled estimate of this variance,

M 2 2 1 p.Xj/   g.X / I  : OI D M 1 j p .X / O j 1  j ! XD Since the estimator I is just the mean of random variables all having this variance, we can O then construct an estimate of the variance in I by dividing by the number of samples, M , i.e., O

M 2 1 1 p.X /  2  2 g.X / j I : I I j  O O D M O D M.M 1/ p .X / O j 1  j ! XD It’s usually very helpful to have an estimate of the variance to accompany some value one is trying to compute via Monte-Carlo integration.

4.1 A coin flipping example

Let’s consider a specific problem: suppose we flip a coin N times and count the number of heads. This problem is discrete, of course, and so we will need to modify the previous continuous theory. The notation for the continuous case will provide guidance for how to do importance sampling in this discrete case.

th In this case we will let Xj be the result of flipping the coin the j time, and set

1 for heads ; Xj D (0 for tails :

For a fair coin P.Xj 1/ 1=2 and P.Xj 0/ 1=2. D D D D

©kath2020esam448notes 50 Importance sampling

Suppose we are interested in the probability that the number of heads is greater than or equal to m, i.e., P.ZN m/ ; where ZN Xj :  D j X Since the number of heads follows a , we can compute this probability exactly, N j N j N N 1 1 1 N P.ZN m/ :  D j 2 2 D 2N j j m ! Â Ã Â Ã j m ! XD XD We will focus on rare events, such as the case N 100 and m 85, for which the probability 13 D D is 2:4 10 . Rare events, or large deviations, are of interest in many situations, particularly when the specific event has a large negative consequence, such as a large rogue ocean wave or danger from a volcanic hazard (Bayarri et al., 2009; Osborne et al., 2000).

To put this in the context of the general theory, we can define an indicator function

1; s m Im.s/  D 80; s < m : < Then we can rewrite the whole thing a bit more: symbolically as

N

P.ZN m/ Im.j /P.ZN j /  D D j 0 XD

::: Im.x1 x2 ::: xN /p.x1/p.x2/ p.xN / D x x x C C C    X1 X2 XN where in the last expression x1 0 or 1, x2 0 or 1,... xN 0 or 1, and p.1/ 1=2 is the probability of a head and p.0/ D 1=2 is theD probability of aD tail. Note the indicatorD function D cuts off all terms in the sum for x1 x2 ::: xN < m. Furthermore, if we write C C C

Im.x1 x2 ::: xN / Im.x/ C C C D E and N

p.x1/p.x2/ p.xN / p.xi / p.x/ ;    D D E i 1 YD we get the compact notation

P.ZN m/ Pm Im.x/p.x/ :  D D E E x XE

©kath2020esam448notes 4.1 A coin flipping example 51

Remember here that each component of x is either 0 or 1. The nice thing about this notation is that we immediately see how to convert thisE to a Monte-Carlo estimate of the probability:

1 M Pm Im.Xj /; O D M E j 1 XD where X .X .1/;X .2/;:::;X .N // is the j th random sample of an N -vector of 0’s and 1’s Ej j j j with each componentD selected using the appropriate probability for tails/heads.

For m deviating strongly from the mean (e.g., 85 or more heads from 100 flips) the probability of generating m or more heads from N flips will be low, so low that there is no way that we can simulate this with the standard — this would require an astronomical number of samples. We can, however, simulate it with importance sampling. The idea is to use an unfair coin, i.e., one with the probability of a head being p. First, we rewrite the above exact result as p.x/ Pm Im.x/p.x/ Im.x/ E p.x/ : D E E D E p .x/ E x x  XE XE E Then we see that the importance-sampled (IS) estimator, for M trials and a vector of N heads or tails X, is E

M M N N .i/ p.Xj/ p.Xj / E .i/ Pm Im.Xj/ Im Xj .i/ : O D E D  j 1 p.Xj/ j 1 i 1 ! i 1 p.Xj / XD E XD XD YD .i/ Here, the Xj are still 0 or 1, but they are drawn from the biased distribution p.x/ with p.1/ p and p.0/ 1 p. Keeping track of the total number of heads (the argument of the indicatorD function)D is simple, of course — it’s just the same sum as before. Now, however, we are adding up not just the contributions from the indicator function (i.e., a count of 1 for every trial with number of heads greater than m), but rather the indicator function weighted by the overall likelihood ratio. The likelihood ratio is easy to calculate, fortunately: since the flips are independent of one another both in the unbiased and biased (importance-sampled) cases, the joint probabilities in the numerator and denominator are both products of the probabilities of each individual flip:

1 if X .i/ 1 ; p.X .i// 2p j D j 8 .i/ p .X  / D ˆ 1 .i/  j <ˆ if X  0 : 2.1 p/ j D ˆ Thus, we just multiply the likelihood ratios:ˆ for each individual flip to get the overall likelihood ratio. If p > 1=2, on a typical flip a head will be more likely to occur, which means that there

©kath2020esam448notes 52 Importance sampling

−6

−8 ) σ (

10 −10 log −12

−14 0.5 0.6 0.7 0.8 0.9 1 p

Figure 4.1: Standard deviation of importance-sampled symmetric random walk as a function of the biased probability p for N 100 and m 85. D D will be many events with a single-flip likelihood ratio of 1=.2p/ < 1. (Some tails will occur, of course, for which the individual-flip likelihood ratio is 1=2.1 p/ > 1, but heads will strongly outweigh tails if p > 1=2). Thus, we expect the overall likelihood ratio will be smaller than 1, as well. It can actually get quite small, as we will see.

The one thing we have not addressed yet for this example is the best choice for the biasing probability p. In this case we can actually calculate the variance as a function of p,

p.X/ p.x/ 2 E 2 2 p Varp Im.X/ Im.x/ E p.x/ Pm  D  E D E p .x/ E " p.X/# x  E XE E N j N j N 1 1 1 P 2 ; D j 2p 2.1 p/ 2N m j m !"Â Ã Â Ã # XD 2 where we have used Im.x/ Im.x/ since the indicator function is either 0 or 1. Figure 4.1 shows the resulting standardE D deviationE as a function of p. Note that the minimum of  (which p is roughly 5:6 10 13) occurs near p 0:85, which is the value of p for which the expected  D number of heads (pN 85) falls at the lower end of the desired range (m ZN N ) for the count of heads. This makesD sense intuitively; if p is much smaller than m=NÄ , thereÄ will be too few samples that produce the required total number of heads. If p is much larger than m=N , however, the number of heads will be not only larger than m, but typically much larger, and we will miss the region near m where the contribution to the overall probability is largest.

Another way to understand what’s going on is to remember that the optimal importance sam- pling distribution (the one that produces zero variance) is, in our current notation,

Im.zN /p.zN / p.x/ p.zN / ; E D D Pm

©kath2020esam448notes 4.1 A coin flipping example 53

10-13 2 I (z )p(z ) m N N 1.5

1

0.5

0 75 80 85 90 95 z N

p=0.85 0.15 p=0.75 p=0.95 0.1

0.05

0 75 80 85 90 95 z N

Figure 4.2: Top: Im.zN /p.zN /, i.e., the probability of getting zN heads for zN m. Note this is the tail of the binomial probabilities. The optimal biasing distribution is this truncated bino- mial renormalized so that it is a proper probability distribution. Bottom: binomial probability distributions for a biased coin, where the probability of getting a head is p 0:75, 0:85 and 0:95. The greatest overlap with the optimal distribution occurs when p is nearD 0.85.

where zN xj is the total number of heads, the p.ZN / are the binomial probabilities, the D indicator Im.zN / 1 if zN m and 0 otherwise, and Pm P.ZN m/ is the probability we are tryingP to calculate.D  Note that in this particular caseD we know how to compute the probability distribution of the sum from the individual probability p of a head, so we can write the result solely in terms of the total number of heads zN . (We will not be true for a general problem, of course, but here we can use it to help understand how this works.) The top part of Fig. 4.2 shows the product of the unbiased binomial distribution and the indicator function, i.e., the tail of the binomial for zN m. This tail of the binomial, renormalized so that it is a proper probability distribution, is the optimal biasing distribution. (Note that normalizing via dividing by the sum of the probabilities in the tail of the binomial is, by definition, dividing by Pm.) For comparison, the bottom part of Fig. 4.2 shows three biased binomial probability distribution, one with p 0:75, one with p 0:85 and one with p 0:95. It should be clear that the greatest overlapD with the optimal biasingD distribution occursD when p is near 0.85.

There is another way to view how big of an improvement given by importance sampling. First, we note that when p 1=2 (the unbiased case) we have D 2 13 VarŒIm.X/ Pm P Pm 2:4 10 ; E D m  D 

©kath2020esam448notes 54 Importance sampling

15

10 (N) 10

log 5

0 0.5 0.6 0.7 0.8 0.9 1 p

Figure 4.3: Expected number of trials needed to produce a trial standard deviation that is 10% of the mean as a function of p.

7 which is a standard deviation of 4:9 10 . This standard deviation is much, much larger than the value we are trying to calculate. To see the consequence of having a standard deviation being so large compared to the value we are trying to calculate, let’s assume that for some calculation we would like to produce an estimate that has a standard error of the mean that’s 2 10% of the value Pm. Since the standard error of the estimate is the sample variance  p divided by the number of trials, this means we want O

2 2 2 p Pm p O  N 100 O  ; p D 10 ) D P 2 Â N Ã Â Ã m A plot of the resulting N as a function of p is shown in Fig. 4.3. Note the number of trials is extremely large unless one is close to the optimal value of p. In the unbiased case (p 0:5) we need approximately 4:2 1014 trials to estimate the probability with a standard errorD of 10%, while near the minimum (p 0:85) this number drops to less than 1000 (roughly 540). Thus, importance sampling speedsD up this Monte-Carlo simulation by almost 14 orders of magnitude.

These importance-sampled simulations are relatively straightforward to do. One merely draws random numbers from a uniform distribution and declares a head if the number is less than or equal to p, otherwise one gets a tail. The overall likelihood ratio is just the product of the individual likelihood ratios, of course. Figures 4.4 and 4.5 show the computed mean and standard deviation for this particular importance-sampled Monte-Carlo simulation.

4.2 Mean translation vs. variance scaling with a sum of Gaussians

For this example suppose we want to estimate P.ZN m/, where  N

ZN Xi ; D i 1 XD

©kath2020esam448notes 4.2 Mean translation vs. variance scaling with a sum of Gaussians 55

10-13 3.4

3.2

3

2.8

P(x >= m) 2.6

2.4

2.2 0.7 0.75 0.8 0.85 0.9 0.95 p

Figure 4.4: Importance-sampled Monte-Carlo result for the probability that flipping a coin 100 times results in 85 or more heads as a function of the biasing probability p. Green is the exact result, blue is the numerical result when 100,000 trials are used at each value of p.

−10.5

−11

−11.5 Var(P)

−12

−12.5 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 p

Figure 4.5: Importance-sampled Monte-Carlo result for the standard deviation obtained from the experiment of flipping a coin 100 times to get 85 or more heads as a function of the biasing probability p. Green is the exact result, blue is the numerical result when 100,000 trials are used at each value of p.

©kath2020esam448notes 56 Importance sampling

where the Xi are iid (independent, identically-distributed) random variables. Let Im.z/ 1 if z m, and 0 otherwise. Then we want D  1 1 P.ZN m/ Im.z/p.z/ dz Im.x/ p.x/ dx  D D E E E Z1 Z1 1 1 ::: Im.x1 ::: xN / p.x1/p.x2/ : : : p.xN / dx1dx2 : : : dxN : D C C Z1 Z1 If N is large, the difficulty is that it can be hard to find all of the regions in the N -dimensional space that contribute significantly to the integral.

With importance sampling, the above becomes

1 p.x/ P.ZN m/ Im.x/ E p.x/ dx ;  D E p.x/ E E Z1 E and when we do Monte Carlo sampling we get

M p.X / 1 Ej Pm Im.X / ; O D M Ej j 1 p.Xj/ XD E where M is the number of samples, with an estimated variance of

M 2 1 p.X /  2 I .X / Ej P : P m j m O O D M.M 1/ E O j 1 p.Xj/ ! XD E What makes this work is that we know how to compute I .X / I .X .j / ::: X .j //; in m Ej m 1 N this last, the subscripts denote component and the superscripts theD particularC trial.C Also, since the Xj are independent, we have

N p.X / .j / Ej p.Xi / .j / ; D  p.Xj/ i 1 p.Xi / E YD i.e., we can compute the overall likelihood ratio as a product of individual likelihood ratios.

As an example, suppose the Xi are Gaussian random variables with zero mean and variance 1. Then, of course, because the sum of N Gaussians is also Gaussian with mean zero and variance N , we know that

1 1 x2=2N 1 P P.ZN m/ e dx erfc .m=p2N / : D  D p2N m D 2 Z

©kath2020esam448notes 4.2 Mean translation vs. variance scaling with a sum of Gaussians 57

Even though we know the exact answer, it’s still instructive to compute this probability with importance sampling. The key step is determining the biasing distribution p.x/. One simple choice, for which many answers can be obtained analytically, is a Gaussian distribution with mean  and variance  2. Note that with this particular choice we are doing the biasing para- metrically, i.e., we are choosing a distribution and adjusting its parameters to do the biasing.

Thus, we choose 1 2 .xi /=2 p.xi / e : D p2 2 Of course, we get the same result for E ŒI .X/.p.X/=p .X// EŒP  P as for the  m E E  E Om m unbiased case; that’s the result that we’re looking for. The interestingD result isD the variance,

p.X/ p.x/ 2 Var I .X/ E 1 I .x/ p .x/ dx P 2  m E m E  m " p .X/# D E p.x/ E E  E Z1 Ä E 

1 p.x/ 2 Im.x/ E p.x/ dx Pm D E p.x/ E E Z1 E Suppose we first try  0; the idea here is to increase the variance to spread the distribution out, and thus get more samplesD at larger values. Then in the integral we need

N N 2 2 N x2 p.x/ p .xi / p2 e i E p.x/ x2=2 2 p .x/ E D p .xi / D 2 i  i 1  ! i 1 e E YD YD N N N  .1 1=2 2/x2 N N 1 x2=2 e i   e i O ; D p N D O p 2 . 2/ i 1 i 1 2 YD YD O where  =p2 2 1. We now have an integral that looks like we are computing the proba- bility forO D a sum of Gaussians with variance  2 to be bigger than m, with a correction factor of  N  N . Thus, the result must be O O p.X/ 1 E N N p 2 2 Var Im.X/   erfc .m= 2N  / Pm : E p .X/ D O 2 O "  E # In the above, let’s assume that N is large but that m=pN is O.1/. The behavior as a function of  will therefore be dominated by the prefactor, ./N . We are interested in minimizing the variance, which means that we want  to be minimal.O Taking logarithmic derivatives of O  2 1 2 1 4 log ./ log 2 ln  ln .2 2 1/ 0 O D p2 2 1 D 2 )  2 2 2 1 D Â Ã

©kath2020esam448notes 58 Importance sampling

2 2 2 which gives  2 1 or  1. In this case VarŒI.X/.p.X/=p .X// Pm P , which D D E E  E D m means that the coefficient of variation (standard deviation divided by the mean) is O.1=pPm/, which means that many, many samples will be required to determine the value with Monte- Carlo sampling when Pm is small. In this particular case, importance sampling is really of no benefit. This particular issue is known as the “dimensionality problem” of variance scaling.

On the other hand, suppose we set  1 and take  nonzero. Then we get instead D N 2 N x2 p.X/ p .x / 1 e i E p.X/ i .x /2=2 E D p .xi / D p N e i p.X/ i 1  . 2/ i 1 E YD YD N N 1 2 2 2 1 2 .xi / =2  N .xi / =2 e C C e e C D p N D p . 2/ i 1 i 1 2 YD YD Now we have an integral that looks like we are computing the probability for a sum of Gaus- sians each with mean  and variance 1 to be bigger than m, with a correction factor of eN2 . Thus, the result will be 2 1 eN erfc ..m N/=p2N / P 2 : 2 C m In this expression if we assume N is large we can use the asymptotic expansion

1 z2 erfc .z/ e :  pz The above expression is then approximately

N2 1 .m N/2=2N 1 p2N 2 e e C P 2 p m N m C Taking the logarithmic derivative of the first term, differentiating with respect to , and ne- glecting some small terms, we get 1 N N 2N 2N.m N/ 0 N m 0 2N C m N D ) m N D C C 1 .N/2 m2 N  m2 N: ) D ) D N C p If, in addition, m pN (i.e., Pm is small) this becomes  m=N . In addition, in this case   it is easy to check that Var ŒI.X/.p.X/=p .X// O.P 2 /, which means that it’s possible to  E E  E m get fairly good results with a relatively small numberD of samples.

A simpler way to see this result is to determine the most probable way for X1 X2 XN to achieve a particular sum S. Maximizing the probability for the sum ofCN GaussiansC    C is

©kath2020esam448notes 4.2 Mean translation vs. variance scaling with a sum of Gaussians 59

1500

1000

500 number of hits

0 0.95 1 1.05 1.1 1.15 computed mean −6 x 10

Figure 4.6: Histogram of importance-sampled Monte-Carlo results. Each individual numerical result is the estimated probability from 10,000 trials that the sum of 10 standard Gaussians is larger than or equal to 15. The histogram shows the results of 10,000 numerical experiments.

2 2 2 equivalent to minimizing X1 X2 ::: XN . If we also want the sum to achieve a particular value m, we want to C C C

2 minimize X subject to the constraint Xi m ; i D i i X X which is a simple Lagrange multiplier problem, i.e.,

2 minimize X  Xi m : i C i i X X Á

Differentiating with respect to Xk, we find 2Xk  0. Using the constraint then gives C D the solution Xi m=N , i.e., the largest contribution to the overall probability comes from the same locationD in each Gaussian. This suggests, then, that when performing importance sampling we should shift the mean of each Gaussian by the same amount, m=N .

This type of biasing, where one shifts the mean of the distribution, is known as mean transla- tion. Generally speaking, this method tends to work reasonably well in practice. The difficulty, of course, is figuring out a good shift of the mean.

As a specific example, take N 10 and m 15. The exact probability for the sum of 10 D D6 Gaussians to be larger than 15 is 1:05 10 . By using Gaussians with a mean shifted by  6 m=N 1:5, producing 10000 trials of the sum generates the sample mean value 1:05 10 D 6  8 with a sample standard deviation of 2:4 10 , and a standard error of the mean of 2:4 10 (or, a coefficient of variation of 0.023). The histogram of these results is shown in Fig. 4.6.

©kath2020esam448notes 60 Importance sampling

4.3 Multiple importance sampling and balance heuristics

In many practical cases no single choice of biasing distribution can efficiently capture all the regions of sample space that give rise to the events of interest. In these cases, it is necessary to use importance sampling with more than one biasing distribution. The simultaneous use of different biasing methods is called multiple importance sampling. When using several biasing distributions p .x/, a difficulty arises about how to correctly weight the results coming from k E different distributions. One solution to this problem can be found by assigning a weight wk.x/ to each distribution and by rewriting the probability P as: E

K K

P Pk wk.x/I.x/Lk.x/p.x/dx ; (4.1) D D E E E k E E k 1 k 1 Z XD XD where K is the number of different biasing distributions used and Lk.x/ p.x/=p .x/ is E D E k E the likelihood ratio for the k-th distribution. Note that the weights wk.x/ depend on the value of the random variables for each individual sample. From Eq. (4.1), aE multiply-importance- sampled Monte Carlo estimator for P can now be written as

M K K 1 k P Pk wk.Xk;j /I.Xk;j /Lk.Xk;j /; (4.2) O D O D Mk E E E k 1 k 1 j 1 XD XD XD where M is the number of samples drawn from the k-th distribution p .x/, and X is the k k Ek;j j -th such sample. E

Several ways exist to choose the weights wk.x/, the particulars of which we will discuss mo- mentarily. Generally, however, the quantity PE is an unbiased estimator for P (i.e., the ex- O pectation value of P is equal to P ) for any choice of weights such that K w .x/ 1 O k 1 k for all x. Thus, each choice of weights corresponds to a different way of partitioningD E ofD the P total probability. The simplest possibility is just to set wk.x/ 1=K for all x, meaning that each distribution is assigned an equal weight in all regions ofE sampleD space. This choice is not advantageous, however, as we will see shortly. If P is a multiply-importance-sampled Monte Carlo estimator defined according to Eq. (4.2), O then, similarly to previous results, one can show that an unbiased estimator of its variance is

K Mk 1 2  2 w .X /L .X /I.X / P : (4.3) P k k;j k k;j k;j k O O D Mk.Mk 1/ E E E O k 1 j 1 XD XD   Recursion relations can also be written so that  2 can be obtained without the need of storing all the individual samples until the end of the simulation:O

K 1  2 S ; (4.4) P k; Mk O O D Mk.Mk 1/ O k 1 XD

©kath2020esam448notes 4.3 Multiple importance sampling and balance heuristics 61

K with P k 1 Pk; Mk and O D D O P j 1 2 Sk;j Sk;j 1 wk.Xk;j /Lk.Xk;j /I.Xk;j // Pk;j 1 : (4.5a) O D O C j E E E O  j 1 1 Pk;j Pk;j 1 wk.Xk;j /Lk.Xk;j /I.Xk;j /; (4.5b) O D j O C j E E E

When using multiple importance sampling, the choice of weights wk.x/ is almost as important E as the choice of biasing distributions pk.x/. Different weighting functions result in different values for the variance of the combined estimator.E A poor choice of weights can result in a large variance, thus partially negating the gains obtained by importance sampling. The best weighting strategies are the ones that yield the smallest value.

For example, consider the case where the weighting functions are constant over the whole domain. In this case,

K K

P wk I.x/Lk.x/p.x/dx wk EŒI.x/Lk.x/ : (4.6) D E E k E E D k E E k 1 Z k 1 XD XD That is, the estimator is simply a weighted combination of the obtained by using each of the biasing techniques. Unfortunately, the variance of P is also a weighted sum of 2 K 2 the individual variances: P k 1 wkk , and if any of the sampling techniques is bad in a given region, then P will alsoD haveD a high variance. P A relatively simple and particularly useful choice of weights is the balance heuristic (Veach, 1997; Owen and Zhou, 2000). In this case, the weights wk.x/ are assigned according to E

Mkpk.x/ Mk=Lk.x/ wk.x/ E E : (4.7) E D K M p .x/ D K M =L .x/ k 1 k0 k k 1 k0 k0 0D 0 E 0D E

The quantity qk.x/ Mkpk.x/Pis proportional to theP expected number of hits from the k-th distribution. Thus,E theD weight associatedE with a sample x with the balance heuristic is given by the probability of realizing that sample with the k-th distributionE relative to the total probability of realizing that same sample with all distributions. Thus, Eq. (4.7) weights each distribution pk.x/ most heavily in those regions of sample space where pk.x/ is largest. The balance heuristicE has been mathematically proven to be asymptotically closeE to optimal as the number of realizations becomes large (Veach, 1997). Results for the sum-of-Gaussians example are shown in Figs. 4.7 and 4.8.

Dividing both numerator and denominator by p.x/ allows Eq. (4.7) to be written alternatively as E Mk=Lk.x/ wk.x/ E : (4.8) E D K M =L .x/ k 1 k0 k0 0D E P

©kath2020esam448notes 62 REFERENCES

0 10

−5 10 probability

−10 10

X −15 10 −30 −20 −10 0 10 20 30

Figure 4.7: Multiple-importance sampled probability distribution (via histograms) for sum-of- Gaussians, showing results of individual biasing distributions.

This form, which only involves likelihood ratios, can be particularly convenient for use in Eq. (4.2), because of several simplifications which occur upon substitution,

K Mk 1 Mk=Lk.X  / P Ek;j I.X /L .X / K k;j k k;j O D Mk M =L .X / E E k 1 j 1 k 1 k0 k0 k;j XD XD 0D E M K k P 1 I.Xk;j /: (4.9) D K M =L .X / E k 1 j 1 k 1 k0 k0 k;j XD XD 0D E While it’s a little difficult to interpretP the weighted importance sampling result in this form, it does show that for each distribution and sample one only needs to keep track of a weighted harmonic average of the likelihood ratios. Note, however (and this is important), one needs to evaluate all of the likelihood ratios for every sample drawn from every distribution. It may seem a little strange that when drawing samples X from distribution k one needs to calculate Ek;j likelihood ratios Lk .X  / for all k , but one must remember that the likelihood ratios for 0 Ek;j 0 k0 k contain the information needed (albeit in a somewhat hidden form) to assess the relative probability¤ of obtaining that particular sample from the different biasing distributions.

References

Bayarri, M. J., Berger, J. O., Calder, E. S., et al. Using statistical and computer models to quantify volcanic hazards. Technometrics, 51(4):402–413, 2009. Osborne, A. R., Onorato, M., and Serio, M. The nonlinear dynamics of rogue waves and holes in deep-water gravity wave trains. Physics Letters A, 275:386–393, 2000.

©kath2020esam448notes REFERENCES 63

0 10

N=10 -5 10 probability

-10 10

X -20 -15 -10 -5 0 5 10 15 20

Figure 4.8: Multiple-importance sampled probability distribution for sum-of-Gaussians, show- ing individual biasing distributions weighted, and then combined, with balance heuristics.

Owen, Art and Zhou, Yi. Safe and effective importance sampling. Journal of the American Statistical Association, 95:135–143, 2000.

Veach, Eric. Robust Monte Carlo Methods for Light Transport Simulation. Ph.D. thesis, Stanford University, 1997.

©kath2020esam448notes