Monte Carlo Methods and Importance Sampling
Total Page:16
File Type:pdf, Size:1020Kb
Lecture Notes for Stat 578C c Eric C. Anderson Statistical Genetics 20 October 1999 (subbin’ for E.A Thompson) Monte Carlo Methods and Importance Sampling History and denition: The term “Monte Carlo” was apparently rst used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. Their methods, involving the laws of chance, were aptly named after the inter- national gaming destination; the moniker stuck and soon after the War a wide range of sticky problems yielded to the new techniques. Despite the widespread use of the methods, and numerous descriptions of them in articles and monographs, it is virtually impossible to nd a succint deni- tion of “Monte Carlo method” in the literature. Perhaps this is owing to the intuitive nature of the topic which spawns many denitions by way of specic examples. Some authors prefer to use the term “stochastic simulation” for almost everything, reserving “Monte Carlo” only for Monte Carlo Integration and Monte Carlo Tests (cf. Ripley 1987). Others seem less concerned about blurring the distinction between simulation studies and Monte Carlo methods. Be that as it may, a suitable denition can be good to have, if for nothing other than to avoid the awkwardness of trying to dene the Monte Carlo method by appealing to a whole bevy of examples of it. Since I am (so Elizabeth claims!) unduly inuenced by my advisor’s ways of thinking, I like to dene Monte Carlo in the spirit of denitions she has used before. In particular, I use: Definition: Monte Carlo is the art of approximating an expectation by the sample mean of a function of simulated random variables. We will nd that this denition is broad enough to cover everything that has been called Monte Carlo, and yet makes clear its essence in very familiar terms: Monte Carlo is about invoking laws of large numbers to approximate expectations.1 While most Monte Carlo simulations are done by computer today, there were many applications of Monte Carlo methods using coin-ipping, card-drawing, or needle-tossing (rather than computer- generated pseudo-random numbers) as early as the turn of the century—long before the name Monte Carlo arose. In more mathematical terms: Consider a (possibly multidimensional) random variable X having probability mass function or probability density function fX (x) which is greater than zero on a set of values X . Then the expected value of a function g of X is X E(g(X)) = g(x)fX (x) (1) x∈X if X is discrete, and Z E(g(X)) = g(x)fX (x)dx (2) x∈X if X is continuous. Now, if we were to take an n-sample of X’s, (x1,...,xn), and we computed the mean of g(x) over the sample, then we would have the Monte Carlo estimate Xn 1 ge (x)= g(x ) n n i i=1 1This applies when the simulated variables are independent of one another, and might apply when they are correlated with one another (for example if they are states visited by an ergodic Markov chain). For now we will just deal with independent simulated random variables, but all of this extends to samples from Markov chains via the weak law of large numbers for the number of passages through a recurrent state in an ergodic Markov chain (see Feller 1957). You will encounter this later when talking about MCMC. 1 2 c Eric C. Anderson. You may duplicate this freely for pedagogical purposes. of E(g(X)). We could, alternatively, speak of the random variable Xn 1 ge (X)= g(X) n n i=1 which we call the Monte Carlo estimator of E(g(X)). If E(g(X)), exists, then the weak law of large numbers tells us that for any arbitrarily small lim P (|gen(X) E(g(X))|)=0. n→∞ This tells us that as n gets large, then there is small probability that gen(X) deviates much from E(g(X)). For our purposes, the strong law of large numbers says much the same thing—the important part being that so long as n is large enough, gen(x) arising from a Monte Carlo experiment shall be close to E(g(X)), as desired. One other thing to note at this point is that gen(X) is unbiased for E(g(X)): ! Xn Xn 1 1 E(ge (X)) = E g(X ) = E(g(X )) = E(g(X)). n n i n i i=1 i=1 Making this useful: The preceding section comes to life and becomes useful when one realizes that very many quantities of interest may be cast as expectations. Most importantly for applica- tions in statistical genetics, it is possible to express all probabilities, integrals, and summations as expectations: Probabilities: Let Y be a random variable. The probability that Y takes on some value in a set A can be expressed as an expectation using the indicator function: ∈ E P (Y A)= (I{A}(Y )) (3) ∈ 6∈ where I{A}(Y ) is the indicator function that takes the value 1 when Y A and 0 when Y A. Integrals: Consider a problem now which is completely deterministic—integratingR a function q(x) b from a to b (as in high-school calculus). So we have a q(x)dx. This can be expressed as an expectation with respect to a uniformly distributed, continuous random variable U between a and b. U has density function fU (u)=1/(b a), so if we rewrite the integral we get Z Z b b 1 E (b a) q(x) dx =(b a) q(x)fU (x)dx =(b a) (q(U)) ...voila! a b a a Discrete Sums: The discrete version of the above is just the sum of a function q(x) over the count- ably many values of x in a set A.P If we have a random variable W which takes values in A all with equal probability p (so that w∈A p = 1 then the sum may be cast as the expectation X X 1 1 q(x)= q(x)p = E(q(W )). p p x∈A x∈A The immediate consequence of this is that all probabilities, integrals, and summations can be approximated by the Monte Carlo method. A crucial thing to note, however, is that there is no restriction that says U or W above must have uniform distributions. This is just for easy illustration of the points above. We will explore this point more while considering importance sampling. c Eric C. Anderson. Corrections, comments?=⇒[email protected] 3 Example I: Approximating probabilities by Monte Carlo. Consider a Wright-Fisher population of size N diploid individuals in which Xt counts the numbers of copies of a certain allelic type in the population at time t. At time zero, there are x0 copies of the allele. Given this, what is the probability that the allele will be lost from the population in t generations, i.e., P (Xt =0|X0 = x0)? This can be computed exactly by multiplying transition probability matrices together, or by employing the Baum (1972) algorithm (which you will learn about later), but it can also be approximated by Monte Carlo. It is simple to simulate genetic drift in a Wright-Fisher population; thus we can easily simulate values for Xt given X0 = x0. Then, P (Xt =0|X0 = x0)=E(I{0}(Xt)|X0 = x0) th where I{0}(Xt) takes the value 1 when Xt = 0 and 0 otherwise. Denoting the i simulated (i) value of Xt by xt our Monte Carlo estimate would be Xn e 1 (i) P (X =0|X = x ) I{ }(x ). t 0 0 n 0 t i=1 Example II: Monte Carlo approximations to distributions. A simple extension of the above example is to approximate the whole probability distribution P (Xt|X0 = x0) by Monte Carlo. Consider the histogram below: 0.04 0.03 0.02 0.01 0 Monte Carlo Estimate of Probability 0 20 40 60 80 100 120 140 160 180 200 Number of Copies of Allele It represents the results of simulations in which n =50, 000, x0 = 60, t = 14, the Wright-Fisher population size N = 100 diploids, and each rectangle represents a Monte Carlo approximation to P (a X14 <a+2|X0 = 60), a =0, 2, 4,...,200. For each such probability, the approximation follows from Xn 1 (i) P (a X <a+2|X = 60) = E(I{ }(X )) I{ }(x ) ,a=0, 2,...,200 14 0 a X<a+2 14 n a X<a+2 14 i=1 Example III: A discrete sum over latent variables. In many applications in statistical genetics, the probability P (Y ) of an observed event Y must be computed as the sum over very many latent variables X of the joint probability P (Y,X). In such a case, Y is typically xed, i.e., we have observed Y = y so we are interested in P (Y = y), but we can’t observe the values of the latent variables which may take values in the space X . Though it follows from the laws of probability that X P (Y = y)= P (Y = y, X = x), x∈X quite often X is such a large space (contains so many elements) that it is impossible to compute the sum. Application of the law of conditional probability, however, gives X X P (Y = y)= P (Y = y, X = x)= P (Y = y|X = x)P (X = x).