Monte Carlo Integration and Variance Reduction
Total Page:16
File Type:pdf, Size:1020Kb
Monte Carlo Integration and Variance Reduction 1 Classical Monte Carlo Integration 1. Generic problem of evaluating the integral Z Ef [h(X)] = h(x)f(x)dx: X Simulate a random sample of Xi ∼ f; i = 1; :::n, approximate the integral Ef [h(X)] using the ¯ Pn empirical average hn = i=1 h(Xi)=n. This is called the Monte Carlo method. ¯ By the SLLN, hn ! Ef [h(X)] w.p.1. We estimate the variance var(h¯n) with 1 1 Z 1 n v = var(h¯ ) = var(h(X)) = (h(x) − E [h(X)])2f(x)dx ≈ X[h(X ) − h¯ ]2 n n n n f n(n − 1) i n X i=1 For large n, h¯ − E [h(X)] n pf ∼ N(0; 1): vn This can lead to the construction of hypothesis tests and confidence limits for Ef [h(X)]. R 1 2 2. Example: find 0 h(x)dx, where h(x) = [cos(50x) + sin(20x)] . # # #This gives a simple monte carlo integration# # # nsim=10000;u=runif(nsim); #The function to be integrated mci.ex = function(x){(cos(50*x)+sin(20*x))^2} plot(function(x)mci.ex(x), xlim=c(0,1),ylim=c(0,4)) #The monte carlo sum sum(mci.ex(u))/nsim #The true value is integrate(mci.ex,0,1) After 10,000 iterations, the empirical average is 0.9785042. The true value is 0.965 3. Example: Normal CDF. The approximation of t Z 1 2 Φ(t) = p e−y =2dy −∞ 2π 1 by the Monte Carlo method is 1 n Φ(^ t) = X I ; n Xi≤t i=1 with 1 Φ(t)(1 − Φ(t)) Var(Φ(^ t)) = Var(I ) = n X1≤t n 1 For values of t around 0 the variance is approximately 4n , to achieve a precision of four decimals the approximation requires about 100 millions n, too many! The variance of the estimate of tail probability is smaller, thus requires relatively fewer n, yet, problem still occur even for estimating tail probability as demonstrated by a little R program # #Problem with simple Monte Carlo for normal tail probability P(X>a) # # #Simple Monte Carlo a=2.5 nsim=1000; y=rnorm(nsim); x=(y>a); mean(x) plot(cumsum(x)/c(1:nsim),type="l",ylim=c(0,.02)) abline(h=1-pnorm(a),lty="77",col="green") 2 Importance Sampling 1. Simulation from the true pdf f is not necessarily optimal, like the normal CDF example which requires a large n. The method of Importance Sampling is an evaluation of Ef [h(X)] based on the alternative representation, assuming supp(g) ⊃ supp(f), we have Z Z f(x) Z Ef [h(X)] = h(x)f(x)dx = h(x) g(x)dx = h(x)w(x)g(x)dx = Eg[w(X)h(X)] X X g(x) X f(x) where g(x) =: w(x). Thus, we can generate a sample fX1; :::Xng from a given distribution g and approximating 1 n E [h(X)] = E [w(X)h(X)] ≈ X w(X )h(X ) f g n j j j=1 The strong law of large number guarantees 2 1 n E [h(X)] X w(X )h(X ) f n j j j=1 as n ! 1 w.p.1 2. An simple example # #Compute the mean and variance of X where X is the target pdf #Gamma(3,2/3) #via importance sampling from a candidate pdf exp(1) #The true mean is 2, and true variance is 4/3 # #This is not accept=reject algorithm; thus the candidate does NOT #need to dominate the target pdf plot(function(x)dgamma(x,shape=3,scale=2/3),xlim=c(0,7), ylim=c(0,.5),col="red") par(new=T) plot(function(x)dexp(x),xlab="",xlim=c(0,7), ylim=c(0,.5),ylab="",xaxt="n",yaxt="n") target <- function(x)((27/16)*(x^2)*exp(-3*x/2)) candidate<-function(x)(exp(-x)) #Compute the mean and variance of X #via importance sampling from a candiate pdf exp(1) nsim=10000; x=rexp(nsim); m=sum(x*target(x)/candidate(x))/nsim v=sum(((x-m)^2)*target(x)/candidate(x))/nsim m;v #Compute the mean and variance of X #via direct sampling from the target pdf Gamma(3,2/3) x=rgamma(nsim,shape=3,scale=2/3) mean(x); var(x) > set.seed(7) > m;v [1] 2.018805 [1] 1.376811 > mean(x); var(x) [1] 1.980619 [1] 1.294792 R 1 3. Example: Normal Tail Probabilities P (Z > a) = a φ(x)dx for a = 4:5. (a) "Naive" approach. Sample Xi; i = 1; :::n directly from the target pdf N(0; 1) and use 1 n Z 1 X I ≈ φ(x)dx n Xi>a i=1 a 3 (b) Importance Sampling approach. Sample Xi; i = 1; :::n from a candidate pdf g such that g(x) = ae−a(x−a); x > a and use 1 n φ(X ) Z 1 X i ≈ φ(x)dx n g(X ) i=1 i a (c) Transformation to uniform. Knowing Z 1 Z 1=a φ(1=y) Z 1 φ(1=y) Z 1 φ(1=y) φ(x)dx = 2 dy = 2 I(0 < y < 1=a)dy = 2 1=a∗U(0; 1=a)dy a 0 y −∞ y −∞ y where y = 1=x, sample Ui ∼ U(0; 1=a), i.e., the candidate pdf is g(x) = a, and use 1 n φ(1=U ) Z 1 X i ≈ φ(x)dx n aU 2 i=1 i a 4. Importance Sampling facts (a) Converges for any choice of the distribution g as long as supp(g) ⊃ supp(f) (b) Instrumental distribution g chosen from distributions easy to sample (c) The same sample generated from g can be re-used repeatedly for different function h and/or f (d) Some choices are desirable: like the tail of g is thinker than f. If supf=g = 1, the weights f(xj)=g(xj) vary widely, giving too much importance to a few values of xj. R R (e) While Eg(w(X)) = X w(x)g(x)dx = X f(x)=g(x) ∗ g(x)dx = 1, the weights w(Xi); i = 1; :::; n do not necessarily average to 1, so one might consider a self-normalized estimate ofµ ~ := E~f [h(X)], Pn j=1 w(Xj)h(Xj) µ~ := Pn j=1 w(Xj) (f) Whileµ ^ is unbiased andµ ~ is not, it can be shown that the latter estimateµ ~ has a smaller MSE under some conditions. (g) The variance of the estimator E^f [h(X)] is 1 f(X1) var(E^f [h(X)]) = var h(X1) : n g(X1) Theoretically, the "best" g that minimizes the variance of the importance sampling estimator is ∗ jh(x)jf(x) g (x) = R : W jh(w)jf(w)dw Proof Note that ! f(X) f 2(X) f(X)2 var h(X) = E h2(X) − E h(X) g(X) g g2(X) g g(X) 4 and the second term does not depend on g. Thus it suffices to minimize the first term. Apply Jensen's inequality (quadratic function is convex) to the first term, we have 2 ! 2 Z 2 2 f (X) f(X) Eg h (X) 2 ≥ Eg jh(X)j = jh(w)jf(w)dw : g (X) g(X) W which provides a lower bound that is independent of g. It is straightforward to verify that this lower bound is achieved by the g∗ above theoretically. Although in practical the lower bound cannot be attained because this ideal g∗ requires knowledge of the original integral R h(x)f(x)dx, but it gives us some idea to reduce f(X) variance, choose a g close to the ideal. In other words, choose a g such that h(X) g(X) is more or less a constant. 5. Student's t distribution. Let X ∼ T (ν; θ; σ2), with pdf !−(ν+1)=2 Γ((ν + 1)=2) (x − θ)2 f(x) = p 1 + σΓ(ν=2) νπ νσ2 W.l.o.g., take θ = 0 and σ = 1. Compute the integral Z 1 x5f(x)dx: 2:1 Consider candidates: (a) f itself (b) Cauchy (c) N(0,1) (d) U(0, 1/2.1) after y = 1=x transform Simulation results: (a) f itself (bad) (b) Cauchy (good) (c) N(0,1) (worst) (d) U(0, 1/2.1) (best) f(X) Why? Look at the plots of h(X) g(X) for various g! 3 Antithetic Sampling (Variates) In estimating the integral Z Ef [h(X)] = h(x)f(x)dx; X we simulate a random sample of Xi ∼ f; i = 1; :::n, and then approximate the integral Ef [h(X)] Pn using the empirical average i=1 h(Xi)=n. 5 The method of antithetic sampling is based on the idea that higher efficiency can be brought about by correlation. Given two samples Xi ∼ f; i = 1; :::n and Yi ∼ f; i = 1; :::n, the estimator 1 n X(h(X ) + h(Y )) 2n i i i=1 is more efficient than an estimator based on an iid sample of size 2n if the variables h(Xi) and h(Yi) are negatively correlated. In this setting, the Yi's are called the antithetic variates. The correlation between h(Xi) and h(Yi) depends on Xi, Yi, and h, and it remains to develop a method for generating these variables in an useful manner.