<<

Monte Carlo Integration and Reduction

1 Classical Monte Carlo Integration

1. Generic problem of evaluating the Z Ef [h(X)] = h(x)f(x)dx. X

Simulate a random sample of Xi ∼ f, i = 1, ...n, approximate the integral Ef [h(X)] using the ¯ Pn empirical average hn = i=1 h(Xi)/n. This is called the . ¯ By the SLLN, hn → Ef [h(X)] w.p.1.

We estimate the variance var(h¯n) with

1 1 Z 1 n v = var(h¯ ) = var(h(X)) = (h(x) − E [h(X)])2f(x)dx ≈ X[h(X ) − h¯ ]2 n n n n f n(n − 1) i n X i=1

For large n, h¯ − E [h(X)] n √f ∼ N(0, 1). vn

This can lead to the construction of hypothesis tests and confidence limits for Ef [h(X)].

R 1 2 2. Example: find 0 h(x)dx, where h(x) = [cos(50x) + sin(20x)] .

# # #This gives a simple monte carlo integration# # # nsim=10000;u=runif(nsim);

#The function to be integrated

mci.ex = function(x){(cos(50*x)+sin(20*x))^2} plot(function(x)mci.ex(x), xlim=c(0,1),ylim=c(0,4))

#The monte carlo sum sum(mci.ex(u))/nsim #The true value is integrate(mci.ex,0,1)

After 10,000 iterations, the empirical average is 0.9785042. The true value is 0.965

3. Example: Normal CDF. The approximation of t Z 1 2 Φ(t) = √ e−y /2dy −∞ 2π

1 by the Monte Carlo method is 1 n Φ(ˆ t) = X I , n Xi≤t i=1 with 1 Φ(t)(1 − Φ(t)) Var(Φ(ˆ t)) = Var(I ) = n X1≤t n

1 For values of t around 0 the variance is approximately 4n , to achieve a precision of four decimals the approximation requires about 100 millions n, too many! The variance of the estimate of tail probability is smaller, thus requires relatively fewer n, yet, problem still occur even for estimating tail probability as demonstrated by a little R program

# #Problem with simple Monte Carlo for normal tail probability P(X>a) #

# #Simple Monte Carlo a=2.5 nsim=1000; y=rnorm(nsim); x=(y>a); mean(x) plot(cumsum(x)/c(1:nsim),type="l",ylim=c(0,.02))

abline(h=1-pnorm(a),lty="77",col="green")

2

1. Simulation from the true pdf f is not necessarily optimal, like the normal CDF example which requires a large n. The method of Importance Sampling is an evaluation of Ef [h(X)] based on the alternative representation, assuming supp(g) ⊃ supp(f), we have

Z Z f(x) Z Ef [h(X)] = h(x)f(x)dx = h(x) g(x)dx = h(x)w(x)g(x)dx = Eg[w(X)h(X)] X X g(x) X

f(x) where g(x) =: w(x).

Thus, we can generate a sample {X1, ...Xn} from a given distribution g and approximating

1 n E [h(X)] = E [w(X)h(X)] ≈ X w(X )h(X ) f g n j j j=1

The strong law of large number guarantees

2 1 n E [h(X)] ← X w(X )h(X ) f n j j j=1 as n → ∞ w.p.1 2. An simple example

# #Compute the mean and variance of X where X is the target pdf #Gamma(3,2/3) #via importance sampling from a candidate pdf exp(1) #The true mean is 2, and true variance is 4/3 #

#This is not accept=reject algorithm; thus the candidate does NOT #need to dominate the target pdf

plot(function(x)dgamma(x,shape=3,scale=2/3),xlim=c(0,7), ylim=c(0,.5),col="red") par(new=T) plot(function(x)dexp(x),xlab="",xlim=c(0,7), ylim=c(0,.5),ylab="",xaxt="n",yaxt="n")

target <- function(x)((27/16)*(x^2)*exp(-3*x/2)) candidate<-function(x)(exp(-x))

#Compute the mean and variance of X #via importance sampling from a candiate pdf exp(1) nsim=10000; x=rexp(nsim); m=sum(x*target(x)/candidate(x))/nsim v=sum(((x-m)^2)*target(x)/candidate(x))/nsim m;v

#Compute the mean and variance of X #via direct sampling from the target pdf Gamma(3,2/3) x=rgamma(nsim,shape=3,scale=2/3) mean(x); var(x)

> set.seed(7) > m;v [1] 2.018805 [1] 1.376811 > mean(x); var(x) [1] 1.980619 [1] 1.294792 R ∞ 3. Example: Normal Tail Probabilities P (Z > a) = a φ(x)dx for a = 4.5.

(a) ”Naive” approach. Sample Xi, i = 1, ...n directly from the target pdf N(0, 1) and use 1 n Z ∞ X I ≈ φ(x)dx n Xi>a i=1 a

3 (b) Importance Sampling approach. Sample Xi, i = 1, ...n from a candidate pdf g such that g(x) = ae−a(x−a), x > a and use

1 n φ(X ) Z ∞ X i ≈ φ(x)dx n g(X ) i=1 i a (c) Transformation to uniform. Knowing

Z ∞ Z 1/a φ(1/y) Z ∞ φ(1/y) Z ∞ φ(1/y) φ(x)dx = 2 dy = 2 I(0 < y < 1/a)dy = 2 1/a∗U(0, 1/a)dy a 0 y −∞ y −∞ y

where y = 1/x, sample Ui ∼ U(0, 1/a), i.e., the candidate pdf is g(x) = a, and use

1 n φ(1/U ) Z ∞ X i ≈ φ(x)dx n aU 2 i=1 i a

4. Importance Sampling facts

(a) Converges for any choice of the distribution g as long as supp(g) ⊃ supp(f) (b) Instrumental distribution g chosen from distributions easy to sample (c) The same sample generated from g can be re-used repeatedly for different function h and/or f (d) Some choices are desirable: like the tail of g is thinker than f. If supf/g = ∞, the weights f(xj)/g(xj) vary widely, giving too much importance to a few values of xj. R R (e) While Eg(w(X)) = X w(x)g(x)dx = X f(x)/g(x) ∗ g(x)dx = 1, the weights w(Xi), i = 1, ..., n do not necessarily average to 1, so one might consider a self-normalized estimate ofµ ˜ := E˜f [h(X)],

Pn j=1 w(Xj)h(Xj) µ˜ := Pn j=1 w(Xj) (f) Whileµ ˆ is unbiased andµ ˜ is not, it can be shown that the latter estimateµ ˜ has a smaller MSE under some conditions.

(g) The variance of the estimator Eˆf [h(X)] is   1 f(X1) var(Eˆf [h(X)]) = var h(X1) . n g(X1) Theoretically, the ”best” g that minimizes the variance of the importance sampling estimator is ∗ |h(x)|f(x) g (x) = R . W |h(w)|f(w)dw Proof Note that !  f(X) f 2(X)   f(X)2 var h(X) = E h2(X) − E h(X) g(X) g g2(X) g g(X)

4 and the second term does not depend on g. Thus it suffices to minimize the first term. Apply Jensen’s inequality (quadratic function is convex) to the first term, we have

2 !   2 Z 2 2 f (X) f(X) Eg h (X) 2 ≥ Eg |h(X)| = |h(w)|f(w)dw . g (X) g(X) W

which provides a lower bound that is independent of g. It is straightforward to verify that this lower bound is achieved by the g∗ above theoretically. Although in practical the lower bound cannot be attained because this ideal g∗ requires knowledge of the original integral R h(x)f(x)dx, but it gives us some idea to reduce f(X) variance, choose a g close to the ideal. In other words, choose a g such that h(X) g(X) is more or less a constant.

5. Student’s t distribution. Let X ∼ T (ν, θ, σ2), with pdf

!−(ν+1)/2 Γ((ν + 1)/2) (x − θ)2 f(x) = √ 1 + σΓ(ν/2) νπ νσ2

W.l.o.g., take θ = 0 and σ = 1. Compute the integral Z ∞ x5f(x)dx. 2.1

Consider candidates:

(a) f itself (b) Cauchy (c) N(0,1) (d) U(0, 1/2.1) after y = 1/x transform

Simulation results:

(a) f itself (bad) (b) Cauchy (good) (c) N(0,1) (worst) (d) U(0, 1/2.1) (best)

f(X) Why? Look at the plots of h(X) g(X) for various g!

3 Antithetic Sampling (Variates)

In estimating the integral Z Ef [h(X)] = h(x)f(x)dx, X we simulate a random sample of Xi ∼ f, i = 1, ...n, and then approximate the integral Ef [h(X)] Pn using the empirical average i=1 h(Xi)/n.

5 The method of antithetic sampling is based on the idea that higher efficiency can be brought about by correlation. Given two samples Xi ∼ f, i = 1, ...n and Yi ∼ f, i = 1, ...n, the estimator

1 n X(h(X ) + h(Y )) 2n i i i=1 is more efficient than an estimator based on an iid sample of size 2n if the variables h(Xi) and h(Yi) are negatively correlated. In this setting, the Yi’s are called the antithetic variates. The correlation between h(Xi) and h(Yi) depends on Xi, Yi, and h, and it remains to develop a method for generating these variables in an useful manner. Rubinstein (1081) proposed to use the uniform variables Ui to generate the Xi and the variables 1 − Ui to generate the Yi using the Inverse Transform method. He showed that h(Xi) and h(Yi) are negatively correlated if h is a monotonic function.

2 Example: (e.g. 5.6 Rizzo) To estimate θ = Φ(x) = R x √1 e−t /2dt using antithetic variate. −∞ 2π Note that x 1 1 Z 2 1 Z 2 Φ(x) = .5 + √ e−t /2dt = .5 + √ xe−(xy) /2dy 2π 0 2π 0 by the substitution y = t/x. Thus, the target parameter to be estimated is

−(xU)2/2 θ = EU (xe ), where U ∼ U(0, 1). The antithetic variate approach achieved approximately 99.5% reduction in variance at x = 1.95.

4 Control Variates R Another way to reduce the variance in a Monte Carlo estimator of θ = Ef [h(X)] = X h(x)f(x)dx, is the use of control variates. Suppose that there is a function h1, such that µ = Ef [h1(X)] is known, and h1(X) is correlated with h(X). Then for a constant c, it is easy to check that θˆc = h(X) + c(h1(X) − µ) is an unbiased estimator of θ. The variance

2 var(θˆc) = var(h(X)) + c ∗ var(h1(X)) + 2c ∗ cov(h(X), h1(X)) is a quadratic function of c. It is minimized at c = c∗, where

cov(h(X), h (X)) c∗ = − 1 var(h1(X) and the minimum variance is

2 ˆ∗ cov(h(X), h1(X)) var(θc ) = var(h(X)) − var(h1(X)

∗ To compute the constant c in practice, we need to estimate cov(h(X), h1(X)) and var(h1(X)) from a pilot Monte Carlo study.

6 The rv h1(X) is called a control variate for the rv h(X). The estimators θˆ = h(X), and θˆc = h(X) + c(h1(X) − µ) are both unbiased, but the second has a smaller variance if h(X) and h1(X) are strongly positively correlated. No reduction of variance is possible when h(X) and h1(X) are uncorrelated. Note that the proportion of reduction in variance is

2 cov(h(X),h1(X)) var(h1(X) = cor(h(X), h (X))2. var(h(X)) 1

Example: (e.g. 5.8 Rizzo) To estimate Z 1 e−x θ = 2 dx, 0 1 + x e−x the parameter of interest is θ = E(g(X)) and g(x) = 1+x2 , where X is uniform on (0,1). We seek a function f(x) which is ’close’ to g(x) with known expected value such that g(X) and f(X) are strongly correlated. For example, the function f(x) = e−.5(1 + x2)−1 works OK and Z 1 −.5 1 −.5 −1 −.5 E(f(U)) = e 2 du = e tan (1) = e π/4 0 1 + u

4.1 Antithetic Variate as Control Variate The antithetic variate estimator is a special case of the control variate estimator. First note that the control variate estimator is a linear combination of unbiased estimators of θ because

θˆc = h(X) + c(h1(X) − µ) = c(h(X) + h1(X) − µ) + (1 − c)h(X).

In general, if θˆ1 and θˆ2 are any two unbiased estimators of θ, then for every constant c,

θˆc = cθˆ1 + (1 − c)θˆ2 is also unbiased for θ. The variance of

θˆc = cθˆ1 + (1 − c)θˆ2 = θˆ2 + c(θˆ1 − θˆ2) is 2 var(θˆ2) + c ∗ var(θˆ1 − θˆ2) + 2c ∗ cov(θˆ2, θˆ1 − θˆ2).

In the special case of antithetic variates, θˆ1 and θˆ2 are iid and cor(θˆ1, θˆ2) = −1. Thus, cov(θˆ1, θˆ2) = −var(θˆ1), and so the variance of θˆc equals 2 (4c − 4c + 1) ∗ var(θˆ1) after algebraic simplifications. The optimal constant is c∗ = 1/2. The control variate estimator in this case is θˆ + θˆ θˆ = 1 2 , c 2 which is the antithetic variates estimator of θ.

7 4.2 Control Variates and Regression In this subsection we will see the relationship between the control variates and simple linear regres- sion. This will provide insights into how the control variates reduce the variance in Monte Carlo integration. In addition, we have a convenient method for estimating the optimal constant c∗, the target parameter θ, the percent reduction inn variance, and the standard error of the estimator, all by fitting a simple linear regression model. Consider fitting a standard regression model to the data {(Xi,Yi), i = 1, 2, ..., n} with mean 2 2 (µX , µY ) and (σX , σY ), Y = β0 + β1X + , with P ¯ ¯ ˆ (Xi − X)(Yi − Y ) covˆ (X,Y ) β1 = P 2 = (Yi − Y¯ ) varˆ (Y ) and βˆ0 = Y¯ − βˆ1X.¯

Now replace X by h1(X) and Y by h(X) , the regression model to the data {(h1(Xi), h(Xi), i = 1, 2, ..., n} with mean (E(h1(X)),E(h(X)) and variances (var(h1(X)), var(h(X)) is

h(X) = β0 + β1h1(X) +  Taking expectation on both sides, we have

E(h(X)) = β0 + β1E(h1(X)) The least squares estimator of the slope is

covˆ (h (X), h(X)) βˆ = 1 = −cˆ∗ 1 varˆ (h(X)) ˆ Alternatively, we note that the estimator of the target parameter θ, θcˆ∗ , is the predicted value of the response variable at the point µ = E(h1(X)), i.e.,

ˆ ˆ ˆ ˆ ˆ ¯ ˆ ¯ ¯ ∗ ¯ θcˆ∗ = Eˆ(h(X)) = β0 + β1µ = (β0 + β1h1(X)) − β1(h1(X) − µ) = h(X) +c ˆ (h1(X) − µ) thus, again ∗ βˆ1 = −cˆ . Now, since 2 ˆ 2 ˆ 2 2 SE (Y |X = x) = SE (β1)(x − x¯) +σ ˆ /n, ˆ ˆ 2 so the variance of the control variate estimator θcˆ∗ is in the order ofvar ˆ (θcˆ∗ ) =σ ˆ /n. This variance is much smaller (if the regression model is good) than the marginal variance of the response variable h(X) which is the variance of crude Monte Carlo estimate of θ without any control variate(s). 2 Lastly, recall that the proportion of reduction in variance for the control variate is cor(h(X), h1(X)) , which is the coefficient of determination in the simple regression model. Example: (e.g. 5.9 Rizzo) Repeat the estimate of Z 1 e−x θ = 2 dx, 0 1 + x by fitting a regression model.

8 4.3 Several Control Variates The idea of combining unbiased estimators of the target parameter θ to reduce variance can be extended to several control variates. The corresponding control variate estimator is

k ˆ X ∗ θc = h(X) + ci (hi(X) − µi) i=1 ˆ where µi = E(hi(X)), i = 1, 2, ..., k. The controlled estimate θcˆ∗ , and estimates for the optimal ∗ constants ci ’s can be obtained by fitting a linear regression model.

5 Stratified Sampling

This aims to reduce variance of the estimator by dividing the interval of the integral into strata and estimating the integral on each stratum with smaller variance. There can be more reduction in variance using stratification when the means of the strata are widely different as shown in the following.

5.1 Strata chosen with equal probabilities ˆMC Denote the standard Monte Carlo estimate of θ = Ef [h(X)] with n replicates by θ , and the Stratifies Sampling estimator by θˆSS. Divide the domain of integration of the unknown parameter θ = Ef [h(X)] into k strata {J = j, j = 1, ..., k} with equal probabilities. If the domain of integration is the real line, the strata are intervals using the percentiles as cutoff points, i.e., F −1(j/k), j = 1, ..., k − 1, where F is the cdf of f. 2 Denote the mean and variance of h(X) on stratum j by θj and σj , respectively, i.e.,

θj = Ef [h(X)|J = j] = Efj [h(X)], and 2 σj = varf [h(X)|J = j] = varfj [h(X)], where fj is the conditional pdf of f on stratum j. Note that fj is 0 everywhere outside stratum j, and is a legitimate pdf on stratum j. In fact,

( f(x) f(x, x ∈ stratum j) 1/k = kf(x), if x ∈ stratum j fj(x) = f (x|J = j) = = X|J=j P (x ∈ stratum j) 0, otherwise

For stratified estimator with k strata, each stratum is estimated using m = n/k replicates by standard Monte Carlo methods. Estimate each stratum separately using the standard Monte Carlo ˆ ˆ estimator θj = Efj [h(X)] with m replicates. Define the stratified estimator as

1 k 1 k θˆSS = X θˆ = X Eˆ [h(X)] k j k fj j=1 j=1

9 We have E(θˆSS) = θ, and var(θˆSS) ≤ var(θˆMC ) Proof. On unbiasedness — Note that

k 1 k θ = E [h(X)] = E(E [h(X)|J]) = X E [h(X)|J = j]P (J = j) = X θ f f f k j j=1 j=1 and θˆj is an unbiased estimate of θj Thus,

1 k 1 k E(θˆSS) = X E(θˆ ) = X θ = θ. k j k j j=1 j=1 On variance — By independence, we have

1 k 1 k σ2 1 k var(θˆSS) = var( X θˆ ) = X j = X σ2 k j k2 m nk j j=1 j=1 j=1 Since J is a random stratum selected with uniform probability 1/k, by applying the conditional variance formula, var(h(X)) var(θˆMC ) = n 1 = (var(E(h(X|J))) + E(var(h(X|J)))) n 1 = (var(θ ) + E(σ2 )) n J J 1 1 k = (var(θ ) + X σ2) n J k j j=1 1 = var(θ ) + var(θˆSS) n J ≥ var(θˆSS) Equality holds only when all stratum has identical means. Example: (e.g. 5.11&12 Rizzo) Repeat the estimate of Z 1 e−x θ = 2 dx, 0 1 + x by stratified sampling. Consider 4 strata with equal probabilities as follows: Z 1 e−x Z .25 e−x Z .5 e−x Z .75 e−x Z 1.0 e−x θ = 2 dx = 2 dx + 2 dx + 2 dx + 2 dx 0 1 + x 0 1 + x .25 1 + x .5 1 + x .75 1 + x which equals ( ) Z e−x Z e−x Z e−x Z e−x 1/4 U(0,.25)dx + U(.25,.5)dx + U(.5,.75)dx + U(.75, 1.0)dx 1 + x2 1 + x2 1 + x2 1 + x2

10 5.2 Strata chosen with unequal probabilities P For stratified estimator with k strata with unequal probabilities P (J = j) = pj, where pj = 1, each stratum is estimated using mj = n ∗ pj replicates by standard Monte Carlo methods. ˆ ˆ Estimate each stratum separately using the standard Monte Carlo estimator θj = Efj [h(X)] with P mj replicates, mj = n. The stratified estimator is

k k ˆSS X ˆ X ˆ θ = pjθj = pjEfj [h(X)] j=1 j=1

Once again, we have E(θˆSS) = θ, and var(θˆSS) ≤ var(θˆMC ) A similar proof applies here and skipped for brevity.

6 Example: Stochastic Shortest Path

Consider the following graph below. We would like to determine the expected shortest time to travel from point A to point B in the graph, where the times X1,X2, ..., X5 follow some distribution(s).

ZZ  ~Z X  Z X 1  Z 4  Z  Z  Z  Z  Z AB X Z Z 3  Z  ~ Z  ~ Z  Z  Z  X2 Z  X5 Z  Z  ZZ ~

To estimate the expected length of the shortest path between vertex A to vertex B, where the lengths (edges) are independent random variables X1,X2, ..., X5.

11 Let X be the random vector   X1 + X4    X1 + X3 + X5  X =    X2 + X3 + X4  X2 + X5 and h(X) = min{X1 + X4,X1 + X3 + X5,X2 + X3 + X4,X2 + X5}. The problem is to estimate θ = E(h(X)). How do you approach the problem using Monte Carlo? Which variance reduction techniques can be applied to reduce the variance of the Monte Carlo estimate?

12