Monte Carlo Methods

Section 2. Monte Carlo Methods Bob Carpenter Columbia University 1 Part I Monte Carlo Integration 2 Monte Carlo Calculation of π • Computing π = 3:14 ::: via simulation is the textbook application of Monte Carlo methods. • Generate points uniformly at random within the square • Calculate proportion within circle (x2 + y 2 < 1) and multiply by square’s area (4) to produce the area of the circle. • This area is π (radius is 1, so area is πr 2 = π) Plot by Mysid Yoderj courtesy of Wikipedia. 3 Monte Carlo Calculation of π (cont.) • R code to calcuate π with Monte Carlo simulation: > x <- runif(1e6,-1,1) > y <- runif(1e6,-1,1) > prop_in_circle <- sum(x^2 + y^2 < 1) / 1e6 > 4 * prop_in_circle [1] 3.144032 4 Accuracy of Monte Carlo • Monte Carlo is not an approximation! • It can be made exact to within any • Monte Carlo draws are i.i.d. by definition • Central limit theorem: expected error decreases at rate of 1 p N • 3 decimal places of accuracy with sample size 1e6 • Need 100× larger sample for each digit of accuracy 5 General Monte Carlo Integration • MC can calculate arbitrary definite inte- grals, Z b f (x) dx a • Let d upper bound f (x) in (a; b); tightness determines computational efficiency • Then generate random points uniformly in the rectangle bounded by (a; b) and (0; d) • Multiply proportion of draws (x, y) where y < f (x) by area of rectangle, d × (b − a). • Can be generalized to multiple dimensions in obvious way Image courtesy of Jeff Cruzan, http://www.drcruzan.com/NumericalIntegration.html 6 Expectations of Function of R.V. • Suppose f (θ) is a function of random variable vector θ • Suppose the density of θ is p(θ) – Warning: θ overloaded as random and bound variable • Then f (θ) is also random variable, with expectation Z E[f (θ)] = f (θ) p(θ) dθ: Θ – where Θ is support of p(θ) (i.e., Θ = fθ j p(θ) > 0g 7 QoI as Expectations • Most Bayesian quantities of interest (QoI) are expectations over the posterior p(θ j y) of functions f (θ) • Bayesian parameter estimation: θˆ – f (θ) = θ – θˆ = E[θjy] minimizes expected square error • Bayesian parameter (co)variance estimation: var[θ j y] – f (θ) = (θ − θ)ˆ 2 • Bayesian event probability: Pr[A j y] – f (θ) = I(θ 2 A) 8 Expectations via Monte Carlo • Generate draws θ(1); θ(2); : : : ; θ(M) drawn from p(θ) • Monte Carlo Estimator plugs in average for expectation: M 1 X E[f (θ)jy] ≈ f (θ(m)) M m=1 • Can be made as accurate as desired, because M 1 X E[f (θ)] = lim f (θ(m)) M!1 M m=1 p • Reminder: By CLT, error goes down as 1 = M 9 Part II Markov Chain Monte Carlo 10 Markov Chain Monte Carlo • Standard Monte Carlo draws i.i.d. draws θ(1); : : : ; θ(M) according to a probability function p(θ) • Drawing an i.i.d. sample is often impossible when dealing with complex densities like Bayesian posteriors p(θjy) • So we use Markov chain Monte Carlo (MCMC) in these cases and draw θ(1); : : : ; θ(M) from a Markov chain 11 Markov Chains • A Markov Chain is a sequence of random variables θ(1); θ(2); : : : ; θ(M) such that θ(m) only depends on θ(m−1), i.e., p(θ(m)jy; θ(1); : : : ; θ(m−1)) = p(θ(m)jy; θ(m−1)) • Drawing θ(1); : : : ; θ(M) from a Markov chain according to p(θ(m) j θ(m−1); y) is more tractable • Require marginal of each draw, p(θ(m)jy), to be equal to true posterior 12 Applying MCMC • Plug in just like ordinary (non-Markov chain) Monte Carlo • Adjust standard errors for dependence in Markov chain 13 MCMC for Posterior Mean • Standard Bayesian estimator is posterior mean Z θˆ = θ p(θjy) dθ Θ – Posterior mean minimizes expected square error • Estimate is a conditional expectation θˆ = E[θjy] • Compute by averaging M 1 X θˆ ≈ θ M m=1 14 MCMC for Posterior Variance • Posterior variance works the same way, E[(θ − E[θ j y])2 j y] = E[(θ − θ)ˆ 2] M 1 X ≈ (θ(m) − θ)ˆ 2 M m=1 15 MCMC for Event Probability • Event probabilities are also expectations, e.g., Z Pr[θ1 > θ2] = E[I[θ1 > θ2]] = I[θ1 > θ2] p(θjy)dθ: Θ • Estimation via MCMC just another plug-in: M 1 X Pr[θ > θ ] ≈ I[θ(m) > θ(m)] 1 2 M 1 2 m=1 • Again, can be made as accurate as necessary 16 MCMC for Quantiles (incl. median) • These are not expectations, but still plug in • Alternative Bayesian estimator is posterior median – Posterior median minimizes expected absolute error • Estimate as median draw of θ(1); : : : ; θ(M) – just sort and take halfway value – e.g., Stan shows 50% point (or other quantiles) • Other quantiles including interval bounds similar – estimate with quantile of draws – estimation error goes up in tail (based on fewer draws) 17 Part III MCMC Algorithms 18 Random-Walk Metropolis • Draw random initial parameter vector θ(1) (in support) • For m 2 2:M – Sample proposal from a (symmetric) jumping distribution, e.g., θ∗ ∼ MultiNormal(θ(m−1); σ I) where I is the identity matrix – Draw u(m) ∼ Uniform(0; 1) and set 8 p(θ∗jy) >θ∗ if u(m) < < (m−1)j θ(m) = p(θ y) > :>θ(m−1) otherwise 19 Metropolis and Normalization • Metropolis only uses posterior in a ratio: p(θ∗ j y) p(θ(m) j y) • This allows the use of unnormalized densities • Recall Bayes’s rule: p(θjy) / p(yjθ) p(θ) • Thus we only need to evaluate sampling (likelihood) and prior – i.e., no need to compute normalizing integral for p(y), Z p(yjθ) p(θ)dθ Θ 20 Metropolis-Hastings • Generalizes Metropolis to asymmetric proposals • Acceptance ratio is J(θ(m)jθ∗) × p(θ∗jy) J(θ∗jθ(m−1)) × p(θ(m)jy) where J is the (potentially asymmetric) proposal density • i.e., probabilty of being at θ∗ and jumping to θ(m−1) probability of being at θ(m−1) and jumping to θ∗ 21 Metropolis-Hastings (cont.) • General form ensures equilibrium by maintaining detailed balance • Like Metropolis, only requires ratios • Many algorithms involve a Metropolis-Hastings “correction” – Including vanilla HMC and RHMC and ensemble samplers 22 Detailed Balance & Reversibility • Definition is measure theoretic, but applies to densities – just like Bayes’s rule • Assume Markov chain has stationary density p(a) • Suppose π(ajb) is density of transitioning from b to a – use of π to indicates different measure on Θ than p • Detailed balance is a reversibility equilibrium condition p(a) π(bja) = p(b) π(ajb) 23 Optimal Proposal Scale? • Proposal scale σ is a free; too low or high is inefficient • Traceplots show parameter value on y axis, iterations on x • Empirical tuning problem; theoretical optima exist for some cases Roberts and Rosenthal (2001) Optimal Scaling for Various Metropolis-Hastings Algorithms. Statistical Science. 24 Convergence • Imagine releasing a hive of bees in a sealed house – they disperse, but eventually reach equilibrium where the same number of bees leave a room as enter it (on average) • May take many iterations for Markov chain to reach equilibrium 25 Convergence: Example • Four chains with different starting points – Left: 50 iterations – Center: 1000 iterations – Right: Draws from second half of each chain Gelman et al., Bayesian Data Analysis 26 Potential Scale Reduction (Rˆ) • Gelman & Rubin recommend M chains of N draws with diffuse initializations • Measure that each chain has same posterior mean and variance • If not, may be stuck in multiple modes or just not converged yet • Define statistic Rˆ of chains s.t. at convergence, Rˆ ! 1 – Rˆ >> 1 implies non-convergence – Rˆ ≈ 1 does not guarantee convergence – Only measures marginals 27 Split Rˆ • Vanilla Rˆ may not diagnose non-stationarity – e.g., a sequence of chains with an increasing parameter • Split Rˆ: Stan splits each chain into first and second half – start with M Markov chains of N draws each – split each in half to creates 2M chains of N=2 draws – then apply Rˆ to the 2M chains 28 Calculating Rˆ Statistic: Between • M chains of N draws each • Between-sample variance estimate M N X B = (θ¯(•) − θ¯(•))2; M − 1 m • m=1 where N M 1 X 1 X θ¯(•) = θ(n) and θ¯(•) = θ¯(•): m N m • M m n=1 m=1 29 Calculating Rˆ (cont.) • M chains of N draws each • Within-sample variance estimate: M 1 X W = s2 ; M m m=1 where N 1 X s2 = (θ(n) − θ¯(•))2: m N − 1 m m n=1 30 Calculating Rˆ Statistic (cont.) • Variance estimate: + N − 1 1 var (θjy) = W + B: d N N recall that W is within-chain variance and B between-chain • Potential scale reduction statistic (“R hat”) s + var (θjy) Rˆ = d : W 31 Correlations in Posterior Draws • Markov chains typically display autocorrelation in the se- ries of draws θ(1); : : : ; θ(m) • Without i.i.d. draws, central limit theorem does not apply • Effective sample size Neff divides out autocorrelation • Neff must be estimated from sample – Fast Fourier transform computes correlations at all lags • Estimation accuracy proportional to 1 p Neff 32 Reducing Posterior Correlation • Tuning algorithm parameters to ensure good mixing • Recall Metropolis traceplots of Roberts and Rosenthal: • Good jump scale σ produces good mixing and high Neff 33 Effective Sample Size • Autocorrelation at lag t is correlation between subseqs – (θ(1); : : : ; θ(N−t)) and (θ(1+t); : : : ; θ(N)) • Suppose chain has density p(θ) with – E[θ] = µ and Var[θ] = σ 2 • Autocorrelation ρt at lag t ≥ 0: 1 Z ρ = (θ(n) − µ)(θ(n+t) − µ) p(θ) dθ t 2 σ Θ • Because p(θ(n)) = p(θ(n+t)) = p(θ) at convergence, 1 Z ρ = θ(n) θ(n+t) p(θ) dθ t 2 σ Θ 34 Estimating Autocorrelations • Effective sample size (N draws in chain) is defined by N N Neff = P1 = P1 t=−∞ ρt 1 + 2 t=1 ρt • Estimate in terms of variograms (M chains) at lag t – Calculate with fast Fourier transform (FFT) M 0 N 1 m 2 1 X 1 X (n) (n−t) Vt = @ θ − θ A − m m M m=1 Nm t n=t+1 • Adjust autocorrelation at lag t using cross-chain variance as Vt ρˆt = 1 − + 2 vard + • If not converged, vard overestimates variance 35 Estimating Nef f 0 • Let T be first lag s.t.

Monte Carlo Methods

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support