Monte Carlo Methods
Total Page:16
File Type:pdf, Size:1020Kb
Section 2. Monte Carlo Methods Bob Carpenter Columbia University 1 Part I Monte Carlo Integration 2 Monte Carlo Calculation of π • Computing π = 3:14 ::: via simulation is the textbook application of Monte Carlo methods. • Generate points uniformly at random within the square • Calculate proportion within circle (x2 + y 2 < 1) and multiply by square’s area (4) to produce the area of the circle. • This area is π (radius is 1, so area is πr 2 = π) Plot by Mysid Yoderj courtesy of Wikipedia. 3 Monte Carlo Calculation of π (cont.) • R code to calcuate π with Monte Carlo simulation: > x <- runif(1e6,-1,1) > y <- runif(1e6,-1,1) > prop_in_circle <- sum(x^2 + y^2 < 1) / 1e6 > 4 * prop_in_circle [1] 3.144032 4 Accuracy of Monte Carlo • Monte Carlo is not an approximation! • It can be made exact to within any • Monte Carlo draws are i.i.d. by definition • Central limit theorem: expected error decreases at rate of 1 p N • 3 decimal places of accuracy with sample size 1e6 • Need 100× larger sample for each digit of accuracy 5 General Monte Carlo Integration • MC can calculate arbitrary definite inte- grals, Z b f (x) dx a • Let d upper bound f (x) in (a; b); tightness determines computational efficiency • Then generate random points uniformly in the rectangle bounded by (a; b) and (0; d) • Multiply proportion of draws (x, y) where y < f (x) by area of rectangle, d × (b − a). • Can be generalized to multiple dimensions in obvious way Image courtesy of Jeff Cruzan, http://www.drcruzan.com/NumericalIntegration.html 6 Expectations of Function of R.V. • Suppose f (θ) is a function of random variable vector θ • Suppose the density of θ is p(θ) – Warning: θ overloaded as random and bound variable • Then f (θ) is also random variable, with expectation Z E[f (θ)] = f (θ) p(θ) dθ: Θ – where Θ is support of p(θ) (i.e., Θ = fθ j p(θ) > 0g 7 QoI as Expectations • Most Bayesian quantities of interest (QoI) are expectations over the posterior p(θ j y) of functions f (θ) • Bayesian parameter estimation: θˆ – f (θ) = θ – θˆ = E[θjy] minimizes expected square error • Bayesian parameter (co)variance estimation: var[θ j y] – f (θ) = (θ − θ)ˆ 2 • Bayesian event probability: Pr[A j y] – f (θ) = I(θ 2 A) 8 Expectations via Monte Carlo • Generate draws θ(1); θ(2); : : : ; θ(M) drawn from p(θ) • Monte Carlo Estimator plugs in average for expectation: M 1 X E[f (θ)jy] ≈ f (θ(m)) M m=1 • Can be made as accurate as desired, because M 1 X E[f (θ)] = lim f (θ(m)) M!1 M m=1 p • Reminder: By CLT, error goes down as 1 = M 9 Part II Markov Chain Monte Carlo 10 Markov Chain Monte Carlo • Standard Monte Carlo draws i.i.d. draws θ(1); : : : ; θ(M) according to a probability function p(θ) • Drawing an i.i.d. sample is often impossible when dealing with complex densities like Bayesian posteriors p(θjy) • So we use Markov chain Monte Carlo (MCMC) in these cases and draw θ(1); : : : ; θ(M) from a Markov chain 11 Markov Chains • A Markov Chain is a sequence of random variables θ(1); θ(2); : : : ; θ(M) such that θ(m) only depends on θ(m−1), i.e., p(θ(m)jy; θ(1); : : : ; θ(m−1)) = p(θ(m)jy; θ(m−1)) • Drawing θ(1); : : : ; θ(M) from a Markov chain according to p(θ(m) j θ(m−1); y) is more tractable • Require marginal of each draw, p(θ(m)jy), to be equal to true posterior 12 Applying MCMC • Plug in just like ordinary (non-Markov chain) Monte Carlo • Adjust standard errors for dependence in Markov chain 13 MCMC for Posterior Mean • Standard Bayesian estimator is posterior mean Z θˆ = θ p(θjy) dθ Θ – Posterior mean minimizes expected square error • Estimate is a conditional expectation θˆ = E[θjy] • Compute by averaging M 1 X θˆ ≈ θ M m=1 14 MCMC for Posterior Variance • Posterior variance works the same way, E[(θ − E[θ j y])2 j y] = E[(θ − θ)ˆ 2] M 1 X ≈ (θ(m) − θ)ˆ 2 M m=1 15 MCMC for Event Probability • Event probabilities are also expectations, e.g., Z Pr[θ1 > θ2] = E[I[θ1 > θ2]] = I[θ1 > θ2] p(θjy)dθ: Θ • Estimation via MCMC just another plug-in: M 1 X Pr[θ > θ ] ≈ I[θ(m) > θ(m)] 1 2 M 1 2 m=1 • Again, can be made as accurate as necessary 16 MCMC for Quantiles (incl. median) • These are not expectations, but still plug in • Alternative Bayesian estimator is posterior median – Posterior median minimizes expected absolute error • Estimate as median draw of θ(1); : : : ; θ(M) – just sort and take halfway value – e.g., Stan shows 50% point (or other quantiles) • Other quantiles including interval bounds similar – estimate with quantile of draws – estimation error goes up in tail (based on fewer draws) 17 Part III MCMC Algorithms 18 Random-Walk Metropolis • Draw random initial parameter vector θ(1) (in support) • For m 2 2:M – Sample proposal from a (symmetric) jumping distribution, e.g., θ∗ ∼ MultiNormal(θ(m−1); σ I) where I is the identity matrix – Draw u(m) ∼ Uniform(0; 1) and set 8 p(θ∗jy) >θ∗ if u(m) < < (m−1)j θ(m) = p(θ y) > :>θ(m−1) otherwise 19 Metropolis and Normalization • Metropolis only uses posterior in a ratio: p(θ∗ j y) p(θ(m) j y) • This allows the use of unnormalized densities • Recall Bayes’s rule: p(θjy) / p(yjθ) p(θ) • Thus we only need to evaluate sampling (likelihood) and prior – i.e., no need to compute normalizing integral for p(y), Z p(yjθ) p(θ)dθ Θ 20 Metropolis-Hastings • Generalizes Metropolis to asymmetric proposals • Acceptance ratio is J(θ(m)jθ∗) × p(θ∗jy) J(θ∗jθ(m−1)) × p(θ(m)jy) where J is the (potentially asymmetric) proposal density • i.e., probabilty of being at θ∗ and jumping to θ(m−1) probability of being at θ(m−1) and jumping to θ∗ 21 Metropolis-Hastings (cont.) • General form ensures equilibrium by maintaining detailed balance • Like Metropolis, only requires ratios • Many algorithms involve a Metropolis-Hastings “correction” – Including vanilla HMC and RHMC and ensemble samplers 22 Detailed Balance & Reversibility • Definition is measure theoretic, but applies to densities – just like Bayes’s rule • Assume Markov chain has stationary density p(a) • Suppose π(ajb) is density of transitioning from b to a – use of π to indicates different measure on Θ than p • Detailed balance is a reversibility equilibrium condition p(a) π(bja) = p(b) π(ajb) 23 Optimal Proposal Scale? • Proposal scale σ is a free; too low or high is inefficient • Traceplots show parameter value on y axis, iterations on x • Empirical tuning problem; theoretical optima exist for some cases Roberts and Rosenthal (2001) Optimal Scaling for Various Metropolis-Hastings Algorithms. Statistical Science. 24 Convergence • Imagine releasing a hive of bees in a sealed house – they disperse, but eventually reach equilibrium where the same number of bees leave a room as enter it (on average) • May take many iterations for Markov chain to reach equi- librium 25 Convergence: Example • Four chains with different starting points – Left: 50 iterations – Center: 1000 iterations – Right: Draws from second half of each chain Gelman et al., Bayesian Data Analysis 26 Potential Scale Reduction (Rˆ) • Gelman & Rubin recommend M chains of N draws with diffuse initializations • Measure that each chain has same posterior mean and variance • If not, may be stuck in multiple modes or just not con- verged yet • Define statistic Rˆ of chains s.t. at convergence, Rˆ ! 1 – Rˆ >> 1 implies non-convergence – Rˆ ≈ 1 does not guarantee convergence – Only measures marginals 27 Split Rˆ • Vanilla Rˆ may not diagnose non-stationarity – e.g., a sequence of chains with an increasing parameter • Split Rˆ: Stan splits each chain into first and second half – start with M Markov chains of N draws each – split each in half to creates 2M chains of N=2 draws – then apply Rˆ to the 2M chains 28 Calculating Rˆ Statistic: Between • M chains of N draws each • Between-sample variance estimate M N X B = (θ¯(•) − θ¯(•))2; M − 1 m • m=1 where N M 1 X 1 X θ¯(•) = θ(n) and θ¯(•) = θ¯(•): m N m • M m n=1 m=1 29 Calculating Rˆ (cont.) • M chains of N draws each • Within-sample variance estimate: M 1 X W = s2 ; M m m=1 where N 1 X s2 = (θ(n) − θ¯(•))2: m N − 1 m m n=1 30 Calculating Rˆ Statistic (cont.) • Variance estimate: + N − 1 1 var (θjy) = W + B: d N N recall that W is within-chain variance and B between-chain • Potential scale reduction statistic (“R hat”) s + var (θjy) Rˆ = d : W 31 Correlations in Posterior Draws • Markov chains typically display autocorrelation in the se- ries of draws θ(1); : : : ; θ(m) • Without i.i.d. draws, central limit theorem does not apply • Effective sample size Neff divides out autocorrelation • Neff must be estimated from sample – Fast Fourier transform computes correlations at all lags • Estimation accuracy proportional to 1 p Neff 32 Reducing Posterior Correlation • Tuning algorithm parameters to ensure good mixing • Recall Metropolis traceplots of Roberts and Rosenthal: • Good jump scale σ produces good mixing and high Neff 33 Effective Sample Size • Autocorrelation at lag t is correlation between subseqs – (θ(1); : : : ; θ(N−t)) and (θ(1+t); : : : ; θ(N)) • Suppose chain has density p(θ) with – E[θ] = µ and Var[θ] = σ 2 • Autocorrelation ρt at lag t ≥ 0: 1 Z ρ = (θ(n) − µ)(θ(n+t) − µ) p(θ) dθ t 2 σ Θ • Because p(θ(n)) = p(θ(n+t)) = p(θ) at convergence, 1 Z ρ = θ(n) θ(n+t) p(θ) dθ t 2 σ Θ 34 Estimating Autocorrelations • Effective sample size (N draws in chain) is defined by N N Neff = P1 = P1 t=−∞ ρt 1 + 2 t=1 ρt • Estimate in terms of variograms (M chains) at lag t – Calculate with fast Fourier transform (FFT) M 0 N 1 m 2 1 X 1 X (n) (n−t) Vt = @ θ − θ A − m m M m=1 Nm t n=t+1 • Adjust autocorrelation at lag t using cross-chain variance as Vt ρˆt = 1 − + 2 vard + • If not converged, vard overestimates variance 35 Estimating Nef f 0 • Let T be first lag s.t.