<<

Section 2. Monte Carlo Methods

Bob Carpenter Columbia University

1 Part I

2 Monte Carlo Calculation of π

• Computing π = 3.14 ... via simulation is the textbook application of Monte Carlo methods. • Generate points uniformly at random within the square • Calculate proportion within circle (x2 + y 2 < 1) and multiply by square’s area (4) to produce the area of the circle. • This area is π (radius is 1, so area is πr 2 = π)

Plot by Mysid Yoderj courtesy of Wikipedia.

3 Monte Carlo Calculation of π (cont.)

code to calcuate π with Monte Carlo simulation:

> x <- runif(1e6,-1,1) > y <- runif(1e6,-1,1)

> prop_in_circle <- sum(x^2 + y^2 < 1) / 1e6

> 4 * prop_in_circle [1] 3.144032

4 Accuracy of Monte Carlo

• Monte Carlo is not an approximation!

• It can be made exact to within any 

• Monte Carlo draws are i.i.d. by definition

• Central limit theorem: expected error decreases at rate of 1 √ N

• 3 decimal places of accuracy with sample size 1e6

• Need 100× larger sample for each digit of accuracy

5 General Monte Carlo Integration • MC can calculate arbitrary definite inte- grals, Z b f (x) dx a • Let d upper bound f (x) in (a, b); tightness determines computational efficiency • Then generate random points uniformly in the rectangle bounded by (a, b) and (0, d) • Multiply proportion of draws (x, y) where y < f (x) by area of rectangle, d × (b − a). • Can be generalized to multiple dimensions in obvious way

Image courtesy of Jeff Cruzan, http://www.drcruzan.com/NumericalIntegration.html

6 Expectations of Function of R.V.

• Suppose f (θ) is a function of vector θ

• Suppose the density of θ is p(θ) – Warning: θ overloaded as random and bound variable

• Then f (θ) is also random variable, with expectation Z E[f (θ)] = f (θ) p(θ) dθ. Θ

– where Θ is support of p(θ) (i.e., Θ = {θ | p(θ) > 0}

7 QoI as Expectations

• Most Bayesian quantities of interest (QoI) are expectations over the posterior p(θ | y) of functions f (θ)

• Bayesian parameter estimation: θˆ – f (θ) = θ – θˆ = E[θ|y] minimizes expected square error

• Bayesian parameter (co) estimation: var[θ | y] – f (θ) = (θ − θ)ˆ 2

• Bayesian event probability: Pr[A | y] – f (θ) = I(θ ∈ A)

8 Expectations via Monte Carlo

• Generate draws θ(1), θ(2), . . . , θ(M) drawn from p(θ)

• Monte Carlo plugs in average for expectation:

M 1 X E[f (θ)|y] ≈ f (θ(m)) M m=1

• Can be made as accurate as desired, because

M 1 X E[f (θ)] = lim f (θ(m)) M→∞ M m=1 √ • Reminder: By CLT, error goes down as 1 / M

9 Part II Monte Carlo

10 Markov Chain Monte Carlo

• Standard Monte Carlo draws i.i.d. draws

θ(1), . . . , θ(M)

according to a probability function p(θ)

• Drawing an i.i.d. sample is often impossible when dealing with complex densities like Bayesian posteriors p(θ|y)

• So we use Markov chain Monte Carlo (MCMC) in these cases and draw θ(1), . . . , θ(M) from a Markov chain

11 Markov Chains

• A Markov Chain is a sequence of random variables

θ(1), θ(2), . . . , θ(M)

such that θ(m) only depends on θ(m−1), i.e.,

p(θ(m)|y, θ(1), . . . , θ(m−1)) = p(θ(m)|y, θ(m−1))

• Drawing θ(1), . . . , θ(M) from a Markov chain according to p(θ(m) | θ(m−1), y) is more tractable

• Require marginal of each draw, p(θ(m)|y), to be equal to true posterior

12 Applying MCMC

• Plug in just like ordinary (non-Markov chain) Monte Carlo

• Adjust standard errors for dependence in Markov chain

13 MCMC for Posterior Mean

• Standard Bayesian estimator is posterior mean Z θˆ = θ p(θ|y) dθ Θ

– Posterior mean minimizes expected square error

• Estimate is a conditional expectation

θˆ = E[θ|y]

• Compute by averaging

M 1 X θˆ ≈ θ M m=1

14 MCMC for Posterior Variance

• Posterior variance works the same way,

E[(θ − E[θ | y])2 | y] = E[(θ − θ)ˆ 2]

M 1 X ≈ (θ(m) − θ)ˆ 2 M m=1

15 MCMC for Event Probability

• Event probabilities are also expectations, e.g., Z Pr[θ1 > θ2] = E[I[θ1 > θ2]] = I[θ1 > θ2] p(θ|y)dθ. Θ

• Estimation via MCMC just another plug-in:

M 1 X Pr[θ > θ ] ≈ I[θ(m) > θ(m)] 1 2 M 1 2 m=1

• Again, can be made as accurate as necessary

16 MCMC for Quantiles (incl. median)

• These are not expectations, but still plug in

• Alternative Bayesian estimator is posterior median – Posterior median minimizes expected absolute error

• Estimate as median draw of θ(1), . . . , θ(M) – just sort and take halfway value – e.g., shows 50% point (or other quantiles)

• Other quantiles including interval bounds similar – estimate with quantile of draws – estimation error goes up in tail (based on fewer draws)

17 Part III MCMC

18 Random-Walk Metropolis

• Draw random initial parameter vector θ(1) (in support)

• For m ∈ 2:M – Sample proposal from a (symmetric) jumping distribution, e.g., θ∗ ∼ MultiNormal(θ(m−1), σ I) where I is the identity matrix

– Draw u(m) ∼ Uniform(0, 1) and set

 p(θ∗|y) θ∗ if u(m) <  (m−1)| θ(m) = p(θ y)  θ(m−1) otherwise

19 Metropolis and Normalization

• Metropolis only uses posterior in a ratio: p(θ∗ | y) p(θ(m) | y)

• This allows the use of unnormalized densities

• Recall Bayes’s rule:

p(θ|y) ∝ p(y|θ) p(θ)

• Thus we only need to evaluate sampling (likelihood) and prior – i.e., no need to compute normalizing integral for p(y), Z p(y|θ) p(θ)dθ Θ

20 Metropolis-Hastings

• Generalizes Metropolis to asymmetric proposals

• Acceptance ratio is

J(θ(m)|θ∗) × p(θ∗|y) J(θ∗|θ(m−1)) × p(θ(m)|y)

where J is the (potentially asymmetric) proposal density

• i.e.,

probabilty of being at θ∗ and jumping to θ(m−1) probability of being at θ(m−1) and jumping to θ∗

21 Metropolis-Hastings (cont.)

• General form ensures equilibrium by maintaining detailed balance

• Like Metropolis, only requires ratios

• Many algorithms involve a Metropolis-Hastings “correction”

– Including vanilla HMC and RHMC and ensemble samplers

22 Detailed Balance & Reversibility

• Definition is measure theoretic, but applies to densities – just like Bayes’s rule

• Assume Markov chain has stationary density p(a)

• Suppose π(a|b) is density of transitioning from b to a – use of π to indicates different measure on Θ than p

• Detailed balance is a reversibility equilibrium condition

p(a) π(b|a) = p(b) π(a|b)

23 Optimal Proposal Scale?

• Proposal scale σ is a free; too low or high is inefficient

• Traceplots show parameter value on y axis, iterations on x

• Empirical tuning problem; theoretical optima exist for some cases

Roberts and Rosenthal (2001) Optimal Scaling for Various Metropolis-Hastings Algorithms. Statistical Science.

24 Convergence

• Imagine releasing a hive of bees in a sealed house – they disperse, but eventually reach equilibrium where the same number of bees leave a room as enter it (on average)

• May take many iterations for Markov chain to reach equi- librium

25 Convergence: Example

• Four chains with different starting points – Left: 50 iterations – Center: 1000 iterations – Right: Draws from second half of each chain

Gelman et al., Bayesian Data Analysis

26 Potential Scale Reduction (Rˆ)

• Gelman & Rubin recommend M chains of N draws with diffuse initializations

• Measure that each chain has same posterior mean and variance

• If not, may be stuck in multiple modes or just not con- verged yet

• Define statistic Rˆ of chains s.t. at convergence, Rˆ → 1 – Rˆ >> 1 implies non-convergence – Rˆ ≈ 1 does not guarantee convergence – Only measures marginals

27 Split Rˆ

• Vanilla Rˆ may not diagnose non-stationarity – e.g., a sequence of chains with an increasing parameter

• Split Rˆ: Stan splits each chain into first and second half – start with M Markov chains of N draws each – split each in half to creates 2M chains of N/2 draws – then apply Rˆ to the 2M chains

28 Calculating Rˆ Statistic: Between

• M chains of N draws each

• Between-sample variance estimate

M N X B = (θ¯(•) − θ¯(•))2, M − 1 m • m=1

where

N M 1 X 1 X θ¯(•) = θ(n) and θ¯(•) = θ¯(•). m N m • M m n=1 m=1

29 Calculating Rˆ (cont.)

• M chains of N draws each

• Within-sample variance estimate:

M 1 X W = s2 , M m m=1

where N 1 X s2 = (θ(n) − θ¯(•))2. m N − 1 m m n=1

30 Calculating Rˆ Statistic (cont.)

• Variance estimate:

+ N − 1 1 var (θ|y) = W + B. d N N

recall that W is within-chain variance and B between-chain

• Potential scale reduction statistic (“R hat”)

s + var (θ|y) Rˆ = d . W

31 Correlations in Posterior Draws

• Markov chains typically display in the se- ries of draws θ(1), . . . , θ(m)

• Without i.i.d. draws, central limit theorem does not apply

• Effective sample size Neff divides out autocorrelation

• Neff must be estimated from sample – Fast Fourier transform computes correlations at all lags

• Estimation accuracy proportional to 1 p Neff

32 Reducing Posterior Correlation

• Tuning parameters to ensure good mixing

• Recall Metropolis traceplots of Roberts and Rosenthal:

• Good jump scale σ produces good mixing and high Neff

33 Effective Sample Size

• Autocorrelation at lag t is correlation between subseqs – (θ(1), . . . , θ(N−t)) and (θ(1+t), . . . , θ(N))

• Suppose chain has density p(θ) with – E[θ] = µ and Var[θ] = σ 2

• Autocorrelation ρt at lag t ≥ 0: 1 Z ρ = (θ(n) − µ)(θ(n+t) − µ) p(θ) dθ t 2 σ Θ

• Because p(θ(n)) = p(θ(n+t)) = p(θ) at convergence, 1 Z ρ = θ(n) θ(n+t) p(θ) dθ t 2 σ Θ

34 Estimating

• Effective sample size (N draws in chain) is defined by N N Neff = P∞ = P∞ t=−∞ ρt 1 + 2 t=1 ρt • Estimate in terms of variograms (M chains) at lag t – Calculate with fast Fourier transform (FFT)

M  N  m 2 1 X 1 X  (n) (n−t) Vt =  θ − θ  − m m M m=1 Nm t n=t+1 • Adjust autocorrelation at lag t using cross-chain variance as

Vt ρˆt = 1 − + 2 vard + • If not converged, vard overestimates variance

35 Estimating Nef f

0 • Let T be first lag s.t. ρT 0+1 < 0,

• Estimate autocorrelation by

MN ˆ = Neff PT 0 . 1 + t=1 ρˆt

• NUTS avoids negative autocorrelations, so first negative autocorrelation estimate is reasonable

• For basics (not our estimates), see Charles Geyer (2013) Introduction to MCMC. In Handbook of MCMC. (free online at http://www.mcmchandbook.net/index.html)

36

• Draw random initial parameter vector θ(1) (in support)

• For m ∈ 2:M – For n ∈ 1:N:

(m) * draw θn according to conditional

(m) (m) (m−1) (m−1) p(θn|θ1 , . . . , θn−1, θn+1 , . . . , θN , y).

• e.g, with θ = (θ1, θ2, θ3): (m) (m−1) (m−1) – draw θ1 according to p(θ1|θ2 , θ3 , y) (m) (m) (m−1) – draw θ2 according to p(θ2|θ1 , θ3 , y) (m) (m) (m) – draw θ3 according to p(θ3|θ1 , θ2 , y)

37 Generalized Gibbs

• “Proper” Gibbs requires conditional Monte Carlo draws – typically works only for conjugate priors

• In general case, may need to use less efficient conditional draws – is a popular general technique that works

for discrete or continuous θn (JAGS) – Adaptive is another alternative (BUGS) – Very difficult in more than one or two dimensions

38 Sampling Efficiency

• We care only about Neff per second • Decompose into

1. Iterations per second 2. Effective sample size per iteration

• Gibbs and Metropolis have high iterations per second (es- pecially Metropolis)

• But they have low effective sample size per iteration (espe- cially Metropolis)

• Both are particular weak when there is high correlation among the parameters in the posterior

39 & NUTS

• Slower iterations per second than Gibbs or Metropolis

• Much higher effective sample size per iteration for com- plex posteriors (i.e., high curvature and correlation)

• Overall, much higher Neff per second

• Details in the next talk . . .

• Along with details of how Stan implements HMC and NUTS

40