<<

Stan a Probabilistic Programming Language

Core Development Team (in order of joining): Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell, Marco Inacio, Jeffrey Arnold, Mitzi Morris, Rob Trangucci, Rob Goedman, Brian Lau, Jonah Sol Gabry, Alp Kucukelbir, Robert L. Grant, Dustin Tran, Alp Kucukelbir, Krzysztof Sakrejda, Aki Vehtari, Rayleigh Lei, Sebastian Weber

Stan 2.9.0 (April 2016) http://mc-stan.org

1 Get the Slides

http://mc-stan.org/workshops/ness2016

2 Example I Male Birth Ratio

3 Birth Rate by Sex

• Laplace’s data on live births in Paris from 1745–1770: sex live births female 241 945 male 251 527

• Question 1 (Estimation) What is the birth rate of boys vs. girls?

• Question 2 (Event Probability) Is a boy more likely to be born than a girl?

• Bayes (1763) set up the “Bayesian” model

• Laplace (1781, 1786) solved for the posterior

4 Repeated Binary Trial Model

• Data – N 0, 1,... : number of trials (constant) ∈ { } – y 0, 1 : trial n success (known, modeled data) n ∈ { } • Parameter – θ [0, 1] : chance of success (unknown) ∈ • Prior – p(θ) Uniform(θ 0, 1) 1 = | = • Likelihood

QN QN yn 1 yn – p(y θ) n 1 Bernoulli(yn θ) n 1 θ (1 θ) − | = = | = = −

5 Stan Program

data { int N; // number of trials int y[N]; // success on trial n } parameters { real theta; // chance of success } model { theta ~ uniform(0, 1); // prior for (n in 1:N) y[n] ~ bernoulli(theta); // likelihood }

6 A Stan Program

• Defines log (posterior) density up to constant, so...

• Equivalent to define log density directly:

model { increment_log_prob(0); for (n in 1:N) increment_log_prob(log(theta^y[n]) * (1 - theta)^(1 - y[n]))); }

• Equivalent to (a) drop constant prior, (b) vectorize likeli- hood:

model { y ~ bernoulli(theta); }

7 Simulating Repeated Binary Trials

Code: rbinom(10, N, 0.3) – N = 10 trials (10% to 50% success rate) 2 2 1 3 3 2 3 2 2 5

– N = 100 trials (27% to 34% success rate) 29 34 27 31 25 31 27 29 32 26

– N = 1000 trials (29% to 32% success rate) 291 297 289 322 305 296 294 297 314 292

– N = 10,000 trials (29.5% to 30.7% success rate) 3014 3031 3017 2886 2995 2944 3067 3069 3051 3068 • Deviation goes down at rate: (1/√N) O

8 R: Simulate Data

> theta <- 0.30; > N <- 20; > y <- rbinom(N, 1, 0.3);

> y

[1] 1 1 1 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1

> sum(y) / N

[1] 0.4

9 RStan: Fit

> library(rstan);

> fit <- stan("bern.stan", data = list(y = y, N = N));

> print(fit, probs=c(0.1, 0.9));

Inference for Stan model: bern. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean sd 10% 90% n_eff Rhat theta 0.41 0.00 0.10 0.28 0.55 1580 1 lp__ -15.40 0.02 0.71 -16.26 -14.89 1557 1

Samples drawn using NUTS(diag_e) at Thu Apr 21 19:38:16 2016.

10 RStan: Posterior Sample

> theta_draws <- extract(fit)$theta; > mean(theta_draws);

[1] 0.4128373

> quantile(theta_draws, probs=c(0.10, 0.90));

10% 90% 0.2830349 0.5496858

11 ggplot2: Plotting

theta_draws_df <- data.frame(list(theta = theta_draws)); plot <- ggplot(theta_draws_df, aes(x = theta)) + geom_histogram(bins=20, color = "gray"); plot;

400

count 200

0 0.2 0.4 0.6 0.8 theta

12

• Don’t know order of births, only total.

• If y , . . . , y Bernoulli(θ), 1 N ∼ then (y y ) Binomial(N, θ) 1 + · · · + N ∼ • The analytic form is ! N y N y Binomial(y N, θ) θ (1 θ) − | = y − where the binomial coefficient normalizes for permuta- tions (i.e., which subset of n has y 1), n = ! N N! y = y! (N y)! −

13 Calculating Laplace’s Answers

transformed data { int male; int female; male <- 251527; female <- 241945; } parameters { real theta; } model { male ~ binomial(male + female, theta); } generated quantities { int theta_gt_half; theta_gt_half <- (theta > 0.5); }

14 And the Answer is...

> fit <- stan("laplace.stan", iter=100000); > print(fit, probs=c(0.005, 0.995), digits=3)

mean 0.5% 99.5% theta 0.51 0.508 0.512 theta_gt_half 1.00 1.000 1.000

• Q1: θ is 99% certain to lie in (0.508, 0.512)

• Q2: Laplace “morally certain” boys more prevalent

15 Parameter Estimates

• Estimate probability that a parameter value in interval, e.g., Pr[θ (0.508, 0.512) y] ∈ | • Conditions on observed data y

• Bayesian parameter estimates are probabilistic

16 Event Probabilities

defined by indicator function

theta_gt_half <- (theta > 0.5);

• Indicators are random variables – with boolean (0 or 1) values – defined in terms of parameters (and data)

• For Laplace’s problem, shows

42 Pr[θ 0.5 y] 10− ≤ | ≈

• Event probabilities are expectations of indicators

17 Part I Bayesian Inference

18 Observations, Counterfactuals, and Random Variables

• Assume we observe data y y , . . . , y = 1 N • Statistical modeling assumes even though y is observed, the values could have been different

• John Stuart Mill first characterized this counterfactual na- ture of statistical modeling in: A System of Logic, Ratiocinative and Inductive (1843)

• In measure-theoretic language, y modeled as a random variable

19 Bayesian Data Analysis

• “By Bayesian data analysis, we mean practical methods for making inferences from data using probability models for quantities we observe and about which we wish to learn.”

• “The essential characteristic of Bayesian methods is their explict use of probability for quantifying uncertainty in inferences based on statistical analysis.”

Gelman et al., Bayesian Data Analysis, 3rd edition, 2013

20 Bayesian Methodology

• Set up full probability model – for all observable & unobservable quantities – consistent w. problem knowledge & data collection

• Condition on observed data – to caclulate posterior probability of unobserved quan- tities (e.g., parameters, predictions, missing data)

• Evaluate – model fit and implications of posterior

• Repeat as necessary

Ibid.

21 Where do Models Come from?

• Sometimes model comes first, based on substantive con- siderations – toxicology, economics, ecology, . . .

• Sometimes model chosen based on data collection – traditional statistics of surveys and experiments

• Other times the data comes first – observational studies, meta-analysis, . . .

• Usually its a mix

22 (Donald) Rubin’s Philosophy

• All statistics is inference about missing data

• Question 1: What would you do if you had all the data?

• Question 2: What were you doing before you had any data?

(as relayed in course notes by Andrew Gelman)

23 Model Checking

• Do the inferences make sense? – are parameter values consistent with model’s prior? – does simulating from parameter values produce reasoable fake data? – are marginal predictions consistent with the data?

• Do predictions and event probabilities for new data make sense?

• Not: Is the model true?

• Not: What is Pr[model is true]?

• Not: Can we “reject” the model?

24 Model Improvement

• Expanding the model – hierarchical and multilevel structure . . . – more flexible distributions (overdispersion, covariance) – more structure (geospatial, time series) – more modeling of measurement methods and errors – ...

• Including more data – breadth (more predictors or kinds of observations) – depth (more observations)

25 Using Bayesian Inference

• Finds parameters consistent with prior info and data∗

– ∗ if such agreement is possible

• Automatically includes uncertainty and variability

• Inferences can be plugged in directly – risk assesment – decision analysis

26 Notation for Basic Quantities

• Basic Quantities

– y: observed data – θ: parameters (and other unobserved quantities) – x: constants, predictors for conditional (aka “discrimina- tive”) models

• Basic Predictive Quantities

– y˜: unknown, potentially observable quantities – x˜: constants, predictors for unknown quantities

27 Naming Conventions

• Joint: p(y, θ)

• Sampling / Likelihood: p(y θ) | – Sampling is function of y with θ fixed (prob function) – Likelihood is function of θ with y fixed (not prob function)

• Prior: p(θ)

• Posterior: p(θ y) | • Data Marginal (Evidence): p(y)

• Posterior Predictive: p(y˜ y) |

28 Bayes’s Rule for Posterior

p(y, θ) p(θ y) [def of conditional] | = p(y)

p(y θ) p(θ) | [chain rule] = p(y)

p(y θ) p(θ) R | [law of total prob] = Θ p(y, θ0) dθ0

p(y θ) p(θ) R | [chain rule] = p(y θ ) p(θ ) dθ Θ | 0 0 0 • Inversion: Final result depends only on sampling distribu- tion (likelihood) p(y θ) and prior p(θ) |

29 Bayes’s Rule up to Proportion

• If data y is fixed, then

p(y θ) p(θ) p(θ y) | | = p(y)

p(y θ) p(θ) ∝ | p(y, θ) =

• Posterior proportional to likelihood times prior

• Equivalently, posterior proportional to joint

• The nasty for data marginal p(y) goes away

30 Posterior Predictive Distribution

• Predict new data y˜ based on observed data y

• Marginalize out parameters from posterior Z p(y˜ y) p(y˜ θ) p(θ y) dθ. | = Θ | |

• Averages predictions p(y˜ θ), weight by posterior p(θ y) | | – Θ θ p(θ y) > 0 is support of p(θ y) = { | | } | • Allows continuous, discrete, or mixed parameters – integral notation shorthand for sums and/or

31 Event Probabilities

• Recall that an event A is a collection of outcomes

• Suppose event A is determined by indicator on parameters  1 if θ A f (θ) ∈ = 0 if θ A 6∈ • e.g., f (θ) I(θ > θ ) for Pr[θ > θ y] = 1 2 1 2 | • Bayesian event probabilities calculate posterior mass Z Pr[A] f (θ) p(θ y) dθ. = Θ |

• Not frequentist, because involves parameter probabilities

32 Detour:

• For parameters α, β > 0 and θ (0, 1), ∈ α 1 β 1 Beta(θ α, β) θ − (1 θ) − | ∝ − • Euler’s Beta function is used to normalize, Z 1 α 1 β 1 Γ (α) Γ (β) B(α, β) u − (1 u) − du = − = Γ (α β) 0 + so that 1 α 1 β 1 Beta(θ α, β) θ − (1 θ) − | = B(α, β) −

• Note: Beta(θ 1, 1) Uniform(θ 0, 1) | = | • Note: Γ () is continuous generalization of

33 Laplace Turns the Crank

• From Bayes’s rule, the posterior is

Binomial(y N, θ) Uniform(θ 0, 1) p(θ y, N) R | | | = Binomial(y N, θ ) p(θ ) dθ Θ | 0 0 0

• Laplace calculated the posterior analytically

p(θ y, N) Beta(θ y 1,N y 1). | = | + − +

34 Beta Distribution — Examples

• Unnormalized posterior density assuming uniform prior and y successes out of n trials (all with mean 0.6). 30 2. SINGLE-PARAMETER MODELS

Figure 2.1 Unnormalized posterior density for binomialGelman parameter et al.θ,basedonuniformpriordis- (2013) Bayesian Data Analysis, 3rd Edition. tribution and y successes out of n trials. Curves displayed for several values of n and y.

(2.1), we are assuming that the n births are conditionally independent given θ,with the probability of a female birth equal to35θ for all cases. This modeling assumption is motivated by the exchangeability that may be judged to arise when we have no explanatory information (for example, distinguishing multiple births or births within the same family) that might affect the sex of the baby. To perform Bayesian inference in the binomial model, we must specify a prior distribu- tion for θ.Wewilldiscussissuesassociatedwithspecifyingpriordistributions many times throughout this book, but for simplicity at this point, we assume that the prior distribution for θ is uniform on the interval [0, 1]. Elementary application of Bayes’ rule as displayed in (1.2),appliedto(2.1),thengives the posterior density for θ as y n y p(θ y) θ (1 θ) − . (2.2) | ∝ − n With fixed n and y,thefactor y does not depend on the unknown parameter θ,andsoit can be treated as a constant when calculating the posterior distribution of θ.Asistypical ! " of many examples, the posterior density can be written immediately in closed form, up to a constant of proportionality. In single-parameter problems, this allows immediate graphical presentation of the posterior distribution. For example, inFigure2.1,theunnormalized density (2.2) is displayed for several different experiments, that is, different values of n and y.Eachofthefourexperimentshasthesameproportionofsuccesses, but the sample sizes vary. In the present case, we can recognize (2.2) as the unnormalized form of the beta distribution (see Appendix A),

θ y Beta(y +1,n y + 1). (2.3) | ∼ − Historical note: Bayes and Laplace Many early writers on probability dealt with the elementary binomial model. The first contributions of lasting significance, in the 17th and early 18th centuries, concentrated on the ‘pre-data’ question: given θ,whataretheprobabilitiesofthevariouspossible outcomes of the random variable y?Forexample,the‘weaklawoflargenumbers’of Estimation

• Posterior is Beta(θ 1 241 945, 1 251 527) | + + • Posterior mean: 1 241 945 + 0.4902913 1 241 945 1 251 527 ≈ + + + • Maximum likelihood estimate same as posterior mode (be- cause of uniform prior)

241 945 0.4902912 241 945 251 527 ≈ + • As number of observations approaches , ∞ MLE approaches posterior mean

36 Event Probability Inference

• What is probability that a male live birth is more likely than a female live birth? Z Pr[θ > 0.5] I[θ > 0.5] p(θ y, N)dθ = Θ |

Z 1 p(θ y, N)dθ = 0.5 |

1 Fθ y,N (0.5) = − | 42 10− ≈ •I [φ] 1 if condition φ is true and 0 otherwise. =

• Fθ y,N is posterior cumulative distribution function (cdf). |

37 vs. Simulation

• Luckily, we don’t have to be as good at math as Laplace

• Nowadays, we calculate all these integrals by computer using tools like Stan

If you wanted to do foundational research in statistics in the mid-twentieth century, you had to be bit of a mathematician, whether you wanted to or not. . . . if you want to do statistical re- search at the turn of the twenty-first century, you have to be a computer programmer. —from Andrew’s blog

38 Bayesian “Fisher Exact Test” • Suppose we observe the following data on handedness

sinister dexter TOTAL

male 9 (y1) 43 52 (N1)

female 4 (y2) 44 48 (N2)

• Assume likelihoods Binomial(y N , θ ), uniform priors k| k k • Are men more likely to be lefthanded? Z Pr[θ1 > θ2 y, N] I[θ1 > θ2] p(θ y, N) dθ | = Θ |

0.91 ≈ • Directly interpretable result; not a frequentist procedure

39 Stan Binomial Comparison

data { int y[2]; int N[2]; } parameters { vector theta[2]; } model { y ~ binomial(N, y); }

40 Visualizing Posterior Difference

• Plot of posterior difference, p(θ θ y, N) (men - women) 1− 2 |

• Vertical bars: central 95% posterior interval ( 0.05, 0.22) −

41 Technical Interlude Conjugate Priors

42 Conjugate Priors

• Family is a conjugate prior for family if F G – prior in and F – likelihood in , G – entails posterior in F • Before MCMC techniques became practical, Bayesian anal- ysis mostly involved conjugate priors

• Still widely used because analytic solutions are more effi- cient than MCMC

43 Beta is Conjugate to Binomial

• Prior: p(θ α, β) Beta(θ α, β) | = | • Likelihood: p(y N, θ) Binomial(y N, θ) | = | • Posterior:

p(θ y, N, α, β) p(θ α, β) p(y N, θ) | ∝ | | Beta(θ α, β) Binomial(y N, θ) = | | ! 1 α 1 β 1 N y N y θ − (1 θ) − θ (1 θ) − = B(α, β) − y −

y α 1 N y β 1 θ + − (1 θ) − + − ∝ − Beta(θ α y, β (N y)) ∝ | + + −

44 Chaining Updates • Start with prior Beta(θ α, β) |

• Receive binomial data in K statges (y1,N1), . . . , (yK ,NK )

• After (y ,N ), posterior is Beta(θ α y , β N y ) 1 1 | + 1 + 1 − 1

• Use as prior for (y2,N2), with posterior Beta(θ α y y , β (N y ) (N y )) | + 1 + 2 + 1 − 1 + 2 − 2 • Lather, rinse, repeat, until final posterior

Beta(θ α y y , β (N N ) (y y )) | + 1+· · ·+ K + 1+· · ·+ K − 1+· · ·+ K • Same result as if we’d updated with combined data Beta(y y ,N N ) 1 + · · · + K 1 + · · · + K

45 Part II (Un-)Bayesian Point Estimation

46 MAP Estimator

• For a Bayesian model p(y, θ) p(y θ) p(θ), the max a = | posteriori (MAP) estimate maximizes the posterior,

θ∗ arg max p(θ y) = θ | p(y θ)p(θ) arg max | = θ p(y) arg max p(y θ)p(θ). = θ | arg max log p(y θ) log p(θ). = θ | +

• not Bayesian because it doesn’t integrate over uncertainty

• not frequentist because of distributions over parameters

47 MAP and the MLE

• MAP estimate reduces to the MLE if the prior is uniform, i.e., p(θ) c = because

θ∗ arg max p(y θ) p(θ) = θ | arg max p(y θ) c = θ | arg max p(y θ). = θ |

48 Penalized Maximum Likelihood

• The MAP estimate can be made palatable to frequentists via philosophical sleight of hand

• Treat the negative log prior log p(θ) as a “penalty” − • e.g., a Normal(θ µ, σ ) prior becomes a penalty function | ! 1  θ µ 2 λθ,µ,σ log σ − = − + 2 σ

• Maximize sum of log likelihood and negative penalty

θ∗ arg max log p(y θ, x) λ = θ | − θ,µ,σ arg max log p(y θ, x) log p(θ µ, σ ) = θ | + |

49 Proper Bayesian Point Estimates

• Choose estimate to minimize some loss function

2 • To minimize expected squared error (L2 loss), E[(θ θ0) y], − | use the posterior mean Z 2 ˆ arg min E 0 θ θ0 [(θ θ ) y] θ p(θ y) dθ. = − | = Θ × |

• To minimize expected absolute error (L1 loss), E[ θ θ0 ], | − | use the posterior median.

• Other loss (utility) functions possible, the study of which falls under decision theory

• All share property of involving full Bayesian inference.

50 Point Estimates for Inference?

• Common in machine learning to generate a point estimate

θ∗, then use it for inference, p(y˜ θ∗) | • This is defective because it

underestimates uncertainty.

• To properly estimate uncertainty, apply full Bayes

• A major focus of statistics and decision theory is estimat- ing uncertainty in our inferences

51 Philosophical Interlude What is Statistics?

52 Exchangeability

• Roughly, an exchangeable probability function is such that for a sequence of random variables y y , . . . , y , = 1 N p(y) p(π(y)) = for every N-permutation π (i.e, a one-to-one mapping of 1,...,N ) { } • i.i.d. implies exchangeability, but not vice-versa

53 Exchangeability Assumptions

• Models almost always make some kind of exchangeability assumption

• Typically when other knowledge is not available – e.g., treat voters as conditionally i.i.d. given their age, sex, income, education level, religous affiliation, and state of residence – But voters have many more properties (hair color, height, profession, employment status, marital status, car owner- ship, gun ownership, etc.) – Missing predictors introduce additional error (on top of measurement error)

54 Random Parameters: Doxastic or Epistemic?

• Bayesians treat distributions over parameters as epistemic (i.e., about knowledge)

• They do not treat them as being doxastic (i.e., about beliefs)

• Priors encode our knowledge before seeing the data

• Posteriors encode our knowledge after seeing the data

• Bayes’s rule provides the way to update our knowledge

• People like to pretend models are ontological (i.e., about what exists)

55 Arbitrariness: Priors vs. Likelihood

• Bayesian analyses often criticized as subjective (arbitrary)

• Choosing priors is no more arbitrary than choosing a like- lihood function (or an exchangeability/i.i.d. assumption)

• As George Box famously wrote (1987),

“All models are wrong, but some are useful.”

• This does not just apply to Bayesian models!

56 Part III Hierarchical Models

57 Baseball At-Bats

• For example, consider baseball batting ability. – Baseball is sort of like cricket, but with round bats, a one-way field, stationary “bowlers”, four bases, short games, and no draws

• Batters have a number of “at-bats” in a season, out of which they get a number of “hits” (hits are a good thing)

• Nobody with higher than 40% success rate since 1950s.

• No player (excluding “bowlers”) bats much less than 20%.

• Same approach applies to hospital pediatric surgery com- plications (a BUGS example), reviews on Yelp, test scores in multiple classrooms, . . .

58 Baseball Data • Hits versus at bats for the 2006 American League season

• Not much variation in ability!

● 0.6 • Ignore skill vs. at-bats re- ● 0.5

lation ●

0.4 ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ●●●● ● ● ●● ● ●● ●●●●●●●● ● ● ● ● ● ● ● ●●● ●●● ●●●●●●●● 0.3 ● ● ● ● ● ●● ●●●●● ● ● ● ●● ●● ● ●●●● ● ● ●● ● ●●● ● ●●●● ●● ●●● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ●● ●●●● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ●●●●● ● ●●● ● ● ● ● ● ● ● • Note uncertainty of MLE ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 0.2 ● ●

avg = hits / at bats avg ● ● ●● ● ● ● ● ● ● ● ●●●

● ● ● ● ● ●

0.1 ● ●

● ● ● ● ● ● ● 0.0 1 5 10 50 100 500 at bats

59 Pooling Data

• How do we estimate the ability of a player who we observe getting 6 hits in 10 at-bats? Or 0 hits in 5 at-bats? Esti- mates of 60% or 0% are absurd!

• Same logic applies to players with 152 hits in 537 at bats.

• No pooling: estimate each player separately

• Complete pooling: estimate all players together (assume no difference in abilities)

• Partial pooling: somewhere in the middle – use information about other players (i.e., the population) to estimate a player’s ability

60 Hierarchical Models

• Hierarchical models are principled way of determining how much pooling to apply.

• Pull estimates toward the population mean based on amount of variation in population – low variance population: more pooling – high variance population: less pooling

• In limit – as variance goes to 0, get complete pooling – as variance goes to , get no pooling ∞

61 Hierarchical Batting Ability

• Instead of fixed priors, estimate priors along with other parameters

• Still only uses data once for a single model fit

• Data: yn,Bn: hits, at-bats for player n

• Parameters: θn: ability for player n

• Hyperparameters: α, β: population mean and variance

• Hyperpriors: fixed priors on α and β (hardcoded)

62 Hierarchical Batting Model (cont.)

y Binomial(B , θ ) n ∼ n n θ Beta(α, β) n ∼ α Uniform(0, 1) α β ∼ + (α β) Pareto(1.5) + ∼ • Sampling notation syntactic sugar for:

QN   p(y, θ, α, β) Pareto(α β 1.5) n 1 Binomial(yn Bn, θn) Beta(θn α, β) = + | = | |

α • Pareto provides power law: Pareto(u α) α 1 | ∝ u + • Should use more informative hyperpriors!

63 Stan Program

data { int N; // items int K[N]; // initial trials int y[N]; // initial successes } parameters { real phi; // population chance of success real kappa; // population concentration vector[N] theta; // chance of success } model { kappa ~ pareto(1, 1.5); // hyperprior theta ~ beta(phi * kappa, (1 - phi) * kappa); // prior y ~ binomial(K, theta); // likelihood }

64 Hierarchical Prior Posterior

• Draws from posterior (crosshairs at posterior mean)

• Prior population mean: 0.271

• Prior population scale: 400

• Together yield prior std dev of 0.022

• Mean is better estimated than scale (typical)

65 Posterior Ability (High Avg Players)

• Histogram of posterior draws for high-average players

• Note uncertainty grows with lower at- bats

66 Multiple Comparisons • Who has the highest ability (based on this data)? • Probabilty player n is best is

Average At-Bats Pr[best] .347 521 0.12 .343 623 0.11 .342 482 0.08 .330 648 0.04 .330 607 0.04 .367 60 0.02 .322 695 0.02

• No clear winner—sample size matters. • In last game (of 162), Mauer (Minnesota) edged out Jeter (NY)

67 Efron & Morris (1975) Data

FirstName LastName Hits At.Bats Rest.At.Bats Rest.Hits 1 Roberto Clemente 18 45 367 127 2 Frank Robinson 17 45 426 127 3 Frank Howard 16 45 521 144 4 Jay Johnstone 15 45 275 61 5 Ken Berry 14 45 418 114 6 Jim Spencer 14 45 466 126 7 Don Kessinger 13 45 586 155 8 Luis Alvarado 12 45 138 29 9 Ron Santo 11 45 510 137 10 Ron Swaboda 11 45 200 46 11 Rico Petrocelli 10 45 538 142

68 Pooling Estimates

• Case Study: Repeated Binary Trials (mc-stan.org)

69 Ranking

generated quantities { int rnk[N]; // rank of player n { int dsc[N]; dsc <- sort_indices_desc(theta); for (n in 1:N) rnk[dsc[n]] <- n; }

70 Posterior Ranks

71 Who is Best? Stan Code

generated quantities { ... int is_best[N]; // Pr[player n highest chance of success] ... for (n in 1:N) is_best[n] <- (rnk[n] == 1); ...

72 Who is Best? Posterior

73 Posterior Predictive Inference

• How do we predict new outcomes (e.g., rest of season)?

data { int K_new[N]; // new trials int y_new[N]; // new successes ... generated quantities { int z[N]; // posterior prediction for (n in 1:N) z[n] <- binomial_rng(K_new[n], theta[n]);

74 Posterior Predictions

75 Posterior Predictive Check

• Replicate data from paraemters

generated quantities { ... for (n in 1:N) y_rep[n] <- binomial_rng(K[n], theta[n]); for (n in 1:N) y_pop_rep[n] <- binomial_rng(K[n], beta_rng(phi * kappa, (1 - phi) * kappa)); min_y_rep <- min(y_rep); sd_y_rep <- sd(to_vector(y_rep)); p_min <- (min_y_rep >= min_y); p_sd <- (sd_y_rep >= sd_y); }

76 Posterior p-Values

77 Part IV Stan Overview

78 Stan’s Namesake

• Stanislaw Ulam (1909–1984)

• Co-inventor of Monte Carlo method (and hydrogen bomb)

• Ulam holding the Fermiac, Enrico Fermi’s physical Monte Carlo simulator for random neutron diffusion

79 What is Stan?

• Stan is an imperative probabilistic programming language – cf., BUGS: declarative; Church: functional; Figaro: object- oriented

• Stan program – declares data and (constrained) parameter variables – defines log posterior (or penalized likelihood)

• Stan inference – MCMC for full Bayesian inference – VB for approximate Bayesian inference – MLE for penalized maximum likelihood estimation

80 Platforms and Interfaces

• Platforms: Linux, Mac OS X, Windows

• C++ API: portable, standards compliant (C++03)

• Interfaces – CmdStan: Command-line or shell interface (direct executable) – RStan: R interface (Rcpp in memory) – PyStan: Python interface (Cython in memory) – MatlabStan: MATLAB interface (external process) – Stan.jl: Julia interface (external process) – StataStan: Stata interface (external process) [under testing]

• Posterior Visualization & Exploration – ShinyStan: Shiny (R) web-based

81 Higher-Level Interfaces

• R Interfaces – RStanArm: Regression modeling with R expressions – ShinyStan: Web-based posterior visualization, exploration – Loo: Approximate leave-one-out cross-validation

• Containers – Dockerized Jupyter (iPython) Notebooks (R, Python, or Julia)

82 Who’s Using Stan?

• 1600+ users group registrations; 10,000 manual down- loads (2.5.0); 300+ Google scholar citations

• Biological sciences: clinical drug trials, entomology, opthalmol- ogy, neurology, genomics, agriculture, botany, fisheries, cancer biology, epidemiology, population ecology, neurology

• Physical sciences: astrophysics, molecular biology, oceanography, climatology, biogeochemistry

• Social sciences: population dynamics, psycholinguistics, social net- works, political science, surveys

• Other: materials engineering, finance, actuarial, sports, public health, recommender systems, educational testing, equipment maintenance

83 Documentation

• Stan User’s Guide and Reference Manual – 500+ pages – Example models, modeling and programming advice – Introduction to Bayesian and frequentist statistics – Complete language specification and execution guide – Descriptions of algorithms (NUTS, R-hat, n_eff) – Guide to built-in distributions and functions

• Installation and getting started manuals by interface – RStan, PyStan, CmdStan, MatlabStan, Stan.jl – RStan vignette

84 Books and Model Sets

• Model Sets Translated to Stan – BUGS and JAGS examples (most of all 3 volumes) – Gelman and Hill (2009) Data Analysis Using Regression and Multilevel/Hierarchical Models – Wagenmakers and Lee (2014) Bayesian Cognitive Modeling

• Books with Sections on Stan – Gelman et al. (2013) Bayesian Data Analysis, 3rd Edition. – Kruschke (2014) Doing Bayesian Data Analysis, Second Edi- tion: A Tutorial with R, JAGS, and Stan – Korner-Nievergelt et al. (2015) Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and Stan

85 Scaling and Evaluation

Big Model and Big Data 1e18 approach state of the art 1e15 big model big data 1e12

1e9 data size (bytes) data size 1e6

model complexity

• Types of Scaling: data, parameters, models

• Time to converge and per effective sample size: 0.5– times faster than BUGS & JAGS ∞ • Memory usage: 1–10% of BUGS & JAGS

86 The No-U-Turn Sampler NUTS vs. Gibbs and Metropolis

      















                           

• Two dimensions of highly correlated 250-dim normal Figure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots compare 1,000 independent draws from a highly correlated 250-dimensional distribu- • 1,000,000tion (right) draws with 1,000,000from samples Metropolis (thinned to and 1,000 Gibbs samples for (thin display) to generated 1000) by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) • 1000generated draws by Gibbsfrom sampling NUTS; (second 1000 from independent left), and 1,000 samples draws generated by NUTS (second from right). Only the first two dimensions are shown here.

4.4 Comparing the Efficiency of HMC and NUTS

Figure 6 compares the efficiency of HMC (with various simulation lengths λ ￿L) and ≈ NUTS (which chooses simulation lengths automatically).87 The x-axis in each plot is the target δ used by the dual averaging algorithm from section 3.2 to automatically tune the step size ￿. The y-axis is the effective sample size (ESS) generated by each sampler, normalized by the number of gradient evaluations used in generating the samples. HMC’s best performance seems to occur around δ =0.65, suggesting that this is indeed a reasonable default value for a variety of problems. NUTS’s best performance seems to occur around δ =0.6, but does not seem to depend strongly on δ within the range δ [0.45, 0.65]. δ =0.6 therefore ∈ seems like a reasonable default value for NUTS. On the two logistic regression problems NUTS is able to produce effectively indepen- dent samples about as efficiently as HMC can. On the multivariate normal and stochastic volatility problems, NUTS with δ =0.6 outperforms HMC’s best ESS by about a factor of three. As expected, HMC’s performance degrades if an inappropriate simulation length is cho- sen. Across the four target distributions we tested, the best simulation lengths λ for HMC varied by about a factor of 100, with the longest optimal λ being 17.62 (for the multivari- ate normal) and the shortest optimal λ being 0.17 (for the simple logistic regression). In practice, finding a good simulation length for HMC will usually require some number of preliminary runs. The results in Figure 6 suggest that NUTS can generate samples at least as efficiently as HMC, even discounting the cost of any preliminary runs needed to tune HMC’s simulation length.

25 Stan’s Autodiff vs. Alternatives

• Among C++ open-source offerings: Stan is fastest (for gradi- ents), most general (functions supported), and most easily ex- tensible (simple OO)

matrix_product_eigen normal_log_density_stan

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 64 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 16 ● ● ● ● ● system ● system ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● adept ● adept ● ● ● ● ● ● ● ● ● ● ● ● adolc adolc 4 ● ● ● ● cppad ● cppad ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● double double ● ● ● ● ● ● ● ● ● ● ● 1 ● ● sacado ● sacado

● ● ● time / Stan's time / Stan's ● ● ● ● ● ● ● ● stan ● stan ● ● 1/4 ●

1/4 ● ●

● ●

● 1/16 1/16 ● ● ● ●

22 24 26 28 210 212 20 22 24 26 28 210 212 214 dimensions dimensions

88 Stan is a Programming Language

• Not a graphical specification language like BUGS or JAGS

• Stan is a Turing-complete imperative programming lan- gauge for specifying differentiable log densities – reassignable local variables and scoping – full conditionals and loops – functions (including recursion)

• With automatic “black-box” inference on top (though even that is tunable)

• Programs computing same thing may have different effi- ciency

89 Basic Program Blocks

• data (once) – content: declare data types, sizes, and constraints – execute: read from data source, validate constraints

• parameters (every log prob eval) – content: declare parameter types, sizes, and constraints – execute: transform to constrained, Jacobian

• model (every log prob eval) – content: statements definining posterior density – execute: execute statements

90 Derived Variable Blocks

• transformed data (once after data) – content: declare and define transformed data variables – execute: execute definition statements, validate constraints

• transformed parameters (every log prob eval) – content: declare and define transformed parameter vars – execute: execute definition statements, validate constraints

• generated quantities (once per draw, double type) – content: declare and define generated quantity variables; includes pseudo-random number generators (for posterior predictions, event probabilities, decision making) – execute: execute definition statements, validate constraints

91 Variable and Expression Types

Variables and expressions are strongly, statically typed.

• Primitive: int, real

• Matrix: matrix[M,N], vector[M], row_vector[N]

• Bounded: primitive or matrix, with , ,

• Constrained Vectors: simplex[K], ordered[N], positive_ordered[N], unit_length[N]

• Constrained Matrices: cov_matrix[K], corr_matrix[K], cholesky_factor_cov[M,N], cholesky_factor_corr[K]

• Arrays: of any type (and dimensionality)

92 Integers vs. Reals

• Different types (conflated in BUGS, JAGS, and R)

• Distributions and assignments care

• Integers may be assigned to reals but not vice-versa

• Reals have not-a-number, and positive and negative infin- ity

• Integers single-precision up to +/- 2 billion

• Integer division rounds (Stan provides warning)

• Real arithmetic is inexact and reals should not be (usually) compared with ==

93 Arrays vs. Matrices

• Stan separates arrays, matrices, vectors, row vectors

• Which to use?

• Arrays allow most efficient access (no copying)

• Arrays stored first-index major (i.e., 2D are row major)

• Vectors and matrices required for matrix and linear alge- bra functions

• Matrices stored column-major

• Are not assignable to each other, but there are conversion functions

94 Logical Operators

Op. Prec. Assoc. Placement Description || 9 left binary infix logical or && 8 left binary infix logical and == 7 left binary infix equality != 7 left binary infix inequality < 6 left binary infix less than <= 6 left binary infix less than or equal > 6 left binary infix greater than >= 6 left binary infix greater than or equal

95 Arithmetic and Matrix Operators

Op. Prec. Assoc. Placement Description + 5 left binary infix addition - 5 left binary infix subtraction * 4 left binary infix multiplication / 4 left binary infix (right) division \ 3 left binary infix left division .* 2 left binary infix elementwise multiplication ./ 2 left binary infix elementwise division ! 1 n/a unary prefix logical negation - 1 n/a unary prefix negation + 1 n/a unary prefix promotion (no-op in Stan) ^ 2 right binary infix exponentiation ’ 0 n/a unary postfix transposition () 0 n/a prefix, wrap function application [] 0 left prefix, wrap array, matrix indexing

96 Built-in Math Functions

• All built-in C++ functions and operators C math, TR1, C++11, including all trig, pow, and special log1m, erf, erfc, fma, atan2, etc.

• Extensive library of statistical functions e.g., softmax, log gamma and digamma functions, beta functions, Bessel functions of first and second kind, etc.

• Efficient, arithmetically stable compound functions e.g., multiply log, log sum of exponentials, log inverse logit

97 Built-in Matrix Functions

• Basic arithmetic: all arithmetic operators

• Elementwise arithmetic: vectorized operations

• Solvers: matrix division, (log) determinant, inverse

• Decompositions: QR, Eigenvalues and Eigenvectors, Cholesky factorization, singular value decomposition

• Compound Operations: quadratic forms, variance scaling, etc.

• Ordering, Slicing, Broadcasting: sort, rank, block, rep

• Reductions: sum, product, norms

• Specializations: triangular, positive-definite,

98 User-Defined Functions

• functions (compiled with model) – content: declare and define general (recursive) functions (use them elsewhere in program) – execute: compile with model

• Example

functions {

real relative_difference(real u, real v) { return 2 * fabs(u - v) / (fabs(u) + fabs(v)); }

}

99 Differential Equation Solver

• System expressed as function – given state (y) time (t), parameters (θ), and data (x) – return derivatives (∂y/∂t) of state w.r.t. time • Simple harmonic oscillator diff eq

real[] sho(real t, // time real[] y, // system state real[] theta, // params real[] x_r, // real data int[] x_i) { // int data real dydt[2]; dydt[1] <- y[2]; dydt[2] <- -y[1] - theta[1] * y[2]; return dydt; }

100 Differential Equation Solver

• Solution via functional, given initial state (y0), initial time (t0), desired solution times (ts)

mu_y <- integrate_ode(sho, y0, t0, ts, theta, x_r, x_i);

• Use noisy measurements of y to estimate θ

y ~ normal(mu_y, sigma);

– Pharmacokinetics/pharmacodynamics (PK/PD), – soil carbon respiration with biomass input and breakdown

101 Diff Eq Derivatives

• Need derivatives of solution w.r.t. parameters

• Couple derivatives of system w.r.t. parameters

 ∂ ∂ ∂y  y, ∂t ∂t ∂θ

• Calculate coupled system via nested autodiff of second term ∂ ∂y ∂θ ∂t

102 Distribution Library

• Each distribution has – log density or mass function – cumulative distribution functions, plus complementary ver- sions, plus log scale – Pseudo-random number generators

• Alternative parameterizations (e.g., Cholesky-based multi-normal, log-scale Poisson, logit-scale Bernoulli)

• New multivariate correlation matrix density: LKJ degrees of freedom controls shrinkage to (expansion from) unit matrix

103 Statements

• Sampling: y ~ normal(mu,sigma) (increments log probability)

• Log probability: increment_log_prob(lp);

• Assignment: y_hat <- x * beta;

• For loop: for (n in 1:N) ...

• While loop: while (cond) ...

• Conditional: if (cond) ...; else if (cond) ...; else ...;

• Block: { ... } (allows local variables)

• Print: print("theta=",theta);

• Reject: reject("arg to foo must be positive, found y=", y);

104 “Sampling” Increments Log Prob

• A Stan program defines a log posterior – typically through log joint and Bayes’s rule

• Sampling statements are just “syntactic sugar”

• A shorthand for incrementing the log posterior

• The following define the same∗ posterior

– y ~ poisson(lambda); – increment_log_prob(poisson_log(y, lamda));

• ∗ up to a constant

• Sampling statement drops constant terms

105 Local Variable Scope Blocks

• y ~ bernoulli(theta);

is more efficient with sufficient statistics

{ real sum_y; // local variable sum_y <- 0; for (n in 1:N) sum_y <- a + y[n]; // reassignment sum_y ~ binomial(N, theta); }

• Simpler, but roughly same efficiency:

sum(y) ~ binomial(N, theta);

106 Print and Reject

• Print statements are for debugging – printed every log prob evaluation – print values in the middle of programs – check when log density becomes undefined – can embed in conditionals

• Reject statements are for error checking – typically function argument checks – cause a rejection of current state (0 density)

107 Prob Function Vectorization

• Stan’s probability functions are vectorized for speed – removes repeated computations (e.g., log σ in normal) − – reduces size of expression graph for differentation

• Consider: y ~ normal(mu, sigma);

• Each of y, mu, and sigma may be any of – scalars (integer or real) – vectors (row or column) – 1D arrays

• All dimensions must be scalars or having matching sizes

• Scalars are broadcast (repeated)

108 Transforms: Precision

parameters { real tau; // precision ... } transformed parameters { real sigma; // sd sigma <- 1 / sqrt(tau); }

109 Transforms: “Matt Trick”

parameters { vector[K] beta_raw; // non-centered real mu; real sigma; } transformed parameters { vector[K] beta; // centered beta <- mu + sigma * beta_raw; } model { mu ~ cauchy(0, 2.5); sigma ~ cauchy(0, 2.5); beta_raw ~ normal(0, 1); }

110 Linear Regression (Normal Noise)

• Likelihood:

– y α β x  n = + n + n –  Normal(0, σ ) for n 1:N n ∼ ∈ • Equivalently,

– y Normal(α β x , σ ) n ∼ + n • Priors (improper)

– σ Uniform(0, ) ∼ ∞ – α, β Uniform( , ) ∼ −∞ ∞ • Stan allows improper prior; requires proper posterior.

111 Linear Regression: Stan Code

data { int N; vector[N] x; vector[N] y; } parameters { real alpha; real beta; real sigma; } model { y ~ normal(alpha + beta * x, sigma); }

// for (n in 1:N) // y[n] ~ normal(alpha + beta * x[n], sigma);

112 Logistic Regression

data { int K; int N; matrix[N,K] x; int y[N]; } parameters { vector[K] beta; } model { beta ~ cauchy(0, 2.5); // prior y ~ bernoulli_logit(x * beta); // likelihood }

113 Time Series Autoregressive: AR(1)

data { int N; vector[N] y; } parameters { real alpha; real beta; real sigma; } model { for (n in 2:N) y[n] ~ normal(alpha + beta * y[n-1], sigma); }

114 Covariance Random-Effects Priors

parameters { vector[2] beta[G]; cholesky_factor_corr[2] L_Omega; vector[2] sigma;

model { sigma ~ cauchy(0, 2.5); L_Omega ~ lkj_cholesky(4); beta ~ multi_normal_cholesky(rep_vector(0, 2), diag_pre_multiply(sigma, L_Omega)); for (n in 1:N) y[n] ~ bernoulli_logit(... + x[n] * beta[gg[n]]);

115 Example: Gaussian Process Estimation data { int N; vector[N] x; vector[N] y; } parameters { real eta_sq, inv_rho_sq, sigma_sq; } transformed parameters { real rho_sq; rho_sq <- inv(inv_rho_sq); } model { matrix[N,N] Sigma; for (i in 1:(N-1)) { for (j in (i+1):N) { Sigma[i,j] <- eta_sq * exp(-rho_sq * square(x[i] - x[j])); Sigma[j,i] <- Sigma[i,j]; }} for (k in 1:N) Sigma[k,k] <- eta_sq + sigma_sq; eta_sq, inv_rho_sq, sigma_sq ~ cauchy(0,5); y ~ multi_normal(rep_vector(0,N), Sigma); }

116 Posterior Predictive Inference

• Parameters θ, observed data y and data to predict y˜ Z p(y˜ y) p(y˜ θ) p(θ y) dθ | = Θ | |

• data { int N_tilde; matrix[N_tilde,K] x_tilde; ... parameters { vector[N_tilde] y_tilde; ... model { y_tilde ~ normal(x_tilde * beta, sigma);

117 Predict w. Generated Quantities

• Replace sampling with pseudo-random number generation

generated quantities { vector[N_tilde] y_tilde;

for (n in 1:N_tilde) y_tilde[n] <- normal_rng(x_tilde[n] * beta, sigma); }

• Must include noise for predictive uncertainty

• PRNGs only allowed in generated quantities block – more computationally efficient per iteration – more statistically efficient with i.i.d. samples (i.e., MC, not MCMC)

118 Part V Monte Carlo Integration

119 Monte Carlo Calculation of π

• Computing π 3.14 ... via simulation = is the textbook application of Monte Carlo methods. • Generate points uniformly at random within the square

• Calculate proportion within circle (x2 + y 2 < 1) and multiply by square’s area (4) to produce the area of the circle.

• This area is π (radius is 1, so area is πr 2 π) =

Plot by Mysid Yoderj courtesy of Wikipedia.

120 Monte Carlo Calculation of π (cont.)

• R code to calcuate π with Monte Carlo simulation:

> x <- runif(1e6,-1,1) > y <- runif(1e6,-1,1)

> prop_in_circle <- sum(x^2 + y^2 < 1) / 1e6

> 4 * prop_in_circle [1] 3.144032

121 Accuracy of Monte Carlo

• Monte Carlo is not an approximation!

• It can be made exact to within any 

• Monte Carlo draws are i.i.d. by definition

• Central limit theorem: expected error decreases at rate of 1 √N • 3 decimal places of accuracy with sample size 1e6

• Need 100 larger sample for each digit of accuracy ×

122 General Monte Carlo Integration • MC can calculate arbitrary definite inte- grals, Z b f (x) dx a • Let d upper bound f (x) in (a, b); tightness determines computational efficiency • Then generate random points uniformly in the rectangle bounded by (a, b) and (0, d) • Multiply proportion of draws (x, y) where y < f (x) by area of rectangle, d (b a). × − • Can be generalized to multiple dimensions in obvious way

Image courtesy of Jeff Cruzan, http://www.drcruzan.com/NumericalIntegration.html

123 Expectations of Function of R.V.

• Suppose f (θ) is a function of random variable vector θ

• Suppose the density of θ is p(θ) – Warning: θ overloaded as random and bound variable

• Then f (θ) is also random variable, with expectation Z E[f (θ)] f (θ) p(θ) dθ. = Θ

– where Θ is support of p(θ) (i.e., Θ θ p(θ) > 0 = { | }

124 QoI as Expectations

• Most Bayesian quantities of interest (QoI) are expectations over the posterior p(θ y) of functions f (θ) | • Bayesian parameter estimation: θˆ – f (θ) θ = – θˆ E[θ y] minimizes expected square error = | • Bayesian parameter (co)variance estimation: var[θ y] | – f (θ) (θ θ)ˆ 2 = − • Bayesian event probability: Pr[A y] | – f (θ) I(θ A) = ∈

125 Expectations via Monte Carlo

• Generate draws θ(1), θ(2), . . . , θ(M) drawn from p(θ)

• Monte Carlo Estimator plugs in average for expectation:

M 1 X E[f (θ) y] f (θ(m)) | ≈ M m 1 = • Can be made as accurate as desired, because

M 1 X E[f (θ)] lim f (θ(m)) = M →∞ M m 1 =

• Reminder: By CLT, error goes down as 1 / √M

126 Part VI Markov Chain Monte Carlo

127 Markov Chain Monte Carlo

• Standard Monte Carlo draws i.i.d. draws

θ(1), . . . , θ(M)

according to a probability function p(θ)

• Drawing an i.i.d. sample is often impossible when dealing with complex densities like Bayesian posteriors p(θ y) | • So we use Markov chain Monte Carlo (MCMC) in these cases and draw θ(1), . . . , θ(M) from a Markov chain

128 Markov Chains

• A Markov Chain is a sequence of random variables

θ(1), θ(2), . . . , θ(M)

(m) (m 1) such that θ only depends on θ − , i.e.,

(m) (1) (m 1) (m) (m 1) p(θ y, θ , . . . , θ − ) p(θ y, θ − ) | = |

• Drawing θ(1), . . . , θ(M) from a Markov chain according to (m) (m 1) p(θ θ − , y) is more tractable | • Require marginal of each draw, p(θ(m) y), to be equal to | true posterior

129 Applying MCMC

• Plug in just like ordinary (non-Markov chain) Monte Carlo

• Adjust standard errors for dependence in Markov chain

130 MCMC for Posterior Mean

• Standard Bayesian estimator is posterior mean Z θˆ θ p(θ y) dθ = Θ |

– Posterior mean minimizes expected square error

• Estimate is a conditional expectation

θˆ E[θ y] = | • Compute by averaging

M 1 X θˆ θ ≈ M m 1 =

131 MCMC for Posterior Variance

• Posterior variance works the same way,

E[(θ E[θ y])2 y] E[(θ θ)ˆ 2] − | | = −

M 1 X (θ(m) θ)ˆ 2 ≈ − M m 1 =

132 MCMC for Event Probability

• Event probabilities are also expectations, e.g., Z Pr[θ1 > θ2] E[I[θ1 > θ2]] I[θ1 > θ2] p(θ y)dθ. = = Θ |

• Estimation via MCMC just another plug-in:

M 1 X Pr[θ > θ ] I[θ(m) > θ(m)] 1 2 ≈ 1 2 M m 1 =

• Again, can be made as accurate as necessary

133 MCMC for Quantiles (incl. median)

• These are not expectations, but still plug in

• Alternative Bayesian estimator is posterior median – Posterior median minimizes expected absolute error

• Estimate as median draw of θ(1), . . . , θ(M) – just sort and take halfway value – e.g., Stan shows 50% point (or other quantiles)

• Other quantiles including interval bounds similar – estimate with quantile of draws – estimation error goes up in tail (based on fewer draws)

134 Part VII MCMC Algorithms

135 Random-Walk Metropolis

• Draw random initial parameter vector θ(1) (in support)

• For m 2:M ∈ – Sample proposal from a (symmetric) jumping distribution, e.g., (m 1) θ∗ MultiNormal(θ − , σ I) ∼ where I is the identity matrix

– Draw u(m) Uniform(0, 1) and set ∼  (m) p(θ∗ y)  ∗ if | θ u < (m 1) θ(m) p(θ − y) = |  (m 1) θ − otherwise

136 Metropolis and Normalization

• Metropolis only uses posterior in a ratio:

p(θ∗ y) | p(θ(m) y) | • This allows the use of unnormalized densities

• Recall Bayes’s rule:

p(θ y) p(y θ) p(θ) | ∝ | • Thus we only need to evaluate sampling (likelihood) and prior – i.e., no need to compute normalizing integral for p(y), Z p(y θ) p(θ)dθ Θ |

137 Metropolis-Hastings

• Generalizes Metropolis to asymmetric proposals

• Acceptance ratio is

(m) J(θ θ∗) p(θ∗ y) | × | J(θ θ(m 1)) p(θ(m) y) ∗| − × | where J is the (potentially asymmetric) proposal density

• i.e.,

(m 1) probabilty of being at θ∗ and jumping to θ − (m 1) probability of being at θ − and jumping to θ∗

138 Metropolis-Hastings (cont.)

• General form ensures equilibrium by maintaining detailed balance

• Like Metropolis, only requires ratios

• Many algorithms involve a Metropolis-Hastings “correction”

– Including vanilla HMC and RHMC and ensemble samplers

139 Detailed Balance & Reversibility

• Definition is measure theoretic, but applies to densities – just like Bayes’s rule

• Assume Markov chain has stationary density p(a)

• Suppose π(a b) is density of transitioning from b to a | – use of π to indicates different measure on Θ than p

• Detailed balance is a reversibility equilibrium condition

p(a) π(b a) p(b) π(a b) | = |

140 Optimal Proposal Scale?

• Proposal scale σ is a free; too low or high is inefficient

• Traceplots show parameter value on y axis, iterations on x

• Empirical tuning problem; theoretical optima exist for some cases

Roberts and Rosenthal (2001) Optimal Scaling for Various Metropolis-Hastings Algorithms. Statistical Science.

141 Convergence

• Imagine releasing a hive of bees in a sealed house – they disperse, but eventually reach equilibrium where the same number of bees leave a room as enter it (on average)

• May take many iterations for Markov chain to reach equi- librium

142 Convergence: Example

• Four chains with different starting points – Left: 50 iterations – Center: 1000 iterations – Right: Draws from second half of each chain

Gelman et al., Bayesian Data Analysis

143 Potential Scale Reduction (Rˆ)

• Gelman & Rubin recommend M chains of N draws with diffuse initializations

• Measure that each chain has same posterior mean and variance

• If not, may be stuck in multiple modes or just not con- verged yet

• Define statistic Rˆ of chains s.t. at convergence, Rˆ 1 → – Rˆ >> 1 implies non-convergence – Rˆ 1 does not guarantee convergence ≈ – Only measures marginals

144 Split Rˆ

• Vanilla Rˆ may not diagnose non-stationarity – e.g., a sequence of chains with an increasing parameter

• Split Rˆ: Stan splits each chain into first and second half – start with M Markov chains of N draws each – split each in half to creates 2M chains of N/2 draws – then apply Rˆ to the 2M chains

145 Calculating Rˆ Statistic: Between

• M chains of N draws each

• Between-sample variance estimate

M N X ¯( ) ¯( ) 2 B (θm• θ • ) , = − • M 1 m 1 − = where

N M ¯( ) 1 X (n) ¯( ) 1 X ¯( ) θm• θm and θ • θm• . = • = N n 1 M m 1 = =

146 Calculating Rˆ (cont.)

• M chains of N draws each

• Within-sample variance estimate:

M 1 X W s2 , = m M m 1 = where N 2 1 X (n) ( ) 2 s (θ θ¯ • ) . m = m − m N 1 n 1 − =

147 Calculating Rˆ Statistic (cont.)

• Variance estimate: N 1 1 vard+(θ y) − W B. | = N + N

recall that W is within-chain variance and B between-chain

• Potential scale reduction statistic (“R hat”) s var+(θ y) Rˆ d | . = W

148 Correlations in Posterior Draws

• Markov chains typically display autocorrelation in the se- ries of draws θ(1), . . . , θ(m)

• Without i.i.d. draws, central limit theorem does not apply

• Effective sample size Neff divides out autocorrelation

• Neff must be estimated from sample – Fast Fourier transform computes correlations at all lags

• Estimation accuracy proportional to 1 p Neff

149 Reducing Posterior Correlation

• Tuning algorithm parameters to ensure good mixing

• Recall Metropolis traceplots of Roberts and Rosenthal:

• Good jump scale σ produces good mixing and high Neff

150 Effective Sample Size

• Autocorrelation at lag t is correlation between subseqs (1) (N t) (1 t) (N) – (θ , . . . , θ − ) and (θ + , . . . , θ )

• Suppose chain has density p(θ) with – E[θ] µ and Var[θ] σ 2 = = • Autocorrelation ρ at lag t 0: t ≥ Z 1 (n) (n t) + ρt 2 (θ µ)(θ µ) p(θ) dθ = σ Θ − −

(n) (n t) • Because p(θ ) p(θ + ) p(θ) at convergence, = = Z 1 (n) (n t) + ρt 2 θ θ p(θ) dθ = σ Θ

151 Estimating Autocorrelations

• Effective sample size (N draws in chain) is defined by N N Neff P P ∞ ∞ = t ρt = 1 2 t 1 ρt =−∞ + = • Estimate in terms of variograms (M chains) at lag t – Calculate with fast Fourier transform (FFT)

M  N  m  2 1 X 1 X (n) (n t) V  θ θ −  t = m − m M m 1 Nm t n t 1 = − = + • Adjust autocorrelation at lag t using cross-chain variance as

Vt ρˆt 1 = − 2 vard+

• If not converged, vard+ overestimates variance

152 Estimating Nef f

• Let T 0 be first lag s.t. ρT 1 < 0, 0+ • Estimate autocorrelation by

MN Nˆeff . = PT 0 1 t 1 ρˆt + = • NUTS avoids negative autocorrelations, so first negative autocorrelation estimate is reasonable

• For basics (not our estimates), see Charles Geyer (2013) Introduction to MCMC. In Handbook of MCMC. (free online at http://www.mcmchandbook.net/index.html)

153 Gibbs Sampling

• Draw random initial parameter vector θ(1) (in support)

• For m 2:M ∈ – For n 1:N: ∈ (m) * draw θn according to conditional

(m) (m) (m 1) (m 1) p(θn θ1 , . . . , θn 1, θn 1− , . . . , θN − , y). | − + • e.g, with θ (θ , θ , θ ): = 1 2 3 (m) (m 1) (m 1) – draw θ according to p(θ θ − , θ − , y) 1 1| 2 3 (m) (m) (m 1) – draw θ according to p(θ θ , θ − , y) 2 2| 1 3 – draw θ(m) according to p(θ θ(m), θ(m), y) 3 3| 1 2

154 Generalized Gibbs

• “Proper” Gibbs requires conditional Monte Carlo draws – typically works only for conjugate priors

• In general case, may need to use less efficient conditional draws – Slice sampling is a popular general technique that works

for discrete or continuous θn (JAGS) – Adaptive rejection sampling is another alternative (BUGS) – Very difficult in more than one or two dimensions

155 Sampling Efficiency

• We care only about Neff per second • Decompose into

1. Iterations per second 2. Effective sample size per iteration

• Gibbs and Metropolis have high iterations per second (es- pecially Metropolis)

• But they have low effective sample size per iteration (espe- cially Metropolis)

• Both are particular weak when there is high correlation among the parameters in the posterior

156 Hamiltonian Monte Carlo & NUTS

• Slower iterations per second than Gibbs or Metropolis

• Much higher effective sample size per iteration for com- plex posteriors (i.e., high curvature and correlation)

• Overall, much higher N1ff per second

• Details in the next talk . . .

• Along with details of how Stan implements HMC and NUTS

157 Part VIII What Stan Does

158 Full Bayes: No-U-Turn Sampler

• Adaptive Hamiltonian Monte Carlo (HMC) – Potential Energy: negative log posterior – Kinetic Energy: random standard normal per iteration

• Adaptation during warmup – step size adapted to target total acceptance rate – mass matrix estimated with regularization

• Adaptation during sampling – simulate forward and backward in time until U-turn

• Slice sample along path

(Hoffman and Gelman 2011, 2014)

159 Posterior Inference

• Generated quantities block for inference (predictions, decisions, and event probabilities)

• Extractors for draws in sample in RStan and PyStan

• Coda-like posterior summary – posterior mean w. MCMC std. error, std. dev., quantiles

– split-Rˆ multi-chain convergence diagnostic (Gelman/Rubin) – multi-chain effective sample size estimation (FFT algorithm)

• Model comparison with WAIC – in-sample approximation to cross-validation

160 Penalized MLE

• Posterior mode finding via L-BFGS optimization (uses model gradient, efficiently approximates Hessian)

• Disables Jacobians for parameter inverse transforms

• Standard errors on unconstrained scale (estimated using curvature of penalized log likelihood function

• Models, data, initialization as in MCMC

• Very Near Future – Standard errors on constrained scale (sample unconstrained approximation and inverse transform)

161 “Black Box” Variational Inference

• Black box so can fit any Stan model

• Multivariate normal approx to unconstrained posterior – covariance: diagonal mean-field or full rank – not Laplace approx — around posterior mean, not mode – transformed back to constrained space (built-in Jacobians)

• Stochastic gradient-descent optimization – ELBO gradient estimated via Monte Carlo + autdiff

• Returns approximate posterior mean / covariance

• Returns sample transformed to constrained space

162 Posterior Analysis: Estimates

• For each parameter (and lp__) – Posterior mean – Posterior standard deviation

– Posterior MCMC error esimate: sd/Neff – Posterior quantiles – Number of effective samples

– Rˆ convergence statistic

• . . . and much much more in ShinyStan

163 Stan as a Research Tool

• Stan can be used to explore algorithms

• Models transformed to unconstrained support on Rn

• Once a model is compiled, have – log probability, gradient (soon: Hessian) – data I/O and parameter initialization – model provides variable names and dimensionalities – transforms to and from constrained representation (with or without Jacobian)

164 Part IX How Stan Works

165 Model: Read and Transform Data

• Only done once for optimization or sampling (per chain)

• Read data – read data variables from memory or file stream – validate data

• Generate transformed data – execute transformed data statements – validate variable constraints when done

166 Model: Log Density

• Given parameter values on unconstrained scale

• Builds expression graph for log density (start at 0)

• Inverse transform parameters to constrained scale – constraints involve non-linear transforms – e.g., positive constrained x to unconstrained y log x = • account for curvature in change of variables

1 – e.g., unconstrained y to positive x log− (y) exp(y) = = – e.g., add log Jacobian determinant, log d exp(y) y | dy | = • Execute model block statements to increment log density

167 Model: Log Density Gradient

• Log density evaluation builds up expression graph – templated overloads of functions and operators – efficient arena-based memory management

• Compute gradient in backward pass on expression graph – propagate partial derivatives via chain rule – work backwards from final log density to parameters – dynamic programming for shared subexpressions

• Linear multiple of time to evalue log density

168 Model: Generated Quantities

• Given parameter values

• Once per iteration (not once per leapfrog step)

• May involve (pseudo) random-number generation – Executed generated quantity statements – Validate values satisfy constraints

• Typically used for – Event probability estimation – Predictive posterior estimation

• Efficient because evaluated with double types (no autodiff)

169 Optimize: L-BFGS

• Initialize unconstrained parameters and Hessian – Random values on unconstrained scale uniform in ( 2, 2) − * or user specified on constrained scale, transformed – Hessian approximation initialized to unit matrix

• While not converged – Move unconstrained parameters toward optimum based on Hessian approximation and step size (Newton step) – If diverged (arithmetic, support), reduce step size, continue – else if converged (parameter change, log density change, gradient value), return value – else update Hessian approx. based on calculated gradient

170 Sample: Hamiltonian Flow • Generate random kinetic energy – random Normal(0, 1) in each parameter

• Use negative log posterior as potential energy

• Hamiltonian is kinetic plus potential energy

• Leapfrog Integration: for fixed stepsize (time discretiza- tion), number of steps (total time), and mass matrix, – update momentum half-step based on potential (gradient) – update position full step based on momentum – update momentum half-step based on potential

• Numerical solution of Hamilton’s first-order version of Newton’s second- order diff-eqs of motion (force mass acceleration) = ×

171 128Sample: LeapfrogHandbook Example of Markov Chain Monte Carlo

Position coordinates Momentum coordinates Value of Hamiltonian 2 2 2.6 1 1 2.5 0 0 2.4

−1 −1 2.3

−2 −2 2.2 −2 −1012 −2 −1012 0 5 10 1520 25

FIGURE• Trajectory 5.3 of 25 leapfrog steps for correlated 2D normal (ellipses A trajectory for a two-dimensional Gaussian distribution, simulated using 25 leapfrog steps with a stepsize of 0.25. Theat ellipses 1 sd plotted from are mean),one standard stepsize deviation from theof means. 0.25, The initial state state had q of1.50,( 1.551, 1T and), and =[− − ] p 1, 1 T. − =[−initial] momentum of ( 1.5, 1.55). − − Figure 5.3 shows a trajectory based on this Hamiltonian, such as might be used to propose a new state in the HMC method, computed using L 25 leapfrog steps, with a stepsize of Radford Neal (2013) MCMC using Hamiltonian= Dynamics. In Handbook of ε 0.25. Since the full state space is four-dimensional, Figure 5.3 shows the two position = coordinatesMCMC. and (free the online two momentum at http://www.mcmchandbook.net/index.html coordinates in separate plots, while the third plot ) shows the value of the Hamiltonian after each leapfrog step. Notice that this trajectory does not resemble a random walk. Instead, starting from the lower left-hand corner, the position variables systematically move upward and to the right, until they reach the upper right-hand corner, at which point the direction of motion is reversed. The consistency of this motion results172 from the role of the momentum variables. The projection of p in the diagonal direction will change only slowly, since the gradient in that direction is small, and hence the direction of diagonal motion stays the same for many leapfrog steps. While this large-scale diagonal motion is happening, smaller-scale oscillations occur, moving back and forth across the “valley” created by the high correlation between the variables. The need to keep these smaller oscillations under control limits the stepsize that can be used. As can be seen in the rightmost plot in Figure 5.3, there are also oscillations in the value of the Hamiltonian (which would be constant if the trajectory were simulated exactly). If a larger stepsize were used, these oscillations would be larger. At a critical stepsize (ε 0.45 in this example), the trajectory becomes unstable, and the value of the = Hamiltonian grows without bound. As long as the stepsize is less than this, however, the error in the Hamiltonian stays bounded regardless of the number of leapfrog steps done. This lack of growth in the error is not guaranteed for all Hamiltonians, but it does hold for many distributions more complex than Gaussians. As can be seen, however, the error in the Hamiltonian along the trajectory does tend to be positive more often than negative. In this example, the error is 0.41 at the end of the trajectory, so if this trajectory were used + for an HMC proposal, the probability of accepting the endpoint as the next state would be exp( 0.41) 0.66. − =

5.3.3.2 Sampling from a Two-Dimensional Distribution Figures 5.4 and 5.5 show the results of using HMC and a simple random-walk Metropolis method to sample from a bivariate Gaussian similar to the one just discussed, but with stronger correlation of 0.98. Sample: No-U-Turn Sampler (NUTS)

• Adapts Hamiltonian simulation time – goal to maximize mixing, maintaining detailed balance – too short devolves to random walk – too long does extra work (i.e., orbits)

• For exponentially increasing number of steps up to max – Randomly choose to extend forward or backward in time – Move forward or backward in time number of steps

* stop if any subtree (size 2, 4, 8, ...) makes U-turn * remove all current steps if subtree U-turns (not ends)

• Randomly select param with density above slice (or reject)

173 The No-U-Turn Sampler Sample: NUTS Binary Tree

Figure• Example 1: Example of of repeated building a binary doubling tree via repeated building doubling. binary Each tree doubling forward proceeds and backwardby choosing in time a direction until (forwards U-turn. or backwards in time) uniformly at random, then simulating Hamiltonian dynamics for 2j leapfrog steps in that direction, Hoffmanwhere andj is the Gelman. number2014. of previous The doublings No-U-Turn (and the Sampler. height ofJMLR the binary. (free tree). online at http://jmlr.org/papers/v15/hoffman14a.htmlThe figures at top show a trajectory in two dimensions (with corresponding) binary tree in dashed lines) as it evolves over four doublings, and the figures below show the evolution of the binary tree. In this example, the directions chosen were forward (light orange node), backward (yellow nodes), backward (blue nodes), and forward (green nodes). 174 no longer increase the distance between the proposal ✓˜ and the initial value of ✓.Weuse a convenient criterion based on the dot product betweenr ˜ (the current momentum) and ✓˜ ✓ (the vector from our initial position to our current position), which is the derivative with respect to time (in the Hamiltonian system) of half the squared distance between the initial position ✓ and the current position ✓˜:

d (✓˜ ✓) (✓˜ ✓) d · =(✓˜ ✓) (✓˜ ✓)=(✓˜ ✓) r.˜ (1) dt 2 · dt · In other words, if we were to run the simulation for an infinitesimal amount of additional time, then this quantity is proportional to the progress we would make away from our starting point ✓. This suggests an algorithm in which one runs leapfrog steps until the quantity in Equa- tion 1 becomes less than 0; such an approach would simulate the system’s dynamics until the proposal location ✓˜ started to move back towards ✓. Unfortunately this algorithm does not guarantee time reversibility, and is therefore not guaranteed to converge to the correct distribution. NUTS overcomes this issue by means of a recursive algorithm that preserves reversibility by running the Hamiltonian simulation both forward and backward in time. NUTS begins by introducing a slice variable u with conditional distribution p(u ✓,r)= | Uniform(u;[0, exp (✓) 1 r r ]), which renders the conditional distribution p(✓,r u)= {L 2 · } |

1597 Sample: NUTS U-Turn Hoffman and Gelman

• Example of trajec- tory from one itera- tion of NUTS. 0.4 • Blue ellipse is con- tour of 2D normal. 0.3 • Black circles are leapfrog steps. 0.2 • Solid red circles ex- cluded below slice 0.1 • U-turn made with blue and magenta 0 arrows • Red crossed cir- −0.1 −0.1 0 0.1 0.2 0.3 0.4 0.5 cles excluded for detailed balance Figure 2: Example of a trajectory generated during one iteration of NUTS. The blue ellipse is a contour of the target distribution, the black open circles are the positions ✓ traced out by the leapfrog integrator and associated with elements of the set of visited states , the black solid circle is the starting position, the red solid circles B 175 are positions associated with states that must be excluded from the set of C possible next samples because their joint probability is below the slice variable u, and the positions with a red “x” through them correspond to states that must be excluded from to satisfy detailed balance. The blue arrow is the vector from the C positions associated with the leftmost to the rightmost leaf nodes in the rightmost height-3 subtree, and the magenta arrow is the (normalized) momentum vector at the final state in the trajectory. The doubling process stops here, since the blue and magenta arrows make an angle of more than 90 degrees. The crossed- out nodes with a red “x” are in the right half-tree, and must be ignored when choosing the next sample.

Uniform(✓,r; ✓ ,r exp (✓) 1 r r u ). This slice sampling step is not strictly neces- { 0 0| {L 2 · } } sary, but it simplifies both the derivation and the implementation of NUTS.

1598 Sample: HMC/NUTS Warmup

• Estimate stepsize – too small requires too many leapfrog steps – too large induces numerical inaccuracy – need to balance

• Estimate mass matrix – Diagonal accounts for parameter scales – Dense optionally accounts for rotation

176 Sample: Warmup (cont.)

• Initialize unconstrained parameters as for optimization

• For exponentially increasing block sizes – for each iteration in block

* generate random kinetic energy * simulate Hamiltonian flow (HMC fixed time, NUTS adapts) * choose next state (Metroplis for HMC, slice for NUTS) – update regularized point estimate of mass matrix

* use parameter draws from current block * shrink diagonal toward unit; dense toward diagonal – tune stepsize (line search) for target acceptance rate

177 Sample: HMC/NUTS Sampling

• Fix stepsize and and mass matrix

• For sampling iterations – generate random kinetic energy – simulate Hamiltonian flow – apply Metropolis accept/reject (HMC) or slice (NUTS)

178 The No-U-Turn Sampler NUTS vs. Gibbs and Metropolis

      















                           

• Two dimensions of highly correlated 250-dim normal Figure 7: Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. The plots compare 1,000 independent draws from a highly correlated 250-dimensional distribu- • 1,000,000tion (right) draws with 1,000,000from samples Metropolis (thinned to and 1,000 Gibbs samples for (thin display) to generated 1000) by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) • 1000generated draws by Gibbsfrom sampling NUTS; (second 1000 from independent left), and 1,000 samples draws generated by NUTS (second from right). Only the first two dimensions are shown here.

4.4 Comparing the Efficiency of HMC and NUTS

Figure 6 compares the efficiency of HMC (with various simulation lengths λ ￿L) and ≈ NUTS (which chooses simulation lengths179 automatically). The x-axis in each plot is the target δ used by the dual averaging algorithm from section 3.2 to automatically tune the step size ￿. The y-axis is the effective sample size (ESS) generated by each sampler, normalized by the number of gradient evaluations used in generating the samples. HMC’s best performance seems to occur around δ =0.65, suggesting that this is indeed a reasonable default value for a variety of problems. NUTS’s best performance seems to occur around δ =0.6, but does not seem to depend strongly on δ within the range δ [0.45, 0.65]. δ =0.6 therefore ∈ seems like a reasonable default value for NUTS. On the two logistic regression problems NUTS is able to produce effectively indepen- dent samples about as efficiently as HMC can. On the multivariate normal and stochastic volatility problems, NUTS with δ =0.6 outperforms HMC’s best ESS by about a factor of three. As expected, HMC’s performance degrades if an inappropriate simulation length is cho- sen. Across the four target distributions we tested, the best simulation lengths λ for HMC varied by about a factor of 100, with the longest optimal λ being 17.62 (for the multivari- ate normal) and the shortest optimal λ being 0.17 (for the simple logistic regression). In practice, finding a good simulation length for HMC will usually require some number of preliminary runs. The results in Figure 6 suggest that NUTS can generate samples at least as efficiently as HMC, even discounting the cost of any preliminary runs needed to tune HMC’s simulation length.

25 NUTS vs. BasicHoffman HMC and Gelman

  !"#$ !"#$ !"#$% !"#$ !"#$ !"#$%% !"#$% !"#$% !"#$ !"#$ 









    

                                  

      !"   !#   !   !$   !   !   !"#   !##   !   !





      

                                



       • 250-D normal ! " #  ! " #$ and logistic ! " #  ! " # regression ! " #  ! " # models ! " #  ! " #$$  ! " #  ! " #   • Vertical axis is effective sample size per sample (bigger better)    •   Left) NUTS; Right) HMC with increasing t L  =                                  

     !" #$% !" #$% !" #$% !" #$% !" #$% !" #$%&' !" #$%& !" #$%& !" #$% !" #$% 

 180





     

                                  

Figure 6: Effective sample size (ESS) as a function of δ and (for HMC) simulation length ￿L for the multivariate normal, logistic regression, hierarchical logistic regression, and stochastic volatility models. Each point shows the ESS divided by the number of gradient evaluations for a separate experiment; lines denote the average of the points’ y-values for a particular δ. Leftmost plots are NUTS’s performance, each other plot shows HMC’s performance for a different setting of ￿L.

The trajectory length (measured in number of states visited) grows as the acceptance rate target δ grows, which is to be expected since a higher δ will lead to a smaller step size ￿, which in turn will mean that more leapfrog steps are necessary before the trajectory doubles back on itself and satisfies equation 9.

24 Hoffman and Gelman

  !"#$ !"#$ !"#$% !"#$ !"#$ !"#$%% !"#$% !"#$% !"#$ !"#$ 









    

                                  

      !"   !#   !   !$   !   !   !"#   !##   !   !





      

                                 NUTS vs. Basic HMC  II          ! " #  ! " #$  ! " #  ! " #  ! " #  ! " #  ! " #  ! " #$$  ! " #  ! " #

         

                                 

     !" #$% !" #$% !" #$% !" #$% !" #$% !" #$%&' !" #$%& !" #$%& !" #$% !" #$% 







     

                                  

• Hierarchical logistic regression and stochastic volatility Figure 6: Effective sample size (ESS) as a function of δ and (for HMC) simulation length • Simulation￿L for the time multivariatet is  normal, L, step logistic size regression, () times hierarchical number logistic of regression, steps (L) and stochastic volatility models. Each point shows the ESS divided by the number • NUTSof can gradient beat evaluations optimally for a separate tuned experiment; HMC (latter lines very denote expensive) the average of the points’ y-values for a particular δ. Leftmost plots are NUTS’s performance, each other plot shows HMC’s performance for a different setting of ￿L.

181

The trajectory length (measured in number of states visited) grows as the acceptance rate target δ grows, which is to be expected since a higher δ will lead to a smaller step size ￿, which in turn will mean that more leapfrog steps are necessary before the trajectory doubles back on itself and satisfies equation 9.

24 Part X Under Stan’s Hood

182 Euclidean Hamiltonian

• Phase space: q position (parameters); p momentum

• Posterior density: π(q)

• Mass matrix: M

• Potential energy: V (q) log π(q) = − 1 1 • Kinetic energy: T (p) p>M− p = 2 • Hamiltonian: H(p, q) V (q) T (p) = + • Diff eqs: dq ∂H dp ∂H dt = + ∂p dt = − ∂q

183 Leapfrog Integrator Steps

• Solves Hamilton’s equations by simulating dynamics (symplectic [volume preserving]; 3 error per step, 2 total error)

• Given: step size , mass matrix M, parameters q

• Initialize kinetic energy, p Normal(0, I) ∼ • Repeat for L leapfrog steps:  ∂V (q) p p [half step in momentum] ← − 2 ∂q

1 q q  M− p [full step in position] ← +  ∂V (q) p p [half step in momentum] ← − 2 ∂q

184 Reverse-Mode Auto Diff

• Eval gradient in (usually small) multiple of function eval time – independent of dimensionality – time proportional to number of expressions evaluated

• Result accurate to machine precision (cf. finite diffs)

• Function evaluation builds up expression tree

• Dynamic program propagates chain rule in reverse pass

• Reverse mode computes g in one pass for a function ∇ f : RN R →

185 Autodiff Expression Graph

v10 ◆⇣ f (y, µ, σ ) @ ~@R ✓⌘ log (Normal(y µ, σ )) v9 log = | ◆⇣ 0.5 2⇡ 1  y µ 2 1 @ − log σ log(2π) @ = − 2 σ − − 2 v7✓⌘ ◆⇣⇤ @ @ @ @R @ ∂ ✓⌘pow v f (y, µ, σ ) .5 6 ∂y ◆⇣ @ @ @ (y µ)σ 2 @R @R − ✓⌘ = − − / v5 log v8 ◆⇣ 2 ◆⇣ ∂ @ f (y, µ, σ ) ∂µ ✓⌘@ ✓⌘ 2 v4 (y µ)σ − ◆⇣ @ = − @ @ @R @R ∂ ✓⌘ f (y, µ, σ ) y v1 µ v2 v3 ∂σ ◆⇣ ◆⇣ ◆⇣ (y µ)2σ 3 σ 1 ~ ~ ~ = − − − − ✓⌘ ✓⌘ ✓⌘ Figure 1: Expression graph for the normal log density function given in (1). Each circle cor- responds to an automatic differentiation variable, with the variable name given to the right in blue. The independent variables are highlighted in yellow on the bottom row, with the depen- dent varaible highlighted in red on the top of the graph. The function186 producing each node is displayed inside the circle, with operands denoted by arrows. Constants are shown in gray with gray arrows leading to them because derivatives need not be propagated to constant operands.

In this example, the gradient with respect to all of y, µ, and are calculated. It is common in statistical models for some variables to be observed outcomes or fixed prior parameters and thus constant. Constants need not enter into derivative calcu- lations as nodes, allowing substantial reduction in both memory and time used for gradient calculations. See Section 2 for an example of how the treatment of constants differs from that of autodiff variables. This mathematical formula corresponds to the expression graph in Figure 1. Each subexpression corresponds to a node in the graph, and each edge connects the node representing a function evaluation to its operands. Figure 2 illustrates the forward pass used by reverse-mode automatic differentia- tion to construct the expression graph for a program. The expression graph is con- structed in the ordinary evaluation order, with each subexpression being numbered and placed on a stack. The stack is initialized here with the dependent variables, but this is not required. Each operand to an expression is evaluated before the expression node is created and placed on the stack. As a result, the stack provides a topolog- ical sort of the nodes in the graph (i.e., a sort in which each node representing an

6 Autodiff Partials

var value partials

v1 y

v2 µ

v3 σ v v v ∂v /∂v 1 ∂v /∂v 1 4 1 − 2 4 1 = 4 2 = − 2 v v /v ∂v /∂v 1/v ∂v /∂v v v− 5 4 3 5 4 = 3 5 3 = − 4 3 v (v )2 ∂v /∂v 2v 6 5 6 5 = 5 v ( 0.5)v ∂v /∂v 0.5 7 − 6 7 6 = − v log v ∂v /∂v 1/v 8 3 8 3 = 3 v v v ∂v /∂v 1 ∂v /∂v 1 9 7 − 8 9 7 = 9 8 = − v v (0.5 log 2π) ∂v /∂v 1 10 9 − 10 9 =

187 Autodiff: Reverse Pass

var operation adjoint result

a : 0 a : 0 1 9 = 1 9 = a 1 a 1 10 = 10 = a a (1) a 1 9 += 10 × 9 = a a (1) a 1 7 += 9 × 7 = a a ( 1) a 1 8 += 9 × − 8 = − a a (1/v ) a 1/v 3 += 8 × 3 3 = − 3 a a ( 0.5) a 0.5 6 += 7 × − 6 = − a a (2v ) a v 5 += 6 × 5 5 = − 5 a4 a5 (1/v3) a4 v5/v3 += × 2 = − 2 a a ( v v− ) a 1/v v v v− 3 += 5 × − 4 3 3 = − 3 + 5 4 3 a a (1) a v /v 1 += 4 × 1 = − 5 3 a a ( 1) a v /v 2 += 4 × − 2 = 5 3

188 Stan’s Reverse-Mode

• Easily extensible object-oriented design

• Code nodes in expression graph for primitive functions – requires partial derivatives – built-in flexible abstract base classes – lazy evaluation of chain rule saves memory

• Autodiff through templated C++ functions – templating on each argument avoids excess promotion

189 Stan’s Reverse-Mode (cont.)

• Arena-based memory management – specialized C++ operator new for reverse-mode variables – custom functions inherit memory management through base

• Nested application to support ODE solver

190 Stan’s Autodiff vs. Alternatives

• Stan is fastest (and uses least memory) – among open-source C++ alternatives

matrix_product_eigen normal_log_density_stan

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 64 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 16 ● ● ● ● ● system ● system ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 ● ● ● adept ● adept ● ● ● ● ● ● ● ● ● ● ● ● adolc adolc 4 ● ● ● ● cppad ● cppad ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● double double ● ● ● ● ● ● ● ● ● ● ● 1 ● ● sacado ● sacado

● ● ● time / Stan's time / Stan's ● ● ● ● ● ● ● ● stan ● stan ● ● 1/4 ●

1/4 ● ●

● ●

● 1/16 1/16 ● ● ● ●

22 24 26 28 210 212 20 22 24 26 28 210 212 214 dimensions dimensions

191 Forward-Mode Auto Diff

• Evaluates expression graph forward from one independent variable to any number of dependent variables

• Function evaluation propagates chain rule forward

• In one pass, computes ∂ f (x) for a function f : R RN ∂x → – derivative of N outputs with respect to a single input

192 Stan’s Forward Mode

• Templated scalar type for value and tangent – allows higher-order derivatives

• Primitive functions propagate derivatives

• No need to build expression graph in memory – much less memory intensive than reverse mode

• Autodiff through templated functions (as reverse mode)

193 Second-Order Derivatives

• Compute Hessian (matrix of second-order partials)

∂2 Hi,j f (x) = ∂xi ∂xj

• Required for Laplace covariance approximation (MLE)

• Required for curvature (Riemannian HMC)

• Nest reverse-mode in forward for second order

• N forward passes: takes gradient of derivative

194 Third-Order Derivatives

• Compute gradients of Hessians (tensor of third-order par- tials) ∂3 f (x) ∂xi ∂xj ∂xk – Required for SoftAbs metric (Riemannian HMC)

– N2 forward passes: gradient of derivative of derivative

195 Jacobians

• Assume function f : RN RM → • Partials for multivariate function (matrix of first-order par- tials) ∂ Ji,j fj (x) = ∂xi • Required for stiff ordinary differential equations – differentiate is coupled sensitivity autodiff for ODE system

• Two execution strategies

1. Multiple reverse passes for rows 2. Forward pass per column (required for stiff ODE)

196 Autodiff Functionals

• Functionals map templated functors to derivatives – fully encapsulates and hides all autodiff types

• Autodiff functionals supported

– gradients: (1) O – Jacobians: (N) O – gradient-vector product (i.e., directional derivative): (1) O – Hessian-vector product: (N) O – Hessian: (N) O – gradient of trace of matrix-Hessian product: (N2) O (for SoftAbs RHMC)

197 Variable Transforms

• Code HMC and optimization with Rn support

• Transform constrained parameters to unconstrained

– lower (upper) bound: offset (negated) log transform – lower and upper bound: scaled, offset logit transform – simplex: centered, stick-breaking logit transform – ordered: free first element, log transform offsets – unit length: spherical coordinates – covariance matrix: Cholesky factor positive diagonal – correlation matrix: rows unit length via quadratic stick- breaking

198 Variable Transforms (cont.)

• Inverse transform from unconstrained Rn

• Evaluate log probability in model block on natural scale

• Optionally adjust log probability for change of variables – adjustment for MCMC and variational, not MLE – add log determinant of inverse transform Jacobian – automatically differentiable

199 Parsing and Compilation

• Stan code parsed to abstract syntax tree (AST) (Boost Spirit Qi, recursive descent, lazy semantic actions)

• C++ model class code generation from AST (Boost Variant)

• C++ code compilation

• Dynamic linking for RStan, PyStan

200 Coding Probability Functions

• Vectorized to allow scalar or container arguments (containers all same shape; scalars broadcast as necessary)

• Avoid repeated computations, e.g. log σ in

PN log Normal(y µ, σ ) n 1 log Normal(yn µ, σ ) | = = | PN √ yn µ n 1 log 2π log σ −2 = = − − − 2σ • recursive expression templates to broadcast and cache scalars, generalize containers (arrays, matrices, vectors)

• traits metaprogram to drop constants (e.g., log √2π − or log σ if constant) and calculate intermediate and return types

201 Part XI Stan for Big Data

202 Scaling and Evaluation

Big Model and Big Data 1e18 approach state of the art 1e15 big model big data 1e12

1e9 data size (bytes) data size 1e6

model complexity

• Types of Scaling: data, parameters, models

203 Riemannian Manifold HMC

• Best mixing MCMC method (fixed # of continuous params)

• Moves on Riemannian manifold rather than Euclidean – adapts to position-dependent curvature

• geoNUTS generalizes NUTS to RHMC (Betancourt arXiv)

• SoftAbs metric (Betancourt arXiv) – eigendecompose Hessian and condition – computationally feasible alternative to original Fisher info metric of Girolami and Calderhead (JRSS, Series B) – requires third-order derivatives and implicit integrator

• Code complete; awaiting higher-order auto-diff

204 Adiabatic Sampling

• Physically motivated alternative to “simulated” annealing and tempering (not really simulated!)

• Supplies external heat bath

• Operates through contact manifold

• System relaxes more naturally between energy levels

• Betancourt paper on arXiv

• Prototype complete

205 “Black Box” Variational Inference

• Black box so can fit any Stan model

• Multivariate normal approx to unconstrained posterior – covariance: diagonal mean-field or full rank – not Laplace approx — around posterior mean, not mode – transformed back to constrained space (built-in Jacobians)

• Stochastic gradient-descent optimization – ELBO gradient estimated via Monte Carlo + autdiff

• Returns approximate posterior mean / covariance

• Returns sample transformed to constrained space

206 “Black Box” EP

• Fast, approximate inference (like VB) – VB and EP minimize divergence in opposite directions – especially useful for Gaussian processes

• Asynchronous, data-parallel expectation propagation (EP)

• Cavity distributions control subsample variance

• Prototypte stage

• collaborating with Seth Flaxman, Aki Vehtari, Pasi Jylänki, John Cunningham, Nicholas Chopin, Christian Robert

207 The Cavity Distribution

•Figure Two 1: Sketch parameters, illustrating the with benefits data of expectation split intopropagationy1, .(EP) . . , ideas y5 in Bayesian compu- tation. In this simple example, the parameter space ✓ has two dimensions, and the data have been split into five pieces. Each oval represents a contour of the likelihood p(y ✓) provided by a single • Contours of likelihood p(y θ) for k 1:5 k| partition of the data. A simple parallel computationk| of each∈ piece separately would be inefficient because it would require the inference for each partition to cover its entire oval. By combining with the cavity distribution g k(✓) in a manner inspired by EP, we can devote most of our computational • g k(θ) is cavity distribution (current approx. without yk) effort− to the area of overlap. • Separately computing for y reqs each partition to cover its area 2. Initialization. Choose initial site approximationsk gk(✓) from some restricted family (for example, multivariate normal distributions in ✓). Let the initial approximation to the posterior K • Combiningdensity be g(✓)= likelihoodp(✓) k=1 gk(✓ with). cavity focuses on overlap 3. EP-like iteration. ForQk =1,...,K (in serial or parallel):

(a) Compute the cavity distribution, g k(✓)=g(✓)/gk(✓). (b) Form the tilted distribution, g k(✓)=p(yk ✓)g k(✓). \ | new new (c) Construct an updated site approximation gk (✓) such that gk (✓)g k(✓) approximates 208 g k(✓). \ new K (d) If parallel, set gk(✓) to gk (✓), and a new approximate distribution g(✓)=p(✓) k=1 gk(✓) will be formed and redistributed after the K site updates. If serial,updatetheglobal new Q approximation g(✓) to gk (✓)g k(✓). 4. Termination. Repeat step 3 until convergence of the approximate posterior distribution g.

The benefits of this algorithm arise because each site gk comes from a restricted family with complexity is determined by the number of parameters in the model, not by the sample size; this is less expensive than carrying around the full likelihood, which in general would require computation time proportional to the size of the data. Furthermore, if the parametric approximation is multi- variate normal, many of the above steps become analytical, with steps 3a, 3b, and 3d requiring only simple linear algebra. Accordingly, EP tends to be applied to specific high-dimensional problems where computational cost is an issue, notably for Gaussian processes (Rasmussen, and Williams, 2006, Jylänki, Vanhatalo, and Vehtari, 2011, Cunningham, Hennig, and Lacoste-Julien, 2011, and Vanhatalo et al., 2013), and efforts are made to keep the algorithm stable as well as fast. Figure 1 illustrates the general idea. Here the data have been divided into five pieces, each of which has a somewhat awkward likelihood function. The most direct parallel partitioning approach would be to analyze each of the pieces separately and then combine these inferences at the end,

3 Maximum Marginal Likelihood

• Fast, approx. inference for hierarchical models: p(φ, α) R • Marginalize out lower-level params: p(φ) p(φ, α)dα =

• Optimize higher-level parameters φ∗ and fix

• Optimize lower-level parameters given higher-level: p(φ∗, α)

• Errors estimated as in MLE

• aka “empirical Bayes” – but not fully Bayesian – and no more empirical than full Bayes

• Design complete; awaiting parameter tagging

209 Part XII Posterior Modes & Laplace Approximation

210 Laplace Approximation

• Multivariate normal approximation to posterior

• Compute posterior mode via optimization

θ∗ arg max p(θ y) = θ | • Laplace approximation to the posterior is

1 p(θ y) MultiNormal(θ∗ H− ) | ≈ | − • H is the Hessian of the log posterior

∂2 Hi,j log p(θ y) = ∂θi ∂θj |

211 Stan’s Laplace Approximation

• Operates on unconstrained parameters

• L-BFGS to compute posterior mode θ∗

• Automatic differentiation to compute H – current R: finite differences of gradients – soon: second-order automatic differentiation

• Draw a sample from approximate posterior – transfrom back to constrained scale – allows Monte Carlo computation of expectations

212 Part XIII Variational Bayes

213 VB in a Nutshell

• y is observed data, θ parameters

• Goal is to approximate posterior p(θ y) | • with a convenient approximating density g(θ φ) | – φ is a vector of parameters of approximating density

• Given data y, VB computes φ∗ minimizing KL-divergence – from approximation g(θ φ) to posterior p(θ y) | |

φ∗ arg min KL[g(θ φ) p(θ y)] = φ | || | ! Z p(θ y) arg max log | g(θ φ) dθ = φ − g(θ φ) | Θ |

214 VB vs. Laplace464 10. APPROXIMATE INFERENCE

1 40

0.8 30

0.6 20 0.4

10 0.2

0 0 −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Figure 10.1 Illustration of the variational approximation for the example considered earlier in Figure 4.14. The • solid yellowleft-hand: target; plot showsred the original: Laplace; distribution (yellow)green along: with VB the Laplace (red) and variational (green) approx- imations, and the right-hand plot shows the negative logarithms of the corresponding curves. • Laplace located at posterior mode However, we shall suppose the model is such that working with the true posterior distribution is intractable. • VB located at approximateWe posterior therefore consider mean instead a restricted family of distributions q(Z) and then seek the member of this family for which the KL divergence is minimized. Our goal is to restrict the family sufficiently that they comprise only tractable distributions, — Bishop (2006) Pattern Recognitionwhile at the same and time Machine allowing the Learning family to, be fig. sufficiently 10.1 rich and flexible that it can provide a good approximation to the true posterior distribution. It is important to emphasize that the restriction is imposed purely to achieve tractability, and that sub- ject to this requirement we should use as rich a family of approximating distributions as215 possible. In particular, there is no ‘over-fitting’ associated with highly flexible dis- tributions. Using more flexible approximations simply allows us to approach the true posterior distribution more closely. One way to restrict the family of approximating distributions is to use a paramet- ric distribution q(Z ω) governed by a set of parameters ω. The lower bound (q) then becomes a function| of ω, and we can exploit standard nonlinear optimizationL techniques to determine the optimal values for the parameters. An example of this approach, in which the variational distribution is a Gaussian and we have optimized with respect to its mean and variance, is shown in Figure 10.1. 10.1.1 Factorized distributions Here we consider an alternative way in which to restrict the family of distri- butions q(Z). Suppose we partition the elements of Z into disjoint groups that we denote by Zi where i =1,...,M. We then assume that the q distribution factorizes with respect to these groups, so that

M

q(Z)= qi(Zi). (10.5) i=1 ! 468 10. APPROXIMATEKL-Divergence INFERENCE Example

Figure 10.2 Comparison of 1 1 the two alternative forms for the Kullback-Leibler divergence. The green contours corresponding to z2 z2 1, 2, and 3 standard deviations for a correlated Gaussian distribution 0.5 0.5 p(z) over two variables z1 and z2, and the red contours represent the corresponding levels for an approximating distribution q(z) over the same variables given by 0 0 the product of two independent 0 0.5 z 1 0 0.5 z 1 univariate Gaussian distributions 1 1 (b) whose parameters are obtained by (a) minimization of (a) the Kullback- Leibler divergence KL(•q pGreen), and : true distribution p; Red: best approximation g (b) the reverse Kullback-Leibler∥ divergence KL(p q). (a) VB-like: KL[g p] ∥ || (b) EP-like: KL[p g] is used in an alternative approximate|| inference framework called expectation prop- Section 10.7 agation. We therefore consider the general problem of minimizing KL(p q) when q•(Z )VBis a systematicallyfactorized approximationunderstimates of the form (10.5). The posterior KL divergence variance∥ can then be written in the form — Bishop (2006) PatternM Recognition and Machine Learning, fig. 10.2 KL(p q)= p(Z) ln q (Z ) dZ + const (10.16) ∥ − i i " i=1 $ ! # where the constant term is simply the entropy of p(Z) and so does not depend on q(Z). We can now optimize with respect to216 each of the factors qj(Zj), which is Exercise 10.3 easily done using a Lagrange multiplier to give

⋆ qj (Zj)= p(Z) dZi = p(Zj). (10.17) i=j ! %̸

In this case, we find that the optimal solution for qj(Zj) is just given by the corre- sponding marginal distribution of p(Z). Note that this is a closed-form solution and so does not require iteration. To apply this result to the illustrative example of a Gaussian distribution p(z) over a vector z we can use (2.98), which gives the result shown in Figure 10.2(b). We see that once again the mean of the approximation is correct, but that it places significant probability mass in regions of variable space that have very low probabil- ity. The difference between these two results can be understood by noting that there is a large positive contribution to the Kullback-Leibler divergence

p(Z) KL(q p)= q(Z) ln dZ (10.18) ∥ − q(Z) ! & ' Stan’s “Black-Box” VB

• Typically custom g() per model – based on conjugacy and analytic updates

• Stan uses “black-box VB” with multivariate Gaussian g

g(θ φ) MultiNormal(θ µ, Σ) | = | for the unconstrained posterior – e.g., scales σ log-transformed with Jacobian

• Stan provides two versions – Mean field: Σ diagonal – General: Σ dense

217 Stan’s VB: Computation

• Use L-BFGS optimization to optimize θ∗

• Requires differentiable KL[g(θ φ) p(θ y)] | || | – only up to constant (i.e., use evidence lower bound (ELBO))

• Approximate KL-divergence and gradient via Monte Carlo

– KL divergence is an expectation w.r.t. approximation g(θ φ) | – Monte Carlo draws i.i.d. from approximating multi-normal – only need approximate gradient calculation for soundness – so only a few Monte Carlo iterations are enough

218 Stan’s VB: Computation (cont.)

• To support compatible plug-in inference

– draw Monte Carlo sample θ(1), . . . , θ(M) with

(m) θ MultiNormal(θ µ∗, Σ∗) ∼ |

– inverse transfrom from unconstrained to constrained scale – report to user in same way as MCMC draws

• Future: reweight θ(m) via importance sampling – with respect to true posterior – to improve expectation calculations

219 Near Future: Stochastic VB

• Data-streaming form of VB – Scales to billions of observations – Hoffman et al. (2013) Stochastic variational inference. JMLR 14.

• Mashup of stochastic gradient (Robbins and Monro 1951) and VB – subsample data (e.g., stream in minibatches) – upweight each minibatch to full data set size – use to make unbiased estimate of true gradient – take gradient step to minimimize KL-divergence

• Prototype code complete

220