Petteri Piiroinen
University of Helsinki
Spring 2020
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 1 / 62 3 Summarizing the posterior distribution
In principle, the posterior distribution contains all the information about the possible parameter values. In practice, we must also present the posterior distribution somehow. Plotting (for 1D and 2D), scatterplot, histogram of simulated values For higher dimensional cases, we could study marginal posterior distribution of some of the parameters
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 2 / 62 3 Summarizing the posterior distribution
The usual summary statistics, such as the mean, median, mode, variance, standard devation and different quantiles, that are used to summarize probability distributions
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 3 / 62 3.1 Credible intervals
Credible interval is a ‘’Bayesian confidence interval”. very intuitive interpretation: we can say “95 % credible interval actually contains a true parameter value with 95% probability!”
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 4 / 62 3.1.1 Credible interval definition
Definition. d A 1 − α -credible set is a subset Iα ⊂ Ω ⊂ R containing a proportion 1 − α of the probability mass of the posterior distribution:
P(Θ ∈ Iα|Y = y) = 1 − α,
If the set is a region, we call it a credible region and if in addition, d = 1, we call a credible region as credible interval.
Usually we talk about a (1 − α) · 100% credible interval; for example, if the confidence level is α = 0.05, we talk about the 95% credible interval.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 5 / 62 3.1.2 Equal-tailed interval
Definition. An equal-tailed interval (also called a central interval) of confidence level α is an interval Iα = [ qα/2, q1−α/2],
where qz is a z-quantile of the posterior distribution p(·|y).
Since we have assumed the parameter to be continuous, quantiles are defined (only zeros in the middle might cause issues) for the p(·|y).
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 6 / 62 3.1.2 Equal-tailed interval
If we can solve the posterior distribution in a closed form, quantiles can be obtained via the quantile function of the posterior distribution:
−1 P(Θ ≤ qz | Y = y) = z ⇐⇒ qz = FΘ|Y(z | y),
−1 This quantile function FΘ | Y is an inverse of the cumulative density function (cdf) FΘ | Y of the posterior distribution. If there are zeroes in the posterior density the quantile function needs to be defined in other way
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 7 / 62 3.1.2 Equal-tailed interval
Usually, when a credible interval is mentioned without specifying which type of the credible interval it is, an equal-tailed interval is meant. However, unless the posterior distribution is unimodal and symmetric, we might prefer using the highest posterior density criterion for choosing the credible interval. But before, an example.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 8 / 62 3.1.3. numerical example
continuation of Poisson-gamma Example 2.1.1 we have observed a data set y = (4, 3, 11, 3, 6) model: sampling distribution / likelihood
Y1,..., Yn ∼ Poisson(λ)⊥⊥| λ
prior: gamma-distribution with hyperparameters α = β = 1
λ ∼ Gamma(1, 1)
we want to compute 95% confidence interval for the parameter λ
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 9 / 62 3.1.3. numerical example
data, hyperparameters and a confidence level y <- c(4,3, 11,3,6) n <- length(y) alpha <-1 beta <-1
alpha_conf <- 0.05
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 10 / 62 3.1.3. numerical example
posterior distribution for the λ:
λ | Y ∼ Gamma(nY + α, n + β).
alpha_1 <- sum(y) + alpha beta_1 <-n + beta
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 11 / 62 3.1.3. numerical example
The quantiles are hence: q_lower <- qgamma(alpha_conf / 2, alpha_1, beta_1) q_upper <- qgamma(1 - alpha_conf / 2, alpha_1, beta_1) c(q_lower, q_upper)
## [1] 3.100966 6.547264
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 12 / 62 3.1.3. numerical example
1.5 prior posterior 1.0 |y) λ p( 0.5 0.0
0 1 2 3 4 5 6 7
λ
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 13 / 62 3.1.3. numerical example
lambda <- seq(0,7, by = 0.001) # set up grid for plotting lambda_true <-3
plot(lambda, dgamma(lambda, alpha_1, beta_1), type = 'l', lwd =2, col = 'violet', ylim = c(0, 1.5), xlab = expression(lambda), ylab = expression(paste('p(', lambda, '|y)')))
y_val <- dgamma(lambda, alpha_1, beta_1) x_coord <- c(q_lower, lambda[lambda >= q_lower & lambda <= q_upper], q_upper) y_coord <- c(0, y_val[lambda >= q_lower & lambda <= q_upper], 0)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 14 / 62 3.1.3. numerical example
polygon(x_coord, y_coord, col = 'pink', lwd =2, border = 'violet') abline(v = lambda_true, lty =2)
lines(lambda, dgamma(lambda, alpha, beta), type = 'l', lwd =2, col = 'orange') legend('topright', inset = .02, legend = c('prior', 'posterior'), col = c('orange', 'violet'), lwd =2)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 15 / 62 3.1.3. numerical example 3.0 3.0
2.5 prior 2.5 n=1 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0
2.5 n=2 2.5 n=5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 16 / 62 3.1.3. numerical example 3.0 3.0
2.5 n=10 2.5 n=50 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0
2.5 n=100 2.5 n=200 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 17 / 62 3.1.3. numerical example – prior effects
we observe more data, the credible interval get narrower orange area: credible interval that is computed using the prior distribution Stronger prior λ ∼ Gamma(10, 10) (with same expectation Eλ = α/β = 1)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 18 / 62 3.1.3. numerical example 3.0 3.0
2.5 prior 2.5 n=1 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0
2.5 n=2 2.5 n=5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 19 / 62 3.1.3. numerical example 3.0 3.0
2.5 n=10 2.5 n=50 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0
2.5 n=100 2.5 n=200 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 20 / 62 3.1.4 Highest posterior density region
Definition. A highest posterior density (HPD) set of confidence level α is a (1 − α)-confidence set Iα for which holds that the posterior density for every point in this set is higher than the posterior density for any point outside of this set: 0 fΘ|Y(θ|y) ≥ fΘ|Y(θ |y) 0 for all θ ∈ Iα, and θ ∈/ Iα.
This means that a (1 − α)-highest density posterior set is a smallest possible (1 − α)-credible set.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 21 / 62 3.1.4 Highest posterior density region
the HPD is not necessarily an interval (or a connected region in a higher-dimensional case): if the posterior distribution is multimodal, the HPD set of this distribution may be an union of distinct intervals (or distinct contiguous regions in a higher-dimensional case). This means that HPD sets are not necessarily always strictly credible intervals or regions. However, it is very commen to talk simply about HPD intervals, even though may not always be intervals.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 22 / 62 3.1.4. HPD - example
Let’s create a bimodal example (mixing two beta distributions) Note: we have seen the mixing before! The mixture distribution of Y is ( Y | (Θ = θi ) ∼ Beta(αi , βi ) 1 Θ ∼ 1 + Bernoulli( 2 )
Therefore, the marginal likelihood is 1 f (y) = f (y | θ ) + f (y | θ ) 2 1 2
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 23 / 62 3.1.4. HPD - example
alpha_conf <- .05 alpha_1 <- 11 ; beta_1 <- 30 alpha_2 <- 25 ; beta_2 <-8
mixture_density <- function(x, alpha_1, alpha_2, beta_1, beta_2){ .5 * dbeta(x, alpha_1, beta_1) + .5 * dbeta(x, alpha_2, beta_2) }
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 24 / 62 3.1.4. HPD - example (numerical way)
# generate data to compute empirical quantiles n_sim <- 1000000 theta_1 <- rbeta(n_sim / 2, alpha_1, beta_1) theta_2 <- rbeta(n_sim / 2, alpha_2, beta_2) theta <- sort(c(theta_1, theta_2))
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 25 / 62 3.1.4. HPD - example 3.0 2.5 2.0 1.5 1.0 0.5 0.0
0.2 0.4 0.6 0.8 1.0
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 26 / 62 3.1.4. HPD - example
lower_idx <- round((alpha_conf / 2) * n_sim) upper_idx <- round((1 - alpha_conf / 2) * n_sim) q_lower <- theta[lower_idx] q_upper <- theta[upper_idx] c(q_lower, q_upper)
## [1] 0.1627391 0.8690430
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 27 / 62 3.1.4. HPD - example (credible interval)
x <- seq(0,1, by = 0.001) y_val <- mixture_density(x, alpha_1, alpha_2, beta_1, beta_2) x_coord <- c(q_lower, x[x >= q_lower & x <= q_upper], q_upper) y_coord <- c(0, y_val[x >= q_lower & x <= q_upper], 0)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 28 / 62 3.1.4. HPD - example (credible interval) 3.0 2.5 2.0 1.5 density 1.0 0.5 0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 29 / 62 3.1.4. HPD - example (credible interval)
plot(x, mixture_density(x, alpha_1, alpha_2, beta_1, beta_2), type='l', col = 'violet', lwd =2, xlab = expression(theta), ylab = 'density') polygon(x_coord, y_coord, col = 'pink', lwd =2, border = 'violet')
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 30 / 62 3.1.4 3.0 2.5 2.0 1.5 density 1.0 0.5 0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 31 / 62 4 Approximate inference
we have studied inference and prediction with conjugate models for these the marginal likelihood, posterior and the posterior predictive distributions can be expressed in a closed form However, usually these are intractable, and the posterior cannot be solved analytically.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 32 / 62 4 Approximate inference
We have usually approximate the posterior distribution p(θ|y) and use this approximation to computing other quantities of interest posterior mean posterior mode (MAP, maximum a posteriori) credible intervals, . . .
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 33 / 62 4 Approximate inference
In general, there are two ways to approximate the posterior distribution:
1 Simulation: generate a random sample from the posterior distribution and use empirical distribution function as an approximation of the posterior. 2 Distributional approximation: approximate the posterior directly by some simpler parametric distributions (normal distribution, . . . )
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 34 / 62 4 Approximate inference - normal approximation
A simple form of the distributional approxmation is a normal approximation, motivated by the Central Limit Theorem Used a lot in frequentist statistics (many test statistics are based on asymptotic normality, like Wald, Rao, likelihood-ratio)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 35 / 62 4 Approximate inference - variational inference
More generally, approximating the posterior density by some tractable density q(θ) is called variational inference We will touch this only slightly (in exercises and during model selection), and only at the end of the course if time permits
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 36 / 62 4.1. Simulation methods
We will assume that be are given some Bayesian model (with some sampling f (· | θ) and prior densities p(·):
(Y | Θ Θ
The first step in simulation methods is to generate a random sample
θ1,..., θS
from a posterior distribution p(· | y) This is easy in R and Python for usual posteriori distributions (but then you don’t actually need to approximate :) but simulating the may still be the easiest way to evaluate some integrals over the posterior distributions
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 37 / 62 4.1. Simulation methods
More interesting case: no closed form for posterior distribution. even though the normalizing constant f (y) is unknown, it is is sufficient to generate a random sample from an unnormalized posterior density q:
q(·; y) ∝ p(· | y) ∝ p(·)f (y | ·)
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 38 / 62 4.1. Simulation methods – these are in CompStat
Few methods: Importance sampling (variance reduction in Monte Carlo) Rejection sampling (this will be shortly explained on blackboard) Nowadays: automated probabilistic programming tools we will consider Stan (named after Stanislaw Ulam ) on this course via RStan) So we will return to this.
Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 39 / 62 4.1 Simulation methods – rejection sampling
rej_sample <- function (f, g, rg, M) { candidate <- function () { x <- rg() u <-M *g (x) * runif(1) if (u y_vec2 <- sapply(1:10000, function(x) { rej_sample(function(x) dbeta(x,5,3), function(x) dbeta(x,1,1), function() rbeta(1,1,1), 2.5)}) x_coord <- seq(0,1, 0.01) y_vec2[1:4] ## [1] 0.6972523 0.8640759 0.6817323 0.8263721 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 41 / 62 4.1. Simulation methods – rejection sampling 2.5 True posterior 2.0 1.5 Density 1.0 0.5 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 42 / 62 4.1.1 Grid approximation For our example we will use a simple grid approximation or direct discrete approximation: 1 Create an even-spaced grid g1 = a + h/2,..., gm = b − h/2. (a, b) is the interval of evaluation h is the increment of the grid m = number of grid points. Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 43 / 62 4.1.1 Grid approximation 2 Evaluate values of the unnormalized posterior density q in the grid points q(g1; y),..., q(gm; y) and normalized them to obtain the estimated values of the posterior distribution at the grid points: q(g1; y) q(gm; y) p1 := P ,..., pm := P P P where = i q(gi ; y) Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 44 / 62 4.1.1 Grid approximation 3 For every s = 1,..., S: 0 Generate λs discrete distribution with pmf 0 P(λs = gi ) = pi Add jitter which is uniformly distributed around zero, 0 λs = λs + X, X ∼ U(−h/2, h/2) Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 45 / 62 4.1.1 Grid approximation – exactly simple numerical integration corresponds to performing a numerical integration by sampling. Like numerical integration, we can only simulate from the finite interval And computation cost of order md (way too high for high dimensional inference) Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 46 / 62 4.1.2 Grid approximation – Poisson-gamma example lambda_true <-3 alpha <- beta <-1 n <-5 set.seed(111111) y <- rpois(n, lambda_true) y ## [1] 4 3 11 3 6 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 47 / 62 4.1.2 Grid approximation – Poisson-gamma example q <- function(lambda, y, n, alpha, beta) { lambda^(alpha + sum(y) - 1) * exp(-(n + beta) * lambda) } Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 48 / 62 4.1.2 Grid approximation – Poisson-gamma example Parameter space (how to choose?) now with prior knowledge (0, 20) lower_lim <-0 upper_lim <- 20 i <- 0.01 grid <- seq(lower_lim + i/2, upper_lim - i/2, by = i) n_sim <- 1e4 n_grid <- length(grid) grid_values <- q(grid, y, n, alpha, beta) normalized_values <- grid_values / sum(grid_values) Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 49 / 62 4.1.2 Grid approximation – Poisson-gamma example Now we generate the samples idx_sim <- sample(1:n_grid, n_sim, prob = normalized_values, replace = TRUE) lambda_sim <- grid[idx_sim] X <- runif(n_sim, -i/2, i/2) lambda_sim <- lambda_sim + X Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 50 / 62 4.1.2 Grid approximation – Poisson-gamma example True posterior 0.4 0.3 Density 0.2 0.1 0.0 0 2 4 6 8 10 λ Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 51 / 62 4.1.2 Grid approximation – Poisson-gamma example hist(lambda_sim, col = 'violet', breaks = seq(0,10, by =.25), probability = TRUE, main = '', xlab = expression(lambda), xlim = c(0,10)) lines(grid, dgamma(grid, alpha + sum(y), beta + n), type = 'l', col = 'green', lwd=3) legend('topright', legend = 'True posterior', bty = 'n', col = 'green', lwd =2, inset = .02) Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 52 / 62 4.1.2 Grid approximation – Poisson-gamma example True posterior Estimated density 0.4 0.3 Density 0.2 0.1 0.0 0 2 4 6 8 10 λ Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 53 / 62 4.1.3 Grid approximation – Poisson-Log-normal example Let’s study a non-conjugate prior with Poisson sampling distribution model: sampling distribution / likelihood Y1,..., Yn ∼ Poisson(λ)⊥⊥| λ 2 prior: gamma-distribution with hyperparameters µ ∈ R and σ > 0: λ ∼ Log-normal(µ, σ2) Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 54 / 62 4.1.3 Grid approximation – Log-normal distribution We have: log(λ) ∼ N(µ, σ2) and therefore, 1 (log λ − µ)2 p(λ) = √ exp − λ 2πσ2 2σ2 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 55 / 62 4.1.3 Grid approximation – Log-normal distribution vs gamma distribution 1.5 Gamma(1,1) Log−normal(0,1) 1.0 Density 0.5 0.0 0 2 4 6 8 10 Petteri Piiroinen (University of Helsinki) Bayesian inferenceλ Spring 2020 56 / 62 4.1.3 Grid approximation – Log-normal distribution q <- function(lambda, y, n, mu, sigma_squared) { lambda^(sum(y) - 1) * exp(-n * lambda - (log(lambda) - mu)^2 / (2 * sigma_squared)) } mu <-0 sigma_squared <-1 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 57 / 62 4.1.3 Grid approximation – Poisson-Log-normal example 0.5 Gamma(28,6) 0.4 0.3 Density 0.2 0.1 0.0 0 2 4 6 8 10 Petteri Piiroinen (University of Helsinki) Bayesian inferenceλ Spring 2020 58 / 62 4.1.3 Grid approximation – Poisson-Log-normal example The green line is a density of the posterior with Gamma(1, 1)-prior. This time our posterior is concentrated on the slightly higher values. This is because Log-normal(0, 1)-distribution has a higher mean 1/2 (Eλ = e ≈ 1.649) and a (essentially) heavier right tail than the Gamma(1, 1)-distribution. Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 59 / 62 4.1.3 Grid approximation – Poisson-Log-normal example Gamma(28,6) Estimated posterior 0.4 0.3 Density 0.2 0.1 0.0 0 2 4 6 8 10 λ Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 60 / 62 4.1.3 Grid approximation – Log-normal distribution vs gamma distribution 1.5 Gamma(1.5,0.90979598956895) Log−normal(0,1) 1.0 Density 0.5 0.0 0 2 4 6 8 10 Petteri Piiroinen (University of Helsinki) Bayesian inferenceλ Spring 2020 61 / 62 4.1.3 Grid approximation – Poisson-Log-normal example Gamma(28.5,5.90979598956895) Estimated posterior 0.4 0.3 Density 0.2 0.1 0.0 0 2 4 6 8 10 λ Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 62 / 62