<<

Petteri Piiroinen

University of Helsinki

Spring 2020

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 1 / 62 3 Summarizing the posterior distribution

In principle, the posterior distribution contains all the information about the possible values. In practice, we must also present the posterior distribution somehow. Plotting (for 1D and 2D), scatterplot, of simulated values For higher dimensional cases, we could study marginal posterior distribution of some of the

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 2 / 62 3 Summarizing the posterior distribution

The usual summary , such as the , , , , standard devation and different quantiles, that are used to summarize distributions

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 3 / 62 3.1 Credible intervals

Credible interval is a ‘’Bayesian confidence interval”. very intuitive interpretation: we can say “95 % credible interval actually contains a true parameter value with 95% probability!”

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 4 / 62 3.1.1 Credible interval definition

Definition. d A 1 − α -credible set is a subset Iα ⊂ Ω ⊂ R containing a proportion 1 − α of the probability mass of the posterior distribution:

P(Θ ∈ Iα|Y = y) = 1 − α,

If the set is a region, we call it a credible region and if in addition, d = 1, we call a credible region as credible interval.

Usually we talk about a (1 − α) · 100% credible interval; for example, if the confidence level is α = 0.05, we talk about the 95% credible interval.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 5 / 62 3.1.2 Equal-tailed interval

Definition. An equal-tailed interval (also called a central interval) of confidence level α is an interval Iα = [ qα/2, q1−α/2],

where qz is a z-quantile of the posterior distribution p(·|y).

Since we have assumed the parameter to be continuous, quantiles are defined (only zeros in the middle might cause issues) for the p(·|y).

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 6 / 62 3.1.2 Equal-tailed interval

If we can solve the posterior distribution in a closed form, quantiles can be obtained via the quantile function of the posterior distribution:

−1 P(Θ ≤ qz | Y = y) = z ⇐⇒ qz = FΘ|Y(z | y),

−1 This quantile function FΘ | Y is an inverse of the cumulative density function (cdf) FΘ | Y of the posterior distribution. If there are zeroes in the posterior density the quantile function needs to be defined in other way

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 7 / 62 3.1.2 Equal-tailed interval

Usually, when a credible interval is mentioned without specifying which type of the credible interval it is, an equal-tailed interval is meant. However, unless the posterior distribution is unimodal and symmetric, we might prefer using the highest posterior density criterion for choosing the credible interval. But before, an example.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 8 / 62 3.1.3. numerical example

continuation of Poisson-gamma Example 2.1.1 we have observed a set y = (4, 3, 11, 3, 6) model: distribution / likelihood

Y1,..., Yn ∼ Poisson(λ)⊥⊥| λ

prior: gamma-distribution with α = β = 1

λ ∼ Gamma(1, 1)

we want to compute 95% confidence interval for the parameter λ

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 9 / 62 3.1.3. numerical example

data, hyperparameters and a confidence level y <- c(4,3, 11,3,6) n <- length(y) alpha <-1 beta <-1

alpha_conf <- 0.05

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 10 / 62 3.1.3. numerical example

posterior distribution for the λ:

λ | Y ∼ Gamma(nY + α, n + β).

alpha_1 <- sum(y) + alpha beta_1 <-n + beta

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 11 / 62 3.1.3. numerical example

The quantiles are hence: q_lower <- qgamma(alpha_conf / 2, alpha_1, beta_1) q_upper <- qgamma(1 - alpha_conf / 2, alpha_1, beta_1) c(q_lower, q_upper)

## [1] 3.100966 6.547264

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 12 / 62 3.1.3. numerical example

1.5 prior posterior 1.0 |y) λ p( 0.5 0.0

0 1 2 3 4 5 6 7

λ

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 13 / 62 3.1.3. numerical example

lambda <- seq(0,7, by = 0.001) # set up grid for plotting lambda_true <-3

plot(lambda, dgamma(lambda, alpha_1, beta_1), type = 'l', lwd =2, col = 'violet', ylim = c(0, 1.5), xlab = expression(lambda), ylab = expression(paste('p(', lambda, '|y)')))

y_val <- dgamma(lambda, alpha_1, beta_1) x_coord <- c(q_lower, lambda[lambda >= q_lower & lambda <= q_upper], q_upper) y_coord <- c(0, y_val[lambda >= q_lower & lambda <= q_upper], 0)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 14 / 62 3.1.3. numerical example

polygon(x_coord, y_coord, col = 'pink', lwd =2, border = 'violet') abline(v = lambda_true, lty =2)

lines(lambda, dgamma(lambda, alpha, beta), type = 'l', lwd =2, col = 'orange') legend('topright', inset = .02, legend = c('prior', 'posterior'), col = c('orange', 'violet'), lwd =2)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 15 / 62 3.1.3. numerical example 3.0 3.0

2.5 prior 2.5 n=1 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0

2.5 n=2 2.5 n=5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 16 / 62 3.1.3. numerical example 3.0 3.0

2.5 n=10 2.5 n=50 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0

2.5 n=100 2.5 n=200 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 17 / 62 3.1.3. numerical example – prior effects

we observe more data, the credible interval get narrower orange area: credible interval that is computed using the prior distribution Stronger prior λ ∼ Gamma(10, 10) (with same expectation Eλ = α/β = 1)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 18 / 62 3.1.3. numerical example 3.0 3.0

2.5 prior 2.5 n=1 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0

2.5 n=2 2.5 n=5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 19 / 62 3.1.3. numerical example 3.0 3.0

2.5 n=10 2.5 n=50 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 3.0 3.0

2.5 n=100 2.5 n=200 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 20 / 62 3.1.4 Highest posterior density region

Definition. A highest posterior density (HPD) set of confidence level α is a (1 − α)-confidence set Iα for which holds that the posterior density for every point in this set is higher than the posterior density for any point outside of this set: 0 fΘ|Y(θ|y) ≥ fΘ|Y(θ |y) 0 for all θ ∈ Iα, and θ ∈/ Iα.

This that a (1 − α)-highest density posterior set is a smallest possible (1 − α)-credible set.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 21 / 62 3.1.4 Highest posterior density region

the HPD is not necessarily an interval (or a connected region in a higher-dimensional case): if the posterior distribution is multimodal, the HPD set of this distribution may be an union of distinct intervals (or distinct contiguous regions in a higher-dimensional case). This means that HPD sets are not necessarily always strictly credible intervals or regions. However, it is very commen to talk simply about HPD intervals, even though may not always be intervals.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 22 / 62 3.1.4. HPD - example

Let’s create a bimodal example (mixing two beta distributions) Note: we have seen the mixing before! The mixture distribution of Y is ( Y | (Θ = θi ) ∼ Beta(αi , βi ) 1 Θ ∼ 1 + Bernoulli( 2 )

Therefore, the marginal likelihood is 1  f (y) = f (y | θ ) + f (y | θ ) 2 1 2

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 23 / 62 3.1.4. HPD - example

alpha_conf <- .05 alpha_1 <- 11 ; beta_1 <- 30 alpha_2 <- 25 ; beta_2 <-8

mixture_density <- function(x, alpha_1, alpha_2, beta_1, beta_2){ .5 * dbeta(x, alpha_1, beta_1) + .5 * dbeta(x, alpha_2, beta_2) }

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 24 / 62 3.1.4. HPD - example (numerical way)

# generate data to compute empirical quantiles n_sim <- 1000000 theta_1 <- rbeta(n_sim / 2, alpha_1, beta_1) theta_2 <- rbeta(n_sim / 2, alpha_2, beta_2) theta <- sort(c(theta_1, theta_2))

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 25 / 62 3.1.4. HPD - example 3.0 2.5 2.0 1.5 1.0 0.5 0.0

0.2 0.4 0.6 0.8 1.0

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 26 / 62 3.1.4. HPD - example

lower_idx <- round((alpha_conf / 2) * n_sim) upper_idx <- round((1 - alpha_conf / 2) * n_sim) q_lower <- theta[lower_idx] q_upper <- theta[upper_idx] c(q_lower, q_upper)

## [1] 0.1627391 0.8690430

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 27 / 62 3.1.4. HPD - example (credible interval)

x <- seq(0,1, by = 0.001) y_val <- mixture_density(x, alpha_1, alpha_2, beta_1, beta_2) x_coord <- c(q_lower, x[x >= q_lower & x <= q_upper], q_upper) y_coord <- c(0, y_val[x >= q_lower & x <= q_upper], 0)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 28 / 62 3.1.4. HPD - example (credible interval) 3.0 2.5 2.0 1.5 density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 29 / 62 3.1.4. HPD - example (credible interval)

plot(x, mixture_density(x, alpha_1, alpha_2, beta_1, beta_2), type='l', col = 'violet', lwd =2, xlab = expression(theta), ylab = 'density') polygon(x_coord, y_coord, col = 'pink', lwd =2, border = 'violet')

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 30 / 62 3.1.4 3.0 2.5 2.0 1.5 density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 31 / 62 4 Approximate inference

we have studied inference and prediction with conjugate models for these the marginal likelihood, posterior and the posterior predictive distributions can be expressed in a closed form However, usually these are intractable, and the posterior cannot be solved analytically.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 32 / 62 4 Approximate inference

We have usually approximate the posterior distribution p(θ|y) and use this approximation to computing other quantities of interest posterior mean posterior mode (MAP, maximum a posteriori) credible intervals, . . .

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 33 / 62 4 Approximate inference

In general, there are two ways to approximate the posterior distribution:

1 Simulation: generate a random sample from the posterior distribution and use empirical distribution function as an approximation of the posterior. 2 Distributional approximation: approximate the posterior directly by some simpler parametric distributions (normal distribution, . . . )

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 34 / 62 4 Approximate inference - normal approximation

A simple form of the distributional approxmation is a normal approximation, motivated by the Used a lot in frequentist statistics (many test statistics are based on asymptotic normality, like Wald, Rao, likelihood-ratio)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 35 / 62 4 Approximate inference - variational inference

More generally, approximating the posterior density by some tractable density q(θ) is called variational inference We will touch this only slightly (in exercises and during ), and only at the end of the course if time permits

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 36 / 62 4.1. Simulation methods

We will assume that be are given some Bayesian model (with some sampling f (· | θ) and prior densities p(·):

(Y | Θ Θ

The first step in simulation methods is to generate a random sample

θ1,..., θS

from a posterior distribution p(· | y) This is easy in R and Python for usual posteriori distributions (but then you don’t actually need to approximate :) but simulating the may still be the easiest way to evaluate some integrals over the posterior distributions

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 37 / 62 4.1. Simulation methods

More interesting case: no closed form for posterior distribution. even though the normalizing constant f (y) is unknown, it is is sufficient to generate a random sample from an unnormalized posterior density q:

q(·; y) ∝ p(· | y) ∝ p(·)f (y | ·)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 38 / 62 4.1. Simulation methods – these are in CompStat

Few methods: Importance sampling (variance reduction in Monte Carlo) Rejection sampling (this will be shortly explained on blackboard) Nowadays: automated probabilistic programming tools we will consider Stan (named after Stanislaw Ulam ) on this course via RStan) So we will return to this.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 39 / 62 4.1 Simulation methods – rejection sampling

rej_sample <- function (f, g, rg, M) { candidate <- function () { x <- rg() u <-M *g (x) * runif(1) if (u

y_vec2 <- sapply(1:10000, function(x) { rej_sample(function(x) dbeta(x,5,3), function(x) dbeta(x,1,1), function() rbeta(1,1,1), 2.5)}) x_coord <- seq(0,1, 0.01) y_vec2[1:4]

## [1] 0.6972523 0.8640759 0.6817323 0.8263721

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 41 / 62 4.1. Simulation methods – rejection sampling 2.5 True posterior 2.0 1.5 Density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 42 / 62 4.1.1 Grid approximation

For our example we will use a simple grid approximation or direct discrete approximation:

1 Create an even-spaced grid g1 = a + h/2,..., gm = b − h/2. (a, b) is the interval of evaluation h is the increment of the grid m = number of grid points.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 43 / 62 4.1.1 Grid approximation

2 Evaluate values of the unnormalized posterior density q in the grid points q(g1; y),..., q(gm; y) and normalized them to obtain the estimated values of the posterior distribution at the grid points:

q(g1; y) q(gm; y) p1 := P ,..., pm := P

P P where = i q(gi ; y)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 44 / 62 4.1.1 Grid approximation

3 For every s = 1,..., S: 0 Generate λs discrete distribution with pmf

0 P(λs = gi ) = pi

Add jitter which is uniformly distributed around zero,

0 λs = λs + X, X ∼ U(−h/2, h/2)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 45 / 62 4.1.1 Grid approximation – exactly simple numerical integration

corresponds to performing a numerical integration by sampling. Like numerical integration, we can only simulate from the finite interval And computation cost of order md (way too high for high dimensional inference)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 46 / 62 4.1.2 Grid approximation – Poisson-gamma example

lambda_true <-3 alpha <- beta <-1 n <-5 set.seed(111111) y <- rpois(n, lambda_true) y

## [1] 4 3 11 3 6

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 47 / 62 4.1.2 Grid approximation – Poisson-gamma example

q <- function(lambda, y, n, alpha, beta) { lambda^(alpha + sum(y) - 1) * exp(-(n + beta) * lambda) }

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 48 / 62 4.1.2 Grid approximation – Poisson-gamma example

Parameter space (how to choose?) now with prior knowledge (0, 20) lower_lim <-0 upper_lim <- 20 i <- 0.01 grid <- seq(lower_lim + i/2, upper_lim - i/2, by = i)

n_sim <- 1e4 n_grid <- length(grid) grid_values <- q(grid, y, n, alpha, beta) normalized_values <- grid_values / sum(grid_values)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 49 / 62 4.1.2 Grid approximation – Poisson-gamma example

Now we generate the samples idx_sim <- sample(1:n_grid, n_sim, prob = normalized_values, replace = TRUE) lambda_sim <- grid[idx_sim]

X <- runif(n_sim, -i/2, i/2) lambda_sim <- lambda_sim + X

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 50 / 62 4.1.2 Grid approximation – Poisson-gamma example

True posterior 0.4 0.3 Density 0.2 0.1 0.0

0 2 4 6 8 10

λ

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 51 / 62 4.1.2 Grid approximation – Poisson-gamma example

hist(lambda_sim, col = 'violet', breaks = seq(0,10, by =.25), probability = TRUE, main = '', xlab = expression(lambda), xlim = c(0,10)) lines(grid, dgamma(grid, alpha + sum(y), beta + n), type = 'l', col = 'green', lwd=3) legend('topright', legend = 'True posterior', bty = 'n', col = 'green', lwd =2, inset = .02)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 52 / 62 4.1.2 Grid approximation – Poisson-gamma example

True posterior Estimated density 0.4 0.3 Density 0.2 0.1 0.0

0 2 4 6 8 10

λ

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 53 / 62 4.1.3 Grid approximation – Poisson-Log-normal example

Let’s study a non- with Poisson model: sampling distribution / likelihood

Y1,..., Yn ∼ Poisson(λ)⊥⊥| λ

2 prior: gamma-distribution with hyperparameters µ ∈ R and σ > 0:

λ ∼ Log-normal(µ, σ2)

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 54 / 62 4.1.3 Grid approximation – Log-normal distribution

We have: log(λ) ∼ N(µ, σ2) and therefore,

1  (log λ − µ)2  p(λ) = √ exp − λ 2πσ2 2σ2

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 55 / 62 4.1.3 Grid approximation – Log-normal distribution vs gamma distribution

1.5 Gamma(1,1) Log−normal(0,1) 1.0 Density 0.5 0.0

0 2 4 6 8 10

Petteri Piiroinen (University of Helsinki) Bayesian inferenceλ Spring 2020 56 / 62 4.1.3 Grid approximation – Log-normal distribution

q <- function(lambda, y, n, mu, sigma_squared) { lambda^(sum(y) - 1) * exp(-n * lambda - (log(lambda) - mu)^2 / (2 * sigma_squared)) } mu <-0 sigma_squared <-1

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 57 / 62 4.1.3 Grid approximation – Poisson-Log-normal example

0.5 Gamma(28,6) 0.4 0.3 Density 0.2 0.1 0.0

0 2 4 6 8 10

Petteri Piiroinen (University of Helsinki) Bayesian inferenceλ Spring 2020 58 / 62 4.1.3 Grid approximation – Poisson-Log-normal example

The green line is a density of the posterior with Gamma(1, 1)-prior. This time our posterior is concentrated on the slightly higher values. This is because Log-normal(0, 1)-distribution has a higher mean 1/2 (Eλ = e ≈ 1.649) and a (essentially) heavier right tail than the Gamma(1, 1)-distribution.

Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 59 / 62 4.1.3 Grid approximation – Poisson-Log-normal example

Gamma(28,6) Estimated posterior 0.4 0.3 Density 0.2 0.1 0.0

0 2 4 6 8 10

λ Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 60 / 62 4.1.3 Grid approximation – Log-normal distribution vs gamma distribution

1.5 Gamma(1.5,0.90979598956895) Log−normal(0,1) 1.0 Density 0.5 0.0

0 2 4 6 8 10

Petteri Piiroinen (University of Helsinki) Bayesian inferenceλ Spring 2020 61 / 62 4.1.3 Grid approximation – Poisson-Log-normal example

Gamma(28.5,5.90979598956895) Estimated posterior 0.4 0.3 Density 0.2 0.1 0.0

0 2 4 6 8 10

λ Petteri Piiroinen (University of Helsinki) Bayesian inference Spring 2020 62 / 62