Bayesian Inference 2019 Ville Hyvönen, Topias Tolonen1 2019-4-21
1These lecture notes were originally written by Ville for the course at University of Helsinki on 2017 and updated for the Spring 2019 iteration by Topias. 2 Contents
1 Introduction 5 1.1 Motivating example : thumbtack tossing ...... 5 1.2 Components of Bayesian inference ...... 11 1.3 Prediction ...... 13
2 Conjugate distributions 19 2.1 One-parameter conjugate models ...... 19 2.2 Prior distributions ...... 28
3 Summarizing the posterior distribution 33 3.1 Credible intervals ...... 33 3.2 Posterior mean as a convex combination of means ...... 43
4 Approximate inference 45 4.1 Simulation methods ...... 45 4.2 Monte Carlo integration ...... 52 4.3 Monte Carlo markov chain (MCMC) methods ...... 55 4.4 Probabilistic programming ...... 59 4.5 Sampling from posterior predictive distribution ...... 67
5 Multiparameter models 69 5.1 Marginal posterior distribution ...... 69 5.2 Inference for the normal distribution with known variance ...... 70 5.3 Inference for the normal distribution with noninformative prior ...... 72
6 Hierarchical models 81 6.1 Two-level hierarchical model ...... 81 6.2 Conditional conjugacy ...... 84 6.3 Hierarchical model example ...... 84
7 Linear model 101 7.1 Classical linear model ...... 102 7.2 Posterior for classical linear regression ...... 102 7.3 Posterior distribution of β ...... 102 7.4 Full model with the predictors ...... 103
8 Hypothesis testing and Bayes factor 105 8.1 Bayes factors for point hypothesis ...... 106 8.2 Bayes factors for composite hypothesis ...... 106 8.3 Example hypotheses regarding population prevalence ...... 107
3 4 CONTENTS Chapter 1
Introduction
1.1 Motivating example : thumbtack tossing
A classical toy example of the random experiment in probability calculus is coin tossing. But this is a little bit boring example, since we know (at least if the coin is fair) a priori that the probability of both heads and tails is very close to 0.5. Instead, let’s consider a slightly more interesting toy example: thumbtack tossing. If we define the success as a thumbtack landing with its point up, we can only have a vague guess about the success probability before conducting the experiment. Let’s toss a thumptack n times, and count the number of times it lands with its point up; denote this quantity as y. We are interested in deducing the true success probability θ. Probably our first intuition is just to use the proportion of successes y/n as an estimate of the true success probability θ. But consider an outcome where you tossed the thumptack n = 3 times, and each time the thumbtack landed point down; this means that your observed value is y = 0. Would it be sensible to conclude that the true success probability in this is θ = y/n = 0/3 = 0? It clearly makes no sense to conclude that the true underlying success probability θ is equal to the observed proportion y/n. Also if we toss the thumbtack n = 3000 times and observe the zero successes, the proportion of successes is also y/n = 0, but now it would make much more sense conclude that thumbtack landing point up is actually impossible, or at least a very rare event. So in addition to the most probable value of θ we also need to measure the uncertainty of our estimates. Finding the most likely parameter values, and quantifying our uncertainty about them is called statistical inference.
1.1.1 Modelling thumbtack tossing To generate some real world data I threw a thumbtack N = 30 times. It landed point up 16 times, and point down 14 times; this means we observed a data set y = 16. Let’s define a proper statistical model to quantify our uncertainty of the true probability of the thumptack landing point up. We can consider an observed proportion of the successes y as a realization of random variable Y . As we remember from the probability calculus course, a repeated random experiment with constant success probability, binary outcome and independent repetitions is modelled with binomial distribution:
Y ∼ Bin(n, θ), 0 < θ < 1.
This means that random variable Y follows a binomial distribution with a (fixed) sample size n and a success probability θ. Unknown quantities in the model, such as θ here, are called parameters of the model.
5 6 CHAPTER 1. INTRODUCTION
The functional form of the probability mass function (pmf) of Y : n f(y; n, θ) = θy(1 − θ)n−y y is fixed, and the value the parameter θ determines what it looks like. Let’s draw some pmf:s of Y with a fixed sample size N = 30, and different parameter values: par(mar = c(4,4,.1,.1)) n <- 30 y <-0 :30 theta <- c(3, 10, 25) / n plot(y, dbinom(y, size = n, prob = theta[1]), lwd =2, col = 'blue', type ='b', ylab = 'P(Y=y)') lines(y, dbinom(y, size = n, prob = theta[2]), lwd =2, col = 'green', type ='b') lines(y, dbinom(y, size = n, prob = theta[3]), lwd =2, col = 'red', type ='b') legend('top', inset =.02, legend = c('Bin(30, 1/10)', 'Bin(30, 1/3)', 'Bin(30, 5/6)'), col = c('blue', 'green', 'red'), lwd =2)
1.1.2 Frequentist thumbtack tossing In classical (sometimes called frequentist) statistics we consider the likelihood function L(θ; y); this is just a pmf/pdf of the observations considered as a function of parameter θ: θ 7→ f(y; θ). Then we can find the most likely value of the parameter by maximizing the likelihood function (normally we actually maximize the natural logarithm of the likelihood function often called the log-likelihood, l(θ; y) = log L(θ; y), which is computationally more convenient) w.r.t. parameter θ. This means that we find the parameter value, which has a highest probability of producing this particular data set. This parameter value θˆ, which maximizes the likelihood function is called a maximum likelihood estimate: θˆ(y) = argmax L(θ; y). θ 1.1. MOTIVATING EXAMPLE : THUMBTACK TOSSING 7
The maximum likelihood estimate is the most likely value of the parameter given the data. Let’s derive the maximum likelihood estimate for our binomial model. Because logarithm is a monotonusly increasing function, the global maximum point of the log-likelihood maximizes also the likelihood function. Log-likelihood for this model is:
l(θ; y) = log f(y; θ) ∝ log(θy(1 − θ)n−y) = y log θ + (n − y) log(1 − θ)