Module 1: Introduction to Bayesian Statistics, Part I
Rebecca C. Steorts Agenda
I Motivations I Traditional inference I Bayesian inference I Bernoulli, Beta I Connection to the Binomial distribution I Posterior of Beta-Bernoulli I Example with 2012 election data I Marginal likelihood I Posterior Prediction Social networks Precision Medicine Estimating Rent Prices in Small Domains
Benchmarked estimates Benchmarked estimates with smoothing 10.0
9.5
9.0
8.5
8.0
7.5
7.0
6.5 Traditional inference
You are given data X and there is an unknown parameter you wish to estimate θ How would you estimate θ?
I Find an unbiased estimator of θ. I Find the maximum likelihood estimate (MLE) of θ by looking at the likelihood of the data. I If you cannot remember the definition of an unbiased estimator or the MLE, review these before our next class. Bayesian Motivation
[credit: Peter Orbanz, Columbia University] Bayesian inference
Bayesian methods trace its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes’ Theorem
I p(x | θ) likelihood I p(θ) prior I p(θ | x) posterior I p(x) marginal distribution
p(θ, x) p(x|θ)p(θ) p(θ|x) = = ∝ p(x|θ)p(θ) p(x) p(x) Bernoulli distribution
The Bernoulli distribution is very common due to binary outcomes.
I Consider flipping a coin (heads or tails). I We can represent this a binary random variable where the probability of heads is θ and the probability of tails is 1 − θ. The write the random variable as X ∼ Bernoulli(θ)1(0 < θ < 1) It follows that the likelihood is
p(x | θ) = θx (1 − θ)(1−x)1(0 < θ < 1).
I Exercise: what is the mean and the variance of X? Bernoulli distribution
iid I Suppose that X1,..., Xn ∼ Bernoulli(θ). Then for x1,..., xn ∈ {0, 1} what is the likelihood? Notation
I ∝: means “proportional to” I x1:n denotes x1,..., xn Likelihood
p(x1:n|θ) = P(X1 = x1,..., Xn = xn | θ) n Y = P(Xi = xi | θ) i=1 n Y = p(xi |θ) i=1 n Y = θxi (1 − θ)1−xi i=1 P P = θ xi (1 − θ)n− xi . Beta distribution
Given a, b > 0, we write θ ∼ Beta(a, b) to mean that θ has pdf
1 p(θ) = Beta(θ|a, b) = θa−1(1 − θ)b−11(0 < θ < 1), B(a, b)
i.e., p(θ) ∝ θa−1(1 − θ)b−1 on the interval from 0 to 1.
I Here, Γ(a)Γ(b) B(a, b) = Γ(a + b) . R I The mean is E(θ) = θ p(θ)dθ = a/(a + b). Posterior of Bernoulli-Beta
Lets derive the posterior of θ | x1:n
p(θ|x1:n)
∝ p(x1:n|θ)p(θ) P P 1 = θ xi (1 − θ)n− xi θa−1(1 − θ)b−1I(0 < θ < 1) B(a, b) P P ∝ θa+ xi −1(1 − θ)b+n− xi −1I(0 < θ < 1) P P ∝ Beta θ | a + xi , b + n − xi . Approval ratings of Obama
What is the proportion of people that approve of President Obama in PA?
I We take a random sample of 10 people in PA and find that 6 approve of President Obama. I The national approval rating (Zogby poll) of President Obama in mid-September 2015 was 45%. We’ll assume that in PA his approval rating is approximately 50%. I Based on this prior information, we’ll use a Beta prior for θ and we’ll choose a and b. Obama Example
n = 10 # Fixing values of a,b. a = 21/8 b = 0.04 th = seq(0,1, length=500) x =6
# we set the likelihood, prior, and posteriors with # THETA as the sequence that we plot on the x-axis. # Beta(c,d) refers to shape parameter like = dbeta(th, x+1, n-x+1) prior = dbeta(th, a, b) post = dbeta(th, x+a, n-x+b) Likelihood
plot(th, like, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 2.5 2.0 1.5 Density 1.0 0.5 0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ Prior
plot(th, prior, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 15 10 Density 5 0
0.0 0.2 0.4 0.6 0.8 1.0
θ Posterior
plot(th, post, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 3.0 2.5 2.0 1.5 Density 1.0 0.5 0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ Likelihood, Prior, and Posterior 3.5 3.0 Prior Likelihood Posterior 2.5 2.0 Density 1.5 1.0 0.5 0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ Cast of characters
I Observed data: x I Note this could consist of many data points, e.g., x = x1:n = (x1,..., xn).
likelihood p(x|θ) prior p(θ) posterior p(θ|x) marginal likelihood p(x) posterior predictive p(xn+1|x1:n) loss function `(s, a) posterior expected loss ρ(a, x) risk / frequentist risk R(θ, δ) integrated risk r(δ) Marginal likelihood
The marginal likelihood is Z p(x) = p(x|θ)p(θ) dθ
I What is the marginal likelihood for the Bernoulli-Beta? Posterior predictive distribution
I We may wish to predict a new data point xn+1 I We assume that x1:(n+1) are independent given θ
Z p(xn+1|x1:n) = p(xn+1, θ|x1:n) dθ Z = p(xn+1|θ, x1:n)p(θ|x1:n) dθ Z = p(xn+1|θ)p(θ|x1:n) dθ. Example: Back to the Beta-Bernoulli
Suppose θ ∼ Beta(a, b) and iid X1,..., Xn | θ ∼ Bernoulli(θ) Then the marginal likelihood is
p(x1:n) Z = p(x1:n|θ)p(θ) dθ Z 1 P P 1 = θ xi (1 − θ)n− xi θa−1(1 − θ)b−1dθ 0 B(a, b) B a + P x , b + n − P x = i i , B(a, b)
by the integral definition of the Beta function. Example continued
P P Let an = a + xi and bn = b + n − xi . It follows that the posterior distribution is p(θ|x1:n) = Beta(θ|an, bn). The posterior predictive can be derived to be Z P(Xn+1 = 1 | x1:n) = P(Xn+1 = 1 | θ)p(θ|x1:n)dθ
Z an = θ Beta(θ|an, bn) = , an + bn hence, the posterior predictive p.m.f. is
xn+1 1−xn+1 an bn p(xn+1|x1:n) = 1(xn+1 ∈ {0, 1}). an + bn