Module 1: Introduction to Bayesian , Part I

Rebecca C. Steorts Agenda

I Motivations I Traditional inference I Bayesian inference I Bernoulli, Beta I Connection to the Binomial I Posterior of Beta-Bernoulli I Example with 2012 election data I Marginal likelihood I Posterior Prediction

6.5 Traditional inference

You are given data X and there is an unknown parameter you wish to estimate θ How would you estimate θ?

I Find an unbiased estimator of θ. I Find the maximum likelihood estimate (MLE) of θ by looking at the likelihood of the data. I If you cannot remember the definition of an unbiased estimator or the MLE, review these before our next class. Bayesian Motivation

[credit: Peter Orbanz, Columbia University] Bayesian inference

Bayesian methods trace its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes’ Theorem

I p(x | θ) likelihood I p(θ) prior I p(θ | x) posterior I p(x)

p(θ, x) p(x|θ)p(θ) p(θ|x) = = ∝ p(x|θ)p(θ) p(x) p(x) Bernoulli distribution

The Bernoulli distribution is very common due to binary outcomes.

I Consider flipping a coin (heads or tails). I We can represent this a binary where the of heads is θ and the probability of tails is 1 − θ. The write the random variable as X ∼ Bernoulli(θ)1(0 < θ < 1) It follows that the likelihood is

p(x | θ) = θx (1 − θ)(1−x)1(0 < θ < 1).

I Exercise: what is the mean and the of X? Bernoulli distribution

iid I Suppose that X1,..., Xn ∼ Bernoulli(θ). Then for x1,..., xn ∈ {0, 1} what is the likelihood? Notation

I ∝: means “proportional to” I x1:n denotes x1,..., xn Likelihood

p(x1:n|θ) = P(X1 = x1,..., Xn = xn | θ) n Y = P(Xi = xi | θ) i=1 n Y = p(xi |θ) i=1 n Y = θxi (1 − θ)1−xi i=1 P P = θ xi (1 − θ)n− xi .

Given a, b > 0, we write θ ∼ Beta(a, b) to mean that θ has pdf

1 p(θ) = Beta(θ|a, b) = θa−1(1 − θ)b−11(0 < θ < 1), B(a, b)

i.e., p(θ) ∝ θa−1(1 − θ)b−1 on the interval from 0 to 1.

I Here, Γ(a)Γ(b) B(a, b) = Γ(a + b) . R I The mean is E(θ) = θ p(θ)dθ = a/(a + b). Posterior of Bernoulli-Beta

Lets derive the posterior of θ | x1:n


∝ p(x1:n|θ)p(θ) P P 1 = θ xi (1 − θ)n− xi θa−1(1 − θ)b−1I(0 < θ < 1) B(a, b) P P ∝ θa+ xi −1(1 − θ)b+n− xi −1I(0 < θ < 1) P P  ∝ Beta θ | a + xi , b + n − xi . Approval ratings of Obama

What is the proportion of people that approve of President Obama in PA?

I We take a random sample of 10 people in PA and find that 6 approve of President Obama. I The national approval rating (Zogby poll) of President Obama in mid-September 2015 was 45%. We’ll assume that in PA his approval rating is approximately 50%. I Based on this prior information, we’ll use a Beta prior for θ and we’ll choose a and b. Obama Example

n = 10 # Fixing values of a,b. a = 21/8 b = 0.04 th = seq(0,1, length=500) x =6

# we set the likelihood, prior, and posteriors with # THETA as the sequence that we plot on the x-axis. # Beta(c,d) refers to shape parameter like = dbeta(th, x+1, n-x+1) prior = dbeta(th, a, b) post = dbeta(th, x+a, n-x+b) Likelihood

plot(th, like, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 2.5 2.0 1.5 Density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Prior

plot(th, prior, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 15 10 Density 5 0

0.0 0.2 0.4 0.6 0.8 1.0

θ Posterior

plot(th, post, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 3.0 2.5 2.0 1.5 Density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Likelihood, Prior, and Posterior 3.5 3.0 Prior Likelihood Posterior 2.5 2.0 Density 1.5 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Cast of characters

I Observed data: x I Note this could consist of many data points, e.g., x = x1:n = (x1,..., xn).

likelihood p(x|θ) prior p(θ) posterior p(θ|x) marginal likelihood p(x) posterior predictive p(xn+1|x1:n) loss function `(s, a) posterior expected loss ρ(a, x) risk / frequentist risk R(θ, δ) integrated risk r(δ) Marginal likelihood

The marginal likelihood is Z p(x) = p(x|θ)p(θ) dθ

I What is the marginal likelihood for the Bernoulli-Beta? Posterior predictive distribution

I We may wish to predict a new data point xn+1 I We assume that x1:(n+1) are independent given θ

Z p(xn+1|x1:n) = p(xn+1, θ|x1:n) dθ Z = p(xn+1|θ, x1:n)p(θ|x1:n) dθ Z = p(xn+1|θ)p(θ|x1:n) dθ. Example: Back to the Beta-Bernoulli

Suppose θ ∼ Beta(a, b) and iid X1,..., Xn | θ ∼ Bernoulli(θ) Then the marginal likelihood is

p(x1:n) Z = p(x1:n|θ)p(θ) dθ Z 1 P P 1 = θ xi (1 − θ)n− xi θa−1(1 − θ)b−1dθ 0 B(a, b) Ba + P x , b + n − P x  = i i , B(a, b)

by the integral definition of the Beta function. Example continued

P P Let an = a + xi and bn = b + n − xi . It follows that the posterior distribution is p(θ|x1:n) = Beta(θ|an, bn). The posterior predictive can be derived to be Z P(Xn+1 = 1 | x1:n) = P(Xn+1 = 1 | θ)p(θ|x1:n)dθ

Z an = θ Beta(θ|an, bn) = , an + bn hence, the posterior predictive p.m.f. is

xn+1 1−xn+1 an bn p(xn+1|x1:n) = 1(xn+1 ∈ {0, 1}). an + bn