Module 1: Introduction to Bayesian , Part I

Rebecca C. Steorts Agenda

I Motivations I Traditional inference I Bayesian inference I Bernoulli, Beta I Connection to the I Posterior of Beta-Bernoulli I Example with 2012 election data I Marginal likelihood I Posterior Prediction Social networks Precision Medicine Estimating Rent Prices in Small Domains

Benchmarked estimates Benchmarked estimates with smoothing 10.0

9.5

9.0

8.5

8.0

7.5

7.0

6.5 Traditional inference

You are given data X and there is an unknown parameter you wish to estimate θ How would you estimate θ?

I Find an unbiased estimator of θ. I Find the maximum likelihood estimate (MLE) of θ by looking at the likelihood of the data. I If you cannot remember the definition of an unbiased estimator or the MLE, review these before our next class. Bayesian Motivation

[credit: Peter Orbanz, Columbia University] Bayesian inference

Bayesian methods trace its origin to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace discovered what we now call Bayes’ Theorem

I p(x | θ) likelihood I p(θ) prior I p(θ | x) posterior I p(x)

p(θ, x) p(x|θ)p(θ) p(θ|x) = = ∝ p(x|θ)p(θ) p(x) p(x) Bernoulli distribution

The Bernoulli distribution is very common due to binary outcomes.

I Consider flipping a coin (heads or tails). I We can represent this a binary where the of heads is θ and the probability of tails is 1 − θ. The write the random variable as X ∼ Bernoulli(θ)1(0 < θ < 1) It follows that the likelihood is

p(x | θ) = θx (1 − θ)(1−x)1(0 < θ < 1).

I Exercise: what is the mean and the of X? Bernoulli distribution

iid I Suppose that X1,..., Xn ∼ Bernoulli(θ). Then for x1,..., xn ∈ {0, 1} what is the likelihood? Notation

I ∝: means “proportional to” I x1:n denotes x1,..., xn Likelihood

p(x1:n|θ) = P(X1 = x1,..., Xn = xn | θ) n Y = P(Xi = xi | θ) i=1 n Y = p(xi |θ) i=1 n Y = θxi (1 − θ)1−xi i=1 P P = θ xi (1 − θ)n− xi .

Given a, b > 0, we write θ ∼ Beta(a, b) to mean that θ has pdf

1 p(θ) = Beta(θ|a, b) = θa−1(1 − θ)b−11(0 < θ < 1), B(a, b)

i.e., p(θ) ∝ θa−1(1 − θ)b−1 on the interval from 0 to 1.

I Here, Γ(a)Γ(b) B(a, b) = Γ(a + b) . R I The mean is E(θ) = θ p(θ)dθ = a/(a + b). Posterior of Bernoulli-Beta

Lets derive the posterior of θ | x1:n

p(θ|x1:n)

∝ p(x1:n|θ)p(θ) P P 1 = θ xi (1 − θ)n− xi θa−1(1 − θ)b−1I(0 < θ < 1) B(a, b) P P ∝ θa+ xi −1(1 − θ)b+n− xi −1I(0 < θ < 1) P P  ∝ Beta θ | a + xi , b + n − xi . Approval ratings of Obama

What is the proportion of people that approve of President Obama in PA?

I We take a random sample of 10 people in PA and find that 6 approve of President Obama. I The national approval rating (Zogby poll) of President Obama in mid-September 2015 was 45%. We’ll assume that in PA his approval rating is approximately 50%. I Based on this prior information, we’ll use a Beta prior for θ and we’ll choose a and b. Obama Example

n = 10 # Fixing values of a,b. a = 21/8 b = 0.04 th = seq(0,1, length=500) x =6

# we set the likelihood, prior, and posteriors with # THETA as the sequence that we plot on the x-axis. # Beta(c,d) refers to shape parameter like = dbeta(th, x+1, n-x+1) prior = dbeta(th, a, b) post = dbeta(th, x+a, n-x+b) Likelihood

plot(th, like, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 2.5 2.0 1.5 Density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Prior

plot(th, prior, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 15 10 Density 5 0

0.0 0.2 0.4 0.6 0.8 1.0

θ Posterior

plot(th, post, type='l', ylab = "Density", lty =3, lwd =3, xlab = expression(theta)) 3.0 2.5 2.0 1.5 Density 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Likelihood, Prior, and Posterior 3.5 3.0 Prior Likelihood Posterior 2.5 2.0 Density 1.5 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Cast of characters

I Observed data: x I Note this could consist of many data points, e.g., x = x1:n = (x1,..., xn).

likelihood p(x|θ) prior p(θ) posterior p(θ|x) marginal likelihood p(x) posterior predictive p(xn+1|x1:n) loss function `(s, a) posterior expected loss ρ(a, x) risk / frequentist risk R(θ, δ) integrated risk r(δ) Marginal likelihood

The marginal likelihood is Z p(x) = p(x|θ)p(θ) dθ

I What is the marginal likelihood for the Bernoulli-Beta? Posterior predictive distribution

I We may wish to predict a new data point xn+1 I We assume that x1:(n+1) are independent given θ

Z p(xn+1|x1:n) = p(xn+1, θ|x1:n) dθ Z = p(xn+1|θ, x1:n)p(θ|x1:n) dθ Z = p(xn+1|θ)p(θ|x1:n) dθ. Example: Back to the Beta-Bernoulli

Suppose θ ∼ Beta(a, b) and iid X1,..., Xn | θ ∼ Bernoulli(θ) Then the marginal likelihood is

p(x1:n) Z = p(x1:n|θ)p(θ) dθ Z 1 P P 1 = θ xi (1 − θ)n− xi θa−1(1 − θ)b−1dθ 0 B(a, b) Ba + P x , b + n − P x  = i i , B(a, b)

by the integral definition of the Beta function. Example continued

P P Let an = a + xi and bn = b + n − xi . It follows that the posterior distribution is p(θ|x1:n) = Beta(θ|an, bn). The posterior predictive can be derived to be Z P(Xn+1 = 1 | x1:n) = P(Xn+1 = 1 | θ)p(θ|x1:n)dθ

Z an = θ Beta(θ|an, bn) = , an + bn hence, the posterior predictive p.m.f. is

xn+1 1−xn+1 an bn p(xn+1|x1:n) = 1(xn+1 ∈ {0, 1}). an + bn