CS340 Machine Learning Bayesian Statistics 3

CS340 Machine learning Bayesian statistics 3

1 Outline • Conjugate analysis of µ and σ2 • Bayesian model selection • Summarizing the posterior

2 Unknown mean and precision • The likelihood function is 1 λ n p(D|, λ) = λn/2 exp − (x − )2 (2π)n/2 2 i i=1 1 λ n = λn/2 exp − n( − x)2 + (x − x)2 (2π)n/2 2 i i=1 • The natural conjugate prior is normal gamma

p(, λ) = NG(, λ|0, κ0, α0, β0) def 1 = N (|0, (κ0λ)− )Ga(λ|α0, rate = β0) 1 1 α0 λ 2 = λ − 2 exp − κ ( − ) + 2β Z 2 0 0 0 NG 3 Gamma distribution • Used for positive reals a b a 1 xb Ga(x|shape = a, rate = b) = x − e− , x, a, b > 0 Γ(a) Bishop 1 α 1 x/β Ga(x|shape = α, scale = β) = x − e− βαΓ(α) Matlab

4 Posterior is also NG • Just update the hyper-parameters

p(, λ|D) = NG(, λ|n, κn, αn, βn) κ00 + nx n = κ0 + n κn = κ0 + n

αn = α0 + n/2 n 2 1 2 κ0n(x − 0) βn = β0 + 2 (xi − x) + 2(κ0 + n) i=1

Derivation of this result not on exam 5 Posterior marginals • Variance p(λ|D) = Ga(λ|αn, βn) • Mean βn p(|D) = T2αn (|n, ) αnκn

Student t distribution

Derivation of this result not on exam 6 Student t distribution • Approaches Gaussian as ν → ∞

ν ( +1 ) 1 x − − 2 t (x|, σ2) ∝ 1 + ( )2 ν ν σ

7 Robustness of t distribution

Student t less affected by outliers

Gaussian Bishop 2.16 8 Posterior predictive distribution • Also a t distribution (fatter tails than Gaussian due to uncertainty in λ)

βn(κn + 1) p(x|D) = t2αn (x|n, ) αnκn

Derivation of this result not on exam 9 Uninformative prior • It can be shown (see handout) that an uninformative prior has the form 1 p(, λ) ∝ λ • This can be emulated using the following hyper-parameters

κ0 = 0 1 a = − 0 2 b0 = 0 • This prior is improper (does not integrate to 1), but the posterior is proper if n ≥ 1

Derivation of this result not on exam 10 Outline • Conjugate analysis of µ and σ2 • Bayesian model selection • Summarizing the posterior

11 Bayesian model selection • Suppose we have K possible models, each with θ parameters i. The posterior over models is defined using the marginal likelihood (“evidence”) p(D|M=i), which is the normalizing constant from the posterior over parameters p(M = i)p(D|M = i) p(M = i|D) = p(D)

p(D|M = i) = p(D|θ, M = i)p(θ|M = i)dθ

p(D|θ, M = i)p(θ|M = i) p(θ|D,M = i) = p(D|M = i)

12 Bayes factors • To compare two models, use posterior odds

p(Mi|D) p(D|Mi)p(Mi) Oij = = p(Mj |D) p(D|Mj)p(Mj )

Bayes factor Prior odds

• The Bayes factor BF(i,j) is a Bayesian version of a likelihood ratio test, that can be used to compare models of different complexity

13 Marginal likelihood for Beta-Bernoulli θ α α • Since we know p( |D) = Be( 1’, 0’)

p(θ)p(D|θ) p(θ|D) = p(D)

1 1 α1 1 α0 1 N1 N0 = θ − (1 − θ) − θ (1 − θ) p(D) B(α , α ) 1 0 θα1′ 1(1 − θ)α0′ 1 = − − B(α1′ , α0′ )

• Hence the marginal likelihood is a ratio of normalizing constants B(α , α ) p(D) = p(D|θ)p(θ)dθ = 1′ 0′ B(α , α ) 1 0

14 Example: is the Eurocoin biased? • Suppose we toss a coin N=250 times and observe N1=141 heads and N 0=109 tails. θ • Consider two hypotheses: H 0 that =0.5 and H 1 that θ ≠ θ 0.5. Actually, we can let H 1 be p( ) = U(0,1), θ since p( =0.5|H 1) = 0 (pdf). • For H 0, marginal likelihood is N p(D|H0) = 0.5

• For H 1, marginal likelihood is

1 B(α + N , α + N ) P (D|H ) = P (D|θ, H )P (θ|H )dθ = 1 1 0 0 1 1 1 B(α , α ) 0 1 0 • Hence the Bayes factor is

P (D|H1) B(α1 + N1, α0 + N0) 1 BF (1, 0) = = N P (D|H0) B(α1, α0) 0.5 15 Bayes factor vs prior strength α α • Let 1= 0 range from 0 to 1000. • The largest BF in favor of H1 (biased coin) is only 2.0, which is very weak evidence of bias.

BF(1,0)

α 16 Bayesian Occam’s razor • The use of the marginal likelihood p(D|M) automatically penalizes overly complex models, since they spread their probability mass very widely (predict that everything is possible), so the probability of the actual data is small.

Too simple, cannot predict D

Just right

Too complex, can predict everything

Bishop 3.13 17 Bayesian Occam’s razor for biased coin

N Blue line = p(D|H 0) = 0.5 θ θ θ Red curve = p(D|H 1) = ∫ p(D| ) Beta( |1,1) d

If we have already observed 4 heads, it is much more likely to observe a 5 th head thanMarginal a likelihoodtail, since of biased θ gets coin updated sequentially. 0.18 1.000 0.16

0.14

0.12

0.1

0.08 If we observe 2 or 3 heads out of 5, 0.06 the simpler model is more likely

0.04

0.02

0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 num heads 18 Bayesian Information Criterion (BIC) • If we make a Gaussian approximation to p( θ|D) (Laplace approximation), and approximate |H| ≈ Nd, the log marginal likelihood becomes 1 log p(D) ≈ log p(D|θ ) − d log N ML 2 • Here d is the dimension/ number of free parameters. • AIC (Akaike Info criterion) is defined as

log p(D) ≈ log p(D|θML) − d

• Can use penalized log-likelihood for model selection instead of cross-validation.

19 Outline • Conjugate analysis of µ and σ2 • Bayesian model selection • Summarizing the posterior

20 Summarizing the posterior • If p( θ|D) is too complex to plot, we can compute various summary statistics, such as posterior mean, mode and median

θˆmean = E[θ|D]

θˆMAP = arg max p(θ|D) θ

θˆmedian = t : p(θ > t|D) = 0.5

21 Bayesian credible intervals • We can represent our uncertainty using a posterior credible interval p(ℓ ≤ θ ≤ u|D) ≥ 1 − α • We set 1 1 ℓ = F − (α/2), u = F − (1 − α/2)

22 Example • We see 47 heads out of 100 trials. • Using a Beta(1,1) prior, what is the 95% credible interval for probability of heads?

S = 47; N = 100; a = S+1; b = (N-S)+1; alpha = 0.05; l = betainv(alpha/2, a, b); u = betainv(1-alpha/2, a, b); CI = [l,u] 0.3749 0.5673

23 Posterior sampling • If θ is high-dimensional, it is hard to visualize p( θ|D). • A common strategy is to draw typical values θs ∼ p( θ|D) and analyze the resulting samples. • Eg we can generate fake data p(x s|θs) to see if it looks like the real data (a simple kind of posterior predictive check of model adequacy). • See handout for some examples.