<<

CS340 Machine learning Bayesian model selection Bayesian model selection • Suppose we have several models, each with potentially different numbers of parameters. • Example: M0 = constant, M1 = straight line, M2 = quadratic, M3 = cubic • The posterior over models is defined using Bayes rule, where p(D|m) is called the marginal likelihood or “evidence” for m

p(m)p(D|m) p(m|D) = p(D)

p(D|m) = p(D|θ, m)p(θ|m)dθ  p(D) = p(D|m)p(m) m∈M Polynomial regression, n=8

truth=quadratic (green curve)

logev(m) = log p(D|m) p(m) = 1/4

With little data, we choose a simple model Polynomial regression, n=32

truth=quadratic (green curve)

Shape of cubic changes a lot – high variance estimator

With more data, we choose a more complex model Bayesian Occam’s razor • The use of the marginal likelihood p(D|M) automatically penalizes overly complex models, since they spread their probability mass very widely (predict that everything is possible), so the probability of the actual data is small.

Too simple, cannot predict D

Just right

Too complex, can predict everything

Bishop 3.13 Bayesian Occam’s razor

Samples from model Actual data

Model 3 can generate Model 1 can only many data sets; prior is broad, generate a few posterior is peaked types of data Mackay 28.6 Computing marginal likelihoods • Let p’(D|θ) and p’(θ) be the unnormalized likelihood and prior. Then 1 1 1 1 p(θ|D) = p′(D|θ) p′(θ) = p′(θ|D) p(D) Zl Z0 Zn 1 1 1 1 = Zn p(D) Zl Z0 Z 1 p(D) = n Z0 Zl • Eg. Beta-bernoulli model B(α + N , α + N ) p(D) = 1 1 0 0 B(α1, α0) • Eg. Normal-Gamma-Normal model

Γ(α )βα0 κ 1/2 1 n/2 p(D) = n 0 0 Γ(α )βαn κ 2π 0 n  n    Bayesian hypothesis testing • Suppose we toss a coin N=250 times and observe

N1=141 heads and N 0=109 tails. θ • Consider two hypotheses: H 0 that =0.5 and H 1 that θ ≠ θ 0.5. Actually, we can let H 1 be p( ) = U(0,1), θ since p( =0.5|H 1) = 0 (pdf).

• For H 0, marginal likelihood is N p(D|H0) = 0.5

• For H 1, marginal likelihood is 1 B(α + N , α + N ) P (D|H ) = P (D|θ, H )P (θ|H )dθ = 1 1 0 0 1 1 1 B(α , α ) 0 1 0 Bayes factors • To compare two models, use posterior odds

p(Mi|D) p(D|Mi)p(Mi) Oij = = p(Mj |D) p(D|Mj)p(Mj )

Posterior odds Prior odds

• If the priors are equal, it suffices to use the BF. • The BF is a Bayesian version of a likelihood ratio test, that can be used to compare models of different complexity. If BF(i,j)>>1, prefer model i. • For the coin example,

P (D|H1) B(α1 + N1, α0 + N0) 1 BF (1, 0) = = N P (D|H0) B(α1, α0) 0.5 Bayes factor vs prior strength α α • Let 1= 0 range from 0 to 1000. • The largest BF in favor of H1 (biased coin) is only 2.0, which is only very weak evidence of bias.

BF(1,0)

α Bayesian Occam’s razor for biased coin

N Blue line = p(D|H 0) = 0.5 θ θ θ Red curve = p(D|H 1) = ∫ p(D| ) Beta( |1,1) d

If we have already observed 4 heads, it is much more likely to observe a 5 th head thanMarginal a likelihoodtail, since of biased θ gets coin updated sequentially. 0.18 1.000 0.16

0.14

0.12

0.1

0.08 If we observe 2 or 3 heads out of 5, 0.06 the simpler model is more likely

0.04

0.02

0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 num heads CS340 Machine learning Frequentist parameter estimation Parameter estimation • We have seen how offers a principled solution to the parameter estimation problem. • However, when the number of samples (relative to the number of parameters) is large, we can often approximate the posterior as a delta function centered on the MAP estimate.

θˆMAP = arg max p(D|θ)p(θ) θ

• An even simpler approximation is to just use the maximum likelihood estimate

θˆMLE = arg max p(D|θ) θ Why maximum likelihood? • Recall that the KL divergence from the true distribution p to the approximation q is p(x) KL(p||q) = p(x) log q(x) x  = const − p(x) log q(x) x • Let p be the empirical distribution 1 n p (x) = δ(x − x ) emp n i i=1  ML = min KL to empirical • KL to the empirical 1 KL(p ||q) = C − [ δ(x − x )] log q(x) emp n i x i   1 = C − log q(x ) n i i • Hence minimizing KL is equivalent to minimizing the average negative log likelihood on the training set Computing the Bernoulli MLE • We maximize the log-likelihood

ℓ(θ) = N1 log θ + N0 log(1 − θ) dℓ N N − N = 1 − 1 dθ θ 1 − θ = 0 ⇒ N1 θˆ = Empirical fraction of heads eg. 47/100 N Regularization • Suppose we toss a coin N=3 times and see 3 tails. We would estimate the probability of heads as 0. 0 θˆ = 3 • Intuitively, this seems unreasonable. Maybe we just haven’t seen enough data yet ( sparse data problem ).

• We can add pseudo counts C0 and C 1 (e.g., 0.1) to the sufficient N 0 and N 1 to get a better behaved estimate. N + C θˆ = 1 1 N0 + N1 + C0 + C1

• This is the MAP estimate using a Beta prior. MLE for the multinomial

• If xn ∈ {1,…,K}, the likelihood is N K n I(xn=k) n I(x =k) Nk P (D|θ) ∝ θk = θk = θk n=1  k=1 k  k

• The N i are the sufficient statistics • The log-likelihood is ℓ(θ) = Nk log θk k Computing the multinomial MLE • We maximize L( θ) subject to the constraint θ ∑k k = 1. • We enforce the constraint using a Lagrange multiplier λ.

ℓ˜ = Nk log θk + λ 1 − θk k θ  k  • Taking derivatives wrt k  ∂ℓ˜ N = k − λ = 0 ∂θk θk

• Taking derivatives wrt λ yields the constraint

∂ℓ˜ = 1 − θk = 0 ∂λ   k Computing the multinomial MLE • Using the sum-to-one constraint, we get

Nk = λθk

Nk = λ θk = λ k k ˆ Nk θk = Empirical fraction of counts k Nk  • Example: N 1 = 100 spam, N 2 = 10 urgent, N 3 = 20 normal, θ = (100/130, 10/130, 20/130). • Can add pseudo counts if some classes are rare. Computing the Gaussian MLE • The log likelihood is N − 1 1 p(D|, σ2) = N (x |, σ2) = (2πσ2) 2 exp(− (x − )2) n 2σ2 n n=1 n   N 1 N N ℓ(, σ2) = − (x − )2 − ln σ2 − ln(2π) 2σ2 n 2 2 n=1  • The MLE for the mean is the sample mean ∂ℓ 2 = − (x − ) = 0 ∂ 2σ2 n n  N 1 ˆ = x N n n=1  Estimating σ • The log likelihood is N 1 N N ℓ(, σ2) = − (x − )2 − ln σ2 − ln(2π) 2σ2 n 2 2 n=1  • The MLE for the variance is the sample variance (see handout for proof)

∂ℓ 1 N = σ−4 (x − ˆ) − = 0 ∂σ2 2 n 2σ2 n  N 1 σˆ2 = (x − ˆ)2 ML N n n=1  1 = x2 − (ˆ)2 N n n  Sampling distribution

• MLE returns a point estimate θˆ(D) • In frequentist (classical/ orthodox) statistics, we treat D as random and θ as fixed, and ask how the estimate would change if D changed. This is called the sampling distribution of the estimator. p(θˆ(D)|D ∼ θ) • The sampling distribution is often approximately Gaussian. • In , we treat D as fixed and θ as random, and model our uncertainty with the posterior p( θ|D) Unbiased estimators • The bias of an estimator is defined as bias(θˆ) = E θˆ(D) − θ|D ∼ θ • An estimator is unbiased if bias=0. • Eg. MLE for Gaussian mean is unbiased 1 N 1 1 Eˆ = E X = E[X ] = N = N n N n N n=1 n   Estimators for σ2 • The MLE for Gaussian variance is biased (HW3) N − 1 Eσˆ2 = σ2 N • It is common to use the following unbiased N estimator instead σˆ2 = σˆ2 N−1 N − 1 • This is unbiased N N N − 1 E[ˆσ2 ] = E[ σˆ2] = σ2 = σ2 N−1 N − 1 N − 1 N 2 • In Matlab, var(X) returns σˆ N − 1 whereas var(X,1) returns σˆ2 • The MLE underestimates the variance (e.g., N=1, no variance) since we use an estimated µ, which is shifted from the true µ towards the data (see HW3). Is being unbiased enough? • Consider the estimator

˜(x1, . . . , xN ) = x1 • This is unbiased

E˜(X1,...,XN ) = E[X1] = • But intuitively it is unreasonable since it will not improve, no matter how many samples N we have. Consistent estimators • An estimator is consistent if it converges (in probability) to the true value with enough data

P (|θˆ(D) − θ| > ǫ|D ∼ θ) → 0 as |D| → ∞

• MLE is a consistent estimator. Bias-variance tradeoff • Being unbiased is not necessarily desirable! Suppose our loss function is mean squared error MSE = E[θˆ(D) − θ)2|D ∼ θ] • To minimize MSE, we can either minimize bias or minimize variance. Define θ = E[θˆ(D)|D ∼ θ] • Then ˆ 2 ˆ 2 ED(θ(D) − θ) = ED(θ(D) − θ + θ − θ) ˆ 2 ˆ 2 = ED(θ(D) − θ) + 2(θ − θ)ED(θ(D) − θ) + (θ − θ) ˆ 2 2 = ED(θ(D) − θ) + (θ − θ) = V (θˆ) + bias2(θˆ) ED(θˆ(D) − θ) = θ − θ = 0 We will frequently use biased estimators! Not on exam