<<

Maximum Likelihood vs. Bayesian Estimation

Ronald J. Williams CSG 220 Spring 2007 Contains numerous slides downloaded from www.cs.huji.ac.il/course/2003/pmai/tirguls/tirgul10.ppt (apparently authored by Nir Friedman)

.

Example: Binomial ( 101) Thumb tack

Head Tail

‹ When tossed, it can land in one of two positions: Head or Tail ‹ We denote by θ the (unknown) probability P(H). Estimation task: ‹ Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= θ and P(T) = 1 - θ 2

1 Statistical Parameter Fitting

‹Consider instances x[1], x[2], …, x[M] such that

z The set of values that x can take is known i.i.d. z Each is sampled from the same distribution samples z Each sampled independently of the rest

‹Here we focus on multinomial distributions

z Only finitely many possible values for x

z Special case: binomial, with values H(ead) and T(ail)

3

The

‹ How good is a particular θ? It depends on how likely it is to generate the observed L(θ : D) = P (D | θ) = ∏ P (x [m]| θ) m ‹ The likelihood for the sequence H,T, T, H, H is )

D :

θ L(θ : D) = θ ⋅(1 − θ) ⋅(1 − θ) ⋅ θ ⋅ θ ( L

0 0.2 0.4 0.6 0.8 1 θ 4

2 Maximum Likelihood Estimation

MLE Principle: Choose that maximize the likelihood function

‹ This is one of the most commonly used in statistics

‹ Intuitively appealing

5

Example: MLE in Binomial Data

‹ It can be shown that the MLE for the probability of heads is given by N We prove this θˆ= H after the next slide NH +NT (which coincides with what one would expect) Example:

(NH,NT ) = (3,2) ) D :

θ ( MLE estimate is 3/5 = 0.6 L

0 0.2 0.4 0.6 0.8 1 6

3 From Binomial to Multinomial

‹ For example, suppose X can have the values 1,2,…,K

‹ We want to learn the parameters θ 1, θ 2. …, θ K Observations:

‹ N1, N2, …, NK - the number of times each outcome is observed K Likelihood function: Nk L(Θ : D) = ∏θk k =1

N ˆθ = k We prove this on MLE: k next several slides ∑Nl l 7

MLE for Multinomial

Theorem: For the multinomial distribution, the MLE for the probability P(x=k) is given by N ˆ k θk = ∑Nl l K Nk Proof: The likelihood function is L ( Θ : D ) = ∏ θk k =1 To maximize it, it is equivalent to maximize the log-likelihood LL(θ ,θ , ,θ ) = ln L = N lnθ 1 2 K K ∑ l l l But we must impose the constraints θ =1 and θ ≥ 0 ∀ ∑ l l l l 8

4 MLE for Multinomial (cont.)

We use the method of Lagrange multipliers. Since there is one constraint equation, we introduce one Lagrange multiplier λ

We want to find θ 1 , θ 2 , K , θ K and λ so that the Lagrangian function ⎛ ⎞ G(θ , ,θ ;λ) = LL(θ , ,θ ) − λ⎜ θ −1⎟ 1 K K 1 K K ∑ l ⎝ l ⎠

attains a maximum as the θ k values vary (and a minimum as λ varies).

9

MLE for Multinomial (cont.)

‹ Take partial derivatives: ∂G N ∂G = k − λ ∀k, =1− θ ∑ l ∂θk θk ∂λ l

‹ Equate to zero and rearrange: N θ = k ∀k, θ =1 k ∑ l λ l

‹ Thus θk ∝ Nk ∀k.

10

5 MLE for Multinomial (cont.)

‹ Normalizing so the probabilities sum to 1 yields N θ = k ∀k. k N ∑ l l ‹ To see that this is a maximum as the θ k values vary, it’s sufficient to observe that the second partial derivatives of G satisfy 2 2 ∂ G ∂ G Ni = 0 ∀i ≠ j, 2 = − 2 < 0 ∀i ∂θi∂θ j ∂θi θi

11

Is MLE all we need?

‹ Suppose that after 10 observations, z ML estimates P(H) = 0.7 for the thumbtack z Would you bet on heads for the next toss?

‹ Suppose now that after 10 observations, z ML estimates P(H) = 0.7 for a coin z Would you place the same bet?

12

6

Frequentist Approach: ‹ Assumes there is an unknown but fixed parameter θ ‹ Estimates θ with some confidence ‹ Prediction by using the estimated parameter value

Bayesian Approach: ‹ Represents uncertainty about the unknown parameter ‹ Uses probability to quantify this uncertainty:

z Unknown parameters as random variables ‹ Prediction follows from the rules of probability:

z Expectation over the unknown parameters

13

Example: Binomial Data Revisited

‹ Prior: uniform for θ in [0,1] z P(θ ) = 1 ‹ Then P(θ |D) is proportional to the likelihood L(θ :D)

P(θ |x[1],Kx[M])∝P(x[1],Kx[M]|θ)⋅P(θ)

(NH,NT ) = (4,1)

‹ MLE for P(X = H ) is 4/5 = 0.8 ‹ Bayesian prediction is 0 0.2 0.4 0.6 0.8 1 5 P (x [M + 1] = H |D) = θ ⋅P (θ|D)dθ = = 0.7142K ∫ 7 14

7 Bayesian Inference and MLE

‹ In our example, MLE and Bayesian prediction differ ‹ But… If: prior is well-behaved (i.e., does not assign 0 density to any “feasible” parameter value) Then: both MLE and Bayesian prediction converge to the same value as the number of training data increases

15

Dirichlet Priors

‹ Recall that the likelihood function is K N Θ = θ k L( : D) ∏ k k =1

‹ A Dirichlet prior with hyperparameters α1,…,αK is defined as K α − Θ ∝ θ k 1 P ( ) ∏ k for legal θ 1,…, θ K k =1 Then the posterior has the same form, with

hyperparameters α1+N 1,…,αK +N K K K K α − α + − Θ ∝ Θ Θ ∝ θ k 1 θ Nk = θ k Nk 1 P ( |D) P ( )P (D | ) ∏ k ∏ k ∏ k k =1 k =1 k =1 16

8 Dirichlet Priors (cont.)

‹ We can compute the prediction on a new event in closed form:

‹ If P(Θ) is Dirichlet with hyperparameters α1,…,αK then α P (X [1] = k ) = θ ⋅P (Θ)dΘ = k We won’t prove this ∫ k α ∑ l l

‹ Since the posterior is also Dirichlet, we get α + N P (X [M + 1] = k |D) = θ ⋅P (Θ|D)dΘ = k k ∫ k α + ∑ ( l Nl ) l 17

Dirichlet Priors -- Example

5 Dirichlet(1,1) 4.5 Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5) 4

3.5

3

2.5

2

1.5

1

0.5

0 0 0.2 0.4 0.6 0.8 1

18

9 Prior Knowledge

‹ The hyperparameters α1,…,αK can be thought of as “imaginary” counts from our prior experience

‹ Equivalent size = α1+…+αK

‹ The larger the equivalent sample size the more confident we are in our prior

19

Effect of Priors

Prediction of P(X=H ) after seeing data with NH = 0.25•NT for different sample sizes

0.55 0.6 0.5 Different strength α + α 0.5 H T Fixed strength αH + αT 0.45 Fixed ratio αH / αT Different ratio α / α 0.4 H T 0.4

0.35 0.3 0.3 0.2 0.25 0.1 0.2

0.15 0 0 20 40 60 80 100 0 20 40 60 80 100

20

10 Effect of Priors (cont.)

‹ In real data, Bayesian estimates are less sensitive to noise in the data 0.7 MLE Dirichlet(.5,.5) 0.6 Dirichlet(1,1) Dirichlet(5,5) 0.5 Dirichlet(10,10)

0.4 P(X = 1|D)

0.3

0.2

N 0.1 5 10 15 20 25 30 35 40 45 50 1 Toss Result

0 N 21

One reason to prefer Bayesian method

‹ If any value fails to occur in the training data, MLE for the corresponding probability will be zero ‹ But even with uniform prior, Bayesian estimate for this same probability will be non-zero ‹ Probability estimates of zero can have very bad effects on just about any learning algorithm

z Only want zero probability estimates when non- occurrence of an event is justified by prior belief

22

11