<<

deGroot 7.5,7.6 Maximum Likelihood Likelihood

Last time, as part of discussing Bayesian inference we defined P(X|θ) as Lecture 24: Maximum Likelihood the likelihood, which is the probability of the data, X, given the model parameters θ.

Statistics 104 In the case where the observations are iid we saw that

n Colin Rundel Y f (x|θ) = f (xi |θ) April 18, 2012 i=1

Our inference goal is still the same, given the observed data (x1,..., xn) we want to come up with an estimate for the parameters θˆ

Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 1 / 12

deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Maximum Likelihood Why use the MLE?

The maximum likelihood approach is straight forward, if we don’t know θ The maximum likelihood has the following properties: we might as well guess a value that maximizes the likelihood, or in other Consistency - as the sample size tends to infinity the MLE tends to words pick the value of θ that has the greatest probability of producing the ‘true’ value of the parameter(s) the observed data. Asymptotic normality - as the sample size increases, the distribution To do this we construct a likelihood by considering the likelihood of the MLE tends to the normal distribution as a function of the parameter(s) given the data

n Efficiency - as the sample size tends to infinity, there are not any Y other unbiased estimator with a lower mean squared error L(θ|X) = f (x|θ) = f (xi |θ) i=1 The maximum likelihood estimate (estimator), MLE, is therefore given by

MLE = θˆMLE = arg max L(θ|X) θ

Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 2 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 3 / 12 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Example - Discrete Parameter Example - Continuous Parameter Space

Assume that you have a box where you keep trick coins (biased towards heads or tails), Consider the same situation but the labels has fallen off of only one coin and you have the labels have fallen off three of them: one of which comes up heads 25% of the time, no idea how biased it might be. You once again flip the coin 100 times and get 65 heads one which comes up head 75% of the time, and one that was fair (heads 50% of the what would MLE be for θ? time). If you pick one of the coins at random and flip it 100 times and you get 65 heads

what coin do you think it is? θ ∈ [0, 1] 100 L(θ|X = 65) = P(X = 65|θ) = θ65(1 − θ)35 θ ∈ 0.25, 0.5, 0.75 65 d 100 100 L(θ|X = 65) = 65θ64(1 − θ)35 − 35θ65(1 − θ)44 L(θ|X = 65) = P(X = 65|θ) = θ65(1 − θ)35 dθ 65 65 64 44 100 0 = θ (1 − θ) (65(1 − θ) − 35θ) L(θ = 0.25|X = 65) = 0.2565(1 − 0.25)35 ≈ 0 65 100 θ = {0, 65/100, 1} L(θ = 0.5|X = 65) = 0.565(1 − 0.5)35 = 0.00086 65 θˆ = 65/100 100 MLE L(θ = 0.75|X = 65) = 0.7565(1 − 0.75)35 = 0.00702 65 Therefore according to a maximum likelihood approach you should label the coin as a 65% Therefore according to a maximum likelihood approach you should label the coin as the 75% heads coin. heads coin. Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 4 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 5 / 12

deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Maximum log-Likelihood Example - Normal with known Variance

iid 2 For the previous examples it was easy to write down the likelihood function and in the Let X1,..., Xn ∼ N (µ, σ ) then the MLE of µ is case of the latter take the derivative to find the maximum. For many common likelihoods this can be difficult, consider the case of n observations from a normal  n " n # 2 2 1 1 X 2 distribution with known variance σ the likelihood function has the form L(µ|X, σ ) = √ exp − (xi − µ) 2σ2 2πσ i=1  n " n # n 2 1 1 X 2 2 n 1 X 2 L(µ|X, σ ) = √ exp − (xi − µ) `(µ|X, σ ) = − log 2π − n log σ − (x − µ) 2πσ 2σ2 2 2σ2 i i=1 i=1 Such problems can be simplified by instead maximizing the log-likelihood function, n n ! n 1 X 2 X 2 `(θ|X) which is equivalent as log is a monotone function. = − log 2π − n log σ − xi − 2µ xi + nµ 2 2 2σ2 i=1 i=1 n X `(θ|X) = log L(θ|X) = log f (xi , θ) i=1 ˆ n ! θMLE = arg max L(θ|X) = arg max `(θ|X) d 2 1 X `(µ|X, σ ) = − −2 x + 2nµ θ θ dµ 2σ2 i i=1 n X 0 = xi − nµ i=1 n 1 X µˆ = x Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 6 / 12 Statistics 104 (Colin Rundel) MLE nLecture 24i April 18, 2012 7 / 12 i=1 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Example - Normal with known Mean Example - Poisson

iid 2 2 iid Let X1,..., Xn ∼ N (µ, σ ) then the MLE of σ is Let X1,..., Xn ∼ Pois(λ) then the MLE of λ is

n n P −λ xi −nλ i xi n 1 X 2 Y e λ e λ `(σ|X, µ) = − log 2π − n log σ − 2 (xi − µ) L(λ|X) = = Q 2 2σ xi ! xi ! i=1 i=1 i n ! n X Y `(λ|X) = −nλ + xi log λ − log xi ! i=1 i=1

n d n 1 X 2 `(σ|X, µ) = − + (x − µ) dσ σ σ3 i i=1 n Pn  n 1 X 2 d xi 0 = − + (x − µ) 0 = `(λ|X) = −n + i=1 σ σ3 i dλ λ i=1 Pn  n i=1 xi 2 1 X 2 n = σˆMLE = (xi − µ) λ n n i=1 P x λˆ = i=1 i MLE n

Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 8 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 9 / 12

deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Example - Uniform - Open vs Closed Example - Uniform - Uniqueness

iid iid Let X1,..., Xn ∼ Unif(0, θ) then the MLE of θ is Let X1,..., Xn ∼ Unif(θ, θ + 1) then the MLE of θ is

( ( 1 if 0 ≤ x ≤ θ 1 if θ ≤ x ≤ θ + 1 f (x|θ) = θ f (x|θ) = 0 otherwise 0 otherwise 1 L(λ|X) = θn Based on our previous result we maximizing L does not depend on θ but we still have to choose a θ such that θ ≤ xi ≤ θ + 1 for all xi . Clearly, L(λ|X) is a decreasing function of θ, therefore to maximize L we need to minimize θ but we have additional constraints on θ from the likelihood: 0 ≤ x ≤ θ. From the lower bound it is clear that θ ≤ min(x1,..., xn) and from the upper bound θ ≥ max(x ,..., x ) − 1. Obviously, there are many potential values that will satisfy Therefore, θˆ = max(x ,..., x ). 1 n MLE 1 n these conditions. What would happen if we had defined the uniform on the open interval (0, θ)?

Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 10 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 11 / 12 deGroot 7.5,7.6 Maximum Likelihood Example - Gamma

iid Let X1,..., Xn ∼ Gamma(k, 1) then the MLE of k is 1 f (x|k) = x k−1e−x Γ(k)

n ! n !k−1 1 X Y L(λ|X) = exp − x x Γ(k)n i i i=1 i=1 n n X X `(λ|X) = −n log Γ(k) − xi + (k − 1) log xi i=1 i=1

n d Γ0(k) X 0 = `(λ|X) = −n + log x dk Γ(k) i i=1 n Γ0(k) 1 X = log x Γ(k) n i i=1

kˆMLE =? Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 12 / 12