Lecture 24: Maximum Likelihood the Likelihood, Which Is the Probability of the Data, X, Given the Model Parameters Θ

deGroot 7.5,7.6 Maximum Likelihood Likelihood Last time, as part of discussing Bayesian inference we defined P(Xjθ) as Lecture 24: Maximum Likelihood the likelihood, which is the probability of the data, X, given the model parameters θ. Statistics 104 In the case where the observations are iid we saw that n Colin Rundel Y f (xjθ) = f (xi jθ) April 18, 2012 i=1 Our inference goal is still the same, given the observed data (x1;:::; xn) we want to come up with an estimate for the parameters θ^ Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 1 / 12 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Maximum Likelihood Why use the MLE? The maximum likelihood approach is straight forward, if we don't know θ The maximum likelihood estimator has the following properties: we might as well guess a value that maximizes the likelihood, or in other Consistency - as the sample size tends to infinity the MLE tends to words pick the value of θ that has the greatest probability of producing the `true' value of the parameter(s) the observed data. Asymptotic normality - as the sample size increases, the distribution To do this we construct a likelihood function by considering the likelihood of the MLE tends to the normal distribution as a function of the parameter(s) given the data n Efficiency - as the sample size tends to infinity, there are not any Y other unbiased estimator with a lower mean squared error L(θjX) = f (xjθ) = f (xi jθ) i=1 The maximum likelihood estimate (estimator), MLE, is therefore given by MLE = θ^MLE = arg max L(θjX) θ Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 2 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 3 / 12 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Example - Discrete Parameter Space Example - Continuous Parameter Space Assume that you have a box where you keep trick coins (biased towards heads or tails), Consider the same situation but the labels has fallen off of only one coin and you have the labels have fallen off three of them: one of which comes up heads 25% of the time, no idea how biased it might be. You once again flip the coin 100 times and get 65 heads one which comes up head 75% of the time, and one that was fair (heads 50% of the what would MLE be for θ? time). If you pick one of the coins at random and flip it 100 times and you get 65 heads what coin do you think it is? θ 2 [0; 1] 100 L(θjX = 65) = P(X = 65jθ) = θ65(1 − θ)35 θ 2 0:25; 0:5; 0:75 65 d 100 100 L(θjX = 65) = 65θ64(1 − θ)35 − 35θ65(1 − θ)44 L(θjX = 65) = P(X = 65jθ) = θ65(1 − θ)35 dθ 65 65 64 44 100 0 = θ (1 − θ) (65(1 − θ) − 35θ) L(θ = 0:25jX = 65) = 0:2565(1 − 0:25)35 ≈ 0 65 100 θ = f0; 65=100; 1g L(θ = 0:5jX = 65) = 0:565(1 − 0:5)35 = 0:00086 65 θ^ = 65=100 100 MLE L(θ = 0:75jX = 65) = 0:7565(1 − 0:75)35 = 0:00702 65 Therefore according to a maximum likelihood approach you should label the coin as a 65% Therefore according to a maximum likelihood approach you should label the coin as the 75% heads coin. heads coin. Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 4 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 5 / 12 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Maximum log-Likelihood Example - Normal with known Variance iid 2 For the previous examples it was easy to write down the likelihood function and in the Let X1;:::; Xn ∼ N (µ, σ ) then the MLE of µ is case of the latter take the derivative to find the maximum. For many common likelihoods this can be difficult, consider the case of n observations from a normal n " n # 2 2 1 1 X 2 distribution with known variance σ the likelihood function has the form L(µjX; σ ) = p exp − (xi − µ) 2σ2 2πσ i=1 n " n # n 2 1 1 X 2 2 n 1 X 2 L(µjX; σ ) = p exp − (xi − µ) `(µjX; σ ) = − log 2π − n log σ − (x − µ) 2πσ 2σ2 2 2σ2 i i=1 i=1 Such problems can be simplified by instead maximizing the log-likelihood function, n n ! n 1 X 2 X 2 `(θjX) which is equivalent as log is a monotone function. = − log 2π − n log σ − xi − 2µ xi + nµ 2 2 2σ2 i=1 i=1 n X `(θjX) = log L(θjX) = log f (xi ; θ) i=1 ^ n ! θMLE = arg max L(θjX) = arg max `(θjX) d 2 1 X `(µjX; σ ) = − −2 x + 2nµ θ θ dµ 2σ2 i i=1 n X 0 = xi − nµ i=1 n 1 X µ^ = x Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 6 / 12 Statistics 104 (Colin Rundel) MLE nLecture 24i April 18, 2012 7 / 12 i=1 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Example - Normal with known Mean Example - Poisson iid 2 2 iid Let X1;:::; Xn ∼ N (µ, σ ) then the MLE of σ is Let X1;:::; Xn ∼ Pois(λ) then the MLE of λ is n n P −λ xi −nλ i xi n 1 X 2 Y e λ e λ `(σjX; µ) = − log 2π − n log σ − 2 (xi − µ) L(λjX) = = Q 2 2σ xi ! xi ! i=1 i=1 i n ! n X Y `(λjX) = −nλ + xi log λ − log xi ! i=1 i=1 n d n 1 X 2 `(σjX; µ) = − + (x − µ) dσ σ σ3 i i=1 n Pn n 1 X 2 d xi 0 = − + (x − µ) 0 = `(λjX) = −n + i=1 σ σ3 i dλ λ i=1 Pn n i=1 xi 2 1 X 2 n = σ^MLE = (xi − µ) λ n n i=1 P x λ^ = i=1 i MLE n Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 8 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 9 / 12 deGroot 7.5,7.6 Maximum Likelihood deGroot 7.5,7.6 Maximum Likelihood Example - Uniform - Open vs Closed Example - Uniform - Uniqueness iid iid Let X1;:::; Xn ∼ Unif(0; θ) then the MLE of θ is Let X1;:::; Xn ∼ Unif(θ; θ + 1) then the MLE of θ is ( ( 1 if 0 ≤ x ≤ θ 1 if θ ≤ x ≤ θ + 1 f (xjθ) = θ f (xjθ) = 0 otherwise 0 otherwise 1 L(λjX) = θn Based on our previous result we maximizing L does not depend on θ but we still have to choose a θ such that θ ≤ xi ≤ θ + 1 for all xi . Clearly, L(λjX) is a decreasing function of θ, therefore to maximize L we need to minimize θ but we have additional constraints on θ from the likelihood: 0 ≤ x ≤ θ. From the lower bound it is clear that θ ≤ min(x1;:::; xn) and from the upper bound θ ≥ max(x ;:::; x ) − 1. Obviously, there are many potential values that will satisfy Therefore, θ^ = max(x ;:::; x ). 1 n MLE 1 n these conditions. What would happen if we had defined the uniform on the open interval (0; θ)? Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 10 / 12 Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 11 / 12 deGroot 7.5,7.6 Maximum Likelihood Example - Gamma iid Let X1;:::; Xn ∼ Gamma(k; 1) then the MLE of k is 1 f (xjk) = x k−1e−x Γ(k) n ! n !k−1 1 X Y L(λjX) = exp − x x Γ(k)n i i i=1 i=1 n n X X `(λjX) = −n log Γ(k) − xi + (k − 1) log xi i=1 i=1 n d Γ0(k) X 0 = `(λjX) = −n + log x dk Γ(k) i i=1 n Γ0(k) 1 X = log x Γ(k) n i i=1 k^MLE =? Statistics 104 (Colin Rundel) Lecture 24 April 18, 2012 12 / 12.

Lecture 24: Maximum Likelihood the Likelihood, Which Is the Probability of the Data, X, Given the Model Parameters Θ

Stable Components in the Parameter Plane of Transcendental Functions of Finite Type

Backpropagation TA: Yi Wen

Decomposing the Parameter Space of Biological Networks Via a Numerical Discriminant Approach

The Emergence of Gravitational Wave Science: 100 Years of Development of Mathematical Theory, Detectors, Numerical Algorithms, and Data Analysis Tools

Investigations of Structures in the Parameter Space of Three-Dimensional Turing-Like Patterns Martin Skrodzki, Ulrich Reitebuch, Eric Zimmermann

A Practical Guide to Compact Infinite Dimensional Parameter Spaces

Interactive Parameter Space Partitioning for Computer Simulations

Visualizing Likelihood Density Functions Via Optimal Region Projection

On Curved Exponential Families

Parameters, Estimation, Likelihood Function a Summary of Concepts

Parameter Estimation Lesson

Lecture 1: January 15 1.1 Decision Principles