Lecture 13 — Maximum Likelihood Estimation

STATS 200: Introduction to Statistical Inference Autumn 2016 Lecture 13 | Maximum likelihood estimation Last lecture, we introduced the method of moments for estimating one or more parameters θ in a parametric model. This lecture, we discuss a different method called maximum likelihood estimation. The focus of this lecture will be on how to compute this estimate; subsequent lectures will study its statistical properties. 13.1 Maximum likelihood estimation IID Consider data X1;:::;Xn ∼ f(xjθ), for a parametric model ff(xjθ): θ 2 Ωg. Given the observed values X1;:::;Xn of the data, the function lik(θ) = f(X1jθ) × ::: × f(Xnjθ) of the parameter θ is called the likelihood function. If f(xjθ) is the PMF of a discrete distribution, then lik(θ) is simply the probability of observing the values X1;:::;Xn if the true parameter were θ. The maximum likelihood estimator (MLE) of θ is the value of θ 2 Ω that maximizes lik(θ). Intuitively, it is the value of θ that makes the observed data \most probable" or \most likely". The idea of maximum likelihood is related to the use of the likelihood ratio statistic in the Neyman-Pearson lemma. Recall that for testing H0 :(X1;:::;Xn) ∼ g H1 :(X1;:::;Xn) ∼ h where g and h are joint PDFs or PMFs for n random variables, the most powerful test rejects for small values of the likelihood ratio g(X1;:::;Xn) L(X1;:::;Xn) = : h(X1;:::;Xn) IID In the context of a parametric model, we may consider testing H0 : X1;:::;Xn ∼ f(xjθ0) IID versus H1 : X1;:::;Xn ∼ f(xjθ1), for two different parameter values θ0; θ1 2 Ω. Then g(X1;:::;Xn) = f(X1jθ0) × ::: × f(Xnjθ0); h(X1;:::;Xn) = f(X1jθ1) × ::: × f(Xnjθ1); so the likelihood ratio is exactly lik(θ0)=lik(θ1). The MLE (if it exists and is unique) is the value of θ 2 Ω for which lik(θ)=lik(θ0) > 1 for any other value θ0 2 Ω. 13-1 13.2 Examples Computing the MLE is an optimization problem. Maximizing lik(θ) is equivalent to maximizing its (natural) logarithm n X l(θ) = log (lik(θ)) = log f(Xijθ); i=1 which in many examples is easier to work with as it involves a sum rather than a product. Let's work through several examples: IID Example 13.1. Let X1;:::;Xn ∼ Poisson(λ). Then n X λXi e−λ l(λ) = log X ! i=1 i n X = (Xi log λ − λ − log(Xi!)) i=1 n n X X = (log λ) Xi − nλ − log(Xi!): i=1 i=1 This is differentiable in λ, so we maximize l(λ) by setting its first derivative equal to 0: n 1 X 0 = l0(λ) = X − n: λ i i=1 Solving for λ yields the estimate λ^ = X¯. Since l(λ) ! −∞ as λ ! 0 or λ ! 1, and since λ^ = X¯ is the unique value for which 0 = l0(λ), this must be the maximum of l. In this example, λ^ is the same as the method-of-moments estimate. IID 2 Example 13.2. Let X1;:::;Xn ∼ N (µ, σ ). Then n (X −µ)2 2 X 1 − i l(µ, σ ) = log p e 2σ2 2 i=1 2πσ n 2 X 1 (Xi − µ) = − log(2πσ2) − 2 2σ2 i=1 n n n 1 X = − log(2π) − log(σ2) − (X − µ)2: 2 2 2σ2 i i=1 Considering σ2 (rather than σ) as the parameter, we maximize l(λ) by settings its partial derivatives with respect to µ and σ2 equal to 0: n @l 1 X 0 = = (X − µ); @µ σ2 i i=1 n @l n 1 X 0 = = − + (X − µ)2: @σ2 2σ2 2σ4 i i=1 13-2 Solving the first equation yieldsµ ^ = X¯, and substituting this into the second equation yields 2 1 Pn ¯ 2 2 2 2 σ^ = n i=1(Xi − X) . Since l(µ, σ ) ! −∞ as µ ! −∞, µ ! 1, σ ! 0, or σ ! 1, 2 @l @l and as (^µ, σ^ ) is the unique value for which 0 = @µ and 0 = @σ2 , this must be the maximum of l. Again, the MLEs are the same as the method-of-moments estimates. IID Example 13.3. Let X1;:::;Xn ∼ Gamma(α; β). Then n X βα l(α; β) = log Xα−1e−βXi Γ(α) i i=1 n X = (α log β − log Γ(α) + (α − 1) log Xi − βXi) i=1 n n X X = nα log β − n log Γ(α) + (α − 1) log Xi − β Xi: i=1 i=1 To maximize l(α; β), we set its partial derivatives equal to 0: n @l nΓ0(α) X 0 = = n log β − + log X ; @α Γ(α) i i=1 n @l nα X 0 = = − X : @β β i i=1 The second equation implies that the MLEsα ^ and β^ satisfy β^ =α= ^ X¯. Substituting into the first equation and dividing by n,α ^ satisfies n Γ0(^α) 1 X 0 = logα ^ − − log X¯ + log X : (13.1) Γ(^α) n i i=1 Γ0(α) The function f(α) = log α − Γ(α) decreases from 1 to 0 as α increases from 0 to 1, and ¯ 1 Pn the value − log X + n i=1 log Xi is always negative (by Jensen's inequality)|hence (13.1) always has a single unique rootα ^, which is the MLE for α. The MLE for β is then β^ =α= ^ X¯. Unfortunately there is no closed-form expression for this rootα ^. (In particular, the MLE α^ is not the method-of-moments estimator for α.) We may compute the root numerically using the Newton-Raphson method: We start with an initial guess α(0), which (for example) may be the method-of-moments estimator X¯ 2 α(0) = : 1 Pn ¯ 2 n i=1(Xi − X) Having computed α(t) for any t = 0; 1; 2;:::, we compute the next iteration α(t+1) by ap- proximating the equation (13.1) with a linear equation using a first-order Taylor expansion aroundα ^ = α(t), and set α(t+1) as the value ofα ^ that solves this linear equation. In detail, Γ0(α) (t) let f(α) = log α − Γ(α) . A first-order Taylor expansion aroundα ^ = α in (13.1) yields the linear approximation n 1 X 0 ≈ f(α(t)) + (^α − α(t))f 0(α(t)) − log X¯ + log X ; n i i=1 13-3 and we set α(t+1) to be the value ofα ^ solving this linear equation, i.e.1 −f(α(t)) + log X¯ − 1 Pn log X α(t+1) = α(t) + n i=1 i : f 0(α(t)) The iterations α(0); α(1); α(2);::: converge to the MLEα ^. Example 13.4. Let (X1;:::;Xk) ∼ Multinomial(n; (p1; : : : ; pk)). (This is not quite the setting of n IID observations from a parametric model, as we have been considering, although you can think of (X1;:::;Xk) as a summary of n such observations Y1;:::;Yn from the parametric model Multinomial(1; (p1; : : : ; pk)), where Yi indicates which of k possible outcomes occurred for the ith observation.) The log-likelihood is given by k n n X l(p ; : : : ; p ) = log pX1 : : : pXk = log + X log p ; 1 k X ;:::;X 1 k X ;:::;X i i 1 k 1 k i=1 and the parameter space is Ω = f(p1; : : : ; pk) : 0 ≤ pi ≤ 1 for all i and p1 + ::: + pk = 1g: To maximize l(p1; : : : ; pk) subject to the linear constraint p1 + ::: + pk = 1, we may use the method of Lagrange multipliers: Consider the Lagrangian k n X L(p ; : : : ; p ; λ) = log + X log p + λ(p + ::: + p − 1); 1 k X ;:::;X i i 1 k 1 k i=1 for a constant λ to be chosen later. Clearly, subject to p1 + ::: + pk = 1, maximizing l(p1; : : : ; pk) is the same as maximizing L(p1; : : : ; pk; λ). Ignoring momentarily the constraint p1 +:::+pk = 1, the unconstrained maximizer of L is obtained by setting for each i = 1; : : : ; k @L X 0 = = i + λ, @pi pi which yieldsp î = −Xi/λ. For the specific choice of constant λ = −n, we obtainp î = Xi=n Pn Pn and i=1 pî = i=1 Xi=n = 1, so the constraint is satisfied. Asp î = Xi=n is the unconstrained maximizer of L(p1; : : : ; pk; −n), this implies that it must also be the constrained maximizer of L(p1; : : : ; pk; −n), so it is the constrained maximizer of l(p1; : : : ; pk). So the MLE is given byp î = Xi=n for i = 1; : : : ; k. 1If this update yields α(t+1) ≤ 0, we may reset α(t+1) to be a very small positive value. 13-4.

Lecture 13 — Maximum Likelihood Estimation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support