Maximum Likelihood Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Topic 15 Maximum Likelihood Estimation 15.1 Introduction The principle of maximum likelihood is relatively straightforward to state. As before, we begin with observations X =(X ,...,X ) of random variables chosen according to one of a family of probabilities P . In addition, f(x ✓), 1 n ✓ | x =(x1,...,xn) will be used to denote the density function for the data when ✓ is the true state of nature. Then, the principle of maximum likelihood yields a choice of the estimator ✓ˆ as the value for the parameter that makes the observed data most probable. Definition 15.1. The likelihood function is the density function regarded as a function of ✓. L(✓ x)=f(x ✓),✓ ⇥. (15.1) | | 2 The maximum likelihood estimate (MLE), ✓ˆ(x) = arg max L(✓ x). (15.2) ✓ | Thus, we are presuming that a unique global maximum exists. We will learn that especially for large samples, the maximum likelihood estimators have many desirable properties. However, especially for high dimensional data, the likelihood can have many local maxima. Thus, finding the global maximum can be a major computational challenge. This class of estimators has an important invariance property. If ✓ˆ(x) is a maximum likelihood estimate for ✓, then g(✓ˆ(x)) is a maximum likelihood estimate for g(✓). For example, if ✓ is a parameter for the variance and ✓ˆ is the maximum likelihood estimate for the variance, then ✓ˆ is the maximum likelihood estimate for the standard deviation. This flexibility in estimation criterion seen here isp not available in the case of unbiased estimators. For independent observations, the likelihood is the product of density functions. Because the logarithm of a product is the sum of the logarithms, finding zeroes of the score function, @ ln L(✓ x)/@✓, the derivative of the logarithm of | the likelihood, will be easier. Having the parameter values be the variable of interest is somewhat unusual, so we will next look at several examples of the likelihood function. 15.2 Examples Example 15.2 (Bernoulli trials). If the experiment consists of n Bernoulli trials with success probability p, then x (1 x ) x (1 x ) (x + +x ) n (x + +x ) L(p x)=p 1 (1 p) − 1 p n (1 p) − n = p 1 ··· n (1 p) − 1 ··· n . | − ··· − − n n ln L(p x)=lnp( x )+ln(1 p)(n x )=n(¯x ln p +(1 x¯)ln(1 p)). | i − − i − − i=1 i=1 X X 221 Introduction to the Science of Statistics Maximum Likelihood Estimation 1.5e-06 2.0e-12 1.0e-06 l l 1.0e-12 5.0e-07 0.0e+00 0.0e+00 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 p p -14 -27 -16 -29 log(l) log(l) -18 -31 -20 -33 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 p p Figure 15.1: Likelihood function (top row) and its logarithm (bottom row) for Bernouli trials. The left column is based on 20 trials having 8 and 6 11 successes. The right column is based on 40 trials having 16 and 22 successes. Notice that the maximum likelihood is approximately 10− for 20 12 trials and 10− for 40. In addition, note that the peaks are more narrow for 40 trials rather than 20. We shall later be able to associate this property to the variance of the maximum likelihood estimator. 222 Introduction to the Science of Statistics Maximum Likelihood Estimation @ x¯ 1 x¯ x¯ p ln L(p x)=n − = n − @p | p − 1 p p(1 p) ✓ − ◆ − This equals zero when p =¯x. Exercise 15.3. Check that this is a maximum. Thus, pˆ(x)=¯x. In this case the maximum likelihood estimator is also unbiased. Example 15.4 (Normal data). Maximum likelihood estimation can be applied to a vector valued parameter. For a simple random sample of n normal random variables, we can use the properties of the exponential function to simplify the likelihood function. 2 2 n 2 1 (x1 µ) 1 (xn µ) 1 1 2 L(µ, σ x)= exp − − exp − − = exp (xi µ) . | p 2 2σ2 ··· p 2 2σ2 2 n −2σ2 − 2⇡ 2⇡ (2⇡ ) i=1 ✓ ◆ ✓ ◆ X The log-likelihood p n 1 n ln L(µ, σ2 x)= (ln 2⇡ +lnσ2) (x µ)2. | − 2 − 2σ2 i − i=1 X @ 2 @ 2 The score function is now a vector. ln L(µ, σ x), 2 ln L(µ, σ x) . Next we find the zeros to determine the @µ | @ | maximum likelihood estimators µˆ and⇣σˆ2 ⌘ @ 1 n 1 ln L(ˆµ, σˆ2 x)= (x µˆ)= n(¯x µˆ)=0 @µ | σˆ2 i − σˆ2 − i=1 X Because the second partial derivative with respect to µ is negative, µˆ(x)=¯x is the maximum likelihood estimator. For the derivative of the log-likelihood with respect to the parameter σ2, @ n 1 n n 1 n ln L(µ, σ2 x)= + (x µ)2 = σ2 (x µ)2 =0. @2 | −2σ2 2(σ2)2 i − −2(σ2)2 − n i − i=1 i=1 ! X X Recalling that µˆ(x)=¯x, we obtain 1 n σˆ2(x)= (x x¯)2. n i − i=1 X Note that the maximum likelihood estimator is a biased estimator. Example 15.5 (Lincoln-Peterson method of mark and recapture). Let’s recall the variables in mark and recapture: t be the number captured and tagged, • k be the number in the second capture, • r the the number in the second capture that are tagged, and let • N be the total population. • 223 Introduction to the Science of Statistics Maximum Likelihood Estimation Here t and k is set by the experimental design; r is an observation that may vary. The total population N is unknown. The likelihood function for N is the hypergeometric distribution. t N t r k −r L(N r)= − | N k Exercise 15.6. Show that the maximum likelihood estimator tk Nˆ = . r where [ ] mean the greater integer less than. · Thus, the maximum likelihood estimator is, in this case, obtained from the method of moments estimator by round- ing down to the next integer. Let look at the example of mark and capture from the previous topic. There N = 2000, the number of fish in the population, is unknown to us. We tag t = 200 fish in the first capture event, and obtain k = 400 fish in the second capture. > N<-2000 > t<-200 > fish<-c(rep(1,t),rep(0,N-t)) This creates a vector of length N with t ones representing tagged fish and and N t zeroes representing the untagged − fish. > k<-400 > r<-sum(sample(fish,k)) >r [1] 42 This samples k for the recaptured and adds up the ones to obtained, in this simulation, the number r = 42 of recaptured fish. For the likelihood function, we look at a range of values for N that is symmetric about 2000. Here, the maximum likelihood estimate Nˆ = [200 400/42] = 1904. · > N<-c(1800:2200) > L<-dhyper(r,t,N-t,k) > plot(N,L,type="l",ylab="L(N|42)",col="green") The likelihood function for this example is shown in Figure 15.2. Example 15.7 (Linear regression). Our data are n observations with one explanatory variable and one response variable. The model is that the responses yi are linearly related to the explanatory variable xi with an “error” ✏i, i.e., yi = ↵ + βxi + ✏i 2 Here we take the ✏i to be independent mean 0 normal random variables. The (unknown) variance is σ . Consequently, our model has three parameters, the intercept ↵, the slope β, and the variance of the error, σ2. Thus, the joint density for the ✏i is 1 ✏2 1 ✏2 1 ✏2 1 1 n exp 1 exp 2 exp n = exp ✏2 p 2 −2σ2 · p 2 −2σ2 ···p 2 −2σ2 2 n −2σ2 i 2⇡ 2⇡ 2⇡ (2⇡ ) i=1 X Since ✏ = y (↵ + βx ), the likelihood function p i i − i n 2 1 1 2 L(↵,β,σ y, x)= exp (yi (↵ + βxi)) . | 2 n −2σ2 − (2⇡ ) i=1 X p 224 Introduction to the Science of Statistics Maximum Likelihood Estimation Likelihood Function for Mark and Recapture 0.070 0.065 0.060 L(N|42) 0.055 0.050 0.045 1800 1900 2000 2100 2200 N Figure 15.2: Likelihood function L(N 42) for mark and recapture with t =200tagged fish, k =400in the second capture with r =42having | tags and thus recapture. Note that the maximum likelihood estimator for the total fish population is Nˆ =1904. The logarithm n 1 n ln L(↵,β,σ2 y, x)= (ln 2⇡ +lnσ2) (y (↵ + βx ))2. (15.3) | − 2 − 2σ2 i − i i=1 X Consequently, maximizing the likelihood function for the parameters ↵ and β is equivalent to minimizing n SS(↵.β)= (y (↵ + βx ))2. i − i i=1 X Thus, the principle of maximum likelihood is equivalent to the least squares criterion for ordinary linear regression. The maximum likelihood estimators ↵ and β give the regression line yˆi =ˆ↵ + βˆxi. with cov(x, y) βˆ = , and ↵ˆ determined by solving y¯ =ˆ↵ + βˆx.¯ var(x) Exercise 15.8. Show that the maximum likelihood estimator for σ2 is 1 n σˆ2 = (y yˆ )2. (15.4) MLE n i − i kX=1 Frequently, software will report the unbiased estimator. For ordinary least square procedures, this is 1 n σˆ2 = (y yˆ )2.