<<

B.Sc./Cert./M.Sc. Qualif. - : Theory and Practice

10 Point Estimation using Maximum Likelihood

10.1 The method of maximum likelihood Definition For each x ∈ X , let θˆ(x) be such that, with x held fixed, the {L(θ; x): θ ∈ Θ} attains its maximum value as a function of θ at θˆ(x). The θˆ(X) is then said to be a maximum likelihood estimator (MLE) of θ. • Given any observed value x, the value of a MLE, θˆ(x), can be interpreted as the most likely value of θ to have given rise to the observed x value. • To find a MLE, it is often more convenient to maximize the log likelihood function, ln L(θ; x), which is equivalent to maximizing the likelihood function. • It should be noted that a MLE may not exist — there may be an x ∈ X such that there is no θ that maximizes the likelihood function {L(θ; x): θ ∈ Θ}. However, in practice, it has been found that such cases will be exceptional. • The MLE, if it does exist, may not be unique — given x ∈ X , there may be more than one value of θ that maximizes the likelihood function {L(θ; x): θ ∈ Θ}. Again, in practice, such cases are found only very exceptionally. • There will be simple cases in which the MLE may be found analytically, possibly just using the standard methods of calculus. In other cases, especially if the dimension q of the vector of parameters is large, the maximization will have to be carried out using numerical methods within computer packages. • The maximum of the likelihood function may be in the interior of Θ, but there will be cases in which it lies on the boundary of Θ. In the former case, routine analytical or numerical methods are generally available, but the possibility that the latter case may occur has to be borne in mind. Assuming that the likelihood function L(θ; x) is a continuously differentiable function of θ, given x ∈ X , an interior stationary point of ln L(θ; x) or of L(θ; x) is given by a solution of the likelihood equations, ∂ ln L(θ; x) = 0, j = 1, . . . , q. (1) ∂θj A solution of Equations (1) may or may not be unique, and may or may not give us a MLE. In many of the standard cases, solution of the Equations (1) does give us the MLE, but we cannot take this for granted. We may wish to check that a solution of Equation (1) gives at least a local maximum of the likelihood function. If L(θ; x) is twice continuously differentiable, the criterion to check is that the Hessian matrix, the matrix of second order partial derivatives, ∂2 ln L(θ; x) j, k = 1, . . . , q. ∂θj∂θk

1 is negative definite at the solution point. However, even if this condition can be shown to be satisfied, it does not prove that the solution point is a global maximum. If possible, it is better to use alternative, direct methods of proving that a global maximum has been found.

Example 1 2 For a random sample X1,X2,...,Xn of size n from a N(µ, σ ) distribution, we found in Chapter 9 that the log likelihood function is given by n 1 ln L(µ, σ2; x) = − ln(2πσ2) − (s + n(¯x − µ)2), (2) 2 2σ2 xx

Pn 2 where sxx = i=1(xi − x¯) . We could find the MLE by solving the likelihood equations, but the following argument provides a clear-cut demonstration that we really have found the MLE. For any given σ2 > 0, the expression (2) is maximized with respect to µ by taking the valueµ ˆ =x ¯ of µ. Using this fact, we reduce the problem of finding a MLE to the univariate problem of maximizing n s ln L(ˆµ, σ2; x) = − ln(2πσ2) − xx (3) 2 2σ2 with respect to σ2 > 0. Differentiating the expression (3) w.r.t σ2, we find

∂ ln L(ˆµ, σ2; x) n s = − + xx ∂σ2 2σ2 2σ4 n ³s ´ = xx − σ2 . 2σ4 n Hence  2  > 0 for σ < sxx/n  2  ∂ ln L(ˆµ, σ ; x) 2 = 0 for σ = sxx/n ∂σ2    2 < 0 for σ > sxx/n Thus ln L(ˆµ, σ2; x) is a unimodal function of σ2 which attains its maximum at the value ˆ2 2 2 σ = sxx/n of σ . It follows that the MLE of (µ, σ ) is given by à ! 1 Xn (ˆµ, σˆ2) = X,¯ (X − X¯)2 . n i i=1

Example 2 Obtain the maximum likelihood estimator of the parameter p ∈ (0, 1) based on a random sample X1,X2,...,Xn of size n drawn from the Bin(m, p) distribution: µ ¶ m f (x; p) = px(1 − p)m−x, x = 0, 1, . . . , m. X x

2 where m ∈ Z+ is known.

Solution: Yn Yn µ ¶ m xi m−xi L(p; x) = fX(x; p) = fXi (xi; p) = p (1 − p) xi i=1 i=1 " # n µ ¶ Y P m n x n(m−x) = p i=1 i (1 − p) . xi i=1 Therefore log L(p; x) = const + nx log p + n(m − x) log(1 − p). Taking the derivative with respect to p and setting equal to zero yields the equation satisfied by the m.l.e.

d nx n(m − x) log L(p; x) = − = 0 dp p=ˆp pˆ 1 − pˆ i.e. x nx =pnm ˆ i.e. pˆ = . m

Example 3 Let X1,X2,...,Xn denote a random sample of size n from the Poisson distribution with unknown parameter µ > 0 such that, for each i = 1, . . . , n,

xi −µ µ fXi (xi; µ) = e , xi = 0, 1, 2,... xi! Find the maximum likelihood estimator of µ.

Solution: Since 1 Pn −nµ i=1 xi L(µ; x) = fX(x; µ) = Qn × e µ i=1 xi! then à ! Xn Xn log L(µ; x) = −nµ + xi log µ − log(xi!) i=1 i=1 Hence d 1 Xn log L(µ; x) = −n + x = 0 dµ µ i i=1 which, upon solving for µ, yields 1 Xn µˆ = x . n i i=1

3 Example 4 Suppose that the Y has a probability density function given by θ−1 fY (y; θ) = θy 0 < y < 1

where θ > 0. Let Y1,Y2,...,Yn be a random sample, of size n, drawn from the same distribution as that of Y .

Based on this random sample, show that the maximum likelihood estimator of θ is given by n −Pn . i=1 log Yi

Solution: Ã ! Yn Yn Yn θ−1 θ−1 n L(θ; y) = fY(y; θ) = fYi (yi; θ) = θyi = θ yi . i=1 i=1 i=1 Xn log L(θ; y) = n log θ + (θ − 1) log yi. i=1 Thus d n Xn log L(θ; y) = + log y dθ θ i i=1 which, when equated to 0 and solved for θ, yields

ˆ n θ(y) = −Pn . i=1 log yi

The following two theorems, which we state without proof, show two general properties of MLEs, which give support to the use of MLEs. The first theorem formally demonstrates that the method of maximum likelihood is tied in with the concept of sufficiency, some- thing that is not generally true for derived under other procedures. Theorem 1 Let X be a vector of observations from a family F of distributions and suppose that a MLE θˆ exists and is unique, i.e, for each x ∈ X there exists a unique θˆ(x) that maximizes ln L(θ; x). For any sufficient T , the estimator θˆ is a function of T . Rather than estimating the vector of parameters θ, we may wish to estimate some function τ(θ) of the parameters. Another useful result about MLEs is the following.

Theorem 2 (The Invariance Property of MLEs) If θˆ is a MLE of θ then, for any function τ(θ) of θ, τ(θˆ) is a MLE of τ(θ).

Example 5 To give a simple illustration of the use ofP Theorem 2, for a random sample of size n from a N(µ, σ2) distribution, as we have seen, (X −X¯)2/n is the MLE of σ2. It follows pP i ¯ 2 that (Xi − X) /n is the MLE of σ.

4 10.2 The method of for one-way models Recall the linear for the completely randomized one-way design,

Yij = µ + τi + εij i = 1, . . . , a; j = 1, . . . , ni, (4)

for which the systematic part of Yij is

E[Yij] ≡ µ + τi ≡ µi.

Given estimatesµ ˆ andτ ˆi, i = 1, . . . , a, of µ and the τi, respectively, the corresponding fitted value yˆij is given by substituting the estimates of the parameters into the expression for E[Yij], i.e., yˆij =µ ˆ +τ ˆi.

According to the method of least squares, given the observed values yij i = 1, . . . , a; j = 1 1, . . . , ni, we choose as our estimates,µ ˆ andτ ˆi, i = 1, . . . , a, those values of the parameters µ and τi, i = 1, . . . , a, which jointly minimize Xa Xni 2 L = (yij − µ − τi) . (5) i=1 j=1 Setting the partial derivative of L with respect to µ equal to zero, and evaluating at the parameter estimates, we obtain

Nµˆ + n1τˆ1 + n2τˆ2 + ... + naτˆa = y.., (6)

and, setting the partial derivative of L with respect to each of theτ ˆi equal to zero, and evaluating at the parameter estimates, we obtain

niµˆ + niτˆi = yi. i = 1, . . . , a. (7)

The a+1 equations (6) and (7) for the a+1 least squares estimatesµ ˆ andτ ˆi, i = 1, . . . , a, are known as the (least squares) normal equations. However, the Equations (6) and (7) are not linearly independent, since adding together the a Equations (7) we obtainP Equation (6). To obtain uniquely defined estimates, corresponding to the constraint i niτi = 0 on the model parameters, we apply the constraint Xa niτˆi = 0. (8) i=1 From the Equations (7), without yet using the constraint (8), we obtain

τˆi =y ¯i. − µˆ i = 1, . . . , a.

Substituting the constraint (8) into Equation (6), we obtainµ ˆ =y ¯... Thus the equations (6), (7) and (8) taken together have the unique solution

µˆ =y ¯.. (9) and τˆi =y ¯i. − y¯.. i = 1, . . . , a, (10) which are the estimates given in Section 8.1. 1This is a slight abuse of notation since we are using the same symbols for both the estimators and the estimates.

5 • Note that, in the normal equations, totals of observed values of the response variable are equated to estimates of their expected values.

• Note the convenience of the particular constraint (8) adopted in obtaining simple expressions for the estimates.

• A fitted model is one in which the unknown parameter values are replaced by their estimates.

From the expressions (9) and (10) (or using Equation (7)) we obtain for the fitted valuesy ˆij ≡ µˆ +τ ˆi the simple expression

yˆij =y ¯i. i = 1, . . . , a, j = 1, . . . , ni.

The residuals εˆij, with corresponding observed values eij, are the differences between the observed and fitted values and are given by

eij = yij − yˆij

= yij − µˆ − τˆi

= yij − y¯i. i = 1, . . . , a, j = 1, . . . , ni.

The corresponding minimized value of L as defined in Equation (5) is

Xa Xni Xa Xni 2 2 eij = (yij − y¯i.) , i=1 j=1 i=1 j=1 which is identical with the observed residual sum of squares as defined for the one way ANOVA.

10.3 Equivalence of Maximum Likelihood and Method of Least Squares for one-way models In this section we will show that for the linear statistical model that arises from the one- way completely randomized design, maximum likelihood estimation is actually equivalent to least squares estimation. The result is also true for all such similar linear statistical models in which the error terms are i.i.d. normal. To see this for the example at hand, recall the definition of the multivariate normal from Chapter 4. A p × 1 random vector Y is said to have a multivariate normal (MVN) distribution if its joint p.d.f. is given by ½ ¾ 1 1 f (y) = exp − (y − µ)0Σ−1(y − µ) (11) Y (2π)p/2|Σ|1/2 2 for y ∈ Rp, where Σ is a p × p, symmetric, positive-definite, matrix. Write Y ∼ Np(µ, Σ).

6 Pa For the linear statistical model at hand, we set p = n = i=1 ni,

0 µ = [µ + τ1, . . . , µ + τ1, µ + τ2, . . . , µ + τ2, . . . , . . . , µ + τa, . . . , µ + τa]

and 2 Σ = σ In

where In is an n × n identity matrix. Thus, the likelihood function can be expressed in the following way:

2 2 L(µ, {τi}, σ ; y) = fY(y; µ, {τi}, σ ) ½ ¾ 1 1 = exp − (y − µ)0(y − µ) . (2π)n/2σn 2σ2 It follows that the log-likelihood is given by

n 1 Xa Xni log L(µ, {τ }, σ2; y) = − log(2πσ2) − (y − µ − τ )2 i 2 2σ2 ij i i=1 j=1

n 1 = − log(2πσ2) − L. 2 2σ2 2 Thus for any given σ > 0 (which we would normally estimate using MST rts), the values of µ and {τi} that minimize L are the ones which maximize L.

Appendix: Judging the performance of estimators using the Squared Estimator Given observations from a family F of probability distributions, how do we evaluate and compare the various estimators of θ that are available to us? In the present discussion, we shall restrict attention to the estimation of a single parameter θ or, more generally, to a real-valued function τ(θ) of the vector of parameters.

Definition The mean squared error (MSE) of an estimatorτ ˆ of τ(θ) is the function MSE(ˆτ; θ) of θ defined by £ 2¤ MSE(ˆτ; θ) = Eθ (ˆτ(X) − τ(θ)) θ ∈ Θ.

• We are assuming the existence of finite second moments.

• For any given θ ∈ Θ, the MSE is the expectation of the square of the difference between the estimatorτ ˆ and τ(θ). Broadly speaking, the smaller the MSE of an estimator the better.

• As is generally the case in , the square of the difference is quite a mathematically tractable measure of discrepancy.

7 Definition The bias of an estimatorτ ˆ of τ(θ) is the function b(θ) defined by

b(θ) = Eθ [ˆτ] − τ(θ) θ ∈ Θ.

Definition An estimatorτ ˆ for which b(θ) = 0 for all θ ∈ Θ is said to be an unbiased estimator of τ(θ).

Equivalently,τ ˆ is an unbiased estimator of τ(θ) if and only if

Eθ [ˆτ] = τ(θ) θ ∈ Θ.

In a frequentist approach to , for any given θ ∈ Θ, imagine carrying out a sequence of independent repetitions of the statistical that generates observations from a family F of distributions. In the long run, the average value of an unbiased estimatorτ ˆ is equal to the true value of the function τ(θ) that it is estimating. Unbiasedness is often regarded as a desirable property of an estimator. However, as we shall see, there may be good reasons to prefer a biased to an unbiased estimator. There is also the problem that an unbiased estimator does not always exist. Theorem 3

2 MSE(ˆτ; θ) = varθ(ˆτ) + b(θ) θ ∈ Θ.

Proof. For any random variable Y with a finite second and for any constant α, it is an elementary result that

E[(Y − α)2] = var(Y ) + (µ − α)2,

where µ is the mean of Y . The result of the lemma for any θ ∈ Θ follows from taking Y =τ ˆ(X) and α = τ(θ).

We see from the result of Theorem 3 that the MSE of an estimator may be thought of as the sum of two components, its and the square of its bias. The MSE of an unbiased estimator is identical with its variance. It is possible, however, for there to be a biased estimator that has not only a smaller variance but also a smaller MSE than an unbiased estimator for all θ ∈ Θ. Given observations from a family F of probability distributions and a function τ(θ) of the parameters, we might consider finding an estimator of τ(θ) with minimum MSE. However, the MSE is a function of θ, and it will not in general be possible to find an estimatorτ ˆ that simultaneously minimizes the MSE for all θ ∈ Θ, unless we impose some restriction on the class of estimators to be considered.

8