Maximum Likelihood Estimation 1

Math 541: Statistical Theory II Maximum Likelihood Estimation Lecturer: Songfeng Zheng 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for an unknown parameter θ. It was introduced by R. A. Fisher, a great English mathematical statis- tician, in 1912. Maximum likelihood estimation (MLE) can be applied in most problems, it has a strong intuitive appeal, and often yields a reasonable estimator of θ. Furthermore, if the sample is large, the method will yield an excellent estimator of θ. For these reasons, the method of maximum likelihood is probably the most widely used method of estimation in statistics. Suppose that the random variables X1; ¢ ¢ ¢ ;Xn form a random sample from a distribution f(xjθ); if X is continuous random variable, f(xjθ) is pdf, if X is discrete random variable, f(xjθ) is point mass function. We use the given symbol | to represent that the distribution also depends on a parameter θ, where θ could be a real-valued unknown parameter or a vector of parameters. For every observed random sample x1; ¢ ¢ ¢ ; xn, we de¯ne f(x1; ¢ ¢ ¢ ; xnjθ) = f(x1jθ) ¢ ¢ ¢ f(xnjθ) (1) If f(xjθ) is pdf, f(x1; ¢ ¢ ¢ ; xnjθ) is the joint density function; if f(xjθ) is pmf, f(x1; ¢ ¢ ¢ ; xnjθ) is the joint probability. Now we call f(x1; ¢ ¢ ¢ ; xnjθ) as the likelihood function. As we can see, the likelihood function depends on the unknown parameter θ, and it is always denoted as L(θ). Suppose, for the moment, that the observed random sample x1; ¢ ¢ ¢ ; xn came from a discrete distribution. If an estimate of θ must be selected, we would certainly not consider any value of θ for which it would have been impossible to obtain the data x1; ¢ ¢ ¢ ; xn that was actually observed. Furthermore, suppose that the probability f(x1; ¢ ¢ ¢ ; xnjθ) of obtaining the actual observed data x1; ¢ ¢ ¢ ; xn is very high when θ has a particular value, say θ = θ0, and is very small for every other value of θ. Then we would naturally estimate the value of θ to be θ0. When the sample comes from a continuous distribution, it would again be natural to try to ¯nd a value of θ for which the probability density f(x1; ¢ ¢ ¢ ; xnjθ) is large, and to use this value as an estimate of θ. For any given observed data x1; ¢ ¢ ¢ ; xn, we are led by this reasoning to consider a value of θ for which the likelihood function L(θ) is a maximum and to use this value as an estimate of θ. 1 2 The meaning of maximum likelihood is as follows. We choose the parameter that makes the likelihood of having the obtained data at hand maximum. With discrete distributions, the likelihood is the same as the probability. We choose the parameter for the density that maximizes the probability of the data coming from it. Theoretically, if we had no actual data, maximizing the likelihood function will give us a function of n random variables X1; ¢ ¢ ¢ ;Xn, which we shall call \maximum likelihood estimate" θ^. When there are actual data, the estimate takes a particular numerical value, which will be the maximum likelihood estimator. MLE requires us to maximum the likelihood function L(θ) with respect to the unknown parameter θ. From Eqn. 1, L(θ) is de¯ned as a product of n terms, which is not easy to be maximized. Maximizing L(θ) is equivalent to maximizing log L(θ) because log is a monotonic increasing function. We de¯ne log L(θ) as log likelihood function, we denote it as l(θ), i.e. Yn Xn l(θ) = log L(θ) = log f(Xijθ) = log f(Xijθ) i=1 i=1 Maximizing l(θ) with respect to θ will give us the MLE estimation. 2 Examples Example 1: Suppose that X is a discrete random variable with the following probability mass function: where 0 · θ · 1 is a parameter. The following 10 independent observations X 0 1 2 3 P (X) 2θ=3 θ=3 2(1 ¡ θ)=3 (1 ¡ θ)=3 were taken from such a distribution: (3,0,2,1,3,2,1,0,2,1). What is the maximum likelihood estimate of θ. Solution: Since the sample is (3,0,2,1,3,2,1,0,2,1), the likelihood is L(θ) = P (X = 3)P (X = 0)P (X = 2)P (X = 1)P (X = 3) £ P (X = 2)P (X = 1)P (X = 0)P (X = 2)P (X = 1) (2) Substituting from the probability distribution given above, we have Ã ! Ã ! Ã ! Ã ! Yn 2θ 2 θ 3 2(1 ¡ θ) 3 1 ¡ θ 2 L(θ) = P (Xijθ) = i=1 3 3 3 3 Clearly, the likelihood function L(θ) is not easy to maximize. 3 Let us look at the log likelihood function Xn l(θ) = log L(θ) = log P (Xijθ) i=1 µ 2 ¶ µ 1 ¶ µ 2 ¶ µ 1 ¶ = 2 log + log θ + 3 log + log θ + 3 log + log(1 ¡ θ) + 2 log + log(1 ¡ θ) 3 3 3 3 = C + 5 log θ + 5 log(1 ¡ θ) where C is a constant which does not depend on θ. It can be seen that the log likelihood function is easier to maximize compared to the likelihood function. Let the derivative of l(θ) with respect to θ be zero: dl(θ) 5 5 = ¡ = 0 dθ θ 1 ¡ θ and the solution gives us the MLE, which is θ^ = 0:5. We remember that the method of moment estimation is θ^ = 5=12, which is di®erent from MLE. Example 2: Suppose³ ´ X1;X2; ¢ ¢ ¢ ;Xn are i.i.d. random variables with density function 1 jxj f(xjσ) = 2σ exp ¡ σ , please ¯nd the maximum likelihood estimate of σ. Solution: The log-likelihood function is " # Xn jX j l(σ) = ¡ log 2 ¡ log σ ¡ i i=1 σ Let the derivative with respect to θ be zero: " # P Xn n 0 1 jXij n i=1 jXij l (σ) = ¡ + 2 = ¡ + 2 = 0 i=1 σ σ σ σ and this gives us the MLE for σ as P n jX j σ^ = i=1 i n Again this is di®erent from the method of moment estimation which is sP n X2 σ^ = i=1 i 2n Example 3: Use the method of moment to estimate the parameters ¹ and σ for the normal density ( ) 1 (x ¡ ¹)2 f(xj¹; σ2) = p exp ¡ ; 2¼σ 2σ2 4 based on a random sample X1; ¢ ¢ ¢ ;Xn. Solution: In this example, we have two unknown parameters, ¹ and σ, therefore the parameter θ = (¹; σ) is a vector. We ¯rst write out the log likelihood function as Xn · ¸ Xn 1 1 2 n 1 2 l(¹; σ) = ¡ log σ ¡ log 2¼ ¡ 2 (Xi ¡ ¹) = ¡n log σ ¡ log 2¼ ¡ 2 (Xi ¡ ¹) i=1 2 2σ 2 2σ i=1 Setting the partial derivative to be 0, we have @l(¹; σ) 1 Xn = 2 (Xi ¡ ¹) = 0 @¹ σ i=1 Xn @l(¹; σ) n ¡3 2 = ¡ + σ (Xi ¡ ¹) = 0 @σ σ i=1 Solving these equations will give us the MLE for ¹ and σ: v u u 1 Xn t 2 ¹^ = X andσ ^ = (Xi ¡ X) n i=1 This time the MLE is the same as the result of method of moment. From these examples, we can see that the maximum likelihood result may or may not be the same as the result of method of moment. Example 4: The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail: θ ¡θ¡1 f(xjx0; θ) = θx0x ; x ¸ x0; θ > 1 Assume that x0 > 0 is given and that X1;X2; ¢ ¢ ¢ ;Xn is an i.i.d. sample. Find the MLE of θ. Solution: The log-likelihood function is Xn Xn l(θ) = log f(Xijθ) = (log θ + θ log x0 ¡ (θ + 1) log Xi) i=1 i=1 Xn = n log θ + nθ log x0 ¡ (θ + 1) log Xi i=1 Let the derivative with respect to θ be zero: dl(θ) n Xn = + n log x0 ¡ log Xi = 0 dθ θ i=1 5 Solving the equation yields the MLE of θ: ^ 1 θMLE = log X ¡ log x0 Example 5: Suppose that X1; ¢ ¢ ¢ ;Xn form a random sample from a uniform distribution on the interval (0; θ), where of the parameter θ > 0 but is unknown. Please ¯nd MLE of θ. Solution: The pdf of each observation has the following form: ( 1/θ; for 0 · x · θ f(xjθ) = (3) 0; otherwise Therefore, the likelihood function has the form ( 1 ; for 0 · x · θ (i = 1; ¢ ¢ ¢ ; n) L(θ) = θn i 0; otherwise It can be seen that the MLE of θ must be a value of θ for which θ ¸ xi for i = 1; ¢ ¢ ¢ ; n and which maximizes 1/θn among all such values. Since 1/θn is a decreasing function of θ, the estimate will be the smallest possible value of θ such that θ ¸ xi for i = 1; ¢ ¢ ¢ ; n. This value ^ is θ = max(x1; ¢ ¢ ¢ ; xn), it follows that the MLE of θ is θ = max(X1; ¢ ¢ ¢ ;Xn). It should be remarked that in this example, the MLE θ^ does not seem to be a suitable ^ estimator of θ. We know that max(X1; ¢ ¢ ¢ ;Xn) < θ with probability 1, and therefore θ surely underestimates the value of θ. Example 6: Suppose again that X1; ¢ ¢ ¢ ;Xn form a random sample from a uniform distribution on the interval (0; θ), where of the parameter θ > 0 but is unknown.

Maximum Likelihood Estimation 1

Lecture 12 Robust Estimation

An Introduction to Psychometric Theory with Applications in R

The Gompertz Distribution and Maximum Likelihood Estimation of Its Parameters - a Revision

Should We Think of a Different Median Estimator?

Cluster Analysis for Gene Expression Data: a Survey

Reliability Engineering: Today and Beyond

5. the Student T Distribution

Bias, Mean-Square Error, Relative Efficiency

The Likelihood Principle

1 One Parameter Exponential Families

Random Variables and Probability Distributions 1.1

A Joint Central Limit Theorem for the Sample Mean and Regenerative Variance Estimator*