Maximum Likelihood Estimates Class 10, 18.05 Jeremy Orloﬀ and Jonathan Bloom

Maximum Likelihood Estimates Class 10, 18.05 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to define the likelihood function for a parametric model given data. 2. Be able to compute the maximum likelihood estimate of unknown parameter(s). 2 Introduction Suppose we know we have data consisting of values x1; : : : ; xn drawn from an exponential distribution. The question remains: which exponential distribution?! We have casually referred to the exponential distribution or the binomial distribution or the normal distribution. In fact the exponential distribution exp(λ) is not a single distribution but rather a one-parameter family of distributions. Each value of λ defines a different dis- −λx tribution in the family, with pdf fλ(x) = λe on [0; ). Similarly, a binomial distribution 1 bin(n; p) is determined by the two parameters n and p, and a normal distribution N(µ, σ2) is determined by the two parameters µ and σ2 (or equivalently, µ and σ). Parameterized families of distributions are often called parametric distributions or parametric models. We are often faced with the situation of having random data which we know (or believe) is drawn from a parametric model, whose parameters we do not know. For example, in an election between two candidates, polling data constitutes draws from a Bernoulli(p) distribution with unknown parameter p. In this case we would like to use the data to estimate the value of the parameter p, as the latter predicts the result of the election. Similarly, assuming gestational length follows a normal distribution, we would like to use the data of the gestational lengths from a random sample of pregnancies to draw inferences about the values of the parameters µ and σ2. Our focus so far has been on computing the probability of data arising from a parametric model with known parameters. Statistical inference flips this on its head: we will estimate the probability of parameters given a parametric model and observed data drawn from it. In the coming weeks we will see how parameter values are naturally viewed as hypotheses, so we are in fact estimating the probability of various hypotheses given the data. 3 Maximum Likelihood Estimates There are many methods for estimating unknown parameters from data. We will first consider the maximum likelihood estimate (MLE), which answers the question: For which parameter value does the observed data have the biggest probability? The MLE is an example of a point estimate because it gives a single value for the unknown parameter (later our estimates will involve intervals and probabilities). Two advantages of 1 18.05 class 10, Maximum Likelihood Estimates , Spring 2014 2 the MLE are that it is often easy to compute and that it agrees with our intuition in simple examples. We will explain the MLE through a series of examples. Example 1. A coin is flipped 100 times. Given that there were 55 heads, find the maximum likelihood estimate for the probability p of heads on a single toss. Before actually solving the problem, let's establish some notation and terms. We can think of counting the number of heads in 100 tosses as an experiment. For a given value of p, the probability of getting 55 heads in this experiment is the binomial probability 100 P (55 heads) = p55(1 p)45: 55 − The probability of getting 55 heads depends on the value of p, so let's include p in by using the notation of conditional probability: 100 P (55 heads p) = p55(1 p)45: j 55 − You should read P (55 heads p) as: j `the probability of 55 heads given p,' or more precisely as `the probability of 55 heads given that the probability of heads on a single toss is p.' Here are some standard terms we will use as we do statistics. Experiment: Flip the coin 100 times and count the number of heads. • Data: The data is the result of the experiment. In this case it is `55 heads'. • Parameter(s) of interest: We are interested in the value of the unknown parameter p. • Likelihood, or likelihood function: this is P (data p): Note it is a function of both the • j data and the parameter p. In this case the likelihood is 100 P (55 heads p) = p55(1 p)45: j 55 − Notes: 1. The likelihood P (data p) changes as the parameter of interest p changes. j 2. Look carefully at the definition. One typical source of confusion is to mistake the likelihood P (data p) for P (p data). We know from our earlier work with Bayes' theorem that j j P (data p) and P (p data) are usually very different. j j Definition: Given data the maximum likelihood estimate (MLE) for the parameter p is the value of p that maximizes the likelihood P (data p). That is, the MLE is the value of j p for which the data is most likely. answer: For the problem at hand, we saw above that the likelihood 100 P (55 heads p) = p55(1 p)45: j 55 − 18.05 class 10, Maximum Likelihood Estimates , Spring 2014 3 We'll use the notation p^ for the MLE. We use calculus to find it by taking the derivative of the likelihood function and setting it to 0. d 100 P (data p) = (55p54(1 p)45 45p55(1 p)44) = 0: dp j 55 − − − Solving this for p we get 55p54(1 p)45 = 45p55(1 p)44 − − 55(1 p) = 45p − 55 = 100p the MLE is p^ = :55 Note: 1. The MLE for p turned out to be exactly the fraction of heads we saw in our data. 2. The MLE is computed from the data. That is, it is a statistic. 3. Officially you should check that the critical point is indeed a maximum. You can do this with the second derivative test. 3.1 Log likelihood If is often easier to work with the natural log of the likelihood function. For short this is simply called the log likelihood. Since ln(x) is an increasing function, the maxima of the likelihood and log likelihood coincide. Example 2. Redo the previous example using log likelihood. answer: We had the likelihood P (55 heads p) = 100p55(1 p)45. Therefore the log j 55 − likelihood is 100 ln(P (55 heads p) = ln + 55 ln(p) + 45 ln(1 p): j 55 − Maximizing likelihood is the same as maximizing log likelihood. We check that calculus gives us the same answer as before: d d 100 (log likelihood) = ln + 55 ln(p) + 45 ln(1 p) dp dp 55 − 55 45 = = 0 p − 1 p − 55(1 p) = 45p ) − p^ = :55 ) 3.2 Maximum likelihood for continuous distributions For continuous distributions, we use the probability density function to define the likelihood. We show this in a few examples. In the next section we explain how this is analogous to what we did in the discrete case. 18.05 class 10, Maximum Likelihood Estimates , Spring 2014 4 Example 3. Light bulbs Suppose that the lifetime of Badger brand light bulbs is modeled by an exponential distribution with (unknown) parameter λ. We test 5 bulbs and find they have lifetimes of 2, 3, 1, 3, and 4 years, respectively. What is the MLE for λ? answer: We need to be careful with our notation. With five different values it is best to th use subscripts. Let Xj be the lifetime of the i bulb and let xi be the value Xi takes. Then −λxi each Xi has pdf fXi (xi) = λe . We assume the lifetimes of the bulbs are independent, so the joint pdf is the product of the individual densities: −λx1 −λx2 −λx3 −λx4 −λx5 5 −λ(x1+x2+x3+x4+x5) f(x1 ; x 2 ; x 3 ; x 4 ; x 5 λ) = (λe )(λe )(λe )(λe )(λe ) = λ e : j Note that we write this as a conditional density, since it depends on λ. Viewing the data as fixed and λ as variable, this density is the likelihood function. Our data had values x1 = 2; x2 = 3; x3 = 1; x4 = 3; x5 = 4: So the likelihood and log likelihood functions with this data are f(2; 3; 1; 3; 4 λ) = λ5e−13λ; ln(f(2; 3; 1; 3; 4 λ) = 5 ln(λ) 13λ j j − Finally we use calculus to find the MLE: d 5 5 (log likelihood) = 13 = 0 λ^ = : dλ λ − ) 13 Note: 1. In this example we used an uppercase letter for a random variable and the corresponding lowercase letter for the value it takes. This will be our usual practice. 2. The MLE for λ turned out to be the reciprocal of the sample mean x¯, so X exp(λ^) ∼ satisfies E(X) = x¯. The following example illustrates how we can use the method of maximum likelihood to estimate multiple parameters at once. Example 4. Normal distributions 2 Suppose the data x1; x2; : : : ; xn is drawn from a N(µ, σ ) distribution, where µ and σ are unknown. Find the maximum likelihood estimate for the pair (µ, σ2). answer: Let's be precise and phrase this in terms of random variables and densities. Let 2 uppercase X1;:::;Xn be i.i.d. N(µ, σ ) random variables, and let lowercase xi be the value Xi takes. The density for each Xi is 2 1 − (xi−µ) fX (xi) = e 2σ2 : i p2π σ Since the Xi are independent their joint pdf is the product of the individual pdf's: n 2 1 − Pn (xi−µ) f(x1; : : : ; xn µ, σ) = e i=1 2σ2 : j p2π σ For the fixed data x1; : : : ; xn, the likelihood and log likelihood are n 2 n 2 1 − Pn (xi−µ) X (xi µ) f(x1; : : : ; xn µ, σ) = e i=1 2σ2 ; ln(f(x1; : : : ; xn µ, σ)) = n ln(p2π) n ln(σ) − : j p j − − − 2σ2 2π σ i=1 18.05 class 10, Maximum Likelihood Estimates , Spring 2014 5 Since ln(f(x1; : : : ; xn µ, σ)) is a function of the two variables µ, σ we use partial derivatives j to find the MLE.

Load more