Lecture 11 Likelihood, MLE and Sufficiency

Lecture 11: Likelihood, MLE and sufficiency 1 of 17 Course: Mathematical Statistics Term: Fall 2018 Instructor: Gordan Žitković Lecture 11 Likelihood, MLE and sufficiency 11.1 Likelihood - definition and examples As we have seen many times before, the statistical inference based on a random sample depends heavily on the model we assume for the data. We would estimate the unknown parameter (say, the mean of the distribution) in vastly different ways when the data are normal, compared to the case of uniformly distributed data. The information about the assumed model is captured in the joint pdf (or pmf) of our data and its dependence on the parameters. In a huge number of situations, this joint pdf/pmf has a simple explicit form and, as we will see below, many important conclusions can be reached by looking at it in a right way. Definition 11.1.1. Given a random sample Y1,..., Yn from a discrete distribution D with an unknown parameter q, we define the likelihood (function) by L(q; y ,..., y ) = pq (y ,..., y ) = pq(y )pq(y ) ... pq(y ), 1 n Y1,...,Yn 1 n 1 2 n q where p is the pmf of (each) Yi. If Y1,..., Yn come from a continuous distribution, we set L(q; y ,..., y ) = f q (y ,..., y ) = f q(y ) f q(y ) ... f q(y ), 1 n Y1,...,Yn 1 n 1 2 n q where f is the pdf of (each) Yi. Simply put, the likelihood is the same as the joint pdf (pmf), but with the emphasis placed on the dependence on the parameter. When considering likelihoods, we think of y1,..., yn as fixed and of L(q; y1,..., yn) as the “likelihood” of it being the parameter q that actually produced y1,..., yn. This should not be confused with a probability - as a function of q, the likelihood L(q; y1,..., yn) is not a pdf (or a pmf) of a probability distribution. In order to be able to interpret likelihood as a probability, we need a completely different paradigm, namely Bayesian statistics. Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 2 of 17 In these notes, Y1,..., Yn are always independent, and, so the likelihood can be written as a product of individual pdfs (by the factorization criterion). In general, when Y1,..., Yn are dependent random variables, the notion of a likelihood can still be used if the joint distribution (pmf or pdf) of Y1,..., Yn is specified. Almost everything we cover below will apply to this case, as well. Example 11.1.2. Here are the likelihood functions for random samples from some of our favorite distributions: 1. Bernoulli. Suppose that Y1,..., Yn and independent and Yi ∼ B(p). The pmf of Yi can be written as ( ) p, y = 1 y 1−y p(y) = P[Yi = y] = = p (1 − p) for y = 0, 1. (1 − p), y = 0 While it may look strange at first, the right-most expression py(1 − p)1−y happens to be very useful. For example, it allows us to write the full likelihood in a very compact form y1 1−y1 yn 1−yn L(p; y1,..., yn) = p (1 − p) × · · · × p (1 − p) = p∑i yi × (1 − p)n−∑i yi . 2. Normal. For a random sample Y1,..., Yn from a normal N(m, s)- distribution, we have (y−m)2 − f (y) = p1 e 2s2 , s 2p and, so, n (y −m)2 ∑n (y −m)2 − i − i=1 i L(m, s; y ,..., y ) = 1 e 2s2 = 1 e 2s2 . 1 n (2ps2)n/2 ∏ (2ps2)n/2 i=1 We can go a step further and try to isolate the parameters m and s 2 by expanding each square (yi − m) : 1 m m2 − ∑n y2+ ∑n y −n L(m, s, y ,..., y ) = 1 e 2s2 i=1 i s2 i=1 i 2s2 1 n s2(2p)n/2 3. Uniform. Let Y1,..., Yn be a random sample from a uniform distribution U(0, q), with an unknown q > 0. The pdf of a single Yi is 1 f (y) = q 1f0≤y≤qg, Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 3 of 17 and, so L(q y y ) = 1 1 × · · · × 1 ; 1,..., n qn f0≤y1≤qg f0≤yn≤qg = 1 1 qn f0≤y1,...,yn≤qg. The condition 0 ≤ y1,..., yn ≤ q is equivalent to the two conditions 0 ≤ min(y1,..., yn) and max(y1,..., yn) ≤ q. Therefore, we can write L(q y y ) = 1 1 1 ; 1,..., n qn fmin(y1,...,yn)≥0g fmax(y1,...,yn)≤qg. 11.2 Maximum-likelihood estimation We mentioned that the word “likelihood” in the likelihood function refers to the parameter, but that we cannot think of it as probability without changing our entire worldview. What we can do is compare likelihoods for different values of the parameter and think of the parameters with the higher value of the likelihood as “more likely” to have produced the observations y1,..., yn. Example 11.2.1. Three buses (of unknown sizes, and not necessarily the same) carrying football fans arrive at a game between the Green team and the Orange team. Suppose that 90% of the people in the first bus are fans of the Green team, and 10% fans of the Orange team. The composition of the second bus is almost exactly opposite: 15% Green team fans, and 85% Orange team fans. The third bus carries the same number of Green-team and Orange-team fans. Once the buses arrive at the game, the two populations mix (say as they enter the stadium) and a person is randomly selected from the crowd. It turns out she is a fan of the Green team. What is your best guess of the bus she came to the game in? The situation can be modeled as follows; the (unknown) parameter q corresponding to the bus our fan came from - can take only three values 1, 2 or 3. The observation Y can take only two values G (for the Green team fans) and O (for the Orange team fans). The likelihood function L(q, y) is given by 8 8 0.9, q = 1 0.1, q = 1 <> <> L(q; G) = 0.15, q = 2 and L(q; O) = 0.85, q = 2 . > > :0.5, q = 3 :0.5, q = 3 Since the randomly picked person was a fan of the Green team, we focus on L(q; G). The three values it can take, namely 0.9, and 0.15 Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 4 of 17 and 0.5 cannot be interpreted as probabilities, as they do not add up to 1. We can still say that, in this case, q = 1 is much more likely than q = 2, and we should probably guess that our fan came in the first bus (the one that carried mostly the Green-team fans). The thinking we used to formulate our guess in the above example was simple: pick the value of the parameter which yields the highest likelihood, given the observed data. If we follow this procedure systematically, we arrive at one of the most important classes of estimators: Definition 11.2.2. An estimator qˆ = qˆ(y1,..., yn) is called the maximum-likelihood estimator (MLE) if it has the property that for 0 0 any other estimator qˆ = qˆ (y1,..., yn) we have 0 L(qˆ; y1,..., yn) ≥ L(qˆ ; y1,..., yn), for all y1,..., yn. Maximum-likelihood are often easy to find whenever explicit expressions for the likelihood functions are available. Unlike in Example 11.2.1 above, the unknown parameters often vary continuously and we can use calculus to find the values that maximize the likelihood. A very useful trick is to maximize the log-likelihood log L(q; y1,..., yn) instead of the likelihood L. We get the same maximizers (as x 7! log(x) is an increasing function), but the expressions involved are often much simpler. Example 11.2.3. 1. Normal (with a known variance). Let Y1,..., Yn be a random sample from a normal model with an unknown mean m and the known variance s = 1. Thanks to Example 11.1.2 above, we have the fol- lowing expression for the likelihood function − 1 n (y − )2 L(m; y ,..., y ) = 1 e 2 ∑i=1 i m . 1 n (2p)n/2 To find the MLE mˆ for m, we need to find the value of m (depending on y1,..., yn) which maximizes L(m; y1,..., yn). We could use the standard technique and differentiate L(m; y1,..., yn) in m, set the obtained value to 0 and solve for m. The log-likelihood n n 1 2 log L(m; y1,..., yn) = − 2 log(2p) − 2 ∑(yi − m) i=1 Last Updated: September 25, 2019 Lecture 11: Likelihood, MLE and sufficiency 5 of 17 is much easier to differentiate: n ¶ 1 ¶ 2 log L(m; y1,..., yn) = − 2 ∑ (yi − m) ¶m i=1 ¶m n n 1 = − 2 ∑ 2(yi − m) = nm − ∑ yi. i=1 i=1 We set the obtained expression to 0 and solve for m, obtaining n 1 mˆ = n ∑ Yi. i=1 It can be show that this mˆ is indeed the maximum of L(m; y1,..., yn) (and not a minimum or an inflection point). It should not be too surprising that we obtained the sample mean - it has already been shown to be the best estimator for m is the mean-squared-error sense.

Load more