Loglikelihood and Confidence Intervals

Stat 504, Lecture 3 1 Stat 504, Lecture 3 2 ' $ ' $ Review (contd.): The likelihood of the sample is the joint PDF (or Loglikelihood and PMF) Confidence Intervals n L(θ) = f(x1, .., xn; θ) = f(xi; θ) i=1 Y Review: The maximum likelihood estimate (MLE) θˆMLE maximizes L(θ): Let X1, X2, ..., Xn be a simple random sample from a probability distribution f(x; θ). L(θˆMLE ) L(θ), θ ≥ ∀ A parameter θ of f(x; θ) is a variable that is If we use Xi’s instead of xi’s then θˆ is the maximum characteristic of f(x; θ). likelihood estimator. A statistic T is any quantity that can be calculated Usually, MLE: from a sample; it’s a function of X1, ..., Xn. is unbiased, E(θˆ) = θ An estimate θˆ for θ is a single number that is a • reasonable value for θ. is consistent, θˆ θ, as n • → → ∞ An estimator θˆ for θ is a statistic that gives the is efficient, has small SE(θˆ) as n • → ∞ formula for computing the estimate θˆ. ˆ− is asymptotically normal, (θ θ) N(0, 1) • SE(θˆ) ≈ & % & % Stat 504, Lecture 3 3 Stat 504, Lecture 3 4 ' $ ' $ The loglikelihood function is defined to be the natural logarithm of the likelihood function, l(θ ; x) = log L(θ ; x). The loglikelihood, however, is the sum of the For a variety of reasons, statisticians often work with individual loglikelihoods: the loglikelihood rather than with the likelihood. One reason is that l(θ ; x) tends to be a simpler function l(θ ; x) = log f(x ; θ) than L(θ ; x). When we take logs, products are n = log f(xi ; θ) changed to sums. If X = (X1, X2, . , Xn) is an iid i=1 sample from a probability distribution f(x; θ), the n Y overall likelihood is the product of the likelihoods for = log f(xi ; θ) i=1 the individual Xi’s: Xn L(θ ; x) = f(x ; θ) = l(θ ; xi). i=1 n X = f(xi ; θ) Below are some examples of loglikelihood functions. i=1 Yn = L(θ ; xi) i=1 Y & % & % Stat 504, Lecture 3 5 Stat 504, Lecture 3 6 ' $ ' $ Poisson. Suppose X = (X1, X2, . , Xn) is an iid sample from a Poisson distribution with parameter λ. The likelihood is Binomial. Suppose X Bin(n, p) where n is known. n − ∼ λxi e λ The likelihood function is L(λ ; x) = xi! i=1 n! x n−x Y n x −nλ L(p ; x) = p (1 p) λ i=1 i e x! (n x)! − = P , − x1! x2! xn! and so the loglikelihood function is · · · and the loglikelihood is l(p ; x) = k + x log p + (n x) log(1 p), − − n l(λ ; x) = xi log λ nλ, where k is a constant that doesn’t involve the − i=1 ! parameter p. In the future we will omit the constant, X ignoring the constant terms that don’t depend on λ. because it’s statistically irrelevant. As the above examples show, l(θ ; x) often looks nicer than L(θ ; x) because the products become sums and the exponents become multipliers. & % & % Stat 504, Lecture 3 7 Stat 504, Lecture 3 8 ' $ ' $ Asymptotic confidence intervals In-Class Exercise: Suppose that we observe X = 1 from a binomial distribution with n = 4 and p Loglikelihood forms the basis for many approximate unknown. Calculate the loglikelihood. What does the confidence intervals and hypothesis tests, because it graph of likelihood look like? Find the MLE (do you behaves in a predictable manner as the sample size understand the difference between the estimator and grows. The following example illustrates what the estimate)? Locate the MLE on the graph of the happens to l(θ ; x) as n becomes large. likelihood. & % & % Stat 504, Lecture 3 9 Stat 504, Lecture 3 10 ' $ ' $ The MLE is pˆ = 1/4 = .25. Ignoring constants, the loglikelihood is l(p ; x) = log p + 3 log(1 p), − (For clarity, I omitted from the plot all values of p which looks like this: beyond .8, because for p > .8 the loglikelihood drops down so low that including these values of p would -2.5 distort the plot’s appearance. When plotting -3.0 loglikelihoods, we don’t need to include all θ values in -3.5 the parameter space; in fact, it’s a good idea to limit l(p;x) -4.0 the domain to those θ’s for which the loglikelihood is -4.5 no more than 2 or 3 units below the maximum value -5.0 l(θˆ; x) because, in a single-parameter problem, any θ 0.0 0.2 0.4 0.6 0.8 1.0 p whose loglikelihood is more than 2 or 3 units below Here is a sample code for plotting this function in R: the maximum is highly implausible.) p<-seq(from=.01,to=.80,by=.01) loglik<-log(p) + 3*log(1-p) plot(p,loglik,xlab="",ylab="",type="l",xlim=c(0,1)) & % & % Stat 504, Lecture 3 11 Stat 504, Lecture 3 12 ' $ 'As n gets larger, two things are happening to the $ Now suppose that we observe X = 10 from a binomial loglikelihood. First, l(p ; x) is becoming more sharply distribution with n = 40. The MLE is again peaked around pˆ. Second, l(p ; x) is becoming more pˆ = 10/40 = .25, but the loglikelihood is symmetric about pˆ. l(p ; x) = 10 log p + 30 log(1 p), − The first point shows that as the sample size grows, we are becoming more confident that the true -22.5 parameter lies close to pˆ. If the loglikelihood is highly -23.0 peaked—that is, if it drops sharply as we move away -23.5 l(p;x) from the MLE—then the evidence is strong that p is -24.0 near pˆ. A flatter loglikelihood, on the other hand, -24.5 means that p is not well estimated and the range of -25.0 0.0 0.2 0.4 0.6 0.8 1.0 plausible values is wide. In fact, the curvature of the p Finally, suppose that we observe X = 100 from a loglikelihood (i.e. the second derivative of l(θ ; x) with binomial with n = 400. The MLE is still respect to θ) is an important measure of statistical pˆ = 100/400 = .25, but the loglikelihood is now information about θ. l(p ; x) = 100 log p + 300 log(1 p), The second point, that the loglikelihood function − becomes more symmetric about the MLE as the sample size grows, forms the basis for constructing -225.0 asymptotic (large-sample) confidence intervals for the -226.0 unknown parameter. In a wide variety of problems, as l(p;x) the sample size grows the loglikelihood approaches a -227.0 quadratic function (i.e. a parabola) centered at the MLE. -228.0 0.0 0.2 0.4 0.6 0.8 1.0 p & % & % Stat 504, Lecture 3 13 Stat 504, Lecture 3 14 'The parabola is significant because that is the shape $ ' $ of the loglikelihood from the normal distribution. If we had a random sample of any size from a normal distribution with known variance σ2 and unknown mean µ, the loglikelihood would be a perfect parabola n centered at the MLE µˆ = x¯ = i=1 xi/n. P From elementary statistics, we know that if we have a sample from a normal distribution with known The confidence interval (1) is valid because over variance σ2, a 95% confidence interval for the mean µ repeated samples the estimate x¯ is normally is distributed about the true value µ with a standard σ x¯ 1.96 . (1) deviation of σ/√n. ± √n The quantity σ/√n is called the standard error; it measures the variability of the sample mean x¯ about the true mean µ. The number 1.96 comes from a table of the standard normal distribution; the area under the standard normal density curve between 1.96 and 1.96 is .95 or 95%. − & % & % Stat 504, Lecture 3 15 Stat 504, Lecture 3 16 ' $ ' $ There is much confusion about how to interpret a confidence interval (CI). In Stat 504, the parameter of interest will not be the mean of a normal population, but some other A CI is NOT a probability statement about θ since θ parameter θ pertaining to a discrete probability is a fixed value, not a random variable. distribution. One interpretation: if we took many samples, most of our intervals would capture true parameter (e.g. 95% We will often estimate the parameter by its MLE θˆ. of out intervals will contain the true parameter). Example: The nationwide telephone poll was But because in large samples the loglikelihood conducted by NY Times/CBS News between Jan. function l(θ ; x) approaches a parabola centered at θˆ, 14-18. with 1118 adults. About 58% of respondents we will be able to use a method similar to (1) to form feel optimistic about next four years. The results are approximate confidence intervals for θ. reported with a margin of error of 3%. & % & % Stat 504, Lecture 3 17 Stat 504, Lecture 3 18 ' $ ' $ Just as x¯ is normally distributed about µ, θˆ is approximately normally distributed about θ in large samples. This property is called the “asymptotic normality of the MLE,” and the technique of forming confidence Of course, we can also form intervals with confidence intervals is called the “asymptotic normal coefficients other than 95%. approximation.” This method works for a wide variety of statistical models, including all the models All we need to do is to replace 1.96 in (2) by z, a value that we will use in this course.

Loglikelihood and Confidence Intervals

Fast Inference in Generalized Linear Models Via Expected Log-Likelihoods

Empirical Bayes Methods for Combining Likelihoods Bradley EFRON

Stat 3701 Lecture Notes: Statistical Models, Part II

Observed Information Matrix for MUB Models

Method for Computation of the Fisher Information Matrix in the Expectation–Maximization Algorithm

Partially Observed Information and Inference About Non-Gaussian

Maximum Likelihood Estimation ∗ Contents

Curvature and Inference for Maximum Likelihood Estimates

Generalized Linear Models

Estimating Standard Errors and Efficient Goodness-Of-Fit Tests For

Generalized Linear Models Lecture 2

Arxiv:1905.09722V1 [Stat.ME] 23 May 2019 Sample Size Is ﬁxed, and the Second Stage Sample Size Is Large