8 | Maximum Likelihood Estimation
Total Page:16
File Type:pdf, Size:1020Kb
8 j Maximum likelihood estimation Maximum likelihood (ML) estimation is a general principle to derive point estimators in probabilistic models. ML estimation was popularized by Fisher at the beginning of the 20th century, but already found application in the works of Laplace (1749-1827) and Gauss (1777-1855) (Aldrich, 1997). ML estimation is based on the following intuition: the most likely parameter value of a probabilistic model that generated an observed data set should be that parameter value for which the probability of the data under the model is maximal. In this Section, we first make this intuition more precise and introduce the notions of (log) likelihood functions and ML estimators (Section 8.1). We then exemplify the ML approach by discussing ML parameter estimation for the univariate Gaussian samples (Section 8.2). Finally, we consider ML parameter estimation for the GLM, relate it to ordinary least squares estimation, and introduce the restricted ML estimator for the variance parameter of the GLM (Section 8.3). 8.1 Likelihood functions and maximum likelihood estimators Likelihood functions The fundamental idea of ML estimation is to select that parameter value as a point estimate of the true, but unknown, parameter value that gave rise to the data which maximizes the probability of the data under the model of interest. To implement this intuition, the notion of the likelihood function and its maximization is invoked. To introduce the likelihood function, consider a parametric probabilistic model pθ(y) which specifies the probability distribution of a random entity y. Here, y models data and θ denotes the model's parameter with parameter space Θ. Given a parametric probabilistic model pθ(y), the function Ly :Θ ! R≥0; θ 7! Ly(θ) := pθ(y) (8.1) is called the likelihood function of the parameter θ for the data y. Note that the specific nature of θ and y is left unspecified, i.e., θ and y may be scalars, vectors, or matrices. Notably, the likelihood function is a function of the parameter θ, while it also depends on y. Because y is a random entity, different data samples from probabilistic model pθ(y) result in different likelihood functions. In this sense, there is a distribution of likelihood functions for each probabilistic model, but once a data realization has been obtained, the likelihood function is a (deterministic) function of the parameter value only. This is in stark contrast with PDFs and PMFs which are functions of the random variable's outcome values (Section 5 j Probability and random variables). Stated differently, the input argument of a PDF or PMF is the value of a random variable and the output argument of a PDF or PMF is the probability density or mass of this value for a fixed value of the model's parameter. In contrast, the input argument of a likelihood function is a parameter value and the output of the likelihood function is the probability density or mass of a fixed value of the random variable modelling data for this parameter value under the probability model of interest. If the random variable value and parameter value submitted to a PDF or PMF of a model and their corresponding likelihood functions are identical, so are the outputs of both functions. It is the functional dependencies that distinguish likelihood functions from PDFs and PMFs, but not their functional form. Maximum likelihood estimators The ML estimator of a given probabilistic model pθ(y) is that parameter value which maximizes the likelihood function. Formally, this can be expressed as ^ θML := arg max Ly(θ): (8.2) θ2Θ ^ Eq. (8.2) should be read as follows: θML is defined as that argument of the likelihood function Ly for which Ly(θ) assumes its maximal value over all possible parameter values θ in the parameter space Θ. Note that from a mathematical viewpoint, the above definition is not overly general, because it is tacitly assumed that Ly in fact has an maximizing argument and that this argument is unique. Also note that Likelihood functions and maximum likelihood estimators 2 ^ ^ instead of values for θML, one is often interested in functional forms that express θML as a function of the ^ ^ data y. Concrete numerical values of θML are referred to as ML estimates, while functional forms of θML are referred to as ML estimators. There are essentially two approaches to ML estimation. The first approach aims to obtain functional forms of ML estimators (sometimes referred to as closed-form solutions) by analytically maximizing the likelihood function with respect to θ. The second approach, often encountered in applied computing, builds on the former and systematically varies θ given an observation of y while monitoring the numeric value of the likelihood function. Once this value appears to be maximal, varying θ stops, and the resulting value is used as an ML estimate. In the following, we consider the first approach, which is of immediate relevance for basic parameter estimation in the GLM, in more detail. ^ From Section 3 j Calculus, we know that candidate values for the ML estimator θML fulfil the requirement d Ly(θ)j ^ = 0: (8.3) dθ θ=θML ^ Eq. (8.3) is known as the likelihood equation and should be read as follows: at the location of θML, the d p derivative of the likelihood function dθ Ly with respect to θ is equal to zero. If θ 2 R ; p > 1, the statement ^ implies that at the location of θML, the gradient rL with respect to θ is equal to the zero vector 0p. Clearly, eq. (8.3) corresponds to the necessary condition for extrema of functions. By evaluating the necessary derivatives of the likelihood function and setting them to zero, one may thus obtain a set of equations which can hopefully be solved for an ML estimator. The log likelihood function To simplify the analytical approach for finding ML estimators as sketched above, one usually considers the logarithm of the likelihood function, the so-called log likelihood function. The log likelihood function is defined as (cf. eq. (8.1)) `y :Θ ! R; θ 7! `y(θ) := ln Ly(θ) = ln pθ(y): (8.4) Because the logarithm is a monotonically increasing function, the location in parameter space at which the likelihood function assumes its maximal value corresponds to the location in parameter space at which the log likelihood assumes its maximal value. Using either the likelihood function or log likelihood function to find a maximum likelihood estimator is thus equivalent, as both will identify the same maximizing value (if it exists). The use of log likelihood functions instead of likelihood functions in ML estimation is primarily of pragmatic nature: first, probabilistic models often involve PDFs with exponential terms that are dissolved by the log transform. Second, independence assumptions often give rise to factorized probability distributions which are simplified to sums by the log transform. Finally, from a numerical perspective, one often deals with PDF or PMF values that are rather close to zero and that are stretched to a broader range by the log transform. In analogy to (8.3), the log likelihood equation for the maximum likelihood estimator is given by d `y(θ)j ^ = 0: (8.5) dθ θ=θML Like eq. (8.3), the log likelihood equation can be extended to multivariate θ in terms of the gradient of `, ^ and like eq. (8.3), it can be solved for θML. We next aim to exemplify the idea of ML estimation in a first example (Section 8.2). To do so, we first discuss two additional assumptions that simplify the application of the ML approach considerably: the assumption of a concave log likelihood function and the assumption of independent data random variables with associated PDFs. Finally, we summarize the ML method in a recipe-like manner. Concave log likelihood functions If the log likelihood function is concave, then the necessary condition for a maximum of the log likelihood function is also sufficient. Recall that a multivariate real-valued function f : Rn ! R is called concave, if for all input arguments a; b 2 Rn the straight line connecting f(a) and f(b) lies below the function's graph. Formally, n f (ta + (1 − t)b) ≥ tf(a) + (1 − t)f(b) for a; b 2 R and t 2 [0; 1]: (8.6) The General Linear Model j © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Likelihood functions and maximum likelihood estimators 3 Here, ta+(1−t)b for t 2 [0; 1] describes a straight line in the domain of the function, while tf(a)+(1−t)f(b) for t 2 [0; 1] describes a straight line in the range of the function. Leaving mathematical subtleties aside, it is roughly correct that concave functions have a single maximum, or in other words, that a critical point at which the gradient vanishes is guaranteed to be a maximum of the function. Thus, if the log likelihood function is concave, finding a parameter value for which the log likelihood equation holds, is sufficient to identify a maximum at this location. In principle, whenever applying the ML method based on the log likelihood equation, it is thus necessary to show that that the log likelihood function is concave and that the necessary condition for a maximum is hence also sufficient. However, such an approach is beyond the level of rigour herein and we content with stating without proof that the log likelihood functions of interest in the following are concave. Independent data random variables with probability density functions A second assumption that simplifies the application of the ML method is the assumption of independent data variables with associated PDFs.