Maximum Likelihood 1 Likelihood Models

Maximum likelihood Class Notes Manuel Arellano Revised: January 28, 2020 1 Likelihood models Given some data w1, ..., wn , a likelihood model is a family of density functions fn (w1, ..., wn; ), f g Rp, that is specified as a probability model for the observed data. If the data are a 2 random sample from some population with density function g (wi) then fn (w1, ..., wn; ) is a model for n n i=1 g (wi). If the model is correctly specified fn (w1, ..., wn; ) = i=1 f (wi, ) and g (wi) = f (wi, 0) Qfor some value 0 in . The function f may be a conditional orQ an unconditional density. It may be flexible enough to be unrestricted or it may place restrictions in the form of the population density.1 n In the likelihood function () = f (wi, ), the data are given and is the argument. Usually L i=1 we work with the log likelihood function:Q n L () = i=1 ln f (wi, ) . (1) P L () measures the ability of different densities within the pre-specified class to generate the data. The maximum likelihood estimator (MLE) is the value of associated with the largest likelihood: = arg max L () . (2) 2 b 2 Example 1: Linear regression under normality The model is y X X , In so j N 0 0 that n fn (y1, ..., yn x1, ..., xn; ) = f (yi xi; ) j i=1 j 2 2 Y where = 0, 0, 0 = 0, 0 0 and 1 1 2 f (yi xi; ) = exp yi x0 (3) j p 2 22 i 2 n n 2 1 n 2 L () = ln (2) ln yi x0 (4) 2 2 22 i=1 i P The MLE = 0, 2 0 consists of the OLS estimator and the residual variance 2 without degrees of freedom adjustment. Letting u = y X we have: b b b b b 1 2 u0u = X0X X0y =b . b (5) n 1 b b Thisb note follows the practiceb of using the term density both for continuous random variables and for the probability function of discrete random variables; as for example in David Cox, Principles of Statistical Inference, 2006. 1 Example 2: Logit regression There is a binary dependent variable yi that takes only two yi (1 yi) values, 0 and 1. Therefore, in this case f (yi xi) = p (1 pi) where pi = Pr (yi = 1 xi). j i j In the logit model the log odds ratio depends linearly on xi: pi ln = xi0 , 1 pi so that pi = (xi0 ) where is the logistic cdf (r) = 1/ (1 + exp ( r)). n Assuming that fn (y1, ..., yn x1, ..., xn; ) = f (yi xi; ), the log likelihood function is j i=1 j n L () = yi ln x0 + (1 yi) ln 1Q x0 . i=1 i i The first andP second partial derivatives ofL () are: @L () n = xi yi x0 @ i=1 i 2 @ L () P n = i=1 xixi0 i (1 i) . @@0 Since the HessianP matrix is negative semidefinite, as long as it is nonsingular, there exists a single maximum to the likelihood function.2 The first order conditions @L () /∂ = 0 are nonlinear but we can find their root using the Newton-Raphson method of successive approximations. Namely, we begin by finding the root 1 to a linear approximation of @L () /∂ around some initial value 0 and b iterate the procedure until convergence: 2 1 @ L (j) @L (j) j+1 = j (j = 0, 1, 2, ...) . @@ @ 0 Example 3: Duration data In the analysis of a duration variable Y it is common to study its hazard rate, which is given by h (y) = f (y) / Pr (Y y). If Y is discrete h (y) = Pr (Y = y Y y), ≥ j ≥ 3 whereas if Y is continuous h (y) = limy 0 Pr (y Y < y + y Y > y) /y = f (y) / [1 F (y)]. ! ≤ j A simple model is to assume that the hazard function is constant: h (y) = with > 0, which for a continuous Y gives rise to the exponential distribution with pdf f (y) = exp ( y) and cdf F (y) = 1 exp ( y). A popular conditional model is the proportional hazard specification: h (y, x) = (y) ' (x). A special case is an exponential model in which the log hazard depends linearly on xi: ln h (yi, xi) = xi0 . Here the log likelihood of a sample of conditionally independent complete and incomplete durations is x x n x x L () = x0 yie i0 + yie i0 = (1 ci) x0 yie i0 ci yie i0 complete i incomplete i=1 i h i where ciP= 0 if a duration is completeP and ci = 1 if not. TheP contribution to the log likelihood of a complete duration is ln f (yi), whereas the contribution of an incomplete duration is ln Pr (Y > yi). 2 To see that the Hessian is negative semidefinite, note that using the decomposition 0 i = i (1 i) with 0 = i i 1/2 1/2 2 n (1 i) , i (1 i) , the Hessian can be written as @ L () /∂@0 = xi0 ix0 . i i=1 i i h 3 The hazard rate fully characterizesi the distribution of Y . To recover F (y) from h (y) note that ln [1 F (y)] = P y ln [1 h (s)] in the discrete case, and ln [1 F (y)] = y [ h (s)] ds in the continuous one. s=1 0 P R 2 What does estimate? The population counterpart of the sample calculation (2) is 0 = arg max E [ln f (W, )] (6) b 2 where W is a random variable with density g (w) so that E [ln f (W, )] = ln f (w, ) g (w) dw. If the population density g (w) belongs to the f (w, ) family, then 0 asR defined in (6) is the true parameter value. In effect, due to Jensen’sinequality: ln f (w, ) f (w, 0) dw ln f (w, 0) f (w, 0) dw Z Z f (w, ) f (w, ) = ln f (w, 0) dw ln f (w, 0) dw = ln f (w, ) dw = ln 1 = 0. f (w, ) ≤ f (w, ) Z 0 Z 0 Z Thus, when g (w) = f (w, 0) we have E [ln f (W, )] E [ln f (W, 0)] 0 for all . ≤ 4 The value 0 can be interpreted more generally as follows: g (W ) 0 = arg min E ln . (7) f (W, ) 2 The quantity E ln g(W ) ln g(w) g (w) dw is the Kullback-Leibler divergence (KLD) from f(W,) f(w,θ) f (w, ) to g (w).h The KLDi is theR expected log difference between g (w) and f (w, ) when the expec- tation is taken using g (w). Thus, f (w, 0) can be regarded as the best approximation to g (w) in the class f (w, ) when the approximation is understood in the KLD sense. If g (w) = f (w, 0) then 0 is called the “true value”. If g (w) does not belong to the f (w, ) class and f (w, 0) is just the best approximation to g (w) in the KLD sense, then 0 is called a “pseudo-true value”. The extent to which a pseudo-true value remains an interesting quantity is model specific. For example, (3) is a restrictive model of the conditional distribution of yi given xi, first because it assumes that the dependence of yi on xi occurs exclusively through the conditional mean E (yi xi) and secondly j because this conditional mean is assumed to be a linear function of xi. However, if yi depends on 2 xi in other ways, for example through the conditional variance, the parameter values 0 = 0, 0 0 2 remain interpretable quantities: as @E (yi xi) /∂xi and as the unconditional variance of the 0 j 0 2 errors ui = yi x . If E (yi xi) is a nonlinear function, and can only be characterized as the i0 0 j 0 0 linear projection regression coeffi cient vector and the linear projection error variance, respectively. Pseudo maximum likelihood estimation The statistic is the maximum likelihood estimator under the assumption that g (w) belongs to the f (w, ) class. In the absence of this assumption, is b a pseudo-maximum likelihood estimator (PML) based on the f (w, ) family of densities. Sometimes b is called a quasi-maximum likelihood estimator. 4 Since E ln g(W ) = E [ln g (W )] [ln f (W, )] and E [ln g (W )] does not depend on , the arg min in (7) is the same b f(W,θ) as the arg maxh in (6).i 3 2 Consistency and asymptotic normality of PML estimators Under regularity and identification conditions a PML estimator is a consistent estimator of the (pseudo) true value 0. Since may not have a closed form expression we need a method for establishing b the consistency of an estimator that maximizes an objective function. The following theorem taken b from Newey and McFadden (1994) provides such a method. The requirements are boundedness of the parameter space, uniform convergence of the objective function to some nonstochastic continuous limit, and that the limiting objective function is uniquely maximized at the truth (identification). Consistency Theorem Suppose that maximizes the objective function Sn () in the parameter space . Assume the following: b (a) is a compact set. (b) The function Sn () converges uniformly in probability to S0 (). (c) S0 () is continuous. (d) S0 () is uniquely maximized at 0. p Then 0. ! n In the PML context Sn () = (1/n) i=1 ln f (wi, ) and S0 () = E [ln f (W, )]. In particular, in b 5 the regression example, by the law of largeP numbers: 1 1 2 1 2 S0 () = ln (2) ln E yi x0 (8) 2 2 22 i h i and, noting that yi x ui x ( ), also i0 i0 0 1 1 2 1 2 S0 () = ln (2) ln + ( )0 E xix0 ( ) .

Load more