<<

Notes on Maximum Likelihood Estimation (First Part) Introduction to

Christopher Flinn

Fall 2004

Most maximum likelihood estimation begins with the specification of an entire prob- ability distribution for the (i.e., the dependent variables of the analysis). We will concentrate on the case of one dependent , and begin with the no exogenous vari- ables just for simplicity. When the distribution of the dependent variable is specified to depend on a finite-dimensional vector, we speak of parametric maximum likeli- hood estimation. When not, such as the case of trying to estimate the entire distribution of the data making only some assumptions regarding the nature of its cumulative distrib- ution , we speak of nonparametric m.l.e. In this case the objective is to estimate an entire function, F, rather than a few that characterize the distribution of the data under some specific functional form assumption regarding F.

1 Example 1. A Bernoulli

Start with the simplest case of a binary dependent variable, like a coin toss. We are not sure how the coin is weighted - if it was weighted evenly than the of a head, H, on any toss is given by .5. If we call an H a ‘1’ and a T a ‘0’, then the of the is simply

Event Probability 1 π . 0 1 π −

1 We flip the coin N times, and the data consist of a string of N 1’s and 0’s. Let the of flip i be denoted by di. Then the joint probability of the is given by

N N di (1 di) L( d ,π)= π (1 π) − { }i=1 − i=1 Y di (1 di) = π (1 π) − − = πNS1 (1 π)N0S, −

where N1 is the number of times ‘1’ appeared and N0 is the number of times ‘0’ appeared, with N1 + N0 = N. N N The function L( d i=1,π) is a joint probability function of the sample d i=1 given the parameter value{π.} In this case, we don’t know the value of π, and will{ use} the one N piece of , the sample d i=1 to attempt to infer its value. Now we treat L as a function of the unknown parameter{ } π, the sample itself being implicitly an argument of the function, and choose that value of π that maximizes L as our . Formally,

πˆML =argmaxL(π). π The value of π that maximizes L will also maximize any increasing of L, such as the natural . Then define

(π) ln L(π) L ≡ = N1 ln π + N0 ln(1 π). − Then the maximum likelihood estimator of π is the solution to the first order condition

d (ˆπML) L =0 dπ N N 1 0 =0 ⇒ πˆML − 1 πˆML − N1(1 πˆML) N0πˆML =0 ⇒ − − N1 N1 πˆML = = . ⇒ N1 + N0 N We know that for this particular example the maximum likelihood estimator has good

2 “small sample” properties. For example, it is unbiased since N Eπˆ = E 1 ML N 1 = EN N 1 1 N = E di N i=1 1 NP = Edi N i=1 1 PN = π N i=1 NπP = = π. N We even have the small sample of the estimator, which is simply

π(1 π) VAR(ˆπ )= − , ML N an estimate of which is given by

πˆML(1 πˆML) VAR(ˆπ )= − . ML N One of the reasons for the goodd small sample properties of the ML estimator in this case is that it coincides with the sample - which is just the proportion of successes out of the N draws. In a random sample of N draws,thesamplemeanisanunbiasedestimator of the population mean, remember, with the variance of the estimate of the sample mean equal to the variance of the random variable divided by the sample size - exactly what we have above. Proving consistency of the maximum likelihood estimator in this case is straightforward, since the estimator is unbiased and the limiting value of the variance is 0.

2 Example 2. The Negative

In many cases the dependent variable is continuous, unlike in the previous example, but is restricted to a subset of the real line - most often the positive real line R+. The simplest dis- tribution defined on this is the negative exponential, with cumulative distribution function (c.d.f.) and probability distribution function (p.d.f.) given by

F (t)=1 exp( αt),α>0,t>0, − − f(t)=α exp( αt). −

3 Say that we wanted to estimate the parameter α, which given our functional form assump- tion, would serve to completely characterize the distribution of the random variable T. Let’s think of T as the duration of time an individual spend unemployed while looking for ajob. We have a random sample of N individuals, and we consider each individual’s unem- ployment duration as an independent draw from the same distribution F. In this case, the joint p.d.f. of all of the sample draws, the , is given by

N L(t1,...,tN ,α)= α exp( αti) − i=1 Y N N = α exp( α ti). − i=1 P Then the log likelihood function is given by

N (α)=N ln α α ti. L − i=1 P We want to maximize this function with respect to α. The solution, if there is exactly one, is the maximum likelihood estimator for the problem. Does this function have a unique solution? The first is d (α) N N L = ti, dα α − i=1 P and the second derivative is d2 (α) N L = < 0. dα2 −α2 Since the second derivative is negative, the solution to the first order condition determines the unique maximizer of the function, and is the maximum likelihood estimator.

d (ˆαML) L =0 dα N N ti =0 ⇒ αˆML − i=1 P N 1 αˆML = = , ⇒ N tN ti i=1 P where tN isthesamplemean.

4 3 Example 2 continued with Observable Heterogeneity

Often times we have access to a information on observable characteristics of individuals in to the value of the dependent variable for each one. For example, in the unem- ployment case discussed above, we may know each individuals gender, years of schooling completed, region of the country in which they live, etc. Let the row vector of these char- acteristics for individual i be given by Xi.We assume that the are K columns in Xi, one of which may be a ‘1’ for each individual (corresponding to a sort of intercept term). Say we continue to assume that the distribution of times in unemployment is negative exponential, and the one individual’s unemployment duration is unrelated (i.e., indepen- dent) of another’s. We just want to relax the “identical” part of the i.i.d. assumption. The easiest way to accomplish this, but by no the only way, is to write

αi =exp(Xiβ), where β is an unknown K 1 parameter vector to be estimated. The likelihood function is now defined in terms of the× parameter vector β, and we have

N L(t1,...,tN ,β X1,...,XN )= exp(Xiβ)exp( exp(Xiβ)ti), | i=1 − Q and the log likelihood function is written as

N (β)= Xiβ exp(Xiβ)ti . L { − } i=1 X The first of this function with respect to the parameter vector β is

∂ (β) N L = X0 exp(Xiβ)X0ti , ∂β { i − i } i=1 X and the second partial of the log likelihood function with respect to β is

∂2 (β) N L = exp(Xiβ)X0Xiti < 0, ∂β∂β {− i } 0 i=1 X so that there is a unique maximum likelihood estimator of β. The estimator solves the first order conditions, or ˆ N ∂ (βML) L = X0 exp(Xiβˆ )X0ti =0. ∂β { i − ML i } i=1 X There is no “closed form” solution to the first order conditions, and the solution must be obtained iteratively using numerical methods. We will say a few words on this topic below.

5 4 Example 3 The Latent Variable Discrete Choice Frame- work

We saw that there were some interpretive problems with the linear probability model - namely, the predicted probability of a “success” was often outside of the interval (0,1) - which is not a good thing for a probability. Instead we can the problem up in a latent variable framework. We illustrate the technique in the binary choice case. Think of a case in which choice is to buy one unit of some good or not, for example, a car. There is a utility gain from buying the car, but of course the cost of the car reduces resources available to spend on other goods. Let U ∗(X) be the net utility gain from buying the car, which depends on characteristics X, such as the price of the car, the individual’s residence and workplace (as indicators of how much she will use it), characteristics of the car such as color, type, etc., and the individual’s total income. We will assume a functional form for this indirect utility function, typically

Ui∗ = Xiβ + εi, where Xi are the characteristics of the individual i and the good in question and εi is a random “preference shock.” We will make a parametric assumption regarding εi, and assume that is independently and identically distributed across consumers, with c.d.f. F (ε). We will assume that E(ε)=0. The individual purchases the good if and only if the net utility from doing so is positive. If di =1when individual i purchases the good, then 1 iff U > 0 d = i∗ . i 0 iff U 0 ½ i∗ ≤ Then the probability that the individual purchases the good is given by

p(di =1Xi)=p(Xiβ + ε>0) | = p(ε> Xiβ) − =1 F ( Xiβ), − − and of course p(di =0Xi)=F ( Xiβ). | − N If we have a random sample of on di,Xi i=1, we define the likelihood function as { } N di 1 di L(d1,...,dN ,β X1,...,XN )= p(di =1Xi) p(di =0Xi) − | | | i=1 Y N di 1 di = (1 F ( Xiβ)) F ( Xiβ) − , − − − i=1 Y 6 with the log likelihood function given by

N (β)= di ln(1 F ( Xiβ)) + (1 di)lnF ( Xiβ) . L { − − − − } i=1 X Now let’s look at the first partials of the log likelihood function, which are

N ∂ (β) 1 1 L = di (1 di) f( Xiβ)X0. ∂β { (1 F ( X β)) − − F ( X β)} − i i=1 i i X − − − Under many distributional assumptions regarding F it is possible to prove that the of second partial is negative definite, and so the solutions to the first order conditions give the unique maximum likelihood estimates of the parameters β, if these parameters are “identified.” We will see what this means in the context of the probit model, where we assume that F is a . 2 Specifically, we will begin by assuming that ε follows the N(0,σε) distribution. Then we have that the probability of d =1is given by

p(di =1Xi)=1 F ( Xiβ) | − − Xiβ =1 Φ(− ), − σε where Φ is the standard normal cumulative distribution function. Because the normal is a symmetric distribution around the mean, and the mean is 0 in this case, we know that

1 Φ( z)=Φ(z), − − so that we can write Xiβ p(di =1Xi)=Φ( ). | σε It follows that the log likelihood is written as

N 2 (β,σ )= di ln Φ(Xiβ/σε)+(1 di)ln(1 Φ(Xiβ/σε)) . L ε { − − } i=1 X We note something immediately from the log likelihood function concerning its de- pendence on the unknown parameters of the model. By assuming that ε was normally 2 distributed with a variance equal to σε, we added an additional parameter to estimate - σ2. However, we can see from inspection of (β,σ2) that β and σ2 always appear in the ε L ε ε same combination β/σε everywhere in . This means that these parameters cannot be sep- L arately identified. Whatever values are taken by β and σε, if we multiply each parameter by some scalar k wewillendupwiththesameratio-β/σε. Thus only the ratio of β to

7 σε can be estimated, and not each term individually. Sometimes we simply “normalize” the standard of ε to be equal to 1, and in that way say that we have estimated β uniquely. But the important thing to remember is that the estimate is not really unique unless we have fixed the value of σε by assumption. In light of this discussion, let’s simply assume that σε =1so that we only have the K parameters in β to estimate. The first order conditions in this case are given by

∂ (βˆ ) L ML =0 ∂β N 1 1 = di (1 di) φ(Xiβˆ )X0, { ˆ − − ˆ } ML i i=1 Φ(XiβML) (1 Φ(XiβML)) X − where φ denotes the standard normal probability density function. This can be rewritten as N ˆ di Φ(XiβML) 0= − φ(Xiβˆ )X0. { ˆ ˆ } ML i i=1 Φ(XiβML)(1 Φ(XiβML)) X − Note that the maximum likelihood estimator is chosen to set a certain average of the residuals equal to 0 in the sample, where the residual is

di Φ(Xiβˆ ), − ML ˆ since the of di is given by Φ(XiβML). The logit model is developed in exactly the same way, except that we assume that the distribution of ε is logistic instead of normal. The inferences one derives from a binary choice model are typically not much different no matter what distribution of ε is assumed.

5 Computation of Maximum Likelihood

In most cases the computation of maximum likelihood estimators since the first order conditions do not have closed from solutions. We typically solve for the solution to the likelihood equations - that is, the first order conditions - iteratively using methods based on low order expansions. Let’s work with a scalar parameter example for simplicity, which we will denote by θ. The maximum likelihood estimate of θ, being the solution to the first order condition, sets

∂ (θˆML) L =0. ∂θ Assuming that is continuously differentiable, let us expand this function around some L

8 initial guess of the parameter value, which we will denote by θ0. Then write

∂ (θˆML) 0= L ∂θ 2 ∂ (θ0) ∂ (θ0) L + ˆθML θ0 L . ' ∂θ − ∂θ2 ³ ´ We can rearrange this last expression and write

2 1 ∂ (θ0) − ∂ (θ0) θˆML = θ0 L L . (1) − ∂θ2 ∂θ ∙ ¸ Now because this first-order Taylor series does not hold exactly, unless happens to be quadratic in θ, thevalueonthelefthandsideof[1] does not actually satisfyL the first order conditions. Instead, we would take the value value labelled ˆθML and call it θ1. If the log likelihood function is well-behaved, i.e., globally concave in θ, then θ1 should be closer to the true solution to the likelihood equation than θ0. We would proceed with our iteration of [1] until the values θ0,θ1,...,θm, converged according to some criterion we would have ∂ (θm) to define. This criterion usually involves the first derivative L∂θ being sufficiently close to 0, small changes in the value of the log likelihood between successive iterations, or small relative changes in the parameter guesses θm and θm+1 between iterations.

5.1 Example: MLE for the Negative Exponential Distribution Parame- ter α. We know from our earlier example that the MLE for α is simply the inverse of the sample mean. Let’s look at the computation of α. Say that we have a sample of N = 100, and N that the total time in unemployment of the sample members was T ti =148months. ≡ i=1 Then since P d (α) N N L = ti dα α − i=1 100 P = 148, α − and d2 (α) 100 L = , dα2 − α2 the updating equation is written as

2 αm 100 αm+1 = αm ( )( 148). − 100 α −

9 For illustration we chose an initial guess of α0 far away from the true value, which we know is 100/148. We set α0 =5, and found that:

α1 .630000 α2 .672588 α3 .675662 . α4 .675676 α5 .675676

Thus convergence was very rapid using this for this simple model. In practice, to solve the likelihood equations in more complicated cases, certain modifications are necessary to obtain good convergence results.

10