Greene-50558 book June 21, 2007 15:50

18 BAYESIAN ESTIMATION AND INFERENCEQ

18.1 INTRODUCTION

The preceding chapters (and those that follow this one) are focused primarily on para- metric specifications and classical estimation methods. These elements of the economet- ric method present a bit of a methodological dilemma for the researcher. They appear to straightjacket the analyst into a fixed and immutable specification of the model. But in any analysis, there is uncertainty as to the magnitudes, sometimes the signs and, at the extreme, even the meaning of parameters. It is rare that the presentation of a set of empirical results has not been preceded by at least some exploratory analysis. Pro- ponents of the Bayesian methodology argue that the process of “estimation” is not one of deducing the values of fixed parameters, but rather, in accordance with the scientific method, one of continually updating and sharpening our subjective beliefs about the state of the world. Of course, this adherence to a subjective approach to model building is not necessarily a virtue. If one holds that “models” and “parameters” represent objec- tive truths that the analyst seeks to discover, then the subjectivity of Bayesian methods may be less than perfectly comfortable. Contemporary applications of Bayesian methods typically advance little of this the- ological debate. The modern practice of Bayesian econometrics is much more pragmatic. As we will see below in several examples, Bayesian methods have produced some re- markably efficient solutions to difficult estimation problems. Researchers often choose the techniques on practical grounds, rather than in adherence to their philosophical basis; indeed, for some, the Bayesian estimator is merely an algorithm.1 Bayesian methods have have been employed by econometricians since well be- fore Zellner’s classic (1971) presentation of the methodology to economists, but until fairly recently, were more or less at the margin of the field. With recent advances in technique (notably the Gibbs sampler) and the advance of computer and hardware that has made simulation-based estimation routine, Bayesian methods that rely heavily on both have become widespread throughout the social sciences. There are libraries of work on Bayesian econometrics and a rapidly expanding applied

1For example, from the home website of MLWin, a widely used program for multilevel (random parameters) modeling, http://www.cmm.bris.ac.uk/MLwiN/features/mcmc.shtml, we find “ (MCMC) methods allow Bayesian models to be fitted, where prior distributions for the model parameters are specified. By default MLwiN sets diffuse priors which can be used to approximate maximum likelihood estimation.” Train (2001) is an interesting application that compares Bayesian and classical estimators of a random parameters model. 600 Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 601

literature.2 This chapter will introduce the vocabulary and techniques of Bayesian econometrics. Section 18.2 lays out the essential foundation for the method. The canon- ical application, the linear regression model, is developed in Section 18.3. Section 18.4 continues the methodological development. The fundamental tool of contemporary Bayesian econometrics, the Gibbs sampler, is presented in Section 18.5. Three appli- cations and several more limited examples are presented in Sections 18.6, 18.7, and 18.8. Section 18.6 shows how to use the Gibbs sampler to estimate the parameters of a probit model without maximizing the likelihood function. This application also in- troduces the technique of data augmentation. Bayesian counterparts to the panel data random and fixed effects models are presented in Section 18.7. A hierarchical Bayesian treatment of the random parameters model is presented in Section 18.8 with a com- parison to the classical treatment of the same model. Some conclusions are drawn in Section 18.9. The presentation here is nontechnical. A much more extensive entry level presentation is given by Lancaster (2004). Intermediate-level presentations appear in Cameron and Trivedi (2005, Chapter 13), and Koop (2003). A more challenging treat- ment is offered in Geweke (2005). The other sources listed in footnote 2 are oriented to applications.

18.2 BAYES THEOREM AND THE POSTERIOR DENSITY

The centerpiece of the Bayesian methodology is the Bayes theorem: for events A and B, the conditional probability of event Agiven that B has occurred is P(B|A)P(A) P(A|B) = . (18-1) P(B) Paraphrased for our applications here, we would write P(data | parameters)P(parameters) P(parameters | data) = . P(data) In this setting, the data are viewed as constants whose distributions do not involve the parameters of interest. For the purpose of the study, we treat the data as only a fixed set of additional information to be used in updating our beliefs about the parameters. Note the similarity to (14-1). Thus, we write P(parameters | data) ∝ P(data | parameters)P(parameters) (18-2) = Likelihood function × Prior density. The symbol ∝ means “is proportional to.” In the preceding equation, we have dropped the marginal density of the data, so what remains is not a proper density until it is scaled by what will be an inessential proportionality constant. The first term on the right is the joint distribution of the observed random variables y, given the parameters. As we

2Recent additions to the dozens of books on the subject include Gelman et al. (2004), Geweke (2005), Gill (2002), Koop (2003), Lancaster (2004), Congdon (2005), and Rossi et al. (2005). Readers with an historical bent will find Zellner (1971) and Leamer (1978) worthwhile reading. There are also many methodological surveys. Poirier and Tobias (2006) as well as Poirier (1988, 1995) sharply focus the nature of the methodological distinctions between the classical (frequentist) and Bayesian approaches. Greene-50558 book June 21, 2007 15:50

602 PART IV ✦ Estimation Methodology

shall analyze it here, this distribution is the normal distribution we have used in our previous analysis—see (14-1). The second term is the prior beliefs of the analyst. The left-hand side is the posterior density of the parameters, given the current body of data, or our revised beliefs about the distribution of the parameters after “seeing” the data. The posterior is a mixture of the prior information and the “current information,” that is, the data. Once obtained, this posterior density is available to be the prior density function when the next body of data or other usable information becomes available. The principle involved, which appears nowhere in the classical analysis, is one of continual accretion of knowledge about the parameters. Traditional Bayesian estimation is heavily parameterized. The prior density and the likelihood function are crucial elements of the analysis, and both must be fully specified for estimation to proceed. The Bayesian “estimator” is the mean of the posterior density of the parameters, a quantity that is usually obtained either by integration (when closed forms exist), approximation of integrals by numerical techniques, or by Monte Carlo methods, which are discussed in Section 17.3.

Example 18.1 Bayesian Estimation of a Probability Consider estimation of the probability that a production process will produce a defective product. In case 1, suppose the sampling design is to choose N = 25 items from the production line and count the number of defectives. If the probability that any item is defective is a constant θ between zero and one, then the likelihood for the sample of data is L(θ | data) = θ D (1− θ) 25−D , where D is the number of defectives, say, 8. The maximum likelihood estimator of θ will be p = D/25 = 0.32, and the asymptotic variance of the maximum likelihood estimator is estimated by p(1− p)/25 = 0.008704. Now, consider a Bayesian approach to the same analysis. The posterior density is obtained by the following reasoning: p(θ, data) p(θ, data) p(data | θ) p(θ) p(θ | data) = = = θ θ p(data) θ p( , data)d p(data) Likelihood(data | θ) × p(θ) = p(data) where p(θ) is the prior density assumed for θ. [We have taken some license with the termi- nology, since the likelihood function is conventionally defined as L(θ | data).] Inserting the results of the sample first drawn, we have the posterior density: θ D (1− θ) N−D p(θ) p(θ | data) = . θ D − θ N−D θ θ θ (1 ) p( )d What follows depends on the assumed prior for θ. Suppose we begin with a “noninforma- tive” prior that treats all allowable values of θ as equally likely. This would imply a uniform distribution over (0,1). Thus, p(θ) = 1, 0 ≤ θ ≤ 1. The denominator with this assumption is a beta integral (see Section E2.3) with parameters a = D + 1 and b = N − D + 1, so the posterior density is θ D (1− θ) N−D ( N + 2)θ D (1− θ) N−D p(θ | data) = = . ( D + 1)( N − D + 1) ( D + 1)( N − D + 1) ( D + 1 + N − D + 1) This is the density of a random variable with a beta distribution with parameters (α, β) = ( D +1, N − D +1). (See Section B.4.6.) The mean of this random variable is ( D +1)/( N +2) = 9/27 = 0.3333 (as opposed to 0.32, the MLE). The posterior variance is [( D +1)/( N − D +1)]/ [( N + 3)( N + 2) 2] = 0.007936. Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 603

There is a loose end in this example. If the uniform prior were noninformative, that would mean that the only information we had was in the likelihood function. Why didn’t the Bayesian estimator and the MLE coincide? The reason is that the uniform prior over [0,1] is not really noninformative. It did introduce the information that θ must fall in the unit interval. The prior mean is 0.5 and the prior variance is 1/12. The posterior mean is an average of the MLE and the prior mean. Another less than obvious aspect of this result is the smaller variance of the Bayesian estimator. The principle that lies behind this (aside from the fact that the prior did in fact introduce some certainty in the estimator) is that the Bayesian estimator is conditioned on the specific sample data. The theory behind the classical MLE implies that it averages over the entire population that generates the data. This will always introduce a greater degree of “uncertainty” in the classical estimator compared to its Bayesian counterpart.

18.3 BAYESIAN ANALYSIS OF THE CLASSICAL REGRESSION MODEL

The complexity of the algebra involved in Bayesian analysis is often extremely bur- densome. For the linear regression model, however, many fairly straightforward results have been obtained. To provide some of the flavor of the techniques, we present the full derivation only for some simple cases. In the interest of brevity, and to avoid the burden of excessive algebra, we refer the reader to one of the several sources that present the full derivation of the more complex cases.3 The classical normal regression model we have analyzed thus far is constructed around the conditional multivariate normal distribution N[Xβ,σ2I]. The interpreta- tion is different here. In the sampling theory setting, this distribution embodies the information about the observed sample data given the assumed distribution and the fixed, albeit unknown, parameters of the model. In the Bayesian setting, this function summarizes the information that a particular realization of the data provides about the assumed distribution of the model parameters. To underscore that idea, we rename this joint density the likelihood for β and σ 2 given the data, so

2  L(β,σ2 | y, X) = [2πσ2]−n/2e−[(1/(2σ ))(y−Xβ) (y−Xβ)]. (18-3) For purposes of the results below, some reformulation is useful. Let d = n − K (the degrees of freedom parameter), and substitute y − Xβ = y − Xb − X(β − b) = e − X(β − b) in the exponent. Expanding this produces 1 1 1 1 1 − (y − Xβ)(y − Xβ) = − ds2 − (β − b) XX (β − b). 2σ 2 2 σ 2 2 σ 2 After a bit of manipulation (note that n/2 = d/2 + K/2), the likelihood may be written L(β,σ2 | y, X)

2 2  2  −1 −1 = [2π]−d/2[σ 2]−d/2e−(d/2)(s /σ )[2π]−K/2[σ 2]−K/2e−(1/2)(β−b) [σ (X X) ] (β−b).

3These sources include Judge et al. (1982, 1985), Maddala (1977a), Mittelhammer et al. (2000), and the canonical reference for econometricians, Zellner (1971). A remarkable feature of the current literature is the degree to which the analytical components have become ever simpler while the applications have become progressively more complex. This will become evident in Sections 18.5–18.7. Greene-50558 book June 21, 2007 15:50

604 PART IV ✦ Estimation Methodology

This density embodies all that we have to learn about the parameters from the observed data. Because the data are taken to be constants in the joint density, we may multiply this joint density by the (very carefully chosen), inessential (because it does not involve β or σ 2) constant function of the observations, ( / )+ d d 2 1 s2 2 ( / )  − / A = [2π] d 2 | X X | 1 2. d  + 1 2 For convenience, let v = d/2. Then, multiplying L(β,σ2 | y, X) by A gives 2 v+1 v [vs ] 1 2 2 L(β,σ2 | y, X) ∝ e−vs (1/σ )[2π]−K/2 | σ 2(XX)−1 | −1/2 (v + 1) σ 2

 2  −1 −1 × e−(1/2)(β−b) [σ (X X) ] (β−b). (18-4)

The likelihood function is proportional to the product of a gamma density for z = 1/σ 2 with parameters λ = vs2 and P = v + 1 [see (B-39); this is an inverted gamma distribution] and a K-variate normal density for β | σ 2 with mean vector b and covariance matrix σ 2(XX)−1. The reason will be clear shortly.

18.3.1 ANALYSIS WITH A NONINFORMATIVE PRIOR The departure point for the Bayesian analysis of the model is the specification of a prior distribution. This distribution gives the analyst’s prior beliefs about the parameters of the model. One of two approaches is generally taken. If no prior information is known about the parameters, then we can specify a noninformative prior that reflects that. We do this by specifying a “flat” prior for the parameter in question:4

g(parameter) ∝ constant.

There are different ways that one might characterize the lack of prior information. The implication of a flat prior is that within the range of valid values for the parameter, all intervals of equal length—hence, in principle, all values—are equally likely. The second possibility, an informative prior, is treated in the next section. The posterior density is the result of combining the likelihood function with the prior density. Because it pools the full set of information available to the analyst, once the data have been drawn, the posterior density would be interpreted the same way the prior density was before the data were obtained. To begin, we analyze the case in which σ 2 is assumed to be known. This assumption is obviously unrealistic, and we do so only to establish a point of departure. Using Bayes Theorem, we construct the posterior density, L(β | σ 2, y, X)g(β | σ 2) f (β | y, X,σ2) = ∝ L(β | σ 2, y, X)g(β | σ 2), f (y)

4That this “improper” density might not integrate to one is only a minor difficulty. Any constant of integration would ultimately drop out of the final result. See Zellner (1971, pp. 41–53) for a discussion of noninformative priors. Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 605

assuming that the distribution of X does not depend on β or σ 2. Because g(β | σ 2) ∝ a constant, this density is the one in (18-4). For now, write  2  −1 −1 f (β | σ 2, y, X) ∝ h(σ 2)[2π]−K/2 |σ 2(XX)−1|−1/2e−(1/2)(β−b) [σ (X X) ] (β−b), (18-5) where 2 v+1 v [vs ] 1 2 2 h(σ 2) = e−vs (1/σ ). (18-6) (v + 1) σ 2 For the present, we treat h(σ 2) simply as a constant that involves σ 2, not as a proba- bility density; (18-5) is conditional on σ 2. Thus, the posterior density f (β | σ 2, y, X) is proportional to a multivariate normal distribution with mean b and covariance matrix σ 2(XX)−1. This result is familiar, but it is interpreted differently in this setting. First, we have combined our prior information about β (in this case, no information) and the sample information to obtain a posterior distribution. Thus, on the basis of the sample data in hand, we obtain a distribution for β with mean b and covariance matrix σ 2(XX)−1.The result is dominated by the sample information, as it should be if there is no prior infor- mation. In the absence of any prior information, the mean of the posterior distribution, which is a type of Bayesian point estimate, is the sampling theory estimator. To generalize the preceding to an unknown σ 2, we specify a noninformative prior distribution for ln σ over the entire real line.5 By the change of variable formula, if g(ln σ) is constant, then g(σ 2) is proportional to 1/σ 2.6 Assuming that β and σ 2 are independent, we now have the noninformative joint prior distribution:

2 2 1 g(β,σ ) = gβ (β)gσ 2 (σ ) ∝ . σ 2 We can obtain the joint posterior distribution for β and σ 2 by using

2 2 2 2 1 f (β,σ | y, X) = L(β | σ , y, X)gσ 2 (σ ) ∝ L(β | σ , y, X) × . (18-7) σ 2 2 For the same reason as before, we multiply gσ 2 (σ ) by a well-chosen constant, this time 2 2 2 vs (v + 1)/ (v + 2) = vs /(v + 1). Multiplying (18-5) by this constant times gσ 2 (σ ) and inserting h(σ 2) gives the joint posterior for β and σ 2, given y and X: 2 v+2 v+1 [vs ] 1 2 2 f (β,σ2 | y, X) ∝ e−vs (1/σ )[2π]−K/2 |σ 2(XX)−1|−1/2 (v + 2) σ 2

 2  −1 −1 × e−(1/2)(β−b) [σ (X X) ] (β−b). To obtain the marginal posterior distribution for β, it is now necessary to integrate σ 2 out of the joint distribution (and vice versa to obtain the marginal distribution for σ 2). By collecting the terms, f (β,σ2 | y, X) can be written as P−1 1 2 f (β,σ2 | y, X) ∝ A× e−λ(1/σ ), σ 2

5See Zellner (1971) for justification of this prior distribution. 6Many treatments of this model use σ rather than σ 2 as the parameter of interest. The end results are identical. We have chosen this parameterization because it makes manipulation of the likelihood function with a gamma prior distribution especially convenient. See Zellner (1971, pp. 44–45) for discussion. Greene-50558 book June 23, 2007 2:9

606 PART IV ✦ Estimation Methodology

where [vs2]v+2 A = [2π]−K/2 |(XX)−1|−1/2, (v + 2) P = v + 2 + K/2 = (n − K)/2 + 2 + K/2 = (n + 4)/2, and λ = v 2 + 1 (β − )  (β − ), s 2 b X X b so the marginal posterior distribution for β is ∞ ∞ P−1 1 2 f (β,σ2 | y, X)dσ 2 ∝ A e−λ(1/σ )dσ 2. σ 2 0 0 To do the integration, we have to make a change of variable; d(1/σ 2) =−(1/σ 2)2dσ 2, so dσ 2 =−(1/σ 2)−2 d(1/σ 2). Making the substitution—the sign of the integral changes twice, once for the Jacobian and back again because the integral from σ 2 = 0to∞ is the negative of the integral from (1/σ 2) = 0to∞—we obtain ∞ ∞ P−3 1 2 1 f (β,σ2 | y, X)dσ 2 ∝ A e−λ(1/σ )d σ 2 σ 2 0 0 (P − 2) = A× . λP−2 Reinserting the expressions for A, P, and λ produces v 2 v+2(v + / ) [ s ] K 2 π −K/2 |  |−1/2 (v + ) [2 ] X X (β | , ) ∝ 2 . f y X v+ / (18-8) v 2 + 1 (β − )  (β − ) K 2 s 2 b X X b This density is proportional to a multivariate t distribution7 and is a generalization of the familiar univariate distribution we have used at various points. This distribution has a degrees of freedom parameter, d = n− K, mean b, and covariance matrix (d/(d−2))× [s2(XX)−1]. Each element of the K-element vector β has a marginal distribution that is the univariate t distribution with degrees of freedom n − K, mean bk, and variance equal to the kth diagonal element of the covariance matrix given earlier. Once again, this is the same as our sampling theory result. The difference is a matter of interpretation. In the current context, the estimated distribution is for β and is centered at b.

18.3.2 ESTIMATION WITH AN INFORMATIVE PRIOR DENSITY Once we leave the simple case of noninformative priors, matters become quite compli- cated, both at a practical level and, methodologically, in terms of just where the prior comes from. The integration of σ 2 out of the posterior in (18-7) is complicated by itself. It is made much more so if the prior distributions of β and σ 2 are at all involved. Partly to offset these difficulties, researchers usually use what is called a conjugate prior, which

7See, for example, Judge et al. (1985) for details. The expression appears in Zellner (1971, p. 67). Note that the exponent in the denominator is v + K/2 = n/2. Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 607

is one that has the same form as the conditional density and is therefore amenable to the integration needed to obtain the marginal distributions.8

Example 18.2 Estimation with a Conjugate Prior We continue Example 18.1, but we now assume a conjugate prior. For likelihood functions involving proportions, the beta prior is a common device, for reasons that will emerge shortly. The beta prior is (α + β)θ α−1(1− θ) β−1 p(θ) = . (α)(β) Then, the posterior density becomes (α + β)θ α−1(1− θ) β−1 θ D (1− θ) N−D  α  β θ D+α−1(1− θ) N−D+β−1 ( ) ( ) = . 1 (α + β)θ α−1(1− θ) β−1 1 θ D (1− θ) N−D dθ θ D+α−1(1− θ) N−D+β−1dθ  α  β 0 ( ) ( ) 0 The posterior density is, once again, a beta distribution, with parameters ( D + α, N − D + β). The posterior mean is D + α E[θ | data] = . N + α + β (Our previous choice of the uniform density was equivalent to α = β = 1.) Suppose we choose a prior that conforms to a prior mean of 0.5, but with less mass near zero and one than in the center, such as α = β = 2. Then, the posterior mean would be (8 + 2)/(25+ 3) = 0.33571. (This is yet larger than the previous estimator. The reason is that the prior variance is now smaller than 1/12, so the prior mean, still 0.5, receives yet greater weight than it did in the previous example.)

Suppose that we assume that the prior beliefs about β may be summarized in a β  K-variate normal distribution with mean 0 and variance matrix 0. Once again, it is illuminating to begin with the case in which σ 2 is assumed to be known. Proceeding in exactly the same fashion as before, we would obtain the following result: The posterior density of β conditioned on σ 2 and the data will be normal with − − − E [β | σ 2, y, X] =  1 + [σ 2(XX)−1]−1 1  1β + [σ 2(XX)−1]−1b 0 0 0 (18-9) = β + ( − ) , F 0 I F b where = −1 + σ 2(  )−1 −1 −1−1 F 0 [ X X ] 0 −1 = [prior variance]−1 + [conditional variance]−1 [prior variance]−1. (18-10) This vector is a matrix weighted average of the prior and the least squares (sample) coefficient estimates, where the weights are the inverses of the prior and the conditional

8Our choice of noninformative prior for ln σ led to a convenient prior for σ 2 in our derivation of the posterior for β. The idea that the prior can be specified arbitrarily in whatever form is mathematically convenient is very troubling; it is supposed to represent the accumulated prior belief about the parameter. On the other hand, it could be argued that the conjugate prior is the posterior of a previous analysis, which could justify its form. The issue of how priors should be specified is one of the focal points of the methodological debate. “Non-Bayesians” argue that it is disingenuous to claim the methodological high ground and then base the crucial prior density in a model purely on the basis of mathematical convenience. In a small sample, this assumed prior is going to dominate the results, whereas in a large one, the sampling theory estimates will dominate anyway. Greene-50558 book June 21, 2007 15:50

608 PART IV ✦ Estimation Methodology

covariance matrices.9 The smaller the variance of the estimator, the larger its weight, which makes sense. Also, still taking σ 2 as known, we can write the variance of the posterior normal distribution as β | , ,σ2 = −1 + σ 2(  )−1 −1 −1. Var[ y X ] 0 [ X X ] (18-11) Notice that the posterior variance combines the prior and conditional variances on the basis of their inverses.10 We may interpret the noninformative prior as having infinite elements in 0. This assumption would reduce this case to the earlier one. Once again, it is necessary to account for the unknown σ 2. If our prior over σ 2 is to be informative as well, then the resulting distribution can be extremely cumbersome. A conjugate prior for β and σ 2 that can be used is

2 2 2 g(β,σ ) = gβ|σ 2 (β | σ )gσ 2 (σ ), (18-12)

2 0 2 where gβ|σ 2 (β | σ ) is normal, with mean β and variance σ A and + σ 2 m 1 m 2 m 0 1 −mσ 2(1/σ 2) gσ 2 (σ ) = e 0 . (18-13) (m + 1) σ 2 This distribution is an inverted gamma distribution. It implies that 1/σ 2 has a gamma σ 2 σ 2 σ 4/( − ) 11 distribution. The prior mean for is 0 and the prior variance is 0 m 1 . The product in (18-12) produces what is called a normal-gamma prior, which is the natural conjugate prior for this form of the model. By integrating out σ 2, we would obtain the prior marginal for β alone, which would be a multivariate t distribution.12 Combining (18-12) with (18-13) produces the joint posterior distribution for β and σ 2. Finally, the marginal posterior distribution for β is obtained by integrating out σ 2. It has been shown that this posterior distribution is multivariate t with β | , = σ 2 −1 + σ 2(  )−1 −1 −1 σ 2 −1β + σ 2(  )−1 −1 E [ y X] [ A] [ X X ] [ A] 0 [ X X ] b (18-14)

and j −1 Var[β | y, X] = [σ 2A]−1 + [σ 2(XX)−1]−1 , (18-15) j − 2 where j is a degrees of freedom parameter and σ 2 is the Bayesian estimate of σ 2.The prior degrees of freedom m is a parameter of the prior distribution for σ 2 that would have been determined at the outset. (See the following example.) Once again, it is clear that as the amount of data increases, the posterior density, and the estimates thereof, converge to the sampling theory results.

9 β Note that it will not follow that individual elements of the posterior mean vector lie between those of 0 and b. See Judge et al. (1985, pp. 109–110) and Chamberlain and Leamer (1976). 10Precisely this estimator was proposed by Theil and Goldberger (1961) as a way of combining a previously obtained estimate of a parameter and a current body of new data. They called their result a “mixed estimator.” The term “mixed estimation” takes an entirely different meaning in the current literature, as we saw in Chapter 17. 11You can show this result by using gamma integrals. Note that the density is a function of 1/σ 2 = 1/x in the formula of (B-39), so to obtain E [σ 2], we use the analog of E [1/x] = λ/(P − 1) and E [(1/x)2] = λ2/ ( − )( − ) ( /σ 2) λ σ 2 + [ P 1 P 2 ]. In the density for 1 , the counterparts to and P are m 0 and m 1. 12Full details of this (lengthy) derivation appear in Judge et al. (1985, pp. 106–110) and Zellner (1971). Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 609

TABLE 18.1 Estimates of the MPC Years Estimated MPC Variance of b Degrees of Freedom Estimated σ 1940–1950 0.6848014 0.061878 9 24.954 1950–2000 0.92481 0.000065865 49 92.244

Example 18.3 Bayesian Estimate of the Marginal Propensity to Consume In Example 3.2, an estimate of the marginal propensity to consume is obtained using 11 ob- servations from 1940 to 1950, with the results shown in the top row of Table 18.1. A clas- sical 95 percent confidence interval for β based on these estimates is (0.1221, 1.2475). (The very wide interval probably results from the obviously poor specification of the model.) Based on noninformative priors for β and σ 2, we would estimate the posterior density for β to be univariate t with 9 degrees of freedom, with mean 0.6848014 and variance (11/9)0.061878 = 0.075628. An HPD interval for β would coincide with the confidence interval. Using the fourth quarter (yearly) values of the 1950–2000 data used in Example 5.3, we obtain the new estimates that appear in the second row of the table. We take the first estimate and its estimated distribution as our prior for β and obtain a posterior density for β based on an informative prior instead. We assume for this exercise that σ 2 may be taken as known at the sample value of 24.954. Then, −1 1 1 0.92481 0.6848014 b = + + = 0.92455 0.000065865 0.061878 0.000065865 0.061878 The weighted average is overwhelmingly dominated by the far more precise sample es- timate from the larger sample. The posterior variance is the inverse in brackets, which is 0.000065795. This is close to the variance of the latter estimate. An HPD interval can be formed in the familiar fashion. It will be slightly narrower than the confidence interval, because the variance of the posterior distribution is slightly smaller than the variance of the sampling estimator. This reduction is the value of the prior information. (As we see here, the prior is not particularly informative.)

18.4 BAYESIAN INFERENCE

The posterior density is the Bayesian counterpart to the likelihood function. It embod- ies the information that is available to make inference about the econometric model. As we have seen, the mean and variance of the posterior distribution correspond to the classical (sampling theory) point estimator and asymptotic variance, although they are interpreted differently. Before we examine more intricate applications of Bayesian inference, it is useful to formalize some other components of the method, point and interval estimation and the Bayesian equivalent of testing a hypothesis.13

18.4.1 POINT ESTIMATION The posterior density function embodies the prior and the likelihood and therefore contains all the researcher’s information about the parameters. But for purposes of presenting results, the density is somewhat imprecise, and one normally prefers a point

13We do not include prediction in this list. The Bayesian approach would treat the prediction problem as one of estimation in the same fashion as “parameter” estimation. The value to be forecasted is among the unknown elements of the model that would be characterized by a prior and would enter the posterior density in a symmetric fashion along with the other parameters. Greene-50558 book June 21, 2007 15:50

610 PART IV ✦ Estimation Methodology

or interval estimate. The natural approach would be to use the mean of the posterior distribution as the estimator. For the noninformative prior, we use b, the sampling theory estimator. One might ask at this point, why bother? These Bayesian point estimates are iden- tical to the sampling theory estimates. All that has changed is our interpretation of the results. This situation is, however, exactly the way it should be. Remember that we entered the analysis with noninformative priors for β and σ 2. Therefore, the only information brought to bear on estimation is the sample data, and it would be peculiar if anything other than the sampling theory estimates emerged at the end. The results do change when our prior brings out of sample information into the estimates, as we shall see below. The results will also change if we change our motivation for estimating β.The parameter estimates have been treated thus far as if they were an end in themselves. But in some settings, parameter estimates are obtained so as to enable the analyst to make a decision. Consider then, a loss function, H(βˆ, β), which quantifies the cost of basing a decision on an estimate βˆ when the parameter is β. The expected, or average loss is

Eβ [H(βˆ, β)] = H(βˆ, β) f (β | y, X)dβ, (18-16) β where the weighting function is the marginal posterior density. (The joint density for β and σ 2 would be used if the loss were defined over both.) The Bayesian point estimate is the parameter vector that minimizes the expected loss. If the loss function is a quadratic form in (βˆ − β), then the mean of the posterior distribution is the “minimum expected loss” (MELO) estimator. The proof is simple. For this case, (βˆ, β) | , = 1 (βˆ − β) (βˆ − β) | , . E [H y X] E 2 W y X To minimize this, we can use the result that ∂ E [H(βˆ , β) | y, X]/∂βˆ = E [∂ H(βˆ, β)/∂βˆ | y, X] = E [−W(βˆ − β) | y, X]. The minimum is found by equating this derivative to 0, whence, because −W is irrele- vant, βˆ = E [β | y, X]. This kind of loss function would state that errors in the positive and negative direction are equally bad, and large errors are much worse than small errors. If the loss function were a linear function instead, then the MELO estimator would be the median of the posterior distribution. These results are the same in the case of the noninformative prior that we have just examined.

18.4.2 INTERVAL ESTIMATION The counterpart to a confidence interval in this setting is an interval of the posterior distribution that contains a specified probability. Clearly, it is desirable to have this interval be as narrow as possible. For a unimodal density, this corresponds to an interval within which the density function is higher than any points outside it, which justifies the term highest posterior density (HPD) interval. For the case we have analyzed, which involves a symmetric distribution, we would form the HPD interval for β around the least squares estimate b, with terminal values taken from the standard t tables. Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 611

18.4.3 HYPOTHESIS TESTING The Bayesian methodology treats the classical approach to hypothesis testing with a large amount of skepticism. Two issues are especially problematic. First, a close ex- amination of only the work we have done in Chapter 5 will show that because we are using consistent estimators, with a large enough sample, we will ultimately reject any (nested) hypothesis unless we adjust the significance level of the test downward as the sample size increases. Second, the all-or-nothing approach of either rejecting or not rejecting a hypothesis provides no method of simply sharpening our beliefs. Even the most committed of analysts might be reluctant to discard a strongly held prior based on a single sample of data, yet this is what the sampling methodology mandates. (Note, for example, the uncomfortable dilemma this creates in footnote 20 in Chapter 10.) The Bayesian approach to hypothesis testing is much more appealing in this regard. Indeed, the approach might be more appropriately called “comparing hypotheses,” because it essentially involves only making an assessment of which of two hypotheses has a higher probability of being correct. The Bayesian approach to hypothesis testing bears large similarity to Bayesian 14 estimation. We have formulated two hypotheses, a “null,” denoted H0, and an alter- native, denoted H1. These need not be complementary, as in H0: “statement A is true” versus H1: “statement A is not true,” since the intent of the procedure is not to reject one hypothesis in favor of the other. For simplicity, however, we will confine our at- tention to hypotheses about the parameters in the regression model, which often are complementary. Assume that before we begin our experimentation (data gathering, statistical analysis) we are able to assign prior probabilities P(H0) and P(H1) to the two hypotheses. The prior odds ratio is simply the ratio

P(H0) Oddsprior = . (18-17) P(H1) For example, one’s uncertainty about the sign of a parameter might be summarized in a prior odds over H0: β ≥ 0 versus H1: β<0of0.5/0.5 = 1. After the sample evidence is gathered, the prior will be modified, so the posterior is, in general,

Oddsposterior = B01 × Oddsprior.

The value B01 is called the Bayes factor for comparing the two hypotheses. It summarizes the effect of the sample data on the prior odds. The end result, Oddsposterior, is a new odds ratio that can be carried forward as the prior in a subsequent analysis. The Bayes factor is computed by assessing the likelihoods of the data observed under the two hypotheses. We return to our first departure point, the likelihood of the data, given the parameters:

2  f (y | β,σ2, X) = [2πσ2]−n/2e(−1/(2σ ))(y−Xβ) (y−Xβ). (18-18) Based on our priors for the parameters, the expected, or average likelihood, assuming that hypothesis j is true ( j = 0, 1),is 2 2 2 2 f (y | X, Hj ) = Eβ,σ 2 [ f (y | β,σ , X, Hj )] = f (y | β,σ , X, Hj )g(β,σ ) dβ dσ . σ 2 β

14For extensive discussion, see Zellner and Siow (1980) and Zellner (1985, pp. 275–305). Greene-50558 book June 21, 2007 15:50

612 PART IV ✦ Estimation Methodology

(This conditional density is also the predictive density for y.) Therefore, based on the observed data, we use Bayes theorem to reassess the probability of Hj ; the posterior probability is f (y | X, H )P(H ) P(H | y, X) = j j . j f (y)

The posterior odds ratio is P(H0 | y, X)/P(H1 | y, X), so the Bayes factor is

f (y | X, H0) B01 = . f (y | X, H1)

Example 18.4 Posterior Odds for the Classical Regression Model Zellner (1971) analyzes the setting in which there are two possible explanations for the variation in a dependent variable y: =  β + ε Model 0: y x0 0 0 and =  β + ε . Model 1: y x1 1 1

2 We will briefly sketch his results. We form informative priors for [β, σ ] j , j = 0, 1, as spec- ified in (18-12) and (18-13), that is, multivariate normal and inverted gamma, respectively. Zellner then derives the Bayes factor for the posterior odds ratio. The derivation is lengthy and complicated, but for large n, with some simplifying assumptions, a useful formulation σ 2 σ 2 emerges. First, assume that the priors for 0 and 1 are the same. Second, assume that | −1|/| −1 +  | / | −1|/| −1 +  | → [ A0 A0 X0X0 ] [ A1 A1 X1X1 ] 1. The first of these would be the usual situation, in which the uncertainty concerns the covariation between yi and xi , not the amount of resid- ual variation (lack of fit). The second concerns the relative amounts of information in the prior (A) versus the likelihood (XX). These matrices are the inverses of the covariance matrices, or the precision matrices. [Note how these two matrices form the matrix weights in the computation of the posterior mean in (18-9).] Zellner (p. 310) discusses this assumption at some length. With these two assumptions, he shows that as n grows large,15 −(n+m)/2 −(n+m)/2 s2 1 − R2 B ≈ 0 = 0 . 01 2 − 2 s1 1 R1 Therefore, the result favors the model that provides the better fit using R2 as the fit measure. If we stretch Zellner’s analysis a bit by interpreting model 1 as “the model” and model 0 as β = 2 = “no model” (that is, the relevant part of 0 0,soR0 0), then the ratio simplifies to + / = − 2 (n m) 2. B01 1 R1 Thus, the better the fit of the regression, the lower the Bayes factor in favor of model 0 (no model), which makes intuitive sense. Zellner and Siow (1980) have continued this analysis with noninformative priors for β and σ 2 σ j . Specifically, they use the flat prior for ln [see (18-7)] and a multivariate Cauchy prior (which has infinite variances) for β. Their main result (3.10) is √ k/2 1 π n − K B = 2 (1− R2) (n−K −1)/2. 01 [(k + 1)/2] 2 This result is very much like the previous one, with some slight differences due to degrees of freedom corrections and the several approximations used to reach the first one.

15A ratio of exponentials that appears in Zellner’s result (his equation 10.50) is omitted. To the order of approximation in the result, this ratio vanishes from the final result. (Personal correspondence from A. Zellner to the author.) Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 613

18.4.4 LARGE SAMPLE RESULTS Although all statistical results for Bayesian estimators are necessarily “finite sample” (they are conditioned on the sample data), it remains of interest to consider how the estimators behave in large samples.16 Do Bayesian estimators “converge” to something? To do this exercise, it is useful to envision having a sample that is the entire population. Then, the posterior distribution would characterize this entire population, not a sample from it. It stands to reason in this case, at least intuitively, that the posterior distribution should coincide with the likelihood function. It will (as usual), save for the influence of the prior. But as the sample size grows, one should expect the likelihood function to overwhelm the prior. It will, unless the strength of the prior grows with the sample size (that is, for example, if the prior variance is of order 1/n). An informative prior will still fade in its influence on the posterior unless it becomes more informative as the sample size grows. The preceding suggests that the posterior mean will converge to the maximum like- lihood estimator. The MLE is the parameter vector that is at the mode of the likelihood function. The Bayesian estimator is the posterior mean, not the mode, so a remain- ing question concerns the relationship between these two features. The Bernstein–von Mises “theorem” [See Cameron and Trivedi(2005, p. 433) and Train(2003, Chapter 12)] states that the posterior mean and the maximum likelihood estimator will coverge to the same probability limit and have the same limiting normal distribution. A form of central limit theorem is at work. But for remaining philosophical questions, the results suggest that for large samples, the choice between Bayesian and frequentist methods can be one of computational efficiency. (This is the thrust of the application in Section 18.8. Note, as well, footnote 1 at the beginning of this chapter. In an infinite sample, the maintained “uncertainty” of the Bayesian estimation framework would have to arise from deeper questions about the model. For example, the mean of the entire population is its mean; there is no uncertainty about the “parameter.”)

18.5 POSTERIOR DISTRIBUTIONS AND THE GIBBS SAMPLER

The preceding analysis has proceeded along a set of steps that includes formulating the likelihood function (the model), the prior density over the objects of estimation, and the posterior density. To complete the inference step, we then analytically derived the characteristics of the posterior density of interest, such as the mean or mode, and the variance. The complicated element of any of this analysis is determining the moments of the posterior density, for example, the mean: θˆ = E[θ | data] = θ p(θ | data)dθ. (18-19) θ

16The standard preamble in econometric studies, that the analysis to follow is “exact” as opposed to approxi- mate or “large sample,” refers to this aspect—the analysis is conditioned on and, by implication, applies only to the sample data in hand. Any inference outside the sample, for example, to hypothesized random samples is, like the sampling theory counterpart, approximate. Greene-50558 book June 21, 2007 15:50

614 PART IV ✦ Estimation Methodology

There are relatively few applications for which integrals such as this can be derived in closed form. (This is one motivation for conjugate priors.) The modern approach to Bayesian inference takes a different strategy. The result in (18-19) is an expectation. Suppose it were possible to obtain a random sample, as large as desired, from the population defined by p(θ | data). Then, using the same strategy we used throughout Chapter 17 for simulation-based estimation, we could use that sample’s characteristics, such as mean, variance, quantiles, and so on, to infer the characteristics of the posterior distribution. Indeed, with an (essentially) infinite sample, we would be freed from having to limit our attention to a few simple features such as the mean and variance, we could view any features of the posterior distribution that we like. The (much less) complicated part of the analysis is the formulation of the posterior density. It remains to determine how the sample is to be drawn from the posterior density. This element of the strategy is provided by a remarkable (and remarkably useful) result known as the Gibbs sampler. [See Casella and George (1992).] The central result of the Gibbs sampler is as follows: We wish to draw a random sample from the joint population (x, y). The joint distribution of x and y is either unknown or intractable and it is not possible to sample from the joint distribution. However, assume that the conditional distributions f (x | y) and f (y | x) are known and simple enough that it is possible to draw univariate random samples from both of them. The following iteration will produce a bivariate random sample from the joint distribution:

Gibbs sampler:

1. Begin the cycle with a value of x0 that is in the right range of x | y,

2. Draw an observation y0 | x0,

3. Draw an observation xt | yt−1,

4. Draw an observation yt | xt .

Iteration of steps 3 and 4 for several thousand cycles will eventually produce a random sample from the joint distribution. (The first several thousand draws are discarded to avoid the influence of the initial conditions—this is called the burn in.) [Some technical details on the procedure appear in Cameron and Trivedi (Chapter Section 13.5).]

Example 18.5 Gibbs Sampling from the Normal Distribution To illustrate the mechanical aspects of the Gibbs sampler, consider random sampling from the joint normal distribution. We consider the bivariate normal distribution first. Suppose we wished to draw a random sample from the population x 0 1 ρ 1 ∼ N , . x2 0 ρ 1

As we have seen in Chapter 17, a direct approach is to use the fact that linear functions of normally distributed variables are normally distributed. [See (B-80).] Thus, we might trans-

form a series of independent normal draws (u1, u2) by the Cholesky decomposition of the covariance matrix x 10 u 1 = 1 = Lu , x θ θ u i 2 i 1 2 2 i Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 615 2 where θ1 = ρ and θ2 = 1 − ρ . The Gibbs sampler would take advantage of the result

2 x1 | x2 ∼ N[ρx2,(1− ρ )], and 2 x2 | x1 ∼ N[ρx1,(1− ρ )].

To sample from a trivariate, or multivariate population, we can expand the Gibbs sequence in the natural fashion. For example, to sample from a trivariate population, we would use the Gibbs sequence

x1 | x2, x3 ∼ N[β1,2x2 + β1,3x3, 1 | 2,3],

x2 | x1, x3 ∼ N[β2,1x1 + β2,3x3, 2 | 1,3],

x3 | x1, x2 ∼ N[β3,1x1 + β3,2x2, 3 | 1,2],

where the conditional means and variances are given in Theorem B.7. This defines a three- step cycle.

The availability of the Gibbs sampler frees the researcher from the necessity of de- riving the analytical properties of the full, joint posterior distribution. Because the for- mulation of conditional priors is straightforward, and the derivation of the conditional posteriors is only slightly less so, this tool has facilitated a vast range of applications that previously were intractable. For an example, consider, once again, the classical normal regression model. From (18-7), the joint posterior for (β,σ2)is v+ [vs2]v+2 1 1 p(β,σ2 | y, X) ∝ exp(−vs2/σ 2)[2π]−K/2 | σ 2(XX)−1 | −1/2 (v + 2) σ 2 × exp(−(1/2)(β − b) [σ 2(XX)−1]−1(β − b). If we wished to use a simulation approach to characterizing the posterior distribution, we would need to draw a K + 1 variate sample of observations from this intractable distribution. However, with the assumed priors, we found the conditional posterior for β in (18-5): p(β | σ 2, y, X) = N[b,σ2(XX)−1]. From (18-6), we can deduce that the conditional posterior for σ 2 | β, y, X is an inverted σ 2 = vσ 2 = v gamma distribution with parameters m 0 ˆ and m in (18-13): v+ v  [vσˆ 2] 1 1 = (y − x β)2 p(σ 2 | β, y, X) = exp(−vσˆ 2/σ 2), σˆ 2 = i 1 i i . (v + 1) σ 2 n − K This sets up a Gibbs sampler for sampling from the joint posterior of β and σ 2.We would cycle between random draws from the multivariate normal for β and the inverted gamma distribution for σ 2 to obtain a K + 1 variate sample on (β,σ2). [Of course, for this application, we do know the marginal posterior distribution for β—see (18-8).] The Gibbs sampler is not truly a random sampler; it is a Markov chain—each “draw” from the distribution is a function of the draw that precedes it. The random input at each cycle provides the randomness, which leads to the popular name for this strategy, Markov–Chain Monte Carlo or MCMC or MC2 (pick one) estimation. In its simplest Greene-50558 book June 21, 2007 15:50

616 PART IV ✦ Estimation Methodology

form, it provides a remarkably efficient tool for studying the posterior distributions in very complicated models. The example in the next section shows a striking example of how to locate the MLE for a probit model without computing the likelihood function or its derivatives. In Section 18.8, we will examine an extension and refinement of the strategy, the Metropolis–Hasting algorithm. In the next several sections, we will present some applications of Bayesian inference. In Section 18.9, we will return to some general issues in classical and Bayesian estimation and inference.

18.6 APPLICATION: BINOMIAL PROBIT MODEL

Consider inference about the binomial probit model for a dependent variable that is generated as follows (see Sections 23.2–23.4): ∗ =  β + ε ,ε ∼ , , yi xi i i N[0 1] (18-20) = ∗ > , = . yi 1ifyi 0 otherwise yi 0 (18-21) (Theoretical moivation for the model appears in Section 23.3.) The data consist of (y, X) = (yi , xi ), i = 1,...,n. The random variable yi has a Bernoulli distribution with probabilities = | = (  β), Prob[yi 1 xi ] xi = | = − (  β). Prob[yi 0 xi ] 1 xi The likelihood function for the observed data is n   − ( | , β) = ( β) yi − ( β) 1 yi . L y X [ xi ] [1 xi ] i=1 (Once again, we cheat a bit on the notation—the likelihood function is actually the joint density for the data, given X and β.) Classical maximum likelihood estimation of β is developed in Section 23.4. To obtain the posterior mean (Bayesian estimator), we assume a noninformative, flat (improper) prior for β, p(β) ∝ 1. The posterior density would be n   − [(x β)]yi [1 − (x β)]1 yi (1) = i i p(β | y, X) = i 1 , n   − ( β) yi − ( β) 1 yi ( ) β [ xi ] [1 xi ] 1 d β i=1 and the estimator would be the posterior mean, n   − β ( β) yi − ( β) 1 yi β [ xi ] [1 xi ] d β i=1 βˆ = E[β | y, X] = . (18-22) n   − ( β) yi − ( β) 1 yi β [ xi ] [1 xi ] d β i=1 Evaluation of the integrals in (18-22) is hopelessly complicated, but a solution using the Gibbs sampler and a technique known as data augmentation, pioneered by Albert Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 617

∗ and Chib (1993a) is surprisingly simple. We begin by treating the unobserved yi unknowns to be estimated, along with β. Thus, the (K + n) × 1 parameter vector is θ = (β, ∗) (β | ∗, , ) ∗ y . We now construct a Gibbs sampler. Consider, first, p y y X .Ifyi is known, then yi is known [see (18-21)]. It follows that p(β | y∗, y, X) = p(β | y∗, X). This posterior defines a linear regression model with normally distributed disturbances and known σ 2 = 1. It is precisely the model we saw in Section 18.3.1, and the posterior we need is in (18-5), with σ 2 = 1. So, based on our earlier results, it follows that p(β | y∗, y, X) = N[b∗,(XX)−1], (18-23) where b∗ = (XX)−1Xy∗. ∗ For yi , ignoring yi for the moment, it would follow immediately from (18-20) that ( ∗ | β, ) =  β, . p yi X N[xi 1] ∗ ∗ > However, yi is informative about yi .Ifyi equals one, we know that yi 0 and if yi ∗ ≤ β ∗ equals zero, then yi 0. The implication is that conditioned on , X, and y, yi has the truncated (above or below zero) normal distribution that is developed in Sections 24.2.1 and 24.2.2. The standard notation for this is ( ∗ | = , β, ) = +  β, , p yi yi 1 xi N [xi 1] (18-24) ( ∗ | = , β, ) = −  β, . p yi yi 0 xi N [xi 1] Results (18-23) and (18-24) set up the components for a Gibbs sampler that we can use to estimate the posterior means E[β | y, X] and E[y∗ | y, X]. The following is our algorithm:

Gibbs Sampler for the Binomial Probit Model 1. Compute XX once at the outset and obtain L such that LL = (XX)−1. 2. Start β at any value such as 0. 3. Result (17-1) shows how to transform a draw from U[0, 1] to a draw from the truncated normal with underlying mean μ and standard deviation σ . For this application, the draw is ∗ ( ) =  β + −1 − ( − )(  β ) = , yi, r xi r−1 [1 1 U xi r−1 ]ifyi 1 ∗ ( ) =  β + −1 (−  β ) = . yi,r r xi r−1 [U xi r−1 ]ifyi 0 ∗ ( ) This step is used to draw the n observations on yi,r r . 4. Section 17.2.4 shows how to draw an observation from the multivariate normal population. For this application, we use the results at step 3 to compute b∗ = (XX)−1Xy∗(r). We obtain a vector, v,ofK draws from the N[0, 1] population, then β(r) = b∗ + Lv. The iteration cycles between steps 3 and 4. This should be repeated several thousand times, discarding the burn-in draws, then the estimator of β is the sample mean of the retained draws. The posterior variance is computed with the variance of the retained ∗ draws. Posterior estimates of yi would typically not be useful. Greene-50558 book June 23, 2007 2:9

618 PART IV ✦ Estimation Methodology

TABLE 18.2 Probit Estimates for Grade Equation Maximum Likelihood Posterior Means and Std. Devs Variable Estimate Standard Error Posterior Mean Posterior S.D. Constant −7.4523 2.5425 −8.6286 2.7995 GPA 1.6258 0.6939 1.8754 0.7668 TUCE 0.05173 0.08389 0.06277 0.08695 PSI 1.4263 0.5950 1.6072 0.6257

Example 18.6 Gibbs Sampler for a Probit Model In Examples 16.14 and 16.15, we examined Spector and Mazzeo’s (1980) widely traveled data on a binary choice outcome. (The example used the data for a different model.) The binary probit model studied in the paper was = | β = β + β + β + β . Prob(GRADEi 1 , xi ) ( 1 2GPAi 3TUCEi 4PSIi ) The variables are defined in Example 16.14. Their probit model is studied in Example 23.3. The sample contains 32 observations. Table 18.2 presents the maximum likelihood estimates and the posterior means and standard deviations for the probit model. For the Gibbs sampler, we used 5,000 draws, and discarded the first 1,000. The results in Table 18.2 suggest the similarity of the posterior mean estimated with the Gibbs sampler to the maximum likelihood estimate. However, the sample is quite small, and the differences between the coefficients are still fairly substantial. For a striking example of the behavior of this procedure, we now revisit the German health care data examined in Examples 11.10, 11.11 and 16.16, and several other examples throughout the book. The probit model to be estimated is > = β + β + β + β Prob(Doctor visitsit 0) ( 1 2 Ageit 3 Educationit 4 Incomeit

+ β5 Kidsit + β6 Marriedit + β7 Femaleit). The sample contains data on 7,293 families and a total of 27,326 observations. We are pooling the data for this application. Table 18.3 presents the probit results for this model using the same procedure as before. (We used only 500 draws, and discarded the first 100.) The similarity is what one would expect given the large sample size. We note before proceeding to other applications, notwithstanding the striking similarity of the Gibbs sampler to the MLE, that this is not an efficient method of estimating the parameters of a probit model. The estimator requires generation of thousands of samples of potentially thousands of observations. We used only 500 replications to produce Table 18.3. The computations took about five minutes. Using Newton’s method to maximize the log-likelihood directly took less than five seconds. Unless one is wedded to the Bayesian paradigm, on strictly practical grounds, the MLE would be the preferred estimator.

TABLE 18.3 Probit Estimates for Doctor Visits Equation Maximum Likelihood Posterior Means and Std. Devs Variable Estimate Standard Error Posterior Mean Posterior S.D. Constant −0.12433 0.058146 −0.12628 0.054759 Age 0.011892 0.00079568 0.011979 0.00080073 Education −0.014966 0.0035747 −0.015142 0.0036246 Income −0.13242 0.046552 −0.12669 0.047979 Kids −0.15212 0.018327 −0.15149 0.018400 Married 0.073522 0.020644 0.071977 0.020852 Female 0.35591 0.016017 0.35582 0.015913 Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 619

This application of the Gibbs sampler demonstrates in an uncomplicated case how the algorithm can provide an alternative to actually maximizing the log-likelihood. We do note that the similarity of the method to the EM algorithm in Section E.3.7 is not coincidental. Both procedures use an estimate of the unobserved, censored data, and both estimate β by using OLS using the predicted data.

18.7 PANEL DATA APPLICATION: INDIVIDUAL EFFECTS MODELS

We consider a panel data model with common individual effects, = α +  β + ε ,ε∼ ,σ2 . yit i xit it it N 0 ε In the Bayesian framework, there is no need to distinguish between fixed and random effects. The classical distinction results from an asymmetric treatment of the data and the parameters. So, we will leave that unspecified for the moment. The implications will emerge later when we specify the prior densities over the model parameters. The likelihood function for the sample under normality of εit is n Ti 1 (y − α − x β)2 p y | α ,...,α , β,σ2, X = √ exp − it i it . 1 n ε 2 σε π 2σε i=1 t=1 2 The remaining analysis hinges on the specification of the prior distributions. We will consider three cases. Each illustrates an aspect of the methodology. First, group the full set of location (regression) parameters in one (n + K) × 1 2 2 slope vector, γ . Then, with the disturbance variance, θ = (α, β,σε ) = (γ ,σε ). Define a conformable data matrix, Z = (D, X), where D contains the n dummy variables so that we may write the model, y = Zγ + ε in the familiar fashion for our common effects linear regression. (See Chapter 9.) We now assume the uniform-inverse gamma prior that we used in our earlier treatment of the linear model, 2 2 p γ ,σε ∝ 1/σε . The resulting (marginal) posterior density for γ is precisely that in (18-8) (where now the slope vector includes the elements of α). The density is an (n + K) variate t with mean equal to the OLS estimator and covariance matrix [( i Ti − n − K)/( i Ti − n − K − 2)]s2(ZZ)−1. Because OLS in this model as stated means the within estimator, the implication is that with this noninformative prior over (α, β), the model is equivalent to the fixed effects model. Note, again, this is not a consequence of any assumption about correlation between effects and included variables. That has remained unstated; though, by implication, we would allow correlation between D and X. Some observers are uncomfortable with the idea of a uniform prior over the entire real line. [See, e.g., Koop (2003, pp. 22–23).] Others, e.g., Zellner (1971, p. 20), are less concerned. Cameron and Trivedi (2005, pp. 425–427) suggest a middle ground.] Formally, our assumption of a uniform prior over the entire real line is an improper Greene-50558 book June 21, 2007 15:50

620 PART IV ✦ Estimation Methodology

prior, because it cannot have a positive density and integrate to one over the entire real line. As such, the posterior appears to be ill defined. However, note that the “improper” uniform prior will, in fact, fall out of the posterior, because it appears in both numerator and denominator. [Zellner (1971, p. 20) offers some more methodological commentary.] The practical solution for location parameters, such as a vector of regression slopes, is to assume a nearly flat, “almost uninformative” prior. The usual choice is a conjugate normal prior with an arbitrarily large variance. (It should be noted, of course, that as long as that variance is finite, even if it is large, the prior is informative. We return to this point in Section 18.9.) 2 Consider, then, the conventional normal-gamma prior over (γ ,σε ) where the condi- 2 tional (on σε ) prior normal density for the slope parameters has mean γ 0 and covariance 2 matrix σε A, where the (n + K) × (n + K) matrix, A, is yet to be specified. [See the dis- cussion after (18-13).] The marginal posterior mean and variance for γ for this set of assumptions are given in (18-14) and (18-15). We reach a point that presents two rather serious dilemmas for the researcher. The posterior was simple with our uniform, non- informative prior. Now, it is necessary actually to specify A, which is potentially large. (In one of our main applications in this text, we are analyzing models with n = 7,293 constant terms and about K = 7 regressors.) It is hopelessly optimistic to expect to be able to specify all the variances and covariances in a matrix this large, unless we actually have the results of an earlier study (in which case we would also have a prior estimate of γ ). A practical solution that is frequently chosen is to specify A to be a diagonal matrix with extremely large diagonal elements, thus emulating a uniform prior without having to commit to one. The second practical issue then becomes dealing with the actual computation of the order (n + K) inverse matrix in (18-14) and (18-15). Under the strategy chosen, to make A a multiple of the identity matrix, however, there are forms of partitioned inverse matrices that will allow solution to the actual computation. Thus far, we have assumed that each αi is generated by a different normal distribu- −γ tion, 0 and A, however specified, have (potentially) different means and variances for the elements of α. The third specification we consider is one in which all αi s in the model are assumed to be draws from the same population. To produce this specification, we use a hierarchical prior for the individual effects. The full model will be  2 yit = αi + x β + εit,εit ∼ N 0,σε , it 2 2 p β σε = N β ,σε A , 0 2 2 p σε = Gamma σ , m , 0 2 p(αi ) = N μα,τα ,

p(μα) = N[a, Q], τ 2 = τ 2,v . p α Gamma 0 We will not be able to derive the posterior density (joint or marginal) for the parame- ters of this model. However, it is possible to set up a Gibbs sampler that can be used to infer the characteristics of the posterior densities statistically. The sampler will be 2 2 driven by conditional normal posteriors for the location parameters, [β | α,σε ,μα,τα ], 2 2 2 2 [αi | β,σε ,μα,τα ], and [μα | β, α, σ ε,τα ] and conditional gamma densities for the scale 2 2 2 2 (variance) parameters, [σε | α, β,μα,τα ] and [τα | α, β,σε ,μα]. [The procedure is Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 621

developed at length by Koop (2003, pp. 152–153).] The assumption of a common distri- bution for the individual effects and an independent prior for β produces a Bayesian counterpart to the random effects model.

18.8 HIERARCHICAL BAYES ESTIMATION OF A RANDOM PARAMETERS MODEL

We now consider a Bayesian approach to estimation of the random parameters model.17 For an individual i, the conditional density for the dependent variable in period t is ( | , β ) β × f yit xit i where i is the individual specific K 1 parameter vector and xit is individual specific data that enter the probability density.18 For the sequence of T ob- β servations, assuming conditional (on i ) independence, person i’s contribution to the likelihood for the sample is T ( | , β ) = ( | , β ). f yi Xi i f yit xit i (18-25) t=1 = ( ,..., ) = ,..., β where yi yi1 yiT and Xi [xi1 xiT]. We will suppose that i is distributed normally with mean β and covariance matrix . (This is the “hierarchical” aspect of the model.) The unconditional density would be the expected value over the possible β values of i ; T ( | , β, ) = ( | , β )φ β | β,  β , f yi Xi f yit xit i K[ i ] d i (18-26) β i t=1 φ β | β,  β β  where K[ i ] denotes the K variate normal prior density for i given and . Maximum likelihood estimation of this model, which entails estimation of the “deep” β  β parameters, , , then estimation of the individual specific parameters, i is considered in Section 17.5. We now consider the Bayesian approach to estimation of the parameters of this model. To approach this from a Bayesian viewpoint, we will assign noninformative prior densities to β and . As is conventional, we assign a flat (noninformative) prior to β. The variance parameters are more involved. If it is assumed that the elements of β  i are conditionally independent, then each element of the (now) diagonal matrix may be assigned the inverted gamma prior that we used in (18-13). A full matrix  is handled by assigning to  an inverted Wishart prior density with parameters scalar K and matrix K × I. [The Wishart density is a multivariate counterpart to the chi-squared

17Note that, there is occasional confusion as to what is meant by “random parameters” in a random param- eters (RP) model. In the Bayesian framework we discuss in this chapter, the “randomness” of the random parameters in the model arises from the “uncertainty” of the analyst. As developed at several points in this book (and in the literature), the randomness of the parameters in the RP model is a characterization of the heterogeneity of parameters across individuals. Consider, for example, in the Bayesian framework of this β section, in the RP model, each vector i is a random vector with a distribution (defined hierarchically). In β the classical framework, each i represents a single draw from a parent population. 18   β To avoid a layer of complication, we will embed the time-invariant effect zi in xit . A full treatment in the same fashion as the latent class model would be substantially more complicated in this setting (although it is quite straightforward in the maximum simulated likelihood approach discussed in Section 17.5.1). Greene-50558 book June 21, 2007 15:50

622 PART IV ✦ Estimation Methodology

distribution. Discussion may be found in Zellner (1971, pp. 389–394).] This produces the joint posterior density, n T (β ,...,β , β,  | ) = ( | , β )φ β | β,  × (β, ). 1 n all data f yit xit i K[ i ] p i=1 t=1 (18-27) This gives the joint density of all the unknown parameters conditioned on the observed data. Our Bayesian estimators of the parameters will be the posterior means for these (n + 1)K + K(K + 1)/2 parameters. In principle, this requires integration of (18-27) with respect to the components. As one might guess at this point, that integration is hopelessly complex and not remotely feasible. However, the techniques of Markov Chain Monte Carlo (MCMC) simulation esti- mation (the Gibbs sampler) and the Metropolis–Hastings algorithm enable us to sample (β ,...,β , β,  | from the (hopelessly complex) joint density 1 n all data) in a remark- ably simple fashion. Train (2001 and 2002, Chapter 12) describe how to use these results for this random parameters model.19 The usefulness of this result for our current prob- lem is that it is, indeed, possible to partition the joint distribution, and we can easily sample from the conditional distributions. We begin by partitioning the parameters γ = (β, ) δ = (β ,...,β ) into and 1 n . Train proposes the following strategy: To obtain a draw from γ | δ, we will use the Gibbs sampler to obtain a draw from the distribution of (β | , δ) then one from the distribution of ( | β, δ). We will lay out this first, then turn to sampling from δ | β, . Conditioned on δ and , β has a K-variate normal distribution with mean β = ( / )| n β ( / ) 1 n = i and covariance matrix 1 n . To sample from this distribution we will i 1  first obtain the Cholesky factorization of  = LL where L is a lower triangular matrix. [See Section A.6.11.] Let v be a vector of K draws from the standard normal distribution.  Then, β + Lv has mean vector β + L × 0 = β and covariance matrix LIL = , which is exactly what we need. So, this shows how to sample a draw from the conditional distribution β. To obtain a random draw from the distribution of  | β, δ, we will require a random draw from the inverted Wishart distribution. The marginal posterior distribution of  | β, δ is inverted Wishart with parameters scalar K + n and matrix W = (KI + nV), = ( / ) n (β − β)(β − β) where V 1 n i=1 i i . Train (2001) suggests the following strategy for sampling a matrix from this distribution: Let M be the lower triangular Cholesky −1  = −1 + = factor of W ,soMM W . Obtain K n draws of vk K standard normal variates. = K+n    j = −1 Then, obtain S M k=1 vkvk M . Then, S is a draw from the inverted Wishart distribution. [This is fairly straightforward, as it involves only random sampling from the standard normal distribution. For a diagonal  matrix, that is, uncorrelated parameters β in i , it simplifies a bit further. A draw for the nonzero kth diagonal element can be ( + )/ K+n v2 obtained using 1 nVkk r=1 rk.]

19Train describes use of this method for “mixed (random parameters) multinomial logit” models. By writing the densities in generic form, we have extended his result to any general setting that involves a parameter vector in the fashion described above. The classical version of this appears in Section 17.5.1 for the binomial probit model and in Section 23.11.6 for the mixed logit model. Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 623

β The difficult step is sampling i . For this step, we use the Metropolis–Hastings (M-H) algorithm suggested by Chib and Greenberg (1995, 1996) and Gelman et al. (2004). The procedure involves the following steps: 1. Given β and  and “tuning constant” τ (to be described below), compute d = τLv where L is the Cholesky factorization of  and v is a vector of K independent standard normal draws. β = β + β 2. Create a trial value i1 i0 d where i0 is the previous value. β 3. The posterior distribution for i is the likelihood that appears in (18-26) times the φ β | β,  joint normal prior density, K[ i ]. Evaluate this posterior density at the trial β β value i1 and the previous value i0. Let ( | , β )φ (β | β, ) = f yi Xi i1 K i1 . R10 ( | , β )φ (β | β, ) f yi Xi i0 K i0 4. Draw one observation, u, from the standard uniform distribution, U[0, 1]. 5. If u < R10, then accept the trial (new) draw. Otherwise, reuse the old one. This M-H iteration converges to a sequence of draws from the desired density. Overall, then, the algorithm uses the Gibbs sampler and the Metropolis–Hastings algorithm to produce the sequence of draws for all the parameters in the model. The sequence is repeated a large number of times to produce each draw from the joint posterior distribution. The entire sequence must then be repeated N times to produce the sample of N draws, which can then be analyzed, for example, by computing the posterior mean. Some practical details remain. The tuning constant, τ is used to control the iteration. A smaller τ increases the acceptance rate. But at the same time, a smaller τ makes new draws look more like old draws so this slows down the process. Gelman et al. (2004) suggest τ = 0.4 for K = 1 and smaller values down to about 0.23 for higher dimensions, as will be typical. Each multivariate draw takes many runs of the MCMC sampler. The process must be started somewhere, though it does not matter much where. Nonetheless, a “burn-in” period is required to eliminate the influence of the starting value. Typical applications use several draws for this burn in period for each run of the sampler. How many sample observations are needed for accurate estimation is not certain, though several hundred would be a minimum. This means that there is a huge amount of com- putation done by this estimator. However, the computations are fairly simple. The only complicated step is computation of the acceptance criterion at step 3 of the M-H itera- tion. Depending on the model, this may, like the rest of the calculations, be quite simple.

18.9 SUMMARY AND CONCLUSIONS

This chapter has introduced the major elements of the Bayesian approach to estimation and inference. The contrast between Bayesian and classical, or frequentist, approaches to the analysis has been the subject of a decades-long dialogue among practitioners and philosophers. As the frequency of applications of Bayesian methods have grown dra- matically in the modern literature, however, the approach to the body of techniques has typically become more pragmatic. The Gibbs sampler and related techniques includ- ing the Metropolis–Hastings algorithm have enabled some remarkable simplifications of heretofore intractable problems. For example, recent developments in commercial Greene-50558 book June 21, 2007 15:50

624 PART IV ✦ Estimation Methodology

software have produced a wide choice of “mixed” estimators which are various im- plementations of the maximum likelihood procedures and hierarchical Bayes proce- dures (such as the Sawtooth and MLWin programs). Unless one is dealing with a small sample, the choice between these can be based on convenience. There is little method- ological difference. This returns us to the practical point noted earlier. The choice between the Bayesian approach and the sampling theory method in this application would not be based on a fundamental methodological criterion, but on purely practical considerations—the end result is the same. This chapter concludes our survey of estimation and inference methods in econo- metrics. We will now turn to two major areas of applications, time series and (broadly) macroeconometrics, and microeconometrics which is primarily oriented to cross section and panel data applications.

Key Terms and Concepts

• Bayes factor • Informative prior • Posterior mean • Bayes theorem • Inverted gamma distribution • Posterior odds • Bernstein–von Mises • Inverted Wishart • Precision matrix theorem • Joint posterior • Predictive density • Burn in • Likelihood • Prior beliefs • Central limit theorem • Loss function • Prior density • Conjugate prior • Markov–Chain Monte Carlo • Prior distribution • Data augmentation (MCMC) • Prior odds ratio • Gibbs sampler • Metropolis–Hastings • Prior probabilities • Hierarchical Bayes algorithm • Sampling theory • Hierarchical prior • Multivariate t distribution • Uniform prior • Highest posterior density • Noninformative prior • Uniform-inverse (HPD) • Normal-gamma prior gamma prior • Improper prior • Posterior density

Exercise

1. Suppose the distribution of yi | λ is Poisson, exp(−λ)λyi exp(−λ)λyi f (yi | λ) = = , yi = 0, 1,...,λ>0. yi ! (yi + 1)

We will obtain a sample of observations, yi ,...,yn. Suppose our prior for λ is the inverted gamma, which will imply (λ) ∝ 1 . p λ

a. Construct the likelihood function, p(y1,...,yn | λ). b. Construct the posterior density ( ,..., | λ) (λ) p y1 yn p p(λ | y1,...,yn) = ∞ . p(y1,...,yn | λ)p(λ)dλ 0

c. Prove that the Bayesian estimator of λ is the posterior mean, E[λ | y1,...,yn] = y. d. Prove that the posterior variance is Var[λ | yl ,...,yn] = y/n. Greene-50558 book June 21, 2007 15:50

CHAPTER 18 ✦ Bayesian Estimation and Inference 625

(Hint: You will make heavy use of gamma integrals in solving this problem. Also, you will find it convenient to use i yi = ny.)

Application

1. Consider a model for the mix of male and female children in families. Let Ki denote the family size (number of children), Ki = 1,.... Let Fi denote the number of female children, Fi = 0,...,Ki . Suppose the density for the number of female children in a family with K children is binomial with constant success probability θ: i

Ki Fi Ki −Fi p(Fi |Ki ,θ)= θ (1 − θ) . Fi We are interested in analyzing the “probability,” θ. Suppose the (conjugate) prior over θ is a beta distribution with parameters a and b: (a + b) p(θ) = θ a−1(1 − θ)b−1. (a)(b) Your sample of 25 observations is given here:

Ki 2115544512442432323532541

Fi 1 1 1 3 2 3 2 4 0 2 3 1 1 3 2 1 3 1 2 4 2 1 1 4 1

a. Compute the classical maximum likelihood estimate of θ. b. Form the posterior density for θ given (Ki , Fi ), i = 1,...,25 conditioned on a and b. c. Using your sample of data, compute the posterior mean assuming a = b = 1. d. Using your sample of data, compute the posterior mean assuming a = b = 2. e. Using your sample of data, compute the posterior mean assuming a = 1 and b = 2. Greene-50558 book June 23, 2007 11:49

19 SERIAL CORRELATIONQ

19.1 INTRODUCTION

Time-series data often display autocorrelation, or serial correlation of the disturbances across periods. Consider, for example, the plot of the least squares residuals in the following example.

Example 19.1 Money Demand Equation Appendix Table F5.1 contains quarterly data from 1950.1 to 2000.4 on the U.S. money stock (M1) and output (real GDP) and the price level (CPI U). Consider a simple (extremely) model of money demand,1

ln M1t = β1 + β2 ln GDPt + β3 ln CPIt + εt .

A plot of the least squares residuals is shown in Figure 19.1. The pattern in the residuals suggests that knowledge of the sign of a residual in one period is a good indicator of the sign of the residual in the next period. This knowledge suggests that the effect of a given disturbance is carried, at least in part, across periods. This sort of “memory” in the disturbances creates the long, slow swings from positive values to negative ones that is evident in Figure 19.1. One might argue that this pattern is the result of an obviously naive model, but that is one of the important points in this discussion. Patterns such as this usually do not arise spontaneously; to a large extent, they are, indeed, a result of an incomplete or flawed model specification.

One explanation for autocorrelation is that relevant factors omitted from the time- series regression, like those included, are correlated across periods. This fact may be due to serial correlation in factors that should be in the regression model. It is easy to see why this situation would arise. Example 19.2 shows an obvious case.

Example 19.2 Autocorrelation Induced by Misspecification of the Model In Examples 2.3 and 6.7, we examined yearly time-series data on the U.S. gasoline market from 1953 to 2004. The evidence in the examples was convincing that a regression model of variation in ln G/Pop should include, at a minimum, a constant, ln PG and ln income/Pop. Other price variables and a time trend also provide significant explanatory power, but these two are a bare minimum. Moreover, we also found on the basis of a Chow test of structural change that apparently this market changed structurally after 1974. Figure 19.2 displays plots of four sets of least squares residuals. Parts (a) through (c) show clearly that as the specification of the regression is expanded, the autocorrelation in the “residuals” diminishes. Part (c) shows the effect of forcing the coefficients in the equation to be the same both before and after the structural shift. In part (d), the residuals in the two subperiods 1953 to 1974 and 1975 to 2004 are produced by separate unrestricted regressions. This latter set of residuals is almost nonautocorrelated. (Note also that the range of variation of the residuals falls as

1Because this chapter deals exclusively with time-series data, we shall use the index t for observations and T for the sample size throughout. 626 Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 627

0.225

0.150

0.075

0.000 idual s Ϫ Re 0.075

Ϫ0.150

Ϫ0.225

Ϫ0.300 1950 1963 1976 1989 2002 Quarter FIGURE 19.1 Autocorrelated Least Squares Residuals.

the model is improved, i.e., as its fit improves.) The full equation is

Gt It ln = β1 + β2 ln PGt + β3 ln + β4 ln PNCt + β5 ln PUCt Popt Popt

+ β6 ln PPTt + β7 ln PNt + β8 ln PDt + β9 ln PSt + β10t + εt . Finally, we consider an example in which serial correlation is an anticipated part of the model.

Example 19.3 Negative Autocorrelation in the Phillips Curve The Phillips curve [Phillips (1957)] has been one of the most intensively studied relationships in the macroeconomics literature. As originally proposed, the model specifies a negative re- lationship between wage inflation and unemployment in the United Kingdom over a period of 100 years. Recent research has documented a similar relationship between unemployment and price inflation. It is difficult to justify the model when cast in simple levels; labor market theories of the relationship rely on an uncomfortable proposition that markets persistently fall victim to money illusion, even when the inflation can be anticipated. Current research [e.g., Staiger et al. (1996)] has reformulated a short run (disequilibrium) “expectations aug- mented Phillips curve” in terms of unexpected inflation and unemployment that deviates from a long run equilibrium or “natural rate.” The expectations-augmented Phillips curve can be written as ∗ pt − E[pt | t−1] = β[ut − u ] + εt

where pt is the rate of inflation in year t, E [pt | t−1] is the forecast of pt made in period ∗ t − 1 based on information available at time t − 1, t−1, ut is the unemployment rate and u is the natural, or equilibrium rate. (Whether u∗ can be treated as an unchanging parameter, ∗ as we are about to do, is controversial.) By construction, [ut − u ] is disequilibrium, or cycli- cal unemployment. In this formulation, εt would be the supply shock (i.e., the stimulus that produces the disequilibrium situation). To complete the model, we require a model for the expected inflation. We will revisit this in some detail in Example 20.2. For the present, we’ll Greene-50558 book June 23, 2007 11:49

628 PART V ✦ Time Series and Macroeconometrics

0.30 0.150 0.20 0.100 0.10 0.050 idual 0.00 idual 0.000 s s Re Ϫ0.10 Re Ϫ0.050 Ϫ0.20 Ϫ0.100 Ϫ0.30 Ϫ0.150 1952 1964 1976 1988 2000 1952 1964 1976 1988 2000 Year Year Panel (a) Panel (b)

0.075 0.030 0.050 0.020 0.025 0.010 idual idual 0.000 0.000 s s Re Re Ϫ0.025 Ϫ0.010 Ϫ0.050 Ϫ0.020 Ϫ0.075 Ϫ0.030 1952 1964 1976 1988 2000 1952 1964 1976 1988 2000 Year Year Panel (c) Panel (d) FIGURE 19.2 Unstandardized Residuals (Bars mark mean res. and +/ − 2s(e)).

assume that economic agents are rank empiricists. The forecast of next year’s inflation is simply this year’s value. This produces the estimating equation

pt − pt−1 = β1 + β2ut + εt

∗ where β2 = β and β1 =−βu . Note that there is an implied estimate of the natural rate of un- ∗ employment embedded in the equation. After estimation, u can be estimated by −b1/b2. The equation was estimated with the 1950.1–2000.4 data in Appendix Table F5.1 that were used in Example 19.1 (minus two quarters for the change in the rate of inflation). Least squares estimates (with standard errors in parentheses) are as follows:

pt − pt−1 = 0.49189 − 0.090136 ut + et (0.7405) (0.1257) R2 = 0.002561, T = 201. The implied estimate of the natural rate of unemployment is 5.46 percent, which is in line with other recent estimates. The estimated asymptotic covariance of b1 and b2 is −0.08973. Using the delta method, we obtain a standard error of 2.2062 for this estimate, so a confidence in- terval for the natural rate is 5.46 percent ±1.96 (2.21 percent) = (1.13 percent, 9.79 percent) (which seems fairly wide, but, again, whether it is reasonable to treat this as a parameter is at least questionable). The regression of the least squares residuals on their past values gives a slope of −0.4263 with a highly significant t ratio of −6.725. We thus conclude that the residuals (and, apparently, the disturbances) in this model are highly negatively autocorre- lated. This is consistent with the striking pattern in Figure 19.3. Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 629

Phillips Curve Deviations from Expected Inflation 10

5

0 idual s

Re Ϫ5

Ϫ10

Ϫ15 1950 1963 1976 1989 2002 Quarter FIGURE 19.3 Negatively Autocorrelated Residuals.

The problems for estimation and inference caused by autocorrelation are similar to (although, unfortunately, more involved than) those caused by heteroscedasticity. As before, least squares is inefficient, and inference based on the least squares estimates is adversely affected. Depending on the underlying process, however, GLS and FGLS estimators can be devised that circumvent these problems. There is one qualitative difference to be noted. In Chapter 8, we examined models in which the generalized regression model can be viewed as an extension of the regression model to the con- ditional second moment of the dependent variable. In the case of autocorrelation, the phenomenon arises in almost all cases from a misspecification of the model. Views differ on how one should react to this failure of the classical assumptions, from a pragmatic one that treats it as another “problem” in the data to an orthodox methodological view that it represents a major specification issue—see, for example, “A Simple Message to Autocorrelation Correctors: Don’t” [Mizon (1995).] We should emphasize that the models we shall examine here are quite far removed from the classical regression. The exact or small-sample properties of the estimators are rarely known, and only their asymptotic properties have been derived.

19.2 THE ANALYSIS OF TIME-SERIES DATA

The treatment in this chapter will be the first structured analysis of time-series data in the text. (We had a brief encounter in Section 4.9.6 where we established some conditions under which moments of time-series data would converge.) Time-series analysis requires some revision of the interpretation of both data generation and sampling that we have maintained thus far. Greene-50558 book June 23, 2007 11:49

630 PART V ✦ Time Series and Macroeconometrics

A time-series model will typically describe the path of a variable yt in terms of contemporaneous (and perhaps lagged) factors xt , disturbances (innovations), εt , and its own past, yt−1,....For example,

yt = β1 + β2xt + β3 yt−1 + εt . The time series is a single occurrence of a random event. For example, the quarterly series on real output in the United States from 1950 to 2000 that we examined in Ex- ample 19.1 is a single realization of a process, GDPt . The entire history over this period constitutes a realization of the process. At least in economics, the process could not be repeated. There is no counterpart to repeated sampling in a cross section or replication of an experiment involving a time-series process in physics or engineering. Nonetheless, were circumstances different at the end of World War II, the observed history could have been different. In principle, a completely different realization of the entire series might { }t=∞ have occurred. The sequence of observations, yt t=−∞ is a time-series process, which is characterized by its time ordering and its systematic correlation between observations in the sequence. The signature characteristic of a time-series process is that empirically, the data generating mechanism produces exactly one realization of the sequence. Sta- tistical results based on sampling characteristics concern not random sampling from a population, but from distributions of statistics constructed from sets of observations taken from this realization in a time window, t = 1,...,T. Asymptotic distribution theory in this context concerns behavior of statistics constructed from an increasingly long window in this sequence. The properties of yt as a random variable in a cross section are straightforward and are conveniently summarized in a statement about its mean and variance or the probability distribution generating yt . The statement is less obvious here. It is common to assume that innovations are generated independently from one period to the next, with the familiar assumptions

E [εt ] = 0, 2 Var[εt ] = σε , and

Cov[εt ,εs ] = 0 for t = s.

In the current context, this distribution of εt is said to be covariance stationary or weakly stationary. Thus, although the substantive notion of “random sampling” must be extended for the time series εt , the mathematical results based on that notion apply here. It can be said, for example, that εt is generated by a time-series process whose mean and variance are not changing over time. As such, by the method we will discuss in this chapter, we could, at least in principle, obtain sample information and use it to characterize the distribution of εt . Could the same be said of yt ? There is an obvious difference between the series εt and yt ; observations on yt at different points in time are necessarily correlated. Suppose that the yt series is weakly stationary and that, for the moment, β2 = 0. Then we could say that

E [yt ] = β1 + β3 E [yt−1] + E [εt ] = β1/(1 − β3) and = β2 + ε , Var[yt ] 3 Var[yt−1] Var[ t ] Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 631

or γ = β2γ + σ 2, 0 3 0 ε so that σ 2 γ = ε . 0 − β2 1 3

Thus, γ0, the variance of yt , is a fixed characteristic of the process generating yt . Note how the stationarity assumption, which apparently includes |β3| < 1, has been used. The 2 assumption that |β3| < 1 is needed to ensure a finite and positive variance. Finally, the same results can be obtained for nonzero β2 if it is further assumed that xt is a weakly stationary series.3 Alternatively, consider simply repeated substitution of lagged values into the expression for yt :

yt = β1 + β3(β1 + β3 yt−2 + εt−1) + εt (19-1)

and so on. We see that, in fact, the current yt is an accumulation of the entire history of the innovations, εt . So if we wish to characterize the distribution of yt , then we might do so in terms of sums of random variables. By continuing to substitute for yt−2, then yt−3,...in (19-1), we obtain an explicit representation of this idea, ∞ = βi (β + ε ). yt 3 1 t−i i=0 Do sums that reach back into infinite past make any sense? We might view the pro- cess as having begun generating data at some remote, effectively “infinite” past. As long as distant observations become progressively less important, the extension to an infinite past is merely a mathematical convenience. The diminishing importance of past observa- tions is implied by |β3| < 1. Notice that, not coincidentally, this requirement is the same as that needed to solve for γ0 in the preceding paragraphs. A second possibility is to as- sume that the observation of this time series begins at some time 0 [with (x0,ε0) called the initial conditions], by which time the underlying process has reached a state such that the mean and variance of yt are not (or are no longer) changing over time. The mathematics are slightly different, but we are led to the same characterization of the random process generating yt . In fact, the same weak stationarity assumption ensures both of them. Except in very special cases, we would expect all the elements in the T component random vector (y1,...,yT) to be correlated. In this instance, said correlation is called “autocorrelation.” As such, the results pertaining to estimation with independent or uncorrelated observations that we used in the previous chapters are no longer usable. In point of fact, we have a sample of but one observation on the multivariate ran- dom variable [yt , t = 1,...,T ]. There is a counterpart to the cross-sectional notion of parameter estimation, but only under assumptions (e.g., weak stationarity) that estab- lish that parameters in the familiar sense even exist. Even with stationarity,it will emerge that for estimation and inference, none of our earlier finite sample results are usable. Consistency and asymptotic normality of estimators are somewhat more difficult to

2The current literature in macroeconometrics and time series analysis is dominated by analysis of cases in which β3 = 1 (or counterparts in different models). We will return to this subject in Chapter 22. 3See Section 19.4.1 on the stationarity assumption. Greene-50558 book June 23, 2007 11:49

632 PART V ✦ Time Series and Macroeconometrics

establish in time-series settings because results that require independent observations, such as the central limit theorems, are no longer usable. Nonetheless, counterparts to our earlier results have been established for most of the estimation problems we consider here and in Chapters 20 and 21.

19.3 DISTURBANCE PROCESSES

The preceding section has introduced a bit of the vocabulary and aspects of time-series specification. To obtain the theoretical results, we need to draw some conclusions about autocorrelation and add some details to that discussion.

19.3.1 CHARACTERISTICS OF DISTURBANCE PROCESSES In the usual time-series setting, the disturbances are assumed to be homoscedastic but correlated across observations, so that E [εε | X] = σ 2, 2 2 where σ  is a full, positive definite matrix with a constant σ = Var[εt | X] on the diagonal. As will be clear in the following discussion, we shall also assume that ts is a function of |t − s|, but not of t or s alone, which is a stationarity assumption. (See the preceding section.) It implies that the covariance between observations t and s is a function only of |t − s|, the distance apart in time of the observations. Because σ 2 is not restricted, we normalize tt = 1. We define the autocovariances: 2 Cov[εt ,εt−s | X] = Cov[εt+s ,εt | X] = σ t,t−s = γs = γ−s . 2 Note that σ tt = γ0. The correlation between εt and εt−s is their autocorrelation, ε ,ε | γ ε ,ε | = Cov[ t t−s X] = s = ρ = ρ . Corr[ t t−s X] γ s −s Var[εt | X]Var[εt−s | X] 0 We can then write E [εε | X] =  = γ0R, where  is an autocovariance matrix and R is an autocorrelation matrix—the ts element is an autocorrelation coefficient γ|t−s| ρts = . γ0 2 (Note that the matrix  = γ0R is the same as σ . The name change conforms to - dard usage in the literature.) We will usually use the abbreviation ρs to denote the autocorrelation between observations s periods apart. Different types of processes imply different patterns in R. For example, the most frequently analyzed process is a first-order autoregression or AR(1) process,

εt = ρεt−1 + ut ,

where ut is a stationary, nonautocorrelated (white noise) process and ρ is a parameter. s We will verify later that for this process, ρs = ρ . Higher-order autoregressive processes of the form

εt = θ1εt−1 + θ2εt−2 +···+θpεt−p + ut Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 633

imply more involved patterns, including, for some values of the parameters, cyclical behavior of the autocorrelations.4 Stationary autoregressions are structured so that the influence of a given disturbance fades as it recedes into the more distant past but vanishes only asymptotically. For example, for the AR(1), Cov[εt ,εt−s ] is never zero, but it does become negligible if |ρ| is less than 1. Moving-average processes, conversely, have a short memory. For the MA(1) process,

εt = ut − λut−1, γ = σ 2( + λ2), γ =−λσ 2 γ = the memory in the process is only one period: 0 u 1 1 u , but s 0 if s > 1.

19.3.2 AR(1) DISTURBANCES Time-series processes such as the ones listed here can be characterized by their order, the values of their parameters, and the behavior of their autocorrelations.5 We shall consider various forms at different points. The received empirical literature is overwhelmingly dominated by the AR(1) model, which is partly a matter of convenience. Processes more involved than this model are usually extremely difficult to analyze. There is, however, a more practical reason. It is very optimistic to expect to know precisely the correct form of the appropriate model for the disturbance in any given situation. The first-order autoregression has withstood the test of time and experimentation as a reasonable model for underlying processes that probably,in truth, are impenetrably complex. AR(1) works as a first pass—higher order models are often constructed as a refinement—as in the following example. The first-order autoregressive disturbance, or AR(1) process, is represented in the autoregressive form as

εt = ρεt−1 + ut , (19-2) where

E [ut | X] = 0, 2 = σ 2, E ut X u and

Cov[ut , us | X] = 0ift = s.

Because ut is white noise, the conditional moments equal the unconditional moments. Thus E[εt | X] = E[εt ] and so on. By repeated substitution, we have

2 εt = ut + ρut−1 + ρ ut−2 +···. (19-3)

From the preceding moving-average form, it is evident that each disturbance εt embodies the entire past history of the u’s, with the most recent observations receiving greater weight than those in the distant past. Depending on the sign of ρ, the series will exhibit

4This model is considered in more detail in Chapter 21. 5See Box and Jenkins (1984) for an authoritative study. Greene-50558 book June 23, 2007 11:49

634 PART V ✦ Time Series and Macroeconometrics

clusters of positive and then negative observations or, if ρ is negative, regular oscillations of sign (as in Example 19.3). Because the successive values of ut are uncorrelated, the variance of εt is the vari- ance of the right-hand side of (19-3): ε = σ 2 + ρ2σ 2 + ρ4σ 2 +···. Var[ t ] u u u (19-4) To proceed, a restriction must be placed on ρ, |ρ| < 1, (19-5) because otherwise, the right-hand side of (19-4) will become infinite. This result is the s stationarity assumption discussed earlier. With (19-5), which implies that lims→∞ ρ = 0, E [εt ] = 0 and σ 2 u 2 Var[ε ] = = σε . (19-6) t 1 − ρ2 With the stationarity assumption, there is an easier way to obtain the variance ε = ρ2 ε + σ 2 Var[ t ] Var[ t−1] u

because Cov[ut ,εs ] = 0ift > s. With stationarity, Var[εt−1] = Var[εt ], which implies (19-6). Proceeding in the same fashion, ρσ2 u Cov[ε ,ε− ] = E [ε ε − ] = E [ε − (ρε − + u )] = ρ Var[ε − ] = . (19-7) t t 1 t t 1 t 1 t 1 t t 1 1 − ρ2 By repeated substitution in (19-2), we see that for any s, s−1 s i εt = ρ εt−s + ρ ut−i i=0 3 2 (e.g., εt = ρ εt−3 + ρ ut−2 + ρut−1 + ut ). Therefore, because εs is not correlated with any ut for which t > s (i.e., any subsequent ut ), it follows that ρs σ 2 u Cov[ε ,ε− ] = E [ε ε − ] = . (19-8) t t s t t s 1 − ρ2 γ = σ 2/( − ρ2) Dividing by 0 u 1 provides the autocorrelations: s Corr[εt ,εt−s ] = ρs = ρ . (19-9) With the stationarity assumption, the autocorrelations fade over time. Depending on the sign of ρ, they will either be declining in geometric progression or alternating in sign if ρ is negative. Collecting terms, we have ⎡ ⎤ 1 ρρ2 ρ3 ··· ρT−1 ⎢ − ⎥ ⎢ ρ 1 ρρ2 ··· ρT 2⎥ σ 2 ⎢ ⎥ 2 u ⎢ ρ2 ρ ρ ··· ρT−3⎥ σ  = ⎢ 1 ⎥ . (19-10) 1 − ρ2 ⎢ . . . . ⎥ ⎣ . . . . ··· ρ ⎦ ρT−1 ρT−2 ρT−3 ··· ρ 1 Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 635

19.4 SOME ASYMPTOTIC RESULTS FOR ANALYZING TIME-SERIES DATA

Because  is not equal to I, the now-familiar complications will arise in establishing the properties of estimators of β, in particular of the least squares estimator. The finite sam- ple properties of the OLS and GLS estimators remain intact. Least squares will continue to be unbiased; the earlier general proof allows for autocorrelated disturbances. The Aitken theorem (Theorem 8.4) and the distributional results for normally distributed disturbances can still be established conditionally on X. (However, even these will be complicated when X contains lagged values of the dependent variable.) But, finite sam- ple properties are of very limited usefulness in time-series contexts. Nearly all that can be said about estimators involving time-series data is based on their asymptotic properties. As we saw in our analysis of heteroscedasticity, whether least squares is consistent or not, depends on the matrices = ( / ) , QT 1 T X X and ∗ = ( / )  . QT 1 T X X

In our earlier analyses, we were able to argue for convergence of QT to a positive definite matrix of constants, Q, by invoking laws of large numbers. But, these theorems assume that the observations in the sums are independent, which as suggested in Section 19.2, is surely not the case here. Thus, we require a different tool for this result. We can expand ∗ the matrix QT as 1 T T Q∗ = ρ x x , (19-11) T T ts t s t=1 s=1 ρ ε ε where xt and xs are rows of X and ts is the autocorrelation between t and s . Sufficient conditions for this matrix to converge are that QT converge and that the correlations between disturbances die off reasonably rapidly as the observations become further apart in time. For example, if the disturbances follow the AR(1) process described ρ = ρ|t−s| ∗ earlier, then ts and if x t is sufficiently well behaved, QT will converge to a positive definite matrix Q∗ as T →∞. Asymptotic normality of the least squares and GLS estimators will depend on the behavior of sums such as    √ √ 1 T √ 1 T w = T x ε = T Xε . T T t t T t=1 Asymptotic normality of least squares is difficult to establish for this general model. The central limit theorems we have relied on thus far do not extend to sums of dependent observations. The results of Amemiya (1985), Mann and Wald (1943), and Anderson (1971) do carry over to most of the familiar types of autocorrelated disturbances, in- cluding those that interest us here, so we shall ultimately conclude that ordinary least squares, GLS, and instrumental variables continue to be consistent and asymptotically normally distributed, and, in the case of OLS, inefficient. This section will provide a brief introduction to some of the underlying principles that are used to reach these conclusions. Greene-50558 book June 23, 2007 11:49

636 PART V ✦ Time Series and Macroeconometrics

19.4.1 CONVERGENCE OF MOMENTS—THE ERGODIC THEOREM The discussion thus far has suggested (appropriately) that stationarity (or its absence) is an important characteristic of a process. The points at which we have encountered this notion concerned requirements that certain sums converge to finite values. In particular, for the AR(1) model, εt = ρεt−1 + ut , for the variance of the process to be finite, we require |ρ| < 1, which is a sufficient condition. However, this result is only a byproduct. Stationarity (at least, the weak stationarity we have examined) is only a characteristic of the sequence of moments of a distribution.

DEFINITION 19.1 Strong Stationarity { }t=∞ A time-series process, zt t=−∞ is strongly stationary, or “stationary,” if the joint probability distribution of any set of k observations in the sequence [zt , zt+1,..., zt+k] is the same regardless of the origin, t, in the time scale.

∼ ,σ2 {ε }t=∞ For example, in (19-2), if we add ut N[0 u ], then the resulting process t t=−∞ can easily be shown to be strongly stationary.

DEFINITION 19.2 Weak Stationarity { }t=∞ A time-series process, zt t=−∞ is weakly stationary (or covariance stationary) if E [zt ] is finite and is the same for all t and if the covariances between any two observations (labeled their autocovariance), Cov[zt , zt−k], is a finite function only of model parameters and their distance apart in time, k, but not of the absolute location of either observation on the time scale.

Weak stationary is obviously implied by strong stationary, although it requires less because the distribution can, at least in principle, be changing on the time axis. The distinction is rarely necessary in applied work. In general, save for narrow theoretical examples, it will be difficult to come up with a process that is weakly but not strongly stationary. The reason for the distinction is that in much of our work, only weak sta- tionary is required, and, as always, when possible, econometricians will dispense with unnecessary assumptions. As we will discover shortly, stationarity is a crucial characteristic at this point in the analysis. If we are going to proceed to parameter estimation in this context, we will also require another characteristic of a time series, ergodicity. There are various ways to delineate this characteristic, none of them particularly intuitive. We borrow one definition from Davidson and MacKinnon (1993, p. 132) which comes close:

DEFINITION 19.3 Ergodicity { }t=∞ , A strongly stationary time-series process, zt t=−∞ is ergodic if for any two bounded functions that map vectors in the a and b dimensional real vector spaces to real scalars, f : Ra → R1 and g : Rb → R1,

lim |E [ f (zt , zt+1,...,zt+a)g(zt+k, zt+k+1,...,zt+k+b)]| k→∞

=|E [ f (zt , zt+1,...,zt+a)]||E [g(zt+k, zt+k+1,...,zt+k+b)]| . Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 637

The definition states essentially that if events are separated far enough in time, then they are “asymptotically independent.” An implication is that in a time series, every obser- vation will contain at least some unique information. Ergodicity is a crucial element of our theory of estimation. When a time series has this property (with stationarity), then we can consider estimation of parameters in a meaningful sense.6 The analysis relies heavily on the following theorem:

THEOREM 19.1 The Ergodic Theorem { }t=∞ If zt t=−∞ is a time-series process that is strongly stationary and ergodic and | | = ( / ) T −→a.s. μ E [ zt ] is a finite constant, and if zT 1 T t=1 zt , then zT , where μ = E [zt ]. Note that the convergence is almost surely not in probability (which is implied) or in mean square (which is also implied). [See White (2001, p. 44) and Davidson and MacKinnon (1993, p. 133).]

What we have in the Ergodic theorem is, for sums of dependent observations, a coun- terpart to the laws of large numbers that we have used at many points in the preceding chapters. Note, once again, the need for this extension is that to this point, our laws of large numbers have required sums of independent observations. But, in this context, by design, observations are distinctly not independent. For this result to be useful, we will require an extension.

THEOREM 19.2 Ergodicity of Functions { }t=∞ If zt t=−∞ is a time-series process that is strongly stationary and ergodic and if yt = f {zt } is a measurable function in the probability space that defines zt , then yt { }t=∞ × is also stationary and ergodic. Let zt t=−∞ define a K 1 vector valued stochastic process—each element of the vector is an ergodic and stationary series, and the characteristics of ergodicity and stationarity apply to the joint distribution of the { }t=∞ { }t=∞ elements of zt t=−∞. Then, the Ergodic theorem applies to functions of zt t=−∞. [See White (2001, pp. 44–45) for discussion.]

Theorem 19.2 produces the results we need to characterize the least squares (and other) estimators. In particular, our minimal assumptions about the data are

= β + ε ASSUMPTION 19.1 Ergodic Data Series: In the regression model, yt xt t , ,ε t=∞ [xt t ]t=−∞ is a jointly stationary and ergodic process.

6Much of the analysis in later chapters will encounter nonstationary series, which are the focus of most of the current literature—tests for nonstationarity largely dominate the recent study in time-series analysis. Ergodicity is a much more subtle and difficult concept. For any process that we will consider, ergodicity will have to be a given, at least at this level. A classic reference on the subject is Doob (1953). Another authoritative treatise is Billingsley (1995). White (2001) provides a concise analysis of many of these concepts as used in econometrics, and some useful commentary. Greene-50558 book June 23, 2007 11:49

638 PART V ✦ Time Series and Macroeconometrics

By analyzing terms element by element we can use these results directly to assert that = ε , = ∗ = ε2 averages of wt xt t Qtt xt xt , and Qtt t xt xt will converge to their population coun- terparts, 0, Q and Q∗.

19.4.2 CONVERGENCE TO NORMALITY—A CENTRAL LIMIT THEOREM To form a distribution theory for least squares, GLS, ML, and GMM, we will need a counterpart to the central limit theorem. In particular, we need to establish a large sample distribution theory for quantities of the form  √ 1 T √ T x ε = T w. T t t t=1 As noted earlier, we cannot invoke the familiar central limit theorems (Lindeberg–Levy, Lindeberg–Feller, Liapounov) because the observations in the sum are not independent. But, with the assumptions already made, we do have an alternative result. Some needed preliminaries are as follows:

DEFINITION 19.4 Martingale Sequence A vector sequence zt is a martingale sequence if E [zt | zt−1, zt−2,...] = zt−1.

An important example of a martingale sequence is the random walk,

zt = zt−1 + ut ,

where Cov[ut , us ] = 0 for all t = s. Then

E [zt | zt−1, zt−2,...] = E [zt−1 | zt−1, zt−2,...] + E [ut | zt−1, zt−2,...] = zt−1 + 0 = zt−1.

DEFINITION 19.5 Martingale Difference Sequence A vector sequence zt is a martingale difference sequence if E [zt | zt−1, zt−2,...] = 0.

With Definition 19.5, we have the following broadly encompassing result:

THEOREM 19.3 Martingale Difference Central Limit Theorem If zt is a vector valued stationary and ergodic martingale difference sequence, with = ,  = ( / ) T , E [zt z√t ] where is a finite positive definite matrix, and if zT 1 T t=1 zt d then T zT −→ N[0,]. [For discussion, see Davidson and MacKinnon (1993, Sections. 4.7 and 4.8).]7

7For convenience, we are bypassing a step in this discussion—establishing multivariate normality requires that the result first be established for the marginal normal distribution of each component, then that every linear combination of the variables also be normally distributed (See Theorems D.17 and D.18A.). Our interest at this point is merely to collect the useful end results. Interested users may find the detailed discussions of the many subtleties and narrower points in White (2001) and Davidson and MacKinnon (1993, Chapter 4). Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 639

Theorem 19.3 is a generalization of the Lindeberg–Levy central limit theorem. It is not yet broad enough to cover cases of autocorrelation, but it does go beyond Lindeberg– Levy, for example, in extending to the GARCH model of Section 19.13.3. [Forms of the theorem that surpass Lindeberg–Feller (D.19) and Liapounov (Theorem D.20) by allowing for different variances at each time, t, appear in Ruud (2000, p. 479) and White (2001, p. 133). These variants extend beyond our requirements in this treatment.] But, looking ahead, this result encompasses what will be a very important application. { }t=∞ Suppose in the classical linear regression model, xt t=−∞ is a stationary and ergodic {ε }t=∞ multivariate stochastic process and t t=−∞ is an i.i.d. process—that is, not autocorre- lated and not heteroscedastic. Then, this is the most general case of the classical model that still maintains the assumptions about εt that we made in Chapter 2. In this case, the { }t=∞ ={ ε }t=∞ process wt t=−∞ xt t t=−∞ is a martingale difference sequence, so that with suffi- cient assumptions on the moments of xt we could use this result to establish consistency and asymptotic normality of the least squares estimator. [See, e.g., Hamilton (1994, pp. 208–212).] We now consider a central limit theorem that is broad enough to include the case that interested us at the outset, stochastically dependent observations on xt and ε 8 { }t=∞ autocorrelation in t . Suppose√ as before that zt t=−∞ is a stationary and ergodic 9 stochastic process. We consider T zT. The following conditions are assumed:

1. Asymptotic uncorrelatedness: E [zt | zt−k, zt−k−1,...] converges in mean square to zero as k→∞. Note that is similar to the condition for ergodicity. White (2001) demonstrates that a (nonobvious) implication of this assumption is E [zt ] = 0. 2. Summability of autocovariances: With dependent observations, √ ∞ ∞ ∞ = , =  = ∗. lim Var[ T zT] Cov[zt zs ] k T→∞ t=0 s=0 k=−∞ To begin, we will need to assume that this matrix is finite, a condition called ∗ summability. Note this is the condition needed for convergence of QT in (19-11). If the sum is to be finite, then the k = 0 term must be finite, which gives us a necessary condition =  , . E [zt zt ] 0 a finite matrix 3. Asymptotic negligibility of innovations: Let

rtk = E [zt | zt−k, zt−k−1,...] − E [zt | zt−k−1, zt−k−2,...].

An observation zt may be viewed as the accumulated information that has entered the process since it began up to time t. Thus, it can be shown that ∞ zt = rts. s=0 The vector rtk can be viewed as the information in this accumulated sum that entered − ∞ the process at time t k. The condition imposed on the process is that s=0 E [rtsrts] be finite. In words, condition (3) states that information eventually becomes negligible as it fades far back in time from the current observation. The AR(1) model (as usual)

8Detailed analysis of this case is quite intricate and well beyond the scope of this book. Some fairly terse analysis may be found in White (2001, pp. 122–133) and Hayashi (2000). 9See Hayashi (2000, p. 405) who attributes the results to Gordin (1969). Greene-50558 book June 23, 2007 11:49

640 PART V ✦ Time Series and Macroeconometrics

helps to illustrate this point. If zt = ρzt−1 + ut , then

rt0 = E [zt | zt , zt−1,...] − E [zt | zt−1, zt−2,...] = zt − ρzt−1 = ut

rt1 = E [zt | zt−1, zt−2 ...] − E [zt | zt−2, zt−3 ...]

= E [ρzt−1 + ut | zt−1, zt−2 ...] − E [ρ(ρzt−2 + ut−1) + ut | zt−2, zt−3,...]

= ρ(zt−1 − ρzt−2)

= ρut−1.  = ρk = ∞ ρs By a similar construction, rtk ut−k from which it follows that zt s=0 ut−s , which we saw earlier in (19-3). You can verify that if |ρ| < 1, the negligibility condition will be met. With all this machinery in place, we now have the theorem we will need:

THEOREM 19.4 Gordin’s Central Limit Theorem √If zt is strongly stationary and ergodic and if conditions (1)–(3) are met, then d ∗ T zT −→ N[0,  ].

We will be able to employ these tools when we consider the least squares, IV, and GLS estimators in the discussion to follow.

19.5 LEAST SQUARES ESTIMATION

The least squares estimator is  −   XX 1 Xε b = (XX)−1Xy = β + . T T Unbiasedness follows from the results in Chapter 4—no modification is needed. We know from Chapter 8 that the Gauss–Markov theorem has been lost—assuming it ex- ists (that remains to be established), the GLS estimator is efficient and OLS is not. How much information is lost by using least squares instead of GLS depends on the data. Broadly, least squares fares better in data that have long periods and little cyclical variation, such as aggregate output series. As might be expected, the greater is the auto- correlation in ε, the greater will be the benefit to using generalized least squares (when this is possible). Even if the disturbances are normally distributed, the usual F and t statistics do not have those distributions. So, not much remains of the finite sample prop- erties we obtained in Chapter 4. The asymptotic properties remain to be established.

19.5.1 ASYMPTOTIC PROPERTIES OF LEAST SQUARES The asymptotic properties of b are straightforward to establish given our earlier results. If we assume that the process generating xt is stationary and ergodic, then by Theo- rems 19.1 and 19.2, (1/T)(XX) converges to Q and we can apply the Slutsky theorem to the inverse. If εt is not serially correlated, then wt = xt εt is a martingale difference Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 641

sequence, so (1/T)(Xε) converges to zero. This establishes consistency for the simple case. On the other hand, if [xt ,εt ] are jointly stationary and ergodic, then we can invoke the Ergodic theorems 19.1 and 19.2 for both moment matrices and establish consistency. Asymptotic normality is a bit more√ subtle. For the case without serial correlation in εt , we can employ Theorem 19.3 for T w. The involved case is the one that interested us at the outset of this discussion, that is, where there is autocorrelation in εt and dependence in xt . Theorem 19.4 is in place for this case. Once again, the conditions described in the preceding section must apply and, moreover, the assumptions needed will have to be established both for xt and εt . Commentary on these cases may be found in Davidson and MacKinnon (1993), Hamilton (1994), White (2001), and Hayashi (2000). Formal presentation extends beyond the scope of this text, so at this point, we will proceed, and assume that the conditions underlying Theorem 19.4 are met. The results suggested here are quite general, albeit only sketched for the general case. For the remainder of our examination, at least in this chapter, we will confine attention to fairly simple processes in which the necessary conditions for the asymptotic distribution theory will be fairly evident. There is an important exception to the results in the preceding paragraph. If the regression contains any lagged values of the dependent variable, then least squares will no longer be unbiased or consistent. To take the simplest case, suppose that

yt = βyt−1 + εt , (19-12) εt = ρεt−1 + ut , and assume |β| < 1, |ρ| < 1. In this model, the regressor and the disturbance are corre- lated. There are various ways to approach the analysis. One useful way is to rearrange (19-12) by subtracting ρyt−1 from yt . Then,

yt = (β + ρ)yt−1 − βρyt−2 + ut , (19-13)

which is a classical regression with stochastic regressors. Because ut is an innovation in period t, it is uncorrelated with both regressors, and least squares regression of yt on (yt−1, yt−2) estimates ρ1 = (β + ρ)and ρ2 =−βρ. What is estimated by regression of yt on yt−1 alone? Let γk = Cov[yt , yt−k] = Cov[yt , yt+k]. By stationarity, Var[yt ] = Var[yt−1], and Cov[yt , yt−1] = Cov[yt−1, yt−2], and so on. These and (19-13) imply the following relationships: γ = ρ γ + ρ γ + σ 2, 0 1 1 2 2 u

γ1 = ρ1γ0 + ρ2γ1, (19-14)

γ2 = ρ1γ1 + ρ2γ0. (These are the Yule–Walker equations for this model. See Section 21.2.3.) The slope in the simple regression estimates γ1/γ0, which can be found in the solutions to these three equations. (An alternative approach is to use the left-out variable formula, which is a useful way to interpret this estimator.) In this case, we see that the slope in the short regression is an estimator of (β + ρ)− βρ(γ1/γ0). In either case, solving the three γ ,γ γ ρ ,ρ σ 2 equations in (19-14) for 0 1, and 2 in terms of 1 2, and u produces β + ρ plim b = . (19-15) 1 + βρ Greene-50558 book June 23, 2007 11:49

642 PART V ✦ Time Series and Macroeconometrics

This result is between β (when ρ = 0) and 1 (when both β and ρ = 1). Therefore, least squares is inconsistent unless ρ equals zero. The more general case that includes regres- sors, xt , involves more complicated algebra, but gives essentially the same result. This is a general result; when the equation contains a lagged dependent variable in the pres- ence of autocorrelation, OLS and GLS are inconsistent. The problem can be viewed as one of an omitted variable.

19.5.2 ESTIMATING THE VARIANCE OF THE LEAST SQUARES ESTIMATOR As usual, s2(XX)−1 is an inappropriate estimator of σ 2(XX)−1(XX)(XX)−1, both because s2 is a biased estimator of σ 2 and because the matrix is incorrect. Generalities are scarce, but in general, for economic time series that are positively related to their past values, the standard errors conventionally estimated by least squares are likely to be too small. For slowly changing, trending aggregates such as output and consumption, this is probably the norm. For highly variable data such as inflation, exchange rates, and market returns, the situation is less clear. Nonetheless, as a general proposition, one would normally not want to rely on s2(XX)−1 as an estimator of the asymptotic covariance matrix of the least squares estimator. In view of this situation, if one is going to use least squares, then it is desirable to have an appropriate estimator of the covariance matrix of the least squares estimator. There are two approaches. If the form of the autocorrelation is known, then one can estimate the parameters of  directly and compute a consistent estimator. Of course, if so, then it would be more sensible to use feasible generalized least squares instead and not waste the sample information on an inefficient estimator. The second approach parallels the use of the White estimator for heteroscedasticity. The extension of White’s result to the more general case of autocorrelation is much more difficult than in the heteroscedasticity case. The natural counterpart for estimating n n 1 Q∗ = σ x x n ij i j i=1 j=1 (19-16) in Section 8.2.3 would be T T 1 Qˆ ∗ = e e x x . T t s t s t=1 s=1 But there are two problems with this estimator, one theoretical, which applies to Q∗ as well, and one practical, which is specific to the latter. Unlike the heteroscedasticity case, the matrix in (19-16) is 1/T times a sum of T 2 terms, so it is difficult to conclude yet that it will converge to anything at all. This application is most likely to arise in a time-series setting. To obtain convergence, it is necessary to assume that the terms involving unequal subscripts in (19-16) diminish in importance as T grows. A sufficient condition is that terms with subscript pairs |t − s| grow smaller as the distance between them grows larger. In practical terms, observation pairs are progressively less correlated as their separation in time grows. Intuitively, if one can think of weights with the diagonal elements getting a weight of 1.0, then in the sum, the weights in the sum grow smaller as we move away from the diagonal. If we think of the sum of the weights rather than just the number of terms, then this sum falls off sufficiently rapidly that as n grows large, the sum is of order T rather than T 2. Thus, Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 643

TABLE 19.1 Robust Covariance Estimation Variable OLS Estimate OLS SE Corrected SE Constant −1.6331 0.2286 0.3335 ln Output 0.2871 0.04738 0.07806 ln CPI 0.9718 0.03377 0.06585 R2 = 0.98952, d = 0.02477, r = 0.98762.

we achieve convergence of Q∗ by assuming that the rows of X are well behaved and that the correlations diminish with increasing separation in time. (See Sections 4.9.6 and 21.2.5 for a more formal statement of this condition.) The practical problem is that Qˆ ∗ need not be positive definite. Newey and West (1987a) have devised an estimator that overcomes this difficulty: L T 1 Qˆ ∗ = S + w e e − (x x − + x − x ), 0 T l t t l t t l t l t l=1 t=l+1 (19-17) l w = 1 − . l (L + 1) [See (8-26).] The Newey–West autocorrelation consistent covariance estimator is sur- prisingly simple and relatively easy to implement.10 There is a final problem to be solved. It must be determined in advance how large L is to be. In general, there is little the- oretical guidance. Current practice specifies L ≈ T1/4. Unfortunately, the result is not quite as crisp as that for the heteroscedasticity consistent estimator. We have the result that b and bIV are asymptotically normally distributed, and we have an appropriate estimator for the asymptotic covariance matrix. We have not specified the distribution of the disturbances, however. Thus, for inference purposes, the F statistic is approximate at best. Moreover, for more involved hypotheses, the likelihood ratio and Lagrange multiplier tests are unavailable. That leaves the Wald statistic, including asymptotic “t ratios,” as the main tool for statistical inference. We will examine a number of applications in the chapters to follow. The White and Newey–West estimators are standard in the econometrics literature. We will encounter them at many points in the discussion to follow.

Example 19.4 Autocorrelation Consistent Covariance Estimation For the model shown in Example 19.1, the regression results with the uncorrected standard errors and the Newey–West autocorrelation robust covariance matrix for lags of five quarters are shown in Table 19.1. The effect of the very high degree of autocorrelation is evident.

19.6 GMM ESTIMATION

The GMM estimator in the regression model with autocorrelated disturbances is pro- duced by the empirical moment equations 1 T   1     x y − x βˆ = Xεˆ βˆ = m βˆ = 0. (19-18) T t t t GMM T GMM GMM t=1

10Both estimators are now standard features in modern econometrics computer programs. Further results on different weighting schemes may be found in Hayashi (2000, pp. 406–410). Greene-50558 book June 25, 2007 18:18

644 PART V ✦ Time Series and Macroeconometrics

The estimator is obtained by minimizing =  βˆ βˆ q m GMM W m GMM where W is a positive definite weighting matrix. The optimal weighting matrix would be √ − W = Asy. Var[ T m(β)] 1,

which is the inverse of √ n T T . (β) = . √1 ε = 1 σ 2ρ  = σ 2 ∗. Asy Var[ T m ] Asy Var xi i plim tsxt xs Q n→∞ T T i=1 t=1 s=1

The optimal weighting matrix would be [σ 2Q∗]−1. As in the heteroscedasticity case, this minimization problem is an exactly identified case, so, the weighting matrix is actually irrelevant to the solution. The GMM estimator for the regression model with autocor- related disturbances is ordinary least squares. We can use the results in Section 19.5.2 to construct the asymptotic covariance matrix. We will require the assumptions in Sec- tion 19.4 to obtain convergence of the moments and asymptotic normality. We will wish to extend this simple result in one instance. In the common case in which xt contains lagged values of yt , we will want to use an instrumental variable estimator. We will return to that estimation problem in Section 19.9.3.

19.7 TESTING FOR AUTOCORRELATION

The available tests for autocorrelation are based on the principle that if the true distur- bances are autocorrelated, then this fact can be detected through the autocorrelations of the least squares residuals. The simplest indicator is the slope in the artificial regression

et = ret−1 + vt , = −  , et yt xt b (19-19) T T−1 = 2 . r et et−1 et t=2 t=1 If there is autocorrelation, then the slope in this regression will be an estimator of ρ = Corr[εt ,εt−1]. The complication in the analysis lies in determining a formal means of evaluating when the estimator is “large,” that is, on what statistical basis to reject the null hypothesis that ρ equals zero. As a first approximation, treating (19-19) as a classical linear model and using a t or F (squared t) test to test the hypothesis is a valid way to proceed based on the Lagrange multiplier principle. We used this device in Example 19.3. The tests we consider here are refinements of this approach.

19.7.1 LAGRANGE MULTIPLIER TEST

The Breusch (1978)–Godfrey (1978) test is a Lagrange multiplier test of H0: no auto- correlation versus H1: εt = AR(P) or εt = MA(P). The same test is used for either Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 645

structure. The test statistic is   eX (X X )−1X e LM = T 0 0 0 0 = TR2, (19-20) ee 0

where X0 is the original X matrix augmented by P additional columns containing the lagged OLS residuals, et−1,...,et−P. The test can be carried out simply by regressing the ordinary least squares residuals et on xt0 (filling in missing values for lagged residuals 2 with zeros) and referring TR0 to the tabled critical value for the chi-squared distribution 11 with P degrees of freedom. Because X e = 0, the test is equivalent to regressing et on the part of the lagged residuals that is unexplained by X. There is therefore a compelling logic to it; if any fit is found, then it is due to correlation between the current and lagged residuals. The test is a joint test of the first P autocorrelations of εt , not just the first.

19.7.2 BOX AND PIERCE’S TEST AND LJUNG’S REFINEMENT An alternative test that is asymptotically equivalent to the LM test when the null hy- pothesis, ρ = 0, is true and when X does not contain lagged values of y is due to Box and Pierce (1970). The Q test is carried out by referring P = 2, Q T r j (19-21) j=1   = ( T )/( T 2) where r j t= j+1 et et− j t=1 et , to the critical values of the chi-squared table with P degrees of freedom. A refinement suggested by Ljung and Box (1979) is P r 2 Q = T(T + 2) j . (19-22) T − j j=1 The essential difference between the Godfrey–Breusch and the Box–Pierce tests is the use of partial correlations (controlling for X and the other variables) in the former and simple correlations in the latter. Under the null hypothesis, there is no autocorrelation in εt , and no correlation between xt and εs in any event, so the two tests are asymptotically equivalent. On the other hand, because it does not condition on xt , the Box–Pierce test is less powerful than the LM test when the null hypothesis is false, as intuition might suggest.

19.7.3 THE DURBIN–WATSON TEST The Durbin–Watson statistic12 was the first formal procedure developed for testing for autocorrelation using the least squares residuals. The test statistic is  T 2 (e − e − )2 e + e 2 d = t=2 t t 1 = 2(1 − r) − 1 T , (19-23) T 2 T 2 t=1 et t=1 et where r is the same first-order autocorrelation that underlies the preceding two statis- tics. If the sample is reasonably large, then the last term will be negligible, leaving

11A warning to practitioners: Current software varies on whether the lagged residuals are filled with zeros or the first P observations are simply dropped when computing this statistic. In the interest of replicability, users should determine which is the case before reporting results. 12Durbin and Watson (1950, 1951, 1971). Greene-50558 book June 23, 2007 11:49

646 PART V ✦ Time Series and Macroeconometrics

d ≈ 2(1 − r). The statistic takes this form because the authors were able to determine the exact distribution of this transformation of the autocorrelation and could provide tables of critical values. Usable critical values that depend only on T and K are presented in tables such as those at the end of this book. The one-sided test for H0: ρ = 0 against H1: ρ>0 is carried out by comparing d to values dL(T, K) and dU(T, K).Ifd < dL, the null hypothesis is rejected; if d > dU, the hypothesis is not rejected. If d lies between dL and dU, then no conclusion is drawn.

19.7.4 TESTING IN THE PRESENCE OF A LAGGED DEPENDENT VARIABLE The Durbin–Watson test is not likely to be valid when there is a lagged dependent variable in the equation.13 The statistic will usually be biased toward a finding of no autocorrelation. Three alternatives have been devised. The LM and Q tests can be used whether or not the regression contains a lagged dependent variable. (In the absence of a lagged dependent variable, they are asymptotically equivalent.) As an alternative to the standard test, Durbin (1970) derived a Lagrange multiplier test that is appro- priate in the presence of a lagged dependent variable. The test may be carried out by referring    = − 2 , h r T 1 Tsc (19-24)

2 where sc is the estimated variance of the least squares regression coefficient on yt−1, to the standard normal tables. Large values of h lead to rejection of H0. The test has the virtues that it can be used even if the regression contains additional lags of yt , and it can be computed using the standard results from the initial regression without any 2 > / further regressions. If sc 1 T, however, then it cannot be computed. An alternative is to regress et on xt , yt−1,...,et−1, and any additional lags that are appropriate for et and then to test the joint significance of the coefficient(s) on the lagged residual(s) with the standard F test. This method is a minor modification of the Breusch–Godfrey test. Under H0, the coefficients on the remaining variables will be zero, so the tests are the same asymptotically.

19.7.5 SUMMARY OF TESTING PROCEDURES The preceding has examined several testing procedures for locating autocorrelation in the disturbances. In all cases, the procedure examines the least squares residuals. We can summarize the procedures as follows: 2 LM test. LM = TR in a regression of the least squares residuals on [xt , et−1,... 2 et−P]. Reject H0 if LM >χ∗ [P]. This test examines the covariance of the residuals with lagged values, controlling for the intervening effect of the independent variables.  = ( − ) P 2/( − ) >χ2 Q test. Q T T 2 j=1 r j T j . Reject H0 if Q ∗ [P]. This test examines the raw correlations between the residuals and P lagged values of the residuals. = ( − ) ρ = < ∗ Durbin–Watson test. d 2 1 r . Reject H0: 0ifd dL. This test looks di- rectly at the first-order autocorrelation of the residuals.

13This issue has been studied by Nerlove and Wallis (1966), Durbin (1970), and Dezhbaksh (1990). Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 647

Durbin’s test. FD = the F statistic for the joint significance of P lags of the residuals in the regression of the least squares residuals on [xt , yt−1,...yt−R, et−1,...et−P]. Reject H0 if FD > F∗[P, T − K − P]. This test examines the partial correlations be- tween the residuals and the lagged residuals, controlling for the intervening effect of the independent variables and the lagged dependent variable. The Durbin–Watson test has some major shortcomings. The inconclusive region is large if T is small or moderate. The bounding distributions, while free of the parameters β and σ, do depend on the data (and assume that X is nonstochastic). An exact version based on an algorithm developed by Imhof (1980) avoids the inconclusive region, but is rarely used. The LM and Box–Pierce statistics do not share these shortcomings—their limiting distributions are chi-squared independently of the data and the parameters. For this reason, the LM test has become the standard method in applied research.

19.8 EFFICIENT ESTIMATION WHEN  IS KNOWN

As a prelude to deriving feasible estimators for β in this model, we consider full gen- eralized least squares estimation assuming that  is known. In the next section, we will turn to the more realistic case in which  must be estimated as well. If the parameters of  are known, then the GLS estimator, βˆ = (X−1X)−1(X−1y), (19-25) and the estimate of its sampling variance, 2 −1 −1 Est. Var[βˆ] = σˆε [X  X] , (19-26) where ( − βˆ)−1( − βˆ) 2 y X y X σˆε = (19-27) T can be computed in one step. For the AR(1) case, data for the transformed model are ⎡ ⎤ ⎡ ⎤ 1 − ρ2 y 1 − ρ2x ⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ − ρ ⎥ ⎢ − ρ ⎥ ⎢ y2 y1 ⎥ ⎢ x2 x1 ⎥ ⎢ ⎥ ⎢ ⎥ y∗ = ⎢ y − ρy ⎥ , X∗ = ⎢ x − ρx ⎥ . (19-28) ⎢ 3 . 2 ⎥ ⎢ 3 . 2 ⎥ ⎣ . ⎦ ⎣ . ⎦

yT − ρyT−1 xT − ρxT−1 These transformations are variously labeled partial differences, quasi differences, or pseudo-differences. Note that in the transformed model, every observation except the first contains a constant term. What was the column of 1s in X is transformed to [(1 − ρ2)1/2,(1 − ρ),(1 − ρ),...]. Therefore, if the sample is relatively small, then the problems with measures of fit noted in Section 3.5 will reappear. The variance of the transformed disturbance is ε − ρε = = σ 2. Var[ t t−1] Var[ut ] u σ 2; The variance of the first disturbance is also u [see (19-6)]. This can be estimated using 2 2 (1 − ρ )σˆε . Greene-50558 book June 23, 2007 11:49

648 PART V ✦ Time Series and Macroeconometrics

Corresponding results have been derived for higher-order autoregressive processes. For the AR(2) model,

εt = θ1εt−1 + θ2εt−2 + ut , (19-29) the transformed data for generalized least squares are obtained by   1/2 ( + θ ) ( − θ )2 − θ 2 1 2 1 2 1 z∗1 = z1, 1 − θ2

  / 2 1 2 (19-30)   / θ 1 − θ = − θ 2 1 2 − 1 1 , z∗2 1 2 z2 z1 1 − θ2

z∗t = zt − θ1zt−1 − θ2zt−2, t > 2,

where zt is used for yt or xt . The transformation becomes progressively more complex for higher-order processes.14 Note that in both the AR(1) and AR(2) models, the transformation to y∗ and X∗ involves “starting values” for the processes that depend only on the first one or two observations. We can view the process as having begun in the infinite past. Because the sample contains only T observations, however, it is convenient to treat the first one or two (or P) observations as shown and consider them as “initial values.” Whether we view the process as having begun at time t = 1 or in the infinite past is ultimately immaterial in regard to the asymptotic properties of the estimators. The asymptotic properties for the GLS estimator are quite straightforward given the apparatus we assembled in Section 19.4. We begin by assuming that {xt ,εt } are jointly an ergodic, stationary process. Then, after the GLS transformation, {x∗t ,ε∗t } is also stationary and ergodic. Moreover, ε∗t is nonautocorrelated by construction. In the transformed model, then, {w∗t }={x∗t ε∗t } is a stationary and ergodic martingale difference series. We can use the ergodic theorem to establish consistency and the central limit theorem for martingale difference sequences to establish asymptotic normality for GLS in this model. Formal arrangement of the relevant results is left as an exercise.

19.9 ESTIMATION WHEN  IS UNKNOWN

For an unknown , there are a variety of approaches. Any consistent estimator of (ρ) will suffice—recall from Theorem (8.5) in Section 8.3.2, all that is needed for efficient estimation of β is a consistent estimator of (ρ). The complication arises, as might be expected, in estimating the autocorrelation parameter(s).

19.9.1 AR(1) DISTURBANCES The AR(1) model is the one most widely used and studied. The most common procedure is to begin FGLS with a natural estimator of ρ, the autocorrelation of the residuals. Because b is consistent, we can use r. Others that have been suggested include Theil’s (1971) estimator,r[(T−K)/(T−1)] and Durbin’s (1970), the slope on yt−1 in a regression

14See Box and Jenkins (1984) and Fuller (1976). Greene-50558 book June 25, 2007 18:18

CHAPTER 19 ✦ Serial Correlation 649

of yt on yt−1, x t and x t−1. The second step is FGLS based on (19-25)–(19-28). This is the Prais and Winsten (1954) estimator. The Cochrane and Orcutt (1949) estimator (based on computational ease) omits the first observation. It is possible to iterate any of these estimators to convergence. Because the estima- tor is asymptotically efficient at every iteration, nothing is gained by doing so. Unlike the heteroscedastic model, iterating when there is autocorrelation does not produce the maximum likelihood estimator. The iterated FGLS estimator, regardless of the estima- tor of ρ, does not account for the term (1/2) ln(1 − ρ2) in the log-likelihood function [see the following (19-31)]. Maximum likelihood estimators can be obtained by maximizing the log-likelihood β,σ2 ρ with respect to u , and . The log-likelihood function may be written T 2 = u 1 T ln L =− t 1 t + ln(1 − ρ2) − ln 2π + ln σ 2 , (19-31) σ 2 u 2 u 2 2 where, as before, the first observation is computed differently from the others using (19-28). The MLE for this model is developed in Section 16.9.2.d. Based on the MLE, the standard approximations to the asymptotic variances of the estimators are:

− 2  −1 1 Est. Asy. Var βˆ = σˆε, X ˆ X , ML ML ML 2 4 Est. Asy. Var σˆ , = 2ˆσ , /T, (19-32) u ML u ML ρ = − ρ2 . Est. Asy. Var[ˆ ML] 1 ˆ ML T All the foregoing estimators have the same asymptotic properties. The available evi- dence on their small-sample properties comes from Monte Carlo studies and is, unfor- tunately, only suggestive. Griliches and Rao (1969) find evidence that if the sample is relatively small and ρ is not particularly large, say, less than 0.3, then least squares is as good as or better than FGLS. The problem is the additional variation introduced into the sampling variance by the variance of r. Beyond these, the results are rather mixed. Maximum likelihood seems to perform well in general, but the Prais–Winsten estimator is evidently nearly as efficient. Both estimators have been incorporated in all contem- porary software. In practice, the Prais and Winsten (1954) and Beach and MacKinnon (1978a) maximum likelihood estimators are probably the most common choices.

19.9.2 APPLICATION: ESTIMATION OF A MODEL WITH AUTOCORRELATION The model of the U.S. gasoline market that appears in Example 6.7 is

Gt It ln = β1 + β2 ln + β3 ln PG,t + β4 ln PNC,t + β5 ln PUC,t + β6t + εt . popt popt The results in Figure 19.2 suggest that the specification may be incomplete, and, if so, there may be autocorrelation in the disturbances in this specification. Least squares esti- mates of the parameters using the data in Appendix Table F2.2 appear in the first row of Table 19.2. [The dependent variable is ln(Gas expenditure /(price × population)). These are the OLS results reported in Example 6.7.] The first five autocorrelations of the least squares residuals are 0.667, 0.438, 0.142, −0.018, and −0.198. This produces Box–Pierce Greene-50558 book June 23, 2007 11:49

650 PART V ✦ Time Series and Macroeconometrics

TABLE 19.2 Parameter Estimates (standard errors in parentheses)

β1 β2 β3 β4 β5 β6 ρ OLS −26.43 1.6017 −0.06166 −0.1408 −0.001293 −0.01518 0.0000 R2 = 0.96780 (1.835) (0.1790) (0.03872) (0.1628) (0.09664) (0.004393) (0.0000) Prais– −18.58 0.7447 −0.1138 −0.1364 −0.008956 0.006689 0.9567 Winsten (1.768)(0.1761) (0.03689) (0.1528) (0.07213) (0.004974) (0.04078) Cochrane– −18.76 0.7300 −0.1080 −0.06675 0.04190 −0.0001653 0.9695 Orcutt (1.382) (0.1377) (0.02885) (0.1201) (0.05713) (0.004082) (0.03434) Maximum −16.25 0.4690 −0.1387 −0.09682 −0.001485 0.01280 0.9792 Likelihood (1.391) (0.1350) (0.02794) (0.1270) (0.05198) (0.004427) (0.02844) AR(2) −19.45 0.8116 −0.09538 −0.09099 0.04091 −0.001374 0.8610 (1.495) (0.1502) (0.03117) (0.1297) (0.06558) (0.004227) (0.07053)

and Box–Ljung statistics of 36.217 and 38.789, respectively, both of which are larger than the critical value from the chi-squared table of 11.01. We regressed the least squares residuals on the independent variables and five lags of the residuals. (The missing val- ues in the first five years were filled with zeros.) The coefficients on the lagged residuals and the associated t statistics are 0.741 (4.635), 0.153 (0.789), −0.246 (−1.262), 0.0942 (0.472), and −0.125 (−0.658).TheR2 in this regression is 0.549086, which produces a chi-squared value of 28.55. This is larger than the critical value of 11.01, so once again, the null hypothesis of zero autocorrelation is rejected. Finally, the Durbin–Watson statistic is 0.425007. For 5 regressors and 52 observations, the critical value of dL is 1.36, so on this basis as well, the null hypothesis ρ = 0 would pe rejected. The plot of the residuals shown in Figure 19.4 seems consistent with this conclusion.

FIGURE 19.4 Least Squares Residuals.

0.100

0.050

0.000 idual s

Re Ϫ0.050

Ϫ0.100

Ϫ0.150 1952 1958 1964 1970 1976 1982 1988 1994 2000 2006 Year Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 651

The Prais and Winsten FGLS estimates appear in the second row of Table 19.2 fol- lowed by the Cochrane and Orcutt results then the maximum likelihood estimates. [The autocorrelation coefficient computed using (19-19) is 0.78750. The MLE is computed using the Beach and MacKinnon algorithm. See Section 16.9.2.b.] Finally, we fit the AR(2) model by first regressing the least squares residuals, et ,onet−1 and et−2 (without a constant term and filling the first two observations with zeros). The two estimates are 0.751941 and −0.022464, respectively. With the estimates of θ1 and θ2, we transformed ∗ = −θ −θ the data using yt yt 1 yt−1 2 yt−2 and likewise for each regressor. Twoobservations are then discarded, so the AR(2) regression uses 50 observations while the Prais– Winsten estimator uses 52 and the Cochrane–Orcutt regression uses 51. In each case, the autocorrelation of the FGLS residuals is computed and reported in the last column of the table. One might want to examine the residuals after estimation to ascertain whether the AR(1) model is appropriate. In the results just presented, there are two large autocorre- lation coefficients listed with the residual based tests, and in computing the LM statistic, we found that the first two coefficients were statistically significant. If the AR(1) model is appropriate, then one should find that only the coefficient on the first lagged residual is statistically significant in this auxiliary, second-step regression. Another indicator is provided by the FGLS residuals, themselves. After computing the FGLS regression, the estimated residuals, ε = − βˆ ˆ yt xt will still be autocorrelated. In our results using the Prais–Winsten estimates, the auto- correlation of the FGLS residuals is 0.957. The associated Durbin–Watson statistic is 0.0867. This is to be expected. However, if the model is correct, then the transformed residuals

uˆ t = εˆt − ρˆ εˆt−1 should be at least close to nonautocorrelated. But, for our data, the autocorrelation of these adjusted residuals is only 0.292 with a Durbin–Watson statistic of 1.416. The value of dL for one regressor (ut−1) and 50 observations is 1.50. It appears on this basis that, in fact, the AR(1) model has largely completed the specification.

19.9.3 ESTIMATION WITH A LAGGED DEPENDENT VARIABLE In Section 19.5.1, we considered the problem of estimation by least squares when the model contains both autocorrelation and lagged dependent variable(s). Because the OLS estimator is inconsistent, the residuals on which an estimator of ρ would be based are likewise inconsistent. Therefore,ρ ˆ will be inconsistent as well. The consequence is that the FGLS estimators described earlier are not usable in this case. There is, however, an alternative way to proceed, based on the method of instrumental variables. The method of instrumental variables was introduced in Section 12.3. To review, the general problem is that in the regression model, if plim(1/T )Xε = 0, then the least squares estimator is not consistent. A consistent estimator is

−1 bIV = (Z X) (Z y), Greene-50558 book June 23, 2007 11:49

652 PART V ✦ Time Series and Macroeconometrics

where Z is a set of K variables chosen such that plim(1/T)Zε = 0 but plim(1/T)ZX = 0. For the purpose of consistency only, any such set of instrumental variables will suffice. The relevance of that here is that the obstacle to consistent FGLS is, at least for the present, is the lack of a consistent estimator of ρ. By using the technique of instrumental variables, we may estimate β consistently, then estimate ρ and proceed. Hatanaka (1974, 1976) has devised an efficient two-step estimator based on this prin- ciple. To put the estimator in the current context, we consider estimation of the model = β + γ + ε , yt xt yt−1 t εt = ρεt−1 + ut . To get to the second step of FGLS, we require a consistent estimator of the slope parameters. These estimates can be obtained using an IV estimator, where the column of Z corresponding to yt−1 is the only one that need be different from that of X.An appropriate instrument can be obtained by using the fitted values in the regression of yt on xt and xt−1. The residuals from the IV regression are then used to construct  T εˆ εˆ − ρˆ = t=3 t t 1 , (19-33) T ε2 t=3 ˆt where ε = − − . ˆt yt bIVxt cIV yt−1 = − ρ FGLS estimates may now be computed by regressing y∗t yt ˆ yt−1 on = − ρ , x∗t xt ˆ xt−1 = − ρ , y∗t−1 yt−1 ˆ yt−2 ε = − − . ˆt−1 yt−1 bIVxt−1 cIV yt−2

Let d be the coefficient onε ˆt−1 in this regression. The efficient estimator of ρ is

ρˆ = ρˆ + d.

Appropriate asymptotic standard errors for the estimators, including ρˆ , are obtained − from the s2[X∗X∗] 1 computed at the second step. Hatanaka shows that these estimators are asymptotically equivalent to maximum likelihood estimators.

19.10 AUTOCORRELATION IN PANEL DATA

The extension of the AR(1) model to stationary panel data would mirror the procedures for a single time series. The standard model is = β + + ε ,ε = ρε + v , = ,..., , = ,..., . yit xit ci it it i,t−1 it i 1 n t 1 Ti [See, e.g., Baltagi (2005, Section 5.2).] The same considerations would apply to the fixed and random effects cases, so we have left the model in generic form. The practical issues are how to obtain an estimate of ρ and how to carry out an FGLS procedure. Greene-50558 book June 25, 2007 18:18

CHAPTER 19 ✦ Serial Correlation 653

Assuming for the moment that ρ is known, the Prais and Winsten transformation, ∗ = ( − ρ2)1/2 , ∗ = ( − ρ2)1/2 , yi1 1 yi1 xi1 1 xi1 (19-34) ∗ = − ρ , ∗ = − ρ , yit yit yi,t−1 xit xit xi,t−1 produces the transformed model ∗ = ( ∗ )β + ∗ + v ,v | ∼ ,σ2. yit xit cit it it Xi 0 v (19-35) [See (19-28).] This would seem to restore the original panel data model, save for a ∗ potentially complicated loose end. The common effect in the transformed model, cit, is treated differently for the first observation from the remaining Ti − 1, hence the necessity for the double subscript in (19-35). The resulting model is no longer a “com- mon effect” model. Baltagi and Li (1991) have devised a full FGLS treatment for the balanced panel random effects case, including the asymmetric treatment of ci .The method is shown in detail in Baltagi (2005, pp. 84–85). The procedure as documented can be generalized to the unbalanced case fairly easily, but overall is quite complicated, again owing to the special treatment of the first observation. FGLS estimation of the fixed effects model is more complicated yet, because there is no simple transformation comparable to differences from group means that will remove the common effect in (19-35). For least squares estimation of β in (19-35), we would have to use brute force, ∗ = α = ( − ρ2)1/2 = = − ρ with cit i dit where dit 1 for individual i and t 1 and dit 1 for individual i and t > 1. (In principle, the Frisch–Waugh result could be applied group by group to transform the observations for estimation of β. However, the application would involve a different tranformation for the first observation and the mean of observations 2 − Ti rather than the full group mean.) For better or worse, dropping the first observation is a practical compromise that produces a large payoff. The different approaches based on Ti − 1 observations remain consistent in n, just as in the single time-series case. The question of efficiency might be raised here as it was in an earlier literature in the time-series case [see, e.g., Maeshiro (1979)]. In a given panel, Ti may well be fairly small. However, with the assumption of common ρ, this case is likely to be much more favorable than the single time-series case, because the way the model is structured, estimation of β based on the Cochrane– n ( − ) Orcutt transformation becomes analogous to the random effects case with i=1 Ti 1 observations. If n is even moderately sized, the efficiency question is likely to be a moot point. There remains the problem of estimation of ρ. For either fixed or random effects case, the within (dummy variables) estimator produces a consistent estimator of β and ρ n Ti a usable set of residuals, eit that can be used to estimate with i=1 t=2 eitei,t−1 n Ti 2 i=1 t=2 ei,t . [Baltagi and Li (1991a) suggest some alternative estimators that may have better small sample properties.]

Example 19.5 Panel Data Models with Autocorrelation Munnell (1990) analyzed the productivity of public capital at the state level using a Cobb– Douglas production function. We will use the data from that study to estimate a log-linear regression model = α + β + β + β ln gspit 1 ln p capit 2 ln hwyit 3 ln waterit +β + β + β + ε + 4 ln utilit 5 ln empit 6 unempit it ui , Greene-50558 book June 23, 2007 11:49

654 PART V ✦ Time Series and Macroeconometrics

TABLE 19.3 Estimated Statewide Production Function. OLS Fixed Effects Random Effects FGLS ρ = 0 AR(1)ρ= 0 AR(1)ρ= 0 AR(1) Estimate Estimate Estimate Estimate Estimate Estimate (Std. Err.a) (Std. Err.) (Std. Err.) (Std. Err.) (Std. Err.) (Std. Err.) α 1.9260 2.1463 2.1608 2.8226 (0.2143) (0.01806) (0.1380) (0.1537)

β1 0.3120 0.2615 0.2350 0.04041 0.2755 0.1342 (0.04678) (0.01338) (0.02621) (0.02383) (0.01972) (0.01943)

β2 0.05888 0.06788 0.07675 −0.05831 0.06167 0.04585 (0.05078) (0.01802) (0.03124) (0.06101) (0.02168) (0.03044)

β3 0.1186 0.09225 0.0786 0.04934 0.07572 0.04879 (0.0345) (0.01464) (0.0150) (0.02098) (0.01381) (0.01911)

β4 0.00856 −0.006299 −0.11478 −0.07459 −0.09672 −0.07977 (0.0406) (0.01432) (0.01814) (0.02759) (0.01683) (0.02273)

β5 0.5497 0.6337 0.8011 1.0534 0.7450 0.08931 (0.0677) (0.01916) (0.02976) (0.03677) (0.02482) (0.03011)

β6 −0.00727 −0.009327 −0.005179 −0.002924 −0.005963 −0.005374 (0.002946) (0.00083) (0.000980) (0.000817) (0.0008814) (0.0007513) b c σε 0.0854228 0.105407 0.0367649 0.074687 .0366974 0.074687 b c σu .0875682 0.074380 aRobust (cluster) standard errors in parentheses bBased on OLS and LSDV residuals cBased on OLS/AR(1) and LSDV/AR(1) residuals

where the variables in the model are gsp = gross state product, p cap = public capital, hwy = highway capital, water = water utility capital, util = utility capital, pc = private capital, emp = employment (labor), unemp = unemployment rate. In Example 9.9, we estimated the parameters of the model under the fixed and random effects assumptions. The results are repeated in Table 19.3. Using the fixed effects residuals, the estimate of ρ is 0.717897, which is quite large. We reestimated the two models using the Cochrane–Orcutt transformation. The results are shown at the right in Table 19.3. The 2 estimates of σε and σu in each case are obtained as 1/(1− r ) times the estimated variances in the transformed model.

There are several strategies available for testing for serial correlation in a panel data set. [See Baltagi (2005, pp. 93–103).] In general, an obstacle to a simple test against the null hypothesis of no autocorrelation is the possible presence of a time-invariant common effect in the model, = + β + ε . yit ci xit it Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 655

Under the alternative hypothesis, Corr(εit,εi,t−1) = ρ. Many variants of the model, based on AR(1), MA(1), and other specifications have been analyzed. We consider the first, as it is the standard framework in the absence of a specific model that suggests another process. The LM statistic is based on the within-groups (LSDV) residuals from the fixed effects specification,      n T 2 NT2 e e , − LM = i=1 t=2 it i t 1 . − n T 2 T 1 i=1 t=1 eit Under the null hypothesis that ρ = 0, the limiting distribution of LM is chi-squared with one degree of freedom. The Durbin–Watson statistic is obtained by omitting [NT2/ 2 (T − 1)] and replacing eitei,t−1 with (eit − ei,t−1) . Bhargava et al. (1982) showed that the Durbin–Watson version of the test is locally most powerful in the neighborhood of ρ = 0. In the typical panel, the data set is large enough that the advantage over the sim- pler LM statistic is unlikely to be substantial. Both tests will suffer from a loss of power if the model is a random effects model. Baltagi and Li (1991a,b) have devised several test statistics that are appropriate for the random effects specification. Wooldridge (2002a) proposes an alternative approach based on first differences that will be invariant to the presence of fixed or random effects.

19.11 COMMON FACTORS

We saw in Example 19.2 that misspecification of an equation could create the appear- ance of serially correlated disturbances when, in fact, there are none. An orthodox (perhaps somewhat optimistic) purist might argue that autocorrelation is always an artifact of misspecification. Although this view might be extreme [see, e.g., Hendry (1980) for a more moderate, but still strident statement], it does suggest a useful point. It might be useful if we could examine the specification of a model statistically with this consideration in mind. The test for common factors is such a test. [See, as well, the aforementioned paper by Mizon (1995).] The assumption that the correctly specified model is = β + ε ,ε= ρε + , = ,..., yt xt t t t−1 ut t 1 T implies the “reduced form,” M0: yt = ρyt−1 + (xt − ρxt−1) β + ut , t = 2,...,T,

where ut is free from serial correlation. The second of these is actually a restriction on the model = ρ + β + α + , = ,..., , M1: yt yt−1 xt xt−1 ut t 2 T

in which, once again, ut is a classical disturbance. The second model contains 2K + 1 parameters, but if the model is correct, then α =−ρβ and there are only K + 1 para- meters and K restrictions. Both M0 and M1 can be estimated by least squares, although M0 is a nonlinear model. One might then test the restrictions of M0 using an F test. This test will be valid asymptotically, although its exact distribution in finite samples will not be precisely F. In large samples, KF will converge to a chi-squared statistic, so we use Greene-50558 book June 23, 2007 11:49

656 PART V ✦ Time Series and Macroeconometrics

the F distribution as usual to be conservative. There is a minor practical complication in implementing this test. Some elements of α may not be estimable. For example, if xt contains a constant term, then the one in α is unidentified. If xt contains both current and lagged values of a variable, then the one period lagged value will appear twice in M1, once in xt as the lagged value and once in xt−1 as the current value. There are other combinations that will be problematic, so the actual number of restrictions that appear in the test is reduced to the number of identified parameters in α.

Example 19.6 Test for Common Factors We will reexamine the model estimated in Section 19.9.2. The base model is

Gt It ln = β1 + β2 ln + β3 ln PG,t + β4 ln PNC,t + β5 ln PUC,t + β6t + εt . Popt Popt

If the AR(1) model is appropriate for εt , that is, εt = ρεt−1 + ut , then the restricted model,     Gt Gt−1 It It−1 ln = ρ ln + β1 + β2 ln − ρ ln + β3 (ln PG,t − ρ ln PG,t−1) Popt Popt−1 Popt Popt−1

+ β4 (ln PNC,t − ρ ln PNC,t−1) + β5 (ln PUC,t − ρ ln PUC,t−1) + β6 [t − ρ(t − 1)] + ut , with 7 free coefficients will not significantly degrate the fit of the unrestricted model,   Gt Gt−1 It It−1 ln = ρ ln + β1 + α1 + β2 ln + α2 ln + β3 ln PG,t + α3 ln PG,t−1 Popt Popt−1 Popt Popt−1

+ β4 ln PNC,t + α4 ln PNC,t−1 + β5 ln PUC,t + α5 ln PUC,t−1 + β6t + α6(t − 1) + ut ,

which has 13 coefficients. Note, however, that α1 and α6 are not identified [because t = (t − 1) + 1]. Thus, the common factor restriction imposes four restrictions on the model. We fit the unrestricted model [minus one constant and (t − 1)] by ordinary least squares and obtained a sum of squared residuals of 0.00737717. We fit the restricted model by nonlinear least squares, using the OLS coefficients from the base model as starting values for β and zero for ρ. (See Section 11.2.5.) The sum of squared residuals is 0.01084939. This produces an F statistic of (0.10184939 − 0.00737717)/4 = 4.707, 0.00737717/ (51− 11) which is larger than the critical value with 4 and 40 degrees of freedom of 2.606. Thus, we would conclude that the AR(1) model would not be appropriate for this specification and these data.

19.12 FORECASTING IN THE PRESENCE OF AUTOCORRELATION

For purposes of forecasting, we refer first to the transformed model, y∗ = x β + ε∗ . t ∗t t

Suppose that the process generating εt is an AR(1) and that ρ is known. This model is a classical regression model, so the results of Section 5.6 may be used. The optimal forecast of y0 , given x0 and x (i.e., x0 = x0 − ρx ), is ∗T+1 T+1 T ∗T+1 T+1 T yˆ 0 = x0 βˆ. ∗T+1 ∗T+1 Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 657

Disassemblingy ˆ 0 , we find that ∗T+1 0 − ρ = 0 βˆ − ρ βˆ, yˆ T+1 yT xT+1 xT or 0 = 0 βˆ + ρ( − βˆ) yˆ T+1 xT+1 yT xT (19-36) = 0 βˆ + ρ . xT+1 eT Thus, we carry forward a proportion ρ of the estimated disturbance in the preceding period. This step can be justified by reference to

E [εT+1 | εT] = ρεT. It can also be shown that to forecast n periods ahead, we would use 0 = 0 βˆ + ρn . yˆ T+n xT+n eT The extension to higher-order autoregressions is direct. For a second-order model, for example, 0 = βˆ 0 + θ + θ . yˆ T+n xT+n 1eT+n−1 2eT+n−2 (19-37) For residuals that are outside the sample period, we use the recursion

es = θ1es−1 + θ2es−2, (19-38) beginning with the last two residuals within the sample. Moving average models are somewhat simpler, as the autocorrelation lasts for only Q periods. For an MA(1) model, for the first postsample period, 0 = 0 βˆ + ε , yˆ T+1 xT+1 ˆT+1 where

εˆT+1 = uˆ T+1 − λuˆ T.

Therefore, a forecast of εT+1 will use all previous residuals. One way to proceed is to accumulateε ˆT+1 from the recursion

uˆ t = εˆt + λuˆ t−1, = = ε = ( − βˆ) withu ˆ T+1 uˆ 0 0 and ˆt yt xt . After the first postsample period,

εˆT+n = uˆ T+n − λuˆ T+n−1 = 0. If the parameters of the disturbance process are known, then the variances for the forecast errors can be computed using the results of Section 5.6. For an AR(1) disturbance, the estimated variance would be   2 = σ 2 + ( − ρ ) . βˆ ( − ρ ). s f ˆε xt xt−1 Est Var [ ] xt xt−1 (19-39)

For a higher-order process, it is only necessary to modify the calculation of x∗t ac- cordingly. The forecast variances for an MA(1) process are somewhat more involved. Details may be found in Judge et al. (1985) and Hamilton (1994). If the parameters Greene-50558 book June 23, 2007 11:49

658 PART V ✦ Time Series and Macroeconometrics

of the disturbance process, ρ,λ,θj , and so on, are estimated as well, then the forecast variance will be greater. For an AR(1) model, the necessary correction to the forecast 2 2 2(n−1) variance of the n-period-ahead forecast error isσ ˆε n ρ /T. [For a one-period-ahead 2 forecast, this merely adds a term,σ ˆε /T in (19-39)]. Higher-order AR and MA processes are analyzed in Baillie (1979). Finally, if the regressors are stochastic, the expressions become more complex by another order of magnitude. 15 If ρ is known, then (19-36) provides the best linear unbiased forecast of yt+1. If, however, ρ must be estimated, then this assessment must be modified. There is information about εt+1 embodied in et . Having to estimate ρ, however, implies that some or all the value of this information is offset by the variation introduced into the 16 forecast by including the stochastic componentρ ˆ et . Whether (19-36) is preferable to 0 = βˆ 0 ρ the obvious expedienty ˆ T+n xT+n in a small sample when is estimated remains to be settled.

19.13 AUTOREGRESSIVE CONDITIONAL HETEROSCEDASTICITY

Heteroscedasticity is often associated with cross-sectional data, whereas time series are usually studied in the context of homoscedastic processes. In analyses of macroeconomic data, Engle (1982, 1983) and Cragg (1982) found evidence that for some kinds of data, the disturbance variances in time-series models were less stable than usually assumed. Engle’s results suggested that in models of inflation, large and small forecast errors appeared to occur in clusters, suggesting a form of heteroscedasticity in which the variance of the forecast error depends on the size of the previous disturbance. He suggested the autoregressive, conditionally heteroscedastic, or ARCH, model as an alternative to the usual time-series process. More recent studies of financial markets suggest that the phenomenon is quite common. The ARCH model has proven to be useful in studying the volatility of inflation [Coulson and Robins (1985)], the term structure of interest rates [Engle, Hendry, and Trumble (1985)], the volatility of stock market returns [Engle, Lilien, and Robins (1987)], and the behavior of foreign exchange markets [Domowitz and Hakkio (1985) and Bollerslev and Ghysels (1996)], to name but a few. This section will describe specification, estimation, and testing, in the basic formulations of the ARCH model and some extensions.17

Example 19.7 Stochastic Volatility Figure 19.5 shows Bollerslev and Ghysel’s 1974 data on the daily percentage nominal return for the Deutschmark/Pound exchange rate. (These data are given in Appendix Table F19.1.) The variation in the series appears to be fluctuating, with several clusters of large and small movements.

15See Goldberger (1962). 16See Baillie (1979). 17Engle and Rothschild (1992) give a survey of this literature which describes many extensions. Mills (1993) also presents several applications. See, as well, Bollerslev (1986) and Li, Ling, and McAleer (2001). See McCullough and Renfro (1999) for discussion of estimation of this model. Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 659

4

3

2

1 Y 0

Ϫ1

Ϫ2

Ϫ3 0 395 790 1185 1580 1975 Observation Number FIGURE 19.5 Nominal Exchange Rate Returns.

19.13.1 THE ARCH(1) MODEL The simplest form of this model is the ARCH(1) model, = β + ε , yt xt t  (19-40) ε = α + α ε2 , t ut 0 1 t−1

18 where ut is distributed as standard normal. It follows that E [εt | xt ,εt−1] = 0, so that ε | = | = β E [ t xt ] 0 and E [yt xt ] xt . Therefore, this model is a classical regression model. But ε | ε = ε2 ε = 2 α + α ε2 = α + α ε2 , Var[ t t−1] E t t−1 E ut 0 1 t−1 0 1 t−1

so εt is conditionally heteroscedastic, not with respect to xt as we considered in Chapter 8, but with respect to εt−1. The unconditional variance of εt is     ε = ε | ε + ε | ε = α + α ε2 = α + α ε . Var[ t ] Var E [ t t−1] E Var[ t t−1] 0 1 E t−1 0 1 Var[ t−1] If the process generating the disturbances is weakly (covariance) stationary (see Definition 19.2),19 then the unconditional variance is not changing over time so α0 Var[εt ] = Var[εt−1] = α0 + α1 Var[εt−1] = . 1 − α1

For this ratio to be finite and positive, |α1| must be less than 1. Then, unconditionally, 2 εt is distributed with mean zero and variance σ = α0/(1 − α1). Therefore, the model

18 The assumption that ut has unit variance is not a restriction. The scaling implied by any other variance would be absorbed by the other parameters. 19This discussion will draw on the results and terminology of time-series analysis in Section 19.3 and Chap- ter 21. The reader may wish to peruse this material at this point. Greene-50558 book June 23, 2007 11:49

660 PART V ✦ Time Series and Macroeconometrics

obeys the classical assumptions, and ordinary least squares is the most efficient linear unbiased estimator of β. But there is a more efficient nonlinear estimator. The log-likelihood function for this model is given by Engle (1982). Conditioned on starting values y0 and x0 (and ε0), the conditional log-likelihood for observations t = 1,...,T is the one we examined in Section 16.9.2.a for the general heteroscedastic regression model [see (16-52)],

T   T ε2 T 1 2 1 t ln L =− ln(2π)− ln α + α ε − − ,ε= y − β x . 2 2 0 1 t 1 2 α + α ε2 t t t t=1 t=1 0 1 t−1 (19-41) Maximization of log L can be done with the conventional methods, as discussed in Appendix E.20

19.13.2 ARCH(q), ARCH-IN-MEAN, AND GENERALIZED ARCH MODELS The natural extension of the ARCH(1) model presented before is a more general model with longer lags. The ARCH(q) process, σ 2 = α + α ε2 + α ε2 +···+α ε2 , t 0 1 t−1 2 t−2 q t−q is a qth order moving average [MA(q)] process. (Much of the analysis of the model parallels the results in Chapter 21 for more general time-series models.) [Once again, see Engle (1982).] This section will generalize the ARCH(q) model, as suggested by Bollerslev (1986), in the direction of the autoregressive-moving average (ARMA) mod- els of Section 21.2.1. The discussion will parallel his development, although many details are omitted for brevity. The reader is referred to that paper for background and for some of the less critical details. Among the many variants of the capital asset pricing model (CAPM) is an intertem- poral formulation by Merton (1980) that suggests an approximate linear relationship between the return and variance of the market portfolio. One of the possible flaws in this model is its assumption of a constant variance of the market portfolio. In this connection, then, the ARCH-in-Mean, or ARCH-M, model suggested by Engle, Lilien, and Robins (1987) is a natural extension. The model states that = β + δσ2 + ε , yt xt t t

Var[εt | t ] = ARCH(q). Among the interesting implications of this modification of the standard model is that under certain assumptions, δ is the coefficient of relative risk aversion. The ARCH-M model has been applied in a wide variety of studies of volatility in asset returns, including the daily Standard and Poor’s Index [French, Schwert, and Stambaugh (1987)] and

20Engle (1982) and Judge et al. (1985, pp. 441–444) suggest a four-step procedure based on the method of scoring that resembles the two-step method we used for the multiplicative heteroscedasticity model in Section 8.8.1. However, the full MLE is now incorporated in most modern software, so the simple regres- sion based methods, which are difficult to generalize, are less attractive in the current literature. But, see McCullough and Renfro (1999) and Fiorentini, Calzolari, and Panattoni (1996) for commentary and some cautions related to maximum likelihood estimation. Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 661

weekly New York Stock Exchange returns [Chou (1988)]. A lengthy list of applications is given in Bollerslev, Chou, and Kroner (1992). The ARCH-M model has several noteworthy statistical characteristics. Unlike the standard regression model, misspecification of the variance function does affect the consistency of estimators of the parameters of the mean. [See Pagan and Ullah (1988) for formal analysis of this point.] Recall that in the classical regression setting, weighted least squares is consistent even if the weights are misspecified as long as the weights are uncorrelated with the disturbances. That is not true here. If the ARCH part of the model is misspecified, then conventional estimators of β and δ will not be consistent. Bollerslev, Chou, and Kroner (1992) list a large number of studies that called into question the specification of the ARCH-M model, and they subsequently obtained quite different results after respecifying the model. A closely related practical problem is that the mean and variance parameters in this model are no longer uncorrelated. In analysis up to this point, we made quite profitable use of the block diagonality of the Hessian of the log-likelihood function for the model of heteroscedasticity. But the Hessian for the ARCH-M model is not block diagonal. In practical terms, the estimation problem cannot be segmented as we have done previously with the heteroscedastic regression model. All the parameters must be estimated simultaneously. The model of generalized autoregressive conditional heteroscedasticity (GARCH) is defined as follows.21 The underlying regression is the usual one in (19-40). Conditioned on an information set at time t, denoted t , the distribution of the disturbance is assumed to be ε |  ∼ ,σ2 , t t N 0 t where the conditional variance is σ 2 = α + δ σ 2 + δ σ 2 +···+δ σ 2 + α ε2 + α ε2 +···+α ε2 . t 0 1 t−1 2 t−2 p t−p 1 t−1 2 t−2 q t−q (19-42) Define = ,σ2 ,σ2 ,...,σ2 ,ε2 ,ε2 ,...,ε2 zt 1 t−1 t−2 t−p t−1 t−2 t−q and γ = [α0,δ1,δ2,...,δp,α1,...,αq] = [α0, δ , α ] . Then σ 2 = γ . t zt Notice that the conditional variance is defined by an autoregressive-moving average ( , ) ε2 [ARMA p q ] process in the innovations t , exactly as in Section 21.2.1. The difference here is that the mean of the random variable of interest yt is described completely by a heteroscedastic, but otherwise ordinary, regression model. The conditional variance, however, evolves over time in what might be a very complicated manner, depending on the parameter values and on p and q. The model in (19-42) is a GARCH(p, q) model,

21As have most areas in time-series econometrics, the line of literature on GARCH models has progressed rapidly in recent years and will surely continue to do so. We have presented Bollerslev’s model in some detail, despite many recent extensions, not only to introduce the topic as a bridge to the literature, but also because it provides a convenient and interesting setting in which to discuss several related topics such as double-length regression and pseudo–maximum likelihood estimation. Greene-50558 book June 23, 2007 11:49

662 PART V ✦ Time Series and Macroeconometrics

where p refers, as before, to the order of the autoregressive part.22 As Bollerslev (1986) demonstrates with an example, the virtue of this approach is that a GARCH model with a small number of terms appears to perform as well as or better than an ARCH model with many. The stationarity conditions discussed in Section 21.2.2 are important in this context to ensure that the moments of the normal distribution are finite. The reason is that higher moments of the normal distribution are finite powers of the variance. A normal σ 2 σ 4 σ 6 distribution with variance t has fourth moment 3 t , sixth moment 15 t , and so on. [The precise relationship of the even moments of the normal distribution to the vari- μ = (σ 2)k( ) /( k) σ 2 ance is 2k 2k ! k!2 .] Simply ensuring that t is stable does not ensure that higher powers are as well.23 Bollerslev presents a useful figure that shows the conditions needed to ensure stability for moments up to order 12 for a GARCH(1, 1) model and gives some additional discussion. For example, for a GARCH(1, 1) process, for the α2 + α δ + δ2 fourth moment to exist, 3 1 2 1 1 1 must be less than 1. It is convenient to write (19-42) in terms of polynomials in the lag operator (see Section 20.2.2): σ 2 = α + ( )σ 2 + ( )ε2. t 0 D L t A L t As discussed in Section 20.2.2, the stationarity condition for such an equation is that the roots of the characteristic equation, 1− D(z) = 0, must lie outside the unit circle. For the present, we will assume that this case is true for the model we are considering and that A(1) + D(1)<1. [This assumption is stronger than that needed to ensure stationarity in a higher-order autoregressive model, which would depend only on D(L).] The implication is that the GARCH process is covariance stationary with E [εt ] = 0 (unconditionally), Var[εt ] = α0/[1 − A(1) − D(1)], and Cov[εt ,εs ] = 0 for all t = s. Thus, unconditionally the model is the classical regression model that we examined in Chapters 2–7. The usefulness of the GARCH specification is that it allows the variance to evolve over time in a way that is much more general than the simple specification of the ARCH model. The comparison between simple finite-distributed lag models and the dynamic regression model discussed in Chapter 20 is analogous. For the example discussed in his paper, Bollerslev reports that although Engle and Kraft’s (1983) ARCH(8) model for the rate of inflation in the GNP deflator appears to remove all ARCH effects, a closer look reveals GARCH effects at several lags. By fitting a GARCH(1, 1) model to the same data, Bollerslev finds that the ARCH effects out to the same eight-period lag as fit by Engle and Kraft and his observed GARCH effects are all satisfactorily accounted for.

19.13.3 MAXIMUM LIKELIHOOD ESTIMATION OF THE GARCH MODEL Bollerslev describes a method of estimation based on the BHHH algorithm. As he shows, the method is relatively simple, although with the line search and first derivative method that he suggests, it probably involves more computation and more iterations

22We have changed Bollerslev’s notation slightly so as not to conflict with our previous presentation. He used β instead of our δ in (19-42) and b instead of our β in (19-40). 23The conditions cannot be imposed a priori. In fact, there is no nonzero set of parameters that guarantees stability of all moments, even though the normal distribution has finite moments of all orders. As such, the normality assumption must be viewed as an approximation. Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 663

than necessary. Following the suggestions of Harvey (1976), it turns out that there is a simpler way to estimate the GARCH model that is also very illuminating. This model is actually very similar to the more conventional model of multiplicative heteroscedasticity that we examined in Section 16.9.2.b. For normally distributed disturbances, the log-likelihood for a sample of T obser- vations is24   T 1 ε2 T T ln L = − ln(2π)+ ln σ 2 + t = ln f (θ) = l (θ), 2 t σ 2 t t t=1 t t=1 t=1 ε = − β θ = (β,α , α, δ) = (β, γ ) where t yt xt and 0 . Derivatives of ln L are obtained by summation. Let lt denote ln ft (θ). The first derivatives with respect to the variance parameters are         ∂ ε2 ∂σ2 ∂σ2 ε2 lt 1 1 t t 1 1 t t 1 1 =− −   = − 1 = gt vt = bt vt . ∂γ 2 σ 2 σ 2 2 ∂γ 2 σ 2 ∂γ σ 2 2 σ 2 t t t t t (19-43)

Note that E [vt ] = 0. Suppose, for now, that there are no regression parameters. Newton’s method for estimating the variance parameters would be

+ γˆ i 1 = γˆ i − H−1g, (19-44)

where H indicates the Hessian and g is the first derivatives vector. Following Harvey’s suggestion (see Section 16.9.2.a), we will use the method of scoring instead. To do this, v = ε2/σ 2 = we make use of E [ t ] 0 and E [ t t ] 1. After√ taking expectations in (19-43),√ the v∗ = ( / )v ∗ = ( / ) /σ 2 iteration reduces to a linear regression of t 1 2 t on regressors w t 1 2 gt t . That is,   ∂ γ i+1 = γ i + −1 = γ i + −1 ln L , ˆ ˆ [W∗W∗] W∗v∗ ˆ [W∗W∗] ∂γ (19-45)

where row t of W∗ is w . The iteration has converged when the slope vector is zero, ∗t which happens when the first derivative vector is zero. When the iterations are complete, the estimated asymptotic covariance matrix is simply

−1 Est. Asy. Var[γ ˆ ] = [Wˆ ∗W∗]

based on the estimated parameters. The usefulness of the result just given is that E [∂2 ln L/∂γ ∂β ] is, in fact, zero. Because the expected Hessian is block diagonal, applying the method of scoring to the full parameter vector can proceed in two parts, exactly as it did in Section 16.9.2.a for the multiplicative heteroscedasticity model. That is, the updates for the mean and variance parameter vectors can be computed separately. Consider then the slope parameters, β.

24There are three minor errors in Bollerslev’s derivation that we note here to avoid the apparent inconsis- 1 1 −1 − −2 − −2 ∂ /∂ω tencies. In his (22), 2 ht should be 2 ht . In (23), 2ht should be ht . In (28), h h should, in each case, be (1/h)∂h/∂ω. [In his (8), α0α1 should be α0 + α1, but this has no implications for our derivation.] Greene-50558 book June 23, 2007 11:49

664 PART V ✦ Time Series and Macroeconometrics

The same type of modified scoring method as used earlier produces the iteration

    −1     T T ε βˆ i+1 = βˆ i + xt xt + 1 dt dt xt t + 1 dt v σ 2 2 σ 2 σ 2 σ 2 2 σ 2 t t=1 t t t t=1 t t  −    1   T x x 1 d d ∂ ln L = βˆ i + t t + t t (19-46) σ 2 2 σ 2 σ 2 ∂β t=1 t t t = βˆ i + hi , which has been referred to as a double-length regression. [See Orme (1990) and Davidson and MacKinnon (1993, Chapter 14).] The update vector hi is the vector of slopes in an augmented or double-length generalized regression, hi = [C−1C]−1[C−1a], (19-47) where C isa2T × K matrix whose first√ T rows are the X from the original regression ( / ) /σ 2, = ,..., × model and whose next T rows are 1 2 dt t t 1 T√; a isa2T 1 vector whose ε ( / )v /σ 2, = ,..., first T elements are t and whose next T elements are 1 2 t t t 1 T; and  /σ 2 ,..., is a diagonal matrix with 1 t in positions 1 T and ones below observation T. At convergence, [C−1C]−1 provides the asymptotic covariance matrix for the MLE. The resemblance to the familiar result for the generalized regression model is striking, but note that this result is based on the double-length regression. The iteration is done simply by computing the update vectors to the current pa- rameters as defined earlier.25 An important consideration is that to apply the scoring method, the estimates of β and γ are updated simultaneously. That is, one does not use the updated estimate of γ in (19-45) to update the weights for the GLS regression to compute the new β in (19-46). The same estimates (the results of the prior iteration) are used on the right-hand sides of both (19-45) and (19-46). The remaining problem is to obtain starting values for the iterations. One obvious choice is b, the OLS estimator, for 2 β, e e/T = s for α0, and zero for all the remaining parameters. The OLS slope vector will be consistent under all specifications. A useful alternative in this context would be α 2 to start at the vector of slopes in the least squares regression of et , the squared OLS residual, on a constant and q lagged values.26 As discussed later, an LM test for the presence of GARCH effects is then a by-product of the first iteration. In principle, the updated result of the first iteration is an efficient two-step estimator of all the parame- ters. But having gone to the full effort to set up the iterations, nothing is gained by not iterating to convergence. One virtue of allowing the procedure to iterate to convergence is that the resulting log-likelihood function can be used in likelihood ratio tests.

19.13.4 TESTING FOR GARCH EFFECTS The preceding development appears fairly complicated. In fact, it is not, because at each step, nothing more than a linear least squares regression is required. The intricate part

25See Fiorentini et al. (1996) on computation of derivatives in GARCH models. 26A test for the presence of q ARCH effects against none can be carried out by carrying TR2 from this regression into a table of critical values for the chi-squared distribution. But in the presence of GARCH effects, this procedure loses its validity. Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 665

of the computation is setting up the derivatives. On the other hand, it does take a fair amount of programming to get this far.27 As Bollerslev suggests, it might be useful to test for GARCH effects first. The simplest approach is to examine the squares of the least squares residuals. The autocorrelations (correlations with lagged values) of the squares of the residuals provide evidence about ARCH effects. An LM test of ARCH(q) against the hypothesis of no ARCH effects [ARCH(0), the classical model] can be carried out by computing χ 2 = 2 2 TR in the regression of et on a constant and q lagged values. Under the null hypothesis of no ARCH effects, the statistic has a limiting chi-squared distribution with q degrees of freedom. Values larger than the critical table value give evidence of the presence of ARCH (or GARCH) effects. Bollerslev suggests a Lagrange multiplier statistic that is, in fact, surprisingly simple to compute. The LM test for GARCH(p, 0) against GARCH(p, q) can be carried out by referring T times the R2 in the linear regression defined in (19-45) to the chi-squared critical value with q degrees of freedom. There is, unfortunately, an indeterminacy in this test procedure. The test for ARCH(q) against GARCH(p, q) is exactly the same as that for ARCH(p) against ARCH(p + q). For carrying out the test, one can use as starting values a set of estimates that includes δ = 0 and any consistent estima- tors for β and α. Then TR2 for the regression at the initial iteration provides the test statistic.28 A number of recent papers have questioned the use of test statistics based solely on normality. Wooldridge (1991) is a useful summary with several examples.

Example 19.8 GARCH Model for Exchange Rate Volatility Bollerslev and Ghysels analyzed the exchange rate data in Example 19.7 using a GARCH(1, 1) model,

yt = μ + εt ,

E[εt | εt−1] = 0, ε | ε = σ 2 = α + α ε2 + δσ 2 . Var[ t t−1] t 0 1 t−1 t−1

The least squares residuals for this model are simply et = yt − y. Regression of the squares of these residuals on a constant and 10 lagged squared values using observations 11–1974 produces an R2 = 0.09795. With T = 1964, the chi-squared statistic is 192.37, which is larger than the critical value from the table of 18.31. We conclude that there is evidence of GARCH effects in these residuals. The maximum likelihood estimates of the GARCH model are given in Table 19.4. Note the resemblance between the OLS unconditional variance (0.221128) and the estimated equilibrium variance from the GARCH model, 0.2631.

27Because this procedure is available as a preprogrammed procedure in many computer programs, including TSP, E-Views, , RATS, LIMDEP, and Shazam, this warning might itself be overstated. 28Bollerslev argues that in view of the complexity of the computations involved in estimating the GARCH model, it is useful to have a test for GARCH effects. This case is one (as are many other maximum like- lihood problems) in which the apparatus for carrying out the test is the same as that for estimating the model. Having computed the LM statistic for GARCH effects, one can proceed to estimate the model just by allowing the program to iterate to convergence. There is no additional cost beyond waiting for the answer. Greene-50558 book June 23, 2007 11:49

666 PART V ✦ Time Series and Macroeconometrics

TABLE 19.4 Maximum Likelihood Estimates of a GARCH(1, 1) Model29

μα0 α1 δα0/(1 − α1 − δ) Estimate −0.006190 0.01076 0.1531 0.8060 0.2631 Std. Error 0.00873 0.00312 0.0273 0.0302 0.594 t ratio −0.709 3.445 5.605 26.731 0.443 2 ln L =−1106.61, ln LOLS =−1311.09, y =−0.01642, s = 0.221128

19.13.5 PSEUDO–MAXIMUM LIKELIHOOD ESTIMATION We now consider an implication of nonnormality of the disturbances. Suppose that the assumption of normality is weakened to only     ε2 ε4 E [ε |  ] = 0, E t  = 1, E t  = κ<∞, t t σ 2 t σ 4 t t t σ 2 where t is as defined earlier. Now the normal log-likelihood function is inappropriate. In this case, the nonlinear (ordinary or weighted) least squares estimator would have the properties discussed in Chapter 11. It would be more difficult to compute than the MLE discussed earlier, however. It has been shown [see White (1982a) and Weiss (1982)] that the pseudo-MLE obtained by maximizing the same log-likelihood as if it were correct produces a consistent estimator despite the misspecification.30 The asymptotic covariance matrices for the parameter estimators must be adjusted, however. The general result for cases such as this one [see Gourieroux, Monfort, and Trognon (1984)] is that the appropriate asymptotic covariance matrix for the pseudo-MLE of a parameter vector θ would be − Asy. Var[θˆ ] = H−1FH 1, (19-48) where   ∂2 =− ln L , H E ∂θ ∂θ and    ∂ ∂ = ln L ln L F E ∂θ ∂θ (i.e., the BHHH estimator), and ln L is the used but inappropriate log-likelihood func- tion. For current purposes, H and F are still block diagonal, so we can treat the mean and variance parameters separately. In addition, E [vt ] is still zero, so the second derivative ∂2σ 2/∂γ ∂γ ∂2σ 2/∂β ∂β terms in both blocks are quite simple. (The parts involving t and t fall out of the expectation.) Taking expectations and inserting the parts produces the

29These data have become a standard data set for the evaluation of software for estimating GARCH models. The values given are the benchmark estimates. Standard errors differ substantially from one method to the next. Those given are the Bollerslev and Wooldridge (1992) results. See McCullough and Renfro (1999). 30 White (1982a) gives some additional requirements for the true underlying density of εt . Gourieroux, Monfort, and Trognon (1984) also consider the issue. Under the assumptions given, the expectations of the matrices in (19-42) and (19-47) remain the same as under normality. The consistency and asymptotic normality of the pseudo-MLE can be argued under the logic of GMM estimators. Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 667

corrected asymptotic covariance matrix for the variance parameters: . γ = −1 −1, Asy Var[ ˆ PMLE] [W∗W∗] B B[W∗W∗] where the rows of W∗ are defined in (19-45) and those of B are in (19-43). For the slope parameters, the adjusted asymptotic covariance matrix would be   T . βˆ = −1 −1 −1 −1, Asy Var[ PMLE] [C C] bt bt [C C] t=1 where the outer matrix is defined in (19-47) and, from the first derivatives given in (19-43) and (19-46),31   x ε 1 v b = t t + t d . t σ 2 σ 2 t t 2 t

19.14 SUMMARY AND CONCLUSIONS

This chapter has examined the generalized regression model with serial correlation in the disturbances. We began with some general results on analysis of time-series data. When we consider dependent observations and serial correlation, the laws of large num- bers and central limit theorems used to analyze independent observations no longer suffice. We presented some useful tools that extend these results to time-series settings. We then considered estimation and testing in the presence of autocorrelation. As usual, OLS is consistent but inefficient. The Newey–West estimator is a robust estimator for the asymptotic covariance matrix of the OLS estimator. This pair of estimators also constitute the GMM estimator for the regression model with autocorrelation. We then considered two-step feasible generalized least squares and maximum likelihood estima- tion for the special case usually analyzed by practitioners, the AR(1) model. The model with a correction for autocorrelation is a restriction on a more general model with lagged values of both dependent and independent variables. We considered a means of test- ing this specification as an alternative to “fixing” the problem of autocorrelation. The final section, on ARCH and GARCH effects, describes an extension of the models of autoregression to the conditional variance of ε as opposed to the conditional mean. This model embodies elements of both autocorrelation and heteroscedasticity. The set of methods plays a fundamental role in the modern analysis of volatility in financial data.

Key Terms and Concepts

• AR(1) • Autocorrelation • Autoregressive processes • ARCH • Autocorrelation matrix • Cochrane–Orcutt estimator • ARCH-in-mean • Autocovariance • Common factor model • Asymptotic negligibility • Autocovariance matrix • Covariance stationarity • Asymptotic normality • Autoregressive form • Double-length regression

31McCullough and Renfro (1999) examined several approaches to computing an appropriate asymptotic covariance matrix for the GARCH model, including the conventional Hessian and BHHH estimators and three sandwich style estimators, including the one suggested earlier and two based on the method of scoring suggested by Bollerslev and Wooldridge (1992). None stand out as obviously better, but the Bollerslev and QMLE estimator based on an actual Hessian appears to perform well in Monte Carlo studies. Greene-50558 book June 23, 2007 11:49

668 PART V ✦ Time Series and Macroeconometrics

• Durbin–Watson test • Martingale sequence • Quasi differences • Efficient two-step estimator • Martingale difference • Random walk • Ergodicity sequence • Stationarity • Ergodic theorem • Moving average form • Stationarity conditions • Expectations-augmented • Moving-average process • Summability Phillips curve • Newey–West robust • Time-series process • First-order autoregression covariance matrix estimator • Time window • GARCH • Partial difference • Weakly stationary • GMM estimator • Prais–Winsten estimator • White noise • Initial conditions • Pseudo-differences • Yule–Walker equations • Innovation • Pseudo-MLE • Lagrange multiplier test • Q test

Exercises

1. Does first differencing reduce autocorrelation? Consider the models yt = β xt +εt , where εt = ρεt−1 + ut and εt = ut − λut−1. Compare the autocorrelation of εt in the original model with that of vt in yt − yt−1 = β (xt − xt−1) + vt , where vt = εt − εt−1. 2. Derive the disturbance covariance matrix for the model yt = β xt + εt ,

εt = ρεt−1 + ut − λut−1. What parameter is estimated by the regression of the OLS residuals on their lagged values? 3. The following regression is obtained by ordinary least squares, using 21 observa- tions. (Estimated asymptotic standard errors are shown in parentheses.)

yt = 1.3 + 0.97yt−1 + 2.31xt , D − W = 1.21. (0.3)(0.18)(1.04) Test for the presence of autocorrelation in the disturbances. 4. It is commonly asserted that the Durbin–Watson statistic is only appropriate for testing for first-order autoregressive disturbances. What combination of the coef- ficients of the model is estimated by the Durbin–Watson statistic in each of the following cases: AR(1), AR(2), MA(1)? In each case, assume that the regression model does not contain a lagged dependent variable. Comment on the impact on your results of relaxing this assumption.

Applications

1. The data used to fit the expectations augmented Phillips curve in Example 19.3 are given in Appendix Table F5.1. Using these data, reestimate the model given in the example. Carry out a formal test for first-order autocorrelation using the LM statis- tic. Then, reestimate the model using an AR(1) model for the disturbance process. Because the sample is large, the Prais–Winsten and Cochrane–Orcutt estimators should give essentially the same answer. Do they? After fitting the model, obtain the transformed residuals and examine them for first-order autocorrelation. Does the AR(1) model appear to have adequately “fixed” the problem? Greene-50558 book June 23, 2007 11:49

CHAPTER 19 ✦ Serial Correlation 669

2. Data for fitting an improved Phillips curve model can be obtained from many sources, including the Bureau of Economic Analysis’s (BEA) own Web site, www.economagic.com, and so on. Obtain the necessary data and expand the model of Example 19.3. Does adding additional explanatory variables to the model reduce the extreme pattern of the OLS residuals that appears in Figure 19.3? 3. (This exercise requires appropriate computer software. The computations required can be done with RATS, EViews, Stata, TSP, LIMDEP, and a variety of other software using only preprogrammed procedures.) Quarterly data on the consumer price index for 1950.1 to 2000.4 are given in Appendix Table F5.1. Use these data to fit the model proposed by Engle and Kraft (1983). The model is

πt = β0 + β1πt−1 + β2πt−2 + β3πt−3 + β4πt−4 + εt ,

where πt = 100 ln[pt /pt−1] and pt is the price index. a. Fit the model by ordinary least squares, then use the tests suggested in the text to see if ARCH effects appear to be present. b. The authors fit an ARCH(8) model with declining weights,   8 − 2 9 i 2 σ = α + ε − . t 0 36 t i i=1 Fit this model. If the software does not allow constraints on the coefficients, you can still do this with a two-step least squares procedure, using the least squares residuals from the first step. What do you find? c. Bollerslev (1986) recomputed this model as a GARCH(1, 1). Use the GARCH(1, 1) form and refit your model. Greene-50558 book June 23, 2007 12:31

20 MODELS WITH LAGGED VARIABLESQ

20.1 INTRODUCTION

This chapter begins our introduction to the analysis of economic time series. By most views, this field has become synonymous with empirical macroeconomics and the anal- ysis of financial markets.1 In this and the next chapter, we will consider a number of models and topics in which time and relationships through time play an explicit part in the formulation. Consider the dynamic regression model

yt = β1 + β2xt + β3xt−1 + γ yt−1 + εt . (20-1) Models of this form specifically include as right-hand-side variables previous as well as contemporaneous values of the regressors. It is also in this context that lagged values of the dependent variable appear as a consequence of the theoretical basis of the model rather than as a computational means of removing autocorrelation. There are several reasons lagged effects might appear in an empirical model: • In modeling the response of economic variables to policy stimuli, it is expected that there will be possibly long lags between policy changes and their impacts. The length of lag between changes in monetary policy and its impact on important economic variables such as output and investment has been a subject of analysis for several decades. • Either the dependent variable or one of the independent variables is based on expectations. Expectations about economic events are usually formed by aggregat- ing new information and past experience. Thus, we might write the expectation of a future value of variable x, formed this period, as = ∗ | , , ,... = ( , , , . . .). xt Et [xt+1 zt xt−1 xt−2 ] g zt xt−1 xt−2

1The literature in this area has grown at an impressive rate, and, more so than in any other area, it has become impossible to provide comprehensive surveys in general textbooks such as this one. Fortunately, specialized volumes have been produced that can fill this need at any level. Harvey (1990) has been in wide use for some time. Among the many other books, three very useful works are Enders (2003), which presents the basics of time-series analysis at an introductory level with several very detailed applications; Hamilton (1994), which gives a relatively technical but quite comprehensive survey of the field; and Lutkepohl (2005), which provides an extremely detailed treatment of the topics presented at the end of this chapter. Hamilton also surveys a number of the applications in the contemporary literature. Two references that are focused on financial econometrics are Mills (1993) and Tsay (2005). There are also a number of important references that are primarily limited to forecasting, including Diebold (1998a, 2003) and Granger and Newbold (1996). A survey of research in many areas of time-series analysis is Engle and McFadden (1994). An extensive, fairly advanced treatise that analyzes in great depth all the issues we touch on in this chapter is Hendry (1995). Finally, Patterson (2000) surveys most of the practical issues in time series and presents a large variety of useful and very detailed applications. 670 Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 671

For example, forecasts of prices and income enter demand equations and con- sumption equations. (See Example 15.1 for an influential application.) • Certain economic decisions are explicitly driven by a history of related activities. For example, energy demand by individuals is clearly a function not only of current prices and income, but also the accumulated stocks of energy using capital. Even energy demand in the macroeconomy behaves in this fashion—the stock of auto- mobiles and its attendant demand for gasoline is clearly driven by past prices of gasoline and automobiles. Other classic examples are the dynamic relationship be- tween investment decisions and past appropriation decisions and the consumption of addictive goods such as cigarettes and theater performances.

We begin with a general discussion of models containing lagged variables. In Sec- tion 20.2, we consider some methodological issues in the specification of dynamic regressions. In Sections 20.3 and 20.4, we describe a general dynamic model that en- compasses some of the extensions and more formal models for time-series data that are presented in Chapter 21. Section 20.5 takes a closer look at some of issues in model specification. Finally, Section 20.6 considers systems of dynamic equations. These are largely extensions of the models that we examined at the end of Chapter 13. But the interpretation is rather different here. This chapter is generally not about methods of es- timation. OLS and GMM estimation are usually routine in this context. Because we are examining time-series data, conventional assumptions including ergodicity and station- arity will be made at the outset. In particular, in the general framework, we will assume that the multivariate stochastic process (yt , xt ,εt ) are a stationary and ergodic process. As such, without further analysis, we will invoke the theorems discussed in Chapters 4, 15, 16, and 19 that support least squares and GMM as appropriate estimate techniques in this context. In most of what follows, in fact, in practical terms, the dynamic regres- sion model can be treated as a linear regression model and estimated by conventional methods (e.g., ordinary least squares or instrumental variables if εt is autocorrelated). As noted, we will generally not return to the issue of estimation and inference the- ory except where new results are needed, such as in the discussion of nonstationary processes.

20.2 DYNAMIC REGRESSION MODELS

In some settings, economic agents respond not only to current values of independent variables but to past values as well. When effects persist over time, an appropriate model will include lagged variables. Example 20.1 illustrates a familiar case.

Example 20.1 A Structural Model of the Demand for Gasoline Drivers demand gasoline not for direct consumption but as fuel for cars to provide a source of energy for transportation. Per capita demand for gasoline in any period, G/Pop, is deter- mined partly by the current price, Pg, and per capita income, Y/Pop, which influence how intensively the existing stock of gasoline using “capital,” K , is used and partly by the size and composition of the stock of cars and other vehicles. The capital stock is determined, in turn, by income, Y/Pop; prices of the equipment such as new and used cars, Pnc and Puc; the price of alternative modes of transportation such as public transportation, Ppt; and past prices of gasoline as they influence forecasts of future gasoline prices. A structural model of Greene-50558 book June 23, 2007 12:31

672 PART V ✦ Time Series and Macroeconometrics

these effects might appear as follows: / = α + β + δ / + γ + per capita demand: Gt Popt Pgt Yt Popt Kt ut ,

stock of vehicles: Kt = (1− ) Kt−1 + It ,  = depreciation rate, = θ / + φ + λ + λ + λ investment in new vehicles: It Yt Popt Et [Pgt+1] 1 Pnct 2 Puct 3Pptt

expected price of gasoline: Et [Pgt+1] = w0 Pgt + w1 Pgt−1 + w2 Pgt−2. The capital stock is the sum of all past investments, so it is evident that not only current income and prices, but all past values, play a role in determining K . When income or the price of gasoline changes, the immediate effect will be to cause drivers to use their vehicles more or less intensively. But, over time, vehicles are added to the capital stock, and some cars are replaced with more or less efficient ones. These changes take some time, so the full impact of income and price changes will not be felt for several periods. Two episodes in the recent history have shown this effect clearly. For well over a decade following the 1973 oil shock, drivers gradually replaced their large, fuel-inefficient cars with smaller, less-fuel- intensive models. In the late 1990s in the United States, this process has visibly worked in reverse. As American drivers have become accustomed to steadily rising incomes and steadily falling real gasoline prices, the downsized, efficient coupes and sedans of the 1980s have yielded the highways to a tide of ever-larger, six- and eight-cylinder sport utility vehicles, whose size and power can reasonably be characterized as astonishing.

20.2.1 LAGGED EFFECTS IN A DYNAMIC MODEL The general form of a dynamic regression model is ∞ yt = α + βi xt−i + εt . (20-2) i=0

In this model, a one-time change in x at any point in time will affect E [ys | xt , xt−1,...] in every period thereafter. When it is believed that the duration of the lagged effects is extremely long—for example, in the analysis of monetary policy—infinite lag models that have effects that gradually fade over time are quite common. But models are often constructed in which changes in x cease to have any influence after a fairly small number of periods. We shall consider these finite lag models first. Marginal effects in the static classical regression model are one-time events. The response of y to a change in x is assumed to be immediate and to be complete at the end of the period of measurement. In a dynamic model, the counterpart to a marginal effect is the effect of a one-time change in xt on the equilibrium of yt . If the level of xt has been unchanged from, say, x for many periods prior to time t, then the equilibrium value of E [yt | xt , xt−1,...] (assuming that it exists) will be ∞ ∞ y = α + βi x = α + x βi , (20-3) i=0 i=0 where x is the permanent value of x . For this value to be finite, we require that t ∞ β < ∞. i (20-4) i=0 Consider the effect of a unit change in x occurring in period s. To focus ideas, consider the earlier example of demand for gasoline and suppose that xt is the unit price. Prior to the oil shock, demand had reached an equilibrium consistent with accumulated habits, Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 673

D0

␤ 0

␤ 1 ␤ ␤ i 2 ␤ 3 Demand

D1

t 1 t 1 t 2 t t 2 Time FIGURE 20.1 Lagged Adjustment.

experience with stable real prices, and the accumulated stocks of vehicles. Now suppose that the price of gasoline, Pg, rises permanently from Pg to Pg + 1 in period s.The path to the new equilibrium might appear as shown in Figure 20.1. The short-run effect is the one that occurs in the same period as the change in x. This effect is β0 in the figure.

DEFINITION 20.1 Impact Multiplier β = = 0 impact multiplier short-run multiplier.

DEFINITION 20.2 Cumulated Effect τ β = τ β . The accumulated effect periods later of an impulse at time t is τ i=0 i

In Figure 20.1, we see that the total effect of a price change in period t after three periods have elapsed will be β0 + β1 + β2 + β3. The difference between the old equilibrium D0 and the new one D1 is the sum of the individual period effects. The long-run multiplier is this total effect. Greene-50558 book June 23, 2007 12:31

674 PART V ✦ Time Series and Macroeconometrics

DEFINITION 20.3 Equilibrium Multiplier β = ∞ β = = i=0 i equilibrium multiplier long-run multiplier.

Because the lag coefficients are regression coefficients, their scale is determined by the scales of the variables in the model. As such, it is often useful to define the β = i , lag weights: wi ∞ β (20-5) j=0 j ∞ = , so that i=0 wi 1 and to rewrite the model as ∞ yt = α + β wi xt−i + εt . (20-6) i=0 (Note the equation for the expected price in Example 20.1.) Two useful statistics, based on the lag weights, that characterize the period of adjustment to a new equilibrium are = ∗ q∗ ≥ . = ∞ 2 the median lag smallest q such that i=0 wi 0 5 and the mean lag i=0 iwi .

20.2.2 THE LAG AND DIFFERENCE OPERATORS A convenient device for manipulating lagged variables is the lag operator,

Lxt = xt−1.

2 Some basic results are La = a if a is a constant and L(Lxt ) = L xt = xt−2. Thus, p q p p+q p q L xt = xt−p, L (L xt ) = L xt = xt−p−q, and (L + L )xt = xt−p + xt−q. By convention, 0 L xt = 1xt = xt . A related operation is the first difference,

xt = xt − xt−1.

Obviously, xt = (1 − L)xt and xt = xt−1 + xt . These two operations can be usefully combined, for example, as in

2 2 2  xt = (1 − L) xt = (1 − 2L + L )xt = xt − 2xt−1 + xt−2.

Note that

2 (1 − L) xt = (1 − L)(1 − L)xt = (1 − L)(xt − xt−1) = (xt − xt−1) − (xt−1 − xt−2).

The dynamic regression model can be written ∞ i yt = α + βi L xt + εt = α + B(L)xt + εt , i=0

2If the lag coefficients do not all have the same sign, then these results may not be meaningful. In some contexts, lag coefficients with different signs may be taken as an indication that there is a flaw in the specification of the model. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 675

2 where B(L) is a polynomial in L, B(L) = β0 + β1 L + β2 L +···.Apolynomial in the lag operator that reappears in many contexts is ∞ A(L) = 1 + aL+ (aL)2 + (aL)3 +···= (aL)i . i=0 If |a| < 1, then 1 A(L) = . 1 − aL A distributed lag model in the form ∞ i i yt = α + β γ L xt + εt i=0 can be written

−1 yt = α + β(1 − γ L) xt + εt , if |γ | < 1. This form is called the moving-average form or distributed lag form. If we multiply through by (1−γ L) and collect terms, then we obtain the autoregressive form,

yt = α(1 − γ)+ βxt + γ yt−1 + (1 − γ L)εt . In more general terms, consider the pth order autoregressive model,

yt = α + βxt + γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt , which may be written

C(L)yt = α + βxt + εt , where

2 p C(L) = (1 − γ1 L − γ2 L −···−γpL ).

Can this equation be “inverted” so that yt is written as a function only of current and past values of xt and εt ? By successively substituting the corresponding autoregressive equation for yt−1 in that for yt , then likewise for yt−2 and so on, it would appear so. However, it is also clear that the resulting distributed lag form will have an infinite number of coefficients. Formally, the operation just described amounts to writing

−1 yt = [C(L)] (α + βxt + εt ) = A(L)(α + βxt + εt ). It will be of interest to be able to solve for the elements of A(L) (see, for example, Section 20.6.6). By this arrangement, it follows that C(L)A(L) = 1 where

0 2 A(L) = (α0 L − α1 L − α2 L −···). By collecting like powers of L in

2 p 0 2 (1 − γ1 L − γ2 L −···−γpL )(α0 L + α1 L + α2 L −···) = 1, Greene-50558 book June 23, 2007 12:31

676 PART V ✦ Time Series and Macroeconometrics

we find that a recursive solution for the α coefficients is 0 L : α0 = 1 1 L : α1 − γ1α0 = 0 2 L : α2 − γ1α1 − γ2α0 = 0 3 L : α3 − γ1α2 − γ2α1 − γ3α0 = 0 4 (20-7) L : α4 − γ1α3 − γ2α2 − γ3α1 − γ4α0 = 0 ...

p L : αp − γ1αp−1 − γ2αp−2 −···−γpα0 = 0 and, thereafter, q L : αq − γ1αq−1 − γ2αq−2 −···−γpαq−p = 0. After a set of p− 1 starting values, the α coefficients obey the same difference equation as yt does in the dynamic equation. One problem remains. For the given set of values, the preceding gives no assurance that the solution for αq does not ultimately explode. The preceding equation system is not necessarily stable for all values of γ j (although it certainly is for some). If the system is stable in this sense, then the polynomial C(L) is said to be invertible. The necessary conditions are precisely those discussed in Section 20.4.3, so we will defer completion of this discussion until then. Finally, two useful results are 0 1 2 B(1) = β01 + β11 + β21 +···=β = long-run multiplier, and ∞  B (1) = [dB(L)/dL]|L=1 = iβi . i=0 It follows that B(1)/B(1) = mean lag.

20.2.3 SPECIFICATION SEARCH FOR THE LAG LENGTH Various procedures have been suggested for determining the appropriate lag length in a dynamic model such as p yt = α + βi xt−i + εt . (20-8) i=0 One must be careful about a purely significance based specification search. Let us suppose that there is an appropriate, “true” value of p > 0 that we seek. A simple-to- general approach to finding the right lag length would depart from a model with only the current value of the independent variable in the regression and add deeper lags until a simple t test suggested that the last one added is statistically insignificant. The problem with such an approach is that at any level at which the number of included lagged variables is less than p, the estimator of the coefficient vector is biased and inconsistent. [See the omitted variable formula (7-4).] The asymptotic covariance matrix is biased as well, so statistical inference on this basis is unlikely to be successful. A general-to- simple approach would begin from a model that contains more than p lagged values—it Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 677

is assumed that although the precise value of p is unknown, the analyst can posit a maintained value that should be larger than p. Least squares or instrumental variables regression of y on a constant and (p + d) lagged values of x consistently estimates θ = [α, β0,β1,...,βp, 0, 0,...]. Because models with lagged values are often used for forecasting, researchers have tended to look for measures that have produced better results for assessing “out of sample” prediction properties. The adjusted R2 [see Section 3.5.1] is one possibility. Others include the Akaike (1973) information criterion, AIC(p), ee 2p AIC(p) = ln + , (20-9) T T and Schwarz’s criterion, SC(p): p SC(p) = AIC(p) + (ln T − 2). (20-10) T (See Section 7.4.) If some maximum P is known, then p < P can be chosen to minimize AIC(p)orSC(p).3 An alternative approach, also based on a known P, is to do sequential F tests on the last P > p coefficients, stopping when the test rejects the hypothesis that the coefficients are jointly zero. Each of these approaches has its flaws and virtues. The Akaike information criterion retains a positive probability of leading to overfitting even as T →∞. In contrast, SC(p) has been seen to lead to underfitting in some finite sample cases. They do avoid, however, the inference problems of sequential estimators. The sequential F tests require successive revision of the significance level to be appropriate, but they do have a statistical underpinning.4

20.3 SIMPLE DISTRIBUTED LAG MODELS

Before examining some very general specifications of the dynamic regression, we briefly consider an infinite lag model, which emerges from a simple model of expectations. There are cases in which the distributed lag models the accumulation of information. The formation of expectations is an example. In these instances, intuition suggests that the most recent past will receive the greatest weight and that the influence of past observations will fade uniformly with the passage of time. The geometric lag model is often used for these settings. The general form of the model is ∞ i yt = α + β (1 − λ)λ xt−i + εt , 0 <λ<1, i = 0 (20-11)

= α + β B(L)xt + εt , where 1 − λ B(L) = (1 − λ)(1 + λL + λ2 L2 + λ3 L3 +···) = . 1 − λL

3For further discussion and some alternative measures, see Geweke and Meese (1981), Amemiya (1985, pp. 146–147), Diebold (1998, pp. 85–91), and Judge et al. (1985, pp. 353–355). 4See Pagano and Hartley (1981) and Trivedi and Pagan (1979). Greene-50558 book June 23, 2007 12:31

678 PART V ✦ Time Series and Macroeconometrics

i The lag coefficients are βi = β(1 − λ)λ . The model incorporates infinite lags, but it as- signs arbitrarily small weights to the distant past. The lag weights decline geometrically; i wi = (1 − λ)λ , 0 ≤ wi < 1. The mean lag is B(1) λ w = = . B(1) 1 − λ ∗ ∗ p −1 = . . ∗ The median lag is p such that i=0 wi 0 5 We can solve for p by using the result p − λp+1 λi = 1 . 1 − λ i=0 Thus, . ∗ = ln 0 5 − . p λ 1 ln β( − λ) β ∞ ( − λ)λi = β The impact multiplier is 1 . The long-run multiplier is i=0 1 .The equilibrium value of yt would be found by fixing xt at x and εt at zero in (20-11), which produces y = α + βx. The geometric lag model can be motivated with an economic model of expectations. We begin with a regression in an expectations variable such as an expected future price , ∗ based on information available at time t xt+1|t , and perhaps a second regressor, wt , = α + β ∗ + δ + ε , yt xt+1|t wt t and a mechanism for the formation of the expectation, ∗ = λ ∗ + ( − λ) = λ ∗ + ( − λ) . xt+1|t xt|t−1 1 xt Lxt+1|t 1 xt (20-12) The currently formed expectation is a weighted average of the expectation in the previ- ous period and the most recent observation. The parameter λ is the adjustment coeffi- cient. If λ equals 1, then the current datum is ignored and expectations are never revised. A value of zero characterizes a strict pragmatist who forgets the past immediately. The expectation variable can be written as − λ ∗ 1 2 x + | = x = (1 − λ)[x + λx − + λ x − +···]. (20-13) t 1 t 1 − λL t t t 1 t 2 Inserting (20-13) into (20-12) produces the geometric distributed lag model, 2 yt = α + β(1 − λ)[xt + λxt−1 + λ xt−2 +···] + δwt + εt . The geometric lag model can be estimated by nonlinear least squares. Rewrite it as

yt = α + γ zt (λ) + δwt + εt ,γ= β(1 − λ). (20-14)

The constructed variable zt (λ) obeys the recursion zt (λ) = xt + λzt−1(λ). For the first (λ) = ∗ = /( − λ) observation, we use z1 x1|0 x1 1 . If the sample is moderately long, then assuming that xt was in long-run equilibrium, although it is an approximation, will not unduly affect the results. One can then scan over the range of λ from zero to one to locate the value that minimizes the sum of squares. Once the minimum is located, an estimate of the asymptotic covariance matrix of the estimators of (α, γ, δ, λ) can be Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 679

found using (11-12) and Theorem 11.2. For the regression function ht (data | α, γ, δ, λ), 0 = , 0 = (λ), 0 = λ xt1 1 xt2 zt and xt3 wt . The derivative with respect to can be computed by using the recursion dt (λ) = ∂zt (λ)/∂λ = zt−1(λ) + λ∂zt−1(λ)/∂λ.Ifz1 = x1/(1 − λ), (λ) = /( − λ) 0 = (λ) β then d1 z1 1 . Then, xt4 dt . Finally, we estimate from the relationship β = γ/(1 − λ) and use the delta method to estimate the asymptotic standard error. For purposes of estimating long- and short-run elasticities, researchers often use a different form of the geometric lag model. The partial adjustment model describes the desired level of yt , ∗ = α + β + δ + ε , yt xt wt t and an adjustment equation, − = ( − λ)( ∗ − ). yt yt−1 1 yt yt−1 ∗, If we solve the second equation for yt and insert the first expression for yt then we obtain

yt = α(1 − λ) + β(1 − λ)xt + δ(1 − λ)wt + λyt−1 + (1 − λ)εt = α + β + δ + λ + ε. xt wt yt−1 t This formulation offers a number of significant practical advantages. It is intrinsically linear in the parameters (unrestricted), and its disturbance is nonautocorrelated if εt was to begin with. As such, the parameters of this model can be estimated consistently and efficiently by ordinary least squares. In this revised formulation, the short-run    multipliers for xt and wt are β and δ . The long-run effects are β = β /(1 − λ) and δ = δ/(1 − λ). With the variables in logs, these effects are the short- and long-run elasticities.

Example 20.2 Expectations-Augmented Phillips Curve In Example 19.3, we estimated an expectations-augmented Phillips curve of the form ∗ pt − E [pt | t−1] = β[ut − u ] + εt .

Our model assumed a particularly simple model of expectations, E [pt | t−1] = pt−1. The least squares results for this equation were

pt − pt−1 = 0.49189 − 0.090136 ut + et (0.7405) (0.1257) R2 = 0.002561, T = 201. The implied estimate of the natural rate of unemployment is −(0.49189/−0.090136) or about 5.46 percent. Suppose we allow expectations to be formulated less pragmatically with the expectations model in (20-12). For this setting, this would be

E [pt | t−1] = λE [pt−1 | t−2] + (1− λ)pt−1. The strict pragmatist has λ = 0.0. Using the method set out earlier, we would compute this for different values of λ, recompute the dependent variable in the regression, and locate the value of λ which produces the lowest sum of squares. Figure 20.2 shows the sum of squares for the values of λ ranging from 0.0 to 1.0. The minimum value of the sum of squares occurs at λ = 0.66. The least squares regression results are pt − pt−1 = 1.69453 − 0.30427 ut + et (0.6617) (0.11125) T = 201. Greene-50558 book June 23, 2007 12:31

680 PART V ✦ Time Series and Macroeconometrics

2500

2000 ) ␭ ( S

1500

1000 0.0 0.2 0.4 0.6 0.8 1.0 ␭_I FIGURE 20.2 Residuals Sums of Squares for Phillips Curve Estimates.

The estimated standard errors are computed using the method described earlier for the nonlinear regression. The extra variable described in the paragraph after (20-14) accounts for the estimated λ. The estimated asymptotic covariance matrix is then computed using   −1 (e e/201)[W W] where w1 = 1, w2 = ut and w3 = ∂pt−1/∂λ. The estimated standard error for λ is 0.04610. Because this is highly statistically significantly different from zero (t = 14.315), we would reject the simple model. Finally, the implied estimate of the natu- ral rate of unemployment is −(−1.69453 / 0.30427) or about 5.57 percent. The estimated asymptotic covariance of the slope and constant term is −0.0720293, so, using this value and the estimated standard errors given earlier and the delta method, we obtain an esti- mated standard error for this estimate of 0.5467. Thus, a confidence interval for the natural rate of unemployment based on these results would be (4.49 percent, 6.64 percent), which is in line with our prior expectations. There are two things to note about these results. First, because the dependent variables are different, we cannot compare the R2s of the models with λ = 0.00 and λ = 0.66. But, the sum of squares for the two models can be compared (they are 1592.32 and 1112.89), so the second model fits far better. One of the payoffs is the much narrower confidence interval for the natural rate. The counterpart to the one given earlier when λ = 0.00 is (1.13%, 9.79%). No doubt the model could be improved still further by expanding the equation. (This is considered in the exercises.)

Example 20.3 Price and Income Elasticities of Demand for Gasoline We have extended the gasoline demand equation estimated in Examples 19.2 and 19.6 to allow for dynamic effects. Table 20.1 presents estimates of three distributed lag models for gasoline consumption. The unrestricted model allows five years of adjustment in the price and income effects. The expectations model includes the same distributed lag (λ)on price and income but different long-run multipliers (βPg and βI ). [Note, for this formulation, that the extra regressor used in computing the asymptotic covariance matrix is dt (λ) = βPgdprice(λ) + βI dincome(λ).] Finally, the partial adjustment model implies lagged effects for all the variables in the model. To facilitate comparison, the constant and the first four slope coefficients in the partial adjustment model have been divided by the estimate of (1−λ). The implied long- and short-run price and income elasticities are shown in Table 20.2. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 681

TABLE 20.1 Estimated Distributed Lag Models Expectations Partial Adjustment Coefficient Unrestricted Estimated Derived Estimated Derived Constant −28.5512 −16.1867 −4.9489 ln Pnc 0.01738 −0.1050 −0.1429 ln Puc 0.07602 0.02815 0.09435 ln Ppt 0.04770 0.2550 0.03243 Trend −0.02297 0.02064 −0.004029 ln Pg −0.08282 −0.06702∗ −0.06702∗ −0.07627 −0.07627 ln Pg[−1] −0.07152 −0.06233 −0.06116 ln Pg[−2] 0.03669 −0.05797 −0.04904 ln Pg[−3] −0.04814 −0.05391 −0.03933 ln Pg[−4] 0.02958 −0.05013 −0.03153 ln Pg[−5] −0.1481 −0.04663 −0.02529 ln Income 1.1074 0.04372∗ 0.04372∗ 0.3135 0.3135 ln Income[−1] 0.3776 0.04066 0.2514 ln Income[−2] −0.01255 0.03781 0.2016 ln Income[−3] −0.03919 0.03517 0.1616 ln Income[−4] 0.2737 0.03270 0.1296 ln Income[−5] 0.09350 0.03042 0.1039 Zt(Price) — −0.06702 Zt(Income) — 0.04372 ln (G/Pop)[−1] — 0.80188 β — −0.9574 γ — 0.6245 λ — 0.9300 0.80188 ee 0.01565356 0.03911383 .01151860 T 47 52 51 ∗Estimated directly

TABLE 20.2 Estimated Elasticities Short-Run Long-Run Price Income Price Income Unrestricted model −0.08282 1.1074 −0.2843 1.8004 Expectations model −0.06702 0.04372 −0.9574 0.6246 Partial adjustment model −0.07628 0.3135 −0.3850 1.5823

20.4 AUTOREGRESSIVE DISTRIBUTED LAG MODELS

Both the finite lag models and the geometric lag model impose strong, possibly incor- rect restrictions on the lagged response of the dependent variable to changes in an independent variable. A very general compromise that also provides a useful platform for studying a number of interesting methodological issues is the autoregressive dis- tributed lag (ARDL) model, p r yt = μ + γi yt−i + β j xt− j + δwt + εt , (20-15) i=1 j=0 Greene-50558 book June 23, 2007 12:31

682 PART V ✦ Time Series and Macroeconometrics

in which εt is assumed to be serially uncorrelated and homoscedastic (we will relax both these assumptions in Chapter 21). We can write this more compactly as

C(L)yt = μ + B(L)xt + δwt + εt by defining polynomials in the lag operator,

2 p C(L) = 1 − γ1 L − γ2 L −···−γpL , and

2 r B(L) = β0 + β1 L + β2 L +···+βr L . The model in this form is denoted ARDL(p, r) to indicate the orders of the two poly- nomials in L. The partial adjustment model estimated in the previous section is the special case in which p equals 1 and r equals 0. A number of other special cases are also interesting, including the familiar model of autocorrelation (p = 1, r = 1,β1 =−γ1β0), the classical regression model (p = 0, r = 0), and so on.

20.4.1 ESTIMATION OF THE ARDL MODEL Save for the presence of the stochastic right-hand-side variables, the ARDL is a linear model with a classical disturbance. As such, ordinary least squares is the efficient esti- mator. The lagged dependent variable does present a complication, but we considered this in Chapter 19. Absent any obvious violations of the assumptions there, least squares continues to be the estimator of choice. Conventional testing procedures are, as before, asymptotically valid as well. Thus, for testing linear restrictions, the Wald statistic can be used, although the F statistic is generally preferable in finite samples because of its more conservative critical values. One subtle complication in the model has attracted a large amount of attention in the recent literature. If C(1) = 0, then the model is actually inestimable. This fact is evident in the distributed lag form, which includes a term μ/C(1). If the equivalent condition i γi = 1 holds, then the stochastic difference equation is unstable and a host of other problems arise as well. This implication suggests that one might be interested in testing this specification as a hypothesis in the context of the model. This restriction might seem to be a simple linear constraint on the alternative (unrestricted) model in (20-15). Under the null hypothesis, however, the conventional test statistics do not have the familiar distributions. The formal derivation is complicated [in the extreme, see Dickey and Fuller (1979) for an example], but intuition should suggest the reason. Under the null hypothesis, the difference equation is explosive, so our assumptions about well behaved data cannot be met. Consider a simple ARDL(1, 0) example and simplify it even further with B(L) = 0. Then,

yt = μ + γ yt−1 + εt . If γ equals 1, then

yt = μ + yt−1 + εt . Assuming we start the time series at time t = 1,

yt = tμ + s εs = tμ + vt . Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 683

The conditional mean in this random walk with drift model is increasing without limit, so the unconditional mean does not exist. The conditional mean of the disturbance, vt ,is zero, but its conditional variance is tσ 2, which shows a peculiar type of heteroscedasticity. Consider least squares estimation of μ with m = (ty)/(tt), where t = [1, 2, 3,...,T]. Then E [m] = μ + E [(tt)−1(tv)] = μ, but σ 2 T t3 O(T 4) 1 Var[m] = t=1 = = O . 2 ( 3) 2 2 T 2 [O T ] T t=1 t

So, the variance of this estimator is an order of magnitude smaller than we are used to seeing in regression models. Not only is m mean square consistent, it is superconsistent. As such, without doing a formal derivation, we conclude that there is something “un- usual” about this estimator and√ that the “usual” testing procedures whose distributions build on the distribution of T(m − μ) will not be appropriate; the variance of this normalized statistic converges to zero. This result does not mean that the hypothesis γ = 1 is not testable in this model. In fact, the appropriate test statistic is the conventional one that we have computed for comparable tests before. But the appropriate critical values against which to measure those statistics are quite different. We will return to this issue in our discussion of the Dickey–Fuller test in Section 22.2.4.

20.4.2 COMPUTATION OF THE LAG WEIGHTS IN THE ARDL MODEL The distributed lag form of the ARDL model is μ B(L) 1 1 y = + x + δw + ε t C(L) C(L) t C(L) t C(L) t μ ∞ ∞ ∞ = + α x − + δ θ w − + θ ε − . 1 − γ −···−γ j t j l t l l t l 1 p j=0 l=0 l=0

This model provides a method of approximating a very general lag structure. In Jorgenson’s (1966) study, in which he labeled this model a rational lag model, he demon- strated that essentially any desired shape for the lag distribution could be produced with relatively few parameters.5 The lag coefficients on xt , xt−1,..., in the ARDL model are the individual terms in the ratio of polynomials that appear in the distributed lag form. We denote these as coefficients B(L) α ,α ,α ,...= the coefficient on 1, L, L2,... in . (20-16) 0 1 2 C(L) A convenient way to compute these coefficients is to write (20-16) as A(L)C(L) = B(L). Then we can just equate coefficients on the powers of L. Example 20.4 demonstrates the procedure.

5A long literature, highlighted by Griliches (1967), Dhrymes (1971), Nerlove (1972), Maddala (1977a), and Harvey (1990), describes estimation of models of this sort. Greene-50558 book June 23, 2007 12:31

684 PART V ✦ Time Series and Macroeconometrics ∞ α The long-run effect in a rational lag model is i=0 i . This result is easy to compute because it is simply ∞ ( ) α = B 1 . i C(1) i=0 A standard error for the long-run effect can be computed using the delta method.

20.4.3 STABILITY OF A DYNAMIC EQUATION In the geometric lag model, we found that a stability condition |λ| < 1 was necessary for the model to be well behaved. Similarly, in the AR(1) model, the autocorrelation parameter ρ must be restricted to |ρ| < 1 for the same reason. The dynamic model in (20-15) must also be restricted, but in ways that are less obvious. Consider once again the question of whether there exists an equilibrium value of yt . In (20-15), suppose that xt is fixed at some value x, wt is fixed at zero, and the distur- bances εt are fixed at their expectation of zero. Would yt converge to an equilibrium? The relevant dynamic equation is

yt = α + γ1 yt−1 + γ2 yt−2 +···+γp yt−p,

where α = μ + B(1)x.Ifyt converges to an equilibrium, then, that equilibrium is μ + B(1)x α y = = . C(1) C(1) Stability of a dynamic equation hinges on the characteristic equation for the auto- regressive part of the model. The roots of the characteristic equation, 2 p C(z) = 1 − γ1z − γ2z −···−γpz = 0, (20-17) must be greater than one in absolute value for the model to be stable. To take a simple example, the characteristic equation for the first-order models we have examined thus far is C(z) = 1 − λz = 0. The single root of this equation is z = 1/λ, which is greater than one in absolute value if |λ| is less than one. The roots of a more general characteristic equation are the reciprocals of the characteristic roots of the matrix ⎡ ⎤ γ γ γ ... γ − γ ⎢ 1 2 3 p 1 p⎥ ⎢ 100... 00⎥ ⎢ ⎥ ⎢ 010... 00⎥ C = ⎢ ⎥ . (20-18) ⎢ 001... 00⎥ ⎣ ... ⎦ 000... 10 Because the matrix is asymmetric, its roots may include complex pairs. The reciprocal of the complex number a + bi is a/M − (b/M)i, where M = a2 + b2 and i 2 =−1. We thus require that M be less than 1. The case of z = 1, the unit root case, is often of special interest. If one of the ( ) = p γ = roots of C z 0 is 1, then it follows that i=1 i 1. This assumption would appear Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 685

to be a simple hypothesis to test in the framework of the ARDL model. Instead, we find the explosive case that we examined in Section 20.4.1, so the hypothesis is more complicated than it first appears. To reiterate, under the null hypothesis that C(1) = 0, it is not possible for the standard F statistic to have a central F distribution because of the behavior of the variables in the model. We will return to this case shortly. The univariate autoregression,

yt = μ + γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt , can be augmented with the p − 1 equations

yt−1 = yt−1,

yt−2 = yt−2, and so on to give a vector autoregression, VAR (to be considered in the next section): = μ + + ε , yt Cyt−1 t   where yt has p elements, εt = (εt , 0,...), and μ = (μ, 0, 0,...). It will ultimately not be relevant to the solution, so we will let εt equal its expected value of zero. Now, by successive substitution, we obtain

2 yt = μ + Cμ + C μ +···, which may or may not converge. Write C in the spectral form C = PQ, where QP = I and  is a diagonal matrix of the characteristic roots. (Note that the characteristic roots in  and vectors in P and Q may be complex.) We then obtain   ∞ i yt = P Q μ. (20-19) i=0 If all the roots of C are less than one in absolute value, then this vector will converge to the equilibrium

−1 y∞ = (I − C) μ.

Nonexplosion of the powers of the roots of C is equivalent to |λp| < 1, or |1/λp| > 1, which was our original requirement. Note finally that because μ is a multiple of the first −1 column of Ip, it must be the case that each element in the first column of (I − C) is the same. At equilibrium, therefore, we must have yt = yt−1 =···=y∞.

Example 20.4 A Rational Lag Model Appendix Table F5.1 lists quarterly data on a number of macroeconomic variables including consumption and real GDP for the U.S. economy for the years 1950 to 2000, a total of 204 quarters. The model

ct = δ + β0 yt + β1 yt−1 + β2 yt−2 + β3 yt−3 + γ1ct−1 + γ2ct−2 + γ3ct−3 + εt

is estimated using the logarithms of real consumption and real GDP, denoted ct and yt . Ordinary least squares estimates of the parameters of the ARDL(3,3) model are

ct = 0.7233ct−1 + 0.3914ct−2 − 0.2337ct−3

+ 0.5651yt − 0.3909yt−1 − 0.2379yt−2 + 0.1902yt−3 + et . Greene-50558 book June 23, 2007 12:31

686 PART V ✦ Time Series and Macroeconometrics

TABLE 20.3 Lag Coefficients in a Rational Lag Model Lag01234567 ARDL 0.565 0.018 −0.004 0.062 0.039 0.054 0.039 0.041 Unrestricted 0.954 −0.090 −0.063 0.100 −0.024 0.057 −0.112 0.236

(A full set of quarterly dummy variables is omitted.) The Durbin–Watson statistic is 1.78597, so remaining autocorrelation seems unlikely to be a consideration. The lag coefficients are given by the equality

2 2 3 2 3 (α0 + α1 L + α2 L +···)(1− γ1 L − γ2 L − γ3 L ) = (β0 + β1 L + β2 L + β3 L ). Note that A( L) is an infinite polynomial. The lag coefficients are

1: α0 = β0 (which will always be the case), 1 L : −α0γ1 + α1 = β1 or α1 = β1 + α0γ1, 2 L : −α0γ2 − α1γ1 + α2 = β2 or α2 = β2 + α0γ2 + α1γ1, 3 L : −α0γ3 − α1γ2 − α2γ1 + α3 = β3 or α3 = β3 + α0γ3 + α1γ2 + α2γ1, 4 L : −α1γ3 − α2γ2 − α3γ1 + α4 = 0orα4 = γ1α3 + γ2α2 + γ3α1, j L : −α j −3γ3 − α j −2γ2 − α j −1γ1 + α j = 0orα j = γ1α j −1 + γ2α j −2 + γ3α j −3, j = 5, 6, ..., and so on. From the fourth term onward, the series of lag coefficients follows the recursion α j = γ1α j −1 + γ2α j −2 + γ3α j −3, which is the same as the autoregressive part of the ARDL model. The series of lag weights follows the same difference equation as the current and lagged values of yt after r initial values, where r is the order of the DL part of the ARDL model. The three characteristic roots of the C matrix are 0.8631, −0.5949, and 0.4551. Because all are less than one, we conclude that the stochastic difference equation is stable. The first seven lag coefficients of the estimated ARDL model are listed in Table 20.3 with the first seven coefficients in an unrestricted lag model. The coefficients from the ARDL model only vaguely resemble those from the unrestricted model, but the erratic swings of the latter are prevented by the smooth equation from the distributed lag model. The estimated long- term effects (with standard errors in parentheses) from the two models are 1.0634 (0.00791) from the ARDL model and 1.0570 (0.002135) from the unrestricted model. Surprisingly, in view of the large and highly significant estimated coefficients, the lagged effects fall off essentially to zero after the initial impact.

20.4.4 FORECASTING

Consider, first, a one-period-ahead forecast of yt in the ARDL(p, r) model. It will be convenient to collect the terms in μ, xt , wt , and so on in a single term, r μt = μ + β j xt− j + δwt . j=0 Now, the ARDL model is just

yt = μt + γ1 yt−1 +···+γp yt−p + εt . Conditioned on the full set of information available up to time T and on forecasts of the exogenous variables, the one-period-ahead forecast of yt would be

yˆ T+1|T = μˆ T+1|T + γ1 yT +···+γp yT−p+1 + εˆT+1|T. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 687

To form a prediction interval, we will be interested in the variance of the forecast error,

eT+1|T = yˆ T+1|T − yT+1.

This error will arise from three sources. First, in forecasting μt , there will be two sources of error. The parameters, μ, δ, and β0,...,βr will have been estimated, soμ ˆ T+1|T will differ from μT+1 because of the sampling variation in these estimators. Second, if the exogenous variables, xT+1 and wT+1 have been forecasted, then to the extent that these forecasts are themselves imperfect, yet another source of error to the forecast will result. Finally, although we will forecast εT+1 with its expectation of zero, we would not assume that the actual realization will be zero, so this step will be a third source of error. In principle, an estimate of the forecast variance, Var[eT+1|T], would account for all three sources of error. In practice, handling the second of these errors is largely intractable, while the first is merely extremely difficult. [See Harvey (1990) and Hamilton (1994, especially Section 11.7) for useful discussion. McCullough (1996) presents results that suggest that “intractable” may be too pessimistic.] For the moment, we will concentrate on the third source and return to the other issues briefly at the end of the section. Ignoring for the moment the variation inμ ˆ T+1|T—that is, assuming that the param- eters are known and the exogenous variables are forecasted perfectly—the variance of the forecast error will be simply 2 Var[eT+1|T | xT+1, wT+1,μ,β,δ,yT,...] = Var[εT+1] = σ , so at least within these assumptions, forming the forecast and computing the forecast variance are straightforward. Also, at this first step, given the data used for the forecast, the first part of the variance is also tractable. Let zT+1 = [1, xT+1, xT,...,xT−r+1, wT, yT, yT−1,...,yT−p+1], and let θˆ denote the full estimated parameter vector. Then we would use   . | = 2 +  . . θˆ . Est Var[eT+1|T zT+1] s zT+1 Est Asy Var[ ] zT+1 Now, consider forecasting further out beyond the sample period:

yˆ T+2|T = μˆ T+2|T + γ1 yˆ T+1|T +···+γp yT−p+2 + εˆT+2|T.

Note that for period T + 1, the forecasted yT+1 is used. Making the substitution for yˆ T+1|T, we have

yˆ T+2|T = μˆ T+2|T +γ1(μˆ T+1|T +γ1 yT +···+γp yT−p+1 +εˆT+1|T)+···+γp yT−p+2 +εˆT+2|T, and, likewise, for subsequent periods. Our method will be simplified considerably if we use the device we constructed in the previous section. For the first forecast period, write the forecast with the previous p lagged values as ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ yˆ T+1|T μˆ T+1|T γ γ ··· γ yT εˆT+1|T ⎢ ⎥ ⎢ ⎥ 1 2 p ⎢ ⎥ ⎢ ⎥ ⎢ yT ⎥ ⎢ 0 ⎥ ⎢ ··· ⎥ ⎢yT−1⎥ ⎢ 0 ⎥ ⎢ ⎥ = ⎢ ⎥ + ⎢ 10 0 ⎥ ⎢ ⎥ + ⎢ ⎥ . ⎣ yT−1 ⎦ ⎣ 0 ⎦ ⎣ ··· ⎦ ⎣yT−2⎦ ⎣ 0 ⎦ . . 01 0 . . . . 0 ··· 10 . .

The coefficient matrix on the right-hand side is C, which we defined in (20-18). To maintain the thread of the discussion, we will continue to use the notationμ ˆ T+1|T for the forecast of the deterministic part of the model, although for the present, we are Greene-50558 book June 23, 2007 12:31

688 PART V ✦ Time Series and Macroeconometrics

assuming that this value, as well as C, is known with certainty. With this modification, then, our forecast is the top element of the vector of forecasts, = μ + + ε . yˆ T+1|T ˆ T+1|T CyT ˆ T+1|T We are assuming that everything on the right-hand side is known except the period T + 1 disturbance, so the covariance matrix for this p + 1 vector is ⎡ ⎤ σ 2 0 ··· ⎢ . ⎥ ( − )( − ) = ⎢ . ⎥ , E [ yˆ T+1|T yT+1 yˆ T+1|T yT+1 ] ⎣ 00 ⎦ . . . ··· ..

2 and the forecast variance fory ˆ T+1|T is just the upper left element, σ . Now, extend this notation to forecasting out to periods T + 2, T + 3, and so on: = μ + + ε yˆ T+2|T ˆ T+2|T Cyˆ T+1|T ˆ T+2|T = μ + μ + 2 + ε + ε . ˆ T+2|T C ˆ T+1|T C yT ˆ T+2|T Cˆ T+1|T Once again, the only unknowns are the disturbances, so the forecast variance for this two-period-ahead forecasted vector is ⎡ ⎤ ⎡ ⎤ σ 2 0 ··· σ 2 0 ··· ⎢ . ⎥ ⎢ . ⎥ ε + ε = ⎢ . ⎥ + ⎢ . ⎥ . Var[ˆ T+2|T Cˆ T+1|T] ⎣ 00 ⎦ C ⎣ 00 ⎦ C . . . . . ··· .. . ··· ..

2 Thus, the forecast variance for the two-step-ahead forecast is σ [1 + (1)11], where    (1)11 is the (1, 1) element of (1) = Cjj C , where j = [σ, 0,...,0]. By extending this device to a forecast F periods beyond the sample period, we obtain

F F = f −1μ + F + f −1ε . yˆ T+F|T C ˆ T+F−( f −1)|T C yT C ˆ T+F−( f −1)|T (20-20) f =1 f =1 This equation shows how to compute the forecasts, which is reasonably simple. We also obtain our expression for the conditional forecast variance,

2 Conditional Var[ˆyT+F|T] = σ [1 + (1)11 + (2)11 +···+(F − 1)11], (20-21)  where (i) = Ci jj Ci. The general form of the F-period-ahead forecast shows how the forecasts will behave as the forecast period extends further out beyond the sample period. If the equation is stable—that is, if all roots of the matrix C are less than one in absolute value—then CF will converge to zero, and because the forecasted disturbances are zero, the forecast will be dominated by the sum in the first term. If we suppose, in addition, that the forecasts of the exogenous variables are just the period T + 1 forecasted values and not revised, then, as we found at the end of the previous section, the forecast will ultimately converge to | μ = − −1μ . lim yˆ T+F|T ˆ T+1|T [I C] ˆ T+1|T F→∞ Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 689

To account fully for all sources of variation in the forecasts, we would have to revise the forecast variance to include the variation in the forecasts of the exogenous variables and the variation in the parameter estimates. As noted, the first of these is likely to be intractable. For the second, this revision will be extremely difficult, the more so when we also account for the matrix C, as well as the vector μ, being built up from the estimated parameters. The level of difficulty in this case falls from impossible to merely extremely difficult. In principle, what is required is 2 Est. Conditional Var[y ˆ T+F|T] = σ [1 + (1)11 + (2)11 +···+(F − 1)11] + gEst. Asy. Var[μ ˆ , βˆ , γˆ ]g, where ∂ yˆ + g = T F . ∂[ˆμ, βˆ , γˆ ] [See Hamilton (1994, Appendix to Chapter 11) for formal derivation.] One possibility is to use the bootstrap method. For this application, bootstrapping would involve sampling new sets of disturbances from the estimated distribution of εt , and then repeatedly rebuilding the within-sample time series of observations on yt by using

yˆt = μˆ t + γ1 yt−1 +···+γp yt−p + ebt (m),

where ebt (m) is the estimated “bootstrapped” disturbance in period t during replica- tion m. The process is repeated M times, with new parameter estimates and a new forecast generated in each replication. The variance of these forecasts produces the estimated forecast variance.6

20.5 METHODOLOGICAL ISSUES IN THE ANALYSIS OF DYNAMIC MODELS

20.5.1 AN ERROR CORRECTION MODEL Consider the ARDL(1, 1) model, which has become a workhorse of the modern lit- erature on time-series analysis. By defining the first differences yt = yt − yt−1 and xt = xt − xt−1 we can rearrange

yt = μ + γ1 yt−1 + β0xt + β1xt−1 + εt to obtain

yt = μ + β0xt + (γ1 − 1)(yt−1 − θxt−1) + εt , (20-22)

where θ =−(β0 + β1)/(γ1 − 1). This form of the model is in the error correction form. In this form, we have an equilibrium relationship, yt = μ + β0xt + εt , and the equilibrium error, (γ1 − 1)(yt−1 − θxt−1), which account for the deviation of the pair of variables from that equilibrium. The model states that the change in yt from the previous period consists of the change associated with movement with xt along the

6Bernard and Veall (1987) give an application of this technique. See, also, McCullough (1996). Greene-50558 book June 23, 2007 12:31

690 PART V ✦ Time Series and Macroeconometrics

9.6 CT YT 9.0

8.4

Variable 7.8

7.2

6.6 1949 1962 1975 1988 2001 Quarter FIGURE 20.3 Consumption and Income Data.

long-run equilibrium path plus a part (γ1 − 1) of the deviation (yt−1 − θxt−1) from the equilibrium. With a model in logs, this relationship would be in proportional terms. It is useful at this juncture to jump ahead a bit—we will return to this topic in some detail in Chapter 21—and explore why the error correction form might be such a useful formulation of this simple model. Consider the logged consumption and income data plotted in Figure 20.3. It is obvious on inspection of the figure that a simple regression of the log of consumption on the log of income would suggest a highly significant relationship; in fact, the simple linear regression produces a slope of 1.0567 with a t ratio of 440.5 (!) and an R2 of 0.99896. The disturbing result of a line of literature in econometrics that begins with Granger and Newbold (1974) and continues to the present is that this seemingly obvious and powerful relationship might be entirely spurious. Equally obvious from the figure is that both ct and yt are trending variables. If, in fact, both variables unconditionally were random walks with drift of the sort that we met at the end of Section 20.4.1—that is, ct = tμc + vt and likewise for yt —then we would almost certainly observe a figure such as 20.3 and compelling regression results such as those, even if there were no relationship at all. In addition, there is ample evidence in the recent literature that low-frequency (infrequently observed, aggregated over long periods) flow variables such as consumption and output are, indeed, often well described as random walks. In such data, the ARDL(1, 1) model might appear to be entirely appropriate even if it is not. So, how is one to distinguish between the spurious regression and a genuine relationship as shown in the ARDL(1, 1)? The first difference of consumption produces ct = μc +vt −vt−1. If the random walk proposition is indeed correct, then the spurious appearance of regression will not survive the first differencing, whereas if there is a relationship between ct and yt , then it will be preserved in the error correction model. We will return to this issue in Chapter 21, when we examine the issue of integration and cointegration of economic variables. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 691

0.095

0.090 EQERROR 0.085

0.080 1950 1963 1976 1989 2002 Quarter FIGURE 20.4 Consumption–Income Equilibrium Errors.

Example 20.5 An Error Correction Model for Consumption The error correction model is a nonlinear regression model, although in fact it is intrinsically linear and can be deduced simply from the unrestricted form directly above it. Because the parameter θ is actually of some interest, it might be more convenient to use nonlinear least squares and fit the second form directly. (The model is intrinsically linear, so the nonlinear least squares estimates will be identical to the derived linear least squares estimates.) The logs of consumption and income data in Appendix Table F5.1 are plotted in Figure 20.3. Not surprisingly, the two variables are drifting upward together. The estimated error correction model, with estimated standard errors in parentheses, is

ct − ct−1 =−0.08533 + (0.90458 − 1)[ct−1 − 1.06034yt−1] + 0.58421( yt − yt−1). (0.02899) (0.03029) (0.01052) (0.05090) The estimated equilibrium errors are shown in Figure 20.4. Note that they are all positive, but that in each period, the adjustment is in the opposite direction. Thus (according to this model), when consumption is below its equilibrium value, the adjustment is upward, as might be expected.

20.5.2 AUTOCORRELATION The disturbance in the error correction model is assumed to be nonautocorrelated. As we saw in Chapter 19, autocorrelation in a model can be induced by misspecification. An orthodox view of the modeling process might state, in fact, that this misspecification is the only source of autocorrelation. Although admittedly a bit optimistic in its implication, this misspecification does raise an interesting methodological question. Consider once again the simplest model of autocorrelation from Chapter 19 (with a small change in notation to make it consistent with the present discussion),

yt = βxt + vt ,vt = ρvt−1 + εt , (20-23) Greene-50558 book June 23, 2007 12:31

692 PART V ✦ Time Series and Macroeconometrics

where εt is nonautocorrelated. As we found earlier, this model can be written as

yt − ρyt−1 = β(xt − ρxt−1) + εt , (20-24) or

yt = ρyt−1 + βxt − βρxt−1 + εt . (20-25)

This model is an ARDL(1, 1) model in which β1 =−γ1β0. Thus, we can view (20-25) as a restricted version of

yt = γ1 yt−1 + β0xt + β1xt−1 + εt . (20-26)

The crucial point here is that the (nonlinear) restriction on (20-26) is testable, so there is no compelling reason to proceed to (20-23) first without establishing that the restriction is in fact consistent with the data. The upshot is that the AR(1) disturbance model, as a general proposition, is a testable restriction on a simpler, linear model, not necessarily a structure unto itself. Now, let us take this argument to its logical conclusion. The AR(p) disturbance model,

vt = ρ1vt−1 +···+ρpvt−p + εt ,

or R(L)vt = εt , can be written in its moving average form as ε v = t . t R(L) 2 [Recall, in the AR(1) model, that εt = ut + ρut−1 + ρ ut−2 +···.] The regression model with this AR(p) disturbance is, therefore, ε y = βx + t . t t R(L) But consider instead the ARDL(p, p) model

C(L)yt = βB(L)xt + εt . These coefficients are the same model if B(L) = C(L). The implication is that any model with an AR(p) disturbance can be interpreted as a nonlinearly restricted version of an ARDL(p, p) model. The preceding discussion is a rather orthodox view of autocorrelation. It is pred- icated on the AR(p) model. Researchers have found that a more involved model for the process generating εt is sometimes called for. If the time-series structure of εt is not autoregressive, much of the preceding analysis will become intractable. As such, there remains room for disagreement with the strong conclusions. We will turn to models whose disturbances are mixtures of autoregressive and moving-average terms, which would be beyond the reach of this apparatus, in Chapter 21.

20.5.3 SPECIFICATION ANALYSIS The usual explanation of autocorrelation is serial correlation in omitted variables. The preceding discussion and our results in Chapter 19 suggest another candidate: misspec- ification of what would otherwise be an unrestricted ARDL model. Thus, upon finding evidence of autocorrelation on the basis of a Durbin–Watson statistic or an LM statistic, Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 693

we might find that relaxing the nonlinear restrictions on the ARDL model is a prefer- able next step to “correcting” for the autocorrelation by imposing the restrictions and refitting the model by FGLS. Because an ARDL(p, r) model with AR disturbances, even with p = 0, is implicitly an ARDL(p+ d, r + d) model, where d is usually one, the approach suggested is just to add additional lags of the dependent variable to the model. Thus, one might even ask why we would ever use the familiar FGLS procedures. [See, e.g., Mizon (1995).] The payoff is that the restrictions imposed by the FGLS procedure produce a more efficient estimator than other methods. If the restrictions are in fact appropriate, then not imposing them amounts to not using information. A related question now arises, apart from the issue of autocorrelation. In the context of the ARDL model, how should one do the specification search? (This question is not specific to the ARDL or even to the time-series setting.) Is it better to start with a small model and expand it until conventional fit measures indicate that additional variables are no longer improving the model, or is it better to start with a large model and pare away variables that conventional statistics suggest are superfluous? The first strategy, going from a simple model to a general model, is likely to be problematic, because the statistics computed for the narrower model are biased and inconsistent if the hypothesis is incorrect. Consider, for example, an LM test for autocorrelation in a model from which important variables have been omitted. The results are biased in favor of a finding of autocorrelation. The alternative approach is to proceed from a general model to a simple one. Thus, one might overfit the model and then subject it to whatever battery of tests are appropriate to produce the correct specification at the end of the procedure. In this instance, the estimates and test statistics computed from the overfit model, although inefficient, are not generally systematically biased. (We have encountered this issue at several points.) The latter approach is common in modern analysis, but some words of caution are needed. The procedure routinely leads to overfitting the model. A typical time-series analysis might involve specifying a model with deep lags on all the variables and then paring away the model as conventional statistics indicate. The danger is that the resulting model might have an autoregressive structure with peculiar holes in it that would be hard to justify with any theory. Thus, a model for quarterly data that includes lags of 2, 3, 6, and 9 on the dependent variable would look suspiciously like the end result of a computer-driven fishing trip and, moreover, might not survive even moderate changes in the estimation sample. [As Hendry (1995) notes, a model in which the largest and most significant lag coefficient occurs at the last lag is surely misspecified.]

20.6 VECTOR AUTOREGRESSIONS

The preceding discussions can be extended to sets of variables. The resulting auto- regressive model is

yt = μ + 1yt−1 +···+ pyt−p + εt , (20-27)

where εt is a vector of nonautocorrelated disturbances (innovations) with zero means ε ε = and contemporaneous covariance matrix E [ t t ] . This equation system is a vector Greene-50558 book June 25, 2007 14:36

694 PART V ✦ Time Series and Macroeconometrics

autoregression, or VAR. Equation (20-27) may also be written as

(L)yt = μ + εt where (L) is a matrix of polynomials in the lag operator. The individual equations are p p p ymt = μm + ( j )m1 y1,t− j + ( j )m2 y2,t− j +···+ ( j )mMyM,t− j + εmt , j=1 j=1 j=1

where ( j )ml indicates the (m, l) element of j . VARs have been used primarily in macroeconomics. Early in their development, it was argued by some authors [e.g., Sims (1980), Litterman (1979, 1986)] that VARswould forecast better than the sort of structural equation models discussed in Chapter 13. One could argue that as long as μ includes the current observations on the (truly) relevant exogenous variables, the VAR is simply an overfit reduced form of some simultaneous equations model. [See Hamilton (1994, pp. 326–327).] The overfitting results from the possible inclusion of more lags than would be appropriate in the original model. (See Example 20.7 for a detailed discussion of one such model.) On the other hand, one of the virtues of the VAR is that it obviates a decision as to what contemporaneous variables are exogenous; it has only lagged (predetermined) variables on the right-hand side, and all variables are endogenous. The motivation behind VARs in macroeconomics runs deeper than the statistical issues.7 The large structural equations models of the 1950s and 1960s were built on a the- oretical foundation that has not proved satisfactory. That the forecasting performance of VARs surpassed that of large structural models—some of the later counterparts to Klein’s Model I ran to hundreds of equations—signaled to researchers a more fun- damental problem with the underlying methodology. The Keynesian style systems of equations describe a structural model of decisions (consumption, investment) that seem loosely to mimic individual behavior; see Keynes’s formulation of the consumption func- tion in Example 1.1 that is, perhaps, the canonical example. In the end, however, these decision rules are fundamentally ad hoc, and there is little basis on which to assume that they would aggregate to the macroeconomic level anyway. On a more practical level, the high inflation and high unemployment experienced in the 1970s were very badly predicted by the Keynesian paradigm. From the point of view of the underlying paradigm, the most troubling criticism of the structural modeling approach comes in the form of “the Lucas critique” (1976), in which the author argued that the parameters of the “decision rules” embodied in the systems of structural equations would not remain stable when economic policies changed, even if the rules themselves were appropriate. Thus, the paradigm underlying the systems of equations approach to macroeconomic modeling is arguably fundamentally flawed. More recent research has reformulated the basic equations of macroeconomic models in terms of a microeconomic optimiza- tion foundation and has, at the same time, been much less ambitious in specifying the interrelationships among economic variables. The preceding arguments have drawn researchers to less structured equation systems for forecasting. Thus, it is not just the form of the equations that has changed. The

7An extremely readable, nontechnical discussion of the paradigm shift in macroeconomic forecasting is given in Diebold (2003). See also Stock and Watson (2001). Greene-50558 book June 25, 2007 14:36

CHAPTER 20 ✦ Models with Lagged Variables 695

variables in the equations have changed as well; the VAR is not just the reduced form of some structural model. For purposes of analyzing and forecasting macroeconomic activity and tracing the effects of policy changes and external stimuli on the economy, researchers have found that simple, small-scale VARs without a possibly flawed theo- retical foundation have proved as good as or better than large-scale structural equation systems. In addition to forecasting, VARs have been used for two primary functions: testing Granger causality and studying the effects of policy through impulse response characteristics.

20.6.1 MODEL FORMS To simplify things for the present, we note that the pth order VAR can be written as a first-order VAR as follows: ⎡ ⎤ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ μ 1 2 ··· p ε yt ⎢ ⎥ yt−1 t ⎜ ⎟ ⎜ ⎟ ⎢ ··· ⎥ ⎜ ⎟ ⎜ ⎟ y − ⎜ 0 ⎟ I0 0 ⎜yt−2 ⎟ ⎜ 0 ⎟ ⎝ t 1 ⎠ = ⎝ ⎠ + ⎢ ⎥ ⎝ ⎠ + ⎝ ⎠ . ··· ··· ⎣··· ··· ··· 0 ⎦ ··· yt−p+1 0 0 ··· I0 yt−p 0 [See, e.g., (20-18).] This means that we do not lose any generality in casting the treatment in terms of a first-order model

yt = μ + yt−1 + εt . In Example 15.10, we examined Dahlberg and Johansson’s model for municipal finances in Sweden, in which yt = [St ,Rt ,Gt ] , where St is spending, Rt is receipts, Gt is grants from the central government, and p = 3. We will continue that application in Example 20.7. In principle, the VAR model is a seemingly unrelated regressions model—indeed, a particularly simple one because each equation has the same set of regressors. This is the traditional form of the model as originally proposed, for example, by Sims (1980). The VAR may also be viewed as the reduced form of a simultaneous equations model; the corresponding structure would then be

yt = α + yt−1 + ωt ,

where  is a nonsingular matrix and Var[ωt ] = . In one of Cecchetti and Rich’s =  ,π (2001) formulations, for example, yt [ yt t ] where yt is the log of aggregate real 1 −θ12 output, πt is the inflation rate from time t − 1 to time t,  = , and p = 8. −θ21 1 (We will examine their model in Section 20.6.8.) In this form, we have a conventional simultaneous equations model, which we analyzed in detail in Chapter 13. As we saw, for such a model to be identified—that is, estimable—certain restrictions must be placed on the structural coefficients. The reason for this is that ultimately, only the original VAR form, now the reduced form, is estimated from the data; the structural parameters must be deduced from these coefficients. In this model, to deduce these structural parameters, they must be extracted from the reduced form parameters, = −1, μ = −1α, and = −1−1. We analyzed this issue in detail in Section 13.3. The results would be the same here. In Cecchetti and Rich’s application, certain restrictions were placed on the lag coefficients in order to secure identification. Greene-50558 book June 23, 2007 12:31

696 PART V ✦ Time Series and Macroeconometrics

20.6.2 ESTIMATION In the form of (20-27)—that is, without autocorrelation of the disturbances—VARs are particularly simple to estimate. Although the equation system can be exceedingly large, it is, in fact, a seemingly unrelated regressions model with identical regressors. As such, the equations should be estimated separately by ordinary least squares. (See Section 10.2.2 for discussion of SUR systems with identical regressors.) The disturbance covariance matrix can then be estimated with average sums of squares or cross-products of the least squares residuals. If the disturbances are normally distributed, then these least squares estimators are also maximum likelihood. If not, then OLS remains an efficient GMM estimator. The extension to instrumental variables and GMM is a bit more complicated, as the model now contains multiple equations (see Section 15.6.3), but since the equations are all linear, the necessary extensions are at least relatively straightforward. GMM estimation of the VAR system is a special case of the model discussed in Section 15.6.3. (We will examine an application in Example 20.7.) The proliferation of parameters in VARs has been cited as a major disadvantage of their use. Consider, for example, a VAR involving five variables and three lags. Each has 25 unconstrained elements, and there are three of them, for a total of 75 free parameters, plus any others in μ, plus 5(6)/2 = 15 free parameters in . On the other hand, each single equation has only 25 parameters, and at least given sufficient degrees of freedom—there’s the rub—a linear regression with 25 parameters is simple work. Moreover, applications rarely involve even as many as four variables, so the model-size issue may well be exaggerated.

20.6.3 TESTING PROCEDURES Formal testing in the VAR setting usually centers either on determining the appropriate lag length (a specification search) or on whether certain blocks of zeros in the coefficient matrices are zero (a simple linear restriction on the collection of slope parameters). Both types of hypotheses may be treated as sets of linear restrictions on the elements in γ = vec[μ, 1, 2,..., p]. We begin by assuming that the disturbances have a joint normal distribution. Let W be the M × M residual covariance matrix based on a restricted model, and let W∗ be its counterpart when the model is unrestricted. Then the likelihood ratio statistic,

λ = T(ln|W|−ln|W∗|),

can be used to test the hypothesis. The statistic would have a limiting chi-squared dis- tribution with degrees of freedom equal to the number of restrictions. In principle, one might base a specification search for the right lag length on this calculation. The pro- cedure would be to test down from, say, lag q to lag p.Thegeneral-to-simple principle discussed in Section 20.5.3 would be to set the maximum lag length and test down from it until deletion of the last set of lags leads to a significant loss of fit. At each step at which the alternative lag model has excess terms, the estimators of the superfluous coefficient matrices would have probability limits of zero and the likelihood function would (again, asymptotically) resemble that of the model with the correct number of lags. Formally, suppose the appropriate lag length is p but the model is fit with q ≥ p + 1 lagged terms. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 697

Then, under the null hypothesis,

∗ d 2 2 λq = T[ln|W(μ, 1,..., q−1)|−ln|W (μ, 1,..., q)|] −→ χ [M ]. The same approach would be used to test other restrictions. Thus, the Granger causality test noted in Section 20.6.5 would fit the model with and without certain blocks of zeros in the coefficient matrices, then refer the value of λ once again to the chi-squared distribution. For specification searches for the right lag, the suggested procedure may be less effective than one based on the information criteria suggested for other linear models (see Section 7.4). Lutkepohl (2005, pp. 128–135) suggests an alternative approach based on the minimizing functions of the information criteria we have considered earlier; λ∗ = ln(|W|) + (pM2 + M)IC(T)/T, where T is the sample size, p is the number of lags, M is the number of equations, and IC(T) = 2 for the Akaike information criterion and ln T for the Schwarz (Bayesian) information criterion. We should note that this is not a test statistic; it is a diagnostic tool that we are using to conduct a specification search. Also, as in all such cases, the testing procedure should be from a larger model to a smaller one to avoid the misspecification problems induced by a lag length that is smaller than the appropriate one. The preceding has relied heavily on the normality assumption. Because most recent applications of these techniques have either treated the least squares estimators as robust (distribution-free) estimators, or used GMM (as we did in Chapter 15), it is necessary to consider a different approach that does not depend on normality. An alternative approach that should be robust to variations in the underlying distributions is the Wald statistic. [See Lutkepohl (2005, pp. 93–95).] The full set of coefficients in the model may be arrayed in a single coefficient vector, γ . Let c be the sample estimator of γ and let V denote the estimated asymptotic covariance matrix. Then, the hypothesis in question (lag length, or other linear restriction) can be cast in the form Rγ − q = 0. The Wald statistic for testing the null hypothesis is  W = (Rc − q)[RVR ]−1(Rc − q). Under the null hypothesis, this statistic has a limiting chi-squared distribution with de- grees of freedom equal to J, the number of restrictions (rows in R). For the specification search for the appropriate lag length (or the Granger causality test discussed in the next γ γ section), the null hypothesis will be that a certain subvector of , say 0, equals zero. In this case, the statistic will be =  −1 , W0 c0V00 c0

where V00 denotes the corresponding submatrix of V. Because time-series data sets are often only moderately long, use of the limiting distribution for the test statistic may be a bit optimistic. Also, the Wald statistic does not account for the fact that the asymptotic covariance matrix is estimated using a finite sample. In our analysis of the classical linear regression model, we accommodated these considerations by using the F distribution instead of the limiting chi-squared. (See Sec- tion 5.4.) The adjustment made was to refer W/J to the F[J, T − K] distribution. This produces a more conservative test—the corresponding critical values of JF converge Greene-50558 book June 23, 2007 12:31

698 PART V ✦ Time Series and Macroeconometrics

of to those of the chi-squared from above. A remaining complication is to decide what degrees of freedom to use for the denominator. It might seem natural to use MT minus the number of parameters, which would be correct if the restrictions are imposed on all equations simultaneously, because there are that many “observations.” In testing for causality, as in Section 20.6.5 below, Lutkepohl (2005, p. 95) argues that MT is exces- sive, because the restrictions are not imposed on all equations. When the causality test involves testing for zero restrictions within a single equation, the appropriate degrees of freedom would be T − Mp − 1 for that one equation.

20.6.4 EXOGENEITY In the classical regression model with nonstochastic regressors, there is no ambiguity about which is the independent or conditioning or “exogenous” variable in the model

yt = β1 + β2xt + εt . (20-28) This is the kind of characterization that might apply in an experimental situation in which the analyst is choosing the values of xt . But, the case of nonstochastic regres- sors has little to do with the sort of modeling that will be of interest in this and the next chapter. There is no basis for the narrow assumption of nonstochastic regressors, and, in fact, in most of the analysis that we have done to this point, we have left this assumption far behind. With stochastic regressor(s), the regression relationship such as the preceding one becomes a conditional mean in a bivariate distribution. In this more realistic setting, what constitutes an “exogenous” variable becomes ambiguous. Assuming that the regression relationship is linear, (20-28) can be written (trivially) as   yt = E [yt | xt ] + yt − E [yt | xt ] ,

where the familiar moment condition E [xt εt ] = 0 follows by construction. But, this form of the model is no more the “correct” equation than would be

xt = δ1 + δ2 yt + ωt , which is (we assume)   xt = E [xt | yt ] + xt − E [xt | yt ] ,

and now, E [yt ωt ] = 0. Both equations are correctly specified in the context of the bivari- ate distribution, so there is nothing to define one variable or the other as “exogenous.” This might seem puzzling, but it is, in fact, at the heart of the matter when one considers modeling in a world in which variables are jointly determined. The definition of exo- geneity depends on the analyst’s understanding of the world they are modeling, and, in the final analysis, on the purpose to which the model is to be put. The methodological platform on which this discussion rests is the classic paper by Engle, Hendry, and Richard (1983), where they point out that exogeneity is not an absolute concept at all; it is defined in the context of the model. The central idea, which will be very useful to us here, is that we define a variable (set of variables) as exogenous in the context of our model if the joint density may be written

f (yt , xt ) = f (yt | β, xt ) × f (xt | θ) where the parameters in the conditional distribution do not appear in and are func- tionally unrelated to those in the marginal distribution of xt . By this arrangement, we Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 699

can think of “autonomous variation” of the parameters of interest, β. The parameters in the conditional model for yt | xt can be analyzed as if they could vary independently of those in the marginal distribution of xt . If this condition does not hold, then we cannot think of variation of those parameters without linking that variation to some effect in the marginal distribution of xt . In this case, it makes little sense to think of xt as somehow being determined “outside” the (conditional) model. (We considered this issue in Section 13.8 in the context of a simultaneous equations model.) A second form of exogeneity we will consider is strong exogeneity, which is some- times called Granger noncausality. Granger noncausality can be superficially defined by the assumption

E [yt | yt−1, xt−1, xt−2,...] = E [yt | yt−1].

That is, lagged values of xt do not provide information about the conditional mean of yt once lagged values of yt , itself, are accounted for. We will consider this issue at the end of this chapter. For the present, we note that most of the models we will examine will explicitly fail this assumption. To put this back in the context of our model, we will be assuming that in the model

yt = β1 + β2xt + β3xt−1 + γ yt−1 + εt ,

and the extensions that we will consider, xt is weakly exogenous—we can meaningfully estimate the parameters of the regression equation independently of the marginal dis- tribution of xt , but we will allow for Granger causality between xt and yt , thus generally not assuming strong exogeneity.

20.6.5 TESTING FOR GRANGER CAUSALITY Causality in the sense defined by Granger (1969) and Sims (1972) is inferred when lagged values of a variable, say, xt , have explanatory power in a regression of a variable yt on lagged values of yt and xt . (See Section 13.2.2.) The VAR can be used to test the hypothesis.8 Tests of the restrictions can be based on simple F tests in the single equations of the VAR model. That the unrestricted equations have identical regressors means that these tests can be based on the results of simple OLS estimates. The notion can be extended in a system of equations to attempt to ascertain if a given variable is weakly exogenous to the system. If lagged values of a variable xt have no explanatory power for any of the variables in a system, then we would view xt as weakly exogenous to the system. Once again, this specification can be tested with a likelihood ratio test as described later—the restriction will be to put “holes” in one or more matrices—or with a form of F test constructed by stacking the equations.

Example 20.6 Granger Causality 9 All but one of the major recessions in the U.S. economy since World War II have been preceded by large increases in the price of crude oil. Does movement of the price of oil =  cause movements in U.S. GDP in the Granger sense? Let yt [GDP, crude oil price]t . Then,

8See Geweke, Meese, and Dent (1983), Sims (1980), and Stock and Watson (2001). 9This example is adapted from Hamilton (1994, pp. 307–308). Greene-50558 book June 23, 2007 12:31

700 PART V ✦ Time Series and Macroeconometrics

a simple VAR would be    μ α α ε = 1 + 1 2 + 1t . yt μ yt−1 ε 2 β1 β2 2t

To assert a causal relationship between oil prices and GDP, we must find that α2 is not zero; previous movements in oil prices do help explain movements in GDP even in the presence of the lagged value of GDP. Consistent with our earlier discussion, this fact, in itself, is not sufficient to assert a causal relationship. We would also have to demonstrate that there were no other intervening explanations that would explain movements in oil prices and GDP. (We will examine a more extensive application in Example 20.7.)

To establish the general result, it will prove useful to write the VAR in the multi- variate regression format we used in Section 16.9.3.b. Partition the two data vectors yt and xt into [y1t , y2t ] and [x1t , x2t ]. Consistent with our earlier discussion, x1 is lagged values of y1 and x2 is lagged values of y2. The VAR with this partitioning would be       ε ε   y1 = 11 12 x1 + 1 , 1t = 11 12 . ε Var ε y2 21 22 x2 2 2t 21 22 We would still obtain the unrestricted maximum likelihood estimates by least squares regressions. For testing Granger causality, the hypothesis 12 = 0 is of interest. (See Ex- ample 20.6.) For testing the hypothesis of interest, 12 = 0, the second set of equations is irrelevant. For testing for Granger causality in the VAR model, only the restricted equations are relevant. The hypothesis can be tested using the likelihood ratio statistic. For the present application, testing means computing

S11 = residual covariance matrix when current values of y1 are regressed on values of both x1 and x2,

S11(0) = residual covariance matrix when current values of y1 are regressed only on values of x1. The likelihood ratio statistic is then

λ = T(ln|S11(0)|−ln|S11|). The number of degrees of freedom is the number of zero restrictions. The fact that this test is wedded to the normal distribution limits its generality. The Wald test or its transformation to an approximate F statistic as described in Sec- tion 20.6.3 is an alternative that should be more generally applicable. When the equation system is fit by GMM, as in Example 20.7, the simplicity of the likelihood ratio test is lost. The Wald statistic remains usable, however. Another possibility is to use the GMM counterpart to the likelihood ratio statistic (see Section 15.5.2) based on the GMM cri- terion functions. This is just the difference in the GMM criteria. Fitting both restricted and unrestricted models in this framework may be burdensome, but having set up the GMM estimator for the (larger) unrestricted model, imposing the zero restrictions of the smaller model should require only a minor modification. There is a complication in these causality tests. The VAR can be motivated by the Wold representation theorem (see Section 21.2.5, Theorem 21.1), although with assumed nonautocorrelated disturbances, the motivation is incomplete. On the other hand, there is no formal theory behind the formulation. As such, the causality tests are predicated on a model that may, in fact, be missing either intervening variables or Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 701

additional lagged effects that should be present but are not. For the first of these, the problem is that a finding of causal effects might equally well result from the omission of a variable that is correlated with both (or all) of the left-hand-side variables.

20.6.6 IMPULSE RESPONSE FUNCTIONS Any VAR can be written as a first-order model by augmenting it, if necessary, with additional identity equations. For example, the model

yt = μ + 1yt−1 + 2yt−2 + vt can be written       μ − v yt = + 1 2 yt 1 + t , yt−1 0 I0yt−2 0 which is a first-order model. We can study the dynamic characteristics of the model in either form, but the second is more convenient, as will soon be apparent. As we analyzed earlier, in the model

yt = μ + yt−1 + vt , dynamic stability is achieved if the characteristic roots of have modulus less than one. (The roots may be complex, because need not be symmetric. See Section 20.4.3 for the case of a single equation and Section 13.9 for analysis of essentially this model in a simultaneous equations context.) Assuming that the equation system is stable, the equilibrium is found by obtaining the final form of the system. We can do this step by repeated substitution, or more simply by using the lag operator to write

yt = μ + (L)yt + vt , or

[I − (L)]yt = μ + vt . With the stability condition, we have −1 yt = [I − (L)] (μ + vt ) ∞ −1 i = (I − ) μ + vt−i i=0 ∞ (20-29) i = y + vt−i i=0 2 = y + vt + vt−1 + vt−2 +···. The coefficients in the powers of are the multipliers in the system. In fact, by re- naming things slightly, this set of results is precisely the one we examined in Section 13.9 in our discussion of dynamic simultaneous equations models. We will change the inter- pretation slightly here, however. As we did in Section 13.9, we consider the conceptual experiment of disturbing a system in equilibrium. Suppose that v has equaled 0 for long enough that y has reached equilibrium, y. Now we consider injecting a shock to the sys- tem by changing one of the v’s, for one period, and then returning it to zero thereafter. Greene-50558 book June 23, 2007 12:31

702 PART V ✦ Time Series and Macroeconometrics

As we saw earlier, ymt will move away from, then return to, its equilibrium. The path whereby the variables return to the equilibrium is called the impulse response of the VAR.10 In the autoregressive form of the model, we can identify each innovation, vmt , with a particular variable in yt , say, ymt . Consider then the effect of a one-time shock to the system, dvmt . As compared with the equilibrium, we will have, in the current period,

ymt − ym = dvmt = φmm(0)dvt . One period later, we will have

ym,t+1 − ym = ( )mmdvmt = φmm(1)dvt . Two periods later, 2 ym,t+2 − ym = ( )mmdvmt = φmm(2)dvt ,

and so on. The function, φmm(i) gives the impulse response characteristics of variable ym to innovations in vm. A useful way to characterize the system is to plot the im- pulse response functions. The preceding traces through the effect on variable m of a one-time innovation in vm. We could also examine the effect of a one-time innovation of vl on variable m. The impulse response function would be i φml (i) = element (m, l) in .

Point estimation of φml (i) using the estimated model parameters is straightforward. Confidence intervals present a more difficult problem because the estimated functions φˆml (i, β)ˆ are so highly nonlinear in the original parameter estimates. The delta method has thus proved unsatisfactory. Killian (1998) presents results that suggest that boot- strapping may be the more productive approach to statistical inference regarding im- pulse response functions.

20.6.7 STRUCTURAL VARs The VAR approach to modeling dynamic behavior of economic variables has provided some interesting insights and appears [see Litterman (1986)] to bring some real benefits for forecasting. The method has received some strident criticism for its atheoretical approach, however. The “unrestricted” nature of the lag structure in (20-27) could be synonymous with “unstructured.” With no theoretical input to the model, it is difficult to claim that its output provides much of a theoretically justified result. For example, how are we to interpret the impulse response functions derived in the previous section? What lies behind much of this discussion is the idea that there is, in fact, a structure underlying the model, and the VAR that we have specified is a mere hodgepodge of all its components. Of course, that is exactly what reduced forms are. As such, to respond to this sort of criticism, analysts have begun to cast VARs formally as reduced forms and thereby attempt to deduce the structure that they had in mind all along. A VAR model yt = μ + yt−1 + vt could, in principle, be viewed as the reduced form of the dynamic structural model

yt = α + yt−1 + εt ,

10See Hamilton (1994, pp. 318–323 and 336–350) for discussion and a number of related results. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 703

where we have embedded any exogenous variables xt in the vector of constants α. Thus, = −1, μ = −1α, v = −1ε, and = −1( −1). Perhaps it is the structure, specified by an underlying theory, that is of interest. For example, we can discuss the impulse response characteristics of this system. For particular configurations of , such as a triangular matrix, we can meaningfully interpret innovations, ε. As we explored at great length in the previous chapter, however, as this model stands, there is not suffi- cient information contained in the reduced form as just stated to deduce the structural parameters. A possibly large number of restrictions must be imposed on , , and  to enable us to deduce structural forms from reduced-form estimates, which are always obtainable. The recent work on structural VARs centers on the types of restrictions and forms of the theory that can be brought to bear to allow this analysis to proceed. See, for example, the survey in Hamilton (1994, Chapter 11). At this point, the literature on this subject has come full circle because the contemporary development of “unstruc- tured VARs” becomes very much the analysis of quite conventional dynamic structural simultaneous equations models. Indeed, current research [e.g., Diebold (1998)] brings the literature back into line with the structural modeling tradition by demonstrating how VARs can be derived formally as the reduced forms of dynamic structural models. That is, the most recent applications have begun with structures and derived the reduced forms as VARs, rather than departing from the VAR as a reduced form and attempting to deduce a structure from it by layering on restrictions.

20.6.8 APPLICATION: POLICY ANALYSIS WITH A VAR Cecchetti and Rich (2001) used a structural VAR to analyze the effect of recent disin- flationary policies of the Fed on aggregate output in the U.S. economy. The Fed’s policy of the last two decades has leaned more toward controlling inflation and less toward stimulation of the economy. The authors argue that the long-run benefits of this policy include economic stability and increased long-term trend output growth. But, there is a short-term cost in lost output. Their study seeks to estimate the “sacrifice ratio,” which is a measure of the cumulative cost of this policy. The specific indicator they study mea- sures the cumulative output loss after τ periods of a policy shock at time t, where the (persistent) shock is measured as the change in the level of inflation.

20.6.8.a A VAR Model for the Macroeconomic Variables The model proposed for estimating the ratio is a structural VAR, p p  = i  + 0 π + i π + εy, yt b11 yt−i b12 t b12 t−i t i=1 i=1 p p π = 0  + i  + i π + επ , t b21 yt b21 yt−i b22 t−i t i=1 i=1

where yt is aggregate real output in period t and πt is the rate of inflation from period t − 1tot and the model is cast in terms of rates of changes of these two variables. (Note, therefore, that sums of πt measure accumulated changes in the rate of inflation, not ε = (εy,επ ) changes in the CPI.) The vector of innovations, t t t is assumed to have mean 0, ε ε = contemporaneous covariance matrix E [ t t ] and to be strictly nonautocorrelated. (We have retained Cecchetti and Rich’s notation for most of this discussion, save for Greene-50558 book June 23, 2007 12:31

704 PART V ✦ Time Series and Macroeconometrics

the number of lags, which is denoted n in their paper and p here, and some other minor changes which will be noted in passing where necessary.)11 The equation system may also be written    y yt εt B(L) = , π επ t t where B(L) isa2× 2 matrix of polynomials in the lag operator. The components of the disturbance (innovation) vector εt are identified as shocks to aggregate supply and aggregate demand, respectively.

20.6.8.b The Sacrifice Ratio Interest in the study centers on the impact over time of structural shocks to output and the rate of inflation. To calculate these, the authors use the vector moving average (VMA) form of the model, which would be        y y  y y ε ε A11(L) A12(L) ε t = [B(L)]−1 t = A(L) t = t π επ επ ( ) ( ) επ t t t A21 L A22 L t ⎡ ⎤ ∞ ∞ ai εy ai επ ⎣ i=0 11 t−i i=0 12 t−i ⎦ = . ∞ i εy ∞ i επ i=0 a21 t−i i=0 a22 t−i (Note that the superscript “i” in the last form of the preceding model is not an exponent; it is the index of the sequence of coefficients.) The impulse response functions for the model corresponding to (20-27) are precisely the coefficients in A(L). In particular, the τ επ τ effect on the change in inflation periods later of a change in t in period t is a22. + + τ τ i The total effect from time t 0 to time t would be the sum of these, i=0 a22.The τ i counterparts for the rate of output would be i=0 a12. However, what is needed is not the effect only on period τ’s output, but the cumulative effect on output from the time of the shock up to period τ. That would be obtained by summing these period-specific τ i j effects, to obtain i=0 j=0 a12. Combining terms, the sacrifice ratio is τ ∂yt+ j τ j=0 π 0 j + 1 j +···+ τ j i j ∂εt i=0 a12 i=0 a12 i=0 a12 i=0 j=0 a12 Sεπ (τ) = = τ = τ . ∂π +τ i i t i=0 a22 i=0 a22 π ∂εt The function S(τ) is then examined over long periods to study the long-term effects of monetary policy.

20.6.8.c Identification and Estimation of a Structural VAR Model Estimation of this model requires some manipulation. The structural model is a con- ventional linear simultaneous equations model of the form,

B0yt = Bxt + εt ,

11The authors examine two other VAR models, a three-equation model of Shapiro and Watson (1988), which adds an equation in real interest rates (it − πt ) and a four-equation model by Gali (1992), which models yt ,it ,(it − πt ), and the real money stock, (mt − πt ). Among the foci of Cecchetti and Rich’s paper was the surprisingly large variation in estimates of the sacrifice ratio produced by the three models. In the interest of brevity, we will restrict our analysis to Cecchetti’s (1994) two-equation model. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 705

 where yt is (yt ,πt ) and xt is the lagged values on the right-hand side. As we saw in Section 13.3, without further restrictions, a model such as this is not identified (estimable). A total of M2 restrictions—M is the number of equations, here two—are needed to identify the model. In the familiar cases of simultaneous equations models that we examined in Chapter 13, identification is usually secured through exclusion restrictions (i.e., zero restrictions), either in B0 or B. This type of exclusion restriction would be unnatural in a model such as this one—there would be no basis for pok- ing specific holes in the coefficient matrices. The authors take a different approach, which requires us to look more closely at the different forms the time-series model can take. Write the structural form as

B0yt = B1yt−1 + B2yt−2 +···+Bpyt−p + εt , where   1 −b0 B = 12 . 0 − 0 b21 1 As noted, this is in the form of a conventional simultaneous equations model. Assuming − 0 0 that B0 is nonsingular, which for this two-equation system requires only that 1 b12b21 not equal zero, we can obtain the reduced form of the model as = −1 + −1 +···+ −1 + −1ε yt B0 B1yt−1 B0 B2yt−2 B0 Bpyt−p B0 t (20-30) = + +···+ + μ , D1yt−1 D2yt−2 Dpyt−p t μ where t is the vector of reduced form innovations. Now, collect the terms in the equiv- alent form − − 2 −··· = μ . [I D1L D2L ]yt t The moving-average form that we obtained earlier is = − − 2 −··· −1μ . yt [I D1L D2L ] t Assuming stability of the system, we can also write this as = − − 2 −··· −1μ yt [I D1L D2L ] t = − − 2 −··· −1 −1ε [I D1L D2L ] B0 t = + + 2 +···μ [I C1L C2L ] t = μ + μ + μ ... t C1 t−1 C2 t−2 = −1ε + μ + μ ···. B0 t C1 t−1 C2 t−2

So, the C j matrices correspond to our A j matrices in the original formulation. But = −1 this manipulation has added something. We can see that A0 B0 . Looking ahead, the reduced form equations can be estimated by least squares. Whether the structural pa- rameters, and thereafter, the VMA parameters can as well depends entirely on whether B0 can be estimated. From (20-30) we can see that if B0 can be estimated, then B1 ...Bp can also just by premultiplying the reduced form coefficient matrices by this estimated Greene-50558 book June 23, 2007 12:31

706 PART V ✦ Time Series and Macroeconometrics

B0. So, we must now consider this issue. (This is precisely the conclusion we drew at the beginning of Section 13.3.) ε ε = Recall the initial assumption that E [ t t ] . In the reduced form, we assume μ μ =  E [ t t ] . As we know,reduced forms are always estimable (indeed, by least squares if the assumptions of the model are correct). That means that  is estimable by the least squares residual variances and covariance. From the earlier derivation, we have that  = −1 ( −1) =  B0 B0 A0 A0. (Again, see the beginning of Section 13.3.) The authors have secured identification of the model through this relationship. In particular, they = =  =   assume first that I. Assuming that I, we now have that A0A0 , where is an estimable matrix with three free parameters. Because A0 is 2×2, one more restriction is needed to secure identification. At this point, the authors, invoking Blanchard and Quah (1989), assume that “demand shocks have no permanent effect on the level of ( ) = ∞ i = output. This is equivalent to A12 1 i=0 a12 0.” This might seem like a cumbersome −1 restriction to impose. But, the matrix A(1) is [I − D1 − D2 −···−Dp] A0 = FA0 and the components, D j , have been estimated as the reduced form coefficient matrices, so A12(1) = 0 assumes only that the upper right element of this matrix is zero. We now obtain the equations needed to solve for A0. First, ⎡    ⎤ 2 2  a0 + a0 a0 a0 + a0 a0 σ σ  =  ⇒ ⎣ 11 12 11 21 12 22 ⎦ = 11 12 , A0A0     σ σ (20-31) 0 0 + 0 0 0 2 + 0 2 12 11 a11a21 a12a22 a21 a22 which provides three equations. Second, the theoretical restriction is   ∗ 0 + 0 ∗ = f11a12 f12a22 = 0 . FA0 ∗∗ ∗∗

12 This provides the four equations needed to identify the four elements in A0. Collecting results, the estimation strategy is first to estimate D1,...,Dp and  in the reduced form, by least squares. (They set p = 8.) Then use the restrictions and = −1 = −1 (20-31) to obtain the elements of A0 B0 and, finally, B j A0 D j . The last step is estimation of the matrices of impulse responses, which can be done as follows: We return to the reduced form which, using our augmentation trick, we

12At this point, an intriguing loose end arises. We have carried this discussion in the form of the original papers by Blanchard and Quah (1989) and Cecchetti and Rich (2001). Returning to the original structure, we = −1 see that because A0 B0 ,ifB0 has ones on the diagonal, then A0 actually does not have four unrestricted and unknown elements, it has two. The model is thus overidentified. We could have predicted this at the outset. In our conventional simultaneous equations model, the normalizations in B0 (ones on the diagonal) provide two restrictions of the M2 = 4 required for identification. Assuming that = I provides three more, and the theoretical restriction provides a sixth. Therefore, the four unknown elements in an unrestricted B0 are overidentified. It might seem convenient at this point to forego the theoretical restriction on long-term impacts, but it seems more natural to omit the restrictions on the scaling of . With the two normalizations already in place, assuming that the innovations are uncorrelated ( is diagonal) and “demand shocks have no permanent effect on the level of output” together suffice to identify the model. Blanchard and Quah appear to reach the same conclusion (page 656), but then they also assume the unit variances [page 657, equation (1)]. They argue that the assumption of unit variances is just a convenient normalization, which for their model is actually the case, because they do not assume that B0 is diagonal. Cecchetti and Rich, however, do appear to normalize B0 in their equation (1). They then (evidently) drop the assumption after (10), however, “[B]ecause A0 has (n × n) unique elements ....” This would imply that the normalization they impose on their (1) has not, in fact, been carried through the later manipulations, so, once again, the model is exactly identified. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 707

write as ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ··· ⎡ ⎤ D1 D2 Dp A ε yt ⎢ ⎥ yt−1 0 t ⎢ ⎥ ⎢ ··· ⎥⎢ ⎥ ⎢ ⎥ y − I0 0 y − ⎢ 0 ⎥ ⎣ t 1 ⎦ = ⎢ ⎥⎣ t 2 ⎦ + ⎣ ⎦ . (20-32) ··· ⎣··· ··· ··· 0 ⎦ ··· ··· yt−p+1 0 ··· I0yt−p 0

For convenience, arrange this result as

Yt = (DL)Yt + wt .

Now, solve this for Yt to obtain the final form

−1 Yt = [I − DL] wt .

Write this in the spectral form and expand as we did earlier, to obtain ∞ = i . Yt P Qwt−i (20-33) i=0

We will be interested in the uppermost subvector of Yt , so we expand (20-33) to yield ⎡ ⎤ ⎡ ⎡ ⎤⎤ ε − yt ∞ A0 t i ⎢ ⎥ ⎢ ⎢ ⎥⎥ ⎣ yt−1 ⎦ = ⎢ Pi Q⎢ 0 ⎥⎥ . ··· ⎣ ⎣ ··· ⎦⎦ i=0 yt−p+1 0

The matrix in the summation is Mp × Mp. The impact matrices we seek are the M × M matrices in the upper left corner of the spectral form, multiplied by A0.

20.6.8.d Inference As noted at the end of Section 20.6.6, obtaining usable standard errors for estimates of impulse responses is a difficult (as yet unresolved) problem. Killian (1998) has suggested that bootstrapping is a preferable approach to using the delta method. Cecchetti and Rich reach the same conclusion and likewise resort to a bootstrapping procedure. Their bootstrap procedure is carried out as follows: Let δˆ and ˆ denote the full set of estimated coefficients and estimated reduced form covariance matrix based on direct estimation. As suggested by Doan (2007), they construct a sequence of N draws for the reduced form parameters, then recompute the entire set of impulse responses. The narrowest interval, which contains 90 percent of these draws, is taken to be a confidence interval for an estimated impulse function.

20.6.8.e Empirical Results Cecchetti and Rich used quarterly observations on real aggregate output and the consumer price index. Their data set spanned 1959.1 to 1997.4. This is a subset of the data described in the Appendix Table F5.1. Before beginning their analysis, they subjected the data to the standard tests for stationarity. Figures 20.5–20.7 show the log of real output, the rate of inflation, and the changes in these two variables. The Greene-50558 book June 23, 2007 12:31

708 PART V ✦ Time Series and Macroeconometrics

9.2

9.0

8.8

8.6 DP G 8.4 Real g

Lo 8.2

8.0

7.8

7.6 1958 1963 1968 1973 1978 1983 1988 1993 1998 Quarter FIGURE 20.5 Log GDP.

0.05

0.04

0.03

0.02 Inflation Rate 0.01

0.00

Ϫ0.01 1958 1963 1968 1973 1978 1983 1988 1993 1998 Quarter FIGURE 20.6 The Quarterly Rate of Inflation.

first two figures do suggest that neither variable is stationary. On the basis of the Dickey–Fuller (1981) test (see Section 22.2.4), they found (as might be expected) that the yt and πt series both contain unit roots. They conclude that because output has a unit root, the identification restriction that the long-run effect of aggregate demand shocks on output is well defined and meaningful. The unit root in inflation allows for Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 709

4.0 DLOGY DPI 2.7

1.5

Variable 0.2

Ϫ1.0

Ϫ2.3 1959 1964 1969 19741979 1984 1989 1994 1999 Quarter FIGURE 20.7 Rates of Change, Log Real GDP, and the Rate of Inflation.

permanent shifts in its level. The lag length for the model is set at p = 8. Long-run impulse response function are truncated at 20 years (80 quarters). Analysis is based on the rate of change data shown in Figure 20.7. As a final check on the model, the authors examined the data for the possibility of a structural shift using the tests described in Andrews (1993) and Andrews and Ploberger (1994). None of the Andrews/Quandt supremum LM test, Andrews/Ploberger exponen- tial LM test, or the Andrews/Ploberger average LM test suggested that the underlying structure had changed (in spite of what seems likely to have been a major shift in Fed policy in the 1970s). On this basis, they concluded that the VAR is stable over the sample period. Figure 20.8 (Figures 3A and 3B taken from the article) shows their two separate estimated impulse response functions. The dotted lines in the figures show the bootstrap- generated confidence bounds. Estimates of the sacrifice ratio for Cecchetti’s model are 1.3219 for τ = 4, 1.3204 for τ = 8, 1.5700 for τ = 12, 1.5219 for τ = 16, and 1.3763 for τ = 20. The authors also examined the forecasting performance of their model compared to Shapiro and Watson’s and Gali’s. The device used was to produce one step ahead, period T + 1 | T forecasts for the model estimated using periods 1 ...,T. The first re- duced form of the model is fit using 1959.1 to 1975.1 and used to forecast 1975.2. Then, it is reestimated using 1959.1 to 1975.2 and used to forecast 1975.3, and so on. Finally, the root mean squared error of these out of sample forecasts is compared for three models. In each case, the level, rather than the rate of change of the inflation rate is forecasted. Overall, the results suggest that the smaller model does a better job of estimating the impulse responses (has smaller confidence bounds and conforms more Greene-50558 book June 23, 2007 12:31

710 PART V ✦ Time Series and Macroeconometrics

A: Dynamic Response to a Monetary Policy Shock Real GDP—Cecchetti 0.6

0.4

0.2

0.0

0.2

g 0.4 Lo 0.6

0.8

1.0

1.2 1.4 0 5 10 15 20

B: Dynamic Response to a Monetary Policy Shock Inflation—Cecchetti 0.75

0.50

0.25

0.00

0.25

0.50 Percent 0.75

1.00

1.25

1.50 1.75 0 5 10 15 20 FIGURE 20.8 Estimated Impulse Response Functions.

nearly with theoretical predictions) but performs worst of the three (slightly) in terms of the mean squared error of the out-of-sample forecasts. Because the unrestricted re- duced form model is being used for the latter, this comes as no surprise. The end result follows essentially from the result that adding variables to a regression model improves its fit. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 711

20.6.9 VARs IN MICROECONOMICS VARs have appeared in the microeconometrics literature as well. Chamberlain (1980) suggested that a useful approach to the analysis of panel data would be to treat each period’s observation as a separate equation. For the case of T = 2, we would have = α +  β + ε , yi1 i xi1 i1 = α +  β + ε , yi2 i xi2 i2

where i indexes individuals and αi are unobserved individual effects. This specification produces a multivariate regression, to which Chamberlain added restrictions related to the individual effects. Holtz-Eakin, Newey, and Rosen’s (1988) approach is to specify the equation as m m yit = α0t + αlt yi,t−l + δltxi,t−l + t fi + μit. l=1 l=1

In their study, yit is hours worked by individual i in period t and xit is the individual’s wage in that period. A second equation for earnings is specified with lagged values of hours and earnings on the right-hand side. The individual, unobserved effects are fi . This model is similar to the VAR in (20-27), but it differs in several ways as well. The number of periods is quite small (14 yearly observations for each individual), but there are nearly 1,000 individuals. The dynamic equation is specified for a specific period, however, so the relevant sample size in each case is n, not T. Also, the number of lags in the model used is relatively small; the authors fixed it at three. They thus have a two- equation VAR containing 12 unknown parameters, six in each equation. The authors used the model to analyze causality, measurement error, and parameter stability—that is, constancy of αlt and δlt across time.

Example 20.7 VAR for Municipal Expenditures In Example 15.10, we examined a model of municipal expenditures proposed by Dahlberg and Johansson (2000): Their equation of interest is m m m  = μ + β  + γ  + δ  + S Si,t t j Si,t− j j Ri,t− j j Gi,t− j ui,t j =1 j =1 j =1

for i = 1, ..., N = 265 and t = m + 1, ...,9. Si,t , Ri,t , and Gi,t are municipal spending, receipts (taxes and fees), and central government grants, respectively. Analogous equations are specified for the current values of Ri,t and Gi,t . This produces a vector autoregression for each municipality, ⎡ ⎤ ⎛ ⎞ ⎛ ⎞⎡ ⎤ S β γ δ S − i,t μS,t S,1 S,1 S,1 i,t 1 ⎣ ⎦ ⎝ ⎠ ⎝ ⎠⎣ ⎦ Ri,t = μR,t + βR,1 γR,1 δR,1 Ri,t−1 +··· μ G G,t β γ δ G − i,t G,1 G,1 G,1 ⎡i,t 1 ⎤ ⎛ ⎞⎡ ⎤ S β γ δ S − u S,m S,m S,m i,t m ⎢ i,t ⎥ +⎝β γ δ ⎠⎣ ⎦+ R . R,m R,m R,m Ri,t−m ⎣ui,t ⎦ β γ δ  G G,m G,m G,m Gi,t−m ui,t The model was estimated by GMM, so the discussion at the end of the preceding section ap- plies here. We will be interested in testing whether changes in municipal spending, Si,t ,are Granger-caused by changes in revenues, Ri,t , and grants, Gi,t . The hypothesis to be tested Greene-50558 book June 23, 2007 12:31

712 PART V ✦ Time Series and Macroeconometrics

is γS, j = δS, j = 0 for all j . This hypothesis can be tested in the context of only the first equa- tion. Parameter estimates and diagnostic statistics are given in Example 15.10. We can carry out the test in two ways. In the unrestricted equation with all three lagged values of all three variables, the minimized GMM criterion is q = 22.8287. If the lagged values of R and G are omitted from the S equation, the criterion rises to 42.9182.13 There are six restrictions. The difference is 20.090 so the F statistic is 20.09/6 = 3.348. We have more than 1,000 degrees of freedom for the denominator, with 265 municipalities and 5 years, so we can use the lim- iting value for the critical value. This is 2.10, so we may reject the hypothesis of noncausality and conclude that changes in revenues and grants do Granger cause changes in spending. (This hardly seems surprising.) The alternative approach is to use a Wald statistic to test the six restrictions. Using the full GMM results for the Sequation with 14 coefficients we obtain a Wald statistic of 15.3030. The critical chi-squared would be 6 × 2.1 = 12.6, so once again, the hypothesis is rejected. Dahlberg and Johansson approach the causality test somewhat differently by using a sequential testing procedure. (See their page 413 for discussion.) They suggest that the intervening variables be dropped in turn. By dropping first G, then R and G, and then first R then G and R, they conclude that grants do not Granger-cause changes in spending (q = only 0.07) but in the absence of grants, revenues do (q|grants excluded) = 24.6. The reverse order produces test statistics of 12.2 and 12.4, respectively. Our own calculations of the four values of q yields 22.829 for the full model, 23.1302 with only grants excluded, 23.0894 with only R excluded, and 42.9182 with both excluded, which disagrees with their results but is consistent with our earlier ones.

Instability of a VAR Model The coefficients for the three-variable VAR model in Example 20.7 appear in Table 15.5. The characteristic roots of the 9 × 9 coefficient matrix are −0.6025, 0.2529, 0.0840, (1.4586 ± 0.6584i), (−0.6992 ± 0.2019i), and (0.0611 ± 0.6291i). The first pair of complex roots has modulus greater than one, so the estimated VAR is unstable. The data do not appear to be consistent with this result, though with only five usable years of data, that conclusion is a bit fragile. One might suspect that the model is overfit. Because the disturbances are assumed to be uncorrelated across equations, the three equations have been estimated sep- arately. The GMM criterion for the system is then the sum of those for the three equations. For p = 3, 2, and 1, respectively, these are (22.8287+30.5398+17.5810) = 70.9495, (30.4526+ 34.2590 + 20.5416) = 85.2532, and (34.4986 + 53.2506 + 27.5927) = 115.6119. The differ- ence statistic for testing down from three lags to two is 14.3037. The critical chi-squared for nine degrees of freedom is 19.62, so it would appear that m = 3 may be too large. The results clearly reject the hypothesis that m = 1, however. The coefficients for a model with two lags instead of one appear in Table 15.5. If we construct from these results instead, we obtain a 6×6 matrix whose characteristic roots are 1.5817, −0.2196, −0.3509 ± 0.4362i, and 0.0968 ± 0.2791i. The system remains unstable.

20.7 SUMMARY AND CONCLUSIONS

This chapter has surveyed a particular type of regression model, the dynamic regres- sion. The signature feature of the dynamic model is effects that are delayed or that persist through time. In a static regression setting, effects embodied in coefficients are assumed to take place all at once. In the dynamic model, the response to an innovation is distributed through several periods. The first three sections of this chapter examined several different forms of single-equation models that contained lagged effects. The

13Once again, these results differ from those given by Dahlberg and Johansson. As before, the difference results from our use of the same weighting matrix for all GMM computations in contrast to their recomputation of the matrix for each new coefficient vector estimated. Greene-50558 book June 23, 2007 12:31

CHAPTER 20 ✦ Models with Lagged Variables 713

progression, which mirrors the current literature, is from tightly structured lag “models” (which were sometimes formulated to respond to a shortage of data rather than to cor- respond to an underlying theory) to unrestricted models with multiple period lag struc- tures. We also examined several hybrids of these two forms, models that allow long lags but build some regular structure into the lag weights. Thus, our model of the formation of expectations of inflation is reasonably flexible, but does assume a specific behavioral mechanism. We then examined several methodological issues. In this context as else- where, there is a preference in the methods toward forming broad unrestricted models and using familiar inference tools to reduce them to the final appropriate specification. The second half of the chapter was devoted to a type of seemingly unrelated regressions model. The vector autoregression, or VAR, has been a major tool in recent research. After developing the econometric framework, we examined two applications, one in macroeconomics centered on monetary policy and one from microeconomics.

Key Terms and Concepts

• Autocorrelation • General-to-simple method • Polynomial in lag operator • Autoregression • Granger causality • Random walk with drift • Autoregressive distributed • Impact multiplier • Rational lag lag • Impulse response • Simple-to-general approach • Autoregressive form • Infinite lag model • Specification • Autoregressive model • Infinite lags • Stability • Characteristic equation • Innovation • Stationary • Distributed lag • Invertible • Strong exogeneity • Dynamic regression model • Lagged variables • Structural model • Elasticity • Lag operator • Structural VAR • Equilibrium • Lag weight • Superconsistent • Equilibrium error • Long-run multiplier • Univariate autoregression • Equilibrium multiplier • Mean lag • Vector autoregression • Equilibrium relationship • Median lag (VAR) • Error correction • Moving-average form • Vector moving average • Exogeneity • One-period-ahead forecast (VMA) • Expectation • Partial adjustment • Finite lags • Phillips curve

Exercises

1. Obtain the mean lag and the long- and short-run multipliers for the following distributed lag models: a. yt = 0.55(0.02xt + 0.15xt−1 + 0.43xt−2 + 0.23xt−3 + 0.17xt−4) + et . b. The model in Exercise 3. c. The model in Exercise 4. (Do for either x or z.) 2 2. Expand the rational lag model yt = [(0.6 + 2L)/(1 − 0.6L + 0.5L )]xt + et . What are the coefficients on xt , xt−1, xt−2, xt−3, and xt−4? 3. Suppose that the model of Exercise 2 were specified as β + γ L y = α + x + e . t 2 t t 1 − δ1 L − δ2 L Greene-50558 book June 23, 2007 12:31

714 PART V ✦ Time Series and Macroeconometrics

Describe a method of estimating the parameters. Is ordinary least squares consis- tent? 4. Describe how to estimate the parameters of the model x z y = α + β t + δ t + ε , t 1 − γ L 1 − φL t

where εt is a serially uncorrelated, homoscedastic, classical disturbance.

Applications

1. We are interested in the long-run multiplier in the model 6 yt = β0 + β j xt− j + εt . j=0

Assume that xt is an autoregressive series, xt = rxt−1 + vt where |r| < 1. a. What is the long run multiplier in this model? b. How would you estimate the long-run multiplier in this model? c. Suppose you knew that the preceding is the true model but you linearly regress yt only on a constant and the first five lags of xt . How does this affect your estimate of the long run multiplier? d. Same as c. for four lags instead of five. e. Using the macroeconomic data in Appendix Table F5.1, let yt be the log of real investment and xt be the log of real output. Carry out the computations suggested and report your findings. Specifically, how does the omission of a lagged value affect estimates of the short-run and long-run multipliers in the unrestricted lag model? 2. Explain how to estimate the parameters of the following model:

yt = α + βxt + γ yt−1 + δyt−2 + et ,

et = ρet−1 + ut .

Is there any problem with ordinary least squares? Let yt be consumption and let xt be disposable income. Using the method you have described, fit the previous model to the data in Appendix Table F5.1. Report your results. Greene-50558 book June 26, 2007 17:24

21 TIME-SERIESQ MODELS

21.1 INTRODUCTION

For forecasting purposes, a simple model that describes the behavior of a variable (or a set of variables) in terms of past values, without the benefit of a well-developed theory, may well prove quite satisfactory. Researchers have observed that the large simultaneous equations macroeconomic models constructed in the 1960s frequently have poorer forecasting performance than fairly simple, univariate time-series models based on just a few parameters and compact specifications. It is just this observation that has raised to prominence the univariate time-series forecasting models pioneered by Box and Jenkins (1984). In this chapter, we introduce some of the tools employed in the analysis of time- series data.1 Section 21.2 describes stationary stochastic processes. We encountered this body of theory in Chapters 19 and 20, where we discovered that certain assumptions were required to ascribe familiar properties to a time series of data. We continue that discussion by defining several characteristics of a stationary time series. The recent literature in macroeconometrics has seen an explosion of studies of nonstationary time series. Nonstationarity mandates a revision of the standard inference tools we have used thus far. Chapter 22 introduces some extensions of the results of this chapter to nonstationary time series. Some of the concepts to be discussed here were introduced in Section 19.2. Sec- tion 19.2 also contains a cursory introduction to the nature of time-series processes. It will be useful to review that material before proceeding with the rest of this chapter. Finally, Sections 13.6 on estimation and 13.9.2 and 20.4.3 on stability of dynamic models will be especially useful for the latter sections of this chapter.

1Each topic discussed here is the subject of a vast literature with articles and book-length treatments at all levels. For example, two survey papers on the subject of unit roots in economic time-series data, Diebold and Nerlove (1990) and Campbell and Perron (1991), cite between them more than 200 basic sources on the subject. The literature on unit roots and cointegration is almost surely the most rapidly moving target in econometrics. Stock’s (1994) survey adds hundreds of references to those in the aforementioned surveys and brings the literature up to date as of then. Useful basic references on the subjects of this chapter are Box and Jenkins (1984); Judge et al. (1985); Mills (1990); Granger and Newbold (1996); Granger and Watson (1984); Hendry, Pagan, and Sargan (1984); Geweke (1984); and especially Harvey (1989, 1990); Enders (2004); Tsay (2005); Hamilton (1994); and Patterson (2000). There are also many survey style and pedagogical articles on these subjects. The aforementioned paper by Diebold and Nerlove is a useful tour guide through some of the literature. We recommend Dickey, Bell, and Miller (1986) and Dickey, Jansen, and Thorton (1991) as well. The latter is an especially clear introduction at a very basic level of the fundamental tools for empirical researchers.

715 Greene-50558 book June 26, 2007 17:24

716 PART V ✦ Time Series and Macroeconometrics

21.2 STATIONARY STOCHASTIC PROCESSES

The essential building block for the models to be discussed in this chapter is the white noise time-series process,

{εt }, t =−∞, +∞, ε = ε2 = σ 2 ε ,ε = where each element in the sequence has E [ t ] 0, E [ t ] ε , and Cov[ t s ] 0 for all s = t. Each element in the series is a random draw from a population with zero mean and constant variance. It is occasionally assumed that the draws are independent or normally distributed, although for most of our analysis, neither assumption will be essential. A univariate time-series model describes the behavior of a variable in terms of its own past values. Consider, for example, the autoregressive disturbance models intro- duced in Chapter 19,

ut = ρut−1 + εt . (21-1) Autoregressive disturbances are generally the residual variation in a regression model =  β + built up from what may be an elaborate underlying theory, yt xt ut . The theory usually stops short of stating what enters the disturbance. But the presumption that some time-series process generates xt should extend equally to ut . There are two ways to interpret this simple series. As stated, ut equals the previous value of ut plus an “innovation,” εt . Alternatively, by manipulating the series, we showed that ut could be interpreted as an aggregation of the entire history of the εt ’s. Occasionally, statistical evidence is convincing that a more intricate process is at work in the disturbance. Perhaps a second-order autoregression,

ut = ρ1ut−1 + ρ2ut−2 + εt , (21-2) better explains the movement of the disturbances in the regression. The model may not arise naturally from an underlying behavioral theory. But in the face of certain kinds of statistical evidence, one might conclude that the more elaborate model would be preferable.2 This section will describe several alternatives to the AR(1) model that we have relied on in most of the preceding applications.

21.2.1 AUTOREGRESSIVE MOVING-AVERAGE PROCESSES

The variable yt in the model

yt = μ + γ yt−1 + εt (21-3)

is said to be autoregressive (or self-regressive) because under certain assumptions,

E [yt | yt−1] = μ + γ yt−1. A more general pth-order autoregression or AR(p) process would be written

yt = μ + γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt . (21-4)

2 For example, the estimates of εt computed after a correction for first-order autocorrelation may fail tests of randomness such as the LM (Section 19.7.1) test. Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 717

The analogy to the classical regression is clear. Now consider the first-order moving average, or MA(1) specification3

yt = μ + εt − θεt−1. (21-5) By writing

yt = μ + (1 − θ L)εt , or μ yt = + ε , 1 − θ L 1 − θ t we find that μ 2 y = − θy − − θ y − −···+ε . t 1 − θ t 1 t 2 t

Once again, the effect is to represent yt as a function of its own past values. An extremely general model that encompasses (21-4) and (21-5) is the autoregres- sive moving average, or ARMA(p, q), model:

yt = μ + γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt − θ1εt−1 −···−θqεt−q. (21-6) Note the convention that the ARMA(p, q) process has p autoregressive (lagged dependent-variable) terms and q lagged moving average terms. Researchers have found that models of this sort with relatively small values of p and q have proved quite effective as forecasting models. The disturbances εt are labeled the innovations in the model. The term is fitting because the only new information that enters the processes in period t is this innovation. Consider, then, the AR(1) process

yt = μ + γ yt−1 + εt . (21-7) Either by successive substitution or by using the lag operator, we obtain

(1 − γL)yt = μ + εt , or μ ∞ i 4 y = + γ ε − . (21-8) t 1 − γ t i i=0

The observed series is a particular type of aggregation of the history of the innovations. The moving average, MA(q) model,

yt = μ + εt − θ1εt−1 −···−θqεt−q = μ + D(L)εt , (21-9) is yet another, particularly simple form of aggregation in that only information from the q most recent periods is retained. The general result is that many time-series processes can be viewed either as regressions on lagged values with additive disturbances or as aggregations of a history of innovations. They differ from one to the next in the form of that aggregation.

3The lag operator is discussed in Section 20.2.2. Because μ is a constant, (1 − θ L)−1μ = μ + θμ+ θ 2μ +···= μ/(1 − θ). The lag operator may be set equal to one when it operates on a constant. 4See Section 20.3 for discussion of models with infinite lag structures. Greene-50558 book June 26, 2007 11:52

718 PART V ✦ Time Series and Macroeconometrics

More involved processes can be similarly represented in either an autoregressive or moving-average form. (We will turn to the mathematical requirements later.) Consider, for example, the ARMA(2, 1) process,

yt = μ + γ1 yt−1 + γ2 yt−2 + εt − θεt−1, which we can write as

(1 − θ L)εt = yt − μ − γ1 yt−1 − γ2 yt−2. If |θ| < 1, then we can divide both sides of the equation by (1 − θ L) and obtain ∞ i εt = θ (yt−i − μ − γ1 yt−i−1 − γ2 yt−i−2). i=0 After some tedious manipulation, this equation produces the autoregressive form, μ ∞ y = + π y − + ε , t 1 − θ i t i t i=1 where j j−1 j−2 π1 = γ1 − θ and π j =−(θ − γ1θ − γ2θ ), j = 2, 3,.... (21-10) Alternatively, by similar (yet more tedious) manipulation, we can write ∞ μ 1 − θ L μ y = + ε = + δ ε − . (21-11) t 1 − γ − γ 1 − γ L − γ L2 t 1 − γ − γ i t i 1 2 1 2 1 2 i=0

In each case, the weights, πi in the autoregressive form and δi in the moving-average form, are complicated functions of the original parameters. But nonetheless, each is just an alternative representation of the same time-series process that produces the current value of yt . This result is a fundamental property of certain time series. We will return to the issue after we formally define the assumption that we have used at the preceding several steps that allows these transformations.

21.2.2 STATIONARITY AND INVERTIBILITY At several points in the preceding, we have alluded to the notion of stationarity, either directly or indirectly by making certain assumptions about the parameters in the model. In Section 19.3.2, we characterized an AR(1) disturbance process

ut = ρut−1 + εt ,

as stationary if |ρ| < 1 and εt is white noise. Then

E [ut ] = 0 for all t, 2 σε Var[u ] = , t 1 − ρ2 (21-12)

|t−s| 2 ρ σε Cov[u , u ] = . t s 1 − ρ2 If |ρ|≥1, then the variance and covariances are undefined. In the following, we use εt to denote the white noise innovations in the process. The ARMA(p, q) process will be denoted as in (21-6). Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 719

DEFINITION 21.1 Covariance Stationarity A stochastic process yt is weakly stationary or covariance stationary if it satisfies the following requirements:5

1. E [yt ] is independent of t. 2. Var[yt ] is a finite, positive constant, independent of t. 3. Cov[yt , ys ] is a finite function of |t − s|, but not of t or s.

The third requirement is that the covariance between observations in the series is a function only of how far apart they are in time, not the time at which they occur. These properties clearly hold for the AR(1) process shown earlier. Whether they apply for the other models we have examined remains to be seen. We define the autocovariance at lag k as

λk = Cov[yt , yt−k]. Note that

λ−k = Cov[yt , yt+k] = λk. Stationarity implies that autocovariances are a function of k, but not of t. For example, in (21-12), we see that the autocovariances of the AR(1) process yt = μ + γ yt−1 + εt are k 2 γ σε Cov[y , y − ] = , k = 0, 1 .... (21-13) t t k 1 − γ 2 If |γ | < 1, then this process is stationary. For any MA(q) series,

yt = μ + εt − θ1εt−1 −···−θqεt−q,

E [yt ] = μ + E [εt ] − θ1 E [εt−1] −···−θq E [εt−q] = μ, (21-14) = + θ 2 +···+θ 2 σ 2, Var[yt ] 1 1 q ε 2 Cov[yt , yt−1] = (−θ1 + θ1θ2 + θ2θ3 +···+θq−1θq)σε , and so on until 2 Cov[yt , yt−(q−1)] = [−θq−1 + θ1θq] σε , 2 Cov[yt , yt−q] =−θqσε , and, for lags greater than q, the autocovariances are zero. It follows, therefore, that finite moving-average processes are stationary regardless of the values of the parame- ters. The MA(1) process yt = εt − θεt−1 is an important special case that has Var[yt ] = 2 2 2 (1 + θ )σε ,λ1 =−θσε , and λk = 0 for |k| > 1. For the AR(1) process, the stationarity requirement is that |γ | < 1, which in turn, implies that the variance of the moving average representation in (21-8) is finite.

5 Strong stationarity requires that the joint distribution of all sets of observations (yt , yt−1,...)be invariant to when the observations are made. For practical purposes in econometrics, this statement is a theoretical fine point. Although weak stationary suffices for our applications, we would not normally analyze weakly stationary time series that were not strongly stationary as well. Indeed, we often go even beyond this step and assume joint normality. Greene-50558 book June 26, 2007 11:52

720 PART V ✦ Time Series and Macroeconometrics

Consider the AR(2) process

yt = μ + γ1 yt−1 + γ2 yt−2 + εt . Write this equation as

C(L)yt = μ + εt , where 2 C(L) = 1 − γ1 L − γ2 L . Then, if it is possible, we invert this result to produce −1 yt = [C(L)] (μ + εt ). Whether the inversion of the polynomial in the lag operator leads to a convergent series depends on the values of γ1 and γ2. If so, then the moving-average representation will be ∞ yt = δi (μ + εt−i ), i=0 so that ∞ = δ2σ 2. Var[yt ] i ε i=0

Whether this result is finite or not depends on whether the series of δi s is exploding or converging. For the AR(2) case, the series converges if |γ2| < 1,γ1 + γ2 < 1, and 6 γ2 − γ1 < 1. For the more general case, the autoregressive process is stationary if the roots of the characteristic equation, 2 p C(z) = 1 − γ1z − γ2z −···−γpz = 0, have modulus greater than one, or “lie outside the unit circle.”7 It follows that if a stochastic process is stationary, it has an infinite moving-average representation (and, if not, it does not). The AR(1) process is the simplest case. The characteristic equation is C(z) = 1 − γ z = 0, and its single root is 1/γ . This root lies outside the unit circle if |γ | < 1, which we saw earlier. Finally, consider the inversion of the moving-average process in (21-9). Whether this inversion is possible depends on the coefficients in D(L) in the same fashion that stationarity hinges on the coefficients in C(L). This counterpart to stationarity of an autoregressive process is called invertibility. For it to be possible to invert a moving- average process to produce an autoregressive representation, the roots of D(L) = 0 must be outside the unit circle. Notice, for example, that in (21-5), the inversion of the

6 (γ ,γ ) ( , − ) (− , − ) This requirement restricts 1 2 to within a triangle with points at 2 1 , 2 1 , and (0, 1). √ 7The roots may be complex. (See Sections 13.9.2 and 20.4.3.) They are of the form a ± bi , where i = −1. The unit circle refers to the two-dimensional set of values of a and b defined by a2 + b2 = 1, which defines a circle centered at the origin with radius 1. Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 721

moving-average process is possible only if |θ| < 1. Because the characteristic equation for the MA(1) process is 1 − θ L = 0, the root is 1/θ, which must be larger than one. If the roots of the characteristic equation of a moving-average process all lie outside the unit circle, then the series is said to be invertible. Note that invertibility has no bearing on the stationarity of a process. All moving-average processes with finite coefficients are stationary. Whether an ARMA process is stationary or not depends only on the AR part of the model.

21.2.3 AUTOCORRELATIONS OF A STATIONARY STOCHASTIC PROCESS The function

λk = Cov[yt , yt−k]

is called the autocovariance function of the process yt .Theautocorrelation function, or ACF, is obtained by dividing by the variance, λ0, to obtain

λk ρk = , −1 ≤ ρk ≤ 1. λ0 For a stationary process, the ACF will be a function of k and the parameters of the process. The ACF is a useful device for describing a time-series process in much the same way that the moments are used to describe the distribution of a random variable. One of the characteristics of a stationary stochastic process is an autocorrelation function that either abruptly drops to zero at some finite lag or eventually tapers off to zero. The AR(1) process provides the simplest example, because

k ρk = γ ,

which is a geometric series that either declines monotonically from ρ0 = 1ifγ is positive or with a damped sawtooth pattern if γ is negative. Note as well that for the process yt = γ yt−1 + εt ,

ρk = γρk−1, k ≥ 1, which bears a noteworthy resemblance to the process itself. For higher-order autoregressive series, the autocorrelations may decline monoton- ically or may progress in the fashion of a damped sine wave.8 Consider, for example, the second-order autoregression, where we assume without loss of generality that μ = 0 (because we are examining second moments in deviations from the mean):

yt = γ1 yt−1 + γ2 yt−2 + εt .

If the process is stationary, then Var[yt ] = Var[yt−s ] for all s. Also, Var[yt ] = Cov[yt , yt ], and Cov[εt , yt−s ] = 0ifs > 0. These relationships imply that

2 λ0 = γ1λ1 + γ2λ2 + σε .

8The behavior is a function of the roots of the characteristic equation. This aspect is discussed in Section 13.9 and especially 13.9.3. Greene-50558 book June 26, 2007 11:52

722 PART V ✦ Time Series and Macroeconometrics

Now, using additional lags, we find that

λ1 = γ1λ0 + γ2λ1, and (21-15) λ2 = γ1λ1 + γ2λ0. These three equations provide the solution: ( − γ )/( + γ ) λ = σ 2 [ 1 2 1 2 ]. 0 ε ( − γ )2 − γ 2 1 2 1

The variance is unchanging, so we can divide throughout by λ0 to obtain the relationships for the autocorrelations,

ρ1 = γ1ρ0 + γ2ρ1.

Because ρ0 = 1,ρ1 = γ1/(1 − γ2). Using the same procedure for additional lags, we find that

ρ2 = γ1ρ1 + γ2, ρ = γ 2/( − γ ) + γ so 2 1 1 2 2. Generally, then, for lags of two or more,

ρk = γ1ρk−1 + γ2ρk−2. Once again, the autocorrelations follow the same difference equation as the series itself. The behavior of this function depends on γ1,γ2, and k, although not in an obvious way. The inherent behavior of the autocorrelation function can be deduced from the charac- teristic equation.9 For the second-order process we are examining, the autocorrelations are of the form k k ρk = φ1(1/z1) + φ2(1/z2) , where the two roots are10 / = 1 γ ± γ 2 + γ . 1 z 2 1 1 4 2 If the two roots are real, then we know that their reciprocals will be less than one in absolute value, so that ρk will be the sum of two terms that are decaying to zero. If the two roots are complex, then ρk will be the sum of two terms that are oscillating in the form of a damped sine wave. Applications that involve autoregressions of order greater than two are relatively unusual. Nonetheless, higher-order models can be handled in the same fashion. For the AR(p) process

yt = γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt , the autocovariances will obey the Yule–Walker equations 2 λ0 = γ1λ1 + γ2λ2 +···+γpλp + σε ,

λ1 = γ1λ0 + γ2λ1 +···+γpλp−1,

9The set of results that we would use to derive this result are exactly those we used in Section 20.4.3 to analyze the stability of a dynamic equation, which makes sense, of course, because the equation linking the autocorrelations is a simple difference equation. 10We used the device in Section 20.4.3 to find the characteristic roots. For a second-order equation, the quadratic is easy to manipulate. Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 723

and so on. The autocorrelations will once again follow the same difference equation as the original series,

ρk = γ1ρk−1 + γ2ρk−2 +···+γpρk−p. The ACF for a moving-average process is very simple to obtain. For the first-order process,

yt = εt − θεt−1, 2 2 λ0 = (1 + θ )σε , 2 λ1 =−θσε ,

then λk = 0 for k > 1. Higher-order processes appear similarly. For the MA(2) process, by multiplying out the terms and taking expectations, we find that λ = + θ 2 + θ 2 σ 2, 0 1 1 2 ε 2 λ1 = (−θ1 + θ1θ2)σε , 2 λ2 =−θ2σε ,

λk = 0, k > 2.

The pattern for the general MA(q) process yt = εt − θ1εt−1 − θ2εt−2 −···−θqεt−q is analogous. The signature of a moving-average process is an autocorrelation function that abruptly drops to zero at one lag past the order of the process. As we will explore later, this sharp distinction provides a statistical tool that will help us distinguish between these two types of processes empirically. The mixed process, ARMA(p, q), is more complicated because it is a mixture of the two forms. For the ARMA(1, 1) process

yt = γ yt−1 + εt − θεt−1, the Yule–Walker equations are 2 2 2 λ0 = E [yt (γ yt−1 + εt − θεt−1)] = γλ1 + σε − σε (θγ − θ ), 2 λ1 = γλ0 − θσε , and

λk = γλk−1, k > 1. The general characteristic of ARMA processes is that when the moving-average component is of order q, then in the series of autocorrelations there will be an ini- tial q terms that are complicated functions of both the AR and MA parameters, but after q periods,

ρk = γ1ρk−1 + γ2ρk−2 +···+γpρk−p, k > q.

21.2.4 PARTIAL AUTOCORRELATIONS OF A STATIONARY STOCHASTIC PROCESS

The autocorrelation function ACF(k) gives the gross correlation between yt and yt−k. But as we saw in our analysis of the classical regression model in Section 3.4, a gross correlation such as this one can mask a completely different underlying relationship. In Greene-50558 book June 26, 2007 11:52

724 PART V ✦ Time Series and Macroeconometrics

this setting, we observe, for example, that a correlation between yt and yt−2 could arise primarily because both variables are correlated with yt−1. Consider the AR(1) process yt = γ yt−1 + εt , where E[εt ] = 0soE[yt ] = E[yt ]/(1 − γ) = 0. The second gross 2 autocorrelation is ρ2 = γ . But in the same spirit, we might ask what is the correlation between yt and yt−2 net of the intervening effect of yt−1? In this model, if we remove the effect of yt−1 from yt , then only εt remains, and this disturbance is uncorrelated with yt−2. We would conclude that the partial autocorrelation between yt and yt−2 in this model is zero.

DEFINITION 21.2 Partial Autocorrelation Coefficient The partial correlation between yt and yt−k is the simple correlation between yt−k and yt minus that part explained linearly by the intervening lags. That is, ρ∗ = − ∗( | ,..., ), , k Corr[yt E yt yt−1 yt−k+1 yt−k] ∗ where E (yt | yt−1,...,yt−k+1) is the minimum mean-squared error predictor of yt by yt−1,...,yt−k+1.

The function E∗(.) might be the linear regression if the conditional mean happened to be linear, but it might not. The optimal linear predictor is the linear regression, however, so what we have is

ρ∗ = − β − β −···−β , , k Corr[yt 1 yt−1 2 yt−2 k−1 yt−k+1 yt−k] −1 where β = [β1,β2,...,βk−1] = Var[yt−1, yt−2,...,yt−k+1] × Cov[yt ,(yt−1, yt−2,...,  yt−k+1)] . This equation will be recognized as a vector of regression coefficients. As such, what we are computing here (of course) is the correlation between a vector of residuals and yt−k. There are various ways to formalize this computation [see, e.g., Enders (2004)]. One intuitively appealing approach is suggested by the equivalent definition (which is also a prescription for computing it), as follows.

DEFINITION 21.3 Partial Autocorrelation Coefficient The partial correlation between yt and yt−k is the last coefficient in the linear projection of y on [y − , y − ,...,y − ], ⎡ t ⎤t 1 t 2 t k β1 ⎡ ⎤−1 ⎡ ⎤ ⎢ ⎥ λ λ ··· λ − λ − λ ⎢ β ⎥ 0 1 k 2 k 1 1 ⎢ 2 ⎥ ⎢ λ λ ··· λ λ ⎥ ⎢ λ ⎥ ⎢ . ⎥ ⎢ 1 0 k−3 k−2 ⎥ ⎢ 2 ⎥ ⎢ . ⎥ = ⎢ ⎥ ⎢ ⎥ . ⎢ . ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ ⎣ ··· . ··· ⎦ ⎣ . ⎦ ⎣ β ⎦ k−1 λ λ ··· λ λ λ ρ∗ k−1 k−2 1 0 k k Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 725

As before, there are some distinctive patterns for particular time-series processes. Consider first the autoregressive processes,

yt = γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt .

We are interested in the last coefficient in the projection of yt on yt−1, then on [yt−1, yt−2], and so on. The first of these is the simple regression coefficient of yt on yt−1,so , λ ρ∗ = Cov[yt yt−1] = 1 = ρ . 1 1 Var[yt−1] λ0 The first partial autocorrelation coefficient for any process equals the first autocorrela- tion coefficient. ( ) ρ∗ Without doing the messy algebra, we also observe that for the AR p process, 1 γ ρ∗ = ρ = γ is a mixture of all the coefficients. Of course, if p equals 1, then 1 1 . For the higher-order processes, the autocorrelations are likewise mixtures of the autoregressive ρ∗ ( ) coefficients until we reach p. In view of the form of the AR p model, the last coefficient in the linear projection on p lagged values is γp. Also, we can see the signature pattern of the AR(p) process, any additional partial autocorrelations must be zero, because ρ∗ = ε , = > they will be simply k Corr[ t yt−k] 0ifk p. Combining results thus far, we have the characteristic pattern for an autoregressive process. The ACF, ρk, will gradually decay to zero, either monotonically if the charac- ρ∗ teristic roots are real or in a sinusoidal pattern if they are complex. The PACF, k, will be irregular out to lag p, when they abruptly drop to zero and remain there. The moving-average process has the mirror image of this pattern. We have already examined the ACF for the MA(q) process; it has q irregular spikes, then it falls to zero and stays there. For the PACF, write the model as 2 q yt = (1 − θ1 L − θ2 L −···−θq L )εt . If the series is invertible, which we will assume throughout, then we have yt = ε , q t 1 − θ1 L −···−θq L or

yt = π1 yt−1 + π2 yt−2 +···+εt ∞ = πi yt−i + εt . i=1 The autoregressive form of the MA(q) process has an infinite number of terms, which means that the PACF will not fall off to zero the way that the PACF of the AR process does. Rather, the PACF of an MA process will resemble the ACF of an AR process. For example, for the MA(1) process yt = εt − θεt−1, the AR representation is 2 yt = θyt−1 + θ yt−2 +···+εt , which is the familiar form of an AR(1) process. Thus, the PACF of an MA(1) process is ρ∗ = θ k identical to the ACF of an AR(1) process, k . The ARMA(p, q) is a mixture of the two types of processes, so its ACF and PACF are likewise mixtures of the two forms discussed above. Generalities are difficult to draw, but normally, the ACF of an ARMA process will have a few distinctive spikes in the early lags corresponding to the number of MA terms, followed by the characteristic Greene-50558 book June 26, 2007 11:52

726 PART V ✦ Time Series and Macroeconometrics

smooth pattern of the AR part of the model. High-order MA processes are relatively uncommon in general, and high-order AR processes (greater than two) seem primarily to arise in the form of the nonstationary processes described in the next section. For a stationary process, the workhorses of the applied literature are the (2, 0) and (1, 1) processes. For the ARMA(1, 1) process, both the ACF and the PACF will display a distinctive spike at lag 1 followed by an exponentially decaying pattern thereafter.

21.2.5 MODELING UNIVARIATE TIME SERIES The preceding discussion is largely descriptive. There is no underlying economic theory that states why a compact ARMA(p, q) representation should adequately describe the movement of a given economic time series. Nonetheless, as a methodology for building forecasting models, this set of tools and its empirical counterpart have proved as good as and even superior to much more elaborate specifications (perhaps to the consternation of the builders of large macroeconomic models).11 Box and Jenkins (1984) pioneered a forecasting framework based on the preceding that has been used in a great many fields and that has, certainly in terms of numbers of applications, largely supplanted the use of large integrated econometric models. Box and Jenkins’s approach to modeling a stochastic process can be motivated by the following.

THEOREM 21.1 Wold’s Decomposition Theorem Every zero-mean covariance stationary stochastic process can be represented in the form ∞ ∗ yt = E [yt | yt−1, yt−2,...,yt−p] + πi εt−i , i=0 where εt is white noise, π0 = 1, and the weights are square summable—that is, ∞ π 2 < ∞ i i=1 ∗ —E [yt | yt−1, yt−2,...,yt−p] is the optimal linear predictor of yt based on its ∗ ε lagged values, and the predictor Et is uncorrelated with t−i .

Thus, the theorem decomposes the process generating yt into ∗ = ∗ | , ,..., = Et E [yt yt−1 yt−2 yt−p] the linearly deterministic component and ∞ πi εt−i = the linearly indeterministic component. i=0

11This observation can be overstated. Even the most committed advocate of the Box–Jenkins methods would concede that an ARMA model of, for example, housing starts will do little to reveal the link between the interest rate policies of the Federal Reserve and their variable of interest. That is, the covariation of economic variables remains as interesting as ever. Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 727

The theorem states that for any stationary stochastic process, for a given choice of p, there is a Wold representation of the stationary series p ∞ yt = γi yt−i + πi εt−i . i=1 i=0

Note that for a specific ARMA(P, Q) process, if p ≥ P, then πi = 0 for i > Q.For practical purposes, the problem with the Wold representation is that we cannot estimate the infinite number of parameters needed to produce the full right-hand side, and, of course, P and Q are unknown. The compromise, then, is to base an estimate of the representation on a model with a finite number of moving-average terms. We can seek the one that best fits the data in hand. It is important to note that neither the ARMA representation of a process nor the Wold representation is unique. In general terms, suppose that the process generating yt is

(L)yt = (L)εt . We assume that (L) is finite but (L) need not be. Let (L) be some other polynomial in the lag operator with roots that are outside the unit circle. Then (L) (L) (L)y = (L)ε , (L) t (L) t or

(L)yt = (L)εt . The new representation is fully equivalent to the old one, but it might have a different number of autoregressive parameters, which is exactly the point of the Wold decompo- sition. The implication is that part of the model-building process will be to determine the lag structures. Further discussion on the methodology is given by Box and Jenkins (1984). The Box–Jenkins approach to modeling stochastic processes consists of the follow- ing steps: 1. Satisfactorily transform the data so as to obtain a stationary series. This step will usually mean taking first differences, logs, or both to obtain a series whose autocorrelation function eventually displays the characteristic exponential decay of a stationary series. 2. Estimate the parameters of the resulting ARMA model, generally by nonlinear least squares. 3. Generate the set of residuals from the estimated model and verify that they satisfactorily resemble a white noise series. If not, respecify the model and return to step 2. 4. The model can now be used for forecasting purposes. Space limitations prevent us from giving a full presentation of the set of techniques. Because this methodology has spawned a mini-industry of its own, however, there is no shortage of book length analyses and prescriptions to which the reader may refer. Five to consider are the canonical source, Box and Jenkins (1984), Granger and Newbold Greene-50558 book June 26, 2007 11:52

728 PART V ✦ Time Series and Macroeconometrics

(1996), Mills (1993), Enders (2004), and Patterson (2000). Some of the aspects of the estimation and analysis steps do have broader relevance for our work here, so we will continue to examine them in some detail.

21.2.6 ESTIMATION OF THE PARAMETERS OF A UNIVARIATE TIME SERIES The broad problem of regression estimation with time-series data, which carries through to all the discussions of this chapter, is that the consistency and asymptotic normality results that we derived based on random sampling will no longer apply. For example, for a stationary series, we have assumed that Var[yt ] = λ0 regardless of t. But we have yet to establish that an estimated variance,

1 T c = (y − y)2, 0 T − 1 t t=1

will converge to λ0, or anything else for that matter. It is necessary to assume that the process is ergodic. (We first encountered this assumption in Section 19.4.1—see Definition 19.3.) Ergodicity is a crucial element of our theory of estimation. When a time series has this property (with stationarity), then we can consider estimation of parameters in a meaningful sense. If the process is stationary and ergodic, then, by the Ergodic theorem (Theorems 19.1 and 19.2), moments such as y and c0 converge 12 to their population counterparts μ and λ0. The essential component of the condition is one that we have met at many points in this discussion, that autocovariances must decline sufficiently rapidly as the separation in time increases. It is possible to construct theoretical examples of processes that are stationary but not ergodic, but for practical purposes, a stationarity assumption will be sufficient for us to proceed with estimation. 2 For example, in our models of stationary processes, if we assume that εt ∼ N[0,σ ], which is common, then the stationary processes are ergodic as well. Estimation of the parameters of a time-series process must begin with a determi- nation of the type of process that we have in hand. (Box and Jenkins label this the identification step. But identification is a term of art in econometrics, so we will steer around that admittedly standard name.) For this purpose, the empirical estimates of the autocorrelation and partial autocorrelation functions are useful tools. The sample counterpart to the ACF is the correlogram,  T ( − )( − ) t=k+1 yt y yt−k y rk =  . T ( − )2 t=1 yt y

A plot of rk against k provides a description of a process and can be used to help discern what type of process is generating the data. The sample PACF is the counterpart to the ACF, but net of the intervening lags; that is,  T ∗ ∗ = + y y − r ∗ = t k 1 t t k , k T ( ∗ )2 t=k+1 yt−k

12The formal conditions for ergodicity are quite involved; see Davidson and MacKinnon (1993) or Hamilton (1994, Chapter 7). Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 729

∗ ∗ , , ,..., where yt and yt−k are residuals from the regressions of yt and yt−k on [1 yt−1 yt−2 ∗ yt−k+1]. We have seen this at many points before; rk is simply the last linear least squares regression coefficient in the regression of yt on [1, yt−1, yt−2,...,yt−k+1, yt−k]. Plots of the ACF and PACF of a series are usually presented together. Because the sample esti- mates of the autocorrelations and partial autocorrelations are not likely to be identically zero even when the population values are, we use diagnostic tests to discern whether a time series appears to be nonautocorrelated.13 Individual sample autocorrelations will be approximately distributed with mean zero and variance 1/T under the hypothesis that the series is white noise. The Box–Pierce (1970) statistic p = 2 Q T rk k=1 is commonly used to test whether a series is white noise. Under the null hypothesis that the series is white noise, Q has a limiting chi-squared distribution with p degrees of freedom. A refinement that appears to have better finite-sample properties is the Ljung–Box (1979) statistic,

p r 2 Q = T(T + 2) k . T − k k=1 The limiting distribution of Q is the same as that of Q. The process of finding the appropriate specification is essentially trial and error. An initial specification based on the sample ACF and PACF can be found. The parameters of the model can then be estimated by least squares. For pure AR(p) processes, the estimation step is simple. The parameters can be estimated by linear least squares. If there are moving-average terms, then linear least squares is inconsistent, but the parameters of the model can be fit by nonlinear least squares. Once the model has been estimated, a set of residuals is computed to assess the adequacy of the specification. In an AR model, the residuals are just the deviations from the regression line. The adequacy of the specification can be examined by applying the foregoing tech- niques to the estimated residuals. If they appear satisfactorily to mimic a white noise process, then analysis can proceed to the forecasting step. If not, a new specification should be considered.

Example 21.1 ACF and PACF for a Series of Bond Yields Appendix Table F21.1 lists five years of monthly averages of the yield on a Moody’s Aaa-rated corporate bond. (Note: In previous editions of this text, the second observation in the data file was incorrectly recorded as 9.72. The correct value is 9.22. Computations to follow are based on the corrected value.) The series is plotted in Figure 21.1. From the figure, it would appear that stationarity may not be a reasonable assumption. We will return to this question below. The ACF and PACF for the original series are shown in Table 21.1, with the diagnostic statistics discussed earlier. Based on the two spikes in the PACF, the results appear to be consistent with an AR(2) process, although the ACF at longer lags seems a bit more persistent than might have been expected. Once again, this condition may indicate that the series is not stationary. Maintain- ing that assumption for the present, we computed the residuals from an AR(2) model and

13The LM test discussed in Section 19.7.1 is one of these. Greene-50558 book June 26, 2007 17:24

730 PART V ✦ Time Series and Macroeconometrics

1.00

0.95

0.90

10) 0.85 ϫ

0.80 Yield ( Yield

0.75

0.70

0.65 1990.1 1990.10 1991.8 1992.6 1993.4 1994.2 1994.12 Month FIGURE 21.1 Monthly Data on Bond Yields.

subjected them to the same test as the original series. To compute the regression, we first subtracted the overall mean from all 60 obserations. We then fit the AR(2) without the first two observations. The coefficients of the AR(2) process are 1.4970 and −0.4986, which also satisfy the restrictions for stationarity given in Section 21.2.2. Despite the earlier suggestions, the residuals do appear to resemble a stationary process. (See Table 21.2.)

TABLE 21.1 ACF and PACF for Bond Yields Time-series identification for YIELD Box–Pierce statistic = 326.0507 Box–Ljung statistic = 364.6475 Degrees of freedom = 14 Degrees of freedom = 14 Significance level = 0.0000 Significance level = 0.0000 ∗⇒|coefficient| > 2/sqrt(N) or > 95% significant. PACF is computed using Yule–Walker equations.

Lag Autocorrelation Function Box/Prc Partial Autocorrelations X 1 0.973* *********** 56.81* 0.973* *********** 2 0.922* ********** 107.76* −0.477* ****** 3 0.863* ********* 152.47* 0.057 * 4 0.806* ********* 191.43* 0.021 * 5 0.745* ******** 224.71* −0.186 *** 6 0.679* ******* 252.39* −0.046 * 7 0.606* ******* 274.44* −0.174 *** 8 0.529* ****** 291.22* −0.039 * 9 0.450* ***** 303.37* −0.049 * 10 0.379* **** 311.98* 0.146 ** 11 0.316* *** 317.95* −0.023 * 12 0.259* *** 321.97* −0.001 * 13 0.205 ** 324.49* −0.018 * 14 0.161 ** 326.05* 0.185 *** Note, *s in first column and bars in the right-hand panel have changed from earlier edition. Greene-50558 book June 26, 2007 17:24

CHAPTER 21 ✦ Time-Series Models 731

TABLE 21.2 ACF and PACF for Residuals Time series identification for U Box–Pierce statistic = 10.7650 Box–Ljung statistic = 12.4641 Degrees of freedom = 14 Degrees of freedom = 14 Significance level = 0.7044 Significance level = 0.5691 ∗⇒|coefficient| > 2/sqrt(N) or > 95% significant. PACF is computed using Yule–Walker equations.

Lag Autocorrelation Function Box/Prc Partial Autocorrelations X 1 0.084 ** 0.41 0.084 * 2 −0.120 ** 1.25 −0.138 ** 3 −0.241 *** 4.61 −0.242 *** 4 0.095 * 5.12 0.137 ** 5 0.137 ** 6.22 0.104 * 6 0.121 * 7.06 0.102 * 7 −0.084 * 7.46 −0.048 * 8 0.049 * 7.60 0.184 *** 9 −0.169 ** 9.26 −0.327* **** 10 0.025 * 9.30 0.025 * 11 −0.005 * 9.30 0.037 * 12 0.003 * 9.30 −0.100 ** 13 −0.137 ** 10.39 −0.203 *** 14 −0.081 * 10.77 −0.167 *** Note, * in third column and bars in both panels have changed from earlier edition.

21.3 THE FREQUENCY DOMAIN

For the analysis of macroeconomic flow data such as output and consumption, and aggregate economic index series such as the price level and the rate of unemploy- ment, the tools described in the previous sections have proved quite satisfactory. The low frequency of observation (yearly, quarterly, or, occasionally, monthly) and very significant aggregation (both across time and of individuals) make these data rela- tively smooth and straightforward to analyze. Much contemporary economic analysis, especially financial econometrics, has dealt with more disaggregated, microlevel data, observed at far greater frequency. Some important examples are stock market data for which daily returns data are routinely available, and exchange rate movements, which have been tabulated on an almost continuous basis. In these settings, analysts have found that the tools of spectral analysis, and the frequency domain, have provided many useful results and have been applied to great advantage. This section introduces a small amount of the terminology of spectral analysis to acquaint the reader with a few basic features of the technique. For those who desire further detail, Fuller (1976), Granger and Newbold (1996), Hamilton (1994), Chatfield (1996), Shumway (1988), and Hatanaka (1996) (among many others with direct application in economics) are excel- lent introductions. Most of the following is based on Chapter 6 of Hamilton (1994). In this framework, we view an observed time series as a weighted sum of underlying series that have different cyclical patterns. For example, aggregate retail sales and con- struction data display several different kinds of cyclical variation, including a regular seasonal pattern and longer frequency variation associated with variation in the econ- omy as a whole over the business cycle. The total variance of an observed time series Greene-50558 book June 26, 2007 11:52

732 PART V ✦ Time Series and Macroeconometrics

may thus be viewed as a sum of the contributions of these underlying series, which vary at different frequencies. The standard application we consider is how spectral analysis is used to decompose the variance of a time series.

21.3.1 Theoretical Results

Let {yt }t=−∞,∞ define a zero-mean, stationary time-series process. The autocovariance at lag k was defined in Section 21.2.2 as

λk = λ−k = Cov[yt , yt−k].  λ ∞ |λ | We assume that the series k is absolutely summable; i=0 k is finite. The autocovari- ance generating function for this time-series process is ∞ k gY(z) = λkz . k=−∞ √ We evaluate this function at the complex value z = exp(iω), where i = −1 and ω is a real number, and divide by 2π to obtain the spectrum, or spectral density function, of the time-series process,   ∞ 1 h (ω) = λ e−iωk . (21-16) Y 2π k k=−∞ The spectral density function is a characteristic of the time-series process very much like the sequence of autocovariances (or the sequence of moments for a probability distribution). For a time-series process that has the set of autocovariances λk, the spectral density can be computed at any particular value of ω. Several results can be combined to simplify hY(ω):

1. Symmetry of the autocovariances, λk = λ−k; 2. DeMoivre’s theorem, exp(± iωk) = cos(ωk) ± i sin(ωk); 3. Polar values, cos(0) = 1, cos(π) =−1, sin(0) = 0, sin(π) = 0; 4. Symmetries of sin and cos functions, sin(−ω) =−sin(ω) and cos(−ω) = cos(ω). One of the convenient consequences of result 2 is exp(iωk) + exp(−iωk) = 2 cos(ωk), which is always real. These equations can be combined to simplify the spectrum,   ∞ 1 h (ω) = λ + 2 λ cos(ωk) ,ω∈ [0,π]. (21-17) Y 2π 0 k k=1 This is a strictly real-valued, continuous function of ω. Because the cosine function is cyclic with period 2π, hY(ω) = hY(ω + M2π) for any integer M, which implies that the entire spectrum is known if its values for ω from 0 to π are known. [Because cos(−ω) = cos(ω), hY(ω) = hY(−ω), so the values of the spectrum for ω from 0 to −π are the same as those from 0 to +π.] There is also a correspondence between the spectrum and the autocovariances,  π

λk = h Y(ω) cos(kω) dω, −π Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 733

which we can interpret as indicating that the sequence of autocovariances and the spectral density function just produce two different ways of looking at the same time- series process (in the first case, in the “time domain,” and in the second case, in the “frequency domain,” hence the name for this analysis). The spectral density function is a function of the infinite sequence of autocovari- ances. For ARMA processes, however, the autocovariances are functions of the usually small numbers of parameters, so hY(ω) will generally simplify considerably. For the ARMA(p, q) process defined in (21-6),

(yt − μ) = γ1(yt−1 − μ) +···+γp(yt−p − μ) + εt − θ1εt−1 −···−θqεt−q, or

(L)(yt − μ) = (L)εt , the autocovariance generating function is σ 2 (z) (1/z) g (z) = = σ 2(z)(1/z), Y (z) (1/z) where (z) gives the sequence of coefficients in the infinite moving-average represen- tation of the series, (z)/ (z). In some cases, this result can be used explicitly to derive the spectral density function. The spectral density function can be obtained from this relationship through σ 2 h (ω) = (e−iω)(eiω). Y 2π Example 21.2 Spectral Density Function for an AR(1) Process For an AR(1) process with autoregressive parameter ρ, yt = ρyt−1 + εt , εt ∼ N[0, 1], the lag polynomials are ( z) = 1 and ( z) = 1 − ρz. The autocovariance generating function is σ 2 g ( z) = Y (1− ρz)(1− ρ/z) σ 2 = 1 + ρ2 − ρ( z + 1/z)     ∞ i i σ 2 ρ 1 + z2 = . 1 + ρ2 1 + ρ2 z i =0 The spectral density function is σ 2 1 σ 2 1 h (ω) = = . Y 2π [1 − ρ exp(−i ω)][1 − ρ exp(i ω)] 2π [1 + ρ2 − 2ρ cos(ω)]

For the general case suggested at the outset, (L)(yt − μ) = (L)εt , there is a template we can use, which, if not simple, is at least transparent. Let αi be the reciprocal of a root of the characteristic polynomial for the autoregressive part of the model, (αi ) = 0, i = 1,...,p, and let δ j , j = 1,...,q, be the same for the moving-average part of the model. Then  2 q 2 σ = 1 + δ − 2δ j cos(ω) h (ω) =  j 1 j . Y π p + α2 − α (ω) 2 i=1 1 i 2 i cos Greene-50558 book June 26, 2007 11:52

734 PART V ✦ Time Series and Macroeconometrics

Some of the roots of either polynomial may be complex pairs, but in this case, the product for adjacent pairs (a ± bi ) is real, so the function is always real valued. [Note also that (a ± bi )−1 = (a ∓ bi )/(a2 + b2).] For purposes of our initial objective, decomposing the variance of the time series, our final useful theoretical result is  π

hY(ω) dω = λ0. −π Thus, the total variance can be viewed as the sum of the spectral densities over all possible frequencies. (More precisely, it is the area under the spectral density.) Once again exploiting the symmetry of the cosine function, we can rewrite this equation in the form  π

2 hY(ω) dω = λ0. 0 Consider, then, integration over only some of the frequencies;  ω j 2 (ω) ω = τ(ω ), <ω ≤ π, <τ(ω ) ≤ . λ hY d j 0 j 0 j 1 0 0 Thus, τ(ωj ) can be interpreted as the proportion of the total variance of the time series that is associated with frequencies less than or equal to ω j .

21.3.2 Empirical Counterparts

We have in hand a sample of observations, yt , t = 1,...,T. The first task is to establish a correspondence between the frequencies 0 <ω≤ π and something of interest in the sample. The lowest frequency we could observe would be once in the entire sample period, so we map ω1 to 2π/T. The highest would then be ωT = 2π, and the intervening values will be 2π j/T, j = 2,...,T − 1. It may be more convenient to think in terms of period rather than frequency. The number of periods per cycle will correspond to T/j = 2π/ωj . Thus, the lowest frequency, ω1, corresponds to the highest period, T “dates” (months, quarters, years, etc.). There are a number of ways to estimate the population spectral density function. The obvious way is the sample counterpart to the population spectrum. The sample of T observations provides the variance and T − 1 distinct sample autocovariances

1 T 1 T c = c− = (y − y)(y − − y), y = y , k = 0, 1,...,T − 1, k k T t t k T t t=k+1 t=1 so we can compute the sample periodogram, which is   − 1 T 1 hˆ (ω) = c + 2 c cos(ωk) . Y 2π 0 k k=1 The sample periodogram is a natural estimator of the spectrum, but it has a statisti- cal flaw. With the sample variance and the T − 1 autocovariances, we are estimating T parameters with T observations. The periodogram is, in the end, T transformations of these T estimates. As such, there are no “degrees of freedom”; the estimator does not improve as the sample size increases. A number of methods have been suggested for improving the behavior of the estimator. Two common ways are truncation and Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 735

windowing [see Chatfield (1996, pp. 139–143)]. The truncated estimator of the peri- odogram is based on a subset of the first L < T autocovariances. The choice of L is a problem because√ there is no theoretical guidance. Chatfield (1996) suggests L approxi- mately equal to 2 T is large enough to provide resolution while removing some of the sampling variation induced by the long lags in the untruncated estimator. The second mechanism for improving the properties of the estimator is a set of weights called a lag window. The revised estimator is   1 L hˆ (ω) = w c + 2 w c cos(ωk) , Y 2π 0 0 k k k=1

where the set of weights, {wk, k = 0,...,L}, is the lag window. One choice for the weights is the Bartlett window, which produces   1 L k hˆ , (ω) = c + 2 w(k, L)c cos(ωk) , w(k, L) = 1 − . Y Bartlett 2π 0 k L + 1 k=1 Note that this result is the same set of weights used in the Newey–West robust covariance matrix estimator in Chapter 19, with essentially the same motivation. Two others that = 1 + (π / ) , are commonly used are the Tukey window, which has wk 2 [1 cos k L ] and the 2 3 3 Parzen window, wk = 1−6[(k/L) −(k/L) ], if k ≤ L/2, and wk = 2(1−k/L) otherwise. If the series has been modeled as an ARMA process, we can instead compute the fully parametric estimator based on our sample estimates of the roots of the autore- gressive and moving-average polynomials. This second estimator would be  2 q + 2 − (ω ) σˆ j=1 1 dj 2dj cos k hˆ , (ω) =  . Y ARMA π p + 2 − (ω ) 2 i=1 1 ai 2ai cos k Others have been suggested. [See Chatfield (1996, Chapter 7).] Finally, with the empirical estimate of the spectrum, the variance decomposition can be approximated by summing the values around the frequencies of interest.

Example 21.3 Spectral Analysis of the Growth Rate of Real GNP Appendix Table F21.2 lists quarterly observations on U.S. GNP and the implicit price defla- tor for GNP for 1950 through 1983. The GNP series, with its upward trend, is obviously nonstationary. We will analyze instead the quarterly growth rate, 100[log(GNPt /pricet ) − log(GNPt−1/pricet−1)]. Figure 21.2 shows the resulting data. The differenced series has 135 observations. Figure 21.3 plots the sample periodogram, with frequencies scaled so that ω j = ( j/T )2π. The figure shows the sample periodogram for j = 1, ..., 67 (because values of the spectrum for j = 68, ..., 134) are a mirror image of the first half, we have omitted them). Figure 21.3 shows peaks at several frequencies. The effect is more easily visualized in terms of the periods of these cyclical components. The second row of labels shows the periods, computed as quarters = T/(2j ), where T = 67 quarters. There are distinct masses around 2 to 3 years that correspond roughly to the “business cycle” of this era. One might also expect seasonal effects in these quarterly data, and there are discernible spikes in the periodogram at about 0.3 year (one quarter). These spikes, however, are minor compared with the other effects in the figure. This is to be expected, because the data are seasonally adjusted already. Finally, there is a pronounced spike at about 6 years in the periodogram. The original data in Figure 21.2 do seem consistent with this result, with substantial recessions coming at intervals of 5 to 7 years from 1953 to 1980. To underscore these results, consider what we would obtain if we analyzed the original (log) real GNP series instead of the growth rates. Figure 21.4 shows the raw data. Although there does appear to be some short-run (high-frequency) variation (around a long-run trend, Greene-50558 book June 26, 2007 11:52

736 PART V ✦ Time Series and Macroeconometrics

for example), the cyclical variation of this series is obviously dominated by the upward trend. If this series were viewed as a single periodic series, then we would surmise that the period of this cycle would be the entire sample interval. The frequency of the dominant part of this time series seems to be quite close to zero. The periodogram for this series, shown in Figure 21.5, is consistent with that suspicion. By far, the largest component of the spectrum is provided by frequencies close to zero.

FIGURE 21.2 Growth Rate of U.S. Real GNP, Quarterly, 1953 to 1984. 6

4

2 RWTH G

NP 0 G

Ϫ2

Ϫ4 19501955 1960 1965 1970 1975 1980 1985 Quarter

FIGURE 21.3 Sample Periodogram.

1.5

1.2

0.9 TRUM C

PE 0.6 S

0.3

0.0 0 14 28 42 56 70 j 34 2.4 1.2 0.8 0.6 0.5 qtrs Greene-50558 book June 26, 2007 11:52

CHAPTER 21 ✦ Time-Series Models 737

0.6

0.4

0.2

NP 0.0 G

Ϫ0.2 REAL

Ϫ0.4

Ϫ0.6

Ϫ0.8 1950 1955 1960 1965 1970 1975 1980 1985 Quarter FIGURE 21.4 Quarterly Data on Real GNP.

0.8

0.7

0.6

0.5

0.4 TRUM C 0.3 PE S 0.2

0.1

0.0

Ϫ0.1 0 1428425670 k FIGURE 21.5 Spectrum for Real GNP.

A Computational Note The computation in (21-16) or (21-17) is the discrete Fourier transform of the series of autocovariances. In principle, it involves an enormous amount of computation, on the order of T 2 sets of computations. For ordinary time series in- volving up to a few hundred observations, this work is not particularly onerous. (The preceding computations involving 135 observations took a total of perhaps 20 seconds of Greene-50558 book June 26, 2007 11:52

738 PART V ✦ Time Series and Macroeconometrics

computing time.) For series involving multiple thousands of observations, such as daily market returns, or far more, such as in recorded exchange rates and forward premiums, the amount of computation could become prohibitive. However, the computation can be done using an important tool, the fast Fourier transform (FFT), that reduces the ( ) 2 computational level to O T log2T , which is many orders of magnitude less than T . The FFT is programmed in some econometric software packages, such as RATS and Matlab. [See Press et al. (1986) for further discussion.]

21.4 SUMMARY AND CONCLUSIONS

This chapter has developed the standard tools for analyzing a stationary time series. The analysis takes place in one of two frameworks, the time domain or the frequency domain. The analysis in the time domain focuses on the different representations of the series in terms of autoregressive and moving-average components. This interpretation of the time series is closely related to the concept of regression—though in this case it is “auto” or self-regression, that is, on past values of the random variable itself. The autocorrelations and partial autocorrelations are the central tools for characterizing a time series in this framework. Constructing a time series in this fashion allows the analyst to construct forecasting equations that exploit the internal structure of the time series. (We have left for additional courses and the many references on the subject the embellishments of these models in terms of seasonal patterns, differences, and so on, that are the staples of Box–Jenkins model building.) The other approach to time-series analysis is in the frequency domain. In this representation, the time series is decomposed into a sum of components that vary cyclically with different periods (and frequencies). The different components that vary at different frequencies provide a view of how the different components contribute to the overall variation of the series. The analysis in this chapter, of modern economic time-series analysis, is a prelude to the analysis of nonstationary series in the next chapter. Nonstationarity is, in large measure, the norm in recent time-series modeling.

Key Terms and Concepts

• Autocorrelation function • Frequency domain • Self-regressive • Autocovariance • Identification • Spectral density function • Autocovariance function • Innovations • Square summable • Autocovariance generating • Invertibility • Stationarity function • Lag window • Strong stationarity • Autoregression • Linearly deterministic • Unit circle • Autoregressive component • Univariate time-series • Autoregressive form • Linearly indeterministic model • Autoregressive moving component • Weak stationarity average • Moving average • White noise • Characteristic equation • Moving-average form • Wold’s decomposition • Correlogram • Nonstationarity theorem • Covariance stationarity • Partial autocorrelation • Yule–Walker equations • Discrete Fourier transform • Periodogram • Ergodic • Sample periodogram Greene-50558 book June 25, 2007 20:42

22 NONSTATIONARYQ DATA

22.1 INTRODUCTION

Most economic variables that exhibit strong trends, such as GDP, consumption, or the price level, are not stationary and are thus not amenable to the analysis of the pre- vious three chapters. In many cases, stationarity can be achieved by simple differenc- ing or some other simple transformation. But, new statistical issues arise in analyzing nonstationary series that are understated by this superficial observation. This chapter will survey a few of the major issues in the analysis of nonstationary data.1 We begin in Section 22.2 with results on analysis of a single nonstationary time series. Section 22.3 examines the implications of nonstationarity for analyzing regression relationship. Finally, Section 22.4 turns to the extension of the time-series results to panel data.

22.2 NONSTATIONARY PROCESSES AND UNIT ROOTS

This section will begin the analysis of nonstationary time series with some basic results for univariate time series. The fundamental results concern the characteristics of non- stationary series and statistical tests for identification of nonstationarity in observed data.

22.2.1 INTEGRATED PROCESSES AND DIFFERENCING A process that figures prominently in recent work is the random walk with drift,

yt = μ + yt−1 + εt .

By direct substitution, ∞ yt = (μ + εt−i ). i=0

That is, yt is the simple sum of what will eventually be an infinite number of random variables, possibly with nonzero mean. If the innovations are being generated by the same zero-mean, constant-variance distribution, then the variance of yt would obviously

1With panel data, this is one of the rapidly growing areas in econometrics, and the literature advances rapidly. We can only scratch the surface. Several recent surveys and books provide useful extensions. Two that will be very helpful are Enders (2004) and Tsay (2005). 739 Greene-50558 book June 25, 2007 18:0

740 PART V ✦ Time Series and Macroeconometrics

be infinite. As such, the random walk is clearly a nonstationary process, even if μ equals zero. On the other hand, the first difference of yt ,

zt = yt − yt−1 = μ + εt ,

is simply the innovation plus the mean of zt , which we have already assumed is stationary. The series yt is said to be integrated of order one, denoted I(1), because taking a first difference produces a stationary process. A nonstationary series is integrated of order d, denoted I(d), if it becomes stationary after being first differenced d times. A further generalization of the ARMA model discussed in Section 21.2.1 would be the series d d zt = (1 − L) yt =  yt . The resulting model is denoted an autoregressive integrated moving-average model, or ARIMA (p, d, q).2 In full, the model would be d d d d  yt = μ + γ1 yt−1 + γ2 yt−2 +···+γp yt−p + εt − θ1εt−1 −···−θqεt−q, where

yt = yt − yt−1 = (1 − L)yt . This result may be written compactly as d C(L)[(1 − L) yt ] = μ + D(L)εt , d d where C(L) and D(L) are the polynomials in the lag operator and (1 − L) yt =  yt is the dth difference of yt . An I(1) series in its raw (undifferenced) form will typically be constantly growing, or wandering about with no tendency to revert to a fixed mean. Most macroeconomic flows and stocks that relate to population size, such as output or employment, are I(1).AnI(2) series is growing at an ever-increasing rate. The price-level data in Appendix Table F21.2 and shown later appear to be I(2). Series that are I(3) or greater are extremely unusual, but they do exist. Among the few manifestly I(3) series that could be listed, one would find, for example, the money stocks or price levels in hyperinflationary economies such as interwar Germany or Hungary after World War II.

Example 22.1 A Nonstationary Series The nominal GNP and price deflator variables in Appendix Table F21.2 are strongly trended, so the mean is changing over time. Figures 22.1 through 22.3 plot the log of the GNP deflator series in Table F21.2 and its first and second differences. The original series and first differ- ences are obviously nonstationary, but the second differencing appears to have rendered the series stationary. The first 10 autocorrelations of the log of the GNP deflator series are shown in Table 22.1. The autocorrelations of the original series show the signature of a strongly trended, nonsta- tionary series. The first difference also exhibits nonstationarity, because the autocorrelations are still very large after a lag of 10 periods. The second difference appears to be stationary, with mild negative autocorrelation at the first lag, but essentially none after that. Intuition might suggest that further differencing would reduce the autocorrelation further, but that would be incorrect. We leave as an exercise to show that, in fact, for values of γ less than about 0.5, first differencing of an AR(1) process actually increases autocorrelation.

2 There are yet further refinements one might consider, such as removing seasonal effects from zt by differ- encing by quarter or month. See Harvey (1990) and Davidson and MacKinnon (1993). Some recent work has relaxed the assumption that d is an integer. The fractionally integrated series, or ARFIMA has been used to model series in which the very long-run multipliers decay more slowly than would be predicted otherwise. Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 741

5.40

5.20

5.00

4.80 price g

Lo 4.60

4.40

4.20

4.00 19501957 1964 1971 1978 1985 Quarter FIGURE 22.1 Quarterly Data on log GNP Deflator.

0.0300

0.0258

0.0215

0.0172 p

g lo d 0.0129

0.0086

0.0043

0.0000 19501957 1964 1971 1978 1985 Quarter FIGURE 22.2 First Difference of log GNP Deflator.

22.2.2 RANDOM WALKS, TRENDS, AND SPURIOUS REGRESSIONS In a seminal paper, Granger and Newbold (1974) argued that researchers had not paid sufficient attention to the warning of very high autocorrelation in the residuals from conventional regression models. Among their conclusions were that macroeconomic data, as a rule, were integrated and that in regressions involving the levels of such Greene-50558 book June 23, 2007 12:28

742 PART V ✦ Time Series and Macroeconometrics

0.020

0.016

0.012

p 0.008

g lo 2

d 0.004

0.000

0.004

0.008 19501957 1964 1971 1978 1985 Quarter FIGURE 22.3 Second Difference of log GNP Deflator.

TABLE 22.1 Autocorrelations for ln GNP Deflator Autocorrelation Function Autocorrelation Function Autocorrelation Function Lag Original Series, log Price First Difference of log Price Second Difference of log Price 1 1.000 0.812 −0.395 2 1.000 0.765 −0.112 3 0.999 0.776 0.258 4 0.999 0.682 −0.101 5 0.999 0.631 −0.022 6 0.998 0.592 0.076 7 0.998 0.523 −0.163 8 0.997 0.513 0.052 9 0.997 0.488 −0.054 10 0.997 0.491 0.062

data, the standard significance tests were usually misleading. The conventional t and F tests would tend to reject the hypothesis of no relationship when, in fact, there might be none. The general result at the center of these findings is that conventional linear regression, ignoring serial correlation, of one random walk on another is virtually certain to suggest a significant relationship, even if the two are, in fact, independent. Among their extreme conclusions, Granger and Newbold suggested that researchers use a critical t value of 11.2 rather than the standard normal value of 1.96 to assess the significance of a coefficient estimate. Phillips (1986) took strong issue with this conclusion. Based on a more general model and on an analytical√ rather than a Monte Carlo approach, he suggested that the normalized statistic tβ / T be used for testing purposes rather than tβ itself. For the 50 observations used by Granger and Newbold, Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 743

the appropriate critical value would be close to 15! If anything, Granger and Newbold were too optimistic. The random walk with drift,

zt = μ + zt−1 + εt , (22-1) and the trend stationary process,

zt = μ + βt + εt , (22-2)

where, in both cases, εt is a white noise process, appear to be reasonable characteriza- tions of many macroeconomic time series.3 Clearly both of these will produce strongly trended, nonstationary series,4 so it is not surprising that regressions involving such variables almost always produce significant relationships. The strong correlation would seem to be a consequence of the underlying trend, whether or not there really is any regression at work. But Granger and Newbold went a step further. The intuition is less clear if there is a pure random walk at work,

zt = zt−1 + εt , (22-3) but even here, they found that regression “relationships” appear to persist even in unrelated series. Each of these three series is characterized by a unit root. In each case, the data- generating process (DGP) can be written

(1 − L)zt = α + vt , (22-4)

where α = μ, β, and 0, respectively,and vt is a stationary process. Thus, the characteristic equation has a single root equal to one, hence the name. The upshot of Granger and Newbold’s and Phillips’s findings is that the use of data characterized by unit roots has the potential to lead to serious errors in inferences. In all three settings, differencing or detrending would seem to be a natural first step. On the other hand, it is not going to be immediately obvious which is the correct way to proceed—the data are strongly trended in all three cases—and taking the incor- rect approach will not necessarily improve matters. For example, first differencing in (22-1) or (22-3) produces a white noise series, but first differencing in (22-2) trades the trend for autocorrelation in the form of an MA(1) process. On the other hand, detrending—that is, computing the residuals from a regression on time—is obviously counterproductive in (22-1) and (22-3), even though the regression of zt on a trend will appear to be significant for the reasons we have been discussing, whereas detrending in (22-2) appears to be the right approach.5 Because none of these approaches is likely to

3The analysis to follow has been extended to more general disturbance processes, but that complicates matters substantially. In this case, in fact, our assumption does cost considerable generality, but the extension is beyond the scope of our work. Some references on the subject are Phillips and Perron (1988) and Davidson and MacKinnon (1993). 4The constant term μ produces the deterministic trend in the random walk with drift. For convenience, = t (μ + ε ) = μ + t ε suppose that the process starts at time zero. Then zt s=0 s t s=0 s . Thus, zt consists of a deterministic trend plus a stochastic trend consisting of the sum of the innovations. The result is a variable with increasing variance around a linear trend. 5See Nelson and Kang (1984). Greene-50558 book June 23, 2007 12:28

744 PART V ✦ Time Series and Macroeconometrics

be obviously preferable at the outset, some means of choosing is necessary. Consider nesting all three models in a single equation,

zt = μ + βt + zt−1 + εt .

Now subtract zt−1 from both sides of the equation and introduce the artificial parameter γ .

zt − zt−1 = μγ + βγt + (γ − 1)zt−1 + εt (22-5) = α0 + α1t + (γ − 1)zt−1 + εt , where, by hypothesis, γ = 1. Equation (22-5) provides the basis for a variety of tests for unit roots in economic data. In principle, a test of the hypothesis that γ − 1 equals zero gives confirmation of the random walk with drift, because if γ equals 1 (and α1 equals zero), then (22-1) results. If γ − 1 is less than zero, then the evidence favors the trend stationary (or some other) model, and detrending (or some alternative) is the preferable approach. The practical difficulty is that standard inference procedures based on least squares and the familiar test statistics are not valid in this setting. The issue is discussed in the next section.

22.2.3 TESTS FOR UNIT ROOTS IN ECONOMIC DATA The implications of unit roots in macroeconomic data are, at least potentially, profound. If a structural variable, such as real output, is truly I(1), then shocks to it will have per- manent effects. If confirmed, then this observation would mandate some rather serious reconsideration of the analysis of macroeconomic policy. For example, the argument that a change in monetary policy could have a transitory effect on real output would vanish.6 The literature is not without its skeptics, however. This result rests on a razor’s edge. Although the literature is thick with tests that have failed to reject the hypothesis that γ = 1, many have also not rejected the hypothesis that γ ≥ 0.95, and at 0.95 (or even at 0.99), the entire issue becomes moot.7 Consider the simple AR(1) model with zero-mean, white noise innovations,

yt = γ yt−1 + εt . The downward bias of the least squares estimator when γ approaches one has been widely documented.8 For |γ | < 1, however, the least squares estimator T = y y − c = t 2 t t 1 T 2 t=2 yt−1 does have plim c = γ

6The 1980s saw the appearance of literally hundreds of studies, both theoretical and applied, of unit roots in economic data. An important example is the seminal paper by Nelson and Plosser (1982). There is little question but that this observation is an early part of the radical paradigm shift that has occurred in empirical macroeconomics. 7A large number of issues are raised in Maddala (1992, pp. 582–588). 8See, for example, Evans and Savin (1981, 1984). Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 745

and √ d T(c − γ)−→ N[0, 1 − γ 2].

Does the result hold up if γ = 1? The case is called the unit root case, because in the ARMA representation C(L)yt = εt , the characteristic equation 1−γ z = 0 has one root equal to one. That the limiting variance appears to go to zero should raise suspicions. The literature on the questions dates back to Mann and Wald (1943) and Rubin (1950). But for econometric purposes, the literature has a focal point at the celebrated papers of Dickey and Fuller (1979, 1981). They showed that if γ equals one, then

d T(c − γ)−→ v,

where v is a random variable with finite, positive variance, and in finite samples, E [c] < 1.9 There are two important implications in the Dickey–Fuller results. First, the estima- tor of γ is biased downward if γ equals one. Second, the OLS estimator of γ converges to its probability limit more rapidly than the estimators to which we are accustomed. That is, the variance of c under the null hypothesis is O(1/T 2), not O(1/T). (In a mean squared error sense, the OLS estimator is superconsistent.) It turns out that the implications of this finding for the regressions with trended data are considerable. We have already observed that in some cases, differencing or detrending is required to achieve stationarity of a series. Suppose, though, that the preceding AR(1) model is fit to an I(1) series, despite that fact. The upshot of the preceding discussion is that the conventional measures will tend to hide the true value of γ ; the sample estimate is biased downward, and by dint of the very small true sampling variance, the conventional t test will tend, incorrectly, to reject the hypothesis that γ = 1. The practical solution to this problem devised by Dickey and Fuller was to derive, through Monte Carlo methods, an appropriate set of critical values for testing the hypothesis that γ equals one in an AR(1) regression when there truly is a unit root. One of their general results is that the test may be carried out using a conventional t statistic, but the critical values for the test must be revised; the standard t table is inappropriate. A number of variants of this form of testing procedure have been developed. We will consider several of them.

22.2.4 THE DICKEY–FULLER TESTS The simplest version of the of the model to be analyzed is the random walk,

2 yt = γ yt−1 + εt ,εt ∼ N[0,σ ], and Cov[εt ,εs ] = 0 ∀ t = s.

Under the null hypothesis that γ = 1, there are two approaches to carrying out the test. The conventional t ratio γˆ − 1 DF = t Est. Std. Error(γ)ˆ

9A full derivation of this result is beyond the scope of this book. For the interested reader, a fairly compre- hensive treatment at an accessible level is given in Chapter 17 of Hamilton (1994, pp. 475–542). Greene-50558 book June 23, 2007 12:28

746 PART V ✦ Time Series and Macroeconometrics

TABLE 22.2 Critical Values for the Dickey–Fuller DFτ Test Sample Size 25 50 100 ∞ F ratio (D–F)a 7.24 6.73 6.49 6.25 F ratio (standard) 3.42 3.20 3.10 3.00 AR modelb (random walk) 0.01 −2.66 −2.62 −2.60 −2.58 0.025 −2.26 −2.25 −2.24 −2.23 0.05 −1.95 −1.95 −1.95 −1.95 0.10 −1.60 −1.61 −1.61 −1.62 0.975 1.70 1.66 1.64 1.62 AR model with constant (random walk with drift) 0.01 −3.75 −3.59 −3.50 −3.42 0.025 −3.33 −3.23 −3.17 −3.12 0.05 −2.99 −2.93 −2.90 −2.86 0.10 −2.64 −2.60 −2.58 −2.57 0.975 0.34 0.29 0.26 0.23 AR model with constant and time trend (trend stationary) 0.01 −4.38 −4.15 −4.04 −3.96 0.025 −3.95 −3.80 −3.69 −3.66 0.05 −3.60 −3.50 −3.45 −3.41 0.10 −3.24 −3.18 −3.15 −3.13 0.975 −0.50 −0.58 −0.62 −0.66 aFrom Dickey and Fuller (1981, p. 1063). Degrees of freedom are 2 and T − p − 3. bFrom Fuller (1976, p. 373 and 1996, Table 10.A.2).

with the revised set of critical values may be used for a one-sided test. Critical values for this test are shown in the top panel of Table 22.2. Note that in general, the critical value is considerably larger in absolute value than its counterpart from the t distribution. The second approach is based on the statistic

DFγ = T(γˆ − 1). Critical values for this test are shown in the top panel of Table 22.2. The simple random walk model is inadequate for many series. Consider the rate of inflation from 1950.2 to 2000.4 (plotted in Figure 22.4) and the log of GDP over the same period (plotted in Figure 22.5). The first of these may be a random walk, but it is clearly drifting. The log GDP series, in contrast, has a strong trend. For the first of these, a random walk with drift may be specified,

yt = μ + zt ,

zt = γ zt−1 + εt , or

yt = μ(1 − γ)+ γ yt−1 + εt . For the second type of series, we may specify the trend stationary form,

yt = μ + βt + zt ,

zt = γ zt−1 + εt Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 747

Rate of Inflation, 1950.2 to 2000.4 20

15

10 PIU C . g h

C 5

0

5 1950 1963 19761989 2002 Quarter FIGURE 22.4 Rate of Inflation in the Consumer Price Index.

Log of GDP, 1950.1 to 2000.4 9.25

9.00

8.75

8.50 DP 8.25 gG Lo 8.00

7.75

7.50

7.25 1950 1963 19761989 2002 Quarter FIGURE 22.5 Log of Gross Domestic Product.

or

yt = [μ(1 − γ)+ γβ] + β(1 − γ)+ γ yt−1 + εt . The tests for these forms may be carried out in the same fashion. For the model with drift only, the center panels of Tables 22.2 and 22.3 are used. When the trend is included, the lower panel of each table is used. Greene-50558 book June 23, 2007 12:28

748 PART V ✦ Time Series and Macroeconometrics

TABLE 22.3 Critical Values for the Dickey–Fuller DFγ Test Sample Size 25 50 100 ∞ AR modela (random walk) 0.01 −11.8 −12.8 −13.3 −13.8 0.025 −9.3 −9.9 −10.2 −10.5 0.05 −7.3 −7.7 −7.9 −8.1 0.10 −5.3 −5.5 −5.6 −5.7 0.975 1.78 1.69 1.65 1.60 AR model with constant (random walk with drift) 0.01 −17.2 −18.9 −19.8 −20.7 0.025 −14.6 −15.7 −16.3 −16.9 0.05 −12.5 −13.3 −13.7 −14.1 0.10 −10.2 −10.7 −11.0 −11.3 0.975 0.65 0.53 0.47 0.41 AR model with constant and time trend (trend stationary) 0.01 −22.5 −25.8 −27.4 −29.4 0.025 −20.0 −22.4 −23.7 −24.4 0.05 −17.9 −19.7 −20.6 −21.7 0.10 −15.6 −16.8 −17.5 −18.3 0.975 −1.53 −1.667 −1.74 −1.81 aFrom Fuller (1976, p. 373 and 1996, Table 10.A.1).

Example 22.2 Tests for Unit Roots In Section 20.6.8, we examined Cecchetti and Rich’s study of the effect of recent monetary policy on the U.S. economy. The data used in their study were the following variables: π = one period rate of inflation = the rate of change in the CPI, y = log of real GDP, i = nominal interest rate = the quarterly average yield on a 90-day T-bill, m = change in the log of the money stock, M1, i − π = ex post real interest rate, m − π = real growth in the money stock.

Data used in their analysis were from the period 1959.1 to 1997.4. As part of their analysis, they checked each of these series for a unit root and suggested that the hypothesis of a unit root could only be rejected for the last two variables. We will reexamine these data for the longer interval, 1950.2 to 2000.4. The data are in Appendix Table F5.1. Figures 22.6 through 22.9 show the behavior of the last four variables. The first two are shown in Figures 22.4 and 22.5. Only the real output figure shows a strong trend, so we will use the random walk with drift for all the variables except this one. The Dickey–Fuller tests are carried out in Table 22.3. There are 202 observations used in each one. The first observation is lost when computing the rate of inflation and the change in the money stock, and one more is lost for the difference term in the regression. The critical values from interpolating to the second row, last column in each panel for 95 percent significance and a one-tailed test are −3.68 and −24.2, respectively, for DFτ and DFγ for the output equation, which contains the time trend, and −3.14 and −16.8 for the other equations, which contain a constant but no trend. For the output equation ( y), the test statistics are

0.9584940384 − 1 DFτ = =−2.32 > −3.44, .017880922 Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 749

16

14

12

10

8

T-Bill Rate T-Bill 6

4

2

0 1950 1963 19761989 2002 Quarter FIGURE 22.6 T-Bill Rate.

6

5

4

3

M1 2

1

0

1

2 1950 1963 1976 1989 2002 Quarter FIGURE 22.7 Change in the Money Stock.

and

DFγ = 202(0.9584940384 − 1) =−8.38 > −21.2. Neither is less than the critical value, so we conclude (as have others) that there is a unit root in the log GDP process. The results of the other tests are shown in Table 22.4. Surprisingly, Greene-50558 book June 23, 2007 12:28

750 PART V ✦ Time Series and Macroeconometrics

15

10

5 t Rate s 0

Real Intere 5

10

15 1950 1963 1976 1989 2002 Quarter FIGURE 22.8 Ex Post Real T-Bill Rate.

10

5

0

5 Real M1 10

15

20 1950 1963 1976 1989 2002 Quarter FIGURE 22.9 Change in the Real Money Stock.

these results do differ sharply from those obtained by Cecchetti and Rich (2001) for π and m. The sample period appears to matter; if we repeat the computation using Cecchetti and Rich’s interval, 1959.4 to 1997.4, then DFτ equals −3.51. This is borderline, but less contradictory. For m we obtain a value of −4.204 for DFτ when the sample is restricted to the shorter interval. Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 751

TABLE 22.4 Unit Root Tests (Standard errors of estimates in parentheses)

μβγDFτ DFγ Conclusion

π 0.332 0.659 −6.40 −68.88 Reject H0 (0.0696) (0.0532) R2 = 0.432, s = 0.643

y 0.320 0.00033 0.958 −2.35 −8.48 Do not reject H0 (0.134) (0.00015) (0.0179) R2 = 0.999, s = 0.001

i 0.228 0.961 −2.14 −7.88 Do not reject H0 (0.109) (0.0182) R2 = 0.933, s = 0.743

m 0.448 0.596 −7.05 −81.61 Reject H0 (0.0923) (0.0573) R2 = 0.351, s = 0.929

i − π 0.615 0.557 −7.57 −89.49 Reject H0 (0.185) (0.0585) R2 = 0.311, s = 2.395

m − π 0.0700 0.490 −8.25 −103.02 Reject H0 (0.0833) (0.0618) R2 = 0.239, s = 1.176

The Dickey–Fuller tests described in this section assume that the disturbances in the model as stated are white noise. An extension which will accommodate some forms of serial correlation is the augmented Dickey–Fuller test. The augmented Dickey–Fuller test is the same one as described earlier, carried out in the context of the model

yt = μ + βt + γ yt−1 + γ1yt−1 +···+γpyt−p + εt .

The random walk form is obtained by imposing μ = 0 and β = 0; the random walk with drift has β = 0; and the trend stationary model leaves both parameters free. The two test statistics are γˆ − 1 DFτ = , Est. Std. Error(γ)ˆ exactly as constructed before, and T(γˆ − 1) DFγ = . 1 − γˆ1 −···−γˆ p The advantage of this formulation is that it can accommodate higher-order autoregres- sive processes in εt . An alternative formulation may prove convenient. By subtracting yt−1 from both sides of the equation, we obtain

p−1 ∗ yt = μ + γ yt−1 + φ j yt− j + εt , j=1

where p p ∗ φ j =− γk and γ = γi − 1. k= j+1 i=1 Greene-50558 book June 23, 2007 12:28

752 PART V ✦ Time Series and Macroeconometrics

The unit root test is carried out as before by testing the null hypothesis γ ∗ = 0 against ∗ γ < 0.10 The t test, DFτ , may be used. If the failure to reject the unit root is taken as evidence that a unit root is present, that is, γ ∗ = 0, then the model specializes to the AR(p − 1) model in the first differences which is an ARIMA(p − 1, 1, 0) model for yt . For a model with a time trend, p−1 ∗ yt = μ + βt + γ yt−1 + φ j yt− j + εt , j=1 the test is carried out by testing the joint hypothesis that β = γ ∗ = 0. Dickey and Fuller (1981) present counterparts to the critical F statistics for testing the hypothesis. Some of their values are reproduced in the first row of Table 22.2. (Authors frequently focus on γ ∗ and ignore the time trend, maintaining it only as part of the appropriate formulation. ∗ In this case, one may use the simple test of γ = 0 as before, with the DFτ critical values.) The lag length, p, remains to be determined. As usual, we are well advised to test down to the right value instead of up. One can take the familiar approach and sequentially examine the t statistic on the last coefficient—the usual t test is appropriate. An alternative is to combine a measure of model fit, such as the regression s2 with one of the information criteria. The Akaike and Schwarz (Bayesian) information criteria would produce the two information measures  ∗ ( ) = e e + ( + ∗) A , IC p ln ∗ p K ∗ T − pmax − K T − pmax − K K∗ = 1 for random walk, 2 for random walk with drift, 3 for trend stationary, ∗ ∗ A = 2 for Akaike criterion, ln(T − pmax − K ) for Bayesian criterion,

pmax = the largest lag length being considered.

The remaining detail is to decide upon pmax. The theory provides little guidance here. On the basis of a large number of simulations, Schwert (1989) found that .25 pmax = integer part of [12 × (T/100) ] gave good results. Many alternatives to the Dickey–Fuller tests have been suggested, in some cases to improve on the finite sample properties and in others to accommodate more general modeling frameworks. The Phillips (1987) and Phillips and Perron (1988) statistic may be computed for the same three functional forms,

yt = δt + γ yt−1 + γ1yt−1 +···+γpyt−p + εt , (22-6)

where δt may be 0,μ,orμ+βt. The procedure modifies the two Dickey–Fuller statistics we previously examined: c0 γˆ − 1 1 Tv Zτ = − (a − c0)√ , a v 2 as2 T(γˆ − 1) 1 T 2v2 Zγ = − (a − c ), 2 0 1 − γˆ1 −···−γˆ p 2 s

10 It is easily verified that one of the roots of the characteristic polynomial is 1/(γ1 + γ2 +···+γp). Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 753

where T e2 s2 = t=1 t , T − K v2 = estimated asymptotic variance ofγ, ˆ 1 T c = e e − , j = 0,...,p = jth autocovariance of residuals, j T t t s s= j+1 2 c0 = [(T − K)/T]s , L j a = c + 2 1 − c . 0 L + 1 j j=1

[Note the Newey–West (Bartlett) weights in the computation of a. As before, the analyst must choose L.] The test statistics are referred to the same Dickey–Fuller tables we have used before. Elliot, Rothenberg, and Stock (1996) have proposed a method they denote the ADF-GLS procedure, which is designed to accommodate more general formulations of ε; the process generating εt is assumed to be an I(0) stationary process, possibly an ARMA(r, s). The null hypothesis, as before, is γ = 1 in (22-6) where δt = μ or μ + βt. The method proceeds as follows: Step 1. Linearly regress ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ y1 1 11 ⎢ − ⎥ ⎢ − ⎥ ⎢ − − ⎥ ∗ = ⎢ y2 ry1 ⎥ ∗ = ⎢1 r⎥ ∗ = ⎢1 r 2 r ⎥ y ⎣ ··· ⎦ on X ⎣ ··· ⎦ or X ⎣ ··· ⎦

yT − ryT−1 1 − r 1 − rT− r(T − 1) for the random walk with drift and trend stationary cases, respectively. (Note that the second column of the matrix is simply r + (1 − r)t.) Compute the residuals from this regression,y ˜t = yt − δˆt . r = 1 − 7/T for the random walk model and 1 − 13.5/T for the model with a trend. Step 2. The Dickey–Fuller DFτ test can now be carried out using the model

y˜t = γ y˜t−1 + γ1y˜t−1 +···+γpy˜t−p + ηt .

If the model does not contain the time trend, then the t statistic for (γ − 1) may be referred to the critical values in the center panel of Table 22.2. For the trend stationary model, the critical values are given in a table presented in Elliot et al. The 97.5 percent critical values for a one-tailed test from their table is −3.15. As in many such cases of a new technique, as researchers develop large and small modifications of these tests, the practitioner is likely to have some difficulty deciding how to proceed. The Dickey–Fuller procedures have stood the test of time as robust tools that appear to give good results over a wide range of applications. The Phillips–Perron tests are very general, but appear to have less than optimal small sample properties. Researchers continue to examine it and the others such as Elliot et al. method. Other tests are catalogued in Maddala and Kim (1998). Greene-50558 book June 23, 2007 12:28

754 PART V ✦ Time Series and Macroeconometrics

Example 22.3 Augmented Dickey–Fuller Test for a Unit Root in GDP The Dickey–Fuller 1981 JASA paper is a classic in the econometrics literature—it is probably the single most frequently cited paper in the field. It seems appropriate, therefore, to revisit at least some of their work. Dickey and Fuller apply their methodology to a model for the log of a quarterly series on output, the Federal Reserve Board Production Index. The model used is

yt = μ + βt + γ yt−1 + φ( yt−1 − yt−2) + εt . (22-7)

The test is carried out by testing the joint hypothesis that both β and γ ∗ are zero in the model

∗ ∗ yt − yt−1 = μ + βt + γ yt−1 + φ( yt−1 − yt−2) + εt .

(If γ = 0, then μ∗ will also by construction.) We will repeat the study with our data on real GDP from Appendix Table F5.1 using observations 1950.1 to 2000.4. We will use the augmented Dickey–Fuller test first. Thus, the first step is to determine the appropriate lag length for the augmented regression. Using Schwert’s suggestion, we find that the maximum lag length should be allowed to reach pmax ={the integer part of 12[204/100].25}=14. The specification search uses observations 18 to 204, because as many as 17 coefficients will be estimated in the equation

p yt = μ + βt + γ yt−1 + γ j yt− j + εt . j =1

In the sequence of 14 regressions with j = 14, 13, ..., the only statistically significant lagged difference is the first one, in the last regression, so it would appear that the model used by Dickey and Fuller would be chosen on this basis. The two information criteria produce a similar conclusion. Both of them decline monotonically from j = 14 all the way down to j = 1, so on this basis, we end the search with j = 1, and proceed to analyze Dickey and Fuller’s model. The linear regression results for the equation in (22-7) are

yt = 0.368 + 0.000391t + 0.952yt−1 + 0.36025yt−1 + et , s = 0.00912 (0.125) (0.000138) (0.0167) (0.0647) R2 = 0.999647.

The two test statistics are 0.95166 − 1 DFτ = =−2.892 0.016716 and 201(0.95166 − 1) DFγ = =−15.263. 1 − 0.36025 Neither statistic is less than the respective critical values, which are −3.70 and −24.5. On this basis, we conclude, as have many others, that there is a unit root in log GDP. For the Phillips and Perron statistic, we need several additional intermediate statistics. Following Hamilton (1994, p. 512), we choose L = 4 for the long-run variance calculation. Other values we need are T = 201,γ ˆ = 0.9516613, s2 = 0.00008311488, v2 = 0.00027942647, and the first five autocovariances, c0 = 0.000081469, c1 =−0.00000351162, c2 = 0.00000688053, c3 = 0.000000597305, and c4 =−0.00000128163. Applying these to the weighted sum produces a = 0.0000840722, which is only a minor correction to c0. Collect- ing the results, we obtain the Phillips–Perron statistics, Zτ =−2.89921 and Zγ =−15.44133. Because these are applied to the same critical values in the Dickey–Fuller tables, we reach the same conclusion as before—we do not reject the hypothesis of a unit root in log GDP. Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 755

22.2.5 THE KPSS TEST OF STATIONARITY Kwitkowski et al. (1992) (KPSS) have devised an alternative to the Dickey–Fuller test for stationarity of a time series. The procedure is a test of nonstationarity against the null hypothesis of stationarity in the model t yt = α + βt + γ zi + εt , t = 1,...,T i=1

= α + βt + γ Zt + εt ,

where εt is a stationary series and zt is an i.i.d. stationary series with mean zero and variance one. (These are merely convenient normalizations because a nonzero mean would move to α and a nonunit variance is absorbed in γ .) If γ equals zero, then the process is stationary if β = 0 and trend stationary if β = 0. Because Zt ,isI(1), yt is nonstationary if γ is nonzero. The KPSS test of the null hypothesis, H0 : γ = 0, against the alternative that γ is nonzero reverses the strategy of the Dickey–Fuller statistic (which tests the null hypothesis γ<1 against the alternative γ = 1). Under the null hypothesis, α and β can be estimated by OLS. Let et denote the tth OLS residual,

et = yt − a − bt, and let the sequence of partial sums be t Et = ei , t = 1,...,T. i=1

(Note ET = 0.) The KPSS statistic is T E2 KPSS = t=1 t , T 2σˆ 2 where T 2 L T e j = + es es− j σˆ 2 = t=1 t + 2 1 − r and r = s j 1 , T L + 1 j j T j=1

and L is chosen by the analyst. [See (19-17).] Under normality of the disturbances, εt , the KPSS statistic is an LM statistic. The authors derive the statistic under more general conditions. Critical values for the test statistic are estimated by simulation. Table 22.5 gives the values reported by the authors (in their Table 1, p. 166).

Example 22.4 Is There a Unit Root in GDP? Using the data used for the Dickey–Fuller tests in Example 21.6, we repeated the procedure using the KPSS test with L = 10. The two statistics are 1.953 without the trend and 0.312

TABLE 22.5 Critical Values for the KPSS Test Upper Tail Percentiles Critical Value 0.100 0.050 0.025 0.010 β = 0 0.347 0.463 0.573 0.739 β = 0 0.119 0.146 0.176 0.216 Greene-50558 book June 23, 2007 12:28

756 PART V ✦ Time Series and Macroeconometrics

with it. Comparing these results to the values in Table 22.4 we conclude (again) that there is, indeed, a unit root in In GDP. Or, more precisely, we conclude that In GDP is not a stationary series, nor even a trend stationary series.

22.3 COINTEGRATION

Studies in empirical macroeconomics almost always involve nonstationary and trending variables, such as income, consumption, money demand, the price level, trade flows, and exchange rates. Accumulated wisdom and the results of the previous sections suggest that the appropriate way to manipulate such series is to use differencing and other transformations (such as seasonal adjustment) to reduce them to stationarity and then to analyze the resulting series as VARs or with the methods of Box and Jenkins. But recent research and a growing literature has shown that there are more interesting, appropriate ways to analyze trending variables. In the fully specified regression model

yt = βxt + εt ,

11 there is a presumption that the disturbances εt are a stationary, white noise series. But this presumption is unlikely to be true if yt and xt are integrated series. Generally, if two series are integrated to different orders, then linear combinations of them will be integrated to the higher of the two orders. Thus, if yt and xt are I(1)—that is, if both are trending variables—then we would normally expect yt − βxt to be I(1) regardless of the value of β, not I(0) (i.e., not stationary). If yt and xt are each drifting upward with their own trend, then unless there is some relationship between those trends, the difference between them should also be growing, with yet another trend. There must be some kind of inconsistency in the model. On the other hand, if the two series are both I(1), then there may be a β such that

εt = yt − βxt

is I(0). Intuitively, if the two series are both I(1), then this partial difference between them might be stable around a fixed mean. The implication would be that the series are drifting together at roughly the same rate. Two series that satisfy this requirement are said to be cointegrated, and the vector [1, −β] (or any multiple of it) is a cointegrating vector. In such a case, we can distinguish between a long-run relationship between yt and xt , that is, the manner in which the two variables drift upward together, and the short-run dynamics, that is, the relationship between deviations of yt from its long-run trend and deviations of xt from its long-run trend. If this is the case, then differencing of the data would be counterproductive, since it would obscure the long-run relationship between yt and xt . Studies of cointegration and a related technique, error correction, are concerned with methods of estimation that preserve the information about both forms of covariation.12

11If there is autocorrelation in the model, then it has been removed through an appropriate transformation. 12See, for example, Engle and Granger (1987) and the lengthy literature cited in Hamilton (1994). A survey paper on VARs and cointegration is Watson (1994). Greene-50558 book June 25, 2007 18:0

CHAPTER 22 ✦ Nonstationary Data 757

Example 22.5 Cointegration in Consumption and Output Consumption and income provide one of the more familiar examples of the phenomenon described above. The logs of GDP and consumption for 1950.1 to 2000.4 are plotted in Figure 22.10. Both variables are obviously nonstationary. We have already verified that there is a unit root in the income data. We leave as an exercise for the reader to verify that the consumption variable is likewise I (1). Nonetheless, there is a clear relationship between consumption and output. To see where this discussion of relationships among variables is going, consider a simple regression of the log of consumption on the log of income, where both variables are manipulated in mean deviation form (so, the regression includes a constant). The slope in that regression is 1.056765. The residuals from the regression, ∗ ∗  ∗ ut = [lnCons , lnGDP ][1, −1.056765] (where the “ ” indicates mean deviations) are plotted in Figure 22.11. The trend is clearly absent from the residuals. But, it remains to verify whether the series of residuals is stationary. In the ADF regression of the least squares residuals on a constant (random walk with drift), the lagged value and the lagged first difference, the coefficient on ut−1 is 0.838488 (0.0370205) and that on ut−1 − ut−2 is −0.098522. (The constant differs trivially from zero because two observations are lost in computing the ADF regression.) With 202 observations, we find DFτ =−4.63 and DFγ =−29.55. Both are well below the critical values, which suggests that the residual series does not contain a unit root. We conclude (at least it appears so) that even after accounting for the trend, although neither of the original variables is stationary, there is a linear combination of them that is. If this conclusion holds up after a more formal treatment of the testing procedure, we will state that logGDP and log consumption are cointegrated.

Example 22.6 Several Cointegrated Series The theory of purchasing power parity specifies that in long-run equilibrium, exchange rates will adjust to erase differences in purchasing power across different economies. Thus, if p1 and p0 are the price levels in two countries and E is the exchange rate between the two currencies, then in equilibrium,

p1t vt = Et = μ, a constant. p0t

FIGURE 22.10 Cointegrated Variables: Logs of Consumption and GDP. 9.6 LOGGDP LOGCONS 9.0

8.4 Logs 7.8

7.2

6.6 1949 1962 19751988 2001 Quarter Greene-50558 book June 25, 2007 18:0

758 PART V ✦ Time Series and Macroeconometrics

0.075

0.050

0.025

Residual 0.000

Ϫ0.025

Ϫ0.050 1950 1963 19761989 2002 Quarter FIGURE 22.11 Residuals from Consumption—Income Regression.

The price levels in any two countries are likely to be strongly trended. But allowing for short- term deviations from equilibrium, the theory suggests that for a particular β = (lnμ, −1, 1), in the model

ln Et = β1 + β2 ln p1t + β3 ln p0t + εt ,

εt = ln vt would be a stationary series, which would imply that the logs of the three variables in the model are cointegrated.

 We suppose that the model involves M variables, yt = [y1t ,...,yMt ] , which indi- vidually may be I(0) or I(1), and a long-run equilibrium relationship,  γ −  β = . yt xt 0 The “regressors” may include a constant, exogenous variables assumed to be I(0), and/or a time trend. The vector of parameters γ is the cointegrating vector. In the short run, the system may deviate from its equilibrium, so the relationship is rewritten as  γ −  β = ε , yt xt t

where the equilibrium error εt mustbeastationary series. In fact, because there are M variables in the system, at least in principle, there could be more than one cointegrating vector. In a system of M variables, there can only be up to M − 1 linearly independent cointegrating vectors. A proof of this proposition is very simple, but useful at this point. γ Proof: Suppose that i is a cointegrating vector and that there are M linearly  β independent cointegrating vectors. Then, neglecting xt for the moment, for γ ,  γ ν every i yt i is a stationary series ti. Any linear combination of a set of stationary series isstationary, so it follows that every linear combination of the Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 759

cointegrating vectors is also a cointegrating vector. If there are M such M × 1 linearly independent vectors, then they form a basis for the M-dimensional space, so any M × 1 vector can be formed from these cointegrating vectors, including the columns of an M × M identity matrix. Thus, the first column of an identity matrix would be a cointegrating vector, or yt1 is I(0). This result is a contradiction, because we are allowing yt1 to be I(1). It follows that there can be at most M − 1 cointegrating vectors. The number of linearly independent cointegrating vectors that exist in the equilib- rium system is called its cointegrating rank. The cointegrating rank may range from 1 to M − 1. If it exceeds one, then we will encounter an interesting identification problem. As a consequence of the observation in the preceding proof, we have the unfortunate result that, in general, if the cointegrating rank of a system exceeds one, then without out-of-sample, exact information, it is not possible to estimate behavioral relationships as cointegrating vectors. Enders (1995) provides a useful example.

Example 22.7 Multiple Cointegrating Vectors We consider the logs of four variables, money demand m, the price level p, real income y, and an interest rate r . The basic relationship is

m = γ0 + γ1 p + γ2 y + γ3r + ε. The price level and real income are assumed to be I (1). The existence of long-run equilibrium in the money market implies a cointegrating vector α1. If the Fed follows a certain feedback rule, increasing the money stock when nominal income ( y + p) is low and decreasing it when nominal income is high—which might make more sense in terms of rates of growth—then there is a second cointegrating vector in which γ1 = γ2 and γ3 = 0. Suppose that we label this vector α2. The parameters in the money demand equation, notably the interest elasticity, are interesting quantities, and we might seek to estimate α1 to learn the value of this quantity. But since every linear combination of α1 and α2 is a cointegrating vector, to this point we are only able to estimate a hash of the two cointegrating vectors. In fact, the parameters of this model are identifiable from sample information (in principle). We have specified two cointegrating vectors,

α1 = [1, −γ10, −γ11, −γ12, −γ13] and

α2 = [1, −γ20, γ21, γ21,0].

Although it is true that every linear combination of α1 and α2 is a cointegrating vector, only the original two vectors, as they are, have a 1 in the first position of both and a 0 in the last position of the second. (The equality restriction actually overidentifies the parameter matrix.) This result is, of course, exactly the sort of analysis that we used in establishing the identifiability of a simultaneous equations system.

22.3.1 COMMON TRENDS If two I(1) variables are cointegrated, then some linear combination of them is I(0). Intuition should suggest that the linear combination does not mysteriously create a well-behaved new variable; rather, something present in the original variables must be missing from the aggregated one. Consider an example. Suppose that two I(1) variables have a linear trend,

y1t = α + βt + ut ,

y2t = γ + δt + vt , Greene-50558 book June 23, 2007 12:28

760 PART V ✦ Time Series and Macroeconometrics

where ut and vt are white noise. A linear combination of y1t and y2t with vector (1,θ) produces the new variable,

zt = (α + θγ)+ (β + θδ)t + ut + θvt ,

which, in general, is still I(1). In fact, the only way the zt series can be made stationary is if θ =−β/δ. If so, then the effect of combining the two variables linearly is to remove the common linear trend, which is the basis of Stock and Watson’s (1988) analysis of the problem. But their observation goes an important step beyond this one. The only way that y1t and y2t can be cointegrated to begin with is if they have a common trend of some sort. To continue, suppose that instead of the linear trend t, the terms on the right-hand side, y1 and y2, are functions of a random walk, wt = wt−1 + ηt , where ηt is white noise. The analysis is identical. But now suppose that each variable yit has its own random walk component wit, i = 1, 2. Any linear combination of y1t and y2t must involve both random walks. It is clear that they cannot be cointegrated unless, in fact, w1t = w2t . That is, once again, they must have a common trend. Finally, suppose that y1t and y2t share two common trends,

y1t = α + βt + λwt + ut ,

y2t = γ + δt + πwt + vt . We place no restriction on λ and π. Then, a bit of manipulation will show that it is not possible to find a linear combination of y1t and y2t that is cointegrated, even though they share common trends. The end result for this example is that if y1t and y2t are cointegrated, then they must share exactly one common trend. As Stock and Watson determined, the preceding is the crux of the cointegration of economic variables. A set of M variables that are cointegrated can be written as a stationary component plus linear combinations of a smaller set of common trends. If the cointegrating rank of the system is r, then there can be up to M − r linear trends and M − r common random walks. [See Hamilton (1994, p. 578).] (The two-variable case is special. In a two-variable system, there can be only one common trend in total.) The effect of the cointegration is to purge these common trends from the resultant variables.

22.3.2 ERROR CORRECTION AND VAR REPRESENTATIONS

Suppose that the two I(1) variables yt and zt are cointegrated and that the cointegrating vector is [1, −θ]. Then all three variables, yt = yt − yt−1,zt , and (yt − θzt ) are I(0). The error correction model  =  β + γ( ) + λ( − θ ) + ε yt xt zt yt−1 zt−1 t

describes the variation in yt around its long-run trend in terms of a set of I(0) exogenous factors xt , the variation of zt around its long-run trend, and the error correction (yt −θzt ), which is the equilibrium error in the model of cointegration. There is a tight connection between models of cointegration and models of error correction. The model in this form is reasonable as it stands, but in fact, it is only internally consistent if the two variables are cointegrated. If not, then the third term, and hence the right-hand side, cannot be I(0), even though the left-hand side must be. The upshot is that the same assumption that we make to produce the cointegration implies (and is implied by) the existence Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 761

of an error correction model.13 As we will examine in the next section, the utility of this representation is that it suggests a way to build an elaborate model of the long-run variation in yt as well as a test for cointegration. Looking ahead, the preceding suggests that residuals from an estimated cointegration model—that is, estimated equilibrium errors—can be included in an elaborate model of the long-run covariation of yt and zt . Once again, we have the foundation of Engel and Granger’s approach to analyzing cointegration. Consider the VAR representation of the model

yt = yt−1 + εt ,

where the vector yt is [yt , zt ] . Now take first differences to obtain

yt − yt−1 = ( − I)yt−1 + εt , or

yt = yt−1 + εt . If all variables are I(1), then all M variables on the left-hand side are I(0). Whether those on the right-hand side are I(0) remains to be seen. The matrix  produces linear combinations of the variables in yt . But as we have seen, not all linear combinations can be cointegrated. The number of such independent linear combinations is r < M. Therefore, although there must be a VAR representation of the model, cointegration implies a restriction on the rank of . It cannot have full rank; its rank is r. From another viewpoint, a different approach to discerning cointegration is suggested. Suppose that we estimate this model as an unrestricted VAR. The resultant coefficient matrix should be short-ranked. The implication is that if we fit the VAR model and impose short rank on the coefficient matrix as a restriction—how we could do that remains to be seen— then if the variables really are cointegrated, this restriction should not lead to a loss of fit. This implication is the basis of Johansen’s (1988) and Stock and Watson’s (1988) analysis of cointegration.

22.3.3 TESTING FOR COINTEGRATION A natural first step in the analysis of cointegration is to establish that it is indeed a characteristic of the data. Two broad approaches for testing for cointegration have been developed. The Engle and Granger (1987) method is based on assessing whether single-equation estimates of the equilibrium errors appear to be stationary. The second approach, due to Johansen (1988, 1991) and Stock and Watson (1988), is based on the VAR approach. As noted earlier, if a set of variables is truly cointegrated, then we should be able to detect the implied restrictions in an otherwise unrestricted VAR. We will examine these two methods in turn. Let yt denote the set of M variables that are believed to be cointegrated. Step one of either analysis is to establish that the variables are indeed integrated to the same order. The Dickey–Fuller tests discussed in Section 22.2.4 can be used for this purpose. If the evidence suggests that the variables are integrated to different orders or not at all, then the specification of the model should be reconsidered.

13The result in its general form is known as the Granger representation theorem. See Hamilton (1994, p. 582). Greene-50558 book June 23, 2007 12:28

762 PART V ✦ Time Series and Macroeconometrics

If the cointegration rank of the system is r, then there are r independent vectors, γ = , −θ i [1 i ], where each vector is distinguished by being normalized on a different variable. If we suppose that there are also a set of I(0) exogenous variables, includ- ing a constant, in the model, then each cointegrating vector produces the equilibrium relationship  γ =  β + ε , yt i xt it which we may rewrite as =  θ +  β + ε . yit Yit i xt it

We can obtain estimates of θ i by least squares regression. If the theory is correct and if this OLS estimator is consistent, then residuals from this regression should estimate the equilibrium errors. There are two obstacles to consistency. First, because both sides of the equation contain I(1) variables, the problem of spurious regressions appears. Sec- ond, a moment’s thought should suggest that what we have done is extract an equation from an otherwise ordinary simultaneous equations model and propose to estimate its parameters by ordinary least squares. As we examined in Chapter 13, consistency is unlikely in that case. It is one of the extraordinary results of this body of theory that in this setting, neither of these considerations is a problem. In fact, as shown by a number of authors [see, e.g., Davidson and MacKinnon (1993)], not only is ci , the OLS estimator 2 of θ i , consistent, it is superconsistent in that its asymptotic variance is O(1/T ) rather than O(1/T) as in the usual case. Consequently, the problem of spurious regressions disappears as well. Therefore, the next step is to estimate the cointegrating vector(s), by OLS. Under all the assumptions thus far, the residuals from these regressions, eit, are estimates of the equilibrium errors, εit. As such, they should be I(0). The natural approach would be to apply the familiar Dickey–Fuller tests to these residuals. The logic is sound, but the Dickey–Fuller tables are inappropriate for these estimated er- rors. Estimates of the appropriate critical values for the tests are given by Engle and Granger (1987), Engle and Yoo (1987), Phillips and Ouliaris (1990), and Davidson and MacKinnon (1993). If autocorrelation in the equilibrium errors is suspected, then an augmented Engle and Granger test can be based on the template

eit = δei,t−1 + φ1(ei,t−1) +···+ut . If the null hypothesis that δ = 0 cannot be rejected (against the alternative δ<0), then we conclude that the variables are not cointegrated. (Cointegration can be rejected by this method. Failing to reject does not confirm it, of course. But having failed to reject the presence of cointegration, we will proceed as if our finding had been affirmative.)

Example 22.8 (Continued) Cointegration in Consumption and Output In the example presented at the beginning of this discussion, we proposed precisely the sort of test suggested by Phillips and Ouliaris (1990) to determine if (log) consumption and (log) GDP are cointegrated. As noted, the logic of our approach is sound, but a few considerations remain. The Dickey–Fuller critical values suggested for the test are appropriate only in a few cases, and not when several trending variables appear in the equation. For the case of only a pair of trended variables, as we have here, one may use infinite sample values in the Dickey–Fuller tables for the trend stationary form of the equation. (The drift and trend would have been removed from the residuals by the original regression, which would have these terms either embedded in the variables or explicitly in the equation.) Finally, there remains an issue of how many lagged differences to include in the ADF regression. We have specified Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 763

one, although further analysis might be called for. [A lengthy discussion of this set of issues appears in Hayashi (2000, pp. 645–648).] Thus, but for the possibility of this specification issue, the ADF approach suggested in the introduction does pass muster. The sample value found earlier was −4.63. The critical values from the table are −3.45 for 5 percent and −3.67 for 2.5 percent. Thus, we conclude (as have many other analysts) that log consumption and log GDP are cointegrated.

The Johansen (1988, 1992) and Stock and Watson (1988) methods are similar, so we will describe only the first one. The theory is beyond the scope of this text, although the operational details are suggestive. To carry out the Johansen test, we first formulate the VAR:

yt = 1yt−1 + 2yt−2 +···+ pyt−p + εt .

The order of the model, p, must be determined in advance. Now, let zt denote the vector of M(p − 1) variables,

zt = [yt−1,yt−2,...,yt−p+1].

That is, zt contains the lags 1 to p−1 of the first differences of all M variables. Now, using the T available observations, we obtain two T × M matrices of least squares residuals:

D = the residuals in the regressions of yt on zt ,

E = the residuals in the regressions of yt−p on zt . We now require the M2 canonical correlations between the columns in D and those ∗ in E. To continue, we will digress briefly to define the canonical correlations. Let d1 ∗ denote a linear combination of the columns of D, and let e1 denote the same from E. We wish to choose these two linear combinations so as to maximize the correlation between them. This pair of variables are the first canonical variates, and their correlation ∗ r1 is the first canonical correlation. In the setting of cointegration, this computation has ∗ ∗ ∗ some intuitive appeal. Now, with d1 and e1 in hand, we seek a second pair of variables d2 ∗ and e2 to maximize their correlation, subject to the constraint that this second variable in each pair be orthogonal to the first. This procedure continues for all M pairs of variables. It turns out that the computation of all these is quite simple. We will not need to compute the coefficient vectors for the linear combinations. The squared canonical correlations are simply the ordered characteristic roots of the matrix ∗ = −1/2 −1 −1/2, R RDD RDEREEREDRDD

where Rij is the (cross-) correlation matrix between variables in set i and set j, for i, j = D, E. Finally, the null hypothesis that there are r or fewer cointegrating vectors is tested using the test statistic M =− − ( ∗)2 . TRACE TEST T ln[1 ri ] i=r+1 If the correlations based on actual disturbances had been observed instead of estimated, then we would refer this statistic to the chi-squared distribution with M − r degrees of freedom. Alternative sets of appropriate tables are given by Johansen and Juselius (1990) and Osterwald-Lenum (1992). Large values give evidence against the hypothesis of r or fewer cointegrating vectors. Greene-50558 book June 23, 2007 12:28

764 PART V ✦ Time Series and Macroeconometrics

22.3.4 ESTIMATING COINTEGRATION RELATIONSHIPS Both of the testing procedures discussed earlier involve actually estimating the coin- tegrating vectors, so this additional section is actually superfluous. In the Engle and Granger framework, at a second step after the cointegration test, we can use the resid- uals from the static regression as an error correction term in a dynamic, first-difference regression, as shown in Section 22.4.2. One can then “test down” to find a satisfactory structure. In the Johansen test shown earlier, the characteristic vectors corresponding to the canonical correlations are the sample estimates of the cointegrating vectors. Once again, computation of an error correction model based on these first step results is a natural next step. We will explore these in an application.

22.3.5 APPLICATION: GERMAN MONEY DEMAND The demand for money has provided a convenient and well targeted illustration of methods of cointegration analysis. The central equation of the model is

mt − pt = μ + βyt + γ it + εt , (22-8)

where mt , pt , and yt are the logs of nominal money demand, the price level, and output, and i is the nominal interest rate (not the log of). The equation involves trending variables (mt , pt , yt ), and one that we found earlier appears to be a random walk with drift (it ). As such, the usual form of statistical inference for estimation of the income elasticity and interest semielasticity based on stationary data is likely to be misleading. Beyer (1998) analyzed the demand for money in Germany over the period 1975 to 1994. A central focus of the study was whether the 1990 reunification produced a structural break in the long-run demand function. (The analysis extended an earlier study by the same author that was based on data that predated the reunification.) One of the interesting questions pursued in this literature concerns the stability of the long- term demand equation,

(m − p)t − yt = μ + γ it + εt . (22-9) The left-hand side is the log of the inverse of the velocity of money, as suggested by Lucas (1988). An issue to be confronted in this specification is the exogeneity of the interest variable—exogeneity [in the Engle, Hendry, and Richard (1993) sense] of in- come is moot in the long-run equation as its coefficient is assumed (per Lucas) to equal one. Beyer explored this latter issue in the framework developed by Engle et al. (see Section 22.3.5) and through the Granger causality testing methods discussed in Section 20.6.5. The analytical platform of Beyer’s study is a long-run function for the real money stock M3 (we adopt the author’s notation) ∗ (m − p) = δ0 + δ1 y + δ2RS + δ3RL + δ44 p, (22-10)

where RS is a short-term interest rate, RL is a long-term interest rate, and 4 p is the annual inflation rate—the data are quarterly. The first step is an examination of the data. Augmented Dickey–Fuller tests suggest that for these German data in this period, mt and pt are I(2), while (mt − pt ), yt ,4 pt , RSt , and RLt are all I(1). Some of Beyer’s results which produced these conclusions are shown in Table 22.6. Note that although both mt and pt appear to be I(2), their simple difference (linear combination) is I(1), Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 765

TABLE 22.6 Augmented Dickey–Fuller Tests for Variables in the Beyer Model

2 2 Variable m m mp p p 4p 4p Spec. TS RW RW TS RW/D RW RW/D RW lag04 343 22 2 DFτ −1.82 −1.61 −6.87 −2.09 −2.14 −10.6 −2.66 −5.48 Crit. Value −3.47 −1.95 −1.95 −3.47 −2.90 −1.95 −2.90 −1.95

Variable y yRS RS RL RL (m − p) (m − p) Spec. TS RW/D TS RW TS RW RW/D RW/D lag43 101 00 0 DFτ −1.83 −2.91 −2.33 −5.26 −2.40 −6.01 −1.65 −8.50 Crit. Value −3.47 −2.90 −2.90 −1.95 −2.90 −1.95 −3.47 −2.90

that is, integrated to a lower order. That produces the long-run specification given by (22-10). The Lucas specification is layered onto this to produce the model for the long- run velocity ( − − )∗ = δ∗ + δ∗ + δ∗ + δ∗ . m p y 0 2 RS 3 RL 4 4 p (22-11)

22.3.5.a Cointegration Analysis and a Long-Run Theoretical Model For (22-10) to be a valid model, there must be at least one cointegrating vector that transforms zt = [(mt − pt ), yt , RSt , RLt ,4 pt ] to stationarity. The Johansen trace test described in Section 22.3.3 was applied to the VAR consisting of these five I(1) vari- ables. A lag length of two was chosen for the analysis. The results of the trace test are a bit ambiguous; the hypothesis that r = 0 is rejected, albeit not strongly (sample value = 90.17 against a 95 percent critical value = 87.31) while the hypothesis that r ≤ 1is not rejected (sample value = 60.15 against a 95 percent critical value of 62.99). (These borderline results follow from the result that Beyer’s first three eigenvalues—canonical correlations in the trace test statistic—are nearly equal. Variation in the test statistic results from variation in the correlations.) On this basis, it is concluded that the cointe- grating rank equals one. The unrestricted cointegrating vector for the equation, with a time trend added is found to be

(m − p) = 0.936y − 1.7804 p + 1.601RS − 3.279RL + 0.002t. (22-12) (These are the coefficients from the first characteristic vector of the canonical correlation analysis in the Johansen computations detailed in Section 22.3.3.) An exogeneity test— we have not developed this in detail; see Beyer (1998, p. 59), Hendry and Ericsson (1991), and Engle and Hendry (1993)—confirms weak exogeneity of all four right-hand-side variables in this specification. The final specification test is for the Lucas formulation and elimination of the time trend, both of which are found to pass, producing the cointegration vector

(m − p − y) =−1.8324 p + 4.352RS − 10.89RL. The conclusion drawn from the cointegration analysis is that a single-equation model for the long-run money demand is appropriate and a valid way to proceed. A last step before this analysis is a series of Granger causality tests for feedback between Greene-50558 book June 23, 2007 12:28

766 PART V ✦ Time Series and Macroeconometrics

changes in the money stock and the four right-hand-side variables in (22-12) (not in- cluding the trend). (See Section 20.6.5.) The test results are generally favorable, with some mixed results for exogeneity of GDP.

22.3.5.b Testing for Model Instability = ( − ), , , , 0 Let zt [ mt pt yt 4 pt RSt RLt ] and let zt−1 denote the entire history of zt up 0 to the previous period. The joint distribution for zt , conditioned on zt−1 and a set of parameters  factors one level further into        0  0 f zt z − , = f (m − p)t yt ,4 pt , RSt , RLt , z − ,1 t 1   t 1 × , , ,  0 , . g yt 4 pt RSt RLt zt−1 2 The result of the exogeneity tests carried out earlier implies that the conditional distribution may be analyzed apart from the marginal distribution—that is, the im- plication of the Engle, Hendry, and Richard results noted earlier. Note the partitioning of the parameter vector. Thus, the conditional model is represented by an error correc- tion form that explains (m − p)t in terms of its own lags, the error correction term and contemporaneous and lagged changes in the (now established) weakly exogenous variables as well as other terms such as a constant term, trend, and certain dummy variables which pick up particular events. The error correction model specified is 4 4   4 ( − ) = ( − ) − + ,   + ,  − m p t ci m p t i d1 i 4pt−i d2 i yt i i=1 i=0 i=0 4 4 + d3,i RSt−i + d4,i RLt−i + λ(m − p − y)t−1 (22-13) i=0 i=0 + γ + γ +  φ + ω , 1RSt−1 2RLt−1 dt t

where dt is the set of additional variables, including the constant and five one-period dummy variables that single out specific events such as a currency crisis in September, 1992 [Beyer (1998, p. 62, fn. 4)]. The model is estimated by least squares, “stepwise simplified and reparameterized.” (The number of parameters in the equation is reduced from 32 to 15.14) The estimated form of (22-13) is an autoregressive distributed lag model. We pro- ceed to use the model to solve for the long-run, steady-state growth path of the real money stock, (22-10). The annual growth rates 4m = gm,4 p = gp,4 y = gy and 15 (assumed) 4RS = gRS = 4RL = gRL = 0 are used for the solution

1 c4 d2,2 (g − g ) = (g − g ) − d , g + g + γ RS + γ RL + λ(m − p − y). 4 m p 4 m p 1 1 p 2 y 1 2 ∗ This equation is solved for (m − p) under the assumption that gm = (gy + gp), ∗ (m − p) = δˆ0 + δˆ1gy + y + δˆ24 p + δˆ3RS + δˆ4RL. Analysis then proceeds based on this estimated long-run relationship.

14 2 2 The equation ultimately used is (mt − pt ) = h[(m − p)t−4,4 pt , yt−2,RSt−1 + RSt−3, RLt , RSt−1, RLt−1,4 pt−1,(m − p − y)t−1, dt ]. 15The division of the coefficients is done because the intervening lags do not appear in the estimated equation. Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 767

The primary interest of the study is the stability of the demand equation pre- and postunification. A comparison of the parameter estimates from the same set of procedures using the period 1976–1989 shows them to be surprisingly similar, [(1.22 − 3.67gy), 1, −3.67, 3.67, −6.44] for the earlier period and [(1.25 − 2.09gy), 1, −3.625, 3.5, −7.25] for the later one. This suggests, albeit informally, that the function has not changed (at least by much). A variety of testing procedures for structural break led to the conclusion that in spite of the dramatic changes of 1990, the long-run money demand function had not materially changed in the sample period.

22.4 N0NSTATI0NARY PANEL DATA

In Section 9.8.5, we began to examine panel data settings in which T, the number of observations in each group (e.g., country), became large as well as n. Applications include cross-country studies of growth using the Penn World Tables [Im, Pesaran, and Shin (2003) and Sala-i-Martin (1996)], studies of purchasing power parity [Pedroni (2001)], and analyses of health care expenditures [McCoskey and Selden (1998)]. In the small T cases of longitudinal, microeconomic data sets, the time-series properties of the data are a side issue that is usually of little interest. But when T is growing at essentially the same rate as n, for example, in the cross-country studies, these properties become a central focus of the analysis. The large T, large n case presents several complications for the analyst. In the lon- gitudinal analysis, pooling of the data is usually a given, although we developed several extensions of the models to accommodate parameter heterogeneity (see Section 9.8.5). In a long-term cross-country model, any type of pooling would be especially suspect. The time series are long, so this would seem to suggest that the appropriate modeling strategy would be simply to analyze each country separately. But this would neglect the hypothsized commonalities across countries such as a (proposed) common growth rate. Thus, the recent “time-series panel data” literature seeks to reconcile these opposing features of the data. As in the single time-series cases examined earlier in this chapter, long-term aggre- gate series are usually nonstationary, which calls conventional methods (such as those in Section 9.8.5) into question. A focus of the recent literature, for example, is on test- ing for unit roots in an analog to the platform for the augmented Dickey–Fuller tests (Section 22.2),

Li yit = ρi yi,t−1 + γimyi,t−m + αi + βi t + εit. m=1 Different formulations of this model have been analyzed, for example, by Levin, Lin, and Chu (2002), who assume ρi = ρ; Im, Pesaran, and Shin (2003), who relax that restriction; and Breitung (2000), who considers various mixtures of the cases. An extension of the KPSS test in Section 22.2.5 that is particularly simple to compute is Hadri’s (2000) LM statistic, 1 n T E2 n KPSS LM = t=1 it = i=1 i . 2 2 n T σˆε n i=1 Greene-50558 book June 25, 2007 18:0

768 PART V ✦ Time Series and Macroeconometrics

This is the sample average of the KPSS statistics for the n countries. Note that it includes 2 two assumptions: that the countries are independent and that there is a common σε for 2 all countries. An alternative is suggested that allows σε to vary across countries. As it stands, the preceding model would suggest that separate analyses for each country would be appropriate. An issue to consider, then, would be how to combine, if possible, the separate results in some optimal fashion. Maddala and Wu (1999), for example, suggested a “Fisher-type” chi-squared test based on P =−2i ln pi , where pi is the p-value from the individual tests. Under the null hypothesis that ρi equals zero, the limiting distribution is chi-squared with 2n degrees of freedom. Analysis of cointegration, and models of cointegrated series in the panel data set- ting, parallel the single time-series case, but also differ in a crucial respect. [See, e.g., Kao (1999), McCoskey and Kao (1999), and Pedroni (2000, 2004)]. Whereas in the sin- gle time-series case, the analysis of cointegration focuses on the long-run relationships between, say, xt and zt for two variables for the same country, in the panel data setting, say, in the analysis of exchange rates, inflation, purchasing power parity or international R & D spillovers, interest may focus on a long-run relationship between xit and xmt for two different countries (or n countries). This substantially complicates the analyses. It is also well beyond the scope of this text. Extensive surveys of these issues may be found in Baltagi (2005, Chapter 12) and Smith (2000).

22.5 SUMMARY AND CONCLUSIONS

This chapter has completed our survey of techniques for the analysis of time-series data. While Chapters 20 and 21 were about extensions of regression modeling to time-series setting, most of the results in this chapter focus on the internal structure of the individual time series, themselves. Chapter 21 presented the standard models for time-series pro- cesses. While the empirical distinction between, say, AR(p) and MA(q) series may seem ad hoc, the Wold decomposition assures that with enough care, a variety of models can be used to analyze a time series. This chapter described what is arguably the fundamental tool of modern macroeconometrics: the tests for nonstationarity. Contemporary econo- metric analysis of macroeconomic data has added considerable structure and formality to trending variables, which are more common than not in that setting. The variants of the Dickey–Fuller and KPSS tests for unit roots are an indispensable tool for the analyst of time-series data. Section 22.4 then considered the subject of cointegration. This modeling framework is a distinct extension of the regression modeling where this discussion began. Cointegrated relationships and equilibrium relationships form the basis of the time-series counterpart to regression relationships. But, in this case, it is not the conditional mean as such that is of interest. Here, both the long-run equilibrium and short-run relationships around trends are of interest and are studied in the data.

Key Terms and Concepts

• Autoregressive integrated • Canonical correlation • Common trend moving-average (ARIMA) • Cointegration • Data generating process process • Cointegration rank (DGP) • Augmented Dickey-Fuller • Cointegration relationship • Dickey-Fuller test test • Cointegrating vector • Equilibrium error Greene-50558 book June 23, 2007 12:28

CHAPTER 22 ✦ Nonstationary Data 769

• Error correction model • Phillips-Perron test • Superconsistent • Fractional integration • Random walk • Trend stationary process • Integrated of order one • Random walk with drift • Unit root • KPSS test • Spurious regression

Exercise

1. Find the autocorrelations and partial autocorrelations for the MA(2) process

εt = vt − θ1vt−1 − θ2vt−2.

Applications

1. Carry out the ADF test for a unit root in the bond yield data of Example 21.1. 2. Using the macroeconomic data in Appendix Table F5.1, estimate by least squares the parameters of the model

ct = β0 + β1 yt + β2ct−1 + β3ct−2 + εt ,

where ct is the log of real consumption and yt is the log of real disposable income. a. Use the Breusch and Pagan test to examine the residuals for autocorrelation. b. Is the estimated equation stable? What is the characteristic equation for the au- toregressive part of this model? What are the roots of the characteristic equation, using your estimated parameters? c. What is your implied estimate of the short-run (impact) multiplier for change in yt on ct ? Compute the estimated long-run multiplier. 3. Carry out an ADF test for a unit root in the rate of inflation using the subset of the data in Appendix Table F5.1 since 1974.1. (This is the first quarter after the oil shock of 1973.) 4. Estimate the parameters of the model in Example 13.1 using two-stage least squares. Obtain the residuals from the two equations. Do these residuals appear to be white noise series? Based on your findings, what do you conclude about the specification of the model? Greene-50558 book June 25, 2007 11:6

23 MODELS FOR DISCRETE CHOICEQ

23.1 INTRODUCTION

There are many settings in which the economic outcome we seek to model is a discrete choice among a set of alternatives, rather than a continuous measure of some activity. Consider, for example, modeling labor force participation, the decision of whether or not to make a major purchase, or the decision of which candidate to vote for in an election. For the first of these examples, intuition would suggest that factors such as age, education, marital status, number of children, and some economic data would be relevant in explaining whether an individual chooses to seek work or not in a given period. But something is obviously lacking if this example is treated as the same sort of regression model we used to analyze consumption or the costs of production or the movements of exchange rates. In this chapter, we shall examine a variety of what have come to be known as qualitative response (QR) models. There are numerous different types that apply in different situations. What they have in common is that they are models in which the dependent variable is an indicator of a discrete choice, such as a “yes or no” decision. In general, conventional regression methods are inappropriate in these cases. This chapter is a lengthy but far from complete survey of topics in estimating QR models. Almost none of these models can be consistently estimated with linear regres- sion methods. Therefore, readers interested in the mechanics of estimation may want to review the material in Appendices D and E before continuing. In most cases, the method of estimation is maximum likelihood. The various properties of maximum likelihood estimators are discussed in Chapter 16. We shall assume throughout this chapter that the necessary conditions behind the optimality properties of maximum likelihood estima- tors are met and, therefore, we will not derive or establish these properties specifically for the QR models. Detailed proofs for most of these models can be found in surveys by Amemiya (1981), McFadden (1984), Maddala (1983), and Dhrymes (1984). Additional commentary on some of the issues of interest in the contemporary literature is given by Maddala and Flores-Lagunes (2001). Cameron and Trivedi (2005) contains numerous applications. Greene (2008) provides a general survey.

23.2 DISCRETE CHOICE MODELS

The general class of models we shall consider are those for which the dependent variable takes values 0, 1, 2,.... In a few cases, the values will themselves be meaningful, as in the following:

1. Number of patents: y = 0, 1, 2,....These are count data. 770 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 771

In most of the cases we shall study, the values taken by the dependent variables are merely a coding for some qualitative outcome. Some examples are as follows:

2. Labor force participation: We equate “no” with 0 and “yes” with 1. These decisions are qualitative choices. The 0/1 coding is a mere convenience. 3. Opinions of a certain type of legislation: Let 0 represent “strongly opposed,” 1 “opposed,” 2 “neutral,” 3 “support,” and 4 “strongly support.” These numbers are rankings, and the values chosen are not quantitative but merely an ordering. The difference between the outcomes represented by 1 and 0 is not necessarily the same as that between 2 and 1. 4. The occupational field chosen by an individual: Let 0 be clerk, 1 engineer, 2 lawyer, 3 politician, and so on. These data are merely categories, giving neither a ranking nor a count. 5. Consumer choice among alternative shopping areas: This case has the same characteristics as number 4, but the appropriate model is a bit different. These two examples will differ in the extent to which the choice is based on characteristics of the individual, which are probably dominant in the occupational choice, as opposed to attributes of the choices, which is likely the more important consideration in the choice of shopping venue.

None of these situations lends itself readily to our familiar type of regression analy- sis. Nonetheless, in each case, we can construct models that link the decision or outcome to a set of factors, at least in the spirit of regression. Our approach will be to analyze each of them in the general framework of probability models:

Prob(event j occurs) = Prob(Y = j) = F[relevant effects, parameters]. (23-1)

The study of qualitative choice focuses on appropriate specification, estimation, and use of models for the probabilities of events, where in most cases, the “event” is an individual’s choice among a set of alternatives.

Example 23.1 Labor Force Participation Model In Example 4.3 we estimated an earnings equation for the subsample of 428 married women who participated in the formal labor market taken from a full sample of 753 observations. The semilog earnings equation is of the form

2 ln earnings = β1 + β2 age + β3 age + β4 education + β5 kids + ε

where earnings is hourly wage times hours worked, education is measured in years of school- ing, and kids is a binary variable which equals one if there are children under 18 in the house- hold. What of the other 325 individuals? The underlying labor supply model described a market in which labor force participation was the outcome of a market process whereby the demanders of labor services were willing to offer a wage based on expected marginal product and individuals themselves made a decision whether or not to accept the offer depending on whether it exceeded their own reservation wage. The first of these depends on, among other things, education, while the second (we assume) depends on such variables as age, the presence of children in the household, other sources of income (husband’s), and marginal tax rates on labor income. The sample we used to fit the earnings equation contains data on all these other variables. The models considered in this chapter would be appropriate for modeling the outcome yi = 1 if in the labor force, and 0 if not. Greene-50558 book June 25, 2007 11:6

772 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

23.3 MODELS FOR BINARY CHOICE

Models for explaining a binary (0/1) dependent variable typically arise in two contexts. In many cases, the analyst is essentially interested in a regressionlike model of the sort considered in Chapters 2 through 7. With data on the variable of interest and a set of covariates, the analyst is interested in specifying a relationship between the former and the latter, more or less along the lines of the models we have already studied. The relationship between voting behavior and income is typical. In other cases, the binary choice model arises in the context of a model in which the nature of the observed data dictate the special treatment of a binary choice model. For example, in a model of the demand for tickets for sporting events, in which the variable of interest is number of tickets, it could happen that the observation consists only of whether the sports facility was filled to capacity (demand greater than or equal to capacity so Y = 1) or not (Y = 0). It will generally turn out that the models and techniques used in both cases are the same. Nonetheless, it is useful to examine both of them.

23.3.1 THE REGRESSION APPROACH To focus ideas, consider the model of labor force participation suggested in Exam- ple 23.1.1 The respondent either works or seeks work (Y = 1) or does not (Y = 0) in the period in which our survey is taken. We believe that a set of factors, such as age, marital status, education, and work history, gathered in a vector x explain the decision, so that Prob(Y = 1 | x) = F(x, β) (23-2) Prob(Y = 0 | x) = 1 − F(x, β). The set of parameters β reflects the impact of changes in x on the probability. For example, among the factors that might interest us is the marginal effect of marital status on the probability of labor force participation. The problem at this point is to devise a suitable model for the right-hand side of the equation. One possibility is to retain the familiar linear regression, F(x, β) = xβ. Because E [y | x] = F(x, β), we can construct the regression model, y = E [y | x] + y − E [y | x] = xβ + ε. (23-3) The linear probability model has a number of shortcomings. A minor complication arises because ε is heteroscedastic in a way that depends on β. Because xβ + ε must equal 0 or 1, ε equals either −xβ or 1−xβ, with probabilities 1− F and F, respectively. Thus, you can easily show that Var[ε | x] = xβ(1 − xβ). (23-4) We could manage this complication with an FGLS estimator in the fashion of Chapter 8. A more serious flaw is that without some ad hoc tinkering with the disturbances, we cannot be assured that the predictions from this model will truly look like probabilities.

1Models for qualitative dependent variables can now be found in most disciplines in economics. A frequent use is in labor economics in the analysis of microlevel data sets. Greene-50558 book June 26, 2007 15:38

CHAPTER 23 ✦ Models for Discrete Choice 773

1.00

0.75 ) ␤ ؅ 0.50 (x F

0.25

0.00 Ϫ30 Ϫ20 Ϫ10 0 10 20 30 ␤x؅ FIGURE 23.1 Model for a Probability.

We cannot constrain xβ to the 0–1 interval. Such a model produces both nonsense probabilities and negative variances. For these reasons, the linear model is becoming less frequently used except as a basis for comparison to some other more appropriate models.2 Our requirement, then, is a model that will produce predictions consistent with the underlying theory in (23-1). For a given regressor vector, we would expect lim Prob(Y = 1 | x) = 1 xβ→+∞ (23-5) lim Prob(Y = 1 | x) = 0. xβ→−∞ See Figure 23.1. In principle, any proper, continuous probability distribution defined over the real line will suffice. The normal distribution has been used in many analyses, giving rise to the probit model, xβ Prob(Y = 1 | x) = φ(t) dt = (xβ). (23-6) −∞ The function (.) is a commonly used notation for the standard normal distribution function. Partly because of its mathematical convenience, the logistic distribution, xβ e  Prob(Y = 1 | x) =  = (x β), (23-7) 1 + ex β

2The linear model is not beyond redemption. Aldrich and Nelson (1984) analyze the properties of the model at length. Judge et al. (1985) and Fomby, Hill, and Johnson (1984) give interesting discussions of the ways we may modify the model to force internal consistency. But the fixes are sample dependent, and the resulting estimator, such as it is, may have no known sampling properties. Additional discussion of weighted least squares appears in Amemiya (1977) and Mullahy (1990). Finally, its shortcomings notwithstanding, the linear probability model is applied by Caudill (1988), Heckman and MaCurdy (1985), and Heckman and Snyder (1997). An exchange on the usefulness of the approach is Angrist (2001) and Moffitt (2001). Greene-50558 book June 25, 2007 11:6

774 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

has also been used in many applications. We shall use the notation (.) to indicate the logistic cumulative distribution function. This model is called the logit model for reasons we shall discuss in the next section. Both of these distributions have the familiar bell shape of symmetric distributions. Other models which do not assume symmetry, such as the Gumbel model,

Prob(Y = 1 | x) = exp[−exp(−xβ)],

and complementary log log model,

Prob(Y = 1 | x) = 1 − exp[exp(xβ)],

have also been employed. Still other distributions have been suggested,3 but the probit and logit models are still the most common frameworks used in econometric applications. The question of which distribution to use is a natural one. The logistic distribution is similar to the normal except in the tails, which are considerably heavier. (It more closely resembles a t distribution with seven degrees of freedom.) Therefore, for intermediate values of xβ (say, between −1.2 and +1.2), the two distributions tend to give similar probabilities. The logistic distribution tends to give larger probabilities to Y = 1 when xβ is extremely small (and smaller probabilities to Y = 1 when xβ is very large) than the normal distribution. It is difficult to provide practical generalities on this basis, however, as they would require knowledge of β. We should expect different predictions from the two models, however, if the sample contains (1) very few responses (Y’s equal to 1) or very few nonresponses (Y’s equal to 0) and (2) very wide variation in an important independent variable, particularly if (1) is also true. There are practical reasons for favoring one or the other in some cases for mathematical convenience, but it is difficult to justify the choice of one distribution or another on theoretical grounds. Amemiya (1981) discusses a number of related issues, but as a general proposition, the question is unresolved. In most applications, the choice between these two seems not to make much difference. However, as seen in the example below, the symmetric and asymmetric distributions can give substantively different results, and here, the guidance on how to choose is unfortunately sparse. The probability model is a regression:

E [y | x] = 0[1 − F(xβ)] + 1[F(xβ)] = F(xβ). (23-8)

Whatever distribution is used, it is important to note that the parameters of the model, like those of any nonlinear regression model, are not necessarily the marginal effects we are accustomed to analyzing. In general, ∂ E [y | x] dF(xβ) = β = f (xβ)β, (23-9) ∂x d(xβ)

3See, for example, Maddala (1983, pp. 27–32), Aldrich and Nelson (1984), and Greene (2001). Greene-50558 book June 26, 2007 2:31

CHAPTER 23 ✦ Models for Discrete Choice 775

where f (.) is the density function that corresponds to the cumulative distribution, F(.). For the normal distribution, this result is ∂ E [y | x] = φ(xβ)β, (23-10) ∂x where φ(t) is the standard normal density. For the logistic distribution,  xβ d(x β) e   =  = (x β)[1 − (x β)]. (23-11) d(xβ) (1 + ex β )2 Thus, in the logit model, ∂ E [y | x] = (xβ)[1 − (xβ)]β. (23-12) ∂x It is obvious that these values will vary with the values of x. In interpreting the estimated model, it will be useful to calculate this value at, say, the means of the regressors and, where necessary, other pertinent values. For convenience, it is worth noting that the same scale factor applies to all the slopes in the model. For computing marginal effects, one can evaluate the expressions at the sample means of the data or evaluate the marginal effects at every observation and use the sample average of the individual marginal effects. The functions are continuous with continuous first derivatives, so Theorem D.12 (the Slutsky theorem) and assuming that the data are “well behaved” a law of large numbers (Theorems D.4 and D.5) apply; in large samples these will give the same answer. But that is not so in small or moderate- sized samples. Current practice favors averaging the individual marginal effects when it is possible to do so. Another complication for computing marginal effects in a binary choice model arises because x will often include dummy variables—for example, a labor force par- ticipation equation will often contain a dummy variable for marital status. Because the derivative is with respect to a small change, it is not appropriate to apply (23-10) for the effect of a change in a dummy variable, or change of state. The appropriate marginal effect for a binary independent variable, say, d, would be Marginal effect = Prob Y = 1 x(d), d = 1 − Prob Y = 1 x(d), d = 0 ,

where x(d), denotes the means of all the other variables in the model. Simply taking the derivative with respect to the binary variable as if it were continuous provides an approx- imation that is often surprisingly accurate. In Example 23.3, for the binary variable PSI, the difference in the two probabilities for the probit model is (0.5702−0.1057) = 0.4645, whereas the derivative approximation reported in Table 23.1 is 0.468. Nonetheless, it might be optimistic to rely on this outcome. We will revisit this computation in the examples and discussion to follow.

23.3.2 LATENT REGRESSION—INDEX FUNCTION MODELS Discrete dependent-variable models are often cast in the form of index function models. We view the outcome of a discrete choice as a reflection of an underlying regression. As an often-cited example, consider the decision to make a large purchase. The theory states that the consumer makes a marginal benefit/marginal cost calculation based on the utilities achieved by making the purchase and by not making the purchase and by Greene-50558 book June 25, 2007 11:6

776 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

using the money for something else. We model the difference between benefit and cost as an unobserved variable y∗ such that y∗ = xβ + ε. We assume that ε has mean zero and has either a standardized logistic with (known) vari- ance π 2/3 [see (23-7)] or a standard normal distribution with variance one [see (23-6)]. We do not observe the net benefit of the purchase, only whether it is made or not. Therefore, our observation is y = 1ify∗ > 0, y = 0ify∗ ≤ 0. In this formulation, xβ is called the index function. Two aspects of this construction merit our attention. First, the assumption of known variance of ε is an innocent normalization. Suppose the variance of ε is scaled by an unrestricted parameter σ 2.Thelatent regression will be y∗ = xβ + σε. But, (y∗/σ ) = x(β/σ)+ε is the same model with the same data. The observed data will be unchanged; y is still 0 or 1, depending only on the sign of y∗ not on its scale. This means that there is no information about σ in the data so it cannot be estimated. Second, the assumption of zero for the threshold is likewise innocent if the model contains a constant term (and not if it does not).4 Let a be the supposed nonzero threshold and α be an unknown constant term and, for the present, x and β contain the rest of the index not including the constant term. Then, the probability that y equals one is Prob(y∗ > a | x) = Prob(α + xβ + ε>a | x) = Prob[(α − a) + xβ + ε>0 | x]. Because α is unknown, the difference (α − a) remains an unknown parameter. With the two normalizations, Prob(y∗ > 0 | x) = Prob(ε > −xβ | x). If the distribution is symmetric, as are the normal and logistic, then Prob(y∗ > 0 | x) = Prob(ε < xβ | x) = F(xβ), which provides an underlying structural model for the probability.

Example 23.2 Structural Equations for a Probit Model Nakosteen and Zimmer (1980) analyze a model of migration based on the following structure:5 For individual i, the market wage that can be earned at the present location is ∗ = β + ε . yp xp p Variables in the equation include age, sex, race, growth in employment, and growth in per capita income. If the individual migrates to a new location, then his or her market wage would be ∗ = γ + ε . ym xm m

4Unless there is some compelling reason, binomial probability models should not be estimated without constant terms. 5A number of other studies have also used variants of this basic formulation. Some important examples are Willis and Rosen (1979) and Robinson and Tomes (1982). The study by Tunali (1986) examined in Example 23.6 is another example. The now standard approach, in which “participation” equals one if wage ( β +ε ) ( β +ε ) offer xw w w minus reservation wage xr r r is positive, is also used in Fernandez and Rodriguez-Poo (1997). Brock and Durlauf (2000) describe a number of models and situations involving individual behavior that give rise to binary choice models. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 777

Migration, however, entails costs that are related both to the individual and to the labor market: C∗ = zα + u. Costs of moving are related to whether the individual is self-employed and whether that person recently changed his or her industry of employment. They migrate if the benefit ∗ − ∗ ∗ ym yp is greater than the cost C . The net benefit of moving is ∗ = ∗ − ∗ − ∗ M ym yp C = γ − β − α + ε − ε − xm xp z ( m p u) = wδ + ε. Because M∗ is unobservable, we cannot treat this equation as an ordinary regression. The ∗ individual either moves or does not. After the fact, we observe only ym if the individual has ∗ = = moved or yp if he or she has not. But we do observe that M 1 for a move and M 0 for no move. If the disturbances are normally distributed, then the probit model we analyzed earlier is produced. Logistic disturbances produce the logit model instead.

23.3.3 RANDOM UTILITY MODELS An alternative interpretation of data on individual choices is provided by the random utility model. Suppose that in the Nakosteen–Zimmer framework, ym and yp represent the individual’s utility of two choices, which we might denote Ua and Ub. For another example, Ua might be the utility of rental housing and Ub that of home ownership. The observed choice between the two reveals which one provides the greater utility, but not the unobservable utilities. Hence, the observed indicator equals 1 if Ua > Ub and 0 if Ua ≤ Ub. A common formulation is the linear random utility model, a = β + ε b = β + ε . U x a a and U x b b (23-13) Then, if we denote by Y = 1 the consumer’s choice of alternative a, we have Prob[Y = 1 | x] = Prob[Ua > Ub] = β + ε − β − ε > | Prob[x a a x b b 0 x] (23-14) = (β − β ) + ε − ε > | Prob[x a b a b 0 x] = Prob[xβ + ε>0 | x], once again.

23.4 ESTIMATION AND INFERENCE IN BINARY CHOICE MODELS

With the exception of the linear probability model, estimation of binary choice models is usually based on the method of maximum likelihood. Each observation is treated as a single draw from a Bernoulli distribution (binomial with one draw). The model with success probability F(xβ) and independent observations leads to the joint probability, or likelihood function, ( = , = ,..., = | ) = − ( β) ( β), Prob Y1 y1 Y2 y2 Yn yn X [1 F xi ] F xi (23-15)

yi =0 yi =1 Greene-50558 book June 26, 2007 2:31

778 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

where X denotes [xi ]i=1,...,n. The likelihood function for a sample of n observations can be conveniently written as n   − (β | ) = ( β) yi − ( β) 1 yi . L data [F xi ] [1 F xi ] (23-16) i=1 Taking logs, we obtain n = (  β) + ( − ) − (  β) .6 ln L yi ln F xi 1 yi ln[1 F xi ] (23-17) i=1 The likelihood equations are n ∂ ln L yi fi − fi = + (1 − yi ) xi = 0, (23-18) ∂β Fi (1 − Fi ) i=1 / (  β) where fi is the density, dFi d xi . [In (23-18) and later, we will use the subscript i to  β indicate that the function has an argument xi .] The choice of a particular form for Fi leads to the empirical model. Unless we are using the linear probability model, the likelihood equations in (23-18) will be nonlinear and require an iterative solution. All of the models we have seen thus far are relatively straightforward to analyze. For the logit model, by inserting (23-7) and (23-11) in (23-18), we get, after a bit of manipulation, the likelihood equations ∂ n ln L = ( −  ) = . ∂β yi i xi 0 (23-19) i=1

Note that if xi contains a constant term, the first-order conditions imply that the average of the predicted probabilities must equal the proportion of ones in the sample.7 This implication also bears some similarity to the least squares normal equations if we view 8 the term yi − i as a residual. For the normal distribution, the log-likelihood is = − (  β) + (  β). ln L ln[1 xi ] ln xi (23-20)

yi =0 yi =1 The first-order conditions for maximizing ln L are ∂ ln L −φi φi = xi + xi = λ0i xi + λ1i xi . ∂β 1 − i i yi =0 yi =1 yi =0 yi =1 Using the device suggested in footnote 6, we can reduce this to ∂ n φ(  β) n log L qi qi xi =  xi = λi xi = 0, (23-21) ∂β (qi x β) i=1 i i=1

where qi = 2yi − 1.

6If the distribution is symmetric, as the normal and logistic are, then 1− F(xβ) = F(−xβ). There is a further = − =  (  β) simplification. Let q 2y 1. Then ln L i ln F qi xi . See (23-21). 7The same result holds for the linear probability model. Although regularly observed in practice, the result has not been verified for the probit model. 8This sort of construction arises in many models. The first derivative of the log-likelihood with respect to the constant term produces the generalized residual in many settings. See, for example, Chesher, Lancaster, and Irish (1985) and the equivalent result for the tobit model in Section 24.3.4.d. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 779

The actual second derivatives for the logit model are quite simple: ∂2 = ln L =−  ( −  ) . H ∂β∂β i 1 i xi xi (23-22) i

The second derivatives do not involve the random variable yi , so Newton’s method is also the method of scoring for the logit model. Note that the Hessian is always negative definite, so the log-likelihood is globally concave. Newton’s method will usually converge to the maximum of the log-likelihood in just a few iterations unless the data are especially badly conditioned. The computation is slightly more involved for the probit model. A useful simplification is obtained by using the variable λ(yi , β xi ) = λi that is defined in (23-21). The second derivatives can be obtained using the result that for any z, dφ(z)/dz =−zφ(z). Then, for the probit model,

∂2 n = ln L = −λ (λ + β) . H ∂β∂β i i xi xi xi (23-23) i=1 This matrix is also negative definite for all values of β. The proof is less obvious than for the logit model.9 It suffices to note that the scalar part in the summation is Var[ε | ε ≤ β x] − 1 when y = 1 and Var[ε | ε ≥−β x] − 1 when y = 0. The unconditional variance is one. Because truncation always reduces variance—see Theorem 24.2—in both cases, the variance is between zero and one, so the value is negative.10 The asymptotic covariance matrix for the maximum likelihood estimator can be estimated by using the inverse of the Hessian evaluated at the maximum likelihood estimates. There are also two other estimators available. The Berndt, Hall, Hall, and Hausman estimator [see (16-18) and Example 16.4] would be n = 2 , B gi xi xi i=1

where gi = (yi − i ) for the logit model [see (23-19)] and gi = λi for the probit model [see (23-21)]. The third estimator would be based on the expected value of the Hessian. As we saw earlier, the Hessian for the logit model does not involve yi ,soH = E [H]. But because λi is a function of yi [see (23-21)], this result is not true for the probit model. Amemiya (1981) showed that for the probit model, ∂2 n ln L = λ λ . E ∂β ∂β 0i 1i xi xi (23-24) probit i=1 Once again, the scalar part of the expression is always negative [see (23-21) and note that λ0i is always negative and λi1 is always positive]. The estimator of the asymptotic covariance matrix for the maximum likelihood estimator is then the negative inverse of whatever matrix is used to estimate the expected Hessian. Since the actual Hessian is generally used for the iterations, this option is the usual choice. As we shall see later, though, for certain hypothesis tests, the BHHH estimator is a more convenient choice.

9See, for example, Amemiya (1985, pp. 273–274) and Maddala (1983, p. 63). 10See Johnson and Kotz (1993) and Heckman (1979). We will make repeated use of this result in Chapter 24. Greene-50558 book June 25, 2007 11:6

780 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

23.4.1 ROBUST COVARIANCE MATRIX ESTIMATION The probit maximum likelihood estimator is often labeled a quasi-maximum likeli- hood estimator (QMLE) in view of the possibility that the normal probability model might be misspecified. White’s (1982a) robust “sandwich” estimator for the asymptotic covariance matrix of the QMLE (see Section 16.8 for discussion), Est. Asy. Var[βˆ ] = [Hˆ ]−1Bˆ [Hˆ ]−1, has been used in a number of recent studies based on the probit model [e.g., Fernandez and Rodriguez-Poo (1997), Horowitz (1993), and Blundell, Laisney, and Lechner (1993)]. If the probit model is correctly specified, then plim(1/n)Bˆ = plim(1/n)(−Hˆ ) and either single matrix will suffice, so the robustness issue is moot (of course). On the other hand, the probit (Q-) maximum likelihood estimator is not consistent in the presence of any form of heteroscedasticity, unmeasured heterogeneity, omitted vari- ables (even if they are orthogonal to the included ones), nonlinearity of the functional form of the index, or an error in the distributional assumption [with some narrow exceptions as described by Ruud (1986)]. Thus, in almost any case, the sandwich esti- mator provides an appropriate asymptotic covariance matrix for an estimator that is biased in an unknown direction. [See Section 16.8 and Freedman (2006).] White raises this issue explicitly, although it seems to receive little attention in the literature: “It is the consistency of the QMLE for the parameters of interest in a wide range of situa- tions which insures its usefulness as the basis for robust estimation techniques” (1982a, p. 4). His very useful result is that if the quasi-maximum likelihood estimator converges to a probability limit, then the sandwich estimator can, under certain circumstances, be used to estimate the asymptotic covariance matrix of that estimator. But there is no guarantee that the QMLE will converge to anything interesting or useful. Simply computing a robust covariance matrix for an otherwise inconsistent estimator does not give it redemption. Consequently, the virtue of a robust covariance matrix in this setting is unclear.

23.4.2 MARGINAL EFFECTS AND AVERAGE PARTIAL EFFECTS The predicted probabilities, F(xβˆ ) = Fˆ and the estimated marginal effects f (xβˆ )×βˆ = fˆβˆ are nonlinear functions of the parameter estimates. To compute standard errors, we can use the linear approximation approach (delta method) discussed in Section 4.9.4. For the predicted probabilities, Asy. Var[Fˆ ] = [∂ Fˆ /∂βˆ ]V[∂ Fˆ /∂βˆ ], where V = Asy. Var[βˆ ]. The estimated asymptotic covariance matrix of βˆ can be any of the three described earlier. Let z = xβˆ . Then the derivative vector is [∂ Fˆ /∂βˆ ] = [dFˆ /dz][∂z/∂βˆ ] = fˆx. Combining terms gives Asy. Var[Fˆ ] = fˆ2x Vx, Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 781

which depends, of course, on the particular x vector used. This result is useful when a marginal effect is computed for a dummy variable. In that case, the estimated effect is Fˆ = Fˆ | (d = 1) − Fˆ | (d = 0). (23-25) The asymptotic variance would be Asy. Var[Fˆ ] = [∂Fˆ /∂βˆ ]V[∂Fˆ /∂βˆ ],

where (23-26) x(d) x(d) [∂Fˆ /∂βˆ ] = fˆ1 − fˆ0 . 1 0 For the other marginal effects, letγ ˆ = fˆβˆ . Then  ∂γˆ ∂γˆ Asy. Var[γ ˆ ] =  V . ∂βˆ ∂βˆ 

The matrix of derivatives is ∂βˆ d fˆ ∂z d fˆ  fˆ  + βˆ  = fˆ I + βˆ x . ∂βˆ dz ∂βˆ dz For the probit model, df/dz =−zφ,so Asy. Var[γ ˆ ] = φ2[I − (xβ)βx]V[I − (xβ)βx]. For the logit model, fˆ = (ˆ 1 − )ˆ ,so d fˆ dˆ = (1 − 2)ˆ = (1 − 2)ˆ (ˆ 1 − ).ˆ dz dz Collecting terms, we obtain  Asy. Var[γ ˆ ] = [(1 − )]2[I + (1 − 2)βx]V[I + (1 − 2)xβ ]. As before, the value obtained will depend on the x vector used.

Example 23.3 Probability Models The data listed in Appendix Table F16.1 were taken from a study by Spector and Mazzeo (1980), which examined whether a new method of teaching economics, the Personalized System of Instruction (PSI), significantly influenced performance in later economics courses. The “dependent variable” used in our application is GRADE, which indicates the whether a student’s grade in an intermediate macroeconomics course was higher than that in the principles course. The other variables are GPA, their grade point average; TUCE, the score on a pretest that indicates entering knowledge of the material; and PSI, the binary variable indicator of whether the student was exposed to the new teaching method. (Spector and Mazzeo’s specific equation was somewhat different from the one estimated here.) Table 23.1 presents four sets of parameter estimates. The slope parameters and deriva- tives were computed for four probability models: linear, probit, logit, and Gumbel. The last three sets of estimates are computed by maximizing the appropriate log-likelihood function. Estimation is discussed in the next section, so standard errors are not presented here. The scale factor given in the last row is the density function evaluated at the means of the vari- ables. Also, note that the slope given for PSI is the derivative, not the change in the function with PSI changed from zero to one with other variables held constant. Greene-50558 book June 26, 2007 15:38

782 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.1 Estimated Probability Models Linear Logistic Probit Gumbel Variable Coefficient Slope Coefficient Slope Coefficient Slope Coefficient Slope Constant −1.498 — −13.021 — −7.452 — −10.631 — GPA 0.464 0.464 2.826 0.534 1.626 0.533 2.293 0.477 TUCE 0.010 0.010 0.095 0.018 0.052 0.017 0.041 0.009 PSI 0.379 0.379 2.379 0.450 1.426 0.468 1.562 0.325 f (x βˆ ) 1.000 0.189 0.328 0.208

If one looked only at the coefficient estimates, then it would be natural to conclude that the four models had produced radically different estimates. But a comparison of the columns of slopes shows that this conclusion is clearly wrong. The models are very similar; in fact, the logit and probit models results are nearly identical. The data used in this example are only moderately unbalanced between 0s and 1s for the dependent variable (21 and 11). As such, we might expect similar results for the probit and logit models.11 One indicator is a comparison of the coefficients. In view of the different variances of the distributions, one for the normal and π 2/3 for the logistic,√ we might expect to obtain comparable estimates by multiplying the probit coefficients by π/ 3 ≈ 1.8. Amemiya (1981) found, through trial and error, that scaling by 1.6 instead produced better results. This proportionality result is frequently cited. The result in (23-9) may help to explain the finding. The index xβ is not the random variable. (See Section 23.3.2.) The marginal effect in the probit φ β β  −  β model for, say, xk is (x p) pk, whereas that for the logit is (1 ) lk. (The subscripts p and l are for probit and logit.) Amemiya suggests that his approximation works best at the center of the distribution, where F = 0.5, or xβ = 0 for either distribution. Suppose it is. Then φ(0) = 0.3989 and (0)[1 − (0)] = 0.25. If the marginal effects are to be the same, then 0.3989 βpk = 0.25βlk,orβlk = 1.6βpk, which is the regularity observed by Amemiya. Note, though, that as we depart from the center of the distribution, the relationship will move away from 1.6. Because the logistic density descends more slowly than the normal, for unbalanced samples such as ours, the ratio of the logit coefficients to the probit coefficients will tend to be larger than 1.6. The ratios for the ones in Table 23.1 are closer to 1.7 than 1.6. The computation of the derivatives of the conditional mean function is useful when the vari- able in question is continuous and often produces a reasonable approximation for a dummy variable. Another way to analyze the effect of a dummy variable on the whole distribution is to compute Prob(Y = 1) over the range of xβ (using the sample estimates) and with the two values of the binary variable. Using the coefficients from the probit model in Table 23.1, we have the following probabilities as a function of GPA, at the mean of TUCE: PSI = 0: Prob(GRADE = 1) = [−7.452 + 1.626GPA + 0.052(21.938)],

PSI = 1: Prob(GRADE = 1) = [−7.452 + 1.626GPA + 0.052(21.938) + 1.426]. Figure 23.2 shows these two functions plotted over the range of GRADE observed in the sample, 2.0 to 4.0. The marginal effect of PSI is the difference between the two functions, which ranges from only about 0.06 at GPA = 2 to about 0.50 at GPA of 3.5. This effect shows that the probability that a student’s grade will increase after exposure to PSI is far greater for students with high GPAs than for those with low GPAs. At the sample mean of GPA of 3.117, the effect of PSI on the probability is 0.465. The simple derivative calculation of (23-9) is given in Table 23.1; the estimate is 0.468. But, of course, this calculation does not show the wide range of differences displayed in Figure 23.2.

11One might be tempted in this case to suggest an asymmetric distribution for the model, such as the Gumbel distribution. However, the asymmetry in the model, to the extent that it is present at all, refers to the values of ε, not to the observed sample of values of the dependent variable. Greene-50558 book June 26, 2007 15:38

CHAPTER 23 ✦ Models for Discrete Choice 783

1.0

0.8 With PSI 1) ϭ 0.6 0.571

GRADE 0.4 Prob( Without PSI 0.2

0.106

0 2.0 2.5 3.0 3.5 4.0 3.117 GPA FIGURE 23.2 Effect of PSI on Predicted Probabilities.

Table 23.2 presents the estimated coefficients and marginal effects for the probit and logit models in Table 23.2. In both cases, the asymptotic covariance matrix is computed from the negative inverse of the actual Hessian of the log-likelihood. The standard errors for the estimated marginal effect of PSI are computed using (23-25) and (23-26) since PSI is a binary variable. In comparison, the simple derivatives produce estimates and standard errors of (0.449, 0.181) for the logit model and (0.464, 0.188) for the probit model. These differ only slightly from the results given in the table.

The preceding has emphasized computing the partial effects for the average individ- ual in the sample. Current practice has many applications based, instead, on “average partial effects.” [See, e.g., Wooldridge (2002a).] The underlying logic is that the quantity

TABLE 23.2 Estimated Coefficients and Standard Errors (standard errors in parentheses) Logistic Probit Variable Coefficient t Ratio Slope t Ratio Coefficient t Ratio Slope t Ratio Constant −13.021 −2.641 — — −7.452 −2.931 — — (4.931) (2.542) GPA 2.826 2.238 0.534 2.252 1.626 2.343 0.533 2.294 (1.263) (0.237) (0.694) (0.232) TUCE 0.095 0.672 0.018 0.685 0.052 0.617 0.017 0.626 (0.142) (0.026) (0.084) (0.027) PSI 2.379 2.234 0.456 2.521 1.426 2.397 0.464 2.727 (1.065) (0.181) (0.595) (0.170) log-likelihood −12.890 −12.819 Greene-50558 book June 26, 2007 2:34

784 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

of interest is ∂ E[y | x] APE = E . x ∂x In practical terms, this suggests the computation 1 n APE = γˆ = f (x βˆ )βˆ . n i i=1 This does raise two questions. Because the computation is (marginally) more burden- some than the simple marginal effects at the means, one might wonder whether this produces a noticeably different answer. That will depend on the data. Save for small sample variation, the difference in these two results is likely to be small. Let n ∂ ( = | ) n N γ = = 1 Pr yi 1 xi = 1 (  β)β = 1 γ ( ) k APEk F xi k k xi n ∂xik n n i=1 i=1 i=1 denote the computation of the average partial effect. We compute this at the MLE, βˆ . Now, expand this function in a second-order Taylor series around the point of sample means, x, to obtain n k ∂γ ( ) γ = 1 γ ( ) + k x ( − ) k k x xim xm n ∂xm i=1 m=1 K K 2 1 ∂ γk(x) + (xil − xl )(xim − xm) + , 2 ∂xl ∂xm l=1 m=1 where  is the remaining higher-order terms. The first of the three terms is the marginal effect computed at the sample means. The second term is zero by construction. That leaves the remainder plus an average of a term that is a function of the variances and covariances of the data and the curvature of the probability function at the means. Little can be said to characterize these two terms in any particular sample, but one might guess they are likely to be small. We will examine an application in Example 23.4. Computing the individual effects, then using the natural estimator to estimate the variance of the mean, N 1 1 2 Est. Var[γˆ ] = γˆ (x ) − γˆ , k n n − 1 k i k i=1 may badly estimate the asymptotic variance of the average partial effect. [See, e.g. Contoyannis et al. (2004, p. 498).] The reason is that the observations in the APE are highly correlated—they all use the same estimate of β—but this variance computation treats them as a random sample. The following example shows the difference, which is substantial. To use the delta method to estimate asymptotic standard errors for the average partial effects, we would use 1 n Est. Asy. Var[γˆ ] = Est. Asy. Var γˆ n2 i i=1 1 n n = Est. Asy. Cov[γ ˆ , γˆ ] n2 i j i=1 j=1 1 n n = G (βˆ )VGˆ  (βˆ ), n2 i j i=1 j=1 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 785

(βˆ ) = ∂ ( βˆ ) ˆ βˆ where Gi F xi and V is the estimated asymptotic covariance matrix for . (The terms with equal subscripts are the same computation we did earlier with the sample means.) This looks like a formidable amount of computation—Example 23.4 uses a sample of 27,326 observations, so at first blush, it appears we need a double sum of roughly 750 million terms. The computation is actually linear in n, not quadratic, however, because the same matrix is used in the center of each product. Moving the first derivative matrix outside the inner summation and using the 1/n twice, we find that the estimator of the asymptotic covariance matrix for the APE is simply  Est. Asy. Var[γˆ ] = G(βˆ )Vˆ G(βˆ ). The appropriate covariance matrix is computed by making the same adjustment as in the partial effects—the derivatives are averaged over the observations rather than being computed at the means of the data.

Example 23.4 Average Partial Effects We estimated a binary logit model for y = 1(DocVis > 0) using the German health care utilization data examined in Example 11.10 (and in Examples 11.11, 16.10, 16.12, 16.13, 16.16, and 18.6). The model is > =  β + β + β + β Prob(DocVisit 0) ( 1 2 Ageit 3 Incomeit 4 Kidsit

+ β5 Educationit + β6 Marriedit) No account of the panel nature of the data set was taken for this exercise. The sample contains 27,326 observations, which should be large enough to reveal the large sample behavior of the computations. Table 23.3 presents the parameter estimates for the logit probability model and both the marginal effects and the average partial effects, each with standard errors computed using the results given earlier. The results do suggest the similarity of the computations. The values in parentheses are based on the naive estimator that ignores the covariances.

23.4.3 HYPOTHESIS TESTS For testing hypotheses about the coefficients, the full menu of procedures is available. The simplest method for a single restriction would be based on the usual t tests, using the standard errors from the information matrix. Using the normal distribution of the estimator, we would use the standard normal table rather than the t table for critical points. For more involved restrictions, it is possible to use the Wald test. For a set of

TABLE 23.3 Estimated Parameters and Partial Effects Parameter Estimates Marginal Effects Average Partial Effects Variable Estimate Std. Error Estimate Std. Error Estimate Std. Error Constant 0.25111 0.091135 Age 0.020709 0.0012852 0.0048133 0.00029819 0.0047109 0.00028727 (0.00042971) Income −0.18592 0.075064 −0.043213 0.017446 −0.042294 0.017069 (0.0038579) Kids −0.22947 0.029537 −0.053335 0.0068626 −0.052201 0.0066921 (0.0047615) Education −0.045588 0.0056465 −0.010596 0.0013122 −0.010370 0.0012787 (0.00094595) Married 0.085293 0.033286 0.019824 0.0077362 0.019403 0.0075686 (0.0017698) Greene-50558 book June 25, 2007 11:6

786 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

restrictions Rβ = q, the statistic is W = (Rβˆ − q){R(Est. Asy. Var[βˆ ])R}−1(Rβˆ − q). For example, for testing the hypothesis that a subset of the coefficients, say, the last M, are zero, the Wald statistic uses R = [0 | IM] and q = 0. Collecting terms, we find that the test statistic for this hypothesis is = βˆ −1βˆ , W MVM M (23-27) where the subscript M indicates the subvector or submatrix corresponding to the M variables and V is the estimated asymptotic covariance matrix of βˆ . Likelihood ratio and Lagrange multiplier statistics can also be computed. The like- lihood ratio statistic is

LR =−2[ln Lˆ R − ln Lˆ U],

where Lˆ R and Lˆ U are the log-likelihood functions evaluated at the restricted and unre- stricted estimates, respectively. A common test, which is similar to the F test that all the slopes in a regression are zero, is the likelihood ratio test that all the slope coefficients in the probit or logit model are zero. For this test, the constant term remains unrestricted. In this case, the restricted log-likelihood is the same for both probit and logit models,

ln L0 = n[P ln P + (1 − P) ln(1 − P)], (23-28) where P is the proportion of the observations that have dependent variable equal to 1. It might be tempting to use the likelihood ratio test to choose between the probit and logit models. But there is no restriction involved, and the test is not valid for this purpose. To underscore the point, there is nothing in its construction to prevent the chi-squared statistic for this “test” from being negative. The Lagrange multiplier test statistic is LM = gVg, where g is the first derivatives of the unrestricted model evaluated at the restricted parameter vector and V is any of the three estimators of the asymptotic covariance matrix of the maximum likelihood es- timator, once again computed using the restricted estimates. Davidson and MacKinnon (1984) find evidence that E [H] is the best of the three estimators to use, which gives  − n n 1 n = − , LM gi xi E [ hi ]xi xi gi xi (23-29) i=1 i=1 i=1

where E [−hi ] is defined in (23-22) for the logit model and in (23-24) for the probit model. For the logit model, when the hypothesis is that all the slopes are zero, LM = nR2,

2 where R is the uncentered coefficient of determination in the regression of (yi − y) on xi and y is the proportion of 1s in the sample. An alternative formulation based on the BHHH estimator, which we developed in Section 16.6.3 is also convenient. For any of the models (probit, logit, Gumbel, etc.), the first derivative vector can be written as ∂ n ln L = = , ∂β gi xi X Gi i=1 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 787

where G(n × n) = diag[g1, g2,...,gn] and i is an n × 1 column of 1s. The BHHH esti- mator of the Hessian is (XGGX), so the LM statistic based on this estimator is 1 LM = n i(GX)(XGGX)−1(XG)i = nR2, (23-30) n i 2 where Ri is the uncentered coefficient of determination in a regression of a column of ones on the first derivatives of the logs of the individual probabilities. All the statistics listed here are asymptotically equivalent and under the null hypothesis of the restricted model have limiting chi-squared distributions with degrees of freedom equal to the number of restrictions being tested. We consider some examples in the next section.

23.4.4 SPECIFICATION TESTS FOR BINARY CHOICE MODELS In the linear regression model, we considered two important specification problems: the effect of omitted variables and the effect of heteroscedasticity. In the classical model, = β + β + ε y X1 1 X2 2 , when least squares estimates b1 are computed omitting X2, = β + −1 β . E [b1] 1 [X1X1] X1X2 2 β = , Unless X1 and X2 are orthogonal or 2 0 b1 is biased. If we ignore heteroscedasticity, then although the least squares estimator is still unbiased and consistent, it is inefficient and the usual estimate of its sampling covariance matrix is inappropriate. Yatchew and Griliches (1984) have examined these same issues in the setting of the probit and logit models. Their general results are far more pessimistic. In the context of a binary choice model, they find the following: β = 1. If x2 is omitted from a model containing x1 and x2, (i.e. 2 0) then βˆ = β + β , plim 1 c1 1 c2 2

where c1 and c2 are complicated functions of the unknown parameters. The implication is that even if the omitted variable is uncorrelated with the included one, the coefficient on the included variable will be inconsistent. 2. If the disturbances in the underlying regression are heteroscedastic, then the maximum likelihood estimators are inconsistent and the covariance matrix is inappropriate. The second result is particularly troubling because the probit model is most often used with microeconomic data, which are frequently heteroscedastic. Any of the three methods of hypothesis testing discussed here can be used to analyze these specification problems. The Lagrange multiplier test has the advantage that it can be carried out using the estimates from the restricted model, which sometimes brings a large saving in computational effort. This situation is especially true for the test for heteroscedasticity.12 To reiterate, the Lagrange multiplier statistic is computed as follows. Let the null hypothesis, H0, be a specification of the model, and let H1 be the alternative. For example,

12The results in this section are based on Davidson and MacKinnon (1984) and Engle (1984). A symposium on the subject of specification tests in discrete choice models is Blundell (1987). Greene-50558 book June 26, 2007 2:34

788 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

H0 might specify that only variables x1 appear in the model, whereas H1 might specify that x2 appears in the model as well. The statistic is =  −1 , LM g 0V0 g 0

where g 0 is the vector of derivatives of the log-likelihood as specified by H1 but evaluated at the maximum likelihood estimator of the parameters assuming that H0 is true, and −1 V0 is any of the three consistent estimators of the asymptotic variance matrix of the maximum likelihood estimator under H1, also computed using the maximum likelihood estimators based on H0. The statistic is asymptotically distributed as chi-squared with degrees of freedom equal to the number of restrictions.

23.4.4.a Omitted Variables The hypothesis to be tested is ∗  H0: y = x β +ε, 1 1 (23-31) ∗ =  β +  β +ε, H1: y x1 1 x2 2 β = so the test is of the null hypothesis that 2 0. The Lagrange multiplier test would be carried out as follows:

1. Estimate the model in H0 by maximum likelihood. The restricted coefficient vector is [βˆ 1, 0]. 2. Let x be the compound vector, [x1, x2]. The statistic is then computed according to (23-29) or (23-30). It is noteworthy that in this case as in many others, the Lagrange multiplier is the coefficient of determination in a regression. The likelihood ratio test is equally straightforward. Using the estimates of the two models, the statistic is simply 2(ln L1 − ln L0).

23.4.4.b Heteroscedasticity We use the general formulation analyzed by Harvey (1976) (see Section 16.9.2.a),13 Var[ε] = [exp(zγ )]2. This model can be applied equally to the probit and logit models. We will derive the results specifically for the probit model; the logit model is essentially the same. Thus, y∗ = xβ + ε, (23-32) Var[ε | x, z] = [exp(zγ )]2. The presence of heteroscedasticity makes some care necessary in interpreting the coefficients for a variable w that could be in x or z or both, k ∂ ( = | , ) β β − ( β)γ Prob Y 1 x z = φ x k x k .   ∂wk exp(z γ ) exp(z γ )

Only the first (second) term applies if wk appears only in x (z). This implies that the simple coefficient may differ radically from the effect that is of interest in the estimated model. This effect is clearly visible in the next example.

13See Knapp and Seaks (1992) for an application. Other formulations are suggested by Fisher and Nagin (1981), Hausman and Wise (1978), and Horowitz (1993). Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 789

The log-likelihood is n x β x β ln L = y ln F i + (1 − y ) ln 1 − F i . (23-33) i exp(z γ ) i exp(z γ ) i=1 i i To be able to estimate all the parameters, z cannot have a constant term. The derivatives are ∂ n ( − ) ln L = fi yi Fi (− γ ) , ∂β ( − ) exp zi xi = Fi 1 Fi i 1 (23-34) ∂ n ( − ) ln L = fi yi Fi (− γ ) (− β), exp zi zi xi ∂γ Fi (1 − Fi ) i=1 which implies a difficult log-likelihood to maximize. But if the model is estimated γ = assuming that 0, then we can easily test for homoscedasticity. Let xi w = , (23-35) i (− βˆ ) xi zi computed at the maximum likelihood estimator, assuming that γ = 0. Then (23-29) or (23-30) can be used as usual for the Lagrange multiplier statistic. Davidson and MacKinnon carried out a Monte Carlo study to examine the true sizes and power functions of these tests. As might be expected, the test for omitted variables is relatively powerful. The test for heteroscedasticity may well pick up some other form of misspecification, however, including perhaps the simple omission of z from the index function, so its power may be problematic. It is perhaps not surprising that the same problem arose earlier in our test for heteroscedasticity in the linear regression model.

Example 23.5 Specification Tests in a Labor Force Participation Model Using the data described in Example 23.1, we fit a probit model for labor force participation based on the specification Prob[LFP = 1] = F(constant, age, age2, family income, education, kids). For these data, P = 428/753 = 0.568393. The restricted (all slopes equal zero, free constant term) log-likelihood is 325 × ln(325/753) + 428 × ln(428/753) =−514.8732. The unre- stricted log-likelihood for the probit model is −490.8478. The chi-squared statistic is, there- fore, 48.05072. The critical value from the chi-squared distribution with 5 degrees of freedom is 11.07, so the joint hypothesis that the coefficients on age, age2, family income, and kids are all zero is rejected. Consider the alternative hypothesis, that the constant term and the coefficients on age, age2, family income, and education are the same whether kids equals one or zero, against the alternative that an altogether different equation applies for the two groups of women, those with kids = 1 and those with kids = 0. To test this hypothesis, we would use a counterpart to the Chow test of Section 6.4 and Example 6.7. The restricted model in this instance would be based on the pooled data set of all 753 observations. The log-likelihood for the pooled model—which has a constant term, age, age2, family income, and education is −496.8663. The log-likelihoods for this model based on the 524 observations with kids = 1 and the 229 observations with kids = 0are−347.87441 and −141.60501, respectively. The log-likelihood for the unrestricted model with separate coefficient vectors is thus the sum, −489.47942. The chi-squared statistic for testing the five restrictions of the pooled model is twice the difference, LR = 2[−489.47942−(−496.8663)] = 14.7738. The 95 percent critical value from the chi-squared distribution with 5 degrees of freedom is 11.07, so at this significance level, the hypothesis that the constant terms and the coefficients on age, age2, family income, and education are the same is rejected. (The 99 percent critical value is 15.09.) Greene-50558 book June 25, 2007 11:6

790 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.4 Estimated Coefficients Estimate (Std. Er) Marg. Effect* Estimate (St. Er.) Marg. Effect*

Constant β1 −4.157(1.402) — −6.030(2.498) — Age β2 0.185(0.0660) −0.0079(0.0027) 0.264(0.118) −0.0088(0.00251) 2 Age β3 −0.0024(0.00077) — −0.0036(0.0014) — Income β4 0.0458(0.0421) 0.0180(0.0165) 0.424(0.222) 0.0552(0.0240) Education β5 0.0982(0.0230) 0.0385(0.0090) 0.140(0.0519) 0.0289(0.00869) Kids β6 −0.449(0.131) −0.171(0.0480) −0.879(0.303) −0.167(0.0779) Kids γ1 0.000 — −0.141(0.324) — Income γ2 0.000 — 0.313(0.123) — ln L −490.8478 −487.6356 Correct Preds. 0s: 106, 1s: 357 0s: 115, 1s: 358 *Marginal effect and estimated standard error include both mean (β) and variance (γ ) effects.

Table 23.4 presents estimates of the probit model with a correction for heteroscedasticity of the form

Var[εi ] = exp(γ1kids + γ2family income). The three tests for homoscedasticity give LR = 2[−487.6356 − (−490.8478)] = 6.424, LM = 2.236 based on the BHHH estimator, Wald = 6.533 (2 restrictions). The 99 percent critical value for two restrictions is 5.99, so the LM statistic conflicts with the other two.

23.4.5 MEASURING GOODNESS OF FIT There have been many fit measures suggested for QR models.14 At a minimum, one should report the maximized value of the log-likelihood function, ln L. Because the hypothesis that all the slopes in the model are zero is often interesting, the log-likelihood computed with only a constant term, ln L0 [see (23-28)], should also be reported. An ana- log to the R2 in a conventional regression is McFadden’s (1974) likelihood ratio index, ln L LRI = 1 − . ln L0 This measure has an intuitive appeal in that it is bounded by zero and one. (See Section 16.6.5.) If all the slope coefficients are zero, then it equals zero. There is no way to make LRI equal 1, although one can come close. If Fi is always one when y equals one and zero when y equals zero, then ln L equals zero (the log of one) and LRI equals one. It has been suggested that this finding is indicative of a “perfect fit” and that LRI increases as the fit of the model improves. To a degree, this point is true (see the analysis in Section 23.8.4). Unfortunately, the values between zero and one have no natural interpretation. ( β) If F xi is a proper pdf, then even with many regressors the model cannot fit perfectly β +∞ −∞ unless xi goes to or . As a practical matter, it does happen. But when it does, it

14See, for example, Cragg and Uhler (1970), Amemiya (1981), Maddala (1983), McFadden (1974), Ben-Akiva and Lerman (1985), Kay and Little (1986), Veall and Zimmermann (1992), Zavoina and McKelvey (1975), Efron (1978), and Cramer (1999). A survey of techniques appears in Windmeijer (1995). Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 791

indicates a flaw in the model, not a good fit. If the range of one of the independent vari- ables contains a value, say, x∗, such that the sign of (x − x∗) predicts y perfectly and vice versa, then the model will become a perfect predictor. This result also holds in general if the sign of xβ gives a perfect predictor for some vector β.15 For example, one might mistakenly include as a regressor a dummy variables that is identical, or nearly so, to the dependent variable. In this case, the maximization procedure will break down precisely because xβ is diverging during the iterations. [See McKenzie (1998) for an application and discussion.] Of course, this situation is not at all what we had in mind for a good fit. Other fit measures have been suggested. Ben-Akiva and Lerman (1985) and Kay and Little (1986) suggested a fit measure that is keyed to the prediction rule, 1 n R2 = y Fˆ + (1 − y )(1 − Fˆ ) , BL n i i i i i=1 which is the average probability of correct prediction by the prediction rule. The diffi- culty in this computation is that in unbalanced samples, the less frequent outcome will usually be predicted very badly by the standard procedure, and this measure does not pick up that point. Cramer (1999) has suggested an alternative measure that directly measures this failure, ˆ ˆ λ = (averageF | yi = 1) − (averageF | yi = 0) ˆ ˆ = (average(1 − F) | yi = 0) − (average(1 − F) | yi = 1). Cramer’s measure heavily penalizes the incorrect predictions, and because each propor- tion is taken within the subsample, it is not unduly influenced by the large proportionate size of the group of more frequent outcomes. Some of the other proposed fit measures are Efron’s (1978) n (y − pˆ )2 R2 = 1 − i=1 i i , Ef n ( − )2 i=1 yi y Veall and Zimmermann’s (1992) δ − 2 = 1 ,δ= n , RVZ LRI δ − LRI 2 log L0 and Zavoina and McKelvey’s (1975) n   2 = x βˆ − x βˆ R2 = i 1 i . MZ + n βˆ − βˆ 2 n i=1 xi x The last of these measures corresponds to the regression variation divided by the total variation in the latent index function model, where the disturbance variance is σ 2 = 1. The values of several of these statistics are given with the model results in Table 23.15 with the application in Section 23.8.4 for illustration. A useful summary of the predictive ability of the model is a 2 × 2 table of the hits and misses of a prediction rule such as yˆ = 1ifFˆ > F∗ and 0 otherwise. (23-36)

15See McFadden (1984) and Amemiya (1985). If this condition holds, then gradient methods will find that β. Greene-50558 book June 25, 2007 11:6

792 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The usual threshold value is 0.5, on the basis that we should predict a one if the model says a one is more likely than a zero. It is important not to place too much emphasis on this measure of goodness of fit, however. Consider, for example, the naive predictor yˆ = 1ifP > 0.5 and 0 otherwise, (23-37) where P is the simple proportion of ones in the sample. This rule will always predict correctly 100P percent of the observations, which means that the naive model does not have zero fit. In fact, if the proportion of ones in the sample is very high, it is possible to construct examples in which the second model will generate more correct predictions than the first! Once again, this flaw is not in the model; it is a flaw in the fit measure.16 The important element to bear in mind is that the coefficients of the estimated model are not chosen so as to maximize this (or any other) fit measure, as they are in the linear regression model where b maximizes R2.(Themaximum score estimator discussed in Example 23.12 addresses this issue directly.) Another consideration is that 0.5, although the usual choice, may not be a very good value to use for the threshold. If the sample is unbalanced—that is, has many more ones than zeros, or vice versa—then by this prediction rule it might never predict a one (or zero). To consider an example, suppose that in a sample of 10,000 observations, only 1,000 have Y = 1. We know that the average predicted probability in the sample will be 0.10. As such, it may require an extreme configuration of regressors even to produce an F of 0.2, to say nothing of 0.5. In such a setting, the prediction rule may fail every time to predict when Y = 1. The obvious adjustment is to reduce F ∗. Of course, this adjustment comes at a cost. If we reduce the threshold F∗ so as to predict y = 1 more often, then we will increase the number of correct classifications of observations that do have y = 1, but we will also increase the number of times that we incorrectly classify as ones observations that have y = 0.17 In general, any prediction rule of the form in (23-36) will make two types of errors: It will incorrectly classify zeros as ones and ones as zeros. In practice, these errors need not be symmetric in the costs that result. For example, in a credit scoring model [see Boyes, Hoffman, and Low (1989)], incorrectly classifying an applicant as a bad risk is not the same as incorrectly classifying a bad risk as a good one. Changing F∗ will always reduce the probability of one type of error while increasing the probability of the other. There is no correct answer as to the best value to choose. It depends on the setting and on the criterion function upon which the prediction rule depends. The likelihood ratio index and Veall and Zimmermann’s modification of it are obviously related to the likelihood ratio statistic for testing the hypothesis that the coef- ficient vector is zero. Efron’s and Cramer’s measures listed previously are oriented more toward the relationship between the fitted probabilities and the actual values. Efron’s and Cramer’s statistics are usefully tied to the standard prediction ruley ˆ = 1[Fˆ > 0.5]. The McKelvey and Zavoina measure is an analog to the regression coefficient of determination, based on the underlying regression y∗ = xβ + ε. Whether these have a close relationship to any type of fit in the familiar sense is a question that needs to be

16See Amemiya (1981). 17The technique of discriminant analysis is used to build a procedure around this consideration. In this setting, we consider not only the number of correct and incorrect classifications, but the cost of each type of misclassification. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 793

studied. In some cases, it appears so. But the maximum likelihood estimator, on which all the fit measures are based, is not chosen so as to maximize a fitting criterion based on prediction of y as it is in the classical regression (which maximizes R2). It is chosen to maximize the joint density of the observed dependent variables. It remains an interest- ing question for research whether fitting y well or obtaining good parameter estimates is a preferable estimation criterion. Evidently, they need not be the same thing.

Example 23.6 Prediction with a Probit Model Tunali (1986) estimated a probit model in a study of migration, subsequent remigration, and earnings for a large sample of observations of male members of households in Turkey. Among his results, he reports the summary shown here for a probit model: The estimated model is highly significant, with a likelihood ratio test of the hypothesis that the coefficients (16 of them) are zero based on a chi-squared value of 69 with 16 degrees of freedom.18 The model predicts 491 of 690, or 71.2 percent, of the observations correctly, although the likelihood ratio index is only 0.083. A naive model, which always predicts that y = 0 because P < 0.5, predicts 487 of 690, or 70.6 percent, of the observations correctly. This result is hardly suggestive of no fit. The maximum likelihood estimator produces several significant influences on the probability but makes only four more correct predictions than the naive predictor.19

Predicted D = 0 D = 1 Total Actual D = 0 471 16 487 D = 1 183 20 203 Total 654 36 690

23.4.6 CHOICE-BASED SAMPLING In some studies [e.g., Boyes, Hoffman, and Low (1989), Greene (1992)], the mix of ones and zeros in the observed sample of the dependent variable is deliberately skewed in favor of one outcome or the other to achieve a more balanced sample than random sampling would produce. The sampling is said to be choice based. In the studies noted, the dependent variable measured the occurrence of loan default, which is a relatively uncommon occurrence. To enrich the sample, observations with y = 1 (default) were oversampled. Intuition should suggest (correctly) that the bias in the sample should be transmitted to the parameter estimates, which will be estimated so as to mimic the sample, not the population, which is known to be different. Manski and Lerman (1977) derived the weighted endogenous sampling maximum likelihood (WESML) estimator for this situation. The estimator requires that the true population proportions, ω1 and ω0, be known. Let p1 and p0 be the sample proportions of ones and zeros. Then the estimator is obtained by maximizing a weighted log-likelihood,

n = ( β), ln L wi ln F qi xi i=1

18This view actually understates slightly the significance of his model, because the preceding predictions are based on a bivariate model. The likelihood ratio test fails to reject the hypothesis that a univariate model applies, however. 19It is also noteworthy that nearly all the correct predictions of the maximum likelihood estimator are the zeros. It hits only 10 percent of the ones in the sample. Greene-50558 book June 25, 2007 11:6

794 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

where wi = yi (ω1/p1) + (1 − yi )(ω0/p0). Note that wi takes only two different values. The derivatives and the Hessian are likewise weighted. A final correction is needed after estimation; the appropriate estimator of the asymptotic covariance matrix is the − sandwich estimator discussed in Section 23.4.1, H−1BH 1 (with weighted B and H), instead of B or H alone. (The weights are not squared in computing B.)20

23.4.7 DYNAMIC BINARY CHOICE MODELS A random or fixed effects model that explicitly allows for lagged effects would be = ( β + α + γ + ε > ). yit 1 xit i yi,t−1 it 0 Lagged effects, or persistence, in a binary choice setting can arise from three sources, serial correlation in εit, the heterogeneity, αi , or true state dependence through the term γ yi,t−1. Chiappori (1998) [and see Arellano (2001)] suggests an application to the French automobile insurance market in which the incentives built into the pricing system are such that having an accident in one period should lower the probability of having one in the next (state dependence), but, some drivers remain more likely to have accidents than others in every period, which would reflect the heterogeneity instead. State dependence is likely to be particularly important in the typical panel which has only a few observations for each individual. Heckman (1981a) examined this issue at length. Among his findings were that the somewhat muted small sample bias in fixed effects models with T = 8 was made much worse when there was state dependence. A related problem is that with a relatively short panel, the initial con- ditions, yi0, have a crucial impact on the entire path of outcomes. Modeling dynamic effects and initial conditions in binary choice models is more complex than in the lin- ear model, and by comparison there are relatively fewer firm results in the applied literature.21 Much of the contemporary literature has focused on methods of avoiding the strong parametric assumptions of the probit and logit models. Manski (1987) and Honore and Kyriazidou (2000) show that Manski’s (1986) maximum score estimator can be ap- plied to the differences of unequal pairs of observations in a two period panel with fixed effects. However, the limitations of the maximum score estimator have moti- vated research on other approaches. An extension of lagged effects to a parametric model is Chamberlain (1985), Jones and Landwehr (1988), and Magnac (1997), who added state dependence to Chamberlain’s fixed effects logit estimator. Unfortunately, once the identification issues are settled, the model is only operational if there are no other exogenous variables in it, which limits is usefulness for practical application. Lewbel (2000) has extended his fixed effects estimator to dynamic models as well. In this framework, the narrow assumptions about the independent variables somewhat

20WESML and the choice-based sampling estimator are not the free lunch they may appear to be. That which the biased sampling does, the weighting undoes. It is common for the end result to be very large standard errors, which might be viewed as unfortunate, insofar as the purpose of the biased sampling was to balance the data precisely to avoid this problem. 21A survey of some of these results is given by Hsiao (2003). Most of Hsiao (2003) is devoted to the linear regression model. A number of studies specifically focused on discrete choice models and panel data have appeared recently, including Beck, Epstein, Jackman and O’Halloran (2001), Arellano (2001) and Greene (2001). Vella and Verbeek (1998) provide an application to the joint determination of wages and union membership. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 795

limit its practical applicability. Honore and Kyriazidou (2000) have combined the logic of the conditional logit model and Manski’s maximum score estimator. They specify

Prob(yi0 = 1 | xi ,αi ) = p0(xi ,αi ) where xi = (xi1, xi2,...,xiT), ( = | ,α , , ,..., ) = ( β + α + γ ) = ,..., . Prob yit 1 xi i yi0 yi1 yi,t−1 F xit i yi,t−1 t 1 T The analysis assumes a single regressor and focuses on the case of T = 3. The resulting estimator resembles Chamberlain’s but relies on observations for which xit = xi,t−1, which rules out direct time effects as well as, for practical purposes, any continuous variable. The restriction to a single regressor limits the generality of the technique as well. The need for observations with equal values of xit is a considerable restriction, and the authors propose a kernel density estimator for the difference, xit − xi,t−1, instead which does relax that restriction a bit. The end result is an estimator that converges (they conjecture) but to a nonnormal distribution and at a rate slower than n−1/3. Semiparametric estimators for dynamic models at this point in the development are still primarily of theoretical interest. Models that extend the parametric formulations to include state dependence have a much longer history, including Heckman (1978, 1981a, 1981b), Heckman and MaCurdy (1980), Jakubson (1988), Keane (1993), and Beck et al. (2001) to name a few.22 In general, even without heterogeneity, dynamic models ultimately involve modeling the joint outcome (yi0,...,yiT), which necessitates some treatment involving multivariate integration. Example 23.7 describes a recent application. Stewart (2006) provides another.

Example 23.7 An Intertemporal Labor Force Participation Equation Hyslop (1999) presents a model of the labor force participation of married women. The focus of the study is the high degree of persistence in the participation decision. Data used in the study were the years 1979–1985 of the Panel Study of Income Dynamics. A sample of 1,812 continuously married couples were studied. Exogenous variables that appeared in the model were measures of permanent and transitory income and fertility captured in yearly counts of the number of children from 0–2, 3–5, and 6–17 years old. Hyslop’s formulation, in general terms, is = β + > (initial condition) yi 0 1(xi 0 0 vi 0 0), = β + γ + α + > (dynamic model) yit 1(xit yi,t−1 i vit 0) α = δ + η (heterogeneity correlated with participation) i zi i , (stochastic specification) 2 ηi | Xi ∼ N 0, ση , 2 vi 0 | Xi ∼ N 0, σ , 0 | ∼ σ 2 wit Xi N 0, w , = ρ + σ 2 + σ 2 = . vit vi,t−1 wit, η w 1 t Corr[vi 0, vit] = ρ , t = 1, ..., T − 1.

22Beck et al. (2001) is a bit different from the others mentioned in that in their study of “state failure,” they observe a large sample of countries (147) observed over a fairly large number of years, 40. As such, they are able to formulate their models in a way that makes the asymptotics with respect to T appropriate. They can analyze the data essentially in a time-series framework. Sepanski (2000) is another application that combines state dependence and the random coefficient specification of Akin, Guilkey, and Sickles (1979). Greene-50558 book June 25, 2007 11:6

796 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The presence of the autocorrelation and state dependence in the model invalidate the simple maximum likelihood procedures we examined earlier. The appropriate likelihood function is constructed by formulating the probabilities as

Prob( yi 0, yi 1, ...) = Prob( yi 0) × Prob( yi 1 | yi 0) ×···×Prob( yiT | yi,T −1). This still involves a T = 7 order normal integration, which is approximated in the study using a simulator similar to the GHK simulator discussed in 17.3.3. Among Hyslop’s results are a comparison of the model fit by the simulator for the multivariate normal probabilities with the same model fit using the maximum simulated likelihood technique described in Section 17.5.1.

23.5 BINARY CHOICE MODELS FOR PANEL DATA

Qualitative response models have been a growth industry in econometrics. The recent literature, particularly in the area of panel data analysis, has produced a number of new techniques. The availability of high-quality panel data sets on microeconomic behavior has maintained an interest in extending the models of Chapter 9 to binary (and other discrete choice) models. In this section, we will survey a few results from this rapidly growing literature. The structural model for a possibly unbalanced panel of data would be written ∗ = β + ε , = ,..., , = ,..., , yit xit it i 1 n t 1 Ti = ∗ > , , yit 1ifyit 0 and 0 otherwise The second line of this definition is often written = ( β + ε > ) yit 1 xit it 0 to indicate a variable that equals one when the condition in parentheses is true and zero when it is not. Ideally, we would like to specify that εit and εis are freely correlated within a group, but uncorrelated across groups. But doing so will involve computing 23 joint probabilities from a Ti variate distribution, which is generally problematic. (We will return to this issue later.) A more promising approach is an effects model, ∗ = β + v + , = ,..., , = ,..., , yit xit it ui i 1 n t 1 Ti = ∗ > , , yit 1ifyit 0 and 0 otherwise

where, as before (see Section 9.5), ui is the unobserved, individual specific hetero- geneity. Once again, we distinguish between “random” and “fixed” effects models by the relationship between ui and xit. The assumption that ui is unrelated to xit, so that the conditional distribution f (ui | xit) is not dependent on xit, produces the random effects model. Note that this places a restriction on the distribution of the heterogeneity.

23A “limited information” approach based on the GMM estimation method has been suggested by Avery, Hansen, and Hotz (1983). With recent advances in simulation-based computation of multinormal integrals (see Section 17.5.1), some work on such a panel data estimator has appeared in the literature. See, for example, Geweke, Keane, and Runkle (1994, 1997). The GEE estimator of Diggle, Liang, and Zeger (1994) [see also, Liang and Zeger (1986) and Stata (2006)] seems to be another possibility. However, in all these cases, it must be remembered that the procedure specifies estimation of a correlation matrix for a Ti vector of unobserved variables based on a dependent variable that takes only two values. We should not be too optimistic about this if Ti is even moderately large. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 797

If that distribution is unrestricted, so that ui and xit may be correlated, then we have what is called the fixed effects model. The distinction does not relate to any intrinsic characteristic of the effect, itself. As we shall see shortly, this is a modeling framework that is fraught with difficulties and unconventional estimation problems. Among them are the following: estimation of the random effects model requires very strong assumptions about the heterogeneity; the fixed effects model encounters an incidental parameters problem that renders the maximum likelihood estimator inconsistent.

23.5.1 Random Effects Models A specification that has the same structure as the random effects model of Section 9.5 has been implemented by Butler and Moffitt (1982). We will sketch the derivation to suggest how random effects can be handled in discrete and limited dependent variable models such as this one. Full details on estimation and inference may be found in Butler and Moffitt (1982) and Greene (1995a). We will then examine some extensions of the Butler and Moffitt model. The random effects model specifies

εit = vit + ui ,

where vit and ui are independent random variables with

E [vit | X] = 0; Cov[vit,vjs | X] = Var[vit | X] = 1, if i = j and t = s; 0 otherwise, | = ; , | = | = σ 2, = ; , E [ui X] 0 Cov[ui u j X] Var[ui X] u if i j 0 otherwise

Cov[vit, u j | X] = 0 for all i, t, j, 24 and X indicates all the exogenous data in the sample, xit for all i and t. Then,

E [εit | X] = 0, ε | = σ 2 + σ 2 = + σ 2, Var[ it X] v u 1 u and σ 2 Corr[ε ,ε | X] = ρ = u . it is + σ 2 1 u σ 2 = ρ/( − ρ) The new free parameter is u 1 . Recall that in the cross-section case, the probability associated with an observation is Ui ( | ) = (ε ) ε ,( , ) = (−∞, − β) = (− β, +∞) = . P yi xi f i d i Li Ui xi if yi 0 and xi if yi 1 Li  ( − ) β  ( − ) β This simplifies to [ 2yi 1 xi ] for the normal distribution and [ 2yi 1 xi ] for the logit model. In the fully general case with an unrestricted covariance matrix, the contri- bution of group i to the likelihood would be the joint probability for all Ti observations; UiT Ui1 = ( ,..., | ) = i ... (ε ,ε ,...,ε ) ε ε ... ε . Li P yi1 yiTi X f i1 i2 iTi d i1d i2 d iTi (23-38) LiTi Li1

24See Wooldridge (1999) for discussion of this assumption. Greene-50558 book June 25, 2007 11:6

798 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The integration of the joint density, as it stands, is impractical in most cases. The special nature of the random effects model allows a simplification, however. We can obtain the v (ε ,...,ε , ) joint density of the it’s by integrating ui out of the joint density of i1 iTi ui which is (ε ,...,ε , ) = (ε ,...,ε | ) ( ). f i1 iTi ui f i1 iTi ui f ui So, +∞ (ε ,ε ,...,ε ) = (ε ,ε ,...,ε | ) ( ) . f i1 i2 iTi f i1 i2 iTi ui f ui dui −∞

The advantage of this form is that conditioned on ui , the εit’s are independent, so +∞ Ti (ε ,ε ,...,ε ) = (ε | ) ( ) . f i1 i2 iTi f it ui f ui dui −∞ t=1 Inserting this result in (23-38) produces UiT Ui1 +∞ Ti = ,..., | = i ... (ε | ) ( ) ε ε ... ε . Li P[yi1 yiTi X] f it ui f ui dui d i1 d i2 d iTi L L −∞ iTi i1 t=1 This may not look like much simplification, but in fact, it is. Because the ranges of integration are independent, we may change the order of integration; +∞ UiT Ui1 Ti = ,..., | = i ... (ε | ) ε ε ... ε ( ) . Li P[yi1 yiTi X] f it ui d i1 d i2 d iTi f ui dui −∞ L L iTi i1 t=1

Conditioned on the common ui , the ε’s are independent, so the term in square brackets is just the product of the individual probabilities. We can write this as +∞ Ti Uit = ,..., | = (ε | ) ε ( ) . Li P[yi1 yiTi X] f it ui d it f ui dui (23-39) −∞ t=1 Lit

Now, consider the individual densities in the product. Conditioned on ui , these are the β + now-familiar probabilities for the individual observations, computed now at xit ui . This produces a general model for random effects for the binary choice model. Collecting all the terms, we have reduced it to +∞ Ti = ,..., | = ( = | β + ) ( ) . Li P[yi1 yiTi X] Prob Yit yit xit ui f ui dui (23-40) −∞ t=1 It remains to specify the distributions, but the important result thus far is that the entire computation requires only one-dimensional integration. The inner probabilities may be any of the models we have considered so far, such as probit, logit, Gumbel, and so on. The intricate part that remains is to determine how to do the outer integration. Butler and Moffitt’s method assuming that ui is normally distributed is detailed in Section 16.9.6.b. A number of authors have found the Butler and Moffitt formulation to be a satis- factory compromise between a fully unrestricted model and the cross-sectional variant that ignores the correlation altogether. An application that includes both group and time effects is Tauchen, Witte, and Griesinger’s (1994) study of arrests and criminal Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 799

behavior. The Butler and Moffitt approach has been criticized for the restriction of equal correlation across periods. But it does have a compelling virtue that the model can be efficiently estimated even with fairly large Ti using conventional computational methods. [See Greene (2007b).] A remaining problem with the Butler and Moffitt specification is its assumption of normality. In general, other distributions are problematic because of the difficulty of finding either a closed form for the integral or a satisfactory method of approximating the integral. An alternative approach that allows some flexibility is the method of maximum simulated likelihood (MSL), which was discussed in Section 9.8.2 and Chapter 17. The transformed likelihood we derived in (23-40) is an expectation; +∞ Ti = ( = |  β + ) ( ) Li Prob Yit yit xit ui f ui dui −∞ t=1 Ti = ( = |  β + ) . Eui Prob Yit yit xit ui t=1 This expectation can be approximated by simulation rather than quadrature. First, let θ now denote the scale parameter in the distribution of ui . This would be σu for a normal distribution, for example, or some other scaling for the logistic or uniform distribution. Then, write the term in the likelihood function as Ti = ( ,  β + θ ) = ( ) . Li Eui F yit xit ui Eui [h ui ] t=1 The function is smooth, continuous, and continuously differentiable. If this expectation is finite, then the conditions of the law of large numbers should apply, which would mean that for a sample of observations ui1,...,uiR, 1 R plim h(u ) = E [h(u )]. R ir u i r=1 This suggests, based on the results in Chapter 17, an alternative method of maximizing the log-likelihood for the random effects model. A sample of person-specific draws from the population ui can be generated with a random number generator. For the Butler and Moffitt model with normally distributed ui , the simulated log-likelihood function is n R Ti 1  ln L = ln F [q (x β + σ u )] . (23-41) Simulated R it it u ir i=1 r=1 t=1

This function is maximized with respect β and σu. Note that in the preceding, as in the quadrature approximated log-likelihood, the model can be based on a probit, logit, or any other functional form desired. There is an additional degree of flexibility in this approach. The Hermite quadrature approach is essentially limited by its functional form to the normal distribution. But, in the simulation approach, uir can come from some other distribution. For example, it might be believed that the dispersion of the hetero- geneity is greater than implied by a normal distribution. The logistic distribution might be preferable. A random sample from the logistic distribution can be created by sampling (wi1,...,wiR) from the standard uniform [0, 1] distribution; then uir = ln[wir/(1−wir)]. Other distributions, such as the uniform itself, are also possible. Greene-50558 book June 25, 2007 11:6

800 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

We have examined two approaches to estimation of a probit model with random ef- fects. GMM estimation is another possibility.Avery,Hansen, and Hotz (1983), Bertschek and Lechner (1998), and Inkmann (2000) examine this approach; the latter two offer some comparison with the quadrature and simulation-based estimators considered here. (Our application in Example 23.16 will use the Bertschek and Lechner data.) The preceding opens another possibility. The random effects model can be cast as a model with a random constant term; ∗ = α + β + ε , = ,..., , = ,..., , yit i xit it i 1 n t 1 Ti = ∗ > , , yit 1ifyit 0 and 0 otherwise

where αi = α +σuui . This is simply a reinterpretation of the model we just analyzed. We might, however, now extend this formulation to the full parameter vector. The resulting structure is ∗ = β + ε , = ,..., , = ,..., , yit xit i it i 1 n t 1 Ti = ∗ > , , yit 1ifyit 0 and 0 otherwise β = β + where i ui where is a nonnegative definite diagonal matrix—some of its diagonal elements could be zero for nonrandom parameters. The method of estimation is essentially the same as before. The simulated log-likelihood is now n 1 R Ti ln L = ln F[q (x (β + u ))] . Simulated R it it ir i=1 r=1 t=1 The simulation now involves R draws from the multivariate distribution of u. Because the draws are uncorrelated— is diagonal—this is essentially the same estimation prob- lem as the random effects model considered previously. This model is estimated in Example 23.11. Example 23.11 presents a similar model that assumes that the distribu- β tion of i is discrete rather than continuous.

23.5.2 Fixed Effects Models The fixed effects model is ∗ = α + β + ε , = ,..., , = ,..., , yit i dit xit it i 1 n t 1 Ti = ∗ > , , yit 1ifyit 0 and 0 otherwise

where dit is a dummy variable that takes the value one for individual i and zero otherwise. For convenience, we have redefined xit to be the nonconstant variables in the model. The parameters to be estimated are the K elements of β and the n individual constant terms. Before we consider the several virtues and shortcomings of this model, we consider the practical aspects of estimation of what are possibly a huge number of parameters, (n + K) − n is not limited here, and could be in the thousands in a typical application. The log-likelihood function for the fixed effects model is

n Ti = ( | α + β), ln L ln P yit i xit i=1 t=1 (.)  (α + β) where P is the probability of the observed outcome, for example, [qit i xit ]  (α + β) for the probit model or [qit i xit ] for the logit model. What follows can be extended to any index function model, but for the present, we’ll confine our attention Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 801

to symmetric distributions such as the normal and logistic, so that the probability can ( = | ) = (α + β) be conveniently written as Prob Yit yit xit P[qit i xit ]. It will be convenient = α + β ( = | ) = ( ) to let zit i xit so Prob Yit yit xit P qitzit . In our previous application of this model, in the linear regression case, we found that estimation of the parameters was made possible by a transformation of the data to deviations from group means which eliminated the person specific constants from the estimator. (See Section 9.4.1.) Save for the special case discussed later, that will not be possible here, so that if one desires to estimate the parameters of this model, it will be necessary actually to compute the possibly huge number of constant terms at the same time. This has been widely viewed as a practical obstacle to estimation of this model be- cause of the need to invert a potentially large second derivatives matrix, but this is a mis- conception. [See, e.g., Maddala (1987), p. 317.] The method for estimation of nonlinear fixed effects models such as the probit and logit models is detailed in Section 16.9.6.c. The problems with the fixed effects estimator are statistical, not practical. The estimator relies on Ti increasing for the constant terms to be consistent—in essence, each αi is estimated with Ti observations. But, in this setting, not only is Ti fixed, it is likely to be quite small. As such, the estimators of the constant terms are not consistent (not because they converge to something other than what they are trying to estimate, but because they do not converge at all). The estimator of β is a function of the estimators of α, which means that the MLE of β is not consistent either. This is the incidental parameters problem. [See Neyman and Scott (1948) and Lancaster (2000).] There is, as well, a small sample (small Ti ) bias in the estimators. How serious this bias is remains a question in the literature. Two pieces of received wisdom are Hsiao’s (1986) results for a binary logit model [with additional results in Abrevaya (1997)] and Heckman and MaCurdy’s (1980) results for the probit model. Hsiao found that for Ti = 2, the bias in the MLE of β is 100 percent, which is extremely pessimistic. Heckman and MaCurdy found in a Monte Carlo study that in samples of n = 100 and T = 8, the bias appeared to be on the order of 10 percent, which is substantive, but certainly less severe than Hsiao’s results suggest. No other theoretical results have been shown for other models, although in very few cases, it can be shown that there is no incidental parameters problem. (The Poisson model mentioned in Chapter 16 is one of these special cases.) The fixed effects approach does have some appeal in that it does not require an assumption of orthogonality of the independent variables and the heterogeneity. An ongoing pursuit in the literature is concerned with the severity of the tradeoff of this virtue against the incidental parameters problem. Some commentary on this issue appears in Arellano (2001). Results of our own investigation appear in Chapter 17 and Greene (2004).

Example 23.8 Binary Choice Models for Panel Data In Example 23.4, we fit a pooled binary Iogit model y = 1(DocVis > 0) using the German health care utilization data examined in Example 11.11. The model is > =  β + β + β + β Prob(DocVisit 0) ( 1 2 Ageit 3 Incomeit 4 Kidsit

+ β5 Educationit + β6 Marriedit). No account of the panel nature of the data set was taken in that exercise. The sample contains a total of 27,326 observations on 7,293 families with Ti dispersed from one to seven. (See Example 11.11 for details.) Table 23.5 lists estimates of parameter estimates and estimated standard errors for probit and Iogit random and fixed effects models. There is a surprising amount of variation across the estimators. The coefficients are in bold to facilitate reading the table. It is generally difficult to compare across the estimators. The three estimators would Greene-50558 book June 25, 2007 11:6 0.057318 0.052072 0.032774 − − − 0.0635780.11671 0.025071 0.045587 0.085293 0.090816 0.028115 0.052260 0.0336940.033489 0.016325 0.072189 0.016826 − − − − − − − − 0.088407 0.21598 0.22947 0.077764 0.048270 0.14118 0.15379 0.153579 − − − − − − − − 0.060973 0.18592 0.050383 0.034328 0.11643 0.003176 0.002973 Variable − − − − − − − 0.10475 0.084760 0.062528 0.25112 0.020709 0.15500 0.012835 0.0341130.033290 0.020143 0.020078 0.128270.13460 0.0017429 0.039267 0.091546 0.021914 0.0383130.079591 0.008075 0.0010739 0.045314 0.056543 0.023614 0.005014 0.027904 Constant Age Income Kids Education Married − L 9458.64 6299.02 9453.71 17670.94 16273.96 16279.97 15261.90 17673.10 0.091135 0.0012852 0.075064 0.029537 0.005646 0.033286 − − − − − − − − e e Estimated Parameters for Panel Data Binary Choice Models β β St.Err.β St.Err.β β β β 0.0072548 0.0065022 0.17829 0.15888 0.074399 0.066282 0.066749 0.056673 0.10609 0.093044 β Rob.SE Rob.SE c d a b 41607 St.Err. 0.17764 0.0024659 0.11866 0.047738 0.011322 0.056282 4478944799 St.Err. St.Err. 0.096354 0.063229 0.0013189 0.0009013 0.066672 0.052012 0.027043 0.020286 0.006289 0.003931 0.031347 0.022771 . . . 0 0 0 = = = Conditional fixed effects estimator Maximum simulated likelihood estimator Unconditional fixed effects estimator Robust, “cluster” corrected standard error Butler and Moffitt estimator Logit TABLE 23.5 ModelLogit Estimate ln a b c d e Logit Probit Probit:RE Probit Logit R.E. Probit:RE ρ F.E.(U) Pooled St.Err. Pooledρ St.Err.ρ 0.056516 0.0007903 0.046329 0.018218 0.003503 0.020462 F.E.(C) F.E.(U) St.Err. 0.0043219 0.10745 0.044559 0.040731 0.063627

802 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 803

TABLE 23.6 Estimated Partial Effects for Panel Data Binary Choice Models Model Age Income Kids Education Married Logit, Pa 0.0048133 −0.043213 −0.053598 −0.010596 0.019936 Logit: RE,Qb 0.0064213 0.0035835 −0.035448 −0.010397 0.0041049 Logit: F,Uc 0.024871 −0.014477 −0.020991 −0.027711 −0.013609 Logit: F,Cd 0.0072991 −0.0043387 −0.0066967 −0.0078206 −0.0044842 Probit, Pa 0.0048374 −0.043883 −0.053414 −0.010597 0.019783 Probit RE.Qb 0.0056049 −0.0008836 −0.042792 −0.0093756 0.0045426 Probit:RE,Se 0.0071455 −0.0010582 −0.054655 −0.011917 0.0059878 Probit: F,Uc 0.023958 −0.013152 −0.018495 −0.027659 −0.012557 aPooled estimator bButler and Moffitt estimator cUnconditional fixed effects estimator dConditional fixed effects estimator eMaximum simulated likelihood estimator

be expected to produce very different estimates in any of the three specifications—recall, for example, the pooled estimator is inconsistent in either the fixed or random effects cases. The Iogit results include two fixed effects estimators. The line market “U” is the unconditional (inconsistent) estimator. The one marked “C” is Chamberlain’s consistent estimator. Note for all three fixed effects estimators, it is necessary to drop from the sample any groups that have DocVisit equal to zero or one for every period. There were 3,046 such groups, which is about 42 percent of the sample. We also computed the probit random effects model in two ways, first by using the Butler and Moffitt method, then by using maximum simulated likelihood estimation. In this case, the estimators are very similar, as might be expected. The estimated ρ σ 2/ σ 2 +σ 2 σ 2 = correlation coefficient, , is computed as u ( ε u ). For the probit model, ε 1. The MSL estimator computes su = 0.9088376, from which we obtained ρ. The estimated partial effects for the models are shown in Table 23.6. The average of the fixed effects constant terms is used to obtain a constant term for the fixed effects case. Once again there is a considerable amount of variation across the different estimators. On average, the fixed effects models tend to produce much larger values than the pooled or random effects models.

Why did the incidental parameters problem arise here and not in the linear regres- sion model? Recall that estimation in the regression model was based on the deviations from group means, not the original data as it is here. The result we exploited there was that although f (yit | Xi ) is a function of αi , f (yit | Xi , yi ) is not a function of αi , and we used the latter in estimation of β. In that setting, yi is a minimal sufficient statistic for αi . Sufficient statistics are available for a few distributions that we will examine, but not for the probit model. They are available for the logit model, as we now examine. A fixed effects binary logit model is αi +x β ( = | ) = e it . Prob yit 1 xit α +x β 1 + e i it The unconditional likelihood for thenT independent observations is yit 1−yit L = (Fit) (1 − Fit) . i t Chamberlain (1980) [following Rasch (1960) and Andersen (1970)] observed that the conditional likelihood function, n Ti Lc = Prob Y = y , Y = y ,...,Y = y y , i1 i1 i2 i2 iTi iTi it i=1 t=1 Greene-50558 book June 25, 2007 11:6

804 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

is free of the incidental parameters, αi . The joint likelihood for each set of Ti observations conditioned on the number of ones in the set is Ti Prob Y = y , Y = y ,...,Y = y y , data i1 i1 i2 i2 iTi iTi it t=1 Ti β exp t=1 yitxit = . exp Ti d x β t dit=Si t=1 it it The function in the denominator is summed over the set of all Ti different sequences Si = Ti 25 of Ti zeros and ones that have the same sum as Si t=1 yit. Consider the example of Ti = 2. The unconditional likelihood is L = Prob(Yi1 = yi1)Prob(Yi2 = yi2). i For each pair of observations, we have these possibilities:

1. yi1 = 0 and yi2 = 0. Prob(0, 0 | sum = 0) = 1. 2. yi1 = 1 and yi2 = 1. Prob(1, 1 | sum = 2) = 1. The ith term in Lc for either of these is just one, so they contribute nothing to the con- ditional likelihood function.26 When we take logs, these terms (and these observations) will drop out. But suppose that yi1 = 0 and yi2 = 1. Then Prob(0, 1 and sum = 1) Prob(0, 1) 3. Prob(0, 1 | sum = 1) = = . Prob(sum = 1) Prob(0, 1) + Prob(1, 0)

Therefore, for this pair of observations, the conditional probability is

α +x β 1 e i i2 β α +x β α +x β xi2 1 + e i i1 1 + e i i2 = e . α + β α + β β β i xi2 i xi1 xi1 + xi2 1 e + e 1 e e α +x β α +x β α +x β α +x β 1 + e i i1 1 + e i i2 1 + e i i1 1 + e i i2 By conditioning on the sum of the two observations, we have removed the heterogeneity. Therefore, we can construct the conditional likelihood function as the product of these terms for the pairs of observations for which the two observations are (0, 1). Pairs of observations with one and zero are included analogously. The product of the terms such as the preceding, for those observation sets for which the sum is not zero or Ti , constitutes the conditional likelihood. Maximization of the resulting function is straightforward and may be done by conventional methods. As in the linear regression model, it is of some interest to test whether there is indeed heterogeneity. With homogeneity (αi = α), there is no unusual problem, and the

25The enumeration of all these computations stands to be quite a burden—see Arellano (2000, p. 47) or Baltagi (2005, p. 235). In fact, using a recursion suggested by Krailo and Pike (1984), the computation even with Ti up to 100 is routine. 26Recall that in the probit model when we encountered this situation, the individual constant term could not be estimated and the group was removed from the sample. The same effect is at work here. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 805

model can be estimated, as usual, as a logit model. It is not possible to test the hypothesis using the likelihood ratio test, however, because the two likelihoods are not compara- ble. (The conditional likelihood is based on a restricted data set.) None of the usual tests of restrictions can be used because the individual effects are never actually estimated.27 Hausman’s (1978) specification test is a natural one to use here, however. Under the null hypothesis of homogeneity, both Chamberlain’s conditional maximum likelihood estimator (CMLE) and the usual maximum likelihood estimator are consistent, but Chamberlain’s is inefficient. (It fails to use the information that αi = α, and it may not use all the data.) Under the alternative hypothesis, the unconditional maximum like- lihood estimator is inconsistent,28 whereas Chamberlain’s estimator is consistent and efficient. The Hausman test can be based on the chi-squared statistic

2  −1 χ = (βˆ CML − βˆ ML) (Var[CML] − Var[ML]) (βˆ CML − βˆ ML). (23-42) The estimated covariance matrices are those computed for the two maximum likelihood estimators. For the unconditional maximum likelihood estimator, the row and column corresponding to the constant term are dropped. A large value will cast doubt on the hypothesis of homogeneity. (There are K degrees of freedom for the test.) It is possible that the covariance matrix for the maximum likelihood estimator will be larger than that for the conditional maximum likelihood estimator. If so, then the difference matrix in brackets is assumed to be a zero matrix, and the chi-squared statistic is therefore zero.

Example 23.9 Fixed Effects Logit Models: Magazine Prices Revisited The fixed effects model does have some appeal, but the incidental parameters problem is a significant shortcoming of the unconditional probit and logit estimators. The conditional MLE for the fixed effects logit model is a fairly common approach. A widely cited application of the model is Cecchetti’s (1986) analysis of changes in newsstand prices of magazines. Cecchetti’s model was =  α + β Prob(Price change in year i of magazine t) ( j xit ),

where the variables in xit are (1) time since last price change, (2) inflation since last change, (3) previous fixed price change, (4) current inflation, (5) industry sales growth, and (6) sales volatility. The fixed effect in the model is indexed “ j ” rather than “i ” as it is defined as a three- year interval for magazine i . Thus, a magazine that had been on the newstands for nine years would have three constants, not just one. In addition to estimating several specifications of the price change model, Cecchetti used the Hausman test in (23-42) to test for the existence of the common effects. Some of Cecchetti’s results appear in Table 23.7. Willis (2006) argued that Cecchetti’s estimates were inconsistent and the Hausman test is invalid because right-hand-side variables (1), (2), and (6) are all functions of lagged dependent variables. This state dependence invalidates the use of the sum of the observations for the group as a sufficient statistic in the Chamberlain estimator and the Hausman tests. He proposes, instead, a method suggested by Heckman and Singer (1984b) to incorporate the unobserved heterogeneity in the unconditional likelihood function. The Heckman and Singer model can be formulated as a latent class model (see Sections 16.9.7 and 23.5.3) in which the classes are defined by different constant terms—the remaining parameters in the model are constrained to be equal across classes. Willis fit the Heckman and Singer model with

27This produces a difficulty for this estimator that is shared by the semiparametric estimators discussed in the next section. Because the fixed effects are not estimated, it is not possible to compute probabilities or marginal effects with these estimated coefficients, and it is a bit ambiguous what one can do with the results of the computations. The brute force estimator that actually computes the individual effects might be preferable. 28Hsiao (2003) derives the result explicitly for some particular cases. Greene-50558 book June 25, 2007 11:6

806 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.7 Models for Magazine Price Changes (standard errors in parentheses) Unconditional Conditional Conditional Heckman Pooled FE FE Cecchetti FE Willis and Singer

β1 −1.10 (0.03) −0.07 (0.03) 1.12 (3.66) 1.02 (0.28) −0.09 (0.04) β2 6.93 (1.12) 8.83 (1.25) 11.57 (1.68) 19.20 (7.51) 8.23 (1.53) β5 −0.36 (0.98) −1.14 (1.06) 5.85 (1.76) 7.60 (3.46) −0.13 (1.14) Constant 1 −1.90 (0.14) −1.94 (0.20) Constant 2 −29.15 (1.1e11) ln L −500.45 −473.18 −82.91 −83.72 −499.65 Sample size 1026 1026 543 1026

two classes to a restricted version of Cecchetti’s model using variables (1), (2), and (5). The results in Table 23.7 show some of the results from Willis’s Table I. (Willis reports that he could not reproduce Cecchetti’s results—the ones in Cecchetti’s second column would be the counterparts—because of some missing values. In fact, Willis’s estimates are quite far from Cecchetti’s results, so it will be difficult to compare them. Both are reported here.) The two “mass points” reported by Willis are shown in Table 23.7. He reports that these two values (−1.94 and −29.15) correspond to class probabilities of 0.88 and 0.12, though it is difficult to make the translation based on the reported values. He does note that the change in the log-likelihood in going from one mass point (pooled logit model) to two is marginal, only from −500.45 to −499.65. There is another anomaly in the results that is consistent with this finding. The reported standard error for the second “mass point” is 1.1 × 1011,or essentially +∞. The finding is consistent with overfitting the latent class model. The results suggest that the better model is a one-class (pooled) model.

23.5.3 MODELING HETEROGENEITY The panel data analysis considered thus far has focused on modeling heterogeneity with the fixed and random effects specifications. Both assume that the heterogeneity is con- tinuously distributed among individuals. The random effects model is fully parametric, requiring a full specification of the likelihood for estimation. The fixed effects model is essentially semiparametric. It requires no specific distributional assumption, however, it does require that the realizations of the latent heterogeneity be treated as parameters, either estimated in the unconditional fixed effects estimator or conditioned out of the likelihood function when possible. As noted in the preceding example, Heckman and Singer’s (1984b) model provides a less stringent model specification based on a discrete distribution of the latent heterogeneity. A straightforward method of implementing their model is to cast it as a latent class model in which the classes are distinguished by different constant terms and the associated probabilities. The class probabilities are treated as parameters to be estimated with the model parameters.

Example 23.10 Semiparametric Models of Heterogeneity We have extended the random effects and fixed effects logit models in Example 23.8 by fitting the Heckman and Singer (1984b) model. Table 23.8 shows the specification search and the results under different specifications. The first column of results shows the estimated fixed effects model from Example 23.8. The conditional estimates are shown in parentheses. Of the 7,293 groups in the sample, 3,056 are not used in estimation of the fixed effects models because the sum of Doctorit is either 0 or Ti for the group. The mean and standard deviation of the estimated underlying heterogeneity distribution are computed using the estimates of Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 807

TABLE 23.8 Estimated Heterogeneity Models Number of Classes Fixed Effect 1 2 3 4 5

β1 0.10475 0.020708 0.030325 0.033684 0.034083 0.034159 (0.084760) β2 −0.060973 −0.18592 0.025550 −0.0058013 −0.0063516 −0.013627 (−0.050383) β3 −0.088407 −0.22947 −0.24708 −0.26388 −0.26590 −0.26626 (−0.077764) β4 −0.11671 −0.045588 −0.050924 −0.058022 −0.059751 −0.059176 (−0.090816) β5 −0.057318 0.085293 0.042974 0.037944 0.029227 0.030699 (−0.52072) α1 −2.62334 0.25111 0.91764 1.71669 1.94536 2.76670 (1.00000) (0.62681) (0.34838) (0.29309) (0.11633) α2 −1.47800 −2.23491 −1.76371 1.18323 (0.37319) (0.18412) (0.21714) (0.26468) α3 −0.28133 −0.036739 −1.96750 (0.46749) (0.46341) (0.19573) α4 −4.03970 −0.25588 (0.026360) (0.40930) α5 −6.48191 (0.013960) Mean −2.62334 0.00000 0.023613 0.055059 0.063685 0.054705 Std. Dev. 3.13415 0.00000 1.158655 1.40723 1.48707 1.62143 ln L −9458.638 −17673.10 −16353.14 −16278.56 −16276.07 −16275.85 (−6299.02) AI C 1.00349 1.29394 1.19748 1.19217 1.19213 1.19226

αi for the remaining 4,237 groups. The remaining five columns in the table show the results for different numbers of latent classes in the Heckman and Singer model. The listed constant terms are the “mass points” of the underlying distributions. The associated class probabilities are shown in parentheses under them. The mean and standard deviation are derived from the 2- to 5-point discrete distributions shown. It is noteworthy that the mean of the distribution is relatively stable, but the standard deviation rises monotonically. The search for the best model would be based on the AIC. As noted in Section 16.9.6, using a likelihood ratio test in this context is dubious, as the number of degrees of freedom is ambiguous. Based on the AIC, the four-class model is the preferred specification.

23.5.4 PARAMETER HETEROGENEITY In Chapter 9, we examined specifications that extend the underlying heterogeneity to all the parameters of the model. We have considered two approaches. The random parameters, or mixed models discussed in Chapter 17 allow parameters to be distributed continuously across individuals. The latent class model in Section 16.9.6 specifies a discrete distribution instead. (The Heckman and Singer model in the previous section applies this method to the constant term.) Most of the focus to this point, save for Example 16.16, has been on linear models. However, as the next example demonstrates, the same methods can be applied to nonlinear models, such as the discrete choice models. Greene-50558 book June 25, 2007 11:6

808 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.9 Estimated Heterogeneous Parameter Models Pooled Random Parameters Latent Class Variable Estimate: β Estimate: β Estimate: σ Estimate: β Estimate: β Estimate: β Constant 0.25111 −0.034964 0.81651 0.96605 −0.18579 −1.52595 (0.091135) (0.075533) (0.016542) (0.43757) (0.23907) (0.43498) Age 0.020709 0.026306 0.025330 0.049058 0.032248 0.019981 (0.0012852) (0.0011038) (0.0004226) (0.0069455) (0.0031462) (0.0062550) Income −0.18592 −0.0043649 0.10737 −0.27917 −0.068633 0.45487 (0.075064) (0.062445) (0.038276) (0.37149) (0.16748) (0.31153) Kids −0.22947 −0.17461 0.55520 −0.28385 −0.28336 −0.11708 (0.029537) (0.024522) (0.023866) (0.14279) (0.066404) (0.12363) Education −0.045588 −0.040510 0.037915 −0.025301 −0.057335 −0.09385 (0.0056465) (0.0047520) (0.0013416) (0.027768) (0.012465) (0.027965) Married 0.085293 0.014618 0.070696 −0.10875 0.025331 0.23571 (0.033286) (0.027417) (0.017362) (0.17228) (0.075929) (0.14369) Class 1.00000 1.00000 0.34833 0.46181 0.18986 Prob. (0.00000) (0.00000) (0.038495) (0.028062) (0.022335) ln L −17673.10 −16271.72 −16265.59

Example 23.11 Parameter Heterogeneity in a Binary Choice Model We have extended the logit model for doctor visits from Example 23.10 to allow the param- eters to vary randomly across individuals. The random parameters logit model is = =  β + β + β + β + β + β Prob (Doctorit 1) ( 1i 2i Ageit 3i Incomeit 4i Kidsit 5i Educit 6i Marriedit), where the two models for the parameter variation we have employed are:

Continuous: βki = βk + σkuki , uki ∼ N[0, 1], k = 1, ..., 6, Cov[uki, umi] = 0, β = β1 π Discrete: ki k with probability 1 β2 π k with probability 2 β3 π . k with probability 3 We have chosen a three-class latent class model for the illustration. In an application, one might undertake a systematic search, such as in Example 23.10, to find a preferred speci- fication. Table 23.9 presents the fixed parameter (pooled) logit model and the two random parameters versions. (There are infinite variations on these specifications that one might explore—See Chapter 17 for discussion—we have shown only the simplest to illustrate the models.29 A more elaborate specification appears in Section 23.11.7.) Figure 23.3 shows the implied distribution for the coefficient on age. For the continuous distribution, we have simply plotted the normal density. For the discrete distribution, we first obtained the mean (0.0358) and standard deviation (0.0107). Notice that the distribution is tighter than the estimated continuous normal (mean, 0.026, standard deviation, 0.0253). To suggest the variation of the parameter (purely for purpose of the display, because the distri- bution is discrete), we placed the mass of the center interval, 0.462, between the midpoints of the intervals between the center mass point and the two extremes. With a width of 0.0145 the density is 0.461 / 0.0145 = 31.8. We used the same interval widths for the outer segments. This range of variation covers about five standard deviations of the distribution.

29We have arrived (once again) at a point where the question of replicability arises. Nonreplicability is an ongoing challenge in empirical work in economics. (See, e.g., Example 23.9.) The problem is particularly acute in analyses that involve simulation such as Monte Carlo studies and random parameter models. In the interest of replicability, we note that the random parameter estimates in Table 23.9 were computed with NLOGIT [Econometric Software (2007)] and are based on 50 Halton draws. We used the first six sequences (prime numbers 2, 3, 5, 7, 11, 13) and discarded the first 10 draws in each sequence. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 809

35

28

21 ity s Den 14

7

0 Ϫ0.050 Ϫ0.025 0.000 0.250.050 0.075 B_INCOME FIGURE 23.3 Distributions of Income Coefficient.

23.6 SEMIPARAMETRIC ANALYSIS

In his survey of qualitative response models, Amemiya (1981) reports the following widely cited approximations for the linear probability (LP) model: Over the range of probabilities of 30 to 70 percent,

βˆ ≈ . β LP 0 4 probit for the slopes, βˆ ≈ . β LP 0 25 logit for the slopes. Aside from confirming our intuition that least squares approximates the nonlinear model and providing a quick comparison for the three models involved, the practical usefulness of the formula is somewhat limited. Still, it is a striking result.30 A series of studies has focused on reasons why the least squares estimates should be proportional to the probit and logit estimates. A related question concerns the problems associated with assuming that a probit model applies when, in fact, a logit model is appropriate or vice versa.31 The approximation would seem to suggest that with this type of misspeci- fication, we would once again obtain a scaled version of the correct coefficient vector. (Amemiya also reports the widely observed relationship βˆ logit = 1.6βˆ probit, which fol- lows from the results for the linear probability model.)

30This result does not imply that it is useful to report 2.5 times the linear probability estimates with the probit estimates for comparability. The linear probability estimates are already in the form of marginal effects, whereas the probit coefficients must be scaled downward. If the sample proportion happens to be close to 0.5, then the right scale factor will be roughly φ[−1(0.5)] = 0.3989. But the density falls rapidly as P moves away from 0.5. 31See Ruud (1986) and Gourieroux et al. (1987). Greene-50558 book June 25, 2007 11:6

810 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Greene (1983), building on Goldberger (1981), finds that if the probit model is correctly specified and if the regressors are themselves joint normally distributed, then the probability limit of the least squares estimator is a multiple of the true coefficient vector.32 Greene’s result is useful only for the same purpose as Amemiya’s quick correction of OLS. Multivariate normality is obviously inconsistent with most appli- cations. For example, nearly all applications include at least one dummy variable. Ruud (1982) and Chung and Goldberger (1984), however, have shown that much weaker conditions than joint normality will produce the same proportionality result. For a pro- bit model, Chung and Goldberger require only that E [x | y∗] be linear in y∗. Several authors have built on these observations to pursue the issue of what circumstances will lead to proportionality results such as these. Ruud (1986) and Stoker (1986) have ex- tended them to a very wide class of models that goes well beyond those of Chung and Goldberger. Curiously enough, Stoker’s results rule out dummy variables, but it is those for which the proportionality result seems to be most robust.33

23.6.1 SEMIPARAMETRIC ESTIMATION The fully parametric probit and logit models remain by far the mainstays of empirical research on binary choice. Fully nonparametric discrete choice models are fairly exotic and have made only limited inroads in the literature, and much of that literature is theoretical [e.g., Matzkin (1993)]. The primary obstacle to application is their paucity of interpretable results. (See Example 23.12.) Of course, one could argue on this basis that the firm results produced by the fully parametric models are merely fragile artifacts of the detailed specification, not genuine reflections of some underlying truth. [In this connection, see Manski (1995).] But that orthodox view raises the question of what motivates the study to begin with and what one hopes to learn by embarking upon it. The intent of model building to approximate reality so as to draw useful conclusions is hardly limited to the analysis of binary choices. Semiparametric estimators represent a middle ground between these extreme views.34 The single index model of Klein and Spady (1993) has been used in several applications, including Gerfin (1996), Horowitz (1993), and Fernandez and Rodriguez-Poo (1997).35 The single index formulation departs from a linear “regression” formulation, | = | β . E [yi xi ] E [yi xi ] Then ( = | ) = ( β | ) = ( β), Prob yi 1 xi F xi xi G xi where G is an unknown continuous distribution function whose range is [0, 1]. The function G is not specified a priori; it is estimated along with the parameters. (Since G as well as β is to be estimated, a constant term is not identified; essentially, G provides

32The scale factor is estimable with the sample data, so under these assumptions, a method of moments estimator is available. 33See Greene (1983). 34Recent proposals for semiparametric estimators in addition to the one developed here include Lewbel (1997, 2000), Lewbel and Honor`e (2001), and Altonji and Matzkin (2001). In spite of nearly 10 years of development, this is a nascent literature. The theoretical development tends to focus on root-n consistent coefficient estimation in models which provide no means of computation of probabilities or marginal effects. 35A symposium on the subject is Hardle and Manski (1993). Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 811

the location for the index that would otherwise be provided by a constant.) The criterion function for estimation, in which subscripts n denote estimators of their unsubscripted counterparts, is 1 n ln L = y ln G (x β ) + (1 − y ) ln[1 − G (x β )] . n n i n i n i n i n i=1 The estimator of the probability function, Gn, is computed at each iteration using β a nonparametric kernel estimator of the density of x n; we did this calculation in Section 14.4. For the Klein and Spady estimator, the nonparametric regression estimator is ygn(zi | yi = 1) Gn(zi ) = , ygn(zi | yi = 1) + (1 − y) gn(zi | yi = 0) ( | ) =  β where gn zi yi is the kernel density estimator of the density of zi xi n. This result is n −  β 1 zi xj n gn(zi | yi = 1) = yj K ; nyhn hn j=1

gn(zi | yi = 0) is obtained by replacing y with 1−y in the leading scalar and yj with 1−yj in the summation. As before, hn is the bandwidth. There is no firm theory for choosing the kernel function or the bandwidth. Both Horowitz and Gerfin used the standard normal density. Two different methods for choosing the bandwidth are suggested by them.36 Klein and Spady provide theoretical background for computing asymptotic standard errors.

Example 23.12 A Comparison of Binary Choice Estimators Gerfin (1996) did an extensive analysis of several binary choice estimators, the probit model, Klein and Spady’s single index model, and Horowitz’s smoothed maximum score (MSCORE) estimator. [The MSCORE estimator is the (nonunique) vector bM computed such that the  − sign of xi bM is the same as that of (2yi 1) the maximum number of times. See Manski and Thompson (1986).] (A fourth “seminonparametric” estimator was also examined, but in the interest of brevity, we confine our attention to the three more widely used procedures.) The several models were all fit to two data sets on labor force participation of married women, one from Switzerland and one from Germany. Variables included in the equation were (our no- 2 tation), x1 = a constant, x2 = age, x3 = age , x4 = education, x5 = number of young children, x6 = number of older children, x7 = log of yearly nonlabor income, and x8 = a dummy vari- able for permanent foreign resident (Swiss data only). Coefficient estimates for the models are not directly comparable. We suggested in Example 23.3 that they could be made compa- rable by transforming them to marginal effects. Neither MSCORE nor the single index model, however, produces a marginal effect (which does suggest a question of interpretation). The author obtained comparability by dividing all coefficients by the absolute value of the co- efficient on x7. The set of normalized coefficients estimated for the Swiss data appears in Table 23.10, with estimated standard errors (from Gerfin’s Table III) shown in parentheses. Given the very large differences in the models, the agreement of the estimates is impres- sive. [A similar comparison of the same estimators with comparable concordance may be found in Horowitz (1993, p. 56).] In every case, the standard error of the probit estimator is smaller than that of the others. It is tempting to conclude that it is a more efficient estimator, but that is true only if the normal distribution assumed for the model is correct. In any event,

36 2 The function Gn(z) involves an enormous amount of computation, on the order of n , in principle. As Gerfin (1996) observes, however, computation of the kernel estimator can be cast as a Fourier transform, for which the fast Fourier transform reduces the amount of computation to the order of n log2 n. This value is only slightly larger than linear in n. See Press et al. (1986) and Gerfin (1996). Greene-50558 book June 26, 2007 2:34

812 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.10 Estimated Parameters for Semiparametric Models

x1 x2 x3 x4 x5 x6 x7 x8 h Probit 5.62 3.11 −0.44 0.03 −1.07 −0.22 −1.00 1.07 — (1.35) (0.77) (0.10) (0.03) (0.26) (0.09) — (0.29) Single — 2.98 −0.44 0.02 −1.32 −0.25 −1.00 1.06 0.40 index — (0.90) (0.12) (0.03) (0.33) (0.11) — (0.32) MSCORE 5.83 2.84 −0.40 0.03 −0.80 −0.16 −1.00 0.91 0.70 (1.78) (0.98) (0.13) (0.05) (0.43) (0.20) — (0.57)

the smaller standard error is the payoff to the sharper specification of the distribution. This payoff could be viewed in much the same way that parametric restrictions in the classical regression make the asymptotic covariance matrix of the restricted least squares estimator smaller than its unrestricted counterpart, even if the restrictions are incorrect. Gerfin then produced plots of F( z) for z in the range of the sample values of xb. Once again, the functions are surprisingly close. In the German data, however, the Klein–Spady estimator is nonmonotonic over a sizable range, which would cause some difficult problems of interpretation. The maximum score estimator does not produce an estimate of the proba- bility, so it is excluded from this comparison. Another comparison is based on the predictions of the observed response. Two approaches are tried, first counting the number of cases in which the predicted probability exceeds 0.5. (xb > 0 for MSCORE) and second by summing the sample values of F(xb). (Once again, MSCORE is excluded.) By the second approach, the estimators are almost indistinguishable, but the results for the first differ widely. Of 401 ones (out of 873 observations), the counts of predicted ones are 389 for probit, 382 for Klein/Spady, and 355 for MSCORE. (The results do not indicate how many of these counts are correct predictions.)

23.6.2 A KERNEL ESTIMATOR FOR A NONPARAMETRIC REGRESSION FUNCTION As noted, one unsatisfactory aspect of semiparametric formulations such as MSCORE is that the amount of information that the procedure provides about the population is limited; this aspect is, after all, the purpose of dispensing with the firm (parametric) assumptions of the probit and logit models. Thus, in the preceding example, there is little that one can say about the population that generated the data based on the MSCORE “estimates” in the table. The estimates do allow predictions of the response variable. But there is little information about any relationship between the response and the inde- pendent variables based on the “estimation” results. Even the mean-squared deviation matrix is suspect as an estimator of the asymptotic covariance matrix of the MSCORE coefficients. The authors of the technique have proposed a secondary analysis of the results. Let ( ) = |  β = Fβ zi E [yi xi zi ] denote a smooth regression function for the response variable. Based on a parameter vector β, the authors propose to estimate the regression by the method of kernels as follows. For the n observations in the sample and for the given β (e.g., bn from MSCORE), let =  β, zi xi 1/2 1 n s = (z − z)2 . n i i=1 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 813

For a particular value z∗, we compute a set of n weights using the kernel function, ∗ ∗ wi (z ) = K[(z − zi )/(λs)], where

K(ri ) = P(ri )[1 − P(ri )], and

−1 P(ri ) = [1 + exp(−cri )] . √ The constant c = (π/ 3)−1 ≈ 0.55133 is used to standardize the logistic distribution that is used for the kernel function. (See Section 14.4.1.) The parameter λ is the smoothing (bandwidth) parameter. Large values will flatten the estimated function through y, whereas values close to zero will allow greater variation in the function but might cause it to be unstable. There is no good theory for the choice, but some suggestions have been made based on descriptive statistics. [See Wong (1983) and Manski (1986).] Finally, the function value is estimated with n ( ∗) ( ∗) ≈ i=1 wi z yi . F z n ( ∗) i=1 wi z The nonparametric estimator displays a relationship between x β and E [yi ]. At first blush, this relationship might suggest that we could deduce the marginal effects, but unfortunately, that is not the case. The coefficients in this setting are not meaningful, so all we can deduce is an estimate of the density, f (z), by using first differences of the estimated regression function. It might seem, therefore, that the analysis has produced relatively little payoff for the effort. But that should come as no surprise if we re- consider the assumptions we have made to reach this point. The only assumptions made thus far are that for a given vector of covariates xi and coefficient vector β (i.e., any β), there exists a smooth function F(x β) = E [yi | zi ]. We have also assumed, at least implicitly, that the coefficients carry some information about the covariation of xβ and the response variable. The technique will approximate any such function [see Manski (1986)]. There is a large and burgeoning literature on kernel estimation and nonparametric estimation in econometrics. [A recent application is Melenberg and van Soest (1996).] As this simple example suggests, with the radically different forms of the specified model, the information that is culled from the data changes radically as well. The general prin- ciple now made evident is that the fewer assumptions one makes about the population, the less precise the information that can be deduced by statistical techniques. That tradeoff is inherent in the methodology.

23.7 ENDOGENOUS RIGHT-HAND-SIDE VARIABLES IN BINARY CHOICE MODELS

The analysis in Example 23.10 (Magazine Prices Revisited) suggests that the presence of endogenous right-hand-side variables in a binary choice model presents familiar problems for estimation. The problem is made worse in nonlinear models because even Greene-50558 book June 25, 2007 11:6

814 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

if one has an instrumental variable readily at hand, it may not be immediately clear what is to be done with it. The instrumental variable estimator described in Chapter 12 is based on moments of the data, variances, and covariances. In this binary choice setting, we are not using any form of least squares to estimate the parameters, so the IV method would appear not to apply. Generalized method of moments is a possibility. (This will figure prominently in the analysis in Section 23.9.) Consider the model ∗ = β + γ + ε , yi xi wi i = ( ∗ > ), yi 1 yi 0 E[εi | wi ] = g(wi ) = 0.

Thus, wi is endogenous in this model. The maximum likelihood estimators considered earlier will not consistently estimate (β,γ). [Without an additional specification that allows us to formalize Prob(yi = 1 | xi , wi ), we cannot state what the MLE will, in fact, estimate.] Suppose that we have a “relevant” (see Section 12.2) instrumental variable, zi such that

E[εi | zi , xi ] = 0,

E[wi zi ] = 0. A natural instrumental variable estimator would be based on the “moment” condition ( ∗ − β − γ ) xi = . E yi xi wi 0 zi ∗ − β − γ However, yi is not observed, yi is. But the “residual,” yi xi wi , would have no meaning even if the true parameters were known.37 One approach that was used in Avery et al. (1983), Butler and Chatterjee (1997), and Bertschek and Lechner (1998) is −( β + γ ) to assume that the instrumental variable is orthogonal to the residual [y xi wi ]; that is, − ( β + γ ) xi = . E [yi xi wi ] 0 zi This form of the moment equation, based on observables, can form the basis of a straight- forward two-step GMM estimator. (See Chapter 15 for details.) The GMM estimator is not less parametric than the full information maximum likelihood estimator described following because the probit model based on the nor- mal distribution is still invoked to specify the moment equation.38 Nothing is gained in simplicity or robustness of this approach to full information maximum likelihood estimation, which we now consider. (As Bertschek and Lechner argue, however, the gains might come in terms of practical implementation and computation time. The same considerations motivated Avery et al.) This maximum likelihood estimator requires a full specification of the model, in- cluding the assumption that underlies the endogeneity of wi . This becomes essentially

37One would proceed in precisely this fashion if the central specification were a linear probability model (LPM) to begin with. See, for example, Eisenberg and Rowe (2006) or Angrist (2001) for an application and some analysis of this case. 38This is precisely the platform that underlies the GLIM/GEE treatment of binary choice models in, for example, the widely used programs SAS and Stata. Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 815

a simultaneous equations model. The model equations are ∗ =  β + γ + ε , = ∗ > , yi xi wi i yi 1[yi 0] =  α + , wi zi ui ρσ (ε , ) ∼ 0 , 1 u . i ui N ρσ σ 2 0 u u

(We are assuming that there is a vector of instrumental variables, zi .) Probit estimation based on yi and (xi , wi ) will not consistently estimate (β,γ) because of the correlation between wi and εi induced by the correlation between ui and εi . Several methods have been proposed for estimation of this model. One possibility is to use the partial reduced form obtained by inserting the second equation in the first. This becomes a probit model ( = | , ) = (  β∗ +  α∗) with probability Prob yi 1 xi zi xi zi . This will produce consistent ∗ 2 2 1/2 ∗ 2 2 1/2 estimates of β = β/(1 + γ σ + 2γσuρ) and α = γ α/(1 + γ σ + 2γσuρ) as the u u ∗ coefficients on xi and zi , respectively. (The procedure will estimate a mixture of β and ∗ α for any variable that appears in both xi and zi .) In addition, linear regression of wi on α σ 2 ρ zi produces estimates of and u , but there is no method of moments estimator of or γ produced by this procedure, so this estimator is incomplete. Newey (1987) suggested a “minimum chi-squared” estimator that does estimate all parameters. A more direct, and actually simpler approach is full information maximum likelihood. The log-likelihood is built up from the joint density of yi and wi , which we write as the product of the conditional and the marginal densities,

f (yi , wi ) = f (yi | wi ) f (wi ). To derive the conditional distribution, we use results for the bivariate normal, and write ε | = (ρσ )/σ 2 + v , i ui u u ui i 2 where vi is normally distributed with Var[vi ] = (1 − ρ ). Inserting this in the first equation, we have ∗ | =  β + γ + (ρ/σ ) + v . yi wi xi wi u ui i Therefore,  β + γ + (ρ/σ ) xi wi u ui Prob[yi = 1 | xi , wi ] =  . (23-43) 1 − ρ2 = ( −  α) Inserting the expression for ui wi zi , and using the normal density for the marginal distribution of wi in the second equation, we obtain the log-likelihood function for the sample, n  β + γ + (ρ/σ )( −  α) −  α xi wi u wi zi 1 wi zi ln L = ln  (2yi − 1) + ln φ . − ρ2 σu σu i=1 1 Example 23.13 Labor Supply Model In Examples 4.3 and 23.1, we examined a labor suppy model for married women using Mroz’s (1987) data on labor supply. The wife’s labor force participation equation suggested in Example 23.1 is = =  β + β + β 2 + β + β . Prob (LFPi 1) ( 1 2 Agei 3 Agei 4 Educationi 5 Kidsi ) Greene-50558 book June 25, 2007 11:6

816 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.11 Estimated Labor Supply Model Probit Regression Maximum Likelihood Constant −3.86704 (1.41153) −5.08405 (1.43134) Age 0.18681 (0.065901) 0.17108 (0.063321) Age2 −0.00243 (0.000774) −0.00219 (0.0007629) Education 0.11098 (0.021663) 0.09037 (0.029041) Kids −0.42652 (0.13074) −0.40202 (0.12967) Husband hours −0.000173 (0.0000797) 0.00055 (0.000482) Constant 2325.38 (167.515) 2424.90 (158.152) Husband age −6.71056 (2.73573) −7.3343 (2.57979) Husband education 9.29051 (7.87278) 2.1465 (7.28048) Family income 55.72534 (19.14917) 63.4669 (18.61712) σu 588.2355 586.994 ρ 0.0000 −0.4221 (0.26931) ln L −489.0766 −5868.432 −6357.093

A natural extension of this model would be to include the husband’s hours in the equation, = =  β + β + β 2 + β + β + γ . Prob (LFPi 1) ( 1 2 Agei 3 Agei 4 Educationi 5 Kidsi HHrsi ) It would also be natural to assume that the husband’s hours would be correlated with the determinants (observed and unobserved) of the wife’s labor force participation. The auxiliary equation might be = α + α + α + α + . HHrsi 1 2 HAgei 3 HEducationi 4 Family Incomei ui As before, we use the Mroz (1987) labor supply data described in Example 4.3. Table 23.11 reports the single-equation and maximum likelihood estimates of the parameters of the two equations. Comparing the two sets of probit estimates, it appears that the (assumed) en- dogeneity of the husband’s hours is not substantially affecting the estimates. There are two simple ways to test the hypothesis that ρ equals zero. The FIML estimator produces an estimated asymptotic standard error with the estimate of ρ, so a Wald test can be carried out. For the preceding results, the Wald statistic would be (−0.4221/0.26921) 2 = 2.458. The critical value from the chi-squared table for one degree of freedom would be 3.84, so we would not reject the hypothesis. The second approach would use the likelihood ratio test. Under the null hypothesis of exogeneity, the probit model and the regression equation can be estimated independently. The log-likelihood for the full model would be the sum of the two log-likelihoods, which would be −6357.508 based on the following results. Without the restriction ρ = 0, the combined log likelihood is −6357.093. Twice the difference is 0.831, which is also well under the 3.84 critical value, so on this basis as well, we would not reject the null hypothesis that ρ = 0.

Blundell and Powell (2004) label the foregoing the control function approach to accommodating the endogeneity. As noted, the estimator is fully parametric. They pro- pose an alternative semiparametric approach that retains much of the functional form specification, but works around the specific distributional assumptions. Adapting their model to our earlier notation, their departure point is a general specification that pro- duces, once again, a control function, | , , = ( β + γ , ). E[yi xi wi ui ] F xi wi ui Note that (23-43) satisfies the assumption; however, they reach this point without assum- ing either joint or marginal normality. The authors propose a three-step, semiparametric Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 817

approach to estimating the structural parameters. In an application somewhat similar to Example 23.13, they apply the technique to a labor force participation model for British men in which a variable of interest is a dummy variable for education greater than 16 years, the endogenous variable in the participation equation, also of interest, is earned income of the spouse, and an instrumental variable is a welfare benefit enti- tlement. Their findings are rather more substantial than ours; they find that when the endogeneity of other family income is accommodated in the equation, the education coefficient increases by 40 percent and remains significant, but the coefficient on other income increases by more than tenfold. The case in which the endogenous variable in the main equation is, itself, a binary variable occupies a large segment of the recent literature. Consider the model

∗ = β + γ + ε , yi xi Ti i = ( ∗ > ), yi 1 yi 0 E[εi | Ti ] = 0,

where Ti is a binary variable indicating some kind of program participation (e.g., gradu- ating from high school or college, receiving some kind of job training, etc.). The model in this form (and several similar ones) is a “treatment effects” model. The main object of estimation is γ (at least superficially). In these settings, the observed outcome may be ∗ yi (e.g., income or hours) or yi (e.g., labor force participation). The preceding analysis has suggested that problems of endogeneity will intervene in either case. The subject of treatment effects models is surveyed in many studies, including Angrist (2001). We will examine this model in some detail in Chapter 24.

23.8 BIVARIATE PROBIT MODELS

In Chapter 10, we analyzed a number of different multiple-equation extensions of the classical and generalized regression model. A natural extension of the probit model would be to allow more than one equation, with correlated disturbances, in the same spirit as the seemingly unrelated regressions model. The general specification for a two-equation model would be ∗ = β + ε , = ∗ > , , y1 x1 1 1 y1 1ify1 0 0 otherwise ∗ = β + ε , = ∗ > , , y2 x2 2 2 y2 1ify2 0 0 otherwise E [ε1 | x1, x2] = E [ε2 | x1, x2] = 0, (23-44)

Var[ε1 | x1, x2] = Var[ε2 | x1, x2] = 1,

Cov[ε1,ε2 | x1, x2] = ρ.

23.8.1 MAXIMUM LIKELIHOOD ESTIMATION The bivariate normal cdf is x2 x1 Prob(X1 < x1, X2 < x2) = φ 2(z1, z2,ρ)dz1dz2, −∞ −∞ Greene-50558 book June 25, 2007 11:6

818 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

39 which we denote 2(x1, x2,ρ). The density is −(1/2)(x2+x2−2ρx x )/(1−ρ2) e 1 2 1 2 φ (x , x ,ρ)= . 2 1 2 2π(1 − ρ2)1/2

To construct the log-likelihood, let qi1 = 2yi1 − 1 and qi2 = 2yi2 − 1. Thus, qij = 1if yij = 1 and −1ifyij = 0 for j = 1 and 2. Now let zij = x ijβ j and wij = qijzij, j = 1, 2, and

ρi ∗ = qi1qi2ρ. Note the notational convention. The subscript 2 is used to indicate the bivariate normal distribution in the density φ2 and cdf 2. In all other cases, the subscript 2 indicates the variables in the second equation. As before, φ(.) and (.) without subscripts denote the univariate standard normal density and cdf. The probabilities that enter the likelihood function are

Prob(Y1 = yi1, Y2 = yi2 | x1, x2) = 2(wi1, wi2,ρi ∗ ), which accounts for all the necessary sign changes needed to compute probabilities for y’s equal to zero and one. Thus,40 n ln L = ln 2(wi1, wi2,ρi ∗ ). i=1 The derivatives of the log-likelihood then reduce to n ∂ ln L qijgij = xij, j = 1, 2, ∂β j 2 i=1 (23-45) ∂ n φ ln L = qi1qi2 2 , ∂ρ 2 i=1 where wi2 − ρi ∗ wi1 gi1 = φ(wi1) √ (23-46) − ρ2 1 i ∗ and the subscripts 1 and 2 in gi1 are reversed to obtain gi2. Before considering the Hessian, it is useful to note what becomes of the preceding if ρ = 0. For ∂ ln L/∂β1,ifρ = ρi ∗ = 0, then gi1 reduces to φ(wi1)(wi2), φ2 is φ(wi1)φ(wi2), and 2 is (wi1)(wi2). Inserting these results in (23-45) with qi1 and qi2 produces (23-21). Because both func- tions in ∂ ln L/∂ρ factor into the product of the univariate functions, ∂ ln L/∂ρ reduces n λ λ λ = , , to i=1 i1 i2, where ij, j 1 2 is defined in (23-21). (This result will reappear in the LM statistic shown later.) The maximum likelihood estimates are obtained by simultaneously setting the three derivatives to zero. The second derivatives are relatively straightforward but tedious.

39See Section B.9. 40 To avoid further ambiguity, and for convenience, the observation subscript will be omitted from 2 = 2(wi1, wi2,ρi∗ ) and from φ2 = φ2(wi1, wi2,ρi∗ ). Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 819

Some simplifications are useful. Let 1 δi = √ , − ρ2 1 i ∗ vi1 = δi (wi2 − ρi ∗ wi1), so gi1 = φ(wi1)(vi1),

vi2 = δi (wi1 − ρi ∗ wi2), so gi2 = φ(wi2)(vi2). By multiplying it out, you can show that

δi φ(wi1)φ(vi1) = δi φ(wi2)φ(vi2) = φ2.

Then n 2 ∂2 log L −w g ρ ∗ φ g = x x i1 i1 − i 2 − i1 , ∂β ∂β i1 i1   2 1 1 2 2 i=1 2 ∂2 log L n φ g g = q q x x 2 − i1 i2 , ∂β ∂β i1 i2 i1 i2  2 1 2 = 2 2 i 1 (23-47) 2 n ∂ log L φ2 gi1 = qi2xi1 ρi ∗ δi vi1 − wi1 − , ∂β1∂ρ 2 2 i=1 ∂2 n φ φ log L 2 2  −1 2 2 = δ ρ ∗ (1 − w R w ) + δ w w − , 2 i i i i i i i1 i2 ∂ρ 2 2 i=1  − 1 = δ2( 2 + 2 − ρ ∗ ) β ∂2 / where wi Ri wi i wi1 wi2 2 i wi1wi2 .(For 2, change the subscripts in ln L ∂β ∂β ∂2 /∂β ∂ρ 1 1 and ln L 1 accordingly.) The complexity of the second derivatives for this model makes it an excellent candidate for the Berndt et al. estimator of the variance matrix of the maximum likelihood estimator.

Example 23.14 Tetrachoric Correlation Returning once again to the health care application of Examples 11.11, 16.16, 23.4, and 23.8, we now consider a second binary variable, = > Hospitalit 1ifHospVisit 0 and 0 otherwise. Our previous analyses have focused on

Doctorit = 1ifDocVisit > 0 and 0 otherwise. A simple bivariate frequency count for these two variables is

Hospital Doctor 0 1 Total 0 9,715 420 10,135 1 15,216 1,975 17,191 Total 24,931 2,395 27,326

Looking at the very large value in the lower-left cell, one might surmise that these two binary variables (and the underlying phenomena that they represent) are negatively correlated. The usual Pearson, product moment correlation would be inappropriate as a measure of this correlation since it is used for continuous variables. Consider, instead, a bivariate probit “model,” ∗ = μ + ε = ∗ > Hit 1 1,it, Hospitalit 1( Hit 0), ∗ = μ + ε = ∗ > Dit 2 2,it, Doctorit 1( Dit 0), Greene-50558 book June 25, 2007 11:6

820 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

where (ε1, ε2) have a bivariate normal distribution with means (0, 0), variances (1, 1) and cor- relation ρ. This is the model in (23-44) without independent variables. In this representation, the tetrachoric correlation, which is a correlation measure for a pair of binary variables, is precisely the ρ in this model—it is the correlation that would be measured between the underlying continuous variables if they could be observed. This suggests an interpretation of the correlation coefficient in a bivariate probit model—as the conditional tetrachoric cor- relation. It also suggests a method of easily estimating the tetrachoric correlation coefficient using a program that is built into nearly all packages. Applied to the hospital/doctor data defined earlier, we obtained an estimate of ρ of 0.31106, with an estimated asymptotic standard error of 0.01357. Apparently, our earlier intuition was incorrect.

23.8.2 TESTING FOR ZERO CORRELATION The Lagrange multiplier statistic is a convenient device for testing for the absence of correlation in this model. Under the null hypothesis that ρ equals zero, the model consists of independent probit equations, which can be estimated separately. Moreover, in the multivariate model, all the bivariate (or multivariate) densities and probabilities factor into the products of the marginals if the correlations are zero, which makes construction of the test statistic a simple matter of manipulating the results of the independent probits. The Lagrange multiplier statistic for testing H0: ρ = 0 in a bivariate probit model is41 φ( )φ( ) 2 n wi1 wi2 = q q i 1 i1 i2 (w )(w ) LM = i1 i2 . φ( )φ( ) 2 n [ wi1 wi2 ] i=1 (wi1)(−wi1)(wi2)(−wi2) As usual, the advantage of the LM statistic is that it obviates computing the bivariate probit model. But the full unrestricted model is now fairly common in commercial soft- ware, so that advantage is minor. The likelihood ratio or Wald test can often be used with equal ease. To carry out the likelihood ratio test, we note first that if ρ equals zero, then the bivariate probit model becomes two independent univariate probits models. The log-likelihood in that case would simply be the sum of the two separate log-likelihoods. The test statistic would be

λLR = 2[ln LBIVARIATE − (ln L1 + ln L2)].

This would converge to a chi-squared variable with one degree of freedom. The Wald test is carried out by referring 2 λWALD = ρˆ MLE/ Est. Asy. Var[ˆρMLE]

to the chi-squared distribution with one degree of freedom. For 95 percent significance, the critical value is 3.84 (or one can refer the positive square root to the standard normal critical value of 1.96). Example 23.14 demonstrates.

41This is derived in Kiefer (1982). Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 821

23.8.3 MARGINAL EFFECTS There are several “marginal effects” one might want to evaluate in a bivariate probit 42 model. A natural first step would be the derivatives of Prob[y1 = 1, y2 = 1 | x1, x2]. These can be deduced from (23-45) by multiplying by 2, removing the sign carrier, qij and differentiating with respect to x j rather than β j . The result is ∂ ( β ,  β ,ρ)  β − ρ  β 2 x1 1 x2 2 = φ( β ) x2 2 x1 1 β . ∂ x1 1 1 x1 1 − ρ2 Note, however, the bivariate probability, albeit possibly of interest in its own right, is not a conditional mean function. As such, the preceding does not correspond to a regression coefficient or a slope of a conditional expectation. For convenience in evaluating the conditional mean and its partial effects, we will = ∪  β = γ γ define a vector x x1 x2 and let x1 1 x 1. Thus, 1 contains all the nonzero elements of β1 and possibly some zeros in the positions of variables in x that appear only in the other equation; γ 2 is defined likewise. The bivariate probability is   Prob[y1 = 1, y2 = 1 | x] = 2[x γ 1, x γ 2,ρ]. Signs are changed appropriately if the probability of the zero outcome is desired in either case. (See 23-44.) The marginal effects of changes in x on this probability are given by ∂ 2 = g γ + g γ , ∂x 1 1 2 2

where g1 and g2 are defined in (23-46). The familiar univariate cases will arise if ρ = 0, and effects specific to one equation or the other will be produced by zeros in the corre- sponding position in one or the other parameter vector. There are also some conditional mean functions to consider. The unconditional mean functions are given by the univari- ate probabilities:  E [yj | x] = (x γ j ), j = 1, 2, so the analysis of (23-9) and (23-10) applies. One pair of conditional mean functions that might be of interest are

Prob[y1 = 1, y2 = 1 | x] E [y1 | y2 = 1, x] = Prob[y1 = 1 | y2 = 1, x] = Prob[y2 = 1 | x]   2(x γ 1, x γ 2,ρ) =  (x γ 2)

and similarly for E [y2 | y1 = 1, x]. The marginal effects for this function are given by  ∂ E [y1 | y2 = 1, x] 1 φ(x γ 2) =  g1γ 1 + g2 − 2  γ 2 . ∂x (x γ 2) (x γ 2) Finally, one might construct the nonlinear conditional mean function γ ,( − ) γ ,( − )ρ | , = 2[x 1 2y2 1 x 2 2y2 1 ]. E [y1 y2 x] ( − ) γ [ 2y2 1 x 2]

42See Greene (1996b) and Christofides et al. (1997, 2000). Greene-50558 book June 25, 2007 11:6

822 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The derivatives of this function are the same as those presented earlier, with sign changes in several places if y2 = 0 is the argument.

Example 23.15 Bivariate Probit Model for Health Care Utilization We have extended the bivariate probit model of the previous example by specifying a set of independent variables, = . x i Constant, Femalei , Ageit, Incomeit, Kidsit, Educationit, Marriedit We have specified that the same exogenous variables appear in both equations. (There is no requirement that different variables appear in the equations, nor that a variable be excluded from each equation.) The correct analogy here is to the seemingly unrelated regressions model, not to the linear simultaneous equations model. Unlike the SUR model of Chapter 10, it is not the case here that having the same variables in the two equations implies that the model can be fit equation by equation, one equation at a time. That result only applies to the estimation of sets of linear regression equations. Table 23.12 contains the estimates of the parameters of the univariate and bivariate probit models. The tests of the null hypothesis of zero correlation strongly reject the hypothesis that ρ equals zero. The t statistic for ρ based on the full model is 0.2981 / 0.0139 = 21.446, which is much larger than the critical value of 1.96. For the likelihood ratio test, we compute

λLR = 2{−25285.07 − [−17422.72 − (−8073.604)]}=422.508. Once again, the hypothesis is rejected. (The Wald statistic is 21.4462 = 459.957.) The LM statistic is 383.953. The coefficient estimates agree with expectations. The income coefficient is statistically significant in the doctor equation but not in the hospital equation, suggesting, perhaps, that physican visits are at least to some extent discretionary while hospital visits occur on an emergency basis that would be much less tied to income. The table also contains the decomposition of the partial effects for E[y1 | y2 = 1]. The direct effect is [g1/(x γ2)]γ 1 in the definition given earlier. The mean estimate of E[y1 | y2 = 1] is 0.821285. In the table in Example 23.13, this would correspond to the raw proportion P( D = 1, H = 1) / P( H = 1) = (1975 / 27326) / 2395 / 27326) = 0.8246.

TABLE 23.12 Estimated Bivariate Probit Model Doctor Hospital Model Estimates Partial Effects Model Estimates Variable Univariate Bivariate Direct Indirect Total Univariate Bivariate Constant −0.1243 −0.1243 −1.3328 −1.3385 (0.05815) (0.05814) (0.08320) (0.07957) Female 0.3559 0.3551 0.09650 −0.00724 0.08926 0.1023 0.1050 (0.01602) (0.01604) (0.004957) (0.001515) (0.005127) (0.02195) (0.02174) Age 0.01189 0.01188 0.003227 −0.00032 0.002909 0.004605 0.00461 (0.0007957) (0.000802) (0.000231) (0.000073) (0.000238) (0.001082) (0.001058) Income −0.1324 −0.1337 −0.03632 −0.003064 −0.03939 0.03739 0.04441 (0.04655) (0.04628) (0.01260) (0.004105) (0.01254) (0.06329) (0.05946) Kids −0.1521 −0.1523 −0.04140 0.001047 −0.04036 −0.01714 −0.01517 (0.01833) (0.01825) (0.005053) (0.001773) (0.005168) (0.02562) (0.02570) Education −0.01497 −0.01484 −0.004033 0.001512 −0.002521 −0.02196 −0.02191 (0.003575) (0.003575) (0.000977) (0.00035) (0.0010) (0.005215) (0.005110) Married 0.07352 0.07351 0.01998 0.003303 0.02328 −0.04824 −0.04789 (0.02064) (0.02063) (0.005626) (0.001917) (0.005735) (0.02788) (0.02777) Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 823

23.8.4 RECURSIVE BIVARIATE PROBIT MODELS Burnett (1997) proposed the following bivariate probit model for the presence of a gender economics course in the curriculum of a liberal arts college: = , = | , =  ( β + γ , β ,ρ). Prob[y1 1 y2 1 x1 x2] 2 x1 1 y2 x2 2 The dependent variables in the model are

y1 = presence of a gender economics course,

y2 = presence of a women’s studies program on the campus. The independent variables in the model are

z1 = constant term;

z2 = academic reputation of the college, coded 1 (best), 2,...to141;

z3 = size of the full-time economics faculty, a count;

z4 = percentage of the economics faculty that are women, proportion (0 to 1);

z5 = religious affiliation of the college, 0 = no, 1 = yes;

z6 = percentage of the college faculty that are women, proportion (0 to 1);

z7–z10 = regional dummy variables, South, Midwest, Northeast, West. The regressor vectors are

x1 = z1, z2, z3, z4, z5, x2 = z2, z6, z5, z7–z10. Burnett’s model illustrates a number of interesting aspects of the bivariate probit model. Note that this model is qualitatively different from the bivariate probit model in (23-44); 43 the second dependent variable, y2, appears on the right-hand side of the first equation. This model is a recursive, simultaneous-equations model. Surprisingly, the endogenous nature of one of the variables on the right-hand side of the first equation can be ignored in formulating the log-likelihood. [The model appears in Maddala (1983, p. 123).] We can establish this fact with the following (admittedly trivial) argument: The term that enters the log-likelihood is P(y1 = 1, y2 = 1) = P(y1 = 1 | y2 = 1)P(y2 = 1). Given the ( β ) model as stated, the marginal probability for y2 is just x2 2 , whereas the conditional  (. . .)/( β ) probability is 2 x2 2 . The product returns the probability we had earlier. The other three terms in the log-likelihood are derived similarly, which produces (Maddala’s results with some sign changes): =  ( β + γ , β ,ρ), =  ( β , − β , −ρ) P11 2 x1 1 y2 x2 2 P10 2 x1 1 x2 2 =  −( β + γ ), β , −ρ , =  (− β , − β ,ρ). P01 2[ x1 1 y2 2x2 ] P00 2 x1 1 x2 2

These terms are exactly those of (23-44) that we obtain just by carrying y2 in the first equation with no special attention to its endogenous nature. We can ignore the simul- taneity in this model and we cannot in the linear regression model because, in this instance, we are maximizing the log-likelihood, whereas in the linear regression case, we are manipulating certain sample moments that do not converge to the necessary population parameters in the presence of simultaneity.

43Eisenberg and Rowe (2006) is another application of this model. In their study, they analyzed the joint (recursive) effect of y2 = veteran status on y1, smoking behavior. The estimator they used was two-stage least squares and GMM. (The GMM estimator is the H2SLS estimator described in Section 15.6.4.) Greene-50558 book June 25, 2007 11:6

824 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The marginal effects in this model are fairly involved, and as before, we can consider several different types. Consider, for example, z2, academic reputation. There is a direct effect produced by its presence in the first equation, but there is also an indirect effect. Academic reputation enters the women’s studies equation and, therefore, influences the probability that y2 equals one. Because y2 appears in the first equation, this effect is transmitted back to y1. The total effect of academic reputation and, likewise, religious affiliation is the sum of these two parts. Consider first the gender economics variable, y1. The conditional mean is

E [y1 | x1, x2] = Prob[y2 = 1]E [y1 | y2 = 1, x1, x2] + Prob[y2 = 0]E [y1 | y2 = 0, x1, x2] =  ( β + γ , β ,ρ)+  ( β , − β , −ρ). 2 x1 1 y2 x2 2 2 x1 1 x2 2 Derivatives can be computed using our earlier results. We are also interested in the effect of religious affiliation. Because this variable is binary, simply differentiating the conditional mean function may not produce an accurate result. Instead, we would com- pute the conditional mean function with this variable set to one and then zero, and take the difference. Finally, what is the effect of the presence of a women’s studies program on the probability that the college will offer a gender economics course? To compute this effect, we would compute Prob[y1 = 1 | y2 = 1, x1, x2]−Prob[y1 = 1 | y2 = 0, x1, x2]. In all cases, standard errors for the estimated marginal effects can be computed using the delta method. Maximum likelihood estimates of the parameters of Burnett’s model were com- puted by Greene (1998) using her sample of 132 liberal arts colleges; 31 of the schools offer gender economics, 58 have women’s studies, and 29 have both. (See Appendix Table F23.1.) The estimated parameters are given in Table 23.13. Both bivariate probit

TABLE 23.13 Estimates of a Recursive Simultaneous Bivariate Probit Model (estimated standard errors in parentheses) Single Equation Bivariate Probit Variable Coefficient Standard Error Coefficient Standard Error Gender Economics Equation Constant −1.4176 (0.8768) −1.1911 (2.2155) AcRep −0.01143 (0.003610) −0.01233 (0.007937) WomStud 1.1095 (0.4699) 0.8835 (2.2603) EconFac 0.06730 (0.05687) 0.06769 (0.06952) PctWecon 2.5391 (0.8997) 2.5636 (1.0144) Relig −0.3482 (0.4212) −0.3741 (0.5264) Women’s Studies Equation AcRep −0.01957 (0.004117) −0.01939 (0.005704) PctWfac 1.9429 (0.9001) 1.8914 (0.8714) Relig −0.4494 (0.3072) −0.4584 (0.3403) South 1.3597 (0.5948) 1.3471 (0.6897) West 2.3386 (0.6449) 2.3376 (0.8611) North 1.8867 (0.5927) 1.9009 (0.8495) Midwest 1.8248 (0.6595) 1.8070 (0.8952) ρ 0.0000 (0.0000) 0.1359 (1.2539) ln L −85.6458 −85.6317 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 825

and the single-equation estimates are given. The estimate of ρ is only 0.1359, with a standard error of 1.2359. The Wald statistic for the test of the hypothesis that ρ equals zero is (0.1359 / 1.2539)2 = 0.011753. For a single restriction, the critical value from the chi-squared table is 3.84, so the hypothesis cannot be rejected. The likelihood ratio statistic for the same hypothesis is 2[−85.6317 − (−85.6458)] = 0.0282, which leads to the same conclusion. The Lagrange multiplier statistic is 0.003807, which is consistent. This result might seem counterintuitive, given the setting. Surely “gender economics” and “women’s studies” are highly correlated, but this finding does not contradict that proposition. The correlation coefficient measures the correlation between the distur- bances in the equations, the omitted factors. That is, ρ measures (roughly) the correlation between the outcomes after the influence of the included factors is accounted for. Thus, the value 0.1359 measures the effect after the influence of women’s studies is already accounted for. As discussed in the next paragraph, the proposition turns out to be right. The single most important determinant (at least within this model) of whether a gender economics course will be offered is indeed whether the college offers a women’s studies program. Table 23.14 presents the estimates of the marginal effects and some descriptive statistics for the data. The calculations were simplified slightly by using the restricted model with ρ = 0. Computations of the marginal effects still require the preceding de- composition, but they are simplified by the result that if ρ equals zero, then the bivariate probabilities factor into the products of the marginals. Numerically, the strongest effect appears to be exerted by the representation of women on the faculty; its coefficient of +0.4491 is by far the largest. This variable, however, cannot change by a full unit because it is a proportion. An increase of 1 percent in the presence of women on the faculty raises the probability by only +0.004, which is comparable in scale to the effect of academic reputation. The effect of women on the faculty is likewise fairly small, only 0.0013 per 1 percent change. As might have been expected, the single most important influence is the presence of a women’s studies program, which increases the likelihood of a gender economics course by a full 0.1863. Of course, the raw data would have anticipated this result; of the 31 schools that offer a gender economics course, 29 also have a women’s studies program and only two do not. Note finally that the effect of religious affiliation (whatever it is) is mostly direct.

TABLE 23.14 Marginal Effects in Gender Economics Model Direct Indirect Total (Std. Error) (Type of Variable, Mean) Gender Economics Equation AcRep −0.002022 −0.001453 −0.003476 (0.001126) (Continuous, 119.242) PctWecon +0.4491 +0.4491 (0.1568) (Continuous, 0.24787) EconFac +0.01190 +0.1190 (0.01292) (Continuous, 6.74242) Relig −0.06327 −0.02306 −0.08632 (0.08220) (Binary, 0.57576) WomStud +0.1863 +0.1863 (0.0868) (Endogenous, 0.43939) PctWfac +0.14434 +0.14434 (0.09051) (Continuous, 0.35772) Women’s Studies Equation AcRep −0.00780 −0.00780 (0.001654) (Continuous, 119.242) PctWfac +0.77489 +0.77489 (0.3591) (Continuous, 0.35772) Relig −0.17777 −0.17777 (0.11946) (Binary, 0.57576) Greene-50558 book June 25, 2007 11:6

826 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.15 Binary Choice Fit Measures Measure (1) (2) (3) (4) (5) (6) (7) LRI 0.573 0.535 0.495 0.407 0.279 0.206 0.000 2 RBL 0.844 0.844 0.823 0.797 0.754 0.718 0.641 λ 0.565 0.560 0.526 0.444 0.319 0.216 0.000 2 REF 0.561 0.558 0.530 0.475 0.343 0.216 0.000 2 RVZ 0.708 0.707 0.672 0.589 0.447 0.352 0.000 2 RMZ 0.687 0.679 0.628 0.567 0.545 0.329 0.000 92 9 93 8 92 9 94 7 98 3 101 0 101 0 Predictions 526 526 823 823 16 15 31 0 31 0

Before closing this application, we can use this opportunity to examine the fit mea- sures listed in Section 23.4.5. We computed the various fit measures using seven different specifications of the gender economics equation:

1. Single-equation probit estimates, z1, z2, z3, z4, z5, y2 2. Bivariate probit model estimates, z1, z2, z3, z4, z5, y2 3. Single-equation probit estimates, z1, z2, z3, z4, z5 4. Single-equation probit estimates, z1, z3, z5, y2 5. Single-equation probit estimates, z1, z3, z5 6. Single-equation probit estimates, z1, z5 7. Single-equation probit estimates z1 (constant only).

The specifications are in descending “quality” because we removed the most statistically significant variables from the model at each step. The values are listed in Table 23.15. The matrix below each column is the table of “hits” and “misses” of the prediction rule yˆ = 1ifPˆ > 0.5, 0 otherwise. [Note that by construction, model (7) must predict all ones or all zeros.] The column is the actual count and the row is the prediction. Thus, for model (1), 92 of 101 zeros were predicted correctly, whereas 5 of 31 ones were predicted incorrectly. As one would hope, the fit measures decline as the more significant variables are removed from the model. The Ben-Akiva measure has an obvious flaw in that with only a constant term, the model still obtains a “fit” of 0.641. From the prediction matrices, it is clear that the explanatory power of the model, such as it is, comes from its ability to predict the ones correctly. The poorer the model, the greater the number of correct predictions of y = 0. But as this number rises, the number of incorrect predictions rises and the number of correct predictions of y = 1 declines. All the fit measures appear to react to this feature to some degree. The Efron and Cramer measures, which are nearly identical, and McFadden’s LRI appear to be most sensitive to this, with the remaining two only slightly less consistent.

23.9 A MULTIVARIATE PROBIT MODEL

In principle, a multivariate probit model would simply extend (23-44) to more than two outcome variables just by adding equations. The resulting equation system, again Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 827

analogous to the seemingly unrelated regressions model, would be ∗ = β + ε , = ∗ > , , = ,..., , ym xm m m ym 1ifym 0 0 otherwise m 1 M E[εm | x1,...,xM] = 0,

Var[εm | x1,...,xM] = 1,

Cov[εj ,εm | x1,...,xM] = ρ jm,

(ε1,...,εM) ∼ NM[0, R].

The joint probabilities of the observed events, [yi1, yi2 ...,yiM| xi1, xi2,...,xiM], i = 1,...,n that from the basis for the log-likelihood function are the M-variate normal probabilities, =  ( β ,..., β , ∗), Li M qi1xi1 1 qiMxiM M R where

qim = 2yim − 1, ∗ = ρ . R jm qijqim jm The practical obstacle to this extension is the evaluation of the M-variate normal in- tegrals and their derivatives. Some progress has been made on using quadrature for trivariate integration (see Section 16.9.6.b), but existing results are not sufficient to allow accurate and efficient evaluation for more than two variables in a sample of even moder- ate size. However, given the speed of modern computers, simulation-based integration using the GHK simulator or simulated likelihood methods (see Chapter 17) do allow for estimation of relatively large models. We consider an application in Example 23.16.44 The multivariate probit model in another form presents a useful extension of the random effects probit model for panel data (Section 23.5.1). If the parameter vectors in all equations are constrained to be equal, we obtain what Bertschek and Lechner (1998) call the “panel probit model,” ∗ = β + ε , = ∗ > , , = ,..., , = ,..., , yit xit it yit 1ifyit 0 0 otherwise i 1 n t 1 T (εi1,...,εiT) ∼ N[0, R]. The Butler and Moffitt (1982) approach for this model (see Section 23.5.1) has proved useful in many applications. But, their underlying assumption that Cov[εit,εis] = ρ is a substantive restriction. By treating this structure as a multivariate probit model with the restriction that the coefficient vector be the same in every period, one can obtain a model with free correlations across periods.45 Hyslop (1999), Bertschek and Lechner (1998), Greene (2004 and Example 23.16), and Cappellari and Jenkins (2006) are applications.

44Studies that propose improved methods of simulating probabilities include Pakes and Pollard (1989) and especially Borsch-Supan ¨ and Hajivassiliou (1990), Geweke (1989), and Keane (1994). A symposium in the November 1994 issue of Review of Economics and Statistics presents discussion of numerous issues in speci- fication and estimation of models based on simulation of probabilities. Applications that employ simulation techniques for evaluation of multivariate normal integrals are now fairly numerous. See, for example, Hyslop (1999) (Example 23.7) which applies the technique to a panel data application with T = 7. Example 23.16 develops a five-variate application. 45By assuming the coefficient vectors are the same in all periods, we actually obviate the normalization that the diagonal elements of R are all equal to one as well. The restriction identifies T − 1 relative variances ρ = σ 2 /σ 2 tt T T. This aspect is examined in Greene (2004). Greene-50558 book June 25, 2007 11:6

828 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Example 23.16 A Multivariate Probit Model for Product Innovations Bertschek and Lechner applied the panel probit model to an analysis of the product innovation activity of 1,270 German firms observed in five years, 1984–1988, in response to imports and foreign direct investment. [See Bertschek (1995).] The probit model to be estimated is based on the latent regression 8 ∗ = β + β + ε = ∗ > = ... = ... yit 1 xk,it k it, yit 1( yit 0), i 1, , 1, 270, t 1984, , 1988, k=2 where

yit = 1 if a product innovation was realized by firm i in year t, 0 otherwise,

x2,it = Log of industry sales in DM,

x3,it = Import share = ratio of industry imports to (industry sales plus imports),

x4,it = Relative firm size = ratio of employment in business unit to employment in the industry (times 30),

x5,it = FDI share = Ratio of industry foreign direct investment to (industry sales plus imports),

x6,it = Productivity = Ratio of industry value added to industry employment,

x7,it = Raw materials sector = 1 if the firm is in this sector,

x8,it = Investment goods sector = 1 if the firm is in this sector.

The coefficients on import share (β3) and FDI share (β5) were of particular interest. The objectives of the study were the empirical investigation of innovation and the methodological development of an estimator that could obviate computing the five-variate normal probabil- ities necessary for a full maximum likelihood estimation of the model. Table 23.16 presents the single-equation, pooled probit model estimates.46 Given the structure of the model, the parameter vector could be estimated consistently with any single

TABLE 23.16 Estimated Pooled Probit Model Estimated Standard Errors Marginal Effects Variable Estimate a SE(1)b SE(2)c SE(3)d SE(4)e Partial Std. Err. t ratio Constant −1.960 0.239 0.377 0.230 0.373 — –— — log Sales 0.177 0.0250 0.0375 0.0222 0.0358 0.0683f 0.0138 4.96 Rel Size 1.072 0.206 0.306 0.142 0.269 0.413f 0.103 4.01 Imports 1.134 0.153 0.246 0.151 0.243 0.437f 0.0938 4.66 FDI 2.853 0.467 0.679 0.402 0.642 1.099f 0.247 4.44 Prod. −2.341 1.114 1.300 0.715 1.115 −0.902f 0.429 −2.10 Raw Mtl −0.279 0.0966 0.133 0.0807 0.126 −0.110g 0.0503 −2.18 Inv Good 0.188 0.0404 0.0630 0.0392 0.0628 0.0723g 0.0241 3.00 aRecomputed. Only two digits were reported in the earlier paper. bObtained from results in Bertschek and Lechner, Table 9. cBased on the Avery et al. (1983) GMM estimator. dSquare roots of the diagonals of the negative inverse of the Hessian eBased on the cluster estimator. fCoefficient scaled by the density evaluated at the sample means gComputed as the difference in the fitted probability with the dummy variable equal to one, then zero.

46We are grateful to the authors of this study who have generously loaned us their data for our continued analysis. The data are proprietary and cannot be made publicly available, unlike the other data sets used in our examples. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 829

period’s data, Hence, pooling the observations, which produces a mixture of the estimators, will also be consistent. Given the panel data nature of the data set, however, the conventional standard errors from the pooled estimator are dubious. Because the marginal distribution will produce a consistent estimator of the parameter vector, this is a case in which the cluster estimator (see Section 16.8.4) provides an appropriate asymptotic covariance matrix. Note that the standard errors in column SE(4) of the table are considerably higher than the uncorrected ones in columns 1–3. The pooled estimator is consistent, so the further development of the estimator is a matter of (1) obtaining a more efficient estimator of β and (2) computing estimates of the cross- period correlation coefficients. The authors proposed a set of GMM estimators based on the orthogonality conditions implied by the single-equation conditional mean functions: { −  β | }= . E [ yit (xit )] Xi 0 The orthogonality conditions are ⎡ ⎛ ⎞⎤ [ y − (x β)] ⎢ ⎜ i 1 i 1 ⎟⎥ ⎢ ⎜ [ y − (x β)] ⎟⎥ ⎢ i 2 i 2 ⎥ = E⎢A(Xi ) ⎜ . ⎟⎥ 0, ⎣ ⎝ . ⎠⎦ −  β [ yiT (xiT )]

where A(X1)isaP ×T matrix of instrumental variables constructed from the exogenous data for individual i . Using only the raw data as A(Xi ), strong exogeneity of the regressors in every period would −  β = provide TK moment equations of the form E[xit( yis (xis ))] 0 for each pair of periods, or a total of T 2 K moment equations altogether for estimation of K parameters in β. [See Wooldridge (1995).] The full set of such orthogonality conditions would be E[(IT ⊗ xi )ui ] = 0, = ...  = ...  = −  β where xi [xi 1, , xiT] , ui (ui 1, , uiT) and uit yit (xit ). This produces 200 orthogonality conditions for the estimation of the 8 parameters. The empirical counterpart to the left-hand side of (6) is ⎡ ⎛ ⎞⎤ −  β [ yi 1 (xi 1 )] ⎢ ⎜ ⎟⎥ N ⎢ [ y − (x β)] ⎥ 1 ⎢ ⎜ i 2 i 2 ⎟⎥ g (β) = A(Xi ) ⎜ . ⎟ . N N ⎢ ⎝ . ⎠⎥ i =1 ⎣ . ⎦ −  β [ yiT (xiT )] The various GMM estimators are the solutions to β = β β . GMM,A,W arg min[gN ( )] W[gN ( )] β

The specific estimator is defined by the choice of instrument matrix A(Xi ) and weighting matrix, W; the authors suggest several. In their application (see p. 337), only data from period t are used in the tth moment condition. This reduces the number of moment conditions from 200 to 40. As noted, the FIML estimates of the model can be computed using the GHK simulator.47 The FIML estimates, Bertschek and Lechner’s GMM estimates, and the random effects model using the Butler and Moffit (1982) quadrature method are reported in Table 23.17: The FIML and GMM estimates are strikingly similar. This would be expected because both are con- sistent estimators and the sample is fairly large. The correlations reported are based on the FIML estimates. [They are not estimated with the GMM estimator. As the authors note, an inefficient estimator of the correlation matrix is available by fitting pairs of equations (years)

47The full computation required about one hour of computing time. Computation of the single-equation (pooled) estimators required only about 1/100 of the time reported by the authors for the same models, which suggests that the evolution of computing technology may play a significant role in advancing the FIML estimators. Greene-50558 book June 25, 2007 11:6

830 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.17 Estimated Constrained Multivariate Probit Model (estimated standard errors in parentheses) Full Maximum Likelihood Random Effects Coefficients Using GHK Simulator BL GMM a ρ = 0.578 (0.0189) Constant −1.797∗∗ (0.341) −1.74∗∗ (0.37) −2.839 (0.533) log Sales 0.154∗∗ (0.0334) 0.15∗∗ (0.034) 0.244 (0.0522) Relative size 0.953∗∗ (0.160) 0.95∗∗ (0.20) 1.522 (0.257) Imports 1.155∗∗ (0.228) 1.14∗∗ (0.24) 1.779 (0.360) FDI 2.426∗∗ (0.573) 2.59∗∗ (0.59) 3.652 (0.870) Productivity −1.578 (1.216) −1.91∗ (0.82) −2.307 (1.911) Raw material −0.292∗∗ (0.130) −0.28∗ (0.12) −0.477 (0.202) Investment goods 0.224∗∗ (0.0605) 0.21∗∗ (0.063) 0.578 (0.0189) log-likelihood −3522.85 −3535.55 Estimated Correlations 1984, 1985 0.460∗∗ (0.0301) Estimated Correlation Matrix 1984, 1986 0.599∗∗ (0.0323) ∗∗ 1984 1985 1986 1987 1988 1985, 1986 0.643 (0.0308) 1984 1.000 (0.658) (0.599) (0.540) (0.483) 1984, 1987 0.540∗∗ (0.0308) ∗∗ 1985 0.460 1.000 (0.644) (0.558) (0.441) 1985, 1987 0.546 (0.0348) 1986 0.599 0.643 1.000 (0.602) (0.537) 1986, 1987 0.610∗∗ (0.0322) ∗∗ 1987 0.540 0.546 0.610 1.000 (0.621) 1984, 1988 0.483 (0.0364) 1988 0.483 0.446 0.524 0.605 1.000 1985, 1988 0.446∗∗ (0.0380) 1986, 1988 0.524∗∗ (0.0355) 1987, 1988 0.605∗∗ (0.0325) aEstimates are BL’s WNP-joint uniform estimates with k = 880. Estimates are from their Table 9, standard errors from their Table 10. ∗Indicates significant at 95 percent level, ∗∗ indicates significant at 99 percent level based on a two-tailed test.

TABLE 23.18 Unrestricted Five-Period Multivariate Probit Model (estimated standard errors in parentheses) Coefficients 1984 1985 1986 1987 1988 Constrained Constant −1.802∗∗ −2.080∗∗ −2.630∗∗ −1.721∗∗ −1.729∗∗ −1.797∗∗ (0.532) (0.519) (0.542) (0.534) (0.523) (0.341) log Sales 0.167∗∗ 0.178∗∗ 0.274∗∗ 0.163∗∗ 0.130∗∗ 0.154∗∗ (0.0538) (0.0565) (0.0584) (0.0560) (0.0519) (0.0334) Relative size 0.658∗∗ 1.280∗∗ 1.739∗∗ 1.085∗∗ 0.826∗∗ 0.953∗∗ (0.323) (0.330) (0.431) (0.351) (0.263) (0.160) Imports 1.118∗∗ 0.923∗∗ 0.936∗∗ 1.091∗∗ 1.301∗∗ 1.155∗∗ (0.377) (0.361) (0.370) (0.338) (0.342) (0.228) FDI 2.070∗∗ 1.509∗ 3.759∗∗ 3.718∗∗ 3.834∗∗ 2.426∗∗ (0.835) (0.769) (0.990) (1.214) (1.106) (0.573) Productivity −2.615 −0.252 −5.565 −3.905 −0.981 −1.578 (4.110) (3.802) (3.537) (3.188) (2.057) (1.216) Raw material −0.346 −0.357 −0.260 0.0261 −0.294 −0.292 (0.283) (0.247) (0.299) (0.288) (0.218) (0.130) Investment 0.239∗∗ 0.177∗ 0.0467 0.218∗ 0.280∗∗ 0.224∗∗ goods (0.0864) (0.0875) (0.0891) (0.0955) (0.0923) (0.0605) Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 831

as bivariate probit models.] Also noteworthy in Table 23.17 is the divergence of the random effects estimates from the other two sets. The log-likelihood function is −3535.55 for the random effects model and −3522.85 for the unrestricted model. The chi-squared statistic for the 9 restrictions of the equicorrelation model is 25.4. The critical value from the chi-squared table for 9 degrees of freedom is 16.9 for 95 percent and 21.7 for 99 percent significance, so the hypothesis of the random effects model would be rejected. Table 23.18 reports the coefficients of a fully unrestricted model that allows the coefficients to vary across the periods. (The correlations in parentheses in Table 23.17 are computed using this model.) There is a surprising amount of variation in the parameter vector. The log- likelihood for the full unrestricted model is −3494.57. The chi-squared statistic for testing the 32 restrictions of the homogeneity hypothesis is twice the difference, or 56.56. The critical value from the chi-squared table with 32 degrees of freedom is 46.19, so the hypothesis of homogeneity would be rejected statistically. It does seem questionable whether, in theoretical terms, the relationship estimated in this application should be this volatile, however.

23.10 ANALYSIS OF ORDERED CHOICES

Some multinomial-choice variables are inherently ordered. Examples that have ap- peared in the literature include the following:

1. Bond ratings 2. Results of taste tests 3. Opinion surveys 4. The assignment of military personnel to job classifications by skill level 5. Voting outcomes on certain programs 6. The level of insurance coverage taken by a consumer: none, part, or full 7. Employment: unemployed, part time, or full time

In each of these cases, although the outcome is discrete, the multinomial logit or probit model would fail to account for the ordinal nature of the dependent variable.48 Ordinary regression analysis would err in the opposite direction, however. Take the outcome of an opinion survey. If the responses are coded 0, 1, 2, 3, or 4, then linear regression would treat the difference between a 4 and a 3 the same as that between a 3 and a 2, whereas in fact they are only a ranking.

23.10.1 THE ORDERED PROBIT MODEL The ordered probit and logit models have come into fairly wide use as a framework for analyzing such responses [Zavoina and McElvey (1975)]. The model is built around a latent regression in the same manner as the binomial probit model. We begin with

y∗ = xβ + ε.

48In two papers, Beggs, Cardell, and Hausman (1981) and Hausman and Ruud (1986), the authors analyze a richer specification of the logit model when respondents provide their rankings of the full set of alternatives in addition to the identity of the most preferred choice. This application falls somewhere between the conditional logit model and the ones we shall discuss here in that, rather than provide a single choice among J either unordered or ordered alternatives, the consumer chooses one of the J! possible orderings of the set of unordered alternatives. Greene-50558 book June 25, 2007 11:6

832 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

As usual, y∗ is unobserved. What we do observe is

y = 0ify∗ ≤ 0 ∗ = 1if0< y ≤ μ1 ∗ = 2ifμ1 < y ≤ μ2 . . ∗ = J if μJ−1 ≤ y , which is a form of censoring. The μ’s are unknown parameters to be estimated with β. Consider, for example, an opinion survey. The respondents have their own intensity of feelings, which depends on certain measurable factors x and certain unobservable factors ε. In principle, they could respond to the questionnaire with their own y∗ if asked to do so. Given only, say, five possible answers, they choose the cell that most closely represents their own feelings on the question. As before, we assume that ε is normally distributed across observations.49 For the same reasons as in the binomial probit model (which is the special case of J = 1), we normalize the mean and variance of ε to zero and one. We then have the following probabilities:

Prob(y = 0 | x) = (−xβ), Prob(y = 1 | x) = (μ1 − x β) − (−x β), Prob(y = 2 | x) = (μ2 − x β) − (μ1 − x β), . . Prob(y = J | x) = 1 − (μJ−1 − x β).

For all the probabilities to be positive, we must have

0 <μ1 <μ2 < ···<μJ−1.

Figure 23.4 shows the implications of the structure. This is an extension of the univariate probit model we examined earlier. The log-likelihood function and its derivatives can be obtained readily, and optimization can be done by the usual means. As usual, the marginal effects of the regressors x on the probabilities are not equal to the coefficients. It is helpful to consider a simple example. Suppose there are three categories. The model thus has only one unknown threshold parameter. The three probabilities are

Prob(y = 0 | x) = 1 − (xβ), Prob(y = 1 | x) = (μ − xβ) − (−xβ), Prob(y = 2 | x) = 1 − (μ − xβ).

49Other distributions, particularly the logistic, could be used just as easily. We assume the normal purely for convenience. The logistic and normal distributions generally give similar results in practice. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 833

0.4

0.3 ) ⑀

( 0.2 f

0.1

y ϭ 0 y ϭ 1 y ϭ 2 y ϭ 3 y ϭ 4 0 ␮ Ϫ␤؅ ␮ Ϫ␤؅ ␮ Ϫ␤؅ Ϫ␤؅x 1 x 2 x 3 x ⑀

FIGURE 23.4 Probabilities in the Ordered Probit Model.

For the three probabilities, the marginal effects of changes in the regressors are ∂ Prob(y = 0 | x) =−φ(xβ)β, ∂x ∂ Prob(y = 1 | x) = [φ(−xβ) − φ(μ − xβ)]β, ∂x ∂ Prob(y = 2 | x) = φ(μ − xβ)β. ∂x Figure 23.5 illustrates the effect. The probability distributions of y and y∗ are shown in the solid curve. Increasing one of the x’s while holding β and μ constant is equivalent to shifting the distribution slightly to the right, which is shown as the dashed curve. The effect of the shift is unambiguously to shift some mass out of the leftmost cell. Assuming that β is positive (for this x), Prob(y = 0 | x) must decline. Alternatively, from the previous expression, it is obvious that the derivative of Prob(y = 0 | x) has the opposite sign from β. By a similar logic, the change in Prob(y = 2 | x) [or Prob(y = J | x) in the general case] must have the same sign as β. Assuming that the particular β is positive, we are shifting some probability into the rightmost cell. But what happens to the middle cell is ambiguous. It depends on the two densities. In the general case, relative to the signs of the coefficients, only the signs of the changes in Prob(y = 0 | x) and Prob(y = J | x) are unambiguous! The upshot is that we must be very careful in interpreting the coefficients in this model. Indeed, without a fair amount of extra calculation, it is quite unclear how the coefficients in the ordered probit model should be interpreted.50

50This point seems uniformly to be overlooked in the received literature. Authors often report coefficients and t ratios, occasionally with some commentary about significant effects, but rarely suggest upon what or in what direction those effects are exerted. Greene-50558 book June 25, 2007 11:6

834 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

0.4

0.3

0.2

0.1

021 0 FIGURE 23.5 Effects of Change in x on Predicted Probabilities.

Example 23.17 Rating Assignments Marcus and Greene (1985) estimated an ordered probit model for the job assignments of new Navy recruits. The Navy attempts to direct recruits into job classifications in which they will be most productive. The broad classifications the authors analyzed were technical jobs with three clearly ranked skill ratings: “medium skilled,” “highly skilled,” and “nuclear quali- fied/highly skilled.” Because the assignment is partly based on the Navy’s own assessment and needs and partly on factors specific to the individual, an ordered probit model was used with the following determinants: (1) ENSPE = a dummy variable indicating that the individual entered the Navy with an “A school” (technical training) guarantee, (2) EDMA = educational level of the entrant’s mother, (3) AFQT = score on the Armed Forces Qualifying Test, (4) EDYRS = years of education completed by the trainee, (5) MARR = a dummy variable indicating that the individual was married at the time of enlistment, and (6) AGEAT = trainee’s age at the time of enlistment. (The data used in this study are not available for distribution.) The sample size was 5,641. The results are reported in Table 23.19. The extremely large t ratio on the AFQT score is to be expected, as it is a primary sorting device used to assign job classifications.

TABLE 23.19 Estimated Rating Assignment Equation Mean of Variable Estimate t Ratio Variable Constant −4.34 — — ENSPA 0.057 1.7 0.66 EDMA 0.007 0.8 12.1 AFQT 0.039 39.9 71.2 EDYRS 0.190 8.7 12.1 MARR −0.48 −9.0 0.08 AGEAT 0.0015 0.1 18.8 μ 1.79 80.8 — Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 835

TABLE 23.20 Marginal Effect of a Binary Variable −βˆ xˆμ − βˆ x Prob[y = 0] Prob[y = 1] Prob[y = 2] MARR = 0 −0.8863 0.9037 0.187 0.629 0.184 MARR = 1 −0.4063 1.3837 0.342 0.574 0.084 Change 0.155 −0.055 −0.100

To obtain the marginal effects of the continuous variables, we require the standard normal density evaluated at −x βˆ =−0.8479 andμ ˆ − x βˆ = 0.9421. The predicted probabilities are (−0.8479) = 0.198, (0.9421) − (−0.8479) = 0.628, and 1 − (0.9421) = 0.174. (The actual frequencies were 0.25, 0.52, and 0.23.) The two densities are φ(−0.8479) = 0.278 and φ(0.9421) = 0.255. Therefore, the derivatives of the three probabilities with respect to AFQT, for example, are ∂ P 0 = (−0.278)0.039 =−0.01084, ∂AFQT ∂ P 1 = (0.278 − 0.255)0.039 = 0.0009, ∂AFQT ∂ P 2 = 0.255(0.039) = 0.00995. ∂AFQT Note that the marginal effects sum to zero, which follows from the requirement that the probabilities add to one. This approach is not appropriate for evaluating the effect of a dummy variable. We can analyze a dummy variable by comparing the probabilities that result when the variable takes its two different values with those that occur with the other variables held at their sample means. For example, for the MARR variable, we have the results given in Table 23.20.

23.10.2 BIVARIATE ORDERED PROBIT MODELS There are several extensions of the ordered probit model that follow the logic of the bivariate probit model we examined in Section 23.8. A direct analog to the base case two-equation model is used in the study in Example 23.18.

Example 23.18 Calculus and Intermediate Economics Courses Butler et al. (1994) analyzed the relationship between the level of calculus attained and grades in intermediate economics courses for a sample of Vanderbilt students. The two-step estimation approach involved the following strategy. (We are stylizing the precise formulation a bit to compress the description.) Step 1 involved a direct application of the ordered probit model of Section 23.10.1 to the level of calculus achievement, which is coded 0, 1, ...,6: ∗ = β + ε ε | ∼ mi xi i , i xi N[0, 1], = −∞ < ∗ ≤ mi 0if mi 0 = < ∗ ≤ μ 1if0 mi 1 ··· = μ < ∗ < +∞. 6if 5 mi The authors argued that although the various calculus courses can be ordered discretely by the material covered, the differences between the levels cannot be measured directly. Thus, this is an application of the ordered probit model. The independent variables in this first step model included SAT scores, foreign language proficiency, indicators of intended major, and several other variables related to areas of study. Greene-50558 book June 25, 2007 11:6

836 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The second step of the estimator involves regression analysis of the grade in the interme- diate microeconomics or macroeconomics course. Grades in these courses were translated to a granular continuous scale (A = 4.0, A- = 3.7, etc.). A linear regression is specified, = δ + | ∼ σ 2 . Gradei zi ui , where ui zi N 0, u Independent variables in this regression include, among others, (1) dummy variables for which outcome in the ordered probit model applies to the student (with the zero reference case omitted), (2) grade in the last calculus course, (3) several other variables related to prior courses, (4) class size, (5) freshman GPA, etc. The unobservables in the Grade equation and the math attainment are clearly correlated, a feature captured by the additional assumption ε | ∼ σ 2 ρσ ρ that ( i , ui xi , zi ) N2[(0, 0), (1, u ), u]. A nonzero captures this “selection” effect. With this in place, the dummy variables in (1) have now become endogenous. The solution is a “selection” correction that we will examine in detail in Chapter 24. The modified equation becomes | = δ + | + v Gradei mi zi E[ui mi ] i = δ + ρσ λ β μ ... μ + v . zi ( u)[ (xi , 1, , 5)] i They thus adopt a “control function” approach to accommodate the endogeneity of the math attainment dummy variables. [See Section 23.7 and (23-43) for another application of this λ β μ ... μ method.] The term (xi , 1, , 5) is a generalized residual that is constructed using the estimates from the first-stage ordered probit model. [A precise statement of the form of this variable is given in Li and Tobias (2006).] Linear regression of the course grade on zi and this constructed regressor is computed at the second step. The standard errors at the second step must be corrected for the use of the estimated regressor using what amounts to a Murphy and Topel (2002) correction. (See Section 16.7.) Li and Tobias (2006) in a replication of and comment on Butler et al. (1994), after roughly replicating the classical estimation results with a Bayesian estimator, observe that the Grade equation above could also be treated as an ordered probit model. The resulting bivariate ordered probit model would be ∗ = β + ε ∗ = δ + mi xi i , and gi zi ui , = −∞ < ∗ ≤ = −∞ < ∗ ≤ mi 0if mi 0 gi 0if gi 0 = < ∗ ≤ μ = < ∗ ≤ α 1if0 mi 1 1if0 gi 1 ··· ··· = μ < ∗ < +∞. = μ < ∗ < +∞ 6if 5 mi 11 if 9 gi where ε | ∼ σ 2 ρσ . ( i , ui xi , zi ) N2 (0,0), 1, u , u Li and Tobias extended their analysis to this case simply by “transforming” the dependent variable in Butler et al.’s second equation. Computing the log-likelihood using sets of bi- variate normal probabilities is fairly straightforward for the bivariate ordered probit model. [See Greene (2007).] However, the classical study of these data using the bivariate ordered approach remains to be done, so a side-by-side comparison to Li and Tobias’s Bayesian alternative estimator is not possible. The endogeneity of the calculus dummy variables in (1) remains a feature of the model, so both the MLE and the Bayesian posterior are less straight- forward than they might appears. Whether the results in Section 23.8.4 on the recursive bivariate probit model extend to this case also remains to be determined.

The bivariate ordered probit model has been applied in a number of settings in the recent empirical literature, including husband and wife’s education levels [Magee et al. (2000)], family size [(Calhoun (1991)], and many others. In two early contributions to the field of pet econometrics, Butler and Chatterjee analyze ownership of cats and dogs (1995) and dogs and televisions (1997). Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 837

23.10.3 PANEL DATA APPLICATIONS The ordered probit model is used to model discrete scales that represent indicators of a continuous underlying variable such as strength of preference, performance, or level of attainment. Many of the recently assembled national panel data sets contain survey questions that ask about subjective assessments of health, satisfaction, or well-being, all of which are applications of this interpretation. Examples include: • The European Community Household Panel (ECHP) includes questions about job satisfaction [see D’Addio (2004)]. • The British Household Panel Survey (BHPS) includes questions about health status [see Contoyannis et al. (2004)]. • The German Socioeconomic Household Panel (GSOEP) includes questions about subjective well being [see Winkelmann (2004)] and subjective assessment of health satisfaction [see Riphahn et al. (2003) and Example 23.19.] Ostensibly, the applications would fit well into the ordered probit frameworks already described. However, given the panel nature of the data, it will be desirable to augment the model with some accommodation of the individual heterogeneity that is likely to be present. The two standard models, fixed and random effects, have both been applied to the analyses of these survey data.

23.10.3.a Ordered Probit Models with Fixed Effects D’Addio et al. (2003), using methodology developed by Frijters et al. (2004) and Ferrer-i-Carbonel et al. (2004), analyzed survey data on job satisfaction using the Danish component of the European Community Household Panel. Their estimator for an ordered logit model is built around the logic of Chamberlain’s estimator for the binary logit model. [See Section 23.5.2.] Because the approach is robust to individual specific threshold parameters and allows time-invariant variables, it differs sharply from the fixed effects models we have considered thus far as well as from the ordered probit model of Section 23.10.1.51 Unlike Chamberlain’s estimator for the binary logit model, however, their conditional estimator is not a function of minimal sufficient statistics. As such, the incidental parameters problem remains an issue. Das and van Soest (2000) proposed a somewhat simpler approach. [See, as well, Long’s (1997) discussion of the “parallel regressions assumption,” which employs this device in a cross-section framework]. Consider the base case ordered logit model with fixed effects, ∗ = α + β + ε ,ε | ∼ , , yit i xit it it Xi N[0 1] = μ < ∗ <μ , = , ,..., μ =−∞,μ = ,μ =+∞. yit j if j−1 yit j j 0 1 J and −1 0 0 J The model assumptions imply that ( = | ) = (μ − α − β) − (μ − α − β), Prob yit j Xi j i xit j−1 i xit where (t) is the cdf of the logistic distribution. Now, define a binary variable

wit, j = 1ifyit > j, j = 0,...,J − 1.

51Cross-section versions of the ordered probit model with individual specific thresholds appear in Terza (1985a), Pudney and Shields (2000), and Greene (2007). Greene-50558 book June 25, 2007 11:6

838 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

It follows that = | = (α − μ + β) Prob[wit, j 1 Xi ] i j xit = (θ + β). i xit

The “ j ” specific constant, which is the same for all individuals, is absorbed in θi . Thus, a fixed effects binary logit model applies to each of the J − 1 binary random variables, wit, j . The method in Section 23.5.2 can now be applied to each of the J − 1 random samples. This provides J − 1 estimators of the parameter vector β (but no estimator of the threshold parameters). The authors propose to reconcile these different estimators by using a minimum distance estimator of the common true β. (See Section 15.3.) The minimum distance estimator at the second step is chosen to minimize J−1 J−1 = (βˆ − β) −1 (βˆ − β), q j V jm m j=0 m=0 −1 , ( − ) × ( − ) where [V jm] is the j m block of the inverse of the J 1 K J 1 K partitioned matrix V that contains Asy. Cov[βˆ j , βˆ m]. The appropriate form of this matrix for a set of cross-section estimators is given in Brant (1990). Das and van Soest (2000) used the counterpart for Chamberlain’s fixed effects estimator but do not provide the specifics for computing the off-diagonal blocks in V. The full ordered probit model with fixed effects, including the individual specific constants, can be estimated by unconditional maximum likelihood using the results in Section 16.9.6.c. The likelihood function is concave [see Pratt (1981)], so despite its superficial complexity, the estimation is straightforward. (In the following application, with more than 27,000 observations and 7,293 individual effects, estimation of the full model required roughly five seconds of computation.) No theoretical counterpart to the Hsiao (1986, 2003) and Abrevaya (1997) results on the small T bias (incidental parameters problem) of the MLE in the presence of fixed effects has been derived for the ordered probit model. The Monte Carlo results in Greene (2004) (see, as well, Chapter 17), suggest that biases comparable to those in the binary choice models persist in the ordered probit model as well. As in the binary choice case, the complication of the fixed effects model is the small sample bias, not the computation. The Das and van Soest approach finesses this problem—their estimator is consistent—but at the cost of losing the information needed to compute partial effects or predicted probabilities.

23.10.3.b Ordered Probit Models with Random Effects The random effects ordered probit model model has been much more widely used than the fixed effects model. Applications include Groot and van den Brink (2003), who studied training levels of employees, with firm effects; Winkelmann (2003b), who exam- ined subjective measures of well being with individual and family effects; Contoyannis et al. (2004), who analyzed self-reported measures of health status; and numerous oth- ers. In the simplest case, the method of the Butler and Moffitt (1982) quadrature method (Section 16.9.6.b) can be extended to this model.

Example 23.19 Health Satisfaction The GSOEP German Health Care data that we have used in Examples 11.11, 16.16, and others includes a self-reported measure of health satisfaction, HSAT, that takes values Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 839

0, 1, ..., 10.52 This is a typical application of a scale variable that reflects an underlying con- tinuous variable, “health.” The frequencies and sample proportions for the reported values are as follows:

HSAT Frequency Proportion 0 447 1.6% 1 255 0.9% 2 642 2.3% 3 1173 4.2% 4 1390 5.0% 5 4233 15.4% 6 2530 9.2% 7 4231 15.4% 8 6172 22.5% 9 3061 11.2% 10 3192 11.6%

We have fit pooled and panel data versions of the ordered probit model to these data. The model used is ∗ = β + β + β + β + β + β + ε + yit 1 2 Ageit 3 Incomeit 4 Educationit 5 Marriedit 6 Workingit it ci,

where ci will be the common fixed or random effect. (We are interested in comparing the fixed and random effects estimators, so we have not included any time-invariant variables such as gender in the equation.) Table 23.21 lists five estimated models. (Standard errors for the estimated threshold parameters are omitted.) The first is the pooled ordered probit model. The second and third are fixed effects. Column 2 shows the unconditional fixed effects estimates using the results of Section 16.9.6.c. Column 3 shows the Das and van Soest estimator. For the minimum distance estimator, we used an inefficient weighting matrix, the block-diagonal matrix in which the j th block is the inverse of the j th asymptotic covariance matrix for the individual logit estimators. With this weighting matrix, the estimator is − 9 1 9 βˆ = −1 −1βˆ MDE V j V j j , j =0 j =0 and the estimator of the asymptotic covariance matrix is approximately equal to the brack- eted inverse matrix. The fourth set of results is the random effects estimator computed using the maximum simulated likelihood method. This model can be estimated using Butler and Moffitt’s quadrature method; however, we found that even with a large number of nodes, the quadrature estimator converged to a point where the log-likelihood was far lower than the MSL estimator, and at parameter values that were implausibly different from the other es- timates. Using different starting values and different numbers of quadrature points did not change this outcome. The MSL estimator for a random constant term (see Section 17.5) is considerably slower, but produces more reasonable results. The fifth set of results is the Mundlak form of the random effects model, which includes the group means in the models as controls to accommodate possible correlation between the latent heterogeneity and the included variables. As noted in Example 23.17, the components of the ordered choice model must be interpreted with some care. By construction, the partial effects of the variables on the probabilities of the outcomes must change sign, so the simple coefficients do not show the complete picture implied by the estimated model. Table 23.22 shows the partial effects for the pooled model to illustrate the computations.

52In the original data set, 40 (of 27,326) observations on this variable were coded with noninteger values between 6 and 7. For purposes of our example, we have recoded all 40 observations to 7. Greene-50558 book June 25, 2007 11:6

840 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.21 Estimated Ordered Probit Models for Health Satisfaction (5) (2) (3) (4) Random Effects (1) Fixed Effects Fixed Effects Random Mundlak Controls Variable Pooled Unconditional Conditional Effects Variables Means Constant 2.4739 3.8577 3.2603 (0.04669) (0.05072) (0.05323) Age −0.01913 −0.07162 −0.1011 −0.03319 −0.06282 0.03940 (0.00064) (0.002743) (0.002878) (0.00065) (0.00234) (0.002442) Income 0.1811 0.2992 0.4353 0.09436 0.2618 0.1461 (0.03774) (0.07058) (0.07462) (0.03632) (0.06156) (0.07695) Kids 0.06081 −0.06385 −0.1170 0.01410 −0.05458 0.1854 (0.01459) (0.02837) (0.03041) (0.01421) (0.02566) (0.03129) Education 0.03421 0.02590 0.06013 0.04728 0.02296 0.02257 (0.002828) (0.02677) (0.02819) (0.002863) (0.02793) (0.02807) Married 0.02574 0.05157 0.08505 0.07327 0.04605 −0.04829 (0.01623) (0.04030) (0.04181) (0.01575) (0.03506) (0.03963) Working 0.1292 −0.02659 −0.007969 0.07108 −0.02383 0.2702 (0.01403) (0.02758) (0.02830) (0.01338) (0.02311) (0.02856) μ1 0.1949 0.3249 0.2726 0.2752 μ2 0.5029 0.8449 0.7060 0.7119 μ3 0.8411 1.3940 1.1778 1.1867 μ4 1.111 1.8230 1.5512 1.5623 μ5 1.6700 2.6992 2.3244 2.3379 μ6 1.9350 3.1272 2.6957 2.7097 μ7 2.3468 3.7923 3.2757 3.2911 μ8 3.0023 4.8436 4.1967 4.2168 μ9 3.4615 5.5727 4.8308 4.8569 σu 0.0000 0.0000 1.0078 0.9936 ln L −56813.52 −41875.63 −53215.54 −53070.43

Winkelmann (2003b) used the random effects approach to analyze the subjec- tive well being (SWB) question (also coded 0 to 10) in the German Socioeconomic Panel (GSOEP) data set. The ordered probit model in this study is based on the latent regression ∗ = β + ε + + v . yimt ximt imt uim i

TABLE 23.22 Estimated Marginal Effects: Pooled Model HSAT Age Income Kids Education Married Working 0 0.0006 −0.0061 −0.0020 −0.0012 −0.0009 −0.0046 1 0.0003 −0.0031 −0.0010 −0.0006 −0.0004 −0.0023 2 0.0008 −0.0072 −0.0024 −0.0014 −0.0010 −0.0053 3 0.0012 −0.0113 −0.0038 −0.0021 −0.0016 −0.0083 4 0.0012 −0.0111 −0.0037 −0.0021 −0.0016 −0.0080 5 0.0024 −0.0231 −0.0078 −0.0044 −0.0033 −0.0163 6 0.0008 −0.0073 −0.0025 −0.0014 −0.0010 −0.0050 7 0.0003 −0.0024 −0.0009 −0.0005 −0.0003 −0.0012 8 −0.0019 0.0184 0.0061 0.0035 0.0026 0.0136 9 −0.0021 0.0198 0.0066 0.0037 0.0028 0.0141 10 −0.0035 0.0336 0.0114 0.0063 0.0047 0.0233 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 841

The independent variables include age, gender, employment status, income, family size, and an indicator for good health. An unusual feature of the model is the nested random effects (see Section 9.7.1), which include a family effect, vi , as well as the individual family member (i in family m) effect, uim. The GLS/MLE approach we applied to the linear regression model in Section 9.7.1 is unavailable in this nonlinear setting. Winkelmann instead employed a Hermite quadrature procedure to maximize the log- likelihood function. Contoyannis, Jones, and Rice (2004) analyzed a self-assessed health scale that ranged from 1 (very poor) to 5 (excellent) in the British Household Panel Survey. Their model accommodated a variety of complications in survey data. The latent regression underlying their ordered probit model is ∗ = β + γ + α + ε , hit xit Hi,t−1 i it

where xit includes marital status, race, education, household size, age, income, and number of children in the household. The lagged value, Hi,t−1, is a set of binary variables for the observed health status in the previous period. (This is the same device that was used by Butler et al. in Example 23.18.) In this case, the lagged values capture state dependence—the assumption that the health outcome is redrawn randomly in each period is inconsistent with evident runs in the data. The initial formulation of the regression is a fixed effects model. To control for the possible correlation between the effects, αi , and the regressors, and the initial conditions problem that helps to explain the state dependence, they use a hybrid of Mundlak’s (1978) correction and a suggestion by Wooldridge (2002a) for modeling the initial conditions, α = α + α + δ + , i 0 x 1 Hi,1 ui

where ui is exogenous. Inserting the second equation into the first produces a random effects model that can be fit using the quadrature method we considered earlier.

23.11 MODELS FOR UNORDERED MULTIPLE CHOICES

Some studies of multiple-choice settings include the following:

1. Hensher (1986, 1991), McFadden (1974), and many others have analyzed the travel mode of urban commuters. 2. Schmidt and Strauss (1975a,b) and Boskin (1974) have analyzed occupational choice among multiple alternatives. 3. Terza (1985a) has studied the assignment of bond ratings to corporate bonds as a choice among multiple alternatives. 4. Rossi and Allenby (1999, 2003) studied consumer brand choices in a repeated choice (panel data) model. 5. Train (2003) studied the choice of electricity supplier by a sample of California electricity customers. 6. Hensher, Rose, and Greene (2006) analyzed choices of automobile models by a sample of consumers offered a hypothetical menu of features. Greene-50558 book June 25, 2007 11:6

842 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

These are all distinct from the multivariate probit model we examined earlier. In that setting, there were several decisions, each between two alternatives. Here there is a single decision among two or more alternatives. We will encounter two broad types of choice sets, ordered choice models and unordered choice models. The choice among means of getting to work—by car, bus, train, or bicycle—is clearly unordered. A bond rating is, by design, a ranking; that is its purpose. Quite different techniques are used for the two types of models. We examined models for ordered choices in Section 23.10. This section will examine models for unordered choice sets.53 General references on the topics discussed here include Hensher, Louviere, and Swait (2000); Train (2003); and Hensher, Rose, and Greene (2006). Unordered choice models can be motivated by a random utility model. For the ith consumer faced with J choices, suppose that the utility of choice j is = θ + ε . Uij zij ij

If the consumer makes choice j in particular, then we assume that Uij is the maximum among the J utilities. Hence, the statistical model is driven by the probability that choice j is made, which is

Prob(Uij > Uik) for all other k = j. The model is made operational by a particular choice of distribution for the disturbances. As in the binary choice case, two models are usually considered, logit and probit. Be- cause of the need to evaluate multiple integrals of the normal distribution, the probit model has found rather limited use in this setting. The logit model, in contrast, has been widely used in many fields, including economics, market research, politics, finance, and transportation engineering. Let Yi be a random variable that indicates the choice made. McFadden (1974a) has shown that if (and only if) the J disturbances are independent and identically distributed with Gumbel (type 1 extreme value) distribution,

F(εij) = exp(−exp(−εij)), (23-48) then exp(z θ) Prob(Y = j) = ij , (23-49) i J ( θ) j=1 exp zij which leads to what is called the conditional logit model. (lt is often labeled the multi- nomial logit model, but this wording conflicts with the usual name for the model dis- cussed in the next section, which differs slightly. Although the distinction turns out to be purely artificial, we will maintain it for the present.) Utility depends on zij, which includes aspects specific to the individual as well as to the choices. It is useful to distinguish them. Let zij = [xij, wi ] and partition θ conformably into [β , α ] . Then xij varies across the choices and possibly across the individuals as well. The components of xij are typically called the attributes of the choices. But wi contains the characteristics of the individual and is, therefore, the same

53A hybrid case occurs in which consumers reveal their own specific ordering for the choices in an unordered choice set. Beggs, Cardell, and Hausman (1981) studied consumers’ rankings of different automobile types, for example. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 843

for all choices. If we incorporate this fact in the model, then (23-49) becomes ( β + α) ( β) ( α) exp xij wi [exp xij ] exp wi Prob(Yi = j) = = . (23-50) J J = exp(x β + w α) ( β) ( α) j 1 ij i j=1 exp xij exp wi Terms that do not vary across alternatives—that is, those specific to the individual—fall out of the probability. Evidently, if the model is to allow individual specific effects, then it must be modified. One method is to create a set of dummy variables, Aj , for the choices and multiply each of them by the common w. We then allow the coefficient to vary across the choices instead of the characteristics. Analogously to the linear model, a complete set of interaction terms creates a singularity, so one of them must be dropped. For example, a model of a shopping center choice by individuals in various cities might specify that the choice depends on attributes of the shopping centers such as number of stores, Sij, and distance from the central business district, Dij, and income, which varies across individuals but not across the choices. Suppose that there were three choices in each city. The three attribute/characteristic vectors would be as follows:

Choice 1: Stores Distance Income 0 Choice 2: Stores Distance 0 Income Choice 3: Stores Distance 0 0

The probabilities for this model would be

Prob(Yi = j) exp(β S + β D + α A Income + α A Income + α A Income ) = 1 ij 2 ij 1 1 i 2 2 i 3 3 i ,α= 0. 3 (β + β + α + α + α ) 3 j=1 exp 1 Sij 2 Dij 1 A1 Incomei 2 A2 Incomei 3 A3 Incomei The nonexperimental data sets typically analyzed by economists do not contain mixtures of individual- and choice-specific attributes. Such data would be far too costly to gather for most purposes. When they do, the preceding framework can be used. For the present, it is useful to examine the two types of data separately and consider aspects of the model that are specific to the two types of applications.

23.11.1 THE MULTINOMIAL LOGIT MODEL To set up the model that applies when data are individual specific, it will help to consider an example. Schmidt and Strauss (1975a, b) estimated a model of occupational choice based on a sample of 1,000 observations drawn from the Public Use Sample for three years: l960, 1967, and 1970. For each sample, the data for each individual in the sample consist of the following: 1. Occupation: 0 = menial, 1 = blue collar, 2 = craft, 3 = white collar, 4 = professional. (Note the slightly different numbering convention, starting at zero, which is standard.) 2. Characteristics: constant, education, experience, race, sex. The model for occupational choice is exp(w α ) Prob(Y = j | w ) = i j , j = 0, 1,...,4. (23-51) i i 4 ( α ) j=0 exp wi j Greene-50558 book June 25, 2007 11:6

844 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

(The binomial logit model in Sections 23.3 and 23.4 is conveniently produced as the special case of J = 1.) The model in (23-51) is a multinomial logit model.54 The estimated equations pro- vide a set of probabilities for the J + 1 choices for a decision maker with characteristics wi . Before proceeding, we must remove an indeterminacy in the model. If we define α∗ = α + j j q for any vector q, then recomputing the probabilities defined later using α∗ α j instead of j produces the identical set of probabilities because all the terms involv- ing q drop out. A convenient normalization that solves the problem is α0 = 0. (This arises because the probabilities sum to one, so only J parameter vectors are needed to determine the J + 1 probabilities.) Therefore, the probabilities are exp(w α ) Prob(Y = j | w ) = P = i j , j = 0, 1,...,J, α = 0. (23-52) i i ij + J ( α ) 0 1 k=1 exp wi k The form of the binomial model examined in Section 23.4 results if J = 1. The model implies that we can compute J log-odds ratios Pij = (α − α ) = α = . ln wi j k wi j if k 0 Pik

From the point of view of estimation, it is useful that the odds ratio, Pij/Pik, does not depend on the other choices, which follows from the independence of the disturbances in the original model. From a behavioral viewpoint, this fact is not very attractive. We shall return to this problem in Section 23.11.3. The log-likelihood can be derived by defining, for each individual, dij = 1 if alter- native j is chosen by individual i, and 0 if not, for the J + 1 possible outcomes. Then, for each i, one and only one of the dij’s is 1. The log-likelihood is a generalization of that for the binomial probit or logit model: n J ln L = dij ln Prob(Yi = j | wi ). i=1 j=0 The derivatives have the characteristically simple form ∂ ln L n = (dij − Pij)wi for j = 1,...,J. ∂α j i=1 The exact second derivatives matrix has J 2 K × K blocks,55 ∂2 n ln L =− ( = ) − , Pij[1 j l Pil]wi wi ∂α j ∂α l i=1 where 1( j = l) equals 1 if j equals l and 0 if not. Because the Hessian does not involve dij, these are the expected values, and Newton’s method is equivalent to the method of scoring. It is worth noting that the number of parameters in this model proliferates

54Nerlove and Press (1973). 55If the data were in the form of proportions, such as market shares, then the appropriate log-likelihood and derivatives are i j ni pij and i j ni (pij − Pij)wi , respectively. The terms in the Hessian are multiplied by ni . Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 845

with the number of choices, which is inconvenient because the typical cross section sometimes involves a fairly large number of regressors. The coefficients in this model are difficult to interpret. It is tempting to associate α j with the jth outcome, but that would be misleading. By differentiating (23-52), we find that the marginal effects of the characteristics on the probabilities are ∂ P J δ = ij = P α − P α = P [α − α]. (23-53) ij ∂w ij j ik k ij j i k=0

Therefore, every subvector of α enters every marginal effect, both through the prob- abilities and through the weighted average that appears in δij. These values can be computed from the parameter estimates. Although the usual focus is on the coefficient estimates, equation (23-53) suggests that there is at least some potential for confusion. Note, for example, that for any particular wik,∂Pij/∂wik need not have the same sign as αjk. Standard errors can be estimated using the delta method. (See Section 4.9.4.) For α = , α , α ,...,α purposes of the computation, let [0 1 2 J ] . We include the fixed 0 vector for outcome 0 because although α0 = 0, δi0 =−Pi0 α, which is not 0. Note as well that Asy. Cov[α ˆ 0, αˆ j ] = 0 for j = 0,...,J. Then J J ∂δ ∂δ δˆ = ij α, α ij , Asy. Var[ ij] ∂α Asy. Cov[ ˆ l ˆ m] ∂α l=0 m=0 l m ∂δ ij = ( = ) − + δ + δ . [1 j l Pil][PijI ijwi ] Pij[ ilwi ] ∂αl

Finding adequate fit measures in this setting presents the same difficulties as in the binomial models. As before, it is useful to report the log-likelihood. If the model contains no covariates and no constant term, then the log-likelihood will be J 1 ln L = n ln c j J + 1 j=0

where n j is the number of individuals who choose outcome j. If the characteristic vector includes only a constant term, then the restricted log-likelihood is J n J ln L = n ln j = n ln p , 0 j n j j j=0 j=0

where pj is the sample proportion of observations that make choice j. A useful table will give a listing of hits and misses of the prediction rule “predict Yi = j if Pˆ ij is the maximum of the predicted probabilities.”56

56It is common for this rule to predict all observation with the same value in an unbalanced sample or a model with little explanatory power. This is not a contradiction of an estimated model with many “significant” coefficients, because the coefficients are not estimated so as to maximize the number of correct predictions. Greene-50558 book June 25, 2007 11:6

846 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

23.11.2 THE CONDITIONAL LOGIT MODEL When the data consist of choice-specific attributes instead of individual-specific char- acteristics, the appropriate model is exp(x β) Prob(Y = j | x , x ,...,x ) = Prob(Y = j | X ) = P = ij . (23-54) i i1 i2 iJ i i ij J ( β) j=1 exp xij Here, in accordance with the convention in the literature, we let j = 1, 2,...,J for a total of J alternatives. The model is otherwise essentially the same as the multinomial logit. Even more care will be required in interpreting the parameters, however. Once again, an example will help to focus ideas. In this model, the coefficients are not directly tied to the marginal effects. The marginal effects for continuous variables can be obtained by differentiating (23-54) with respect to a particular xm to obtain

∂Pij = [Pij(1( j = m) − Pim)]β, m = 1,...,J. ∂xim

It is clear that through its presence in Pij and Pim, every attribute set xm affects all the probabilities. Hensher (1991) suggests that one might prefer to report elasticities of the probabilities. The effect of attribute k of choice m on Pij would be

∂ ln Pj = xmk[1( j = m) − Pim]βk. ∂ ln xmk Because there is no ambiguity about the scale of the probability itself, whether one should report the derivatives or the elasticities is largely a matter of taste. Some of Hensher’s elasticity estimates are given in Table 23.29 later on in this chapter. Estimation of the conditional logit model is simplest by Newton’s method or the method of scoring. The log-likelihood is the same as for the multinomial logit model. Once again, we define dij = 1ifYi = j and 0 otherwise. Then n J ln L = dij ln Prob(Yi = j). i=1 j=1 Market share and frequency data are common in this setting. If the data are in this form, then the only change needed is, once again, to define dij as the proportion or frequency. Because of the simple form of L, the gradient and Hessian have particularly con- = J . venient forms: Let xi j=1 Pijxij Then, ∂ n J log L = ( − ), ∂β dij xij xi i=1 j=1 ∂2 n J (23-55) log L =− ( − )( − ), ∂β∂β Pij xij xi xij xi i=1 j=1 The usual problems of fit measures appear here. The log-likelihood ratio and tabula- tion of actual versus predicted choices will be useful. There are two possible constrained log-likelihoods. The model cannot contain a constant term, so the constraint β = 0 ren- ders all probabilities equal to 1/J. The constrained log-likelihood for this constraint Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 847

is then Lc =−n ln J. Of course, it is unlikely that this hypothesis would fail to be re- jected. Alternatively, we could fit the model with only the J −1 choice-specific constants, which makes the constrained log-likelihood the same as in the multinomial logit model, ∗ = ln L0 j n j ln pj where, as before, n j is the number of individuals who choose alternative j.

23.11.3 THE INDEPENDENCE FROM IRRELEVANT ALTERNATIVES ASSUMPTION We noted earlier that the odds ratios in the multinomial logit or conditional logit models are independent of the other alternatives. This property is convenient as re- gards estimation, but it is not a particularly appealing restriction to place on consumer behavior. The property of the logit model whereby Pij/Pim is independent of the re- maining probabilities is called the independence from irrelevant alternatives (IIA). The independence assumption follows from the initial assumption that the distur- bances are independent and homoscedastic. Later we will discuss several models that have been developed to relax this assumption. Before doing so, we consider a test that has been developed for testing the validity of the assumption. Hausman and McFadden (1984) suggest that if a subset of the choice set truly is irrelevant, omitting it from the model altogether will not change parameter estimates systematically. Exclusion of these choices will be inefficient but will not lead to inconsistency. But if the remaining odds ratios are not truly independent from these alternatives, then the parameter estimates obtained when these choices are excluded will be inconsistent. This observation is the usual basis for Hausman’s specification test. The statistic is 2 −1 χ = (βˆ s − βˆ f ) [Vˆ s − Vˆ f ] (βˆ s − βˆ f ), where s indicates the estimators based on the restricted subset, f indicates the estimator based on the full set of choices, and Vˆ s and Vˆ f are the respective estimates of the asymptotic covariance matrices. The statistic has a limiting chi-squared distribution with K degrees of freedom.57

23.11.4 NESTED LOGIT MODELS If the independence from irrelevant alternatives test fails, then an alternative to the multinomial logit model will be needed. A natural alternative is a multivariate probit model: = β + ε , = ,..., , ε ,ε ,...,ε ∼ ,  . Uij xij ij j 1 J [ i1 i2 iJ] N[0 ] We had considered this model earlier but found that as a general model of consumer choice, its failings were the practical difficulty of computing the multinormal integral and estimation of an unrestricted correlation matrix. Hausman and Wise (1978) point out that for a model of consumer choice, the probit model may not be as impractical as it might seem. First, for J choices, the comparisons implicit in Uij > Uim for m = j involve the J − 1 differences, ε j − εm. Thus, starting with a J-dimensional problem, we need only consider derivatives of (J − 1)-order probabilities. Therefore, to come to a concrete example, a model with four choices requires only the evaluation of bivariate

57McFadden (1987) shows how this hypothesis can also be tested using a Lagrange multiplier test. Greene-50558 book June 25, 2007 11:6

848 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

normal integrals, which, albeit still complicated to estimate, is well within the received technology. (We will examine the multivariate probit model in Section 23.11.5.) For larger models, however, other specifications have proved more useful. One way to relax the homoscedasticity assumption in the conditional logit model that also provides an intuitively appealing structure is to group the alternatives into subgroups that allow the variance to differ across the groups while maintaining the IIA assumption within the groups. This specification defines a nested logit model. To fix ideas, it is useful to think of this specification as a two- (or more) level choice problem (although, once again, the model arises as a modification of the stochastic specification in the original conditional logit model, not necessarily as a model of behavior). Suppose, then, that the J alternatives can be divided into B subgroups (branches) such that the choice set can be written ,..., = ( ,..., ), ( ,..., )...,( ,..., ) . [c1 cJ ] [ c1|1 cJ1|1 c1|2 cJ2|2 c1|B cJB|B ] Logically, we may think of the choice process as that of choosing among the B choice sets and then making the specific choice within the chosen set. This method produces a tree structure, which for two branches and, say, five choices (twigs) might look as follows: Choice

Branch1 Branch2

c1 | 1 c2 | 1 c1 | 2 c2 | 2 c3 | 2

Suppose as well that the data consist of observations on the attributes of the choices xij|b and attributes of the choice sets zib. To derive the mathematical form of the model, we begin with the unconditional probability ( β + γ ) exp xij|b zib Prob[twigj, branch ] = Pijb = . b B Jb ( β + γ ) b=1 j=1 exp xij|b zib Now write this probability as

Pijb = Pij|b Pb Jb L ( β) = exp(x | β) = exp(z γ ) exp xij|b exp(z γ ) j 1 ij b l 1 ib = ib . Jb L L J = exp(x | β) = exp(z γ ) l ( β + γ ) j 1 ij b l 1 ib l=1 j=1 exp xij|b zib Define the inclusive value for the lth branch as ⎛ ⎞ Jb = ⎝ β ⎠ . IVib ln exp xij|b j=1 Then, after canceling terms and using this result, we find ( β) τ ( γ + ) exp xij|b exp[ b zib IVib ] Pij|b = and Pb = , Jb ( β) B τ ( γ + ) j=1 exp xij|b b=1 exp[ b zib IVib ] Greene-50558 book June 26, 2007 2:34

CHAPTER 23 ✦ Models for Discrete Choice 849

where the new parameters τl must equal 1 to produce the original model. Therefore, we use the restriction τl = 1 to recover the conditional logit model, and the preceding equation just writes this model in another form. The nested logit model arises if this restriction is relaxed. The inclusive value coefficients, unrestricted in this fashion, allow the model to incorporate some degree of heteroscedasticity. Within each branch, the IIA restriction continues to hold. The equal variance of the disturbances within the jth branch are now58 π 2 σ 2 = . b 6τb

With τ j = 1, this reverts to the basic result for the multinomial logit model. As usual, the coefficients in the model are not directly interpretable. The derivatives that describe covariation of the attributes and probabilities are ∂ ln Prob[choice = m, branch = b] ∂x(k) in choice M and branch B ={ ( = ) ( = ) − + τ ( = ) − | }β . 1 b B [1 m M PM|B] B[1 b B PB]PM B k The nested logit model has been extended to three and higher levels. The complexity of the model increases rapidly with the number of levels. But the model has been found to be extremely flexible and is widely used for modeling consumer choice in the mar- keting and transportation literatures, to name a few. There are two ways to estimate the parameters of the nested logit model. A limited information, two-step maximum likelihood approach can be done as follows: 1. Estimate β by treating the choice within branches as a simple conditional logit model. 2. Compute the inclusive values for all the branches in the model. Estimate γ and the τ parameters by treating the choice among branches as a conditional logit model with attributes zib and Iib. Because this approach is a two-step estimator, the estimate of the asymptotic covariance matrix of the estimates at the second step must be corrected. [See Section 16.7 and McFadden (1984).] For full information maximum likelihood (FIML) estimation of the model, the log-likelihood is n ln L = ln[Prob(twig | branch)i × Prob(branch)i ]. i=1 [See Hensher (1986, 1991) and Greene (2007a).] The information matrix is not block diagonal in β and (γ , τ), so FIML estimation will be more efficient than two-step esti- mation. The FIML estimator is now available in several commercial computer packages. The two-step estimator is rarely used in current research. To specify the nested logit model, it is necessary to partition the choice set into branches. Sometimes there will be a natural partition, such as in the example given by Maddala (1983) when the choice of residence is made first by community, then by

58See Hensher, Louviere, and Swaite (2000). See Greene and Hensher (2002) for alternative formulations of the nested logit model. Greene-50558 book June 25, 2007 11:6

850 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

dwelling type within the community. In other instances, however, the partitioning of the choice set is ad hoc and leads to the troubling possibility that the results might be dependent on the branches so defined. (Many studies in this literature present several sets of results based on different specifications of the tree structure.) There is no well- defined testing procedure for discriminating among tree structures, which is a problem- atic aspect of the model.

23.11.5 THE MULTINOMIAL PROBIT MODEL A natural alternative model that relaxes the independence restrictions built into the multinomial logit (MNL) model is the multinomial probit model (MNP). The structural equations of the MNP model are = β + ε , = ,..., , ε ,ε ,...,ε ∼ ,  . Uij xij ij j 1 J [ i1 i2 iJ] N[0 ] The term in the log-likelihood that corresponds to the choice of alternative q is

Prob[choiceiq] = Prob[Uiq > Uij, j = 1,...,J, j = q]. The probability for this occurrence is Prob[choicei q] = Prob[εi1 − εiq <(xiq − xi1) β,...,εiJ − εiq <(xiq − xiJ) β] for the J − 1 other choices, which is a cumulative probability from a (J − 1)-variate normal distribution. Because we are only making comparisons, one of the variances in this J − 1 variate structure—that is, one of the diagonal elements in the reduced —must be normalized to 1.0. Because only comparisons are ever observable in this model, for identification, J − 1 of the covariances must also be normalized, to zero. The MNP model allows an unrestricted (J − 1) × (J − 1) correlation structure and J − 2 free standard deviations for the disturbances in the model. (Thus, a two-choice model returns to the univariate probit model of Section 23.2.) For more than two choices, this specification is far more general than the MNL model, which assumes that  = I.(The scaling is absorbed in the coefficient vector in the MNL model.) It adds the unrestricted correlations to the heteroscedastic model of the previous section. The main obstacle to implementation of the MNP model has been the difficulty in computing the multivariate normal probabilities for any dimensionality higher than 2. Recent results on accurate simulation of multinormal integrals, however, have made estimation of the MNP model feasible. (See Section 17.3.3 and a symposium in the November 1994 issue of the Review of Economics and Statistics.) Yet some practical problems remain. Computation is exceedingly time consuming. It is also necessary to ensure that  remain a positive definite matrix. One way often suggested is to construct the Cholesky decomposition of , LL , where L is a lower triangular matrix, and es- timate the elements of L. The normalizations and zero restrictions can be imposed by making the last row of the J × J matrix  equal (0, 0,...,1) and using LL to create the upper (J − 1) × (J − 1) matrix. The additional normalization restriction is obtained by imposing L11 = 1. This is straightforward to implement for an otherwise unrestricted . A remaining problem, however, is that it is now difficult by this method to impose any other restrictions, such as a zero in a specific location in , which is common. An al- ternative approach is estimate the correlations, R, and a diagonal matrix of standard deviations, S = diag(σ1,...,σJ−2, 1, 1), separately. The normalizations, R jj = 1, and Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 851

exclusions, RJl = 0, are then simple to impose, and  is just SRS. The resulting matrix must still be symmetric and positive definite. The restriction −1 < R jl < +1 is neces- sary but still not sufficient to ensure definiteness. The full set of restrictions is difficult to enumerate explicitly and involves sets of inequalities that would be difficult to impose in estimation. (Typically when this method is employed, the matrix is estimated without the explicit restrictions.) Identification appears to be a serious problem with the MNP model. Although the unrestricted MNP model is fully identified in principle, conver- gence to satisfactory results in applications with more than three choices appears to require many additional restrictions on the standard deviations and correlations, such as zero restrictions or equality restrictions in the case of the standard deviations.

23.11.6 THE MIXED LOGIT MODEL Another variant of the multinomial logit model is the random parameters logit model (RPL) (also called the mixed logit model). [See Revelt and Train (1996); Bhat (1996); Berry, Levinsohn, and Pakes (1995); Jain, Vilcassim, and Chintagunta (1994); and Hensher and Greene (2004).] Train’s (2003) formulation of the RPL model (which encompasses the others) is a modification of the MNL model. The model is a random coefficients formulation. The change to the basic MNL model is the parameter specifi- cation in the distribution of the parameters across individuals, i; β = β + θ + σ , ik k zi k kuik (23-56)

where uik, k = 1,...,K, is multivariate normally distributed with correlation matrix σ β + θ R, k is the standard deviation of the kth distribution, k zi k is the mean of the distribution, and zi is a vector of person specific characteristics (such as age and income) that do not vary across choices. This formulation contains all the earlier models. For example, if θ k = 0 for all the coefficients and σk = 0 for all the coefficients except for choice-specific constants, then the original MNL model with a normal-logistic mixture for the random part of the MNL model arises (hence the name). The model is estimated by simulating the log-likelihood function rather than direct integration to compute the probabilities, which would be infeasible because the mix- ture distribution composed of the original εij and the random part of the coefficient is unknown. For any individual,

Prob[choice q | ui ] = MNL probability | βi (ui ), with all restrictions imposed on the coefficients. The appropriate probability is

Eu[Prob(choice q | u)] = Prob[choice q | u] f (u)du, u1,...,uk which can be estimated by simulation, using 1 R Est. E [Prob(choice q | u)] = Prob[choice q | β (u )], u R i ir r=1

where uir is the rth of R draws for observation i. (There are nkR draws in total. The draws for observation i must be the same from one computation to the next, which can be accomplished by assigning to each individual their own seed for the random number generator and restarting it each time the probability is to be computed.) By this Greene-50558 book June 25, 2007 11:6

852 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

method, the log-likelihood and its derivatives with respect to (βk, θ k,σk), k = 1,...,K and R are simulated to find the values that maximize the simulated log-likelihood. (See Section 17.5 and Example 23.8.) The mixed model enjoys two considerable advantages not available in any of the other forms suggested. In a panel data or repeated-choices setting (see Section 23.11.8), one can formulate a random effects model simply by making the variation in the coef- ficients time invariant. Thus, the model is changed to = β + ε , = ,..., , = ,..., , = ,..., , Uijt xijt it ijt i 1 n j 1 J t 1 T β = β + θ + σ . it,k k zit k kuik The time variation in the coefficients is provided by the choice-invariant variables, which may change through time. Habit persistence is carried by the time-invariant random effect, uik. If only the constant terms vary and they are assumed to be uncorrelated, then this is logically equivalent to the familiar random effects model. But, much greater generality can be achieved by allowing the other coefficients to vary randomly across individuals and by allowing correlation of these effects.59 A second degree of flexibility is in (23-56). The random components, ui are not restricted to normality. Other distribu- tions that can be simulated will be appropriate when the range of parameter variation consistent with consumer behavior must be restricted, for example to narrow ranges or to positive values.

23.11.7 APPLICATION: CONDITIONAL LOGIT MODEL FOR TRAVEL MODE CHOICE Hensher and Greene [Greene (2007a)] report estimates of a model of travel mode choice for travel between Sydney and Melbourne, Australia. The data set contains 210 observations on choice among four travel modes, air, train, bus, and car. (See Appendix Table F23.2.) The attributes used for their example were: choice-specific constants; two choice-specific continuous measures; GC, a measure of the generalized cost of the travel that is equal to the sum of in-vehicle cost, INVC, and a wagelike measure times INVT, the amount of time spent traveling; and TTME, the terminal time (zero for car); and for the choice between air and the other modes, HINC, the household income. A sum- mary of the sample data is given in Table 23.23. The sample is choice based so as to balance it among the four choices—the true population allocation, as shown in the last column of Table 23.23, is dominated by drivers. The model specified is

Uij = αairdi,air + αtraindi,train + αbusdi,bus + βGGCij + βTTTMEij + γHdi,airHINCi + εij,

where for each j,εij has the same independent, type 1 extreme value distribution,

Fε(εij) = exp(−exp(−εij)), which has standard deviation π 2/6. The mean is absorbed in the constants. Estimates of the conditional logit model are shown in Table 23.24. The model was fit with and

59See Hensher (2001) for an application to transportation mode choice in which each individual is observed in several choice situations. A stated choice experiment in which consumers make several choices in sequence about automobile features appears in Hensher, Rost, and Greene (2006). Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 853

TABLE 23.23 Summary Statistics for Travel Mode Choice Data Number True GC TTME INVC INVT HINC Choosing p Prop. Air 102.648 61.010 85.522 133.710 34.548 58 0.28 0.14 113.522 46.534 97.569 124.828 41.274 Train 130.200 35.690 51.338 608.286 34.548 63 0.30 0.13 106.619 28.524 37.460 532.667 23.063 Bus 115.257 41.650 33.457 629.462 34.548 30 0.14 0.09 108.133 25.200 33.733 618.833 29.700 Car 94.414 0 20.995 573.205 34.548 59 0.28 0.64 89.095 0 15.694 527.373 42.220 Note: The upper figure is the average for all 210 observations. The lower figure is the mean for the observations that made that choice.

TABLE 23.24 Parameter Estimates Unweighted Sample Choice-Based Weighting Estimate t Ratio Estimate t Ratio

βG −0.15501 −3.517 −0.01333 −2.724 βT −0.09612 −9.207 −0.13405 −7.164 γH 0.01329 1.295 −0.00108 −0.087 αair 5.2074 6.684 6.5940 5.906 αtrain 3.8690 8.731 3.6190 7.447 αbus 3.1632 7.025 3.3218 5.698 Log-likelihood at β = 0 −291.1218 −291.1218 Log-likelihood (sample shares) −283.7588 −223.0578 Log-likelihood at convergence −199.1284 −147.5896

without the corrections for choice-based sampling. Because the sample shares do not differ radically from the population proportions, the effect on the estimated param- eters is fairly modest. Nonetheless, it is apparent that the choice-based sampling is not completely innocent. A cross tabulation of the predicted versus actual outcomes is given in Table 23.25. The predictions are generated by tabulating the integer parts = 210 , , = of mjk i=1 pˆijdik j k air, train, bus, car, wherep ˆij is the predicted probability of outcome j for observation i and dik is the binary variable which indicates if individual i made choice k. Are the odds ratios train/bus and car/bus really independent from the presence of the air alternative? To use the Hausman test, we would eliminate choice air, from the

TABLE 23.25 Predicted Choices Based on Model Probabilities (predictions based on choice-based sampling in parentheses) Air Train Bus Car Total (Actual) Air 32 (30) 8 (3) 5 (3) 13 (23) 58 Train 7 (3) 37 (30) 5 (3) 14 (27) 63 Bus 3 (1) 5 (2) 15 (4) 6 (12) 30 Car 16 (5) 13 (5) 6 (3) 25 (45) 59 Total (Predicted) 58 (39) 63 (40) 30 (23) 59 (108) 210 Greene-50558 book June 25, 2007 11:6

854 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 23.26 Results for IIA Test Full-Choice Set Restricted-Choice Set β β α α β β α α G T train bus G T train bus Estimate −0.0155 −0.0961 3.869 3.163 −0.0639 −0.0699 4.464 3.105 Estimated Asymptotic Covariance Matrix Estimated Asymptotic Covariance Matrix

βG 0.194e-5 0.000101 βT −0.46e-7 0.000110 −0.0000013 0.000221 αtrain −0.00060 −0.0038 0.196 −0.000244 −0.00759 0.410 αbus −0.00026 −0.0037 0.161 0.203 −0.000113 −0.00753 0.336 0.371 Note: 0.nnne-p indicates times 10 to the negative p power. H = 33.3363. Critical chi-squared[4] = 9.488.

choice set and estimate a three-choice model. Because 58 respondents chose this mode, we would lose 58 observations. In addition, for every data vector left in the sample, the air-specific constant and the interaction, di,air × HINCi would be zero for every remaining individual. Thus, these parameters could not be estimated in the restricted model. We would drop these variables. The test would be based on the two estimators of the remaining four coefficients in the model, [βG,βT,αtrain,αbus]. The results for the test are as shown in Table 23.26. The hypothesis that the odds ratios for the other three choices are independent from air would be rejected based on these results, as the chi-squared statistic exceeds the critical value. Because IIA was rejected, they estimated a nested logit model of the following type: Travel Determinants

FLY GROUND (Income)

AIR TRAIN BUS CAR (G cost, T time)

Note that one of the branches has only a single choice, so the conditional proba- bility, Pj| fly = Pair| fly = 1. The estimates marked “unconditional” in Table 23.27 are the simple conditional (multinomial) logit (MNL) model for choice among the four alternatives that was reported earlier. Both inclusive value parameters are constrained (by construction) to equal 1.0000. The FIML estimates are obtained by maximizing the full log-likelihood for the nested logit model. In this model,

Prob(choice | branch) = P(αairdair + αtraindtrain + αbusdbus + βGGC + βTTTME), Prob(branch) = P(γ dairHINC + τfly IVfly + τground IVground), Prob(choice, branch) = Prob(choice | branch) × Prob(branch). The likelihood ratio statistic for the nesting (heteroscedasticity) against the null hy- pothesis of homoscedasticity is −2[−199.1284 − (−193.6561)] = 10.945. The 95 percent critical value from the chi-squared distribution with two degrees of freedom is 5.99, so the hypothesis is rejected. We can also carry out a Wald test. The asymptotic covariance matrix for the two inclusive value parameters is [0.01977 / 0.009621, 0.01529]. The Wald Greene-50558 book June 26, 2007 15:38

CHAPTER 23 ✦ Models for Discrete Choice 855

TABLE 23.27 Estimates of a Mode Choice Model (standard errors in parentheses) Parameter FIML Estimate Unconditional

αair 6.042 (1.199) 5.207 (0.779) αbus 4.096 (0.615) 3.163 (0.450) αtrain 5.065 (0.662) 3.869 (0.443) βGC −0.03159 (0.00816) −0.1550 (0.00441) βTTME −0.1126 (0.0141) −0.09612 (0.0104) γH 0.01533 (0.00938) 0.01329 (0.0103) τfly 0.5860 (0.141) 1.0000 (0.000) τground 0.3890 (0.124) 1.0000 (0.000) σfly 2.1886 (0.525) 1.2825 (0.000) σground 3.2974 (1.048) 1.2825 (0.000) ln L −193.6561 −199.1284

statistic for the joint test of the hypothesis that τfly = τground = 1, is − 0.1977 0.009621 1 0.586 − 1.0 W = (0.586 − 1.00.389 − 1.0) = 24.475. 0.009621 0.01529 0.389 − 1.0 The hypothesis is rejected, once again. The choice model was reestimated under the assumptions of the heteroscedastic ex- treme value (HEV) model. [See Greene (2007b).] This model allows a separate variance, σ 2 = π 2/( θ 2) ε j 6 j for each ij in (23-48). The results are shown in Table 23.28. This model is

TABLE 23.28 Estimates of a Heteroscedastic Extreme Value Model (standard errors in parentheses) Heteroscedastic Restricted Parameter HEV Model HEV Model HEV Model Nested Logit Model

αair 7.8326 (10.951) 5.1815 (6.042) 2.973 (0.995) 6.062 (1.199) αbus 7.1718 (9.135) 5.1302 (5.132) 4.050 (0.494) 4.096 (0.615) αtrain 6.8655 (8.829) 4.8654 (5.071) 3.042 (0.429) 5.065 (0.662) βGC −0.05156 (0.0694) −0.03326 (0.0378) −0.0289 (0.00580) −0.03159 (0.00816) βTTME −0.1968 (0.288) −0.1372 (0.164) −0.0828 (0.00576) −0.1126 (0.0141) γ 0.04024 (0.0607) 0.03557 (0.0451) 0.0238 (0.0186) 0.01533 (0.00938) τfly 0.5860 (0.141) τground 0.3890 (0.124) θair 0.2485 (0.369) 0.2890 (0.321) 0.4959 (0.124) θtrain 0.2595 (0.418) 0.3629 (0.482) 1.0000 (0.000) θbus 0.6065 (1.040) 0.6895 (0.945) 1.0000 (0.000) θcar 1.0000 (0.000) 1.0000 (0.000) 1.0000 (0.000) φ 0.0000 (0.000) 0.00552 (0.00573) 0.0000 (0.000) Implied Standard Deviations

σair 5.161 (7.667) σtrain 4.942 (7.978) σbus 2.115 (3.623) σcar 1.283 (0.000) ln L −195.6605 −194.5107 −200.3791 −193.6561 Greene-50558 book June 25, 2007 11:6

856 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

less restrictive than the√ nested logit model. To make them comparable, we√ note that we found that σair = π/(τfly 6) = 2.1886 and σtrain = σbus = σcar = π/(τground 6) = 3.2974. The heteroscedastic extreme value (HEV) model thus relaxes one variance restriction, because it has three free variance parameters instead of two. On the other hand, the important degree of freedom here is that the HEV model does not impose the IIA as- sumption anywhere in the choice set, whereas the nested logit does, within each branch. A primary virtue of the HEV model, the nested logit model, and other alternative models is that they relax the IIA assumption. This assumption has implications for the cross elasticities between attributes in the different probabilities. Table 23.29 lists the estimated elasticities of the estimated probabilities with respect to changes in the generalized cost variable. Elasticities are computed by averaging the individual sample values rather than computing them once at the sample means. The implication of the IIA assumption can be seen in the table entries. Thus, in the estimates for the multinomial logit (MNL) model, the cross elasticities for each attribute are all equal. In the nested logit model, the IIA property only holds within the branch. Thus, in the first column, the effect of GC of air affects all ground modes equally, whereas the effect of GC for train is the same for bus and car but different from these two for air. All these elasticities vary freely in the HEV model. Table 23.29 lists the estimates of the parameters of the multinomial probit and random parameters logit models. For the multinomial probit model, we fit three spec- ifications: (1) free correlations among the choices, which implies an unrestricted 3 × 3 correlation matrix and two free standard deviations; (2) uncorrelated disturbances, but free standard deviations, a model that parallels the heteroscedastic extreme value model; and (3) uncorrelated disturbances and equal standard deviations, a model that is the same as the original conditional logit model save for the normal distribution of the disturbances instead of the extreme value assumed in the logit model. In this case,

TABLE 23.29 Estimated Elasticities with Respect to Generalized Cost Cost Is That of Alternative Effect on Air Train Bus Car Multinomial Logit Air −1.136 0.498 0.238 0.418 Train 0.456 −1.520 0.238 0.418 Bus 0.456 0.498 −1.549 0.418 Car 0.456 0.498 0.238 −1.061 Nested Logit Air −0.858 0.332 0.179 0.308 Train 0.314 −4.075 0.887 1.657 Bus 0.314 1.595 −4.132 1.657 Car 0.314 1.595 0.887 −2.498 Heteroscedastic Extreme Value Air −1.040 0.367 0.221 0.441 Train 0.272 −1.495 0.250 0.553 Bus 0.688 0.858 −6.562 3.384 Car 0.690 0.930 1.254 −2.717 Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 857

the scaling of the utility functions is different by a factor of (π 2/6)1/2 = 1.283, as the probit model assumes ε j has a standard deviation of 1.0. We also fit three variants of the random parameters logit. In these cases, the choice- σ 2 + θ 2 σ 2 specific variance for each utility function is j j where j is the contribution of the π 2 / = . θ 2 logit model, which is 6 1 645, and j is the estimated constant specific variance estimated in the random parameters model. The combined estimated standard devi- ations are given in the table. The estimates of the specific parameters, θ j , are given in the footnotes. The estimated models are (1) unrestricted variation and correlation among the three intercept parameters—this parallels the general specification of the multinomial probit model; (2) only the constant terms randomly distributed but uncor- related, a model that is parallel to the multinomial probit model with no cross-equation correlation and to the heteroscedastic extreme value model shown in Table 23.28; and (3) random but uncorrelated parameters. This model is more general than the others, but is somewhat restricted as the parameters are assumed to be uncorrelated. Identi- fication of the correlation matrix is weak in this model—after all, we are attempting to estimate a 6 × 6 correlation matrix for all unobserved variables. Only the estimated parameters are shown in Table 23.30. Estimated standard errors are similar to (although generally somewhat larger than) those for the basic multinomial logit model. The standard deviations and correlations shown for the multinomial probit model are parameters of the distribution of εij, the overall randomness in the model. The coun- terparts in the random parameters model apply to the distributions of the parameters. Thus, the full disturbance in the model in which only the constants are random is εiair + uair for air, and likewise for train and bus. Likewise, the correlations shown

TABLE 23.30 Parameter Estimates for Normal-Based Multinomial Choice Models Multinomial Probit Random Parameters Logit Parameter Unrestricted Homoscedastic Uncorrelated Unrestricted Constants Uncorrelated

αair 1.358 3.005 3.171 5.519 4.807 12.603 a d b c σair 4.940 1.000 3.629 4.009 3.225 2.803 αtrain 4.298 2.409 4.277 5.776 5.035 13.504 a b σtrain 1.899 1.000 1.581 1.904 1.290 1.373 αbus 3.609 1.834 3.533 4.813 4.062 11.962 a a a b σbus 1.000 1.000 1.000 1.424 3.147 1.287 a a a a a αcar 0.000 0.000 0.000 0.000 0.000 0.000 a a a a a σcar 1.000 1.000 1.000 1.283 1.283 1.283 βG −0.0351 −0.0113 −0.0325 −0.0326 −0.0317 −0.0544 a a σβG — — — 0.000 0.000 0.00561 βT −0.0769 −0.0563 −0.0918 −0.126 −0.112 −0.2822 a a σβT — — — 0.000 0.000 0.182 γH 0.0593 0.0126 0.0370 0.0334 0.0319 0.0846 a a σγ — — — 0.000 0.000 0.0768 a a a a ρAT 0.581 0.000 0.000 0.543 0.000 0.000 a a a a ρAB 0.576 0.000 0.000 0.532 0.000 0.000 a a a a ρBT 0.718 0.000 0.000 0.993 0.000 0.000 log L −196.9244 −208.9181 −199.7623 −193.7160 −199.0073 −175.5333 aRestricted to this fixed value. b (π 2/ + θ 2), θ = . ,θ = . ,θ = . ,θ = . . Computed as the square root of 6 j air 2 959 train 0 136 bus 0 183 car 0 000 c θair = 2.492,θtrain = 0.489,θbus = 0.108,θcar = 0.000. d Derived standard deviations for the random constants are θair = 3.798,θtrain = 1.182,θbus = 0.0712,θcar = 0.000. Greene-50558 book June 25, 2007 11:6

858 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

for the first two models are directly comparable, although it should be noted that in the random parameters model, the disturbances have a distribution that is that of a sum of an extreme value and a normal variable, while in the probit model, the disturbances are normally distributed. With these considerations, the “unrestricted” models in each case are comparable and are, in fact, fairly similar. None of this discussion suggests a preference for one model or the other. The likelihood values are not comparable, so a direct test is precluded. Both relax the IIA assumption, which is a crucial consideration. The random parameters model enjoys a significant practical advantage, as discussed earlier, and also allows a much richer specification of the utility function itself. But, the question still warrants additional study. Both models are making their way into the applied literature.

23.11.8 PANEL DATA AND STATED CHOICE EXPERIMENTS Panel data in the unordered discrete choice setting typically come in the form of sequential choices. Train (2003, Chapter 6) reports an analysis of the site choices of 258 anglers who chose among 59 possible fishing sites for total of 962 visits. Allenby and Rossi (1999) modeled brand choice for a sample of shoppers who made multiple store trips. The mixed logit model is a framework that allows the counterpart to a random effects model. The random utility model would appear = β + ε , Uij,t xij,t i ij,t β where conditioned on i , a multinomial logit model applies. The random coefficients carry the common effects across choice situations. For example, if the random co- efficients include choice-specific constant terms, then the random utility model be- comes essentially a random effects model. A modification of the model that resembles Mundlak’s correction for the random effects model is β = β0 +  + , i zi ui

where, typically, zi would contain demographic and socioeconomic information. The stated choice experiment is similar to the repeated choice situation, with a crucial difference. In a stated choice survey, the respondent is asked about his or her preferences over a series of hypothetical choices, often including one or more that are actually available and others that might not be available (yet). Hensher, Rose, and Greene (2006) describe a survey of Australian commuters who were asked about hypo- thetical commutation modes in a choice set that included the one they currently took and a variety of alternatives. Revelt and Train (2000) analyzed a stated choice experi- ment in which California electricity consumers were asked to choose among alternative hypothetical energy suppliers. The advantage of the stated choice experiment is that it allows the analyst to study choice situations over a range of variation of the attributes or a range of choices that might not exist within the observed, actual outcomes. Thus, the original work on the MNL by McFadden et al. concerned survey data on whether com- muters would ride a (then-hypothetical) underground train system to work in the San Francisco Bay area. The disadvantage of stated choice data is that they are hypothetical. Particularly when they are mixed with revealed preference data, the researcher must assume that the same preference patterns govern both types of outcomes. This is likely to be a dubious assumption. One method of accommodating the mixture of underlying Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 859

preferences is to build different scaling parameters into the model for the stated and revealed preference components of the model. Greene and Hensher (2007) suggest a nested logit model that groups the hypothetical choices in one branch of a tree and the observed choices in another.

23.12 SUMMARY AND CONCLUSIONS

This chapter has surveyed techniques for modeling discrete choice. We examined three classes of models: binary choice, ordered choice, and multinomial choice. These are quite far removed from the regression models (linear and nonlinear) that have been the focus of the preceding 22 chapters. The most important difference concerns the modeling approach. Up to this point, we have been primarily interested in modeling the conditional mean function for outcomes that vary continuously. In this chapter, we have shifted our approach to one of modeling the conditional probabilities of events. Modeling binary choice—the decision between two alternatives—is a growth area in the applied econometrics literature. Maximum likelihood estimation of fully pa- rameterized models remains the mainstay of the literature. But, we also considered semiparametric and nonparametric forms of the model and examined models for time series and panel data. The ordered choice model is a natural extension of the binary choice setting and also a convenient bridge between models of choice between two alternatives and more complex models of choice among multiple alternatives. Multi- nomial choice modeling is likewise a large field, both within economics and, especially, in many other fields, such as marketing, transportation, political science, and so on. The multinomial logit model and many variations of it provide an especially rich frame- work within which modelers have carefully matched behavioral modeling to empirical specification and estimation.

Key Terms and Concepts

• Attributes • Goodness of fit measure • Linear probability model • Binary choice model • Gumbel model • Logit • Bivariate ordered probit • Heterogeneity • Log-odds ratios • Bivariate probit • Heteroscedasticity • Marginal effects • Bootstrapping • Incidental parameters • Maximum likelihood • Butler and Moffitt method problem • Maximum score • Characteristics • Inclusive value estimator • Choice-based sampling • Independence from • Maximum simulated • Chow test irrelevant alternatives likelihood • Conditional likelihood • Index function model • Mean-squared deviation function • Initial conditions • Method of kernels • Conditional logit model • Kernel density estimator • Method of scoring • Control function • Kernel function • Minimal sufficient statistic • Discriminant analysis • Lagrange multiplier test • Mixed logit model • Fixed effects model • Latent regression • Multinomial logit model • Full information maximum • Likelihood equations • Multinomial probit model likelihood (FIML) • Likelihood ratio test • Multivariate probit • Generalized residual • Limited information ML • Negative binomial model Greene-50558 book June 25, 2007 11:6

860 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

• Nested logit model • Quasi-MLE • Robust covariance • Nonnested models • Random coefficients estimation • Normit • Random effects model • Semiparametric estimation • Ordered choice model • Random parameters logit • State dependence • Persistence model • Stated choice data • Probit • Random utility model • Stated choice experiment • Quadrature • Ranking • Tetrachoric correlation • Qualitative choice • Recursive model • Unbalanced sample • Qualitative response • Revealed preference data • Unordered choice

Exercises

1. A binomial probability model is to be based on the following index function model: y∗ = α + βd + ε, y = 1, if y∗ > 0, y = 0 otherwise. The only regressor, d, is a dummy variable. The data consist of 100 observations that have the following: y 01 02428 d 1 32 16

Obtain the maximum likelihood estimators of α and β, and estimate the asymptotic standard errors of your estimates. Test the hypothesis that β equals zero by using a Wald test (asymptotic t test) and a likelihood ratio test. Use the probit model and then repeat, using the logit model. Do your results change? (Hint: Formulate the log-likelihood in terms of α and δ = α + β.) 2. Suppose that a linear probability model is to be fit to a set of observations on a dependent variable y that takes values zero and one, and a single regressor x that varies continuously across observations. Obtain the exact expressions for the least squares slope in the regression in terms of the mean(s) and variance of x, and interpret the result. 3. Given the data set y 1001100111 , x 9254673526

estimate a probit model and test the hypothesis that x is not influential in determin- ing the probability that y equals one. 4. Construct the Lagrange multiplier statistic for testing the hypothesis that all the slopes (but not the constant term) equal zero in the binomial logit model. Prove 2 that the Lagrange multiplier statistic is nR in the regression of (yi = p) on the x’s, where p is the sample proportion of 1’s. Greene-50558 book June 25, 2007 11:6

CHAPTER 23 ✦ Models for Discrete Choice 861

5. We are interested in the ordered probit model. Our data consist of 250 observations, of which the response are

y 01234. n 50 40 45 80 35

Using the preceding data, obtain maximum likelihood estimates of the unknown parameters of the model. (Hint: Consider the probabilities as the unknown parameters.) 6. The following hypothetical data give the participation rates in a particular type of recycling program and the number of trucks purchased for collection by 10 towns in a small mid-Atlantic state:

Town 12345678910 Trucks 160 250 170 365 210 206 203 305 270 340 Participation% 11 74 8 87 62 83 48 84 71 79

The town of Eleven is contemplating initiating a recycling program but wishes to achieve a 95 percent rate of participation. Using a probit model for your analysis, a. How many trucks would the town expect to have to purchase to achieve its goal? (Hint: You can form the log likelihood by replacing yi with the participation rate (e.g., 0.11 for observation 1) and (1 − yi ) with 1—the rate in (23-33). b. If trucks cost $20,000 each, then is a goal of 90 percent reachable within a budget of $6.5 million? (That is, should they expect to reach the goal?) c. According to your model, what is the marginal value of the 301st truck in terms of the increase in the percentage participation? 7. A data set consists of n = n1 + n2 + n3 observations on y and x. For the first n1 observations, y = 1 and x = 1. For the next n2 observations, y = 0 and x = 1. For the last n3 observations, y = 0 and x = 0. Prove that neither (23-19) nor (23-21) has a solution. 8. Prove (23-28). 9. In the panel data models estimated in Section 23.5, neither the logit nor the probit model provides a framework for applying a Hausman test to determine whether fixed or random effects is preferred. Explain. (Hint: Unlike our application in the linear model, the incidental parameters problem persists here.)

Applications

1. Appendix Table F24.1 provides Fair’s (1978) Redbook survey on extramarital affairs. The data are described in Application 1 at the end of Chapter 24 and in Appendix F. The variables in the data set are as follows:

id = an identification number, C = constant, value = 1, yrb = a constructed measure of time spent in extramarital affairs, Greene-50558 book June 25, 2007 11:6

862 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

v1 = a rating of the marriage, coded 1 to 4, v2 = age, in years, aggregated, v3 = number of years married, v4 = number of children, top coded at 5, v5 = religiosity, 1 to 4, 1 = not, 4 = very, v6 = education, coded 9, 12, 14, 16, 17, 20, v7 = occupation, v8 = husband’s occupation, and three other variables that are not used. The sample contains a survey of 6,366 married women, conducted by Redbook magazine. For this exercise, we will analyze, first, the binary variable A = 1ifyrb > 0, 0 otherwise. The regressors of interest are v1tov8; however, not necessarily all of them belong in your model. Use these data to build a binary choice model for A. Report all computed results for the model. Compute the marginal effects for the variables you choose. Compare the results you obtain for a probit model to those for a logit model. Are there any substantial differences in the results for the two models? 2. Continuing the analysis of the first application, we now consider the self-reported rating, v1. This is a natural candidate for an ordered choice model, because the simple four-item coding is a censored version of what would be a continuous scale on some subjective satisfaction variable. Analyze this variable using an ordered probit model. What variables appear to explain the response to this survey question? (Note, the variable is coded 1, 2, 3, 4. Some programs accept data for ordered choice modeling in this form, e.g., Stata, while others require the variable to be coded 0, 1, 2, 3, e.g., LIMDEP. Be sure to determine which is appropriate for the program you are using and transform the data if necessary.) Can you obtain the partial effects for your model? Report them as well. What do they suggest about the impact of the different independent variables on the reported ratings? Greene-50558 book June 23, 2007 13:17

24 TRUNCATION, CENSORING, AND SAMPLEQ SELECTION

24.1 INTRODUCTION

This chapter is concerned with truncation and censoring.1 The effect of truncation occurs when sample data are drawn from a subset of a larger population of interest. For example, studies of income based on incomes above or below some poverty line may be of limited usefulness for inference about the whole population. Truncation is essentially a characteristic of the distribution from which the sample data are drawn. Censoring is a more common feature of recent studies. Tocontinue the example, suppose that instead of being unobserved, all incomes below the poverty line are reported as if they were at the poverty line. The censoring of a range of values of the variable of interest introduces a distortion into conventional statistical results that is similar to that of truncation. Unlike truncation, however, censoring is essentially a defect in the sample data. Presumably, if they were not censored, the data would be a representative sample from the population of interest. This chapter will discuss three broad topics: truncation, censoring, and a form of truncation called the sample selection problem. Although most empirical work in this area involves censoring rather than truncation, we will study the simpler model of truncation first. It provides most of the theoretical tools we need to analyze models of censoring and sample selection.2

24.2 TRUNCATION

In this section, we are concerned with inferring the characteristics of a full population from a sample drawn from a restricted part of that population.

24.2.1 TRUNCATED DISTRIBUTIONS A truncated distribution is the part of an untruncated distribution that is above or below some specified value. For instance, in Example 24.2, we are given a characteristic of the distribution of incomes above $100,000. This subset is a part of the full distribution of incomes which range from zero to (essentially) infinity.

1Five of the many surveys of these topics are Dhrymes (1984), Maddala (1977b, 1983, 1984), and Amemiya (1984). The last is part of a symposium on censored and truncated regression models. Surveys that are oriented toward applications and techniques are Long (1997), deMaris (2004), and Greene (2006). Some recent results on non- and semiparametric estimation appear in Lee (1996). 2Although sample selection can be viewed formally as merely a type of truncation, it is useful to treat the methodological aspects separately. 863 Greene-50558 book June 23, 2007 13:17

864 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

THEOREM 24.1 Density of a Truncated Random Variable If a continuous random variable x has pdf f (x) and a is a constant, then3 f (x) f (x | x > a) = . Prob(x > a) The proof follows from the definition of conditional probability and amounts merely to scaling the density so that it integrates to one over the range above a. Note that the truncated distribution is a conditional distribution.

Most recent applications based on continuous random variables use the truncated normal distribution. If x has a normal distribution with mean μ and standard deviation σ, then − μ ( > ) = −  a = − (α), Prob x a 1 σ 1 where α = (a − μ)/σ and (.) is the standard normal cdf. The density of the truncated normal distribution is then − μ 1 φ x f (x) (2πσ2)−1/2e−(x−μ)2/(2σ 2) σ σ f (x | x > a) = = = , 1 − (α) 1 − (α) 1 − (α) where φ(.) is the standard normal pdf. The truncated standard normal distribution, with μ = 0 and σ = 1, is illustrated for a =−0.5, 0, and 0.5 in Figure 24.1. Another truncated distribution that has appeared in the recent literature, this one for a discrete random variable, is the truncated at zero Poisson distribution, (e−λλy)/y! (e−λλy)/y! Prob[Y = y | y > 0] = = Prob[Y > 0] 1 − Prob[Y = 0] (e−λλy)/y! = ,λ>0, y = 1,... 1 − e−λ This distribution is used in models of uses of recreation and other kinds of facilities where observations of zero uses are discarded.4 For convenience in what follows, we shall call a random variable whose distribution is truncated a truncated random variable.

24.2.2 MOMENTS OF TRUNCATED DISTRIBUTIONS We are usually interested in the mean and variance of the truncated random variable. They would be obtained by the general formula: ∞ E [x | x > a] = xf(x | x > a) dx a for the mean and likewise for the variance.

3The case of truncation from above instead of below is handled in an analogous fashion and does not require any new results. 4See Shaw (1988). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 865

1.2 Truncation point Mean of distribution 1.0

0.8 ity s 0.6 Den

0.4

0.2

0 Ϫ3 Ϫ2 Ϫ1 Ϫ0.5 0 0.5 1 2 3 x FIGURE 24.1 Truncated Normal Distributions.

Example 24.1 Truncated Uniform Distribution If x has a standard uniform distribution, denoted U(0, 1), then f ( x) = 1, 0 ≤ x ≤ 1. The truncated at x = 1 distribution is also uniform; 3 1 f ( x) 1 3 1 f x | x > = = = , ≤ x ≤ 1. 3 > 1 2 2 3 Prob x 3 3 The expected value is 1 1 3 2 E x | x > = x dx = . 3 1/3 2 3 For a variable distributed uniformly between L and U, the variance is (U − L) 2/12. Thus, | > 1 = 1 . Var x x 3 27 1 1 The mean and variance of the untruncated distribution are 2 and 12 , respectively. Example 24.1 illustrates two results. 1. If the truncation is from below, then the mean of the truncated variable is greater than the mean of the original one. If the truncation is from above, then the mean of the truncated variable is smaller than the mean of the original one. This is clearly visible in Figure 24.1. 2. Truncation reduces the variance compared with the variance in the untruncated distribution. Henceforth, we shall use the terms truncated mean and truncated variance to refer to the mean and variance of the random variable with a truncated distribution. Greene-50558 book June 23, 2007 13:17

866 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

For the truncated normal distribution, we have the following theorem:5

THEOREM 24.2 Moments of the Truncated Normal Distribution If x ∼ N[μ, σ 2] and a is a constant, then E [x | truncation] = μ + σ λ(α), (24-1) Var[x | truncation] = σ 2[1 − δ(α)], (24-2) where α = (a − μ)/σ, φ(α) is the standard normal density and λ(α) = φ(α)/[1 − (α)] if truncation is x > a, (24-3a) λ(α) =−φ(α)/(α) if truncation is x < a, (24-3b) and δ(α) = λ(α)[λ(α) − α]. (24-4)

An important result is

0 <δ(α)<1 for all values of α,

which implies point 2 after Example 24.1. A result that we will use at several points below is dφ(α)/dα =−αφ(α). The function λ(α) is called the inverse Mills ratio. The function in (24-3a) is also called the hazard function for the standard normal distribution.

Example 24.2 A Truncated Lognormal Income Distribution “The typical ‘upper affluent American’ ...makes $142,000 per year....The people surveyed had household income of at least $100,000.”6 Would this statistic tell us anything about the “typical American”? As it stands, it probably does not (popular impressions notwithstanding). The 1987 article where this appeared went on to state, “If you’re in that category, pat yourself on the back—only 2 percent of American households make the grade, according to the survey.” Because the degree of truncation in the sample is 98 percent, the $142,000 was probably quite far from the mean in the full population. Suppose that incomes, x, in the population were lognormally distributed—see Section B.4.4. Then, the log of income, y, had a normal distribution with, say, mean μ and standard deviation, σ. Suppose that the survey was large enough for us to treat the sample average as the true mean. Assuming so, we’ll deduce μ and σ then determine the population mean income. Two useful numbers for this example are In 100 = 4.605 and In 142 = 4.956. The article states that

Prob[x ≥ 100] = Prob[exp( y) ≥ 100] = 0.02,

or

Prob( y < 4.605) = 0.98.

5Details may be found in Johnson, Kotz, and Balakrishnan (1994, pp. 156–158). 6See New York Post (1987). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 867

This implies that

Prob[( y − μ)/σ < (4.605 − μ)/σ ] = 0.98.

Because [(4.605 − μ)/σ ] = 0.98, we know that

−1(0.98) = 2.054 = (4.605 − μ)/σ,

or

4.605 = μ + 2.054σ.

The article also states that

E[x | x > 100] = E[exp( y) | exp( y) > 100] = 142,

or

E[exp( y) | y > 4.645] = 142.

To proceed, we need another result for the lognormal distribution:

(σ − (a − μ)/σ ) If y ∼ N[μ, σ 2], then E[exp( y) | y > a] = exp(μ + σ 2/2) × . 1 − ((a − μ)/σ )

[See Johnson, Kotz and Balakrishnan (1995, p. 241).] For our application, we would equate this expression to 142, and a to In 100 = 4.605. This provides a second equation. To estimate the two parameters, we used the method of moments. We solved the minimization problem

2 2 2 Minimizeμ,σ [4.605 − (μ + 2.054σ)] + [142((μ − 4.605)/σ ) − exp(μ + σ /2)(σ − (4.605 − μ)/σ )] .

The two solutions are 2.89372 and 0.83314 for μ and σ , respectively. To obtain the mean income, we now use the result that if y ∼ N[μ, σ 2] and x = exp( y), then E[x] = exp(μ + σ 2/2). Inserting our values for μ and σ gives E[x] = $25,554. The 1987 Statistical Abstract of the United States gives the mean of household incomes across all groups for the United States as about $25,000. So, the estimate based on surprisingly little information would have been relatively good. These meager data did, indeed, tell us something about the average American.

24.2.3 THE TRUNCATED REGRESSION MODEL In the model of the earlier examples, we now assume that μ =  β i xi is the deterministic part of the classical regression model. Then =  β + ε , yi xi i where

2 εi | xi ∼ N[0,σ ], so that | ∼  β,σ2 . yi xi N[xi ] (24-5) Greene-50558 book June 23, 2007 13:17

868 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

We are interested in the distribution of yi given that yi is greater than the truncation point a. This is the result described in Theorem 24.2. It follows that φ ( −  β)/σ | > =  β + σ [ a xi ] . E [yi yi a] xi −  ( −  β)/σ (24-6) 1 [ a xi ] The conditional mean is therefore a nonlinear function of a,σ,x, and β. The marginal effects in this model in the subpopulation can be obtained by writing | > =  β + σ λ(α ), E [yi yi a] xi i (24-7) α = ( −  β)/σ λ = λ(α ) δ = δ(α ) where now i a xi . For convenience, let i i and i i . Then ∂ | > ∂α E [yi yi a] = β + σ( λ / α ) i ∂ d i d i ∂ xi xi 2 = β + σ λ − αi λi (−β/σ ) i (24-8) = β − λ2 + α λ 1 i i i

= β(1 − δi ).

Note the appearance of the scale factor 1 − δi from the truncated variance. Because (1 − δi ) is between zero and one, we conclude that for every element of xi , the marginal effect is less than the corresponding coefficient. There is a similar attenuation of the 2 variance. In the subpopulation yi > a, the regression variance is not σ but 2 Var[yi | yi > a] = σ (1 − δi ). (24-9) Whether the marginal effect in (24-7) or the coefficient β itself is of interest depends on the intended inferences of the study.If the analysis is to be confined to the subpopulation, then (24-7) is of interest. If the study is intended to extend to the entire population, however, then it is the coefficients β that are actually of interest. One’s first inclination might be to use ordinary least squares to estimate the param- eters of this regression model. For the subpopulation from which the data are drawn, we could write (24-6) in the form | > = | > + =  β + σλ + , yi yi a E [yi yi a] ui xi i ui (24-10)

where ui is yi minus its conditional expectation. By construction, ui has a zero mean, but it is heteroscedastic: = σ 2 − λ2 + λ α = σ 2( − δ ), Var[ui ] 1 i i i 1 i

which is a function of xi . If we estimate (24-10) by ordinary least squares regression of y on X, then we have omitted a variable, the nonlinear term λi . All the biases that arise because of an omitted variable can be expected.7 Without some knowledge of the distribution of x, it is not possible to determine how serious the bias is likely to be. A result obtained by Chung and Goldberger (1984) is broadly suggestive. If E [x | y] in the full population is a linear function of y, then plim b = βτ for some proportionality constant τ. This result is consistent with the widely obs- erved (albeit rather rough) proportionality relationship between least squares estimates

7See Heckman (1979) who formulates this as a “specification error.” Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 869

of this model and maximum likelihood estimates.8 The proportionality result appears to be quite general. In applications, it is usually found that, compared with consis- tent maximum likelihood estimates, the OLS estimates are biased toward zero. (See Example 24.4.)

24.3 CENSORED DATA

A very common problem in microeconomic data is censoring of the dependent variable. When the dependent variable is censored, values in a certain range are all transformed to (or reported as) a single value. Some examples that have appeared in the empirical literature are as follows:9 1. Household purchases of durable goods [Tobin (1958)], 2. The number of extramarital affairs [Fair (1977, 1978)], 3. The number of hours worked by a woman in the labor force [Quester and Greene (1982)], 4. The number of arrests after release from prison [Witte (1980)], 5. Household expenditure on various commodity groups [Jarque (1987)], 6. Vacation expenditures [Melenberg and van Soest (1996)]. Each of these studies analyzes a dependent variable that is zero for a significant fraction of the observations. Conventional regression methods fail to account for the qualitative difference between limit (zero) observations and nonlimit (continuous) observations.

24.3.1 THE CENSORED NORMAL DISTRIBUTION The relevant distribution theory for a censored variable is similar to that for a truncated one. Once again, we begin with the normal distribution, as much of the received work has been based on an assumption of normality. We also assume that the censoring point is zero, although this is only a convenient normalization. In a truncated distribution, only the part of distribution above y = 0 is relevant to our computations. To make the distribution integrate to one, we scale it up by the probability that an observation in the untruncated population falls in the range that interests us. When data are censored, the distribution that applies to the sample data is a mixture of discrete and continuous distributions. Figure 24.2 illustrates the effects. To analyze this distribution, we define a new random variable y transformed from the original one, y∗,by y = 0ify∗ ≤ 0, y = y∗ if y∗ > 0. The distribution that applies if y∗ ∼ N[μ, σ 2] is Prob(y = 0) = Prob(y∗ ≤ 0) = (−μ/σ ) = 1 − (μ/σ ), and if y∗ > 0, then y has the density of y∗. This distribution is a mixture of discrete and continuous parts. The total probability is one, as required, but instead of scaling the second part, we simply assign the full probability in the censored region to the censoring point, in this case, zero.

8See the appendix in Hausman and Wise (1977) and Greene (1983) as well. 9More extensive listings may be found in Amemiya (1984) and Maddala (1983). Greene-50558 book June 23, 2007 13:17

870 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Capacity Seats demanded

Capacity Tickets sold FIGURE 24.2 Partially Censored Distribution.

THEOREM 24.3 Moments of the Censored Normal Variable If y∗ ∼ N[μ, σ 2] and y = aify∗ ≤ a or else y = y∗, then E [y] = a + (1 − )(μ + σ λ), and Var[y] = σ 2(1 − )[(1 − δ) + (α − λ)2], where [(a − μ)/σ ] = (α) = Prob(y∗ ≤ a) = , λ = φ/(1 − ), and δ = λ2 − λα.

Proof: For the mean, E [y] = Prob(y = a) × E [y | y = a] + Prob(y > a) × E [y | y > a] = Prob(y∗ ≤ a) × a + Prob(y∗ > a) × E [y∗ | y∗ > a] = a + (1 − )(μ + σλ) using Theorem 24.2. For the variance, we use a counterpart to the decomposition in (B-69), that is, Var[y] = E [conditional variance] + Var[conditional mean], and Theorem 24.2. Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 871

For the special case of a = 0, the mean simplifies to φ(μ/σ ) | = = (μ/σ )(μ + σ λ), λ = . E [y a 0] where (μ/σ ) For censoring of the upper part of the distribution instead of the lower, it is only neces- sary to reverse the role of  and 1 −  and redefine λ as in Theorem 24.2.

Example 24.3 Censored Random Variable We are interested in the number of tickets demanded for events at a certain arena. Our only measure is the number actually sold. Whenever an event sells out, however, we know that the actual number demanded is larger than the number sold. The number of tickets demanded is censored when it is transformed to obtain the number sold. Suppose that the arena in question has 20,000 seats and, in a recent season, sold out 25 percent of the time. If the average attendance, including sellouts, was 18,000, then what are the mean and standard deviation of the demand for seats? According to Theorem 24.3, the 18,000 is an estimate of E[sales] = 20,000(1 − ) + [μ + σλ]. Because this is censoring from above, rather than below, λ =−φ(α)/(α). The argument of , φ, and λ is α = (20,000−μ)/σ . If 25 percent of the events are sellouts, then  = 0.75. In- verting the standard normal at 0.75 gives α = 0.675. In addition, if α = 0.675, then −φ(0.675)/ 0.75 = λ =− 0.424. This result provides two equations in μ and σ , (a) 18, 000 = 0.25(20, 000) + 0.75(μ − 0.424σ) and (b) 0.675σ = 20,000 − μ. The solutions are σ = 2426 and μ = 18,362. For comparison, suppose that we were told that the mean of 18,000 applies only to the events that were not sold out and that, on average, the arena sells out 25 percent of the time. Now our estimates would be obtained from the equations (a) 18,000 = μ − 0.424σ and (b) 0.675σ = 20,000 − μ. The solutions are σ = 1820 and μ = 18,772.

24.3.2 THE CENSORED REGRESSION (TOBIT) MODEL The regression model based on the preceding discussion is referred to as the censored regression model or the tobit model. [In reference to Tobin (1958), where the model was first proposed.] The regression is obtained by making the mean in the preceding correspond to a classical regression model. The general formulation is usually given in terms of an index function, ∗ =  β + ε , yi xi i = ∗ ≤ , yi 0ifyi 0 (24-11) = ∗ ∗ > . yi yi if yi 0 There are potentially three conditional mean functions to consider, depending on the purpose of the study. For the index variable, sometimes called the latent variable, ∗ |  β E [yi xi ]isxi . If the data are always censored, however, then this result will usu- ally not be useful. Consistent with Theorem 24.3, for an observation randomly drawn from the population, which may or may not be censored,  β | =  xi (  β + σλ ), E [yi xi ] σ xi i where φ ( −  β)/σ φ(  β/σ ) λ = [ 0 xi ] = xi . i −  ( −  β)/σ (  β/σ ) (24-12) 1 [ 0 xi ] xi Greene-50558 book June 23, 2007 13:17

872 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Finally, if we intend to confine our attention to uncensored observations, then the results for the truncated regression model apply. The limit observations should not be discarded, however, because the truncated regression model is no more amenable to least squares than the censored data model. It is an unresolved question which of these functions should be used for computing predicted values from this model. Intu- ition suggests that E [yi | xi ] is correct, but authors differ on this point. For the setting in Example 24.3, for predicting the number of tickets sold, say, to plan for an upcoming event, the censored mean is obviously the relevant quantity. On the other hand, if the ∗ objective is to study the need for a new facility, then the mean of the latent variable yi would be more interesting. There are differences in the marginal effects as well. For the index variable, ∂ ∗ | E [yi xi ] = β. ∂xi ∗ But this result is not what will usually be of interest, because yi is unobserved. For the 10 observed data, yi , the following general result will be useful:

THEOREM 24.4 Marginal Effects in the Censored Regression Model In the censored regression model with latent regression y∗ = xβ +ε and observed dependent variable, y = aify∗ ≤ a, y= bify∗ ≥ b, and y = y∗ otherwise, where a and bare constants, let f (ε) and F(ε) denote the density and cdf of ε. Assume that ε is a continuous random variable with mean 0 and variance σ 2, and f (ε | x) = f (ε). Then ∂E [y | x] = β × Prob[a < y∗ < b]. ∂x Proof: By definition, E [y | x] = a Prob[y∗ ≤ a | x] + b Prob[y∗ ≥ b | x] + Prob[a < y∗ < b | x]E [y∗ | a < y∗ < b | x].  Let α j = ( j − x β)/σ, Fj = F(α j ), f j = f (α j ), and j = a, b. Then ∗ ∗ E [y | x] = aFa + b(1 − Fb) + (Fb − Fa)E [y | a < y < b, x].  Because y∗ = xβ + σ[(y∗ − β x)/σ ], the conditional mean may be written ∗ − β − β ∗ − β − β ∗ | < ∗ < , = β + σ y x a x < y x < b x E [y a y b x] x E σ σ σ σ α b (ε/σ ) (ε/σ) ε = β + σ f . x − d σ αa Fb Fa

10See Greene (1999) for the general result and Rosett and Nelson (1975) and Nakamura and Nakamura (1983) for applications based on the normal distribution. Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 873

THEOREM 24.4 (Continued) Collecting terms, we have α b ε ε ε | = + ( − ) + ( − )β + σ . E [y x] aFa b 1 Fb Fb Fa x σ f σ d σ αa Now, differentiate with respect to x. The only complication is the last term, for which the differentiation is with respect to the limits of integration. We use Leibnitz’s theorem and use the assumption that f (ε) does not involve x. Thus, ∂E [y | x] −β −β −β = af − bf + (F − F )β + (xβ)( f − f ) ∂x σ a σ b b a b a σ −β + σ α − α . [ b fb a fa] σ

After inserting the definitions of αa and αb, and collecting terms, we find all terms sum to zero save for the desired result, ∂ E [y | x] = (F − F )β = β × Prob[a < y∗ < b]. ∂x b a i

Note that this general result includes censoring in either or both tails of the distribu- tion, and it does not assume that ε is normally distributed. For the standard case with censoring at zero and normally distributed disturbances, the result specializes to ∂ |  β E [yi xi ] = β xi . ∂xi σ Although not a formal result, this does suggest a reason why, in general, least squares estimates of the coefficients in a tobit model usually resemble the MLEs times the proportion of nonlimit observations in the sample. McDonald and Moffitt (1980) suggested a useful decomposition of ∂E [yi | xi ]/∂xi , ∂E [yi | xi ] = β × i [1 − λi (αi + λi )] + φi (αi + λi ) , ∂xi α =  β/σ  = (α ) λ = φ / where i xi , i i and i i i . Taking the two parts separately, this result decomposes the slope vector into

∂ E [yi | xi ] ∂ E [yi | xi , yi > 0] ∂ Prob[yi > 0] = Prob[yi > 0] + E [yi | xi , yi > 0] . ∂xi ∂xi ∂xi ∗ Thus, a change in xi has two effects: It affects the conditional mean of yi in the positive part of the distribution, and it affects the probability that the observation will fall in that part of the distribution.

Example 24.4 Estimated Tobit Equations for Hours Worked In their study of the number of hours worked in a survey year by a large sample of wives, Quester and Greene (1982) were interested in whether wives whose marriages were statisti- cally more likely to dissolve hedged against that possibility by spending, on average, more time working. They reported the tobit estimates given in Table 24.1. The last figure in the Greene-50558 book June 23, 2007 13:17

874 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 24.1 Tobit Estimates of an Hours Worked Equation White Wives Black Wives Least Scaled Coefficient Slope Coefficient Slope Squares OLS Constant −1803.13 −2753.87 (−8.64) (−9.68) Small kids −1324.84 −385.89 −824.19 −376.53 −352.63 −766.56 (−19.78) (−10.14) Education −48.08 −14.00 22.59 10.32 11.47 24.93 difference (−4.77) (1.96) Relative wage 312.07 90.90 286.39 130.93 123.95 269.46 (5.71) (3.32) Second marriage 175.85 51.51 25.33 11.57 13.14 28.57 (3.47) (0.41) Mean divorce 417.39 121.58 481.02 219.75 219.22 476.57 probability (6.52) (5.28) High divorce 670.22 195.22 578.66 264.36 244.17 530.80 probability (8.40) (5.33) σ 1559 618 1511 826 Sample size 7459 2798 Proportion working 0.29 0.46

table implies that a very large proportion of the women reported zero hours, so least squares regression would be inappropriate. The figures in parentheses are the ratio of the coefficient estimate to the estimated asymp- totic standard error. The dependent variable is hours worked in the survey year. “Small kids” is a dummy variable indicating whether there were children in the household. The “education dif- ference” and “relative wage” variables compare husband and wife on these two dimensions. The wage rate used for wives was predicted using a previously estimated regression model and is thus available for all individuals, whether working or not. “Second marriage” is a dummy variable. Divorce probabilities were produced by a large microsimulation model presented in another study [Orcutt, Caldwell, and Wertheimer (1976)]. The variables used here were dummy variables indicating “mean” if the predicted probability was between 0.01 and 0.03 and “high” if it was greater than 0.03. The “slopes” are the marginal effects described earlier. Note the marginal effects compared with the tobit coefficients. Likewise, the estimate of σ is quite misleading as an estimate of the standard deviation of hours worked. The effects of the divorce probability variables were as expected and were quite large. One of the questions raised in connection with this study was whether the divorce probabilities could reasonably be treated as independent variables. It might be that for these individuals, the number of hours worked was a significant determinant of the probability.

24.3.3 ESTIMATION Estimation of this model is very similar to that of truncated regression. The tobit model has become so routine and been incorporated in so many computer packages that despite formidable obstacles in years past, estimation is now essentially on the level of ordinary linear regression.11 The log-likelihood for the censored regression model is 1 (y − x β)2 x β ln L = − log(2π)+ ln σ 2 + i i + ln 1 −  i . (24-13) 2 σ 2 σ yi >0 yi =0

11See Hall (1984). Greene-50558 book June 25, 2007 18:55

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 875

The two parts correspond to the classical regression for the nonlimit observations and the relevant probabilities for the limit observations, respectively. This likelihood is a nonstandard type, because it is a mixture of discrete and continuous distributions. In a seminal paper, Amemiya (1973) showed that despite the complications, proceeding in the usual fashion to maximize ln L would produce an estimator with all the familiar desirable properties attained by MLEs. The log-likelihood function is fairly involved, but Olsen’s (1978) reparameterization simplifies things considerably. With γ = β/σ and θ = 1/σ , the log-likelihood is 1 ln L = − [ln(2π)− ln θ 2 + (θy − x γ )2] + ln[1 − (x γ )]. (24-14) 2 i i i yi >0 yi =0 The results in this setting are now very similar to those for the truncated regression. The Hessian is always negative definite, so Newton’s method is simple to use and usually converges quickly. After convergence, the original parameters can be recovered using σ = 1/θ and β = γ /θ. The asymptotic covariance matrix for these estimates can be ob- tained from that for the estimates of [γ ,θ] using the delta method; Est. Asy. Var[βˆ , σˆ ] = Jˆ Asy. Var[γ ˆ , θˆ ]Jˆ , where ∂β/∂γ  ∂β/∂θ (1/θ)I (−1/θ 2)γ J = = . ∂σ/∂γ  ∂σ/∂θ 0 (−1/θ 2)

Researchers often compute ordinary least squares estimates despite their incon- sistency. Almost without exception, it is found that the OLS estimates are smaller in absolute value than the MLEs. A striking empirical regularity is that the maximum likelihood estimates can often be approximated by dividing the OLS estimates by the proportion of nonlimit observations in the sample.12 The effect is illustrated in the last two columns of Table 24.1. Another strategy is to discard the limit observations, but we now see that just trades the censoring problem for the truncation problem.

24.3.4 SOME ISSUES IN SPECIFICATION Two issues that commonly arise in microeconomic data, heteroscedasticity and nonnor- mality, have been analyzed at length in the tobit setting.13

24.3.4.a Heteroscedasticity Maddala and Nelson (1975), Hurd (1979), Arabmazar and Schmidt (1982a,b), and Brown and Moffitt (1982) all have varying degrees of pessimism regarding how in- consistent the maximum likelihood estimator will be when heteroscedasticity occurs. Not surprisingly, the degree of censoring is the primary determinant. Unfortunately, all the analyses have been carried out in the setting of very specific models—for example, involving only a single dummy variable or one with groupwise heteroscedasticity—so the primary lesson is the very general conclusion that heteroscedasticity emerges as an obviously serious problem.

12This concept is explored further in Greene (1980b), Goldberger (1981), and Chung and Goldberger (1984). 13Two symposia that contain numerous results on these subjects are Blundell (1987) and Duncan (1986b). An application that explores these two issues in detail is Melenberg and van Soest (1996). Greene-50558 book June 23, 2007 13:17

876 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 24.2 Estimates of a Tobit Model (standard errors in parentheses) Homoscedastic Heteroscedastic ββα Constant −18.28 (5.10) −4.11 (3.28) −0.47 (0.60) Beta 10.97 (3.61) 2.22 (2.00) 1.20 (1.81) Nonmarket 0.65 (7.41) 0.12 (1.90) 0.08 (7.55) Number 0.75 (5.74) 0.33 (4.50) 0.15 (4.58) Merger 0.50 (5.90) 0.24 (3.00) 0.06 (4.17) Option 2.56 (1.51) 2.96 (2.99) 0.83 (1.70) ln L −547.30 −466.27 Sample size 200 200

One can approach the heteroscedasticity problem directly. Petersen and Waldman (1981) present the computations needed to estimate a tobit model with heteroscedastic- σ σ σ 2 ity of several types. Replacing with i in the log-likelihood function and including i in the summations produces the needed generality. Specification of a particular model for σi provides the empirical model for estimation.

Example 24.5 Multiplicative Heteroscedasticity in the Tobit Model Petersen and Waldman (1981) analyzed the volume of short interest in a cross section of common stocks. The regressors included a measure of the market component of heteroge- neous expectations as measured by the firm’s BETA coefficient; a company-specific measure of heterogeneous expectations, NONMARKET; the NUMBER of analysts making earnings forecasts for the company; the number of common shares to be issued for the acquisition of another firm, MERGER; and a dummy variable for the existence of OPTIONs. They report the results listed in Table 24.2 for a model in which the variance is assumed to be of the form σ 2 =  α i exp(xi ). The values in parentheses are the ratio of the coefficient to the estimated asymptotic standard error. The effect of heteroscedasticity on the estimates is extremely large. We do note, however, a common misconception in the literature. The change in the coefficients is often misleading. The marginal effects in the heteroscedasticity model will generally be very similar to those computed from the model which assumes homoscedasticity. (The calculation is pursued in the exercises.) A test of the hypothesis that α = 0 (except for the constant term) can be based on the likelihood ratio statistic. For these results, the statistic is −2[−547.3 − (−466.27)] = 162.06. This statistic has a limiting chi-squared distribution with five degrees of freedom. The sample value exceeds the critical value in the table of 11.07, so the hypothesis can be rejected.

In the preceding example, we carried out a likelihood ratio test against the hypoth- esis of homoscedasticity. It would be desirable to be able to carry out the test without having to estimate the unrestricted model. A Lagrange multiplier test can be used for that purpose. Consider the heteroscedastic tobit model in which we specify that σ 2 = σ 2 (  α) 2. i [exp wi ] (24-15) This model is a fairly general specification that includes many familiar ones as special cases. The null hypothesis of homoscedasticity is α = 0. (We used this specification in the probit model in Section 23.4.4.b and in the linear regression model in Sec- tion 16.9.2.a) Using the BHHH estimator of the Hessian as usual, we can produce Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 877

= a Lagrange multiplier statistic as follows: Let zi 1ifyi is positive and 0 otherwise, ε (−1)λ a = z i + (1 − z ) i , i i σ 2 i σ ε2/σ 2 − 1 (x β)λ b = z i + (1 − z ) i i , (24-16) i i 2σ 2 i 2σ 3 φ(  β/σ ) λ = xi . i − (  β/σ ) 1 xi =  , ,   The data vector is gi [ai xi bi bi wi ] . The sums are taken over all observations, and (ε ,φ , ,  β,σ,λ all functions involving unknown parameters i i i xi i ) are evaluated at the restricted (homoscedastic) maximum likelihood estimates. Then, LM = iG[GG]−1Gi = nR2 (24-17) in the regression of a column of ones on the K + 1 + P derivatives of the log-likelihood function for the model with multiplicative heteroscedasticity, evaluated at the estimates from the restricted model. (If there were no limit observations, then it would reduce to the Breusch–Pagan statistic discussed in Section 8.5.2.) Given the maximum likelihood estimates of the tobit model coefficients, it is quite simple to compute. The statistic has a limiting chi-squared distribution with degrees of freedom equal to the number of variables in wi .

24.3.4.b Misspecification of Prob[y*<0] In an early study in this literature, Cragg (1971) proposed a somewhat more general model in which the probability of a limit observation is independent of the regression model for the nonlimit data. One can imagine, for instance, the decision of whether or not to purchase a car as being different from the decision of how much to spend on the car, having decided to buy one. A related problem raised by Fin and Schmidt (1984) is that in the tobit model, a variable that increases the probability of an observation being a nonlimit observation also increases the mean of the variable. They cite as an example loss due to fire in buildings. Older buildings might be more likely to have fires, so that ∂ > /∂ > Prob[yi 0] agei 0, but, because of the greater value of newer buildings, older ∂ | > /∂ < ones incur smaller losses when they do have fires, so that E [yi yi 0] agei 0. This fact would require the coefficient on age to have different signs in the two functions, which is impossible in the tobit model because they are the same coefficient. A more general model that accommodates these objections is as follows: 1. Decision equation: Prob[y∗ > 0] = (x γ ), z = 1ify∗ > 0, i i i i (24-18) ∗ ≤ = − (  γ ), = ∗ ≤ . Prob[yi 0] 1 xi zi 0ifyi 0 2. Regression equation for nonlimit observations: | = =  β + σλ , E [yi zi 1] xi i according to Theorem 24.2. This model is a combination of the truncated regression model of Section 24.2 and the univariate probit model of Section 23.3, which suggests a method of analyzing it. The tobit model of this section arises if γ = β/σ . The parameters of the regression Greene-50558 book June 23, 2007 13:17

878 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

equation can be estimated independently using the truncated regression model of Sec- tion 24.2. A recent application is Melenberg and van Soest (1996). Fin and Schmidt (1984) considered testing the restriction of the tobit model. Based only on the tobit model, they devised a Lagrange multiplier statistic that, although a bit cumbersome algebraically, can be computed without great difficulty. If one is able to estimate the truncated regression model, the tobit model, and the probit model separately, then there is a simpler way to test the hypothesis. The tobit log-likelihood is the sum of the log-likelihoods for the truncated regression and probit models. [To show this result, add and subtract ln (x β/σ ) in (24-13). This produces the log- yi =1 i likelihood for the truncated regression model plus (23-20) for the probit model.]14 Therefore, a likelihood ratio statistic can be computed using

λ =−2[ln LT − (ln LP + ln LTR)], where

LT = likelihood for the tobit model in (24-13), with the same coefficients,

LP = likelihood for the probit model in (23-20), fit separately,

LTR = likelihood for the truncated regression model, fit separately. 24.3.4.c Corner Solutions Many of the applications of the tobit model in the received literature are constructed not to accommodate censoring of the underlying data, but, rather, to model the appearance of a large cluster of zeros. The Cragg model of the previous section is clearly related to this phenomenon. Consider, for example, survey data on purchases of consumer durables, firm expenditure on research and development, or consumer savings. In each case, the observed data will consist of zero or some positive amount. Arguably, there are two decisions at work in these scenarios, first whether to engage in the activity or not, and second, given that the answer to the first question is yes, how intensively to engage in it—how much to spend, for example. This is precisely the motivation behind the model in the previous section. [This specification has been labeled a “corner solution model”; see Wooldridge (2002a, pp. 518–519).] An application appears in Section 25.5.2. In practical terms, the difference between this and the tobit model should be evident in the data. Often overlooked in tobit analyses is that the model predicts not only a cluster of zeros (or limit observations), but also a grouping of observations near zero (or the limit point). For example, the tobit model is surely misspecified for the sort of (hypothetical) spending data shown in Figure 24.3 for a sample of 1,000 observations. Neglecting for the moment the earlier point about the underlying decision process, Figure 24.4 shows the characteristic appearance of a (substantively) censored variable. The implication for the model builder is that, as in Section 24.3.4.b, an appropri- ate specification would consist of two equations, one for the “participation decision,” and one for the distribution of the positive dependent variable. Formally, we might, continuing the development of Cragg’s specification, model the first decision with a binary choice (e.g., probit or logit model) as in Sections 23.2–23.4. The second equation is a model for y | y > 0, for which the truncated regression model of Section 24.2.3 is a natural candidate. As we will see, this is essentially the model behind the sample selec- tion treatment developed in Section 24.5.

14The likelihood function for the truncated regression model is considered in the exercises. Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 879 Frequency

0.000 14.131 28.262 42.393 56.524 70.655 84.786 98.917 SPENDING FIGURE 24.3 Hypothetical Spending Data.

Two practical issues frequently intervene at this point. First, one might well have a model in mind for the intensity (regression) equation, but none for the participation equation. This is the usual backdrop for the uses of the tobit model, which produces the considerations in the previous section. The second issue concerns the appropriateness of the truncation or censoring model to data such as those in Figure 24.3. If we consider only

FIGURE 24.4 Hypothetical Censored Data. Frequency

0.000 9.862 19.723 29.585 39.446 49.308 59.169 69.031 SPENDING Greene-50558 book June 23, 2007 13:17

880 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

the nonlimit observations in Figure 24.3, the underlying distribution does not appear to be truncated at all. The truncated regression model in Section 24.2.3 fit to these data will not depart significantly from ordinary least squares [because the underlying probability in the denominator of (24-6) will equal one and the numerator will equal zero]. But, this is not the case of a tobit model forced on these same data. Forcing the model in (24-13) on data such as these will significantly distort the estimator—all else equal, it will significantly attenuate the coefficients, the more so the larger is the proportion of limit observations in the sample. Once again, this stands as a caveat for the model builder. The tobit model is manifestly misspecified for data such as those in Figure 24.3.

24.3.4.d Nonnormality Nonnormality is an especially difficult problem in this setting. It has been shown that if the underlying disturbances are not normally distributed, then the estimator based on (24-13) is inconsistent. Research is ongoing both on alternative estimators and on methods for testing for this type of misspecification.15 One approach to the estimation is to use an alternative distribution. Kalbfleisch and Prentice (2002) present a unifying treatment that includes several distributions such as the exponential, lognormal, and Weibull. (Their primary focus is on survival analysis in a medical statistics setting, which is an interesting convergence of the techniques in very different disciplines.) Of course, assuming some other specific distribution does not necessarily solve the problem and may make it worse. A preferable alternative would be to devise an estimator that is robust to changes in the distribution. Powell’s (1981, 1984) least absolute deviations (LAD) estimator appears to offer some promise.16 The main drawback to its use is its computational complexity. An extensive application of the LAD estimator is Melenberg and van Soest (1996). Although estimation in the nonnor- mal case is relatively difficult, testing for this failure of the model is worthwhile to assess the estimates obtained by the conventional methods. Among the tests that have been developed are Hausman tests, Lagrange multiplier tests [Bera and Jarque (1981, 1982), Bera, Jarque, and Lee (1982)], and conditional moment tests [Nelson (1981)]. The con- ditional moment tests are described in the next section. To employ a Hausman test, we require an estimator that is consistent and efficient under the null hypothesis but inconsistent under the alternative—the tobit estimator with normality—and an estimator that is consistent under both hypotheses but ineffi- cient under the null hypothesis. Thus, we will require a robust estimator of β, which restores the difficulties of the previous paragraph. Recent applications [e.g., Melenberg and van Soest (1996)] have used the Hausman test to compare the tobit/normal estimator with Powell’s consistent, but inefficient (robust), LAD estimator. Another approach to testing is to embed the normal distribution in some other distribution and then use an LM test for the normal specification. Chesher and Irish (1987) have devised an LM test of normality in the tobit model based on generalized residuals. In many models, including the tobit model, the generalized residuals can be computed as the derivatives

15See Duncan (1983, 1986b), Goldberger (1983), Pagan and Vella (1989), Lee (1996), and Fernandez (1986). We will examine one of the tests more closely in the following section. 16See Duncan (1986a,b) for a symposium on the subject and Amemiya (1984). Additional references are Newey, Powell, and Walker (1990); Lee (1996); and Robinson (1988). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 881

of the log-densities with respect to the constant term, so 1 e = [z (y − x β) − (1 − z )σ λ ], i σ 2 i i i i i

where zi is defined in (24-18) and λi is defined in (24-16). This residual is an estimate ε | = of i that accounts for the censoring in the distribution. By construction, E [ei xi ] 0, n = and if the model actually does contain a constant term, then i=1 ei 0; this is the first of the necessary conditions for the MLE. The test is then carried out by regressing a column of ones on di , where the elements of di are computed as follows: = ( −  βˆ)/ σ, ai yi xi ˆ α =−  βˆ/σ,λ = φ(α )/(α ), ˆ i xi ˆ i ˆ i ˆ i e =−(1 − z )λ + z a , i1 i i i i e =−(1 − z )αˆ λ + z a2 − 1 , i2 i i i i i e =−(1 − z ) 2 + αˆ 2 λ + z a3, i3 i i i i i =−( − ) α + α3 λ + 4 − , ei4 1 zi 3ˆi ˆ i i zi ai 3  =  , , , . di [ei1xi ei2 ei3 ei4]

Note that first K + 1 variables in di are the derivatives of the tobit log-likelihood. Let D × ( + )  = , + be the n K 3 matrix with ith row equal to di . Then, D [G M] where the K 1 columns of G are the derivatives of the tobit log-likelihood and the two columns in M 2 are the last two variables in di . Then the chi-squared statistic is nR ; that is, LM = iD(DD)−1Di. The necessary conditions that define the MLE are iG = 0, so the first K + 1 elements of iD are zero. Using (A-74), then, the LM statistic becomes LM = iM[MM − MG(GG)−1GM]−1Mi, which is a chi-squared statistic with two degrees of freedom. Note the similarity to (24-17), where a test for homoscedasticity is carried out by the same method. As emerges so often in this framework, the test of the distribution actually focuses on the skewness and kurtosis of the residuals. Pagan and Vella (1989) and Ruud (1984) have developed similar tests for several specification errors.17

24.4 PANEL DATA APPLICATIONS

Extension of the familiar panel data results to the tobit model parallel the probit model, with the attendant problems. The random effects or random parameters models dis- cussed in Chapter 23 can be adapted to the censored regression model using simulation or quadrature. The same reservations with respect to the orthogonality of the effects and the regressors will apply here, as will the applicability of the Mundlak (1978) correction to accommodate it.

17Developing specification tests for the tobit model has been a popular enterprise. A sampling of the received literature includes Nelson (1981); Bera, Jarque, and Lee (1982); Chesher and Irish (1987); Chesher, Lancaster, and Irish (1985); Gourieroux et al. (1984, 1987); Newey (1986); Rivers and Vuong (1988); Horowitz and Neumann (1989); and Pagan and Vella (1989). Newey (1985a, b) are useful references on the general subject of conditional moment testing. More general treatments of specification testing are Godfrey (1988) and Ruud (1984). Greene-50558 book June 25, 2007 18:55

882 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Most of the attention in the theoretical literature on panel data methods for the tobit model has been focused on fixed effects. The departure point would be the maximum likelihood estimator for the static fixed effects model, ∗ = α +  β + ε ,ε ∼ ,σ2 , yit i xit it it N[0 ]

yit = Max(0, yit). However, there are no firm theoretical results on the behavior of the MLE in this model. Intuition might suggest, based on the findings for the binary probit model, that the MLE would be biased in the same fashion, away from zero. Perhaps surprisingly, the results in Greene (2004) persistently found that not to be the case in a variety of model specifications. Rather, the incidental parameters, such as it is, manifests in a downward bias in the estimator of σ, not an upward (or downward) bias in the MLE of β. However, this is less surprising when the tobit estimator is juxtaposed with the MLE in the linear regression model with fixed effects. In that model, the MLE is the within-groups (LSDV) estimator which is unbiased and consistent. But, the ML estimator of the disturbance  /( ) variance in the linear regression model is eLSDVeLSDV nT , which is biased downward by a factor of (T −1)/T. [This is the result found in the original source on the incidental parameters problem, Neyman and Scott (1948).] So, what evidence there is suggests that unconditional estimation of the tobit model behaves essentially like that for the linear regression model. That does not settle the problem, however; if the evidence is correct, then it implies that although consistent estimation of β is possible, appropriate statistical inference is not. The bias in the estimation of σ shows up in any estimator of the asymptotic covariance of the MLE of β. Unfortunately, there is no conditional estimator of β for the tobit (or truncated re- gression) model. First differencing or taking group mean deviations does not preserve the model. Because the latent variable is censored before observation, these transforma- tions are not meaningful. Some progress has been made on theoretical, semiparametric estimators for this model. See, for example, Honor`e and Kyriazidou (2000) for a survey. Much of the theoretical development has also been directed at dynamic models where the benign result of the previous paragraph (such as it is) is lost once again. Arellano (2001) contains some general results. Hahn and Kuersteiner (2004) have characterized the bias of the MLE, and suggested methods of reducing the bias of the estimators in dynamic binary choice and censored regression models.

24.5 SAMPLE SELECTION

The topic of sample selection, or incidental truncation, has been the subject of an enormous recent literature, both theoretical and applied.18 This analysis combines both of the previous topics.

18A large proportion of the analysis in this framework has been in the area of labor economics. See, for example, Vella (1998), which is an extensive survey for practitioners. The results, however, have been applied in many other fields, including, for example, long series of stock market returns by financial economists (“sur- vivorship bias”) and medical treatment and response in long-term studies by clinical researchers (“attrition bias”). The four surveys noted in the introduction to this chapter provide fairly extensive, although far from exhaustive, lists of the studies. Some studies that comment on methodological issues are Heckman (1990), Manski (1989, 1990, 1992), and Newey, Powell, and Walker (1990). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 883

Example 24.6 Incidental Truncation In the high-income survey discussed in Example 24.2, respondents were also included in the survey if their net worth, not including their homes, was at least $500,000. Suppose that the survey of incomes was based only on people whose net worth was at least $500,000. This selection is a form of truncation, but not quite the same as in Section 24.2. This selection criterion does not necessarily exclude individuals whose incomes at the time might be quite low. Still, one would expect that, on average, individuals with a high net worth would have a high income as well. Thus, the average income in this subpopulation would in all likelihood also be misleading as an indication of the income of the typical American. The data in such a survey would be nonrandomly selected or incidentally truncated.

Econometric studies of nonrandom sampling have analyzed the deleterious effects of sample selection on the properties of conventional estimators such as least squares; have produced a variety of alternative estimation techniques; and, in the process, have yielded a rich crop of empirical models. In some cases, the analysis has led to a reinter- pretation of earlier results.

24.5.1 INCIDENTAL TRUNCATION IN A BIVARIATE DISTRIBUTION Suppose that y and z have a bivariate distribution with correlation ρ. We are interested in the distribution of y given that z exceeds a particular value. Intuition suggests that if y and z are positively correlated, then the truncation of z should push the distribution of y to the right. As before, we are interested in (1) the form of the incidentally trun- cated distribution and (2) the mean and variance of the incidentally truncated random variable. Because it has dominated the empirical literature, we will focus first on the bivariate normal distribution.19 The truncated joint density of y and z is f (y, z) f (y, z| z > a) = . Prob(z > a) To obtain the incidentally truncated marginal density for y, we would then integrate z out of this expression. The moments of the incidentally truncated normal distribution are given in Theorem 24.5.20

THEOREM 24.5 Moments of the Incidentally Truncated Bivariate Normal Distribution If y and z have a bivariate normal distribution with means μy and μz, standard deviations σy and σz, and correlation ρ, then

E [y | z > a] = μy + ρσyλ(αz), (24-19) | > = σ 2 − ρ2δ(α ) , Var[y z a] y [1 z ] where

αz = (a − μz)/σz, λ(αz) = φ(αz)/[1 − (αz)], and δ(αz) = λ(αz)[λ(αz) − αz].

19We will reconsider the issue of the normality assumption in Section 24.5.5. 20Much more general forms of the result that apply to multivariate distributions are given in Johnson and Kotz (1974). See also Maddala (1983, pp. 266–267). Greene-50558 book June 23, 2007 13:17

884 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Note that the expressions involving z are analogous to the moments of the truncated distribution of x given in Theorem 24.2. If the truncation is z< a, then we make the replacement λ(αz) =−φ(αz)/(αz). As expected, the truncated mean is pushed in the direction of the correlation if the truncation is from below and in the opposite direction if it is from above. In addition, the incidental truncation reduces the variance, because both δ(α) and ρ2 are between zero and one.

24.5.2 REGRESSION IN A MODEL OF SELECTION To motivate a regression model that corresponds to the results in Theorem 24.5, we consider the following example.

Example 24.7 A Model of Labor Supply A simple model of female labor supply that has been examined in many studies consists of two equations:21 1. Wage equation. The difference between a person’s market wage, what she could command in the labor market, and her reservation wage, the wage rate necessary to make her choose to participate in the labor market, is a function of characteristics such as age and education as well as, for example, number of children and where a person lives. 2. Hours equation. The desired number of labor hours supplied depends on the wage, home characteristics such as whether there are small children present, marital status, and so on. The problem of truncation surfaces when we consider that the second equation describes desired hours, but an actual figure is observed only if the individual is working. (In most such studies, only a participation equation, that is, whether hours are positive or zero, is observable.) We infer from this that the market wage exceeds the reservation wage. Thus, the hours variable in the second equation is incidentally truncated.

To put the preceding examples in a general framework, let the equation that deter- mines the sample selection be ∗ =  γ + , zi wi ui and let the equation of primary interest be =  β + ε . yi xi i ∗ The sample rule is that yi is observed only when zi is greater than zero. Suppose as well that εi and ui have a bivariate normal distribution with zero means and correlation ρ. Then we may insert these in Theorem 24.5 to obtain the model that applies to the observations in our sample: | = | ∗ > E [yi yi is observed] E [yi zi 0] = | > − γ E [yi ui w i ] =  β + ε | > −  γ xi E [ i ui wi ] =  β + ρσ λ (α ) xi ε i u =  β + β λ (α ), xi λ i u

21See, for example, Heckman (1976). This strand of literature begins with an exchange by Gronau (1974) and Lewis (1974). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 885

α =−  γ /σ λ(α ) = φ(  γ /σ )/(  γ /σ ). where u wi u and u wi u wi u So, | ∗ > = | ∗ > + v yi zi 0 E [yi zi 0] i =  β + β λ (α ) + v . xi λ i u i Least squares regression using the observed data—for instance, OLS regression of hours on its determinants, using only data for women who are working—produces inconsistent estimates of β. Once again, we can view the problem as an omitted variable. Least squares regression of y on x and λ would be a consistent estimator, but if λ is omitted, then the specification error of an omitted variable is committed. Finally, note that the second part of Theorem 24.5 implies that even if λi were observed, then least squares would be inefficient. The disturbance vi is heteroscedastic. The marginal effect of the regressors on yi in the observed sample consists of two components. There is the direct effect on the mean of yi , which is β. In addition, for a ∗ particular independent variable, if it appears in the probability that zi is positive, then it will influence yi through its presence in λi . The full effect of changes in a regressor that appears in both xi and wi on y is ∂ | ∗ > ρσ E [yi zi 0] ε = βk − γk δi (αu), ∂xik σu where22 δ = λ2 − α λ . i i i i ρ ∗ Suppose that is positive and E [yi ] is greater when zi is positive than when it is negative. Because 0 <δi < 1, the additional term serves to reduce the marginal effect. The change ∗ > in the probability affects the mean of yi in that the mean in the group zi 0 is higher. The second term in the derivative compensates for this effect, leaving only the marginal ∗ > effect of a change given that zi 0 to begin with. Consider Example 24.9, and suppose that education affects both the probability of migration and the income in either state. If we suppose that the income of migrants is higher than that of otherwise identical people who do not migrate, then the marginal effect of education has two parts, one due to its influence in increasing the probability of the individual’s entering a higher- income group and one due to its influence on income within the group. As such, the coefficient on education in the regression overstates the marginal effect of the education of migrants and understates it for nonmigrants. The sizes of the various parts depend on the setting. It is quite possible that the magnitude, sign, and statistical significance of the effect might all be different from those of the estimate of β, a point that appears frequently to be overlooked in empirical studies. In most cases, the selection variable z∗ is not observed. Rather, we observe only its sign. To consider our two examples, we typically observe only whether a woman is working or not working or whether an individual migrated or not. We can infer the sign of z∗, but not its magnitude, from such information. Because there is no information on the scale of z∗, the disturbance variance in the selection equation cannot be estimated. (We encountered this problem in Chapter 23 in connection with the probit model.)

22  We have reversed the sign of αu in (24-19) because a = 0, and α = w γ /σM is somewhat more convenient. Also, as such, ∂λ/∂α =−δ. Greene-50558 book June 23, 2007 13:17

886 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Thus, we reformulate the model as follows: ∗ =  γ + , = ∗ > selection mechanism: zi wi ui zi 1ifzi 0 and 0 otherwise; ( = | ) = (  γ ), Prob zi 1 wi wi and ( = | ) = − (  γ ). Prob zi 0 wi 1 wi (24-20) =  β + ε = , regression model: yi xi i observed only if zi 1

(ui ,εi ) ∼ bivariate normal [0, 0, 1,σε,ρ].

Suppose that, as in many of these studies, zi and wi are observed for a random sample of individuals but yi is observed only when zi = 1. This model is precisely the one we examined earlier, with | = , , =  β + ρσ λ(  γ ). E [yi zi 1 xi wi ] xi ε wi

24.5.3 ESTIMATION The parameters of the sample selection model can be estimated by maximum like- lihood.23 However, Heckman’s (1979) two-step estimation procedure is usually used instead. Heckman’s method is as follows:24 1. Estimate the probit equation by maximum likelihood to obtain estimates of γ .For λˆ = φ(  γ )/(  γ ) each observation in the selected sample, compute i wi ˆ wi ˆ and δˆ = λˆ (λˆ +  γ ). i i i wi ˆ 2. Estimate β and βλ = ρσε by least squares regression of y on x and λˆ . It is possible also to construct consistent estimators of the individual parameters ρ and σε. At each observation, the true conditional variance of the disturbance would be σ 2 = σ 2( − ρ2δ ). i ε 1 i The average conditional variance for the sample would converge to n 1 2 2 2 plim σ = σε (1 − ρ δ),¯ n i i=1 which is what is estimated by the least squares residual variance ee/n. For the square of the coefficient on λ, we have 2 2 2 plim bλ = ρ σε , whereas based on the probit results we have 1 n plim δˆ = δ.¯ n i i=1 2 We can then obtain a consistent estimator of σε using

2 1  ˆ 2 σˆε = e e + δ¯bλ. n

23See Greene (1995a). 24Perhaps in a mimicry of the “tobit” estimator described earlier, this procedure has come to be known as the “Heckit” estimator. Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 887

Finally, an estimator of ρ2 is 2 bλ ρˆ 2 = , 2 σˆε which provides a complete set of estimators of the model’s parameters.25  To test hypotheses, an estimate of the asymptotic covariance matrix of [b , bλ]is needed. We have two problems to contend with. First, we can see in Theorem 24.5 that the disturbance term in ( | = , , ) =  β + ρσ λ + v yi zi 1 xi wi xi ε i i (24-21) is heteroscedastic; 2 2 Var[vi | zi = 1, xi , wi ] = σε (1 − ρ δi ).

Second, there are unknown parameters in λi . Suppose that we assume for the moment that λi and δi are known (i.e., we do not have to estimate γ ). For convenience, let ∗ = ,λ ∗ xi [xi i ], and let b be the least squares coefficient vector in the regression of y on x∗ in the selected data. Then, using the appropriate form of the variance of ordinary least squares in a heteroscedastic model from Chapter 8, we would have to estimate n ∗ = σ 2  −1 ( − ρ2δ ) ∗ ∗  −1 Var[b ] ε [X∗X∗] 1 i xi xi [X∗X∗] i=1 2  −1  2  −1 = σε [X∗X∗] [X∗(I − ρ )X∗][X∗X∗] ,

2 2 where I − ρ  is a diagonal matrix with (1 − ρ δi ) on the diagonal. Without any other complications, this result could be computed fairly easily using X, the sample estimates 2 2 of σε and ρ , and the assumed known values of λi and δi . The parameters in γ do have to be estimated using the probit equation. Rewrite (24-21) as ( | = , , ) =  β + β λˆ + v − β (λˆ − λ ). yi zi 1 xi wi xi λ i i λ i i In this form, we see that in the preceding expression we have ignored both an additional source of variation in the compound disturbance and correlation across observations; the same estimate of γ is used to compute λˆ i for every observation. Heckman has shown that the earlier covariance matrix can be appropriately corrected by adding a term inside the brackets, 2   2  Q = ρˆ (X∗ˆ W)Est. Asy. Var[γ ˆ ](W ˆ X∗) = ρˆ Fˆ Vˆ Fˆ , where Vˆ = Est. Asy. Var[γ ˆ ], the estimator of the asymptotic covariance of the probit coefficients. Any of the estimators in (23-22) to (23-24) may be used to compute Vˆ .The complete expression is26 2  −1  2  −1 Est. Asy. Var[b, bλ] = σˆε [X∗X∗] [X∗(I − ρˆ ˆ )X∗ + Q][X∗X∗] .

25Note thatρ ˆ 2 is not a sample correlation and, as such, is not limited to [0, 1]. See Greene (1981) for discussion. 26This matrix formulation is derived in Greene (1981). Note that the Murphy and Topel (1985) results for two-step estimators given in Theorem 11.3 would apply here as well. Asymptotically, this method would give the same answer. The Heckman formulation has become standard in the literature. Greene-50558 book June 25, 2007 18:55

888 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 24.3 Estimated Selection Corrected Wage Equation Two-Step Maximum Likelihood Least Squares Estimate Std. Err. Estimate Std. Err. Estimate Std. Err.

β1 −0.971 (2.06) −1.963 (1.684) −2.56 (0.929) β2 0.021 (0.0625) 0.0279 (0.0756) 0.0325 (0.0616) β3 0.000137 (0.00188) −0.0001 (0.00234) −0.000260 (0.00184) β4 0.417 (0.100) 0.457 (0.0964) 0.481 (0.0669) β5 0.444 (0.316) 0.447 (0.427) 0.318 (0.449) (ρσ) −1.098 (1.266) ρ −0.343 −0.132 (0.224) 0.000 σ 3.200 3.108 (0.0837) 3.111

Example 24.8 Female Labor Supply Examples 23.1 and 23.5 proposed a labor force participation model for a sample of 753 married women in a sample analyzed by Mroz (1987). The data set contains wage and hours information for the 428 women who participated in the formal market (LFP = 1). Fol- lowing Mroz, we suppose that for these 428 individuals, the offered wage exceeded the reservation wage and, moreover, the unobserved effects in the two wage equations are cor- related. As such, a wage equation based on the market data should account for the sample selection problem. We specify a simple wage model:

2 Wage = β1 + β2 Exper + β3 Exper + β4 Education + β5 City + ε where Exper is labor market experience and City is a dummy variable indicating that the indi- vidual lived in a large urban area. Maximum likelihood, Heckman two-step, and ordinary least squares estimates of the wage equation are shown in Table 24.3. The maximum likelihood estimates are FIML estimates—the labor force participation equation is reestimated at the same time. Only the parameters of the wage equation are shown below. Note as well that the two-step estimator estimates the single coefficient on λi and the structural parameters σ and ρ are deduced by the method of moments. The maximum likelihood estimator computes estimates of these parameters directly. [Details on maximum likelihood estimation may be found in Maddala (1983).] The differences between the two-step and maximum likelihood estimates in Table 24.3 are surprisingly large. The difference is even more striking in the marginal effects. The effect for education is estimated as 0.417+0.0641 for the two step estimators and 0.149 in total for the maximum likelihood estimates. For the kids variable, the marginal effect is −.293 for the two-step estimates and only −0.0113 for the MLEs. Surprisingly, the direct test for a selection effect in the maximum likelihood estimates, a nonzero ρ, fails to reject the hypothesis that ρ equals zero.

In some settings, the selection process is a nonrandom sorting of individuals into two or more groups. The mover-stayer model in the next example is a familiar case.

Example 24.9 A Mover-Stayer Model for Migration The model of migration analyzed by Nakosteen and Zimmer (1980) fits into the framework described in this section. The equations of the model are ∗ =  γ + net benefit of moving: Mi wi ui , =  β + ε income if moves: Ii 1 xi 1 1 i 1, =  β + ε . income if stays: Ii 0 xi 0 0 i 0 One component of the net benefit is the market wage individuals could achieve if they move, compared with what they could obtain if they stay. Therefore, among the determinants of Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 889

TABLE 24.4 Estimated Earnings Equations Migrant Nonmigrant Migration Earnings Earnings Constant −1.509 9.041 8.593 SE −0.708 (−5.72) −4.104 (−9.54) −4.161 (−57.71) EMP −1.488 (−2.60) — — PCI 1.455 (3.14) — — Age −0.008 (−5.29) — — Race −0.065 (−1.17) — — Sex −0.082 (−2.14) — — SIC 0.948 (24.15) −0.790 (−2.24) −0.927 (−9.35) λ — 0.212 (0.50) 0.863 (2.84)

the net benefit are factors that also affect the income received in either place. An analysis of income in a sample of migrants must account for the incidental truncation of the mover’s income on a positive net benefit. Likewise, the income of the stayer is incidentally truncated on a nonpositive net benefit. The model implies an income after moving for all observations, but we observe it only for those who actually do move. Nakosteen and Zimmer (1980) applied the selectivity model to a sample of 9,223 individuals with data for 2 years (1971 and 1973) sampled from the Social Security Administration’s Continuous Work History Sample. Over the period, 1,078 individuals migrated and the remaining 8,145 did not. The independent variables in the migration equation were as follows: SE = self-employment dummy variable; 1 if yes, EMP = rate of growth of state employment, PCI = growth of state per capita income, x = age, race (nonwhite = 1), sex (female = 1), SIC = 1 if individual changes industry. The earnings equations included SIC and SE. The authors reported the results given in Table 24.4. The figures in parentheses are asymptotic t ratios.

24.5.4 REGRESSION ANALYSIS OF TREATMENT EFFECTS The basic model of selectivity outlined earlier has been extended in an impressive variety of directions.27 An interesting application that has found wide use is the measurement of treatment effects and program effectiveness.28 An earnings equation that accounts for the value of a college education is =  β + δ + ε , earningsi xi Ci i

where Ci is a dummy variable indicating whether or not the individual attended college. The same format has been used in any number of other analyses of programs, experi- ments, and treatments. The question is: Does δ measure the value of a college education (assuming that the rest of the regression model is correctly specified)? The answer is

27For a survey, see Maddala (1983). 28This is one of the fundamental applications of this body of techniques, and is also the setting for the most longstanding and contentious debate on the subject. A Journal of Business and Economic Statistics symposium [Angrist (2001)] raised many of the important questions on whether and how it is possible to measure treatment effects. Greene-50558 book June 23, 2007 13:17

890 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

no if the typical individual who chooses to go to college would have relatively high earnings whether or not he or she went to college. The problem is one of self-selection. If our observation is correct, then least squares estimates of δ will actually overestimate the treatment effect. The same observation applies to estimates of the treatment effects in other settings in which the individuals themselves decide whether or not they will receive the treatment. To put this in a more familiar context, suppose that we model program participation (e.g., whether or not the individual goes to college) as ∗ =  γ + , Ci wi ui = ∗ > , . Ci 1ifCi 0 0 otherwise

We also suppose that, consistent with our previous conjecture, ui and εi are correlated. Coupled with our earnings equation, we find that | = , , =  β + δ + ε | = , , E [yi Ci 1 xi zi ] xi E [ i Ci 1 xi zi ] (24-22) =  β + δ + ρσ λ(−  γ ) xi ε wi once again. [See (24-21).] Evidently, a viable strategy for estimating this model is to use the two-step estimator discussed earlier. The net result will be a different estimate of δ that will account for the self-selected nature of program participation. For nonpartici- pants, the counterpart to (24-22) is −φ(  γ ) | = , , =  β + ρσ wi . E [yi Ci 0 xi zi ] xi ε − (  γ ) (24-23) 1 wi The difference in expected earnings between participants and nonparticipants is, then, φi E [yi | Ci = 1, xi , zi ] − E [yi | Ci = 0, xi , zi ] = δ + ρσε . (24-24) i (1 − i )

If the selectivity correction λi is omitted from the least squares regression, then this difference is what is estimated by the least squares coefficient on the treatment dummy variable. But because (by assumption) all terms are positive, we see that least squares overestimates the treatment effect. Note, finally, that simply estimating separate equa- tions for participants and nonparticipants does not solve the problem. In fact, doing so would be equivalent to estimating the two regressions of Example 24.9 by least squares, which, as we have seen, would lead to inconsistent estimates of both sets of parameters. There are many variations of this model in the empirical literature. They have been applied to the analysis of education,29 the Head Start program,30 and a host of other settings.31 This strand of literature is particularly important because the use of dummy variable models to analyze treatment effects and program participation has a long history in empirical economics. This analysis has called into question the interpretation of a number of received studies.

29Willis and Rosen (1979). 30Goldberger (1972). 31A useful summary of the issues is Barnow, Cain, and Goldberger (1981). See also Maddala (1983) for a long list of applications. A related application is the switching regression model. See, for example, Quandt (1982, 1988). Greene-50558 book June 25, 2007 18:55

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 891

24.5.5 THE NORMALITY ASSUMPTION Some research has cast some skepticism on the selection model based on the normal distribution. [See Goldberger (1983) for an early salvo in this literature.] Among the findings are that the parameter estimates are surprisingly sensitive to the distributional assumption that underlies the model. Of course, this fact in itself does not invalidate the normality assumption, but it does call its generality into question. On the other hand, the received evidence is convincing that sample selection, in the abstract, raises serious problems, distributional questions aside. The literature—for example, Duncan (1986b), Manski (1989, 1990), and Heckman (1990)—has suggested some promising approaches based on robust and nonparametric estimators. These approaches obviously have the virtue of greater generality. Unfortunately, the cost is that they generally are quite limited in the breadth of the models they can accommodate. That is, one might gain the robustness of a nonparametric estimator at the cost of being unable to make use of the rich set of accompanying variables usually present in the panels to which selectivity models are often applied. For example, the nonparametric bounds approach of Manski (1990) is defined for two regressors. Other methods [e.g., Duncan (1986b)] allow more elaborate specifications. Recent research includes specific attempts to move away from the normality assumption.32 An example is Martins (2001), building on Newey (1991), which takes the core specification as given in (24-20) as the platform, but constructs an alternative to the assumption of bivariate normality. Martins’s specification modifies the Heckman model by employing an equation of the form | = , , =  β + μ(  γ ) E [yi zi 1 xi wi ] xi wi where the latter, “selectivity correction” is not the inverse Mills ratio, but some other result from a different model. The correction term is estimated using the Klein and Spady model discussed in Section 23.6.1. This is labeled a “semiparametric” approach. Whether the conditional mean in the selected sample should even remain a linear index function remains to be settled. Not surprisingly, Martins’s results, based on two-step least squares differ only slightly from the conventional results based on normality. This approach is arguably only a fairly small step away from the tight parameterization of the Heckman model. Other non- and semiparametric specifications, e.g., Honor`e and Kyriazidou (1997, 2000) represent more substantial departures from the normal model, but are much less operational.33 The upshot is that the issue remains unsettled. For better or worse, the empirical literature on the subject continues to be dominated by Heckman’s original model built around the joint normal distribution.

24.5.6 ESTIMATING THE EFFECT OF TREATMENT ON THE TREATED Consider a regression approach to analyzing treatment effects in a two-period setting, = θ +  β + γ + + ε , = , , yit t xit Ci ui it t 0 1

32Again, Angrist (2001) is an important contribution to this literature. 33This particular work considers selection in a “panel” (mainly two periods). But, the panel data setting for sample selection models is more involved than a cross-section analysis. In a panel data set, the “selection” is likely to be a decision at the beginning of Period 1 to be in the data set for all subsequent periods. As such, something more intricate than the model we have considered here is called for. Greene-50558 book June 23, 2007 13:17

892 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

where Ci is the treatment dummy variable and ui is the unobserved individual effect. The setting is the pre- and post-treatment analysis of the sort considered in this section, where we examine the impact of a job training program on post training earnings. Because there are two periods, a natural approach to the analysis is to examine the changes,  yi = (θ1 − θ0) + γCi + (xit) β + εit

where Ci = 1 for the treated and 0 for the nontreated individuals, and the first differ- ences eliminate the unobserved individual effects. In the absence of controls (regressors, xit), or assuming that the controls are unchanged, the estimator of the effect of the treat- ment will be

γˆ = y | (Ci = 1) − y | (Ci = 0), which is the difference in differences estimator. This simplifies the problem consider- ably, but has several shortcomings. Most important, by using the simple differences, we have lost our ability to discern what induced the change, whether it was the program or something else, presumably in xit. Even without the normality assumption, the preceding regression approach is more tightly structured than many are comfortable with. A considerable amount of research has focused on what assumptions are needed to reach that model and whether they are likely to be appropriate in a given setting.34 The overall objective of the analysis of the preceding two sections is to evaluate the effect of a treatment, Ci on the individual treated. The implicit counterfactual is an observation on what the “response” (depen- dent variable) of the treated individual would have been had they not been treated. But, of course, an individual will be in one state or the other, not both. Denote by y0 the random variable that is the outcome variable in the absence of the treatment and by y1 the outcome when the treatment has taken place. The average treatment effect, averaged over the entire population is

ATE = E[y1 − y0]. This is the impact of the treatment on an individual drawn at random from the entire population. However, the desired quantity is not necessarily the ATE, but the average treatment effect on the treated, which would be

ATE |T = E[y1 − y0 | C = 1].

The difficulty of measuring this is, once again, the counterfactual, E[y0 | C = 1]. Whether these two measures will be the same is at the center of the much of the discussion on this subject. If treatment is completely randomly assigned, then E[yj | C = 1] = E[yj | C = 0] = E[yj | C = j], j = 0, 1. This means that with completely random treatment assignment

ATE = E[y1 | C = 1] − E[y0 | C = 0]. To put this in our example, if college attendance were completely randomly distributed throughout the population, then the impact of college attendance on income (neglecting other covariates at this point), could be measured simply by averaging the incomes of

34A sampling of the more important parts of the literature on this issue includes Heckman (1992, 1997), Imbens and Angrist (1994), Manski (1996), and Wooldridge (2002a, Chapter 18). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 893

college attendees and subtracting the average income of nonattendees. The preceding theory might work for the treatment “having brown eyes,” but it is unlikely to work for college attendance. Not only is the college attendance treatment not randomly distributed, but the treatment “assignment” is surely related to expectations about y1 versus y0, and, at a minimum, y0 itself. (College is expensive.) More generally, the researcher faces the difficulty in calculating treatment effects that assignment to the treatment might not be exogenous. The control function approach that we used in (24-22)–(24-24) is used to account for the endogeneity of the treatment assignment in the regression context. The very specific assumptions of the bivariate normal distribution of the unobservables somewhat simpli- fies the estimation, because they make explicit what control function (λi ) is appropriate to use in the regression. As Wooldridge (2002a, p. 622) points out, however, the binary variable in the treatment effects regression represents simply an endogenous variable in a linear equation, amenable to instrumental variable estimation (assuming suitable instruments are available). Barnow, Cain, and Goldberger (1981) proposed a two-stage least squares estimator, with instrumental variable equal to the predicted probabil- ity from the probit treatment assignment model. This is slightly less parametric than (22-24) because, in principle, its validity does not rely on joint normality of the distur- bances. [Wooldridge (2002a, pp. 621–633) discusses the underlying assumptions.] If the treatment assignment is “completely ignorable,” then, as noted, estimation of the treatment effects is greatly simplified. Suppose, as well, that there are observable variables that influence both the outcome and the treatment assignment. Suppose it is possible to obtain pairs of individuals matched by a common xi , one with Ci = 0, the other with Ci = 1. If done with a sufficient number of pairs so as to average over the population of xi ’s, then a matching estimator, the average value of (yi | Ci = 1) − (yi | Ci = 0) would estimate E[y1 − y0], which is what we seek. Of course, it is optimistic to hope to find a large sample of such matched pairs, both because the sample overall is finite and because there may be many regressors, and the “cells” in the distribution of xi are likely to be thinly populated. This will be worse when the regressors are continuous, for example, with a “family income” variable. Rosenbaum and Rubin (1983) 35 and others suggested, instead, matching on the propensity score, F(xi ) = Prob(Ci = 1 | xi ). Individuals with similar propensity scores are paired and the average treatment effect is then estimated by the differences in outcomes. Various strategies are suggested by the authors for obtaining the necessary subsamples and for verifying the conditions under which the procedures will be valid. [See, e.g., Becker and Ichino (2002) and Greene (2007c).]

Example 24.10 Treatment Effects on Earnings LaLonde (1986) analyzed the results of a labor market experiment, The National Supported Work Demonstration, in which a group of disadvantaged workers lacking basic job skills were given work experience and counseling in a sheltered environment. Qualified applicants were assigned to training positions randomly. The treatment group received the benefits of the pro- gram. Those in the control group “were left to fend for themselves.” [The demonstration was run in numerous cities in the mid-1970s. See LaLonde (1986, pp. 605–609) for institutional

35Other important references in this literature are Becker and Ichino (1999), Dehejia and Wahba (1999), LaLonde (1986), Heckman, Ichimura, and Todd (1997, 1998), Heckman, Ichimura, Smith and Todd (1998), Heckman, LaLonde, and Smith (1999), Heckman, Tobias, and Vytlacil (2003), and Heckman and Vytlacil (2000). Greene-50558 book June 23, 2007 13:17

894 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

details on the NSW experiments.] The training period was 1976–1977; the outcome of in- terest for the sample examined here was post-training 1978 earnings. LaLonde reports a large variety of estimates of the treatment effect, for different subgroups and using different estimation methods. Nonparametric estimates for the group in our sample are roughly $900 for the income increment in the post-training year. (See LaLonde, p. 609.) Similar results are reported from a two-step regression-based estimator similar to (24-22) to (24-24). (See LaLonde’s footnote to Table 6, p. 616.) LaLonde’s data are fairly well traveled, having been used in replications and extensions in, e.g., Dehejia and Wahba (1999), Becker and Ichino (2002), and Greene (2007b, c). We have reestimated the matching estimates reported in Becker and Ichino. The data in the file used there (and here) contain 2,490 control observations and 185 treatment observations on the following variables:

t = treatment dummy variable, age = age in years, educ = education in years, marr = dummy variable for married, black = dummy variable for black, hisp = dummy variable for Hispanic, nodegree = dummy for no degree (not used), re74 = real earnings in 1974, re75 = real earnings in 1975, re78 = real earnings in 1978.

Transformed variables added to the equation are

age2 = age squared, educ2 = educ squared, re742 = re74 squared, re752 = re75 squared, blacku74 = black times 1(re74 = 0).

We also scaled all earnings variables by 10,000 before beginning the analysis. (See Appendix Table F24.2. The data are downloaded from the website http://www.nber.org/%7Erdehejia/ nswdata.html. The two specific subsamples are in http://www.nber.org/%7Erdehejia// psid controls.txt and http://www.nber.org/%7Erdehejia/nswre74 treated.txt.) (We note that Becker and Ichino report they were unable to replicate Dehejia and Wahba’s results, al- though they could come reasonably close. We, in turn, were not able to replicate either set of results, though we, likewise, obtained quite similar results.) The analysis proceeded as follows: A logit model in which the included variables were a constant, age, age2, education, education2, marr, black, hisp, re74, re75, re742, re752, and black74 was computed for the treatment assignment. The fitted probabilities are used for the propensity scores. By means of an iterative search, the range of propensity scores was partitioned into 8 regions within which, by a simple F test, the mean scores of the treat- ments and controls were not statistically different. The partitioning is shown in Table 24.5. The 1,347 observations are all the treated observations and the 1,162 control observa- tions are those whose propensity scores fell within the range of the scores for the treated observations. Within each interval, each treated observation is paired with a small number of the nearest control observations. We found the average difference between treated observation and control to equal $1,574.35. Becker and Ichino reported $1,537.94. As an experiment, we refit the propensity score equation using a probit model, retaining the fitted probabilities. We then used the two-step estimator described earlier to fit (24-22) Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 895

TABLE 24.5 Empirical Distribution of Propensity Scores Percent Lower Upper 0–5 0.000591 0.000783 Sample size = 1,347 5–10 0.000787 0.001061 Average score = 0.137238 10–15 0.001065 0.001377 Std. Dev score = 0.274079 15–20 0.001378 0.001748 20–25 0.001760 0.002321 Lower Upper # obs 25–30 0.002340 0.002956 1 0.000591 0.098016 1041 30–35 0.002974 0.004057 2 0.098016 0.195440 63 35–40 0.004059 0.005272 3 0.195440 0.390289 65 40–45 0.005278 0.007486 4 0.390289 0.585138 36 45–50 0.007557 0.010451 5 0.585138 0.779986 32 50–55 0.010563 0.014643 6 0.779986 0.877411 17 55–60 0.014686 0.022462 7 0.877411 0.926123 7 60–65 0.022621 0.035060 8 0.926123 0.974835 86 65–70 0.035075 0.051415 70–75 0.051415 0.076188 75–80 0.076376 0.134189 80–85 0.134238 0.320638 85–90 0.321233 0.616002 90–95 0.624407 0.949418 95–100 0.949418 0.974835

and (24-23) using the entire sample. The estimates of δ, ρ, and σ were (−1.01437, 0.35519, 1.38426). Using the results from the probit model, we averaged the result in (24-24) for the entire sample, obtaining an estimated treatment effect of $1,476.30.

24.5.7 SAMPLE SELECTION IN NONLINEAR MODELS The preceding analysis has focused on an extension of the linear regression (or the estimation of simple averages of the data). The method of analysis changes in nonlinear models. To begin, it is not necessarily obvious what the impact of the sample selection is on the response variable, or how it can be accommodated in a model. Consider the model analyzed by Boyes, Hoffman, and Lowe (1989):

yi1 = 1 if individual i defaults on a loan, 0 otherwise,

yi2 = 1 if the individual is granted a loan, 0 otherwise. Wynand and van Praag (1981) also used this framework to analyze consumer insurance purchases in the first application of the selection methodology in a nonlinear model. Greene (1992) applied the same model to y1 = default on credit card loans, in which yi2 denotes whether an application for the card was accepted or not. [Mohanty (2002) also used this model to analyze teen employment in California.] For a given individual, y1 is not observed unless yi2 = 1. Following the lead of the linear regression case in Section 24.5.3, a natural approach might seem to be to fit the second (selection) equation using a univariate probit model, compute the inverse Mills ratio, λi , and add it to the first equation as an additional “control” variable to accommodate the selection effect. [This is the approach used by Wynand and van Praag (1981).] The problems with this control function approach are, first, it is unclear what in the model is being “controlled” and, second, assuming the first model is correct, the appropriate model conditioned Greene-50558 book June 23, 2007 13:17

896 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

on the sample selection is unlikely to contain an inverse Mills ratio anywhere in it. That result is specific to the linear model, where it arises as E[εi | selection]. What would seem to be the apparent counterpart for this probit model, ( = | = ) = (  β + θλ ), Prob yi1 1 selection on yi2 1 xi1 1 i is not, in fact, the appropriate conditional mean, or probability. For this particular ap- plication, the appropriate conditional probability (extending the bivariate probit model of Section 23.8) would be  (  β ,  β ,ρ) = | = = 2 xi1 1 xi2 2 . Prob[yi1 1 yi2 1] (  β ) xi2 2 We would use this result to build up the likelihood function for the three observed out- comes, as follows: The three types of observations in the sample, with their unconditional probabilities, are = ( = | , ) = − (  β ), yi2 0: Prob yi2 0 xi1 xi2 1 xi2 2 = , = ( = , = | , ) =  (−  β ,  β , −ρ), yi1 0 yi2 1: Prob yi1 0 yi2 1 xi1 xi2 2 xi1 1 xi2 2 (24-25) = , = ( = , = | , ) =  (  β ,  β ,ρ). yi1 1 yi2 1: Prob yi1 1 yi2 1 xi1 xi2 2 xi1 1 xi2 2 The log-likelihood function is based on these probabilities.36

Example 24.11 Doctor Visits and Insurance Continuing our analysis of the utilization of the German health care system, we observe that the data set (see Example 11.11) contains an indicator of whether the individual subscribes to the “Public” health insurance or not. Roughly 87 percent of the observations in the sample do. We might ask whether the selection on public insurance reveals any substantive difference in visits to the physician. We estimated a logit specification for this model in Example 23.4. Using (24-25) as the framework, we define yi 2 to be presence of insurance and yi 1 to be the binary variable defined to equal 1 if the individual makes at least one visit to the doctor in the survey year. The estimation results are given in Table 24.6. Based on these results, there does appear to be a very strong relationship. The coefficients do change somewhat in the conditional model. A Wald test for the presence of the selection effect against the null hypothesis that ρ equals zero produces a test statistic of (−7.188) 2 = 51.667, which is larger than the critical value of 3.84. Thus, the hypothesis is rejected. A likelihood ratio statistic is computed as the difference between the log-likelihood for the full model and the sum of the two separate log-likelihoods for the independent probit models when ρ equals zero. The result is

λLR = 2[−23969.58 − (−15536.39 + (−8471.508)) = 77.796 The hypothesis is rejected once again. Partial effects were computed using the results in Section 23.8.3. The large correlation coefficient can be misleading. The estimated −0.9299 does not state that the presence of insurance makes it much less likely to go to the doctor. This is the correlation among the unobserved factors in each equation. The factors that make it more likely to purchase insurance make it less likely to use a physician. To obtain a simple correlation between the two variables, we might use the tetrachoric correlation defined in Example 23.13. This would be computed by fitting a bivariate probit model for the two binary variables without any other variables. The estimated value is 0.120.

36Extensions of the bivariate probit model to other types of censoring are discussed in Poirier (1980) and Abowd and Farber (1982). Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 897

TABLE 24.6 Estimated Probit Equations for Doctor Visits Independent: No Selection Sample Selection Model Standard Partial Standard Partial Variable Estimate Error Effect Estimate Error Effect Constant 0.05588 0.06564 −9.4366 0.06760 Age 0.01331 0.0008399 0.004971 0.01284 0.0008131 0.005042 Income −0.1034 0.05089 −0.03860 −0.1030 0.04582 −0.04060 Kids −0.1349 0.01947 −0.05059 −0.1264 0.01790 −0.04979 Education −0.01920 0.004254 −0.007170 0.03660 0.004744 0.002703 Married 0.03586 0.02172 0.01343 0.03564 0.02016 0.01404 ln L −15536.39 Constant 3.3585 0.06959 3.2699 0.06916 Age 0.0001868 0.0009744 −0.0002679 0.001036 Education −0.1854 0.003941 −0.1807 0.003936 Female 0.1150 0.02186 0.0000a 0.2230 0.02101 0.01446a ln L −8471.508 ρ 0.0000 0.0000 −0.9299 0.1294 ln L −24007.90 −23969.58 aIndirect effect from second equation.

More general cases are typically much less straightforward. Greene (1992, 2006) and Terza (1995, 1998) have developed sample selection models for nonlinear specifications based on the underlying logic of the Heckman model in Section 24.5.3, that the influence of the incidental truncation acts on the unobservable variables in the model. (That is the source of the “selection bias” in conventional estimators.) The modeling extension introduces the unobservables into the model in a natural fashion that parallels the regression model. The generic model will take the form

1. Probit selection equation: ∗ =  α + ∼ , , zi wi ui in which ui N[0 1] (24-26) = ∗ > , . zi 1ifzi 0 0 otherwise 2. Nonlinear index function model with unobserved heterogeneity and sample selection: μ | ε =  β + σε ,ε ∼ , , i i xi i i N[0 1] | ,ε ∼ ( | ,ε ) = ( |  β + σε ), yi xi i density g yi xi i f yi xi i (24-27)

yi , xi are observed only when zi = 1,

[ui ,εi ] ∼ N[(0, 1), (1,ρ,1)]. For example, in a Poisson regression model, the conditional mean function becomes ( | ) = λ = (  β + σε ) = (μ ) E yi xi i exp xi i exp i . (We will use this specification of the model in Chapter 25 to introduce random effects in the Poisson regression model for panel data.) The log-likelihood function for the full model is the joint density for the observed data. When zi equals one, (yi , xi , zi , wi ) are all observed. To obtain the joint density Greene-50558 book June 23, 2007 13:17

898 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

p(yi , zi = 1 | xi , wi ), we proceed as follows: ∞

p(yi , zi = 1 | xi , wi ) = p(yi , zi = 1 | xi , wi ,εi ) f (εi )dεi . −∞

Conditioned on εi , zi and yi are independent. Therefore, the joint density is the product, ( , = | , ,ε ) = ( |  β + σε ) ( = | ,ε ). p yi zi 1 xi wi i f yi xi i Prob zi 1 wi i ( |  β + σε ) The first part, f yi xi i is the conditional index function model in (24-27). By 2 joint normality, f (ui | εi ) = N[ρεi ,(1 − ρ )], so ui | εi = ρεi + (ui − ρεi ) = ρεi + vi where E[v ] = 0 and Var[v ] = (1 − ρ2). Therefore, i i  α + ρε wi i Prob(zi = 1 | wi ,εi ) =  . 1 − ρ2 Combining terms and using the earlier approach, the unconditional joint density is ∞  α + ρε exp −ε2 2 ( , = | , ) = ( |  β + σε ) wi i √ i ε . p yi zi 1 xi wi f yi xi i d i (24-28) −∞ 1 − ρ2 2π

The other part of the likelihood function for the observations with zi = 0 will be ∞

Prob(zi = 0 | wi ) = Prob(zi = 0 | wi ,εi ) f (εi )dεi . −∞ ∞  α + ρε wi i = 1 −  f (εi )dεi (24-29) −∞ 1 − ρ2 ∞ −(  α + ρε ) −ε2 wi i exp i 2 =  √ dεi . −∞ 1 − ρ2 2π For convenience, we can use the invariance principle to reparameterize the likelihood function in terms of γ = α/ 1 − ρ2 and τ = ρ/ 1 − ρ2. Combining all the preceding terms, the log-likelihood function to be maximized is n ∞ = ( − )+ ( |  β +σε )  ( − )(  γ +τε ) φ(ε ) ε . ln L ln [ 1 zi zi f yi xi i ] [ 2zi 1 wi i ] i d i (24-30) −∞ i=1 This can be maximized with respect to (β,σ,γ ,τ)using quadrature or simulation. When done, ρ can be recovered from ρ = sign(τ)×τ/(1+τ 2)1/2 and α = (1−ρ2)1/2γ . All that ( |  β + σε differs from one model to another is the specification of f yi xi i ). This is the specification used in Greene (1992), Terza (1995), and Terza and Kenkel (2001). (In the latter two papers, the authors analyzed E[yi | zi = 1] rather than the conditional density. Their estimator was based on nonlinear least squares, but as earlier, it is necessary to integrate the unobserved heterogeneity out of the conditional mean function.)

24.5.8 PANEL DATA APPLICATIONS OF SAMPLE SELECTION MODELS The development of methods for extending sample selection models to panel data settings parallels the literature on cross-section methods. It begins with Hausman and Wise (1979) who devised a maximum likelihood estimator for a two-period model with attrition—the “selection equation” was a formal model for attrition from the sample. Greene-50558 book June 25, 2007 18:55

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 899

Subsequent research has drawn the analogy between attrition and sample selection in a variety of applications, such as Keane et al. (1988) and Verbeek and Nijman (1992), and produced theoretical developments including Wooldridge (2002a, b). The direct extension of panel data methods to sample selection brings several new issues for the modeler. An immediate question arises concerning the nature of the selection itself. Although much of the theoretical literature [e.g., Kyriazidou (1997, 2001)] treats the panel as if the selection mechanism is run anew in every period, in practice, the selection process often comes in two very different forms. First, selection may take the form of selection of the entire group of observations into the panel data set. Thus, the selection mechanism operates once, perhaps even before the observation window opens. Consider the entry (or not) of eligible candidates for a job training program. In this case, it is not appropriate to build the model to allow entry, exit, then reentry. Second, for most applications, selection comes in the form of attrition or retention. Once an observation is “deselected,” it does not return. Leading examples would include “survivorship” in time-series–cross-section models of firm performance and attrition in medical trials and in panel data applications involving large national survey data bases, such as Contoyannis et al. (2004). Each of these cases suggests the utility of a more structured approach to the selection mechanism.

24.5.8.a Common Effects in Sample Selection Models A formal “effects” treatment for sample selection was first suggested in complete form by Verbeek (1990), who formulated a random effects model for the probit equation and a fixed effects approach for the main regression. Zabel (1992) criticized the specification for its asymmetry in the treatment of the effects in the two equations. He also argued that the likelihood function that neglected correlation between the effects and regressors in the probit model would render the FIML estimator inconsistent. His proposal involved fixed effects in both equations. Recognizing the difficulty of fitting such a model, he then proposed using the Mundlak correction. The full model is ∗ = η +  β + ε ,η=  π + τ , ∼ , , yit i xit it i xi wi wi N[0 1] ∗ = θ +  α + ,θ=  δ + ωv ,v ∼ , , dit i zit uit i zi i i N[0 1] (24-31) 2 (εit, uit) ∼ N2[(0, 0), (σ , 1,ρσ)].

The “selectivity” in the model is carried through the correlation between εit and uit. The resulting log-likelihood is built up from the contribution of individual i, ∞ = −  α −  δ − ωv φ(v ) v Li [ zit zi i ] i d i −∞ dit=0 ∞ ∞  α +  δ + ωv + (ρ/σ )ε × zit zi i it −∞ −∞ 1 − ρ2 dit=1 ε × 1 φ it φ (v , ) v , σ σ 2 i wi d i dwi (24-32) ε = −  β −  π − τ . it yit xit xi wi

The log-likelihood is then ln L = i ln Li . The log-likelihood requires integration in two dimensions for any selected obser- vations. Vella (1998) suggested two-step procedures to avoid the integration. However, Greene-50558 book June 23, 2007 13:17

900 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

the bivariate normal integration is actually the product of two univariate normals, be- cause in the preceding specification, vi and wi are assumed to be uncorrelated. As such, the likelihood function in (24-32) can be readily evaluated using familiar simulation or quadrature techniques. [See Sections 16.9.6.b and 17.5. Vella and Verbeek (1999) suggest this in a footnote, but do not pursue it.] To show this, note that the first line (. . .) in the log-likelihood is of the form Ev[ d=0 ] and the second line is of the form Ew[Ev[(. . .)φ(. . .)/σ ]]. Either of these expectations can be satisfactorily approximated with the average of a sufficient number of draws from the standard normal populations that generate wi and vi . The term in the simulated likelihood that follows this prescrip- tion is R S 1   L = [−z α − z δ − ωv , ] i R it i i r r=1 d =0 it R   1 z α + z δ + ωv , + (ρ/σ )ε , 1 ε , ×  it i i r it r φ it r , (24-33) R 1 − ρ2 σ σ r=1 dit=1 ε = −  β −  π − τ . it,r yit xit xi wi,r Maximization of this log-likelihood with respect to (β,σ,ρ,α,δ,π,τ,ω) by conventional gradient methods is quite feasible. Indeed, this formulation provides a means by which the likely correlation between vi and wi can be accommodated in the model. Suppose that wi and vi are bivariate standard normal with correlation ρvw. We can project wi on vi and write / = ρ v + − ρ2 1 2 , wi vw i 1 vw hi

where hi has a standard normal distribution. To allow the correlation, we now simply substitute this expression for wi in the simulated (or original) log-likelihood and add ρvw to the list of parameters to be estimated. The simulation is still over independent normal variates, vi and hi . Notwithstanding the preceding derivation, much of the recent attention has focused on simpler two-step estimators. Building on Ridder and Wansbeek (1990) and Verbeek and Nijman (1992) [see Vella (1998) for numerous additional references], Vella and Verbeek (1999) purpose a two-step methodology that involves a random effects frame- work similar to the one in (24-31). As they note, there is some loss in efficiency by not using the FIML estimator. But, with the sample sizes typical in contemporary panel data sets, that efficiency loss may not be large. As they note, their two-step template encompasses a variety of models including the tobit model examined in the preceding sections and the mover-stayer model noted earlier. The Vella and Verbeek model requires some fairly intricate maximum likelihood procedures. Wooldridge (1995) proposes an estimator that, with a few probably—but not necessarily—innocent assumptions, can be based on straightforward applications of conventional, everyday methods. We depart from a fixed effects specification, ∗ = η +  β + ε , yit i xit it ∗ = θ +  α + , dit i zit uit 2 (εit, uit) ∼ N2[(0, 0), (σ , 1,ρσ)]. Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 901

Under the mean independence assumption E[εit | ηi ,θi , zi1,...,ziT,vi1,...,viT, di1,...,diT] = ρuit, it will follow that | ,..., ,η ,θ , ,..., ,v ,...,v , ,..., = η +  β + ρ . E[yit xi1 xiT i i zi1 ziT i1 iT di1 diT] i xit uit This suggests an approach to estimating the model parameters; however, it requires computation of uit. That would require estimation of θi , which cannot be done, at least not consistently—and that precludes simple estimation of uit. To escape the dilemma, Wooldridge (2002c) suggests Chamberlain’s approach to the fixed effects model, θ = +  +  +···+  + . i f0 zi1f1 zi2f2 ziTfT hi With this substitution, ∗ =  α + +  +  +···+  + + dit zit f0 zi1f1 zi2f2 ziTfT hi uit =  α + +  +  +···+  + , zit f0 zi1f1 zi2f2 ziTfT wit

where wit is independent of zit, t = 1,...,T. This now implies that | ,..., ,η ,θ , ,..., ,v ,...,v , ,..., = η +  β + ρ( − ) E[yit xi1 xiT i i zi1 ziT i1 iT di1 diT] i xit wit hi = (η − ρ ) +  β + ρ . i hi xit wit To complete the estimation procedure, we now compute T cross-sectional probit mod- els (reestimating f0, f1,... each time) and compute λˆ it from each one. The resulting equation, = +  β + ρλˆ + v , yit ai xit it it now forms the basis for estimation of β and ρ by using a conventional fixed effects linear regression with the observed data.

24.5.8.b Attrition The recent literature or sample selection contains numerous analyses of two-period models, such as Kyriazidou (1997, 2001). They generally focus on non- and semipara- metric analyses. An early parametric contribution of Hausman and Wise (1979) is also a two-period model of attrition, which would seem to characterize many of the stud- ies suggested in the current literature. The model formulation is a two-period random effects specification: =  β + ε + yi1 xi1 i1 ui (first period regression), =  β + ε + yi2 xi2 i2 ui (second period regression). Attrition is likely in the second period (to begin the study, the individual must have been observed in the first period). The authors suggest that the probability that an observation is made in the second period varies with the value of yi2 as well as some other variables, ∗ = δ +  θ +  α + v . zi2 yi2 xi2 wi2 i2 ∗ ≤ Attrition occurs if zi2 0, which produces a probit model, = ∗ > zi2 1 zi2 0 (attrition indicator observed in period 2). Greene-50558 book June 23, 2007 13:17

902 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

An observation is made in the second period if zi2 = 1, which makes this an early version of the familiar sample selection model. The reduced form of the observation equation is ∗ =  (δβ + θ) +  α + δε + v zi2 xi2 wi2 i2 i2 =  π +  α + xi2 wi2 hi2 =  γ + . ri2 hi2 The variables in the probit equation are all those in the second period regression plus any additional ones dictated by the application. The estimable parameters in this model 2 are β, γ ,σ = Var[εit + ui ], and two correlation coefficients, 2 ρ12 = Corr[εi1 + ui ,εi2 + ui ] = Var[ui ]/σ , and

ρ23 = Corr[hi2,εi2 + ui ]. All disturbances are assumed to be normally distributed. (Readers are referred to the paper for motivation and details on this specification.) The authors propose a full information maximum likelihood estimator. Estimation can be simplified somewhat by using two steps. The parameters of the probit model can be estimated first by maximum likelihood. Then the remaining parameters are estimated by maximum likelihood, conditionally on these first-step estimates. The Murphy and Topel adjustment is made after the second step. [See Greene (2007a).] The Hausman and Wise model covers the case of two periods in which there is a formal mechanism in the model for retention in the second period. It is unclear how the procedure could be extended to a multiple period application such as that in Contoyannis et al. (2004), which involved a panel data set with eight waves. In addition, in that study, the variables in the main equations were counts of hospital visits and phys- ican visits, which complicates the use of linear regression. A workable solution to the problem of attrition in a multiperiod panel is the inverse probability weighted estimator [Wooldridge (2002a, 2006b) and Rotnitzky and Robins (2005).] In the Contoyannis ap- plication, there are eight waves in the panel. Attrition is taken to be “ignorable” so that the unobservables in the attrition equation and in the main equation(s) of interest are uncorrelated. (Note that Hausman and Wise do not make this assumption.) This enables Contoyannis et al. to fit a “retention” probit equation for each observation present at wave 1, for waves 2–8, using characteristics observed at the entry to the panel. (This defines, then, “selection (retention) on observables.”) Defining dit to be the indicator for presence (dit = 1) or absence (dit = 0) of observation i in wave t, it will follow that the sequence of observations will begin at 1 and either stay at 1 or change to 0 for the remaining waves. Letp ˆ it denote the predicted probability from the probit estimator at wave t. Then, their full log-likelihood is constructed as n T dit ln L = ln Lit. pˆ it i=1 t=1 Wooldridge (2002b) presents the underlying theory for the properties of this weighted maximum likelihood estimator. [Further details on the use of the inverse probability weighted estimator in the Contoyannis et al. (2004) study appear in Jones, Koolman, and Rice (2006).] Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 903

24.6 SUMMARY AND CONCLUSIONS

This chapter has examined three settings in which, in principle, the linear regression model of Chapter 2 would apply, but the data generating mechanism produces a nonlin- ear form. In the truncated regression model, the range of the dependent variable is re- stricted substantively. Certainly all economic data are restricted in this way—aggregate income data cannot be negative, for example. But when data are truncated so that plau- sible values of the dependent variable are precluded, for example, when zero values for expenditure are discarded, the data that remain are analyzed with models that explicitly account for the truncation. When data are censored, values of the dependent variable that could in principle be observed are masked. Ranges of values of the true variable being studied are observed as a single value. The basic problem this presents for model building is that in such a case, we observe variation of the independent variables without the corresponding variation in the dependent variable that might be expected. Finally, the issue of sample selection arises when the observed data are not drawn randomly from the population of interest. Failure to account for this nonrandom sampling produces a model that describes only the nonrandom subsample, not the larger population. In each case, we examined the model specification and estimation techniques which are appro- priate for these variations of the regression model. Maximum likelihood is usually the method of choice, but for the third case, a two-step estimator has become more common.

Key Terms and Concepts

• Attrition • Heterogeneity • Sample selection • Average treatment effect • Heteroscedasticity • Semiparametric estimator • Average treatment effect on • Incidental truncation • Specification error the treated • Instrumetal variable • Tobit model • Attenuation estimation • Treatment effect • Censored regression model • Inverse probability • Truncated bivariate normal • Censored variable weighted estimator distribution • Censoring • Inverse Mills ratio • Truncated distribution • Conditional moment test • Lagrange multiplier test • Truncated mean • Control function • Marginal effects • Truncated normal • Corner solution • Matching estimator distribution • Degree of truncation • Mean independence • Truncated random variable • Delta method assumption • Truncated standard normal • Difference in differences • Olsen’s reparameterization distribution • Generalized residual • Parametric model • Truncated variance • Hazard function • Propensity score • Two-step estimation

Exercises

1. The following 20 observations are drawn from a censored normal distribution:

3.8396 7.2040 0.00000 0.00000 4.4132 8.0230 5.7971 7.0828 0.00000 0.80260 13.0670 4.3211 0.00000 8.6801 5.4571 0.00000 8.1021 0.00000 1.2526 5.6016 Greene-50558 book June 23, 2007 13:17

904 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The applicable model is ∗ = μ + ε , yi i = ∗ μ + ε > , , yi yi if i 0 0 otherwise 2 εi ∼ N[0,σ ]. Exercises 1 through 4 in this section are based on the preceding information. The OLS estimator of μ in the context of this tobit model is simply the sample mean. Compute the mean of all 20 observations. Would you expect this estimator to over- or underestimate μ? If we consider only the nonzero observations, then the trun- cated regression model applies. The sample mean of the nonlimit observations is the least squares estimator in this context. Compute it and then comment on whether this sample mean should be an overestimate or an underestimate of the true mean. 2. We now consider the tobit model that applies to the full data set. a. Formulate the log-likelihood for this very simple tobit model. b. Reformulate the log-likelihood in terms of θ = 1/σ and γ = μ/σ . Then derive the necessary conditions for maximizing the log-likelihood with respect to θ and γ . c. Discuss how you would obtain the values of θ and γ to solve the problem in part b. d. Compute the maximum likelihood estimates of μ and σ . 3. Using only the nonlimit observations, repeat Exercise 2 in the context of the trun- cated regression model. Estimate μ and σ by using the method of moments esti- mator outlined in Example 24.2. Compare your results with those in the previous exercises. 4. Continuing to use the data in Exercise 1, consider once again only the nonzero observations. Suppose that the sampling mechanism is as follows: y∗ and another normally distributed random variable z have population correlation 0.7. The two variables, y∗ and z, are sampled jointly. When z is greater than zero, y is re- ported. When z is less than zero, both z and y are discarded. Exactly 35 draws were required to obtain the preceding sample. Estimate μ and σ. (Hint: Use The- orem 24.5.) 5. Derive the marginal effects for the tobit model with heteroscedasticity that is described in Section 24.3.4.a. 6. Prove that the Hessian for the tobit model in (24-14) is negative definite after Olsen’s transformation is applied to the parameters.

Applications

1. In Section 25.5.2, we will examine Ray Fair’s famous analysis of a Psychology Today survey on extramarital affairs. Although the dependent variable used in that study was a count, Fair used the tobit model as the platform for his study. Our analysis in Section 25.5.2 will examine the study, using a Poisson model for counts instead. Fair’s original study also included but did not analyze a second data set that was a similar survey conducted by Redbook magazine. The data are reproduced in Greene-50558 book June 23, 2007 13:17

CHAPTER 24 ✦ Truncation, Censoring, and Sample Selection 905

Appendix Table F24.1. (Our thanks to Ray Fair for providing these data.) This sample contains observations on 6,366 women and the following variables: id = an identification number, C = constant, value = 1, yrb = a constructed measure of time spent in extramarital affairs, v1 = a rating of the marriage, coded 1 to 4, v2 = age, in years, aggregated, v3 = number of years married, v4 = number of children, top coded at 5, v5 = religiosity, 1 to 4, 1 = not, 4 = very, v6 = education, coded 9, 12, 14, 16, 17, 20, v7 = occupation, v8 = husband’s occupation. Three other variables were not used. Details on the variables in the model are given in Fair’s (1978) Journal of Political Economy paper. Using these data, conduct a parallel study to the Psychology Today study that was done in Fair (1978) and replicated in Section 25.5.2. Are the results consistent? Report all results, including partial effects and relevant diagnostic statistics. 2. Continuing the analysis of the first application, note that these data conform pre- cisely to the description of “corner solutions” in 24.3.4.c. The dependent vari- able is not censored in the fashion usually assumed for a tobit model. To inves- tigate whether the dependent variable is determined by a two-part decision process (yes/no and, if yes, how much), specify and estimate a two-equation model in which the first equation analyzes the binary decision A = 1ify > 0 and 0 otherwise and the second equation analyzes y | y > 0. What is the appropriate model? What do you find? Report all results. (Note, if you analyze the second dependent variable using the truncated regression, you should remove some extreme observations from your sample. The truncated regression estimator refuses to converge with the full data set, but works nicely for the example if you omit observations with y > 5.) Greene-50558 book June 23, 2007 15:44

25 MODELS FOR EVENT COUNTS ANDQ DURATION

25.1 INTRODUCTION

This chapter is concerned with models of events. In some aspects, the regression-like models we have studied, such as the discrete choice models, are the appropriate tools. As in the previous two chapters, however, the models are nonlinear, and the familiar regression methods are not appropriate. Most of this analysis focuses on maximum likelihood estimators. In modeling duration, although an underlying regression model is in fact at work, it is not the conditional mean function that is of interest. More likely, as we will explore below, the objects of estimation are certain probabilities of events. Consider the measurement of the interarrival time t (i.e., the amount of time that passes between arrivals) of, say, customers at an ATM machine, messages in a telephone system, customers at a store, or patients at a hospital. A model that is often used for phenomena such as these is the exponential model,

f (t) = 1/θ exp(−t/θ), t ≥ 0,θ >0.

The expected interarrival time in this distribution is E[t] = 1/θ. With this in hand, then, consider the number of arrivals, y, that occur per unit of time. It can be shown that this discrete random variable has probability distribution

g(y) = exp(−λ)λy/y!,λ= 1/θ > 0, y = 0, 1,...,

which is the Poisson distribution. The expected value of this random variable is E[y] = 1/θ, which makes intuitive sense. If the average time between arrivals is 30 seconds (0.5 minutes), then we expect two arrivals per minute. The modeling of events is concerned with these two clearly related phenomena. Different aspects will be of interest depending on the objectives of the study. In health economics, researchers are often interested in the number of occurrences of events. In several of our earlier applications we have examined models that describe utilization of the health system in terms of numbers of visits to the physician or to the hospital. In contrast, in a business application, one might be interested in the duration of events such as strikes or spells of unemployment, or of the amount of time that passes be- tween events such as business failures. In medical statistics, much of the research is focused on duration between events, such as the length of survival of patients after treatment.

906 Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 907

This chapter will describe modeling approaches for events.1 As suggested, the two measures, counts of events and duration between events, are usually studied with different techniques, and for different purposes. Sections 25.2 to 25.5 will describe some of the econometric approaches to modeling event counts. The application of duration models to economic data is somewhat less frequent, but conversely, far more frequent in the other fields mentioned. The chapter, and this text, end in Section 25.6 with a discussion of models for duration.

25.2 MODELS FOR COUNTS OF EVENTS

Data on patents suggested in Section 23.2 are typical of count data. In principle, we could analyze these data using multiple linear regression. But the preponderance of zeros and the small values and clearly discrete nature of the dependent variable suggest that we can improve on least squares and the linear model with a specification that accounts for these characteristics. The Poisson regression model has been widely used to study such data. The Poisson regression model specifies that each yi is drawn from a Poisson distri- bution with parameter λi , which is related to the regressors xi . The primary equation of the model is −λ y i λ i e i Prob(Y = yi | xi ) = , yi = 0, 1, 2,.... (25-1) yi !

The most common formulation for λi is the loglinear model, λ = β. ln i xi It is easily shown that the expected number of events per period is given by

x β E [yi | xi ] = Var[yi | xi ] = λi = e i , so

∂ E [yi | xi ] = λi β. ∂xi With the parameter estimates in hand, this vector can be computed using any data vector desired. In principle, the Poisson model is simply a nonlinear regression.2 But it is far easier to estimate the parameters with maximum likelihood techniques. The log-likelihood function is n = −λ + β − . ln L [ i yi xi ln yi !] i=1

1Some particularly rich surveys of these topics (among dozens available) are Cameron and Trivedi (1986, 1998, 2005), Winkelmann (2003), and Kalbfleisch and Prentice (2002). 2We have estimated a Poisson regression model using two-step nonlinear least squares in Example 16.6. Greene-50558 book June 23, 2007 15:44

908 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The likelihood equations are ∂ n ln L = ( − λ ) = . ∂β yi i xi 0 i=1 The Hessian is ∂2 n ln L =− λ . ∂β∂β i xi xi i=1 The Hessian is negative definite for all x and β. Newton’s method is a simple algorithm n λˆ −1 for this model and will usually converge rapidly. At convergence, [ i=1 i xi xi ] pro- vides an estimator of the asymptotic covariance matrix for the parameter estimates. Given the estimates, the prediction for observation i is λˆ i = exp(x βˆ). A standard error for the prediction interval can be formed by using a linear Taylor series approximation. λˆ 2 , The estimated variance of the prediction will be i xi Vxi where V is the estimated asymptotic covariance matrix for βˆ. For testing hypotheses, the three standard tests are very convenient in this model. The Wald statistic is computed as usual. As in any discrete choice model, the likelihood ratio test has the intuitive form n Pˆ LR = 2 ln i , ˆ , i=1 Prestricted i where the probabilities in the denominator are computed with using the restricted model. Using the BHHH estimator for the asymptotic covariance matrix, the LM statistic is simply  − n n 1 n = ( − λˆ ) ( − λˆ )2 ( − λˆ ) = ( )−1 , LM xi yi i xi xi yi i xi yi i i G G G G i i=1 i=1 i=1 (25-2)

where each row of G is simply the corresponding row of X multiplied by ei = (yi −λˆ i ), λˆ i is computed using the restricted coefficient vector, and i is a column of ones.

25.2.1 MEASURING GOODNESS OF FIT The Poisson model produces no natural counterpart to the R2 in a linear regression model, as usual, because the conditional mean function is nonlinear and, moreover, because the regression is heteroscedastic. But many alternatives have been suggested.3 A measure based on the standardized residuals is 2 −λˆ n y√i i = i 1 λˆ 2 = − i . Rp 1 2 − n y√i y = i 1 y

3See the surveys by Cameron and Windmeijer (1993), Gurmu and Trivedi (1994), and Greene (1995b). Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 909

This measure has the virtue that it compares the fit of the model with that provided by a model with only a constant term. But it can be negative, and it can rise when a variable is dropped from the model. For an individual observation, the deviance is

di = 2[yi ln(yi /λˆ i ) − (yi − λˆ i )] = 2[yi ln(yi /λˆ i ) − ei ], ( ) = n = . where, by convention, 0 ln 0 0. If the model contains a constant term, then i=1 ei 0 The sum of the deviances, n n 2 G = di = 2 yi ln(yi /λˆ i ), i=1 i=1 is reported as an alternative fit measure by some computer programs. This statistic will equal 0.0 for a model that produces a perfect fit. (Note that because yi is an integer while the prediction is continuous, it could not happen.) Cameron and Windmeijer (1993) suggest that the fit measure based on the deviances, y n y log i − (y − λˆ ) i=1 i λˆ i i 2 = − i , Rd 1 n yi = y log i 1 i y has a number of desirable properties. First, denote the log-likelihood function for the model in which ψi is used as the prediction (e.g., the mean) of yi as (ψi , yi ). The Poisson model fit by MLE is, then, (λˆ i , yi ), the model with only a constant term is (y, yi ), and a model that achieves a perfect fit (by predicting yi with itself) is l(yi , yi ). Then (λ,ˆ ) − ( , ) 2 = yi y yi . Rd (yi , yi ) − (y, yi ) Both numerator and denominator measure the improvement of the model over one with only a constant term. The denominator measures the maximum improvement, since one cannot improve on a perfect fit. Hence, the measure is bounded by zero and one and increases as regressors are added to the model.4 We note, finally, the passing 2 2 resemblance of Rd to the “pseudo-R ,” or “likelihood ratio index” reported by some statistical packages (e.g., Stata), (λˆ , ) 2 = − i yi . RLRI 1 (y, yi ) Many modifications of the Poisson model have been analyzed by economists. In this and the next few sections, we briefly examine a few of them.

25.2.2 TESTING FOR OVERDISPERSION The Poisson model has been criticized because of its implicit assumption that the variance of yi equals its mean. Many extensions of the Poisson model that relax this assumption have been proposed by Hausman, Hall, and Griliches (1984), McCullagh and Nelder (1983), and Cameron and Trivedi (1986), to name but a few.

4Note that multiplying both numerator and denominator by 2 produces the ratio of two likelihood ratio statistics, each of which is distributed as chi-squared. Greene-50558 book June 25, 2007 17:35

910 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

The first step in this extended analysis is usually a test for overdispersion in the context of the simple model. A number of authors have devised tests for “overdis- persion” within the context of the Poisson model. [See Cameron and Trivedi (1990), Gurmu (1991), and Lee (1986).] We will consider three of the common tests, one based on a regression approach, one a conditional moment test, and a third, a Lagrange multiplier test, based on an alternative model. Cameron and Trivedi (1990) offer several different tests for overdispersion. A simple regression-based procedure used for testing the hypothesis

H0: Var[yi ] = E [yi ],

H1: Var[yi ] = E [yi ] + αg(E [yi ]),

is carried out by regressing

2 (yi − λˆ i ) − yi zi = √ , λˆ i 2

where λˆ i is the predicted value from the regression, on either a constant term or λˆ i with- out a constant term. A simple t test of whether the coefficient is significantly different from zero tests H0 versus H1. Cameron and Trivedi’s regression based test for overdispersion is formulated around the alternative Var[yi ] = E [yi ]+g(E [yi ]). This is a very specific type of overdis- persion. Consider the more general hypothesis that Var[yi ] is completely given by E [yi ]. The alternative is that the variance is systematically related to the regressors in a way that  is not completely accounted for by E [yi ]. Formally, we have E [yi ] = exp(β xi ) = λi . The null hypothesis is that Var[yi ] = λi as well. We can test the hypothesis using a conditional moment test. The expected first derivatives and the moment restriction are 2 E [xi (yi − λi )] = 0 and E zi [(yi − λi ) − λi ] = 0.

Tocarry out the test, we do the following. Let ei = yi − λˆ i and zi = xi without the constant term.

1. Compute the Poisson regression by maximum likelihood. = n 2 − λˆ = n v 2. Compute r i=1 zi [ei i ] i=1 z i based on the maximum likelihood estimates.  = n  v2,  = n  2,  = n  v 3. Compute M M i=1 zi zi i D D i=1 xi xi ei and M D i=1 zi xi i ei . 4. Compute S = MM − MD(DD)−1DM. 5. C = rS−1r is the chi-squared statistic. It has degrees of freedom equal to the number of variables in zi .

The next section presents the negative binomial model. This model relaxes the Poisson assumption that the mean equals the variance. The Poisson model is obtained as a parametric restriction on the negative binomial model, so a Lagrange multiplier test can be computed. In general, if an alternative distribution for which the Poisson model is obtained as a parametric restriction, such as the negative binomial model, can be specified, then a Lagrange multiplier statistic can be computed. [See Cameron and Greene-50558 book June 25, 2007 17:35

CHAPTER 25 ✦ Models for Event Counts and Duration 911

Trivedi (1986, p. 41).] The LM statistic is 2 n ( − λˆ )2 − = i=1wˆ i [ yi i yi ] . LM (25-3) n λˆ 2 2 i=1 wˆ i i

The weight,w ˆ i , depends on the assumed alternative distribution. For the negative binomial model discussed later,w ˆ i equals 1.0. Thus, under this alternative, the statistic is particularly simple to compute: (ee − ny)2 LM =  . (25-4) 2λˆ λˆ The main advantage of this test statistic is that one need only estimate the Poisson model to compute it. Under the hypothesis of the Poisson model, the limiting distribution of the LM statistic is chi-squared with one degree of freedom.

25.2.3 HETEROGENEITY AND THE NEGATIVE BINOMIAL REGRESSION MODEL The assumed equality of the conditional mean and variance functions is typically taken to be the major shortcoming of the Poisson regression model. Many alternatives have been suggested [see Hausman, Hall, and Griliches (1984), Cameron and Trivedi (1986, 1998), Gurmu and Trivedi (1994), Johnson and Kotz (1993), and Winkelmann (2003) for discussion]. The most common is the negative binomial model, which arises from a natural formulation of cross-section heterogeneity. [See Hilbe (2007).] We generalize the Poisson model by introducing an individual, unobserved effect into the conditional mean, μ =  β + ε = λ + , ln i xi i ln i ln ui

where the disturbance εi reflects either specification error, as in the classical regression model, or the kind of cross-sectional heterogeneity that normally characterizes micro- economic data. Then, the distribution of yi conditioned on xi and ui (i.e., εi ) remains Poisson with conditional mean and variance μi :

−λi ui yi e (λi ui ) f (yi | xi , ui ) = . yi !

The unconditional distribution f (yi | xi ) is the expected value (over ui ) of f (yi | xi , ui ), ∞ −λi ui yi e (λi ui ) f (yi | xi ) = g(ui ) dui . 0 yi !

The choice of a density for ui defines the unconditional distribution. For mathematical 5 convenience, a gamma distribution is usually assumed for ui = exp(εi ). As in other models of heterogeneity, the mean of the distribution is unidentified if the model con- tains a constant term (because the disturbance enters multiplicatively) so E [exp(εi )]is

5An alternative approach based on the normal distribution is suggested in Terza(1998), Greene (1995a, 1997a, 2007d), and Winkelmann (1997). The normal-Poisson mixture is also easily extended to the random effects model discussed in the next section. There is no closed form for the normal-Poisson mixture model, but it can be easily approximated by using Hermite quadrature or simulation. See Sections 16.9.6.b and 23.5.1. Greene-50558 book June 23, 2007 15:44

912 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

assumed to be 1.0. With this normalization, θ θ −θ θ− ( ) = ui 1. g ui (θ)e ui

The density for yi is then

∞ −λ θ θ− −θ i ui (λ )yi θ 1 ui e i ui ui e f (yi | xi ) = dui 0 yi ! (θ) θ θ λyi ∞ −(λ +θ) θ+y −1 = i i ui i e ui dui (yi + 1) (θ) 0

θ θ λyi (θ + ) = i yi θ+y (yi + 1) (θ)(λi + θ) i

(θ + y ) λ = i yi ( − )θ , = i , ri 1 ri where ri (yi + 1) (θ) λi + θ which is one form of the negative binomial distribution. The distribution has conditional mean λi and conditional variance λi (1 + (1/θ)λi ). [This model is Negbin 2 in Cameron and Trivedi’s (1986) presentation.] The negative binomial model can be estimated by maximum likelihood without much difficulty. A test of the Poisson distribution is often carried out by testing the hypothesis α = 1/θ = 0 using the Wald or likelihood ratio test.

25.2.4 FUNCTIONAL FORMS FOR COUNT DATA MODELS

The equidispersion assumption of the Poisson regression model, E[yi | xi ] = Var[yi | xi ], is a major shortcoming. Observed data rarely, if ever, display this feature. The very large amount of research activity on functional forms for count models is often focused on testing for equidispersion and building functional forms that relax this assumption. In practice, the Poisson model is typically only the departure point for an extended specification search. One easily remedied minor issue concerns the units of measurement of the data. In the Poisson and negative binomial models, the parameter λi is the expected number of events per unit of time. Thus, there is a presumption in the model formulation, for example, the Poisson, that the same amount of time is observed for each i. In a spatial context, such as measurements of the incidence of a disease per group of Ni persons, or the number of bomb craters per square mile (London, 1940), the assumption would be that the same physical area or the same size of population applies to each observation. Where this differs by individual, it will introduce a type of heteroscedasticity in the model. The simple remedy is to modify the model to account for the exposure, Ti ,of the observation as follows: exp(−T φ )(T φ ) j Prob(y = j | x , T ) = i i i i ,φ= exp(x β), j = 0, 1,.... i i i j! i i λ = ( β + ) The original model is returned if we write i exp xi ln Ti . Thus, when the ex- posure differs by observation, the appropriate accommodation is to include the log of exposure in the regression part of the model with a coefficient of 1.0. (For less than Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 913

obvious reasons, the term “offset variable” is commonly associated with the exposure variable Ti·) Note that if Ti is the same for all i, ln Ti will simply vanish into the constant term of the model (assuming one is included in xi ). The recent literature, mostly associating the result with Cameron and Trivedi’s (1986, 1998) work, defines two familiar forms of the negative binomial model. The Negbin 2 (NB2) form of the probability is (θ + y ) ( = | ) = i yi ( − )θ , Prob Y yi xi ri 1 ri (yi + 1) (θ) λ = ( β), i exp xi (25-5)

ri = λi /(θ + λi ). This is the default form of the model in the received econometrics packages that provide an estimator for this model. The Negbin 1 (NB1) form of the model results if θ in the preceding is replaced with θi = θλi . Then, ri reduces to r = 1/(1 + θ), and the density becomes (θλ + y ) i i yi θλi Prob(Y = yi | xi ) = r (1 − r) . (25-6) (yi + 1) (θλi ) This is not a simple reparameterization of the model. The results in Example 25.1 demonstrate that the log-likelihood functions are not equal at the maxima, and the parameters are not simple transformations in one model versus the other. We are not aware of a theory that justifies using one form or the other for the negative binomial model. Neither is a restricted version of the other, so we cannot carry out a likelihood ratio test of one versus the other. The more general Negbin P (NBP) family does nest both of them, so this may provide a more general, encompassing approach to finding the right specification. The Negbin P model is obtained by replacing θ in the Negbin 2 θλ2−P = = form with i . We have examined the cases of P 1 and P 2 in (25-5) and (25-6). The full model is Q yi θλ (θλQ + y ) λ θλQ i Prob(Y = y | x ) = i i i i , Q = 2 − P. i i ( + ) θλQ θλQ + λ θλQ + λ yi 1 i i i i i The conditional mean function for the three cases considered is | = ( β) = λ . E[yi xi ] exp xi i The parameter P is picking up the scaling. A general result is that for all three variants of the model, | = λ + αλP−1 , α = /θ. Var[yi xi ] i 1 i where 1 Thus, the NB2 form has a variance function that is quadratic in the mean while the NB1 form’s variance is a simple multiple of the mean. There have been many other functional forms proposed for count data models, including the generalized Poisson, gamma, and Polya-Aeppli forms described in Winkelmann (2003) and Greene (2007a, Chapter 24). The heteroscedasticity in the count models is induced by the relationship between the variance and the mean. The single parameter θ picks up an implicit overall scaling, so it does not contribute to this aspect of the model. As in the linear model, microeconomic data are likely to induce heterogeneity in both the mean and variance of the response Greene-50558 book June 23, 2007 15:44

914 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

variable. A specification that allows independent variation of both will be of some virtue. The result | = λ + ( /θ)λP−1 Var[yi xi ] i 1 1 i suggests that a natural platform for separately modeling heteroscedasticity will be the dispersion parameter, θ, which we now parameterize as θ = θ ( δ). i exp zi Operationally, this is a relatively minor extension of the model. But, it is likely to introduce quite a substantial increase in the flexibility of the specification. Indeed, a heterogeneous Negbin P model is likely to be sufficiently parameterized to accommo- date the behavior of most data sets. (Of course, the specialized models discussed in Section 25.4, for example, the zero inflation models, may yet be more appropriate for a given situation.)

Example 25.1 Count Data Models for Doctor Visits The study by Riphahn et al. (2003) that provided the data we have used in numerous earlier examples analyzed the two-count variables DocVis (visits to the doctor) and HospVis (visits to the hospital). The authors were interested in the joint determination of these two count variables. One of the issues considered in the study was whether the data contained evidence of moral hazard, that is, whether health care utilization as measured by these two outcomes was influenced by the subscription to health insurance. The data contain indicators of two levels of insurance coverage, PUBLIC, which is the main source of insurance, and ADDON, which is a secondary optional insurance. In the sample of 27,326 observations (family/years), 24,203 individuals held the public insurance. (There is quite a lot of within group variation in this. Individuals did not routinely obtain the insurance for all periods.) Of these 24,203, 23,689 had only public insurance and 514 had both types. (One could not have only the ADDON insurance.) To explore the issue, we have analyzed the DocVis variable with the count data models described in this section. The exogenous variables in our model are

x it = (1,Age, Education, Income, Kids, Public). (Variables are described in Example 11.11.) Table 25.1 presents the estimates of the several count models. In all specifications, the coefficient on PUBLIC is positive, large, and highly statistically significant, which is consistent with the results in the authors’ study. The various test statistics strongly reject the hypothesis of equidispersion. Cameron and Trivedi’s (1990) semiparametric tests from the Poisson model = μ = μ2 (see Section 25.2.2 have t statistics of 22.147 for gi i and 22.504 for gi i . Both of these are far larger than the critical value of 1.96. The LM statistic is 972,714.48, which is also larger than the (any) critical value. On these bases, we would reject the hypothesis of equidispersion. The Wald and likelihood ratio tests based on the negative binomial models produce the same conclusion. For comparing the different negative binomial models, note that Negbin 2 is the worst of the three by the likelihood function, although NB1 and NB2 are not directly comparable. On the other hand, note that in the NBP model, the estimate of P is more than 10 standard errors from 1.0000 or 2.000, so both NB1 and NB2 are rejected in favor of the unrestricted NBP form of the model. The NBP and the heterogeneous NB2 model are not nested either, but comparing the log-likelihoods, it does appear that the heterogeneous model is substantially superior. We computed the Vuong statistic based on the individual contributions to the log-likelihoods, with vi = ln Li(NBP) = ln Li(NB2-H). (See Section 7.3.4). The value of the statistic is −3.27. On this basis, we would reject NBP in favor of NB2-H. Finally, with regard to the original question, the coefficient on PUBLIC is larger than 10 times the estimated standard error in every specification. We would conclude that the results are consistent with the proposition that there is evidence of moral hazard. Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 915

TABLE 25.1 Estimated Models for DOCVIS (standard errors in parentheses) Negbin 2 Variable Poisson Negbin 2 Heterogeneous Negbin 1 Negbin P Constant 0.7162 0.7628 0.7928 0.6848 0.6517 (0.03287) (0.07247) (0.07459) (0.06807) (0.07759) Age 0.01844 0.01803 0.01704 0.01585 0.01907 (0.0003316) (0.0007915) (0.0008146) (0.0007042) (0.0008078) Education −0.03429 −0.03839 −0.03581 −0.02381 −0.03388 (0.001797) (0.003965) (0.004036) (0.003702) (0.004308) Income −0.4751 −0.4206 −0.4108 −0.1892 −0.3337 (0.02198) (0.04700) (0.04752) (0.04452) (0.05161) Kids −0.1582 −0.1513 −0.1568 −0.1342 −0.1622 (0.007956) (0.01738) (0.01773) (0.01647) (0.01856) Public 0.2364 0.2324 0.2411 0.1616 0.2195 (0.01328) (0.02900) (0.03006) (0.02678) (0.03155) P 0.0000 2.0000 2.0000 1.0000 1.5473 (0.0000) (0.0000) (0.0000) (0.0000) (0.03444) θ 0.0000 1.9242 2.6060 6.1865 3.2470 (0.0000) (0.02008) (0.05954) (0.06861) (0.1346) δ (Female) 0.0000 0.0000 −0.3838 0.0000 0.0000 (0.0000) (0.0000) (0.02046) (0.0000) (0.0000) δ (Married) 0.0000 0.0000 −0.1359 0.0000 0.0000 (0.0000) (0.0000) (0.02307) (0.0000) (0.0000) ln L −104440.3 −60265.49 −60121.77 −60260.68 −60197.15

25.3 PANEL DATA MODELS

The familiar approaches to accommodating heterogeneity in panel data have fairly straightforward extensions in the count data setting. [Hausman, Hall, and Griliches (1984) give full details for these models.] We will examine them for the Poisson model. The authors [and Allison (2000)] also give results for the negative binomial model.

25.3.1 ROBUST COVARIANCE MATRICES The standard asymptotic covariance matrix estimator for the Poisson model is − −1 ∂2 log L 1 n Est. Asy. Var[βˆ] = − = λˆ x x = [Xˆ X]−1, ∂βˆ∂βˆ i i i i=1 where ˆ is a diagonal matrix of predicted values. The BHHH estimator is −1 n ∂ ln P ∂ ln P 1 Est. Asy. Var[βˆ] = i i ∂βˆ ∂βˆ i=1 − n 1 = − λˆ  2 = ˆ 2 −1, yi i xi xi [X E X] i=1 Greene-50558 book June 23, 2007 15:44

916 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

where Eˆ is a diagonal matrix of residuals. The Poisson model is one in which the MLE is robust to certain misspecifications of the model, such as the failure to incorporate latent heterogeneity in the mean (i.e., one fits the Poisson model when the negative binomial is appropriate). In this case, a robust covariance matrix is the “sandwich” estimator, Robust Est. Asy. Var[βˆ] = [Xˆ X]−1[XEˆ 2X][Xˆ X]−1, which is appropriate to accommodate this failure of the model. It has become common to employ this estimator with all specifications, including the negative binomial. One might question the virtue of this. Because the negative binomial model already accounts for the latent heterogeneity, it is unclear what additional failure of the assumptions of the model this estimator would be robust to. The questions raised in Section 16.8.3 and 16.8.4 about robust covariance matrices would be relevant here. A related calculation is used when observations occur in groups that may be corre- lated. This would include a random effects setting in a panel in which observations have a common latent heterogeneity as well as more general, stratified, and clustered data sets. The parameter estimator is unchanged in this case (and an assumption is made that the estimator is still consistent), but an adjustment is made to the estimated asymptotic covariance matrix. The calculation is done as follows: Suppose the n observations are assembled in G clusters of observations, in which the number of observations in the ith G = β cluster is ni . Thus, i=1ni n. Denote by the full set of model parameters in what- ever variant of the model is being estimated. Let the observation-specific gradients and = ∂ /∂β = ( − λ ) = ∂2 /∂β∂β =−λ Hessians be gij ln Lij yij ij xij and Hij ln Lij ijxijxij. The uncorrected estimator of the asymptotic covariance matrix based on the Hessian is ⎛ ⎞ −1 G ni −1 ⎝ ⎠ VH =−H = − Hij . i=1 j=1 The corrected asymptotic covariance matrix is ⎡ ⎛ ⎞ ⎛ ⎞ ⎤  G G ni ni Est. Asy. Var[βˆ] = V ⎣ ⎝ g ⎠ ⎝ g ⎠ ⎦ V . H G − 1 ij ij H i=1 j=1 j=1 Note that if there is exactly one observation per cluster, then this is G/(G − 1) times the sandwich (robust) estimator.

25.3.2 FIXED EFFECTS Consider first a fixed effects approach. The Poisson distribution is assumed to have conditional mean log λit = β xit + αi , (25-7)

where now, xit has been redefined to exclude the constant term. The approach used in the linear model of transforming yit to group mean deviations does not remove the heterogeneity, nor does it leave a Poisson distribution for the transformed variable. However, the Poisson model with fixed effects can be fit using the methods described for the probit model in Section 23.5.2. The extension to the Poisson model requires only the minor modifications, git = (yit − λit) and hit =−λit. Everything else in that derivation applies with only a simple change in the notation. The first-order conditions Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 917

for maximizing the log-likelihood function for the Poisson model will include

Ti ∂ ln L αi x β = (yit − e μit) = 0 where μit = e it . ∂αi t=1

This implies an explicit solution for αi in terms of β in this model, ( / ) Ti 1 Ti t=1 yit yi αˆ i = ln = ln . (25-8) ( / ) Ti μ μ 1 Ti t=1 ˆ it ˆ i Unlike the regression or the probit model, this does not require that there be within- group variation in yit—all the values can be the same. It does require that at least one observation for individual i be nonzero, however. The rest of the solution for the fixed effects estimator follows the same lines as that for the probit model. An alternative approach, albeit with little practical gain, would be to concentrate the log-likelihood function by inserting this solution for αi back into the original log-likelihood, then maximizing the resulting function of β. While logically this makes sense, the approach suggested earlier for the probit model is simpler to implement. An estimator that is not a function of the fixed effects is found by obtaining the ,..., joint distribution of (yi1 yiTi ) conditional on their sum. For the Poisson model, a close cousin to the multinomial logit model discussed earlier is produced:    T  Ti i Ti  t=1 yit ! , ,...,  =   yit , p yi1 yi2 yiTi yit  pit (25-9)  Ti i=1 t=1 yit! t=1 where x β+αi x β = e it = e it . pit (25-10) Ti β+α Ti β xit i xit t=1 e t=1 e The contribution of group i to the conditional log-likelihood is

Ti ln Li = yit ln pit. t=1

Note, once again, that the contribution to ln L of a group in which yit = 0 in every period is zero. Cameron and Trivedi (1998) have shown that these two approaches give identical results. Hausman, Hall, and Griliches (1984) (HHG) report the following conditional den- sity for the fixed effects negative binomial (FENB) model:      T T  Ti + i i λ Ti  1 t=1 yit t=1 it (y + λ ) p y , y ,...,y  y =   it it , i1 i2 iTi  it Ti + Ti λ (1 + yit) (λit) t=1 t=1 yit t=1 it t=1

which is free of the fixed effects. This is the default FENB formulation used in popular software packages such as SAS, Stata, and LIMDEP. Researchers accustomed to the admonishments that fixed effects models cannot contain overall constants or time- invariant covariates are sometimes surprised to find (perhaps accidentally) that this fixed effects model allows both. [This issue is explored at length in Allison (2000) and Allison and Waterman (2002).] The resolution of this apparent contradiction is that the Greene-50558 book June 23, 2007 15:44

918 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

HHG FENB model is not obtained by shifting the conditional mean function by the λ = β + α fixed effect, ln it xit i , as it is in the Poisson model. Rather, the HHG model is obtained by building the fixed effect into the model as an individual specific θi in the Negbin 1 form in (25-6). The conditional mean functions in the models are as follows (we have changed the notation slightly to conform to our earlier formulation): | = θ φ = θ ( β), NB1(HHG): E[yit xit] i it i exp xit | = (α )φ = λ = ( β + α ). NB2: E[yit xit] exp i it it exp xit i The conditional variances are

NB1(HHG): Var[yit | xit] = θi φit[1 + θi ],

NB2: Var[yit | xit] = λit[1 + θλit].

Letting μi = ln θi , it appears that the HHG formulation does provide a fixed effect in the | = ( β + μ ) mean, as now, E[yit xit] exp xit i . Indeed, by this construction, it appears (as the authors suggest) that there are separate effects in both the mean and the variance. They make this explicit by writing θi = exp(μi )γi so that in their model, | = γ ( β + μ ), E[yit xit] i exp xit i | = γ ( β + μ )/ + γ (μ ) . Var[yit xit] i exp xit i [1 i exp i ]

The contradiction arises because the authors assert that μi and γi are separate parame- ters. In fact, they cannot vary separately,only θi can vary autonomously.The firm-specific effect in the HHG model is still isolated in the scaling parameter, which falls out of the conditional density. The mean is homogeneous, which explains why a separate constant, or a time invariant regressor (or another set of firm-specific effects) can reside there. [See Greene (2007d) and Allison and Waterman (2002) for further discussion.]

25.3.3 RANDOM EFFECTS The fixed effects approach has the same flaws and virtues in this setting as in the pro- bit case. It is not necessary to assume that the heterogeneity is uncorrelated with the included, exogenous variables. If the uncorrelatedness of the regressors and the hetero- geneity can be maintained, then the random effects model is an attractive alternative model. Once again, the approach used in the linear regression model, partial deviations from the group means followed by generalized least squares (see Chapter 9), is not usable here. The approach used is to formulate the joint probability conditioned upon the heterogeneity, then integrate it out of the joint distribution. Thus, we form

Ti ( ,..., | ) = ( | ). p yi1 yiTi ui p yit ui t=1 Then the random effect is swept out by obtaining ( ,..., ) = ( ,..., , ) p yi1 yiTi p yi1 yiTi ui dui ui = ( ,..., | ) ( ) p yi1 yiTi ui g ui dui ui = ( ,..., | ) . Eui [p yi1 yiTi ui ] Greene-50558 book June 25, 2007 17:35

CHAPTER 25 ✦ Models for Event Counts and Duration 919

This is exactly the approach used earlier to condition the heterogeneity out of the Poisson model to produce the negative binomial model. If, as before, we take p(yit | ui ) λ = (  β + ) ( to be Poisson with mean it exp xit ui in which exp ui ) is distributed as gamma with mean 1.0 and variance 1/α, then the preceding steps produce a negative binomial distribution, Ti λyit  θ + Ti = = yit T t 1 it t 1 θ i y p(y ,...,y ) = Q (1 − Q ) t=1 it , (25-11) i1 iTi Ti i i = yit (θ) Ti Ti λ t 1 t=1 yit! t=1 it

where θ Qi = . θ + Ti λ t=1 it

For estimation purposes, we have a negative binomial distribution for Yi = t yit with mean i = t λit. Like the fixed effects model, introducing random effects into the negative binomial model adds some additional complexity. We do note, because the negative binomial model derives from the Poisson model by adding latent heterogeneity to the condi- tional mean, adding a random effect to the negative binomial model might well amount to introducing the heterogeneity a second time. However, one might prefer to interpret the negative binomial as the density for yit in its own right, and treat the common ef- fects in the familiar fashion. Hausman et al.’s (1984) random effects negative binomial (RENB) model is a hierarchical model that is constructed as follows. The heterogene- ity is assumed to enter λit additively with a gamma distribution with mean 1, (θi ,θi ). Then, θi /(1+θi ) is assumed to have a beta distribution with parameters a and b [see Ap- pendix B.4.6)]. The resulting unconditional density after the heterogeneity is integrated out is Ti Ti (a + b) a + = λit  b + = yit ( , ,..., ) = t 1 t 1 . p yi1 yi2 yiTi ( )( ) + Ti λ + + Ti a b a t=1 it b t=1 yit As before, the relationship between the heterogeneity and the conditional mean func- tion is unclear, because the random effect impacts the parameter of the scedastic func- tion. An alternative approach that maintains the essential flavor of the Poisson model (and other random effects models) is to augment the NB2 form with the random effect, (θ + y ) ( = | ,ε ) = it yit ( − )θ , Prob Y yit xit i rit 1 rit (yit + 1)(θ) λ = (  β + ε ), it exp xit i

rit = λit/(θ + λit).

We then estimate the parameters by forming the conditional (on εi ) log-likelihood and integrating εi out either by quadrature or simulation. The parameters are simpler to interpret by this construction. Estimates of the two forms of the random effects model are presented in Example 25.2 for a comparison. There is a mild preference in the received literature for the fixed effects estimators over the random effects estimators. The virtue of dispensing with the assumption of uncorrelatedness of the regressors and the group specific effects is substantial. On the Greene-50558 book June 25, 2007 17:35

920 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

other hand, the assumption does come at a cost. To compute the probabilities or the marginal effects, it is necessary to estimate the constants, αi . The unscaled coefficients in these models are of limited usefulness because of the nonlinearity of the conditional mean functions. Other approaches to the random effects model have been proposed. Greene (1994, 1995a), Riphahn et al. (2003) and Terza (1995) specify a normally distributed hetero- geneity, on the assumption that this is a more natural distribution for the aggregate of small independent effects. Brannas and Johanssen (1994) have suggested a semipara- metric approach based on the GMM estimator by superimposing a very general form of heterogeneity on the Poisson model. They assume that conditioned on a random effect εit, yit is distributed as Poisson with mean εitλit. The covariance structure of εit is , = ,..., , ε = σ 2 ε ,ε = γ (| − |) allowed to be fully general. For t s 1 T Var[ it] i , Cov[ it js] ij t s . For long time series, this model is likely to have far too many parameters to be identified (β = β ∀ ) without some restrictions, such as first-order homogeneity i i , uncorrelated- γ (.) = = (σ 2 = σ 2 ∀ ) ness across groups, [ ij 0 for i j], groupwise homoscedasticity i i , and nonautocorrelatedness [γ(r) = 0 ∀ r = 0]. With these assumptions, the estimation pro- cedure they propose is similar to the procedures suggested earlier. If the model imposes enough restrictions, then the parameters can be estimated by the method of moments. The authors discuss estimation of the model in its full generality. Finally, the latent class model discussed in Section 16.9.7 and the random parameters model in Section 17.5 extend naturally to the Poisson model. Indeed, most of the received applications of the latent class structure have been in the Poisson regression framework. [See Greene (2001) for a survey.]

Example 25.2 Panel Data Models for Doctor Visits The German health care panel data set contains 7,293 individuals with group sizes rang- ing from 1 to 7. Table 25.2 presents the fixed and random effects estimates of the equa- tion for DocVis. The pooled estimates are also shown for comparison. Overall, the panel data treatments bring large changes in the estimates compared to the pooled estimates. There is also a considerable amount of variation across the specifications. With respect to the parameter of interest, Public, we find that the size of the coefficient falls substantially with all panel data treatments. Whether using the pooled, fixed, or random effects specifi- cations, the test statistics (Wald, LR) all reject the Poisson model in favor of the negative binomial. Similarly, either common effects specification is preferred to the pooled estimator. There is no simple basis for choosing between the fixed and random effects models, and we have further blurred the distinction by suggesting two formulations of each of them. We do note that the two random effects estimators are producing similar results, which one might hope for. But, the two fixed effects estimators are producing very different estimates. The NB1 estimates include two coefficients, Income and Education, that are positive, but negative in every other case. Moreover, the coefficient on Public, which is large and sig- nificant throughout the table, has become small and less significant with the fixed effects estimators. We also fit a three-class latent class model for these data. (See Section 16.9.7.) The three class probabilities were modeled as functions of Married and Female, which appear from the results to be significant determinants of the class sorting. The average prior probabili- ties for the three classes are 0.09212, 0.49361, and 0.41427. The coefficients on Public in the three classes, with associated t ratios are 0.3388 (11.541), 0.1907 (3.987), and 0.1084 (4.282). The qualitative result concerning evidence of moral hazard suggested at the outset of Example 25.1 appears to be supported in a variety of specifications (with FE-NB1 the sole exception). Greene-50558 book June 23, 2007 15:44 0.03773 0.1187 0.1743 58177.66 (0.008235) − − − − 0.6343 0.1169 0.01779 0.08126 0.1103 58182.52 (0.05955) (0.1145) − − − − − 0.04589 0.001274 0.1968 49476.36 − − − − Fixed Effects Random Effects 0.03381 1.2354 0.0000 34016.16 − − − 60265.49 0.03839 0.01652 0.4206 0.02373 0.1513 − − − − 0.03427 0.03854 0.2646 71763.13 (0.01940) (0.02008) (0.02994) (0.01203) − − − − Poisson Negative Binomial 0.03803 0.001927 0.3030 60337.13 − − − − Estimated Panel Data Models for Doctor Visits (standard errors in parentheses) 0.03429 0.4751 0.1582 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 2.1463 0.0000 0.0000 0.0000 1.1646 1.9242 0.0000 1.9199 0.0000 1.0808 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 3.8011 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.9737 Pooled Fixed Random 104440.3 (0.1319)(0.001336) (0.001443)(0.007255) (0.01733) (0.0004458) (0.0007916) (0.05463) (0.004352) (0.001188) (0.003965) (0.07247) (0.002769) (0.006501) (0.1079) (0.0007820) (0.02967) (0.0006969) (0.004056) (0.003595) (0.07328) (0.06612) (.08212) (0.04104) (0.01520) (0.04700) (0.05530) (0.07320) (0.04565) (0.04273) (0.03115)(0.04307) (0.01546) (0.02980) (0.005272) (0.01268) (0.01738) (0.02900) (0.02116) (0.02920) (0.03896) (0.05334) (0.01675) (0.02834) (0.01582) (0.02574) − − − − L Variable (Robust S.E.) Effects Effects Pooled NB2 FE-NB1 FE-NB2 HHG-Gamma Normal TABLE 25.2 Constant 0.7162 0.0000 0.4957 0.7628 AgeEduc 0.01844Income 0.03115 0.02329 0.01803 0.02389 0.04479 0.01899 0.02231 Kids Publicθ 0.2365 0.1015 0.1535 0.2324 0.05837 0.09700 0.1486 0.1940 b σ ln a

921 Greene-50558 book June 25, 2007 17:35

922 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

25.4 HURDLE AND ZERO-ALTERED POISSON MODELS

In some settings, the zero outcome of the data-generating process is qualitatively differ- ent from the positive ones. Mullahy (1986) argues that this fact constitutes a shortcoming of the Poisson (or negative binomial) model and suggests a hurdle model as an alter- native.6 In his formulation, a binary probability model determines whether a zero or a nonzero outcome occurs, then, in the latter case, a (truncated) Poisson distribution describes the positive outcomes. The model is

−θ Prob(yi = 0 | xi ) = e −θ −λ j (25-12) ( − ) i λ ( = | ) = 1 e e i , = , ,.... Prob yi j xi −λ j 1 2 j!(1 − e i )

This formulation changes the probability of the zero outcome and scales the remaining probabilities so that the sum to one. It adds a new restriction that Prob(yi = 0 | xi ) no longer depends on the covariates, however. Therefore, a natural next step is to param- eterize this probability. Mullahy suggests some formulations and applies the model to a sample of observations on daily beverage consumption. Mullahy (1986), Heilbron (1989), Lambert (1992), Johnson and Kotz (1993), and Greene (1994) have analyzed an extension of the hurdle model in which the zero out- come can arise from one of two regimes.7 In one regime, the outcome is always zero. In the other, the usual Poisson process is at work, which can produce the zero outcome or some other. In Lambert’s application, she analyzes the number of defective items produced by a manufacturing process in a given time interval. If the process is under control, then the outcome is always zero (by definition). If it is not under control, then the number of defective items is distributed as Poisson and may be zero or positive in any period. The model at work is therefore

Prob(yi = 0 | xi ) = Prob(regime 1) + Prob(yi = 0 | xi , regime 2)Prob(regime 2),

Prob(yi = j | xi ) = Prob(yi = j | xi , regime 2)Prob(regime 2), j = 1, 2,....

Let z denote a binary indicator of regime 1 (z = 0) or regime 2 (z = 1), and let y∗ denote the outcome of the Poisson process in regime 2. Then the observed y is z× y∗. A natural extension of the splitting model is to allow z to be determined by a set of covariates. These covariates need not be the same as those that determine the conditional probabilities in the Poisson process. Thus, the model is

Prob(zi = 1 | wi ) = F(wi ,γ),

−λ j e i λ Prob(y = j | x , z = 1) = i . i i i j!

6For a similar treatment in a continuous data application, see Cragg (1971). 7The model is variously labeled the “With Zeros,” or WZ, model [Mullahy (1986)], the Zero Inflated Poisson, or ZIP, model [Lambert (1992)], and “Zero-Altered Poisson,” or ZAP, model [Greene (1994)]. Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 923

The mean in this distribution is λ | = ( − ) × + × ∗ | , ∗ > = × i . E [yi wi ] 1 F 0 F E [yi xi yi 0] F −λ 1 − e i Lambert (1992) and Greene (1994) consider a number of alternative formulations, including logit and probit models discussed in Sections 23.3 and 23.4, for the probability of the two regimes. Both of these modifications substantially alter the Poisson formulation. First, note that the equality of the mean and variance of the distribution no longer follows; both modifications induce overdispersion. On the other hand, the overdispersion does not arise from heterogeneity; it arises from the nature of the process generating the zeros. As such, an interesting identification problem arises in this model. If the data do appear to be characterized by overdispersion, then it seems less than obvious whether it should be attributed to heterogeneity or to the regime splitting mechanism. Mullahy (1986) argues the point more strongly. He demonstrates that overdispersion will always induce excess zeros. As such, in a splitting model, we are likely to misinterpret the excess zeros as due to the splitting process instead of the heterogeneity. It might be of interest to test simply whether there is a regime splitting mechanism at work or not. Unfortunately, the basic model and the zero-inflated model are not nested. Setting the parameters of the splitting model to zero, for example, does not produce Prob[z = 0] = 0. In the probit case, this probability becomes 0.5, which maintains the regime split. The preceding tests for over- or underdispersion would be rather indirect. What is desired is a test of non-Poissonness. An alternative distribution may (but need not) produce a systematically different proportion of zeros than the Poisson. Testing for a different distribution, as opposed to a different set of parameters, is a difficult procedure. Because the hypotheses are necessarily nonnested, the power of any test is a function of the alternative hypothesis and may, under some, be small. Vuong (1989) has proposed a test statistic for nonnested models that is well suited for this setting when the alternative distribution can be specified. Let f j (yi | xi ) denote the predicted probability that the random variable Y equals yi under the assumption that the distribution is f j (yi | xi ), for j = 1, 2, and let f1(yi | xi ) mi = ln . f2(yi | xi ) Then Vuong’s statistic for testing the nonnested hypothesis of model 1 versus model 2 is √ 1 n n i=1 mi v = n . 1 n ( − )2 n i=1 mi m

This is the standard statistic for testing the hypothesis that E [mi ] equals zero. Vuong shows that v has a limiting standard normal distribution. As he notes, the statistic is bidirectional. If |v| is less than two, then the test does not favor one model or the other. Otherwise, large values favor model 1 whereas small (negative) values favor model 2. Carrying out the test requires estimation of both models and computation of both sets of predicted probabilities. In Greene (1994), it is shown that the Vuong test has some power to discern this phenomenon. The logic of the testing procedure is to allow for overdispersion by Greene-50558 book June 23, 2007 15:44

924 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

TABLE 25.3 Estimates of a Split Population Model Poisson and Logit Models Split Population Model Variable Poisson for y Logit for y > 0 Poisson for y Logit for y > 0 Constant −0.8196 −2.2442 1.0010 2.1540 (0.1453) (0.2515) (0.1267) (0.2900) Age 0.007181 0.02245 −0.005073 −0.02469 (0.003978) (0.007313) (0.003218) (0.008451) Income 0.07790 0.06931 0.01332 −0.1167 (0.02394) (0.04198) (0.02249) (0.04941) Expend −0.004102 −0.002359 (0.0003740) (0.0001948) Own/Rent −0.3766 0.3865 (0.1578) (0.1709) ln L −1396.719 −645.5649 −1093.0280 nPˆ (0 | xˆ) 938.6 1061.5

specifying a negative binomial count data process, then examine whether, even allow- ing for the overdispersion, there still appear to be excess zeros. In his application, that appears to be the case.

Example 25.3 A Split Population Model for Major Derogatory Reports Greene (1995c) estimated a model of consumer behavior in which the dependent variable of interest was the number of major derogatory reports recorded in the credit history for a sample of applicants for a type of credit card. The basic model predicts yi , the number of major derogatory credit reports, as a function of xi = [1, age, income, average expenditure]. The data for the model appear in Appendix Table F25.1. There are 1,319 observations in the sample (10 percent of the original data set). Inspection of the data reveals a preponderance of zeros. Indeed, of 1,319 observations, 1060 have yi = 0, whereas of the remaining 259, 137 have 1, 50 have 2, 24 have 3, 17 have 4, and 11 have 5—the remaining 20 range from 6 to 14. Thus, for a Poisson distribution, these data are actually a bit extreme. We propose to use Lambert’s zero inflated Poisson model instead, with the Poisson distribution built around

ln λi = β1 + β2 age + β3 income + β4 expenditure. For the splitting model, we use a logit model, with covariates z = [1, age, income, own/rent]. The estimates are shown in Table 25.3. Vuong’s diagnostic statistic appears to confirm in- tuition that the Poisson model does not adequately describe the data; the value is 6.9788. Using the model parameters to compute a prediction of the number of zeros, it is clear that the splitting model does perform better than the basic Poisson regression.

25.5 CENSORING AND TRUNCATION IN MODELS FOR COUNTS

Truncation and censoring are relatively common in applications of models for counts. Truncation often arises as a consequence of discarding what appear to be unusable data, such as the zero values in survey data on the number of uses of recreation facilities [Shaw (1988) and Bockstael et al. (1990)]. The zero values in this setting might represent a discrete decision not to visit the site, which is a qualitatively different decision from the positive number for someone who had decided to make at least one visit. In such Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 925

a case, it might make sense to confine attention to the nonzero observations, thereby truncating the distribution. Censoring, in contrast, is often employed to make survey data more convenient to gather and analyze. For example, survey data on access to medical facilities might ask, “How many trips to the doctor did you make in the last year?” The responses might be 0, 1, 2, 3, or more.

25.5.1 CENSORING AND TRUNCATION IN THE POISSON MODEL Models with these characteristics can be handled within the Poisson and negative binomial regression frameworks by using the laws of probability to modify the like- lihood. For example, in the censored data case, −λ j e i λ P ( j) = Prob[y = j] = i , j = 0, 1, 2. i i j!

Pi (3) = Prob[yi ≥ 3] = 1 − [Prob(yi = 0) + Prob(yi = 1) + Prob(yi = 2)]. The probabilities in the model with truncation above zero would be −λ j −λ j i λ i λ ( ) = = = e i = e i , = , ,.... Pi j Prob[yi j] −λ j 1 2 (25-13) [1 − Pi (0)] j! [1 − e i ] j! These models are not appreciably more complicated to analyze than the basic Poisson or negative binomial models. [See Terza(1985b), Mullahy (1986), Shaw (1988), Grogger and Carson (1991), Greene (1998), Lambert (1992), and Winkelmann (1997).] They do, however, bring substantive changes to the familiar characteristics of the models. For example, the conditional means are no longer λi ; in the censoring case, ∞ E [yi | xi ] = λi − ( j − 3)Pi ( j)<λi . j=3 Marginal effects are changed as well. Recall that our earlier result for the count data models was ∂ E [yi | xi ]/∂ xi = λi β. With censoring or truncation, it is straightforward in general to show that ∂ E [yi | xi ]/∂xi = δi β, but the new scale factor need not be smaller than λi .

25.5.2 APPLICATION: CENSORING IN THE TOBIT AND POISSON REGRESSION MODELS In 1969, the popular magazine Psychology Today published a 101-question survey on sex and asked its readers to mail in their answers. The results of the survey were dis- cussed in the July 1970 issue. From the approximately 2,000 replies that were collected in electronic form (of about 20,000 received), Professor Ray Fair (1978) extracted a sample of 601 observations on men and women then currently married for the first time and analyzed their responses to a question about extramarital affairs. He used the tobit model as a platform. Fair’s analysis in this frequently cited study suggests several interesting econometric questions. [In addition, his 1977 companion paper in Econo- metrica on estimation of the tobit model contributed to the development of the EM algorithm, which was published by and is usually associated with Dempster, Laird, and Rubin (1977).] Greene-50558 book June 23, 2007 15:44

926 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

As noted, Fair used the tobit model as his estimation framework for this study. The nonexperimental nature of the data (which can be downloaded from the Internet at http://fairmodel.econ.yale.edu/rayfair/work.ss.htm) provides a fine laboratory case that we can use to examine the relationships among the tobit, truncated regression, and probit models. In addition, as we will explore later, although the tobit model seems to be a natural choice for the model for these data, a closer look suggests that the models for counts we have examined at several points earlier might be yet a better choice. Finally, the preponderance of zeros in the data that initially motivated the tobit model suggests that even the standard Poisson model, although an improvement, might still be inadequate. In this example, we will reestimate Fair’s original model and then apply some of the specification tests and modified models for count data as alternatives. The study was based on 601 observations on the following variables (full details on data coding are given in the data file and Appendix Table F25.2): y = number of affairs in the past year, 0, 1, 2, 3, 4–10 coded as 7, “monthly, weekly, or daily,” coded as 12. Sample mean = 1.46. Frequencies = (451, 34, 17, 19, 42, 38).

z1 = sex = 0 for female, 1 for male. Sample mean = 0.476.

z2 = age. Sample mean = 32.5.

z3 = number of years married. Sample mean = 8.18.

z4 = children, 0 = no, 1 = yes. Sample mean = 0.715.

z5 = religiousness, 1 = anti, . . . , 5 = very. Sample mean = 3.12.

z6 = education, years, 9 = grade school, 12 = high school, . . . , 20 = Ph.D or other. Sample mean = 16.2.

z7 = occupation, “Hollingshead scale,” 1–7. Sample mean = 4.19.

z8 = self-rating of marriage, 1 = very unhappy, . . . , 5 = very happy. Sample mean = 3.93. The tobit model was fit to y using a constant term and all eight variables. A restricted model was fit by excluding z1, z4, and z6, none of which was individually statistically sig- nificant in the model. We are able to match exactly Fair’s results for both equations. The log-likelihood functions for the full and restricted models are 2704.7311 and 2705.5762. The chi-squared statistic for testing the hypothesis that the three coefficients are zero is twice the difference, 1.6902. The critical value from the chi-squared distribution with three degrees of freedom is 7.81, so the hypothesis that the coefficients on these three variables are all zero is not rejected. The Wald and Lagrange multiplier statistics are likewise small, 6.59 and 1.681. Based on these results, we will continue the analysis using the restricted set of variables, Z = (1, z2, z3, z5, z7, z8). Our interest is solely in the numerical results of different modeling approaches. Readers may draw their own conclusions and interpretations from the estimates. Table 25.4 presents parameter estimates based on Fair’s specification of the normal distribution. The inconsistent least squares estimates appear at the left as a basis for comparison. The maximum likelihood tobit estimates appear next. The sample is heavily dominated by observations with y = 0 (451 of 601, or 75 percent), so the marginal effects are very different from the coefficients, by a multiple of roughly 0.766. The scale factor Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 927

TABLE 25.4 Model Estimates Based on the Normal Distribution (standard errors in parentheses) Tobit Truncated Regression Probit Least Marginal Scaled Marginal Squares Estimate Effect by 1/σ Estimate Estimate Effect Variable (1) (2) (3) (4) (5) (6) (7) Constant 5.61 8.17 — 0.991 0.977 8.32 — (0.797) (2.74) — (0.336) (0.361) (3.96) — z2 −0.0504 −0.179 −0.042 −0.022 −0.022 −0.0841 −0.0407 (0.0221) (0.079) (0.184) (0.010) (0.102) (0.119) (0.0578) z3 0.162 0.554 0.130 0.0672 0.0599 0.560 0.271 (0.0369) (0.135) (0.0312) (0.0161) (0.0171) (0.219) (0.106) z5 −0.476 −1.69 −0.394 −0.2004 −0.184 −1.502 −0.728 (0.111) (0.404) (0.093) (0.484) (0.0515) (0.617) (0.299) z7 0.106 0.326 0.0762 0.0395 0.0375 0.189 0.0916 (0.0711) (0.254) (0.0595) (0.0308) (0.0328) (0.377) (0.182) z8 −0.712 −2.29 −0.534 −0.277 −0.273 −1.35 −0.653 (0.118) (0.408) (0.0949) (0.0483) (0.0525) (0.565) (0.273) σ 3.09 8.25 5.53 ln L −705.5762 −307.2955 −392.7103

is computed using the results of Theorem 24.4 for left censoring at zero and the upper limit of +∞, with all variables evaluated at the sample means and the parameters equal to the maximum likelihood estimates: +∞ − xβˆ 0 − xβˆ 0 − xβˆ xβˆ scale =  ML − ML = 1− ML =  ML = 0.234. σˆML σˆML σˆML σˆML These estimates are shown in the third column. As expected, they resemble the least squares estimates, although not enough that one would be content to use OLS for estimation. The fifth column in Table 25.4 gives estimates of the probit model estimated for the dependent variable qi = 0ifyi = 0, qi = 1ifyi > 0. If the specification of the tobit model is correct, then the probit estimators should be consistent for (1/σ )β from the tobit model. These estimates, with standard errors computed using the delta method, are shown in column 4. The results are surprisingly close, especially given the results of the specification test considered later. Finally, columns 6 and 7 give the estimates for the truncated regression model that applies to the 150 nonlimit observations if the specification of the model is correct. Here the results seem a bit less consistent. Several specification tests were suggested for this model. The Cragg/Greene test for appropriate specification of Prob[yi = 0] is given in Section 24.3.4.b. This test is easily carried out using the log-likelihood values in the table. The chi-squared statistic, which has seven degrees of freedom is −2 −705.5762 − [−307.2955 + (−392.7103)] = 11.141, which is smaller than the critical value of 14.067. We conclude that the tobit model is correctly specified (the decision of whether or not is not different from the decision of how many, given “whether”). We now turn to the normality tests. We emphasize that these tests are nonconstructive tests of the skewness and kurtosis of the distribution of ε. A fortiori, if we do reject the hypothesis that these values are 0.0 and 3.0, respectively, then we can reject normality. But that does not suggest what to do next. We turn to Greene-50558 book June 23, 2007 15:44

928 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

that issue later. The Chesher–Irish and Pagan–Vella chi-squared statistics are 562.218 and 22.314, respectively. The critical value is 5.99, so on the basis of both of these values, the hypothesis of normality is rejected. Thus, both the probability model and the distributional framework are rejected by these tests. Before leaving the tobit model, we consider one additional aspect of the original specification. The values above 4 in the observed data are not true observations on the response; 7 is an estimate of the mean of observations that fall in the range 4 to 10, whereas 12 was chosen more or less arbitrarily for observations that were greater than 10. These observations represent 80 of the 601 observations, or about 13 percent of the sample. To some extent, this coding scheme might be driving the results. [This point was not overlooked in the original study; “[a] linear specification was used for the estimated equation, and it did not seem reasonable in this case, given the range of explanatory variables, to have a dependent variable that ranged from, say, 0 to 365” [Fair (1978, p. 55)]. The tobit model allows for censoring in both tails of the distribution. Ignoring the results of the specification tests for the moment, we will examine a doubly censored regression by recoding all observations that take the values 7, or 12 as 4. The model is thus y∗ = xβ + ε, y = 0ify∗ ≤ 0, y = y∗ if 0 < y∗ < 4, y = 4ify∗ ≥ 4. The log-likelihood is built up from three sets of terms: − β − β − β =  0 xi + 1 φ yi xi + −  4 xi . ln L ln σ ln σ σ ln 1 σ y=0 0

TABLE 25.5 Estimates of a Doubly Censored Tobit Model Left Censored at 0 Only Censored at Both 0 and 4 Standard Marginal Standard Marginal Variable Estimate Error Effect Estimate Error Effect Constant 8.17 2.74 — 7.90 2.804 — z2 −0.179 0.079 −0.0420 −0.178 0.080 −0.0218 z3 0.554 0.135 0.130 0.532 0.141 0.0654 z5 −1.69 0.404 −0.394 −1.62 0.424 −0.199 z7 0.326 0.254 0.0762 0.324 0.254 0.0399 z8 −2.29 0.408 −0.534 −2.21 0.459 −0.271 σ 8.25 Prob(nonlimit) = 0.2338 7.94 Prob(nonlimit) = 0.1229 E [y | x = E [x]] 1.126 0.226 Greene-50558 book June 25, 2007 17:35

CHAPTER 25 ✦ Models for Event Counts and Duration 929

for the singly censored data was 0.2338. For the doubly censored variable, this factor is [(4 − β x)/σ ] − [(0 − β x)/σ ] = 0.8930 − 0.7701 = 0.1229. The regression model for y∗ has not changed much, but the effect now is to assign the upper tail area to the censored region, whereas before it was in the uncensored region. The effect, then, is to reduce the scale roughly by this 0.107, from 0.234 to about 0.123. By construction, the tobit model should only be viewed as an approximation for these data. The dependent variable is a count, not a continuous measurement. (Thus, the testing results obtained earlier are not surprising.) The Poisson regression model, or perhaps one of the many variants of it, should be a preferable modeling framework. Table 25.6 presents estimates of the Poisson and negative binomial regression models. There is ample evidence of overdispersion in these data; the t ratio on the estimated overdispersion parameter is 7.014/0.945 = 7.42, which is strongly suggestive. The large absolute value of the coefficient is likewise suggestive. Before proceeding to a model that specifically accounts for overdispersion, we can find a candidate for its source, at least to some degree. As discussed earlier, responses of 7 and 12 do not represent the actual counts. It is unclear what the effect of the first recoding would be, because it might well be the mean of the observations in this group. But the second is clearly a censored observation. To remove both of these effects, we have recoded both the values 7 and 12 as 4 and treated this observation (appropriately) as a censored observation, with 4 denoting “4 or more.” As shown in the third and fourth sets of results in Table 25.6, the effect of this treatment of the data is greatly to reduce the measured effects, which is the same effect we observed for the tobit model. Although this step does remove a deficiency in the data, it does not remove the overdispersion; at this point, the negative binomial model is still the preferred specification.

TABLE 25.6 Model Estimates Based on the Poisson Distribution Poisson Regression Negative Binomial Regression Standard Marginal Standard Marginal Variable Estimate Error Effect Estimate Error Effect Based on Uncensored Poisson Distribution Constant 2.53 0.197 — 2.19 0.664 — z2 −0.0322 0.00585 −0.0470 −0.0262 0.0192 −0.00393 z3 0.116 0.00991 0.168 0.0848 0.0350 0.127 z5 −0.354 0.0309 −0.515 −0.422 0.111 −0.632 z7 0.0798 0.0194 0.116 0.0604 0.0702 0.0906 z8 −0.409 0.0274 −0.0596 −0.431 0.111 −0.646 α 7.01 0.786 ln L −1427.037 −728.2441 Based on Poisson Distribution Right Censored at y = 4 Constant 1.90 0.283 — 4.79 1.16 — z2 −0.0328 0.00838 −0.0235 −0.0166 0.0250 −0.00428 z3 0.105 0.0140 0.0754 0.174 0.0568 0.045 z5 −0.323 0.0437 −0.232 −0.723 0.198 −0.186 z7 0.0798 0.0275 0.0521 0.0900 0.116 0.0232 z8 −0.390 0.0391 −0.279 −0.854 0.216 −0.220 α 9.40 1.35 ln L −747.7541 −482.0505 Greene-50558 book June 23, 2007 15:44

930 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

500

375

250 Frequency

125

0 0 1 2 3 4 YC FIGURE 25.1 Histogram for Model Predictions.

The tobit model remains the standard approach to modeling a dependent variable that displays a large cluster of limit values, usually zeros. But in these data, it is clear that the zero value represents something other than a censoring; it is the outcome of a discrete decision. Thus, for this reason and based on the preceding results, it seems appropriate to turn to a different model for this dependent variable. The Poisson and negative binomial models look like an improvement, but there remains a noteworthy problem. Figure 25.1 shows a histogram of the actual values (solid dark bars) and predicted values from the negative binomial model estimated with the censored data (lighter bars). Predictions from the latter model are the integer values of E [y | x] = exp(β x). As in the actual data, values larger than 4 are censored to 4. Evidently, the negative binomial model predicts the data fairly poorly. In fact, it is not hard to see why. The source of the overdispersion in the data is not the extreme values on the right of the distribution; it is the very large number of zeros on the left. There are a large variety of models and permutations that one might turn to at this point. We will conclude with just one of these, Lambert’s (1992) zero-inflated Poisson (ZIP) model with a logit “splitting” model discussed in Section 25.4 and Example 25.3. The doubly censored count is the dependent variable in this model. [Mullahy’s (1986) hurdle model is an alternative that might be considered. The difference between these two is in the interpretation of the zero observations. In the ZIP formulation, the zero observations would be a mixture of “never” and “not in the last year,” whereas the hurdle model assumes two distinct decisions, “whether or not” and “how many, given yes.”] The estimates of the parameters of the ZIP model are shown in Table 25.7. The Vuong statistic of 21.64 strongly supports the ZIP model over the Poisson model. (An attempt to combine the ZIP model with the negative binomial was unsuccessful. Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 931

TABLE 25.7 Estimates of a Zero-Inflated Poisson Model Poisson Regression Logit Splitting Model Marginal Effects Standard Standard Variable 1.1Estimate Error Estimate Error ZIP Tobit (0) Tobit (0, 4) Constant 1.27 0.439 −1.85 0.664 — — Age −0.00422 0.0122 0.0397 0.0190 −0.0216 −0.0420 −0.0218 Years 0.0331 0.0231 −0.0981 0.0318 0.0702 0.130 0.0654 Religion −0.0909 0.0721 0.306 0.0951 −0.210 −0.394 −0.199 Occupation 0.0205 0.0441 0.0677 0.0607 0.0467 0.0762 0.0399 Happiness −0.817 0.0666 0.458 0.0949 −0.273 −0.534 −0.271

Because, as expected, the ancillary model for the zeros accounted for the overdispersion in the data, the negative binomial model degenerated to the Poisson form.) Finally, the marginal effects, δ = ∂ E [y | x]/∂x, are shown in Table 25.7 for three models: the ZIP model, Fair’s original tobit model, and the tobit model estimated with the doubly censored count. The estimates for the ZIP model are considerably lower than those for Fair’s tobit model. When the tobit model is reestimated with the censoring on the right, however, the resulting marginal effects are reasonably close to those from the ZIP model, although uniformly smaller. (This result may be from not building the censoring into the ZIP model, a refinement that would be relatively straightforward.) We conclude that the original tobit model provided only a fair approximation to the marginal effects produced by (we contend) the more appropriate specification of the Poisson model. But the approximation became much better when the data were recoded and treated as censored. Figure 25.1 also shows the predictions from the ZIP model (narrow bars). As might be expected, it provides a much better prediction of the dependent variable. (The integer values of the conditional mean function for the tobit model were roughly evenly split between zeros and ones, whereas the doubly censored model always predicted y = 0.) Surprisingly, the treatment of the highest observations does greatly affect the outcome. If the ZIP model is fit to the original uncensored data, then the vector of marginal effects is δ = [−0.0586, 0.2446, −0.692, 0.115, −0.787], which is extremely large. Thus, perhaps more analysis is called for—the ZIP model can be further improved, and one might reconsider the hurdle model—but we have tortured Fair’s data enough. Further exploration is left for the reader.

25.6 MODELS FOR DURATION DATA8

Intuition might suggest that the longer a strike persists, the more likely it is that it will end within, say, the next week. Or is it? It seems equally plausible to suggest that the longer a strike has lasted, the more difficult must be the problems that led to it in the first place, and hence the less likely it is that it will end in the next short time interval.

8There are a large number of highly technical articles on this topic but relatively few accessible sources for the uninitiated. A particularly useful introductory survey is Kiefer (1988), upon which we have drawn heavily for this section. Other useful sources are Kalbfleisch and Prentice (2002), Heckman and Singer (1984a), Lancaster (1990), and Florens, Fougere, and Mouchart (1996). Greene-50558 book June 25, 2007 17:35

932 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

A similar kind of reasoning could be applied to spells of unemployment or the interval between conceptions. In each of these cases, it is not only the duration of the event, per se, that is interesting, but also the likelihood that the event will end in “the next period” given that it has lasted as long as it has. Analysis of the length of time until failure has interested engineers for decades. For example, the models discussed in this section were applied to the durability of electric and electronic components long before economists discovered their usefulness. Likewise, the analysis of survival times—for example, the length of survival after the onset of a disease or after an operation such as a heart transplant—has long been a staple of biomedical research. Social scientists have recently applied the same body of techniques to strike duration, length of unemployment spells, intervals between con- ception, time until business failure, length of time between arrests, length of time from purchase until a warranty claim is made, intervals between purchases, and so on. This section will give a brief introduction to the econometric analysis of duration data. As usual, we will restrict our attention to a few straightforward, relatively uncom- plicated techniques and applications, primarily to introduce terms and concepts. The reader can then wade into the literature to find the extensions and variations. We will concentrate primarily on what are known as parametric models. These apply familiar inference techniques and provide a convenient departure point. Alternative approaches are considered at the end of the discussion.

25.6.1 DURATION DATA The variable of interest in the analysis of duration is the length of time that elapses from the beginning of some event either until its end or until the measurement is taken, which may precede termination. Observations will typically consist of a cross section of durations, t1, t2,...,tn. The process being observed may have begun at different points in calendar time for the different individuals in the sample. For example, the strike duration data examined in Example 25.4 are drawn from nine different years. Censoring is a pervasive and usually unavoidable problem in the analysis of duration data. The common cause is that the measurement is made while the pro- cess is ongoing. An obvious example can be drawn from medical research. Consider analyzing the survival times of heart transplant patients. Although the beginning times may be known with precision, at the time of the measurement, observations on any individuals who are still alive are necessarily censored. Likewise, samples of spells of unemployment drawn from surveys will probably include some individuals who are still unemployed at the time the survey is taken. For these individuals, duration, or survival, is at least the observed ti , but not equal to it. Estimation must account for the cen- sored nature of the data for the same reasons as considered in Section 24.3. The conse- quences of ignoring censoring in duration data are similar to those that arise in regression analysis. In a conventional regression model that characterizes the conditional mean and variance of a distribution, the regressors can be taken as fixed characteristics at the point in time or for the individual for which the measurement is taken. When measuring duration, the observation is implicitly on a process that has been under way for an interval of time from zero to t. If the analysis is conditioned on a set of covariates (the counterparts to regressors) xt , then the duration is implicitly a function of the entire Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 933

time path of the variable x(t), t = (0, t), which may have changed during the interval. For example, the observed duration of employment in a job may be a function of the individual’s rank in the firm. But their rank may have changed several times between the time they were hired and when the observation was made. As such, observed rank at the end of the job tenure is not necessarily a complete description of the individual’s rank while they were employed. Likewise, marital status, family size, and amount of education are all variables that can change during the duration of unemployment and that one would like to account for in the duration model. The treatment of time-varying covariates is a considerable complication.9

25.6.2 A REGRESSION-LIKE APPROACH: PARAMETRIC MODELS OF DURATION We will use the term spell as a catchall for the different duration variables we might measure. Spell length is represented by the random variable T. A simple approach to duration analysis would be to apply regression analysis to the sample of observed spells. By this device, we could characterize the expected duration, perhaps conditioned on a set of covariates whose values were measured at the end of the period. We could also assume that conditioned on an x that has remained fixed from T = 0toT = t, t has a normal distribution, as we commonly do in regression. We could then characterize the probability distribution of observed duration times. But, normality turns out not to be particularly attractive in this setting for a number of reasons, not least of which is that duration is positive by construction, while a normally distributed variable can take negative values. (Lognormality turns out to be a palatable alternative, but it is only one among a long list of candidates.)

25.6.2.a Theoretical Background Suppose that the random variable T has a continuous probability distribution f (t), where t is a realization of T. The cumulative probability is t F(t) = f (s) ds = Prob(T ≤ t). 0 We will usually be more interested in the probability that the spell is of length at least t, which is given by the survival function,

S(t) = 1 − F(t) = Prob(T ≥ t).

Consider the question raised in the introduction: Given that the spell has lasted until time t, what is the probability that it will end in the next short interval of time, say, t? It is

l(t,t) = Prob(t ≤ T ≤ t + t | T ≥ t).

A useful function for characterizing this aspect of the distribution is the hazard rate, Prob(t ≤ T ≤ t + t | T ≥ t) F(t + t) − F(t) f (t) λ(t) = lim = lim = . t→0 t t→0 tS(t) S(t)

9See Petersen (1986) for one approach to this problem. Greene-50558 book June 23, 2007 15:44

934 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Roughly, the hazard rate is the rate at which spells are completed after duration t, given that they last at least until t. As such, the hazard function gives an answer to our original question. The hazard function, the density, the CDF and the survival function are all related. The hazard function is −d ln S(t) λ(t) = , dt so f (t) = S(t)λ(t). Another useful function is the integrated hazard function t (t) = λ(s) ds, 0 for which S(t) = e−(t), so (t) =−ln S(t). The integrated hazard function is generalized residual in this setting. [See Chesher and Irish (1987) and Example 25.4.]

25.6.2.b Models of the Hazard Function For present purposes, the hazard function is more interesting than the survival rate or the density. Based on the previous results, one might consider modeling the hazard function itself, rather than, say,modeling the survival function then obtaining the density and the hazard. For example, the base case for many analyses is a hazard rate that does not vary over time. That is, λ(t) is a constant λ. This is characteristic of a process that has no memory; the conditional probability of “failure” in a given short interval is the same regardless of when the observation is made. Thus, λ(t) = λ. From the earlier definition, we obtain the simple differential equation, − ( ) d ln S t = λ. dt The solution is ln S(t) = k − λt, or S(t) = Ke−λt , where K is the constant of integration. The terminal condition that S(0) = 1 implies that K = 1, and the solution is S(t) = e−λt . Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 935

This solution is the exponential distribution, which has been used to model the time until failure of electronic components. Estimation of λ is simple, because with an expo- nential distribution, E [t] = 1/λ. The maximum likelihood estimator of λ would be the reciprocal of the sample mean. A natural extension might be to model the hazard rate as a linear function, λ(t) = α + β ( ) = α + 1 β 2 ( ) = λ( ) ( ) = λ( ) −( ) . t. Then t t 2 t and f t t S t t exp[ t ] To avoid a nega- tive hazard function, one might depart from λ(t) = exp[g(t, θ)], where θ is a vector of parameters to be estimated. With an observed sample of durations, estimation of α and β is, at least in principle, a straightforward problem in maximum likelihood. [Kennan (1985) used a similar approach.] A distribution whose hazard function slopes upward is said to have positive duration dependence. For such distributions, the likelihood of failure at time t, conditional upon duration up to time t, is increasing in t. The opposite case is that of decreasing hazard or negative duration dependence. Our question in the introduction about whether the strike is more or less likely to end at time t given that it has lasted until time t can be framed in terms of positive or negative duration dependence. The assumed distribution has a considerable bearing on the answer. If one is unsure at the outset of the analysis whether the data can be characterized by positive or negative duration dependence, then it is counterproductive to assume a distribution that displays one characteristic or the other over the entire range of t. Thus, the exponential distribution and our sug- gested extension could be problematic. The literature contains a cornucopia of choices for duration models: normal, inverse normal [inverse Gaussian; see Lancaster (1990)], lognormal, F, gamma, Weibull (which is a popular choice), and many others.10 To illustrate the differences, we will examine a few of the simpler ones. Table 25.8 lists the hazard functions and survival functions for four commonly used distributions. Each involves two parameters, a location parameter, λ and a scale parameter, p. [Note that in the benchmark case of the exponential distribution, λ is the hazard function. In all other cases, the hazard function is a function of λ, p and, where there is duration dependence, t as well. Different authors, e.g., Kiefer (1988), use different parameterizations of these models. We follow the convention of Kalbfleisch and Prentice (2002).] All these are distributions for a nonnegative random variable. Their hazard func- tions display very different behaviors, as can be seen in Figure 25.2. The hazard function for the exponential distribution is constant, that for the Weibull is monotonically in- creasing or decreasing depending on p, and the hazards for lognormal and loglogistic

TABLE 25.8 Survival Distributions Distribution Hazard Function, λ(t) Survival Function, S(t) Exponential λ, S(t) = e−λt Weibull λp(λt)p−1, S(t) = e−(λt)p Lognormal f (t) = (p/t)φ[p ln(λt)] S(t) = [−p ln(λt)] [ln t is normally distributed with mean −ln λ and standard deviation 1/p.] Loglogistic λ(t) = λp(λt)p−1/[1 + (λt)p], S(t) = 1/[1 + (λt)p] [ln t has a logistic distribution with mean −ln λ and variance π 2/(3p2).]

10Three sources that contain numerous specifications are Kalbfleisch and Prentice (2002), Cox and Oakes (1985), and Lancaster (1990). Greene-50558 book June 23, 2007 15:44

936 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

0.040

Lognormal 0.032

Exponential Loglogistic 0.024

Weibull 0.016

0.008

0 0 20 40 60 80 100 Days FIGURE 25.2 Parametric Hazard Functions.

distributions first increase and then decrease. Which among these or the many alterna- tives is likely to be best in any application is uncertain.

25.6.2.c Maximum Likelihood Estimation The parameters λ and p of these models can be estimated by maximum likelihood. For observed duration data, t1, t2,...,tn, the log-likelihood function can be formulated and maximized in the ways we have become familiar with in earlier chapters. Censored observations can be incorporated as in Section 24.3 for the tobit model. [See (24-13).] As such, ln L(θ) = ln f (t | θ) + ln S(t | θ), uncensored censored observations observations

where θ = (λ, p). For some distributions, it is convenient to formulate the log-likelihood function in terms of f (t) = λ(t)S(t) so that ln L = ln λ(t | θ) + ln S(t | θ). uncensored all observations observations

Inference about the parameters can be done in the usual way. Either the BHHH estima- tor or actual second derivatives can be used to estimate asymptotic standard errors for the estimates. The transformation w = p(ln t +ln λ) for these distributions greatly facil- itates maximum likelihood estimation. For example, for the Weibull model, by defining w = p(ln t + ln λ), we obtain the very simple density f (w) = exp[w − exp(w)] and Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 937

survival function S(w) = exp(−exp(w)).11 Therefore, by using ln t instead of t,we greatly simplify the log-likelihood function. Details for these and several other dis- tributions may be found in Kalbfleisch and Prentice (2002, pp. 68–70). The Weibull distribution is examined in detail in the next section.

25.6.2.d Exogenous Variables One limitation of the models given earlier is that external factors are not given a role in the survival distribution. The addition of “covariates” to duration models is fairly straightforward, although the interpretation of the coefficients in the model is less so. Consider, for example, the Weibull model. (The extension to other distributions will be similar.) Let

−x β λi = e i ,

where xi is a constant term and a set of variables that are assumed not to change from time T = 0 until the “failure time,” T = ti . Making λi a function of a set of regressors is equivalent to changing the units of measurement on the time axis. For this reason, these models are sometimes called accelerated failure time models. Note as well that in all the models listed (and generally), the regressors do not bear on the question of duration dependence, which is a function of p. Let σ = 1/p and let δi = 1 if the spell is completed and δi = 0 if it is censored. As before, let ( − β) = (λ ) = ln ti xi , wi p ln i ti σ

and denote the density and survival functions f (wi ) and S(wi ). The observed random variable is = σ + β. ln ti wi xi

The Jacobian of the transformation from wi to ln ti is dwi /dlnti = 1/σ , so the density and survival functions for ln t are i − β − β ( | , β,σ)= 1 ln ti xi , ( | , β,σ)= ln ti xi . f ln ti xi σ f σ and S ln ti xi S σ The log-likelihood for the observed data is n ln L(β,σ| data) = [δi ln f (ln ti | xi , β,σ)+ (1 − δi ) ln S(ln ti | xi , β,σ)]. i=1 For the Weibull model, for example (see footnote 11)

wi f (wi ) = exp(wi − e ), and

wi S(wi ) = exp(−e ).

11The transformation is exp(w) = (λt)p so t = (1/λ)[exp(w)]1/p. The Jacobian of the transformation is dt/dw = [exp(w)]1/p/(λp). The density in Table 25.8 is λp[exp(w)]−(1/p)−1[exp(−exp(w))]. Multiplying by the Jacobian produces the result, f (w) = exp[w − exp(w)]. The survival function is the antiderivative, [exp(−exp(w))]. Greene-50558 book June 25, 2007 17:35

938 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

Making the transformation to ln ti and collecting terms reduces the log-likelihood to −  β −  β (β,σ| ) = δ ln ti xi − σ − ln ti xi . ln L data i σ ln exp σ i (Many other distributions, including the others in Table 25.8, simplify in the same way. The exponential model is obtained by setting σ to one.) The derivatives can be equated to zero using the methods described in Section E.3. The individual terms can also be used to form the BHHH estimator of the asymptotic covariance matrix for the estimator.12 The Hessian is also simple to derive, so Newton’s method could be used instead.13 Note that the hazard function generally depends on t, p, and x. The sign of an estimated coefficient suggests the direction of the effect of the variable on the hazard function when the hazard is monotonic. But in those cases, such as the loglogistic, in which the hazard is nonmonotonic, even this may be ambiguous. The magnitudes of the effects may also be difficult to interpret in terms of the hazard function. In a few cases, we do get a regression-like interpretation. In the Weibull and exponential models, | = (  β) ( / ) + E [t xi ] exp xi [ 1 p 1], whereas for the lognormal and loglogistic models, | =  β β E [ln t xi ] xi . In these cases, k is the derivative (or a multiple of the derivative) of this conditional mean. For some other distributions, the conditional median of t is easily obtained. Numerous cases are discussed by Kiefer (1988), Kalbfleisch and Prentice (2002), and Lancaster (1990).

25.6.2.e Heterogeneity The problem of heterogeneity in duration models can be viewed essentially as the result of an incomplete specification. Individual specific covariates are intended to incorpo- rate observation specific effects. But if the model specification is incomplete and if systematic individual differences in the distribution remain after the observed effects are accounted for, then inference based on the improperly specified model is likely to be problematic. We have already encountered several settings in which the possibility of heterogeneity mandated a change in the model specification; the fixed and random effects regression, logit, and probit models all incorporate observation-specific effects. Indeed, all the failures of the linear regression model discussed in the preceding chap- ters can be interpreted as a consequence of heterogeneity arising from an incomplete specification. There are a number of ways of extending duration models to account for het- erogeneity. The strictly nonparametric approach of the Kaplan–Meier estimator (see Section 25.6.3) is largely immune to the problem, but it is also rather limited in how much information can be culled from it. One direct approach is to model heterogeneity in the parametric model. Suppose that we posit a survival function conditioned on the individual specific effect vi . We treat the survival function as S(ti |vi ). Then add to that a model for the unobserved heterogeneity f (vi ). (Note that this is a counterpart to the incorporation of a disturbance in a regression model and follows the same procedures

12Note that the log-likelihood function has the same form as that for the tobit model in Section 24.3. By just reinterpreting the nonlimit observations in a tobit setting, we can, therefore, use this framework to apply a wide range of distributions to the tobit model. [See Greene (1995a) and references given therein.] 13See Kalbfleisch and Prentice (2002) for numerous other examples. Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 939

that we used in the Poisson model with random effects.) Then

S(t) = Ev[S(t | v)] = S(t | v) f (v) dv. v The gamma distribution is frequently used for this purpose.14 Consider, for example, using this device to incorporate heterogeneity into the Weibull model we used earlier. As is typical, we assume that v has a gamma distribution with mean 1 and variance θ = 1/k. Then kk f (v) = e−kvvk−1, (k) and

p S(t | v) = e−(vλt) . After a bit of manipulation, we obtain the unconditional distribution, ∞ S(t) = S(t | v) f (v) dv = [1 + θ(λt)p]−1/θ . 0 The limiting value, with θ = 0, is the Weibull survival model, so θ = 0 corresponds to Var[v] = 0, or no heterogeneity.15 The hazard function for this model is λ(t) = λp(λt)p−1[S(t)]θ , which shows the relationship to the Weibull model. This approach is common in parametric modeling of heterogeneity. In an important paper on this subject, Heckman and Singer (1984b) argued that this approach tends to overparameterize the survival distribution and can lead to rather serious errors in inference. They gave some dramatic examples to make the point. They also expressed some concern that researchers tend to choose the distribution of heterogeneity more on the basis of mathematical convenience than on any sensible economic basis.

25.6.3 NONPARAMETRIC AND SEMIPARAMETRIC APPROACHES The parametric models are attractive for their simplicity. But by imposing as much structure on the data as they do, the models may distort the estimated hazard rates. It may be that a more accurate representation can be obtained by imposing fewer restrictions. The Kaplan–Meier (1958) product limit estimator is a strictly empirical, nonpara- metric approach to survival and hazard function estimation. Assume that the obser- vations on duration are sorted in ascending order so that t1 ≤ t2 and so on and, for now, that no observations are censored. Suppose as well that there are K distinct sur- vival times in the data, denoted Tk; K will equal n unless there are ties. Let nk denote

14See, for example, Hausman, Hall, and Griliches (1984), who use it to incorporate heterogeneity in the Poisson regression model. The application is developed in Section 25.3. 15For the strike data analyzed in Figure 25.2, the maximum likelihood estimate of θ is 0.0004, which suggests that at least in the context of the Weibull model, latent heterogeneity does not appear to be a feature of these data. Greene-50558 book June 23, 2007 15:44

940 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

the number of individuals whose observed duration is at least Tk. The set of individuals whose duration is at least Tk is called the risk set at this duration. (We borrow, once again, from biostatistics, where the risk set is those individuals still “at risk” at time Tk). Thus, nk is the size of the risk set at time Tk. Let hk denote the number of observed spells completed at time Tk. A strictly empirical estimate of the survivor function would be k − − ˆ ni hi ni hi S(Tk) = = . ni n1 i=1 The estimator of the hazard rate is

hk λ(ˆ Tk) = . (25-14) nk Corrections are necessary for observations that are censored. Lawless (1982), Kalbfleisch and Prentice (2002), Kiefer (1988), and Greene (1995a) give details. Susin (2001) points out a fundamental ambiguity in this calculation (one which he argues ap- pears in the 1958 source). The estimator in (25-14) is not a “rate” as such, as the width of the time window is undefined, and could be very different at different points in the chain of calculations. Because many intervals, particularly those late in the observation period, might have zeros, the failure to acknowledge these intervals should impart an upward bias to the estimator. His proposed alternative computes the counterpart to (25-14) over a mesh of defined intervals as follows: b h λˆ b = j=a j , Ia b j=a n j bj

where the interval is from t = a to t = b, h j is the number of failures in each pe- riod in this interval, n j is the number of individuals at risk in that period and bj is the width of the period. Thus, an interval (a, b) is likely to include several “peri- ods.” Cox’s (1972) approach to the proportional hazard model is another popular, semi- parametric method of analyzing the effect of covariates on the hazard rate. The model specifies that λ( ) = ( β)λ ( ) ti exp xi 0 ti

The function λ0 is the “baseline” hazard, which is the individual heterogeneity. In princi- ple, this hazard is a parameter for each observation that must be estimated. Cox’s partial likelihood estimator provides a method of estimating β without requiring estimation of λ0. The estimator is somewhat similar to Chamberlain’s estimator for the logit model with panel data in that a conditioning operation is used to remove the heterogeneity. (See Section 23.5.2.) Suppose that the sample contains K distinct exit times, T1,...,TK. For any time Tk, the risk set, denoted Rk, is all individuals whose exit time is at least Tk. The risk set is defined with respect to any moment in time T as the set of individ- uals who have not yet exited just prior to that time. For every individual i in risk set Rk, ti ≥ Tk. The probability that an individual exits at time Tk given that exactly one Greene-50558 book June 25, 2007 17:35

CHAPTER 25 ✦ Models for Event Counts and Duration 941

individual exits at this time (which is the counterpart to the conditioning in the binary logit model in Chapter 23) is

x β e i = | = . Prob[ti Tk risk setk] x β e j j∈Rk Thus, the conditioning sweeps out the baseline hazard functions. For the simplest case in which exactly one individual exits at each distinct exit time and there are no censored observations, the partial log-likelihood is ⎡ ⎤ K   β = ⎣ β − x j ⎦ . ln L xk ln e k=1 j∈Rk

If mk individuals exit at time Tk, then the contribution to the log-likelihood is the sum of the terms for each of these individuals. The proportional hazard model is a common choice for modeling durations because it is a reasonable compromise between the Kaplan–Meier estimator and the possibly excessively structured parametric models. Hausman and Han (1990) and Meyer (1988), among others, have devised other, “semiparametric” specifications for hazard models.

Example 25.4 Survival Models for Strike Duration The strike duration data given in Kennan (1985, pp. 14–16) have become a familiar standard for the demonstration of hazard models. Appendix Table F25.3 lists the durations, in days, of 62 strikes that commenced in June of the years 1968 to 1976. Each involved at least 1,000 workers and began at the expiration or reopening of a contract. Kennan reported the actual duration. In his survey, Kiefer (1985), using the same observations, censored the data at 80 days to demonstrate the effects of censoring. We have kept the data in their original form; the interested reader is referred to Kiefer for further analysis of the censoring problem.16 Parameter estimates for the four duration models are given in Table 25.9. The estimate of the median of the survival distribution is obtained by solving the equation S(t) = 0.5. For example, for the Weibull model, S( M) = 0.5 = exp[−(λM) P], or M = [(ln 2) 1/p]/λ. For the exponential model, p= 1. For the lognormal and loglogistic models, M = 1/λ. The delta method is then used to estimate the standard error of this function of the parameter estimates. (See Section 4.9.4.) All these distributions are skewed to the right. As such, E [t]is

TABLE 25.9 Estimated Duration Models (estimated standard errors in parentheses) λ p Median Duration Exponential 0.02344 (0.00298) 1.00000 (0.00000) 29.571 (3.522) Weibull 0.02439 (0.00354) 0.92083 (0.11086) 27.543 (3.997) Loglogistic 0.04153 (0.00707) 1.33148 (0.17201) 24.079 (4.102) Lognormal 0.04514 (0.00806) 0.77206 (0.08865) 22.152 (3.954)

16Our statistical results are nearly the same as Kiefer’s despite the censoring. Greene-50558 book June 25, 2007 17:35

942 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

greater than the median. For the exponential and Weibull models, E [t] = [1/λ][(1/p) +1]; for the normal, E [t] = (1/λ)[exp(1/p2)]1/2. The implied hazard functions are shown in Figure 25.2. The variable x reported with the strike duration data is a measure of unanticipated ag- gregate industrial production net of seasonal and trend components. It is computed as the residual in a regression of the log of industrial production in manufacturing on time, time squared, and monthly dummy variables. With the industrial production variable included as a covariate, the estimated Weibull model is −ln λ = 3.7772 − 9.3515 x, p = 1.00288, (0.1394) (2.973) (0.1217), median strike length = 27.35(3.667) days, E [t] = 39.83 days. Note that the Weibull model is now almost identical to the exponential model ( p = 1). Because the hazard conditioned on x is approximately equal to λi , it follows that the hazard function is increasing in “unexpected” industrial production. A 1 percent increase in x leads to a 9.35 percent increase in λ, which because p ≈ 1 translates into a 9.35 percent decrease in the median strike length or about 2.6 days. (Note that M = ln 2/λ.) The proportional hazard model does not have a constant term. (The baseline hazard is an individual specific constant.) The estimate of β is −9.0726, with an estimated standard error of 3.225. This is very similar to the estimate obtained for the Weibull model.

25.7 SUMMARY AND CONCLUSIONS

This chapter has surveyed models for events. We analyze these data in two forms. Models for “count data” are appropriate for describing numbers of events, such as hospital visits, arrivals of messages at a network node, and so on. The Poisson model is the central pillar of this area, although its limitations usually motivate researchers to consider alternative specifications. The count data models are essentially nonlinear regressions, but it is more fruitful to do the modeling in terms of the probabilities of events rather than as a model of a conditional mean. We considered basic cases, as well as extensions that accommodate features of data such as censoring, truncation, zero inflation, and unobserved heterogeneity. The second type of event history data that we considered are durations. It is useful to think of duration, or survival data, as the measurement of time between transitions or changes of state. We examined three modeling approaches that correspond to the description in Chapter 14, nonparametric (survival tables), semiparametric (the proportional hazard models), and parametric (various forms such as the Weibull model).

Key Terms and Concepts

• Accelerated failure time • Exposure • Negative duration model • Generalized residual dependence • Censored variable • Hazard function • Negbin 1 form • Censoring • Hazard rate • Negbin 2 form • Conditional moment test • Heterogeneity • Negbin P model • Count data • Hurdle model • Nonnested models • Delta method • Integrated hazard function • Overdispersion • Deviance • Lagrange multiplier test • Parametric model • Duration dependence • Loglinear model • Partial likelihood • Duration model • Marginal effects • Poisson regression • Exponential model • Negative binomial model model Greene-50558 book June 23, 2007 15:44

CHAPTER 25 ✦ Models for Event Counts and Duration 943

• Positive duration • Risk set • Time-varying covariate dependence • Semiparametric model • Weibull survival model • Product limit estimator • Specification error • Zero inflated Poisson model • Proportional hazard • Survival function

Exercises

1. For the zero-inflated Poisson (ZIP) model in Section 25.4, we derived the condi- tional mean function, ( γ )λ | , = F wi i ,λ= ( β). E[yi xi wi ] i exp xi 1 − exp(−λi )

a. For the same model, now obtain Var[yi | xi , wi ]. Then, obtain

Var[yi | xi , wi ] τi = . E[yi | xi , wi ] Does the zero inflation produce overdispersion? (That is, is the ratio greater than one?) b. Obtain the partial effect for a variable zi that appears in both wi and xi . 2. Consider estimation of a Poisson regression model for yi | xi . The data are truncated on the left—these are on-site observations at a recreasion site, so zeros do not appear in the data set. The data are censored on the right—any response greater than 5 is recorded as a 5. Construct the log-likelihood for a data set drawn under this sampling scheme. 3. The density for the Weibull survival model is f (t) = λp(λt)p−1exp(−(λt)p).The survival function is S(t) = exp[−(λt)p]. Obtain the hazard function. What is the median survival time (as a function of λ and p)? Does this model exhibit duration dependence? If so, is it positive or negative?

Applications

1. Several applications in the preceding chapters using the German health care data have examined the variable DocVis, the reported number of visits to the doctor. The data are described in Appendix Table F11.1. A second count variable in that data set that we have not examined is HospVis, the number of visits to hospital. For this application, we will examine this variable. To begin, we treat the full sample (27,326) observations as a cross section. a. Begin by fitting a Poisson regression model to this variable. The exogenous vari- ables are listed in Example 11.11 and in the Appendix Table F11.1. Determine an appropriate specification for the right-hand side of your model. Report the regression results and the partial effects. b. Estimate the model using ordinary least squares and compare your least squares results to the partial effects you computed in part a. What do you find? c. Is there evidence of overdispersion in the data? Test for overdispersion. Now, reestimate the model using a negative binomial specification. What is the result? Do your results change? Use a likelihood ratio test to test the hypothesis of the negative binomial model against the Poisson. Greene-50558 book June 23, 2007 15:44

944 PART VI ✦ Cross Sections, Panel Data, and Microeconometrics

2. The data are an unbalanced panel, with 7,293 groups. Continue your analysis in Application 1 by fitting the Poisson model with fixed and with random effects and compare your results. (Recall, like the linear model, the Poisson fixed effects model may not contain any time invariant variables.) How do the panel data results compare to the pooled results? 3. Appendix Table F25.4 contains data on ship accidents reported in McCullagh and Nelder (1983). The data set contains 40 observations on the number of incidents of wave damage for oceangoing ships. Regressors include “aggregate months of service,” and three sets of dummy variables, Type (1, . . . , 5), operation period (1960–1974 or 1975–1979), and construction period (1960–1964, 1965–1969, or 1970– 1974). There are six missing values on the dependent variable, leaving 34 usable observations. a. Fit a Poisson model for these data, using the log of service months, four type dummy variables, two construction period variables, and one operation period dummy variable. Report your results. b. The authors note that the rate of accidents is supposed to be per period, but the exposure (aggregate months) differs by ship. Reestimate your model constrain- ing the coefficient on log of service months to equal one. c. The authors take overdispersion as a given in these data. Do you find evidence of overdispersion? Show your results. 4. Data on t = strike duration and x = unanticipated industrial production for a number of strikes in each of 9 years are given in Appendix Table F25.3. Use the Poisson regression model discussed in this chapter to determine whether x is a significant determinant of the number of strikes in a given year. Greene-50558 book June 25, 2007 12:52

APPENDIXQ A MATRIX ALGEBRA

A.1 TERMINOLOGY

A matrix is a rectangular array of numbers, denoted ⎡ ⎤ a11 a12 ··· a1K ⎢ ··· ⎥ = = = ⎣a21 a22 a2K ⎦ . A [aik] [A]ik ··· (A-1) an1 an2 ··· anK

The typical element is used to denote the matrix. A subscripted element of a matrix is always read as arow,column. An example is given in Table A.1. In these data, the rows are identified with years and the columns with particular variables. A vector is an ordered set of numbers arranged either in a row or a column. In view of the preceding, a row vector is also a matrix with one row, whereas a column vector is a matrix with one column. Thus, in Table A.1, the five variables observed for 1972 (including the date) constitute a row vector, whereas the time series of nine values for consumption is a column vector. A matrix can also be viewed as a set of column vectors or as a set of row vectors.1 The dimensions of a matrix are the numbers of rows and columns it contains. “A is an n × K matrix” (read “n by K”) will always mean that A has n rows and K columns. If n equals K, then A is a square matrix. Several particular types of square matrices occur frequently in econometrics. • A symmetric matrix is one in which aik = aki for all i and k. • A diagonal matrix is a square matrix whose only nonzero elements appear on the main diagonal, that is, moving from upper left to lower right. • A scalar matrix is a diagonal matrix with the same value in all diagonal elements. • An identity matrix is a scalar matrix with ones on the diagonal. This matrix is always denoted I. A subscript is sometimes included to indicate its size, or order. For example, I4 indicates a 4 × 4 identity matrix. • A triangular matrix is one that has only zeros either above or below the main diagonal. If the zeros are above the diagonal, the matrix is lower triangular.

A.2 ALGEBRAIC MANIPULATION OF MATRICES

A.2.1 EQUALITY OF MATRICES

Matrices (or vectors) A and B are equal if and only if they have the same dimensions and each element of A equals the corresponding element of B. That is,

A = B if and only if aik = bik for all i and k. (A-2)

1Henceforth, we shall denote a matrix by a boldfaced capital letter, as is A in (A-1), and a vector as a boldfaced lowercase letter, as in a. Unless otherwise noted, a vector will always be assumed to be a column vector. 945 Greene-50558 book June 25, 2007 12:52

946 PART VII ✦ Appendices

TABLE A.1 Matrix of Macroeconomic Data Column 23 5 1 Consumption GNP 4 Discount Rate Row Year (billions of dollars) (billions of dollars) GNP Deflator (N.Y Fed., avg.) 1 1972 737.1 1185.9 1.0000 4.50 2 1973 812.0 1326.4 1.0575 6.44 3 1974 808.1 1434.2 1.1508 7.83 4 1975 976.4 1549.2 1.2579 6.25 5 1976 1084.3 1718.0 1.3234 5.50 6 1977 1204.4 1918.3 1.4005 5.46 7 1978 1346.5 2163.9 1.5042 7.46 8 1979 1507.2 2417.8 1.6342 10.28 9 1980 1667.2 2633.1 1.7864 11.77 Source: Data from the Economic Report of the President (Washington, D.C.: U.S. Government Printing Office, 1983).

A.2.2 TRANSPOSITION

The transpose of a matrix A, denoted A, is obtained by creating the matrix whose kth row is the kth column of the original matrix. Thus, if B = A, then each column of A will appear as the corresponding row of B.IfA is n × K, then A is K × n. An equivalent definition of the transpose of a matrix is

B = A ⇔ bik = aki for all i and k. (A-3)

The definition of a symmetric matrix implies that

if (and only if) A is symmetric, then A = A. (A-4)

It also follows from the definition that for any A,

(A) = A. (A-5)

Finally, the transpose of a column vector, a, is a row vector:

a = [a1 a2 ··· an].

A.2.3 MATRIX ADDITION

The operations of addition and subtraction are extended to matrices by defining

C = A + B = [aik + bik]. (A-6)

A − B = [aik − bik]. (A-7)

Matrices cannot be added unless they have the same dimensions, in which case they are said to be conformable for addition. A zero matrix or null matrix is one whose elements are all zero. In the addition of matrices, the zero matrix plays the same role as the scalar 0 in scalar addition; that is,

A + 0 = A. (A-8)

It follows from (A-6) that matrix addition is commutative,

A + B = B + A. (A-9) Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 947

and associative,

(A + B) + C = A + (B + C), (A-10)

and that

(A + B) = A + B. (A-11)

A.2.4 VECTOR MULTIPLICATION

Matrices are multiplied by using the inner product. The inner product, or dot product, of two vectors, a and b, is a scalar and is written

a b = a1b1 + a2b2 +···+anbn. (A-12) Note that the inner product is written as the transpose of vector a times vector b, a row vector times a column vector. In (A-12), each term a j bj equals bj a j ; hence ab = ba. (A-13)

A.2.5 A NOTATION FOR ROWS AND COLUMNS OF A MATRIX

We need a notation for the ith row of a matrix. Throughout this book, an untransposed vector will always be a column vector. However, we will often require a notation for the column vector that is the transpose of a row of a matrix. This has the potential to create some ambiguity, but the following convention based on the subscripts will suffice for our work throughout this text: • a ,ora or a will denote column k, l,orm of the matrix A, • k l m ai ,ora j or at or as will denote the column vector formed by the transpose of row (A-14) , , i j t,ors of matrix A. Thus, ai is row i of A.

For example, from the data in Table A.1 it might be convenient to speak of xi , where i = 1972 as the 5 × 1 vector containing the five variables measured for the year 1972, that is, the transpose of the 1972 row of the matrix. In our applications, the common association of subscripts “i” and “ j” with individual i or j, and “t” and “s” with time periods t and s will be natural.

A.2.6 MATRIX MULTIPLICATION AND SCALAR MULTIPLICATION

For an n × K matrix A and a K × M matrix B, the product matrix, C = AB,isann × M matrix whose ikth element is the inner product of row i of A and column k of B. Thus, the product matrix C is = ⇒ = . C AB cik ai bk (A-15) [Note our use of (A-14) in (A-15).] To multiply two matrices, the number of columns in the first must be the same as the number of rows in the second, in which case they are conformable for multiplication.2 Multiplication of matrices is generally not commutative. In some cases, AB may exist, but BA may be undefined or, if it does exist, may have different dimensions. In general, however, even if AB and BA do have the same dimensions, they will not be equal. In view of this, we define premultiplication and postmultiplication of matrices. In the product AB, B is premultiplied by A, whereas A is postmultiplied by B.

2A simple way to check the conformability of two matrices for multiplication is to write down the dimensions of the operation, for example, (n × K) times (K × M). The inner dimensions must be equal; the result has dimensions equal to the outer values. Greene-50558 book June 25, 2007 12:52

948 PART VII ✦ Appendices

Scalar multiplication of a matrix is the operation of multiplying every element of the matrix by a given scalar. For scalar c and matrix A,

cA = [caik]. (A-16) The product of a matrix and a vector is written

c = Ab.

The number of elements in b must equal the number of columns in A; the result is a vector with number of elements equal to the number of rows in A. For example, ⎡ ⎤ ⎡ ⎤⎡ ⎤ 5 421 a ⎣4⎦ = ⎣261⎦⎣b⎦ . 1 110 c We can interpret this in two ways. First, it is a compact way of writing the three equations

5 = 4a + 2b + 1c, 4 = 2a + 6b + 1c, 1 = 1a + 1b + 0c.

Second, by writing the set of equations as ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 5 4 2 1 ⎣4⎦ = a⎣2⎦ + b⎣6⎦ + c⎣1⎦ , 1 1 1 0 we see that the right-hand side is a linear combination of the columns of the matrix where the coefficients are the elements of the vector. For the general case,

c = Ab = b1a1 + b2a2 +···+bKaK. (A-17) In the calculation of a matrix product C = AB, each column of C is a linear combination of the columns of A, where the coefficients are the elements in the corresponding column of B. That is,

C = AB ⇔ ck = Abk. (A-18)

Let ek be a column vector that has zeros everywhere except for a one in the kth position. Then Aek is a linear combination of the columns of A in which the coefficient on every column but the kth is zero, whereas that on the kth is one. The result is

ak = Aek. (A-19) Combining this result with (A-17) produces

(a1 a2 ··· an) = A(e1 e2 ··· en) = AI = A. (A-20) In matrix multiplication, the identity matrix is analogous to the scalar 1. For any matrix or vector A, AI = A. In addition, IA = A, although if A is not a square matrix, the two identity matrices are of different orders. A conformable matrix of zeros produces the expected result: A0 = 0. Some general rules for matrix multiplication are as follows: • Associative law: (AB)C = A(BC). (A-21) • Distributive law: A(B + C) = AB + AC. (A-22) Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 949

• Transpose of a product: (AB) = BA. (A-23) • Transpose of an extended product: (ABC) = CBA. (A-24)

A.2.7 SUMS OF VALUES

Denote by i a vector that contains a column of ones. Then, n xi = x1 + x2 +···+xn = i x. (A-25) i=1 If all elements in x are equal to the same constant a, then x = ai and n xi = i (ai) = a(i i) = na. (A-26) i=1 For any constant a and vector x, n n axi = a xi = ai x. (A-27) i=1 i=1 If a = 1/n, then we obtain the arithmetic mean, n 1 1 x = x = ix, (A-28) n i n i=1 from which it follows that n xi = i x = nx. i=1 The sum of squares of the elements in a vector x is n 2 = ; xi x x (A-29) i=1 while the sum of the products of the n elements in vectors x and y is n xi yi = x y. (A-30) i=1 By the definition of matrix multiplication, = [X X]kl [xkxl ] (A-31) is the inner product of the kth and lth columns of X. For example, for the data set given in Table A.1, if we define X as the 9 × 3 matrix containing (year, consumption, GNP), then

1980 = = . ( . ) +···+ . ( . ) [X X]23 consumptiont GNPt 737 1 1185 9 1667 2 2633 1 t=1972 = 19,743,711.34.

If X is n × K, then [again using (A-14)] n = . X X xi xi i=1 Greene-50558 book June 25, 2007 12:52

950 PART VII ✦ Appendices

This form shows that the K × K matrix XX is the sum of nK× K matrices, each formed from a single row (year) of X. For the example given earlier, this sum is of nine 3 × 3 matrices, each formed from one row (year) of the original data matrix.

A.2.8 A USEFUL IDEMPOTENT MATRIX

A fundamental matrix in statistics is the one that is used to transform data to deviations from their mean. First, ⎡ ⎤ x ⎢ ⎥ 1 x 1 i x = i ix = ⎢ . ⎥ = ii x. (A-32) n ⎣ . ⎦ n x

The matrix (1/n)ii is an n × n matrix with every element equal to 1/n. The set of values in deviations form is ⎡ ⎤ x1 − x ⎢x − x ⎥ 1 ⎣ 2 ⎦ = [x − ix] = x − ii x . (A-33) ··· n xn − x

Because x = Ix,

1 1 1 x − ii x = Ix − ii x = I − ii x = M0x. (A-34) n n n

Henceforth, the symbol M0 will be used only for this matrix. Its diagonal elements are all (1 − 1/n), and its off-diagonal elements are −1/n. The matrix M0 is primarily useful in com- puting sums of squared deviations. Some computations are simplified by the result

1 1 M0i = I − ii i = i − i(ii) = 0, n n

which implies that iM0 = 0. The sum of deviations about the mean is then

n 0 (xi − x ) = i [M x] = 0 x = 0. (A-35) i=1 For a single variable x, the sum of squared deviations about the mean is

n n ( − )2 = 2 − 2. xi x xi nx (A-36) i=1 i=1 In matrix terms,

n 2 0 0 0 0 (xi − x ) = (x − x i) (x − x i) = (M x) (M x) = x M M x. i=1

Two properties of M0 are useful at this point. First, because all off-diagonal elements of M0 equal −1/n, M0 is symmetric. Second, as can easily be verified by multiplication, M0 is equal to its square; M0M0 = M0. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 951

DEFINITION A.1 Idempotent Matrix An idempotent matrix, M, is one that is equal to its square, that is, M2 = MM = M.IfM is a symmetric idempotent matrix (all of the idempotent matrices we shall encounter are symmetric), then MM = M.

Thus, M0 is a symmetric idempotent matrix. Combining results, we obtain n 2 0 (xi − x ) = x M x. (A-37) i=1 Consider constructing a matrix of sums of squares and cross products in deviations from the column means. For two vectors x and y, n 0 0 (xi − x )(yi − y) = (M x) (M y), (A-38) i=1 so ⎡ ⎤ n n ( − )2 ( − )( − ) ⎢ xi x xi x yi y ⎥ ⎢ ⎥ xM0xxM0y ⎢ i=1 i=1 ⎥ = . (A-39) ⎣n n ⎦ 0 0 2 y M xyM y (yi − y)(xi − x ) (yi − y) i=1 i=1 If we put the two column vectors x and y in an n × 2 matrix Z = [x, y], then M0Z is the n × 2 matrix in which the two columns of data are in mean deviation form. Then

(M0Z)(M0Z) = ZM0M0Z = ZM0Z.

A.3 GEOMETRY OF MATRICES

A.3.1 VECTOR SPACES

The K elements of a column vector ⎡ ⎤ a1 ⎢ ⎥ a = ⎣ a2 ⎦ ··· aK can be viewed as the coordinates of a point in a K-dimensional space, as shown in Figure A.1 for two dimensions, or as the definition of the line segment connecting the origin and the point defined by a. Two basic arithmetic operations are defined for vectors, scalar multiplication and addition. A scalar multiple of a vector, a, is another vector, say a∗, whose coordinates are the scalar multiple of a’s coordinates. Thus, in Figure A.1,  1 2 1 − 1 a = , a∗ = 2a = , a∗∗ =− a = 2 . 2 4 2 −1 Greene-50558 book June 25, 2007 12:52

952 PART VII ✦ Appendices

5

4

3 a*

2 econd coordinate S a c 1

b

Ϫ1 1234 ** a First coordinate Ϫ1

FIGURE A.1 Vector Space.

The set of all possible scalar multiples of a is the line through the origin, 0 and a. Any scalar multiple of a is a segment of this line. The sum of two vectors a and b is a third vector whose coordinates are the sums of the corresponding coordinates of a and b. For example,

1 2 3 c = a + b = + = . 2 1 3

Geometrically, c is obtained by moving in the distance and direction defined by b from the tip of a or, because addition is commutative, from the tip of b in the distance and direction of a. The two-dimensional plane is the set of all vectors with two real-valued coordinates. We label this set R2 (“R two,” not “R squared”). It has two important properties. • R2 is closed under scalar multiplication; every scalar multiple of a vector in R2 is also in R2. • R2 is closed under addition; the sum of any two vectors in the plane is always a vector in R2.

DEFINITION A.2 Vector Space A vector space is any set of vectors that is closed under scalar multiplication and addition.

Another example is the set of all real numbers, that is, R1, that is, the set of vectors with one real element. In general, that set of K-element vectors all of whose elements are real numbers is a K-dimensional vector space, denoted RK. The preceding examples are drawn in R2. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 953

d 5

* e 4 a

3 c

2 a econd coordinate S

1 b

1 2 3 4 5 First coordinate Ϫ1 f FIGURE A.2 Linear Combinations of Vectors.

A.3.2 LINEAR COMBINATIONS OF VECTORS AND BASIS VECTORS

In Figure A.2, c = a + b and d = a∗ + b. But since a∗ = 2a, d = 2a + b. Also, e = a + 2b and f = b + (−a) = b − a. As this exercise suggests, any vector in R2 could be obtained as a linear combination of a and b.

DEFINITION A.3 Basis Vectors A set of vectors in a vector space is a basis for that vector space if any vector in the vector space can be written as a linear combination of that set of vectors.

As is suggested by Figure A.2, any pair of two-element vectors, including a and b, that point in different directions will form a basis for R2. Consider an arbitrary set of vectors in R2, a, b, and c.Ifa and b are a basis, then we can find numbers α1 and α2 such that c = α1a + α2b. Let

a b c a = 1 , b = 1 , c = 1 . a2 b2 c2

Then

c1 = α1a1 + α2b1, (A-40) c2 = α1a2 + α2b2. Greene-50558 book June 25, 2007 12:52

954 PART VII ✦ Appendices

The solutions to this pair of equations are

b2c1 − b1c2 a1c2 − a2c1 α1 = ,α2 = . (A-41) a1b2 − b1a2 a1b2 − b1a2

This result gives a unique solution unless (a1b2 − b1a2) = 0. If (a1b2 − b1a2) = 0, then a1/a2 = b1/b2, which means that b is just a multiple of a. This returns us to our original condition, that a and b must point in different directions. The implication is that if a and b are any pair of vectors for which the denominator in (A-41) is not zero, then any other vector c can be formed as a unique linear combination of a and b. The basis of a vector space is not unique, since any set of vectors that satisfies the definition will do. But for any particular basis, only one linear combination of them will produce another particular vector in the vector space.

A.3.3 LINEAR DEPENDENCE

As the preceding should suggest, K vectors are required to form a basis for RK. Although the basis for a vector space is not unique, not every set of K vectors will suffice. In Figure A.2, a and b form a basis for R2, but a and a∗ do not. The difference between these two pairs is that a and b are linearly independent, whereas a and a∗ are linearly dependent.

DEFINITION A.4 Linear Dependence A set of vectors is linearly dependent if any one of the vectors in the set can be written as a linear combination of the others.

Because a∗ is a multiple of a, a and a∗ are linearly dependent. For another example, if

1 3 10 a = , b = , and c = , 2 3 14 then 1 2a + b − c = 0, 2 so a, b, and c are linearly dependent. Any of the three possible pairs of them, however, are linearly independent.

DEFINITION A.5 Linear Independence A set of vectors is linearly independent if and only if the only solution to

α1a1 + α2a2 +···+αKaK = 0 is

α1 = α2 =···=αK = 0.

The preceding implies the following equivalent definition of a basis. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 955

DEFINITION A.6 Basis for a Vector Space A basis for a vector space of K dimensions is any set of K linearly independent vectors in that vector space.

Because any (K + 1)st vector can be written as a linear combination of the K basis vectors, it follows that any set of more than K vectors in RK must be linearly dependent.

A.3.4 SUBSPACES

DEFINITION A.7 Spanning Vectors The set of all linear combinations of a set of vectors is the vector space that is spanned by those vectors.

For example, by definition, the space spanned by a basis for RK is RK. An implication of this is that if a and b are a basis for R2 and c is another vector in R2, the space spanned by [a, b, c]is, again, R2. Of course, c is superfluous. Nonetheless, any vector in R2 can be expressed as a linear combination of a, b, and c. (The linear combination will not be unique. Suppose, for example, that a and c are also a basis for R2.) Consider the set of three coordinate vectors whose third element is zero. In particular,

a = [a1 a2 0] and b = [b1 b2 0].

Vectors a and b do not span the three-dimensional space R3. Every linear combination of a and b has a third coordinate equal to zero; thus, for instance, c = [1 2 3] could not be written as a linear combination of a and b.If(a1b2 − a2b1) is not equal to zero [see (A-41)], however, then any vector whose third element is zero can be expressed as a linear combination of a and b.So, although a and b do not span R3, they do span something, the set of vectors in R3 whose third element is zero. This area is a plane (the “floor” of the box in a three-dimensional figure). This plane in R3 is a subspace, in this instance, a two-dimensional subspace. Note that it is not R2; it is the set of vectors in R3 whose third coordinate is 0. Any plane in R3, regardless of how it is oriented, forms a two-dimensional subspace. Any two independent vectors that lie in that subspace will span it. But without a third vector that points in some other direction, we cannot span any more of R3 than this two-dimensional part of it. By the same logic, any line in R3 is a one-dimensional subspace, in this case, the set of all vectors in R3 whose coordinates are multiples of those of the vector that define the line. A subspace is a vector space in all the respects in which we have defined it. We emphasize that it is not a vector space of lower dimension. For example, R2 is not a subspace of R3. The essential difference is the number of dimensions in the vectors. The vectors in R3 that form a two-dimensional subspace are still three-element vectors; they all just happen to lie in the same plane. The space spanned by a set of vectors in RK has at most K dimensions. If this space has fewer than K dimensions, it is a subspace, or hyperplane. But the important point in the preceding discussion is that every set of vectors spans some space; it may be the entire space in which the vectors reside, or it may be some subspace of it. Greene-50558 book June 25, 2007 12:52

956 PART VII ✦ Appendices

A.3.5 RANK OF A MATRIX

We view a matrix as a set of column vectors. The number of columns in the matrix equals the number of vectors in the set, and the number of rows equals the number of coordinates in each column vector.

DEFINITION A.8 Column Space The column space of a matrix is the vector space that is spanned by its column vectors.

If the matrix contains K rows, its column space might have K dimensions. But, as we have seen, it might have fewer dimensions; the column vectors might be linearly dependent, or there might be fewer than K of them. Consider the matrix ⎡ ⎤ 156 A = ⎣268⎦ . 718

It contains three vectors from R3, but the third is the sum of the first two, so the column space of this matrix cannot have three dimensions. Nor does it have only one, because the three columns are not all scalar multiples of one another. Hence, it has two, and the column space of this matrix is a two-dimensional subspace of R3.

DEFINITION A.9 Column Rank The column rank of a matrix is the dimension of the vector space that is spanned by its column vectors.

It follows that the column rank of a matrix is equal to the largest number of linearly inde- pendent column vectors it contains. The column rank of A is 2. For another specific example, consider ⎡ ⎤ 123 ⎢515⎥ B = ⎢ ⎥ . ⎣645⎦ 314 It can be shown (we shall see how later) that this matrix has a column rank equal to 3. Each column of B is a vector in R4, so the column space of B is a three-dimensional subspace of R4. Consider, instead, the set of vectors obtained by using the rows of B instead of the columns. The new matrix would be ⎡ ⎤ 1563 C = ⎣2141⎦ . 3554

This matrix is composed of four column vectors from R3. (Note that C is B.) The column space of C is at most R3, since four vectors in R3 must be linearly dependent. In fact, the column space of Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 957

C is R3. Although this is not the same as the column space of B, it does have the same dimension. Thus, the column rank of C and the column rank of B are the same. But the columns of C are the rows of B. Thus, the column rank of C equals the row rank of B. That the column and row ranks of B are the same is not a coincidence. The general results (which are equivalent) are as follows.

THEOREM A.1 Equality of Row and Column Rank The column rank and row rank of a matrix are equal. By the definition of row rank and its counterpart for column rank, we obtain the corollary,

the row space and column space of a matrix have the same dimension. (A-42)

Theorem A.1 holds regardless of the actual row and column rank. If the column rank of a matrix happens to equal the number of columns it contains, then the matrix is said to have full column rank. Full row rank is defined likewise. Because the row and column ranks of a matrix are always equal, we can speak unambiguously of the rank of a matrix. For either the row rank or the column rank (and, at this point, we shall drop the distinction),

rank(A) = rank(A) ≤ min(number of rows, number of columns). (A-43)

In most contexts, we shall be interested in the columns of the matrices we manipulate. We shall use the term full rank to describe a matrix whose rank is equal to the number of columns it contains. Of particular interest will be the distinction between full rank and short rank matrices. The distinction turns on the solutions to Ax = 0. If a nonzero x for which Ax = 0 exists, then A does not have full rank. Equivalently, if the nonzero x exists, then the columns of A are linearly dependent and at least one of them can be expressed as a linear combination of the others. For example, a nonzero set of solutions to ⎡ ⎤ x 1310 1 0 ⎣x ⎦ = 2314 2 0 x3

= ( , , − 1 ) is any multiple of x 2 1 2 . In a product matrix C = AB, every column of C is a linear combination of the columns of A, so each column of C is in the column space of A. It is possible that the set of columns in C could span this space, but it is not possible for them to span a higher-dimensional space. At best, they could be a full set of linearly independent vectors in A’s column space. We conclude that the column rank of C could not be greater than that of A. Now, apply the same logic to the rows of C, which are all linear combinations of the rows of B. For the same reason that the column rank of C cannot exceed the column rank of A, the row rank of C cannot exceed the row rank of B. Row and column ranks are always equal, so we can conclude that

rank(AB) ≤ min(rank(A), rank(B)). (A-44)

A useful corollary to (A-44) is:

If A is M × n and B is a square matrix of rank n, then rank(AB) = rank(A). (A-45) Greene-50558 book June 25, 2007 12:52

958 PART VII ✦ Appendices

Another application that plays a central role in the development of regression analysis is, for any matrix A, rank(A) = rank(AA) = rank(AA ). (A-46)

A.3.6 DETERMINANT OF A MATRIX

The determinant of a square matrix—determinants are not defined for nonsquare matrices—is a function of the elements of the matrix. There are various definitions, most of which are not useful for our work. Determinants figure into our results in several ways, however, that we can enumerate before we need formally to define the computations.

PROPOSITION The determinant of a matrix is nonzero if and only if it has full rank.

Full rank and short rank matrices can be distinguished by whether or not their determinants are nonzero. There are some settings in which the value of the determinant is also of interest, so we now consider some algebraic results. It is most convenient to begin with a diagonal matrix ⎡ ⎤ d1 00··· 0 ⎢ ··· ⎥ = ⎣ 0 d2 0 0 ⎦ . D ···

000··· dK

The column vectors of D define a “box” in RK whose sides are all at right angles to one another.3 Its “volume,” or determinant, is simply the product of the lengths of the sides, which we denote

K

|D|=d1d2 ...dK = dk. (A-47) k=1

A special case is the identity matrix, which has, regardless of K, |IK|=1. Multiplying D by a scalar c is equivalent to multiplying the length of each side of the box by c, which would multiply its volume by cK. Thus,

|cD|=cK|D|. (A-48)

Continuing with this admittedly special case, we suppose that only one column of D is multiplied by c. In two dimensions, this would make the box wider but not higher, or vice versa. Hence, the “volume” (area) would also be multiplied by c. Now, suppose that each side of the box were multiplied by a different c, the first by c1, the second by c2, and so on. The volume would, by an obvious extension, now be c1c2 ...cK|D|. The matrix with columns defined by [c1d1 c2d2 ...]is just DC, where C is a diagonal matrix with ci as its ith diagonal element. The computation just described is, therefore,

|DC|=|D|·|C|. (A-49)

(The determinant of C is the product of the ci s since C, like D, is a diagonal matrix.) In particular, note what happens to the whole thing if one of the ci s is zero.

3Each column vector defines a segment on one of the axes. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 959

For 2 × 2 matrices, the computation of the determinant is   ac   = ad − bc. (A-50) bd Notice that it is a function of all the elements of the matrix. This statement will be true, in general. For more than two dimensions, the determinant can be obtained by using an expansion by cofactors. Using any row, say, i, we obtain

K i+k |A|= aik(−1) |Aik|, k = 1,...,K, (A-51) k=1

where Aik is the matrix obtained from A by deleting row i and column k. The determinant of 4 i+k Aik is called a minor of A. When the correct sign, (−1) , is added, it becomes a cofactor. This operation can be done using any column as well. For example, a 4 × 4 determinant becomes a sum of four 3 × 3s, whereas a 5 × 5 is a sum of five 4 × 4s, each of which is a sum of four 3 × 3s, and so on. Obviously, it is a good idea to base (A-51) on a row or column with many zeros in it, if possible. In practice, this rapidly becomes a heavy burden. It is unlikely, though, that you will ever calculate any determinants over 3 × 3 without a computer. A 3 × 3, however, might be computed on occasion; if so, the following shortcut will prove useful:     a11 a12 a13   = + + − − − . a21 a22 a23 a11a22a33 a12a23a31 a13a32a21 a31a22a13 a21a12a33 a11a23a32   a31 a32 a33 Although (A-48) and (A-49) were given for diagonal matrices, they hold for general matrices C and D. One special case of (A-48) to note is that of c =−1. Multiplying a matrix by −1 does not necessarily change the sign of its determinant. It does so only if the order of the matrix is odd. By using the expansion by cofactors formula, an additional result can be shown:

|A|=|A| (A-52)

A.3.7 A LEAST SQUARES PROBLEM

Given a vector y and a matrix X, we are interested in expressing y as a linear combination of the columns of X. There are two possibilities. If y lies in the column space of X, then we shall be able to find a vector b such that y = Xb. (A-53)

Figure A.3 illustrates such a case for three dimensions in which the two columns of X both have a third coordinate equal to zero. Only ys whose third coordinate is zero, such as y0 in the figure, can be expressed as Xb for some b. For the general case, assuming that y is, indeed, in the column space of X, we can find the coefficients b by solving the set of equations in (A-53). The solution is discussed in the next section. Suppose, however, that y is not in the column space of X. In the context of this example, suppose that y’s third component is not zero. Then there is no b such that (A-53) holds. We can, however, write y = Xb + e, (A-54)

where e is the difference between y and Xb. By this construction, we find an Xb that is in the column space of X, and e is the difference, or “residual.” Figure A.3 shows two examples, y and

4If i equals k, then the determinant is a principal minor. Greene-50558 book June 25, 2007 12:52

960 PART VII ✦ Appendices

Second coordinate

x1 e* y* y

Third coordinate (Xb)* e

y0 ␪* (Xb)

␪ x2

First coordinate

FIGURE A.3 Least Squares Projections.

y∗. For the present, we consider only y. We are interested in finding the b such that y is as close as possible to Xb in the sense that e is as short as possible.

DEFINITION A.10 Length of a Vector The length, or norm, of a vector e is √ e= ee. (A-55)

The problem is to find the b for which e=y − Xb is as small as possible. The solution is that b that makes e perpendicular, or orthogonal, to Xb.

DEFINITION A.11 Orthogonal Vectors Two nonzero vectors a and b are orthogonal, written a ⊥ b, if and only if ab = ba = 0.

Returning once again to our fitting problem, we find that the b we seek is that for which e ⊥ Xb. Expanding this set of equations gives the requirement

(Xb)e = 0 = bXy − bXXb = b[Xy − XXb], Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 961

or, assuming b is not 0, the set of equations

Xy = XXb.

The means of solving such a set of equations is the subject of Section A.5. In Figure A.3, the linear combination Xb is called the projection of y into the column space of X. The figure is drawn so that, although y and y∗ are different, they are similar in that the projection of y lies on top of that of y∗. The question we wish to pursue here is, Which vector, y or y∗, is closer to its projection in the column space of X? Superficially, it would appear that y is closer, because e is shorter than e∗.Yety∗ is much more nearly parallel to its projection than y,so the only reason that its residual vector is longer is that y∗ is longer compared with y. A measure of comparison that would be unaffected by the length of the vectors is the angle between the vector and its projection (assuming that angle is not zero). By this measure, θ ∗ is smaller than θ, which would reverse the earlier conclusion.

THEOREM A.2 The Cosine Law The angle θ between two vectors a and b satisfies

ab cos θ = . a·b

The two vectors in the calculation would be y or y∗ and Xb or (Xb)∗. A zero cosine implies that the vectors are orthogonal. If the cosine is one, then the angle is zero, which means that the vectors are the same. (They would be if y were in the column space of X.) By dividing by the lengths, we automatically compensate for the length of y. By this measure, we find in Figure A.3 that y∗ is closer to its projection, (Xb)∗ than y is to its projection, Xb.

A.4 SOLUTION OF A SYSTEM OF LINEAR EQUATIONS

Consider the set of n linear equations

Ax = b, (A-56)

in which the K elements of x constitute the unknowns. A is a known matrix of coefficients, and b is a specified vector of values. We are interested in knowing whether a solution exists; if so, then how to obtain it; and finally, if it does exist, then whether it is unique.

A.4.1 SYSTEMS OF LINEAR EQUATIONS

For most of our applications, we shall consider only square systems of equations, that is, those in which A is a square matrix. In what follows, therefore, we take n to equal K. Because the number of rows in A is the number of equations, whereas the number of columns in A is the number of variables, this case is the familiar one of “n equations in n unknowns.” There are two types of systems of equations. Greene-50558 book June 25, 2007 12:52

962 PART VII ✦ Appendices

DEFINITION A.12 Homogeneous Equation System A homogeneous system is of the form Ax = 0.

By definition, a nonzero solution to such a system will exist if and only if A does not have full rank. If so, then for at least one column of A, we can write the preceding as  xm ak =− am. xk m =k This means, as we know, that the columns of A are linearly dependent and that |A|=0.

DEFINITION A.13 Nonhomogeneous Equation System A nonhomogeneous system of equations is of the form Ax = b, where b is a nonzero vector.

The vector b is chosen arbitrarily and is to be expressed as a linear combination of the columns of A. Because b has K elements, this solution will exist only if the columns of A span the entire K-dimensional space, RK.5 Equivalently, we shall require that the columns of A be linearly independent or that |A| not be equal to zero.

A.4.2 INVERSE MATRICES

To solve the system Ax = b for x, something akin to division by a matrix is needed. Suppose that we could find a square matrix B such that BA = I. If the equation system is premultiplied by this B, then the following would be obtained:

BAx = Ix = x = Bb. (A-57)

If the matrix B exists, then it is the inverse of A, denoted

B = A−1.

From the definition,

A−1A = I.

In addition, by premultiplying by A, postmultiplying by A−1, and then canceling terms, we find

− AA 1 = I

as well. If the inverse exists, then it must be unique. Suppose that it is not and that C is a different inverse of A. Then CAB = CAB, but (CA)B = IB = B and C(AB) = C, which would be a

5If A does not have full rank, then the nonhomogeneous system will have solutions for some vectors b, namely, any b in the column space of A. But we are interested in the case in which there are solutions for all nonzero vectors b, which requires A to have full rank. Greene-50558 book June 25, 2007 22:50

APPENDIX A ✦ Matrix Algebra 963

contradiction if C did not equal B. Because, by (A-57), the solution is x = A−1b, the solution to the equation system is unique as well. We now consider the calculation of the inverse matrix. For a 2 × 2 matrix, AB = I implies that ⎡ ⎤ + = a11b11 a12b21 1 ⎢ + = ⎥ a11 a12 b11 b12 10 ⎢a11b12 a12b22 0⎥ = or ⎣ ⎦ . a21 a22 b21 b22 01 a21b11 + a22b21 = 0

a21b12 + a22b22 = 1

The solutions are b b − − 11 12 = 1 a22 a12 = 1 a22 a12 . − − | | − (A-58) b21 b22 a11a22 a12a21 a21 a11 A a21 a11

Notice the presence of the reciprocal of |A| in A−1. This result is not specific to the 2 × 2 case. We infer from it that if the determinant is zero, then the inverse does not exist.

DEFINITION A.14 Nonsingular Matrix A matrix is nonsingular if and only if its inverse exists.

The simplest inverse matrix to compute is that of a diagonal matrix. If ⎡ ⎤ ⎡ ⎤ d1 00··· 0 1/d1 00··· 0 ⎢ ··· ⎥ ⎢ / ··· ⎥ = ⎣ 0 d2 0 0 ⎦, −1 = ⎣ 01d2 0 0 ⎦, D ··· then D ···

00 0··· dK 000··· 1/dK

which shows, incidentally, that I−1 = I. We shall use aik to indicate the ikth element of A−1. The general formula for computing an inverse matrix is |C | aik = ki , (A-59) |A|

where |Cki | is the kith cofactor of A. [See (A-51).] It follows, therefore, that for A to be non- singular, |A| must be nonzero. Notice the reversal of the subscripts Some computational results involving inverses are 1 |A−1|= , (A-60) |A| (A−1)−1 = A, (A-61)

(A−1) = (A)−1. (A-62)

If A is symmetric, then A−1 is symmetric. (A-63)

When both inverse matrices exist,

(AB)−1 = B−1A−1. (A-64) Greene-50558 book June 25, 2007 12:52

964 PART VII ✦ Appendices

Note the condition preceding (A-64). It may be that AB is a square, nonsingular matrix when neither A nor B are even square. (Consider, e.g., AA.) Extending (A-64), we have

(ABC)−1 = C−1(AB)−1 = C−1B−1A−1. (A-65)

Recall that for a data matrix X, XX is the sum of the outer products of the rows X. Suppose that we have already computed S = (XX)−1 for a number of years of data, such as those given in Table A.1. The following result, which is called an updating formula, shows how to compute the new S that would result when a new row is added to X: For symmetric, nonsingular matrix A,

1 [A ± bb ]−1 = A−1 ∓ A−1bb A−1. (A-66) 1 ± bA−1b Note the reversal of the sign in the inverse. Twomore general forms of (A-66) that are occasionally useful are

1 [A ± bc ]−1 = A−1 ∓ A−1bc A−1. (A-66a) 1 ± cA−1b [A ± BCB ]−1 = A−1 ∓ A−1B[C−1 ± BA−1B]−1BA−1. (A-66b)

A.4.3 NONHOMOGENEOUS SYSTEMS OF EQUATIONS

For the nonhomogeneous system

Ax = b,

if A is nonsingular, then the unique solution is

x = A−1b.

A.4.4 SOLVING THE LEAST SQUARES PROBLEM

We now have the tool needed to solve the least squares problem posed in Section A3.7. We found the solution vector, b to be the solution to the nonhomogenous system Xy = XXb. Let a equal the vector Xy and let A equal the square matrix XX. The equation system is then

Ab = a.

By the preceding results, if A is nonsingular, then

b = A−1a = (XX)−1(Xy)

assuming that the matrix to be inverted is nonsingular. We have reached the irreducible minimum. If the columns of X are linearly independent, that is, if X has full rank, then this is the solution to the least squares problem. If the columns of X are linearly dependent, then this system has no unique solution.

A.5 PARTITIONED MATRICES

In formulating the elements of a matrix—it is sometimes useful to group some of the elements in submatrices. Let ⎡ ⎤ 14 5 A11 A12 A = ⎣29 3⎦ = . A A 89 6 21 22 Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 965

A is a partitioned matrix. The subscripts of the submatrices are defined in the same fashion as those for the elements of a matrix. A common special case is the block-diagonal matrix:

A11 0 A = , 0 A22

where A11 and A22 are square matrices.

A.5.1 ADDITION AND MULTIPLICATION OF PARTITIONED MATRICES

For conformably partitioned matrices A and B,

A11 + B11 A12 + B12 A + B = , (A-67) A21 + B21 A22 + B22 and

A11 A12 B11 B12 A11B11 + A12B21 A11B12 + A12B22 AB = = . (A-68) A21 A22 B21 B22 A21B11 + A22B21 A21B12 + A22B22

In all these, the matrices must be conformable for the operations involved. For addition, the dimensions of Aik and Bik must be the same. For multiplication, the number of columns in Aij must equal the number of rows in B jl for all pairs i and j. That is, all the necessary matrix products of the submatrices must be defined. Two cases frequently encountered are of the form A A A 1 1 = 1 = + , [A1 A2] [A1A1 A2A2] (A-69) A2 A2 A2 and A 0 A 0 A A 0 11 11 = 11 11 . (A-70) 0A22 0A22 0A22A22

A.5.2 DETERMINANTS OF PARTITIONED MATRICES

The determinant of a block-diagonal matrix is obtained analogously to that of a diagonal matrix:   A 0   11  = | | · | | .   A11 A22 (A-71) 0A22 The determinant of a general 2 × 2 partitioned matrix is   A A       11 12 =| |· − −1  =| |· − −1 .   A22 A11 A12A22 A21 A11 A22 A21A11 A12 (A-72) A21 A22

A.5.3 INVERSES OF PARTITIONED MATRICES

The inverse of a block-diagonal matrix is  −1 −1 A11 0 A11 0 = , (A-73) 0A −1 22 0A22 which can be verified by direct multiplication. Greene-50558 book June 25, 2007 12:52

966 PART VII ✦ Appendices

For the general 2 × 2 partitioned matrix, one form of the partitioned inverse is    −1 A−1 I + A F A A−1 −A−1A F A11 A12 11 12 2 21 11 11 12 2 = , (A-74) − A21 A22 − 1 F2A21A11 F2 where  − = − −1 1. F2 A22 A21A11 A12 The upper left block could also be written as  − = − −1 1. F1 A11 A12A22 A21

A.5.4 DEVIATIONS FROM MEANS

Suppose that we begin with a column vector of n values x and let ⎡ ⎤ n ⎢ n xi ⎥ ⎢ ⎥ ⎢ i=1 ⎥ i iix A = ⎢ ⎥ = . ⎣n n ⎦ x ixx 2 xi xi i=1 i=1

−1 We are interested in the lower-right-hand element of A . Upon using the definition of F2 in (A-74), this is     −1 1 F = [xx − (xi)(ii)−1(ix)]−1 = x Ix − i ix 2 n     −1 1 = x I − ii x = (xM0x)−1. n Therefore, the lower-right-hand value in the inverse matrix is 1 (xM0x)−1 =  = a22. n ( − )2 i=1 xi x Now, suppose that we replace x with X, a matrix with several columns. We seek the lower-right block of (ZZ)−1, where Z = [i, X]. The analogous result is

(ZZ)22 = [XX − Xi(ii)−1iX]−1 = (XM0X)−1,

− which implies that the K × K matrix in the lower-right corner of (Z Z) 1 is the inverse of the × n ( − )( − ) K K matrix whose jkth element is i=1 xij x j xik xk . Thus, when a data matrix contains a column of ones, the elements of the inverse of the matrix of sums of squares and cross products will be computed from the original data in the form of deviations from the respective column means.

A.5.5 KRONECKER PRODUCTS

A calculation that helps to condense the notation when dealing with sets of regression models (see Chapters 10, 11 and 13) is the Kronecker product. For general matrices A and B, ⎡ ⎤ a11B a12B ··· a1KB ⎢ ··· ⎥ ⊗ = ⎢a21B a22B a2KB⎥ . A B ⎣ ··· ⎦ (A-75)

an1B an2B ··· anKB Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 967

Notice that there is no requirement for conformability in this operation. The Kronecker product can be computed for any pair of matrices. If A is K × Land B is m×n, then A⊗B is (Km)×(Ln). For the Kronecker product,

(A ⊗ B)−1 = (A−1 ⊗ B−1), (A-76)

If A is M × M and B is n × n, then

|A ⊗ B|=|A|n|B|M, (A ⊗ B) = A ⊗ B, trace(A ⊗ B) = tr(A)tr(B).

For A, B, C, and D such that the products are defined is

(A ⊗ B)(C ⊗ D) = AC ⊗ BD.

A.6 CHARACTERISTIC ROOTS AND VECTORS

A useful set of results for analyzing a square matrix A arises from the solutions to the set of equations

Ac = λc. (A-77)

The pairs of solutions are the characteristic vectors c and characteristic roots λ.Ifc is any solution vector, then kc is also for any value of k. To remove the indeterminancy, c is normalized so that cc = 1. The solution then consists of λ and the n − 1 unknown elements in c.

A.6.1 THE CHARACTERISTIC EQUATION

Solving (A-77) can, in principle, proceed as follows. First, (A-77) implies that

Ac = λIc,

or that

(A − λI)c = 0.

This equation is a homogeneous system that has a nonzero solution only if the matrix (A − λI) is singular or has a zero determinant. Therefore, if λ is a solution, then

|A − λI |=0. (A-78)

This polynomial in λ is the characteristic equation of A. For example, if

51 A = , 24

then     5 − λ 1 |A − λI| =   = (5 − λ)(4 − λ) − 2(1) = λ2 − 9λ + 18.  24− λ

The two solutions are λ = 6 and λ = 3. Greene-50558 book June 25, 2007 12:52

968 PART VII ✦ Appendices

In solving the characteristic equation, there is no guarantee that the characteristic roots will be real. In the preceding example, if the 2 in the lower-left-hand corner of the matrix were −2 instead, then the solution would be a pair of complex values. The same result can emerge in the general n × n case. The characteristic roots of a symmetric matrix are real, however.6 This result will be convenient because most of our applications will involve the characteristic roots and vectors of symmetric matrices. For an n × n matrix, the characteristic equation is an nth-order polynomial in λ. Its solutions may be n distinct values, as in the preceding example, or may contain repeated values of λ, and may contain some zeros as well.

A.6.2 CHARACTERISTIC VECTORS

With λ in hand, the characteristic vectors are derived from the original problem,

Ac = λc,

or

(A − λI)c = 0. (A-79)

Neither pair determines the values of c1 and c2. But this result was to be expected; it was the reason cc = 1 was specified at the outset. The additional equation cc = 1, however, produces complete solutions for the vectors.

A.6.3 GENERAL RESULTS FOR CHARACTERISTIC ROOTS AND VECTORS

A K × K symmetric matrix has K distinct characteristic vectors, c1, c2,...cK. The corresponding characteristic roots, λ1,λ2,...,λK, although real, need not be distinct. The characteristic vectors of 7 = , = 8 a symmetric matrix are orthogonal, which implies that for every i j ci c j 0. It is convenient to collect the K-characteristic vectors in a K × K matrix whose ith column is the ci corresponding to λi ,

C = [c1 c2 ··· cK],

and the K-characteristic roots in the same order, in a diagonal matrix, ⎡ ⎤ λ1 00··· 0 ⎢ λ ··· ⎥ = 0 2 0 0 . ⎣ ··· ⎦

000··· λK Then, the full set of equations

Ack = λkck

is contained in

AC = C. (A-80)

6A proof may be found in Theil (1971). 7For proofs of these propositions, see Strang (1988). 8This statement is not true if the matrix is not symmetric. For instance, it does not hold for the characteristic vectors computed in the first example. For nonsymmetric matrices, there is also a distinction between “right” characteristic vectors, Ac = λc, and “left” characteristic vectors, dA = λd, which may not be equal. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 969

= , Because the vectors are orthogonal and ci ci 1 we have ⎡ ⎤ ··· c1c1 c1c2 c1cK ⎢ ··· ⎥ ⎢c2c1 c2c2 c2cK ⎥ CC = ⎢ . ⎥ = I. (A-81) ⎣ . ⎦ ··· cKc1 cKc2 cKcK Result (A-81) implies that C = C−1. (A-82) Consequently, − CC = CC 1 = I (A-83) as well, so the rows as well as the columns of C are orthogonal.

A.6.4 DIAGONALIZATION AND SPECTRAL DECOMPOSITION OF A MATRIX

By premultiplying (A-80) by C and using (A-81), we can extract the characteristic roots of A.

DEFINITION A.15 Diagonalization of a Matrix The diagonalization of a matrix A is CAC = CC = I = . (A-84)

Alternatively,by postmultiplying (A-80) by C and using (A-83), we obtain a useful representation of A.

DEFINITION A.16 Spectral Decomposition of a Matrix The spectral decomposition of A is K = = λ . A C C kckck (A-85) k=1

In this representation, the K × K matrix A is written as a sum of K rank one matrices. This sum is also called the eigenvalue (or, “own” value) decomposition of A. In this connection, the term signature of the matrix is sometimes used to describe the characteristic roots and vectors. Yet another pair of terms for the parts of this decomposition are the latent roots and latent vectors of A.

A.6.5 RANK OF A MATRIX

The diagonalization result enables us to obtain the rank of a matrix very easily. To do so, we can use the following result. Greene-50558 book June 25, 2007 12:52

970 PART VII ✦ Appendices

THEOREM A.3 Rank of a Product For any matrix A and nonsingular matrices B and C, the rank of BAC is equal to the rank of A. The proof is simple. By (A-45), rank(BAC) = rank[(BA)C] = rank(BA). By (A-43), rank(BA) = rank(AB), and applying (A-45) again, rank(AB) = rank(A) because B is nonsingular if B is nonsingular (once again, by A-43). Finally, applying (A-43) again to obtain rank(A) = rank(A) gives the result.

Because C and C are nonsingular, we can use them to apply this result to (A-84). By an obvious substitution,

rank(A) = rank(). (A-86)

Finding the rank of is trivial. Because is a diagonal matrix, its rank is just the number of nonzero values on its diagonal. By extending this result, we can prove the following theorems. (Proofs are brief and are left for the reader.)

THEOREM A.4 Rank of a Symmetric Matrix The rank of a symmetric matrix is the number of nonzero characteristic roots it contains.

Note how this result enters the spectral decomposition given earlier. If any of the character- istic roots are zero, then the number of rank one matrices in the sum is reduced correspondingly. It would appear that this simple rule will not be useful if A is not square. But recall that

rank(A) = rank(AA). (A-87)

Because AA is always square, we can use it instead of A. Indeed, we can use it even if A is square, which leads to a fully general result.

THEOREM A.5 Rank of a Matrix The rank of any matrix A equals the number of nonzero characteristic roots in AA.

The row rank and column rank of a matrix are equal, so we should be able to apply Theorem A.5 to AA as well. This process, however, requires an additional result.

THEOREM A.6 Roots of an Outer Product Matrix The nonzero characteristic roots of AA are the same as those of AA. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 971

The proof is left as an exercise. A useful special case the reader can examine is the characteristic roots of aa and aa, where a is an n × 1 vector. If a characteristic root of a matrix is zero, then we have Ac = 0. Thus, if the matrix has a zero root, it must be singular. Otherwise, no nonzero c would exist. In general, therefore, a matrix is singular; that is, it does not have full rank if and only if it has at least one zero root.

A.6.6 CONDITION NUMBER OF A MATRIX

As the preceding might suggest, there is a discrete difference between full rank and short rank matrices. In analyzing data matrices such as the one in Section A.2, however, we shall often encounter cases in which a matrix is not quite short ranked, because it has all nonzero roots, but it is close. That is, by some measure, we can come very close to being able to write one column as a linear combination of the others. This case is important; we shall examine it at length in our discussion of multicollinearity in Section 4.8.1. Our definitions of rank and determinant will fail to indicate this possibility, but an alternative measure, the condition number, is designed for that purpose. Formally, the condition number for a square matrix A is

1/2 maximum root γ = . (A-88) minimum root For nonsquare matrices X, such as the data matrix in the example, we use A = XX. As a further refinement, because the characteristic roots are affected by the scaling of the columns of X,we scale the columns to have length 1 by dividing each column by its norm [see (A-55)]. For the X in Section A.2, the largest characteristic root of A is 4.9255 and the smallest is 0.0001543. Therefore, the condition number is 178.67, which is extremely large. (Values greater than 20 are large.) That the smallest root is close to zero compared with the largest means that this matrix is nearly singular. Matrices with large condition numbers are difficult to invert accurately.

A.6.7 TRACE OF A MATRIX

The trace of a square K × K matrix is the sum of its diagonal elements:

K

tr(A) = akk. k=1 Some easily proven results are

tr(cA) = c(tr(A)), (A-89) tr(A) = tr(A), (A-90) tr(A + B) = tr(A) + tr(B), (A-91)

tr(IK) = K. (A-92) tr(AB) = tr(BA). (A-93) aa = tr(aa) = tr(aa) K K K ( ) = = 2 . tr A A akak aik k=1 i=1 k=1 The permutation rule can be extended to any cyclic permutation in a product:

tr(ABCD) = tr(BCDA) = tr(CDAB) = tr(DABC). (A-94) Greene-50558 book June 25, 2007 12:52

972 PART VII ✦ Appendices

By using (A-84), we obtain

tr(CAC) = tr(ACC ) = tr(AI) = tr(A) = tr(). (A-95)

Because is diagonal with the roots of A on its diagonal, the general result is the following.

THEOREM A.7 Trace of a Matrix The trace of a matrix equals the sum of its characteristic roots. (A-96)

A.6.8 DETERMINANT OF A MATRIX

Recalling how tedious the calculation of a determinant promised to be, we find that the following is particularly useful. Because CAC = , (A-97) |CAC|=||. Using a number of earlier results, we have, for orthogonal matrix C, |CAC|=|C|·|A|·|C|=|C|·|C|·|A|=|CC|·|A|=|I|·|A|=1 ·|A| =|A| (A-98) =||.

Because || is just the product of its diagonal elements, the following is implied.

THEOREM A.8 Determinant of a Matrix The determinant of a matrix equals the product of its characteristic roots. (A-99)

Notice that we get the expected result if any of these roots is zero. The determinant is the product of the roots, so it follows that a matrix is singular if and only if its determinant is zero and, in turn, if and only if it has at least one zero characteristic root.

A.6.9 POWERS OF A MATRIX

We often use expressions involving powers of matrices, such as AA = A2. For positive integer powers, these expressions can be computed by repeated multiplication. But this does not show how to handle a problem such as finding a B such that B2 = A, that is, the square root of a matrix. The characteristic roots and vectors provide a solution. Consider first AA = A2 = (CC)(CC) = CCCC = CIC = CC (A-100) = C2C.

Two results follow. Because 2 is a diagonal matrix whose nonzero elements are the squares of those in , the following is implied. For any symmetric matrix, the characteristic roots of A2 are the squares of those of A, and the characteristic vectors are the same. (A-101) Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 973

The proof is obtained by observing that the second line in (A-100) is the spectral decomposi- tion of the matrix B = AA. Because A3 = AA2 and so on, (A-101) extends to any positive integer. By convention, for any A, A0 = I. Thus, for any symmetric matrix A, AK = CKC, K = 0, 1,.... Hence, the characteristic roots of AK are λK, whereas the characteristic vectors are the same as those of A.IfA is nonsingular, so that all its roots λi are nonzero, then this proof can be extended to negative powers as well. If A−1 exists, then

A−1 = (CC)−1 = (C)−1−1C−1 = C−1C, (A-102)

where we have used the earlier result, C = C−1. This gives an important result that is useful for analyzing inverse matrices.

THEOREM A.9 Characteristic Roots of an Inverse Matrix If A−1 exists, then the characteristic roots of A−1 are the reciprocals of those of A, and the characteristic vectors are the same.

By extending the notion of repeated multiplication, we now have a more general result.

THEOREM A.10 Characteristic Roots of a Matrix Power For any nonsingular symmetric matrix A = CC, AK = CKC, K = ...,−2, −1, 0, 1, 2,....

We now turn to the general problem of how to compute the square root of a matrix. In the scalar case, the value would have to be nonnegative. The matrix analog to this requirement is that all the characteristic roots are nonnegative. Consider, then, the candidate ⎡√ ⎤ λ ··· 1 √0 0 / / ⎢ 0 λ ··· 0 ⎥ A1 2 = C1 2C = C ⎣ 2 ⎦ C . (A-103) ··· √ 00··· λn This equation satisfies the requirement for a square root, because

A1/2A1/2 = C1/2CC1/2C = CC = A. (A-104)

If we continue in this fashion, we can define the powers of a matrix more generally, still assuming that all the characteristic roots are nonnegative. For example, A1/3 = C1/3C. If all the roots are strictly positive, we can go one step further and extend the result to any real power. For reasons that will be made clear in the next section, we say that a matrix with positive characteristic roots is positive definite. It is the matrix analog to a positive number.

DEFINITION A.17 Real Powers of a Positive Definite Matrix For a positive definite matrix A, Ar = Cr C, for any real number, r. (A-105) Greene-50558 book June 25, 2007 12:52

974 PART VII ✦ Appendices

The characteristic roots of Ar are the rth power of those of A, and the characteristic vectors are the same. If A is only nonnegative definite—that is, has roots that are either zero or positive—then (A-105) holds only for nonnegative r.

A.6.10 IDEMPOTENT MATRICES

Idempotent matrices are equal to their squares [see (A-37) to (A-39)]. In view of their importance in econometrics, we collect a few results related to idempotent matrices at this point. First, (A-101) implies that if λ is a characteristic root of an idempotent matrix, then λ = λK for all nonnegative integers K. As such, if A is a symmetric idempotent matrix, then all its roots are one or zero. Assume that all the roots of A are one. Then = I, and A = CC = CIC = CC = I.Ifthe roots are not all one, then one or more are zero. Consequently, we have the following results for symmetric idempotent matrices:9

• The only full rank, symmetric idempotent matrix is the identity matrix I. (A-106) • All symmetric idempotent matrices except the identity matrix are singular. (A-107)

The final result on idempotent matrices is obtained by observing that the count of the nonzero roots of A is also equal to their sum. By combining Theorems A.5 and A.7 with the result that for an idempotent matrix, the roots are all zero or one, we obtain this result: • The rank of a symmetric idempotent matrix is equal to its trace. (A-108)

A.6.11 FACTORING A MATRIX

In some applications, we shall require a matrix P such that

PP = A−1.

One choice is

P = −1/2C,

so that

PP = (C)(−1/2)−1/2C = C−1C,

as desired.10 Thus, the spectral decomposition of A, A = CC is a useful result for this kind of computation. The Cholesky factorization of a symmetric positive definite matrix is an alternative represen- tation that is useful in regression analysis. Any symmetric positive definite matrix A may be written as the product of a lower triangular matrix L and its transpose (which is an upper triangular matrix) L = U. Thus, A = LU. This result is the Cholesky decomposition of A. The square roots of the diagonal elements of L, di , are the Cholesky values of A. By arraying these in a diagonal matrix D, we may also write A = LD−1D2D−1U = L∗D2U∗, which is similar to the spectral decomposition in (A-85). The usefulness of this formulation arises when the inverse of A is required. Once L is

9Not all idempotent matrices are symmetric. We shall not encounter any asymmetric ones in our work, however. 10We say that this is “one” choice because if A is symmetric, as it will be in all our applications, there are − / other candidates. The reader can easily verify that C 1 2C = A−1/2 works as well. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 975

computed, finding A−1 = U−1L−1 is also straightforward as well as extremely fast and accurate. Most recently developed econometric software packages use this technique for inverting positive definite matrices. A third type of decomposition of a matrix is useful for numerical analysis when the inverse is difficult to obtain because the columns of A are “nearly” collinear. Any n × K matrix A for which n ≥ K can be written in the form A = UWV, where U is an orthogonal n× K matrix—that is, U U = IK—W is a K × K diagonal matrix such that wi ≥ 0, and V is a K × K matrix such that V V = IK. This result is called the singular value decomposition (SVD) of A, and wi are the singular values of A.11 (Note that if A is square, then the spectral decomposition is a singular value decomposition.) As with the Cholesky decomposition, the usefulness of the SVD arises in inversion, in this case, of AA. By multiplying it out, we obtain that (AA)−1 is simply VW−2V. Once the SVD of A is computed, the inversion is trivial. The other advantage of this format is its numerical stability, which is discussed at length in Press et al. (1986). Press et al. (1986) recommend the SVD approach as the method of choice for solv- ing least squares problems because of its accuracy and numerical stability. A commonly used alternative method similar to the SVD approach is the QR decomposition. Any n × K matrix, X, with n ≥ K can be written in the form X = QR in which the columns of Q are orthonormal (QQ = I) and R is an upper triangular matrix. Decomposing X in this fashion allows an ex- tremely accurate solution to the least squares problem that does not involve inversion or direct solution of the normal equations. Press et al. suggest that this method may have problems with rounding errors in problems when X is nearly of short rank, but based on other published results, this concern seems relatively minor.12

A.6.12 THE GENERALIZED INVERSE OF A MATRIX

Inverse matrices are fundamental in econometrics. Although we shall not require them much in our treatment in this book, there are more general forms of inverse matrices than we have considered thus far. A generalized inverse of a matrix A is another matrix A+ that satisfies the following requirements: + 1. AA A = A. + 2. A+AA = A+. 3. A+A is symmetric. + 4. AA is symmetric. A unique A+ can be found for any matrix, whether A is singular or not, or even if A is not square.13 The unique matrix that satisfies all four requirements is called the Moore–Penrose inverse or pseudoinverse of A.IfA happens to be square and nonsingular, then the generalized inverse will be the familiar ordinary inverse. But if A−1 does not exist, then A+ can still be computed. An important special case is the overdetermined system of equations

Ab = y,

11Discussion of the singular value decomposition (and listings of computer programs for the computations) may be found in Press et al. (1986). 12The National Institute of Standards and Technology (NIST) has published a suite of benchmark problems that test the accuracy of least squares computations (http://www.nist.gov/itl/div898/strd). Using these prob- lems, which include some extremely difficult, ill-conditioned data sets, we found that the QR method would reproduce all the NIST certified solutions to 15 digits of accuracy, which suggests that the QR method should be satisfactory for all but the worst problems. 13A proof of uniqueness, with several other results, may be found in Theil (1983). Greene-50558 book June 25, 2007 12:52

976 PART VII ✦ Appendices

where A has n rows, K < n columns, and column rank equal to R ≤ K. Suppose that R equals K, so that (AA)−1 exists. Then the Moore–Penrose inverse of A is

A+ = (AA)−1A,

which can be verified by multiplication. A “solution” to the system of equations can be written

b = A+y.

This is the vector that minimizes the length of Ab − y. Recall this was the solution to the least squares problem obtained in Section A.4.4. If y lies in the column space of A, this vector will be zero, but otherwise, it will not. Now suppose that A does not have full rank. The previous solution cannot be computed. An alternative solution can be obtained, however. We continue to use the matrix AA. In the spectral decomposition of Section A.6.4, if A has rank R, then there are R terms in the summation in (A-85). In (A-102), the spectral decomposition using the reciprocals of the characteristic roots is used to compute the inverse. To compute the Moore–Penrose inverse, we apply this calculation to A A, using only the nonzero roots, then postmultiply the result by A . Let C1 be the Rcharacteristic vectors corresponding to the nonzero roots, which we array in the diagonal matrix, 1. Then the Moore–Penrose inverse is + = −1 , A C1 1 C1A which is very similar to the previous result. If A is a symmetric matrix with rank R ≤ K, the Moore–Penrose inverse is computed precisely as in the preceding equation without postmultiplying by A. Thus, for a symmetric matrix A, + = −1 , A C1 1 C1 −1 where 1 is a diagonal matrix containing the reciprocals of the nonzero roots of A.

A.7 QUADRATIC FORMS AND DEFINITE MATRICES

Many optimization problems involve double sums of the form n n

q = xi xj aij. (A-109) i=1 j=1 This quadratic form can be written

q = xAx,

where A is a symmetric matrix. In general, q may be positive, negative, or zero; it depends on A and x. There are some matrices, however, for which q will be positive regardless of x, and others for which q will always be negative (or nonnegative or nonpositive). For a given matrix A, 1. If xAx > (<) 0 for all nonzero x, then A is positive (negative) definite. 2. If xAx ≥ (≤) 0 for all nonzero x, then A is nonnegative definite or positive semidefinite (nonpositive definite). It might seem that it would be impossible to check a matrix for definiteness, since x can be chosen arbitrarily. But we have already used the set of results necessary to do so. Recall that a Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 977

symmetric matrix can be decomposed into

A = CC.

Therefore, the quadratic form can be written as

xAx = xCCx.

Let y = Cx. Then n = = λ 2. x Ax y y i yi (A-110) i=1

If λi is positive for all i, then regardless of y—that is, regardless of x—q will be positive. This case was identified earlier as a positive definite matrix. Continuing this line of reasoning, we obtain the following theorem.

THEOREM A.11 Definite Matrices Let A be a symmetric matrix. If all the characteristic roots of A are positive (negative), then A is positive definite (negative definite). If some of the roots are zero, then A is nonnegative (nonpositive) definite if the remainder are positive (negative). If A has both negative and positive roots, then A is indefinite.

The preceding statements give, in each case, the “if” parts of the theorem. To establish the “only if” parts, assume that the condition on the roots does not hold. This must lead to a contradiction. For example, if some λ can be negative, then yy could be negative for some y, so A cannot be positive definite.

A.7.1 NONNEGATIVE DEFINITE MATRICES

A case of particular interest is that of nonnegative definite matrices. Theorem A.11 implies a number of related results. • If A is nonnegative definite, then |A|≥0. (A-111)

Proof: The determinant is the product of the roots, which are nonnegative. The converse, however, is not true. For example, a 2 × 2 matrix with two negative roots is clearly not positive definite, but it does have a positive determinant. • If A is positive definite, so is A−1. (A-112)

Proof: The roots are the reciprocals of those of A, which are, therefore positive. • The identity matrix I is positive definite. (A-113)

Proof: xIx = xx > 0ifx = 0. A very important result for regression analysis is • If A is n × K with full column rank and n > K, then AA is positive definite and AA is nonnegative definite. (A-114)  = = ( )( ) = = 2 > Proof: By assumption, Ax 0.Sox A Ax Ax Ax y y j yj 0. Greene-50558 book June 25, 2007 12:52

978 PART VII ✦ Appendices

A similar proof establishes the nonnegative definiteness of AA . The difference in the latter case is that because A has more rows than columns there is an x such that Ax = 0. Thus, in the proof, we only have yy ≥ 0. The case in which A does not have full column rank is the same as that of AA . • If A is positive definite and B is a nonsingular matrix, then BAB is positive definite. (A-115)

Proof: xBABx = yAy > 0, where y = Bx. But y cannot be 0 because B is nonsingular. Finally, note that for A to be negative definite, all A’s characteristic roots must be negative. But, in this case, |A| is positive if A is of even order and negative if A is of odd order.

A.7.2 IDEMPOTENT QUADRATIC FORMS

Quadratic forms in idempotent matrices play an important role in the distributions of many test statistics. As such, we shall encounter them fairly often. Two central results are of interest. • Every symmetric idempotent matrix is nonnegative definite. (A-116)

Proof: All roots are one or zero; hence, the matrix is nonnegative definite by definition. Combining this with some earlier results yields a result used in determining the sampling distri- bution of most of the standard test statistics. • If A is symmetric and idempotent, n × n with rank J, then every quadratic form in A can be = J 2 written x Ax j=1 yj (A-117) Proof: This result is (A-110) with λ = one or zero.

A.7.3 COMPARING MATRICES

Derivations in econometrics often focus on whether one matrix is “larger” than another. We now consider how to make such a comparison. As a starting point, the two matrices must have the same dimensions. A useful comparison is based on

d = xAx − xBx = x(A − B)x.

If d is always positive for any nonzero vector, x, then by this criterion, we can say that A is larger than B. The reverse would apply if d is always negative. It follows from the definition that

if d > 0 for all nonzero x, then A − B is positive definite. (A-118)

If d is only greater than or equal to zero, then A − B is nonnegative definite. The ordering is not complete. For some pairs of matrices, d could have either sign, depending on x. In this case, there is no simple comparison. A particular case of the general result which we will encounter frequently is:

If A is positive definite and B is nonnegative definite, then A + B ≥ A. (A-119)

Consider, for example, the “updating formula” introduced in (A-66). This uses a matrix

A = BB + bb ≥ BB.

Finally, in comparing matrices, it may be more convenient to compare their inverses. The result analogous to a familiar result for scalars is:

If A > B, then B−1 > A−1. (A-120) Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 979

To establish this intuitive result, we would make use of the following, which is proved in Gold- berger (1964, Chapter 2):

THEOREM A.12 Ordering for Positive Definite Matrices If A and B are two positive definite matrices with the same dimensions and if every char- acteristic root of A is larger than (at least as large as) the corresponding characteristic root of B when both sets of roots are ordered from largest to smallest, then A − B is positive (nonnegative) definite.

The roots of the inverse are the reciprocals of the roots of the original matrix, so the theorem can be applied to the inverse matrices.

A.8 CALCULUS AND MATRIX ALGEBRA14

A.8.1 DIFFERENTIATION AND THE TAYLOR SERIES

A variable y is a function of another variable x written

y = f (x), y = g(x), y = y(x),

and so on, if each value of x is associated with a single value of y. In this relationship, y and x are sometimes labeled the dependent variable and the independent variable, respectively. Assuming that the function f (x) is continuous and differentiable, we obtain the following derivatives: dy d2 y f (x) = , f (x) = , dx dx2 and so on. A frequent use of the derivatives of f (x) is in the Taylor series approximation. A Taylor series is a polynomial approximation to f (x). Letting x0 be an arbitrarily chosen expansion point

P 1 di f (x0) f (x) ≈ f (x0) + (x − x0)i . (A-121) i! d(x0)i i=1 The choice of the number of terms is arbitrary; the more that are used, the more accurate the approximation will be. The approximation used most frequently in econometrics is the linear approximation,

f (x) ≈ α + βx, (A-122)

where, by collecting terms in (A-121), α = [ f (x0) − f (x0)x0] and β = f (x0). The superscript “0” indicates that the function is evaluated at x0.Thequadratic approximation is

f (x) ≈ α + βx + γ x2, (A-123)

α = 0 − 0 0 + 1 0( 0)2 ,β = 0 − 0 0 γ = 1 0. where [ f f x 2 f x ] [ f f x ] and 2 f

14For a complete exposition, see Magnus and Neudecker (1988). Greene-50558 book June 25, 2007 12:52

980 PART VII ✦ Appendices

We can regard a function y = f (x1, x2,...,xn) as a scalar-valued function of a vector; that is, y = f (x). The vector of partial derivatives, or gradient vector, or simply gradient, is ⎡ ⎤ ⎡ ⎤ ∂y/∂x1 f1 ∂ f (x) ⎢∂y/∂x ⎥ ⎢ f ⎥ = ⎣ 2⎦ = ⎣ 2 ⎦ . (A-124) ∂x ··· ··· ∂y/∂xn fn

The vector g(x) or g is used to represent the gradient. Notice that it is a column vector. The shape of the derivative is determined by the denominator of the derivative. A second derivatives matrix or Hessian is computed as ⎡ ⎤ 2 2 2 ∂ y/∂x1∂x1 ∂ y/∂x1∂x2 ··· ∂ y/∂x1∂xn ⎢∂2 /∂ ∂ ∂2 /∂ ∂ ··· ∂2 /∂ ∂ ⎥ = y x2 x1 y x2 x2 y x2 xn = . H ⎣ ··· ··· ··· ··· ⎦ [ fij] (A-125) 2 2 2 ∂ y/∂xn∂x1 ∂ y/∂xn∂x2 ··· ∂ y/∂xn∂xn

In general, H is a square, symmetric matrix. (The symmetry is obtained for continuous and continuously differentiable functions from Young’s theorem.) Each column of H is the derivative of g with respect to the corresponding variable in x. Therefore,

∂(∂ /∂ ) ∂(∂ /∂ ) ∂(∂ /∂ ) ∂(∂ /∂ ) ∂(∂ /∂ ) ∂2 = y x y x ··· y x = y x = y x = y . H ∂x1 ∂x2 ∂xn ∂(x1 x2 ··· xn) ∂x ∂x∂x

The first-order, or linear Taylor series approximation is n   ≈ ( 0) + ( 0) − 0 . y f x fi x xi xi (A-126) i=1 The right-hand side is ∂ f (x0) f (x0) + (x − x0) = [ f (x0) − g(x0)x0] + g(x0)x = [ f 0 − g0x0] + g0x. ∂x0 This produces the linear approximation,

y ≈ α + β x.

The second-order, or quadratic, approximation adds the second-order terms in the expansion, n n 1    1 f 0 x − x0 x − x0 = (x − x0)H0(x − x0), 2 ij i i j j 2 i=1 j=1 to the preceding one. Collecting terms in the same manner as in (A-126), we have

1 y ≈ α + β x + xx, (A-127) 2 where 1 α = f 0 − g0x0 + x0H0x0, β = g0 − H0x0 and  = H0. 2 A linear function can be written n y = a x = x a = ai xi , i=1 Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 981

so ∂(ax) = a. (A-128) ∂x Note, in particular, that ∂(ax)/∂x = a, not a. In a set of linear functions

y = Ax,

each element yi of y is = , yi ai x where ai is the ith row of A [see (A-14)]. Therefore, ∂y i = a = transpose of ith row of A, ∂x i and ⎡ ⎤ ⎡ ⎤ ∂ /∂ y1 x a1 ⎢∂ /∂ ⎥ ⎢ ⎥ ⎢ y2 x ⎥ = ⎢ a2 ⎥ . ⎣ ··· ⎦ ⎣···⎦ ∂ /∂ yn x an Collecting all terms, we find that ∂Ax/∂x = A, whereas the more familiar form will be ∂Ax = A. (A-129) ∂x A quadratic form is written n n x Ax = xi xj aij. (A-130) i=1 j=1 For example,

13 A = , 34 so that = 2 + 2 + . x Ax 1x1 4x2 6x1x2 Then

∂x Ax 2x + 6x 26 x = 1 2 = 1 = 2Ax, (A-131) ∂x 6x1 + 8x2 68 x2

which is the general result when A is a symmetric matrix. If A is not symmetric, then

∂(xAx) = (A + A)x. (A-132) ∂x

Referring to the preceding double summation, we find that for each term, the coefficient on aij is xi xj . Therefore, ∂(xAx) = xi xj . ∂aij Greene-50558 book June 25, 2007 12:52

982 PART VII ✦ Appendices

The square matrix whose ijth element is xi xj is xx ,so ∂(xAx) = xx. (A-133) ∂A Derivatives involving determinants appear in maximum likelihood estimation. From the cofactor expansion in (A-51), ∂| | A i+ j = (−1) |Aij|=cij ∂aij

where |C ji| is the jith cofactor in A. The inverse of A can be computed using |C | A−1 = ji ij |A| (note the reversal of the subscripts), which implies that ∂ | | (− )i+ j | | ln A = 1 C ji , ∂aij |A| or, collecting terms, ∂ ln|A| = A−1. ∂A Because the matrices for which we shall make use of this calculation will be symmetric in our applications, the transposition will be unnecessary.

A.8.2 OPTIMIZATION

Consider finding the x where f (x) is maximized or minimized. Because f (x) is the slope of f (x), either optimum must occur where f (x) = 0. Otherwise, the function will be increasing or decreasing at x. This result implies the first-order or necessary condition for an optimum (maximum or minimum): dy = 0. (A-134) dx For a maximum, the function must be concave; for a minimum, it must be convex. The sufficient condition for an optimum is: d2 y For a maximum, < 0; dx2 (A-135) d2 y for a minimum, > 0. dx2 Some functions, such as the sine and cosine functions, have many local optima, that is, many minima and maxima. A function such as (cos x)/(1 + x2), which is a damped cosine wave, does as well but differs in that although it has many local maxima, it has one, at x = 0, at which f (x) is greater than it is at any other point. Thus, x = 0istheglobal maximum, whereas the other maxima are only local maxima. Certain functions, such as a quadratic, have only a single optimum. These functions are globally concave if the optimum is a maximum and globally convex if it is a minimum. For maximizing or minimizing a function of several variables, the first-order conditions are ∂ f (x) = 0. (A-136) ∂x Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 983

This result is interpreted in the same manner as the necessary condition in the univariate case. At the optimum, it must be true that no small change in any variable leads to an improvement in the function value. In the single-variable case, d2 y/dx2 must be positive for a minimum and negative for a maximum. The second-order condition for an optimum in the multivariate case is that, at the optimizing value, ∂2 f (x) H = (A-137) ∂x ∂x must be positive definite for a minimum and negative definite for a maximum. In a single-variable problem, the second-order condition can usually be verified by inspection. This situation will not generally be true in the multivariate case. As discussed earlier, checking the definiteness of a matrix is, in general, a difficult problem. For most of the problems encountered in econometrics, however, the second-order condition will be implied by the structure of the problem. That is, the matrix H will usually be of such a form that it is always definite. For an example of the preceding, consider the problem

maximizex R = a x − x Ax, where

a = (542),

and  213 A = 132. 325 Using some now familiar results, we obtain    5 42 6 x ∂ R 1 = a − 2Ax = 4 − 26 4 x = 0. (A-138) ∂x 2 2 6410 x3 The solutions are  −1   x1 42 6 5 11.25 x2 = 26 4 4 = 1.75 . x3 6410 2 −7.25 The sufficient condition is that  −4 −2 −6 ∂2 R(x) =−2A = −2 −6 −4 (A-139) ∂x ∂x −6 −4 −10 must be negative definite. The three characteristic roots of this matrix are −15.746, −4, and −0.25403. Because all three roots are negative, the matrix is negative definite, as required. In the preceding, it was necessary to compute the characteristic roots of the Hessian to verify the sufficient condition. For a general matrix of order larger than 2, this will normally require a computer. Suppose, however, that A is of the form

A = BB,

where B is some known matrix. Then, as shown earlier, we know that A will always be positive definite (assuming that B has full rank). In this case, it is not necessary to calculate the characteristic roots of A to verify the sufficient conditions. Greene-50558 book June 25, 2007 12:52

984 PART VII ✦ Appendices

A.8.3 CONSTRAINED OPTIMIZATION

It is often necessary to solve an optimization problem subject to some constraints on the solution. One method is merely to “solve out” the constraints. For example, in the maximization problem considered earlier, suppose that the constraint x1 = x2 − x3 is imposed on the solution. For a single constraint such as this one, it is possible merely to substitute the right-hand side of this equation for x1 in the objective function and solve the resulting problem as a function of the remaining two variables. For more general constraints, however, or when there is more than one constraint, the method of Lagrange multipliers provides a more straightforward method of solving the problem. We

maximizex f (x) subject to c1(x) = 0, c2(x) = 0, ··· (A-140)

cJ (x) = 0. The Lagrangean approach to this problem is to find the stationary points—that is, the points at which the derivatives are zero—of J ∗ L (x, λ) = f (x) + λ j c j (x) = f (x) + λ c(x). (A-141) j=1 The solutions satisfy the equations ∂ L∗ ∂ f (x) ∂λc(x) = + = 0 (n × 1), ∂x ∂x ∂x (A-142) ∂ ∗ L = ( ) = ( × ). ∂λ c x 0 J 1 The second term in ∂ L∗/∂x is

∂λc(x) ∂c(x)λ ∂c(x) = = λ = Cλ, (A-143) ∂x ∂x ∂x where C is the matrix of derivatives of the constraints with respect to x.The jth row of the J × n matrix C is the vector of derivatives of the jth constraint, c j (x), with respect to x . Upon collecting terms, the first-order conditions are ∂ L∗ ∂ f (x) = + Cλ = 0, ∂x ∂x (A-144) ∂ ∗ L = ( ) = . ∂λ c x 0 There is one very important aspect of the constrained solution to consider. In the unconstrained solution, we have ∂ f (x)/∂x = 0. From (A-144), we obtain, for a constrained solution, ∂ f (x) =−Cλ, (A-145) ∂x which will not equal 0 unless λ = 0. This result has two important implications: • The constrained solution cannot be superior to the unconstrained solution. This is implied by the nonzero gradient at the constrained solution. (That is, unless C = 0 which could happen if the constraints were nonlinear. But, even if so, the solution is still no better than the unconstrained optimum.) • If the Lagrange multipliers are zero, then the constrained solution will equal the unconstrained solution. Greene-50558 book June 25, 2007 12:52

APPENDIX A ✦ Matrix Algebra 985

To continue the example begun earlier, suppose that we add the following conditions:

x1 − x2 + x3 = 0,

x1 + x2 + x3 = 0.

To put this in the format of the general problem, write the constraints as c(x) = Cx = 0, where

1 −11 C = . 111

The Lagrangean function is

R∗(x, λ) = ax − xAx + λCx.

Note the dimensions and arrangement of the various parts. In particular, C isa2× 3 matrix, with one row for each constraint and one column for each variable in the objective function. The vector of Lagrange multipliers thus has two elements, one for each constraint. The necessary conditions are

a − 2Ax + Cλ = 0 (three equations), (A-146)

and

Cx = 0 (two equations).

These may be combined in the single equation

− − 2AC x = a . C0λ 0

Using the partitioned inverse of (A-74) produces the solutions

− − λ =−[CA 1C]−1CA 1a (A-147)

and

1 − − x = A−1[I − C(CA 1C)−1CA 1]a. (A-148) 2 The two results, (A-147) and (A-148), yield analytic solutions for λ and x. For the specific matrices and vectors of the example, these are λ = [−0.5 −7.5], and the constrained solution vector, x∗ = [1.50−1.5]. Note that in computing the solution to this sort of problem, it is not necessary to use the rather cumbersome form of (A-148). Once λ is obtained from (A-147), the solution can be inserted in (A-146) for a much simpler computation. The solution 1 1 x = A−1a + A−1Cλ 2 2 suggests a useful result for the constrained optimum:

constrained solution = unconstrained solution + [2A]−1Cλ. (A-149)

Finally, by inserting the two solutions in the original function, we find that R = 24.375 and R∗ = 2.25, which illustrates again that the constrained solution (in this maximization problem) is inferior to the unconstrained solution. Greene-50558 book June 25, 2007 12:52

986 PART VII ✦ Appendices

A.8.4 TRANSFORMATIONS

If a function is strictly monotonic, then it is a one-to-one function. Each y is associated with exactly one value of x, and vice versa. In this case, an inverse function exists, which expresses x as a function of y, written y = f (x) and x = f −1(y). An example is the inverse relationship between the log and the exponential functions. The slope of the inverse function, dx df−1(y) J = = = f −1 (y), dy dy is the Jacobian of the transformation from y to x. For example, if y = a + bx, then a 1 x =− + y b b is the inverse transformation and dx 1 J = = . dy b Looking ahead to the statistical application of this concept, we observe that if y = f (x) were vertical, then this would no longer be a functional relationship. The same x would be associated with more than one value of y. In this case, at this value of x, we would find that J = 0, indicating a singularity in the function. If y is a column vector of functions, y = f(x), then ⎡ ⎤ ∂x1/∂y1 ∂x1/∂y2 ··· ∂x1/∂yn ⎢∂ /∂ ∂ /∂ ··· ∂ /∂ ⎥ ∂ ⎢ x2 y1 x2 y2 x2 yn⎥ = x = ⎢ ⎥ . J . ∂y ⎣ . ⎦

∂xn/∂y1 ∂xn/∂y2 ··· ∂xn/∂yn Consider the set of linear functions y = Ax = f(x). The inverse transformation is x = f−1(y), which will be x = A−1y, if A is nonsingular. If A is singular, then there is no inverse transformation. Let J be the matrix of partial derivatives of the inverse functions:

∂x J = i . ∂yj

The absolute value of the determinant of J,    ∂x abs(|J|) = abs det , ∂y is the Jacobian determinant of the transformation from y to x. In the nonsingular case, 1 abs(|J|) = abs(|A−1|) = . abs(|A|) Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 987

In the singular case, the matrix of partial derivatives will be singular and the determinant of the Jacobian will be zero. In this instance, the singular Jacobian implies that A is singular or, equivalently, that the transformations from x to y are functionally dependent. The singular case is analogous to the single-variable case. Clearly, if the vector x is given, then y = Ax can be computed from x. Whether x can be deduced from y is another question. Evidently, it depends on the Jacobian. If the Jacobian is not zero, then the inverse transformations exist, and we can obtain x. If not, then we cannot obtain x.

APPENDIXQ B PROBABILITY AND DISTRIBUTION THEORY

B.1 INTRODUCTION

This appendix reviews the distribution theory used later in the book. A previous course in statistics is assumed, so most of the results will be stated without proof. The more advanced results in the later sections will be developed in greater detail.

B.2 RANDOM VARIABLES

We view our observation on some aspect of the economy as the outcome of a random process that is almost never under our (the analyst’s) control. In the current literature, the descriptive (and perspective laden) term data-generating process, or DGP is often used for this underlying mechanism. The observed (measured) outcomes of the process are assigned unique numeric values. The assignment is one to one; each outcome gets one value, and no two distinct outcomes receive the same value. This outcome variable, X,isarandom variable because, until the data are actually observed, it is uncertain what value X will take. Probabilities are associated with outcomes to quantify this uncertainty. We usually use capital letters for the “name” of a random variable and lowercase letters for the values it takes. Thus, the probability that X takes a particular value x might be denoted Prob(X = x). A random variable is discrete if the set of outcomes is either finite in number or countably infinite. The random variable is continuous if the set of outcomes is infinitely divisible and, hence, not countable. These definitions will correspond to the types of data we observe in practice. Counts of occurrences will provide observations on discrete random variables, whereas measurements such as time or income will give observations on continuous random variables.

B.2.1 PROBABILITY DISTRIBUTIONS

A listing of the values x taken by a random variable X and their associated probabilities is a probability distribution, f (x). For a discrete random variable,

f (x) = Prob(X = x). (B-1) Greene-50558 book June 25, 2007 12:52

988 PART VII ✦ Appendices

The axioms of probability require that 1. 0 ≤ Prob(X = x) ≤ 1. (B-2) ( ) = . 2. x f x 1 (B-3) For the continuous case, the probability associated with any particular point is zero, and we can only assign positive probabilities to intervals in the range of x.Theprobability density function (pdf) is defined so that f (x) ≥ 0 and  b 1. Prob(a ≤ x ≤ b) = f (x) dx ≥ 0. (B-4) a

This result is the area under f (x) in the range from a to b. For a continuous variable,

 +∞ 2. f (x) dx = 1. (B-5) −∞

If the range of x is not infinite, then it is understood that f (x) = 0 anywhere outside the appropriate range. Because the probability associated with any individual point is 0,

Prob(a ≤ x ≤ b) = Prob(a ≤ x < b) = Prob(a < x ≤ b) = Prob(a < x < b).

B.2.2 CUMULATIVE DISTRIBUTION FUNCTION

For any random variable X, the probability that X is less than or equal to a is denoted F(a). F(x) is the cumulative distribution function (cdf). For a discrete random variable,  F(x) = f (X ) = Prob(X ≤ x). (B-6) X≤x In view of the definition of f (x),

f (xi ) = F(xi ) − F(xi−1). (B-7) For a continuous random variable,  x F(x) = f (t) dt, (B-8) −∞ and dF(x) f (x) = . (B-9) dx In both the continuous and discrete cases, F(x) must satisfy the following properties:

1. 0 ≤ F(x) ≤ 1. 2. If x > y, then F(x) ≥ F(y). 3. F(+∞) = 1. 4. F(−∞) = 0. From the definition of the cdf, Prob(a < x ≤ b) = F(b) − F(a). (B-10)

Any valid pdf will imply a valid cdf, so there is no need to verify these conditions separately. Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 989

B.3 EXPECTATIONS OF A RANDOM VARIABLE

DEFINITION B.1 Mean of a Random Variable The mean, or expected value, of a random variable is ⎧ ⎪ ( ) ⎨⎪ xf x if x is discrete, x E[x] =  (B-11) ⎪ ⎩ xf(x) dx if x is continuous. x

 

The notation x or x, used henceforth, means the sum or integral over the entire range of values of x. The mean is usually denoted μ. It is a weighted average of the values taken by x, where the weights are the respective probabilities. It is not necessarily a value actually taken by 1 the random variable. For example, the expected number of heads in one toss of a fair coin is 2 . Other measures of central tendency are the median, which is the value m such that Prob(X≤ ) ≥ 1 ( ≥ ) ≥ 1 ( ) m 2 and Prob X m 2 , and the mode, which is the value of x at which f x takes its maximum. The first of these measures is more frequently used than the second. Loosely speaking, the median corresponds more closely than the mean to the middle of a distribution. It is unaffected by extreme values. In the discrete case, the modal value of x has the highest probability of occurring. Let g(x) be a function of x. The function that gives the expected value of g(x) is denoted ⎧ ⎪ ( ) ( = ) ⎨⎪ g x Prob X x if X is discrete, x E[g(x)] =  (B-12) ⎪ ⎩ g(x) f (x) dx if X is continuous. x If g(x) = a + bx for constants a and b, then

E[a + bx] = a + bE[x].

An important case is the expected value of a constant a, which is just a.

DEFINITION B.2 Variance of a Random Variable The variance of a random variable is

Var[x] = E[(x − μ)2] ⎧ ⎪ ( − μ)2 ( ) ⎨⎪ x f x if x is discrete, x =  (B-13) ⎪ ⎩ (x − μ)2 f (x) dx if x is continuous. x

Var[x], which must be positive, is usually denoted σ 2. This function is a measure of the dispersion of a distribution. Computation of the variance is simplified by using the following Greene-50558 book June 25, 2007 12:52

990 PART VII ✦ Appendices

important result: Var[x] = E[x2] − μ2. (B-14)

A convenient corollary to (B-14) is E[x2] = σ 2 + μ2. (B-15)

By inserting y = a + bx in (B-13) and expanding, we find that Var[a + bx] = b2 Var[x], (B-16)

which implies, for any constant a, that Var[a] = 0. (B-17) To describe a distribution, we usually use σ , the positive square root, which is the standard deviation of x. The standard deviation can be interpreted as having the same units of measurement as x and μ. For any random variable x and any positive constant k, the Chebychev inequality states that 1 Prob(μ − kσ ≤ x ≤ μ + kσ) ≥ 1 − . (B-18) k2 Two other measures often used to describe a probability distribution are

skewness = E[(x − μ)3], and kurtosis = E[(x − μ)4]. Skewness is a measure of the asymmetry of a distribution. For symmetric distributions, f (μ − x) = f (μ + x),

and skewness = 0.

For asymmetric distributions, the skewness will be positive if the “long tail” is in the positive direction. Kurtosis is a measure of the thickness of the tails of the distribution. A shorthand expression for other central moments is r μr = E[(x − μ) ].

r Because μr tends to explode as r grows, the normalized measure, μr /σ , is often used for descrip- tion. Two common measures are μ skewness coefficient = 3 , σ 3 and μ degree of excess = 4 − 3. σ 4 The second is based on the normal distribution, which has excess of zero. For any two functions g1(x) and g2(x),

E[g1(x) + g2(x)] = E[g1(x)] + E[g2(x)]. (B-19) ( ) For the general case of a possibly nonlinear g x , E[g(x)] = g(x) f (x) dx, (B-20) x Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 991

and    2 Var[g(x)] = g(x) − E [g(x)] f (x) dx. (B-21) x (For convenience, we shall omit the equivalent definitions for discrete variables in the fol- lowing discussion and use the integral to mean either integration or summation, whichever is appropriate.) A device used to approximate E[g(x)] and Var[g(x)] is the linear Taylor series approxi- mation:

0 0 0 0 ∗ g(x) ≈ [g(x ) − g (x )x ] + g (x )x = β1 + β2x = g (x). (B-22)

If the approximation is reasonably accurate, then the mean and variance of g∗(x) will be ap- proximately equal to the mean and variance of g(x). A natural choice for the expansion point is x0 = μ = E(x). Inserting this value in (B-22) gives

g(x) ≈ [g(μ) − g (μ)μ] + g (μ)x, (B-23)

so that E[g(x)] ≈ g(μ), (B-24)

and Var[g(x)] ≈ [g (μ)]2 Var[x]. (B-25)

A point to note in view of (B-22) to (B-24) is that E[g(x)] will generally not equal g(E[x]). For the special case in which g(x) is concave—that is, where g (x)<0—we know from Jensen’s inequality that E[g(x)] ≤ g(E[x]). For example, E[log(x)] ≤ log(E[x]).

B.4 SOME SPECIFIC PROBABILITY DISTRIBUTIONS

Certain experimental situations naturally give rise to specific probability distributions. In the majority of cases in economics, however, the distributions used are merely models of the observed phenomena. Although the normal distribution, which we shall discuss at length, is the mainstay of econometric research, economists have used a wide variety of other distributions. A few are discussed here.1

B.4.1 THE NORMAL DISTRIBUTION

The general form of the normal distribution with mean μ and standard deviation σ is

1 2 2 f (x | μ, σ 2) = √ e−1/2[(x−μ) /σ ]. (B-26) σ 2π This result is usually denoted x ∼ N[μ, σ 2]. The standard notation x ∼ f (x) is used to state that “x has probability distribution f (x).” Among the most useful properties of the normal distribution

1A much more complete listing appears in Maddala (1977a, Chapters 3 and 18) and in most mathematical statistics textbooks. See also Poirier (1995) and Stuart and Ord (1989). Another useful reference is Evans, Hastings, and Peacock (1993). Johnson et al. (1974, 1993, 1994, 1995, 1997) is an encyclopedic reference on the subject of statistical distributions. Greene-50558 book June 25, 2007 12:52

992 PART VII ✦ Appendices

is its preservation under linear transformation.

If x ∼ N[μ, σ 2], then (a + bx) ∼ N[a + bμ, b2σ 2]. (B-27)

One particularly convenient transformation is a =−μ/σ and b = 1/σ . The resulting variable z= (x − μ)/σ has the standard normal distribution, denoted N[0, 1], with density

1 2 φ(z) = √ e−z /2. (B-28) 2π The specific notation φ(z) is often used for this distribution and (z) for its cdf. It follows from the definitions above that if x ∼ N[μ, σ 2], then  − μ ( ) = 1 φ x . f x σ σ Figure B.1 shows the densities of the standard normal distribution and the normal distribution with mean 0.5, which shifts the distribution to the right, and standard deviation 1.3, which, it can be seen, scales the density so that it is shorter but wider. (The graph is a bit deceiving unless you look closely; both densities are symmetric.) Tables of the standard normal cdf appear in most statistics and econometrics textbooks. Because the form of the distribution does not change under a linear transformation, it is not necessary to tabulate the distribution for other values of μ and σ . For any normally distributed variable,   − μ − μ − μ ( ≤ ≤ ) = a ≤ x ≤ b , Prob a x b Prob σ σ σ (B-29)

which can always be read from a table of the standard normal distribution. In addition, because the distribution is symmetric, (−z) = 1 − (z). Hence, it is not necessary to tabulate both the negative and positive halves of the distribution.

FIGURE B.1 The Normal Distribution. f1ϭNormal[0,1] and f2ϭNormal[.5,1.3] Densities 0.425 f1 f2 0.340

0.255 ity s Den 0.170

0.085

0.000 Ϫ4.0 Ϫ2.4 Ϫ0.8 0.8 2.4 4.0 z Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 993

B.4.2 THE CHI-SQUARED, t, AND F DISTRIBUTIONS

The chi-squared, t, and F distributions are derived from the normal distribution. They arise in econometrics as sums of n or n1 and n2 other variables. These three distributions have associated with them one or two “degrees of freedom” parameters, which for our purposes will be the number of variables in the relevant sum. The first of the essential results is • If z ∼ N[0, 1], then x = z2 ∼ chi-squared[1]—that is, chi-squared with one degree of freedom—denoted

z2 ∼ χ 2[1]. (B-30)

This distribution is a skewed distribution with mean 1 and variance 2. The second result is • If x1,...,xn are n independent chi-squared[1] variables, then n

xi ∼ chi-squared[n]. (B-31) i=1 The mean and variance of a chi-squared variable with n degrees of freedom are n and 2n, respec- tively. A number of useful corollaries can be derived using (B-30) and (B-31). • If zi , i = 1,...,n, are independent N[0, 1] variables, then n 2 ∼ χ 2 . zi [n] (B-32) i=1 • 2 If zi , i = 1,...,n, are independent N[0,σ ] variables, then n 2 2 (zi /σ ) ∼ χ [n]. (B-33) i=1 • If x1 and x2 are independent chi-squared variables with n1 and n2 degrees of freedom, respectively, then 2 x1 + x2 ∼ χ [n1 + n2]. (B-34) This result can be generalized to the sum of an arbitrary number of independent chi-squared variables. Figure B.2 shows the chi-squared density for three degrees of freedom. The amount of skewness declines as the number of degrees of freedom rises. Unlike the normal distribution, a separate table is required for the chi-squared distribution for each value of n. Typically, only a few percentage points of the distribution are tabulated for each n. Table G.3 in Appendix G of this book gives lower (left) tail areas for a number of values. • If x1 and x2 are two independent chi-squared variables with degrees of freedom parameters n1 and n2, respectively, then the ratio

x1/n1 F [n1, n2] = (B-35) x2/n2

has the F distribution with n1 and n2 degrees of freedom.

The two degrees of freedom parameters n1 and n2 are the numerator and denominator degrees of freedom, respectively. Tables of the F distribution must be computed for each pair of values of (n1, n2). As such, only one or two specific values, such as the 95 percent and 99 percent upper tail values, are tabulated in most cases. Greene-50558 book June 25, 2007 12:52

994 PART VII ✦ Appendices

Chi-Squared[3] Density 0.30

0.24

0.18 ity s Den 0.12

0.06

0.00 0 246810 x FIGURE B.2 The Chi-Squared [3] Distribution.

• If z is an N[0, 1] variable and x is χ 2[n] and is independent of z, then the ratio z t[n] = √ (B-36) x/n has the t distribution with n degrees of freedom. The t distribution has the same shape as the normal distribution but has thicker tails. Figure B.3 illustrates the t distributions with three and 10 degrees of freedom with the standard normal distribution. Two effects that can be seen in the figure are how the distribution changes as the degrees of freedom increases, and, overall, the similarity of the t distribution to the standard normal. This distribution is tabulated in the same manner as the chi-squared distribution, with several specific cutoff points corresponding to specified tail areas for various values of the degrees of freedom parameter. Comparing (B-35) with n1 = 1 and (B-36), we see the useful relationship between the t and F distributions: • If t ∼ t[n], then t 2 ∼ F[1, n]. If the numerator in (B-36) has a nonzero mean, then the random variable in (B-36) has a non- central t distribution and its square has a noncentral F distribution. These distributions arise in the F tests of linear restrictions [see (5-6)] when the restrictions do not hold as follows: 1. Noncentral chi-squared distribution.Ifz has a normal distribution with mean μ and standard deviation 1, then the distribution of z2 is noncentral chi-squared with parameters 1 and μ2/2. a. If z ∼ N[μ, ] with J elements, then z−1z has a noncentral chi-squared distribution with J degrees of freedom and noncentrality parameter μ−1μ/2, which we denote 2 −1 χ∗ [J, μ  μ/2]. 2 b. If z ∼ N[μ, I] and M is an idempotent matrix with rank J, then z Mz ∼ χ∗ [J, μ Mμ/2]. Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 995

Normal[0,1], t[3], and t[10] Densities 0.45 Normal[0,1] t[3] t[10] 0.36

0.27 ity s Den 0.18

0.09

0.00 Ϫ4.0 Ϫ2.4 Ϫ0.8 0.8 2.4 4.0 z FIGURE B.3 The Standard Normal, t[3], and t[10] Distributions.

2. Noncentral F distribution.IfX1 has a noncentral chi-squared distribution with noncentrality parameter λ and degrees of freedom n1 and X2 has a central chi-squared distribution with degrees of freedom n2 and is independent of X1, then

X1/n1 F∗ = X2/n2 2 has a noncentral F distribution with parameters n1, n2, and λ. Note that in each of these cases, the statistic and the distribution are the familiar ones, except that the effect of the nonzero mean, which induces the noncentrality, is to push the distribution to the right.

B.4.3 DISTRIBUTIONS WITH LARGE DEGREES OF FREEDOM

The chi-squared, t, and F distributions usually arise in connection with sums of sample observa- tions. The degrees of freedom parameter in each case grows with the number of observations. We often deal with larger degrees of freedom than are shown in the tables. Thus, the standard tables are often inadequate. In all cases, however, there are limiting distributions that we can use when the degrees of freedom parameter grows large. The simplest case is the t distribution. The t distribution with infinite degrees of freedom is equivalent to the standard normal distribution. Beyond about 100 degrees of freedom, they are almost indistinguishable. For degrees of freedom greater than 30, a reasonably good approximation for the distribution of the chi-squared variable x is z = (2x)1/2 − (2n − 1)1/2, (B-37) which is approximately standard normally distributed. Thus, Prob(χ 2[n] ≤ a) ≈ [(2a)1/2 − (2n − 1)1/2].

2The denominator chi-squared could also be noncentral, but we shall not use any statistics with doubly noncentral distributions. Greene-50558 book June 25, 2007 12:52

996 PART VII ✦ Appendices

As used in econometrics, the F distribution with a large-denominator degrees of freedom is common. As n2 becomes infinite, the denominator of F converges identically to one, so we can treat the variable

x = n1 F (B-38)

as a chi-squared variable with n1 degrees of freedom. The numerator degree of freedom will typically be small, so this approximation will suffice for the types of applications we are likely to encounter.3 If not, then the approximation given earlier for the chi-squared distribution can be applied to n1 F.

B.4.4 SIZE DISTRIBUTIONS: THE LOGNORMAL DISTRIBUTION

In modeling size distributions, such as the distribution of firm sizes in an industry or the distribution of income in a country, the lognormal distribution, denoted LN[μ, σ 2], has been particularly useful.4

1 2 f (x) = √ e−1/2[(ln x−μ)/σ] , x > 0. 2πσx A lognormal variable x has 2 E[x] = eμ+σ /2, and   2 2 Var[x] = e2μ+σ eσ − 1 . The relation between the normal and lognormal distributions is If y ∼ LN[μ, σ 2], ln y ∼ N[μ, σ 2]. A useful result for transformations is given as follows: If x has a lognormal distribution with mean θ and variance λ2, then ∼ (μ, σ 2), μ = θ 2 − 1 (θ 2 + λ2) σ 2 = ( + λ2/θ 2). ln x N where ln 2 ln and ln 1 Because the normal distribution is preserved under linear transformation, if y ∼ LN[μ, σ 2], then ln yr ∼ N[rμ, r 2σ 2]. ∼ μ ,σ2 ∼ μ ,σ2 If y1 and y2 are independent lognormal variables with y1 LN[ 1 1 ] and y2 LN[ 2 2 ], then ∼ μ + μ ,σ2 + σ 2 . y1 y2 LN 1 2 1 2

B.4.5 THE GAMMA AND EXPONENTIAL DISTRIBUTIONS

The gamma distribution has been used in a variety of settings, including the study of income distribution5 and production functions.6 The general form of the distribution is λP f (x) = e−λx x P−1, x ≥ 0,λ>0, P > 0. (B-39) (P) Many familiar distributions are special cases, including the exponential distribution (P = 1) and (λ = 1 , = n ). chi-squared 2 P 2 The Erlang distribution results if P is a positive integer. The mean is P/λ, and the variance is P/λ2.Theinverse gamma distribution is the distribution of 1/x, where x

3See Johnson, Kotz, and Balakrishnan (1994) for other approximations. 4A study of applications of the lognormal distribution appears in Aitchison and Brown (1969). 5Salem and Mount (1974). 6Greene (1980a). Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 997

has the gamma distribution. Using the change of variable, y = 1/x, the Jacobian is |dx/dy|=1/y2. Making the substitution and the change of variable, we find λP f (y) = e−λ/y y−(P+1), y ≥ 0,λ>0, P > 0. (P) The density is defined for positive P. However, the mean is λ/(P − 1) which is defined only if P > 1 and the variance is λ2/[(P − 1)2(P − 2)] which is defined only for P > 2.

B.4.6 THE BETA DISTRIBUTION

Distributions for models are often chosen on the basis of the range within which the random variable is constrained to vary. The lognormal distribution, for example, is sometimes used to model a variable that is always nonnegative. For a variable constrained between 0 and c > 0, the beta distribution has proved useful. Its density is     α−1 β−1 (α + β) x x 1 f (x) = 1 − . (B-40) (α) (β) c c c This functional form is extremely flexible in the shapes it will accommodate. It is symmetric if α = β, asymmetric otherwise, and can be hump-shaped or U-shaped. The mean is cα/(α + β), and the variance is c2αβ/[(α +β +1)(α +β)2]. The beta distribution has been applied in the study of labor force participation rates.7

B.4.7 THE LOGISTIC DISTRIBUTION

The normal distribution is ubiquitous in econometrics. But researchers have found that for some microeconomic applications, there does not appear to be enough mass in the tails of the normal distribution; observations that a model based on normality would classify as “unusual” seem not to be very unusual at all. One approach has been to use thicker-tailed symmetric distributions. The logistic distribution is one candidate; the cdf for a logistic random variable is denoted 1 F(x) = (x) = . 1 + e−x The density is f (x) = (x)[1 − (x)]. The mean and variance of this random variable are zero and π 2/3.

B.4.8 THE WISHART DISTRIBUTION

The Wishart distribution describes the distribution of a random matrix obtained as n

W = (xi − μ)(xi − μ), i=1

where xi is the ith of nKelement random vectors from the multivariate normal distribution with mean vector, μ, and covariance matrix, . This is a multivariate counterpart to the chi-squared distribution. The density of the Wishart random matrix is   1 −1 − 1 (n−K−1) exp − trace  W |W| 2 2 f (W) =   . n + 1 − j 2nK/2||K/2 π K(K−1)/4 K j=1 2 The mean matrix is n. For the individual pairs of elements in W,

Cov[wij, wrs] = n(σirσ js + σisσ jr).

7Heckman and Willis (1976). Greene-50558 book June 25, 2007 12:52

998 PART VII ✦ Appendices

0.2500

0.1875

0.1250

0.0625

0.0000 01234567891011 x FIGURE B.4 The Poisson [3] Distribution.

B.4.9 DISCRETE RANDOM VARIABLES

Modeling in economics frequently involves random variables that take integer values. In these cases, the distributions listed thus far only provide approximations that are sometimes quite inappropriate. We can build up a class of models for discrete random variables from the Bernoulli distribution for a single binomial outcome (trial) Prob(x = 1) = α, Prob(x = 0) = 1 − α, where 0 ≤ α ≤ 1. The modeling aspect of this specification would be the assumptions that the suc- cess probability α is constant from one trial to the next and that successive trials are independent. If so, then the distribution for x successes in n trials is the binomial distribution, n − Prob(X = x) = αx(1 − α)n x, x = 0, 1,...,n. x The mean and variance of x are nα and nα(1 − α), respectively. If the number of trials becomes large at the same time that the success probability becomes small so that the mean nα is stable, then, the limiting form of the binomial distribution is the Poisson distribution, e−λλx Prob(X = x) = . x! The Poisson distribution has seen wide use in econometrics in, for example, modeling patents, crime, recreation demand, and demand for health services. (See Chapter 25.) An example is shown in Figure B.4.

B.5 THE DISTRIBUTION OF A FUNCTION OF A RANDOM VARIABLE

We considered finding the expected value of a function of a random variable. It is fairly common to analyze the random variable itself, which results when we compute a function of some random variable. There are three types of transformation to consider. One discrete random variable may Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 999

be transformed into another, a continuous variable may be transformed into a discrete one, and one continuous variable may be transformed into another. The simplest case is the first one. The probabilities associated with the new variable are computed according to the laws of probability. If y is derived from x and the function is one to one, then the probability that Y = y(x) equals the probability that X = x. If several values of x yield the same value of y, then Prob(Y = y) is the sum of the corresponding probabilities for x. The second type of transformation is illustrated by the way individual data on income are typ- ically obtained in a survey. Income in the population can be expected to be distributed according to some skewed, continuous distribution such as the one shown in Figure B.5. Data are often reported categorically, as shown in the lower part of the figure. Thus, the random variable corresponding to observed income is a discrete transformation of the actual underlying continuous random variable. Suppose, for example, that the transformed variable y is the mean income in the respective interval. Then

Prob(Y = μ1) = P(−∞ < X ≤ a),

Prob(Y = μ2) = P(a < X ≤ b),

Prob(Y = μ3) = P(b < X ≤ c),

and so on, which illustrates the general procedure. If x is a continuous random variable with pdf fx(x) and if y = g(x) is a continuous monotonic function of x, then the density of y is obtained by using the change of variable technique to find

FIGURE B.5 Censored Distribution. ity s Den

a bc d eIncome ␮ ␮ ␮ ␮ ␮ ␮ 1 2 3 4 5 6 e v Relati frequency

␮ ␮ ␮ ␮ ␮ ␮ 1 2 3 4 5 6 Greene-50558 book June 25, 2007 12:52

1000 PART VII ✦ Appendices

the cdf of y:  b −1 −1 Prob(y ≤ b) = fx(g (y))|g (y)| dy. −∞ This equation can now be written as  b

Prob(y ≤ b) = fy(y) dy. −∞ Hence,

−1 −1 fy(y) = fx(g (y))|g (y)|. (B-41)

To avoid the possibility of a negative pdf if g(x) is decreasing, we use the absolute value of the derivative in the previous expression. The term |g−1 (y)| must be nonzero for the density of y to be nonzero. In words, the probabilities associated with intervals in the range of y must be associated with intervals in the range of x. If the derivative is zero, the correspondence y = g(x) is vertical, and hence all values of y in the given range are associated with the same value of x. This single point must have probability zero. One of the most useful applications of the preceding result is the linear transformation of a normally distributed variable. If x ∼ N[μ, σ 2], then the distribution of − μ = x y σ

is found using the preceding result. First, the derivative is obtained from the inverse transformation x μ dx y = − ⇒ x = σ y + μ ⇒ f −1 (y) = = σ. σ σ dy

Therefore,

1 −[(σ y+μ)−μ]2/(2σ 2) 1 −y2/2 fy(y) = √ e |σ |= √ e . 2πσ 2π

This is the density of a normally distributed variable with mean zero and unit standard deviation one. This is the result which makes it unnecessary to have separate tables for the different normal distributions which result from different means and variances.

B.6 REPRESENTATIONS OF A PROBABILITY DISTRIBUTION

The probability density function (pdf) is a natural and familiar way to formulate the distribution of a random variable. But, there are many other functions that are used to identify or characterize a random variable, depending on the setting. In each of these cases, we can identify some other function of the random variable that has a one to one relationship with the density. We have already used one of these quite heavily in the preceding discussion. For a random variable which has density function f (x), the distribution function, F(x), is an equally informative function that identifies the distribution; the relationship between f (x) and F(x) is defined in (B-6) for a discrete random variable and (B-8) for a continuous one. We now consider several other related functions. For a continuous random variable, the survival function is S(x) = 1 − F(x) = Prob[X ≥ x]. This function is widely used in epidemiology,where x is time until some transition, such as recovery Greene-50558 book June 25, 2007 22:50

APPENDIX B ✦ Probability and Distribution Theory 1001

from a disease. The hazard function for a random variable is

f (x) f (x) h(x) = = . S(x) 1 − F(x)

The hazard function is a conditional probability;

h(x) = limt↓0 Prob(X ≤ x ≤ X + t | X ≥ x).

Hazard functions have been used in econometrics in studying the duration of spells, or conditions, such as unemployment, strikes, time until business failures, and so on. The connection between the hazard and the other functions is h(x) =−d ln S(x)/dx. As an exercise, you might want to verify the interesting special case of h(x) = 1/λ, a constant—the only distribution which has this characteristic is the exponential distribution noted in Section B.4.5. For the random variable X, with probability density function f (x), if the function

M(t) = E[etx]

exists, then it is the moment generating function. Assuming the function exists, it can be shown that

r r r d M(t)/dt |t=0 = E[x ].

The moment generating function, like the survival and the hazard functions, is a unique charac- terization of a probability distribution. When it exists, the moment generating function (MGF) has a one-to-one correspondence with the distribution. Thus, for example, if we begin with some random variable and find that a transformation of it has a particular MGF, then we may infer that the function of the random variable has the distribution associated with that MGF. A convenient application of this result is the MGF for the normal distribution. The MGF for the standard t2/2 normal distribution is Mz(t) = e . A useful feature of MGFs is the following:

if x and y are independent, then the MGF of x + y is Mx(t)My(t). This result has been used to establish the contagion property of some distributions, that is, the property that sums of random variables with a given distribution have that same distribution. The normal distribution is a familiar example. This is usually not the case. It is for Poisson and chi-squared random variables. One qualification of all of the preceding is that in order for these results to hold, the MGF must exist. It will for the distributions that we will encounter in our work, but in at least one important case, we cannot be sure of this. When computing sums of random vari- ables which may have different distributions and whose specific distributions need not be so well behaved, it is likely that the MGF of the sum does not exist. However, the characteristic function,

φ(t) = E[eitx], i 2 =−1,

will always exist, at least for relatively small t. The characteristic function is the device used to prove that certain sums of random variables converge to a normally distributed variable—that is, the characteristic function is a fundamental tool in proofs of the central limit theorem. Greene-50558 book June 25, 2007 12:52

1002 PART VII ✦ Appendices

B.7 JOINT DISTRIBUTIONS

The joint density function for two random variables X and Y denoted f (x, y) is defined so that ⎧   ⎪ ( , ) ⎨⎪ f x y if x and y are discrete, a≤x≤b c≤y≤d Prob(a ≤ x ≤ b, c ≤ y ≤ d) =   (B-42) ⎪ b d ⎩ f (x, y) dydx if x and y are continuous. a c The counterparts of the requirements for a univariate probability density are

f (x, y) ≥ 0,   f (x, y) = 1ifx and y are discrete, x y   (B-43) f (x, y) dydx = 1ifx and y are continuous. x y The cumulative probability is likewise the probability of a joint event:

F(x, y) = Prob(X ≤ x, Y ≤ y) ⎧  ⎪ f (x, y) in the discrete case ⎨⎪ X≤x Y≤y =   (B-44) ⎪ x y ⎪ ⎩ f (t, s) ds dt in the continuous case. −∞ −∞

B.7.1 MARGINAL DISTRIBUTIONS

A marginal probability density or marginal probability distribution is defined with respect to an individual variable. To obtain the marginal distributions from the joint density, it is necessary to sum or integrate out the other variable: ⎧ ⎪ ( , ) ⎨⎪ f x y in the discrete case y f (x) =  (B-45) x ⎪ ⎩⎪ f (x, s) ds in the continuous case, y

and similarly for fy(y). Two random variables are statistically independent if and only if their joint density is the product of the marginal densities:

f (x, y) = fx(x) fy(y) ⇔ x and y are independent. (B-46)

If (and only if) x and y are independent, then the cdf factors as well as the pdf:

F(x, y) = Fx(x)Fy(y), (B-47)

or

Prob(X ≤ x, Y ≤ y) = Prob(X ≤ x)Prob(Y ≤ y). Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1003

B.7.2 EXPECTATIONS IN A JOINT DISTRIBUTION

The means, variances, and higher moments of the variables in a joint distribution are defined with respect to the marginal distributions. For the mean of x in a discrete distribution, 

E[x] = xfx(x) x    = x f (x, y) (B-48) x y   = xf(x, y). x y The means of the variables in a continuous distribution are defined likewise, using integration instead of summation: 

E[x] = xfx(x) dx x   (B-49) = xf(x, y) dydx. x y Variances are computed in the same manner:    2 Var[x] = x − E[x] fx(x) x    2 (B-50) = x − E[x] f (x, y). x y

B.7.3 COVARIANCE AND CORRELATION

For any function g(x, y), ⎧  ⎪ ( , ) ( , ) ⎨⎪ g x y f x y in the discrete case x y E[g(x, y)] =   (B-51) ⎪ ⎩ g(x, y) f (x, y) dydx in the continuous case. x y The covariance of x and y is a special case:

Cov[x, y] = E[(x − μx)(y − μy)]

= E[xy] − μxμy (B-52)

= σxy.

If x and y are independent, then f (x, y) = fx(x) fy(y) and  

σxy = fx(x) fy(y)(x − μx)(y − μy) x y  

= (x − μx) fx(x) (y − μy) fy(y) x y

= E[x − μx]E[y − μy] = 0. Greene-50558 book June 25, 2007 12:52

1004 PART VII ✦ Appendices

The sign of the covariance will indicate the direction of covariation of X and Y. Its magnitude depends on the scales of measurement, however. In view of this fact, a preferable measure is the correlation coefficient:

σxy r[x, y] = ρxy = , (B-53) σxσy

where σx and σy are the standard deviations of x and y, respectively. The correlation coefficient has the same sign as the covariance but is always between −1 and 1 and is thus unaffected by any scaling of the variables. Variables that are uncorrelated are not necessarily independent. For example, in the dis- (− , ) = ( , ) = ( , ) = 1 ( , ) crete distribution f 1 1 f 0 0 f 1 1 3 , the correlation is zero, but f 1 1 does not ( ) ( ) = ( 1 )( 2 ) equal fx 1 fy 1 3 3 . An important exception is the joint normal distribution discussed sub- sequently, in which lack of correlation does imply independence. Some general results regarding expectations in a joint distribution, which can be verified by applying the appropriate definitions, are

E[ax + by + c] = aE[x] + bE[y] + c, (B-54)

Var[ax + by + c] = a2Var[x] + b2Var[y] + 2abCov[x, y] (B-55) = Var[ax + by],

and

Cov[ax + by, cx + dy] = ac Var[x] + bd Var[y] + (ad + bc)Cov[x, y]. (B-56)

If X and Y are uncorrelated, then

Var[x + y] = Var[x − y] (B-57) = Var[x] + Var[y].

For any two functions g1(x) and g2(y),ifx and y are independent, then

E[g1(x)g2(y)] = E[g1(x)]E[g2(y)]. (B-58)

B.7.4 DISTRIBUTION OF A FUNCTION OF BIVARIATE RANDOM VARIABLES

The result for a function of a random variable in (B-41) must be modified for a joint distribution. Suppose that x1 and x2 have a joint distribution fx(x1, x2) and that y1 and y2 are two monotonic functions of x1 and x2:

y1 = y1(x1, x2),

y2 = y2(x1, x2).

Because the functions are monotonic, the inverse transformations,

x1 = x1(y1, y2),

x2 = x2(y1, y2), Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1005

exist. The Jacobian of the transformations is the matrix of partial derivatives,

∂x /∂y ∂x /∂y ∂ = 1 1 1 2 = x . J ∂ ∂x2/∂y1 ∂x2/∂y2 y

The joint distribution of y1 and y2 is

fy(y1, y2) = fx[x1(y1, y2), x2(y1, y2)]abs(|J|).

The determinant of the Jacobian must be nonzero for the transformation to exist. A zero deter- minant implies that the two transformations are functionally dependent. Certainly the most common application of the preceding in econometrics is the linear trans- formation of a set of random variables. Suppose that x1 and x2 are independently distributed N[0, 1], and the transformations are

y1 = α1 + β11x1 + β12x2,

y2 = α2 + β21x1 + β22x2.

To obtain the joint distribution of y1 and y2, we first write the transformations as

y = a + Bx.

The inverse transformation is

x = B−1(y − a),

so the absolute value of the determinant of the Jacobian is 1 abs|J|=abs|B−1|= . abs|B|

The joint distribution of x is the product of the marginal distributions since they are independent. Thus,

−1 −(x2+x2)/2 −1 −x x/2 fx(x) = (2π) e 1 2 = (2π) e .

Inserting the results for x(y) and J into fy(y1, y2) gives

1 −1 f (y) = (2π)−1 e−(y−a) (BB ) (y−a)/2. y abs|B|

This bivariate normal distribution is the subject of Section B.9. Note that by formulating it as we did earlier, we can generalize easily to the multivariate case, that is, with an arbitrary number of variables. Perhaps the more common situation is that in which it is necessary to find the distribution of one function of two (or more) random variables. A strategy that often works in this case is to form the joint distribution of the transformed variable and one of the original variables, then integrate (or sum) the latter out of the joint distribution to obtain the marginal distribution. Thus, to find the distribution of y1(x1, x2), we might formulate

y1 = y1(x1, x2)

y2 = x2. Greene-50558 book June 25, 2007 12:52

1006 PART VII ✦ Appendices

The absolute value of the determinant of the Jacobian would then be    ∂x ∂x     1 1   ∂  =  ∂y ∂y  =  x1 . J abs  1 2  abs  ∂    y1 01

The density of y1 would then be  ( ) = ( , ), | | . fy1 y1 fx[x1 y1 y2 y2] abs J dy2 y2

B.8 CONDITIONING IN A BIVARIATE DISTRIBUTION

Conditioning and the use of conditional distributions play a pivotal role in econometric modeling. We consider some general results for a bivariate distribution. (All these results can be extended directly to the multivariate case.) In a bivariate distribution, there is a conditional distribution over y for each value of x.The conditional densities are f (x, y) f (y | x) = , (B-59) fx(x) and f (x, y) f (x | y) = . fy(y) It follows from (B-46) that:

If x and y are independent, then f (y | x) = fy(y) and f (x | y) = fx(x). (B-60)

The interpretation is that if the variables are independent, the probabilities of events relating to one variable are unrelated to the other. The definition of conditional densities implies the important result

f (x, y) = f (y | x) fx(x) (B-61) = f (x | y) fy(y).

B.8.1 REGRESSION: THE CONDITIONAL MEAN

A conditional mean is the mean of the conditional distribution and is defined by ⎧ ⎪ ⎨⎪ yf(y | x) dy if y is continuous y E[y | x] =  (B-62) ⎪ ⎩⎪ yf(y | x) if y is discrete. y The conditional mean function E[y | x] is called the regression of y on x. A random variable may always be written as   y = E[y | x] + y − E[y | x]

= E[y | x] + ε. Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1007

B.8.2 CONDITIONAL VARIANCE

A conditional variance is the variance of the conditional distribution:    2 Var[y | x] = E y − E[y | x]  x    (B-63) 2 = y − E[y | x] f (y | x) dy, if y is continuous, y

or    2 Var[y | x] = y − E[y | x] f (y | x), if y is discrete. (B-64) y

The computation can be simplified by using   2 Var[y | x] = E[y2 | x] − E[y | x] . (B-65)

The conditional variance is called the scedastic function and, like the regression, is generally a function of x. Unlike the conditional mean function, however, it is common for the conditional variance not to vary with x. We shall examine a particular case. This case does not imply, however, that Var[y | x] equals Var[y], which will usually not be true. It implies only that the conditional variance is a constant. The case in which the conditional variance does not vary with x is called homoscedasticity (same variance).

B.8.3 RELATIONSHIPS AMONG MARGINAL AND CONDITIONAL MOMENTS

Some useful results for the moments of a conditional distribution are given in the following theorems.

THEOREM B.1 Law of Iterated Expectations

E[y] = Ex[E[y | x]]. (B-66)

The notation Ex[.] indicates the expectation over the values of x. Note that E[y | x] is a function of x.

THEOREM B.2 Covariance In any bivariate distribution,    Cov[x, y] = Covx[x, E[y | x]] = x − E[x] E[y | x] fx(x) dx. (B-67) x (Note that this is the covariance of x and a function of x.) Greene-50558 book June 25, 2007 12:52

1008 PART VII ✦ Appendices

The preceding results provide an additional, extremely useful result for the special case in which the conditional mean function is linear in x.

THEOREM B.3 Moments in a Linear Regression If E[y | x] = α + βx, then α = E[y] − βE[x] and Cov[x, y] β = . (B-68) Var[x] The proof follows from (B-66).

The preceding theorems relate to the conditional mean in a bivariate distribution. The follow- ing theorems, which also appear in various forms in regression analysis, describe the conditional variance.

THEOREM B.4 Decomposition of Variance In a joint distribution,

Var[y] = Var x[E[y | x]] + Ex[Var[y | x]]. (B-69)

The notation Varx[.] indicates the variance over the distribution of x. This equation states that in a bivariate distribution, the variance of y decomposes into the variance of the conditional mean function plus the expected variance around the conditional mean.

THEOREM B.5 Residual Variance in a Regression In any bivariate distribution,

Ex[Var[y | x]] = Var[y] − Var x[E[y | x]]. (B-70)

On average, conditioning reduces the variance of the variable subject to the conditioning. For example, if y is homoscedastic, then we have the unambiguous result that the variance of the conditional distribution(s) is less than or equal to the unconditional variance of y. Going a step further, we have the result that appears prominently in the bivariate normal distribution (Sec- tion B.9). Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1009

THEOREM B.6 Linear Regression and Homoscedasticity In a bivariate distribution, if E[y | x] = α + βx and if Var[y | x] is a constant, then     | = − 2 , = σ 2 − ρ2 . Var[y x] Var[y] 1 Corr [y x] y 1 xy (B-71) The proof is straightforward using Theorems B.2 to B.4.

B.8.4 THE ANALYSIS OF VARIANCE

The variance decomposition result implies that in a bivariate distribution, variation in y arises from two sources: 1. Variation because E[y | x] varies with x:

regression variance = Var x[E[y | x]]. (B-72)

2. Variation because, in each conditional distribution, y varies around the conditional mean:

residual variance = Ex[Var[y | x]]. (B-73)

Thus,

Var[y] = regression variance + residual variance. (B-74)

In analyzing a regression, we shall usually be interested in which of the two parts of the total variance, Var[y], is the larger one. A natural measure is the ratio regression variance coefficient of determination = . (B-75) total variance In the setting of a linear regression, (B-75) arises from another relationship that emphasizes the interpretation of the correlation coefficient.

If E[y | x] = α + βx, then the coefficient of determination = COD = ρ2, (B-76)

where ρ2 is the squared correlation between x and y. We conclude that the correlation coefficient (squared) is a measure of the proportion of the variance of y accounted for by variation in the mean of y given x. It is in this sense that correlation can be interpreted as a measure of linear association between two variables.

B.9 THE BIVARIATE NORMAL DISTRIBUTION

A bivariate distribution that embodies many of the features described earlier is the bivariate normal, which is the joint distribution of two normally distributed variables. The density is

1 − / (ε2+ε2− ρε ε )/( −ρ2) f (x, y) = e 1 2[ x y 2 x y 1 ], 2 2πσxσy 1 − ρ (B-77) x − μx y − μy εx = ,εy = . σx σy Greene-50558 book June 25, 2007 12:52

1010 PART VII ✦ Appendices

The parameters μx,σx,μy, and σy are the means and standard deviations of the marginal distri- butions of x and y, respectively. The additional parameter ρ is the correlation between x and y. The covariance is

σxy = ρσxσy. (B-78) The density is defined only if ρ is not 1 or −1, which in turn requires that the two variables not be linearly related. If x and y have a bivariate normal distribution, denoted ( , ) ∼ μ ,μ ,σ2,σ2,ρ , x y N2 x y x y then • The marginal distributions are normal: ( ) = μ ,σ2 , fx x N x x (B-79) ( ) = μ ,σ2 . fy y N y y • The conditional distributions are normal: ( | ) = α + β ,σ2( − ρ2) , f y x N x y 1 σ (B-80) α = μ − βμ ,β= xy , y x σ 2 x and likewise for f (x | y). • x and y are independent if and only if ρ = 0. The density factors into the product of the two marginal normal distributions if ρ = 0. Two things to note about the conditional distributions beyond their normality are their linear regression functions and their constant conditional variances. The conditional variance is less than the unconditional variance, which is consistent with the results of the previous section.

B.10 MULTIVARIATE DISTRIBUTIONS

The extension of the results for bivariate distributions to more than two variables is direct. It is made much more convenient by using matrices and vectors. The term random vector applies to a vector whose elements are random variables. The joint density is f (x), whereas the cdf is    xn xn−1 x1 F(x) = ··· f (t) dt1 ···dtn−1 dtn. (B-81) −∞ −∞ −∞

Note that the cdf is an n-fold integral. The marginal distribution of any one (or more) of the n variables is obtained by integrating or summing over the other variables.

B.10.1 MOMENTS

The expected value of a vector or matrix is the vector or matrix of expected values. A mean vector is defined as ⎡ ⎤ ⎡ ⎤ μ1 E[x1] ⎢μ ⎥ ⎢ ⎥ ⎢ 2⎥ ⎢E[x2]⎥ μ = ⎣ . ⎦ = ⎣ . ⎦ = E[x]. (B-82) . .

μn E[xn] Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1011

Define the matrix ⎡ ⎤ (x1 − μ1)(x1 − μ1)(x1 − μ1)(x2 − μ2) ··· (x1 − μ1)(xn − μn) ⎢( − μ )( − μ )(− μ )( − μ ) ··· ( − μ )( − μ )⎥ ⎢ x2 2 x1 1 x2 2 x2 2 x2 2 xn n ⎥ (x − μ)(x − μ) = ⎢ . . ⎥. ⎣ . . ⎦

(xn − μn)(x1 − μ1)(xn − μn)(x2 − μ2) ··· (xn − μn)(xn − μn)

The expected value of each element in the matrix is the covariance of the two variables in the product. (The covariance of a variable with itself is its variance.) Thus, ⎡ ⎤ σ11 σ12 ··· σ1n ⎢σ σ ··· σ ⎥ ⎢ 21 22 2n⎥ E[(x − μ)(x − μ)] = ⎢ . . ⎥ = E [xx] − μμ, (B-83) ⎣ . . ⎦

σn1 σn2 ··· σnn

which is the covariance matrix of the random vector x. Henceforth, we shall denote the covariance matrix of a random vector in boldface, as in

Var[x] = .

By dividing σij by σi σ j , we obtain the correlation matrix: ⎡ ⎤ 1 ρ12 ρ13 ··· ρ1n ⎢ρ ρ ··· ρ ⎥ ⎢ 21 1 23 2n⎥ R = ⎢ . . . . ⎥ . ⎣ . . . . ⎦

ρn1 ρn2 ρn3 ··· 1

B.10.2 SETS OF LINEAR FUNCTIONS

Our earlier results for the mean and variance of a linear function can be extended to the multi- variate case. For the mean,

E[a1x1 + a2x2 +···+anxn] = E[a x]

= a1 E[x1] + a2 E[x2] +···+an E[xn] (B-84) = a1μ1 + a2μ2 +···+anμn = aμ.

For the variance,   2 Var[ax] = E ax − E[ax]   2 = E a x − E[x] = E[a(x − μ)(x − μ)a]

as E[x] = μ and a(x − μ) = (x − μ)a. Because a is a vector of constants,

n n Var[a x] = a E[(x − μ)(x − μ) ]a = a a = ai a j σij. (B-85) i=1 j=1 Greene-50558 book June 25, 2007 12:52

1012 PART VII ✦ Appendices

It is the expected value of a square, so we know that a variance cannot be negative. As such, the preceding quadratic form is nonnegative, and the symmetric matrix  must be nonnegative definite. In the set of linear functions y = Ax, the ith element of y is yi = ai x, where ai is the ith row of A [see result (A-14)]. Therefore,

E[yi ] = ai μ.

Collecting the results in a vector, we have

E[Ax] = Aμ. (B-86)

For two row vectors ai and a j , , =  . Cov[ai x a j x] ai a j   Because ai a j is the ijth element of A A ,

Var[Ax] = AA. (B-87)

This matrix will be either nonnegative definite or positive definite, depending on the column rank of A.

B.10.3 NONLINEAR FUNCTIONS

Consider a set of possibly nonlinear functions of x, y = g(x). Each element of y can be approxi- mated with a linear Taylor series. Let ji be the row vector of partial derivatives of the ith function with respect to the n elements of x: ∂g (x) ∂y ji (x) = i = i . (B-88) ∂x ∂x Then, proceeding in the now familiar way, we use μ, the mean vector of x, as the expansion point, so that ji (μ) is the row vector of partial derivatives evaluated at μ. Then

i gi (x) ≈ gi (μ) + j (μ)(x − μ). (B-89)

From this we obtain

E[gi (x)] ≈ gi (μ), (B-90)

i i Var[gi (x)] ≈ j (μ)j (μ) , (B-91)

and

i j Cov[gi (x), g j (x)] ≈ j (μ)j (μ) . (B-92)

These results can be collected in a convenient form by arranging the row vectors ji (μ) in a matrix J(μ). Then, corresponding to the preceding equations, we have

E[g(x)]  g(μ), (B-93)

Var[g(x)]  J(μ)J(μ). (B-94)

The matrix J(μ) in the last preceding line is ∂y/∂x evaluated at x = μ. Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1013

B.11 THE MULTIVARIATE NORMAL DISTRIBUTION

The foundation of most multivariate analysis in econometrics is the multivariate normal distri-

bution. Let the vector (x1, x2,...,xn) = x be the set of n random variables, μ their mean vector, and  their covariance matrix. The general form of the joint density is

−1 f (x) = (2π)−n/2||−1/2e(−1/2)(x−μ)  (x−μ). (B-95)

If R is the correlation matrix of the variables and Rij = σij/(σi σ j ), then

− −n/2 −1 −1/2 (−1/2)εR 1ε f (x) = (2π) (σ1σ2 ···σn) |R| e , (B-96)

8 where εi = (xi − μi )/σi . Two special cases are of interest. If all the variables are uncorrelated, then ρij = 0 for i = j. Thus, R = I, and the density becomes

−n/2 −1 −ε ε/2 f (x) = (2π) (σ1σ2 ···σn) e n (B-97) = f (x1) f (x2) ··· f (xn) = f (xi ). i=1

As in the bivariate case, if normally distributed variables are uncorrelated, then they are inde- 2 pendent. If σi = σ and μ = 0, then xi ∼ N[0,σ ] and εi = xi /σ , and the density becomes

2 f (x) = (2π)−n/2(σ 2)−n/2e−x x/(2σ ). (B-98)

Finally, if σ = 1,

f (x) = (2π)−n/2e−x x/2. (B-99)

This distribution is the multivariate standard normal, or spherical normal distribution.

B.11.1 MARGINAL AND CONDITIONAL NORMAL DISTRIBUTIONS

Let x1 be any subset of the variables, including a single variable, and let x2 be the remaining variables. Partition μ and  likewise so that

μ   μ = 1  = 11 12 . μ and   2 21 22

Then the marginal distributions are also normal. In particular, we have the following theorem.

THEOREM B.7 Marginal and Conditional Normal Distributions If [x1, x2] have a joint multivariate normal distribution, then the marginal distributions are ∼ (μ ,  ), x1 N 1 11 (B-100)

8 This result is obtained by constructing , the diagonal matrix with σi as its ith diagonal element. Then, − − − R = 1 −1, which implies that −1 = 1R−1 1. Inserting this in (B-95) yields (B-96). Note that the −1 ith element of (x − μ) is (xi − μi )/σi . Greene-50558 book June 25, 2007 12:52

1014 PART VII ✦ Appendices

THEOREM B.7 (Continued) and ∼ (μ ,  ). x2 N 2 22 (B-101)

The conditional distribution of x1 given x2 is normal as well: | ∼ (μ ,  ), x1 x2 N 1.2 11.2 (B-102) where μ = μ +  −1( − μ ), 1.2 1 12 22 x2 2 (B-102a)

 =  −  −1 . 11.2 11 12 22 21 (B-102b) Proof: We partition μ and  as shown earlier and insert the parts in (B-95). To construct the density, we use (A-72) to partition the determinant,   || = | |  −  −1 , 22 11 12 22 21 and (A-74) to partition the inverse,

−1   −1 −−1 B 11 12 = 11.2 11.2 .   − −1 −1 + −1 21 22 B 11.2 22 B 11.2B For simplicity, we let =  −1. B 12 22 Inserting these in (B-95) and collecting terms produces the joint density as a product of two terms:

f (x1, x2) = f1.2(x1 | x2) f2(x2). μ  The first of these is a normal distribution with mean 1.2 and variance 11.2, whereas the second is the marginal distribution of x2.

The conditional mean vector in the multivariate normal distribution is a linear function of the unconditional mean and the conditioning variables, and the conditional covariance matrix is constant and is smaller (in the sense discussed in Section A.7.3) than the unconditional covariance matrix. Notice that the conditional covariance matrix is the inverse of the upper left block of −1; that is, this matrix is of the form shown in (A-74) for the partitioned inverse of a matrix.

B.11.2 THE CLASSICAL NORMAL LINEAR REGRESSION MODEL

An important special case of the preceding is that in which x1 is a single variable, y, and x2 is K variables, x. Then the conditional distribution is a multivariate version of that in (B-80) with β = −1σ , σ xx xy where xy is the vector of covariances of y with x2. Recall that any random variable, y, can be written as its mean plus the deviation from the mean. If we apply this tautology to the multivariate normal, we obtain   y = E[y | x] + y − E[y | x] = α + β x + ε, Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1015

β α = μ − βμ ε where is given earlier, y x, and has a normal distribution. We thus have, in this multivariate normal distribution, the classical normal linear regression model.

B.11.3 LINEAR FUNCTIONS OF A NORMAL VECTOR

Any linear function of a vector of joint normally distributed variables is also normally distributed. The mean vector and covariance matrix of Ax, where x is normally distributed, follow the general pattern given earlier. Thus,

If x ∼ N[μ, ], then Ax + b ∼ N[Aμ + b, AA]. (B-103)

If A does not have full rank, then AA is singular and the density does not exist in the full dimensional space of x although it does exist in the subspace of dimension equal to the rank of . Nonetheless, the individual elements of Ax + b will still be normally distributed, and the joint distribution of the full vector is still a multivariate normal.

B.11.4 QUADRATIC FORMS IN A STANDARD NORMAL VECTOR

The earlier discussion of the chi-squared distribution gives the distribution of xx if x has a standard normal distribution. It follows from (A-36) that

n n = 2 = ( − )2 + 2. x x xi xi x nx (B-104) i=1 i=1

We know from (B-32) that xx has a chi-squared distribution. It seems natural, therefore, to invoke (B-34) for the two parts on the right-hand side of (B-104). It is not yet obvious, however, that either of the two terms has a chi-squared distribution or that the two terms are independent, as required. To show these conditions, it is necessary to derive the distributions of idempotent quadratic forms and to show when they are independent.√ To begin, the second term is the square of n x, which can easily be shown to have a standard normal distribution. Thus, the second term is the square of a standard normal variable and has chi- squared distribution with one degree of freedom. But the first term is the sum of n nonindependent variables, and it remains to be shown that the two terms are independent.

DEFINITION B.3 Orthonormal Quadratic Form A particular case of (B-103) is the following:

If x ∼ N[0, I] and C is a square matrix such that CC = I, then Cx ∼ N[0, I].

Consider, then, a quadratic form in a standard normal vector x with symmetric matrix A:

q = xAx. (B-105)

Let the characteristic roots and vectors of A be arranged in a diagonal matrix and an orthogonal matrix C, as in Section A.6.3. Then

q = xCCx. (B-106) Greene-50558 book June 25, 2007 12:52

1016 PART VII ✦ Appendices

By definition, C satisfies the requirement that CC = I. Thus, the vector y = Cx has a standard normal distribution. Consequently, n = = λ 2. q y y i yi (B-107) i=1

If λi is always one or zero, then J = 2, q yj (B-108) j=1 which has a chi-squared distribution. The sum is taken over the j = 1,...,J elements associated with the roots that are equal to one. A matrix whose characteristic roots are all zero or one is idempotent. Therefore, we have proved the next theorem.

THEOREM B.8 Distribution of an Idempotent Quadratic Form in a Standard Normal Vector If x ∼ N[0, I] and A is idempotent, then xAx has a chi-squared distribution with degrees of freedom equal to the number of unit roots of A, which is equal to the rank of A.

The rank of a matrix is equal to the number of nonzero characteristic roots it has. Therefore, the degrees of freedom in the preceding chi-squared distribution equals J, the rank of A. We can apply this result to the earlier sum of squares. The first term is n 2 0 (xi − x ) = x M x, i=1 where M0 was defined in (A-34) as the matrix that transforms data to mean deviation form:

1 M0 = I − ii . n Because M0 is idempotent, the sum of squared deviations from the mean has a chi-squared distribution. The degrees of freedom equals the rank M0, which is not obvious except for the useful result in (A-108), that • The rank of an idempotent matrix is equal to its trace. (B-109)

Each diagonal element of M0 is 1 − (1/n); hence, the trace is n[1 − (1/n)] = n − 1. Therefore, we have an application of Theorem B.8.  • ∼ ( , ), n ( − )2 ∼ χ 2 − . If x N 0 I i=1 xi x [n 1] (B-110) We have already shown that the second term in (B-104) has a chi-squared distribution with one degree of freedom. It is instructive to set this up as a quadratic form as well:   1 1 nx2 = x ii x = x[jj ]x, where j = √ i. (B-111) n n The matrix in brackets is the outer product of a nonzero vector, which always has rank one. You can verify that it is idempotent by multiplication. Thus, xx is the sum of two chi-squared variables, Greene-50558 book June 25, 2007 12:52

APPENDIX B ✦ Probability and Distribution Theory 1017

one with n − 1 degrees of freedom and the other with one. It is now necessary to show that the two terms are independent. To do so, we will use the next theorem.

THEOREM B.9 Independence of Idempotent Quadratic Forms If x ∼ N[0, I] and xAx and xBx are two idempotent quadratic forms in x, then xAx and xBx are independent if AB = 0. (B-112)

As before, we show the result for the general case and then specialize it for the example. Because both A and B are symmetric and idempotent, A = AA and B = BB. The quadratic forms are therefore

= = , = , = , = . x Ax x A Ax x1x1 where x1 Ax and x Bx x2x2 where x2 Bx (B-113)

Both vectors have zero mean vectors, so the covariance matrix of x1 and x2 is ( ) = = = . E x1x2 AIB AB 0

Because Ax and Bx are linear functions of a normally distributed random vector, they are, in turn, normally distributed. Their zero covariance matrix implies that they are statistically independent,9 which establishes the independence of the two quadratic forms. For the case of xx, the two matrices are M0 and [I − M0]. You can show that M0[I − M0] = 0 just by multiplying it out.

B.11.5 THE F DISTRIBUTION

The normal family of distributions (chi-squared, F, and t) can all be derived as functions of idempotent quadratic forms in a standard normal vector. The F distribution is the ratio of two independent chi-squared variables, each divided by its respective degrees of freedom. Let A and B be two idempotent matrices with ranks ra and rb, and let AB = 0. Then / x Ax ra ∼ , . F [ra rb] (B-114) x Bx/rb

If Var[x] = σ 2I instead, then this is modified to

(xAx/σ 2)/r a ∼ F [r , r ]. (B-115) 2 a b (x Bx/σ )/rb

B.11.6 A FULL RANK QUADRATIC FORM

Finally, consider the general case,

x ∼ N[μ, ].

We are interested in the distribution of

q = (x − μ)−1(x − μ). (B-116)

9 Note that both x1 = Ax and x2 = Bx have singular covariance matrices. Nonetheless, every element of x1 is independent of every element x2, so the vectors are independent. Greene-50558 book June 25, 2007 12:52

1018 PART VII ✦ Appendices

First, the vector can be written as z = x − μ, and  is the covariance matrix of z as well as of x. Therefore, we seek the distribution of   −1 q = z−1z = z Var[z] z, (B-117)

where z is normally distributed with mean 0. This equation is a quadratic form, but not necessarily in an idempotent matrix.10 Because  is positive definite, it has a square root. Define the symmetric matrix 1/2 so that 1/21/2 = . Then

−1 = −1/2−1/2,

and

z−1z = z−1/2−1/2z = (−1/2z)(−1/2z) = ww.

Now w = Az,so

E(w) = AE[z] = 0,

and

Var[w] = AA = −1/2−1/2 = 0 = I.

This provides the following important result:

THEOREM B.10 Distribution of a Standardized Normal Vector If x ∼ N[μ, ], then −1/2(x − μ) ∼ N[0, I].

The simplest special case is that in which x has only one variable, so that the transformation is just (x − μ)/σ. Combining this case with (B-32) concerning the sum of squares of standard normals, we have the following theorem.

THEOREM B.11 Distribution of x−1x When x Is Normal If x ∼ N[μ, ], then (x − μ)−1(x − μ) ∼ χ 2[n].

B.11.7 INDEPENDENCE OF A LINEAR AND A QUADRATIC FORM

The t distribution is used in many forms of hypothesis tests. In some situations, it arises as the ratio of a linear to a quadratic form in a normal vector. To establish the distribution of these statistics, we use the following result.

10It will be idempotent only in the special case of  = I. Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1019

THEOREM B.12 Independence of a Linear and a Quadratic Form A linear function Lx and a symmetric idempotent quadratic form xAx in a standard normal vector are statistically independent if LA = 0.

The proof follows the same logic as that for two quadratic forms. Write xAx as xAAx = (Ax)(Ax). The covariance matrix of the variables Lx and Ax is LA = 0, which establishes the independence of these two random vectors. The independence of the linear function and the quadratic form follows because functions of independent random vectors are also independent. The t distribution is defined as the ratio of a standard normal variable to the square root of a chi-squared variable divided by its degrees of freedom: N[0, 1] = . t[J] 1/2 χ 2[J]/J

A particular case is √ √ n x n x − = = , t[n 1]  1/2 1 n ( − )2 s n−1 i=1 xi x where s is the standard deviation of the values of x. The distribution of the two variables in t[n−1] was shown earlier; we need only show that they are independent. But √ 1 n x = √ ix = jx, n and xM0x s2 = . n − 1 It suffices to show that M0j = 0, which follows from

M0i = [I − i(ii)−1i]i = i − i(ii)−1(ii) = 0.

APPENDIXQ C ESTIMATION AND INFERENCE

C.1 INTRODUCTION

The probability distributions discussed in Appendix B serve as models for the underlying data generating processes that produce our observed data. The goal of statistical inference in econo- metrics is to use the principles of mathematical statistics to combine these theoretical distributions and the observed data into an empirical model of the economy. This analysis takes place in one of two frameworks, classical or Bayesian. The overwhelming majority of empirical study in Greene-50558 book June 25, 2007 22:50

1020 PART VII ✦ Appendices

econometrics has been done in the classical framework. Our focus, therefore, will be on classical methods of inference. Bayesian methods are discussed in Chapter 18.1

C.2 SAMPLES AND RANDOM SAMPLING

The classical theory of statistical inference centers on rules for using the sampled data effectively. These rules, in turn, are based on the properties of samples and sampling distributions. A sample of n observations on one or more variables, denoted x1, x2,...,xn is a random sample if the n observations are drawn independently from the same population, or probability distribution, f (xi , θ). The sample may be univariate if xi is a single random variable or multi- variate if each observation contains several variables. A random sample of observations, denoted [x1, x2,...,xn]or{xi }i=1,...,n, is said to be independent, identically distributed, which we denote i.i.d. The vector θ contains one or more unknown parameters. Data are generally drawn in one of two settings. A cross section is a sample of a number of observational units all drawn at the same point in time. A time series is a set of observations drawn on the same observational unit at a number of (usually evenly spaced) points in time. Many recent studies have been based on time-series cross sections, which generally consist of the same cross-sectional units observed at several points in time. Because the typical data set of this sort consists of a large number of cross-sectional units observed at a few points in time, the common term panel data set is usually more fitting for this sort of study.

C.3 DESCRIPTIVE STATISTICS

Before attempting to estimate parameters of a population or fit models to data, we normally examine the data themselves. In raw form, the sample data are a disorganized mass of information, so we will need some organizing principles to distill the information into something meaningful. Consider, first, examining the data on a single variable. In most cases, and particularly if the number of observations in the sample is large, we shall use some summary statistics to describe the sample data. Of most interest are measures of location—that is, the center of the data—and scale, or the dispersion of the data. A few measures of central tendency are as follows: n 1 mean: x = x , n i i=1 median: M = middle ranked observation, (C-1) maximum + minimum sample midrange: midrange = . 2 The dispersion of the sample observations is usually measured by the / n 2 1 2 (xi − x ) standard deviation: s = i=1 . (C-2) x n − 1 Other measures, such as the average absolute deviation from the sample mean, are also used, although less frequently than the standard deviation. The shape of the distribution of values is often of interest as well. Samples of income or expenditure data, for example, tend to be highly

1An excellent reference is Leamer (1978). A summary of the results as they apply to econometrics is contained in Zellner (1971) and in Judge et al. (1985). See, as well, Poirier (1991, 1995). Recent textbooks on Bayesian econometrics include Koop (2003), Lancaster (2004) and Geweke (2005). Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1021

skewed while financial data such as asset returns and exchange rate movements are relatively more symmetrically distributed but are also more widely dispersed than other variables that might be observed. Two measures used to quantify these effects are the   n 3 n 4 (xi − x ) (xi − x ) skewness = i=1 , and kurtosis = i=1 . 3 ( − ) 4 ( − ) sx n 1 sx n 1

(Benchmark values for these two measures are zero for a symmetric distribution, and three for one which is “normally” dispersed.) The skewness coefficient has a bit less of the intuitive appeal of the mean and standard deviation, and the kurtosis measure has very little at all. The box and whisker plot is a graphical device which is often used to capture a large amount of information about the sample in a simple visual display. This plot shows in a figure the median, the range of values contained in the 25th and 75th percentile, some limits that show the normal range of values expected, such as the median plus and minus two standard deviations, and in isolation values that could be viewed as outliers. A box and whisker plot is shown in Figure C.1 for the income variable in Example C.1. If the sample contains data on more than one variable, we will also be interested in measures of association among the variables. A scatter diagram is useful in a bivariate sample if the sample contains a reasonable number of observations. Figure C.1 shows an example for a small data set. If the sample is a multivariate one, then the degree of linear association among the variables can be measured by the pairwise measures  n (xi − x )(yi − y) covariance: s = i=1 , (C-3) xy n − 1

sxy correlation: rxy = . sxsy

If the sample contains data on several variables, then it is sometimes convenient to arrange the covariances or correlations in a

covariance matrix: S = [sij], (C-4)

or

correlation matrix: R = [rij].

Some useful algebraic results for any two variables (xi , yi ), i = 1,...,n, and constants a and b are   n 2 − 2 = x nx s2 = i 1 i , (C-5) x n − 1   n − = xi yi nx y s = i 1 , (C-6) xy n − 1

−1 ≤ rxy ≤ 1, ab r , = r , a, b = 0, (C-7) ax by |ab| xy

sax =|a|sx, (C-8) sax,by = (ab)sxy. Greene-50558 book June 25, 2007 12:52

1022 PART VII ✦ Appendices

Note that these algebraic results parallel the theoretical results for bivariate probability distri- butions. [We note in passing, while the formulas in (C-2) and (C-5) are algebraically the same, (C-2) will generally be more accurate in practice, especially when the values in the sample are very widely dispersed.]

Example C.1 Descriptive Statistics for a Random Sample Appendix Table FC1 contains a (hypothetical) sample of observations on income and educa- tion (The observations all appear in the calculations of the means below.) A scatter diagram appears in Figure C.1. It suggests a weak positive association between income and educa- tion in these data. The box and whisker plot for income at the left of the scatter plot shows the distribution of the income data as well. ⎡ ⎤ 20.5 + 31.5 + 47.7 + 26.2 + 44.0 + 8.28 + 30.8 + 1 Means: I = ⎣17.2 + 19.9 + 9.96 + 55.8 + 25.2 + 29.0 + 85.5 +⎦ = 31.278, 20 15.1 + 28.5 + 21.4 + 17.7 + 6.42 + 84.9

1 12 + 16 + 18 + 16 + 12 + 12 + 16 + 12 + 10 + 12 + E = = 14.600. 20 16 + 20 + 12 + 16 + 10 + 18 + 16 + 20 + 12 + 16 Standard deviations: = 1 . − . 2 +···+ . − . 2 = . sI 19 [(20 5 31 278) (84 9 31 278) ] 22 376, = 1 − . 2 +···+ − . 2 = . . sE 19 [(12 14 6) (16 14 6) ] 3 119

90

80

70

s 60 and s 50

40

Income in thou 30

20

10

0 10 12 14 16 18 20 Education FIGURE C.1 Box and Whisker Plot for Income and Scatter Diagram for Income and Education. Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1023

= 1 . +···+ . − . . = . Covariance: sIE 19 [20 5(12) 84 9(16) 20(31 28)(14 6)] 23 597, 23.597 Correlation: r = = 0.3382. IE (22.376)(3.119) The positive correlation is consistent with our observation in the scatter diagram.

The statistics just described will provide the analyst with a more concise description of the data than a raw tabulation. However, we have not, as yet, suggested that these measures correspond to some underlying characteristic of the process that generated the data. We do assume that there is an underlying mechanism, the data-generating process, that produces the data in hand. Thus, these serve to do more than describe the data; they characterize that process, or population. Because we have assumed that there is an underlying probability distribution, it might be useful to produce a statistic that gives a broader view of the DGP. The histogram is a simple graphical device that produces this result—see Examples C.3 and C.4 for applications. For small samples or widely dispersed data, however, histograms tend to be rough and difficult to make informative. A burgeoning literature [see, e.g., Pagan and Ullah (1999) and Li and Racine (2007)] has demonstrated the usefulness of the kernel density estimator as a substitute for the histogram as a descriptive tool for the underlying distribution that produced a sample of data. The underlying theory of the kernel density estimator is fairly complicated, but the computations are surprisingly simple. The estimator is computed using

n ∗ 1 x − x fˆ(x∗) = K i , nh h i=1

∗ where x1,...,xn are the n observations in the sample, fˆ(x ) denotes the estimated density func- tion, x∗ is the value at which we wish to evaluate the density, and h and K[·] are the “bandwidth” and “kernel function” that we now consider. The density estimator is rather like a histogram, in which the bandwidth is the width of the intervals. The kernel function is a weight function ∗ which is generally chosen so that it takes large values when x is close to xi and tapers off to zero in as they diverge in either direction. The weighting function used in the example below is the logistic density discussed in Section B.4.7. The bandwidth is chosen to be a function of 1/n so that the intervals can become narrower as the sample becomes larger (and richer). The one used for Figure C.2 is h = 0.9Min(s, range/3)/n.2. (We will revisit this method of estimation in Chapter 14.) Example C.2 illustrates the computation for the income data used in Example C.1.

Example C.2 Kernel Density Estimator for the Income Data Figure C.2 suggests the large skew in the income data that is also suggested by the box and whisker plot (and the scatter plot) in Example C.1.

C.4 STATISTICS AS ESTIMATORS—SAMPLING DISTRIBUTIONS

The measures described in the preceding section summarize the data in a random sample. Each measure has a counterpart in the population, that is, the distribution from which the data were drawn. Sample quantities such as the means and the correlation coefficient correspond to popu- lation expectations, whereas the kernel density estimator and the values in Table C.1 parallel the population pdf and cdf. In the setting of a random sample, we expect these quantities to mimic Greene-50558 book June 25, 2007 22:50

1024 PART VII ✦ Appendices

0.020

0.015

0.010 Density

0.005

0.000 0 20 40 60 80 100 K_INCOME FIGURE C.2 Kernel Density Estimate for Income.

TABLE C.1 Income Distribution Range Relative Frequency Cumulative Frequency <$10,000 0.15 0.15 10,000–25,000 0.30 0.45 25,000–50,000 0.40 0.85 >50,000 0.15 1.00

the population, although not perfectly. The precise manner in which these quantities reflect the population values defines the sampling distribution of a sample statistic.

DEFINITION C.1 Statistic A statistic is any function computed from the data in a sample.

If another sample were drawn under identical conditions, different values would be obtained for the observations,as each one is a random variable. Any statistic is a function of these random values, soitis also a random variable with a probability distribution called a sampling distribution. For example, the following shows an exact result for the sampling behavior of a widely used statistic. Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1025

THEOREM C.1 Sampling Distribution of the Sample Mean 2 If x1,...,xn are a random sample from a population with mean μ and variance σ , then x is a random variable with mean μ and variance σ 2/n. Proof: x = (1/n)i xi .E[x] = (1/n)i μ = μ. The observations are independent, so 2 2 2 2 Var[x] = (1/n) Var[i xi ] = (1/n )i σ = σ /n.

Example C.3 illustrates the behavior of the sample mean in samples of four observations drawn from a chi-squared population with one degree of freedom. The crucial concepts illus- trated in this example are, first, the mean and variance results in Theorem C.1 and, second, the phenomenon of sampling variability. Notice that the fundamental result in Theorem C.1 does not assume a distribution for xi . Indeed, looking back at Section C.3, nothing we have done so far has required any assumption about a particular distribution.

Example C.3 Sampling Distribution of a Sample Mean Figure C.3 shows a frequency plot of the means of 1,000 random samples of four observations drawn from a chi-squared distribution with one degree of freedom, which has mean 1 and variance 2.

We are often interested in how a statistic behaves as the sample size increases. Example C.4 illustrates one such case. Figure C.4 shows two sampling distributions, one based on samples of three and a second, of the same statistic, but based on samples of six. The effect of increasing sample size in this figure is unmistakable. It is easy to visualize the behavior of this statistic if we extrapolate the experiment in Example C.4 to samples of, say, 100.

Example C.4 Sampling Distribution of the Sample Minimum −θx If x1, ..., xn are a random sample from an exponential distribution with f ( x) = θe , then the sampling distribution of the sample minimum in a sample of n observations, denoted x(1),is   −(nθ) x(1) f x(1) = (nθ)e .

2 2 Because E [x] = 1/θ and Var[x] = 1/θ , by analogy E [x(1)] = 1/(nθ) and Var[x(1)] = 1/(nθ) . Thus, in increasingly larger samples, the minimum will be arbitrarily close to 0. [The Chebychev inequality in Theorem D.2 can be used to prove this intuitively appealing result.] Figure C.4 shows the results of a simple sampling experiment you can do to demon- strate this effect. It requires software that will allow you to produce pseudorandom num- bers uniformly distributed in the range zero to one and that will let you plot a histogram and control the axes. (We used NLOGIT. This can be done with Stata, Excel, or several other packages.) The experiment consists of drawing 1,000 sets of nine random values, Uij, i = 1, ...1,000, j = 1, ..., 9. To transform these uniform draws to exponential with pa- rameter θ—we used θ = 1.5, use the inverse probability transform—see Section E.2.3. For an exponentially distributed variable, the transformation is zij =−(1/θ) log(1 − Uij). We then created z(1)| 3 from the first three draws and z(1)| 6 from the other six. The two histograms show clearly the effect on the sampling distribution of increasing sample size from just 3to6.

Sampling distributions are used to make inferences about the population. To consider a perhaps obvious example, because the sampling distribution of the mean of a set of normally distributed observations has mean μ, the sample mean is a natural candidate for an estimate of μ. The observation that the sample “mimics” the population is a statement about the sampling Greene-50558 book June 25, 2007 12:52

1026 PART VII ✦ Appendices

80

75

70

65 Mean ϭ 0.9038 60 Variance ϭ 0.5637

55

50

45

40 Frequency 35

30

25

20

15

10

5

0 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 Sample mean FIGURE C.3 Sampling Distribution of Means of 1,000 Samples of Size 4 from Chi-Squared [1]. Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1027

200 372

150 279

100 186 Frequency Frequency

50 93

0 0 0.000 0.186 0.371 0.557 0.743 0.929 1.114 1.300 0.000 0.186 0.371 0.557 0.743 0.929 1.114 1.300 Minimum of 3 Observations Minimum of 6 Observations FIGURE C.4 Histograms of the Sample Minimum of 3 and 6 Observations.

distributions of the sample statistics. Consider, for example, the sample data collected in Fig- ure C.3. The sample mean of four observations clearly has a sampling distribution, which appears to have a mean roughly equal to the population mean. Our theory of parameter estimation departs from this point.

C.5 POINT ESTIMATION OF PARAMETERS

Our objective is to use the sample data to infer the value of a parameter or set of parameters, which we denote θ.Apoint estimate is a statistic computed from a sample that gives a single value for θ.Thestandard error of the estimate is the standard deviation of the sampling distribution of the statistic; the square of this quantity is the sampling variance. An interval estimate is a range of values that will contain the true parameter with a preassigned probability. There will be a connection between the two types of estimates; generally, if θˆ is the point estimate, then the interval estimate will be θˆ± a measure of sampling error. An estimator is a rule or strategy for using the data to estimate the parameter. It is defined before the data are drawn. Obviously, some estimators are better than others. To take a simple ex- ample, your intuition should convince you that the sample mean would be a better estimator of the population mean than the sample minimum; the minimum is almost certain to underestimate the mean. Nonetheless, the minimum is not entirely without virtue; it is easy to compute, which is oc- casionally a relevant criterion. The search for good estimators constitutes much of econometrics. Estimators are compared on the basis of a variety of attributes. Finite sample properties of estima- tors are those attributes that can be compared regardless of the sample size. Some estimation prob- lems involve characteristics that are not known in finite samples. In these instances, estimators are compared on the basis on their large sample, or asymptotic properties. We consider these in turn.

C.5.1 ESTIMATION IN A FINITE SAMPLE

The following are some finite sample estimation criteria for estimating a single parameter. The ex- tensions to the multiparameter case are direct. We shall consider them in passing where necessary. Greene-50558 book June 25, 2007 12:52

1028 PART VII ✦ Appendices

DEFINITION C.2 Unbiased Estimator An estimator of a parameter θ is unbiased if the mean of its sampling distribution is θ. Formally, E [θˆ] = θ or E [θˆ − θ] = Bias[θˆ | θ] = 0 implies that θˆ is unbiased. Note that this implies that the expected sampling error is zero. If θ is a vector of parameters, then the estimator is unbiased if the expected value of every element of θˆ equals the corresponding element of θ.

If samples of size n are drawn repeatedly and θˆ is computed for each one, then the average value of these estimates will tend to equal θ. For example, the average of the 1,000 sample means underlying Figure C.2 is 0.9038, which is reasonably close to the population mean of one. The sample minimum is clearly a biased estimator of the mean; it will almost always underestimate the mean, so it will do so on average as well. Unbiasedness is a desirable attribute, but it is rarely used by itself as an estimation criterion. One reason is that there are many unbiased estimators that are poor uses of the data. For example, in a sample of size n, the first observation drawn is an unbiased estimator of the mean that clearly wastes a great deal of information. A second criterion used to choose among unbiased estimators is efficiency.

DEFINITION C.3 Efficient Unbiased Estimator An unbiased estimator θˆ1 is more efficient than another unbiased estimator θˆ2 if the sam- pling variance of θˆ1 is less than that of θˆ2. That is,

Var[θˆ1] < Var[θˆ2]. In the multiparameter case, the comparison is based on the covariance matrices of the two estimators; θˆ1 is more efficient than θˆ2 if Var[θˆ2] − Var[θˆ1] is a positive definite matrix.

By this criterion, the sample mean is obviously to be preferred to the first observation as an estimator of the population mean. If σ 2 is the population variance, then

σ 2 Var[x ] = σ 2 > Var[x] = . 1 n

In discussing efficiency, we have restricted the discussion to unbiased estimators. Clearly, there are biased estimators that have smaller variances than the unbiased ones we have consid- ered. Any constant has a variance of zero. Of course, using a constant as an estimator is not likely to be an effective use of the sample data. Focusing on unbiasedness may still preclude a tolerably biased estimator with a much smaller variance, however. A criterion that recognizes this possible tradeoff is the mean squared error. Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1029

DEFINITION C.4 Mean Squared Error The mean squared error of an estimator is MSE[θˆ | θ] = E [(θˆ − θ)2]   2 = Var[θˆ] + Bias[θˆ | θ] if θ is a scalar, (C-9) MSE[θˆ | θ] = Var[θˆ] + Bias[θˆ | θ]Bias[θˆ | θ] if θ is a vector.

Figure C.5 illustrates the effect. In this example, on average, the biased estimator will be closer to the true parameter than will the unbiased estimator. Which of these criteria should be used in a given situation depends on the particulars of that setting and our objectives in the study. Unfortunately, the MSE criterion is rarely operational; minimum mean squared error estimators, when they exist at all, usually depend on unknown parameters. Thus, we are usually less demanding. A commonly used criterion is minimum variance unbiasedness.

Example C.5 Mean Squared Error of the Sample Variance In sampling from a normal distribution, the most frequently used estimator for σ 2 is  n 2 ( xi − x ) s2 = i =1 . n − 1 It is straightforward to show that s2 is unbiased, so

2σ 4 Var[s2] = = MSE[s2 | σ 2]. n − 1

FIGURE C.5 Sampling Distributions.

unbiased ␪^ biased ␪^ ity s Den

␪ Estimator Greene-50558 book June 25, 2007 12:52

1030 PART VII ✦ Appendices

[A proof is based on the distribution of the idempotent quadratic form (x − iμ) M0(x − iμ), which we discussed in Section B11.4.] A less frequently used estimator is n 1 σˆ 2 = ( x − x ) 2 = [(n − 1)/n]s2. n i i =1 This estimator is slightly biased downward: (n − 1) E(s2) (n − 1)σ 2 E[ˆσ 2] = = , n n so its bias is −1 E[ˆσ 2 − σ 2] = Bias[σ ˆ 2 | σ 2] = σ 2. n But it has a smaller variance than s2:

2 n − 1 2σ 4 Var[σ ˆ 2] = < Var[s2]. n n − 1 To compare the two estimators, we can use the difference in their mean squared errors:

2n − 1 2 MSE[σ ˆ 2 | σ 2] − MSE[s2 | σ 2] = σ 4 − < 0. n2 n − 1 The biased estimator is a bit more precise. The difference will be negligible in a large sample, but, for example, it is about 1.2 percent in a sample of 16.

C.5.2 EFFICIENT UNBIASED ESTIMATION

In a random sample of n observations, the density of each observation is f (xi ,θ). Because the n observations are independent, their joint density is

f (x1, x2,...,xn,θ) = f (x1,θ)f (x2,θ)··· f (xn,θ) n (C-10) = f (xi ,θ)= L(θ | x1, x2,...,xn). i=1 This function, denoted L(θ | X), is called the likelihood function for θ given the data X.Itis frequently abbreviated to L(θ). Where no ambiguity can arise, we shall abbreviate it further to L.

Example C.6 Likelihood Functions for Exponential and Normal Distributions If x1, ..., xn are a sample of n observations from an exponential distribution with parameter θ, then n   n −θx n −θ xi L(θ) = θe i = θ e i =1 . i =1

If x1, ..., xn are a sample of n observations from a normal distribution with mean μ and standard deviation σ, then n − / − / σ 2 −μ 2 L(μ, σ ) = (2πσ2) 1 2e [1 (2 )](xi ) i =1 (C-11) − / − / σ 2  −μ 2 = (2πσ2) n 2e [1 (2 )] i ( xi ) . Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1031

The likelihood function is the cornerstone for most of our theory of parameter estimation. An important result for efficient estimation is the following.

THEOREM C.2 Cramer–Rao´ Lower Bound Assuming that the density of x satisfies certain regularity conditions, the variance of an unbiased estimator of a parameter θ will always be at least as large as      − −1 2 1 ∂2 ln L(θ) ∂ ln L(θ) [I(θ)]−1 = −E = E . (C-12) ∂θ2 ∂θ

The quantity I(θ) is the information number for the sample. We will prove the result that the negative of the expected second derivative equals the expected square of the first derivative in Chapter 16. Proof of the main result of the theorem is quite involved. See, for example, Stuart and Ord (1989).

The regularity conditions are technical in nature. (See Section 16.4.1.) Loosely, they are conditions imposed on the density of the random variable that appears in the likelihood function; these conditions will ensure that the Lindeberg–Levy central limit theorem will apply to moments of the sample of observations on the random vector y = ∂ ln f (xi | θ)/∂θ, i = 1,...,n. Among the conditions are finite moments of x up to order 3. An additional condition normally included in the set is that the range of the random variable be independent of the parameters. In some cases, the second derivative of the log likelihood is a constant, so the Cram´er– Rao bound is simple to obtain. For instance, in sampling from an exponential distribution, from Example C.6, n

ln L = n ln θ − θ xi , i=1 ∂ n ln L = n − , ∂θ θ xi i=1

so ∂2 ln L/∂θ 2 =−n/θ 2 and the variance bound is [I(θ)]−1 = θ 2/n. In many situations, the second derivative is a random variable with a distribution of its own. The following examples show two such cases.

Example C.7 Variance Bound for the Poisson Distribution For the Poisson distribution,

e−θ θ x f ( x) = , x! n n ln L =−nθ + xi ln θ − ln( xi !), = =  i 1 i 1 n ∂ ln L xi =−n + i =1 , ∂θ θ  n ∂2 ln L − x = i =1 i . ∂θ2 θ 2 Greene-50558 book June 25, 2007 12:52

1032 PART VII ✦ Appendices

The sum of n identical Poisson variables has a Poisson distribution with parameter equal to n times the parameter of the individual variables. Therefore, the actual distribution of the first derivative will be that of a linear function of a Poisson distributed variable. Because n = = θ θ −1 = θ/ E[ i =1 xi ] nE [xi ] n , the variance bound for the Poisson distribution is [I ( )] n. (Note also that the same result implies that E [∂ ln L/∂θ] = 0, which is a result we will use in Chapter 16. The same result holds for the exponential distribution.)

Consider, finally, a multivariate case. If θ is a vector of parameters, then I(θ) is the information matrix. The Cram´er–Rao theorem states that the difference between the covariance matrix of any unbiased estimator and the inverse of the information matrix,        −1 −1 ∂2 ln L(θ) ∂ ln L(θ) ∂ ln L(θ) (θ) −1 = − = , [I ] E ∂θ∂θ E ∂θ ∂θ (C-13)

will be a nonnegative definite matrix. In many settings, numerous estimators are available for the parameters of a distribution. The usefulness of the Cram´er–Rao bound is that if one of these is known to attain the variance bound, then there is no need to consider any other to seek a more efficient estimator. Regarding the use of the variance bound, we emphasize that if an unbiased estimator attains it, then that estimator is efficient. If a given estimator does not attain the variance bound, however, then we do not know, except in a few special cases, whether this estimator is efficient or not. It may be that no unbiased estimator can attain the Cram´er–Rao bound, which can leave the question of whether a given unbiased estimator is efficient or not unanswered. We note, finally, that in some cases we further restrict the set of estimators to linear functions of the data.

DEFINITION C.5 Minimum Variance Linear Unbiased Estimator (MVLUE) An estimator is the minimum variance linear unbiased estimator or best linear unbiased estimator (BLUE) if it is a linear function of the data and has minimum variance among linear unbiased estimators.

In a few instances, such as the normal mean, there will be an efficient linear unbiased estima- tor; x is efficient among all unbiased estimators, both linear and nonlinear. In other cases, such as the normal variance, there is no linear unbiased estimator. This criterion is useful because we can sometimes find an MVLUE without having to specify the distribution at all. Thus, by limiting ourselves to a somewhat restricted class of estimators, we free ourselves from having to assume a particular distribution.

C.6 INTERVAL ESTIMATION

Regardless of the properties of an estimator, the estimate obtained will vary from sample to sample, and there is some probability that it will be quite erroneous. A point estimate will not provide any information on the likely range of error. The logic behind an interval estimate is that we use the sample data to construct an interval, [lower (X), upper (X)], such that we can expect this interval to contain the true parameter in some specified proportion of samples, or Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1033

equivalently, with some desired level of confidence. Clearly, the wider the interval, the more confident we can be that it will, in any given sample, contain the parameter being estimated. The theory of interval estimation is based on a pivotal quantity, which is a function of both the parameter and a point estimate that has a known distribution. Consider the following examples.

Example C.8 Confidence Intervals for the Normal Mean In sampling from a normal distribution with mean μ and standard deviation σ , √ n( x − μ) z = ∼ t[n − 1], s and (n − 1)s2 c = ∼ χ 2[n − 1]. σ 2 Given the pivotal quantity, we can make probability statements about events involving the parameter and the estimate. Let p( g, θ) be the constructed random variable, for example, z or c. Given a prespecified confidence level, 1 − α, we can state that

Prob(lower ≤ p( g, θ) ≤ upper) = 1 − α, (C-14)

where lower and upper are obtained from the appropriate table. This statement is then ma- nipulated to make equivalent statements about the endpoints of the intervals. For example, the following statements are equivalent:  √  n( x − μ) Prob −z ≤ ≤ z = 1 − α, s   zs zs Prob x − √ ≤ μ ≤ x + √ = 1 − α. n n

The second of these is a statement about the interval, not the parameter; that is, it is the interval that is random, not the parameter. We attach a probability, or 100(1 − α) percent confidence level, to the interval itself; in repeated sampling, an interval constructed in this fashion will contain the true parameter 100(1 − α) percent of the time.

In general, the interval constructed by this method will be of the form

lower(X) = θˆ − e1,

upper(X) = θˆ + e2,

where X is the sample data, e1 and e2 are sampling errors, and θˆ is a point estimate of θ. It is clear from the preceding example that if the sampling distribution of the pivotal quantity is either t or standard normal, which will be true in the vast majority of cases we encounter in practice, then the confidence interval will be

θˆ ± C1−α/2[se(θ)ˆ ], (C-15)

where se(.) is the (known or estimated) standard error of the parameter estimate and C1−α/2 is the value from the t or standard normal distribution that is exceeded with probability 1 − α/2. The usual values for α are 0.10, 0.05, or 0.01. The theory does not prescribe exactly how to choose the endpoints for the confidence interval. An obvious criterion is to minimize the width of the interval. If the sampling distribution is symmetric, then the symmetric interval is the best one. If the sampling distribution is not symmetric, however, then this procedure will not be optimal. Greene-50558 book June 25, 2007 12:52

1034 PART VII ✦ Appendices

Example C.9 Estimated Confidence Intervals for a Normal Mean and Variance In a sample of 25, x = 1.63 and s = 0.51. Construct a 95 percent confidence interval for μ. Assuming that the sample of 25 is from a normal distribution,   5( x − μ) Prob −2.064 ≤ ≤ 2.064 = 0.95, s where 2.064 is the critical value from a t distribution with 24 degrees of freedom. Thus, the confidence interval is 1.63 ± [2.064(0.51)/5] or [1.4195, 1.8405]. Remark: Had the parent distribution not been specified, it would have been natural to use the standard normal distribution instead, perhaps relying on the central limit theorem. But a sam- ple size of 25 is small enough that the more conservative t distribution might still be preferable. The chi-squared distribution is used to construct a confidence interval for the variance of a normal distribution. Using the data from Example C.9, we find that the usual procedure would use   24s2 Prob 12.4 ≤ ≤ 39.4 = 0.95, σ 2

where 12.4 and 39.4 are the 0.025 and 0.975 cutoff points from the chi-squared (24) distribu- tion. This procedure leads to the 95 percent confidence interval [0.1581, 0.5032]. By making use of the asymmetry of the distribution, a narrower interval can be constructed. Allocating 4 percent to the left-hand tail and 1 percent to the right instead of 2.5 percent to each, the two cutoff points are 13.4 and 42.9, and the resulting 95 percent confidence interval is [0.1455, 0.4659]. Finally, the confidence interval can be manipulated to obtain a confidence interval for a function of a parameter.√ For example,√ based on the preceding, a 95 percent confidence interval for σ would be [ 0.1581, 0.5032] = [0.3976, 0.7094].

C.7 HYPOTHESIS TESTING

The second major group of statistical inference procedures is hypothesis tests. The classical testing procedures are based on constructing a statistic from a random sample that will enable the analyst to decide, with reasonable confidence, whether or not the data in the sample would have been generated by a hypothesized population. The formal procedure involves a statement of the hypothesis, usually in terms of a “null” or maintained hypothesis and an “alternative,” conventionally denoted H0 and H1, respectively. The procedure itself is a rule, stated in terms of the data, that dictates whether the null hypothesis should be rejected or not. For example, the hypothesis might state a parameter is equal to a specified value. The decision rule might state that the hypothesis should be rejected if a sample estimate of that parameter is too far away from that value (where “far” remains to be defined). The classical, or Neyman–Pearson, methodology involves partitioning the sample space into two regions. If the observed data (i.e., the test statistic) fall in the rejection region (sometimes called the critical region), then the null hypothesis is rejected; if they fall in the acceptance region, then it is not.

C.7.1 CLASSICAL TESTING PROCEDURES

Since the sample is random, the test statistic, however defined, is also random. The same test procedure can lead to different conclusions in different samples. As such, there are two ways such a procedure can be in error: 1. Type I error. The procedure may lead to rejection of the null hypothesis when it is true. 2. Type II error. The procedure may fail to reject the null hypothesis when it is false. Greene-50558 book June 25, 2007 12:52

APPENDIX C ✦ Estimation and Inference 1035

To continue the previous example, there is some probability that the estimate of the parameter will be quite far from the hypothesized value, even if the hypothesis is true. This outcome might cause a type I error.

DEFINITION C.6 Size of a Test The probability of a type I error is the size of the test. This is conventionally denoted α and is also called the significance level.

The size of the test is under the control of the analyst. It can be changed just by changing the decision rule. Indeed, the type I error could be eliminated altogether just by making the rejection region very small, but this would come at a cost. By eliminating the probability of a type I error—that is, by making it unlikely that the hypothesis is rejected—we must increase the probability of a type II error. Ideally, we would like both probabilities to be as small as possible. It is clear, however, that there is a tradeoff between the two. The best we can hope for is that for a given probability of type I error, the procedure we choose will have as small a probability of type II error as possible.

DEFINITION C.7 PowerofaTest The power of a test is the probability that it will correctly lead to rejection of a false null hypothesis: power = 1 − β = 1 − Prob(type II error). (C-16)

For a given significance level α, we would like β to be as small as possible. Because β is defined in terms of the alternative hypothesis, it depends on the value of the parameter.

Example C.10 Testing a Hypothesis About a Mean μ = μ0 σ 2 For testing H0: in a normal distribution with known variance√ , the decision rule is to reject the hypothesis if the absolute value of the z statistic, n( x − μ0)/σ, exceeds the predetermined critical value. For a test at the 5 percent significance level, we set the critical value at 1.96. The power of the test, therefore, is the probability that the absolute value of the test statistic will exceed 1.96 given that the true value of μ is, in fact, not μ0. This value depends on the alternative value of μ, as shown in Figure C.6. Notice that for this test the power is equal to the size at the point where μ equals μ0. As might be expected, the test becomes more powerful the farther the true mean is from the hypothesized value.

Testing procedures, like estimators, can be compared using a number of criteria.

DEFINITION C.8 Most Powerful Test A test is most powerful if it has greater power than any other test of the same size. Greene-50558 book June 25, 2007 12:52

1036 PART VII ✦ Appendices

1.0 1 Ϫ ␤

0 ␮ ␮0

FIGURE C.6 Power Function for a Test.

This requirement is very strong. Because the power depends on the alternative hypothesis, we might require that the test be uniformly most powerful (UMP), that is, have greater power than any other test of the same size for all admissible values of the parameter. There are few situations in which a UMP test is available. We usually must be less stringent in our requirements. Nonetheless, the criteria for comparing hypothesis testing procedures are generally based on their respective power functions. A common and very modest requirement is that the test be unbiased.

DEFINITION C.9 Unbiased Test A test is unbiased if its power (1 − β) is greater than or equal to its size α for all values of the parameter.

If a test is biased, then, for some values of the parameter, we are more likely to accept the null hypothesis when it is false than when it is true. The use of the term unbiased here is unrelated to the concept of an unbiased estimator. Fortunately, there is little chance of confusion. Tests and estimators are clearly connected, how- ever. The following criterion derives, in general, from the corresponding attribute of a parameter estimate.

DEFINITION C.10 Consistent Test A test is consistent if its power goes to one as the sample size grows to infinity. Greene-50558 book June 25, 2007 22:50

APPENDIX C ✦ Estimation and Inference 1037

Example C.11 Consistent Test About a Mean √ A confidence interval for the mean of a normal distribution is x ± t1−α/2(s/ n), where x and s are the usual consistent estimators for μ and σ (see Section D.2.1), n is the sample size, and t1−α/2 is the correct critical value from the t distribution with n − 1 degrees of freedom. For testing H0: μ = μ0 versus H1: μ = μ0, let the procedure be to reject H0 if the confidence interval does not contain μ0. Because x is consistent for μ, one can discern if H0 is false as n →∞, with probability 1, because x will be arbitrarily close to the true μ. Therefore, this test is consistent.

As a general rule, a test will be consistent if it is based on a consistent estimator of the parameter.

C.7.2 TESTS BASED ON CONFIDENCE INTERVALS

There is an obvious link between interval estimation and the sorts of hypothesis tests we have been discussing here. The confidence interval gives a range of plausible values for the parameter. Therefore, it stands to reason that if a hypothesized value of the parameter does not fall in this range of plausible values, then the data are not consistent with the hypothesis, and it should be rejected. Consider, then, testing

H0: θ = θ0,

H1: θ = θ0. We form a confidence interval based on θˆ as described earlier:

θˆ − C1−α/2[se(θ)ˆ ] <θ<θˆ + C1−α/2[se(θ)ˆ ].

H0 is rejected if θ0 exceeds the upper limit or is less than the lower limit. Equivalently, H0 is rejected if θˆ − θ 0 > . C1−α/2 se(θ)ˆ

In words, the hypothesis is rejected if the estimate is too far from θ0, where the distance is measured in standard error units. The critical value is taken from the t or standard normal distribution, whichever is appropriate.

Example C.12 Testing a Hypothesis About a Mean with a Confidence Interval For the results in Example C.8, test H 0: μ = 1.98 versus H 1: μ = 1.98, assuming sampling from a normal distribution: x − 1.98 1.63 − 1.98 t = √ = = 3.43. s/ n 0.102

The 95 percent critical value for t(24) is 2.064. Therefore, reject H0. If the critical value for the standard normal table of 1.96 is used instead, then the same result is obtained.

If the test is one-sided, as in

H0: θ ≥ θ0,

H1: θ<θ0,

then the critical region must be adjusted. Thus, for this test, H0 will be rejected if a point estimate of θ falls sufficiently below θ0. (Testscan usually be set up by departing from the decision criterion, “What sample results are inconsistent with the hypothesis?”) Greene-50558 book June 25, 2007 12:52

1038 PART VII ✦ Appendices

Example C.13 One-Sided Test About a Mean A sample of 25 from a normal distribution yields x = 1.63 and s = 0.51. Test

H 0: μ ≤ 1.5,

H 1: μ>1.5.

Clearly, no observed x less than or equal to 1.5 will lead to rejection of H 0. Using the borderline value of 1.5 for μ, we obtain  √  n( x − 1.5) 5(1.63 − 1.5) Prob > = Prob(t > 1.27). s 0.51 24

This is approximately 0.11. This value is not unlikely by the usual standards. Hence, at a significant level of 0.11, we would not reject the hypothesis.

C.7.3 SPECIFICATION TESTS

The hypothesis testing procedures just described are known as “classical” testing procedures. In each case, the null hypothesis tested came in the form of a restriction on the alternative. You can verify that in each application we examined, the parameter space assumed under the null hypothesis is a subspace of that described by the alternative. For that reason, the models implied are said to be “nested.” The null hypothesis is contained within the alternative. This approach suffices for most of the testing situations encountered in practice, but there are common situations in which two competing models cannot be viewed in these terms. For example, consider a case in which there are two completely different, competing theories to explain the same observed data. Many models for censoring and truncation discussed in Chapter 24 rest upon a fragile assumption of normality,for example. Testingof this nature requires a different approach from the classical procedures discussed here. These are discussed at various points throughout the book, for example, in Chapter 24, where we study the difference between fixed and random effects models.

APPENDIXQ D LARGE-SAMPLE DISTRIBUTION THEORY

D.1 INTRODUCTION

Most of this book is about parameter estimation. In studying that subject, we will usually be interested in determining how best to use the observed data when choosing among competing estimators. That, in turn, requires us to examine the sampling behavior of estimators. In a few cases, such as those presented in Appendix C and the least squares estimator considered in Chapter 4, we can make broad statements about sampling distributions that will apply regardless of the size of the sample. But, in most situations, it will only be possible to make approximate statements about estimators, such as whether they improve as the sample size increases and what can be said about their sampling distributions in large samples as an approximation to the finite samples we actually observe. This appendix will collect most of the formal, fundamental theorems Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1039

and results needed for this analysis. A few additional results will be developed in the discussion of time-series analysis later in the book.

D.2 LARGE-SAMPLE DISTRIBUTION THEORY1

In most cases, whether an estimator is exactly unbiased or what its exact sampling variance is in samples of a given size will be unknown. But we may be able to obtain approximate results about the behavior of the distribution of an estimator as the sample becomes large. For example, it is well known that the distribution of the mean of a sample tends to approximate normality as the sample size grows, regardless of the distribution of the individual observations. Knowledge about the limiting behavior of the distribution of an estimator can be used to infer an approximate distribution for the estimator in a finite sample. To describe how this is done, it is necessary, first, to present some results on convergence of random variables.

D.2.1 CONVERGENCE IN PROBABILITY

Limiting arguments in this discussion will be with respect to the sample size n. Let xn be a sequence random variable indexed by the sample size.

DEFINITION D.1 Convergence in Probability The random variable xn converges in probability to a constant c if limn→∞ Prob(|xn − c| >ε)= 0 for any positive ε.

Convergence in probability implies that the values that the variable may take that are not close to c become increasingly unlikely as n increases. To consider one example, suppose that the random variable xn takes two values, zero and n, with probabilities 1 − (1/n) and (1/n), respec- tively. As n increases, the second point will become ever more remote from any constant but, at the same time, will become increasingly less probable. In this example, xn converges in probability to zero. The crux of this form of convergence is that all the mass of the probability distribution becomes concentrated at points close to c.Ifxn converges in probability to c, then we write

plim xn = c. (D-1)

We will make frequent use of a special case of convergence in probability, convergence in mean square or convergence in quadratic mean.

THEOREM D.1 Convergence in Quadratic Mean μ σ 2 μ σ 2 If xn has mean n and variance n such that the ordinary limits of n and n are c and 0, respectively, then xn converges in mean square to c, and

plim xn = c.

1A comprehensive summary of many results in large-sample theory appears in White (2001). The results discussed here will apply to samples of independent observations. Time series cases in which observations are correlated are analyzed in Chapters 19 through 22. Greene-50558 book June 25, 2007 12:52

1040 PART VII ✦ Appendices

A proof of Theorem D.1 can be based on another useful theorem.

THEOREM D.2 Chebychev’s Inequality If xn is a random variable and c and ε are constants, then Prob(|xn − c| >ε)≤ 2 2 E[(xn − c) ]/ε .

To establish the Chebychev inequality, we use another result [see Goldberger (1991, p. 31)].

THEOREM D.3 Markov’s Inequality If yn is a nonnegative random variable and δ is a positive constant, then Prob[yn ≥ δ] ≤ E[yn]/δ. Proof: E[yn] = Prob[yn <δ]E[yn | yn <δ] + Prob[yn ≥ δ]E[yn | yn ≥ δ]. Because yn is non- negative, both terms must be nonnegative, so E[yn] ≥ Prob[yn ≥ δ]E[yn | yn ≥ δ]. Because E[yn | yn ≥ δ] must be greater than or equal to δ,E[yn] ≥ Prob[yn ≥ δ]δ, which is the result.

2 2 2 Now, to prove Theorem D.1., let yn be (xn − c) and δ be ε in Theorem D.3. Then, (xn − c) >δ implies that |xn − c| >ε. Finally, we will use a special case of the Chebychev inequality, where c = μn, so that we have (| − μ | >ε)≤ σ 2/ε2. Prob xn n n (D-2) μ σ 2 Taking the limits of n and n in (D-2), we see that if

lim E[xn] = c, and lim Var[xn] = 0, (D-3) n→∞ n→∞ then

plim xn = c. We have shown that convergence in mean square implies convergence in probability. Mean- square convergence implies that the distribution of xn collapses to a spike at plim xn, as shown in Figure D.1.

Example D.1 Mean Square Convergence of the Sample Minimum in Exponential Sampling As noted in Example C.4, in sampling of n observations from an exponential distribution, for the sample minimum x(1), 1 lim E x(1) = lim = 0 n→∞ n→∞ nθ and 1 lim Var x(1) = lim = 0. n→∞ n→∞ (nθ) 2 Therefore,

plim x(1) = 0. Note, in particular, that the variance is divided by n2. Thus, this estimator converges very rapidly to 0. Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1041

n ϭ 1000 ity s Den

n ϭ 100

n ϭ 10

␪ Estimator FIGURE D.1 Quadratic Convergence to a Constant, θ.

Convergence in probability does not imply convergence in mean square. Consider the simple example given earlier in which xn equals either zero or n with probabilities 1 − (1/n) and (1/n). The exact expected value of xn is 1 for all n, which is not the probability limit. Indeed, if we let 2 Prob(xn = n ) = (1/n) instead, the mean of the distribution explodes, but the probability limit is 2 still zero. Again, the point xn = n becomes ever more extreme but, at the same time, becomes ever less likely. The conditions for convergence in mean square are usually easier to verify than those for the more general form. Fortunately, we shall rarely encounter circumstances in which it will be necessary to show convergence in probability in which we cannot rely upon convergence in mean square. Our most frequent use of this concept will be in formulating consistent estimators.

DEFINITION D.2 Consistent Estimator An estimator θˆ n of a parameter θ is a consistent estimator of θ if and only if

plim θˆ n = θ. (D-4)

THEOREM D.4 Consistency of the Sample Mean The mean of a random sample from any population with finite mean μ and finite variance σ 2 is a consistent estimator of μ. 2 Proof: E[xn] = μ and Var[xn] = σ /n. Therefore, xn converges in mean square to μ,or plim xn = μ. Greene-50558 book June 25, 2007 12:52

1042 PART VII ✦ Appendices

Theorem D.4 is broader than it might appear at first.

COROLLARY TO THEOREM D.4 Consistency of a Mean of Functions In random sampling, for any function g(x),ifE[g(x)] and Var[g(x)] are finite constants, then n 1 plim g(x ) = E[g(x)]. (D-5) n i i=1

Proof: Define yi = g(xi ) and use Theorem D.4.

Example D.2 Estimating a Function of the Mean In sampling from a normal distribution with mean μ and variance 1, E [ex ] = eμ+1/2 and Var[ex ] = e2μ+2 − e2μ+1. (See Section B.4.4 on the lognormal distribution.) Hence,

n 1 μ+ / plim exi = e 1 2. n i =1

D.2.2 OTHER FORMS OF CONVERGENCE AND LAWS OF LARGE NUMBERS

Theorem D.4 and the corollary just given are particularly narrow forms of a set of results known as laws of large numbers that are fundamental to the theory of parameter estimation. Laws of large numbers come in two forms depending on the type of convergence considered. The simpler of these are “weak laws of large numbers” which rely on convergence in probability as we defined it above. “Strong laws” rely on a broader type of convergence called almost sure convergence. Overall, the law of large numbers is a statement about the behavior of an average of a large number of random variables.

THEOREM D.5 Khinchine’s Weak Law of Large Numbers If xi , i = 1,...,n is a random (i.i.d.) sample from a distribution with finite mean E [xi ] = μ, then

plim xn = μ. Proofs of this and the theorem below are fairly intricate. Rao (1973) provides one.

Notice that this is already broader than Theorem D.4, as it does not require that the variance of the distribution be finite. On the other hand, it is not broad enough, because most of the situations we encounter where we will need a result such as this will not involve i.i.d. random sampling. A broader result is Greene-50558 book June 25, 2007 22:50

APPENDIX D ✦ Large-Sample Distribution Theory 1043

THEOREM D.6 Chebychev’s Weak Law of Large Numbers If xi , i = 1,...,n is a sample of observations such that E [xi ] = μi < ∞ and Var[xi ] = σ 2 < ∞ σ 2/ = ( / 2) σ 2 → →∞ ( − μ ) = i such that n n 1 n i i 0 as n , then plim xn n 0.

There is a subtle distinction between these two theorems that you should notice. The Chebychev μ theorem does not state that xn converges to n, or even that it converges to a constant at all. μ That would require a precise statement about the behavior of n. The theorem states that as n increases without bound, these two quantities will be arbitrarily close to each other—that is, the difference between them converges to a constant, zero. This is an important notion that enters the derivation when we consider statistics that converge to random variables, in- stead of to constants. What we do have with these two theorems is extremely broad condi- tions under which a sample mean will converge in probability to its population counterpart. The more important difference between the Khinchine and Chebychev theorems is that the second allows for heterogeneity in the distributions of the random variables that enter the mean. In analyzing time series data, the sequence of outcomes is itself viewed as a random event. Consider, then, the sample mean, xn. The preceding results concern the behavior of this statistic as n →∞for a particular realization of the sequence x1,...,xn. But, if the sequence, itself, is viewed as a random event, then limit to which xn converges may be also. The stronger notion of almost sure convergence relates to this possibility.

DEFINITION D.3 Almost Sure Convergence The random variable x converges almost surely to the constant c if and only if n

Prob lim xn = c = 1. n→∞

a.s. This is denoted xn−→ c. It states that the probability of observing a sequence that does not converge to c ultimately vanishes. Intuitively, it states that once the sequence xn becomes close to c, it stays close to c. Almost sure convergence is used in a stronger form of the law of large numbers:

THEOREM D.7 Kolmogorov’s Strong Law of Large Numbers , = ,..., If xi i 1 n is a sequence of independently distributed∞ random variables such that = μ < ∞ = σ 2 < ∞ σ 2/ 2 < ∞ →∞ E [xi ] i and Var[xi ] i such that i=1 i i as n then − μ −→a.s. xn n 0. Greene-50558 book June 25, 2007 12:52

1044 PART VII ✦ Appendices

THEOREM D.8 Markov’s Strong Law of Large Numbers If {zi } is a sequence of independent random variables with E[zi ] = μi < ∞ and if for some ∞ δ> , | − μ |1+δ / 1+δ < ∞, − μ , 0 = E[ zi i ] i then zn n converges almost surely to 0 which i 1 . . − μ −→a s 2 we denote zn n 0.

The variance condition is satisfied if every variance in the sequence is finite, but this is not strictly required; it only requires that the variances in the sequence increase at a slow enough rate that the sequence of variances as defined is bounded. The theorem allows for heterogeneity in the means and variances. If we return to the conditions of the Khinchine theorem, i.i.d. sampling, we have a corollary:

COROLLARY TO THEOREM D.8 (Kolmogorov) If xi , i = 1,...,n is a sequence of independent and identically distributed random variables a.s. such that E[xi ] = μ<∞ and E[|xi |] < ∞ then xn − μ −→ 0.

Note that the corollary requires identically distributed observations while the theorem only requires independence. Finally, another form of convergence encountered in the analysis of time series data is convergence in rth mean:

DEFINITION D.4 Convergence in rth Mean r r If xn is a sequence of random variables such that E[|xn| ] < ∞ and limn→∞ E[|xn −c| ] = 0, r.m. then xn converges in rth mean to c. This is denoted xn−→ c.

Surely the most common application is the one we met earlier, convergence in means square, which is convergence in the second mean. Some useful results follow from this definition:

THEOREM D.9 Convergence in Lower Powers If xn converges in rth mean to c, then xn converges in sth mean to c for any s < r. The s r s/r proof uses Jensen’s Inequality, Theorem D.13. Write E[|xn − c| ] = E[(|xn − c| ) ] ≤ / r s r E[(|xn − c| )] and the inner term converges to zero so the full function must also.

2The use of the expected absolute deviation differs a bit from the expected squared deviation that we have used heretofore to characterize the spread of a distribution. Consider two examples. If z ∼ N[0,σ2], then E[|z|] = Prob[z < 0]E[−z| z < 0] + Prob[z ≥ 0]E[z| z ≥ 0] = 0.7979σ . (See Theorem 24.2.) So, finite expected absolute value is the same as finite second moment for the normal distribution. But if z takes values [0, n] with probabilities [1 − 1/n, 1/n], then the variance of z is (n − 1), but E[|z − μz|]is2− 2/n.For this case, finite expected absolute value occurs without finite expected second moment. These are different characterizations of the spread of the distribution. Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1045

THEOREM D.10 Generalized Chebychev’s Inequality r If xn is a random variable and c is a constant such that with E[|xn − c| ] < ∞ and ε is a r r positive constant, then Prob(|xn − c| >ε)≤ E[|xn − c| ]/ε .

We have considered two cases of this result already, when r = 1 which is the Markov inequality, Theorem D.3, and when r = 2, which is the Chebychev inequality we looked at first in Theo- rem D.2.

THEOREM D.11 Convergence in rth mean and Convergence in Probability r.m. p If xn −→ c, for some r > 0, then xn −→ c. The proof relies on Theorem D.10. By r r assumption, limn→∞ E [|xn − c| ] = 0 so for some n sufficiently large, E [|xn − c| ] < ∞. r r By Theorem D.10, then, Prob(|xn − c| >ε)≤ E [|xn − c| ]/ε for any ε>0. The denomina- tor of the fraction is a fixed constant and the numerator converges to zero by our initial assumption, so limn→∞ Prob(|xn − c| >ε)= 0, which completes the proof.

One implication of Theorem D.11 is that although convergence in mean square is a convenient way to prove convergence in probability, it is actually stronger than necessary, as we get the same result for any positive r. Finally, we note that we have now shown that both almost sure convergence and convergence in rth mean are stronger than convergence in probability; each implies the latter. But they, themselves, are different notions of convergence, and neither implies the other.

DEFINITION D.5 Convergence of a Random Vector or Matrix Let xn denote a random vector and Xn a random matrix, and c and C denote a vector and matrix of constants with the same dimensions as xn and Xn, respectively. All of the preceding notions of convergence can be extended to (xn, c) and (Xn, C) by applying the results to the respective corresponding elements.

D.2.3 CONVERGENCE OF FUNCTIONS

A particularly convenient result is the following.

THEOREM D.12 Slutsky Theorem For a continuous function g(xn) that is not a function of n,

plim g(xn) = g(plim xn). (D-6)

The generalization of Theorem D.12 to a function of several random variables is direct, as illustrated in the next example. Greene-50558 book June 25, 2007 12:52

1046 PART VII ✦ Appendices

Example D.3 Probability Limit of a Function of x and s2 In random sampling from a population with mean μ and variance σ 2, the exact expected 2/ 2 value of xn sn will be difficult, if not impossible, to derive. But, by the Slutsky theorem,

x2 μ2 plim n = . 2 σ 2 sn An application that highlights the difference between expectation and probability is suggested by the following useful relationships.

THEOREM D.13 Inequalities for Expectations   Jensen’s Inequality. If g(xn) is a concave function of xn, then g E [xn] ≥ E [g(xn)]. Cauchy–Schwarz Inequality. For two random variables, 1/2 1/2 E [|xy|] ≤ E [x2] E [y2] .

Although the expected value of a function of xn may not equal the function of the expected value—it exceeds it if the function is concave—the probability limit of the function is equal to the function of the probability limit. The Slutsky theorem highlights a comparison between the expectation of a random variable and its probability limit. Theorem D.12 extends directly in two important directions. First, though stated in terms of convergence in probability, the same set of results applies to convergence in rth mean and almost sure convergence. Second, so long as the functions are continuous, the Slutsky theorem can be extended to vector or matrix valued functions of random scalars, vectors, or matrices. The following describe some specific applications. Some implications of the Slutsky theorem are now summarized.

THEOREM D.14 Rules for Probability Limits If xn and yn are random variables with plimxn = c and plim yn = d, then

plim(xn + yn) = c + d,(sum rule) (D-7)

plim xn yn = cd,(product rule) (D-8)

plim xn/yn = c/d if d = 0.(ratio rule) (D-9)

If Wn is a matrix whose elements are random variables and if plim Wn = , then −1 = −1.( ) plim Wn matrix inverse rule (D-10)

If Xn and Yn are random matrices with plim Xn = A and plim Yn = B, then

plim XnYn = AB.(matrix product rule) (D-11)

D.2.4 CONVERGENCE TO A RANDOM VARIABLE

The preceding has dealt with conditions under which a random variable converges to a constant, for example, the way that a sample mean converges to the population mean. To develop a theory Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1047

for the behavior of estimators, as a prelude to the discussion of limiting distributions, we now consider cases in which a random variable converges not to a constant, but to another random variable. These results will actually subsume those in the preceding section, as a constant may always be viewed as a degenerate random variable, that is one with zero variance.

DEFINITION D.6 Convergence in Probability to a Random Variable The random variable xn converges in probability to the random variable x if limn→∞ Prob(|xn − x| >ε)= 0 for any positive ε.

As before, we write plim xn = x to denote this case. The interpretation (at least the intuition) of this type of convergence is different when x is a random variable. The notion of closeness defined here relates not to the concentration of the mass of the probability mechanism generating xn at a point c, but to the closeness of that probability mechanism to that of x. One can think of this as a convergence of the CDF of xn to that of x.

DEFINITION D.7 Almost Sure Convergence to a Random Variable The random variable xn converges almost surely to the random variable x if and only if limn→∞ Prob(|xi − x| >εfor all i ≥ n) = 0 for all ε>0.

DEFINITION D.8 Convergence in rth Mean to a Random Variable The random variable xn converges in rth mean to the random variable x if and only if . . r r m limn→∞ E [|xn − x| ] = 0. This is labeled xn −→ x. As before, the case r = 2 is labeled convergence in mean square.

Once again, we have to revise our understanding of convergence when convergence is to a random variable.

THEOREM D.15 Convergence of Moments . . r m r r r Suppose xn −→ x and E [|x| ] is finite. Then, limn→∞ E [|xn| ] = E [|x| ].

r.m. Theorem D.15 raises an interesting question. Suppose we let r grow, and suppose that xn −→ x and, in addition, all moments are finite. If this holds for any r, do we conclude that these random variables have the same distribution? The answer to this longstanding problem in probability theory—the problem of the sequence of moments—is no. The sequence of moments does not uniquely determine the distribution. Although convergence in rth mean and almost surely still both imply convergence in probability,it remains true, even with convergence to a random variable instead of a constant, that these are different forms of convergence. Greene-50558 book June 25, 2007 12:52

1048 PART VII ✦ Appendices

D.2.5 CONVERGENCE IN DISTRIBUTION: LIMITING DISTRIBUTIONS

A second form of convergence is convergence in distribution. Let xn be a sequence of random variables indexed by the sample size, and assume that xn has cdf Fn(xn).

DEFINITION D.9 Convergence in Distribution xn converges in distribution to a random variable x with cdf F(x) if limn→∞| Fn(xn) − F(x)|=0 at all continuity points of F(x).

This statement is about the probability distribution associated with xn; it does not imply that xn converges at all. To take a trivial example, suppose that the exact distribution of the random variable xn is 1 1 1 1 Prob(x = 1) = + , Prob(x = 2) = − . n 2 n + 1 n 2 n + 1

1 As n increases without bound, the two probabilities converge to 2 , but xn does not converge to a constant.

DEFINITION D.10 Limiting Distribution If xn converges in distribution to x, where Fn(xn) is the cdf of xn, then F(x) is the limiting distribution of xn. This is written

d xn −→ x.

The limiting distribution is often given in terms of the pdf, or simply the parametric family. For example, “the limiting distribution of xn is standard normal.” Convergence in distribution can be extended to random vectors and matrices, although not in the element by element manner that we extended the earlier convergence forms. The reason is that convergence in distribution is a property of the CDF of the random variable, not the variable itself. Thus, we can obtain a convergence result analogous to that in Definition D.9 for vectors or matrices by applying definition to the joint CDF for the elements of the vector or matrices. Thus, d xn −→ x if limn→∞ |Fn(xn) − F(x)|=0 and likewise for a random matrix.

Example D.4 Limiting Distribution of tn−1 Consider a sample of size n from a standard normal distribution. A familiar inference problem is the test of the hypothesis that the population mean is zero. The test statistic usually used is the t statistic:

xn tn−1 = √ , sn/ n

where  n 2 ( xi − xn) s2 = i =1 . n n − 1 Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1049

The exact distribution of the random variable tn−1 is t with n − 1 degrees of freedom. The density is different for every n:

− / / 2 n 2 (n 2) −1/2 tn−1 f (t − ) = [(n − 1)π] 1 + , (D-12) n 1 [(n − 1)/2] n − 1  = t . − / as is the cdf, Fn−1(t) −∞ fn−1( x) dx This distribution has mean zero and variance (n 1) (n − 3). As n grows to infinity, tn−1 converges to the standard normal, which is written

d tn−1 −→ N[0, 1].

DEFINITION D.11 Limiting Mean and Variance The limiting mean and variance of a random variable are the mean and variance of the limiting distribution, assuming that the limiting distribution and its moments exist.

For the random variable with t[n] distribution, the exact mean and variance are zero and n/(n − 2), whereas the limiting mean and variance are zero and one. The example might suggest that the limiting mean and variance are zero and one; that is, that the moments of the limiting distribution are the ordinary limits of the moments of the finite sample distributions. This situation is almost always true, but it need not be. It is possible to construct examples in which the exact moments do not even exist, even though the moments of the limiting distribution are well defined.3 Even in such cases, we can usually derive the mean and variance of the limiting distribution. Limiting distributions, like probability limits, can greatly simplify the analysis of a problem. Some results that combine the two concepts are as follows.4

THEOREM D.16 Rules for Limiting Distributions d 1. If xn −→ x and plim yn = c, then d xn yn −→ cx, (D-13)

which means that the limiting distribution of xn yn is the distribution of cx. Also, d xn + yn −→ x + c, (D-14) d xn/yn −→ x/c, if c = 0. (D-15)

d 2. If xn −→ x and g(xn) is a continuous function, then

d g(xn) −→ g(x). (D-16) This result is analogous to the Slutsky theorem for probability limits. For an example, consider the tn random variable discussed earlier. The exact distribution 2 , −→ ∞ , of tn is F[1 n]. But as n tn converges to a standard normal variable. 2 According to this result, the limiting distribution of tn will be that of the square of a standard normal, which is chi-squared with one

3See, for example, Maddala (1977a, p. 150). 4For proofs and further discussion, see, for example, Greenberg and Webster (1983). Greene-50558 book June 25, 2007 12:52

1050 PART VII ✦ Appendices

THEOREM D.16 (Continued) degree of freedom. We conclude, therefore, that

d F[1, n] −→ chi-squared[1]. (D-17) We encountered this result in our earlier discussion of limiting forms of the standard normal family of distributions. 3. If yn has a limiting distribution and plim (xn − yn) = 0, then xn has the same limiting distribution as yn.

The third result in Theorem D.16 combines convergence in distribution and in probability. The second result can be extended to vectors and matrices.

Example D.5 The F Distribution Suppose that t1,n and t2,n are a K × 1 and an M × 1 random vector of variables whose components are independent with each distributed as t with n degrees of freedom. Then, as we saw in the preceding, for any component in either random vector, the limiting distribution −→d is standard normal, so for the entire vector, t j,n z j , a vector of independent standard / (t1,nt1,n) K d normally distributed variables. The results so far show that −→ F[K, M]. / (t2,nt2,n) M Finally, a specific case of result 2 in Theorem D.16 produces a tool known as the Cram´er–Wold device.

THEOREM D.17 Cramer–Wold Device d d If xn −→ x, then c xn −→ c x for all conformable vectors c with real valued elements.

By allowing c to be a vector with just a one in a particular position and zeros elsewhere, we see that convergence in distribution of a random vector xn to x does imply that each component does likewise.

D.2.6 CENTRAL LIMIT THEOREMS

We are ultimately interested in finding a way to describe the statistical properties of estimators when their exact distributions are unknown. The concepts of consistency and convergence in probability are important. But the theory of limiting distributions given earlier is not yet adequate. We rarely deal with estimators that are not consistent for something, though perhaps not always the parameter we are trying to estimate. As such,

d if plim θˆ n = θ, then θˆ n −→ θ.

That is, the limiting distribution of θˆ n is a spike. This is not very informative, nor is it at all what we have in mind when we speak of the statistical properties of an estimator. (To endow our finite sample estimator θˆ n with the zero sampling variance of the spike at θ would be optimistic in the extreme.) As an intermediate step, then, to a more reasonable description of the statistical properties of an estimator, we use a stabilizing transformation of the random variable to one that does have Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1051

a well-defined limiting distribution. To jump to the most common application, whereas

plim θˆ n = θ,

we often find that

√ d zn = n(θˆ n − θ) −→ f (z),

where f (z) is a well-defined distribution with a mean and a positive variance. An estimator which has this property is said to be root-n consistent. The single most important theorem in econometrics provides an application of this proposition. A basic form of the theorem is as follows.

THEOREM D.18 Lindeberg–Levy Central Limit Theorem (Univariate) ,..., If x1 xn are a random sample from a probability distribution with finite μ σ 2 = ( / ) n , mean and finite variance and xn 1 n i=1 xi then √ d 2 n(xn − μ) −→ N[0,σ ], A proof appears in Rao (1973, p. 127).

The result is quite remarkable as it holds regardless of the form of the parent distribution. For a striking example, return to Figure C.2. The distribution from which the data were drawn in that figure does not even remotely resemble a normal distribution. In samples of only four observations the force of the central limit theorem is clearly visible in the sampling distribution of the means. The sampling experiment Example D.6 shows the effect in a systematic demonstration of the result. The Lindeberg–Levy theorem is one of several forms of this extremely powerful result. For our purposes, an important extension allows us to relax the assumption of equal variances. The Lindeberg–Feller form of the central limit theorem is the centerpiece of most of our analysis in econometrics.

THEOREM D.19 Lindeberg–Feller Central Limit Theorem (with Unequal Variances) Suppose that {xi }, i = 1,...,n, is a sequence of independent random variables with finite μ σ 2 means i and finite positive variances i . Let 1 1   μ = (μ + μ +···+μ ), and σ 2 = σ 2 + σ 2 +···,σ2 . n n 1 2 n n n 1 2 n

If no single term dominates this average variance, which we could state as limn→∞ max(σi )/ ( σ ) = , σ 2 = σ 2, n n 0 and if the average variance converges to a finite constant, limn→∞ n then √ ( − μ ) −→d , σ 2 . n xn n N[0 ] Greene-50558 book June 25, 2007 12:52

1052 PART VII ✦ Appendices

Density of Exponential (Mean ϭ 1.5) 1.0

0.8

0.6 ity s Den 0.4

0.2

0.0 0 2 4 6810 FIGURE D.2 The Exponential Distribution.

In practical terms, the theorem states that sums of random variables, regardless of their form, will tend to be normally distributed. The result is yet more remarkable in that it does not require the variables in the sum to come from the same underlying distribution. It requires, essentially, only that the mean be a mixture of many random variables, none of which is large compared with their sum. Because nearly all the estimators we construct in econometrics fall under the purview of the central limit theorem, it is obviously an important result.

Example D.6 The Lindeberg–Levy Central Limit Theorem We’ll use a sampling experiment to demonstrate the operation of the central limit theorem. Consider random sampling from the exponential distribution with mean 1.5—this is the setting used in Example C.4. The density is shown in Figure D.2. We’ve drawn 1,000 samples of 3, 6, and 20 observations from this population√ and com- puted the sample means for each. For each mean, we then computed zin = n( xin − μ), where i = 1, ..., 1,000 and n is 3, 6 or 20. The three rows of figures in Figure D.3 show histograms of the observed samples of sample means and kernel density estimates of the underlying distributions for the three samples of transformed means.

Proof of the Lindeberg–Feller theorem requires some quite intricate mathematics [see, e.g., Loeve (1977)] that are well beyond the scope of our work here. We do note an important consid- eration in this theorem. The result rests on a condition known as the Lindeberg condition. The sample mean computed in the theorem is a mixture of random variables from possibly different distributions. The Lindeberg condition, in words, states that the contribution of the tail areas of these underlying distributions to the variance of the sum must be negligible in the limit. The condition formalizes the assumption in Theorem D.19 that the average variance be positive and not be dominated by any single term. [For an intuitively crafted mathematical discussion of this condition, see White (2001, pp. 117–118).] The condition is essentially impossible to verify in practice, so it is useful to have a simpler version of the theorem that encompasses it. Greene-50558 book June 25, 2007 12:52

Histogram for Variable z3 Kernel Density Estimate for z3 140 0.43

0.34 105

0.26 ity 70 s Den

Frequency 0.17

35 0.09

0 0.00 Ϫ2 Ϫ10 123456 Ϫ4.000 Ϫ2.857Ϫ1.714 Ϫ0.571 0.5711.714 2.857 4.000 z3 z3

Histogram for Variable z6 Kernel Density Estimate for z6 128 0.42

0.33 96

0.25 ity 64 s Den

Frequency 0.17

32 0.08

0 0.00 Ϫ3 Ϫ2 Ϫ1 012345 Ϫ4.000 Ϫ2.857Ϫ1.714Ϫ0.571 0.571 1.714 2.857 4.000 z6 z6

Histogram for Variable z20 Kernel Density Estimate for z20 124 0.39

0.31 93

0.23 ity 62 s Den

Frequency 0.15

31 0.08

0 0.00 Ϫ3 Ϫ2 Ϫ1 01234 Ϫ4.000 Ϫ2.857Ϫ1.714Ϫ0.571 0.571 1.714 2.857 4.000 z20 z20 FIGURE D.3 The Central Limit Theorem.

1053 Greene-50558 book June 25, 2007 12:52

1054 PART VII ✦ Appendices

THEOREM D.20 Liapounov Central Limit Theorem Suppose that {xi }, is a sequence of independent random variables with finite means μi and σ 2 | − μ |2+δ δ> σ finite positive variances i such that E[ xi i ] is finite for some 0.If n is positive and finite for all n sufficiently large, then √ ( − μ )/σ −→d , . n xn n n N[0 1]

This version of the central limit theorem requires only that moments slightly larger than two be finite. Note the distinction between the laws of large numbers in Theorems D.5 and D.6 and the central limit theorems. Neither assert that sample means tend to normality. Sample means (i.e., the distributions√ of them) converge to spikes at the true mean. It is the transformation of the mean, n( xn −μ)/σ, that converges to standard normality. To see this at work, if you have access to the necessary software, you might try reproducing Example D.6 using the raw means, xin. What do you expect to observe? For later purposes, we will require multivariate versions of these theorems. Proofs of the following may be found, for example, in Greenberg and Webster (1983) or Rao (1973) and references cited there.

THEOREM D.18A Multivariate Lindeberg–Levy Central Limit Theorem If x1,...,xn are a random sample from a multivariate distribution with finite mean vector μ and finite positive definite covariance matrix Q, then

√ d n(xn − μ) −→ N[0, Q], where n 1 x = x . n n i i=1 To get from D.18 to D.18A (and D.19 to D.19A) we need to add a step. Theorem D.18 applies to the individual elements of the vector. A vector has a multivariate normal distri- bution if the individual elements are normally distributed and if every linear combination is normally distributed. We can use Theorem D.18 (D.19) for the individual terms and Theorem D.17 to establish that linear combinations behave likewise. This establishes the extensions.

The extension of the Lindeberg–Feller theorem to unequal covariance matrices requires some intricate mathematics. The following is an informal statement of the relevant conditions. Further discussion and references appear in Fomby, Hill, and Johnson (1984) and Greenberg and Webster (1983). Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1055

THEOREM D.19A Multivariate Lindeberg–Feller Central Limit Theorem ,..., = μ , Suppose that x1 xn are a sample of random vectors such that E[xi ] i Var[xi ] = Qi , and all mixed third moments of the multivariate distribution are finite. Let n μ = 1 μ , n n i i=1

n 1 Q = Q . n n i i=1 We assume that = , lim Qn Q n→∞ where Q is a finite, positive definite matrix, and that for every i,

− n 1 ( )−1 = = . lim nQn Qi lim Qi Qi 0 n→∞ n→∞ i=1

We allow the means of the random vectors to differ, although in the cases that we will analyze, they will generally be identical. The second assumption states that individual components of the sum must be finite and diminish in significance. There is also an im- plicit assumption that the sum of matrices is nonsingular. Because the limiting matrix is nonsingular, the assumption must hold for large enough n, which is all that concerns us here. With these in place, the result is √ ( − μ ) −→d , . n xn n N[0 Q]

D.2.7 THE DELTA METHOD

At several points in Appendix C, we used a linear Taylor series approximation to analyze the distribution and moments of a random variable. We are now able to justify this usage. We complete the development of Theorem D.12 (probability limit of a function of a random variable), Theorem D.16 (2) (limiting distribution of a function of a random variable), and the central limit theorems, with a useful result that is known as the delta method. For a single random variable (sample mean or otherwise), we have the following theorem.

THEOREM√ D.21 Limiting Normal Distribution of a Function d 2 If n(zn − μ) −→ N[0,σ ] and if g(zn) is a continuous function not involving n, then √ d 2 2 n[g(zn) − g(μ)] −→ N[0, {g (μ)} σ ]. (D-18) Greene-50558 book June 25, 2007 12:52

1056 PART VII ✦ Appendices

Notice that the mean and variance of the limiting distribution are the mean and variance of the linear Taylor series approximation:

g(zn)  g(μ) + g (μ)(zn − μ).

The multivariate version of this theorem will be used at many points in the text.

THEOREM D.21A Limiting Normal Distribution of a Set of Functions √ d If zn is a K × 1 sequence of vector-valued random variables such that n(zn − μ) −→ N[0, ] and if c(zn) is a set of J continuous functions of zn not involving n, then √ d n[c(zn) − c(μ)] −→ N[0, C(μ)C(μ) ], (D-19) where C(μ) is the J × K matrix ∂c(μ)/∂μ. The jth row of C(μ) is the vector of partial derivatives of the jth function with respect to μ.

D.3 ASYMPTOTIC DISTRIBUTIONS

The theory of limiting distributions is only a means to an end. We are interested in the behavior of the estimators themselves. The limiting distributions obtained through the central limit theorem all involve unknown parameters, generally the ones we are trying to estimate. Moreover, our samples are always finite. Thus, we depart from the limiting distributions to derive the asymptotic distributions of the estimators.

DEFINITION D.12 Asymptotic Distribution An asymptotic distribution is a distribution that is used to approximate the true finite sample distribution of a random variable.5

By far the most common means of formulating an asymptotic distribution (at least by econo- metricians) is to construct it from the known limiting distribution of a function of the random variable. If

√ d n[(xn − μ)/σ ] −→ N[0, 1],

5We depart somewhat from some other treatments [e.g., White (2001), Hayashi (2000, p. 90)] at this point, because they make no distinction between an asymptotic distribution and the limiting distribution, although the treatments are largely along the lines discussed here. In the interest of maintaining consistency of the discussion, we prefer to retain the sharp distinction√ and derive the asymptotic distribution of an estimator, t by first obtaining the limiting distribution of n(t −√θ). By our construction, the limiting distribution of t is degenerate, whereas the asymptotic distribution of n(t − θ) is not useful. Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1057

1.75

1.50 Exact distribution 1.25 Asymptotic distribution 1.00 ) n x ( f 0.75

0.50

0.25

0 0.5 1.0 1.5 2.0 2.5 3.0 xn FIGURE D.4 True Versus Asymptotic Distribution.

2 then approximately, or asymptotically, xn ∼ N[μ, σ /n], which we write as

a x ∼ N[μ, σ 2/n].

2 The statement “xn is asymptotically normally distributed with mean μ and variance σ /n” says only that this normal distribution provides an approximation to the true distribution, not that the true distribution is exactly normal.

Example D.7 Asymptotic Distribution of the Mean of an Exponential Sample In sampling from an exponential distribution with parameter θ, the exact distribution of xn is that of θ/(2n) times a chi-squared variable with 2n degrees of freedom. The asymptotic distribution is N[θ, θ 2/n]. The exact and asymptotic distributions are shown in Figure D.4 for the case of θ = 1 and n = 16.

Extending the definition, suppose that θˆn is an estimator of the parameter vector θ.The asymptotic distribution of the vector θˆn is obtained from the limiting distribution:

√ d n(θˆn − θ) −→ N[0, V] (D-20)

implies that

a 1 θˆ ∼ N θ, V . (D-21) n n

This notation is read “θˆn is asymptotically normally distributed, with mean vector θ and covariance matrix (1/n)V.” The covariance matrix of the asymptotic distribution is the asymptotic covariance matrix and is denoted 1 Asy. Var[θˆ ] = V. n n Greene-50558 book June 25, 2007 12:52

1058 PART VII ✦ Appendices

Note, once again, the logic used to reach the result; (D-20) holds exactly as n →∞. We assume that it holds approximately for finite n, which leads to (D-21).

DEFINITION D.13 Asymptotic Normality and Asymptotic Efficiency An estimator θˆn is asymptotically normal if (D-20) holds. The estimator is asymptotically ef- ficient if the covariance matrix of any other consistent, asymptotically normally distributed estimator exceeds (1/n)V by a nonnegative definite matrix.

For most estimation problems, these are the criteria used to choose an estimator.

Example D.8 Asymptotic Inefficiency of the Median in Normal Sampling 2 In sampling from a normal distribution with mean μ and variance σ , both the mean xn and the median Mn of the sample are consistent estimators of μ. The limiting distributions of both estimators are spikes at μ, so they can only be compared on the basis of their asymptotic properties. The necessary results are

a 2 a 2 xn ∼ N[μ, σ /n], and Mn ∼ N[μ,(π/2)σ /n]. (D-22)

Therefore, the mean is more efficient by a factor of π/2. (But, see Example 17.4 for a finite sample result.)

D.3.1 ASYMPTOTIC DISTRIBUTION OF A NONLINEAR FUNCTION

Theorems D.12 and D.14 for functions of a random variable have counterparts in asymptotic distributions.

THEOREM√ D.22 Asymptotic Distribution of a Nonlinear Function d 2 If n(θˆ n − θ) −→ N[0,σ ] and if g(θ) is a continuous function not involving n, then ∼a 2 2 g(θˆ n) N[g(θ), (1/n){g (θ)} σ ]. If θˆn is a vector of parameter estimators such that ∼a θˆn N[θ,(1/n)V] and if c(θ) is a set of J continuous functions not involving n, then ∼a c(θˆn) N[c(θ), (1/n)C(θ)VC(θ) ], where C(θ) = ∂c(θ)/∂θ .

Example D.9 Asymptotic Distribution of a Function of Two Estimators Suppose that bn and tn are estimators of parameters β and θ such that     b a β σββ σβθ n ∼ N , . tn θ σθβ σθθ

Find the asymptotic distribution of cn = bn/(1−tn). Let γ = β/(1−θ). By the Slutsky theorem, cn is consistent for γ . We shall require

∂γ 1 ∂γ β = = γβ , = = γθ . ∂β 1 − θ ∂θ (1− θ) 2 Greene-50558 book June 25, 2007 12:52

APPENDIX D ✦ Large-Sample Distribution Theory 1059

Let  be the 2 × 2 asymptotic covariance matrix given previously. Then the asymptotic variance of cn is   γβ 2 2 Asy. Var[cn] = (γβ γθ ) = γβ σββ + γθ σθθ + 2γβ γθ σβθ, γθ

which is the variance of the linear Taylor series approximation:

γˆn  γ + γβ (bn − β) + γθ (tn − θ).

D.3.2 ASYMPTOTIC EXPECTATIONS

The asymptotic mean and variance of a random variable are usually the mean and variance of the asymptotic distribution. Thus, for an estimator with the limiting distribution defined in

√ d n(θˆn − θ) −→ N[0, V],

the asymptotic expectation is θ and the asymptotic variance is (1/n)V. This statement implies, among other things, that the estimator is “asymptotically unbiased.” At the risk of clouding the issue a bit, it is necessary to reconsider one aspect of the previous description. We have deliberately avoided the use of consistency even though, in most instances, that is what we have in mind. The description thus far might suggest that consistency and asymp- totic unbiasedness are the same. Unfortunately (because it is a source of some confusion), they are not. They are if the estimator is consistent and asymptotically normally distributed, or CAN. They may differ in other settings, however. There are at least three possible definitions of asymptotic unbiasedness: √ 1. The mean of the limiting distribution of n(θˆ n − θ) is 0. 2. limn→∞ E[θˆ n] = θ. (D-23) 3. plim θn = θ.

In most cases encountered in practice, the estimator in hand will have all three properties, so there is no ambiguity. It is not difficult to construct cases in which the left-hand sides of all three definitions are different, however.6 There is no general agreement among authors as to the precise meaning of asymptotic unbiasedness, perhaps because the term is misleading at the outset; asymptotic refers to an approximation, whereas unbiasedness is an exact result.7 Nonetheless, the majority view seems to be that (2) is the proper definition of asymptotic unbiasedness.8 Note, though, that this definition relies on quantities that are generally unknown and that may not exist. A similar problem arises in the definition of the asymptotic variance of an estimator. One common definition is9   1 √ 2 Asy. Var[θˆ n] = lim E n θˆ n − lim E [θˆ n] . (D-24) n n→∞ n→∞

6See, for example, Maddala (1977a, p. 150). 7See, for example, Theil (1971, p. 377). 8 Many studies of estimators analyze the “asymptotic bias” of, say, θˆ n as an estimator of a parameter θ.In most cases, the quantity of interest is actually plim [θˆ n − θ]. See, for example, Greene (1980b) and another example in Johnston (1984, p. 312). 9Kmenta (1986, p.165). Greene-50558 book June 25, 2007 12:52

1060 PART VII ✦ Appendices

This result is a leading term approximation, and it will be sufficient for nearly all applications. Note, however, that like definition 2 of asymptotic unbiasedness, it relies on unknown and possibly nonexistent quantities.

Example D.10 Asymptotic Moments of the Sample Variance The exact expected value and variance of the variance estimator

n 1 m = ( x − x ) 2 (D-25) 2 n i i =1 are (n − 1)σ 2 E [m ] = , (D-26) 2 n and μ − σ 4 2(μ − 2σ 4) μ − 3σ 4 Var [m ] = 4 − 4 + 4 , (D-27) 2 n n2 n3

4 where μ4 = E [( x −μ) ]. [See Goldberger (1964, pp. 97–99).] The leading term approximation would be 1 Asy. Var [m ] = (μ − σ 4). 2 n 4

D.4 SEQUENCES AND THE ORDER OF A SEQUENCE

This section has been concerned with sequences of constants, denoted, for example, cn, and random variables, such as xn, that are indexed by a sample size, n. An important characteristic of a sequence is the rate at which it converges (or diverges). For example, as we have seen, the mean of a random sample of n observations from a distribution with finite mean, μ, and finite variance, σ 2 γ 2 = σ 2/ . σ 2 , is itself a random variable with variance n n We see that as long as is a finite γ 2 constant, n is a sequence of constants that converges to zero. Another example is the random variable x(1),n, the minimum value in a random sample of n observations from the exponential 2 distribution with mean 1/θ defined in Example C.4. It turns out that x(1),n has variance 1/(nθ) . Clearly, this variance also converges to zero, but, intuition suggests, faster than σ 2/n does. On the other hand, the sum of the integers from one to n, Sn = n(n + 1)/2, obviously diverges as n →∞, albeit faster (one might expect) than the log of the likelihood function for the exponential distribution in Example C.6, which is ln L(θ) = n(ln θ − θxn). As a final example, consider the downward bias of the maximum likelihood estimator of the variance of the normal distribution, cn = (n − 1)/n, which is a constant that converges to one. (See Example C.5.) We will define the rate at which a sequence converges or diverges in terms of the order of the sequence.

DEFINITION D.14 Order nδ δ δ δ A sequence cn is of order n , denoted O(n ), if and only if plim(1/n )cn is a finite nonzero constant. Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1061

DEFINITION D.15 Order less than nδ δ δ δ A sequence cn, is of order less than n , denoted o(n ), if and only if plim(1/n )cn equals zero.

γ 2 ( −1) ( −2) ( −1), ( 2)(δ + Thus, in our examples, n is O n , Var[x(1),n]isO n and o n Sn is O n equals 2in this case), ln L(θ) is O(n)(δ equals +1), and cn is O(1)(δ = 0). Important particular cases that we will encounter repeatedly in our work are sequences for which δ = 1or−1. The notion of order of a sequence is often of interest in econometrics in the context of the variance of an estimator. Thus, we see in Section D.3 that an important element of our√ strategy for forming an asymptotic distribution is that the variance of the limiting distribution of n(xn −μ)/σ −1 −2 is O(1). In Example D.10 the variance of m2 is the sum of three terms that are O(n ), O(n ), −3 −1 4 and O(n ). The sum is O(n ), because n Var[m2] converges to μ4 − σ , the numerator of the first, or leading term, whereas the second and third terms converge to zero. This term is also the dominant term of the sequence. Finally, consider the two divergent examples in the preceding list. Sn is simply a deterministic function of n that explodes. However, ln L(θ) = n ln θ − θi xi is the sum of a constant that is O(n) and a random variable with variance equal to n/θ. The random variable “diverges” in the sense that its variance grows without bound as n increases. APPENDIXQ E COMPUTATION AND OPTIMIZATION

E.1 INTRODUCTION

The computation of empirical estimates by econometricians involves using digital computers and software written either by the researchers themselves or by others.1 It is also a surprisingly balanced mix of art and science. It is important for software users to be aware of how results are obtained, not only to understand routine computations, but also to be able to explain the occasional strange and contradictory results that do arise. This appendix will describe some of the basic elements of computing and a number of tools that are used by econometricians.2 Section E.2

1It is one of the interesting aspects of the development of econometric methodology that the adoption of certain classes of techniques has proceeded in discrete jumps with the development of software. Noteworthy examples include the appearance, both around 1970, of G. K. Joreskog’s LISREL [Joreskog and Sorbom (1981)] program, which spawned a still-growing industry in linear structural modeling, and TSP [Hall (1982)], which was among the first computer programs to accept symbolic representations of econometric models and which provided a significant advance in econometric practice with its LSQ procedure for systems of equations. An extensive survey of the evolution of econometric software is given in Renfro (2007). 2This discussion is not intended to teach the reader how to write computer programs. For those who expect to do so, there are whole libraries of useful sources. Three very useful works are Kennedy and Gentle (1980), Abramovitz and Stegun (1971), and especially Press et al. (1986). The third of these provides a wealth of expertly written programs and a large amount of information about how to do computation efficiently and accurately. A recent survey of many areas of computation is Judd (1998). Greene-50558 book June 25, 2007 12:52

1062 PART VII ✦ Appendices

then describes some techniques for computing certain integrals and derivatives that are recurrent in econometric applications. Section E.3 presents methods of optimization of functions. Some examples are given in Section E.4.

E.2 COMPUTATION IN ECONOMETRICS

This section will discuss some methods of computing integrals that appear frequently in econo- metrics.

E.2.1 COMPUTING INTEGRALS

One advantage of computers is their ability rapidly to compute approximations to complex func- tions such as logs and exponents. The basic functions, such as these, trigonometric functions, and so forth, are standard parts of the libraries of programs that accompany all scientific computing installations.3 But one of the very common applications that often requires some high-level cre- ativity by econometricians is the evaluation of integrals that do not have simple closed forms and that do not typically exist in “system libraries.” We will consider several of these in this section. We will not go into detail on the nuts and bolts of how to compute integrals with a computer; rather, we will turn directly to the most common applications in econometrics.

E.2.2 THE STANDARD NORMAL CUMULATIVE DISTRIBUTION FUNCTION

The standard normal cumulative distribution function (cdf) is ubiquitous in econometric models. Yet this most homely of applications must be computed by approximation. There are a number of ways to do so.4 Recall that what we desire is  x 1 2 (x) = φ(t) dt, where φ(t) = √ e−t /2. −∞ 2π One way to proceed is to use a Taylor series:

M 1 di (x ) (x) ≈ 0 (x − x )i . i! dxi 0 i=0 0 The normal cdf has some advantages for this approach. First, the derivatives are simple and not integrals. Second, the function is analytic; as M −→ ∞ , the approximation converges to the true value. Third, the derivatives have a simple form; they are the Hermite polynomials and they can be computed by a simple recursion. The 0th term in the preceding expansion is (x) evaluated at the expansion point. The first derivative of the cdf is the pdf, so the terms from 2 onward are the derivatives of φ(x), once again evaluated at x0. The derivatives of the standard normal pdf obey the recursion

φi /φ(x) =−xφi−1/φ(x) − (i − 1)φi−2/φ(x),

where φi is di φ(x)/dxi . The zero and one terms in the sequence are one and −x. The next term is x2 − 1, followed by 3x − x3 and x4 − 6x2 + 3, and so on. The approximation can be made

3Of course, at some level, these must have been programmed as approximations by someone. √ x 2 4Many system libraries provide a related function, the error function, erf(x) = (2/ π) e−t dt. If this is √ 0 ( ) = 1 + 1 ( / ), ≥ ( ) = − (− ), ≤ available, then the normal cdf can be obtained from x 2 2 erf x 2 x 0 and x 1 x x 0. Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1063

1.20 F ϭ ϭ 1.00 FA

0.80

0.60

0.40

0.20

0.00

Ϫ0.20 Ϫ2.00 Ϫ1.50 Ϫ1.00 Ϫ0.50 0.00 0.50 1.00 1.50 2.00 x FIGURE E.1 Approximation to Normal cdf.

more accurate by adding terms. Consider using a fifth-order Taylor series approximation around the point x = 0, where (0) = 0.5 and φ(0) = 0.3989423. Evaluating the derivatives at zero and assembling the terms produces the approximation

( ) ≈ 1 + . − 3/ + 5/ . x 2 0 3989423[x x 6 x 40]

[Some of the terms (every other one, in fact) will conveniently drop out.] Figure E.1 shows the actual values (F) and approximate values (FA) over the range −2 to 2. The figure shows two important points. First, the approximation is remarkably good over most of the range. Second, as is usually true for Taylor series approximations, the quality of the approximation deteriorates as one gets far from the expansion point. Unfortunately, it is the tail areas of the standard normal distribution that are usually of interest, so the preceding is likely to be problematic. An alternative approach that is used much more often is a polynomial approximation reported by Abramovitz and Stegun (1971, p. 932):

5 i (−|x|) = φ(x) ai t + ε(x), where t = 1/[1 + a0|x|]. i=1

(The complement is taken if x is positive.) The error of approximation is less than ±7.5 × 10−8 for all x. (Note that the error exceeds the function value at |x| > 5.7, so this is the operational limit of this approximation.)

E.2.3 THE GAMMA AND RELATED FUNCTIONS

The standard normal cdf is probably the most common application of numerical integration of a function in econometrics. Another very common application is the class of gamma functions. For Greene-50558 book June 25, 2007 12:52

1064 PART VII ✦ Appendices

positive constant P, the gamma function is

 ∞ (P) = t P−1e−t dt. 0 The gamma function obeys the recursion (P) = (P − 1) (P − 1), so for integer values of , ( ) = ( − ) P P P 1 !. This result suggests that the gamma function can be viewed as a generalization√ ( 1 ) = π of the factorial function for noninteger values. Another convenient value is 2 .By making a change of variable, it can be shown that for positive constants a, c, and P,

 ∞  ∞     c c 1 P t P−1e−at dt = t −(P+1)e−a/t dt = a−P/c . (E-1) 0 0 c c As a generalization of the factorial function, the gamma function will usually overflow for the sorts of values of P that normally appear in applications. The log of the function should normally be used instead. The function ln (P) can be approximated remarkably accurately with only a handful of terms and is very easy to program. A number of approximations appear in the literature; they are generally modifications of Sterling’s approximation to the factorial function P! ≈ (2π P)1/2 PPe−P,so

ln (P) ≈ (P − 0.5)ln P − P + 0.5ln(2π)+ C + ε(P),

where C is the correction term [see, e.g., Abramovitz and Stegun (1971, p. 257), Press et al. (1986, p. 157), or Rao (1973, p. 59)] and ε(P) is the approximation error.5 The derivatives of the gamma function are

 ∞ dr (P) = (ln P)r t P−1e−t dt. r dP 0 The first two derivatives of ln (P) are denoted (P) = / and  (P) = ( − 2)/ 2 and are known as the digamma and trigamma functions.6 The beta function, denoted β(a, b),  1 (a) (b) β( , ) = a−1( − )b−1 = , a b t 1 t dt ( + ) 0 a b is related.

E.2.4 APPROXIMATING INTEGRALS BY QUADRATURE

The digamma and trigamma functions, and the gamma function for noninteger values of P and 1 values that are not integers plus 2 , do not exist in closed form and must be approximated. Most other applications will also involve integrals for which no simple computing function exists. The simplest approach to approximating  U(x) F(x) = f (t) dt L(x)

5For example, one widely used formula is C = z−1/12 − z−3/360 − z−5/1260 + z−7/1680 − q, where z = P and q = 0ifP > 18, or z = P + J and q = ln[P(P + 1)(P + 2) ···(P + J − 1)], where J = 18 − INT(P),if not. Note, in the approximation, we write (P) = (P!)/P + a correction. 6Tables of specific values for the gamma, digamma, and trigamma functions appear in Abramovitz and Stegun (1971). Most contemporary econometric programs have built-in functions for these common integrals, so the tables are not generally needed. Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1065

is likely to be a variant of Simpson’s rule, or the trapezoid rule. For example, one approximation [see Press et al. (1986, p. 108)] is 1 4 2 4 2 4 1 ( ) ≈  + + + +···+ − + − + , F x 3 f1 3 f2 3 f3 3 f4 3 fN 2 3 fN 1 3 fN

where f j is the function evaluated at N equally spaced points in [L, U] including the endpoints and  = (L− U)/(N − 1). There are a number of problems with this method, most notably that it is difficult to obtain satisfactory accuracy with a moderate number of points. Gaussian quadrature is a popular method of computing integrals. The general approach is to use an approximation of the form  U M

W(x) f (x) dx ≈ w j f (a j ), L j=1

where W(x) is viewed as a “weighting” function for integrating f (x), w j is the quadrature weight, and a j is the quadrature abscissa. Different weights and abscissas have been derived for several weighting functions. Two weighting functions common in econometrics are

W(x) = xce−x, x ∈ [0, ∞),

for which the computation is called Gauss–Laguerre quadrature, and

2 W(x) = e−x , x ∈ (−∞, ∞),

for which the computation is called Gauss–Hermite quadrature. The theory for deriving weights and abscissas is given in Press et al. (1986, pp. 121–125). Tables of weights and abscissas for many values of M are given by Abramovitz and Stegun (1971). Applications of the technique appear in Section 16.9.6.b and Chapter 23.

E.3 OPTIMIZATION

Nonlinear optimization (e.g., maximizing log-likelihood functions) is an intriguing practical prob- lem. Theory provides few hard and fast rules, and there are relatively few cases in which it is obvious how to proceed. This section introduces some of the terminology and underlying theory of nonlinear optimization.7 We begin with a general discussion on how to search for a solution to a nonlinear optimization problem and describe some specific commonly used methods. We then consider some practical problems that arise in optimization. An example is given in the final section. Consider maximizing the quadratic function

(θ) = + θ − 1 θ θ, F a b 2 C where C is a positive definite matrix. The first-order condition for a maximum is ∂ (θ) F = − θ = . ∂θ b C 0 (E-2) This set of linear equations has the unique solution

θ = C−1b. (E-3)

7There are numerous excellent references that offer a more complete exposition. Among these are Quandt (1983), Bazaraa and Shetty (1979), Fletcher (1980), and Judd (1998). Greene-50558 book June 25, 2007 12:52

1066 PART VII ✦ Appendices

This is a linear optimization problem. Note that it has a closed-form solution; for any a, b, and C, the solution can be computed directly.8 In the more typical situation, ∂ (θ) F = ∂θ 0 (E-4) is a set of nonlinear equations that cannot be solved explicitly for θ.9 The techniques considered in this section provide systematic means of searching for a solution. We now consider the general problem of maximizing a function of several variables:

maximizeθ F(θ), (E-5) where F(θ) may be a log-likelihood or some other function. Minimization of F(θ) is handled by maximizing −F(θ). Two special cases are n

F(θ) = fi (θ), (E-6) i=1 which is typical for maximum likelihood problems, and the least squares problem,10 2 fi (θ) =−(yi − f (xi , θ)) . (E-7) We treated the nonlinear least squares problem in detail in Chapter 11. An obvious way to search for the θ that maximizes F(θ) is by trial and error. If θ has only a single element and it is known approximately where the optimum will be found, then a grid search will be a feasible strategy. An example is a common time-series problem in which a one-dimensional search for a correlation coefficient is made in the interval (−1, 1). The grid search can proceed in the obvious fashion— that is, ...,−0.1, 0, 0.1, 0.2,...,then θˆmax −0.1toθˆmax +0.1 in increments of 0.01, and so on—until the desired precision is achieved.11 If θ contains more than one parameter, then a grid search is likely to be extremely costly, particularly if little is known about the parameter vector at the outset. Nonetheless, relatively efficient methods have been devised. Quandt (1983) and Fletcher (1980) contain further details. There are also systematic, derivative-free methods of searching for a function optimum that resemble in some respects the algorithms that we will examine in the next section. The downhill simplex (and other simplex) methods12 have been found to be very fast and effective for some problems. A recent entry in the econometrics literature is the method of simulated annealing.13 These derivative-free methods, particularly the latter, are often very effective in problems with many variables in the objective function, but they usually require far more function evaluations than the methods based on derivatives that are considered below. Because the problems typically analyzed in econometrics involve relatively few parameters but often quite complex functions involving large numbers of terms in a summation, on balance, the gradient methods are usually going to be preferable.14

8Notice that the constant a is irrelevant to the solution. Many maximum likelihood problems are presented with the preface “neglecting an irrelevant constant.” For example, the log-likelihood for the normal linear regression model contains a term—(n/2) ln(2π)—that can be discarded. 9See, for example, the normal equations for the nonlinear least squares estimators of Chapter 11. 10Least squares is, of course, a minimization problem. The negative of the criterion is used to maintain consistency with the general formulation. 11There are more efficient methods of carrying out a one-dimensional search, for example, the golden section method. See Press et al. (1986, Chap. 10). 12See Nelder and Mead (1965) and Press et al. (1986). 13See Goffe, Ferrier, and Rodgers (1994) and Press et al. (1986, pp. 326–334). 14Goffe, Ferrier, and Rodgers (1994) did find that the method of simulated annealing was quite adept at finding the best among multiple solutions. This problem is common for derivative-based methods, because they usually have no method of distinguishing between a local optimum and a global one. Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1067

E.3.1 ALGORITHMS

A more effective means of solving most nonlinear maximization problems is by an iterative algorithm:

Beginning from initial value θ 0, at entry to iteration t,ifθ t is not the optimal value for θ, compute direction vector t , step size λt , then

θ t+1 = θ t + λt t . (E-8)

Figure E.2 illustrates the structure of an iteration for a hypothetical function of two variables. The direction vector t is shown in the figure with θ t . The dashed line is the set of points θ t + λt t . Different values of λt lead to different contours; for this θ t and t , the best value of λt is about 0.5. Notice in Figure E.2 that for a given direction vector t and current parameter vector θ t , a secondary optimization is required to find the best λt . Translating from Figure E.2, we obtain the form of this problem as shown in Figure E.3. This subsidiary search is called a line search, as we search along the line θ t + λt t for the optimal value of F(.). The formal solution to the line search problem would be the λt that satisfies

∂ F(θ t + λt t ) = g(θ t + λt t ) t = 0, (E-9) ∂λt

FIGURE E.2 Iteration. ␪ 2

⌬ t

␪ t 2.3

2.2 2.1

1.8 1.9 2.0 ␪ 1 Greene-50558 book June 25, 2007 12:52

1068 PART VII ✦ Appendices

2.0

1.95 ) t ⌬ t ␭

ϩ

t ␪ F( 1.9

0 0.5 1 1.5 2 ␭ t FIGURE E.3 Line Search.

where g is the vector of partial derivatives of F(.) evaluated at θ t + λt t . In general, this problem will also be a nonlinear one. In most cases, adding a formal search for λt will be too expensive, as well as unnecessary. Some approximate or ad hoc method will usually be chosen. It is worth emphasizing that finding the λt that maximizes F(θ t +λt t ) at a given iteration does not generally lead to the overall solution in that iteration. This situation is clear in Figure E.3, where the optimal value of λt leads to F(.) = 2.0, at which point we reenter the iteration.

E.3.2 COMPUTING DERIVATIVES

For certain functions, the programming of derivatives may be quite difficult. Numeric approx- imations can be used, although it should be borne in mind that analytic derivatives obtained by formally differentiating the functions involved are to be preferred. First derivatives can be approximated by using ∂ (θ) (···θ + ε ···) − (···θ − ε ···) F ≈ F i F i . ∂θi 2ε

The choice of ε is a remaining problem. Extensive discussion may be found in Quandt (1983). There are three drawbacks to this means of computing derivatives compared with using the analytic derivatives. A possible major consideration is that it may substantially increase the amount of computation needed to obtain a function and its gradient. In particular, K +1 function evaluations (the criterion and K derivatives) are replaced with 2K + 1 functions. The latter may be more burdensome than the former, depending on the complexity of the partial derivatives compared with the function itself. The comparison will depend on the application. But in most settings, careful programming that avoids superfluous or redundant calculation can make the advantage of the analytic derivatives substantial. Second, the choice of ε can be problematic. If it is chosen too large, then the approximation will be inaccurate. If it is chosen too small, then there may be insufficient variation in the function to produce a good estimate of the derivative. Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1069

A compromise that is likely to be effective is to compute εi separately for each parameter, as in

εi = Max[α|θi |,γ] [see Goldfeld and Quandt (1971)]. The values α and γ should be relatively small, such as 10−5. Third, although numeric derivatives computed in this fashion are likely to be reasonably accurate, in a sum of a large number of terms, say, several thousand, enough approximation error can accu- mulate to cause the numerical derivatives to differ significantly from their analytic counterparts. Second derivatives can also be computed numerically. In addition to the preceding problems, however, it is generally not possible to ensure negative definiteness of a Hessian computed in this manner. Unless the choice of ε is made extremely carefully, an indefinite matrix is a possi- bility. In general, the use of numeric derivatives should be avoided if the analytic derivatives are available.

E.3.3 GRADIENT METHODS

The most commonly used algorithms are gradient methods, in which

t = Wt gt , (E-10)

where Wt is a positive definite matrix and gt is the gradient of F(θ t ):

∂ F(θ t ) gt = g(θ t ) = . (E-11) ∂θ t These methods are motivated partly by the following. Consider a linear Taylor series approxima- tion to F(θ t + λt t ) around λt = 0:

F(θ t + λt t )  F(θ t ) + λt g(θ t ) t . (E-12)

Let F(θ t + λt t ) equal Ft+1. Then, −  λ . Ft+1 Ft t gt t

If t = Wt gt , then −  λ . Ft+1 Ft t gt Wt gt

If gt is not 0 and λt is small enough, then Ft+1 − Ft must be positive. Thus, if F(θ) is not already at its maximum, then we can always find a step size such that a gradient-type iteration will lead to an increase in the function. (Recall that Wt is assumed to be positive definite.) In the following, we will omit the iteration index t, except where it is necessary to distinguish one vector from another. The following are some commonly used algorithms.15 Steepest Ascent The simplest algorithm to employ is the steepest ascent method, which uses

W = I so that = g. (E-13)

As its name implies, the direction is the one of greatest increase of F(.). Another virtue is that the line search has a straightforward solution; at least near the maximum, the optimal λ is −gg λ = , (E-14) gHg

15A more extensive catalog may be found in Judge et al. (1985, Appendix B). Those mentioned here are some of the more commonly used ones and are chosen primarily because they illustrate many of the important aspects of nonlinear optimization. Greene-50558 book June 25, 2007 12:52

1070 PART VII ✦ Appendices

where ∂2 (θ) = F . H ∂θ ∂θ Therefore, the steepest ascent iteration is θ = θ − gt gt . t+1 t gt (E-15) gt Ht gt

Computation of the second derivatives matrix may be extremely burdensome. Also, if Ht is not negative definite, which is likely if θ t is far from the maximum, the iteration may diverge. A systematic line search can bypass this problem. This algorithm usually converges very slowly, however, so other techniques are usually used.

Newton’s Method The template for most gradient methods in common use is Newton’s method. The basis for Newton’s method is a linear Taylor series approximation. Expanding the first-order conditions, ∂ (θ) F = , ∂θ 0

equation by equation, in a linear Taylor series around an arbitrary θ 0 yields

∂ F(θ)  0 + 0(θ − θ 0) = , ∂θ g H 0 (E-16)

where the superscript indicates that the term is evaluated at θ 0. Solving for θ and then equating 0 θ to θ t+1 and θ to θ t , we obtain the iteration

θ = θ − −1 . t+1 t Ht gt (E-17)

Thus, for Newton’s method,

W =−H−1, =−H−1g,λ= 1. (E-18)

Newton’s method will converge very rapidly in many problems. If the function is quadratic, then this method will reach the optimum in one iteration from any starting point. If the criterion function is globally concave, as it is in a number of problems that we shall examine in this text, then it is probably the best algorithm available. This method is very well suited to maximum likelihood estimation.

Alternatives to Newton’s Method Newton’s method is very effective in some settings, but it can perform very poorly in others. If the function is not approximately quadratic or if the current estimate is very far from the maximum, then it can cause wide swings in the estimates and even fail to converge at all. A number of algorithms have been devised to improve upon Newton’s method. An obvious one is to include a line search at each iteration rather than use λ = 1. Two problems remain, however. At points distant from the optimum, the second derivatives matrix may not be negative definite, and, in any event, the computational burden of computing H may be excessive. The quadratic hill-climbing method proposed by Goldfeld, Quandt, and Trotter (1966) deals directly with the first of these problems. In any iteration, if H is not negative definite, then it is replaced with

Hα = H − αI, (E-19) Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1071

where α is a positive number chosen large enough to ensure the negative definiteness of Hα. Another suggestion is that of Greenstadt (1967), which uses, at every iteration, n =− |π | , Hπ i ci ci (E-20) i=1

where πi is the ith characteristic root of H and ci is its associated characteristic vector. Other proposals have been made to ensure the negative definiteness of the required matrix at each iteration.16 Quasi-Newton Methods: Davidon–Fletcher–Powell A very effective class of algorithms has been developed that eliminates second derivatives altogether and has excellent convergence properties, even for ill-behaved problems. These are the quasi-Newton methods, which form

Wt+1 = Wt + Et ,

17 where Et is a positive definite matrix. As long as W0 is positive definite—I is commonly used— Wt will be positive definite at every iteration. In the Davidon–Fletcher–Powell (DFP) method, −1 after a sufficient number of iterations, Wt+1 will be an approximation to −H . Let

δt = λt t , and γt = g(θ t+1) − g(θ t ). (E-21)

The DFP variable metric algorithm uses δ δ γ γ = + t t + Wt t t Wt . Wt+1 Wt δ γ γ γ (E-22) t t t Wt t Notice that in the DFP algorithm, the change in the first derivative vector is used in W; an estimate of the inverse of the second derivatives matrix is being accumulated. The variable metric algorithms are those that update W at each iteration while preserving its definiteness. For the DFP method, the accumulation of Wt+1 is of the form

Wt+1 = Wt + aa + bb = Wt + [ab][ab] .

The two-column matrix [ab] will have rank two; hence, DFP is called a rank two update or rank two correction. The Broyden–Fletcher–Goldfarb–Shanno (BFGS) method is a rank three v v = (γ γ ) correction that subtracts dd from the DFP update, where t Wt t and     = 1 δ − 1 γ . dt δ γ t γ γ Wt t t t t Wt t There is some evidence that this method is more efficient than DFP. Other methods, such as Broyden’s method, involve a rank one correction instead. Any method that is of the form

Wt+1 = Wt + QQ

will preserve the definiteness of W, regardless of the number of columns in Q. The DFP and BFGS algorithms are extremely effective and are among the most widely used of the gradient methods. An important practical consideration to keep in mind is that although Wt accumulates an estimate of the negative inverse of the second derivatives matrix for both algorithms, in maximum likelihood problems it rarely converges to a very good estimate of the covariance matrix of the estimator and should generally not be used as one.

16See, for example, Goldfeld and Quandt (1971). 17See Fletcher (1980). Greene-50558 book June 25, 2007 12:52

1072 PART VII ✦ Appendices

E.3.4 ASPECTS OF MAXIMUM LIKELIHOOD ESTIMATION

Newton’s method is often used for maximum likelihood problems. For solving a maximum like- lihood problem, the method of scoring replaces H with

H = E[H(θ)], (E-23)

which will be recognized as the asymptotic covariance of the maximum likelihood estimator. There is some evidence that where it can be used, this method performs better than Newton’s method. The exact form of the expectation of the Hessian of the log likelihood is rarely known, however.18 Newton’s method, which uses actual instead of expected second derivatives, is generally used instead. One-Step Estimation A convenient variant of Newton’s method is the one-step maximum likelihood estimator. It has been shown that if θ 0 is any consistent initial estimator of θ and H∗ is H, H, or any other asymptotically equivalent estimator of Var[g(θˆMLE)], then θ 1 = θ 0 − (H∗)−1g0 (E-24) is an estimator of θ that has the same asymptotic properties as the maximum likelihood estima- tor.19 (Note that it is not the maximum likelihood estimator. As such, for example, it should not be used as the basis for likelihood ratio tests.) Covariance Matrix Estimation In computing maximum likelihood estimators, a commonly used method of estimating H simultaneously simplifies the calculation of W and solves the occasional problem of indefiniteness of the Hessian. The method of Berndt et al. (1974) replaces W with − n 1 ˆ = = ( )−1, W gi gi G G (E-25) i=1 where ∂ ( | , θ) = ln f yi xi . gi ∂θ (E-26) × ˆ Then, G is the n K matrix with ith row equal to gi . Although W and other suggested estimators of (−H)−1 are asymptotically equivalent, Wˆ has the additional virtues that it is always nonnegative definite, and it is only necessary to differentiate the log-likelihood once to compute it. The Lagrange Multiplier Statistic The use of Wˆ as an estimator of (−H)−1 brings another intriguing convenience in maximum likelihood estimation. When testing restrictions on parame- ters estimated by maximum likelihood, one approach is to use the Lagrange multiplier statistic. We will examine this test at length at various points in this book, so we need only sketch it briefly here. The logic of the LM test is as follows. The gradient g(θ) of the log-likelihood function equals 0 at the unrestricted maximum likelihood estimators (that is, at least to within the precision of the computer program in use). If θˆr is an MLE that is computed subject to some restrictions on θ, then we know that g(θˆr ) = 0. The LM test is used to test whether, at θˆr , gr is significantly different from 0 or whether the deviation of gr from 0 can be viewed as sampling variation. The covariance matrix of the gradient of the log-likelihood is −H, so the Wald statistic for testing this hypothesis is W = g(−H)−1g. Now, suppose that we use Wˆ to estimate −H−1. Let G be the n × K matrix × with ith row equal to gi , and let i denote an n 1 column of ones. Then the LM statistic can be

18Amemiya (1981) provides a number of examples. 19See, for example, Rao (1973). Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1073

computed as LM = iG(GG)−1Gi.

Because ii = n, = ( )−1 / = 2, LM n[i G G G G i n] nRi 2 2 where Ri is the uncentered R in a regression of a column of ones on the derivatives of the log-likelihood function. The Concentrated Log-Likelihood Many problems in maximum likelihood estimation can be formulated in terms of a partitioning of the parameter vector θ = [θ 1, θ 2] such that at the solution to the optimization problem, θ 2,ML, can be written as an explicit function of θ 1,ML. When the solution to the likelihood equation for θ 2 produces

θ 2,ML = t(θ 1,ML),

then, if it is convenient, we may “concentrate” the log-likelihood function by writing

∗ F (θ 1, θ 2) = F [θ 1, t(θ 1)] = Fc(θ 1).

θ (θ ) The unrestricted solution to the problem Max 1 Fc 1 provides the full solution to the optimiza- tion problem. Once the optimizing value of θ 1 is obtained, the optimizing value of θ 2 is simply ∗ t(θˆ1,ML). Note that F (θ 1, θ 2) is a subset of the set of values of the log-likelihood function, namely those values at which the second parameter vector satisfies the first-order conditions.20

E.3.5 OPTIMIZATION WITH CONSTRAINTS

Occasionally, some of or all the parameters of a model are constrained, for example, to be positive in the case of a variance or to be in a certain range, such as a correlation coefficient. Optimization subject to constraints is often yet another art form. The elaborate literature on the general problem provides some guidance—see, for example, Appendix B in Judge et al. (1985)—but applications still, as often as not, require some creativity on the part of the analyst. In this section, we will examine a few of the most common forms of constrained optimization as they arise in econometrics. Parametric constraints typically come in two forms, which may occur simultaneously in a problem. Equality constraints can be written c(θ) = 0, where c j (θ) is a continuous and dif- ferentiable function. Typical applications include linear constraints on slope vectors, such as a requirement that a set of elasticities in a log-linear model add to one; exclusion restrictions, which are often cast in the form of interesting hypotheses about whether or not a variable should appear in a model (i.e., whether a coefficient is zero or not); and equality restrictions, such as the sym- metry restrictions in a translog model, which require that parameters in two different equations be equal to each other. Inequality constraints, in general, will be of the form a j ≤ c j (θ) ≤ bj , where a j and bj are known constants (either of which may be infinite). Once again, the typical application in econometrics involves a restriction on a single parameter, such as σ>0 for a variance parameter, −1 ≤ ρ ≤ 1 for a correlation coefficient, or β j ≥ 0 for a particular slope coefficient in a model. We will consider the two cases separately. In the case of equality constraints, for practical purposes of optimization, there are usually two strategies available. One can use a Lagrangean multiplier approach. The new optimization problem is

Maxθ,λ L(θ, λ) = F(θ) + λ c(θ).

20A formal proof that this is a valid way to proceed is given by Amemiya (1985, pp. 125–127). Greene-50558 book June 25, 2007 12:52

1074 PART VII ✦ Appendices

The necessary conditions for an optimum are ∂ (θ, λ) L = (θ) + (θ)λ = , ∂θ g C 0 ∂ (θ, λ) L = (θ) = , ∂λ c 0

where g(θ) is the familiar gradient of F(θ) and C(θ) is a J × K matrix of derivatives with jth row equal to ∂c j /∂θ . The joint solution will provide the constrained optimizer, as well as the Lagrange multipliers, which are often interesting in their own right. The disadvantage of this approach is that it increases the dimensionality of the optimization problem. An alternative strategy is to eliminate some of the parameters by either imposing the constraints directly on the function or by solving out the constraints. For exclusion restrictions, which are usually of the form θ j = 0, this step usually means dropping a variable from a model. Other restrictions can often be imposed just by building them into the model. For example, in a function of θ1,θ2, and θ3, if the restriction is of the form θ3 = θ1θ2, then θ3 can be eliminated from the model by a direct substitution. Inequality constraints are more difficult. For the general case, one suggestion is to transform the constrained problem into an unconstrained one by imposing some sort of penalty function into the optimization criterion that will cause a parameter vector that violates the constraints, or nearly does so, to be an unattractive choice. For example, to force a parameter θ j to be nonzero, one might maximize the augmented function F(θ)−|1/θ j |. This approach is feasible, but it has the disadvantage that because the penalty is a function of the parameters, different penalty functions will lead to different solutions of the optimization problem. For the most common problems in econometrics, a simpler approach will usually suffice. One can often reparameterize a function so that the new parameter is unconstrained. For example, the “method of squaring” is sometimes 2 used to force a parameter to be positive. If we require θ j to be positive, then we can define θ j = α 2 and substitute α for θ j wherever it appears in the model. Then an unconstrained solution for α is obtained. An alternative reparameterization for a parameter that must be positive that is often used is θ j = exp(α). To force a parameter to be between zero and one, we can use the function θ j = 1/[1 + exp(α)]. The range of α is now unrestricted. Experience suggests that a third, less orthodox approach works very well for many problems. When the constrained optimization is begun, there is a starting value θ 0 that begins the iterations. Presumably, θ 0 obeys the restrictions. (If not, and none can be found, then the optimization process must be terminated immediately.) 1 0 1 0 0 1 The next iterate, θ , is a step away from θ ,byθ = θ + λ0δ . Suppose that θ violates the 1 0 1 constraints. By construction, we know that there is some value θ ∗ between θ and θ that does not violate the constraint, where “between” means only that a shorter step is taken. Therefore, the 1 next value for the iteration can be θ ∗. The logic is true at every iteration, so a way to proceed is to alter the iteration so that the step length is shortened when necessary when a parameter violates the constraints.

E.3.6 SOME PRACTICAL CONSIDERATIONS

The reasons for the good performance of many algorithms, including DFP, are unknown. More- over, different algorithms may perform differently in given settings. Indeed, for some problems, one algorithm may fail to converge whereas another will succeed in finding a solution without great difficulty. In view of this, computer programs such as GQOPT,21 Gauss, and MatLab that offer a menu of different preprogrammed algorithms can be particularly useful. It is sometimes worth the effort to try more than one algorithm on a given problem.

21Goldfeld and Quandt (1972). Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1075

Step Sizes Except for the steepest ascent case, an optimal line search is likely to be infeasible or to require more effort than it is worth in view of the potentially large number of function evaluations required. In most cases, the choice of a step size is likely to be rather ad hoc. But within limits, the most widely used algorithms appear to be robust to inaccurate line searches. For example, one method employed by the widely used TSP computer program22 is the method λ = , 1 , 1 of squeezing, which tries 1 2 4 , and so on until an improvement in the function results. Although this approach is obviously a bit unorthodox, it appears to be quite effective when used with the Gauss–Newton method for nonlinear least squares problems. (See Chapter 11.) A somewhat more elaborate rule is suggested by Berndt et al. (1974). Choose an ε between 0 and 1 λ 2 , and then find a such that F(θ + λ ) − F(θ) ε< < 1 − ε. (E-27) λg Of course, which value of ε to choose is still open, so the choice of λ remains ad hoc. Moreover, in neither of these cases is there any optimality to the choice; we merely find a λ that leads to a function improvement. Other authors have devised relatively efficient means of searching for a step size without doing the full optimization at each iteration.23 Assessing Convergence Ideally, the iterative procedure should terminate when the gradi- ent is zero. In practice, this step will not be possible, primarily because of accumulated rounding error in the computation of the function and its derivatives. Therefore, a number of alternative convergence criteria are used. Most of them are based on the relative changes in the function or the parameters. There is considerable variation in those used in different computer programs, and there are some pitfalls that should be avoided. A critical absolute value for the elements of the gradient or its norm will be affected by any scaling of the function, such as normalizing it by the sample size. Similarly, stopping on the basis of small absolute changes in the parameters can lead to premature convergence when the parameter vector approaches the maximizer. It is probably best to use several criteria simultaneously, such as the proportional change in both the function and the parameters. Belsley (1980) discusses a number of possible stopping rules. One that has proved useful and is immune to the scaling problem is to base convergence on gH−1g. Multiple Solutions It is possible for a function to have several local extrema. It is difficult to know a priori whether this is true of the one at hand. But if the function is not globally concave, then it may be a good idea to attempt to maximize it from several starting points to ensure that the maximum obtained is the global one. Ideally, a starting value near the optimum can facilitate matters; in some settings, this can be obtained by using a consistent estimate of the parameter for the starting point. The method of moments, if available, is sometimes a convenient device for doing so. No Solution Finally, it should be noted that in a nonlinear setting the iterative algorithm can break down, even in the absence of constraints, for at least two reasons. The first possibility is that the problem being solved may be so numerically complex as to defy solution. The second possibility, which is often neglected, is that the proposed model may simply be inappropriate for the data. In a linear setting, a low R2 or some other diagnostic test may suggest that the model and data are mismatched, but as long as the full rank condition is met by the regressor matrix, a linear regression can always be computed. Nonlinear models are not so forgiving. The failure of an iterative algorithm to find a maximum of the criterion function may be a warning that the model is not appropriate for this body of data.

22Hall (1982, p. 147). 23See, for example, Joreskog and Gruvaeus (1970), Powell (1964), Quandt (1983), and Hall (1982). Greene-50558 book June 25, 2007 12:52

1076 PART VII ✦ Appendices

E.3.7 THE EM ALGORITHM

The latent class model can be characterized as a missing data model. Consider the mixture model we used for DocVis in Example 16.21, which we will now generalize to allow more than two classes:

( | , = ) = θ ( − θ )yit ,θ = /( + λ ), λ = ( β ), = , ,.... f yit xit classi j it, j 1 it, j it, j 1 1 it, j it, j exp xit j yit 0 1 exp(z α ) Prob(class = j | z ) =  i j , j = 1, 2,...,J. i i j ( α ) j=1 exp zi j With all parts incorporated, the log-likelihood for this latent class model is

n

ln LM = ln Li,M i=1    n J Ti yit exp(z α ) 1 exp(x β ) = ln  i j it j . J + ( β ) + ( β ) ( α ) 1 exp xit j 1 exp xit j i=1 j=1 m=1 exp zi m t=1 (E-28)

Suppose the actual class memberships were known (i.e., observed). Then, the class probabili- ties in ln LM would be unnecessary. The appropriate complete data log-likelihood for this case would be n

ln LC = ln Li,C = i 1 T    n J i exp(x β ) yit = 1 it j , ln Dij + ( β ) + ( β ) (E-29) 1 exp xit j 1 exp xit j i=1 j=1 t=1

where Dij is an observed dummy variable that equals one if individual i is from class j, and zero otherwise. With this specification, the log-likelihood breaks into J separate log-likelihoods, one β ,...,β for each (now known) class. The maximum likelihood estimates of 1 J would be obtained simply by separating the sample into the respective subgroups and estimating the appropriate model for each group using maximum likelihood. The method we have used to estimate the parameters of the full model is to replace the Dij variables with their unconditional espectations, Prob(classi = j|zi ), then maximize the resulting log-likelihood function. This is the essential logic of the EM (expectation–maximization) algorithm [Dempster et al. (1977)], however, the method uses the conditional (posterior) class probabilities instead of the unconditional probabilities. The iterative steps of the EM algorithm are (E step) Form the expectation of the missing data log-likelihood, conditional on the pre- vious parameter estimates and the data in the sample; (M step) Maximize the expected log-likelihood function. Then either return to the E step or exit if the estimates have converged. The EM algorithm can be used in a variety of settings. [See McLachlan and Krishnan (1997).] It has a particularly appealing form for estimating latent class models. The iterative steps for the latent class model are as follows:

(E step) Form the conditional (posterior) class probabilities, πij|zi , based on the current estimates. These are based on the likelihood function. Greene-50558 book June 25, 2007 22:50

APPENDIX E ✦ Computation and Optimization 1077

(M step) For each class, estimate the class-specific parameters by maximizing a weighted log-likelihood,

nc

ln LMstep, j = πij ln Li | class = j. i=1 The parameters of the class probability model are also reestimated, as shown later, when there are variables in zi other than a constant term. This amounts to a simple weighted estimation. For example, in the latent class linear regression model, the M step would amount to nothing more than weighted least squares. For nonlinear models such as the geometric model above, the M step involves maximizing a weighted log- likelihood function. For the preceding geometric model, the precise steps are as follows: First, obtain starting β ,...,β , α ,...,α α = values for 1 J 1 J . Recall, J 0. Then: 1. Form the contributions to the likelihood function using (E-28),

J Ti = π ( | , β , = ) Li ij f yit xit j classi j j=1 t=1 J

= Li | class = j. (E-30) j=1 L | class = j 2. Form the conditional probabilities, w = i . (E-31) ij J | = m=1 Li class m 3. For each j, now maximize the weighted log likelihood functions (one at a time),

T n i exp(x β ) yit (β ) = 1 it j ln Lj,M j wijln + (  β ) + (  β ) (E-32) 1 exp xit j 1 exp xit j i=1 t=1

4. To update the αj parameters, maximize the following log-likelihood function

n J  exp(z α ) ln L(α ,...,α ) = w ln i j , α = 0. (E-33) 1 J ij J (  α ) J i=1 j=1 j=1 exp zi j Step 4 defines a multinomial logit model (with “grouped”) data. If the class probability model does not contain any variables in zi , other than a constant, then the solutions to this optimization will be n w πˆ π = i=1 ij , α = j . ˆ j n J then ˆ j ln (E-34) πˆ J i=1 j=1 wij

(Note that this preserves the restrictionα ˆ J = 0.) With these in hand, we return to steps 1 and 2 to rebuild the weights, then perform steps 3 and 4. The process is iterated until the estimates of β ,...,β 1 J converge. Step 1 is constructed in a generic form. For a different model, it is necessary only to change the density that appears at the end of the expresssion in (E-32). For a cross section instead of a panel, the product term in step 1 becomes simply the log of the single term. The EM algorithm has an intuitive appeal in this (and other) settings. In practical terms, it is often found to be a very slow algorithm. It can take many iterations to converge. (The estimates in Example 16.16 were computed using a gradient method, not the EM algorithm.) In its favor, Greene-50558 book June 25, 2007 12:52

1078 PART VII ✦ Appendices

the EM method is very stable. It has been shown [Dempster, Laird, and Rubin (1977)] that the algorithm always climbs uphill. The log-likelihood improves with each iteration. Applications differ widely in the methods used to estimate latent class models. Adding to the variety are the very many Bayesian applications, none of which use either of the methods discussed here.

E.4 EXAMPLES

To illustrate the use of gradient methods, we consider some simple problems.

E.4.1 FUNCTION OF ONE PARAMETER

First, consider maximizing a function of a single variable, f (θ) = ln(θ) − 0.1θ 2. The function is shown in Figure E.4. The first and second derivatives are

(θ) = 1 − . θ, f θ 0 2 −1 f (θ) = − 0.2. θ 2 √ Equating f to zero yields the solution θ = 5 = 2.236. At the solution, f =−0.4, so this solution is indeed a maximum. To demonstrate the use of an iterative method, we solve this problem using Newton’s method. Observe, first, that the second derivative is always negative for any admissible (positive) θ.24 Therefore, it should not matter where we start the iterations; we shall eventually find the maximum. For a single parameter, Newton’s method is θ = θ − / . t+1 t [ ft ft ]

FIGURE E.4 Function of One Variable Parameter. 0.40

0.20

0.00

Ϫ0.20

Ϫ Function 0.40

Ϫ0.60

Ϫ0.80

Ϫ1.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00 ␪

24In this problem, an inequality restriction, θ>0, is required. As is common, however, for our first attempt we shall neglect the constraint. Greene-50558 book June 25, 2007 12:52

APPENDIX E ✦ Computation and Optimization 1079

TABLE E.1 Iterations for Newton’s Method Iteration θ ff f 0 5.00000 −0.890562 −0.800000 −0.240000 1 1.66667 0.233048 0.266667 −0.560000 2 2.14286 0.302956 0.030952 −0.417778 3 2.23404 0.304718 0.000811 −0.400363 4 2.23607 0.304719 0.0000004 −0.400000

The sequence of values that results when 5 is used as the starting value is given in Table E.1. The path of the iterations is also shown in the table.

E.4.2 FUNCTION OF TWO PARAMETERS: THE GAMMA DISTRIBUTION

For random sampling from the gamma distribution, ρ β −β ρ− ( ,β,ρ)= yi 1. f yi (ρ)e yi   (β, ρ) = ρ β − (ρ) − β n + (ρ − ) n The log-likelihood is ln L n ln n ln i=1 yi 1 i=1 ln yi . (See Section 16.6.4 and Examples 15.5 and 15.7.) It is often convenient to scale the log-likelihood by the sample size. Suppose, as well, that we have a sample with y = 3 and ln y = 1. Then the function to be maximized is F(β, ρ) = ρ ln β − ln (ρ) − 3β + ρ − 1. The derivatives are ∂ ρ ∂ F = − , F = β − + = β − (ρ) + , ∂β β 3 ∂ρ ln 1 ln 1 ∂2 −ρ ∂2 −( − 2) ∂2 F = , F = =− (ρ), F = 1 . ∂β2 β2 ∂ρ2 2 ∂β ∂ρ β

Finding a good set of starting values is often a difficult problem. Here we choose three starting points somewhat arbitrarily: (ρ0,β0) = (4, 1), (8, 3), and (2, 7). The solution to the problem is (5.233, 1.7438). We used Newton’s method and DFP with a line search to maximize this function.25 For Newton’s method, λ = 1. The results are shown in TableE.2. The two methods were essentially the same when starting from a good starting point (trial 1), but they differed substantially when starting from a poorer one (trial 2). Note that DFP and Newton approached the solution from different directions in trial 2. The third starting point shows the value of a line search. At this

TABLE E.2 Iterative Solutions to Max(ρ, β)ρ ln β − ln (ρ) − 3β + ρ − 1 Trial 1 Trial 2 Trial 3 DFP Newton DFP Newton DFP Newton Iter. ρβ ρβ ρβ ρβ ρβ ρ β 0 4.000 1.000 4.000 1.000 8.000 3.000 8.000 3.000 2.000 7.000 2.000 7.000 1 3.981 1.345 3.812 1.203 7.117 2.518 2.640 0.615 6.663 2.027 −47.7 −233. 2 4.005 1.324 4.795 1.577 7.144 2.372 3.203 0.931 6.195 2.075 — — 3 5.217 1.743 5.190 1.728 7.045 2.389 4.257 1.357 5.239 1.731 — — 4 5.233 1.744 5.231 1.744 5.114 1.710 5.011 1.656 5.251 1.754 — — 5 — — — — 5.239 1.747 5.219 1.740 5.233 1.744 — — 6 — — — — 5.233 1.744 5.233 1.744 — — — —

25The one used is described in Joreskog and Gruvaeus (1970). Greene-50558 book June 25, 2007 12:52

1080 PART VII ✦ Appendices

starting value, the Hessian is extremely large, and the second value for the parameter vec- tor with Newton’s method is (−47.671, −233.35), at which point F cannot be computed and this method must be abandoned. Beginning with H = I and using a line search, DFP reaches the point (6.63, 2.03) at the first iteration, after which convergence occurs routinely in three more iterations. At the solution, the Hessian is [(−1.72038, 0.191153) , (0.191153, −0.210579) ]. The diagonal elements of the Hessian are negative and its determinant is 0.32574, so it is negative definite. (The two characteristic roots are −1.7442 and −0.18675). Therefore, this result is indeed the maximum of the function.

E.4.3 A CONCENTRATED LOG-LIKELIHOOD FUNCTION

There is another way that the preceding problem might have been solved. The first of the necessary conditions implies that at the joint solution for (β, ρ), β will equal ρ/3. Suppose that we impose this requirement on the function we are maximizing. The concentrated (over β) log-likelihood function is then produced:

Fc(ρ) = ρ ln(ρ/3) − ln (ρ) − 3(ρ/3) + ρ − 1 = ρ ln(ρ/3) − ln (ρ) − 1.

This function could be maximized by an iterative search or by a simple one-dimensional grid search. Figure E.5 shows the behavior of the function. As expected, the maximum occurs at ρ = 5.233. The value of β is found as 5.23/3 = 1.743. The concentrated log-likelihood is a useful device in many problems. (See Section 16.9.6.c for an application.) Note the interpretation of the function plotted in Figure E.5. The original function of ρ and β is a surface in three dimensions. The curve in Figure E.5 is a projection of that function; it is a plot of the function values above the line β = ρ/3. By virtue of the first-order condition, we know that one of these points will be the maximizer of the function. Therefore, we may restrict our search for the overall maximum of F(β, ρ) to the points on this line.

FIGURE E.5 Concentrated Log-Likelihood.

Ϫ1.50

Ϫ1.75

Ϫ2.00

Ϫ2.25

Ϫ2.50

Ϫ2.75 Function

Ϫ3.00

Ϫ3.25

Ϫ3.50

Ϫ3.75 0 246810 P Greene-50558 book June 25, 2007 12:52

APPENDIX F ✦ Data Sets Used in Applications 1081 APPENDIXQ F DATA SETS USED IN APPLICATIONS

The following tables list the variables in the data sets used in the examples and applications in the text. With the exception of the Bertschek and Lechner file, the data sets themselves can be downloaded either from the Web site for this text or from the URLs to the publicly accessible archives indicated below as “Location.” The points in the text where the data are used for examples or suggested exercises are noted as “Uses.”

TABLE F1.1 Consumption and Income, 10 Yearly Observations, 1970–1979 C = Consumption, Y = Disposable income. Source: Economic Report of the President, 1987, Council of Economic Advisors Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Example 1.2.

TABLE F2.1 Consumption and Income, 11 Yearly Observations, 1940–1950 Year = Date, X = Disposable income, C = Consumption, W = War years dummy variable, 1 in 1942–1945, 0 other years. Source: Economic Report of the President, U.S. Government Printing Office, Washington, D.C., 1983. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 2.1, 3.2.

TABLE F2.2 The U.S. Gasoline Market, 36 Yearly Observations 1953–2004 Variables used in the examples are as follows: G = Per capita U.S. gasoline consumption = expenditure/(population times price index), Pg = Price index for gasoline, Y = Per capita disposable income, Pnc = Price index for new cars, Puc = Price index for used cars, Ppt = Price index for public transportation, Pd = Aggregate price index for consumer durables, Pn = Aggregate price index for consumer nondurables, Ps = Aggregate price index for consumer services, Pop = U.S. total population in millions. Source: The data were compiled by Professor Chris Bell, Department of Economics, University of North Carolina, Asheville. Sources: www.bea.gov and www.bls.gov. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 2.3, 4.4, 4.7, 4.8, 6.7, 7.1, 19.2, 20.3, Sections 6.4, 19.9.3, Application 4.1. Greene-50558 book June 25, 2007 12:52

1082 PART VII ✦ Appendices

TABLE F3.1 Investment, 15 Yearly Observations, 1968–1982 Year = Date, GNP = Nominal GNP, Invest = Nominal investment, CPI = Consumer price index, Interest = Interest rate. Note: CPI 1967 is 79.06. The interest rate is the average yearly discount rate at the New York Federal Reserve Bank Source: Economic Report of the President, U.S. Government Printing Office, Washington, D.C., 1983. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 3.1 and 3.3, Section 3.2.2, Exercise 3.12

TABLE F3.2 Koop and Tobias Labor Market Experience, 17,919 Observations The data file is in two parts. The first file contains the panel of 17,919 observations on 4 time-varying variables: Column 1 = Person id (ranging from 1–2, 178), Column 2 = Education, Column 3 = Log of hourly wage, Column 4 = Potential experience, Column 5 = Time trend. The second file contains time invariant variables for the 2,178 households. Column 1 = Ability, Column 2 = Mother’s education, Column 3 = Father’s education, Column 4 = Dummy variable for residence in a broken home, Column 5 = Number of siblings.

Source: Koop and Tobias (2004). Location: Journal of Applied Econometrics data archive. http://www.econ.queensu.ca/jae/2004-v19.7/koop-tobias/. Uses: Example 9.17, Applications 3.1, 5.1, 6.1, 6.2. TABLE F4.1 Labor Supply Data from Mroz (1987), 753 Observations LFP = 1 if woman worked in 1975, else 0, WHRS = Wife’s hours of work in 1975, KL6 = Number of children less than 6 years old in household, K618 = Number of children between ages 6 and 18 in household, WA = Wife’s age, WE = Wife’s educational attainment, in years, WW = Wife’s average hourly earnings, in 1975 dollars, RPWG = Wife’s wage reported at the time of the 1976 interview (not = 1975 estimated wage), HHRS = Husband’s hours worked in 1975, HA = Husband’s age, HE = Husband’s educational attainment, in years, HW = Husband’s wage, in 1975 dollars, FAMINC = Family income, in 1975 dollars, WMED = Wife’s mother’s educational attainment, in years, WFED = Wife’s father’s educational attainment, in years, UN = Unemployment rate in county of residence, in percentage points, CIT = Dummy variable = 1 if live in large city (SMSA), else 0, AX = Actual years of wife’s previous labor market experience.

Source: 1976 Panel Study of Income Dynamics, Mroz (1987) Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 4.3, 4.5, 6.1, 23.5, 23.13, 24.8. Greene-50558 book June 25, 2007 12:52

APPENDIX F ✦ Data Sets Used in Applications 1083

TABLE F4.2 The Longley Data, 15 Yearly Observations, 1947–1962 Employ = Employment (1000s), Price = GNP deflator, GNP = Nominal GNP (millions), Armed = Armed forces, Year = Date.

Source: Longley (1967). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Example 4.6.

TABLE F4.3 Cost Function, 158 1970 Cross-Section Firm Level Observations Id = Observation, Year = 1970 for all observations, Cost = Total cost, Q = Total output, Pl = Wage rate, Sl = Cost share for labor, Pk = Capital price index, Sk = Cost share for capital, Pf = Fuel price, Sf = Cost share for fuel. Note: The file contains 158 observations. Christensen and Greene used the first 123. The extras are the holding companies. Use only the first 123 observations to replicate Christensen and Greene. Source: Christensen and Greene (1976). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 14.5, 14.8, Applications 4.2, 16.2.

TABLE F5.1 Macroeconomics Data Set, Quarterly, 1950I to 2000IV Year = Date, Qtr = Quarter, Realgdp = Real GDP (billions), Realcons = Real consumption expenditures, Realinvs = Real investment by private sector, Realgovt = Real government expenditures, Realdpi = Real disposable personal income, CPI U = Consumer price index, M1 = Nominal money stock, Tbilrate = Quarterly average of month end 90-day T-bill rate, Unemp = Unemployment rate, Pop = Population, mil. interpolate of year-end figures using constant growth rate per quarter, Infl = Rate of inflation (first observation is missing), Realint = Ex post real interest rate = Tbilrate − Infl. (First observation missing). Source: Department of Commerce, BEA website and www.economagic.com Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 5.1, 5.3, 5.4, 7.2, 7.3, 7.4, 11.5, 11.6, 11.8, 12.4, 12.6, 16.7, 19.1, 19.3, 19.4, 20.2, 20.4, 20.5, 20.6, 22.2, 22.3, 22,4, 22.7, Applications 7.1, 13.1, 19.3, Section 20.6.8.e. Greene-50558 book June 25, 2007 12:52

1084 PART VII ✦ Appendices

TABLE F5.2 Production for SIC 33: Primary Metals, 27 Statewide Observations Obs = Observation number, Valueadd = Value added, Labor = Labor input, Capital = Capital stock. Note: Data are per establishment, labor is a measure of labor input, and capital is the gross value of plant and equipment. A scale factor used to normalize the capital figure in the original study has been omitted. Further details on construction of the data are given in Aigner et al. (1977). Source: Hildebrand and Liu (1957). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Example 5.2, Application 11.1.

TABLE F6.1 Costs for U.S. Airlines, 90 Total Observations on 6 Firms for 1970–1984 I = Airline, T = Year, Q = Output, in revenue passenger miles, index number, C = Total cost, in $1000, PF = Fuel price, LF = Load factor, the average capacity utilization of the fleet. Note: These data are a subset of a larger data set provided to the author by Professor Moshe Kim. Source: Christensen Associates of Madison, Wisconsin. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 6.2, 8.4, Applications 8.3, 9.2.

TABLE F6.2 World Health Organization Panel Data, 901 Total Observations Primary variables: COMP = Composite measure of health care attainment, DALE = Disability adjusted life expectancy (other measure), YEAR = 1993,...,1997, TIME = 1, 2, 3, 4, 5, T93, T94, T95, T96, T97 = year dummy variables, HEXP = Per capita health expenditure, HC3 = Educational attainment, SMALL = Indicator for states, provinces, etc. (These are additional observations.) SMALL > 0 implies internal political unit, = 0 implies country observation, COUNTRY = Number assigned to country, GROUPTI = Number of observations when SMALL = 0. Usually 5, some = 1, one country = 4, OECD = Dummy variable for OECD country (30 countries), GINI = Gini coefficient for income inequality, GEFF = World bank measure of government effectiveness,* VOICE = World bank measure of democratization of the political process,* TROPICS = Dummy variable for tropical location, POPDEN = Population density,* PUBTHE = Proportion of health expenditure paid by bublic authorities, GDPC = Normalized per capita GDP. Greene-50558 book June 25, 2007 12:52

APPENDIX F ✦ Data Sets Used in Applications 1085

TABLE F6.2 Continued Constructed variables: LCOMP = ln COMP, LDALE = ln DALE, LGDPC = ln GDPC, LGDPC2 = ln2 GDPC, LHC = ln HC3, LHC2 = ln2 HC3, LHEXP = ln HEXP, LHEXP2 = ln2 HEXP, LHEXPHC = ln HEXP × ln HC3.

Note: Variables marked * were updated with more recent sources in Greene (2004a). Missing values for some of the variables in this data set are filled by using fitted values from a linear regression. Sources: The World Health Organization [Evans et al. (2000) and www.who.int] Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 6.6, 9.3.

TABLE F6.3 Solow’s Technological Change Data, 41 Yearly Observations, 1909–1949 Year = Date, Q = Output, K = Capital/labor ratio, A = Index of technology.

Source: Solow (1957, p. 314). Several variables are omitted. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Application 6.3.

TABLE F8.1 Income and Expenditure Data. 100 Cross-Section Observations MDR = Number of derogatory reports, Acc = Credit card application accepted (1 = yes), Age = Age in years + 12ths of a year, Income = Income, divided by 10,000, Avgexp = Average monthly credit card expenditure, Ownrent = Individual owns (1) or rents (0) home, Selfempl = Self-employed (1 = yes, 0 = no).

Source: Greene (1992) Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 8.1, 8.2, 8.3.

TABLE F8.2 Baltagi and Griffin Gasoline Data, 18 OECD Countries, 19 Years COUNTRY = Name of country, YEAR = Year, 1960–1978, LGASPCAR = Log of consumption per car, LINCOMEP = Log of per capita income, LRPMG = Log of real price of gasoline, LCARPCAP = Log of per capita number of cars.

Source: See Baltagi and Griffin (1983) and Baltagi (2005). Location: Website for Baltagi (2005) http://www.wiley.com/legacy/wileychi/baltagi/supp/Gasoline.dat Uses: Example 8.5, Application 8.2. Greene-50558 book June 25, 2007 12:52

1086 PART VII ✦ Appendices

TABLE F9.1 Cornwell and Rupert, Labor Market Data, 595 Individuals, 7 years EXP = Work experience, WKS = Weeks worked, OCC = Occupation, 1 if blue collar, IND = 1 if manufacturing industry, SOUTH = 1 if resides in south, SMSA = 1 if resides in a city (SMSA), MS = 1 if married, FEM = 1 if female, UNION = 1 if wage set by union contract, ED = Years of education, BLK = 1 if individual is black, LWAGE = Log of wage.

Source: See Cornwell and Rupert (1988). Location: Website for Baltagi (2005) http://www.wiley.com/legacy/wileychi/baltagi/supp/WAGES.xls Location (ASCII form): http://pages.stern.nyu.edu/∼ wgreene/Text/econometricanalysis.htm Uses: Examples 9.1, 9.2, 9.4, 9.5, 9.6, 9.7, 9.8, 9.14, 12.3.

TABLE F9.2 Munnell Productivity Data, 48 Continental U.S. States, 17 years,1970–1986 STATE = State name, ST ABB = State abbreviation, YR = Year, 1970,...,1986, P CAP = Public capital, HWY = Highway capital, WATER = Water utility capital, UTIL = Utility capital, PC = Private capital, GSP = Gross state product, EMP = Employment, UNEMP = Unemployment rate.

Source: Munnell (1990), Baltagi (2005). Location: Website for Baltagi (2005) http://www.wiley.com/legacy/wileychi/baltagi/supp/PRODUC.prn Uses: Examples 9.9, 9.12, 9.13, 10.1, 10.2, 10.3, 10.4, 19.5.

TABLE F9.3 Grunfeld Investment Data, 100 Yearly Observations on 5 Firms for 1935–1954 I = Gross investment, from Moody’s Industrial Manual and annual reports of corporations, F = Value of the firm from Bank and Quotation Record and Moody’s Industrial Manual, C = Stock of plant and equipment, from Survey of Current Business.

Source: Grunfeld (1958), Boot and deWitt (1960). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Applications 9.1, 10.2. Greene-50558 book June 25, 2007 12:52

APPENDIX F ✦ Data Sets Used in Applications 1087

TABLE F10.1 Cost Function, 145 U.S. Electricity Producers, Nerlove’s 1955 Data Firm = Observation, Year = 1955 for all observations, Cost = Total cost, Output = Total output, Pl = Wage rate, Sl = Cost share for labor, Pk = Capital price index, Sk = Cost share for capital, Pf = Fuel price, Sf = Cost share for fuel. Note: The data file contains several extra observations that are aggregates of commonly owned firms. Use only the first 145 for analysis. Source: Nerlove (1960) and Christensen and Greene (1976). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Section 10.4.1.

TABLE F10.2 Manufacturing Costs, U.S. Economy, 25 Yearly Observations, 1947–1971 Year = Date, Cost = Cost index, K = Capital cost share, L = Labor cost share, E = Energy cost share, M = Materials cost share, Pk = Capital price, Pl = Labor price, Pe = Energy price, Pm = Materials price.

Source: Berndt and Wood (1975). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Example 10.8, Section 10.4.2.

TABLE F11.1 German Health Care Data, Unbalanced Panel, 7,293 Individuals, 27,326 Observations ID = Person - identification number, FEMALE = Female = 1; male = 0, YEAR = Calendar year of the observation, AGE = Age in years, HSAT = Health satisfaction, coded 0 (low)–10 (high), NEWHSAT = Health satisfaction, 0,...,10; see note below, HANDDUM = Handicapped = 1; otherwise = 0, HANDPER = Degree of handicap in percent (0–100), HHNINC = Household nominal monthly net income in German marks/10,000, HHKIDS = Children under age 16 in the household = 1; otherwise = 0, EDUC = Years of schooling, MARRIED = Married = 1; otherwise = 0, HAUPTS = Highest schooling degree is Hauptschul degree = 1; otherwise = 0, REALS = Highest schooling degree is Realschul degree = 1; otherwise = 0, Greene-50558 book June 25, 2007 12:52

1088 PART VII ✦ Appendices

TABLE F11.1 Continued FACHHS = Highest schooling degree is Polytechnical degree = 1; otherwise = 0, ABITUR = Highest schooling degree is Abitur = 1; otherwise = 0, UNIV = Highest schooling degree is university degree = 1; otherwise = 0, WORKING = Employed = 1; otherwise = 0, BLUEC = Blue-collar employee = 1; otherwise = 0, WHITEC = White-collar employee = 1; otherwise = 0, SELF = Self-employed = 1; otherwise = 0, BEAMT = Civil servant = 1; otherwise = 0, DOCVIS = Number of doctor visits in last three months, HOSPVIS = Number of hospital visits in last calendar year, PUBLIC = Insured in public health insurance = 1; otherwise = 0, ADDON = Insured by add-on insurance = 1; otherwise = 0, Notes: In the applications in the text, the following additional variables are used: NUMOBS = Number of observations for this person. Repeated in each row of data. NEWHSAT = HSAT; 40 observations on HSAT recorded between 6 and 7 were changed to 7. Frequencies are: 1 = 1525, 2 = 2158, 3 = 825, 4 = 926, 5 = 1051, 6 = 1000, 7 = 987. Source: Riphahn et al. (2003). Location: Journal of Applied Econometrics Data Archive http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/ Uses: Examples 11.10, 11.11, 16.16, 17.5, 18.6, 23.4, 23.8, 23.10, 23.11, 23.14, 23.15, 23.19, 24.11, 25.1, 25.2, Applications 16.1, 25.1, 25.2.

TABLE F13.1 Klein’s Model I, 22 Yearly Observations, 1920–1941 Year = Date, C = Consumption, P = Corporate profits, Wp = Private wage bill, I = Investment, K1 = previous year’s capital stock, X = GNP, Wg = Government wage bill, G = Government spending, T = Taxes.

Source: Klein (1950). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 13.6, 13.7, 13.8, Section 13.7.

TABLE F14.1 Statewide Data on Transportation Equipment Manufacturing, 25 Observations State = Observation, ValueAdd = Output, Capita = Capital input, Labor = Labor input, Nfirm = Number of firms. Note: “Value added,” “Capital,” and “Labor” in millions of 1957 dollars. Data used in regression examples are per establishment. Totals are used for the stochastic frontier application in Chapter 16. Source: A. Zellner and N. Revankar (1970, p. 249). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 14.4, 16.9, Application 11.2. Greene-50558 book June 25, 2007 12:52

APPENDIX F ✦ Data Sets Used in Applications 1089

TABLE F15.1 Dahlberg and Johanssen Expenditure Data, 265 Municipalities, 9 Years ID = Identification, Year = Date, Expend = Expenditure, Revenue = Revenue from taxes and fees, Grants = Grants from central government.

Source: Dahlberg and Johanssen (2000). Location: Journal of Applied Econometrics data archive. http://qed.econ.queensu.ca/jae/2000-v15.4/dahlberg-johansson/dj-data.zip Uses: Examples 15.10, 20.7.

TABLE F16.1 Program Effectiveness, 32 Cross-Section Observations Obs = Observation, TUCE = Test score on economics test, PSI = Participation in program, GRADE = Grade increase (1) or decrease (0) indicator.

Source: Spector and Mazzeo (1980). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 16.14, 16.15, 18.6, 23.3.

TABLE F17.1 Bertschek and Lechner Binary Choice Data

yit = 1iffirmi realized a product innovation in year t and 0 if not, xit2 = Log of sales, xit3 = Relative size = ratio of employment in business unit to employment in the industry, xit4 = Ratio of industry imports to (industry sales + imports), xit5 = Ratio of industry foreign direct investment to (industry sales + imports), xit6 = Productivity = ratio of industry value added to industry employment, xit7 = Dummy variable indicating firm is in the raw materials sector, xit8 = Dummy variable indicating firm is in the investment goods sector. Source: Bertcshek and Lechner (1998). Location: These data are proprietary and may not be redistributed. Uses: Examples 17.6, 23.16.

TABLE F19.1 Bollerslev and Ghysels Exchange Rate Data, 1974 Daily Observations Y = Nominal return on mark/pound exchange rate, daily.

Source: Bollerslev (1986). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 19.7, 19.8.

TABLE F21.1 Bond Yield, Moody’s Aaa Rated, Monthly, 60 Observations, 1990–1994 Date = Year.Month, Y = Corporate bond rate in percent/year. Source: National Income and Product Accounts, U.S. Department of Commerce, Bureau of Economic Analysis, Survey of Current Business: Business Statistics. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Example 21.1. Greene-50558 book June 25, 2007 12:52

1090 PART VII ✦ Appendices

TABLE F21.2 Money, Output, Price Deflator Data, 136 Quarterly Observations, 1950–1983 Y = Nominal GNP, M1 = M1 measure of money stock, P = Implicit price deflator for GNP. Source: National Income and Product Accounts, U.S. Department of Commerce, Bureau of Economic Analysis, Survey of Current Business: Business Statistics. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 21.3, 22.1, 22.5.

TABLE F23.1 Burnett Analysis of Liberal Arts College Gender Economics Courses, 132 Observations

y1 = Presence of a gender economics course (0/1), y2 = Presence of a women’s studies program on the campus (0/1), z2 = Academic reputation of the college, coded 1 (best), 2, ...to 141, z3 = Size of the full time economics faculty, a count, z4 = Percentage of the economics faculty that are women, proportion (0 to 1), z5 = Religious affiliation of the college, 0 = no, 1 = yes, z6 = Percentage of the college faculty that are women, proportion (0 to 1), z7 − z10 = Regional dummy variables, South, Midwest, Northeast, West. Source: Burnett (1997). Data provided by the author. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Section 23.8.4.

TABLE F23.2 Data Used to Study Travel Mode Choice, 840 Observations, on 4 Modes for 210 Individuals Mode = Choice: air, train, bus, or car, Ttme = Terminal waiting time, 0 for car, Invc = In-vehicle cost-cost component, Invt = Travel time, in vehicle, GC = Generalized cost measure, Hinc = Household income, Psize = Party size in mode chosen.

Source: Greene and Hensher (1997). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Section 23.11.7.

TABLE F24.1 Fair, Redbook Survey on Extramarital Affairs, 6,366 Observations id = Identification number, C = Constant, value = 1, yrb = Constructed measure of time spent in extramarital affairs, v1 = Rating of the marriage, coded 1 to 4, v2 = Age, in years, aggregated, v3 = Number of years married, Greene-50558 book June 25, 2007 12:52

APPENDIX F ✦ Data Sets Used in Applications 1091

TABLE F24.1 Continued v4 = Number of children, top coded at 5, v5 = Religiosity, 1 to 4, 1 = not, 4 = very, v6 = Education, coded 9, 12, 14, 16, 17, 20, v7 = Occupation, v8 = Husband’s occupation.

Source: Fair (1978), data provided by the author. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Applications 23.1, 23.2, 24.1.

TABLE F24.2 LaLonde (1986) Earnings Data, 2,490 Control Observations and 185 Treatment Observations t = Treatment dummy variable, age = Age in years, educ = Education in years, marr = Dummy variable for married, black = Dummy variable for black, hisp = Dummy variable for Hispanic, nodegree = Dummy for no degree (not used), re74 = Real earnings in 1974, re75 = Real earnings in 1975, re78 = Real earnings in 1978. Transformed variables added to the equation are age2 = Age2, educ2 = Educ2, re742 = Re742, re752 = Re752, blacku74 = Black times 1(re74 = 0). Note: We also scaled all earnings variables by 10,000 before beginning the analysis. Source: LaLonde (1986). Location: http://www.nber.org/%7Erdehejia/nswdata.htm. The two specific subsamples are in http://www.nber.org/%7Erdehejia//psid controls.txt and http://www.nber.org/%7Erdehejia/nswre74 treated.txt Use: Example 24.10.

TABLE F25.1 Expenditure and Default Data, 1,319 Observations Cardhldr = Dummy variable, 1 if application for credit card accepted, 0 if not, Majordrg = Number of major derogatory reports, Age = Age in years plus twelfths of a year, Income = Yearly income (divided by 10,000), Exp Inc = Ratio of monthly credit card expenditure to yearly income, Avgexp = Average monthly credit card expenditure, Ownrent = 1 if owns their home, 0 if rent, Selfempl = 1 if self employed, 0 if not, Depndt = Number of dependents, Inc per = Income divided by (number of dependents +1), Cur add = Months living at current address, Major = Number of major credit cards held, Active = Number of active credit accounts.

Source: Greene (1992). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Example 25.3. Greene-50558 book June 25, 2007 12:52

1092 PART VII ✦ Appendices

TABLE F25.2 Fair’s (1977) Extramarital Affairs Data, 601 Cross-Section Observations y = Number of affairs in the past year, z1 = Sex, z2 = Age, z3 = Number of years married, z4 = Children, z5 = Religiousness, z6 = Education, z7 = Occupation, z8 = Self-rating of marriage.

Note: Several variables not used are denoted X1,...,X5.) Source: Fair (1977). Location: http://fairmodel.econ.yale.edu/rayfair/pdf/1978ADAT.ZIP http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Section 25.5.2.

TABLE F25.3 Strike Duration Data, 63 Observations in 9 Years, 1968–1976 Year = Date, T = Strike duration in days, PROD = Unanticipated output.

Source: Kennan (1985). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Example 25.4, Application 25.4.

TABLE F25.4 Ship Accidents, 40 Observations, 5 Types, 4 Vintages, and 2 Service Periods Type = Ship type, TA,TB,TC,TD,TE= Type indicators, YearBuilt = Year, indicated by Y6064, Y6569, Y7074, Y7579 = Year constructed indicators, O6064, O7579 = Years operated indicators, Months = Measure of service amount, Acc = Accidents.

Source: McCullagh and Nelder (1983). Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Use: Applications 6.4, 25.3.

TABLE FC.1 Observations on Income and Education, 20 Observations I = Observation, Y = Income. Source: Data are artificial. Location: http://pages.stern.nyu.edu/∼wgreene/Text/econometricanalysis.htm Uses: Examples 15.5, 15.7, 16.4, Section 16.6.4. Greene-50558 book June 25, 2007 12:52

APPENDIX G ✦ Statistical Tables 1093 APPENDIXQ G STATISTICAL TABLES

TABLE G.1 Cumulative Normal Distribution. Table Entry Is (z) = Prob[Z ≤ z] z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 .0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 .1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 .2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 .3 .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 .4 .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 .5 .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 .6 .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 .7 .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 .8 .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 .9 .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 1.0 .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 1.1 .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 1.3 .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 1.4 .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 1.5 .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 1.6 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 1.7 .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 1.8 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 1.9 .9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 2.1 .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 2.2 .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 2.3 .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 2.4 .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 2.6 .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 2.7 .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 2.8 .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 2.9 .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 3.1 .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 3.2 .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 3.3 .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 3.4 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 Greene-50558 book June 25, 2007 12:52

1094 PART VII ✦ Appendices

TABLE G.2 Percentiles of the Student’s t Distribution. Table Entry Is x Such

that Prob[tn ≤ x] = P n .750 .900 .950 .975 .990 .995 1 1.000 3.078 6.314 12.706 31.821 63.657 2 .816 1.886 2.920 4.303 6.965 9.925 3 .765 1.638 2.353 3.182 4.541 5.841 4 .741 1.533 2.132 2.776 3.747 4.604 5 .727 1.476 2.015 2.571 3.365 4.032 6 .718 1.440 1.943 2.447 3.143 3.707 7 .711 1.415 1.895 2.365 2.998 3.499 8 .706 1.397 1.860 2.306 2.896 3.355 9 .703 1.383 1.833 2.262 2.821 3.250 10 .700 1.372 1.812 2.228 2.764 3.169 11 .697 1.363 1.796 2.201 2.718 3.106 12 .695 1.356 1.782 2.179 2.681 3.055 13 .694 1.350 1.771 2.160 2.650 3.012 14 .692 1.345 1.761 2.145 2.624 2.977 15 .691 1.341 1.753 2.131 2.602 2.947 16 .690 1.337 1.746 2.120 2.583 2.921 17 .689 1.333 1.740 2.110 2.567 2.898 18 .688 1.330 1.734 2.101 2.552 2.878 19 .688 1.328 1.729 2.093 2.539 2.861 20 .687 1.325 1.725 2.086 2.528 2.845 21 .686 1.323 1.721 2.080 2.518 2.831 22 .686 1.321 1.717 2.074 2.508 2.819 23 .685 1.319 1.714 2.069 2.500 2.807 24 .685 1.318 1.711 2.064 2.492 2.797 25 .684 1.316 1.708 2.060 2.485 2.787 26 .684 1.315 1.706 2.056 2.479 2.779 27 .684 1.314 1.703 2.052 2.473 2.771 28 .683 1.313 1.701 2.048 2.467 2.763 29 .683 1.311 1.699 2.045 2.462 2.756 30 .683 1.310 1.697 2.042 2.457 2.750 35 .682 1.306 1.690 2.030 2.438 2.724 40 .681 1.303 1.684 2.021 2.423 2.704 45 .680 1.301 1.679 2.014 2.412 2.690 50 .679 1.299 1.676 2.009 2.403 2.678 60 .679 1.296 1.671 2.000 2.390 2.660 70 .678 1.294 1.667 1.994 2.381 2.648 80 .678 1.292 1.664 1.990 2.374 2.639 90 .677 1.291 1.662 1.987 2.368 2.632 100 .677 1.290 1.660 1.984 2.364 2.626 ∞ .674 1.282 1.645 1.960 2.326 2.576 Greene-50558 book June 25, 2007 12:52

APPENDIX G ✦ Statistical Tables 1095

TABLE G.3 Percentiles of the Chi-Squared Distribution. Table Entry Is c such χ2 ≤ = that Prob[ n c] P n .005 .010 .025 .050 .100 .250 .500 .750 .900 .950 .975 .990 .995 1 .00004 .0002 .001 .004 .02 .10 .45 1.32 2.71 3.84 5.02 6.63 7.88 2 .01 .02 .05 .10 .21 .58 1.39 2.77 4.61 5.99 7.38 9.21 10.60 3 .07 .11 .22 .35 .58 1.21 2.37 4.11 6.25 7.81 9.35 11.34 12.84 4 .21 .30 .48 .71 1.06 1.92 3.36 5.39 7.78 9.49 11.14 13.28 14.86 5 .41 .55 .83 1.15 1.61 2.67 4.35 6.63 9.24 11.07 12.83 15.09 16.75 6 .68 .87 1.24 1.64 2.20 3.45 5.35 7.84 10.64 12.59 14.45 16.81 18.55 7 .99 1.24 1.69 2.17 2.83 4.25 6.35 9.04 12.02 14.07 16.01 18.48 20.28 8 1.34 1.65 2.18 2.73 3.49 5.07 7.34 10.22 13.36 15.51 17.53 20.09 21.95 9 1.73 2.09 2.70 3.33 4.17 5.90 8.34 11.39 14.68 16.92 19.02 21.67 23.59 10 2.16 2.56 3.25 3.94 4.87 6.74 9.34 12.55 15.99 18.31 20.48 23.21 25.19 11 2.60 3.05 3.82 4.57 5.58 7.58 10.34 13.70 17.28 19.68 21.92 24.72 26.76 12 3.07 3.57 4.40 5.23 6.30 8.44 11.34 14.85 18.55 21.03 23.34 26.22 28.30 13 3.57 4.11 5.01 5.89 7.04 9.30 12.34 15.98 19.81 22.36 24.74 27.69 29.82 14 4.07 4.66 5.63 6.57 7.79 10.17 13.34 17.12 21.06 23.68 26.12 29.14 31.32 15 4.60 5.23 6.26 7.26 8.55 11.04 14.34 18.25 22.31 25.00 27.49 30.58 32.80 16 5.14 5.81 6.91 7.96 9.31 11.91 15.34 19.37 23.54 26.30 28.85 32.00 34.27 17 5.70 6.41 7.56 8.67 10.09 12.79 16.34 20.49 24.77 27.59 30.19 33.41 35.72 18 6.26 7.01 8.23 9.39 10.86 13.68 17.34 21.60 25.99 28.87 31.53 34.81 37.16 19 6.84 7.63 8.91 10.12 11.65 14.56 18.34 22.72 27.20 30.14 32.85 36.19 38.58 20 7.43 8.26 9.59 10.85 12.44 15.45 19.34 23.83 28.41 31.41 34.17 37.57 40.00 21 8.03 8.90 10.28 11.59 13.24 16.34 20.34 24.93 29.62 32.67 35.48 38.93 41.40 22 8.64 9.54 10.98 12.34 14.04 17.24 21.34 26.04 30.81 33.92 36.78 40.29 42.80 23 9.26 10.20 11.69 13.09 14.85 18.14 22.34 27.14 32.01 35.17 38.08 41.64 44.18 24 9.89 10.86 12.40 13.85 15.66 19.04 23.34 28.24 33.20 36.42 39.36 42.98 45.56 25 10.52 11.52 13.12 14.61 16.47 19.94 24.34 29.34 34.38 37.65 40.65 44.31 46.93 30 13.79 14.95 16.79 18.49 20.60 24.48 29.34 34.80 40.26 43.77 46.98 50.89 53.67 35 17.19 18.51 20.57 22.47 24.80 29.05 34.34 40.22 46.06 49.80 53.20 57.34 60.27 40 20.71 22.16 24.43 26.51 29.05 33.66 39.34 45.62 51.81 55.76 59.34 63.69 66.77 45 24.31 25.90 28.37 30.61 33.35 38.29 44.34 50.98 57.51 61.66 65.41 69.96 73.17 50 27.99 29.71 32.36 34.76 37.69 42.94 49.33 56.33 63.17 67.50 71.42 76.15 79.49 Greene-50558 book June 25, 2007 12:52

1096 PART VII ✦ Appendices

TABLE G.4 95th Percentiles of the F Distribution. Table Entry Is f Such that

Prob[Fn1,n2 ≤ f ] = .95

n1 = Degrees of Freedom for the Numerator

n2 123456789 1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 ∞ 3.84 3.00 2.60 2.37 2.21 2 .10 2.01 1.94 1.88

n2 10 12 15 20 30 40 50 60 ∞ 1 241.88 243.91 245.95 248.01 250.10 251.14 252.20 252.20 254.19 2 19.40 19.41 19.43 19.45 19.46 19.47 19.48 19.48 19.49 3 8.79 8.74 8.70 8.66 8.62 8.59 8.57 8.57 8.53 4 5.96 5.91 5.86 5.80 5.75 5.72 5.69 5.69 5.63 5 4.74 4.68 4.62 4.56 4.50 4.46 4.43 4.43 4.37 6 4.06 4.00 3.94 3.87 3.81 3.77 3.74 3.74 3.67 7 3.64 3.57 3.51 3.44 3.38 3.34 3.30 3.30 3.23 8 3.35 3.28 3.22 3.15 3.08 3.04 3.01 3.01 2.93 9 3.14 3.07 3.01 2.94 2.86 2.83 2.79 2.79 2.71 10 2.98 2.91 2.85 2.77 2.70 2.66 2.62 2.62 2.54 15 2.54 2.48 2.40 2.33 2.25 2.20 2.16 2.16 2.07 20 2.35 2.28 2.20 2.12 2.04 1.99 1.95 1.95 1.85 25 2.24 2.16 2.09 2.01 1.92 1.87 1.82 1.82 1.72 30 2.16 2.09 2.01 1.93 1.84 1.79 1.74 1.74 1.63 40 2.08 2.00 1.92 1.84 1.74 1.69 1.64 1.64 1.52 50 2.03 1.95 1.87 1.78 1.69 1.63 1.58 1.58 1.45 70 1.97 1.89 1.81 1.72 1.62 1.57 1.50 1.50 1.36 100 1.93 1.85 1.77 1.68 1.57 1.52 1.45 1.45 1.30 ∞ 1.83 1.75 1.67 1.57 1.46 1.39 1.34 1.31 1.30 Greene-50558 book June 25, 2007 12:52

APPENDIX G ✦ Statistical Tables 1097

TABLE G.5 99th Percentiles of the F Distribution. Table Entry Is f Such that

Prob[Fn1,n2 ≤ f ] = .99

n1 = Degrees of Freedom for the Numerator

n2 123456789 1 4052.18 4999.50 5403.35 5624.58 5763.65 5858.99 5928.36 5981.07 6022.47 2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 50 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 70 7.01 4.92 4.07 3.60 3.29 3.07 2.91 2.78 2.67 100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 ∞ 6.66 4.63 3.80 3.34 3.04 2.82 2.66 2.53 2.43

n2 10 12 15 20 30 40 50 60 ∞ 1 6055.85 6106.32 6157.28 6208.73 6260.65 6286.78 6313.03 6313.03 6362.68 2 99.40 99.42 99.43 99.45 99.47 99.47 99.48 99.48 99.50 3 27.23 27.05 26.87 26.69 26.50 26.41 26.32 26.32 26.14 4 14.55 14.37 14.20 14.02 13.84 13.75 13.65 13.65 13.47 5 10.05 9.89 9.72 9.55 9.38 9.29 9.20 9.20 9.03 6 7.87 7.72 7.56 7.40 7.23 7.14 7.06 7.06 6.89 7 6.62 6.47 6.31 6.16 5.99 5.91 5.82 5.82 5.66 8 5.81 5.67 5.52 5.36 5.20 5.12 5.03 5.03 4.87 9 5.26 5.11 4.96 4.81 4.65 4.57 4.48 4.48 4.32 10 4.85 4.71 4.56 4.41 4.25 4.17 4.08 4.08 3.92 15 3.80 3.67 3.52 3.37 3.21 3.13 3.05 3.05 2.88 20 3.37 3.23 3.09 2.94 2.78 2.69 2.61 2.61 2.43 25 3.13 2.99 2.85 2.70 2.54 2.45 2.36 2.36 2.18 30 2.98 2.84 2.70 2.55 2.39 2.30 2.21 2.21 2.02 40 2.80 2.66 2.52 2.37 2.20 2.11 2.02 2.02 1.82 50 2.70 2.56 2.42 2.27 2.10 2.01 1.91 1.91 1.70 70 2.59 2.45 2.31 2.15 1.98 1.89 1.78 1.78 1.56 100 2.50 2.37 2.22 2.07 1.89 1.80 1.69 1.69 1.45 ∞ 2.34 2.20 2.06 1.90 1.72 1.61 1.50 1.50 1.16 Greene-50558 book June 25, 2007 12:52

1098 PART VII ✦ Appendices

TABLE G.6 Durbin–Watson Statistic: 5 Percent Significance Points of dL and dU k = 1 k = 2 k = 3 k = 4 k = 5 k = 10 k = 15 n dLdUdLdUdLdUdLdUdLdUdLdUdLdU 15 1.08 1.36 .95 1.54 .82 1.75 .69 1.97 .56 2.21 16 1.10 1.37 .98 1.54 .86 1.73 .74 1.93 .62 2.15 .16 3.30 17 1.13 1.38 1.02 1.54 .90 1.71 .78 1.90 .67 2.10 .20 3.18 18 1.16 1.39 1.05 1.53 .93 1.69 .82 1.87 .71 2.06 .24 3.07 19 1.18 1.40 1.08 1.53 .97 1.68 .86 1.85 .75 2.02 .29 2.97 20 1.20 1.41 1.10 1.54 1.00 1.68 .90 1.83 .79 1.99 .34 2.89 .06 3.68 21 1.22 1.42 1.13 1.54 1.03 1.67 .93 1.81 .83 1.96 .38 2.81 .09 3.58 22 1.24 1.43 1.15 1.54 1.05 1.66 .96 1.80 .86 1.94 .42 2.73 .12 3.55 23 1.26 1.44 1.17 1.54 1.08 1.66 .99 1.79 .90 1.92 .47 2.67 .15 3.41 24 1.27 1.45 1.19 1.55 1.10 1.66 1.01 1.78 .93 1.90 .51 2.61 .19 3.33 25 1.29 1.45 1.21 1.55 1.12 1.66 1.04 1.77 .95 1.89 .54 2.57 .22 3.25 26 1.30 1.46 1.22 1.55 1.14 1.65 1.06 1.76 .98 1.88 .58 2.51 .26 3.18 27 1.32 1.47 1.24 1.56 1.16 1.65 1.08 1.76 1.01 1.86 .62 2.47 .29 3.11 28 1.33 1.48 1.26 1.56 1.18 1.65 1.10 1.75 1.03 1.85 .65 2.43 .33 3.05 29 1.34 1.48 1.27 1.56 1.20 1.65 1.12 1.74 1.05 1.84 .68 2.40 .36 2.99 30 1.35 1.49 1.28 1.57 1.21 1.65 1.14 1.74 1.07 1.83 .71 2.36 .39 2.94 31 1.36 1.50 1.30 1.57 1.23 1.65 1.16 1.74 1.09 1.83 .74 2.33 .43 2.99 32 1.37 1.50 1.31 1.57 1.24 1.65 1.18 1.73 1.11 1.82 .77 2.31 .46 2.84 33 1.38 1.51 1.32 1.58 1.26 1.65 1.19 1.73 1.13 1.81 .80 2.28 .49 2.80 34 1.39 1.51 1.33 1.58 1.27 1.65 1.21 1.73 1.15 1.81 .82 2.26 .52 2.75 35 1.40 1.52 1.34 1.53 1.28 1.65 1.22 1.73 1.16 1.80 .85 2.24 .55 2.72 36 1.41 1.52 1.35 1.59 1.29 1.65 1.24 1.73 1.18 1.80 .87 2.22 .58 2.68 37 1.42 1.53 1.36 1.59 1.31 1.66 1.25 1.72 1.19 1.80 .89 2.20 .60 2.65 38 1.43 1.54 1.37 1.59 1.32 1.66 1.26 1.72 1.21 1.79 .91 2.18 .63 2.61 39 1.43 1.54 1.38 1.60 1.33 1.66 1.27 1.72 1.22 1.79 .93 2.16 .65 2.59 40 1.44 1.54 1.39 1.60 1.34 1.66 1.29 1.72 1.23 1.79 .95 2.15 .68 2.56 45 1.48 1.57 1.43 1.62 1.38 1.67 1.34 1.72 1.29 1.78 1.04 2.09 .79 2.44 50 1.50 1.59 1.46 1.63 1.42 1.67 1.38 1.72 1.34 1.77 1.11 2.04 .88 2.35 55 1.53 1.60 1.49 1.64 1.45 1.68 1.41 1.72 1.38 1.77 1.17 2.01 .96 2.28 60 1.55 1.62 1.51 1.65 1.48 1.69 1.44 1.73 1.41 1.77 1.22 1.98 1.03 2.23 65 1.57 1.63 1.54 1.66 1.50 1.70 1.47 1.73 1.44 1.77 1.27 1.96 1.09 2.18 70 1.58 1.64 1.55 1.67 1.52 1.70 1.49 1.74 1.46 1.77 1.30 1.95 1.14 2.15 75 1.60 1.65 1.57 1.68 1.54 1.71 1.51 1.74 1.49 1.77 1.34 1.94 1.18 2.12 80 1.61 1.66 1.59 1.69 1.56 1.72 1.53 1.74 1.51 1.77 1.37 1.93 1.22 2.09 85 1.62 1.67 1.60 1.70 1.57 1.72 1.55 1.75 1.52 1.77 1.40 1.92 1.26 2.07 90 1.63 1.68 1.61 1.70 1.59 1.73 1.57 1.75 1.54 1.78 1.42 1.91 1.29 2.06 95 1.64 1.69 1.62 1.71 1.60 1.73 1.58 1.75 1.56 1.78 1.44 1.90 1.32 2.04 100 1.65 1.69 1.63 1.72 1.61 1.74 1.59 1.76 1.57 1.78 1.46 1.90 1.35 2.03 Source: Extracted from N.E. Savin and K.J. White, “The Dubin–Watson Test for Serial Correlation with Extreme Sample Sizes and Many Regressors,” Econometrica, 45 (8), Nov. 1977, pp. 1992–1995. Note: k is the number of regressors excluding the intercept. Greene-50558 book June 25, 2007 10:16

REFERENCESQ

Abowd, J., and H. Farber. “Job Queues Aitchison, J., and J. Brown. The Lognor- and Union Status of Workers.” Industrial mal Distribution with Special Reference and Labor Relations Review, 35, 1982, to Its Uses in Economics. New York: pp. 354–367. Cambridge University Press, 1969. Abramovitz, M., and I. Stegun. Handbook Aitken, A. C. “On Least Squares and Lin- of Mathematical Functions. New York: ear Combinations of Observations.” Pro- Dover Press, 1971. ceedings of the Royal Statistical Society, Abrevaya, J., “The Equivalence of Two Esti- 55, 1935, pp. 42–48. mators of the Fixed Effects Logit Model.” Akaike, H. “Information Theory and an Economics Letters, 55, 1997, pp. 41–43. Extension of the Maximum Likelihood Abrevaya, J. and J. Huang, “On the Boot- Principle.” In B. Petrov and F. Csake, eds., strap of the Maximum Score Estimator.” Second International Symposium on In- Econometrica, 73, 4, 2005, pp. 1175–1204. formation Theory. Budapest: Akademiai Achen, C., “Two-Step Hierarchical Esti- Kiado, 1973. mation: Beyond Regression Analysis.” Akin, J., D. Guilkey, and R. Sickles, “A Political Analysis, 13, 4, 2005 pp. 447–456. Random Coefficient Probit Model with Affleck-Graves, J., and B. McDonald. “Non- an Application to a Study of Migra- normalities and Tests of Asset Pricing tion.” Journal of Econometrics, 11, 1979, Theories.” Journal of Finance, 44, 1989, pp. 233–246. pp. 889–908. Albert, J., and S. Chib. “Bayesian Analysis Afifi, T., and R. Elashoff. “Missing Observa- of Binary and Polytomous Response tions in Multivariate Statistics.” Journal Data.” Journal of the American Statistical of the American Statistical Association, 61, Association, 88, 1993a, pp. 669–679. 1966, pp. 595–604. Albert, J., and S. Chib. “Bayes Inference via Afifi, T., and R. Elashoff. “Missing Observa- Gibbs Sampling of Autoregressive Time tions in Multivariate Statistics.” Journal Series Subject to Markov Mean and of the American Statistical Association, 62, Variance Shifts.” Journal of Business and 1967, pp. 10–29. Economic Statistics, 11, 1993b, pp. 1–15. Ahn, S., and P. Schmidt. “Efficient Estima- Aldrich, J., and F. Nelson. Linear Probability, tion of Models for Dynamic Panel Data.” Logit, and Probit Models. Beverly Hills: Journal of Econometrics, 68, 1, 1995, Sage Publications, 1984. pp. 5–28. Ali, M., and C. Giaccotto. “A Study of Aigner, D. “MSE Dominance of Least Squares Several New and Existing Tests for with Errors of Observation.” Journal of Heteroscedasticity in the General Linear Econometrics, 2, 1974, pp. 365–372. Model.” Journal of Econometrics, 26, Aigner, D., K. Lovell, and P. Schmidt. “For- 1984, pp. 355–374. mulation and Estimation of Stochastic Allenby, G., and J. Ginter. “The Effects of Frontier Production Models.” Journal of In-Store Displays and Feature Advertis- Econometrics, 6, 1977, pp. 21–37. ing on Consideration Sets.” International

1099 Greene-50558 book June 25, 2007 10:16

1100 References

Journal of Research in Marketing, 12, Estimators.” Journal of the Royal Statisti- 1995, pp. 67–80. cal Society, Series B, 32, 1970, pp. 283–301. Allison, P., “Problems with Fixed-Effects Anderson, G., and R. Blundell, “Estimation Negative Binomial Models.” Manuscript, and Hypothesis Testing in Dynamic Sin- Department of Sociology, University of gular Equation Systems.” Econometrica, Pennsylvania, 2000. 50, 1982, pp. 1559–1572. Allison, P., and R. Waterman, “Fixed-Effects Anderson, R., and J. Thursby. “Confidence Negative Binomial Regression Models.” Intervals for Elasticity Estimators in Sociological Methodology, 32, 2002, Translog Models.” Review of Economics pp. 247–256. and Statistics, 68, 1986, pp. 647–657. Allison, P., Missing Data. Beverly Hills: Sage Anderson, T. The Statistical Analysis of Time Publications, 2002. Series. New York: John Wiley and Sons, Almon, S. “The Distributed Lag Between 1971. Capital Appropriations and Expendi- Anderson, T., and C. Hsiao. “Estimation of tures.” Econometrica, 33, 1965, pp. 178– Dynamic Models with Error Compo- 196. nents.” Journal of the American Statistical Altonji, J., and R. Matzkin, “Panel Data Association, 76, 1981, pp. 598–606. Estimators for Nonseparable Models Anderson, T., and C. Hsiao. “Formulation and with Endogenous Regressors.” NBER Estimation of Dynamic Models Using Working Paper t0267, Cambridge, 2001. Panel Data.” Journal of Econometrics, Alvarez, R., G. Garrett, and P. Lange. 18, 1982, pp. 67–82. “Government Partisanship, Labor Anderson, T., and H. Rubin. “Estimation Organization, and Macroeconomic Per- of the Parameters of a Single Equation formance.” American Political Science in a Complete System of Stochastic Review, 85, 1991, pp. 539–556. Equations.” Annals of Mathematical Amemiya, T. “The Estimation of Variances Statistics, 20, 1949, pp. 46–63. in a Variance-Components Model.” Anderson, T., and H. Rubin. “The Asymptotic International Economic Review, 12, Properties of Estimators of the Parame- 1971, pp. 1–13. ters of a Single Equation in a Complete Amemiya, T. “Regression Analysis When System of Stochastic Equations.” Annals the Dependent Variable Is Truncated of Mathematical Statistics, 21, 1950, Normal.” Econometrica, 41, 1973, pp. 570–582. pp. 997–1016. Andrews, D. “A Robust Method for Multiple Amemiya, T. “Some Theorems in the Lin- Linear Regression.” Technometrics, 16, ear Probability Model.” International 1974, pp. 523–531. Economic Review, 18, 1977, pp. 645–650. Andrews, D. “Heteroskedasticity and Au- Amemiya, T. “Qualitative Response Mod- tocorrelation Consistent Covariance els: A Survey.” Journal of Economic Matrix Estimation.” Econometrica, 59, Literature, 19, 4, 1981, pp. 481–536. 1991, pp. 817–858. Amemiya, T. “Tobit Models: A Survey.” Jour- Andrews, D. “Tests for Parameter Instability nal of Econometrics, 24, 1984, pp. 3–63. and Structural Change with Unknown Amemiya, T. Advanced Econometrics. Cam- Change Point.” Econometrica, 61, 1993, bridge: Harvard University Press, 1985. pp. 821–856. Amemiya, T., and T. MaCurdy, “Instrumental Andrews, D. “Hypothesis Tests with a Re- Variable Estimation of an Error Compo- stricted Parameter Space.” Journal of nents Model.” Econometrica, 54, 1986, Econometrics, 84, 1998, pp. 155–199. pp. 869–881. Andrews, D. “Estimation When a Parameter Andersen, D., “Asymptotic Properties of Is on a Boundary.” Econometrica, 67, Conditional Maximum Likelihood 1999, pp. 1341–1382. Greene-50558 book June 25, 2007 10:16

References 1101

Andrews, D. “Inconsistency of the Bootstrap Arabmazar, A., and P. Schmidt. “An In- When a Parameter Is on the Boundary vestigation into the Robustness of the of the Parameter Space.” Econometrica, Tobit Estimator to Nonnormality.” 68, 2000, pp. 399–405. Econometrica, 50, 1982a, pp. 1055–1063. Andrews, D. “Testing When a Parameter is on Arabmazar, A., and P. Schmidt. “Further a Boundary of the Maintained Hypothe- Evidence on the Robustness of the Tobit sis,” Econometrica, 69, 2001, pp. 683–734. Estimator to Heteroscedasticity.” Journal Andrews, D. “GMM Estimation When a of Econometrics, 17, 1982b, pp. 253–258. Parameter is on a Boundary.” Journal Arellano, M. “Computing Robust Standard of Business and Economic Statistics, 20, Errors for Within-Groups Estimators.” 2002, pp. 530–544. Oxford Bulletin of Economics and Andrews, D., and R. Fair, “Inference in Statistics, 49, 1987, pp. 431–434. Nonlinear Econometric Models with Arellano, M. “A Note on the Anderson-Hsiao Structural Change.” Review of Economic Estimator for Panel Data.” Economics Studies, 55, 1988, pp. 615–640. Letters, 31, 1989, pp. 337– 341. Andrews, D., and W. Ploberger. “Optimal Arellano, M. “Discrete Choices with Panel Tests When a Nuisance Parameter Is Data.” Investigaciones Economica, Present Only Under the Alternative.” Lecture 25, 2000. Econometrica, 62, 1994, pp. 1383–1414. Arellano, M. “Panel Data: Some Recent Andrews, D., and W. Ploberger. “Admiss- Developments.” In J. Heckman and ability of the LR Test When a Nuisance E. Leamer, eds., Handbook of Economet- Parameter is Present Only Under the rics, Vol. 5, North Holland, Amsterdam, Alternative.” Annals of Statistics, 23, 2001. 1995, pp. 1609–1629. Arellano, M. Panel Data Econometrics. Angrist, J. “Estimation of Limited Dependent Oxford: Oxford University Press, 2003. Variable Models with Dummy Endoge- Arellano, M., and S. Bond. “Some Tests of nous Regressors: Simple Strategies for Specification for Panel Data: Monte Empirical Practice.” Journal of Business Carlo Evidence and an Application to and Economic Statistics, 29, 1, 2001, Employment Equations.” Review of pp. 2–15. Economics Studies, 58, 1991, pp. 277–297. Aneuryn-Evans, G., and A. Deaton. “Testing Arellano, M., and C. Borrego. “Symmetri- Linear versus Logarithmic Regression cally Normalized Instrumental Variable Models.” Review of Economic Studies, Estimation Using Panel Data.” Journal 47, 1980, pp. 275–291. of Business and Economic Statistics, 17, Anselin, L. Spatial Econometrics:Methods and 1999, pp. 36–49. Models, Dordrecht: Kluwer Academic Arellano, M., and O. Bover. “Another Look Publishers, 1988. at the Instrumental Variables Estimation Anselin, L. “Spatial Econometrics,” in B. of Error Components Models.” Journal Baltagi, ed., A Companion to Theoret- of Econometrics, 68, 1, 1995, pp. 29–52. ical Econometrics, Oxford: Blackwell Arrow, K., H. Chenery, B. Minhas, and R. Puplishers, 2001, pp. 310–330. Solow. “Capital-Labor Substitution Anselin, L., and S. Hudak. “Spatial Econo- and Economic Efficiency.” Review of metrics in Practice: A Review of Software Economics and Statistics, 45, 1961, Options.” Regional Science and Urban pp. 225–247. Economics, 22, 3, 1992, pp. 509–536. Ashenfelter, O., and J. Heckman. “The Antweiler, W. “Nested Random Effects Estimation of Income and Substitu- Estimation in Unbalanced Panel Data.” tion Effects in a Model of Family Journal of Econometrics, 101, 2001, Labor Supply.” Econometrica, 42, 1974, pp. 295–312. pp. 73–85. Greene-50558 book June 25, 2007 10:16

1102 References

Ashenfelter, O., and A. Kreuger. “Estimates Kmenta and Error Components Tech- of the Economic Return to School- niques.” Econometric Theory, 2, 1986, ing from a New Sample of Twins.” pp. 429–441. American Economic Review, 84, 1994, Baltagi, B., “Applications of a Necessary pp. 1157–1173. and Sufficient Condition for OLS to be Attfield, C. “Bartlett Adjustments for Sys- BLUE,” Statistics and Probability Letters, tems of Linear Equations with Linear 8, 1989, pp. 457–461. Restrictions.” Economics Letters, 60, Baltagi, B. Econometric Analysis of Panel 1998, pp. 277–283. Data, 2nd ed. New York: John Wiley and Avery, R. “Error Components and Seemingly Sons, 2001. Unrelated Regressions.” Econometrica, Baltagi, B. Econometric Analysis of Panel 45, 1977, pp. 199–209. Data, 3rd ed. New York: John Wiley and Avery, R., L. Hansen, and J. Hotz. “Multi- Sons, 2005. period Probit Models and Orthogonality Baltagi, G., S. Garvin, and S. Kerman. “Fur- Condition Estimation.” International ther Evidence on Seemingly Unrelated Economic Review, 24, 1983, pp. 21–35. Regressions with Unequal Number of Bai, J. “Estimation of a Change Point in Observations.” Annales D’Economie et Multiple Regression Models.” Review de Statistique, 14, 1989, pp. 103–115. of Economics and Statistics, 79, 1997, Baltagi, B., and Griffin, J. “Gasoline Demand pp. 551–563. in the OECD: An Application of Pooling Bai, J. “Likelihood Ratio Tests for Mul- and Testing Procedures.” European tiple Structural Changes.” Journal of Economic Review, 22, 1983, pp. 117–137. Econometrics, 91, 1999, pp. 299–323. Baltagi, B., and W. Griffin. “A Generalized Bai, J., R. Lumsdaine, and J. Stock. “Testing Error Component Model with Het- for and Dating Breaks in Integrated eroscedastic Disturbances.” International and Cointegrated Time Series.” Mimeo, Economic Review, 29, 1988, pp. 745–753. Department of Economics, MIT, 1991. Baltagi, B., and C. Kao. “Nonstationary Bai, J., and P. Perron. “Estimating and Testing Panels, Cointegration in Panels and Linear Models with Multiple Structural Dynamic Panels: A Survey.” Advances in Changes.” Econometrica, 66, 1998a, Econometrics, 15, pp. 7–51. pp. 47–78. Baltagi, B., and Q. Li. “A Transformation Bai, J., and P. Perron. “Testing for and Esti- That Will Circumvent the Problem of mation of Multiple Structural Changes.” Autocorrelation in an Error Component Econometrica, 66, 1998b, pp. 817–858. Model.” Journal of Econometrics, 48, Baillie, R. “The Asymptotic Mean Squared 1991a, pp. 385–393. Error of Multistep Prediction from the Baltagi, B., and Q. Li. “A Joint Test for Serial Regression Model with Autoregressive Correlation and Random Individual Ef- Errors.” Journal of the American Statisti- fects.” Statistics and Probability Letters, cal Association, 74, 1979, pp. 175–184. 11, 1991b, pp. 277–280. Baillie, R. “Long Memory Processes and Frac- Baltagi. B., and Q. Li. “Double Length tional Integration in Econometrics.” Jour- Artificial Regressions for Testing Spatial nal of Econometrics, 73, 1, 1996, pp. 5–59. Dependence.” Econometric Reviews, 20, Balestra, P., and M. Nerlove. “Pooling Cross 2001, pp. 31–40. Section and Time Series Data in the Baltagi, B., S. Song, and B. Jung. “The Estimation of a Dynamic Model: The Unbalanced Nested Error Compo- Demand for Natural Gas.” Econometrica, nent Regression Model.” Journal of 34, 1966, pp. 585–612. Econometrics, 101, 2001, pp. 357–381. Baltagi, B. “Pooling Under Misspecification: Baltagi, B., S. Song, and W. Koh. “Testing Some Monte Carlo Evidence on the Panel Data Regression Models with Greene-50558 book June 25, 2007 10:16

References 1103

Spatial Error Correlation.” Journal of of Dynamics in Binary Time-Series Econometrics, 117, 2003, pp. 123–150. Cross-Section Models: The Example Bannerjee, A. “Panel Data Unit Roots and of State Failure,” Manuscript, Depart- Cointegration: An Overview.” Oxford ment of Political Science, University of Bulletin of Economics and Statistics, 61, California, San Diego, 2001. 1999, pp. 607–629. Beck, N., J. Katz, R. Alvarez, G. Garrett, and Barnow, B., G. Cain, and A. Goldberger. P. Lange. “Government Partisanship, “Issues in the Analysis of Selectivity Labor Organization, and Macroeco- Bias.” In E. Stromsdorfer and G. Farkas, nomic Performance: A Corrigendum.” eds. Evaluation Studies Review Annual, American Political Science Review, 87, 4, Vol. 5, Beverly Hills: Sage Publications, 1993, pp. 945–948. 1981. Becker, S. and A. Ichino, “Estimation of Bartels, R., and D. Fiebig. “A Simple Char- Average Treatment Effects Based on acterization of Seemingly Unrelated Propensity Scores,” The Stata Journal, 2, Regressions Models in Which OLS is 2002, pp. 358–377. BLUE.” American Statistician, 45, 1992, Beggs, S., S. Cardell, and J. Hausman. “Assess- pp. 137–140. ing the Potential Demand for Electric Basmann, R. “A General Classical Method Cars.” Journal of Econometrics, 17, 1981, of Linear Estimation of Coefficients in a pp. 19–20. Structural Equation.” Econometrica, 25, Behrman, J, and P. Taubman. “Is Schooling 1957, pp. 77–83. ‘Mostly in the Genes’? Nature-Nurture Barten, A. “Maximum Likelihood Estimation Decomposition Using Data on Rela- of A Complete System of Demand tives.” Journal of Political Economy, 97, Equations.” European Economic Review, 6, 1989, pp. 1425–1446 Fall, 1, 1969, pp. 7–73. Bekker, P., and T. Wansbeek. “Identifica- Bazaraa, M., and C. Shetty. Nonlinear Pro- tion in Parametric Models.” In Baltagi, gramming: Theory and Algorithms. New B. ed., A Companion to Theoretical York: John Wiley and Sons, 1979. Econometrics, Oxford: Blackwell, 2001. Beach, C., and J. MacKinnon. “A Maximum Bell, K., and N. Bockstael. “Applying the Likelihood Procedure for Regression Generalized Method of Moments Ap- with Autocorrelated Errors.” Economet- proach to Spatial Problems Involving rica, 46, 1978a, pp. 51–58. Micro-Level Data.” Review of Economic Beach, C., and J. MacKinnon. “Full Maximum and Statistics 82, 1, 2000, pp. 72–82 Likelihood Estimation of Second Order Belsley, D. “On the Efficient Computation Autoregressive Error Models.” Journal of the Nonlinear Full-Information Max- of Econometrics, 7, 1978b, pp. 187–198. imum Likelihood Estimator.” Technical Beck, N., and J. Katz. “What to Do (and Not Report no. 5, Center for Computational to Do) with Time-Series-Cross-Section Research in Economics and Management Data in Comparative Politics.” Ameri- Science, Vol. II, Cambridge, MA, 1980. can Political Science Review, 89, 1995, Belsley, D., E. Kuh, and R. Welsh. Regression pp. 634–647. Diagnostics: Identifying Influential Data Beck, N., D. Epstein, and S. Jackman. “Es- and Sources of Collinearity. New York: timating Dynamic Time Series Cross John Wiley and Sons, 1980. Section Models with a Binary Dependent Ben-Akiva, M., and S. Lerman. Discrete Variable.” Manuscript, Department of Choice Analysis. London: MIT Press, Political Science, University of Califor- 1985. nia, San Diego, 2001. Ben-Porath, Y. “Labor Force Participation Beck, N., D. Epstein, S. Jackman, and Rates and Labor Supply.” Journal of S. O’ Halloran, “Alternative Models Political Economy, 81, 1973, pp. 697–704. Greene-50558 book June 25, 2007 10:16

1104 References

Bera, A., and C. Jarque. “Efficient Tests for of Industrial Economics. 43, 4, 1995, Normality, Heteroscedasticity, and Serial pp. 341–357. Independence of Regression Residuals: Bertschek, I., and M. Lechner. “Convenient Monte Carlo Evidence.” Economics Estimators for the Panel Probit Model.” Letters, 7, 1981, pp. 313–318. Journal of Econometrics, 87, 2, 1998, Bera, A., and C. Jarque. “Model Specification pp. 329–372. Tests: A Simultaneous Approach.” Jour- Berzeg, K. “The Error Components Model: nal of Econometrics, 20, 1982, pp. 59–82. Conditions for the Existence of Maxi- Bera, A., C. Jarque, and L. Lee. “Testing for mum Likelihood Estimates.” Journal of the Normality Assumption in Limited Econometrics, 10, 1979, pp. 99–102. Dependent Variable Models.” Mimeo, Beyer, A. “Modelling Money Demand in Ger- Department of Economics, University of many.” Journal of Applied Econometrics, Minnesota, 1982. 13, 1, 1998, pp. 57–76. Bernard, J., and M. Veall. “The Probability Bhargava, A., L. Franzini, and W. Naren- Distribution of Future Demand.” Journal dranathan. “Serial Correlation and of Business and Economic Statistics, 5, the Fixed Effects Model.” Review of 1987, pp. 417–424. Economic Studies, 49, 1982, pp. 533–549. Berndt, E. The Practice of Econometrics. Bhargava, A., and J. Sargan. “Estimating Reading, MA: Addison-Wesley, 1990. Dynamic Random Effects Models from Berndt, E., and L. Christensen. “The Translog Panel Data Covering Short Periods.” Function and the Substitution of Equip- Econometrica, 51, 1983, pp. 221–236. ment, Structures, and Labor in U.S. Bhat, C. “A Heteroscedastic Extreme Value Manufacturing, 1929–1968.” Journal of Model of Intercity Mode Choice.” Trans- Econometrics, 1, 1973, pp. 81–114. portation Research, 30, 1, 1995, pp. 16–29. Berndt, E., B. Hall, R. Hall, and J. Hausman. Bhat, C. “Accommodating Variations in “Estimation and Inference in Nonlinear Responsiveness to Level-of-Service Structural Models.” Annals of Economic Measures in Travel Mode Choice Mod- and Social Measurement, 3/4, 1974, eling.” Department of Civil Engineering, pp. 653–665. University of Massachusetts, Amherst, Berndt, E., and E. Savin. “Conflict Among 1996. Criteria for Testing Hypotheses in the Bhat, C. “Quasi-Random Maximum Simu- Multivariate Linear Regression Model.” lated Likelihood Estimation of the Mixed Econometrica, 45, 1977, pp. 1263– Multinomial Logit Model.” Manuscript, 1277. Department of Civil Engineering, Berndt, E., and D. Wood. “Technology, Prices, University of Texas, Austin, 1999. and the Derived Demand for Energy.” Bickel, P., and K. Doksum. Mathematical Review of Economics and Statistics, 57, Statistics. San Francisco: Holden Day, 1975, pp. 376–384. 2000. Beron, K., J. Murdoch, and M. Thayer. “Hier- Billingsley, P. Probability and Measure. 3rd ed. archical Linear Models with Application New York: John Wiley and Sons, 1995. to Air Pollution in the South Coast Air Binkley, J. “The Effect of VariableCorrelation Basin.” American Journal of Agricultural on the Efficiency of Seemingly Unrelated Economics, 81, 5, 1999, pp. 1123–1127. Regression in a Two Equation Model.” Berry, S., J. Levinsohn, and A. Pakes. “Auto- Journal of the American Statistical mobile Prices in Market Equilibrium.” Association, 77, 1982, pp. 890–895. Econometrica, 63, 4, 1995, pp. 841–890. Binkley, J., and C. Nelson. “A Note on the Bertschek, I. “Product and Process Innova- Efficiency of Seemingly Unrelated Re- tion as a Response to Increasing Imports gression.”American Statistician, 42, 1988, and Foreign Direct Investment, Journal pp. 137–139. Greene-50558 book June 25, 2007 10:16

References 1105

Birkes, D., and Y. Dodge. Alternative Methods Bollerslev, T., and J. Wooldridge. “Quasi- of Regression. New York: John Wiley and Maximum Likelihood Estimation and Sons, 1993. Inference in Dynamic Models with Time- Black, F. “Capital Market Equilibrium Varying Covariances.” Econometric with Restricted Borrowing.” Journal of Reviews, 11, 1992, pp. 143–172. Business, 44, 1972, pp. 444–454. Boot, J., and G. deWitt. “Investment De- Blanchard, O., and D. Quah. “The Dynamic mand: An Empirical Contribution to Effects of Aggregate Demand and Sup- the Aggregation Problem.” International ply Disturbances.” American Economic Economic Review, 1, 1960, pp. 3–30. Review, 79, 1989, pp. 655–673. Borjas, G., and G. Sueyoshi. “A Two-Stage Es- Blinder, A. “Wage Discrimination: Reduced timator for Probit Models with Structural Form and Structural Estimates.” Journal Group Effects.” Journal of Econometrics, of Human Resources, 8, 1973, pp. 436–455. 64, 1/2, 1994, pp. 165–182. Blundell, R., ed. “Specification Testing in B¨orsch-Supan, A., and V. Hajivassiliou. Limited and Discrete Dependent Vari- “Smooth Unbiased Multivariate Prob- able Models.” Journal of Econometrics, ability Simulators for Maximum Likeli- 34, 1/2, 1987, pp. 1–274. hood Estimation of Limited Dependent Blundell, R., M. Browning, and I. Crawford. Variable Models.” Journal of Economet- “Nonparametric Engel Curves and rics, 58, 3, 1990, pp. 347–368. Revealed Preference.” Econometrica, 71, Boskin, M. “A Conditional Logit Model 1, 2003, pp. 205–240. of Occupational Choice.” Journal of Blundell, R., and S. Bond. “Initial Con- Political Economy, 82, 1974, pp. 389–398. ditions and Moment Restrictions in Bound, J., D. Jaeger, and R. Baker. “Problems Dynamic Panel Data Models.” Journal of with Instrumental Variables Estimation Econometrics, 87, 1998, pp. 115–143. When the Correlation Between the Blundell, R., F. Laisney, and M. Lechner. Instruments and the Endogenous Ex- “Alternative Interpretations of Hours planatory Variables Is Weak.” Journal of Information in an Econometric Model of the American Statistical Association, 90, Labour Supply.” Empirical Economics, 1995, pp. 443–450. 18, 1993, pp. 393–415. Bourguignon, F., Ferriera, F., and P. Leite. Bockstael, N., I. Strand, K. McConnell, and F. “Beyond Oaxaca-Blinder: Accounting Arsanjani. “Sample Selection Bias in the for Differences in Household In- Estimation of Recreation Demand Func- come Distributions Across Countries.” tions: An Application to Sport Fishing.” Discussion Paper 452, Department of Land Economics, 66, 1990, pp. 40–49. Economics, Pontifica University, Catolica Boes, S., and R. Winkelmann. “Ordered do Rio de Janiero, 2002. http://www.econ Response Models.” Working Paper 0507, .pucrio.rb/pdf/td452.pdf. Socioeconomic Institute, University of Bover, O., and M. Arellano. “Estimating Zurich, 2005. Dynamic Limited Dependent Variable Bollerslev, T. “Generalized Autoregressive Models from Panel Data.” Investi- Conditional Heteroscedasticity.” Journal gaciones Economicas, Econometrics of Econometrics, 31, 1986, pp. 307–327. Special Issue, 21, 1997, pp. 141–165. Bollerslev, T., R. Chou, and K. Kroner. Box, G., and D. Cox. “An Analysis of Transfor- “ARCH Modeling in Finance.” Journal mations.” Journal of the Royal Statistical of Econometrics, 52, 1992, pp. 5–59. Society, 1964, Series B, 1964, pp. 211– Bollerslev, T., and E. Ghysels. “Periodic Au- 264. toregressive Conditional Heteroscedas- Box, G., and G. Jenkins. Time Series Analysis: ticity.” Journal of Business and Economic Forecasting and Control. 2nd ed. San Statistics, 14, 1996, pp. 139–151. Francisco: Holden Day, 1984. Greene-50558 book June 25, 2007 10:16

1106 References

Box, G., and M. Muller. “A Note on the Gen- Regression Relationships over Time.” eration of Random Normal Deviates.” Journal of the Royal Statistical Society, Annals of Mathematical Statistics, 29, Series B, 37, 1975, pp. 149–172. 1958, pp. 610–611. Brown, P., A. Kleidon, and T. Marsh. “New Box, G., and D. Pierce. “Distribution of Resid- Evidence on the Nature of Size Related ual Autocorrelations in Autoregressive Anomalies in Stock Prices.” Journal of Moving Average Time Series Mod- Financial Economics, 12, 1983, pp. 33–56. els.” Journal of the American Statistical Brown, B., and M. Walker. “Stochastic Spec- Association, 65, 1970, pp. 1509–1526. ification in Random Production Models Boyes, W., D. Hoffman, and S. Low. “An of Cost-Minimizing Firms.” Journal of Econometric Analysis of the Bank Econometrics, 66, 1995, pp. 175–205. Credit Scoring Problem.” Journal of Brown, C., and R. Moffitt. “The Effect of Ig- Econometrics, 40, 1989, pp. 3–14. noring Heteroscedasticity on Estimates Brannas, K. “Explanatory Variables in the of the Tobit Model.” Mimeo, University AR(1) Count Data Model.” Working Pa- of Maryland, Department of Economics, per No. 381, Department of Economics, June 1982. University of Umea, Sweden, 1995. Brundy, J., and D. Jorgenson. “Consistent and Brant, R. “Assessing Proportionality in the Efficient Estimation of Systems of Simul- Proportional Odds Model for Ordered taneous Equations by Means of Instru- Logistic Regression.” Biometrics, 46, mental Variables.” Review of Economics 1990, pp. 1171–1178. and Statistics, 53, 1971, pp. 207–224. Brannas, K., and P. Johanssen. “Panel Data Burnett, N. “Gender Economics Courses in Regressions for Counts.” Manuscript, Liberal Arts Colleges.” Journal of Eco- Department of Economics, University of nomic Education, 28, 4, 1997, pp. 369–377. Umea, Sweden, 1994. Burnside, C. and M. Eichenbaum. “Small- Breitung, J. “The Local Power of Some Unit Sample Properties of GMM-Based Wald Root Tests for Panel Data.” Advances in Tests.” Journal of Business and Economic Econometrics, 15, 2000, pp. 161–177. Statistics, 14, 3, 1996, pp. 294–308. Breslaw, J. “Evaluation of Multivariate Nor- Buse, A. “Goodness of Fit in Generalized mal Probabilities Using a Low Variance Least Squares Estimation.” American Simulator.” Review of Economics and Statistician, 27, 1973, pp. 106–108. Statistics, 76, 1994, pp. 673–682. Buse, A. “The Likelihood Ratio, Wald, and Breusch, T. “Testing for Autocorrelation in Lagrange Multiplier Tests: An Expos- Dynamic Linear Models.” Australian itory Note.” American Statistician, 36, Economic Papers, 17, 1978, pp. 334–355. 1982, pp. 153–157. Breusch, T., and A. Pagan. “A Simple Test Butler, J., and R. Moffitt. “A Computationally for Heteroscedasticity and Random Efficient Quadrature Procedure for the Coefficient Variation.” Econometrica, 47, One Factor Multinomial Probit Model.” 1979, pp. 1287–1294. Econometrica, 50, 1982, pp. 761–764. Breusch, T., and A. Pagan. “The LM Test and Butler, J., and P. Chatterjee. “Pet Econo- Its Applications to Model Specification metrics: Ownershop of Cats and Dogs.” in Econometrics.” Review of Economic Working Paper 95-WP1, Department of Studies, 47, 1980, pp. 239–254. Economics, Vanderbilt University, 1995. Brock, W., and S. Durlauf. “Discrete Choice Butler, J., and P. Chatterjee. “Tests of the with Social Interactions.” Working Pa- Specification of Univariate and Bivariate per #2007, Department of Economics, Ordered Probit.” Review of Economics University of Wisconsin, Madison, 2000. and Statistics, 79, 1997, pp. 343–347. Brown, B., J. Durbin, and J. Evans. “Tech- Butler, J., T. Finegan, and J. Siegfried. niques for Testing the Constancy of “Does More Calculus Improve Student Greene-50558 book June 25, 2007 10:16

References 1107

Learning in Intermediate Micro and Paper 2924, NBER, Cambridge, MA, Macro Economic Theory?” American 1989. Economic Review, 84, 1994, pp. 206–210. Campbell, J., and P. Perron. “Pitfalls and Butler, R., J. McDonald, R. Nelson, and Opportunities: What Macroeconomists S. White. “Robust and Partially Adaptive Should Know About Unit Roots.” Estimation of Regression Models.” National Bureau of Economic Re- Review of Economics and Statistics, 72, search, Macroeconomics Conference, 1990, pp. 321–327. Cambridge, MA, February 1991. Calhoun, C. “BIVOPROB: Computer Cappellari, L., and S. Jenkins, “Calculation Program for Maximum-Likelihood of Multivariate Normal Probabilities Estimation of Bivariate Ordered-Probit by Simulation, with Applications to Models for Censored Data, Version Maximum Simulated Likelihood Esti- 11.92.” Economic Journal, 105, 1995, mation,” Discussion Paper 2112, IZA, pp. 786–787. 2006. Cameron, A., and P. Trivedi. “Econometric Carey, K., “A Panel Data Design for Estima- Models Based on Count Data: Com- tion of Hospital Cost Functions,” Review parisons and Applications of Some of Economics and Statistics, 79, 3, 1997, Estimators and Tests.” Journal of Applied pp. 443–453. Econometrics, 1, 1986, pp. 29–54. Carlin, B., and S. Chib. “Bayesian Model Cameron, A., and P. Trivedi. “Regression- Choice via Markov Chain Monte Carlo.” Based Tests for Overdispersion in the Journal of the Royal Statistical Society, Poisson Model.” Journal of Econometrics, Series B, 57, 1995, pp. 408–417. 46, 1990, pp. 347–364. Case, A. “Spatial Patterns in Household Cameron, C., and P. Trivedi. Regression Demand.” Econometrica, 59, No. 4, 1991, Analysis of Count Data. New York: pp. 953–965. Cambridge University Press, 1998. Casella, G., and E. George. “Explaining the Cameron, C., and P. Trivedi. Microecono- Gibbs Sampler.” American Statistician, metrics: Methods and Applications. 46, 3, 1992, pp. 167–174. Cambridge: Cambridge University Press, Caudill, S. “An Advantage of the Linear 2005. Probability Model Over Probit or Logit.” Cameron, C., T. Li, P.Trivedi, and D. Zimmer. Oxford Bulletin of Economics and “Modeling the Differences in Counted Statistics, 50, 1988, pp. 425–427. Outcomes Using Bivariate Copula Mod- Caves, D., L. Christensen, and M. Trethaway. els: With Applications to Mismeasured “Flexible Cost Functions for Multiprod- Counts.” Econometrics Journal, 7, 2004, uct Firms.” Review of Economics and pp. 566–584. Statistics, 62, 1980, pp. 477–481. Cameron, C., and F. Windmeijer. “R-Squared Cecchetti, S. “The Frequency of Price Adjust- Measures for Count Data Regression ment: A Study of the Newsstand Prices Models with Applications to Health Care of Magazines.” Journal of Econometrics, Utilization.” Working Paper No. 93–24, 31, 3, pp. 255–274. Department of Economics, University of Cecchetti, S.. “Comment.” In G. Mankiw, ed., California, Davis, 1993. Monetary Policy. Chicago: University of Campbell, J., A. Lo, and A. MacKinlay. Chicago Press, 1994. The Econometrics of Financial Markets. Cecchetti, S., and R. Rich. “Structural Esti- Princeton: Princeton University Press, mates of the U.S. Sacrifice Ratio. Journal 1997. of Business and Economic Statistics, 19, Campbell, J., and G. Mankiw. “Consumption, 4, 2001, pp. 416–427. Income, and Interest Rates: Reinterpret- Chamberlain, G. “Omitted Variable Bias in ing the Time Series Evidence.” Working Panel Data: Estimating the Returns to Greene-50558 book June 25, 2007 10:16

1108 References

Schooling.” Annales de L’Insee, 30/31, Normal Linear Model.” Journal of 1978, pp. 49–82. Econometrics, 34, 1987, pp. 33–62. Chamberlain, G. “Analysis of Covariance with Chesher, A., T. Lancaster, and M. Irish. “On Qualitative Data.” Review of Economic Detecting the Failure of Distributional Studies, 47, 1980, pp. 225–238. Assumptions.” Annales de L’Insee, 59/60, Chamberlain, G. “Multivariate Regression 1985, pp. 7–44. Models For Panel Data.” Journal of Chung, C., and A. Goldberger. “Proportional Econometrics, 18, 1, 1982, pp. 5–46. Projections in Limited Dependent Vari- Chamberlain, G. “Panel Data.” In Z. Griliches able Models.” Econometrica, 52, 1984, and M. Intriligator, eds., Handbook of pp. 531–534. Econometrics. Amsterdam: North Cheung, Y. “Long Memory in Foreign- Holland, 1984. Exchange Rates.” Journal of Business Chamberlain, G. “Heterogeneity, Omitted and Economic Statistics, 11, 1, 1993, Variable Bias and Duration Depen- pp. 93–102. dence.” in J. Heckman and B. Singer, Chiappori, R. “Econometric Models of Insur- eds., Longitudinal Analysis of Labor ance Under Asymmetric Information.” Market Data. Cambridge: Cambridge Manuscript, Department of Economics, University Press, 1985. University of Chicago, 1998. Chamberlain, G. “Asymptotic Efficiency in Chib, S. “Bayes Regression for the Tobit Estimation with Conditional Moment Censored Regression Model.” Journal of Restrictions.” Journal of Econometrics, Econometrics, 51, 1992, pp. 79–99. 34, 1987, pp. 305–334. Chou, R. “Volatility Persistence and Stock Chamberlain, G., and E. Leamer. “Ma- Valuations: Some Empirical Evidence trix Weighted Averages and Posterior Using GARCH.” Journal of Applied Bounds.” Journal of the Royal Statistical Econometrics, 3, 1988, pp. 279–294. Society, Series B, 1976, pp. 73–84. Chib, S., and E. Greenberg. “Understanding Chambers, R. Applied Production Analysis: A the Metropolis-Hastings Alagorithm.” Dual Approach. New York: Cambridge The American Statistician, 49, 4, 1995, University Press, 1988. pp. 327–335. Charlier, E., B. Melenberg, and A. Van Soest. Chib, S., and E. Greenberg. “Markov Chain “A Smoothed Maximum Score Estimator Monte Carlo Simulation Methods in for the Binary Choice Panel Data Model Econometrics.” Econometric Theory, 12, with an Application to Labor Force 1996, pp. 409–431. Participation.” Neerlander, 49, Chow, G. “Tests of Equality Between Sets of 1995, pp. 324–343. Coefficients in Two Linear Regressions.” Chatfield, C. The Analysis of Time Series: An Econometrica, 28, 1960, pp. 591–605. Introduction, 5th ed. London: Chapman Chow, G. “Random and Changing Coefficient and Hall, 1996. Models.” In Griliches, Z. and M. Intrili- Chavez, J., and K. Segerson. “Stochastic gator, eds., Handbook of Econometrics, Specification and Estimation of Share Vol. 2, Amsterdam: North Holland, 1984. Equation Systems.” Journal of Econo- Christensen, L., and W. Greene. “Economies metrics, 35, 1987, pp. 337–358. of Scale in U.S. Electric Power Genera- Chen, T. “Root N Consistent Estimation of tion.” Journal of Political Economy, 84, a Panel Data Sample Selection Model.” 1976, pp. 655–676. Manuscript, Hong Kong University of Christensen, L., D. Jorgenson, and L. Lau. Science and Technology, 1998. “Transcendental Logarithmic Utility Chesher, A., and M. Irish. “Residual Analysis Functions.” American Economic Review, in the Grouped Data and Censored 65, 1975, pp. 367–383. Greene-50558 book June 25, 2007 10:16

References 1109

Christofides, L., T. Stengos, and R Swidinsky. of Monetary Economics, 16, 1985, “On the Calculation of Marginal Effects pp. 283–308. in the Bivariate Probit Model.” Eco- Cornwell, C., and P. Rupert. “Efficient nomics Letters, 54, 3, 1997, pp. 203–208. Estimation with Panel Data: An Em- Christofides, L., T. Hardin, and R. Stengos. pirical Comparison of Instrumental “On the Calculation of Marginal Effects Variable Estimators.” Journal of Applied in the Bivariate Probit Model: Corri- Econometrics, 3, 1988, pp. 149–155. gendum.” Economics Letters, 68, 2000, Cornwell, C., and P. Schmidt. “Panel Data pp. 339–340. with Cross-Sectional Variation in Slopes Cleveland, W. “Robust Locally Weighted as Well as in Intercept,” Econometrics Regression and Smoothing Scatter Workshop Paper No. 8404, Michi- Plots.” Journal of the American Statistical gan State University, Department of Association, 74, 1979, pp. 829–836. Economics, 1984. Coakley, J., F. Kulasi, and R. Smith. “Current Coulson, N., and R. Robins. “Aggregate Account Solvency and the Feldstein- Economic Activity and the Variance of Horioka Puzzle.” Economic Journal, 106, Inflation: Another Look.” Economics 1996, pp. 620–627. Letters, 17, 1985, pp. 71–75. Cochrane, D., and G. Orcutt. “Application Council of Economic Advisors. Economic Re- of Least Squares Regression to Rela- port of the President. Washington, D.C.: tionships Containing Autocorrelated United States Government Printing Error Terms.” Journal of the American Office, 1994. Statistical Association, 44, 1949, pp. 32–61. Cox, D. “Tests of Separate Families of Hy- Congdon, P. Bayesian Models for Categorical potheses.” Proceedings of the Fourth Data. New York: John Wiley and Sons, Berkeley Symposium on Mathematical 2005. Statistics and Probability, Vol. 1. Berke- Conniffe, D. “Estimating Regression Equa- ley: University of California Press, 1961. tions with Common Explanatory Cox, D. “Further Results on Tests of Separate Variables but Unequal Numbers of Families of Hypotheses.” Journal of the Observations” Journal of Econometrics, Royal Statistical Society, Series B, 24, 27, 1985, pp. 179–196. 1962, pp. 406–424. Conniffe, D. “Covariance Analysis and Cox, D. Analysis of Binary Data. London: Seemingly Unrelated Regression Equa- Methuen, 1970. tions.” American Statistician, 36, 1982a, Cox, D. “Regression Models and Life Tables.” pp. 169–171. Journal of the Royal Statistical Society, Conniffe, D. “A Note on Seemingly Unrelated Series B, 34, 1972, pp. 187–220. Regressions.” Econometrica, 50, 1982b, Cox, D., and D. Oakes. Analysis of Survival pp. 229–233. Data. New York: Chapman and Hall, Conway, D., and H. Roberts. “Reverse 1985. Regression, Fairness and Employment Cragg, J. “On the Relative Small-Sample Discrimination.” Journal of Business Properties of Several Structural- and Economic Statistics, 1, 1, 1983, Equation Estimators.” Econometrica, 35, pp. 75–85. 1967, pp. 89–110. Contoyannis, C., A. Jones, and N. Rice. “The Cragg, J. “Some Statistical Models for Limited Dynamics of Health in the British House- Dependent Variables with Application hold Panel Survey.” Journal of Applied to the Demand for Durable Goods.” Econometrics, 19, 4, 2004, pp. 473–503. Econometrica, 39, 1971, pp. 829–844. Cooley, T., and S. LeRoy. “Atheoretical Cragg, J. “Estimation and Testing in Testing Macroeconomics: A Critique.” Journal in Time Series Regression Models with Greene-50558 book June 25, 2007 10:16

1110 References

Heteroscedastic Disturbances.” Journal Dastoor, N. “Some Aspects of Testing of Econometrics, 20, 1982, pp. 135–157. Nonnested Hypotheses.” Journal of Cragg, J. “More Efficient Estimation in the Econometrics, 21, 1983, pp. 213–228. Presence of Heteroscedasticity of Un- Davidson, A., and D. Hinkley. Bootstrap Meth- known Form.” Econometrica, 51, 1983, ods and Their Application. Cambridge: pp. 751–763. Cambridge University Press, 1997. Cragg., J., “Using Higher Moments to Es- Davidson, J. Econometric Theory. Oxford: timate the Simple Errors in Variables Blackwell, 2000. Model,” Rand Journal of Economics, 28, Davidson, R., and J. MacKinnon. “Several 0, 1997, pp. S71–S91. Tests for Model Specification in the Cragg, J., and R. Uhler. “The Demand for Presence of Alternative Hypotheses.” Automobiles.” Canadian Journal of Econometrica, 49, 1981, pp. 781–793. Economics, 3, 1970, pp. 386–406. Davidson, R., and J. MacKinnon. “Conve- Cram`er, H. Mathematical Methods of Statis- nient Specification Tests for Logit and tics. Princeton: Princeton University Probit Models.” Journal of Econometrics, Press, 1948. 25, 1984, pp. 241–262. Cramer, J. “Predictive Performance of the Davidson, R., and J.MacKinnon. “TestingLin- Binary Logit Model in Unbalanced ear and Loglinear Regressions Against Samples.” Journal of the Royal Statistical Box–Cox Alternatives.” Canadian Jour- Society, Series D (The Statistician), 48, nal of Economics, 18, 1985, pp. 499–517. 1999, pp. 85–94. Davidson, R. and J. MacKinnon. Estimation Culver, S. and D. Pappell. “Is There a Unit and Inference in Econometrics. New Root in the Inflation Rate? Evidence York: Oxford University Press, 1993. from Sequential Break and Panel Data Davidson, R., and J. MacKinnon. Econometric Model.” Journal of Applied Economet- Theory and Methods. New York: Oxford rics, 12, 1997, pp. 435–444. University Press, 2004. Cumby, R., J. Huizinga, and M. Obstfeld. Davidson, R. and J. MacKinnon. “Boot- “Two-Step, Two-Stage Least Squares strap Methods in Econometrics.” In Estimation in Models with Rational T. Mills and K. Patterson, eds., Palgrave Expectations.” Journal of Econometrics, Handbook of Econometrics, Volume 1: 21, 1983, pp. 333–355. Econometric Theory, Hampshire: Pal- D’Addio, A., Eriksson, T., and P. Frijters. grave Macmillan, 2006. “An Analysis of the Determinants of Deaton, A. “Demand Analysis.” In Job Satisfaction when Individuals’ Base- Z. Griliches and M. Intriligator, eds., line Satisfaction Levels May Differ.” Handbook of Econometrics, Vol. 2, Working Paper 2003-16, Center for pp. 1767–1839, Amsterdam: North Applied Microeconometrics, University Holland, 1986. of Copenhagen, 2003. Deaton, A., and J. Muellbauer. Economics Dahlberg, M., and E. Johansson. “An Ex- and Consumer Behavior. New York: amination of the Dynamic Behaviour Cambridge University Press, 1980. of Local Governments Using GMM Deaton, A. “Model Selection Procedures, or, Bootstrapping Methods.”Journal of Does the Consumption Function Exist?” Applied Econometrics, 15, 2000, pp. 401– In G. Chow and P. Corsi, eds., Evaluating 416. the Reliability of Macroceonomic Models. Das, M., and A. van Soest. “A Panel Data New York: John Wiley and Sons, 1982. Model for Subjective Information on Deb, P., and P. K. Trivedi. “The Structure of Household Income Growth.” Journal of Demand for Health Care: Latent Class Economic Behavior and Organization versus Two-part Models.” Journal of 40, 2000, 409–426. Health Economics, 21, 2002, pp. 601–625. Greene-50558 book June 25, 2007 10:16

References 1111

Debreu, G. “The Coefficient of Resource Implications.” American Statistician, 40, Utilization.” Econometrica, 19, 3, 1951, 1, 1986, pp. 12–26. pp. 273–292. Dickey, D., and W. Fuller. “Distribution of DeMaris, A. Regression with Social Data: the Estimators for Autoregressive Time Modeling Continuous and Limited Re- Series with a Unit Root.” Journal of sponse Variables. Hoboken, NJ: Wiley, the American Statistical Association, 74, 2004. 1979, pp. 427–431. Dehejia, R., and S. Wahba. “Causal Ef- Dickey, D., and W. Fuller. “Likelihood Ratio fects in Non-Experimental Studies: Tests for Autoregressive Time Series Evaluating the Valuation of Training with a Unit Root.” Econometrica, 49, Programs.” Journal of the American 1981, pp. 1057–1072. Statistical Association, 94, 1999, pp. 1053– Dickey, D., D. Jansen, and D. Thornton. “A 1062. Primer on Cointegration with an Appli- Dempster, A., N. Laird, and D. Rubin. cation to Money and Income.” Federal “Maximum Likelihood Estimation from Reserve Bank of St. Louis, Review, 73, 2, Incomplete Data via the EM Algorithm.” 1991, pp. 58–78. Journal of the Royal Statistical Society, Diebold, F. “The Past, Present, and Future Series B, 39, 1977, pp. 1–38. of Macroeconomic Forecasting.” Journal Denton, F. “Single Equation Estimators of Economic Perspectives, 12, 2, 1998, and Aggregation Restrictions When pp. 175–192. Equations Have the Same Set of Regres- Diebold, F. Elements of Forecasting, 2nd ed. sors.” Journal of Econometrics, 8, 1978, Cincinnati: South-Western Publishing, pp. 173–179. 2003. DesChamps, P. “Full Maximum Likelihood Diebold, F., and M. Nerlove. “Unit Roots Estimation of Dynamic Demand Mod- in Economic Time Series: A Selective els.” Journal of Econometrics, 82, 1998, Survey.” In T. Bewley, ed., Advances in pp. 335–359. Econometrics, Vol. 8. New York: JAI Dezhbaksh, H. “The Inappropriate Use of Press, 1990. Serial Correlation Tests in Dynamic Dielman, T. Pooled Cross-Sectional and Linear Models.” Review of Economics Time Series Data Analysis. New York: and Statistics, 72, 1990, pp. 126–132. Marcel-Dekker, 1989. Dhrymes, P. “Restricted and Unrestricted Diewert, E. “Applications of Duality Theory.” Reduced Forms.” Econometrica, 41, In M. Intriligator and D. Kendrick, 1973, pp. 119–134. Frontiers in Quantitative Economics. Dhrymes, P. Distributed Lags: Problems Amsterdam: North Holland, 1974. of Estimation and Formulation. San Diggle, R., P. Liang, and S. Zeger. Analysis Francisco: Holden Day, 1971. of Longitudinal Data. Oxford: Oxford Dhrymes, P. “Limited Dependent Variables.” University Press, 1994. In Z. Griliches and M. Intriligator, Ding, Z., C. Granger, and R. Engle. “A Long eds., Handbook of Econometrics,Vol. 2, Memory Property of Stock Returns and Amsterdam: North Holland, 1984. a New Model.” Journal of Empirical Dhrymes, P. “Specification Tests in Simul- Finance, 1, 1993, pp. 83–106. taneous Equation Systems.” Journal of Domowitz, I., and C. Hakkio. “Conditional Econometrics, 64, 1994, pp. 45–76. Variance and the Risk Premium in the Dhrymes, P. Time Series: Unit Roots and Foreign Exchange Market.” Journal Cointegration. New York: Academic of International Economics, 19, 1985, Press, 1998. pp. 47–66. Dickey, D., W. Bell, and R. Miller. “Unit Doan, T. RATS 6.3, User’s Manual. Evanston, Roots in Time Series Models: Tests and IL: Estima, 2007. Greene-50558 book June 25, 2007 10:16

1112 References

Doob, J., Stochastic Process. New York: John Regression—II.” Biometrika, 38, 1951, Wiley and Sons, 1953. pp. 159–178. Doppelhofer, G., R. Miller, and S. Sala-i- Durbin, J., and G. Watson. “Testing for Martin. “Determinants of Long-Term Serial Correlation in Least Squares Growth: A Bayesian Averaging of Clas- Regression—III.” Biometrika, 58, 1971, sical Estimates (BACE) Approach.” pp. 1–42. NBER Working Paper Number 7750, Dwivedi, T., and K. Srivastava. “Optimality June, 2000. of Least Squares in the Seemingly Un- Dufour, J. “Some Impossibility Theorems related Regressions Model.” Journal of in Econometrics with Applications Econometrics, 7, 1978, pp. 391–395. to Structural and Dynamic Models.” Efron, B. “Regression and ANOVA with Econometrica, 65, 1997, pp. 1365–1389. Zero-One Data: Measures of Residual Dufour, J. “Identification, Weak Instruments Variation.” Journal of the American Sta- and Statistical Inference in Economet- tistical Association, 73, 1978, pp. 113–212. rics.” Scientific Series, Paper Number Efron, B. “Bootstrapping Methods: An- 2003s-49, CIRANO, University of other Look at the Jackknife.” Annals of Montreal, 2003. Statistics, 7, 1979, pp. 1–26. Dufour, J. and J. Jasiak. “Finite Sample Efron, B. and R. Tibshirani. An Introduction Limited Information Inference Methods to the Bootstrap. New York: Chapman for Structural Equations and Models and Hall, 1994. with Generated Regressors.” Inter- Eicker, F. “Limit Theorems for Regression national Economic Review, 42, 2001, with Unequal and Dependent Errors.” In pp. 815–843. L. LeCam and J. Neyman, eds., Proceed- Duncan, G. “Sample Selectivity as a Proxy ings of the Fifth Berkeley Symposium on Variable Problem: On the Use and Mathematical Statistics and Probability. Misuse of Gaussian Selectivity Correc- Berkeley: University of California Press, tions.” Research in Labor Economics, 1967, pp. 59–82. Supplement 2, 1983, pp. 333–345. Eisenberg, D., and B. Rowe. “The Effect of Duncan, G. “A Semiparametric Censored Serving in the Vietnam War on Smoking Regression Estimator.” Journal of Behavior Later in Life.” Manuscript, Econometrics, 31, 1986a, pp. 5–34. School of Public Health, University of Duncan, G., ed. “Continuous/Discrete Econo- Michigan, 2006. metric Models with Unspecified Error Elliot, G., T. Rothenberg, and J. Stock. Distribution.” Journal of Econometrics, “Efficient Tests for an Autoregressive 32, 1, 1986b, pp. 1–187. Unit Root.” Econometrica, 64, 1996, Durbin, J. “Errors in Variables.” Review of pp. 813–836. the International Statistical Institute, 22, Enders, W. Applied Econometric Time Series, 1954, pp. 23–32. 2nd ed. New York: John Wiley and Sons, Durbin, J. “Testing for Serial Correlation in 2004. Least Squares Regression When Some of Engle, R. “Autoregressive Conditional Het- the Regressors Are Lagged Dependent eroscedasticity with Estimates of the Variables.” Econometrica, 38, 1970, Variance of United Kingdom Inflations.” pp. 410–421. Econometrica, 50, 1982, pp. 987–1008. Durbin, J., and G. Watson. “Testing for Engle, R. “Estimates of the Variance of U.S. Serial Correlation in Least Squares Inflation Based on the ARCH Model.” Regression—I.” Biometrika, 37, 1950, Journal of Money, Credit, and Banking, pp. 409–428. 15, 1983, pp. 286–301. Durbin, J., and G. Watson. “Testing for Engle, R. “Wald, Likelihood Ratio, and La- Serial Correlation in Least Squares grange Multiplier Testsin Econometrics.” Greene-50558 book June 25, 2007 17:11

References 1113

In Z. Griliches and M. Intriligator, eds., Health Organization GPE Discussion Handbook of Econometrics, Vol. 2. Paper, No. 30, EIP/GPE/EQC, 2000b. Amsterdam: North Holland, 1984. Evans, G., and N. Savin. “Testing for Unit Engle, R., and C. Granger. “Co-integration Roots: I.” Econometrica, 49, 1981, and Error Correction: Representation, pp. 753–779. Estimation, and Testing.” Econometrica, Evans, G., and N. Savin. “Testing for Unit 35, 1987, pp. 251–276. Roots: II.” Econometrica, 52, 1984, Engle, R., and D. Hendry. “Testing Super pp. 1241–1269. Exogeneity and Invariance.” Journal of Evans, M., N. Hastings, and B. Peacock. Econometrics, 56, 1993, pp. 119–139. Statistical Distributions, 2nd ed. New Engle, R., D. Hendry, and J. Richard. “Ex- York: John Wiley and Sons, 1993. ogeneity.” Econometrica, 51, 1983, Fair, R. “A Note on Computation of the Tobit pp. 277–304. Estimator.” Econometrica, 45, 1977, Engle, R., D. Hendry, and D. Trumble. “Small pp. 1723–1727. Sample Properties of ARCH Estima- Fair, R. “A Theory of Extramarital Affairs.” tors and Tests.” Canadian Journal of Journal of Political Economy, 86, 1978, Economics, 18, 1985, pp. 66–93. pp. 45–61. Engle, R., and D. Kraft. “Multiperiod Forecast Fair, R. Specification and Analysis of Macroe- Error Variances of Inflation Estimated conomic Models. Cambridge: Harvard from ARCH Models.” In A. Zellner, ed., University Press, 1984. Applied Time Series Analysis of Eco- Farrell, M. “The Measurement of Productive nomic Data. Washington D.C.: Bureau of Efficiency.” Journal of the Royal Statisti- the Census, 1983. cal Society, Series A, General, 120, part 3, Engle, R., and D. McFadden, eds. Handbook 1957, pp. 253–291. of Econometrics, Vol. 4. Amsterdam: Fiebig, D. “Seemingly Unrelated Regression.” North Holland, 1994. In B. Baltagi, ed., A Companion to Theo- Engle, R., and M. Rothschild. “ARCHModels retical Econometrics, Oxford: Blackwell, in Finance.” Journal of Econometrics, 52, 2001. 1992, pp. 1–311. Fiebig, D., R. Bartels, and D. Aigner. “A Ran- Engle, R., D. Lilen, and R. Robins. “Esti- dom Coefficient Approach to the Estima- mating Time Varying Risk Premia in the tion of End Use Load Profiles.” Journal Term Structure: The ARCH-M Model.” of Econometrics, 50, 1991, pp. 297–328. Econometrica, 55, 1987, pp. 391–407. Feldstein, M. “The Error of Forecast in Econo- Engel, R., and B. Yoo. “Forecasting and Test- metric Models When the Forecast-Period ing in Cointegrated Systems.” Journal of Exogenous Variables Are Stochastic.” Econometrics, 35, 1987, pp. 143–159. Econometrica, 39, 1971, pp. 55–60. Estes, E., and B. Honor`e. “Partially Lin- Fernandez, A., and J. Rodriguez-Poo. “Es- ear Regression Using One Nearest timation and Testing in Female Labor Neighbor.” Manuscript, Department of Participation Models: Parametric and Economics, Princeton University, 1995. Semiparametric Models.” Econometric Evans, D., A. Tandon, C. Murray, and J. Lauer. Reviews, 16, 1997, pp. 229–248. “The Comparative Efficiency of National Fernandez, L. “Nonparametric Maximum Health Systems in Producing Health: An Likelihood Estimation of Censored Analysis of 191 Countries.” World Health Regression Models.” Journal of Econo- Organization, GPE Discussion Paper, metrics, 32, 1, 1986, pp. 35–38. No. 29, EIP/GPE/EQC, 2000a. Ferrer-i-Carbonel, A., and P. Frijters.“The Evans D., A. Tandon, C. Murray, and J. Lauer. Effect of Metholodogy on the Determi- “Measuring Overall Health System nants of Happiness.” Economic Journal, Performance for 191 Countries.” World 114, 2004, pp. 641–659. Greene-50558 book June 25, 2007 10:16

1114 References

Fin, T. and P. Schmidt. “A Test for the Tobit French, K., W. Schwert, and R. Stambaugh. Specification versus an Alternative “Expected Stock Returns and Volatility.” Suggested by Cragg.” Review of Eco- Journal of Financial Economics, 19, 1987, nomics and Statistics, 66, 1984, pp. 174– pp. 3–30. 177. Fried, H., K. Lovell, and S. Schmidt. The Finney, D. Probit Analysis. Cambridge: Measurement of Productive Efficiency Cambridge University Press, 1971. and Productivity, 2nd ed. Oxford, UK: Fiorentini, G., G. Calzolari, and L. Panat- Oxford University Press, 2007. toni. “Analytic Derivatives and the Friedman, M. A Theory of the Consump- Computation of GARCH Estimates.” tion Function. Princeton: Princeton Journal of Applied Econometrics, 11, University Press, 1957. 1996, pp. 399–417. Frijters P., J. Haisken-DeNew, and M. Shields. Fisher, F. “Tests of Equality Between Sets of “The Value of Reunification in Germany: Coefficients in Two Linear Regressions: An Analysis of Changes in Life Satisfac- An Expository Note.” Econometrica, 28, tion.” Journal of Human Resources, 39, 1970, pp. 361–366. 3, 2004, pp. 649–674. Fisher, G., and D. Nagin. “Random Versus Frisch, R. “Editorial.” Econometrica, 1, 1933, Fixed Coefficients Coefficient Quantal pp. 1–4. Choice Models.” In C. Manski and Frisch, R., and F. Waugh. “Partial Time D. McFadden, eds., Structural Analysis of Regressions as Compared with Indi- Discrete Data with Econometric Applica- vidual Trends.” Econometrica, 1, 1933, tions. Cambridge: MIT Press, 1981. pp. 387–401. Fisher, R. “The Theory of Statistical Esti- Fry, J., T. Fry, and K. McLaren. “The Stochastic mation” Proceedings of the Cambridge Specification of Demand Share Equa- Philosophical Society, 22, 1925, pp. 700– tions: Restricting Budget Shares to the 725. Unit Simplex.” Journal of Econometrics, Fleissig, A., and J. Strauss. “Unit Root Tests 73, 1996, pp. 377–386. on Real Wage Panel Data for the G7.” Fuller, W. Introduction to Statistical Time Economics Letters, 54, 1997, pp. 149–155. Series. New York: John Wiley and Sons, Fletcher, R. Practical Methods of Optimiza- 1976. tion. New York: John Wiley and Sons, Fuller, W., and G. Battese. “Estimation 1980. of Linear Models with Crossed-Error Florens, J., D. Fougere, and M. Mouchart. Structure.” Journal of Econometrics, 2, “Duration Models.” In L. Matyas and 1974, pp. 67–78. P. Sevestre, The Econometrics of Panel Gabrielsen, A. “Consistency and Identifia- Data, 2nd ed., Norwell, MA: Kluwer, bility.” Journal of Econometrics, 8, 1978, 1996. pp. 261–263. Fomby, T., C. Hill, and S. Johnson. Advanced Gali, J. “How Well Does the IS-LM Model Fit Econometric Methods. Needham, MA: Postwar U.S. Data?” Quarterly Journal Springer-Verlag, 1984. of Economics, 107, 1992, pp. 709–738. Frankel, J. and A. Rose. “A Panel Project Gallant, A. Nonlinear Statistical Models. New on Purchasing Power Parity: Mean Re- York: John Wiley and Sons, 1987. version Within and Between Countries.” Gallant, A., and A. Holly. “Statistical Journal of International Economics, 40, Inference in an Implicit Nonlinear 1996, pp. 209–224. Simultaneous Equation in the Context Freedman, D. “On the So-Called ‘Huber of Maximum Likelihood Estimation.” Sandwich Estimator’ and Robust Stan- Econometrica, 48, 1980, pp. 697–720. dard Errors.” The American Statistician, Gallant, R., and H. White. A Unified The- 60, 4, 2006, pp. 299–302. ory of Estimation and Inference for Greene-50558 book June 25, 2007 10:16

References 1115

Nonlinear Dynamic Models. Oxford: Geweke, J. Contemporary Bayesian Econo- Basil Blackwell, 1988. metrics and Statistics. New York: John Garber, S., and S. Klepper. “Extending Wiley and Sons, 2005. the Classical Normal Errors in Vari- Geweke, J., M. Keane, and D. Runkle. “Al- ables Model.” Econometrica, 48, 1980, ternative Computational Approaches pp. 1541–1546. to Inference in the Multinomial Probit Garber, S., and D. Poirier. “The Determinants Model.” Review of Economics and of Aerospace Profit Rates.” Southern Statistics, 76, 1994, pp. 609–632. Economic Journal, 41, 1974, pp. 228– Geweke, J., and R. Meese. “Estimating Re- 238. gression Models of Finite but Unknown Gaver, K., and M. Geisel. “Discriminating Order.” International Economic Review, Among Alternative Models: Bayesian 22, 1981, pp. 55–70. and Non-Bayesian Methods.” In P. Geweke, J., M. Keane, and D. Runkle. “Sta- Zarembka, ed., Frontiers in Economet- tistical Inference in the Multinomial rics. New York: Academic Press, 1974. Multiperiod Probit Model.” Journal of Gelfand, A., and A. Smith. “Sampling Based Econometrics, 81, 1, 1997, pp. 125–166. Approaches to Calculating Marginal Geweke, J., R. Meese, and W. Dent. “Com- Densities.” Journal of the American paring Alternative Tests of Causality Statistical Association, 85, 1990, pp. 398– in Temporal Systems: Analytic Results 409. and Experimental Evidence.” Journal of Gerfin, M. “Parametric and Semi-Parametric Econometrics, 21, 1983, pp. 161–194. Estimation of the Binary Response Geweke, J., and S. Porter-Hudak. “The Esti- Model.” Journal of Applied Economet- mation and Application of Long Memory rics, 11, 1996, pp. 321–340. Time Series Models.” Journal of Time Gelman, A., J. Carlin, H. Stern, and D. Rubin. Series Analysis, 4, 1983, pp. 221–238. Bayesian Data Analysis, 2nd ed. Suffolk: Gill, J. Bayesian Methods: A Social and Chapman and Hall, 2004. Behavioral Sciences Approach. Suffolk: Gentle, J. Elements of Computational Statis- Chapman and Hall, 2002. tics. New York: Springer-Verlag, 2002. Godfrey, L. “Testing Against General Gentle, J. Random Number Generation and Autoregressive and Moving Average Monte Carlo Methods, 2nd ed. New York: Error Models When the Regressors Springer-Verlag, 2003. Include Lagged Dependent Variables.” Geweke, J. “Inference and Causality in Econometrica, 46, 1978, pp. 1293–1302. Econometric Time Series Models.” In Godfrey, L. Misspecification Tests in Econo- Z. Griliches and M. Intriligator, eds., metrics. Cambridge: Cambridge Univer- Handbook of Econometrics, Vol. 2. sity Press, 1988. Amsterdam: North Holland, 1984. Godfrey, L. “Instrument Relevance in Geweke, J. “Exact Inference in the In- Multivariate Linear Models.” Review equality Constrained Normal Linear of Economics and Statistics, 81, 1999, Regression Model.” Journal of Applied pp. 550–552. Econometrics, 2, 1986, pp. 127–142. Godfrey, L. and H. Pesaran. “Tests of Geweke, J. “Antithetic Acceleration of Monte Nonnested Regression Models After Carlo Integration in Bayesian Infer- Estimation by Instrumental Variables or ence.” Journal of Econometrics, 38, 1988, Least Squares.” Journal of Econometrics, pp. 73–90. 21, 1983, pp. 133–154. Geweke, J. “Bayesian Inference in Econo- Godfrey, L., and M. Wickens. “Tests of Mis- metric Models Using Monte Carlo specification Using Locally Equivalent Integration.” Econometrica, 57, 1989, Alternative Models.” In G. Chow and pp. 1317–1340. P. Corsi, eds., Evaluating the Reliability Greene-50558 book June 25, 2007 10:16

1116 References

of Econometric Models. New York: John and Prediction.” International Economic Wiley and Sons, 1982, pp. 71–99. Review, 9, 1968, pp. 113–136. Goffe, W., G. Ferrier, and J. Rodgers. “Global Goldfeld, S., and R. Quandt. Nonlinear Optimization of Statistical Functions Methods in Econometrics. Amsterdam: with Simulated Annealing.” Journal of North Holland, 1971. Econometrics, 60, 1/2, 1994, pp. 65–100. Goldfeld, S., and R. Quandt. “GQOPT: A Goldberger, A. “Best Linear Unbiased Pre- Package for Numerical Optimization of diction in the Generalized Regression Functions.” Manuscript, Department of Model.” Journal of the American Statisti- Economics, Princeton University, 1972. cal Association, 57, 1962, pp. 369–375. Goldfeld, S., R. Quandt, and H. Trotter. “Max- Goldberger, A. Econometric Theory. New imization by Quadratic Hill Climbing.” York: John Wiley and Sons, 1964. Econometrica, 1966, pp. 541–551. Goldberger, A. “Estimation of a Regression Gonzal´az, P., and W.Maloney. “Logit Analysis Coefficient Matrix Containing a Block of in a Rotating Panel Context and an Ap- Zeroes.” University of Wisconsin, SSRI, plication to Self Employment Decisions.” EME, Number 7002, 1970. Policy Research Working Paper Number Goldberger, A. “Selection Bias in Evaluating 2069, Washington, D.C: World Bank, Treatment Effects: Some Formal Illustra- 1999. tions.” Discussion Paper 123-72, Institute Gordin, M. “The Central Limit Theorem for for Research on Poverty, University of Stationary Processes.” Soviet Mathemat- Wisconsin, Madison, 1972. ical Dokl., 10, 1969, pp. 1174–1176. Goldberger, A. “Dependency Rates and Gourieroux, C., and A. Monfort. “Testing Savings Rates: Further Comment.” Non-Nested Hypotheses.” In Z. Griliches American Economic Review, 63, 1, 1973, and M. Intriligator, eds., Handbook of pp. 232–233. Econometrics, Vol. 4. Amsterdam: North Goldberger, A. “Linear Regression After Holland, 1994. Selection.” Journal of Econometrics, 15, Gourieroux, C., and A. Monfort, “Testing, 1981, pp. 357–366. Encompassing, and Simulating Dynamic Goldberger, A. “Abnormal Selection Bias.” In Econometric Models,” Econometric S. Karlin, T. Amemiya, and L. Goodman, Theory, 11, 1995, pp. 195–228. eds., Studies in Econometrics, Time Series, Gourieroux, C., and A. Monfort. Simulation- and Multivariate Statistics. New York: Based Methods Econometric Methods. Academic Press, 1983. Oxford: Oxford University Press, 1996. Goldberger, A. Functional Form and Utility: Gourieroux, C., A. Monfort, and A. Trognon. A Review of Consumer Demand Theory. “Testing Nested or Nonnested Hypothe- Boulder: Westview Press, 1987. ses.” Journal of Econometrics, 21, 1983, Goldberger, A. A Course in Econometrics. pp. 83–115. Cambridge: Harvard University Press, Gourieroux, C., A. Monfort, and A. Trognon. 1991. “Pseudo Maximum Likelihood Meth- Goldfeld, S. “The Demand for Money Re- ods: Applications to Poisson Models.” visited.” Brookings Papers on Economic Econometrica, 52, 1984, pp. 701–720. Activity, 3. Washington, D.C.: Brookings Gourieroux, C., A. Monfort, E. Renault, Institution, 1973. and A. Trognon. “Generalized Residu- Goldfeld, S., and R. Quandt. “Some Tests als.” Journal of Econometrics, 34, 1987, for Homoscedasticity.” Journal of the pp. 5–32. American Statistical Association, 60, Granger, C. “Investigating Causal Relations 1965, pp. 539–547. by Econometric Models and Cross- Goldfeld, S., and R. Quandt. “Nonlinear Spectral Methods.” Econometrica, 37, Simultaneous Equations: Estimation 1969, pp. 424–438. Greene-50558 book June 25, 2007 10:16

References 1117

Granger, C. “Some Properties of Time Greene, W. “Sample Selection Bias as a Speci- Series Data and Their Use in Econo- fication Error: Comment.” Econometrica, metric Model Specification.” Journal of 49, 1981, pp. 795–798. Econometrics, 16, 1981, pp. 121–130. Greene, W. “Estimation of Limited Depen- Granger, C., and Z. Ding. “Varieties of dent Variable Models by Ordinary Least Long Memory Models.” Journal of Squares and the Method of Moments.” Econometrics, 73, 1996, pp. 61–78. Journal of Econometrics, 21, 1983, Granger, C., and P. Newbold. “Spurious pp. 195–212. Regressions in Econometrics.” Journal Greene, W. “A Gamma Distributed Stochastic of Econometrics, 2, 1974, pp. 111–120. Frontier Model.” Journal of Economet- Granger, C., and P. Newbold. Forecasting rics, 46, 1990, pp. 141–163. Economic Time Series, 2nd ed. New Greene, W. “A Statistical Model for Credit York: Academic Press, 1996. Scoring.” Working Paper No. EC-92-29, Granger, C., and M. Watson. “Time Series Department of Economics, Stern School and Spectral Methods in Econometrics.” of Business, New York University, 1992. In Z. Griliches and M. Intriligator, eds., Greene, W. Econometric Analysis, 2nd ed. En- Handbook of Econometrics, Vol. 2. glewood Cliffs, NJ: Prentice Hall, 1993. Amsterdam: North Holland, 1984. Greene, W. “Accounting for Excess Zeros Granger, C., and R. Joyeux. “An Introduction and Sample Selection in Poisson and to Long Memory Time Series Models Negative Binomial Regression Models.” and Fractional Differencing.” Journal of Working Paper No. EC-94-10, Depart- Time Series Analysis, 1, 1980, pp. 15–39. ment of Economics, Stern School of Granger, C., and M. Pesaran. “A Decision Business, New York University, 1994. Theoretic Approach to Forecast Evalua- Greene, W. “Count Data.” Manuscript, De- tion.” In W. S. Chan, W. Li, and H. Tong, partment of Economics, Stern School of eds., Statistics and Finance: An Interface. Business, New York University, 1995a. London: Imperial College Press, 2000. Greene, W. “Sample Selection in the Poisson Gravelle H., R. Jacobs, A. Jones, and A. Regression Model.” Working Paper No. Street. “Comparing the Efficiency of EC-95-6, Department of Economics, National Health Systems: Econometric Stern School of Business, New York Analysis Should Be Handled with Care.” University, 1995b. Manuscript, University of York, Health Greene, W. “Models for Count Data.” Economics, UK, 2002a. Manuscript, Department of Economics, Gravelle H., R. Jacobs, A. Jones, and A. Stern School of Business, New York Street. “Comparing the Efficiency of University, 1996a. National Health Systems: A Sensitivity Greene, W. “Marginal Effects in the Bivariate Approach.” Manuscript, University of Probit Model.” Working Paper No. 96-11, York, Health Economics, UK, 2002b. Department of Economics, Stern School Greenberg, E., and C. Webster. Advanced of Business, New York University, 1996b. Econometrics: A Bridge to the Literature. Greene, W. “FIML Estimation of Sample New York: John Wiley and Sons, 1983. Selection Models for Count Data.” Greene, W. “Maximum Likelihood Estima- Working Paper No. 97-02, Department tion of Econometric Frontier Functions.” of Economics, Stern School of Business, Journal of Econometrics, 13, 1980a, New York University, 1997a. pp. 27–56. Greene, W. “Frontier Production Functions.” Greene, W. “On the Asymptotic Bias of the In M. Pesaran and P.Schmidt, Handbook Ordinary Least Squares Estimator of the of Applied Econometrics: Volume. II: Tobit Model.” Econometrica, 48, 1980b, Microeconomics. London: Blackwell pp. 505–514. Publishers, 1997b. Greene-50558 book June 25, 2007 10:16

1118 References

Greene, W. “Gender Economics Courses in Working Paper EC-07-10, Department Liberal Arts Colleges: Further Results.” of Economics, Stern School of Business, Journal of Economic Education, 29, 4, New York University, 2007d. 1998, pp. 291–300. Greene, W. “Discrete Choice Models.” In T. Greene W. “Marginal Effects in the Censored Mills and K. Patterson, eds., Palgrave Regression Model.” Economics Letters, Handbook of Econometrics, Volume 64, 1, 1999, pp. 43–50. 2: Applied Econometrics. Hampshire: Greene, W. “Fixed and Random Effects Palgrave, 2008, forthcoming. in Nonlinear Models.” Working Paper Greene, W. and Hensher, D. “Specification EC-01-01, Department of Economics, and Estimation of Nested Logit Mod- Stern School of Business, New York els.” Transportation Research, B, 36, 1, University, 2001. pp. 1–18, 2002. Greene, W. “Simulated Maximum Likeli- Greene, W., and D. Hensher. “Multinomial hood Estimation of the Normal-Gamma Logit and Discrete Choice Models.” Stochastic Frontier Model.” Journal In W. Greene, NLOGIT Version 4.0 of Productivity Analysis, 19, 2003, User’s Manual, Revised, Plainview, NY: pp. 179–190. Econometric Software, Inc., 2007. Greene, W. “Convenient Estimators for Greene, W., and D. Hensher. “Mixed Logit the Panel Probit Model.” Empirical and Heteroscedastic Control for Random Economics, 29, 1, 2004a, pp. 21–47. Coefficients and Error Components.” Greene, W. “Fixed Effects and Bias Due to Transportation Research Part E: Logistics the Incidental Parameters Problem in and Transportation, 2007, forthcoming. the Tobit Model.” Econometric Reviews, Greene, W., D. Hensher, and J. Rose. “Ac- 2004b, 23, 2, pp. 125–147. counting for Heterogeneity in the Greene, W. “Distinguishing Between Het- Variance of the Unobserved Effects in erogeneity and Inefficiency: Stochastic Mixed Logit Models (NW Transport Frontier Analysis of the World Health Study Data).” Transportation Research Organization’s Panel Data on Na- Part B: Methodological, 40, 1, 2006a, tional Health Care Systems.” Health pp. 75–92. Economics, 13, 2004c, pp. 959–980. Greene, W., and T. Seaks. “The Restricted Greene, W. “Censored Data and Truncated Least Squares Estimator: A Pedagog- Distributions.” In T. Mills and K. Pat- ical Note.” Review of Economics and terson, eds., Palgrave Handbook of Statistics, 73, 1991, pp. 563–567. Econometrics, Volume 1: Econometric Greenstadt, J. “On the Relative Efficiencies Theory. Hampshire: Palgrave, 2006. of Gradient Methods.” Mathematics of Greene, W. “The Econometric Approach Computation, 1967, pp. 360– 367. to Efficiency Analysis.” In H. Fried, Griffiths, W., C. Hill, and G. Judge. Learning K. Lovell, and S. Schmidt, eds. The Mea- and Practicing Econometrics. New York: surement of Productive Efficiency, 2nd ed. John Wiley and Sons, 1993. Oxford: Oxford University Press, 2007a. Griliches, Z. “Hedonic Price Indexes for Greene, W.LIMDEP 9.0 Reference Guide. Automobiles: An Econometric Analysis Plainview, NY: Econometric Software, of Quality Change.” In Price Statistics Inc., 2007b. of the Federal Government, prepared by Greene, W. “Matching the Matching Esti- the Price Statistics Review Committee mators.” Manuscript, Department of of the National Bureau of Economic Economics, Stern School of Business, Research. New York: National Bureau New York University, 2007c. of Economic Research, 1961. Greene, W., “Functional Form and Hetero- Griliches, Z. “Distributed Lags: A Survey.” geneity and Models for Count Data,” Econometrica, 35, 1967, pp. 16–49. Greene-50558 book June 25, 2007 10:16

References 1119

Griliches, Z. “Economic Data Issues.” In Regression Models.” Journal of Econo- Z. Griliches and M. Intriligator, eds., metrics, 88, 1, 1999, pp. 123–150. Handbook of Econometrics, Vol. 3. Gurmu, S., and P. Trivedi. “Recent Devel- Amsterdam: North Holland, 1986. opments in Models of Event Counts: Griliches, Z., and P. Rao. “Small Sample A Survey.” Manuscript, Department Properties of Several Two Stage Re- of Economics, Indiana University, gression Methods in the Context of 1994. Autocorrelated Errors.” Journal of the Haavelmo, T. “The Statistical Implications of American Statistical Association, 64, a System of Simultaneous Equations.” 1969, pp. 253–272. Econometrica, 11, 1943, pp. 1–12. Grogger, J., and R. Carson. “Models for Hadri, K., “Testing for Heterogeneity in Het- Truncated Counts.” Journal of Applied erogeneous Panel Data,”Econometrics Econometrics, 6, 1991, pp. 225– 238. Journal, 3, 2000, pp. 148–161. Gronau, R. “Wage Comparisons: A Selectiv- Hahn, J., and J. Hausman. “A New Specifica- ity Bias.” Journal of Political Economy, tion Test for the Validity of Instrumental 82, 1974, pp. 1119–1149. Variables.” Econometrica, 70, 2002, Groot, W., and H. Maassen van den Brink. pp. 163–189. “Match Specific Gains to Marriages: Hahn, J., and J. Hausman. “Weak Instruments: A Random Effects Ordered Response Diagnosis and Cures in Empirical Eco- Model.” Quality and Quantity, 37, 3, nomics.” American Economic Review, 2003, pp. 317–325. 93, 2003, pp. 118–125. Grunfeld, Y. “The Determinants of Corporate Hahn, J., and G. Kuersteiner. “Bias Reduction Investment.” Unpublished Ph.D. thesis, for Dynamic Nonlinear Panel Mdels with Department of Economics, University of Fixed Effects.” Manuscript, Department Chicago, 1958. of Economics, University of California, Grunfeld, Y., and Z. Griliches. “Is Aggrega- Los Angeles, 2004. tion Necessarily Bad?” Review of Eco- Haitovsky, Y. “Missing Data in Regression nomics and Statistics, 42, 1960, pp. 1–13. Analysis.” Journal of the Royal Statistical Guilkey, D. “Alternative Tests for a First- Society, Series B, 1968, pp. 67–82. Order Vector Autoregressive Error Hajivassiliou, V. “Smooth Simulation Esti- Specification.” Journal of Econometrics, mation of Panel Data LDV Models.” 2, 1974, pp. 95–104. Department of Economics, Yale Univer- Guilkey, D., and P. Schmidt. “Estimation of sity, 1990. Seemingly Unrelated Regressions with Hajivassiliou, A. “Some Practical Issues in Vector Autoregressive Errors.” Journal Maximum Simulated Likelihood.” In of the American Statistical Association, R. Mariano, T. Schuermann, and M. 1973, pp. 642–647. Weeks, eds., Simulation Based Inference Guilkey, D., K. Lovell, and R. Sickles. “A in Econometrics. Cambridge: Cambridge Comparison of the Performance of Three University Press, 2000. Flexible Functional Forms.” International Hall, A., and A. Sen. “Structural Stability Economic Review, 24, 1983, pp. 591–616. Testing in Models Estimated by Gener- Gujarati, D. Basic Econometrics, 4th ed. New alized Method of Moments.” Journal of York: McGraw-Hill, 2002. Business and Economics and Statistics, Gurmu, S. “Tests for Detecting Overdisper- 17, 3, 1999, pp. 335–348. sion in the Positive Poisson Regression Hall, B. TSP Version 4.0 Reference Manual. Model.” Journal of Business and Eco- Palo Alto: TSP International, 1982. nomic Statistics, 9, 1991, pp. 215–222. Hall, B. “Software for the Computation Gurmu, S., P. Rilstone, and S. Stern. of Tobit Model Estimates.” Journal of “Semiparametric Estimation of Count Econometrics, 24, 1984, pp. 215–222. Greene-50558 book June 25, 2007 10:16

1120 References

Hall, R. “Stochastic Implications of the Life Hardle, W., and C. Manski, ed., “Nonpara- Cycle–Permanent Income Hypothesis: metric and Semiparametric Approaches Theory and Evidence,”Journal of Political to Discrete Response Analysis.” Journal Economy, 86, 6, 1978, pp. 971–987. of Econometrics, 58, 1993, pp. 1–274. Hamilton, J. Time Series Analysis. Princeton: Harris, M. N., and Zhao, X. “A Zero- Princeton University Press, 1994. Inflated Ordered Probit Model, with Hansen, B. “Testing for Parameter Instability an Application to Modelling Tobacco in Linear Models.” Journal of Policy Consumption.” Journal of Econometrics, Modeling, 14, 1992, pp. 517– 533. 2007, forthcoming. Hansen, B. “Approximate Asymptotic P Val- Harvey, A. “Estimating Regression Models ues for Structural-Change Tests.” Journal with Multiplicative Heteroscedastic- of Business and Economic Statistics, 15, ity.” Econometrica, 44, 1976, pp. 461– 1, 1997, pp. 60–67. 465. Hansen, B. “Testing for Structural Change Harvey, A. Forecasting, Structural Time Series in Conditional Models.” Journal of Models and the Kalman Filter. New York: Econometrics, 97, 2000, pp. 93–115. Cambridge University Press, 1989. Hansen, B. “Challenges for Econometric Harvey, A. The Econometric Analysis of Time Model Selection.” Econometric Theory, Series, 2nd ed. Cambridge: MIT Press, 21, 2005, pp. 60–68. 1990. Hansen, L., “Large Sample Properties Harvey, A., and G. Phillips. “A Comparison of Generalized Method of Moments of the Power of Some Tests for Het- Estimators.” Econometrica, 50, 1982, eroscedasticity in the General Linear pp. 1029–1054. Model.” Journal of Econometrics, 2, 1974, Hansen, L., J. Heaton, and A. Yaron. “Finite pp. 307–316. Sample Properties of Some Alternative Hashimoto, N., and K. Ohtani. “An Exact GMM Estimators.” Journal of Business Test for Linear Restrictions in Seemingly and Economic Statistics, 14, 3, 1996, Unrelated Regressions with the Same pp. 262–280. Regressors.” Economics Letters, 32, 1990, Hansen, L., and K. Singleton, “Generalized pp. 243–246. Instrumental Variable Estimation of Hatanaka, M. “An Efficient Estimator for Nonlinear Rational Expectations Mod- the Dynamic Adjustment Model with els.” Econometrica, 50, 1982, pp. 1269– Autocorrelated Errors.” Journal of 1286. Econometrics, 2, 1974, pp. 199–220. Hansen, L., and K. Singleton. “Efficient Hatanaka, M. “Several Efficient Two-Step Es- Estimation of Asset Pricing Models with timators for the Dynamic Simultaneous Moving Average Errors.” Manuscript, Equations Model with Autoregressive Department of Economics, Carnegie Disturbances.” Journal of Econometrics, Mellon University, 1988. 4, 1976, pp. 189–204. Hanushek, E. “Efficient Estimators for Hatanaka, M. Time-Series-Based Economet- Regressing Regression Coefficients.” rics. New York: Oxford University Press, The American Statistician, 28(2), 1974, 1996. pp. 21–27. Hausman, J. “An Instrumental Variable Ap- Hanushek, E. “The Evidence on Class Size.” proach to Full-Information Estimators In S. Mayer and P. Peterson, eds., Earn- for Linear and Certain Nonlinear Mod- ing and Learning: How Schools Matter. els.” Econometrica, 43, 1975, pp. 727–738. Washington, D.C.: Brookings Institute Hausman, J. “Specification Tests in Econo- Press, 1999. metrics.” Econometrica, 46, 1978, Hardle, W. Applied Nonparametric Regres- pp. 1251–1271. sion. New York: Cambridge University Hausman, J. “Specification and Estimation Press, 1990. of Simultaneous Equations Models.” In Greene-50558 book June 25, 2007 10:16

References 1121

Z. Griliches and M. Intriligator, eds., Models.” Annals of Economic and Social Handbook of Econometrics, Vol. 1, Measurement, 5, 1976, pp. 475–492. Amsterdam: North Holland, 1983. Heckman, J. “Simple Statistical Models for Hausman, J., B. Hall, and Z. Griliches. Discrete Panel Data Developed and “Economic Models for Count Data with Applied to the Hypothesis of True State an Application to the Patents—R&D Dependence Against the Hypothesis of Relationship.” Econometrica, 52, 1984, Spurious State Dependence.” Annalse de pp. 909–938. l’INSEE, 30, 1978, pp. 227–269. Hausman, J., and A. Han. “Flexible Para- Heckman, J. “Sample Selection Bias as a metric Estimation of Duration and Specification Error.” Econometrica, 47, Competing Risk Models.” Journal of 1979, pp. 153–161. Applied Econometrics, 5, 1990, pp. 1–28. Heckman, J. “Statistical Models for Dis- Hausman, J., and D. McFadden. “A Spec- crete Panel Data.” In C. Manski and ification Test for the Multinomial D. McFadden, eds., Structural Analy- Logit Model.” Econometrica, 52, 1984, sis of Discrete Data with Econometric pp. 1219–1240. Applications. Cambridge: MIT Press, Hausman, J., and P. Ruud. “Specifying and 1981a. Testing Econometric Models for Rank Heckman, J. “Heterogeneity and State De- Ordered Data with an Application to pendence.” In S. Rosen, ed., Studies the Demand for Mobile and Portable of Labor Markets. NBER, Chicago: Telephones.” Working Paper No. 8605, University of Chicago Press, 1981b. University of California, Berkeley, Heckman, J. “Hetreogeneity and State De- Department of Economics, 1986. pendence.” In S. Rosen, ed., Studies in Hausman, J., J. Stock, and M. Yogo. “Asymp- Labor Markets. Chicago: University of totic Properties of the Hahn-Hausman Chicago Press, 1981c. Test for Weak Instruments.” Economics Heckman, J. “Varieties of Selection Bias.” Letters, 89, 2005, pp. 333–342. American Economic Review, 80, 1990, Hausman, J., and W. Taylor. “Panel Data pp. 313–318. and Unobservable Individual Effects.” Heckman, J. “Randomization and Social Econometrica, 49, 1981, pp. 1377–1398. Program Evaluation.” In C. Manski and Hausman, J., and D. Wise. “Social Experi- I. Garfinkel, eds., Evaluating Welfare and mentation, Truncated Distributions, and Training Programs. Cambridge: Harvard Efficient Estimation.” Econometrica, 45, University Press, 1992. 1977, pp. 919–938. Heckman, J. “Instrumental Variables:A Study Hausman, J., and D. Wise. “A Conditional in Implicit Behavioral Assumptions Used Probit Model for Qualitative Choice: in Making Program Evaluations.” Jour- Discrete Decisions Recognizing Inter- nal of Human Resources, 32, 1997, dependence and Heterogeneous Prefer- pp. 441–462. ences.” Econometrica, 46, 1978, pp. 403– Heckman, J., and T. MaCurdy. “A Life 426. Cycle Model of Female Labor Supply,” Hausman, J. and D. Wise. “Attrition Bias Review of Economic Studies, 47, 1980, in Experimental and Panel Data: The pp. 247–283. Gary Income Maintenance Experiment.” Heckman, J. and T. MaCurdy. “A Simultane- Econometrica, 47, 2, 1979, pp. 455–573. ous Equations Linear Probability Model. Hayashi, F. Econometrics. Princeton: Prince- Canadian Journal of Economics, 18, ton University Press, 2000. 1985, pp. 28–37. Heckman, J. “The Common Structure of Heckman, J., H. Ichimura, J. Smith, and Statistical Models of Truncation, Sample P. Todd. “Characterizing Selection Bias Selection, and Limited Dependent Vari- Using Experimental Data.” Economet- ables and a Simple Estimator for Such rica, 66, 5, 1998, pp. 1017–1098. Greene-50558 book June 25, 2007 10:16

1122 References

Heckman, J., H. Ichimura, and P. Todd. Technical Report, Department of Epi- “Matching as an Econometric Evaluation demiology and Biostatistics, University Estimator: Evidence from Evaluating of California, San Francisco, 1989. a Job Training Program.” Review of Hendry, D. “Econometrics: Alchemy or Economic Studies, 64, 4, 1997 pp. 605– Science?” Economica, 47, 1980, pp. 387– 654. 406. Heckman, J., H. Ichimura, and P. Todd. Hendry, D. “Monte Carlo Experimenta- “Matching as an Econometric Evalua- tion in Econometrics.” In Z. Griliches tion Estimator.” Review of Economic and M. Intriligator, eds., Handbook of Studies, 65, 2, 1998, pp. 261–294. Econometrics, Vol. 2. Amsterdam: North Heckman, J., R. LaLonde, and J. Smith. Holland, 1984. “The Economics and Econometrics of Hendry, D. Econometrics: Alchemy or Sci- Active Labour Market Programmes.” In ence? Oxford: Blackwell Publishers, 1993. O. Ashenfelter and D. Card, eds., The Hendry, D. Dynamic Econometrics. Oxford: Handbook of Labor Economics, Vol. 3. Oxford University Press, 1995. Amsterdam: North Holland, 1999. Hendry, D., and N. Ericsson. “An Economet- Heckman, J., J. Tobias, and E. Vytlacil. “Sim- ric Analysis of UK Money Demand.” In ple Estimators for Treatment Parameters M. Friedman and A. Schwartz, eds., in a Latent Variable Framework.” Review American Economic Review, 81, 1991, of Economics and Statistics, 85, 3, 2003, pp. 8–38. pp. 748–755. Hendry, D., and J. Doornik. PC Give-8. Lon- Heckman, J., and E. Vytlacil. “Instrumental don: International Thomson Publishers, Variables, Selection Models and Tight 1986. Bounds on the Average Treatment Hendry, D., A. Pagan, and D. Sargan. “Dy- Effect.” NBER Technical Working Paper namic Specification.” In M. Intriligator 0259, 2000. and Griliches, Z., eds., Handbook of Heckman, J., and B. Singer. “Econometric Econometrics, Vol. 2, Amsterdam: North Duration Analysis.” Journal of Econo- Holland, 1984. metrics, 24, 1984a, pp. 63–132. Hendry, D., and H. M. Krotzis. Automatic Heckman, J., and B. Singer. “A Method for Econometric Model Selection Using Minimizing the Impact of Distributional PCGetS. London: Timberlake Consul- Assumptions in Econometric Models tants Press, 2001. for Duration Data.” Econometrica, 52, Hensher, D. “Sequential and Full Informa- 1984b, pp. 271–320. tion Maximum Likelihood Estimation Heckman, J., and R. Willis. “Estimation of a Nested Logit Model.” Review of of a Stochastic Model of Reproduc- Economics and Statistics, 68, 4, 1986, tion: An Econometric Approach.” In pp. 657–667. N. Terleckyj, ed., Household Produc- Hensher, D. “Efficient Estimation of tion and Consumption. New York: Hierarchical Logit Mode Choice Mod- National Bureau of Economic Research, els.”Journal of the Japanese Society 1976. of Civil Engineers, 425/IV-14, 1991, Heckman, J., and J. Snyder. “Linear Probabil- pp. 117–128. ity Models of the Demand for Attributes Hensher, D., ed. Travel Behavior Research: with an Empirical Application to Esti- The Leading Edge. Amsterdam: Perga- mating the Preferences of Legislators.” mon Press, 2001. Rand Journal of Economics, 28, 0, 1997. Hensher, D., and Greene, W. “The Mixed Heilbron, D. “Generalized Linear Mod- Logit Model: The State of Practice.” els for Altered Zero Probabilities Transportation Research, B, 30, 2003, and Overdispersion in Count Data.” pp. 133–176. Greene-50558 book June 25, 2007 17:11

References 1123

Hensher, D., and W. Greene. “Combining RP Information.” Health Economics 11, and SP Data: Biases in Using the Nested 2002; 1–11. Logit ‘Trick’—Constrasts with Flexible Holt, M. “Autocorrelation Specification in Mixed Logit Incorporating Panel and Singular Equation Systems: A Further Scale Effects.” Manuscript, Institute of Look.” Economics Letters, 58, 1998, Transport and Logistical Studies, Sydney pp. 135–141. University, 2006. Holtz-Eakin, D. “Testing for Individual Ef- Hensher, D., Louviere, J., and J. Swait. Stated fects in Autoregressive Models.” Journal Choice Methods: Analysis and Applica- of Econometrics, 39, 1988, pp. 297–307. tions. Cambridge: Cambridge University Holtz-Eakin, D., W. Newey, and H. Rosen. Press, 2000. “Estimating Vector Autoregressions with Hensher, D., J. Rose, and W. Greene. Applied Panel Data.” Econometrica, 56, 6, 1988, Choice Analysis. Cambridge: Cambridge pp. 1371–1395. University Press, 2006. Hong, H., B. Presting, and M. Shum. “Gen- Hildebrand, G., and T. Liu. Manufacturing eralized Empirical Likelihood Based Production Functions in the United States. Model Selection Criteria for Moment Ithaca, NY: Cornell University Press, Condition Models.” Econometric Theory, 1957. 19, 2003, pp. 923–943. Hildreth, C., and C. Houck. “Some Estimators Honor`e, B., and E. Kyriazidou. “Estimation of for a Linear Model with Random Coeffi- a Panel Data Sample Selection Model,” cients.” Journal of the American Statistical Econometrica, 65, 6, 1997, pp. 1335– Association, 63, 1968, pp. 584–595. 1364. Hildreth, C., and J. Lu. “Demand Relations Honor`e, B., and E. Kyriazidou. “Panel Data with Autocorrelated Disturbances.” Discrete Choice Models with Lagged Technical Bulletin No. 276, Michigan Dependent Variables.” Econometrica, State University Agricultural Experi- 68, 4, 2000, pp. 839–874. ment Station, 1960. Honor`e, B., and E. Kyriazidou. “Estimation Hill, C., and L. Adkins. “Collinearity.” In B. of Tobit-Type Models with Individual Baltagi, ed., A Companion to Theoretical Specific Effects.” Econometric Reviews, Econometrics. Oxford: Blackwell, 2001. 19, 3, 2000, pp. 341–367. Hilts, J. “Europeans Perform Highest in Horn, D., A. Horn, and G. Duncan. “Estimat- Ranking of World Health.” New York ing Heteroscedastic Variances in Linear Times, June 21, 2000. Models.” Journal of the American Statis- Hite, S. Women and Love. New York: Alfred tical Association, 70, 1975, pp. 380–385. A. Knopf, 1987. Horowitz, J. “A Smoothed Maximum Score Hoeting, J., D. Madigan, A. Raftery, and C. Estimator for the Binary Response Volinsky. “Bayesian Model Averaging: Model.”Econometrica, 60, 1992, pp. 505– A Tutorial.” Statistical Science, 14, 1999, 531. pp. 382–417. Horowitz, J. “Semiparametric Estimation of a Hole, A. “A Comparison of Approaches Work-Trip Mode Choice Model.” Journal to Estimating Confidence Intervals for of Econometrics, 58, 1993, pp. 49–70. Willingness to Pay Measures.” Paper Horowitz, J. “The Bootstrap.” In J. CHE 8, Center for Health Economics, Heckman and E. Leamer, eds. Handbook University of York, 2006. of Econometrics, Vol. 5. Amsterdam: Hollingsworth, J., and B. Wildman. “The North Holland, 2001, pp. 3159–3228. Efficiency of Health Production: Re- Horowitz, J., and G. Neumann. “Specification estimating the WHO Panel Data Testing in Censored Regression Models.” Using Parametric and Nonparametric Journal of Applied Econometrics, 4(S), Approaches to Provide Additional 1989, pp. S35–S60. Greene-50558 book June 25, 2007 10:16

1124 References

Hosking, J. “Fractional Differencing.” Hurd, M. “Estimation in Truncated Samples Biometrika, 68, 1981, pp. 165–176. When There Is Heteroscedasticity.” Howrey, E., and H. Varian. “Estimating Journal of Econometrics, 11, 1979, the Distributional Impact of Time-of- pp. 247–258. Day Pricing of Electricity.” Journal of Hurst, H. “Long Term Storage Capacity Econometrics, 26, 1984, pp. 65–82. of Reservoirs.” Transactions of the Hoxby, C. “Does Competition Among Public American Society of Civil Engineers, 116, Schools Benefit Students and Taxpay- 1951, pp. 519–543. ers?” American Economic Review, 69, 5, Hyslop, D. “State Dependence, Serial Cor- 2000, pp. 1209–1238. relation, and Heterogeneity in Labor Hsiao, C. “Some Estimation Methods for a Force Participation of Married Women.” Random Coefficient Model.” Economet- Econometrica, 67, 6, 1999, pp. 1255–1294. rica, 43, 1975, pp. 305–325. Hwang, H. “Estimation of a Linear SUR Hsiao, C. Analysis of Panel Data. Cambridge: Model with Unequal Numbers of Ob- Cambridge University Press, 1986. servations.”Review of Economics and Hsiao, C. Analysis of Panel Data, 2nd ed. New Statistics, 72, 1990, pp. 510–515. York: Cambridge University Press, 2003. Im, E. “Unequal Numbers of Observations Hsiao, C. “Identification.” In Z. Griliches and Partial Efficiency Gain.” Economics and M. Intriligator, eds., Handbook of Letters, 46, 1994, pp. 291– 294. Econometrics, Vol. 1. Amsterdam: North Im, K., S. Ahn, P. Schmidt, and J. Wooldridge, Holland, 1983. “Efficient Estimation of Panel Data Mod- Hsiao, C. “Modeling Ontario Regional els with Strictly Exogenous Explanatory Electricity System Demand Using a Variables,” Journal of Econometrics, 93, Mixed Fixed and Random Coefficients 1999, pp. 177–201. Approach.” Regional Science and Urban Im, K., M. Pesaran, and Y. Shin. “Testing for Economics, 19, 1989. Unit Roots in Heterogeneous Panels.” Hsiao, C. “Logit and Probit Models.” In Journal of Econometrics, 115, 2003, L. Matyas and P. Sevestre, eds., The pp. 53–74. Econometrics of Panel Data: Handbook Imbens, G., and J. Angrist. “Identification of Theory and Applications. Dordrecht: and Estimation of Local Average Treat- Kluwer-Nijhoff, 1992. ment Effects.” Econometrica, 62, 1994, Hsiao, C., K. Lahiri, L. Lee, and H. pp. 467–476. Pesaran. Analysis of Panels and Limited Imbens, G., and D. Hyslop. “Bias from Dependent Variable Models. New York: Classical and Other Forms of Measure- Cambridge University Press, 1999. ment Error.” Journal of Business and Hsiao, C., M. Pesaran, and A. Tahmis- Economic Statistics, 19, 2001, pp. 141–149. cioglu. “A Panel Analysis of Liquidity Imhof, J. “Computing the Distribution of Constraints and Firm Investment.” In Quadratic Forms in Normal Variables.” C. Hsiao, K. Lahiri, L. Lee, and M. Biometrika, 48, 1980, pp. 419–426. Pesaran, eds., Analysis of Panels and Inkmann, J. “Misspecified Heteroscedasticity Limited Dependent Variable Models. in the Panel Probit Model: A Small Cambridge: Cambridge University Press, Sample Comparison of GMM and SML 2002, pp. 268–296. Estimators.” Journal of Econometrics, 97, Huber, P. “The Behavior of Maximum Like- 2, 2000, pp. 227–259. lihood Estimates Under Nonstandard Jain, D., N. Vilcassim, and P. Chintagunta. Conditions.” In Proceedings of the Fifth “A Random-Coefficients Logit Brand Berkeley Symposium in Mathematical Choice Model Applied to Panel Data.” Statistics, Vol. 1. Berkeley: University of Journal of Business and Economic California Press, 1967. Statistics, 12, 3, 1994, pp. 317–328. Greene-50558 book June 25, 2007 10:16

References 1125

Jarque, C. “An Application of LDV Models Johnson, N., S. Kotz, and A. Kemp. Distri- to Household Expenditure Analysis in butions in Statistics—Univariate Discrete Mexico.” Journal of Econometrics, 36, Distributions, 2nd ed. New York: John 1987, pp. 31–54. Wiley and Sons, 1993. Jakubson, G. “The Sensitivity of Labor Supply Johnson, N., S. Kotz, and A. Balakrishnan. Parameters to Unobserved Individual Distributions in Statistics, Continuous Effects: Fixed and Random Effects Esti- Univariate Distributions— Vol. 1, 2nd ed. mates in a Nonlinear Model Using Panel New York: John Wiley and Sons, 1994. Data.” Journal of Labor Economics, 6, Johnson, N., S. Kotz, and N. Balakrishnan. 1988, pp. 302–329. Distributions in Statistics, Continuous Jayatissa, W. “Tests of Equality Between Univariate Distributions—Vol. 2, 2nd ed. Sets of Coefficients in Two Linear Re- New York: John Wiley and Sons, 1995. gressions When Disturbance Variances Johnson, N., S. Kotz, and N. Balakrishnan. Are Unequal.” Econometrica, 45, 1977, Distributions in Statistics, Discrete Mul- pp. 1291–1292. tivariate Distributions. New York: John Jennrich, R. I. “The Asymptotic Properties Wiley and Sons, 1997. of Nonlinear Least Squares Estimators.” Johnson, R., and D. Wichern. Applied Annals of Statistics, 2, 1969, pp. 633–643. Multivariate Statistical Analysis, 5th ed. Jensen, M. “A Monte Carlo Study on Two Englewood Cliffs, NJ: Prentice Hall, 2005. Methods of Calculating the MLE’s Co- Johnston, J. Econometric Methods. New York: variance Matrix in a Seemingly Unrelated McGraw-Hill, 1984. Nonlinear Regression.” Econometric Re- Johnston, J., and J. DiNardo. Econometric views, 14, 1995, pp. 315–330. Methods, 4th ed. New York: McGraw- Jobson, J., and W. Fuller. “Least Squares Esti- Hill, 1997. mation When the Covariance Matrix and Jondrow, J., K. Lovell, I. Materov, and Parameter Vector Are Functionally Re- P. Schmidt. “On the Estimation of Tech- lated.” Journal of the American Statistical nical Inefficiency in the Stochastic Fron- Association, 75, 1980, pp. 176–181. tier Production Function Model.” Journal Johansen, S. “Estimation and Hypothesis of Econometrics, 19, 1982, pp. 233–238. Testing of Cointegrated Vectors in Gaus- Jones, A., X. Koolman, and N Rice. “Health- sian VAR Models.” Econometrica, 59, 6, related Attrition in the BHPS and ECHP: 1991, pp. 1551–1580. Using Inverse Probability Weighted Johansen, S. “A Representation of Vector Estimators in Nonlinear Models.” Jour- Autoregressive Processes of Order 2.” nal of the Royal Statistical Society Series Econometric Theory, 8, 1992, pp. 188–202. A (Statistics in Society), 169, 2006, Johansen, S. “Statistical Analysis of Cointe- pp. 543–569. gration Vectors.” Journal of Economic Jones, J., and J. Landwehr. “Removing Dynamics and Control, 12, 1988, pp. 231– Heterogeneity Bias from Logit Model 254. Estimation.” Marketing Science, 7, 1, Johansen, S., and K. Juselius. “Maximum 1988, pp. 41–59. Likelihood Estimation and Inference on Joreskog, K. “A General Method for Cointegration, with Applications for the Estimating a Linear Structural Equa- Demand for Money.” Oxford Bulletin tion System.” In A. Goldberger and of Economics and Statistics, 52, 1990, O. Duncan, eds. Structural Equation pp. 169–210. Models in the Social Sciences. New York: Johnson, N., and S. Kotz. Distributions in Academic Press, 1973. Statistics—Continuous Multivariate Dis- Joreskog, K., and G. Gruvaeus. “A Computer tributions. New York: John Wiley and Program for Minimizing a Function of Sons, 1974. Several Variables.” Educational Testing Greene-50558 book June 25, 2007 10:16

1126 References

Services, Research Bulletin No. 70-14, Kay, R., and S. Little. “Assessing the Fit of the 1970. Logistic Model: A Case Study of Children Joreskog, K., and D. Sorbom. LISREL with Haemolytic Uraemic Syndrome.” V User’s Guide. Chicago: National Applied Statistics, 35, 1986, pp. 16–30. Educational Resources, 1981. Keane, M. “A Computationally Practical Jorgenson, D. “Rational Distributed Lag Simulation Estimator for Panel Data.” Functions.” Econometrica, 34, 1966, Econometrica, 62, 1, 1994, pp. 95–116. pp. 135–149. Keane, M. “Simulation Estimators for Panel Jorgenson, D. “Econometric Methods for Data Models with Limited Dependent Modeling Producer Behavior.” In Z. Variables.” In G. Maddala and C. Rao, Griliches and M. Intriligator, eds., eds., Handbook of Statistics, Volume 11, Handbook of Econometrics, Vol. 3. Chapter 20. Amsterdam: North Holland, Amsterdam: North Holland, 1983. 1993. Judd, K. Numerical Methods in Economics. Keane, M., R. Moffitt, and D. Runkle. “Real Cambridge: MIT Press, 1998. Wages over the Business Cycle: Esti- Judge, G., W. Griffiths, C. Hill, and T. Lee. The mating the Impact of Heterogeneity Theory and Practice of Econometrics. with Micro-Data.” Journal of Political New York: John Wiley and Sons, 1985. Economy, 96, 6, 1988, pp. 1232–1265. Judge, G., C. Hill, W. Griffiths, T. Lee, and H. Kelejian, H. “Two-Stage Least Squares and Lutkepol. An Introduction to the Theory Econometric Systems Linear in Param- and Practice of Econometrics. New York: eters but Nonlinear in the Endogenous John Wiley and Sons, 1982. Variables.” Journal of the American Sta- tistical Association, 66, 1971, pp. 373–374. Jusko, K. “A Two-Step Binary Response Model for Cross-National Public Opin- Kaptery, A., and D. Fiebig. “When Are Two ion Data: A Research Note.” Presented Stage and Three Stage Least Squares at the Midwest Political Science Associ- Identical?” Economics Letters, 8, 1981, ation National Conference, April 7–10, pp. 53–57. 2005, Chicago. Kelejian, H., and I. Prucha. “A Generalized Kakwani, N., “The Unbiasedness of Zell- Moments Estimator for the Autore- ner’s Seemingly Unrelated Regression gressive Parameter in a Spatial Model.” Equations Estimators.” Journal of the International Economic Review, 40, 1999, American Statistical Association, 62, pp. 509–533. 1967, pp. 141–142. Kelly, J. “Linear Cross Equation Constraints Kalbfleisch, J., and R. Prentice. The Statistical and the Identification Problem.” Econo- Analysis of Failure Time Data, 2nd ed. metrica, 43, 1975, pp. 125–140. New York: John Wiley and Sons, 2002. Kennan, J. “The Duration of Contract Strikes Kamlich, R., and S. Polachek. “Discrimina- in U.S. Manufacturing.” Journal of tion: Fact or Fiction? An Examination Econometrics, 28, 1985, pp. 5–28. Using an Alternative Approach.” South- Kennedy, W., and J. Gentle. Statistical Com- ern Economic Journal, October 1982, puting. New York: Marcel Dekker, 1980. pp. 450–461. Keuzenkamp, H., and J. Magnus. “The Signif- Kao, C. “Spurious Regression and Residual icance of Testing in Econometrics.” Jour- Based Tests for Cointegration in Panel nal of Econometrics, 67, 1, 1995, pp. 1–257. Data.” Journal of Econometrics, 90, 1999, Keynes, J. The General Theory of Employ- pp. 1–44. ment, Interest, and Money. New York: Kaplan, E. and P. Meier. “Nonparametric Harcourt, Brace, and Jovanovich, 1936. Estimation from Incomplete Observa- Kiefer, N. “Testing for Independence in tions.” Journal of the American Statistical Multivariate Probit Models.” Biometrika, Association, 53, 1958, pp. 457–481. 69, 1982, pp. 161–166. Greene-50558 book June 25, 2007 10:16

References 1127

Kiefer, N., ed. “Econometric Analysis of Du- Klugman, S., and R. Parsa. “Fitting Bivariate ration Data.” Journal of Econometrics, Loss Distributuins with Copulas.” Insur- 28, 1, 1985, pp. 1–169. ance: Mathematics and Economics, 24, Kiefer, N. “Economic Duration Data and 2000, pp. 139–148. Hazard Functions.” Journal of Economic Kmenta, J. “On Estimation of the CES Literature, 26, 1988, pp. 646–679. Production Function.” International Kiefer, N., and Salmon, M. “Testing Normal- Economic Review, 8, 1967, pp. 180–189. ity in Econometric Models.” Economics Kmenta, J. Elements of Econometrics. New Letters, 11, 1983, pp. 123–127. York: Macmillan, 1986. Killian, L. “Small-Sample Confidence Inter- Kmenta, J., and R. Gilbert. “Small Sample vals for Impulse Response Functions.” Properties of Alternative Estimators The Review of Economics and Statistics, of Seemingly Unrelated Regressions.” 80, 2, 1998, pp. 218–230. Journal of the American Statistical Kim, H., and J. Pollard. “Cube Root Asymp- Association, 63, 1968, pp. 1180–1200. totics.” Annals of Statistics, March 1990, Knapp, L., and T. Seaks. “An Analysis of pp. 191–219. the Probability of Default on Federally Kingdon, G., and R. Cassen. “Explaining Low Guaranteed Student Loans.” Review Achievement at Age 16 in England.” of Economics and Statistics, 74, 1992, Mimeo, Department of Economics, pp. 404–411. University of Oxford, 2007. Knight, F. The Economic Organization. New Kiviet, J. “On Bias, Inconsistency, and York: Harper and Row, 1933. Efficiency of Some Estimators in Dy- Kobayashi, M. “A Bounds Test of Equality namic Panel Data Models.” Journal Between Sets of Coefficients in Two of Econometrics, 68, 1, 1995, pp. 63– Linear Regressions When Disturbance 78. Variances Are Unequal.” Journal of Kiviet, J., G. Phillips, and B. Schipp, “The the American Statistical Association, 81, Bias of OLS, GLS and ZEF Estimators 1986, pp. 510–514. in Dynamic SUR Models.” Journal of Koenker, R. “A Note on Studentizing a Econometrics, 69, 1995, pp. 241–266. Test for Heteroscedasticity.” Journal of Kleijn, R., and H. K. van Dijk. “Bayes Model Econometrics, 17, 1981, pp. 107–112. Averaging of Cyclical Decompositions Koenker, R., and G. Bassett. “Regression in Economic Time Series.” Journal Quantiles.” Econometrica, 46, 1978, of Applied Econometrics, 21, 2, 2006, pp. 107–112. pp. 191–213. Koenker, R., and G. Bassett. “Robust Tests Kleibergen, F. “Pivotal Statistics for Testing for Heteroscedasticity Based on Regres- Structural Parameters in Instrumental sion Quantiles.” Econometrica, 50, 1982, Variables Regression.” Econometrica, pp. 43–61. 70, 2002, pp. 1781–1803. Koop, G. Bayesian Econometrics. New York: Klein, L. Economic Fluctuations in the United John Wiley and Sons, 2003. States 1921–1941. New York: John Wiley Koop, G., and S. Potter. “Forecasting in Large and Sons, 1950. Macroeconomic Panels Using Bayesian Klein, R., and R. Spady. “An Efficient Model Averaging.” Econometrics Jour- Semiparametric Estimator for Discrete nal, 7, 2, 2004, pp. 161–185. Choice Models.” Econometrica, 61, 1993, Koop, G., and J. Tobias. “Learning About pp. 387–421. Heterogeneity in Returns to Schooling.” Klepper, S., and E. Leamer. “Consistent Sets Journal of Applied Econometrics, 19, 7, of Estimates for Regressions with Errors 2004, pp. 827–849. in All Variables.” Econometrica, 52, 1983, Krailo, M., and M. Pike. “Conditional Mul- pp. 163–184. tivariate Logistic Analysis of Stratified Greene-50558 book June 25, 2007 10:16

1128 References

Case-Control Studies.” Applied Statistics, Economic Review, 76, 4, 1986, pp. 604– 44, 1, 1984, pp. 95–103. 620. Kreuger, A. “Economic Scene.” New York Lambert, D. “Zero-Inflated Poisson Regres- Times, April 27, 2000, p. C2. sion, with an Application to Defects in Kreuger, A., and S. Dale. “Estimating the Manufacturing.” Technometrics, 34, 1, Payoff to Attending a More Selective 1992, pp. 1–14. College.” NBER, Cambridge, Working Lancaster, T. The Analysis of Transition Data. Paper 7322, 1999. New York: Cambridge University Press, Krinsky, I., and L. Robb. “On Approximating 1990. the Statistical Properties of Elasticities.” Lancaster, T. “The Incidental Parameters Review of Economics and Statistics, 68, 4, Problem since 1948.” Journal of Econo- 1986, pp. 715–719. metrics, 95, 2, 2000, pp. 391–414. Krinsky, I., and L. Robb. “On Approximating Lancaster, T. An Introduction to Modern the Statisticl Properties of Elasticities: Bayesian Inference. Oxford: Oxford Correction.” Review of Economics and University Press, 2004. Statistics, 72, 1, 1990, pp. 189–190. Landers, A. “Survey,” Chicago Tribune, 1984, Krinsky, I., and L. Robb. “Three Methods passim. for Calculating Statistical Properties for Lawless, J. Statistical Models and Methods for Elasticities.” Empirical Economics, 16, Lifetime Data. New York: John Wiley 1991, pp. 1–11. and Sons, 1982. Kumbhakar, S., and A. Heshmati. “Technical Leamer, E. Specification Searches: Ad Hoc Change and Total Factor Productivity Inferences with Nonexperimental Data. Growth in Swedish Manufacturing In- New York: John Wiley and Sons, 1978. dustries.” Econometric Reviews, 15, 1996, LeCam, L. “On Some Asymptotic Properties pp. 275–298. of Maximum Likelihood Estimators and Kumbhakar, S., and K. Lovell. Stochastic Related Bayes Estimators.” University Frontier Analysis. New York: Cambridge of California Publications in Statistics, 1, University Press, 2000. 1953, pp. 277–330. Kwiatkowski, D., P. Phillips, P. Schmidt, and Lee, K., M. Pesaran, and R. Smith. “Growth Y. Shin. “Testing the Null Hypothesis of and Convergence in a Multi-Country Stationarity Against the Alternative of a Empiridal Stochastic Solow Model.” Unit Root.” Journal of Econometrics, 54, Journal of Applied Econometrics, 12, 1992, pp. 159–178. 1997, pp. 357–392. Kyriazidou, E. “Estimation of a Panel Data Lee, L. “Estimation of Error Compo- Sample Selection Model.” Econometrica, nents Models with ARMA (p, q) Time 65, 1997, pp. 1335–1364. Component—An Exact GLS Approach.” Kyriazidou, E. “Estimation of Dynamic Number 78-104, University of Min- Panel Data Sample Selectioin Models.” nesota, Center for Economic Research, Review of Economic Studies, 68, 2001, 1978. pp. 543–572. Lee, L. “Generalized Econometric Models L’Ecuyer, P. “Good Parameters and Im- with Selectivity.” Econometrica, 51, 1983, plementations for Combined Multiple pp. 507–512. Recursive Random Number Gener- Lee, L. “Specification Tests for Poisson ators,” Working Paper, Department Regression Models.” International Eco- of Information Science, University of nomic Review, 27, 1986, pp. 689–706. Montreal, 1998. Lee, M. Method of Moments and Semi- LaLonde, R., “Evaluating the Economet- parametric Econometrics for Lim- ric Evaluations of Training Programs ited Dependent Variables. New York: with Experimental Data.” American Springer-Verlag, 1996. Greene-50558 book June 25, 2007 17:11

References 1129

Lee, M. Limited Dependent Variable Models. Economic Theory.” Journal of Applied New York: Cambridge University Press, Economics, 21, 9, 2006, pp. 893–896. 1998. Li, Q., and J. Racine. Nonparametric Econo- Lee, M. Method of Moments and Semi- metrics. Princeton: Princeton University parametric Econometrics for Limited Press, 2007. Dependent Variable Models. Heidelberg, Li, W., S. Ling, and M. McAleer. A Survey Springer-Verlag, 1996. of Recent Theoretical Results for Time Leff, N. “Dependency Rates and Savings Series Models with GARCH Errors. Rates.” American Economic Review, 59, Manuscript, Institute for Social and 5, 1969, pp. 886–896. Economic Research, Osaka University, Leff, N. “Dependency Rates and Savings Osaka, 2001. Rates: Reply.” American Economic Liang, K. and S. Zeger. “Longitudinal Data Review, 63, 1, 1973, p. 234. Analysis Using Generalized Linear Lerman, R., and C. Manski. “On the Use of Models.” Biometrika, 73, 1986, pp. 13–22. Simulated Frequencies to Approximate Lillard, L., and R. Willis. “Dynamic Aspects Choice Probabilities.” In C. Manski and of Earning Mobility.” Econometrica, 46, D. McFadden, eds., Structural Analysis of 1978, pp. 985–1012. Discrete Data with Econometric Applica- Lintner, J. “Security Prices, Risk, and Maxi- tions. Cambridge: MIT Press, 1981. mal Gains from Diversification.” Journal Levi, M. “Errors in the Variables in the Pres- of Finance, 20, 1965, pp. 587–615. ence of Correctly Measured Variables.” Litterman, R. “Techniques of Forecasting Econometrica, 41, 1973, pp. 985–986. Using Vector Autoregressions.” Working Levin, A., and C. Lin. “Unit Root Tests in Paper No. 15, Federal Reserve Bank of Panel Data: Asymptotic and Finite Sam- Minneapolis, 1979. ple Properties.” Discussion Paper 92-93, Litterman, R. “Forecasting with Bayesian Department of Economics, University of Vector Autoregressions—Five Years of California, San Diego, 1992. Experience.” Journal of Business and Levin, A., C. Lin, and C. Chu. “Unit Root Economic Statistics, 4, 1986, pp. 25–38. Test in Panel Data: Asumptotic and Little, R., and D. Rubin. Statistical Analysis Finite Sample Properties.” Journal of with Missing Data. New York: Wiley, Econometrics, 108, 2002, pp. 1–25. 1987. Lewbel, A. “Semiparametric Qualitative Re- Liu, T. “Underidentification, Structural Esti- sponse Model Estimation with Unknown mation, and Forecasting.” Econometrica, Heteroscedasticity or Instrumental Vari- 28, 1960, pp. 855–865. ables.” Journal of Econometrics, 97, 1, 2000, pp. 145–177. Ljung, G., and G. Box. “On a Measure of Lack of Fit in Time Series Models.” Lewbel, A. “Semiparametric Estimation of Biometrika, 66, 1979, pp. 265–270. Location and Other Discrete Choice Moments.” Econometric Theory, 14, Lo, A. “Long Term Memory in Stock Mar- 1997, pp. 32–51. ket Prices.” Econometrica, 59, 1991, pp. 1297–1313. Lewbel, A., and B. Honor`e. “Semiparamet- ric Binary Choice Panel Data Models Loeve, M., Probability Theory. New York: without Strictly Exogenous Regressors.” Springer-Verlag, 1977. Econometrica, 70, 2002, pp. 2053–2063. Long, S. Regression Models for Categori- Lewis, H. “Comments on Selectivity Biases in cal and Limited Dependent Variables. Wage Comparisons.” Journal of Political Thousand Oaks, CA: Sage Publications, Economy, 82, 1974, pp. 1149–1155. 1997. Li, M., and J. Tobias. “Calculus Attainment Longley, J. “An Appraisal of Least Squares and Grades Received in Intermediate Programs from the Point of the User.” Greene-50558 book June 25, 2007 10:16

1130 References

Journal of the American Statistical Maddala, G. “Limited Dependent Variable Association, 62, 1967, pp. 819–841. Models Using Panel Data.” Journal of Louviere, J., D. Hensher, and J. Swait. Stated Human Resources, 22, 1977b, pp. 307– Choice Methods: Analysis and Applica- 338. tions. Cambridge: Cambridge University Maddala, G. Limited Dependent and Qual- Press, 2000. itative Variables in Econometrics. New Lovell, M. “Seasonal Adjustment of Eco- York: Cambridge University Press, nomic Time Series and Multiple Regres- 1983. sion Analysis.” Journal of the American Maddala, G. “Disequilibrium, Self-Selection, Statistical Association, 58, 1963, pp. 993– and Switching Models.” In Z. Griliches 1010. and M. Intriligator, eds., Handbook of Lucas, R. “Econometric Policy Evaluation: A Econometrics, Vol. 3. Amsterdam: North Critique.” In K. Brunner and A. Meltzer, Holland, 1984. eds., The Phillips Curve and the Labor Maddala, G. “Limited Dependent Variable Market. Amsterdam: North Holland, Models Using Panel Data.” Journal of 1976. Human Resources, 22, 1987, pp. 307–338. Lucas, R. Money Demand in the United Maddala, G. Introduction to Econometrics, States: A Quantitative Review. Carnegie- 2nd ed. New York: Macmillan, 1992. Rochester Conference Series on Public Maddala, G. The Econometrics of Panel Data, Policy, 29,1988, pp. 137–168. Vols. I and II. Brookfield, VT: E. E. Lutkepohl, H. Introduction to Multiple Time Elgar, 1993. Series Analysis, 2nd ed. New York: Maddala, G. and A. Flores-Lagunes. “Qual- Marcel Dekker, 2005. itative Response Models.” In B. Baltagi, MacDonald, G. and H. White. “Some Large ed., A Companion to Theoretical Econo- Sample Tests for Nonnormality in the metrics. Oxford: Blackwell, 2001. Linear Regression Model.” Journal of Maddala, G., and I. Kim. Unit Roots, the American Statistical Association, 75, Cointegration and Structural Change. 1980, pp. 16–27. Cambridge: Cambridge University Press, MacKinnon, J., and H. White. “Some Het- 1998. eroscedasticity Consistent Covariance Maddala, G., and T. Mount. “A Comparative Matrix Estimators with Improved Study of Alternative Estimators for Finite Sample Properties.” Journal Variance Components Models.” Journal of Econometrics, 19, 1985, pp. 305– of the American Statistical Association, 325. 68, 1973, pp. 324–328. MacKinnon, J., H. White, and R. Davidson. Maddala, G., and F. Nelson. “Specification “Tests for Model Specification in the Errors in Limited Dependent Variable Presence of Alternative Hypotheses: Models.” Working Paper 96, National Bu- Some Further Results.” Journal of reau of Economic Research, Cambridge, Econometrics, 21, 1983, pp. 53–70. MA, 1975. Maddala, G. “The Use of Variance Compo- Maddala, G., and A. Rao. “Tests for Serial nents Models in Pooling Cross Section Correlation in Regression Models with and Time Series Data.” Econometrica, Lagged Dependent Variables and Seri- 39, 1971, pp. 341–358. ally Correlated Errors.” Econometrica Maddala, G. “Some Small Sample Evidence 41, 1973, pp. 761–774. on Tests of Significance in Simultaneous Maddala, G., and S. Wu. “A Comparative Equations Models.” Econometrica, 42, Study of Unit Root Tests with Panel 1974, pp. 841–851. Data and a New Simple Test.” Oxford Maddala, G. Econometrics. New York: Bulletin of Economics and Statistics, 61, McGraw-Hill, 1977a. 1999, pp. 631–652. Greene-50558 book June 25, 2007 10:16

References 1131

Maeshiro, A. “On the Retention of the Binary Response Data,” Econometrica, First Observation in Serial Correlation 55, 1987, pp. 357–362. Adjustment of Regression Models.” Manski, C. “Learning About Treatment International Economic Review, 20, 1979, Effects from Experiments with Random pp. 259–265. Assignment of Treatments.” Journal of Magee, L., J. Burbidge, and L. Robb. “The Human Resources, 31, 1996, pp. 709– Correlation Between Husband’s and 733. Wife’s Education: Canada, 1971–1996.” Manski, C. “Anatomy of the Selection Prob- Social and Economic Dimensions of an lem.” Journal of Human Resources, 24, Aging Population Research Papers, 24, 1989, pp. 343–360. McMaster University, 2000. Manski, C. “Nonparametric Bounds on Magnac, T. “State Dependence and Het- Treatment Effects.” American Economic erogeneity in Youth Unemployment Review, 80, 1990, pp. 319–323. Histories.” Working Paper, INRA and Manski, C. Analog Estimation Methods in CREST, Paris, 1997. Econometrics. London: Chapman and Magnus, J. “Multivariate Error Compo- Hall, 1992. nents Analysis of Linear and Nonlinear Manski, C. Identification Problems in the Regression Models by Maximum Likeli- Social Sciences. Cambridge: Harvard hood.” Journal of Econometrics, 19, 1982, University Press, 1995. pp. 239–285. Manski, C., and S. Lerman. “The Estimation Magnus, J., and H. Neudecker. Matrix Dif- of Choice Probabilities from Choice ferential Calculus with Applications in Based Samples.” Econometrica, 45, 1977, Statistics and Econometrics. New York: pp. 1977–1988. John Wiley and Sons, 1988. Manski, C., and S. Thompson. “MSCORE: A Malinvaud, E. Statistical Methods of Econo- Program for Maximum Score Estimation metrics. Amsterdam: North Holland, of Linear Quantile Regressions from 1970. Binary Response Data.” Mimeo, Univer- Mandy, D., and C. Martins-Filho. “Seemingly sity of Wisconsin, Madison, Department Unrelated Regressions Under Additive of Economics, 1986. Heteroscedasticity: Theory and Share Mariano, R. “Analytical Small-Sample Dis- Equation Applications.” Journal of tribution Theory in Econometrics: The Econometrics, 58, 1993, pp. 315–346. Simultaneous Equations Case.” Inter- Mann, H., and A. Wald. “On the Statistical national Economic Review, 23, 1982, Treatment of Linear Stochastic Differ- pp. 503–534. ence Equations.” Econometrica, 11, 1943, Marcus, A., and W. Greene. “The Deter- pp. 173–220. minants of Rating Assignment and Manski, C. “The Maximum Score Estima- Performance.” Working Paper CRC528, tor of the Stochastic Utility Model of Center for Naval Analyses, 1985. Choice.” Journal of Econometrics, 3, Markowitz, H. Portfolio Selection: Efficient 1975, pp. 205–228. Diversification of Investments. New York: Manski, C. “Semiparametric Analysis of Dis- John Wiley and Sons, 1959. crete Response: Asymptotic Properties of Mariano, R. “Simultaneous Equation Model the Maximum Score Estimator.” Journal Estimators: Statistical Properties.” In B. of Econometrics, 27, 1985, pp. 313–333. Baltagi, ed., A Companion to Theoretical Manski, C. “Operational Characteristics of Econometrics. Oxford: Blackwell, 2001. the Maximum Score Estimator.” Journal Marsaglia, G. and T. Bray. “A Convenient of Econometrics, 32, 1986, pp. 85–100. Method of Generating Normal Vari- Manski, C. “Semiparametric Analysis of the ables.” SIAM Review, 6, 1964, pp. 260– Random Effects Linear Model from 264. Greene-50558 book June 25, 2007 10:16

1132 References

Martins, M., “Parametric and Semiparametric McCullough, B. “Consistent Forecast Inter- Estimation of Sample Selection Models: vals When the Forecast-Period Exoge- An Empirical Application to the Female nous Variables Are Stochastic.” Journal Labour Force in Portugal.” Journal of Forecasting, 15, 1996, pp. 293–304. of Applied Econometrics, 16, 1, 2001, McCullough, B. “Econometric Software Re- pp. 23–40. liability: E-Views, LIMDEP, SHAZAM, Matyas, L. Generalized Method of Moments and TSP.” Journal of Applied Economet- Estimation. Cambridge: Cambridge rics, 14, 2 1999, pp. 191–202. University Press, 1999. McCullough, B., and C. Renfro. “Benchmarks Matyas, L., and P. Sevestre, eds. The Econo- and Software Standards: A Case Study metrics of Panel Data: Handbook of GARCH Procedures.” Journal of of Theory and Applications, 2nd ed. Economic and Social Measurement, 25, Dordrecht: Kluwer-Nijhoff, 1996. 2, 1999, pp. 27–37. Matzkin, R. “Nonparametric Identification McCullough, B., and H. Vinod. “The Numeri- and Estimation of Polytomous Choice cal Reliability of Econometric Software.” Models.” Journal of Econometrics, 58, Journal of Economic Literature, 37, 2, 1993, pp. 137–168. 1999, pp. 633–665. Mazodier, P.,and A. Trognon. “Heteroscedas- McDonald, J., and R. Moffitt. “The Uses of ticity and Stratification in Error Com- Tobit Analysis.” Review of Economics ponents Models.” Annales de l’Insee, 30, and Statistics, 62, 1980, pp. 318–321. 1978, pp. 451–482. McElroy, M. “Goodness of Fit for Seemingly 2 McAleer, M. “The Significance of Testing Unrelated Regressions: Glahn’s Ry,x and Empirical Non-Nested Models.” Journal Hooper’s.r ¯ 2.” Journal of Econometrics, of Econometrics, 67, 1995, pp. 149–171. 6, 1977, pp. 381–387. McAleer, M., G. Fisher, and P. Volker. McFadden, D. “Conditional Logit Analysis “Separate Misspecified Regressions and of Qualitative Choice Behavior.” In the U.S. Long-Run Demand for Money P. Zarembka, ed., Frontiers in Economet- Function.” Review of Economics and rics. New York: Academic Press, 1974. Statistics, 64, 1982, pp. 572–583. McFadden, D. “The Measurement of Urban McCallum, B. “Relative Asymptotic Bias from Travel Demand.” Journal of Public Errors of Omission and Measurement.” Economics, 3, 1974, pp. 303–328. Econometrica, 40, 1972, pp. 757–758. McFadden, D. “Econometric Analysis of McCallum, B., “A Note Concerning Covari- Qualitative Response Models.” In ance Expressions.” Econometrica, 42, Z. Griliches and M. Intriligator, eds., 1973, pp. 581—583. Handbook of Econometrics, Vol. 2. McCoskey, S., and C. Kao. “Testing the Amsterdam: North Holland, 1984. Stability of a Production Function with McFadden, D. “Regression Based Specifi- Urbanization as a Shift Factor: An Ap- cation Tests for the Multinomial Logit plication of Nonstationary Panel Data Model.” Journal of Econometrics, 34, Techniques.” Oxford Bulletin of Eco- 1987, pp. 63–82. nomics and Statistics, 61, 1999, pp. 671– McFadden, D. “A Method of Simulated 690. Moments for Estimation of Discrete McCoskey, S., and T. Selden. “Health Care Response Models without Numerical Expenditures and GDP: Panel Data Unit Integration.” Econometrica, 57, 1989, Root Test Results.” Journal of Health pp. 995–1026. Economics, 17, 1998, pp. 369–376. McFadden, D., and K. Train. “Mixed Multi- McCullagh, P., and J. Nelder. Generalized nomial Logit Models for Discrete Linear Models. New York: Chapman and Response.” Journal of Applied Econo- Hall, 1983. metrics, 15, 2000, pp. 447–470. Greene-50558 book June 25, 2007 10:16

References 1133

McFadden, D., and P. Ruud. “Estimation by Mills, T. Time Series Techniques for Econo- Simulation.” Review of Economics and mists. New York: Cambridge University Statistics, 76, 1994, pp. 591–608. Press, 1990. McGuire, T., J. Farley, R. Lucas, and R. Mills, T. The Econometric Modelling of Finan- Winston. “Estimation and Inference for cial Time Series. New York: Cambridge Models in Which Subsets of the Depen- University Press, 1993. dent Variable Are Constrained.” Journal Min, C., and A. Zellner. “Bayesian and of the American Statistical Association, Non-Bayesian Methods for Combining 63, 1968, pp. 1201–1213. Models and Forecasts with Applications McKenzie, C. “Microfit 4.0.” Journal of Ap- to Forecasting International Growth plied Econometrics, 13, 1998, pp. 77–90. Rates.” Journal of Econometrics, 56, McKoskey, S., and C. Kao. “Testing the 1993, pp. 89–118. Stability of a Production Function with Mittelhammer, R., G. Judge, and D. Miller. Urbanization as a Shift Factor: An Econometric Foundations. Cambridge: Application of Nonstationary Panel Cambridge University Press, 2000. Data Techniques.” Oxford Bulletin of Mizon, G. “A Note to Autocorrelation Cor- Economics and Statistics, 61, 1999, pp. 57– rectors: Don’t.” Journal of Econometrics, 84. 69, 1, 1995, pp. 267–288. McKoskey, S., and T. Selden. “Health Care Mizon, G., and J. Richard. “The Encompassing Expenditures and GDP: Panel Data Principle and its Application to Testing Unit Root Tests,” Journal of Health Nonnested Models.” Econometrica, 54, Economics, 17, 1998, pp. 369–376. 1986, pp. 657–678. McLachlan, G., and T. Krishnan. The EM Moffitt, R. “Comment on Estimation of Algorithm and Extensions. New York: Limited Dependent Variable Models John Wiley and Sons, 1997. with Dummy Endogenous Regressors: McLachlan, G., and D. Peel. Finite Mixture Simple Strategies for Empirical Practice.” Models. New York: John Wiley and Sons, Journal of Business and Statistics, 19, 1, 2000. 2001, pp. 26–27. McLaren, K. “Parsimonious Autocorrela- Moscone, F., M. Knapp, and E. Tosetti. “Men- tion Corrections for Singular Demand tal Health Expenditures in England: Systems.” Economics Letters, 53, 1996, A Spatial Panel Approach.” Journal of pp. 115–121. Health Economics, forthcoming, 2007. Melenberg, B., and A. van Soest. “Para- Mohanty, M. “A Bivariate Probit Approach metric and Semi-Parametric Modelling to the Determination of Employment: of Vacation Expenditures.” Journal of A Study of Teen Employment Differen- Applied Econometrics, 11, 1, 1996, pp. 59– tials in Los Angeles County,” Applied 76. Economics, 34, 2, 2002, pp. 143–156. Merton, R. “On Estimating the Expected Re- Moran, P. “Notes on Continuous Stochas- turn on the Market.” Journal of Financial tic Phenomena.” Biometrika 37, 1950, Economics, 8, 1980, pp. 323–361. pp. 17–23. Messer, K., and H. White. “A Note on Com- Moshino, G., and D. Moro. “Autocorrela- puting the Heteroscedasticity Consistent tion Specification in Singular Equation Covariance Matrix Using Instrumental Systems,” Economics Letters, 46, 1994, Variable Techniques.” Oxford Bulletin pp. 303–309. of Economics and Statistics, 46, 1984, Mroz, T. “The Sensitivity of an Empirical pp. 181–184. Model of Married Women’s Hours Meyer, B. “Semiparametric Estimation of of Work to Economic and Statistical Hazard Models.” Northwestern Univer- Assumptions.” Econometrica, 55, 1987, sity, Department of Economics, 1988. pp. 765–799. Greene-50558 book June 25, 2007 10:16

1134 References

Mullahy, J. “Specification and Testing of Some Nelder, J., and R. Mead. “A Simplex Method Modified Count Data Models.” Journal for Function Minimization.” Computer of Econometrics, 33, 1986, pp. 341–365. Journal, 7, 1965, pp. 308–313. Mullahy, J. “Weighted Least Squares Esti- Nelson, C., and H. Kang. “Pitfalls in the Use mation of the Linear Probability Model, of Time as an Explanatory Variable in Revisited.” Economics Letters, 32, 1990, Regression.” Journal of Business and pp. 35–41. Economic Statistics, 2, 1984, pp. 73–82. Mundlak, Y. “On the Pooling of Time Series Nelson, C., and C. Plosser. “Trends and and Cross Sectional Data.” Economet- Random Walks in Macroeconomic Time rica, 56, 1978, pp. 69–86. Series: Some Evidence and Implica- Munkin, M., and P. Trivedi. “Simulated tions.” Journal of Monetary Economics, Maximum Likelihood Estimation of 10, 1982, pp. 139–162. Multivariate Mixed Poisson Regression Nelson, F. “A Test for Misspecification in the Models with Application,” Econometric Censored Normal Model.” Economet- Journal, 1, 1, 1999, pp. 1–21. rica, 49, 1981, pp. 1317–1329. Munnell, A. “Why Has Productivity De- Nerlove, M. “Returns to Scale in Electricity clined? Productivity and Public Invest- Supply.” In C. Christ, ed., Measurement ment.” New England Economic Review, in Economics: Studies in Mathematical 1990, pp. 3–22. Economics and Econometrics in Memory Murphy, K., and R. Topel. “Estimation and of Yehuda Grunfeld. Palo Alto: Stanford Inference in Two Step Econometric University Press, 1963. Models.” Journal of Business and Eco- Nerlove, M. “Further Evidence on the Estima- nomic Statistics, 3, 1985, pp. 370–379. tion of Dynamic Relations from a Time Reprinted, 20, 2002, pp. 88–97. Series of Cross Sections.” Econometrica, Nagin, D., and K. Land. “Age, Criminal 39, 1971a, pp. 359–382. Careers, and Population Heterogene- Nerlove, M. “A Note on Error Components ity: Specification and Estimation of a Models.” Econometrica, 39, 1971b, Nonparametric Mixed Poisson Model.” pp. 383–396. Criminology, 31, 3, 1993, pp. 327–362. Nerlove, M. “Lags in Economic Behavior.” Nair-Reichert, U., and D. Weinhold. “Causal- Econometrica, 40, 1972, pp. 221–251. ity Tests for Cross Country Panels: A Nerlove, M. Essays in Panel Data Economet- Look at FDI and Economic Growth rics. Cambridge: Cambridge University in Less Developed Countries.” Oxford Press, 2002. Bulletin of Economics and Statistics, 63, Nerlove, M., and S. Press. “Univariate and 2, 2001, pp. 153–171. Multivariate Log-Linear and Logistic Nakamura, A., and M. Nakamura. “On the Models.” RAND—R1306-EDA/NIH, Relationships Among Several Specifi- Santa Monica, 1973. cation Error Tests Presented by Durbin, Nerlove, M., and K. Wallis. “Use of the Wu, and Hausman.” Econometrica, 49, Durbin–Watson Statistic in Inappropri- 1981, pp. 1583–1588. ate Situations.” Econometrica, 34, 1966, Nakamura, A., and M. Nakamura. “Part-Time pp. 235–238. and Full-Time Work Behavior of Married Newbold, P. “Testing Causality Using Effi- Women: A Model with a Doubly Trun- ciently Parameterized Vector ARMA cated Dependent Variable.” Canadian Models.”Applied Mathematics and Com- Journal of Economics, 1983, pp. 229–257. putation, 20, 1986, pp. 184–199. Nakosteen, R., and M. Zimmer. “Migration Newbold, P. “Significance Levels of the Box– and Income: The Question of Self- Pierce Portmanteau Statistic in Finite Selection.” Southern Economic Journal, Samples.” Biometrika, 64, 1977, pp. 67– 46, 1980, pp. 840–851. 71. Greene-50558 book June 25, 2007 10:16

References 1135

Newey, W. “A Method of Moments Inter- Nickell, S. “Biases in Dynamic Models with pretation of Sequential Estimators.” Fixed Effects.” Econometrica, 49, 1981, Economics Letters, 14, 1984, pp. 201–206. pp. 1417–1426. Newey, W. “Maximum Likelihood Specifica- Oaxaca, R. “Male–Female Wage Differentials tion Testing and Conditional Moment in Urban Labor Markets.” International Tests.” Econometrica, 53, 1985a, pp. 1047– Economic Review, 14, 1973, pp. 693–708. 1070. Oberhofer, W., and J. Kmenta. “A General Newey, W. “Generalized Method of Mo- Procedure for Obtaining Maximum ments Specification Testing.” Journal of Likelihood Estimates in Generalized Econometrics, 29, 1985b, pp. 229–256. Regression Models.” Econometrica, 42, Newey, W. “Specification Tests for Dis- 1974, pp. 579–590. tributional Assumptions in the Tobit Ohtani, K., and M. Kobayashi. “A Bounds Model.” Journal of Econometrics, 34, Test for Equality Between Sets of Coeffi- 1986, pp. 125–146. cients in Two Linear Regression Models Newey, W. “Efficient Estimation of Lim- Under Heteroscedasticity.” Econometric ited Dependent Variable Models with Theory, 2, 1986, pp. 220–231. Endogenous Explanatory Variables.” Ohtani, K., and T. Toyoda. “Estimation of Journal of Econometrics, 36, 1987, Regression Coefficients After a Prelimi- pp. 231–250. nary Test for Homoscedasticity.” Journal Newey, W. “The Asymptotic Variance of of Econometrics, 12, 1980, pp. 151–159. Semiparametric Estimators,” Economet- Ohtani, K., and T. Toyoda. “Small Sample rica, 62, 1994, pp. 1349–1382. Properties of Tests of Equality Between Newey, W., and D. McFadden. “Large Sample Sets of Coefficients in Two Linear Re- Estimation and Hypothesis Testing.” gressions Under Heteroscedasticity.” In R. Engle and D. McFadden, eds., International Economic Review, 26, 1985, Handbook of Econometrics, Vol. IV, pp. 37–44. Chapter 36, 1994. Olsen, R. “A Note on the Uniqueness of the Newey, W., J. Powell, and J. Walker. “Semi- Maximum Likelihood Estimator in the parametric Estimation of Selection Tobit Model.” Econometrica, 46, 1978, Models.” American Economic Review, pp. 1211–1215. 80, 1990, pp. 324–328. Orea, C., and S. Kumbhakar. “Efficiency Newey, W. “Two Step Series Estimation of Measurement Using a Latent Class Sample Selection Models.” Manuscript, Stochastic Frontier Model.” Empirical Department of Economics, MIT, 1991. Economics, 29, 2004, pp. 169–184. Newey, W., and K. West. “A Simple Positive Orcutt, G., S. Caldwell, and R. Wertheimer. Semi-Definite, Heteroscedasticity and Policy Exploration through Microanalytic Autocorrelation Consistent Covari- Simulation. Washington, D.C.: Urban ance Matrix.” Econometrica, 55, 1987a, Institute, 1976. pp. 703–708. Orme, C. “Nonnested Tests for Discrete Newey, W., and K. West. “Hypothesis Testing Choice Models.” Working Paper, De- with Efficient Method of Moments partment of Economics, University of Estimation.” International Economic York, 1994. Review, 28, 1987b, pp. 777–787. Orme, C. “Double and Triple Length Re- New York Post. “America’s New Big Wheels gressions for the Information Matrix of Fortune.” May 22, 1987, p. 3. Test and Other Conditional Moment Neyman, J., and E. Scott. “Consistent Es- Tests.” Mimeo, University of York, U.K., timates Based on Partially Consistent Department of Economics, 1990. Observations.” Econometrica, 16, 1948, Osterwald-Lenum, M. “A Note on Quantiles pp. 1–32. of the Asymptotic Distribution of the Greene-50558 book June 25, 2007 10:16

1136 References

Maximum Likelihood Cointegration Patterson, K. An Introduction to Applied Rank Test Statistics.” Oxford Bulletin Econometrics. New York: St. Martin’s of Economics and Statistics, 54, 1992, Press, 2000. pp. 461–472. Pedroni, P., “Fully Modified OLS for Hetero- Pagan, A., and A. Ullah. “The Econometric geneous Cointegrated Panels,” Advances Analysis of Models with Risk Terms.” in Econometrics, 15, 2000, pp. 93–130. Journal of Applied Econometrics, 3, 1988, Pedroni, P. “Purchasing Power Parity Tests pp. 87–105. in Cointegrated Panels.” Review of Pagan, A., and A. Ullah. Nonparametric Economics and Statistics, 83, 2001, Econometrics. Cambridge: Cambridge pp. 727–731. University Press, 1999. Pedroni, P. “Panel Cointegration: Asymptotic Pagan, A., and F. Vella. “Diagnostic Tests for and Finite Sample Properties of Pooled Models Based on Individual Data: A Sur- Time Series Tests with an Application vey.” Journal of Applied Econometrics, 4, to the PPP Hypothesis.” Econometric Supplement, 1989, pp. S29–S59. Theory, 20, 2004, pp. 597–625. Pagan, A., and M. Wickens. “A Survey of Pesaran, H. “On the General Problem of Some Recent Econometric Methods.” Model Selection.” Review of Economic Economic Journal, 99, 1989, pp. 962–1025. Studies, 41, 1974, pp. 153–171. Pagano, M., and M. Hartley. “On Fitting Pesaran, M. The Limits to Rational Expecta- Distributed Lag Models Subject to Poly- tions. Oxford: Blackwell, 1987. nomial Restrictions.” Journal of Econo- Pesaran, M., and A. Hall. “Tests of Non- metrics, 16, 1981, pp. 171–198. Nested Linear Regression Models Pakes, A., and D. Pollard. “Simulation Subject to Linear Restrictions.” Eco- and the Asymptotics of Optimization nomics Letters, 27, 1988, pp. 341–348. Estimators.” Econometrica, 57, 1989, Pesaran, H., and A. Deaton. “Testing Non- pp. 1027–1058. Nested Nonlinear Regression Models.” Park, R., R. Sickles, and L. Simar. “Semi- Econometrica, 46, 1978, pp. 677–694. parametric Efficient Estimation of Pesaran, M., and B. Pesaran. “A Simulation Panel Data Models with AR(1) Errors.” Approach to the Problem of Computing Manuscript, Demartment of Economics, Cox’s Statistic for Testing Nonnested Rice University, 2000. Models.” Journal of Econometrics, 57, Parks, R. “Efficient Estimation of a Sys- 1993, pp. 377–392. tem of Regression Equations When Pesaran, M., and R. Smith. “Estimating Disturbances Are Both Serially and Con- Long Run Relationships from Dynamic temporaneously Correlated.” Journal of Heterogeneous Panels.” Journal of the American Statistical Association, 62, Econometrics, 68, 1995, pp. 79–113. 1967, pp. 500–509. Pesaran, H., R. Smith, and K. Im. “Dy- Passmore, W. “The GSE Implicit Subsity and namic Linear Models for Heterogeneous the Value of Government Ambiguity.” Panels.” In L. Matyas and P. Sevestre, FEDS Working Paper No. 2005-05, Board eds., The Econometrics of Panel Data: of Governors of the Federal Reserve— A Handbook of the Theory with Ap- Household and Real Estate Finance plications, 2nd ed. Dordrecht: Kluwer Section, 2005. Academic Publishers, 1996. Passmore, W., S. Sherlund, and G. Burgess. Pesaran, M., Y. Shin, and R. Smith. “Pooled “The Effect of Housing Government Mean Group Estimation of Dynamic Sponsored Enterprises on Mortgage Heterogeneous Panels.” Journal of the Rates,” Real Estate Economics, 33, 3, American Statistical Association, 94, 2005, pp. 427–463. 1999, pp. 621–634. Greene-50558 book June 25, 2007 10:16

References 1137

Pesaran, M., and P. Schmidt. Handbook Poirier, D. The Econometrics of Structural of Applied Econometrics: Volume II: Change. Amsterdam: North Holland, Microeconomics. London: Blackwell 1974. Publishers, 1997. Poirier, D. “The Use of the Box–Cox Transfor- Pesaran, H., and M. Weeks. “Nonnested mation in Limited Dependent Variable Hypothesis Testing: An Overview.” In Models.” Journal of the American Statis- B. Baltagi ed., A Companion to Theo- tical Association, 73, 1978, pp. 284–287. retical Econometrics, Blackwell, Oxford, Poirier, D. “Partial Observability in Bivariate 2001. Probit Models.” Journal of Econometrics, Petersen, D., and D. Waldman. “The Treat- 12, 1980, pp. 209–217. ment of Heteroscedasticity in the Limited Poirier, D. “Frequentist and Subjectivist Dependent Variable Model.” Mimeo, Perspectives on the Problems of Model University of North Carolina, Chapel Building in Economics.” Journal of Eco- Hill, November 1981. nomic Perspectives, 2, 1988, pp. 121–170. Petersen, T. “Fitting Parametric Survival Poirier, D., ed. “Bayesian Empirical Studies Models with Time Dependent Covari- in Economics and Finance.” Journal of ates.” Journal of the Royal Statistical Econometrics, 49, 1991, pp. 1–304. Society, Series C (Applied Statistics), 35, Poirier, D. Intermediate Statistics and Econo- 1986, pp. 281–288. metrics. Cambridge: MIT Press, 1995, Phillips, A. “Stabilization Policies and the pp. 1–217. Time Form of Lagged Responses.” Poirier, D., and A. Melino. “A Note on Economic Journal, 67, 1957, pp. 265–277. the Interpretation of Regression Co- Phillips, P. “Exact Small Sample Theory in efficients Within a Class of Truncated the Simultaneous Equations Model.” In Distributions.” Econometrica, 46, 1978, Z. Griliches and M. Intriligator, eds., pp. 1207–1209. Handbook of Econometrics, Vol. 1. Poirier, D., and J. Tobias. “Bayesian Econo- Amsterdam: North Holland, 1983. metrics.” In T. Mills and K. Patterson, Phillips, P. “Understanding Spurious Regres- eds., Palgrave Handbook of Economet- sions.” Journal of Econometrics, 33, 1986, rics, Volume 1: Theoretical Econometrics. pp. 311–340. London: Palgrave-Macmillan, 2006. Phillips, P. “Time Series Regressions with Powell, A. “Aitken Estimators as a Tool a Unit Root.” Econometrica, 55, 1987, in Allocating Predetermined Aggre- pp. 277–301. gates.” Journal of the American Statistical Phillips, P. and H. Moon, “Linear Regression Association, 44, 1969, pp. 913–922. Limit Theory for Nonstationary Panel Powell, J. “Least Absolute Deviations Es- Data,” Econometrica, 67, 1999, pp. 1057– timation for Censored and Truncated 1111. Regression Models.” Technical Report Phillips, P., and H. Moon. “Nonstationary 356, Stanford University, IMSSS, 1981. Panel Data Analysis: An Overview of Powell, J. “Least Absolute Deviations Es- Some Recent Developments.” Econo- timation for the Censored Regression metric Reviews, 19, 2000, pp. 263–286. Model.” Journal of Econometrics, 25, Phillips, P., and S. Ouliaris. “Asymptotic 1984, pp. 303–325. Properties of Residual Based Tests for Powell, J. “Censored Regression Quantiles.” Cointegration.” Econometrica, 58, 1990, Journal of Econometrics, 32, 1986a, pp. 165–193. pp. 143–155. Phillips, P., and P. Perron. “Testing for a Powell, J. “Symmetrically Trimmed Least Unit Root in Time Series Regression.” Squares Estimation for Tobit Models.” Biometrika, 75, 1988, pp. 335–346. Econometrica, 54, 1986b, pp. 1435–1460. Greene-50558 book June 25, 2007 10:16

1138 References

Powell, M. “An Efficient Method for Finding on Bayesian Statistics. New York: Oxford the Minimum of a Function of Several University Press, 1992, pp. 763–774. Variables without Calculating Deriva- Raj, B., and B. Baltagi, eds. Panel Data Anal- tives.” Computer Journal, 1964, pp. 165– ysis. Heidelberg: Physica-Verlag, 1992. 172. Rao, C. Linear Statistical Inference and Its Prais, S., and H. Houthakker. The Analysis of Applications. New York: John Wiley and Family Budgets. New York: Cambridge Sons, 1973. University Press, 1955. Rao, C. “Information and Accuracy At- Prais, S., and C. Winsten. “Trend Estima- tainable in Estimation of Statistical tion and Serial Correlation.” Cowles Parameters.” Bulletin of the Calcutta Commission Discussion Paper No. 383, Mathematical Society, 37, 1945, pp. 81–91. Chicago, 1954. Rasch, G. “Probabilistic Models for Some Pratt, J. “Concavity of the Log Likelihood.” Intelligence and Attainment Tests.” Den- Journal of the American Statistical mark Paedogiska, Copenhagen, 1960. Association, 76, pp. 103–106. Reichert, U., and D. Weinhold. “Causality Prentice, R., and L. Gloeckler. “Regression Tests for Cross Country Panels: A New Analysis of Grouped Survival Data with Look at FDI and Economic Growth Application to Breast Cancer Data.” in Developing Countries.” Manuscript, Biometrics, 34, 1978, pp. 57–67. Department of Economics, Georgia Press, W., B. Flannery, S. Teukolsky, and W. Institute of Technology, 2000. Vetterling. Numerical Recipes: The Art Renfro, C. “Econometric Software,” Journal of Scientific Computing. Cambridge: of Economic and Social Measurement, Cambridge University Press, 1986. forthcoming, 2007. Pudney,S., and M. Shields. “Gender, Race, Pay Revankar, N. “Some Finite Sample Results in and Promotion in the British Nursing Pro- the Context of Two Seemingly Unrelated fession: Estimation of a Generalized Or- Regression Equations.” Journal of the dered Probit Model.” Journal of Applied American Statistical Association, 69, Econometrics, 15, 4, 2000, pp. 367–399. 1974, pp. 187–190. Quandt, R. “Econometric Disequilibrium Revankar, N. “Use of Restricted Residuals in Models.” Econometric Reviews, 1, 1982, SUR Systems: Some Finite Sample Re- pp. 1–63. sults.” Journal of the American Statistical Quandt, R. “Computational Problems and Association, 71, 1976, pp. 183–188. Methods.” In Z. Griliches and M. Intrili- Revelt, D., and K. Train. “Incentives for Ap- gator, eds., Handbook of Econometrics, pliance Efficiency: Random-Parameters Vol. 1. Amsterdam: North Holland, 1983. Logit Models of Households’ Choices.” Quandt, R. The Econometrics of Disequilib- Manuscript, Department of Economics, rium. New York: Basil Blackwell, 1988. University of California, Berkeley, Quandt, R., and J. Ramsey. “Estimating 1996. Mixtures of Normal Distributions and Revelt, D., and K. Train. “Customer Specific Switching Regressions.” Journal of the Taste Parameters and Mixed Logit: American Statistical Association, 73, Households’ Choice of Electricity December 1978, pp. 730–738. Supplier.” Economics Working Paper, Quester, A., and W. Greene. “Divorce Risk E00-274, Department of Economics, Uni- and Wives’ Labor Supply Behavior.” So- versity of California at Berkeley, 2000. cial Science Quarterly, 63, 1982, pp. 16–27. Ridder, G., and T. Wansbeek. “Dynamic Mod- Raftery, A., and S. Lewis. “How Many els for Panel Data.” In R. van der Ploeg, Iterations in the Gibbs Sampler?” In ed., Advanced Lectures in Quantitative J. Bernardo et al., eds., Proceedings of the Economics. New York: Academic Press, Fourth Valencia International Conference 1990, pp. 557–582. Greene-50558 book June 25, 2007 10:16

References 1139

Riersøl, O. “Identifiability of a Linear Rossi, P., and G. Allenby. “Marketing Models Relation Between Variables Which Are of Consumer Heterogeneity.” Journal of Subject to Error.” Econometrica, 18, Econometrics, 89, 1999, pp. 57–78. 1950, pp. 375–389. Rossi, P., G. Allenby, and R. McCulloch. Riphahn, R., A. Wambach, and A. Million. Bayesian Statistics and Marketing. New “Incentive Effects in the Demand for York: John Wiley and Sons, 2005. Health Care: A Bivariate Panel Count Rothstein, J. “Does Competition Among Data Estimation.” Journal of Applied Public Schools Benefit Students and Tax- Econometrics, 18, 4, 2003, pp. 387–405. payers? A Comment on Hoxby (2000).” Rivers, D., and Q. Vuong. “Limited Informa- Working Paper Number 10, Princeton tion Estimators and Exogeneity Tests for University, Education Research Section, Simultaneous Probit Models.” Journal 2004. of Econometrics, 39, 1988, pp. 347– Rotnitzky, A., and J. Robins. “Inverse Prob- 366. ability Weighted Estimation in Survival Roberts, G. “Convergence Diagnostics of the Analysis.” In P.Armitage and T. Coulton, Gibbs Sampler.” In J. Bernardo et al., eds., Encyclopedia of Biostatistics. New eds., Proceedings of the Fourth Valencia York: Wiley, 2005. International Conference on Bayesian Rubin, D. “Inference and Missing Data.” Statistics. New York: Oxford University Biometrika, 63, 1976, pp. 581–592. Press, 1992, pp. 775–782. Rubin, D. Multiple Imputation for Non- Robertson, D., and J. Symons. “Some Strange response in Surveys. New York: John Properties of Panel Data Estimators.” Wiley and Sons, 1987. Journal of Applied Econometrics, 7, 1992, Rubin, H. “Consistency of Maximum Likeli- pp. 175–189. hood Estimators in the Explosive Case.” Robinson, C., and N. Tomes. “Self Selec- In T. Koopmans, ed., Statistical Inference tion and Interprovincial Migration in in Dynamic Economic Models. New Canada.” Canadian Journal of Eco- York: John Wiley and Sons, 1950. nomics, 15, 1982, pp. 474–502. Ruud, P. An Introduction to Classical Robinson, P. “Semiparametric Economet- Econometric Theory. Oxford: Oxford rics: A Survey.” Journal of Applied University Press, 2000. Econometrics, 3, 1988, pp. 35–51. Ruud, P. “A Score Test of Consistency.” Rogers, W. “Calculation of Quantile Regres- Manuscript, Department of Economics, sion Standard Errors.” Stata Technical University of California, Berkeley, 1982. Bulletin No. 13, Stata Corporation, Ruud, P. “Tests of Specification in Econo- College Station, TX, 1993. metrics.” Econometric Reviews, 3, 1984, Rosenbaum, P., and D. Rubin. “The Central pp. 211–242. Role of the Propensity Score in Ob- Ruud, P. “Consistent Estimation of Limited servational Studies for Causal Effects.” Dependent Variable Models Despite Biometrika, 70, 1983, pp. 41–55. Misspecification of the Distribution.” Rosenblatt, D. “Remarks on Some Nonpara- Journal of Econometrics, 32, 1986, metric Estimates of a Density Function.” pp. 157–187. Annals of Mathematical Statistics, 27, Sala-i-Martin, X. “The Classical Approach 1956, pp. 832–841. to Convergence Analysis.” Economic Rosett, R., and F. Nelson. “Estimation of the Journal, 106, 1996, pp. 1019–1036. Two-Limit Probit Regression Model.” Sala-i-Martin, X. “I Just Ran Two Million Re- Econometrica, 43, 1975, pp. 141–146. gressions.” American Economic Review, Rossi, P., and G. Allenby. “Bayesian Statistics 87, 1997, pp. 178–183. and Marketing.” Marketing Science, 22, Salem, D., and T. Mount. “A Convenient 2003, 304–328. Descriptive Model of the Income Greene-50558 book June 25, 2007 10:16

1140 References

Distribution.” Econometrica, 42, 6, 1974, Sepanski, J. “On a Random Coefficients pp. 1115–1128. Probit Model.” Communications in Saxonhouse, G. “Estimated Parameters as Statistics—Theory and Methods, 29, 2000, Dependent Variables.” American Eco- pp. 2493–2505. nomic Review, 46, 1, 1976, pp. 178–183. Shapiro, M., and M. Watson. “Sources of Savin, E., and K. White. “The Durbin–Watson Business Cycle Fluctuations.” In O. Test for Serial Correlation with Extreme Blanchard and S. Fischer, eds., NBER Sample Sizes or Many Regressors.” Macroeconomics Annual, Cambridge: Econometrica, 45(8), 1977, pp. 1989–1996. MIT Press, 1988, pp. 111–148. Sawtooth Software. The CBC/HB Module for Sharpe, W. “Capital Asset Prices: A Theory of Hierarchical Bayes Estimation. 2006. Market Equilibrium Under Conditions Saxonhouse, G. “Estimated Parameters as of Risk.” Journal of Finance, 19, 1964, Dependent Variables.” American Eco- pp. 425–442. nomic Review, 66, 1, 1976, pp. 178–183. Shaw, D. “On-Site Samples’ Regression Schimek, M. (ed.). Smoothing and Regression: Problems of Nonnegative Integers, Approaches, Computation, and Applica- Truncation, and Endogenous Stratifica- tions, New York: John Wiley and Sons, tion.” Journal of Econometrics, 37, 1988, 2000. pp. 211–223. Schmidt, P. Econometrics. New York: Marcel Shea, J. “Instrument Relevance in Multivari- Dekker, 1976. ate Linear Models: A Simple Measure.” Review of Economics and Statistics, 79, Schmidt, P. “Estimation of Seemingly Un- 1997, pp. 348–352. related Regressions with Unequal Numbers of Observations.” Journal of Shephard, R. The Theory of Cost and Pro- Econometrics, 5, 1977, pp. 365–377. duction. Princeton: Princeton University Schmidt, P., and R. Sickles. “Some Further Press, 1970. Evidence on the Use of the Chow Test Shumway, R. Applied Statistical Time Series. Under Heteroscedasticity.” Economet- Englewood Cliffs, NJ: Prentice Hall, rica, 45, 1977, pp. 1293–1298. 1988. Schmidt, P., and R. Sickles. “Production Sickles, R. “A Nonlinear Multivariate Error Frontiers and Panel Data.” Journal of Components Analysis of Technology and Business and Economic Statistics, 2, 1984, Specific Factor Productivity Growth with pp. 367–374. an Application to U.S. Airlines.” Journal Schmidt, P., and R. Strauss. “The Prediction of Econometrics, 27, 1985, pp. 61–78. of Occupation Using Multinomial Logit Sickles, R., B. Park, and L. Simar. “Semi- Models.” International Economic Review, parametric Efficient Estimation of 16, 1975a, pp. 471–486. Panel Models with AR(1) Errors.” Schmidt, P., and R. Strauss. “Estimation of Manuscript, Department of Economics, Models with Jointly Dependent Quali- Rice University, 2000. tative Variables: A Simultaneous Logit Sickles, R., D. Good, and R. Johnson. “Alloca- Approach.” Econometrica, 43, 1975b, tive Distortions and the Regulatory Tran- pp. 745–755. sition of the Airline Industry.” Journal of Schwert, W. “Tests for Unit Roots: A Monte Econometrics, 33, 1986, pp. 143–163. Carlo Investigation.” Journal of Busi- Silk, J. “Systems Estimation: A Comparison ness and Economic Statistics, 7, 1989, of SAS, Shazam, and TSP.” Journal of Ap- pp. 147–159. plied Econometrics, 11, 1996, pp. 437–450. Seaks, T., and K. Layson. “Box–Cox Es- Silva, J. “A Score Test for Non-Nested Hy- timation with Standard Econometric potheses with Applications to Discrete Problems.” Review of Economics and Response Models.” Journal of Applied Statistics, 65, 1983, pp. 160–164. Econometrics, 16, 5, 2001, pp. 577–598. Greene-50558 book June 25, 2007 10:16

References 1141

Silver, J. and M. Ali. “Testing Slutsky Sym- Srivistava, V., and D. Giles. Seemingly Unre- metry in Systems of Linear Demand lated Regression Models: Estimation and Equations.” Journal of Econometrics, 41, Inference. New York: Marcel Dekker, 1989, pp. 251–266. 1987. Sims, C. “Money, Income, and Causality.” Staiger, D., J. Stock, and M. Watson. “How American Economic Review, 62, 1972, Precise Are Estimates of the Natural pp. 540–552. Rate of Unemployment?” NBER Work- Sims, C. “Exogeneity and Causal Ordering in ing Paper Number 5477, Cambridge, Macroeconomic Models.” In New Meth- 1996. ods in Business Cycle Research: Proceed- Staiger, D., and J. Stock. “Instrumental ings from a Conference. Federal Reserve Variables Regression with Weak Instru- Bank of Minneapolis, 1977, pp. 23–43. ments.” Econometrica, 65, 1997, pp. 557– Sims, C. “Macroeconomics and Reality.” 586. Econometrica, 48, 1, 1980, pp. 1–48. Stata. Stata User’s Guide, Version 9. College Sklar, A. “Random Variables, Joint Distribu- Station, TX: Stata Press, 2006. tions and Copulas.” Kybernetica, 9, 1973, Stern, S. “Two Dynamic Discrete Choice pp. 449–460. Estimation Problems and Simulation Smith, M. “Modeling Selectivity Using Method Solutions.” Review of Eco- Archimedean Copulas.” Econometrics nomics and Statistics, 76, 1994, pp. 695– Journal, 6, 2003, pp. 99–123. 702. Smith, M. “Using Copulas to Model Switching Stewart, M. “Maximum Simulated Likelihood Regimes with an Application to Child Estimation of Random Effects Dynamic Labour,” Economic Record, 81, 2005, Probit Models with Autocorrelated pp. S47-S57. Errors.” Manuscript, Department of Smith, R. “Estimation and Inference with Economics, University of Warwick, Nonstationary Panel Time Series Data.” 2006, reprinted Stata Journal, 6, 2, 2006, Manuscript, Department of Economics, pp. 256–272. Birkbeck College, 2000. Stigler, S. Statistics on the Table. Cambridge, Smith, V. “Selection and Recreation De- MA: Harvard University Press, 1999. mand.” American Journal of Agricultural Stock, J. “Unit Roots, Structural Breaks, and Economics, 70, 1988, pp. 29–36. Trends.” In R. Engle and D. McFadden, Solow, R. “Technical Change and the Ag- eds., Handbook of Econometrics, Vol. 4. gregate Production Function.” Review Amsterdam: North Holland, 1994. of Economics and Statistics, 39, 1957, Stock, J., and F. Trebbi. “Who Invented pp. 312–320. Instrumental Variable Regression?” Sowell, F. “Optimal Tests of Parameter Journal of Economic Perspectives, 17, 32, Variation in the Generalized Method of 2003, pp. 177–194. Moments Framework.” Econometrica, Stock, J., and M. Watson. “Testing for Com- 64, 1996, pp. 1085–1108. mon Trends.” Journal of the American Spector, L., and M. Mazzeo. “Probit Analysis Statistical Association, 83, 1988, pp. 1097– and Economic Education.” Journal of 1107. Economic Education, 11, 1980, pp. 37–44. Stock, J., and M. Watson. “Forecasting Out- Spencer, D., and K. Berk. “ALimited Informa- put and Inflation: The Role of Asset tion Specification Test.” Econometrica, Prices.” NBER, Working Paper #8180, 49, 1981, pp. 1079–1085. Cambridge, MA, 2001. Srivistava, V., and T. Dwivedi. “Estimation Stock, J., and M. Watson. “Combination of Seemingly Unrelated Regression Forecases of Output Growth in a Equations: A Brief Survey.” Journal of Seven-Country Data Set.” Journal of Econometrics, 10, 1979, pp. 15–32. Forecasting, 23, 6, 2004, pp. 405–430. Greene-50558 book June 25, 2007 10:16

1142 References

Stock, J., and M. Watson. Introduction to Swamy, P. Statistical Inference in Random Econometrics, 2nd ed., 2007. Coefficient Regression Models. New Stock, J., J. Wright, and M. Yogo. “A Sur- York: Springer-Verlag, 1971. vey of Weak Instruments and Weak Swamy, P. “Linear Models with Random Co- Identification in Generalized Method efficients.” In P. Zarembka, ed., Frontiers of Moments.” Journal of Business and in Econometrics. New York: Academic Economic Statistics, 20, 2002, pp. 518– Press, 1974. 529. Swamy, P., and G. Tavlas. “Random Coeffi- Stock, J., and M. Yogo. “Testing for Weak cients Models: Theory and Applications.” Instruments in Linear IV Regression.” Journal of Economic Surveys, 9, 1995, In J. Stock and D. Andrews, eds. Identifi- pp. 165–182. cation and Inference in Econometrics: A Swamy, P., and G. Tavlas. “Random Coef- Festschrift in Honor of Thomas Rothen- ficient Models.” In B. Baltagi, ed., A berg. Cambridge: Cambridge University Companion to Theoretical Econometrics. Press, 2005, pp. 80–108. Oxford: Blackwell, 2001. Stoker, T. “Consistent Estimation of Scaled Tanner, M. Tools for Statistical Inference, 2nd Coefficients.” Econometrica, 54, 1986, ed. New York: Springer-Verlag, 1993. pp. 1461–1482. Taqqu, M. “Weak Convergence to Fractional Stone, R. The Measurement of Consumers’ Brownian Motion and the Rosenblatt Expenditure and Behaviour in the Process.” Zeitschrift fur Wahrschein- United Kingdom, 1920–1938. Cambridge: lichkeitstheorie und Verwandte Gebiete, Cambridge University Press, 1954a. 31, 1975, pp. 287–302. Stone, R. “Linear Expenditure Systems and Tauchen, H. “Diagnostic Testing and Evalu- Demand Analysis: An Application to the ation of Maximum Likelihood Models.” Pattern of British Demand.” Economic Journal of Econometrics, 30, 1985, Journal, 64, 1954b, pp. 511–527. pp. 415–443. Strang, G. Linear Algebra and Its Applications. Tauchen, H., A. Witte, and H. Griesinger. New York: Academic Press, 1988. “Criminal Deterrence: Revisiting the Is- Strickland, A., and L. Weiss. “Advertising, sue with a Birth Cohort.” Review of Eco- Concentration, and Price Cost Margins.” nomics and Statistics, 3, 1994, pp. 399–412. Journal of Political Economy, 84, 1976, Taylor, L. “Estimation by Minimizing the Sum pp. 1109–1121. of Absolute Errors.” In P.Zarembka, ed., Stuart, A., and S. Ord. Kendall’s Advanced Frontiers in Econometrics. New York: Theory of Statistics. New York: Oxford Academic Press, 1974. University Press, 1989. Taylor, W. “Small Sample Properties of a Srivastava, T., and Tiwari, R. “Efficiency Class of Two Stage Aitken Estimators.” of Two Stage and Three Stage Least Econometrica, 45, 1977, pp. 497–508. Squares Estimators.” Econometrica, 46, Telser, L. “Iterative Estimation of a Set of 1978, pp. 1495–1498. Linear Regression Equations.” Journal Suits, D. “Dummy Variables: Mechanics vs. of the American Statistical Association, Interpretation.” Review of Economics 59, 1964, pp. 845–862. and Statistics, 66, 1984, pp. 177–180. Terza, J. “Ordinal Probit: A Generalization.” Susin, S. “Hazard Hazards: The Inconsis- Communications in Statistics, 14, 1985a, tency of the “Kaplan-Meier Empirical pp. 1–12. Hazard,” and Some Alternatives,” Terza, J. “A Tobit Type Estimator for the Cen- Manuscript, U.S. Census Bureau, 2001. sored Poisson Regression Model.” Eco- Swamy, P. “Efficient Inference in a Ran- nomics Letters, 18, 1985b, pp. 361–365. dom Coefficient Regression Model.” Terza, J. “Estimating Count Data Models Econometrica, 38, 1970, pp. 311–323. with Endogenous Switching and Sample Greene-50558 book June 25, 2007 10:16

References 1143

Selection.” Working Paper IPRE-95-14, Train, K. “A Comparison of Hierarchical Department of Economics, Pennsylvania Bayes and Maximum Simulated Like- State University, 1995. lihood for Mixed Logit.” Manuscript, Terza, J. “Estimating Count Data Models Department of Economics, University of with Endogenous Switching: Sample California, Berkeley, 2001. Selection and Endogenous Treatment Train, K. Discrete Choice Methods with Effects.” Journal of Econometrics, 84, 1, Simulation. Cambridge: Cambridge 1998, pp. 129–154. University Press, 2003. Terza, J., and D. Kenkel. “The Effect of Physi- Trivedi, P., and A. Pagan. “Polynomial Dis- cian Advice on Alcohol Consumption: tributed Lags: A Unified Treatment.” Count Regression with an Endogenous Economic Studies Quarterly, 30, 1979, Treatment Effect,” Journal of Applied pp. 37–49. Econometrics, 16, 2, 2001, pp. 165–184. Trivedi, P., and D. Zimmer. “Copula Model- Theil, H. “Repeated Least Squares Applied ing: An Introduction for Practitioners.” to Complete Equation Systems.” Dis- Foundations and Trends in Econometrics, cussion Paper, Central Planning Bureau, forthcoming, 2007. The Hague, 1953. Tsay, R. Analysis of Financial Time Series, 2nd Theil, H. Economic Forecasts and Policy. ed. New York: John Wiley and Sons, 2005. Amsterdam: North Holland, 1961. Tunali, I. “A General Structure for Models Theil, H. Principles of Econometrics. New of Double Selection and an Application York: John Wiley and Sons, 1971. to a Joint Migration/Earnings Process Theil, H. “Linear Algebra and Matrix Meth- with Remigration.” Research in Labor ods in Econometrics.” In Z. Griliches Economics, 8, 1986, pp. 235–282. and M. Intriligator, eds., Handbook of United States Department of Commerce. Econometrics, Vol. 1. New York: North Statistical Abstract of the United States. Holland, 1983. Washington, D.C.: U.S. Government Theil, H., and A. Goldberger. “On Pure Printing Office, 1979. and Mixed Estimation in Economics.” United States Department of Commerce, International Economic Review, 2, 1961, Bureau of Economic Analysis. Na- pp. 65–78. tional Income and Product Accounts, Thursby, J. “Misspecification, Heteroscedas- Survey of Current Business: Business ticity, and the Chow and Goldfeld– Statistics, 1984. Washington, D.C.: U.S. Quandt Tests.” Review of Economics and Government Printing Office, 1984. Statistics, 64, 1982, pp. 314–321. Veall, M. “Bootstrapping the Probability Dis- Tobin, J. “Estimation of Relationships for tribution of Peak Electricity Demand.” Limited Dependent Variables.” Econo- International Economic Review, 28, 1987, metrica, 26, 1958, pp. 24–36. pp. 203–212. Toyoda, T., and K. Ohtani. “Testing Equality Veall, M. “Bootstrapping the Process of Between Sets of Coefficients After Model Selection: An Econometric Ex- a Preliminary Test for Equality of ample.” Journal of Applied Econometrics, Disturbance Variances in Two Linear 7, 1992, pp. 93–99. Regressions.” Journal of Econometrics, Veall, M., and K. Zimmermann. “Pseudo-R2 31, 1986, pp. 67–80. in the Ordinal Probit Model.” Journal Train, K. Discrete Choice Methods with of Mathematical Sociology, 16, 1992, Simulation. Cambridge: Cambridge pp. 333–342. University Press, 2002. Vella, F. “Estimating Models with Sample Train,K. “Halton Sequences for Mixed Logit.” Selection Bias: A Survey.” Journal of Manuscript, Department of Economics, Human Resources, 33, 1998, pp. 439– University of California, Berkeley, 1999. 454. Greene-50558 book June 25, 2007 10:16

1144 References

Vella, F., and M. Verbeek. “Whose Wages Wallace, T., and A. Hussain. “The Use Do Unions Raise? A Dynamic Model of of Error Components in Combining Unionism and Wage Rate Determination Cross Section with Time Series Data.” for Young Men.” Journal of Applied Econometrica, 37, 1969, pp. 55–72. Econometrics, 13, 2, 1998, pp. 163–184. Wan, G., W. Griffiths, and J. Anderson. “Using Vella, F., and M. Verbeek. “Two-Step Es- Panel Data to Estimate Risk Effects in timation of Panel Data Models with Seemingly Unrelated Production Func- Censored Endogenous Variables and tions.” Empirical Economics, 17, 1992, Selection Bias.” Journal of Econometrics, pp. 35–49. 90, 1999, pp. 239–263. Wang, P., I. Cockburn, and M. Puterman. Verbeek, M. “On the Estimation of a Fixed “Analysis of Panel Data: A Mixed Effects Model with Selectivity Bias.” Poisson Regression Model Approach.” Economics Letters, 34, 1990, pp. 267–270. Journal of Business and Economic Verbeek, M., and T. Nijman. “Testing for Statistics, 16, 1, 1998, pp. 27–41. Selectivity Bias in Panel Data Models.” Watson, M. “Vector Autoregressions and International Economic Review, 33, 3, Cointegration.” In R. Engle and D. Mc- 1992, pp. 681–703. Fadden, eds., Handbook of Econometrics, Verbon, H. “Testing for Heteroscedasticity Vol. 4. Amsterdam: North Holland, 1994. in a Model of Seemingly Unrelated Weeks, M. “Testing the Binomial and Multi- Regression Equations with Variance nomial Choice Models Using Cox’s Components (SUREVC).” Economics Nonnested Test.” Journal of the Amer- Letters, 5, 1980, pp. 149–153. ican Statistical Association, Papers and Vinod, H. “Bootstrap, Jackknife, Resam- Proceedings, 1996, pp. 312–328. pling and Simulation: Applications in Weiss, A. “Asymptotic Theory for ARCH Econometrics.” In G. Maddala, C. Rao, Models: Stability, Estimation, and and H. Vinod, eds., Handbook of Statis- Testing.” Discussion Paper 82-36, De- tics: Econometrics, Vol II., Chapter 11. partment of Economics, University of Amsterdam: North Holland, 1993. California, San Diego, 1982. Vinod, H., and B. Raj. “Economic Issues in Weiss, A. “Simultaneity and Sample Selec- Bell System Divestiture: A Bootstrap tion in Poisson Regression Models.” Application.” Applied Statistics (Journal Manuscript, Department of Economics, of the Royal Statistical Society, Series C), University of Southern California, 1995. 37, 2, 1994, pp. 251–261. Wedel, M., W. DeSarbo, J. Bult, and V. Vuong, Q. “Likelihood Ratio Tests for Model Ramaswamy. “A Latent Class Poisson Selection and Non-Nested Hypotheses.” Regression Model for Heterogeneous Econometrica, 57, 1989, pp. 307–334. Count Data.” Journal of Applied Econo- Vytlacil, E., A. Aakvik, and J. Heckman. metrics, 8, 1993, pp. 397–411. “Treatment Effects for Discrete Out- Weinhold, D. “A Dynamic “Fixed Effects” comes When Responses to Treatments Model for Heterogeneous Panel Data.” Vary Among Observationally Identical Manuscript, Department of Economics, Persons: An Applicaton to Norwegian London School of Economics, 1999. Vocational Rehabilitation Programs.” Weinhold, D. “Investment, Growth and Journal of Econometrics, 125, 1/2, 2005, Causality Testing in Panels” (in French). pp. 15–51. Economie et Prevision, 126-5, 1996, Waldman, D. “A Note on the Algebraic pp. 163–175. Equivalence of White’s Test and a West, K. “On Optimal Instrumental Variables Variant of the Godfrey/Breusch-Pagan Estimation of Stationary Time Series Test for Heteroscedasticity.” Economics Models.” International Economic Review, Letters, 13, 1983, pp. 197–200. 42, 4, 2001, pp. 1043–1050. Greene-50558 book June 25, 2007 10:16

References 1145

White, H. “A Heteroscedasticity-Consistent Woolridge, J. “Specification Testing and Covariance Matrix Estimator and a Quasi-Maximum Likelihood Estima- Direct Test for Heteroscedasticity.” tion.” Journal of Econometrics, 48, 1/2, Econometrica, 48, 1980, pp. 817–838. 1991, pp. 29–57. White, H. “Maximum Likelihood Estimation Woolridge, J. “Selection Corrections for Panel of Misspecified Models.” Econometrica, Data Models Under Conditional Mean 53, 1982a, pp. 1–16. Assumptions.” Journal of Econometrics, White, H. “Instrumental Variables Regres- 68, 1995, pp. 115–132. sion with Independent Observations.” Wooldridge, J. “Asymptotic Properties of Econometrica, 50, 2, 1982b, pp. 483–500. Weighted M Estimators for Variable White, H., ed. “Non-Nested Models.” Journal Probability Samples.” Econometrica, 67, of Econometrics, 21, 1, 1983, pp. 1–160. 1999, pp. 1385–1406. White, H. Asymptotic Theory for Econome- Wooldridge, J. Econometric Analysis of Cross tricians, Revised. New York: Academic Section and Panel Data. Cambridge: MIT Press, 2001. Press, 2002a. Wickens, M. “A Note on the Use of Proxy Wooldridge, J. “Inverse Probability Weighted Variables.” Econometrica, 40, 1972, M Estimators for Sample Selection, pp. 759–760. Attrition and Stratification,” Portuguese Economic Journal, 1, 2002b, pp. 117–139. Willis, J. “Magazine Prices Revisited.” Journal of Applied Econometrics, 21, 3, 2006, Wooldridge, J., “Simple Solutions to the Initial pp. 337–344. Conditions Problem in Dynamic Nonlin- ear Panel Data Models with Unobserved Willis, R., and S. Rosen. “Education and Self- Heterogeneity,” CEMMAP Working Selection.” Journal of Political Economy, Paper CWP18/02, Centre for Microdata 87, 1979, pp. S7–S36. and Practice, IFS and University College, Windmeijer, F. “Goodness of Fit Measures London, 2002c. in Binary Choice Models.” Econometric Woolridge, J. Introductory Econometrics: A Reviews, 14, 1995, pp. 101–116. Modern Approach, 3rd ed. New York: Winkelmann, R. Econometric Analysis Southwestern Publishers, 2006a. of Count Data. 2nd ed., Heidelberg, Wooldridge, J. “Inverse Pobability Weighted Germany: Springer-Verlag, 1997. Estimation for General Missing Data Winkelmann, R. Econometric Analysis of Problems.” Manuscript, Department of Count Data, Springer-Verlag, Heidel- Economics, Michigan State University, berg, 2000. East Lansing, 2006b. Winkelmann, R. Econometric Analysis of Working, E. “What Do Statistical Demand Count Data, Springer-Verlag, Heidel- Curves Show?” Quarterly Journal of berg, 4th ed. 2003. Economics, 41, 1926, pp. 212–235. Winkelmann, R. “Subjective Well-Being and World Health Organization. The World the Family: Results from an Ordered Health Report, 2000, Health Systems: Probit Model with Multiple Random Ef- Improving Performance. Geneva. 2000. fects.” Discussion Paper 1016, IZA/Bonn Wright, J., “Forecasting U.S. Inflation by and University of Zurich. Bayesian Model Averaging,” Board of Witte, A. “Estimating an Economic Model of Governors, Federal Reserve System, Crime with Individual Data.” Quarterly International Finance Discussion Papers Journal of Economics, 94, 980, pp. 57–84. Number 780, 2003. Wong, W. “On the Consistency of Cross Wu, D. “Alternative Tests of Independence Validation in Kernel Nonparametric Between Stochastic Regressors and Regression.” Annals of Statistics, 11, Disturbances.” Econometrica, 41, 1973, 1983, pp. 1136–1141. pp. 733–750. Greene-50558 book June 25, 2007 10:16

1146 References

Wu, J., “Mean Reversion of the Current Zellner, A. “Estimators for Seemingly Un- Account: Evidence from the Panel Data related Regression Equations: Some Unit Root Test,” Economics Letters, 66, Exact Finite Sample Results.” Journal of 2000, pp. 215–222. the American Statistical Association, 58, Wynand, P., and B. van Praag. “The Demand 1963, pp. 977–992. for Deductibles in Private Health In- Zellner, A. Introduction to Bayesian Inference surance: A Probit Model with Sample in Econometrics. New York: John Wiley Selection.” Journal of Econometrics, 17, and Sons, 1971. 1981, pp. 229–252. Zellner, A. “Statistical Analysis of Econo- Yatchew, A., “Nonparametric Regression metric Models.” Journal of the Amer- Techniques in Econometrics,” Journal ican Statistical Association, 74, 1979, of Econometric Literature, 36, 1998, pp. 628–643. pp. 669–721. Zellner, A. “Bayesian Econometrics.” Econo- Yatchew, A., “An Elementary Estimator of metrica, 53, 1985, pp. 253–269. the Partial Linear Model,” Economics Zellner, A., and D. Huang. “Further Proper- Letters, 57, 1997, pp. 135–143. ties of Efficient Estimators for Seemingly Yatchew, A., “Scale Economies in Electric- Unrelated Regression Equations.” In- ity Distribution,” Journal of Applied ternational Economic Review, 3, 1962, Econometrics, 15, 2, 2000, pp. 187–210. pp. 300–313. Yatchew, A., and Z. Griliches. “Specifica- Zellner, A., J. Kmenta, and J. Dreze. “Specifi- tion Error in Probit Models.” Review cation and Estimation of Cobb-Douglas of Economics and Statistics, 66, 1984, Production Functions.” Econometrica, pp. 134–139. 34, 1966, pp. 784–795. Zabel, J., “Estimating Fixed and Random Ef- Zellner, A., and C. K. Min. “Gibbs Sampler fects Models with Selectivity,” Economics Convergence Criteria.” Journal of the Letters, 40, 1992, pp. 269–272. American Statistical Association, 90, Zarembka, P. “Transformations of Variables 1995, pp. 921–927. in Econometrics.” In P. Zarembka, Zellner, A., and N. Revankar. “General- ed., Frontiers in Econometrics. Boston: ized Production Functions.” Review of Academic Press, 1974. Economic Studies, 37, 1970, pp. 241–250. Zavoina, R., and W. McKelvey. “A Statisti- Zellner, A., and A. Siow. “Posterior Odds cal Model for the Analysis of Ordinal Ratios for Selected Regression Hypothe- Level Dependent Variables.” Journal of ses (with Discussion).” In J. Bernardo, Mathematical Sociology, Summer, 1975, M. DeGroot, D. Lindley, and A. Smith, pp. 103–120. eds., Bayesian Statistics, Valencia, Spain: Zellner, A. “An Efficient Method of Estimat- University Press, 1980. ing Seemingly Unrelated Regressions Zellner, A., and H. Theil. “Three Stage Least and Tests of Aggregation Bias.” Journal Squares: Simultaneous Estimation of of the American Statistical Association, Simultaneous Equations.” Econometrica, 57, 1962, pp. 500–509. 30, 1962, pp. 63–68. Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

AUTHORQ INDEX

A 472, 478, 557n29, 794, 801, Beyer, A., 764, 765 Aakyik, A, 557n29 804, 882 Bhargava, A., 207n18, 336, 470n14, Abowd, J., 896n36 Arrow, K., 275 476n19 Abramovitz, M., 436n2, 1061n2, Ashenfelter, O., 5, 11, 330 Bhat, C., 578, 591 1063, 1064, 1065 Attfield, C., 534n23 Billingsley, P., 75n16, 637n6 Abrevaya, J., 587, 588, 801, 838 Avery, R., 267, 796n23, 800, 814, 828 Binkley, J., 257n9 Achen, C., 231n30, 232n31 Birkes, D., 406n2 Blanchard, O., 706n12 Adkins, L., 60n6 B Afifi, T., 62n9 Blinder, A., 55 Ahn, S., 269n17, 342, 470n14, Baillie, R., 658, 658n16 Blundell, R., 264n15, 348, 420, 472, 475, 476, 478 Baker, R., 350 474n17, 780, 787n12, 816 Aigner, D., 90n5, 263, 538 Balakrishman, N., 866n5, 996n3 Bockstael, N., 218, 219, 924 Aitchison, J., 996n4 Balestra, P., 205n16, 208, 208n19, Bollerslev, T., 658, 658n17, 660, 661, Aitken, A.C., 154, 155 238, 470n14 661n21, 662, 662n22, 663n24, Akaike, H., 37, 143, 506, 677, 697 Baltagi, B., 181n2, 184, 198, 207, 665, 665n28, 666n29, Akin, J., 795n22 208n19, 208n21, 209, 212, 667n31, 1089 Aldrich, J., 773n2, 774n3 214, 215, 216, 216n25, Bond, S., 342, 343, 346, 348, 352, Ali, M., 165n17, 273n20 218, 219, 243n32, 244, 455, 470, 470n14, 474n17 Allenby, G., 841 255n3, 257n6, 342, 343, Boot, J., 252, 283n34, 1086 Allison, P., 62, 915, 917 399n1, 547, 585, 653, 654, Borjas, G., 231n30 Altonji, J., 810n34 768, 804 Börsch-Supan, A., 583n7, 827n44 Amemiya, T., 37n5, 151n2, 152, Bannerjee, A., 244 Bound, J., 350 170n23, 170n25, 204n15, Barnow, B., 890n31, 893 Bourguignon, F., 55n4 285n1, 290, 333, 365n12, Bartels, R., 257n6, 263 Bover, O., 210n23, 211, 342, 348, 380, 380n17, 406n2, 421, Barten, A., 274n23 455, 470, 470n14, 422n8, 423, 425, 438, 445n8, Basmann, R., 378 478, 557n29 472, 490, 494n3, 512n12, 513, Bassett, G., 166, 406n2, 407, 425 Box, G., 296n3, 575, 633n5, 645, 635, 677n3, 770, 773n2, Battese, G., 204n15 648n14, 650, 715n1, 726, 727, 779n9, 790n14, 791n15, Beach, C., 529, 649, 651 729, 730, 731, 738, 768 792n16, 809, 863n1, 869n9, Beck, N., 794n21, 795n22 Boyes, W., 792, 793, 895 875, 880n16, 1072n18, Becker, S., 893, 894 Brannas, K., 920 1073n20 Beggs, S., 831n48, 842 Bray, T., 575 American Economic Review, 41 Behrman, J., 331 Breitung, J., 767 Andersen, D., 587, 803 Bekker, P., 365n12 Breslaw, J., 583n6 Anderson, G., 264n15 Bell, C., 1081 Breusch, T., 166, 166n19, 178, 184, Anderson, R., 278n31 Bell, W., 218, 219, 715n1 205, 223n26, 236, 526, 534, Anderson, T., 152, 208n19, 244, 246, Belsley, D., 60, 1075 549, 644 333, 341, 376n15, 378, 379, Ben-Akiva, M., 790n14, 791 Brock, W., 776n5 387, 470n14, 537, 635 Ben-Porath, Y., 181 Brown, B., 272n19 Andrews, D., 123n12, 236, Bera, A., 880, 881n17 Brown, J., 996n4 406b2, 709 Bernard, J., 689n6 Brown, P., 267 Angrist. J., 5, 398, 773n2, 814n37, Berndt, E., 14, 91n6, 264n15, Browning, M., 420 817, 889n28, 891n32, 892n34 276n25, 276n26, 277n29, 278, Brundy, J., 319 Anselin, L., 218, 219 279n32, 279n33, 495, 506n10, Burgess, G., 232 Antwiler, 214, 215 1072, 1087 Burnett, N., 823, 824, 1090 Arabmazar, A., 875 Beron, K., 233 Burnside, C., 454n10 Arellano, M., 181n2, 210n23, 211, Berry, S., 589, 851 Buse, A., 156, 498n5 341, 342, 343, 346, 348, 455, Bertschek, I., 593, 594, 800, 814, Butler, J., 407n3, 592, 797, 798, 470, 470n14, 470n15, 471, 828, 829, 1089 799, 802, 814, 827, 829, 835, Berzeg, K., 207 836, 838

1147 Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

1148 Author Index

C Coulson, N., 658 Duncan, G., 163n11, 880n15, Cain, G., 890n31 Cox, D., 138, 142, 144, 296n3, 880n16, 891 Cain, G, 893 935n10, 940 Durbin, J., 116n6, 322, 387n26, Caldwell, S., 874 Cragg, J., 163n13, 327n5, 386, 645n12, 646, 646n13, 647, Calhoun, C., 836 387n25, 467, 658, 790n14, 648, 650, 651, 655, 668, Calzolari, G., 660n20 877, 920, 927 692, 1098 Cameron, C., 63, 405, 517, 587, Cramér, H., 490, 493, 519, 1031 Durlauf, 776n5 590n9, 597, 598, 601, 619, Cramer, J., 790n14, 791, 792, 826 Dwivedi, T., 253n2, 256 770, 907n1, 908n3, 909, 910, Crawford, I., 420 Culver, S., 243 911, 912, 917 E Campbell, J., 6, 715n1 Cumby, R., 441n3, 467 Cappellari, L., 827 Econometrica, 925 Efron, B., 596n12, 597, 790n14, 791, Cardell, S., 831n48, 842 D Carey, K., 343, 437, 439 792, 826 Carson, R., 925 D’Addio, A, 837 Eichenbaum, M., 454n10 Case, A., 218 Dahlberg, M., 478, 695, Eicker, F., 163n11 Cassen, R., 557 712, 712n13 Eisenberg, D., 814n37, 823n43 Caudill, S., 773n2 Dale, S., 5 Elashoff, R., 62n9 Caves, D., 296 Das, M., 837 Elliot, G., 753 Cecchetti, S., 5, 695, 703, 704n11, Dastoor, N., 138n5 Enders, W., 670n1, 715n1, 706n12, 707, 748, 750, Davidson, A., 596n12 728, 739n1 805, 806 Davidson, J., 399n1, 447 Engle, R., 1, 357n5, 358, 658, Chamberlain, G., 207n18, 208n20, Davidson, R., 139, 164, 257n7, 658n17, 660, 660n20, 669, 210n23, 230, 269, 269n17, 285n1, 287, 290, 293, 312, 670n1, 698, 756n12, 761, 762, 270, 271, 439, 441n3, 556, 324, 335, 358n6, 375, 377, 764, 765, 766, 787n12 557n28, 557n29, 608n9, 711, 421, 422, 425, 441n3, 467, Epstein, D., 794n21 794, 795, 803, 805, 900 488, 494, 522, 537, 598, Ericsson, N., 765 Chambers, R., 272n19 636, 637, 638, 638n7, 641, Estes, E., 409n5 Chatfield, C., 731, 735 664, 728n12, 740n2, 743n3, Evans, D., 124 Chatterjee, P., 814, 836 762, 787n12, 789 Evans, G., 744n8 Chavez, J., 272n19 Deaton, A., 138n5, 272n19, 274n23, Evans, M., 991n1 Chen, T., 557n29 276n26 Debreu, G., 538 Chesher, A., 778n8, 880, 881n17, F 928, 934 Dehejia, R., 893n35, 894 Fair, R., 101n10, 123n12, 861, 869, Chiappori, R., 794 deMaris, A., 5, 863n1 904, 905, 925, 926, 1090, 1092 Chib, S., 399, 617, 623 Dempster, A., 925, 1076, 1078 Farber, H., 896n36 Chintagunta, P., 851 Dent, W., 699n8 Farrell, M., 538 Chou, R., 661 Denton, F., 273n20 Feldstein, M., 101 Chow, G., 120, 122, 123, 127, DesChamps, P., 264n15 Fernandez, A., 780, 810, 880n15 223n26, 284, 789 de Witt, G., 252, 283n34, 1086 Ferrer-i-Carbonel, A., 837 Christensen, L., 14, 79, 91n6, 104, Dezhbaksh, H., 646n13 Fiebig, D., 253n2, 257n6, 116, 117, 273n20, 274, Dhrymes, P., 75n16, 382n19, 384, 262n14, 263 276n25, 276n26, 276n27, 387n26, 683n5, 770, 863n1 Fin, T., 878 277n29, 278n30, 283, 286, Dickey, D., 682, 708, 715n1, 745, Fiorentini, G., 660n20, 664n25 296, 410, 411, 572, 1083, 1087 748, 751, 752, 753, 754, 761, Fisher, F., 122 Christofides, L., 821n42 762, 764, 768 Fisher, G., 140n7, 788n13 Chu, C., 767 Diebold, F., 143, 670n1, 677n3, Fisher, R., 401, 428 Chung, C., 810, 868, 875n12 694n7, 703, 715n1 Fletcher, R., 1071n17 Cleveland, W., 418 Dielman, T., 181n2 Florens, J., 931n8 Coakley, J., 243 Diewert, E., 276n25, 276n27 Flores-Lagunes, A., 770 Cochrane, D., 651 Diggle, R., 796n23 Fomby, T., 170n25, 208n19, 297n6, Congdon, P., 601n2 DiNardo, J., 407 773n2, 1054 Conniffe, D., 255n3, 257n8 Doan, T., 707 Fougere, D., 931n8 Contoyannis, C., 784, 841, Dodge, Y., 406n2 Frankel, J., 243 899, 902 Domowitz, I., 658 Freedman, D., 551, 780 Conway, D., 129 Doob, J., 637n6 Doppelhofer, G., 145, 146 French, K., 660 Cooley, T., 355n2 Dreze, J., 115 Fried, H., 538n24 Cornwell, C., 186, 194, 199, 231, Dufour, J., 378, 379 Friedman, M., 9 549, 1086 Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

Author Index 1149

Frijters, P., 837 794n21, 799, 801, 810, Hausman, J., 208, 208n20, 209, 247, Frisch, R., 1, 28, 39, 40, 134 810n33, 821n42, 824, 827, 322, 323, 324, 336, 339, 347, Fry, J., 272n19 827n45, 834, 836, 837n51, 350, 350n11, 351, 352, Fry, T., 272n19 838, 841, 842, 849, 851, 852, 363n10, 376n15, 377, 378, Fuller, W., 170n24, 204n15, 648n14, 852n59, 858, 859, 863n1, 869, 384, 387, 387n26, 471, 472, 682, 708, 731, 745, 748, 751, 869n8, 872n10, 873, 875n12, 473, 788n13, 805, 831n48, 752, 753, 754, 761, 762, 882, 887n26, 893, 894, 897, 842, 853, 869n8, 880, 909, 764, 768 898, 902, 908n3, 913, 918, 911, 915, 916, 919, 920, 922, 922n7, 923, 924, 939n14, 941 925, 927, 938n12, 940, 996n6, Hayashi, F., 399n1, 425, 441n3, G 1059n8, 1083, 1085, 1087, 639n8, 639n9, 641, 643n10, Gabrielsen, A., 365n12 1090, 1091 763, 1056 Gali, J., 704n11, 709 Griesinger, H., 798 Heaton, J., 444n7 Gallant, A., 358n6 Griffin, J., 212 Heckman, J., 1, 2, 5, 557n29, 588, Garber, S., 111n2, 327n5 Griffiths, W., 422n8 589, 773n2, 779n10, 794, 795, Garvin, S., 255n3 Griliches, Z., 61, 62n9, 158, 240, 801, 806, 807, 868n7, 882n18, Gelman, A., 601n2, 623 283n34, 327n5, 585n8, 649, 884n21, 886, 887, 887n26, Gentle, J., 574, 1061n2 683n5, 787, 909, 915, 917, 891, 892n34, 893n35, 897, Gerfin, M., 810, 811, 811n36, 812 939n14 931n8, 939, 997n7 Geweke, J., 577n3, 581, 583n6, Grogger, J., 925 Heilbron, D., 922 583n7, 601, 601n2, 677, Gronau, R., 884n21 Hendry, D., 137, 357n5, 358, 585, 699n8, 715n1, 796n23, Groot, W., 838 658, 670n1, 693, 698, 764, 827n44 Grunfeld, Y., 252, 255, 283n34, 765, 766 Ghysels, E., 658 534, 1086 Hensher, D., 591, 841, 842, 846, 849, Giaccotto, C., 165n17 Gruvaeus, G., 1075n23, 1079n25 849n58, 851, 852, 852n59, Gilbert, R., 585n8 Guilkey, D., 264n15, 277n29, 858, 859, 1090 Giles, D., 253n2, 258n13 795n22 Heshmati, A., 263 Gill, J., 601n2 Gurmu, S., 61, 903n3, 910, 911 Hildebrand, G., 90n5, 1084 Gloeckler, L., 557n28 Hildreth, C., 223n26, 242, 243 Godfrey, L., 103, 178, 236, 351, 521, Hill, C., 60n6, 170n25, 208n19, 526, 586, 644 H 297n6, 422n8, 773n2, 1054 Goldberger, A., 41, 257n8, 273n20, Haavelmo, T., 1, 354n1 Hilts, J., 124 363n10, 386, 393, 608n10, Hadri, K., 767 Hinkley, D., 596n12 658n15, 810, 868, 875n12, Hahn, J., 350, 350n11, 351, 376n15, Hite, S., 62n7 880n15, 890n30, 890n31, 377, 882 Hoeting, J., 145 891, 893 Haitovsky, Y., 62n9 Hoffman, D., 792, 793, 895 Goldfeld, S., 285n1, 380, 558, 1069, Hajivassiliou, V., 583n7, 827n44 Hole, A., 71 1070, 1071n16, 1074n21 Hakkio, C., 658 Hollingsworth, J., 124 Gonzaláz, P., 184 Hall, R., 428, 429, 452, 909, 911, Holly, A., 358n6 Good, D., 108 915, 916, 939n14, 1061n1, Holt, M., 264n15 Gordin, M., 639n9, 640 1075n22, 1075n23 Holtz-Eakin, D., 475n18, 478, 711 Gourieroux, C., 137n3, 138n5, Hamilton, J., 639, 641, 657, 670n1, Hong, H., 144 512n12, 515, 583n6, 590n9, 687, 689, 694, 699n9, 702n10, Honore, B., 409n5, 794, 795, 591, 592, 666, 666n30, 703, 715n1, 728n12, 731, 810n34, 882, 891 809n31, 881n17 745n9, 756n12, 760, 761n13 Horn, A., 163n11 Granger, C., 1, 138n4, 357n5, 358, Han, A., 941 Horn, D., 163n11 670n1, 690, 699, 700, 712, Hansen, L., 144, 145, 320n1, 429, Horowitz, J., 597, 780, 788n13, 810, 715n1, 727, 731, 741, 742, 441n3, 444n7, 445, 445n8, 811, 881n17 743, 756n12, 761, 761n13, 448, 452, 796n23, 800 Hotz, J., 796n23, 800 762, 764, 765 Hanushek, E., 5, 231n30 Houck, C., 223n26, 242, 243 Gravelle, H., 124, 193, 197 Härdle, W., 406n2, 413n7, 420, Houthakker, S., 158n8 Greenberg, E., 399, 623, 1054 810n35 Howrey, E., 267 Greene, W., 72n14, 79, 88, 88n4, Hartley, M., 577n4 Hoxby, C., 5, 320 104, 105, 108, 108n1, 116, Harvey, A., 170, 523, 525, 526, 663, Hsiao, C., 181n2, 207, 208n19, 212, 117, 193, 226, 274, 276n27, 670n1, 683n5, 687, 715n1, 223n26, 224n28, 239, 242, 278n30, 283, 304, 410, 411, 740n2, 788 244, 246, 341, 352, 361n8, 414, 538n24, 556, 557, 562, Hasimoto, N., 258n10 470n14, 587, 588, 794n21, 565, 572, 574, 588, 589, 591, Hastings, N., 991n1 801, 805, 838 594n11, 770, 774n3, 793, Hatanaka, M., 652, 731 Huang, D., 250, 258n11, 283n35 Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

1150 Author Index

Huber, P., 512n12, 513 Judd, K., 1061n2 Kotz, S., 863n20, 866n5, 911, Hudak, S., 219 Judge, G., 65n10, 135, 156, 202, 922, 996n3 Huizinga, J., 441n3, 467 204n14, 208n19, 258n13, Kraft, D., 662, 669 Hussain, A., 204n15 262n14, 299n9, 399, 422n8, Krailo, M., 804 Hwang, H., 255n3 603n3, 606n7, 608n9, 608n12, Kreuger, A., 5, 11, 330 Hyslop, D., 795, 827n44 657, 660n20, 677n3, 715n1, Krinsky, I., 70, 71, 278n31 Hyslop, J., 325n2 773n2, 1020n1 Kroner, K., 661 Jung, B., 214 Kuersteiner, G., 882 I Juselius, K., 763 Kuh, E., 60 Jusko, K., 231n30 Kumbhakar, S., 263, 538n24, 562 Ichimura, H., 893n35 Kyriazidou, E., 794, 795, 882, Ichino, A., 893, 894 891, 899 Im, E., 255n3 K Im, K., 243, 269n17 Kakwani, N., 258n12 Imbens, G., 325n2 Kalbfleisch, J., 587, 880, 907n1, L Inkmann, J., 800 931n8, 935, 935n10, 937, 938, Lahiri, K., 181n2 Irish, M., 778n8, 880, 881n17, 938n13, 940 Laird, N., 925, 1078 928, 934 Kamlich, R., 129 Laisney, F., 780 Kang, H., 743n5 LaLonde, R., 893n35, 894, 1091 J Kao, C., 243, 244, 768 Lambert, D., 922, 922n7, 923, Jackman, S., 794n21 Katz, J., 588 925, 930 Jaeger, D., 350 Kay, R., 790n14, 791 Lancaster, T., 555, 557, 587, 601, Jain, D., 851 Keane, M., 583n6, 583n7, 795, 601n2, 778n8, 801, 881n17, Jakubson, G., 795 796n23, 827n44, 899 931n8, 935, 935n10, 938 Jansen, D., 715n1 Kelejian, H., 380 Land, K., 594n11 Jarque, C., 869, 880, 881n17 Kelly, J., 370n13 Landers, A., 62n7 Jayatissa, W., 124n13 Kenkel, D., 898 Landwehr, J., 794 Jenkins, G., 633n5, 648n14, 715n1, Kennan, J., 935 Lau, L., 104, 273n20, 276n26, 726, 727, 738, 768 Kennedy, W., 1061n2 277n29, 286 Jenkins, S., 827 Kerman, S., 255n3 Lawless, J., 940 Jennrich, R.I., 285n1 Keuzenkamp, H., 4n2 Layson, K., 296n5 Jensen, M., 531n20 Keynes, J., 2, 3, 9–11, 694 Leamer, E., 145, 327n5, 601n2, Jobson, J., 170n24 Kiefer, N., 820n41, 931n8, 935, 938, 608n9, 1020n1 Johansen, S., 763, 765 940, 941, 941n16 LeCam, L., 494n3 Johanssen, P., 920 Killian, L., 702, 707 Lechner, M., 593, 594, 780, 814, 827, Johansson, E., 478, 695, 712, Kim, I., 753 828, 829, 1089 712n13, 761, 763 Kingdon, G., 557 L’Ecuyer, P., 574 Johnson, N., 108, 779n10, 866n5, Kiviet, J., 246, 264n15, 341, Lee, K., 239 883n20, 911, 922, 991n1, 470n13, 470n15 Lee, L., 181n2, 403, 880, 880n15, 996n3, 1054 Kleibergen, F., 350n11, 351, 378 881n17, 910 Johnson, R., 61, 265 Klein, L., 1, 369, 385, 388, 391, 392, Lee, M., 863n1 Johnson, S., 170n25, 208n19, 297n6, 393, 396, 397, 537, 694, 810, Leff, N., 41 773n2 811, 891, 1088 Lerman, R., 582n5, 790n14, 791, 793 Johnston, J., 407, 1059n8 Klein, R., 413 LeRoy, S., 355n2 Jondrow, J., 541 Klepper, S., 327n5 Levi, M., 327n5 Jones, A., 841, 902 Klugman, S., 403 Levin, A., 243, 342, 767 Jones, J., 794 Kmenta, J., 115, 119, 173, 264n16, Levinsohn, J., 589, 851 Joreskog, K., 533n21, 1061n1, 277n29, 531, 533n22, 585n8, Lewbel, A., 411, 412, 794, 810n34 1075n23, 1079n25 1059n9 Lewis, H., 884n21 Jorgenson, D., 104, 105, 273n20, Knapp, M., 221, 788n13 Li, M., 836 276n25, 276n26, 277n29, Knight, F., 538 Li, Q., 413n7, 420, 653 286, 319 Kobayashi, M., 124, 124n13 Li, W., 218, 658n17 Journal of Applied Econometrics, Koenker, R., 166, 406n2, 407, Liang, P., 796n23 103, 307, 399, 412, 478, 1088 407n4, 425 Lin, C., 243, 767 Journal of Business and Economic Koh, W., 219 Ling, S., 658n17 Statistics, The, 129, 303n11, Koolman, X., 902 Litterman, R., 694, 702 399, 889n28 Koop, G., 41, 103, 130, 145, 235, 601, Little, R., 61 Journal of Econometrics, 399 601n2, 619, 621, 1082 Little, S., 790n14, 791 Journal of Political Economy, 905 Kotz, A., 779n10 Liu, T., 90n5, 387, 1084 Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

Author Index 1151

Ljung, G., 645, 650, 730, 731 McFadden, D., 1, 2, 226, 421, 425, Nelson, F., 773n2, 774n3, 872n10, Lo, A., 5 441n3, 447, 454n10, 497, 875, 881n17 Loeve, M., 1052 506n10, 583, 583n6, 590, Nerlove, M., 114, 115, 117, 181n2, Long, S., 5, 863n1 670n1, 770, 790n14, 791n15, 202, 205n16, 207, 208, Longley, J., 60, 1083 826, 842, 847n57, 849, 858 208n19, 238, 239, 274, Louviere, J., 842, 849n58 McGuire, T., 274n23 274n24, 276n27, 410, 470n14, Lovell, K., 28, 39, 90n5, 277n29, McKelvey, W., 790n14, 791, 792, 831 572, 646n13, 683n5, 715n1, 538, 538n24 McKenzie, C., 791 844n54, 1087 Lowe, S., 792, 793, 895 McLachlan, G., 562, 564, 565 Neudecker, H., 979n14 Lucas, R., 764, 765 McLaren, K., 264n15, 272n19 Newbold, P., 670n1, 690, 715n1, 727, Lutkepohl, H., 670n1, 697, 698 Meese, R., 677n3, 699m8 731, 741, 742, 743 Melenberg, B., 407n3, 412, 869, Newey, W., 190, 200, 421, 425, 875n13, 878, 880 441n3, 447, 453, 454, 454n10, M Merton, R., 660 454n11, 465, 475n18, 478, Mackinlay, A., 5 Messer, K., 163n14 497, 557n29, 643, 667, 753, MacKinnon, J., 139, 163n11, Miller, D., 399 880n16, 881n17, 882n18 163n14, 164, 257n7, 285n1, Miller, R., 146, 715n1 New York Post, 866n6 287, 290, 293, 312, 324, 335, Million, A., 307 Neyman, J., 138, 557, 587, 801, 882 358n6, 375, 377, 421, 422, Mills, T., 658n17, 670n1, 715n1, 728, Nickell, S., 207n17, 470n13 425, 441n3, 467, 488, 494, 891, 895 Nijman, T., 899, 900 522, 528, 529, 537, 598, 636, Min, C., 145 Nuemann, G., 881n17 637, 638, 638n7, 641, 649, Mittelhammer, R., 299n9, 399, 651, 664, 728n12, 740n2, 399n1, 447, 512n12, 603n3 743n3, 762, 787n12, 789 Mizon, G., 138, 138n5, 629, 655, 693 O MaCurdy, T., 472, 773n2, 795, 801 Moffitt, R., 592, 773n2, 797, 798, Oakes, D., 935n10 Maddala, G., 62n8, 181n2, 204n14, 799, 802, 827, 829, 838, 873 Oaxaca, R., 55 204n15, 205n16, 207, 240, Mohanty, M., 895 Oberhofer, W., 173, 264n16, 243, 555, 603n3, 683n5, Monfort, A., 137n3, 512n12, 515, 531, 533n22 744n7, 753, 768, 770, 774n3, 583n6, 590n9, 591, 592, 666, Obstfeld, M., 441n3, 467 779n9, 790n14, 801, 823, 849, 666n30 O’Halloran, S., 794n21 863n1, 869n9, 875, 883n20, Moon, H., 244 Ohtani, K., 123, 124, 165n16, 258n10 888, 889n27, 991n1, 1059n6 Moro, D., 264n15 Orcutt, G., 651, 874 Magee, L., 836 Moscone, F., 221 Ord, S., 137n2, 433n1, 487, 488, Magnac, T., 794 Moshino, G., 264n15 490, 991n1 Magnus, J., 4n2, 267, 979n14 Mouchart, M., 931n8 Orea, C., 562 Malinvaud, E., 285n1, 438, 445n8 Moulton, 216 Osterwald-Lenum, M., 763 Maloney, W., 184 Mount, T., 207, 996n5 Ouliaris, S., 762 Mandy, D., 263 Mroz, T., 53, 815, 816, 888 Mann, H., 75n16, 635, 745 Muellbauer, J., 272n19, 276n26 Manski, C., 443, 582n5, 793, 794, Mullahy, J., 412, 773n2, 922, 923, P 810, 810n35, 811, 882n18, 925, 930 Pagan, A., 166, 166n19, 178, 184, 891, 892n34 Muller, M., 575 205, 223n26, 236, 413n7, 416, Marcus, A., 834 Mundlak, Y., 189, 200, 209, 210, 213, 420, 441n3, 526, 534, 549, Mariano, R., 377, 382 230, 269, 881 661, 677n4, 715n1, 880n15, Marsagila, G., 575 Munkin, M., 403 881, 881n17, 928, 1023 Martins, M., 891 Munnell, A., 216, 224, 252, 254, Pagano, M., 677n4 Martins-Filho, C., 263 653, 1086 Pakes, A., 589, 827n44, 851 Matyas, L., 181n2, 441n3 Murdoch, J., 233 Panattoni, L., 660n20 Matzkin, R., 810, 810n34 Murphy, K., 303n11, 887n26 Papell, D., 243 Mazodier, P., 212 Parsa, R., 403 Mazzeo, M., 781, 1089 Passmore, W., 232, 437, 439 McAleer, M., 137n3, 140n7, 658n17 N Patterson, K., 399n1, 670n1, McCallum, B., 67n13, 329 Nagin, D., 594n11, 788n13 715n1, 728 McCoskey, S., 243, 767, 768 Nair-Reichert, U., 242, 253 Pedroni, P., 243, 767, 768 McCullough, B., 101, 293, 313, Nakamura, A., 387n26, 872n10 Peel, D., 562, 564, 565 658n17, 660n20, 666n29, Nakamura, M., 387n26, 872n10 Perron, P., 715n1, 743n3, 752, 754 667n31, 687, 689n6, 909, Nakosteen, R., 776, 777 Pesaran, H., 137n3, 139, 181n2 944, 1092 Nelder, J., 909, 944, 1092 Pesaran, M., 138n4, 239, 240, 241, McDonald, J., 407n3, 873 Nelson, C., 257n9, 743n5, 744n6 243, 253, 767 Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

1152 Author Index

Petersen, D., 876, 933n9 Ridder, G., 470n13, 900 Shaw, D., 864n4, 924 Phillips, A., 627 Rilstone, P., 61 Shea, J., 351 Phillips, G., 264n15 Riphahn, R., 5, 181, 307, 837, Shephard, R., 276 Phillips, P., 244, 375, 668, 742, 914, 1088 Sherlund, S., 232 743n3, 752, 754, 762 Rivers, D., 881n17 Shields, M., 837n51 Pierce, D., 645, 729, 730, 731 Robb, L., 70, 71, 278n31 Shin, Y., 239, 243, 767 Pike, M., 804 Roberts, H., 129 Shum, M., 144 Ploberger, W., 236, 709 Robertson, D., 239 Shumway, R., 731 Plosser, C., 744n6 Robins, J., 902 Sickles, R., 108, 123, 268, Poirier, D., 111n2, 399, 896n36, 897 Robinson, C., 776n5, 880n16 277n29, 795n22 Polachek, S., 129 Robons, R., 658 Silk, J., 384n21 Pollard, D., 827n44 Rodriguez-Poo, J., 776n5, 780, 810 Silva, J., 139n6 Potter, S., 145 Rogers, W., 407 Silver. J., 273n20 Powell, J., 274n23, 407n3, 412, Rose, A., 243 Sims, C., 357n5, 694, 699, 699n8 816, 880, 880n16, 882n18, Rose, J., 841, 842, 858 Singer, B., 805, 806, 807, 1075n23 Rosen, H., 475n18, 478, 711 931n8, 939 Prais, S., 158n8, 527, 651, 668 Rosen, S., 776n5, 890n29 Singleton, K., 429, 452 Pratt, J., 838 Rosenblatt, D., 415 Siow, A., 611n14 Prentice, R., 557n28, 880, 907n1, Rosett, R., 872n10 Sklar, A., 403 931n8, 935, 935n10, 937, 938, Rossi, P., 601n2, 841 Smith, J., 893n35 938n13, 940 Rothenberg, T., 753 Smith, M., 244, 403 Press, W., 576, 738, 811n36, 844n54, Rothschild, M., 658n17 Smith, R., 239, 240, 241, 242, 975n12, 1061n2 Rothstein, J., 320 253, 768 Psychology Today, 904, 905, 925 Rotnitzky, A., 902 Snyder, J., 773n2 Pudney, S., 837n51 Rowe, B., 814n37, 823n43 Solow, R., 131, 181n2, 1085 Rubin, D., 61, 376n15, 378, 379, 387, Song, S., 214, 219 537, 745, 925, 1078 Sorbom, D., 1061n1 Q Runkle, D., 583n6, 796n23 Spady, R., 413, 810, 811, 891 Quah, D., 706n12 Rupert, P., 186, 199, 231, 549, 1086 Spector, L., 781, 1089 Quandt, R., 285n1, 380, 432, 444n6, Ruud, P., 288, 399n1, 441n3, 583, Srivastava, T., 253n2, 256, 258n13 558, 890n31, 1068, 1069, 583n6, 809n31, 838n48, 881 Staiger, D., 320n1, 350, 350n11, 1070, 1071n16, 1074n21, 377, 627 1075n23 Stambuagh, R., 660 Quester, A., 869, 873 S Stegun, I., 436n2, 1061n2, 1063, Sala-i-Martin, S., 143, 146, 767 1064, 1065 Salem, D., 996n5 Stern, S., 61, 583n6 R Sargan, J., 207n18, 336, 470n14, Stewart, M., 795 Racine, J., 413n7, 420 476n19, 715n1 Stock, J., 145, 320n1, 350, 350n11, Raj, B., 181n2, 597 Savin, E., 264n15, 506n10 377, 378, 694n7, 699n8, Ramsey, J., 432, 444n6 Savin, N., 744n8 715n1, 753, 760, 761, 763 Rao, P., 158, 240, 493, 519, 557n28, Saxonhouse, G., 231n30, 232n31 Stoker, T., 810 585n8, 649, 1031, 1054, Schimek, M., 418 Stone, R., 272 1072n19 Schipp, B., 264n15 Strauss, R., 841, 843 Rasch, G., 803 Schmidt, P., 90n5, 108, 123, 194, Stuart, A., 137n2, 433n1, 487, 488, Redbook, 861, 862, 904 255n3, 264n15, 269n17, 342, 490, 991n1 Renfro, C., 658n17, 660n20, 666n29, 374n14, 377n16, 383n20, Sueyoshi, G., 231n30 667n31, 1061n1 470n14, 472, 475, 476, 478, Suits, D., 108n1, 197 Revankar, N., 257n8, 274n23, 541, 841, 843, 875, 878 Susin, S., 940 541n25, 1088 Schmidt, S., 538, 538n24 Swait, J., 842, 849n58 Revelt, D., 851, 858 Schwartz, A., 143, 697 Swamy, P., 223n26, 223n27, Review of Economics and Statistics, Schwert, W., 660, 752 242, 243 827n44 Scott, E., 557, 587, 801, 882 Symons, J., 239 Review of Economic Statistics, Seaks, T., 88, 88n4, 108n1, 296n5, 583n6 788n13 Rice, N., 841, 902 Segerson, K., 272n19 T Rich, R., 5, 695, 703, 704n11, Seldon, T., 767 Tahmiscioglu, A., 239 706n12, 707, 748, 750 Sepanski, J., 795n22 Taubman, P., 331 Richard, J., 138, 138n5, 357n5, 358, Sevestre, P., 181n2 Tauchen, H., 200, 798 698, 764, 766 Shapiro, M., 704n11, 709 Tavlas, G., 223n26, 223n27 Greene-50558 Z03˙0135132452˙Aind June 27, 2007 14:44

Author Index 1153

Taylor, W., 158, 162, 208n20, 323, Varian, H., 267 358n6, 447, 463, 465, 512n12, 336, 337, 339, 347, 351, 380, Veall, M., 597, 689n6, 790n14, 791 637, 637n6, 638n7, 639, 641, 406n2, 435, 471, 472, 473, 979 Vella, F., 441n3, 794n21, 880n15, 666, 666n30, 1056 Telser, L., 256n5 881, 881n17, 882n18, 899, White, S., 407n3 Terza, J., 841, 897, 898, 920, 925 900, 928 Wichern, D., 61, 265 Thayer, M., 233 Verbeek, M., 794n21, 899, 900 Wickens, M., 329, 441n3 Theil, H., 101n10, 101n11, 258n13, Verbon, H., 267 Wildman, B., 124 327n5, 375, 377n16, 378, 382, Vilcassim, N., 851 Willis, R., 776n5, 805, 806, 608n10, 648, 968, 975n13, Vinod, H., 293, 597 890n29, 997n7 1059n7 Volker, P., 140n7 Windmeijer, F., 790n14, 908n3, 909 Thompson, S., 811 Vuong, Q., 140, 140n8, 141, 142, Winkelmann, R., 837, 840, 907n1, Thornton, D., 715n1 144, 546, 881n17, 923, 911, 913, 925 Thursby, J., 124n13, 166n18, 278n31 924, 930 Winsten, C., 527, 651, 668 Tibshirani, R., 596n12 Vytlacil, E., 557n29, 893n35 Wise, D., 788n13, 869n8 Tobias, J., 41, 103, 130, 235, 601n2, Witte, A., 798, 869 836, 893n35, 1082 Wood, D., 278, 279n32, Tobin, J., 869, 871 W 279n33, 1087 Todd, P., 893n35 Wahba, S., 893n35, 894 Wooldridge, J., 15, 100, 100n9, Tomes, N., 776n5 Wald, A., 75n16, 249, 262, 284, 210n23, 269n17, 319, 358n6, Topel, R., 303n11, 887n26 453, 454, 500, 501, 502, 380n17, 507, 665, 666n29, Tosetti, E., 221 503, 505, 520, 536, 635, 667n31, 783, 797n24, 878, Toyoda, T., 123, 165n16 697, 700 892n34, 893, 900, 902 Train, K., 226, 578, 590, 590n9, Waldman, D., 876 Working, E., 354n1, 362n9 600n1, 622, 622n19, 841, 842, Walker, M., 272n19, 880n16, 882n18 Wright, J., 145, 350n11 851, 858 Wallace, T., 204n15 Wu, S., 243, 322, 325, 387n26, 768 Trebbi, F., 378 Wallis, K., 646n13 Wynand, P., 895 Trethaway, M., 296 Wambach, A., 307 Trivedi, P., 63, 402, 403, 404, 507, Wan, G., 268 517, 587, 590n9, 597, 598, Wang, P., 594n11 Y 601, 619, 677n4, 770, 907n1, Wansbeek, T., 365n12, 470n13, 900 Yaron, A, 444n7 908n3, 909, 910, 911, 912, 913 Waterman, R., 917, 918 Yatchew, A., 409, 410, 410n6, 787 Trognon, A., 212, 512n12, 515, Watson, G., 116n6, 645n12, 646, Yogo, M., 350n11, 378 666, 666n30 647, 650, 651, 655, 668, 692 Trotter, H., 1070 Watson, M., 145, 350, 694n7, 699n8, Trumble, D., 658 704n11, 709, 715n1, 756n12, Z Tsay, R., 670n1, 715n1, 739n1 760, 761, 1098 Zabel, J., 899 Tunali, I., 776n5, 793 Waugh, F., 28, 39, 40, 134 Zarembka, P., 296n3 Webster, C., 1054 Zavoina, R., 790n14, 792, 831 Wedel, M., 562, 594n11 Zeger, S., 796n23 U Weeks, M., 137n3, 139 Zellner, A., 115, 145, 250, 252, Uhler, R., 790n14 Weinhold, D., 242, 253 255n4, 256, 256n5, 258n11, Ullah, A., 413n7, 416, 420, 661, 1023 Weiss, A., 666 283n35, 357n5, 382, 541, Welsh, R., 60 541n25, 600, 601n2, 603n3, Wertheimer, R., 874 604n4, 605n5, 605n6, 606n7, V West, K., 190, 453, 454, 454n11, 465, 608n12, 611n14, 612, 612n15, van den Brink, H., 838 472n16, 643, 667, 753 619, 620, 1020n1, 1088 van Praag, B., 895 White, H., 65n11, 94, 137n3, Zimmer, D., 402, 403, 404 van Soest, A., 407n3, 412, 837, 869, 163n11, 163n12, 163n13, Zimmer, M., 776, 777, 889 875n13, 878, 880 163n14, 164, 312, 320n1, Zimmerman, K., 790n14, 791 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

SUBJECTQ INDEX

A of ordered choices, 831–841 assessment, convergence, 1075 accelerated failure time models, 937 panel data, 180, 182–183. See also assignments, rating, 834 acceptance region, 1034 panel data associative law, 948 addition parametric and semiparametric, assumptions conformable for, 946 412 of asymptotic properties, 421–424 matrices, 946–947 policy, 703–710 of the classical linear regression partitioned matrices, 965 of restrictions across equations, model, 44 variables, testing, 210 370n13 instrumental variables vector spaces, 951 semiparametric, 809–813 estimation, 315–316 adjusted R-squared, 35–38, 142 specification, 692–693 irrelevant alternatives, 847–850 adjustment spectral, 735 least squares estimator, 43 equation, 239 time-series, 6, 243, 629–632 mean independence, 189, 901 to equilibrium, 391–394 of treatment effects, 11n2, of the nonlinear regression admissible theories, 365 889–890 model, 286–287 age-earnings profile, 114 of variances, 32–39 normality, 52–58, 100n9 Aitken’s theorem, 155 Vuong’s test for nonnested normality, sample selection, 891 Akaike Information Criterion models, 141 stationary, 631 (AIC), 143 analysis of variance (ANOVA), 192 asymptotically normally distributed algebra analytic functions, 1062 (CAN), 150 calculus and matrix, 979–987 Anderson/Hsiao estimator, 340–348 asymptotic covariance characteristics of roots and application. See also panel data matrices, 159, 317, 385n22, 446 vectors, 967–976 data sets in, 1081–1092 matrices, estimation, 163 least squares regression, 25 of heteroscedasticity regression model, 531n20 matrices, 945 models, 170–175 samples, 1057 matrices, geometry of, of likelihood-based tests, 504–506 asymptotic distribution, 44n1, 68, 951–961 of maximum likelihood 303, 317, 1056–1060 matrices, manipulation of, estimation (MLE), 517–567 of empirical moments, 449 945–951 nonlinear regression models, in generalized regression matrices, partitioned, 964–967 294–297 models, 153 quadratic forms, 976–979 panel data, 267–272, 307–311 of GMM estimators, 450 systems of linear equations, applied econometric methods, 6 instrumental variables 961–964 approximation estimation, 333 terminology, 945 errors, 334 nonlinear instrumental variables algorithms integrals by quadrature, estimation, 335 EM (expectation-maximization), 1064–1065 of two-step MLE, 509 1076–1078 leading term, 1060 asymptotic efficiency, 71–75, iterative, 1067 linear, 979 487, 1058 Metropolis-Hastings, 622 to normal cdf, 1063 maximum likelihood estimation optimization, 1067–1068 quadratic, 979 (MLE), 493–494 variable metric, 1071 Sterling’s, 1064 properties, 420–421 almost sure convergence, 1043 Taylor series, 979 asymptotic expectations, 1059–1060 alternative hypothesis, 83 to unknown functions, 13 asymptotic inefficiency, 1058 analog estimation, 443 AR(1), 632, 633–634, 648–649 asymptotic moments, 1060 analysis ARCH-M model, 660–662 asymptotic negligibility of ceteris paribus, 29 ARCH model, 658 innovations, 639 of classical regression model, ARCH(1) model, 659–660 asymptotic normality, 635, 1058 603–609 ARCH(q) model, 660–662 of least squares estimator, cointegration, 765–766 Arellano/Bond estimator, 65–67 of covariance, 108–110 340–348 maximum likelihood estimation of dynamic models, 689–693 artificial nesting, 138 (MLE), 492–493

1154 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1155

of nonlinear least squares processes, 632 estimation, 777–796 estimator, 292 univariate, 685 estimators, 811 proof, 152 vector, 479 fit measures, 826 properties, 420 autoregressive distributed lag goodness of fit, measuring, asymptotic properties, 44, 64 (ARDL) models, 681–689 790–793 assumptions of, 421–424 autoregressive integrated heteroscedasticity, 787, 788–790 of estimators, 424–425 moving-average (ARIMA) inference, 777–796 of least squares, 151 model, 740 maximum simulated likelihood maximum likelihood estimate average partial effects, 780–785 estimation, 593 (MLE), 486, 490 averages models, 139, 411–412, 772–777 of method of moments, 434–436 Bayesian model, 144–146 panel data, 796–809 asymptotic results, 635–640 matrix weighted, 192 testing, 785–787 asymptotics, small T, 196–197 moving average (MA), 190 binary variables, 106–112, 835 asymptotic standard errors, 295 treatment effect on the binomial distributions, 998 asymptotic uncorrelatedness, 639 treated, 892 binomial probit model, 616–619 asymptotic variance, 494–496 Gibbs sampler, 617 AT&T Corporation, 181n3 bivariate distribution, 578 attenuation, 325–327, 868 B conditioning in, 1006–1009 attributes, 842 balanced panel data, 184, 269 incidental truncation, 883–884 attrition, 61, 901–902 bandwidth, 407, 416 normal, 1005, 1009–1010 augmentation basic statistical inference, 52–58 bivariate ordered probit models, Dickey-Fuller test, 751 basis vectors, 953–954 835–836 unit roots, 754 Bayes factor, 611 bivariate probit models, 817–826, augmentation, data, 616 Bayesian estimation, 192, 399, 896n36 autocorrelation, 17, 116n6, 600–601 bivariate random variable 256, 756n11 analysis of classical regression distribution, 1004–1006 coefficient, 632 model, 603–609 bivariate regression, 23 consistent covariance Bayes theorem, 601–603 block-diagonal matrix, 965 estimator, 643 binomial probit model, 616–619 bootstrapping, 407, 596–598 estimation, 649–651 individual effects model, 619–621 Box and Pierce’s test, 645 forecasting, 656–658 inference, 609–613 Box-Cox transformation, 296–297 and heteroscedasticity, 148–149 posterior distributions, 613–616 Box-Jenkins methods, 726n11 lagged variables, 682, 691–692 random parameters model, Breusch-Pagan multiplier test, 166 matrix, 632 621–623 Breusch-Pagan test, 167 maximum likelihood estimation Bayesian Information Criterion British Household Panel Survey (MLE), 527–529 (BIC), 143 (BHSP), 180, 841 misspecification of the model, 626 Bayesian model averaging, 144–146 Broyden-Fletcher-Goldfarb- negative, 627 Bayes theorem, 565, 601–603 Shanno (BFGS) negative, residuals, 629 behavior method, 1071 Newey and West’s estimator, 463 of data, 423–424 Broyden’s method, 1071 panel data, 185, 213, 652–655 of test statistics, 585 burn in, 614 partial, 723–726 behavioral equations, 356 Butler and Moffitt’s method, regression equations, 263–264 Behrens-Fisher problem, 123n11 552, 798 spatial, 218–222 Bernoulli distribution, 998 stationary stochastic process, best linear unbiased (BLU), 150 721–723 beta distribution, 997 C testing for, 644–647 BHHH estimator, 495, 662 calculus, 835, 979–987 autocovariance, 719 bias canonical correlation, 763 generating function, 732 caused by omission of variables, case matrix, 632 133–134 exactly identified, 443, 459 summability of, 639 omitted variables, 868 overidentified, 439, 459 autoregression, 716 self-selection, 62n7 categorical variables, 110–111 autoregressive form, 633, simultaneous equations, 355n3 categories of binary variables, 675, 718 binary choice, 571 107–108 conditional heteroscedasticity, choice-based sampling, 793–794 causality 658–667 dynamic binary choice models, Granger, 358 model, 675 794–796 simultaneous equations model, moving-average processes, endogenous right-hand-side 357–358 716–718 variables, 813–817 censored data, 869–881 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1156 Subject Index

censored distributions, 999 estimated year dummy conditional latent class model, censored normal distribution, variables, 109 561–563 869–871 estimation, 440 conditional likelihood function, censored random variables, 871 hypothesis, testing, 52 496–497, 803 censored regression (tobit) model, linear combination of, 55–56 conditional logit model, 842, 871–874 matrices, 393 846–847, 852–858 censoring, 896n36 partial autocorrelation, 724 conditional mean function, count data, 924–931 partial correlation, 29–31 417, 1006 Poisson model, 925–931 partial regression, 28 conditional variance, 1007 tobit model, 925–931 pretest estimators, 135 conditions variables, 869 random model, 223–225, 851 bivariate distributions, 1006–1009 central limit theorem, 638–640, regression, 28 first-order, 982 1050–1055 scale factor, 100n9 Grenander, 65 central moments, 432, 990 structural, 369 identification, 119, 449, 457 central tendency, measure of, 1020 vector, least squares regression, initial, 389, 631, 794 CES production function, 119, 21–22 numbers, 60 285–286 cofactors, expansion by, 959 numbers of matrices, 971 ceteris paribus analysis, 29 cohort effect, 11n2 Oberhofer-Kmenta, 533n22 Chamberlain’s model, 230 cointegration, 181n2 order, 449, 457 characteristics analysis, 765–766 orthogonality, 456 equations, 684, 720 in consumption, 762 orthogonality, estimation based of roots and vectors, 967–976 nonstationary data, 756–767 on, 442–443 Chebychev’s inequality, 990, 1040 relationships, 764 rank, 449 Chebychev’s Weak Law of Large testing for, 761–763 regularity, 487–488 Numbers, 1043 cointegration rank, 759 confidence intervals, 295 chi-squared distribution, 84, collinearity, diagnosing, 60n6 for parameters, 54–55 993–995 columns testing, 505, 1037–1038 choice based samples, notation, 947 confidence level, 1033 793–794, 852 rank, 956 conformable Cholesky decomposition, 576n2 spaces, 956 for addition, 946 Cholesky factorization, 583, 974 vector, 945 for multiplication, 947 Cholesky values, 974 common (base 10) logs, 115n3 conjugate prior, 607 Chow test, 120n8 common effects, 899–902 consistency classes, 564 common factors, 655–656 of estimation, 244–246, 429–436 classical estimates, Bayesian common trends, 759–760 of estimators, 265, 318, 372, 1041 averaging and, 146 comparison of least squares estimator, 64–65 classical likelihood-based of binary choice estimators, 811 of maximum likelihood estimation, 400–402 of matrices, 978–979 estimation (MLE), 491–492 classical model selection, 144 of methods, 385 of nonlinear least squares classical normal linear regression of models, 38–39, 506–507 estimator, 291 model, 1014–1015 of parametric and of ordinary least squares classical regression model, 458, semiparametric analyses, 412 (OLS), 151 603–609 complete data log-likelihood, 1076 of properties, 420 classical testing procedures, completeness condition, 360 of root-n consistent 1034–1037 complete system of equations, 355 properties, 1051 class membership, predicting, 561 components of sample mean, 1041 closed-form solution, 1066 of equations, 367 testing, 1036 cluster estimator, 188, 515–517 error model, 201 constants, 16n3 clustering, 188 heteroscedasticity, 212–213 elasticity, 13 Cobb-Douglas cost function, 91, principal, 61 random, 857 116, 273, 276 comprehensive model, 138 constant term Cobb-Douglas production computation, 1061–1065 regression with a, 29 function, 408 derivatives, 1068–1069 R-squared and the, 37–38 Cochrane and Orcutt estimator, 649 nonlinear least squares estimator, constrained optimization, 984–985 coefficients 292–293 constraints autocorrelation, 632 concentrated log-likelihood matrix, convergence, 151n1 change in a subset of, 122–123 function, 520, 1073, 1080 nonsingular matrix of, 435 of determination, 34, 1009 conditional density, 400, 433n1 optimization with, 1073–1074 economies of scale, 116n7 conditional distribution, 1013–1014 consumer price index (CPI), 22 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1157

consumption function, 314 Hurdle model, 920–924 Current Population Survey cointegration in, 757, 762 panel data, 915–920 (CPS), 180 fit of a, 34 random effects, 918–920 Hausman test for, 324–325 robust covariance matrices, instrumental variables 915–916 D estimation, 336 truncation, 924–931 data analysis, 629–632 J test for, 140 zero-altered Poisson model, data augmentation, 616 life cycle, 428–429 920–924 data generating process (DGP), linear regression models, 9–11 covariance 287, 436, 518n13, 743, 987 nonlinear function, 294–296 analysis of, 108–110 data generations, 11, 17–18 relationship to income, 2–3 asymptotic covariance matrix, data imputation, 62 Vuong’s test for, 142 385n22 data sets in applications, 1081–1092 contagion property, 1001 asymptotic matrices, 159, 317, 446 Davidon-Fletcher-Powell (DFP) contiguity, 218 asymptotic samples, 1057 method, 1071 contiguity matrix, W, 218 autocovariance. See decomposition, 33 continuous distributions, 575 autocovariance Oaxaca and Blinder’s, 55 continuous random variables, 987 instrumental variables singular value, 975 contrasts, effects, 198 estimation, 333 of variance, 1008 control function, 816, 836 joint distributions, 1003–1004 definite matrices, 976–979 convergence linear regression models, 17 degrees of freedom almost sure, 1043 matrices, 1011 correction, 204 assessment, 1075 matrices, ordinary least squares maximum likelihood estimation in distribution, 1048 (OLS), 162–164 (MLE), 519n14 of empirical moments, 448 restrictions, 355 delta method, 69, 297, 927, in lower powers, 1044 robust asymptotic matrices, 1055–1056 matrices, 151 511–517 demand mean square, 68 robust estimation, 210 equation, 69, 354 of moments, 1047 robust matrix, 307, 780 relationship, 9n1 nonsingular matrix of stationary, 630 system, 274 constraints, 435 structures model, 266 systems of demand equations, in probability, 1039–1042 covariates, 8, 933 272–280 to random variables, 1046–1047 Cox test, 142 translog demand system, 286 of random vectors or Cramér-Rao lower bound, 1031 density matrices, 1045 Cramer-Wold device, 1050 conditional, 433n1 in rth mean, 1044, 1045 credit scoring model, 304 individual, 482 copula functions, 402–405 criterion informative prior, 606–609 corner solutions, 878–880 Akaike Information Criterion integration, 604n4 corrections (AIC), 143 inverted Wishart prior, 621 degrees of freedom, 519n14 Bayesian Information Criterion kernel density estimator, errors, 689, 756 (BIC), 143 811, 1023 errors, nonstationary data, focused information criterion kernel estimator, 407, 414–416 760–761 (FIC), 144 kernel methods, 411–413 errors, spatial, 221 function, 421, 422, 437 likelihood function, 482 rank two, 1071 Kullback-Leibler Information marginal probability, 1002 correlation Criterion (KLIC), 141 model, 400 autocorrelation. See mean-squared error, 135 posterior, 601–603 autocorrelation model selection, 142–143 probability density function canonical, 763 prediction criteria, 143 (pdf), 1000 cross-function, testing, 265 critical region, 1034 regular, properties of, 488–490 joint distributions, 1003–1004 critical values, test statistics and, 388 truncated random variables, 864 matrices, 1011 cross-country growth dependent observations, 635 tetrachoric, 819 regressions, 145 dependent variables, 3, 8, 979 zero, testing for, 820 cross-function correlation, descriptive statistics, 1020–1023 correlogram, 728 testing, 265 determinants Cosine Law, 961 cross section of samples, 1020 of matrices, 958–961 count data, 542–547, 907 cumulated effect, 673 of a matrix, 972 censoring, 924–931 cumulative distribution function of partitioned matrices, 965 fixed effects, 916–918 (cdf), 988 determination, coefficients function form, 912–915 cumulative multipliers, 390 of, 1009 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1158 Subject Index

deterministic relationships, 3, 9 function of random variables, survival, 935 deviance, 909 998–1000 t, 993–995 diagonal elements of the inverse, 30 gamma, 445, 996–997 truncated normal, 576 diagonalization of matrices, 969 gamma, example of, 1079–1080 truncated standard normal, 864 diagonal matrix, 945 gamma, parameters, 447 truncated uniform, 865 diagrams, scatter, 1021 gamma prior, 605n6 truncation, 863–864 Dickey-Fuller tests, 745–754 instrumental variables of two-step MLE, 509 differences, 647 estimation, 315, 333 Wald test statistic, 501 differencing, 739–741 inverse gamma, 996 Wishart, 997–998 digamma functions, 1064 inverse Gaussian, 431 distributive law, 948 dimensions of matrices, 945 inverted gamma, 604 disturbance, 9 discrepancies joint, 402–405, 1002–1006 AR(1), 633–634, 648–649 F statistic, 83 joint posterior, 605 autocorrelation, 148 vectors, 84 Lagrange multiplier, 503 feasible generalized least squares discrete choice models, 770 with large degrees of freedom, (FGLS), 203–205 discrete Fourier transform, 737 995–996 first order autoregressive, 527 discrete random variables, 998 large-sample distribution theory, instrumental variables distance, minimum distance 1038–1056 estimation, 333 estimation (MDE), 233, 270, likelihood functions, 1030 least squares regression, 20 335, 428–429, 436–441 likelihood ratio test, 500 linearized regression model, 288 distributed lag limiting, 66, 95, 317, 995, 1048 linear regression models, 14 form, 675 logistic, 997 maximum likelihood estimation model, 97 lognormal, 996 (MLE), 538–542 distribution marginal, 1013–1014 moving average (MA), 190 assumptions of linear regression marginal distributions of Newey and West’s estimator, 463 models, 44 statistics, 57–58 nonnormal, 92–96 asymptotic, 44n1, 68, 303, 317, moments of truncated, 864–867 nonspherical, 210–213 1056–1060 multivariate, 1010–1012 normally distributed, 18 asymptotic, in generalized multivariate normal, 1013–1019 processes, 632–634 regression models, 153 multivariate t, 606 reduced-form, 360 asymptotic, of empirical noncentral chi-squared, 501n8 regression, 158–164 moments, 449 nonlinear instrumental variables restrictions, 355 asymptotic, of GMM estimation, 335 spherical, 16–17, 17n4, 149 estimators, 450 normal, 11, 486, 864, 872n10, structural, 358 Bernoulli, 998 991–992, 1013–1014 SUR models, 264n16 beta, 997 normal, for sample moments, 458 zero mean of the, 286 binomial, 998 normal, Gibbs sampler, 614 dominant root, 391 bivariate, 578 normal, limiting, 1055 dot product, 947 bivariate, conditioning, normal, mixtures of, 432 double-length regression, 664 1006–1009 normally distributed doubly censored tobit model, 928 bivariate, incidental truncation, disturbances, 18 downhill simplex, 1066 883–884 ordered probit models, 832n49 draws bivariate normal, 1005, parameters, estimating, 430–434 Halton, 578 1009–1010 Poisson, 485, 906, 998 random, 577–580 bivariate random variables, Poisson, variance bound for, 1031 dummy variables, 106, 122, 195 1004–1006 posterior, 613–616 dummy variable trap, 108 censored, 999 prior, 604 duration, 906–907 censored normal, 869–871 probability, 987–988, 991–998 estimated duration models, 941 chi-squared, 84, 993–995 probability, representations of, exogenous variables, 937–938 conditional, 1013–1014 1000–1001 heterogeneity, 938–939 continuous, 575 sample minimum, 1025–1027 log-likelihood function, convergence in, 1048 sampling, 44, 1023–1027 938n12 Erlang, 996 sampling, of least squares maximum likelihood estimation exponential, 996–997, 1052 estimator, 47 (MLE), 936–937 exponential, families of, size, 996 models, 931–942 433n1, 481 spherical normal, 1013 nonparametric, 939–942 F, 993–995, 1017–1018, 1050 standard normal, 992 parametric duration models, feasible generalized least squares standard normal cumulative 933–939 (FGLS), 157 function, 1062–1063 semiparametric, 939–942 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1159

Durbin-Watson test, 116n6, income, confidence intervals for, 55 semilog, 113 645–646 long-run, 239 share, 278 dynamic binary choice models, elements, L, 444 simultaneous equations bias, 794–796 EM (expectation-maximization) 355n3 dynamic equation, 684–686 algorithm, 1076–1078 simultaneous equations models, dynamic labor study equation, 348 empirical estimators, 596 354, 466–469 dynamic models, 356 empirical moments single-equation linear models, analysis of, 689–693 asymptotic distribution of, 449 455–461 properties of, 389–394 convergence of, 448 single-equation nonlinear dynamic multipliers, 390 equation, 442, 447, 456 models, 461–464 dynamic panel data models, empirical results, 707–710 single-equation (pooled) 238–243, 314, 340–348, empirical studies, 574n1 estimators, 829n47 469–479 encompassing principle, 138 specification tests, 387 consistent estimation of, 244–246 endogeneity, 357–358 stability of a dynamic, 684–686 dynamic regression models, 314, endogenous right-hand-side stacking, 301n10 670, 671–677 variables, 813–817 structural, 355 dynamic SUR models, 264n15 endogenous variables, 355 supply, 354 environmental economics, 6 systems of demand, 272–280 equality systems of linear, 961–964 E of column and row rank, 957 wage, 186 earnings equations, 10–11, 53–54 information matrix, 488, 490 Yule-Walker, 641, 722 dummy variable in, 106 of matrices, 945–946 equilibrium econometric models, 355 equations adjustment to, 391–394 estimation of, 455–479 adjustment, 239 conditions, 354, 356 maximum likelihood estimation behavioral, 356 errors, 689, 758 (MLE), 496–497 characteristic, 684, 720, 967–968 market, 5 econometrics complete system of, 355 multiplier, 674 computation in, 1062–1065 components of, 367 multipliers, 239, 390 modeling, 2–5 constant elasticity, 13 relationships, 689 practice of, 6 demand, 354 ergodicity, 636, 728n12 Econometric Society, 1 demand, regression for, 69 Ergodic theorem, 448, 636–638 economic theory, 5 dynamic labor study, 348 Erlang distribution, 996 economies of scale, 116n7 earnings, 53–54, 106 errors education, 10–11 empirical moment, 442, 447 approximation, 334 effects empirical moment equation, 456 asymptotic standard, 295 contrasts, 198 estimated labor supply, 321 components model, 201 fixed effect model, 193–200 estimated log wage, 340 correction, 689, 756 group, 197 estimated rating assignment, 834 correction, nonstationary data, random, 200–210 estimated tobit, 873 760–761 unobserved model, 209–210 Euler, 428–429 cost functions, 116 efficiency fixed effects, 199–200 equilibrium, 689, 758 asymptotic, 71–75, 487, 1058 Gibbs sampler, 618 functions, 1062n4 estimation by generalized least illustrative systems of, 354–357 instrumental variables squares, 154–158 intrinsic linearity, 117 estimation, 327 feasible generalized least squares investment, 22–25 mean squared, 1029 (FGLS), 158 K moment, 431 mean-squared error criterion, 135 maximum likelihood estimation lags, 479 measurement, 63, 189, 314–315, (MLE), 493–494 least squares normal, 22 325–331 efficient estimation, 484–486, likelihood, 490 root mean squared error, 101 647–648 likelihood, binary choice spatial error correction, 221 efficient scale, 79 models, 778 specification, 885 efficient score, 502 linearized regression model, 288 standard, 1027 efficient two-step estimator, 664 nonlinear systems of, 300–302 standard error of the efficient unbiased estimators, 1028 normal, 27 estimator, 52 eigenvalue, 969 population moment, 456 standard error of the elasticities, 680 population regression, 8 regression, 51 elasticity reduced form, 329 type I, 1034 constant, 13 regression, 252–254 type II, 1034 estimation of, 279 restricted investment, 86 weighted least squares, 168 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1160 Subject Index

estimated bivariate probit by instrumental variables, simultaneous equations model, model, 822 372–373 370–371 estimated consumption intervals, 610, 1027, 1032–1034 specification tests, 387–388 functions, 295 with a lagged dependent variable, system methods of, 380–384 estimated duration models, 941 651–652 treatment effects, 891–895 estimated earnings equation, least absolute deviations (LAD), truncation, 874–875 107, 889 406–409 two-stage least squares, 356 estimated labor supply least squares, 149–154, 154–158, two-step nonlinear least squares, equation, 321 194–196 302–307 estimated labor supply model, 816 least squares, regression, 21n1 unbiased, 46–47 estimated latent class model, 595 least squares, serial correlation, variances, 51–52, 642–643 estimated log wage equations, 340 640–643 vector autoregression estimated pooled probit model, 828 likelihood, random effects, (VAR), 696 estimated production functions, 91 214n14 when is unknown, 648–652 estimated random parameters limited information estimation estimators, 1027. See also model, 594 methods 371-380 estimation estimated rating assignment limited information maximum Anderson/Hsiao, 340–348 equation, 834 likelihood (LIML), 508 Arellano/Bond, 340–348 estimated selection corrected wage linear regression model, 401–402 asymptotic covariance equation, 888 lognormal mean, 578 matrices, 152 estimated SUR model, 259 maximum likelihood, bivariate asymptotic properties of, 424–425 estimated tobit equations, 873 probit models, 817–820 Bayesian, 192 estimated year-specific effects, 109 maximum likelihood estimation BHHH, 495 estimation, 1019–1020 (MLE), 118, 482, 485 binary choice, 811 analog, 443 maximum simulated likelihood, for binary choice models, 411–412 of the ARDL model, 682–683 228, 593 cluster, 188, 515–517 asymptotic covariance minimum distance, 428–429, Cochrane and Orcutt, 649 matrix, 163 436–441 consistency of, 265, 318, 372, 1041 autocorrelation, 649–651 minimum variance linear efficient two-step, 664 based on orthogonality unbiased, 46–47 efficient unbiased, 1028 conditions, 442–443 nonparametric, 398, 413–420 empirical, 596 Bayesian, 399, 600–601 one-step, 1072 extremum, 421 binary choice, 777–796 ordinary least squares (OLS), extremum, assumptions for classical, Bayesian 159–160 asymptotic properties, averaging, 146 panel data, 229–233 421–424 classical likelihood-based, parameters, 430–434, 471 feasible GLS, 258 400–402 parameters of univariate time full information maximum coefficients, 393, 440 series, 728–731 likelihood (FIML), 383–384 with a conjugate prior, 607 parametric, 398, 400–405 generalized method of moments consistent, 244–246, 429–436 point, 609–610 (GMM), 441–451, 450 of covariance matrix of b, point, of parameters, 1027–1032 generalized regression 160–162 Poisson model, 929 model, 156 criterion, 400 pretest, 134–135 GMM, 288, 406 doubly censored tobit model, 928 pseudo-maximum likelihood, Hausman and Taylor, 336, 337 of econometric models, 455–479 511–517, 666–667 instrumental variables, 245, efficient, 484–486, 647–648 random coefficients model, 225 316–318, 334 of elasticity, 279 random effects, 206, 207 inverse probability weighted, 902 feasible generalized least squares robust, 153–154 kernel density, 407, 414–416, (FGLS), 203–205 robust, heteroscedasticity, 159 811, 1023 in finite samples, 1027–1030 robust, using group means, least squares. See least squares with first differences, 190–193 188–190 estimator frameworks, 398–400 robust covariance, 210 limited information, 371 full information maximum robust covariance matrix, limited information maximum likelihood (FIML), 508 185–188, 780 likelihood (LIML), 375, inconsistent, 201 sample selection, 886–889 508, 570 informative prior density, semiparametric, 398, 405–413, linear unbiased, 46 606–609 810–812 M, 421 instrumental variables, simulation-based, 226–229, 399, matching, 893 314–315, 893 573, 589–596 maximum likelihood, 72, 401, 428 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1161

maximum score, 792 examples, 1078–1080 FGM (Farlie, Gumbel, maximum simulated exclusions, 355, 367 Morgenstern) copula, 404 likelihood, 227 maximum likelihood estimation final form, 390 method of moments, 431, (MLE), 532–536 finite mixture models, 558, 559–560 434–436, 457 restrictions, 90, 370 finite moments, 435 minimum distance, 233, 335 exogeneity finite sample properties, 44, 1027 minimum distance estimator assumptions of linear regression of b in generalized regression (MDE), 270, 437 models, 44 models, 150 minimum expected loss asymptotic properties, 72 estimation in, 1027–1030 (MELO), 610 of independent variables, 11 exact, 63 minimum variance linear instrumental variables least squares estimator, 58–63 unbiased estimator estimation, 315, 316 of ordinary least squares (MVLUE), 1032 specification tests, 387 (OLS), 150 Newey and West’s, 463 vector autoregression (VAR), first differences, 190–193 Newey-West autocorrelation 698–699 first-order autoregression. See consistent covariance, 643 exogenous variables duration, AR(1) Newey-West robust 937–938 first-order conditions, 288, 982 covariance, 190 expansion by cofactors, 959 fit, competing model, 506–507 nonlinear instrumental expectations, 670 fit measures, 498–507, 826 variable, 462 asymptotic, 1059–1060 fit of a consumption function, 34 nonlinear least squares, 290–292 inequalities for, 1046 fitted logs, 100n9 nonparametric regression in joint distributions, 1003 fitting criterion, 20, 32 function, 812–813 of random variables, 989–991 fixed effects, 183 ordinary least squares (OLS), 150 expectations-augmented Phillips count data, 916–918 outer product of gradients curve, 627, 679 Hausman’s specification test, 209 (OPG), 495 explained variables, 8 linear regression model, 554n27 partial likelihood, 940 explanatory variables, 8 maximum likelihood estimation Prais and Winsten, 649 exponential (MLE), 554–558 pretest, 135 distributions, 996–997, models, 193–200, 211–212, product limit, 939 1030, 1052 268–272, 797, 800–806 properties of, 420–425 families, 432, 433, 481 panel data applications, 308–310 properties of generalized least model, 906 probit models, 837–838 squares, 155 model with fixed effects, 309 fixed panel, 184 quantile regression, 407 ratios, 612n15 fixed time and group effects, restricted least squares, 87–89 regression model, 554 197–200 robust, 166 samples, 1057 flexible cost function, 296 sampling theory, 610 ex post forecasts, 101 flexible functional form, 13, sandwich, 514–515 exposure, 912 275–280 single-equation (pooled), extensions flexible functions, 276 829n47 bivariate probit model, 896n36 focused information criterion standard error of the, 52 panel data, 184 (FIC), 144 statistics as, 1023–1027 random effects, 213–222 forecasting, 99n8 straight-forward two-step extremum estimators, 421–424 Bayesian model averaging, 145 GMM, 814 ex post forecasts, 101 two-step, 169 lagged variables, 686–689 unbiased, 1028 F models, 358 variables, comparing, 187 factoring matrices, 974–975 one-period-ahead forecast, 686 variance, for MLE, 496 factorization, Cholesky, 583 serial correlation, 656–658 White heteroscedasticity families, exponential, 432, 433 formulas, omitted variable, 134 consistent, 162–164 F distributions, 993–995, fractionally integrated series, 740n2 within-groups, 191–193 1017–1018, 1050 fractional moments, 576 Zellner’s efficient, 258n11 feasible generalized least squares frameworks, estimation, 398–400 Euler equations, 428–429 (FGLS), 156–158, 337 freedom correction, degrees of, 204 event counts, 906–915 distribution, 157 frequency domain, 731–738 events per unit of time, 912 efficiency, 158 Frish-Waugh-Lovell theorem, 28 evidence, 4 estimators, 258 F statistic, 83, 298 exact, finite-sample properties, 63 GMM estimators, 464, 465 F table, 199–200 exactly identified case, 119, 436, groupwise heteroscedasticity, 173 F tests for firm and year effects, 109 443, 459 random effects, 203–205, 207 full information, 371 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1162 Subject Index

full information maximum generalized linear regression Gross National Product (GNP), 22 likelihood (FIML), 383–384, model, 148 group effects 508, 849 generalized method of moments fixed time and, 197–200 full MLE, 554–558 (GMM), 399 testing, 197 full rank, 11, 14, 15, 957 asymptotic distribution of, 450 groupings of dummy variables, assumptions of linear regression estimation of econometric 108–110 models, 44 models, 455–479 groups asymptotic properties, 72 estimator, 288, 406, 441–451 means, robust estimation using, instrumental variables identification condition, 457 188–190 estimation, 315 maximum likelihood estimation within-groups estimator, quadratic forms, 1017–1018 (MLE), 496–497 191–193 fully recursive model, 372 properties, 447–451 groupwise heteroscedasticity, 172 functional form, 4, 106, 112–114 serial correlation, 643–644 growth flexible, 275–280 straight-forward two-step, 814 cross-country growth for nonlinear cost function, testing hypothesis in, 451–455 regressions, 145 114–117 generalized regression model, mixed fixed models, 242 nonlinear regression model, 286 148–149, 256, 459 surveys, 181n2 function form, 912–915 in asymptotic distribution, 153 Grunfeld model, 255 function of random variables, consistency of ordinary least Gumbel model, 774 998–1000 squares (OLS), 151 fundamental probability transform, feasible generalized least squares 403, 575 (FGLS), 156–158 H finite-sample properties Halton draws, 578 of b in, 150 Halton sequences, 577–580 G least squares estimation, 149–154 Hausman and Taylor estimator, gamma distribution, 996–997 maximum likelihood estimation 336, 337 example of, 1079–1080 (MLE), 522–523 Hausman and Taylor (HT) generalized method of moments generalized residual, 778n8, 934 formulation, 470 (GMM) estimation, 445 generalized sum of squares, 156, 522 Hausman test, 339 parameters, 447 general model building for consumption function, gamma functions, 1063–1064 strategies, 136 324–325 gamma prior distribution, 605n6 general modeling framework, instrumental variables gamma regression model, 72 182–183 estimation, 321–325 GARCH model, 661–662 General Theory of Employment, specification test, 208–209 effects, 403 Interest, and Money, hazard function, 866, 1001 maximum likelihood estimation 2–3, 9–11 hazard rate, 933 (MLE), 662–664 geometric regression model health care utilization, 307 testing, 664–666 maximum likelihood estimation health economics, 6 volatility, 665 (MLE), 542–547 Heckman study, 588 Gauss-Hermite quadrature, panel data, 593 Hermite polynomials, 1062 552, 1065 random effects, 592–593 Hessian matrix, 981 Gaussian quadrature, 1065 geometry of matrices, 951–961 heterogeneity, 182, 794 Gauss-Laguerre quadrature, 1065 German Socioeconomic Panel across units, 180 Gauss-Markov theorem, 50 (GSOEP), 180 duration, 938–939 asymptotic efficiency, 71 GHK simulator, 582–583 estimation with first least squares estimator, 48–49 Gibbs sampler differences, 190 multicollinearity, 59 binomial probit model, 617 measured and unmeasured, general data generating equations, 618 560–561 processes, 72 normal distribution, 614 modeling, 806–807 generalized ARCH models, globally concave, 982 negative binomial regression 660–662 globally convex, 982 model, 911–912 generalized Chebychev’s global maximum, 982 parameters, 222–244, 238–243, inequality, 1045 goodness of fit, 32–39, 790–793, 807–809 generalized inverse, 323, 975–976 908–909 heteroscedasticity, 16, 148–149, generalized least squares, 154–156, Gordin’s central limit theorem, 640 158–164, 256 256–257 gradient methods, 1069–1071 applications of, 170–175 estimated covariance matrix of b, Granger causality, 358, 699–701 autoregressive conditional, 160–162 Grenander conditions, 65 658–667 random effects, 202 grid search, 1066 binary choice, 787, 788–790 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1163

groupwise, 172 I asymptotic properties, 72 least squares estimation, 149–154 idempotent matrices, 950–951, 974 exogeneity of, 11 microeconomic data, 875–877 idempotent quadratic forms, instrumental variables multiplicative, 170–172, 523–527 978, 1016 estimation, 315 multiplicative, tobit model, 876 identical explanatory variables, 257 indicators, 329 ordinary least squares (OLS) identical regressors, 257 indices estimation, 159–160 identifiability index function models, 550, 775 random effects, 212–213 maximum likelihood estimation likelihood ratio index, 507 regression, 459 (MLE), 497 qualification, 129 regression equations, 263 of model parameters, 286 indirect least squares, 371 robust estimation, 159 of parameters, 422 individual density, 482 robustness to unknown, 169 identification, 327 individual effects model, 182, and serial correlation, 149 of conditions, 14, 119, 292, 619–621 simultaneous equations models 449, 457 inefficiency with, 466–469 instrumental variables asymptotic, 1058 testing, 165–167 estimation, 337 of least squares, 160 weighted least squares, 168 of intrinsic linearity, 117–120 inequality White consistent estimator, of Klein’s model, 369 Chebychev’s, 990, 1040 162–164 of nonlinearity, 114–117 for expectations, 1046 heteroscedasticity extreme value of parameters, 482–484 generalized Chebychev’s, 1045 (HEV) model, 856 problem of, 354, 356 Jensen’s, 991 heteroscedastic regression, 159 of rank and order conditions, likelihood, 491 hierarchical linear model, 234 354–370 Markov’s, 1040 hierarchical model, 184, 230 simultaneous equations model, inertia, 17 highest posterior density (HPD) 361–370 inference, 81, 400–405, 1019–1020 interval, 610 through nonsample Bayesian estimation, 609–613 high-frequency time-series data, information, 370 binary choice, 777–796 volatile, 148 of time series, 712 hypothesis testing, 1034–1038 histograms, 47, 414, 1023 identify matrix, 945 vector autoregression homogeneity restriction, 262 identities, 355 (VAR), 707 homogeneous equation system, ignorable case, 61 infinite lag model, 677 962, 964 illustrative systems of equations, infinite lag structures, 717n4 homoscedasticity, 11, 16, 1007 354–357 information matrix equality, assumptions of linear regression impact multiplier, 389, 673 488, 490 models, 44 importance informative prior density, 606–609 asymptotic properties, 72 function, 581 initial conditions, 389, 631, 794 instrumental variables sampling, 579–582 inner product, 947 estimation, 315 improper prior, 619–620 innovations, 630 nonlinear regression impulse response function, 394, asymptotic negligibility of, 639 models, 286 701–702 time-series models, 717 hospital cost function, 439 incidental parameters problem, vector autoregression H2SLS, 467 309, 557, 586–589, 797 (VAR), 702 Hurdle model, 920–924 incidental truncation, 882, instability hybrid case, 842n53 883–884 models, testing, 766–767 hyperplane, 955 inclusion of irrelevant variables, 136 of vector autoregression hypothesis income, 9n1, 11n2 (VAR), 712 binary choice, 785–787 earnings and education, 10–11 instrumental variables estimation, LM statistic, 454 relationship to consumption, 245, 314–315, 327–331, 356, maximum likelihood estimation 2–3 461, 893 (MLE), 498–507 inconsistency assumptions, 315–316 nulls, restrictions of, 140 estimates, 201 dynamic panel data models, restrictions, 115n4 least squares, 314–315 340–348 testing, 83–92, 259–263, 295, independence empirical moment equation, 447 298–300 of b and s-squared, 53 generalized regression model, testing, coefficients, 52 irrelevant alternatives 332–333 testing, estimation, 425 assumptions, 847–850 Hausman test, 321–325 testing, inference, 1034–1038 of linear forms, 1018–1019 measurement errors, 325–331 testing, nonnested, 138 of quadratic forms, 1018–1019 models, 372–373 true size of the test, 585 independent variables, 3, 8, 979 nonlinear, 333–336 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1164 Subject Index

instrumental variables J lag operator, 674, 717n3 estimation (continued) jackknife technique, 164 Lagrange multiplier, 264 nonlinear estimator, 462 Jacobian, 298, 518 autocorrelation, testing for, ordinary least squares (OLS), 316 Jensen’s inequality, 991 644–645 panel data, 336–349 joint distributions, 402–405, binary choice, 786 random effects, 336–339 1002–1006 heteroscedasticity, 876 two-stage least squares, 318–321 jointly dependent variables, 355 maximum likelihood estimation weak instruments, 350–352 joint posterior distribution, 605 (MLE), 502–504 Wu test, 321–325 J test, 139–140, 140 normal linear regression instruments model, 520 streams as, 320 statistic, 454, 534 weak, 350–352 K testing, 96, 166, 205, 299–300, 505 weak, testing, 377–380 k class, 377 lags insufficient observations, 121–122 kernels dynamic models, 389 integrals by quadrature, density estimator, 407, 414–416, equations, 479 approximation, 1064–1065 811, 1023 spatial, 221 integrated hazard function, 934 density methods, 411–413 window, 735 integrated order of one, 740 method of, 812 large numbers, laws of, 1042–1045 integrated processes, 739–741 Keyne’s consumption function, large-sample distribution theory, integration 2–3, 9–11 1038–1056 density, 604n4 Khinchine’s Weak Law of Large large sample properties, 63–75, Monte Carlo, 576–583 Numbers, 1042 290–292 interaction terms, 113 Khinchine theorem, 67, 430 maximum likelihood estimate interdependent systems, 355 Klein’s model, 355, 357, 369, 393 (MLE), 486 intervals K moment equations, 431 results, 613 confidence, testing, 1037–1038 Kolmogorov’s Strong Law of Large tests, 92–96 estimation, 610, 1027, 1032–1034 Numbers, 1043 latent class models, 558 prediction, 99, 100 K parameters, 443n4 latent class regression model, 562 intrinsic linearity, 117–120 KPSS test of stationarity, 755–756 latent regression, 775–777 invariance K regressors, 182 latent roots, 969 lack of, 98 Krinsky and Robb method, 70 latent vectors, 969 maximum likelihood estimation Kronecker products, 966–967 Law of Iterated Expectations, 1007 (MLE), 494 Kruskal’s theorem, 176 laws of large numbers, 1042–1045 invariants, 274, 278 Kullback-Leibler Information leading term approximation, 1060 inverse function, 986 Criterion (KLIC), 141 least absolute deviations (LAD), inverse gamma distribution, 996 kurtosis, 1021 406–409, 880 inverse Gaussian distribution, 431 least squares, 43–44. See also inverse matrices, 465, 962–964 nonlinear least squares inverses L estimator Mills ratio, 866 labor economics, 882n18 asymptotic normality of, 65–67 of partitioned matrices, 965–966 labor supply model, 320, 324, 815 asymptotic properties of, 151 probability weighted lack of invariance, 98 attenuation, 325–327 estimator, 902 lagged dependent variable basic statistical inference, 52–58 inverted gamma distribution, 604 estimation with a, 651–652 coefficient vector, 21–22 inverted Wishart prior density, 621 testing, 646 consistency of, 64–65 invertibility, time-series models, lagged variables dummy variable (LSDV), 195 718–721 analysis of dynamic models, estimation, 149–154, 154–158, invertible polynomials, 676 689–693 194–196 investments autocorrelation, 682, 691–692 estimation, serial correlation, equations, 22–25 autoregressive distributed lag 640–643 prediction for, 99–101 (ARDL) models, 681–689 feasible generalized least squares restricted equations, 86 dynamic regression models, (FGLS), 156–158 irrelevant alternatives assumptions, 671–677 finite-sample properties, 58–63 847–850 forecasting, 686–689 F statistic, 83 irrelevant variables, inclusion models with, 670–671 Gauss-Markov Theorem, 48–49 of, 136 simple distributed lag models, generalized, 154–156, 256–257 I statistic, 219 677–681 inconsistency, 314–315 iterations, 294 vector autoregression (VAR), indirect, 371 iterative algorithms, 1067 693–712 inefficiency of, 160 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1165

large sample properties of, 63–75 probability, 67, 326, 430, 1046 loglinear model, 13, 112, 305, 907 Lindeberg-Feller central limit variances, 1049 lognormal distributions, 996 theorem, 152 Lindeberg-Feller central limit lognormal mean estimation, 578 loss of fit from restricted, 89–92 theorem, 66, 152 log-odds ratio, 844 maximum likelihood Lindeberg-Levy central limit log-quadratic cost function, 117 estimates, 118 theorem, 435, 1051 logs motivating, 44–46 linear approximation, 979 common (base 10), 115n3 normal equations, 22 linear combination, 948 natural, 115n3 normality assumption, 52–58 of coefficients, 55–56 longitudinal data sets, 180 ordinary least squares (OLS), of vectors, 953–954 Longley data, multicollinearity 150, 159–160 linear dependence, 954–955 in, 60 problem, 1066 linear equations, 961–964 long-run elasticity, 239 projections, 960 linear form, 1018–1019 long run multiplier, 239, 673 random effects, 202 linear functions long run theoretical model, 765–766 regression, 20–26, 164 of normal vectors, 1015 loss function, 610 residuals, 541n25, 650 sets of, 1011–1012 loss of fit, 89–92 stochastic repressors, 49–50 linear independence, 954 lower powers, convergence in, 1044 three-stage (3SLS), 381–383, 537 linearity, 11 lower triangular matrix, 945, 974 two-stage (2SLS), 318–321, 356, assumptions of linear regression 373–375 models, 44 two-step nonlinear estimation, asymptotic properties, 72 M 302–307 instrumental variables macroeconomic models, 5, 356 unbiased estimation, 46–47 estimation, 315 maginal propensity to consume, 609 variances, 48–49 of regression models, 12–19 marginal distributions, 404, variances, estimating, 51–52 linearized regression, 288–290 1013–1014 variances, estimation, 642–643 linear least squares regression, 4 marginal effects, 114, 775 weighted, 167–169, 444 linearly deterministic binary choice, 780–785 least variance ratio, 376 component, 726 binary variables, 835 L elements, 444 linearly indeterministic bivariate probit models, 821–822 length of vectors, 960 component, 726 censored regression model, 872 Liapounov central limit linear probability model, 772 marginal probability density, 1002 theorem, 1054 linear random effects model, market equilibrium, 5 life cycle consumption, 428–429 547–550 Markov Chain Monte Carlo life expectancy, 125 linear regression model, 8–18, 304, (MCMC) methods, 573, 574, likelihood-based tests, application 401–402 600n1, 615 of, 504–506 assumptions of, 44 Markov’s inequality, 1040 likelihood equation, 490, 778 generalized model, 148 Markov’s Strong Law of Large likelihood estimation, 214n14 homoscedasticity, 1009 Numbers, 1044 likelihood function, 445n9, 482–484 moments, 1008 Markov theorem, 67 exponential and normal linear restrictions, 82, 85, 355 Martingale difference central limit distributions, 1030 linear time trends, 16n3 theorem, 638 prior density, 601 linear unbiased estimator, 46 Martingale difference likelihood inequality, 491 line search, 1067 sequence, 638 likelihood ratio, 264 Ljung-Box statistic, 729 Martingale difference series, 449 index, 507 Ljung’s refinement, 645 Martingale sequence, 638 statistic, 453, 534, 854 LM. See Lagrange multiplier matching estimator, 893 testing, 498–500, 505, 520, 786 local maxima, 982 mathematical statistics, 5 Vuong’s test of nonnested local optima, 982 matrices models, 140 location, measures of, 1020 addition, 946–947 limited information, 849 logistic distributions, 997 algebra, 945. See also algebra estimation methods 371-380 logit model, 774 asymptotic covariance, 153–154, estimator, 371 for binary choice, 139 159, 317, 385n22, 446 limited information maximum fixed effects, 805 autocorrelation, 632 likelihood (LIML) log likelihood function, 485, 543 autocovariance, 632 estimator, 375, 508, 570 duration, 938n12 block-diagonal, 965 limiting Kullback-Leibler Information calculus, 979–987 distribution, 95, 317, 995, 1048 Criterion (KLIC), 141 coefficients, 393 mean, 1049 loglinear conditional mean comparing, 978–979 normal distributions, 1055 function, 542 condition numbers of, 971 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1166 Subject Index

matrices (continued) bivariate probit models, 817–820 lognormal, estimating, 578 convergence, 151 class membership, predicting, 561 normal, confidence intervals correlation, 1011 cluster estimators, 515–517 for, 1033 covariance, 1011 conditional latent class model, one-sided tests, 1038 covariance, ordinary least 561–563 quadratic, convergence, 1–39 squares (OLS), 162–164 consistency, 491–492 of random variables, 989 definite, 976–979 degrees of freedom, 519n14 rth, convergence in, 1044, 1045 determinants of, 958–961, 972 duration, 936–937 sampling, 1025 diagonalization of, 969 exclusion restrictions, 532–536 sampling, consistency of, 1041 equality of, 945–946 finite mixture models, 558, square convergence, 68 estimation of covariance of b, 559–560 squared error, 135, 1029 160–162 fixed effects, 554–558 truncated, 865 factoring, 974–975 GARCH model, 662–664 value theorem, 450 generalized inverse of, 975–976 generalized regression model, measured heterogeneity, 560–561 geometry of, 951–961 522–523 measurement Hessian, 981 geometric regression model, economies of scale, 116n7 idempotent, 950–951, 974 542–547 errors, 63, 189, 314–315, 325–331 information matrix equality, heterogeneity, 560–561 of a fit, 35–38 488, 490 invariance, 494 goodness of fit, 790–793, 908–909 inverse, 465, 962–964 latent class models, 558 of variables, 4 manipulation of, 945–951 likelihood function, 482–484 measure of central tendency, 1020 moments, 30 linear random effects model, measure of linear association, 1009 multiplication, 947–950 547–550 measures of central tendency, 989 nonsingular, 963 nonlinear regression models, measures of location, 1020 null, 946 537–538 median, 989 optimal weighting matrix, 438 nonnormal disturbances, 538–542 asymptotic inefficiency, 1058 partitioned, 964–967 optimization, 1072–1073 lag, 674 powers of, 972–974 panel data, 547, 564–567 variance of, bootstrapping, 597 precision, 612 pooled model, 530–531 Medical Expenditure Panel Survey projection, 25 principle of, 484–486 (MEPS), 180 rank, 956–958, 969–971 properties of, 486–496 medical research, 6 residual maker, 25 quadrature, 550–554 M estimators, 421 robust asymptotic covariance, residuals, 529 method of moments, 402, 406 511–517 sandwich estimators, 514–515 asymptotic properties of, robust covariance, 307 seemingly unrelated regression 434–436 robust covariance, count data, (SUR) model, 529–530, 531 consistent estimation, 429–436 915–916 simultaneous equations model, estimator, 457 robust covariance estimation, 780 536–537 estimators, 431 second derivatives, 981 stochastic frontier model, generating functions, 432 short rank, 957 538–542 methodology, 5, 6 singularity of the disturbance two-step, 507–511 methods covariance, 278 maximum likelihood method, 399 bootstrap, 407 spectral decomposition of, 969 maximum score estimator, 792 Box-Jenkins, 726n11 symmetric, 974n9 maximum simulated likelihood Broyden-Fletcher-Goldfarb- trace of, 971–972 (MSL), 589, 595, 799 Shanno (BFGS), 1071 transformation, 473 estimates, 228 Broyden’s, 1071 transposition, 946 estimation, 593 Butler and Moffitt’s, 552, 798 weighting, 439, 444n7, 459, 478 estimator, 227 Davidon-Fletcher-Powell weighting matrix, 438 mean (DFP), 1071 zeros, 946 assumptions, 15 delta, 69, 927, 1055–1056 matrix weighted average, 192 asymptotic distributions, 1057 gradient, 1069–1071 maximum likelihood estimation conditional, 1006 of kernels, 812 (MLE), 72, 118, 274, 401, conditional mean function, 417 of kernels, density, 411–413 428, 482, 485, 770 deviations from, 966 Krinsky and Robb, 70 applications of, 517–567 hypothesis testing, 1035 Markov Chain Monte Carlo asymptotic efficiency, 493–494 independence, 182, 189 (MCMC), 573, 574 asymptotic normality, 492–493 independence, assumption, 901 Newton’s, 524, 544, 1070, 1079 asymptotic variance, 494–496 lag, 674 QR, 975n12 autocorrelation, 527–529 limiting, 1049 quadratic hill-climbing, 1070 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1167

quasi-Newton, 1071 bivariate ordered probit, 835–836 infinite lag, 677 rejection, 576 bivariate probit, 817–826, 896n36 instability, testing, 766–767 of scoring, 524, 779 building, 136–137 joint distributions, 402–405 of simulated moments, 590, censored regression (tobit), J test, 139–140 595–596 871–874 Klein’s, 357, 369 steepest ascent, 1069 Chamberlain’s, 230 with lagged variables, 670–671 Metropolis-Hastings algorithm, 622 classical normal linear latent class regression, 562 Michigan Panel Study of Income regression, 1014–1015 linear probability, 772 Dynamics (PSID), 180 Cobb-Douglas, 91 linear regression, 8–18, 288–290, microeconomics comparing, 38–39 304, 401–402, 554n27 models, 5 comprehensive, 138 linear regression, assumptions vector autoregression (VAR) in, confirmation of, 4 of, 44 711–712 constants, 16n3 logit, 774 minimal sufficient statistic, 803 covariance structures, 266 logit, fixed effects, 805 minimization, 334 credit scoring, 304 loglinear, 13, 112, 305, 907 minimum distance estimation density, 400 long-run theoretical, 765–766 (MDE), 233, 270, 335, discrete choice, 770 macroeconomic, 356 428–429, 436–441, 437 distributed lag, 97 mean-squared error criterion, 135 consistent estimation, 429–436 duration, 931–942 methodology, 5 generalized method of moments dynamic, 356 missing data, 1076 (GMM), 441–451 dynamic, binary choice, 794–796 mixed, 233 minimum expected loss (MELO) dynamic, panel data, 238–243, mixed fixed growth, 242 estimator, 610 244–246, 340–348, 469–479 mixed linear, 235 minimum mean error predictor, dynamic, properties of, 389–394 mth, 145 45–46 dynamic, regression, 670 multinomial probit, 850–851 minimum simulated sum of squares econometric, 2–5, 355 multiple linear regression, 8 function, 228 econometric, estimation of, multivariate probit, 826–831 minimum variance linear unbiased 455–479 multivariate regression, 255 estimation (MVLUE), error components, 201 negative binomial, 910 46–47, 1032 estimated duration, 941 negative binomial regression, missing at random (MAR), 62 estimated latent class, 595 heterogeneity, 911–912 missing completely at random estimated random nested, 81–83 (MCAR), 61 parameters, 594 nonlinear panel data regression, missing data model, 1076 estimated SUR, 259 308–310 mixed fixed growth models, 242 event counts, 907–915 nonlinear regression, 285–293 mixed linear model, 235 exponential, 906 nonlinear regression, mixed logit model, 851–852 exponential model with fixed applications, 294–297 mixed models, 233 effects, 309 nonnested, 923 mixtures of normal exponential regression, 554 nonnested, selection of, 137–142 distributions, 432 fit, computing, 506–507 nonstationary data, 243–244 mode, 989 fixed effects, 193–200, 211–212, nonstructural, 355n2 models 268–272, 797, 800–806 nontested, 82 accelerated failure time, 937 forecasting, 358 normal-Poisson mixture, 911n5 ARCH, 658 forms, 695 ordered choice, 842 ARCH(1), 659–660 fully recursive, 372 ordered probit, 831–835 ARCH-M, 660–662 GARCH, 661–662, 662–664 panel data, 180, 183–184 ARCH(q), 660–662 generalized ARCH, 660–662 parametric duration, 933–939 autoregressive, 675 generalized linear regression, 148 partial adjustment, 679 autoregressive distributed lag generalized regression, partially linear, 409 (ARDL), 681–689 148–149, 256 piecewise loglinear, 116 autoregressive integrated Grunfeld, 255 pooled, 266–267 moving-average Gumbel, 774 pooled regression, 185–193 (ARIMA), 740 heterogeneity, 806–807 population averaged, 185 Bayesian estimation, 600–601 heteroscedastistic extreme value probability, 400, 781 Bayesian model averaging, (HEV), 856 probit, 773, 776 144–146 hierarchical, 184, 230 qualitative response (QR), 770 binary choice, 139, 411–412, hierarchical linear, 234 random coefficient, 223–225, 225 772–777 index function, 550, 775 random effects, 206, 268–272, binomial probit, 616–619 individual effects, 619–621 796, 797–800 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1168 Subject Index

models (continued) of censored normal variables, 870 multiple linear regression model, 8 random effects geometric central, 432, 990 multiple regression, 24 regression, 553–554 convergence of, 1047 computation of, 22 random parameters, 621–623 diagonal elements of the frameworks, 328 random parameters logit inverse, 30 multiple solutions, 1075 (RPL), 851 empirical, asymptotic multiplication random utility, 777 distribution of, 449 conformable for, 947 recursive, 359 empirical, convergence of, 448 matrices, 947–950 recursive bivariate probit, empirical equation, 442, 447, 456 partitioned matrices, 965 823–826 empirical moment equation, 456 scalars, 947–950, 951 reduced form of, 355, 360 finite, 435 vectors, 947 regression, analysis of classical, fractional, 576 multiplicative heteroscedasticity, 603–609 generalized method of, 399 170–172, 523–527, 876 regression, selection, 884–886 of incidentally truncated multipliers relational lag, 683 bivariate normal cumulative, 390 seemingly unrelated regression distribution, 883 dynamic, 390 (SUR), 254–263, 464–466 K moment equations, 431 equilibrium, 239, 390, 674 selection, 143–146 linear regression, 1008 impact, 389, 673 selection criterion, 142–143 method of, 402, 406 Lagrange, 264 self-regressive, 716 method of, asymptotic properties Lagrange, binary choice, 786 semilog, 13 of, 434–436 long run, 239, 673 semiparametric regression, 52 method of, consistent estimation, maximum likelihood estimation simple distributed lag, 677–681 429–436 (MLE), 502–504 simultaneous equations, 354, method of, estimators, 431 modes and, 389–390 466–469 method of, generating multivariate distributions, single-equation linear, 455–461 functions, 432 1010–1012 single-equation nonlinear, method of simulated, 590, multivariate Lindeberg-Feller 461–464 595–596 central limit theorem, 1055 spherical disturbances, 149 multivariate distributions, multivariate Lindeberg-Levy stochastic frontier, 402, 572 1010–1011 central limit theorem, 1054 strongly exogenous, 358 population equation, 456 multivariate normal distribution, structural, 328, 355n2 relationships, 1007–1009 1013–1019 structural change, 120–128 sample, normal distribution multivariate normal population, 576 time-series, 715 for, 458 multivariate normal probabilities, time-series, panel data, 181n1 of truncated distributions, 582–583 time-space dynamic, 220 864–867 multivariate probit model, 826–831 time-space simultaneous, 219 uncentered, 430 multivariate regression model, 255 translog, 13–14 validity of moment restrictions, multivariate t distribution, 606 two-step estimation of panel 452–455 Mundlak’s approach, 209–210, 230 data, 229–233 money demand equation, 626 Two-Variable Regression, 48 Monte Carlo integration, 576–583 uncorrelated linear Monte Carlo studies, 574n1, N regression, 304 584–589 National Institute of Standards and underlying probability, 287 Moore-Penrose inverse, 975 Technology (NIST), 975n12 univariate time-series, 716, most powerful test, 1035 natural logs, 115n3 726–728 moving average (MA), 190 nearest neighbor concept, 417 unobserved effects, 209–210 form, 633, 675, 718 necessary condition for an unordered multiple choice, processes, 633 optimum, 982 841–859 MSCORE, 812 negative autocorrelation, 627, 629 vector moving average mth model, 145 negative binomial model, 910 (VMA), 704 multicollinearity, 59–61 negative binomial regression Weibull model, 937 dummy variables, 108 model, 911–912 Weibull survival, 939 in nonlinear regression negative duration dependence, 935 zero-inflated Poisson (ZIP), 930 models, 296 Negbin 1 form, 913 With Zeros (WZ), 922n7 multinomial logit model, 842, Negbin 2 form, 913 modified zero-order regression, 62 843–845, 854 Negbin P, 913 moment-generating function, 1001 multinomial probit model, 850–851 nested logit models, 847–850 moments multiple cointegration vectors, 759 nested models, 81–83 asymptotic, 1060 multiple correlation, 37 nested random effects, 214–217 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1169

net effect, 9 nonoautocorrelation, 11, 16 normal mean, 1033 netting out, 28 assumptions of linear regression normal-Poisson mixture model, Newey and West’s estimator, 463 models, 44 911n5 Newey-West autocorrelation asymptotic properties, 72 normal sampling, 1058 consistent covariance instrumental variables normal vectors, 1015 estimator, 643 estimation, 315 no solution, 1075 Newey-West robust covariance nonlinear regression models, 286 notation, 12, 947 estimator, 190 nonparametric average cost not missing at random (NMAR), 62 Newton’s method, 524, 544, function, 418–419 null hypothesis, 83, 140 1070, 1079 nonparametric duration, 939–942 null matrix, 946 Nobel prize in Economic Sciences, 1 nonparametric estimation, 398, numbers noncausality, Granger, 699 413–420 large, laws of, 1042–1045 noncentral chi-squared nonparametric regression, 416–420, pseudo-random numbers, distribution, 501n8 812–813 generating, 574 nonconstructive, White test, 166 nonrandom regressors, 18 nonhomogeneous equation nonsample information, 364, 370 system, 962 nonsingular matrix, 435, 963 O noninformative prior, 604 nonspherical disturbances, 210–213 Oaxaca and Blinder’s nonlinear consumption function, nonstationary data, 243–244, 739 decomposition, 55 294–296 cointegration, 756–767 Oberhofer-Kmenta conditions, nonlinear cost function, 114–117 common trends, 759–760 533n22 nonlinear functions, 1012 Dickey-Fuller tests, 745–754 observationally equivalent theories, asymptotic distribution, 1058 error correction, 760–761 361, 362–364 of parameters, 69, 70 KPSS test of stationarity, 755–756 observations nonlinear instrumental variable panel data, 767–768 dependent, 635 estimator, 462 processes, 739–756 insufficient, 121–122 nonlinear instrumental variables nonstationary process, 740 linear regression models, 15–16 estimation, 333–336 nonstochastic data missing, 61–63 nonlinearity assumptions of linear regression regression, 195 identification of, 114–117 models, 44 observed variables, 337 in variables, 112–120 instrumental variables occupational choice, 843 nonlinear least squares estimation, 315 odds ratio, 853 estimator, computing, 292–293 nonstochastic regressors, 17 Olsen’s reparameterization, 875 estimators, 290–292 nonstructural models, 355n2 omission of relevant variables, 133 robust covariance matrix, 307 nontested models, 82 omitted variables nonlinear models normal distribution, 11, 486, 864, bias, 868 fixed effects, 554–558 872n10, 991–992, 1013–1014 binary choice, 787 sample selection, 895–898 assumptions of linear regression formula, 134 nonlinear models, sample selection, models, 44 one parameter, function of, 895–898 censored, 869–871 1078–1079 nonlinear optimization, 228, Gibbs sampler, 614 one-period-ahead forecast, 686 446, 462 instrumental variables one-sided tests, mean, 1038 nonlinear panel data regression estimation, 315 one-step estimation, 1072 model, 308–310 likelihood functions, 1030 one-to-one function, 986 nonlinear regression model, limiting, 1055 optimal linear predictor, 45 285–293 mixtures of, 432 optimal weighting matrix, 438 applications, 294–297 for sample moments, 458 optimization, 982–983, 1065–1078 hypothesis testing, 300 normal equations, 27 algorithms, 1067–1068 maximum likelihood estimation normal-gamma prior, 608, 620 computation derivatives, (MLE), 537–538 normality, 18–19 1068–1069 multicollinearity in, 296 assumptions of, 100n9 constrained, 984–985 nonlinear restriction, 96–98, 120 asymptotic, 1058 with constraints, 1073–1074 nonlinear systems, 300–302, 358n6 least squares estimator, 52–58 gradient methods, 1069–1071 nonnegative definite, 974, 976 sample selection, 891 maximum likelihood estimation nonnested hypothesis, testing, 138 normalization, 354, 359, 483 (MLE), 1072–1073 nonnested models, 137–142, 923 normal linear regression model, nonlinear, 446, 462 nonnormal disturbances, 92–96, 518–522 order, 945 538–542 normally distributed of sequences, 1060–1061 nonnormality, 880–881 disturbances, 18 order condition, 449, 457 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1170 Subject Index

ordered choice models, maximum likelihood estimation partially linear regression, 409–411 831–841, 842 (MLE), 547, 564–567 partially linear translog cost ordered probit model, 831–835 models, 180, 183–184 function, 410 ordinary least squares (OLS) nonstationary data, 243–244, partial regression, 27–29, 28, 29–31 asymptotic covariance 739n1, 767–768 partitioned inverse, 966 matrix, 516 ordinary least squares (OLS), 190 partitioned matrices, 964–967 consistency of, 151 pooled regression model, partitioned regression, 27–29 covariance matrices, 162–164 185–193 periodogram sample, 734 estimation, 159–160 probit models, 837–841 persistence, 794 estimator, 150 sample selection, 898–899 Phillips curve, 627, 679 groupwise heteroscedasticity, 173 sets, 1020 Phillips-Perron tests, 753 instrumental variables stated choice experiments, piecewise continuous, 112 estimation, 316 858–859 piecewise loglinear model, 116 panel data, 190 tobit model, 881–882 pivotal quantity, 1033 random effects, 207 within-groups estimator, 191–193 point estimation, 609–610, residuals, 174 parameters 1027–1032 simultaneous equations model, confidence intervals for, 54–55 Poisson distribution, 485, 906, 371–372 estimation of, 430–434, 471 998, 1031 orthogonality condition, 287–288, estimation of univariate time Poisson model 335, 442–443, 456 series, 728–731 censoring, 925–931 orthogonality population, 44–45 function of one, example, censoring and truncation, 925 orthogonal partitioned 1078–1079 estimation, 929 regression, 27 gamma distribution, 447 regression, 542, 907 orthogonal regression, 24 heterogeneity, 222–244, 238–243, policy analysis, 703–710 orthogonal vectors, 960 807–809 political methodology, 6 orthonormal quadratic forms, 1015 identifiability of, 286, 422 polynomials outcome of random processes, 987 identification of, 482–484 Hermite, 1062 outer product of gradients (OPG) incidental parameters problem, invertible, 676 estimator, 495 309, 557, 586–589, 797 lag operators, 675 output, cointegration in, 757, 762 instrumental variables pooled model, 266–267, 530–531 overdispersion, testing, 909–911 estimation, 327 pooled regression model, 183, overidentified case, 439, 459 K, 443n4 185–193 overidentified parameters, 120 nested logit models, 849 population overidentifying restrictions, 298, nonlinear functions of, 69, 70 averaged model, 185 388, 452 overidentified, 120 least squares regression, 21 point estimation of, 1027–1032 moment equation, 456 precision, 494 multivariate normal, 576 P random, 183, 226–229 quantity, 20 pair of event counts, joint modeling random parameters logit (RPL) regression, equation, 8 of, 405 model, 851 regression, models, 20, 44–45 panel data, 73 random parameters model, standard uniform, 575 analysis, 182–183 621–623 positive duration dependence, 935 applications, 267–272, 307–311 space, 82, 400, 421–422, 497 positive (negative) definite, 976 autocorrelation, 213, 652–655 unknown, 297 positive semidefinite, 976 balanced and unbalanced, 184 vectors, 121, 141n9 posterior density, 601–603 Bayesian estimation, 619–621 parametric duration models, posterior distributions, 613–616 binary choice, 796–809 933–939 postmultiplication, 947 consistent estimation of, 244–246 parametric estimation, 398, 400–405 power of tests, 585, 1035 count data, 915–920 parametric restrictions, 88n3, powers dynamic model, 238–243, 298–300 lower, convergence in, 1044 340–348, 469–479 partial adjustment model, 679 of a matrix, 972–974 estimation, 229–233 partial autocorrelation, 723–726 Prais and Winsten estimator, 649 extensions, 184 partial correlation precision matrices, 612 generalized regression coefficients, 29–31 precision parameter, 494 model, 149 partial differences, 647 predetermined variables, 356, 358 geometric regression, 593 partial effects, 182 prediction, 81, 99–102 group sizes, 235 partialing out, 28 Bayesian estimation, 609n13 instrumental variables partial likelihood estimator, 940 class membership, 561 estimation, 336–349 partially linear model, 409 criteria, 37, 143 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1171

intervals, 99, 100 bivariate, 817–826 of estimators, 420–425 for investments, 99–101 bivariate, extensions, 896n36 exact, finite-sample, 63 probability, 834 bivariate, ordered, 835–836 finite sample, 44, 1027 probit models, 793 fixed effects, 837–838 finite sample, least squares residuals, 117 multinomial, 850–851 estimator, 58–63 variances, 99 multivariate, 826–831 finite sample, of b in generalized predictive tests, 122, 127–128 ordered, 831–835 regression models, 150 predictors panel data, 837–841 finite sample, of ordinary least minimum mean error, 45–46 prediction, 793 squares (OLS), 150 optimal linear, 45 random effects, 838–841 of generalized least squares of variations, 32 recursive bivariate, 823–826 estimator, 155 preliminary tests for problem of identification, 354, 356, of GMM estimators, 447–451 heteroscedasticity, 165n16 361–370 large sample, 290–292 premultiplication, 947 processes large sample properties, 63–75 preponderance of evidence, 4 autoregressive, 632 of maximum likelihood estimate pretest estimation, 134–135 autoregressive moving-average, (MLE), 486–496 pretest estimator, 135 716–718 of regular densities, 488–490 principal components, 61 data generating process (DGP), root-n consistent, 1051 principal minor, 959n4 436, 518n13, 743, 987 sampling, 596 principle of maximum likelihood, disturbances, 632–634 statistical, 43 484–486 general data generating, 72 proportional hazard, 940 prior beliefs, 602 integrated, 739–741 proportions, 844n55 prior distribution, 604 moving-average, 633 proxy variables, 328–331 prior odds ratio, 611 nonstationary, 739–756, 740 pseudo-differences, 647 prior probabilities, 611 outcome of random, 987 pseudoinverse, 975 probability stationary, 671 pseudo-maximum likelihood Bayesian estimation, 602 stationary stochastic, 716–731 estimation, 511–517, 666–667 convergence in, 1039–1042 stationary stochastic, pseudo-random numbers, distributions, 987–988, 991–998 autocorrelation, 721–723 generating, 574 distributions, representations of, time-series, 630 pseudoregressors, 290 1000–1001 trend stationary, 743 pseudotrue parameter vectors, finite moments, 435 production function, 90, 228 141n9 fundamental transform, 403 CES, 119, 285–286 Pythagorean theorem, 26 inverse probability weighted Cobb-Douglas, 273 estimator, 902 LAD estimation of a, 408 limits, 326, 430, 1046 product limit estimator, 939 Q linear probability model, 772 products QR method, 975n12 marginal density, 1002 Kronecker, 966–967 quadratic approximation, 979 maximum likelihood estimation transposition, 949 quadratic earnings equation, 113 (MLE), 565 profiles quadratic form, 323 models, 781 age-earnings, 114 algebra, 976–979 multivariate normal probabilities, time, 111 full rank, 1017–1018 582–583 profit maximization, 5 idempotent, 978, 1016 order probit models, 833 projection, 25, 269 independence of, 1018–1019 power of the test, 585 projects, 269n17 orthonormal, 1015 prediction, 834 proof, asymptotic normality, 152 standard normal vectors, prior probabilities, 611 propensity score, 893 1015–1017 simulating, 827n44 properties quadratic hill-climbing underlying probability asymptotic, 44, 64 method, 1070 model, 287 asymptotic, assumptions of, quadratic mean, 1–39 of variances, 161 421–424 quadratic term, 116 probability density function asymptotic, of estimators, quadrature (pdf), 1000 424–425 Gauss-Hermite, 552, 1065 probability distribution function asymptotic, of least squares, 151, Gaussian, 1065 (pdf), 988 640–642 Gauss-Laguerre, 1065 probability limits, 67 asymptotic, of method of maximum likelihood estimation probability model, 400 moments, 434–436 (MLE), 550–554 probit model, 773, 776 contagion, 1001 weight, 1065 for binary choice, 139 of dynamic models, 389–394 qualification indices, 129 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1172 Subject Index

qualitative choices, 771 discrete, 998 earnings equation, 54 qualitative response (QR) distributions, 998–1000 equations, 252–254 models, 770 expectations of, 989–991 generalized, 459 quantile regression estimator, 407 random vector, 1010 generalized linear regression quantitative approach to random walks, 741–744 model, 148 economics, 1 Dickey-Fuller tests, 745 generalized model, 148–149, 256 quasi differences, 647 with drift, 683, 739 heteroscedasticity, 459 quasi-Newton methods, 1071 rank intrinsic linear, 118 cointegration, 759 latent, 775–777 columns, 956 least squares, 164 R condition, 449 least squares estimator, 47 random coefficients model, full, 957 linearized, 288–290 223–225, 851 of matrices, 956–958, 969–971 linear least squares, 4 random constants, 857 and order conditions, 354–370 in a model of selection, 884–886 random disturbance, 9 rankings, 771 models. See regression models random draws, 577–580, 578 rank two modified zero-order, 62 random effects, 183, 200–210 correction, 1071 multiple, 24 count data, 918–920 update, 1071 multivariate regression estimation, 207 rating assignments, 834 model, 255 extensions, 213–222 ratios nonparametric, 416–420 feasible generalized least squares exponential, 612n15 nonparametric regression (FGLS), 203–205, 207 inverse Mills, 866 function, 812–813 generalized least squares, 202 J test, 139 observations, 195 geometric regression model, least variance, 376 orthogonal, 24 553–554, 592–593 likelihood, 264 partial, 29–31 Hausman and Taylor (HT) likelihood ratio (LR), statistic, partially linear, 409–411 formulation, 470 534, 854 Poisson model, censoring, Hausman’s specification test, likelihood ratio (LR), test, 925–931 208–209 498–500, 505, 786 pooled, 183 heteroscedasticity, 212–213 likelihood ratio (LR) test, 520 population, 20, 44–45 instrumental variables likelihood statistics, 453 quantile estimator, 407 estimation, 336–339 log-odds, 844 results for life expectancy, 125 likelihood estimation, 214n14 odds, 853 seemingly unrelated regressions maximum likelihood estimation prior odds, 611 (SUR), 254–263 (MLE), 547–550 sacrifice, 704 spline, 111–112 models, 268–272, 796, 797–800 t, 53 spurious, 741–744 nested, 214–217 real data, 573 standard error of the, 51 panel data applications, recursive bivariate probit models, testing, 56–57 310–311 823–826 Two-Variable Regression probit models, 838–841 recursive models, 359 model, 48 SUR models, 267–268 reduced form variables, adding, 31 testing, 205–208 disturbances, 360 vector autoregression, 479 randomness of human equation, 329 regression models behavior, 4 of models, 355, 360 analysis of, 603–609 random number generation, regression, 15–16 censored, 871–874 573–576 analysis of treatment effects, dynamic, 670 random parameters logit (RPL) 889–890 dynamic, lagged variables, model, 851 approach, binary choice models, 671–677 random parameters model, 183, 772–775 gammas, 72 226–229, 621–623 binary variables, 106–112 heteroscedasticity, applications random sampling, 429, 430, 1020 bivariate, 23 of, 170–175 random terms, 234 classical, 458 least squares regression, 20–26 random utility models, 777 coefficients, 28 linear, 8–18 random variables, 987–988 conditional mean, 1006 linearity of, 12–19 bivariate, distribution, 1004–1006 with a constant term, 29 loglinear, 112 censored, 871 cross-country growth, 145 multiple linear, 8 continuous, 987 demand equations, 55, 69 nonlinear, 285–293 convergence to, 1046–1047 disturbances, 158–164 Poisson, 907 density of truncated, 864 double-length, 664 pooled, 185–193 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1173

seemingly unrelated, 464–466 linear, 82, 355 sample midrange, 1020 semiparametric, 52 maximum likelihood estimation sample minimum, distributions, transformed, 168 (MLE), 532–536 1025–1027 truncation, 867–869 and nested models, 81–83 sample moments, 458 regressors, 8 nonlinear, 120 sample periodogram, 734 data generating processes for, nonlinear, cost function, 115n4 sample properties, 290–292 17–18 nonlinear, testing, 96–98 sample selection, 62, 863, fixed time effects, 198n10 of null hypothesis, 140 882–902 identical, 257 overidentifying, 298, 452 common effects, 899–902 nonrandom, 18 overidentifying, testing, 388 estimation, 886–889 nonstochastic, 17 parametric, 298–300 nonlinear models, 895–898 subset of, 304 testing, 85 normality assumption, 891 transformation, 296n4 validity of moment, 452–455 panel data, 898–899 regular densities, 488–490 results sample variance, 1060 regularity conditions, 487–488 characteristic roots and vectors, sampling, 613n16 rejection 968–969 asymptotic covariance method, 576 frequency domain, 732–734 samples, 1057 region, 1034 large sample, 613 choice based, 793–794, 852 relational lag model, 683 vector autoregression (VAR), distributions, 44, 1023–1027 relationships 707–710 Gibbs sampler, 613–616 cointegration, 764 revealed preference data, 858 importance, 579–582 consumption to income, 2–3 revised beliefs, 602 large sample properties, 63–75 demand, 9n1 risk set, 940 large sample results, 613 deterministic, 3, 9 robust asymptotic covariance large sample tests, 92–96 between earnings and education, matrices, 511–517 least squares estimator, 47 10–11 robust covariance mean, 1025 equilibrium, 689 estimation, 210 method of moments, 429 moments, 1007–1009 matrice, 915–916 from multivariate normal random disturbance, 9 matrices, 185–188, 307, 780 population, 576 relevance, 316 robust estimation normal, 1058 relevant variables, omission of, 133 of asymptotic covariance properties, 596 residual, 20 matrices, 153–154 random, 430, 1020 residuals, 116n5 heteroscedasticity, 159 from standard uniform consistent estimators, 163 using group means, 188–190 population, 575 constrained generalized robust estimators, 166 theory estimator, 610 regression model, 156 robustness to unknown unbalanced samples, 792 generalized, 778n8, 934 heteroscedasticity, 169 variances, 48, 1027 least squares, 541n25, 650 root mean squared error, 101 sandwich estimators, 514–515 against load factor, 171 root-n consistent properties, 1051 SAS, 814n38 maximum likelihood estimation roots scalars (MLE), 529 characteristics of, 967–976 matrix, 945 negative autocorrelation, 629 dominant, 391 multiplication, 947–950, 951 ordinary least squares (OLS), 174 latent, 969 scalar-valued function, 980 panel data, 186 rotating panel, 180, 184 scale factor coefficients, 100n9 of predicted cost, 117 rows scaling, 88n3 for production function, 408 notation, 947 scatter diagram, 1021 sum of squares, 680 vectors, 945 scedastic function, 1007 unstandardized, 628 R-squared, 35–38, 156 score, 502 variance in a regression, 1008 rth mean, convergence in, method of, 524, 779 vectors, 25 1044, 1045 propensity, 893 response, treatment of, 107 rules testing, 503 restricted investment equation, 86 for limiting distribution, 1049 seasonal factors, correcting restricted least squares for probability limits, 1046 for, 107 estimator, 87–89 second derivatives matrix, 981 loss of fit from, 89–92 second-order effects, 13 restrictions S seed, 574 exclusion, 90 sacrifice ratio, 704 seemingly unrelated regression exclusions, 370 sample information, 400 (SUR) model, 254–263, homogeneity, 262 sample mean, 1041 464–466, 529–530, 531 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1174 Subject Index

selection simulated annealing, 1066 spatial autoregression models, 142–143, 143–146 simulated data, 573 coefficient, 218 regression in a model of, 884–886 simulated log likelihood, 227 spatial error correction, 221 sample, 863, 882–902. See also simulation-based estimation, spatial lags, 221 sample selection 226–229, 399, 573, 589–596 special decomposition, 974 sample, estimation, 886–889 bootstrapping, 596–598 specification self-regressive models, 716 continuous distributions, 575 analysis, 133–137, 692–693 self-selection bias, 62n7 Monte Carlo integration, 576–583 binary choice, 787–790 semilog Monte Carlo studies, 584–589 errors, 885 equation, 113 pseudo-random numbers, Hausman’s specification test, models, 13 generating, 574 208–209 semiparametric, 287 random number generation, Hausman test, 339 analysis, 809–813 573–576 instrumental variables duration, 939–942 simulation results, 71 estimation, 321–325 estimation, 398, 405–413, simultaneous equations model maximum likelihood estimation 411–412, 810–812 bias, 355n3 (MLE), 498–507 models of heterogeneity, 806 causality, 357–358 search for lag length, 676–677 regression models, 52 endogeneity, 357–358 tests, 167, 387–388. See also sequences, 1060–1061 fundamental issues in, 354–261 testing Halton, 577–580 with heteroscedasticity, 466–469 tests, for SUR models, 264–266 Martingale, 638 limited information estimation validity of moment restrictions, Martingale difference, 638 methods, 371–380 testing, 452–455 serial correlation, 626–629 maximum likelihood estimation specificity, 585 asymptotic results, 635–640 (MLE), 536–537 spectral analysis, 735 autocorrelation, testing for methods of estimation, 370–371 spectral decomposition of 644-647 ordinary least squares (OLS), matrices, 969 autoregressive conditional 371–372 spectral density function, 732 heteroscedasticity, 658–667 problem of identification, spectrum, 732 central limit theorem, 638–640 361–370 spherical disturbances, 16–17, common factors, 655–656 specification tests, 387 17n4, 149 disturbance processes, 632–634 system methods of estimation, spherical normal distribution, 1013 efficient estimation, 647–648 380–384 spline regression, 111–112 Ergodic theorem, 636–638 two-stage least squares, 373–375 spurious regressions, 741–744 forecasting, 656–658 single-equation squares generalized method of moments linear models, 455–461 matrices, 945 (GMM), 643–644 nonlinear models, 461–464 summable, 726 heteroscedasticity and, 149 pooled estimators, 829n47 sum of, 287–288, 292 least squares estimation, 640–643 singularity of the disturbance stability, 389, 390–391, 684–686 panel data, 652–655 covariance matrix, 278 stabilizing transformations, 1050 time-series data analysis, 629–632 singular systems, 272–280 stacking equations, 301n10 series singular value decomposition, 975 standard definition of x, 990 fractionally integrated, 740n2 size standard deviation, 1020 Martingale difference, 449 distributions, 996 standard errors, 1027 time. See time series step, 1075 of the estimator, 52 sets of linear functions, 1011–1012 of tests, 1035 of the regression, 51 several categories of binary skewness, 1021 standard normal cumulative variables, 107–108 Sklar’s theorem, 403 distribution function, several groupings of dummy slopes, 194 1062–1063 variables, 108–110 Slutsky theorem, 303, 445, 1045 standard normal distribution, 992 share equations, 278 small T asymptotics, 196–197 standard normal vectors, Shepard’s lemma, 276 smoothing functions, 418 1015–1017 short rank, 14–15, 957 sociology, 6 standard uniform population, 575 significance tests, 298 spaces starting values, 294 simple distributed lag models, columns, 956 Stata, 814n38 677–681 parameters, 497 stated choice simple model building vectors, 951–954 data, 858 strategies, 136 spanning vectors, 954 experiments, 852n59 simple-to-general approach, spatial autocorrelation, panel data, 858–859 136, 676 218–222 state dependence, 794 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1175

stationarity structural equations, 355 classical testing procedures, assumption, 632 structural models, 328, 355n2 1034–1037 KPSS test of, 755–756 structural VARs, 702–703 cointegration, 761–763 time-series models, 718–721 structure and reduced form, 360 common factors, 655–656 stationary assumption, 631 subequations, 367 confidence intervals, 1037–1038 stationary process, 671 submatrices, 964. See also matrices consistency, 1036 stationary stochastic processes, subsets Cox test, 142 716–731, 721–723 of coefficients, changes in, cross-function correlation, 265 statistical properties, 43 122–123 Dickey-Fuller tests, 745–754 statistics of regressors, 304 Durbin-Watson test, 116n6, behavior of test, 585 subspace, 955 645–646 descriptive, 1020–1023 sufficient condition for an F tests for firm and year as estimators, 1023–1027 optimum, 982 effects, 109 F, 83 sufficient statistics, 433 for GARCH effects, 664–666 I, 219 summability of group effects, 197 Lagrange multiplier, autocovariances, 639 Hausman’s specification test, 299–300, 534 summary statistics, 1020 208–209 likelihood ratio, 453, 534, 854 sum of squares, 287–288, 444n6 Hausman test, 339 Ljung-Box, 729 changes in, 31 heteroscedasticity, 165–167 LM, 454 generalized, 522 hypothesis, 83–92, 259–263, 295, marginal distributions of test, minimizing, 292 298–300 57–58 residuals, 680 hypothesis, about a coefficient, 52 minimal sufficient, 803 superconsistent, 683, 762 hypothesis, estimation, 425 properties of estimators, 420–421 supply equation, 354 hypothesis, inference, 1034–1038 spatial autocorrelation, 221 surveys J test, 139–140 sufficient, 433 British Household Panel Survey KPSS test of stationarity, summary, 1020 (BHPS), 841 755–756 tables, 1093–1098 estimation of commodity lagged dependent variable, 646 test and critical values, 388 demands, 272n19 Lagrange multiplier, 96, 166, 205, Theil U, 101 growth, 181n2 299–300, 505 Voung, 930 nontested models, 137 large sample tests, 92–96 Wald, 94, 262, 453, 520 survival distributions, 935 likelihood-based tests, steepest ascent method, 1069 survival function, 933, 1000 application of, 504–506 step sizes, 1075 symmetric matrices, 945, 974n9 likelihood ratio, 498–500, 505, 786 stepwise model building, 137 system methods of estimation, linear restrictions, 85 Sterling’s approximation, 1064 380–384 LM statistic, 454 stochastic data systems of demand equations, marginal distributions of assumptions of linear regression 272–280 statistics, 57–58 models, 44 systems of linear equations, 961–964 maximum likelihood estimation instrumental variables systems of regression equations, (MLE), 498–507 estimation, 315 252–254 for model instability, 766–767 stochastic frontier model, 402, most powerful test, 1035 538–542, 572 nonlinear restrictions, 96–98 stochastic regressors, 49–50 T nonnested hypothesis, 138 stochastic volatility, 658 tables, statistics, 1093–1098 one-sided tests, 1038 Stone’s expenditure system, 272 Taylor series, 277, 979 overdispersion, 909–911 Stone’s loglinear expenditure t distributions, 993–995 overidentifying restrictions, 388 system, 273n20 terminology, algebra, 945 Phillips-Perron tests, 753 straight-forward two-step GMM testable implications, 81 power of tests, 1035 estimator, 814 testing predictive, 122, 127–128 stratification, 188 augmented Dickey-Fuller random effects, 205–208 streams as instruments, 320 test, 754 regression, 56–57 strong exogeneity, 699 for autocorrelation, 644–647 significance, 298 strongly exogenous models, 358 Bayesian model averaging, size of tests, 1035 strong stationarity, 636 144–146 spatial autocorrelation, 221 structural breaks, 123–127 behavior of test statistics, 585 specification, 167, 321–325, structural change, 106, 120–128 binary choice, 785–787 387–388 structural coefficients, 369 Breusch-Pagan, 166, 167 structural breaks, 123–127 structural disturbances, 358 Chow test, 120n8 structural change, 120–128 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1176 Subject Index

testing (continued) data, 6 truncated variances, 865 SUR models, 264–266 data analysis, 629–632 truncate random variable, 864 unbiased tests, 1036 frequency domain, 731–738 truncation, 863–869 unit roots, 744–745 identification, 712 count data, 924–931 validity of moment restrictions, invertibility, 718–721 distributions, 863–864 452–455 panel data, 181n1 estimation, 874–875 variable addition, 210 partial autocorrelation, 723–726 incidental, 882 vector autoregression (VAR), processes, 630 incidental, bivariate distribution, 696–698 stationarity, 718–721 883–884 Vuong’s test, 140 stationary stochastic processes, moments of truncated Wald test, 124, 500–502, 505 716–731 distributions, 864–867 weak instruments, 377–380 univariate time series modeling, Poisson model, 925 White test, 165, 167 726–728 regression models, 867–869 for zero correlation, 820 time-space truth, 143 test of economic literacy dynamic model, 220 two-stage least squares estimation, (TUCE), 560 simultaneous model, 219 318–321, 356, 373–375 tetrachoric correlation, 819 time-varying covariates, 933 two-step estimation, 229–233, 886 Theil U statistic, 101 time window, 630 two-step estimators, 169 theorems tobit model, 871–874 two-step maximum likelihood Aitken’s, 155 censoring, 925–931 approach, 849 Bayes, 565, 601–603 multiplicative estimation, 507–511 central limit, 638–640, 1050–1055 heteroscedasticity, 876 two-step nonlinear least squares Ergodic, 448, 636–638 panel data, 881–882 estimation, 302–307 Gordin’s central limit, 640 tools for diagnosing collinearity, Two-Variable Regression model, 48 Khinchine, 430 60n6 type I error, 1034 Kruskal’s, 176 total variation, 32 type II error, 1034 Liapounov central limit, 1054 trace of a matrix, 971–972 Lindeberg-Feller central transformations, 986–987 limit, 152 Box-Cox, 296–297 U Lindeberg-Levy Central matrix, 473 unbalanced panel data, 184 Limit, 435 regressors, 296n4 unbalanced samples, 792 Lindeberg-Levy central stabilizing, 1050 unbiased estimation, 46–47 limit, 1051 transformed regression model, 168 unbiased estimators, 1028 Martingale difference central transformed regressors, 157 unbiasedness properties, 420 limit, 638 transforms, discrete Fourier, 737 unbiased tests, 1036 mean value, 450 translog cost function, 104, 275–280 uncentered moment, 430 multivariate Lindeberg-Feller translog demand system, 286 uncorrelated linear regression central limit, 1055 translog models, 13–14 models, 304 multivariate Lindeberg-Levy transportation engineering, 6 underlying probability model, 287 central limit, 1054 transposition unequal variances, 123–127 Sklar’s, 403 matrices, 946 unification, 1 Slutsky, 445, 1045 products, 949 uniform-inverse gamma prior, 619 three-level random parameters, 234 t ratio, 53 uniformly most powerful (UMP), three-stage least squares (3SLS), treatment effects, 889 1036 381–383, 537 analysis of, 11n2, 889–890 uniform prior, 619 threshold effects, 110–111 estimation, 891–895 uniqueness, 449 time treatment of response, 107 unit roots, 181n2, 739–756, 743 accelerated failure time trends, 741–744 augmented Dickey-Fuller models, 937 linear time, 16n3 test, 754 events per unit of, 912 stationary process, 743 KPSS test of stationarity, 755 fixed and group effects, 197–200 triangular matrix, 945 testing, 744–745, 748–751 linear trends, 16n3 triangular systems, 359 univariate autoregression, 685 until failure, 932 trigamma function, 436n2, 1064 univariate time series, 716 time-invariant, 187, 472 true size of the test, 585 estimation of parameters, time profile, 111 truncated mean, 865 728–731 time series, 715, 1020 truncated normal distribution, 576 modeling, 726–728 analysis, 6, 243 truncated standard normal unknown heteroscedasticity, autoregressive moving-average distribution, 864 robustness to, 169 processes, 716–718 truncated uniform distribution, 865 unknown parameter, 297 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

Subject Index 1177

unmeasured heterogeneity, 560–561 omission of relevant, 133 vector moving average (VMA) unobserved effects model, 209–210 omitted, 787 model, 704 unordered choice models, 842 omitted, bias, 868 vectors, 945 unordered multiple choice models, predetermined, 356, 358 autoregression, 479 841–859 pretest estimators, 135 basis, 953–954 unstandardized residuals, 628 proxy, 328–331 characteristics of, 967–976 upper triangular matrix, 974 random, 987–988 coefficients, least squares utility maximization, 5 random, censored, 871 regression, 21–22 random, continuous, 987 discrepancy, 84 random, convergence to, latent, 969 V 1046–1047 least square residuals, 25 validity of moment restrictions, random, density of truncated, 864 length of, 960 452–455 random, discrete, 998 linear combination of, 953–954 values random, distributions, 998–1000 linear functions of normal, 1015 Cholesky, 974 random, expectations of, 989–991 multiple cointegration, 759 heteroscedastistic extreme value regression, adding, 31 multiplication, 947 (HEV) model, 856 time-invariant, 472 orthogonal, 960 mean value theorem, 450 Two-Variable Regression parameters, 121, 141n9 singular value model, 48 random, 1010 decomposition, 975 variances spaces, 951–954 starting, 294 analysis of variance (ANOVA), spanning, 954 trigamma function, 436n2 32–39, 192 standard normal, quadratic variables, 3 bound for Poisson forms, 1015–1017 addition, testing, 210 distribution, 1031 volatility asymptotic properties, 72 conditional, 1007 GARCH model, 665 binary, 106–112 decomposition of, 1008 high-frequency time-series binary, marginal effects, 835 estimation, 51–52, 642–643 data, 148 bivariate regression, 23 feasible generalized least squares stochastic, 658 categorical, 110–111 (FGLS), 203–205 Voung statistic, 930 censoring, 869 inflation factor, 60 Vuong’s test, 140 constants, 16n3 least ratio, 376 dependent, 8, 979 least squares estimator, 48–49 dummy, 106, 122 limiting, 1049 W endogenous right-hand-side, linear regression models, 17 , estimation when unknown, 813–817 maximum likelihood estimation 648–652 endogenus, 355 (MLE), 494–496 wage equation, 186 estimators, comparing, 187 mean squared error of fixed effects, 199–200 exogenous, duration, 937–938 sample, 1029 random effects, 206 explained, 8 of the median, bootstrapping, 597 Wald statistic, 94, 262, 453, 520 explanatory, 8 minimum linear unbiased instrumental variables Grunfeld model, 255 estimation, 46–47 estimation, 322 identical explanatory, 257 prediction, 99 significance tests for, 298 inclusion of irrelevant, 136 probability of, 161 Wald test, 124, 500–502, 505 independent, 8, 979 of random variables, 989 weak instruments independent, exogeneity of, 11 sampling, 48, 1027 instrumental variables instrumental, 245, 461 sampling, asymptotic estimation, 350–352 instrumental, empirical moment moments, 1060 testing, 377–380 equation, 447 slopes, 194 weakly exogenous, 358 instrumental, estimation, truncated, 865 weakly stationary, 630 314–315, 893 unequal, 123–127 weak stationarity, 636 jointly dependent, 355 vector autoregression (VAR), 685, Weibull model, 937 lagged. See lagged variables 693–712 Weibull survival model, 939 least squares dummy variable cointegration, 756n12 weighted endogenous sampling (LSDV), 195 empirical results, 707–710 maximum likelihood measurement of, 4 error correction and, 760–761 (WESML), 793 metric algorithm, 1071 instability of, 712 weighted least squares, nonlinear instrumental variable in microeconomics, 711–712 167–169, 444 estimator, 462 policy analysis, 703–710 weighting matrix, 438, 439, 444n7, nonlinearity in, 112–120 structural, 702–703 459, 478 Greene-50558 Z04˙0135132452˙Sind June 27, 2007 14:58

1178 Subject Index

weights With Zeros (WZ) model, 922n7 Z lags, computation of, 683–684 World Health Organization Zellner’s efficient estimator, 258n11 quadrature, 1065 (WHO), 124 zero-altered Poisson model, recomputing, 173 Wu test, 321–325 920–924 well-behaved data, 497 zero-inflated Poisson (ZIP) White estimator panel data, 186 X model, 930 White heteroscedasticity consistent zero-order method, 62 estimator, 162–164 x, standard definition of, 990 zeros white noise, 632, 718 attenuation, 326 White test, 165, 167 Y correlation, testing for, 820 Wishart distribution, 997–998 matrices, 946 within-groups estimator, 192 Yule-Walker equation, 641, 722 mean of the disturbance, 286