Greene-50558 book June 21, 2007 15:50
18 BAYESIAN ESTIMATION AND INFERENCEQ
18.1 INTRODUCTION
The preceding chapters (and those that follow this one) are focused primarily on para- metric specifications and classical estimation methods. These elements of the economet- ric method present a bit of a methodological dilemma for the researcher. They appear to straightjacket the analyst into a fixed and immutable specification of the model. But in any analysis, there is uncertainty as to the magnitudes, sometimes the signs and, at the extreme, even the meaning of parameters. It is rare that the presentation of a set of empirical results has not been preceded by at least some exploratory analysis. Pro- ponents of the Bayesian methodology argue that the process of “estimation” is not one of deducing the values of fixed parameters, but rather, in accordance with the scientific method, one of continually updating and sharpening our subjective beliefs about the state of the world. Of course, this adherence to a subjective approach to model building is not necessarily a virtue. If one holds that “models” and “parameters” represent objec- tive truths that the analyst seeks to discover, then the subjectivity of Bayesian methods may be less than perfectly comfortable. Contemporary applications of Bayesian methods typically advance little of this the- ological debate. The modern practice of Bayesian econometrics is much more pragmatic. As we will see below in several examples, Bayesian methods have produced some re- markably efficient solutions to difficult estimation problems. Researchers often choose the techniques on practical grounds, rather than in adherence to their philosophical basis; indeed, for some, the Bayesian estimator is merely an algorithm.1 Bayesian methods have have been employed by econometricians since well be- fore Zellner’s classic (1971) presentation of the methodology to economists, but until fairly recently, were more or less at the margin of the field. With recent advances in technique (notably the Gibbs sampler) and the advance of computer software and hardware that has made simulation-based estimation routine, Bayesian methods that rely heavily on both have become widespread throughout the social sciences. There are libraries of work on Bayesian econometrics and a rapidly expanding applied
1For example, from the home website of MLWin, a widely used program for multilevel (random parameters) modeling, http://www.cmm.bris.ac.uk/MLwiN/features/mcmc.shtml, we find “Markov Chain Monte Carlo (MCMC) methods allow Bayesian models to be fitted, where prior distributions for the model parameters are specified. By default MLwiN sets diffuse priors which can be used to approximate maximum likelihood estimation.” Train (2001) is an interesting application that compares Bayesian and classical estimators of a random parameters model. 600 Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 601
literature.2 This chapter will introduce the vocabulary and techniques of Bayesian econometrics. Section 18.2 lays out the essential foundation for the method. The canon- ical application, the linear regression model, is developed in Section 18.3. Section 18.4 continues the methodological development. The fundamental tool of contemporary Bayesian econometrics, the Gibbs sampler, is presented in Section 18.5. Three appli- cations and several more limited examples are presented in Sections 18.6, 18.7, and 18.8. Section 18.6 shows how to use the Gibbs sampler to estimate the parameters of a probit model without maximizing the likelihood function. This application also in- troduces the technique of data augmentation. Bayesian counterparts to the panel data random and fixed effects models are presented in Section 18.7. A hierarchical Bayesian treatment of the random parameters model is presented in Section 18.8 with a com- parison to the classical treatment of the same model. Some conclusions are drawn in Section 18.9. The presentation here is nontechnical. A much more extensive entry level presentation is given by Lancaster (2004). Intermediate-level presentations appear in Cameron and Trivedi (2005, Chapter 13), and Koop (2003). A more challenging treat- ment is offered in Geweke (2005). The other sources listed in footnote 2 are oriented to applications.
18.2 BAYES THEOREM AND THE POSTERIOR DENSITY
The centerpiece of the Bayesian methodology is the Bayes theorem: for events A and B, the conditional probability of event Agiven that B has occurred is P(B|A)P(A) P(A|B) = . (18-1) P(B) Paraphrased for our applications here, we would write P(data | parameters)P(parameters) P(parameters | data) = . P(data) In this setting, the data are viewed as constants whose distributions do not involve the parameters of interest. For the purpose of the study, we treat the data as only a fixed set of additional information to be used in updating our beliefs about the parameters. Note the similarity to (14-1). Thus, we write P(parameters | data) ∝ P(data | parameters)P(parameters) (18-2) = Likelihood function × Prior density. The symbol ∝ means “is proportional to.” In the preceding equation, we have dropped the marginal density of the data, so what remains is not a proper density until it is scaled by what will be an inessential proportionality constant. The first term on the right is the joint distribution of the observed random variables y, given the parameters. As we
2Recent additions to the dozens of books on the subject include Gelman et al. (2004), Geweke (2005), Gill (2002), Koop (2003), Lancaster (2004), Congdon (2005), and Rossi et al. (2005). Readers with an historical bent will find Zellner (1971) and Leamer (1978) worthwhile reading. There are also many methodological surveys. Poirier and Tobias (2006) as well as Poirier (1988, 1995) sharply focus the nature of the methodological distinctions between the classical (frequentist) and Bayesian approaches. Greene-50558 book June 21, 2007 15:50
602 PART IV ✦ Estimation Methodology
shall analyze it here, this distribution is the normal distribution we have used in our previous analysis—see (14-1). The second term is the prior beliefs of the analyst. The left-hand side is the posterior density of the parameters, given the current body of data, or our revised beliefs about the distribution of the parameters after “seeing” the data. The posterior is a mixture of the prior information and the “current information,” that is, the data. Once obtained, this posterior density is available to be the prior density function when the next body of data or other usable information becomes available. The principle involved, which appears nowhere in the classical analysis, is one of continual accretion of knowledge about the parameters. Traditional Bayesian estimation is heavily parameterized. The prior density and the likelihood function are crucial elements of the analysis, and both must be fully specified for estimation to proceed. The Bayesian “estimator” is the mean of the posterior density of the parameters, a quantity that is usually obtained either by integration (when closed forms exist), approximation of integrals by numerical techniques, or by Monte Carlo methods, which are discussed in Section 17.3.
Example 18.1 Bayesian Estimation of a Probability Consider estimation of the probability that a production process will produce a defective product. In case 1, suppose the sampling design is to choose N = 25 items from the production line and count the number of defectives. If the probability that any item is defective is a constant θ between zero and one, then the likelihood for the sample of data is L(θ | data) = θ D (1− θ) 25−D , where D is the number of defectives, say, 8. The maximum likelihood estimator of θ will be p = D/25 = 0.32, and the asymptotic variance of the maximum likelihood estimator is estimated by p(1− p)/25 = 0.008704. Now, consider a Bayesian approach to the same analysis. The posterior density is obtained by the following reasoning: p(θ, data) p(θ, data) p(data | θ) p(θ) p(θ | data) = = = θ θ p(data) θ p( , data)d p(data) Likelihood(data | θ) × p(θ) = p(data) where p(θ) is the prior density assumed for θ. [We have taken some license with the termi- nology, since the likelihood function is conventionally defined as L(θ | data).] Inserting the results of the sample first drawn, we have the posterior density: θ D (1− θ) N−D p(θ) p(θ | data) = . θ D − θ N−D θ θ θ (1 ) p( )d What follows depends on the assumed prior for θ. Suppose we begin with a “noninforma- tive” prior that treats all allowable values of θ as equally likely. This would imply a uniform distribution over (0,1). Thus, p(θ) = 1, 0 ≤ θ ≤ 1. The denominator with this assumption is a beta integral (see Section E2.3) with parameters a = D + 1 and b = N − D + 1, so the posterior density is θ D (1− θ) N−D ( N + 2)θ D (1− θ) N−D p(θ | data) = = . ( D + 1)( N − D + 1) ( D + 1)( N − D + 1) ( D + 1 + N − D + 1) This is the density of a random variable with a beta distribution with parameters (α, β) = ( D +1, N − D +1). (See Section B.4.6.) The mean of this random variable is ( D +1)/( N +2) = 9/27 = 0.3333 (as opposed to 0.32, the MLE). The posterior variance is [( D +1)/( N − D +1)]/ [( N + 3)( N + 2) 2] = 0.007936. Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 603
There is a loose end in this example. If the uniform prior were noninformative, that would mean that the only information we had was in the likelihood function. Why didn’t the Bayesian estimator and the MLE coincide? The reason is that the uniform prior over [0,1] is not really noninformative. It did introduce the information that θ must fall in the unit interval. The prior mean is 0.5 and the prior variance is 1/12. The posterior mean is an average of the MLE and the prior mean. Another less than obvious aspect of this result is the smaller variance of the Bayesian estimator. The principle that lies behind this (aside from the fact that the prior did in fact introduce some certainty in the estimator) is that the Bayesian estimator is conditioned on the specific sample data. The theory behind the classical MLE implies that it averages over the entire population that generates the data. This will always introduce a greater degree of “uncertainty” in the classical estimator compared to its Bayesian counterpart.
18.3 BAYESIAN ANALYSIS OF THE CLASSICAL REGRESSION MODEL
The complexity of the algebra involved in Bayesian analysis is often extremely bur- densome. For the linear regression model, however, many fairly straightforward results have been obtained. To provide some of the flavor of the techniques, we present the full derivation only for some simple cases. In the interest of brevity, and to avoid the burden of excessive algebra, we refer the reader to one of the several sources that present the full derivation of the more complex cases.3 The classical normal regression model we have analyzed thus far is constructed around the conditional multivariate normal distribution N[Xβ,σ2I]. The interpreta- tion is different here. In the sampling theory setting, this distribution embodies the information about the observed sample data given the assumed distribution and the fixed, albeit unknown, parameters of the model. In the Bayesian setting, this function summarizes the information that a particular realization of the data provides about the assumed distribution of the model parameters. To underscore that idea, we rename this joint density the likelihood for β and σ 2 given the data, so
2 L(β,σ2 | y, X) = [2πσ2]−n/2e−[(1/(2σ ))(y−Xβ) (y−Xβ)]. (18-3) For purposes of the results below, some reformulation is useful. Let d = n − K (the degrees of freedom parameter), and substitute y − Xβ = y − Xb − X(β − b) = e − X(β − b) in the exponent. Expanding this produces 1 1 1 1 1 − (y − Xβ)(y − Xβ) = − ds2 − (β − b) XX (β − b). 2σ 2 2 σ 2 2 σ 2 After a bit of manipulation (note that n/2 = d/2 + K/2), the likelihood may be written L(β,σ2 | y, X)
2 2 2 −1 −1 = [2π]−d/2[σ 2]−d/2e−(d/2)(s /σ )[2π]−K/2[σ 2]−K/2e−(1/2)(β−b) [σ (X X) ] (β−b).
3These sources include Judge et al. (1982, 1985), Maddala (1977a), Mittelhammer et al. (2000), and the canonical reference for econometricians, Zellner (1971). A remarkable feature of the current literature is the degree to which the analytical components have become ever simpler while the applications have become progressively more complex. This will become evident in Sections 18.5–18.7. Greene-50558 book June 21, 2007 15:50
604 PART IV ✦ Estimation Methodology
This density embodies all that we have to learn about the parameters from the observed data. Because the data are taken to be constants in the joint density, we may multiply this joint density by the (very carefully chosen), inessential (because it does not involve β or σ 2) constant function of the observations, ( / )+ d d 2 1 s2 2 ( / ) − / A = [2π] d 2 | X X | 1 2. d + 1 2 For convenience, let v = d/2. Then, multiplying L(β,σ2 | y, X) by A gives 2 v+1 v [vs ] 1 2 2 L(β,σ2 | y, X) ∝ e−vs (1/σ )[2π]−K/2 | σ 2(XX)−1 | −1/2 (v + 1) σ 2
2 −1 −1 × e−(1/2)(β−b) [σ (X X) ] (β−b). (18-4)
The likelihood function is proportional to the product of a gamma density for z = 1/σ 2 with parameters λ = vs2 and P = v + 1 [see (B-39); this is an inverted gamma distribution] and a K-variate normal density for β | σ 2 with mean vector b and covariance matrix σ 2(XX)−1. The reason will be clear shortly.
18.3.1 ANALYSIS WITH A NONINFORMATIVE PRIOR The departure point for the Bayesian analysis of the model is the specification of a prior distribution. This distribution gives the analyst’s prior beliefs about the parameters of the model. One of two approaches is generally taken. If no prior information is known about the parameters, then we can specify a noninformative prior that reflects that. We do this by specifying a “flat” prior for the parameter in question:4
g(parameter) ∝ constant.
There are different ways that one might characterize the lack of prior information. The implication of a flat prior is that within the range of valid values for the parameter, all intervals of equal length—hence, in principle, all values—are equally likely. The second possibility, an informative prior, is treated in the next section. The posterior density is the result of combining the likelihood function with the prior density. Because it pools the full set of information available to the analyst, once the data have been drawn, the posterior density would be interpreted the same way the prior density was before the data were obtained. To begin, we analyze the case in which σ 2 is assumed to be known. This assumption is obviously unrealistic, and we do so only to establish a point of departure. Using Bayes Theorem, we construct the posterior density, L(β | σ 2, y, X)g(β | σ 2) f (β | y, X,σ2) = ∝ L(β | σ 2, y, X)g(β | σ 2), f (y)
4That this “improper” density might not integrate to one is only a minor difficulty. Any constant of integration would ultimately drop out of the final result. See Zellner (1971, pp. 41–53) for a discussion of noninformative priors. Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 605
assuming that the distribution of X does not depend on β or σ 2. Because g(β | σ 2) ∝ a constant, this density is the one in (18-4). For now, write 2 −1 −1 f (β | σ 2, y, X) ∝ h(σ 2)[2π]−K/2 |σ 2(XX)−1|−1/2e−(1/2)(β−b) [σ (X X) ] (β−b), (18-5) where 2 v+1 v [vs ] 1 2 2 h(σ 2) = e−vs (1/σ ). (18-6) (v + 1) σ 2 For the present, we treat h(σ 2) simply as a constant that involves σ 2, not as a proba- bility density; (18-5) is conditional on σ 2. Thus, the posterior density f (β | σ 2, y, X) is proportional to a multivariate normal distribution with mean b and covariance matrix σ 2(XX)−1. This result is familiar, but it is interpreted differently in this setting. First, we have combined our prior information about β (in this case, no information) and the sample information to obtain a posterior distribution. Thus, on the basis of the sample data in hand, we obtain a distribution for β with mean b and covariance matrix σ 2(XX)−1.The result is dominated by the sample information, as it should be if there is no prior infor- mation. In the absence of any prior information, the mean of the posterior distribution, which is a type of Bayesian point estimate, is the sampling theory estimator. To generalize the preceding to an unknown σ 2, we specify a noninformative prior distribution for ln σ over the entire real line.5 By the change of variable formula, if g(ln σ) is constant, then g(σ 2) is proportional to 1/σ 2.6 Assuming that β and σ 2 are independent, we now have the noninformative joint prior distribution:
2 2 1 g(β,σ ) = gβ (β)gσ 2 (σ ) ∝ . σ 2 We can obtain the joint posterior distribution for β and σ 2 by using
2 2 2 2 1 f (β,σ | y, X) = L(β | σ , y, X)gσ 2 (σ ) ∝ L(β | σ , y, X) × . (18-7) σ 2 2 For the same reason as before, we multiply gσ 2 (σ ) by a well-chosen constant, this time 2 2 2 vs (v + 1)/ (v + 2) = vs /(v + 1). Multiplying (18-5) by this constant times gσ 2 (σ ) and inserting h(σ 2) gives the joint posterior for β and σ 2, given y and X: 2 v+2 v+1 [vs ] 1 2 2 f (β,σ2 | y, X) ∝ e−vs (1/σ )[2π]−K/2 |σ 2(XX)−1|−1/2 (v + 2) σ 2
2 −1 −1 × e−(1/2)(β−b) [σ (X X) ] (β−b). To obtain the marginal posterior distribution for β, it is now necessary to integrate σ 2 out of the joint distribution (and vice versa to obtain the marginal distribution for σ 2). By collecting the terms, f (β,σ2 | y, X) can be written as P−1 1 2 f (β,σ2 | y, X) ∝ A× e−λ(1/σ ), σ 2
5See Zellner (1971) for justification of this prior distribution. 6Many treatments of this model use σ rather than σ 2 as the parameter of interest. The end results are identical. We have chosen this parameterization because it makes manipulation of the likelihood function with a gamma prior distribution especially convenient. See Zellner (1971, pp. 44–45) for discussion. Greene-50558 book June 23, 2007 2:9
606 PART IV ✦ Estimation Methodology
where [vs2]v+2 A = [2π]−K/2 |(XX)−1|−1/2, (v + 2) P = v + 2 + K/2 = (n − K)/2 + 2 + K/2 = (n + 4)/2, and λ = v 2 + 1 (β − ) (β − ), s 2 b X X b so the marginal posterior distribution for β is ∞ ∞ P−1 1 2 f (β,σ2 | y, X)dσ 2 ∝ A e−λ(1/σ )dσ 2. σ 2 0 0 To do the integration, we have to make a change of variable; d(1/σ 2) =−(1/σ 2)2dσ 2, so dσ 2 =−(1/σ 2)−2 d(1/σ 2). Making the substitution—the sign of the integral changes twice, once for the Jacobian and back again because the integral from σ 2 = 0to∞ is the negative of the integral from (1/σ 2) = 0to∞—we obtain ∞ ∞ P−3 1 2 1 f (β,σ2 | y, X)dσ 2 ∝ A e−λ(1/σ )d σ 2 σ 2 0 0 (P − 2) = A× . λP−2 Reinserting the expressions for A, P, and λ produces v 2 v+2 (v + / ) [ s ] K 2 π −K/2 | |−1/2 (v + ) [2 ] X X (β | , ) ∝ 2 . f y X v+ / (18-8) v 2 + 1 (β − ) (β − ) K 2 s 2 b X X b This density is proportional to a multivariate t distribution7 and is a generalization of the familiar univariate distribution we have used at various points. This distribution has a degrees of freedom parameter, d = n− K, mean b, and covariance matrix (d/(d−2))× [s2(XX)−1]. Each element of the K-element vector β has a marginal distribution that is the univariate t distribution with degrees of freedom n − K, mean bk, and variance equal to the kth diagonal element of the covariance matrix given earlier. Once again, this is the same as our sampling theory result. The difference is a matter of interpretation. In the current context, the estimated distribution is for β and is centered at b.
18.3.2 ESTIMATION WITH AN INFORMATIVE PRIOR DENSITY Once we leave the simple case of noninformative priors, matters become quite compli- cated, both at a practical level and, methodologically, in terms of just where the prior comes from. The integration of σ 2 out of the posterior in (18-7) is complicated by itself. It is made much more so if the prior distributions of β and σ 2 are at all involved. Partly to offset these difficulties, researchers usually use what is called a conjugate prior, which
7See, for example, Judge et al. (1985) for details. The expression appears in Zellner (1971, p. 67). Note that the exponent in the denominator is v + K/2 = n/2. Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 607
is one that has the same form as the conditional density and is therefore amenable to the integration needed to obtain the marginal distributions.8
Example 18.2 Estimation with a Conjugate Prior We continue Example 18.1, but we now assume a conjugate prior. For likelihood functions involving proportions, the beta prior is a common device, for reasons that will emerge shortly. The beta prior is (α + β)θ α−1(1− θ) β−1 p(θ) = . (α)(β) Then, the posterior density becomes (α + β)θ α−1(1− θ) β−1 θ D (1− θ) N−D α β θ D+α−1(1− θ) N−D+β−1 ( ) ( ) = . 1 (α + β)θ α−1(1− θ) β−1 1 θ D (1− θ) N−D dθ θ D+α−1(1− θ) N−D+β−1dθ α β 0 ( ) ( ) 0 The posterior density is, once again, a beta distribution, with parameters ( D + α, N − D + β). The posterior mean is D + α E[θ | data] = . N + α + β (Our previous choice of the uniform density was equivalent to α = β = 1.) Suppose we choose a prior that conforms to a prior mean of 0.5, but with less mass near zero and one than in the center, such as α = β = 2. Then, the posterior mean would be (8 + 2)/(25+ 3) = 0.33571. (This is yet larger than the previous estimator. The reason is that the prior variance is now smaller than 1/12, so the prior mean, still 0.5, receives yet greater weight than it did in the previous example.)
Suppose that we assume that the prior beliefs about β may be summarized in a β K-variate normal distribution with mean 0 and variance matrix 0. Once again, it is illuminating to begin with the case in which σ 2 is assumed to be known. Proceeding in exactly the same fashion as before, we would obtain the following result: The posterior density of β conditioned on σ 2 and the data will be normal with − − − E [β | σ 2, y, X] = 1 + [σ 2(XX)−1]−1 1 1β + [σ 2(XX)−1]−1b 0 0 0 (18-9) = β + ( − ) , F 0 I F b where = −1 + σ 2( )−1 −1 −1−1 F 0 [ X X ] 0 −1 = [prior variance]−1 + [conditional variance]−1 [prior variance]−1. (18-10) This vector is a matrix weighted average of the prior and the least squares (sample) coefficient estimates, where the weights are the inverses of the prior and the conditional
8Our choice of noninformative prior for ln σ led to a convenient prior for σ 2 in our derivation of the posterior for β. The idea that the prior can be specified arbitrarily in whatever form is mathematically convenient is very troubling; it is supposed to represent the accumulated prior belief about the parameter. On the other hand, it could be argued that the conjugate prior is the posterior of a previous analysis, which could justify its form. The issue of how priors should be specified is one of the focal points of the methodological debate. “Non-Bayesians” argue that it is disingenuous to claim the methodological high ground and then base the crucial prior density in a model purely on the basis of mathematical convenience. In a small sample, this assumed prior is going to dominate the results, whereas in a large one, the sampling theory estimates will dominate anyway. Greene-50558 book June 21, 2007 15:50
608 PART IV ✦ Estimation Methodology
covariance matrices.9 The smaller the variance of the estimator, the larger its weight, which makes sense. Also, still taking σ 2 as known, we can write the variance of the posterior normal distribution as β | , ,σ2 = −1 + σ 2( )−1 −1 −1. Var[ y X ] 0 [ X X ] (18-11) Notice that the posterior variance combines the prior and conditional variances on the basis of their inverses.10 We may interpret the noninformative prior as having infinite elements in 0. This assumption would reduce this case to the earlier one. Once again, it is necessary to account for the unknown σ 2. If our prior over σ 2 is to be informative as well, then the resulting distribution can be extremely cumbersome. A conjugate prior for β and σ 2 that can be used is
2 2 2 g(β,σ ) = gβ|σ 2 (β | σ )gσ 2 (σ ), (18-12)
2 0 2 where gβ|σ 2 (β | σ ) is normal, with mean β and variance σ A and + σ 2 m 1 m 2 m 0 1 −mσ 2(1/σ 2) gσ 2 (σ ) = e 0 . (18-13) (m + 1) σ 2 This distribution is an inverted gamma distribution. It implies that 1/σ 2 has a gamma σ 2 σ 2 σ 4/( − ) 11 distribution. The prior mean for is 0 and the prior variance is 0 m 1 . The product in (18-12) produces what is called a normal-gamma prior, which is the natural conjugate prior for this form of the model. By integrating out σ 2, we would obtain the prior marginal for β alone, which would be a multivariate t distribution.12 Combining (18-12) with (18-13) produces the joint posterior distribution for β and σ 2. Finally, the marginal posterior distribution for β is obtained by integrating out σ 2. It has been shown that this posterior distribution is multivariate t with β | , = σ 2 −1 + σ 2( )−1 −1 −1 σ 2 −1β + σ 2( )−1 −1 E [ y X] [ A] [ X X ] [ A] 0 [ X X ] b (18-14)
and j −1 Var[β | y, X] = [σ 2A]−1 + [σ 2(XX)−1]−1 , (18-15) j − 2 where j is a degrees of freedom parameter and σ 2 is the Bayesian estimate of σ 2.The prior degrees of freedom m is a parameter of the prior distribution for σ 2 that would have been determined at the outset. (See the following example.) Once again, it is clear that as the amount of data increases, the posterior density, and the estimates thereof, converge to the sampling theory results.
9 β Note that it will not follow that individual elements of the posterior mean vector lie between those of 0 and b. See Judge et al. (1985, pp. 109–110) and Chamberlain and Leamer (1976). 10Precisely this estimator was proposed by Theil and Goldberger (1961) as a way of combining a previously obtained estimate of a parameter and a current body of new data. They called their result a “mixed estimator.” The term “mixed estimation” takes an entirely different meaning in the current literature, as we saw in Chapter 17. 11You can show this result by using gamma integrals. Note that the density is a function of 1/σ 2 = 1/x in the formula of (B-39), so to obtain E [σ 2], we use the analog of E [1/x] = λ/(P − 1) and E [(1/x)2] = λ2/ ( − )( − ) ( /σ 2) λ σ 2 + [ P 1 P 2 ]. In the density for 1 , the counterparts to and P are m 0 and m 1. 12Full details of this (lengthy) derivation appear in Judge et al. (1985, pp. 106–110) and Zellner (1971). Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 609
TABLE 18.1 Estimates of the MPC Years Estimated MPC Variance of b Degrees of Freedom Estimated σ 1940–1950 0.6848014 0.061878 9 24.954 1950–2000 0.92481 0.000065865 49 92.244
Example 18.3 Bayesian Estimate of the Marginal Propensity to Consume In Example 3.2, an estimate of the marginal propensity to consume is obtained using 11 ob- servations from 1940 to 1950, with the results shown in the top row of Table 18.1. A clas- sical 95 percent confidence interval for β based on these estimates is (0.1221, 1.2475). (The very wide interval probably results from the obviously poor specification of the model.) Based on noninformative priors for β and σ 2, we would estimate the posterior density for β to be univariate t with 9 degrees of freedom, with mean 0.6848014 and variance (11/9)0.061878 = 0.075628. An HPD interval for β would coincide with the confidence interval. Using the fourth quarter (yearly) values of the 1950–2000 data used in Example 5.3, we obtain the new estimates that appear in the second row of the table. We take the first estimate and its estimated distribution as our prior for β and obtain a posterior density for β based on an informative prior instead. We assume for this exercise that σ 2 may be taken as known at the sample value of 24.954. Then, −1 1 1 0.92481 0.6848014 b = + + = 0.92455 0.000065865 0.061878 0.000065865 0.061878 The weighted average is overwhelmingly dominated by the far more precise sample es- timate from the larger sample. The posterior variance is the inverse in brackets, which is 0.000065795. This is close to the variance of the latter estimate. An HPD interval can be formed in the familiar fashion. It will be slightly narrower than the confidence interval, because the variance of the posterior distribution is slightly smaller than the variance of the sampling estimator. This reduction is the value of the prior information. (As we see here, the prior is not particularly informative.)
18.4 BAYESIAN INFERENCE
The posterior density is the Bayesian counterpart to the likelihood function. It embod- ies the information that is available to make inference about the econometric model. As we have seen, the mean and variance of the posterior distribution correspond to the classical (sampling theory) point estimator and asymptotic variance, although they are interpreted differently. Before we examine more intricate applications of Bayesian inference, it is useful to formalize some other components of the method, point and interval estimation and the Bayesian equivalent of testing a hypothesis.13
18.4.1 POINT ESTIMATION The posterior density function embodies the prior and the likelihood and therefore contains all the researcher’s information about the parameters. But for purposes of presenting results, the density is somewhat imprecise, and one normally prefers a point
13We do not include prediction in this list. The Bayesian approach would treat the prediction problem as one of estimation in the same fashion as “parameter” estimation. The value to be forecasted is among the unknown elements of the model that would be characterized by a prior and would enter the posterior density in a symmetric fashion along with the other parameters. Greene-50558 book June 21, 2007 15:50
610 PART IV ✦ Estimation Methodology
or interval estimate. The natural approach would be to use the mean of the posterior distribution as the estimator. For the noninformative prior, we use b, the sampling theory estimator. One might ask at this point, why bother? These Bayesian point estimates are iden- tical to the sampling theory estimates. All that has changed is our interpretation of the results. This situation is, however, exactly the way it should be. Remember that we entered the analysis with noninformative priors for β and σ 2. Therefore, the only information brought to bear on estimation is the sample data, and it would be peculiar if anything other than the sampling theory estimates emerged at the end. The results do change when our prior brings out of sample information into the estimates, as we shall see below. The results will also change if we change our motivation for estimating β.The parameter estimates have been treated thus far as if they were an end in themselves. But in some settings, parameter estimates are obtained so as to enable the analyst to make a decision. Consider then, a loss function, H(βˆ, β), which quantifies the cost of basing a decision on an estimate βˆ when the parameter is β. The expected, or average loss is
Eβ [H(βˆ, β)] = H(βˆ, β) f (β | y, X)dβ, (18-16) β where the weighting function is the marginal posterior density. (The joint density for β and σ 2 would be used if the loss were defined over both.) The Bayesian point estimate is the parameter vector that minimizes the expected loss. If the loss function is a quadratic form in (βˆ − β), then the mean of the posterior distribution is the “minimum expected loss” (MELO) estimator. The proof is simple. For this case, (βˆ, β) | , = 1 (βˆ − β) (βˆ − β) | , . E [H y X] E 2 W y X To minimize this, we can use the result that ∂ E [H(βˆ , β) | y, X]/∂βˆ = E [∂ H(βˆ, β)/∂βˆ | y, X] = E [−W(βˆ − β) | y, X]. The minimum is found by equating this derivative to 0, whence, because −W is irrele- vant, βˆ = E [β | y, X]. This kind of loss function would state that errors in the positive and negative direction are equally bad, and large errors are much worse than small errors. If the loss function were a linear function instead, then the MELO estimator would be the median of the posterior distribution. These results are the same in the case of the noninformative prior that we have just examined.
18.4.2 INTERVAL ESTIMATION The counterpart to a confidence interval in this setting is an interval of the posterior distribution that contains a specified probability. Clearly, it is desirable to have this interval be as narrow as possible. For a unimodal density, this corresponds to an interval within which the density function is higher than any points outside it, which justifies the term highest posterior density (HPD) interval. For the case we have analyzed, which involves a symmetric distribution, we would form the HPD interval for β around the least squares estimate b, with terminal values taken from the standard t tables. Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 611
18.4.3 HYPOTHESIS TESTING The Bayesian methodology treats the classical approach to hypothesis testing with a large amount of skepticism. Two issues are especially problematic. First, a close ex- amination of only the work we have done in Chapter 5 will show that because we are using consistent estimators, with a large enough sample, we will ultimately reject any (nested) hypothesis unless we adjust the significance level of the test downward as the sample size increases. Second, the all-or-nothing approach of either rejecting or not rejecting a hypothesis provides no method of simply sharpening our beliefs. Even the most committed of analysts might be reluctant to discard a strongly held prior based on a single sample of data, yet this is what the sampling methodology mandates. (Note, for example, the uncomfortable dilemma this creates in footnote 20 in Chapter 10.) The Bayesian approach to hypothesis testing is much more appealing in this regard. Indeed, the approach might be more appropriately called “comparing hypotheses,” because it essentially involves only making an assessment of which of two hypotheses has a higher probability of being correct. The Bayesian approach to hypothesis testing bears large similarity to Bayesian 14 estimation. We have formulated two hypotheses, a “null,” denoted H0, and an alter- native, denoted H1. These need not be complementary, as in H0: “statement A is true” versus H1: “statement A is not true,” since the intent of the procedure is not to reject one hypothesis in favor of the other. For simplicity, however, we will confine our at- tention to hypotheses about the parameters in the regression model, which often are complementary. Assume that before we begin our experimentation (data gathering, statistical analysis) we are able to assign prior probabilities P(H0) and P(H1) to the two hypotheses. The prior odds ratio is simply the ratio
P(H0) Oddsprior = . (18-17) P(H1) For example, one’s uncertainty about the sign of a parameter might be summarized in a prior odds over H0: β ≥ 0 versus H1: β<0of0.5/0.5 = 1. After the sample evidence is gathered, the prior will be modified, so the posterior is, in general,
Oddsposterior = B01 × Oddsprior.
The value B01 is called the Bayes factor for comparing the two hypotheses. It summarizes the effect of the sample data on the prior odds. The end result, Oddsposterior, is a new odds ratio that can be carried forward as the prior in a subsequent analysis. The Bayes factor is computed by assessing the likelihoods of the data observed under the two hypotheses. We return to our first departure point, the likelihood of the data, given the parameters:
2 f (y | β,σ2, X) = [2πσ2]−n/2e(−1/(2σ ))(y−Xβ) (y−Xβ). (18-18) Based on our priors for the parameters, the expected, or average likelihood, assuming that hypothesis j is true ( j = 0, 1),is 2 2 2 2 f (y | X, Hj ) = Eβ,σ 2 [ f (y | β,σ , X, Hj )] = f (y | β,σ , X, Hj )g(β,σ ) dβ dσ . σ 2 β
14For extensive discussion, see Zellner and Siow (1980) and Zellner (1985, pp. 275–305). Greene-50558 book June 21, 2007 15:50
612 PART IV ✦ Estimation Methodology
(This conditional density is also the predictive density for y.) Therefore, based on the observed data, we use Bayes theorem to reassess the probability of Hj ; the posterior probability is f (y | X, H )P(H ) P(H | y, X) = j j . j f (y)
The posterior odds ratio is P(H0 | y, X)/P(H1 | y, X), so the Bayes factor is
f (y | X, H0) B01 = . f (y | X, H1)
Example 18.4 Posterior Odds for the Classical Regression Model Zellner (1971) analyzes the setting in which there are two possible explanations for the variation in a dependent variable y: = β + ε Model 0: y x0 0 0 and = β + ε . Model 1: y x1 1 1
2 We will briefly sketch his results. We form informative priors for [β, σ ] j , j = 0, 1, as spec- ified in (18-12) and (18-13), that is, multivariate normal and inverted gamma, respectively. Zellner then derives the Bayes factor for the posterior odds ratio. The derivation is lengthy and complicated, but for large n, with some simplifying assumptions, a useful formulation σ 2 σ 2 emerges. First, assume that the priors for 0 and 1 are the same. Second, assume that | −1|/| −1 + | / | −1|/| −1 + | → [ A0 A0 X0X0 ] [ A1 A1 X1X1 ] 1. The first of these would be the usual situation, in which the uncertainty concerns the covariation between yi and xi , not the amount of resid- ual variation (lack of fit). The second concerns the relative amounts of information in the prior (A) versus the likelihood (XX). These matrices are the inverses of the covariance matrices, or the precision matrices. [Note how these two matrices form the matrix weights in the computation of the posterior mean in (18-9).] Zellner (p. 310) discusses this assumption at some length. With these two assumptions, he shows that as n grows large,15 −(n+m)/2 −(n+m)/2 s2 1 − R2 B ≈ 0 = 0 . 01 2 − 2 s1 1 R1 Therefore, the result favors the model that provides the better fit using R2 as the fit measure. If we stretch Zellner’s analysis a bit by interpreting model 1 as “the model” and model 0 as β = 2 = “no model” (that is, the relevant part of 0 0,soR0 0), then the ratio simplifies to + / = − 2 (n m) 2. B01 1 R1 Thus, the better the fit of the regression, the lower the Bayes factor in favor of model 0 (no model), which makes intuitive sense. Zellner and Siow (1980) have continued this analysis with noninformative priors for β and σ 2 σ j . Specifically, they use the flat prior for ln [see (18-7)] and a multivariate Cauchy prior (which has infinite variances) for β. Their main result (3.10) is √ k/2 1 π n − K B = 2 (1− R2) (n−K −1)/2. 01 [(k + 1)/2] 2 This result is very much like the previous one, with some slight differences due to degrees of freedom corrections and the several approximations used to reach the first one.
15A ratio of exponentials that appears in Zellner’s result (his equation 10.50) is omitted. To the order of approximation in the result, this ratio vanishes from the final result. (Personal correspondence from A. Zellner to the author.) Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 613
18.4.4 LARGE SAMPLE RESULTS Although all statistical results for Bayesian estimators are necessarily “finite sample” (they are conditioned on the sample data), it remains of interest to consider how the estimators behave in large samples.16 Do Bayesian estimators “converge” to something? To do this exercise, it is useful to envision having a sample that is the entire population. Then, the posterior distribution would characterize this entire population, not a sample from it. It stands to reason in this case, at least intuitively, that the posterior distribution should coincide with the likelihood function. It will (as usual), save for the influence of the prior. But as the sample size grows, one should expect the likelihood function to overwhelm the prior. It will, unless the strength of the prior grows with the sample size (that is, for example, if the prior variance is of order 1/n). An informative prior will still fade in its influence on the posterior unless it becomes more informative as the sample size grows. The preceding suggests that the posterior mean will converge to the maximum like- lihood estimator. The MLE is the parameter vector that is at the mode of the likelihood function. The Bayesian estimator is the posterior mean, not the mode, so a remain- ing question concerns the relationship between these two features. The Bernstein–von Mises “theorem” [See Cameron and Trivedi(2005, p. 433) and Train(2003, Chapter 12)] states that the posterior mean and the maximum likelihood estimator will coverge to the same probability limit and have the same limiting normal distribution. A form of central limit theorem is at work. But for remaining philosophical questions, the results suggest that for large samples, the choice between Bayesian and frequentist methods can be one of computational efficiency. (This is the thrust of the application in Section 18.8. Note, as well, footnote 1 at the beginning of this chapter. In an infinite sample, the maintained “uncertainty” of the Bayesian estimation framework would have to arise from deeper questions about the model. For example, the mean of the entire population is its mean; there is no uncertainty about the “parameter.”)
18.5 POSTERIOR DISTRIBUTIONS AND THE GIBBS SAMPLER
The preceding analysis has proceeded along a set of steps that includes formulating the likelihood function (the model), the prior density over the objects of estimation, and the posterior density. To complete the inference step, we then analytically derived the characteristics of the posterior density of interest, such as the mean or mode, and the variance. The complicated element of any of this analysis is determining the moments of the posterior density, for example, the mean: θˆ = E[θ | data] = θ p(θ | data)dθ. (18-19) θ
16The standard preamble in econometric studies, that the analysis to follow is “exact” as opposed to approxi- mate or “large sample,” refers to this aspect—the analysis is conditioned on and, by implication, applies only to the sample data in hand. Any inference outside the sample, for example, to hypothesized random samples is, like the sampling theory counterpart, approximate. Greene-50558 book June 21, 2007 15:50
614 PART IV ✦ Estimation Methodology
There are relatively few applications for which integrals such as this can be derived in closed form. (This is one motivation for conjugate priors.) The modern approach to Bayesian inference takes a different strategy. The result in (18-19) is an expectation. Suppose it were possible to obtain a random sample, as large as desired, from the population defined by p(θ | data). Then, using the same strategy we used throughout Chapter 17 for simulation-based estimation, we could use that sample’s characteristics, such as mean, variance, quantiles, and so on, to infer the characteristics of the posterior distribution. Indeed, with an (essentially) infinite sample, we would be freed from having to limit our attention to a few simple features such as the mean and variance, we could view any features of the posterior distribution that we like. The (much less) complicated part of the analysis is the formulation of the posterior density. It remains to determine how the sample is to be drawn from the posterior density. This element of the strategy is provided by a remarkable (and remarkably useful) result known as the Gibbs sampler. [See Casella and George (1992).] The central result of the Gibbs sampler is as follows: We wish to draw a random sample from the joint population (x, y). The joint distribution of x and y is either unknown or intractable and it is not possible to sample from the joint distribution. However, assume that the conditional distributions f (x | y) and f (y | x) are known and simple enough that it is possible to draw univariate random samples from both of them. The following iteration will produce a bivariate random sample from the joint distribution:
Gibbs sampler:
1. Begin the cycle with a value of x0 that is in the right range of x | y,
2. Draw an observation y0 | x0,
3. Draw an observation xt | yt−1,
4. Draw an observation yt | xt .
Iteration of steps 3 and 4 for several thousand cycles will eventually produce a random sample from the joint distribution. (The first several thousand draws are discarded to avoid the influence of the initial conditions—this is called the burn in.) [Some technical details on the procedure appear in Cameron and Trivedi (Chapter Section 13.5).]
Example 18.5 Gibbs Sampling from the Normal Distribution To illustrate the mechanical aspects of the Gibbs sampler, consider random sampling from the joint normal distribution. We consider the bivariate normal distribution first. Suppose we wished to draw a random sample from the population x 0 1 ρ 1 ∼ N , . x2 0 ρ 1
As we have seen in Chapter 17, a direct approach is to use the fact that linear functions of normally distributed variables are normally distributed. [See (B-80).] Thus, we might trans-
form a series of independent normal draws (u1, u2) by the Cholesky decomposition of the covariance matrix x 10 u 1 = 1 = Lu , x θ θ u i 2 i 1 2 2 i Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 615 2 where θ1 = ρ and θ2 = 1 − ρ . The Gibbs sampler would take advantage of the result
2 x1 | x2 ∼ N[ρx2,(1− ρ )], and 2 x2 | x1 ∼ N[ρx1,(1− ρ )].
To sample from a trivariate, or multivariate population, we can expand the Gibbs sequence in the natural fashion. For example, to sample from a trivariate population, we would use the Gibbs sequence
x1 | x2, x3 ∼ N[β1,2x2 + β1,3x3, 1 | 2,3],
x2 | x1, x3 ∼ N[β2,1x1 + β2,3x3, 2 | 1,3],
x3 | x1, x2 ∼ N[β3,1x1 + β3,2x2, 3 | 1,2],
where the conditional means and variances are given in Theorem B.7. This defines a three- step cycle.
The availability of the Gibbs sampler frees the researcher from the necessity of de- riving the analytical properties of the full, joint posterior distribution. Because the for- mulation of conditional priors is straightforward, and the derivation of the conditional posteriors is only slightly less so, this tool has facilitated a vast range of applications that previously were intractable. For an example, consider, once again, the classical normal regression model. From (18-7), the joint posterior for (β,σ2)is v+ [vs2]v+2 1 1 p(β,σ2 | y, X) ∝ exp(−vs2/σ 2)[2π]−K/2 | σ 2(XX)−1 | −1/2 (v + 2) σ 2 × exp(−(1/2)(β − b) [σ 2(XX)−1]−1(β − b). If we wished to use a simulation approach to characterizing the posterior distribution, we would need to draw a K + 1 variate sample of observations from this intractable distribution. However, with the assumed priors, we found the conditional posterior for β in (18-5): p(β | σ 2, y, X) = N[b,σ2(XX)−1]. From (18-6), we can deduce that the conditional posterior for σ 2 | β, y, X is an inverted σ 2 = vσ 2 = v gamma distribution with parameters m 0 ˆ and m in (18-13): v+ v [vσˆ 2] 1 1 = (y − x β)2 p(σ 2 | β, y, X) = exp(−vσˆ 2/σ 2), σˆ 2 = i 1 i i . (v + 1) σ 2 n − K This sets up a Gibbs sampler for sampling from the joint posterior of β and σ 2.We would cycle between random draws from the multivariate normal for β and the inverted gamma distribution for σ 2 to obtain a K + 1 variate sample on (β,σ2). [Of course, for this application, we do know the marginal posterior distribution for β—see (18-8).] The Gibbs sampler is not truly a random sampler; it is a Markov chain—each “draw” from the distribution is a function of the draw that precedes it. The random input at each cycle provides the randomness, which leads to the popular name for this strategy, Markov–Chain Monte Carlo or MCMC or MC2 (pick one) estimation. In its simplest Greene-50558 book June 21, 2007 15:50
616 PART IV ✦ Estimation Methodology
form, it provides a remarkably efficient tool for studying the posterior distributions in very complicated models. The example in the next section shows a striking example of how to locate the MLE for a probit model without computing the likelihood function or its derivatives. In Section 18.8, we will examine an extension and refinement of the strategy, the Metropolis–Hasting algorithm. In the next several sections, we will present some applications of Bayesian inference. In Section 18.9, we will return to some general issues in classical and Bayesian estimation and inference.
18.6 APPLICATION: BINOMIAL PROBIT MODEL
Consider inference about the binomial probit model for a dependent variable that is generated as follows (see Sections 23.2–23.4): ∗ = β + ε ,ε ∼ , , yi xi i i N[0 1] (18-20) = ∗ > , = . yi 1ifyi 0 otherwise yi 0 (18-21) (Theoretical moivation for the model appears in Section 23.3.) The data consist of (y, X) = (yi , xi ), i = 1,...,n. The random variable yi has a Bernoulli distribution with probabilities = | = ( β), Prob[yi 1 xi ] xi = | = − ( β). Prob[yi 0 xi ] 1 xi The likelihood function for the observed data is n − ( | , β) = ( β) yi − ( β) 1 yi . L y X [ xi ] [1 xi ] i=1 (Once again, we cheat a bit on the notation—the likelihood function is actually the joint density for the data, given X and β.) Classical maximum likelihood estimation of β is developed in Section 23.4. To obtain the posterior mean (Bayesian estimator), we assume a noninformative, flat (improper) prior for β, p(β) ∝ 1. The posterior density would be n − [(x β)]yi [1 − (x β)]1 yi (1) = i i p(β | y, X) = i 1 , n − ( β) yi − ( β) 1 yi ( ) β [ xi ] [1 xi ] 1 d β i=1 and the estimator would be the posterior mean, n − β ( β) yi − ( β) 1 yi β [ xi ] [1 xi ] d β i=1 βˆ = E[β | y, X] = . (18-22) n − ( β) yi − ( β) 1 yi β [ xi ] [1 xi ] d β i=1 Evaluation of the integrals in (18-22) is hopelessly complicated, but a solution using the Gibbs sampler and a technique known as data augmentation, pioneered by Albert Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 617
∗ and Chib (1993a) is surprisingly simple. We begin by treating the unobserved yi sas unknowns to be estimated, along with β. Thus, the (K + n) × 1 parameter vector is θ = (β, ∗) (β | ∗, , ) ∗ y . We now construct a Gibbs sampler. Consider, first, p y y X .Ifyi is known, then yi is known [see (18-21)]. It follows that p(β | y∗, y, X) = p(β | y∗, X). This posterior defines a linear regression model with normally distributed disturbances and known σ 2 = 1. It is precisely the model we saw in Section 18.3.1, and the posterior we need is in (18-5), with σ 2 = 1. So, based on our earlier results, it follows that p(β | y∗, y, X) = N[b∗,(XX)−1], (18-23) where b∗ = (XX)−1Xy∗. ∗ For yi , ignoring yi for the moment, it would follow immediately from (18-20) that ( ∗ | β, ) = β, . p yi X N[xi 1] ∗ ∗ > However, yi is informative about yi .Ifyi equals one, we know that yi 0 and if yi ∗ ≤ β ∗ equals zero, then yi 0. The implication is that conditioned on , X, and y, yi has the truncated (above or below zero) normal distribution that is developed in Sections 24.2.1 and 24.2.2. The standard notation for this is ( ∗ | = , β, ) = + β, , p yi yi 1 xi N [xi 1] (18-24) ( ∗ | = , β, ) = − β, . p yi yi 0 xi N [xi 1] Results (18-23) and (18-24) set up the components for a Gibbs sampler that we can use to estimate the posterior means E[β | y, X] and E[y∗ | y, X]. The following is our algorithm:
Gibbs Sampler for the Binomial Probit Model 1. Compute XX once at the outset and obtain L such that LL = (XX)−1. 2. Start β at any value such as 0. 3. Result (17-1) shows how to transform a draw from U[0, 1] to a draw from the truncated normal with underlying mean μ and standard deviation σ . For this application, the draw is ∗ ( ) = β + −1 − ( − )( β ) = , yi,r r xi r−1 [1 1 U xi r−1 ]ifyi 1 ∗ ( ) = β + −1 (− β ) = . yi,r r xi r−1 [U xi r−1 ]ifyi 0 ∗ ( ) This step is used to draw the n observations on yi,r r . 4. Section 17.2.4 shows how to draw an observation from the multivariate normal population. For this application, we use the results at step 3 to compute b∗ = (XX)−1Xy∗(r). We obtain a vector, v,ofK draws from the N[0, 1] population, then β(r) = b∗ + Lv. The iteration cycles between steps 3 and 4. This should be repeated several thousand times, discarding the burn-in draws, then the estimator of β is the sample mean of the retained draws. The posterior variance is computed with the variance of the retained ∗ draws. Posterior estimates of yi would typically not be useful. Greene-50558 book June 23, 2007 2:9
618 PART IV ✦ Estimation Methodology
TABLE 18.2 Probit Estimates for Grade Equation Maximum Likelihood Posterior Means and Std. Devs Variable Estimate Standard Error Posterior Mean Posterior S.D. Constant −7.4523 2.5425 −8.6286 2.7995 GPA 1.6258 0.6939 1.8754 0.7668 TUCE 0.05173 0.08389 0.06277 0.08695 PSI 1.4263 0.5950 1.6072 0.6257
Example 18.6 Gibbs Sampler for a Probit Model In Examples 16.14 and 16.15, we examined Spector and Mazzeo’s (1980) widely traveled data on a binary choice outcome. (The example used the data for a different model.) The binary probit model studied in the paper was = | β = β + β + β + β . Prob(GRADEi 1 , xi ) ( 1 2GPAi 3TUCEi 4PSIi ) The variables are defined in Example 16.14. Their probit model is studied in Example 23.3. The sample contains 32 observations. Table 18.2 presents the maximum likelihood estimates and the posterior means and standard deviations for the probit model. For the Gibbs sampler, we used 5,000 draws, and discarded the first 1,000. The results in Table 18.2 suggest the similarity of the posterior mean estimated with the Gibbs sampler to the maximum likelihood estimate. However, the sample is quite small, and the differences between the coefficients are still fairly substantial. For a striking example of the behavior of this procedure, we now revisit the German health care data examined in Examples 11.10, 11.11 and 16.16, and several other examples throughout the book. The probit model to be estimated is > = β + β + β + β Prob(Doctor visitsit 0) ( 1 2 Ageit 3 Educationit 4 Incomeit
+ β5 Kidsit + β6 Marriedit + β7 Femaleit). The sample contains data on 7,293 families and a total of 27,326 observations. We are pooling the data for this application. Table 18.3 presents the probit results for this model using the same procedure as before. (We used only 500 draws, and discarded the first 100.) The similarity is what one would expect given the large sample size. We note before proceeding to other applications, notwithstanding the striking similarity of the Gibbs sampler to the MLE, that this is not an efficient method of estimating the parameters of a probit model. The estimator requires generation of thousands of samples of potentially thousands of observations. We used only 500 replications to produce Table 18.3. The computations took about five minutes. Using Newton’s method to maximize the log-likelihood directly took less than five seconds. Unless one is wedded to the Bayesian paradigm, on strictly practical grounds, the MLE would be the preferred estimator.
TABLE 18.3 Probit Estimates for Doctor Visits Equation Maximum Likelihood Posterior Means and Std. Devs Variable Estimate Standard Error Posterior Mean Posterior S.D. Constant −0.12433 0.058146 −0.12628 0.054759 Age 0.011892 0.00079568 0.011979 0.00080073 Education −0.014966 0.0035747 −0.015142 0.0036246 Income −0.13242 0.046552 −0.12669 0.047979 Kids −0.15212 0.018327 −0.15149 0.018400 Married 0.073522 0.020644 0.071977 0.020852 Female 0.35591 0.016017 0.35582 0.015913 Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 619
This application of the Gibbs sampler demonstrates in an uncomplicated case how the algorithm can provide an alternative to actually maximizing the log-likelihood. We do note that the similarity of the method to the EM algorithm in Section E.3.7 is not coincidental. Both procedures use an estimate of the unobserved, censored data, and both estimate β by using OLS using the predicted data.
18.7 PANEL DATA APPLICATION: INDIVIDUAL EFFECTS MODELS
We consider a panel data model with common individual effects, = α + β + ε ,ε∼ ,σ2 . yit i xit it it N 0 ε In the Bayesian framework, there is no need to distinguish between fixed and random effects. The classical distinction results from an asymmetric treatment of the data and the parameters. So, we will leave that unspecified for the moment. The implications will emerge later when we specify the prior densities over the model parameters. The likelihood function for the sample under normality of εit is n Ti 1 (y − α − x β)2 p y | α ,...,α , β,σ2, X = √ exp − it i it . 1 n ε 2 σε π 2σε i=1 t=1 2 The remaining analysis hinges on the specification of the prior distributions. We will consider three cases. Each illustrates an aspect of the methodology. First, group the full set of location (regression) parameters in one (n + K) × 1 2 2 slope vector, γ . Then, with the disturbance variance, θ = (α, β,σε ) = (γ ,σε ). Define a conformable data matrix, Z = (D, X), where D contains the n dummy variables so that we may write the model, y = Zγ + ε in the familiar fashion for our common effects linear regression. (See Chapter 9.) We now assume the uniform-inverse gamma prior that we used in our earlier treatment of the linear model, 2 2 p γ ,σε ∝ 1/σε . The resulting (marginal) posterior density for γ is precisely that in (18-8) (where now the slope vector includes the elements of α). The density is an (n + K) variate t with mean equal to the OLS estimator and covariance matrix [( i Ti − n − K)/( i Ti − n − K − 2)]s2(ZZ)−1. Because OLS in this model as stated means the within estimator, the implication is that with this noninformative prior over (α, β), the model is equivalent to the fixed effects model. Note, again, this is not a consequence of any assumption about correlation between effects and included variables. That has remained unstated; though, by implication, we would allow correlation between D and X. Some observers are uncomfortable with the idea of a uniform prior over the entire real line. [See, e.g., Koop (2003, pp. 22–23).] Others, e.g., Zellner (1971, p. 20), are less concerned. Cameron and Trivedi (2005, pp. 425–427) suggest a middle ground.] Formally, our assumption of a uniform prior over the entire real line is an improper Greene-50558 book June 21, 2007 15:50
620 PART IV ✦ Estimation Methodology
prior, because it cannot have a positive density and integrate to one over the entire real line. As such, the posterior appears to be ill defined. However, note that the “improper” uniform prior will, in fact, fall out of the posterior, because it appears in both numerator and denominator. [Zellner (1971, p. 20) offers some more methodological commentary.] The practical solution for location parameters, such as a vector of regression slopes, is to assume a nearly flat, “almost uninformative” prior. The usual choice is a conjugate normal prior with an arbitrarily large variance. (It should be noted, of course, that as long as that variance is finite, even if it is large, the prior is informative. We return to this point in Section 18.9.) 2 Consider, then, the conventional normal-gamma prior over (γ ,σε ) where the condi- 2 tional (on σε ) prior normal density for the slope parameters has mean γ 0 and covariance 2 matrix σε A, where the (n + K) × (n + K) matrix, A, is yet to be specified. [See the dis- cussion after (18-13).] The marginal posterior mean and variance for γ for this set of assumptions are given in (18-14) and (18-15). We reach a point that presents two rather serious dilemmas for the researcher. The posterior was simple with our uniform, non- informative prior. Now, it is necessary actually to specify A, which is potentially large. (In one of our main applications in this text, we are analyzing models with n = 7,293 constant terms and about K = 7 regressors.) It is hopelessly optimistic to expect to be able to specify all the variances and covariances in a matrix this large, unless we actually have the results of an earlier study (in which case we would also have a prior estimate of γ ). A practical solution that is frequently chosen is to specify A to be a diagonal matrix with extremely large diagonal elements, thus emulating a uniform prior without having to commit to one. The second practical issue then becomes dealing with the actual computation of the order (n + K) inverse matrix in (18-14) and (18-15). Under the strategy chosen, to make A a multiple of the identity matrix, however, there are forms of partitioned inverse matrices that will allow solution to the actual computation. Thus far, we have assumed that each αi is generated by a different normal distribu- −γ tion, 0 and A, however specified, have (potentially) different means and variances for the elements of α. The third specification we consider is one in which all αi s in the model are assumed to be draws from the same population. To produce this specification, we use a hierarchical prior for the individual effects. The full model will be 2 yit = αi + x β + εit,εit ∼ N 0,σε , it 2 2 p β σε = N β ,σε A , 0 2 2 p σε = Gamma σ , m , 0 2 p(αi ) = N μα,τα ,
p(μα) = N[a, Q], τ 2 = τ 2,v . p α Gamma 0 We will not be able to derive the posterior density (joint or marginal) for the parame- ters of this model. However, it is possible to set up a Gibbs sampler that can be used to infer the characteristics of the posterior densities statistically. The sampler will be 2 2 driven by conditional normal posteriors for the location parameters, [β | α,σε ,μα,τα ], 2 2 2 2 [αi | β,σε ,μα,τα ], and [μα | β, α, σ ε,τα ] and conditional gamma densities for the scale 2 2 2 2 (variance) parameters, [σε | α, β,μα,τα ] and [τα | α, β,σε ,μα]. [The procedure is Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 621
developed at length by Koop (2003, pp. 152–153).] The assumption of a common distri- bution for the individual effects and an independent prior for β produces a Bayesian counterpart to the random effects model.
18.8 HIERARCHICAL BAYES ESTIMATION OF A RANDOM PARAMETERS MODEL
We now consider a Bayesian approach to estimation of the random parameters model.17 For an individual i, the conditional density for the dependent variable in period t is ( | , β ) β × f yit xit i where i is the individual specific K 1 parameter vector and xit is individual specific data that enter the probability density.18 For the sequence of T ob- β servations, assuming conditional (on i ) independence, person i’s contribution to the likelihood for the sample is T ( | , β ) = ( | , β ). f yi Xi i f yit xit i (18-25) t=1 = ( ,..., ) = ,..., β where yi yi1 yiT and Xi [xi1 xiT]. We will suppose that i is distributed normally with mean β and covariance matrix . (This is the “hierarchical” aspect of the model.) The unconditional density would be the expected value over the possible β values of i ; T ( | , β, ) = ( | , β )φ β | β, β , f yi Xi f yit xit i K[ i ] d i (18-26) β i t=1 φ β | β, β β where K[ i ] denotes the K variate normal prior density for i given and . Maximum likelihood estimation of this model, which entails estimation of the “deep” β β parameters, , , then estimation of the individual specific parameters, i is considered in Section 17.5. We now consider the Bayesian approach to estimation of the parameters of this model. To approach this from a Bayesian viewpoint, we will assign noninformative prior densities to β and . As is conventional, we assign a flat (noninformative) prior to β. The variance parameters are more involved. If it is assumed that the elements of β i are conditionally independent, then each element of the (now) diagonal matrix may be assigned the inverted gamma prior that we used in (18-13). A full matrix is handled by assigning to an inverted Wishart prior density with parameters scalar K and matrix K × I. [The Wishart density is a multivariate counterpart to the chi-squared
17Note that, there is occasional confusion as to what is meant by “random parameters” in a random param- eters (RP) model. In the Bayesian framework we discuss in this chapter, the “randomness” of the random parameters in the model arises from the “uncertainty” of the analyst. As developed at several points in this book (and in the literature), the randomness of the parameters in the RP model is a characterization of the heterogeneity of parameters across individuals. Consider, for example, in the Bayesian framework of this β section, in the RP model, each vector i is a random vector with a distribution (defined hierarchically). In β the classical framework, each i represents a single draw from a parent population. 18 β To avoid a layer of complication, we will embed the time-invariant effect zi in xit . A full treatment in the same fashion as the latent class model would be substantially more complicated in this setting (although it is quite straightforward in the maximum simulated likelihood approach discussed in Section 17.5.1). Greene-50558 book June 21, 2007 15:50
622 PART IV ✦ Estimation Methodology
distribution. Discussion may be found in Zellner (1971, pp. 389–394).] This produces the joint posterior density, n T (β ,...,β , β, | ) = ( | , β )φ β | β, × (β, ). 1 n all data f yit xit i K[ i ] p i=1 t=1 (18-27) This gives the joint density of all the unknown parameters conditioned on the observed data. Our Bayesian estimators of the parameters will be the posterior means for these (n + 1)K + K(K + 1)/2 parameters. In principle, this requires integration of (18-27) with respect to the components. As one might guess at this point, that integration is hopelessly complex and not remotely feasible. However, the techniques of Markov Chain Monte Carlo (MCMC) simulation esti- mation (the Gibbs sampler) and the Metropolis–Hastings algorithm enable us to sample (β ,...,β , β, | from the (hopelessly complex) joint density 1 n all data) in a remark- ably simple fashion. Train (2001 and 2002, Chapter 12) describe how to use these results for this random parameters model.19 The usefulness of this result for our current prob- lem is that it is, indeed, possible to partition the joint distribution, and we can easily sample from the conditional distributions. We begin by partitioning the parameters γ = (β, ) δ = (β ,...,β ) into and 1 n . Train proposes the following strategy: To obtain a draw from γ | δ, we will use the Gibbs sampler to obtain a draw from the distribution of (β | , δ) then one from the distribution of ( | β, δ). We will lay out this first, then turn to sampling from δ | β, . Conditioned on δ and , β has a K-variate normal distribution with mean β = ( / )| n β ( / ) 1 n = i and covariance matrix 1 n . To sample from this distribution we will i 1 first obtain the Cholesky factorization of = LL where L is a lower triangular matrix. [See Section A.6.11.] Let v be a vector of K draws from the standard normal distribution. Then, β + Lv has mean vector β + L × 0 = β and covariance matrix LIL = , which is exactly what we need. So, this shows how to sample a draw from the conditional distribution β. To obtain a random draw from the distribution of | β, δ, we will require a random draw from the inverted Wishart distribution. The marginal posterior distribution of | β, δ is inverted Wishart with parameters scalar K + n and matrix W = (KI + nV), = ( / ) n (β − β)(β − β) where V 1 n i=1 i i . Train (2001) suggests the following strategy for sampling a matrix from this distribution: Let M be the lower triangular Cholesky −1 = −1 + = factor of W ,soMM W . Obtain K n draws of vk K standard normal variates. = K+n j = −1 Then, obtain S M k=1 vkvk M . Then, S is a draw from the inverted Wishart distribution. [This is fairly straightforward, as it involves only random sampling from the standard normal distribution. For a diagonal matrix, that is, uncorrelated parameters β in i , it simplifies a bit further. A draw for the nonzero kth diagonal element can be ( + )/ K+n v2 obtained using 1 nVkk r=1 rk.]
19Train describes use of this method for “mixed (random parameters) multinomial logit” models. By writing the densities in generic form, we have extended his result to any general setting that involves a parameter vector in the fashion described above. The classical version of this appears in Section 17.5.1 for the binomial probit model and in Section 23.11.6 for the mixed logit model. Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 623
β The difficult step is sampling i . For this step, we use the Metropolis–Hastings (M-H) algorithm suggested by Chib and Greenberg (1995, 1996) and Gelman et al. (2004). The procedure involves the following steps: 1. Given β and and “tuning constant” τ (to be described below), compute d = τLv where L is the Cholesky factorization of and v is a vector of K independent standard normal draws. β = β + β 2. Create a trial value i1 i0 d where i0 is the previous value. β 3. The posterior distribution for i is the likelihood that appears in (18-26) times the φ β | β, joint normal prior density, K[ i ]. Evaluate this posterior density at the trial β β value i1 and the previous value i0. Let ( | , β )φ (β | β, ) = f yi Xi i1 K i1 . R10 ( | , β )φ (β | β, ) f yi Xi i0 K i0 4. Draw one observation, u, from the standard uniform distribution, U[0, 1]. 5. If u < R10, then accept the trial (new) draw. Otherwise, reuse the old one. This M-H iteration converges to a sequence of draws from the desired density. Overall, then, the algorithm uses the Gibbs sampler and the Metropolis–Hastings algorithm to produce the sequence of draws for all the parameters in the model. The sequence is repeated a large number of times to produce each draw from the joint posterior distribution. The entire sequence must then be repeated N times to produce the sample of N draws, which can then be analyzed, for example, by computing the posterior mean. Some practical details remain. The tuning constant, τ is used to control the iteration. A smaller τ increases the acceptance rate. But at the same time, a smaller τ makes new draws look more like old draws so this slows down the process. Gelman et al. (2004) suggest τ = 0.4 for K = 1 and smaller values down to about 0.23 for higher dimensions, as will be typical. Each multivariate draw takes many runs of the MCMC sampler. The process must be started somewhere, though it does not matter much where. Nonetheless, a “burn-in” period is required to eliminate the influence of the starting value. Typical applications use several draws for this burn in period for each run of the sampler. How many sample observations are needed for accurate estimation is not certain, though several hundred would be a minimum. This means that there is a huge amount of com- putation done by this estimator. However, the computations are fairly simple. The only complicated step is computation of the acceptance criterion at step 3 of the M-H itera- tion. Depending on the model, this may, like the rest of the calculations, be quite simple.
18.9 SUMMARY AND CONCLUSIONS
This chapter has introduced the major elements of the Bayesian approach to estimation and inference. The contrast between Bayesian and classical, or frequentist, approaches to the analysis has been the subject of a decades-long dialogue among practitioners and philosophers. As the frequency of applications of Bayesian methods have grown dra- matically in the modern literature, however, the approach to the body of techniques has typically become more pragmatic. The Gibbs sampler and related techniques includ- ing the Metropolis–Hastings algorithm have enabled some remarkable simplifications of heretofore intractable problems. For example, recent developments in commercial Greene-50558 book June 21, 2007 15:50
624 PART IV ✦ Estimation Methodology
software have produced a wide choice of “mixed” estimators which are various im- plementations of the maximum likelihood procedures and hierarchical Bayes proce- dures (such as the Sawtooth and MLWin programs). Unless one is dealing with a small sample, the choice between these can be based on convenience. There is little method- ological difference. This returns us to the practical point noted earlier. The choice between the Bayesian approach and the sampling theory method in this application would not be based on a fundamental methodological criterion, but on purely practical considerations—the end result is the same. This chapter concludes our survey of estimation and inference methods in econo- metrics. We will now turn to two major areas of applications, time series and (broadly) macroeconometrics, and microeconometrics which is primarily oriented to cross section and panel data applications.
Key Terms and Concepts
• Bayes factor • Informative prior • Posterior mean • Bayes theorem • Inverted gamma distribution • Posterior odds • Bernstein–von Mises • Inverted Wishart • Precision matrix theorem • Joint posterior • Predictive density • Burn in • Likelihood • Prior beliefs • Central limit theorem • Loss function • Prior density • Conjugate prior • Markov–Chain Monte Carlo • Prior distribution • Data augmentation (MCMC) • Prior odds ratio • Gibbs sampler • Metropolis–Hastings • Prior probabilities • Hierarchical Bayes algorithm • Sampling theory • Hierarchical prior • Multivariate t distribution • Uniform prior • Highest posterior density • Noninformative prior • Uniform-inverse (HPD) • Normal-gamma prior gamma prior • Improper prior • Posterior density
Exercise
1. Suppose the distribution of yi | λ is Poisson, exp(−λ)λyi exp(−λ)λyi f (yi | λ) = = , yi = 0, 1,...,λ>0. yi ! (yi + 1)
We will obtain a sample of observations, yi ,...,yn. Suppose our prior for λ is the inverted gamma, which will imply (λ) ∝ 1 . p λ
a. Construct the likelihood function, p(y1,...,yn | λ). b. Construct the posterior density ( ,..., | λ) (λ) p y1 yn p p(λ | y1,...,yn) = ∞ . p(y1,...,yn | λ)p(λ)dλ 0
c. Prove that the Bayesian estimator of λ is the posterior mean, E[λ | y1,...,yn] = y. d. Prove that the posterior variance is Var[λ | yl ,...,yn] = y/n. Greene-50558 book June 21, 2007 15:50
CHAPTER 18 ✦ Bayesian Estimation and Inference 625
(Hint: You will make heavy use of gamma integrals in solving this problem. Also, you will find it convenient to use i yi = ny.)
Application
1. Consider a model for the mix of male and female children in families. Let Ki denote the family size (number of children), Ki = 1,.... Let Fi denote the number of female children, Fi = 0,...,Ki . Suppose the density for the number of female children in a family with K children is binomial with constant success probability θ: i
Ki Fi Ki −Fi p(Fi |Ki ,θ)= θ (1 − θ) . Fi We are interested in analyzing the “probability,” θ. Suppose the (conjugate) prior over θ is a beta distribution with parameters a and b: (a + b) p(θ) = θ a−1(1 − θ)b−1. (a)(b) Your sample of 25 observations is given here:
Ki 2115544512442432323532541
Fi 1 1 1 3 2 3 2 4 0 2 3 1 1 3 2 1 3 1 2 4 2 1 1 4 1
a. Compute the classical maximum likelihood estimate of θ. b. Form the posterior density for θ given (Ki , Fi ), i = 1,...,25 conditioned on a and b. c. Using your sample of data, compute the posterior mean assuming a = b = 1. d. Using your sample of data, compute the posterior mean assuming a = b = 2. e. Using your sample of data, compute the posterior mean assuming a = 1 and b = 2. Greene-50558 book June 23, 2007 11:49
19 SERIAL CORRELATIONQ
19.1 INTRODUCTION
Time-series data often display autocorrelation, or serial correlation of the disturbances across periods. Consider, for example, the plot of the least squares residuals in the following example.
Example 19.1 Money Demand Equation Appendix Table F5.1 contains quarterly data from 1950.1 to 2000.4 on the U.S. money stock (M1) and output (real GDP) and the price level (CPI U). Consider a simple (extremely) model of money demand,1
ln M1t = β1 + β2 ln GDPt + β3 ln CPIt + εt .
A plot of the least squares residuals is shown in Figure 19.1. The pattern in the residuals suggests that knowledge of the sign of a residual in one period is a good indicator of the sign of the residual in the next period. This knowledge suggests that the effect of a given disturbance is carried, at least in part, across periods. This sort of “memory” in the disturbances creates the long, slow swings from positive values to negative ones that is evident in Figure 19.1. One might argue that this pattern is the result of an obviously naive model, but that is one of the important points in this discussion. Patterns such as this usually do not arise spontaneously; to a large extent, they are, indeed, a result of an incomplete or flawed model specification.
One explanation for autocorrelation is that relevant factors omitted from the time- series regression, like those included, are correlated across periods. This fact may be due to serial correlation in factors that should be in the regression model. It is easy to see why this situation would arise. Example 19.2 shows an obvious case.
Example 19.2 Autocorrelation Induced by Misspecification of the Model In Examples 2.3 and 6.7, we examined yearly time-series data on the U.S. gasoline market from 1953 to 2004. The evidence in the examples was convincing that a regression model of variation in ln G/Pop should include, at a minimum, a constant, ln PG and ln income/Pop. Other price variables and a time trend also provide significant explanatory power, but these two are a bare minimum. Moreover, we also found on the basis of a Chow test of structural change that apparently this market changed structurally after 1974. Figure 19.2 displays plots of four sets of least squares residuals. Parts (a) through (c) show clearly that as the specification of the regression is expanded, the autocorrelation in the “residuals” diminishes. Part (c) shows the effect of forcing the coefficients in the equation to be the same both before and after the structural shift. In part (d), the residuals in the two subperiods 1953 to 1974 and 1975 to 2004 are produced by separate unrestricted regressions. This latter set of residuals is almost nonautocorrelated. (Note also that the range of variation of the residuals falls as
1Because this chapter deals exclusively with time-series data, we shall use the index t for observations and T for the sample size throughout. 626 Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 627
0.225
0.150
0.075
0.000 idual s Ϫ Re 0.075
Ϫ0.150
Ϫ0.225
Ϫ0.300 1950 1963 1976 1989 2002 Quarter FIGURE 19.1 Autocorrelated Least Squares Residuals.
the model is improved, i.e., as its fit improves.) The full equation is
Gt It ln = β1 + β2 ln PGt + β3 ln + β4 ln PNCt + β5 ln PUCt Popt Popt
+ β6 ln PPTt + β7 ln PNt + β8 ln PDt + β9 ln PSt + β10t + εt . Finally, we consider an example in which serial correlation is an anticipated part of the model.
Example 19.3 Negative Autocorrelation in the Phillips Curve The Phillips curve [Phillips (1957)] has been one of the most intensively studied relationships in the macroeconomics literature. As originally proposed, the model specifies a negative re- lationship between wage inflation and unemployment in the United Kingdom over a period of 100 years. Recent research has documented a similar relationship between unemployment and price inflation. It is difficult to justify the model when cast in simple levels; labor market theories of the relationship rely on an uncomfortable proposition that markets persistently fall victim to money illusion, even when the inflation can be anticipated. Current research [e.g., Staiger et al. (1996)] has reformulated a short run (disequilibrium) “expectations aug- mented Phillips curve” in terms of unexpected inflation and unemployment that deviates from a long run equilibrium or “natural rate.” The expectations-augmented Phillips curve can be written as ∗ pt − E[pt | t−1] = β[ut − u ] + εt
where pt is the rate of inflation in year t, E [pt | t−1] is the forecast of pt made in period ∗ t − 1 based on information available at time t − 1, t−1, ut is the unemployment rate and u is the natural, or equilibrium rate. (Whether u∗ can be treated as an unchanging parameter, ∗ as we are about to do, is controversial.) By construction, [ut − u ] is disequilibrium, or cycli- cal unemployment. In this formulation, εt would be the supply shock (i.e., the stimulus that produces the disequilibrium situation). To complete the model, we require a model for the expected inflation. We will revisit this in some detail in Example 20.2. For the present, we’ll Greene-50558 book June 23, 2007 11:49
628 PART V ✦ Time Series and Macroeconometrics
0.30 0.150 0.20 0.100 0.10 0.050 idual 0.00 idual 0.000 s s Re Ϫ0.10 Re Ϫ0.050 Ϫ0.20 Ϫ0.100 Ϫ0.30 Ϫ0.150 1952 1964 1976 1988 2000 1952 1964 1976 1988 2000 Year Year Panel (a) Panel (b)
0.075 0.030 0.050 0.020 0.025 0.010 idual idual 0.000 0.000 s s Re Re Ϫ0.025 Ϫ0.010 Ϫ0.050 Ϫ0.020 Ϫ0.075 Ϫ0.030 1952 1964 1976 1988 2000 1952 1964 1976 1988 2000 Year Year Panel (c) Panel (d) FIGURE 19.2 Unstandardized Residuals (Bars mark mean res. and +/ − 2s(e)).
assume that economic agents are rank empiricists. The forecast of next year’s inflation is simply this year’s value. This produces the estimating equation
pt − pt−1 = β1 + β2ut + εt
∗ where β2 = β and β1 =−βu . Note that there is an implied estimate of the natural rate of un- ∗ employment embedded in the equation. After estimation, u can be estimated by −b1/b2. The equation was estimated with the 1950.1–2000.4 data in Appendix Table F5.1 that were used in Example 19.1 (minus two quarters for the change in the rate of inflation). Least squares estimates (with standard errors in parentheses) are as follows:
pt − pt−1 = 0.49189 − 0.090136 ut + et (0.7405) (0.1257) R2 = 0.002561, T = 201. The implied estimate of the natural rate of unemployment is 5.46 percent, which is in line with other recent estimates. The estimated asymptotic covariance of b1 and b2 is −0.08973. Using the delta method, we obtain a standard error of 2.2062 for this estimate, so a confidence in- terval for the natural rate is 5.46 percent ±1.96 (2.21 percent) = (1.13 percent, 9.79 percent) (which seems fairly wide, but, again, whether it is reasonable to treat this as a parameter is at least questionable). The regression of the least squares residuals on their past values gives a slope of −0.4263 with a highly significant t ratio of −6.725. We thus conclude that the residuals (and, apparently, the disturbances) in this model are highly negatively autocorre- lated. This is consistent with the striking pattern in Figure 19.3. Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 629
Phillips Curve Deviations from Expected Inflation 10
5
0 idual s
Re Ϫ5
Ϫ10
Ϫ15 1950 1963 1976 1989 2002 Quarter FIGURE 19.3 Negatively Autocorrelated Residuals.
The problems for estimation and inference caused by autocorrelation are similar to (although, unfortunately, more involved than) those caused by heteroscedasticity. As before, least squares is inefficient, and inference based on the least squares estimates is adversely affected. Depending on the underlying process, however, GLS and FGLS estimators can be devised that circumvent these problems. There is one qualitative difference to be noted. In Chapter 8, we examined models in which the generalized regression model can be viewed as an extension of the regression model to the con- ditional second moment of the dependent variable. In the case of autocorrelation, the phenomenon arises in almost all cases from a misspecification of the model. Views differ on how one should react to this failure of the classical assumptions, from a pragmatic one that treats it as another “problem” in the data to an orthodox methodological view that it represents a major specification issue—see, for example, “A Simple Message to Autocorrelation Correctors: Don’t” [Mizon (1995).] We should emphasize that the models we shall examine here are quite far removed from the classical regression. The exact or small-sample properties of the estimators are rarely known, and only their asymptotic properties have been derived.
19.2 THE ANALYSIS OF TIME-SERIES DATA
The treatment in this chapter will be the first structured analysis of time-series data in the text. (We had a brief encounter in Section 4.9.6 where we established some conditions under which moments of time-series data would converge.) Time-series analysis requires some revision of the interpretation of both data generation and sampling that we have maintained thus far. Greene-50558 book June 23, 2007 11:49
630 PART V ✦ Time Series and Macroeconometrics
A time-series model will typically describe the path of a variable yt in terms of contemporaneous (and perhaps lagged) factors xt , disturbances (innovations), εt , and its own past, yt−1,....For example,
yt = β1 + β2xt + β3 yt−1 + εt . The time series is a single occurrence of a random event. For example, the quarterly series on real output in the United States from 1950 to 2000 that we examined in Ex- ample 19.1 is a single realization of a process, GDPt . The entire history over this period constitutes a realization of the process. At least in economics, the process could not be repeated. There is no counterpart to repeated sampling in a cross section or replication of an experiment involving a time-series process in physics or engineering. Nonetheless, were circumstances different at the end of World War II, the observed history could have been different. In principle, a completely different realization of the entire series might { }t=∞ have occurred. The sequence of observations, yt t=−∞ is a time-series process, which is characterized by its time ordering and its systematic correlation between observations in the sequence. The signature characteristic of a time-series process is that empirically, the data generating mechanism produces exactly one realization of the sequence. Sta- tistical results based on sampling characteristics concern not random sampling from a population, but from distributions of statistics constructed from sets of observations taken from this realization in a time window, t = 1,...,T. Asymptotic distribution theory in this context concerns behavior of statistics constructed from an increasingly long window in this sequence. The properties of yt as a random variable in a cross section are straightforward and are conveniently summarized in a statement about its mean and variance or the probability distribution generating yt . The statement is less obvious here. It is common to assume that innovations are generated independently from one period to the next, with the familiar assumptions
E [εt ] = 0, 2 Var[εt ] = σε , and
Cov[εt ,εs ] = 0 for t = s.
In the current context, this distribution of εt is said to be covariance stationary or weakly stationary. Thus, although the substantive notion of “random sampling” must be extended for the time series εt , the mathematical results based on that notion apply here. It can be said, for example, that εt is generated by a time-series process whose mean and variance are not changing over time. As such, by the method we will discuss in this chapter, we could, at least in principle, obtain sample information and use it to characterize the distribution of εt . Could the same be said of yt ? There is an obvious difference between the series εt and yt ; observations on yt at different points in time are necessarily correlated. Suppose that the yt series is weakly stationary and that, for the moment, β2 = 0. Then we could say that
E [yt ] = β1 + β3 E [yt−1] + E [εt ] = β1/(1 − β3) and = β2 + ε , Var[yt ] 3 Var[yt−1] Var[ t ] Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 631
or γ = β2γ + σ 2, 0 3 0 ε so that σ 2 γ = ε . 0 − β2 1 3
Thus, γ0, the variance of yt , is a fixed characteristic of the process generating yt . Note how the stationarity assumption, which apparently includes |β3| < 1, has been used. The 2 assumption that |β3| < 1 is needed to ensure a finite and positive variance. Finally, the same results can be obtained for nonzero β2 if it is further assumed that xt is a weakly stationary series.3 Alternatively, consider simply repeated substitution of lagged values into the expression for yt :
yt = β1 + β3(β1 + β3 yt−2 + εt−1) + εt (19-1)
and so on. We see that, in fact, the current yt is an accumulation of the entire history of the innovations, εt . So if we wish to characterize the distribution of yt , then we might do so in terms of sums of random variables. By continuing to substitute for yt−2, then yt−3,...in (19-1), we obtain an explicit representation of this idea, ∞ = βi (β + ε ). yt 3 1 t−i i=0 Do sums that reach back into infinite past make any sense? We might view the pro- cess as having begun generating data at some remote, effectively “infinite” past. As long as distant observations become progressively less important, the extension to an infinite past is merely a mathematical convenience. The diminishing importance of past observa- tions is implied by |β3| < 1. Notice that, not coincidentally, this requirement is the same as that needed to solve for γ0 in the preceding paragraphs. A second possibility is to as- sume that the observation of this time series begins at some time 0 [with (x0,ε0) called the initial conditions], by which time the underlying process has reached a state such that the mean and variance of yt are not (or are no longer) changing over time. The mathematics are slightly different, but we are led to the same characterization of the random process generating yt . In fact, the same weak stationarity assumption ensures both of them. Except in very special cases, we would expect all the elements in the T component random vector (y1,...,yT) to be correlated. In this instance, said correlation is called “autocorrelation.” As such, the results pertaining to estimation with independent or uncorrelated observations that we used in the previous chapters are no longer usable. In point of fact, we have a sample of but one observation on the multivariate ran- dom variable [yt , t = 1,...,T ]. There is a counterpart to the cross-sectional notion of parameter estimation, but only under assumptions (e.g., weak stationarity) that estab- lish that parameters in the familiar sense even exist. Even with stationarity,it will emerge that for estimation and inference, none of our earlier finite sample results are usable. Consistency and asymptotic normality of estimators are somewhat more difficult to
2The current literature in macroeconometrics and time series analysis is dominated by analysis of cases in which β3 = 1 (or counterparts in different models). We will return to this subject in Chapter 22. 3See Section 19.4.1 on the stationarity assumption. Greene-50558 book June 23, 2007 11:49
632 PART V ✦ Time Series and Macroeconometrics
establish in time-series settings because results that require independent observations, such as the central limit theorems, are no longer usable. Nonetheless, counterparts to our earlier results have been established for most of the estimation problems we consider here and in Chapters 20 and 21.
19.3 DISTURBANCE PROCESSES
The preceding section has introduced a bit of the vocabulary and aspects of time-series specification. To obtain the theoretical results, we need to draw some conclusions about autocorrelation and add some details to that discussion.
19.3.1 CHARACTERISTICS OF DISTURBANCE PROCESSES In the usual time-series setting, the disturbances are assumed to be homoscedastic but correlated across observations, so that E [εε | X] = σ 2, 2 2 where σ is a full, positive definite matrix with a constant σ = Var[εt | X] on the diagonal. As will be clear in the following discussion, we shall also assume that ts is a function of |t − s|, but not of t or s alone, which is a stationarity assumption. (See the preceding section.) It implies that the covariance between observations t and s is a function only of |t − s|, the distance apart in time of the observations. Because σ 2 is not restricted, we normalize tt = 1. We define the autocovariances: 2 Cov[εt ,εt−s | X] = Cov[εt+s ,εt | X] = σ t,t−s = γs = γ−s . 2 Note that σ tt = γ0. The correlation between εt and εt−s is their autocorrelation, ε ,ε | γ ε ,ε | = Cov[ t t−s X] = s = ρ = ρ . Corr[ t t−s X] γ s −s Var[εt | X]Var[εt−s | X] 0 We can then write E [εε | X] = = γ0R, where is an autocovariance matrix and R is an autocorrelation matrix—the ts element is an autocorrelation coefficient γ|t−s| ρts = . γ0 2 (Note that the matrix = γ0R is the same as σ . The name change conforms to stan- dard usage in the literature.) We will usually use the abbreviation ρs to denote the autocorrelation between observations s periods apart. Different types of processes imply different patterns in R. For example, the most frequently analyzed process is a first-order autoregression or AR(1) process,
εt = ρεt−1 + ut ,
where ut is a stationary, nonautocorrelated (white noise) process and ρ is a parameter. s We will verify later that for this process, ρs = ρ . Higher-order autoregressive processes of the form
εt = θ1εt−1 + θ2εt−2 +···+θpεt−p + ut Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 633
imply more involved patterns, including, for some values of the parameters, cyclical behavior of the autocorrelations.4 Stationary autoregressions are structured so that the influence of a given disturbance fades as it recedes into the more distant past but vanishes only asymptotically. For example, for the AR(1), Cov[εt ,εt−s ] is never zero, but it does become negligible if |ρ| is less than 1. Moving-average processes, conversely, have a short memory. For the MA(1) process,
εt = ut − λut−1, γ = σ 2( + λ2), γ =−λσ 2 γ = the memory in the process is only one period: 0 u 1 1 u , but s 0 if s > 1.
19.3.2 AR(1) DISTURBANCES Time-series processes such as the ones listed here can be characterized by their order, the values of their parameters, and the behavior of their autocorrelations.5 We shall consider various forms at different points. The received empirical literature is overwhelmingly dominated by the AR(1) model, which is partly a matter of convenience. Processes more involved than this model are usually extremely difficult to analyze. There is, however, a more practical reason. It is very optimistic to expect to know precisely the correct form of the appropriate model for the disturbance in any given situation. The first-order autoregression has withstood the test of time and experimentation as a reasonable model for underlying processes that probably,in truth, are impenetrably complex. AR(1) works as a first pass—higher order models are often constructed as a refinement—as in the following example. The first-order autoregressive disturbance, or AR(1) process, is represented in the autoregressive form as
εt = ρεt−1 + ut , (19-2) where
E [ut | X] = 0, 2 = σ 2, E ut X u and
Cov[ut , us | X] = 0ift = s.
Because ut is white noise, the conditional moments equal the unconditional moments. Thus E[εt | X] = E[εt ] and so on. By repeated substitution, we have
2 εt = ut + ρut−1 + ρ ut−2 +···. (19-3)
From the preceding moving-average form, it is evident that each disturbance εt embodies the entire past history of the u’s, with the most recent observations receiving greater weight than those in the distant past. Depending on the sign of ρ, the series will exhibit
4This model is considered in more detail in Chapter 21. 5See Box and Jenkins (1984) for an authoritative study. Greene-50558 book June 23, 2007 11:49
634 PART V ✦ Time Series and Macroeconometrics
clusters of positive and then negative observations or, if ρ is negative, regular oscillations of sign (as in Example 19.3). Because the successive values of ut are uncorrelated, the variance of εt is the vari- ance of the right-hand side of (19-3): ε = σ 2 + ρ2σ 2 + ρ4σ 2 +···. Var[ t ] u u u (19-4) To proceed, a restriction must be placed on ρ, |ρ| < 1, (19-5) because otherwise, the right-hand side of (19-4) will become infinite. This result is the s stationarity assumption discussed earlier. With (19-5), which implies that lims→∞ ρ = 0, E [εt ] = 0 and σ 2 u 2 Var[ε ] = = σε . (19-6) t 1 − ρ2 With the stationarity assumption, there is an easier way to obtain the variance ε = ρ2 ε + σ 2 Var[ t ] Var[ t−1] u
because Cov[ut ,εs ] = 0ift > s. With stationarity, Var[εt−1] = Var[εt ], which implies (19-6). Proceeding in the same fashion, ρσ2 u Cov[ε ,ε− ] = E [ε ε − ] = E [ε − (ρε − + u )] = ρ Var[ε − ] = . (19-7) t t 1 t t 1 t 1 t 1 t t 1 1 − ρ2 By repeated substitution in (19-2), we see that for any s, s−1 s i εt = ρ εt−s + ρ ut−i i=0 3 2 (e.g., εt = ρ εt−3 + ρ ut−2 + ρut−1 + ut ). Therefore, because εs is not correlated with any ut for which t > s (i.e., any subsequent ut ), it follows that ρs σ 2 u Cov[ε ,ε− ] = E [ε ε − ] = . (19-8) t t s t t s 1 − ρ2 γ = σ 2/( − ρ2) Dividing by 0 u 1 provides the autocorrelations: s Corr[εt ,εt−s ] = ρs = ρ . (19-9) With the stationarity assumption, the autocorrelations fade over time. Depending on the sign of ρ, they will either be declining in geometric progression or alternating in sign if ρ is negative. Collecting terms, we have ⎡ ⎤ 1 ρρ2 ρ3 ··· ρT−1 ⎢ − ⎥ ⎢ ρ 1 ρρ2 ··· ρT 2⎥ σ 2 ⎢ ⎥ 2 u ⎢ ρ2 ρ ρ ··· ρT−3⎥ σ = ⎢ 1 ⎥ . (19-10) 1 − ρ2 ⎢ . . . . ⎥ ⎣ . . . . ··· ρ ⎦ ρT−1 ρT−2 ρT−3 ··· ρ 1 Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 635
19.4 SOME ASYMPTOTIC RESULTS FOR ANALYZING TIME-SERIES DATA
Because is not equal to I, the now-familiar complications will arise in establishing the properties of estimators of β, in particular of the least squares estimator. The finite sam- ple properties of the OLS and GLS estimators remain intact. Least squares will continue to be unbiased; the earlier general proof allows for autocorrelated disturbances. The Aitken theorem (Theorem 8.4) and the distributional results for normally distributed disturbances can still be established conditionally on X. (However, even these will be complicated when X contains lagged values of the dependent variable.) But, finite sam- ple properties are of very limited usefulness in time-series contexts. Nearly all that can be said about estimators involving time-series data is based on their asymptotic properties. As we saw in our analysis of heteroscedasticity, whether least squares is consistent or not, depends on the matrices = ( / ) , QT 1 T X X and ∗ = ( / ) . QT 1 T X X
In our earlier analyses, we were able to argue for convergence of QT to a positive definite matrix of constants, Q, by invoking laws of large numbers. But, these theorems assume that the observations in the sums are independent, which as suggested in Section 19.2, is surely not the case here. Thus, we require a different tool for this result. We can expand ∗ the matrix QT as 1 T T Q∗ = ρ x x , (19-11) T T ts t s t=1 s=1 ρ ε ε where xt and xs are rows of X and ts is the autocorrelation between t and s . Sufficient conditions for this matrix to converge are that QT converge and that the correlations between disturbances die off reasonably rapidly as the observations become further apart in time. For example, if the disturbances follow the AR(1) process described ρ = ρ|t−s| ∗ earlier, then ts and if x t is sufficiently well behaved, QT will converge to a positive definite matrix Q∗ as T →∞. Asymptotic normality of the least squares and GLS estimators will depend on the behavior of sums such as √ √ 1 T √ 1 T w = T x ε = T X ε . T T t t T t=1 Asymptotic normality of least squares is difficult to establish for this general model. The central limit theorems we have relied on thus far do not extend to sums of dependent observations. The results of Amemiya (1985), Mann and Wald (1943), and Anderson (1971) do carry over to most of the familiar types of autocorrelated disturbances, in- cluding those that interest us here, so we shall ultimately conclude that ordinary least squares, GLS, and instrumental variables continue to be consistent and asymptotically normally distributed, and, in the case of OLS, inefficient. This section will provide a brief introduction to some of the underlying principles that are used to reach these conclusions. Greene-50558 book June 23, 2007 11:49
636 PART V ✦ Time Series and Macroeconometrics
19.4.1 CONVERGENCE OF MOMENTS—THE ERGODIC THEOREM The discussion thus far has suggested (appropriately) that stationarity (or its absence) is an important characteristic of a process. The points at which we have encountered this notion concerned requirements that certain sums converge to finite values. In particular, for the AR(1) model, εt = ρεt−1 + ut , for the variance of the process to be finite, we require |ρ| < 1, which is a sufficient condition. However, this result is only a byproduct. Stationarity (at least, the weak stationarity we have examined) is only a characteristic of the sequence of moments of a distribution.
DEFINITION 19.1 Strong Stationarity { }t=∞ A time-series process, zt t=−∞ is strongly stationary, or “stationary,” if the joint probability distribution of any set of k observations in the sequence [zt , zt+1,..., zt+k] is the same regardless of the origin, t, in the time scale.
∼ ,σ2 {ε }t=∞ For example, in (19-2), if we add ut N[0 u ], then the resulting process t t=−∞ can easily be shown to be strongly stationary.
DEFINITION 19.2 Weak Stationarity { }t=∞ A time-series process, zt t=−∞ is weakly stationary (or covariance stationary) if E [zt ] is finite and is the same for all t and if the covariances between any two observations (labeled their autocovariance), Cov[zt , zt−k], is a finite function only of model parameters and their distance apart in time, k, but not of the absolute location of either observation on the time scale.
Weak stationary is obviously implied by strong stationary, although it requires less because the distribution can, at least in principle, be changing on the time axis. The distinction is rarely necessary in applied work. In general, save for narrow theoretical examples, it will be difficult to come up with a process that is weakly but not strongly stationary. The reason for the distinction is that in much of our work, only weak sta- tionary is required, and, as always, when possible, econometricians will dispense with unnecessary assumptions. As we will discover shortly, stationarity is a crucial characteristic at this point in the analysis. If we are going to proceed to parameter estimation in this context, we will also require another characteristic of a time series, ergodicity. There are various ways to delineate this characteristic, none of them particularly intuitive. We borrow one definition from Davidson and MacKinnon (1993, p. 132) which comes close:
DEFINITION 19.3 Ergodicity { }t=∞ , A strongly stationary time-series process, zt t=−∞ is ergodic if for any two bounded functions that map vectors in the a and b dimensional real vector spaces to real scalars, f : Ra → R1 and g : Rb → R1,
lim |E [ f (zt , zt+1,...,zt+a)g(zt+k, zt+k+1,...,zt+k+b)]| k→∞
=|E [ f (zt , zt+1,...,zt+a)]||E [g(zt+k, zt+k+1,...,zt+k+b)]| . Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 637
The definition states essentially that if events are separated far enough in time, then they are “asymptotically independent.” An implication is that in a time series, every obser- vation will contain at least some unique information. Ergodicity is a crucial element of our theory of estimation. When a time series has this property (with stationarity), then we can consider estimation of parameters in a meaningful sense.6 The analysis relies heavily on the following theorem:
THEOREM 19.1 The Ergodic Theorem { }t=∞ If zt t=−∞ is a time-series process that is strongly stationary and ergodic and | | = ( / ) T −→a.s. μ E [ zt ] is a finite constant, and if zT 1 T t=1 zt , then zT , where μ = E [zt ]. Note that the convergence is almost surely not in probability (which is implied) or in mean square (which is also implied). [See White (2001, p. 44) and Davidson and MacKinnon (1993, p. 133).]
What we have in the Ergodic theorem is, for sums of dependent observations, a coun- terpart to the laws of large numbers that we have used at many points in the preceding chapters. Note, once again, the need for this extension is that to this point, our laws of large numbers have required sums of independent observations. But, in this context, by design, observations are distinctly not independent. For this result to be useful, we will require an extension.
THEOREM 19.2 Ergodicity of Functions { }t=∞ If zt t=−∞ is a time-series process that is strongly stationary and ergodic and if yt = f {zt } is a measurable function in the probability space that defines zt , then yt { }t=∞ × is also stationary and ergodic. Let zt t=−∞ define a K 1 vector valued stochastic process—each element of the vector is an ergodic and stationary series, and the characteristics of ergodicity and stationarity apply to the joint distribution of the { }t=∞ { }t=∞ elements of zt t=−∞. Then, the Ergodic theorem applies to functions of zt t=−∞. [See White (2001, pp. 44–45) for discussion.]
Theorem 19.2 produces the results we need to characterize the least squares (and other) estimators. In particular, our minimal assumptions about the data are
= β + ε ASSUMPTION 19.1 Ergodic Data Series: In the regression model, yt xt t , ,ε t=∞ [xt t ]t=−∞ is a jointly stationary and ergodic process.
6Much of the analysis in later chapters will encounter nonstationary series, which are the focus of most of the current literature—tests for nonstationarity largely dominate the recent study in time-series analysis. Ergodicity is a much more subtle and difficult concept. For any process that we will consider, ergodicity will have to be a given, at least at this level. A classic reference on the subject is Doob (1953). Another authoritative treatise is Billingsley (1995). White (2001) provides a concise analysis of many of these concepts as used in econometrics, and some useful commentary. Greene-50558 book June 23, 2007 11:49
638 PART V ✦ Time Series and Macroeconometrics
By analyzing terms element by element we can use these results directly to assert that = ε , = ∗ = ε2 averages of wt xt t Qtt xt xt , and Qtt t xt xt will converge to their population coun- terparts, 0, Q and Q∗.
19.4.2 CONVERGENCE TO NORMALITY—A CENTRAL LIMIT THEOREM To form a distribution theory for least squares, GLS, ML, and GMM, we will need a counterpart to the central limit theorem. In particular, we need to establish a large sample distribution theory for quantities of the form √ 1 T √ T x ε = T w. T t t t=1 As noted earlier, we cannot invoke the familiar central limit theorems (Lindeberg–Levy, Lindeberg–Feller, Liapounov) because the observations in the sum are not independent. But, with the assumptions already made, we do have an alternative result. Some needed preliminaries are as follows:
DEFINITION 19.4 Martingale Sequence A vector sequence zt is a martingale sequence if E [zt | zt−1, zt−2,...] = zt−1.
An important example of a martingale sequence is the random walk,
zt = zt−1 + ut ,
where Cov[ut , us ] = 0 for all t = s. Then
E [zt | zt−1, zt−2,...] = E [zt−1 | zt−1, zt−2,...] + E [ut | zt−1, zt−2,...] = zt−1 + 0 = zt−1.
DEFINITION 19.5 Martingale Difference Sequence A vector sequence zt is a martingale difference sequence if E [zt | zt−1, zt−2,...] = 0.
With Definition 19.5, we have the following broadly encompassing result:
THEOREM 19.3 Martingale Difference Central Limit Theorem If zt is a vector valued stationary and ergodic martingale difference sequence, with = , = ( / ) T , E [zt z√t ] where is a finite positive definite matrix, and if zT 1 T t=1 zt d then T zT −→ N[0,]. [For discussion, see Davidson and MacKinnon (1993, Sections. 4.7 and 4.8).]7
7For convenience, we are bypassing a step in this discussion—establishing multivariate normality requires that the result first be established for the marginal normal distribution of each component, then that every linear combination of the variables also be normally distributed (See Theorems D.17 and D.18A.). Our interest at this point is merely to collect the useful end results. Interested users may find the detailed discussions of the many subtleties and narrower points in White (2001) and Davidson and MacKinnon (1993, Chapter 4). Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 639
Theorem 19.3 is a generalization of the Lindeberg–Levy central limit theorem. It is not yet broad enough to cover cases of autocorrelation, but it does go beyond Lindeberg– Levy, for example, in extending to the GARCH model of Section 19.13.3. [Forms of the theorem that surpass Lindeberg–Feller (D.19) and Liapounov (Theorem D.20) by allowing for different variances at each time, t, appear in Ruud (2000, p. 479) and White (2001, p. 133). These variants extend beyond our requirements in this treatment.] But, looking ahead, this result encompasses what will be a very important application. { }t=∞ Suppose in the classical linear regression model, xt t=−∞ is a stationary and ergodic {ε }t=∞ multivariate stochastic process and t t=−∞ is an i.i.d. process—that is, not autocorre- lated and not heteroscedastic. Then, this is the most general case of the classical model that still maintains the assumptions about εt that we made in Chapter 2. In this case, the { }t=∞ ={ ε }t=∞ process wt t=−∞ xt t t=−∞ is a martingale difference sequence, so that with suffi- cient assumptions on the moments of xt we could use this result to establish consistency and asymptotic normality of the least squares estimator. [See, e.g., Hamilton (1994, pp. 208–212).] We now consider a central limit theorem that is broad enough to include the case that interested us at the outset, stochastically dependent observations on xt and ε 8 { }t=∞ autocorrelation in t . Suppose√ as before that zt t=−∞ is a stationary and ergodic 9 stochastic process. We consider T zT. The following conditions are assumed:
1. Asymptotic uncorrelatedness: E [zt | zt−k, zt−k−1,...] converges in mean square to zero as k→∞. Note that is similar to the condition for ergodicity. White (2001) demonstrates that a (nonobvious) implication of this assumption is E [zt ] = 0. 2. Summability of autocovariances: With dependent observations, √ ∞ ∞ ∞ = , = = ∗. lim Var[ T zT] Cov[zt zs ] k T→∞ t=0 s=0 k=−∞ To begin, we will need to assume that this matrix is finite, a condition called ∗ summability. Note this is the condition needed for convergence of QT in (19-11). If the sum is to be finite, then the k = 0 term must be finite, which gives us a necessary condition = , . E [zt zt ] 0 a finite matrix 3. Asymptotic negligibility of innovations: Let
rtk = E [zt | zt−k, zt−k−1,...] − E [zt | zt−k−1, zt−k−2,...].
An observation zt may be viewed as the accumulated information that has entered the process since it began up to time t. Thus, it can be shown that ∞ zt = rts. s=0 The vector rtk can be viewed as the information in this accumulated sum that entered − ∞ the process at time t k. The condition imposed on the process is that s=0 E [rtsrts] be finite. In words, condition (3) states that information eventually becomes negligible as it fades far back in time from the current observation. The AR(1) model (as usual)
8Detailed analysis of this case is quite intricate and well beyond the scope of this book. Some fairly terse analysis may be found in White (2001, pp. 122–133) and Hayashi (2000). 9See Hayashi (2000, p. 405) who attributes the results to Gordin (1969). Greene-50558 book June 23, 2007 11:49
640 PART V ✦ Time Series and Macroeconometrics
helps to illustrate this point. If zt = ρzt−1 + ut , then
rt0 = E [zt | zt , zt−1,...] − E [zt | zt−1, zt−2,...] = zt − ρzt−1 = ut
rt1 = E [zt | zt−1, zt−2 ...] − E [zt | zt−2, zt−3 ...]
= E [ρzt−1 + ut | zt−1, zt−2 ...] − E [ρ(ρzt−2 + ut−1) + ut | zt−2, zt−3,...]
= ρ(zt−1 − ρzt−2)
= ρut−1. = ρk = ∞ ρs By a similar construction, rtk ut−k from which it follows that zt s=0 ut−s , which we saw earlier in (19-3). You can verify that if |ρ| < 1, the negligibility condition will be met. With all this machinery in place, we now have the theorem we will need:
THEOREM 19.4 Gordin’s Central Limit Theorem √If zt is strongly stationary and ergodic and if conditions (1)–(3) are met, then d ∗ T zT −→ N[0, ].
We will be able to employ these tools when we consider the least squares, IV, and GLS estimators in the discussion to follow.
19.5 LEAST SQUARES ESTIMATION
The least squares estimator is − X X 1 X ε b = (X X)−1X y = β + . T T Unbiasedness follows from the results in Chapter 4—no modification is needed. We know from Chapter 8 that the Gauss–Markov theorem has been lost—assuming it ex- ists (that remains to be established), the GLS estimator is efficient and OLS is not. How much information is lost by using least squares instead of GLS depends on the data. Broadly, least squares fares better in data that have long periods and little cyclical variation, such as aggregate output series. As might be expected, the greater is the auto- correlation in ε, the greater will be the benefit to using generalized least squares (when this is possible). Even if the disturbances are normally distributed, the usual F and t statistics do not have those distributions. So, not much remains of the finite sample prop- erties we obtained in Chapter 4. The asymptotic properties remain to be established.
19.5.1 ASYMPTOTIC PROPERTIES OF LEAST SQUARES The asymptotic properties of b are straightforward to establish given our earlier results. If we assume that the process generating xt is stationary and ergodic, then by Theo- rems 19.1 and 19.2, (1/T)(X X) converges to Q and we can apply the Slutsky theorem to the inverse. If εt is not serially correlated, then wt = xt εt is a martingale difference Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 641
sequence, so (1/T)(X ε) converges to zero. This establishes consistency for the simple case. On the other hand, if [xt ,εt ] are jointly stationary and ergodic, then we can invoke the Ergodic theorems 19.1 and 19.2 for both moment matrices and establish consistency. Asymptotic normality is a bit more√ subtle. For the case without serial correlation in εt , we can employ Theorem 19.3 for T w. The involved case is the one that interested us at the outset of this discussion, that is, where there is autocorrelation in εt and dependence in xt . Theorem 19.4 is in place for this case. Once again, the conditions described in the preceding section must apply and, moreover, the assumptions needed will have to be established both for xt and εt . Commentary on these cases may be found in Davidson and MacKinnon (1993), Hamilton (1994), White (2001), and Hayashi (2000). Formal presentation extends beyond the scope of this text, so at this point, we will proceed, and assume that the conditions underlying Theorem 19.4 are met. The results suggested here are quite general, albeit only sketched for the general case. For the remainder of our examination, at least in this chapter, we will confine attention to fairly simple processes in which the necessary conditions for the asymptotic distribution theory will be fairly evident. There is an important exception to the results in the preceding paragraph. If the regression contains any lagged values of the dependent variable, then least squares will no longer be unbiased or consistent. To take the simplest case, suppose that
yt = βyt−1 + εt , (19-12) εt = ρεt−1 + ut , and assume |β| < 1, |ρ| < 1. In this model, the regressor and the disturbance are corre- lated. There are various ways to approach the analysis. One useful way is to rearrange (19-12) by subtracting ρyt−1 from yt . Then,
yt = (β + ρ)yt−1 − βρyt−2 + ut , (19-13)
which is a classical regression with stochastic regressors. Because ut is an innovation in period t, it is uncorrelated with both regressors, and least squares regression of yt on (yt−1, yt−2) estimates ρ1 = (β + ρ)and ρ2 =−βρ. What is estimated by regression of yt on yt−1 alone? Let γk = Cov[yt , yt−k] = Cov[yt , yt+k]. By stationarity, Var[yt ] = Var[yt−1], and Cov[yt , yt−1] = Cov[yt−1, yt−2], and so on. These and (19-13) imply the following relationships: γ = ρ γ + ρ γ + σ 2, 0 1 1 2 2 u
γ1 = ρ1γ0 + ρ2γ1, (19-14)
γ2 = ρ1γ1 + ρ2γ0. (These are the Yule–Walker equations for this model. See Section 21.2.3.) The slope in the simple regression estimates γ1/γ0, which can be found in the solutions to these three equations. (An alternative approach is to use the left-out variable formula, which is a useful way to interpret this estimator.) In this case, we see that the slope in the short regression is an estimator of (β + ρ)− βρ(γ1/γ0). In either case, solving the three γ ,γ γ ρ ,ρ σ 2 equations in (19-14) for 0 1, and 2 in terms of 1 2, and u produces β + ρ plim b = . (19-15) 1 + βρ Greene-50558 book June 23, 2007 11:49
642 PART V ✦ Time Series and Macroeconometrics
This result is between β (when ρ = 0) and 1 (when both β and ρ = 1). Therefore, least squares is inconsistent unless ρ equals zero. The more general case that includes regres- sors, xt , involves more complicated algebra, but gives essentially the same result. This is a general result; when the equation contains a lagged dependent variable in the pres- ence of autocorrelation, OLS and GLS are inconsistent. The problem can be viewed as one of an omitted variable.
19.5.2 ESTIMATING THE VARIANCE OF THE LEAST SQUARES ESTIMATOR As usual, s2(X X)−1 is an inappropriate estimator of σ 2(X X)−1(X X)(X X)−1, both because s2 is a biased estimator of σ 2 and because the matrix is incorrect. Generalities are scarce, but in general, for economic time series that are positively related to their past values, the standard errors conventionally estimated by least squares are likely to be too small. For slowly changing, trending aggregates such as output and consumption, this is probably the norm. For highly variable data such as inflation, exchange rates, and market returns, the situation is less clear. Nonetheless, as a general proposition, one would normally not want to rely on s2(X X)−1 as an estimator of the asymptotic covariance matrix of the least squares estimator. In view of this situation, if one is going to use least squares, then it is desirable to have an appropriate estimator of the covariance matrix of the least squares estimator. There are two approaches. If the form of the autocorrelation is known, then one can estimate the parameters of directly and compute a consistent estimator. Of course, if so, then it would be more sensible to use feasible generalized least squares instead and not waste the sample information on an inefficient estimator. The second approach parallels the use of the White estimator for heteroscedasticity. The extension of White’s result to the more general case of autocorrelation is much more difficult than in the heteroscedasticity case. The natural counterpart for estimating n n 1 Q∗ = σ x x n ij i j i=1 j=1 (19-16) in Section 8.2.3 would be T T 1 Qˆ ∗ = e e x x . T t s t s t=1 s=1 But there are two problems with this estimator, one theoretical, which applies to Q∗ as well, and one practical, which is specific to the latter. Unlike the heteroscedasticity case, the matrix in (19-16) is 1/T times a sum of T 2 terms, so it is difficult to conclude yet that it will converge to anything at all. This application is most likely to arise in a time-series setting. To obtain convergence, it is necessary to assume that the terms involving unequal subscripts in (19-16) diminish in importance as T grows. A sufficient condition is that terms with subscript pairs |t − s| grow smaller as the distance between them grows larger. In practical terms, observation pairs are progressively less correlated as their separation in time grows. Intuitively, if one can think of weights with the diagonal elements getting a weight of 1.0, then in the sum, the weights in the sum grow smaller as we move away from the diagonal. If we think of the sum of the weights rather than just the number of terms, then this sum falls off sufficiently rapidly that as n grows large, the sum is of order T rather than T 2. Thus, Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 643
TABLE 19.1 Robust Covariance Estimation Variable OLS Estimate OLS SE Corrected SE Constant −1.6331 0.2286 0.3335 ln Output 0.2871 0.04738 0.07806 ln CPI 0.9718 0.03377 0.06585 R2 = 0.98952, d = 0.02477, r = 0.98762.
we achieve convergence of Q∗ by assuming that the rows of X are well behaved and that the correlations diminish with increasing separation in time. (See Sections 4.9.6 and 21.2.5 for a more formal statement of this condition.) The practical problem is that Qˆ ∗ need not be positive definite. Newey and West (1987a) have devised an estimator that overcomes this difficulty: L T 1 Qˆ ∗ = S + w e e − (x x − + x − x ), 0 T l t t l t t l t l t l=1 t=l+1 (19-17) l w = 1 − . l (L + 1) [See (8-26).] The Newey–West autocorrelation consistent covariance estimator is sur- prisingly simple and relatively easy to implement.10 There is a final problem to be solved. It must be determined in advance how large L is to be. In general, there is little the- oretical guidance. Current practice specifies L ≈ T1/4. Unfortunately, the result is not quite as crisp as that for the heteroscedasticity consistent estimator. We have the result that b and bIV are asymptotically normally distributed, and we have an appropriate estimator for the asymptotic covariance matrix. We have not specified the distribution of the disturbances, however. Thus, for inference purposes, the F statistic is approximate at best. Moreover, for more involved hypotheses, the likelihood ratio and Lagrange multiplier tests are unavailable. That leaves the Wald statistic, including asymptotic “t ratios,” as the main tool for statistical inference. We will examine a number of applications in the chapters to follow. The White and Newey–West estimators are standard in the econometrics literature. We will encounter them at many points in the discussion to follow.
Example 19.4 Autocorrelation Consistent Covariance Estimation For the model shown in Example 19.1, the regression results with the uncorrected standard errors and the Newey–West autocorrelation robust covariance matrix for lags of five quarters are shown in Table 19.1. The effect of the very high degree of autocorrelation is evident.
19.6 GMM ESTIMATION
The GMM estimator in the regression model with autocorrelated disturbances is pro- duced by the empirical moment equations 1 T 1 x y − x βˆ = X εˆ βˆ = m βˆ = 0. (19-18) T t t t GMM T GMM GMM t=1
10Both estimators are now standard features in modern econometrics computer programs. Further results on different weighting schemes may be found in Hayashi (2000, pp. 406–410). Greene-50558 book June 25, 2007 18:18
644 PART V ✦ Time Series and Macroeconometrics
The estimator is obtained by minimizing = βˆ βˆ q m GMM W m GMM where W is a positive definite weighting matrix. The optimal weighting matrix would be √ − W = Asy. Var[ T m(β)] 1,
which is the inverse of √ n T T . (β) = . √1 ε = 1 σ 2ρ = σ 2 ∗. Asy Var[ T m ] Asy Var xi i plim tsxt xs Q n→∞ T T i=1 t=1 s=1
The optimal weighting matrix would be [σ 2Q∗]−1. As in the heteroscedasticity case, this minimization problem is an exactly identified case, so, the weighting matrix is actually irrelevant to the solution. The GMM estimator for the regression model with autocor- related disturbances is ordinary least squares. We can use the results in Section 19.5.2 to construct the asymptotic covariance matrix. We will require the assumptions in Sec- tion 19.4 to obtain convergence of the moments and asymptotic normality. We will wish to extend this simple result in one instance. In the common case in which xt contains lagged values of yt , we will want to use an instrumental variable estimator. We will return to that estimation problem in Section 19.9.3.
19.7 TESTING FOR AUTOCORRELATION
The available tests for autocorrelation are based on the principle that if the true distur- bances are autocorrelated, then this fact can be detected through the autocorrelations of the least squares residuals. The simplest indicator is the slope in the artificial regression
et = ret−1 + vt , = − , et yt xt b (19-19) T T−1 = 2 . r et et−1 et t=2 t=1 If there is autocorrelation, then the slope in this regression will be an estimator of ρ = Corr[εt ,εt−1]. The complication in the analysis lies in determining a formal means of evaluating when the estimator is “large,” that is, on what statistical basis to reject the null hypothesis that ρ equals zero. As a first approximation, treating (19-19) as a classical linear model and using a t or F (squared t) test to test the hypothesis is a valid way to proceed based on the Lagrange multiplier principle. We used this device in Example 19.3. The tests we consider here are refinements of this approach.
19.7.1 LAGRANGE MULTIPLIER TEST
The Breusch (1978)–Godfrey (1978) test is a Lagrange multiplier test of H0: no auto- correlation versus H1: εt = AR(P) or εt = MA(P). The same test is used for either Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 645
structure. The test statistic is e X (X X )−1X e LM = T 0 0 0 0 = TR2, (19-20) e e 0
where X0 is the original X matrix augmented by P additional columns containing the lagged OLS residuals, et−1,...,et−P. The test can be carried out simply by regressing the ordinary least squares residuals et on xt0 (filling in missing values for lagged residuals 2 with zeros) and referring TR0 to the tabled critical value for the chi-squared distribution 11 with P degrees of freedom. Because X e = 0, the test is equivalent to regressing et on the part of the lagged residuals that is unexplained by X. There is therefore a compelling logic to it; if any fit is found, then it is due to correlation between the current and lagged residuals. The test is a joint test of the first P autocorrelations of εt , not just the first.
19.7.2 BOX AND PIERCE’S TEST AND LJUNG’S REFINEMENT An alternative test that is asymptotically equivalent to the LM test when the null hy- pothesis, ρ = 0, is true and when X does not contain lagged values of y is due to Box and Pierce (1970). The Q test is carried out by referring P = 2, Q T r j (19-21) j=1 = ( T )/( T 2) where r j t= j+1 et et− j t=1 et , to the critical values of the chi-squared table with P degrees of freedom. A refinement suggested by Ljung and Box (1979) is P r 2 Q = T(T + 2) j . (19-22) T − j j=1 The essential difference between the Godfrey–Breusch and the Box–Pierce tests is the use of partial correlations (controlling for X and the other variables) in the former and simple correlations in the latter. Under the null hypothesis, there is no autocorrelation in εt , and no correlation between xt and εs in any event, so the two tests are asymptotically equivalent. On the other hand, because it does not condition on xt , the Box–Pierce test is less powerful than the LM test when the null hypothesis is false, as intuition might suggest.
19.7.3 THE DURBIN–WATSON TEST The Durbin–Watson statistic12 was the first formal procedure developed for testing for autocorrelation using the least squares residuals. The test statistic is T 2 (e − e − )2 e + e 2 d = t=2 t t 1 = 2(1 − r) − 1 T , (19-23) T 2 T 2 t=1 et t=1 et where r is the same first-order autocorrelation that underlies the preceding two statis- tics. If the sample is reasonably large, then the last term will be negligible, leaving
11A warning to practitioners: Current software varies on whether the lagged residuals are filled with zeros or the first P observations are simply dropped when computing this statistic. In the interest of replicability, users should determine which is the case before reporting results. 12Durbin and Watson (1950, 1951, 1971). Greene-50558 book June 23, 2007 11:49
646 PART V ✦ Time Series and Macroeconometrics
d ≈ 2(1 − r). The statistic takes this form because the authors were able to determine the exact distribution of this transformation of the autocorrelation and could provide tables of critical values. Usable critical values that depend only on T and K are presented in tables such as those at the end of this book. The one-sided test for H0: ρ = 0 against H1: ρ>0 is carried out by comparing d to values dL(T, K) and dU(T, K).Ifd < dL, the null hypothesis is rejected; if d > dU, the hypothesis is not rejected. If d lies between dL and dU, then no conclusion is drawn.
19.7.4 TESTING IN THE PRESENCE OF A LAGGED DEPENDENT VARIABLE The Durbin–Watson test is not likely to be valid when there is a lagged dependent variable in the equation.13 The statistic will usually be biased toward a finding of no autocorrelation. Three alternatives have been devised. The LM and Q tests can be used whether or not the regression contains a lagged dependent variable. (In the absence of a lagged dependent variable, they are asymptotically equivalent.) As an alternative to the standard test, Durbin (1970) derived a Lagrange multiplier test that is appro- priate in the presence of a lagged dependent variable. The test may be carried out by referring = − 2 , h r T 1 Tsc (19-24)
2 where sc is the estimated variance of the least squares regression coefficient on yt−1, to the standard normal tables. Large values of h lead to rejection of H0. The test has the virtues that it can be used even if the regression contains additional lags of yt , and it can be computed using the standard results from the initial regression without any 2 > / further regressions. If sc 1 T, however, then it cannot be computed. An alternative is to regress et on xt , yt−1,...,et−1, and any additional lags that are appropriate for et and then to test the joint significance of the coefficient(s) on the lagged residual(s) with the standard F test. This method is a minor modification of the Breusch–Godfrey test. Under H0, the coefficients on the remaining variables will be zero, so the tests are the same asymptotically.
19.7.5 SUMMARY OF TESTING PROCEDURES The preceding has examined several testing procedures for locating autocorrelation in the disturbances. In all cases, the procedure examines the least squares residuals. We can summarize the procedures as follows: 2 LM test. LM = TR in a regression of the least squares residuals on [xt , et−1,... 2 et−P]. Reject H0 if LM >χ∗ [P]. This test examines the covariance of the residuals with lagged values, controlling for the intervening effect of the independent variables. = ( − ) P 2/( − ) >χ2 Q test. Q T T 2 j=1 r j T j . Reject H0 if Q ∗ [P]. This test examines the raw correlations between the residuals and P lagged values of the residuals. = ( − ) ρ = < ∗ Durbin–Watson test. d 2 1 r . Reject H0: 0ifd dL. This test looks di- rectly at the first-order autocorrelation of the residuals.
13This issue has been studied by Nerlove and Wallis (1966), Durbin (1970), and Dezhbaksh (1990). Greene-50558 book June 23, 2007 11:49
CHAPTER 19 ✦ Serial Correlation 647
Durbin’s test. FD = the F statistic for the joint significance of P lags of the residuals in the regression of the least squares residuals on [xt , yt−1,...yt−R, et−1,...et−P]. Reject H0 if FD > F∗[P, T − K − P]. This test examines the partial correlations be- tween the residuals and the lagged residuals, controlling for the intervening effect of the independent variables and the lagged dependent variable. The Durbin–Watson test has some major shortcomings. The inconclusive region is large if T is small or moderate. The bounding distributions, while free of the parameters β and σ, do depend on the data (and assume that X is nonstochastic). An exact version based on an algorithm developed by Imhof (1980) avoids the inconclusive region, but is rarely used. The LM and Box–Pierce statistics do not share these shortcomings—their limiting distributions are chi-squared independently of the data and the parameters. For this reason, the LM test has become the standard method in applied research.
19.8 EFFICIENT ESTIMATION WHEN IS KNOWN
As a prelude to deriving feasible estimators for β in this model, we consider full gen- eralized least squares estimation assuming that is known. In the next section, we will turn to the more realistic case in which must be estimated as well. If the parameters of are known, then the GLS estimator, βˆ = (X −1X)−1(X −1y), (19-25) and the estimate of its sampling variance, 2 −1 −1 Est. Var[βˆ] = σˆε [X X] , (19-26) where ( − βˆ) −1( − βˆ) 2 y X y X σˆε = (19-27) T can be computed in one step. For the AR(1) case, data for the transformed model are ⎡ ⎤ ⎡ ⎤ 1 − ρ2 y 1 − ρ2x ⎢ 1 ⎥ ⎢ 1 ⎥ ⎢ − ρ ⎥ ⎢ − ρ ⎥ ⎢ y2 y1 ⎥ ⎢ x2 x1 ⎥ ⎢ ⎥ ⎢ ⎥ y∗ = ⎢ y − ρy ⎥ , X∗ = ⎢ x − ρx ⎥ . (19-28) ⎢ 3 . 2 ⎥ ⎢ 3 . 2 ⎥ ⎣ . ⎦ ⎣ . ⎦
yT − ρyT−1 xT − ρxT−1 These transformations are variously labeled partial differences, quasi differences, or pseudo-differences. Note that in the transformed model, every observation except the first contains a constant term. What was the column of 1s in X is transformed to [(1 − ρ2)1/2,(1 − ρ),(1 − ρ),...]. Therefore, if the sample is relatively small, then the problems with measures of fit noted in Section 3.5 will reappear. The variance of the transformed disturbance is ε − ρε = = σ 2. Var[ t t−1] Var[ut ] u σ 2; The variance of the first disturbance is also u [see (19-6)]. This can be estimated using 2 2 (1 − ρ )σˆε . Greene-50558 book June 23, 2007 11:49
648 PART V ✦ Time Series and Macroeconometrics
Corresponding results have been derived for higher-order autoregressive processes. For the AR(2) model,
εt = θ1εt−1 + θ2εt−2 + ut , (19-29) the transformed data for generalized least squares are obtained by 1/2 ( + θ ) ( − θ )2 − θ 2 1 2 1 2 1 z∗1 = z1, 1 − θ2
/ 2 1 2 (19-30) / θ 1 − θ = − θ 2 1 2 − 1 1 , z∗2 1 2 z2 z1 1 − θ2
z∗t = zt − θ1zt−1 − θ2zt−2, t > 2,
where zt is used for yt or xt . The transformation becomes progressively more complex for higher-order processes.14 Note that in both the AR(1) and AR(2) models, the transformation to y∗ and X∗ involves “starting values” for the processes that depend only on the first one or two observations. We can view the process as having begun in the infinite past. Because the sample contains only T observations, however, it is convenient to treat the first one or two (or P) observations as shown and consider them as “initial values.” Whether we view the process as having begun at time t = 1 or in the infinite past is ultimately immaterial in regard to the asymptotic properties of the estimators. The asymptotic properties for the GLS estimator are quite straightforward given the apparatus we assembled in Section 19.4. We begin by assuming that {xt ,εt } are jointly an ergodic, stationary process. Then, after the GLS transformation, {x∗t ,ε∗t } is also stationary and ergodic. Moreover, ε∗t is nonautocorrelated by construction. In the transformed model, then, {w∗t }={x∗t ε∗t } is a stationary and ergodic martingale difference series. We can use the ergodic theorem to establish consistency and the central limit theorem for martingale difference sequences to establish asymptotic normality for GLS in this model. Formal arrangement of the relevant results is left as an exercise.
19.9 ESTIMATION WHEN IS UNKNOWN
For an unknown , there are a variety of approaches. Any consistent estimator of (ρ) will suffice—recall from Theorem (8.5) in Section 8.3.2, all that is needed for efficient estimation of β is a consistent estimator of (ρ). The complication arises, as might be expected, in estimating the autocorrelation parameter(s).
19.9.1 AR(1) DISTURBANCES The AR(1) model is the one most widely used and studied. The most common procedure is to begin FGLS with a natural estimator of ρ, the autocorrelation of the residuals. Because b is consistent, we can use r. Others that have been suggested include Theil’s (1971) estimator,r[(T−K)/(T−1)] and Durbin’s (1970), the slope on yt−1 in a regression
14See Box and Jenkins (1984) and Fuller (1976). Greene-50558 book June 25, 2007 18:18
CHAPTER 19 ✦ Serial Correlation 649
of yt on yt−1, x t and x t−1. The second step is FGLS based on (19-25)–(19-28). This is the Prais and Winsten (1954) estimator. The Cochrane and Orcutt (1949) estimator (based on computational ease) omits the first observation. It is possible to iterate any of these estimators to convergence. Because the estima- tor is asymptotically efficient at every iteration, nothing is gained by doing so. Unlike the heteroscedastic model, iterating when there is autocorrelation does not produce the maximum likelihood estimator. The iterated FGLS estimator, regardless of the estima- tor of ρ, does not account for the term (1/2) ln(1 − ρ2) in the log-likelihood function [see the following (19-31)]. Maximum likelihood estimators can be obtained by maximizing the log-likelihood β,σ2 ρ with respect to u , and . The log-likelihood function may be written T 2 = u 1 T ln L =− t 1 t + ln(1 − ρ2) − ln 2π + ln σ 2 , (19-31) σ 2 u 2 u 2 2 where, as before, the first observation is computed differently from the others using (19-28). The MLE for this model is developed in Section 16.9.2.d. Based on the MLE, the standard approximations to the asymptotic variances of the estimators are: