Chapter 10

Multicollinearity

10.1 The Nature of Multicollinearity 10.1.1 Extreme Collinearity

The standard OLS assumption that ( xi1, xi2, . . . , xik ) not be linearly related means that for any ( c1, c2, . . . , ck )

xik =6 c1xi1 + c2xi2 + ··· + ck 1xi,k 1 (10.1) − − for some i. If the assumption is violated, then we can find ( c1, c2, . . . , ck 1 ) such that − xik = c1xi1 + c2xi2 + ··· + ck 1xi,k 1 (10.2) − − for all i. Define

x12 ··· x1k xk1 c1 x22 ··· x2k xk2 c2 X1 =  . .  , xk =  .  , and c =  .  . . . . .        xn2 ··· xnk   xkn   ck 1       −        Then extreme collinearity can be represented as

xk = X1c. (10.3) We have represented extreme collinearity in terms of the last explanatory vari- able. Since we can always re-order the variables this choice is without loss of generality and the analysis could be applied to any non-constant variable by moving it to the last column.

10.1.2 Near Extreme Collinearity Of course, it is rare, in practice, that an exact linear relationship holds. Instead, we have xik = c1xi1 + c2xi2 + ··· + ck 1xi,k 1 + vi (10.4) − −

133 134 CHAPTER 10. MULTICOLLINEARITY or, more compactly, xk = X1c + v, (10.5) where the v’s are small relative to the x’s. If we think of the v’s as random variables they will have small (and zero mean if X includes a column of ones). A convenient way to algebraically express the degree of collinearity is the sample correlation between xik and wi = c1xi1 +c2xi2 +···+ck 1xi,k 1, namely − −

cov( xik, wi ) cov( wi + vi, wi ) rx,w = = (10.6) var(xi,k)var(wi) var(wi + vi)var(wi) d d Clearly, as the variancep of vi grows small, thisp value will go to unity. For near extreme collinearity, wec are talkingc about a highc correlationc between at least one variable and some linear combination of the others. We are interested not only in the possibility of high correlation between xik and the linear combination wi = c1xi1 + c2xi2 + ··· + ck 1xi,k 1 for a particular choice of c but for any choice of the coefficient. The choice− which− will maximize n 2 the correlation is the choice which minimizes i=1 wi or least squares. Thus 1 c = (X0 X1)− X0 xk and w = X1c and 1 1 P 2 2 (rxk,w) = Rk (10.7) b b b b • is the R2 of this regression and hence the maximal sample correlation between xki and the other x’s.

10.1.3 Absence of Collinearity At the other extreme, suppose

2 Rk = rxk,w = cov( xik, wi ) = 0. (10.8) • b

That is, xik has zero sample correlation with all linear combinations of the other variables for any ordering of the variables.d In termsb of the matrices, this requires c = 0 or X10 xk = 0. (10.9) bregardless of which variable is used as xk. This is called the case of orthogonal regressors, since the various x’s are all orthogonal. This extreme is also very rare, in practice. We usually find some degree of collinearity, though not perfect, in any data set.

10.2 Consequences of Multicollinearity 10.2.1 For OLS Estimation

We will first examine the effect of xk1 being highly collinear upon the estimator βk. Now let xk = X1c + v (10.10) b 10.2. CONSEQUENCES OF MULTICOLLINEARITY 135

The OLS estimates are given by the solution of

X0y = X0Xβ

= X0( X1 : xk )β b = ( X0X1 : X0xk )β (10.11) b Applying Cramer’s rule to obtain βk yields b |X0X1 : X0y| βk =b (10.12) |X0X| However, as the collinearity becomesb more extreme, the columns of X (the rows of X0) become more linearly dependent and 0 lim β1 = (10.13) v 0 → 0 which is indeterminant. b Now, the variance- is

2 1 2 1 σ ( X0X )− = σ adj( X0X ) |X0X|

2 1 X10 = σ adj ( X1 : xk ) |X X| x0 0  k   1 X X X x = σ2 adj 10 1 10 k . (10.14) |X X| X0 x x0 x 0  1 k k k 

The variance of βk is given by the (k, k) element, so

2 1 2 1 var(b βk ) = σ cof(k, k) = σ |X10 X1|. (10.15) |X0X| |X0X|

Thus, for |X10 X1|= 6 b 0, we have 2 σ |X10 X1| lim var( βk ) = = ∞. (10.16) v 0 → 0 and the variance of the collinearb terms becomes unbounded. It is instructive to give more structure to the variance of the last coefficient 2 estimate in terms of the sample correlation Rk given above. First we obtain the • covariance of the OLS estimators other than the intercept. Denote X = (` : X∗) where ` is an n × 1 vector of ones and X∗ are the nonconstant columns of X, then `0` `0X∗ X0X = . (10.17) X∗0` X∗0X∗   Using the results for the inverse of a partitioned matrix we find that the lower right-hand k − 1 × k − 1 submatrix of the inverse is given by 1 1 1 (X∗0X∗ − X∗0`(`0`)− `0X∗)− = (X∗0X∗ − nx∗x∗0)− (10.18) 1 = [(X∗ − `x∗0)0(X∗ − `x∗0)]− 1 = (X0X)− 136 CHAPTER 10. MULTICOLLINEARITY

where x∗ = `0X∗/n is the mean vector for the nonconstant variables and X = X∗ − `x∗0 is the demeaned or deviation form of the data matrix for the nonconstant variables. th We now denote X = (X1 : xk) where xk is last column (k − 1) , then

X0 X X0 x X0X = 1 1 1 k . (10.19) x X x x  k0 k0 k  Using the results for partitioned inverses again, the (k − 1, k − 1) element of the 1 inverse of (X0X)− is given by,

0 1 0 1 0 1 0 (xk0 xk − xk0 X1(X1X1)− X1xk)− = 1/(xk0 xk − xk0 X1(X1X1)− X1xk)

= 1/ek0 ek

= 1/(xk0 xk · ek0 ek/xk0 xk) SSEk = 1/(xk0 xk( )) SSTk 2 = 1/(xk0 xk(1 − Rk )) (10.20) • 0 1 0 where ek = (In − X1(X1X1)− X1)xk are the OLS residuals from regressing 2 the demeaned xk’s on the other variables and SSEk, SSTk, and Rk are the corresponding for this regression. Thus we find ·

2 1 2 2 var(βk) = σ [( X0X )− ]kk = σ /(xk0 xk(1 − Rk )) (10.21) n • 2 2 2 = σ /( (xik − xk) (1 − Rk )) b i=1 • n 2 X1 2 2 = σ /(n · (xik − xk) (1 − Rk )). n i=1 • X 2 2 and the variance of βk increases with the noise σ and the correlation Rk of xk with the other variables, and decreases with the sample size n and the• signal 1 n 2 n i=1(xik −xk) .b Since the order of the variables is arbitrary and any could be placed in the k-th position, var(β ) will be the same expression with j replacing P j k. Thus, as the collinearity becomesb more and more extreme: • The OLS estimates of the coefficients on the collinear terms become in- determinant. This is just a manifestation of the difficulties in obtaining 1 (X0X)− . • The OLS coefficients on the collinear terms become infinitely variable. 2 Their become very large as Rk → 1. · • The OLS estimates are still BLUE and with normal disturbances BUE. Thus, any unbiased estimator will be afflicted with the same problems. Collinearity does not effect our estimate s2 of σ2. This is easy to see, since we have shown that 2 s 2 ( n − k ) ∼ χn k (10.22) σ2 − 10.2. CONSEQUENCES OF MULTICOLLINEARITY 137

regardless of the values of X, provided X0X still nonsingular. This is to be contrasted with the β where β ∼ N( β, σ2( X X ) 1 ) (10.23) b 0 − clearly depends on X and more particularly the near non-invertibility of X X. b 0 10.2.2 For Inferences

Provided√ collinearity does not become extreme, we still have the ratios (βj − 2 jj jj 1 βj)/ s d ∼ tn k where d = [( X0X )− ]jj. Although βj becomes highly − jj variable as collinearity increases, d grows correspondingly√ larger, thereby com-b 0 0 2 jj pensating. Thus under H0 : βj = βj , we find (βj − βj )/ sb d ∼ tn k, as is the case in the absence of collinearity. This result that the null distribution− of the ratios is not impacted as collinearity becomesb more extreme seems not to be fully advertised in most texts. The inferential price extracted by collinearity is loss of power. Under H1 : 1 0 βj = βj =6 βj , we can write √ √ √ 0 2 jj 1 2 jj 1 0 2 jj (βj − βj )/ s d = (βj − βj )/ s d + (βj − βj )/ s d . (10.24) The first term will continue to follow a t distribution, as argued in the b b n k previous paragraph, as collinearity becomes more− extreme. However, the second term, which represents a “shift” term, will grow smaller as collinearity becomes more extreme and djj becomes larger. Thus we are less likely to shift the statistic into the tail of the ostensible null distribution and hence less likely to reject the null hypothesis. Formally, √ √ 0 2 jj 1 0 2 jj (βj − βj )/ s d ∼ tn k((βj − βj )/ σ d ) (10.25) − and the noncentrality parameter becomes smaller and smaller as collinearity b becomes more extreme. Alternatively the inferential impact can be seen through the impact on the confidence intervals. Using√ the standard√ approach discussed in the previous 2 jj 2 jj chapter, we have [βj − a s d , βj + a s d ) as the 95% confidence interval, where a is the critical value for a .025 tail. Note that as collinearity becomes more extreme and bdjj becomes larger,b the width of the interval becomes larger as well. Thus we see that the estimates are consistent with a larger and larger set of null hypothesis as the collinearity strengthens. In the limit it is consistent with any null hypothesis and we have zero power. We should emphasize that collinearity does not always cause problems with power. The noncentrality parameter in (10.22) can be written

√ √ n 1 0 2 jj 1 0 2 1 2 2 (βj − βj )/ σ d = n(βj − βj )/ σ /( (xij − xj) (1 − Rj )) r n i=1 • X which clearly depends on other factors than the degree√ of collinearity. The size of the noncentrality increases with the sample size n, the difference be- 1 0 tween the null and alternative hypotheses (βj − βj ), and the signal noise ratio 138 CHAPTER 10. MULTICOLLINEARITY

1 n 2 2 2 ( n i=1(xij − xj) /σ . It is entirely possible for Rj to be close to one and the noncentrality to still be large. The important question• is not whether collinear- ityP is present or extreme but whether is is extreme enough to eliminate the power of our test. This is also a phenomenon that does not seem to be fully appreciated or well-enough advertised in most texts. We can easily tell when collinearity is not a problem if the coefficients are sig- nificant or we reject the null hypothesis under consideration. Only if apparently important variables are insignificantly different from zero or have the wrong sign should we consider the possibility that collinearity is causing problems.

10.2.3 For Prediction

If all we are interested in is prediction of yp given xp1, xp2, . . . , xpk, then we are not particularly interested in whether or not we have isolated the individual effects of each xij. We are interested in predicting the total effect or variation in y. A good measure of how well the linear relationship captures the total effect or variation is the overall R2 statistic. But the R2 value is related to s2 by

e e s2 R2 = 1 − 0 = 1 − (n − k) , (10.26) (y − y)0(y − y) var(y) which does not depend upon the collinearity of X. c Thus, we can expect our regressions to predict well, despite collinearity and insignificant coefficients, provided the R2 value is high. This depends, of course, upon the collinearity continuing to persist in the future. If the collinearity does not continue, then prediction will become increasingly uncertain. Such uncertainty will be reflected, however, by the estimated standard errors of the forcast and hence wider forecast intervals.

10.2.4 An Illustrative Example

As an illustration of the problems introduced by collinearity, consider the con- sumption equation

Ct = β0 + β1Yt + β2Wt + ut, (10.27) where Ct is consumption expenditures at time t, Yt is income at time t and Wt is wealth at time t. Economic theory suggests that the coefficient on income should be slightly less than one and the coefficient on wealth should be positive. The time-series data for this relationship are given in the following table: 10.3. DETECTING MULTICOLLINEARITY 139

Ct Yt Wt 70 80 810 65 100 1009 90 120 1273 95 140 1425 110 160 1633 115 180 1876 120 200 2052 140 220 2201 155 240 2435 150 260 2686

Table 9.1: Consumption Data

Applying least squares to this equation and data yields

Ct = 24.775 + 0.942Yt − 0.042Wt + et, (6.752) (0.823) (0.081) where estimated standard errors are given in parenthesis. Summary statistics for the regression are: SSE = 324.446, s2 = 46.35, and R2 = 0.9635. The coefficient estimate for the marginal propensity to consume seems to be a rea- sonable value however it is not significantly different from either zero or one. And the coefficient on wealth is negative, which is not consistent with economic theory. Wrong signs and insignificant coefficient estimates on a priori impor- tant variables are the classic symptoms of collinearity. As an indicator of the possible collinearity the squared correlation between Yt and Wt is .9979, which suggests near extreme collinearity among the explanatory variables.

10.3 Detecting Multicollinearity

10.3.1 When Is Multicollinearity a Problem? Suppose the regression yields significant coefficients, then collinearity is not a problem—even if present. On the other hand, if a regression has insignificant coefficients, then this may be due to collinearity or that the variables, in fact, do not enter the relationship.

10.3.2 Zero-Order Correlations If we have a trivariate relationship, say

yt = β1 + β2xt2 + β3xt3 + ut, (10.28) 140 CHAPTER 10. MULTICOLLINEARITY

we can look at the zero-order correlation between x2 and x3. As a rule of thumb, if this (squared) value exceeds the R2 of the original regression, then we have a problem of collinearity. If r23 is low, then the regression is likely insignificant. 2 In the previous example, rWY = 0.9979, which indicates that Yt is more highly related to Wt than Ct and we have a problem. In effect, the variables are so closely related that the regression has difficulty untangling the separate effects of Yt and Wt. In general (k > 3), when one of the zero-order correlations between x0s is large relative to R2 we have a problem.

10.3.3 Partial Regressions In the general case (k > 3), even if all the zero-order correlations are small, we may still have a problem. For while x1 my not be strongly linearly related to any single xi (i =6 1), it may be very highly correlated with some linear combination of xs. To test for this possibility, we should run regressions of each xi on all the other x’s. If collinearity is present, then one of these regressions will have a high R2 (relative to R2 for the complete regression). For example, when k = 4 and

yt = β1 + β2xt2 + β3xt3 + β4xt4 + ut (10.29) is the regression, then collinearity is indicated when one of the partial regressions

xt2 = α1 + α3xt3 + α4xt4

xt3 = γ1 + γ2xt2 + γ4xt4

xt4 = δ1 + δ2xt2 + δ3xt3 (10.30) yields a large R2 relative to the complete regression.

10.3.4 Variance Inflation Factor. A more informative use of the partial regressions is to directly calculate the impact of collinearity on the variance of the coefficient in question. We know that the variance of coefficient j will be increased by

2 vj = 1/(1 − Rj ) (10.31) • as a result of the correlation structure among the explanatory variablles. This is sometimes called the variance inflation factor (VIF). If you have the usual 2 1 n − 2 unbiased variance estimator for xij, denoted say sxj = n 1 i=1(xij xj) , then we do not need to perform a partial regression but can− alternatively calculate the VIF as P 2 2 (sβj ) (n − 1)sx v = j . (10.32) j s2 where sβj denotes the usual reported estimated standard error of βj.

b 10.3. DETECTING MULTICOLLINEARITY 141

The square root of this factor gives us a direct measure of how much the estimated standard errors have been inflated as a result of collinearity. In cases of near extreme collinearity these factors will sometimes be in the hundreds or thousands. If it is less than say 4, then the standard errors are less than doubled. For the consumption example given above it is 476, which indicates collinearity between income and wealth is a substantial problem for the precision of the estimates for both the income and wealth coefficients.

10.3.5 The F Test The manifestation of collinearity is that estimators become insignificantly dif- ferent from zero, due to the inability to untangle the separate effects of the collinear variables. If the insignificance is due to collinearity, the total effect is not confused, as evidenced by the fact that s2 is unaffected. A formal test, accordingly, is to examine whether the total effect of the insignificant (possibly collinear) variables is significant. Thus, we perform an F test to test the joint hypothesis that the individually insignificant variables are jointly insignificant. For example, if the regression

yt = β1 + β2xt2 + β3xt3 + β4xt4 + ut (10.33) yields insignificant (from zero) estimates of β2, β3 and β4, we use an F test of the joint hypothesis β2 = β3 = β4 = 0. If we reject this joint hypothesis, then the total effect is strong, but the individual effects are confused. This is evidence of collinearity. If we accept the null, then we are forced to conclude that the variables are, in fact, insignificant. For the consumption example considered above, a test of the null hypothesis that the collinear terms (income and wealth) are jointly zero yields an F -statistic value of 92.40 which is very extreme under the null when the variable has an F2,7. Thus the variables are individually insignificant but are jointly significant, which indicates that collinearity is, in fact, a problem.

10.3.6 The Condition Number Belsley, Kuh, and Welsh (1980), suggest an approach that considers the invert- ibility of X directly. First, we transform each column of X so that they are of similar scale in terms of variability by dividing each column to unit length:

xj∗ = xj/ xj0 xj (10.34) q for j = 1, 2, ..., k. Next we find the eigenvalues of the moment matrix of the so-transformed data matrix by finding the k roots of :

det(X∗0X∗ − λIk) = 0. (10.35)

Note that since X∗0X∗ is positive semi-difinite the eigenvalues will be between zero and one with values of zero in the event of singularity and close to zero in 142 CHAPTER 10. MULTICOLLINEARITY the event of close to singularity. The condition number of the matrix is taken as the ratio of the largest to smallest of the eigenvalues: λ c = max . (10.36) λmin Using an analyis of a number of problems BKW suggest that collinearity is a possible issue when c ≥ 20. For the example the condition number is 166.245, which indicates a very poorly conditioned matrix. Although this approach tells a great deal about the invertibility of X0X and hence the signal, it tells us nothing about the noise level relative to the signal.

10.4 Correcting For Collinearity 10.4.1 Additional Observations Professor Goldberger has quite aptly described multicollinearity as “micronu- merosity” or not enough observations. Recall that the shift term depends on the difference between the null and alternative, the signal-noise ratio, and the sample size. For a given signal-noise ratio, unless collinearity is extreme, it can always be overcome by increasing the sample size sufficiently. Moreover, we can sometimes gather more data that, hopefully, will not suffer the collinearity problem. With designed experiments, and cross-sections, this is particularly the case. With time series data this is not feasible and in any event gathering more data is time-consuming and expensive.

10.4.2 Independent Estimation Sometimes we can obtain outside estimates. For example, in the Ando-Modigliani consumption equation

Ct = β0 + β1Yt + β2Wt + ut, (10.37) we might have a cross-sectional estimate of β1, say β1. Then, (C − β Y ) = β + β W + u (10.38) t 1 t 0 2 t b t becomes the new problem. Treatingb β1 as known allows estimation of β2 with increases in precision. It would not reduce the precision of the estimate of β1 which would simply be the cross-sectionalb estimate. The implied error term, moreover, is more complicated since β1 may be correlated with Wt. Mixed estimation approaches should be used to handle this approach carefully. Note that this is another way to gather moreb data.

10.4.3 Prior Restrictions Consider the consumption equation from Klein’s Model I:

Ct = β0 + β1Pt + β2Pt 1β3Wt + β4Wt0 + ut, (10.39) − 10.4. CORRECTING FOR COLLINEARITY 143

where Ct is the consumption expenditure, Pt is profits, Wt is the private wage bill and Wt0 is the governement wage bill. Due to market forces, Wt and Wt0 will probably move together and collinear- ity will be a problem for β3 and β4. However, there is no prior reason to discriminate between Wt and Wt0 in their effect on Ct. Thus it is reasonable to suppose Wt and Wt0 impact Ct in the same way. That is, β3 = β4. The model is now Ct = β0 + β1Pt + β2Pt 1β(Wt + Wt0) + ut, (10.40) − which should avoid the collinearity problem.

10.4.4 Ridge Regression

One manifestation of collinearity is that the effected estimates, say β1, will be extreme with a high probability. Thus, b k 2 2 2 2 βi = β1 + β2 + ··· + βk = β0β (10.41) i=1 X b b b b b b will be large with a high probability. By way of treating the disease by treating its symptoms, we might restrict β0β to be small. Thus, we might reasonably b b min( y − Xβ )0( y − Xβ ) subject to β0β ≤ m. (10.42) βb b b b b Form the Lagrangian (since β0β is large, we must impose the restriction with equality). b b L = ( y − Xβ )0( y − Xβ ) + λ( m − β0β ) n k 2 k e e e e 2 = yt − βixti + +λ( m − βi ). (10.43) t=1 i=1 ! i=1 X X X e e The first-order conditions yield

∂L = −2 y − β x x + 2λβ2 = 0, (10.44) ∂β t i ti tj i j t i ! X X or e e

ytxtj = xtixtjβi + λβj t t i X X X e e = βi xtixtj + λβj, (10.45) i t X X e e for j = 1, 2, . . . , k. In matrix form, we have

X0y = (X0X + λIn)β. (10.46)

e 144 CHAPTER 10. MULTICOLLINEARITY

So, we have 1 β = (X0X + λIn)− X0y. (10.47) This is called ridge regression. e Substition yields

1 β = (X0X + λIn)− X0y 1 = (X0X + λIn)− X0(Xβ + u) e 1 1 = (X0X + λIn)− X0(Xβ + (X0X + λIn)− u) (10.48) and 1 E[ β ] = (X0X + λIn)− X0Xβ = Pβ, (10.49) so ridge regression is biased. Rather obviously, as λ grows large, the expectation e “shrinks” towards zero so the bias is towards zero. Next, we find that

2 1 1 2 2 1 var( β ) = σ (X0X + λIn)− X0X(X0X + λIn)− = σ Q < σ (X0X)− . (10.50)

2 If u ∼eN(0, σ In), then β ∼ N(Pβ, σ2Q) (10.51) and inferences are possible only for Pβ and hence the complete vector. e The rather obvious question in using ridge regression is what is the best choice for λ? We seek to trade off the increased bias against the reduction in the variance. This may be done by considering the mean square error (MSE) which is given by

2 MSE(β) = σ Q + (P − Ik)ββ0(P − Ik) 1 2 2 1 = (X0X + λIn)− {σ X0X + λ ββ0}(X0X + λIn)− . e We might choose to minimize the determinant or trace of this function. Note that either is an decreasing function of λ through the inverses and an increasing function through the term in brackets. Note also that the minimand depends on the true unkown β, which makes it infeasible. In practice, it is useful to obtain what is called a ridge trace, which plots out the estimates, estimated standard error, and estimated square root of (SMSE) as a function of λ. Problematic terms will frequently display a change of sign and a dramatic reduction in the SMSE. If this phe- nonmenon occurs at a sufficiently small value of λ, then the bias will be small and inflation in SMSE relative to the standard error will be small and we can conduct inference in something like the usual fashion. In particular, if the estimate of a particular coefficient seems to be significantly different from zero despite the bias toward zero, we can reject the null that it is zero. Chapter 11

Stochastic Explanatory Variables

11.1 Nature of Stochastic X

In previous chapters, we made the assumption that the x’s are nonstochastic, which means they are not random variables. This assumption was motivated by the control variables in controlled experiments, where we can choose the values of the independent variables. Such a restriction allows us to focus on the role of the disturbances in the process and was most useful in working out the stochastic properties of the estimators and other statistics. Unfortunately, economic data do not usually come to us in this form. In fact, the independent variables are typically random variables much like the dependent variable whose values are beyond the control of the researcher. Consequently we will restate our model and assumptions with an eye toward stochastic x. The model is

yi = xi0 β + ui for i = 1, 2, ..., n. (11.1)

The assumptions with respect to unconditional moments of the disturbances are the same as before:

(i) E[ui] = 0

2 2 (ii) E[ui ] = σ

(iii) E[uiuj] = 0, j =6 i The assumptions with respect to x must be modified. We replace the assumption of x nonstochastic with an assumption regarding the joint stochastic behavior of ui and xi, which are taken to be jointly i.i.d.. Several alternative cases will be introduced regarding the degree of dependence between xi and ui. For stochastic x, the assumption of linearly independent x’s implies that the covariance matrix

145 146 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

of the x’s has full column rank and is hence positive definite. Stated formally, we have:

(iv) (ui,xi) jointly i.i.d. with {dependence assumption}

(v) E[xixi0 ] = Q p.d. Notice that the assumption of normality, which was introduced in previous chap- ters to facilitate inference, was not generally reintroduced. It will be introduced later but only for one of the dependence cases. Thus we are effectively relax- ing both the nonstochastic regressor and normality assumptions at the same time, except for one case. The motivation for dispensing with the normality assumption will become apparent presently. We will now examine the various alternative assumptions that will be en- tertained with respect to the degree of dependence between xi and ui. Beyond giving a complete treatment of altrnatives there are good reasons to consider each of these alternative at some length.

11.1.1 Independent X

The strongest assumption we can make relative to this relationship is that ui is stochastically independent of xi, so assumption (iv) becomes

(iv,a) (ui,xi) jointly i.i.d. with ui independent of xi.

This means that the distribution of ui depends in no way on the value of xi and visa versa. Note that cov ( g(xi), h(ui) ) = 0, (11.2) for any functions g(·) and h(·), in this case. When combined with the normality assumption below, this alternative is the only dependence assumption where the inferential results obtained in Chapter 8 continue to apply in finite samples.

11.1.2 Conditional Zero Mean

The next strongest assumption is that ui is mean independent of xi or E[ui|xi] = 0. We also impose variance independence so Assumption (iv) becomes

2 2 (iv,b) (ui,xi) jointly i.i.d. with E[ui|xi] = 0, E[ui |xi] = σ . Note that this assumption implies

cov ( g(xi), ui) ) = 0, (11.3)

for any function g(·). The independence assumption (iv,a) along with the un- 2 2 conditional statements E[ui] = 0 and E[ui ] = σ imply conditional zero mean and constant but not the reverse. 2 We could add conditional normality to this assumption, so ui|xi ∼ N(0, σ ), but this implies that ui is independent of x, so we are really in the previous case. The usual interences developed in Chapter 8, will be problematic in finite 11.1. NATURE OF STOCHASTIC X 147

samples for the current case without normality, but turn out to be appropriate in large samples whether or not the disturbances are normal. Moreover, the least square estimator will be unbiased and enjoy optimality properties within the class of unbiased estimators. This assumption is motivated by supposing that our model is simply a state- ment of conditional expectation, E[yi|xi] = xi0 β, and sometimes will not be ac- 2 2 companied by the conditional second moment assumption E[ui |xi] = σ . In this case of nonconstant variance, which will be considered at length in Chapter 13, the large sample behavior of least squares is the same as for the uncorrelated case considered next.

11.1.3 Uncorrelated X

A weaker assumption is that xi and ui are only uncorrelated, so Assumption (iv) becomes

(iv,c) (ui,xi) jointly i.i.d. with .

This assumption only implies zero covariance in the levels of xi and ui, or for one element of xi, cov ( xij, ui ) = 0. (11.4)

The properties of β are less accessible in this case. Note that conditional zero mean always implies uncorrelated, but not the reverse. It is possible to have a random variablesb that are uncorrelated but neither has constant conditional mean given the other. This assumption is the weakest of the alternatives under which least squares will continue to be consistent. In general, the conditional second moment will also be nonconstant for this case. Such nonconstant conditional variance is called heteroskedasticity, and will be studied at length in Chapter 13.

11.1.4 Correlated X

A priori information sometimes suggests the possibility that ui is corrleated with xi, so Assumption (iv) becomes

(iv,d) (ui,xi) jointly i.i.d. with E[xiui] = d, d =6 0.

Stated another way

E( xijui ) =6 0. (11.5) for some j. As we shall see below, this can have quite serious implications for the OLS estimates. Consequently we will spend considerable time developing possible solutions and tests for this condition, which is a violation of (iv,c). Examples of models which are afflicted with this difficulty abound in the applied econometric literature. A leading example is the case of simultaneous equations models that we will examine later in this chapter. A second leading 148 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES example occurs when our right-hand side variables are measured with error. Suppose yi = α + βxi + ui (11.6) is the true model but

xi∗ = xi + vi (11.7) is the only available measurement of xi. If we use xi∗ in our regression, then we are estimating the model

yi = α + β( xi∗ − vi ) + ui

= α + βxi∗ + ( ui − βvi ). (11.8)

Now, even if the measurement error vi were uncorrelated with the disturbance ui, the right-hand side variable xt∗ will be correlated with the effective distur- bance (ui − βvi).

11.2 Consequences of Stochastic X 11.2.1 Consequences for OLS Estimation Recall that

1 β = ( X0X )− X0y (11.9) 1 = ( X0X )− XX0( Xβ + u ) b 1 = β + ( X0X )− X0u

1 1 1 = β + ( X0X )− X0u n n 1 − 1 n 1 n = β + x x0 x u n j j  n i i j=1 i=1 ! X X   We will now examine the bias and consistency properties of the estimators under the alternative dependence assumptions.

11.2.1.1 Uncorrelated X

Suppose, under Assumption (iv,c), that xt is only assured of being uncorrelated with ui. Rewrite the second term in (11.9) as

1 1 − − 1 n 1 n 1 n 1 n x x0 x u = x x0 x u n j j  n i i n n j j  i i j=1 i=1 ! i=1 j=1 X X X X      1 n   = w u . n i i i=1 X 11.2. CONSEQUENCES OF STOCHASTIC X 149

Note that wi is a function of both xi and xj and is nonlinear in xi for j = i. Now ui is uncorrelated with the xj for j =6 i by independence and the level of xi by the assumption but is not necessarily uncorrelated with the nonlinear function of xi. Thus, if the expectation exists,

1 E[{( X0X )− X0}u] =6 0, (11.10) in general, whereupon

1 E[β] = β + E ( X0X )− X0u =6 β. (11.11)   Similarly, we find E[s2] =6 bσ2. Thus both β and s2 will be biased, although the bias may disappear asymptotically as we will see below. Note that sometimes these expectations are not well defined. b Now, each element of xixi0 and xiui are i.i.d. random variables with ex- pectations Q and 0, respectively. Thus, the law of large numbers guarantees that 1 n x x −→ Ex x = Q, (11.12) n i i p i i i=1 X and 1 n x u −→ Ex u = 0. (11.13) n i i p i i i=1 X It follows that

1 1 − 1 plim β = β + plim X0X X0u n n n n →∞ →∞ "  # 1 b 1 − 1 = β + plim X0X plim X0u n n n n →∞   →∞ 1 = β + Q− · 0 = β. (11.14)

Similarly, we can show that plim s2 = σ2. (11.15) n →∞ Thus both β and s2 will be consistent.

11.2.1.2 Conditionalb Zero Mean

Suppose Assumption (iv,b) is satisfied, then E[ui|xi] = 0 and, by independence across i, we have E(u|X) = 0. It follows that

1 E[β] = β+E[(X0X)− X0u] 1 = β+E[(X0X)− X0E(u|X)] b 1 = β+E[(X0X)− X0 · 0] = β. 150 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES and OLS is unbiased. Since conditional zero mean implies uncorrelatedness, then we have the same consistency results as before, namely

plim β = β and plim s2 = σ2. (11.16) n n →∞ →∞ 2 2 b 2 2 Now E[ui |xj] = σ for i =6 j by i.i.d. and E[ui |xi] = σ by (iv,b) hence 2 E(uu0|X) = σ In. It follows that

2 1 E[(β − β)(β − β)0|X] = σ (X0X)− and the Gauss-Markov theoremb continuesb to hold in the sense that least-squares is BLUE, given X, within the class of estimators that is linear in y with the linear transformation matrix a function only of X. Under this conditional covariance assumption, we can show unbiasedness, in a similar fashion, for the variance estimator,

2 Es = Ee0e/(n − k)

1 1 = Eu0( I − X(X0X)− X0 )u n − k n = σ2. (11.17)

This estimator, being quadratic in u, and hence y, can be shown to be the best quadratic unbiased estimator BQUE of σ2.

11.2.1.3 Independent X

Suppose Assumption (iv,a) holds, then xi is independent of ui. Since (i) and 2 2 (ii) assure that E[ui] = 0 and E[ui ] = σ , we have conditional zero mean and constant conditional variance and the corresponding unbiasedness results

Eβ = β and Es2 = σ2, (11.18) together with the BLUE andb BQUE properties and the consistency results

plim β = β and plim s2 = σ2. (11.19) n n →∞ →∞ b 11.2.1.4 Correlated X

Finally, suppose Assumption (iv,d) applies, so xi is correlated with ui.and

Exiui = d =6 0. (11.20)

1 Obviously, since xi is correlated with ui there is no reason to believe E[{(X0X)− X0}u] = 0 so Eβ =6 β

b 11.2. CONSEQUENCES OF STOCHASTIC X 151 and the OLS estimator will, in general, be biased. Moreover, this bias will not disappear in large samples as we will see below. Turning to the possibility of consistency, we see, by the law of large numbers that

n 1 p x u −→ Ex u = d, (11.21) n i i i i i=1 X whereupon 1 plim β = β + Q− · d =6 β (11.22) n →∞ 1 since Q− is nonsingular and dbis nonzero. Thus OLS is also inconsistent. It will follow that s2 will also be biased and inconsistent.

11.2.2 Consequences for Inferences In previous chapters, the assumption of unconditionally normal disturbances, specifically, 2 u ∼N(0,σ In) was introduced to facilitate inferences. Together with the nonstochastic regres- sor assumption, it implied that the distribution of the least squares estimator,

1 β = β + ( X0X )− X0u which is linear in the disturbances,b has an unconditional normal distribution, specifically, 2 1 β∼N(β,σ (X0X)− ) All the inferential results from Chapter 8 based on the t and F distributions b followed directly. If the x’s are random variables, however, the unconditional normality of the disturbances is not sufficient to obtain these results. Accordingly, we either strengthen the assumption to conditional normality or abandon normality com- pletely. In this section, we consider these two alternatives respectively for the independence (iv,a) and conditional zero mean (iv,b) cases. The other two cases will be treated in the next section in a more general context and normality will not be assumed.

11.2.2.1 Conditional Normality - Finite Samples We first consider the normality assumption for the independene case. Specifi- cally, we add the assumption

2 (vi) ui ∼ N(0, σ ) This assumption together with Assumption (iv,a) for the independent case, implies 2 ui|xi ∼ N(0, σ ) 152 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

Moreover 2 ui|xj ∼ N(0, σ ) for i =6 j due to the joint i.i.d. assumption. Thus, under (iv,a) and (vi), we have

2 u|X ∼N(0,σ In), which does not depend on X, and

2 1 β|X ∼N(β,σ (X0X)− ). which does depend on the conditioningb variables X. Fortunately, the distributions of the statistics we utilize for inference do not depend on the conditioning values. Specifically, it is easy to see that

β − β √j j |X ∼ N(0, 1). σ2djj b while we can show that 2 s 2 ( n − k ) |X ∼ χn k σ2 − and is conditionally independent of β|X. Whereupon, following the arguments in Chapter 8, we find

βj − βbj √ |X ∼ tn k. s2djj − b which does not depend on the conditioning values. Since the resulting distribution does not depend on x, the unconditional dis- tribution of the usual ratio is the same as before. A similar result applies for all the ratios that were found to follow a t distribution in Chapter 8. Likewise, the statistics that were found to follow an unconditional F distribution in that chapter have the same unconditional distribution here. The non-central distri- butions that applied under the alternative hypotheses there, and depended on X will also apply here given X. Thus we see that treating the X matrix as given here is essentially the same as treating it as nonstochastic in Cchapter 8, which is hardly surprising.

11.2.2.2 Conditional Non-Normality - Large Samples

In the event that u is not conditionally normal, the estimator will not be con- ditionally normal, the standardized ratios will not be standard normal, and the feasible standardized ratio will not follow the tn k distribution in finite sam- ples. Fortunately, we can appeal to the central limit− theorem for help in large samples. We shall first develop the large-sample asymptotic distribution of β under Assumption (iv,b), the case of conditional zero mean and constant conditional b 11.2. CONSEQUENCES OF STOCHASTIC X 153 variance. The limiting behavior is identical for Assumption (iv,a), the inde- pendence assumption since together with Assumptions (i) and (ii), it implies conditional zero mean with constant conditional variance. Recall that,

plim( β − β ) = 0, (11.23) n →∞ in this case, so in order to have a nondegenerateb distribution we consider

1 √ 1 − 1 n( β − β ) = X0X √ X0u. (11.24) n n   The j-th element of b 1 1 n √ X0u = √ x u (11.25) n n i i i=1 X is 1 n √ x u . (11.26) n ij i i=1 X Note the xijui are i.i.d. random variables with

E[xijui] = E[E[xijui|x]] = E[xijE[ui|x]] = E[xij · 0] = 0 (11.27) and

2 2 2 2 2 2 2 2 2 2 E[( xijui ) ] = E[xijui ] = E[E[xijui |x]] = E[xijE[ui |x]] = E[xijσ ] = σ qjj, (11.28) 2 where qjj = E[xjj] is the jj-th element of Q. Thus, according to the central limit theorem, 1 n √ x u −→d N(0, σ2q ). (11.29) n ij i jj i=1 X And, more generally,

n 1 1 d 2 √ x u = √ X0u −→ N(0, σ Q). (11.30) n i i n i=1 X 1 Since n X0X converges in probability to the fixed matrix Q, we have

√ d 1 1 d 2 1 n( β − β ) −→ Q− √ X0u −→ N(0, σ Q− ). (11.31) n

For inferences, b √ d 2 jj n(βj − βj) −→ N(0, σ q ) (11.32) and √ nb(β − β ) j j −→d N(0, 1). (11.33) σ2qjj b p 154 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

2 1 Unfortunately, neither σ nor Q− are available, so we will have to substitute estimators. We can use s2 as a consistent estimator of σ2 and 1 Q = X0X (11.34) n as a consistent estimator of Q. Substituting,b we have √ √ n(β − β ) n(β − β ) β − β j j = j j = √j j −→d N(0, 1). (11.35) s2qjj s2[( 1 X X) 1] s2djj b nb 0 − jj b p q Thus, the usualb statistics we use in conducting t-tests are asymptotically stan- dard normal. This is particularly convenient since the t-distribution converges to the standard normal. The small-sample inferential procedures we learned for the nonstochastic regressor case are appropriate in large samples for the stochastic regressor case, under conditional zero mean and constant conditional variance. And this will be true whether or not conditional normality applies. In a similar fashion, we can show that the approach introduced in previous chapters for inference on complex hypotheses that had an F -distribution un- der normality with nonstochastic regressors continue to be appropriate in large samples with non-normality and stochastic regressors. For example, consider again the model

y = X1β1 + X2β2 + u (11.36) with H0 : β2 = 0 and H1 : β2 =6 0. Regression on this unrestricted model yields SSEu while regression on the restricted model y = X1β1 + u yields the SSEr. We form the statistic [(SSEr − SSEu)/k2]/[SSEu/(n − k)], where k is the un- restricted number of regressors and k2 is the number of restrictions. Under conditional normality this statistic will have a Fk2,n k distribution. Note that asymptotically, as n becomes large, the denominator− converges to σ2 and the F distribution converges to a χ2 distribution (divided by k ). But, fol- k2,n k k2 2 lowing− the arguments of the previous paragraph, this would also be the limiting distribution of this statistic under the conditional zero mean with conditional constant variance assumption even if the disturbances are non-normal.

11.3 Correcting for Correlated X 11.3.1 Instruments Consider y = Xβ + u (11.37) and premultiply by X0 to obtain

X0y = X0Xβ + X0u 1 1 1 X0y = X0Xβ + X0u. (11.38) n n n 11.3. CORRECTING FOR CORRELATED X 155

If X is uncorrelated with u, as in Assumption (iv,c), then the last term disap- pears in large samples: 1 1 plim X0y = plim (X0X)β, (11.39) n n n n →∞ →∞ which may be solved to obtain

1 1 − 1 β = plim X0X X0y n n n →∞   1 = plim(X0X)− X0y = plim β. (11.40) n n →∞ →∞ Of course, if X is correlated with u, as in Assumptionb (iv,d), then 1 plim X0u = d and plim β =6 β. (11.41) n n n →∞ →∞ Suppose we can find similarly dimensioned i.i.d. variablesb zi that are uncorre- lated with ui, then Eziui = 0. (11.42)

Also, suppose that the zi are correlated with xi so

Ezixi0 = P (11.43) and P is nonsingular. Such variables are known as instruments for the variables xi. Note that some of the elements of xi may be uncorrelated with ui, in which case the analogous elements of zi will be the same and only the elements corre- sponding to correlated variables replaced. Examples will be presented shortly. We can now summarize these conditions, plus some second moment con- ditions that will be needed below, by expanding Assumptions (iv,d) and (v) as

(iv,d) (ui,xi, zi) jointly i.i.d. with E[xiui] = d, E[ziui] = 0.

2 2 (v) E[xixi0] = Q, E[zixi0 ] = P, E[zizi0 ] = N, E[ui xixi0 ] = M, E[ui zizi0 ] = G.

Note that for OLS estimation, zi = xi, P = Q, and for OLS and case (c) d = 0, G = M.

11.3.2 Instrumental Variable (IV) Estimation Suppose that, analogous to OLS, we premultiply (11.37) by

Z0 = (z1, z2, . . . , zn) (11.44) to obtain

Z0y = Z0Xβ + Z0u 1 1 1 Z0y = Z0Xβ + Z0u. (11.45) n n n 156 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

But since Eziui = 0, then

1 1 n plim Z0u = plim ziui = 0, (11.46) n n n n i=1 →∞ →∞ X so 1 1 plim Z0y = plim Z0Xβ (11.47) n n n n →∞ →∞ or

1 1 − 1 β = plim Z0X Z0y n n n →∞   1 = plim(Z0X)− Z0y n →∞ = = plim β, (11.48) n →∞ where e 1 β = (Z0X)− Z0y (11.49) is defined as the instrumental variablee (IV) estimator. Note that OLS is an IV estimator with X chosen as the instruments. It is instructive to look at a simple example to get a better idea of what is meant by instrumental variables. Consider again the measurement error model (11.6-11.8). Suppose that yi is the wage received by an individual and xi is the unobservable variable ability. We have a measurement (with error) of xi, say xi∗ , which is the score on an IQ test for the individual, so xi0 = (1, xi∗). We have a second, perhaps rougher, measurement (with error) of xi, say zi, which is the score on a knowledge of the work world test, whereupon zi0 = (1, zi). Hopefully, the measurement errors on the regressor xi∗ and the instrument zi are uncorrelated so the conditions for IV to be consistent will be met.

11.3.3 Properties of the IV Estimator We have just shown that plim β = β, (11.50) n →∞ so the IV estimator is consistent. In smalle samples, since

1 β = (Z0X)− Z0y 1 = (Z0X)− Z0(β + u) e 1 = β + (Z0X)− Z0u, (11.51) we generally have bias since we are only assured that zi is uncorrelated with 1 ui, but not that (Z0X)− Z0 is uncorrelated with u. The bias, however, will disappear as demonstrated in the limiting distribution. 11.3. CORRECTING FOR CORRELATED X 157

Since the estimator is consistent, we need to rescale to obtain the limiting distribution. After some simple rearranging we have

√ 1 1 1 n(β − β) = ( Z0X)− √ Z0u (11.52) n n n n e 1 1 1 = ( z x0 )− √ z u . (11.53) n i i n i i i=1 i=1 X X The expression being averaged inside the inverse is i.i.d., and has mean P, 1 n so by the law of large numbers n i=1 zixi0 −→p E[zixi0 ] = P. And the expression inside the final summation is i.i.d., has mean 0, and covariance G, 1 Pn z −→ 0 G so, by the central limit theorem, √n i=1 iui p N( , ). Combining these two results we find P √ 1 1 n(β − β) −→d N(0, P− GP0− ). (11.54)

Note that any bias has disappearede in the limiting distribution. So if the mean of the estimator exists in finite samples the limit of the expectation must be zero and the estimator is asymptotically unbiased. This limiting normality result for instrumental variables is very general and yields the limiting distribution when Z = X and hence β = β in each of the cases where OLS was found to be consistent. For independence (Assumption (4,a)) or conditional zero mean with constant conditional covariancee b (Assumption (4,b)), we have P = Q and G =σ2Q, whereupon

√ 2 1 n(β − β) −→d N(0,σ Q− ) (11.55) which is the same as (11.31).b For the uncorrelated case (Assumption (4,c)) or conditional zero mean with non-constant conditional covariance, we have P = Q and G = M, whereupon

√ 1 1 n(β − β) −→d N(0, Q− MQ− ). (11.56)

This last result will play anb important role in Chapter 13 when we study het- eroskedasticity. In order to make these limiting distribution results operational for inference purposes, we must estimate the covariance matrices. Let

u = y − Xβ (11.57) be the IV residuals, then, under our assumptions, we can show. e e n 1 1 2 P= Z0X −→ P, G = u z x0 −→ G n p n i i i p i=1 X b b e 1 1 So a consistent covariance matrix is provided by P− GP0− . For the case where least squares is valid, we have Q from above and the rather obvious consistent 1 n 2 estimator of M is M = n i=1 ui xixi0 . b b b b P c b 158 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

In most research that has involved the use of instrumental variables, the disturbances have been assumed to be uncorrelated with the instruments and the conditional variances (given the values of the instruments) are assumed to 2 2 be constant. Thus E[ui zizi0 ] = G =σ N and we have

√ 2 1 1 n(β − β) −→d N(0,σ P− NP0− ).

Again, the result for OLSe under independence or conditional zero mean and constant conditional variance is obtained as a special case when Z = X. Esti- mators of the covariance components are provided by 1 N= Z0Z −→ N n p bu u n u2 σ2 = 0 = i −→ σ2. (11.58) n n p i=1 e e X e Thus e 2 1 1 2 1 1 σ P− NP0− = n · σ [(Z0X)− Z0Z(X0Z)− ] is a consistent estimator of the limiting covariance matrix and with the omission b b b of n, is the standarde output of most IVe packages. This covariance estimator (taking account of scaling by n) can be used in ratios and quadratic forms to obtain asymptotically appropriate statistics. For example, β − β j j −→d N(0, 1), (11.59) 2 1 1 σ [(Z0X)− Z0Z(X0Z)− ]jj e with βj = 0 is the standardp ratio printed by IV packages and is asymptotically appropriate for testinge the null that the coefficient in question is zero. Similarly,

d 2 1 1 0 1 2 (Rβ−r)0[σ R(Z0X)− Z0Z(X0Z)− R ]− (Rβ−r) −→ χq is asymptotically appropriate for testing the null Rβ = r. Note that the scaling e e e by n has cancelled in each case.

11.3.4 Optimal Instruments

The instruments zi cannot be just any variables that are independent of and uncorrelated with ui. They should be as closely related to xi as possible while at the same time remaining uncorrelated with ui. 1 1 2 1 1 Looking at the asymptotic covariance matrices P− GP− or σ P− NP0− , we can see as zi and xi become unrelated and hence uncorrelated, that 1 plim Z0X = P (11.60) n n →∞ 1 1 goes to zero. The inverse of P consequently grows large and P− GP− will become large. Thus the consequence of using zi that are not close to xi is 11.4. DETECTING CORRELATED X 159

imprecise estimates. It is easy to find variables that are uncorrelated with ui but sometimes more difficult to find variables that are also sufficiently closely related to xi. Much of the applied econometric literature is dominated by the search for such instruments. In fact, we can speak of optimal instruments as being all of xi except the part that is correlated with ui. For models where there is an explicit relationship between ui and xi, the optimal instruments can be found and utilized, at least asymptotically. For example, suppose the data generating process for xi is given by xi = Πwi + vi, (11.61) where vi is the part of xi that is linearly related to ui and Πwi is the remain- der. If wi is observable, then we can show that a lower bound for the limiting covariance matrix for IV is obtained when we use zi = Πwi as the instruments. If Π is unknown we can estimate it from the relationship above and use the fea- sible instruments zi = Πwi, which will yield the same asymptotically optimal behavior. b b 11.4 Detecting Correlated X

The motivation for using instrumental variables is compelling when the regres- sors are correlated with the disturbances (d =6 0). When the regressors are uncorrelated with the disturbances (d = 0), the motivation for using OLS in- stead is equally compelling. In fact, for independence (Assumption (iv,a)) and conditional zero mean with constant conditional variance (Assumption (iv,b)), OLS has well demonstrated optimality properties. Accordingly, it behooves us to determine which is the relevant state. We will set this up as a test of the null hypothesis that d = 0 against the alternative that d =6 0.

11.4.1 An Incorrect Procedure With previously encountered problems of OLS we have examined the OLS resid- uals for signs of the problem. In the present case, where ui being correlated with xi is the problem, we might naturally see if our proxy for ui, the OLS residuals et, are correlated with xi. Thus, the estimated covariance

1 n 1 x e = X0e (11.62) n i t n i=1 X might be taken as an indication of any correlation between xi and ui. Unfortu- nately, one of the properties of OLS guarantees that

X0e = 0 (11.63) whether or not ui is correlated with xi. Thus, this procedure will not be informative. 160 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

11.4.2 A Priori Information

Typically, we know that xi is correlated with ui as a result of the structure of the model. For example, in the errors in variables model considered above. In such cases, the candidates for instruments are often evident. Another leading case occurs with simultaneous equation models. For exam- ple, consider the consumption equation

Ct = α + βYt + ut, (11.64) where income, Yt, is defined by the identity

Yt = Ct + Gt. (11.65)

Substituting (11.64) into (11.65), we obtain

Yt = α + βYt + ut + Gt, (11.66) and solving for Yt, α 1 1 Y = + G + u . (11.67) t 1 − β 1 − β t 1 − β t

Rather obviously, Yt is linearly related and hence correlated with ut. A candi- date as an instrument for Yt is the exogenous variable Gt.

11.4.3 An IV Approach In both the simultaneous equation model and the measurement error model the possibility of the regressors being correlated with the disturbances is only suggested. It is entirely possible that the measured with error variables or the right-hand side endogenous variables are not sufficiently correlated to justify the use of IV. The real question is what is the appropriate metric for “sufficiently correlated”. The answer will be based on a careful comparison of the OLS and IV estimators. Under the null hypothesis d = 0, we know that both OLS and IV will be consistent, so plim β = β = plim β. n n →∞ →∞ Under the alternative hypothesis bd =6 0, we finde that OLS is inconsistent but IV is still consistent, so

1 plim β = β + Q− d =6 β = plim β. n n →∞ →∞ Thus a test can be formulated on the difference betweene the two estimators, 1 which converges to Q− d. Under the null this will be zero but under the √ d 1 1 alternative it will be nonzero. Recall that n(β −β) −→ N(0, P− GP− ) and √ d 1 1 under the null n(β−β) −→ N(0, Q− MQ− ) so we will look at the normalized e b 11.4. DETECTING CORRELATED X 161

√ difference n(β − β). In order to determine the limiting distribution of the difference, we need to add an additional fourth moment condition to Assumption 2 (v), specificallyb E[uei zixi0 ] = H. With this assumption, it is straightforward to show, under the null, that

√ d 1 1 1 1 1 10 1 1 n(β − β) −→ 0, (P− GP− − P− HQ− − Q− HP− + Q− MQ− ) . h (11.68)i b e √Under the alternative hypothesis this scaled difference will diverge at the rate n. The most powerful use of this result can be based on a quadratic form that jointly tests the entire vector d = 0. Specifically, we find

1 1 1 1 1 10 1 1 1 d 2 n·(β−β)0(P− GP− −P− HQ− −Q− HP− +Q− MQ− )− (β−β) −→ χk if theb limitinge covariance in (11.68) is nonsingular. It is possibleb e that this matrix is not full rank, in which case we use the Moore-Penrose generalized inverse to obtain

1 1 1 1 1 10 1 1 + d 2 n·(β −β)0(P− GP− −P− HQ− −Q− HP− +Q− MQ− ) (β −β) −→ χq whereb thee superscript + indicates a generalized inverse and q isb thee rank of the covariance matrix. To make these statistics feasible, we need to use the estimators of the covariance matrices Q, M, P, and G introduced above and an obvious analogous estimator of H. The limiting distributions will be unchanged. The approach discussed above is very general and quite feasible but the covariance matrix used is somewhat more complicated than the tests one sees in the literature. In order to simplify the covariance, a little more structure is required on the relationship between the regressors and instruments and an additional assumption added. Specifically, we can decompose zi into the com- ponent linearly related to xi and a residual, whereupon we can write

zi = Bxi + εi

1 where B = PQ− and εi orthogonal to xi by construction. Since zi is orthog- onal to ui by the properties of instruments and, under the null hypothesis, xi is orthogonal to ui, then εi is also orthogonal to ui. Beyond mutual orthogonality 2 of εi, xi, and ui, we need to assume E[ui εixi0 ] = 0. Under the null and this 1 1 1 1 additional assumption, we find Q− HP− 0 = Q− MQ− so

√ d 1 1 1 1 n(β − β) −→ 0, (P− GP− 0 − Q− MQ− ) , (11.69)   and the covarianceb of thee difference (suitably scaled) is the difference in the covariances. The quadratic form test will be similarly simplified and made feasible by consistently estimating the covariance components. One case where this assumption is satisfied, that has received widespread attention in the literature, occurs under the constant conditional variance condi- 2 2 2 2 2 tion E[ui |zi, xi] = σ whereupon E[ui εixi0 ] = E[E[ui εixi0 |zi, xi]] = E[E[ui εixi0 |εi, xi]] = 162 CHAPTER 11. STOCHASTIC EXPLANATORY VARIABLES

2 E[σ εixi0 ] = 0 by definition. Moreover the covariance matrix simplifies further since G =σ2N and M =σ2Q so

√ d 2 1 1 2 1 n(β − β) −→ 0, (σ P− NP− − σ Q− ) ,   and the quadratic formb teste becomes

2 1 1 2 1 1 d 2 n · (β − β)0(σ P− NP− − σ Q− )− (β − β) −→ χk. (11.70)

Again, it is possibleb e that the weight matrix in theb quadratice form is rank deficient in which case we use the generalized inverse instead and the degrees of freedom equals the rank. And a feasible version of the statistic requires the consistent estimators of the components of the covariance that were introduced above. This class of tests based on the difference between an estimator that is consistent under both the null and alternative and another estimator that is consistent only under the null are called Hausman-type tests. Strictly speaking, the Hausman test compares an estimator that is efficient under the null and inconsistent under the alternative with one that is consistent under both. The earliest example of a test of this type was the Wu test, which was applied to the simultaneous equation model and has the feasible version of the form given in (11.79). Chapter 12

Nonscalar Covariance

12.1 Nature of the Problem 12.1.1 Model and Ideal Conditions Consider the model y = Xβ + u, (12.1) where y is n × 1 vector of observations on the dependent variable, X is the n × k matrix of observations on the explanatory variables, and u is the vector of unobservable disturbances. The ideal conditions are (i) E[u] = 0

2 (ii & iii) E[uu0] = σ In (iv) X full column rank (v) X nonstochastic

2 (vi) [u ∼N(0,σ In)]

12.1.2 Nonscalar Covariance Nonscalar covariance means that

2 E[uu0] = σ Ω, tr(Ω) = n (12.2)

an n-by-n positive definite matrix such that Ω =6 In. That is,

u1 ω11 ω12 ··· ω1n u2 ω21 ω22 ··· ω2n 2 E  .  (u1, u2, . . . , un) = σ  . . . .  (12.3) ......       u    ω ω ··· ω   n    1n 2n nn       163 164 CHAPTER 12. NONSCALAR COVARIANCE

A covariance matrix can be nonscalar either by having non-constant diagonal elements or non-zero off diagonal elements or both.

12.1.3 Some Examples 12.1.3.1 Serial Correlation Consider the model yt = α + βxt + ut, (12.4) where ut = ρut 1 + εt, (12.5) − 2 2 and E[εt] = 0, E[εt ] = σ , and E[εtεs] = 0 for all t =6 s. Here, ut and ut 1 are correlated, so Ω is not diagonal. This is a problem that afflicts a large fraction− of time series regressions.

12.1.3.2 Heteroscedasticity Consider the model

Ci = α + βY + ui i = 1, 2, . . . , n, (12.6) where Ci is consumption and Yi is income for individual i. For a cross-section, we might expect more variation in consumption by high-income individuals. 2 Thus, E[ui ] is not constant. This is a problem that afflicts many cross-sectional regressions.

12.1.3.3 Systems of Equations Consider the joint model

yt1 = xt01β1 + ut1

yt2 = xt02β2 + ut2.

If ut1 and ut2 are correlated, then the joint model has a nonscalar covariance. If the error terms ut1 and ut2 are viewed as omitted variables then it is obvious to ask whether common factors have been omitted and hence the terms are correlated.

12.2 Consequences of Nonscalar Covariance 12.2.1 For Estimation The OLS estimates are

1 β = (X0X)− X0y 1 = β + (X0X)− X0u. (12.7) b 12.2. CONSEQUENCES OF NONSCALAR COVARIANCE 165

Thus, 1 E[β] = β + (X0X)− X0E[u] = β, (12.8) so OLS is still unbiased (but not BLUE since (ii & iii) not satisfied). b Now 1 β − β = (X0X)− X0u. (12.9) so b 1 1 E[(β − β)(β − β)0] = (X0X)− X0E[uu0]X(X0X)− 2 1 1 = σ (X0X)− X0ΩX(X0X)− b b 2 1 =6 σ (X0X)− .

1 1 The diagonal elements of (X0X)− X0ΩX(X0X)− can be either larger or smaller 1 than the corresponding elements of (X0X)− . In certain cases we will be able to establish the direction of the inequality. Suppose

1 p X0X→Q p.d. (12.10) n 1 p X0ΩX→M n

1 1 1 1 1 1 1 1 p 1 1 1 p then (X0X)− X0ΩX(X0X)− = n ( n X0X)− n X0ΩX( n X0X)− → n Q− MQ− →0 so p β→β (12.11) β since unbiased and the variances gob to zero. 12.2.2b For Inference Suppose u ∼ N(0, σ2Ω) (12.12) then 2 1 1 β ∼ N(β, σ (X0X)− X0ΩX(X0X)− ). (12.13) Thus b β − β j j N(0, 1) (12.14) 2 1  σ [(X0X)− ]jj b 2 1 1 since the denominator mayp be either larger or smaller than σ [(X0X)− X0ΩX(X0X)− ]jj. And p βj − βj  tn k (12.15) 2 1 − s [(X0X)− ]jj b We might say that OLS yieldsp biased and inconsistent estimates of the variance- covariance matrix. This means that our statistics will have incorrect size so we over- or under-reject a correct null hypothesis. 166 CHAPTER 12. NONSCALAR COVARIANCE

12.2.3 For Prediction We seek to predict y = x0 β + u (12.16) ∗ ∗ ∗ where ∗ indicates an observation outside the sample. The OLS (point) predictor is y = x0 β (12.17) ∗ ∗ which will be unbiased (but not BLUP). Prediction intervals based on σ2(X X) 1 b 0 − will be either too wide or too narrowb so the probablility content will not be the ostenisble value.

12.3 Correcting For Nonscalar Covariance 12.3.1 Generalized Least Squares Since Ω positive definite we can write

Ω = PP0 (12.18) for some n × n nonsingular matrix P (typically upper or lower triangular). 1 Multiplying (11.1) by P− yields

1 1 1 P− y = P− Xβ + P− u (12.19) or y∗ = X∗β + u∗ (12.20) 1 1 1 where y∗ = P− y, X∗ = P− X, and u∗ = P− u. Perform OLS on the transformed model yields the generalized least squares or GLS estimator

1 β = (X∗0X∗)− X∗0y∗ (12.21) 1 1 1 1 1 = ((P− X)0P− X)− (P− X)0P− y 10 1 1 1 1 = (X0P− P− X)− X0P− 0P− y.

10 1 1 1 1 But P− P− = P0− P− = Ω− whereupon we have the alternative represen- tation 1 1 1 β = (X0Ω− X)− X0Ω− y. (12.22) This estimator is also known as the Aitken estimator. Note that GLS reduces to OLS when Ω = In.

12.3.2 Properties with Known Ω Suppose that Ω is a known, fixed matrix, then

• E[u∗] = 0 12.3. CORRECTING FOR NONSCALAR COVARIANCE 167

1 1 2 1 1 2 1 1 2 • E[u∗u∗0] = P− E[uu0]P− 0 = σ P− ΩP− 0 = σ P− PP0P− 0 = σ In

1 • X∗ = P− X nonstochastic

• X∗ has full column rank so the transformed model satisfies the ideal model assumptions (i)-(v). Applying previous results for the ideal case to the transformed model we have E[β] = β (12.23)

2 1 2 1 1 E[(β − β)(β − β)0] = σ (X∗0X∗)− = σ (X0Ω− X)− (12.24) and the GLS estimator is unbiased and BLUE. We assume the transformed model satisfies the asymptotic properties studied in the previous chapter. First, suppose 1 1 1 p X0Ω− X = X∗0X∗→Q∗ p.d. (a) n n p then β→β. Secondly, suppose

1 1 1 d 2 √ X0Ω− u =√ X∗0u∗→N(0, σ Q∗) (b) n n

√ d 2 1 then n(β−β)→N(0, σ Q∗− ). Inference and prediction can proceed as before for the ideal case.

12.3.3 Properties with Unknown Ω If Ω is unknown then the obvious approach is to estimate it. Bear in mind, however, that there are up to n(n+1)/2 possible different parameters if we have no restrictions on the matrix. Such a matrix cannot be estimated consistently since we only have n observations and the number of parameters is increasing faster than the sample size. Accordingly, we look at cases where Ω = Ω(λ) for λ a p × 1 finite-length vector of unknown paramters. The three examples will fall into this category. Suppose we have an estimator λ (possibly consistent) then we obtain Ω= Ω(λ) and the feasible GLS estimator b b b 1 1 1 β = (X0Ω− X)− X0Ω− y 1 1 1 = β + (X0Ω− X)− X0Ω− u. b b b

The small sample properties of this estimatorb areb problematic since Ω = PP0 will generally be a function of u so the regressors of the feasible transformed 1 model X∗ = P− X become stochastic. The feasible GLS will be biasedb b andb non-normal in small samples even if the original disturbances were normal. It mightb beb supposed that if λ is consistent that everything will work out in large samples. Such happiness is not assured since there are possibly n(n+1)/2 b 168 CHAPTER 12. NONSCALAR COVARIANCE

possible nonzero elements in Ω which can interact with the x0s in a pathological fashion. Suppose that (a) and (b) are satisfied and furthermore

1 1 1 p [X0Ω(λ)− X − X0Ω(λ)− X]→0 (c) n and b 1 1 1 p √ [X0Ω(λ)− u − X0Ω(λ)− u]→0 (d) n then b √ d 2 1 n(β − β)→N(0, σ Q∗− ). (12.25)

Thus in large samples, underb (a)-(d), the feasible GLS estimator has the same asymptotic distribution as the true GLS. As such it shares the optimality properties of the latter.

12.3.4 Maximum Likelihood Estimation Suppose u ∼ N(0, σ2Ω) (12.26) then y ∼ N(Xβ, σ2Ω) (12.27) and

L(β, σ2, Ω; y, X) = f(y|X; β, σ2, Ω) (12.28) 0 1 1 (y Xβ) Ω−1(y Xβ) = e− 2σ2 − − . (2πσ2)n/2 |Ω|1/2

Taking Ω as given, we can maximize L(·) w.r.t. β by minimizing

1 1 1 (y − Xβ)0Ω− (y − Xβ) = (y − Xβ)0P0− P− (y − Xβ) (12.29)

= (y∗−X∗β)0(y∗−X∗β).

Thus OLS on the transformed model or the GLS estimator

1 1 1 β = (X0Ω− X)− X0Ω− y (12.30) is MLE and BUE since it is unbiased.

12.4 Seemingly Unrelated Regressions 12.4.1 Sets of Regression Equations We consider a model with G agents and a behavioral equation with n observa- tions for each agent. The equation for agent j can be written

yj = Xjβj + uj, (12.31) 12.4. SEEMINGLY UNRELATED REGRESSIONS 169

where yj is n × 1 vector of observations on the dependent variable for agent j, Xj is the n × k matrix of observations on the explanatory variables, and uj is the vector of unobservable disturbances. Writing the G sets of equations as one system yields

y1 X1 0 ... 0 β1 u1 y2 0 X2 ... 0 β2 u2  .  =  . . . .   .  +  .  (12.32) ......          y   0 0 ... X   β   u   G   G   G   G          or more compactly y = Xβ + u (12.33) where the definitions are obvious. The individual equations satisfy the usual OLS assumptions

E[uj] = 0 (12.34) and 2 E[ujuj0 ] = σj In (12.35) but due to common ommited factors we must allow for the possibility that

E[uju`0 ] = σj`In j =6 `. (12.36) In matrix notation we have E[u] = 0 (12.37) and 2 E[uu0] = Σ ⊗ In = σ Ω (12.38) where 2 σ1 σ12 . . . σ1G 2 σ12 σ2 . . . σ2G Σ =  . . . .  . (12.39) . . .. .    σ σ . . . σ2   G1 G2 G    12.4.2 SUR Estimation We can estimate each equation by OLS

1 βj = (Xj0 Xj)− Xj0 yj (12.40) and as usual the estimators willb be unbiased, BLUE for linearity w.r.t. yj, and under normality 2 1 βj ∼ N(βj, σj (Xj0 Xj)− ). (12.41) This procedure, however, ignores the covariances between equations. Treat- ing all equations as a combinedb system yields

y = Xβ + u (12.42) 170 CHAPTER 12. NONSCALAR COVARIANCE where u ∼ (0, Σ ⊗ In) (12.43) is non-scalar. Applying GLS to this model yields

1 1 1 β = (X0(Σ ⊗ In)− X)− X0(Σ ⊗ In)− y (12.44) 1 1 1 = (X0(Σ− ⊗ In)X)− X0(Σ− ⊗ In)y This estimator will be unbiased and BLUE for linearity in y and will, in general, be efficient relative to OLS. If u is multivariate normal then 1 1 β ∼ N(β, (X0(Σ ⊗ In)− X)− ). (12.45) Even if u is not normal then, with reasonable assumptions about the joint behavior of X and u, we have

√ d 1 1 1 n(β − β) → N(0, [plim (X0(Σ ⊗ In)− X)]− ). (12.46) n n →∞ 12.4.3 Diagonal Σ There are two special cases in which the SUR esimator simplifies to OLS on each equation. The first case is when Σ is diagonal. In this case 2 σ1 0 ... 0 2 0 σ2 ... 0 Σ =  . . . .  (12.47) . . .. .    0 0 . . . σ2   G  and   1 2 In 0 ... 0 X10 0 ... 0 σ1 X1 0 ... 0 1 0 X0 ... 0 0 2 In ... 0 0 X2 ... 0 1 2  σ2  X0(Σ ⊗ In)− X =  . . .   . . (12.48).  ......  . . . .  . . . .   1    0 0 ... X0   I   0 0 ... X  G  0 0 ... σ2 n  G    G     1      2 X10 X1 0 ... 0 σ1 1 0 2 X20 X2 ... 0  σ2  = ......  . . . .   1 X X   0 0 ... σ2 G0 G   G    Similarly,

1 2 X10 y1 0 ... 0 σ1 1 0 2 X20 y2 ... 0  σ2  X0(Σ ⊗ I ) 1y= (12.49) n − . . .. .  . . . .   1 X y   0 0 ... σ2 G0 G   G    12.4. SEEMINGLY UNRELATED REGRESSIONS 171 whereupon 1 (X10 X1)− X10 y1 1 (X20 X2)− X20 y2 β =  .  . (12.50) .    (X X ) 1X y   G0 G − G0 G    So the estimator for each equation is just the OLS estimator for that equation alone.

12.4.4 Identical Regressors

The second case is when each equation has the same set of regressor, i.e. Xj = X so X = IG ⊗ X. (12.51) And

1 1 1 β = [(IG ⊗ X0)(Σ− ⊗ In)(IG ⊗ X)]− (IG ⊗ X0)(Σ− ⊗ In)y (12.52) 1 1 1 = (Σ− ⊗ X0X)− (Σ− ⊗ X0)y 1 1 = [Σ ⊗ (X0X)− ](Σ− ⊗ X0)y 1 = [IG ⊗ (X0X)− X0]y 1 (X0X)− X0y1 1 (X0X)− X0y2 =  .  . .    (X X) 1X y   0 − 0 G    In both these cases the other equations have nothing to add to the estimation of the equation of interest because either the omitted factors are unrelated or the equation has no additional regressors to help reduce the sum- of-squared errors for the equation of interest.

12.4.5 Unknown Σ Note that for this case Σ comprises λ in the general form Ω = Ω(λ). It is finite- length with G(G+1)/2 unique elements. It can be estimated consistently using OLS residuals. Let

ej = yj − Xjβj denote the OLS residuals for agent j. Then by the usual arguments b 1 n σ = e e j` n ij i` i=1 X and b Σ = (σj`)

b b 172 CHAPTER 12. NONSCALAR COVARIANCE will be consistent. Form the feasible GLS estimator

1 1 1 β = (X0(Σ− ⊗ In)X)− X0(Σ− ⊗ In)y which can be shownb to satisfyb (a)-(d) and willb have the same asymptotic dis- tribution as β. This estimator will be obtained in two steps: the first step is to estimate all equations by OLS and thereby obtain the estimator Σ, in the second step we obtain the feasible GLS estimator. b Chapter 13

Heteroskedasticity

13.1 The Nature of Heteroskedasticity 13.1.1 Model and Ideal Conditions Consider, from Chapter 11, the model with stochastic regressors, written one observation at a time

yi = xi0 β + ui for i = 1, 2, . . . , n. The assumptions for the conditional zero mean case (iv,b) were

(i) E[ui] = 0

2 2 (ii) E[ui ] = σ

(iii) E[uiul] = 0, i =6 l

2 2 (iv,b) (ui, xi) jointly i.i.d. with E[ui|xi] = 0, E[ui |xi] = σ

(v) E[xixi0 ] = Q p.d. As we pointed out there, if we add a conditional normality assumption,

2 (vi) ui|xi ∼ N(0, σ )

then we are really back in (iv,a), the case of ui independent of x. And conditional on X, the distributional and inferential results obtained in Chapters 7-10 for nonstochastic regressors could be applied.

13.1.2 Heteroskedasticity

We can think about the systematic part of the model xi0 β as being a model of the conditional expectation of yi given xi. So, if the systematic portion is correct, the assumption of conditional zero mean disturbances is quite compelling. It

173 174 CHAPTER 13. HETEROSKEDASTICITY

is not equally compelling that the conditional variance, which is, in general, a function of the conditioning variables, is constant as assumed in (iv,b). In fact, there is some restrictiveness in arguing that the conditional expectation of yi is a function of the conditioning variables but its condtional variance is not. Particularly in cross-sectional models, it is sometimes crucial to entertain the notion that variances are not constant. The constant conditional variance in (iv,b) is the assumption of conditionally homoskedastic disturbances. If the conditional variance is not constant, then we say the disturbances are conditionally heteroskedastic. Formally, we have

2 2 2 (iv,h) (ui, xi) jointly i.i.d. with E[ui|xi] = 0, E[ui |xi] = σ λ (xi)

2 where λ (xi) is not constant. The terminology arises from the Greek root “skedos” or spread paired with either “homo” for single or “hetero” for varied. So heteroskedasticity literally means varied spreads. Note that (iv,h) does not contradict (ii) since one is a conditional variance and the other is unconditional, 2 but the two together imply E[λ (xi)] = 1. Under this relaxed assumption, we need additional higher order moment assumptions

2 2 2 2 (v,h) E[xixi0 ] = Q p.d., E[ui xixi0 ] = E[σ λ (xi)xixi0 ] = M, E[(1/λ (xi))xixi0 ] = Q∗ p.d.

The second condition is borrowed from Assumption (iv,c) but the third is in- troduced specifically for this case. For inference in small samples we will add conditional normality

2 2 (vi,h) ui|xi ∼ N(0, σ λ (xi)). The normality assumption was not added with Assumption (iv,b) since it im- plied we were in Case (iv,a). In terms of the matrix notation, the conditional covariance can now be writ- ten 2 E[uu0|X] = σ Λ where 2 λ (x1) 0 ··· 0 2 0 λ (x2) ··· 0 Λ =  . . .  . .. .    0 ··· 0 λ2(x )   n    and E[tr(Λ)] = n. This an example of a non-scalar covariance that is diagonal but non-constant along the diagonal.

13.1.3 Some Examples For example, suppose, Ci = α + βYi + γWi + ui 13.2. CONSEQUENCES OF HETEROSKEDASTICITY 175 is estimated from cross-sectional data, where i = 1, 2, ..., n. It is likely that the variance of Ci and hence ui will be larger for individuals with large Yi. Thus ui will be uncorrelated with Yi but its square will be related. Specifically, we 2 2 2 would have E[ui |Yi] = σ λ (Yi). We would not usually know the specific form of λ2(·), only that it is a monotonic increasing function. Sometimes, however, it is possible to ascertain the form of the heteroskedas- ticity function. Consider an even simpler model of consumption, but with a more complicated panel-data structure:

Csi = α + βYsi + usi where the subscript si denotes inidvidual i from state s, for s = 1, 2, ..., S, and i = 1, 2, ..., ns. The disturbances uis are assumed to have ideal properties with σ2 denoting the constant variance. Unfortunately, individual specific data are not available. Instead we have state-level averages

Cs = α + βY s + us

ns ns ns where Cs = i=1 Csi/ns, Y s = i=1 Ysi/ns, and us = i=1 usi/ns. But 2 2 now we see that although E[us] = 0 and E[usur] = 0 for s =6 t, E[us] = σ /ns. P P P 2 Thus the error term will be heteroskedastic with known form since λi = 1/ns. Moreover, the non-constant variances are not stochastic.

13.2 Consequences of Heteroskedasticity 13.2.1 For OLS estimation Consider the OLS estimator 1 β = (X0X)− X0y substituting y = Xβ + u yields b 1 β = (X0X)− X0(Xβ + u) 1 1 = (X0X)− X0Xβ + (X0X)− X0u b 1 = β + (X0X)− X0u and β is unbiased (but not BLUE) since β|X β b E[ ] = . Recall b 1 β − β = (X0X)− X0u So, b 1 1 E[(β − β)(β − β)0|X] = (X0X)− X0E[uu0|X]X(X0X)− 1 2 1 = (X0X)− X0σ ΛX(X0X)− b b 2 1 1 = σ (X0X)− X0ΛX(X0X)− 176 CHAPTER 13. HETEROSKEDASTICITY is the conditional covariance matrix of β. Using Assumption (v,h) and applying the law of large numbers, we have 1 p b n X0X −→ Q = E[xixi0]. It follows that,

n

X0X = xixi0 = Op(n) i=1 X 1 2 2 2 so (X0X)− = Op(1/n). Note λ (xi)xixi0 is i.i.d. and that E[σ λ (xi)xixi0] = M 2 1 n 2 p so σ n i=1 λ (xi)xixi0 −→ M. Thus, it also follows that P n 2 X0ΛX = λ (xi)xixi0 = Op(n) i=1 X wherupon 1 1 (X0X)− X0ΛX(X0X)− = Op(1/n) goes to zero. Thus, by convergence in quadratic mean, β collapses in distribution to its mean β, which means it is consistent. Now, in terms of the estimated variance scalar, web find

n 2 2 i=1 ei 1 Es |X = E |X = E[e0e|X] n − k n − k P 1 1 = E[u0(I − X(X0X)− X)u|X] n − k n 2 n 1 1 = σ − E[tr((X0X)− X0uu0X)|X] n − k n − k 2 2 σ 1 = σ + {k − tr((X0X)− X0ΛX)}. n − k

2 1 2 Depending on the interaction of λ (xi) and xi in (X0X)− X0ΛX, we find s is, in general, now biased. However, since the second term is Op(1/n) then

lim E[s2] = σ2 n →∞ and s2 is asymptotically unbiased. Moreover, under Assumption (iv,h), we can also show plim s2 = σ2 n →∞ and s2 is consistent.

13.2.2 For inferences

Suppose ui are conditionally normal or

u|X ∼ N(0, σ2Λ) 13.2. CONSEQUENCES OF HETEROSKEDASTICITY 177 then 2 1 1 β|X ∼ N(β, σ (X0X)− X0ΛX(X0X)− ) Clearly, then b 2 1 s (X0X)− is not an appropriate estimate of 2 1 1 σ (X0X)− X0ΛX(X0X)− 1 1 1 since (X0X)− =6 (X0X)− X0ΛX(X0X)− , in general. Depending on the interaction of Λ and X, the diagonal elements of 2 1 1 σ (X0X)− X0ΛX(X0X)− can be either larger or smaller than the diagonal elements of 2 1 s (X0X)−

2 1 Thus, the estimated variances (diagonal elements of s (X0X)− ) and standard errors printed by OLS packages can either understate or overstate the true variances. Consequently, the distribution of

βj − βj 1 , djj = [(X0X)− ]jj 2 s djj b can be either fatter or thinnerp than a tn k distribution. Thus, our inferences under the null hypothesis will, in general,− be incorrectly sized and we will either over-reject or under-reject. This can increase the probability of either Type I or Type II errors. These problems persist in larger samples. From Case (iv,c) in Chapter 10, we have √ p 1 1 n(β − β) −→ Q− MQ− and √ b p 1 1 n(βj − βj) −→ [Q− MQ− ]jj. Thus b √ β − β n(β − β ) j j = j j 2 1 1 s djj s2[( X X)− ] b nb 0 jj p q √ 1 1 n(β − β ) [Q− MQ− ]jj = j j 1 q 1 1 [Q 1MQ− ] s2[( X X)− ] − b jj n 0 jj q q 1 1 p [Q− MQ− ]jj −→ N(0, 1) . q 2 1 σ [Q− ]jj The numerator and denominator in the ratiop are, in general, not equal depending 2 on the interaction of λ (xi) and xi and hence the form of M. Thus we can either over- or under-reject, even in large samples. 178 CHAPTER 13. HETEROSKEDASTICITY

13.2.3 For prediction

Suppose we use

yp = xp0 β

b b as a predictor of yp = xp0 β + up. Then since Eβ = β,

b Eyp = Exp0 β = Eyp

so yp is still unbiased. However, since β is not BLUE, then yp is not BLUP. That is, there are other predictors linear in y that have smaller variance. Moreover, theb variance of the predictor will notb have the usual formb and the width of prediction intevals based on the usual variance-covariance estimators will be biased.

13.3 Detecting Heteroskedasticity

13.3.1 Graphical Method

In order to detect heteroskedasticity, we usually need some idea of the possible 2 sources of the non-constant variance. Usually, the notion is that λi is a function of xi, if not constant.

In time series, where the xt usually move smoothly as t increases, a general approach is to obtain the plot of et against t. Under the null hypothesis of homoskedasticity we expect to observe.

et b b b

b b b b

b b b b b b

b b b bb b b

b b b b b b b t

b b b b b b b b

b b b

That is, the residuals seem equally dispersed over time. If heteroskedasticity occurs, we might observe some pattern in the dispersion as a function of time, e.g. 13.3. DETECTING HETEROSKEDASTICITY 179

b b et b b b

b b

b b b b

b b b bb b b

b b b b b b b t b

b b b b b b b b

b b b b

b b b

b b b

A general technique for either time series or cross-sections is to plot ei against yi = xi0 β. Under the null hypothesis of homoskedasticity, the dispersion of ei is unrelated to xt and, hence (asymptotically) xt0 β. Thus any pattern such as b b

b b

b b ei b b

b b b b

b b

b b b bb

b b b

b b b b b yi b

b b b b

b b b b b b b b b b b

b b b

b b indicates the presence of heteroskedasticity.

If we have a general idea which xij may influence the variance of ui, we 2 simply plot ei against xij. Suppose, for example, we think Eui may be an increasing function of xij, then we would find a pattern

b b ei b b b

b b

b b b b

b b b bb b b

b b b b b b b xij b

b b b b b b b b

b b b b

b b b

b b b

While even dispersion for all xij would argue against heteroskedasticity. 180 CHAPTER 13. HETEROSKEDASTICITY

13.3.2 Goldfeld-Quandt Test

In order to perform this test, we must have some idea when the heteroskedas- ticity, if present, will lead to larger variances. That is, we need to be able to determine which observations are likely to have higher variance and which smaller. When this is possible, we reorder the observations and split the sample so that y = Xβ + u can be written as y X u 1 = 1 β + 1 y X2 u  2     2  where u1 correspond to the n1 observations with larger variances if heteroskedas- ticity is present. That is 2 E[u1u10 |X1] = σ1In1 and 2 E[u2u20 |X2] = σ2In2

2 2 2 2 where H0 : σ1 = σ2, while H1 : σ1 > σ2. Perform OLS on the two subsamples and obtain the usual form

2 e10 e1 s1 = n1 − k and 2 e20 e2 s2 = n2 − k where e1 and e2 are the OLS residuals from each of the subsamples. Now, by previous results, s2 − 1 ∼ 2 (n1 k) 2 χn1 k σ1 − and s2 − 2 ∼ 2 (n2 k) 2 χn2 k σ2 −

2 2 and the two are independent. Thus, under H0 : σ1 = σ2, the ratio of these divided by their respective degrees of freedom yields

s2 1 ∼ 2 Fn1 k,n2 k s2 − −

2 2 2 2 2 2 Under the alternative H1 : σ1 > σ2 we find s1/s2 ∼ Fn1 k,n2 k · σ1/σ2 and expect to obtain large values for this statistic. − − 13.3. DETECTING HETEROSKEDASTICITY 181

13.3.3 Breusch-Pagan Test

Suppose

2 2 0 λi = λ (zi α)

zi0 = (1, zi∗0) ∗ α0 = (α1, α 0) then 2 2 ui = λ (zi0 α) + vi where E[vi] = 0. This is an example of a single-index model for heteroskedas- ticity. We assume that λ2(·) is a strictly monotonic function of its argument. 2 Under H0 : α∗ = 0 we see that the model is homoskedastic. Moreover, ui will be uncorrelated and hence have zero covariance with zi∗ which could be tested 2 by regressing ui on zi and testing all the slopes zero. Along these lines, Breusch and Pagan propose the asymptotically equivalent 2 2 approach of regressing qi = ei − σ on zi and testing the slopes zero. Using the n · R2 test for testing all slopes zero we find b 2 η∗ = n · R q Z(Z Z) 1Z q = 0 0 − 0 1 2 2 2 n t(et − σ ) q Z Z Z 1Z q 0P( 0 )− 0 d 2 = −→ χs 1 q0q/n b −

2 under H0, where R is for the auxilliary regression. The single-index structure and and monotonicity assure that qi is likely to be correlated with zi, since 2 0 ∂λ (ziα) 2 ∂α = h0(zi0 α)zi where the scalar function h0(·) is the first derivative of λ (·) with respect to its single argument. Thus η∗ is likely to become increasingly positive under the alternative and the test will have power. This test will be asymptotically appropriate whether the ui are normal or not. Under normality, the denominator of this test, which is an estimator of the fourth moments has a specific structure which can be used to simplify the test. Specifically, since the raw fourth moment of a normal around its mean is 3, we have

1 q0Z(Z0Z)− Z0q d 2 η = −→ χs 1 2(σ2)2 − under H . The discussion for the power of the previous more general ver- 0 b sion of the test apply to the normal-specific version as well. Under conditional normality, this test can be shown to be a Lagrange Multiplier test. 182 CHAPTER 13. HETEROSKEDASTICITY

13.3.4 White Test

1 n 2 p By the i.i.d. assumption and the law of large numbers, we have n i=1 ui, −→ σ2 = E[u2 ], i P n 1 2 p 2 2 2 2 u x x0 −→ E[u xix0]= E[E[u xix0|xi]] = E[σ λ (xi)xix0]xi) = M, n i i i i i i i i i=1 X and n 1 2 p Q = u x x0 −→ Q. n i i i i=1 X b For ei = yi − xi0 β, due to consistency of β, we can show

n b 1 nb− k p σ2 = e2 = s2 −→ σ2 n i n i=1 X and also b n p 1 2 M = e x x0 −→ M . n i i i i=1 X 2 c Now, under H0 : λi = 1 M = σ2Q

2 while under H1 : λi non-constant

M =6 σ2Q possible. Thus, we compare M with σ2Q and

2 1 c2 b 1b 2 1 p M − σ Q = e x x0 − e x x0 −→ 0 n i i i n i n i i i ! i ! X X X c b b 2 under H0. Let wi = ei and zi = hunique elements of xi ⊗ xii then under H0 :

1 1 1 p w z − w z −→ 0. n i i n i n i i ! i ! X X X or 1 p (w − w)(z − z) −→ 0. n i i i X While under the alternative this sample covariance will converge in probability to a nonzero value. Accordingly, White proposes we test whether the slope coefficients in the re- gression of wi on zi are zero. Let i be the residuals in this auxilliary regression.

b 13.4. CORRECTING FOR HETEROSKEDASTICITY 183

Then, under general conditions, White shows that

2 R2 = 1 − i w (w − w)2 Pi (w − w)b2 − 2 = Pi i (w − w)2 P i P − 2 − b2 2 (wPi w) i d 2 nR = −→ χr 1 (w − w)2/n − P i P b where r = k(k + 1)/2 is the numberP of unique elements of xi⊗xi. Note that the White test is strictly speaking not necessarily a test for het- eroskedasticity but a test for whether the OLS standard errors are appropriate for inference purposes. It is possible to have heteroskedasticity or Λ =6 In and 2 still have X0ΛX = X0X or M = σ Q, whereupon standard OLS inference is fine. It is also worth noting that the White test is just the Breusch-Pagan test with zi = hunique elements of xi ⊗ xii.

13.4 Correcting for Heteroskedasticity 13.4.1 Weighted Least Squares (known weights) The only manner in which the model

yi = xi0 β + ui i = 1, 2, . . . , n violates the ideal conditions is that

2 2 2 2 2 E(ui ) = σ λ (xi) = σ λi

2 is not constant, for λi known. A logical possibility is to transform the model to eliminate this problem. Specifically, we propose

1 1 1 y = x0 β + u λ i λ i λ i i  i  i or 0 yi∗ = xi∗ β + ui∗ where yi∗ = yi/λi, xi∗ = xi/λi, and ui∗ = ui/λi. Then clearly

E(ui∗) = 0 2 2 E(ui∗ ) = σ

E(ui∗ul∗) = 0 i =6 l 2 2 (ui∗, xi∗) jointly i.i.d. withE[ui∗|xi∗] = 0, E[ui∗ |xi∗] = σ

E[xi∗xi∗’]=Q∗p.d. 184 CHAPTER 13. HETEROSKEDASTICITY for i = 1, 2, . . . , n and the assumptions for the conditional zero mean case are satisfied. If we add conditional normality then the transformed disturbances are independent of the transformed regressors and we have the same finite sample distributions and inferences as the stochastic regressor case. Thus, we perform OLS on the transformed system

y∗ = X∗β + u∗ to obtain

1 β = (X∗0X∗)− X∗0y∗ 1 1 1 = (X0Λ− X)− X0Λ− y

This estimator is called weighted least-squares (WLS) since we have performed least-squares after having weighted the observations by wi = 1/λi. This es- timator is just generalized least squares (GLS) for the specific problem of het- eroskedasticity of known form.

13.4.2 Properties of WLS (known weights)

Substituting for y∗ in the usual fashion yields

β = X∗0(X∗β + u) 1 = β + (X∗0X∗)− X∗0u

So E[β|X∗] = β and β is unbiased and also BLUE. Similarly, we find, as might be expected

2 1 E[(β − β)(β − β)0|X∗] = σ (X∗0X∗)− which will go to zero for large n. Thus, β will be consistent. If ui are conditionally normal or

u|X ∼ N(0, σ2Λ) then ∗ ∗ 2 u |X ∼ N(0, σ In) and 2 1 β|X∗ ∼ N(β, σ (X∗0X∗)− ) And 2 s∗ 2 (n − k) ∼ χn k σ2 − independent of β so the usual finite sample inferences based on the transformed model are correct. Moreover, β is MLE and hence BUE. 13.4. CORRECTING FOR HETEROSKEDASTICITY 185

If ui are non-normal we must rely on the asymptotic properties. Fortunately, Assumptions (iv,h) and (v,h) allow us to use the law of large numbers and central 1 1 limit theorem. Specifically, since 2 x x is i.i.d. and E[ 2 x x ] = Q , we λ (xi) i i0 λ (xi) i i0 ∗ find n 1 1 1 1 1 p X 0X∗ = x x0 = X0Λ− X −→ Q∗ (a) n ∗ n λ2 i i n i=1 i X 1 1 2 1 2 and, since 2 xiui is i.i.d. with E[ 2 xiui] = 0 and E[ui 2 xixi0 ] = σ Q∗, λi λ (xi) λ (xi) then n 1 1 1 1 1 d 2 √ X 0u∗ = √ x u = √ X0Λ− u −→ N(0, σ Q∗), (b) n ∗ n λ2 i i n i=1 i X whereupon d √ 2 1 n(β − β ) −→ N(0, σ Q∗− ). This is just an application of the results from Chapter 11. The conditions (a) and (b) are not always assured and should be verified if possible. The usual t-ratios and quadratic forms would have complicated properties in small samples but would be appropriate in large samples. Specifically, we have β − β j j −→d N(0, 1) s2[(X 0X ) 1] b ∗ ∗ − jj and q (SSEr∗ − SSEu∗)/q d 2 −→ χq/q SSEu∗/(n − k) where SSE∗ denotes the sum-of-squared errors for the transformed model. Con- veniently, these are the limiting distributions of the tn k and Fq,n k as n grows large. − −

13.4.3 Estimation with Unknown Weights

The problem with WLS is that λi must be known, which is typically not the case. It frequently occurs, however, that we can estimate λi (as in the Breusch- Pagan formulation), whereupon we simply use λi rather than λi as above. This is known as feasible WLS. Once we use estimates for λi, say λi, thenb xi∗ = xi∗/λi is no longer non- stochastic and the properties of β are intractable in small samples. Fortunately, b b if our estimates λi improve in large samples, web usually find the difference between using λi and λi disappears.b Strictly speaking, in addition to (a) and (b) above we needb to verify b 1 1 1 p [X0Λ− X − XΛ− X]→0 (c) n and b 1 1 1 p √ [X0Λ− u − X0Λ− u]→0 (d) n b 186 CHAPTER 13. HETEROSKEDASTICITY for the case at hand before being assured that feasible WLS is asymptotically equivalent to WLS based on known weights. Verification of these conditions will 2 2 depend on the specific form of λi = λ (xi). In order to perform inference with the feasible WLS estimator we need a consistent covariance estimator. Under Condition (c) and (d)

1 1 p Q∗ = [X0Λ− X]→Q∗ n and b b β − βj j −→d N(0, 1) 1 2 − 1 s [(Xb0Λ X)− ]jj which means the usualq ratios reported for the OLS esitmates of the feasible b transformed model are asymptotically appropriate. This consistent covariance matrix may also be used in quadratic forms to perform inference on more com- plicated hypotheses.

13.4.4 Heteroskedastic-Consistent Covariances If we don’t have a good model for the heteroskedasticity, then we can’t do WLS. An alternative, in this case, is to use an appropriate covariance estimate for OLS. Recall that

√ p 1 1 n(β − β) −→ N(0, Q− MQ− ).

p Now, under the conditionsb of (iv,h) and (v,h), s2 −→ σ2, and

1 p Q = x x0 −→ Q n t t t X b 1 2 p M = e x x0 −→ M n t t t t X so c 1 1 p 1 1 C = Q− MQ− −→ Q− MQ− And b b c b √ n(β − β ) (β − β ) p i i = i i −→ N(0, 1). X X 1 2x x X X 1 C [( 0 )− ( t et t t0 )( 0 )− ]ii b ii b q p P The denominatorb in the middle expression above would be the reported robust or heteroskedastic-consistent standard errors. They are also known as Eicker- White standard errors. The beauty of this approach is that it does not require a model of het- eroskedasticity. If we instead use the feasible weighted least squares approach we always stand the chance of misspecification, particularly since our models 13.4. CORRECTING FOR HETEROSKEDASTICITY 187 seldom suggest a form for the heteroskedasticity. And if we misspecify the heteroskedasticity the resulting transformed model from WLS will still be het- eroskedastic and suffer inferential difficulties. This is just another manifestation of the bias-variance trade-off. Chapter 14

Serial Correlation

14.1 Review and Introduction 14.1.1 Model and Ideal Conditions For relevance, we rewrite the model in terms of time series data. Written one period at a time, the model is

yt = β1xt1 + β2xt2 + ... + βkxtk + ut k

= βjxtj + ut j=1 X where t = 1, 2, . . . , n. The ideal assumptions are

(i) E[ut] = 0

2 2 (ii) E[ut ] = σ

(iii) E[utus] = 0, s =6 t

(iv) xti non-stochastic

(v) (xt1, xt2, . . . , xtk) not linearly related

(vi) ut normally distributed. This model is the same as before except for the use of the subscript t rather than i, since we are using time-series data. In matrix notation, since we are not identifying individual observations, the model is the same as before, namely

y = Xβ + u.

And the assumptions are still

188 14.1. REVIEW AND INTRODUCTION 189

(i) E[u] = 0

2 (ii & iii) E[uu0] = σ In

(iv & v) X non-stochastic, full column rank

2 (vi) u ∼ N(0, σ In).

14.1.2 Serial Correlation

The condition (iii) is the assumption at issue if the disturbances are serially (between periods) correlated. The disturbances are said to be serially correlated if (iii). is violated, i.e.,

E[usut] = w =6 0 Serial correlation, as a problem, is most frequently encountered in time series analysis. This may be chiefly a result of the fact that we have a stronger notion of proximity of observations, e.g. adjacent periods, than in cross-sectional analysis.

14.1.3 Autoregressive Model

For expositional purposes, we will entertain one of the simplist possible, but still very useful, models of serial correlation. Suppose that the ut are related between periods by,

ut = ρut 1 + t t = 1, 2, . . . , n −

where −1 < ρ < 1 and t satisfy

E[t] = 0 2 2 E[t ] = σ

E[ts] = 0 s =6 0

for t = 1, 2, . . . , n. The disturbances ut are said to have been generated by a first-order auto-regressive or AR(1) process. Rewrite (by substituting recursively)

ut = t + ρut 1 − = t + ρ(t 1 + ρut 2) − − 2 = t + ρt 1 + ρ (t 2 + ρut 3) − − − 2 3 = t + ρt 1 + ρ t 2 + ρ t 3 + ··· . − − − Since |ρ| < 1 then

E[ut] = 0 190 CHAPTER 14. SERIAL CORRELATION

While,

2 2 3 2 E[ut ] = E[t + ρt 1 + ρ t 2 + ρ t 3 + ··· ] − − − 2 2 3 = E[t + ρtt 1 + ρ tt 2 + ρ tt 3 + ··· − − − 2 2 3 + ρt 1t + ρ t 1 + ρ t 1t 2 + ··· − − − − 2 3 4 2 + ρ t 2t + ρ t 1t 2 + ρ t 2 + ··· ] − − − − 2 2 2 4 2 = σ + ρ σ + ρ σ + ··· 2 2 4 = σ (1 + ρ + ρ + ··· ) 1 σ2 = σ2 since ρ2 < 1. u  1 − ρ2 Similarly, we have

2 3 E[utut 1] = E[t + ρt 1 + ρ t 2 + ρ t 3 + ··· +] − − − − 2 2 3 = E[t + ρtt 1 + ρ tt 2 + ρ tt 3 + ··· − − −  2 + t 1 + ρt 2 + ρ t 3 + ··· ] − − − 2 3 2 5 2 = ρσ + ρ σ + ρ σ + ··· 2 2 2 4 2 = ρσ (1 + ρ σ + ρ σ + ··· ) 2 = ρσu =6 0 and,generally, s 2 E[utut s] = ρ σu. − Forming the variances and covariances into a matrix yields

2 u1 u1u2 u1u3 ··· u1un 2 u2u1 u2 u2u3 ··· u2un  2 ···  E[uu0] = u3u1 u3u2 u3 u3un  . . . . .   ......     u u u u u u ··· u2   n 1 n 2 n 3 n   2 2 2 2  2 n 1 σu σuρ σuρ ··· σuρ − 2 2 2 2 n 2 σuρ σu σuρ ··· σuρ −  2 2 2 2 2 n 3  = σuρ σuρ σu ··· σuρ −  . . . . .   ......     σ2ρn 1 σ2ρn 2 σ2ρn 3 ··· σ2   u − u − u − u   2 n 1  1 ρ ρ ··· ρ − n 2 ρ 1 ρ ··· ρ −  2 n 3  2 ρ ρ 1 ··· ρ − = σu  . . . . .   ......     ρn 1 ρn 2 ρn 3 ··· 1   − − −  2   = σuΩ(ρ). 14.1. REVIEW AND INTRODUCTION 191

Note the structure of this covariance matrix. The elements along the main diagonal are constant as are the elements all along each off-diagonal. However the values in the off-diagonals become smaller exponentially as we move away from the main diagonal toward the upper right-hand (and lower left-hand) cor- ner(s). This latter feasure is an example of the infinite but decreasing memory of the process. An innovation t always matters to the process but matters less and less as the process evolves. Such properties are typical of time-series. It is possible to entertain other models of serial correlation. For example we could posit a second-order autoregressive AR(2) process

ut = ρ1ut 1 + ρ2ut 2 + t t = 1, 2, . . . , n − − where ρ1 and ρ2 are parameters subject to appropriate stability restrictions. This process would have a much more complicated covariance structure which we will not pursue here. Or we might have a first-order moving-average MA(1) process ut = t + λt 1 t = 1, 2, . . . , n − where λ is a parameter governing the covariance between adjacent periods. This process has a simpler but still interesting structure and is sometimes encountered in practice but will not be pursued here.

14.1.4 Some causes of Serial Correlation Most economic time series are subject to inertia. That is, they tend to move smoothly or sluggishly over time. There is a sense in which values of the sub- script t that are similar indicate a certain proximity economically so the values of the variables are expected to be similar. When this occurs, we find the se- ries are serially correlated since adjacent values of the series tend to be large and/or small together. It is not unreasonable to consider the possibility that the disturbances might have similar properties. Serial correlation can arise if we misspecify the model. For example if we omit variables that are themselves serially correlated. Suppose that the true relationship is yt = β1xt1 + β2xt2 + β3xt3 + β4xt4 + ut But we estimate yt = β1xt1 + β2xt2 + β3xt3 + vt

Then vt = β4xt4 + ut. Aside from other difficulties, if the xt3 are serially correlated, then so will vt be serially correlated, even if the original ut were not! For this reason many econometricians take evidence of serial correlation in a model as an index of misspecification. Or if the true relationship is nonlinear and we mistakenly specify a linear model, serial correlation can result. Suppose the true relationship is

yt = f(xt; θ) + ut where f(xt; θ) is a nonlinear relationship, yielding the scatterplot 192 CHAPTER 14. SERIAL CORRELATION

yt

b

b

b α + βx b t b b

b

b b b b b b b b b b b b b b b b b b b b b b b b b f(x ; θ) b t b b b b b b b b b b b b b b

b bb b b b b b

xt

If we mistakenly estimate

yt = α + βxt + vt then even the “best” fitting line will exhibit correlated residuals since the resid- uals will tend to be positive and negative together and the xt move sluggishly. It is possible to have serial correlation in non-time-series cases. For example if we have spatially located observations with a careful measure of distance we can define the concept of spatial serial correlation. Here, we can index the observations by the location measure. There has been substantial work on such models but the basic approaches used are much the same as for the time- series cases.

14.2 Consequences of Serial Correlation

14.2.1 For OLS Estimation

Consider the OLS estimator

1 β = (X0X)− X0y 1 = (X0X)− X0(Xβ + u) b 1 = β + (X0X)− X0u or 1 β − β = (X0X)− X0u

Thus b 1 E[β] = β + (X0X)− X0E[u] = β and OLS is unbiased. (Butb not BLUE). 14.2. CONSEQUENCES OF SERIAL CORRELATION 193

While

cov(β) = E[(β − β)(β − β)0] 1 1 = E[(X0X)− X0uu0X(X0X)− ] b b 1 b 2 1 = (X0X)− X0σuΩX(X0X)− 2 1 1 = σu(X0X)− X0ΩX(X0X)−

1 1 With the reasonable assumptions n X0X →p Q and n X0ΩX →p M, then

1 1 lim (X0X)− X0ΩX(X0X)− = 0 n →∞ and the variance of β collapses about their expectation β, whereupon β is consistent. Consider b b e e s2 = 0 n − k where e = y − Xβ are the OLS residuals. We can show b 2 2 n 1 1 E[s ] = σ − tr((X0X)− X0ΩX) n − k n − k   Under the assumptions used above, for large n, the second term goes to zero n while n k → 1, thus − 2 2 lim E[s ] = σu n →∞ Moreover, under these same assumptions, consistency of s2 follows from consis- tency of β.

14.2.2b For inferences

Suppose ut are normal or 2 ut ∼ N(0, σuQ) then 2 1 1 β ∼ N β, σu(X0X)− X0ΩX(X0X)− Clearly  b 2 1 s (X0X)− is not an appropriate estimate of the covariance of β, namely,

2 1 1 σu(X0X)− X0ΩX(X0X)−b since X0QX =6 X0X in general. 194 CHAPTER 14. SERIAL CORRELATION

f(t ) n-k

f(z )

j

c t(n-k,0.025)

Suppose ρ > 0 and the xt move smoothly, which is typically the case, then we can show

X0ΩX > X0X

(in the sense of exceeding by a positive semi-definite matrix) and

1 1 1 1 (X0X)− X0ΩX(X0X)− > (X0X)− X0X(X0X)− so typically, 2 1 1 2 1 σu(X0X)− X0ΩX(X0X)− > s (X0X)− and the usual estimated variances (and standard errors) calculated by OLS packages understate the true variances. Thus, the ratios,

βi − βi = zi 2 s djj b have a distribution that this fatterp than tn k, since we are dividing by too small a value. Graphically − Thus, we end up (apparently) in the tails too often and commit Type I errors with increased probability.

14.2.3 For forecasting

Suppose we use

yp = xp0 β as a predictor of b b yp = xp0 β + up

Then since E[β] = β,

E[yp] = xp0 β = E[yp] b

So yp is still unbiased. Since β isb no longer BLUE, yp is not BLUP.

b b b 14.3. DETECTION OF SERIAL CORRELATION 195

14.3 Detection of Serial Correlation 14.3.1 Graphical Techniques Serial correlation implies some sort of nonzero covariances between adjacent disturbances. This suggests that there will be some type of linear relationship between (either positive or negative) disturbances of adjacent periods. Such patterns of serial relationship are perhaps best detected (at least initially) by plotting the disturbances against time. Since we don’t have the ut = yt − xt0 β, we use the OLS residuals et = yt − xt0 β which will be similar. Graphical techniques are most useful in detectingb first-order serial correla- tion, i.e. E[utut 1] = µ =6 0 − Let

cov(ut, ut 1) ρ = − var(ut) · var(ut 1) µ µ − = p = 2 2 2 σuσu σu Then p 2 E[utut 1] = ρσu − Thus, graphic techniques can be effective in detecting whether or not ρ = 0. Suppose ρ > 0, then we have positive serial correlation between ut and ut 1. − When this is true, then ut will tend to be highly positive (negative) when ut 1 − was highly positive (negative). Thus, ut and hence et might be expected to move relatively smoothly, with extreme values occurring together in adjacent periods. For example, et

b

b b b b b b b b b b b b b bb b b b b b b b b b b b b b b b b bb b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b bb b b b b b b b b b b b b t b b bb b b b b b b

Suppose ρ < 0, then we have negative serial correlation between ut and ut 1. Here ut will typically be extremely negative (positive) when ut 1 is ex- tremely− positive (negative). Thus, we have a jagged pattern with extreme− values clustered together, but alternating in sign. For example, 196 CHAPTER 14. SERIAL CORRELATION

et

b

b b

b

b b b

b

b b b b

b b b t b b b

b b b

b b

b

14.3.2 Von Neumann Ratio

Suppose we observe ut then Von Neumann suggests tests may be based on the statistic n 2 (ut − ut 1) /(n − 1) VNR = t=1 − n u2/n P t=1 t P Expanding and invoking a LLN for nonindependent data, we have

1 n 2 2 n 1 n 2 n 1 t=1 (ut) − n 1 t=1 utut 1 + n 1 t=1 (ut 1) − − − − − VNR = 1 n 2 P nP t=1 ut /n P 2 2 2 p σ − 2ρσ + σ −→ u u u P− 2 = 2(1 ρ) σu

p Thus, under the null hypothesis of no serial correlation ρ = 0, VNR−→ 2.

In fact, under the null hypothesis and general conditions on ut, Von Neu- mann shows

VNR ∼ N(2, 4/n) for large samples. So,

VNR − 2 √ VNR − 2 z = = n −→d N(0, 1) 4/n 2 p Under the alternative (ρ =6 0), we expect VNR to converge to values other than 2 and hence z to be extreme. Values in the left-hand side tail indicate positive serial correlation (ρ > 0) and right-hand side tail values indicate negative serial correlation. Since ut is unobservable, we use et since the difference disappears in large samples. 14.3. DETECTION OF SERIAL CORRELATION 197

a

b

dL dU

14.3.3 Durbin-Watson Test the VNR is, strictly speaking, appropriate only in large samples. In small samples, Durbin and Watson propose the related statistic

n 2 (et − et 1) d = t=1 − n e2 P t=1 t n − 1 = VNRP· n Thus, in large samples, the difference between d and VNR will become negligible. Now e Ae d = 0 e0e where 1 −1 0 ··· 0 0 0 −1 2 −1 ··· 0 0 0  0 −1 2 ··· 0 0 0  A =  ......   ......     0 0 0 · · · −1 2 −1     0 0 0 ··· 0 −1 2      Under the null hypothesis of no serial correlation, D-W proceed to show

tr(A) − tr(X AX(X X) 1) E[d] = 0 0 − −→ 2 n − k thus the distribution (in small samples) of d depends in a complicated way upon X. And we cannot calculate the exact distribution of d without taking explicit account of the values of X. This is, needless to say, quite complicated. Accordingly, D-W consider two distributions for each value of n − k. Here a represents the best-case (most compact) distribution of d while b is the worst case (fattest) distribution that d can have for any X. For α = 0.05, say, dU is the correct critical value if a is correct while dL is the critical value if b is correct. Note that we are only looking at the left-hand side tail so are thinking about a one-sided alternative. Moreover, we have in mind a positive value of ρ under the alternative. Thus, while we don’t precisely know what the appropriate distribution is, we know if 198 CHAPTER 14. SERIAL CORRELATION

d > dU not in left 0.05 tail d < dL in left 0.05 tail regardless of values in X dL < d < dU we are in the inconclusive region and may or may not be in tail depending on X In practice, however, we can often reduce or eliminate the inconclusive region. If ρ > 0 and the xt move smoothly over time then the true critical value will be close to dU and we can use dU as a critical value in the usual way. One caution on the Durbin-Watson statistic is warranted when the regressors include lagged endogenous variables. Obviously, this violates the assumption of non-stochastic regressors so we could at best expect the test to be appropriate only in large samples. There are additional complications, however, since, in this case the d statistic will be biased upward, under the alternative of ρ > 0, so the test will lose power. The null distribution will still be correct in large samples. Thus evidence of serial correlation revealed by the test still applies regardless of the presence of lagged endogenous variables.

14.3.4 A Regression Approach

Suppose that the ut are generated by the first-order autoregressive process

ut = ρut 1 + t −

Substitute for ut and ut 1 from −

ut = yt − xt0 β

ut 1 = yt 1 − xt0 1β − − − to obtain yt − xt0 β = ρ(yt 1 − xt0 1β) + t − − or yt = xt0 β + ρ(yt 1 − xt0 1β) + t − − Under H0 : ρ = 0, then ρβ = 0, thus, a direct test is to use β

yt = xt0 yt 1 xt0 1 ρ + t − −  γ     as an unrestricted regression and

yt = xt0 β + ut as the restricted regression since ρ = 0 and γ = 0 under H0. Having performed these two regressions, we form the usual statistic

(SSRr − SSRu)/# rest SSRu/(n − k) 14.4. CORRECTING SERIAL CORRELATION 199

which will have approximately an F# rest,n k distribution. Under the alterna- tive, ρ =6 0 and the statistic would become− large. Note that this procedure is only appropriate asymptotically since the regressors in the unrestricted case include yt 1 and are no longer non-stochastic. Rather− obviously, due to multiollinearity, this approach breaks down if the regessors include one-period lags of the endogenous variable. In this case, we eliminate the second instance of yt 1 from the list of regressors and consider the unrestricted regression − β y = x x ∗ +  t t0 t0 1 γ t −    where yt 1 is included in xt and its coefficient is now βj∗ = βj + ρ. Under the − null hypothesis, we still have γ = 0 but βj∗ =6 0 is possible. So now we proceed as before but only test γ = 0 using the F -test.

14.4 Correcting Serial Correlation 14.4.1 With known ρ

The basic regression equation yt = xt0 β + ut can be rewritten ut = yt − xt0 β and the autoregressive equation ut = ρut 1 + t can be rewritten t = ρut + ut 1. Substitution of the first into the second− now yields −

t = (yt − xt0 β) − ρ(yt 1 − xt0 1β) − − which can be rewritten as

(yt − ρyt 1) = (xt0 − ρxt0 1)β + t − − t = 2, . . . , n y∗ = x∗0β +  t t t  This procedure is known as the Cochran-Orcutt transformation. Since xt∗ is non-stochastic and t has ideal properties, this transformed model satisfies the classical assumptions and least-squares will have the usual nice properties. Note, however, that we have lost one observation (the first) in obtaining this transformed model. It is possible to use a different transformation to recover the 2 2 2 2 first observation. Let 1∗ = 1 − ρ u1 = 1 − ρ (ρu0 + 1), then E[1∗ ] = σ and E[  ] = 0 for s =6 t. Accordingly, we obtain the following transformations 1∗ s∗ p p for the first observation

2 y1∗ = 1 − ρ y1 2 x1∗ = p1 − ρ x1. This procedure for recovering the first observationp is known as the Prais-Wintsen transformation. Combining the two transformations and using matrix notation, we obtain the GLS transformation, y∗ = X∗β + . 200 CHAPTER 14. SERIAL CORRELATION

where 1 = 1∗. Since ρ is known and X is non-stochastic then the transformed 2 model satisfies all the ideal properties, namely E[] = 0, E[0] = σ In, X∗ non- stochastic and full column rank. Thus, the OLS estimator of the transformed model 1 β = (X∗0X∗)− X∗0y∗ is unbiased and efficient. Specifically, e E[β] = β

2 1 cov(β) =eσ (X∗0X∗)− And for unknown variance, e 2 1 est cov(β) = s (X∗0X∗)−

2 is unbiased where s is the usual variancee estimator for the transformed model.

14.4.2 With estimated ρ In general, we will not know ρ so the GLS transformation is not feasible. Sup- pose we have an estimate ρ that is consistent then we obtain the feasible trans- formations b (yt − ρyt 1) = (xt0 − ρxt0 1)β + t − − t = 2, . . . , n y∗ = x∗0β +  t t t  b b and for the first observation b b 2 2 1 − ρ y1 = 1 − ρ x10 β + 1 p y1∗ = xp1∗β + 1. b b Applying OLS to this transformed model yields the feasible GLS estimator b b

1 β = (X∗0X∗)− X∗0y∗.

Under fairly general conditions,b the assumptions (a)-(d) from Chapter 11 can e b b b b be verified and this estimator has the same large-sample limiting distribution as β. Note that since the results only apply in large sample one can bypass using the Prais-Wintsen transformation to recover the first observation and still enjoye the same asymptotic results. This leaves us with the problem of estimating ρ consistently. If ut were observable, we might estimate ρ in

ut = ρut 1 + t − directly by OLS n utut 1 ρ = t=2 − . n u2 P t=2 t b P 14.4. CORRECTING SERIAL CORRELATION 201

Since ut is unobservable, we instead form

n etet 1 ρ = t=2 − n e2 P t=2 t where et are OLS residuals. Thisb estimatorP is, in fact, consistent under general conditions. An alternative approach is to use the DW statistic to estimate ρ. Recall that the DW and VNR statistics converge in probability to the same value and p p VNR−→ 2(1 − ρ). Rewriting this we have (1−VNR/2) −→ ρ. Thus we have

ρ = (1 − DW/2)

as a consistent estimator. We can also take the coefficient of y in the b t 1 regression − yt = xt0 β + ρyt 1 + xt0 1γ + t − − as our estimate of ρ. This estimate will also be consistent. Both of these alter- native estimators are not only consistent but have the same limiting distribution as the regression approach above.

14.4.3 Maximum Likelihood Estimation The above treatment can lead naturally to estimation of the model by maximum likelihood techniques. In the following, we will disregard the first observation. Since it will turn out to be a nonlinear problem, we can only establish the prop- erties of our estimator asymptotically in which case the use of one observation is irrelevant. In any event, we can easily modify the following to incorporate the first observation using the Prais-Wintsen transformation. Combining the basic equation yt = xt0 β+ut with the autoregressive equation ut = ρut 1 + t and rewriting yields −

yt = ρyt 1 + (xt − ρxt 1)0β + tt t = 1, 2, . . . , n. − − 2 Suppose that t ∼ i.i.d.N(0, σ ) then

2 yt|yt 1, xt, xt 1 ∼ N(ρyt 1 + (xt − ρxt 1)0β, σ ) − − − − and the different observations of yt are conditionally independent but not iden- tical. Independence is seen by ignoring the xt and looking at the unconditional distribution as the product of contional distributions

f(yt, yt 1, yt 1, ...) = f(yt|yt 1)f(yt 1, yt 2, yt 3, ...) − − − − − − = f(yt|yt 1)f(yt 1|yt 2)f(yt 2, yt 3, yt 4, ...) − − − − − − = f(yt|yt 1)f(yt 1|yt 2)f(yt 3|yt 4)... − − − − − The conditional distribution depend only on the previous realiztion due to the form of the generating equation. 202 CHAPTER 14. SERIAL CORRELATION

Thus the conditional density of yt given xt, xt 1 is −

2 1 1 2 f(yt|yt 1, xt, xt 1; β, σ , ρ) = exp ((yt − ρyt 1) + (xt − ρxt 1)0β) − − 2 σ2 − − 2πσ    p 1 1 2 = exp (y∗ − xt∗0β) 2 σ2 t 2πσ    Due to ,p the joint density and likelihood function ig- noring the first observation is therefore

L = Pr(y1, y2, . . . , yn) = Pr(y2|y1) · Pr(y3|y2) ··· Pr(yn|yn 1) − n 1 1 2 = exp (y∗ − x ∗0β) . (2πσ2)n/2 σ2 t t  (  t=2 ) X and the log-likelihood is

n n n 2 1 2 L = ln L = − ln 2π − ln σ − (y∗ − x ∗0β) . 2 2  σ2 t t  t=2 X Note that the log-likelihood depends on β and ρ only in the summation. 2 Thus, for any σ , n 2 max L ⇐⇒ min (yt∗ − xt∗0β) ρ,β ρ,β t=2 X and, making ρ explicit again, the problem is

n 2 min ((yt − ρyt 1) + (xt − ρxt 1)0β) . − − ρ,β t=2 X n 2 Given a value of ρ, say ρ, the problem is to minimize t=2(yt∗ − xt∗0β) w.r.t. β which yields the familiar feasible GLS result P b 1 β = (X∗0X∗)− X∗0y∗.

n 2 And given a value of β, sayebβ, theb problemb b is to minimize (et − ρet 1) b t=2 − with respect to ρ, where et = yt − xt∗0β, which also yields a familiar result b P n etet 1 ρ = tb=2 − . n e2 P t=2 t In order to find the maximumb likelihoodP solution, we have to satisfy both con- ditions at once, so we iterate back and forth between them until the estimates stabilize. Convergence is guaranteed since we have a quadratic problem at each stage. The asymptotic limiting distribution is not changed by this further iteration. And we see that the feasible GLS estimator proposed in the previous section is asymptotically fully efficient. Chapter 15

Misspecification

15.1 Introduction 15.1.1 The Model and Ideal Conditions In the prequel, we have considered the linear relationship

yi = β1xi1 + β2xi2 + ... + βkxik + ui where i= 1, 2, . . . , n. The elements of this model are assumed to satisfy the ideal conditions

(i) E[ui] = 0

2 2 (ii) E[ui ] = σ

(iii) E[uiul] = 0, l =6 i

(iv) xij non-stochastic

(v) (xi1, xi2, . . . , xik) not linearly related

(vi) ui normally distributed

15.1.2 Misspecification The model together with the ideal conditions comprise a complete specification of the joint stochastic behavior of the independent and dependent variables. We might say the model is misspecified if any of the assumptions are violated. Usually, however, we take misspecification to mean the wrong functional form has been chosen for the model. That is, yi is related to xij and ui by some function yi = g(xi1, . . . , xik, ui) that differs from the specified linear relationship given above. When any of the ideal conditions are violated, we use the terminology specific to that case.

203 204 CHAPTER 15. MISSPECIFICATION

15.1.3 Types of Misspecification 15.1.3.1 Omitted variables Suppose the correct model is linear

yi = β1xi1 + β2xi2 + ... + βkxik

+βk+1xik+1 + βk+2x + ... + βk+1xik+` + ui but we estimate yi = β1xi1 + β2xi2 + ... + βkxik + ui In matrix notation, the true model is

y = X1β1 + X2β2 + u but we estimate

y = X1β1 + u Thus, we obtain estimates

1 β1 = (X10 X1)− X10 y where we use the tilde to denotee estimation of β1 based only on the X1 variables.

15.1.3.2 Extraneous variables Suppose the correct model is

yt = β1xi1 + β2xi2 + ... + βkxik + ui but we estimate

yi = β1xi1 + β2xi2 + ... + βkxik

+βk+1xik+1 + βk+2xik+2 + ... + βk+1xik+` + ui

In matrix notation, the true model is

y = X1β1 + u but we estimate

y = X1β1 + X2β2 + u∗ = Xβ + u where X = (X1 : X2), β0 = (β10 , β20 ). Thus, we obtain the estimates

1 β = (X0X)− X0y

b 15.2. CONSEQUENCES OF MISSPECIFICATION 205

15.1.3.3 Nonlinearity We specify the model

yi = β1xi1 + β2xi2 + ... + βkxik + ui but the true model is intrinsically nonlinear:

yi = g(xi, ui)

That is, it cannot be made linear in the parameters and disturbances through any transformation.

15.1.3.4 Disturbance Misspecification There are a number of possible ways to misspecify the way that the disturbance term enters the relationship. However, for simplicity, we will only look at the case of specifying an addititve error when it should be multiplicative and the reverse. Specifically, the correct model is

yi = f(xi)ui but we specify yi = f(xi) + ui or vice versa.

15.2 Consequences of Misspecification

15.2.1 Omitted variables For estimation we have

1 β1 = (X10 X1)− X10 y 1 = (X10 X1)− X10 (X1β1 + X2β2 + u) e 1 1 = β1 + (X10 X1)− X10 X2β2 + (X10 X1)− X10 u.

In terms of first moments, then

1 E[β1] = β1 + (X10 X1)− X10 X2β2 =6 β1 and the estimates aree biased unless X10 X2 = 0. And, in terms of second moments, 2 1 E[(β1 − E[β1])(β1 − E[β1])0 = σ (X10 X1)− . Note that E[s2] > σ2 since X β has been thrown in with the disturbances. 1 e e2 2 e e It is instructive to compare this covariance of the estimator of β1 based on X1 only with the estimator when we also include X2. For this case, we have

1 β = (X0X)− X0y

b 206 CHAPTER 15. MISSPECIFICATION

0 0 where β = (β1, β1)0. Moreover,

2 1 b b b E[(β − E[β])(β − E[β])0 = σ (X0X)− so b b b b 2 1 E[(β1 − E[β1])(β1 − E[β1])0 = σ [(X0X)− ]11 where σ2[(X X) 1] denotes the upper left-hand submatrix of σ2(X X) 1. Us- 0 − 11b b b b 0 − ing results for partitioned inverses, we find that

2 1 2 1 1 σ [(X0X)− ]11 = σ (X10 X1 − X10 X2(X20 X2)− X20 X1)− 2 1 > σ (X10 X1)− where greater than means exceeding by a positive definite matrix. Thus the estimates while biased are less variable than the OLS estimates based on the complete set of x’s. The findings in the previous paragraph are worthy of elaboration. We find that incorrectly omitting variables, which amounts to incorrectly setting their coefficients to zero results in possibly biased estimates with smaller variance. This is an example of the classic bias-variance trade-off in estimation of multiple- parameter models. Imposing incorrect restrictions, or under-parameterizing the model in the case of zero restriction, results in biased estimates but smaller variance. For inference, the statistics

βi − βi q = tn k 2  s mjj − b under the null hypothesis. Specifically,p it will not be centered around 0 (due to bias) and it will be less spread (since s2 is larger). Thus, graphically we have,

f(q)

tn k − 1

E(βj ) and our confidence intervals constructed from qbmay well not bracket 0 depend- ing on the size of the bias. This means we will commit a Type I error by rejecting a true null hypothesis with probability greater than the ostensible size. 15.2. CONSEQUENCES OF MISSPECIFICATION 207

For forecasting, yp will be responsive to xp1 in the wrong way (due to bias of β1) and unresponsive to xp2. Thus the predictions will be biased and can be systematically off. Moreover, s2 will be larger and hence R2 will be smaller so web expect worsened prediction performance.

15.2.2 Extraneous Variable

In a very real sense, the model that includes both X1 and X2 is still correct, except that β2 = 0. Thus, for estimation

β β E 1 = 1 β 0 2 !   b and both vectors of estimators areb unbiased. Moreover, we still have 2 1 E[(β − E[β])(β − E[β])0 = σ (X0X)− and b b b b E(s2) = σ2. 2 1 Thus s (X0X)− provides an unbiased estimator of the variances and covari- ances. However, β1 when also estimating β2 is not efficient, since estimating using only X1 would yield smaller variances. Specifically, from the previous section, we have b

2 1 E[(β1 − E[β1])(β1 − E[β1])0 = σ [(X0X)− ]11 2 1 1 = σ (X10 X1 − X10 X2(X20 X2)− X20 X1)− b b b b 2 1 > σ (X10 X1)− which is the variance of the estimator when β2 = 0 is imposed. Thus we see, in the linear model, the price of overparameterizing the model (including extraneous variables with accompanying parameters) is a loss of efficiency. This result continues to hold for estimators in a much more general context. If we know parameters should be zero or otherwise restricted, we should impose the restiction. For inference, the ratios

β − β q = j j 2 1 s [(X0X)− ]jj b have the t distribution with n − p(k1 + k2) degrees of freedom. If we estimate using only X1, however,

β − β q = j j 2 1 s [(X10 X1)− ]jj b will have a t distribution with n −pk1 d.f. 208 CHAPTER 15. MISSPECIFICATION

tn k − 1

tn (k +k ) − 1 2

Thus, there will be loss of power and the probability of a Type II error (not re- jecting a false hypothesis) will be higher. Note, however, that the power, though smaller, will still approach 1 asymptotically and the test remains consistent. (Alternative behavior) For forecasting the predictors will be unbiased but more variable. However, the increase in the variance is O(1/n) so it disappears asymptotically. In terms of these second-order impacts the estimated coefficients will not be estimated as efficiently so the prediction intervals will be wider. Specifically, the critical values that establish the width of the intervals come from the tn (k +k ) distri- − 1 2 bution rather than the tn k . Both converge to the N(0, 1) in large samples. − 1 15.2.3 Nonlinearity For simplicity we look at a bivariate case. Suppose the underlying model is

yi = f(xi, θ) + ui for i = 1, 2, ..., n, but we mistakenly estimate the linear relationship

yi = β0 + β1xi + ui. by least squares. The consequences for estimation are similar to the case of omitted variables. Expanding the nonlinear relationship about x0 in a Taylor Series yields

∂f(x , θ) f(x , θ) = f(x , θ) + 0 (x − x ) + R i 0 ∂x i 0 i ∂f(x , θ) ∂f(x , θ) = f(x , θ) − 0 x + 0 x + R 0 ∂x 0 ∂x i i     = α + βxi + Ri

∂f(x0,θ) ∂f(x0,θ) where α = f(x0, θ) − ∂x x0 , β = ∂x and Ri = R(xi) is the re- mainder term.h Thus, estimating thei linearh modeli is equivalent to omitting the 15.2. CONSEQUENCES OF MISSPECIFICATION 209

remainder term, which will be a function of xi. Note that the values of β and α depend on the point of expansion so α and β cannot be unbiased or consistent for all values. For example, with a scatter of pointsb generatedb by a concave nonlinear re- lationship and then fitted by linear least squares we might have the following geometric relationship

yi α + βxi

b b

b b b b b f(xi) b b b b b b b b b b b b b b

b

b b

b b b b b

b b b b

b b b b b b b

b

b

b

b

b

xi

It is clear that: (a) the intercept is no longer correct, (b) the function yields the proper value at only two points, and (c) the slope of the estimated function is correct at only one point. The consequences for inference are similar to the case of omitted variables. Even though a variable may be highly non-linearly related to yi, we may accept βj = 0 on a linear basis. Thus, we may commit Type II errors in choosing the correct variables. b The consequences for prediction are evident. As a consequence of (b), depending on the point of evaluation, we will either systematically over- or under-estimate the regression function when used for conditional prediction. Depending on the nonlinearity, the farther we get from the estimated range, the worse the linear relationship will approximate f(xi).

15.2.4 Misspecified Disturbances Suppose yi = f(xi) + ui is correct but we mistakenly estimate

yi = f(xi)ui∗.

β (For example, suppose the true model is yi = αxi + ui but we estimate yi = β αxi ui∗ or ln yi = ln α + β ln xi + ln ui∗). Solving for ui∗ in the general case, we obtain yi ui ui∗ = = 1 + . f(xi) f(xi) 210 CHAPTER 15. MISSPECIFICATION

So

E(ui∗) = 1 2 2 σ var(ui∗ ) = f(xi) Thus, the disturbances are heteroskedastic. If we transform into logs to obtain linear in disturbance model, the transformed errors ln(u ) = ln(1 + ui ) will i∗ f(xi) still be heteroskedastic. Conversely, suppose the true model is

yi = f(xi)ui with E[ui] = 1, but we mistakenly estimate

yi = f(xi) + ui∗ then ui∗ = f(xi)(1 − ui) and ui∗ is conditional mean zero but heteroskedastic. The estimation, inferential, and prediction complications arising from het- eroskedasticity are well-covered in Chapter 12. Likewise, the detection and correction of this problem is covered quite well there and will not be discussed below.

15.3 Detection of Misspecification 15.3.1 Testing for irrelevant variables Suppose the model is y = X1β1 + X2β2 + u and we suspect that β2 = 0, i.e., X2 does not enter into the regression. We may test each coefficient of β2 directly using the test. Specifically, per- form 1 β = (X0X)− X0y the regression of y on X = (X : X ). Now, 1b 2

βj ∼ tn k 2 s djj − b under the null hypothesis that βpj = 0. Thus, we can use the printed t-values to test whether the coefficients βj = 0. A more powerful test may be based on the F -distribution. Perform,

1 β1 = (X10 X1)− X10 y

b 15.3. DETECTION OF MISSPECIFICATION 211

as the restricted regression of y on X1 under H0 : β2 = 0. Using β from the regression of y on X = (X1 : X2) as the unrestricted regression, we form b (SSEr − SSEu)/k ∼ Fk2,n k SSRu/(n − k) − as a test of the joint H0 : β2 = 0.

15.3.2 Testing for omitted variables

Suppose the estimated model is

y = X1β1 + u but we suspect that X2 may also enter into the relationship. That is, we suspect that the true relationship is

y = X1β1 + X2β2 + u were β2 =6 0. Obviously, we may proceed as in the case of testing for extraneous variables to see whether X2 enters the relationship. The above procedure is flexible only if we have some idea of what variables have been omitted. That is, we know X2. If X2 is unknown, then we must utilize the OLS residuals.

e1 = y − y

= y − X1β1 b 1 = y − X1(X10 X1)− X10 y b 1 = (In − X1(X10 X1)− X10 )y to detect whether variables have been omitted.

Suppose X2 has been omitted, then

1 e1 = (In − X1(X10 X1)− X10 )(X1β1 + X2β2 + u) 1 = (In − X1(X10 X1)− X10 )(X2β2 + u)

= M1(X2β2 + u)

Since multiplication by M1 orthogonalizes w.r.t. X1, e1 is linearly related to that part of X2 that is orthogonal to X1.

In time series analysis, we expect xt02β2 to move smoothly. Consequently, we might expect et1 to exhibit (some) smooth movement over time due to the relationship. That is 212 CHAPTER 15. MISSPECIFICATION

0 et1

b b b

b b

b b

b b

b b

b b b b b b b bb b b b b

b b

b b b b

bb bb b

b b b

b b b b b b

b b b b b b b b b b b b b

b b bb

b b

b b

b

b b b b

b bb

t

Suppose xi1 and xi2 move together, i.e. are related in some fashion. The ei will react non-linearly to changes in xi1 and hence xi01β = yi. Thus, a general technique is to plot ei against yi = xi01β. Under H0 : β2 = 0, we expect to find no relationship. If β2 =6 0, however et may have someb relationshipb (nonlinear) to yt. b b

b et

b

b

b

b b b

b b

b b b

bbb b b b b b b b

b bb bbb bb b 0 b b b b b b b b b b b b b b b b b b bbb b b b b b b b b b b b b b b b

yt

b

This procedure may be formalized. Form the regression

2 3 et = γ1yt + γ2yt + vt

Under H0 : β2 = 0, et should be unrelatedb tob xt1 and xt01β = yt, i.e., γ1 = γ2 = 0. If this hypothesis is rejected by the regression results, we have evidence of omitted variables. b b 15.3. DETECTION OF MISSPECIFICATION 213

15.3.3 Testing for Nonlinearity

Suppose the true regression is nonlinear:

yi = f(xi, θ) + ui

Expanding, as above about x0 in a Taylor Series, yields

yt = α + βxi + Ri + ui where Ri = R(xi) is the remainder term. In this case, since ei ≈ R(xi) + ui, then et will be a (nonlinear) function of xi. Thus, plots, of ei against xi will reveal a pattern:

ei

b

b

b

b b

b

b b b

b

b b

b

b b

b bbb

b b

b b

b b

b

b

b b

b b

b

b b

b b

b bb bbb

b

bb

b b

b

b b

b b

0 bbb

b

b

b

b

b

b b

b

b

b b

xi

Of course the pattern must be nonlinear in xi, since the OLS residuals ei are by definition linearly unrelated to the regressors. A analogous formal test is to examine the regression

2 3 ei = γ0 + γ1xi + γ2xi + vi

Under the null hypothesis of linearity, then ei is mean-zero and unrelated to xi whereupon H0 : γ0 = γ1 = γ2 = 0. If nonlinearity is present we expect to find significant terms. Higher-order polynomials can be entertained but third-order usually suffices.. In the multivariate case setting up all the various second- and third-order terms in the polynomial becomes more involved. Note that the linear term is always omitted and the intercept is included in the null for power purposes. 214 CHAPTER 15. MISSPECIFICATION

15.4 Correcting Misspecification 15.4.1 Omitted or Extraneous Variables If the two competing models are explicitly available then these two cases are easily handled since one model nests the other. In the case of omitted vari- ables that should be included the solution is to include the variables if they are available. In the case of extraneous variables the solution is to eliminate the extraneous variables. The tests proposed in the previous section can be used to guide these decisions. If we have evidence of omitted variables but are uncertain of the nature of the omitted variables, then we need to undertake a model selection process to determine the correct variables. This process is touched on below.

15.4.2 Nonlinearities If we have evidence of nonlinearities in the relationship, then the solution in- volves entertaining a nonlinear regression. Usually, if we are proceeding from an initial linear model we have no idea of what the nature of the nonlinearity is but do know which variables to use. In the bivariate case, we can simply introduce higher-order polynomial terms, as in the testing approach, except we continue to include a linear term:

2 3 yi = β0 + β1xi + +β2xi + β3xi + ui. We can start with safe choice of the order of the polynomial and eliminate insignificant higher-order terms. This process can be very involved when we have a number of variables that are possibly nonlinear. This approach naturally leads one to entertain nonparametric regression where we allow the regression function to take on a very flexible form. These approaches have slower than normal rates of convergence to the target function and are beyond the scope of our analysis here. Sometimes, we may have a specific nonlinear function in mind as the alter- native. For example, if the model was a Cobb-Douglas function and we decided that the log-linear form was inadequate due to heteroskedasticity, we might be inclined to estimate the nonlinear function directly with additive errors. This is the subject on nonlinear regression, which is given extended treatment in the next chapter.

15.5 Model Selection Approaches 15.5.1 J-Statistic Approach We begin with the simplified problem of choosing between two completely spec- ified models. Suppose the first model under consideration is

y = X1β1 + u1 15.5. MODEL SELECTION APPROACHES 215 but we suspect the alternative

y = X2β2 + u2 may be correct. How can we determine whether the first (or second) model is misspecified? If the matrix X1 includes all the variables in X2 or visa versa, then the models are nested and we are dealing with possible extraneous variables or possibly known omitted variables. These cases are treated above and the decision can be made on the basis of an F -test. So we are left with the case where the two models are non-nested. Each model includes explanatory variables that are not in the other. Davidson and MacKinnon (D-M) propose we consider a linear combination regression model

y = (1 − λ)X1β1 + λX2β2 + u

= X1((1 − λ)β1) + X2(λβ2) + u

= X1β1∗ + X2β2∗ + u. This formulation allows us to state the problem of which is the correct model in parametric form. Specifically, we have H0 : Model 1 is correct or λ = 0 and H1 : Model 2 is correct or λ = 1. A complication arises, however, since neither λ nor (1 − λ) are identifiable but are subsumed into the scale of the parameter vectors β1∗ and β2∗. Under the null hypothesis, however, the combined model simplifies to Model 1, which suggests we just treat the combined regression in the usual way and jointly test for β2∗ = 0. The problem is that X1 and X2 may, and likely will, share variables (columns) so extreme collinearity will be a problem and the unrestricted regression is infeasible. However, any linear combination of the columns of X2 will also have a zero coefficient under the null. This suggests the following two-step approach: in the first regress y on X2 and calculate the 1 fitted values, say y2 = X2(X20 X2)− X20 y; in the second we regress using the model b y = X1β1∗ + γy2 + u. Under the null hypothesis, at least asymptotically, γ = 0, so we can just use the usual t-ratio, which is called the “J-statistic”b by D-M, to test whether γ = 0. If we fail to reject we conclude that Model 1 is correct. If we reject, then we conclude that X2 adds important explanatory power to Model 1. We should now entertain the possibility that Model 2 is the correct model and run the test the other way, reversing the roles of Models 1 and 2. If we don’t reject the reversed hypotheses then we conclude that Model 2 is correct. A very real possibility, and one that frequently occurs in practice, is that we reject going both ways, so neither Model 1 nor Model 2 is correct. This is a particular problem in large samples since we know all models are misspecified at some level and this will be revealed asymptotically with consistent tests. Unfortunately, if we are still interested in choosing between the two models, this approach gives us no guidance. Thus the sequel of this section will deal with how to choose between the models when they are both misspecified. 216 CHAPTER 15. MISSPECIFICATION

15.5.2 Residual Variance Approach First, we will develop a metric that can be interpreted as a measure of misspec- ification. Now, for the first model we obtain

1 β1 = (X10 X1)− X10 y

e1 = y − y1 b = y − X1β1 b 1 = y − X1(X10 X1)− X10 y b 1 = (In − X1(X10 X1)− X10 )y

= M1y and 2 e10 e1 y0M10 M1y y0M1y s1 = = = n − k1 n − k1 n − k1 Similarly, for the second model, we have

1 β2 = (X20 X2)− X20 y and b 2 y0M2y s2 = n − k2 1 where M2 = (In − X2(X20 X2)− X20 ) Now, if the first model is correct

2 2 E[s1] = σ while y M y E[s2] = E 0 2 2 n − k  2  (X β + u ) M (X β + u ) = E 1 1 1 0 2 1 1 1 n − k  2  1 = E [(X1β1 + u1)0M2(X1β1 + u1)] n − k2 1 = E β10 X10 M2X1β + 2u10 M2X1β1 + u10 M2u1 n − k2 β0 X M X β  = 1 10 2 1 1 + σ2 ≥ E[s2] n − k2

Thus, s2 is smallest, on average, for the correct specification. This suggests a criterion where we select the model with the smallest estimated variance. This is the residual-variance criterion. And the expression (β10 X10 M2X1β1)/(n−k2) can be taken as a measure of misspecification in terms of mean squared error. 15.5. MODEL SELECTION APPROACHES 217

Provided that both models have the same dependent variable, which was the case above, it is tempting to select between them on the basis of “goodness- of-fit” or R2, which tells us the percentage of total variation in the dependent variable that is explained by the model. The problem with this approach is that R2 always rises as we add explanatory variables, so we would end up with a model using all possible explanatory variables whether or not they really enter the relationship or not. Such a model would not work well out-of-sample if the extra variables move independently of the correctly included variables. A possible solution is to penalize the statistic for the number of included regressors. Specifically, we propose the adjusted R2 statistic

2 n − 1 R = 1 − (1 − R2) n − k (e e)/(n − k) − 0 = 1 n 2 i=1(yi − y) /(n − 1) s2 − P = 1 n 2 i=1(yi − y) /(n − 1)

2 Note that ranking models in termsP of increasing R is equivalent to ranking them in terms of decreasing s2, since the other elements in the expression do 2 not change between models. One advantage of the R approach is that it is unit and scale-free.

15.5.3 A Model Selection Statistic For the residual variance criterion, the rather obvious question is whether one estimated variance is “enough” larger than another to justify selection of one model over another. We will formulate a test statistic as a guide in selecting between any two models. Specifically let

n 2 n 2 mi = (yi − x20 iβ2) − (yi − x10 iβ1) n − k2 n − k1 1 n m = m = s2 −bs2 b n i 2 1 i=1 X 1 n v = (m − m)2 m n i i=1 X and define √ nm Dn = √ vm p as our statistic of interest. Suppose m −→ 0, then under appropriate conditions d Dn −→ N(0, 1) and tail occurrences (|Dn| > 1.96, say) would be rare events. p Suppose that the true model is denoted by subscript 0, then m −→ 0 implies 0 0 p 0 0 p 2 2 β1X1M0X1β1 β2X2M0X2β2 s2 − s1 −→ 0 and hence n k − n k −→ 0. If, from above, − 0 − 0 218 CHAPTER 15. MISSPECIFICATION

0 β X0 M X β we take l l 0 l l as an index of misspecification of the model l relative to n k0 − p the true model 0, then we see that m −→ 0 indicates that the two models are equally misspecified. This leaves us with three possibilities

d (1) Model 1 is ”equivalent” to Model 2: Dn −→ N(0, 1) a.s. (2) Model 1 is ”better” than Model 2: Dn −→ +∞ a.s. (3) Model 2 is ”better” than Model 1: Dn −→ −∞.

Large positive values favor Model 1. Large negative value favor Model 2. The intermediate values (e.g. |Dn| > 1.96 for 95 percent significance) comprise an inconclusive region.

15.5.4 Stepwise Regression In the above model selection approaches, we have reduced the number of com- peting models to a small number before undertaking statistical inference. Such reductions are usually guided by competing economic theories and experience from previous empirical exercises. Such reductions are really at the heart of econometrics which is seen as the intersection of data, economic theory, and statistical analysis. The model we end up with really depends on all three contributors to winnow the competing models down to a select few. Such an approach is at the heart of structural and even informed reduced form modeling. It is of interest to think about the model selection process from a more data- driven standpoint. This might be the case if economic theory does not give firm guidance as to which explanatory variables enter the relationship for which dependent variable. Such a situation might be characterized by the true model having the form

y = X1β1 + u1 where the variables comprising X1 are not known but are a subset of the avail- able economic variables X. Somehow, we would like to use the data to determine the variables entering the relationship. Unfortunately, some problems arise in any purely data-driven approach as we will see below. At this point, it seems appropriate to compare the consequences of incor- rectly omitting a variable against incorrectly including an extraneous variable. They are really two sides of the same coin but the consequences are very dif- ferent. Omitting the variable incorrectly leads to bias and inconsistency and incorrect size due to increased probability of a Type I error. Including an extraneous variable increases the probability of a Type II error and hence de- creased power but the test will still be consistent. Thus we are faced with the oft encountered trade-off between size and power in deciding how to conduct inference. In the usual inferential exercise we carefully calibrate the test so that the size takes on a prespecified value, such as .05 and take the power that results. Implicitly, this approach places greater value on getting the size correct at the 15.5. MODEL SELECTION APPROACHES 219 expense of power and the consequences of omitted variables should be viewed as more serious than those of including extraneous variables. This suggests the “kitchen-sink” approach when deciding which variables to include in the model under construction. When in doubt, leave it in, would seem to be good advice based on this reasoning. An approach suggested by this reasoning is the backward elimination step- wise regression approach. In this approach, we start with a regression including all candidate regressors and eliminate them one-by-one on the basis of statisti- cal insignificance. The resulting regression model will have all the regressors statistically significant, at some level, by the measure used. An example of a decision criterion at each step would be to eliminate the variable with the highest t-ratio until the remaining t-ratios are no smaller than one. At this 2 point the elimination of any of the remaining variables would reduce the R and not be preferred. There are a couple of problems with this approach. The first is the fact that we do not necessarily end up with the best set of regressors that are significant. Some variables that were eliminated in early rounds may end up being very important once others have been eliminated. A modification is to entertain all possible competing models that have a given number of explanatory variables, but this becomes very complicated when the number of variables in X is large. A second problem is that the regression model that results from this process would not have standard statistical properties. This is easily seen. The usual ratios on the estimated coefficients could not have a t distribution because small values of the ratio have been eliminated in the selection process. For example all the ratios would be larger in absolute value than, say 1. Nor could they even be normal in large samples. This is an example of what is called pre-test bias. If parts of a model have been selected on the basis of a previous testing step, the properties of tests in the current step will not have the usual properties. The serious problems with the stepwise regression approach indicated by the non-standard distribution can be seen as resulting from data-mining. Suppose we start out with a variable yi, which we label as the dependent variable, and a large set of other variables, which we label as candidate regressors. In addition suppose that the candidate regressors have been randomly generated and bear no true relationship to the dependent variable. If the sample size is not large and the number of candidate regressors is large, then just by chance a subset of the regressors will be correlated with the dependent variable. Specifically, a subset will have a large t-ratio just by chance. This is also easy to see. The true regression coefficients are all zero so the t-ratios will have a t-distribution and a certain fraction, say 0.16 for 30 degrees of freedom, will have t-ratios larger than 1. But this is purely a result of spurious correlation. Thus we see that the stepwise regression approach is fraught with difficulties. It is really a pure data-mining approach and cannot be expected to yield important economic insights. Instead we should let our choice of variables and models be guided to the extent possible by economic theory and previous econometric results.