<<

Asymptotic Results for the Linear Regression Model

C. Flinn February 5, 1999

1. Asymptotic Results under Classical Assumptions

The following results apply to the linear regression model y = Xβ + ε, where X is of dimension (n × k),εis a (unknown) (n × 1) vector of disturbances, and β is a (unknown) (k × 1) parameter vector. We assume that n  k, and that ρ(X)=k. This implies that ρ(XX)=k as well. Throughout we assume that the “classical” conditional moment assumptions apply, namely

• E(εi|X)=0∀i.

2 • V (εi|X)=σ ∀i.

We first show that the probability limit of the OLS is β, i.e., that it is consistent. In particular, we know that

βˆ = β +(XX)−1Xε ⇒ E(βˆ|X)=β +(XX)−1XE(ε|X) = β

In terms of the (conditional) variance of the estimator β,ˆ V (βˆ|X)=σ2(XX)−1. Now we will rely heavily on the following assumption X X lim n n = Q, n→∞ n where Q is a finite, nonsingular k × k matrix. Then we can write the covariance ˆ of βn in a sample of size n explicitly as   σ2 X X −1 V (βˆ |X )= n n , n n n n so that   2  −1 ˆ σ XnXn lim V (βn|Xn) = lim lim n→∞ n n =0× Q−1 =0

Since the asymptotic variance of the estimator is 0 and the distribution is centered on β for all n, we have shown that βˆ is consistent. Alternatively, we can prove consistency as follows. We need the following result. Lemma 1.1.   Xε plim =0. n   Proof. E Xε =0 n. First, note that n for any Then the variance of the expression Xε n is given by      Xε Xε Xε  V = E n n n = n−2E(XεεX) σ2 XX = , n n   lim V Xε =0× Q =0. so that n→∞ n Since the asymptotic mean of the ran- dom variable is 0 and the asymptotic variance is 0, the probability limit of the expression is 0.

2 Now we can state a slightly more direct proof of consistency of the OLS esti- mator, which is plim(βˆ) = plim(β +(XX)−1Xε)     X X −1 Xε = β + lim n n plim n n = β + Q−1 × 0=β.

Next, consider whether or not s2 is a consistent estimator of σ2. Now SSE s2 = , n − k where SSE =(y − Xβˆ)(y − Xβˆ). We showed that E(s2)=σ2 for all n - that is, that s2 is an unbiased estimator of σ2 for all sample sizes. Since SSE = εMε, with M =(I − X(XX)−1X), then εMε p lim s2 = p lim n − k εMε = p lim n      εε εX XX −1 Xε = p lim − p lim n n n n εε = p lim − 0 × Q−1 × 0. n Now εε n = n−1 ε2 n i i=1 so that   εε n E = n−1E ε2 n i i=1 n −1 2 = n Eεi i=1 = n−1(nσ2)=σ2.

3 Similarly, , under the assumption that εi is i.i.d., the variance of the being considered is given by   εε n V = n−2V ( ε2) n i i=1 n −2 2 = n V (εi ) i=1 −2 4 2 = n (n[E(εi ) − V (εi) ]) −1 4 2 = n [E(εi ) − V (εi) ], εε 0 E(ε4) so that the limit of the variance of n is as long as i is finite [we have already assumed that the first two moments of the distribution of εi exist]. Thus εε σ2 the asymptotic distribution of n is centered at and is degenerate, thus proving consistency of s2.

2. Testing without Normally Distributed Disturbances

In this section we look at the distribution of test associated with linear restrictions on the β vector when εi is not assumed to be normally distributed as 2 N(0,σ ) for all i. Instead, we will proceed with the weaker condition that εi is in- dependently and identically distributed with the common cumulative distribution 2 function (c.d.f.) F. Furthermore, E(εi)=0and V (εi)=σ for all i. Since we retain the mean independence and homogeneity assumptions, and since unbiasedness, consistency, and the Gauss-Markov theorem for that matter, all only rely on these first two conditional moment assumptions, all these results continue to hold when we drop normality. However, the small sample distributions of our test statistics no longer will be accurate, since these were all derived under the assumption of normality. If we made other explicit assumptions regarding F, it is possible in principle to derive the small sample distributions of test statistics, though these distributions are not simple to characterize analytically or even to compute. Instead of making explicit assumptions regarding the form of F, we can derive distributions of test statistics which are valid for large n no matter what the exact form of F [exceptthatitmustbeamemberoftheclassofdistibutions for which the asymptotic results are valid, of course]. We begin with the following useful lemma, which is associated with Lindberg- Levy.

4 2 2 Lemma 2.1. If ε is i.i.d. with E(εi)=0and E(εi )=σ for all i; if the elements of the matrix X are uniformly bounded so that |Xij|

Proof. Consider the case of only one regressor for simplicity. Then 1 n Z ≡ √ X ε n n i i i=1 is a scalar. Let Gi be the c.d.f. of Xiεi. Let n n 2 2 2 Sn ≡ V (Xiεi)=σ Xi . i=1 i=1  −1 2 In this scalar case, Q =limn i Xi . By the Lindberg-Feller Theorem, the 2 necessary and sufficient condition for Zn → N(0,σ Q) is  1 n lim ω2 dG (ω)=0 S2 i (2.1) n i=1 |ω|>νSn

ω ν>0. Gi(ω)=F ( ). for all Now |Xi| Then rewrite [2.1] as    n n X2 ω 2 ω lim i dF ( )=0. S2 n X |X | n i=1 |ω/Xi|>νSn/|Xi| i i

 2 2 2 n Xi 2 n 2 −1 Since lim Sn =limnσ = nσ Q, then lim 2 =(σ Q) , which is a finite i=1 n Sn and nonzero scalar. Then we need to show n −1 2 lim n Xi δi,n =0, i=1

 2 ω ω δi,n ≡ dF ( ). lim δi,n =0 i where |ω/Xi|>νSn/|Xi| Xi |Xi| Now for all and any fixed ν since |Xi| is bounded while lim Sn = ∞ [thus the measure of the set |ω/X | >νS/|X | 0 lim n−1 X2 i n i goes to asymptotically]. Since i is finite and −1 2 lim δi,n =0for all i, lim n Xi δi,n =0.

5 For vector-valued Xi, the result is identical of course, with Q being k × k instead of a scalar. The proof is only slightly more involved. Now we can prove the following important result. Theorem 2.2. Under the conditions of the lemma, √ n(βˆ − β) → N(0,σ2Q−1).

√   −1   −1 Proof. n(βˆ − β)= X X √1 Xε. Since lim X X = Q−1 and √1 Xε → √ n n n n N(0,σ2Q),then n(βˆ − β) → N(0,σ2Q−1QQ−1)=N(0,σ2Q−1).

The results of this√ proof have the following practical implications. For small n, the distribution of n(βˆ − β) is not normal, though asymptotically the dis- tribution of this random variable converges to a normal. The variance of this random variable converges to σ2Q−1 which is arbitrarily well-approximated by  −1  2 XnXn 2 −1 ˆ s = s n(XnXn) . But the variance of (β − β) is equal to the variance √ n of n(βˆ − β) divided by n, so that in large samples the variance of the OLS 2  −1 2  −1 estimator is approximately equal to s n(XnXn) /n = s (XnXn) , even when F is non-normal. Usual t tests of one linear restriction on β are no longer consistent. However, an analagous large sample test is readily available.

2 2 Proposition 2.3. Let εi be i.i.d. (0,σ ),σ < ∞, and let Q be finite and non- singular. Consider the test H0 : Rβ = r, where R is (1 × k) and r is a scalar, both known. Then Rβˆ − r → N(0, 1). s2R(XX)−1R

Proof. Under the null, Rβˆ − r = Rβˆ − Rβ = R(βˆ − β), so that the test statistic is √ nR(βˆ − β) . s2R(XX/n)−1R Since √ n(βˆ − β) → N(0,σ2Q−1) √ ⇒ nR(βˆ − β) → N(0,σ2RQ−1R).

6

The denominator of the test statistic has a probability limit equal to σ2RQ−1R, which is the standard deviation of the random variable in the numerator. A mean zero normal random variable divided by its standard deviation has the distribution N(0, 1).

A similar result holds for the situation in which multiple (nonredundent) linear restrictions on β are tested simultaneously.

2 2 Proposition 2.4. Let εi be i.i.d. (0,σ ),σ < ∞, and let Q be finite and non- singular. Consider the test H0 : Rβ = r, where R is (m × k) and r is a (m × 1) vector, both known. Then

(r − Rβˆ)[R(XX)−1R]−1(r − Rβˆ)/m χ2 → m . SSE/(n − k) m

Proof. The denominator is a consistent estimator of σ2 [as would be SSS/n], and has a degenerate limiting distribution. Under the null hypothesis, r − Rβˆ = −R(XX)−1Xε, so that the numerator of the test statistic can be written εDε, where D ≡ X(XX)−1R[R(XX)−1R]−1R(XX)−1X. Now D is symmetric and idempotent with ρ(D)=m. Then write εDε εPPDPP ε = 2 2 mσ mσ 1 I 0 = V  m V m 00 1 m = V 2, m i i=1

I 0 P P DP = m V = where is the orthogonal matrix such that 00and where P ε . V V = σ Thus the i are i.i.d. with mean 0 and standard deviation 1. Because P ε/σ, n P ε V = ji j ,i=1, ..., m. i σ j=1

7 The terms in the summand are independent random variables with mean 0 and 2 2 variance σj = Pji. Since the εj are i.i.d., the applies, so that

n P ε /σ ji j → N(0, 1), W j=1 n   n 2 n 2 where Wn = j=1 σj = j=1 Pji =1because P is orthogonal. Then since m χ2 V 1 V 2 → m . each i is standard normal, m i=1 i m

The practical use of this theorem is as follows. For large samples, the sample distribution of the statistic (r − Rβˆ)[R(XX)−1R]−1(r − Rβˆ)/m χ2 → m , SSE/(n − k) m (2.2) which means that for large enough n

(r − Rβˆ)[R(XX)−1R]−1(r − Rβˆ) → χ2 . SSE/(n − k) m (2.3)

Now when disturbances were normally distributed, in a sample of size n we have the same test statistic given by the left-hand side of [2.2] was distributed as an χ2 (x) F (m, n − k). lim F (x; m, n − k) m . Note that n→∞ is m For example, say that the test statistic associated with a null with (m) 3 restrictions assumed the value 4. In a sample size of n = 8000, we have (approximately) 1 − F (4; 3, 8000) = .00741. 2 The asymptotic approximation given in [2.3] in this example yields 1−χ3(3×4) = .00738. In small samples, differences are much greater of course. For example, for the same value of the test statistic, when n =20we have 1 − F (4; 3, 20 − 3) = 2 .02523, which is certainly different than 1 − χ3(3 × 4) = .00738. In summary, when the sample size is very large, the normality assumption is pretty much inconsequential in the testing of linear restrictions on the parame- ter vector β. In small samples, some given assumption as to the form of F (ε) is generally required to compute the distribution of the estimator β.ˆ Under normal- ity, the small sample distributions of test statistics follow the t or F, depending on the number of restrictions being tested. Testing in this environment depends critically on the normality assumption, and if the disturbances are not normally distributed, tests will be biased in general.

8 3. Heteroskedasticity

We will now consider the behavior of the OLS estimator when the assumption of i.i.d. (0,σ2) error terms is relaxed. Let the vector of errors in a sample of size n be given by  ε =(ε1...εn) . We continue to assume mean independence, so E(ε|X)=0, which is an (n × 1)  vector in this case. As opposed to homoskedasticity, let E(εε |X)=Ωn, which is an (n × n) covariance matrix.

Proposition 3.1. E(βˆ|X)=β.

Proof. E(βˆ|X)=E(β +(XX)−1Xε|X)=β +(XX)−1XE(ε|X)=β. This result just restates the point we have made several times before - unbiased- ness is a result of the mean independence assumption and requires no restrictions on the conditional variance. Heteroskedasticity only relates to the conditional variance. What about the conditional variance of the OLS estimator? This does change as a result of heteroskedasticity because the covariance matrix of the ε is involved.

Proposition 3.2. V (βˆ|X)=(XX)−1XΩX(XX)−1.

Proof.

V (βˆ|X)=E[(βˆ − β)(βˆ − β)|X] = E[(XX)−1XεεX(XX)−1|X] =(XX)−1XE(εε|X)X(XX)−1 =(XX)−1XΩX(XX)−1. (3.1)

 −1 The OLS estimator then does not have a covariance matrix proportional to (X X) 2 ˆ ˆ n ˆ 2 when Ω = σ In. In particular, assuming that V (β|X)= i=1(yi − Xiβ) /(n − k)(XX)−1 will lead to incorrect inferences and biased tests. Ofcourse,ifΩ is known, then we can obtain easily obtain the covariance matrix of the OLS estimator from [3.1]. But we first may want to investigate whether or not OLS maintains its optimality properties in this case, in the sense of being BLUE (Best Linear Unbiased Estimator). This turns out not to be the case.

9