<<

Asymptotic Concepts L. Magee January, 2010

——————————–

1 Definitions of Terms Used in Asymptotic Theory

Let an to refer to a random variable that is a function of n random variables. An example is a sample mean

n −1 X an =x ¯ = n xi i=1

Convergence in Probability

The scalar an converges in probability to a constant α if, for any positive values of  and δ, there is a sufficiently large n∗ such that

∗ Prob(|an − α| > ) < δ for all n > n

α is called the probability limit, or plim, of an.

Consistency

If an is an of α, and plim an = α, then an is a (weakly) consistent estimator of α.

Convergence in Distribution an converges in distribution to a random variable y (an → y) if, as n → ∞, Prob(an ≤ b) =

Prob(y ≤ b) for all b. In other words, the distribution of an becomes the same as the distribution of y.

2 Examples Let xi ∼ N[µ, σ ], i = 1, . . . , n, where the xi’s are mutually independently distributed. Define the three

−1 Pn 1.x ¯ = n i=1 xi

2 −1 Pn 2 2. s = (n − 1) i=1(xi − x¯)

3. t = x¯−µ (s2/n)1/2

Considering each one-by-one,

1 1. As n → ∞, Var(¯x) → 0. This implies that plim(¯x) = µ, andx ¯ is a consistent estimator of µ. (Uses fact that Var(¯x) = σ2/n and E(¯x) = µ.)

2. As n → ∞, Var(s2) → 0, so plim(s2) = σ2 and s2 is a consistent estimator of σ2. (Uses 2 2 2 2 2 distributional result: (n − 1)s /σ ∼ χn−1. Since E(χn−1) = n − 1 and Var(χn−1) = 2(n − 1), then Es2 = σ2 and Var(s2) = 2σ4/(n − 1).)

3. t converges in distribution to z, where z ∼ N[0, 1]. (Uses distributional result: t ∼ tn−1. Since plim(s2) = σ2, then as n → ∞, t → (¯x − µ)/(σ2/n)1/2. This is a standardized normal random variable.)

Properties

(i) if plim(xn) = θx, then plim(g(xn)) = g(θx), for any function g(·) that is continuous at θx. This is sometimes called Slutsky’s theorem.

(ii) if xn converges in distribution to some random variable x, i.e. xn → x, then, for any func-

tion g(·), g(xn) → g(x). That is, the distribution of g(xn) converges to the distribution of g(x). (This is like property (i) but for convergence in distribution instead of convergence in probability.)

(iii) if plim(xn) = θx and plim(yn) = θy, then plim(xnyn) = θxθy.

(iv) if plim(xn) = θx and yn → y, then xnyn → θxy. 0 −1 Often xn is a matrix and y is a normally distributed vector. Similarly, yn(xn) yn → 0 −1 y (θx) y, which relates to the asymptotic chi-square distribution often encountered in hy- pothesis testing.

2 Order Notation

It is useful to have notation that describes the rate that a statistic converges to zero or goes off to infinity as the sample size n grows. First, consider f(n), some non-random function of n.

Definitions (i) f(n) is O(nd) (“is order nd”) if, as n → ∞, then f(n)/nd is finite. (If d > 0, then f(n) grows to infinity at the same rate as nd, and if d < 0, f(n) shrinks to zero at the same rate as nd.)

(ii) f(n) is o(nd) (“is order smaller than nd”) if, as n → ∞, then f(n)/nd → 0.

2 2.1 Examples

(i) -3 is O(1), or O(n0)

(ii) 5n3 is O(n3), and it is o(n4) and o(n5)

3 −1 (iii) n is O(n ) and is o(1)

(iv) 3 − 2 is O(n−1) n n3/2

Example (iv) illustrates that the order of a sum depends only on the order of the highest-order term (meaning the term with the largest “d” in O(nd)) as long as the number of terms in the sum does not depend on n.

If the number of terms in a sum itself depends on n, then the order of the sum can be affected. Let xi 0 and xij be O(n ) constants. For example, xi and xij do not display a trend as i or j increase. Then, Pn Pn Pn 2 in general, i=1 xi is O(n) and i=1 j=1 xij is O(n ). (For an example of what happens when 0 Pn Pn xi is not an O(n ) constant, consider i=1 xi when xi = a + bi. Then i=1 xi = na + bn(n + 1)/2, which is O(n2) when b 6= 0.)

Here are some rules for operations involving order notation:

O(np) + O(nq) is O(nmax(p,q)) o(np) + o(nq) is o(nmax(p,q)) O(np) + o(nq) is O(np) if p ≥ q (already mentioned in example (iv)) and is o(nq) if p < q O(np) × O(nq) is O(np+q) O(np) × o(nq) is o(np+q) o(np) × o(nq) is o(np+q) (O(np))−1 is O(n−p) (o(np))−1 is of unknown order without more information

Combining these gives other results, such as O(np)/O(nq) is O(np−q).

2.2 Order in Probability

Order notation can be applied to random variables, using a “p” subscript, so that Op denotes “order in probability”. Let an be a random variable function of n random variables as on page 1.

3 Definitions

d d (i) an is Op(n ) if, for every  > 0, there is some K > 0 for which, as n → ∞, Prob(|an/n | > K) < .

d d (ii) an is op(n ) if as n → ∞, then plim(an/n ) = 0.

Except for special cases that usually do not apply to econometric models, (i) is equivalent to the d condition that the mean and the of an/n stay bounded as n → ∞. Therefore d d 2d if an is Op(n ) then: (1) E(an) is O(n ) and (2) Var(an) is O(n ). Another way to think about it d f g is, if E(an) is O(n ) and Var(an) is O(n ) then an is O(n ), where g = max(d, f/2).

2.3 Examples of Order in Probability

2 Let xi, i = 1, . . . , n, be independent random variables with mean µ and σ .

(i) xi is Op(1)

Pn 1/2 (ii) i=1 xi is Op(n ) if µ = 0, and it is Op(n) if µ 6= 0

Pn 2 2 (iii) i=1 xi is Op(n) unless µ = σ = 0

Pn 2 (ii) arises often in asymptotic theory. If µ = 0, then i=1 xi has a mean of 0 and variance nσ , Pn 1/2 2 Pn 1/2 implying i=1 xi/n has a mean of 0 and a variance of σ . So if µ = 0 then i=1 xi is Op(n ). Pn Pn 1/2 However, if µ 6= 0, then the mean of i=1 xi is nµ, implying that i=1 xi is Op(n), not Op(n ).

2 2 2 (iii) follows from the fact that xi has a finite non-trending mean, m1 = µ +σ , and finite variance, Pn 2 m2. Then i=1 xi /n has a finite mean m1 and variance m2/n.

2.4 Asymptotic Expansions

−1/2 Many consistent are “root-n consistent”, meaning that the sampling error is Op(n ), −1/2 as in θˆ − θ = Op(n ). Asymptotic expansions simplify the analysis of the distribution of θˆ, by −1/2 ignoring the part of θˆ − θ that is op(n ). This involves decomposing θˆ as

ˆ −1/2 θ = θ + ξ−1/2 + op(n ) (1)

The right hand side of (1) contains three terms of declining importance as n → ∞. The first term −1/2 is O(1), and must equal the true parameter value if θˆ is consistent. The second term is Op(n ).

4 2 It usually has mean zero, and often is simple enough to enable the derivation of E(ξ−1/2), which is then used for estimating Var(θˆ). The third term is the remainder. It is left out in most asymptotic approximations. We hope it is not very big compared to the first two terms. Its importance often can be examined most easily by simulations.

3 Application to with Heteroskedastic Er- rors

0 2 Assume the true model is yi = xiβ + ui, where Exiui = 0, Var(ui|xi) = σi , and the ui’s are P 0 −1 P independent. Consider the OLS estimator of β, b = ( xixi) xiyi. Unless indicated otherwise, the summations run over i from 1 to n, where n is the number of observations. Assume that the xi’s are random, as in survey data where the randomness in both xi and yi derives from the random survey sampling.

0 th Aside on notation: xi is a k × 1 vector of observations on the RHS variables. xi is the i row of th the usual n × k matrix X, and yi is the i element of the usual n × 1 vector y. So in this vector 0 P 0 notation, a matrix product such as X X is written as xixi.

Substituting out yi gives

X 0 −1 X b = β + ( xixi) xiui (2)

Relating this to (1), β is O(1), and there is no remainder term. To find the order of the second RHS term, consider its two parts separately.

P 0 −1 P 0 xixi is a k × k matrix. Assume that n xixi converges to some finite positive definite matrix −1 P 0 Σxx. Then n xixi = Σxx + op(1). P xiui, might appear to be Op(n) since it is the sum of n terms that are Op(1). But since we P P  P P 0 have assumed that Exiui = 0, then E( xiui) = 0 and Var( xiui) = E ( i xiui)( j xjuj) = P 0 2 0 E( xixiui ) = O(n). (This result uses independence of the xiui’s to get E(xiui)(xjuj) = 0 for all P −1/2 P i 6= j.) Therefore we only need to multiply xiui by n to give it an O(1) variance, so xiui 1/2 is Op(n ). This is like example (ii) in section 2.3 when µ = 0. xi in that example is replaced here by xiui.

−1 1/2 The second term of the RHS of (2), then, is the product of an O(n ) term and an Op(n ) term. −1/2 It follows that this second term is Op(n ) if Exiui = 0. In this case, b is consistent. If it had been the case that Exiui 6= 0, then this term would have been Op(1), and b would not have been

5 consistent.

3.1 Variance of b

P 0 −1 P Since Exiui = 0, then Eb = β from (2), the variance of b is the variance of ( xixi) xiui. It is convenient to multiply each term by the appropriate power of n so that we can work with O(1) and Op(1) terms.

−1/2 −1 X 0 −1 −1/2 X b − β = n (n xixi) (n xiui) (3) and

Var(b) = E(b − β)(b − β)0 −1 −1 X 0 −1 −1/2 X −1/2 X 0 −1 X 0 −1 = n E(n xixi) (n xiui)(n xiui) (n xixi) (4)

−1 P 0 Substituting n xixi = Σxx + op(1), then

−1 −1  −1/2 X −1/2 X 0 −1 Var(b) = n E (Σxx + op(1)) (n xiui)(n xiui) (Σxx + op(1))

−1 −1  −1/2 X −1/2 X 0 −1 −1 = n EΣxx (n xiui)(n xiui) Σxx + o(n ) n n −1 −1 −1 X X 0 −1 −1 = n Σxx (n (Exiui(xjuj) ))Σxx + o(n ) (5) i=1 j=1

The assumption that the xiui’s are not correlated allows us to set the expected values of the 0 cross-product terms (xiui)(xjuj) in (5) equal to zero. Then (5) can be written as

−1 −1  −1 X 0 2  −1 −1 Var(b) = n Σxx n (E xixiui ) Σxx + o(n ) = n−1A−1BA−1 + o(n−1) (6)

0 −1 P 0 2 0 2 where A = Σxx = Exixi and B = n (E xixiui ) = Exixiui A consistent estimator of Var(b) is formed by replacing A and B by consistent estimators Aˆ and ˆ ˆ −1 P 0 ˆ −1 P 0 2 B. Common choices are the sample means A = n xixi and B = n xixiei , where ei =

6 0 0 ˆ −1 ˆ −1 yi − xib = yi − xiβ + op(1) = ui + op(1) . Since A = A + op(n ) and B = B + op(n ), then

−1 −1 −1 −1 Var(b) = n Aˆ BˆAˆ + op(n ) −1 −1 X 0 −1 −1 X 0 2 −1 X 0 −1 −1 = n (n xixi) (n xixiei )(n xixi) + op(n ) X 0 −1 X 0 2 X 0 −1 −1 = ( xixi) ( xixiei )( xixi) + op(n ) −1 = Vˆ (b) + op(n )

ˆ P 0 −1 P 0 2 P 0 −1 This V (b) = ( xixi) ( xixiei )( xixi) is White’s (1980) heteroskedasticity-consistent variance- covariance matrix. It is used in the robust option in Stata, and referred to as HCCME in Davidson and MacKinnon (1993, p.552) and elsewhere.

3.2 Variance of b when errors are homoskedastic

2 2 2 If the errors are homoskedastic (Eui = σ for all i), then Eui is unrelated to the elements of 0 0 2 xixi, which allows for E(xixiui ) to be split into two separate multiplicative expectation terms, and 0 2 2 0 2 E(xixiui ) = E(ui )E(xixi) = σ Σxx. (The general property is that if random variables a and b are independently distributed, then E(ab) = E(a)E(b).) The asymptotic variance (6) simplifies as

−1 −1  −1 X 0 2  −1 −1 Var(b) = n Σxx n (E xixiui ) Σxx + o(n ) −1 −1 2 −1 −1 = n Σxx (σ Σxx)Σxx + o(n ) −1 2 −1 −1 = n σ Σxx + o(n )

This can be consistently estimated by the usual OLS variance-covariance estimator, since

e0e X s2 ≡ = σ2 + o (1) and n−1 x x0 = Σ + o (1) n − k p i i xx p i so that

−1 X 0 −1 −1 X 0 −1 −1 −1 −1 (n xixi) = Σxx + op(1) or ( xixi) = n Σxx + op(n ) i i and therefore

2 X 0 −1 2  −1 −1 −1  2  −1 −1 −1  s ( xixi) = σ + op(1) n Σxx + op(n ) = σ + op(1) n Σxx + op(n ) i −1 2 −1 −1 −1 = n σ Σxx + op(n ) = Var(b) + op(n )

7 4 Central Limit Theorem (CLT)

To know the asymptotic distribution of test statistics and compute asymptotic confidence intervals, we require the asymptotic distribution of b, not just a variance estimator. Fortunately, under a broad range of assumptions, including the ones here, the asymptotic distribution is normal. The main result that leads to this is the central limit theorem, originating with Laplace in 1810. It has many versions depending on the assumptions made. One version is:

If n random vectors ai are independently and identically distributed, with mean µ and variance Σ, 1/2 −1 P then the distribution of n (n i ai − µ) converges to N[0, Σ]. −1 P In practice, we can approximate the distribution of the sample mean of ai vectors, n i ai, as N[µ, n−1Σ]. Since this result is asymptotic, it also is valid to approximate the distribution as N[µ, Vˆ ], where nVˆ is a consistent estimator of Σ.

4.1 OLS with heteroskedastic errors

In the above regression example,

1/2 −1 X 0 −1 −1/2 X n (b − β) = (n xixi) (n xiui) i i

−1 P 0 The first RHS term has plim(n i xixi) = Σxx. The CLT applies to the second RHS term. To 1/2 −1 P match it up with the way the CLT was presented above, express the second term as n (n i xiui− −1/2 P 0). Let Var(xiui) = Σxu. Then, also using Exiui = 0, the CLT implies that n i xiui con- 1/2 verges in distribution to N[0, Σxu]. Then n (b − β) converges in distribution to the distribution −1 −1/2 P −1 −1 of Σxx times (n i xiui), which, using standard results for , is N[0, Σxx ΣxuΣxx ]. The −1 −1 ˆ ˆ variance, Σxx ΣxuΣxx , is what is being estimated by nV (b), where V (b) is White’s HCCME from section 3.1.

5 References

Davidson, R. and J.G. MacKinnon (1993), Estimation and Inference in Econometrics, Oxford University Press, Oxford.

White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity,” Econometrica, 48, 817-838.

8