<<

Statistics 910, #11 1

Asymptotic Distributions in Time Series

Overview

Standard proofs that establish the asymptotic normality of con- structed from random samples (i.e., independent observations) no longer apply in time series analysis. The usual version of the (CLT) presumes independence of the summed components, and that’s not the case with time series. This lecture shows that normality still rules for asymptotic distributions, but the arguments have to be modified to allow for correlated data.

1. Types of convergence

2. Distributions in regression (Th A.2, section B.1)

3. Central limit theorems (Th A.3, Th A.4)

4. Normality of covariances, correlations (Th A.6, Th A.7)

5. Normality of parameter estimates in ARMA models (Th B.4)

Types of convergence

Convergence The relevant types of convergence most often seen in statis- tics and probability are

• Almost surely, almost everywhere, with probability one, w.p. 1:

a.s. Xn → X : P {ω : lim Xn = X} = 1.

• In probability, in measure:

p Xn → X : lim {ω : |Xn − X| > } = 0 . n P

• In distribution, weak convergence, convergence in law:

d Xx → X : lim (Xn ≤ x) = (X ≤ x) . n P P 910, #11 2

• In mean square or `2, the variance goes to zero:

m.s. 2 Xn → X : lim (Xn − X) = 0 . n E Connections Convergence almost surely (which is much like good old fashioned convergence of a sequence) implies covergence almost surely which implies covergence in distribution:

p a.s.→ ⇒ → ⇒ →d

Convergence in distribution only implies convergence in probability if the distribution is a point mass (i.e., the r.v. converges to a constant). The various types of converence “commute” with sums, products, and smooth functions. Mean square convergence is a bit different from the others; it implies convergence in probabiity, p m.s.→ ⇒ →

which holds by Cheybyshev’s inequality.

Asymptotically normal (Defn A.5)

Xn − µn d 2 → N(0, 1) ⇔ Xn ∼ AN(µn, σn) σn Distributions in regression

Not i.i.d. The asymptotic normality of the slope estimates in regression is not so obvious if the errors are not normal. Normality requires that we can handle sums of independent, but not identically distributed r.v.s.

Scalar This example and those that follow only do scalar estimators to avoid matrix manipulations that conceal the underlying ideas. The text illustrates the use of the “Cramer-Wold device” for handling vector-valued estimators.

Model In scalar form, we observe a sample of independent observations

that follow Yi = α+βXi+i. Assume Y and  denote random variables, 2 the xi are fixed, and the deviations i have mean 0 and variance σ . Statistics 910, #11 3

Least squares estimators areα ˆ = Y − βˆX and P ˆ i(xi − x)(yi − y) covn(x, y) β = P 2 = i(xi − x) varn(x) Pn 2 For future reference, write SSxn = i=1(xi − x) . Distribution of βˆ We can avoid centering y in the numerator and write P (x − x)y βˆ = i i i SSxn P (x − x)(x β +  ) = i i i i SSxn X = β + wnii (1)

where the key weights depend on n and have the form

xi − x wni = . (2) SSxn

Hence, we have written βˆ − β as a weighted sum of random variables, and it follows that 2 ˆ ˆ σ E β = β, Var(β) = P 2 . (xi − x) To get the limiting distribution, we need a version of the CLT that allows for unequal variances (Lindeberg CLT) and weights that change with n.

Bounded leverage Asymptotic normality of βˆ requires, in essence, that the all of the observations have roughly equal impact on the response. In order that one point not dominate βˆ we have to require that w max ni → 0 . SSxn

Lindeberg CLT This theorem allows a triangular array of random vari-

ables. Think of sequences as in a lower triangular array, {X11}, {X21,X22}, 2 {X31,X32,X33} with mean zero and variances Var(Xni) = σni. Now Pn 2 Pn 2 let Tn = i=1 Xni with variance Var(Tn) = sn = i=1 σni. Hence,

Tn Tn E = 0, Var = 1 . sn sn Statistics 910, #11 4

If the so-called Lindeberg condition holds, (I(set) is the indicator func- tion of the set)

n 1 X ∀δ > 0, X2 I(δs < |X | → 0 as n → ∞ . (3) s2 E ni n ni n i=1 then Tn d 2 → N(0, 1),Tn ∼ AN(0, sn) . sn

Regression application In this case, define the random variables Xni = P wnii so that βˆ = β + Xni, where the model error i has mean 0 with variance σ2 and (as in 2)

(xi − xn) X w = , s2 = σ2 w2 . (4) ni P(x − x )2 n ni i n i Asymptotic normality requires that no one observation have too much effect on the slope, which we specify by the condition

2 maxi wni Pn 2 → 0 as n → ∞ (5) i=1 wni To see that the condition (5) implies the Lindeberg condition, let’s

keep things a bit simpler by assuming that the i share a common distribution, differing only in scale.

Now define the largest weight Wn = max |wni| and observe that 1 1   X 2 2  X 2 2 δsn wniE i I(δsn < |wnii|) ≤ wniE i I( < |i|) s2 s2 Wn n i n i (6) 2 The condition (5) implies that sn/Wn → ∞. Since Var i = σ is finite, choose N so that the summands have a common bound (here’s

where we use the common distribution of the i)   2 δsn E 1I( < |1|) < η ∀i. Wn

with η → 0 as n → ∞. Then the sum (6) is bounded by η/σ2 → 0. Statistics 910, #11 5

Central limit theorems for dependent processes

M-dependent process A sequence of random variables {Xt} (stochastic process) is said to be M-dependent if |t − s| > M implies that Xt is independent of Xs,

{Xt, t ≤ T } ind {Xs : s > T + M} (7)

Variables with time indices t and s that are more than M apart, |t − s| > M, are independent.

Basic theorem (A.2) This is the favorite method of proof in S&S and other common textbooks. An up-over-down argument. The idea is to

construct an approximating r.v. Ymn that is similar to Xn (enforced by condition 3 below), but simpler to analyze in the sense that its asymptotic distribution which is controlled by the “tuning parameter” m is relatively easy to obtain. Theorem A.2 If

d (1) ∀m Ymn → Ym as n → ∞, d (2) Ym → Y as m → ∞, 2 (3) E (Xn − Ymn) → 0 as m, n → ∞,

d then Xn → Y .

CLT for M-dependence (A.4) Suppose {Xt} is M-dependent with co- variances γj. The variance of the mean of n observations is then

M M √ X n − |h| X Var n X  = n VarX = γ → γ := V (8) n n n h h M h=−M −M

Theorem A.4 If {Xt} is an M-dependent stationary process with mean µ and VM > 0 then

√ d n(Xn − µ) → N(0,VM )(Xn ∼ AN(µ, VM /n))

Proof Assume w.l.o.g. that µ = 0 (recall our prior discussion in Lecture 4 that we can replace the sample mean by µ) so that the Statistics 910, #11 6

√ statistic of interest is n Xn. Define the similar statistic Ymn as follows by forming the data into blocks of length m and tossing enough so that the blocks are independent of each other. Choose m > 2M and let r = b n/m c (greatest integer less than) count the blocks. Now arrange the data to fill a table with r rows and m columns (pre-

tend for the moment that n = r m. The approximation Ymn sums the first m − M columns of this table,   X1 + X2 + ··· + Xm−M  + X + X + ··· + X   m+1 m+2 2m−M  Y = √1  + X + X + ··· + X  mn n  2m+1 2m+2 3m−M     ···  + X(r−1)m+1 + X(r−1)m+2 + ··· + X(r−1)m−M (9) and omits the intervening blocks,   Xm−M+1 + ··· + Xm  + X + ··· + X   2m−M+1 2m  U = √1  + X + ··· + X  (10) mn n  3m−M+1 3m     ···  + X(r−1)m−M+1 + ··· + Xn

The idea is that we will get the shape of the distribution from the variation among the blocks.

Label the row sums that define Ymn in (9) as Zi so that we can write

√1 Ymn = n (Z1 + Z2 + ... + Zr) .

Clearly, because the Xt are centered, E Zi = 0 and X Var Zi = (m − M − |h|)γh := Sm−M h Now let’s fill in the 3 requirements of Theorem A.2: (1) r X X √1 −1/2 −1/2 Ymn = n Zi = (n/r ) r Zi |{z} i=1 →m | {z } d →N Statistics 910, #11 7

Now observe that n/r → m, and since the Zi are independent by construction, the usual CLT shows that

−1/2 X d r Zi → N(0,Sm−M )

d Hence, we have Ymn → Ym := N(0,Sm−M /m) as n → ∞. (2) Now let m → ∞. As the number of columns in the table (9)

increases, the variance expression Sm−M approaches the limiting vari- d ance VM , so Ym → N(0,VM ). (3) Last, we have to check that we did not make too rough an ap-

proximation when forming Ymn. The approximation omits the last M columns (10), so the error is √ √1 n Xn − Ymn = n (U1 + ··· + Ur)

where Ui = Xim−M+1 + ··· + Xim and Var(Ui) = SM . Thus,

√ 2 r 1 E ( n Xn − Ymn) = n SM → m SM

as n → ∞, and (SM )/m → 0 as m → ∞.

CLT for stationary processes, Th A.5 In particular, for linear processes P of the form Xt − µ = ψjwt−j where the white noise terms are inde- pendent with mean 0 and variance σ2. The coefficients, as usual, are P absolutely summable, |ψj| < ∞. Before stating the theorem, note that the variance of the mean of such a linear process has a particularly convenient form. Following our familiar manipulations of these sums (and Fatou’s lemma!), we define V as the limiting variance of the mean as n → ∞,

n−1 X n Var(Xn) = γh h=−n+1 ∞ X → γh h=−∞ ∞ ∞ 2 X X = σ ψjψj+h (ψj = 0, j < 0) h=−∞ j=0 Statistics 910, #11 8

 2 2 X = σ  ψj j := V (11)

P Theorem A.5 If {Xt} is a linear process of this form with ψj 6= 0 then

Xn ∼ AN(µ, V/n)

Proof. Again use the basic three-part theorem A.2. This time, form √ the key approximation to n(Xn − µ) from the M-dependent process matched up to resemble Xt,

M M X Xt = ψjwt−j j=0

(1) Form the approximation based on averaging the M-dependent pro-

cess rather than the original, infinite order stationary process Xt,

n √ 1 X M YMn = n(Xn,M − µ), Xn,M = n Xt . t=1

M Since Xt is M-dependent, Theorem A.4 implies that as n → ∞

M d X 2 X 2 YMn → YM ∼ N(0,VM ),VM = γh = σ ( ψj) . h j=0

(2) As the order M increases, VM → V as defined in (11). (3) The error of the approximation is the tail sum of the moving aver-

age approximation, which is easily bounded since the weights ψj are absolutely summable

 2 n ∞ 2 ∞ h√ M i2 1 X X σ X n(X − X ) = ψ w = ( ψ )2 → 0 E n n E n j t−j n j t=1 j=M+1 j=M+1 Statistics 910, #11 9

Normality of covariances, correlations

Covariances Th A.6 The text shows this for linear processes. We did the messy part previously.

Correlations Th A.7 This is an exercise in the use of the . For scalar r.v.s, we have for differentiable functions g() that

2 0 2 2 X ∼ AN(µ, σn) ⇒ g(X) ∼ AN(g(µ), g (µ) σn) (12)

The argument is based on the observation that if g(x) = a + bx, then g(X) ∼ N(g(µ), b2σ2). The trick is to show that the remainder in the linear Taylor series expansion of g(X) around µ is small,

g00(m) g(X) = g(µ) + g0(µ)(X − µ) + (X − µ)2 . (13) 2 | {z } remainder

Normality of estimates in ARMA models

Conditional least squares in the AR(1) model (with mean zero) gives the

Pn−1 X X φˆ = t=1 t+1 t Pn−1 2 t=1 Xt P(φX + w )X = t t+1 t Pn−1 X2 P t=1 t wt+1Xt = φ + P 2 (14) Xt Hence, 1 P √ √ XtWt+1 n(φˆ − φ) = n 1 P 2 n Xt Asymptotic distribution Results for the covariances of stationary pro- cesses (Theorem A.6) shows that the denominator converges to the

variance γ0. For the numerator, the expected value is zero (since the terms are independent) and variances of the components are

2 2 2 Var(Xtwt+1) = E Xt E wt = γ0 σ Statistics 910, #11 10

These have equal variance, unlike the counterparts in the analysis of a regression model with fixed xs. The covariances are zero (owing again

to the independence of the wt),

Cov(Xtwt+1,Xsws+1) = 0, s 6= t.

√1 P The summands in n XtWt+1 are thus uncorrelated (though not independent) and are a “trivial” stationary process. Though this is not a linear process as in A.6, comparable results hold owing to the lack of lingering dependence. (One has to build an approximating M- M dependent version of the products, say wt+1Xt ; the approximation works with A.2 again since Xt is a linear process – see S&S.) This version of the prior theorem gives (Theorem B.4) √ n(φˆ − φ) →d N(0, 1 − φ2)

where the variance expression comes from

2 2 σ γ0 σ 2 2 = = 1 − φ . γ0 γ0

2 −1 (In the general case, the asymptotic variance is σ Γp , but this is a nice way to see that the expression is invariant of σ2.)