Asymptotic Distributions in Time Series
Total Page:16
File Type:pdf, Size:1020Kb
Statistics 910, #11 1 Asymptotic Distributions in Time Series Overview Standard proofs that establish the asymptotic normality of estimators con- structed from random samples (i.e., independent observations) no longer apply in time series analysis. The usual version of the central limit theorem (CLT) presumes independence of the summed components, and that's not the case with time series. This lecture shows that normality still rules for asymptotic distributions, but the arguments have to be modified to allow for correlated data. 1. Types of convergence 2. Distributions in regression (Th A.2, section B.1) 3. Central limit theorems (Th A.3, Th A.4) 4. Normality of covariances, correlations (Th A.6, Th A.7) 5. Normality of parameter estimates in ARMA models (Th B.4) Types of convergence Convergence The relevant types of convergence most often seen in statis- tics and probability are • Almost surely, almost everywhere, with probability one, w.p. 1: a.s. Xn ! X : P f! : lim Xn = Xg = 1: • In probability, in measure: p Xn ! X : lim f! : jXn − Xj > g = 0 : n P • In distribution, weak convergence, convergence in law: d Xx ! X : lim (Xn ≤ x) = (X ≤ x) : n P P Statistics 910, #11 2 • In mean square or `2, the variance goes to zero: m.s. 2 Xn ! X : lim (Xn − X) = 0 : n E Connections Convergence almost surely (which is much like good old fashioned convergence of a sequence) implies covergence almost surely which implies covergence in distribution: p a.s.!) !) !d Convergence in distribution only implies convergence in probability if the distribution is a point mass (i.e., the r.v. converges to a constant). The various types of converence \commute" with sums, products, and smooth functions. Mean square convergence is a bit different from the others; it implies convergence in probabiity, p m.s.!) ! which holds by Cheybyshev's inequality. Asymptotically normal (Defn A.5) Xn − µn d 2 ! N(0; 1) , Xn ∼ AN(µn; σn) σn Distributions in regression Not i.i.d. The asymptotic normality of the slope estimates in regression is not so obvious if the errors are not normal. Normality requires that we can handle sums of independent, but not identically distributed r.v.s. Scalar This example and those that follow only do scalar estimators to avoid matrix manipulations that conceal the underlying ideas. The text illustrates the use of the \Cramer-Wold device" for handling vector-valued estimators. Model In scalar form, we observe a sample of independent observations that follow Yi = α+βXi+i. Assume Y and denote random variables, 2 the xi are fixed, and the deviations i have mean 0 and variance σ . Statistics 910, #11 3 Least squares estimators areα ^ = Y − β^X and P ^ i(xi − x)(yi − y) covn(x; y) β = P 2 = i(xi − x) varn(x) Pn 2 For future reference, write SSxn = i=1(xi − x) . Distribution of β^ We can avoid centering y in the numerator and write P (x − x)y β^ = i i i SSxn P (x − x)(x β + ) = i i i i SSxn X = β + wnii (1) where the key weights depend on n and have the form xi − x wni = : (2) SSxn Hence, we have written β^ − β as a weighted sum of random variables, and it follows that 2 ^ ^ σ E β = β; Var(β) = P 2 : (xi − x) To get the limiting distribution, we need a version of the CLT that allows for unequal variances (Lindeberg CLT) and weights that change with n. Bounded leverage Asymptotic normality of β^ requires, in essence, that the all of the observations have roughly equal impact on the response. In order that one point not dominate β^ we have to require that w max ni ! 0 : SSxn Lindeberg CLT This theorem allows a triangular array of random vari- ables. Think of sequences as in a lower triangular array, fX11g, fX21;X22g, 2 fX31;X32;X33g with mean zero and variances Var(Xni) = σni. Now Pn 2 Pn 2 let Tn = i=1 Xni with variance Var(Tn) = sn = i=1 σni. Hence, Tn Tn E = 0; Var = 1 : sn sn Statistics 910, #11 4 If the so-called Lindeberg condition holds, (I(set) is the indicator func- tion of the set) n 1 X 8δ > 0; X2 I(δs < jX j ! 0 as n ! 1 : (3) s2 E ni n ni n i=1 then Tn d 2 ! N(0; 1);Tn ∼ AN(0; sn) : sn Regression application In this case, define the random variables Xni = P wnii so that β^ = β + Xni, where the model error i has mean 0 with variance σ2 and (as in 2) (xi − xn) X w = ; s2 = σ2 w2 : (4) ni P(x − x )2 n ni i n i Asymptotic normality requires that no one observation have too much effect on the slope, which we specify by the condition 2 maxi wni Pn 2 ! 0 as n ! 1 (5) i=1 wni To see that the condition (5) implies the Lindeberg condition, let's keep things a bit simpler by assuming that the i share a common distribution, differing only in scale. Now define the largest weight Wn = max jwnij and observe that 1 1 X 2 2 X 2 2 δsn wniE i I(δsn < jwniij) ≤ wniE i I( < jij) s2 s2 Wn n i n i (6) 2 The condition (5) implies that sn=Wn ! 1. Since Var i = σ is finite, choose N so that the summands have a common bound (here's where we use the common distribution of the i) 2 δsn E 1I( < j1j) < η 8i: Wn with η ! 0 as n ! 1. Then the sum (6) is bounded by η/σ2 ! 0. Statistics 910, #11 5 Central limit theorems for dependent processes M-dependent process A sequence of random variables fXtg (stochastic process) is said to be M-dependent if jt − sj > M implies that Xt is independent of Xs, fXt; t ≤ T g ind fXs : s > T + Mg (7) Variables with time indices t and s that are more than M apart, jt − sj > M, are independent. Basic theorem (A.2) This is the favorite method of proof in S&S and other common textbooks. An up-over-down argument. The idea is to construct an approximating r.v. Ymn that is similar to Xn (enforced by condition 3 below), but simpler to analyze in the sense that its asymptotic distribution which is controlled by the \tuning parameter" m is relatively easy to obtain. Theorem A.2 If d (1) 8m Ymn ! Ym as n ! 1; d (2) Ym ! Y as m ! 1; 2 (3) E (Xn − Ymn) ! 0 as m; n ! 1; d then Xn ! Y . CLT for M-dependence (A.4) Suppose fXtg is M-dependent with co- variances γj. The variance of the mean of n observations is then M M p X n − jhj X Var n X = n VarX = γ ! γ := V (8) n n n h h M h=−M −M Theorem A.4 If fXtg is an M-dependent stationary process with mean µ and VM > 0 then p d n(Xn − µ) ! N(0;VM )(Xn ∼ AN(µ, VM =n)) Proof Assume w.l.o.g. that µ = 0 (recall our prior discussion in Lecture 4 that we can replace the sample mean by µ) so that the Statistics 910, #11 6 p statistic of interest is n Xn. Define the similar statistic Ymn as follows by forming the data into blocks of length m and tossing enough so that the blocks are independent of each other. Choose m > 2M and let r = b n=m c (greatest integer less than) count the blocks. Now arrange the data to fill a table with r rows and m columns (pre- tend for the moment that n = r m. The approximation Ymn sums the first m − M columns of this table, 0 1 X1 + X2 + ··· + Xm−M B + X + X + ··· + X C B m+1 m+2 2m−M C Y = p1 B + X + X + ··· + X C mn n B 2m+1 2m+2 3m−M C B C @ ··· A + X(r−1)m+1 + X(r−1)m+2 + ··· + X(r−1)m−M (9) and omits the intervening blocks, 0 1 Xm−M+1 + ··· + Xm B + X + ··· + X C B 2m−M+1 2m C U = p1 B + X + ··· + X C (10) mn n B 3m−M+1 3m C B C @ ··· A + X(r−1)m−M+1 + ··· + Xn The idea is that we will get the shape of the distribution from the variation among the blocks. Label the row sums that define Ymn in (9) as Zi so that we can write p1 Ymn = n (Z1 + Z2 + ::: + Zr) : Clearly, because the Xt are centered, E Zi = 0 and X Var Zi = (m − M − jhj)γh := Sm−M h Now let's fill in the 3 requirements of Theorem A.2: (1) r X X p1 −1=2 −1=2 Ymn = n Zi = (n=r ) r Zi |{z} i=1 !m | {z } d !N Statistics 910, #11 7 Now observe that n=r ! m, and since the Zi are independent by construction, the usual CLT shows that −1=2 X d r Zi ! N(0;Sm−M ) d Hence, we have Ymn ! Ym := N(0;Sm−M =m) as n ! 1. (2) Now let m ! 1. As the number of columns in the table (9) increases, the variance expression Sm−M approaches the limiting vari- d ance VM , so Ym ! N(0;VM ). (3) Last, we have to check that we did not make too rough an ap- proximation when forming Ymn.