<<

Lecture 8: Univariate Processes with Unit Roots∗

1 Introduction

1.1 Stationary AR(1) vs. Random Walk In this lecture, we will discuss a very important type of processes: unit root processes. For an AR(1) process xt = φxt−1 + ut (1) to be stationary, we require that |φ| < 1. Or in an AR(p) process, we require that all the roots of

p 1 − φ1z − ... − φpz = 0 lie out of the unit circle. If one of the roots turns out to be one, then this process is called unit root process. In an AR(1) process, we have φ = 1, xt = xt−1 + ut. (2) It turns out that two processes with (|φ| < 1 and φ = 1) behave in very different manners. For simplicity, we assume that the innovations ut follows a i.i.d. Gaussian distribution with mean zero and variance σ2. First, I plot the following two graphs in figure 1. In the left graph, I set φ = 0.9 and in the right graph, I draw the random walk process, φ = 1. From the left graph, we see that xt moves around zero and never gets out of the [−6, 6] region. There seems to be some force there which pulls the process to its mean (zero). But in the right graph, we did not see a fixed mean, instead, xt moves ‘freely’ and in this case, it goes to as high as about 72. If we repeat generating the above two process, we would see that the φ = 0.9 processes look pretty much the same; but the random walk processes are very different from each other. For instance, in a second simulation, it may go down to −80, say. Above is some graphical illustration. Second, consider some moments of the process xt when |φ| < 1 and φ = 1. When |φ| < 1, we have that

2 2 2 E(xt) = 0 and E(x ) = σ /(1 − φ ), when φ = 1, we no longer have constant unconditional moments, the first two conditional moments are 2 2 E(xt|Ft−1) = xt−1 and E(xt |Ft−1) = σ . ∗Copyright 2002-2006 by Ling Hu.

1 8 80

70 6

60 4

50 2

40

0 30

−2 20

−4 10

−6 0 0 100 200 300 400 500 0 100 200 300 400 500

Figure 1: Simulated Autoregressive Processes with Coefficient φ = 0.9 and φ = 1

When we do k-period ahead forecasting, when |φ| < 1,

k E(xt+k|Ft) = φ xt.

k Since |φ| < 1, φ → 0 = E(xt) when k → ∞. So as the forecasting horizon increases, the current value of xt matters less and less since the conditional expectation converges to the unconditional expectation. The variance of the forecasting

1 − φ2k+2 V ar(x |F ) = (1 + φ2 + ... + φ2k)σ2 = σ2, t+k t 1 − φ2 which converges to σ2/(1 − φ2) as k → ∞. Next, consider the case when φ = 1.

E(xt+k|Ft) = xt, which means that the current value does matter (actually it is the only thing that matters) even as k → ∞! The variance of the forecasting is

2 V ar(xt+k|Ft) = kσ → ∞ as k → ∞.

2 If we let x0 = 1, σ = 1, we could draw the forecasting of xt+k when φ = 0.9 and φ = 1 in Figure 2. The upper graph in Figure 2 plots the forecasting for xk when φ = 0.9. The expectation for xk conditional on x0 = 1 drops to zero as k increases, and the forecasting error converges to the unconditional standard error quickly. The lower graph in Figure 2 plots the forecasting for xk when φ = 1. Obviously, the forecasting interval diverges as k increases.

2 = 1, 0 φ = 1) 2 σ ), we have shown in = 1, 2 0 , σ x (0 . 1 = 1 ( . This implies that if 4   u 1 φ 1 i.i.d.N − − t increases. In the unit root case,

2 ! ∼ φ → ∞ 2 t h t x n + u 9 and

0 . . So if a process is a random walk, the n =1 h t X ... 1 −2 = 0 + − 1 and φ 2 Forecast of x(k) at time zero Forecast of x(k) at time n , . Or the impulse-response function is flat at k − t

−4 <

2 4 6 8 10 12 14 16 18 20 u u | 2

k 2 φ t =1 | φ , σ 3 k X 0

6 + goes to zero as . =   1 k 2 t 1 is asymptotically normal, / − 4 − permanent x 1 , which dies out as N t t − where φ h ) n u t 2 t ∼ φ k 2 φu u x are φ ) is 0 + 1 + 0 } =1 t φ =0 h − n t x 1 at time zero when t k u X { + − − t k t

−2 P x x n Forecast of x(k) at time zero Forecast of x(k) at time 1 ˆ = = φ φx −4 − ( 2 4 6 8 10 12 14 16 18 20 to n 2 = t t k / 1 t is one, which is independent of x u n x h + t x on = 1, then ( t 0 u φ converges at a order higher than Figure 2: Forecasting of n ˆ φ So the effect of a shock A third way to compare a stationary and a nonstationary autoregressive process is to compare The effect of Finally, we can compare the asymptotic distribution of theFor coefficient an estimator AR(1) of process, a stationary However, if their impulse-response functions. We can invert (1) to lecture note 6 that the MLE estimator of then effects of all shocksone. on the level of and a nonstationary autoregressive process. Above we have only considered AR(1) process. In a general AR(p) process, if there is one unit root, then the process is a nonstationary unit root process. Consider an AR(2) example. let λ1 = 1, and λ2 = 0.5, then 2 (1 − L)(1 − 0.5L)xt = t, t ∼ i.i.d.(0, σ ). Then −1 (1 − L)xt = (1 − 0.5L) t = θ(L)t ≡ ut.

So the difference of xt, ∆xt = (1 − L)xt is a stationary process, and

xt = xt−1 + ut, is a unit root process with serially correlated errors.

1.2 Stochastic Trend v.s. Deterministic Trend In a unit root process, xt = xt+1 + ut, where ut is a stationary process, then xt is said to be integrated of order one, denoted by I(1). An I(1) process is also said to be difference stationary, compared to trend stationary as has been discussed in the previous lecture. If we need to take difference twice to get a stationary process, then this process is said to be integrated of order two, denoted by I(2), and so force. A stationary process can be denoted by I(0). If a process is a stationary ARMA(p, q) after taking kth differences, then the original process is called ARIMA(p, k, q). The ‘I’ here denotes integrated. Recall that we learned about spectrum in lecture 3. The implications of these ‘integrate’ and ‘difference’ operators to spectral analysis is that when you do ‘integration’ or sum, you filter out the high frequency components and what remains are low frequencies; which is a feature of unit root process. Recall that the spectrum of a stationary AR(1) process is 1 S(ω) = (1 + φ2 − 2φcos ω)−1. 2π When φ → 1, we have S(ω) = 1/[4π(1 − cos ω)]. Then when ω → 0, S(ω) → ∞. So processes with stochastic trend have infinite spectrum at the origin. Recall that S(ω) decomposes the variance of a process into components contributed by each frequencies. So the variance of a unit root process are largely contributed by low frequencies. On the other hand, when we do ‘difference’, we filter out the low frequencies and what remains are the high frequencies. In the previous lecture, we discussed the processes with deterministic trend. We can compare a process with deterministic trend (DT) and a process with stochastic trend (ST) from two per- spectives. First, when we do k-period ahead forecasting, as k → ∞, the forecasting error for DT converges to the variance of its stationary components, which is bounded. But as we see from the previous section, the forecasting error for ST diverges as n → ∞. Second, the impulse-response function for a DT is the same as in the stationary case: the effect of a shock dies out quickly. While the impulse-function for ST is flat at one: the effect of all shocks on the level are permanent. However, note that in Figure 1, we plot a simulated random walk, but part of its path looks like to have a upward time trend. This turns out to be a quite general problem: over a short time period, it is very hard to judge whether a process has a stochastic trend, or deterministic trend.

4 2 Brownian Motion and Functional Central Theorem

2.1 Brownian Motion To derive statistical inference of a unit root process, we need to make use of a very important stochastic process – Brownian motion (also called Wiener process). To understand a Brownian motion, consider a random walk

yt = yt−1 + ut, y0 = 0, ut ∼ i.i.d.N(0, 1). (3)

We can then write t X yt = us ∼ N(0, t), s=1 and the change in the value of x between dates t and s,

t X yt − ys = us+1 + us+1 + ... + ut = ui ∼ N(0, t − s) i=s+1 and it is independent of the change between dates r and q for s < t < r < q. Next, consider the change yt − yt−1 = ut ∼ i.i.d.N(0, 1). If we view ut as the sum of two independent Gaussian variables,

 1 u =  +  ,  ∼ i.i.d.N 0, . t 1t 2t it 2

Then we can associate 1t with the change between yt−1 and the value of y at some interim point (say, yt−(1/2)), and 2t with the change between yt−(1/2) and yt:

yt−(1/2) − yt−1 = 1t yt − yt−(1/2) = 2t. (4)

Sampled at integer dates t = 1, 2,..., the process of (4) has the same properties as (3), since

yt − yt−1 = 1t + 2t ∼ i.i.d.N(0, 1).

In addition, the process of (4) is defined also at the non-integer dates and remains the property for both integer and non-integer dates that yt − ys ∼ N(0, t − s) with yt − ys independent of the change over any other nonoverlapping interval. Using the same reasoning, we could partition the change between t − 1 and t into N separate subperiods:

n X  1  y − y =  ,  ∼ i.i.d.N 0, . t t−1 it it n i=1 when n → ∞, the limit process is known as Brownian motion. The value of this process at date t is denoted by W (t). A realization of a continuous time process can be viewed as a stochastic function W (·). In particular, we will be interested in Brownian motion over the interval t ∈ [0, 1].

Definition 1 (Brownian Motion) A standard Brownian motion W (t), t ∈ [0, 1], is a continuous time stochastic process such that

5 (a) W (0) = 0

(b) For any time 0 < s < t < 1, W (t) − W (s) ∼ N(0, t − s). And the differences W (t2) − W (t1) and W (t4) − W (t3), for any 0 ≤ t1 < t2 < t3 < t4 ≤ 1, are independent. (c) W(t) is continuous in time t with probability 1.

So given a standard Brownian motion W (t), we have E(W (t)) = 0, V ar(W (t)) = t, and Cov(W (t),W (s)) = min(t, s). Other Brownian motions can be generated from a standard Brownian motion. For example, the process Z(t) = σ · W (t) has independent increments and is distributed N(0, σ2t). Such a process is described as Brownian motion with variance σ2. An important feature of Brownian motion is that although it is continuous in t, it is not differentiable using standard . The direction of change at t is likely to be completely different from that at t + ∆, no matter how small is ∆. Even some parts of the realization of a Brownian motion looks ‘smooth’, if we see it with a ‘microscope’, we will see many zig-zags. There are several concepts of smoothness of a function and continuity is the weakest one. Differentiability is another concept of smoothness. When the domain of the function is an interval, we have another smoothness condtion. A function f :[a, b] 7→ R is of bounded variation if ∃ M < ∞ such that for every partition of [a, b] by finite collections of points a = x0 < x1 < x2 < . . . < xn = b,

n X |f(xi) − f(xi−1)| ≤ M. k=1 Brownian motion is not of bounded variation. R 1 Later in this lecture, we will also see of Brownian motion ( 0 W (r)dr) and a stochastic R 1 R 1 ( 0 W (r)dW (r), for r ∈ [0, 1]. First, note that W (r) is a Gaussian process, hence 0 W (r)dr R 1 is also a Gaussian process. It is easy to see that E[ 0 W (r)dr] = 0. To compute its variance, let s ≤ t and write

Z 1 2 Z 1 Z r E W (r)dr = 2 E[W (r)W (s)]dsdr 0 0 0 Z 1 Z r = 2 sdsdr 0 0 Z 1 1 = r2dr = 0 3 R 1 Therefore, 0 W (r)dr ∼ N(0, 1/3). As another exercise, consider the distribution of W (1) − R 1 0 W (r)dr. Again, it is a Gaussian process with zero mean. To compute its variance,

 Z 1 2 1  Z 1  E W (1) − W (r)dr = 1 + − 2E W (1) W (r)dr 0 3 0 4 Z 1 1 = − 2 rdr = . 3 0 3 To study the stochastic integral, we need a fundamental theorem in :

6 Definition 2 (Ito’s Lemma) Let Xr be a process given by

dX(r) = udr + vdW (r).

Let g(r, x) be a twice continuously differentiable real function. Let

Y (r) = g(r, Xr), then ∂g ∂g 1 ∂2g dY (r) = (r, x)dr + (r, x)dX(r) + (r, x) · (dX(r))2, ∂r ∂x 2 ∂x2 where (dX(r))2 = (dX(r)) · (dX(r)) is computed according to the rules

dr · dr = dr · dW (r) = dW (r) · dr = 0, dW (r) · dW (r) = dr.

1 2 Now, choose X(r) = σW (r), g(r, x) = 2 x . Then 1 1 Y (r) = g(r, X(r)) = X(r)2 = σ2W (r)2. 2 2 By Ito’s lemma,

∂g ∂g 1 ∂2g 1 1 dY (r) = dr+ dX(r)+ ·(dX(r))2 = σ2W (r)dW (r)+ (σdW (r))2 = σ2W (r)dW (r)+ σ2dr. ∂r ∂x 2 ∂x2 2 2 where the first of g with respect to x gives Xr = σW (r), and dXr = σdW (r). Hence, 1  1 d σ2W (r)2 = σ2W (r)dW (r) + σ2dr. 2 2

Integrate them from 0 to 1,

1 Z 1 1 σ2W (1)2 = σ2 W (r)dW (r) + σ2, 2 0 2 therefore, Z 1 1 σ2 W (r)dW (r) = σ2(W (1)2 − 1). 0 2

2.2 Functional Central Limit Theorem (FCLT) 2.2.1 Introduction Recall that the central limit theorem (CLT) tells that the sample mean of a stationary process is asymptotically normal and centered at the the population mean. However, if xt is a random walk, there is no such a thing as ‘population mean’. Therefore, to draw inference for processes with unit root, we need a new tool which is called functional central limit theorem’ (FCLT), the central limit theorem defined on the function spaces. FCLT is important to unit root limit theory just as CLT is important to stationary time limit theory. As usual, let n denotes the sample size, and we let r = t/n, so r ∈ [0, 1]. And we use the symbol [nr] to denote the largest integer that is less than or equal to nr.

7 2 1/2 Consider a process ut ∼ i.i.d.(0, σ ) with its mean denoted byu ¯n, then CLT tells that n u¯n → N(0, σ2). Now consider that given a sample of size n, we calculate the mean of the first half of the sample and throw out the rest of the observations:

[n/2] 1 X u¯ = u . [n/2] [n/2] t t=1 This estimator also satisfies the CLT:

p 2 [n/2]¯u[n/2] → N(0, σ ).

Moreover, this estimator would be independent of an estimator tha uses only the second half of the sample. More generally, let’s construct a new random variable Xn(r) for r ∈ [0, 1],

[nr] X Xn(r) = (1/n) ut. t=1 or  0 for r ∈ [0, 1/n)   u1/n for r ∈ [1/n, 2/n)  Xn(r) = (u1 + u2)/n for r ∈ [2/n, 3/n)  .  .   (u1 + u2 + ... + un)/n for r = 1. 1/2 1/2 2 It is easy to see that n Xn(1) = n u¯n and CLT tells that it converges to N(0, σ ). But what Xn(r) converges to as n → ∞? Write

[nr] p [nr] 1 X [nr] 1 X n1/2X (r) = √ u = √ u . n n t n p t t=1 [nr] t=1

−1/2 P[nr] 2 1/2 1/2 By CLT we have [nr] t=1 ut → N(0, σ ), while ([nr]/n) → r , therefore we have

1/2 n Xn(r)/σ → N(0, r). (5)

Next if we consider the behavior of a sample mean based on observations [nr1] through [nr2] for r2 > r1, this is also asymptotically normal using similar approach,

1/2 n (Xn(r2) − Xn(r1))/σ → N(0, r2 − r1) and it is independent of the estimator in (5) for r < r1. Therefore, the sequence of stochastic √ ∞ functions { nXn(·)/σ}n=1 has an asymptotic probability law:

1/2 n Xn(·)/σ → W (·). (6)

Note that here Xn(·) is a function, while in (5) Xn(r) is a random variable. The asymptotic result (6) is known as the functional central limit theorem (FCLT). Later on, we may also write 1/2 n Xn(r) → W (r), but note this does not mean the variable Xn(r) converges to a variable which

8 has N(0, r) distribution, but that the function converges to a stochastic function: the standard Brownian motion. Evaluated at r = 1, the function Xn(r) is just the sample mean. Thus when the function in (6) is evaluated at t = 1, we get the conventional CLT: n √ 1 X nX (1)/σ = √ u → W (1) ∼ N(0, 1). (7) n σ n t t=1

2.2.2 Convergence of a random function In lecture 4, we discussed various convergence and continuous mapping theorem for a random variable. Now, let’s define convergence of a random function, such as Xn(r) we defined earlier. We first define convergence in distribution for a random function. Let S(·) represent a continuous- time stochastic process with S(r) representing its value at some date r for r ∈ [0, 1]. Also suppose ∞ that for any given realization, S(·) is a of r with probability 1. For {Sn(·)}n=1 a sequence of such continuous functions, we say that Sn(·) →d S(·) if all of the following hold:

(a) For any finite collection of k particular dates, 0 ≤ r1 < r2 < . . . < rk ≤ 1, the sequence of ∞ k-dimensional random vectors {yn}n=1 converges in distribution to the vector y, where     Sn(r1) S(r1)  Sn(r2)   S(r2)  y ≡   y ≡   ; n  .   .   .   .  Sn(rk) S(rk)

(b) For each  > 0, the probability that Sn(r1) differs from Sn(r2) for any dates r1 and r2 within δ of each other goes to zero uniformly in n as δ → 0.

(c) P (|Sn(0)| > λ) → 0 uniformly in n as λ → ∞. ∞ Next, we will extend convergence in probability for a random function. Let {Sn(·)}n=1 and ∞ {Vn(·)}n=1 denote sequences of random continuous functions with Sn : r ∈ [0, 1] 7→ R and Vn : r ∈ [0, 1] 7→ R. Define Yn as: Yn = sup |Sn(r) − Vn(r)|. r∈[0,1]

Then Yn is a sequence of random variables. If Yn →p 0 (this is the usual convergence in probability for a random variable), then we have that

Sn(·) →p Vn(·). In other words, we define convergence in probability of a random function in terms of conver- gence of the upper bound of its distance from the limit function. Further, if Vn(·) →p Sn(·) and Vn(·) →d V (·) where S(·) is a continuous function, then Vn(·) →d S(·).

Example 1 Let ut be strictly stationary time series with finite fourth moment, and let Sn(r) = −1/2 n u[nr]. Then Sn(·) →p 0. Proof: ! P sup |Sn(r)| > δ r∈[0,1]

9 n −1/2 −1/2 −1/2 o = P [|n u1| > δ], or [|n u2| > δ],... or [|n un| > δ] −1/2 ≤ nP (|n ut| > δ) E(n−1/2u )4 ≤ n t δ4 E(u4) = t nδ4 → 0.

So we ave Sn(·) →p 0.

In Lecture 4, we also reviewed that the continuous mapping theorem (CMT) tells that if xn → x, and g(·) is a continuous function, then we have g(xn) → g(x). We have a similar results for the FCLT. If S (·) → S(·), and g(·) is a continuous functional, then g(S (·)) → g(S(·)). For example, √ n n nXn(·)/σ →d W (·) implies that √ 2 nXn(·) →d σW (·) ∼ N(0, σ r). (8) As another example, let √ 2 Sn(r) ≡ [ nXn(r)] . (9) √ Since nXn(r) →d σW (·), it follows that 2 2 Sn(·) →d σ [W (·)] . (10)

2.3 Applications to Unit Root Processes The simplist case to illustrate how to use FCLT to compute the asymptotics is to consider a random walk yt with y0 = 0, t X 2 yt = yt−1 + ut = ui, ut ∼ i.i.d.N(0, σ ). i=1

Define Xn(·) as:  0 for r ∈ [0, 1/n)   y1/n for r ∈ [1/n, 2/n)  Xn(r) = y2/n for r ∈ [2/n, 3/n) (11)  .  .   yn/n for r = 1.

If we integrate Xn(r) over r ∈ [0, 1], we have Z 1 2 2 2 Xn(r)dr = y1/n + y2/n + ... + yn−1/n 0 n −2 X = n yt−1 t=1 √ Multiplying its both sides by n: Z 1 n √ −3/2 X nXn(r)dr = n yt−1. 0 t=1

10 √ From (8) we know that nXn(·) →d σW (·), by CMT,

Z 1 √ Z 1 nXn(r)dr →d σ W (r)dr. 0 0

−3/2 Pn−1 therefore, we got the limit for n t=1 yt,

n Z 1 −3/2 X n yt−1 → σ W (r)dr. (12) t=1 0

1 Pn −3/2 Pn Thus, when yt is a driftless random walk, its sample mean n t=1 yt diverges but n t=1 yt −3/2 Pn converges. An alternative way to find the limit distribution of n t=1 yt follows:

n −3/2 X −3/2 n yt−1 = n [u1 + (u1 + u2) + ... + (u1 + u2 + ... + un−1)] t=1 −3/2 = n [(n − 1)u1 + (n − 2)u2 + ... + un−1] n −3/2 X = n (n − t)ut t=1 n n −1/2 X −3/2 X = n ut − n tut t=1 t=1 while from the previous lecture, we know that

 −1/2 Pn     1  n t=1 ut 0 2 1 2 −3/2 Pn →d N , σ 1 1 . (13) n t=1 tut 0 2 3

−3/2 Pn 2 Therefore n t=1 yt is asymptotically Gaussian with mean zero and variance equal to σ [1− 2(1/2) + 1/3] = σ2/3. From this expression we also have

n n n −3/2 X −1/2 X −3/2 X n tut = n ut − n yt−1 t=1 t=1 t=1 Z 1 → σW (1) − σ W (r)dr (14) 0 Using similar methods we could compute the asymptotic distribution of the sum of squares of a random walk. Define 2 Sn(r) = n[Xn(r)] . and it can be written as   0 for r ∈ [0, 1/n)  2  y1/n for r ∈ [1/n, 2/n)  2 Sn(r) = y2/n for r ∈ [2/n, 3/n) (15) .  .  .  2 yn/n for r = 1

11 Again we compute the sum Z 1 2 2 2 2 2 2 Sn(r)dr = y1/n + y2/n + ... + yn−1/n . 0 2 2 Since we have that Sn(r) → σ W (r) , by CMT, n Z 1 −2 X 2 2 2 n yt−1 → σ [W (r)] dr. (16) t=1 0

−3/2 Pn R 1 If we make use of n t=1 yt−1 →d σ 0 W (r)dr and for r = t/n , we also have n n Z 1 5/2 X −3/2 X n tyt−1 = n (t/n)yt−1 →d σ rW (r)dr. (17) t=1 t=1 0 Similarly, for r = t/n and we use (16) to get

n n Z 1 −3 X 2 −2 X 2 2 2 n tyt−1 = n (t/n)yt−1 →d σ r[W (r)] dr. (18) t=1 t=1 0 Another useful result is n −1 X 2 2 n yt−1ut → (1/2)σ [W (1) − 1]. t=1 Proof: first 2 2 2 2 yt = (yt−1 + ut) = yt−1 + 2yt−1ut + ut , so n −1 X n yt−1ut t=1 n n −1 X 2 2 −1 X 2 = n (1/2) (yt − yt−1) − n (1/2) ut t=1 t=1 n −1 2 −1 X 2 = n (1/2)yn − n (1/2) ut t=1

−1 By (6), we have n yn → σW (1), by CMT, we have

−1 2 2 2 n (1/2)yn → (1/2)σ W (1) . By LLN, n −1 X 2 2 n (1/2) ut → (1/2)σ , t=1 Therefore, n −1 X 2 2 n yt−1ut → (1/2)σ [W (1) − 1] (19) t=1

12 3 Unit Root Tests

3.1 Unit Root Tests with i.i.d Error The asymptotics of a a random walk with i.i.d. shocks is summarized in the following proposition. The number in bracket shows where the result is first introduced and proved.

Proposition 1 Suppose that ξt follows a random walk without drift,

2 ξt = ξt−1 + ut, ξ0 = 0, ut ∼ i.i.d(0, σ ).

Then

−1/2 Pn (a) n t=1 ut →d σW (1) [7]; −1 Pn 1 2 2 (b) n t=1 ξt−1ut →d 2 σ [W (1) − 1] [19];

−3/2 Pn R 1 (c) n t=1 tut →d σW (1) − σ 0 W (r)dr [14];

−3/2 Pn R 1 (d) n t=1 ξt−1 →d σ 0 W (r)dr [12];

−2 Pn 2 2 R 1 2 (e) n t=1 ξt−1 →d σ 0 W (r) dr [16];

−5/2 Pn R 1 (f) n t=1 tξt−1 →d σ 0 rW (r)dr [17];

−3 Pn 2 2 R 1 2 (g) n t=1 tξt−1 →d σ 0 rW (r) dr [18]. −v+1 Pn v (h) n t=1 t → 1/(v + 1) for v = 0, 1, 2,... [lecture 7]

Note that all those W (·) is the same Brownian motion, so all those results are correlated. If we are not interested in their correlations, we can find simpler expressions for them. For example, (a) is just N(0, σ2), (b) is (1/2)σ2[χ2(1) − 1], (c) and (d) are N(0, σ2/3). Pn R 1 In general, the correspondence between the finite sample and their limits are like t=1 → 0 , −1/2 (t/n) → r, (1/n) → dr, n ut → dW , etc. Take (h) as an example, and let v = 2. From previous −3 Pn 2 lecture we know that n t=1 t → 1/3. Using the correspondence here, we have

n n X X Z 1 n−3 t2 = n−1 (t/n)2 → r2dr = 1/3. t=1 t=1 0

3.1.1 Case 1 Suppose that the data generating process (DGP) is a random walk, and we are estimating the parameter ρ by OLS in the regression

2 yt = ρyt−1 + ut, ut ∼ i.i.d(0, σ ), (20)

13 where ρ = 1 and we are interested in the asymptotic distributions of the OLS estimatesρ ˆn: Pn t=1 yt−1yt ρˆn = Pn 2 t=1 yt−1 Pn t=1 yt−1(yt−1 + ut) = Pn 2 t=1 yt−1 Pn t=1 yt−1ut = 1 + Pn 2 t=1 yt−1 Then −1 Pn n t=1 yt−1ut n(ˆρn − 1) = −2 Pn 2 . n t=1 yt−1 By (19) (result b), (16) (result e) and CMT, we have

W (1)2 − 1 n(ˆρ − 1) → . (21) R 1 2 2 0 W (r) dr

1/2 First we note that (ˆρn −1) converges at the order of n, instead of n , as in the cases when |ρ| < 1. Therefore when the true coefficient is unity,ρ ˆn is superconsistent. Second, since W (1) ∼ N(0, 1), W (1)2 ∼ χ2(1). The probability that χ2(1) is less than one is 0.68, therefore with probability 0.68 n(ˆρn − 1) will be negative, which implies that its limit distribution is skewed to the left. Recall that in the AR(1) regression with |ρ| < 1, the estimateρ ˆ is downward biased. However, its limit √ n distribution n(ˆρn − ρ) is still symmetric around zero. While when the true value of ρ is unity, even the limit distribution of n(ˆρ − 1) is asymmetric with negative values twice as likely as positive values. In practice, critical values for the random variable in (21) are found by computing the exact finite sample distribution of n(ˆρ − 1) assuming ut is Gaussian. Then the critical value can be tabulated by Monte Carlo or by numerical approximation. There are two commonly used approaches to test the hypothesis that ρ0 = 1: Dickey-Fuller ρ-test and Dickey-Fuller t-test. The DF ρ-test is to compute the statistics n(ˆρn − 1) and compare the statistics with the critical values from the distribution in (21). The advantage of this approach is that we don’t need to compute its standard deviation. Alternatively, we could use DF t-test which is based on the usual t statistics,

ρˆn − 1 tn = . (22) σˆρˆ whereσ ˆρˆ is the standard deviation of OLS estimated coefficient,

2 2 sn σˆρˆ = Pn 2 , (23) t=1 yt−1 and n 1 X s2 = (y − ρˆ y )2. n n t n t−1 t=1

14 Plug (23) into (22), we have

−1 Pn n t=1 yt−1ut tn = . −2 Pn 2 1/2 2 1/2 n t=1 yt−1 (sn)

2 2 Ifρ ˆ → ρ = 1, which is true for OLS estimator in the present problem, then sn → σ by LLN. And by (19) and (16), we have the limit for tn,

(1/2)σ2[W (1)2 − 1] W (1)2 − 1 t → = . (24) n 1/2 1/2 h 2 R 1 2 i 2 1/2 R 1 2  σ 0 W (r) dr [σ ] 2 0 W (r) dr

For the same reason as in (21), this t-statistics is asymmetric and skewed to the left.

3.1.2 Case 2 The DGP is still a random walk as in case 1 (20),

2 yt = yt−1 + ut, ut ∼ i.i.d(0, σ ), but we include a constant term in the regression

yt =α ˆ +ρy ˆ t−1 +u ˆt.

The OLS estimates for the coefficients    P −1  P  αˆn n yt−1 yt = P P 2 P . ρˆn yt−1 yt−1 yt−1yt

Under the null hypothesis H0 : α = 0, ρ = 1, the deviations of the estimate vector from the hypothesis    P −1  P  αˆn n yt−1 ut = P P 2 P . (25) ρˆn − 1 yt−1 yt−1 ut−1yt Recall in a regression with a constant and time trend, the estimates have different convergent rates. The situation is similar in this case. The order in probability for each terms are

   3/2 −1  1/2  αˆn Op(n) Op(n ) Op(n ) = 3/2 2 . (26) ρˆn Op(n ) Op(n ) Op(n)

As we did before, now we need a rescaling matrix

 n1/2 0  H = . n 0 n

Premultiply (25) by Hn we have

 1/2   −3/2 P −1  −1/2 P  n αˆn 1 n yt−1 n ut = −3/2 P −2 P 2 P . (27) n(ˆρn − 1) n yt−1 n yt−1 n ut−1yt

15 By result (d) and (e) we have

 −3/2 P   R  1 n yt−1 1 σ W (r)dr −3/2 P −2 P 2 → R 2 R 2 n yt−1 n yt−1 σ W (r)dr σ W (r) dr  1 0   1 R W (r)dr   1 0  = 0 σ R W (r)dr R W (r)2dr 0 σ and by result (a) and (b) we have

 −1/2 P        n ut σW (1) 1 0 W (1) −1 P → 2 2 = σ 2 . n ut−1yt (1/2)σ [W (1) − 1] 0 σ (1/2)[W (1) − 1] Therefore,

 1/2     R −1   n αˆn σ 0 1 W (r)dr W (1) 1 →d R R 2 2 n (ˆρn − 1) 0 1 W (r)dr W (r) dr (1/2)[W (1) − 1]  σ 0   R W (r)2dr − R W (r)dr   W (1)  = ∆−1 0 1 − R W (r)dr 1 (1/2)[W (1)2 − 1]  σ 0   W (1) R W (r)2dr − (1/2)[W (1)2 − 1] R W (r)dr  = ∆−1 0 1 (1/2)[W (1)2 − 1] − W (1) R W (r)dr where Z Z 2 ∆ ≡ W (r)2dr − W (r)dr .

So the DF ρ statistics to test the null hypothesis that ρ = 1 has the following limit distribution (1/2)[W (1)2 − 1] − W (1) R W (r)dr n(ˆρn − 1) →d . (28) R W (r)2dr − R W (r)dr2 As in case 1, we can also use a t test,

ρˆn − 1 tn = , σˆρˆn which converges to (1/2)[W (1)2 − 1] − W (1) R W (r)dr . n o1/2 R W (r)2dr − R W (r)dr2 The details can be found on page 493-494.

3.1.3 Case 3 Now, suppose that the true process is a random walk with drift:

2 yt = α + yt−1 + ut, ut ∼ i.i.d(0, σ ).

Without loss of generality, we could set y0 = 0. And we also estimate a linear regression with a constant, yt =α ˆ +ρy ˆ t−1 +u ˆt.

16 Define ξt ≡ u1 + u2 + ... + ut, then yt = αt + ξt and n n n X X X yt−1 = α t + ξt−1. t=1 t=1 t=1 Pn Notice that these two terms have different divergent rates. We know that t=1 t = n(n+1)/2 = 2 Pn 3/2 −3/2 Pn O(n ), while t=1 ξt−1 = Op(n ) as n t=1 ξt−1 converges to a normal distribution with finite variance (result (d)). Therefore, pick the fastest divergent rate,

n n n −2 X −2 X −1/2 −3/2 X n yt−1 = αn t + n n ξt−1 →p α/2. (29) t=1 t=1 t=1

Pn 2 Pn Similarly, t=1 yt−1, t=1 yt−1ut also have terms with different divergent rates:

n n X 2 X 2 yt−1 = [α(t − 1) + ξt−1] t=1 t=1 n n n 2 X 2 X 2 X = α (t − 1) + ξt−1 + 2α (t − 1)ξt−1 t=1 t=1 t=1

Pn 2 3 Pn 2 2 Pn where t=1(t − 1) = Op(n ) (result (h)), t=1 ξt−1 = Op(n ) (result (e)), and t=1(t − 1)ξt−1 = 5/2 3 Op(n ) (result (f)). Norm the sequence with the inverse of the fastest divergent rate n ,

n −3 X 2 2 n yt−1 → α /3. (30) t=1 Finally, n n n n X X X X yt−1ut = [α(t − 1) + ξt−1]ut = α (t − 1)ut + ξt−1ut, t=1 t=1 t=1 t=1 Pn 3/2 Pn where t=1(t − 1)ut = Op(n ) (result (c)) and t=1 ξt−1ut = Op(n) (result (b)). Again norm it with the fastest divergent rate

n n −3/2 X −3/2 X n yt−1ut →p n α(t − 1)ut. (31) t=1 t=1 Corresponding to the different rates, to derive a nondegenerate limit distribution for the esti- mates, again we need a scaling matrix. In this case, we need

 n1/2 0  H = . n 0 n3/2

17 Premultiply the OLS estimator vector (in deviations from their true value) with Hn we got

 1/2   −2 P −1  −1/2 P  n (ˆαn − α) 1 n yt−1 n ut 3/2 = −2 P −3 P 2 −3/2 P . n (ˆρn − 1) n yt−1 n yt−1 n yt−1ut

From (29) and (30), the first term

 −2 P    1 n yt−1 1 α/2 −2 P −3 P 2 →p 2 ≡ Q. n yt−1 n yt−1 α/2 α /3

From (13) and (31), we have

 −1/2 P   −1/2 P  n ut n ut −3/2 P →p −3/2 Pn n yt−1ut n t=1 α(t − 1)ut  0   1 α/2  → N , σ2 = N(0, σ2Q). p 0 α/2 α2/3

Therefore we have the following limit distribution for the OLS estimates

 1/2  n (ˆαn − α) −1 2 −1 2 −1 3/2 →d N(0,Q · σ Q · Q ) = N(0, σ Q ). (32) n (ˆρn − 1)

So in case 3, both estimated coefficients are asymptotically Gaussian, and the asymptotic dis- tribution is the same asα ˆ and δˆ in the regression with deterministic trends. This is because here yt has two components: a deterministic time trend and random walk, and the time trend dominates the random walk.

3.1.4 Case 4 Finally we consider that the true process is a random walk with or without drift,

2 yt = α + yt−1 + ut, ut ∼ i.i.d.(0, σ ), where α may or may not be zero, and we run the following regression

yt = α + ρyt−1 + δt + ut. (33)

Without loss of generality, we assume that y0 = 0. Note that when α 6= 0, it is also a time trend, hence there will be an asymptotic collinear problem between yt and t. Hence rewrite the regression as

yt = (1 − ρ)α + ρ[yt−1 − α(t − 1)] + (δ + ρα)t + ut ∗ ∗ ∗ = α + ρ ξt−1 + δ t + ut

∗ ∗ ∗ where α = (1 − ρ)α, ρ = ρ, δ = (δ + ρα), and ξt = yt − αt. With this transformation, under the null hypothesis ρ = 1, δ = 0, ξt is a random walk:

ξt = u1 + u2 + ... + ut.

18 Therefore, with this transformation, we regress yt on a constant, a driftless random walk, and a deterministic time trend. The OLS estimates in this regression are

 ∗   P P   P  αˆn n ξt−1 t yt ∗ P P 2 P P  ρˆn  =  ξt−1 ξt−1 ξt−1t   ξt−1yt  . ˆ∗ P P P 2 P δn t ξt−1t t tyt The hypothesis is that α = c, any constant, ρ = 1 and δ = 0. Correspondingly, in the transformed system α∗ = 0, ρ∗ = 1, and δ∗ = c. The deviations of the estimates from these true values are given by

 ∗   P P   P  αˆn n ξt−1 t ut ∗ P P 2 P P  ρˆn − 1  =  ξt−1 ξt−1 ξt−1t   ξt−1ut  . (34) ˆ∗ P P P 2 P δn − c t ξt−1t t tut Note that these three estimates have different convergent rates (we are already familiar with ∗ 1/2 ∗ ˆ∗ 3/2 them!) αn is n convergent,ρ ˆn is n convergent, and δn is n convergent. Therefore we need a rescaling matrix  n1/2 0 0  Hn =  0 n 0  . 0 0 n3/2

Premultiply (34) with Hn we have that

 1/2 ∗   −3/2 P −2 P   −1/2 P  n αˆn 1 n ξt−1 n t n ut ∗ −3/2 P −2 P 2 −5/2 P −1 P  n(ˆρn − 1)  =  n ξt−1 n ξt−1 n ξt−1t   n ξt−1ut  . 3/2 ˆ∗ −2 P −5/2 P −3 P 2 −3/2 P n (δn − c) n t n ξt−1t n t n tut The limit distribution of each term in the above equation can be found in the proposition. Plug them in and we get

 1/2 ∗     R 1 −1   n αˆn σ 0 0 1 W (r)dr 2 W (1) ∗ R R 2 R 2  n(ˆρn − 1)  →d  0 1 0   W (r)dr W (r) dr rW (r)dr   (1/2)[W (r) − 1]  . 3/2 ˆ∗ 1 R 1 R n (δn − c) 0 0 σ 2 rW (r)dr 3 W (1) − W (r)dr (35) The DF unit root ρ test in this case is given by the middle row of (35). Note that it does not depend on either σ or α. The DF t test can be derived in a similar way (see page 500 in Hamilton).

3.2 Unit Root Tests with Serially Correlated Errors 3.2.1 BN Decomposition and Phillips-Solo Device Beveridge and Nelson (1981) proposed that any time series that displays some degree of nonstation- arity can be decomposed into two additive parts: a stationary (also called cyclitical or transitory) part and a nonstationary (also called long-run or permanent) part. Let

∞ X ut = C(L)t = cjt−j, (36) j=0

19 2 P∞ where (a) t ∼ WN(0, σ ) and (b) j=0 j · |cj| < ∞. The BN-decomposition tells that we could rewrite the lag operator as C(L) = C(1) + (L − 1)C˜(L) P∞ ˜ P∞ j P∞ P∞ where C(1) = j=0 cj, C(L) = j=0 c˜jL , andc ˜j = j+1 ck. Since we assume that j=0 j · |cj| < P∞ ∞, we have j=0 |c˜j| < ∞. Phillips and Solo (1992) verified that with conditions (a), (b), and that C(1) 6= 0, ut can be represented in the form of

ut = (C(1) + (L − 1)C˜(L))t

= C(1)t − C˜(L)(t − t−1)

Then for a random walk process with innovations ut, we could represent it as

yt = yt−1 + ut t X = y0 + uj j=1 t t X X = y0 + C(1)j − C˜(L) (j − j−1) j=1 j=0 t X = y0 + C(1) j − C˜(L)t + C˜(L)0 j=1 t X = y0 + η0 − ηt + C(1) j j=1 ˜ ˜ P∞ where η0 = C(L)0 is the initial condition, ηt = C(L)t = j=1 c˜jt−j is a stationary process (note Pt thatc ˜j is absolutely summable), and C(1) j=1 j is a nonstationary random walk process. Rewrite yt as t t X X yt = us = C(1) s + θ0 − θt. s=1 s=1 Pt Note that ξt = s=1 s is a random walk with serially uncorrelated error and we have that −1/2 −1/2 n ξ[nr] → σW (r), while 0 −t are bounded in probability, hence we would expect that n yt = −1/2 C(1)n ξt +op(1) → λW (r). The following proposition summarizes some important limit theories for unit root process with serially correlated error.

P∞ P∞ 2 Proposition 2 Let ut = C(L)t = j=0 cjt−j, where j=0 j · |cj| < ∞ and  ∼ i.i.d.(0, σ , µ4). Define that ∞ 2 X γh = E(utut−h) = σ cjcj+h, j=0 ∞ X λ = σ cj = σC(1), j=0

ξt = u1 + u2 + . . . ut, ξ0 = 0. 2 In the above notation, λ is known as the long run variance of ut, which is in gerenal different from the variance of ut, which is γ0.

20 −1/2 Pn (a) n t=1 ut →d λ · W (1); −1/2 Pn 2 (b) n t=1 ut−jt →d N(0, σ γ0) for j = 1, 2,...; −1 Pn (c) n t=1 utut−j → γj for j = 1, 2,...; −1 Pn 2 (d) n t=1 ξt−1t →d (1/2)σλ[W (1) − 1]; (e) n  2 2 −1 X (1/2)[λ [W (1) − γ0] for h = 0 n ξt−1ut−h →d 2 2 Ph−1 (1/2)[λ [W (1) − γ0] + γh for h = 1, 2,... t=1 j=0

−3/2 Pn R 1 (f) n t=1 ξt−1 →d λ 0 W (r)dr;

3/2 Pn h R 1 i (g) n t=1 tut−j →d λ W (1) − 0 W (r)dr ;

−2 Pn 2 2 R 1 2 (h) n t=1 ξt−1 →d λ 0 W (r) dr;

−5/2 Pn R 1 (i) n t=1 tξt−1 →d λ 0 rW (r)dr;

−3 Pn 2 2 R 1 2 (j) n t=1 tξt−1 →d λ 0 rW (r) dr; −(v+1) Pn v (k) n t=1 t → 1/(v + 1) for v = 0, 1, 2,....

The proof of all these results can be found in the appendix of Chapter 17 in Hamilton. In the class, we will discuss (a), (e) and (f) as examples. First to prove (a),

[nr] [nr] −1/2 X −1/2 X −1/2 n ut = n C(1) t + n (η[nr] − η0). t=1 t=1 By (6) and CMT we have that

[nr] −1/2 X n C(1) t → σC(1)W (r). t=1

η0 is the initial condition and η[nr] is a zero mean stationary process, both are bounded in probability,

−1/2 n (η[nr] − η0) → 0.

Therefore, we obtain the limit [nr] −1/2 X n ut → σC(1)W (r) t=1 and when r = 1, n −1/2 X n ut → σC(1)W (1) (37) t=1

21 Second, to prove

n  2 2 −1 X (1/2)[λ (W (1) − γ0)] for j = 0 n ξt−1ut−j → 2 2 Pj−1 (38) (1/2)[λ (W (1) − γ0)] + γi for j > 0 t=1 i=0 First, let h = 0 and we have

n n −1 X −1 2 −1 X 2 n ξt−1ut = n (1/2)ξn − n (1/2) ut . t=1 t=1

−1/2 We know that n ξn → λW (1). By CMT, we have that

−1 2 2 2 (1/2)n ξn → (1/2)λ W (1) .

−1 Pn 2 In result (c), we have n t=1 ut → γ0. Therefore, n −1 X 2 2 n ξt−1ut → (1/2)[λ (W (1) − γ0)]. t=1 Next, let h = 1. Note that

ξt−1ut−1 = (ξt−2 + ut−1)ut−1 = ξt−2ut−1 + ut−1ut−1.

−1 Pn We already got the limit for n t−1 ξt−1ut, therefore,

n −1 X 2 2 n ξt−1ut−1 → (1/2)[λ (W (1) − γ0)] + γ0. t=1 Similar for h = 2, 3,.... Thirdly, consider result (f),

n Z 1 −3/2 X n ξt−1 → λ W (r)dr. (39) t=1 0 Define  0 for r ∈ [0, 1/n)  −1/2 Sn(r) = n ξt for r ∈ [t/n, (t + 1)/n) (40) −1/2  n ξn for r = 1 then we have Sn(r) → λW (r). By CMT, Z 1 Z 1 Sn(r)dr → λ W (r)dr, 0 0 and we have Z 1 n −3/2 X Sn(r)dr = n ξt. 0 t=1

22 3.2.2 Phillips-Perron Tests for Unit Roots We will discuss case 2 only and other cases can be derived similarly. Let the true DGP be a random walk with serially correlated errors,

yt = α + ρyt−1 + ut, ut = C(L)t where C(L) and t satisfy the conditions in proposition 2. When |ρ| < 1, OLS estimates of ρ is not consistent when the errors are serially correlated. However, when ρ = 1, OLS estimates ρˆn → 1. Therefore, Phillips and Perron (1988) proposed estimating the regression with OLS and then correct the estimates with serial correlation. Under the null hypothesis H0 : α = 0, ρ = 1, the deviations of the OLS estimates vector from the hypothesis

 1/2   P −3/2 −1  −1/2 P  n αˆn 1 n yt−1 n ut = −3/2 P −2 P 2 −1 P . (41) n(ˆρn − 1) n yt−1 n yt−1 n yt−1ut Use result (f) and (h) in proposition 2,

 P −3/2 −1  R −1 1 n yt−1 1 λ W (r)dr −3/2 P −2 P 2 → R 2 R 2 n yt−1 n yt−1 λ W (r)dr λ W (r) dr  1 0 −1  1 R W (r)dr −1  1 0 −1 = , 0 λ R W (r)dr R W (r)2dr 0 λ and use result (a) and (e) in proposition 2,

 −1/2 P    n ut λW (1) −1 P →d 2 2 n yt−1ut (1/2)[λ W (1) − γ0]  λW (1)   0  = 2 2 + 2 (1/2)[λ W (1) − 1] (1/2)(λ − γ0)  1 0   λW (1)   0  = λ 2 2 + 2 . 0 λ (1/2)[λ W (1) − 1] (1/2)(λ − γ0) Substitute these two results into (41),

 1/2     R −1   n αˆn λ 0 1 W (r)dr λW (1) → R R 2 2 2 n(ˆρn − 1) 0 1 W (r)dr W (r) dr (1/2)[λ W (1) − 1]  1 0   1 R W (r)dr −1  0  + −1 R R 2 2 0 λ W (r)dr W (r) dr (1/2)(λ − γ0)/λ To test ρ = 1,

 1 R W (r)dr −1  λW (1)  n(ˆρ − 1) →  0 1  n R W (r)dr R W (r)2dr (1/2)[λ2W (1)2 − 1] λ2 − γ  1 R W (r)dr −1  0  + 0  0 1  2λ2 R W (r)dr R W (r)2dr 1 (1/2)[W (1)2 − 1] − W (1) R W (r)dr (1/2)(λ2 − γ ) = + 0 . R W (r)2dr − R W (r)dr λ2 R W (r)2dr − R W (r)dr

23 The first term describes the asymptotic distribution of n(ˆρ − 1) as if ut is i.i.d as in the pre- vious subsection (28). The second term is a correction for serial correlation. When ut is serially 2 2 uncorrelated, C(1) = 1, then λ = γ0 = σ . Then this term disappears. The asymptotics for the t-statistics can be derived in a similar way.

3.2.3 Augmented Dickey-Fuller Tests for Unit Roots An alternative unit root test with serially correlated errors is augmented Dickey-Fuller test. Recall that I used an example of AR(2) process early in this lecture

2 (1 − φ1L − φ2L )yt = t, with one unit root and another root |λ2| < 1. Then we could rewrite it

−1 yt = yt−1 + ut, ut = (1 − λ2) t = θ(L)t.

So this is a unit root process with serially correlated errors. To correct for the serial correlation, define ρ = φ1 + φ2, κ = −φ2. Then we have the following equivalent polynomial,

(1 − ρL) − κL(1 − L) = 1 − ρL − κL + κL2 2 = 1 − (φ1 + φ2 − φ2)L − φ2L .

Therefore, the original AR(2) process can be written as

[(1 − ρL) − κL(1 − L)]yt = t, or yt = ρyt−1 + κ∆yt−1 + t (42) This approach can be generalized to AR(p) process, where we define

ρ = φ1 + φ2 + ... + φp, and κj = −[φj+1 + φj+2 + ... + φp] for j = 1, 2, . . . , p − 1. Note that when the process contain a unit root, which means one root of

2 p 1 − φ1z − φ2z − ... − φpz = 0 is unity, 1 − φ1 − φ2 − ... − φp = 0, which implies that ρ = 1. Therefore to test if a process contain a unit root is equivalent to test if ρ = 1 in (42). Furthermore, (42) is a regression with serially uncorrelated errors. For simplicity,

24 in our following discussion, we work with an AR(2) process. Again, we only consider case 2. Our regression 0 yt = κ∆yt−1 + α + ρyt−1 + t ≡ xtβ + t where xt = (∆yt−1, 1, yt−1), β = (κ, α, ρ). The deviation of the OLS estimates from the true β,

" n #−1 " n # ˆ X 0 X βn − β = xtxt xtt . t=1 t=1

Let ut = yt − yt−1,  P 2 P P  n ut−1 ut−1 ut−1yt−1 X 0 P P xtxt =  ut−1 n yt−1  , P P 2 t=1 yt−1ut−1 yt−1 yt−1  P  n ut−1t X P xtt =  t  . P t=1 yt−1t 1/2 ut is stationary, so its coefficient is n convergent. So the scaling matrix √  n 0 0  √ Hn =  0 n 0  . 0 0 n

Premultiply the coefficient vector with Hn,

( " n # )−1 " n # ˆ −1 X 0 −1 −1 X Hn(βn − β) = Hn xtxt Hn Hn xtt . (43) t=1 t=1

2 Define γj = E(utut−j), λ = σC(1) = σ/(1 − κ), where σ = E(t ).

" n #  γ 0 0  X 0  V 0  H−1 x x0 H−1 → 0 1 λ R W (r)dr ≡ . n t t n d   0 Q t=1 0 λ R W (r)dr λ2 R W (r)2dr

Here V = γ0, while it would be a matrix with elements γj for a general AR(p) model, and

 1 λ R W (r)dr  Q = . λ R W (r)dr λ2 R W (r)2dr

Next, consider the second term in (43),

 −1/2 P  " n # n ut−1t −1 X −1/2 P Hn xtt =  n t  −1 P t=1 n yt−1t

Apply the usual CLT to the first element,

−1/2 X 2 n ut−1t →p h1 ∼ N(0, σ V ).

25 Apply result (a) and (d) of proposition 2 for the other two terms,

 −1/2 P    n t σW (1) −1 P →d h2 ∼ 2 . n yt−1t (1/2)σλ[W (1) − 1]

Substituting the above results into (43) and we get

     −1  ˆ V 0 h1 V h1 Hn(βn − β) →d = −1 . (44) 0 Q h2 Q h2

Since the limit distribution is block diagonal, we can discuss the coefficients on the stationary components and the nonstationary components seperately. For the stationary components,

√ −1 2 −1 n(ˆκn − κ) →d V h1 ∼ N(0, σ V ).

2 In this AR(2) problem, the variance is simply σ /γ0. The limit distribution on the constant and the I(1) components are

 1/2     R    n αˆn −1 σ 0 1 W (r)dr σW (1) →d Q h2 = R R 2 2 . n(ˆρn − 1) 0 σ/λ W (r)dr W (r) dr (1/2)σλ[W (1) − 1]

This implies that n · (λ/σ) · (ˆρn − 1) has the same distribution as in (28). Since λ = σC(1), λ/σ = C(1) = 1/(1 − κ). Therefore, the ADF ρ-test is

2 R n(ˆρn − 1) (1/2)[W (1) − 1] − W (1) W (r)dr →d 2 . (45) 1 − κˆn R W (r)2dr − R W (r)dr

For the general AR(p) process, simply replace (1 − κˆn) with (1 − κˆ1,n − ... − κˆp−1,n). The ADF t-test can be found in Hamilton’s book.

26