Lecture 12 – Estimating Autoregressions and Vector Autoregressions (Reference - Section 6.4, Hayashi)

Assume that the observed time series data, y1,…,yT have been generated by the AR(p) process:

yt = c + φ1yt-1 + … + φpyt-p + t

p where the roots of of (1- φ1z - … -φpz ) lie outside the 2 unit circle and t ~ i.i.d. (0, ) (The i.i.d. assumptions can be replaced with white noise and additional conditions)

Then the OLS estimator of the AR(p) parameter vector [c φ1 … φp]’ is

1) (strongly) consistent

2) asymptotically normal

3) asymptotically efficient Further, for sufficiently large samples the model can be treated as if it is a classical linear regression model with strictly exogenous regressors and normally distributed, serially uncorrelated, and conditionally homoskedastic errors. So, for example, the t-statistic

  (ˆi  ) / se(ˆi ) N(0,1) d where

1 se(ˆi )  ˆ 2 (X ' X )ii

ˆ 2  SSR /(T  p)

To prove these claims (except for asymptotic efficiency), we can show that the assumptions underlying Hayashi’s Proposition 2.5 apply to this model. Consider the proof for the case where p = 1:

A.1 – Linearity; Yes

A.2 – [yt xt] is strictly stationary and ergodic

For this model, xt = [1 yt-1]’, so A.2 requires that yt is strictly stationary and ergodic. This follows from the fact that yt has an absolutely summable

MA(∞) form in terms of the i.i.d. process, t.

A.4 - E(xtxt’) is nonsingular

E(xtxt’) = E{[1 yt-1]’[1 yt-1]} =

1 E(yt-1)

2 E(yt-1) E(yt-1 )

2 Recall that E(yt) = μ = c/(1-) and Var(yt) = E(yt )- 2 E(yt) = 0 So,

E(xtxt’) = 1 μ

2 μ 0+ μ

2 2 which is nonsingular. [ det(E(xtxt’)) = 0+ μ - μ = 0,

0 < 0 < ∞ ] A.5 (and, therefore, A.3) - {xtt} is an m.d.s. with finite second moment

xtt = [t yt-1t]’

Show that {xtt} is an m.d.s. by proving the sufficient condtion

E (xtt │ xt-1t-1, xt-2t-2,…) = 0

E(t │t-1, t-2,…,yt-2t-1,yt-3t-2,…) = E(t │t-1, t-2,…) = 0

E(yt-1t │t-1, t-2,…,yt-2t-1,yt-3t-2,…)

= E{ E(yt-1t │yt-1,t-1, t-2,…,yt-2t-1,yt-3t-2,…) │t-1,

t-2,…,yt-2t-1,yt-3t-2,…}

by the Law of Iterated Expectations

= E{yt-1 E(t │yt-1,t-1, t-2,…,yt-2t-1,yt-3t-2,…) │ t-1, t-

2,…,yt-2t-1,yt-3t-2,…}

= 0, since E(t │yt-1,t-1, t-2,…,yt-2t-1,yt-3t-2,…) = 0.

So, [t yt-1t]’ is an m.d.s. To complete A.5, we need to show that Var(xtt) is finite. Since E(xtt) = 0 (by the m.d.s. property),

Var(xtt) = E {[t yt-1t]’ [t yt-1t]} =

2 2  E(yt-1t )

2 2 2 E(yt-1t ) E(yt-1 t )

Applying the Law of Iterated Expectations:

2 2 2 2 E(yt-1t ) = E{E(yt-1t │yt-1)} = E(yt-1 ) =  E(yt-1) = 2 μ and

2 2 2 2 2 2 E(yt-1 t ) = E{E(yt-1 t │yt-1)} = E(yt-1  ) = 2 2  (0+ μ )

2 2 So, E {[t yt-1t]’ [t yt-1t]} =   μ

2 2 2  μ  (0+ μ )

2 =  E(xrxt’), which is finite and nonsingular. 2 2 A.7 – E(t │ xt) =  > 0

This follows from the facts that 1) xt is a linear combination of past ’s, 2) t is independent of past 2 2 ’s, and 3) E(t ) =  > 0.

Therefore, the assumptions of Proposition (2.5) are satisfied.

The proof for the general AR(p) follows along this line. A couple of practical issues –

1. What transformation(s), if any, should we make to a time series before we fit it to an AR model?

We want to the series to look like a realization of a stationary process.

 use logged form of the series (especially with trending series, since the changes in levels will typically also be growing over time while the changes in the logs, which are approximately percentage changes, will typically be relatively stable over time)  remove trend – should we use the linear trend model or should we use first differences? More on this in the next section of the course.

2. How do we select the appropriate value for p?

This gets kind of messy. There is no “best way” to do select the appropriate lag length, although there are a number of sensible ways to go about this process. Selecting the lag length for the AR(p) model –

The idea is that we want to choose p large enough to remove any serial correlation in the error term but we want to choose p small enough so that we are not including irrelevant regressors. (Why is including irrelevant regressors a problem?)

There are two approaches to lag length selection –

 Hypothesis testing – sequential t or F tests  Minimize the AIC or SIC statistic Sequential testing –

Consider sequential t tests.

First, select the largest “plausible” p, say pmax (For quarterly real GDP this might be say, 6 – 8.)

Second, fit the AR model using p = pmax and test H0: φpmax = 0. If H0 is rejected, set p = pmax. If H0 is not rejected, redo with p = pmax-1. Continue until H0 is rejected. Notes:

1)After you have selected your p, you should check for serial correlation in the error term (using the sample autocorrelogram of the residuals) to make sure that your p is large enough to have removed the serial correlation in the error process.

2)For any given p, you can only fit the model for t = p+1,…T. One way to perform the lag length tests we have described is to fit all the model using t = pmax+1, …,T, then having the selected the preferred lag length, p*, fit the model using t = p*+1,…,T. The advantage of this approach is that the lag length test results depend only on varying p rather than varying p and samples. The alternative approach is to using the maximum number of observations for each test.

3) As T → ∞, Prob(p* < p) → 0 but Prob(p* > p) → c > 0. (The probability of underfitting goes to zero but there is a positive probability of overfitting even in large samples.) (Why? If φp ≠ 0, │p │ → ∞ with probability 1.) AIC and SIC –

The AIC (Akaike Information Criterion) and the SIC (Schwartz Information Criterion) are based on the following statistics -

AIC(p, pmax) = log (SSRp)/(T-pmax)+2(p+1)/(T-pmax)

SIC(p,pmax) = log (SSRp)/(T-pmax)+ (p+1)log(T-pmax)/(T-pmax)

The AR(p) is fit to the data for t = pmax+1,…,T for p = * 0,1,…,pmax . The optimal lag length p is chosen to minimize the AIC (or the SIC).

The idea – The AIC and SIC are like adjusted R2’s. The first term will decrease as p increases since the SSR will fall as p increases. The second term is a “penalty term” that increases as p increases. Notes –

1)So long as T-pmax > 8 (i.e., for all practical purposes), p*(AIC) > p*(SIC).

2)If t is independent white noise with finite fourth moment, the following can be shown:

 p*(SIC) → p as T → ∞  prob(p*(AIC)p) → c > 0 as T → ∞

(So why ever use AIC?)

3)The SIC is also called the SBC (Schwartz Bayesian Criterion) and BIC (Bayesian Information Criterion). Estimating Vector Autoregressions –

Let yt = [y1t … ynt]’ evolve according to the VAR(p) model -

p yt  A0   Ai yts   t s1

where t is n-dimensional white noise and A1,…,Ap satistfy the stationarity condition.

Then, OLS applied equation by equation is

1) (strongly) consistent

2) asymptotically normal

3) asymptotically efficient For sufficiently large samples, each of the n equations can be treated as if it is a classical linear regression model with strictly exogenous regressors and normally distributed, serially uncorrelated, and conditionally homoskedastic errors.

Linear cross equation restrictions can be tested as follows -

Let A = [A0 A1 … Ap], A is nx(np+1). Let vec(A) = (n2p+n)x1 Consider H0: Rvec(A)= r, where R is a known qx (n2p+n) matrix and r is a known qx1 vector.

Then, under H0, the (quasi-) likelihood ratio statistic,

ˆ ˆ T[logdet  R  logdet u ] converges in distribution to a X2(q), where 1 T ˆ ˆ ˆ '  R =  t,R t,R , T p1

ˆt,R = residual from the “restricted regression” and 1 T ˆ ˆ ˆ ' u =  t,U t,U , T p1

ˆt,U = residual from the “unrestricted regression” Notes:

1)OLS is algebraically equivalent to SUR because the same regressors appear in each equation. (Once you start allowing for different lag lengths in different equations or different variables in different equations, SUR will be be preferred to OLS, unless the innovations are uncorrelated across equations.)

2)We can choose the lag length by applying the vector versions of the AIC and SIC: ˆ 2 AIC(p,pmax) = logdet  p + 2(pn +n)/(T-pmax) ˆ SIC(p,pmax) = logdet  p 2 + (pn +n)log(T-pmax)/(T-pmax)

1 T ˆ ˆ ˆ '  p =  t, p t, p , T pmax 1

ˆt, p = residual from the VAR(p) model

* We choose p from p = 0,1,…,pmax to minimize the AIC (or SIC). The large sample properties of the AIC and SIC in the VAR(p) case are the same as for the AR(p) case. 3)How to select the variables that appear in yt?

 (Weiner-Granger-Sims) causality tests

4) Bayesian VARs (BVAR)