<<

Lecture 6:

In this section, we will extend our discussion to vector valued . We will be mostly interested in vector autoregression (VAR), which is much easier to be estimated in applications. We will fist introduce the properties and basic tools in analyzing stationary VAR process, and then we’ll move on to estimation and inference of the VAR model.

1 -stationary VAR(p) process

1.1 Introduction to stationary vector ARMA processes 1.1.1 VAR processes A VAR model applies when each variable in the system does not only depend on its own lags, but also the lags of other variables. A simple VAR example is:

x1t = φ11x1,t−1 + φ12x2,t−1 + 1t

x2t = φ21x2,t−1 + φ22x2,t−2 + 2t where E(1t2s) = σ12 for t = s and zero for t 6= s. We could rewrite it as  x   φ φ   x   0 0   x     1t = 11 12 1,t−1 + 1,t−2 + 1t , x2t 0 φ21 x2,t−1 0 φ22 x2,t−2 2t or just xt = Φ1xt−1 + Φ2xt−2 + t (1) and E(t) = 0,E(ts) = 0 for s 6= t and

 2  0 σ1 σ12 E(tt) = 2 . σ21 σ2

As you can see, in this example, the vector-valued xt follows a VAR(2) process. A general VAR(p) process with can be written as

xt = Φ1xt−1 + Φ2xt−2 + ... + t p X = Φjxt−j + t j=1 or, if we make use of the lag operator, Φ(L)xt = t,

∗Copyright 2002-2006 by Ling Hu.

1 where p Φ(L) = Ik − Φ1L − ... − ΦpL .

The error terms follow a vector white noise, i.e., E(t) = 0,

 Ω for t = s E( 0 ) = t s 0 otherwise with Ω a (k × k) symmetric positive definite . Recall that in studying the scalar AR(p) process,

φ(L)xt = t, we have the results that the process {xt} is covariance-stationary as long as all the roots in (2)

2 p 1 − φ1z − φ2z − ... − φpz = 0 (2) lies out side of the unit circle. Similarly, for the VAR(p) process to be stationary, we must have that the roots in the equation p |Ik − Φ1z − ... − Φpz | = 0 all lies outside the unit circle.

1.1.2 Vector processes

Recall that we could invert a scalar stationary AR(p) process, φ(L)xt = t to a MA(∞) process, −1 xt = θ(L)t, where θ(L) = φ(L) . The same is true for a covariance-stationary VAR(p) process, Φ(L)xt = t. We could invert it to xt = Ψ(L)t where Ψ(L) = Φ(L)−1 The coefficients of Ψ can be solved in the same way as in the scalar case, i.e., if Φ−1(L) = Ψ(L), then Φ(L)Ψ(L) = Ik:

2 p 2 (Ik − Φ1L − Φ2L − ... − ΦpL )(Ik + Ψ1L + Ψ2L + ...) = Ik.

j Equating the coefficients of L , we have Ψ0 = Ik,Ψ1 = Φ1,Ψ2 = Φ1Ψ1 + Φ2, and in general, we have Ψs = Φ1Ψs−1 + Φ2Ψs−2 + ... + ΦpΨs−p.

1.2 Transforming to a state space representation Sometime, it is more convenient to write a scalar valued time series, say an AR(p) process, in vector form. For example, p X xt = θjxt−j + t. j=1

2 where  ∼ N(0, σ2). We could equivalently write it as         xt φ1 φ2 . . . φp−1 φp xt−1 t  xt−1   1 0 ... 0 0   xt−2   0    =     +    .   . . . . .   .   .   .   . . . . .   .   .  xt−p+1 0 ...... 1 0 xt−p 0

0 If we let ξt = (xt, xt−1, . . . , xt−p+1) , ξt−1 = (xt−1, xt−2, . . . , xt−p), t = (t, 0,..., 0), and let F denote the parameter matrix, then we can write the process as:

ξt = F ξt−1 + t

2 where  ∼ N(0, σ Ip). So we have rewrite an AR(p) scalar process as an vector autoregression of order one, denoted by VAR(1). Similarly, we could also transform a VAR(p) process to a VAR(1) process. For the process

xt = Φ1xt−1 + Φ2xt−2 + ... + Φpxt−p + t, let   xt  xt−1  ξ =   , t  .   .  xt−p+1   Φ1 Φ2 ... Φp−1 Φp  Ik 0 ... 0 0     0 Ik ... 0 0  F =   ,  . . . . .   . . . . .  0 0 ...Ik 0   t  0  v =   , t  .   .  0 Then we could rewrite the VAR(p) process in state space notations,

ξt = F ξt−1 + vt. (3)

0 where E(vtvs) equals Q for t = s and equals zero otherwise, and

 Ω 0 ... 0   0 0 ... 0  Q =   .  . . .   ......  0 0 ... 0

3 1.3 The autocovariance matrix 1.3.1 VAR process

For a covariance stationary k dimensional vector process {xt}, let E(xt) = µ, then the autocovari- ance is defined to be the following k by k matrix

0 Γ(h) = E[(xt − µ)(xt−h − µ) ].

0 For simplicity, assume that µ = 0. Then we have Γ(h) = E(xtxt−h). Because of the lead-lag effect, we may not have Γ(h) = Γ(−h), but we have Γ(h)0 = Γ(−h). To show this,

0 0 Γ(h) = E(xt+hxt+h−h) = E(xt+hxt), taking transpose 0 0 Γ(h) = E(xtxt+h) = Γ(−h). Similar as in the scalar case, we define the autocovariance generating function of the process x as ∞ X h Gx(z) = Γ(h)z h=−∞ where z is again a complex scalar. Let ξt as defined in (3). Assume that ξ and x are stationary, and let Σ denote the of ξ,

0 Σ = E(ξtξt)    xt  xt−1   = E   x0 x0 ... x0   .  t t−1 t−p+1   .   xt−p+1  Γ(0) Γ(1) ... Γ(p − 1)   Γ(1)0 Γ(0) ... Γ(p − 2)  =   .  . . .   ......  Γ(p − 1)0 Γ(p − 2)0 ... Γ(0)

Postmultiplying (3) by its transpose and taking expectations gives

0 0 0 0 0 E(ξtξt) = E[(F ξt−1 + vt)(F ξt−1 + vt) ] = FE(ξt−1ξt−1)F + E(vtvt), or Σ = F ΣF 0 + Q. (4) To solve for Σ, we need to use the , and the following result: let A, B, C be matrices whose dimensions are such that the product ABC exists. Then

vec(ABC) = (C0 ⊗ A) · vec(B).

4 where vec is the operator to stack each column of a matrix (k × k) into a k2-dimensional vector, for example,   a11   a11 a12  a21  A = vec(A) =   . a21 a22  a12  a22 Apply vec operator on both sides of (4), we get

vec(Σ) = (F ⊗ F ) · vec(Σ) + vec(Q), which gives −1 vec(Σ) = (Im − F ⊗ F ) vec(Q), where m = k2p2. We can use this equation to solve for the first p order of autocovariance of x, Γ(0),..., Γ(p). To derive the hth autocovariance of ξ, denoted by Σ(h), we can postmultiplying 0 (3) by ξt−h and take expectations,

0 0 0 E(ξtξt−h) = FE(ξt−1ξt−h) + E(vtξt−h), then Σ(h) = F Σ(h − 1), or Σ(h) = F hΣ. Therefore we have the following relationship for Γ(h)

Γ(h) = Φ1Γ(h − 1) + Φ2Γ(h − 2) + ... + ΦpΓ(h − p).

1.3.2 Vector MA processes We first consider the MA(q) process.

xt = t + Ψ1t−1 + Ψ2t−2 + ... + Ψqt−q.

Then the variance of xt is

0 Γ(0) = E(xtxt) 0 0 0 0 0 = E(tt) + Ψ1E(t−1t−1)Ψ1 + ... + ΨqE(t−qt−q)Ψq 0 0 0 = Ω + Ψ1ΩΨ1 + Ψ2ΩΨ2 + ... + ΨqΩΨq.

and the autocovariances

 0 0 ΨhΩ + Ψh+1ΩΨ1 + Ψh+2ΩΨ2 + ... + ΨqΩΨq−j for h = 1, . . . , q.  0 0 0 0 Γ(h) = ΩΨ−h + Ψ1ΩΨ−h+1 + Ψ2ΩΨ−h+2 + ... + Ψq+hΩΨq for h = −1,..., −q.  0 for |h| > q

As in the scalar case, any vector MA(q) process is stationary. Next consider the MA(∞) process

xt = t + Ψ1t−1 + Ψ2t−2 + ... = Ψ(L)t.

5 ∞ A sequence of matrices {Ψs}−∞ is absolutely summable if each of its element forms an absolutely summable scalar sequence, i.e.

∞ X (s) |ψij | < ∞ for i, j = 1, 2, . . . n, s=0

(s) where ψij is the row i column j element (will use ijth for short) of Ψs. Some important results about MA(∞) process is summarized as follows:

Proposition 1 Let xt be a k × 1 vector satisfying

∞ X xt = Ψjt−j, j=0 where t is vector white noise and Ψj is absolutely summable. Then (a) The autocovaiance between the ith variable at time t and the jth variable s periods earlier, E(xitxj,t−s) exists and is given by the ijth element of

∞ X 0 Γ(s) = Ψs+vΩΨv for s = 0, 1, 2,... ; v=0

∞ (b) {Γ(h)}h=0 is absolutely summable. ∞ If {t}t=−∞ is i.i.d. with E|i1,ti2,ti3,ti4,t| < ∞ for i1, i2, i3, i4 = 1, 2, . . . , k then we also have

(c) E|xi1,t1xi2,t2xi3,t3xi4,t4| < ∞ for i1, i2, i3, i4 = 1, 2, . . . , k and all t1, t2, t3, t4.

−1 Pn (d) n t=1 xitxj,t−s →p E(xitxj,t−s) for i, j = 1, 2, . . . , k and for all s.

All of these results can be viewed as extensions from the scalar case to vector case, and its proof can be found on page 286-288 in Hamilton’s book.

1.4 The Sample of a Vector Process

Let xt be a with E(xt) = 0 and E(xtxt−h) = Γ(h), where Γ(h) is absolutely summable. Then we consider the properties of the sample mean

n 1 X x¯ = x . n n t t=1

0 E[(x¯nx¯n) 1 = E[(x + ... x )(x + ... x )0] n2 1 n 1 n n 1 X = E(x x0 ) n2 i j i,j

6 ∞ 1 X = Γ(h) n2 h=−∞ n−1 ! 1 X  |h| = 1 − Γ(h) n n h=−n+1

Then

0 nE[(x¯nx¯n)] n−1 ! X  |h| = 1 − Γ(h) n h=−n+1  1   2  = Γ(0) + 1 − (Γ(1) + Γ(−1)) + 1 − (Γ(2) + Γ(−2)) + ... n n ∞ X → Γ(h) h=−∞

This is very similar as what we did in the scalar case. Then we have the following proposition:

Proposition 2 Let xt be a zero mean stationary process with E(xt) = 0 and E(xtxt−h) = Γ(h), where Γ(h) is absolutely summable, then the sample mean satisfies

(a) x¯n →p 0

0 P∞ (b) limn→∞[nE(x¯nx¯n)] = h=−∞ Γ(h). 0 Let S denote the limit variance of nE(x¯nx¯n). If the are generated by a MA(q) process, then results (b) implies that q X S = Γ(h). h=−q Then a natural estimate for S is q X Sˆ = Γ(ˆ h) + (Γ(h) + Γ(h)0), (5) h=1 where n 1 X Γ(ˆ h) = (x − x¯ )(x − x¯ )0. n t n t−h n t=h+1 Sˆ defined in (5) provides a consistent for a large class of stationary processes. Even when the process has time-varying second moments, as long as

n 1 X (x − x¯ )(x − x¯ )0 n t n t n t=h+1

7 converges in probability to n 1 X E(x x ), n t t−h t=h+1 ˆ 0 S is a consistent estimate of nE(x¯nx¯n). It is used not only for MA(q) process. Write the autoco- variance as E(xtxs), even it is nonzero for all t and s, if the matrix goes to zero sufficiently fast as |t − s| → ∞, and q is growing with the sample n, then we still have Sˆ → S. However, a problem with Sˆ is that it may not be positive semidefinite in small samples. There- fore, we can use the Newey and West estimate q X  h  S˜ = Γˆ + 1 − (Γ(h) + Γ(h)0), 0 q + 1 h=1 which is positive semidefinite and has the same consistency properties of Sˆ when q, n → ∞ with q/n1/4 → 0.

1.5 Impulse-response Function and Orthogonalization 1.5.1 Impulse-response function Impulse-response function gives how a time series variable is affected given a at time t. Recall that for a scalar time series process, say, a AR(1) process xt = φxt−1 + t with |φ| < 1, we can 2 invert it to a MA process xt = (1 + φL + φ L + ...)t, and the effects of  on x are:  : 0 1 0 0 ... x : 0 1 φ φ2 ...

In other words, after we invert φ(L)xt = t to xt = θ(L)t, the θ(L) function gives us how x response to a a unit shock from t. We could do similar thing on a VAR process. In our earlier example, we have a VAR(2) system,

xt = Φ1xt−1 + Φ2xt−2 + t and t ∼ WN(0, Ω) where  2  σ1 σ12 Ω = 2 . σ21 σ2 After we invert it to a MA(∞) representation

xt = Ψ(L)t (6) 2 −1 where Ψ(L) = (1−Φ1L−Φ2L ) , we see that in this representation, the observations xt is a linear combinations of shocks t. However, suppose we are interested in another form of shocks, say

ut = Qt where Q is an arbitrary square matrix (in this example, it is 2 by 2), we have

−1 xt = Ψ(L)Q Qt = A(L)ut (7) where we let A(L) = Ψ(L)Q−1. Since this Q is arbitrary, you see that we can have many linear combinations of shocks, and response functions. Then which combinations shall we use?

8 1.5.2 Orthogonalization and model specification In economic modeling, we calculate the impulse-response dynamics as we are interested how eco- nomic variables response to certain source of shocks. If the shocks are correlated, then it is hard to identify what is the response to a particular shock. From that view, we may want to choose the Q to make ut = Qt orthonormal, or uncorrelated across each other and with unit variance, i.e., 0 E(utut) = I. To do so, we need a Q such that

0 Q−1Q−1 = Ω,

0 0 0 0 then E(utut) = E(QttQ ) = QΩQ = Ik. So, we can use Choleski decomposition to find Q. However, Q is still not unique as you can form other Qs by multiplying an orthogonal matrix. Sims (1980) proposes that we could specify the model by choosing a particular leading term in −1 the coefficient, A0. In (6), we see that Ψ0 = Ik. However, in (7), A0 = Q cannot be identity −1 matrix unless Ω is diagonal. In our example, we would choose the Q which produces A0 = Q as a lower triangular matrix. That after this transformation, shock u2t has no effects on x1t. The nice thing is that Choleski decomposition itself will produce a triangular matrix.

Example 1 Consider a AR(1) process of a 2-dimensional vector,

 x   0.5 0.2   x     1t = 1,t−1 + 1t x2t 0.3 0.4 x2,t−1 2t where  2 1  Ω = E( 0 ) = . t t 1 4 First we verify that this process is stationary, as     λ 0 0.5 0.2 − = 0 0 λ 0.3 0.4 gives λ1 = 0.94 and λ2 = −0.04, both lies inside the unit circle. Invert it to a moving average process, xt = Ψ(L)t.

We know that Ψ0 = I2,Ψ1 = Φ1, etc. Then we find Q by Choleski decomposition of Ω, which gives  0.70 0   1.41 0  Q = and Q−1 = . −0.27 0.53 0.70 1.87 Then we can write −1 −1 xt = Ψ(L)Q Qt = Ψ(L)Q ut where we define that ut = Qt. Then we have

−1 −1 xt = Ψ0Q ut + Ψ1Q ut−1 + .... or  x   1.41 0   u   0.85 0.37   u  1t = 1t + 1,t−1 + ... x2t 0.70 1.87 u2t 0.70 0.75 u2,t−1

9 In this example you see that we find a unique MA representation which is linear combination of 0 uncorrelated error (E(utut) = I2), and the second sources of shock does not have instantaneous effects on x1t. We can then use this representation to compute the impulse-responses. There are also other ways to specify the representation, depending on the problem of . For example, Quah (1988) suggests that find a Q so that the long-run response of one variable to another shocks is zero.

1.5.3 Variance decomposition

Now, let’s consider how we could decompose the variance of the errors. xt = Ψ(L)t = 0 0 A(L)ut where A(L) = Ψ(L)Q, ut = Qt and E(utut) = I. For simplicity, we let (xt = (x1t, x2t). Suppose we do a one-period ahead forecasting, and let yt+1 denote the forecast error,

 0 0    A11 A12 u1,t+1 yt+1 = xt+1 − Et(xt+1) = A0ut+1 = 0 0 . A21 A22 u2,t+1

2 0 Since E(u1tu2t) = 0, E(uit) = 1, the variance of the forecasting error is given by E(yt+1yt+1) = 0 0 2 0 2 A0A0. So the variance of forecasting error for x1t is given by (A11) + (A12) . We can interpret 0 2 that (A11) is the amount of the one-step ahead forecasting error variance due to shock u1, and 0 2 (A12) is the amount due to shock u2. Similarly the variance of forecasting error of x2t is given by 0 2 0 2 (A21) + (A22) , and we can interpret them as amount due to shock u1 and u2 respectively. The variance for k-period ahead forecasting error can be computed in a similar way.

2 Estimation of VAR(p) process

2.1 Maximum Likelihood Estimation Usually we use conditional likelihood in VAR estimation (recall that conditional likelihood functions are much easier to work with than unconditional likelihood functions). Given a k-vector VAR(p) process,

yt = c + Φ1yt−1 + Φ2yt−2 + ... + t, we could rewrite it more concisely as 0 yt = Π xt + t. where  c0   1  0  Φ1   yt−1   0     Φ   yt−2  Π =  2  and xt =    .   .   .   .  0 Φp yt−p If we assume that  ∼ i.i.d.N(0, Ω), then we could use MLE to estimate the parameters in θ = (c, Π, Ω). Following the same way in the scalar case, assume that we have observed (y−p+1,..., y0), then the for the yt is

−k/2 −1 1/2 0 0 −1 0 L(yt, xt; θ) = (2π) |Ω | exp[(−1/2)(yt − Π xt) Ω (yt − Π xt)]

10 The log likelihood function of observations (y1,..., yn) is (constant omitted)

n −1 X  0 0 −1 0  l(y, x; θ) = (n/2)log|Ω | − (1/2) (yt − Π xt) Ω (yt − Π xt) . (8) t=1 Taking first derivative with respect to Π and Ω, we have that

" n #" n #−1 ˆ 0 X 0 X 0 Πn = ytxt xtxt . t=1 t=1

ˆ 0 The jth row of Πn is " n #" n #−1 0 X 0 X 0 πˆ j = yjtxt xtxt . t=1 t=1

which is the estimated coefficient vector from an OLS regression of yjt on xt. So the MLE estimates of the coefficients for the jth equation of a VAR are found by an OLS regression of yjt on a constant term and p lags of all of the variables in the system. The MLE estimate of Ω is n ˆ X 0 Ωn = (1/n) ˆtˆt t=1 where ˆ 0 ˆt = yt − Πnxt The details on the derivations can be found on page 292-296 on Hamilton book. The MLE estimates Πˆ and Ωˆ are consistent even if the true innovations are non-Gaussian. In the next subsection, we will consider regression with non-Gaussian errors, and we will use the LS approach to derive for the asymptotics.

2.2 LS estimation and asymptotics The asymptotic of Πˆ is summarized in the following proposition Proposition 3

yt = c + Φ1yt−1 + Φ2yt−2 + ... + Φpyt−p + t,

t = i.i.d.(0, Ω), and E(itjtltmt) < ∞ for all i, j, l, and m and where roots of

p |Ik − Φ1z − ... − Φpz | = 0 lie outside the unit circle. Let m = kp + 1 and let

0 0 0 0 xt = [ 1 yt−1 yt−2 ... yt−p ],

So xt is a m-dimensional vector. Let πˆ n = vec(Πˆ n) denote the km×1 vector of coefficients resulting from OLS regression of each of the elements of yt on xt for a sample of size n:

0 0 0 0 πˆ n = [ πˆ 1,n πˆ 2,n ... πˆ k,n ]

11 where " n #−1 " n # X 0 X 0 πˆ i,n = xtxt xtyit , t=1 t=1 and let πˆ 0 denote the km by 1 vector of the true parameter. Finally, let

n ˆ −1 X 0 Ωn = n ˆtˆt, t=1 where

0 ˆt = [ ˆ1t ˆ2t ... ˆkt ] 0 ˆit = yit − xtπˆ i,n

Then

−1 Pn 0 0 (a) n t=1 xtxt →p Q where Q = E(xtxt);

(b) πˆ n →p π;

(c) Ωˆ n →p Ω;

√ −1 (d) n(πˆ n − π) →d N(0, Ω ⊗ Q ). Result (a) is a vector version of that sample second converges to the population moment, and it follows that the coefficients are absolutely summable and it has finite fourth moment. Result (b) and (c) are similar to the derivations for single OLS regression in case 3 in lecture 5. To show result (d), let n −1 X 0 Qn = n xtxt, t=1 then we could write " n # √ −1 −1/2 X n(πˆ i,n − πi) = Qn n xtit t=1 and  −1 −1/2 Pn  Qn n t=1 xt1t −1 −1/2 Pn √  Qn n xt2t  n(πˆ − π) =  t=1  . (9) n  .   .  −1 −1/2 Pn Qn n t=1 xtkt Define ξ to be a km × 1 vector t   xt1t  xt2t  ξ =  . . t  .   .  xtkt

12 Note that ξt is a mds with finite fourth moments and variance  2  E(1t) E(1t2t) ...E(1tkt) 2  E(2t1t) E(2t) ...E(2tkt)  E(ξ ξ0 ) =   ⊗ E(x x0 ) t t  . . .  t t  ......  2 E(kt1t) E(kt2t) ...E(kt) = Ω ⊗ Q

We can also show that n −1 X 0 n ξtξt →p Ω ⊗ Q. t=1 Apply the CLT for vector mds, we have n −1/2 X n ξt →d N(0, Ω ⊗ Q). (10) t=1 Now rewrite (9) as

 −1   −1/2 Pn  Qn 0 ... 0 n t=1 xt1t −1 −1/2 Pn √  0 Qn ... 0   n xt2t  n(πˆ − π) =    t=1  n  . . .   .   ......   .  −1 −1/2 Pn 0 0 ...Qn n t=1 xtkt n −1 −1/2 X = (Ik ⊗ Qn )n ξt t=1 −1 −1 By result (a) we have Qn →p Q . Thus n 1/2 −1 −1/2 X n (πˆ n − π) →p (Ik ⊗ Q )n ξt. t=1 From (10) we know that this has a distribution that is Gaussian with mean 0 and variance

−1 −1 −1 −1 −1 (Ik ⊗ Q )(Ω ⊗ Q)(Ik ⊗ Q ) = (IkΩIk) ⊗ (Q QQ ) = Ω ⊗ Q .

Hence we got result (d). Each of πˆ i has the distribution √ 2 −1 n(πˆ i,n − πi) →d N(0, σi Q ). Given that the are asymptotically normal, we can use it to test linear or nonlinear restrictions on the coefficients with the Wald . We know that vec is an operator to stack each column of a k × k matrix into one k2 × 1 vector. A similar operator, vech, is to stack all elements under the principal diagonal (so it transforms a k × k matrix into one k(k + 1)/2 × 1 vector). For example,     a11 a11 a12 A = vech(A) =  a21  . a21 a22 a22

13 We will apply this operator on the variance matrix, which is symmetric. The joint distribution of πˆ n and Ωˆ n is given in the following proposition. Proposition 4

yt = c + Φ1yt−1 + Φ2yt−2 + ... + Φpyt−p + t,

t = i.i.d.N(0, Ω), and where roots of

p |Ik − Φ1z − ... − Φpz | = 0 lie outside the unit circle. Let πˆ n, Ωˆ n, and Q be as defined in proposition 3, then

 1/2     −1  n [πˆ n − π] 0 Ω ⊗ Q 0 1/2 →d N , . n [vech(Ωˆ n) − vech(Ω)] 0 0 Σ22

Let σij denote the ijth element of Ω then the element of Σ22 corresponding to the covariance between σˆij and σˆlm is given by (σilσjm + σimσjl) for all i, j, l, m = 1, . . . k.

The detailed proof can be found on page 341-342 in Hamilton book. Basically there are three ˆ −1 Pn 0 ˆ ∗ steps: first, we show that Ωn = n t=1 ˆtˆt has the same asymptotic distribution as Ωn = −1 Pn 0 n t=1 tt. In the second step, write

 1/2   −1 −1/2 Pn  n [πˆ n − π] (Ik ⊗ Q )n t=1 ξt 1/2 ˆ →d −1/2 Pn n [vech(Ωn) − vech(Ω)] n t=1 λt where  2  1t − σ11 . . . 1tkt − σ1k  . .  λt = vech  . ... .  . 2 ktk1 − σk1 . . . kt − σkk 0 0 Now, (ξt, λt) is an mds and we apply the CLT for mds to get (with a few more computations)

 −1/2 Pn     −1  n t=1 ξt 0 Ω ⊗ Q 0 −1/2 Pn →d N , . n t=1 λt 0 0 Σ22

0 The final step in the proof is to show that E(λtλt) is given by the matrix Σ22 as described in the proposition, which can be proved with a constructed error sequence which is uncorrelated Gaussian with zero mean and unit variance (see Hamilton’s book for details). With the asymptotic variance of Ωˆ n, we can then test if two errors are correlated. For example, for k = 2,

     2 2  σˆ11,n − σ11 0 2σ11 2σ11σ12 σ12 √ 2 n  σˆ12,n − σ12  →d N  0  ,  2σ11σ12 σ11σ22 + σ12 2σ12σ22  . 2 2 σˆ22,n − σ22 0 2σ12 2σ12σ22 2σ22

Then a of the null hypothesis that there is no covariance between 1t and 2t is given by √ nσˆ 12 ≈ N(0, 1). 2 1/2 (ˆσ11σˆ22 + σ12)

14 The matrix Σ22 can be expressed more compactly using the duplication matrix. Duplication 2 matrix Dk is a matrix of size k × k(k + 1)/2 matrix that transforms vech(Ω) into vec(Ω), i.e.

Dkvech(Ω) = vec(Ω). For example,     1 0 0   σ11 σ11  0 1 0   σ21     σ21  =   .  0 1 1   σ12  σ22 0 0 1 σ22 Define + 0 −1 Dk ≡ (DkDk) Dk + + Note that Dk Dk = Ik(k+1)/2. Dk is like the ‘reverse’ of Dk as it transform vec(Ω) into vech(Ω),

+ vech(Ω) = Dk vec(Ω). For example, when k = 2, we have       σ11 σ11 1 0 0 0  σ21   σ21  =  0 1/2 1/2 0    .  σ12  σ22 0 0 0 1 σ22

+ With Dk and Dk we can write + + 0 Σ22 = 2Dk (Ω ⊗ Ω)(Dk ) .

3 Granger

In most regressions in , it is very hard to discuss causality. For instance, the significance of the coefficient β in the regression yi = βxi + i, only tells the ‘co-occurrence’ of x and y, not that x causes y. In other words, usually the regression only tells us there is some ‘relationship’ between x and y, and does not tell the nature of the relationship, such as whether x causes y or y causes x. One good thing of time series vector autoregression is that we could test ‘causality’ in some sense. This test is first proposed by Granger (1969), and therefore we refer it . We will restrict our discussion to a system of two variables, x and y. y is said to Granger-cause x if current or lagged values of y helps to predict future values of x. On the other hand, y fails to Granger-cause x if for all s > 0, the mean squared error of a forecast of xt+s based on (xt, xt−1,...) is the same as that is based on (yt, yt−1,...) and (xt, xt−1,...). If we restrict ourselves to linear functions, x fails to Granger-cause x if

MSE[Eˆ(xt+s|xt, xt−1,...)] = MSE[Eˆ(xt+s|xt, xt−1, . . . , yt, yt−1,...)].

Equivalently, we can say that x is exogenous in the time series sense with respect to y, or y is not linearly informative about future x.

15 In the VAR equation, the example we proposed above implies a lower triangular coefficient matrix:

     1     p      xt c1 φ11 0 xt−1 φ11 0 xt−p 1t = + 1 1 + ... + p p + (11) yt c2 φ21 φ22 yt−1 φ21 φ22 yt−p 2t Or if we use MA representations,

 x   µ   φ (L) 0     t = 1 + 11 1t , (12) yt µ2 φ21(L) φ22(L) 2t where 0 1 2 φij(L) = φij + φijL + φijL + ... 0 0 0 with φ11 = φ22 = 1 and φ21 = 0. Another implication of Granger causality is stressed by Sims (1972).

Proposition 5 Consider a linear projection of yt on past, present and future x’s,

∞ ∞ X X yt = c + bjxt−j + djxt+j + ηt, (13) j=0 j=1 where E(ηtxτ ) = 0 for all t and τ. Then y fails to Granger-cause x iff dj = 0 for j = 1, 2,.... Econometric tests of whether the series y Granger causes x can be based on any of the three implications (11), (12), or (13). The simplest test is to estimate the regression which is based on (11), p p X X xt = c1 + αixt−i + βiyt−j + ut i=1 j=1 using OLS and then conduct a F-test of the null hypothesis

H0 : β1 = β2 = ... = βp = 0.

Note: we have to be aware of that Granger causality does not equal to what we usually mean by causality. For instance, even if x1 does not cause x2, it may still help to predict x2, and thus Granger-causes x2 if changes in x1 precedes that of x2 for some reason. A naive example is that we observe that a dragonfly flies much lower before a rain storm, due to the lower air pressure. We know that dragonflies do not cause a rain storm, but it does help to predict a rain storm, thus Granger-causes a rain storm. Reading: Hamilton Ch. 10, 11, 14.

16