Time series Class Notes Manuel Arellano Revised: February 12, 2018
1 Time series as stochastic outcomes
T A time series is a sequence of data points wt observed over time, typically at equally spaced { }t=1 intervals; for example, the quarterly GDP per capita or the daily number of tweets that mention a specific product. We wish to discuss probabilistic models that regard the observed time series as a realization of a probability distribution function f (w1, ..., wT ). In a random sample observations T are independent and identically distributed so that f (w1, ..., wT ) = t=1 f (wt). In a time series, observations near in time tend to be more similar, in which case the independenceQ assumption is not appropriate. Moreover, the level or other features of the series often change over time, and in that case the assumption of identically distributed observations is not appropriate either. Thus, the joint distribution of the data may differ from the product of marginal distributions of each data point
f (w1, ..., wT ) = f1 (w1) f2 (w2) ... fT (wT ) 6 × × × and the form of those marginal distributions may change over time. Due to the natural temporal ordering of data, a factorization that will be often useful is
T f (w1, ..., wT ) = f1 (w1) ft (wt wt 1, ..., w1) . t=2 | − Y If the distributions f1 (.), f2 (.), ... changed arbitrarily there would be no regularities on which to base their statistical analysis, since we only have one observation on each distribution. Similarly, if the joint distributions of consecutive pairs of observations changed arbitrarily, there would be no regularities on which to base the analysis of their dependence. Thus, modeling the dependence among observations and their evolving pattern are central to time series analysis. The basic building block to facilitate statistical analysis of time series is some stationary form of dependence that preserves the assumption of identically distributed observations. First we will introduce the concept of stationary dependence and later on we will discuss ways of introducing nonstationarities.
Stochastic process A stochastic process is a collection of random variables that are indexed with respect to the elements in a set of indices. The set may be finite or infinite and contain integer or real numbers. Integer numbers may be equidistant or irregularly spaced. In the case of our time series the set is 1, 2, 3, ..., T , but usually it is convenient to consider a set of indices t covering all { } integers from to + . In such case we are dealing with a double-infinite sequence of the form −∞ ∞
wt t∞= = ..., w 1, w0, w1, w2, ..., wT , wT +1, ... , { } −∞ { − } 1 which is regarded as a single realization of the stochastic process, and the observed time series is a portion of this realization. Hypothetical repeated realizations of the process could be indexed as:
(1) (2) (n) ∞ wt , wt , ..., wt . t= n o −∞ 2 Stationarity
A process wt is (strictly) stationary if the joint distribution of wt , wt , ..., wt for a given subset of { } { 1 2 k } indices t1, t2, ..., tk is equal to the joint distribution of wt +j, wt +j, ..., wt +j for any j > 0. That is, { 1 2 k } the distribution of a collection of time points only depends on how far apart they are, and not where they start:
f (wt1 , wt2 , ..., wtk ) = f (wt1+j, wt2+j, ..., wtk+j) .
This implies that marginal distributions are all equal, that the joint distribution of any pair of variables only depends on the time interval between them, and so on:
fwt (.) = fws (.) for all t, s
fwt,ws (., .) = fwt+j ,ws+j (., .) for all t, s, j.
2 In terms of moments, the implication is that the unconditional mean and variance µt, σt of the distri- bution fwt (.) are constant:
2 2 E (wt) µ = µ, V ar (wt) σ = σ . ≡ t ≡ t
Moreover, the covariance γ between wt and ws only depends on t s : t,s | − |
Cov (wt, ws) γt,s = γ t s . ≡ | − | 2 Thus, using the notation γ0 = σ , the covariance matrix of w = (w1, ..., wT ) takes the form
γ0 γ1 . . . γT 1 − γ1 γ0 γ1 . . . γT 2 . . .− V ar (w) = . γ γ .. . . 1 0 . γ .. γ T 2 1 − γT 1 γT 2 . . . γ0 − − Similarly, the correlation ρ between wt and ws only depends on t s : t,s | − | γt,s ρt,s = ρ t s ≡ σtσs | − |
The quantity ρj is called the autocorrelation of order j and when seen as a function of j it is called the autocorrelation function.
2 A stationary process is also called strictly stationary in contrast with weaker forms of stationarity.
For example, we talk of stationarity in mean if µt = µ or of covariance stationarity (or weak station- arity) if the process is stationary in mean, variance and covariances. In a normal process covariance stationarity is equivalent to strict stationarity.
Processes that are uncorrelated or independent A sequence of serially uncorrelated random variables with zero mean and constant finite variance is called a “white noise”process; that is, white noise is a covariance stationary process wt such that
E (wt) = 0
V ar (wt) = γ < 0 ∞ Cov (wt, wt j) = 0 for all j = 0. − 6 In this process observations are uncorrelated but not necessarily independent. In an independent white noise process wt is also statistically independent of past observations:
f (wt wt 1, wt 2, ...w1) = f (wt) . | − − Another possibility is a mean independent white noise process that satisfies
E (wt wt 1, wt 2, ...w1) = 0. | − −
In this case wt is called a martingale difference sequence. A martingale difference is a stronger form of independence than uncorrelatedness but weaker than statistical independence. For example, a martin- 2 gale difference does not rule out the possibility that E wt wt 1, ...w1 depends on past observations. | − Prediction Consider the problem of selecting a predictor of wt given a set of past values
wt 1, ..., wt j . The conditional mean E (wt wt 1, ..., wt j) is the best predictor when the loss func- { − − } | − − tion is quadratic. Similarly, the linear projection E∗ (wt wt 1, ..., wt j) is the best linear predictor | − − under quadratic loss. For example, for a stationary process wt
E∗ (wt wt 1) = α + βwt 1 | − − with β = γ /γ and α = (1 β) µ. We can also write 1 0 −
wt = α + βwt 1 + νt − where νt is the prediction error, which by construction is orthogonal to wt 1. − For convenience, predictors based on all past history are often considered:
Et 1 (wt) = E (wt wt 1, wt 2, ...) − | − −
3 or
Et∗ 1 (wt) = E∗ (wt wt 1, wt 2, ...) , − | − − which are defined as the corresponding quadratic-mean limits of predictors given wt 1, ..., wt j as { − − } j . → ∞
Linear predictor k-period-ahead Let wt be a stationary time series with zero mean and let ut denote the innovation in wt so that
wt = Et∗ 1 (wt) + ut. − ut is a one-step-ahead forecast error that is orthogonal to all past values of the series. Similarly,
wt+1 = Et∗ (wt+1) + ut+1.
Moreover, since the spaces spanned by (wt, wt 1, wt 2, ...) and (ut, wt 1, wt 2, ...) are the same, and − − − − ut is orthogonal to (wt 1, wt 2, ...) we have: − −
Et∗ (wt+1) = E∗ (wt+1 ut, wt 1, wt 2, ...) = Et∗ 1 (wt+1) + E∗ (wt+1 ut) . | − − − |
Thus, E (wt+1 ut)+ut+1 is the two-step-ahead forecast error in wt+1. In a similar way we can obtain ∗ | incremental errors for Et∗ 1 (wt+2) , ..., Et∗ 1 (wt+k). − −
Wold decomposition Letting E (wt+1 ut) = ψ ut, we can write ∗ | 1
wt+1 = ut+1 + ψ1ut + Et∗ 1 (wt+1) − and repeating the argument we obtain the following representation of the process:
wt = (ut + ψ1ut 1 + ψ2ut 2 + ...) + κt − − where ut wt Et∗ 1 (wt), ut 1 wt 1 Et∗ 2 (wt 1), etc. and κt denotes the linear prediction of wt ≡ − − − ≡ − − − − at the beginning of the process. This representation is called the Wold decomposition, after the work of Herman Wold. It exists for any covariance stationary process with zero mean. The one-step-ahead 2 1 forecast errors ut are white noise and it can be shown that ∞ ψ < (with ψ = 1). j=0 j ∞ 0 The term κt is called the linearly deterministic part of wPt because it is perfectly predictable based on past observations of wt. The other part, consisting of j∞=0 ψjut j, is the linearly indeterministic − part of the process. The indeterministic part is the linearP projection of wt onto the current and past linear forecast errors, and the deterministic part is the corresponding projection error. If κt = 0, wt is a purely non-deterministic process, also called a linearly regular process.
1 See T. Sargent, Macroeconomic Theory, 1979.
4 Ergodicity A stochastic process is ergodic if it has the same behavior averaged over time as averaged over the sample space. Specifically, a covariance stationary process is ergodic in mean if the time series mean converges in probability to the same limit as a (hypothetical) cross-sectional mean
(known as the ensemble average), that is, to E (wt) = µ:
1 T p wT = wt µ. T t=1 → P Ergodicity requires that the autocovariances γj tend to zero suffi ciently fast as j increases. In the
next section we check that wt is ergodic in mean if the following absolute summability condition is { } satisfied:
∞ γ < . (1) j=0 j ∞
Similarly,P a covariance stationary process is ergodic in covariance if
1 T p t=j+1 (wt µ)(wt j µ) γj. T j − − − → − P In the special case in which wt is a normal stationary process, condition (1) guarantees ergodicity { } for all moments.2
Example of stationary non-ergodic process Suppose that
wt = η + εt
2 2 where η iid 0, σ and εt iid 0, σ independent of η. We have ∼ η ∼ ε µ = E (wt) = E (η) + E (εt) = 0 2 2 γ0 = V ar (wt) = V ar (η) + V ar (εt) = ση + σε 2 γj = Cov (wt, wt j) = ση for j = 0. − 6 Note that condition (1) is not satisfied in this example. Let the index i denote a realization of the process in the probability space. The process is stationary and yet
1 p w(i) = T w(i) η(i) T T t=1 t → instead of convergingP to µ = 0. Moreover,
1 T (i) (i) (i) (i) p t=j+1 wt η wt j η 0 T j − − − → − P 1 T (i) (i) p (i) 2 2 (or (T j)− t=j+1 wt wt j η ) instead of converging to γj = ση. − − → 2 1 T p In general, aP stationary process is ergodic in distribution if T − 1 (wt r) Pr (wt r) for any r, where t=1 ≤ → ≤ 1 (wt r) = 1 if wt r and 1 (wt r) = 0 if wt > r. ≤ ≤ ≤ P
5 3 Asymptotic theory with dependent observations
Here we consider a law of large numbers and a central limit theorem for covariance stationary processes.
Law of Large Numbers Let wt be a covariance stationary stochastic process with E (wt) = µ { } T and Cov (wt, wt j) = γj such that j∞=0 γj < . Let the sample mean be wT = (1/T ) t=1 wt. − ∞ p √ Then (i) wT µ, and (ii) V ar TPwT j∞= γj. P → p → −∞ A suffi cient condition for wT µ is thatPE (wT ) µ and V ar (wT ) 0. For any T we have → → → E (wT ) = µ. Next,
2 1 T T V ar (wT ) = E (wT µ) = E [(wt µ)(ws µ)] − T 2 t=1 s=1 − − 1h i P P = 2 T γ0 + 2 (T 1) γ1 + 2 (T 2) γ2 + ... + 2γT 1 . T − − − To show that V ar (wT ) 0, show that T V ar (wT ) is bounded under the assumption ∞ γ < : → j=0 j ∞ T 1 T 2 1 P T V ar (w ) = γ + − 2γ + − 2γ + ... + 2γ (2) T 0 T 1 T 2 T T 1 −
T 1 T 2 1 γ0 + − 2 γ1 + − 2 γ2 + ... + 2 γ T 1 ≤ | | T | | T | | T −
γ0 + 2 γ1 + 2 γ2 + ... + 2 γT 1 + ... . ≤ | | | | | | − To check that V ar √T wT j∞= γj see J. Hamilton, Time Series Analysis, 1994, p. 187—188. → −∞ P Consistent estimation of second-order moments Let us consider the sample autocovariance
1 T 1 T γj = t=j+1 (wt w0)(wt j w j) = t=j+1 wtwt j w0w j T j − − − − T j − − − − − P P b 1 T 1 T where w0 = (T j)− t=j+1 wt and w j = (T j)− t=j+1 wt j. Let us define zt = wtwt j. The − − − − − previous LLN can be appliedP to the process zt to state conditionsP under which
1 T p t=j+1 wtwt j E (wtwt j) . T j − → − − P Note that if wt is strictly stationary so is zt.
Central Limit Theorem A central limit theorem provides conditions under which
wT µ d − (0, 1) . V ar (wT ) → N p Since in our context T V ar (wT ) j∞= γj, an asymptotically equivalent statement is → −∞ P √T (wT µ) d − (0, 1) . → N j∞= γj −∞ qP 6 or
√ d T (wT µ) 0, j∞= γj . (3) − → N −∞ A condition under which thisP result holds is:
wt = µ + j∞=0 ψjvt j (4) − P 2 where vt is an i.i.d. sequence with E v < and ∞ ψ < (T.W. Anderson, The Statistical { } t ∞ j=0 j ∞ Analysis of Time Series, 1971, p. 429). Result (3) alsoP holds when the innovation process vt in (4) { } is a martingale difference sequence satisfying certain conditions.3
A multivariate version of (3) for the case in which wt is a vector-valued process is as follows: { } √ d T (wT µ) 0, j∞= Γj . (5) − → N −∞ where P
Γ0 = E (wt µ)(wt µ)0 − − and for j = 0: 6
Γj = E (wt µ)(wt j µ)0 . − − − Note that the autocovariance matrix Γj is not symmetric. We have Γ j = Γj0 . − As in the scalar case, a condition under which result (5) holds is:
wt = µ + j∞=0 Ψjvt j − P where vt is an i.i.d. vector sequence with E (vt) = 0, E (vtv ) = Ω a symmetric positive definite { } t0 4 matrix, and the sequence of matrices Ψj ∞ is absolutely summable. { }j=0
Consistent estimation of the asymptotic variance To be able to use the previous central limit theory for the construction of interval estimations and test statistics we need consistent estimators of V = j∞= γj. One possibility is to parameterize γj; for example, assuming that the γj satisfy the −∞ restrictionsP imposed by an ARMA model of the type that are discussed in the next section. Another possibility is to obtain a flexible estimator of V of the type considered by Hansen (1982), and Newey and West (1987), among others.5 The Newey-West estimator is a sample counterpart of expression (2) truncated after m lags:
m j V = γ + 1 2γ . 0 − m + 1 j j=1 X b 3 Theoremb 3.15 in P.C.B. Phillips and V.b Solo (1992) “Asymptotics for Linear Processes,”The Annals of Statistics 20. 4 The matrix sequence Ψj ∞ is absolutely summable if each of its elements forms an absolutely summable sequence. { }j=0 5 Hansen (1982) “Large Sample Properties of GMM Estimators”, Econometrica 50. Newey and West (1987) “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix”, Econometrica 55.
7 This estimator can be shown to be consistent for V if the truncation parameter m goes to infinity with T more slowly than T 1/4 or than T 1/2, depending on the type of process. Nevertheless, the specification of an appropriate growth rate for m gives little practical guidance on the choice of m.
Similarly, in the vector case the Newey-West estimator of V = j∞= Γj is given by −∞ m P j V = Γ0 + 1 Γj + Γ0 (6) − m + 1 j j=1 X b b b b 1 T where Γj = (T j)− t=j+1 (wt w0)(wt j w j)0. A nice property of the Newey-West estimator − − − − − (6) is that it is guaranteed to be a positive semi-definite matrix by construction.6 b P
4 Autoregressive and moving average models
4.1 Autoregressive models
A first-order autoregressive process (or Markov process) assumes that wt is independent of wt 2, wt 3, ... { − − } conditionally on wt 1: −
ft (wt wt 1, ..., w1) = ft (wt wt 1) | − | − and, therefore, also
E (wt wt 1, ..., w1) = E (wt wt 1) . | − | − Moreover, the standard linear AR(1) model also assumes
E (wt wt 1) = α + ρwt 1 | − − 2 V ar (wt wt 1) = σ . | −
Moment properties These assumptions have the following implications for marginal moments:
E (wt) = E [E (wt wt 1)] = α + ρE (wt 1) (7) | − − 2 2 V ar (wt) = V ar [E (wt wt 1)] + E [V ar (wt wt 1)] = ρ V ar (wt 1) + σ | − | − −
Cov (wt, wt 1) = Cov (E (wt wt 1) , wt 1) = Cov (α + ρwt 1, wt 1) = ρV ar (wt 1) . − | − − − − − Moreover,
2 E (wt wt 2) = α + ρE (wt 1 wt 2) = α (1 + ρ) + ρ wt 2. | − − | − − 6 m The estimator V = Γ0 + j=1 Γj + Γj0 has the same large sample justification as (6) but is not necessarily positive semi-definite. e b P b b
8 In general
j 1 j E (wt wt j) = α 1 + ρ + ... + ρ − + ρ wt j | − − and
j Cov (wt, wt j) = Cov (E (wt wt j) , wt 1) = ρ V ar (wt j) . − | − − − In view of the recursion (7) we have
t 1 t E (wt) = α 1 + ρ + ... + ρ − + ρ E (w0) . Stability and stationarity For the process to be stationary is required that ρ < 1. In itself, | | ρ < 1 is a condition of stability under which as t we obtain | | → ∞ α E (wt) µ = → 1 ρ − σ2 V ar (wt) γ = → 0 1 ρ2 − 2 j σ Cov (wt, wt j) γj = ρ . − → 1 ρ2 − These quantities are known as the steady state mean, variance and autocovariances. Thus, regardless of the starting point, under the stability condition the AR(1) process is asymptotically covariance stationary. If the AR(1) process is stationary (due to being stable and having started in the distant past or having started at t = 1 with the steady state distribution) then E (wt) = µ = α/ (1 ρ), V ar (wt) = − 2 2 j 2 2 γ0 = σ / 1 ρ and Cov (wt, wt j) = γj = ρ σ / 1 ρ . − − − The autocorrelation function of a stationary AR(1) process decreases exponentially and is given by ρj.
Letting ut = wt α ρwt 1, the Wold representation of a stationary AR(1) process can be obtained − − − by repeated substitutions and is given by:
2 3 wt = µ + ut + ρut 1 + ρ ut 2 + ρ ut 3 + ... − − − The parameter ρ measures the persistence in the process. The closer is ρ to one the more persistent will be the deviations of the process from its mean.
Normality assumptions The standard additional assumption to fully specify the distribution of wt wt 1 is conditional normality: | − 2 wt wt 1, ..., w1 α + ρwt 1, σ . (8) | − ∼ N − 9 In itself this assumption does not imply unconditional normality. However, if we assume that the initial observation is normally distributed with the steady state mean and variance: α σ2 w1 , , (9) ∼ N 1 ρ 1 ρ2 − − then the process is fully stationary and (w1, ..., wT ) is jointly normally distributed as follows:
T 1 w1 1 1 ρ . . . ρ − T 2 w2 α 1 σ2 ρ 1 . . . ρ − . . , 2 . . . . . . ∼ N 1 ρ . 1 ρ . . .. . − − T 1 T 2 wT 1 ρ − ρ − ... 1 Normal likelihood functions Under assumption (8), the log likelihood function of the time
series w1, ..., wT conditioned on the first observation is given by (up to an additive constant): { } 2 (T 1) 2 1 T 2 L α, ρ, σ = − ln σ (wt α ρwt 1) . − 2 − 2σ2 t=2 − − − X The corresponding maximum likelihood estimates are: T t=2 (wt w0)(wt 1 w 1) − − ρ = T − −2 (10) t=2 (wt 1 w 1) P − − − bα = w0 Pρw 1 − − 2 1 T 2 σ = t=2 (wt α ρwt 1) b (T b1) − − − − P1 T 1 T wherebw0 = (T 1)− t=2 wt band bw 1 = (T 1)− t=2 wt 1. − − − − Under the steady stateP assumption (9) the log likelihoodP of the first observation is given by: 2 2 2 1 2 1 2 1 ρ α `1 α, ρ, σ = ln σ + ln 1 ρ − w1 . −2 2 − − 2σ2 − 1 ρ − Thus, the full log likelihood function under assumptions (8) and (9) becomes:
2 2 2 L∗ α, ρ, σ = L α, ρ, σ + `1 α, ρ, σ .