Analysis (I)

MEI-YUAN CHEN

Department of Finance National Chung Hsing University

Feb. 26, 2013

A c Mei-Yuan Chen. The LTEX source files are tim−lin.tex.

Contents

1 Introduction 1 1.1 LinearTimeSeriesModels...... 3 1.2 StochasticProcesses ...... 3 1.3 StationaryProcesses ...... 6 1.4 The Autocorrelation Function and Partial Autocorrelation Functions . . 7 1.5 SomeUsefulStochasticProcesses ...... 7 1.5.1 RandomWalk ...... 8 1.5.2 MovingAverageProcesses...... 8 1.6 Autocovariance Functions of a MA(q) ...... 8 1.7 AutoregressiveProcesses...... 11 1.7.1 First-orderProcess...... 11 1.7.2 Autoregressive Moving Average Models ...... 12 1.8 Autoregressive Integrated Moving Average Models ...... 12

2 Model Identification 13 2.1 EstimationsfortheTimeDependence ...... 14

3 Model Estimation 15 3.1 MLEofanAR(1)Model...... 15 3.2 MLE for a MA(q)Model...... 16

4 Check for Filtered Residuals 17 4.1 PortmanteauTest ...... 18 4.2 Testing That a Dependent Process Is Uncorrelated ...... 19

4.2.1 Testing H0 : µ = 0 ...... 19

4.2.2 Testing H0 : γ(1)=0...... 20 4.3 Selection for a Parsimonious Model ...... 22 4.4 ImpulseResponseFunction ...... 22

5 Forecasting 23 5.1 CombinedForecasts ...... 24 5.2 ForecastEvaluation ...... 25 5.3 Test for Martingale Difference ...... 27

i 1 Introduction

A univariate random variable Y is defined as

Y :Ω y ,y ,... = D (µ , var(Y ), α (Y ), α (Y )). → { 1 2 } Y Y 3 4 The distribution of realizations is demonstrated by measures E(Y ) (unconditional mean), var(Y ) (unconditional ), α3(Y ) (unconditional skewness), α4(Y ) (un- conditional kurtosis). Of course, the distribution of all realizations can be captured by the cumulative distribution function FY and the probability density function fY if FY is globally differentiable. Bivariate random variables X and Y are defined as

(X,Y ):Ω (x ,y ), (x ,y ), (x ,y ),... → { 1 1 2 2 3 3 } 2 2 = DX,Y (µX ,σX , α3(X), α4(X),µY ,σY , α3(Y ), α4(Y ), cov(X,Y )).

That is, the plot (distribution) of realizations (x,y) can be demonstrated by measures 2 2 µX ,σX , α3(X), α4(X) for X, µY ,σY , α3(Y ), α4(Y ) for Y , and cov(X,Y ) for the covari- ance between X and Y . The conditional random variable Y on X = x is defined as

(Y X = x):Ω (x,y ), (x,y ),... | → { 1 2 } = D (E(Y X = x), var(Y X = x), α (Y X = x), α (Y X = x)). Y |X=x | | 3 | 4 | The distribution of realizations is demonstrated by measures E(Y X = x) (conditional | mean), var(Y X = x) (conditional variance), α (Y X = x) (conditional skewness), | 3 | α (Y X = x) (conditional kurtosis). Similarly, the distribution of all realizations can 4 | be captured by the cumulative distribution function FY |X=x and the probability density function fY |X=x if FY |X=x is globally differentiable. The purpose of exploring the distribution of conditional random variable (Y X = x) is to check whether it is different | from the distribution of unconditional random variable Y , i.e., DY |X=x = DY . For example, let Y denote the today’s stock return and X = 1 for having positive and X = 0 for negative yesterday’s return. Then DY |X=x = DY implies that knowing previous day’s return is positive or negative has no help for understanding the distribution of today’s stock return. In contrast, D = D indicates that knowing yesterday’s Y |X=x 6 Y return is positive or negative is useful to understanding the distribution of today’s return. The information of X =1 or X = 0 is valuable to understand the distribution of today’s stock return. In other words, checking DY |X=x = DY or not is equivalent to study whether the information X is valuable or not. Without doubt, the one- dimensional random variable can be extended to be multiple dimensional and denoted

1 ′ as X = [X1, X2,...,Xk] . When X is multiple dimensional, the joint distributions

DY,X1 , DY,X2 ,...,DY,Xk are explored by studying the conditional distribution DY |X . Usually (Y X = x) is decomposed as | (Y X = x)= E(Y X = x)+ var(Y X = x) e. | | | × p It is clear that under the assumption of normality and homogeneity of DY |X=x, the exploration of D is reduced to study E(Y X = x). Therefore, the null hypothesis Y |X=x | of checking whether X is valuable to know the distribution DY , H0 : DY |X=x = D , is reduced to be H : E(Y X = x) = E(Y ). In statistical analysis, a random Y 0 | sample (y , X′ )′, i = 1,...,n is used to study E(Y X = x) under the assumption { i i } | that E(Y X = x) is “fixed” for all realizations (also for sample observations.) In | econometric analysis, the content of X are suggested by the corresponding economic theory and are selected by using model selection criteria in regression analysis. In a time series analysis, the information contained in X is the σ-algebra generated by the historical values of Y . That is, the conditional random variable of Y on is t Ft−1 the interest, where

:= Y ,Y ,... . Ft−1 { t−1 t−2 } That is the conditional random variable (Y ) is under investigating. In other words, t|Ft−1 the joint distributions DYt,Yt−1 , DYt,Yt−2 ,...,DYt,Yt−∞ are going to be studied. To be in- vestigatable, DYt,Yt−1 , DYt,Yt−2 ,...,DYt,Yt−p is considered for a finite number of p. This is also equivalent to study the joint distributions DYt,Yt−1 , DYt,Yt−2 ,...,DYt,Yt−p . These 2 joint distributions can be demonstrated by the measures E(Yt), σYt , α3(Yt), α4(Yt), and cov(Yt,Yt−j) for all t and j = 1,...,k. Therefore, there are plenty of measures under estimating if they are time-varying, i.e., different for different t. In reality, there is only a time series sample with T observations, y ,y ,...,T , obtained and used to do { 1 2 } the estimations. It is necessary to assume that these measures are all time-invariant. This is the necessity of “strict stationary” for doing time series analysis. Given the condition of stationarity, the joint distribution DYt,Yt−1,...,Yt−p becomes time invariant and so does the distribution of conditional random variable (Y Y ,...,Y ). This t| t−1 t−p also implies that the conditional mean E(Y Y ,...,Y ) and conditional variance t| t−1 t−p var(Y Y ,...,Y ) are also constant over time. Under the assumption of normality, t| t−1 t−p E(Y Y ,...,Y ) and var(Y Y ,...,Y ) are the two measures to be studied for t| t−1 t−p t| t−1 t−p exploring DYt|Yt−1,...,Yt−p .

2 1.1 Linear Time Series Models

In conventional time series analysis, the assumptions of normality and homogeneity (var(Y Y ,...,Y ) is constant for all time t and all values of Y ,...,Y ) are t| t−1 t−p t−1 t−p imposed on D . As a result, the conditional mean, E(Y Y ,...,Y ), Yt|Yt−1,...,Yt−p t| t−1 t−p becomes the only object to be studied for exploring DYt|Yt−1,...,Yt−p . For simplicity as in linear regression analysis, the linearity is always assumed to model E(Y Y ,...,Y ). t| t−1 t−p That is,

E(Y Y ,...,Y )= φ + φ Y + + φ Y , t| t−1 t−p 0 1 t−1 · · · k t−p which is called the linear . To be more general, an autoregressive and moving average (ARMA model, hereafter) is suggested as

E(Y )= φ + φ Y + + φ Y + π e + + π e . t|Ft−1 0 1 t−1 · · · k t−p 1 t−1 · · · q t−q It is crucial to determine the numbers of p and q before doing model estimation. For financial time series analysis, it is common to recognize that the distribution of (Y ) is not normal and not homogeneous. Therefore, D can not be t|Ft−1 Yt|Yt−1,...,Yt−p explored completely by studying E(Y Y ,...,Y ) only. t| t−1 t−p

1.2 Stochastic Processes

Mathematically, a Y ,

2 Stationary: Means and do not depend on time subscripts, depend only on the difference between two subscripts.

3 Uncorrelated: The correlation between variables having different time subscripts is always zero.

4 Autocorrelated: It is not uncorrelated.

5 White noise: The variables are uncorrelated, stationary and have mean equal to 0.

3 6 Strict white noise: The variables are independent and have identical distributions whose mean is equal to 0.

7 Martingale: The of variable at time t, conditional on the informa- tion provided by all previous values, equals variables at time t 1. − 8 Martingale difference: The expected value of a variable at period t, conditional on the information provided by all previous values, always equals 0.

9 Gaussian: All multivariate distributions are multivariate normal.

10 Linear: It is a liner combination of the present and past terms from a strict white noise process.

In time series analysis, a stationary time series data y ,t = 1,...,T is used to { t } explore the joint distribution Y1,Y2,...,YT , for examples, E(Y ), var(Y ), cov(Y ,Y ),t,s = 1,...,T,t = s. t t t s 6 However, the linear conditional mean E(Y Y = y ,...,Y = y ) is focused. It t| t−1 t−1 t−p t−p is clear also that p E(Y Y = y ,...,Y = y )= α y = c t| t−1 t−1 t−p t−p j t−j Xj=1 implies that α = 0, j. That is the values of previous times have no effects on the j ∀ mean behavior of Yt. We say that yt−j, j = 1,...,p have no explanantion (prediction) power on the mean of Yt. Most statistical problem are concerned with estimating the measures (or parame- ters) of a population from a sample. In time-series analysis there is a rather different situation in that, although it may be possible to vary the “length”of the observed time series– the sample– it is impossible to make more than one observation at any given time. Thus we only have a single outcome of the process and a single observation on the random variable at time t. Nevertheless we may regard the observed time series as just one example of the “infinite” set of time series which might have been observed. The infinite set of time series is sometimes called the “ensemble”. Every member of the ensemble is a possible “realization” of the stochastic process. One way of describing a stochastic process is to specify the joint probability dis- tribution of Yt1 ,...,Ytn for any set of times t1,...,tn and any value of n. However, a simpler, more useful way of describing a stochastic process is to give the moments of the process, particularly the first and second moments, which are called the mean, variance, and autocovariance functions.

4 5 1. Mean: The mean function µt is defined by µt = E(Yt).

2 2 2. Variance: The variance function σt is defined by σt = var(Yt).

3. Autocovariance: The covavriance function γ(Yt1 ,Yt2 ) is defined by

γ(y ,y ) = E [Y E(Y )][Y E(Y )] t1 t2 { t1 − t1 t2 − t2 } = γ(τ), τ = t t . 2 − 1

4. Autocorrelation: The autocorrelation function ρ(Yt1 ,Yt2 ) is defined by

E [Yt1 E(Yt1 )][Yt2 E(Yt2 )] ρ(yt1 ,yt2 ) = { − − } σt1 σt2 γ(τ) = = ρ(τ)), τ = t t . γ(0) 2 − 1

5. Partial Autocorrelation: The partial autocorrelations measures the excess corre-

lation between Yt and Yt−p which is not accounted for the correlations among

Yt−1,...,Yt−p+1.

1.3 Stationary Processes

A time series is said to be “strictly stationary” of the joint distribution of Yt1 ,...,Ytn is the same as the joint distribution of Yt1+τ ,...,Ytn+τ for all t1,...,tn and τ. In other words, shifting the time origin by an amount τ has no effect on the joint distribution.

In particular, if n = 1, strictly stationary implies that the distribution of Yt is the same 2 2 for all t, so that provided the first two moments are finite, we have µt = µ and σt = σ for all t. Furthermore, if n = 2, the joint distribution of Yt1 and Yt2 depends only on (t t ), which is called the “lag”. ρ(τ) is called the “autocorrelation function” at lag 2 − 1 τ. In practice it is often useful to define stationarity in a less restricted way than the strictly stationarity. A process is called “second-order (weakly) stationary” if its mean is constant and its ACF depends only on the lag, so that E(Yt)= µ and cov(Yt,Yt+τ )= γ(τ). No assumptions are made about the higher moments than those of second order. By letting τ = 0, the variance is implied to be constant.

6 1.4 The Autocorrelation Function and Partial Autocorrelation Func- tions

2 Suppose a stationary stochastic process Yt has mean µ, variance σ , autocovariance functions γ(τ) and autocorrelation function ρ(τ). Then

ρ(τ)= γ(τ)/γ(0) = γ(τ)/σ2.

Note that ρ(0) = 1. Some properties of autocorrelation function:

1. The autocorrelation function is an even function of the lag τ in that ρ(τ)= ρ( τ). − 2. ρ(τ) 1. | |≤ 3. Lack of uniqueness. Although a given stochastic process has a unique structure, the converse is not in general true.

Consider the regressions of Yt on Yt−1,Yt−2,...,Yt−p,

Yt = α01 + α11Yt−1 + et

Yt = α02 + α12Yt−1 + α22Yt−2 + et

Yt = α03 + α13Yt−1 + α23Yt−2 + α33Yt−3 + et . . = · · · Y = α + α Y + α Y + α Y + + α Y + e t 0p 1p t−1 23 t−2 33 t−3 · · · pp t−p t

The coefficients α11, α22,...,αpp are called the partial autocorrelation coefficients of Y . Denote the PACF (partial autocorrelation function) as a function of the partial { t} autocorrelation coefficients on the number of lag k, i.e., P ACF (k)= αkk.

1.5 Some Useful Stochastic Processes

A discrete-time process is called a purely random process or white noise if it consists of a sequence of random variable {Zt} which are mutually independent and identically distributed. That is the process has constant mean and variance and that

γ(k) = cov(Zt, Zt+k) = 0 for k = ±1, ±2,...,...

As the mean and ACF do not depend on time, the process is second-order stationary. In fact it is clear that the process is also strictly stationary. The ACF is given by

1 k = 0 ρ(k)= ( 0 k = ±1, ±2,...

As {Zt} is mutually independent, the partial autocorrelation coefficients PACF (k)= αkk = 0, ∀k.

7 1.5.1

Suppose that Z is a discrete, purely random process with mean µ and variance σ2 . { t} Z A process Y is said to be a random walk if { t}

Yt = Yt−1 + Zt.

The process is customarily started at zero when t = 0, so that Y1 = Z1 and Yt = t 2 i=1 Zi. Then we find that E(Yt) = tµ and that var(Yt) = tσZ. As the mean and Pvariance change with t, the process is non-stationary. However, it is interesting to note that the first difference of a random walk, given by

Y = Y Y = Z △ t t − t−1 t form a purely random process, which is therefore stationary.

1.5.2 Moving Average Processes

A process Y is said to be a moving average process of order q (denoted as MA (q)) { t} if

Y = Z + π Z + + π Z t t 1 t−1 · · · q t−q where π are constants. It is clear that { i}

E(Yt) = 0 q 2 2 var(Yt) = σZ πi , with π0 = 1 Xi=1 since the Z s are independent. { t}

1.6 Autocovariance Functions of a MA(q)

Besides, we have

γ(k)

= cov(Yt,Yt+k) = cov(Z + π Z + + π Z ,Z + π Z + + π Z ) t 1 t−1 · · · q t−q t+k 1 t+k−1 · · · q t+k−q 0 k>q 2 q−k =  σZ i=1 πiπi+k k = 0, 1,...,q  γ( k) k < 0 P − 

8 since σ2 s = t cov(Z ,Z )= Z s t 0 s = t. ( 6 As γ(k) does not depend on t, and the mean is constant, the process is second-order stationary for all values of π . The ACF of the MA(q) process is given by { i} 1 k = 0 q−k q 2  i=1 πiπi+k/(1 + i=1 πi ) k = 1,...,q ρ(k)=   0 k>q  P P ρ( k) k< 0  −  Note that the ACF “cuts off” at lag q, which is a special feature of MA processes. In particular the MA(1) process has an ACF given by

1 k = 0 ρ(k)= π /(1 + π2) k = 1  1 1 ±  0 otherwise

No other restriction on the π are required for a (finite-order) MA process to be { i} stationary, but it is generally desirable to impose restrictions on the π to ensure { i} that the process satisfies a condition called “invertibility”. Consider the following first-order MA processes:

A : Yt = Zt + θZt−1 1 B : Y = Z + Z . t t θ t−1 It can easily be shown that these two different processes have exactly the same ACF. Thus we cannot identify an MA process from a given ACF. Now, if we express processes

A and B by putting Zt in terms of Yt,Yt−1,..., we find by successive substitution that

A : Z = Y θY + θ2Y t t − t−1 t−2 −··· 1 1 B : Z = Y Y + Y t t − θ t−1 θ2 t−2 −··· If θ < 1, the process A converges whereas that the process B does not. Thus if | | θ < 1,, process A is said to be invertible whereas process B is not. The imposition of | | the invertibility condition ensures that there is a unique MA process for a given ACF. The invertibility condition for the general-order MA process is best expressed by using the backward shift operator, denoted by B, which is defined by

BjY = Y j. t t−j ∀

9 Then the general-order MA process may be written as

Y = (1+ π B + + π Bq)Z t 1 · · · q t = π(B)Zt, where π(B) is a polynomial of order q in B. An MA process of order q is invertible if the roots of the equation

π(z)=1+ π z + π z2 + + π zq = 0 1 2 · · · q all lie outside the unit circle. For example, in the first-order case we have π(z) = 1+π z = 0, which has root z = 1/π . Thus the root is outside the unit circle, z > 1, 1 − 1 | | provided that π < 1. | 1| For a MA(1) process, Yt = Zt + π1Zt−1 may be written as Yt = (1+ π1B)Zt and then 1 Zt = Yt 1+ π1B = (1 π B + π2B2 π3B3 + )Y − 1 1 − 1 · · · t = Y π Y + π2Y + π3Y + t − 1 t−1 1 t−2 1 t−3 · · · or

Y = π Y π2Y π3Y + + Z t 1 t−1 − 1 t−2 − 1 t−3 · · · t which is an AR( ) process. The partial autocorrelation function PACF(k), α , will ∞ kk never “cut off” at any lag k but will decay exponentially to zero as k gets large. This result can also be extended to the MA(q) process, since

Y = (1+ π B + π B2 + + π Bq)Z t 1 2 · · · q t can be expressed as 1 Z = Y t 1+ π B + π B2 + + π Bq t 1 2 · · · q = (1 ψ B ψ B2 ψ B3 )Y − 1 − 2 − 3 −··· t = Y ψ Y ψ Y ψ Y t − 1 t−1 − 2 t−2 − 3 t−3 −··· or

Y = ψ Y + ψ Y + ψ Y + + Z t 1 t−1 2 t−2 3 t−3 · · · t which is an AR( ) process. The PACF(k) will also never “cut off” at any lag k but ∞ will decay exponentially to zero as k gets large. To summary, an MA(q) process has ACF(k) cuting off at lag k and PACF(k) de- caying to zero exponentially.

10 1.7 Autoregressive Processes

A process Y is said to be an autoregressive process of order p if { t} Y = φ Y + φ Y + + φ Y + Z , t 1 t−1 2 t−2 · · · p t−p t which is abbreviated to an AR(p) process.

1.7.1 First-order Process

When p = 1, Yt = φ1Yt−1 + Zt. The AR(1) process is sometimes called the Markov process, after the Russian A.A. Markov. By successive substitution, we have ∞ j Yt = φ1Zt−j Xj=0 which is an infinite-order MA process provided 1 < φ < 1. As − 1 (1 φ B)Y = Z − 1 t t so that

Y = Z /(1 φ B) t t − 1 2 = (1+ φ1 + φ1 + )Zt ∞ · · · j = φ1Zt−j. Xj=0 It is clear that

E(Yt) = 0 ∞ var(Y ) = σ2 φ2j = σ2 /(1 φ2) t Z 1 Z − 1 Xj=0 given φ < 1. The ACF is given by | 1|

γ(k) = E(YtYt+k) ∞ ∞ j i = E[( φ1Zt−j)( φ1Zt+k−i)] Xj=0 Xi=0 ∞ = σ2 φj φk+j for k 0 Z 1 1 ≥ Xj=0 which converges for φ < 1 to | 1| γ(k)= φkσ2 /(1 φ2)= φkσ2 . 1 Z − 1 1 Y

11 For k < 0, we find γ(k) = γ( k). Since γ(k) does not depend on t, an AR(1) − process is second-order stationary provided that φ < 1. The ACF is given by | 1| k ρ(k)= φ1, k = 0, 1, 2,...

That is the ACF(k) of an AR(1) process will not “cut off” at any lag k but will decay to zero exponentially since φ < 1. However, the PACF at k = 1, i.e., α , equals to | 1| 11 φ and α = 0 for all k 2. More generally, the ACF(k) of an AR(p) process will 1 kk ≥ not “cut off” at any lag k but will decay to zero exponentially. However, the PACF at k = p, i.e., α , equals to φ and α = 0 for all k p + 1. pp p kk ≥

1.7.2 Autoregressive Moving Average Models

More generally, a useful class of processes is formed by combining both AR and MA processes. A mixed autoregressive/moving-average process containing p AR terms and q MA terms is said to be an ARMA process of order (p,q). That is

φ(B)Yt = π(B)Zt, where φ(B) and π(B) are the p-th and q-th order polynomials in B. A process gener- ated from an ARMA(p,q) model is called an ARMA(p,q) process; its stationarity and invertibility are determined by the roots of φ(z) and π(z), respectively. Since an ARMA(p,q) process may be rewritten as

Yt = [π(B)/φ(B)]Zt = ψ(B)Zt, an ARMA process can be viewed as a pure MA( ) process. Similarly, an ARMA ∞ process may also be rewritten as

−1 Zt = [π(B)/φ(B)] Yt = ϕ(B)Yt, an ARMA process can be viewed as a pure AR( ) process. Then the ACF(k) and ∞ PACF(k) both will not “cut off” at any lag k till k . →∞

1.8 Autoregressive Integrated Moving Average Models

In many applications it has been found that there are many economic time series that are non-stationary but their (first) differences are stationary. This leads to the class of autoregressive integrated moving average (ARIMA) processes. An ARIMA(p,d,q) process Y can be expressed as an ARMA(p,q) process of Y = (1 B)dY : t △ t − t φ(B)(1 B)dY = π(B)Z . − t t

12 The process Y is also called an integrated process of order d, denoted as I(d), which { t} must be differenced d times to achieve stationarity. In practice, the integrated time series we usually encounter are I(1) process. Consider an ARIMA(p, 1,q) process Y with (1 B)Y = Y . By recursive substi- t − t △ t tution, t−1 Y = Y t △ t−j Xj=0 2 has mean tµ△Yt and variance tσ△Yt . Hence, Yt is non-stationary because its mean tµ△Yt 2 and variance tσ△Yt are growing with t. More importantly, the autocorrelations of Yt never die out. Also note that the polynomial φ(z)= π(z)(1 z) has a root on the unit − circle, and Y is also called a unit root process with drift µ . In particular, if Y { t} △Yt △ t are i.i.d., Yt is a random walk with drift µ△Yt .

2 Model Identification

In Econometric analyzes, the regression model is set to regress Y on X based on “some” economic theoretic results, i.e., (Y X = x) = m(x) = x′β for example. To | 0 do the model selection, some criteria such as R2 (or R¯2, (adjusted) determinant coeffi- cients), AIC (Akaike Information Criterion), and BIC (Bayesian Information Criterion, or Schwartz Criterion). However, in time series analysis, there is no theoretic reference result to be used to select a stochastic process and then to say an observed time series data is a realization from the selected stochastic process. A general model ARMA(p,q) is considered first and then the appropriate orders of p and q are under identifying based on the time structures implied by observed time series data. A time series model is set as p q

yt = φiyt−i + ut + ut−j Xi=1 Xj=1 for an observed time series data y ,t = 1,...,T . An appropriate time series model, { t } the ARMA(p,q) with appropriate p and q, is identified by

1. the time structures (or time dependence), i.e., autocovariances (autocorrelations) and partial autocorrelation functions described by the time series model is similar to a stationary stochastic process;

2. the filtered errors from the appropriate model is white-noise.

3. the model is “parsimonious”.

13 2.1 Estimations for the Time Dependence

As the autocovariance (autocorrelation) and partial autocorrelation functions are used to describe the time dependence of a stochastic process. However the autocorrelation and partial autocorrelation functions of an observed time series data are unobservable. Fortunately, they are estimatable from an observed time series data under the condition of stationarity. Given an observed time series data, y ,t = 1,...,T , which is assumed { t } to be a realization of a stationary stochastic process. The sample autocorrelations can be estimated by

T (y y¯ )(y y¯ ) ρˆ(k)= t=k+1 t − T −k t−k − T −k , k = 1, 2,... T (y y¯ )2 P t=k+1 t − T −k and that partial autocorrelationsP α ˆpp can be estimated by running the regression

y = α y + α y + + α y + u , k = 1, 2,.... t p1 t−1 p2 t−2 · · · kk t−k t

For a stationary AR(p) process, ρ(k) should decay toward zero as k increases, and αkk ought to have an abrupt cutoff with all values “close” to zero for k>p. For an invertible MA(q) process, ρ(k) have an abrupt cutoff with values “close” to zero for all k>q, and sample partial autocorrelations gradually decay toward zero. Thus, by examining the plots of sample autocorrelations and partial autocorrelations we can determine the order of a pure AR or MA model. Unfortunately, the AR and MA orders of an ARMA process cannot be easily identified from the plots of sample autocorrelations and partial autocorrelations.

It has been shown that if yt is generated from an AR(p) process, sample partial autocorrelationsα ˆkk are approximately normally distributed with mean zero and vari- ance T −1 for k>p, where T is the sample size. A simple diagnostic check is then based on the comparison of the estimates with 2T −1/2; those estimates fall within ± these bounds are considered to be “close” enough to zero. If yt is generated from an MA(q) process, sample autocorrelationsρ ˆ(k) are approximately normally distributed with mean zero and variance q 1 var(ˆρ(k)= 1+ ρ(i)2 , k = q + 1,q + 2,... T ! Xi=1 It is thus not easy to implement a simple diagnostic test of autocorrelations, except for the case that the data are generated from a white noise series.

14 3 Model Estimation

After the diagnostic check for sample autocorrelation and partial autocorrelation func- tions, the orders of p and q are identified. The next step for establishing an appropriate time series model, the filtered error from subtracting yt from the value evaluated by the appropriate model should be white-noise. However, the filtered error is not observ- able since the appropriate model is unobservable as its coefficients are unknown. The proxy of filtered error can be obtained by subtracting yt from the “fitted” value of the estimated appropriate model. The proxy is also called the filtered residual. Therefore, estimation for the appropriate model is the next step. An ARMA model is typically estimated using the method of maximum likelihood. Under the assumption of normality, it can be shown that the likelihood function of an ARMA(p,q) model is non-linear in unknown parameters and more complicated than that of an AR or MA model. Hence, the maximum likelihood estimator (MLE) must be computed using non-linear optimization techniques.

3.1 MLE of an AR(1) Model

For an AR(1) model

yt = φ0 + φ1yt−1 + ut,t = 1,...,T where u is assumed to be a Gaussian white noise series. The joint likelihood function { t} can be written as a product of conditional likelihood functions:

T f(y ,...,y ; θ) = f(y ; θ) f(y y ,...,y ; θ) 1 T 1 t| t−1 1 t=2 Y T = f(y ; θ) f(y y ; θ), 1 t| t−1 Yt=2 2 ′ where θ = [φ0 φ1 σu] .

1 (y φ /(1 φ ))2 f(y ; θ) = exp 1 − 0 − 1 , 1 2 2 2 2 2πσ /(1 φ ) − 2σu/(1 φ ) u − 1  − 1  1 (y φ φ y )2 f(y y ; θ) = p exp t − 0 − 1 t−1 . t| t−1 2 − 2σ2 2πσu  u  The MLE of θ is obtainedp by maximizing:

(y ,...,y ; θ) LT 1 T

15 T 1 (y φ /(1 φ ))2 log(σ2)+ log(1 φ2) 1 − 0 − 1 ∝ − 2 u 2 − 1 − 2σ2/(1 φ2) u − 1 T (y φ φ y )2 t − 0 − 1 t−1 . − 2σ2 t=2 u X Obviously, this is a non-linear optimization problem and can be solved using numerical methods. Alternatively, we may maximize the likelihood function conditional on the initial observation y1: T T 1 (y φ φ y )2 (y ,...,y y ; θ) − log(σ2) t − 0 − 1 t−1 . LT 2 T | 1 ∝ − 2 u − 2σ2 t=2 u X ˜ ˜ It is clear that the resulting MLEs φ0 and φ1 are nothing but the OLS estimators based 2 on all but the first observations. The MLE of σu is T 1 (y φ˜ φ˜ y )2. T 1 t − 0 − 1 t−1 − Xt=2 Note that the difference between the exact MLE and conditional MLE is resulted from the treatment of the first observation; this difference is negligible when sample size T is sufficiently large. Similarly, the exact MLEs of an AR(p) model can be obtained by maximizing the (non-linear) log-likelihood function. The conditional MLEs of AR coefficients are just the OLS estimators computed from the last T p observations, and the MLE of σ2 is − u T 1 (y φ˜ φ˜ y φ˜ y )2. T p t − 0 − 1 t−1 −···− p t−p − t=Xp+1 3.2 MLE for a MA(q) Model

Consider now an MA(1) model: yt = π0 π1ut−1 + ut, where ut is also assumed to − ′ { } 2 ′ be a Gaussian white noise series. Let Y = [y1 y2 ... yT ] , θ = [π0 π1 σu] . As Y has the covariance matrix: 2 1+ π1 π1 0 0 0 − 2 · · ·  π1 1+ π1 π1 0 0  Σ= σ2 − − · · · , u . . . . .  . . . . .   · · ·   0 0 0 π 1+ π2   ··· − 1 1   ′ 2  which can be decomposed as ADA , D = σu diag[d1, d2,...,dT ] with t 2j j=0 π1 dt = t−1 2j , t = 1,...,T, Pj=0 π1 P 16 and A is a lower triangular matrix with ones on the main diagonal and π /d , t = − 1 t 1,...,T 1 on the first sub-diagonal. − The joint likelihood function can then be written as: 1 f(y ,...,y ; θ) det(Σ)−1/2 exp (Y π ℓ)′Σ−1(Y π ℓ) 1 T ∝ −2 − 0 − 0   1 ′ = det(D)−1/2 exp Y˜ D−1Y˜ −2   T T 1 y˜2 = (σ2)−T/2( d )−1/2 exp t , u t −2σ2 d t=1 u t=1 t ! Y X where Y˜ = A−1(Y π ℓ) with elementsy ˜ . The log-likelihood function is then − 0 t T T T 1 1 y˜2 (Y ; θ) log(σ2) log(d ) t . L ∝ − 2 u − 2 t − 2σ2 d t=1 u t=1 t X X This is a complex non-linear function in θ and must be maximized by non-linear opti- mization techniques. Conditional on u = 0 (say), we can compute u = y π +π u iteratively. Hence, 0 t t− 0 1 t−1 each ut depends on unknown parameters π0 and π1 directly and indirectly through the presence of ut−1. This parameter dependence structure makes the conditional likelihood function a very complex non-linear function in θ, in sharp contrast with the conditional likelihood function of an AR(p) model. Similarly, the exact and conditional MLEs for an MA(q) model must also be computed from non-linear likelihood functions.

4 Check White Noise for Filtered Residuals

For checking whether an observed time series comes from a white-noise stochastic pro- cess, test for the null of zero lag-1 autocorrelation is studied first. If y is an inde- { t} pendent and identically distributed (iid) sequence with E(y2) < , t ∞ T t=2(yt y¯T −1)(yt−1 y¯T −1) ρˆT (1) = − − T (y y¯ )2 P t=2 t−1 − T −1 N(0, 1), for a T large enough, → P see Brockwell and Davis (1991, Theorem 7.2.2). For testing H : ρ(1) = 0 against H : ρ(1) = 0, the test statistic under the null 0 a 6 √T ρˆ (1) p N(0, 1). T →

17 In general, for testing H : ρ(l) = 0 against H : ρ(l) = 0, the test statistic under the 0 a 6 null

√T ρˆ (l) p N(0, 1), T →

T t=l+1(yt y¯T −l)(yt−l y¯T −l) ρˆT (l) = − − T (y y¯ )2 P t=l+1 t−l − T −l N(0, 1), for a T large enough, → P

4.1 Portmanteau Test

Box and Pierce (1970) propose the portmanteau statistic

m ∗ 2 Q (m)= T ρˆT (l) Xl=1 to test the null hypothesis H : ρ(1) = = ρ(m) = 0 against the alternative hypothesis 0 · · · H : ρ(i) = 0, i. Under the null, Q∗(m) χ2(m) as T . a 6 ∀ → →∞ Ljung and Box (1978) modify the Q∗(m) statistic to increase the power of the test in finite samples, m ρˆ (l)2 Q(m)= T (T + 2) T . T l Xl=1 − Given the MLE estimation of an ARMA(p,q) model, the filtered residuals are de- fined as p q uˆ = y φˆ y πˆ u ,t = p + q + 1,...,T. t t − i t−i − j t−j Xi=1 Xj=1 Replace y ,t = 1,...,T with uˆ ,t = p + q + 1,...,T , above portmanteau test is { t } { t } applicable to check the white noise foru ˆ . If the null is not rejected, u is inferred to t { t} be a white noise and the time series model is to be appropriate. Note that, it is easy to know that if the portmanteau test statistic is significant for the filtered residuals from estimated ARMA(p,q), the test will also be insignificant for the filtered residuals from a model with higher orders of Ar and/or MA terms than p and q. To have the model identified parsimoniously (or have fewer parameters to be estimated, or to have larger degree of freedom), model selection criteria have been applied.

18 4.2 Testing That a Dependent Process Is Uncorrelated

Labato (2001b) provides a new testing procedure for the null hypothesis that a stochas- tic process is uncorrelated when the process is possibly dependent. Let yt be a covari- ance stationary stochastic process with mean µ and autocovariance at lag j, γ(j) =

E[(yt µ)(yt−j µ)]. A sample of yt(t = 1,...,T ) is observed. Denote byy ¯T = T − − t=1 yt/T as the sample mean. The null hypothesis of interest is H0 : µ = 0 and the alternative is H = µ = 0. P 0 6 As √T (¯yT µ) N(0, 2πfy(0)) where refers to weak convergence, and fy(0) de- − ⇒ ⇒ ∞ notes the spectral density of yt at zero frequency that satisfies 2πfy(0) = j=−∞ γ(j), 2 2 the test statistic T y¯T /2πfy(0) converges to χ (1) and is appropriate for testingP the null when fy(0) is known. However, it is usually unknown but can be estimated by kernel smoothing method. The bandwidth parameter for the kernel smoothing estimation for fy(0) is difficult to determine. Under weak dependence conditions, and under the null hypothesis, for any rin(0, 1) an invariance principle or functional (CLT) (FCLT) holds, namely,

[Tr] 1 y [2πf (0)]1/2B(r), √ t ⇒ y T t=1 X where B(r) is a , and [T r] denotes the integer part of T r. And,

[Tr] 1 (y y¯ ) [2πf (0)]1/2[B(r) rB(1)], √ t − T ⇒ y − T t=1 X where B(r) rB(1) is called a Brownian bridge − In addition, the statistic SM = T −2 T [ t (y y¯ )]2 converges weakly to t=1 s=1 s − T 2πf (0) 1[B(r) rB(1)]2dr. Consider an alternative test statistic for H : µ =0 as y 0 − P P 0 R 2 2 M T y¯T B(1) T = U1 = SM ⇒ 1[B(r) rB(1)]2dr 0 − which “does not depend” onR 2πfy(0). However the limiting distribution is not standard. Approximate upper critical values of the distribution are tabulated in the first row of Table 1 in Lobato (2001b) by means of simulations. The critical values for 90 %, 95 %, and 99 % are 28.31, 45.4, and 99.76, respectively.

4.2.1 Testing H0 : µ = 0

Let yt be a covariance stationary stochastic vector process of dimension K with mean µ and autocovariance matrix at lag j, Γ(j) = E[(y µ)(y µ)′]. A sample of y t − t−j − t

19 T is observed (t = 1,...,T ), and the sample mean is denoted by y¯T = t=1 yt/T . The µ 0 null hypothesis is H0 : = . P Under the null, see Davidson 1994, chap. 24; Dhrymes 1998, chap. 9,

√T (y¯ µ) N(0, 2πf (0)), T − ⇒ y where fy(0) is the spectral density matrix of yt at zero frequency that satisfies 2πfy(0) = ∞ ′ j=−∞ Γ(j). With 2πfy(0) = ΨΨ and under the null,

P √T Ψ−1y¯ N(0, I ). T ⇒ K Denote SM = [Tr](y y¯ ), under the null and some regular conditions, [Tr] t=1 t − T 1 P SM Ψ[B (r) rB (1)], √T [Tr] ⇒ K − K where B ( ) denotes a K-dimensional Brownian motions. And K · T ′ C = T −2 SM SM ΨV Ψ′, M t t ⇒ k t=1 X where V = 1[B (r) rB (1)][B (r) rB (1)]′dr. The test statistic for H : K 0 K − K K − K 0 µ = 0 suggestedR by Lobato (2001b) is T M = T y¯′ C−1y¯ U = B (1)′V −1B (1). K T M T ⇒ k K K K

Upper values of the distribution of UK are tabulated in Table 1 of Lobato (2001b) by means of simulations

4.2.2 Testing H0 : γ(1) = 0 Denote c = T −1 T (y y¯ )(y y¯ ) as the sample autocovariance. As c can be 1 t=2 t − T t−1 − T 1 seen as the sample mean of z = (y y¯ )(y y¯ ). That is, under weak dependence P 1t t − T t−1 − T conditions, the following CLT for c1 holds:

√T (c γ(1)) N(0, 2πf (0)) 1 − ⇒ z˜1 where f (0) is the spectral density at zero frequency ofz ˜ = (y µ)(y µ). Under z˜1 1t t − t−1 − 2 ˆ 2 the null H0 : γ(1) = 0, the test statistic T c1/2πfz˜1 (0) converges weakly to χ (1), where ˆ fz˜1 (0) is a consistent estimator for fz˜1 (0). Under the null H0 : γ(1) = 0 and the FCLT [Tr] for the partial sum t=1 z1t holds, i.e., [Tr] P 1 z [2πf ]1/2B(r), √ 1t ⇒ z˜1 T t=1 X 20 and

T t 1 C = T −2 [ (z c )]2 [2πf ] [B(r) rB(1)]2dr. 1 1t − 1 ⇒ z˜1 − t=1 s=1 0 X X Z ∗ Hence, for one-sided hypothesis testing, Lobato (2001b) considers the statistic T1 = 1/2 1/2 T c1/C1 , and for two-sided hypothesis testing, Lobato (2001b) considers the statis- tic T = T c2/C Under the null hypothesis, T ∗ U 1/2 and T U . 1 1 1 1 ⇒ 1 1 ⇒ 1 In general, the null hypothesis of interest is that the process is uncorrelated up to lag k, H0 : γ(1) = γ(2) = = γ(K) = 0 and the alternative is Ha : γ(j) = 0 · · · 6 ′ for some j = 1,...,K. Define the vector of sample autocovariances c = (c1,...,cK ) and the vector of the population autocovariances γ = (γ(1), . . . , γ(K))′, the vector z = (z ,...,z )′ with z = (y y¯ )(y y¯ ), and the vector z˜ = (˜z ,..., z˜ )′ t 1t Kt kt t − T t−k − T t 1t Kt withz ˜ = (y µ)(y µ). Under a variety of weak dependence conditions [see kt t − t−k − Romano and Thombs (1996, p. 592, p. 599) for processes], the following CLT can be derived:

√T (c γ) N(0, 2πf (0)) − ⇒ z˜ where fz˜(0) is the spectral density matrix at zero frequency of the vector z˜t. Therefore,

√T Φ−1c N(0,I ), ⇒ K the lower triangular K K matrix Φ satisfies 2πf (0) = ΦΦ′. Given a consistent × z˜ ˆ estimator fz˜(0) for fz˜(0), Lobato (2001a) introduces the test statistic ˜ ′ ˆ −1 QK = T c [2πfz˜(0)] c converges weakly to χ2(K). Denote the vector

[Tr] z1t c1 −. S[Tr] =  .  t=1 X  z c   Kt − K    Under the null hypothesis,

[Tr] 1 zt ΦBK(r), √T ⇒ Xt=1 where B ( ) denotes a K-dimensional Brownian motions. Therefore, K · 1 S[Tr] Φ[BK (r) rBK(1)] √T ⇒ −

21 and T 1 C = S S′ ΦV Φ′, K T 2 t t ⇒ K Xt=1 where V = 1[B (r) rB (1)][B (r) rB (1)]′dr. The test statistic proposed K 0 K − K K − K by Lobato (2001b)R is T = T c′C−1c U . k K ⇒ K

4.3 Selection for a Parsimonious Model

There are different ways to determine whether an estimated ARMA model is “correct”. One approach is to estimate a family of ARMA models and select the “best” (and “most parsimonious”) model among this family according to a model selection criterion such as AIC or SIC. Letu ˆt denote the residual from a fitted ARMA(p,q) model and 2 T 2 σ˜ = t=1 uˆt /T be the variance estimate. We have

PAIC = logσ ˜2 + 2(p + q + 1)/T, SIC = logσ ˜2 + [(p + q + 1) log T ]/T.

Note that SIC is “dimensionally consistent” for ARMA models.

4.4 Impulse Response Function

Given a stationary and invertible ARMA process, φ(B)Yt = π(B)Zt,

Yt = [π(B)/φ(B)]Zt = ψ(B)Zt. (1)

The weights ψ0, ψ1, ψ2,... in the function

ψ(B)= ψ + ψ B + ψ B2 + 0 1 2 · · · is called the impulse response function of Yt. The weight, ψj, can be demonstrated as the “response” of Yt caused by a unit (one standard deviation) change of Zt−j. Given (1),

Y = ψ Z + ψ Z + ψ Z + + ψ Z + t+j 0 t+j 1 t+j−1 2 t+j−2 · · · j t · · ·

Define the impulse response (dynamic multiplier) of the future observation Yt+j to the effect of one unit change of Zt as ∂yt+j/∂Zt = ψj which depends only on j but not on t. A dynamic system is said to be stable if its impulse response eventually “vanishes” as j tends to infinity; a dynamic system is “explosive” if its impulse response “diverges”

22 as j increases. For a stable dynamic equation, summing the impulse responses yields an accumulated response (interim multiplier):

j j ∂Yt+j = ψi ∂Zt+i Xi=0 Xi=0 Letting j tend to infinity, we have

j j ∂Yt+j lim = lim ψi j→∞ ∂Zt+i j→∞ Xi=0 Xi=0 which represents the long-run effect (total multiplier) of a permanent change in Zt. This is the total effect resulted from the changes of current and all subsequent innovations.

5 Forecasting

Once an ARMA model is determined as the “best” model, it is believed that this model is capable of representing the behavior of the variable of interest. Forecasts should then be derived from the fitted model. Suppose that the selected model is ARMA(p,q), this relationship,

y = φ + φ y + + φ y π ǫ π ǫ + ǫ , t 0 1 t−1 · · · p t−p − 1 t−1 −···− q t−q t should hold for observations in and out of sample. Lety ˆt+h|t denote the forecast of yt+h made at time t. Then, h-step ahead forecasts are:

yˆ = φ˜ + φ˜ yˆ + + φ˜ yˆ π˜ ǫˆ t+h|t 0 1 t+h−1|t · · · p t+h−p|t − 1 t+h−1|t −··· π˜ ˆǫ +ǫ ˆ , − q t+h−q|t t+h|t where φ˜ andπ ˜ are MLEs of model parameters. Here, for j 0,y ˆ = y are known ≤ t+j|t t+j observations, andǫ ˆt+j|t =u ˆt+j are residuals from the fitted model; for j > 0,ǫ ˆt+j|t = 0 are the best forecast of white noise. As an example consider the forecasts from a stationary AR(1) model:

˜ ˜ yˆt+h|t = φ0 + φ1yˆt+h−1|t +ǫ ˆt+h|t φ˜ + φ˜ y , h = 1 = 0 1 t ˜ ˜ ( φ0 + φ1yˆt+h−1|t, h = 2, 3,... It follows that

yˆ = (1+ φ˜ + + φ˜h−1)φ˜ + φ˜hy , t+h|t 1 · · · 1 0 1 t

23 and consequently, the h-step ahead forecasts would be a constant when h is large. Consider also an MA(1) model:

yˆ =π ˜ π˜ ˆǫ +ǫ ˆ t+h|t 0 − 1 t+h−1|t t+h|t π˜ π˜ e , h = 1 = 0 − 1 t ( π˜0, h = 2, 3,... Hence, the forecasts of two or more periods ahead are simply a constant.

5.1 Combined Forecasts

Forecasts may be obtained from econometric models, exponential smoothing algo- rithms, ARIMA models, and other forecasting schemes. As each method utilizes the information of data differently, it may be desirable to combine two or more alternative forecasts to form a composite forecast. Let fi,t+h|t, i = 1,...,N, denote the forecast of yt resulted from the i-th method. The forecast combining N individual forecasts can be expressed as

N N f = w f , with w = 1, 0 w 1. t+h|t j j,t+h|t j ≤ j ≤ Xj=1 Xj=1 Let e = y f and e = y f be the errors resulted from i,t+h|t t+h − i,t+h|t t+h|t t+h − t+h|t individual and combined forecasts. It is readily seen that, if each individual forecast is unbiased (i.e., E(ei,t+h|t) = 0), then so is the combined forecast (i.e., E(et+h|t) = 0).

The simplest way to combine forecasts is to give an equal weight wj = 1/N to each forecast so that N 1 f = f . t+h|t N j,t+h|t Xj=1 To see why this combined forecast may be preferred, consider the case N = 2 so that

2 e1,t+h|t + e2,t+h|t 1 E(e2 )= E = (σ2 + σ2 + 2ρσ σ ), t+h|t 2 4 1 2 1 2   2 2 where σi,t+h|t = E(ei,t+h|t), i = 1, 2, are error variances, and ρ is the correlation coefficient of e1,t+h|t and e2,t+h|t. Suppose further that these two forecasts are of the same quality so that they have 2 the same error variance σi,t+h|t. Then, 1 E(e2 )= σ2 (1 + ρ) σ2 , t+h|t 2 t+h|t ≤ t+h|t

24 and the equality holds when ρ = 1. It should be clear that the error variance of the combined forecast is less than that of each individual forecast as long as ρ< 1. Hence, it is relatively less likely for the combined forecast to commit huge forecasting errors. On the other hand, when two forecasts are perfectly, positively correlated (ρ = 1), combined forecast would not be necessary. Another approach is to make the combining weight of a given method inversely proportional to the sum of squared forecasting errors achieved by this particular method (over T ′ periods). That is, for j = 1,...,N, −1 T ′ 2 t=1 ej,t+h|t wj = . ′  −1  ′ −1 T e2 P + + T e2 t=1 1,t+h|t · · · t=1 N,t+h|t     Thus, a methodP that yields a smaller sumP of squared forecasting errors receives a larger weight for its forecast, and the combined forecast relies more heavily on the forecasts from “better” methods. A more natural way is to find the combining weights that minimize the sum of squared forecasting errors. This amounts to fitting a regression model of yt on the forecasts fi,t+h|t, i = 1,...,N, subject to the constraint that all weights summing to one, i.e.,

y f = w (f f )+ w (f f )+ t+h − N,t+h|t 1 1,t+h|t − N,t+h|t 2 2,t+h|t − N,t+h|t · · · +w (f f )+ u . N−1 N−1,t+h|t − N,t+h|t t The combining weights are the estimated regression coefficients, and the combined forecasts are the fitted values of this model:

f =w ˆ f +w ˆ f + t+h|t 1 1,t+h|t 2 2,t+h|t · · · +w ˆ f + (1 N−1 wˆ )f . N−1 N−1,t+h|t − j=1 j N,t+h|t More generally, the combined forecast mayP be determined by fitting the model:

y = α + β f + β f + + β f + u , t+h 1 1,t+h|t 2 2,t+h|t · · · N N,t+h|t t without restricting α = 0 and all β’s summing to one. Note, however, that there is no definite conclusion about the “best” method of combining forecasts. One may have to experiment different methods before settling on a particular one.

5.2 Forecast Evaluation

In practice, it is typical to partition the sample into two non-overlapping parts: the first sub-sample is used to determine an appropriate model for forecasting, and the

25 second sub-sample is merely used to evaluate the forecasts generated from the previously estimated model. As the information in the second sample is not used for model determination, the resulting forecasts are also referred to as “out-of-sample” forecasts. The forecasting performance of a particular model should be evaluated based on out-of- sample forecasts. There are various descriptive measures of average quality of forecasts. For example:

1 T ′ 2 ′ 1. Mean Squared Error (MSE):MSE = T ′ t=1 et+h|t, where T is the number of observations in the second sub-sample. AP similar measure is the square root of MSE, also known as RMSE.

′ 2. Mean Absolute Error (MAE): MAE = 1 T e , T ′ t=1 | t+h|t|

P 100 T ′ |et+h|t| 3. Mean Absolute Percentage Error (MAPE): MAPE = ′ , T t=1 yt+h To compare forecasts from different methods, we may simply comP pare the magnitudes of their out-of-sample MSEs (or MAEs or MAPEs). We may also compute the ratio of MSE of some model to the MSE of a benchmark model. Suppose the benchmark forecast is the naive forecast: ft+h|t = yt−1. The RMSE ratio is also known as Theil’s U:

′ 1/2 T (y f )2 t=1 t+h − t+h|t U = T ′ . (y y )2 ! P t=1 t+h − t−1 A differentP way to evaluate the effectiveness of forecasts is to perform a regression:

yt+h = a + bf1,t+h|t + ut, and test the joint hypothesis a = 0 and b = 1. Failing to reject the null hypothesis suggests that the forecasts are capable of describing the systematic variation of yt. In addition, we may want to evaluate whether combining an alternative forecast f2,t+h|t with f1,t+h|t can reduce MSE. For this purpose, we fit the model:

yt+h = a + bf1,t+h|t + cf2,t+h|t + ut, and test the joint hypothesis: a = 0, b = 1, and c = 0. This model can of course be generalized to incorporate more alternative forecasts. T ′ ¯ 2 ′ 1/2 Finally, we examine a decomposition of MSE. Let Sf = [ t=1(ft+h|t f) /T ] , ′ − S = [ T (y y¯)2/T ′]1/2, and y t=1 t+h − P P T ′ 1 r = (y y¯)(f f¯), T ′S S t+h − t+h|t − f y t=1 X 26 ¯ where f andy ¯ are averages of ft+h|t and yt+h. Then,

T ′ 1 MSE = [(y y¯) (f f¯)+(¯y f¯)]2 T ′ t+h − − t+h|t − − t=1 X = (f¯ y¯)2 + (S rS )2 + (1 r2)S2. − f − y − y Let UM, UR, and UD denote, respectively, the three terms on the right-hand side divided by MSE. Hence, UM + UR + UD = 1. Clearly, UM compares the means of yt+h and f . For the other two terms, we observe that S rS = (1 rS /S )S , where t+h|t f − y − y f f T ′ ¯ t=1(yt+h y¯)(ft+h|t f) rSy/Sf = ′ − − , T (f f¯)2 P t=1 t+h|t − ˆ P is the slope, b, of the regression of yt+h on ft+h|t, and that rS (1 r2)S2 = S2 y rS S − y y − S y f  f  T ′ 1 = (y y¯)[(y y¯) ˆb(f f¯)] T ′ t+h − t+h − − t+h|t − t=1 X T ′ 1 = [(y y¯) ˆb(f f¯)]2. T ′ t+h − − t+h|t − t=1 X Therefore, UR measures the difference between ˆb and one, and UD measures the mag- nitude of residuals from the regression of yt+h on ft+h|t. This decomposition shows that forecasts with small UM (bias proportion), small UR (regression proportion), and UD (disturbance proportion) close to one will be preferred.

5.3 Test for Martingale Difference

Escanciano, J. C. and S. Mayoral (2007) provide the following Data-Driven Smooth Tests for the Martingale Difference Hypothesis. Let Y be a strictly stationary and { t} ergodic time series process defined on the probability space (Ω, , ). The MDH F P states that the best predictor, in a mean square error sense, of Yt given It−1 := ′ (Yt−1,Yt−2,...) is just the unconditional expectation, which is zero for a martingale difference sequence (mds). In other words, the MDH states that Y = X X , t t − t−1 where Xt is a martingale process with respect to the σ-field generated by It−1, i.e., := σ(I ). The null of Y is martingale difference series is Ft−1 t−1 t H : E(Y I ) = 0 almost surely (a.s.) 0 t| t−1

27 The alternative H1 is the negation of the null, i.e., Yt is not a mds. Given a sample y T , consider the marked (cf. Koul and Stute, 1999) { t}t=1 T 1 R (θ)= y 1(y <θ),θ , T √ t t−1 ∈ R σˆT T t=1 X −1 T 2 whereσ ˆT = T t=1 yt and 1(A) denotes the indicator function, i.e. 1(A)=1if A occurs and 0 otherwise.P Let B be a standard Brownian motion on [0, 1], and define

τ 2(θ)= σ−2E[y21(y θ)] t t−1 ≤ where σ2 = E(y2). Notice that τ 2( ) = 1, τ 2( ) = 0, and is nondecreasing and t ∞ −∞ continuous. Then, under regularity condition and null hypothesis,

R (θ) B(τ 2(θ)). T ⇒ An immediate consequence of Theorem 1 and Lemma 3.1 in Chang (1990) is that, under some regularity conditions and H0,

2 2 d CvMt := RT (θ)] τT (dθ) R | | → Z 1 CvM := B(τ 2(θ)) 2τ 2(dθ)= B(u) 2du, ∞ | | | | ZR Z0 where τ 2 (θ) =σ ˆ2 T −1 T y21(y θ). Or, T T t=1 t t−1 ≤ P d 2 2 KST := sup RT (θ) sup B(τ (θ)) = sup B(u) . θ∈R | |→ R | | u∈[0,1] | |

28