<<

1

Stationary stochastic processes, parts of Chapters 2 and 6

Georg Lindgren, Holger Rootz´en, and Maria Sandsten

Question marks indicate references to other parts of the book. Comments and plots regarding spectral densities are not supposed to be understood. 2 Chapter 1

Stationary processes

1.1 Introduction

In Section 1.2, we introduce the functions: the value function, which is the expected process value as a function of time t, and the function, which is the covariance between process values at times s and t. We remind of some simple rules for expectations and , for example that covariances are linear in both arguments. We also give many example of how the mean value and covariance functions shall be interpreted. The main focus is on processes for which the statistical properties do not change with time – they are (statistically) stationary. Strict stationarity and weak statio- narity are defined. Dynamical systems, for example a linear system, is often described by a set of state variables, which summarize all important properties of the system at time t, and which change with time under influence of some environmental variables. Often the variables are random, and then they must be modeled as a . State variables are further dealt with in Chapter ??. The statistical problem of how to find good models for a random phenomenon is also dealt with in this chapter, and in particular how one should estimate the mean value function and from . The dependence between different process values needs to be taken care of when constructing confidence intervals and testing hypotheses.

1.2 Moment functions

The statistical properties of a stochastic process X(t), t T are determined by the distribution functions. Expectation and standard{ deviation∈ } catch two important properties of the of X(t), and for a stochastic process these may be functions of time. To describe the time dynamics of the sample functions, we also need some simple measures of the dependence over time. The statistical definitions are simple, but the practical interpretation can be complicated. We illustrate this by the simple concepts of “average temperature” and “day-to-day”

3 4 Stationary processes Chapter 1

10 1992

5

0

−5

1997 −10 Average temperature −15

−20

−25 5 10 15 20 25 30 Day

Figure 1.1: Daily average temperature in M˚alilla during January, for 1988 – 1997. The fat curves mark the years 1992 and 1997. correlation.

Example 1.1. (’Daily temperature’) Figure 1.1 shows plots of the temperature in the small Swedish village of M˚alilla, averaged over each day, during the month of January for the ten years 1988 – 1997. Obviously, there has been large variations between years, and it has been rather cold for several days in a row. The global circulation is known to be a very chaotic systems, and it is hard to predict the weather more than a few days ahead. However, modern weather forecasts has adopted a statistical approach in the predictions, together with the computer intense numerical methods, which form the basis for all weather forecasts. Nature is regarded as a stochastic weather generator, where the distributions depend on geographical location, time of the year, etc, and with strong dependence from day to day. One can very well imagine that the data in the figure are the results of such a “weather roulette”, which for each year decides on the dominant weather systems, and on the day-to-day variation. With the statistical approach, we can think of the ten years of data as observations of a stochastic process X1,...,X31 . The mean value function is m(t) = E[Xt]. Since there is no theoretical reason to assume any particular values for the expected temperatures, one has to rely on historical data. In meteorology, the observed mean temperature during a 30 year period is often used as a standard. The covariance structure in the temperature series can also be analyzed from the data. Figure 1.2 illustrates the dependence between the temperatures from one day to the next. For each of the nine years 1988 – 1996 we show to the left scatter Section 1.2 Moment functions 5

20 20 20 20 20 20

10 10 10 10 10 10

0 0 0 0 0 0

−10 −10 −10 −10 −10 −10

−20 −20 −20 −20 −20 −20 −20 0 20 −20 0 20 −20 0 20 −20 0 20 −20 0 20 −20 0 20

20 20 20 20 20 20

10 10 10 10 10 10

0 0 0 0 0 0

−10 −10 −10 −10 −10 −10

−20 −20 −20 −20 −20 −20 −20 0 20 −20 0 20 −20 0 20 −20 0 20 −20 0 20 −20 0 20

20 20 20 20 20 20

10 10 10 10 10 10

0 0 0 0 0 0

−10 −10 −10 −10 −10 −10

−20 −20 −20 −20 −20 −20 −20 0 20 −20 0 20 −20 0 20 −20 0 20 −20 0 20 −20 0 20

Figure 1.2: Scatter plots of temperatures for years 1988 – 1996 for two successive days (left plot) and two days, five days apart (right plot). One can see a weak similarity between temperatures for adjacent days, but it is hard to see any connection with five days separation.

plots of the pairs (Xt,Xt+1), i.e., with temperature one day on the horizontal axis and the temperature the next day on the vertical axis. There seems to be a weak dependence, two successive days are correlated. To the right we have similar scatter plots, but now with five days separation, i.e., data are (Xt,Xt+5). There is almost no correlation between two days that are five days apart. 2

1.2.1 Definitions We now introduce the basic statistical measures of average and correlation. Let X(t), t T be a real valued stochastic process with discrete or continuous time. { ∈ } Definition 1.1 For any stochastic process, the first and second order moment func- tions are defined as

m(t) = E[X(t)] mean value function (mvf) v(t) = V[X(t)] variancefunction (vf) r(s, t) = C[X(s),X(t)] covariance function (cvf) b(s, t) = E[X(s)X(t)] second-moment function ρ(s, t) = ρ[X(s),X(t)] correlation function

There are some simple relations between these functions:

r(t, t)= C[X(t),X(t)] = V[X(t)] = v(t), r(s, t)= b(s, t) m(s)m(t), − C[X(s),X(t)] r(s, t) ρ(s, t)= = . V[X(s)] V[X(t)] r(s,s) r(t, t) p p 6 Stationary processes Chapter 1

These functions provide essential information about the process. The meaning of the mean value and functions are intuitively clear and easy to understand. For example, the mean value function describes how the changes with time, like we expect colder weather during winter months than during summer. The (square root of the) tells us what magnitude of fluctuations we can expect. The covariance function has no such immediate interpretation, even if its statistical meaning is clear enough as a covariance. For example, in the ocean wave example, Example ??, the covariance r(s,s + 5) is negative and r(s,s + 10) is positive, corresponding to the fact that measurements five seconds apart often fall on the opposite side of the mean level, while measurements at ten seconds distance often are on the same side. Simply stated, it is the similarity between observations as a function of the times of measurements. If there are more than one stochastic process in a study, one can distinguish the moment functions by indexing them, as mX ,rX , etc. A complete name for the covariance function is then auto-covariance function, to distinguish it from a cross- covariance function. In Chapter ??, we will investigate this measure of co-variation between two stochastic processes.

Definition 1.2 The function

r (s, t)= C[X(s),Y (t)] = E[X(s)Y (t)] m (s)m (t), X,Y − X Y is called the cross-covariance function between X(t), t T and Y (t), t T . { ∈ } { ∈ } 1.2.2 Simple properties and rules The first and second order moment functions are linear and bi-linear, respectively. We formulate the following generalization of the rules E[aX + bY ]= aE[X]+ bE[Y ], V[aX + bY ] = a2V(X)+ b2V[Y ], which hold for uncorrelated random variables, X and Y .

Theorem 1.1. Let a1,...,ak and b1,...,bl be real constants, and let X1,...,Xk and Y1,...,Yl be random variables in the same , i.e., defined on a com- mon sample space. Then

k k

E aiXi = aiE[Xi], " i=1 # i=1 X X k k k

V aiXi = aiajC[Xi,Xj], " i=1 # i=1 j=1 X X X k l k l

C aiXi, bjYj = aibjC[Xi,Yj]. " i=1 j=1 # i=1 j=1 X X X X

The rule for the covariance between sums of random variables, C[ aiXi, bjYj] is easy to remember and use: the total covariance between two sums is a double sum P P Section 1.2 Moment functions 7 of all covariances between pairs of one term from the first sum, and one term from the second sum.

Remember, that if two independent random variables X and Y are always uncor- related, i.e., C[X,Y ] = 0, the reverse does not necessarily hold; even if C[X,Y ] = 0, there can be a strong dependence between X and Y . We finish the section with some examples of covariance calculations.

Example 1.2. Assume X1 and X2 to be independent random variables and define a new variable Z = X 2 X . 1 − 2 The variance of Z is then, since C[X1,X2] = 0,

V[Z]= V[X 2X ]= C[X 2X ,X 2X ] 1 − 2 1 − 2 1 − 2 = C[X ,X ] 2C[X ,X ] 2C[X ,X ]+4C[X ,X ]= V[X ]+4V[X ]. 1 1 − 1 2 − 2 1 2 2 1 2 We also calculate the variance for the variable Y = X 3: 1 − V[Y ]= V[X 3] = C[X 3,X 3] 1 − 1 − 1 − = C[X ,X ] C[X , 3] C[3,X ]+ C[3, 3] = V[X ], 1 1 − 1 − 1 1 i.e., the same as for X1 , which is clear as the difference between X1 and Y only is a constant value, which does not effect the variance of the . 2

Example 1.3. From a sequence U of independent random variables with mean { t} zero and variance σ2 , we construct a new process X by { t}

Xt = Ut +0.5 Ut 1. · − This a “” process, which is the topic of Chapter 2. By of Theorem 1.1, we can calculate its mean value and covariance function. Of course, m(t) = E[Xt] = E[Ut +0.5 Ut 1] = 0. For the covariance function, we have to − work harder, and to keep computations· under control, we do separate calculations according to the size of t s. First, take s = t, −

r(t, t) = V[Xt]= V[Ut +0.5 Ut 1] − · 2 = V[Ut]+0.5 C[Ut 1, Ut]+0.5 C[Ut, Ut 1]+0.5 V[Ut 1] · − · − · − = σ2 +0+0+0.25σ2 =1.25σ2,

2 where we used that V[Ut] = V[Ut 1] = σ , and that Ut and Ut 1 are independent, − − so C[Ut, Ut 1]= C[Ut 1, Ut]=0. For s = t +1 we get − −

r(s, t) = C[Ut+1 +0.5 Ut, Ut +0.5 Ut 1] − · · 2 = C[Ut+1, Ut]+0.5 C[Ut+1, Ut 1]+0.5 C[Ut, Ut]+0.5 C[Ut, Ut 1] · − · · − =0+0+0.5 V[U ]+0 · t = 0.5σ2. 8 Stationary processes Chapter 1

The case s = t 1 gives the same result, and for s t +2 or s t 2, one easily finds that r(s, t−) = 0. Process values X(s) and X(≥t), with time≤ separation− s t | − | greater than 1, are therefore uncorrelated (they are even independent). All moving average processes share the common property that they have a finite correlation time: after some time the correlation is exactly zero. 2

Example 1.4. (“Independent, stationary increments”) Processes with independent, stationary increments, (see Section ??, page ??), have particularly simple expecta- tion and covariance functions, as we shall now see. Let the process X(t), t 0 start at X(0) = 0 and have independent, sta- { ≥ } tionary increments, with finite variance. Thus, the distribution of the change X(s + t) X(s) over an interval (s,s + t] only depends on the interval length − t and not on its location. In particular, X(s + t) X(s) has the same distribution as X(t)= X(t) X(0), and also the same mean−m(t) and variance v(t). We first show that both −m(t) and v(t) are proportional to the interval length t. Since E[X(s + t) X(s)] = E[X(t)], one has − m(s + t) = E[X(s + t)] = E[X(s)] + E[X(s + t) X(s)] = m(s)+ m(t), − which means that for s, t 0, the mean function is a solution to the equation ≥ m(s + t)= m(s)+ m(t), which is known as Cauchy’s functional equation. If we now look only for continuous solutions to the equation, it is easy to argue that m(t) is of the form (note: m(0) = 0), m(t) = E[X(t)] = k t, t 0, 1 · ≥ for some constant k1 = m(1). (The reader could prove this by first taking t =1/q, with integer q, and then t = p/q for integer p.) The variance has a similar form, which follows from the independence and sta- tionarity of the increments. For s, t 0, we write X(s + t) as the sum of two increments, Y = X(s) = X(s) X(0),≥ and Z = X(s + t) X(s). Then, Y has variance V[Y ] = V[X(s)] = v(s−), by definition. The second− increment is over an interval with length t, and since the distribution of an increment only depends on the interval length and not on its location, Z has the same distribution as X(t), and hence V[Z]= V[X(t)] = v(t). Thus,

v(s + t)= V[X(s + t)] = V[Y + Z]= V[Y ]+ V[Z]= v(s)+ v(t).

As before, the only continuous solution is

v(t)= V[X(t)] = k t (t 0) 2 · ≥ for some constant k2 = V[X(1)] 0. Thus, we have shown that both the mean and variance functions are proportional≥ to the interval length. Section 1.2 Moment functions 9

Finally, we turn to the covariance function. First take the case s t. Then, we can split X(t) as the sum of X(s) and the increment from s to t, and≤ get

r(s, t) = C[X(s),X(t)] = C[X(s),X(s)+(X(t) X(s))] − = C[X(s),X(s)]+0= V[X(s)] = k s. 2 · For s > t, we just interchange t and s, and realize that it is the minimum time that determines the covariance: r(s, t)= k t. 2 · Summing up: the covariance function for a process with stationary independent increment starting with X(0) = 0, is

r(s, t)= V[X(1)] min(s, t). · Besides the Poisson process, we will meet the as an important example of a process of this type; see Section ??. 2

Remark 1.1. We have defined the second-moment function as b(s, t)= E[X(s)X(t)], as distinguished from the covariance function, r(s, t)= b(s, t) m(s)m(t). In signal − processing literature, b(s, t) is often called the auto-correlation function. Since in , the technical term correlation is reserved for the normalized covariance, we will not use that terminology. In daily language, correlation means just “co- variation”. A practical drawback with the second moment function is that the definition does not correct for a non-zero mean value. For signals with non-zero mean, b(s, t) may be dominated by the mean value product, hiding the interesting dependence structure. The covariance function is always equal to the second moment function for the mean value corrected series. 2

1.2.3 Interpretation of moments and moment functions

Expectation and the mean function

The mean function m(t) of a stochastic process X(t), t T is defined as the { ∈ } expected value of X(t) as a function of time. From the in probability theory we know the precise meaning of this statement: for many independent repetitions of the experiment, i.e., many independent observations of the random X(t), the (i.e., the average) of the observations tends to be close to m(t). But as its name suggests, one would also like to interpret it in another way: the mean function should say something about the average of the realization x(t), over time. These two meanings of the word “average” is one of the subtle difficulties in the applications of stochastic process theory – later we shall try to throw some light upon the problem when we discuss how to estimate the mean value function, and introduce the concept of in Section 1.5.2. 10 Stationary processes Chapter 1

channel PZ Correlation 50 50 40

V 0 30 µ 20 −50 10 2 4 6 8 10 12 14 16 s 0 PZ channel CZ −10 50 −20

V 0 −30 µ

−40 −50 −50 2 4 6 8 10 12 14 16 −50 0 50 s CZ

Figure 1.3: Measurements of EEG from two different channels, and of the two signals.

Correlation and covariance function The second order moment functions, the covariance and correlation functions, mea- sure a degree of correlation between process values at different times, r(s, t) = C[X(s),X(t)], ρ(s, t)= ρ[X(s),X(t)]. First, we discuss the correlation coefficient, ρ = ρ[X,Y ], between two random variables X and Y with positive variance:

C[X,Y ] E[(X m )(Y m )] ρ = = − X − Y . V[X] V[Y ] V[X] V[Y ]

The correlation coefficientp is a dimensionless constant,p that remains unchanged after a change of scale: for constants a> 0,c> 0,b,d,

ρ(aX + b,cY + d)= ρ(X,Y ).

Further, it is easy to see that it is always bounded to be between 1 and +1. To see − this, use Theorem 1.1 to find the variance of the sum X λY for the special choice λ = C[X,Y ]/V[Y ] = ρ V[X]/V[Y ]. Since a variance− is always non-negative, we − have p 0 V[X λY ]= V[X] 2λC[X,Y ]+ λ2V[Y ]= V[X](1 ρ2), (1.1) ≤ − − − which is possible only if 1 ρ 1. − ≤ ≤ Example 1.5. (“ElectroEncephaloGram, EEG”) The ElectroEncephaloGram is the graphic representation of spontaneous brain activity measured with electrodes at- tached to the scalp. Usually, EEG is measured from several channels at different positions of the head. Channels at nearby positions will be heavily correlated, which can be seen in Figure 1.3, where the curves have a very similar appearance. They are however not exactly the same and using the samples as observations of two different stochastic processes, an estimate of the correlation coefficient between X(t) and Section 1.2 Moment functions 11

Y (t) will be ρ∗ 0.9, i.e., rather close to one. Figure 1.3 also shows a scatter plot of the two signals,≈ where the strong correlation is seen as the samples are distributed close to a straight line. 2 The covariance E[(X m )(Y m )] measures the degree of linear covariation. − X − Y If there is a tendency for observations of X and Y to be either both large or both small, compared to their expected values, then the product (X mX )(Y mY ) is more often positive than negative, and the correlation is positive.− If, on the− other hand, large values of X often occur together with small values of Y , and vice versa, then the product is more often negative than positive, and the correlation is negative. From (1.1) we see that if the correlation coefficient is +1 or 1, there is an exact − linear relation between X and Y , in the sense that there are constants a, b such that V[X bY )] = 0, which means that X bY is a constant, i.e., P(X = a + bY ) = 1. − − Repeated observations of the pair (X,Y ) would then fall on a straight line. The closer the correlation is to 1 the closer to a straight line are the observations.1 Figure 1.4 shows scatter± plots of observations of two-dimensional normal va- riables with different degrees of correlation. As seen in the figure, there is quite a scatter around a straight line even with a correlation as high as 0.9.

ρ = 0 ρ = 0.9 4 4

2 2

0 0

−2 −2

−4 −4 −4 −2 0 2 4 −4 −2 0 2 4 ρ = 0.5 ρ = −0.5 4 4

2 2

0 0

−2 −2

−4 −4 −4 −2 0 2 4 −4 −2 0 2 4

Figure 1.4: Observations of two-dimensional normal variables X,Y with E[X] = E[Y ]=0 and V[X]= V[Y ]=1 for some different correlation coefficients ρ.

Now back to the interpretation of the covariance function and its scaled version, the correlation function, of a stochastic process. If the correlation function, ρ(s, t),

1 Note, however, that the correlation coefficient only measures the degree of linear dependence, and one can easily construct an example with perfect dependence even though the variables are uncorrelated; take for example Y = X2 with X N(0, 1). ∈ 12 Stationary processes Chapter 1

ρ = 0 ρ = 0.9 3 10

2 5 1

0 0

−1 −5 −2

−3 −10 0 50 100 0 20 40 60 80 100 ρ = 0.5 ρ = −0.5 3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3 0 20 40 60 80 100 0 20 40 60 80 100

s t Figure 1.5: Realizations of normal sequences with mt =0 and r(s, t)= ρ| − | . attains a value close to 1 for some arguments s and t, then the realizations x(s) and x(t) will vary together. If one the other hand, the correlation is close to 1, − the covariation is still strong, but goes in the opposite direction.

Example 1.6. This example illustrates how the sign of the correlation function is reflected in the variation of a stochastic process. Figure 1.5 shows realizations of a sequence of normal random variables Xt with mt = 0 and r(t, t) = V [Xt] = 1, and the covariance function (= correlation{ function} since the variance is 1), r(s, t)= s t ρ| − | , for four different ρ-values. The realization with ρ =0.9 shows rather strong correlation between neighboring observations, which becomes a little less obvious with ρ = 0.5. For ρ = 0.5 the correlation between observations next to each − other, i.e., s t = 1, is negative, and this is reflected in the alternating signs in the realization.| − | Figure 1.6 illustrates the same thing in a different way. Pairs of successive observations (Xt,Xt+k), for t =1, 2,...,n, are plotted for three different time lags, k = 1, k = 2, k = 5. For ρ = 0.9 the correlation is always positive, but becomes weaker with increasing distance. For ρ = 0.9 the correlation is negative when − the distance is odd, and positive when it is even, becoming weaker with increasing distance. 2

1.3 Stationary processes

There is an unlimited number of ways to generate dependence between X(s) and X(t) in a stochastic process, and it is necessary to impose some restrictions and fur- ther assumptions if one wants to derive any useful and general properties about the process. There are three main assumptions that make the dependence manageable. Section 1.3 Stationary processes 13

ρ = 0.9, k = 1 ρ = 0.9, k = 2 ρ = 0.9, k = 5 4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

ρ = −0.9, k = 1 ρ = −0.9, k = 2 ρ = −0.9, k = 5 4 4 4

2 2 2

0 0 0

−2 −2 −2

−4 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4

Figure 1.6: Scatter plots of (Xt,Xt+k) for different k-values and ρ = 0.9 and ρ = 0.9. −

The first is the Markov principle, that says that the statistical distribution of what will happen between time s and time t depends on what happened up to time s only through the value at time s. This means that X(s) is a state variable that summarizes the history before time s. In Section 1.4, we will meet some applications of this concept. The second principle is a variation of the Markov principle, that assumes that the expected future change is 0, independently of how the process reached its present value. Processes with this property are called martingales, and they are central in stochastic calculus and financial statistics. The third principle that makes the dependence manageable is the stationarity principle, and that is the topic of this book. In everyday language, the word statio- nary indicates something that does not change with time, or stays permanently in its position. In statistics, it means that the statistical properties do not change. A random function is called “stationary” if the fluctuations have the same statistical distributions whenever one chooses to observe it. The word stationary is mostly used for processes in time. An alternative term is homogeneous, which can be used also for processes with a space parameter, for example a random surface. For a process to be stationary, all statistical properties have to be unchanged with time. This is a very strict requirement, and to make life, and , simpler one can often be content with a weaker condition, namely that the mean value and covariances do not change.

1.3.1 Strictly stationary processes Definition 1.3 A stochastic process X(t), t T is called strictly stationary if its statistical distributions remain unchanged{ after∈ a} shift of the time scale. 14 Stationary processes Chapter 1

Since the distributions of a stochastic process are defined by the finite-dimensional distribution functions,2 we can formulate an alternative definition of strict stationa- rity: If, for every n, every choice of times t ,...,t T and every time lag τ such 1 n ∈ that ti +τ T , the n-dimensional random vector (X(t1 +τ),...,X(tn +τ)) has the same distribution∈ as the vector (X(t ),...,X(t )), then the process X(t), t T 1 n { ∈ } is said to be strictly stationary.

If X(t), t T is strictly stationary, then the marginal distribution of X(t) { ∈ } is independent of t. Also the two-dimensional distributions of (X(t1),X(t2)) are independent of the absolute location of t1 and t2 , only the distance t1 t2 matters. As a consequence, the mean function m(t) is constant, and the covariance− function r(s, t) is a function of t s only, not of the absolute location of s and t. Also higher − order moments, like the third order moment, E[X(s)X(t)X(u)], remain unchanged if one adds a constant time shift to s,t,u, and so on for fourth, fifth order, etc.

1.3.2 Weakly stationary processes There are very good reasons to study so called weakly stationary processes where the first and second order moments are time invariant, i.e., where the mean is constant and the covariances only depend on the time distance. The two main reasons are, first of all, that weakly stationary Gaussian processes are automatically also strictly stationary, since their distributions are completely determined by mean values and covariances; see Section ?? and Chapter ??. Secondly, stochastic processes passed through linear filters are effectively handled by the first two moments of the input; see Chapters ??–??.

Definition 1.4 If the mean function m(t) is constant and the covariance function r(s, t) is everywhere finite, and depends only on the time difference τ = t s, the process X(t), t T is called weakly stationary, or covariance stationary.− { ∈ }

Note: When we say that a process is stationary, we mean that it is weakly stationary, if we do not explicitly say anything else. A process that is not stationary is called non-stationary. It is clear that every strictly with finite variance is also weakly stationary. If the process is normal and weakly stationary, then it is also strictly stationary, more on that in Chapter ??, but in general one can not draw such a conclusion. For a stationary process, we write m for the constant mean value and make the following simplified definition and notation for the covariance function.

2 See Section ?? on Kolmogorov’s existence theorem Section 1.3 Stationary processes 15

Definition 1.5 If X(t), t T is a weakly stationary process with mean m, the covariance and correlation{ functions∈ } 3are defined as, r(τ)= C[X(t),X(t + τ)] = E[(X(t) m)(X(t + τ) m)] = E[X(t)X(t + τ)] m2, − − − ρ(τ)= ρ[X(t),X(t + τ)] = r(τ)/r(0), respectively. In particular, the variance is

r(0) = V[X(t)] = E[(X(t) m)2]. − We will use the symbol τ to denote a time lag, as in the definition of r and ρ.

We have used the same notation for the covariance function for a stationary as well as for a non-stationary process. No confusion should arise from this – one argument for the stationary and two arguments for the general case. For a stationary process, r(s,s + τ)= r(τ). The mean and covariance function can tell us much about how process values are connected, but they fail to provide detailed information about the sample functions, as is seen in the next two examples.

Example 1.7. (”Random telegraph signal”) This extremely simple process jumps between two states, 0 and 1, according to the following rules. Let the signal X(t) start at time t = 0 with equal probability for the two states, i.e., P(X(0) = 0) = P(X(0) = 1) = 1/2, and let the switching times be decided by a Poisson process Y (t), t 0 with intensity λ independently of X(0). Then, X(t), t 0 is a weakly{ stationary≥ } process; in fact, it is also strictly stationary, { ≥ } but we don’t show that. Let us calculate E[X(t)] and E[X(s)X(t)]. At time t, the signal is equal to 1 X(t)= 1 ( 1)X(0)+Y (t) , 2 − − since, for example, if X(0) = 0 and Y (t) is an even number, X(t) is back at 0, but if it has jumped an uneven number of times, it is at 1. Since X(0) and Y (t) are independent, 1 1 1 E[X(t)] = E[ 1 ( 1)X(0)+Y (t) ]= E[( 1)X(0)]E[( 1)Y (t)], 2 − − 2 − 2 − − E X(0) 1 0 1 1 which is constant equal to 1/2, since [( 1) ] = 2 ( 1) + 2 ( 1) =0. Asa P P − E − 1 − byproduct, we get that (X(t)=0)= (X(t)=1)= [X(t)] = 2 . For E[X(s)X(t)], we observe that the product X(s)X(t) can be either 0 or 1, and it is 1 only when both X(s) and X(t) are 1. Therefore, for s < t, E[X(s)X(t)] = P(X(s)= X(t)=1)= P(X(s)=1,X(t) X(s)=0). − 3 Norbert Wiener introduced the term covariance function in his work on stochastic harmonic analysis in the nineteen twenties. 16 Stationary processes Chapter 1

Now X(t) X(s) = 0 only if there is an even number of jumps in (s, t], i.e., Y (t) Y (s) is even.− Using the independence between X(s) and the Poisson distributed− increment Y (t) Y (s), with expectation λ(t s), we get for τ = t s> 0, − − − E[X(s)X(t)] = P(X(s)=1) P(Y (t) Y (s) is even) · − 1 1 λτ k = P(Y (t) Y (s) is even) = e− (λτ) /k! 2 − 2 k=0X,2,4,... 1 λτ λτ λτ 1 2λτ = e− e + e− = (1 + e− ). 4 { } 4 For 0

2 1 2λ τ r(τ) = E[X(s)X(s + τ)] m = e− | |. − 4 2

Example 1.8. Figure 1.7 shows realizations of two stationary processes with dif- ferent distributions and quite different sample function behavior, but with exactly the same covariance function 2 α τ r(τ)= σ e− | |.

(a) 2 .

1

0

-1

-2 . 0 100 200 300 400 500 600 700 800 900 1000

(b) 1

0

-1

-2 0 100 200 300 400 500 600 700 800 900 1000

Figure 1.7: Realizations of two processes with the same covariance function, 2 α τ r(τ)= σ e− | | : (a) random telegraph signal, (b) .

The “random telegraph signal” in (a) jumps in a random fashion between the two levels 0 and 1, while the normal process in (b) is continuous, but rather irregular. Section 1.3 Stationary processes 17

This normal process is called a Gauss-Markov process, and it will be studied more in Chapter ??, under the name of the Ornstein-Uhlenbeck process. 2

1.3.3 Important properties of the covariance function All covariance functions share the following very important properties.

Theorem 1.2. If r(τ) is the covariance function for a stationary process X(t), t { ∈ T , then } a. V[X(t)] = r(0) 0, ≥ b. V[X(t + h) X(t)] = E[(X(t + h) X(t))2]=2(r(0) r(h)), ± ± ± c. r( τ)= r(τ), − d. r(τ) r(0), | | ≤ e. if r(τ) = r(0), for some τ =0, then r is periodic, | | 6 f. if r(τ) is continuous for τ =0, then r(τ) is continuous everywhere.

Proof: (a) is clear by definition. (b) Take the variance of the variables X(t + h)+ X(t) and X(t + h) X(t): − V[X(t + h) X(t)] = V[X(t + h)] + V[X(t)] 2C[X(t),X(t + h)] ± ± = r(0) + r(0) 2r(h)=2(r(0) r(h)). ± ± (c) The covariance is symmetric in the arguments, so

r( τ) = C[X(t),X(t τ)] = C[X(t τ),X(t)] = r(t (t τ)) = r(τ). − − − − − (d) Since the variance of X(t + h) X(t) is non-negative regardless of the sign, part (b) gives that r(0) r(h) 0, and± hence r(h) r(0). ± ≥ | | ≤ (e) If r(τ) = r(0), part (b) gives that V[X(t + τ) X(t)] = 0, which implies that X(t + τ)= X(t) for all t, so X(t) is periodic with− period τ . If, on the other hand, r(τ) = r(0), then V[X(t + τ)+ X(t)] = 0, and X(t + τ) = X(t) for all t, and − − X(t) is periodic with period 2τ . Finally, it is easy to see that if X(t) is periodic, then also r(t) is periodic. (f) We consider the increment of the covariance function at t,

(r(t + h) r(t))2 = (C[X(0),X(t + h)] C[X(0),X(t)])2 − − = (C[X(0),X(t + h) X(t)])2 , (1.2) − where we used that C[U, V ] C[U, W ] = C[U, V W ]. Further, covariances obey Schwarz’ inequality, − − (C[Y,Z])2 V[Y ]V[Z]. ≤ Applied to the right hand side in (1.2) this yields, according to (b),

(r(t + h) r(t))2 V[X(0)] V[X(t + h) X(t)]=2r(0)(r(0) r(h)). − ≤ · − − 18 Stationary processes Chapter 1

If r(τ) is continuous for τ = 0, then the right hand side r(0) r(h) 0 as h 0. Then also the left hand side (r(t + h) r(t)) 0, and hence−r(τ) is→ continuous→ at − → τ = t. 2

The theorem is important since it restricts the class of functions that can be used as covariance functions for real stationary processes. For example, a covariance function must be symmetric and attain its maximum at τ = 0. It must also be continuous if it is continuous at the origin, which excludes for example the function in Figure 1.8. In Chapter ??, Theorem ??, we shall present a definite answer to the question, what functions can appear as covariance functions.

0 τ

Figure 1.8: A function with a jump discontinuity for τ =0 cannot be a covariance 6 function.

1.4 State variables

In Section 1.3 we mentioned the two main methods to model the dependence between X(s) and X(t) in a stochastic process: the Markov principle, and the stationarity principle. The two principles can be combined, and are often so when one seeks to model how a dynamical system evolves under the influence of external random forces or signals. A mathematical model of a dynamical system often consists of a set of state variables, which develop under the influence of a number of input signals, i.e., external factors that affect the system and may change with time. A set of variables may be called state variables for a dynamical system if

they completely determine the state of the system at any given time t , and • 0 the state of the system at time t > t is completely determined by the state • 0 variables at time t0 and the input signals between t0 and t.

The current values of the state variables for a dynamical system, determine completely, together with the new input signals, the state variable value at any later time point. How the system arrived at its present state is of no interest. In the theory of dynamical systems, the exact relation between input signals and the state variables is often described by a differential equation, like (??). If the input signal is a stochastic process, it is not very meaningful to solve this equation, since the input is random and will be different from experiment to experiment. But there will Section 1.5 Estimation of mean value and covariance function 19 always be reasons to determine the statistical properties of the system and its state variables, like the distribution, expected value, variance, covariances etc., since these are defined by the physical properties of the system, and by the statistical properties of the input. We deal with these relations in Chapters ?? and ??.

1.5 Estimation of mean value and covariance func- tion

So far, we have described the most simple characteristics of a stationary process, the mean value and covariance function, and these were defined in a probabilistic way, as expectations that can be calculated from a distribution or probability density function. They were used to describe and predict what can be expected from a realization of the process. We did not pay much attention to how the model should be chosen. As always in statistics, one aim is data analysis, parameter estimation, and model fitting, in order to find a good model that can be used in practice for prediction of a and for simulation and theoretical analysis. In this section, we discuss the properties of the natural estimators of the mean value and covariance function of a stationary process X , n =1, 2,... , with discrete time, from a series of data { n } x1,...,xn . Before we start with the technical details, just a few words about how to think about parameter estimation, model fitting, and data analysis.

Formal parameter estimation: In the formal statistical setting, a model is pos- tulated beforehand, and the data is assumed to be observations of random variables with distributions that are known, apart from some unknown para- meters. The task is then to suggest a function of the observations that has good properties as an estimate of the unknown parameters. The model is not questioned, and the estimation procedure is evaluated from how close the parameter estimates are to the true parameter values. : In this setting, one seeks to find which model, among many pos- sible alternatives, that is most likely to have produced the data. This includes parameter estimation as well as model choice. The procedure is evaluated from its ability to reproduce the observations and to predict future observations. Model fitting: This is the most realistic situation in practice. No model is assumed to be “true”, and the task is to find a model, including suitable parameter values, that best reproduces observed data, and that well predicts the future.

In this course, we will stay with the first level, and the technical properties of estimation procedures, but the reader is reminded that this is a formal approach. No model should be regarded as true, unless there are some external reasons to assume some specific structure, for example derived from physical principles, and perhaps not even then! 20 Stationary processes Chapter 1

1.5.1 Estimation of the mean value function Let X , n = 1, 2,... be a weakly stationary sequence with mean value m and { n } with (known or unknown) covariance function r(τ). The process need not be strictly stationary, and we make no specific assumption about its distribution, apart from the constant, but unknown, mean value m. Let x1,...,xn be observations of the first X1,...,Xn .

1 n Theorem 1.3. (a) The arithmetic mean of the observations mn∗ = n t=1 xt is an E unbiased estimator of mean value m of the process, i.e., [mn∗ ] = m, regardless of the distribution. P

(b) If the infinite series t∞=0 r(t) is convergent, then the asymptotic variance of mn∗ is given by P V ∞ ∞ lim n [mn∗ ]= r(t)= r(0)+2 r(t), (1.3) n →∞ t= 1 X−∞ X V which means that, for large n, the variance of the mean value estimator is [mn∗ ] 1 ≈ n t r(t).

(c)P Under the condition in (b), mn∗ is a consistent estimator of m as n in the 2 →∞ sense that E[(m∗ m) ] 0, and also P( m∗ m > ) 0, for all > 0. n − → | n − | → Proof: (a) The expectation of a sum of random variables is equal to the sum of the expectations; this is true regardless of whether the variables are dependent or E 1 n E not. Therefore, [mn∗ ]= n t=1 [X(t)] = m. (b) We calculate the varianceP by means of Theorem 1.1. From the theorem, 1 n 1 n n 1 n n V[m∗ ] = V X = C[X ,X ]= r(s t), n n2 t n2 s t n2 − " t=1 # s=1 t=1 s=1 t=1 X X X X X where we sum along the diagonals to collect all the n u terms with s t = u to 1 n 1 −| | − V − get [mn∗ ]= n2 u= n+1(n u )r(u). Hence, − −| | P n 1 2 − nV[m∗ ]= r(0) + (n u)r(u). (1.4) n − n − u=0 X n 1 Now, if t∞=0 r(t) is convergent, Sn = t=0− r(t) t∞=0 r(t) = S , say, which 1 n → n implies that n k=1 Sk S , as n . (If xn x then also 1/n k=1 xk x.) Thus, P → →∞ P → P → P P n 1 n 1 − 1 (n u) r(u)= S S, n − n k → u=0 X Xk=1 and this, together with (1.4), gives the result, i.e., nV [m∗ )] r(0)+2 ∞ r(u)= n → − 0 ∞ r(u)= r(0)+2 1∞ r(u). −∞ P P P Section 1.5 Estimation of mean value and covariance function 21

E 2 V (c) The first statement follows from (a) and (b), since [(mn∗ m) ] = [mn∗ ]+ 2 − E[m∗ m] 0. The second statement is a direct consequence of Chebysjev’s n − → inequality,4 E 2 [(mn∗ m) ] P( m∗ m > ) − . | n − | ≤ 2 2

The consequences of the theorem are extremely important for data analysis with dependent observations. A positive correlation between successive observa- tions tends to increase the variance of the mean value estimate, and make it more uncertain. We will give two examples on this, first to demonstrate the difficulties with visual interpretation, and then use the theorem for a more precise analysis.

Example 1.9. (”Dependence can deceive the eye”) Figure 1.9 shows 40 realizations of 25 successive data point in two stationary processes with discrete time. In the upper diagram, the data are dependent and each successive data contains part of the previous value, Xt+1 = 0.9 Xt + et , where the et are independent normal variable. (This is an AR(1)-process, an autoregressive process, which we will study in detail in Chapter 2.) In the lower diagram, all variables are independent normal variables V V Yt = et0 . are chosen so that [Xt] = [Yt]. The solid thick line connects the average of the 25 points in each sample. One should note that the observations in the AR(1)-process within each sample are less spread out than those in the samples with independent data. Even if there is less variation in each sample, the observed average values are more variable for the AR(1)-process. The calculated standard deviations in the 40 dependent samples is 0.68 on the average, while the average is 0.99 for independent data. Therefore, if one had only one sample to analyze, one could be lead to believe that a dependent sample would give better precision in the estimate of the overall mean level. But it is just the opposite, the more spread out sample gives better precision. 2

Example 1.10. (”How many data samples are necessary”) How many data points should be sampled from a stationary process in order to obtain a specified pre- cision in an estimate of the mean value function? As we saw from the previous example, the answer depends on the covariance structure. Successive time series data taken from nature often exhibit a very simple type of dependence, and often the covariance function can be approximated by a geometrically decreasing function, 2 α τ r(τ)= σ e− | | . Suppose that we want to estimate the mean value m of a stationary sequence Xt , and that we have reasons to believe that the covariance function is r(τ) = { 2 }τ 1 n σ e−| | . We estimate m by the average mn∗ = x = n 1 xk of observations of X1,...,Xn . P 4 Pafnuty Lvovich Chebysjev, Russian mathematician, 1821-1894. 22 Stationary processes Chapter 1

40 samples of 25 dependent AR(1)−observations 4

2

0

−2

−4 0 5 10 15 20 25 30 35 40 40 samples with 25 independent observations 4

2

0

−2

−4 0 5 10 15 20 25 30 35 40

Figure 1.9: Upper diagram: 40 samples of 25 dependent AR(1)-variables, Xt+1 = 0.9 Xt + et . Lower diagram: independent data Yt with V[Yt] = V[Xt]=1. Solid lines connect the averages of the 25 data points in each series.

If the variables had been uncorrelated, the standard deviation of mn∗ would have been σ/√n. We use Theorem 1.3, and calculate

∞ ∞ ∞ 2 t 2 t r(t) = σ e−| | = σ 1+2 e−| | t= t= t=1 ! X−∞ X−∞ X 1/e e +1 = σ2 1+2 = σ2 , 1 1/e e 1  −  − so mn∗ is unbiased and consistent. If n is large, the variance is

σ2 e +1 V[m∗ ] n ≈ n · e 1 − and the standard deviation

σ e +1 1/2 1.471 D[m∗ ] = σ . n ≈ √n e 1 √n  −  We see that positively correlated data gives almost 50% larger standard deviation in the m-estimate, than uncorrelated data. To compensate this reduction in precision, it is necessary to measure over a longer time period, more precisely, one has to obtain 1.4712n 2.16n measurements instead of n. ≈ Section 1.5 Estimation of mean value and covariance function 23

The constant α determines the decay of correlation. For a general exponential 2 α τ 2 τ covariance function r(τ)= σ e− | | = σ θ| | , the asymptotic standard deviation is, for large n, σ eα +1 1/2 σ 1+ θ 1/2 D[m∗ ] = , (1.5) n ≈ √n eα 1 √n 1 θ  −   −  which can be quite large for θ near 1. As a rule of thumb, one may have to increase 1/τK 1/τK the number of observations by a factor (K + 1)/(K 1), where τK is the time lag where the correlation is equal to 1/K . − 2

Example 1.11. (“Oscillating data can decrease variance”) If the decay parameter 2 τ θ in r(τ)= σ θ| | is negative, the observations oscillate around the mean value, and the variance of the observed average will be smaller than for independent data. The “errors” tend to compensate each other. With θ = 1/e, instead of θ = 1/e as − in the previous example, the standard deviation of the observed mean is D[m∗ ] n ≈ σ e 1 1/2 0.6798 2 √n e+1− = σ √n .  Example 1.12. (Confidence interval for the mean) If the process Xt in Example 1.10 1 {n } is a Gaussian process, as in Section ??, the estimator mn∗ = n 1 Xt has a normal D distribution with expectation m and approximative standard deviation [mn∗ ] 1 2 P ≈ 1.471, i.e., m∗ N(m, D[m∗ ] ). This means, for example, that √n n ∈ n

P(m λ D[m∗ ] m∗ m + λ D[m∗ ]) = 1 α, (1.6) − α/2 n ≤ n ≤ α/2 n − where λα/2 is a quantile in the standard normal distribution:

P( λ Y λ )=1 α, − α/2 ≤ ≤ α/2 − if Y N(0, 1). Rearranging the inequality, we can write (1.6) as ∈

P(m∗ λ D[m∗ ] m m∗ + λ D[m∗ ]) = 1 α. n − α/2 n ≤ ≤ n α/2 n − To obtain a confidence interval for m,

I : m∗ λ D[m∗ ], m∗ + λ D[m∗ ] , (1.7) m { n − α/2 n n α/2 n } with confidence level 1 α. The interpretation is this: If the experiment ”observe − X1,...,Xn and calculate mn∗ and the interval Im according to (1.7)” is repeated many times, some of the so constructed intervals will be “correct” and cover the true m-value, and others will not. In the long run, the proportion of correct intervals will be 1 α. − Suppose now that we have observed the first 100 values of Xt , and got the sum x1 + + x100 = 34.9. An estimate of m is mn∗ = 34.9/100 = 0.349 and a 95% confidence··· interval for m is 1.471 0.349 λ0.025 =0.349 1.96 0.1471 = 0.349 0.288. ± · √100 ± · ± 24 Stationary processes Chapter 1

(The confidence level shall be 0.95, i.e., α = 0.05, and λ0.025 = 1.96.) Thus, we found the 95% confidence interval for m to be (0.061, 0.637). Compare this with the interval, constructed under the assumption of independent observations: 1 0.349 λ0.025 =0.349 1.96 0.1=0.349 0.2, ± · √100 ± · ± see Figure 1.9, which shows the increased variability in the observed average for positively dependent variables. Of course, oscillating data with negative one-step correlation will give smaller variability. The analysis was based on the normality assumption, but it is approximately valid also for moderately non-normal data. 2

1.5.2 Ergodicity Ensemble average The expectation m = E[X] of a random variable X is what one will get on the average, in a long series of independent observations of X , i.e. the long run average of the result when the experiment is repeated many times: 1 n m∗ = x m, when n . (1.8) n n k → →∞ Xk=1 The x1,...,xn are n independent observations of X . The expectation m is called the ensemble average of the experiment, and it is the average of the possible outcomes of the experiment, weighted by their probabilities.

Time average

If Xt is a variable in a time series, or more generally in a stochastic process, the expectation is a function of time, m(t) = E[Xt], but if the process is stationary, strictly or weakly, it is a constant, i.e., all variables have the same expectation. A natural question is therefore: Can one replace repeated observations of Xt at a fixed t, with observations of the entire time series, and estimate the common expectation 1 n m by the time average mn∗ = n t=1 xt , observed in one single realization of the process: The vector (x1,...,xn) is here one observation of (X1,...,Xn). We know, from the law of largeP numbers, that the average of repeated measure- ments, (1.8), converges to the ensemble average m as n . But is the same true →∞ for the time average? Does it also converge to m? The answer is yes, for some processes. These processes are called ergodic, or more precisely linearly ergodic. A linearly ergodic stationary sequence is a stationary process Xn, n =1, 2,... where the common mean value m (= the ensemble average) can{ be consistently} estimated by the time average, x + + x 1 ··· n m, n → Section 1.5 Estimation of mean value and covariance function 25 when x ,...,x are observations in one single realization of X . 1 n { t} Theorem 1.3(c) gives a sufficient condition for a stationary sequence to be linearly ergodic: 0∞ r(τ) < . If the process is linearly∞ ergodic, one can estimate the expectation of any linear P 1 n function aXt + b of Xt by the corresponding time average n t=1(axt + b). This explains the name, “linearly” ergodic. P Example 1.13. The essence of an ergodic phenomenon or process is that everything that can conceivably happen in repeated , also happens already in one single realization, if it is extended indefinitely. We give an example of a non-ergodic experiment, to clarify the meaning. What is the meaning of “the average earth temperature”? A statistical inter- pretation could be the expectation of a random experiment where the temperature was measured at a randomly chosen location. Choosing many locations at random at the same time, would give a consistent estimate of the average temperature at the chosen time. The procedure gives all temperatures present on the earth at the time a chance to contribute to the estimate. However, the average may well change with time of the year, sun activity, volcanic activity, etc. Another type of average is obtained from stationary measurements over time. A long series of temperature data from Lund, spread out over the year, would give a good estimate of the average temperature in Lund, disregarding “global warming”, of course. The procedure gives all temperatures that occur in Lund over the year a chance to contribute to the estimate. The temperature process in Lund is probably linearly ergodic, but as a model for the earth temperature it is not ergodic. An automatic procedure to get an estimate of the average earth temperature, would be to start a , for example a , (see Section ??), to criss-cross the earth surface, continuously measuring the temperature. 2

1.5.3 Estimating the covariance function

The covariance function of a stationary sequence Xn is r(τ)= E[(Xt m)(Xt+τ m)] = E[X X ] m2 , and we assume to begin with,{ } that m is known.− − t t+τ − Theorem 1.4. The estimator n τ 1 − r∗ (τ)= (x(t) m)(x(t + τ) m) (τ 0) (1.9) n n − − ≥ t=1 X is asymptotically unbiased, i.e., E[r∗ (τ)] r(τ) when n . n → →∞ Proof:

n τ 1 − E[r∗ (τ)] = E[(X m)(X m)] n n t − t+τ − t=1 Xn τ 1 − 1 = r(τ)= (n τ)r(τ) r(τ) when n . n n − → →∞ t=1 X 26 Stationary processes Chapter 1

2

Remark 1.2. Since rn∗ (τ) is asymptotically unbiased it is consistent as soon as its variance goes to 0 when n . Theorem 1.3 applied to the process Y = → ∞ t (X m)(X m) will yield the result, since Y has expectation r(τ). But to use t − t+τ − t the theorem, we must calculate the covariances between Ys and Yt , i.e.,

C[(X m)(X m), (X m)(X m)], s − s+τ − t − t+τ − and that will require some knowledge of the fourth moments of the X -process. We deal with this problem in Theorem 1.5. 2

Remark 1.3. Why divide by n instead of n τ ? The estimator r∗ (τ) is only − n asymptotically unbiased for large n. There are many good reasons to use the biased form in (1.9), with the n in the denominator, instead of n τ . The most important − reason is that rn∗ (τ) has all the properties of a true covariance function; in Theo- rem ?? in Chapter ??, we shall see what these are. Furthermore, division by n gives smaller mean square error, i.e.,

2 E (r∗ (τ) r(τ)) n − is smaller with n in the denominator than with n τ . 2 −

Example 1.14. It can be hard to interpret an estimated covariance function, and it is easy to be misled by a visual inspection. It turns out that, for large n, both the V C variance [rn∗ (τ)] and the covariances [rn∗ (s),rn∗ (t)] are of the same order, namely 1/n, and that the correlation coefficient

C[r (s),r (t)] n∗ n∗ , (1.10) V V [rn∗ (s)] [rn∗ (t)] between two covariance functionp estimates will not go to 0 as n increases. There always remains some correlation between rn∗ (s) och rn∗ (t), which gives the covariance function rn∗ a regular, almost periodic shape, also when r(τ) is almost 0. This fact caused worries about the usefulness of sample covariance calculations, and spurred the interest for serious research on time series analysis in the nineteen fourties. The exact expression for the covariances for the covariance estimates are given in the following Theorem 1.5. The phenomenon is clear in Figure 1.10. We generated three realizations of an AR(2)-process and produced three estimates of its covariance function based on n = 128 observations each. Note, for example, that for τ 8 10 two of the covariance estimates are clearly positive, and the third is negat≈ ive,− while the true covariance is almost zero; for more on the AR(2)-process, see Example 2.2 in Chapter 2. 2 Section 1.5 Estimation of mean value and covariance function 27

Three estimates of the same covariance function 2.5

2

1.5

1

0.5

0

−0.5

−1 0 5 10 15 20 25 30

Figure 1.10: Three estimates of the theoretical covariance function for an AR(2)- process Xt = Xt 1 0.5Xt 2 + et , (n = 128 observations in each estimate). − − − The true covariance function is bold.

Theorem 1.5. (a)If Xn, n =1, 2,... is a stationary Gaussian process with mean { } 2 m and covariance function r(τ), such that 0∞ r(t) < , then rn∗ (t) defined by (1.9) is a consistent estimator of r(τ). ∞ P (b) Under the same condition, for s, t = s + τ ,

∞ nC[r∗ (s),r∗ (t)] r(u)r(u + τ)+ r(u s)r(u + t) , (1.11) n n → { − } u= X−∞ when n . →∞

(c) If Xn = k∞= cken k is an infinite moving average (with k∞= ck < ), of −∞ − −∞ | | ∞ independent, identically distributed random variables ek with E[ek]=0, V[ek]= σ2 , and E[eP4] = ησ4 < , then the conclusion of (b){ still} holds,P when the right k ∞ hand side in (1.11) is replaced by

∞ (η 3)r(s)r(t)+ r(u)r(u + τ)+ r(u s)r(u + t) , − { − } u= X−∞ Note, that η =3 when ek are Gaussian.

(d) Under the conditions in (a) or (c), the estimates are asymptotically normal when n . →∞ 28 Stationary processes Chapter 1

Proof: We prove (b). Part (a) follows from (b) and Theorem 1.4. For part (c) and (d), we refer to [4, Chap. 7]. We can assume m = 0, and compute

n s n t 1 − − n C[r∗ (s),r∗(t)] = C X X , X X n n n j j+s k k+t " j=1 # X Xk=1 n s n t n s k+t 1 − − − = E[X X X X ] E[X X ] E[X X ] . (1.12) n j j+s k k+t − j j+s · k k+t ( j=1 j=1 ) X Xk=1 X Xk=1 Now, it is a nice property of the normal distribution, known as Isserlis’ theorem, that the higher product moments can be expressed in terms of the covariances. In this case,

E[XjXj+sXkXk+t]= E[XjXj+s]E[XkXk+t]+E[XjXk]E[Xj+sXk+t]+E[XjXk+t]E[Xj+sXk].

Collecting terms with k j = u and summing over u, the normed covariance (1.12) can, for τ 0, be written− as, ≥ n t 1 − − a(u) ∞ (1 ) ruru+t s + ru sru+t r(u)r(u + τ)+ r(u s)r(u + t) , − n { − − }→ { − } u= n+s+1 u= −X X−∞ when n , where →∞ t + u , for u< 0, | | a(u)= t, for 0 u τ,  s + u , for u≤ > τ.≤  | | 2 The convergence holds under the condition that 0∞ r(t) is finite; cf. the proof of Theorem 1.3(b). 2 P If the mean value m is unknown, one just subtracts an estimate, and use the estimate

n τ 1 − r∗ (τ) = (x x)(x x), n n t − t+τ − t=1 X where x is the total average of the n observations x1,...,xn . The conclusions of Theorem 1.5 remain true.

Example 1.15. (“Interest rates”) Figure 1.11 shows the monthly interest rates for U.S. “1-year Treasury constant maturity” over the 19 years 1990-2008. There seems to be a cyclic variation around a downward trend. If we remove the linear trend by subtracting a line, we get a series of data that we regard as a realization of a stationary sequence for which we estimate the covariance function. 2 Section 1.5 Estimation of mean value and covariance function 29

Interest and interest residuals 10

5

0

−5 0 50 100 150 200

Estimated covariance function 3

2

1

0

−1 0 20 40 60 80 100

Figure 1.11: Interest rate with linear trend, the residuals = variation around the trend line, and the estimated covariance function for the residuals.

Testing zero correlation In time series analysis, it is often important to make a statistical test to see if the variables in a stationary sequence are uncorrelated. Then, one can estimate the correlation function ρ(τ)= r(τ)/r(0) by

rn∗ (τ) ρ∗(τ)= , rn∗ (0) based on n sequential observations. If ρ(τ) = 0, for τ = 1, 2,...,p, then the Box-Ljung , p ρ (τ)2 Q = n(n + 2) ∗ , n τ 1 X − has an approximative χ2 -distribution with p degrees of freedom, if n is large. This 2 2 means that if Q > χα(p), one can reject the hypothesis of zero correlation. (χα(p) is the upper α-quantile of the χ2 -distribution function, with p degrees of freedom.) The somewhat simpler Box-Pierce statistic,

p 2 Q = n ρ∗(τ) , 1 X has the same asymptotic distribution, but requires larger sample size to work pro- perly. 30 Stationary processes Chapter 1

1.5.4 Ergodicity a second time Linear ergodicity means that one can estimate the expectation = ensemble average, of a stationary processes by means of the observed time average in a single reali- zation. We have also mentioned that under certain conditions, also the covariance function, which is the ensemble average of a cross product,

r(τ)= E[(X(t) m)(X(t + τ) m)], − − can be consistently estimated from the corresponding time average. A process with this property could be called ergodic of second order. In a completely ergodic, or simply “ergodic”, process one can consistently esti- mate any expectation E[g(X(t1),...,X(tp))], where g is an arbitrary function of a finite number of X(t)-variables, by the corres- ponding time average in a single realization,

1 n g(x(t + t ),...,x(t + t )). n 1 p t=1 X Chapter 2

ARMA-processes

This chapter deals with two of the oldest and most useful of all stationary process models, the autoregressive AR-model and the moving average MA-model. They form the basic elements in time series analysis of both stationary and non-stationary sequences, including model identification and parameter estimation. Predictions can be made in an algorithmically simple way in these models, and they can be used in efficient Monte Carlo simulation of more general time series. Modeling a stochastic process by means of a gives a lot of flexi- bility, since every non-negative, symmetric, integrable function is possible. On the other hand, the dependence in the process can take many possible shapes, which makes estimation of spectrum or covariance function difficult. Simplifying the co- variance, by assuming some sort of independence in the process generation, is one way to get a more manageable model. The AR- and MA-models both contain such independent generators. The AR-models include feedback, for example coming from an automatic control system, and they can generate sequences with heavy dynamics. They are time-series versions of a Markov model. The MA-models are simpler, but less flexible, and they are used when correlations have a finite time span. In Section 2.1 we present the general covariance and spectral theory for AR- and MA-processes, and also for a combined model, the ARMA-model. Section 2.2 deals with parameter estimation in the AR-model, and Section 2.3 presents an introduc- tion to prediction methods based on AR- and ARMA-models.

2.1 Auto-regression and Moving average: AR(p) och MA(q)

In this chapter, e , t = 0, 1,... denotes in discrete time, i.e., a { t ± } sequence of uncorrelated random variables with mean 0 and variance σ2 ,

E[et] =0, σ2 if s = t, C[e , e ] = s t 0 otherwise.  31 32 ARMA-processes Chapter 2

The sequence e is called the innovation process and its spectral density is constant, { t} R (f)= σ2 for 1/2 < f 1/2. e − ≤ 2.1.1 Autoregressive process, AR(p) An autoregressive process of order p, or shorter, AR(p)-process, is created by white noise passing through a feedback filter as in Figure 2.1.

Xt

et Xt Xt 1 Xt 2 Xt p 1 − 1 − 1 − Σ T − T − T − a − 1 a − 2 a − p Xt = a1Xt 1 a2Xt 2 apXt p + et − − − − −···− − 1 Figure 2.1: AR(p)-process. (The operator T − delays the signal one time unit.)

An AR(p)-process is defined by its generating polynomial A(z). Let a0 = 1, a1,..., ap , be real coefficients and define the polynomial

A(z)= a + a z + + a zp, 0 1 ··· p in the complex variable z . The polynomial is called stable if the characteristic equation p 1 p p 1 z A(z− )= a0z + a1z − ... + ap =0 has all its roots inside the unit circle, or equivalently, that all the zeros of the generating polynomial A(z) lie outside the unit circle.

Definition 2.1 Let A(z) be a stable polynomial of degree p. A stationary sequence Xt is called an AR(p)-process with generating polynomial A(z), if the sequence {e }, given by { t} Xt + a1Xt 1 + + apXt p = et, (2.1) − ··· − 2 is a white noise sequence with E[et]=0, constant variance, V[et] = σ , and et uncorrelated with Xt 1, Xt 2,... . The variables et are the innovations to the AR- − − process. In a Gaussian stationary AR-process, also the innovations are Gaussian.

Equation (2.1) becomes more informative if written in the form

Xt = a1Xt 1 apXt p + et, − − −···− − Section 2.1 Auto-regression and Moving average: AR(p) och MA(q) 33 where one can see how new values are generated as linear combinations of old value plus a small uncorrelated innovation. Note, that it is important that the innovation at time t is uncorrelated with the process so far; it should be a real “innovation”, introducing something new to the process. Of course, et is correlated with Xt and all subsequent X for s t. Figure 2.2 illustrates how e influences future X . s ≥ t+k s

et et+1 et+2 et+3 Xs

t s

Figure 2.2: In an AR-process the innovation e influences all X for s t. t s ≥

Remark 2.1. If A(z) is a stable polynomial of degree p, and et a sequence of independent normal random variables, e N(0, σ2), there always{ exists} a Gaussian t ∈ stationary AR(p)-process with A(z) as its generating polynomial and et as innova- tions. The filter equation Xt + a1Xt 1 + + apXt p = et gives the X -process as − ··· − solution to a linear difference equation with right hand side equal to et . If the process was started a very long time ago, T , the solution is approximately{ } ≈ ∞ independent of the initial values. 2

Theorem 2.1. If Xt is an AR(p)-process, with generating polynomial A(z) and { 2} innovation variance σ , then mX = E[Xt]=0, and the covariance function rX is the solution of the Yule-Walker equations, r (k)+ a r (k 1)+ + a r (k p)=0, k =1, 2,... (2.2) X 1 X − ··· p X − with initial condition r (0) + a r (1) + + a r (p)= σ2. (2.3) X 1 X ··· p X p τ The general solution to (2.2) is of the form rX (τ) = 1 Ckrk, where rk,k = 1, 2,...,p, with r < 1, are the roots of the characteristic equation, or modifi- | k| cations thereof, if there are multiple roots. P

Proof: The filter equation (2.1) is used to define the AR-process from the innova- tions, and Figure 2.1 illustrates how Xt is obtained as a filtration of the et -sequence. Taking expectations in (2.1), we find that me = E[et] and mX = E[Xt] satisfy the equation m + a m + + a m = m , X 1 X ··· p X e i.e., m A(1) = m = 0, and since A(1) = 0, one has m = 0. X e 6 X To show that rX (τ) satisfies the Yule-Walker equations, we take covariances between Xt k and the variables on both sides of equation (2.1), −

C[Xt k,Xt + a1Xt 1 + + apXt p]= C[Xt k, et]. − − ··· − − 34 ARMA-processes Chapter 2

Here the left hand side is equal to r (k)+ a r (k 1)+ + a r (k p), X 1 X − ··· p X − while the right hand side is equal to 0 for k =1, 2,... , and equal to σ2 for k = 0:

C 0 for k =1, 2,... [Xt k, et]= 2 − σ for k = 0.  This follows from the characteristics of an AR-process: For k = 1, 2,... the inno- vations et are uncorrelated with Xt k , while for k = 0, we have, − 2 C[Xt, et]= C[ a1Xt 1 apXt p + et, et]= C[et, et]= σ , − − −···− − by definition. 2 The Yule-Walker equation (2.2) is a linear difference equation, which can be solved recursively. To find the initial values, one has to solve the system of p +1 linear equations, r (0) + a r ( 1) + ... + a r ( p) = σ2, X 1 X − p X − rX (1) + a1rX (0) + ... + aprX ( p +1)= 0,  . − (2.4)  .  .  r (p) + a r (p 1) + ... + a r (0) =0. X 1 X − p X  Note that there are p + 1 equations and p + 1 unknowns, since rX ( k)= rX (k).  − Remark 2.2. The Yule-Walker equations are named after George Udny Yule, Bri- tish (1871-1951), and Sir Gilbert Thomas Walker, British physicist, cli- matologist, and statistician (1868-1958), who were the first to use AR-processes as models for natural phenomena. In the 1920s, Yule worked on time series analy- sis and suggested the AR(2)-process as an alternative to the Fourier method as a means to describe periodicities and explain correlation in the sunspot cycle; cf. the comment about A. Schuster and the periodogram in Chapter ??. Yule’s analysis was published as: On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers, 1927. G.T. Walker was trained in physics and mathematics in Cambridge, but worked for 20 years as head of the Indian Meteorological Department, where he was concer- ned with the serious problem of monsoon forecasting. He shared Yule’s scepticism about deterministic Fourier methods, and favored correlation and regression me- thods. He made systematic studies of air pressure variability at Darwin, Australia, and found that it exhibited an “quasi-periodic” behavior, with no single period, but rather a band of periods – we would say “continuous spectrum” – between 3 and 1 3 4 years. Walker extended Yule’s AR(2)-model to an AR(p)-model and derived the general form of the Yule-Walker equation, and applied it to the Darwin pressure: On periodicity in series of related terms, 1931. His name is now attached to the Walker oscillation as part of the complex El Ni˜no – Southern Oscillation phenomenon1. 2

1 See R.W. Katz: Sir Gilbert Walker and a connection between El Ni˜no and statistics, Statistical Science, 17 (2002) 97-112 Section 2.1 Auto-regression and Moving average: AR(p) och MA(q) 35

Remark 2.3. There are (at least) three good reasons to use AR-processes in time series modeling:

Many series are actually generated in a feedback system, • The AR-process is flexible, and by smart choice of coefficients they can ap- • proximate most covariance and spectrum structures; parameter estimation is simple,

They are easy to use in forecasting: suppose we want to predict, at time t, • the future value Xt+1 , knowing all ..., Xt p+1,..., Xt . The linear predictor −

Xt+1 = a1Xt a2Xt 1 apXt p+1, − − − −···− −

is the best predictionb of Xt+1 in sense. For further arguments, see Sections 2.2 and 2.3. 2

Example 2.1. (”AR(1)-process”) A process in discrete time with geometrically de- caying covariance function is an AR(1)-process, and it can be generated by filtering white noise through a one-step feedback filter. With θ = a , the recurrence 1 − 1 equation Xt + a1Xt 1 = et, i.e., Xt = θ1Xt 1 + et, − − 2 has a stationary process solution if a1 < 1. With innovation variance V[et] = σ , the initial values for the Yule-Walker| equation| r (k +1) = a r (k), are found X − 1 X from the equation system (2.4),

2 rX (0) + a1rX (1) = σ ,

rX (1) + a1rX (0) = 0, which gives V[X ]= r (0) = σ2/(1 a2), and the covariance function, t X − 1 2 2 σ τ σ τ r(τ)= ( a )| | = θ| |. 1 a2 − 1 1 θ2 1 − 1 − 1 2

Example 2.2. (”AR(2)-process”) An AR(2)-process is a simple model for damped random oscillations with “quasi-periodicity”, i.e., a more or less vague periodicity,

Xt + a1Xt 1 + a2Xt 2 = et. − − The condition for stability is that the coefficients lie inside the triangle

a2 < 1, |a | < 1+ a , | 1| 2 illustrated in Figure 2.3. 36 ARMA-processes Chapter 2

a2 1

-2 -1 1 2 a1

-1

2 Figure 2.3: Stability region for the AR(2)-process. The parabola a2 = a1/4 is the boundary between a covariance function of type (2.8) with complex roots and type (2.6) with real roots.

To find the variance and initial value in the Yule-Walker equation (2.2), we re-arrange (2.4), with r( k)= r(k), for k =1, 2: − 2 r(0) + a1r(1) + a2r(2) = σ , a r(0) + (1+ a )r(1) =0, (k = 1),  1 2  a2r(0) + a1r(1) + r(2) = 0, (k = 2), leading to the variance and first order covariance, 1+ a 1 V[X ]= r(0) = σ2 2 , t · 1 a · (1 + a )2 a2 − 2 2 − 1 (2.5) a 1 r(1) = σ2 1 , − · 1 a · (1 + a )2 a2 − 2 2 − 1 respectively. We can now express the general solution to the Yule-Walker equation in terms of the roots z = a /2 (a /2)2 a , 1,2 − 1 ± 1 − 2 2 1 2 to the characteristic equation, z A(z− )=pz +a1z+a2 =0. The covariance function is of one of the types,

τ τ r(τ) = K1z1| | + K2z2| |, (2.6) τ r(τ) = K1z1| |(1 + K2 τ ), (2.7) τ | | r(τ) = K ρ| | cos(β τ φ), (2.8) 1 | | − where the different types appear if the roots are (1) real-valued and different, (2) real-valued and equal, or (3) complex conjugated. For the real root cases, the constants K1 , K2 , can be found by solving the equation system

K1 + K2 = r(0), (K1z1 + K2z2 = r(1), Section 2.1 Auto-regression and Moving average: AR(p) och MA(q) 37 with the starting values from (2.5). For the complex root case, write the complex conjugated roots in polar form, i2πf i2πf z1 = ρe and z2 = ρe− , where 0 <ρ< 1 and 0 < f 1/2. Then, the covariance function r(τ) is (for ≤ τ 0), ≥ τ τ τ i2πfτ i2πfτ r(τ) = K1z1 + K2z2 = ρ (K1e + K2e− ) = ρτ ((K + K ) cos(2πfτ)+ i(K K ) sin(2πfτ)) 1 2 1 − 2 τ = ρ (K3 cos(2πfτ)+ K4 sin(2πfτ)) where K3 are K4 are real constants (since r(τ) is real-valued). With

K = K + iK = K2 + K2 5 | 3 4| 3 4 and q φ = arg(K3 + iK4), we can write

K3 = K5 cos φ,

K4 = K5 sin φ, and find that r(τ)= ρτ K cos(2πfτ φ). 5 − Figure 2.4 shows realizations together with covariance function and spectral den- sities for two different Gaussian AR(2)-processes. Note, the peak in the spec- tral density for the process in (a), not present in (b), depending on the roots z = a /2 a2/4 a ). 1,2 − 2 ± 1 − 2 2 a) With σ =p 1, a1 = 1 and a2 =0.5, the roots to the characteristic equation − 1/2 iπ/4 are conjugate comples z = (1 i)/2=2− e , and 1,2 ± Xt = Xt 1 0.5Xt 2 + et, − − − τ /2 1 1 r (τ) = √6.42−| | cos( π τ θ), where θ = arctan , X 4 | | − 3 .

b) With σ2 = 1, a = 0.5 and a = 0.25, the roots to the characteristic 1 − 2 − equation are real, z = (1 √5)/4, and 1,2 ± Xt = 0.5Xt 1 +0.25Xt 2 + et, − − 0.96+0.32√5 0.96 0.32√5 r(τ) = + − , τ τ ( 1+ √5)| | ( 1 √5)| | . − − −

The possibility of having complex roots to the characteristic equation makes the AR(2)-process a very flexible modeling tool in the presence of “quasi-periodicities” near one single period; cf. Remark 2.2. 2 38 ARMA-processes Chapter 2

(a) X =X −0.5X +e (b) X =0.5X +0.25X +e t t−1 t−2 t t t−1 t−2 t 4 4

2 2

0 0

−2 −2

−4 −4 0 10 20 30 40 0 10 20 30 40 r(τ) r(τ) 3 2

2 1.5

1 1

0 0.5

−1 0 0 5 10 15 20 0 5 10 15 20 R(f) R(f) 1 2 10 10

1 10 0 10 0 10

−1 −1 10 10 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5

Figure 2.4: Realization, covariance function, and spectral density (log scale) for two different AR(2)-processes with (a) a1 = 1, a2 =0.5, and (b) a1 = 0.5, a = 0.25. − − 2 −

2.1.2 Moving average, MA(q)

A moving average process is generated by filtration of a white noise process et through a transversal filter; see Figure 2.5. { }

et et 1 et 2 et q Xt 1 − 1 − 1 − c T − T − T − q Σ

c2

c1

Xt = et + c1et 1 + c2et 2 + + cqet q − − ··· − 1 Figure 2.5: MA(q)-process. (The operator T − delays the signal one time unit.)

An MA(q)-process is defined by its generating polynomial

C(z)= c + c z + + c zq. 0 1 ··· q There are no necessary restrictions on its zeros, like there are for the AR(p)-process, but it is often favorable to require that it has all its zeros outside the unit circle; Section 2.1 Auto-regression and Moving average: AR(p) och MA(q) 39 then the filter is called invertible. Expressed in terms of the characteristic equation q 1 z C(z− ) = 0, the roots should be inside the unit circle. Usually, one normalizes 2 the polynomial and adjusts the innovation variance V[et]= σ , and takes c0 = 1. Definition 2.2 The process X , given by { t}

Xt = et + c1et 1 + + cqet q, − ··· − is called a moving average process of order q, MA(q)-process, with innovation se- quence e and generating polynomial C(z). { t} The sequence X is an improper average of the latest q + 1 innovations. We do { t} not require the weights ck to be positive, and their sum need not be equal to 1. Theorem 2.2. An MA(q)-process X is stationary, with m = E[X ]=0, and { t} X t 2 σ j k=τ cjck, if τ q, rX (τ) = − | | ≤ 0, otherwise.  P The main feature of an MA(q)-process is that its covariance function is 0 for τ > q. | | Proof: This is left to the reader. Start with the easiest case which is treated in the next example. and ??. 2

(a) X =e +0.9e (b) X =e −0.9e t t t−1 t t t−1 4 4

2 2

0 0

−2 −2

−4 −4 0 10 20 30 40 0 10 20 30 40 r(τ) r(τ) 2 2

1 1

0 0

−1 −1 0 5 10 15 20 0 5 10 15 20 R(f) R(f) 1 10

0 0 10 10

−1 10

−2 −2 10 10 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5

Figure 2.6: Realizations, covariance functions, and spectral densities (log-scale) for two different MA(1)-processes: (a) Xt = et +0.9et 1 , (b) Xt = et 0.9et 1. − − − 40 ARMA-processes Chapter 2

Example 2.3. For the MA(1)-process, Xt = et + c1et 1 the covariance is, − 2 2 σ (1 + c1) for τ =0, r(τ)= σ2c for τ = 1,  1 0 for |τ| 2.  | | ≥ Figure 2.6 shows realizations, covariance functions, and spectral densities for two different Gaussian MA(1)-processes with c = 0.9, and 1 ± r(0) =1.81, r(1) = 0.9, ± R(f)=1.81 1.8cos2πf. ± 2

2.1.3 , ARMA(p,q) A natural generalization of the AR- and MA-processes is a combination, with one AR- and one MA-filter in series, letting the right hand side of the AR-definition (2.1), be an MA-process. The result is called an ARMA(p,q)-process,

Xt + a1Xt 1 + + apXt p = et + c1et 1 + + cqet q, − ··· − − ··· − where et is a white noise process, such that et and Xt k are uncorrelated for − k =1, 2{,...}

2.2 Estimation of AR-parameters

AR-, MA-, and ARMA-models are the basic elements in statistical time series ana- lysis, both for stationary phenomena and for non-stationary. Here, we will only give a first example of parameter estimation for the most simple case, the AR(p)-process. 2 Estimation of the parameters a1,...,ap and σ = V[et] in an AR(p)-process is easy. The AR-equation

Xt + a1Xt 1 + ... + apXt p = et, (2.9) − − can be seen as a multiple regression model,

Xt = a1Xt 1 ... apXt p + et, − − − − − were the residuals (= innovations) et are uncorrelated with the regressors Xt 1,Xt 2,.... − − With terminology borrowed from one can call Ut =( Xt 1,..., Xt p) − − − − the independent regressor variables, and regard the new observation Xt as the de- pendent variable. With the parameter vector

θ =(a1,...,ap)0, we can write the AR-equation in standard multiple regression form,

Xt = Utθ + et, Section 2.3 Prediction in AR- and ARMA-models 41 and use standard regression technique. Suppose, we have n successive observations of the AR(p)-process (2.9), x1, x2,...,xn , and define ut =( xt 1,..., xt p), − − − − xt = utθ + et, for t = p +1, . . . , n. The least squares estimate θ is the θ-value that minimizes n b Q(θ)= (x u θ)2. t − t t=p+1 X The solution can be formulated in matrix language. With

xp+1 up+1 xp x(p 1) ... x1 ep+1 − − − − xp+2 up+2 x(p+1) xp ... x2 ep+2 X =  .  , U =  .  = − . −. −.  , E =  .  , ......          xn   un   xn 1 xn 2 ... xn p  en       − − − − − −    the regression  equation can be written in compact form, X = Uθ + E, and the function to minimize is

Q(θ)=(X Uθ)0(X Uθ). − − Theorem 2.3. The least squares estimates of the parameters (a1,...,ap)= θ, and 2 the innovation variance σ = V[et], in an AR(p)-process are given by

1 2 θ =(U0U)− U0X, σ = Q(θ )/(n p). − The estimates are consistent, and converge to the true values when n ; see Appendix ??. b b b → ∞ The theorem claims that if the observations come from an AR(p)-process, then the parameters can be correctly estimated if only the series is long enough. Also covariance function can then be estimated. However, one does not know for sure if the process is an AR(p)-process, and even if one did, the value of p would probably be unknown. Statistical time series analysis has developed techniques to test possible model order, and to evaluate how well the fitted model model agrees with data. Example 2.4. We use the estimation technique on the AR(2)-process from Example 2.2, Xt = Xt 1 0.5Xt 2 + et . Based on n = 512 observations the estimates were ama- − − − zingly close to the true values, namely a1 = 1.0361, a2 =0.4954, σ =0.9690; see Table 2.1 for the standard errors of the estimates.− 2 b b b 2.3 Prediction in AR- and ARMA-models

Forecasting or predicting future values in a time series is one of the most important applications of ARMA-models. Given a sequence of observations xt, xt 1, xt 2,..., − − of a stationary sequence Xt , one wants to predict the value xt+τ , of the process as well as possible in mean{ square} sense, τ time units later. The value of τ is called the prediction horizon. We only consider predictors, which are linear combinations of observed values. 42 ARMA-processes Chapter 2

a1 a2 σ true value -1 0.5 1 estimated value -1.0384 0.4954 0.969 0.0384 0.0385

Table 2.1: Parametric estimates of AR(2)-parameters.

2.3.1 Forecasting an AR-process Let us first consider one-step ahead prediction, i.e., τ = 1, and assume that the process X is an AR(p)-process, { t}

Xt + a1Xt 1 + + apXt p = et, (2.10) − ··· − with uncorrelated innovations e with mean 0 and finite variance, σ2 = V[e ], { t} t and with et+1 uncorrelated with Xt,Xt 1,.... In the relation (2.10), delayed one − time unit,

Xt+1 = a1Xt a2Xt 1 apXt p+1 + et+1, − − − −···− − all terms on the right hand side are known at time t, except et+1 , which in turn is uncorrelated with the observations of Xt,Xt 1,.... It is then clear that it is not − possible to predict the value of et+1 from the known observations – we only know that it will be an observation from a distribution with mean 0 and variance σ2 . The best thing to do is to predict et+1 with its expected value 0. The predictor of Xt+1 would then be,

Xt+1 = a1Xt a2Xt 1 apXt p+1. (2.11) − − − −···− −

Theorem 2.4. The predictorb (2.11) is optimal in the sense, that if Yt+1 is any other linear predictor, based only on Xt,Xt 1,..., then − b E[(X Y )2] E[(X X )2]. t+1 − t+1 ≥ t+1 − t+1

Proof: Since Xt+1 and Yt+1 areb based only on Xt,Xbt 1,..., they are uncorrelated − with et+1 , and since Xt+1 = Xt+1 + et+1 , one has b b E 2 E 2 [(Xt+1 Yt+1) ]=b [(Xt+1 + et+1 Yt+1) ] − − = E[e2 ]+2E[e ]E[X Y ]+ E[(X Y )2] t+1 t+1 t+1 − t+1 t+1 − t+1 2 b b 2 2 b 2 = E[e ]+ E[(Xt+1 Yt+1) ] E[e ]= E[(Xt+1 Xt+1) ], t+1 − b ≥b t+1 b b − with equality only if Yt+1 =bXt+1 . b b 2

Repeating the one-stepb aheadb prediction, one can extend the prediction horizon. To predict Xt+2 , consider the identity

Xt+2 = a1Xt+1 a2Xt apXt p+2 + et+2, − − −···− − Section 2.3 Prediction in AR- and ARMA-models 43

and insert Xt+1 = Xt+1 + et+1 , to get

Xt+2 = a1( a1Xt apXt p+1 + et+1) a2Xt apXt p+2 + et+2 b − − − 2 − −···− − −···− = (a1 a2)Xt +(a1a2 a3)Xt 1 +(a1ap 1 ap)Xt p+2 − − − ··· − − − + a1apXt p+1 a1et+1 + et+2. − − Here, a1et+1 + et+2 is uncorrelated with Xt,Xt 1,..., and in the same way as − − before, we see that the best two-step ahead predictor is

2 Xt+2 = (a1 a2)Xt +(a1a2 a3)Xt 1 − − − + +(a1ap 1 ap)Xt p+2 + a1apXt p+1. b ··· − − − − Repeating the procedure gives the best, in mean square sense, predictor for any prediction horizon.

2.3.2 Prediction of ARMA-processes To predict an ARMA-process requires more work than the AR-process, since the unobserved old innovations are correlated with the observed data, and have delayed influence on future observations. An optimal predictor therefore requires recons- truction of old e -values, based on observed X , s t. s s ≤ Let the ARMA-process be defined by,

Xt + a1Xt 1 + + apXt p = et + c1et 1 + + cqet q, (2.12) − ··· − − ··· − with et as before. We present the solution to the one-step ahead prediction; generalization{ } to many-steps ahead is very similar. We formulate the solution by means of the generating polynomials, (with a0 = c0 = 1), A(z)=1+ a z + + a zp, 1 ··· p C(z)=1+ c z + + c zq, 1 ··· q and assume both polynomials have their zeros outside the unit circle, so A(z) is stable and C(z) is invertible. Further, define the backward translation operator 1 T − , 1 T − Xt = Xt 1, 1 − T − et = et 1, 2 − 1 2 1 1 1 T − Xt = (T − ) Xt = T − (T − Xt)= T − Xt 1 = Xt 2, etc. − − The defining equation (2.12) can now be written in compact form as,

1 1 A(T − )Xt = C(T − )et, (2.13) and by formal operation with the polynomials one can write 1 1 1 C(T − ) C(T − ) A(T − ) 1 − Xt+1 = 1 et+1 = et+1 + −1 1 T et+1 A(T − ) A(T − )T − 1 1 C(T − ) A(T − ) = et+1 + −1 1 et. (2.14) A(T − )T − 44 ARMA-processes Chapter 2

A(T −1) According to (2.13), et = C(T −1) Xt , and inserting this into (2.14) we get,

1 1 1 C(T − ) A(T − ) A(T − ) Xt+1 = et+1 + −1 1 1 Xt A(T − )T − · C(T − ) 1 1 C(T − ) A(T − ) = et+1 + −1 1 Xt. C(T − )T −

Here, the innovation et+1 is uncorrelated with known observations, while the second term only contains known X -values, and can be used as predictor. To find the explicit form, we expand the polynomial ratio in a power series,

C(z) A(z) (c a )z +(c a )z2 + − = 1 − 1 2 − 2 ··· = d + d z + d z2 + , C(z)z z(1 + c z + + c zq) 0 1 2 ··· 1 ··· q 1 which, with the T − -operator inserted, gives the desired form,

1 1 C(T − ) A(T − ) Xt+1 = et+1 + −1 1 Xt = et+1 + d0Xt + d1Xt 1 + d2Xt 2 + . C(T − )T − − − ···

Hence, the best predictor of Xt+1 is

1 1 C(T − ) A(T − ) Xt+1 = −1 1 Xt = d0Xt + d1Xt 1 + d2Xt 2 + . (2.15) C(T − )T − − − ··· Ourb computations have been formal, but one can show that if the sums

g0et + g1et 1 + and d0Xt + d1Xt 1 + , (2.16) − ··· − ··· where g ,g ,..., are the coefficients in the power series expansion of (C(z) A(z))/A(z), 0 1 − are absolutely convergent, then

∞ Xt+1 = et+1 + dkXt k − Xk=0 and Xt+1 is the optimal predictor. Here are some more arguments for the calculations. We have assumed that the polynomialsb A(z) och C(z) in the complex variable z have their zeros outside the unit circle. Then, known theorems for analytic functions say that the radius of k k convergence for the power series C(z) = ckz and D(z) = dkz are greater than 1, and then c , d constant θk , for some θ, θ < 1. This in turn implies, | k| | k| ≤ · | | by some more elaborated probability theory,P that all series in (2.16)P are absolutely convergent for (almost) all realizations of X . In summary, the predictor X { t} t+1 is always well-defined and it is the optimal one-step ahead predictor in the stable- invertible case, when A(z) and C(z) have their zeros outside the unit circle. b For more details, see e.g. Astr¨om:˚ Introduction to Stochastic Control, [31], or Yaglom: An Introduction to the Theory of Stationary Random Functions, [30], for a classical account. Section 2.3 Prediction in AR- and ARMA-models 45

Example 2.5. (”Prediction of electrical power demand”) Trading electrical power on a daily or even hourly basis, has become economically important and the use of statistical methods has grown. Predictions of demand a few days ahead is one of the basic factors in the pricing of electrical power, that takes place on the power markets. (Of course, also forecasts of the demand several months or years ahead are important, but that requires different data and different methods.) Short term predictions are needed also for production planning of water gene- rated power, and for decisions about selling and buying electricity. If a distributor has made big errors in the prediction, he can be forced to start up expensive fossil production units, or sell the surplus at a bargain price. Electrical power consumption is not a stationary process. It varies systemati- caly over the year, depending on season and weather, and there are big differences between days of the week, and time of the day. Before one can use any statio- nary process model, these systematic variations need to be estimated. If one has successfully estimated and subtracted the systematic part, one can hope to fit an ARMA-model to the residuals, i.e., the variation around the weekly/daily profile, and the parameters of the model has to be estimated. Then a new problem arises, since the parameters need not be constant, but be time and weather dependent. Fi- gure 2.7 shows observed power consumption during one autumn week, together with one-hour ahead predicted consumption based on an ARMA-model, and prediction error. It turned out, in this experiment, that the ARMA-model contained just as much information about the future demand, as a good weather prediction, and that there was no need to include any more data in the prediction. 2

2.3.3 The orthogonality principle The optimal AR-predictor (2.11) and Theorem 2.4 illustrate a general property of optimal linear prediction. Let Y and X1,...,Xn be correlated random variables with mean zero.

n Theorem 2.5. A linear predictor, Y = k=1 akXk , of Y by means of X1,...,Xn is optimal in mean square sense if and only if the prediction error Y Y is uncorrelated P − with each X , i.e. C[Y Y,X ]=0b, for k =1,...,n. k − k b The coefficients ak of the optimal predictor is a solution to the linear equation system { } b

n

C[Y,Xk]= aj C[Xj,Xk], k =1, . . . , n. (2.17) j=1 X 46 ARMA-processes Chapter 2

Figure 2.7: Measured electricity consumption during one autumn week (top dia- gram), ARMA-predicted consumption, including weekday and hour correction, one hour ahead (middle diagram), and the prediction error (lower diagram). Bibliography

[1] D. L. Bartholomew and J. A. Tague. Quadratic power spectrum estimation with orthogonal division multiple windows. IEEE Trans. on , 43(5):1279–1282, May 1995. [2] J.S. Bendat & A.G. Piersol: Random Data. Wiley, New York, 2:a uppl, 1986. [3] P. Bloomfield: of Time Series: An Introduction. Wiley, New York 1976. [4] P.J. Brockwell & R.A. Davis: Time Series: Theory and Methods. 2nd ed, Springer-Verlag, New York 1991. [5] M. P. Clark and C. T. Mullis. Quadratic estimation of the power spectrum using orthogonal time-division multiple windows. IEEE Trans. on Signal Processing, 41(1):222–231, Jan 1993. [6] J. W. Cooley and J. W. Tukey. An for the machine calculation of . Math. Comput., 19:297–301, 1965. [7] H. Cram´er, & M.R. Leadbetter: Stationary and Related Stochastic Processes. Wiley, New York 1967. [8] Graybill, F.A.: Introduction to Matrices with Applications in Statistics, Wad- sworth, 1969. [9] G. Grimmett and D. Stirzaker: Probability and Random Processes. Oxford Uni- versity Press, 2001. [10] K. Hasselmann et al.: Measurements of wind-wave growth and swell decay during the JOint North Sea Wave Project (JONSWAP). Deutsche Hydrogra- phische Zeitschrift, Reihe A, No.8, 1973. [11] S.M. Kay: Modern Spectral Estimation: Theory and Application. Prentice-Hall, Englewood Cliffs 1988. [12] A.N. Kolmogorov: Grundbegriffe der Wahrscheinlichkeitsrechnung, Springer- Verlag, Berlin, 1933. [13] A.N. Kolmogorov: Foundations of the Theory of Probability, 2nd English Edi- tion, Chelsea Publishing Company, New York, 1956. (Translation of [12].)

47 48 BIBLIOGRAPHY

[14] G. Lindgren: Lectures on Stationary Stochastic Processes. Lund 2006.

[15] Mazo, R.M.: Brownian motion. Fluctuations, Dynamics and Application, Ox- ford University Press, Oxford, 2002.

[16] L. Olbjer, U. Holst, J. Holst: Tidsserieanalys. Matematisk statistik, LTH, 5th Ed., 2002.

[17] D. B. Percival and A. T. Walden. Spectral analysis for Physical Applications: and Conventional Univariate Techniques. Cambridge University Press, 1993.

[18] D.S.G. Pollock: A handbook of time-series analysis, signal processing and dy- namics. Academic Press, 1999.

[19] S.O. Rice: Mathematical analysis of random noise. Bell System Technical Jour- nal, Vol. 23 och 24, sid 282–332 resp 46–156.

[20] K. S. Riedel. Minimum bias multiple taper spectral estimation. IEEE Trans. on Signal Processing, 43(1):188–195, January 1995.

[21] A. Schuster. On the investigation of hidden periodicities with application to a supposed 26-day period of meterological phenomena. Terr. Magnet., 3:13–41, 1898.

[22] D. Slepian. Prolate spheriodal wave functions, Fourier analysis and uncertainty- v: The discrete case. Bell System Journal, 57(5):1371–1430, May-June 1978.

[23] G. Sparr and Sparr, A.: Kontinuerliga system. Studentlitteratur, Lund, 2000.

[24] P. Sroica and R. Moses.: Spectral analysis of signals. Pearson, Prentice Hall, 2005.

[25] D. J. Thomson. Spectrum estimation and harmonic analysis. Proc. of the IEEE, 70(9):1055–1096, Sept 1982.

[26] A. T. Walden, E. McCoy, and D. B. Percival. The variance of multitaper spec- trum estimates for real gaussian processes. IEEE Trans. on Signal Processing, 42(2):479–482, Feb 1994.

[27] P. D. Welch. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms. IEEE Trans. on Audio Electroacoustics, AU-15(2):70–73, June 1967.

[28] http://en.wikipedia.org/wiki/Distribution_(mathematics)

[29] A.M. Yaglom: Correlation Theory of Stationary and Related random Functions, I-II. Springer-Verlag, New York 1987. BIBLIOGRAPHY 49

[30] A.M. Yaglom: An Introduction to the Theory of Stationary Random Functions, Dover Publications (reprint of 1962 edition).

[31] K.J. Astr¨om:˚ Introduction to Stochastic Control. Academic Press, New York 1970.

[32] B. Øksendal: Stochastic differential equations, sn introduction with applications. 6th Ed. Springer, 2003.