Beginning Tutorials
Regression with Time Series Errors
David A. Dickey, North Carolina State University
Abstract: if t represents time and X = t, then part
of our mo del consists of a simple linear time
The basic assumptions of regression are re-
trend and there will b e no surprises when we
viewed. Graphical and statistical metho ds
try to extend the time sequence 1; 2; 3;:::;n
for checking the assumptions are presented
into the future. On the other hand if X is
t
using a sales example. Departures from in-
the number of incoming phone calls at time
dep endence in time series data are empha-
t then forecasting to time n +1 would re-
sized and illustrated in the example. Sev-
quire that some value be inserted for X
n+1
TM
eral pro ducts from SAS Institute for ana-
and this value will itself likely b e a forecast.
lyzing regressions with time series errors are
These two examples represent deterministic
illustrated. The imp ortance of the sto chastic
and stochastic explanatory variables, resp ec-
prop erties of the mo del input variables is em-
tively.
phasized. Forecasts from several mo dels for
the example data are compared.
The nature of the X variables will a ect the
forecast accuracy - obviously a p erson fore-
1. Intro duction:
casting with a known future X is b etter o
than one who must estimate that future X .
Regression is a to ol that allows one to mo del
Thus a problem we will need to deal with, if
the relationship b etween a resp onse variable
wewant to put some sort of error b ounds on
Y , which might be a mail order company's
our forecasts, is the incorp oration of our level
sales, and some explanatory variables usually
of uncertainty ab out the future X values.
denoted X where X might b e the cost of one
j 1
item from the company, X the cost of a sim-
2
The usual way of estimating the s is the
ilar item from a comp etitor company and X
3
metho d used in PROC GLM and PROC
the number of phone calls coming in to the
REG. The metho d is referred to as ordinary
company's switchb oard. Atypical regression
least squares in that it nds estimates b of
mo del for this situation is
j
the parameters that minimize the error
j
P
n
Y = + X + X + X + e
0 1 1 2 2 3 3
sum of squares SSE = Y b + b X +
t 0 1 1
t=1
2
b X + b X . This SSE is a function of the
2 2 3 3
estimates, b , and much of the sub ject of cal-
j
where the regression co ecients, , are un-
culus is concerned with nding values of ar-
known.
guments, like these b , that minimize a func-
j
tion, SSE in our case. Thus wehave mathe- You would like to estimate these s, for if you
matical to ols which are relatively easy to im- could, you would then havean equation for
plement on the computer that allowusto nd predicting a future Y from the asso ciated X s.
the minimizing values. This is what PROC Notice that even if the regression co ecients
REG and PROC GLM are set up to do. Fur- were known, such a prediction would require
thermore, statistical theory allows us to com- knowledge of future X values. For example, 1
Beginning Tutorials
a histogram of the residuals, a hanging his- pute measures of uncertainty called standard
togram in which each bar b ecomes a line seg- errors for these b estimates and the result-
j
ment at the former bar midp oint, this line b e- ing forecasts if certain conditions are satis-
ing hung from the normal curve rather than ed. Note the expression \if certain condi-
rising from the horizontal axis, and a plot tions are satis ed." It is this with whichwe
of the residuals against their normal scores. are concerned here.
These are very easy to pro duce using the fol-
In this pap er we review these \certain condi-
lowing co de:
tions," indicate why they might be violated
pro c capability graphics;
when data are taken over time, present meth-
histogram r /normal hanging vref=0;
o ds for checking these conditions, and nally
histogram r / normal;
represent corrections that can be applied if
qqplot r / normal mu=est sigma=est;
the conditions are violated. The corrections
that we sp eak of are implemented in SAS In-
The histograms lo ok reasonably normal
stitute's PROC AUTOREG.
and the quantile-quantile plot reasonably
straight. PROC CAPABILITY also presents
Throughout the pap er we will use an arti -
tests of the normalityhyp othesis but the the-
cial example in which X represents the num-
t
ory b ehind these assumes indep endence, an
ber of phone calls in week t to a mail order
assumption wehaveyet to check.
company and Y is the numb er of shipments
t
for that week. Figure 1 shows the data over
Not shown is a simple plot of residuals against
a 3 year p erio d. We are interested in es-
predicted values. Because this lo oks uni-
timating the company's growth, estimating
form as opp osed to megaphone shap ed this
the numb er of shipments generated p er phone
check on the homogeneous variance assump-
call, and forecasting phone calls and sales two
tion do es not give us cause for concern.
weeks into the future.
The regression and subsequent calculation of
2. Checking the usual assumptions.
residuals was accomplished with this co de:
Our mo del is Y = + X + t + e . We
pro c reg; mo del y = t x/dw;
t 0 1 t 2 t
assume
pro c reg; mo del y = t/dw;
A Normality:
where Y is sales, X phone calls, and t week
The errors all come from normal
numb er. The previous residual analysis was
distributions
from the rst regression. The advantage
of the second regression is that only future
B Homogeneity:
values of t would be needed for forecasting
These normal distributions all
whereas for the rst mo del wewould need to
have mean 0 and
know, or at least estimate, next week's phone
2
the same variance,
calls to forecast sales.
Notice the dw options. These request the
C Indep endence:
\Durbin Watson" statistic which is a test
The correlation b etween e and e is 0
i j
for auto correlation, that is, correlation be-
for i not equal to j
tween successive residuals. Auto correlation
is a commonly o ccurring violation of the in- We can check the normality assumption by
dep endence assumptions when data are taken drawing histograms and normal probability
over time. The option also gives an estimate plots of the residuals. In gures 2-4, we see 2
Beginning Tutorials
r of the rst order auto correlation. We get 21 b so that from our Z ,we could get a
dw = 1:407 and b = :283 for the rst mo del, large sample approximate distribution for the
dw = :969 and b = :497 for the second mo del. Durbin-Watson statistic. The real contribu-
tion of Durbin and Watson was to to show
3. The Durbin-Watson statistic and
how to get the exact nite sample distribu-
rst order auto correlation.
tion of the statistic dw.
The Durbin-Watson statistic is dw =
Unfortunately the Durbin-Watson theory
P P
n n
2 2
r where r is the r r =
t t t 1
t shows that the exact distribution dep ends on
t=1 t=2
residual at time t. If e represents white noise
t
the values of the X explanatory variables in
an uncorrelated sequence then we nd these
the regression so that each new problem en-
exp ected values:
countered would require a new table of crit-
ical values. However, if none of the X vari-
2 2 2
E fe e g = E fe e e + e g =
t t 1 t t 1
t
t 1
ables are lagged Y values and the errors are
2 2
+0+
normal, they were able to calculate b ounds
2 2
Efe g =
t
for all critical values. Thus if you enter the
P P
tables of Durbin and Watson for a certain
n n
2 2
Thus e e = e should be
t t 1
t
t=2 t=1
sample size and numb er of explanatory vari-
near 2, that is, the Durbin Watson statistic
ables you will see upp er and lower b ounds for
should b e near 2 if calculated on a white noise
the true critical value.
sequence. If there is p ositive correlation b e-
tween neighb oring e's then e and e would
t t 1
A dw to the left of the lower b ound is clearly
b e more alike than in the white noise case so
less than the critical value and thus to o close
that e e would b e smaller in magnitude
t t 1
to 0 to accept the indep endence hyp othesis
and thus dw would movetoward 0.
under which dw should b e near 2. Adwto
the rightof the upp er b ound makes it clear
The rst order correlation in the residuals is
that dw is closer to 2 than is the critical value
n n
so we cannot reject indep endence. Adwbe-
X X
2
b = r r r r = r r
t t 1 t
tween the b ounds just tells you that the cal-
t=2 t=1
culated dw and the critical value are b etween
these numb ers so you have no idea how they
are placed relativetoeach other.
whichisvery close to what one would get by
simply inserting r and r into the standard
t t 1
Durbin and Watson also gave a computation-
formula for a correlation. If, as in our exam-
ally intensiveway of computing p-values us-
ple, the regression contains an intercept then
ing the observed X s. We will see howtoget
r =0. It is well known that if b is computed
p-values from PROC AUTOREG. It should
on a white noise series and if the sample size
b e noted that the restriction that lagged Y s
n is reasonably large, then
not be included in the explanatory X vari-
p
p
2
ables still holds so that a mo del with lagged
Z = nb 1 b
Y s, explicitly or implicitly among the ex-
planatory variables, would not pro duce exact
is approximately a N 0; 1 random variable
p-values using the Durbin-Watson metho d.
so that for large samples, values of jZ j ex-
ceeding 1.96 would give us reason to susp ect
Because the large sample Z approximation
that auto correlation is present.
is reasonably go o d, the tables of Durbin and
Watson typically do not extend to very large A bit of algebra demonstrates that the
nvalues so for our example residuals, we use Durbin-Watson statistic is roughly equal to 3
Beginning Tutorials
p
p
2
A The estimates of the regression
Z = n=b 1 b getting
co ecients are still unbiased.
p p
2
156:283= 1 :283 =3:7
B The estimates of the regression
co ecients vary more from sample
to sample than do the b est
for the rst mo del, and
estimates, but still maybe
reasonably ecient.
p p
2
156:497= 1 :497 =7:2
C Estimates of standard errors for
co ecients and anything computed
for the second mo del. Using 1.96 as a critical
from them t statistics, p-
value wehave strong evidence for auto corre-
values and con dence intervals
lation in our example.
for example are biased - often
badly biased.
For our example we have normal residuals
with homogeneous variance, but they are
Using our simple linear regression and an or-
clearly not indep endent for either of our mo d-
der 1 error pro cess a = a + e we note
t t 1 t
els.
that the equation holds at b oth times t and
t 1 so that, multiplying through by the au-
toregressive parameter we obtain
4. Adjusting for auto correlation.
Y = + X + a
t 0 1 1t t
Supp ose we have a simple linear regression
mo del
Y = + X + a
t 1 0 1 1t 1 t 1
and adding wehave the transformedmodel
Y = + X + a
t 0 1 1t t
Y + Y =
t t 1
where, instead of white noise e our error
t
1 + + X + X +a + a
term satis es a mo del suchas
0 1 1t 1t 1 t t 1
Nowwe note the following p oints ab out this
a = a a a +e
t 1 t 1 2 t 2 p t p t
transformed mo del:
A This is a linear mo del in the
This error mo del is called autoregressive of
transformed variables in parentheses
order p where the order refers to the number
of lagged a's app earing in the equation for a .
t
B It has the same co ecients as the
If p = 1 then the rst order auto correlation
original mo del
co ecient from PROC REG is a reasonable
estimate of but in higher order mo dels, the
C It has error term a + a
t t 1
relationship b etween the autoregressive coef-
where we are assuming
cients s and the auto correlations is much
a + a = e , that is this
t t 1 t
more convoluted.
new mo del satis es all the usual regression
assumptions!
What happ ens if we just ignore the auto cor-
relation?
D It has n 1, not n observations. 4
Beginning Tutorials
in a sp ecial way. Note that p oint C implies that running an
ordinary regression would suce if we knew
or could approximate it well. We can re-
cover the lost observation by noting that
5. PROC AUTOREG for the sales
p p
2 2
data.
1 Y = 1 +
1 0
p
2
1 X + a
1 11 1
We use PROC AUTOREG on our sales data
2 2
with the following co de:
This works b ecause Vara = =1 .
t
What we will do is regress column 1 on
pro c autoreg;
columns 2 and 3 in this table using the rst
order auto correlation of the residuals from an
modely=t x/nlag=4 backstep dwprob par-
initial ordinary least squares regression as an
tial;
estimate of .
p p p
The output b egins with ordinary least
2 2 2
1 Y 1 1 X
1 1
squares, used to get the residuals and mo del
Y + Y 1+ X + X
2 1 2 1
them. Next come auto correlations of the
Y + Y 1+ X + X
3 2 3 2
residuals. The lag j auto correlation is sim-
. . .
. . .
ply an estimate of the correlation b etween a
t
. . .
and a computed from the ordinary least
t j
Y + Y 1+ X + X
n n 1 n n 1
squares residuals r think of r as an estimate
t t
th
of a . The j partial auto correlation can b e
These new estimates of the s can be used
t
interpreted as an estimate of the last co e- to compute new residuals, a new estimate of
cient in the regression of a on a ;:::;a
, new columns in the table and the whole
t t 1 t j
and thus would be 0 after the appropriate
pro cess can be iterated until the estimates
numb er of autoregressive lags are included in stabilize.
the mo del. Time series exp erts use auto cor-
Alternatively, one can simply notice at the
relations and partial auto correlations to diag-
outset that this whole pro cedure amounts
nose the nature of correlation in the residuals.
to an attempt to minimize the error sum
of squares in a nonlinear mo del and thus
The drop in the auto correlations after lag 1
use standard nonlinear search techniques i.e.
is more dramatic than that in the partial au-
full blown maximum likeliho o d estimation of
to correlations. This suggests that a mo del
and the s simultaneously instead of the
other than autoregressive for the error terms
alternating technique ab ove. Either way,we
might b e considered, thus taking us into the
have estimated the mo del
realm of PROC ARIMA which we discuss
later.
Y = Y + 1 + + X + X +
t t 1 0 1 1t 1t 1
e
t
Estimates of Auto correlations
which is seen to be a mo del that includes
Lag Covariance Correlation
lagged Y 's among the explanatory variables.
Now it is p ossible that the mo del needs more
0 446.1458 1.000000
than 1 lagged residual. The pro cedure is es-
1 126.196 0.282858
sentially the same, we just need more terms
2 -17.9134 -0.040151
in the transformation and more than 1 obser-
3 27.51148 0.061665
vation at the b eginning needs to b e recovered
4 8.913212 0.019978 5
Beginning Tutorials
Partial Auto correlations would b e selected. The criteria trade the t
of the mo del o against its complexity just
1 0.282858
as a p erson might trade the functionalityof
2 -0.130610
a new computer against its cost in deciding
3 0.123244
which machine to purchase.
4 -0.047635
We observe that the Durbin-Watson statistic
The next part of the output is a backward
on the transformed mo del is now close to 2
elimination of insigni cant autoregressive pa-
and an approximate p-value larger than 0.05
rameters 's starting with the least signi -
app ears. This is not an exact p-value since
cant.
the transformed mo del implicitly uses an esti-
mated co ecient on lagged Y 's to predict the
current Y and thus do es not satisfy Durbin
Backward Elimination of Autoregressive
and Watson's assumptions.
Terms
Lag Estimate t-Ratio Prob
Finally there are 2 R-square values. This
4 0.047635 0.5821 0.5614
is b ecause, in predicting Y one step ahead,
3 -0.123244 -1.5210 0.1304
we can use a prediction based on the X 's
2 0.130610 1.6188 0.1076
and their co ecients only or we can add to
that a forecast of the next residual based on
our autoregressive error mo del and the most
We are left, then, with just a lag 1 autoregres-
recently observed residuals. The p ercent-
sive mo del. The pro cedure next summarizes
age of variation explained under these sce-
the mo del:
narios are the regression R-square and total
R-square resp ectively. In that sense, the total
R-square is a predictability R-square while
Estimates of the AutoregressiveParameters
the regression R-square tells howmuch of the
Lag Co ecient Std Error t Ratio
predictabilityis asso ciated with the X vari-
1 -0.28285812 0.077798 -3.636
ables whichmay b e dicult or exp ensiveto
obtain - esp ecially future values of them.
Yule-Walker Estimates
SSE 63800.85 DFE 152
MSE 419.7425 Ro ot MSE 20.48762
SBC 1401.124 AIC 1388.924
The pro cedure next uses the estimated
Reg Rsq 0.7101 Total Rsq 0.7764
Durbin-Watson 1.8850 PROB < DW 0.2013
to get the transformed variables, e.g. Y
t
0:2829Y , and runs ordinary least squares
t 1
on the transformed variables. Because of
The autoregressive co ecient -0.2829 divided
the transformation, the resulting co ecients,
by its standard error is t = -3.636 and since
standard errors, etc. are correct and, except
our sample size is reasonably large, t exceed-
for the fact that is estimated instead of
ing 1.96 in magnitude is considered signi -
known, the X co ecients are the b est linear
cant. The error mean square 419.7 estimates
unbiased estimates of the parameters.
2
, the variance of e so our error mo del is
t
a = :2829a + e
The mo del parameters part of the output is
t t 1 t
given next. A p ortion of this is shown here:
The SBC and AIC are information criteria
that can be used to compare mo dels with
di ering numb ers of parameters. The mo del
delivering the smallest information criterion 6
Beginning Tutorials
Variable DF BValue t Ratio Approx
Cho osing SBC as a criterion wehave
Prob
Intercept 1 5.571991 0.882 0.3793
Mo del SBC MSE
T 1 0.038574 0.732 0.4656
X and t 1401 420 -.2829
X 1 0.947411 18.220 0.0001
X only 1397 418 -.2873
t only 1607 1658 -.4972
We see that the time trend T which seemed
to app ear in our graph do es not seem imp or-
tant after X is included in our mo del. Note
so the mo del with X only is clearly preferred.
that this do es not say that there is no increase
in sales, only that there is no increase b eyond
One confusing phenomenon that often oc-
what would have b een predicted from incom-
curs is that, although statistical theory indi-
ing phone calls. Phone calls X are strongly
cates that the estimates from our transformed
signi cant. We mighthave b een happier had
mo del should be b etter than the ordinary
the signi cance results b een reversed, since
least squares estimates that ignore auto cor-
we need to supply next week's phone volume
relation, the ordinary least squares printout
to predict next week's sales in a mo del involv-
shows smaller standard errors than the ones
ing X .
shown in PROC AUTOREG. How can that
b e if the PROC AUTOREG metho d provides
What are our choices in terms of forecasting?
b etter estimates? The answer is simple - the
One option is to somehow get estimates of
ordinary least squares numb ers are not go o d
future X values. Here are some forecasts of
estimates of the standard errors and are of-
phone volume which actually came from SAS
ten, but not always, deceptively small. In
PROC ARIMA. They have b een app ended to
other words by ignoring auto correlation we
our original data and you see their Y values
are using inferior estimates of the parameters
are missing.
but the standard errors falsely indicate that
they are sup erior.
T DATE X Y
The three mo dels estimated by PROC AU-
154 12/16/94 153.000 141
TOREG with standard errors are:
155 12/23/94 146.000 159
156 12/30/94 180.000 220
Y = 5:5720 + 0:0386t+ 0:9474X
t t
157 01/06/95 157.292 .
6:31910:0527 0:052
158 01/13/95 141.705 .
159 01/20/95 131.008 .
Y = 7:3681 +0:9589X
t t
160 01/27/95 123.665 .
5:8111 0:0498
161 02/03/95 118.625 .
Y = 83:8472 + 0:3388t
t
Do we really think X will be 157.292 next
11:09560:1222
week? Of course not, this is just a forecast
so the use of this X value in computing a
6. Forecasting
forecast for Y will add to the inaccuracy of
the forecast. On the other hand, since we
The mo del with only X is
stopp ed in week 156, the use of t = 157 for
next week's t in our mo del will intro duce no
Y = 7:3681 + :9589X + a
t t t
inaccuracy in the forecast.
5:81110:050 7
Beginning Tutorials
Figure 6 shows the forecasts and intervals with a = :2873 a + e and we are
t t 1 t
from the mo del that uses time t as the ex- at the last week of year 3 in our data so
planatory variable. Here we can prop erly t=156. Standard errors are in parentheses.
treat future t's as known, but pay a price in The last observation was Y = 220 and
156
that the mo del do es not t as well. The fore- X = 180 so that the residual, an esti-
156
casted X ARIMA plot is overlaid on this it mate of a , would b e r = 220 7:3681
156 t
gives the lower forecasts and intervals and it :9589180 = 40:030 and we predict a as
157
is interesting to note that although this mo del :287340:030 = 11:501. Now if we assume
t substantially worse according to our statis- X will be 157:292 then we predict Y
157 157
tical tests, once we admit that there is error to be 7:3681 + :9589157:292 + 11:501 =
in our forecasts of X , our forecasts and error 158:195 + 11:501 = 169:696. This is the
bands are similar in b oth mo dels. In the long rst forecast in the PROC AUTOREG out-
term, the mo del with t will give forecasts that put dataset and an asso ciated standard er-
trend linearly upward while the forecasts with ror 20.9 is used to compute a 95 prediction
the ARIMA mo del will return to the historic interval. The problem is that this standard
mean as will always happ en with a stationary error is computed assuming that X will b e ex-
ARIMA mo del. actly 157.292 in the next time interval. Our
true level of uncertainty in the forecast of Y
would certainly b e in uenced by the variabil-
Finally, PROC ARIMA allows the tting of
ity in our imputed X value.
a larger class of error mo dels than autore-
gressive. A lag1moving average mo del also
7. Using PROC ARIMA
provides an excellent t to the error series
for the sales data. Using the autoregressive
We can mo del the whole pro cess, X and Y ,
mo del for X , a linear relationship b etween Y
in PROC ARIMA. We rst compute a mo del
and X , and an order 1 moving average error
for X whichPROC ARIMA estimates as
term, our estimated mo del b ecomes
X 107:6= :6864X 107:6 + e
t t 1 t
Y =8:607 + 0:95 X + e + :37e
t t t t 1
NowPROC ARIMA can t and forecast the
X 107:6= 0:686 X 107:6 + u
t t 1 t
same mo del as PROC AUTOREG, however
it gives you the option of using the X mo del
Because the error term e e isamoving
t t 1
to provide forecasts of future X 's and it in-
average we estimated this mo del in PROC
corp orates the uncertainty in those X 's in the
ARIMA:
Y forecasts. Our forecasts of future X 's were
from PROC ARIMA so the forecasts from the
pro c arima data=a;
two pro cedures will match but the forecast
ivar=x noprint; e p=1 ml;
error variances will di er. We close by pre-
ivar=y crosscor=x nlag=5;
senting graphs of forecasts and their standard
e input = x q=1 ml;
errors from several scenarios.
f lead=5 out=out5 id=t;
In gure 5, the forecasts from PROC AU-
Because our input variable is mo deled, we TOREG and PROC ARIMA with their in-
will get a crosscorrelation plot which has b een tervals are overlaid. The di erence in widths
prewhitened. It is a plot of correlations b e- of the forecast intervals illustrates the magni-
tween Y at time t and X at time t j which tude of error that is b eing ignored when one
has b een cleared of any indirect correlations treats future X 's as known when they are in
caused by auto correlation in the X series. It fact forecasts. 8
Beginning Tutorials
is thus a picture of howchanges in X are in- It is seen that there is no lagged correla-
corp orated into the Y series. tion, only contemp oraneous correlation. The
moving average error structure gave error
The plot of the crosscorrelatins for our exam-
mean square 408.7 as compared to 418.3 so
ple will lo ok similar to this:
it is the b est t of all mo dels considered here
by that criterion.
Crosscorrelations
SAS is the registered trademark of SAS In-
Lag Corr
stitute Inc. in the USA and other countries
-5 -0.065 . * j .
TM
indicates USA registration.
-4 0.043 . j * .
-3 -0.090 . ** j .
-2 -0.060 . * j .
-1 0.008 . j .
0 0.797 . j **********
1 0.057 . j * .
2 -0.010 . j .
3 -0.053 . * j .
4 0.135 . j ***
5 -0.124 . ** j .
\." marks two standard errors 9