STAT 153: Introduction to Time Series

STAT 153: Introduction to Time Series Instructor: Aditya Guntuboyina Lectures: 12:30 pm - 2 pm (Tuesdays and Thursdays) Office Hours: 10 am - 11 am (Tuesdays and Thursdays) 423 Evans Hall GSI: Brianna Heggeseth Section: 10 am - 12 pm or 12 pm - 2 pm (Fridays) Office Hours and Location: TBA Announcements, Lecture slides, Assignments, etc. will be posted on the course site at bspace. Tuesday, January 17, 12 Course Outline A Time Series is a set of numerical observations, each one being recorded at a specific time. Examples of Time Series data are ubiquitous. The aim of this course is to teach you how to analyze such data. Tuesday, January 17, 12 Population of the United States 3.0e+08 2.0e+08 US Population 1.0e+08 0.0e+00 1800 1850 1900 1950 2000 Year(once every ten years) Tuesday, January 17, 12 Course Outline (continued) There are two approaches to time series analysis: • Time Domain Approach • Frequency Domain Approach (also known as the Spectral or Fourier analysis of time series) Very roughly, 60% of the course will be on Time Domain methods and 40% on Frequency Domain methods. Tuesday, January 17, 12 Time Domain Approach Seeks an answer to the following question: Given the observed time series, how does one guess future values? Forecasting or Prediction Tuesday, January 17, 12 Time Domain (continued) Forecasting is carried out through three steps: • Find a MODEL that adequately describes the evolution of the observed time series over time. • Fit this model to the data. In other words, estimate the parameters of the model. • Forecast based on the fitted model. Tuesday, January 17, 12 Time Series Models Most of our focus in the Time Domain part of the course will be on the following two classes of models: • Trend + Seasonality + Stationary ARMA • Differencing + Stationary ARMA (consist of the ARIMA and Seasonal ARIMA models) In the Time Domain Part of the course, we study these models and learn how to execute each of the three steps outlined in the previous slide with them. Tuesday, January 17, 12 Time Series Models (continued) These provide a sturdy toolkit for analyzing many practical time series data sets. State-Space models are a modern and very powerful class of time series models. Forecasting in these models is carried out via an algorithm known as the Kalman Filter. We shall spend some time on these models although we will not have time to study them in depth. Tuesday, January 17, 12 Frequency Domain Approach Brightness of a variable star on 600 consecutive nights 35 30 25 20 15 Brightness 10 5 0 0 100 200 300 400 500 600 day Tuesday, January 17, 12 Frequency Domain (continued) Based on the idea that the observed time series is made up of CYCLES having different frequencies. In the Frequency Domain Approach, the data is analyzed by discovering these hidden cycles along with their relative strengths. The key tool here is the Discrete Fourier Transform (DFT) or, more specifically, a function of the DFT known as the Periodogram. Tuesday, January 17, 12 Frequency Domain (continued) In the Frequency Domain part of the course, we shall study the periodogram and its performance in discovering periodicities when the data are indeed made up of many different cycles. It turns out that the raw periodogram is often too variable as an estimator of the true Spectrum and we shall study methods for improving it. Tuesday, January 17, 12 Rest of this Lecture • Some more Time Series Data Examples • Simplest Time Series Model: Purely Random Process (Section 3.4.1) • Sample Autocorrelation Coefficients and the Correlogram (Section 2.7 and Page 56) Tuesday, January 17, 12 Annual Measurements of the level of Lake Huron 1875-1972 582 581 580 579 Levelin feet 578 577 576 1880 1900 1920 1940 1960 Year Tuesday, January 17, 12 Monthly Accidental Deaths in the US from 1973-1978 11000 10000 9000 Numberof Deaths 8000 7000 1973 1974 1975 1976 1977 1978 1979 Time Tuesday, January 17, 12 The first step in the time domain analysis of a time series data set is to find a model that well describes the evolution of the data over time. Start Simple Basic Modelling Strategy: • • Build Up Simplest Model: X ,t 1, ,n independent N 0,σ 2 t = ( ) Purely Random Process or Gaussian White Noise Tuesday, January 17, 12 100 Observations from Gaussian White Noise with unit Variance 3 2 1 0 Purely RandomProcess -1 -2 0 20 40 60 80 100 Time Tuesday, January 17, 12 Is this data set from a purely random process 2 1 0 Data -1 -2 -3 0 20 40 60 80 100 Time Tuesday, January 17, 12 How to check if a given time series is purely random? Tuesday, January 17, 12 How to check if a given time series is purely random? Answer: Think in terms of Forecasting. Tuesday, January 17, 12 How to check if a given time series is purely random? Answer: Think in terms of Forecasting. For a purely random series, the given data can NOT help in predicting Xn+1. The best estimate of Xn+1 is E(Xn+1) = 0. In particular, X1 can not predict X2 and X2 can not predict X3 and so on. Tuesday, January 17, 12 How to check if a given time series is purely random? Answer: Think in terms of Forecasting. For a purely random series, the given data can NOT help in predicting Xn+1. The best estimate of Xn+1 is E(Xn+1) = 0. In particular, X1 can not predict X2 and X2 can not predict X3 and so on. Therefore, the correlation coefficient between Y = (X1, ..., X n-1) and Z = (X2, ..., X n) must be close to zero. Tuesday, January 17, 12 The formula for the correlation between Y and Z is n−1 n−1 n−1 Xt − X (1) Xt +1 − X (2) ∑( )( ) ∑ Xt ∑ Xt +1 t =1 t =1 t =1 r = X (1) = X (2) = n 1 n 1 − 2 − 2 n − 1 n − 1 X X (1) X X (2) ∑( t − ) ∑( t +1 − ) t =1 t =1 This formula is usually simplified to obtain n−1 n X − X X − X ∑( t )( t +1 ) ∑ Xt t =1 t =1 r1 = n X = 2 n ∑(Xt − X) t =1 Note the subscript on the left hand side above. Tuesday, January 17, 12 Sample Autocorrelation Coefficients The quantity r1 is called the Sample Autocorrelation Coefficient of X1, ..., X n at lag one. Lag one because this correlation is between Xt and Xt+1. When X1, ..., X n are obtained from a Purely Random Process, r1 is close to zero, particularly when n is large. One can similarly consider Sample Autocorrelations at other lags: n− k X X X X ∑( t − )( t + k − ) t =1 rk = n k = 1, 2, ... 2 ∑(Xt − X) t =1 Tuesday, January 17, 12 Correlogram Mathematical Fact: When X1, ..., X n are obtained from a Purely Random process, r1, r 2, ... are independently distributed according to N(0, 1/n). So one way of testing if the series is purely random is to plot the sample autocorrelations. This plot is known as the Correlogram. Use the function acf() in R to get the Correlogram. ts.obs = rnorm(100) acf(ts.obs, lag.max = 20, type = “correlation”, plot = T, drop.lag.0 = F) Tuesday, January 17, 12 Correlogram of a Purely Random Series of 100 Observations 1.0 0.8 0.6 0.4 ACF 0.2 0.0 -0.2 0 5 10 15 20 Lag The correlogram plots rk against k. r0 always equals 1. The blue bands correspond to levels of ±1.96/√n Tuesday, January 17, 12 Interpreting the Correlogram When X1, ..., X n are obtained from a Purely Random process, the probability that a fixed rk lies outside the blue bands equals 0.05. A value of rk outside the blue bands is significant i.e., it gives evidence against pure randomness. However, the overall probability of getting at least one rk outside the bands increases with the number of coefficients plotted. If 20 rks are plotted, one expects to get one significant value under pure randomness. Tuesday, January 17, 12 Rules of Thumb (for deciding if a correlogram indicates departure from randomness) Chatfield (page 56) • A single rk just outside the bands may be ignored, but two or three values well outside indicate a departure from pure randomness. • A single significant rk at a lag which has some physical interpretation such as lag one or a lag corresponding to seasonal variation also indicates evidence of non-randomness. Tuesday, January 17, 12 Is this data set from a purely random process 2 1 0 Data -1 -2 -3 0 20 40 60 80 100 Time Tuesday, January 17, 12 Correlogram of the Data in the Previous Slide 1.0 0.8 0.6 0.4 ACF 0.2 0.0 -0.2 0 5 10 15 20 Lag This data was generated from a moving average process. Tuesday, January 17, 12 Is this data set from a purely random process? 3 2 1 0 Data -1 -2 -3 0 20 40 60 80 100 Time Tuesday, January 17, 12 Correlogram of the Data in the Previous Slide 1.0 0.8 0.6 0.4 ACF 0.2 0.0 -0.2 -0.4 0 5 10 15 20 Lag Again, there is more structure in this dataset compared to pure randomness. Tuesday, January 17, 12 Is this data set from a purely random process? 2 1 0 Data -1 -2 -3 0 20 40 60 80 100 Time Tuesday, January 17, 12 Correlogram of the Data in the Previous Slide 1.0 0.8 0.6 0.4 ACF 0.2 0.0 -0.2 0 5 10 15 20 Lag Lots of structure here.

STAT 153: Introduction to Time Series

Here Is an Example Where I Analyze the Lags Needed to Analyze Okun's

Alternative Tests for Time Series Dependence Based on Autocorrelation Coefficients

3 Autocorrelation

Statistical Tool for Soil Biology : 11. Autocorrelogram and Mantel Test

Applied Time Series Analysis

Quantitative Risk Management in R

Autocorrelation and Seasonality Otexts.Com/Fpp/2/ Otexts.Com/Fpp/6/1

Chapter 5: Spatial Autocorrelation Statistics

Correlation and Regression Analysis

Reference Manual

Forecasting and Predictive Analytics with Forecast X, 7E (Keating) Chapter 2 the Forecast Process, Data Considerations, and Model Selection

Chapter 11 Relationship Between Monitoring Variables. Correlation and Regression Analysis