Time Series Analysis in Python with statsmodels

Wes McKinney1 Josef Perktold2 Skipper Seabold3

1Department of Statistical Science Duke University

2Department of Economics University of North Carolina at Chapel Hill

3Department of Economics American University

10th Python in Science Conference, 13 July 2011

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 1 / 29 What is statsmodels?

A library for statistical modeling, implementing standard statistical models in Python using and SciPy Includes: Linear (regression) models of many forms Descriptive statistics Statistical tests Time series analysis ...and much more

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 2 / 29 What is Time Series Analysis?

Statistical modeling of time-ordered data observations Inferring structure, forecasting and simulation, and testing distributional assumptions about the data Modeling dynamic relationships among multiple time series Broad applications e.g. in economics, finance, neuroscience, signal processing...

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 3 / 29 Talk Overview

Brief update on statsmodels development Aside: user interface and data structures Descriptive statistics and tests Auto-regressive moving average models (ARMA) Vector autoregression (VAR) models Filtering tools (Hodrick-Prescott and others) Near future: Bayesian dynamic linear models (DLMs), ARCH / GARCH volatility models and beyond

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 4 / 29 Statsmodels development update

We’re now on GitHub! Join us:

http://github.com/statsmodels/statsmodels

Check out the slick Sphinx docs:

http://statsmodels.sourceforge.net

Development focus has been largely computational, i.e. writing correct, tested implementations of all the common classes of statistical models

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 5 / 29 Statsmodels development update

Major work to be done on providing a nice integrated user interface We must work together to close the gap between R and Python! Some important areas: Formula framework, for specifying model design matrices Need integrated rich statistical data structures () Data visualization of results should always be a few keystrokes away Write a “Statsmodels for R users” guide

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 6 / 29 Aside: statistical data structures and user interface

While I have a captive audience... Controversial fact: pandas is the only Python library currently providing data structures matching (and in many places exceeding) the richness of R’s data structures (for statistics) Let’s have a BoF session so I can justify this statement Feedback I hear is that end users find the fragmented, incohesive set of Python tools for and statistics to be confusing, frustrating, and certainly not compelling them to use Python... (Not to mention the packaging headaches)

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 7 / 29 Aside: statistical data structures and user interface

We need to “commit” ASAP (not 12 months from now) to a high level data structure(s) as the “primary data structure(s) for statistical data analysis” and communicate that clearly to end users Or we might as well all start programming in R...

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 8 / 29 Example data: EEG trace data

300

200

100

0

100

200

300

400

500

600 0 500 1000 1500 2000 2500 3000 3500 4000

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 9 / 29 Example data: Macroeconomic data

5.5 5.0 cpi 4.5 4.0 3.5 3.0 7.5 7.0 m1 6.5 6.0 5.5 5.0 4.5 9.5 realgdp 9.0

8.5

8.0

1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 10 / 29 Example data: Stock data

800 AAPL 700 GOOG MSFT 600 YHOO

500

400

300

200

100

0

2001 2002 2003 2004 2005 2006 2007 2008 2009

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 11 / 29 Descriptive statistics

Autocorrelation, partial autocorrelation plots Commonly used for identification in ARMA(p,q) and ARIMA(p,d,q) models acf = tsa.acf(eeg, 50) pacf = tsa.pacf(eeg, 50)

Autocorrelation Partial Autocorrelation 1.0 1.0

0.5 0.5

0.0 0.0

0.5 0.5

1.0 1.0 0 10 20 30 40 50 0 10 20 30 40 50

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 12 / 29 Statistical tests

Ljung-Box test for zero autocorrelation Unit root test for cointegration (Augmented Dickey-Fuller test) Granger-causality Whiteness (iid-ness) and normality See our conference paper (when the proceedings get published!)

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 13 / 29 Autoregressive moving average (ARMA) models

One of most common univariate time series models:

yt = µ + a1yt−1 + ... + ak yt−p + t + b1t−1 + ... + bqt−q 2 where E(t , s ) = 0, for t 6= s and t ∼ N (0, σ )

Exact log-likelihood can be evaluated via the Kalman filter, but the “conditional” likelihood is easier and commonly used statsmodels has tools for simulating ARMA processes with known coefficients ai , bi and also estimation given specified lag orders import scikits.statsmodels.tsa.arima_process as ap ar_coef = [1, .75, -.25]; ma_coef = [1, -.5] nobs = 100 y = ap.arma_generate_sample(ar_coef, ma_coef, nobs) y += 4 # add in constant

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 14 / 29 ARMA Estimation

Several likelihood-based estimators implemented (see docs) model = tsa.ARMA(y) result = model.fit(order=(2, 1), trend=’c’, method=’css-mle’, disp=-1) result.params # array([ 3.97, -0.97, -0.05, -0.13])

Standard model diagnostics, standard errors, information criteria (AIC, BIC, ...), etc available in the returned ARMAResults object

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 15 / 29 Vector Autoregression (VAR) models

Widely used model for modeling multiple (K-variate) time series, especially in macroeconomics:

Yt = A1Yt−1 + ... + ApYt−p + t , t ∼ N (0, Σ)

Matrices Ai are K × K.

Yt must be a stationary process (sometimes achieved by differencing). Related class of models (VECM) for modeling nonstationary (including cointegrated) processes

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 16 / 29 Vector Autoregression (VAR) models

>>> model = VAR(data); model.select_order(8) VAR Order Selection ======aic bic fpe hqic ------0 -27.83 -27.78 8.214e-13 -27.81 1 -28.77 -28.57 3.189e-13 -28.69 2 -29.00 -28.64* 2.556e-13 -28.85 3 -29.10 -28.60 2.304e-13 -28.90* 4 -29.09 -28.43 2.330e-13 -28.82 5 -29.13 -28.33 2.228e-13 -28.81 6 -29.14* -28.18 2.213e-13* -28.75 7 -29.07 -27.96 2.387e-13 -28.62 ======* Minimum

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 17 / 29 Vector Autoregression (VAR) models

>>> result = model.fit(2) >>> result.summary() # print summary for each variable Results for equation m1 ======coefficient std. error t-stat prob ------const 0.004968 0.001850 2.685 0.008 L1.m1 0.363636 0.071307 5.100 0.000 L1.realgdp -0.077460 0.092975 -0.833 0.406 L1.cpi -0.052387 0.128161 -0.409 0.683 L2.m1 0.250589 0.072050 3.478 0.001 L2.realgdp -0.085874 0.092032 -0.933 0.352 L2.cpi 0.169803 0.128376 1.323 0.188 ======

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 18 / 29 Vector Autoregression (VAR) models

>>> result = model.fit(2) >>> result.summary() # print summary for each variable Correlation matrix of residuals m1 realgdp cpi m1 1.000000 -0.055690 -0.297494 realgdp -0.055690 1.000000 0.115597 cpi -0.297494 0.115597 1.000000

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 19 / 29 VAR: Impulse Response analysis

Analyze systematic impact of unit “shock” to a single variable

irf = result.irf(10) irf.plot()

Impulse responses m1 m1 realgdp m1 cpi m1 1.0 → 0.2 → 0.4 → 0.8 0.1 0.3 0.2 0.6 0.0 0.1 0.4 0.1 0.0 0.2 0.2 0.1 0.2 0.0 0.3 0.3 0.2 0.4 0.4 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 m1 realgdp realgdp realgdp cpi realgdp 0.20 → 1.0 → 0.2 → 0.15 0.8 0.1 0.10 0.6 0.0 0.05 0.4 0.1 0.00 0.2 0.2 0.05 0.10 0.0 0.3 0.15 0.2 0.4 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 m1 cpi realgdp cpi cpi cpi 0.20 → 0.15 → 1.0 →

0.15 0.10 0.8 0.10 0.05 0.6 0.05 0.00 0.4 0.00 0.05 0.05 0.10 0.2 0.10 0.15 0.0 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 20 / 29 VAR: Forecast Error Variance Decomposition

Analyze contribution of each variable to forecasting error

fevd = result.fevd(20) fevd.plot()

Forecast error variance decomposition (FEVD) m1 m1 1.0 realgdp cpi 0.8

0.6

0.4

0.2

0.0 0 5 10 15 20 realgdp 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0 5 10 15 20 cpi 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0 5 10 15 20

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 21 / 29 VAR: Statistical tests

In [137]: result.test_causality(’m1’, [’cpi’, ’realgdp’]) Granger causality f-test ======Test statistic Critical Value p-value df ------1.248787 2.387325 0.289 (4, 579) ======H_0: [’cpi’, ’realgdp’] do not Granger-cause m1 Conclusion: fail to reject H_0 at 5.00% significance level

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 22 / 29 Filtering

Hodrick-Prescott (HP) filter separates a time series yt into a trend τt and a cyclical component ζt , so that yt = τt + ζt .

14 Inflation 12 Cyclical component Trend component 10

8

6

4

2

0

2

4

1962 1966 1970 1974 1978 1982 1986 1990 1994 1998 2002 2006

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 23 / 29 Filtering

In addition to the HP filter, 2 other filters popular in finance and economics, Baxter-King and Christiano-Fitzgerald, are available We refer you to our paper and the documentation for details on these:

Inflation and Unemployment: BK Filtered Inflation and Unemployment: CF Filtered INFL INFL 4 4 UNEMP UNEMP

2 2

0 0

2 2

4 4

1966 1971 1976 1981 1986 1991 1996 2001 2006 1963 1968 1973 1978 1983 1988 1993 1998 2003 2008

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 24 / 29 Preview: Bayesian dynamic linear models (DLM)

A state space model by another name:

0 yt = Ft θt + νt , νt ∼ N (0, Vt ) θt = Gθt−1 + ωt , ωt ∼ N (0, Wt )

Estimation of basic model by Kalman filter recursions. Provides elegant way to do time-varying linear regressions for forecasting Extensions: multivariate DLMs, stochastic volatility (SV) models, MCMC-based posterior sampling, mixtures of DLMs

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 25 / 29 Preview: DLM Example (Constant+Trend model)

model = Polynomial(2) dlm = DLM(close_px[’AAPL’], model.F, G=model.G, # model m0=m0, C0=C0, n0=n0, s0=s0, # priors state_discount=.95) # discount factor Constant + Trend DLM

200

150

100

50

Nov 2008 Jan 2009 Mar 2009 May 2009 Jul 2009 Sep 2009 Nov 2009

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 26 / 29 Preview: Stochastic volatility models

JPY-USD Exchange Rate Volatility Process 1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2 0 200 400 600 800 1000

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 27 / 29 Future: sandbox and beyond

ARCH / GARCH models for volatility Structural VAR and error correction models (ECM) for cointegrated processes Models with non-normally distributed errors Better data description, visualization, and interactive research tools More sophisticated Bayesian time series models

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 28 / 29 Conclusions

We’ve implemented many foundational models for time series analysis, but the field is very broad User interface can and should be much improved Repo: http://github.com/statsmodels/statsmodels Docs: http://statsmodels.sourceforge.net Contact: [email protected]

McKinney, Perktold, Seabold (statsmodels) Python Time Series Analysis SciPy Conference 2011 29 / 29