Deconvolutional Time Series Regression: a Technique for Modeling Temporally Diffuse Effects
Total Page:16
File Type:pdf, Size:1020Kb
Deconvolutional Time Series Regression: A Technique for Modeling Temporally Diffuse Effects Cory Shain William Schuler Department of Linguistics Department of Linguistics The Ohio State University The Ohio State University [email protected] [email protected] Abstract dard approach for handling the possibility of tem- porally diffuse relationships between the predic- Researchers in computational psycholinguis- tors and the response is to use spillover or lag tics frequently use linear models to study regressors, where the observed predictor value is time series data generated by human subjects. used to predict subsequent observations of the re- However, time series may violate the assump- tions of these models through temporal diffu- sponse (Erlich and Rayner, 1983). But this strat- sion, where stimulus presentation has a linger- egy has several undesirable properties. First, the ing influence on the response as the rest of choice of spillover position(s) for a given predic- the experiment unfolds. This paper proposes a tor is difficult to motivate empirically. Second, in new statistical model that borrows from digital experiments with variably long trials the use of signal processing by recasting the predictors relative event indices obscures potentially impor- and response as convolutionally-related sig- tant details about the actual amount of time that nals, using recent advances in machine learn- ing to fit latent impulse response functions passed between events. And third, including mul- (IRFs) of arbitrary shape. A synthetic exper- tiple spillover positions per predictor quickly leads iment shows successful recovery of true la- to parametric explosion on realistically complex tent IRFs, and psycholinguistic experiments models over realistically sized data sets, especially reveal plausible, replicable, and fine-grained if random effects structures are included. estimates of latent temporal dynamics, with As a solution to the problem of temporal diffu- comparable or improved prediction quality to sion, this paper proposes deconvolutional time se- widely-used alternatives. ries regression (DTSR), a technique that directly 1 Introduction models diffusion by learning parametric impulse response functions (IRFs) of the predictors that Time series are abundant in many naturally- mediate their relationship to the response variable occurring phenomena of interest to science, and over time. Parametric deconvolution is difficult they frequently violate the assumptions of linear in the general case because the likelihood surface modeling and its generalizations. One confound depends on the choice IRF kernel, requiring the that may be widespread in psycholinguistic data user to re-derive estimators for each unique model is temporal diffusion: the dependent variable may structure. Furthermore, arbitrary IRF kernels are evolve slowly in response to its inputs, with the not guaranteed to afford analytical estimator func- result that a particular predictor observed at a par- tions or unique real-valued solutions. However, ticular time may continue to exert an influence on recent advances in machine learning have led to the response as the rest of the process unfolds. If libraries like Tensorflow (Abadi et al., 2015)— not properly controlled for, such a confound could which uses auto-differentiation to support opti- have a detrimental impact on parameter estima- mization of arbitrary computation graphs — and tion, model interpretation, and hypothesis testing. Edward (Tran et al., 2016) — which enables black The problem of temporal diffusion remains box variational inference (BBVI) on Tensorflow largely unsolved in the general case.1 A stan- graphs. While these libraries are typically used to build and train deep networks, DTSR uses them 1Although recent work in computational psycholinguis- tics has begun to address separate but related problems in using generalized additive models (GAM) with a particular time series modeling (auto-correlation and non-stationarity) structure (Baayen et al., 2017, 2018). 2679 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2679–2689 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics Response Response • • P P P•P• P Predictor • predictor Convolved • • • • • Trial index • • • • Figure 1: Effects in a linear time series model Predictor Response Latent IRF Time Predictor Figure 3: Effects of predictors in DTSR Trial index ure3). Figure 2: Linear time series model with spillover This paper presents evidence that DTSR can (1) recover known underlying IRFs from synthetic to overcome the aforementioned difficulties with data, (2) discover previously unknown temporal general-purpose temporal deconvolution by elim- structure in human data (psycholinguistic reading inating the need for hand-derivation of estimators time experiments), (3) provide support for the ab- and sampling distributions for each model. sence of temporal diffusion in settings where it might exist in principle, and (4) provide compara- The IRFs learned by DTSR are interpretable ble (or in some cases improved) prediction quality as estimates of the temporal shape of predictors’ to standard linear mixed-effects (LME) and gener- influence on the response variable. By convolv- alized additive (GAM) models. ing predictors with their IRFs, DTSR is able to consider arbitrarily long histories of independent 2 Related work variable observations in generating a given predic- tion, and (in contrast to spillover) model complex- 2.1 Non-deconvolutional time series modeling ity is constant on the length of the history win- The two most widely used tools for analyzing psy- dow. DTSR is thus a parsimonious technique for cholinguistic time series are linear mixed effects directly measuring temporal diffusion. regression (LME) (Bates et al., 2015) and gener- Figures1–3 illustrate the present proposal and alized additive models (GAM) (Hastie and Tib- how it differs from linear time series models. As shirani, 1986; Wood, 2006). LME learns a lin- shown in Figure1, a standard linear model as- ear combination of the predictors that generates sumes conditional independence of the response a given response variable. GAM generalizes lin- from all preceding observations of the predictor. ear models by allowing the response variable to be This independence assumption can be weakened computed as the sum of smooth functions of one by including additional spillover predictors (Fig- or more predictors. ure2), at a cost of requiring additional parameters. In both approaches, responses are modeled In both cases, only the relative order of events is as conditionally independent of preceding obser- considered, not their actual distance in time. By vations of predictors unless spillover terms are contrast, DTSR recasts the predictor and response added, with the attendant drawbacks discussed in vectors as streams of impulses and responses (re- Section1. To make this point more forcefully, take spectively) localized in time. It then fits latent for example Shain et al.(2016), who find signif- IRFs that govern the influence of each predictor icant effects of constituent wrap-up (p = 2.33e- value on the response as a function of time (Fig- 14) and dependency locality (p = 4.87e-10) in the 2680 Natural Stories self-paced reading corpus (Futrell have already been developed. et al., 2018). They argue that this constitutes the first strong evidence of memory effects in broad- 3 Model definition coverage sentence processing. However, it turns This section presents the mathematical definition out that when one baseline predictor — probabilis- of DTSR. For readability, only a fixed effects tic context free grammar (PCFG) surprisal — is model is presented below, since mixed modeling spilled over one position, the reported effects dis- substantially complicates the equations. The full appear: p = 0.816 for constituent wrap-up and p = model definition is provided in AppendixA. Note 0.370 for dependency locality. Thus, a reasonable that the full definition is used to construct all read- but ultimately inaccurate assumption about base- ing time models reported in subsequent sections, line effect timecourses can have a dramatic im- since they contain random effects. M×K pact on the conclusions supported by the statistical Let X 2 R be a design matrix of M ob- N model. DTSR offers a way forward by bringing servations for K predictor variables and y 2 R temporal diffusion under direct statistical control. be a vector of N responses, both of which contain contiguous temporally-sorted time series. DTSR 2.2 Deconvolutional time series modeling models the relationship between X and y using parameters consisting of: Deconvolutional modeling has long been used in a variety of scientific fields, including economics • a scalar intercept µ 2 R K (Ramey, 2016), epidemiology (Goldstein et al., • a vector u 2 R of K coefficients R×K 2011), and neuroimaging (Friston et al., 1998). • a matrix A 2 R of R IRF kernel param- Non-parametric deconvolutional models quantize eters for K fixed impulse vectors the time series and fit estimates for each time point • a scalar variance σ2 2 of the response within some window, similarly to the spillover R approach discussed above. These estimates can To define the convolution step, let gk for k 2 be unconstrained, as in finite impulse response f1; 2;:::;Kg be a set of parametric IRF kernels, M N models (FIR) (H. Glover, 1999; Ward, 2006), one for each predictor; let a 2 R and b 2 R be or smoothed with some form of regularization vectors of timestamps associated with each obser- M (Goutte et al., 2000; Pedregosa et al., 2014). Ad- vation in X and y, respectively; and let c 2 N N ditional post-hoc interpolation is necessary in or- and d 2 N be vectors of series ID’s associated der to obtain a closed-form continuous IRF. These with each observation in X and y, respectively. A N×M non-parametric approaches are prone to paramet- filter F 2 R admits only those observations ric explosion as well as sparsity problems when in X that precede y[n] in the same time series: trials are variably spaced in time. ( def 1 c[m] = d[n] ^ a[m] ≤ b[n] Parametric deconvolutional approaches (i.e.