<<

Deconvolutional Regression: A Technique for Modeling Temporally Diffuse Effects

Cory Shain William Schuler Department of Linguistics Department of Linguistics The Ohio State University The Ohio State University [email protected] [email protected]

Abstract dard approach for handling the possibility of tem- porally diffuse relationships between the predic- Researchers in computational psycholinguis- tors and the response is to use spillover or lag tics frequently use linear models to study regressors, where the observed predictor value is time series data generated by human subjects. used to predict subsequent observations of the re- However, time series may violate the assump- tions of these models through temporal diffu- sponse (Erlich and Rayner, 1983). But this strat- sion, where stimulus presentation has a linger- egy has several undesirable properties. First, the ing influence on the response as the rest of choice of spillover position(s) for a given predic- the experiment unfolds. This paper proposes a tor is difficult to motivate empirically. Second, in new statistical model that borrows from digital experiments with variably long trials the use of by recasting the predictors relative event indices obscures potentially impor- and response as convolutionally-related sig- tant details about the actual amount of time that nals, using recent advances in machine learn- ing to fit latent impulse response functions passed between events. And third, including mul- (IRFs) of arbitrary shape. A synthetic exper- tiple spillover positions per predictor quickly leads iment shows successful recovery of true la- to parametric explosion on realistically complex tent IRFs, and psycholinguistic experiments models over realistically sized data sets, especially reveal plausible, replicable, and fine-grained if random effects structures are included. estimates of latent temporal dynamics, with As a solution to the problem of temporal diffu- comparable or improved prediction quality to sion, this paper proposes deconvolutional time se- widely-used alternatives. ries regression (DTSR), a technique that directly 1 Introduction models diffusion by learning parametric impulse response functions (IRFs) of the predictors that Time series are abundant in many naturally- mediate their relationship to the response variable occurring phenomena of interest to science, and over time. Parametric deconvolution is difficult they frequently violate the assumptions of linear in the general case because the likelihood surface modeling and its generalizations. One confound depends on the choice IRF kernel, requiring the that may be widespread in psycholinguistic data user to re-derive estimators for each unique model is temporal diffusion: the dependent variable may structure. Furthermore, arbitrary IRF kernels are evolve slowly in response to its inputs, with the not guaranteed to afford analytical estimator func- result that a particular predictor observed at a par- tions or unique real-valued solutions. However, ticular time may continue to exert an influence on recent advances in machine learning have led to the response as the rest of the process unfolds. If libraries like Tensorflow (Abadi et al., 2015)— not properly controlled for, such a confound could which uses auto-differentiation to support opti- have a detrimental impact on parameter estima- mization of arbitrary computation graphs — and tion, model interpretation, and hypothesis testing. Edward (Tran et al., 2016) — which enables black The problem of temporal diffusion remains box variational inference (BBVI) on Tensorflow largely unsolved in the general case.1 A stan- graphs. While these libraries are typically used to build and train deep networks, DTSR uses them 1Although recent work in computational psycholinguis- tics has begun to address separate but related problems in using generalized additive models (GAM) with a particular time series modeling (auto-correlation and non-stationarity) structure (Baayen et al., 2017, 2018).

2679 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2679–2689 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics Response Response

• • P P P•P• P

Predictor • predictor Convolved • • • • • Trial index • • • •

Figure 1: Effects in a linear time series model Predictor

Response Latent IRF

Time

Predictor Figure 3: Effects of predictors in DTSR

Trial index ure3). Figure 2: Linear time series model with spillover This paper presents evidence that DTSR can (1) recover known underlying IRFs from synthetic to overcome the aforementioned difficulties with data, (2) discover previously unknown temporal general-purpose temporal deconvolution by elim- structure in human data (psycholinguistic reading inating the need for hand-derivation of estimators time experiments), (3) provide support for the ab- and sampling distributions for each model. sence of temporal diffusion in settings where it might exist in principle, and (4) provide compara- The IRFs learned by DTSR are interpretable ble (or in some cases improved) prediction quality as estimates of the temporal shape of predictors’ to standard linear mixed-effects (LME) and gener- influence on the response variable. By convolv- alized additive (GAM) models. ing predictors with their IRFs, DTSR is able to consider arbitrarily long histories of independent 2 Related work variable observations in generating a given predic- tion, and (in contrast to spillover) model complex- 2.1 Non-deconvolutional time series modeling ity is constant on the length of the history win- The two most widely used tools for analyzing psy- dow. DTSR is thus a parsimonious technique for cholinguistic time series are linear mixed effects directly measuring temporal diffusion. regression (LME) (Bates et al., 2015) and gener- Figures1–3 illustrate the present proposal and alized additive models (GAM) (Hastie and Tib- how it differs from linear time series models. As shirani, 1986; Wood, 2006). LME learns a lin- shown in Figure1, a standard linear model as- ear combination of the predictors that generates sumes conditional independence of the response a given response variable. GAM generalizes lin- from all preceding observations of the predictor. ear models by allowing the response variable to be This independence assumption can be weakened computed as the sum of smooth functions of one by including additional spillover predictors (Fig- or more predictors. ure2), at a cost of requiring additional parameters. In both approaches, responses are modeled In both cases, only the relative order of events is as conditionally independent of preceding obser- considered, not their actual distance in time. By vations of predictors unless spillover terms are contrast, DTSR recasts the predictor and response added, with the attendant drawbacks discussed in vectors as streams of impulses and responses (re- Section1. To make this point more forcefully, take spectively) localized in time. It then fits latent for example Shain et al.(2016), who find signif- IRFs that govern the influence of each predictor icant effects of constituent wrap-up (p = 2.33e- value on the response as a function of time (Fig- 14) and dependency locality (p = 4.87e-10) in the

2680 Natural Stories self-paced reading corpus (Futrell have already been developed. et al., 2018). They argue that this constitutes the first strong evidence of memory effects in broad- 3 Model definition coverage sentence processing. However, it turns This section presents the mathematical definition out that when one baseline predictor — probabilis- of DTSR. For readability, only a fixed effects tic context free grammar (PCFG) surprisal — is model is presented below, since mixed modeling spilled over one position, the reported effects dis- substantially complicates the equations. The full appear: p = 0.816 for constituent wrap-up and p = model definition is provided in AppendixA. Note 0.370 for dependency locality. Thus, a reasonable that the full definition is used to construct all read- but ultimately inaccurate assumption about base- ing time models reported in subsequent sections, line effect timecourses can have a dramatic im- since they contain random effects. M×K pact on the conclusions supported by the statistical Let X ∈ R be a design matrix of M ob- N model. DTSR offers a way forward by bringing servations for K predictor variables and y ∈ R temporal diffusion under direct statistical control. be a vector of N responses, both of which contain contiguous temporally-sorted time series. DTSR 2.2 Deconvolutional time series modeling models the relationship between X and y using parameters consisting of: Deconvolutional modeling has long been used in a variety of scientific fields, including economics • a scalar intercept µ ∈ R K (Ramey, 2016), epidemiology (Goldstein et al., • a vector u ∈ R of K coefficients R×K 2011), and neuroimaging (Friston et al., 1998). • a matrix A ∈ R of R IRF kernel param- Non-parametric deconvolutional models quantize eters for K fixed impulse vectors the time series and fit estimates for each time point • a scalar variance σ2 ∈ of the response within some window, similarly to the spillover R approach discussed above. These estimates can To define the step, let gk for k ∈ be unconstrained, as in finite impulse response {1, 2,...,K} be a set of parametric IRF kernels, M N models (FIR) (H. Glover, 1999; Ward, 2006), one for each predictor; let a ∈ R and b ∈ R be or smoothed with some form of regularization vectors of timestamps associated with each obser- M (Goutte et al., 2000; Pedregosa et al., 2014). Ad- vation in X and y, respectively; and let c ∈ N N ditional post-hoc interpolation is necessary in or- and d ∈ N be vectors of series ID’s associated der to obtain a closed-form continuous IRF. These with each observation in X and y, respectively. A N×M non-parametric approaches are prone to paramet- filter F ∈ R admits only those observations ric explosion as well as sparsity problems when in X that precede y[n] in the same time series: trials are variably spaced in time. ( def 1 c[m] = d[n] ∧ a[m] ≤ b[n] Parametric deconvolutional approaches (i.e. F[n,m] = (1) 0 otherwise specific instantiations of DTSR) have evolved in certain fields (e.g. fMRI modeling) to solve The inputs X can be convolved with each IRF particular problems, generally with some gk by premultiplication with sparse matrix Gk ∈ N×M independently-motivated IRF kernel like the R for k ∈ {1, 2, ..., K} as defined below: hemodynamic response function (HRF) (Friston  > >  Gk = gk b1 − 1a ; A[∗,k] F (2) et al., 1998; Lindquist and Wager, 2007; Lindquist et al., 2009). However, to our knowledge DTSR The convolution that yields the design matrix of 0 N×K constitutes the first mathematical formulation convolved predictors X ∈ R is then defined and software implementation of general-purpose using products of the G matrices and the design mixed effects parametric deconvolutional re- matrix X:2 gression for arbitrary impulse response kernels. 0 def X[∗,k] = Gk X[∗,k] (3) DTSR also supports Bayesian inference, enabling 2 quantification of uncertainty in the absence of This implementation of convolution is only exact when the predictors fully describe a discrete impulse signal. Exact analytic formulae for standard errors. With these convolution of samples from continuous signals is generally properties, DTSR expands the range of possible not possible because the signal is generally not analytically applications of parametric deconvolution beyond integrable. For continuous signals, DTSR can approximate the convolution as long as the predictor is interpolated be- those fields for which appropriate formulations tween sample points at a fixed frequency prior to fitting.

2681 Following convolution, DTSR is simply a linear parameters are domain-specific and discussed in model. The full model mean is the sum of (1) the the experiments sections below. To center the intercept µ and (2) the product of the convolved prior at an intercept-only model,6 prior means for predictor matrix X0 and the coefficient vector u: the intercept µ and variance σ2 are set (respec- tively) to the empirical mean and variance of the 0 2 y ∼ N µ + X u, σ (4) response, and prior means for both fixed coeffi- cients and random effects7 are set to 0. Although 4 Implementation the Bayesian implementation of DTSR is used for The present implementation defines the afore- this study because it provides quantification of un- mentioned equations3 as a Bayesian computation certainty, placing priors on the IRF kernel param- graph in Tensorflow and Edward and trains it with eters is not crucial to the success of the system. In black box variation inference (BBVI) using the all experiments reported below, the MLE imple- Nadam optimizer (Dozat, 2016)4 with a constant mentation arrives at similar solutions and achieves learning rate of 0.01 and minibatches of size 1024. slightly better error. For computational efficiency, histories are trun- In the interests of enabling the use of DTSR cated at 128 timesteps. Prediction from the net- by the scientific community, the implementation work uses an exponential moving average of pa- of DTSR used here is offered as a documented rameter iterates with a decay rate of 0.998. Con- open-source Python package with support for (1) vergence was visually diagnosed. Bayesian, variational Bayesian, and MLE infer- The present experiments use a ShiftedGamma ences and (2) a variety model structures and im- IRF kernel: pulse response kernels. The Tensorflow back- end also enables GPU acceleration where avail- βα(x − δ)α−1e−β(x−δ) able. Source code and links to documenta- f(x; α, β, δ) = (5) Γ(α) tion are available at https://github.com/ coryshain/dtsr. This is simply the PDF of the Gamma distribution augmented with a shift parameter δ allowing the 5 Experiment 1: Synthetic data lower bound of the support of the distribution to deviate from 0. We constrain δ to be strictly neg- An initial experiment fits DTSR estimates to syn- ative, thereby allowing the model to find a non- thetic data to determine whether the model can re- zero instantaneous response. We also constrain k cover known ground truth IRFs. Synthetic data to be strictly greater than 1, which deconfounds were synthesized using the following procedure. the shape and shift parameters. All bounded vari- First, 20 input vectors of size 10,000 were drawn ables are constrained using the softplus bijection: from a standard normal distribution. These values synthesize an impulse stream containing 20 co- softplus(x) = log(ex + 1) (6) variates, each with 10,000 observations. A Shift- edGamma IRF was then drawn for each of the 20 The ShiftedGamma kernel is used here because covariates. Coefficients were drawn from a uni- it can fit a wide range of response shapes and form distribution U(−50, 50), and IRF parame- has precedent in the fMRI literature, where HRF ters were drawn from the following distributions: kernels are often assumed to be Gamma-shaped α ∼ U(1, 6), β ∼ U(0, 5), δ ∼ U(−1, 0). The (Lindquist et al., 2009).5 prior means for the corresponding IRF kernel pa- All parameters are given normal priors with unit rameters are placed at the centers of these ranges. variance. Prior means for the fixed IRF kernel The stream of responses was generated by con- 3As noted above, for expository purposes the definition in volving the covariates with their corresponding Section3 only supports fixed-effects models. The full def- IRFs. Gaussian with standard deviation 20 inition for mixed-effects DTSR models is provided in Ap- pendixA. Mixed models are used throughout the experiments was injected into the response following genera- reported below. tion. The 10,000 trials were spaced 100ms apart. 4The Adam optimizer (Kingma and Ba, 2014) with Nes- As shown in Figure4, the DTSR estimates for the terov momentum (Nesterov, 1983) 5Other IRF kernels, including spline functions and com- 6A model in which the response is insensitive to the model position of , are supported by the current imple- structure. mentation of DTSR but are not explored in these experiments. 7See AppendixA for the definition of the mixed-effects More details are provided in the software documentation. DTSR model, which includes random effects.

2682 Response

Time (s)

Figure 4: Synthetic data. True IRFs (left) and estimated IRFs with 95% credible intervals (right). synthetic data are very similar to the ground truth, fixation duration (go-past duration for ET). Mod- confirming that when the data-generating model els use the following set of predictor variables in matches the assumptions of DTSR, DTSR can re- common use in psycholinguistics: Sentence posi- cover its latent structure with high fidelity. tion (index of word in sentence), Trial (index of trial in series),8 Saccade Length (in words, ET 6 Experiment 2: Human reading times only), Word Length (in characters), Unigram Log- prob, and 5-gram Surprisal. Unigram Logprob 6.1 Background and experimental design and 5-gram Surprisal are computed by the KenLM The main interest of DTSR is the potential to bet- toolkit (Heafield et al., 2013) trained on Gigaword ter understand real-world dynamical systems like 4 (Parker et al., 2009). In addition, DTSR enables the human sentence processing response. There- fitting of a Rate predictor, which is simply a vec- fore, Experiment 2 applies DTSR to three exist- tor of ones, one for each observation, that is con- ing datasets of naturalistic reading: Natural Sto- volved using a latent IRF. Rate thus measures the ries (Futrell et al., 2018), Dundee (Kennedy et al., response to density of stimulus presentation in the 2003), and UCL (Frank et al., 2013). recent past. Since without deconvolution Rate is Natural Stories is a self-paced reading (SPR) identical to the intercept, it is excluded from non- corpus consisting of narratives designed to provide deconvolutional baseline models. Following stan- context-rich, fluent-sounding stimuli that nonethe- dard practice in psycholinguistics, by-subject ran- less contain many grammatical constructions that dom coefficients for each of these predictors are rarely occur naturally in texts. The public release included in all models (baseline and DTSR).9 of the corpus contains data collected from 181 ShiftedGamma IRFs are fitted to all predic- subjects. The stimulus set contains 10 stories with tors except Sentence Position, which is assigned a total of 485 sentences and 10,245 tokens, for a a Dirac delta IRF (i.e. a linear coefficient) since it total 848,768 fixation events. increases linearly within the sentence and is not Dundee is an eye-tracking (ET) corpus con- expected to have a diffuse response. In plots, taining newspaper editorials read by 10 subjects, the Sentence Position estimate is shown as a stick with incremental eye fixation data recorded during function at time 0s. Prior means used for the reading. The stimulus set contains 20 editorials IRF kernel parameters are α = 2, β = 5, and with a total of 2,368 sentences and 51,502 tokens, δ = −0.5. Together, these priors define an ex- for a total of 260,065 fixation events. pected exponential-like IRF which decays to near- UCL is a reading corpus containing individual zero in about 1s, which seems plausible for human sentences that were extracted from novels written reading times. In practice they do not appear to be by amateur authors. The sentences were shuffled very constraining, since posterior means of fitted and presented in isolation to 42 subjects. The eye- tracking portion of the UCL corpus used in these 8Except UCL, which contains isolated sentences, in experiments contains 205 sentences with a total of which case Trial is identical to Sentence Position. 9By-subject IRF parameters were not used for this study 1,931 tokens, for a total of 53,070 fixation events. because they substantially complicate the model and initial In all experiments, the response variable is log experiments using them showed little benefit on training data.

2683 Fixation duration (log ms)

Time (ms)

Figure 5: Human data. Estimated IRFs with 95% credible intervals for Natural Stories (left), Dundee (center) and UCL (right). Intervals are too tight to be seen. models often deviate quite far from these values. excluded as well as (1) items following saccades Existing work provides some expectations longer than 4 words and (2) sentence starts and about the relationships of these variables to read- ends. Partitioning and filtering are applied only to ing time. Processing difficulty is expected to in- the response series. The entire predictor history crease with Saccade Length, Word Length, and 5- remains visible to the model. gram Surprisal, and positive linear relationships From a modeling perspective, the primary re- have been shown experimentally (Demberg and sults of interest in Experiment 2 are the IRFs them- Keller, 2008). Unigram Logprob is expected to selves and the insights they provide into human be negatively correlated with reading times, since sentence processing. However, to check the re- more frequent words are expected to be easier to liability of the DTSR estimates, prediction qual- process. Sentence Position, Trial, and Rate index ity on unseen data is compared to that of non- different kinds of change in the response over time deconvolutional baseline models fitted with LME and their relationship has not been carefully stud- and GAM.11 Both baselines are fitted with and ied, in part for lack of deconvolutional regression without three preceding spillover positions for tools. Although reading times tend to decrease each predictor (baselines with spillover are desig- over the course of the experiment (Baayen et al., nated throughout this paper with the suffix -S).12 2018), suggesting an expected negative effect of Trial, this may be partially explained by temporal 6.2 Results diffusion. For the present study, all predictors are The fitted IRFs for Natural Stories, Dundee, and 10 rescaled by their standard deviations. UCL are shown in Figure5. Effect sizes by corpus In all reading experiments, data are partitioned — computed here as the integral of each IRF over into training (50%), development (25%) and test the first 10s — are shown in Table1, along with (25%) sets. Outlier filtering is also performed. 11 For Natural Stories, following Shain et al.(2016), Formulae used to construct each model reported in this study are available in the associated code repository. items are excluded if they have fixations shorter 12This number of spillover positions is among the largest than 100ms or longer than 3000ms, if they start attested in the psycholinguistic literature because model com- or end a sentence, or if subjects missed 4 or plexity in LME and GAM increases substantially with each spillover position added, especially when by-subject random more subsequent comprehension questions. For slopes are included for each spillover position for each vari- Dundee, following van Schijndel and Schuler able. Indeed, many of the baseline models run for these ex- (2015), unfixated items are excluded as well as periments are already at the limits of tractability, as shown by the non-convergence reported in certain cells of Table2. An (1) items following saccades longer than 4 words advantage of the DTSR approach is that it can consider arbi- and (2) starts and ends of sentences, screens, doc- trarily long histories at no cost to model complexity. While uments, and lines. For UCL, unfixated items are this permits DTSR to consider longer histories than its com- petitors (in these experiments, 128 timepoints vs. 4), DTSR is more constrained in its use of history since it must apply 10Except Rate, which has no variance and therefore cannot the same set of IRFs to all datapoints, while the baselines es- be scaled by its standard deviation of 0. sentially fit separate models for each spillover position.

2684 Natural Stories Dundee UCL Predictor Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5% Trial -0.0053 -0.0057 -0.0049 -0.0085 -0.0010 -0.0071 ——— Sent pos 0.0154 0.0148 0.0160 0.0004 -0.0013 0.0022 0.0340 0.0301 0.0379 Rate -0.1853 -0.1858 -0.1848 -0.0649 -0.0659 -0.0640 -0.0806 -0.0832 -0.0781 Sac len ——— 0.0249 0.0216 0.0207 0.0217 0.0209 0.0225 Word len 0.0020 0.0019 0.0021 0.0107 0.0105 0.0109 -8e-07 -1.7e-5 1.4e-5 Unigram 2.6e-6 -5e-6 2.2e-5 -2.0e-6 -3.9e-5 2.8e-5 1e-06 -4e-6 1.2e-5 5-gram 0.0057 0.0056 0.0059 0.0139 0.0134 0.0145 0.0159 0.0148 0.0171

Table 1: Effect sizes by corpus with 95% credible intervals based on 1024 posterior samples

95% credible intervals (CI). The IRFs (curves) in ment in MSE on test data of DTSR over each base- these plots represent the expected change in the line system (p = 0.0001 for all comparisons).14 response over time from observing a unit impulse of the predictor. For example, the Dundee model 6.3 Discussion estimates that observing a standard deviation of Some key generalizations emerge from the DTSR 5-gram surprisal engenders a slowdown of about estimates shown in Figure5. The first is the pro- 0.05 log ms instantaneously and a slowdown of nounced facilitative role of Rate in all three mod- about 0.03 log ms 250 ms after stimulus presen- els, but especially in Natural Stories. This means tation. Because the response is reading time, pos- that fast reading in the recent past engenders itive IRFs represent inhibition and negative IRFs fast reading in the present, because (1) observ- represent facilitation. Detailed interpretation of ing a stimulus exerts a large-magnitude, diffuse, these curves is provided below in Section 6.3. and negative (facilitative) influence on the sub- Table2 shows prediction error from DTSR vs. sequent response, and (2) the Rate contributions baselines fitted to the same feature set. As shown, of the stimuli are additive. This result demon- DTSR provides comparable or improved predic- strates an important pre-linguistic influence of in- tion performance to the baselines, even against the ertia — a tendency toward slow overall change -S models which are more heavily parameterized. in base response rate. This effect is especially DTSR outperforms LME models on unseen data large-magnitude and diffuse in Natural Stories, across all corpora and generally improves upon or which is self-paced reading and therefore differs closely matches the performance of GAM (with in modality from the other datasets (which are no spillover). Compared to GAM-S (with three eye-tracking). This suggests that SPR participants additional spillover positions), there is a clear ad- strongly habituate to repeated button pressing and vantage of DTSR for Natural Stories but not for stresses the importance of deconvolutional regres- the eye-tracking (ET) datasets. This is likely due sion for bringing this low-level confound under to more pronounced temporal confounds in Natu- control in analyzing SPR data, since it appears to ral Stories (especially of Rate, which the baseline have a large influence on the response and might models cannot estimate) compared to the other otherwise confound model interpretation. 13 corpora. However, even in the absence of suf- Second, effects are generally consistent with ex- ficiently diffuse effects to afford prediction im- pectations: positive effects for Saccade Length, provements, the ability to measure diffusion di- Word Length, and 5-gram Surprisal, and a nega- rectly is a major advantage of the DTSR model, tive effect of Trial. The null influence of Unigram since it can be used to detect the absence of dif- Logprob is likely due to the presence in the model fusion in settings where it might in principle ex- of both 5-gram Surprisal (which interpolates uni- ist. Further discussion of the DTSR IRF estimates gram probabilities) and Word Length (which is in- themselves is provided in Section 6.3. versely correlated with Unigram Logprob). The As shown in Table3, pooling across corpora, biggest departure from prior expectations is the permutation testing reveals a significant improve- null estimate for Word Length in UCL. It appears 13Note that GAM-S is more heavily parameterized than DTSR in that it fits multidimensional spline functions of each 14To ensure comparability across corpora with different er- spillover position of each predictor. This makes it difficult to ror variances, per-datum errors were first scaled by their stan- generalize information about effect timecourses from GAM dard deviations within each corpus. Standard deviations were fits, motivating the use of DTSR for studies in which time- computed over the joint set of error values in each pair of courses are a quantity of interest. DTSR and baseline models.

2685 Natural Stories Dundee UCL System Train Dev Test Train Dev Test Train Dev Test LME 0.0803 0.0818 0.0815 0.2135 0.2133 0.2128 0.2613 0.2776 0.2561 LME-S 0.0789† 0.0807† 0.0804† 0.2099† 0.2103† 0.2095† 0.2509† 0.2754† 0.2557† GAM 0.0798 0.0814 0.081 0.212 0.2116 0.2111 0.2576 0.2741 0.2538 GAM-S 0.0784 0.0802 0.0799 0.2083 0.2085 0.2078 0.2440 0.2661 0.2457 DTSR 0.0648 0.0655 0.0650 0.2100 0.2094 0.2088 0.2590 0.2752 0.2543

Table 2: Mean squared prediction error by system (daggers indicate convergence warnings)

Baseline DTSR improvement (z-units) p-value ticolinearity, one could perform ablative tests of LME 0.059 0.0001∗∗∗ LME-S 0.054 0.0001∗∗∗ fitted null and alternative models using (1) likeli- GAM 0.057 0.0001∗∗∗ hood comparison or (2) predictive performance on ∗∗∗ GAM-S 0.051 0.0001 unseen data. Table 3: Overall pairwise significance of prediction However, DTSR estimates are obtained through improvement from DTSR vs. baselines non-convex stochastic optimization, which com- plicates hypothesis testing because of possible es- timation noise due to (1) convergence to a lo- that the contribution of Word Length in this corpus cal but not global optimum, (2) imperfect con- can be effectively explained by other variables. vergence to the local optimum, and/or (3) Monte Third, the response estimates for Dundee and Carlo estimation of the test statistic via posterior UCL (both of which are eye-tracking) are very sampling. It cannot therefore be guaranteed that similar, which suggests that DTSR is discovering hypothesis testing results are due to differences in replicable population-level features of the tempo- model structure rather than differences in relative ral profile for eye-tracking data. amounts of estimation noise introduced by the fit- Fourth, there is a general asymmetry in degree ting procedure. Thus, p-values (and, consequently, of diffusion between low-level perceptual-motor hypothesis tests) based on direct comparison of variables like Saccade Length and Word Length, DTSR models should be considered approximate. whose responses tend to decay quickly, and the However, even in situations where such un- high-level 5-gram Surprisal variable, whose re- certainty in hypothesis testing is not acceptable, sponse tends to decay more slowly. This is consis- DTSR is appropriate for certain important use tent with expectations from the sentence process- cases. First, DTSR can be used for exploratory ing literature. Perceptual-motor variables involve data analysis in order to empirically motivate the rapid bottom-up computation (e.g. visual process- spillover structure of the linear model. Spillover ing or motor planning/execution) and are there- variables can be excluded or included based on the fore not expected to have a diffuse response, while degree of temporal diffusion revealed by DTSR, surprisal involves top-down computation of future permitting construction of linear models that are words given context, which might be more com- both parsimonious and effective for controlling putationally expensive and therefore engender a temporal diffusion. Second, DTSR can be used slower response. While this outcome is suggested to fit a data transform which is then applied to e.g. by the aforementioned finding that spillover 1 the data prior to statistical analysis. This approach winds up being a stronger position for a surprisal is identical in spirit to e.g. the use of the canon- predictor in the Shain et al.(2016) models, DTSR ical HRF to convolve predictors in fMRI models permits direct investigation of these dynamics. prior to linear regression. However, since DTSR is domain-general, it can be a valuable component 7 A note on hypothesis testing in any analysis toolchain for time series.

As a Bayesian model, DTSR supports hypothesis 8 Conclusion testing by querying the variational posterior. For example, as shown in Table1, the credible interval This paper presented a variational Bayesian de- (CI) for 5-gram Surprisal in Natural Stories does convolutional time series regression method as a not include zero (rejecting the null hypothesis of solution to the problem of temporal diffusion in no effect), while the CI for Unigram logprob does psycholinguistic time series data and applied it to (failing to reject). To control for effects of mul- both synthetic and human responses in order to

2686 better understand and control for latent temporal Douglas Bates, Martin Machler,¨ Ben Bolker, and Steve dynamics. Results showed that DTSR can yield Walker. 2015. Fitting linear mixed-effects mod- a plausible, replicable, parsimonious, insightful, els using lme4. Journal of Statistical Software, 67(1):1–48. and predictive model of a complex dynamical sys- tem like the human sentence processing response Vera Demberg and Frank Keller. 2008. Data from eye- and therefore support the use of DTSR for psy- tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109(2):193–210. cholinguistic time series modeling. While the present study explored the use of DTSR to under- Timothy Dozat. 2016. Incorporating Nesterov momen- stand human reading times, DTSR can in princi- tum into Adam. In ICLR Workshop. ple also be used to deconvolve other kinds of re- Kate Erlich and Keith Rayner. 1983. Pronoun assign- sponse variables, such as the HRF in fMRI model- ment and semantic integration during reading: Eye ing or the power/coherence response in oscillatory movements and immediacy of processing. Journal of Verbal Learning & Verbal Behavior measures like electroencephalography, suggesting , 22:75–87. a rich array of potential applications of DTSR in Stefan L. Frank, Irene Fernandez Monsalve, Robin L. computational psycholinguistics. Thompson, and Gabriella Vigliocco. 2013. Read- ing time data for evaluating broad-coverage models of English sentence processing. Behavior Research Acknowledgements Methods, 45:1182–1190. The authors would like to thank the anonymous Karl J. Friston, Oliver Josephs, Geraint Rees, and reviewers for their helpful comments. This work Robert Turner. 1998. Nonlinear event-related re- was supported by National Science Foundation sponses in fMRI. Magn. Reson. Med, pages 41–52. grants #1551313 and #1816891. All views ex- Richard Futrell, Edward Gibson, Hal Tily, Anasta- pressed are those of the authors and do not nec- sia Vishnevetsky, Steve Piantadosi, and Evelina Fe- essarily reflect the views of the National Science dorenko. 2018. The Natural Stories corpus. In LREC 2018. Foundation. Edward Goldstein, Benjamin J. Cowling, Allison E. Aiello, Saki Takahashi, Gary King, Ying Lu, and References Marc Lipsitch. 2011. Estimating incidence curves of several infections using symptom surveillance data. Martn Abadi, Ashish Agarwal, Paul Barham, Eugene PLOS ONE, 6(8):1–8. Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- C. Goutte, F. A. Nielsen, and K. H. Hansen. 2000. jay Ghemawat, Ian Goodfellow, Andrew Harp, Ge- Modeling the hemodynamic response in fMRI using offrey Irving, Michael Isard, Yangqing Jia, Rafal smooth FIR filters. IEEE Transactions on Medical Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Imaging, 19(12):1188–1201. Levenberg, Dan Man, Rajat Monga, Sherry Moore, Gary H. Glover. 1999. Deconvolution of impulse re- Derek Murray, Chris Olah, Mike Schuster, Jonathon sponse in event-related BOLD fMRI. 9:416–29. Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- Trevor Hastie and Robert Tibshirani. 1986. General- van, Fernanda Vigas, Oriol Vinyals, Pete Warden, ized additive models. Statist. Sci., 1(3):297–310. Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. Tensorflow: Large-scale Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. machine learning on heterogeneous distributed sys- Clark, and Philipp Koehn. 2013. Scalable modi- tems. fied Kneser-Ney language model estimation. In Pro- ceedings of the 51st Annual Meeting of the Associa- Harald Baayen, Shravan Vasishth, Reinhold Kliegl, and tion for Computational Linguistics, pages 690–696, Douglas Bates. 2017. The cave of shadows: Ad- Sofia, Bulgaria. dressing the human factor with generalized additive mixed models. Journal of Memory and Language, Alan Kennedy, James Pynte, and Robin Hill. 2003. 94(Supplement C):206 – 234. The Dundee corpus. In Proceedings of the 12th Eu- ropean conference on eye movement. R. Harald Baayen, Jacolien van Rij, Cecile de Cat, and Diederik P. Kingma and Jimmy Ba. 2014. Adam: Simon Wood. 2018. Autocorrelated errors in exper- A method for stochastic optimization. CoRR, imental data in the language sciences: Some solu- abs/1412.6980. tions offered by Generalized Additive Mixed Mod- els. In Dirk Speelman, Kris Heylen, and Dirk Geer- Martin Lindquist and Tor Wager. 2007. Validity and aerts, editors, Mixed Effects Regression Models in power in hemodynamic response modeling: A com- Linguistics. Springer, Berlin. parison study and a new approach. 28:764–84.

2687 Martin A. Lindquist, Ji Meng Loh, Lauren Y. Atlas, and extent. Mixed-effects DTSR models additionally Tor D. Wager. 2009. Modeling the hemodynamic contain the following parameters: response function in fmri: Efficiency, bias and mis- NeuroImage modeling. , 45(1, Supplement 1):S187 • o ∈ O O – S198. in Brain Imaging. a vector R of random intercepts • a vector u ∈ U of U fixed coefficients Yurii E Nesterov. 1983. A method for solving the R V convex programming problem with convergence rate • a vector v ∈ R of V random coefficients O(1/k2). In Dokl. Akad. Nauk SSSR, volume 269, R×L pages 543–547. • a matrix A ∈ R of R fixed IRF kernel parameters for L fixed impulse vectors Robert Parker, David Graff, Junbo Kong, Ke Chen, R×W and Kazuaki Maeda. 2009. English Gigaword • a matrix B ∈ R of R random IRF kernel LDC2009T13. parameters for W random impulse vectors Fabian Pedregosa, Michael Eickenberg, Philippe Ciu- ciu, Alexandre Gramfort, and Bertrand Thirion. Random parameters o, v, and B are constrained 2014. Data-driven hrf estimation for encoding and to be zero-centered within each random grouping decoding models. 104. factor. V.A. Ramey. 2016. Macroeconomic shocks and their To support mixed modeling, the fixed and ran- propagation. volume 2 of Handbook of Macroeco- dom effects must first be combined using addi- nomics, pages 71 – 162. Elsevier. tional utility matrices. Let O ∈ {0, 1}N×O be Cory Shain, Marten van Schijndel, Richard Futrell, Ed- a mask matrix for random intercepts. A vector N ward Gibson, and William Schuler. 2016. Mem- q ∈ R of intercepts is: ory access during incremental sentence processing causes reading time latency. In Proceedings of the def Computational Linguistics for Linguistic Complex- q = µ + O o (7) ity Workshop, pages 49–58. Association for Compu- tational Linguistics. Let U ∈ {0, 1}L×U be an indicator matrix for fixed coefficients, V ∈ {0, 1}L×V be an indi- Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja 0 Rudolph, Dawen Liang, and David M. Blei. cator matrix for random coefficients, and V ∈ 2016. Edward: A library for probabilistic mod- {0, 1}N×V be a mask matrix for random coeffi- N×L eling, inference, and criticism. arXiv preprint cients. A matrix Q ∈ R of coefficients is: arXiv:1610.09787. def Marten van Schijndel and William Schuler. 2015. Hi- Q = 1 (U u)> + V0 diag(v) V> (8) erarchic syntax improves reading time prediction. In Proceedings of NAACL-HLT 2015. Association for Let W ∈ {0, 1}L×W be an indicator matrix Computational Linguistics. 0 0 for random IRF parameters and W1,..., Wn ∈ B. Douglas Ward. 2006. Deconvolution analysis of {0, 1}R×W be mask matrices for random IRF pa- fmri time series data. R×L rameters. Then matrices Pn ∈ R for n ∈ Simon N. Wood. 2006. Generalized Additive Models: {1, 2,...,N} are: An Introduction with R. Chapman and Hall/CRC, Boca Raton. def 0 > Pn = A + (Wn B) W (9) A Definition of mixed effects DTSR In each equation above, the random effects param- For expository purposes, in Section3 the DTSR eters are masked using the random effects filter model was defined only for fixed effects. How- associated with each data point. Q and Pn are ever, DTSR is compatible with mixed modeling then transformed into the impulse vector space us- and the implementation used here supports ran- ing the indicator matrices V and W, respectively. dom effects in the model intercepts, coefficients, This procedure sums the random effects associ- and IRF parameters. The full mixed-effects DTSR ated with each data point and adds them to the equations are presented below. population-level parameters. 2 The definitions of X, y, µ, σ , a, b, c, d, F, M, To define the convolution step, let gl for l ∈ N, K, and R presented in Section3 are retained {1, 2,...,L} be parametric IRF kernels, one for for the mixed model definition. The remaining each impulse. Convolution of X with each IRF variables and equations must be redefined to some kernel is performed by premultiplying the inputs

2688 N×M X with sparse matrix Gl ∈ R for l ∈ {1, 2, ..., L}:

def  >  (Gl)[n,∗] = gl b[n] − a ;(Pn)[∗,l] F[n,∗] (10) Finally, let L ∈ {0, 1}K×L be an indicator ma- trix mapping the K predictors of X to the corre- sponding L impulse vectors of the model.15 The convolution that yields the design matrix of con- 0 N×L volved predictors X ∈ R is then defined us- ing a product of the convolution matrices G, the design matrix X, and the impulse indicator L:

0 def X[∗,l] = Gl XL[∗,l] (11)

The full model mean is the sum of (1) the in- tercepts and (2) the sum-product of the convolved predictors X0 with the coefficient parameters Q:

y ∼ N q + (X0 Q) 1, σ2 (12)

15Predictors and impulse vectors are distinguished because in principle multiple IRFs can be applied to the same predic- tor. In the usual case where this distinction is not needed, L is identity and K = L.

2689