<<

Deconvolutional Regression: A Technique for Modeling Temporally Diffuse Effects Cory Shain

Advisor: William Schuler

Department of Linguistics The Ohio State University

Abstract the rest of the process unfolds. Temporal diffu- sion has been carefully studied in some psycho- Psycholinguists frequently use linear models logical subfields. For example, a sizeable liter- to study time series data generated by human ature on fMRI has investigated the structure of subjects. However, time series may violate the assumptions of these models through tempo- the hemodynamic response function (HRF), which ral diffusion, where stimulus presentation has is known to govern the relatively slow response a lingering influence on the response as the of blood oxygenation to neural activity (Boyn- rest of the experiment unfolds. This paper ton et al., 1996; Friston et al., 1998; H. Glover, proposes a new statistical model that borrows 1999; Ward, 2006; Lindquist and Wager, 2007; from digital by recasting the Lindquist et al., 2009). The HRF is an instan- predictors and response as convolutionally- tiation of the more general notion of impulse re- related signals, using recent advances in ma- sponse function (IRF) from the field of signal pro- chine learning to fit latent impulse response functions (IRFs) of arbitrary shape. A syn- cessing (Madisetti, 1997), where the response g∗h thetic experiment shows successful recovery of a dynamical system as a function of time is de- of true latent IRFs, and psycholinguistic exper- scribed as a over time of an impulse g iments reveal plausible, replicable, and fine- with an IRF h: grained estimates of latent temporal dynamics, ∫ t with comparable or improved prediction qual- (g ∗ h)(t) = g(τ)h(t − τ)dτ ity to widely-used alternatives. 0

1 Introduction The process of deconvolution seeks to infer the structure of h (the IRF) given that the impulses Much of the data available to psycholinguistics is g (stimuli) and responses g ∗ h (psycholinguistic generated by processes that unfold in time. Ex- response) are known. amples include behavioral measures such as eye- Although particular attention has been paid to movement records and self-paced reading laten- the importance of impulse responses in fMRI, cies as well as neural measures like electroen- there are other kinds of psycholinguistic measures cephalography (EEG), magnetoencephalography in which temporal diffusion might reasonably play (MEG), functional magnetic resonance imaging a role. This paper focuses on one such example: (fMRI), and electrocorticography (ECoG). If left measures of reading time, specifically fixation du- uncontrolled, temporal confounds in psycholin- rations in eye-tracking and response times in self- guistic data can be problematic for interpretation paced reading. It has been known for decades that and testing of statistical models used to analyze the response to properties of words in human sub- them (Baayen et al., 2017, 2018). jects’ reading behavior may not be fully instan- This paper addresses one possible temporal taneous, but may "spill over" into the reading of confound which I will refer to as temporal diffu- subsequent words (Erlich and Rayner, 1983). A sion. Temporal diffusion exists when the response standard approach for handling the possibility of of a dependent variable to some inputs evolves temporally diffuse relationships between the word slowly as a function of time, with the result that a properties and reading response is to use spillover particular input observed at a particular time con- or lag regressors, where a word’s properties are tinues to exert an influence on the response as used to predict subsequent observations of the re- 2 itrini hyaermvd(l vnsare events (all spaced). removed equally or as are apart) treated they 207ms if exactly distortion retained (2) spaced are are durations events event (few variable severe to if (1) either sparsity leading series, dis- time they the cretize because data reading variably-spaced to 1980) (Sims, autoregressive models (Dayal vector (VAR) and models 1996) (FIR) MacGregor, response and impulse finite such as frameworks deconvolutional However, existing response. tem- major reading of the data in diffusion from poral discovery supporting by ward will predictor each colinearities. exhibit of variants spillover the autocorre- are lated, predictors the if fourth, in- And are cluded. structures effects random if data especially sized sets, realistically over realisti- models on complex explosion cally in- parametric to Third, predictor leads per quickly positions events. spillover between multiple passed cluding that time amount of actual the about po- details obscures important indices tentially event variably relative are of use fixations the word long, since empir- Second, to motivate ically. difficult is predictor given a for position(s) spillover undesirable of choice several the has First, strategy properties. this But sponse. iue2 Figure Response Response saslto otepolmo eprldif- temporal of problem the to solution a As for- way a provides modeling Deconvolutional Predictor 1 Figure Predictor 1 frdiscussion. for 2 Section See iertm eismdlwt spillover with model series time Linear : fet nalna iesre model series time linear a in Effects : ra index Trial index Trial 1 r o applicable not are TRi hsaprioiu ehiu o di- mod- for DTSR technique diffusion. window. temporal parsimonious measuring history a rectly the thus of is length complexity DTSR the model on spillover) constant to is contrast (in prediction, and given a generating in con- observations vari- able to independent of able histories is long arbitrarily DTSR sider By IRFs, convolving their with variable. predictors the response in- on predictors’ of fluence shape temporal the of estimates and estimators model. of each for hand-derivation distributions sampling for need eliminate inference the Auto- Bayesian structure. and model differentiation unique each for the es- timators of on re-derivation requiring depends kernel, IRF surface of choice likelihood decon- the ofparametric dif- volution: important holds an otherwise that overcome networks, ficulty to deep them train uses and are DTSR build libraries to these used While typically (BBVI) graphs. inference Tensorflow variational on box which black — 2016) enables al., et (Tran computation Edward and auto-differentiation arbitrary — graphs of uses optimization which support — to 2015) al., (Abadi Tensorflow et libraries learning of advent machine recent the the of advantage takes pro- here DTSR posed of implementation The time. over vari- able response the to relationship predictors their the mediate of that IRFs parametric diffu- learning by models sion directly that de- method continuous-time convolutional a (DTSR), time regression deconvolutional series proposes paper this fusion,

h Rslandb TRaeitrrtbeas interpretable are DTSR by learned IRFs The Convolved Predictor predictor Response iue3 Figure ∑ • aetIRFs Latent fet fpeitr nDTSR in predictors of Effects : ∑ • • Time ∑ • • • ∑ • • • • ∑ • • • • • els are continuous-time and can therefore be op- tuted the first strong evidence of memory effects timized on naturalistic time series with variably- in broad-coverage sentence processing. However, spaced events, including reading time data. it turns out that when one baseline predictor — Figures1–3 illustrate the present proposal and probabilistic context free grammar (PCFG) sur- how it differs from linear time series models. As prisal — is spilled over one position, the reported shown in Figure1, a standard linear model as- effects disappear: p = 0.816 for constituent wrap- sumes conditional independence of the response up and p = 0.370 for dependency locality. Thus, from all preceding observations of the predictor. a reasonable but ultimately inaccurate assumption This independence assumption can be weakened about baseline effect timecourses — in this case, by including additional spillover predictors (Fig- that the PCFG effect did not spill over — can have ure2), at a cost of requiring additional parameters. a dramatic impact on the conclusions supported by In both cases, only the relative order of events is the statistical model. DTSR offers a way forward considered, not their actual distance in time. By by building the possibility of temporal diffusion contrast, DTSR recasts the predictor and response directly into the estimates, thereby avoiding the vectors as streams of impulses and responses (re- need to choose spillover positions as hyperparam- spectively) localized in time. It then fits latent eters. IRFs that govern the influence of each predictor on the response as a function of time (Figure3). 2.2 Deconvolutional time series modeling This paper presents evidence that DTSR can (1) recover known underlying IRFs from synthetic Deconvolutional modeling has long been used in data, (2) discover previously unknown temporal a variety of scientific fields, including economics structure in human data (psycholinguistic reading (Ramey, 2016), epidemiology (Goldstein et al., time experiments), (3) provide support for the ab- 2011), and neuroimaging (Friston et al., 1998). sence of temporal diffusion in settings where it One widely-used approach to IRF discovery is fi- might exist in principle, and (4) provide compara- nite impulse response modeling (FIR) (H. Glover, ble (or in some cases improved) prediction quality 1999; Ward, 2006). IRF models quantize the time to standard linear mixed-effects (LME) and gener- series and use linear regression to fit estimates alized additive (GAM) models. for each time point within some window, simi- larly to the spillover approach discussed above. 2 Related work These estimates can be unconstrained or smoothed with some form of regularization (Nikolaou and 2.1 Non-deconvolutional time series modeling Vuthandam, 1998; Goutte et al., 2000; Pedregosa The two most widely used tools for analyzing psy- et al., 2014). Another major approach to decon- cholinguistic time series are linear mixed effects volution is vector autoregression (VAR), which regression (LME) (Bates et al., 2015) and gener- discovers pairwise temporal relationships between alized additive models (GAM) (Hastie and Tib- all variables in the data (predictors and response) shirani, 1986; Wood, 2006). LME learns a lin- over some finite number of lags. VAR fits can ear combination of the predictors that generates be used to extract IRFs between pairs of vari- a given response variable. GAM generalizes lin- ables. For both FIR and VAR models, addi- ear models by allowing the response variable to be tional post-hoc interpolation is necessary in order computed as the sum of smooth functions of one to obtain closed-form continuous-time IRFs. Non- or more predictors. parametric deconvolutional approaches like these In both approaches, responses are modeled are prone to parametric explosion and overfitting as conditionally independent of preceding obser- (Nikolaou and Vuthandam, 1998). Furthermore, vations of predictors unless spillover terms are as discussed above, their requirement of time dis- added, with the attendant drawbacks discussed in cretization gives rise to sparsity or distortion when Section1. To make this point more forcefully, take applied to time series with variable event dura- for example Shain et al.(2016), who found signif- tion. Finally, many psycholinguistic datasets con- icant effects of constituent wrap-up (p = 2.33e- tain data from many subjects and/or conditions, 14) and dependency locality (p = 4.87e-10) in the motivating the use of mixed-effects models. How- Natural Stories self-paced reading corpus (Futrell ever, although mixed effects non-parametric de- et al., 2018). They argued that their result consti- convolutional models have been proposed (Gor- rostieta et al., 2012), modeling random variation in ance of the response. Mixed-effects DTSR mod- the deconvolutional estimates severely increases els can also include random variation in the in- model complexity by adding random covariates tercept, coefficients, and/or IRF parameters. This for each predictor/timepoint pair in the model. yields at most 1 + Z(1 + K + KR) estimates for Continuous-time mixed-effects IRF estimation a mixed effects model with Z total random group- for arbitrary impulse response kernels would over- ing factor levels, although sub-maximal numbers come these difficulties and greatly extend the of estimates can arise from restricting randomness range of time series data to which deconvolu- to substructures of the model (e.g. to the inter- tional modeling can be applied. To my knowledge, cept only). For fuller discussion of mixed-effects DTSR is the first mathematical formulation and DTSR models, see the Appendix. software implementation of such an approach. By To define the convolution step, let gk for k ∈ including a parametric impulse response as part {1, 2,...,K} be a set of parametric IRF kernels, M N of model design, DTSR avoids time discretization one for each predictor; let a ∈ R and b ∈ R be and the attendant problems with model complex- vectors of timestamps associated with each obser- M ity discussed above. DTSR thus expands the range vation in X and y, respectively; and let c ∈ N N of possible applications of deconvolutional model- and d ∈ N be vectors of series ID’s associated ing to include settings with variable event duration with each observation in X and y, respectively. A N×M and heterogeneous sources of data. filter F ∈ R admits only those observations in X that precede y[n] in the same time series: 3 Model definition { This section presents the mathematical definition def 1 c[m] = d[n] ∧ a[m] ≤ b[n] F[n,m] = (1) of DTSR. For readability, only a fixed effects 0 otherwise model is presented below, since mixed modeling substantially complicates the equations. The full The inputs X can be convolved with each IRF gk by premultiplication with sparse matrix k ∈ model definition is provided in AppendixA. Note N×M that the full definition is used to construct all read- R for k ∈ {1, 2, ..., K} as defined below: ing time models reported in subsequent sections, ( ⊤ ⊤ ) since they contain random effects. k = gk b1 − 1a ; A[∗,k] ⊙ F (2) M×K Let X ∈ R be a design matrix of M ob- N The convolution that yields the design matrix of servations for K predictor variables and y ∈ R convolved predictors X′ ∈ N×K is then defined be a vector of N responses, both of which contain R using products of the convolution matrices and the contiguous temporally-sorted time series. DTSR design matrix X:3 models the relationship between X and y using parameters consisting of: ′ def X[∗,k] = k X[∗,k] (3) • a scalar intercept µ ∈ R The full model mean is the sum of (1) the inter- K 2 • a vector u ∈ R of K coefficients cept µ and (2) the product of the convolved pre- R×K ′ • a matrix A ∈ R of R IRF kernel param- dictor matrix X and the coefficient vector u: eters for K fixed impulse vectors y ∼ N (µ + X′u, σ2) (4) • a scalar variance σ ∈ R of the response 4 A note on multicolinearity A fixed-effects DTSR model therefore contains 2 + K + K · R parameters: one intercept, K coef- Note that the formulation in eq.4 is simply a linear ficients (one for each impulse), K ·R IRF parame- model on the convolved design matrix X′. There- ters (R parameters for each impulse, and one vari- fore, the primary difference between linear and

2Throughout this paper I use the term coefficients to refer 3This implementation of convolution is only exact when to what are often called slopes in linear models. This is to the predictors fully describe a discrete impulse signal. Exact avoid falsely implying that the coefficients represent straight- convolution of samples from continuous signals is generally line functions of the predictors, when in fact they are applied not possible because the signal is generally not analytically non-linearly to the predictors via the impulse response. Al- integrable. For continuous signals, DTSR can approximate ternatively, the coefficients can be construed as slopes onthe the convolution as long as the predictor is interpolated be- convolved predictors X′, as shown in eq.4. tween sample points at a fixed frequency prior to fitting. DTSR models is that DTSR additionally infers the robust to multicolinearity, models fitted to highly parameters that generate X′ jointly with the model colinear data should be interpreted with caution, intercept and coefficients. and perfectly colinear data should be avoided al- Since DTSR depends internally on linear com- together. As in linear models, multicolinearity bination to generate its outputs, it is vulnerable to can be avoided by orthogonalizing predictors in confounds from multicolinearity (correlated pre- advance (e.g. via principal components analysis). dictors) in much the same way that linear models Empirical assessment of orthogonalization proce- are. In linear models, multicolinearity increases dures in the DTSR setting is left to future work. uncertainty about how to allocate covariation be- 5 Implementation tween predictors and response, since the predictors themselves covary. In the extreme case of perfect The present implementation defines the equations multicolinearity (i.e. one or more predictors are an from Section3 as a Bayesian computation graph exact linear combination of one or more other pre- in Tensorflow and Edward and trains it with black dictors), the model has no solution (Neter et al., box variation inference (BBVI) using the Nadam 1989). optimizer (Dozat, 2016)4 with a constant learning 5 Multicolinearity in DTSR works in much the rate of 0.01 and minibatches of size 1024. For same way, with the added complexity that DTSR computational efficiency, histories are truncated at models also have a temporal dimension which 128 timesteps. Prediction from the network uses may allow the fitting procedure to discover real an exponential moving average of parameter iter- characteristics of the global impulse response ates with a decay rate of 0.998. Convergence was structure while struggling proportionally to the de- visually diagnosed. gree of multicolinearity to decompose that struc- The present experiments use a ShiftedGamma ture into predictor-wise IRFs. To understand this, IRF kernel: note that the expected response t seconds after βα(x − δ)α−1e−β(x−δ) f(x; α, β, δ) = (5) stimulus presentation is a weighted sum of the Γ(α) IRFs at t, with weights provided by the predictor This is simply the probability density function of values of the stimulus. When multicolinearity is the Gamma distribution augmented with a shift pa- low, the expected overall response can vary widely rameter δ allowing the lower bound of the sup- from one stimulus to another, since the IRFs are port of the distribution to deviate from 0. The fol- reweighted at each stimulus by roughly orthogonal lowing additional constraints are imposed: (1) δ predictor values. This variation in expected over- is strictly negative, thereby allowing the model to all response provides clues to the system as to the find a non-zero instantaneous response, and (2) k magnitude, direction, and temporal shape of the is strictly greater than 1, deconfounding the shape individual response to each predictor. As multico- and shift parameters. All bounded variables are linearity increases, the expected overall response constrained using the softplus bijection: increasingly converges to a single shape which is shared across all stimuli (albeit scaled by the stim- softplus(x) = log(ex + 1) ulus magnitude). In this setting, the model should The ShiftedGamma kernel is used here because still be able to correctly recover the global re- it can fit a wide range of response shapes and sponse characteristics, but may decompose it into has precedent in the fMRI literature, where HRF predictor-wise responses that increasingly deviate kernels are often assumed to be Gamma-shaped from the true data generating model. In the ex- (Lindquist et al., 2009).6 treme case that each predictor is identical, the ex- pected response is identical for each stimulus, and 4The Adam optimizer (Kingma and Ba, 2014) with Nes- terov momentum (Nesterov, 1983) the model will construct IRFs whose summation 5As noted above, for expository purposes the definition in approximates the true global response profile but Section3 only supports fixed-effects models. The full def- whose attribution of IRF components to predictors inition for mixed-effects DTSR models is provided in Ap- pendixA. Mixed models are used throughout the experiments is random. reported below. The above predictions are born out empirically 6Other IRF kernels, including spline functions and com- position of , are supported by the current imple- by results presented in Section6. While those re- mentation of DTSR but are not explored in these experiments. sults indicate that DTSR models are surprisingly More details are provided in the software documentation. All parameters are given normal priors with unit U(−1, 0).9 The prior means for the corresponding variance. Prior means for the fixed IRF kernel IRF kernel parameters were placed at the centers parameters are domain-specific and discussed in of these ranges. the experiments sections below. To center the To manipulate multicolinearity in the predic- prior at an intercept-only model,7 prior means for tors, predictor streams were drawn from multi- the intercept µ and variance σ are set (respec- variate normal distributions in which the variance- tively) to the empirical mean and variance of the covariance matrix had a diagonal of 1 and all off- response, and prior means for both fixed coeffi- diagonal elements were set to the desired level of cients and random effects8 are set to 0. Although correlation. For example, predictors with correla- the Bayesian implementation of DTSR is used for tion level ρ = 0.5 were drawn using the following this study because it provides quantification of un- variance-covariance matrix: certainty, placing priors on the IRF kernel param- ⎡ 1 0.5 0.5 ... 0.5 0.5 0.5⎤ eters is not crucial to the success of the system. In ⎢0.5 1 0.5 ... 0.5 0.5 0.5⎥ all experiments reported below, the MLE imple- ⎢ ⎥ ⎢0.5 0.5 1 ... 0.5 0.5 0.5⎥ mentation arrives at similar solutions and achieves ⎢ ⎥ ⎢ ...... ⎥ slightly better error. ⎢ ...... ⎥ ⎢ ⎥ In the interests of enabling the use of DTSR ⎢0.5 0.5 0.5 ... 1 0.5 0.5⎥ ⎢ ⎥ by the scientific community, the implementation ⎣0.5 0.5 0.5 ... 0.5 1 0.5⎦ of DTSR used here is offered as a documented 0.5 0.5 0.5 ... 0.5 0.5 1 open-source Python package with support for (1) Four sets of predictors were generated in this way, Bayesian, variational Bayesian, and MLE infer- one for each of ρ = 0 (uncorrelated predictors), ences and (2) a variety model structures and im- ρ = 0.25, ρ = 0.5, ρ = 0.75, ρ = 1. Note that pulse response kernels. The Tensorflow back- ρ = 1 is a degenerate case in which all 20 covari- end also enables GPU acceleration where avail- ates are identical. As such, it is undecidable and able. Source code and links to documenta- therefore not a valid use case for DTSR. It is in- tion are available at https://github.com/ cluded in this experiment to assess the prediction coryshain/dtsr. outlined in Section4 that in such a setting DTSR will faithfully recover the global response profile 6 Experiment 1: Synthetic data while decomposing and distributing it to predic- An initial experiment fits DTSR estimates to syn- tors at random. thetic datasets. This experiment has two purposes: For each set of predictors, the stream of re- (1) to determine whether the model can recover sponses was generated by convolving the covari- known ground truth IRFs and (2) to assess the im- ates with their corresponding IRFs and multiply- pact of multicolinearity in the predictors. Syn- ing them by their coefficients. For each response thetic data were created by convolving sets of vector, Gaussian with standard deviation 20 random covariates with known impulse responses was drawn and added to the responses following in order to generate simulated response variables. generation. Each simulation contained 20 covariates, and pre- As shown in the left column of Figure4, the dictor and response streams each contained 10,000 DTSR estimates for the synthetic data with ρ ≤ observations spaced 100ms apart. A single set of 0.75 are very similar to the ground truth, confirm- impulse responses was randomly drawn at the out- ing that when the data-generating model matches side of the experiment and shared across all syn- the assumptions of DTSR, DTSR can recover thetic datasets. To produce the impulse responses, its latent structure with high fidelity even in the ground truth coefficients were drawn from a uni- presence of strong multicolinearity. Nonetheless, form distribution U(−50, 50), and ground truth there are small degradations in model fidelity with IRF parameters were drawn from the following increasing colinearity, supporting the hypothesis distributions: α ∼ U(1, 6), β ∼ U(0, 5), δ ∼ that the difficulty of model identification increases 9These ranges generally yield IRF with peak dynamics 7A model in which the response is insensitive to the model within the system’s visibility window of 12.8s. Visibility is structure. curtailed by history truncation (128 timesteps, see above), 8See AppendixA for the definition of the mixed-effects leading to a maximum visibility of 12.8s since each trial is DTSR model, which includes random effects. 0.1s long (128 × 0.1 = 12.8). (a) Ground truth (b) Ground truth, summed

(c) ρ = 0 (d) ρ = 0, summed Response

(e) ρ = 0.25 (f) ρ = 0.25, summed

(g) ρ = 0.5 (h) ρ = 0.5, summed

(i) ρ = 0.75 (j) ρ = 0.75, summed

(k) ρ = 1 (l) ρ = 1, summed

Time (s)

Figure 4: Synthetic data. Predictor-wise (left) and summed IRFs in the true (top) and estimated models at varying levels of impulse multicolinearity ρ. Estimates are shown with 95% credible intervals. with colinearity and motivating the avoidance of lus set contains 10 stories with a total of 485 sen- strongly colinear data when possible. tences and 10,245 word tokens, for a total 848,768 The right column of Figure4 shows a sum of fixation events (where one event is a single subject all the IRFs in the system. This corresponds to viewing a single word token). the expected response to a stimulus containing Dundee is an eye-tracking corpus containing a unit impulse for each predictor. Since this is newspaper editorials read by 10 subjects, with in- a deterministic function of the component IRFs, cremental eye fixation data recorded during read- summed estimated responses should closely ap- ing. The stimulus set contains 20 editorials with a proximate summed true responses, which Figure4 total of 2,368 sentences and 51,502 word tokens, shows to be the case in all models. The reason for a total of 260,065 fixation events (where one the summed response is of interest here is that, event is a single subject fixating a single word to- as argued in Section4, with increasing multico- ken for the first time from the left). linearity the synthetic response at each stimulus UCL is a reading corpus containing individual increasingly resembles (a multiple of) a single re- sentences that were extracted from novels written sponse profile. In these experiments where pre- by amateur authors. The sentences were shuffled dictors have identical means and variances and and presented in isolation to 42 subjects. The eye- are positively correlated, this response profile ap- tracking portion of the UCL corpus used in these proaches the summed response in actual fact, and experiments contains 205 sentences with a total of in such a setting DTSR is expected to find com- 1,931 word tokens, for a total of 53,070 fixation ponent IRFs that sum to the global response, even events. if their allocation to particular predictors is largely In all experiments, the response variable is arbitrary. The ρ = 1 example (bottom row) clearly log fixation duration (go-past duration for eye- shows this to be the case. Although the model has tracking). Models use the following set of pre- discovered component IRFs and assigned them to dictor variables in common use in psycholinguis- predictors in ways that do not resemble the true tics: Sentence position (index of word in sen- model at all, the sum of these IRFs nonetheless tence), Trial (index of trial in series),10 Saccade captures the true summed response with very high Length (in words, eye-tracking only), Word Length fidelity. The summed response profile is theonly (in characters), Unigram Log Probability, and 5- characteristic of the true model that DTSR could gram Surprisal. Unigram Log Probability and have learned in this setting, and DTSR does suc- 5-gram Surprisal are computed by the KenLM cessfully recover it. toolkit (Heafield et al., 2013) trained on Gigaword 4 (Parker et al., 2009). Examples of studies using 7 Experiment 2: Human reading times some or all of these predictors include Demberg 7.1 Background and experimental design and Keller(2008); Frank and Bod(2011); Smith and Levy(2013) and Baayen et al.(2018). The main interest of DTSR is the potential to bet- In addition, DTSR enables fitting of the impulse ter understand real-world dynamical systems like response to a Rate predictor, which is simply a the human sentence processing response. There- vector of ones, one for each observation (Bren- fore, Experiment 2 applies DTSR to three exist- nan et al., 2012). Rate can be viewed as a decon- ing datasets of naturalistic reading: Natural Sto- volutional intercept, providing information about ries (Futrell et al., 2018), Dundee (Kennedy et al., stimulus timing but no information about stimulus 2003), and UCL (Frank et al., 2013). properties. The fitted response to Rate is an esti- Natural Stories is a self-paced reading (SPR) mate of the baseline response of the system, with corpus consisting of context-rich narratives that expected deviation from that baseline governed by resemble fluent storytelling while nonetheless the estimated responses to the other predictors, containing many grammatical constructions that which vary from stimulus to stimulus. Rate esti- rarely occur naturally in texts. The public release mates also provide insight into the effect of tempo- of the corpus contains data collected from 181 ral density of stimulus presentation, since the Rate subjects who paged through the stories on a com- responses accumulate for stimuli that impinge on puter screen, pressing a button to reveal the next word. The amount of time spent on each word 10Except UCL, which contains isolated sentences, in was recorded as the response variable. The stimu- which case Trial is identical to Sentence Position. the system before it has effectively finished re- uments and sentences respectively. They are not sponding to Rate from previous stimuli. Since perceptual or linguistic properties of the experi- without deconvolution Rate is identical to the in- ment, and it is unclear how any diffuse impulse tercept, it is excluded from non-deconvolutional response attributed to them would be interpreted. baseline models. Following prior work (Demberg and Keller, 2008; Following standard practice in psycholinguis- Baayen et al., 2018), their presence in the model tics, by-subject random intercepts along with by- is motivated by the possibility of trends in the re- subject random coefficients for each of these pre- sponse. For this reason, ShiftedGamma IRFs are dictors are included in all models (baseline and fitted to all predictors except Trial and Sentence DTSR).11 All predictors are rescaled by their stan- Position, which are assigned a Dirac delta IRF (i.e. dard deviations prior to fitting.12 A single DTSR a linear coefficient). Consequently, in plots, the model was fitted to each corpus. Trial and Sentence Position estimates are shown Existing work provides some expectations as stick functions at time 0s. about the relationships of these variables to read- Specifying the Bayesian model requires stating ing time. Processing difficulty is expected to in- priors over the IRF parameters. Unfortunately, the crease with Saccade Length, Word Length, and 5- existing literature on human reading does not pro- gram Surprisal, and positive linear relationships vide detailed evidence as to the temporal char- have been shown experimentally (Demberg and acteristics of the response to predictors, due at Keller, 2008). Unigram Log Probability is ex- least in part to the difficulty of estimating these pected to be negatively correlated with reading characteristics without the aid of DTSR. Nonethe- times, since more frequent words are expected to less, some relevant signposts exist. For example, be easier to process. Sentence Position, Trial, and studies using many spillover positions have found Rate index different kinds of change in the re- the influence of linguistic predictors like surprisal sponse over time and their relationship has not to decrease monotonically with spillover position been carefully studied, in part for lack of deconvo- (Smith and Levy, 2013). Although spillover does lutional regression tools. Although reading times not have a clear continuous-time interpretation, tend to decrease over the course of the experi- these results support a strictly-decreasing shape ment (Baayen et al., 2018), suggesting an expected for the prior on the IRF kernel. In addition, the negative effect of Trial, this may be partially ex- timecourse of the sentence processing response plained by temporal diffusion. has been very carefully studied in the domain of The predictors Saccade Length, Word Length, electroencephalography, with consistent finding of Unigram Log Probability, and 5-gram Surprisal a number of distinct event-related potentials gen- are all motor, perceptual, or linguistic variables erally occurring within one second after stimulus to which the sentence processing system has been onset (Sur and Sinha, 2009). Together, these con- shown to respond upon word fixation (Demberg siderations support prior bias towards a stricly- and Keller, 2008) and to which the response might decreasing exponential-like IRF with response pri- not be perfectly instantaneous. To the extent that marily localized to the first second after stimulus temporally diffuse responses to any of these pre- onset. In this study, the prior means used to meet dictors exist, it is desirable that the model be this desideratum are α = 2, β = 5, and δ = −0.5. able to capture them. By contrast, Trial and Sen- Note that these are simply choices for the prior; tence Position merely index progress through doc- the model can deviate unboundedly far from them in its parameters as motivated by the data, poten- 11By-subject IRF parameters were not used for this study because they substantially complicate the model and ini- tially finding late-peaking responses, more rapidly tial experiments using them showed little benefit on training decaying responses, and more slowly decaying re- data. By-word random intercepts, though common in psy- sponses. In practice, the choice of prior does cholinguistic studies (Demberg and Keller, 2008), were also avoided because (1) estimates for Word Length and Unigram not appear to be very constraining, since posterior Log Probability are of interest for this study but are context- means of fitted models often deviate quite far from free and can therefore be wholly or partially consumed by the the prior means. random intercepts and (2) early experiments suggested that by-word intercepts led to overfitting (based on development In all reading experiments, data were parti- set performance) in both DTSR and baseline models. 12Except Rate, which has no variance and therefore cannot tioned into training (50%), development (25%) be scaled by its standard deviation of 0. and test (25%) sets. Outlier filtering was also td r vial nteascae oerepository. code associated the in available are study information boundaries. provides line Dundee and document, only screen, e.g. about since data, and long differences source (3) and in eye-tracking, items to relevant unfixated only e.g. are (2) saccades since citations), es- modality, (see precedent corpora in in these differences use differences that are (1) studies criteria by of tablished exclusion combination in a (Breen, by corpora driven prosody across likeimplicit Differences 2014). effects of boundary influence suffix the with paper this desig- throughout for are nated spillover positions with (baselines spillover predictor each preceding three non- without of that GAM. LME to and with compared fitted models is baseline data deconvolutional qual- unseen prediction on estimates, re- ity DTSR the the check of human to liability However, into provide processing. they sentence insights the them- and IRFs the selves are 2 Experiment in interest of sults predictor model. entire the The to visible remained series. history response the and to starts only sentence (2) and ends. words saccades 4 following items than (1) longer as well were items as unfixated excluded UCL, doc- For screens, lines. sentences, and of uments, ends words and 4 starts (2) than and longer wellas saccades as following items excluded (1) were items Schuler unfixated and (2015), Schijndel van questions. following comprehension Dundee, For subsequent missed more subjects if or or sentence, 4 a end 3000ms, or than start they longer if or fix- 100ms have than they shorter if ations excluded were Shain items (2016), following al. Stories, et Natural For performed. seen. be to tight too are Intervals (right). UCL and (center) 5 Figure Fixation duration (log ms) 15 14 13 rmamdln esetv,tepiayre- primary the perspective, modeling a From hsnme fsilvrpstosi mn h largest the among is positions spillover of number this This in reported model each construct to used Formulae the to minimize designed are filters outlier these of Most 13 : attoigadfleigwr applied were filtering and Partitioning ua data Human 14 ohbslnsaefte ihand with fitted are baselines Both siae Rswt 5 rdbeitrasfrNtrlSois(et,Dundee (left), Stories Natural for intervals credible 95% with IRFs Estimated . -S ). 15 ie(ms) Time h is eodatrsiuu ne.Teearealso There negative large-magnitude onset. stimulus after second first the Surprisal 5-gram responses estimated for the general, de- In that time. influence with cays instantaneous peak magni- a in with tude, monotonically estimates decreases all response In the facilitation. neg- represent and IRFs inhibition ative represent stimu- reading IRFs after is positive ms response time, the 250 Because ms presentation. log lus slow- 0.03 a about and of instantaneously down ms log 0.05 about of devi- standard a of Dundee ation observing the that example, estimates For model im- predictor. unit in the a of observing change pulse from expected time over the (curves) response represent the IRFs plots with The these along (CI). 1, in inTable intervals credible are shown — 95% over IRF 10s each first of integral the the as corpus by here sizes computed Effect — and . 5 Dundee, Figure in Stories, shown are for Natural UCL IRFs fitted The Results 7.2 tms pl h aesto Rst l aaons while datapoints, all to each spillover for IRFs position. of models that separate in set fit history essentially same baselines of the the use apply its in must 4), constrained it than more vs. histories timepoints much 128 is longer experiments, DTSR much these While (in consider competitors complexity. to its model DTSR to cost permits arbi- no consider this can at it histories that long An is trarily approach 2. Table DTSR the of of cells advantage certain by ex- shown in these as reported tractability, for non-convergence of run limits the the models at baseline already vari- the are each periments of for many position Indeed, spillover each able. for included random each are by-subject with slopes when substantially especially added, increases position GAM spillover and LME in com- model plexity because literature psycholinguistic the in attested odLength Word -rmsurprisal 5-gram , r oiieadcnetae in concentrated and positive are nga o Probability Log Unigram Rate nedr slowdown a engenders siae cosall across estimates and , Natural Stories Dundee UCL Predictor Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5% Trial -0.0053 -0.0057 -0.0049 -0.0085 -0.0010 -0.0071 ——— Sent pos 0.0154 0.0148 0.0160 0.0004 -0.0013 0.0022 0.0340 0.0301 0.0379 Rate -0.1853 -0.1858 -0.1848 -0.0649 -0.0659 -0.0640 -0.0806 -0.0832 -0.0781 Sac len ——— 0.0249 0.0216 0.0207 0.0217 0.0209 0.0225 Word len 0.0020 0.0019 0.0021 0.0107 0.0105 0.0109 -8e-07 -1.7e-5 1.4e-5 Unigram 2.6e-6 -5e-6 2.2e-5 -2.0e-6 -3.9e-5 2.8e-5 1e-06 -4e-6 1.2e-5 5-gram 0.0057 0.0056 0.0059 0.0139 0.0134 0.0145 0.0159 0.0148 0.0171

Table 1: Effect sizes by corpus with 95% credible intervals based on 1024 posterior samples corpora, with an especially pronounced Rate re- 7.3 Discussion sponse in Natural Stories. More detailed interpre- Some key generalizations emerge from the DTSR tation of these curves is provided below in Sec- estimates shown in Figure5. The first is the pro- tion 7.3. nounced facilitative role of Rate in all three mod- Table2 shows prediction error from DTSR vs. els, but especially in Natural Stories. This means baselines fitted to the same feature set. As shown, that fast reading in the recent past engenders DTSR provides comparable or improved predic- fast reading in the present, because (1) observ- tion performance to the baselines, even against the ing a stimulus exerts a large-magnitude, diffuse, -S models which are more heavily parameterized. and negative (facilitative) influence on the sub- DTSR outperforms LME models on unseen data sequent response, and (2) the Rate contributions across all corpora and generally improves upon or of the stimuli are additive. This result demon- closely matches the performance of GAM (with no strates an important pre-linguistic influence of in- spillover). Compared to GAM-S (with three addi- ertia — a tendency toward slow overall change tional spillover positions), there is a clear advan- in base response rate. This effect is especially tage of DTSR for Natural Stories but not for the large-magnitude and diffuse in Natural Stories, eye-tracking datasets. This is likely due to more which is self-paced reading and therefore differs pronounced temporal confounds in Natural Sto- in modality from the other datasets (which are ries (especially of Rate, which the baseline models eye-tracking). This suggests that SPR participants cannot estimate) compared to the other corpora. strongly habituate to repeated button pressing and Note that GAM-S is more heavily parameterized stresses the importance of deconvolutional regres- than DTSR in that it fits multidimensional spline sion for bringing this low-level confound under functions of each spillover position of each pre- control in analyzing SPR data, since it appears to dictor. This makes it difficult to generalize infor- have a large influence on the response. If left un- mation about effect timecourses from GAM fits, controlled, variation due to Rate (i.e. due to the motivating the use of DTSR for studies in which timing structure of the stream of stimuli) might timecourses are a quantity of interest. be mis-attributed to other predictors, possibly con- Note also that even in the absence of very dif- founding model interpretation. fuse effects that afford prediction improvements, Second, effects are generally consistent with ex- the ability to measure diffusion directly is a major pectations: positive effects for Saccade Length, advantage of the DTSR model, since it can be used Word Length, and 5-gram Surprisal, and a neg- to detect the absence of diffusion in settings where ative effect of Trial. The null influence of Uni- it might in principle exist. gram Log Probability is likely due to the presence As shown in Table3, pooling across corpora, in the model of both 5-gram Surprisal (which in- permutation testing reveals a significant improve- terpolates unigram probabilities) and Word Length ment in MSE on test data of DTSR over each base- (which is inversely correlated with Unigram Log line system (p = 0.0001 for all comparisons).16 Probability). The biggest departure from prior ex- 16To ensure comparability across corpora with different er- pectations is the null estimate for Word Length ror variances, per-datum errors were first scaled by their stan- in UCL. It appears that the contribution of Word dard deviations within each corpus. Standard deviations were computed over the joint set of error values in each pair of Length in this corpus can be effectively explained DTSR and baseline models. The reason DTSR outperforms GAM-S in the pooled permutation test despite underperform- improvement on Natural Stories, which is itself much larger ing it on Dundee and UCL is that it gives a large relative than the other two datasets. Natural Stories Dundee UCL System Train Dev Test Train Dev Test Train Dev Test LME 0.0803 0.0818 0.0815 0.2135 0.2133 0.2128 0.2613 0.2776 0.2561 LME-S 0.0789† 0.0807† 0.0804† 0.2099† 0.2103† 0.2095† 0.2509† 0.2754† 0.2557† GAM 0.0798 0.0814 0.081 0.212 0.2116 0.2111 0.2576 0.2741 0.2538 GAM-S 0.0784 0.0802 0.0799 0.2083 0.2085 0.2078 0.2440 0.2661 0.2457 DTSR 0.0648 0.0655 0.0650 0.2100 0.2094 0.2088 0.2590 0.2752 0.2543

Table 2: Mean squared prediction error by system (daggers indicate convergence warnings)

Baseline DTSR improvement (z-units) p-value For example, as shown in Table1, the CI for 5- LME 0.059 0.0001∗∗∗ LME-S 0.054 0.0001∗∗∗ gram Surprisal in Natural Stories does not include GAM 0.057 0.0001∗∗∗ zero (rejecting the null hypothesis of no effect), ∗∗∗ GAM-S 0.051 0.0001 while the CI for Unigram logprob does (failing Table 3: Overall pairwise significance of prediction to reject). To control for effects of multicolinear- improvement from DTSR vs. baselines ity, one could perform ablative tests of fitted null and alternative models using non-parametric tests of predictive performance on in-sample or out-of- by other variables. sample data. Third, the response estimates for Dundee and However, DTSR estimates are obtained through UCL (both of which are eye-tracking) are similar, non-convex stochastic optimization, which com- which suggests that DTSR is discovering replica- plicates hypothesis testing because of possible es- ble population-level features of the temporal pro- timation noise due to (1) convergence to a lo- file for eye-tracking data. cal but not global optimum, (2) imperfect con- Fourth, there is a general asymmetry in degree vergence to the local optimum, and/or (3) Monte of diffusion between low-level perceptual-motor Carlo estimation of the test statistic via posterior Saccade Length Word Length variables like and , sampling. It cannot therefore be guaranteed that whose responses tend to decay quickly, and the hypothesis testing results are due to differences 5-gram Surprisal high-level variable, whose re- in model structure rather than differences in rel- sponse tends to decay more slowly, as shown by ative amounts of estimation noise introduced by the existence in all corpora of a point in time af- the fitting procedure. Thus, hypothesis tests based 5-gram Surprisal ter which the response exceeds on direct comparison of DTSR models rely on Saccade Length Word Length the and responses a (possibly incorrect) assumption that the mod- in magnitude. This is consistent with a view els are effectively optimal. The empirical results of sentence processing in which perceptual-motor presented in Section6 support this assumption by variables have a faster response because they in- showing DTSR fits that are consistently close to volve rapid bottom-up computation (e.g. visual the true data generating model, even in the pres- processing or motor planning/execution), while ence of strongly correlated predictors, suggesting surprisal has a slower response because it involves that non-global optima may not be a severe con- more expensive top-down computations about fu- found in practice. The procedure of statistically ture words given context (Friederici, 2002; Con- comparing models fitted via non-convex optimiza- nor et al., 2004; Bonte et al., 2005). While this tion is identical to the one typically followed for outcome is suggested e.g. by the aforementioned model comparison in machine learning (Demšar, finding that spillover 1 winds up being a stronger 2006), where differences between two models in position for a surprisal predictor in the Shain et al. performance on held-out data is subjected to sta- (2016) models, indicating a diffuse response that tistical testing, even when the fits are obtained in spreads to or even peaks at subsequent words, a non-convex deep learning setting. The outcomes DTSR permits direct investigation of these dy- of such tests are implicitly conditional on the fit- namics. ted parameters of each model. The probability of 8 A note on hypothesis testing that the fitted parameters are indeed optimal fora given experiment can be increased by Monte Carlo As a Bayesian model, DTSR supports hypothe- techniques in which multiple randomly initialized sis testing by querying the variational posterior. models are fitted and then aggregated, an approach to which DTSR is also amenable. related potentials (Walter et al., 1964) in oscilla- However, even in situations where such un- tory measures like electroencephalography, sug- certainty in hypothesis testing is not acceptable, gesting a rich array of potential applications of DTSR is appropriate for certain important use DTSR in computational psycholinguistics. cases. First, DTSR can be used for exploratory data analysis in order to empirically motivate the References spillover structure of the linear model. Spillover variables can be excluded or included based on the Martín Abadi, Ashish Agarwal, Paul Barham, Eugene degree of temporal diffusion revealed by DTSR, Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay permitting construction of linear models that are Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey both parsimonious and effective for controlling Irving, Michael Isard, Yangqing Jia, Rafal Jozefow- temporal diffusion. For example, if the average icz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev- trial duration is 300ms and the estimated response enberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon to a predictor is near zero at that point, this could Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal- be used to motivate the exclusion of spillover vari- war, Paul Tucker, Vincent Vanhoucke, Vijay Vasude- ables for that predictor. To overcome the limita- van, Fernanda Viégas, Oriol Vinyals, Pete Warden, tion that linear models cannot fit estimates of Rate, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale the convolved Rate predictor from the DTSR fit Machine Learning on Heterogeneous Distributed can be extracted and supplied to the linear model Systems. as a predictor. Second, DTSR can be used to fit Harald Baayen, Shravan Vasishth, Reinhold Kliegl, and a data transform which is then applied to the data Douglas Bates. 2017. The cave of shadows: Ad- prior to statistical analysis. This approach is iden- dressing the human factor with generalized additive tical in spirit to e.g. the use of the canonical HRF mixed models. Journal of Memory and Language, to convolve predictors in fMRI models prior to lin- 94(Supplement C):206–234. ear regression (Boynton et al., 1996). The canoni- R Harald Baayen, Jacolien van Rij, Cecile de Cat, and cal HRF may be sub-optimal for the data at hand, Simon Wood. 2018. Autocorrelated errors in exper- yet likelihood maximization and model compari- imental data in the language sciences: Some solu- tions offered by Generalized Additive Mixed Mod- son are conditional on it. The same would hold els. In Dirk Speelman, Kris Heylen, and Dirk Geer- of linear fits to predictors convolved using DTSR. aerts, editors, Mixed Effects Regression Models in However, unlike the canonical HRF, since DTSR Linguistics. Springer, Berlin. is domain-general, it can be integrated into any Douglas Bates, Martin Mächler, Ben Bolker, and Steve analysis toolchain for time series. Walker. 2015. Fitting linear mixed-effects mod- els using lme4. Journal of Statistical Software, 9 Conclusion 67(1):1–48. Milene Bonte, Tiina Parviainen, Kaisa Hytönen, and This paper presented a variational Bayesian de- Riitta Salmelin. 2005. Time course of top-down and convolutional time series regression method as a bottom-up influences on syllable processing in the solution to the problem of temporal diffusion in auditory cortex. Cerebral Cortex, 16(1):115–123. psycholinguistic time series data and applied it to Geoffrey M Boynton, Stephen A Engel, Gary H Glover, both synthetic and human responses in order to and David J Heeger. 1996. Linear systems analysis better understand and control for latent temporal of functional magnetic resonance imaging in human V1. Journal of Neuroscience, 16(13):4207–4221. dynamics. Results showed that DTSR can yield a plausible, replicable, parsimonious, insightful, Mara Breen. 2014. Empirical investigations of the role and predictive model of a complex dynamical sys- of implicit prosody in sentence processing. Lan- guage and Linguistics Compass, 8(2):37–50. tem like the human sentence processing response and therefore support the use of DTSR for psy- Jonathan Brennan, Yuval Nir, Uri Hasson, Rafael cholinguistic time series modeling. While the Malach, David J Heeger, and Liina Pylkkänen. 2012. Syntactic structure building in the anterior temporal present study explored the use of DTSR to under- lobe during natural story listening. Brain and Lan- stand human reading times, DTSR can in princi- guage, 120(2):163–173. ple also be used to deconvolve other kinds of re- Charles E Connor, Howard E Egeth, and Steven Yantis. sponse variables, such as the HRF in fMRI mod- 2004. Visual attention: bottom-up versus top-down. eling (Boynton et al., 1996) or intracranial event- Current biology, 14(19):R850–R852. Bhupinder S Dayal and John F MacGregor. 1996. Iden- Trevor Hastie and Robert Tibshirani. 1986. General- tification of finite impulse response models: meth- ized additive models. Statist. Sci., 1(3):297–310. ods and robustness issues. Industrial & engineering chemistry research, 35(11):4078–4090. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modi- Vera Demberg and Frank Keller. 2008. Data from eye- fied Kneser-Ney language model estimation. In Pro- tracking corpora as evidence for theories of syntactic ceedings of the 51st Annual Meeting of the Associa- processing complexity. Cognition, 109(2):193–210. tion for Computational Linguistics, pages 690–696, Sofia, Bulgaria. Janez Demšar. 2006. Statistical comparisons of clas- sifiers over multiple data sets. Journal of Machine Alan Kennedy, James Pynte, and Robin Hill. 2003. Learning Research, 7(Jan):1–30. The Dundee corpus. In Proceedings of the 12th Eu- Timothy Dozat. 2016. Incorporating Nesterov momen- ropean conference on eye movement. tum into Adam. In ICLR Workshop. Diederik P Kingma and Jimmy Ba. 2014. Adam: Kate Erlich and Keith Rayner. 1983. Pronoun assign- A method for stochastic optimization. CoRR, ment and semantic integration during reading: Eye abs/1412.6. movements and immediacy of processing. Journal of Verbal Learning & Verbal Behavior, 22:75–87. Martin Lindquist and Tor Wager. 2007. Validity and power in hemodynamic response modeling: A com- Stefan L Frank and Rens Bod. 2011. Insensitivity of parison study and a new approach. Human brain the Human Sentence-Processing System to Hierar- mapping, 28:764–784. chical Structure. Psychological Science, 22(6):829– 834. Martin A Lindquist, Ji Meng Loh, Lauren Y Atlas, and Tor D Wager. 2009. Modeling the hemodynamic re- Stefan L Frank, Irene Fernandez Monsalve, Robin L sponse function in fMRI: Efficiency, bias and mis- Thompson, and Gabriella Vigliocco. 2013. Read- modeling. NeuroImage, 45(1, Supplement 1):S187 ing time data for evaluating broad-coverage models – S198. of English sentence processing. Behavior Research Methods, 45(4):1182–1190. Vijay Madisetti. 1997. The digital signal processing Angela D Friederici. 2002. Towards a neural basis of handbook. CRC press. auditory sentence processing. Trends in Cognitive Sciences, 6(2):78–84. Yurii E Nesterov. 1983. A method for solving the convex programming problem with convergence rate Karl J Friston, Oliver Josephs, Geraint Rees, and O(1/kˆ2). In Dokl. Akad. Nauk SSSR, volume 269, Robert Turner. 1998. Nonlinear event-related re- pages 543–547. sponses in fMRI. Magn. Reson. Med, pages 41–52. John Neter, William Wasserman, and Michael H Kut- Richard Futrell, Edward Gibson, Harry J . Tily, Idan ner. 1989. Applied linear regression models. Blank, Anastasia Vishnevetsky, Steven Piantadosi, and Evelina Fedorenko. 2018. The Natural Stories Michael Nikolaou and Premkiran Vuthandam. 1998. Corpus. In Proceedings of the Eleventh Interna- FIR model identification: Parsimony through ker- tional Conference on Language Resources and Eval- nel compression with . AIChE Journal, uation (LREC 2018), Paris, France. European Lan- 44(1):141–150. guage Resources Association (ELRA). Robert Parker, David Graff, Junbo Kong, Ke Chen, Edward Goldstein, Benjamin J Cowling, Allison E and Kazuaki Maeda. 2009. English Gigaword Aiello, Saki Takahashi, Gary King, Ying Lu, and LDC2009T13. Marc Lipsitch. 2011. Estimating Incidence Curves of Several Infections Using Symptom Surveillance Fabian Pedregosa, Michael Eickenberg, Philippe Ciu- Data. PLOS ONE, 6(8):1–8. ciu, Alexandre Gramfort, and Bertrand Thirion. 2014. Data-driven HRF estimation for encoding and Cristina Gorrostieta, Hernando Ombao, Patrick Bé- decoding models. NeuroImage, 104. dard, and Jerome N Sanes. 2012. Investigating brain connectivity using mixed effects vector autoregres- Valerie A Ramey. 2016. Macroeconomic shocks and sive models. NeuroImage, 59(4):3347–3355. their propagation. In John B Taylor and Harald C Goutte, F A Nielsen, and K H Hansen. 2000. Mod- Uhlig, editors, Handbook of Macroeconomics, vol- eling the hemodynamic response in fMRI using ume 2 of Handbook of Macroeconomics, pages 71– smooth FIR filters. IEEE Transactions on Medical 162. Elsevier. Imaging, 19(12):1188–1201. Marten van Schijndel and William Schuler. 2015. Hier- Gary H. Glover. 1999. Deconvolution of impulse re- archic syntax improves reading time prediction. In sponse in event-related BOLD fMRI. NeuroImage, Proceedings of NAACL-HLT 2015. Association for 9:416–429. Computational Linguistics. R×W Cory Shain, Marten van Schijndel, Richard Futrell, Ed- • a matrix ∈ R of R random IRF kernel ward Gibson, and William Schuler. 2016. Mem- parameters for W random impulse vectors ory access during incremental sentence processing causes reading time latency. In Proceedings of the Random parameters o, v, and are constrained to Computational Linguistics for Linguistic Complex- ity Workshop, pages 49–58. Association for Compu- be zero-centered. tational Linguistics. To support mixed modeling, the fixed and ran- dom effects must first be combined using addi- Christopher A Sims. 1980. Macroeconomics and real- tional utility matrices. Let O ∈ {0, 1}N×O be ity. Econometrica: Journal of the Econometric So- ciety, pages 1–48. a mask matrix for random intercepts. A vector N q ∈ R of intercepts is: Nathaniel J Smith and Roger Levy. 2013. The effect of word predictability on reading time is logarithmic. def Cognition, 128:302–319. q = µ + O o (6)

Shravani Sur and V K Sinha. 2009. Event-related po- Let ∈ {0, 1}L×U be an indicator matrix for fixed tential: An overview. Industrial psychiatry journal, coefficients, V ∈ {0, 1}L×V be an indicator ma- 18(1):70. trix for random coefficients, and V′ ∈ {0, 1}N×V Dustin Tran, Alp Kucukelbir, Adji B Dieng, Maja be a mask matrix for random coefficients. A ma- N×L Rudolph, Dawen Liang, and David M Blei. trix Q ∈ R of coefficients is: 2016. Edward: A library for probabilistic mod- eling, inference, and criticism. arXiv preprint def ⊤ ′ ⊤ arXiv:1610.09787. Q = 1 ( u) + V diag(v) V (7)

W. Walter, R. Cooper, V. J. Aldridge, W. C. McCallum, Let W ∈ {0, 1}L×W be an indicator matrix and A. L. Winter. 1964. Contingent Negative Varia- ′ ′ for random IRF parameters and W1,..., Wn ∈ tion: An Electric Sign of Sensori-Motor Association R×W and Expectancy in the Human Brain. Nature. {0, 1} be mask matrices for random IRF pa- R×L rameters. Then matrices Pn ∈ R for n ∈ B Douglas Ward. 2006. Deconvolution analysis of {1, 2,...,N} are: fMRI time series data. def ′ ⊤ Simon N Wood. 2006. Generalized additive models: Pn = A + (Wn⊙) W (8) An introduction with R. Chapman and Hall/CRC, Boca Raton. In each equation above, the random effects param- eters are masked using the random effects filter A Definition of mixed effects DTSR associated with each data point. Q and Pn are For expository purposes, in Section3 the DTSR then transformed into the impulse vector space us- model was defined only for fixed effects. How- ing the indicator matrices V and W, respectively. ever, DTSR is compatible with mixed modeling This procedure sums the random effects associ- and the implementation used here supports ran- ated with each data point and adds them to the dom effects in the model intercepts, coefficients, population-level parameters. and IRF parameters. The full mixed-effects DTSR To define the convolution step, let gl for l ∈ equations are presented below. {1, 2,...,L} be parametric IRF kernels, one for The definitions of X, y, µ, σ, a, b, c, d, F, M, each impulse. Convolution is performed by pre- N, K, and R presented in Section3 are retained multiplying the inputs X with L sparse matrices N×M for the mixed model definition. The remaining l ∈ R for l ∈ {1, 2, ..., L}: variables and equations must be redefined to some def ( ⊤ ) extent. Mixed-effects DTSR models additionally (l)[n,∗] = gl b[n] − a ;(Pn)[∗,l] ⊙ F[n,∗] (9) contain the following parameters: L ∈ {0, 1}K×L O Finally, let be an indicator ma- • a vector o ∈ R of O random intercepts trix mapping the K predictors of X to the corre- U 17 • a vector u ∈ R of U fixed coefficients sponding L impulse vectors of the model. The V • a vector v ∈ R of V random coefficients 17Predictors and impulse vectors are distinguished because R×L in principle multiple IRFs can be applied to the same predic- • a matrix A ∈ R of R fixed IRF kernel tor. In the usual case where this distinction is not needed, L parameters for L fixed impulse vectors is identity and K = L. convolution that yields the design matrix of con- ′ N×L volved predictors X ∈ R is then defined us- ing a product of the convolution matrices, the de- sign matrix, and the impulse indicator L:

′ def X[∗,l] = l XL[∗,l] (10)

The full model mean is the sum of (1) the in- tercepts and (2) the sum-product of the convolved predictors with the coefficient parameters:

y ∼ N (q + (X′ ⊙ Q) 1, σ2) (11)