Arxiv:2003.03887V3 [Stat.ME] 27 Jan 2021 Tions of the Hypothesis Tests

Assessing the Signiﬁcance of Directed and Multivariate Measures of Linear Dependence Between Time Series

Oliver M. Cliff,1, 2, 3, ∗ Leonardo Novelli,1, 3 Ben D. Fulcher,1, 2 James M. Shine,1, 4 and Joseph T. Lizier1, 3 1Centre for Complex Systems, The University of Sydney, Sydney NSW 2006, Australia 2School of Physics, The University of Sydney, Sydney NSW 2006, Australia 3Faculty of Engineering, The University of Sydney, Sydney NSW 2006, Australia 4Brain and Mind Centre, School of Medical Sciences, The University of Sydney, Sydney NSW 2006, Australia Inferring linear dependence between time series is central to our understanding of natural and artificial systems. Unfortunately, the hypothesis tests that are used to determine statistically significant directed or multivariate relationships from time-series data often yield spurious associations (Type I errors) or omit causal relationships (Type II errors). This is due to the autocorrelation present in the analysed time series—a property that is ubiquitous across diverse applications, from brain dynamics to climate change. Here we show that, for limited data, this issue cannot be mediated by fitting a time-series model alone (e.g., in Granger causality or prewhitening approaches), and instead that the degrees of freedom in statistical tests should be altered to account for the effective sample size induced by cross-correlations in the observations. This insight enabled us to derive modified hypothesis tests for any multivariate correlation-based measures of linear dependence between covariance-stationary time series, including Granger causality and mutual information with Gaussian marginals. We use both numerical simulations (generated by autoregressive models and digital filtering) as well as recorded fMRI-neuroimaging data to show that our tests are unbiased for a variety of stationary time series. Our experiments demonstrate that the commonly used F - and χ2-tests can induce significant false-positive rates of up to 100% for both measures, with and without prewhitening of the signals. These findings suggest that many dependencies reported in the scientific literature may have been, and may continue to be, spuriously reported or missed if modified hypothesis tests are not used when analysing time series.

I. INTRODUCTION tics such as Pearson correlation coefficients. In practical terms, this induces an “effective sample size”, where the Linear dependence measures such as Pearson correla- effective number of independent samples used in com- tion, canonical correlation analysis, and Granger causal- puting an estimate is different to the actual length of the ity are used in a broad range of scientific domains to in- dataset. Two opposing strategies have been proposed vestigate the complex relationships in both natural and for handling autocorrelation: remove the autoregressive artificial processes. Despite their widespread use, con- (AR) components of the time series before computing cerns have been raised about the hypothesis tests typ- statistics, or modify the hypothesis tests that assess the ically used to assess the statistical significance of such distorted measurements. The former approach, known as measures from time series [1–6]. Specifically, the presence prewhitening, involves filtering the time series in order to of autocorrelation in a signal—one of two defining prop- render the residuals serially independent [12]. Prewhiten- erties of a stationary time series [7,8]—has been known ing is known to have many issues, such as reducing to bias statistics since the beginning of time-series analy- the size and power properties of hypothesis tests both sis [9]. If left unaccounted, this bias yields a greater num- in theory [13] and in practice, with simulated [14] and ber of both spurious correlations and missed causalities recorded [15, 16] time-series data in a variety of domains. (Type I and Type II errors) due to size and power distor- In contrast, the notion of modifying hypothesis tests remains relatively underused in practice, more often found

arXiv:2003.03887v3 [stat.ME] 27 Jan 2021 tions of the hypothesis tests. With the recent findings [1– 6] suggesting that existing techniques do not adequately in applications involving short time series and high auto- address autocorrelation, the accuracy of many reported correlation, where the statistical bias of measures is most results across the empirical sciences may be called into pronouced (with or without prewhitening), e.g., in fMRI- question. based neuroimaging [1,2, 17], as well as environmental The notion that autocorrelation affects the sampling and ecological studies [18, 19]. Indeed, it was not until distribution of time-series properties has a long history recently that the efficacy of a modified z-test for corre- in statistics, with research often focusing on the relation- lation analysis was demonstrated successfully on fMRI ship between two univariate processes. Seminal work by signals [2], which have been widely characterised using Bartlett [10, 11] revealed that autocorrelation can dis- correlation coefficients [20]. Nevertheless, the theory of tort the degrees of freedom available to compute statis- autocorrelation on the undirected relationship between bivariate time series is now well developed. However, the extension of this theory to multiple time series, and to directed relationships, remains incomplete. ∗ oliver.cliff@sydney.edu.au 2

Motivated to study directed dependencies in eco- multivariate linear dependence measures that informa- nomics, Granger [21] introduced a measure of causal in- tion theory provides. fluence between AR models nearly 60 years ago. Since In this work we bridge this gap by leveraging the con- then, it has become exceedingly popular, exemplified cept of the effective sample size to derive hypothesis tests by more than 100 000 works indexed by Google Scholar for any correlation-based measure of linear dependence that contain the phrase “Granger causality” (as of June, between covariance-stationary time series. This com- 2020). This impact is reflected in the measure’s ubiq- prises a large family of well-known statistics based on uity in the scientific community beyond its origins in ratios of generalised variance—such as Granger causality econometrics, generating highly influential results on and mutual information—that we introduce in Sec.II. phenomena ranging from brain dynamics [22–24] to cli- To achieve this, we first provide the one-tailed and two- mate change [25, 26] and political relationships [27, 28]. tailed tests for the sample partial correlations between Granger causality controls for the confounding past of two univariate processes under autocorrelation in Sec.III. a process through linear regression, building statistics Although this result is important in its own right, in and hypothesis tests via residuals rather than the orig- this work we primarily leverage it to construct the tests inal process. However, researchers are becoming aware for more advanced inference procedures with multivari- that certain preprocessing techniques that increase auto- ate and directed models of observed dynamics. Follow- correlation, such as filtering, raise the false-positive rate ing this, we introduce the modified Λ-test (in Sec.IV), (FPR) of Granger causality tests when using the well- which we show is suitable for assessing the significance 2 established χ - and F -distributions [3–5]. Even though of any linear dependence measure that can be expressed these empirical studies have demonstrated that estab- as a ratio of generalised variances. Specifically, in Sec.V lished Granger causality tests have distorted size and we use the two-tailed test to derive hypothesis tests for power properties (exhibiting Type I and II errors), it has conditional mutual information estimates between bivari- remained unclear as to why and how to correct them. In ate time-series data. We then use the chain rule for this paper, we illustrate that these errors are due to an mutual information to extend this result to multivari- inflated variance of the null distribution as a function of ate time-series data. Finally, since Granger causality autocorrelation remaining in the residuals, in the same can be expressed as a conditional mutual information, in way that bivariate correlation is affected. Sec.VI we extend our results further to derive Granger In order to unify Bartlett’s earlier investigations on causality tests for both bivariate and multivariate time- correlation coefficients (under autocorrelation) with more series datasets. More broadly, the modified Λ-test can be complex measures such as Granger causality, we must ex- used for any measure that can be expressed in terms of pand the former body of work to account for multivari- conditional mutual information (or, equivalently, Wilks’ ate relationships. One such multivariate generalisation of criterion or partial correlation), e.g., canonical correla- Pearson correlation is referred to as Wilks’ criterion [29], tions and partial autocorrelation [7,8] or information- which quantifies the relationship between multiple sets theoretic measures (for linear-Gaussian processes) such of variables, and is Λ-distributed for independent obser- as predictive information [41, 42] and active information vations [30]. In particular, we are interested in a special storage [43]. case of Wilks’ criterion popularised by Hotelling [31] that Using numerical simulations throughout Sec.VII, we focuses on two sets of variables, referred to as canoni- validate the modified Λ-test and characterise the effect of cal correlation analysis. It was later established that, autocorrelation on both the χ2- and F -test. Our exper- like Pearson correlation, estimates of canonical correla- iments involve generating samples from two first-order tions are inefficient under autocorrelation [6], introduc- independent AR models and iteratively filtering the out- ing Type I and II errors under hypothesis tests that put signal such that the autocorrelation is increased for assume independence (such as the Λ-distribution). In- both time series; this simulates empirical analysis in prac- stead of deriving hypothesis tests directly for Wilks’ crite- tise, and allows for the process parameters to be modi- rion or canonical correlations, here we use the equivalent fied while ensuring that the null hypothesis (of no inter- information-theoretic formulation. Information theory’s process dependence) is not violated. We perform these general applicability arises in simply requiring a proba- experiments for mutual information and Granger causal- bility distribution that can be either parametric or non- ity in their unconditional, conditional, and multivariate parametric [32, 33]. When this probability distribution is forms. Our results generally agree with the hypotheses modelled as a multivariate Gaussian, canonical correla- that the FPR of F - and χ2-tests can be inflated by either tion analysis and information theory overlap because mu- increasing the autocorrelation (through filtering) or, for tual information can be decomposed into sums involving the χ2-test, the number of conditionals (through increas- canonical variables [34]. Moreover, Granger causality is ing the dimension of mutual information or the history now understood as a special case of conditional mutual length of Granger causality). These experiments mirror information, known as transfer entropy [35–37]. While empirical applications where digital filtering is often used this unification provides an elegant perspective, there re- in preprocessing for many purposes, such as handling mains a clear divide between the theoretical foundations nonstationary effects, which inadvertently increases au- of Bartlett (and others [38–40]) and the large family of tocorrelation and therefore the FPRs of unmodified tests. 3

Given minimally sufficient effective samples, however, we reflecting an interest in the relationship between X and confirm that the modified Λ-tests remain unbiased for all Y . The linear dependence of X on Y (or vice versa) is scenarios. We thus show that, in contrast, the size (Type measured by a scalar value that quantifies how much the I errors) and power (Type II errors) of F - or χ2-tests outcomes of Y reduce uncertainty over outcomes of X. are arbitrarily low for a large class of multivariate lin- Theoretically, in the absence of a linear relationship be- ear dependence measures (approximately zero in certain tween X and Y (the null hypothesis, H0) the reduction instances) and overwhelmingly depends on the parame- of uncertainty is exactly zero, meaning that Y does not ters of the underlying independent processes. We further linearly predict X at all. In practice, however, we only demonstrate that the common approach of prewhitening have access to a finite-length dataset with T observations a signal (in order to remove the effect of autocorrelation) over which to compute the measures, introducing a vari- does not suffice to control the FPR in almost all cases. ation in statistical estimates and manifesting as non-zero Finally, by using a well-known brain-imaging dataset values in the case of no relationship. Here, we present from the Human Connectome Project [44], we verify that this dataset as an m × T matrix z of consecutive real- our previous numerical simulations yield comparable re- valued samples z(t) ∈ Rm of the process Z (again, this is sults to experiments on commonly used datasets. For partitioned into submatrices x and y). To this end, the these experiments, the χ2-tests of mutual information aim of linear-dependence tests is to infer whether there and Granger causality yield concerningly high FPRs of is a statistical dependence between X and Y based on over 80% and 65% for a nominal significance of 5%—a 16- the sample paths x and y alone. and a 13-fold increase—whereas our exact tests maintain We make the typical assumption that the underly- the ideal FPR for all experiments. Open-source MAT- ing system, Z, is a second-order stationary, purely non- LAB code is made available to allow users to perform deterministic process [7,8, 46]. An important conse- correct hypothesis testing for all dependence measures, quence of covariance-stationarity is that the time series as well as the above experiments, at: https://github. may be represented, after appropriate mean removal and com/olivercliff/exact-linear-dependence. differencing [47], by the ARMA model: Our theoretical and empirical findings suggest that p q X X this work presents the first statistically sound approach Z(t) = a(t) + Φ(u)Z(t − u) + Θ(u)a(t − u), (4) for testing the linear dependence between multivari- u=1 u=1 ate time-series data. Given that approaches such as prewhitening and Granger causality are specifically de- where Φ and Θ are vectors of autoregressive (AR) and signed to account for autocorrelation, we conjecture that moving-average (MA) parameters, and a(t) is uncorre- autocorrelation-induced statistical errors caused by F - lated noise (the innovation process). We further assume or χ2-tests (and others) may be even more prevalent in that the noise is Gaussian, a(t) ∼ N (0, Σ) for some ar- prior publications than previously suggested by a num- bitrary noise covariance Σ, meaning that Z is a linear- ber of authors [1,3,4]. In particular, our case study Gaussian process. of brain-imaging data is concerning, because the neuroscience community employs techniques such as correlation, mutual information, and Granger causality in order A. Cross-correlation and autocorrelation to infer pairwise dependence (known as “functional connectivity”). Implementation of our approach will enable For covariance-stationary time series, the relationship correct inference of linear relationships within complex between Zi(t) and Zj(t + u) depends only on the differ- systems across myriad scientific applications. ence in times t and t + u of the observation but not on t itself. Once the mean has been removed, such processes are fully defined by their cross-correlation, II. MEASURES OF LINEAR DEPENDENCE γij(u) ρij(u) = , (5) pγ (0)γ (0) In this work, we focus on multivariate signals, ii jj with γij(u) = cov(Zi(t),Zj(t + u)) the cross-covariance {Z1(t),...,Zm(t)} , t = 0, ±1, ±2,..., (1) between Zi(t) and Zj(t + u). If ρii(u) =6 0 for any u > 0, that is, a collection of m series sampled at equally spaced then the univariate process Zi exhibits autocorrelation, time intervals. Writing and the collection of ρii(u) for u = 0, ±1, ±2,... is gener- 0 ally called the autocorrelation function of Zi. The sam- Z(t) = (Z1(t),...,Zm(t)) , (2) ple cross-correlation coefficients are computed from time- we shall refer to the m series as an m-dimensional vector series data z as of multiple time series such that Z(t) ∈ Rm. For the cij(u) purposes of inferring linear dependence, Z is partitioned rij(u) = p , (6) into one k-variate and one l-variate subprocess [45]: cii(0)cjj(0) X with c (u) = N −1 PT z (t)z (t + u) where N = T − 1 Z = , (3) ij t=1 i j Y as an unbiased estimate of the sample cross-covariance. 4

The first linear dependence measure we discuss is Pear- If either one (or both) of x or y are serially indepen- son’s product-moment correlation coefficient. For bivari- dent, then the sample correlation coefficients rxy can be ate (m = 2) processes, we shall write X = X and Y = Y tested against Student’s t-distribution with degrees of and denote the cross-correlation between these variables freedom T −2. Thus, the textbook approach for minimis- (Eq. (5)) as ρXY (u). Pearson’s correlation coefficient is ing the deleterious effects of autocorrelation is to whiten the lag-zero cross-correlation ρXY = ρXY (0), and quan- one of the time series by filtering any AR components tifies the (symmetric) association between paired obser- (referred to as prewhitening). The idea is that, by filter- vations of X and Y . The sample correlation coefficient ing any AR components, the residuals become uncorre- is then given by lated and so statistical tests that have been developed for independent variables can now be used without modify- cxy rxy = , (7) ing the degree of freedom. In Sec.VIII, we discuss this cxxcyy approach in more detail, showing that linear-dependence tests applied to signals that have been “whitened” in this where cxy = cxy(0) is the sample covariance, and cxx = way still exhibit significant statistical bias (in some cases cxx(0) and cyy = cyy(0) are sample variances [39]. worse than without prewhitening). In order to assess the statistical significance of linear dependence measures, such as the sample correlation coefficient, we must be able to compute their variance, i.e., B. Partial correlation 2 σr (x, y) = var (rxy). For independent but autocorrelated stationary processes, the variance of the sample corre- Partial correlation ρ measures the association lation coefficient can be estimated (to the first order) XY ·W between X and Y , whilst controlling for any concomitant as [2, 38, 39] effect of another c-variate process W [49, 50]. Partial T −1 ! correlation is estimated by, first, computing the residual 2 −1 X T − u processes: σˆ (x, y) ≈ T 1 + 2 rxx(u)ryy(u) , (8) r T u=1 ex|w = x − xˆ(w) (11) where rxx(u) and ryy(u) are the lag u sample autocorrela- ey|w = y − yˆ(w), (12) tions [48]. Although we refer to Eq. (8) as Bartlett’s formula, this first-order approximation is due to Clifford et wherex ˆ(w) denotes the linear prediction of x from w al. [39], who presented the variance estimator for spatial via ordinary least squares. Then, an appropriate test autocorrelation and an estimate of the effective number statistic for the null hypothesis H0 : ρXY ·W = 0 of no of independent samples for correlation coefficients: relation between X and Y , above any relationship with W , is the sample partial correlation: −2 ηˆ(x, y) = 1 +σ ˆr (x, y). (9) P t ex|w(t) ey|w(t) rxy·w = q q . (13) An important consequence of Eqs. (8) and (9) is that P e2 (t) P e2 (t) hypothesis tests, such as Student’s t-test, should have t x|w t y|w degrees of freedom corresponding to the effective sample By contrasting the formulas for sample partial correla- size of the analysed time series (i.e., the effective degree of tion (Eq. (13)) with sample cross-correlation (Eq. (6)), it freedom [2, 11]), rather than the original sample size [39]. is evident that the former is equivalent to the bivariate That is, if both X and Y are autocorrelated, then the null correlation between the residuals, i.e., rxy·w = rex|wey|w . distribution for rxy follows a modified Student’s t-test: Unlike Pearson correlation, there is a dearth of research s into the null distribution of partial correlation coefficients ηˆ(x, y) − 2 ∼ − for autocorrelated time-series data. As such, our first rxy 2 t(ˆη(x, y) 2), (10) 1 − rxy theoretical contribution (in Sec.III) is a derivation for the null distribution of sample partial correlations (13) under whereη ˆ(x, y) − 2 is the (estimated) effective degrees autocorrelation, i.e., extending the modified t-test (for of freedom. An examination of this formula reveals bivariate correlation (10)) to facilitate residual processes. that, when both x and y are positively autocorrelated, then there are, effectively, fewer independent observations than in the original dataset (ˆη < T ); if only one C. Wilks’ criterion and canonical correlations process is negatively autocorrelated, then there appear to be more independent observations than in the original Relating two or more sets of variables is achieved sim- dataset (ˆη > T )[39]. Consequently, when the modified ilarly to partial correlation (13), with the exception that degree of freedom in Eq. (10) is neglected, inference pro- the generalised variance is used, rather than the condi- cedures can either spuriously identify association (pro- tional variance [29, 31]. Consider the relationship be- duce Type I errors) whenη ˆ < T or miss actual correla- tween the k-variate process X and the l-variate process tions (Type II errors) whenη ˆ > T . Y , in the context of a c-variate concomitant process W . 5

To measure their dependence, we use the same proce- This formula is asymptotically equivalent to the nested dure as for univariate processes, except now the residuals log-likelihood ratio (LR) of two models [53], and thus we ex|w and ey|w (from Eqs. (11) and (12)) are multivari- can use a null distribution also provided by Wilks [54]. ate, making the sample covariance sxy|w an m × m ma- Following Wilks’ theorem [54], under the null hypoth- trix, rather than a scalar value. The generalised sample esis H0 : IX;Y |W = 0 and with normally distributed variance is the determinant of these sample covariances marginals, mutual information estimates are asymptoti- |sxy|w|, and can be used to form a special case of (resid- cally chi-square distributed [35, 52, 53], ual) Wilks’ criterion [29]: ˆ 2 2T Ix;y|w ∼ χ (kl). (16) |sxy|w| . (14) For limited data, a more precise null distribution can be |sx|w||sy|w| derived from the standard F -test, albeit for the more Although in general Wilks’ criterion facilitates any num- specific case of no autocorrelation and with one of the ber of partitions of Z, we will restrict our attention to processes being univariate (see Eq. (A4) and surround- two partitions (referring to the special case in Eq. (14) ing discussion in AppendixA2). That is, the mutual as Wilks’ criterion when the meaning is clear). The null information between an i.i.d. variable X and an l-variate distribution of the ratio of independent generalised vari- Y , in the context of the concomitant W , is, under the ances (14) is known as Wilks’ Λ-distribution, with its null hypothesis H0 : IX;Y |W = 0, exact analytic form derived by a number of authors on T − (l + c + 1) ˆ the basis of no cross-correlation and under the hypothesis [exp (2Ix;y|w) − 1] ∼ F (l, T − (l + c + 1)). that each variable within X, Y , or W exhibit no auto- l (17) correlation [30, 51]. Hotelling [31] extensively studied The same arguments in Eqs. (15)–(17) hold for uncon- the case of Wilks’ criterion with two sets of variables, Iˆ w ∅ showing invariance under any internal linear transfor- ditional mutual information x;y by setting = (and mation of these sets and a decomposition into canoni- c = 0). cal correlations with an asymptotic (χ2) null distribu- Although we found no discussion on autocorrelation- tion. Much like their univariate counterparts, however, induced biases of mutual information in literature, statis- Hotelling’s canonical correlations have been shown to be tical tests will clearly be incorrect if autocorrelation is not inefficient under autocorrelation [6]. Consequently, nei- taken into account. This is evident from the well-known ther approach is suitable for inferring linear-dependence result that mutual information reduces to a function of between the majority of time-series data due to the ubiq- sample correlation coefficients when the dependent pro- uity of autocorrelation. In Sec.IV, we address this is- cesses are univariate [55]: sue by providing the null distribution to be used in the 2 Iˆ = − log(1 − r2 ). (18) presence of autocorrelation (and of course when cross- x;y|w xy·w correlations are present amongst any two variables, i.e., Moreover, it is clear by observing the equivalence be- Xi(t) and Xj(t − u) may covary for any i, j, t, or u). An tween Eqs. (14) and (15), noting that mutual informa- application that is of particular interest is mutual infor- tion estimates can be decomposed into sums involving mation, which is equivalent to Wilks’ criterion (14) for canonical variables [52]. By deriving the exact hypothesis Gaussian marginals. tests for mutual information in Sec.V, we provide a crit- ical component of general-purpose techniques for measuring undirected relationships between sets of variables. D. Mutual information However, mutual information was not originally intended to measure autocorrelated time-series dependencies, nor Mutual information IX;Y |W is a fundamental concept does it naturally model directed dependencies—this is in information theory—and a building block of many the intended purpose of Granger causality. other measures—that quantifies the amount of information about a process X obtained by observing another process Y (potentially in the context of a third process E. Granger causality W , making it a conditional mutual information) [33]. In general, information theory facilitates multivariate analy- Granger causality was explicitly designed to capture sis by simply requiring well-defined probability distribu- one-way dependence between stochastic processes by tak- tions that can be either parametric or non-parametric. ing into account the confounding influence of their past When these are normally distributed, mutual informa- (i.e., the autocorrelation). By considering X as a tar- tion takes a form that is equivalent to Wilks’ crite- get (predictee) process and Y as a source (predictor) rion [33, 52]: process, Granger causality FY →X|W explicitly aims to measure the causality (predictability) in Y about X in |s | ˆ 1 xy|w context of the relevant history of X (and, potentially, a Ix;y|w = − log . (15) 2 |sx|w||sy|w| concomitant process W ). Of course, the use of the term 6

“causality” here refers to Wiener’s definition (as a model It should be emphasised, however, that the F -test is only of dependence based on prediction) rather than Pearl’s suitable for serially independent observations. Thus, al- (a mechanistic causal-effect that can only be inferred us- though the Granger causality measure explicitly accounts ing interventions); see [56] for a differentiation of these for autocorrelation in vector AR processes, the estab- concepts. lished hypothesis tests assume either a sequence of com- The main assumption underlying Granger causality is pletely independent residuals (the F -test) or infinite data that both X and Y are (vector) AR processes [21, 45]. (the χ2-test). The null distributions that we provide in That is, we assume that X(t) and Y (t) are causally de- Sec.VI overcome both of these issues, providing the first pendent on the following states: valid finite-sample tests for Granger causality.

X(t − 1) Y (t − 1) (p)  .  (q)  .  X (t) =  .  , Y (t) =  .  . (19) X(t − p) Y (t − q) III. MODIFIED TESTS FOR PARTIAL CORRELATION Under this assumption, the (directed) inﬂuence from Y to X is quantiﬁed by conditional mutual information [35]: In this section, we derive one-tailed and two-tailed H FY →X|W (p, q) = 2 IX;Y (q)|X(p)W . (20) tests for the null hypothesis, 0 : ρXY ·W = 0, of no partial correlation between two univariate autocorrelated Following Eq. (15), this measure can be estimated as a time series x and y, given a third (potentially multivari- log-ratio of generalised variances [45]: ate) process w. These tests are valid for any covariance- stationary time series X, Y , and W and sample size T . ! s (p) ˆ x|x w Fy→x|w(p, q) = log . (21) sx|x(p)y(q)w

Note that, except for the rather narrow case of k = q = 1, A. Modified Student’s t-test for partial correlation Granger causality is a multivariate measure (using generalised variances rather than conditional variances). The AR orders, p and q, of each process are typically Recall from Eq. (13) that the sample partial correlation determined by statistical tests such as partial autocor- is equivalent to the sample correlation between ex|w and relation [8], Burg’s method [57], the Akaike or Bayesian ey|w, i.e., rxy·w = rex|wey|w . Obtaining the null distri- information criterion (AIC or BIC), cross-validation [4], bution for sample partial correlations between autocor- or active information storage [43, 58][59]. In this pa- related time series can thus be treated similarly to that per, we use Burg’s method to infer the model order due of the correlation coefficients (see Eq. (10)). to its efficiency and stability over the Yule-Walker equa- A well-known result is that the sample partial corre- tions [57]. Further, using results from the main text, we lation between independent observations is t-distributed discuss the relationship between partial autocorrelation t(ν), under the null hypothesis H0 : ρXY ·W = 0, with and active information storage in AppendixE. degrees of freedom ν = T − c − 2 and c = dim(w(t)) [60]. In general, hypothesis tests for Granger causality can As such, the modified statistic and null distribution is: be derived from Wilks’ theorem [54]. That is, under the null hypothesis H : F (p, q) = 0, estimates of the 0 Y →X|W s Granger causality from X to Y (Eq. (21)) are asymptot- ηˆ ex|w, ey|w − c − 2 ∼ − − rxy·w 2 t(ˆη ex|w, ey|w c 2). ically chi-square distributed [45]: 1 − rxy·w (25) T Fˆ (p, q) ∼ χ2(klq). (22) y→x|w Here, the (estimated) effective sample sizeη ˆ(ex|w, ey|w) is still computed from Eq. (9) but with the autocorre- Alternatively, a finite-sample null distribution is can be lation functions of the residual vectors e and e , used if the predictee process is univariate (k = 1). Re- x|w y|w rather than the original sample paths x and y. Intu- ferring to AppendixA, the restricted model has p + c + 1 itively, this is because r is itself a sample correlation parameters, and the unrestricted model has p + lq + c + 1 xy·w of these residuals, so it is their autocorrelation—not that parameters. Accordingly, we can build the statistic: of the original time series—that directly determines the effective sample size. Another crucial addition is that T − (p + lq + c + 1) ˆ [exp (Fy→x|w(p, q)) − 1], (23) the dimension of the conditional process c = dim(w(t)) lq further reduces the number of degrees of freedom [60], which, according to the standard F -test (Eq. (A4)), is for the same reason as in standard F -tests [4]. When the distributed as residual vectors, ex|w and ey|w, are (serially) independent, then Eq. (25) becomes equivalent to the standard F (lq, T − (p + lq + c + 1)). (24) Student’s t-distribution for partial correlation. 7

B. Modiﬁed F -test for partial correlation where the Λ∗(n) distribution itself can be described by a product of independent Λ-distributed variables:

The Student’s t-distribution in Eq. (25) allows for one- b tailed (upper or lower) tests for the partial correlation Y Li, with Li ∼ Λ(ni, 1, 1). (29) by using the statistic (the LHS) as an input to the i=1 quantile function of the t-distribution. For two-tailed tests, another common approach is to square the statis- Notice that this reduces to the Λ-distribution for two ∗ tic and, subsequently, the null distribution. The square sets of independent variables [30], however, with the Λ - of a random variable Z ∼ t(ν) that follows Student’s t- distribution we are able to include the effective sample distribution (with parameter ν) follows an F -distribution sizes. with parameters 1 and ν, i.e., Z2 ∼ F (1, ν). Thus, under Although the null distribution for the product of two the null hypothesis H0 : ρXY ·W = 0, the square of the independent Λ-distributed variates is known [30], deriv- statistic in Eq. (25) (the LHS) follows an F -distribution: ing the exact distribution for the product of an arbitrary number of Λ-distributed variates is non-trivial. Fortu- 2 rxy·w nately, a relationship between the beta-, F -, and Λ- nxy|w ∼ F (1, nxy|w), (26) distributions [30, 51] allows for simple numerical meth- 1 − r2 xy·w ods. To generate the distribution Λ∗(n), we could sample with an effective degree of freedom, beta-distributed variables: b b Y Y ni 1 nxy|w =η ˆ(ex|w, ey|w) − c − 2, (27) Li = Vi, with Vi ∼ B , , (30) 2 2 i=1 i=1 obtained from Eq. (9). We refer to the significance test that uses this distribution as the modified F -test. Note where B(α, β) is the beta distribution. Equivalently, one that a form of Eq. (26) without modifying the degree could sample independent F -distributed variables: of freedom is commonly used for testing the coefficient b b Y Y ni of determination; thus this approach could also be used Li = , with Ui ∼ F (1, ni). (31) Ui + ni for constructing a finite-sample test of the coefficient of i=1 i=1 multiple correlation under autocorrelation. In our experiments (and open-source code), we opt to sample independent beta-distributed variables and con- ∗ IV. MODIFIED Λ-TESTS structing the Λ -distribution from their product as per Eq. (30). Throughout this work, we refer to hypothesis tests that use the Λ∗-distribution and modify the degree Although the modified t- and F -tests introduced above of freedom to account for autocorrelation as “modified are suitable for bivariate correlation-based measures, Λ-tests”. they are not appropriate for multivariate (and thus di- ∗ From the relationship between the beta-, F - and Λ- rected) null tests. Here, we introduce the Λ -distribution, distributions (see Eqs. (29)–(31)), it is clear that the which can be used for hypothesis testing all linear depen- modified Λ-test is a generalisation of the modified F - dence measures throughout this paper. test, becoming equivalent for univariate statistics. For Recall that, for independent X and Y and W , Wilks’ instance, returning to partial correlation, we have that: criterion (Eq. (14)) is Λ-distributed. The exact form of the Λ-distribution has been extensively studied, with |sxy|w| 2 ∗ = 1 − rxy·w ∼ Λ (nxy|w), (32) known relationships to the F - and beta-distributions [30, |sx|w||sy|w| 51]. The main purpose of this paper is to derive the finite sample distribution of such statistics under autocorrela- where nxy|w is the effective degree of freedom. Thus, tion, i.e., where X(t) 6⊥⊥ X(t − u) and Y (t) 6⊥⊥ Y (t − v) either the modified F -test or the modified Λ-test could for some u, v > 0. be used for univariate statistics (i.e., ratios of conditional As we will show throughout this work, the distribu- variances). tion for Wilks’ criterion with two independent but seri- We can now derive explicit hypothesis tests for com- ally correlated processes can be described by products mon directed and multivariate linear dependence mea- of Λ-distributed variables with different effective degrees sures using a similar approach. Note that, although the of freedom. We denote this distribution as Λ∗(n), with purpose of this work is explicitly for linear dependence 0 the parameter n = (n1, . . . , nb) comprising the degrees measures between time-series data, the modified Λ-test of freedom of each independent Λ-distribution. That can be easily extended to more general likelihood tests is, Wilks’ criterion is, under the null hypothesis, Λ∗- for ratios of generalised variances under spatial autocor- distributed: relation [39], which is also known to affect the sampling properties of statistics. We begin by deriving tests for |sxy|w| ∗ the (conditional) mutual information between both uni- ∼ Λ (n), (28) |sx|w||sy|w| variate and multivariate normally distributed processes. 8

V. MODIFIED TESTS FOR MUTUAL B. Multiple time series INFORMATION

Mutual information IX;Y |W can also be used to mea- In this section, we obtain hypothesis tests for the mu- sure the dependence between multivariate processes X tual information between multiple time series. We ﬁrst and Y . Here, we apply the chain rule and the results present the hypothesis tests explicitly for conditional mu- from the previous section to obtain a partial correlation tual information for bivariate time series and, by using decomposition that can be used for constructing a null the chain rule, obtain the null distribution for the mutual distribution in the presence of autocorrelation. information between multivariate time series. The chain rule provides a decomposition of mutual information as a sum of conditional mutual information terms: k l A. Two time series ˆ X X ˆ{gh} Ix;y|w = Ixy|w. (35) g=1 h=1 The conditional mutual information for linear Gaus- ˆ sian processes (Eq. (18)) is equivalent to the statistic in That is, mutual information estimates Ix;y|w between a Eq. (32). Therefore, estimates of conditional mutual in- k-variate process x and an l-variate process y, in the context of the c-variate concomitant w, can be computed by formation under the null hypothesis, H0 : IX;Y |W = 0, are Λ∗-distributed: summing over conditional mutual information terms [33]. Each conditional mutual information term may be expressed as ˆ 2 exp −2 Ix;y|w = 1 − rxy·w ˆ{gh} ˆ ∗ Ixy|w = Ix y |v{gh} , (36) ∼ Λ (nxy|w). (33) g ; h xy|w where the conditional for the (g, h)-term is given by Of course, due to the relationship between the F - and   Λ-distribution (noted in Eq. (32)), we can construct an x1:g−1 {gh} equivalent modiﬁed F -test for conditional mutual infor- vxy|w = y1:h−1 , (37) mation: w

0 ˆ with x = x0 , . . . , x0 a g ×T matrix when 0 < g ≤ k, nxy|w [exp (2 Ix;y|w) − 1] ∼ F (1, nxy|w). (34) 1:g 1 g and the empty set, x1:g = ∅, when g = 0. Using this notation, we have an equivalent expression for mutual The null distributions we provide above explicitly ac- information as count for autocorrelation via the effective degrees of free- | | dom nxy|w and also reduce to the F -distribution for ˆ sxy|w exp −2 Ix;y|w = information-theoretic quantities when observations of the |sx|w||sy|w| analysed time series are independent (cf. Eq. (17) by let- Y ˆ{gh} tingη ˆ(x, y) = T ). Further, when x and y are serially = exp −2 Ixy|w uncorrelated, and in the limit T → ∞, Eq. (34) becomes g,h 2 equivalent to the χ null distribution for mutual infor- Y 2 = 1 − r {gh} . (38) mation (see the discussion in AppendixA2). Thus, the xg yh·vxy|w null distribution we present in Eq. (34) is a generalisa- g,h tion of both the standard F -test (which is applicable only Although this equation has a similar form to the well- for i.i.d. variables) as well as the asymptotic distribution known canonical correlation decomposition [30, 52], the (which is applicable only for infinite data). correlations are over different variables. More impor- Although they are special cases of the modified Λ- tantly for our purposes, and assuming independence of test, there are important distinctions here from the χ2- these partial correlations (see below), its null distribu- tests (16) and the standard F -tests (17) for conditional tion can thus be obtained from the Λ∗-distribution: mutual information. The first is that we now have an ˆ ∗ effective sample sizeη ˆ that changes depending on the exp −2 Ix;y|w ∼ Λ (nxy|w), (39) autocorrelation function of the residuals ex|w and ey|w. with parameter vector The second is that the degrees of freedom nxy|w is further 0 reduced by c = dim (w(t)), the dimension of the condi- {11} {kl} tional time series w, which appears in the finite-sample nxy|w = nxy|w, . . . , nxy|w . (40) F -tests but not the asymptotic χ2-tests. Both of these The remaining challenge is to compute the effective de- differences introduce a significant bias in the estimation {gh} of linear dependence for many real-world applications, gree of freedom, nxy|w, for each independent partial cor- exemplified by the numerical simulations in Sec.VIIA. relation in Eq. (38). Recall that partial correlation can 9 be computed from ordinary least squares. The residual with the jth conditional as the matrix vector for the (g, h) term in Eq. (38) is  x(p)  {gh} {gh} {j} (j−1) ex|v = xg − xˆg vxy|w (41) vy→x = y  , (45) w {gh} {gh} ey|v = yh − yˆh vxy|w , (42) and the limiting case giving y(0) = ∅, i.e., the empty set. where the hatx ˆg(·) again denotes the linear prediction Again, following Eq. (39) we conclude that under the of xg from the input argument. Then, the degrees of null hypothesis H0 : FY →X (p, q) = 0, Granger causality freedom used in computing the (g, h) sample partial cor- estimates are distributed as follows: relation is: ˆ ∗ {gh} {gh} {gh} {gh} exp −Fy→x|w(p, q) ∼ Λ (ny→x|w), (46) nxy|w =η ˆ ex|v , ey|v − dim vxy|w(t) − 2. (43) where Throughout this analysis, we ordered the summations 0 first over the dimensions of y, and then over the dimen- {1} {q} sions of x. In practice, the order of these operations are ny→x|w = ny→x|w, . . . , ny→x|w . (47) arbitrary and was initially imposed solely for clarity in the chain-rule formula. Now, we can use the same approach from Sec.V to It should be noted that the modified Λ-test assumes obtain the effective degrees of freedom used in computing both that the residuals are completely independent and Granger causality estimates. First, the residuals for the that the estimated degrees of freedom is approximately jth partial correlation in Eq. (44) are: correct. If, instead, the residuals become slightly corre- ∗ lated due to statistical errors in the regression, the Λ - e{j} = x − xˆ v{j} (48) distribution should be generated by sampling dependent x|vy→x|w y→x|w beta- or F -distributed variables. Further, the effective e{j} = yj − yˆj v{j} . (49) degree of freedom is a first-order approximation, which y|vy→x|w y→x|w may introduce biases in the hypothesis tests. Although our numerical simulations in Sec.VII show no such bi- Thus, the number of degrees of freedom is different for ases, we discuss the potential solutions in AppendixD. each term, with the jth number computed as: n{j} =η ˆ e{j} , e{j} −dim v{j} (t) −2. y→x|w x|vy→x|w y|vy→x|w y→x|w VI. MODIFIED TESTS FOR GRANGER (50) CAUSALITY Unlike the standard F -test for Granger causality (Eq. (24)), the modified Λ-test takes into account the Recall that Granger causality can be expressed as a effective number of degrees of freedom induced by auto- conditional mutual information (see Eq. (20)). As such, correlation in both the predictee and predictor processes, we can leverage results from the previous section to in- with the two approaches overlapping only when there is troduce its null distribution. Many other information- no autocorrelation in the residuals. This indicates that, theoretic and likelihood ratio-based measures could be with limited data, the F -test can only be used for as- similarly decomposed (from Wilks’ criterion or condi- sessing the significance of Granger causality estimates tional mutual information) in order to derive their finite- from independent observations (y) to univariate autocor- sample hypothesis tests. related time series (x).

A. Two time series B. Multiple time series

We shall first express the Granger causality for bi- Finally, we present hypothesis tests for the most variate processes as a sum of conditional mutual in- complex linear dependence measure in the paper: the formation terms via the chain rule. Let upper indices Granger causality from an l-variate predictor process Y (without parenthesis) denote a backshifted variable, e.g., to a k-variate predictee process X, in the context of Xj(t) = X(t − j) denotes the variable X(t) lagged by the c-variate concomitant process W . By virtue of the j time indices. Then, by applying the chain rule (35) to chain rule (35), this general expression of Granger causal- Granger causality (20), we can compute it as a sum of ity (20) decomposes into three nested sums: conditional mutual information estimates: q k l q ˆ X ˆ ˆ X X X ˆ Fy→x|w(p, q) = 2 I j {j} , (44) Fx→y|w(p, q) = 2 I j {ghj} , (51) x;y |vy→x|w xg yh|vy→x|w j=1 g=1 h=1 j=1 10 where the conditional for the (g, h, j) mutual information sampling first-order AR processes to obtain our time- term is: series data, x, y, and w. Then, to illustrate the effect of higher autocorrelation on the FPR of both methods, we  (p)  x1:g digitally filter each time series along the time dimension  (q)  with two types of low-pass causal filters: a finite-impulse {ghj} y  v =  1:h−1 . (52) response (FIR) linear-phase least-squares filter and an y→x|w   (j−1) infinite-impulse response (IIR) Butterworth filter. The yh  w filter order is variable, with higher filter orders generally increasing the autocorrelation of the time series. For That is, the (g, h, j) term is the conditional Granger each experiment, we perform 1 000 trials and, using the causality for dimension g of X and the predictor obser- statistical hypothesis-testing procedure described in Ap- vation j steps back of the hth subprocess of Y . This is pendixA4, we consider the FPR to be the proportion conditioned on all dimensions and predictor observations of p-values that are significant at the nominal level (typi- below g, h, and j, as well as on W . Again, following cally 5% in this paper) in comparison to the relevant null Eq. (39) the distribution of Granger causality estimates, distribution. under the null hypothesis H0 : FX→Y (p, q) = 0, is given These experiments allow us to study how each test be- as: haves under increasing levels of autocorrelation of both x and y, whilst ensuring that the null hypothesis (of no ˆ ∗ exp Fy→x|w(p, q) ∼ Λ (ny→x|w), (53) dependence) is not violated. Rather than using a filter, it would of course be possible to increase the autocorrela- where tion by selecting the ARMA parameters, Φ(u) and Θ(u), 0 for each lag u. However, the formulations are equiva- {111} {klq} ny→x|w = ny→x|w, . . . , ny→x|w . lent: AR processes are all-pole IIR filters; MA processes are FIR filters; and ARMA processes are IIR filters with The effective degrees of freedom are, again, computed both poles and zeros. Thus, although these are identi- from the residuals, where the residual processes used in cal formulations, we opt for digitally filtering processes computing the (g, h, j) partial correlation are: to increase their autocorrelation as this is a commmon preprocessing step performed by practitioners to remove {ghj} {ghj} e = xg − xˆg v (54) artefacts from time series (even differencing the signal is x|vy→x|w y→x|w a type of filter). Moreover, as previously discussed, filter- {ghj} {ghj} e = yj − yˆj v . (55) ing the signals has been shown in the past to bias various y|vy→x|w h h y→x|w dependence measures such as Granger causality [4]. Un- The number of degrees of freedom for each term in the til this work, however, it has not been suggested that this chained sum can then be computed from these residuals: bias is a function of autocorrelation nor has a valid hypothesis test been proposed based on the autocorrelation n{ghj} =η ˆ e{ghj} , e{ghj} −dim v{ghj} (t) −2. function. y→x|w x|vy→x|w y|vy→x|w y→x|w (56) We note that, although the same p and q are used for each of the subprocesses of x and y in our presentation, A. Mutual information tests for bivariate time our decomposition facilitates setting an individual his- series tory length for each term in the chained sum. The only difference would be to infer the optimal history length (p First, we use this approach to evaluate the performance and q) for each residual vector in the chain. of the hypothesis tests on assessing the significance of mutual information estimates between two independent (but serially correlated) time series. Our results are shown in VII. NUMERICAL VALIDATION Fig.1, where the “ F -tests” are from the finite-sample distribution (Eq. (17)), “χ2-tests” refer to the asymptotic We perform numerical simulations in order to validate LR distributions (Eq. (16)), and the “Modified Λ-tests” the modified Λ-test and characterize the effect of autocor- refer to our hypothesis tests that account for autocorrela- relation on the (unmodified) F - and χ2-tests. The simu- tion (Eq. (33)). As the plots illustrate, both the F -tests lations, detailed in AppendixB1, involve generating ob- and the χ2-tests overestimate the measures for higher servations from two first-order independent AR processes filter orders (and therefore higher AR orders), yielding and iteratively filtering the output signal such that the over 15% of false positives at the nominal significance of autocorrelation is increased for both time series. α = 0.05—approximately three times the FPR expected Following Barnett and Seth [4], we illustrate the finite- from the test. The figures on the right illustrate the sample effects by generating relatively short stationary significance level α against the FPR for an 8th order fil- time series with T = 29 = 512 observations from the ter. From these figures, we can see that the FPR for the stochastic processes to obtain our dataset. We begin by χ2-tests is higher than nominal for all significance levels 11

0.2 0.2

0.15 Modiﬁed Λ-test 0.15 Modiﬁed Λ-test χ2-test χ2-test F -test F -test 0.1 0.1 1 1

0.05 0.05 False positive rate (FPR) False positive rate (FPR) 0 Actual FPR 0 0 Actual FPR 0 0 10 20 30 0 1 103 104 0 1 Filter order Expected FPR Sample size Expected FPR (a) (a)

0.2 0.2

0.15 Modiﬁed Λ-test 0.15 Modiﬁed Λ-test χ2-test χ2-test F -test F -test 0.1 0.1 1 1

0.05 0.05 False positive rate (FPR) False positive rate (FPR) 0 Actual FPR 0 0 Actual FPR 0 0 10 20 30 0 1 103 104 0 1 Filter order Expected FPR Sample size Expected FPR (b) (b)

FIG. 1. Modified tests correctly assess the significance of the FIG. 2. Increasing the sample size does not mediate the mutual information estimated between two univariate time se- effect of autocorrelation on the χ2- and F -tests for mutual ries under FIR (1a) and IIR (1b) filtering. The mutual infor- information. We perform the same tests as Fig.1, except mation was measured (using Eq. (18)) between independent with an exponentially increasing sample size and a fixed 8th- univariate time series (generated by Eq. (B2)) after filtering, order FIR (2a) and IIR (2b) filter. The subplots on the right and tested using the χ2-test (Eq. (16)), F -test (Eq. (17)), and show the FPR for T = 211 samples. the modified Λ-test (Eq. (33)). The plots show the effect of increasing the filter order on the FPR for both an (a) FIR filter and (b) IIR filter. The shaded regions on the right in- dicate α = 0.05, whilst the shaded regions on the left show the 95% confidence interval for the FPR (as defined in Ap- pendixA4). The subplots on the right capture the FPR for would imply that the observations are anti-correlated in all potential significance levels α (“Expected FPR”) with an 2 8th order filtered signal. An ideal distribution is where the time. Such conservative results for the χ - and F -tests FPR equals α and thus sits perfectly on the diagonal, as per are likely to induce lower statistical power (i.e., a lower our tests. true-positive rate (TPR)) in scenarios when the effective sample size is greater than the original sample size. To verify this, we performed 1 000 trials where there was a ∈ small dependence of X on Y (see AppendixB1). The α (0, 1). In comparison, the modified Λ-test procedure 2 yields the expected FPR for all filter orders. TPR was 0.049 (SE of 0.0068) for the χ -test and 0.1570 (SE of 0.0115) for the modified Λ-test. Thus, the power A filter order of zero in Fig.1 refers to generating the of our hypothesis test is three times greater than the χ2- time-series data with the first-order AR model (B2) with- test in this scenario. out any digital filtering. In this case, the χ2- and F - tests yielded less than the nominal 5% FPR. This occurs when the number of effective samples becomes greater In Fig.2 we show that increasing the sample size does than the original sample size η(x, y) > T . Referring to not mediate the effect of autocorrelation on the χ2- and Bartlett’s formula (9), this is due to the product of nega- F -tests. This is due to the fact that the effective degree tive autocorrelation exhibited by the Y process (induced of freedom is always a fraction of the degree of freedom. by ΦY = −0.8, indicating an effect of undersampling) Thus, regardless of the sample size, both the asymptotic and the positive autocorrelation exhibited by the X pro- (χ2) and finite (F ) tests are invalid, unless modified to cess (induced by ΦX = 0.3). Counter-intuitively, this account for effective sample size. 12

0.2 0.4

0.15 Modiﬁed Λ-test 0.3 Modiﬁed Λ-test χ2-test χ2-test F -test F -test 0.1 0.2 1 1

0.05 0.1 False positive rate (FPR) False positive rate (FPR) Actual FPR 0 Actual FPR 0 0 0 0 10 20 30 0 1 0 50 100 150 200 0 1 Filter order Expected FPR Dimension of conditional c Expected FPR (a) (a) FIR ﬁlter with c-variate conditional.

0.2

0.4 Modiﬁed Λ-test Modiﬁed Λ-test 0.15 χ2-test χ2-test F -test F -test 0.1 1 1 0.2

0.05 False positive rate (FPR) Actual FPR False positive rate (FPR)

Actual FPR 0 0 0 0 0 50 100 150 200 0 1 0 10 20 30 0 1 Dimension of conditional c Expected FPR Filter order Expected FPR (b) IIR filter with c-variate conditional. (b) FIG. 4. The FPR of asymptotic tests increase with the FIG. 3. Modified tests correctly assess the significance of dimension of the c-variate conditional process. The plots are the conditional mutual information estimated between two the same as per Fig.3, except as a function of c with a fixed univariate time series, conditioned on a third univariate time 8th order FIR (4a) and IIR filter (4b). The subplots on the series; each time series underwent FIR (3a) or IIR (3b) fil- right show the FPR for each significance level α when c = 100. tering. Conditional mutual information was measured (using Eq. (18)) between two univariate time series, given a third univariate process (generated by Eq. (B2) with k = l = c = 1), after filtering, and tested for significance using the χ2-, F -, ate the x and y processes and filter the signal with an and modified Λ-tests. The subplots on the right show the 8th order FIR and IIR filter the same as before; how- FPR for each significance level α when the signal is filtered ever, we increase the number of independent processes with an 8th order filter. c = dim (W (t)) in the multivariate conditional. The result is shown in Fig.4, where the FPR of the F - and χ2-tests increases somewhat linearly with the dimension, B. Conditional mutual information tests for however the modified Λ-test remains unbiased. As ev- bivariate time series idenced by the improvemenet of the unmodified F -test over the χ2-test, this experiment demonstrates that even We now extend the previous results by evaluating the when the autocorrelation function is the same, the di- effect of conditioning mutual information between x and mension of the conditional must also be included in the y on an independent, tertiary process w. The FPRs for hypothesis tests. the χ2-tests, F -tests, and the modified Λ-tests from these experiments are presented in Fig.3, and exhibit similar characteristics to those of the mutual information tests in C. Mutual information tests for multivariate time Fig.1. Increasing the filter order generally increases the series FPR for both unmodified tests, yet the modified Λ-test remains unbiased, maintaining the expected FPR of 5%. In this section, we present results for the hypoth- As discussed in Sec.V, an important distinction esis tests of mutual information between multivariate between the null distributions for mutual information (m > 2) time series. The multivariate time series are (Eq. (33)) is that the effective degree of freedom in the partitioned into two independent sets of processes, X Λ∗-distribution not only includes the effective sample size and Y , one with dimension k and one with dimension l. but also the dimension of the conditional c. To show For each experiment, we let k = l and use the state equa- the severity of the asymptotic approximation, we gener- tions in Eq. (B2) to simulate m = k + l independent AR 13

0.8 0.2

0.6 0.15 Modiﬁed Λ-test Modiﬁed Λ-test χ2-test χ2-test F -test 0.4 0.1 1 1

0.2 0.05 False positive rate (FPR) False positive rate (FPR) Actual FPR 0 0 0 Actual FPR 0 1 2 3 4 5 0 1 0 10 20 30 0 1 Dimension k, l Expected FPR Filter order Expected FPR (a) (a)

0.8 1

0.8 0.6 Modiﬁed Λ-test Modiﬁed Λ-test χ2-test χ2-test 0.6 F -test 0.4 1 1 0.4 0.2 0.2 False positive rate (FPR) False positive rate (FPR) Actual FPR 0 0 0 Actual FPR 0 1 2 3 4 5 0 1 0 10 20 30 0 1 Dimension k, l Expected FPR Filter order Expected FPR (b) (b)

FIG. 5. Modified Λ-tests correctly assess the significance of FIG. 6. Modified Λ-tests correctly assess the significance the mutual information estimated between two multivariate of the Granger causality estimated from one univariate time time series. The mutual information was measured (using series to another. Granger causality, with history lengths p Eq. (38)) between the multivariate time series (generated by and q chosen via Burg’s method, is estimated (using Eq. (44)) Eq. (B2) with an increasing dimension k and l of X and Y ), between univariate time series (generated by Eq. (B2)) after passed through 8th-order FIR (5a) and IIR (5b) filters, and smoothing with an FIR (6a) and an IIR (6b) filter. Esti- tested for significance using the χ2-test (Eq. (16)) and the mates are then tested using the χ2-test (Eq. (22)), the F -test modified Λ-test (Eq. (39)). The subplots on the right show (Eq. (24)), and the modified Λ-test (Eq. (46)). The subplots the FPR for each significance level α when k = l = 3. on the right show the FPR for each significance level α with an 8th-order filter.

processes for m = {2, 4,..., 10}. These signals are then D. Granger causality tests for bivariate time series filtered along the temporal dimension using 8th order FIR and IIR filters. This signal generation process en- sures that there is no correlation between signals within This section examines the performance of each hypothesis test on estimates of Granger causality using the same the same subprocess, i.e., ρij = 0 for all i, j ∈ [1, m]. The results are shown in Fig.5, where increasing the di- (univariate) simulations from Sec.VIIA. That is, the bi- mension approximately linearly increases the FPR of the variate AR model (Eq. (B2) with k = l = 1 and no original LR test to over 70% for both filters (continuing conditional c = 0) is simulated to generate T = 512 to increase for larger k and l), yet the modified Λ-test observations of the X and Y processes, which are then remains unbiased. passed through FIR and IIR filters. Referring to Fig.6, we perform this with each filter order and each filter type In the tests above, no correlations between subpro- (FIR and IIR). After generating these sample paths, the cesses were included (e.g., Xi(t) with Xj(t − u) for AR order of the predictee, p, and the predictor, q, were j =6 i), however an internal cross-correlation between any inferred from Burg’s method [57]. We then compute of these subprocesses may further decrease the size and Granger causality (via Eq. (44)) and use this estimate power of unmodified tests. Our experiments of Granger to obtain p-values from the CDFs of the F -distribution causality in the following sections naturally incorporate (Eq. (24)), the χ2-distribution (Eq. (22)), and the Λ∗- examples with correlated subprocesses in the mutual in- distribution (Eq. (46)). This is performed 1 000 times in formation calculation. order to obtain a FPR of each approach. Whilst our ex- 14

0.6 0.2

Modiﬁed Λ-test 0.15 Modiﬁed Λ-test 0.4 χ2-test χ2-test F -test F -test 0.1 1 1 0.2 0.05 False positive rate (FPR) False positive rate (FPR) 0 Actual FPR 0 0 Actual FPR 0 20 40 60 80 100 0 1 103 104 0 1 History length q Expected FPR Sample size T Expected FPR (a) (a)

1 0.8

0.8 Modiﬁed Λ-test 0.6 Modiﬁed Λ-test χ2-test χ2-test 0.6 F -test F -test 0.4 1 1 0.4 0.2 0.2 False positive rate (FPR) False positive rate (FPR) Actual FPR 0 Actual FPR 0 0 0 0 20 40 60 80 100 0 1 103 104 0 1 History length q Expected FPR Sample size T Expected FPR (b) (b)

FIG. 7. The FPR of the F - and χ2-tests for Granger causal- FIG. 8. The F - and χ2-tests converge to the modified Λ-test ity increases with an increasing dependence on the past for for Granger causality with large sample sizes. This exper- time series generated by an 8th-order FIR (7a) and IIR (7b) iment is the same as Fig.6, except with an exponentially filter. We illustrate this by varying the history length of the increasing sample size and a fixed 8th-order FIR (8a) and IIR predictor process, q, from one to 200. The subplots on the (8b) filter. right show the FPR for each significance level α when q = 20 (for Fig. 7a) and q = 40 (for Fig. 7b) to approximately match the orders chosen via Burg’s method in Fig.6. trate the FPRs for increasing the predictor history length q from one to 200 (in increments of 20), whilst holding the autocorrelation length constant with an 8th-order fil- periment here is equivalent to a conditional mutual infor- ter. This effectively introduces more terms in Eq. (44), mation, unlike the experiments in previous sections the causing a larger divergence between the F -, χ2- and mod- autocorrelation within the time series naturally induces ified Λ-tests. As expected, the FPR of the χ2- and F - a cross-correlation amongst the variables within the pre- tests linearly increases in this range, whereas the mod- dictor Y (q)(t) and conditional X(p)(t) processes. The ified Λ-test remains consistent with the 5% FPR. This results shown here illustrate that increasing the autocor- linear increase of the FPR in χ2- and F -tests is some- relation length via filtering increases the FPR of Granger what counter-intuitive to the notion of Granger causality, causality under the F - and χ2-tests, particularly when where one may expect that accounting for more history using an IIR filter. In contrast, the FPR of Granger would reduce spurious correlations. However, the oppo- causality using the modified Λ-tests remains mostly un- site is true, simply due to a lack of correct finite-sample biased. It should be noted that, although within the con- distributions (in the case of the unmodified tests). fidence bounds, the FPR of the modified Λ-tests appear In Fig.8 we show the effect of increasing the sam- to be not exact for high-order filters; sources of error re- ple size for tests on Granger causality estimates. Here, garding this potential bias are discussed in AppendixD. we can see the χ2- and F -tests converging for sufficient The model order for our experiments above was cho- sample sizes. Unlike mutual information (from Fig.2), sen using Burg’s method, however, the are numerous ap- estimating Granger causality involves regressing the au- proaches to inferring the “optimal” AR order as outlined tocorrelation of the predictee first, with the variance of in the introduction, all of which can result in vastly dif- these residuals reducing as the sample size grows. Thus, ferent model orders. To ensure our results are not a con- the effective sample size asymptotically approaches the sequence of poor model identification, in Fig.7 we illus- sample size, however, the precise rate of this convergence 15

0.3 lation and dimensionality, allowing an arbitrary predictor history length of q results in the effective sample size ap- proaching zero for the modified Λ-tests. Thus, for these 0.2 Modified Λ-test experiments we fix the history length of the predictor to χ2-test q = 1. 1 The results are shown in Fig.9, where the χ2-tests 0.1 inflate the FPR close to 100% for with higher dimensional processes. Although our corrected tests perform well for moderate dimensionality, when k, l > 3 with the False positive rate (FPR) 0 Actual FPR 0 IIR filter, the FPR of our modified Λ-tests begin to have 1 2 3 4 5 0 1 numerical issues. This is caused by the regression matrix Dimension k, l Expected FPR not being well conditioned, i.e., the ratio of regressors to (a) data points is too high. Nonetheless, we can see from the figure that our tests maintain a low FPR, becoming more 1 conservative when the regression is ill-posed. Moreover, a poorly conditioned regression can be easily tested for 0.8 in practice. So, we conclude that with minimally suffi- Modified Λ-test cient observations, our tests maintain the desired FPR 0.6 χ2-test even for the most general case of multivariate Granger 1 causality and, when the sample size is simply too small 0.4 for reliable inference, our approach flags this as an issue.

0.2

False positive rate (FPR) VIII. EFFECT OF PREWHITENING 0 Actual FPR 0 1 2 3 4 5 0 1 Dimension k, l Expected FPR The rationale for applying prewhitening is to remove the autocorrelation in one time series such that the vari- (b) ance of computed statistics becomes equivalent to seri- FIG. 9. Modified Λ-tests correctly assess the significance ally independent observations [12]. That is, instead of of the Granger causality estimated from one multivariate modifying the hypothesis tests, prewhitening modifies time series to another for increasing dimension k, l. Granger the input time series and thus the statistics themselves. causality, with each predictee history length chosen optimally, Prewhitening is typically attempted by first inferring a is estimated (using Eq. (51)) between multivariate time se- model of one time series (x), and transforming the pro- ries (generated by Eq. (B2)) after filtering via 8th-order FIR cess x to a residual process x˜ through a filter constructed (9a) and IIR (9b) filters. Estimates are then tested using the 2 from its model. The same filter (with parameters inferred χ -tests (Eq. (22)) and our modified Λ-test (Eq. (53)). The from x) is then applied to the other time series y to cre- subplots on the right show the FPR for each significance level ate y˜. Specifically, assuming any arbitrary ARMA(p, q) α when k = l = 3. model for time series x, prewhitening involves learning the parameter vectors Φˆ and Θˆ from x, and then filter- depends the autocorrelation function and may change for ing the raw signals through the following equations: every pair of time series. Even for this simple example, p q X X we see that on the order of 100 000 samples are required x˜(t) = x(t) − Φˆ (u)x(t − u) − Θˆ (u)x˜(t − u), for convergence, which is not realistic in many empirical u=1 u=1 scenarios. p q X X y˜(t) = y(t) − Φˆ (u)y(t − u) − Θˆ (u)y˜(t − u). u=1 u=1 E. Granger causality tests for multivariate time series Using the same linear transformation (filter) for both time series renders their correlation theoretically invari- Finally, we can evaluate the effect of increasing the di- ant [12], however, the assumption is that x˜ is now serially mensionality of both processes on Granger causality in- independent, and so the variance of sample correlations ference. In these experiments, we vary the dimension of (for instance) converge toσ ˆr(x, y) = 1/T . the processes X and Y from one to five. Recall from A significant challenge in prewhitening signals is in se- Eq. (51) that the number of terms involved in comput- lecting an appropriate model of the autocorrelation func- ing Granger causality (and its null distribution) is the tion. Of course, for the same arguments as presented in product klq, of the dimensionality (k, l) and the history Sec.II, after mean removal and differencing, the most length of the predictor (q). Due to the relatively short general model for covariance-stationary time series are time series length of T = 512 samples and high autocorre- ARMA models. However, inference procedures to learn 16

0.3 0.6 F -test, AR(1) F -test, AR(1) F -test, ARMA(1,1) F -test, ARMA(1,1) 0.2 F -test, AR(p) 0.4 F -test, AR(p) F -test, ARMA(p,q) F -test, ARMA(p,q) 1 1 0.1 0.2 False positive rate (FPR) False positive rate (FPR) 0 Actual FPR 0 0 Actual FPR 0 0 10 20 30 0 1 0 10 20 30 0 1 Filter order Expected FPR Filter order Expected FPR (a) (a)

0.3 1 F -test, AR(1) F -test, AR(1) F -test, ARMA(1,1) 0.8 F -test, ARMA(1,1) 0.2 F -test, AR(p) F -test, AR(p) F -test, ARMA(p,q) 0.6 F -test, ARMA(p,q) 1 1 0.4 0.1 0.2 False positive rate (FPR) False positive rate (FPR) Actual FPR 0 0 0 Actual FPR 0 0 10 20 30 0 1 0 10 20 30 0 1 Filter order Expected FPR Filter order Expected FPR (b) (b)

FIG. 10. Prewhitening the time series does not mediate FIG. 11. Prewhitening the time series does not mediate the bias in F -tests for mutual information estimates after the bias in F -tests for Granger causality estimates after FIR FIR (10a) or IIR (10b) filtering. The experiments are as per (11a) or IIR (11b) filtering. The experiments are as per Fig.6, Fig.1, with four prewhitening approaches used: the AR(1), with four prewhitening approaches used (as per Fig. 10): the ARMA(1, 1), AR(p), and ARMA(p, q). For the AR(p) and AR(1), ARMA(1, 1), AR(p), and ARMA(p, q). ARMA(p, q) models, the optimal order is inferred from Burg’s method and the BIC score. In many cases for the IIR-filtered time series, either the AR(p) or the ARMA(p, q) failed to learn a stable model, and so these trials were removed, induc- model can describe any arbitrary-order process. This is ing non-uniform confidence intervals (reflected in the shaded clearly insufficient, and if the wrong model is used, the region). residuals x˜ remain dependent, resulting in an unknown variance of statistical estimates. This is evidenced by consistently high FPRs in empirical studies [15]. both the order and parameters of ARMA models are For completeness of this paper, however, we imple- computationally expensive. As such, many authors (and ment a number of prewhitening schemes in order to il- textbooks [12]) propose an AR(p) model would suffice lustrate that such an approach is insufficient for assess- to render the residuals, x˜, independent, presuming that ing linear dependence between typical time series. Our autocorrelations decay rapidly for stationary time series. experiments—on the same synthetic time series used in Since this is the same assumption underlying Granger Sec.VII—show the effect of prewhitening univariate sig- causality (regarding the residuals on the target after fit- nals on the unmodified F -test for a number of differ- ting an AR(p) model), our results from Sec.VIID sug- ent models: the AR(1) and ARMA(1, 1); as well as the gest that statistics computed from signals prewhitened AR(p) and ARMA(p, q) with the optimal model orders in this way will remain biased. In fMRI research, the (p and q) learned from data. For the AR(p) model infer- most popular packages that are used for preprocessing ence, we allow for p ∈ [1, 200] (where the sample size is time-series data (AFNI, SFL, and SPM) are similarly in- T = 512), with the model order selected (and parameters sufficient for handling autocorrelation due to their sim- inferred) using Burg’s method [57]. Given the difficulty plistic models [15]. Specifically, the package AFNI uses of learning higher-order ARMA(p, q) models, however, an ARMA(1,1) model learned from each voxel, whereas we restrict our search space, iterating through each po- FSL uses Tukey tapering to smooth the data (see Ap- tential p, q ∈ {1,..., 5} and selecting the model with the pendixD), and SPM uses one global AR(1) model for all lowest BIC score. processes. Thus, each package assumes a fixed ARMA Our results for the performance of the F -test on mu- 17

Time series data Mutual information Granger causality

Raw t 1 tt 1 t t p 1 tt X;Y W − − − − Filtered I | X Y )

t W ( t q x Time − Significance level α Significance level α Significance level α 010101 100% ) t ( y

Test at 5% False positive rate (FPR) 0%

Modified Λ-test (fixed): 5.5% Modified Λ-test: 4.7% Modified Λ-test: 5.9% Modified Λ-test (optimal): 6.5% Autocorrelation χ2-test: 41.8% χ2-test: 83.4% χ2-test (optimal): 16.9% Lag χ2-test (fixed): 65.3%

FIG. 12. Application to brain-imaging data, demonstrating our correction for the otherwise dramatic inflation of false- positive rates (FPRs) of classical hypothesis tests of dependence estimates in the presence of autocorrelation. We perform 1 000 experiments with both the χ2-tests as well as the modified Λ-tests for inferring the significance of two dependence measures: mutual information(between two univariate and two bivariate processes), and Granger causality. For each of these experiments, we randomly selected two uncorrelated fMRI time series, X and Y , from the Human Connectome Project [44] (see AppendixB2 for more details). A sample window of these univariate processes are shown in the left panel (grey lines), with the common preprocessing technique of band-pass digital filtering later applied to the signals (red lines). At the bottom of the left panel, we plot the sample autocorrelation function from both the raw and the filtered signal, illustrating the higher autocorrelation length (and lower effective sample size) induced by digital filtering. The panels on the right show the FPR of each test applied to the filtered signals as a function of the significance level α. For ideal hypothesis testing, we expect a line along the diagonal (i.e., the FPR equals the significance level). The χ2-tests illustrate an increased FPR for all measures, as seen by both the plots and the FPR at 5% significance level. In contrast, our tests remain consistent with the expected FPR. tual information estimates after prewhitening are shown test [61] and, moreover, remained biased in most scenar- in Fig. 10. Contrasting these results to Fig.1, the only ios. We conclude that, regardless of the model selected, benefit of prewhitening appears to be for the FIR-filtered the additional burden of prewhitening over using a modi- time series when an ARMA(p, q) model is used. In al- fied hypothesis test is unjustified and that the outcome of most all other scenarios, the FPRs are either equiva- unmodified hypothesis tests (with or without prewhiten- lent to, or worse than, the original tests. Figure 11 il- ing) conveys inconsistent information about the underly- lustrates the effect of prewhitening on Granger causal- ing dependence structure between time series. ity F -tests (compare to Fig.6 without prewhitening). Again, prewhitening appears to serve no benefit to tests for Granger causality, even for the relatively advanced IX. CASE STUDY: HUMAN CONNECTOME ARMA(p, q) model. Concerningly, for IIR-filterd signals, PROJECT DATASET the FPR increases to over 60% for all ARMA models with no scenario where the prewhitening approach results in an FPR within the expected range. In AppendixC we Studies in computational neuroscience often leverage demonstrate similar results when using BIC and AICc statistical analysis in order to formulate and test biologi- scoring functions (as an alternative to Burg’s method) to cally plausible models. An important application in this infer AR(p) models for prewhitening. field is the study of the human brain through fMRI, which is abundant with short, autocorrelated time series that It is possible that, with an ideal model, the F - or χ2- have been studied using the measures of interest here [1– tests used for a prewhitened time series may be com- 5]. As such, it is an archetypal real-world application parable (in size properties or FPR) to the modified Λ- to illustrate the issues of autocorrelation for time-series test. However, even restricting our search space to an analysis. ARMA(5, 5) model resulted in an approximately 5 000- In fMRI research, the blood-oxygen level-dependent fold increase in computational time over the modified Λ- (BOLD) data is translated into a slowly varying (and 18 thus highly autocorrelated) multivariate time series that measures. By framing different dependence measures in traces the haemodynamic response of different locales unified theoretical terms, we provide the first demonstra- (voxels) in the brain. Digital filtering is then commonly tion of how Bartlett’s formula can be applied to derive used as a preprocessing step to reduce line noise, nonsta- unbiased hypothesis tests, termed modified Λ-tests, for tionarity and other artefacts in neuroimaging data. This mutual information (and, consequently, Wilks’ criterion induces an (either finite or infinite) impulse response that and Granger causality) for both univariate and multi- can increase autocorrelation, even if the original signals variate time-series data. These measures are used in a were not serially correlated. To characterise the FPR of wide range of disciplines, modelling myriad important linear dependence measures between empirical time se- processes from anthropogenic climate change [26] to the ries, we use completely independent time series from a brain dynamics of dementia patients [62]. The continued widely accessed brain-imaging dataset known as the the use of flawed testing procedures in empirical sciences is Human Connectome Project (HCP) resting state fMRI problematic, making it imperative that the corrections (rsfMRI) dataset [44] (see AppendixB2 for more detail; reported here be incorporated into future studies. time series are selected from different random regions of The effect of temporal autocorrelation on linear de- interest from different random subjects to ensure independence has long been investigated in statistics, how- pendence, and then digitally filtered). This process is ever the majority of research has focused on simple lin- shown in Fig. 12, where a sample window of the raw and ear correlations [1,2, 10, 11, 40], which are restricted filtered data from two independent time series appears to measuring symmetric bivariate dependence structures. on the left panel. These studies, while representing important milestones First, we illustrate the effect of autocorrelation on mu- in inferring the association between univariate time se- 2 tual information by using a χ -test, with significance level ries, suffer from the inability to capture both multivariate 5%, computed with T = 800 samples of each independent and directed dynamical dependence. Even though mea- time series. We begin by estimating mutual information sures such as mutual information and Granger causality ˆ Ix;y between two unrelated time series x and y, sampled can model a much richer set of dependence structures from the HCP dataset. We find that the test results are in temporal processes, the notion that their sampling significantly biased, yielding an FPR of 41.8% (a 9-fold properties could be altered in the presence of autocor- increase of the expected rate). After prewhitening with relation has thus far been largely overlooked. A major an AR(p) filter, this increased to over 88% (not shown in challenge of handling autocorrelation for more involved the figure). The modified Λ-test demonstrated an FPR dependence measures was in extending Bartlett’s formula of 4.7%, which is well within the acceptable confidence to multivariate relationships. Crucially, our approach fa- interval. Next, we perform the same tests but with the cilitates not only independent multivariate relationships mutual information between two sets of bivariate time but also correlated multivariate processes. Our results on series x and y (k = l = 2); this yields an 83% FPR the mutual information between multivariate time series (a 16-fold increase). When we correctly test these same used examples where the subprocesses were all indepen- measurements with the modified Λ-test (Eq. (39)), we dent. When extending this approach to Granger causal- find a FPR of 5.2%, matching the desired level within ity for arbitrarily large history lengths and dimensional- the confidence bounds. ity, the subprocesses become inherently correlated (since For Granger causality, we perform two different tests: one subprocess is a time-lagged version of another, and scenario (i) with the model orders, p and q, inferred for they involve significant autocorrelation). This provides each trial, and scenario (ii) with a fixed p = q = 100 strong evidence that our approach is a generalisation of for all trials. For unrelated signal pairs, the FPR of Bartlett’s dependence studies that is able to handle a Granger causality estimates using the χ2-test is 16.9% much richer class of multivariate dependency structures (with optimal history length around 18), increasing fur- beyond those already presented in this paper. ther to 90.5% after prewhitening with an AR(p) filter. Granger causality is the de facto measure of directed If a longer embedding length of 100 is chosen, the FPR dependence between stationary time series. The typi- is 66%—more than thirteen times the expected value— cal approach to assessing the significance of (potentially increasing to 83.3% after prewhitening with an AR(p) multivariate) estimates has been via the finite-sample F - model. When we test these same measurements with the tests [63] or the asymptotic χ2-test [45]. In this study, we modified Λ-test (46), we find a FPR of 5.5% and 6.5% have shown that higher levels of autocorrelation (equiva- for scenario (i) and (ii), respectively, completely remov- lently, a higher-order autoregression) in the signal inflates ing the false-positive bias exhibited by the χ2-tests. the variance of these statistical estimates, inducing significant bias for these traditional hypothesis tests. This means that using these tests induces errors when the pre- X. DISCUSSION dictor process is serially correlated—precisely the situa- tion that Granger causality was designed to address. One We have shown that the autocorrelation exhibited in might logically surmise that this issue could be mediated covariance-stationary time-series data induces bias in the by accounting for a longer history of the process, i.e., con- hypothesis tests of a broad class of linear dependence ditioning on additional AR variables. Referring to Fig.6, 19 we have shown that this will result in the FPR being even that the time-series innovations are Gaussian and that further inflated. This is particularly concerning given all relationships are linear. Thus, we have only discussed that the autocorrelation function is one of the two prop- linear-Gaussian probability distributions for information- erties that define a stationary process [7,8, 46]—the other theoretic measures. When instead applied to nonlin- is its mean, which has no effect on scale-invariant depen- ear time series, these probability distributions are often dence measures such as Granger causality, mutual infor- inferred using non-parametric density estimation tech- mation, and Pearson correlation. Similar to the think- niques such as nearest neighbour or kernel methods [67]. ing behind Granger causality, another common approach Spurious estimates of the nearest neighbour counts have to remove autocorrelation is prewhitening, which intends previously been observed for autocorrelated signals by to induce a serially independent process (of residuals) Theiler [68], who provided a solution by excluding ob- for hypothesis testing. Our experiments in Fig. 10 and servations that are close in time. This is now a popu- Fig. 11 (as well as AppendixC) illustrated that many lar approach to effectively account for autocorrelation in common approaches to prewhitening fail to control the density estimation for nonlinear time-series analysis. In FPR and, indeed, can further reduce the size of the un- fact, in introducing transfer entropy—now understood modified (F and χ2) hypothesis tests. We conclude that as a model-free extension of Granger causality [35]— the unmodified hypothesis tests cannot be used to reli- Schreiber explicitly recommended the use of a Theiler ably infer the significance of Granger causality or mu- window (also known as serial- or dynamic-correlation ex- tual information estimates when applied to covariance- clusion) when kernel estimation methods are used [36]. stationary time series and should be replaced with mod- The Theiler window approach has been demonstrated to ified tests, such as the modified Λ-tests, particularly for control the FPR for such estimators in practice [69], yet limited time-series data. remains a heuristic with no theoretical guarantees and, Prior work [4] had already established that digital similar to Pearson correlation, is often neglected in prac- filtering led to biased Granger causality estimates for tical estimation of transfer entropy [70]. We hypothesize shorter time-series, yet this effect was not understood nor that the methods outlined in our work could be extended able to be corrected until now. Due to the widespread in future to provide a more rigorous approach to handling use and influence of Granger causality across fields in- autocorrelated nonlinear time series through, e.g., non- cluding neuroscience, ecology and economics, underlined linear versions of Bartlett’s formula [71], facilitating a by any examination of the literature (see Sec.I), this broader class of information-theoretic measures. was a serious deficiency for directed inference of relation- Finally, the dependence structure discussed in this ships in time-series analysis. Much like correlation co- work is assumed to be in the time domain, whereas many efficients, the issue was magnified in fields dealing with empirical studies are concerned with other forms of de- short, highly autocorrelated time-series, as demonstrated pendence that could similarly bias hypothesis tests. Fu- in Fig. 12 for computational neuroscience using fMRI ture work will be required to consider handling such cor- recordings. Many extensions to Granger causality have relation structures in a similar fashion to that which we been proposed to explicitly model the autocorrelation have presented, e.g., spectral models [21] or spatial au- function (such as ARMA [64] or state-space Granger tocorrelation [39]. Indeed, the formula for the effective causality [65]). However these approaches are also known sample size (Eq. (9)) was developed for spatial autocor- to exhibit significant false-positive biases [66], aligning relation and can thus be easily extended to handle spa- with our results in Sec.VIII on the ineffectiveness of tiotemporal autocorrelation, allowing for even broader prewhitening with ARMA models. In this paper, we class of null distributions that can be considered with showed that modifying the effective degrees of freedom of the modified Λ-test. the null distribution suffices to eliminate the bias across all examined time series, without the additional burden of inferring complex models or prewhitening. More advanced methods (such as state-space Granger causality) may retain other empirical advantages, but their hy- Appendix A: Statistical hypothesis tests pothesis tests are likely to require incorporation of similar modifications based on effective sample sizes. More The linear dependence measures discussed in this pa- broadly, our results strongly suggest that any hypothesis ˆ per are positive real-valued random variables Λ ∈ R>0 tests dealing with time-series analysis should be modified that can be expressed as the ratio of the generalised vari- to account for autocorrelation, regardless of regressing or ance of two models. In general, we consider two nested conditioning on AR components. The concerningly high models, the ‘restricted’ model with p0 parameters (un- FPRs exhibited in our experiments suggest that relation- der which the null hypothesis H0 is true) and the ‘unre- ships established using previously tested Granger causal- stricted’ model with p1 parameters (under which the al- ity estimates should perhaps be revisted; particularly in ternate hypothesis H1 is true). These models are referred fields that have high levels of autocorrelation and limited to as nested since p0 < p1 and the restricted model pa- data. rameter space is a subset of the unrestricted model space. Throughout this work, we have made the assumption The statistics are expressed in terms of the generalised 20 sample variance of these models: of (A3), i.e., − h i |s0| T p1 ˆ Λˆ = , (A1) Λ − 1 ∼ F (p1 − p0,T − p1). (A4) |s1| p1 − p0 Thus, the F -statistic is a function of the LR of two where si is the the residual sum-of-squares for model i models and we can show its asymptotic distribution and |si| is the generalised sample variance. The generalised sample variance is, asymptotically, inversely pro- is chi-square. First, a Taylor expansion of the LHS ˆ ˆ portional to the likelihood of each model, and so taking of the F -statistic in Eq. (A4) gives log (Λ) ≈ Λ − 1. the log of the ratio of generalised sample variances (A1) Moreover, for a random variable X ∼ F (ν1, ν2), then 2 is equivalent to the LR between two models (for a large Y = limν2→∞ ν1X has the chi-square distribution χ (ν1). enough number of samples) [8, 64]. For this reason, Thus, by this asymptotic relationship between the F - and 2 statistics of the form in Eq. (A1) appear in a number χ -distributions, we have: of linear dependence measures, such as mutual informa- d lim (T − p1) log (Λ)ˆ → (p1 − p0) F (p1 − p0,T ), (A5) tion (with Gaussian marginals) and Geweke’s definition T →∞ of Granger causality [45]. 2 T log (Λ)ˆ ∼ χ (p1 − p0). (A6)

This result is discussed throughout the paper to explain 1. Asymptotic likelihood-ratio test the diverging behaviour between the two tests.

Wilks’ theorem [54] is the basis of the χ2-test, which states that a test statistic constructed from the 3. Surrogate-distribution tests LR of nested models will asymptotically follow a χ2- distribution under the null hypothesis. Since all statistics Another established approach to empirically generat- 2 used in this work fit this definition, a χ -test can be used. ing a null distribution involves permuting, re-drawing That is, if the true model is the restricted model, then or rotating the observations of one variable x or y as T → ∞, the statistic is chi-square distributed: and computing the relevant statistic for each surrogate dataset [4, 49, 67]. Naive approaches to permuting or d 2 T log (Λ)ˆ → χ (p1 − p0). (A2) re-drawing will completely destroy the autocorrelation profile of that variable, making this empirical distribu- However, as we show throughout the main text, the χ2- tion representative of serially independent observations test has a significant bias when applied to limited and and similar to the analytic F -distribution. Indeed, such autocorrelated time-series data, which results in a large empirical generation of the CDF via permutation testing number of false positives. was attempted (for Granger causality) by Barnett and It is important to note that the LR test is but one Seth [4], and shown to incur the same inflated FPR is- of three classical procedures for hypothesis testing max- sues as the F - and χ2-tests. Alternatively, constrained imum likelihood estimates; the others are the Wald test realisation approaches [72] can be used to generate surro- and the Lagrange multiplier test [7,8, 46]. The three gate time-series data that exhibit certain properties (such tests overlap because the null distribution of each asymp- as the same power spectra) and have recently been shown totically follows the χ2-distribution. Thus, the same is- effective for handling autocorrelation in EEG data [73]. sues hold if one were to use any test on linear dependence Nonetheless, bootstrap tests are computationally ineffi- measures unless autocorrelation is considered in the null cient and known to exhibit size and power distortions, distributions. however typically less so than asymptotic tests [74]. As such, we consider a comparison to these empirical approaches outside the scope of our paper. 2. Finite-sample F -test

4. Drawing inferences In regression analysis, the F -test is used to infer the significance of nested models of independent observations ˆ with limited data, i.e., the finite-sample null distribution. Given an estimate Λ and null distribution (e.g., the F - 2 Using the same notation as above, we obtain a distribu- or χ -distributions), we use the same general hypoth- tion for the comparing the nested models: esis testing procedure for all measures. Here, we use the χ2-test for Λˆ as an example, however the same ap- ∗ T − p1 S0 − S1 plies to all linear dependence measures, including the Λ - ∼ F (p1 − p0,T − p1). (A3) p1 − p0 S1 distributions, which are numerically generated. First, we set an arbitrary significance level α, which is F -statistics can be reformulated as nested ratios of (ideally) the probability of rejecting the null hypothesis sample variances through simply rearranging the LHS even if it were true—this is set to 5% in this paper unless 21 stated otherwise. Then, the statistic T log (Λ)ˆ is input to univariate process Y to X in order to test the TPR (i.e., 2 the quantile function of the χ -distribution with p1 − p0 statistical power of the test). In this scenario, we gener- degrees of freedom. The output (the complement of the ate bivariate time-series data z from same state equations p-value) is the probability of measuring that value (or (Eq. (B1)) but with a small causal influence from Y to higher) under the null hypothesis H0 : Λ = 0. If the X in the AR parameters: p-value is below the significance level, then we deduce that the measured LR Λˆ is significant. A false positive ΦX (1) ΦXY (1) Φ(1) = , (B3) occurs when the p-value is below the significance level 0 ΦY (1) α but the null hypothesis H0 is true, i.e., there is no actual dependence between the variables Λ = 0, yet Λˆ is with ΦXY (1) = 0.03, whilst ΦX (1) = 0.3 and ΦY (1) = considered significant. Ideally, we expect the proportion −0.8, as per Eq. (B2). of false positives (the FPR) to match the significance level α, i.e., one would expect an FPR of 0.05 for a 5% significance level α. This same procedure is used for all 2. Human Connectome Project linear dependence measures in this paper, with differing statistics and null distributions. The HCP rsfMRI dataset [44] comprises 500 subjects When the FPR is measured over R trials (usually imaged at a 0.72 s sampling rate for 15 minutes in the R = 1 000 in this paper), confidence intervals can be de- (relatively quiescent) resting state. This results in 1 200 termined based on the binomial distribution of R draws observations of spatially dense time series data, which is of a random variable each with α = 0.05 chance of suc- then parcellated into 333 regions of interest in the brain. cess. Thus, the dataset contains 500 subjects with 333 brain regions each, and each of these regions is associated with a stationary time series of 1200 observations. The raw Appendix B: Experimental setup (BOLD) data of each region was then preprocessed by removing the DC component, detrending, applying a 3rd 1. Numerical simulations order zero-phase (or forward and reverse) Butterworth bandpass filter (0.01–0.08 Hz). These are common tech- For our experimental validation, we use an AR model niques used to remove potential artefacts. We also re- similar to the example proposed in [4], with two processes moved 200 observations from the start and the end of X and Y that have no interdependence, digitally filtered the time series, in order to minimise filter initialisation to increase their autocorrelation. We begin with the m- effects. This leaves T = 800 observations for the analy- variate time-series data z = (z(1),..., z(T )), i.e., an m× sis. In order to build a scenario where the null hypothesis T matrix, where each realisation is generated from a first- holds, we conduct experiments on 1 000 time-series pairs, order vector AR model: selecting different random regions of interest from different random subjects, making the corresponding time z Φ z − a (t) = (1) (t 1) + (t), (B1) series completely independent of one another. The analy- with sis was performed for mutual information (between both univariate and bivariate time series) and Granger causal- x(t) Φ (1) 0 z(t) = , Φ(1) = X , (B2) ity in the same way as discussed for the simulated time- y(t) 0 ΦY (1) series experiments above. We use the same hypothesis testing procedure (discussed above) with a 5% signifi- and AR parameters ΦX (1) = 0.3Ik and ΦY (1) = 0 cance level for the χ2-test and our newly proposed mod- −0.8Il. The innovations a(t) = (a1(t), . . . , am(t)) are uncorrelated with mean 0 and unit variance matrix ified Λ-test. var (a(t)) = Im. The matrix z is then partitioned into k ×T and l ×T matrices denoted x(t) and y(t). For each measure, we are interested in either the mutual infor- Appendix C: Additional prewhitening tests mation between x and y, or the Granger causality from y to x. If a third (conditional) process w is required to This appendix provides further evidence that stan- contextualise these measures (in their conditional forms), dard prewhitening techniques are insufficient for many we consider another first-order AR process w again us- covariance-stationary time series. In the main text, we c ing Eq. (B1), with w(t) ∈ R , ΦW (1) = 0.4Ic and unit show that prewhitening time series is ineffective when the variance innovation process. Each process, x, y, and w, time-series models are either: AR(p) models that are in- are then independently filtered with either an FIR or an ferred from Burg’s method; or ARMA(p,q) models that IIR filter. Both filters were low-pass, with their cutoff set are inferred from the BIC score (up to a maximum order to a normalised frequency of π/2 radians and a variable of p = q = 5). Here, we extend these results to show that filter order. AR(p) models inferred via the AICc (AIC with small- In our study of the mutual information between bi- sample correction) and BIC scoring functions are also variate processes, we inject a causal influence from the insufficient. 22

0.3 0.3

F -test, AR(1) F -test, AR(1) 0.2 F -test, AR(p) [AICc] 0.2 F -test, AR(p) [AICc] F -test, AR(p) [BIC] F -test, AR(p) [BIC] 1 1 0.1 0.1 False positive rate (FPR) False positive rate (FPR) 0 Actual FPR 0 0 Actual FPR 0 0 10 20 30 0 1 0 10 20 30 0 1 Filter order Expected FPR Filter order Expected FPR (a) (a)

1 1

0.8 F -test, AR(1) 0.8 F -test, AR(1) F -test, AR(p) [AICc] F -test, AR(p) [AICc] 0.6 F -test, AR(p) [BIC] 0.6 F -test, AR(p) [BIC] 1 1 0.4 0.4

0.2 0.2 False positive rate (FPR) False positive rate (FPR) 0 Actual FPR 0 0 Actual FPR 0 0 10 20 30 0 1 0 10 20 30 0 1 Filter order Expected FPR Filter order Expected FPR (b) (b)

FIG. 13. Inferring the AR models via AICc and BIC does FIG. 14. Inferring the AR models via AICc and BIC does not consistently improve prewhitening results for mutual in- not improve prewhitening results for Granger causality tests formation tests with FIR (13a) or IIR (13a) ﬁltering. The with FIR (14a) or IIR (14a) ﬁltering. The experiments are as experiments are as per Fig.1, with two additional prewhiten- per Fig.6, with two additional prewhitening approaches used: ing approaches used: the AR(p) model inferred from the AICc the AR(p) model inferred from the AICc and BIC scores, and BIC scores, respectively. respectively.

In Fig. 13, we show the extended prewhitening results jecture that prewhitening with ARMA(p,q) models may for mutual information tests. The algorithm iterates perform favourably to AR(p) models given the ability through all potential AR(p) models and selects the one to infer arbitrarily complex models. However, the con- that minimises the AICc and BIC scores (independently). straints governed by the their inference procedures makes For the FIR-filtered data, this approach shows a slightly testing this currently intractable in practice. increased FPR above the nominal value of 5%—this il- lustrates a marginal improvement over Burg’s method from Fig. 10. However, for the IIR-filtered data, the FPR approaches 100% and is significantly worse than Appendix D: Considerations and extensions of even methods with no or minimal prewhitening (c.f. the Bartlett’s formula AR(1) models or the standard F -test in Fig.1). Simi- larly, prewhitening is shown to be insufficient for Granger causality tests in Fig. 14, where equivalent experiments In our derivations we use a first-order approximation were performed with no major qualitative differences of Bartlett’s formula (Eq. (8)), that was originally de- (compared to Fig. 11). scribed for spatially autocorrelated processes. However, We do not show any further results for ARMA(p,q) since Bartlett’s seminal work [10], there have been a num- models, e.g., by increasing the maximum order or via a ber of other extensions made to his formula as well as different criterion because the procedure to learn the pa- techniques intended to overcome the issues of its empiri- rameters of higher-order ARMA models was too compu- cal computation. tationally expensive using the standard functions. Given One of the more general cases of Bartlett’s formula this constraint, all information criteria (AIC, AICc, or is due to Roy [40], who provided the large-sample dis- BIC) often chose the maximum order in practice, and tribution between pairs of sample cross-correlations at so using alternative approaches was redundant. We con- differing lags. Consider the four processes Zi,Zj,Zk,Zl. 23

Let Eq. (D3) by setting d = a and e = b, reducing to the re-

∞ sults reported in [76] and [7] (Theorem 11.2.3). Using this X special case, the null distribution of Pearson (zero-lag) ∆v(i, j, k, l) = ρij(u)ρkl(u + v), (D1) correlation between two univariate processes var (sab(0)) u=−∞ can be obtained by setting v = w = 0. Finally, under the assumption that ρ (0) = 0, most of these terms disap- where ρij(u) are the cross-correlation as per Eq. (5), and ab pear and we are left with Bartlett’s original formula [10]: s (v) = T 1/2 [r (v) − ρ (v)], (D2) ab ab ab 1/2 lim var (sab(0)) ≈ lim var T [rab(0)] , T →∞ T →∞ as the standard error, with rab(v) the sample cross- ∞ X correlation in Eq. (6). In general, the asymptotic dis- = ρii(u)ρjj(u) (D4) tribution of the standard error, sab(v), is Gaussian with u=−∞ zero mean and covariance with a similar form given by a first-order approxima- lim cov(sab(v), sde(w)) tion [38]: T →∞ ≈ ∞ ∆w−v(a, d, b, e) + ∆w+v(b, d, a, e) −1 X var (sab(0)) ≈ T (T − |u|)ρaa(u)ρbb(u). (D5) − ρab(v) ∆w(a, d, a, e) + ∆w(b, d, b, e) u=−∞ − ρ (w)∆ (b, d, a, d) + ∆ (b, e, a, e) de v v Due to symmetry of the autocorrelation function about 1 + 2 ρab(w)ρde(w) ∆0(a, d, a, d) + ∆0(a, e, a, e) lag-zero for stationary processes, we can simply sum over + ∆0(b, d, b, d) + ∆0(b, e, b, e) the positive lags in Eq. (D5), u > 0, which was the (D3) form used throughout this paper (see Eq. (8)). For the exact relationship between the large-sample approxima- This derivation can be further generalised to the non- tions and the first-order approximations, we refer the Gaussian case, for instance by allowing for skewed distri- reader to the discussions in [11, 38, 77]. Bartlett did butions [75]. More recently, a first-order approximation indeed present a formula irrespective of sample size [11], of this formula was given by Afyouni et al. [2], which which may yield an improvement for small sample dis- takes the same form as Eq. (D3), with Eq. (D1) slightly tributions and give minor practical advantages, however, modified. we did not find this necessary for any experiments and One consequence of knowing the full covariance struc- instead follow the approximations in Eq. (D5). ture (Eq. (D3)) is that such a distribution could further Another potential source of error in the sampling dis- help in situations when the partial correlation terms that tributions come from incorrectly estimating the autocor- Wilks’ statistic decomposes into are themselves corre- relation function raa(u). Tapering (also known as data lated. That is, in the main text, we provided the variance windowing) is commonly used in practice to regularise ∗ for each Λ -distributed variable Li (as a beta distribution the autocorrelation samples to better estimate their true by assuming independence). By assuming independence, value [2, 50]. These approaches involve scaling the au- we were able to obtain the sampling distribution of Wilks’ tocorrelation samples by some factor, with the maxi- criterion as a product of these Λ∗-distributions. However, mum lag truncated below the dataset length. Using this if the random variables were correlated, then we must use method, we can appropriate Bartlett’s formula to the multivariate form of Bartlett’s formula (D3), which U − provides the covariance of each variable Li term with X T u var (sab(0)) ≈ 1 + 2 λ(u)raa(u)rbb(u), (D6) T all other variables Lj (i.e., we would have nij for each u=1 i and j, rather than just ni). Accounting for this covariance would require knowing the distribution of the where λ(u) are a set of weights called the lag window and general product of correlated beta-distributions that, to U < T is the truncation point. The lag window comprises the best of our knowledge, is not an established result. λ(u) values that decrease with increasing u; two com- Another use-case of Eq. (D3) is testing against an al- mon approaches are the Parzen and the Tukey windows ternative hypothesis (H1 : ρab(0) =6 0), which is discussed (see [50] for details). Numerous truncation√ points√ U have at length by Afyouni et al. [2] for correlation coefficients. also been proposed, e.g., T/4, T/5, T , and 2 T [2]. One could follow the same logic from this paper to pro- In the above few sections we outlined a number of po- vide the alternative hypothesis test for linear dependence tential factors that could introduce small size or power measures based on Wilks’ statistic. distortions in our hypothesis tests. To compare our ap- There are a number of special cases of Roy’s formula proach with these more complex extensions, we ran ex- that are worth noting. In the event that we are in- periments with the effective sample size computed from terested in the covariance cov(sab(v), sab(w)) between the full covariance matrix (D3), both with and without cross-correlation estimates of two univariate processes Za tapering. These were computed for the experiments from and Zb at arbitrary lags v and w, this is obtained from Fig.1 in the paper, however, as mentioned above, the 24 product of correlated beta- or F -distributed variates is tions: unknown and so our modifed Λ-test could not be per- p formed. Instead, we used the sums of z-transformed par- X 2 IX;X(p) = −1/2 log 1 − ρXXu·X(u−1) , tial correlations (each of which make up the conditional u=1 mutual information term), rather than the products of p X 2 squared partial correlations. That is, by transforming = −1/2 log 1 − [αX (u)] . (E4) each partial correlation, we expect the sum of these corre- u=1 lation to be approximately Gaussian. In performing these tests, we found no notable difference in the FPRs for any This same logic can be straightforwardly applied to other of the validation experiments, suggesting that these ad- measures such as excess entropy [41] and predictive in- ditions made no significant difference towards reducing formation [42]. size or power distortions. In additional to quantifying the memory within a process, active information storage is often used for inferring the optimal history length for both the Gaussian and non-Gaussian cases [67]. This is typically achieved Appendix E: Partial autocorrelation and active by using the χ2-test to infer the significance of increasing information storage the embedding lengths p. For AR processes with Gaus- sian innovations, we infer the embedding length p for X The partial autocorrelation function conveys impor- by first taking the difference δx(u) = Ax(u + 1) − AX (u) tant information regarding the dependence structure of and then generating a p-value by testing 2 δX (u) against an AR process [7]. For a univariate stationary time se- a χ2(1) distribution, which represents the null hypothesis ries Z, the partial autocorrelation αZ (u) at lag u is the of no increase in information storage. If the p-value is be- correlation between Z(t) and Z(t−u), adjusted for the in- low a threshold (say 5%), then the test is rejected and the tervening observations Z(u−1)(t) = [Z(t − 1) ; ... ; Z(t − lag is increased u = u + 1. This process is iterated until u + 1)]. Denote Zu as the process of Z lagged by u time the null hypothesis is accepted, at which point we sur- (u) steps and Z as the history up until that lag (inclusive) mise that the optimal lag p is the one at which δX (p + 1) Z(u) = [Z1 ; ... ; Zu]. Then, for a stationary time series, is considered insignificant. This approach is similar to the partial autocorrelation function is defined by [7] using the partial autocorrelation, as the difference δX (u) is equivalent to squared partial autocorrelation up to a

αZ (1) = ρZZ1 , (E1) factor of two. This can be seen from Eq. (E4): δ (u) = A (u + 1) − A (u) and X X X 1 2 = − log 1 − [αX (u + 1)] (E5) αZ (u) = ρZZu·Z(u−1) , u > 1. (E2) 2 In contrast to measures of dependence between multiple Although we use Burg’s method to identify the relevant processes, the χ2-test appears suitable here (without ad- history length p for an AR model of Z length for AR justing for an effective sample size) for testing δX (u) for models in this paper, it is common practice to use the u > p. This is because, after the full set of past variables partial autocorrelation function instead, since αZ (u) = 0 is included in the regression, any higher order residuals for u > p [7,8, 46]. Again, this is a statistical estimate xu − xû(ˆx(p)) with u > p have statistically zero autocor- and thus the order p is inferred by testing each sample relation for every lag. partial autocorrelationα ˆZ (u) for significance against the null distribution. Intriguingly, our work reveals a relationship between ACKNOWLEDGMENTS the partial autocorrelation function and active information storage [43]—a recently developed model-free mea- OMC and BDF were supported by NHMRC Ideas sure for quantifying memory in a process—under the lin- Grant 1183280. JTL was supported through the Aus- ear Gaussian assumption. The average active informa- tralian Research Council DECRA grant DE160100630. tion storage AX quantifies the information storage in a JMS was supported through a University of Syd- process. For a p-order Markov process X, this is quan- ney Robinson Fellowship and NHMRC Project Grant tified by the mutual information between the relevant 1156536. JMS and JTL were supported through The (p) history X (t) and variable X(t), i.e., University of Sydney Research Accelerator (SOAR) Fel- lowship program. OMC, JTL and JMS were supported AX (p) = IX;X(p) . (E3) through The Centre for Translational Data Science at The University of Sydney’s Research Incubator Fund- Since the average active information storage is a specific ing Scheme. OMC, JTL, JMS and BDF were supported type of mutual information, we can use the chain rule to through the Centre for Complex Systems at The Uni- decompose it into a sum of squared partial autocorrela- versity of Sydney’s Emerging Aspirations scheme. High 25 performance computing facilities provided by The Uni- versity of Sydney (Artemis) have contributed to the research results reported within this paper.

[1] C. E. Davey, D. B. Grayden, G. F. Egan, and L. A. John- the statistical significance of paleoclimatic reconstruc- ston, Filtering induces correlation in fMRI resting state tion statistics from autocorrelated time series, Den- data, NeuroImage 64, 728 (2013). drochronologia 30, 179 (2012). [2] S. Afyouni, S. M. Smith, and T. E. Nichols, Effective [20] K. J. Friston, Functional and effective connectivity: a degrees of freedom of the Pearson’s correlation coefficient review, Brain Connectivity 1, 13 (2011). under autocorrelation, NeuroImage (2019). [21] C. W. Granger, Investigating causal relations by econo- [3] E. Florin, J. Gross, J. Pfeifer, G. R. Fink, and L. Tim- metric models and cross-spectral methods, Economet- mermann, The effect of filtering on Granger causality rica: Journal of the Econometric Society , 424 (1969). based multivariate causality measures, NeuroImage 50, [22] A. Roebroeck, E. Formisano, and R. Goebel, Mapping 577 (2010). directed influence over the brain using Granger causality [4] L. Barnett and A. K. Seth, Behaviour of Granger causal- and fMRI, NeuroImage 25, 230 (2005). ity under filtering: Theoretical invariance and practical [23] W. Liao, J. Ding, D. Marinazzo, Q. Xu, Z. Wang, application, Journal of Neuroscience Methods 201, 404 C. Yuan, Z. Zhang, G. Lu, and H. Chen, Small-world (2011). directed networks in the human brain: Multivariate [5] A. K. Seth, A MATLAB toolbox for Granger causal con- Granger causality analysis of resting-state fMRI, Neu- nectivity analysis, Journal of Neuroscience Methods 186, roImage 54, 2683 (2011). 262 (2010). [24] K. Friston, R. Moran, and A. K. Seth, Analysing connec- [6] P. M. Robinson, Generalized canonical analysis for time tivity with Granger causality and dynamic causal mod- series, Journal of Multivariate Analysis 3, 141 (1973). elling, Current Opinion in Neurobiology 23, 172 (2013). [7] P. J. Brockwell and R. A. Davis, Time Series: Theory [25] R. K. Kaufmann and D. I. Stern, Evidence for human and Methods (Springer Science & Business Media, 1991). influence on climate from hemispheric temperature rela- [8] G. C. Reinsel, Elements of Multivariate Time Series tions, Nature 388, 39 (1997). analysis (Springer Science & Business Media, 2003). [26] D. D. Zhang, H. F. Lee, C. Wang, B. Li, Q. Pei, J. Zhang, [9] G. U. Yule, Why do we sometimes get nonsense- and Y. An, The causality analysis of climate change correlations between time-series?–a study in sampling and large-scale human crisis, Proceedings of the National and the nature of time-series, Journal of the royal sta- Academy of Sciences 108, 17296 (2011). tistical society 89, 1 (1926). [27] J. R. Freeman, Granger causality and the times series [10] M. S. Bartlett, Some aspects of the time-correlation prob- analysis of political relationships, American Journal of lem in regard to tests of significance, Journal of the Royal Political Science 27, 327 (1983). Statistical Society 98, 536 (1935). [28] R. Reuveny and H. Kang, International trade, political [11] M. S. Bartlett, On the theoretical specification and sam- conflict/cooperation, and Granger causality, American pling properties of autocorrelated time-series, Supple- Journal of Political Science 40, 943 (1996). ment to the Journal of the Royal Statistical Society 8, [29] S. S. Wilks, Certain generalizations in the analysis of 27 (1946). variance, Biometrika , 471 (1932). [12] J. D. Cryer and K.-S. Chan, Time series analysis: with [30] T. Pham-Gia, Exact distribution of the generalized applications in R (Springer Science & Business Media, Wilks’s statistic and applications, Journal of Multivari- 2008). ate Analysis 99, 1698 (2008). [13] D. Sul, P. C. Phillips, and C.-Y. Choi, Prewhitening bias [31] H. Hotelling, Relations between two sets of variates, in HAC estimation, Oxford Bulletin of Economics and Biometrika 28, 321 (1936). Statistics 67, 517 (2005). [32] D. J. MacKay, Information theory, inference and learning [14] M. Bayazit and B. Onöz,¨ To prewhiten or not to algorithms (Cambridge University Press, 2003). prewhiten in trend analysis?, Hydrological Sciences Jour- [33] T. M. Cover and J. A. Thomas, Elements of Information nal 52, 611 (2007). Theory (John Wiley & Sons, 2012). [15] W. Olszowy, J. Aston, C. Rua, and G. B. Williams, Ac- [34] J. Kay, Feature discovery under contextual supervision curate autocorrelation modeling substantially improves using mutual information, in Proc. of IEEE IJCNN, fMRI reliability, Nature communications 10, 1 (2019). Vol. 4 (1992) pp. 79–84. [16] S. Yue and C. Y. Wang, Applicability of prewhitening to [35] L. Barnett, A. B. Barrett, and A. K. Seth, Granger eliminate the influence of serial correlation on the mann- causality and transfer entropy are equivalent for Gaussian kendall test, Water resources research 38, 4 (2002). variables, Physical Review Letters 103, 238701 (2009). [17] M. R. Arbabshirani, E. Damaraju, R. Phlypo, S. Plis, [36] T. Schreiber, Measuring information transfer, Physical E. Allen, S. Ma, D. Mathalon, A. Preda, J. G. Vaidya, Review Letters 85, 461 (2000). T. Adali, and V. D. Calhoun, Impact of autocorrelation [37] T. Bossomaier, L. Barnett, M. Harré,and J. T. Lizier, on functional connectivity, NeuroImage 102, 294 (2014). An Introduction to Transfer Entropy: Information Flow [18] J. R. Bence, Analysis of short time series: correcting for in Complex Systems (Springer, 2016). autocorrelation, Ecology 76, 628 (1995). [38] G. Bayley and J. Hammersley, The “effective” number [19] M. Macias-Fauria, A. Grinsted, S. Helama, and of independent observations in an autocorrelated time J. Holopainen, Persistence matters: Estimation of series, Supplement to the Journal of the Royal Statistical 26

Society 8, 184 (1946). autoregressive modelling, Annals of Nuclear Energy 23, [39] P. Clifford, S. Richardson, and D. Hémon,Assessing the 1219 (1996). significance of the correlation between two spatial pro- [58] J. Garland, R. G. James, and E. Bradley, Leveraging cesses, Biometrics , 123 (1989). information storage to select forecast-optimal parameters [40] R. Roy, Asymptotic covariance structure of serial corre- for delay-coordinate reconstructions, Physical Review E lations in multivariate time series, Biometrika 76, 824 93, 022221 (2016). (1989). [59] In general, the variables representing the past need not [41] W. Bialek, I. Nemenman, and N. Tishby, Complexity be a sequence of consecutive temporal indices, nor have through nonextensivity, Physica A: Statistical Mechan- the same history length p for each dimension of X. For in- ics and its Applications 302, 89 (2001). stance, one could use a Takens embedding [78, 79] or any [42] J. P. Crutchfield and D. P. Feldman, Regularities un- other statistically significant set of variables [80]. How- seen, randomness observed: Levels of entropy conver- ever, for simplicity, in this work we follow the vector AR gence, Chaos: An Interdisciplinary Journal of Nonlinear model often used in Granger causality analysis (Eq. (B1)) Science 13, 25 (2003). and thus use a consecutive sequence of variables as the [43] J. T. Lizier, M. Prokopenko, and A. Y. Zomaya, Local history of X(t). measures of information storage in complex distributed [60] J. Wishart, The generalised product moment distribu- computation, Information Sciences 208, 39 (2012). tion in samples from a normal multivariate population, [44] D. C. Van Essen, K. Ugurbil, E. Auerbach, D. Barch, Biometrika 20A, 32 (1928). T. Behrens, R. Bucholz, A. Chang, L. Chen, M. Cor- [61] Using the MATLAB estimate function in order to infer betta, S. W. Curtiss, et al., The Human Connectome the parametres for each p and q; see our open-source Project: a data acquisition perspective, NeuroImage 62, toolkit for details. 2222 (2012). [62] R. Franciotti, N. Falasca, L. Bonanni, F. Anzellotti, [45] J. Geweke, Measurement of linear dependence and feed- V. Maruotti, S. Comani, A. Thomas, A. Tartaro, J. Tay- back between multiple time series, Journal of the Amer- lor, and M. Onofrj, Default network is not hypoactive ican Statistical Association 77, 304 (1982). in dementia with fluctuating cognition: an Alzheimer [46] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. disease-dementia with Lewy bodies comparison., Neuro- Ljung, Time Series Analysis: Forecasting and Control biology of Aging 34, 1148 (2013). (John Wiley & Sons, 2015). [63] L. Barnett and A. K. Seth, The MVGC multivariate [47] The most general form is the ARIMA model, where the Granger causality toolbox: A new approach to Granger- integrative (I) component accounts for non-stationary causal inference, Journal of Neuroscience Methods 223, time series. Here we assume that appropriate differencing 50 (2014). and mean removal has already been performed in order [64] H. Boudjellaba, J.-M. Dufour, and R. Roy, Testing to remove any integration of the time series. causality between two vectors in multivariate autoregres- [48] The problem of recovering the autocorrelation functions sive moving average models, Journal of the American Sta- from data, and thus the effective sample size, is nontrivial tistical Association 87, 1082 (1992). and has resulted in procedures such as tapering to handle [65] L. Barnett and A. K. Seth, Granger causality for state- noisy estimates [2, 50]; these notions are covered briefly space models, Physical Review E 91, 040101(R) (2015). in AppendixD. [66] A. Gutknecht and L. Barnett, Sampling distribution [49] M. J. Anderson and J. Robinson, Permutation tests for for single-regression Granger causality estimators, arXiv linear models, Australian & New Zealand Journal of preprint arXiv:1911.09625 (2019). Statistics 43, 75 (2001). [67] J. T. Lizier, JIDT: An information-theoretic toolkit for [50] C. Chatfield, The Analysis of Time Series: An Introduc- studying the dynamics of complex systems, Frontiers in tion (Chapman and Hall/CRC, 2003). Robotics and AI 1, 11 (2014). [51] A. M. Mathai and P. Rathie, The exact distribution of [68] J. Theiler, Spurious dimension from correlation algo- Wilks’ criterion, The Annals of Mathematical Statistics rithms applied to limited time-series data, Physical Re- , 1010 (1971). view A 34, 2427 (1986). [52] D. R. Brillinger, Some data analyses using mutual infor- [69] See the Supporting Information of Novelli et al. [80] for mation, Brazilian Journal of Probability and Statistics a similar experiment with transfer entropy on the same 18, 163 (2004). HCP rsfMRI dataset used in our Fig. 12. [53] L. Barnett and T. Bossomaier, Transfer entropy as a [70] Weber et al. [81] report a contrasting finding of increased log-likelihood ratio, Physical Review Letters 109, 138105 FPR in transfer entropy inference due to filtering, how- (2012). ever the use of a Theiler window is not specified. [54] S. S. Wilks, The large-sample distribution of the likeli- [71] C. Francq and J.-M. Zako¨ıan, Bartlett’s formula for a hood ratio for testing composite hypotheses, The Annals general class of nonlinear processes, Journal of Time Se- of Mathematical Statistics 9, 60 (1938). ries Analysis 30, 449 (2009). [55] C. E. Davey, D. B. Grayden, M. Gavrilescu, G. F. Egan, [72] T. Schreiber and A. Schmitz, Improved surrogate data for and L. A. Johnston, The equivalence of linear Gaussian nonlinearity tests, Physical review letters 77, 635 (1996). connectivity techniques, Human Brain Mapping 34, 1999 [73] N. Schaworonkow, D. A. Blythe, J. Kegeles, G. Curio, (2013). and V. V. Nikulin, Power-law dynamics in neuronal and [56] J. T. Lizier and M. Prokopenko, Differentiating informa- behavioral data introduce spurious correlations, Human tion transfer and causal effect, The European Physical brain mapping 36, 2901 (2015). Journal B 73, 605 (2010). [74] R. Davidson and J. G. MacKinnon, The size distortion [57] M. De Hoon, T. Van der Hagen, H. Schoonewelle, and of bootstrap tests, Econometric theory , 361 (1999). H. Van Dam, Why Yule-Walker should not be used for [75] N. Su and R. Lund, Multivariate versions of Bartlett’s 27

formula, Journal of Multivariate Analysis 105, 18 (2012). [79] O. M. Cliff, M. Prokopenko, and R. Fitch, Minimising [76] M. S. Bartlett, An Introduction to Stochastic Processes the Kullback–Leibler divergence for model selection in (Cambridge University Press, 1935). distributed nonlinear systems, Entropy 20, 51 (2018). [77] M. H. Quenouille, Notes on the calculation of autocorre- [80] L. Novelli, P. Wollstadt, P. Mediano, M. Wibral, and lations of linear autoregressive schemes, Biometrika 34, J. T. Lizier, Large-scale directed network inference with 365 (1947). multivariate transfer entropy and hierarchical statistical [78] F. Takens, Detecting strange attractors in turbulence, testing, Network Neuroscience 3, 827 (2019). in Dynamical Systems and Turbulence, Lecture Notes in [81] I. Weber, E. Florin, M. Von Papen, and L. Timmermann, Mathematics, Vol. 898, edited by D. Rand and L.-S. The influence of filtering and downsampling on the esti- Young (Springer, Berlin / Heidelberg, 1981) pp. 366–381. mation of transfer entropy, PloS One 12 (2017).