<<

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Unified Approach Based on Minimum Description Length Principle in Analysis

FEI LI1, XUEWEI WANG1, QIANG LIN1, AND ZHENGHUI HU1 1College of Science, Zhejiang University of Technology, Hangzhou, Zhejiang 310023, China. Corresponding author: Zhenghui Hu (e-mail: [email protected]). This work is supported in part by National Key R&D Program of China under Grant 2018YFA0701400, in part by the Public Projects of Science Technology Department of Zhejiang Province under Grant LGF20H180015.

ABSTRACT Granger causality analysis (GCA) provides a powerful tool for uncovering the patterns of brain connectivity mechanism using neuroimaging techniques. In this paper, distinct from conventional two- stage GCA, we present a unified model selection approach based on the minimum description length (MDL) principle for GCA in the context of the general regression model paradigm. In comparison with conventional methods, our approach emphasize that model selection should follow a single mathematical theory during the GCA process. Under this framework, all candidate models within the model space might be compared freely in the context of the code length, without the need for an intermediate model. We illustrate its advantages over conventional two-stage GCA approach in a 3-node network and a 5-node network synthetic . The unified model selection approach is capable of identifying the actual connectivity while avoiding the false influences of noise. More importantly, the proposed approach obtained more consistent results in a challenging fMRI dataset, in which visual/auditory stimulus with the same presentation design gives identical neural correlates of mental calculation, allowing one to evaluate the performance of different GCA methods. Moreover, the proposed approach has potential to accommodate other Granger causality representations in other space. The comparison between different GC representations in different function spaces can also be naturally deal with in the framework.

INDEX TERMS Code length, granger causality analysis (GCA), minimum description length (MDL), model selection.

I. INTRODUCTION cephalography (EEG) [18]–[20], local field potentials (LFP) [20]–[22], magnetoencephalography (MEG) [23]–[25], and arXiv:1910.11537v2 [stat.ME] 19 Mar 2020 AUSAL connectivity analysis, also called effective con- functional magnetic imaging (fMRI) [26]–[29] C nectivity, plays an increasingly important role in brain etc. research using neuroimaging techniques [1]–[6]. It reflects a trend in neuroscience away from focusing on individual Granger causality analysis (GCA) is an effective tool for brain unit (functional specialization) toward complex neural detecting the causal connections that can provide information circuits (functional integration) at different spatial scales about the dynamics and directionality of the associations. [7]–[9], where the integration among specialized areas is Causality in the Granger sense is based on the statistical mediated by causal connectivity [10]–[15]. Causal connec- of one that derives from knowledge tivity refers to the influence that one neural system exerts of one or more other time series [30], [31]. For two time over another, either in the absence of identifiable behavioral series, X and Y , that are stochastic and wide-sense stationary events or in the context of task performance [16], [17]. It (i.e., constant and ) [32], the general idea of provides important insights into brain organization. Causal Granger causality is that variable Y is said to have a causal connectivity analysis is based on temporal relations existing influence on variable X when the prediction of variable X between time series recordings of neural activity, which may could be improved by incorporating the history information be obtained by neuroimaging techniques, such as electroen- of variable Y .

VOLUME 4, 2016 1 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

In practical applications, the GCA is usually conveyed in selection approach for GCA with the minimum description the context of linear autoregressive (AR) models of stochastic length (MDL) principle that model selection should follow processes. The AR models with different time lags repre- a single mathematical theory during the GCA process. We sent the historical dependence on variables, with or without illustrated the benefits of introducing a unified model se- including other variables, to forecast the dynamics of the lection approach in simulated and real fMRI experiments. variable. The optimal model for predication then is chosen by Especially, we compared the proposed approach with con- model selection techniques. The extension of GCA to nonlin- ventional two-stage GCA one in an empirical fMRI dataset ear model that capture nonlinear causal relations is also avail- for the validation of causal connectivity analysis. able [33], [34]. And the validity of original GCA can also The rest of the article is organized as follows. Section II be affected if the errors after the prediction with the model introduces the basic GCA concept and discusses the potential is not normally distributed [26], [35]. Several test methods problems in the current two-stage GCA scheme and our of causality in Granger sense for the with asymptotical motivation for introducing the MDL principle in GCA. In noise distribution have been developed that is especially Section III, the MDL principle has been illustrated in detail, useful for task-related fMRI studies [36], [37]. Asymmetric and the formula of two-part MDL also has been derived with causality testing recently has also been suggested in order to Bernoulli distribution in Markov model class. Immediately, separate the causal impacts of positive and negative changes in Section IV, the MDL guided model selection for linear [38]. Moreover, Granger causality have been also expressed model has been carried out, in and in other function spaces, e.g. Fourier spaces (frequency domain respectively. In Section V, We illustrate the ad- domain) [21], [39]–[45], kernel Hilbert spaces [46], [47]. vantages of our proposal over conventional two-stage GCA These methods are important in neuroscience studies since approach in a 3-node network and a 5-node network syn- the causal influences between neuronal populations are often thetic experiments. At the same time, the proposed approach nonlinear and have complicated due to various obtained more consistent results in a challenge fMRI dataset sources of uncertainties. On the other hand, the original for causality investigation, mental calculation network under GCA is limited to investigate causal connectivity between visual and auditory stimulus, respectively. Section VI demon- two nodes. There have been numerous efforts on expanding strates the comparison between conventional two-stage GCA the approach from small network with few nodes to large, and our proposal from modeling standpoint, and discusses its complex network [48], [49]. A recent approach presented potential development in a wider Granger causality sense. an iterative scheme that gradually pruned the network by removing indirect connections, considered one at a time, to II. PROBLEM STATEMENT uncover large network structure with hundreds of nodes [50]. Let variables X and Y be two stochastic and stationary time Despite these developments, the basic idea of Granger series. Now consider the following pair of (restricted and causality remains unchanged. It generally comprises two unrestricted) regression models stages in this framework: (1) specifying the model order ( Pp X[n] = βx[i]X[n i] + u[n] (the number of time lags that is associated with its own i=1 − X[n] = Pp β [i]X[n i] + Pq β [i]Y [n i] + v[n], history information and external effects), and (2) deciding i=1 x i=1 y − − (1) the optimal model. The order of the predictors is usually where p and q are the model orders (the numbers of time determined with the Akaike information criterion (AIC) [51] lags) in X and Y , respectively, β and β are the model or the Bayesian information criterion (BIC) [52], whereas x y coefficients, and u and v are the residual of the models. The the optimal predictor is judged through statistical pairwise order of historical predictor p is usually determined with the comparison [53]–[56]. Specifying model order and selecting AIC [51] or the BIC [52], the optimal model comply two completely different mathe- matical theories. AIC = 2log(L(θ)) + 2k − Indeed, from a mathematical perspective, the two stages BIC = 2log(L(θ)) + klog(n) are all to select the best model capturing essential features − under investigation from a number of competing models in Where n is the sample size, and k is the number of parameters terms of given observations. They are essentially the same which your model estimates, and θ is the set of all parameters. problem of model selection. Model selection is the most im- L(θ) represents the likelihood of the model tested, given your portant aspect of inference with causal models and allows one data, when evaluated at maximum likelihood values of θ. to test different hypotheses by comparing different models in The conventional GCA requires statistical significance to terms of selection criteria. Two different theories applied in determine whether the unrestricted model provides a better the same question might generate different selection criteria prediction than the restricted model. The hierarchical F - and therefore degrade the performance of GCA. statistics, based on the extra sum-of-squares principle [57], We argue that the two stages in GCA are the problem can be used to evaluate significant predictability, given as of model selection, and should obey the same benchmark (RSSr RSSu)/q F = − vs F0(q, n p q 1) under one mathematical theory. In this paper, against two- RSSu/(n p q 1) − − − stage selection scheme, we therefore proposed a unify model − − − (2)

2 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

where RSSr and RSSu represent the sum of squared resid- effect is quantified by F-statistics sequentially. It is clear uals of restricted model and unrestricted model, respectively, that the final inference might differ with rules applied, even n is the total number of observations to estimate the unre- though three classes are completely equivalent from a model stricted model. The F -statistics approximately follows an F - description standpoint. distribution with (q, n p q 1) degrees of freedom. As stated above, specifying the effects of endogenous − − − If F > F0(q, n p q 1), the variability of the residual of and exogenous information in conventional GCA is split − − − unrestricted model is significantly less than the variability of into a two-stage scheme. Specifying the regression model the residual of restricted model, then there is an improvement orders of historical information, which contains the regres- in the prediction of X due to Y , and we refer to this as causal sion of endogenous information in restricted model and the influence from Y to X. But in the current bivariate model, regression of both endogenous and exogenous information spurious connections will emerge frequently. in unrestricted model, is mainly based on AIC/BIC theory In order to remove spurious connections caused by indirect separately, then specifying the causal effect of exogenous causalities between nodes, GCA also provides a measure of information by F-statistics. However, specifying the two ef- conditional causal connection by introducing another vari- fects is the same kind of problem of model selection from able Z into Eqs. (1): the mathematical perspective. Different theories might gener-  p r ated different benchmarks in two stages for model selection, X[n] = P β [i]X[n i] + P β [i]Z[n i] + u[n]  i=1 x i=1 z therefore degrade the performance of GCA.  Pp − Pq − X[n] = i=1 βx[i]X[n i] + i=1 βy[i]Y [n i] Aside from theoretical considerations, there are still some  Pr − −  + βz[i]Z[n i] + v[n]. issues in the F-test itself to be discussed. The American i=1 − (3) Statistical Association’s (ASA) statement on statistical sig- Then, the causal influence from Y to X, conditional on Z, is nificance and p-values has led to a collective rethinking of defined as the entire scientific community, and many scientists believe var(u) Y →X|Z = ln . (4) that the application of p-values in current scientific research F var(v) has been distorted [58]–[61]. This also leads many studies Consider three model spaces , , and , where each to selectively report results, and the dichotomous p-value A B C space comprises three models. Let pi and qi, i = 1, 2, 3, is arbitrary reductionism for scientific research. We believe denote the model orders regarding its endogenous informa- that any reasonable study has its implied meaning, whether tion and the exogenous information from other variables, or not the p-value is less than 0.05. We should understand respectively. And n N, m N, the true meaning of p-values, and it should not be over- ∈ ∈ exaggerated or degraded [62]–[64]. It is just a statistical = a1 : p1 = n, q1 = 0; A { tool, or just one of the mapping relationships that is purely a2 : p2 = n + 1, q2 = 0; mathematical. Specifically, pairwise F -statistics in conven- a3 : p3 = n + 2, q3 = 0 tional GCA arouses several potential problems that might } lead to misleading or unreasonable inferences in connectivity F = b1 : p1 = n, q1 = m + 0; analysis. Firstly, the model comparison with -statistics is B { performed under a specific significance level. The assignment b : p = n, q = m + 1; 2 2 2 of significance level is a subjective matter in conventional b3 : p3 = n, q3 = m + 2 } GCA process [65], [66]. A significance level that is too low could cause the false connection noise originated, whereas = c1 : p1 = n, q1 = m; a significant level that is too high could erase the actual C { c2 : p2 = n + 1, q2 = m; connection. When there is no rule to justify the assignment of the significance level, F -statistics will lead to different con- c3 : p3 = n + 1, q3 = m + 1 . } nectivity results depending on the significance level chosen. We further assume that ai, bi, and ci, i = 1, 2, 3, have the Secondly, the selection results by pairwise F -statistics same residual . For space , the models only specify are heavily dependent on the initially selected model and A the regression model orders of the endogenous information the search path in model space. Consider the collection of by AIC/BIC, and then specify the causal effect of endogenous three models = A, B, C and let denote the residual M { } S information by F-statistics. The models in space specify the variance of each model with SA = S, SB = S + ∆S, and B I regression model orders of the exogenous effect by AIC/BIC, SC = S ∆S. We further assume ∆S < I, where I is − 2 ≤ then specify the causal effect of exogenous information by the interval satisfying statistical significance. The aim of F - F-statistics. But the models specify the regression model statistics is to find the model appropriate to the observations C orders of endogenous information and exogenous informa- from the model space, which is evidently model C in this tion separately, and specify the causal effect between c1, c2 case. The search path B A C will arrive at optimal → → and c3 by F-statistics. In this situation, the model selection model C, whereas if we start at A and follow the search path in conventional GCA is split into a two-stage scheme, the A C B, we will reach suboptimal model A, that → → model orders is determining by AIC/BIC and then the causal is, optimal model C is not considered. The distinct results

VOLUME 4, 2016 3 1 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

S A S A S A S A

B B C C B B C C S + ∆S S + ∆S S ∆S S ∆S S + ∆S S + ∆S S ∆S S ∆S − − − − (a) Search Path:(a)(a)B Search SearchA Path: Path:CAB AC CB (b)(b) Search Search Path: Path:(b) SearchBA CA Path: BAC C B → → →→ →→ → →→ → → FIGURE 1: Consider a conceivable case involving a collection of three candidate models = A, B, C with residual I M { } variances of SA = S, SB = S + ∆S, and SC = S ∆S. Suppose ∆S < I, where I is the interval satisfying statistical − 2 ≤ significance. It is clear that model C, with the minimum variance, is the optimal model in this case. Following search path B A C (a) we can arrive at optimal model C, while if we start with A and follow the search path A C B (b), we → → → → reach suboptimal model A. The results using F -statistics for model selection rely on both the search path and the initial model. Comparing transverses within the model space can only partially reduce the risk of model misspecification, and is not always available due to the nested relation between comparable models in F -statistics. Moreover, this strategy also leads to concern about the computational complexity. obtained by using F -statistics for model selection along with by conventional GCA could be misleading, and main reason different search paths and initial models are given explicitly may be splitting model selection into a two-stage scheme. in Fig. 1. In fact, F -statistics can not discriminate the models Since all the issues we discuss can be attributed to model by residual variance within a specific relative to the selection problems, then a more practical model selection chosen significance level and the noise level. On the other method need to be enabled. hand, F -statistics uses the extra sum of squares principle to To keep consistency in model selection for GCA, we identify a better model. This means that only models with a proposed a code length guided framework based on MDL nested relationship can be compared. Then the competitive principle, which Rissanen first proposed to quantify parsi- models perform pairwise comparison in an indirect way, mony of a model [68], to map the two-stage scheme into through intermediate model, namely unrestricted model the same model space. Specifically, our proposal involves in F -statistics. Therefore, the comparison between any two constructing a code length guided framework, then the model models is not available in practical applications, and the selection process in GCA can be converted into comparing search path relies on the structure of all candidate models. the code length of each model. That means our proposal In such situation the optimal model is not always guaranteed. incorporated the endogenous and the exogenous information The third potential problem relates to computational com- into a unified model selection process, which the information plexity. Consider the simplest case where conditional causal will be quantified by converting into code length to obtain connections are not taken into account (q = 1 and r = 0 causality. The two-stage scheme based on the two different in Eq. (3)). Suppose that there are m candidate models theories is unified under the single mathematical framework, for any one directional causality in the network with n MDL principle, which guarantees the only benchmark in nodes, the number of F -tests that needs to be performed is GCA methodology research. m(m 1)(n 1), and the total number of comparisons is And above all, the MDL is an information criterion that − − mn(m 1)(n 1). The problem on computational complex- provides a generic solution for the model selection problem − − ity is compounded by conditional GCA, but it still will be [69]. As a broad principle, the MDL represents a completely intractable while investigating large network [50]. different approach for model selection relative to traditional Although different approaches might be applied for model statistical approaches, such as F -statistics and the AIC or selection in GCA [37], [67], the two stages are kept un- BIC. Compared with the AIC/BIC, the use of the MDL changed. They are generally based on two different math- does not require any assumptions about the data generation ematical theories in most GCA applications. However, as mechanism. In particular, a distribution does mentioned above, determining the model orders by AIC/BIC not need to be assigned to each model class. The objective of or quantifying the causal effects by F-statistics can essen- model selection in the MDL is not to estimate an assumed tially be considered as a generalized model selection problem but unknown distribution, but to find models that more real- from a modeling standpoint. And model selection by a single istically represent the data [70]–[72]. mathematical principle throughout GCA, which could be Fundamentally, MDL has intellectual roots in the algo- easily ignored, would determine the final result working rithmic or descriptive complexity theory of Kolmogorov, pattern of human brain. Thus the causal connectivity obtained Chaitin, and Solomonoff [73]. Only considering the prob-

4 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS ability distribution as a basis for generating descriptions, By Shannon’s Source Coding Theorem, for any prefix code Rissanen endowed MDL with a rich information-theoretic % on with length function L, the expected L% is bounded A interpretation [74]–[79]. Due to these characteristics, and it’s below by H(P ), the entropy of P . That is reasonable to believe human brain meet minimum energy X L% H(P ) = P (x) log P (x), (6) principle, causality analyzed with help of MDL principle may ≥ − 2 be more in line with physiological models of the brain. On the x⊂A whole, our proposal takes the MDL principle as the single where equality holds if and only if L = log P , in other − 2 mathematical framework to select the generalized model for words, the expected code length is minimized when Q = P . GCA, which to ensure the consistency, objectivity and the parsimony. B. CRUDE TWO-PART CODE MDL In our view, modeling is a process that find the regularity III. THE MINIMUM DESCRIPTION LENGTH PRINCIPLE of data and compress it. In model selection within MDL The principle of parsimony is the soul of model selection. principle, What we have to do is selecting a suitable model To implement the parsimony principle, one must quantify based on the of the object. Generally, parsimony of a model relative to the available data. With the model we picked is overfitting or too simple. But model help of the work of Kolmogorov [80], [81], Wallace and selection guided by the MDL principle, the complexity term Boulton [82], Rissenan formulated MDL as a broad principle or the error term in data fitting is incorporated into code governing statistical modeling in general. At beginning of length guided framework, which ensures the objectivity of modeling, all we have is only the data. Luckily, With the the operation. help of the algorithmic or description complexity theory of Until now, there are several forms of MDL principle to Kolmogorov, Chaitin, and Solomonoff, MDL regards a prob- polynomial or other types of hypothesis and model selection. ability distribution as a descriptive standpoint. And MDL But at the original MDL, it usually divides the modeling for has some connections with AIC and BIC, sometimes it be- the data set into two parts, one part is to describe the model’s haves similarly to AIC and BIC [72], [83]. The difference self-information. The other is to describe the data set with the is that MDL fixes attention on the length function rather help of chosen probability model in part one. Consequently, than the code system. Therefore, many kinds of probability here we firstly introduce the most common implementation distribution can be compared in terms of their descriptive of the idea – the two part code version of MDL [72], [85]. power [84]. Our code length guided framework can be used in Suppose the data D n where = 0, 1 . Then there ∈ X X { } generalized model selection as long as there is a probability is a probability P , and minimize distribution in the model. ∈ M L1,2(P,D) = L1(D P ) + L2(P ). (7) | A. PROBABILITY AND IDEALIZED CODE LENGTH Here, it will select a reasonable model for D to make good In order to describe the MDL principle explicitly, we deduced predictions of future data coming from the same source, the formula of MDL in different cases. In model selection which therefore models the data using the class of all B process of MDL, we need to compare the code length ob- Markov chains [85]. tained by its probability distribution, so it is essential to understand the relationship between probability and the code a: The first part length [72], [85]. To get a better feel for the code L1, we prepare to consider A code % on a set is simply a mapping from to a set (1) A A two examples. First, let Pθ B be some Bernoulli of codewords. Let be a finite binary set and let Q denote ∈ A distribution with Pθ(x = 1) = θ, and let D = (x1, , xN ). a probability distribution of the any element a in . The Q ˆ ··· A Since Pθ(D) = Pθ(xi) and θ is equal to the frequency of code length of a is that log2 Q, the negative logarithm of − 1 in D, the first part L1(D P ) is given as Q. For example, the Huffman code is one of the algorithms | that constructed this relationship between probability and log Pθ(D) = n1 log θ n0 log(1 θ) − − − − (8) idealized code length [84]. Suppose that elements of are ˆ ˆ A = N[θ log θ + (1 θ) log(1 θ)], generated according a known distribution P . Given a code % − − − on with length function L, the expected code length of % where nj denotes the number of occurrences of symbol j in A γ with respect to P is defined as D. Let k = 2 , the γth-order Markov chain model is denoted by (k), it’s defined as X B L% = P (x)L(x). (5) (k) (k) (k) k = Pθ θ Θ ; Θ = [0, 1] . x⊂A B { | ∈ } As is well known, if % is prefix code, the code length L Where θ = (η[1 0 ... 0], η[1 0 ... 01], . . . , η[1 0 ... 11]), for | | | is equivalent to log Q(x) for some distribution Q on . all n, xn − 2 A There is given an arbitrary code, if no codeword is the prefix N of any other, the unique decodability is guaranteed. Any code 1 γ Y Pθ(D) = ( ) Pθ(xi xi−1, , xi−γ ) (9) 2 | ··· satisfying this codeword condition is called a prefix code. i=γ+1

VOLUME 4, 2016 5 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS and And the parameter vector consists of data θ = (n, ξ) and 2 2 X ξ = (σ , a1, , an), where σ = ξ0 is the variance- log Pθ(D) = N (ˆη[1|y] log η[1 y] ··· parameter of zero- Gaussian distribution model for t. − − log k | y∈{0,1} (10) And t = 1, , m, which m can be anyone more than n +(1 ηˆ ) log(1 η )) + γ. ··· − [1|y] − [1|y] to keep the solution determined, N is the data length. In order to describe xt, turn to Gaussian distribution for t. Here γ = log k is the number of bits needed to encode the Pm 2 first γ outcomes in D. The maximum likelihood parameters And RSS = t=1(t ) denotes the residual sum of squares corresponding to the estimation in the model. Clearly there is ηˆ[1|y] are equal to the conditional of 1 prefixed by y. a Gaussian distribution model, applying the two part form of MDL, the total code length is given as b: The second part n RSS X ξi L(x, θ) = m ln √2πσ + + ln | | + ln(n + 1). In order to describe a P , we have to describe a pair 2σ2 δ ∈ B i=0 (k, θ). We encode the all parameter in the distribution model (15) by firstly encoding k, which will use some prefix code C2a, Where δ is the precision, and it’s optimal to choose 1/√N and then code θ with the help of the prefix code C2a. The |ξi| [72], [75], [86]. Specially, δ < 1 should be ignored. resulting code C2(the code length of the first part) is then defined by C2(k, θ) = C2a(k)C2b(θ k). Firstly, N is nature | A. TIME-DOMAIN FORMULATION number set, Combining with the above formula, the causality investi- L2a(k) = LN (k) = O(log k). (11) gation with the code length guided framework in the time domain can be carried out. There are two time-series XN and (k) k Since Θ = [0, 1] is an uncountable infinite set, we YN , then we consider two different models A and B (in Eq. would have to discretize it firstly. More precisely, we will (16) and Eq. (17)) to describe xt. The representations are (k) restrict Θ(k) to some finite set Θ¨ of parameters that can be d ( Pn described using a finite precision of d bits per parameter. Xt = a1iXt−j + 1t j−1 (16) ¨ (k) d k Pn Now that the number of elements of Θd is (2 ) . Thus Yt = j−1 d1iYt−j + η1t it needs log(2d)k = k d bits to describe any particular ¨ (k) · where var(1t) = Σ1 and var(η1t) = Γ1. Bivariate regres- ϑ Θd , and may call d the precision used to encode a ∈ (k) sive representations are given, parameter. Letting dϑ be the smallest d such that θ Θ¨ , ∈ d this gives ( Pn Pn Xt = a2iXt−j + b2iYt−j + 2t j−1 j−1 (17) ( Y = Pn c X + Pn d Y + η LN (k) + LN (dϑ) + kdϑ, if dϑ < t j−1 2i t−j j−1 2i t−j 2t L2(k, θ) = ∞ (12) , otherwise. ∞ where var(2t) = Σ2 and var(η2t) = Γ2, and their contem- poraneous matrix is At the end, the crude two-part code MDL for Markov chain   hypothesis selection is given by, Σ2 Υ2 Σ = where Υ2 = cov(2t, η2t). Υ2 Γ2 min log Pk,d(D) + kd + LN (k) + LN (d). k,d∈N ;θ∈Θ¨ (k) − d Finally, 1t and 2t have Gaussian distribution with mean 0 (13) and unknown variance σ2, which denote the noise of time- series are fitting residual. Therefore, the distribution of resid- IV. CODE LENGTH GUIDED MODEL SELECTION IN ual terms t can be a standpoint to describe the model within CAUSALITY ANALYSIS MDL. Then, the code length of model A and B we obtained As stated above, conventional GCA splits the whole process can be compared to identify the causal influence between xt into two stages which are actually modeling endogenous and yt. In the whole process of model selection for GCA, information and exogenous information. Since MDL has a the causality investigation was mapped into the unified code close relationship with information theory, causality analysis length guided framework. According to Eq.(15), the code with MDL here will also be more convincing and suitable. length of model A and model B can be given respectively. Further the regression of endogenous information can be By the definition of Granger causality, the influence from Y converted to code length, and relative effect of exogenous to X is defined by our code length guided framework, variables can be also quantified by the code length guided framework, which the whole model selection process for FY →X = LX LX+Y . (18) − GCA is unified into same model space. The following is that MDL principle guided model selec- Where LX denotes the code length of optimal model in Eq.(16), and LX+Y denotes the code length of optimal model tion in causality analysis [68], and variable XN is given, in Eq.(17). If FY →X > 0, it means causal influence from Y xt = a1xt−1 + a2xt−2 + + anxt−n + t. (14) to X existed. Otherwise, there is no causal influence existed ··· 6 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

from Y to X. As the causality represented above, our pro- where a2(0) = 1, b2(0) = 0, c2(0) = 0, d2(0) = 1, the posal can unify two-stage scheme into the code length guided lag L denotes LXt = Xt−1. Performing Fourier framework, which can avoid inconsistency of two different transform on both sides of Eq.(23), then we left-multiply mathematical theories or the subjectivity of F-statistics in  1 0 model selection of the conventional GCA. P = (24) Υ2 1 To compare conditional GCA, we consider the influence − Σ2 from Y to X while controlling for the effect from conditional on both sides and rewrite the result equation, the normalized node Z to X. Firstly, the joint autoregressive representation equations yield is given       Xω D11(ω) D12(ω) E2(ω) ( n n = 0 , (25) P P Yω D21(ω) D22(ω) H (ω) Xt = j−1 a3iXt−j + j−1 b3iZt−j + 3t 2 Pn Pn Pn Xt = a4iXt−j + b4iYt−j + c4iZt−j + 4t 0 Υ2 j−1 j−1 j−1 where H2(ω) = H2(ω) E2(ω). The spectral matrix is − Σ2 (19) S(ω) = D(ω)ΣD∗(ω), where denotes complex conjugate and var(3t) = Σ3, var(4t) = Σ3. By the definition of ∗ and matrix transpose. The spectrum of Xt is conditional GCA, if FY →X > 0 existed, causal influence ∗ 0 ∗ from Y to X conditioned Z is defined S11(ω) = D11(ω)Σ2D (ω) D12(ω)Υ D (ω), (26) 11 − 2 12 0 Υ2 F = LX+Z LX+Y +Z . (20) where Υ2 = Γ2 Υ2. The first term in Eq.(26) is Y →X|Z − Σ2 − represented as the intrinsic influence and the second term as Same as above, if FZ→X > 0 existed, causal influence from the causal influence of Xt due to Yt at frequency ω. Based Z to X conditioned Y is given on this transformation, the causal influence from Yt to Xt at frequency ω is F = LX+Y LX+Y +Z (21) Z→X|Y − Sxx(ω) Clearly, in our code length guided framework, all candi- fY −→X (ω) = ln ∗ . (27) D11(ω)Σ2D11(ω) date models can be compared freely in the context of their code length. Different from the traditional method, the causal The model orders of historical information in frequency connection is obtained by repeated pairwise comparison be- domain for conventional GCA, are determined by AIC/BIC. tween models, our method can map all candidate models into Distinct from conventional GCA, we obtained the causal the same model space without the repeating comparison and connectivities between nodes at frequency ω by the code to obtain the conditional causal influence directly. Which is, length guided framework. if both FY →X > 0 and FZ→X > 0 existed, V. EXPERIMENTS FY,Z→X = min(LX+Y ,LX+Z ) LX+Y +Z . (22) A. 3-NODE NETWORK SIMULATION EXPERIMENTS − 1) Protocol for 3-node Network Here if F > 0 existed, it means that both Y and Y,Z→X To verify the performance of MDL principle in Z have direct influence on X. But if F is less than Y,Z→X , a simple 3-node network is enabled. There were 0, there will be two cases. One is FY,Z→X = (LX+Y − 1000 data points in time series of each node, to keep the L ) < 0 existed, it means only Y has direct influence X+Y +Z stationarity of synthetic data, the first 700 data points were on X. The other is FY,Z→X = (LX+Z LX+Y +Z ) < 0 − removed. The initial value of each node was 1, the variance existed, it means that Z impacts X directly. In the unified of noise term  (i = 1, 2, 3) varied from 0.15 to 0.35. The model space, multiple selected models can be directly com- i nodes was generated by pared by code length, which can release the complexity of  the algorithm. In this way, our proposal is more in line with y1,t = 1.5y1,t−1 0.9y1,t−2 + 1 Occam’s razor, or the principle of parsimony.  − y2,t = 0.8y1,t−1 + 0.2y2,t−1 + 2 (28)  y3,t = 0.8y1,t−1 + 0.4y3,t−1 + 3 B. FREQUENCY-DOMAIN FORMULATION − With help of Geweke’s work [39], the total interdependence 2) Results from 3-node Network X Y between two time series t and t can be decomposed into Firstly, to ensure the rigor of the proposal, the distribution three components: two directional causal influences due to of residual between the selected model and the observed their patterns, and the instantaneous influence due data was verified. It was almost corresponding Gaussian (X,Y ) to factors possibly exogenous to the system (e.g. distribution with mean 0, seen in Fig. 2a and 2b, which guar- a common driving input) [41], [54]. Here other forms of anteed the validity for our initial description standpoint. And regressive representations need to be considered, We first code length changed with different time lag in autoregressive rewrite Eq.(16) and Eq.(17) model had shown in Fig. 2c, it dropped down to the minimum a (L) b (L) X    when time lag was 2, which corresponded with generated 2 2 t = 2t (23) c2(L) d2(L) Yt η2t model in Eq.(28).

VOLUME 4, 2016 7 1 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

40 448 0.995 35 0.99 446 0.95 30 0.9 444 25 0.75

20 0.5 442 Probability Frequency 15 0.25 Code Length 440 0.1 10 0.05 438 5 0.01 0.005 0 436 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 1 2 3 4 5 6 7 8 9 10 Data Data Time Lag (a) (b) (c) FIGURE 2: The fitting Guassian distribution of the residual. (a): a of values in data and fits a normal density function. (b): comparing the distribution of the data to the . (c): code length obtained by our proposal, changed with different time lag in node 1.

1 1 1 1

0.047 1 0.0470.0471 1 0.047 1 0.008 1 0.0090.0081 1 0.009 1

0.056 0.056 0.009 0.009 2 2 3 3 2 2 3 3 0.049 0.049 0.01 0.01 (a) F-statistics (P <(a)(a)0. F-statistics05) F-statistics (P < 0..05) (b)(b) MDL MDL method (b) MDL method FIGURE 3: Comparison between two methods within low variance noise(0.15-0.35), the numbers on the arrow indicated the accuracy that the causal influence was identified in 1000 samples. Specially, the accuracy was only represent the probability that the single connection direction was identified by two methods.(a):causal network obtained by conventional GCA. (b):causal network obtained by our proposal.

TABLE 1: Comparison between conventional GCA and MDL method in 3-node network

Accuracy(%) Noise level Method Node 1 Node 2 Node 3 Total α = 0.1 82.3 71.8 72.1 65.4 F-statistics α = 0.05 90.9 85.5 85.5 82 Low α = 0.01 98.1 96 96.2 95.2 MDL 98.4 97.3 97.2 96.5 α = 0.1 81.4 72.5 71.8 64.8 F-statistics α = 0.05 89.4 84.6 84.7 79.9 Moderate α = 0.01 97.6 96.2 96.3 95.1 MDL 98.5 97.7 97.4 96.8 α = 0.1 80.3 72 71.7 64.5 F-statistics α = 0.05 89.5 83.6 83.7 79.1 High α = 0.01 98 96.7 96.7 95.7 MDL 98.8 98 97.6 97.2 The variance of the low noise level data ranges from 1.5 to 3.5, and the moderate(high) level data ranges from 2.5 to 4.5(3.5-5.5). 1 α denotes the significance level of F- test in conventional GCA. The accuracy of node− i(i=1,2,3) only measures the causal connectivities with node i, not including the connections between the other two nodes. And the total accuracy represents the accuracy that the true model is found, it means only quantifies the causal influence from node 1 to node 2 and node 3.

Then, the causal connectivities identified by our proposal 1 to node 2 and node 3 were identified both in GCA and had shown in Fig. 3b, which obtained causal connection our proposal. In other four causal influences, our proposal by comparing the code lengths according to Eq. (15), then almost guaranteed 99% accuracy. But conventional GCA the causal connectivities obtained by conventional GCA had only guaranteed 95% accuracy. shown in Fig. 3a. We found that causal influences from node At the same time, experiments with three different noise

8 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1->2 2->1 3->1

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 Causality Causality Causality 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 1 10 100 1 10 100 1 10 100 Frequency Frequency Frequency 1->3 2->3 3->2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 Causality Causality Causality 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 1 10 100 1 10 100 1 10 100 Frequency Frequency Frequency FIGURE 4: Causal connectivities between 3-node network obtained in frequency domain. Here we identified the causal influence at frequency ω range from 1 30, and at 50,100Hz. ∼ levels and three confidence intervals had been carried out, did not introduce conditional GCA in the frequency domain. the comparison between conventional GCA and our proposal This is mainly because there seems to be still some obfusca- showed in Table. 1. The accuracy of causal connection asso- tion with the method of using conditional GCA concept in the ciated with each node and the overall true model was also frequency domain. Therefore, as shown in Fig. 4, some non- verified. Comparing the accuracy of the single node and existent connections between nodes were often misjudged, the overall network at the same noise level and the same except for the causal influence 1 2 and 1 3. But we → → confidence interval, it’s unevenly distributed in conventional found that causal connectivities between node 2 and node 3 GCA. Meanwhile, the conventional GCA were very sensitive had a bigger chance to be misjudged at low frequency (0 10 − due to the different confidence intervals, especially in the Hz). Actually, comparing results obtained by conventional identification of the overall network, even it had a stable per- GCA in the time domain, causalities between two nodes were formance at different noise level. In generally, our proposal more legible in frequency domain, which meant only direct always showed a relatively good performance whether it’s causalities showed more consistent results in our analysis. identification in the whole network or a single node. It was For example, there were only stable and significant causal worth noting that the results obtained by conventional GCA influence existed in 1 2 and 1 3. → → when α = 0.01 in F-statistics were close to the results of our proposal. The results were also in line with our expectations, B. 5-NODE NETWORK SIMULATION EXPERIMENTS that was because our proposal consider the complexity of the 1) Protocol for 5-node network model more thoughtfully. As we emphasized above, there is To verified in more detail whether the proposed MDL method only one mathematical principle guiding the model selection is just a conventional GCA method with a higher level of throughout the GCA process, thus our proposal is a more confidence, a complex 5-node network was given to further rigorous approach or a more robust approach. verified the robustness and the validity of our proposal, seen The oscillations in neurophysiological systems and neu- in Fig. 5. Noise terms i(i = 1, 2, ..., 5) were the Guassian roscience data are thought to constrain and organize neural distribution with mean 0, and the variance ranged from 0.15 activity within and between functional networks across a to 0.3. The first two initial values of nodes are 1, they were wide range of temporal and spatial scales. Geweke-Granger given by causality demonstrated that the oscillations at specific fre-  quencies had been associated with the activation or inactiva- x = 0.792x 0.278x +   1,t 1,t−1 1,t−2 1 tion of different encephalic region. But for conventional GCA  − x2,t = 0.768x2,t−1 0.503x2,t−2 + 0.83x1,t−1 in frequency domain, model selection of history information  −  0.32x1,t−2 + 2 is determined by AIC or BIC. For our code length guided −  framework, the selected models for history information will x3,t = 0.67x3,t−1 0.312x3,t−2 + 0.56x2,t−1  − regress into the true model space automatically. Due to its  0.42x2,t−2 + 3 − (29) intellectual roots in descriptive complexity and close tie with x = 0.733x 0.27x + 0.72x  4,t 4,t−1 4,t−2 2,t−1 information theory, our proposal may be more capable to  −  0.27x2,t−2 + 0.52x3,t−1 0, 456x3,t−2 identified causality in frequency domain. − −  +0.76x5,t−1 0.33x5,t−2 + 4  − Same as time domain, causal connectivities at frequency  x5,t = 0.845x5,t−1 0.24x5,t−2 + 0.68x4,t−1 ω between 3-node network obtained by our proposal showed  −  0.254x4,t−2 + 5 in Fig. 4. In particular, causal connectivities in frequency do- − main were obtained within two nodes, which meant that we

VOLUME 4, 2016 9 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

2 4 5

1 3

FIGURE 5: The relationships of simulation data sets in 5-node network. Fig. 1: The relationships of simulation data sets in 5-node network.

160 500 140

400 120

2->3 100 1->2 2->4 300 4->5 1->2 2->4 80 F value 2->5 3->4 Code length 200 60 3->5 5->4 2->3 3->5 40 2->5 4->5 5->4 100 3->4 1->3 1->4 1->3 1->4 20 5->3 5->3 2->1 4->4 1->5 3->2 4->1 4->3 5->1 3->1 3->2 4->1 4->2 5->1 5->2 2->1 3->1 4->2 5->2 5->5 1->1 1->5 2->2 3->3 4->3 5->5 1->1 2->2 3->3 4->4 0 0

Edges Edges (a) F-statistics (P < 0.05) (b) MDL method FIGURE 6: Causal connectivities between 5 nodes identified by conventional GCA and our proposal respectively. (a): causal connection network was obtained by conventional GCA at α = 0.05 in F-test. (b): causal connection network obtained by our proposal.

TABLE 2: 5-NODE NETWORKS Edge F-statistics(α = 0.05) (α = 0.01) MDL method 1 2 1000/1000 (1000/1000) 1000/1000 2 −→ 3 1000/1000 (1000/1000) 1000/1000 2 −→ 4 1000/1000 (1000/1000) 1000/1000 true connections 3 −→ 4 507/1000 (456/1000) 1000/1000 4 −→ 5 1000/1000 (1000/1000) 1000/1000 5 −→ 4 929/1000 (780/1000) 1000/1000 −→ 1 3 36/1000 (8/1000) 4/1000 1 −→ 4 43/1000 (1/1000) 1/1000 1 −→ 5 11/1000 (0/1000) 0/1000 2 −→ 1 54/1000 (12/1000) 15/1000 2 −→ 5 58/1000 (14/1000) 15/1000 3 −→ 1 44/1000 (17/1000) 7/1000 3 −→ 2 15/1000 (2/1000) 2/1000 false connections 3 −→ 5 83/1000 (29/1000) 5/1000 4 −→ 1 30/1000 (8/1000) 14/1000 4 −→ 2 6/1000 (0/1000) 2/1000 4 −→ 3 5/1000 (1/1000) 1/1000 5 −→ 1 53/1000 (18/1000) 16/1000 5 −→ 2 17/1000 (1/1000) 2/1000 5 −→ 3 41/1000 (8/1000) 11/1000 −→ Causal connectivities between nodes in 5-node network within low variance noise(0.15-0.3). True connection represents the causal influence truly existed in Fig. 5, false connection means not existed.

2) Results from 5-node Network Fig. 6a. Same as results in 3-node network, our proposal showed 100% consistency with connections in Fig. 5 be- The result of connections between 5 nodes showed in Fig. 6b, tween directly causal related nodes, seen in Table. 2. And causal influence analyzed by code length was largely consis- in other connections in 5-node network, the accuracy of our tent with connections in Fig. 5. Simultaneously, causalities proposal also was not below 98.4%. Whereas, conventional analyzed by conventional GCA between 5 nodes showed in

10 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1->2 1->3 1->4 1->5

1 1 1 1

0.5 0.5 0.5 0.5 Accuracy Accuracy Accuracy Accuracy

0 0 0 0 100 101 102 100 101 102 100 101 102 100 101 102 Frequency Frequency Frequency Frequency 2->1 2->3 2->4 2->5

1 1 1 1

0.5 0.5 0.5 0.5 Accuracy Accuracy Accuracy Accuracy

0 0 0 0 100 101 102 100 101 102 100 101 102 100 101 102 Frequency Frequency Frequency Frequency 3->1 3->2 3->4 3->5

1 1 1 1

0.5 0.5 0.5 0.5 Accuracy Accuracy Accuracy Accuracy

0 0 0 0 100 101 102 100 101 102 100 101 102 100 101 102 Frequency Frequency Frequency Frequency 4->1 4->2 4->3 4->5

1 1 1 1

0.5 0.5 0.5 0.5 Accuracy Accuracy Accuracy Accuracy

0 0 0 0 100 101 102 100 101 102 100 101 102 100 101 102 Frequency Frequency Frequency Frequency 5->1 5->2 5->3 5->4

1 1 1 1

0.5 0.5 0.5 0.5 Accuracy Accuracy Accuracy Accuracy

0 0 0 0 100 101 102 100 101 102 100 101 102 100 101 102 Frequency Frequency Frequency Frequency FIGURE 7: Causal connectivities of 5-node network obtained in frequency domain. Same as 3-node network, we identified the causal influence at frequency ω range from 1 30, and at 50, 100Hz. ∼

GCA did not showed the same robustness of our proposal. noise from (0.15-0.3) to (0.35-0.5). In conditional causality Causal connectivities between direct related nodes was not analysis, our approach reduced the complexity of algorithm, well identified, for example only 507 samples were identified eliminating the need for repeated pairwise comparisons be- as causal influences from node 3 to node 4 in 1000 samples tween models, while ensured the accuracy of results. and causal influence from node 5 to node 3 was identified Subsequently, we identified the causal connectivities at 92.9% accuracy. And in other connections, the accuracy of among 5-node network in frequency domain by our proposal, conventional GCA was more poor than our proposal. Clearly, seen in Fig. 7. Same as the 3-node network, causal con- in higher confidence levels (α = 0.01) of conventional GCA, nection networks were identified without introducing condi- although the specificity of causal identification increased, its tional GCA, and the causal connection network had regular sensitivity decreased. characteristics. The causal influence whether it was direct More importantly, whether the causal influence from Node or indirect existed in 5-node network was more stable to be 3 to Node 4 or from Node 5 to Node 4, the significance identified in the frequency domain. Similarly, since condi- level of connection were not enough to be identified, even for tional GCA was not considered, other non-existent causal α = 0.01 in F-statistics. Therefore, the results demonstrated connectivities had a chance to be identified. And in 5-node that our proposal was not equivalent to the conventional GCA network, the possibility of being misjudged was even greater. with a higher significance level in F-statistics at all. Obvi- Therefore, it is necessary to introduce conditional GCA to ously, when target network was more complicated in simula- distinguish direct from indirect influences between system tion, our proposal showed a more desirable property in time components in the frequency domain. And at the same time, domain, while conventional GCA generally made mistakes. we found that removing the noise frequency component is an Luckily, our proposal performed very well regardless of the obstacle to causality analysis, main reason is that we have no existence fo relationships between nodes, even as in more prior knowledge about which one is noise or others in row complicated network. Same as 3-node network, the result data. causal network was unrelated with the varying variance of

VOLUME 4, 2016 11 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) visual (CSA-control) (b) auditory (CSA-control) FIGURE 8: Mental calculation of CSA-control state under the two stimuli(visual stimulus and auditory stimulus), the activation regions were processed by SPM12, the control state meant that the sample was in rest state without mental calculation. (a): CSA-control state under visual stimulus. (b): CSA-control state under auditory stimulus.(P<0.0001, uncorrected)

1 1.2 GCA MDL GCA MDL

1 0.8

0.8 0.6

0.6 Similarity 0.4 Similarity 0.4

0.2 0.2

0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Subjects (4-node) Subject (4-node)

1 1.2

GCA MDL GCA MDL

1 0.8

0.8 0.6

0.6 Similarity Similarity 0.4 0.4

0.2 0.2

0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Subjects (6-node) Subject (6-node) FIGURE 9: The similarity in 4-node mental calculation network and 6-node full network respectively. Left panel compared the similarity on the individual, and the right panel compared the distribution of similarities, collectively.

C. REAL DATA volunteered to participate in this study with informal written 1) Experimental protocol consent by themselves. In the study we let ten subjects perform simple one- The causal connection network of one subject performing digit(consisting of 1-10) serial addition (SSA) and complex same task should be same or similar, which is also the two-digit(consisting of 1-5) serial addition (CSA) by visual basic assumption in group analysis. Therefore, for mental stimulus and simultaneously measured their brain activities calculation under different stimuli, causal connection net- with fMRI. work of same subject may be different in the input node Immediately, we asked the subjects to perform same serial of stimulus, but we have reason to postulate that the causal addition arithmetic tasks by auditory stimulus. Nine right- connection should be same inside the mental calculation handed healthy subjects (four female, 24 1.5 years old) network, at least it should be similar. Thus, we will compare ± and one left-handed healthy female subject (24 years old) the similarity of causal connection network within the same participated. One of the subject’s(a right hand male) data mental calculation task under different stimuli. More directly, was deleted due to excessive head motion. All subjects in order to quantify which method is more robust, we will

12 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) Statistics (visual) (b) Statistics (auditory)

(c) MDL method (visual) (d) MDL method (auditory) FIGURE 10: Causal connectivities of subject 2 in mental calculation network under two stimuli. Cyan node represented visual/auditory stimulus node. The blue/red nodes related to the mental calculation network stated in the text. (a) and (b): mental calculation network obtained by conventional GCA. (c) and (d): mental calculation network obtained by proposed approach. compare the similarity of the result network within auditory senses was above our proposal. Even for these two subjects, and visual stimulus on each subject. To obtain the similarity the similarities in our proposal were close to conventional between networks, we define a measure method to quantify GCA, especially in subject 1. Further the similarities in 6- the similarity, node full connection network also showed in Fig.11. Duo PP (A B) to the difference between the input node of stimulus, the S = ∩ . (30) PP(A B) similarities in two methods were not above 0.5 mostly, but ∪ Where S represents the measured similarity, A and B are there were still 3 subjects the similarities in our proposal were the connected matrix of mental calculation networks. The mostly above 0.5, especially for subject 7. Clearly, whether numerator is the sum of intersections of the same connected inside connection network of mental calculation or among edges, and the denominator is the sum of the unions of the the 6-node full connection network, the similarities in most connected edges. of our subjects identified by our proposal were more than conventional GCA. 2) Mental calculation network Then, causal mental calculation networks of two subjects We already have verified the validity of MDL in simulation showed in Fig. 10 and Fig. 11 respectively. As stated above, data, but behavior in real data will determine the truly robust- the input node of stimulus should not be included into the ness of the methods. Firstly, the activation regions of mental network to be compared, we removed the causal connection calculation under different stimuli showed in Fig. 8, and as between stimuli nodes and other four nodes of mental calcu- had been postulated above, the activation regions between lation network. Obviously, our proposal also had a desirable two senses were identical. The similarities in 4-node con- robustness in fMRI data. As seen in Fig. 10 and Fig. 11, nection network and 6-node full connection network showed for subject 2 and subject 7, the mental calculation networks in Fig. 9 respectively. In 4-node network, we found that under different stimuli were almost identical. By the way, causal connection networks of mental calculation between even for different subjects, the causal connection networks visual stimulus and auditory stimulus were very similar for were similar at a large extent. As for conventional two-stage most subjects, except for subject 6. There were seven of GCA, causal connection networks of two subjects above nine subjects the similarities were above 0.6. Turned to con- were more irregular, which seen in Fig. 10 and Fig. 11. ventional GCA, we found that only three subjects of causal Comparing the similarity between causal networks identified connection networks between the two stimuli had a similarity by our proposal, the networks obtained by conventional GCA above 0.6. Meanwhile, only in subject 1 and subject 9, had almost no consistency characteristics, the networks under we found the similarity of causal networks between two different stimuli appeared to be unrelated. Conventional two-

VOLUME 4, 2016 13 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

(a) Statistics (visual) (b) Statistics (auditory)

(c) MDL method (visual) (d) MDL method (auditory) FIGURE 11: Causal connectivities of subject 7 in mental calculation network under two stimuli. Cyan node represented visual/auditory stimulus node. The blue/red nodes related to the mental calculation network stated in the text. (a) and (b): mental calculation network obtained by conventional GCA. (c) and (d): mental calculation network obtained by proposed approach. stage GCA identified a inconclusive results in mental calcu- handle pairs of variables, and may produce misleading results lation. Consistent with the results in simulation, our proposal when the true relationship involves three or more variables was more robust in identifying causal connection network, [31], [32]. This case occurs when single connections are especially in complex networks. strong enough: if A B and B C, then very likely → → A C will be picked up by GCA, thus inducing a denser → VI. CONCLUSIONS AND PERSPECTIVE connected network than the truth. Our results suggest that A. CONTRIBUTIONS AND DISCUSSIONS the proposed approach can remove the indirect connection and retain the direct connection. The use of a better model The novelty of the present study is not in including the MDL selection strategy can improve the performance of GCA in principle in the GCA procedure, but rather in considering systems involving a lager number of variables. As noted the MDL principle as unity strategy for model selection in earlier (section II), for the statistical strategy, the resulting the analysis procedure. Conventional GCA usually consists network are heavily dependent on the significance level of two stages: (1) AIC/BIC for the predictors associated chosen. In the simulation experiments involving a 5-node with internal or external information, and (2) F - for system, while a moderate significance level (P < 0.05) evaluating the relative effects of exogenous variables. We produces higher sensitivity in revealing actual connectivity emphasize that these two parts fall within the scope of model (true positive, TP), it has lower specificity in avoiding false selection, and distinct theory might generate different model connectivity (true negative, TN). A more stringent signifi- selection criteria. Model selection should follow the same cance threshold (P < 0.01) makes the reverse effect. In mathematical theory in the GCA process. contrast, the proposed approach maintains high sensitivity In this paper, we have addressed this concern and have and specificity in all cases, indicating that the MDL principle proposed a unified model selection approach based on the do much more than a higher significant level in F-statistics. MDL principle for GCA in the context of the general regres- The results illustrate the benefits of performing comparisons sion model paradigm. We have demonstrated its efficacy over over all competing models in the model space as well. conventional two-stage GCA approach in a 3-node network and a 5-node network synthetic experiments. All results con- More importantly, the proposed approach obtained more firm the superiority of proposed approach over conventional consistent results in a challenging fMRI dataset, in where vi- two-stage GCA. The unified model selection approach is sual/auditory stimuli with the same presentation design give capable of identifying the actual connectivity in all the cases, identical neural correlates of mental calculation, allowing at the same time, avoiding greatly the false influences caused one to evaluate the performance of different GCA methods. by the presence of noise. GCA was originally designed to This provides clear experimental proof that unified GCA

14 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

is superior to conventional GCA [87]. The results are also [3] M. M. Poo, J. L. Du, N. Y. Ip, Z. Q. Xiong, B. Xu, and T. Tan, “China consistent with the current consensus that model selection is brain project: Basic neuroscience, brain diseases, and brain-inspired com- puting,” Neuron, vol. 92, no. 3, pp. 591–596, 2016. crucial to investigate causal connectivity using neuroimaging [4] A. B. A. S. Committee et al., “Australian brain alliance,” Neuron, vol. 92, techniques [88]. no. 3, pp. 597–600, 2016. Essentially, both the MDL and statistical strategies, indeed [5] H. Okano, E. Sasaki, T. Yamamori, A. Iriki, T. Shimogori, Y. Yamaguchi, K. Kasai, and A. Miyawaki, “Brain/minds: A japanese national brain all model selection methods, attempt to seek a trade-off be- project for marmoset neuroscience,” Neuron, vol. 92, no. 3, pp. 582–590, tween goodness-of-fit and complexity of the model involved. 2016. The MDL uses the code length which describes both the [6] S. J. Jeong, H. Lee, E. M. Hur, Y. Choe, J. W. Koo, J. C. Rah, K. J. Lee, H. H. Lim, W. Sun, and C. Moon, “Korea brain initiative: Integration and model complexity and the fitting error to achieve such a control of brain functions,” Neuron, vol. 92, no. 3, pp. 607–611, 2016. trade-off, whereas the statistical strategy implicitly conveys [7] S. H. Strogatz, “Exploring complex networks,” nature, vol. 410, no. 6825, it in F-distribution function through degrees of freedom. In p. 268, 2001. [8] O. Sporns, D. R. Chialvo, M. Kaiser, and C. C. Hilgetag, “Organization, the MDL framework all competitive models can directly development and function of complex brain networks,” Trends in Cogni- compare in terms of description length, without the need for tive Sciences, vol. 8, no. 9, pp. 418–425, 2004. an intermediate model. In contrast, F-statistics resorts the [9] E. Bullmore and O. Sporns, “Complex brain networks: graph theoretical analysis of structural and functional systems,” Nature Reviews Neuro- extra sum of squares principle to determine which model science, vol. 10, no. 3, pp. 186–198, 2009. gives a better fit to the data. This approach requires one model [10] S. L. Bressler and V. Menon, “Large-scale brain networks in cognition: (the restricted model) to be nested within another model (the emerging methods and principles,” Trends in cognitive sciences, vol. 14, no. 6, pp. 277–290, 2010. unrestricted model). We argue that the pairwise comparison [11] M. Rubinov and O. Sporns, “Complex network measures of brain connec- in F-statistics is responsible for the inferior performances tivity: Uses and interpretations,” Neuroimage, vol. 52, no. 3, pp. 1059– when using the statistical strategy. The procedure of model 1069, 2010. [12] O. Sporns, “The human connectome: a complex network,” Annals of the selection is confined to comparisons between models with a New York Academy of Sciences, vol. 1224, no. 1, pp. 109–125, 2011. nested relationship, thereby impairing the performance of the [13] M. Siegel, T. H. Donner, and A. K. Engel, “Spectral fingerprints of large- statistical strategy. scale neuronal interactions,” Nature Reviews Neuroscience, vol. 13, no. 2, pp. 121–134, 2012. [14] R. M. Hutchison, T. Womelsdorf, E. A. Allen, P. A. Bandettin, V. D. B. FUTURE WORKS Calhoun, M. Corbetta, S. D. Penna, J. H. Duyn, G. H. Glover, J. Gonzalez- Castillo, D. A. Handwerker, S. Keiholz, V. Kiviniemi, D. A. Leopold, In this study we have focused on a general regression model F. d. Pasquale, O. Sporns, M. Walter, and C. Chang, “Dynamic Functional paradigm that provides a natural solution from two part form Connectivity: Promise, Issues, and Interpretations.” NeuroImage, vol. 80, pp. 360–378, 2013. of MDL to the GCA. As a general principle for statistical [15] S. L. Valk, B. C. Bernhardt, A. BÃ˝uckler, M. Trautwein, and T. Singer, model selection, the MDL principle develops many forms of “Socio-cognitive phenotypes differentially modulate large-scale structural description length in terms of coding schemes. The MDL covariance networks,” Cerebral Cortex, vol. 27, no. 2, 2017. [16] K. E. Stephan and A. Roebroeck, “A Short History of Causal Modeling of forms can be viewed as imposing an adaptive penalty on fMRI Data,” NeuroImage, vol. 62, pp. 856–863, 2012. model size. Although all forms achieve pointwise and min- [17] J. Tsunada, A. S. K. Liu, J. I. Gold, and Y. E. Cohen, “Causal contribution imax lower bounds on redundancy [72], further investigation of primate auditory cortex to auditory perceptual decision-making,” Nature Neuroscience, vol. 19, pp. 135–142, 2016. is required to determine the optimal coding scheme suitable [18] D. Olivier, G. Isabelle, S. Sandrine, R. Sebastien, D. Colin, S. Christoph, for a given neuroimaging modality or noise level. Future D. Antoine, and V.-S. Pedro, “Identifying neural drivers with functional animal experiments that provide intracranial recordings of mri: An electrophysiological validation,” Plos Biology, vol. 6, no. 12, pp. 2683–97, 2008. and fMRI measurements of neural network synchronously on [19] T. Van Kerkoerle, M. W. Self, B. Dagnino, M.-A. Gariel-Mathis, J. Poort, the same animal will help to direct and test the development C. Van Der Togt, and P. R. Roelfsema, “Alpha and gamma oscillations of these forms. Moreover, the GCA schemes were also con- characterize feedback and feedforward processing in monkey visual cor- tex.” Proceedings of the National Academy of Sciences, vol. 111, no. 40, veyed in different function spaces [49], [89]. The MDL will pp. 14 332–14 341, 2014. provide potential approaches to the Granger-causality repre- [20] A. Bastos, J. Vezoli, C. A. Bosman, J. M. Schoffelen, R. Oostenveld, sentations in other function spaces principle because has rich J. R. Dowdall, P. Deweerd, H. Kennedy, and P. Fries, “Visual areas exert feedforward and feedback influences through distinct frequency channels.” connections with Bayesian statistics [72]. More importantly, Neuron, vol. 85, no. 2, pp. 390–401, 2015. the MDL principle allows the comparison between any two [21] A. Brovelli, M. Ding, A. Ledberg, Y. Chen, R. Nakamura, and S. L. model classes in terms of code length, regardless of their Bressler, “Beta oscillations in a large-scale sensorimotor cortical network: forms [72]. This robust feature has potential to accommodate directional influences revealed by granger causality,” Proceedings of the National Academy of Sciences, vol. 101, no. 26, pp. 9849–9854, 2004. the representations of Granger-causality generated from dif- [22] X. Wang, Y. Chen, and M. Ding, “Estimating granger causality after ferent model classes, while the causal correlation can be also stimulus onset: A cautionary note,” Neuroimage, vol. 41, no. 3, pp. 767– investigated between disparate function spaces. 776, 2008. [23] D. W. Gow Jr, J. A. Segawa, S. P. Ahlfors, and F.-H. Lin, “Lexical influences on speech perception: a granger causality analysis of meg and REFERENCES eeg source estimates,” Neuroimage, vol. 43, no. 3, pp. 614–623, 2008. [24] M. Ploner, J. M. Schoffelen, A. Schnitzler, and J. Gross, “Functional [1] K. Amunts, C. Ebell, J. Muller, M. Telefont, A. Knoll, and T. Lippert, “The integration within the human pain system as revealed by granger causality,” human brain project: Creating a european research infrastructure to decode Human Brain Mapping, vol. 30, no. 12, pp. 4025–4032, 2010. the human brain,” Neuron, vol. 92, no. 3, pp. 574–581, 2016. [25] E. Florin, J. Gross, J. Pfeifer, G. R. Fink, and L. Timmermann, “The effect [2] C. L. Martin and M. Chun, “The brain initiative: Building, strengthening, of filtering on granger causality based multivariate causality measures,” and sustaining,” Neuron, vol. 92, no. 3, pp. 570–573, 2016. Neuroimage, vol. 50, no. 2, pp. 577–588, 2010.

VOLUME 4, 2016 15 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[26] A. Roebroeck, E. Formisano, and R. Goebel, “Mapping directed influence [51] H. Akaike, “A new look at the identification.” in Selected over the brain using granger causality and fmri,” Neuroimage, vol. 25, Papers of Hirotugu Akaike. Springer, 1974, pp. 215–222. no. 1, pp. 230–242, 2005. [52] G. Schwarz et al., “Estimating the dimension of a model.” The annals of [27] D. Sridharan, D. J. Levitin, and V. Menon, “A critical role for the right statistics, vol. 6, no. 2, pp. 461–464, 1978. fronto-insular cortex in switching between central-executive and default- [53] G. Deshpande, S. Laconte, G. A. James, S. Peltier, and X. Hu, “Multi- networks,” Proceedings of the National Academy of Sciences, vol. variate granger causality analysis of fmri data.” Human Brain Mapping, 105, no. 34, pp. 12 569–12 574, 2008. vol. 30, no. 4, pp. 1361–1373, 2010. [28] V. Menon and L. Q. Uddin, “Saliency, switching, attention and control: A [54] S. Guo, C. Ladroue, and J. Feng, Granger Causality: Theory and Applica- network model of insula function,” Brain Structure and Function, vol. 214, tions. Springer, 2010. no. 5-6, pp. 655–667, 2010. [55] S. L. Bressler and A. K. Seth, “WienerâA¸Sgrangercausality:˘ A well [29] S. Ryali, K. Supekar, T. Chen, and V. Menon, “Multivariate dynamical established methodology,” Neuroimage, vol. 58, no. 2, pp. 323–329, 2011. systems models for estimating causal interactions in fmri,” Neuroimage, [56] X. Li, G. Marrelec, R. F. Hess, and H. Benali, “A nonlinear identification vol. 54, no. 2, pp. 807–823, 2011. method to study effective connectivity in functional mri,” Medical Image [30] N. Wiener, “The theory of prediction.” Modern mathematics for engineers, Analysis, vol. 14, no. 1, pp. 30–38, 2010. 1956. [57] G. Casella and R. L. Berger, , 2nd ed. Pacific Grove, [31] C. W. J. Granger, “Investigating causal relations by econometric models CA, USA: Duxbury Resource Center, 2001. and cross-spectral methods.” Econometrica, vol. 37, no. 3, pp. 424–438, [58] R. L. Wasserstein, N. A. Lazar et al., “The asaâA˘ Zs´ statement on p-values: 1969. context, process, and purpose,” The American , vol. 70, no. 2, [32] C. Granger and P. Newbold, “Spurious Regressions in .” pp. 129–133, 2016. Journal of Econometrics, vol. 2, pp. 111–120, 1974. [59] R. L. Wasserstein, A. L. Schirm, and N. A. Lazar, “Moving to a world [33] M. Chávez, J. Martinerie, and M. Le Van Quyen, “Statistical assessment beyond âAIJp<˘ 0.05âA˘ I,”˙ 2019. of nonlinear causality: application to epileptic eeg ,” Journal of [60] V. Amrhein, S. Greenland, and B. McShane, “Scientists rise up against neuroscience methods, vol. 124, no. 2, pp. 113–128, 2003. statistical significance,” 2019. [34] B. Gourévitch, R. Le Bouquin-Jeannès, and G. Faucon, “Linear and [61] B. B. McShane, D. Gal, A. Gelman, C. Robert, and J. L. Tackett, “Abandon nonlinear causality between signals: methods, examples and neurophysio- statistical significance,” The American Statistician, vol. 73, no. sup1, pp. logical applications,” Biological cybernetics, vol. 95, no. 4, pp. 349–369, 235–245, 2019. 2006. [62] V. Amrhein and S. Greenland, “Remove, rather than redefine, statistical [35] R. Goebel, A. Roebroeck, D. S. Kim, and E. Formisano, “Investigating significance,” Nature Human Behaviour, vol. 2, no. 1, p. 4, 2018. Directed Cortical Interactions in Time-Resolved fMRI Data Using Vec- [63] V. Amrhein, D. Trafimow, and S. Greenland, “Inferential statistics as tor Autoregressive Nodeling and Granger Causality Mapping.” Magnetic : There is no crisis if we donâA˘ Zt´ expect Resonance Imaing, vol. 21, pp. 1251–1261, 2003. replication,” The American Statistician, vol. 73, no. sup1, pp. 262–270, [36] R. S. Hacker and A. Hatemi-J, “Tests for Causality Between Integrated 2019. Variables Using Asymptotic and Bootstrap Distributions: Theory and [64] S. H. Hurlbert, R. A. Levine, and J. Utts, “Coup de grâce for a tough old Application.” Applied Economics, vol. 38, no. 13, pp. 1489–1500, 2006. bull:âAIJstatistically˘ significantâA˘ I˙ expires,” The American Statistician, [37] M. Kaminski, Multichannel Data Analysis in Biomedical Research., vol. 73, no. sup1, pp. 352–357, 2019. 2nd ed. Pacific Grove, CA, USA: Duxbury Resource Center, 2007. [65] X. Miao, X. Wu, R. Li, K. Chen, and L. Yao, “Altered connectivity pattern [38] A. Hatemi-J, “Asymmetric Causality Tests with an Application.” Empiri- of hubs in default-mode network with alzheimer’s disease: an granger cal Economics, vol. 43, no. 1, pp. 447–456, 2012. causality modeling approach,” PloS one, vol. 6, no. 10, p. e25546, 2011. [39] J. Geweke, “Measurement of linear dependence and feedback between [66] G. Deshpande, P. Santhanam, and X. Hu, “Instantaneous and causal multiple time series,” Publications of the American Statistical Association, connectivity in resting state brain networks derived from functional mri vol. 77, no. 378, pp. 304–313, 1982. data,” Neuroimage, vol. 54, no. 2, pp. 1043–1052, 2011. [40] M. Ding, Y. Chen, and S. L. Bressler, “Granger causality: Basic theory and [67] X. F. Li, D. Coyle, L. Maguire, T. M. McGinnity, and H. Benali, “A model application to neuroscience,” Quantitative Biology, pp. 826–831, 2006. selection method for nonlinear system identification based fmri effective [41] S. Guo, J. Wu, M. Ding, and J. Feng, “Uncovering interactions in the fre- connectivity analysis.” IEEE Transactions on Medical Imaging, vol. 30, quency domain,” Plos Computational Biology, vol. 4, no. 5, p. e1000087, no. 7, pp. 1365–1380, 2011. 2008. [68] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, [42] A. K. Seth, “A matlab toolbox for granger causal connectivity analysis,” no. 5, pp. 465–471, 1978. Journal of neuroscience methods, vol. 186, no. 2, pp. 262–273, 2010. [69] P. G. Bryant and O. I. Cordero-Brana, “Model selection using the min- [43] L. Barnett and A. K. Seth, “The mvgc multivariate granger causality imum description length principle,” The American Statistician, vol. 54, toolbox: A new approach to granger-causal inference,” Journal of Neu- no. 4, pp. 257–268, 2000. roscience Methods, vol. 223, pp. 50–68, 2014. [70] P. D. Grünwald, I. J. Myung, and M. A. Pitt, Advances in Minimum [44] P. A. Stokes and P. L. Purdon, “A study of problems encountered in granger Description Length: Theory and Applications, ser. Neural Information causality analysis from a neuroscience perspective,” Proc Natl Acad Sci U Processing series. Cambridge, Massachusetts: The MIT Press, 2005. S A, vol. 114, no. 34, p. E7063, 2017. [71] M. Hansen and B. Yu, “Bridging aic and bic: An mdl model selection crite- [45] L. Ning and Y. Rathi, “A dynamic regression approach for frequency- rion.” in Worskhop on Detection, Estimation, Classification, and Imaging, domain partial coherence and causality analysis of functional brain net- Santa Fe, NM, USA, 1999, pp. 24–26. works,” IEEE transactions on medical imaging, vol. 37, no. 9, pp. 1957– [72] M. H. Hansen and B. Yu, “Model selection and the principle of minimum 1969, 2017. description length,” Publications of the American Statistical Association, [46] W. Liao, D. Marinazzo, Z. Pan, Q. Gong, and H. Chen, “Kernel granger vol. 96, no. 454, pp. 746–774, 2001. causality mapping effective connectivity on fmri data,” IEEE transactions [73] M. Li, P. Vitányi et al., An introduction to Kolmogorov complexity and its on medical imaging, vol. 28, no. 11, pp. 1825–1835, 2009. applications. Springer, 2008, vol. 3. [47] Z. T. Chen, K. Zhang, L. W. Chan, and B. Schölkopf, “Causal discovery via [74] C. E. Shannon, “A mathematical theory of communication,” Bell system reproducing kernel hibert space embedding,” Neural Computation, vol. 26, technical journal, vol. 27, no. 3, pp. 379–423, 1948. no. 7, pp. 1484–1517, 2014. [75] J. Rissanen, “A universal prior for integers and estimation by minimum [48] M. E. J. Newman and M. Girvan, “Finding and evaluating community description length,” Annals of Statistics, vol. 11, no. 2, pp. 416–431, 1983. structure in networks.” Physical Review E, vol. 69, p. 026113, 2004. [76] ——, “Universal coding, information, prediction, and estimation,” IEEE [49] D. Marinazzo, M. Pellicoro, and S. Stramaglia, “Kernel Method for Transactions on Information theory, vol. 30, no. 4, pp. 629–636, 1984. Nonlinear Granger Causality.” Physical Review Letters, vol. 100, no. 14, [77] ——, “Stochastic complexity and modeling,” Annals of Statistics, vol. 14, p. 144103, 2008. no. 3, pp. 1080–1100, 1986. [50] C. L. Zou, C. Ladroue, S. X. Guo, and J. F. Feng, “Identifying interactions [78] ——, “Stochastic complexity,” Journal of the Royal Statistical Society, in the time and frequency domains in local and global networks - a granger vol. 49, no. 3, pp. 223–239, 1987. causality approach.” BMC , vol. 337, no. 11, pp. 1–17, [79] J. J. Rissanen, “Fisher information and stochastic complexity,” IEEE 2010. transactions on information theory, vol. 42, no. 1, pp. 40–47, 1996.

16 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[80] A. N. Kolmogorov, “Three approaches to the quantitative definition of information,” Problems of information transmission, vol. 1, no. 1, pp. 1–7, 1965. [81] A. Kolmogorov, “Logical basis for information theory and probability theory.” IEEE Transactions on Information Theory, vol. 14, no. 5, pp. 662– 664, 1968. [82] C. S. Wallace and D. M. Boulton, “An information measure for classifica- tion,” The Computer Journal, vol. 11, no. 2, pp. 185–194, 1968. [83] J. Rissanen, Information and Complexity in Statistical Modeling, ser. Information Science and Statistics. New York, NY,USA: Springer Verlag, 2005. [84] T. M. Cover and J. A. Thomas, Elements of information theory. New York:Wiley, 1991. [85] P. D. Grünwald, The minimum description length principle. MIT press, 2007. [86] J. Rissanen, Stochastic complexity in statistical inquiry. World Scientific, 1989. [87] F. Li, X. Wang, P. Shi, Q. Lin, and Z. Hu, “Neural network can be revealed with functional mri: evidence from self-consistent experiment,” Nature Communications (Submitted), 2020. [88] P. A. Valdes-Sosa, A. Roebroeck, J. Daunizeau, and K. Friston, “Effective connectivity: influence, causality and biophysical modeling,” Neuroimage, vol. 58, no. 2, pp. 339–361, 2011. [89] L. Angelini, M. Pellicoro, and S. Stramaglia, “Granger causality for circular variables,” Physics Letters A, vol. 373, no. 29, pp. 2467–2470, 2009.

VOLUME 4, 2016 17