Unified Model Selection Approach Based on Minimum Description

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer 10.1109/ACCESS.2017.DOI

Uniﬁed Model Selection Approach Based on Minimum Description Length Principle in Granger Causality Analysis

FEI LI1, XUEWEI WANG1, QIANG LIN1, AND ZHENGHUI HU1 1College of Science, Zhejiang University of Technology, Hangzhou, Zhejiang 310023, China. Corresponding author: Zhenghui Hu (e-mail: [email protected]). This work is supported in part by National Key R&D Program of China under Grant 2018YFA0701400, in part by the Public Projects of Science Technology Department of Zhejiang Province under Grant LGF20H180015.

ABSTRACT Granger causality analysis (GCA) provides a powerful tool for uncovering the patterns of brain connectivity mechanism using neuroimaging techniques. In this paper, distinct from conventional two- stage GCA, we present a unified model selection approach based on the minimum description length (MDL) principle for GCA in the context of the general regression model paradigm. In comparison with conventional methods, our approach emphasize that model selection should follow a single mathematical theory during the GCA process. Under this framework, all candidate models within the model space might be compared freely in the context of the code length, without the need for an intermediate model. We illustrate its advantages over conventional two-stage GCA approach in a 3-node network and a 5-node network synthetic experiments. The unified model selection approach is capable of identifying the actual connectivity while avoiding the false influences of noise. More importantly, the proposed approach obtained more consistent results in a challenging fMRI dataset, in which visual/auditory stimulus with the same presentation design gives identical neural correlates of mental calculation, allowing one to evaluate the performance of different GCA methods. Moreover, the proposed approach has potential to accommodate other Granger causality representations in other function space. The comparison between different GC representations in different function spaces can also be naturally deal with in the framework.

INDEX TERMS Code length, granger causality analysis (GCA), minimum description length (MDL), model selection.

I. INTRODUCTION cephalography (EEG) [18]–[20], local field potentials (LFP) [20]–[22], magnetoencephalography (MEG) [23]–[25], and arXiv:1910.11537v2 [stat.ME] 19 Mar 2020 AUSAL connectivity analysis, also called effective con- functional magnetic resonance imaging (fMRI) [26]–[29] C nectivity, plays an increasingly important role in brain etc. research using neuroimaging techniques [1]–[6]. It reflects a trend in neuroscience away from focusing on individual Granger causality analysis (GCA) is an effective tool for brain unit (functional specialization) toward complex neural detecting the causal connections that can provide information circuits (functional integration) at different spatial scales about the dynamics and directionality of the associations. [7]–[9], where the integration among specialized areas is Causality in the Granger sense is based on the statistical mediated by causal connectivity [10]–[15]. Causal connec- predictability of one time series that derives from knowledge tivity refers to the influence that one neural system exerts of one or more other time series [30], [31]. For two time over another, either in the absence of identifiable behavioral series, X and Y , that are stochastic and wide-sense stationary events or in the context of task performance [16], [17]. It (i.e., constant means and variances) [32], the general idea of provides important insights into brain organization. Causal Granger causality is that variable Y is said to have a causal connectivity analysis is based on temporal relations existing influence on variable X when the prediction of variable X between time series recordings of neural activity, which may could be improved by incorporating the history information be obtained by neuroimaging techniques, such as electroen- of variable Y .

VOLUME 4, 2016 1 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

In practical applications, the GCA is usually conveyed in selection approach for GCA with the minimum description the context of linear autoregressive (AR) models of stochastic length (MDL) principle that model selection should follow processes. The AR models with different time lags repre- a single mathematical theory during the GCA process. We sent the historical dependence on variables, with or without illustrated the benefits of introducing a unified model se- including other variables, to forecast the dynamics of the lection approach in simulated and real fMRI experiments. variable. The optimal model for predication then is chosen by Especially, we compared the proposed approach with con- model selection techniques. The extension of GCA to nonlin- ventional two-stage GCA one in an empirical fMRI dataset ear model that capture nonlinear causal relations is also avail- for the validation of causal connectivity analysis. able [33], [34]. And the validity of original GCA can also The rest of the article is organized as follows. Section II be affected if the errors after the prediction with the model introduces the basic GCA concept and discusses the potential is not normally distributed [26], [35]. Several test methods problems in the current two-stage GCA scheme and our of causality in Granger sense for the data with asymptotical motivation for introducing the MDL principle in GCA. In noise distribution have been developed that is especially Section III, the MDL principle has been illustrated in detail, useful for task-related fMRI studies [36], [37]. Asymmetric and the formula of two-part MDL also has been derived with causality testing recently has also been suggested in order to Bernoulli distribution in Markov model class. Immediately, separate the causal impacts of positive and negative changes in Section IV, the MDL guided model selection for linear [38]. Moreover, Granger causality have been also expressed model has been carried out, in time domain and frequency in other function spaces, e.g. Fourier spaces (frequency domain respectively. In Section V, We illustrate the ad- domain) [21], [39]–[45], kernel Hilbert spaces [46], [47]. vantages of our proposal over conventional two-stage GCA These methods are important in neuroscience studies since approach in a 3-node network and a 5-node network syn- the causal influences between neuronal populations are often thetic experiments. At the same time, the proposed approach nonlinear and have complicated statistics due to various obtained more consistent results in a challenge fMRI dataset sources of uncertainties. On the other hand, the original for causality investigation, mental calculation network under GCA is limited to investigate causal connectivity between visual and auditory stimulus, respectively. Section VI demon- two nodes. There have been numerous efforts on expanding strates the comparison between conventional two-stage GCA the approach from small network with few nodes to large, and our proposal from modeling standpoint, and discusses its complex network [48], [49]. A recent approach presented potential development in a wider Granger causality sense. an iterative scheme that gradually pruned the network by removing indirect connections, considered one at a time, to II. PROBLEM STATEMENT uncover large network structure with hundreds of nodes [50]. Let variables X and Y be two stochastic and stationary time Despite these developments, the basic idea of Granger series. Now consider the following pair of (restricted and causality remains unchanged. It generally comprises two unrestricted) regression models stages in this framework: (1) specifying the model order ( Pp X[n] = βx[i]X[n i] + u[n] (the number of time lags that is associated with its own i=1 − X[n] = Pp β [i]X[n i] + Pq β [i]Y [n i] + v[n], history information and external effects), and (2) deciding i=1 x i=1 y − − (1) the optimal model. The order of the predictors is usually where p and q are the model orders (the numbers of time determined with the Akaike information criterion (AIC) [51] lags) in X and Y , respectively, β and β are the model or the Bayesian information criterion (BIC) [52], whereas x y coefficients, and u and v are the residual of the models. The the optimal predictor is judged through statistical pairwise order of historical predictor p is usually determined with the comparison [53]–[56]. Specifying model order and selecting AIC [51] or the BIC [52], the optimal model comply two completely different mathematical theories. AIC = 2log(L(θ)) + 2k − Indeed, from a mathematical perspective, the two stages BIC = 2log(L(θ)) + klog(n) are all to select the best model capturing essential features − under investigation from a number of competing models in Where n is the sample size, and k is the number of parameters terms of given observations. They are essentially the same which your model estimates, and θ is the set of all parameters. problem of model selection. Model selection is the most im- L(θ) represents the likelihood of the model tested, given your portant aspect of inference with causal models and allows one data, when evaluated at maximum likelihood values of θ. to test different hypotheses by comparing different models in The conventional GCA requires statistical significance to terms of selection criteria. Two different theories applied in determine whether the unrestricted model provides a better the same question might generate different selection criteria prediction than the restricted model. The hierarchical F - and therefore degrade the performance of GCA. statistics, based on the extra sum-of-squares principle [57], We argue that the two stages in GCA are the problem can be used to evaluate significant predictability, given as of model selection, and should obey the same benchmark (RSSr RSSu)/q F = − vs F0(q, n p q 1) under one mathematical theory. In this paper, against two- RSSu/(n p q 1) − − − stage selection scheme, we therefore proposed a unify model − − − (2)

2 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

where RSSr and RSSu represent the sum of squared resid- effect is quantified by F-statistics sequentially. It is clear uals of restricted model and unrestricted model, respectively, that the final inference might differ with rules applied, even n is the total number of observations to estimate the unre- though three classes are completely equivalent from a model stricted model. The F -statistics approximately follows an F - description standpoint. distribution with (q, n p q 1) degrees of freedom. As stated above, specifying the effects of endogenous − − − If F > F0(q, n p q 1), the variability of the residual of and exogenous information in conventional GCA is split − − − unrestricted model is significantly less than the variability of into a two-stage scheme. Specifying the regression model the residual of restricted model, then there is an improvement orders of historical information, which contains the regres- in the prediction of X due to Y , and we refer to this as causal sion of endogenous information in restricted model and the influence from Y to X. But in the current bivariate model, regression of both endogenous and exogenous information spurious connections will emerge frequently. in unrestricted model, is mainly based on AIC/BIC theory In order to remove spurious connections caused by indirect separately, then specifying the causal effect of exogenous causalities between nodes, GCA also provides a measure of information by F-statistics. However, specifying the two ef- conditional causal connection by introducing another vari- fects is the same kind of problem of model selection from able Z into Eqs. (1): the mathematical perspective. Different theories might gener-  p r ated different benchmarks in two stages for model selection, X[n] = P β [i]X[n i] + P β [i]Z[n i] + u[n]  i=1 x i=1 z therefore degrade the performance of GCA.  Pp − Pq − X[n] = i=1 βx[i]X[n i] + i=1 βy[i]Y [n i] Aside from theoretical considerations, there are still some  Pr − −  + βz[i]Z[n i] + v[n]. issues in the F-test itself to be discussed. The American i=1 − (3) Statistical Association’s (ASA) statement on statistical sig- Then, the causal influence from Y to X, conditional on Z, is nificance and p-values has led to a collective rethinking of defined as the entire scientific community, and many scientists believe var(u) Y →X|Z = ln . (4) that the application of p-values in current scientific research F var(v) has been distorted [58]–[61]. This also leads many studies Consider three model spaces , , and , where each to selectively report results, and the dichotomous p-value A B C space comprises three models. Let pi and qi, i = 1, 2, 3, is arbitrary reductionism for scientific research. We believe denote the model orders regarding its endogenous informa- that any reasonable study has its implied meaning, whether tion and the exogenous information from other variables, or not the p-value is less than 0.05. We should understand respectively. And n N, m N, the true meaning of p-values, and it should not be over- ∈ ∈ exaggerated or degraded [62]–[64]. It is just a statistical = a1 : p1 = n, q1 = 0; A { tool, or just one of the mapping relationships that is purely a2 : p2 = n + 1, q2 = 0; mathematical. Specifically, pairwise F -statistics in conven- a3 : p3 = n + 2, q3 = 0 tional GCA arouses several potential problems that might } lead to misleading or unreasonable inferences in connectivity F = b1 : p1 = n, q1 = m + 0; analysis. Firstly, the model comparison with -statistics is B { performed under a specific significance level. The assignment b : p = n, q = m + 1; 2 2 2 of significance level is a subjective matter in conventional b3 : p3 = n, q3 = m + 2 } GCA process [65], [66]. A significance level that is too low could cause the false connection noise originated, whereas = c1 : p1 = n, q1 = m; a significant level that is too high could erase the actual C { c2 : p2 = n + 1, q2 = m; connection. When there is no rule to justify the assignment of the significance level, F -statistics will lead to different con- c3 : p3 = n + 1, q3 = m + 1 . } nectivity results depending on the significance level chosen. We further assume that ai, bi, and ci, i = 1, 2, 3, have the Secondly, the selection results by pairwise F -statistics same residual variance. For space , the models only specify are heavily dependent on the initially selected model and A the regression model orders of the endogenous information the search path in model space. Consider the collection of by AIC/BIC, and then specify the causal effect of endogenous three models = A, B, C and let denote the residual M { } S information by F-statistics. The models in space specify the variance of each model with SA = S, SB = S + ∆S, and B I regression model orders of the exogenous effect by AIC/BIC, SC = S ∆S. We further assume ∆S < I, where I is − 2 ≤ then specify the causal effect of exogenous information by the interval satisfying statistical significance. The aim of F - F-statistics. But the models specify the regression model statistics is to find the model appropriate to the observations C orders of endogenous information and exogenous informa- from the model space, which is evidently model C in this tion separately, and specify the causal effect between c1, c2 case. The search path B A C will arrive at optimal → → and c3 by F-statistics. In this situation, the model selection model C, whereas if we start at A and follow the search path in conventional GCA is split into a two-stage scheme, the A C B, we will reach suboptimal model A, that → → model orders is determining by AIC/BIC and then the causal is, optimal model C is not considered. The distinct results

VOLUME 4, 2016 3 1 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

S A S A S A S A

B B C C B B C C S + ∆S S + ∆S S ∆S S ∆S S + ∆S S + ∆S S ∆S S ∆S − − − − (a) Search Path:(a)(a)B Search SearchA Path: Path:CAB AC CB (b)(b) Search Search Path: Path:(b) SearchBA CA Path: BAC C B → → →→ →→ → →→ → → FIGURE 1: Consider a conceivable case involving a collection of three candidate models = A, B, C with residual I M { } variances of SA = S, SB = S + ∆S, and SC = S ∆S. Suppose ∆S < I, where I is the interval satisfying statistical − 2 ≤ significance. It is clear that model C, with the minimum variance, is the optimal model in this case. Following search path B A C (a) we can arrive at optimal model C, while if we start with A and follow the search path A C B (b), we → → → → reach suboptimal model A. The results using F -statistics for model selection rely on both the search path and the initial model. Comparing transverses within the model space can only partially reduce the risk of model misspecification, and is not always available due to the nested relation between comparable models in F -statistics. Moreover, this strategy also leads to concern about the computational complexity. obtained by using F -statistics for model selection along with by conventional GCA could be misleading, and main reason different search paths and initial models are given explicitly may be splitting model selection into a two-stage scheme. in Fig. 1. In fact, F -statistics can not discriminate the models Since all the issues we discuss can be attributed to model by residual variance within a specific range relative to the selection problems, then a more practical model selection chosen significance level and the noise level. On the other method need to be enabled. hand, F -statistics uses the extra sum of squares principle to To keep consistency in model selection for GCA, we identify a better model. This means that only models with a proposed a code length guided framework based on MDL nested relationship can be compared. Then the competitive principle, which Rissanen first proposed to quantify parsi- models perform pairwise comparison in an indirect way, mony of a model [68], to map the two-stage scheme into through intermediate model, namely unrestricted model the same model space. Specifically, our proposal involves in F -statistics. Therefore, the comparison between any two constructing a code length guided framework, then the model models is not available in practical applications, and the selection process in GCA can be converted into comparing search path relies on the structure of all candidate models. the code length of each model. That means our proposal In such situation the optimal model is not always guaranteed. incorporated the endogenous and the exogenous information The third potential problem relates to computational com- into a unified model selection process, which the information plexity. Consider the simplest case where conditional causal will be quantified by converting into code length to obtain connections are not taken into account (q = 1 and r = 0 causality. The two-stage scheme based on the two different in Eq. (3)). Suppose that there are m candidate models theories is unified under the single mathematical framework, for any one directional causality in the network with n MDL principle, which guarantees the only benchmark in nodes, the number of F -tests that needs to be performed is GCA methodology research. m(m 1)(n 1), and the total number of comparisons is And above all, the MDL is an information criterion that − − mn(m 1)(n 1). The problem on computational complex- provides a generic solution for the model selection problem − − ity is compounded by conditional GCA, but it still will be [69]. As a broad principle, the MDL represents a completely intractable while investigating large network [50]. different approach for model selection relative to traditional Although different approaches might be applied for model statistical approaches, such as F -statistics and the AIC or selection in GCA [37], [67], the two stages are kept un- BIC. Compared with the AIC/BIC, the use of the MDL changed. They are generally based on two different math- does not require any assumptions about the data generation ematical theories in most GCA applications. However, as mechanism. In particular, a prior probability distribution does mentioned above, determining the model orders by AIC/BIC not need to be assigned to each model class. The objective of or quantifying the causal effects by F-statistics can essen- model selection in the MDL is not to estimate an assumed tially be considered as a generalized model selection problem but unknown distribution, but to find models that more real- from a modeling standpoint. And model selection by a single istically represent the data [70]–[72]. mathematical principle throughout GCA, which could be Fundamentally, MDL has intellectual roots in the algo- easily ignored, would determine the final result working rithmic or descriptive complexity theory of Kolmogorov, pattern of human brain. Thus the causal connectivity obtained Chaitin, and Solomonoff [73]. Only considering the prob-

4 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS ability distribution as a basis for generating descriptions, By Shannon’s Source Coding Theorem, for any prefix code Rissanen endowed MDL with a rich information-theoretic % on with length function L, the expected L% is bounded A interpretation [74]–[79]. Due to these characteristics, and it’s below by H(P ), the entropy of P . That is reasonable to believe human brain meet minimum energy X L% H(P ) = P (x) log P (x), (6) principle, causality analyzed with help of MDL principle may ≥ − 2 be more in line with physiological models of the brain. On the x⊂A whole, our proposal takes the MDL principle as the single where equality holds if and only if L = log P , in other − 2 mathematical framework to select the generalized model for words, the expected code length is minimized when Q = P . GCA, which to ensure the consistency, objectivity and the parsimony. B. CRUDE TWO-PART CODE MDL In our view, modeling is a process that find the regularity III. THE MINIMUM DESCRIPTION LENGTH PRINCIPLE of data and compress it. In model selection within MDL The principle of parsimony is the soul of model selection. principle, What we have to do is selecting a suitable model To implement the parsimony principle, one must quantify based on the probability distribution of the object. Generally, parsimony of a model relative to the available data. With the model we picked is overfitting or too simple. But model help of the work of Kolmogorov [80], [81], Wallace and selection guided by the MDL principle, the complexity term Boulton [82], Rissenan formulated MDL as a broad principle or the error term in data fitting is incorporated into code governing statistical modeling in general. At beginning of length guided framework, which ensures the objectivity of modeling, all we have is only the data. Luckily, With the the operation. help of the algorithmic or description complexity theory of Until now, there are several forms of MDL principle to Kolmogorov, Chaitin, and Solomonoff, MDL regards a prob- polynomial or other types of hypothesis and model selection. ability distribution as a descriptive standpoint. And MDL But at the original MDL, it usually divides the modeling for has some connections with AIC and BIC, sometimes it be- the data set into two parts, one part is to describe the model’s haves similarly to AIC and BIC [72], [83]. The difference self-information. The other is to describe the data set with the is that MDL fixes attention on the length function rather help of chosen probability model in part one. Consequently, than the code system. Therefore, many kinds of probability here we firstly introduce the most common implementation distribution can be compared in terms of their descriptive of the idea – the two part code version of MDL [72], [85]. power [84]. Our code length guided framework can be used in Suppose the data D n where = 0, 1 . Then there ∈ X X { } generalized model selection as long as there is a probability is a probability P , and minimize distribution in the model. ∈ M L1,2(P,D) = L1(D P ) + L2(P ). (7) | A. PROBABILITY AND IDEALIZED CODE LENGTH Here, it will select a reasonable model for D to make good In order to describe the MDL principle explicitly, we deduced predictions of future data coming from the same source, the formula of MDL in different cases. In model selection which therefore models the data using the class of all B process of MDL, we need to compare the code length ob- Markov chains [85]. tained by its probability distribution, so it is essential to understand the relationship between probability and the code a: The first part length [72], [85]. To get a better feel for the code L1, we prepare to consider A code % on a set is simply a mapping from to a set (1) A A two examples. First, let Pθ B be some Bernoulli of codewords. Let be a finite binary set and let Q denote ∈ A distribution with Pθ(x = 1) = θ, and let D = (x1, , xN ). a probability distribution of the any element a in . The Q ˆ ··· A Since Pθ(D) = Pθ(xi) and θ is equal to the frequency of code length of a is that log2 Q, the negative logarithm of − 1 in D, the first part L1(D P ) is given as Q. For example, the Huffman code is one of the algorithms | that constructed this relationship between probability and log Pθ(D) = n1 log θ n0 log(1 θ) − − − − (8) idealized code length [84]. Suppose that elements of are ˆ ˆ A = N[θ log θ + (1 θ) log(1 θ)], generated according a known distribution P . Given a code % − − − on with length function L, the expected code length of % where nj denotes the number of occurrences of symbol j in A γ with respect to P is defined as D. Let k = 2 , the γth-order Markov chain model is denoted by (k), it’s defined as X B L% = P (x)L(x). (5) (k) (k) (k) k = Pθ θ Θ ; Θ = [0, 1] . x⊂A B { | ∈ } As is well known, if % is prefix code, the code length L Where θ = (η[1 0 ... 0], η[1 0 ... 01], . . . , η[1 0 ... 11]), for | | | is equivalent to log Q(x) for some distribution Q on . all n, xn − 2 A There is given an arbitrary code, if no codeword is the prefix N of any other, the unique decodability is guaranteed. Any code 1 γ Y Pθ(D) = ( ) Pθ(xi xi−1, , xi−γ ) (9) 2 | ··· satisfying this codeword condition is called a prefix code. i=γ+1

VOLUME 4, 2016 5 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS and And the parameter vector consists of data θ = (n, ξ) and 2 2 X ξ = (σ , a1, , an), where σ = ξ0 is the variance- log Pθ(D) = N (ˆη[1|y] log η[1 y] ··· parameter of zero-mean Gaussian distribution model for t. − − log k | y∈{0,1} (10) And t = 1, , m, which m can be anyone more than n +(1 ηˆ ) log(1 η )) + γ. ··· − [1|y] − [1|y] to keep the solution determined, N is the data length. In order to describe xt, turn to Gaussian distribution for t. Here γ = log k is the number of bits needed to encode the Pm 2 first γ outcomes in D. The maximum likelihood parameters And RSS = t=1(t ) denotes the residual sum of squares corresponding to the estimation in the model. Clearly there is ηˆ[1|y] are equal to the conditional frequencies of 1 prefixed by y. a Gaussian distribution model, applying the two part form of MDL, the total code length is given as b: The second part n RSS X ξi L(x, θ) = m ln √2πσ + + ln | | + ln(n + 1). In order to describe a P , we have to describe a pair 2σ2 δ ∈ B i=0 (k, θ). We encode the all parameter in the distribution model (15) by firstly encoding k, which will use some prefix code C2a, Where δ is the precision, and it’s optimal to choose 1/√N and then code θ with the help of the prefix code C2a. The |ξi| [72], [75], [86]. Specially, δ < 1 should be ignored. resulting code C2(the code length of the first part) is then defined by C2(k, θ) = C2a(k)C2b(θ k). Firstly, N is nature | A. TIME-DOMAIN FORMULATION number set, Combining with the above formula, the causality investi- L2a(k) = LN (k) = O(log k). (11) gation with the code length guided framework in the time domain can be carried out. There are two time-series XN and (k) k Since Θ = [0, 1] is an uncountable infinite set, we YN , then we consider two different models A and B (in Eq. would have to discretize it firstly. More precisely, we will (16) and Eq. (17)) to describe xt. The representations are (k) restrict Θ(k) to some finite set Θ¨ of parameters that can be d ( Pn described using a finite precision of d bits per parameter. Xt = a1iXt−j + 1t j−1 (16) ¨ (k) d k Pn Now that the number of elements of Θd is (2 ) . Thus Yt = j−1 d1iYt−j + η1t it needs log(2d)k = k d bits to describe any particular ¨ (k) · where var(1t) = Σ1 and var(η1t) = Γ1. Bivariate regres- ϑ Θd , and may call d the precision used to encode a ∈ (k) sive representations are given, parameter. Letting dϑ be the smallest d such that θ Θ¨ , ∈ d this gives ( Pn Pn Xt = a2iXt−j + b2iYt−j + 2t j−1 j−1 (17) ( Y = Pn c X + Pn d Y + η LN (k) + LN (dϑ) + kdϑ, if dϑ < t j−1 2i t−j j−1 2i t−j 2t L2(k, θ) = ∞ (12) , otherwise. ∞ where var(2t) = Σ2 and var(η2t) = Γ2, and their contem- poraneous covariance matrix is At the end, the crude two-part code MDL for Markov chain hypothesis selection is given by, Σ2 Υ2 Σ = where Υ2 = cov(2t, η2t). Υ2 Γ2 min log Pk,d(D) + kd + LN (k) + LN (d). k,d∈N ;θ∈Θ¨ (k) − d Finally, 1t and 2t have Gaussian distribution with mean 0 (13) and unknown variance σ2, which denote the noise of time- series are fitting residual. Therefore, the distribution of resid- IV. CODE LENGTH GUIDED MODEL SELECTION IN ual terms t can be a standpoint to describe the model within CAUSALITY ANALYSIS MDL. Then, the code length of model A and B we obtained As stated above, conventional GCA splits the whole process can be compared to identify the causal influence between xt into two stages which are actually modeling endogenous and yt. In the whole process of model selection for GCA, information and exogenous information. Since MDL has a the causality investigation was mapped into the unified code close relationship with information theory, causality analysis length guided framework. According to Eq.(15), the code with MDL here will also be more convincing and suitable. length of model A and model B can be given respectively. Further the regression of endogenous information can be By the definition of Granger causality, the influence from Y converted to code length, and relative effect of exogenous to X is defined by our code length guided framework, variables can be also quantified by the code length guided framework, which the whole model selection process for FY →X = LX LX+Y . (18) − GCA is unified into same model space. The following is that MDL principle guided model selec- Where LX denotes the code length of optimal model in Eq.(16), and LX+Y denotes the code length of optimal model tion in causality analysis [68], and variable XN is given, in Eq.(17). If FY →X > 0, it means causal influence from Y xt = a1xt−1 + a2xt−2 + + anxt−n + t. (14) to X existed. Otherwise, there is no causal influence existed ··· 6 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

from Y to X. As the causality represented above, our pro- where a2(0) = 1, b2(0) = 0, c2(0) = 0, d2(0) = 1, the posal can unify two-stage scheme into the code length guided lag operator L denotes LXt = Xt−1. Performing Fourier framework, which can avoid inconsistency of two different transform on both sides of Eq.(23), then we left-multiply mathematical theories or the subjectivity of F-statistics in 1 0 model selection of the conventional GCA. P = (24) Υ2 1 To compare conditional GCA, we consider the influence − Σ2 from Y to X while controlling for the effect from conditional on both sides and rewrite the result equation, the normalized node Z to X. Firstly, the joint autoregressive representation equations yield is given Xω D11(ω) D12(ω) E2(ω) ( n n = 0 , (25) P P Yω D21(ω) D22(ω) H (ω) Xt = j−1 a3iXt−j + j−1 b3iZt−j + 3t 2 Pn Pn Pn Xt = a4iXt−j + b4iYt−j + c4iZt−j + 4t 0 Υ2 j−1 j−1 j−1 where H2(ω) = H2(ω) E2(ω). The spectral matrix is − Σ2 (19) S(ω) = D(ω)ΣD∗(ω), where denotes complex conjugate and var(3t) = Σ3, var(4t) = Σ3. By the definition of ∗ and matrix transpose. The spectrum of Xt is conditional GCA, if FY →X > 0 existed, causal influence ∗ 0 ∗ from Y to X conditioned Z is defined S11(ω) = D11(ω)Σ2D (ω) D12(ω)Υ D (ω), (26) 11 − 2 12 0 Υ2 F = LX+Z LX+Y +Z . (20) where Υ2 = Γ2 Υ2. The first term in Eq.(26) is Y →X|Z − Σ2 − represented as the intrinsic influence and the second term as Same as above, if FZ→X > 0 existed, causal influence from the causal influence of Xt due to Yt at frequency ω. Based Z to X conditioned Y is given on this transformation, the causal influence from Yt to Xt at frequency ω is F = LX+Y LX+Y +Z (21) Z→X|Y − Sxx(ω) Clearly, in our code length guided framework, all candi- fY −→X (ω) = ln ∗ . (27) D11(ω)Σ2D11(ω) date models can be compared freely in the context of their code length. Different from the traditional method, the causal The model orders of historical information in frequency connection is obtained by repeated pairwise comparison be- domain for conventional GCA, are determined by AIC/BIC. tween models, our method can map all candidate models into Distinct from conventional GCA, we obtained the causal the same model space without the repeating comparison and connectivities between nodes at frequency ω by the code to obtain the conditional causal influence directly. Which is, length guided framework. if both FY →X > 0 and FZ→X > 0 existed, V. EXPERIMENTS FY,Z→X = min(LX+Y ,LX+Z ) LX+Y +Z . (22) A. 3-NODE NETWORK SIMULATION EXPERIMENTS − 1) Protocol for 3-node Network Here if F > 0 existed, it means that both Y and Y,Z→X To verify the performance of MDL principle in synthetic data Z have direct influence on X. But if F is less than Y,Z→X experiment, a simple 3-node network is enabled. There were 0, there will be two cases. One is FY,Z→X = (LX+Y − 1000 data points in time series of each node, to keep the L ) < 0 existed, it means only Y has direct influence X+Y +Z stationarity of synthetic data, the first 700 data points were on X. The other is FY,Z→X = (LX+Z LX+Y +Z ) < 0 − removed. The initial value of each node was 1, the variance existed, it means that Z impacts X directly. In the unified of noise term (i = 1, 2, 3) varied from 0.15 to 0.35. The model space, multiple selected models can be directly com- i nodes was generated by pared by code length, which can release the complexity of  the algorithm. In this way, our proposal is more in line with y1,t = 1.5y1,t−1 0.9y1,t−2 + 1 Occam’s razor, or the principle of parsimony.  − y2,t = 0.8y1,t−1 + 0.2y2,t−1 + 2 (28)  y3,t = 0.8y1,t−1 + 0.4y3,t−1 + 3 B. FREQUENCY-DOMAIN FORMULATION − With help of Geweke’s work [39], the total interdependence 2) Results from 3-node Network X Y between two time series t and t can be decomposed into Firstly, to ensure the rigor of the proposal, the distribution three components: two directional causal influences due to of residual between the selected model and the observed their interaction patterns, and the instantaneous influence due data was verified. It was almost corresponding Gaussian (X,Y ) to factors possibly exogenous to the system (e.g. distribution with mean 0, seen in Fig. 2a and 2b, which guar- a common driving input) [41], [54]. Here other forms of anteed the validity for our initial description standpoint. And regressive representations need to be considered, We first code length changed with different time lag in autoregressive rewrite Eq.(16) and Eq.(17) model had shown in Fig. 2c, it dropped down to the minimum a (L) b (L) X when time lag was 2, which corresponded with generated 2 2 t = 2t (23) c2(L) d2(L) Yt η2t model in Eq.(28).

VOLUME 4, 2016 7 1 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

40 448 0.995 35 0.99 446 0.95 30 0.9 444 25 0.75

20 0.5 442 Probability Frequency 15 0.25 Code Length 440 0.1 10 0.05 438 5 0.01 0.005 0 436 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 1 2 3 4 5 6 7 8 9 10 Data Data Time Lag (a) (b) (c) FIGURE 2: The ﬁtting Guassian distribution of the residual. (a): a histogram of values in data and ﬁts a normal density function. (b): comparing the distribution of the data to the normal distribution. (c): code length obtained by our proposal, changed with different time lag in node 1.

1 1 1 1

0.047 1 0.0470.0471 1 0.047 1 0.008 1 0.0090.0081 1 0.009 1

0.056 0.056 0.009 0.009 2 2 3 3 2 2 3 3 0.049 0.049 0.01 0.01 (a) F-statistics (P <(a)(a)0. F-statistics05) F-statistics (P < 0..05) (b)(b) MDL MDL method (b) MDL method FIGURE 3: Comparison between two methods within low variance noise(0.15-0.35), the numbers on the arrow indicated the accuracy that the causal influence was identified in 1000 samples. Specially, the accuracy was only represent the probability that the single connection direction was identified by two methods.(a):causal network obtained by conventional GCA. (b):causal network obtained by our proposal.

TABLE 1: Comparison between conventional GCA and MDL method in 3-node network

Accuracy(%) Noise level Method Node 1 Node 2 Node 3 Total α = 0.1 82.3 71.8 72.1 65.4 F-statistics α = 0.05 90.9 85.5 85.5 82 Low α = 0.01 98.1 96 96.2 95.2 MDL 98.4 97.3 97.2 96.5 α = 0.1 81.4 72.5 71.8 64.8 F-statistics α = 0.05 89.4 84.6 84.7 79.9 Moderate α = 0.01 97.6 96.2 96.3 95.1 MDL 98.5 97.7 97.4 96.8 α = 0.1 80.3 72 71.7 64.5 F-statistics α = 0.05 89.5 83.6 83.7 79.1 High α = 0.01 98 96.7 96.7 95.7 MDL 98.8 98 97.6 97.2 The variance of the low noise level data ranges from 1.5 to 3.5, and the moderate(high) level data ranges from 2.5 to 4.5(3.5-5.5). 1 α denotes the significance level of F- test in conventional GCA. The accuracy of node− i(i=1,2,3) only measures the causal connectivities with node i, not including the connections between the other two nodes. And the total accuracy represents the accuracy that the true model is found, it means only quantifies the causal influence from node 1 to node 2 and node 3.

Then, the causal connectivities identified by our proposal 1 to node 2 and node 3 were identified both in GCA and had shown in Fig. 3b, which obtained causal connection our proposal. In other four causal influences, our proposal by comparing the code lengths according to Eq. (15), then almost guaranteed 99% accuracy. But conventional GCA the causal connectivities obtained by conventional GCA had only guaranteed 95% accuracy. shown in Fig. 3a. We found that causal influences from node At the same time, experiments with three different noise

8 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1->2 2->1 3->1

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 Causality Causality Causality 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 1 10 100 1 10 100 1 10 100 Frequency Frequency Frequency 1->3 2->3 3->2

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 Causality Causality Causality 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 1 10 100 1 10 100 1 10 100 Frequency Frequency Frequency FIGURE 4: Causal connectivities between 3-node network obtained in frequency domain. Here we identified the causal influence at frequency ω range from 1 30, and at 50,100Hz. ∼ levels and three confidence intervals had been carried out, did not introduce conditional GCA in the frequency domain. the comparison between conventional GCA and our proposal This is mainly because there seems to be still some obfusca- showed in Table. 1. The accuracy of causal connection asso- tion with the method of using conditional GCA concept in the ciated with each node and the overall true model was also frequency domain. Therefore, as shown in Fig. 4, some non- verified. Comparing the accuracy of the single node and existent connections between nodes were often misjudged, the overall network at the same noise level and the same except for the causal influence 1 2 and 1 3. But we → → confidence interval, it’s unevenly distributed in conventional found that causal connectivities between node 2 and node 3 GCA. Meanwhile, the conventional GCA were very sensitive had a bigger chance to be misjudged at low frequency (0 10 − due to the different confidence intervals, especially in the Hz). Actually, comparing results obtained by conventional identification of the overall network, even it had a stable per- GCA in the time domain, causalities between two nodes were formance at different noise level. In generally, our proposal more legible in frequency domain, which meant only direct always showed a relatively good performance whether it’s causalities showed more consistent results in our analysis. identification in the whole network or a single node. It was For example, there were only stable and significant causal worth noting that the results obtained by conventional GCA influence existed in 1 2 and 1 3. → → when α = 0.01 in F-statistics were close to the results of our proposal. The results were also in line with our expectations, B. 5-NODE NETWORK SIMULATION EXPERIMENTS that was because our proposal consider the complexity of the 1) Protocol for 5-node network model more thoughtfully. As we emphasized above, there is To verified in more detail whether the proposed MDL method only one mathematical principle guiding the model selection is just a conventional GCA method with a higher level of throughout the GCA process, thus our proposal is a more confidence, a complex 5-node network was given to further rigorous approach or a more robust approach. verified the robustness and the validity of our proposal, seen The oscillations in neurophysiological systems and neu- in Fig. 5. Noise terms i(i = 1, 2, ..., 5) were the Guassian roscience data are thought to constrain and organize neural distribution with mean 0, and the variance ranged from 0.15 activity within and between functional networks across a to 0.3. The first two initial values of nodes are 1, they were wide range of temporal and spatial scales. Geweke-Granger given by causality demonstrated that the oscillations at specific fre-  quencies had been associated with the activation or inactiva- x = 0.792x 0.278x +  1,t 1,t−1 1,t−2 1 tion of different encephalic region. But for conventional GCA  − x2,t = 0.768x2,t−1 0.503x2,t−2 + 0.83x1,t−1 in frequency domain, model selection of history information  −  0.32x1,t−2 + 2 is determined by AIC or BIC. For our code length guided −  framework, the selected models for history information will x3,t = 0.67x3,t−1 0.312x3,t−2 + 0.56x2,t−1  − regress into the true model space automatically. Due to its  0.42x2,t−2 + 3 − (29) intellectual roots in descriptive complexity and close tie with x = 0.733x 0.27x + 0.72x  4,t 4,t−1 4,t−2 2,t−1 information theory, our proposal may be more capable to  −  0.27x2,t−2 + 0.52x3,t−1 0, 456x3,t−2 identified causality in frequency domain. − −  +0.76x5,t−1 0.33x5,t−2 + 4  − Same as time domain, causal connectivities at frequency  x5,t = 0.845x5,t−1 0.24x5,t−2 + 0.68x4,t−1 ω between 3-node network obtained by our proposal showed  −  0.254x4,t−2 + 5 in Fig. 4. In particular, causal connectivities in frequency do- − main were obtained within two nodes, which meant that we

VOLUME 4, 2016 9 1

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

2 4 5

1 3

FIGURE 5: The relationships of simulation data sets in 5-node network. Fig. 1: The relationships of simulation data sets in 5-node network.

160 500 140

400 120

2->3 100 1->2 2->4 300 4->5 1->2 2->4 80 F value 2->5 3->4 Code length 200 60 3->5 5->4 2->3 3->5 40 2->5 4->5 5->4 100 3->4 1->3 1->4 1->3 1->4 20 5->3 5->3 2->1 4->4 1->5 3->2 4->1 4->3 5->1 3->1 3->2 4->1 4->2 5->1 5->2 2->1 3->1 4->2 5->2 5->5 1->1 1->5 2->2 3->3 4->3 5->5 1->1 2->2 3->3 4->4 0 0

Edges Edges (a) F-statistics (P < 0.05) (b) MDL method FIGURE 6: Causal connectivities between 5 nodes identiﬁed by conventional GCA and our proposal respectively. (a): causal connection network was obtained by conventional GCA at α = 0.05 in F-test. (b): causal connection network obtained by our proposal.

TABLE 2: 5-NODE NETWORKS Edge F-statistics(α = 0.05) (α = 0.01) MDL method 1 2 1000/1000 (1000/1000) 1000/1000 2 −→ 3 1000/1000 (1000/1000) 1000/1000 2 −→ 4 1000/1000 (1000/1000) 1000/1000 true connections 3 −→ 4 507/1000 (456/1000) 1000/1000 4 −→ 5 1000/1000 (1000/1000) 1000/1000 5 −→ 4 929/1000 (780/1000) 1000/1000 −→ 1 3 36/1000 (8/1000) 4/1000 1 −→ 4 43/1000 (1/1000) 1/1000 1 −→ 5 11/1000 (0/1000) 0/1000 2 −→ 1 54/1000 (12/1000) 15/1000 2 −→ 5 58/1000 (14/1000) 15/1000 3 −→ 1 44/1000 (17/1000) 7/1000 3 −→ 2 15/1000 (2/1000) 2/1000 false connections 3 −→ 5 83/1000 (29/1000) 5/1000 4 −→ 1 30/1000 (8/1000) 14/1000 4 −→ 2 6/1000 (0/1000) 2/1000 4 −→ 3 5/1000 (1/1000) 1/1000 5 −→ 1 53/1000 (18/1000) 16/1000 5 −→ 2 17/1000 (1/1000) 2/1000 5 −→ 3 41/1000 (8/1000) 11/1000 −→ Causal connectivities between nodes in 5-node network within low variance noise(0.15-0.3). True connection represents the causal inﬂuence truly existed in Fig. 5, false connection means not existed.

2) Results from 5-node Network Fig. 6a. Same as results in 3-node network, our proposal showed 100% consistency with connections in Fig. 5 be- The result of connections between 5 nodes showed in Fig. 6b, tween directly causal related nodes, seen in Table. 2. And causal inﬂuence analyzed by code length was largely consis- in other connections in 5-node network, the accuracy of our tent with connections in Fig. 5. Simultaneously, causalities proposal also was not below 98.4%. Whereas, conventional analyzed by conventional GCA between 5 nodes showed in

10 VOLUME 4, 2016 Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1->2 1->3 1->4 1->5

1 1 1 1