<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 Estimating Transfer Entropy in Continuous Time Between Neural

2 Spike Trains or Other Event-Based Data

1 1, 2 1 3 David P. Shorten, Richard E. Spinney, and Joseph T. Lizier

1 4 Complex Systems Research Group and Centre for Complex Systems,

5 Faculty of Engineering, The University of Sydney, NSW 2006, Australia

2 6 School of Physics and EMBL Australia Node Single Molecule Science,

7 School of Medicine, The University of New South Wales, Sydney, Australia

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Abstract Transfer entropy (TE) is a widely used measure of directed flows in a number of domains including neuroscience. Many real-world time series in which we are interested in information flows come in the form of (near) instantaneous events occurring over time, including the spiking of biological neurons, trades on stock markets and posts to social media. However, there exist severe limitations to the current approach to TE estimation on such event-based data via discretising the time series into time bins: it is not consistent, has high bias, converges slowly and cannot simultaneously capture relationships that occur with very fine time precision as well as those that occur over long time intervals. Building on recent work which derived a theoretical framework for TE in continuous time, we present an estimation framework for TE on event- based data and develop a k-nearest-neighbours estimator within this framework. This estimator is provably consistent, has favourable bias properties and converges orders of magnitude more quickly than the discrete-time estimator on synthetic examples. We also develop a local permutation scheme for generating null surrogate time series to test for the statistical significance of the TE and, as such, test for the conditional independence between the history of one point process and the updates of another — signifying the lack of a causal connection under certain weak assumptions. Our approach is capable of detecting conditional independence or otherwise even in the presence of strong pairwise time-directed correlations. The power of this approach is further demonstrated on the inference of the connectivity of biophysical models of a spiking neural circuit inspired by the pyloric circuit of the crustacean stomatogastric ganglion, succeeding where previous related estimators have failed.

8 AUTHOR SUMMARY

9 Transfer Entropy (TE) is an information-theoretic measure commonly used in neuro-

10 science to measure the directed statistical dependence between a source and a target time

11 series, possibly also conditioned on other processes. Along with measuring information flows,

12 it is used for the inference of directed functional and effective networks from time series data.

13 The currently-used technique for estimating TE on neural spike trains first time-discretises

14 the data and then applies a straightforward or ”plug-in” information-theoretic estimation

15 procedure. This approach has numerous drawbacks: it is very biased, it cannot capture re-

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Y Y

X X

Y Y X X

Y y3 y2 y1

(right) performs no time binning. History embeddings x

y

16 lationships occurring on both fine and large timescales simultaneously, converges very slowly

17 as more data is obtained, and indeed does not even converge to the correct value. We present

18 a new estimator for TE which operates in continuous time, demonstrating via application to

19 synthetic examples that it addresses these problems, and can reliably differentiate statisti-

20 cally significant flows from (conditionally) independent spike trains. Further, we also apply

21 it to more biologically-realistic spike trains obtained from a biophysical model of the pyloric

22 circuit of the crustacean stomatogastric ganglion; our correct inference of the underlying

23 connection structure here provides an important validation for our approach where similar

24 methods have previously failed

25 I. INTRODUCTION

26 In analysing time series data from complex dynamical systems, such as in neuroscience,

27 it is often useful to have a notion of information flow. We intuitively describe the activities

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

28 of brains in terms of such information flows: for instance, information from the visual world

29 must flow to the visual cortex where it will be encoded [1]. Conversely, information coded

30 in the motor cortex must flow to muscles where it will be enacted [2].

31 Transfer entropy (TE) [3, 4] has become a widely accepted measure of such flows. It is

32 defined as the between the past of a source time-series process and the

33 present state of a target process, conditioned on the past of the target. More specifically (in

34 discrete time), the transfer entropy rate [5] is:

NT 1 1 X p (xt x

35 Here the information flow is being measured from a source process Y to a target X, I( ; ) · · | · 36 is the conditional mutual information [6], xt is the current state of the target, x

37 history of the target, y

38 samples (in units of time), τ is the length of the time series and NT = τ/∆t is the number

39 of time samples. The histories x

40 x

42 in order to ensure convergence in the limit of small time bin size.

43 It is also possible to condition the TE on additional processes [4]. Given additional

44 processes Z = Z ,Z ,...,Z with histories Z = Z , Z ,..., Z , we can { 1 2 nZ }

1 T˙ Y X Z = I (Xt ; Y

46 When combined with a suitable statistical significance test, the conditional TE can be used

47 to show that the present state of X is conditionally independent of the past of Y – conditional

48 on the past of X and of the conditional processes Z. Such conditional independence implies

49 the absence of an active causal connection between X and Y under certain weak assumptions

50 [7, 8], and TE is widely used for inferring directed functional and effective network models

51 [9–12] (and see [4, Sec. 7.2]).

52 TE has received widespread application in neuroscience [13, 14]. Uses have included the

53 network inference as above, as well as measuring the direction and magnitude of information

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

54 flows [15, 16] and determining transmission delays [17]. Such applications have been per-

55 formed using data from diverse sources such as MEG [18, 19], EEG [20], fMRI [21], electrode

56 arrays [22], calcium imaging [11] and simulations [23].

57 Previous applications of TE to spike trains [22, 24–34] and other types of event-based

58 data [35], including for the purpose of network inference [11, 36, 37], have relied on time

59 discretisation. As shown in Fig. 1, the time series is divided into small bins of width ∆t.

60 The value of a sample for each bin could then be assigned as either binary - denoting

61 the presence or absence of events (spikes) in the bin - or a natural number denoting the

62 number of events (spikes) that fell within the bin. A choice is made as to the number of

63 time bins, l and m, to include in the source and target history embeddings y

64 This results in a finite number of possible history embeddings. For a given combination

65 x

66 histories, p (x x , y ), can be directly estimated using the so-called plugin (histogram) t |

68 target history, p (x x ), can be estimated in the same fashion. From these estimates the t |

70 the application of the discrete time TE estimator to synthetic examples including spiking

71 events from simulations of model neurons.)

72 There are two large disadvantages to this approach [5]. If the process is genuinely oc-

73 curring in discrete time, then the estimation procedure just described is consistent. That

74 is, it is guaranteed to converge to the true value of the TE in the limit of infinite data.

75 However, if we are considering a fundamentally continuous-time process, such as a neuron’s

76 action potential, then the lossy transformation of time discretisation will result in an inac-

77 curate estimate of the TE. Thus, in these cases, any estimator based on time discretisation

78 is not consistent. Secondly, whilst the loss of resolution of the discretization will decrease

79 with decreasing bin size ∆t, this requires larger dimensionality of the history embeddings to

80 capture correlations over similar time intervals. This increase in dimension will result in an

81 exponential increase in the state space size being sampled and therefore the data require-

82 ments. This renders the estimation problem intractable for the typical dataset sizes present

83 in neuroscience.

84 In practice then, the application of transfer entropy to event-based data such as spike

85 trains has required a trade-off between fully resolving interactions that occur with fine

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

86 time precision and capturing correlations that occur across long time intervals. There is

87 substantial evidence that spike correlations at the millisecond and sub-millisecond scale

88 play a role in encoding visual stimuli [38, 39], motor control [40] and speech [41]. On the

89 other hand, correlations in spike trains exist over lengths of hundreds of milliseconds [42].

90 A discrete-time TE estimator cannot capture both of these classes of effects simultaneously.

91 Recent work by Spinney et al. [5] derived a continuous-time formalism for TE. It was

92 demonstrated that, for stationary point processes such as spike trains, the pairwise TE rate

93 is given by: NX 1 X λx x

94 It is worth emphasizing that, in this context, the processes X and Y are series of the time

95 points xi of events i. This is contrasted with (1), where X and Y are series of values at the

96 sampled time points t . To avoid confusion we use the notation that the y Y are the raw i i ∈ 97 time points and y

98 xi (see Methods). Here, λx x

99 conditioned on the histories of the target x

100 events in the target process. λx x

101 conditioned on its history alone, ignoring the history of the source. Note that λx x

102 λx x

NX 1 X λx x

104 Here, λx x

105 on the histories of the target x

106 z

108 processes, ignoring the history of the source.

109 Crucially, it was demonstrated by Spinney et al., and later shown more rigorously by

110 Cooper and Edgar [43], that if the discrete-time formalism of the TE (in (1)) could be

111 properly estimated as limδt 0, then it would converge to the same value as the continuous- → 112 time formalism. This is due to the contributions to the TE from the times between target

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

113 events cancelling. Yet there are two important distinctions in the continuous-time formalism

114 which hold promise to address the consistency issues of the discrete-time formalism. First,

115 the basis in continuous time allows us to represent the history embeddings by inter-event

116 intervals, suggesting the possibility of jointly capturing subtleties in both short and long

117 time-scale effects that has evaded discrete-time approaches. (See Fig. 1 for a diagrammatic

118 representation of these history embeddings, contrasted with the traditional way of construct-

119 ing histories for the discrete-time estimator). Second, note the important distinction that

120 the sums in (3) and (4) are taken over the NX (spiking) events in the target during the

121 time-series over interval τ; this contrasts to a sum over all time-steps in the discrete-time

122 formalism. An estimation strategy based on (3) and (4) would only be required to calcu-

123 late quantities at events, ignoring the ‘empty’ space in between. This implies a potential

124 computational advantage, as well as eliminating one source of estimation variability.

125 These factors all point to the advantages of estimating TE for event-based data using

126 the continuous-time formalism in (3) and (4). This paper presents an empirical approach

127 to doing so. The estimator (presented in Methods) operates by considering the probability

128 densities of the history embeddings observed at events and contrasts these with the prob-

129 ability densities of those embeddings being observed at other (randomly sampled) points.

130 This approach is distinct in utilising a novel Bayesian inversion on (4) in order to operate on

131 these probability densities of the history embeddings, rather than making a more difficult

132 direct estimation of spike rates. This further allows us to utilise k-Nearest-Neighbour (kNN)

133 estimators for entropy terms based these probability densities, bringing known advantages

134 of consistency, data efficiency, low sensitivity to parameters and known bias corrections.

135 By combining these entropy estimators, and adding a further bias-reduction step, we ar-

136 rive at our proposed estimator. The resulting estimator is consistent (see Methods section

137 IV A 5) and is demonstrated on synthetic examples in Results to be substantially superior

138 to estimators based on time discretisation across a number of metrics.

139 These surrogate samples have historically been created by either shuffling the original

140 source samples or by shifting the source process in time. However, this results in surro-

141 gates which conform to an incorrect null hypothesis that the transitions in the target are

142 completely independent of the source histories. That is, they conform to the factorisation

143 p (X , Y X , Z ) = p (X X , Z ) p (Y ). This leads to the possibility that a pro- t

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

145 TE.

146 As shown in Results, this can lead to incredibly high false positive rates in certain settings

147 such as the presence of strong common driver effects. Therefore, in order to have a suitable

148 significance test for use in conjunction with the proposed estimator, we also present (in

149 Methods section IV B) an adaptation of a recently proposed local permutation method [8]

150 to our specific case. This adapted scheme produces surrogates which conform to the correct

151 null hypothesis of conditional independence of the present of the target and the source

152 history, given the histories of the target and further conditioning processes. This is the

153 condition that p (X , Y X , Z ) = p (X X , Z ) p (Y X , Z ). All deviations t

155 the appropriate null.

156 We show in Results that the combination of the proposed estimator and surrogate gener-

157 ation method is capable of correctly distinguishing between zero and non-zero information

158 flow in cases where the history of the source might have a strong pairwise correlation with

159 the occurrence of events in the target, but is nevertheless conditionally independent. The

160 combination of the discrete-time estimator and a traditional method of surrogate genera-

161 tion is shown to be incapable of making this distinction. Further, we demonstrate that

162 the proposed combination is capable of inferring the connectivity of a simple circuit of bio-

163 physical model neurons inspired by the crustacean stomatogastric ganglion, a task which

164 the discrete-time estimator was shown to be incapable of performing. This provides strong

165 impetus for the application of the proposed techniques to investigate information flows in

166 spike-train data recorded from biological neurons.

167 II. RESULTS

168 The first two subsections here present the results of the continuous-time estimator applied

169 to two different synthetic examples for which the ground truth value of the TE is known. The

170 first example considers independent processes where T˙ Y X = 0, whilst the second examines → 171 coupled Poisson processes with a known, non-zero T˙ Y X . The continuous-time estimator’s → 172 performance is also contrasted with that of the discrete-time estimator. The emphasis of

173 these sections is on properties of the estimators: their bias, variance and consistency (see

174 Methods).

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

175 The third and fourth subsections present the results of the combination of the continuous-

176 time estimator and the local permutation surrogate generation scheme applied to two ex-

177 amples: one synthetic and the other a biologically plausible model of neural activity. The

178 comparison of the estimates to a population of surrogates produces p-values for the statistical

179 significance of the TE values being distinct from zero and implying a directed relationship.

180 The results are compared to the known connectivity of the studied systems. These p-values

181 could be translated into other metrics such as ROC curves and false-positive rates, but we

182 choose to instead visualise the distributions of the p-values themselves. The combination of

183 the discrete-time estimator along with a traditional method for surrogate generation (time

184 shifts) is also applied to these examples for comparison.

185 A. No TE Between Independent Homogeneous Poisson Processes

(a) k = 1 (b) k = 5

FIG. 2: Evaluation of the continuous-time estimator on independent homogeneous Poisson processes. The blue line shows the average TE rate across multiple runs and the shaded area shows the standard deviation. Plots are shown for two different values of k nearest neighbours, and four different values of the ratio of the number of sample points to the number of events NU /NX (See Methods, sections IV A 4 and IV A 5).

186 The simplest example on which we can attempt to validate the estimator is independent

187 homogeneous Poisson processes, where the true value of the TE between such processes is

188 zero.

189 Pairs of independent homogeneous Poisson processes were generated, each with rate λ¯ =

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

(a) l = 2, m = 2 (b) l = 5, m = 5

FIG. 3: Result of the discrete time estimator applied to independent homogeneous Poisson processes. The solid line shows the average TE rate across multiple runs and the shaded area shows the standard deviation. Plots are shown for four different values of the bin width ∆t as well as different source and target embedding lengths, l and m.

2 3 4 5 190 1, and contiguous sequences of N 1 10 , 1 10 , 1 10 , 1 10 target events were X ∈ { × × × × } 191 selected. For the continuous-time estimator, the parameter NU for the number of placed

192 sample points was varied (see Methods sections IV A 4 and IV A 5) to check the sensitivity

193 of estimates to this parameter. At each of these numbers of target events NX , the averages

194 are taken across 1000, 100, 20 and 20 tested process pairs respectively.

195 Fig. 2 shows the results of these runs for the continuous-time estimator, using various

196 parameter settings. In all cases, the Manhattan (`1) norm is used as the distance metric

197 and the embedding lengths are set to lX = lY = 1 spike (see Methods sections IV A 4 and

198 IV A 5). For this example, the set of conditioning processes Z is empty.

199 The plots show that the continuous-time estimator converges to the true value of the TE

200 (0). This is a numerical confirmation of its consistency for independent processes. Moreover,

201 it exhibits very low bias (as compared to the discrete-time estimator, Fig. 3) for all values of

202 the k nearest neighbours and NU /NX parameters. The variance is relatively large for k = 1,

203 although it is dramatically better for k = 5 — this reflects known results for variance of this

204 class of estimators as a function of k [44].

205 Fig. 3 shows the result of the discrete-time estimator applied to the same independent

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

206 homogeneous Poisson processes for two different combinations of the source and target his-

207 tory embedding lengths, l and m time bins, and four different bin sizes ∆t. At each of the

208 numbers of target events NX , the averages are taken across 1000, 100, 100 and 100 tested

209 process pairs respectively. The variance of this estimator on this process is low and compa-

210 rable to the continuous-time estimator, however the bias is very large and positive for short

211 processes. It would appear that, in the case of these independent processes, it is consistent.

212 B. Consistent TE Between Unidirectionally Coupled Processes

lX = 1

1.5 5 1.0

0.5 4 0.0 102 103 104 105 lX = 2 y |

x 3 1.5

1.0

2 0.5

TE (nats/second) 0.0 102 103 104 105 lX = 3 1 1.5

1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 ty 0.5

0.0 102 103 104 105 (a) Firing rate of the target Number of Target Events process as a function of the time since the most recent (b) Continuous-time 1 estimator source spike (ty) (c) Discrete-time estimator

FIG. 4: The discrete-time and continuous-time estimators were run on coupled Poisson processes for which the ground-truth value of the TE is known. (a) shows the firing rate of the target process as a function of the history of the source. (b) and (c) show the estimates of the TE provided by the two estimators. The blue line shows the average TE rate across multiple runs and the shaded area shows the standard deviation. The black line shows the true value of the TE. For the continuous-time estimator the parameter values of NU /NX = 1 and k = 4 were used along with the `1 (Manhattan) norm. Plots are shown for three different values of the length of the target history component lX . For the discrete-time estimator, plots are shown for four different values of the bin width ∆t. The source and target history embedding lengths are chosen such that they extend back one time unit.

213 The estimators were also tested on an example of unidirectionally coupled spiking pro-

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

214 cesses with a known value of TE (previously presented as example B in [5]). Here, the

215 source process Y is an homogoneous Poisson process. The target process X is produced as

216 a conditional point process where the instantaneous rate is a function of the time since the

217 most recent source event. More specifically:

¯ λy x tcut  y  1  λ x y [x , y ] = λ t = h 1 t 2i x

1 1 218 Here, ty is the time since the most recent source event. As a function of ty, the target spike base 1 1 219 rate λx x

222 simulation was performed using a thinning algorithm [45]. Specifically, we first generated the

223 source process at rate λ¯y. We then generate the target as an homogeneous Poisson process  1 1 224 with rate λh such that λh > λx ty for all values of ty. We then went back through all the  1 225 events in this process and removed each event with probability 1 λ t /λ . As with the − x y h 226 previous example, once a pair of processes had been generated, a contiguous sequence of 2 227 N target events was selected. Tests were conducted for the values of N 1 10 , 1 X X ∈ { × × 3 4 5 228 10 , 1 10 , 1 10 . For the continuous-time estimator, the number of placed sample × × } 229 points NU was set equal to NX (see Methods sections IV A 4 and IV A 5). At each NX , the

230 averages are taken over 1000, 100, 20 and 20 tested process pairs respectively. Due to its 6 231 slow convergence, the discrete-time estimator was also evaluated at a value of N = 1 10 X × 232 target spikes, averaged over 20 process pairs.

233 Spinney et al. [5] present a numerical method for calculating the TE for this process. For

234 the parameter values used here the true value of the TE is 0.5076 0.001. ±

235 Given that we know that the dependence of the target on the source is fully determined

236 by the distance to the most recent event in the source, we used a source embedding length

237 of lY = 1. The estimators were run with three different values of the target embedding

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

238 length l 1, 2, 3 (see Methods sections IV A 4 and IV A 5). For this example, the set of X ∈ { } 239 conditioning processes Z is empty.

240 Fig. 4b shows the results of the continuous-time estimator applied to the simulated data.

241 We used the value of k = 4 and the Manhattan (`1) norm. The results displayed are as

242 expected in that for a short target history embedding length of lX = 1 spike, the estimator

243 converges to a slight over-estimate of the TE. The overestimate at shorter target history

NX 244 P embedding lengths lX can be explained in that perfect estimates of the i=1 ln λx x

246 shorter values of lX don’t cover this period in many cases, leaving this rate underestimated

247 and therefore the TE overestimated. For longer values of l 2, 3 we see that they X ∈ { } 248 converge closely to the true value of the TE. This is a further numerical confirmation of the

249 consistency of the continuous-time estimator.

250 Fig. 4c shows the results of the discrete-time estimator applied to the same process, run

251 for three different values of the bin width ∆t 1, 0.5, 0.2, 0.1 time units. The number of ∈ { } 252 bins included in the history embeddings was chosen such that they extended one time unit

253 back (the known length of the history dependence). Smaller bin sizes could not be used

254 as this caused the machine on which the experiments were running to run out of memory.

255 The plots are a clear demonstration that the discrete-time estimator is very biased and not

256 consistent. At a bin size of ∆t = 0.2 it converges to a value less than half the true TE.

257 Moreover, its convergence is incredibly slow. At the bin size of ∆t = 0.1 it would appear to

258 not have converged even after 1 million target events, and indeed it is not even converging to

259 the true value of the TE. The significance of the performance improvement by our estimator

260 is explored further in Discussion.

261 C. Identifying Conditional Independence Despite Strong Pairwise Correlations

262 The existence of a set of conditioning processes under which the present of the target

263 component is conditionally independent of the past of the source implies that, under certain

264 weak assumptions, there is no causal connection from the source to the target [46–48]. More

265 importantly, TE can be used to test for such conditional independence (see Methods), thus

266 motivating its use in network inference. A large challenge faced in testing for conditional

267 independence is correctly identifying “spurious” correlations, whereby conditionally inde-

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

T + ξM M

aD1 + ξD1 D1

aD2 + ξD2 D2

FIG. 5: Diagram of the noisy copy process. Events in the mother process M occur periodically with intervals T + ξM (ξ is used for noise terms). Events in the daughter

processes D1 and D2 occur after each event in the mother process, at a distance of aDi + ξDi

(a) Raw TE and surrogate values (b) Bias corrected TE values

˙ FIG. 6: Results of the continuous-time estimator run on a noisy copy process TD1 D2 M , where conditioning on a strong common driver M should lead to zero information→ flow| being inferred. The shift σ of the source, relative to the target and common driver, controls the strength of the correlation between the source and target (maximal at zero shift). For each shift, the estimator is run on both the original process as well as embeddings generated via two surrogate generation methods: our proposed local permutation method and a traditional source time-shift method. The shaded regions represent the standard deviation. The bias of the estimator changes with the shift, and we expect the estimates to be consistent with appropriately generated surrogates reflecting the same strong common driver effect. This is the case for our local permutation surrogates, as shown in (a). This leads to the correct bias-corrected TE value of 0, as shown in (b).

268 pendent components might have a strong pairwise correlation. This problem is particularly

269 pronounced when investigating the spiking activity of biological neurons, populations of

270 which often exhibit highly correlated behaviour through various forms of synchrony [49–51]

271 , or common drivers [52, 53]. In this subsection we demonstrate that the combination of the

272 presented estimator and surrogate generation scheme are particularly adept at identifying

273 conditional independence in the face of strong pairwise correlations on a synthetic exam-

274 ple. Moreover, the combination of the traditional discrete-time estimator and surrogate

275 generation techniques are demonstrated to be ineffective on this task.

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4 p value p value

0.2 0.2

0.0 0.0

D = 7.5e-2 D = 7.5e-2 D = 5e-2 D = 5e-2 D = 7.5e-2 D = 7.5e-2 D = 5e-2 D = 5e-2 zero flow non-zero flow zero flow non-zero flow zero flow non-zero flow zero flow non-zero flow (a) Continuous time (b) Discrete time

FIG. 7: The p-values obtained when using continuous and discrete-time estimators to infer non-zero information flow in the noisy copy process. The estimators are applied to both ˙ ˙ TD1 D2 M (expected to have zero flow) and TM D2 D1 (expected to have non-zero flow and → therefore| be indicated as statistically significant).→ | Only the results from the continuous-time estimator match these expectations. Ticks represent the particular combination of estimator and surrogate generation scheme making the correct inference in the majority of cases when a cutoff value of p = 0.05 is used. Diamonds show outliers.

276 The chosen synthetic example models a common driver effect, where an apparent directed

277 coupling between a source and target is only due to a common parent. In such cases, despite

278 a strong induced correlation between the source history and the occurrence of an event in the

279 target, we expect to measure zero information flow when conditioning on the common driver.

280 Our system here consists of a quasi-periodic ‘mother’ process M (the common driver) and

281 two ‘daughter’ processes, D1 and D2 (see Fig. 5 for a diagram of the process). The mother

282 process contains events occurring at intervals of T + ξM , with the daughter processes being

283 noisy copies with each event shifted by an amount aDi +ξDi . We also choose that aD1 < aD2 ;

284 so long as the difference between these aDi values is large compared to the size of the noise

285 terms, this will ensure that the events in D1 precede those in D2. When conditioning

286 ˙ on the mother process, the TE from the first daughter to the second, TD1 D2 M , should → | 287 be 0. However, the history of the source daughter process is strongly correlated with the

288 occurrence of events in the second daughter process — the events in D1 will precede those

289 in D by the constant amount a a plus a small noise term ξ ξ . 2 D2 − D1 D2 − D1 290 Due to the noise in the system, this level of correlation will gradually break down if we

291 shift the source daughter process relative to the others. This allows us to do two things.

292 Firstly, we can get an idea of the bias of the estimator on conditionally independent processes

293 for different levels of pairwise correlation between the history of the source and events in the

294 target. Second, we can evaluate different schemes of generating surrogate TE distributions

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

295 as a function of this correlation. We would expect that, for well-generated surrogates which

296 reflect the relationships to the conditional process, the TE estimates on these conditionally

297 independent processes will closely match the surrogate distribution.

298 We simulated this process using the parameter values of T = 1.0, aD1 = 0.25, aD2 = 2 2 2 299 0.5, ξ N(0,σ ), ξ N(0,σ ) and ξ N(0,σ ), σ = 0.05 and σ = 0.05. M ∼ M D1 ∼ D D2 ∼ D M D 300 Once the process had been simulated, the source process D1 was shifted by an amount σ.

301 We used values of σ between -10T and 10T , at intervals of 0.13T . For each such σ, the

302 process was simulated 200 times. For each simulation, the TE was estimated on the original

303 process with the shift in the first daughter as well as on a surrogate generated according to

304 our proposed local permutation scheme (see section IV B for a detailed description). The

305 parameter values of kperm = 10 and NU,surrogate = NX were used. For comparison, we also

306 generated surrogates according to the traditional source time-shift method, where this shift

307 was distributed randomly uniform between 200 and 300. A contiguous region of 50000 target

308 events was extracted and the estimation was performed on this data. The continuous-time

309 estimator used the parameter values of lX = lY = lZ1 = 1, k = 10, NU = NX and the

310 Manhattan (`1) norm.

311 The results in Fig. 6a demonstrate that the null distribution of TE values produced by

312 the the local permutation surrogate generation scheme closely matches the distribution of

313 TE values produced by the continuous-time estimator applied to the original data. Whilst

314 the raw TE estimates retain a slight negative bias (explored further in Discussion), we can

315 generate a bias-corrected TE with the surrogate mean subtracted from the original estimate

316 (giving an “effective transfer entropy” [54]). This bias-corrected TE as displayed in Fig. 6b is

317 consistent with zero because of the close match between our estimated value and surrogates,

318 which is the desired result in this scenario. By contrast, the TE values estimated on the

319 surrogates generated by the traditional time shift method are substantially lower than those

320 estimated on the original process (Fig. 6a); comparison to these would produce very high

321 false positive rates for significant directed relationships. This is most pronounced for high

322 levels of pairwise source-target correlation (with shifts σ near zero). The reason behind

323 this difference in the two approaches is easy to intuit. The traditional time-shift method

324 destroys all relationship between the history of the source and the occurrence of events in the

325 target. This means that we are comparing estimates of the TE on processes where there is

326 a strong pairwise correlation between the history of the source and the occurrence of target

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

327 events, with estimates of the TE on surrogate processes where these are fully independent.

328 Specifically, in discrete time, the joint distribution of the present of the target and the

329 source history, conditioned on the other histories decomposes as p (X , Y X , Z ) = t

332 the history of the source and the occurrence of events in the target are conditionally inde-

333 pendent, this scheme maintains the relationship between the history of the source and the

334 mediating variable, which in this case is the history of the mother process. That is, the

335 scheme produces surrogates where (working in discrete time for now) the joint distribu-

336 tion of the present of the target and the source history, conditioned on the other histories

337 decomposes as p (X , Y X , Z ) = p (X X , Z ) p (Y X , Z ). See Methods t

339 framework.

340 We then confirm that the proposed scheme is able to identify cases where an information

341 ˙ flow does exist. To do so, we applied it to measure TM D2 D1 in the above system, where → | 342 we would expect to see non-zero information flow from the common driver or mother to

343 one daughter process, conditioned on the other. The setup used was identical to above

344 however focussing on a shift of σ = 0, and for completeness, two different levels of noise

345 in the daughter processes were used: σD = 0.05 and σD = 0.075. We recorded the p

346 values produced by the combination of the proposed continuous-time estimator and the

347 local permutation surrogate generation scheme when testing for conditional information

348 ˙ ˙ flow TM D2 D1 , as well as contrasting to the expected zero flow in TD1 D2 M . These flows → | → | 349 were measured in 10 runs each and the distributions of the resulting p values are shown in

350 Fig. 7. We observe that our proposed combination assigns a p value of zero in every instance

351 ˙ ˙ of TM D2 D1 as expected; whilst for TD1 D2 M it assigns p values in a broad distribution → | → | 352 above zero, meaning the estimates are consistent with the null distribution as expected.

353 We also applied the combination of the discrete-time estimator and the traditional time-

354 shift method of surrogate generation to this same task of distinguishing between zero and

355 non-zero conditional information flows. We used time bins of width ∆t = 0.05 and history

356 lengths of 7 samples for the target, source and conditional histories. We also applied a pro-

357 cedure to determine optimal lags to the target for both the conditional and source histories,

358 following [17]. That is, before calculating the conditional TE from the source to the target,

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

359 we determined the optimal lag between the conditional history and the target by calculat-

360 ing the pairwise TE between the conditioning process and the target for all lags between

361 0 and 10. The lag which produced the maximum such TE was used. We then calculated

362 the conditional TE between the source and the target, using this determined lag for the

363 conditioning process, for all lags to the source process between 0 and 10. The TE was then

364 determined to be the maximum TE estimated over all these different lags applied to the

365 source process. This procedure was applied when estimating the TE on the original process

366 as well as on each separate surrogate. The results of this procedure are also displayed in

367 Fig. 7. Here we see that the combination of the discrete-time estimator and the traditional

368 time-shift method of surrogate generation assigns a p value of zero to every single example of

369 ˙ ˙ ˙ both TM D2 D1 and TD1 D2 M . This result – contradicting the expectation that TD1 D2 M → | → | → | 370 is consistent with zero – suggests that this benchmark approach has an incredibly high false

371 positive rate here.

372 D. Connectivity Inference of the Pyloric Circuit of the Crustacean Stomatogastric

373 Ganglion

374 The pyloric circuit of the crustacean stomatogastric ganglion has been proposed as a

375 benchmark circuit on which to test spike-based connectivity inference techniques [55, 56]. It

376 has been shown that Granger (which is equivalent to TE under the assumption of

377 Gaussian variables [57]) is unable to infer the connectivity of this network [55]. In contrast,

378 we demonstrate here the efficacy of the proposed estimators for TE on models of this system.

379 The crustacean stomatogastric ganglion [58–60] has received substantial research atten-

380 tion as a simple model circuit. Of great importance to network inference is the fact that its

381 full connectivity is known. The pyloric circuit is a partially independent component of the

382 greater circuit and consists of an Anterior Burster (AB) neuron, two Pyloric Driver (PD)

383 neurons, a Lateral Pyloric (LP) neuron and multiple Pyloric (PY) neurons. As the AB

384 neuron is electrically coupled to the PD neurons and the PY neurons are identical, for the

385 purposes of modelling and network inference, the circuit is usually represented by a single

386 AB/PD complex, and single LP and PY neurons [55, 56, 61, 62].

387 The AB/PD complex undergoes self-sustained rhythmic bursting. It inhibits the LP

388 and PY neurons through slow cholinergic and fast glutamatergic synapses. These neurons

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

(b) Example membrane potential (a) Circuit connectivity diagram traces produced by the circuit.

1.0 1.0

0.8 0.8

0.6 0.6

p value 0.4 p value 0.4

0.2 0.2

0.0 0.0

ABPD-LP ABPD-PY LP-ABPD LP-PY PY-ABPD PY-LP ABPD-LP ABPD-PY LP-ABPD LP-PY PY-ABPD PY-LP connection connection

(c) Distribution of p values from the (d) Distribution of p values from the continuous-time estimator and the discrete-time estimator and the source local permutation surrogate generation time-shift surrogate generation method. method.

FIG. 8: Results of both estimator and surrogate generation combinations being applied to data from simulations of a biophysical model of a neural circuit inspired by the pyloric circuit of the crustacean stomatogastric ganglion. The circuit, shown in (a), is fully connected apart from the missing connection between the PY neuron and the AB/PD complex, and generates membrane potential traces which are bursty and highly-periodic with cross-correlated activity. The distribution of p values from the combination of the continuous-time estimator and local permutation surrogate generation scheme shown in (c) demonstrates that this combination is capable of correctly inferring this circuit in most runs. By contrast, the distribution of p values produced by the combination of the discrete-time estimator and the traditional source time-shift surrogate generation method shown in (d) mis-specified the connectivity in every run. Ticks represent the particular combination of estimator and surrogate generation scheme making the correct inference in the majority of cases when a cutoff value of p = 0.05 is used.

389 then burst on rebound from this inhibition. The PY and LP neurons also inhibit one an-

390 other through fast glutamatergic synapses and the LP neuron similarly inhibits the AB/PD

391 complex.

392 Fig. 8 shows sample membrane potential traces from simulations of this circuit as well

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

393 as a connectivity diagram. Despite its small size, inference of its connectivity is challenging

394 [55, 56] due to the fact that it is highly periodic. Although there is no structural connection

395 from the PY to the ABPD neuron, there is a strong, time-directed, correlation between their

396 activity — the PY neuron always bursts shortly before the ABPD. Recognising that this is

397 a spurious correlation requires fully resolving the influence of the AB/PD’s history on itself

398 as well as that of the LP on the AB/PD. To further complicate matters, the true connection

399 between the LP and ABPD neurons is very challenging to detect. The AB/PD complex will

400 continue bursting regardless of any input from the LP. Correctly inferring this connection

401 requires detecting the subtle changes in the timing of AB/PD bursts that result from the

402 activity of the LP.

403 Previous work on the inference of the pyloric circuit has used both in vitro and in silico

404 data [55, 56]. We ran simulations of biophysical models inspired by this network, similar

405 to those used in [55] (see Appendix A). Attempts were then made to infer this network by

406 detecting non-zero conditional information flow from the spiking event times produced by

407 the simulations. This was done by, for every source-target pair, estimating the TE from

408 the source to the target, conditioned on the activity of the third remaining neuron. Both

409 the combination of the proposed continuous-time estimator and local permutation surrogate

410 generation scheme and the combination of the discrete-time estimator and source time-shift

411 surrogate generation scheme were applied to this task.

412 Both combinations were applied to ten independent simulations of the network and the 4 413 number of target events N = 1 10 was used. For the continuous-time estimator the X × 414 parameter values of lX = lY = lZ1 = 1, k = 10, NU = NX , NU,surrogate = NX and kperm = 5

415 were used along with the Manhattan (`1) norm (see Methods Sec. IV A 4 and IV A 5). The

416 discrete-time estimator made use of a bin size of ∆t = 0.05 and history embedding lengths of

417 seven bins for each of the source, target and conditioning processes. Searches were performed

418 to determine the optimum embedding lag for both the source and conditioning histories (as

419 per Sec. II C) with a maximum search value of 20 bins being used. We designed the search

420 procedure to include times up to the inter-burst interval (around 1 time unit), which placed

421 an effective lower bound on the width of the time bins (as bin sizes below ∆t = 0.05

422 resulted in impractically large search spaces). For both estimators, p values were inferred

423 from 100 independently generated surrogates (see Methods Sec. IV B). The source time-shift

424 surrogate generation scheme used time shifts distributed uniformly randomly between 200

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

425 and 400 time units.

426 Figures 8c and 8d show the distributions of p values resulting from the application of

427 both estimator and surrogate generation scheme combinations. The continuous-time esti-

428 mator and local permutation surrogate generation scheme were able to correctly infer the

429 connectivity of the network in the majority of cases (indicated by p-values approaching 0 for

430 the true positives, and spread throughout [0,1] for the true negatives). On the other hand,

431 the discrete-time estimator and source time-shift surrogate generation scheme produced the

432 same two erroneous predictions on every run: a connection between the PY neuron and the

433 AB/PD complex along with a missing connection between the LP and PY neurons.

434 III. DISCUSSION

435 Despite transfer entropy being a popular tool within neuroscience and other domains of

436 enquiry [9–11, 13–19], it has received more limited application to event-based data such as

437 spike trains. This is at least partially due to current estimation techniques requiring the

438 process to be recast as a discrete-time phenomenon. The resulting discrete-time estimation

439 task has been beset by difficulties including a lack of consistency, high bias, slow convergence

440 and an inability to capture effects which occur over fine and large time scales simultaneously.

441 This paper has built on recent work presenting a continuous time formalism for TE

442 [5] in order to derive an estimation framework for TE on event-based data in continuous

443 time. This framework has the unique advantage of only estimating quantities at events in

444 the target process, providing a significant computational advantage. Instead of comparing

445 spike rates conditioned on specific histories at each target spiking event though, we use

446 a Bayesian inversion to instead make the empirically easier comparison of probabilities of

447 histories at target events versus anywhere else along the target process. This comparison,

448 using KL divergences, is made using k-NN techniques, which brings desirable properties

449 such as efficiency for the estimator. This estimator is provably consistent. Moreover, as it

450 operates on inter-event intervals, it is capable of capturing relationships which occur with

451 fine time precision along with those that occur over longer time distances.

452 The estimator was first evaluated on two simple examples for which the ground truth

453 is known: pairs of independent Poisson processes (Sec. II A) as well as pairs of processes

454 unidirectionally coupled through a simple functional relationship (Sec. II B). The discrete-

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

455 time estimator was also applied to these processes. It was found that the continuous-time

456 estimator had substantially lower bias than the discrete-time estimator, converged orders

457 of magnitude faster (in terms of the number of sample spikes required), and was relatively

458 insensitive to parameter selections. Moreover, these examples provided numerical confirma-

459 tion of the consistency of the continuous-time estimator, and further demonstration that

460 the discrete-time estimator is not consistent. The latter simple example highlighted the

461 magnitude of the shortcomings of the discrete-time estimator. In the authors’ experience,

462 spike-train datasets which contain 1 million spiking events for a single neuron are vanish-

463 ingly rare. However, even in the unlikely circumstance that the discrete-time estimator is

464 presented with a dataset of this size as in Sec. II B, it could not accurately estimate the

465 TE for a simple one-way relationship between only two neurons. Moreover, this example

466 neatly demonstrates a notable problem with the use of the discrete-time estimator, which is

467 that it provides wildly different estimates for different values of ∆t. In real-world applica-

468 tions, where the ground truth is unknown, there is no principled method for choosing which

469 resulting TE value from the various bin sizes to use.

470 One of the principal use-cases of TE is the inference of non-zero information flow. As

471 the TE is estimated from finite data, we require a manner of determining the statistical

472 significance of the estimated values. Traditional methods of surrogate generation for TE

473 either shift the source in time, or shuffle the source embeddings. However, whilst this re-

474 tains the relationship of the target to its past and other conditionals, it completely destroys

475 the relationship between the source and any conditioning processes, which can lead to very

476 high false positive rates as detailed in Results Sec. II C and Sec. IV B. We developed a local

477 permutation scheme for use in conjunction with this estimator which is able to maintain

478 the relationship of the source history embeddings with the history embeddings of the target

479 and conditioning processes. The combination of the proposed estimator and this surrogate

480 generation scheme were applied to an example where the history of the source and the oc-

481 currence of events in the target are highly correlated, but conditionally independent given

482 their common driver (Sec. II C). The established time-shift method for surrogate generation

483 produced a null distribution of TE values substantially below that estimated on the original

484 data, incorrectly implying non-zero information flow. Conversely, the proposed local per-

485 mutation method produced a null distribution which closely tracked the estimates on the

486 original data. The proposed combination was also shown to be able to correctly distinguish

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

487 between cases of zero and non-zero information flow. When applied to the same example,

488 the combination of the discrete-time estimator and the traditional method of time-shifted

489 surrogates inferred the existence of information flow in all cases, even when no such flow

490 was present.

491 Finally, our proposed approach was applied to the inference of the connectivity of the

492 pyloric network of the crustacean stomatogastric ganglion. The inference of this network is

493 challenging due to highly periodic dynamics. , using a more established

494 estimator, has historically been demonstrated to be incapable of inferring its connectivity

495 [55]; furthermore, we showed that the discrete-time binary-valued TE estimator (with time-

496 shifted surrogates) also could not successfully infer the connectivity here. Despite these

497 challenges, our combination of continuous-time estimator and surrogate generation scheme

498 was able to correctly infer the connectivity of this network. This provides an important vali-

499 dation of the efficacy of our presented approaches on a challenging example of representative

500 biological spiking data.

501 This work represents a substantial step forward in the estimation of information flows from

502 event-based data. To the best of the authors’ knowledge it is the first consistent estimator

503 of TE for event-based data. That is, it is the first estimator which is known to converge to

504 the true value of the TE in the limit of infinite data, let alone to provide efficient estimates

505 with finite data. As demonstrated in Results Sec. II A and Sec. II B it has substantially

506 favourable bias and convergence properties as compared to the discrete-time estimator. The

507 fact that this estimator uses raw inter-event intervals as its history representation allows it to

508 efficiently capture relevant information from the past of the source, target and conditional

509 processes. This allows it to simultaneously measure relationships that occur both with

510 very fine time scales as well as those that occur over long intervals. This was highlighted

511 in Results Sec. II D, where it was shown that our proposed approach is able to correctly

512 infer the connectivity of the pyloric circuit of the crustacean stomatogastric ganglion. The

513 inference of this circuit requires capturing subtle changes in spike timing. However, its

514 bursty nature means that there are long intervals of no spiking activity. This is contrasted

515 with the poor performance of the discrete-time estimator on this same task, as above. The

516 use of the discrete-time estimator requires a hard trade-off in the choice of bin size: small

517 bins will be able to capture relationships that occur over finer timescales but will result in

518 an estimator that is blind to history effects existing over large intervals. Conversely, whilst

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

519 larger bins might be capable of capturing these relationships occurring over larger intervals,

520 the estimator will be blind to effects occurring with fine temporal precision. Moreover,

521 to the best of our knowledge, this work showcases the first use of a surrogate generation

522 scheme for statistical significance estimates which correctly handles strong source-conditional

523 relationships for event-based data. This has crucial practical benefit in that it greatly reduces

524 the occurrence of false positives in cases where the history of a source is strongly correlated

525 with the present of the target, but conditionally independent.

526 Inspection of some plots, notably Fig. 6 of Sec. II C shows that, in some cases, the es-

527 timator can exhibit small though non-insignificant bias. Indeed, similar biases can readily

528 be demonstrated with the standard KSG estimator for transfer entropy on continuous vari-

529 ables in discrete time, in similar circumstances where a strong source-target relationship

530 is fully explained by a conditional. The reason for the small remaining bias is that while

531 underlying assumption of the nearest neighbour estimators is of a fixed probability density

532 within the range of the k nearest neighbours, strong conditional relationships tend to result

533 in correlations remaining between the variables within this range. For the common use-case

534 of inferring non-zero information flows this small remaining bias will not be an issue as the

535 proposed method for surrogate generation is capable of producing null distributions with

536 very similar bias properties. Furthermore, such bias can be removed from an estimate by

537 subtracting the mean of the surrogate distribution (as shown via the effective transfer en-

538 tropy [54] in Results). However, it is foreseeable that certain scenarios might benefit from an

539 estimator with lower bias, without having to resort to generating surrogates. In such cases

540 it will likely prove beneficial to explore the combination of various existing bias reduction

541 techniques for k-NN estimators with the approach proposed here. These include perform-

542 ing a whitening transformation on the data [63], transforming each marginal distribution

543 to uniform or exploring alternative approaches to sharing radii across entropy terms (see

544 Sec. IV A 5). The authors believe that the most probable cause of the observed bias in the

545 case of strong pairwise correlations is that it causes the assumption of local uniformity (see

546 Methods) to be violated. Gao et. al [64] have proposed a method for reducing the bias of

547 k-NN information theoretic estimators which specifically addresses cases where local unifor-

548 mity does not apply. The application of this technique to our estimator holds promise for

549 addressing this bias.

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

550 IV. METHODS

551 There are a variety of approaches available for estimating information theoretic quantities

552 from continuous-valued data [65]; here we focus on methods for generating estimates Tb˙ Y X Z → | 553 of a true underlying (conditional) transfer entropy T˙ Y X Z. → |

554 The nature of estimation means that our estimates Tb˙ Y X Z may have a bias with respect → | 555 to the true value T˙ Y X Z, and a variance, as a function of some metric n of the size of the → | 556 data being provided to the estimator (we use the number of spikes, or events, in the target

557 process). The bias is a measure of the degree to which the estimator systematically deviates

558 from the true value of the quantity being estimated, for finite data size. It is expressed ˙ ˙ ˙ 559 as bias(Tb Y X Z) = E[Tb Y X Z] TY X Z. The variance of an estimator is a measure of → | → | − → | 560 the degree to which it provides different estimates for distinct, finite, samples from the 2 ˙ ˙ ˙ 2 561 same process. It is expressed as variance(Tb Y X Z) = E[Tb Y X Z] E[Tb Y X Z] . Another → | → | − → | 562 important property is consistency, which refers to whether, in the limit of infinite data, the

563 estimator converges to the true value. That is, an estimator is consistent if and only if

564 limn Tb˙ Y X Z = T˙ Y X Z. →∞ → | → |

565 The first half of this methods section is concerned with the derivation of a consistent

566 estimator of TE which operates in continuous time. In order to be able to test for non-

567 zero information flow given finite data, we require a surrogate generation scheme to use

568 in conjunction with the estimator. Such a surrogate generation scheme should produce

569 surrogate history samples that conform to the null hypothesis of zero information flow. The

570 second half of this section will focus on a scheme for generating these surrogates.

571 The presented estimator and surrogate generation scheme have been implemented in

572 a software package which will be made available on publication (see the Implementation

573 subsection).

574 A. Continuous-time estimator for transfer entropy between spike trains

575 In the following subsections, we describe the algorithm for our estimator Tb˙ Y X Z for the → | 576 transfer entropy between spike trains. We first outline our choice of a kNN type estimator,

577 due to the desirable consistency and bias properties of this class of estimator. In order to use

578 such an estimator type, we then describe a Bayesian inversion we apply to the definition of

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

579 transfer entropy for spiking processes, which allows us to operate on probability densities of

580 histories of the processes, rather than directly on spike rates. This results in a sum of differ-

581 ential entropies to which kNN estimator techniques can be applied. The evaluation of these

582 entropy terms using kNN estimators requires a method for sampling history embeddings,

583 which is presented before attention is turned to a technique for combining the separate kNN

584 estimators in a manner that will reduce the bias of the final estimate.

585 1. Consideration of estimator type

586 Although there has been much recent progress on parametric information-theoretic es-

587 timators [66], such estimators will always inject modelling assumptions into the estimation

588 process. Even in the case that large, general, parametric models are used — as in [67] —

589 there are no known methods of determining whether such a model is capturing all depen-

590 dencies present within the data.

591 In comparison, nonparametric estimators make less explicit model assumptions regarding

592 the probability distributions. Early approaches included the use of kernels for the estimation

593 of the probability densities [68], however this has the disadvantage of operating at a fixed

594 kernel ‘resolution’. An improvement was achieved by the successful, widely applied, class

595 of nonparametric estimators making use of k-nearest-neighbour statistics [44, 69–71], which

596 dynamically adjust their resolution given the local density of points. Crucially, there are

597 consistency proofs [70, 72] for kNN estimators, meaning that these methods are known to

598 converge to the true values in the limit of infinite data size. These estimators operate by

599 decomposing the information quantity of interest into a sum of differential entropy terms

600 H . Each entropy term is subsequently estimated by estimating the probability densities ∗ 601 p(xi) at all the points in the sample by finding the distances to the kth nearest neighbours

602 of the points xi. The average of the logarithms of these densities is found and is adjusted

603 by bias correction terms. In some instances, most notably the KSG estimator for mutual

604 information [44], many of the terms in each entropy estimate cancel and so each entropy is

605 only implicitly estimated.

606 Such bias and consistency properties are highly desirable – given the efficacy of kNN

607 estimators, it would be advantageous to be able to make use of such techniques in order to

608 estimate the transfer entropy of point processes in continuous time. However the continuous

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

609 time formulations in (3) and (4) contain no entropy terms, being written in terms of rates

610 as opposed to probability densities. Moreover, the estimators for each differential entropy d 611 term H in a standard kNN approach operate on sets of points in R , and it is unclear how ∗ 612 these may be applied in our case.

613 The following sub-section is concerned with deriving an expression for continuous-time

614 transfer entropy on spike trains as a sum of H terms, in order to define a kNN type ∗ 615 estimator.

616 2. Formulating continuous-time TE as a sum of differential entropies

617 Consider two point processes X and Y represented by sets of real numbers, where each N N 618 element represents the time of an event. That is, X R X and Y R Y . Further, consider ∈ ∈ N 619 Zi the set of extra conditioning point processes Z = Z1,Z2,...,ZnZ , Zi R . We can { } ∈ 620 define a counting process NX (t) on X. NX (t) is a natural number representing the ‘state’

621 of the process. This state is incremented by one at the occurrence of an event. The instan-

622 taneous firing rate of the target is then λX (t) = lim∆t 0 p (NX (t + ∆t) NX (t) = 1) /∆t. → − 623 Using this expression, (4) can then be rewritten as

  ˙ ¯ pU (NX (x + ∆t) NX (x)=1 x

624 Here, λ¯X is the average, unconditional, firing rate of the target process (λ¯X = NX /τ),

625 whilst x X , y Y and z = z , z ,..., z Z are the histories

627 density pU is taken to represent the probability density at any arbitrary point in the target

628 process, unconditional of events in any of the processes. By contrast, pX is taken to represent

629 the probability density of observing a quantity at target events. The expectation EPX is taken R 630 over this distribution. That is EPX [f(Y )] = Y f(y)pX (y)dy

631 By applying Bayes’ rule we can make a Bayesian inversion to arrive at:

 ˙ ¯ pU (x

pU (x

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

632 Equation (6) can be written as

  ˙ ¯ pX (x

633 Equation (7) can be written as a sum of differential entropy and cross entropy terms.

 ¯ T˙ Y X Z = λX H (X

Here, H refers to an entropy term and HPU refers to a cross entropy term. More specifically,

Z H (X

and Z

HPU (X

634 It is worth noting in passing that (7) can also be written as a difference of Kullback-Leibler

635 divergences:

 ¯ T˙ Y X Z = λX DKL (PX (X

636 The expressions in equations (8) and (9) represent a general framework for estimating

637 the TE between point processes in continuous time. Any estimator of differential entropy Hb

638 which can be adapted to the estimation of cross entropies can be plugged into (8) in order

639 to estimate the TE. Similarly, any estimator of the KL divergence can be plugged into (9).

640 3. Constructing kNN estimators for differential entropies and cross entropies

641 Following similar steps to the derivations in [44, 63, 72], assume that we have an (un- d 642 known) probability distribution µ(x) for x R . Note that here X is a general random ∈ 643 variable (not necessarily a point process). We also have a set X of NX points drawn from µ.

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

644 In order to estimate the differential entropy H we need to construct estimates of the form

N 1 XX Hb(X) = ln\µ(xi) (10) −N X i=1

645 where ln\µ(xi) is an estimate of the logarithm of the true density. Denote by  (k, xi,X) the

646 distance to the kth nearest neighbour of xi in the set X under some norm L. Further, let µ 647 pi be the probability mass of the -ball surrounding xi. If we make the assumption that µ k d 648 µ(xi) is constant within the -ball, we have pi = N 1 = cd,L (k, xi,X) µ(xi) where cd,L X − 649 is the volume of the d-dimensional unit ball under the norm L. Using this relationship, we

650 can construct a simple estimator of the differential entropy:

N 1 XX k Hb(X) = ln d (11) −NX (N 1) c  (k, x ,X) i=1 X − d,L i

1 651 We then add the bias-correction term ln k ψ(k). ψ(x)=Γ− (x)dΓ(x)/dx is the digamma − 652 function and Γ(x) the gamma function. This yields HbKL, the Kozachenko-Leonenko [69]

653 estimator of differential entropy:

N d XX HbKL(X) = ψ(k) + ln(NX 1) + ln cd,L + ln  (k, xi,X) (12) − − N X i=1

654 This estimator has been shown to be consistent [69, 73].

655 Assume that we now have two (unknown) probability distributions µ(x) and β(x). We

656 have a set X of NX points drawn from µ and a set Y of NY points drawn from β. Using

657 similar arguments to above, we denote by  (k, xi,Y ) the distance from the ith element of

658 X to its kth nearest neighbour in Y . We then make the assumption that β(xi) is constant β k d 659 within the -ball, and we have p = = cd,L (k, xi,Y ) β(xi). We can then construct a i NY

660 naive estimator of the cross entropy

N 1 XX k Hbβ(X) = ln (13) −N d X i=1 NY cd,L (k, xi,Y )

661 Again, we add the bias-correction term ln k ψ(k) to arrive at an estimator of the cross − 29 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

662 entropy. N d XX Hbβ,KL(X) = ψ(k) + ln NY + ln cd,L + ln  (k, xi,Y ) (14) − N X i=1

663 This estimator has been shown to be consistent [73].

664 Attention should be brought to the fundamental difference between estimating entropies

665 and cross entropies using k-NN estimators. An entropy estimator takes a set X and, for

666 each x X, performs a nearest neighbour search in the same set X. An estimator of cross i ∈ 667 entropy takes two sets, X and Y and, for each x X, performs a nearest neighbour search i ∈ 668 in the other set Y .

669 We will be interested in applying these estimators to the entropy and cross entropy terms

670 in (8). For instance, we could use Hbβ,KL(X) to estimate HPU (X

671 that µ = pX (x

672 IV A 5, after we first consider how to represent the history embeddings x

673 as sample them from their distributions.

674 4. Selection and Representation of Sample Histories for Entropy Estimation

4 3 2 1 z1,

ti ti 1 = xi 1 ti = xi − − (b) Embedding constructed at a non-event (a) Embedding constructed at a target event. point.

FIG. 9: Examples of history embeddings. (a) shows an example of a joint embedding

constructed at a target event (j

675 Inspection of (7) and (8) informs us that we will need to be able to estimate four distinct

676 differential entropy terms and, implicitly, the associated probability densities:

677 1. The probability density of the target, source and conditioning histories at target events

678 pX (x

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Algorithm 1: the CT TE estimator input : /* The joint history embeddings at the target events and at the sampled points */ J = j NX , J = j NU

15 kC ,i findNumberOfNeighboursInRadius (rconditioning,i, c

19 b˙ 1 b˙ TY X Z,CT τ TY X Z,CT → | ←

31 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

679 2. The probability density of the target, and conditioning histories at target events

680 pX (x

681 3. The probability density of the target, source and conditioning histories independent

682 of target activity pU (x

683 4. The probability density of the target and conditioning histories independent of target

684 activity pU (x

685 Estimation of these probability densities will require an associated set of samples for a k-NN

686 estimator to operate on. These samples for x

690 embeddings (in terms of how many previous spikes are referred to) must be restricted in

691 order to avoid the difficulties associated with the estimation of probability densities in high

692 dimensions. The lengths of the history embeddings along each process are specified by the

693 parameters l , l , l ,...,l . X Y Z1 ZnZ

NX NX NU 694 We label the sets of samples as J = j , C = c , J = j ,

697 i.e. without the source).

698 For the two sets of joint embeddings J< (where X, U ) each j< i J< is made ∗ ∗ ∈ { } ∗ ∈ ∗ 699 up of target, source and conditioning components. That is, j< = x< , y< , z< where ∗i { ∗i ∗i ∗i }

700 z< = z1,< , z2,< ,..., znz,< . Similarly, for the two sets of conditioning embeddings ∗i { ∗i ∗i ∗i } 701 C< (where X,U ) each c< i C< is made up of target, and conditioning compo- ∗ ∗ ∈ { } ∗ ∈ ∗ 702 nents. That is, c< = x< , z< . ∗i { ∗i ∗i } NT 703 Each set of embeddings J< is constructed from a set of observation points T R . ∗ ∈ 704 Each individual embedding j< is constructed at one such observation ti. We denote by ∗i 705 pred (ti,P ), the index of the most recent event in the process P to occur before the obser-

706 vation point ti.

32 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 2 lX 707 The values of x< i = x< , x< , . . . , x< i X< are set as follows: ∗ { ∗i ∗i ∗ } ∈ ∗  ti xpred(t ,X) k = 1 k − i x< := (15) ∗i xpred(t ,X) k+2 xpred(t ,X) k+1 k = 1 i − − i − 6

708 Here, the t T are the raw observation points and the x X are the raw event times in the i ∈ j ∈ 709 process X. The first element of x< is then the interval between the observation time and ∗i 710 the most recent target event time xpred(t ,X). The second element of x< is the inter-event i ∗i 711 interval between this most recent event time and the next most recent event time and so forth.

1 2 lX 1 2 lX 712 The values of y< i = y< , y< , . . . , y< i Y< and z< i = z< , z< , . . . , z< i Z< ∗ { ∗i ∗i ∗ } ∈ ∗ ∗ { ∗i ∗i ∗ } ∈ ∗ 713 are set in the same manner. P NX lX +lY + lZ 714 j The set of samples J

716 times x of the target process X. As such, J = X Y Z . j

720 the target process. These NU sample points between the first and last events of the target

721 process X can either be placed randomly or at fixed intervals. In the experiments presented

722 in this paper they were placed randomly. Importantly, note that NU is not necessarily equal

723 to NX , with their ratio NU /NX a parameter for the estimator which is investigated in our

724 Results. We also have that J = X Y Z . Figure 9 shows diagramatic examples

727 spike at ti = xi back to the previous spike, which is not the case for J

730 ever, the source embeddings y< are discarded. We will also have that C

734 J

33 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

ˆ ˙ 735 5. Combining H estimators for Tb Y X Z ∗ → |

736 With sets of samples and their embedded representation determined as per the previous

737 subsection, we are now ready to estimate each of the four Hˆ terms in (8). Here we consider ∗ 738 how to combine the entropy and cross entropy estimators of these terms ((12) and (14)) into

739 a single estimator.

740 We could simply estimate each H term in (8) using HbKL as specified in (12) and Hbp ,KL as ∗ U 741 specified in (14), with the same number k of nearest neighbours in each of the four estimators

742 and at each sample in the set for each estimator. Following the convention introduced in

743 [72] we shall refer to this as a 4KL estimator of transfer entropy (the ‘4’ refers to the use 4

744 k-NN searches and the ‘KL’ to Kozachenko-Leonenko):

¯ NX  λ

 nZ   nZ  745 P P Here, lJ = lX + lY + j=1 lZj is the dimension of the joint samples and lC = lX + j=1 lZj

746 is the dimension of the conditional-only samples. Note that the ln(N 1) ψ(k) terms X − − 747 cancel between the J and C terms (also for ln(N ) ψ(k) between the J and C

749 C

750 events (the cross-entropies which evaluate probability densities using J

751 evaluate those densities on the samples j J and c C , following the definition

753 It is, however, not only possible to use a different k at every sample, but desirable when

754 the k are chosen judiciously (as detailed below). We shall refer to this as the generalised

755 k-NN estimator:

34 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

¯ NX  λX X     Tb˙ Y X Z,generalised = ψ kJ ,i ψ kJ ,i ψ kC ,i + ψ kC ,i → | N

756 Here kA,i is the number of neighbours used for the ith sample in set A for the corresponding

757 entropy estimator for that set of samples. By theorems 3 and 4 of [63] this estimator (and, by

758 implication, the 4KL estimator) is asymptotically consistent. Application of the generalised

759 estimator requires a scheme for choosing the kA,i at each sample. Work on constructing H ∗ 760 k-NN estimators for mutual information [44] and KL divergence [63] has found advantages

761 in having certain H terms share the same or similar radii, e.g. resulting in lower overall ∗ 762 bias due to components of biases of individual H terms cancelling. Given that we have four ∗ 763 H terms, there are a number of approaches we could take to sharing radii. ∗

764 Our algorithm, which we refer to as the CT estimator of TE — Tb˙ Y X Z,CT — is specified → | 765 in detail in algorithm 1. Our algorithm applies the approach of Wang et. al. (referred to

766 as the ‘bias improved’ estimator in [63]) to each of the Kullback-Leibler divergence terms

767 separately. In broad strokes, whereas, (16) uses the same k for each nearest-neighbour search,

768 this estimator uses the same radius for each of the two nearest-neighbour searches relating to

769 a given KL divergence term. In practice, this requires first performing searches with a fixed

770 k in order to determine the radius to use. As such, we start with a fixed parameter kglobal,

771 which will be the minimum number of nearest neighbours in any search space. For each joint

772 sample at a target event, that is, each j

773 same set J

774 1). We perform a similar kglobal-NN search for j

775 of target activity J

776 4). We define a search radius as the maximum of these two distances (line 5). We then

777 find the number of points in J

778 number (line 6). We also find twice the distance to the kJ

780 J

35 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

782 (line 9).

783 In the majority of cases, only one of these two  terms will be exactly twice the search

784 radius, and its associated kA,i will equal kglobal. In such cases, the other  will be less than

785 twice the search radius and its associated kA,i will be greater than or equal to kglobal.

786 The same set of steps is followed for each conditioning history embedding that was con-

787 structed at an event in the target process, that is, each c

788 C

789 The values that we have found for kJ

790 (kJ

791 (lines 17 and 19 of algorithm 1).

792 6. Handling Dynamic Correlations

793 The derivation of the k-NN estimators for entropy and cross entropy given in section

794 IV A 3 assumes that the points are independent [44] However, nearby interspike intervals

795 might be autocorrelated (e.g. during bursts), and indeed our method for constructing history

796 embeddings (see section IV A 4) will incorporate the same interspike intervals at different

797 positions in consecutive samples. This contradicts the assumption of independence. In

798 order to satisfy the assumption of independence when counting neighbours, conventional

799 neighbour counting estimators can be made to ignore matches within a dynamic or serial

800 correlation exclusion window (a.k.a. Theiler windows [74, 75]).

801 For our estimator, we maintain a record of the start and end times of each history

802 embedding, providing us with an exclusion window. The start time is recorded as the time

803 of the first event that formed part of an interval which was included in the sample. This

804 event could come from the embedding of any of the processes from which the sample was

805 constructed. The end of the window is the observation point from which the sample is

806 constructed. When performing nearest neighbour and radius searches (lines 3, 4, 6, 7 , 8,

807 9, 10, 11, 13, 14 , 15 and 16 of algorithm 1 and line 6 of algorithm 2), any sample whose

808 exclusion window overlaps with the exclusion window of the sample around which the search

809 is taking place is ignored. Subtleties concerning dynamic correlation exclusion for surrogate

810 calculations are considered in the next section.

36 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1 1 z1,

1 1 y

1 1 x

xi uh

1 1 1 1 1 1 j

1 1 1 j

similar to the z

combining the x

811 B. Local Permutation Method for Surrogate Generation

812 A common use of this estimator would be to ascertain whether there is a non-zero condi-

813 tional information flow between two components of a system. When using TE for directed

814 functional network inference, this is the criteria we use to determine the presence or absence

815 of a connection. Given that we are estimating the TE from finite samples, we require a sta-

816 tistical test in order to determine the significance of the measured TE value. Unfortunately,

817 analytic results do not exist for the sampling distribution of k-NN estimators of information

818 theoretic quantities [8]. This necessitates a scheme for generating surrogate samples from

819 which the null distribution can be empirically constructed.

820 It is instructive to first consider the more general case of testing for non-zero mutual

821 information. As the mutual information between X and Y is zero if and only X and Y are

822 independent, testing for non-zero mutual information is a test for statistical dependence. As

823 such, we are testing against the null hypothesis that X and Y are independent (X Y ) ⊥⊥ 824 or, equivalently, that the joint probability distribution of X and Y factorises as p(X, Y ) =

825 p(X)p(Y ). It is straightforward to construct surrogate pairs (ˇx, yˇ) that conform to this

826 null hypothesis. We start with the original pairs (x, y) and resample the y values across

827 pairs, commonly by shuffling (in conjunction with handling dynamic correlations, as per

828 the later subsection). This shuffling process will maintain the marginal distributions p(X)

37 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Algorithm 2: Algorithm for the local permutation method for surrogate generation. Input : /* The joint history embeddings at the target events */ J = j NX = x , y , z NX

Output : J

/* Set to keep a record of the used indices in the independently sampled embeddings. */

1 U φ ← 2 J φ

4 I shuffle(I) ← 5 for i I do ∈ /* Search for the nearest neighbours in the set of embeddings at sampled points, ignoring the source components. */  N  6 N findIndicesOfNearestNeighbours k , x , z , x , z U,surr ← perm {

7 E N (N U) ← \ ∩ 8 if E > 0 then k k 9 h chooseRandomElement (E) ← 10 end

11 else

12 h chooseRandomElement (N) ← 13 end /* Append the set of surrogate samples with an embedding composed of the original target and conditioning components (at index i), but with the source component swapped for that at index h of the independently sampled embeddings. */

14 J J x , y , z

38 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

829 and p(Y ), and the same number of samples, but will destroy any relationship between X

830 and Y , yielding the required factorisation for the null hypothesis. One shuffling process

831 produces one set of surrogate samples; estimates of mutual information on populations of

832 such surrogate sample sets yields a null distribution for the mutual information.

833 As transfer entropy is a conditional mutual information ( I (X ; Y X , Z ) ), we t

835 tionally independent of the history of the source Y (X Y X , Z ). That is,

839 history embeddings or by shifting the source time series (see discussions in e.g. [4, 76]).

840 These approaches lead to various problems. These problems stem from the fact that they

841 destroy any relationship between the source history (Y

842 conditioning (Z

843 joint distribution factorises as: p (X , Y X , Z ) = p (X X , Z ) p (Y ). [8] The t

845 considering a system whereby the conditioning processes Z

846 of the target Xt as well as the history of the source Y

847 correlated with Xt, but conditionally independent. This is the classic case of a “spurious

848 correlation” between Y

849 If, in such a case, we use time shifted or shuffled source surrogates to test for the significance

850 of the TE, we will be comparing the TE measured when Xt and Y

851 (albeit potentially conditionally independent) with surrogates where they are independent.

852 This subtle difference in the formulation of the null may result in a high false positive rate in

853 a test for conditional independence. An analysis of such a system is presented in section II C.

854 Alternately, if we can generate surrogates where the joint probability distribution factorises

855 correctly and the relationship between Y

856 then Y

857 We would anticipate conditional independence tests using surrogates generated under this

858 properly formed null to have a false positive rate closer to what we expect.

859 Generating surrogates for testing for conditional dependence is relatively straightforward

860 in the case of discrete-valued conditioning variables. If we are testing for dependence between

39 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

861 X and Y given Z, then, for each unique value of Z, we can shuffle the associated values of

862 Y . This maintains the distributions p(X Z) and p(Y Z) whilst, for any given value of Z, | | 863 the relationship between the associated X and Y values is destroyed.

864 The problem is more challenging when Z can take on continuous values. However, recent

865 work by Runge [8], demonstrated the efficacy of a local permutation technique. In this

866 approach, to generate one surrogate sample set, we separately generate a surrogate sample

867 (x, y,ˇ z) for each sample (x, y, z) in the original set. We find the kperm nearest neighbours of z

868 in Z: one of these neighbours, z0, is chosen at random, and y is swapped with the associated

869 y0 to produce the surrogate sample (x, y0, z). In order to reduce the occurrence of duplicate

870 y values, a set U of used indices is maintained. After finding the kperm nearest neighbours,

871 those that have already been used are removed from the candidate set. If this results in an

872 empty candidate set, one of the original kperm candidates are chosen at random. Otherwise,

873 this choice is made from the reduced set. As before then, a surrogate conditional mutual

874 information is estimated for every surrogate sample set, and a population of such surrogate

875 estimates provides the null distribution.

876 This approach needs to be adapted slightly in order to be applied to our particular case,

877 because we have implictly removed the target variable (whether or not the target is spiking)

878 from our samples via the novel Bayesian inversion. We can rewrite (7) as:

  ˙ ¯ pX (x

879 This makes it clear that we are testing whether the following factorisation holds: pX (x

880 p (x , z ) p (y x , z ) (recall the difference between probability densities at tar- X

882 J

883 in a way that maintains the relationship between the source histories and the histories of

884 the target and conditioning processes, but decouples (only) the source histories from target

885 events. (As above, simply shuffling the source histories across J

886 events does not properly maintain the relationship of the source to the target and condition-

887 ing histories). The procedure to achieve this is detailed in algorithm 2. We start with the

NX 888 samples at target events J = x , y , z and resample the source components

890 Usurr of NU,surr points sampled independently of events in the target. This set is constructed

891 in the same manner as J

892 points (N = N ) at which the embeddings are constructed. For each original sample U,surr 6 U kperm 893 j from J then, we then find the nearest neighbours x , z of x , z

895 neighbours (line 9 or 12), and add a sample x , y , z to J (line 14). The {

897 keep a record of used indices in order to reduce the incidence of duplicate y

898 For each redrawn surrogate sample set J

899 is estimated (utilising the same J

900 for the original TE estimate) following the algorithm outlined earlier; the population of such

901 surrogate estimates provides the null distribution as before.

902 Finally, note an additional subtlety for dynamic correlation exclusion for the surrogate

903 calculations. Samples in the surrogate calculations will have had their history components

904 originating from two different time windows. One will be from the construction of the original

905 sample and the other from the sample with which the source component was swapped. A

906 record is kept of both these exclusion windows and, during neighbour searches, points are

907 excluded if their exclusion windows intersect either of the exclusion windows of the surrogate

908 history embedding.

909 C. Implementation

910 The algorithms 2 and 1 as well as all experiments were implemented in the Julia language.

911 All code will be made available upon publication. Implementations of k-NN information-

912 theoretic estimators have commonly made use of KD-trees to speed up the nearest neigh-

913 bour searches [76]. A popular Julia nearest neighbours library (NearestNeighbors.jl [77])

914 was modified such that checks for dynamic exclusion windows (see section IV A 6) were

915 performed during the KD-tree searches when considering adding neighbours to the candi-

916 date set. The implementation of the algorithms is available at the following repository:

917 github.com/dpshorten/CoTETE.jl. Scripts to run the experiments in the paper can be

918 found here: github.com/dpshorten/CoTETE experiments.

41 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

919 V. ACKNOWLEDGEMENTS

920 We would like to thank Mike Li for performing preliminary benchmarking of the perfor-

921 mance of the discrete-time estimator on point processes.

922 VI. AUTHOR CONTRIBUTIONS

Conceptualization: David P. Shorten, Richard E. Spinney and Joseph T. Lizier 923 Funding acquisition: Joseph T. Lizier 924 Methodology: David P. Shorten, Richard E. Spinney and Joseph T. Lizier 925 Software: David P. Shorten 926 Supervision: Richard E. Spinney and Joseph T. Lizier 927 Writing - original draft: David P. Shorten 928

929 Writing - review & editing: David P. Shorten, Richard E. Spinney and Joseph T. Lizier

930 VII. FUNDING STATEMENT

931 JL was supported through the Australian Research Council DECRA Fellowship grant

932 DE160100630 and through The University of Sydney Research Accelerator (SOAR) prize

933 program. High performance computing facilities provided by The University of Sydney

934 (artemis) have contributed to the research results reported within this paper.

935 [1] B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive field properties by learning

936 a sparse code for natural images, Nature 381, 607 (1996).

937 [2] A. P. Georgopoulos, J. Ashe, N. Smyrnis, and M. Taira, The motor cortex and the coding of

938 force, Science 256, 1692 (1992).

939 [3] T. Schreiber, Measuring information transfer, Physical review letters 85, 461 (2000).

940 [4] T. Bossomaier, L. Barnett, M. Harr´e,and J. T. Lizier, An introduction to transfer entropy,

941 Cham: Springer International Publishing , 65 (2016).

42 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

942 [5] R. E. Spinney, M. Prokopenko, and J. T. Lizier, Transfer entropy in continuous time, with

943 applications to jump and neural spiking processes, Physical Review E 95, 032319 (2017).

944 [6] D. J. MacKay and D. J. Mac Kay, Information theory, inference and learning algorithms

945 (Cambridge university press, 2003).

946 [7] J. Runge, J. Heitzig, V. Petoukhov, and J. Kurths, Escaping the curse of dimensionality in

947 estimating multivariate transfer entropy, Physical review letters 108, 258701 (2012).

948 [8] J. Runge, Conditional independence testing based on a nearest-neighbor estimator of condi-

949 tional mutual information, arXiv preprint arXiv:1709.01447 (2017).

950 [9] L. Novelli, P. Wollstadt, P. Mediano, M. Wibral, and J. T. Lizier, Large-scale directed net-

951 work inference with multivariate transfer entropy and hierarchical statistical testing, Network

952 Neuroscience , 1 (2019).

953 [10] C. J. Honey, R. K¨otter,M. Breakspear, and O. Sporns, Network structure of cerebral cortex

954 shapes functional connectivity on multiple time scales, Proceedings of the National Academy

955 of Sciences 104, 10240 (2007).

956 [11] O. Stetter, D. Battaglia, J. Soriano, and T. Geisel, Model-free reconstruction of excitatory

957 neuronal connectivity from calcium imaging signals, PLoS computational biology 8, e1002653

958 (2012).

959 [12] J. Sun and E. M. Bollt, Causation entropy identifies indirect influences, dominance of neighbors

960 and anticipatory couplings, Physica D: Nonlinear Phenomena 267, 49 (2014).

961 [13] M. Wibral, R. Vicente, and J. T. Lizier, Directed information measures in neuroscience

962 (Springer, 2014).

963 [14] N. M. Timme and C. Lapish, A tutorial for information theory in neuroscience, eNeuro 5

964 (2018).

965 [15] A. Palmigiano, T. Geisel, F. Wolf, and D. Battaglia, Flexible information routing by transient

966 synchrony, Nature neuroscience 20, 1014 (2017).

967 [16] M. Lungarella and O. Sporns, Mapping information flow in sensorimotor networks, PLoS

968 computational biology 2, e144 (2006).

969 [17] M. Wibral, N. Pampu, V. Priesemann, F. Siebenh¨uhner, H. Seiwert, M. Lindner, J. T. Lizier,

970 and R. Vicente, Measuring information-transfer delays, PloS one 8, e55809 (2013).

971 [18] M. Wibral, B. Rahm, M. Rieder, M. Lindner, R. Vicente, and J. Kaiser, Transfer entropy

972 in magnetoencephalographic data: quantifying information flow in cortical and cerebellar

43 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

973 networks, Progress in biophysics and molecular biology 105, 80 (2011).

974 [19] A. Brodski-Guerniero, M. J. Naumer, V. Moliadze, J. Chan, H. Althen, F. Ferreira-Santos,

975 J. T. Lizier, S. Schlitt, J. Kitzerow, M. Sch¨utz, et al., Predictable information in neural signals

976 during resting state is reduced in autism spectrum disorder, Human brain mapping 39, 3227

977 (2018).

978 [20] V. A. Vakorin, N. Kovacevic, and A. R. McIntosh, Exploring transient transfer entropy based

979 on a group-wise ica decomposition of eeg data, Neuroimage 49, 1593 (2010).

980 [21] J. T. Lizier, J. Heinzle, A. Horstmann, J.-D. Haynes, and M. Prokopenko, Multivariate

981 information-theoretic measures reveal directed information structure and task relevant changes

982 in fmri connectivity, Journal of computational neuroscience 30, 85 (2011).

983 [22] M. Wibral, C. Finn, P. Wollstadt, J. Lizier, and V. Priesemann, Quantifying information

984 modification in developing neural networks via partial information decomposition, Entropy

985 19, 494 (2017).

986 [23] M. Li, Y. Han, M. J. Aburn, M. Breakspear, R. A. Poldrack, J. M. Shine, and J. T. Lizier,

987 Transitions in brain-network level information processing dynamics are driven by alterations

988 in neural gain, BioRxiv , 581538 (2019).

989 [24] J.-P. Thivierge, Scale-free and economical features of functional connectivity in neuronal net-

990 works, Physical Review E 90, 022721 (2014).

991 [25] J. J. Harris, E. Engl, D. Attwell, and R. B. Jolivet, Energy-efficient information transfer at

992 thalamocortical synapses, PLoS computational biology 15, e1007226 (2019).

993 [26] N. M. Timme, S. Ito, M. Myroshnychenko, S. Nigam, M. Shimono, F.-C. Yeh, P. Hottowy,

994 A. M. Litke, and J. M. Beggs, High-degree neurons feed cortical computations, PLoS compu-

995 tational biology 12, e1004858 (2016).

996 [27] N. Timme, S. Ito, M. Myroshnychenko, F.-C. Yeh, E. Hiolski, A. M. Litke, and J. M. Beggs,

997 Multiplex networks of cortical and hippocampal neurons revealed at different timescales, BMC

998 neuroscience 15, P212 (2014).

999 [28] K. E. Schroeder, Z. T. Irwin, M. Gaidica, J. N. Bentley, P. G. Patil, G. A. Mashour, and C. A.

1000 Chestek, Disruption of corticocortical information transfer during ketamine anesthesia in the

1001 primate brain, Neuroimage 134, 459 (2016).

1002 [29] R. Kobayashi and K. Kitano, Impact of network topology on inference of synaptic connectivity

1003 from multi-neuronal spike data simulated by a large-scale cortical network model, Journal of

44 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1004 computational neuroscience 35, 109 (2013).

1005 [30] S. Nigam, M. Shimono, S. Ito, F.-C. Yeh, N. Timme, M. Myroshnychenko, C. C. Lapish,

1006 Z. Tosi, P. Hottowy, W. C. Smith, et al., Rich-club organization in effective connectivity

1007 among cortical neurons, Journal of Neuroscience 36, 670 (2016).

1008 [31] S. Ito, M. E. Hansen, R. Heiland, A. Lumsdaine, A. M. Litke, and J. M. Beggs, Extending

1009 transfer entropy improves identification of effective connectivity in a spiking cortical network

1010 model, PloS one 6, e27431 (2011).

1011 [32] S. A. Neymotin, K. M. Jacobs, A. A. Fenton, and W. W. Lytton, Synaptic information transfer

1012 in computer models of neocortical columns, Journal of computational neuroscience 30, 69

1013 (2011).

1014 [33] B. Gour´evitch and J. J. Eggermont, Evaluating information transfer between auditory cortical

1015 neurons, Journal of neurophysiology 97, 2533 (2007).

1016 [34] A. Buehlmann and G. Deco, Optimal information transfer in the cortex through synchroniza-

1017 tion, PLoS computational biology 6, e1000934 (2010).

1018 [35] G. Ver Steeg and A. Galstyan, Information transfer in social media, in Proceedings of the 21st

1019 international conference on World Wide Web (2012) pp. 509–518.

1020 [36] J. G. Orlandi, O. Stetter, J. Soriano, T. Geisel, and D. Battaglia, Transfer entropy recon-

1021 struction and labeling of neuronal connections from simulated calcium imaging, PloS one 9,

1022 e98842 (2014).

1023 [37] I. M. de Abril, J. Yoshimoto, and K. Doya, Connectivity inference from neural recording data:

1024 Challenges, mathematical bases and research directions, Neural Networks 102, 120 (2018).

1025 [38] I. Nemenman, G. D. Lewen, W. Bialek, and R. R. D. R. Van Steveninck, Neural coding of

1026 natural stimuli: information at sub-millisecond resolution, PLoS computational biology 4,

1027 e1000025 (2008).

1028 [39] C. Kayser, N. K. Logothetis, and S. Panzeri, Millisecond encoding precision of auditory cortex

1029 neurons, Proceedings of the National Academy of Sciences 107, 16976 (2010).

1030 [40] S. J. Sober, S. Sponberg, I. Nemenman, and L. H. Ting, Millisecond spike timing codes for

1031 motor control, Trends in neurosciences 41, 644 (2018).

1032 [41] J. A. Garcia-Lazaro, L. A. Belliveau, and N. A. Lesica, Independent population coding of

1033 speech with sub-millisecond precision, Journal of Neuroscience 33, 19362 (2013).

45 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1034 [42] J. W. Aldridge and S. Gilman, The temporal structure of spike trains in the primate basal gan-

1035 glia: afferent regulation of bursting demonstrated with precentral cerebral cortical ablation,

1036 Brain research 543, 123 (1991).

1037 [43] J. N. Cooper and C. D. Edgar, Transfer entropy in continuous time, arXiv preprint

1038 arXiv:1905.06406 (2019).

1039 [44] A. Kraskov, H. St¨ogbauer,and P. Grassberger, Estimating mutual information, Physical re-

1040 view E 69, 066138 (2004).

1041 [45] P. W. Lewis and G. S. Shedler, Simulation of nonhomogeneous poisson processes by thinning,

1042 Naval research logistics quarterly 26, 403 (1979).

1043 [46] P. Spirtes, C. N. Glymour, R. Scheines, and D. Heckerman, Causation, prediction, and search

1044 (MIT press, 2000).

1045 [47] J. Peters, D. Janzing, and B. Sch¨olkopf, Elements of : foundations and learning

1046 algorithms (MIT press, 2017).

1047 [48] J. Runge, Causal network reconstruction from time series: From theoretical assumptions to

1048 practical estimation, Chaos: An Interdisciplinary Journal of Nonlinear Science 28, 075310

1049 (2018).

1050 [49] G. Buzsaki, Rhythms of the Brain (Oxford University Press, 2006).

1051 [50] A. Riehle, S. Gr¨un,M. Diesmann, and A. Aertsen, Spike synchronization and rate modulation

1052 differentially involved in motor cortical function, Science 278, 1950 (1997).

1053 [51] E. Maeda, H. Robinson, and A. Kawana, The mechanisms of generation and propagation of

1054 synchronized bursting in developing networks of cortical neurons, Journal of Neuroscience 15,

1055 6834 (1995).

1056 [52] A. Litwin-Kumar, A.-M. M. Oswald, N. N. Urban, and B. Doiron, Balanced synaptic input

1057 shapes the correlation between neural spike trains, PLoS computational biology 7 (2011).

1058 [53] P. K. Trong and F. Rieke, Origin of correlated activity between parasol retinal ganglion cells,

1059 Nature neuroscience 11, 1343 (2008).

1060 [54] R. Marschinski and H. Kantz, Analysing the information flow between financial time series,

1061 The European Physical Journal B 30, 275 (2002).

1062 [55] T. Kispersky, G. J. Gutierrez, and E. Marder, Functional connectivity in a rhythmic inhibitory

1063 circuit using granger causality, Neural systems & circuits 1, 9 (2011).

46 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1064 [56] F. Gerhard, T. Kispersky, G. J. Gutierrez, E. Marder, M. Kramer, and U. Eden, Successful

1065 reconstruction of a physiological circuit with known connectivity from spiking activity alone,

1066 PLoS computational biology 9, e1003138 (2013).

1067 [57] L. Barnett, A. B. Barrett, and A. K. Seth, Granger causality and transfer entropy are equiv-

1068 alent for gaussian variables, Physical review letters 103, 238701 (2009).

1069 [58] A. I. Selverston, Dynamic biological networks: the stomatogastric nervous system (MIT press,

1070 1992).

1071 [59] E. Marder and D. Bucher, Understanding circuit dynamics using the stomatogastric nervous

1072 system of lobsters and crabs, Annu. Rev. Physiol. 69, 291 (2007).

1073 [60] E. Marder and D. Bucher, Central pattern generators and the control of rhythmic movements,

1074 Current biology 11, R986 (2001).

1075 [61] A. A. Prinz, D. Bucher, and E. Marder, Similar network activity from disparate circuit pa-

1076 rameters, Nature neuroscience 7, 1345 (2004).

1077 [62] T. OLeary, A. H. Williams, A. Franci, and E. Marder, Cell types, network homeostasis, and

1078 pathological compensation from a biologically plausible ion channel expression model, Neuron

1079 82, 809 (2014).

1080 [63] Q. Wang, S. R. Kulkarni, and S. Verd´u,Divergence estimation for multidimensional densities

1081 via k-nearest-neighbor distances, IEEE Transactions on Information Theory 55, 2392 (2009).

1082 [64] S. Gao, G. Ver Steeg, and A. Galstyan, Efficient estimation of mutual information for strongly

1083 dependent variables, in Artificial intelligence and statistics (2015) pp. 277–286.

1084 [65] S. Khan, S. Bandyopadhyay, A. R. Ganguly, S. Saigal, D. J. Erickson III, V. Protopopescu,

1085 and G. Ostrouchov, Relative performance of mutual information estimation methods for quan-

1086 tifying the dependence among short and noisy data, Physical Review E 76, 026209 (2007).

1087 [66] B. Poole, S. Ozair, A. van den Oord, A. A. Alemi, and G. Tucker, On variational lower bounds

1088 of mutual information, in NeurIPS Workshop on Bayesian Deep Learning (2018).

1089 [67] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm,

1090 Mine: mutual information neural estimation, arXiv preprint arXiv:1801.04062 (2018).

1091 [68] J. Beirlant, E. J. Dudewicz, L. Gy¨orfi,and E. C. Van der Meulen, Nonparametric entropy

1092 estimation: An overview, International Journal of Mathematical and Statistical Sciences 6,

1093 17 (1997).

47 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1094 [69] L. Kozachenko and N. N. Leonenko, Sample estimate of the entropy of a random vector,

1095 Problemy Peredachi Informatsii 23, 9 (1987).

1096 [70] T. B. Berrett, R. J. Samworth, M. Yuan, et al., Efficient multivariate entropy estimation via

1097 k-nearest neighbour distances, The Annals of Statistics 47, 288 (2019).

1098 [71] K. Moon, K. Sricharan, K. Greenewald, and A. Hero, Ensemble estimation of information

1099 divergence, Entropy 20, 560 (2018).

1100 [72] W. Gao, S. Oh, and P. Viswanath, Demystifying fixed k-nearest neighbor information estima-

1101 tors, IEEE Transactions on Information Theory 64, 5629 (2018).

1102 [73] N. Leonenko, L. Pronzato, and S. Vippal, A class of R´enyi information estimators for multi-

1103 dimensional densities, The Annals of Statistics 36.5, 2153 (2008).

1104 [74] J. Theiler, Estimating fractal dimension, JOSA A 7, 1055 (1990).

1105 [75] H. Kantz and T. Schreiber, Nonlinear time series analysis, Vol. 7 (Cambridge university press,

1106 2004).

1107 [76] J. T. Lizier, Jidt: an information-theoretic toolkit for studying the dynamics of complex

1108 systems, Frontiers in Robotics and AI 1, 11 (2014).

1109 [77] Github.com/KristofferC/NearestNeighbors.jl.

1110 [78] A. A. Prinz, C. P. Billimoria, and E. Marder, Alternative to hand-tuning conductance-based

1111 models: construction and analysis of databases of model neurons, Journal of neurophysiology

1112 90, 3998 (2003).

1113 [79] J. H. Goldwyn and E. Shea-Brown, The what and where of adding channel noise to the

1114 hodgkin-huxley equations, PLoS computational biology 7 (2011).

1115 [80] P. Rowat, Interspike interval statistics in the stochastic hodgkin-huxley model: Coexistence

1116 of gamma frequency bursts and highly irregular firing, Neural Computation 19, 1215 (2007).

1117 Appendix A: Simulation of the Pyloric Circuit of the STG

1118 The simulation approach used follows very closely that presented in [61, 62, 78]. The

1119 only significant deviation is the addition of a noise term.

1120 We modelled the neurons of the of the pyloric circuit using a conductance-based model.

48 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Current E p q m∞ h∞

1 1 INa 50 3 1   V +48.9  V +25.5 1 + exp 1 + exp −5.29 5.18 1 1 ICaT 3 1   V +32.1  V +27.1 1 + exp 1 + exp −7.2 5.5 1 1 ICaS 3 1   V +60  V +33 1 + exp 1 + exp −8.1 6.2 1 1 IA -80 3 1   V +56.9  V +27.2 1 + exp 1 + exp −8.7 4.9

[Ca] 1 IKCa -80 4 0 [Ca] + 3  V +28.3  1 + exp −12.6 1 IKd -80 4 0  V +12.3  1 + exp −11.8 1 IH -20 1 0 V +75  1 + exp 5.5

TABLE I: Parameters and functions used in the conductance based model.

1121 The membrane potential (V ) evolves according to

dV X X C = I I + ξ (A1) dt − i − s i s

C = 0.628 nF is the membrane conductance. ξ is a noise term. Each current is specified by

I = g mqi hpi (V E ) i i i i − i

Ei is the reversal potential and its values are listed in table I. The reversal potentials as- sociated with the calcium channels are not listed as these are calculated according to the  2+  RT [Ca ]ext 1 1 Nernst equation. Specifically, ECa = 2F log10 [Ca2+] where R = 8.314 J K− mol− is the 1 universal gas constant, R = 293.3 K is the temperature and F = 96 485.332 12 C mol− is 2+ 2+ 2+ Faraday’s constant. [Ca ]ext = 3 mm is the extracellular Ca concentration and [Ca ] is the intracellular Ca2+ concentration. The intracellular Ca2+ concentration evolves according

49 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Current τm τh

" # 2.52 1.34 42.6 INa 2.64     1.5 V +34.9  − V +120 V +62.9 − 1 + exp 1 + exp −25 1 + exp −10 3.6 42.6 179.6 ICaT 43.4   210   − V +68.1 − V +55 1 + exp −20.5 1 + exp −16.9 14 300 ICaS 2.8 + 120 + V +27   V +70  V +55   V +65  exp 10 + exp −13 exp 9 + exp −16 20.8 58.4 IA 23.2   77.2   − V +32.9 − V +38.9 1 + exp −15.2 1 + exp −26.5 150.2 IKCa 180.6   − V +46 1 + exp −22.7 12.8 IKd 14.4   − V +28.3 1 + exp −19.2 2 IH  V +169.7  V −26.7  exp −11.6 + exp 14.3

TABLE II: Parameters and functions used in the conductance based model.

to d[Ca2+] τ = f (I + I ) [Ca2+] + [Ca2+] Ca dt − CaT CaS − 0

2+ 2+ 1 1122 [Ca ]0 = 0.05 µm is the steady-state Ca concentration, f = 14.96 µm nA− and τCa =

1123 200 ms

The values of qi and pi are listed in table I. The activation variables mi evolve according to dmi τm = m ,i mi i dt ∞ −

The inactivation variables hi evolve according to

dhi τh = h ,i hi i dt ∞ −

1124 m ,i and h ,i are given in table I and τm and τh are given in table II. ∞ ∞ i i 50 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

The synaptic currents are specified by

I = g a(V E ) s s − s

E is the synaptic reversal potential. It was set to 70 mV for glutamatergic synapses and s − 80 mV for cholinergic synapses. The activation variable a evolves according to − s

das τas = a ,s as dt ∞ −

where 1 a ,s = ∞ 1 + ((V V ) /∆) th − pre and 1 a τ = − s as k −

1125 ∆ = 5 mV provides the slope of the activation curve. V = 35 mV is the half activation th − 1126 potential of the synapse. Vpre is the membrane potential of the presynaptic neuron. k is the − 1127 rate constant for the transmitter-receptor dissociation rate. For the glutamatergic synapses

1128 we used k = 0.025 ms and for the cholinergic synapses we used k = 0.01 ms. − −

Simulations of the pyloric rhythm can be run with fixed maximum conductance values gi

and gs, as in [61]. However, it was found that these models were less robust to the addition of a noise term. It was, therefore, decided to use adaptive conductances as described in [62]. Each conductance evolved according to:

dg τ j = m g g dt j − j

where dm τ j = [Ca2+] Ca mj dt − tgt

1129 τg = 100 ms was common across all channels. The time constants τmj are listed in table

1130 III. . The time constants provided in [62] were not used as these produce a pyloric rhythm

1131 with unrealistically short period. Instead, the approach presented in [62] for arriving at

1132 conductance time constants from desired conductance values was used. The conductance

1133 values in [61] were used as these desired values. Specifically, we used the values presented

51 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

Conductance AB/PD LP PY gNa 2.5 10 10 gCaT 400 1e13 400 gCaS 166.7 250 400 gA 20 50 20 gKCa 100 200 1e13 gKd 10 20 8 g 1 105 2 104 2 104 H × × ×

TABLE III: The conductance time constants τmj . All values are in seconds.

LP AB/PD, glutamatergic 3 103 → × AB/PD LP, cholinergic 250 → AB/PD LP, glutamatergic 5 103 → × PY LP, glutamatergic 8 103 → × AB/PD PY, cholinergic 1 104 → × AB/PD PY, glutamatergic 400 → LP PY, glutamatergic 1 105 → ×

TABLE IV: The conductance time constants τmj for the synapses. All values are in seconds.

1134 in table 2 for AB/PD 1, LP 2 and PY 1. The time constant τmj associated with gj was set

1135 as τmj = c/gj. c is a constant with units of seconds that was adjusted by hand so that the

1136 activity converged in a reasonable amount of time and the time constants were of the same

1137 order of magnitude as those provided in [62]. As in [62], the leak conductances were fixed.

1138 They were set at 0 for the AB/PD neuron, and 0.0628 µS for the PY neuron and 0.1256 µS

1139 for the LP neuron.

1140 The TE approach to network inference will not work in a fully deterministic system.

1141 As such, noise was added to the system. There are a number of techniques for adding

1142 noise to conductance-based models [79]. The simplest such technique is to add noise to

1143 the currents in (A1). Although this is not a biophysically realistic method, it has been

1144 shown to produce resulting behaviours which closely match those produced by more realistic

1145 techniques [79, 80]. As such, we decided to make use of this procedure in our simulations.

1146 The associated noise term is shown in (A1). The noise was generated using an AR(1) process

ξt = θξt 1 + t (A2) − 52 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.16.154377; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license.

1147 We used the parameter value of θ = 0.01. t was distributed normally with mean 0 and 13 1148 standard deviation 8 10− × 1149 A simulation timestep of ∆t = 0.01 ms was used.

53