arXiv:2003.04179v2 [cs.IT] 16 May 2020 h itiuinof distribution the h output the ntepeec ffebc,teF capacity FB the feedback, of presence the In stedrce nomto D)fo h nu sequence input the from (DI) directed the is where, otnosapae hnesta eeiacsil until inaccessible results were approximation that capacity channels of framew myriad alphabet th a estimation continuous to of proposed door knowledge the The opens without capac kernel. additive estimated the transition average both are channel that moving capacity shown the is feedback distributio it and on where input channel, demonstrated optimal noise the Gaussian is for method an NDT by an The ch trained obtain the are and estimate models both capacity These to model. procedure maximization NDT alternating current communica and the the sample, estimates of th that to rate into (DINE) distributi able estimator variable are neural neural DI The noise we the which a present. a distribution, shapes not (i) input that is channel ingredients: chann model or main (NDT) is the transformer two feedback treats when has that both algorithm algorithm for ‘black-box’, estimation a framework prop th as unified capacity work or This no novel capacity. present is a approximating is there even fo solution or unknown, known analytic mul calculating is only no a model is When is channel cases. solution objective specific analytic The few whose distributions. expression, input letter channel ( information all directed challe the a over optimizing is requires alphabets It continuous task. and ing memory with channels of rmtelte,teF aaiyo an of Starting capacity capacity. FF (FF) of the channel latter, feedforward rates the and fundamental respective from (FB) The the are, feedback channels present. to the such back not over receiver is communication the reliable or from is feedback such transmitter over when scenarios are communication channels predominant interfe Two inter-symbol or (ISI). noise correlated involve channels Abstract aydsrt-iecniuu-lhbtcommunication continuous-alphabet discrete-time Many aaiyo otnosCanl ihMemory with Channels Continuous of Capacity e uinUniversity Gurion Ben [email protected] P Cluaigtecpct wt rwtotfeedback) without or (with capacity the —Calculating Y Y C n i ietdIfrainNua Estimator Neural Information Directed via i Aharoni Ziv I | n FB X ( X n C 8,and [8], lim = denoted , n FF X → n lim = →∞ n .I I. Y aslycniindon causally-conditioned n →∞ n P P NTRODUCTION := ) X X C n sup n P k sup FF k Y X Y X i n sgvnb [1] by given is , n =1 n n − − n 1 1 1 I I n 1 ( := ( X I X e uinUniversity Gurion Ben ( i X [email protected] n ; Q Y ; n n Y i n i fl point-to-point -fold =1 | → Y n o Tsur Dor ) i P . − Y C Y X 1 FB n n i ) | − ) X 1 s[17] is i − se[21], (see 1 I rate DI) Y X i now. rence annel − n oses tion 1 ork a r ng- (1) (3) (2) for (ii) ity ly, on ti- to n. is el e e e h Irt sdfie as proc defined stationary is for rate (3), on DI Built the details). further for [24] egn akaekonol nsvrlseilcss A cases. special several in chal- for only this example known to are common solutions task form lenging Closed problem. optimization ahrthan rather nfidfaeokfrrpeetn ohF n Bcapacities FB and FF both representing for framework unified inpolm()(hc mut ootmzn over optimizing to optimiza- amounts the (which present, not (2) is problem feedback when tion [8], in shown As eane fti eto,w ecieDN,NT n their and estimator. NDT, capacity DINE, the describe th into we In integration lay section, algorithm. this estimation to NDT of capacity and used remainder our DINE (NDT), for groundwork Together, transformer the DI distributions. distribution new input neural a simulate communicatio a are the method and evaluate by our to rate, sampled used to know (DINE), be Central estimator stationary can to inputs. neural outputs a with need channel assume it not that feeding only and We does model kernel. continuous- method channel transition Our arbitrary channel knowing model. without the of memory, channel capacity with the possible FB channels, and alphabet FF model. estimating channel only known but a proposed, with was channels iterative rate alphabet discrete that DI for the algorithm maximizes estimator learning and an reinforcement estimates modulations proposed a learn [19] on to Later, based used channels. was memoryless information mutual [2] for the (MINE) where estimator [9], in neural made was (DL) learning samples. tractable on numerically based no capacity is approximating the for when there non- method Furthermore, unknown, form. two is closed model only whose in channel the memory known is with are capacity channels these FB continuous knowledge, of our examples of fo trivial best available the is [11]. To channels characterization found AGN (ARMA) form was moving-average auto-regressive closed channel Another (MA(1)-AGN) [12]. noise non-Gaussian Gaussian are the to There additive solutions [15]. For these channel case. of Gaussian extensions ISI known the no and [14] memory Computing nprdb h bv,w eeo h rmwr for framework the develop we above, the by Inspired deep via computation capacity to related progress Recent [email protected] onl University Cornell i Goldfeld Ziv C P FB I X ( ouinfrte1todrmvn average moving order 1st the for solution a , Y → X C n k FF Y n onie ih() hs Ipoie a provides DI Thus, (1). with coincides ) and =lim := ) C C FF FB n →∞ steGusa hne with channel Gaussian the is eurssligamulti-letter a solving requires e uinUniversity Gurion Ben amH Permuter H. Haim n 1 [email protected] I ( X n → Y n ) . esses, P X (4) ly n n e r . (i.e., fixing the parameters of one model while training the other). DINE estimates the communication rate of a fixed A. Directed Information Neural Estimation NDT input distribution, and the NDT is trained to increase The estimation of (MI) from samples its rate with respect to fixed DINE model. Proceeding until using neural networks (NNs) is a recently proposedPSfrag approach replacementsconvergence, this results in the capacity estimate, as well as [2], [3]. It is especially effective when the involved random an NDT generative model for the achieving input distribution. variables (RVs) are continuous. The concept originated from We demonstrate our method on the MA(1)-AGN channel. Both [2], where MINE was proposed. The core idea is to represent CFF and CFB are estimated using the same algorithm, using MI using the Donsker-Varadhan (DV) variational formula the channel as a black-box to solely generate samples. The T e e estimation results are compared with the analytic solution to I(X; Y ) = sup E [T(X, Y )] − log E e (X,Y ) , (5) T R n shown the effectiveness of the proposed approach. :X×Y→ IDn (X → Y ) h i where (X, Y ) ∼ PXY and (X, Y ) ∼ PX ⊗PY . The supremum b Yi−1 Feedback is over all measurable functions T for which both expecta- ∆ tions are finite. Parameterizinge e T by an NN and replacing (RNN) expectations with empirical averages, enables gradient ascent NDT Xi Channel i i 1 (RNN) PYi X Y − optimization to estimate I(X; Y ). A variant of MINE that goes Noise | N through estimating the underlying entropy terms was proposed i in [3]. The new estimators were shown empirically to perform extremely well, especially for continuous alphabets. Gradient Herein, we propose a new estimator for the DI rate I(X → Yi DINE Output I Y). The DI is factorized as Dn (X → Y) (RNN) b I(Xn → Y n)= h(Y n) − h(Y nkXn), (6)

n n where h(Y ) is the differential entropy of Y and Fig. 1. The overall capacity estimator: NDT generates samples that are fed n n n i−1 i into the channel. DINE uses these samples to improve its estimation of the h(Y kX ) := i=1 h(Yi|Y ,X ). Applying the approach of [3] to the entropy terms, we expand each as a Kullback- communication rate. DINE then supplies gradient for the optimization of NDT. Leibler (KL) divergenceP plus a cross-entropy (CE) residual and invoke the DV representation. To account for memory, II. METHODOLOGY we derive a formula valid for causally dependent data, which involves RNNs as function approximators (rather than the FF We give a high-level description of the algorithm and its network used in the independently and identically distributed building blocks. Due to space limitations, full details are reserved to the extended version of this paper. The imple- (i.i.d.) case). Thus, DINE is an RNN-based estimator for the † DI rate from Xn to Y n based on their samples. mentation is available on GitHub. Estimation of DI between discrete-valued processes was A. Directed Information Estimation Method studied in [25]–[27]. An estimator of the transfer entropy, which upper bounds DI for jointly Markov process with finite We propose a new estimator of the DI rate between two memory, was proposed [16]. DINE, on the other hand, does not correlated stationary processes, termed DINE. Building on [3], assume Markovity nor discrete alphabets, and can be applied to we factorize each term in (6) as: continuous-valued stationary and ergodic processes. A detailed n description of the DINE algorithm is given in subsection II-A. n n 1 h(Y )= hCE(PY , PY − ⊗ PYe ) n n 1 B. Neural Distribution Transformer and Capacity Estimation − DKL(PY kPY − ⊗ PYe ) n n n n n 1 n 1 n DINE accounts for one of the two tasks involved in es- h(Y kX )= hCE PY kX , PY − kX − ⊗ PYe PX timating capacity, it estimates the objective of (2). It then n n n 1 n 1 n − DKL PY kX PY − kX − ⊗ P Ye PX remains to optimize this objective over input distributions. To (7)  that end, we design a deep generative model, termed the NDT, to approximate the channel input distributions. This is similar where hCE(PX ,QX ) and DKL(PX kQX ) are, respectively, the in flavor to generators used in generative adversarial networks CE and KL divergence between PX and QX , with [23].The designed NDT maps i.i.d. noise into samples of hCE(PY |X ,QY |X |PX ) := hCE(PY |X=x,QY |X=x)dPX (x) the channel input distribution. For estimating FB capacity, in X addition to the i.i.d. noise, the NDT also receives channel FB Z as inputs. Together, NDT and DINE form the overall system DKL(PY |X kQY |X |PX ) := DKL(PY |X=xkQY |X=x)dPX (x) X that estimates the capacity as shown in Fig 1. Z (8) The capacity estimation algorithm trains DINE and NDT models together via an alternating optimization procedure †https://github.com/zivaharoni/capacity-estimator-via-dine Algorithm 1 Directed Information Rate Estimation denoting their conditional versions; and PYe is uniform ref- erence measure over the support of the dataset. To simplify input: Samples of the process Dn. notation, we use the shorthands output: IDn (X → Y), estimated directed information rate. (n) D := DKL(P n kP n−1 ⊗ P e ) Y Y Y Y Initializeb networks parameters θY ,θY kX . (n) Step 1, Optimization: D := DKL(PY nkXn kPY n−1kXn−1 ⊗ P e ). (9) Y kX Y repeat Subtracting both elements in (7) and observing that the differ- iT iT B Draw a batch DB = {(x(i−1)T ,y(i−1)T )}i=1 ence of CE terms equals the DI at the former time step, we Feed the network with the examples and compute have loss DY kX (θY kX , DB), DY (θY , DB). n n n−1 n−1 (n) (n) Update networks parameters: I(X → Y )= I(X → Y )+ DY kX − DY . (10) θY kXb← θY kX + ∇DY kXb(θY kX , DB ) Note that the difference of KL divergences equals θY ← θY + ∇DY (θY , DB) n n−1 I(X ; Yn|Y ). For stationary data processes we take until convergence b the limit and obtain Step 2, Perfrom ab Monte Carlo estimation over Dn (n) (n) n n−1 and subtract loss evaluations to obtain estimation : lim DY kX − DY = lim I(X ; Yn|Y )= I(X → Y). n→∞ n→∞ I n (X → Y)= D (θ , D ) − D (θ , D ) (11) D Y kX Y kX n Y Y n Each DKL is expanded by its DV representation [4] as: b b b

n 1 e by F : (yi,si−1) 7−→ si. The full characterization of F is (n) E T n E T(Y − ,Y ) DY = sup [ (Y )] − log e provided in [5]. T:Ω→R h n−i1 n−1 e We modify the structure of the LSTM to perform the (n) E T n n E T(Y kX ,Y ) DY kX = sup [ (Y kX )] − log e . T:Ω→R calculations: h i i−1 (12) si = F (yi,si−1)= s(yi|y ) i−1 (15) To maximize (12), each DV potential is parametrized by si = F (yi,si−1)= s(yi|y ) a modified LSTM and expected values are estimated by n A similar modification is introduced for DY kX by substitution empirical averages over the dataset Dn := {(xi,yi)}i=1. Thus, e e e the optimization objectives are: of yi with (yi, xi) and yi with (yi, xi), we have: bi−1 i n si = F (yi, xi,si−1)= s(yi|y , x ) 1 T i i−1 (16) DY kX (θY kX , Dn) := θY X (yi|x y ) e e i−1 i n k si = F (yi, xisi−1)= s(yi|y , x ). i=1 X n i i−1 A visualization of a modified LSTM cell (unrolled) is shown in b 1 Tθ (yei|x y ) − log e Y kX Fig. 2. The LSTMe cell’se output is thee sequence {(s , s )}n , n i i i=1 i=1 ! which is fed into a fully-connected layer to obtain T and X θY n T . As demonstrated by Algorithm 1 and Fig. 3, in each 1 T i−1 θY kX e DY (θY , Dn) := θY (yi|y ) n iteration we draw DB, a subset on Dn, of size B. We feed i=1 the NN with D to acquire T Y , T . Those enter the NN X n PSfrag replacements B θ θY kX b i−1 1 T ei loss function (13), and gradients are calculated to update the − log e θY (y |y ) (13) n NN parameters θY ,θ . i=1 ! Y kX X n i.i.d. e T T where y˜ ∼ PY and θY , θY kX are the parametrized potentials. S1 S1 ST ST The estimator is given by: e e IDn (X → Y) := sup DY kX − sup DY (14) θY kX ∈ΘY kX θY ∈ΘY 0 ST −1 F F ... By universalb approximation of RNNsb [6] and Breiman’sb theo- F F rem [7], the maximizer of (14) approaches I(X → Y) as the number of samples grows, provided the neural networks are

Y1 sufficiently expressive. Y1 YT YT To capture the time dependencies in Dn we introduce a e e b modified LSTM network model for functional approximation. Fig. 2. The modified LSTM cell unrolled in the DINE architecture of DY . T e LSTM [5] is an RNN that receives a time series {yi}i=1 as Recursively, at each time i, (yi,si−1) and (yi,si−1) are mapped to si and input and for each i, performs a recursive non-linear transform sei, respectively. to calculate its hidden state si. We denote the LSTM function PSfrag replacements

Algorithm 2 Capacity Estimation

Input Yi Si input: Continuous channel, feedback indicator Dense T i−1 θY (Yi|Y ) Modified output: IDn (X → Y,µ), estimated capacity. DY (θY , Dn) LSTM DV Layer b Initializeb DINE parameters, θY ,θY kX Yi Si Reference Gen. T i−1 Initialize NDT parameters µ Dense θY (Yi|Y ) e e if feedback indicator then e Add feedback to NDT b repeat Fig. 3. End-to-end architecture for estimating DY (θY , Dn). For each batch of time sequences, a batch of the same size is sampled from the reference Step 1: Train DINE model T measure. Together, these samples are fed into the NN to compute θY and Generate B sequences of length T of i.i.d random noise Tθ , from which the estimate is assumbled. T T B Y kX Compute DB = {(xi ,yi )}i=1 with NDT and channel Compute DY kX (θY kX , DB), DY (θY , DB) Update DINE parameters: B. Neural Distribution Transformer θY kX ←b θY kX + ∇DY kX (bθY kX , DB ) The DINE model is an effective approach to estimate the θY ← θY + ∇DY (θY , DB) b argument of (2). However, finding the capacity comprises Step 2: Train NDT Generate B sequencesb of length T of i.i.d random noise maximization of the DI with respect to the input distribution. T T B For this purpose we present the NDT model that represents Compute DB = {(xi ,yi )}i=1 with NDT and channel a general input distribution of the channel. At each iteration compute the objective: I ,µ D θ , D θ , i = 1,...,n the NDT maps an i.i.d noise vector N i to a DB (X → Y )= Y kX ( Y kX DB) − Y ( Y DB) Update NDT parameters: channel input variable Xi. When feedback is present the NDT i i−1 b µ ← µ + ∇µIDBb(X → Y,µ) b PSfrag replacements maps (N , Y ) 7−→ Xi. Thus, NDT is represented by an RNN with parameters µ as shown in Fig. 4. The NDT model is until convergence I ,µ used to generate the channel input Xn, and the DINE estimates Monte Carlo evaluationb of Dn (X → Y ) the DI between Xn and Y n. return IDn (X → Y,µ) b b

Yi−1 III. NUMERICAL RESULTS Power Xi LSTM Dense Dense N Constraint We demonstrate the performance of Algorithm 2 on the n n i ∇IDn (X → Y ) AWGN channel and the first order MA-AGN channel. The numerical results are then compared with the analytic solution b to verify the effectiveness of the proposed method. Fig. 4. The NDT. The noise and past channel output (if feedback is applied) are fed into an NN. The last layer performs normalization to obey the power constraint, if needed. A. AWGN channel The power constrained AWGN channel is considered. This is an instance of a memoryless, continuous-alphabet channel C. Complete Architecture Layout for which analytic solution is known. The channel model is

Combining DINE and NDT models into a complete system Yi = Xi + Zi, i ∈ N, (17) enables capacity estimation. As shown in Fig. 1, the NDT 2 n where Zi ∼ N 0, σ are i.i.d RVs, and Xi is the channel model is fed with i.i.d. noise and its output is the samples X . E 2 input sequence bound to the power constraint Xi ≤ P . These samples are fed into the channel to generate outputs.  C 1 P n n The capacity of this channel is given by = 2 log 1+ σ2 . Then, DINE uses (X , Y ) to produce the estimate IDn (X →   In our implementation we chose σ2 = 1 and estimated Y). To estimate capacity, DINE and NDT models are trained capacity for a range of P values. The numerical results are together. The training scheme, as shown in Algorithmb 2, is a compared to the analytic solution in Fig. 5, where a clear variant of alternated maximization procedure. This procedure correspondence is seen. iterates between updating the DINE parameters θ and the NDT parameters µ, each time keeping one of the models fixed. At B. Gaussian MA(1) channel the end of training a long Monte-Carlo evaluation of ∼ 106 samples is done in order to estimate the expectations in (13). We consider both the FB (CFB) and the FF (CFB) capacity Applying this algorithm to channels with memory estimates of the MA(1) Gaussian channel. The model here is: their capacity without any specific knowledge of the channel Z = αU + U underlying distribution. Next, we demonstrate the effectiveness i i−1 i of this algorithm on continuous alphabet channels. Yi = Xi + Zi (18) 1.8 2

1.6 1.8

1.6 1.4

1.4 1.2

1.2 1 1 0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 -20 -15 -10 -5 0 5 10 15 -20 -15 -10 -5 0 5 10 15

Fig. 5. Estimation of AWGN channel capacity for various SNR values Fig. 7. Preformance of CFB estimation in the MA(1)-AGN channel.

0.6 where, Ui ∼ N (0, 1) are i.i.d., Xi is the channel input sequence bound to the power constraint E X2 ≤ P , and i 0.5 Yi is the channel output.   1) Feedforward capacity: The FF capacity of the MA(1) 0.4 Gaussian channel with input power constraint can be obtained via the water-filing algorithm [14]. This is the benchmark 0.3 against which we compare the quality of the CFF estimate produced by Algorithm 2. Results are shown in Fig. 6. 0.2

0.1 1.8

1.6 0 0 100 200 300 400 500 600 700 800 900 1000

1.4

1.2 Fig. 8. Optimization progress of DI rate of Algorithm 2 for the FB setting with P = 1. The information rates were estimated by a Monte-Carlo evaluation 5 1 of (14) with 10 samples.

0.8

0.6 IV. CONCLUSION AND FUTURE WORK

0.4 We presented a methodology for estimating FF and FB capacities that uses the channel as a black-box, i.e., without 0.2 assuming the channel model is known and only relying its

0 output samples. The main building block were a novel DI -20 -15 -10 -5 0 5 10 15 estimator (DINE) and the NDT model, both implemented based on RNNs. The performance of the estimator was tested Fig. 6. Performance of CFF estimation in the MA(1)-AGN channel. on AWGN and MA(1)-AGN channels, showing estimates that agree well with analytic solution. Despite the empirical effectiveness of DINE, we stress that 2) Feedback capacity: Computing the FB capacity of the it is neither a lower nor a upper bound on the true DI (see (6)- ARMA(k) Gaussian channel can be formulated as a dynamic (7)). A main goal going forward is to revise DINE so that is programming, which is then solved via an iterative algorithm provably lower bounds the true value. This will imply that the [11]. For the particular case of (18), CFB is given by − log(x ), 0 induced capacity estimator lower bounds the theoretical fun- where x is a solution to a 4th order polynomial equation. The 0 damental limit. Extension of our method to multiuser channels estimates for CFB produced by Algorithm 2 are compared to is also of interest, as capacity results in multiuser information the analytic solutions in Fig. 7. The optimization dynamics for theory are quite scarce. Another objective is coupling DINE our algorithm are shown in Fig. 8. with theoretical performance guarantees. REFERENCES [27] C. J. Quinn, T.P. Coleman, N. Kiyavash et al. Estimating the directed information to infer causal relationships in ensemble neural spike train [1] R. G. Gallager. and reliable communication. Vol. 2. recordings. J Comput Neurosci 30, 1744. June 2010. New York: Wiley, 1968. [2] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. June 2018. [3] C. Chan, A. Al-Bashabshesh, H. P. Huang, M. Lim, D. S. H. Tam and C. Zhao. Neural Entropic Estimation: A faster path to mutual information estimation. arXiv preprint arXiv:1905.12957, May 2019. [4] M. Donsker, and S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time, iv. Communications on Pure and Applied Mathematics, 36(2):183-212. March 1983. [5] S. Hochreiter and J. Schumidhuber.Long short-term memory. Neural Computation 9(8): 1735-1780. November 1997. [6] A. M. Schfer and H. G. Zimmermann. Recurrent neural networks are universal approximators. International journal of neural systems 17.04: 253-263. 2007. [7] L. Breiman. ”The individual ergodic theorem of information theory” The Annals of Mathematical Statistics: 809-811. September 1957. Information Theory, IEEE Trans. Comm., vol. COM-21, pp. 1345-1351. December 1973. [8] J. Massey, Causality, feedback, and directed information. Proc. Int. Symp. Inf. Theory Appl. , pp. 303305. November 1990. [9] R. Fritschek, R. F. Schaefer, and G. Wunder. Deep Learning for Chan- nel Coding via Neural Mutual Information Estimation. arXiv preprint arXiv:1903.02865 March 2019. [10] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks 2.5 : 359-366. March 1989. [11] S. Yang, A. Kavcic, and S. Tatikonda. On the feedback capacity of power-constrained Gaussian noise channels with memory. IEEE Trans. Inf. Theory 53.3 : 929-954. March 2007. [12] Y. H. Kim. Feedback capacity of the first-order moving average Gaussian channel. IEEE Trans. Inf. Theory 52.7: 3063-3079. July 2006. [13] Y. H. Kim. Feedback capacity of stationary Gaussian channels. IEEE Trans. Inf. Theory 56.1: 57-85. Januaray 2010. [14] T. M. Cover, and J. A. Thomas. Elements of information theory. John Wiley and Sons, 2012. [15] W. Hirt, and J. L. Massey. Capacity of the discrete-time Gaussian channel with intersymbol interference. IEEE Trans. Inf. Theory 34.3: 38-38. May 1988. [16] J. Zhang, O. Simeone, Z. Cvetkovic, E. Abela, and M. Richardson. ITENE: Intrinsic Transfer Entropy Neural Estimator. arXiv preprint arXiv:1912.07277. January 2020. [17] Y. H.Kim . A coding theorem for a class of stationary channels with feedback. IEEE Trans. Inf. Theory 54.4: 1488-1499. April 2008. [18] S. Yang, A. Kavcic, and S. Tatikonda. Feedback Capacity of Stationary Sources over Gaussian Intersymbol Interference Channels. GLOBE- COM, 2006. [19] Z. Aharoni, O. Sabag, and H. H. Permuter. Computing the Feedback Capacity of Finite State Channels using . 2019 IEEE International Symposium on Information Theory (ISIT). IEEE, 2019. [20] S. Molavipour, G. Bassi, and M. S. Conditional Mutual Information Neural Estimator. arXiv preprint arXiv:1911.02277. November 2019 [21] H. H. Permuter, Y. H. Kim, and T. Weissman. Interpretations of directed information in portfolio theory, , and hypothesis testing. IEEE Trans. Inf. Theory 57.6: 3248-3259. June 2011. [22] D. P. Kingma, M. Welling. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114. 2013. [23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville & Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680). 2014. [24] G. Kramer. Directed information for channels with feedback. Hartung- Gorre, 1998. [25] L. Zhao, H. Permuter, Y. Kim, and T. Weissman. Universal estimation of directed information. IEEE Trans. Inf. Theory 59.10: 6220-6242. October 2013. [26] I. Kontoyiannis, and M. Skoularidou. Estimating the directed informa- tion and testing for causality. IEEE Transactions on Information Theory 62.11: 6053-6067. November 2016.