Capacity of Continuous Channels with Memory Via Directed Information Neural Estimator

Capacity of Continuous Channels with Memory via Directed Information Neural Estimator Ziv Aharoni Dor Tsur Ziv Goldfeld Haim H. Permuter Ben Gurion University Ben Gurion University Cornell University Ben Gurion University [email protected] [email protected] [email protected] [email protected] Abstract—Calculating the capacity (with or without feedback) [24] for further details). Built on (3), for stationary processes, of channels with memory and continuous alphabets is a challeng- the DI rate is defined as ing task. It requires optimizing the directed information (DI) rate 1 over all channel input distributions. The objective is a multi- I(X → Y) := lim I(Xn → Y n). (4) letter expression, whose analytic solution is only known for a n→∞ n few specific cases. When no analytic solution is present or the As shown in [8], when feedback is not present, the optimiza- channel model is unknown, there is no unified framework for tion problem (2) (which amounts to optimizing over P n calculating or even approximating capacity. This work proposes X a novel capacity estimation algorithm that treats the channel rather than PXnkY n ) coincides with (1). Thus, DI provides a as a ‘black-box’, both when feedback is or is not present. The unified framework for representing both FF and FB capacities. algorithm has two main ingredients: (i) a neural distribution Computing CFF and CFB requires solving a multi-letter transformer (NDT) model that shapes a noise variable into the optimization problem. Closed form solutions to this chal- channel input distribution, which we are able to sample, and (ii) the DI neural estimator (DINE) that estimates the communication lenging task are known only in several special cases. A rate of the current NDT model. These models are trained by an common example for CFF is the Gaussian channel with alternating maximization procedure to both estimate the channel memory [14] and the ISI Gaussian channel [15]. There are capacity and obtain an NDT for the optimal input distribution. no known extensions of these solutions to the non-Gaussian The method is demonstrated on the moving average additive case. For CFB, a solution for the 1st order moving average Gaussian noise channel, where it is shown that both the capacity and feedback capacity are estimated without knowledge of the additive Gaussian noise (MA(1)-AGN) channel was found channel transition kernel. The proposed estimation framework [12]. Another closed form characterization is available for opens the door to a myriad of capacity approximation results for auto-regressive moving-average (ARMA) AGN channels [11]. continuous alphabet channels that were inaccessible until now. To the best of our knowledge, these are the only two non- trivial examples of continuous channels with memory whose I. INTRODUCTION FB capacity is known in closed form. Furthermore, when the Many discrete-time continuous-alphabet communication channel model is unknown, there is no numerically tractable channels involve correlated noise or inter-symbol interference method for approximating capacity based on samples. (ISI). Two predominant communication scenarios over such Recent progress related to capacity computation via deep channels are when feedback from the receiver back to the learning (DL) was made in [9], where the mutual information transmitter is or is not present. The fundamental rates of neural estimator (MINE) [2] was used to learn modulations reliable communication over such channels are, respectively, for memoryless channels. Later, [19] proposed an estimator the feedback (FB) and feedforward (FF) capacity. Starting based on a reinforcement learning algorithm that iteratively arXiv:2003.04179v2 [cs.IT] 16 May 2020 from the latter, the FF capacity of an n-fold point-to-point estimates and maximizes the DI rate was proposed, but only channel P n n , denoted CFF, is given by [1] Y |X for discrete alphabet channels with a known channel model. 1 n n CFF = lim sup I(X ; Y ). (1) Inspired by the above, we develop the framework for n→∞ PXn n estimating FF and FB capacity of arbitrary continuous- In the presence of feedback, the FB capacity CFB is [17] alphabet channels, possible with memory, without knowing the channel model. Our method does not need to know 1 n n CFB = lim sup I(X → Y ) (2) the channel transition kernel. We only assume a stationary n→∞ P n n 1 n X kY − channel model and that channel outputs can be sampled by where, feeding it with inputs. Central to our method are a new DI n neural estimator (DINE), used to evaluate the communication n n i i−1 I(X → Y ) := I(X ; Yi|Y ) (3) rate, and a neural distribution transformer (NDT), used to i=1 X simulate input distributions. Together, DINE and NDT lay is the directed information (DI) from the input sequence Xn to the groundwork for our capacity estimation algorithm. In the n n n n 1 i 1 i 1 the output Y [8], and PX kY − := i=1 PXi|X − Y − is remainder of this section, we describe DINE, NDT, and their the distribution of Xn causally-conditioned on Y n−1 (see [21], integration into the capacity estimator. Q (i.e., fixing the parameters of one model while training the other). DINE estimates the communication rate of a fixed A. Directed Information Neural Estimation NDT input distribution, and the NDT is trained to increase The estimation of mutual information (MI) from samples its rate with respect to fixed DINE model. Proceeding until using neural networks (NNs) is a recently proposedPSfrag approach replacementsconvergence, this results in the capacity estimate, as well as [2], [3]. It is especially effective when the involved random an NDT generative model for the achieving input distribution. variables (RVs) are continuous. The concept originated from We demonstrate our method on the MA(1)-AGN channel. Both [2], where MINE was proposed. The core idea is to represent CFF and CFB are estimated using the same algorithm, using MI using the Donsker-Varadhan (DV) variational formula the channel as a black-box to solely generate samples. The T e e estimation results are compared with the analytic solution to I(X; Y ) = sup E [T(X, Y )] − log E e (X,Y ) , (5) T R n shown the effectiveness of the proposed approach. :X×Y→ IDn (X → Y ) h i where (X, Y ) ∼ PXY and (X, Y ) ∼ PX ⊗PY . The supremum b Yi−1 Feedback is over all measurable functions T for which both expecta- ∆ tions are finite. Parameterizinge e T by an NN and replacing (RNN) expectations with empirical averages, enables gradient ascent NDT Xi Channel i i 1 (RNN) PYi X Y − optimization to estimate I(X; Y ). A variant of MINE that goes Noise | N through estimating the underlying entropy terms was proposed i in [3]. The new estimators were shown empirically to perform extremely well, especially for continuous alphabets. Gradient Herein, we propose a new estimator for the DI rate I(X → Yi DINE Output I Y). The DI is factorized as Dn (X → Y) (RNN) b I(Xn → Y n)= h(Y n) − h(Y nkXn), (6) n n where h(Y ) is the differential entropy of Y and Fig. 1. The overall capacity estimator: NDT generates samples that are fed n n n i−1 i into the channel. DINE uses these samples to improve its estimation of the h(Y kX ) := i=1 h(Yi|Y ,X ). Applying the approach of [3] to the entropy terms, we expand each as a Kullback- communication rate. DINE then supplies gradient for the optimization of NDT. Leibler (KL) divergenceP plus a cross-entropy (CE) residual and invoke the DV representation. To account for memory, II. METHODOLOGY we derive a formula valid for causally dependent data, which involves RNNs as function approximators (rather than the FF We give a high-level description of the algorithm and its network used in the independently and identically distributed building blocks. Due to space limitations, full details are reserved to the extended version of this paper. The imple- (i.i.d.) case). Thus, DINE is an RNN-based estimator for the † DI rate from Xn to Y n based on their samples. mentation is available on GitHub. Estimation of DI between discrete-valued processes was A. Directed Information Estimation Method studied in [25]–[27]. An estimator of the transfer entropy, which upper bounds DI for jointly Markov process with finite We propose a new estimator of the DI rate between two memory, was proposed [16]. DINE, on the other hand, does not correlated stationary processes, termed DINE. Building on [3], assume Markovity nor discrete alphabets, and can be applied to we factorize each term in (6) as: continuous-valued stationary and ergodic processes. A detailed n description of the DINE algorithm is given in subsection II-A. n n 1 h(Y )= hCE(PY , PY − ⊗ PYe ) n n 1 B. Neural Distribution Transformer and Capacity Estimation − DKL(PY kPY − ⊗ PYe ) n n n n n 1 n 1 n DINE accounts for one of the two tasks involved in es- h(Y kX )= hCE PY kX , PY − kX − ⊗ PYe PX timating capacity, it estimates the objective of (2). It then n n n 1 n 1 n − DKL PY kX PY − kX − ⊗ P Ye PX remains to optimize this objective over input distributions. To (7) that end, we design a deep generative model, termed the NDT, to approximate the channel input distributions. This is similar where hCE(PX ,QX ) and DKL(PX kQX ) are, respectively, the in flavor to generators used in generative adversarial networks CE and KL divergence between PX and QX , with [23].The designed NDT maps i.i.d.

Capacity of Continuous Channels with Memory Via Directed Information Neural Estimator

An Information-Theoretic Perspective on Credit Assignment in Reinforcement Learning

Task-Driven Estimation and Control Via Information Bottlenecks

Interpretations of Directed Information in Portfolio Theory, Data

Analogy Between Gambling and Measurements-Based Work Extraction

Universal Estimation of Directed Information Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman

Information Theoretic Causal Effect Quantification

Information Structures for Causally Explainable Decisions

A Unified Bellman Equation for Causal Information and Value in Markov

Transfer-Entropy-Regularized Markov Decision Processes Takashi Tanaka1 Henrik Sandberg2 Mikael Skoglund3

Directed Information for Complex Network Analysis from Multivariate Time Series

Echo State Network Models for Nonlinear Granger Causality

Multi-Scale Information, Network, Causality, and Dynamics: Mathematical Computation and Bayesian Inference to Cognitive Neuroscience and Aging