Deep Learning for Channel Coding Via Neural Mutual Information Estimation

Deep Learning for Channel Coding via Neural Mutual Information Estimation Rick Fritschek∗, Rafael F. Schaefery, and Gerhard Wunder∗ ∗ Heisenberg Communications and Information Theory Group y Information Theory and Applications Chair Freie Universität Berlin, Technische Universität Berlin Takustr. 9, 14195 Berlin, Germany Einsteinufer 25, 10587 Berlin, Germany Email: {rick.fritschek, g.wunder}@fu-berlin.de Email: [email protected] Abstract—End-to-end deep learning for communication sys- to this general model and subsequently fine-tune the receiver, tems, i.e., systems whose encoder and decoder are learned, has i.e., the weights of the decoding part of the NN, based on the attracted significant interest recently, due to its performance actual received signals. This approach was implemented in [2]. which comes close to well-developed classical encoder-decoder designs. However, one of the drawbacks of current learning Another approach is to use generative adversarial networks approaches is that a differentiable channel model is needed for (GANs), which was introduced in [3]. GANs are composed of the training of the underlying neural networks. In real-world two competing neural networks, i.e., a generative NN and a scenarios, such a channel model is hardly available and often the discriminative NN. Here, the generative NN tries to transform channel density is not even known at all. Some works, therefore, a uniform input to the real data distribution, whereas the focus on a generative approach, i.e., generating the channel from samples, or rely on reinforcement learning to circumvent discriminative NN compares the samples of the real distribution this problem. We present a novel approach which utilizes a (from the data) to the fake generated distribution and tries to recently proposed neural estimator of mutual information. We estimate the probability that a sample came from the real use this estimator to optimize the encoder for a maximized mutual data. Therefore, both neural networks are competing against information, only relying on channel samples. Moreover, we show each other. Due to their effectiveness and good performance, that our approach achieves the same performance as state-of-the- art end-to-end learning with perfect channel model knowledge. GANs are a popular and active research direction. In our problem of end-to-end learning for communications, GANs were used in [4], [5] to produce an artificial channel model I. INTRODUCTION which approximates the true channel distribution and can Deep learning based methods for wireless communication is therefore be used for end-to-end training of the encoder and an emerging field whose performance is becoming competitive decoder. The third approach is using reinforcement learning to state-of-the-art techniques that evolved over decades of (RL) and is, therefore, circumventing the problem with the back- research. One of the most prominent recent examples is the propagation itself by using a feedback link [6]. In this approach, end-to-end learning of communication systems utilizing (deep) the transmitter can be seen as an agent which performs actions neural networks (NNs) as encoding and decoding functions in an environment and receives a reward through the feedback with an in-between noise layer that represents the channel link. The transmitter can then be optimized to minimize an [1]. In this configuration, the system resembles the concept arbitrary loss function, connected to the actions, i.e., the of an autoencoder in the field of machine learning, which transmitted signals in our case. This work was subsequently does not compress but adds redundancy to increase reliability. extended towards noisy feedback links in [7]. However, the arXiv:1903.02865v1 [cs.IT] 7 Mar 2019 These encoder-decoder systems can achieve bit error rates drawback of this approach is, that RL is known for its sample which come close to practical baseline techniques if they are inefficiency, meaning that it needs large sample sizes to achieve used for over-the-air transmissions [2]. This is promising since high accuracy. Moreover, both approaches (via RL and GAN) complex encoding and decoding functions can be learned on- still have a dependence on the receiver of the system. As the the-fly without extensive communication-theoretic analysis and GAN approach approximates the channel density as a surrogate design, possibly enabling future communication systems to for the missing channel model, it still needs end-to-end learning better cope with new and changing channel scenarios and use- in the last step. The RL approach, on the other hand, needs cases. However, the previously mentioned approaches have the the feedback of the receiver and therefore also depends on the drawback that they require a known channel model to properly decoder. In this paper, we make progress on this by proposing choose the noise layer within the autoencoder. Moreover, a method that is completely independent of the decoder. This this channel model needs to be differentiable to enable back- circumvents the challenge of a missing channel model. propagation through the whole system to optimize, i.e., learn, Our contribution: From a communication theoretic perspec- the optimal weights of the NN. tive, we know that the optimal transmission rate is a function of One approach is to assume a generic channel model, e.g. the mutual information I(X; Y ) between input X and output a Gaussian model. The idea is then to first learn according Y of a channel p(yjx). For example, the capacity C of an 2 Transmitter Training Receiver Training where the noise Zi is i.i.d. over i with Zi ∼ CN (0; σ ). The receiver uses a decoder g(yn) =m ^ to estimate and recover the n n M X Y M original message. Moreover, the block error rate Pe is defined R C → as the average probability of error over all messages → ˆ C R M arg max Softmax Channel Dense NN Normalize Dense NN jMj Embedding 1 X P = Pr(M^ 6= mjM = m): (3) e jMj Optimize m=1 Optimize Loss Feedback III. NEURAL ESTIMATION OF MUTUAL INFORMATION I˜ Xn Y n Optimize θ( ; ) A straight forward computation and therefore evaluation of the mutual information is difficult due to its dependence on the Fig. 1: The figure shows our channel coding approach. We joint probability density of the underlying random variables. A train an approximation of the mutual information I~ (Xn; Y n) θ fallback solution is therefore a limitation on mutual information between the channel input and output samples (xn; yn), which approximations. The main challenge here is to provide an is used to optimize the neural network of the channel encoder. accurate and stable approximation from low sample sizes. Both are alternatingly trained until convergence. Common approaches are based for example on binning of the probability space [9], [10], k-nearest neighbor statistics [11]– [13], maximum likelihood estimation [14], and variational lower additive white Gaussian noise (AWGN) channel (as properly bounds [15]. We focus on a recently proposed estimator [8], defined in the next section) is given by the maximum of the coined mutual information neural estimation (MINE), which p(x) mutual information over the input distribution under an utilizes the Donsker-Varadhan representation of the Kullback- P average power constraint , i.e., Leibler divergence, which in turn is connected to the mutual P information by C = max I(X; Y ) = log 1 + : (1) 2 2 p(x):E(X )≤P σ Z p(x; y) I(X; Y ) := p(x; y) log dxdy This suggests to use mutual information as a metric to learn the X ×Y p(x)p(y) optimal channel encoding function of AWGN channels as well = DKL(p(x; y)jjp(x)p(y)) as other communication channels. However, the mutual informa- p(x; y) tion also dependents on the channel probability distribution. But = log : Ep(x;y) p(x)p(y) instead of approximating the channel probability distribution itself, we will approximate the mutual information between The Donsker-Varadhan representation can be stated as the samples of the channel input and output and optimize D (P jjQ) = sup [g(X; Y )] − log( [eg(X;Y )]) the encoder weights, by maximizing the mutual information KL EP EQ (4) g:Ω!R between them, see Fig. 1. For that, we utilize a recent NN estimator of the mutual information [8] and integrate it in our where the supremum is taken over all measurable functions communication framework. We are, therefore, independent of g such that the expectation is finite. Now, depending on the the decoder and can reliably train our encoding function using function class, the right hand side of (4) yields a lower bound only channel samples. on the KL-divergence, which is tight for optimal functions. Notation: We stick to the convention of upper case random In [8] Belghazi et al. proposed to choose a neural network, variables X and lower case realizations x, i.e. X ∼ p(x), parametrized with θ 2 Θ as function family Tθ : X × Y ! R where p(x) is the probability mass or density function of X. for the lower bound. This yields the estimator n Moreover, p(x ) is the probability mass or density function I(X; Y ) ≥ sup [T (X; Y )] − log [eTθ (X;Y )]: n Ep(x;y) θ Ep(x)p(y) of the random vector X . We also use jX j to denote the θ2Θ (5) cardinality of a set X . The expectation is denoted by E[·]. Moreover, they show that the above estimator is consistent II. POINT-TO-POINT COMMUNICATION MODEL in the sense that it converges to the true value for increasing We consider a communication model with a transmitter, sample size k. Another closely related estimator is based on a channel, and a receiver.

Load more