Deep Learning for Channel Coding via Neural Estimation

Rick Fritschek∗, Rafael F. Schaefer†, and Gerhard Wunder∗ ∗ Heisenberg Communications and Group † Information Theory and Applications Chair Freie Universität Berlin, Technische Universität Berlin Takustr. 9, 14195 Berlin, Germany Einsteinufer 25, 10587 Berlin, Germany Email: {rick.fritschek, g.wunder}@fu-berlin.de Email: [email protected]

Abstract—End-to-end deep learning for communication sys- to this general model and subsequently fine-tune the receiver, tems, i.e., systems whose encoder and decoder are learned, has i.e., the weights of the decoding part of the NN, based on the attracted significant interest recently, due to its performance actual received signals. This approach was implemented in [2]. which comes close to well-developed classical encoder-decoder designs. However, one of the drawbacks of current learning Another approach is to use generative adversarial networks approaches is that a differentiable channel model is needed for (GANs), which was introduced in [3]. GANs are composed of the training of the underlying neural networks. In real-world two competing neural networks, i.e., a generative NN and a scenarios, such a channel model is hardly available and often the discriminative NN. Here, the generative NN tries to transform channel density is not even known at all. Some works, therefore, a uniform input to the real data distribution, whereas the focus on a generative approach, i.e., generating the channel from samples, or rely on reinforcement learning to circumvent discriminative NN compares the samples of the real distribution this problem. We present a novel approach which utilizes a (from the data) to the fake generated distribution and tries to recently proposed neural estimator of mutual information. We estimate the probability that a sample came from the real use this estimator to optimize the encoder for a maximized mutual data. Therefore, both neural networks are competing against information, only relying on channel samples. Moreover, we show each other. Due to their effectiveness and good performance, that our approach achieves the same performance as state-of-the- art end-to-end learning with perfect channel model knowledge. GANs are a popular and active research direction. In our problem of end-to-end learning for communications, GANs were used in [4], [5] to produce an artificial channel model I.INTRODUCTION which approximates the true channel distribution and can Deep learning based methods for wireless communication is therefore be used for end-to-end training of the encoder and an emerging field whose performance is becoming competitive decoder. The third approach is using reinforcement learning to state-of-the-art techniques that evolved over decades of (RL) and is, therefore, circumventing the problem with the back- research. One of the most prominent recent examples is the propagation itself by using a feedback link [6]. In this approach, end-to-end learning of communication systems utilizing (deep) the transmitter can be seen as an agent which performs actions neural networks (NNs) as encoding and decoding functions in an environment and receives a reward through the feedback with an in-between noise layer that represents the channel link. The transmitter can then be optimized to minimize an [1]. In this configuration, the system resembles the concept arbitrary loss function, connected to the actions, i.e., the of an autoencoder in the field of , which transmitted signals in our case. This work was subsequently does not compress but adds redundancy to increase reliability. extended towards noisy feedback links in [7]. However, the arXiv:1903.02865v1 [cs.IT] 7 Mar 2019 These encoder-decoder systems can achieve bit error rates drawback of this approach is, that RL is known for its sample which come close to practical baseline techniques if they are inefficiency, meaning that it needs large sample sizes to achieve used for over-the-air transmissions [2]. This is promising since high accuracy. Moreover, both approaches (via RL and GAN) complex encoding and decoding functions can be learned on- still have a dependence on the receiver of the system. As the the-fly without extensive communication-theoretic analysis and GAN approach approximates the channel density as a surrogate design, possibly enabling future communication systems to for the missing channel model, it still needs end-to-end learning better cope with new and changing channel scenarios and use- in the last step. The RL approach, on the other hand, needs cases. However, the previously mentioned approaches have the the feedback of the receiver and therefore also depends on the drawback that they require a known channel model to properly decoder. In this paper, we make progress on this by proposing choose the noise layer within the autoencoder. Moreover, a method that is completely independent of the decoder. This this channel model needs to be differentiable to enable back- circumvents the challenge of a missing channel model. propagation through the whole system to optimize, i.e., learn, Our contribution: From a communication theoretic perspec- the optimal weights of the NN. tive, we know that the optimal transmission rate is a function of One approach is to assume a generic channel model, e.g. the mutual information I(X; Y ) between input X and output a Gaussian model. The idea is then to first learn according Y of a channel p(y|x). For example, the capacity C of an ok u omncto hne sa WNcanlsuch channel AWGN an is channel communication our work, where hssget ouemta nomto samti olanthe learn to metric a as information mutual use to suggests This variables httercie inli ie as given is signal received the that for Moreover, noise. against message robust every transmission the a function make send encoding to to an wants using transmitter channel The receiver. a message and channel, a set a of vector cardinality random the of Moreover, using function encoding our samples. train channel reliably only our can of in NN and independent it recent decoder therefore, integrate the a are, and utilize We [8] we framework. information communication that, mutual For the optimize 1. of information and estimator Fig. mutual see output the them, and maximizing between input by weights, channel encoder between the the information of mutual samples the the approximate distribution will probability we channel itself, the But approximating distribution. of probability channel instead the on dependents informa- also mutual well tion the as However, channels. channels communication AWGN other of as function encoding channel optimal the of constraint maximum distribution power the input average by the given over is information section) mutual properly next (as the channel in (AWGN) defined noise Gaussian white additive encoder. channel convergence. the until of trained network alternatingly neural are the Both ( optimize samples to output used and is We input information approach. channel mutual coding the the between channel of our approximation shows an train figure The 1: Fig. n 1 M P Notation: ecnie omncto oe ihatransmitter, a with model communication a consider We I ˜ θ rnmte Training Transmitter Optimize ( i n X =1 I P II. p n ( ; |

C Embedding x Y x m X ) i n p max = ( OINT Y ) M ∈ stepoaiiyms rdniyfnto of function density or mass probability the is m ( Dense NN esikt h ovnino pe aerandom case upper of convention the to stick We i n oe aerealizations case lower and x p ) m n = ( | x ) 2 R → C - ): X TO M ∈ Optimize = ≤ stepoaiiyms rdniyfunction density or mass probability the is E ( i X X { - P

+ Normalize POINT 1 2 h xetto sdntdby denoted is expectation The . , ) ntecrepnigcdwrs nthis In codewords. corresponding the on Z ≤ easm naeaepwrconstraint power average an assume we 2 X X , . . . , i P , P n n I eas use also We . i.e., , ( C X Feedback OMMUNICATION 2 ; nR for Y Channel } log = ) i Y tarate a at { ∈ n f ( 1 m Optimize n , . . . , eevrTraining Receiver  = ) x X| |X + 1 i.e. , C → R R x p M odnt the denote to x σ P n } ( Dense NN vranoisy a over n 2 x y , ODEL I ˜ X (  θ )

m Softmax ( n . ne an under X ,which ), ∼ )

E arg max n ∈ Loss [ ; · p M ] Y ( . M C (1) (2) x ˆ X n n ) ) , . hr h noise the where hr h urmmi vralmaual functions measurable all over is supremum the where functions measurable all over taken is supremum the where o h oe on.Ti ilsteestimator the yields This bound. lower the for mutual the to connected is turn in Kullback- which the divergence, of Leibler representation Donsker-Varadhan [8], the estimator utilizes proposed recently a on coined focus We lower [15]. variational bounds and [14], the estimation likelihood of maximum binning [13], sizes. [10], on [9], example sample space for low an probability based provide from are to approaches approximation Common is stable here and challenge accurate information main mutual The on A limitation approximations. variables. a therefore random the is underlying on solution the dependence fallback of its density to probability due difficult joint is information mutual the messages all over error of probability rate average error the block as the Moreover, message. original decoder a uses receiver hieo h ojgt ulfunction right dual the conjugate for table the a of provide and neural choice family parameterized function a this choose for network to proposed also [17] Moreover, f increasing f for value true f the to converges is size it sample estimator that above sense the the in that show they Moreover, functions. optimal with parametrized for tight Belghazi [8] is In which of KL-divergence, side the hand on right the class, function g by information I GN 1] hc ssteFnhldaiyt on the bound to duality as Fenchel below the from -divergence uses for applied which recently [17], was -GANs and [16] representations -divergence D ( uhta h xetto sfiie o,dpnigo the on depending Now, finite. is expectation the that such h ose-aahnrpeetto a esae as stated be can representation Donsker-Varadhan The of evaluation therefore and computation forward straight A I.N III. D X KL f ; ( Y P ( uulifrainnua siain(MINE) estimation neural information mutual P ) I || EURAL ( || ≥ Q X Q ) P θ sup k ; sup = ) ∈ ≥ e nte lsl eae siao sbsdon based is estimator related closely Another . Y Θ = := ) g E Z tal. et E :Ω g sup = = p |M| :Ω i TMTO OF STIMATION θ → ( 1 x,y siid over i.i.d. is → ∈ E D Z R R ×Y X p ) E Θ KL rpsdt hoeanua network, neural a choose to proposed m g ( X [ |M| E x,y T P ( =1 P y sfnto family function as θ [ ( g n ( ) p [ p g k ,Y X, Pr ( ˆ = ) (  ( ,Y X, ( naetniho ttsis[11]– statistics neighbor -nearest ,y x, ,y x, log ,Y X, ( M ˆ m )] i ) log ) p || )] M 6= p ( with − )] oetmt n eoe the recover and estimate to p x ( − UTUAL ( ,y x, m ) (4) − x log f p p E ) ∗ | ( p log( M ( p Z hc is which , y Q ) x ( ilsalwrbound lower a yields ( E ) i ,y x, [ y ) f  p = p CN ∼ )) ( E ∗ I ( . x T NFORMATION ( y Q ) ) m θ g p ) [ ( ( : ) dxdy e y ,Y X, P . g → Y × X ) (0 ( [ e exp( X,Y e consistent σ , T sdefined is θ ))] which , ( 2 ) X,Y ) x ]) The . − ) (3) (4) (6) (5) ] 1) R g . . Hidden Hidden Output the following estimator for k samples Input layer layer (linear) (ReLu) (ReLu) k 1 X I˜ (Xn; Y n) := [T (xn , yn )] θ k θ (i) (i) x1 i=1 . k n n n Input X . 1 X Tθ (x ,y¯ ) . − log [e (i) (i) ], (8) k xn i=1 n n Tθ(X ,Y ) where the k samples of the joint distribution p(xn, yn), for y1 . the first term in (8), are produced via uniform generation of Input Y n . . . messages m and sending them through the initialized encoder, . . which generates Xn of p(xn, yn). The corresponding samples yn of Y n are generated by our AWGN channel, see Section II, where the noise variance σ2 is scaled such that we have a Fig. 2: Neural network representation of our approximation resulting signal-to-noise ratio per bit of 7 Eb/N0 [db]. Note n n n function Tθ. The samples of X and Y are concatenated and also that the encoded signal x has a unit average power 2 fed into the network. The network is comprised of two hidden normalization E(|Xi| ) = 1, where the expectation is over layers, with 20 nodes and ReLu activation function. The output the signal dimension and the batch size. The samples of function is linear. the marginal distributions, for the second term in (8), are n n generated by dropping either x(i) or y(i) from the joint samples n n (x(i), y(i)), and dropping the other in the next k samples, as to obtain a lower bound on the KL-divergence. This leads to proposed in [8]. Therefore, a total batch size of 2k leads to k the estimator samples of the joint and marginal distribution. In the estimator, Tθ represents a neural network, with two fully connected hidden Tθ (X,Y )−1 I(X; Y ) ≥ sup Ep(x,y)[Tθ(X,Y )] − Ep(x)p(y)[e ]. layers and a linear output node, see Fig. 2. Our estimator θ∈Θ (7) network uses 20 nodes per hidden layer, because the mutual We note that both estimators can be derived through application information value for our AWGN model stabilized at around of the Fenchel duality. Moreover, both lower bounds share the 15 nodes, see Fig. 3. However, we remark that the MINE same supremum, however, over the choice of functions T , (5) implementation of [8] uses 3 fully connected layers with 400 is closer to the supremum than (7), see [18]. The work of [8] nodes per layer to produce the results for the 25 Gaussians compares both approximations and shows that (5) provides a data set. A higher dimensionality therefore may require more tighter estimate for high-dimensional variables. We therefore nodes to produce stable results. As in [8], we initialized the focus on the latter in our following implementation. weights in Tθ with a low standard deviation (σ = 0.05), to circumvent unstable behaviour in conjunction with the log term. The approximation (7) is more stable in that regard, IV. IMPLEMENTATION however, the outlook of a better estimate in high-dimensions A. Encoder Training via Mutual Information Estimation motivated the use of the Donsker-Varadhan based estimator. Our training of the encoder is now implemented in two phases. Our encoder architecture is modelled as in [19], i.e., it After the initialization of the encoder weights, we train the consists of an embedding layer, dense hidden layers, converts mutual information estimation network for an initial round 2n real values to n complex values and normalizes them, see with 1000 iterations and a batch size of 200. In the second Fig. 1. However, unlike other end-to-end learning approaches, phase, we alternate between maximizing (8) over the encoder we estimate the mutual information from samples and train the weights φ and the estimator weights θ encoder network by maximization of the mutual information estimation. This enables deep learning for channel encoding ˜ n n max max Iθ(Xφ (m); Y ). without explicit knowledge of the channel density function. φ θ Our mutual information estimation uses the Donsker-Varadhan We maximize over the encoder weights φ with batch sizes estimator, see (5). For that we maximize the estimator term {100, 100, 1000} and {1000, 10000, 10000} iterations with a learning rate α of {0.01, 0.001, 0.001}, respectively. After n n n n Iθ(X ; Y ) := Ep(x,y)[Tθ(X ,Y )] every {100, 1000, 1000} iterations, i.e., 10 times during every n n Tθ (X ,Y ) cycle, we maximize over the estimator weights θ again with a − log Ep(x)p(y)[e ] batch size of 200. over θ with the Adam optimizer [20] and a learning rate of p = 0.0005. Note that we do not have access to the true B. Decoder Training via Cross-Entropy joint distribution p(x, y) and the marginal distributions p(x) To evaluate our new method, we use a standard cross-entropy and p(y). We therefore use samples of these distributions and based NN decoder, consisting of a conversion from complex approximate the expectations by the sample average. This yields to real values, followed by dense hidden layers, a softmax 2 2 2 2

1 1 1 1 100 0 0 0 0 Autoencoder MI+CE

1 1 1 1 Autoencoder CE 16QAM 2 2 2 2 1 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 10

(a) (b) (c) (d)

2 2 2 10 2

1 1 1

0 0 0

10 3 1 1 1 Block-error rate

2 2 2 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2

(e) (f) (g) (h) 10 4 Fig. 3: The resulting constellation points for 16 symbols are shown by varying the number of nodes in the mutual 10 5 0 2 4 6 8 10 12 14 information estimation network Tθ. The constellations are for Eb/No (dB) the node sizes 2(a), 4(b), 6(c), 8(d), 10(e), 12(f), 14(g) and 16(h). Fig. 4: The resulting averaged block-error rates Pe is shown for n = 1 and the following systems: a standard 16 QAM coding scheme; an end-to-end learning system based on cross-entropy layer and an arg max layer, see Fig. 1. Let ν|M| be the output (CE) with known channel distribution; and our proposed mutual of the last dense layer in the decoder network. The softmax information maximization encoding and cross-entropy decoding function takes ν|M| and returns a vector of probabilities for system (MI+CE) without known channel distribution, i.e., |M| |M| the message set, i.e., p ∈ (0, 1) , where the entries pm sample based. are calculated by

exp(ν ) p = f(ν|M|) := m . C. Results m m P exp(ν ) i i We have implemented our new approach using the Tensor- Flow [21] framework. The resulting symbol error curves can be The decoder then declares the estimated messages to be mˆ = seen in Fig. 4, which show that the performance of our proposed arg max p . For the training, we uniformly generate message m m method is indistinguishable from the theoretical approximated indexes m, and feed them in our previously trained encoder 16 QAM performance and the state-of-the-art autoencoder NN, which generates the codewords xn(m). The codewords which uses the knowledge of the channel in conjunction with are then sent over the channel. The receiver gets a noisy signal a cross-entropy loss. Moreover, Fig. 6 shows the resulting yn and feeds it into the decoder. The decoder outputs the constellations of the encoder for 16 symbols, for (a) the estimated probabilities p of the received message index and m standard cross-entropy approach and (b) the mutual information feeds it into a cross-entropy function together with the true maximization approach. Furthermore, we show in Fig. 5 the index m calculated mutual information values after training, for several X signal-to-noise ratios. There, we compare these values for 16, H(M, Mˆ ) = − p(m) log p (m) decoder 32, and 64 symbols. We remark that the performance of the m∈M encoders is dependent on the SNR during training. If we train = − [log p (m)], Ep(m) decoder for example the 64 symbols encoder at a high SNR, then the curve gets closer to 6 bit, but also decreases in the low SNR which is estimated by averaging over the sample size k, which regime. It can be seen, that the mutual information bound yields the cross entropy cost function for the decoder weights ψ comes close to expected values, i.e., in the range of m-ary k QAM, for mid to high SNR ranges. 1 X J(ψ) = − log p , k m V. CONCLUSIONSANDOUTLOOK i=1 We have shown that the recently developed mutual in- where m represents the index of the message of the i-th sample. formation neural estimator (MINE) can be used to train a Finally, we use the Adam optimizer to train the decoder weights channel encoding setup, by alternating the maximization of the ψ by minimizing the cross-entropy. We remark that we assume, estimated mutual information over the estimator weights and the that our decoding system knows the sent message index, this encoder weights. The training works without explicit knowledge can be achieved by using a fixed seed on a random number of the channel density function and rather approximates a generator, as proposed in [7]. function of the channel, i.e., the mutual information, based on size scenarios and compare it to GAN and RL based approaches.

7 Moreover, the stability of the estimator, in comparison to the 16 Symbols f-divergence estimator, needs to be investigated under different 32 Symbols channel models. 6 64 Symbols log(1+SNR)

5 REFERENCES [1] T. O’Shea and J. Hoydis, “An introduction to deep learning for the

4 physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, Dec 2017. [2] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning 3 based communication over the air,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 132–143, Feb 2018.

2 [3] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,

Mutual Information (bits) S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680. 1 [4] H. Ye, G. Y. Li, B.-H. F. Juang, and K. Sivanesan, “Channel agnostic end- to-end learning based communication systems with conditional GAN,” arXiv preprint arXiv:1807.00447, 2018. 0 0 5 10 15 20 25 30 35 [5] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Physical layer SNR (dB) communications system design over-the-air using adversarial networks,” arXiv preprint arXiv:1803.03145, 2018. [6] F. A. Aoudia and J. Hoydis, “End-to-end learning of communications Fig. 5: Mutual information estimation evaluations for trained systems without a channel model,” arXiv preprint arXiv:1804.02276, encoders for n = 1 and {16, 32, 64} symbols. The encoder 2018. and the estimation network T were trained as in Section IV-A [7] M. Goutay, F. A. Aoudia, and J. Hoydis, “Deep reinforcement learning θ autoencoder with noisy feedback,” arXiv preprint arXiv:1810.05419, with increasing SNR values: {10 : 14, 14 : 18, 17 : 21} dB for 2018. {16, 32, 64} symbols, respectively. [8] I. Belghazi, S. Rajeswar, A. Baratin, R. D. Hjelm, and A. Courville, “MINE: Mutual information neural estimation,” arXiv preprint arXiv:1801.04062, 2018. [9] A. M. Fraser and H. L. Swinney, “Independent coordinates for strange

2 2 attractors from mutual information,” Physical review A, vol. 33, no. 2, p. 1134, 1986. [10] G. A. Darbellay and I. Vajda, “Estimation of the information by an 1 1 adaptive partitioning of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999. [11] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual 0 0 information,” Physical Review E, vol. 69, no. 6, p. 066138, 2004. [12] S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutual 1 1 information for strongly dependent variables,” in Artificial Intelligence and Statistics, 2015, pp. 277–286. [13] W. Gao, S. Oh, and P. Viswanath, “Demystifying fixed k-nearest neighbor 2 2 2 1 0 1 2 2 1 0 1 2 information estimators,” IEEE Trans. Inf. Theory, vol. 64, no. 8, pp. 5629– 5661, Aug 2018. [14] T. Suzuki, M. Sugiyama, J. Sese, and T. Kanamori, “Approximating (a) (b) mutual information by maximum likelihood density ratio estimation,” Fig. 6: The resulting encoding constellations are shown for 16 in New challenges for feature selection in data mining and knowledge discovery, 2008, pp. 5–20. symbols based on: (a) the standard cross-entropy approach; (b) [15] D. Barber and F. Agakov, “The IM algorithm: A variational approach the mutual information estimation based approach. to information maximization,” in Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press, 2003, pp. 201–208. [16] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” IEEE the samples of the input and output of the channel. We believe Transactions on Information Theory, vol. 56, no. 11, pp. 5847–5861, that this can perform better than an end-to-end learning setup, 2010. because the encoder basically uses the expert information about [17] S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: Training generative neural samplers using variational divergence minimization,” in Advances which performance function (i.e. the mutual information) it in Neural Information Processing Systems, 2016, pp. 271–279. needs to optimize the encoding in order to perform well. This [18] A. Ruderman, M. Reid, D. García-García, and J. Petterson, “Tighter is in contrast to the end-to-end learning approach, where the variational representations of f-divergences via restriction to probability measures,” arXiv preprint arXiv:1206.4664, 2012. neural network system needs to learn this information on its [19] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning own. Furthermore, our method can be implemented without based communication over the air,” IEEE Journal of Selected Topics in changes at the receiver side and is therefore suitable for fast Signal Processing, vol. 12, no. 1, pp. 132–143, 2018. [20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” deployment. The investigation of sample size bounds for our arXiv preprint arXiv:1412.6980, 2014. method is on-going work. Intuitively, it requires less samples [21] J. Dean, R. Monga et al., “TensorFlow: Large-scale machine learning on than a GAN based approach due to the fact that we do not heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/ need to exactly generate the channel density. However, future research needs to investigate the performance under low sample