Gradient-Free Training of Autoencoders for Non-Differentiable Communication Channels Ognjen Jovanovic, Student Member, IEEE, Metodi P

JLT DRAFT 1

Gradient-free training of autoencoders for non-differentiable communication channels Ognjen Jovanovic, Student Member, IEEE, Metodi P. Yankov, Member, IEEE, Francesco Da Ros, Senior Member, IEEE, and Darko Zibar

Abstract—Training of autoencoders using the back- fiber transmission, the AE has been applied to a dispersive propagation algorithm is challenging for non-differential linear fiber channel [6], [7], and nonlinear frequency division channel models or in an experimental environment where multiplexing (NFDM) [8], [9] for transmission over the non- gradients cannot be computed. In this paper, we study a gradient–free training method based on the cubature Kalman linear dispersive fiber channel model. filter. To numerically validate the method, the autoencoder The AE approach can be utilized for geometric constellation is employed to perform geometric constellation shaping on shaping (GCS) in order to close the gap to the theoretically differentiable communication channels, showing the same achievable information rate, which is experienced with uni- performance as the back-propagation algorithm. Further formly distributed signalling (such as conventional quadrature investigation is done on a non–differentiable communication channel that includes: laser phase noise, additive white Gaussian amplitude modulation (QAM)). Geometric constellation shap- noise and blind phase search-based phase noise compensation. ing is a process of optimizing the positions of the constellation Our results indicate that the autoencoder can be successfully points on the I/Q plane. The main goal of GCS is to achieve optimized using the proposed training method to achieve better the best possible trade–off between Euclidean distance and robustness to residual phase noise with respect to standard energy distribution of the constellation for the given channel. constellation schemes such as Quadrature Amplitude Modulation and Iterative Polar Modulation for the considered conditions. Embedding the AE with different optical fiber channel models to learn geometric constellation shapes was demonstrated for Index Terms—Optical fiber communication, cubature Kalman perturbative models of the nonlinear fiber channels [10], [11], filter, end-to-end learning, geometric constellation shaping, phase noise. for non-dispersive channels [12] or for linear links with up to 1.2 dB of optical signal-to-noise ratio (SNR) gain with respect to standard QAM [13]. I.INTRODUCTION All the aforementioned works fulfilled the requirement OMMUNICATION systems consist of a transmitter and that the channel model is differentiable in order to per- C receiver designed with the aim to reliably transfer in- form classical gradient–based optimization of an AE. This formation from one end to another over a physical channel, requirement is often too strict because 1) not all channel e.g. air or optical fiber. Typically, both the transmitter and models are differentiable; 2) approximating a channel with a receiver are structured as a chain of multiple independent simple differentiable model results in inaccuracies [14]; 3) an signal processing blocks, such as channel coding, modulation, accurate differentiable channel model can be too complicated pulse shaping, and equalization. Even though this block-wise for reliable optimization. approach has proven to be efficient, it is uncertain if it As an alternative, it was proposed to use generative ad- achieves the best possible end-to-end performance, e.g. highest versarial networks (GANs) to provide a simple differentiable throughput. channel model of the complex or non-differentiable observed End-to-end learning of a communication system, which system [15]. The GANs are used to train the AE as demon- optimizes the transmitter and receiver (including all of their strated in [16] for wireless communication and [14] for non- arXiv:2012.11227v4 [eess.SP] 19 Aug 2021 processes) jointly for a specific channel model and perfor- coherent fiber optic communication. However, using a GAN mance metric was introduced in [1]. A communication system as a channel model adds an extra step to the AE optimization. is perceived as an autoencoder (AE) [2] by representing For every optimization step of the AE, new data needs to be the transmitter and receiver as neural networks (NNs) and obtained from the non-differentiable channel model and used considering the channel model as a non-trainable layer. Opti- to train the GAN. The GAN requires a lot of training samples mizing the AE is typically done by applying gradient–based to learn an accurate channel approximation. Therefore, the algorithms, which require a differentiable channel model. In overall process of AE optimization using GANs can be time- [3], the AE was trained numerically for a wireless link and consuming [14]. Moreover, the data generated by the GAN its benefits were demonstrated experimentally. The idea has during the AE optimization is synthetic and it is just an been expanded and applied to orthogonal frequency-division approximation of the probability distribution of the actual multiplexing (OFDM) [4] and multiple–input multiple–output channel. (MIMO) [5]. In order to optimize full waveforms for optical In [17], [18], a two–phase alternating algorithm for training of the AE without a channel model was presented. The O. Jovanovic, M. P. Yankov, F. Da Ros, and D. Zibar are with the Department of Photonic Engineering, Technical University of Denmark, 2800 algorithm is based on observing the transmitter and receiver as Kgs. Lyngby, Denmark, e-mail: [email protected] two separate NNs, applying a variant of reinforcement learning JLT DRAFT 2 to train the transmitter and supervised learning to train the the number of constellation points. The entropy of the con- Í receiver. This approach is in contrast with a joint optimiza- stellation 퐻(푋) = − 푥 ∈X 푃푋 (푥) log2 (푃푋 (푥)) = log2 (푀) = 푚 tion since the optimization of the transmitter and receiver represents the number of bits carried by a symbol. Let 푌 be a is not done simultaneously, and it requires more samples to continuous complex output of a memoryless channel with 푋 as converge [19]. A simultaneous optimization approach for AE its input. The input-output relation of the channel is governed without channel knowledge using simultaneous perturbation by the channel transition probability density 푝푌 |푋 (푦|푥). The stochastic approximation (SPSA) [20] for gradient estimation amount of information that 푌 contains about 푋 in bits per was demonstrated in [19]. However, it was shown that the symbol is represented by mutual information (MI) variance of the gradient estimation increases with the number 퐼(푋;푌) = 퐻(푋) − 퐻 (푋|푌) = 푚 − 퐻 (푋|푌) of parameters [18]. As a consequence, SPSA struggles to 푝 푝 ∑︁ ∫ 푝푌 |푋 (푦|푥) (1) train AEs with a large number of parameters, e.g. a large = 푃 (푥) 푝 (푦|푥) log 푑푦, 푋 푌 |푋 2 푝 (푦) constellation size if targeting constellation shaping. 푥 ∈X C 푌 In this paper, the differentiable channel model requirement where C denotes the set of complex numbers, 퐻 (푋|푌) = is lifted by proposing a derivative-free optimization method for 푝 E[푝 (푥|푦)] is the conditional entropy of 푋 given 푌 and AEs. This allows the encoder and the decoder to be optimized 푋 |푌 푝 (푦) is the probability density distribution of 푌. simultaneously for arbitrary black-box channels, including 푌 In order to calculate Eq. (1), the transition probability non-numerical ones (e.g. experimental test-beds). Also, the 푝 (푦|푥) must be known. Since the main topic of this paper method has the potential for online optimization. The proposed 푌 |푋 is black-box channels (which do not have known explicit method is exemplified by adopting the cubature Kalman filter analytical expressions), the transition probability is unknown. (CKF) [21] for the optimization of an AE. In order to show that In such cases, Eq. (1) needs to be bound. A lower bound on this method can be used for differentiable channel models, the the MI, also known as the achievable information rate (AIR), AE performs GCS for a differentiable AWGN and nonlinear can be obtained by using the mismatched decoding approach phase noise channels, resulting in nearly identical performance and assume the transition probability 푞 (푦|푥) of an auxiliary to typically used gradient–based optimization. Then, the AE is 푌 |푋 channel instead of the true 푝 (푦|푥) [22], trained to perform GCS on a phase noise channel with residual 푌 |푋 phase noise, resulting from a non–differentiable carrier phase 퐼(푋;푌) ≥ 퐻(푋) − 퐻ˆ푞 (푋|푌) = 푚 − 퐻ˆ푞 (푋|푌) recovery algorithm. ∑︁ ∫ 푞푌 |푋 (푦|푥) (2) The remainder of the paper is organized as follows. In = 푃푋 (푥) 푝푌 |푋 (푦|푥) log 푑푦, 2 푞 (푦) Section II the basic principles of estimating the mutual in- 푥 ∈X C 푌 formation and the key concepts of using an AE for GCS are where 퐻ˆ푞 (푋|푌) = E[푞푋 |푌 (푥|푦)] is the upper bound of the true described. Section III explains how the weights of the AE can conditional entropy 퐻푝 (푋|푌). The inequality turns to equality be optimized using CKF. A detailed description of the system, only when 푞푌 |푋 (푦|푥) = 푝푌 |푋 (푦|푥). the AE architecture and channel models, is provided in Section IV. Section V provides results on the mutual information achieved by the AE for different channel models. In Section B. Geometric constellation shaping with autoencoders VI the simulation results and future work are discussed. The Typically, GCS involves the optimization of the constella- conclusions are summarized in Section VII. tion points in the complex plane with the aim to maximize Notations: Boldface denotes multivariate quantities such as the MI 퐼(푋;푌). The MI can be maximized without explicit vectors (lowercase) and matrices (uppercase). The sets of real channel knowledge or assumption by leveraging the AEs and complex numbers are denoted as R and C, respectively. to perform GCS. The AE accomplishes this by making an The subscript 푘 and 푗 indicate time and iteration step, re- approximation 푝ˆ푋 |푌 (푥|푦) of the true posterior distribution spectively. The covariance matrix of matrix A is denoted 푝푋 |푌 (푥|푦) using the decoder NN. It was shown that an AE as PAA, whereas the cross–covariance matrix of matrices A can be used to improve the lower bound of the MI [12]. and B is denoted as PAB. The 푖-th column of a matrix A The considered communication system employing an AE is represented by A푖, whereas the 푖-th scalar element of a for geometrical constellation shaping is shown in Fig. 1. vector a is represented by a(푖) . The subscript 푗 | 푗 − 1 is used The encoder and the decoder are represented by feed-forward to indicated that the current value of the matrix (vector) is neural networks 푁푁푒 (w푒) and 푁푁푑 (w푑), parameterized with 푁 푁 conditional to the previous iteration. If two subscripts occur for trainable weights w푒 ∈ R 푒 and w푑 ∈ R 푑 , respectively. The the same matrix they are separated by a comma, e.g. A푖, 푗 | 푗−1 number of weights, including all trainable weights and biases or PAA, 푗 | 푗−1. of the NN, in the encoder is denoted by 푁푒, whereas the number of weights in the decoder is 푁푑. The overall goal is ( ) ( ) II.FUNDAMENTALS OF GEOMETRIC CONSTELLATION to find the corresponding 푁푁푒 w푒 and 푁푁푑 w푑 topologies { } ∈ 푁 + SHAPING and the weight set, w = w푒, w푑 R , where 푁 = 푁푒 푁푑 is the total number of trainable weights, that would maximize A. Mutual information the MI between the transmitted and the received symbols for Consider 푋 to be a sequence of complex constellation points the considered channel. In this paper, the goal of the encoder (symbols) that take values from X = {푥1, 푥2, . . . , 푥푀 } with a is to shape the input sequence to a geometrically optimized 1 uniform probability mass function 푃푋 (푥) = 푀 , where 푀 is constellation that is robust to channel impairments, whereas JLT DRAFT 3

Fig. 1. Example of autoencoder model for geometrical constellation shaping. The number of input/output nodes of the AE and the hidden layers are used for illustration purposes. the decoder learns to reconstruct the transmitted symbols with therefore satisfying the overall goal of the system. The weights high fidelity. w can be optimized by applying Bayesian filtering techniques The encoder 푁푁푒 (w푒) performs a mapping of the input [23] as discussed in the following. 푀 one–hot encoded vector u푘 ∈ R to an output 푥˜푘 = Re{푥˜푘 } + 푖 · Im{푥˜푘 } ∈ C, representing a point in a constellation plane, III.TRAININGOFAUTOENCODERSUSING BAYESIAN where 푘 is a running index representing time, Re{·} is the FILTERING real part and Im{·} is the imaginary part of the complex point. In order to apply Bayesian filtering techniques for the Normalized to an average power of 1, the complex symbol 푥 푘 optimization, the AE structure depicted in Fig. 1 has to be is then sent through the channel 푐(·). It should be emphasized described using a state–space modelling framework. The state– that channel 푐(·) can be any channel for which an output can space model is described by a pair of equations, known as be generated with a given input. The output of the channel, the process equation and measurement equation. The process denoted by 푦 is then applied to the decoder 푁푁 (w ) which 푘 푑 푑 equation describes the evolution of the states which are non– uses a softmax output layer to output a vector of posterior observable variables that we would like to estimate (w for the probabilities s ∈ R푀 . The full data propagation throughout 푘 considered case). The corresponding measurement equation the AE is represented with h(·) such that s = h(푘, w, u ). 푘 푘 relates observable variables (s for the considered case) to The described AE structure can be considered an 푀– 푘 the states [23]. class pattern-classification problem. For such a problem, it In order to correctly perform the optimization of the AE is appropriate to optimize the weights w by minimizing the using Bayesian filtering techniques, the state–space model has cross–entropy cost function [2] to be defined specifically for the system under consideration. 푀 ∑︁ In the following subsection III-A, the space–state model will ( ) = − (푖) (푖) 퐽퐶퐸 푘, w t푘 log s푘 be constructed so that it fulfills the requirements of the AE 푖=1 (3) structure and the desired performance metric (MI for the 푀 ∑︁ ( ) considered case). First, the process equation will be defined, = − t 푖 log h(푖) (푘, w, u ), 푘 푘 followed by an adaptation of the measurement equation to 푖=1 fit the system requirements (cross–entropy cost function and where t푘 is the target AE output and (푖) denotes the 푖-th output batch optimization). In subsection III-B, the optimization node of the decoder neural network 푁푁푑. Even though the procedure will be detailed based on the established state–space target AE output is t푘 = u푘 , the notation t푘 will be used model. in the rest of the paper for clarity. The cross–entropy cost function can be used to calculate an AE-based upper bound ˆ Í퐾 A. Autoencoder state–space model 퐻푝ˆ (푋|푌) = E[푝ˆ푋 |푌 (푥|푦)] = 푘=1 퐽퐶퐸 (푘, w) to the true conditional entropy 퐻푝 (푋|푌), where 퐾 is the number of symbols For the AE, the weights associated with the encoder and transmitted. Replacing 퐻ˆ푞 (푋|푌) in Eq. (2) with 퐻ˆ 푝ˆ (푋|푌) decoder NNs form column vectors constructed by stacking an AE-based lower bound on the MI is obtained. Therefore, the weights associated with each neuron, starting with the first minimizing the cross–entropy cost function is equivalent to neuron in the first hidden layer and progressing first in width maximizing the MI between transmitted and received symbols, (next neuron) and then in depth (next layer). The two column JLT DRAFT 4 vectors are then stacked to form a single column weight vector the required cross–entropy cost function from Eq. (3) into Eq. 푇 푇 푇 w 푗 = [w푒, 푗 , w푑, 푗 ] , where 푗 is the 푗-th iteration step. The (8) AE weights are considered as states and their evolution is v tu 푀 described by the first–order auto-regressive equation [23]: ∑︁ 푡˜ = 0 = − t(푖) log h(푖) ( 푗, w , u ) + 푟 푗 푗 푗 푗 푗 (9) 푖=1 w 푗 = w 푗−1 + q 푗−1 (4) = ℎ˜ ( 푗, w 푗 , u 푗 , t 푗 ) + 푟 푗 . where q 푗 represents process–noise term which corresponds The Bayesian filtering techniques typically perform a single to white Gaussian noise of zero mean and covariance matrix weight-vector update on the basis of a single input-output Qj−1. For simplicity reasons, the covariance matrix is defined data pair. In the case of the AE, due to the normalization as a diagonal matrix Qj−1 = 푄 푗−1I, where I is an identity of the constellation, it would be advantageous to coordinate matrix. The process-noise term is intentionally included to the weight update on a batch of data. Batch optimization is avoid the states being trapped in a local minimum during the referred to as multistream training in Kalman filter related initial training stage. work and it is described in [24]. Multiple instances of the The measurement equation for the state–space model de- input, a batch of size 퐵, are propagated through the system scribing the AE is defined such that it assumes that the target with the same weight set w 푗 . The input to the AE is now 푇 output t 푗 is a noisy measurement of the actual AE output: U 푗 = [u 푗 ·퐵, u 푗 ·퐵+1,..., u 푗 ·퐵+(퐵−1) ] , each are propagated through the AE and form a vector-valued measurement, t 푗 = h( 푗, w 푗 , u 푗 ) + r 푗 , (5) 푇 ˜t 푗 = 0 = [푡˜푗 ·퐵, 푡˜푗 ·퐵+1,..., 푡˜푗 ·퐵+(퐵−1) ] = (10) = h˜ ( 푗, w , U , T ) + r , where r 푗 is the measurement–noise term which corresponds 푗 푗 푗 푗 to white Gaussian noise of zero mean and covariance matrix where T 푗 is the target output corresponding to the AE input 푇 R 푗 . The term u 푗 is the input of the AE at iteration 푗. With the U 푗 and r 푗 = [푟 푗 ·퐵, 푟 푗 ·퐵+1, . . . , 푟 푗 ·퐵+(퐵−1) ] is a vector that measurement equation in its current form, Bayesian filtering represents different independent noise realizations. A diagonal 퐵×퐵 techniques implicitly minimize the sum of squared errors cost covariance matrix R 푗 = 푅 푗 I ∈ R is used to describe r 푗 . function [23], Now that both process and measurement equations are defined to fulfill the requirements of the AE, a Bayesian filtering 푀 ∑︁ ( ) technique can be used to estimate the weights. Based on the 퐽( 푗, w ) = (t 푖 − h(푖) ( 푗, w , u ))2. (6) 푗 푗 푗 푗 Bayesian filtering paradigm, a complete statistical description 푖=1 of the state at iteration 푗 is provided by the posterior density of However, such a cost function does not comply with the the state. Exploiting the new measurement, the old posterior goal of maximizing MI between the transmitted and received density of the state is updated in two steps, prediction and symbols. Therefore, the measurement equation needs to be correction. In the correction step, the posterior density of the adapted to a form where the weight estimation w 푗 is obtained state is computed exploiting the predictive density obtained in by minimizing cross-entropy cost function. The adaptation the prediction step. The weight estimation within the state– is done following the steps explained in [23]. First, the space model given by Eq. (4) and (9) can be solved by using measurement equation (5) can be rewritten as various nonlinear Bayesian state estimation techniques [23]. In this paper, the focus was put on using cubature Kalman 0 = (t 푗 − h( 푗, w 푗 , u 푗 )) + r 푗 , (7) filter (CKF) [21]. The advantage of using CKF is its accuracy and most importantly that it does not require computations of where the measurement is forced to take a vector-value 0. gradients providing a gradient–free training of AEs. Then, the vector-valued measurement equation can be refor- mulated as a scalar-valued equation B. Cubature Kalman filter v The assumption of CKF is that the predictive and the poste- tu 푀 ∑︁ √︃ rior densities of the state are Gaussian distributions described 0 = (t(푖) − h(푖) ( 푗, w , u ))2 + 푟 = 퐽( 푗, w ) + 푟 , (8) 푗 푗 푗 푗 푗 푗 by their means and covariances. In the prediction step, the 푖=1 predictive density of the current process state is computed based on the posterior density of the previous process state. where 푟 푗 is now a single dimension zero mean Gaussian noise The mean and covariance of the predictive density of the with a variance 푅 푗 . 푁 푁 ×푁 The scalar reformulation of the measurement equation offers process state are denoted as wˆ 푗 | 푗−1 ∈ R and P 푗 | 푗−1 ∈ R , significant benefits: 1) Reduces the computational complexity respectively. Since the process equation (4) yields a linear of the Bayesian filter optimization; 2) Improves the numerical transition, wˆ 푗 | 푗−1 and P 푗 | 푗−1 are defined as: stability of the Bayesian filter [23]; 3) The cost function is wˆ 푗 | 푗−1 = w 푗−1, (11) directly incorporated into the measurement equation. The new P = P + Q . (12) formulation of the measurement equation allows various cost 푗 | 푗−1 푗−1 푗−1 functions to be fitted and used with Bayesian filtering tech- In the correction step, the posterior density of the state is niques. The final step of the adaptation is just incorporating obtained from the predicted measurement density. In order JLT DRAFT 5

Fig. 2. Schematic of the optimization process. The covariance matrix calculation is denoted as 푐표푣 (·), whereas 푐표푣 (·, ·) denotes the cross-covariance matrix calculation. A bar is used to represent a column vector, a square to represent a square matrix and multiple bars encapsulated a rectangular matrix. Two colors have been used with W 푗| 푗−1 to indicated that one half of the matrix is a a result of an addition and the other the result of a subtraction. The unit time delay operator is denoted as Z−1. to calculate the predicted measurement density, 2푁 cubature where 푐표푣(·, ·) denotes the cross–covariance matrix calcula- 푁 ×퐵 points are formed tion. The Kalman gain G 푗 ∈ R is then calculated −1 √︁ √︁ 푁 ×2푁 G = P | − P , (18) W 푗 | 푗−1 = [wˆ 푗 | 푗−1 + 휖 P 푗 | 푗−1 wˆ 푗 | 푗−1 − 휖 P 푗 | 푗−1] ∈ R , 푗 WT, 푗 푗 1 TT, 푗 | 푗−1 (13) √︁ and used to update the weights w 푗 conditional on the mea- where (·) is the square root of a matrix satisfying the relation surement √︁ √︁ 푇 P | − = P | − P | − , and the unit cubature point 휖 = √푗 푗 1 푗 푗 1 푗 푗 1 w 푗 = wˆ 푗 | 푗−1 + G 푗 (˜t 푗 − ˆt 푗 | 푗−1) 푁. Each column of the cubature points W푖, 푗 | 푗−1, where 푖 = (19) = wˆ 푗 | 푗−1 + G 푗 (−ˆt 푗 | 푗−1) 1, 2,..., 2푁 represents the 푖-th column of W 푗 | 푗−1, is a different realizations of the AE weights and the input is propagated Based on Eq. (10), the target output is ˜t 푗 = 0, therefore it can through each of them to obtain the output of each realization be neglected in Eq. (19). The Kalman gain G 푗 is also used to 퐵×2푁 T 푗 | 푗−1 ∈ R . The obtained outputs are used to calculate update the covariance P 푗 퐵 the predicted measurement ˆt 푗 | 푗−1 ∈ R : 푇 P 푗 = P 푗 | 푗−1 − G 푗 PTT, 푗 | 푗−1G 푗 . (20) ˜ T 푗 | 푗−1 = h( 푗, W 푗 | 푗−1, U 푗 , T 푗 ) (14) Observing Eq. (13–18), it can be noticed that the complexity 2푁 1 ∑︁ of the algorithm depends on the number of weights 푁 and ˆt 푗 | 푗−1 = T푖, 푗 | 푗−1 (15) batch size 퐵. The noise covariance matrices Q and R are 2푁 j−1 j 푖=1 hyperparameters of the CKF algorithm and its performance depends on how they are chosen. Fig. 2 illustrates the opti- where T푖, 푗 | 푗−1 represents the 푖-th column of T 푗 | 푗−1. The covariance associated with the predicted measurement density mization process described by Eq. (11–20). 퐵×퐵 PTT, 푗 | 푗−1 ∈ R , also known as the innovation covariance, is estimated IV. COMMUNICATIONSYSTEMDESCRIPTION A. AE model PTT | − = 푐표푣(T | − ) + R 푗 , 푗 푗 1 푗 푗 1 In [25], it was demonstrated that state-of-the-art generalized 2푁 1 ∑︁ (16) mutual information (GMI) performance can be achieved with = (T − ˆt )(T − ˆt )푇 + R , 2푁 푖, 푗 | 푗−1 푗 | 푗−1 푖, 푗 | 푗−1 푗 | 푗−1 푗 an encoder NN with no hidden layers and with biases set to 푖=1 zero. Since GMI is a good indicator of system performance where 푐표푣(·) denotes the covariance matrix calculation. The [26], an encoder NN with no hidden layers and with biases 푁 ×퐵 cross–covariance PWT, 푗 | 푗−1 ∈ R of the state and the set to zero has been used in this paper. As a result of a coarse measurement is calculated optimization, the decoder neural network has a single hidden layer of 푀 nodes and Leaky Relu as the activation function. PWT, 푗 | 푗−1 = 푐표푣(W 푗 | 푗−1, T 푗 | 푗−1) 2 The AE architecture is summarized in Table I. The initial 2푁 1 ∑︁ (17) weight vector wˆ is initialized using Glorot initialization [27], = (W − wˆ )(T − ˆt )푇 , 0 2푁 푖, 푗 | 푗−1 푗 | 푗−1 푖, 푗 | 푗−1 푗 | 푗−1 whereas the initial covariance P is an identity matrix. 푖=1 0 JLT DRAFT 6

TABLE I recursion PARAMETERSOFTHEENCODERANDDECODERNEURALNETWORK (푖+1) (푖) 푖훾퐿 |푧 (푖) |2 (푖+1) 푒 푓 푓 푘 + ≤ 푧푘 = 푧푘 푒 푛퐴푆퐸,푘 , 0 푖 < 푁푠 푝, (21) Encoder NN Decoder NN # of input nodes M 2 where (푖) is 푖-th ﬁber span, 훾 is the nonlinearity parameter, 훼퐿 # of hidden layers 0 1 and 퐿푒 푓 푓 = (1− 푒 )/훼 is the effective length of a span. The # of nodes per hidden layer 0 푀/2 actual length of a span is 퐿 and 훼 is the attenuation coefﬁcient. 푖+1 # of output nodes 2 M The noise term 푛퐴푆퐸 is a zero–mean Gaussian distribution Bias No Yes with variance 푃푛 = ℎ퐹푐 푅푠 (퐺 · 푁퐹 − 1)/2 [29], where ℎ is Hidden layer activation function None Leaky Relu Planck’s constant and 퐹푐 is the carrier frequency. The input (0) √ Output layer activation function Linear Softmax to the channel is a power rescaled signal 푧푘 = 푃푖푛푥푘 , where 푃푖푛 the launch power. The output of the channel 푦푘 = (푁푠 푝) / 푧푘 푃푖푛 is normalized before processed by the decoder. The NLPN channel model serves as the second differentiable channel model to use for comparison of the CKF and gradient–based methods. As for the previous channel model, the Adam optimizer is used as a benchmark gradient–based method. 3) Non–differentiable phase noise channel: As a validation of our proposed optimization method, we consider a channel model for which the gradients cannot be computed. For this purpose, the phase noise channel with blind phase search (BPS) [30], which is a standard phase noise compensation algorithm, is considered. The encoder NN output 푥푘 is distorted by phase noise and additive noise 푖 휙푘 푧푘 = 푥푘 푒 + 푛푘 , (22)

where the AWGN term 푛푘 is characterized by noise power 2 휎푁 . The phase noise 휙푘 is modeled as a Wiener process 휙 = 휙 + Δ휙 , (23) Fig. 3. Embedded channel models consist of: 1) additive white Gaussian 푘 푘−1 푘 noise (AWGN); 2) nonlinear phase noise (NLPN) 3) phase noise, AWGN and blind phase search algorithm for phase recovery. where Δ휙푘 is the random phase increment with a zero-mean 2 and variance 휎휙 = 2휋Δ휈푇푠. The combined transmitter and receiver oscillator linewidth is denoted by Δ휈, and 푇푠 is the B. Embedded channel models symbol period. In this paper, three channel models that operate on one The BPS is used as the phase recovery algorithm and it is the non-differentiable part of this channel model due to sample per symbol basis, with symbol rate 푅푠, are embedded into the autoencoder. Those channel models are illustrated in its hard-decision directed nature [30]. The BPS is a pure Fig. 3: feedforward phase recovery algorithm based on rotating the 1) AWGN channel: The noise variance is determined by received symbol by 푁푠 test phases defined by: 2 1 the signal-to-noise ratio (SNR): 휎푁 = 푆푁 푅 . As the AWGN 푖 channel model is differentiable, a fair comparison in the per- 휃푖 = · 2휋, 푖 ∈ {0, 1, . . . , 푁푠 − 1}, (24) 푁푠 formance of the AE trained using the CKF and the gradient– based methods can be obtained. For the considered case, where 푖 represents the 푖-th test phase. Decisions are made we employ the backpropagation algorithm using the Adam on each of the rotated symbols and the distance between the optimizer as a benchmark gradient–based training method decided symbol 푧ˆ푘,푖 and the rotated symbol 푧푘,푖 is calculated. [28]. Afterwards, distances of symbols rotated by the same test 2) Nonlinear phase noise channel: A memoryless fiber phase are summed over a window of size 푊퐵푃푆 channel model that includes multiple spans 푁 with ideal 푊퐵푃푆 푠 푝 ∑︁ | ( 푗) − ( 푗) |2 lumped amplification is considered. Erbium-Doped Fiber Am- 푑푘,푖 = 푧푘,푖 푧ˆ푘,푖 , (25) plifiers (EDFAs), described by the noise figure 푁퐹 and ampli- 푗=1 fier gain 퐺, are used and they introduce amplified spontaneous where ( 푗) is the 푗-th sample in the window. This step mitigates emission (ASE) noise 푛 . Similar to [12], the channel 퐴푆퐸 the effect of the AWGN on the quality of the decision. Finally, model is obtained from the nonlinear Schrödinger equation the optimal test phase is chosen by the minimum sum of by neglecting the dispersion. The channel model shows the distances [30] impact of a data-dependent nonlinear phase shift known as ˆ = the nonlinear phase noise (NLPN) and it is defined by the 휙푘 argmin 푑푘,푖, (26) 휃푖 JLT DRAFT 7 where argmin is a non–differentiable operation. The received are fixed and testing is performed. The testing was done by symbol is rotated by the chosen test phase to output the phase running 100 simulations with 105 symbols per simulation. compensated sample Each of the trained AE was tested with the same channel parameters as it was trained. −푖 휙ˆ푘 푦푘 = 푧푘 푒 . (27) The constellations learned by AE will be compared to iter- The performance of the BPS algorithm is determined by the ative polar modulation (IPM) [33] and square QAM (referred parameters 푁푠 and 푊퐵푃푆. to simply as QAM in the following). The IPM was chosen as a benchmark because it is a near-optimal constellation shape for the AWGN channel [33]. C. Gaussian receiver For optical communication, it is a common approach to use A. AWGN channel a mismatched Gaussian receiver [31], [32] and assume the The studied SNR region includes values from the interval transition probability 푞푌 |푋 (푦|푥) in Eq. (2) is of an auxiliary Gaussian channel SNR = {10, 11,..., 25} dB. Since the Gaussian receiver is optimal for the AWGN channel, the presented results for this 1 ||푦 − 푥||2 푞 (푦|푥) = exp (− ), (28) channel model only include the Gaussian receiver. 푌 |푋 √︃ 2 2휋휎2 2휎퐺 The simulation results of the testing for the AWGN channel 퐺 are shown in Fig. 4, which illustrates the performance in MI 2 2 where 휎퐺 is the noise variance and ||푦 − 푥|| is a squared with respect to the SNR. The learned constellations using the Euclidean distance between the symbols. Applying the Bayes’ AE trained with the CKF and the backpropagation are denoted theorem, the conditional probability that a specific 푥 was sent as AE-CKF and AE-BP, respectively. This notation will be observing 푦 is given as used throughout the rest of this section. The learned constellations AE-CKF and AE-BP result in a similar performance in 푝 (푥)푞 | (푦|푥) ( | ) = 푋 푌 푋 the MI with an average difference of around 0.01 bits/symbol. 푞푋 |푌 푥 푦 푖 푖 . (29) Í 푖 ( = ) ( | = ) 푥 ∈X 푝푋 푥 푥 푞푌 |푋 푦 푥 푥 Compared to QAM, the two constellations achieve higher MI The Monte Carlo approach can be used to evaluate Eq. (29). in the low SNR region, as expected, whereas the difference is 2 The noise variance 휎퐺 can be estimated from the channel marginal compared to the IPM. input-output pairs. The insets illustrate the learned constellations AE-CKF and In the case of the AWGN channel model, the assumed tran- AE-BP when training on SNR = 18 dB. The Euclidian distance sition probability 푞푌 |푋 (푦|푥) is identical to the true 푝푌 |푋 (푦|푥). between some of the points in AE-CKF is quite small, but this Therefore, the Gaussian receiver is the maximum likelihood does not have a negative effect on the MI performance. Max- (ML) receiver for the AWGN channel and the estimated noise imum shaping gain is not achieved by maximizing Euclidean 2 2 variance 휎퐺 = 휎푁 . Therefore, on the AWGN channel, the distance but rather by optimizing the trade-off between the performance of the Gaussian receiver is an upper bound for Euclidian distance and the energy distribution of the signal. As the performance of the NN decoder. expected for an AWGN channel, the optimizer pushes points However, in the case of channel models 2) and 3), the towards the origin in order to resemble the AWGN-optimal true transition probability 푝푌 |푋 (푦|푥) is approximated by Gaussian distribution of the signal energy, thus providing 푞푌 |푋 (푦|푥). The combined distortion of the nonlinear phase overall gain. noise and the additive noise of channel model 2) is assumed to be purely Gaussian and approximated with the noise variance B. NLPN channel 휎2 . The same assumption and approximation is made for 퐺 The training of the AE is performed by sweeping the launch the combined distortion of the residual phase noise and the power 푃 in the interval 푃 = {−8, −7.5,..., 0} dBm for the additive noise of channel model 3). Therefore, the Gaussian 푖푛 푖푛 fiber parameters 훾 = 1.27 1 , 훼 = 0.2 dB , 푁퐹 = 5 dB, receiver is a mismatched receiver for both of these channel 푊 ·푘푚 푘푚 퐹 = 193.41 THz, 푁 = 10, and 퐿 = 100 km. The models. 푐 푠 푝 MI estimation for the QAM and IPM constellation is performed using the mismatched Gaussian receiver. The results V. NUMERICAL RESULTS for the constellations learn by applying CKF (AE-CKF) and The system symbol rate is 푅푠 = 32 GBd and the size of backpropagation (AE-BP) include the MI estimation for both the constellation is 푀 = 64. At each training iteration 푗, mismatched Gaussian receiver and decoder NN. a batch U 푗 of 퐵 = 32 · 푀 one-hot encoded input vectors In Fig. 5, the MI performance with respect to launch u푘 is generated. The hyperparameters of the CKF algorithm power 푃푖푛 is shown. For AE-CKF and AE-BP, the dashed 푄 푗−1 and 푅 푗 are coarsely optimized using a standard grid lines represent the MI performance when the mismatched search method. Both hyperparameters are sampled from the Gaussian receiver is used and the solid lines represent the set {1, 10−1, 10−2, 10−3, 10−4, 10−5, 10−6} and the CKF algo- MI performance when using the decoder NN. The learned rithm is applied with each combination. The combination that constellations AE-CKF and AE-BP outperform both QAM and achieves the best mutual information is chosen. The training IPM in terms of MI, with both mismatched Gaussian receiver was done for each SNR value. The AE is trained until the and decoder NN. Observing just the mismatched Gaussian cost function converges. Afterwards, the weights of the AE receiver results, the learned constellations achieve a greater JLT DRAFT 8

Fig. 4. AWGN channel: Mutual information as a function of SNR for constellation size 푀 = 64. Inset: example of the learned constellations with = CKF and BP at SNR 18 dB. Fig. 5. NLPN channel: Mutual information as a function of launch power 푃푖푛 for constellation size 푀 = 64. Inset: example of the learned constellations with CKF and BP for launch power 푃푖푛 = −2.5 dBm. maximum MI by up to 0.12 and 0.17 bits/symbol compared to QAM and IPM, respectively. Observing the studied launch power 푃푖푛 region, the maximum gain in MI compared to QAM the penalty introduced by the phase noise. The solid lines was obtained at 푃푖푛 = 0 dBm and it amounts to around represent the average MI value over the 100 test simulations 0.26 bits/symbol. At the same launch power, the maximum at a given SNR, whereas the upper limit of the error bar is the gain in MI compared to IPM was achieved, reaching a value maximum obtained MI value and the lower limit shows the of up to 0.52 bits/symbol. It can be concluded that the learned 25th percentile. Therefore, the error bars represent the 75% of constellations are more robust to nonlinear phase noise than simulations with the highest MI. Constellations optimized on QAM and IPM. channel models 1) and 2) have been tested on channel model The two learned constellations AE-CKF and AE-BP have 3) and the curves representing the obtained results are denoted similar performance for both receivers in question. The aver- as "AE-CKF 1)" and "AE-CKF 2)", respectively. The insets age difference for both the mismatched Gaussian receiver and illustrate the learned constellations AE-CKF when training on the decoder NN is around 0.01 bits/symbol. Observing the MI SNR= 18 dB. performance obtained with the decoder NN, similar mitigation Fig. 6(a) shows MI as a function of SNR when the BPS pa- and compensation was observed as what was shown in [12]. rameters are 푁 = 36 and 푊 = 64. For the SNR values with The insets illustrate the learned constellations AE-CKF and 푠 퐵푃푆 a mean MI less than 3.5 bits/symbol, all points are removed AE-BP when training on launch power 푃 = −2.5 dBm. 푖푛 from the ﬁgure for visual clarity. For the used BPS parameters, The obtained results demonstrate that the proposed AE phase slips occur at SNR = 15 dB for QAM constellation optimization method, CKF, can achieve similar performance resulting in a wide error bar. The IPM constellation is designed to BP when the derivative of the cost function with respect to for the AWGN channel and it is not suitable for a phase the encoder weights w can be calculated. e noise channel. The BPS algorithm does not estimate the phase accurately resulting into wide error bars. Observing the studied C. Non–differentiable phase noise channel SNR region, the AE learned constellation achieves the highest The training of AE is performed for each of the SNR gain in MI compared to QAM at SNR = 15 dB, and it amounts values in the interval SNR = {15, 16,..., 20} dB combined to around 0.45 bits/symbol. The gain in MI compared to QAM with the BPS parameters 푁푠 = 36 and 푊퐵푃푆 = {40, 64} and is 0.15 bits/symbol at SNR = 16 dB and with the increase of linewidth Δ휈 = 100 kHz. A mismatched Gaussian receiver is SNR the gain is slowly decaying to 0.08 bits/symbol, which used instead of the decoder NN for the MI estimation during is achieved at SNR = 20 dB. Observing the width of the error the testing phase in order to have a fair comparison between bars, it can be noticed that the spread of MI performance is all the constellations. smaller for the AE learned constellation compared to QAM, In Fig. 6(a)-(b), the MI performance with respect to especially for lower SNR values. It can also be noticed that SNR is shown. The dashed lines show the performance of the maximum achieved MI value of QAM is lower than the the constellations for the AWGN channel. The dashed line 25th percentile of the AE learned constellation, implying that AE-CKF퐴푊 퐺푁 presents the MI obtained when the AE-CKF the AE learned constellation has a better performance at least trained for channel model 3) is tested on channel model 1) for in 75% of the test cases. The AE-CKF and IPM constellations the respected SNR. These results are added in order to observe achieve similar maximum MI at high SNR but the AE- JLT DRAFT 9

Next, the potential improvement at lower BPS complexity is analyzed. The window size was decreased to 푊퐵푃푆 = 40 and the results are illustrated in Fig. 6(b). All of the SNR points with a mean MI less than 3 bits/symbol were removed from the figure for visual clarity. Reducing the complexity of the BPS affects its performance, resulting in less accurate phase estimations. Therefore, the mentioned issues regarding QAM and IPM occur even for higher SNR values. The gain in MI achieved by the learned constellation compared to QAM at SNR = 20 dB is similar to the one achieved in the previous scenario, whereas compared to IPM for the same SNR = 20 dB greater gain was achieved, reaching a value of around 0.41 bits/symbol. The achieved gain is increasing with the lowering of the SNR. The gain at SNR = 17 dB already exceeds the highest gain achieved in the previous scenario, reaching a gain of around 0.7 bits/symbol. Similarly to the previous scenario, the error bars of the AE-CKF constellation are less spread than for QAM and IPM at the same SNR, indicating (a) better robustness of the constellation. The less accurate phase estimation affects the performance of AE-CKF 1) and AE- CKF 2) which are in this scenario more penalized than QAM, but still less than IPM. At SNR = 20 dB, the AE-CKF 1) stands out with a higher MI than QAM, but still lower than AE-CKF. The dashed lines in Fig. 6 show that the AE-CKF constellation learned by training on channel model 3) lost AWGN shaping gain compared to when it was trained directly on the AWGN channel. The AE-CKF퐴푊 퐺푁 and QAM퐴푊 퐺푁 now have similar performance in MI. Comparing situations with and without phase noise, the penalty in MI is significantly lower for AE-CKF than for QAM. This demonstrates that the AE-CKF constellation is more robust to phase noise at an expense of slight loss of AWGN shaping gain. The previous figures show results when the AE was trained and tested on the same SNR value to show the best achieved MI performance at the given SNR. However, in practice an SNR estimation error of up to ∼ 1 dB can be encountered (b) [34]. Fig. 7 shows what the MI performance of constellations is when they are tested on different SNR values than what they Fig. 6. Nondifferentiable phase noise channel: Mutual information as a function of SNR for constellation size M=64. The parameters of the BPS were trained on. The results are only for the case when the are: (a) 푁푠 = 36 and 푊퐵푃푆 = 64 and (b) 푁푠 = 36 and 푊퐵푃푆 = 40. The BPS parameters are 푁푠 = 36 and 푊퐵푃푆 = 64 and the curve line shows the mean value, whereas the upper limit of the error bar is the denoted as "Envelope" is the same as "AE-CKF" curve from max value and the lower limit is the 25th percentile. Points that have a mean less than 3.5 and 3 bits/symbol are omitted from the figure (a) and (b) for Fig. 6(a) and it indicates the training points as well. It has been visual clarity, respectively. Inset: example of the learned constellation at SNR expanded with the MI obtained when training at SNR = 14 dB, = 18 dB. in order to show the robustness of the constellation trained at SNR = 15 dB. The results obtained when training on SNR = {18, 19, 20} dB are similar and therefore only the CKF is more robust and has a significantly higher mean MI results obtained for SNR = 18 dB are shown. All of the value, whereas at SNR = 20 dB the two constellations have constellations show at least up to ±0.5 dB of robustness to marginal differences in MI. The AE-CKF 1) achieves similar SNR estimation errors with less than 1% penalty in MI. The performance as AE-CKF for SNR = {19, 20} dB due to the constellation trained at SNR = 18 dB can be used for the fact that the main impairment at these SNR values is AWGN. SNR = [17, 20] dB interval with less than 1% penalty in MI Even though both IPM and AE-CKF 1) are optimized for but at an expense of a considerable performance deterioration AWGN channel, the AE-CKF 1) performs better than IPM for SNR < 17 dB. when applied to the non-differentiable channel. However, the It has been established that AE-CKF constellations out- AE-CKF 2) has similar performance as IPM. As expected both perform QAM and IPM when using the mismatched Gaus- AE-CKF 1) and 2) do not perform well for the given non- sian receiver. Now, a comparison between the mismatched differentiable channel since they were not optimized for it. Gaussian receiver and NN decoder for AE-CKF constellations JLT DRAFT 10

Fig. 7. Nondifferentiable phase noise channel: Mutual information as a Fig. 8. Nondifferentiable phase noise channel: Mutual information as a function of SNR for constellation size 푀 = 64 and BPS parameters 푁푠 = 36 function of SNR for constellation size 푀 = 64. The dashed lines represent and 푊퐵푃푆 = 64. Sweeping the observed SNR interval with constellations the mismatched Gaussian receiver, whereas the solid lines represent the NN trained with different SNR values. receiver. is provided, since the encoder and decoder come as a pair. state spaced framework supports training of recurrent NNs In Fig. 8 the MI is illustrated for the mismatched Gaussian [23], which showed superior performance compared to feed- receiver with dashed lines and NN decoder with solid lines forward NNs when applied to a channel with finite memory for 푊퐵푃푆 = 64 and 푊퐵푃푆 = 40. For lower SNR, when [7]. In such a case, the weight set is just expanded with the the estimation of the phase noise is worse, the NN decoder weights of the recurrent connections, meaning an identical outperforms the Gaussian receiver. As the SNR increases the state-based model for the kernel weights is used. The CKF estimation of the phase noise improves, resulting in AWGN derivation from this paper is therefore valid. This study is out being the main source of distortion. Thus, the gain in MI that of the scope of this paper, but is an interesting direction for NN decoders achieve compared to the mismatched Gaussian future work. receiver decreases since the mismatched Gaussian receiver = { } is optimal for the AWGN channel. For SNR 19, 20 dB VII.CONCLUSION the estimation of the phase noise is highly accurate for all the observed cases, therefore all have similar performance in We have proposed and numerically demonstrated a derivative-free method for training autoencoders for geomet- regards to MI. In the case of 푊퐵푃푆 = 64, the estimation of the phase noise is still quite accurate at SNR= 15 dB and the NN rical constellation shaping. This is achieved by expressing decoder achieves around 0.09 bits/symbol gain compared to the autoencoder weights and system as state-space models and then applying cubature Kalman filter (CKF) for state the Gaussian receiver. Whereas, in the case of 푊퐵푃푆 = 40, the estimation of the phase noise is significantly degraded (encoder and decoder weights) estimation. For differentiable at SNR= 15 dB and the NN decoder achieves up to around AWGN and NLPN channel models, it was shown that the 0.44 bits/symbol gain compared to the Gaussian receiver. performance, in terms of the mutual information of the learned constellations, is almost identical to when the training is VI.DISCUSSION performed using the standard backpropagation algorithm. The CKF trained autoencoder was also tested for a phase noise In this paper, the differentiable channel model requirement channel with a non-differentiable phase recovery algorithm. is lifted by proposing a derivative-free optimization method In such a case, the autoencoder-learned constellations achieved for AEs. The proposed method is exemplified by adopting significant performance and robustness improvement with re- CKF for the AE weights optimization. Most notably, the spect to conventional constellation shapes optimized for an results of this study imply that the AE can be optimized AWGN channel. It should be emphasized that the proposed with arbitrary black–box channels and not only with known method can be applied to any AE structure and not just for differentiable ones. Although the training capabilities of the GCS as it was used in this paper. proposed method were demonstrated in few scenarios, the proposed method should allow the training of autoencoders on arbitrary channel models, e.g. experimental test–beds, which ACKNOWLEDGMENT is the ultimate goal. This is left for future work. This work was financially supported by the European Re- The CKF algorithm can be applied for training of recurrent search Council through the ERC-CoG FRECOM project (grant NNs, which is important if considering a channel with finite agreement no. 771878), the Villum Young Investigator OPTIC- memory, such as the dispersive optical channel. The proposed AI project (grant no. 29334), and DNRF SPOC, DNRF123. JLT DRAFT 11

REFERENCES [22] D. M. Arnold, H. . Loeliger, P. O. Vontobel, A. Kavcic, and W. Zeng, “Simulation-Based Computation of Information Rates for Channels With Memory,” IEEE Transactions on Information Theory, vol. 52, no. 8, pp. [1] T. O’Shea and J. Hoydis, “An Introduction to Deep Learning for the 3498–3508, 2006. Physical Layer,” IEEE Transactions on Cognitive Communications and [23] I. Arasaratnam and S. Haykin, “Nonlinear Bayesian filters for training Networking, vol. 3, no. 4, pp. 563–575, 2017. recurrent neural networks,” in Mexican International Conference on [2] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. Artificial Intelligence. Springer, 2008, pp. 12–33. MIT press Cambridge, 2016, vol. 1. [24] L. A. Feldkamp and G. V. Puskorius, “A signal processing framework [3] S. Dörner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep Learning based on dynamic neural networks with application to problems in adap- Based Communication Over the Air,” IEEE Journal of Selected Topics tation, filtering, and classification,” Proceedings of the IEEE, vol. 86, in Signal Processing, vol. 12, no. 1, pp. 132–143, 2018. no. 11, pp. 2259–2277, 1998. [4] A. Felix, S. Cammerer, S. Dörner, J. Hoydis, and S. Ten Brink, “OFDM- [25] K. Gümüs, A. Alvarado, B. Chen, C. Häger, and E. Agrell, “End-to- Autoencoder for End-to-End Learning of Communications Systems,” in end learning of geometrical shaping maximizing generalized mutual 2018 IEEE 19th International Workshop on Signal Processing Advances information,” Optical Fiber Communication Conference (OFC) 2020, in Wireless Communications (SPAWC), 2018, pp. 1–5. pp. 10–12, 2020. [5] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Physical layer deep learning of [26] A. Alvarado, E. Agrell, D. Lavery, R. Maher, and P. Bayvel, “Replac- encodings for the MIMO fading channel,” in 2017 55th Annual Allerton ing the Soft-Decision FEC Limit Paradigm in the Design of Optical Conference on Communication, Control, and Computing (Allerton), Communication Systems,” J. Lightwave Technol., vol. 33, no. 20, pp. 2017, pp. 76–80. 4338–4352, Oct 2015. [6] B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bulow, [27] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep D. Lavery, P. Bayvel, and L. Schmalen, “End-to-End Deep Learning feedforward neural networks,” Journal of Machine Learning Research, of Optical Fiber Communications,” Journal of Lightwave Technology, vol. 9, pp. 249–256, 2010. vol. 36, no. 20, pp. 4843–4855, 2018. [28] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [7] B. Karanov, D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end arXiv preprint arXiv:1412.6980, 2014. optimized transmission over dispersive intensity-modulated channels [29] R. Essiambre, G. Kramer, P. J. Winzer, G. J. Foschini, and B. Goebel, using bidirectional recurrent neural networks,” Optics Express, vol. 27, “Capacity Limits of Optical Fiber Networks,” Journal of Lightwave no. 14, p. 19650, 2019. Technology, vol. 28, no. 4, pp. 662–701, 2010. [8] S. Gaiarin, R. T. Jones, F. Da Ros, and D. Zibar, “End-to-end optimized [30] T. Pfau, S. Hoffmann, and R. Noé, “Hardware-efficient coherent digital nonlinear Fourier transform-based coherent communications,” Confer- receiver concept with feedforward carrier recovery for M-QAM constel- ence on Lasers and Electro-Optics (CLEO), p. SF2L.4, 2020. lations,” Journal of Lightwave Technology, vol. 27, no. 8, pp. 989–999, [9] S. Gaiarin, F. Da Ros, R. T. Jones, and D. Zibar, “End-to-End Optimiza- 2009. tion of Coherent Optical Communications Over the Split-Step Fourier [31] A. Lapidoth and S. S. Shitz, “On Information Rates for Mismatched Method Guided by the Nonlinear Fourier Transform Theory,” Journal Decoders,” IEEE Transactions on Information Theory, vol. 40, no. 6, of Lightwave Technology, vol. 39, no. 2, pp. 418–428, 2021. pp. 1953–1967, 1994. [10] R. T. Jones, T. A. Eriksson, M. P. Yankov, and D. Zibar, “Deep Learning [32] M. P. Yankov, F. Da Ros, E. P. da Silva, S. Forchhammer, K. J. Larsen, of Geometric Constellation Shaping Including Fiber Nonlinearities,” L. K. Oxenløwe, M. Galili, and D. Zibar, “Constellation Shaping for European Conference on Optical Communication, ECOC, 2018. WDM Systems Using 256QAM/1024QAM With Probabilistic Optimiza- [11] R. T. Jones, M. P. Yankov, and D. Zibar, “End-to-end learning for GMI tion,” Journal of Lightwave Technology, vol. 34, no. 22, pp. 5146–5156, optimized geometric constellation shape,” in European Conference on 2016. Optical Communication, ECOC, 2019, pp. 1–3. [33] I. B. Djordjevic, H. G. Batshon, L. Xu, and T. Wang, “Coded [12] S. Li, C. Häger, N. Garcia, and H. Wymeersch, “Achievable Information polarization-multiplexed iterative polar modulation (PM-IPM) for be- Rates for Nonlinear Fiber Communication via End-to-end Autoencoder yond 400 Gb/s serial optical transmission,” Optical Fiber Communica- Learning,” European Conference on Optical Communication, ECOC, tion Conference, p. OMK2, 2010. 2018. [34] F. N. Khan, Z. Dong, C. Lu, A. P. T. Lau, X. Zhou, and C. Xie, “Optical [13] M. Schaedler, S. Calabrò, F. Pittalà, G. Böcherer, M. Kuschnerov, performance monitoring for fiber-optic communication networks,” in C. Bluemm, and S. Pachnicke, “Neural network assisted geometric Enabling Technologies for High Spectral-Efficiency Coherent Optical shaping for 800Gbit/s and 1Tbit/s optical transmission,” 2020 Optical Communication Networks. Wiley Online Library, 2016. Fiber Communications Conference and Exhibition (OFC), vol. Part F174-, no. DM, pp. 3–5, 2020. [14] B. Karanov, M. Chagnon, V. Aref, D. Lavery, P. Bayvel, and L. Schmalen, “Concept and experimental demonstration of optical IM/DD end-to-end system optimization using a generative model,” 2020 Optical Fiber Communications Conference and Exhibition (OFC), pp. 1–3, 2020. [15] T. J. O’Shea, T. Roy, and N. West, “Approximating the void: Learning stochastic channel models from observation with variational generative adversarial networks,” in 2019 International Conference on Computing, Networking and Communications (ICNC). IEEE, 2019, pp. 681–686. [16] H. Ye, G. Y. Li, B.-H. F. Juang, and K. Sivanesan, “Channel agnostic end-to-end learning based communication systems with conditional GAN,” in 2018 IEEE Globecom Workshops (GC Wkshps). IEEE, 2018, pp. 1–5. [17] F. A. Aoudia and J. Hoydis, “End-to-End Learning of Communications Systems Without a Channel Model,” in 2018 52nd Asilomar Conference on Signals, Systems, and Computers, 2018, pp. 298–303. [18] ——, “Model-Free Training of End-to-End Communication Systems,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 11, pp. 2503–2516, 2019. [19] V. Raj and S. Kalyani, “Backpropagating Through the Air: Deep Learn- ing at Physical Layer Without Channel Models,” IEEE Communications Letters, vol. 22, no. 11, pp. 2278–2281, 2018. [20] J. C. Spall et al., “Multivariate stochastic approximation using a simultaneous perturbation gradient approximation,” IEEE transactions on automatic control, vol. 37, no. 3, pp. 332–341, 1992. [21] S. Haykin and I. Arasaratnam, “Cubature Kalman filters,” IEEE Trans. Autom. Control, vol. 54, no. 6, pp. 1254–1269, 2009.