<<

arXiv:1904.09252v2 [eess.SP] 4 Nov 2019 ekly S.Tewr fC ¨grwsspotdb h Euro 2018-03701. suppor the No. grant was by under Wymeersch Council supported H. und Research 749798. was programme Swedish H¨ager No. innovation C. grant and Skłodowska-Curie research of 2020 work Horizon Science Union’s The Computer and USA. Engineering Berkeley, Sah Electrical A. of USA. Durham, Department University, Duke Engineering, Computer n a edt iia efrac scmae otecase the to with even compared training, as proc for used performance learning is similar feedback the to unquantized where affect lead feedb appreciably can that and not show results does Simulation quantization the channels. f fo exploits nonlinear evaluated suitable is is and which method and The distributions. proposed, signal signal feedback stationary is the of method the communicat properties specific physical-layer quantization to of novel receiver learning quantized A the of data-driven impact from on the study back signal we paper, t feedback this of In transmitter. a implementations practical require channel and that approach unknown shown schemes has detection over work and Previous communication modulation physical-layer new enable reveal can ceivers fEetia niern,Camr nvriyo Technol [email protected],henkw of email: University Chalmers Sweden. Engineering, Electrical of iees[] 4,nnierotcl[][] n iil l visible and [5]–[7], includi optical [8]. nonlinear applications communication bee [4], physical-layer has [3], receiver various wireless and transmitter for the lowe proposed both exhibit of en or recently, have learning More better to-end methods algorithms. perform model-based These either for than [2]. that complexity e.g., algorithms decoding channel, to or and led [1] transmitter detection given MIMO a assuming ceivers parameter learning. good optimiza machine find from gradient-based approaches and large-scale general using networks) parameteri as The configurations neural receiver data. and (e.g., of transmitter from the line functions regard learned new to are is a idea methods to led detection transmiss the has and where precludes communication This channel physical-layer on approaches. the research standard of nature of communication nonlinear use optical the in which also for t but on limitation performance, a achievable become increasingly have impairment no quantization hardware true and where is communication, This wireless well. in as only harder become transmissi has optimal methods, detection devising i.e., design, physical-layer outt os edakweerno i isaeapidto applied are flips bit random bits. where quantization surp feedback is the learning noisy that to shown is robust it addition, In quantization. .Sn,B eg .Hae,adH yerc r ihteDepa the with are Wymeersch H. and H¨ager, C. Peng, B. Song, J. Abstract aadie ehd aemil oue nlann re- learning on focused mainly have methods Data-driven complex, more become systems communication As } camr.e .Hae sas ihteDprmn fElect of Department the with also H¨ager is C. @chalmers.se. Dt-rvnotmzto ftasitr n re- and transmitters of optimization —Data-driven erigPyia-ae Communication Physical-Layer Learning ixagSn,Bl eg hita ¨gr ekWymeersch H¨ager, Henk Christian Peng, Bile Song, Jinxiang .I I. NTRODUCTION ihQatzdFeedback Quantized with { iepn,christian.haeger, bile.peng, g,Gothenburg, ogy, CBerkeley, UC , ii ihthe with is ai rteMarie the er e ythe by ted rlinear or risingly ia and rical nand on non- r rtment feed- 1 ight tion pean ion. zed -bit ack ion his ess ng he d- s. n s r t , utrciea receive must n efr xesv iuain o ohlna Gaussian linear contribu detailed both The channels. approa for phase- nonlinear proposed simulations and the extensive for perform justification and theoretical a provide prahsi 1][3 n [14]. and [11]–[13] in approaches efrac scmae otecs hr unquantized where case with even the training, similar for to to used steps lead is compared feedback can pre-processing strategy as simple quantization performance fixed on a based signa by feedback followed scheme the of adaptive w properties specific Instead, an the learned. to not due that is show [15], scheme to transmission Compared feedback physical-la channel. the of unknown learning an data-driven over communication on signal feedback the with [16]–[20]. implemented examp resolution in are for finite models only processing, learned receiver far corresponding the and so when transmitter conducted the been of literat have terms the quantization knowledge, in considered our on been of Studies yet bits, best not of has the number quantization To such finite signal. a feedback an to the over practice, quantized including In numbers be channel. will [15] real (AWGN) signals in of however, noise link Gaussian transmission feedback white assumed the additive The for versa). [15]). i allowed in vice criterion only 3 stopping (and Alg. predefined (see pair some met simultaneously other until improve pair continuously the systems one and of communication for both training improvements Thus, better training to pa that lead transmitter/receiver intuition alternates [15] different the the in with the regard scheme agai optimizing training to can proposed between pairs [15] The receiver in learned. and problem be [15]. transmitter proposed communication optimized noisy was separate which or a for it as [11]–[14] case, transmission perfect feedback latter be the either In can signal back aaees eae prahi rpsdi [14]. in proposed is approach related pro- A of training parameters. computation the the during for variable gradients allows random This a [11]–[13]. be cess messa to stochasti optimization fixed assumed a on the for is based symbol for approach transmitted the different model where a transmitters, follow surrogate We adversari the [10]. an [9], first through use to e.g., and is model, process, channel limitation channe surrogate this differentiable a circumvent and learn to known approach a One requires model. it since lematic 1 nti ae,w nlz h mato uniainof quantization of impact the analyze we paper, this In npatc,gain-ae rnmte piiaini p is optimization transmitter gradient-based practice, In nodrt opt h urgt rdet,tetransmitte the gradients, surrogate the compute to order In e 1,Sc I-]fradsuso bu h relationship the about discussion a for III-C] Sec. [12, See hc a hnb sdt paetetransmitter the update to used be then can which edaksignal feedback nn Sahai Anant , rmtercie.Ti feed- This receiver. the from 1 btqatzto.We quantization. -bit ewe the between 1 surrogate tions rob- ure. yer irs, ge ch le al l, n e c s 1 r l 2

stochastic transmitter elements in [a,b]M are vectors of length M with entries be- w tween a and b inclusively) and τ and ρ are sets of transmitter m x x˜ y q mˆ fτ (m) + p(y|x˜) fρ(y) and receiver parameters, respectively. The transmitter maps the

argmax k-th message mk to a complex symbol xk = fτ (mk), where l 2 an average power constraint according to E{|xk| } ≤ P is ˆl f(·) assumed. The symbol xk is sent over the channel and the receiver maps the channel observation yk to a probability binary Q−1(·) Q(·) q f feedback vector k = ρ(yk), where one may interpret the components of qk as estimated posterior probabilities for each possible Fig. 1: Data-driven learning model where the discrete time index k (e.g., mk) message. Finally, the receiver outputs an estimated message is omitted for all variables. The quantization and binary feedback is shown in according to mˆ = argmax [q ] , where [x] returns the the lower dashed box, while the proposed pre-processor is highlighted. Note k m k m m that w = 0 for the receiver learning (Sec. III-A). m-th component of x. The setup is depicted in the top branch of the block diagram in Fig. 1, where the random perturbation w in the transmitter can be ignored for now. in this paper are as follows: We further assume that there exists a feedback link from 1) We propose a novel quantization method for feedback the receiver to the transmitter, which, as we will see below, signals in data-driven learning of physical-layer com- facilitates transmitter learning. In general, our goal is to learn munication. The proposed method addresses a major optimal transmitter and receiver mappings fτ and fρ using shortcoming in previous work, in particular the assump- limited feedback. tion in [15] that feedback losses can be transmitted as unquantized real numbers over an AWGN channel. III. DATA-DRIVEN LEARNING 2) We conduct a thorough numerical study demonstrating In order to find good parameter configurations for τ and the effectiveness of the proposed scheme. We investigate ρ, a suitable optimization criterion is required. Due to the the impact of the number of quantization bits on the reliance on gradient-based methods, conventional criteria such performance and the training process, showing that 1- as the symbol error probability Pr(mk 6=m ˆ k) cannot be used bit quantization can provide performance similar to directly. Instead, it is common to minimize the expected cross- unquantized feedback. In addition, it is shown that the entropy loss defined by scheme is robust to noisy feedback where the quantized , E ℓ(τ,ρ) − {log([fρ(yk)]mk )}, (1) signal is perturbed by random bit flips. 3) We provide a theoretical justification for the effective- where the dependence of ℓ(τ,ρ) on τ is implicit through the ness of the proposed approach in the form of Proposi- distribution of yk. tions 1 and 2. In particular, it is proved that feedback A major practical hurdle is the fact that the gradient quantization and bit flips manifest themselves merely as ∇τ ℓ(τ,ρ) cannot actually be evaluated because it requires a a scaling of the expected gradient used for parameter known and differentiable channel model. To solve this prob- training. Moreover, upper bounds on the variance of the lem, we apply the alternating optimization approach proposed gradient are derived in terms of the Fisher information in [11], [12], which we briefly review in the following. For this matrix of the transmitter parameters. approach, one alternates between optimizing first the receiver Notation: Vectors will be denoted with lower case letters parameters ρ and then the transmitter parameters τ for a certain number of iterations . To that end, it is assumed that the in bold (e.g., x), with x or [x] referring to the n-th entry in N n n transmitter and receiver share common knowledge about a x; matrices will be denoted in bold capitals (e.g., X); E({x} database of training data . denotes the expectation operator; V(x) denotes the variance mk (the trace of the covariance matrix) of the random vector x (i.e., V{x} = E{x⊺x}− (E{x})⊺(E{x})). A. Receiver Learning For the receiver optimization, the transmitter parameters τ are assumed to be fixed. The transmitter maps a mini-batch of II. SYSTEM MODEL uniformly random training messages mk, k ∈ {1,...,BR}, We wish to transmit messages m ∈ {1,...,M} over an a to symbols satisfying the power constraint and transmits them priori unknown static memoryless channel which is defined over the channel. The receiver observes y1,...,yBR and by a conditional probability density function (PDF) p(y|x), generates BR probability vectors fρ(y1),..., fρ(yBR ). where x, y ∈ C and M is the total number of messages.2 The receiver then updates its parameters ρ according to e The communication system is implemented by representing ρi+1 = ρi − αR∇ρℓR(ρi), where the transmitter and receiver as two parameterized functions BR M M 1 fτ : {1,...,M} → C and fρ : C → [0, 1] , where [a,b] e ℓR(ρ)= − log([fρ(yk)]mk ) (2) is the M–fold Cartesian product of the [a,b]–interval (i.e., the BR Xk=1 is the empirical cross-entropy loss associated with the mini- 2In this paper, we restrict ourselves to two-dimensional (i.e., complex- valued) channel models, where the generalization to an arbitrary number of batch and αR is the learning rate. This procedure is repeated dimensions is straightforward. iteratively for a fixed number of iterations NR. 3

B. Transmitter Learning • Clipping: setting f(lk) = min(β,lk) is used to deal with For the transmitter optimization, the receiver parameters are large loss variations and stabilize training [22]. assumed to be fixed. The transmitter generates a mini-batch • Baseline: setting f(lk) = lk − β is called a constant baseline [23] and is often used to reduce the variance of uniformly random training messages mk, k ∈{1,...,BT }, and performs the symbol mapping as before. However, before of the Monte Carlo estimate of the stochastic gradient transmitting the symbols over the channel, a small Gaussian [21]. • Scaling: setting f(l ) = βl only affects the magnitude perturbation is applied, which yields x˜k = xk + wk, where k k 2 2 of the gradient step, but this can be compensated with wk ∼ CN (0, σp) and reasonable choices for σp are discussed in Sec. V. Hence, we can interpret the transmitter as stochastic, methods using adaptive step sizes (including the widely described by the PDF used Adam optimizer [24]). However, aggressive scaling can adversely affect the performance [25], [26]. 1 |x˜ − f (m )|2 k τ k (3) πτ (˜xk|mk)= 2 exp − 2 . To summarize, it has been shown that training with trans- πσp σp   formed losses, i.e., assuming ˆlk = f(lk) in (4), is quite robust Based on the received channel observations, the receiver then and can even be beneficial in some cases (e.g., by reducing R computes per-sample losses lk = − log([fρ(yk)]mk ) ∈ for gradient variance through baselines). Hence, one may conclude k ∈ {1,...,BT }, and feeds these back to the transmitter that the training success is to a large extent determined by the via the feedback link. The corresponding received losses are relative ordering of the losses (i.e., the distinction between denoted by ˆlk, where ideal feedback corresponds to lˆk = lk. good actions and bad actions). In this paper, reward shaping is Finally, the transmitter updates its parameters τ according to exploited for pre-processing before quantizing the transformed e τi+1 = τi − α∇τ ℓT (τi), where losses to a finite number of bits.

BT 1 e ˆ (4) ∇τ ℓT (τ)= lk∇τ log πτ (˜xk|mk). EARNING WITH UANTIZED EEDBACK BT IV. L Q F kX=1 This procedure is repeated iteratively for a fixed number of it- Previous work has mostly relied on ideal feedback, where ˆ erations NT , after which the alternating optimization continues lk = lk [11]–[14]. Robustness of learning with respect to ˆ 2 again with the receiver learning. The total number of gradient additive noise according to lk = lk + nk, nk ∼ N (0, σ ), was steps in the entire optimization is given by N(NT + NR). demonstrated in [15]. In this paper, we take a different view A theoretical justification for the gradient in (4) can be and assume that there only exists a binary feedback channel found in [11]–[13]. In particular, it can be shown that the from the receiver to the transmitter. In this case, the losses gradient of ℓT (τ)= E {lk} is given by must be quantized before transmission.

∇τ ℓT (τ)= E {lk∇τ log πτ (˜xk|mk)} , (5) A. Conventional Quantization where the expectations are over the message, transmitter, and channel distributions. Note that (4) is the corresponding Optimal Quantization: Given a distribution of the losses ˆ sample average for finite mini-batches assuming lk = lk. p(lk) and q bits that can be used for quantization, the mean Remark 1. As pointed out in previous work, the transmitter squared quantization error is optimization can be regarded as a simple form of reinforce- E 2 ment learning. In particular, one may interpret the transmitter D = {(lk − Q(lk)) }. (7) as an agent exploring its environment according to a stochastic With bits, there are q possible quantization levels which exploration policy defined by (3) and receiving (negative) q 2 can be optimized to minimize , e.g., using the Lloyd-Max rewards in the form of per-sample losses. The state is the D algorithm [27]. message mk and the transmitted symbol x˜k is the correspond- ing action. The learning setup belongs to the class of policy Adaptive Quantization: In our setting, the distribution of the gradient methods, which rely on optimizing parameterized per-sample losses varies over time as illustrated in Fig. 2. For policies using gradient descent. We will make use of the non-stationary variables, adaptive quantization can be used. following well-known property of policy gradient learning:3 The source distribution can be estimated based on a finite number of previously seen values and then adapted based on E {∇τ log πτ (˜xk|mk)} =0. (6) the Lloyd-Max algorithm. If the source and sink adapt based on quantized values, no additional information needs to be C. Loss Transformation exchanged. If adaptation is performed based on unquantized samples, the new quantization levels need to be conveyed from The per-sample losses can be transformed through a pre- the source to the sink. In either case, a sufficient number processing function f : R → R, which is known as reward of realizations are needed to accurately estimate the loss shaping in the context of reinforcement learning [21]. Possible distribution and the speed of adaptation is fixed. examples for f include: Fixed Quantization: We aim for a strategy that does not ∇ 3To see this, one may first apply π τ πτ and then use the require overhead between transmitter and receiver. A simple ∇τ log τ = πτ fact that R ∇τ πτ (˜x|m)dx˜ = 0 since R πτ (˜x|m)dx˜ = 1. non-adaptive strategy is to apply a fixed quantization. Under 4

1 −1 the interval [0, 1]. It thus applies ˆlk = Q (˜lk) ∈ [0, 1] and N = 1 iteration ˆ N = 10 iterations uses the values lk in (4). We note that some aspects of this 0.8 N = 100 iterations ) approach are reminiscent of the Pop-Art algorithm from [28], k l

( where shifting and scaling are used to address non-stationarity p 0.6 during learning. In particular, Pop-Art can be used for general 0.4 supervised learning, where the goal is to fit the outcome of a parameterized function (e.g., a neural network) to given targets distribution 0.2 (e.g., labels) by minimizing a loss function. Pop-Art adaptively normalizes the targets in order to deal with large magnitude 0 variations and also address non-stationary targets. However, . . . 0 0 5 1 1 5 2 2 5 3 Pop-Art and the proposed method are different algorithms that per-sample loss l k have been proposed in different contexts, e.g., Pop-Art does Fig. 2: Illustration of the non-stationary loss distribution as a function of the not deal with quantization issues during learning. number of training iterations in the alternating optimization. In terms of complexity overhead, the proposed method requires one sorting operation in order to identify and clip fixed quantization, we divide up the range [0, ¯l] into 2q − 1 the largest losses in each mini-batch (step 1). The baseline equal-size regions of size ∆= l/¯ 2q so that and scaling (steps 2 and 3) can be implemented with one real addition followed by one real multiplication. Finally, the quan- ∆ l Q(l)= + ∆ . (8) tizer can be implemented by using a look-up table approach. 2 ∆   At the transmitter side (sink), the method only requires the Here, l¯ is the largest loss value of interest. The corresponding dequantization step, which again can be implemented using a thresholds are located at m¯l/2q, where m ∈ {1,..., 2q − look-up table. 1}. Hence, the function Q(l) and its inverse Q−1(l) are fully determined by ¯l and the number of bits q. C. Impact of Feedback Quantization The effect of quantization can be assessed via the Bussgang B. Proposed Quantization Theorem [29], which is a generalization of MMSE decompo- 2 Given the fact that losses can be transformed without much sition. If we assume lk ∼ p(l) with mean µl and variance σl , impact on the optimization, as described in Sec. III-C, we pro- then pose a novel strategy that employs adaptive pre-processing fol- Q(lk)= glk + wk, (9) lowed by a fixed quantization scheme. The proposed method R operates on mini-batches of size BT . In particular, the receiver in which g ∈ is the Bussgang gain and wk is a random (source) applies the following steps: variable, uncorrelated with lk, provided we set 1) Clipping: we clip the losses to lie within a range E{l Q(l )}− µ E{Q(l )} g = k k l k . (10) [lmin,lmax]. Here, lmin is the smallest loss in the current 2 σl mini-batch, while l is chosen such that the 5% largest max In general, the distribution of may be hard (or impossible) losses in the mini-batch are clipped. This effectively wk to derive in closed form. Note that the mean of is excludes very large per-sample losses which may be wk E and the variance is V 2 2. When regarded as outliers. We denote this operation by f (·). {Q(lk)}− gµl {Q(lk)}− g σl clip the number of quantization bits increases, and 2) Baseline: we then shift the losses with a fixed baseline q Q(lk) → lk thus g → 1. lmin. This ensures that all losses are within the range If we replace lk with Q(lk) in (5), denote the corresponding [0,l − l ]. We denote this operation by fbl(·). max min gradient function by ∇ ℓq (τ), and substitute (9), then the 3) Scaling: we scale all the losses by 1/(l − l ), so τ T max min following proposition holds. that they are within the range [0, 1]. We denote this operation by fsc(·). Proposition 1. Let γk = lk∇τ log πτ (˜xk|mk), lk ∈ [0, 1], E q 4) Fixed quantization: finally, we use a fixed quantization with ∇τ ℓT (τ) = {γk}, and γk = Q(lk)∇τ log πτ (˜xk|mk), with q bits and send Q(l˜k), where Q(·) is defined in (7) then ˜ , and lk = f(lk)= fsc(fbl(fclip(lk))), i.e., f fsc ◦ fbl ◦ q q E{γ } = ∇τ ℓ (τ)= g∇τ ℓT (τ) (11) fclip denotes the entire pre-processing. For simplicity, k T V q 2V 2 a natural mapping of quantized losses to bit vectors {γk }≤ g {γk} + (gw¯ +w ¯ )tr{J(τ)} (12) Bq is assumed where quantization levels are mapped where J(τ) = E{∇ log π (˜x |m )∇⊺ log π (˜x |m )}  0 in ascending order to (0,..., 0, 0)⊺, (0,..., 0, 1)⊺, ..., τ τ k k τ τ k k ⊺ is the Fisher information matrix of the transmitter parameters (1,..., 1, 1) . In general, one may also try to optimize q−1 τ and w¯ = maxl |gl − Q(l)| = |1 − 1/2 − g| is a measure the mapping of bit vectors to the quantization levels of the maximum quantization error. in order to improve the robustness of the feedback transmission. Proof: See Appendix. The transmitter (sink) has no knowledge of the functions Hence, the impact of quantization, under a sufficiently large fclip(·), fbl(·), or fsc(·), and interprets the losses as being in mini-batch size is a scaling of the expected gradient. Note 5 that this scaling will differ for each mini-batch. The variance Remark 2. Note that when using small mini-batches, the is affected in two ways: a scaling with g2 and an additive empirical gradients computed via (4) will deviate from the q term that depends on the maximum quantization error and the expected value (1 − 2p)∇τ ℓT (τ): they will not be scaled Fisher information at τ. When q increases, g → 1 and w¯ → 0, exactly by 1 − 2p and they will be perturbed by the average V q V so that {γk } → {γk}, as expected. value of p∇τ log πτ (˜xk|mk). Hence, robustness against large In general, the value of g is hard to compute in closed form, p can only be offered for large mini-batch sizes. but for 1-bit quantization and a Gaussian loss distribution, (10) 4 admits a closed-form solution. In particular, V. NUMERICAL RESULTS 2 1/ 8πσl µl =1/2 In this section, we provide extensive numerical results to g = 2 (13) −1/(8σl ) 2 verify and illustrate the effectiveness of the proposed loss (e / 8πσl µl ∈{0, 1}. p quantization scheme. In the following, the binary feedback In light of the distributions fromp Fig. 2, we observe that (after channel is always assumed to be noiseless except for the 2 5 loss transformation) for most iterations, µl ≈ 1/2 and σl will results presented in Sec. V-B4. be moderate (around 1/(8π)), leading to g ≈ 1. Only after 2 many iterations µl < 1/2 and σl will be small, leading to g ≪ 1. Hence, for sufficiently large batch sizes, 1-bit quantization A. Setup and Parameters should not significantly affect the learning convergence rate. 1) Channel Models: We consider two memoryless channel models p(y|x): the standard AWGN channel y = x + n, 2 D. Impact of Noisy Feedback Channels where n ∼ CN (0, σ ), and a simplified memoryless fiber- optic channel which is defined by the recursion For the proposed pre-processing and quantization scheme, are introduced through the function f(·) (in partic- 2 Lγ|xi| /K ular the clipping) and the quantizer Q(·). Moreover, additional xi+1 = xie + ni+1, 0 ≤ i < K, (14) impairments may be introduced when the quantized losses are transmitted over a noisy feedback channel. We will consider where x0 = x is the channel input, y = xK is the channel 2 2 the case where the feedback channel is a binary symmetric output, ni+1 ∼ CN (0, σ /K), L is the total link length, σ channel with flip probability p ∈ [0, 1/2). Our numerical is the noise power, and γ ≥ 0 is a nonlinearity parameter. results (see Sec. V-B4) indicate that the learning process Note that this channel reverts to the AWGN channel when is robust against such distortions, even for very high flip γ = 0. For our numerical analysis, we set L = 5000 km, 2 probabilities. In order to explain this behavior, it is instructive γ = 1.27 rad/W/km, K = 50, and σ = −21.3 dBm, which to first consider the case where the transmitted per-sample are the same parameters as in [6], [12], [30]. For both channels, , 2 losses are entirely random and completely unrelated to the we define SNR P/σ . Since the noise power is assumed to training data. In that case, one finds that be fixed, the SNR is varied by varying the signal power P . The model in (14) assumes ideal distributed amplification E ˆ E ˆ E {lk∇τ log πτ (˜xk|mk)} = {lk} {∇τ log πτ (˜xk|mk)} =0 across the optical link and is obtained from the nonlinear regardless of the loss distribution or quantization scheme. The Schr¨odinger equation by neglecting dispersive effects, see, interpretation is that for large mini-batch sizes, random losses e.g., [31] for more details about the derivation. Because simply “average out” and the applied gradient in (4) is close dispersive effects are ignored, the model does not necessarily to zero. We can exploit this behavior and make the following reflect the actual channel conditions in realistic fiber-optic statement. transmission. The main interest in this model stems from its simplicity and analytical tractability while still capturing some e ˆ Proposition 2. Let γk = lk∇τ log πτ (˜xk|mk) where the realistic nonlinear effects, in particular the nonlinear phase binary version of Q(lk) has been subjected to a binary noise. The model has been studied intensively in the literature, ˆ symmetric channel with flip probability p to yield lk. Then, including detection schemes [32]–[34], signal constellations for 1-bit and 2-bit quantization with a natural mapping of bit [33], [35], capacity bounds [30], [31], [36], [37], and most vectors to quantized losses, we have recently also in the context of machine learning [6], [12]. In the following, we refer to the model as the nonlinear phase- e e q noise channel to highlight the fact that it should not be seen E{γ } = ∇τ ℓ (τ)=(1 − 2p)∇τ ℓ (τ). k T T as an accurate model for fiber-optic transmission. Moreover, for 1-bit quantization, 2) Transmitter and Receiver Networks: Following previous e q q 2 work, the functions fτ and fρ are implemented as multi-layer V{γ }≤ V{γ } +4p(1 − p)k∇τ ℓ (τ)k + ptr{J(τ)}. k k T neural networks. A message m is first mapped to a M– Proof: See Appendix. dimensional ”one-hot” vector where the m–th element is 1 Hence, for a sufficiently large mini-batch size, the gradient and all other elements are 0. Each neuron takes inputs from is simply scaled by a factor 1−2p. This means that even under the previous layer and generates an output according to a very noisy feedback, learning should be possible. learned linear mapping followed by a fixed nonlinear activation

4For Gaussian losses, w¯ in Proposition 1 is not defined. The proposition 5TensorFlow source code is available at can be modified to deal with unbounded losses. https://github.com/henkwymeersch/quantizedfeedback. 6

TABLE I: Neural network parameters, where M = 16

transmitter fτ receiver fρ layer 1 2-3 4 1 2-3 4 number of neurons M 30 2 2 50 M activation function - ReLU linear - ReLU softmax

SNR (dB) 10 12 14 16 18 20 22 100 (a) (b) −1 nonlinear phase-noise channel 10 Fig. 4: Learned decision regions for the nonlinear phase-noise channel, M = 16, and P = −3 dBm (a) without quantizing per-sample losses and (b) using the proposed quantization scheme and 1-bit quantization. 10−2

B. Results and Discussion 10−3 16-QAM, ML detector learned, no quantization symbol error rate 1) Perfect vs Quantized Feedback: We start by evaluating learned, 1-bit quantization the impact of quantized feedback on the system performance, −4 10 measured in terms of the symbol error rate (SER). For the AWGN channel AWGN channel, the transmitter and receiver are trained for 10−5 a fixed SNR = 15 dB (i.e., P = −6.3 dBm such that −12 −10 −8 −6 −4 −2 0 SNR = P/σ2 = −6.3 dBm + 21.3 dBm = 15 dB) input power P (dBm) and then evaluated over a range of SNRs by changing the Fig. 3: Symbol error rate achieved for M = 16. The training SNR is 15 dB signal power (similar to, e.g., [12]). For the nonlinear phase- for the AWGN channel, whereas training is done separately for each input power (i.e., SNR) for the nonlinear phase-noise channel. noise channel, this approach cannot be used because optimal signal constellations and receivers are highly dependent on the transmit power.6 Therefore, a separate transmitter–receiver pair is trained for each input power P . Fig. 3 shows the achieved function. The final two outputs of the transmitter network are SER assuming both perfect feedback without quantization and B 2 a -bit feedback signal based on the proposed method. For normalized to ensure 1/B k=1 |xk| = P , B ∈ {BT ,BR}, 1 and then used as the channel input. The real and imaginary both channels, the resulting communication systems with 1- parts of the channel observationP serve as the input to the bit feedback quantization have very similar performance to the receiver network. All network parameters are summarized in scenario where perfect feedback is used for training, indicating Table I, where M = 16. that the feedback quantization does not significantly affect the learning process. As a reference, the performance of standard 3) Training Procedure: For the alternating optimization, we 16-QAM with a maximum-likelihood (ML) detector is also first fix the transmitter and train the receiver for NR = 30 shown. The ML detector makes a decision according to iterations with a mini-batch size of BR = 64. Then, the receiver is fixed and the transmitter is trained for NT = 20 xˆML = argmax p(y|sm), (15) m∈{1,...,M} iterations with BT = 64. This procedure is repeated N = 4000 times for the AWGN channel. For the nonlinear phase-noise where s1,...,sM are all constellation points. For the nonlinear channel, we found that more iterations are typically required phase-noise channel, the channel likelihood p(y|x) can be to converge, especially at high input powers, and we conse- derived in closed form, see [32, p. 225]. For the AWGN quently set N = 6000. The Adam optimizer is used to perform channel, (15) is equivalent to a standard minimum Euclidean- the gradient updates, where αT =0.001 and αR =0.008. The distance detector. The learning approach outperforms this reason behind the unequal number of training iterations for the baseline for both channels, which is explained by the fact transmitter and receiver is that the receiver network is slightly that the transmitter neural network learns better modulation bigger than the transmitter network and thus requires more formats (i.e., signal constellations) compared to 16-QAM. training iterations to converge. Fig. 4 visualizes the learned decision regions for the quan- tized (right) and unquantized (left) feedback schemes assum- 4) Transmitter Exploration Variance: We found that the ing the nonlinear phase-noise channel with P = −3 dBm. parameter σ2 has to be carefully chosen to ensure successful p Only slight differences are observed which can be largely training. In particular, choosing σ2 too small will result in p attributed to the randomness of the training process. insufficient exploration and slow down the training process. 2) Impact of Number of Quantization Bits: Next, the non- On the other hand, if σ2 is chosen too large, the resulting noise p linear phase-noise channel for a fixed input power P = may in fact be larger than the actual channel noise, resulting in many falsely detected messages and unstable training. In 6In principle, the optimal signal constellation may also depend on the SNR 2 −3 our simulations, we use σp = P · 10 . for the AWGN channel. 7

·10−2 3 unquantized pre-processing, no quantization 1-bit quantization 8 proposed (pre-processing, fixed quantization) 5-bit quantization no pre-processing, fixed quantization (¯l = 4)

no pre-processing, fixed quantization (¯l = 10) ) τ

( 2 e T

6 ℓ

4 1 empirical loss symbol error rate

2

0 0 50 100 150 200 250 300 1 2 3 4 5 training iteration N e number of quantization bits Fig. 6: Evolution of ℓT (τ) during the alternating optimization for the nonlinear phase-noise channel with M = 16, P = −3 dBm. Results are Fig. 5: Impact of the number of quantization bits on the achieved performance averaged over 15 different training runs where the shaded area indicates one for the nonlinear phase-noise channel with M = 16, P = −3 dBm. Results standard deviation between the runs. are averaged over 10 different training runs where error bars indicate the standard deviation between the runs. 100 1-bit quantization, BT = 64 2-bit quantization, BT = 64 −3 dBm is considered to numerically evaluate the impact of 1-bit quantization, BT = 640 the number of quantization bits on the performance. Fig. 5 shows the achieved SER when different schemes are used −1 10 for quantizing the per-sample losses. For a fixed quanti- zation scheme without pre-processing (see Sec. IV-A), the performance of the trained system is highly sensitive to the number of quantization bits and the assumed quantization −2 symbol error rate 10 range [0, ¯l]. For ¯l = 10 with 1 quantization bit, the system performance deteriorates noticeably and the training outcome becomes unstable, as indicated by the error bars (which are averaged over 10 different training runs). For the proposed 10−3 quantization scheme, the performance of the trained system 0 0.1 0.2 0.3 0.4 0.5 is (i) essentially independent on the number of bits used for flip probability p quantization and (ii) virtually indistinguishable from a system Fig. 7: Performance on the nonlinear phase-noise channel with M = 16, trained with unquantized feedback. P = −3 dBm when transmitting quantized losses over a noisy feedback channel modeled as a binary symmetric channel with flip probability p. Results 3) Impact on Convergence Rate: In Fig. 6, we show the are average over 10 runs where the error bars indicate one standard deviation e evolution of the empirical cross-entropy loss ℓT (τ) during the between runs. alternating optimization for the nonlinear phase-noise channel with P = −3 dBm. It can be seen that quantization manifests itself primarily in terms of a slightly decreased convergence (see Sec. IV-D). It can be seen that the proposed quantization rate during training. For the scenario where per-sample losses scheme is highly robust to the channel noise. For the assumed e are quantized with 5 bits, the empirical losses ℓT (τ) converged mini-batch size BT = 64, performance starts to decrease after about 160 iterations, which is the same as in the case only for very high flip probabilities and remains essentially of un-quantized feedback. For 1-bit quantization, the training unchanged for p < 0.1 with 1-bit quantization and for converges slightly slower, after around 200 iterations, which p < 0.2 with 2-bit quantization. A theoretical justification is a minor degradation compared to the entire training time. for this behavior is provided in Proposition 2, which states However, the slower convergence rate implies that it is harder that the channel noise manifests itself only as a scaling of the to deal with changes in the channel. Hence, with 1-bit quanti- expected gradient. Thus, one may also expect that the learning zation, the coherence time should be longer compared to with process can withstand even higher flip probabilities by simply unquantized feedback. increasing the mini-batch size. Indeed, Fig. 7 shows that when 4) Impact of Noisy Feedback: In order to numerically increasing the mini-batch size from BT = 64 to BT = 640, the evaluate the effect of noise during the feedback transmis- noise tolerance for 1-bit quantization increases significantly sion, we consider again the nonlinear phase-noise channel and performance remains unchanged for flip probabilities as for a fixed input powerP = −3 dBm. Fig. 7 shows the high as p =0.3. achieved SER when transmitting the quantized per-sample Note that for p = 0.5, the achieved SER is slightly better losses over a binary symmetric channel with flip probability p than (M − 1)/M ≈ 0.938 corresponding to random guessing. 8

This is because the receiver learning is still active, even though Hence, the transmitter only performs random explorations. E e e {γk} = ∇τ ℓT (τ) E VI. CONCLUSIONS = {((1 − 2p)Q(lk)+ p)∇τ log πτ (˜xk|mk)} E E We have proposed a novel method for data-driven learning = (1 − 2p) {Q(lk)∇τ log πτ (˜xk|mk)} + p {∇τ log πτ (˜xk|mk)} q of physical-layer communication in the presence of a binary = (1 − 2p)∇τ ℓT (τ), feedback channel. Our method relies on an adaptive clipping, where the last step follows from (6). For 2-bit quantization, shifting, and scaling of losses followed by a fixed quantization the possible values are ∆/2=1/8 (corresponding to bits at the receiver, and a fixed reconstruction method at the 00), 3∆/2=3/8 (corresponding to 01), 1 − 3∆/2=5/8 transmitter. We have shown that the proposed method (i) can (corresponding to 10), 1 − ∆/2=7/8 (corresponding to 11). lead to good performance even under 1-bit feedback; (ii) does It then follows that when the transmitted loss is Q(l ), the not significantly affect the convergence speed of learning; and k received loss is (iii) is highly robust to noise in the feedback channel. 2 The proposed method can be applied beyond physical- Q(lk) with prob. (1 − p) layer communication, to reinforcement learning problems in 2 1 − Q(lk) with prob. p general, and distributed multi-agent learning in particular. other with prob. p(1 − p)

APPENDIX so that the expected received loss is (1 − 2p)Q(lk)+ p. Proof of Proposition 1 The variance under 1-bit quantization can be computed as q V e The mean of γk can be computed as {γk} q q E e 2 2 q 2 E = {(γk) }− (1 − 2p) k∇τ ℓ (τ)k {γk} = ∇τ ℓT (τ) T E 2(1−nk ) 2nk 2 = E{Q(lk)∇τ log πτ (˜xk|mk)} = {(Q(lk)) (1 − Q(lk)) k∇τ log πτ (˜xk|mk)k } 2 q 2 = gE{lk∇τ log πτ (˜xk|mk)} + E{wk∇τ log πτ (˜xk|mk)} − (1 − 2p) k∇τ ℓT (τ)k E 2 2 E 2 = gE{lk∇τ log πτ (˜xk|mk)} + E{wk}E{∇τ log πτ (˜xk|mk)} = {Q (lk)k∇τ log πτ (˜xk|mk)k } + p {k∇ log πτ (˜xk|mk)k } E 2 2 q 2 = gE{lk∇τ log πτ (˜xk|mk)} = g∇τ ℓT (τ). − 2p {Q(lk)k∇ log πτ (˜xk|mk)k }− (1 − 2p) k∇τ ℓT (τ)k q q 2 = V{γ } +4p(1 − p)k∇τ ℓ (τ)k + ptr{J(τ)} We have made use of the fact that wk is uncorrelated with lk k T 2 and that (6) holds. The variance can similarly be bounded as − 2pE{Q(lk)k∇τ log πτ (˜xk|mk)k } follows: V q q 2 ≤ {γk} +4p(1 − p)k∇τ ℓT (τ)k + ptr{J(τ)}, V q {γk} where the last step holds since Q(lk) ≥ 0. 2 2 2 2 = E{(Q(lk)) k∇τ log πτ (˜xk|mk)k }− g k∇τ ℓT (τ)k 2E 2 2 2 2 REFERENCES = g {lkk∇ log πτ (xk|mk)k }− g k∇τ ℓT (τ)k [1] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO Detection,” in E 2 2 + {wkk∇ log πτ (xk|mk)k } Proc. IEEE Int. Workshop on Advances in Wireless E 2 Communications (SPAWC), 2017. +2 {glkwkk∇ log πτ (xk|mk)k } [2] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and 2 2 ≤ g V{γk} +w ¯ tr{J(τ)} Y. Be’ery, “Deep Learning Methods for Improved Decoding of Linear Codes,” IEEE J. Sel. Topics Signal Proc., vol. 12, no. 1, pp. 119–131, 2 − 2gE{wklkk∇ log πτ (˜xk|mk)k } Feb. 2018. 2V 2 [3] T. O’Shea and J. Hoydis, “An Introduction to Deep Learning for ≤ g {γk} +w ¯ tr{J(τ)} +2gw¯tr{J(τ)} the Physical Layer,” IEEE Trans. on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, Dec. 2017. We have made use of −wklk = lk(glk − Q(lk)) ≤ [4] S. D¨orner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep Learning- maxlk |glk − Q(lk)| =w ¯, that lk ≤ 1, and that tr{J(τ)} = Based Communication Over the Air,” IEEE J. Sel. Topics Signal Proc., 2 vol. 12, no. 1, pp. 132–143, Feb. 2017. E{k∇ log πτ (xk|mk)k }. [5] B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. B¨ulow, D. Lav- ery, P. Bayvel, and L. Schmalen, “End-to-end deep learning of optical Proof of Proposition 2 fiber communications,” J. Lightw. Technol., vol. 36, no. 20, pp. 4843– 4855, 2018. For the proposed adaptive pre-processing and fixed 1-bit [6] S. Li, C. H¨ager, N. Garcia, and H. Wymeersch, “Achievable information quantization, the quantized losses lk are either ∆/2=1/4 rates for nonlinear fiber communication via end-to-end learning,” in Proc. European Conf. Optical Communication (ECOC), or 1 − ∆/2=3/4. Assuming transmission over the binary Rome, Italy, 2018. symmetric channel, the gradient in (5) can be written as [7] R. T. Jones, T. A. Eriksson, M. P. Yankov, and D. Zibar, “Deep Learning of Geometric Constellation Shaping including Fiber Nonlinearities,” in e E 1−nk nk ∇τ ℓT (τ)= {Q(lk) (1 − Q(lk)) ∇τ log πτ (˜xk|mk)}, Proc. European Conf. Optical Communication (ECOC), Rome, Italy, 2018. where nk are independent and identically distributed Bernoulli [8] H. Lee, I. Lee, and S. H. Lee, “Deep learning based transceiver design Opt. Express random variables with parameter p. Since nk is independent for multi-colored VLC systems,” , vol. 26, no. 5, pp. 6222– 6238, Mar. 2018. of all other random variables, we can compute [9] T. J. O’Shea, T. Roy, and N. West, “Approximating the Void: Learning E 1−nk nk Stochastic Channel Models from Observation with Variational Genera- [Q(lk) (1 − Q(lk)) | Q(lk)] = (1 − 2p)Q(lk)+ p. tive Adversarial Networks,” arXiv:1805.06350, 2018. 9

[10] H. Ye, G. Y. Li, B.-H. F. Juang, and K. Sivanesan, “Channel Agnostic [36] K. S. Turitsyn, S. A. Derevyanko, I. V. Yurkevich, and S. K. Turitsyn, End-to-End Learning based Communication Systems with Conditional “Information Capacity of Optical Fiber Channels with Zero Average GAN,” arXiv:1807.00447, 2018. Dispersion,” vol. 91, no. 20, p. 203901, nov 2003. [11] F. A. Aoudia and J. Hoydis, “End-to-End Learning of Communications [37] K. Keykhosravi, G. Durisi, and E. Agrell, “A tighter upper bound on the Systems Without a Channel Model,” arXiv:1804.02276, 2018. capacity of the nondispersive optical fiber channel,” in 2017 European [12] ——, “Model-free Training of End-to-end Communication Systems,” Conference on Optical Communication (ECOC). IEEE, 2017, pp. 1–3. arXiv:1812.05929, 2018. [13] C. de Vrieze, S. Barratt, D. Tsai, and A. Sahai, “Cooperative Multi- Agent Reinforcement Learning for Low-Level Wireless Communica- tion,” arXiv:1801.04541, 2018. [14] V. Raj and S. Kalyani, “Backpropagating Through the Air: Deep Learning at Physical Layer Without Channel Models,” IEEE Commun. Lett., vol. 22, no. 11, pp. 2278–2281, Nov. 2018. [15] M. Goutay, F. A. Aoudia, and J. Hoydis, “Deep Reinforcement Learning Autoencoder with Noisy Feedback,” arXiv:1810.05419, 2018. [16] M. Kim, W. Lee, J. Yoon, and O. Jo, “Building Encoder and Decoder with Deep Neural Networks: On the Way to Reality,” arXiv:1808.02401, 2018. [17] Z.-L. Tang, S.-M. Li, and L.-J. Yu, “Implementation of Deep Learning- based Automatic Modulation Classifier on FPGA SDR Platform,” Elec- tronics, vol. 7, no. 7, p. 122, 2018. [18] C.-F. Teng, C.-H. Wu, K.-S. Ho, and A.-Y. Wu, “Low-complexity Re- current Neural Network-based Polar Decoder with Weight Quantization Mechanism,” arXiv:1810.12154, 2018. [19] C. Fougstedt, C. H¨ager, L. Svensson, H. D. Pfister, and P. Larsson- Edefors, “ASIC Implementation of Time-Domain Digital Backprop- agation with Deep-Learned Chromatic Dispersion Filters,” in Proc. European Conf. Optical Communication (ECOC), Rome, Italy, 2018. [20] F. A. Aoudia and J. Hoydis, “Towards Hardware Implementation of Neural Network-based Communication Algorithms,” arXiv:1902.06939, 2019. [21] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in Proceedings of the Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 1999, pp. 278–287. [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015. [23] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018. [24] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. ICLR, 2015. [25] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q- prop: Sample-efficient policy gradient with an off-policy critic,” arXiv preprint arXiv:1611.02247, 2016. [26] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup, “Reproducibil- ity of benchmarked deep reinforcement learning tasks for continuous control,” arXiv preprint arXiv:1708.04133, 2017. [27] S. Lloyd, “Least squares quantization in pcm,” IEEE transactions on information theory, vol. 28, no. 2, pp. 129–137, 1982. [28] H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver, “Learning values across many orders of magnitude,” in Advances in Neural Information Processing Systems, 2016, pp. 4287–4295. [29] H. Rowe, “Memoryless nonlinearities with Gaussian inputs: Elementary results,” The BELL system technical Journal, vol. 61, no. 7, pp. 1519– 1525, 1982. [30] K. Keykhosravi, G. Durisi, and E. Agrell, “Accuracy Assessment of Nondispersive Optical Perturbative Models through Capacity Analysis,” Entropy, vol. 21, no. 8, pp. 1–19, aug 2019. [31] M. I. Yousefi and F. R. Kschischang, “On the per-sample capacity of nondispersive optical fibers,” IEEE Trans. Inf. Theory, vol. 57, no. 11, pp. 7522–7541, November 2011. [32] K.-P. Ho, Phase-modulated Optical Communication Systems. Springer, 2005. [33] A. P. Lau and J. M. Kahn, “16-QAM Signal Design and Detection in Presence of Nonlinear Phase Noise.” IEEE, July 2007, pp. 53–54. [34] A. S. Tan, H. Wymeersch, P. Johannisson, E. Agrell, P. Andrekson, and M. Karlsson, “An ml-based detector for optical communication in the presence of nonlinear phase noise,” in 2011 IEEE International Conference on Communications (ICC). IEEE, 2011, pp. 1–5. [35] C. H¨ager, A. Graell i Amat, A. Alvarado, and E. Agrell, “Design of APSK Constellations for Coherent Optical Channels with Nonlinear Phase Noise,” IEEE Trans. Commun., vol. 61, no. 8, pp. 3362–3373, August 2013.