Distributed Stochastic Gradient Descent Over-The-Air,” in Proc

Home , Gradient descent

arXiv:1901.00844v3 [cs.DC] 7 Apr 2020 iern,Ipra olg odn odnS72Z ..(e U.K. 2AZ, SW7 London London, College [email protected]). Imperial gineering, niern,Ipra olg odn ei o ihteDep the with now Princeton [email protected]). University, mail: is Princeton He Engineering, London. Electrical College Imperial assum Engineering, inherently this While network dataset. massive neural a a on often trained algorithm, algorith is learning centralized powerful dat on a sensor focuses where future of trend types its current various algorithms the about for however, (ML) tailored predictions developed learning or being machine are system, specialized a continuously Many inferences of states. be make state to must the au- processed about sensors (IoT), and things from communicated, pri of collected, data are Internet technologies where reality data. examples, extended this or driving, of tonomous sense p can make that intelligence and collaborative and collection, data pesu omncto,mkn tvr trciefrlow- for edge. attractive wireless very the at it applications making ML o latency communication, lack furt The up A-DSGD devices. in speeds encoding/decoding edge abi channel A- of their and showing power and quantization constant), computation D-DSGD size the (whi both dataset harnessing devices of total that number the the observe keeping increasing also by schem better We perform digital DSGD devic literature. other across the distribution bandwidth outperforms data in tha significantly low in problem bias D-DSGD and to classification while robust a power improvement more for is The low illustrate DSGD channel. also at We the compelling regimes. over alignm particularly natural estimates the is gradient and more bandwidth the its limited to of the thanks of D-DSGD available use than efficient the faster s converges results by A-DSGD ove Numerical code. that directly imposed digital sent any space employing are without MAC projections dimensional These lower bandwidth. channel a the to and estimates, A-DS gradient them their In sparsify approach. first this devices for the analysis convergence provide and aueo h ieesMCfor MAC wireless the of introduc nature then We (MAC). ov novel channel PS access the to multiple estimates underlying gradient their erro transmit and and quantization mulation, gradient employ lin devices orthogonal wireless the over PS the this to Following remote separat transmitted estim and gradient a assume compressed local are of where approaches communication, help Standard bandwidth- and the computation (PS). and with dist server power (DSGD) out parameter where carry descent datasets gradient edge, local stochastic wireless with devices the wireless limited at (ML) ing hog trigGatBAO areetN.677854). No. (agreement BEACON Grant Starting through ahn eriga h ieesEg:Distributed Edge: Wireless the at Learning Machine hswr a enspotdb h uoenRsac Counci Research European the by supported been has work This .Gnü swt h eateto lcrcladElectroni and Electrical of Department the with is Gündüz D. .MhmaiAiiwswt h eateto lcrcland Electrical of Department the with was Amiri Mohammadi M. ayeegn ehooisivlemsieaonsof amounts massive involve technologies emerging Many Abstract analog oamdMhmaiAmiri, Mohammadi Mohammad W td olbrtv/fdrtdmcielearn- machine federated collaborative/ study —We cee aldADG,wihepot h additive the exploits which A-DSGD, called scheme, tcatcGain ecn Over-the-Air Descent Gradient Stochastic digital prah eitoueDDG,i which in D-DSGD, introduce we approach, .I I. NTRODUCTION over-the-air rdetcomputation, gradient J054 S (e- USA 08544, NJ , tdn ebr IEEE Member, Student rmn of artment project n Electronic ributed accu- r (ERC) l rocess iyin lity rthe er En- c the r -mail: ,A- t, how GD, ates ms, her me ent a e ks. es, es es a; le e f , G) nwihtemdlprmtr tthe at des parameters gradient model iterative the through which of out in minimization (GD), carried The typically model. is learning (1) the by defined function where where B r pae codn to according updated are hc satisfies which where eadn.Ised nsohsi D(G)teparameter gradient stochastic the a (SGD) with GD updated is stochastic vector in prohibitiv Instead, becomes GD demanding. of iteration each datasets massive of rtv Ltcnqe,as nw as known also more develop techniques, to much ML orative is a alternative viable in Therefore, practically constraints. costly and desirable users’ too delay, infringe much a be too may introduce may to and may bandwidth, manner o data and reliable energy case of collected a the terms the in in processor, processor transmitting central central devices, a edge at wireless data of availability the rcs aasmlsi aallwiemitiigaglobal devices a vector maintaining (DSGD), while parameter SGD parallel consistent ev in distributed or samples In tens data process across devices. distributed of is hundreds dataset the when lelization etrwt epc oislcldataset parameter local global its the to on respect based with vector vector gradient a computes rmaltedvcs tudtstegoa aaee vector parameter global the updates gradients it computed to devices, the according receives the PS all the Once from PS. the to result clls function loss ical collaboratively model. (PS) learning server the parameter a wi at central datasets a FL local of consider with help devices the we where paper, existi edge, this and network In intelligence wireless it). edge [1] enable of (see to applications approaches communications of survey limited capabili a requiring processing for and devices, datasets edge local of the exploit can that |B 1 θ m Lpolm fe eur h iiiaino h empir- the of minimization the require often problems ML ersnstedtstwt size with dataset the represents t | +1 P n ei Gündüz, Deniz and , η θ M = u t ∈ ∈B stelann aea iteration at rate learning the is θ eoe h ubro eie,and devices, of number the denotes R t m θ − d ∇ t +1 eoe h oe aaeest eoptimized, be to parameters model the denotes η E F f t ∇ [ ( ( = g θ θ θ F ( t t = ) θ +1 , θ ( u t t θ ]= )] − ) t = = ) |B| stesohsi rdeto h cur- the of gradient stochastic the is 1 η θ t t ∇ M X 1 θ θ eirMme,IEEE Member, Senior − F t t nec trto,device iteration, each In . X − η u ( t θ ∈B η · t m M t ) g |B| G loalw paral- allows also SGD . f |B| =1 ( 1 eeae erig(FL) learning federated θ ( t and , θ g t oee,i h case the in However, . u X ) , m B ∈B u , m ( ) θ t ∇ , n ed the sends and , t iteration, -th f t ) f ( · , ) ( θ g steloss the is t m , u ( collab- ) θ , t train ) cent ties ely θ (4) (1) (3) (2) ng en m , th ly t 1 f , , 2 rent model computed at device m, m [M], using the locally and learning, we consider a wireless multiple access chan- ∈ available portion of the dataset, m. nel (MAC) from the devices to the PS, through which the While the communication bottleneckB in DSGD has been PS receives the gradients computed by the devices at each widely acknowledged in the literature [2]–[4], ML literature iteration of the DSGD algorithm. The standard approach to ignores the characteristics of the communication channel, and this problem, aligned with the literature on FL, is a separate simply focus on reducing the amount of data transmitted from approach to computation and communication. each device to the PS. In this paper, we consider DSGD We first follow this separation-based approach, and propose over a shared wireless medium, and explicitly model the a digital DSGD scheme, which will be called D-DSGD. In channel from the devices to the PS, and treat each iteration this scheme, the devices first compress their gradient estimates of the DSGD algorithm as a distributed wireless computation to a finite number of bits through quantization. Then, some problem, taking into account the physical properties of the access scheme, e.g., time, frequency or code division multiple wireless medium. We will provide two distinct approaches for access, combined with error correction coding is employed to this problem, based on digital and analog communications, transmit compressed gradient estimates to the PS in a reliable respectively. We will show that analog “over-the-air” compu- manner. In this work, to understand the performance limits of tation can significantly speed up wireless DSGD, particularly the digital approach, we assume capacity-achieving channel in bandwidth-limited and low-power settings, typically expe- codes are utilized at the devices. The optimal solution for rienced by wireless edge devices. this scheme will require carefully allocating channel resources across the devices and the available power of each device across iterations, together with an efficient gradient quantiza- A. Prior Works tion scheme. For gradient compression, we will consider state- Extensive efforts have been made in recent years to enable of-the-art quantization approaches together with local error collaborative learning across distributed mobile devices. In the accumulation [21]. FL framework [5]–[7], devices collaboratively learn a model It is known that separation-based approaches to distributed with the help of a PS. In practical implementations, bandwidth compression and computing over wireless channels are sub- of the communication channel from the devices to the PS is optimal in general [27], [28]. We propose an alternative analog the main bottleneck in FL [5]–[12]. Therefore, it is essential to communication approach, in which the devices transmit their reduce the communication requirements of collaborative ML. local gradient estimates directly over the wireless channel, To this end, three main approaches, namely quantization, spar- extending our previous work in [29]. This scheme is motivated sification, and local updates, and their various combinations by the fact that the PS is not interested in the individual have been considered in the literature. Quantization methods gradient vectors, but in their average, and the wireless MAC implement lossy compression of the gradient vectors by quan- automatically provides the PS with the sum of the gradients tizing each of their entries to a finite-bit low precision value (plus a noise term). However, the bandwidth available at each [2], [4], [13]–[18]. Sparsification reduces the communication iteration may not be sufficient to transmit the whole gradient time by sending only some values of the gradient vectors [3], vector in an uncoded fashion. Hence, to compress the gradients [19]–[24], and can be considered as another way of lossy to the dimension of the limited bandwidth resources, we em- compression, but it is assumed that the chosen entries of the ploy an analog compression scheme inspired by compressive gradient vectors are transmitted reliably, e.g., at a very high sensing, which was previously introduced in [30] for analog resolution. Another approach is to reduce the frequency of image transmission over bandwidth-limited wireless channels. communication from the devices by allowing local parameter In this analog computation scheme, called A-DSGD, devices updates [5]–[12], [25], [26]. Although there are many papers first sparsify their local gradient estimates (after adding the in the FL literature, most of these works ignore the physical accumulated local error). These sparsified gradient vectors are and network layer aspects of the problem, which are essential projected to the channel bandwidth using a pseudo-random for FL implementation at the network edge, where the training measurement matrix, as in compressive sensing. Then, all the takes place over a shared wireless medium. To the best of our devices transmit the resultant real-valued vectors to the PS knowledge, this is the first paper to address the bandwidth and simultaneously in an analog fashion, by simply scaling them power limitations in collaborative edge learning by taking into to meet the average transmit power constraints. The PS tries to account the physical properties of the wireless medium. reconstruct the sum of the actual sparse gradient vectors from its noisy observation. We use approximate message passing (AMP) algorithm to do this at the PS [31]. We also prove the B. Our Contributions convergence of the A-DSGD algorithm considering a strongly Most of the current literature on distributed ML and FL convex loss function. ignore the physical layer aspects of communication, and Numerical results show that the proposed analog scheme consider interference-and-error-free links from the devices to A-DSGD has a better convergence behaviour compared to the PS to communicate their local gradient estimates, possibly its digital counterpart, while D-DSGD outperforms the digital in compressed form to reduce the amount of information schemes studied in [2], [16]. Moreover, we observe that the that must be transmitted. Due to the prevalence of wireless performance of A-DSGD degrades only slightly by introduc- networks and the increasing availability of edge devices, e.g., ing bias in the datasets across the devices; while the degrada- mobile phones, sensors, etc., for large-scale data collection tion in D-DSGD is larger, it still outperforms other digital 3 approaches [2], [16] by a large margin. The performances of both A-DSGD and D-DSGD algorithms improve with the number of devices, keeping the total size of the dataset constant, despite the inherent resource sharing approach for D-DSGD. Furthermore, the performance of A-DSGD degrades only slightly even with a significant reduction in the available average transmit power. We argue that, these benefits of A- DSGD are due to the inherent ability of analog communications to benefit from the signal-superposition characteristic of the wireless medium. Also, reduction in the communication bandwidth of the MAC deteriorates the performance of D- DSGD much more compared to A-DSGD. Fig. 1: Illustration of the FL framework at the wireless edge. A similar over-the-air computation approach is also considered in parallel works [32], [33]. These works consider, respectively, SISO and SIMO fading MACs from the devices gradient computed by device m using local data samples. At to the PS, and focus on aligning the gradient estimates received each iteration of the DSGD algorithm in (4), the local gradient from different devices to have the same power at the PS to estimates of the devices are sent to the PS over s uses of a allow correct computation by performing power control and Gaussian MAC, characterized by: device selection. Our work is extended to a fading MAC M y(t)= xm(t)+ z(t), (5) model in [34]. The distinctive contributions of our work m=1 with respect to [32] and [33] are i) the introduction of a s where xm(t) R isX the length-s channel input vector purely digital separate gradient compression and communica- transmitted by device∈ m at iteration t, y(t) Rs is the channel tion scheme; ii) the consideration of a bandwidth-constrained output received by the PS, and z(t) Rs ∈is the independent channel, which requires (digital or analog) compression of additive white Gaussian noise (AWGN)∈ vector with each entry the gradient estimates; iii) error accumulation at the devices independent and identically distributed (i.i.d.) according to to improve the quality of gradient estimates by keeping 0, σ2 . Since we focus on DSGD, the channel input vector track of the information lost due to compression; and iv) Nof device m at iteration t is a function of the current parameter power allocation across iterations to dynamically adapt to vector θt and the local dataset m, and more specifically the the diminishing gradient variance. We also remark that both B current gradient estimate at device m, gm (θt), m [M]. [32] and [33] consider transmitting model updates, while For a total of T iterations of DSGD algorithm, the following∈ we focus on gradient transmission, which is more energy- average transmit power constraint is imposed per eache device: efficient as each device transmits only the innovation obtained T through gradient descent at that particular iteration (together 1 2 ¯ xm(t) 2 P, m [M], (6) with error accumulation), whereas model transmission wastes T t=1 || || ≤ ∀ ∈ a significant portion of the transmit power by sending the averaged overX iterations of the DSGD algorithm. The goal already known previous model parameters from all the devices. is to recover the average of the local gradient estimates 1 M Our work can easily be combined with the federated averaging M m=1 gm (θt) at the PS, and update the model parameter algorithm in [6], where multiple SGD iterations are carried as in (4). However, due to the pre-processing performed at P out locally before forwarding the models differences to the each device and the noise added by the wireless channel, it PS. We can also apply momentum correction [3], which is not possible to recover the average gradient perfectly at improves the convergence speed of the DSGD algorithms with the PS, and instead, it uses a noisy estimate to update the communication delay. model parameter vector; i.e., we have θt+1 = φ(θt, y(t)) for some update function φ : Rd Rs Rd. The updated model × → C. Notations parameter is then multicast to the devices by the PS through an error-free shared link. We assume that the PS is not limited in R represents the set of real values. For positive integer i, we power or bandwidth, so the devices receive a consistent global , 1 let [i] 1,...,i , and i denotes a column vector of length parameter vector for their computations in the next iteration. i with all{ entries}1. 0, σ2 denotes a zero-mean normal The transmission of the local gradient estimates, g (θt), distribution with varianceN σ2. We denote the cardinality of set m m [M], to the PS with the goal of PS reconstructing by , and l2 norm of vector x by x . Also, we represent ∀their∈ average can be considered as a distributed function Athe l |A|induced norm of a rectangular matrixk k A by A . 2 k k computation problem over a MAC [28]. We will consider both a digital approach treating computation and communication II. SYSTEM MODEL separately, and an analog approach that does not use any We consider federated SGD at the wireless edge, where M coding, and instead applies gradient sparsification followed by wireless edge devices employ SGD with the help of a remote a linear transformation to compress the gradients, which are PS, to which they are connected through a noisy wireless MAC then transmitted simultaneously in an uncoded fashion. (see Fig. 1). Let m denote the set of data samples available We remark that we assume a simple channel model (Gaus- at device m, m B [M], and g (θ ) Rd be the stochastic sian MAC) for the bandwidth-limited uplink transmission from ∈ m t ∈ 4

∆ the devices to the PS in this paper in order to highlight the gradient estimate vector gm (θt)+ m(t), of dimension d, main ideas behind the proposed digital and analog collabo- to zero, where q d/2 (to have a communication-efficient t ≤ rative learning. The digital and analog approaches proposed scheme, in practice, the goal is to have qt d, t). Then, in this paper can be extended to more complicated channel it computes the mean values of all the remaining≪ ∀ positive models as it has been done in the follow up works [34]–[37]. entries and all the remaining negative entries, denoted by + + µ (t) and µ− (t), respectively. If µ (t) > µ− (t) , then it m m m | m | III. DIGITAL DSGD(D-DSGD) sets all the entries with negative values to zero and all the entries with positive values to µ+ (t), and vice versa. We In this section, we present DSGD at the wireless network m denote the resulting sparse vector at device m by gq (θ ), edge utilizing digital compression and transmission over the m t and device m updates the local accumulated error vector as wireless MAC, referred to as the digital DSGD (D-DSGD) ∆ (t +1) = g (θ ) + ∆ (t) gq (θ ), m [M]. It algorithm. Since we do not know the variances of the gradient m m t m m t then aims to send gq (θ ) over the− channel by transmitting∈ estimates at different devices, we allocate the power equally m t its mean value and the positions of its non-zero entries. For among the devices, so that device m sends x (t) with power m this purpose, we use a 32-bit representation of the absolute P , i.e., x (t) 2 = P , where P values are chosen to satisfy t m 2 t t value of the mean (either µ+ (t) or µ (t) ) along with 1 the average|| transmit|| power constraint over T iterations m m− bit indicating its sign. To send the| positions| of the non- 1 T zero entries, it is assumed in [21] that the distribution of the Pt P.¯ (7) T t=1 ≤ distances between the non-zero entries is geometrical with Due to the intrinsic symmetryX of the model, we assume that the success probability qt, which allows them to use Golomb devices transmit at the same rate at each iteration (while the encoding to send these distances with a total number of bits rate may change across iterations depending on the allocated 1 log((√5 1)/2) b∗ , where − . b∗ + 2 b∗ = 1+ log2 log(1 qt) power, P ). Accordingly, the total number of bits that can 1 (1 qt) − t − − However, we argue that, sending d bits to transmit be transmitted from each of the devices over s uses of the log2 qt Gaussian MAC, described in (5), is upper bounded by the positions of the non-zero entries is sufficient regardless of the distribution of the positions. This can be achieved by s MP R = log 1+ t , (8) simply enumerating all possible sparsity patterns. Thus, with t 2M 2 sσ2 D-DSGD, the total number of bits sent by each device at where MPt/s is the sum-power per channel use. Note that iteration t is given by this is an upper bound since it is the Shannon capacity of the d underlying Gaussian MAC, and we further assumed that the r = log + 33, (9) t 2 q capacity can be achieved over a finite blocklength of s. t where qt is chosen as the highest integer satisfying rt Rt. Remark 1. We note that having a distinct sum power MPt at We will also study the impact of introducing more devices≤ each iteration t enables each user to transmit different num- into the system. With the reducing cost of sensing and comput- bers of bits at different iterations. This corresponds to a novel ing devices, we can consider introducing more devices, each gradient compression scheme for DSGD, in which the devices coming with its own power resources. We assume that the can adjust over time the amount of information they send to size of the total dataset remains constant, which allows each the PS about their gradient estimates. They can send more sensor to save computation time and energy. Assume that the information bits at the beginning of the DSGD algorithm when number of devices is increased by a factor κ> 1. We assume the gradient estimates have higher variances, and reduce the that the total power consumed by the devices at iteration t also number of transmitted bits over time as the variance decreases. increases by factor κ. We can see that the maximum number We observed empirically that this improves the performance of bits that can be sent by each device is strictly smaller in the compared to the standard approach in the literature, where the new system. this means that the PS receives a less accurate same compression scheme is applied at each iteration [21]. gradient estimate from each device, but from more devices. We will adopt the scheme proposed in [21] for gradient Numerical results for the D-DSGD scheme and its comparison compression at each iteration of the DSGD scheme. We to analog transmission are relegated to Section VI. modify this scheme by allowing different numbers of bits to be Remark 2. We have established a framework for digital trans- transmitted by the devices at each iteration. At each iteration mission of the gradients over a band-limited wireless channel, the devices sparsify their gradient estimates as described where other gradient quantization techniques can also be below. In order to retain the accuracy of their local gradient utilized. In Section VI, we will compare the performance of estimates, devices employ error accumulation [4], [38], where the proposed D-DSGD scheme with that of sign DSGD (S- the accumulated error vector at device m until iteration t is de- DSGD) [16] and quantized DSGD (Q-DSGD) [2], adopted ∆ Rd ∆ 0 noted by m(t) , where we set m(0) = , m [M]. for digital transmission over a limited capacity MAC. Hence, after the∈ computation of the local gradient∀ estimate∈ for parameter vector θt, i.e., gm (θt), device m updates its ∆ IV. ANALOG DSGD(A-DSGD) estimate with the accumulated error as gm (θt) + m(t), m [M]. At iteration t, device m, m [M], sets all Next, we propose an analog DSGD algorithm, called A- ∈ ∈ but the highest qt and the smallest qt of the entries of its DSGD, which does not employ any digital code, either for 5 compression or channel coding, and instead all the devices Algorithm 1 A-DSGD transmit their gradient estimates simultaneously in an uncoded 1: Initialize θ = 0 and ∆ (0) = = ∆ (0) = 0 manner. This is motivated by the fact that the PS is not inter- 0 1 ··· M 2: for t =0,...,T 1 do ested in the individual gradients, but only in their average. The devices do: − underlying wireless MAC provides the sum of the gradients, • 3: for m =1,...,M in parallel do which is used by the PS to update the parameter vector. See 4: Compute gm (θt) with respect to m Algorithm 1 for a description of the A-DSGD scheme. ec ∆ B 5: gm (θt)= gm (θt)+ m(t) Similarly to D-DSGD, devices employ local error accumu- sp ec 6: gm (θt)=spk (gm (θt)) lation. Hence, after computing the local gradient estimate for ∆ ec sp 7: m(t +1)= gm (θt) gm (θt) parameter vector θt, each device updates its estimate with the sp − ec 8: g˜m (θt)= As 1gm (θt) accumulated error as g (θ ) , g (θ )+ ∆ (t), m [M]. − T m t m t m T ∈ 9: x The challenge in the analog transmission approach is to m (t)= αm(t)g˜m (θt) αm(t) compress the gradient vectors to the available channel band- 10: end for hp p i width. In many modern ML applications, such as deep neural PS does: • 1 11: g θ A ys 1 networks, the parameter vector, and hence the gradient vec- ˆ ( t) = AMP s−1 ys(t) − (t) tors, have extremely large dimensions, whereas the channel 12: θt+1 = θt ηt gˆ (θt) − · bandwidth, measured by parameter s, is small due to the 13: end for bandwidth and latency limitations. Thus, transmitting all the model parameters one-by-one in an uncoded/analog fashion is not possible as we typically have d s. In the following, we propose a power allocation scheme ≫ Lossy compression at any required level is at least theoret- designing the scaling coefficients αm(t), m,t. ically possible in the digital domain. For the analog scheme, We set s˜ = s 1, which requires ∀s 2. At it- − p ≥ in order to reduce the dimension of the gradient vector to that eration t, we set am(t) = αm(t), and device m sp of the channel, the devices apply gradient sparsification. In computes g˜m (θt) = As 1gm (θt), and sends vector − p T particular, device m sets all but the k elements of the error- T xm (t) = α (t)g˜ (θ ) α (t) with the same compensated resulting vector gec (θ ) with the highest mag- m m t m m t x 2 nitudes to zero. We denote the sparse vector at device m by power Pt =h m (t) 2 satisfying the averagei power con- 1 Tp|| ||¯ p sp straint t=1 Pt P , for m [M]. Accordingly, scaling gm (θt), m [M]. This k-level sparsification is represented T ≤ ∈ ∈ sp ec factor αm(t) is determined to satisfy by function spk in Algorithm 1, i.e., gm (θt)=spk (gm (θt)). P The accumulated error at device m, m [M], which is the p 2 ec ∈ Pt = αm(t) g˜m (θt) 2 +1 , (12) difference between gm (θt) and its sparsified version, i.e., it k k ec maintains those elements of vector gm (θt) which are set to which yields zero as the result of sparsification at iteration t, is then updated P according to α (t)= t , for m [M]. (13) m 2 ∈ g˜m (θt) 2 +1 ∆ (t +1)=gec (θ ) gsp (θ )= g (θ )+ ∆ (t) k k m m t − m t m t m 2 Since g˜m (θt) 2 may vary across devices, so can αm(t). sp (g (θt)+ ∆m(t)) . (10) k k − k m That is why, at iteration t, device m allocates one channel p We would like to transmit only the non-zero entries of these use to provide the value of αm(t) to the PS along with its sparse vectors. However, simply ignoring the zero elements scaled low-dimensional gradient vector g˜ (θ ), m [M]. p m t ∈ would require transmitting their indeces to the PS separately. Accordingly, the received vector at the PS is given by To avoid this additional data transmission, we will employ a M sp As 1 m=1 αm(t)gm (θt) random projection matrix, similarly to compressive sensing. y (θt)= − + z(t), (14) M α (t) A similar idea is recently used in [30] for analog image " P m=1p m # transmission over a bandwidth-limited channel. where is replaced by (13).p For , we define s˜ d αm(t) P i [s] A pseudo-random matrix As˜ R × , for some s˜ s, ∈ ∈ ≤ T with each entry i.i.d. according to (0, 1/s˜), is generated yi (t) , y (t) y (t) y (t) , (15) N 1 2 ··· i and shared between the PS and the devices before start- i , T z (t) z1(t) z2(t) zi(t) , (16) ing the computations. At iteration t, device m computes ··· , sp Rs˜ g˜m (θt) As˜gm (θt) , and transmits xm(t) = where yj (t) and zj (t) denote the j-th entry ofy (t) and z(t), ∈ T T T , where a Rs s˜, over respectively. Thus, we have αm(t)˜gm (θt) am(t) m(t) − ∈ M theh MAC, m [M], whilei satisfying the average power s 1 sp s 1 p ∈ y − (t)= As 1 αm(t)gm (θt)+ z − (t), constraint (6). The PS receives − m=1 (17a) M X p M y (t)= xm (t)+ z(t) m=1 ys (t)= αm(t)+ zs(t). (17b) m=1 XA M α (t)gsp (θ ) s˜ m=1 m m t z (11) X p 1 M sp = M + (t). Note that the goal is to recover M m=1 gm (θt) at the am(t) s 1 " P mp=1 # PS, while, from y − (t) given in (17a), the PS observes a P P 6

M sp noisy version of the weighted sum m=1 αm(t)gm (θt) [M], where g˜m,i (θt) is the i-th entry of vector g˜m (θt), i az , 1 ∈ projected into a low-dimensional vector through As 1. Ac- [˜s]. We also define g˜m (θt) g˜m (θt) µm(t) s˜, m [M]. 2 p − cording to (13), each value of g˜ (θP) results in a distinct The power of vector gãz (θ ) is given by− ∈ k m t k2 m t scaling factor αm(t). However, for large enough d and m , az 2 2 2 2 |B | g˜m (θt) 2 = g˜m (θt) 2 sµ˜ m(t). (19) the values of g˜m (θt) , m [M], are not going to k k k k − k k2 ∀ ∈ be too different across devices. As a result, scaling factors We set s˜ = s 2, which requires s 3. We also set T αm(t), m [M], are not going to be very different either. az − az ≥ ∀ ∈ am(t) = αm (t)µm(t) αm (t) , and after computing Accordingly, to diminish the effect of scaled gradient vectors, sp p s 1 g˜m (θt)= As 2gm (θt), device m, m [M], sends we choose to scale down the received vector y − (t) at the p − p ∈ T T PS, given in (17a), with the sum of the scaling factors, i.e., x az az az az M m (t)= αm (t)˜gm (θt) αm (t)µm(t) αm (t) , αm(t), whose noisy version is received by the PS as m=1 h (20)i ys (t) given in (17b). The resulting scaled vector at the PS is p p p P p with power s 1 M y − (t) αm(t) sp =As 1 M gm (θt) 2 az 2 2 y (t) − m=1 xm (t) 2 = αm (t) g˜m (θt) 2 (s 3)µm(t) + 1 , (21) s i=1 pαi(t)+ zs(t) || || k k − − X T 1 s 1 1 ¯ + p z − (t), (18) which is chosen to be equal to Pt, such that T t=1 Pt P . M P ≤ αi(t)+ zs(t) Thus, we have, for m [M], i=1 ∈ P where αm(t), m P[M]p, is given in (13). By our choice, the az Pt ∈ 1 M sp s 1 αm (t)= 2 . (22) PS tries to recover g (θt) from y − (t) /ys (t) 2 M m=1 m g˜m (θt) 2 (s 3)µm(t)+1 knowing the measurement matrix As 1. The PS estimates k k − − P − Compared to the transmit power given in (12), we observe that gˆ (θt) using the AMP algorithm. The estimate gˆ (θt) is then az removing the mean reduces the transmit power by αm (t)(s used to update the model parameter as θt+1 = θt ηt gˆ (θt). 2 − − · 3)µm(t), m [M]. The received vector at the PS is given by Remark 3. We remark here that, with SGD the empirical vari- ∈ M az az ance of the stochastic gradient vectors reduce over time ap- m=1 αm (t)g˜m (θt) y (θ )= M αaz(t)µ (t) + z(t) proaching zero asymptotically. The power should be allocated t P m=1p m m  over iterations taking into account this decaying behaviour M αaz(t)  P mp=1 m  of gradient variance, while making sure that the noise term M az sp  1 αPm (t) (pAs 2gm (θt) µm(t) s 2) would not become dominant. To reduce the variation in the m=1 − − − M az z =  m=1 αm (t)µm(t)  + (t), scaling factors αm(t), m,t, variance reduction techniques P p M ∀ ¯ αaz(t) can be used [39]. We also note that setting Pt = P , t, results  P mp=1 m  p ∀   (23) in a special case, where the power is allocated uniformly over P p time to be resistant against the noise term. where we have M Remark 4. Increasing M can help increase the convergence s 2 az sp y − (t)= As 2 αm (t)gm (θt) speed for A-DSGD. This is due to the fact that having − m=1 M more signals superposed over the MAC leads to more robust X azp 1 s 2 αm (t)µm(t) s 2 + z − (t), (24a) transmission against noise, particularly when the ratio ¯ 2 − m=1 − P/sσ M X paz is relatively small, as we will observe in Fig. 6. Also, for a ys 1 (t)= αm (t)µm(t)+ zs 1(t), (24b) 1 M sp − m=1 − larger M value, M m=1 gm (θt) provides a better estimate M M X p az of 1 g (θ ), and receiving information from a larger ys (t)= αm (t)+ zs(t). (24c) M m=1 m tP m=1 number of devices can make these estimates more reliable. X p 1 M sp P The PS performs AMP to recover M m=1 gm (θt) from the Remark 5. In the proposed A-DSGD algorithm, the spar- following vector sification level, , results in a trade-off. For a relatively P k 1 1 M sp s 2 1 small value of k, g (θt) can be more reli- y − (t)+ ys 1 (t) s 2 = M m=1 m − − 1 M ys (t) ably recovered from m=1 g˜m (θt); however it may MP M az not provide an accurate estimate of the actual aver- αm (t) sp As 2 gm (θt) 1 M P − m=1 M az age gradient M m=1 gm (θt). Whereas, with a higher i=1 pαi (t)+ zs(t) 1 M sp X 1 s 2 k value, m=1 gm (θt) provides a better estimate of zs 1(t) s 2 + z − (t) M P − −P p 1 M 1 M sp + M . (25) g (θt), but reliable recovery of g (θt) az M m=1 m M m=1 m i=1 αi (t)+ zs(t) P 1 M from the vector m=1 g˜m (θt) is less likely. P M P P p P A. Mean-Removal for Efficient Transmission V. CONVERGENCE ANALYSIS OF A-DSGD ALGORITHM To have a more efficient usage of the available power, each device can remove the mean value of its gradient estimate In this section we provide convergence analysis of A-DSGD before scaling and sending it. We define the mean value of presented in Algorithm 1. For simplicity, we assume that ηt = g˜ (θ )= A g (θ ) as µ (t) , 1 s˜ g˜ (θ ), for m η, t. We consider a differentiable and c-strongly convex loss m t s˜ m t m s˜ i=1 m,i t ∈ ∀ P 7 function F , i.e., x, y Rd, it satisfies Lemma 2. Consider a random vector u Rd with each ∀ ∈ 2 ∈ T c 2 entry i.i.d. according to (0, σu). For δ (0, 1), we F (y) F (x) F (x) (y x)+ y x . (26) N ∈ , − ≥ ∇ − 2 k − k have Pr u σuρ(δ) = δ, where we define ρ(δ) 1 {k k≥ } 1/2 2γ− (Γ(d/2)(1 δ), d/2) , and γ (a, x) is the lower A. Preliminaries incomplete gamma− function defined as: We first present some preliminaries and background for the x , a 1 ι essential instruments of the convergence analysis. γ (a, x) ι − e− dι, x 0,a> 0, (32) ˆ0 ≥ Assumption 1. The average of the first moment of the gradi- 1 and γ− (a,y) is its inverse, which returns the value of x, such ents at different devices is bounded as follows: that γ (a, x)= y, and Γ (a) is the gamma function defined as 1 M E ∞ [ gm (θt) ] G, t, (27) , a 1 ι M m=1 k k ≤ ∀ Γ (a) ι − e− dι, a> 0. (33) ˆ0 where the expectationX is over the randomness of the gradients. Proof. We highlight that u 2 follows the Chi-square distri- Assumption 2. For scaling factors α (t), m [M], given k k m ∀ ∈ bution, and we have [43] √αm(t) 1 in (13), we approximate M p , t. 2 αi(t)+zs(t) M 2 γ(d/2,u/(2σ )) Pi=1 √ ≈ ∀ Pr u u = u . (34) The rationale behind Assumption 2 is assuming that the k k ≤ Γ(d/2) n o noise term is small compared to the sum of the scaling Accordingly, it follows that factors, M α (t), and the l norm of the gradient vectors i=1 i 2 1 1/2 at different devices are not highly skewed. Accordingly, the Pr u σu 2γ− (Γ(d/2)(1 δ), d/2) P p k k≥ − model parameter vector in Algorithm 1 is updated as follows: n 2 2 1 o = Pr u 2σuγ− (Γ(d/2)(1 δ), d/2) = δ. (35) M k k ≥ − 1 sp n o θt+1 = θt η AMPAs−1 As 1 gm (θt) − · − m=1 M X 1 s 1 By choosing δ appropriately, we can guarantee, with prob- + M z − (t) . (28) α (t) ability arbitrary close to 1, that u σuρ(δ). i=1 i k k≤ Lemma 1. [40]–[42] ConsiderP reconstructingp vector x Rd Lemma 3. Having σω(t) defined as in (31), by taking the with sparsity level k and x 2 = P from a noisy observation∈ expectation with respect to the gradients, we have, t, k k s d ∀ y = A x + z, where A R × is the measurement matrix, t+1 s s ∈ E σ 1 λ and z Rs is the noise vector with each entry i.i.d. with [σω(t)] σmax − G +1 , (36) ≤ M√Pt 1 λ (0, σ2∈). If s > k, the AMP algorithm reconstructs − N where σ , d +1. , A max s 1 xˆ AMP s (y)= x + στ ω, (29) − q Rd 2 Proof. See Appendix A. where each entry of ω is i.i.d. with (0, 1), and στ ∈ 2 2 N decreases monotonically from σ + P to σ . That is, the noisy Lemma 4. Based on the results presented above, taking observation y is effectively transformed into xˆ = x + σω. expectation with respect to the gradients yields

Assumption 3. We assume that the sparsity pattern of vector M E 1 sp M sp η (gm (θt) gm (θt)) σω(t)w(t) ηv(t), g (θt) is smaller than s 1, t. M m=1 − − ≤ m=1 m − ∀ X (37a) sp PIf all the sparse vectors gm (θt), m [M], have the same sparsity pattern, Assumption 3 holds∀ trivially∈ since we set k< where we define, t 0, ∀ ≥ s 1. On the other hand, in general we can guarantee that (1 + λ)(1 λt) Assumption− 3 holds by setting k s. v(t) ,λ − +1 G ≪ 1 λ According to Lemma 1 and Assumption 3, the model − σ 1 λt+1 parameter update given in (28) can be rewritten as + ρ(δ) σmax − G +1 , (37b) M√Pt 1 λ M 1 sp − θt+1 = θt η g (θt)+ σω(t)w(t) , (30) Proof. See Appendix C. − M m=1 m X where we define , σ B. Convergence Rate σω(t) M , (31) m=1 αm(t) Let θ∗ be the optimum parameter vector minimizing the loss , 2 and each entry of ω(t) RPd is i.i.d.p with (0, 1). function F . For ε> 0, we define θ : θ θ∗ ε as ∈ N the success region to which the parameterS { k vector− shouldk ≤ con-} Rd Corollary 1. For vector x , it is easy to verify that verge. We establish the convergence using rate supermartingale ∈ d k x sp (x) λ x , where λ , − , and the equality stochastic processes (refer to [23], [44] for a detailed definition k − k k ≤ k k d holds if all the entries of x have theq same magnitude. of the supermartingale and rate supermartingale processes). 8

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4

Test accuracy Test 0.4 accuracy Test

Error-free shared link Error-free shared link 0.3 A-DSGD 0.3 A-DSGD D-DSGD D-DSGD 0.2 SignSGD 0.2 SignSGD QSGD QSGD 0.1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Iteration count, t Iteration count, t (a) IID data distribution (b) Non-IID data distribution Fig. 2: Test accuracy of different schemes for IID and non-IID data distribution scenarios with M = 25, B = 1000, P¯ = 500, s = d/2, k = s/2, and P = P¯, t. t ∀

Definition 1. [44] For any stochastic process, a non-negative Proof. See Appendix D. d (t+1) process Wt : R × R with horizon B is a rate T 1 → Rd We highlight that the reduction term t=0− v(t) in the supermartingale if i) for any sequence θt,..., θ0 and denominator is due to the sparsification, random projection E ∈ any t B, [Wt+1 (θt+1,..., θ0)] Wt (θt,..., θ0). ii) (compression), and the noise introduced byP the wireless chan- ≤ ≤ d T B and any sequence θT ,..., θ0 R , if the algorithm nel. To show that Pr ET 0 asymptotically, we consider ∀ ≤ ∈ ¯{ } → does not enter the success region by time T , i.e., if θ / , a special case Pt = P , in which t ∈ S t T , it must satisfy WT (θT ,..., θ0) T . T −1 2λG σρ(δ) σ G ∀ ≤ ≥ v(t)= + max + 1 T t=0 1 λ M√P¯ 1 λ Next we present a rate supermartingale process for the − − X T T +1 stochastic process with the update in (4), introduced in [44]. λ(1 + λ)(1 λ )G σρ(δ)σmax(1 λ )G . (42) −2 + − − (1 λ) M√P¯(1 λ)2 ! Statement 1. Consider the stochastic process in (4) with − − 2 learning rate η< 2cε/G . We define the process Wt by T 1 After replacing t=0− v(t) in (41), it is easy to verify that, ∗ 2 , ε e θt θ for η bounded as in (40), Pr ET 0 as T . Wt(θt, ..., θ0) log k − k + t, (38) P { } → → ∞ 2ηcε η2G2 ε − VI. EXPERIMENTS if the algorithm has not entered the success region by iteration t, i.e., if θi / , i t, and Wt(θt, ..., θ0) = Here we evaluate the performances of A-DSGD and D- ∈ S ∀ ≤ DSGD for the task of image classification. We run experiments Wτ1 1(θτ1 1, ..., θ0) otherwise, where τ1 is the smallest index − − for which θτ . The process Wt is a rate supermartingale on the MNIST dataset [45] with N = 60000 training and with horizon ∈B S = . It is also L-Lipschitz in its first 10000 test data samples, and train a single layer neural ∞ 2 2 1 coordinate, where L , 2√ε(2ηcε η G )− ; that is, given network with d = 7850 parameters utilizing ADAM optimizer d − 2 any x, y R ,t 1, and any sequence θt 1, ..., θ0, we have [46]. We set the channel noise variance to σ = 1. The ∈ ≥ − performance is measured as the accuracy with respect to the kWt (y, θt−1,..., θ0) − Wt (x, θt−1,..., θ0)k≤ L ky − xk . (39) test dataset, called test accuracy, versus iteration count t. Theorem 1. Under the assumptions described above, for We consider two scenarios for data distribution across devices: in IID data distribution, B training data samples, se- T 1 2 cεT √ε t=0− v(t) lected at random, are assigned to each device at the beginning η< − , (40) of training; while in non-IID data distribution, each device has T G2 P access to B training data samples, where B/2 of the training the probability that the A-DSGD algorithm does not enter the data samples are selected at random from only one class/label; success region by time T is bounded by that is, for each device, we first select two classes/labels at ε e θ∗ 2 random, and then randomly select B/2 data samples from each Pr ET log k k , { }≤ 2 2 T −1 ε of the two classes/labels. At each iteration, devices use all the (2ηcε η G ) T ηL t=0 v(t) − − B local data samples to compute their gradient estimates, i.e., (41) P the batch size is equal to the size of the local datasets. where ET denotes the event of not arriving at the success When performing A-DSGD, we use mean removal power region at time T . allocation scheme only for the first 20 iterations. The reason 9

0 Pt Pt P P 0 t t P Pt Pt P ̄ P P P 0 t

0 0 00 0 00 0 00 0 50 100 150 200 250 300 t Iteration t Fig. 3: Performance of the A-DSGD and D-DSGD algorithms Fig. 4: Performance of the A-DSGD and D-DSGD algorithms for different power allocation schemes when M = 25, B = for different P¯ values P¯ 200, 1000 , when M = 25, B = 1000, P¯ = 200, s = d/2, and k = s/2. 1000, s = d/2, k = s/2,∈{ and P = P¯, } t. t ∀ is that the gradients are more aggressive at the beginning of t. Observe that A-DSGD outperforms all digital approaches A-DSGD, and their mean values are expected to be relatively ∀that separate computation from communication. For both data diverse, while they converge to zero. Also, for comparison, we distribution scenarios, the gap between the performance of consider a benchmark, where the PS receives the average of the error-free shared link approach and that of A-DSGD is the actual gradient estimates at the devices, 1 M g θ , M m=1 m ( t) very small, and D-DSGD significantly outperforms the other in a noiseless fashion, and updates the model parameter vector P two digital schemes SignSGD and QSGD. Unlike the digital according to this error-free observation. We refer to this schemes, the performance loss of A-DSGD in non-IID data scheme as the error-free shared link approach. distribution scenario compared to the IID scenario is negligible In Fig. 2, we investigate IID and non-IID data distribution illustrating the robustness of A-DSGD to the bias in data scenarios. For comparison, we also consider two alternative distribution. The reduction in the convergence rate of A-DSGD digital schemes that employ SignSGD [16] and QSGD [2] in the non-IID scenario is mainly due to the mismatch in the algorithms for gradient compression, where each device ap- sparsity patterns of gradients at different devices, especially plies the coding schemes proposed in [16] and [2] to a limited in the initial iterations. We also observe that the performance number of the elements of its gradient vector, specified by its of D-DSGD degrades much less compared to SignSGD and link capacity. To be more precise, assume that, at iteration t, QSGD when the data is biased across devices. at each device a limited number of q and q entries of t,S t,Q In all the following experiments, we only consider IID data the gradient estimate with highest magnitudes are selected for distribution scenario. In Fig. 3 we investigate the performance delivery with SignSGD and QSGD, respectively. Following of D-DSGD for various power allocation designs for T = 300 the SignSGD scheme [16], at iteration t, a total number of DSGD iterations with M = 25, B = 1000, P¯ = 200, s = d/2, d and k = s/2 . We consider four different power allocation rt,S = log2 + qt,S bits, t, (43) ⌊ ⌋ q ∀ schemes: constant power allocation P = P¯, and t,S t are delivered to transmit the sign of each selected entry, as well 2 Pt,LH,stair = 100 (t 1)+1 , t [300], (45a) as their locations, and qt,S is then highest integer satisfying 299 − ∈ rt,S Rt. On the other hand, following the QSGD approach 100 , if 1 t 100 , ≤ ≤ ≤ [2], each device sends a quantized version of each of the qt,Q Pt,LH = 200, if 101 t 200, (45b) selected entries, the l2-norm of the resultant vector with qt,Q ≤ ≤ 300, if 201 t 300, non-zero entries, and the locations of the non-zero entries. ≤ ≤ Hence, for a quantization level of 2lQ , each device transmits 300, if 1 t 100,  ≤ ≤ d Pt,HL = 200, if 101 t 200, (45c) r = 32 + log +(1+ l )q bits, t, (44) ≤ ≤ t,Q 2 q Q t,Q ∀ 100, if 201 t 300. t,Q ≤ ≤ where qt,Q is set as then highest integer satisfying rt,Q Rt. With Pt,LH,stair, average transmit power increases linearly ≤  For QSGD, we consider a quantization level lQ = 2. We from P1 = 100 to P300 = 300. We also consider the error- consider s = d/2 channel uses, M = 25 devices each with free shared link bound, and A-DSGD with Pt = P¯, t. Note B = 1000 training data samples, average power P¯ = 500, and that the performance of A-DSGD is almost the same∀ as in we set a fixed ratio k = s/2 for sparsification, and P = P¯, Fig. 2a, despite 60% reduction in average transmit power. ⌊ ⌋ t 10

0.9

0.8

0.7

0.6 M B M B 0.5 M B P Test accuracy Test M B P 0.4 Error-free shared link M B P A-DSGD, s = d/2 A-DSGD, s = 3d/10 M B P 0.3 D-DSGD, s = d/2 M B P D-DSGD, s = 3d/10 M B P 0.2 0 50 100 150 200 250 300 0 200 400 600 800 1000 Iteration count, t Iteration t Fig. 5: Performance of the A-DSGD and D-DSGD algorithms Fig. 6: Performance of the A-DSGD and D-DSGD algorithms for different s values, s d/2, 3d/10 , when M = 20, for different (M,B) pairs (M,B) (10, 2000), (20, 1000) , B = 1000, P¯ = 500, and k∈= { s/2 , where} P = P¯, t. where s = d/2, k = s/2, and P =∈{P¯, t. } ⌊ ⌋ t ∀ t ∀

The performance of D-DSGD falls short of A-DSGD for all D-DSGD for different M and P¯ values, while keeping the the power allocation schemes under consideration. Comparing total number of data samples available across all the devices, the digital scheme for various power allocation designs, we MB, fixed. We consider (M,B) (10, 2000), (20, 1000) , observe that it is better to allocate less power to the initial and, for each setting, P¯ 1, 500∈. Given{ s = d/4 , we set} iterations, when the gradients are more aggressive, and save ∈{ } ⌊ ⌋ k = s/2 , and Pt = P¯, t. We highlight that, for P¯ =1, D- the power for the later iterations to increase the final accuracy; DSGD⌊ fails⌋ since the devices∀ cannot transmit any information even though this will result in a lower convergence speed. bits. We observe that increasing M while fixing MB does Also, increasing the power slowly, as in Pt,LH,stair, provides not have any visible impact on the performance in the error- a better performance in general, compared to occassional free shared link setting. The convergence speed of A-DSGD jumps in the transmit power. Overall, letting Pt vary over improves with M for both P¯ values. We note that although time provides an additional degree-of-freedom, which can the accuracy of individual gradient estimates degrade with M improve the accuracy of D-DSGD by designing an efficient (due to the reduction in the training sample size available at power allocation scheme having statistical knowledge about each device), A-DSGD benefits from the additional power the dynamic of the gradients over time. introduced by each device, which increases the robustness In Fig. 4, we compare the performance of A-DSGD with against noise thanks to the superposed signals each with that of D-DSGD for different values of the available average average power P¯. Similarly, the convergence speed of D- transmit power P¯ 200, 1000 . We consider s = d/2, DSGD improves with M for P¯ = 500. Note that even though ∈ { } M = 25, and B = 1000. We set k = s/2 and Pt = P¯, the number of bits allocated to each device reduces with M, ⌊ ⌋ t. As in Fig. 2, we observe that A-DSGD outperforms D- i.e., each device sends a less accurate estimate of its gradient, ∀ DSGD, and the gap between A-DSGD and the error-free since each new device comes with its own additional power, shared link approach is relatively small. We did not include the the total number of bits transmitted per data sample increases. performance of A-DSGD for P¯ = 1000 since it is very close Also, we observe that, for P¯ = 500, the performance of A- to the one with P¯ = 200. Unlike A-DSGD, the performance DSGD improves slightly by increasing M, while for P¯ = 1, of D-DSGD significantly deteriorates by reducing P¯. Thus, we the improvement is notable. Thus, even for large enough P¯, conclude that the analog approach is particularly attractive for increasing M slightly improves the performance of A-DSGD. learning across low-power devices as it allows them to align In Fig. 7, we investigate the impact of s on the perfor- their limited transmission powers to dominate the noise term. mance of A-DSGD. We consider s d/10, d/5, d/2 for a In Fig. 5, we compare A-DSGD and D-DSGD for different distributed system with M = 25, B ∈= {1000, P¯ = 50, and} for channel bandwidth values, s d/2, 3d/10 , and M = 20 any s values, we set k = 4s/5 , and P = P¯, t. In Fig. 7a, ∈ { } ⌊ ⌋ t ∀ and B = 1000, where we set Pt = 500, t, and k = s/2 . we consider different s values for each iteration of A-DSGD. ∀ ⌊ ⌋ Performance deterioration of D-DSGD is notable when chan- This may correspond to allocating different number of channel nel bandwidth reduces from s = d/2 to s = 3d/10. On the resources for each iteration, for example, allocating more sub- other hand, we observe that A-DSGD is robust to the channel bands through OFDM. On the other hand, in Fig. 7b, we limit bandwidth reduction as well, providing yet another advantage the number of channel uses at each communication round, and of analog over-the-air computation for edge learning. assume that the transmission time at each iteration linearly in- In Fig. 6, we compare the performance of A-DSGD and creases with s, and evaluate the performance of A-DSGD with 11

0.90 0.90

0.85 0.85

0.80 0.80

0.75 0.75

0.70 0.70

0.65 0.65 Test accuracy Test Test accuracy Test

0.60 0.60 Error-free shared link Error-free shared link A-DSGD, s = d/2 A-DSGD, s = d/5 0.55 0.55 A-DSGD, s = d/5 A-DSGD, s = d/10 A-DSGD, s = d/10 A-DSGD, s = d/2 0.50 0.50 0 50 100 150 200 250 300 350 400 0 50000 100000 150000 200000 250000 300000 Iteration count, t Total number of symbols, ts (a) Test accuracy versus iteration count, t (b) Test accuracy versus number of transmitted symbols, ts Fig. 7: Performance of the A-DSGD algorithm for different s values s d/10, d/5, d/2 , when M = 25, B = 1000, P¯ = 50, and k = 4s/5 , where P = P¯, t. ∈{ } ⌊ ⌋ t ∀ respect to ts, the total number of transmitted symbols. This As opposed to the standard approach in FL, which ig- would correspond to the convergence time if these symbols nores the channel aspects, and simply aims at reducing the are transmitted over time, using a single subcarrier. Therefore, communication load by compressing the gradients to a fixed assigning more resources to each iteration, i.e., increasing s, level, here we incorporate the wireless channel characteristics means that less number of iterations can be implemented by and constraints into the system design. We considered both the same time. Observe from Fig. 7a that, as expected, utilizing a digital approach (D-DSGD) that separates computation and more channel uses improves the performance both in terms communication, and an analog approach (A-DSGD) that ex- of convergence speed and final accuracy. On the other hand, ploits the superposition property of the wireless channel to when we consider the total number of transmitted symbols in have the average gradient computed over-the-air. Fig. 7b the performance of A-DSGD improves significantly In D-DSGD, the amount of information bits sent by each by reducing s from s = d/2 to s = d/5, which shows that device at each iteration can be adaptively adjusted with respect performing more SGD iterations with less accurate gradient to the average transmit power constraint P¯, and each device estimates at each iteration, i.e., smaller s values, can be better. quantizes its gradient followed by capacity-achieving channel This clearly shows the benefit of A-DSGD, in which the coding. We have shown that, under a finite power constraint, devices can transmit low dimensional gradient vectors at each the performance of D-DSGD can be improved by allocating iteration instead of sending the entire gradient vectors over more power to later iterations to compensate for the decreasing several iterations. However, this trend does not continue, and gradient variance, and provide better protection against noise. the performance of A-DSGD almost remains unchanged by In A-DSGD, we have proposed gradient sparsification further reducing s from s = d/5 to s = d/10. According followed by random linear projection employing the same to the results illustrated in Fig. 7b, we conclude that, when projection matrix at all the devices. This allowed reducing the communication bandwidth is limited, the value s can be the typically very large parameter vector dimension to the a design parameter taking into account the total desired time limited channel bandwidth. The devices then transmit these (number of iterations) to finish an ML task through DSGD. compressed gradient vectors simultaneously over the MAC We note here that we do not consider the computation time, to exploit the superposition property of the wireless medium. which remain mostly the same independent of s in different This analog approach allows a much more efficient use of the devices, as the same number of gradients are computed. available limited channel bandwidth. Numerical results have shown significant improvement in VII. CONCLUSIONS the performance with the analog approach, particularly in the We have studied collaborative/ federated learning at the low-power and low-bandwidth regimes. Moreover, when non- wireless edge, where wireless devices aim to minimize an IID data distribution is considered, the performance loss of empirical loss function collaboratively with the help of a both A-DSGD and D-DSGD is much smaller compared to remote PS. Devices have their own local datasets, and they SignSGD [16] and QSGD [2], A-DSGD being more robust to communicate with the PS over a wireless MAC. We have bias in data distribution across devices. We have also observed assumed that the communication from the PS to the devices is that the performances of both D-DSGD and A-DSGD improve noiseless, so the updated parameter vector is shared with the as more devices are introduced thanks to the additional power devices in a lossless fashion. introduced by each device. 12

APPENDIX A APPENDIX B PROOF OF LEMMA 3 PROOF OF INEQUALITY (52)

To prove the inequality in (52), we need to show that, for By plugging (13) into (31), we have a1, ..., an > 0, M 1 √Pt − 2 n n σω(t)= σ n i=1 ai 2 n n ai, (55) m=1 g˜ (θ ) +1 a ≤ m t i=1 j=1,j=i j i=1 X k k Q 6 X M q 1 √P − which correspondsP to Q σ t . (46) ≤ g˜ (θ ) +1 n n n n m=1 m t X k k a a n2 a 0. (56) j k − i ≥ Here we ﬁnd an upper bound on g˜m (θt) . We have, m,t, i=1 j=1,j=i k=1 i=1 k k ∀ X Y6 X Y sp g˜m (θt) = As 1gm (θt) We have k k k − k ∆ n n n 2 n As 1 ( gm (θt) + m (t) ) . (47) aj ak n ak ≤k − k k k k k i=1 j=1,j=6 i k=1 − k=1 X n Y X Y n We note that As 1 corresponds to the largest singular value n n 2 n 2 k − k = aj + aj ak n ak of random matrix As 1. We use the result presented in [47, j=1 j=1,j=6 n k=1,k=6 i,j − − i=1 k=1 Theorem 2] to estimate As 1 . According to [47, Theorem XnY n X n Y n Y − 2 k Rk n m = aj ak n(n 1) ak 2], for a random matrix A × with the (i, j)-th entry i=1 j=1,j=6 n k=1,k=6 i,j − − k=1 ∈ n n i.i.d. with E [a ]=0 and E a2 =1/n2, i [n], j [m], X X 2 Y2 Y i,j i,j = (ai + aj ) ak n(n 1) ak and n = κm, as m , the largest singular∈ value of∈A is (i,j)∈[n]2,i=6 j k=1,k=6 i,j − − k=1 → ∞ X 2 2 Yn Y asymptotically given by A . Based on this = (ai + aj ) ak σmax( )= m/n+1 (i,j)∈[n]2,i=6 j k=1,k=6 i,j result, since d 0, we estimate As 1 with its asymptotic X Y n − aiaj a ≫ k p k 2 2 k largest singular value denoted by σmax = d/(s 1)+1. − (i,j)∈[n] ,i=6 j k=1,k=6 i,j ∆ − 2 2 X n Y Next we upper bound m (t) . According to Corollary 1, = (ai + aj 2aiaj ) ak p (i,j)∈[n]2,i=6 j k=1,k=6 i,j we have, m,t, k k − X 2 n Y ∀ = (ai aj ) ak 0, (57) ∆ ∆ (i,j)∈[n]2,i=6 j − k=1,k=6 i,j ≥ m (t) λ gm (θt 1)+ m (t 1) k k≤ k − − k X Y ∆ which completes the proof of inequality (52). λ( gm (θt 1) + m (t 1) ), (48) ≤ k − k k − k where iterating it downward yields t 1 APPENDIX C ∆ − t i m (t) λ − gm (θi) . (49) PROOF OF LEMMA 4 k k≤ i=0 k k By replacing As 1 withXσmax and plugging (49) into (47), By taking the expectation with respect to the gradients, we it follows k − k have t M t i 1 g˜m (θt) σmax λ − gm (θi) . (50) E η (g (θt) sp (g (θt)+ ∆m(t))) σω(t)w(t) k k≤ i=0 k k M m − k m − " m=1 # X Using the above inequality inX (46) yields M 1 η E (g (θt)+ ∆m(t) sp (g (θt)+ ∆m(t))) M 1 ≤ M m − k m σ 1 − " m=1 # σω(t) . X t t i M ≤ √Pt σmax λ − g (θi) +1 1 m=1 i=0 m + ηE ∆m(t) + η ω(t) E [σω(t)] X k k M m=1 k k (51) P XM M (a) E 1 ∆ E 1 ∆ For a , ..., a R+, we prove in Appendix B that = η m(t + 1) + η m(t) 1 n ∈ M m=1 M m=1 n X X 1 a + η ω(t) E [σω(t)] i=1 i . (52) k k n 1 2 (b) M t ≤ n E 1 t+1−i i=1 ai P η λ gm (θi) ≤ M m=1 i=0 k k According to (52), weP can rewrite (51) as follows: X X E 1 M t−1 t−i M t + η λ gm (θi) σ t i M m=1 i=0 k k σ (t) σ λ − g (θ ) +1 . ω 2 max m i X X ≤ M √Pt m=1 i=0 k k + η ω(t) E [σω(t)] , k k t X X (53) (c) (1 + λ)(1 λ ) ηλ − + 1 G Independence of the gradients across time and devices yields ≤ 1 λ − t+1 t M σ 1 λ σ + ηρ(δ) σmax − G + 1 , (58) t i M√Pt 1 λ E [σω(t)] σmax λ − E [ g (θi) ]+ M ≤ M 2√P k m k − t i=0 m=1 where (a) follows from the deﬁnition ∆ (t) in (10), (b) X X m σ 1 λt+1 follows from the inequality in (49), and (c) follows from σmax − G +1 . (54) ≤ M√P 1 λ Assumption 1, as well as the result presented in Lemma 3. t − 13

APPENDIX D where (a) follows form (62), and (b) follows since Wt is rate PROOF OF THEOREM 1 supermartingale. Thus, for η bounded as in (40), we have E [W ] Pr E 0 . (64) We define the following process for the A-DSGD algorithm { T }≤ T 1 T ηL i=0− v(i) with the model update estimated as (30), t 0, − ≥ Replacing W0 from (38) completes the proof of Theorem 1. t 1 P − Vt(θt, ..., θ0) , Wt(θt, ..., θ0) ηL v(i), (59) − i=0 X REFERENCES if θi / , i t, and Vt(θt, ..., θ0) = Vτ2 1(θτ2 1, ..., θ0), ∈ S ∀ ≤ − − [1] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network where τ2 denotes the smallest iteration index such that θτ2 . When the algorithm has not succeeded to the success∈ intelligence at the edge,” CoRR, vol. abs/1812.02858, 2018. [Online]. S Available: http://arxiv.org/abs/1812.02858 region, we have, t 0, [2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: ∀ ≥ Communication-efficient SGD via gradient quantization and encoding,” Vt+1(θt+1, ..., θ0) arXiv:1610.02132 [cs.LG], Dec. 2017. M [3] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient 1 compression: Reducing the communication bandwidth for distributed = W θ η gsp (θ )+ σ (t)w(t) , θ , ..., θ t+1 t − M m t ω t 0 training,” arXiv:1712.01887 [cs.CV], Feb. 2018. m=1 [4] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient t X descent and its application to data-parallel distributed training of speech ηL v(i) DNNs,” in INTERSPEECH, Singapore, Sep. 2014, pp. 1058–1062. i=0 − [5] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and X 1 M D. Bacon, “Federated learning: Strategies for improving communication Wt+1 θt η g (θt) , θt, ..., θ0 ≤ − M m=1 m efficiency,” arXiv:1610.05492 [cs.LG], Oct. 2017. [6] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, M X 1 sp “Communication-efficient learning of deep networks from decentralized + L η (g (θ ) g (θ )) σ (t)w(t) data,” in Proc. AISTATS, 2017. M m t − m t − ω m=1 [7] B. McMahan and D. Ramage, “Federated learning: Collabora- X t tive machine learning without centralized training data,” [on- ηL v(i), (60) line]. Available. https://ai.googleblog.com/2017/04/federated-learning- − collaborative.html, Apr. 2017. i=0 X [8] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated where the inequality is due to the L-Lipschitz property of multi-task learning,” arXiv:1705.10467 [cs.LG], May 2017. [9] J. Konecny and P. Richtarik, “Randomized distributed mean estimation: process Wt in its first coordinate. By taking expectation of Accuracy vs communication,” arXiv:1611.07555 [cs.DC], Nov. 2016. the terms on both sides of the inequality in (60) with respect [10] J. Konecny, B. McMahan, and D. Ramage, “Federated optimization: Dis- to the randomness of the gradients, it follows that tributed optimization beyond the datacenter,” arXiv:1511.03575 [cs.LG], Nov. 2015. E θ θ [11] T. Nishio and R. Yonetani, “Client selection for federated learning with [Vt+1( t+1, ..., 0)] heterogeneous resources in mobile edge,” arXiv:1804.08333 [cs.NI], 1 M Apr. 2018. E Wt+1 θt η g (θt) , θt, ..., θ0 [12] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed fed- ≤ − M m=1 m erated learning for ultra-reliable low-latency vehicular communications,” M X arXiv:1807.08127 [cs.IT], Jul. 2018. E 1 sp + ηL (gm (θt) gm (θt)) σω(t)w(t) [13] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “TernGrad: M m=1 − − Ternary gradients to reduce communication in distributed deep learning,” t X ηLE v(i) arXiv:1705.07878 [cs.LG], Dec. 2017. − i=0 [14] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa-Net: h i t Training low bitwidth convolutional neural networks with low bitwidth (a) X gradients,” arXiv:1606.06160 [cs.NE], Feb. 2018. W (θ , ..., θ )+ ηLv(t) ηL v(i)= V (θ , ..., θ ), ≤ t t 0 − t t 0 [15] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and i=0 S. Wright, “ATOMO: Communication-efficient learning via atomic X (61) sparsification,” arXiv:1806.04090 [stat.ML], Jun. 2018. [16] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandku- where (a) follows since Wt is a rate supermartingale process, mar, “signSGD: Compressed optimisation for non-convex problems,” as well as the result in Lemma 4. Accordingly, we have arXiv:1802.04434 [cs.LG], Aug. 2018. [17] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quan- E [Vt+1(θt+1, ..., θ0)] Vt(θt, ..., θ0), t. (62) tized SGD and its applications to large-scale distributed optimization,” ≤ ∀ arXiv:1806.08054 [cs.CV], Jun. 2018. We denote the complement of E by Ec, t. With the [18] B. Li, W. Wen, J. Mao, S. Li, Y. Chen, and H. Li, “Running sparse t t ∀ and low-precision neural network: When algorithm meets hardware,” in expectation with respect to the gradients, we have Proc. Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, South Korea, Jan. 2018. (a) [19] N. Strom, “Scalable distributed DNN training using commodity gpu E [W0]= E [V0] E [VT ] ≥ cloud computing,” in Proc. Conference of the International Speech E E c c = [VT ET ]Pr ET + [VT ET ]Pr ET Communication Association (INTERSPEECH), 2015. | { } | { } [20] A. F. Aji and K. Heafield, “Sparse communication for distributed E [V E ]Pr E ≥ T | T { T } gradient descent,” arXiv:1704.05021 [cs.CL], Jul. 2017. T 1 [21] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary − = E [WT ET ] ηL v(i) Pr ET compression: Towards distributed deep learning with minimal commu- | − i=0 { } nication,” arXiv:1805.08768 [cs.LG], May 2018. (b) T 1 X [22] C. Renggli, D. Alistarh, T. Hoefler, and M. Aghagolzadeh, “Spar- − T ηL v(i) Pr ET , (63) CML: High-performance sparse communication for machine learning,” ≥ − i=0 { } arXiv:1802.08021 [cs.DC], Oct. 2018. X 14

[23] D. Alistarh, T. Hoefler, M. Johansson, S. Khirirat, N. Konstantinov, and C. Renggli, “The convergence of sparsified gradient methods,” arXiv:1809.10505 [cs.LG], Sep. 2018. [24] Y. Tsuzuku, H. Imachi, and T. Akiba, “Variance-based gradient compression for efficient distributed deep learning,” arXiv:1802.06058 [cs.LG], Feb. 2018. [25] S. U. Stich, “Local SGD converges fast and communicates little,” arXiv:1805.09767 [math.OC], Jun. 2018. [26] T. Lin, S. U. Stich, and M. Jaggi, “Don’t use large mini-batches, use local SGD,” arXiv:1808.07217 [cs.LG], Oct. 2018. [27] D. Gunduz, E. Erkip, A. Goldsmith, and H. V. Poor, “Source and channel coding for correlated sources over multiuser channels,” IEEE Transactions on Information Theory, vol. 55, no. 9, pp. 3927–3944, Sep. 2009. [28] M. Goldenbaum and S. Stanczak, “Robust analog function computation via wireless multiple-access channels,” IEEE Trans. Commun., vol. 61, no. 9, pp. 3863–3877, Sep. 2013. [29] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” in Proc. IEEE Int’l Symp. on Inform. Theory (ISIT), Paris, France, Jul. 2019, pp. 1432–1436. [30] T-Y. Tung and D. Gündüz, “SparseCast: Hybrid digital-analog wireless image transmission exploiting frequency-domain sparsity,” IEEE Com- mun. Lett., vol. 22, no. 12, pp. 2451–2454, Dec. 2018. [31] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,” Proc. Nat. Acad. Sci. USA, vol. 106, no. 45, pp. 18 914–18 919, Nov. 2009. [32] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020. [33] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over- the-air computation,” arXiv:1812.11750 [cs.LG], Jan. 2019. [34] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., Early Access, Feb. 2020. [35] ——, “Over-the-air machine learning at the wireless edge,” in Proc. IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), Cannes, France, Jul. 2019, pp. 1–5. [36] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,” arXiv:1907.02745 [cs.IT], Jul. 2019. [37] M. M. Amiri, T. M. Duman, and D. Gündüz, “Collaborative machine learning at the wireless edge with blind transmitters,” in Proc. IEEE Global Conference on Signal and Information Processing (GlobalSIP), Ottawa, ON, Canada, Nov. 2019, pp. 1–5. [38] N. Strom, “Scalable distributed DNN training using commodity GPU cloud computing,” in INTERSPEECH, 2015, pp. 1488–1492. [39] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Proc. International Conference on Neural Information Processing Systems (NIPS), Navada, USA, Dec. 2013. [40] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,” IEEE Trans. Inform. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011. [41] C. Rush, A. Greig, and R. Venkataramanan, “Capacity-achieving sparse superposition codes via approximate message passing decoding,” IEEE Trans. Inform. Theory, vol. 63, no. 3, pp. 1476–1500, Mar. 2017. [42] C. Rush and R. Venkataramanan, “The error probability of sparse superposition codes with approximate message passing decoding,” IEEE Trans. Inform. Theory, vol. 65, no. 5, pp. 3278–3303, Nov. 2018. [43] A. Mood, F. A. Graybill, and D. C. Boes, Introduction to the Theory of Statistics. Third ed., McGraw-Hill, 1974. [44] C. D. Sa, C. Zhang, K. Olukotun, and C. Re, “Taming the wild: A unified analysis of hogwild-style algorithms,” in Proc. International Conference on Neural Information Processing Systems (NIPS), 2015. [45] Y. LeCun, C. Cortes, and C. Burges, “The MNIST database of hand- written digits,” http://yann.lecun.com/exdb/mnist/, 1998. [46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980 [cs.LG], Jan. 2017. [47] Z. D. Bai and Y. Q. Yin, “Limit of the smallest eigenvalue of a large dimensional sample covariance matrix,” The Annals of Probability, vol. 21, no. 3, pp. 1275–1294, 1993. This figure "Deniz.jpg" is available in "jpg" format from:

http://arxiv.org/ps/1901.00844v3