arXiv:1901.00844v3 [cs.DC] 7 Apr 2020 iern,Ipra olg odn odnS72Z ..(e U.K. 2AZ, SW7 London London, College [email protected]). Imperial gineering, niern,Ipra olg odn ei o ihteDep the with now Princeton [email protected]). University, mail: is Princeton He Engineering, London. Electrical College Imperial assum Engineering, inherently this While network dataset. massive neural a a on often trained algorithm, algorith is learning centralized powerful dat on a sensor focuses where future of trend types its current various algorithms the about for however, (ML) tailored predictions developed learning or being machine are system, specialized a continuously Many inferences of states. be make state to must the au- processed about sensors (IoT), and things from communicated, pri of collected, data are Internet technologies where reality data. examples, extended this or driving, of tonomous sense p can make that intelligence and collaborative and collection, data pesu omncto,mkn tvr trciefrlow- for edge. attractive wireless very the at it applications making ML o latency communication, lack furt The up A-DSGD devices. in speeds encoding/decoding edge abi channel A- of their and showing power and quantization constant), computation D-DSGD size the (whi both dataset harnessing devices of total that number the the observe keeping increasing also by schem better We perform digital DSGD devic literature. other across the distribution bandwidth outperforms data in tha significantly low in problem bias D-DSGD and to classification while robust a power improvement more for is The low illustrate DSGD channel. also at We the compelling regimes. over alignm particularly natural estimates the is gradient and more bandwidth the its limited to of the thanks of D-DSGD available use than efficient the faster s converges results by A-DSGD ove Numerical code. that directly imposed digital sent any space employing are without MAC projections dimensional These lower bandwidth. channel a the to and estimates, A-DS gradient them their In sparsify approach. first this devices for the analysis convergence provide and aueo h ieesMCfor MAC wireless the of introduc nature then We (MAC). ov novel channel PS access the to multiple estimates underlying gradient their erro transmit and and quantization mulation, gradient employ lin devices orthogonal wireless the over PS the this to Following remote separat transmitted estim and gradient a assume compressed local are of where approaches communication, help Standard bandwidth- and the computation (PS). and with dist server power (DSGD) out parameter where carry descent datasets gradient edge, local stochastic wireless with devices the wireless limited at (ML) ing hog trigGatBAO areetN.677854). No. (agreement BEACON Grant Starting through ahn eriga h ieesEg:Distributed Edge: Wireless the at Learning Machine hswr a enspotdb h uoenRsac Counci Research European the by supported been has work This .Gnü swt h eateto lcrcladElectroni and Electrical of Department the with is Gündüz D. .MhmaiAiiwswt h eateto lcrcland Electrical of Department the with was Amiri Mohammadi M. ayeegn ehooisivlemsieaonsof amounts massive involve technologies emerging Many Abstract analog oamdMhmaiAmiri, Mohammadi Mohammad W td olbrtv/fdrtdmcielearn- machine federated collaborative/ study —We cee aldADG,wihepot h additive the exploits which A-DSGD, called scheme, tcatcGain ecn Over-the-Air Descent Gradient Stochastic digital prah eitoueDDG,i which in D-DSGD, introduce we approach, .I I. NTRODUCTION over-the-air rdetcomputation, gradient J054 S (e- USA 08544, NJ , tdn ebr IEEE Member, Student rmn of artment project n Electronic ributed accu- r (ERC) l rocess iyin lity rthe er En- c the r -mail: ,A- t, how GD, ates ms, her me ent a e ks. es, es es a; le e f , G) nwihtemdlprmtr tthe at des parameters gradient model iterative the through which of out in minimization (GD), carried The typically model. is learning (1) the by defined function where where B r pae codn to according updated are hc satisfies which where eadn.Ised nsohsi D(G)teparameter gradient stochastic the a (SGD) with GD updated is stochastic vector in prohibitiv Instead, becomes GD demanding. of iteration each datasets massive of rtv Ltcnqe,as nw as known also more develop techniques, to much ML orative is a alternative viable in Therefore, practically constraints. costly and desirable users’ too delay, infringe much a be too may introduce may to and may bandwidth, manner o data and reliable energy case of collected a the terms the in in processor, processor transmitting central central devices, a edge at wireless data of availability the rcs aasmlsi aallwiemitiigaglobal devices a vector maintaining (DSGD), while parameter SGD parallel consistent ev in distributed or samples In tens data process across devices. distributed of is hundreds dataset the when lelization etrwt epc oislcldataset parameter local global its the to on respect based with vector vector gradient a computes rmaltedvcs tudtstegoa aaee vector parameter global the updates gradients it computed to devices, the according receives the PS all the Once from PS. the to result clls function loss ical collaboratively model. (PS) learning server the parameter a wi at central datasets a FL local of consider with help devices the we where paper, existi edge, this and network In intelligence wireless it). edge [1] enable of (see to applications approaches communications of survey limited capabili a requiring processing for and devices, datasets edge local of the exploit can that |B 1 θ m Lpolm fe eur h iiiaino h empir- the of minimization the require often problems ML ersnstedtstwt size with dataset the represents t | +1 P n ei Gündüz, Deniz and , η θ M = u t ∈ ∈B stelann aea iteration at rate learning the is θ eoe h ubro eie,and devices, of number the denotes R t m θ − d ∇ t +1 eoe h oe aaeest eoptimized, be to parameters model the denotes η E F f t ∇ [ ( ( = g θ θ θ F ( t t = ) θ +1 , θ ( u t t θ ]= )] − ) t = = ) |B| stesohsi rdeto h cur- the of gradient stochastic the is 1 η θ t t ∇ M X 1 θ θ eirMme,IEEE Member, Senior − F t t nec trto,device iteration, each In . X − η u ( t θ ∈B η · t m M t ) g |B| G loalw paral- allows also SGD . f |B| =1 ( 1 eeae erig(FL) learning federated θ ( t and , θ g t oee,i h case the in However, . u X ) , m B ∈B u , m ( ) θ t ∇ , n ed the sends and , t iteration, -th f t ) f ( · , ) ( θ g steloss the is t m , u ( collab- ) θ , t train ) cent ties ely θ (4) (1) (3) (2) ng en m , th ly t 1 f , , 2 rent model computed at device m, m [M], using the locally and learning, we consider a wireless multiple access chan- ∈ available portion of the dataset, m. nel (MAC) from the devices to the PS, through which the While the communication bottleneckB in DSGD has been PS receives the gradients computed by the devices at each widely acknowledged in the literature [2]–[4], ML literature iteration of the DSGD algorithm. The standard approach to ignores the characteristics of the communication channel, and this problem, aligned with the literature on FL, is a separate simply focus on reducing the amount of data transmitted from approach to computation and communication. each device to the PS. In this paper, we consider DSGD We first follow this separation-based approach, and propose over a shared wireless medium, and explicitly model the a digital DSGD scheme, which will be called D-DSGD. In channel from the devices to the PS, and treat each iteration this scheme, the devices first compress their gradient estimates of the DSGD algorithm as a distributed wireless computation to a finite number of bits through quantization. Then, some problem, taking into account the physical properties of the access scheme, e.g., time, frequency or code division multiple wireless medium. We will provide two distinct approaches for access, combined with error correction coding is employed to this problem, based on digital and analog communications, transmit compressed gradient estimates to the PS in a reliable respectively. We will show that analog “over-the-air” compu- manner. In this work, to understand the performance limits of tation can significantly speed up wireless DSGD, particularly the digital approach, we assume capacity-achieving channel in bandwidth-limited and low-power settings, typically expe- codes are utilized at the devices. The optimal solution for rienced by wireless edge devices. this scheme will require carefully allocating channel resources across the devices and the available power of each device across iterations, together with an efficient gradient quantiza- A. Prior Works tion scheme. For gradient compression, we will consider state- Extensive efforts have been made in recent years to enable of-the-art quantization approaches together with local error collaborative learning across distributed mobile devices. In the accumulation [21]. FL framework [5]–[7], devices collaboratively learn a model It is known that separation-based approaches to distributed with the help of a PS. In practical implementations, bandwidth compression and computing over wireless channels are sub- of the communication channel from the devices to the PS is optimal in general [27], [28]. We propose an alternative analog the main bottleneck in FL [5]–[12]. Therefore, it is essential to communication approach, in which the devices transmit their reduce the communication requirements of collaborative ML. local gradient estimates directly over the wireless channel, To this end, three main approaches, namely quantization, spar- extending our previous work in [29]. This scheme is motivated sification, and local updates, and their various combinations by the fact that the PS is not interested in the individual have been considered in the literature. Quantization methods gradient vectors, but in their average, and the wireless MAC implement lossy compression of the gradient vectors by quan- automatically provides the PS with the sum of the gradients tizing each of their entries to a finite-bit low precision value (plus a noise term). However, the bandwidth available at each [2], [4], [13]–[18]. Sparsification reduces the communication iteration may not be sufficient to transmit the whole gradient time by sending only some values of the gradient vectors [3], vector in an uncoded fashion. Hence, to compress the gradients [19]–[24], and can be considered as another way of lossy to the dimension of the limited bandwidth resources, we em- compression, but it is assumed that the chosen entries of the ploy an analog compression scheme inspired by compressive gradient vectors are transmitted reliably, e.g., at a very high sensing, which was previously introduced in [30] for analog resolution. Another approach is to reduce the frequency of image transmission over bandwidth-limited wireless channels. communication from the devices by allowing local parameter In this analog computation scheme, called A-DSGD, devices updates [5]–[12], [25], [26]. Although there are many papers first sparsify their local gradient estimates (after adding the in the FL literature, most of these works ignore the physical accumulated local error). These sparsified gradient vectors are and network layer aspects of the problem, which are essential projected to the channel bandwidth using a pseudo-random for FL implementation at the network edge, where the training measurement matrix, as in compressive sensing. Then, all the takes place over a shared wireless medium. To the best of our devices transmit the resultant real-valued vectors to the PS knowledge, this is the first paper to address the bandwidth and simultaneously in an analog fashion, by simply scaling them power limitations in collaborative edge learning by taking into to meet the average transmit power constraints. The PS tries to account the physical properties of the wireless medium. reconstruct the sum of the actual sparse gradient vectors from its noisy observation. We use approximate message passing (AMP) algorithm to do this at the PS [31]. We also prove the B. Our Contributions convergence of the A-DSGD algorithm considering a strongly Most of the current literature on distributed ML and FL convex loss function. ignore the physical layer aspects of communication, and Numerical results show that the proposed analog scheme consider interference-and-error-free links from the devices to A-DSGD has a better convergence behaviour compared to the PS to communicate their local gradient estimates, possibly its digital counterpart, while D-DSGD outperforms the digital in compressed form to reduce the amount of information schemes studied in [2], [16]. Moreover, we observe that the that must be transmitted. Due to the prevalence of wireless performance of A-DSGD degrades only slightly by introduc- networks and the increasing availability of edge devices, e.g., ing bias in the datasets across the devices; while the degrada- mobile phones, sensors, etc., for large-scale data collection tion in D-DSGD is larger, it still outperforms other digital 3 approaches [2], [16] by a large margin. The performances of both A-DSGD and D-DSGD algorithms improve with the number of devices, keeping the total size of the dataset constant, despite the inherent resource sharing approach for D-DSGD. Furthermore, the performance of A-DSGD degrades only slightly even with a significant reduction in the available average transmit power. We argue that, these benefits of A- DSGD are due to the inherent ability of analog communica- tions to benefit from the signal-superposition characteristic of the wireless medium. Also, reduction in the communication bandwidth of the MAC deteriorates the performance of D- DSGD much more compared to A-DSGD. Fig. 1: Illustration of the FL framework at the wireless edge. A similar over-the-air computation approach is also con- sidered in parallel works [32], [33]. These works consider, respectively, SISO and SIMO fading MACs from the devices gradient computed by device m using local data samples. At to the PS, and focus on aligning the gradient estimates received each iteration of the DSGD algorithm in (4), the local gradient from different devices to have the same power at the PS to estimates of the devices are sent to the PS over s uses of a allow correct computation by performing power control and Gaussian MAC, characterized by: device selection. Our work is extended to a fading MAC M y(t)= xm(t)+ z(t), (5) model in [34]. The distinctive contributions of our work m=1 with respect to [32] and [33] are i) the introduction of a s where xm(t) R isX the length-s channel input vector purely digital separate gradient compression and communica- transmitted by device∈ m at iteration t, y(t) Rs is the channel tion scheme; ii) the consideration of a bandwidth-constrained output received by the PS, and z(t) Rs ∈is the independent channel, which requires (digital or analog) compression of additive white Gaussian noise (AWGN)∈ vector with each entry the gradient estimates; iii) error accumulation at the devices independent and identically distributed (i.i.d.) according to to improve the quality of gradient estimates by keeping 0, σ2 . Since we focus on DSGD, the channel input vector track of the information lost due to compression; and iv) Nof device m at iteration t is a function of the current parameter power allocation across iterations to dynamically adapt to vector θt and the local dataset m, and more specifically the the diminishing gradient variance. We also remark that both B current gradient estimate at device m, gm (θt), m [M]. [32] and [33] consider transmitting model updates, while For a total of T iterations of DSGD algorithm, the following∈ we focus on gradient transmission, which is more energy- average transmit power constraint is imposed per eache device: efficient as each device transmits only the innovation obtained T through gradient descent at that particular iteration (together 1 2 ¯ xm(t) 2 P, m [M], (6) with error accumulation), whereas model transmission wastes T t=1 || || ≤ ∀ ∈ a significant portion of the transmit power by sending the averaged overX iterations of the DSGD algorithm. The goal already known previous model parameters from all the devices. is to recover the average of the local gradient estimates 1 M Our work can easily be combined with the federated averaging M m=1 gm (θt) at the PS, and update the model parameter algorithm in [6], where multiple SGD iterations are carried as in (4). However, due to the pre-processing performed at P out locally before forwarding the models differences to the each device and the noise added by the wireless channel, it PS. We can also apply momentum correction [3], which is not possible to recover the average gradient perfectly at improves the convergence speed of the DSGD algorithms with the PS, and instead, it uses a noisy estimate to update the communication delay. model parameter vector; i.e., we have θt+1 = φ(θt, y(t)) for some update function φ : Rd Rs Rd. The updated model × → C. Notations parameter is then multicast to the devices by the PS through an error-free shared link. We assume that the PS is not limited in R represents the set of real values. For positive integer i, we power or bandwidth, so the devices receive a consistent global , 1 let [i] 1,...,i , and i denotes a column vector of length parameter vector for their computations in the next iteration. i with all{ entries}1. 0, σ2 denotes a zero-mean normal The transmission of the local gradient estimates, g (θt), distribution with varianceN σ2. We denote the cardinality of set m m [M], to the PS with the goal of PS reconstructing by , and l2 norm of vector x by x . Also, we represent ∀their∈ average can be considered as a distributed function Athe l |A|induced norm of a rectangular matrixk k A by A . 2 k k computation problem over a MAC [28]. We will consider both a digital approach treating computation and communication II. SYSTEM MODEL separately, and an analog approach that does not use any We consider federated SGD at the wireless edge, where M coding, and instead applies gradient sparsification followed by wireless edge devices employ SGD with the help of a remote a linear transformation to compress the gradients, which are PS, to which they are connected through a noisy wireless MAC then transmitted simultaneously in an uncoded fashion. (see Fig. 1). Let m denote the set of data samples available We remark that we assume a simple channel model (Gaus- at device m, m B [M], and g (θ ) Rd be the stochastic sian MAC) for the bandwidth-limited uplink transmission from ∈ m t ∈ 4
∆ the devices to the PS in this paper in order to highlight the gradient estimate vector gm (θt)+ m(t), of dimension d, main ideas behind the proposed digital and analog collabo- to zero, where q d/2 (to have a communication-efficient t ≤ rative learning. The digital and analog approaches proposed scheme, in practice, the goal is to have qt d, t). Then, in this paper can be extended to more complicated channel it computes the mean values of all the remaining≪ ∀ positive models as it has been done in the follow up works [34]–[37]. entries and all the remaining negative entries, denoted by + + µ (t) and µ− (t), respectively. If µ (t) > µ− (t) , then it m m m | m | III. DIGITAL DSGD(D-DSGD) sets all the entries with negative values to zero and all the entries with positive values to µ+ (t), and vice versa. We In this section, we present DSGD at the wireless network m denote the resulting sparse vector at device m by gq (θ ), edge utilizing digital compression and transmission over the m t and device m updates the local accumulated error vector as wireless MAC, referred to as the digital DSGD (D-DSGD) ∆ (t +1) = g (θ ) + ∆ (t) gq (θ ), m [M]. It algorithm. Since we do not know the variances of the gradient m m t m m t then aims to send gq (θ ) over the− channel by transmitting∈ estimates at different devices, we allocate the power equally m t its mean value and the positions of its non-zero entries. For among the devices, so that device m sends x (t) with power m this purpose, we use a 32-bit representation of the absolute P , i.e., x (t) 2 = P , where P values are chosen to satisfy t m 2 t t value of the mean (either µ+ (t) or µ (t) ) along with 1 the average|| transmit|| power constraint over T iterations m m− bit indicating its sign. To send the| positions| of the non- 1 T zero entries, it is assumed in [21] that the distribution of the Pt P.¯ (7) T t=1 ≤ distances between the non-zero entries is geometrical with Due to the intrinsic symmetryX of the model, we assume that the success probability qt, which allows them to use Golomb devices transmit at the same rate at each iteration (while the encoding to send these distances with a total number of bits rate may change across iterations depending on the allocated 1 log((√5 1)/2) b∗ , where − . b∗ + 2 b∗ = 1+ log2 log(1 qt) power, P ). Accordingly, the total number of bits that can 1 (1 qt) − t − − However, we argue that, sending d bits to transmit be transmitted from each of the devices over s uses of the log2 qt Gaussian MAC, described in (5), is upper bounded by the positions of the non-zero entries is sufficient regardless of the distribution of the positions. This can be achieved by s MP R = log 1+ t , (8) simply enumerating all possible sparsity patterns. Thus, with t 2M 2 sσ2 D-DSGD, the total number of bits sent by each device at where MPt/s is the sum-power per channel use. Note that iteration t is given by this is an upper bound since it is the Shannon capacity of the d underlying Gaussian MAC, and we further assumed that the r = log + 33, (9) t 2 q capacity can be achieved over a finite blocklength of s. t where qt is chosen as the highest integer satisfying rt Rt. Remark 1. We note that having a distinct sum power MPt at We will also study the impact of introducing more devices≤ each iteration t enables each user to transmit different num- into the system. With the reducing cost of sensing and comput- bers of bits at different iterations. This corresponds to a novel ing devices, we can consider introducing more devices, each gradient compression scheme for DSGD, in which the devices coming with its own power resources. We assume that the can adjust over time the amount of information they send to size of the total dataset remains constant, which allows each the PS about their gradient estimates. They can send more sensor to save computation time and energy. Assume that the information bits at the beginning of the DSGD algorithm when number of devices is increased by a factor κ> 1. We assume the gradient estimates have higher variances, and reduce the that the total power consumed by the devices at iteration t also number of transmitted bits over time as the variance decreases. increases by factor κ. We can see that the maximum number We observed empirically that this improves the performance of bits that can be sent by each device is strictly smaller in the compared to the standard approach in the literature, where the new system. this means that the PS receives a less accurate same compression scheme is applied at each iteration [21]. gradient estimate from each device, but from more devices. We will adopt the scheme proposed in [21] for gradient Numerical results for the D-DSGD scheme and its comparison compression at each iteration of the DSGD scheme. We to analog transmission are relegated to Section VI. modify this scheme by allowing different numbers of bits to be Remark 2. We have established a framework for digital trans- transmitted by the devices at each iteration. At each iteration mission of the gradients over a band-limited wireless channel, the devices sparsify their gradient estimates as described where other gradient quantization techniques can also be below. In order to retain the accuracy of their local gradient utilized. In Section VI, we will compare the performance of estimates, devices employ error accumulation [4], [38], where the proposed D-DSGD scheme with that of sign DSGD (S- the accumulated error vector at device m until iteration t is de- DSGD) [16] and quantized DSGD (Q-DSGD) [2], adopted ∆ Rd ∆ 0 noted by m(t) , where we set m(0) = , m [M]. for digital transmission over a limited capacity MAC. Hence, after the∈ computation of the local gradient∀ estimate∈ for parameter vector θt, i.e., gm (θt), device m updates its ∆ IV. ANALOG DSGD(A-DSGD) estimate with the accumulated error as gm (θt) + m(t), m [M]. At iteration t, device m, m [M], sets all Next, we propose an analog DSGD algorithm, called A- ∈ ∈ but the highest qt and the smallest qt of the entries of its DSGD, which does not employ any digital code, either for 5 compression or channel coding, and instead all the devices Algorithm 1 A-DSGD transmit their gradient estimates simultaneously in an uncoded 1: Initialize θ = 0 and ∆ (0) = = ∆ (0) = 0 manner. This is motivated by the fact that the PS is not inter- 0 1 ··· M 2: for t =0,...,T 1 do ested in the individual gradients, but only in their average. The devices do: − underlying wireless MAC provides the sum of the gradients, • 3: for m =1,...,M in parallel do which is used by the PS to update the parameter vector. See 4: Compute gm (θt) with respect to m Algorithm 1 for a description of the A-DSGD scheme. ec ∆ B 5: gm (θt)= gm (θt)+ m(t) Similarly to D-DSGD, devices employ local error accumu- sp ec 6: gm (θt)=spk (gm (θt)) lation. Hence, after computing the local gradient estimate for ∆ ec sp 7: m(t +1)= gm (θt) gm (θt) parameter vector θt, each device updates its estimate with the sp − ec 8: g˜m (θt)= As 1gm (θt) accumulated error as g (θ ) , g (θ )+ ∆ (t), m [M]. − T m t m t m T ∈ 9: x The challenge in the analog transmission approach is to m (t)= αm(t)g˜m (θt) αm(t) compress the gradient vectors to the available channel band- 10: end for hp p i width. In many modern ML applications, such as deep neural PS does: • 1 11: g θ A ys 1 networks, the parameter vector, and hence the gradient vec- ˆ ( t) = AMP s−1 ys(t) − (t) tors, have extremely large dimensions, whereas the channel 12: θt+1 = θt ηt gˆ (θt) − · bandwidth, measured by parameter s, is small due to the 13: end for bandwidth and latency limitations. Thus, transmitting all the model parameters one-by-one in an uncoded/analog fashion is not possible as we typically have d s. In the following, we propose a power allocation scheme ≫ Lossy compression at any required level is at least theoret- designing the scaling coefficients αm(t), m,t. ically possible in the digital domain. For the analog scheme, We set s˜ = s 1, which requires ∀s 2. At it- − p ≥ in order to reduce the dimension of the gradient vector to that eration t, we set am(t) = αm(t), and device m sp of the channel, the devices apply gradient sparsification. In computes g˜m (θt) = As 1gm (θt), and sends vector − p T particular, device m sets all but the k elements of the error- T xm (t) = α (t)g˜ (θ ) α (t) with the same compensated resulting vector gec (θ ) with the highest mag- m m t m m t x 2 nitudes to zero. We denote the sparse vector at device m by power Pt =h m (t) 2 satisfying the averagei power con- 1 Tp|| ||¯ p sp straint t=1 Pt P , for m [M]. Accordingly, scaling gm (θt), m [M]. This k-level sparsification is represented T ≤ ∈ ∈ sp ec factor αm(t) is determined to satisfy by function spk in Algorithm 1, i.e., gm (θt)=spk (gm (θt)). P The accumulated error at device m, m [M], which is the p 2 ec ∈ Pt = αm(t) g˜m (θt) 2 +1 , (12) difference between gm (θt) and its sparsified version, i.e., it k k ec maintains those elements of vector gm (θt) which are set to which yields zero as the result of sparsification at iteration t, is then updated P according to α (t)= t , for m [M]. (13) m 2 ∈ g˜m (θt) 2 +1 ∆ (t +1)=gec (θ ) gsp (θ )= g (θ )+ ∆ (t) k k m m t − m t m t m 2 Since g˜m (θt) 2 may vary across devices, so can αm(t). sp (g (θt)+ ∆m(t)) . (10) k k − k m That is why, at iteration t, device m allocates one channel p We would like to transmit only the non-zero entries of these use to provide the value of αm(t) to the PS along with its sparse vectors. However, simply ignoring the zero elements scaled low-dimensional gradient vector g˜ (θ ), m [M]. p m t ∈ would require transmitting their indeces to the PS separately. Accordingly, the received vector at the PS is given by To avoid this additional data transmission, we will employ a M sp As 1 m=1 αm(t)gm (θt) random projection matrix, similarly to compressive sensing. y (θt)= − + z(t), (14) M α (t) A similar idea is recently used in [30] for analog image " P m=1p m # transmission over a bandwidth-limited channel. where is replaced by (13).p For , we define s˜ d αm(t) P i [s] A pseudo-random matrix As˜ R × , for some s˜ s, ∈ ∈ ≤ T with each entry i.i.d. according to (0, 1/s˜), is generated yi (t) , y (t) y (t) y (t) , (15) N 1 2 ··· i and shared between the PS and the devices before start- i , T z (t) z1(t) z2(t) zi(t) , (16) ing the computations. At iteration t, device m computes ··· , sp Rs˜ g˜m (θt) As˜gm (θt) , and transmits xm(t) = where yj (t) and zj (t) denote the j-th entry ofy (t) and z(t), ∈ T T T , where a Rs s˜, over respectively. Thus, we have αm(t)˜gm (θt) am(t) m(t) − ∈ M theh MAC, m [M], whilei satisfying the average power s 1 sp s 1 p ∈ y − (t)= As 1 αm(t)gm (θt)+ z − (t), constraint (6). The PS receives − m=1 (17a) M X p M y (t)= xm (t)+ z(t) m=1 ys (t)= αm(t)+ zs(t). (17b) m=1 XA M α (t)gsp (θ ) s˜ m=1 m m t z (11) X p 1 M sp = M + (t). Note that the goal is to recover M m=1 gm (θt) at the am(t) s 1 " P mp=1 # PS, while, from y − (t) given in (17a), the PS observes a P P 6
M sp noisy version of the weighted sum m=1 αm(t)gm (θt) [M], where g˜m,i (θt) is the i-th entry of vector g˜m (θt), i az , 1 ∈ projected into a low-dimensional vector through As 1. Ac- [˜s]. We also define g˜m (θt) g˜m (θt) µm(t) s˜, m [M]. 2 p − cording to (13), each value of g˜ (θP) results in a distinct The power of vector g˜az (θ ) is given by− ∈ k m t k2 m t scaling factor αm(t). However, for large enough d and m , az 2 2 2 2 |B | g˜m (θt) 2 = g˜m (θt) 2 sµ˜ m(t). (19) the values of g˜m (θt) , m [M], are not going to k k k k − k k2 ∀ ∈ be too different across devices. As a result, scaling factors We set s˜ = s 2, which requires s 3. We also set T αm(t), m [M], are not going to be very different either. az − az ≥ ∀ ∈ am(t) = αm (t)µm(t) αm (t) , and after computing Accordingly, to diminish the effect of scaled gradient vectors, sp p s 1 g˜m (θt)= As 2gm (θt), device m, m [M], sends we choose to scale down the received vector y − (t) at the p − p ∈ T T PS, given in (17a), with the sum of the scaling factors, i.e., x az az az az M m (t)= αm (t)˜gm (θt) αm (t)µm(t) αm (t) , αm(t), whose noisy version is received by the PS as m=1 h (20)i ys (t) given in (17b). The resulting scaled vector at the PS is p p p P p with power s 1 M y − (t) αm(t) sp =As 1 M gm (θt) 2 az 2 2 y (t) − m=1 xm (t) 2 = αm (t) g˜m (θt) 2 (s 3)µm(t) + 1 , (21) s i=1 pαi(t)+ zs(t) || || k k − − X T 1 s 1 1 ¯ + p z − (t), (18) which is chosen to be equal to Pt, such that T t=1 Pt P . M P ≤ αi(t)+ zs(t) Thus, we have, for m [M], i=1 ∈ P where αm(t), m P[M]p, is given in (13). By our choice, the az Pt ∈ 1 M sp s 1 αm (t)= 2 . (22) PS tries to recover g (θt) from y − (t) /ys (t) 2 M m=1 m g˜m (θt) 2 (s 3)µm(t)+1 knowing the measurement matrix As 1. The PS estimates k k − − P − Compared to the transmit power given in (12), we observe that gˆ (θt) using the AMP algorithm. The estimate gˆ (θt) is then az removing the mean reduces the transmit power by αm (t)(s used to update the model parameter as θt+1 = θt ηt gˆ (θt). 2 − − · 3)µm(t), m [M]. The received vector at the PS is given by Remark 3. We remark here that, with SGD the empirical vari- ∈ M az az ance of the stochastic gradient vectors reduce over time ap- m=1 αm (t)g˜m (θt) y (θ )= M αaz(t)µ (t) + z(t) proaching zero asymptotically. The power should be allocated t P m=1p m m over iterations taking into account this decaying behaviour M αaz(t) P mp=1 m of gradient variance, while making sure that the noise term M az sp 1 αPm (t) (pAs 2gm (θt) µm(t) s 2) would not become dominant. To reduce the variation in the m=1 − − − M az z = m=1 αm (t)µm(t) + (t), scaling factors αm(t), m,t, variance reduction techniques P p M ∀ ¯ αaz(t) can be used [39]. We also note that setting Pt = P , t, results P mp=1 m p ∀ (23) in a special case, where the power is allocated uniformly over P p time to be resistant against the noise term. where we have M Remark 4. Increasing M can help increase the convergence s 2 az sp y − (t)= As 2 αm (t)gm (θt) speed for A-DSGD. This is due to the fact that having − m=1 M more signals superposed over the MAC leads to more robust X azp 1 s 2 αm (t)µm(t) s 2 + z − (t), (24a) transmission against noise, particularly when the ratio ¯ 2 − m=1 − P/sσ M X paz is relatively small, as we will observe in Fig. 6. Also, for a ys 1 (t)= αm (t)µm(t)+ zs 1(t), (24b) 1 M sp − m=1 − larger M value, M m=1 gm (θt) provides a better estimate M M X p az of 1 g (θ ), and receiving information from a larger ys (t)= αm (t)+ zs(t). (24c) M m=1 m tP m=1 number of devices can make these estimates more reliable. X p 1 M sp P The PS performs AMP to recover M m=1 gm (θt) from the Remark 5. In the proposed A-DSGD algorithm, the spar- following vector sification level, , results in a trade-off. For a relatively P k 1 1 M sp s 2 1 small value of k, g (θt) can be more reli- y − (t)+ ys 1 (t) s 2 = M m=1 m − − 1 M ys (t) ably recovered from m=1 g˜m (θt); however it may MP M az not provide an accurate estimate of the actual aver- αm (t) sp As 2 gm (θt) 1 M P − m=1 M az age gradient M m=1 gm (θt). Whereas, with a higher i=1 pαi (t)+ zs(t) 1 M sp X 1 s 2 k value, m=1 gm (θt) provides a better estimate of zs 1(t) s 2 + z − (t) M P − −P p 1 M 1 M sp + M . (25) g (θt), but reliable recovery of g (θt) az M m=1 m M m=1 m i=1 αi (t)+ zs(t) P 1 M from the vector m=1 g˜m (θt) is less likely. P M P P p P A. Mean-Removal for Efficient Transmission V. CONVERGENCE ANALYSIS OF A-DSGD ALGORITHM To have a more efficient usage of the available power, each device can remove the mean value of its gradient estimate In this section we provide convergence analysis of A-DSGD before scaling and sending it. We define the mean value of presented in Algorithm 1. For simplicity, we assume that ηt = g˜ (θ )= A g (θ ) as µ (t) , 1 s˜ g˜ (θ ), for m η, t. We consider a differentiable and c-strongly convex loss m t s˜ m t m s˜ i=1 m,i t ∈ ∀ P 7 function F , i.e., x, y Rd, it satisfies Lemma 2. Consider a random vector u Rd with each ∀ ∈ 2 ∈ T c 2 entry i.i.d. according to (0, σu). For δ (0, 1), we F (y) F (x) F (x) (y x)+ y x . (26) N ∈ , − ≥ ∇ − 2 k − k have Pr u σuρ(δ) = δ, where we define ρ(δ) 1 {k k≥ } 1/2 2γ− (Γ(d/2)(1 δ), d/2) , and γ (a, x) is the lower A. Preliminaries incomplete gamma− function defined as: We first present some preliminaries and background for the x , a 1 ι essential instruments of the convergence analysis. γ (a, x) ι − e− dι, x 0,a> 0, (32) ˆ0 ≥ Assumption 1. The average of the first moment of the gradi- 1 and γ− (a,y) is its inverse, which returns the value of x, such ents at different devices is bounded as follows: that γ (a, x)= y, and Γ (a) is the gamma function defined as 1 M E ∞ [ gm (θt) ] G, t, (27) , a 1 ι M m=1 k k ≤ ∀ Γ (a) ι − e− dι, a> 0. (33) ˆ0 where the expectationX is over the randomness of the gradients. Proof. We highlight that u 2 follows the Chi-square distri- Assumption 2. For scaling factors α (t), m [M], given k k m ∀ ∈ bution, and we have [43] √αm(t) 1 in (13), we approximate M p , t. 2 αi(t)+zs(t) M 2 γ(d/2,u/(2σ )) Pi=1 √ ≈ ∀ Pr u u = u . (34) The rationale behind Assumption 2 is assuming that the k k ≤ Γ(d/2) n o noise term is small compared to the sum of the scaling Accordingly, it follows that factors, M α (t), and the l norm of the gradient vectors i=1 i 2 1 1/2 at different devices are not highly skewed. Accordingly, the Pr u σu 2γ− (Γ(d/2)(1 δ), d/2) P p k k≥ − model parameter vector in Algorithm 1 is updated as follows: n 2 2 1 o = Pr u 2σuγ− (Γ(d/2)(1 δ), d/2) = δ. (35) M k k ≥ − 1 sp n o θt+1 = θt η AMPAs−1 As 1 gm (θt) − · − m=1 M X 1 s 1 By choosing δ appropriately, we can guarantee, with prob- + M z − (t) . (28) α (t) ability arbitrary close to 1, that u σuρ(δ). i=1 i k k≤ Lemma 1. [40]–[42] ConsiderP reconstructingp vector x Rd Lemma 3. Having σω(t) defined as in (31), by taking the with sparsity level k and x 2 = P from a noisy observation∈ expectation with respect to the gradients, we have, t, k k s d ∀ y = A x + z, where A R × is the measurement matrix, t+1 s s ∈ E σ 1 λ and z Rs is the noise vector with each entry i.i.d. with [σω(t)] σmax − G +1 , (36) ≤ M√Pt 1 λ (0, σ2∈). If s > k, the AMP algorithm reconstructs − N where σ , d +1. , A max s 1 xˆ AMP s (y)= x + στ ω, (29) − q Rd 2 Proof. See Appendix A. where each entry of ω is i.i.d. with (0, 1), and στ ∈ 2 2 N decreases monotonically from σ + P to σ . That is, the noisy Lemma 4. Based on the results presented above, taking observation y is effectively transformed into xˆ = x + σω. expectation with respect to the gradients yields
Assumption 3. We assume that the sparsity pattern of vector M E 1 sp M sp η (gm (θt) gm (θt)) σω(t)w(t) ηv(t), g (θt) is smaller than s 1, t. M m=1 − − ≤ m=1 m − ∀ X (37a) sp PIf all the sparse vectors gm (θt), m [M], have the same sparsity pattern, Assumption 3 holds∀ trivially∈ since we set k< where we define, t 0, ∀ ≥ s 1. On the other hand, in general we can guarantee that (1 + λ)(1 λt) Assumption− 3 holds by setting k s. v(t) ,λ − +1 G ≪ 1 λ According to Lemma 1 and Assumption 3, the model − σ 1 λt+1 parameter update given in (28) can be rewritten as + ρ(δ) σmax − G +1 , (37b) M√Pt 1 λ M 1 sp − θt+1 = θt η g (θt)+ σω(t)w(t) , (30) Proof. See Appendix C. − M m=1 m X where we define , σ B. Convergence Rate σω(t) M , (31) m=1 αm(t) Let θ∗ be the optimum parameter vector minimizing the loss , 2 and each entry of ω(t) RPd is i.i.d.p with (0, 1). function F . For ε> 0, we define θ : θ θ∗ ε as ∈ N the success region to which the parameterS { k vector− shouldk ≤ con-} Rd Corollary 1. For vector x , it is easy to verify that verge. We establish the convergence using rate supermartingale ∈ d k x sp (x) λ x , where λ , − , and the equality stochastic processes (refer to [23], [44] for a detailed definition k − k k ≤ k k d holds if all the entries of x have theq same magnitude. of the supermartingale and rate supermartingale processes). 8
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4
Test accuracy Test 0.4 accuracy Test
Error-free shared link Error-free shared link 0.3 A-DSGD 0.3 A-DSGD D-DSGD D-DSGD 0.2 SignSGD 0.2 SignSGD QSGD QSGD 0.1 0.1 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 Iteration count, t Iteration count, t (a) IID data distribution (b) Non-IID data distribution Fig. 2: Test accuracy of different schemes for IID and non-IID data distribution scenarios with M = 25, B = 1000, P¯ = 500, s = d/2, k = s/2, and P = P¯, t. t ∀
Definition 1. [44] For any stochastic process, a non-negative Proof. See Appendix D. d (t+1) process Wt : R × R with horizon B is a rate T 1 → Rd We highlight that the reduction term t=0− v(t) in the supermartingale if i) for any sequence θt,..., θ0 and denominator is due to the sparsification, random projection E ∈ any t B, [Wt+1 (θt+1,..., θ0)] Wt (θt,..., θ0). ii) (compression), and the noise introduced byP the wireless chan- ≤ ≤ d T B and any sequence θT ,..., θ0 R , if the algorithm nel. To show that Pr ET 0 asymptotically, we consider ∀ ≤ ∈ ¯{ } → does not enter the success region by time T , i.e., if θ / , a special case Pt = P , in which t ∈ S t T , it must satisfy WT (θT ,..., θ0) T . T −1 2λG σρ(δ) σ G ∀ ≤ ≥ v(t)= + max + 1 T t=0 1 λ M√P¯ 1 λ Next we present a rate supermartingale process for the − − X T T +1 stochastic process with the update in (4), introduced in [44]. λ(1 + λ)(1 λ )G σρ(δ)σmax(1 λ )G . (42) −2 + − − (1 λ) M√P¯(1 λ)2 ! Statement 1. Consider the stochastic process in (4) with − − 2 learning rate η< 2cε/G . We define the process Wt by T 1 After replacing t=0− v(t) in (41), it is easy to verify that, ∗ 2 , ε e θt θ for η bounded as in (40), Pr ET 0 as T . Wt(θt, ..., θ0) log k − k + t, (38) P { } → → ∞ 2ηcε η2G2 ε − VI. EXPERIMENTS if the algorithm has not entered the success region by it- eration t, i.e., if θi / , i t, and Wt(θt, ..., θ0) = Here we evaluate the performances of A-DSGD and D- ∈ S ∀ ≤ DSGD for the task of image classification. We run experiments Wτ1 1(θτ1 1, ..., θ0) otherwise, where τ1 is the smallest index − − for which θτ . The process Wt is a rate supermartingale on the MNIST dataset [45] with N = 60000 training and with horizon ∈B S = . It is also L-Lipschitz in its first 10000 test data samples, and train a single layer neural ∞ 2 2 1 coordinate, where L , 2√ε(2ηcε η G )− ; that is, given network with d = 7850 parameters utilizing ADAM optimizer d − 2 any x, y R ,t 1, and any sequence θt 1, ..., θ0, we have [46]. We set the channel noise variance to σ = 1. The ∈ ≥ − performance is measured as the accuracy with respect to the kWt (y, θt−1,..., θ0) − Wt (x, θt−1,..., θ0)k≤ L ky − xk . (39) test dataset, called test accuracy, versus iteration count t. Theorem 1. Under the assumptions described above, for We consider two scenarios for data distribution across devices: in IID data distribution, B training data samples, se- T 1 2 cεT √ε t=0− v(t) lected at random, are assigned to each device at the beginning η< − , (40) of training; while in non-IID data distribution, each device has T G2 P access to B training data samples, where B/2 of the training the probability that the A-DSGD algorithm does not enter the data samples are selected at random from only one class/label; success region by time T is bounded by that is, for each device, we first select two classes/labels at ε e θ∗ 2 random, and then randomly select B/2 data samples from each Pr ET log k k , { }≤ 2 2 T −1 ε of the two classes/labels. At each iteration, devices use all the (2ηcε η G ) T ηL t=0 v(t) − − B local data samples to compute their gradient estimates, i.e., (41) P the batch size is equal to the size of the local datasets. where ET denotes the event of not arriving at the success When performing A-DSGD, we use mean removal power region at time T . allocation scheme only for the first 20 iterations. The reason 9
0
0
0
0
0