This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 1

Lazily Aggregated Quantized Gradient Innovation for Communication-Efficient Federated Learning

Jun Sun, Student Member, IEEE, Tianyi Chen, Member, IEEE, Georgios B. Giannakis, Fellow, IEEE, Qinmin Yang, Senior Member, IEEE, and Zaiyue Yang, Member, IEEE

Abstract—This paper focuses on communication-efficient federated learning problem, and develops a novel distributed quantized gradient approach, which is characterized by adaptive communications of the quantized gradients. Specifically, the federated learning builds upon the server-worker infrastructure, where the workers calculate local gradients and upload them to the server; then the server obtain the global gradient by aggregating all the local gradients and utilizes it to update the model parameter. The key idea to save communications from the worker to the server is to quantize gradients as well as skip less informative quantized gradient communications by reusing previous gradients. Quantizing and skipping result in ‘lazy’ worker-server communications, which justifies the term Lazily Aggregated Quantized (LAQ) gradient. Theoretically, the LAQ algorithm achieves the same linear convergence as the in the strongly convex case, while effecting major savings in the communication in terms of transmitted bits and communication rounds. Empirically, extensive experiments using realistic data corroborate a significant communication reduction compared with state-of-the-art gradient- and stochastic gradient-based algorithms.

Index Terms—federated learning, communication-efficient, gradient innovation, quantization !

1 INTRODUCTION exploiting the distributed computing resources to speed up RAINING today’s functions (models) the training of machine learning models [5, 6]. T relies on an enormous amount of data collected by In this context, communication-efficient federated learn- a massive number of mobile devices. This comes with ing methods have gained popularity recently [3]. Most substantial computational cost, and raises serious privacy methods build on simple gradient updates, and are cen- concerns when the training is centralized. In addition to tered around gradient compression to save communication, cloud computing, these considerations drive the vision that including gradient quantization and sparsification, as outlined future machine learning tasks must be performed to the in the following overview of the prior art in this area. extent possible in a distributed fashion at the network edge, 1.1 Prior art namely devices [2]. When distributed learning is carried out in a server- Quantization. Today’s computers usually utilize 32 or 64 worker setup with possibly heterogeneous devices and bits to quantize the floating point number, which is assumed datasets as well as privacy considerations, it is referred to to be accurate enough in most algorithms. By quantization as federated learning [3, 4]. The server updates the learning in this paper, we mean fewer bits are employed. Toward the parameters utilizing the information (usually gradients) goal of reducing communications, quantization compresses collected from local workers, and then broadcasts the pa- transmitted information (e.g., the gradient) by limiting the rameters to workers. In this setup, the server obtains the number of bits that represent floating point numbers, and aggregate information without requesting the raw data– has been successfully applied to several engineering tasks what also respects privacy and mitigates the computation employing wireless sensor networks [7]. In the context of burden at the server. Such a learning paradigm however, distributed machine learning, a 1-bit binary quantization incurs communication overhead that does not scale with scheme has been proposed in [8, 9]. Multi-bit quantization the number of workers. This is aggravated in , methods have been developed in [10, 11], where an ad- which involves high-dimensional learning parameters. In justable quantization level endows flexibility to balance the fact, communication delay has become a bottleneck for fully communication cost and the convergence rate. Other variants of quantized gradient schemes include ternary quantization Part of the results in this paper has been presented in Proc. of Neural [12], variance-reduced quantization [13], error compensation Information Processing, Vancouver, Canada, December 8-14, 2019 [1]. Jun Sun and Qinmin Yang are with college of Control Science and Engineering, [14], and gradient difference quantizaiton [11, 15]; and it the State Key Laboratory of Industrial Control Technology, Zhejiang University, is shown in [11] that the linear convergence rate can be Hangzhou, China. maintained with gradient difference quantization. Tianyi Chen is with Department of Electrical, Computer, and Systems Sparsification. Sparsification amounts to discarding some Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA. Georgios B. Giannakis is with the Department of Electrical and Computer entries of the gradient and the most straightforward scheme Engineering and the Digital Technology Center, University of Minnesota, is to transmit only gradient components with large enough Minneapolis, MN 55455, USA. magnitudes [16]. Surprisingly, the desired accuracy can be Zaiyue Yang is with the Department of Mechanical and Energy Engineering, attained even with 99% of the gradients being dropped in Southern University of Science and Technology, Shenzhen, China. Corresponding authors: Zaiyue Yang, email: [email protected], and some cases [17]. To reduce information losses, gradient com- Qinmin Yang, email: [email protected]. ponents with small values are accumulated and then applied

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 2

when they exceed a certain threshold [18]. The accumulated the server; upon receiving all the local gradients, the server gradient offers variance reduction for the sparsified stochastic then updates the parameter vector following the GD iteration gradient descent (SGD) iterates [19]. With its impressive k+1 k X k GD: θ = θ α fm θ (2) empirical performance granted, apart from recent efforts [20], − ∇ deterministic sparsification schemes fall short in performance m∈M guarantees. Their randomized counterparts though come where superscript k indexes the iteration, α is the stepsize, k P k with the so-termed unbiased sparsification, which provably and ∇f(θ ) = m∈M ∇fm(θ ) is the aggregated (or, global) offers convergence guarantees [21, 22]. gradient. It is clear that to implement (2), the server has to Quantization and sparsification can also be employed communicate with all workers to obtain ‘fresh’ gradients k M jointly [3, 23, 24]. Nevertheless, they both introduce noise to {∇fm θ }m=1. In several settings though, communication (S)GD updates, which deteriorates convergence in general. is much slower than computation [5]. Thus, as the number of For problems with strongly convex losses, gradient compres- workers grows, worker-server communications become the sion algorithms either converge linearly to the neighborhood bottleneck [35]. This becomes more challenging when adopt- of the optimal solution, or, converge at sublinear rate. The ing popular deep learning models with high-dimensional exception is [11], which only focuses on reducing the required parameters, and correspondingly large gradients. This clearly bits per communication, but not the total number of rounds. prompts the need for communication-eficient learning. However, for message exchanging, e.g., the p-dimensional Before introducing our communication-efficient learning model parameter or the gradient, other latencies, such as, ini- approach, we revisit the canonical form of popular quantized tiating communication links, queueing, and propagating the (Q) GD methods [8]-[15] in the simple setup of (1) with one message, can be comparable to the message size-dependent server and M workers: transmission latency [25]. This motivates saving the number k+1 k X k QGD: θ = θ α Q θ (3) of communication rounds, sometimes even more than the m − m∈M bits per round. k Apart from the aforementioned gradient compression where Qm θ is the quantized gradient that coarsely ap- k approaches, communication-efficient schemes aiming to proximates the local gradient ∇fm(θ ). While the exact reduce communication rounds have been developed by quantization scheme varies across algorithms, transmitting k leveraging higher-order information [26, 27], periodic ag- Qm θ generally requires fewer bits than transmitting k gregation [4, 28, 29], and recently by adaptive aggregation its accurate counterpart ∇fm(θ ). Similar to GD however, k [30, 31, 32]; see also [33] for a lower bound on the number of only when all the local quantized gradients {Qm θ } are communication rounds. However, simultaneously saving collected, the server can update θ. communication bits and rounds without sacrificing the In this context, the present paper puts forth a quantized desired convergence guarantees has not been addressed, and gradient innovation method (as simple as QGD) that also constitutes the goal of this paper. Note that asynchronous skips communication rounds. Different from the downlink algorithms, such as DC-ASGD [34], which primarily aim to server-to-worker communications that can be performed k save run time can also result in communication reduction. simultaneously (e.g., by broadcasting θ ), the server in the Our method and asynchronous schemes are complementary uplink receives the workers’ gradients in the presence of to each other, and our algorithms to be presented can also be interference, whose mitigation costs resources, e.g., extra extended to the asynchronous version. latency or bandwidth. For this reason, our focus here is on reducing the number of worker-to-server uplink communi- 1.2 Context and contributions in a nutshell cations, which we will also refer to as uploads. Our Lazily Aggregated Quantized (LAQ) GD update is given by (cf. (3)) We first review the standard distributed server-worker learn- X ing architecture that typically aims at solving an optimization LAQ: θk+1 = θk α k with k = k−1+ δQk (4) − ∇ ∇ ∇ m problem of the form m∈Mk N k X Xm where ∇ is an approximate aggregated gradient that sum- min fm(θ) with fm(θ) := `(xm,n; θ) (1) k δQk := θ marizes the parameter change at iteration , and m m∈M n=1 k ˆk−1 Qm(θ )−Qm(θm ) is the difference between two quantized p k where θ ∈ R denotes the parameter to be learned; M with gradients of fm at the current iterate θ and the previous ˆk−1 |M| = M is the set of workers; xm,n represents the n- copy θm . With a judiciously selected criterion that will be th data vector at worker m (e.g., feature vector and its elaborated later, Mk denotes the subset of workers whose k label); Nm is the number of data samples at worker m; local δQm is uploaded in iteration k, while the parameter ˆk k k while `(x; θ) denotes the loss associated with θ and x; and vector iterates are given by θm := θ , ∀m ∈ M , and ˆk ˆk−1 k ˆk−1 fm(θ) stands for the local loss corresponding to θ and all θm := θm , m / . For worker m, the copy θm is data at worker m. For ease in exposition, we further let employed to remember∀ ∈ M the model parameter when last time P f(θ) := m∈M fm(θ) denote the overall loss function. it is selected to communicate with the server. Throughout this paper, we consider implementing dis- In comparison to QGD as (3) where ‘fresh’ quantized tributed gradient descent (GD) in the commonly employed gradient is required from each and every worker, the key worker-server setup. Since the data samples are distributed idea of LAQ is to obtain ∇k by refining the previous aggre- across the workers, in each iteration the workers need to gated gradient k−1 with the selected gradient differences k ∇ download the model parameter from the server, calculate the δQm m∈Mk ; that is, using only the new gradients from { } k local gradient using local data and upload the gradients to the selected workers in M , while reusing the outdated

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 3

k−1 gradients from the rest of the workers. With stored k−1 k k [Qm(θm̂ )]i [∇fm(θ )]i [Qm(θ )]i in the server, this simple modification scales down∇ the per- iteration communication rounds from QGD’s M to LAQ’s |Mk|. Note that one round of communication through out [ ) k k this paper means one worker’s upload. 2τRm Rm Compared with alternative quantization schemes, we have that i) LAQ quantizes the gradient innovation — the Fig. 1. Quantization example (b = 3) difference of the current gradient relative to the previous Q (θˆk−1) τ quantized gradient; and ii) LAQ skips the gradient commu- outdated gradients m m stored in the memory and δQk nication — if the gradient innovation of a worker is not known a priori, upon receiving m the server can recover significant enough, the communication of this worker is the quantized gradient as k ˆk−1 k skipped. We will rigorously establish that LAQ achieves Qm(θ ) = Qm(θm ) + δQm (7) the same linear convergence as GD under the strongly convex assumption on the loss function. Numerical tests Figure 1 presents an example for quantizing one coor- will demonstrate that our approach outperforms competing dinate of the gradient with b = 3 bits. The original value 3 methods in terms of both communication bits and rounds. is quantized with 3 bits and 2 = 8 values, each of which k Notation. Bold lowercase fonts will be used to denote column covers an iterval of length 2τRm centered at itself. With k k k vectors; kxk2 and kxk∞ the `2-norm and `∞-norm of x, εm := ∇fm(θ ) − Qm(θ ) denoting the local quantization respectively; and [x]i the i-th entry of x; while bac will stand error, it is clear that the quantization error is not larger that for the floor of a; and | · | for the cardinality of a set or vector. half of the length of the interval that each value covers, namely, k k ε ∞ τR . (8) 2 LAQ: A LAZILY AGGREGATED QUANTIZED GRA- k mk ≤ m DIENTAPPROACH The aggregated quantized gradient is Q(θk) = P k With the goal of reducing the communication overhead, m∈M Qm(θ ), and the aggregated quantization k k k PM k two complementary techniques are incorporated in our error is ε := ∇f(θ ) − Q(θ ) = m=1 εm; that is, algorithm design: i) gradient innovation-based quantization; Q(θk) = ∇f(θk) − εk. and ii) gradient innovation-based uploading or aggregation L A Q LAQ — giving the name azily ggregated uantized ( ) 2.2 Gradient innovation-based aggregation gradient. The former reduces the number of bits per upload, while the latter cuts down the number of uploads, and jointly The intuition behind lazy gradient aggregation is that if the they effect parsimony in communications. The remainder of difference of two consecutive locally quantized gradients this section elaborates further on LAQ. is small, it is safe to skip the redundant gradient upload, and reuse the previous one at the server. In addition, we 2.1 Gradient innovation-based quantization also ensure the server has a relatively “fresh" gradient for each worker by enforcing communication if any worker Quantization limits the number of bits to encode a vector has not uploaded during the last t¯ rounds. We set a clock during communication. Suppose we use b bits to quantize tm, m ∈ M for worker m counting the number of iterations each coordinate of the gradient in contrast to 32 or 64 bits since last time it uploaded information. Equipped with the used by most computers. With Q denoting the quantization quantization and selection, our LAQ update takes the form operator, the quantized gradient per worker m at iteration k we presented in (4). Q (θk) = Q(∇f (θk),Q (θˆk−1)) is m m m m , which depends on the Now it only remains to design the selection criterion to k ˆk−1 gradient ∇fm(θ ) and its previous quantization Qm(θm ). decide which worker to upload the quantized gradient or The gradient is element-wise quantized by projecting to the its innovation. We propose the following communication closest point in a uniformly discretized grid. The grid is a p- criterion: worker m ∈ M skips the upload at iteration k, if it ˆk−1 dimensional hypercube with center at Qm(θm ) and radius satisfies k k ˆk−1 b Rm = k∇fm(θ ) − Qm(θm )k∞. With τ := 1/(2 − 1) defin- D k−1 k 2 1 X k+1−d k−d 2 kQ (θˆ ) − Q (θ )k ≤ ξ kθ − θ k ing the quantization granularity, the gradient innovation m m m 2 α2M 2 d 2 k ˆk−1 d=1 [fm(θ )]i [Qm(θm )]i at worker m is mapped to an integer −  k 2 k−1 2 as + 3 kεmk2 + kεˆm k2 ; (9a) $ k ˆk−1 k % k [∇fm(θ )]i − [Qm(θm )]i + Rm 1 tm ≤ t¯ (9b) [qm(θ )]i = k + (5) 2τRm 2 D where D ≤ t¯ and {ξd}d=1 are predetermined constants, and b k−1 ˆk−1 ˆk−1 which falls in 0, 2, , 2 1 , and thus can be encoded εˆm = ∇fm(θm ) − Qm(θm ) is the quantization error of { ··· −k } by b bits. Note that adding Rm in the numerator ensures the the local gradient when last time worker m uploads gradient k non-negativity of [qm(θ )]i, and adding 1/2 in (5) guarantees innovation to the server. We will prove in the next section rounding to the closest point. Hence, the quantized gradient that the LAQ iterates in (4) converge, and are communication innovation at worker m is (with 1 := [1 ··· 1]>) efficient. k k k−1 k k k With reference to Figure 2, LAQ can be summarized as δQ = Qm(θ ) − Qm(θˆ ) = 2τR qm(θ ) − R 1 (6) m m m m follows. At iteration k, the server broadcasts the learning k which can be transmitted by 32 + bp bits (32 bits for Rm and parameter vector to all workers. Each worker computes the k bp bits for qm(θ )) instead of the original 32p bits. With the gradient, and quantizes it to decide whether to upload the

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 4 Algorithm 2 LAQ D ¯ Server θk+1 θk α k 1: Input: stepsize α > 0, b, D, {ξd}d=1 and t. = − ∇ k ˆ0 2: Initialize: θ , and {Qm(θm), tm}m∈M. 3: for k = 1, 2, ··· ,K do θk δQk θk θk δQk k Workers 1 M 4: Server broadcasts θ to all workers. 5: for m = 1, 2, ··· ,M do k k Quantization Quantization Quantization 6: Worker m computes ∇fm(θ ) and Qm(θ ). ⋯ ⋯ m Selection Selection Selection 7: if (9) holds for worker then 8: Worker m uploads nothing. ˆk ˆk−1 9: Set θm = θm and tm ← tm + 1. Fig. 2. Federated learning via LAQ 10: else k 11: Worker m uploads δQm via (6). ˆk k 12: Set θm = θ , and tm = 0. Algorithm 1 QGD 13: end if 1: Input: stepsize α > 0, quantization bit b. 14: end for 2: Initialize: θ0. 15: Server updates θ according to (4). 3: for k = 1, 2, ··· ,K do 16: end for 4: Server broadcasts θk to all workers. 5: for m = 1, 2, ··· ,M do k k 6: Worker m computes ∇fm(θ ) and Qm(θ ). 3.1 Development of the communication skipping rule k 7: Worker m uploads δQm via (6). 8: end for To illuminate the difference between LAQ and GD, consider 9: Server updates θ following (4) with Mk = M. re-writing (4) as 10: end for k+1 k k X ˆk−1 k θ =θ − α[∇Q(θ ) + (Qm(θm ) − Qm(θ ))] k m∈Mc k k k k X ˆk−1 k quantized gradient innovation δQm. Upon receiving the gra- =θ − α[∇f(θ ) − ε + (Qm(θm ) − Qm(θ ))] k dient innovation from selected workers, the server updates m∈Mc the learning vector. These steps are listed in Algorithm 2. where k := k denotes the subset of workers that Note that in this paper, by privacy preserving, we mean c skip communicationM M\M with the server at iteration k. Compared the private raw data of each worker are not released to other with the GD iteration in (2), the gradient employed here workers nor the server. It is true that the local information degrades due to the quantization error εk and the missed might be partially inferred by reverse engineering of the gra- P ˆk−1 k gradient innovation k (Qm(θ ) − Qm(θ ))]. It is dient. However, the gradient compression of LAQ introduces m∈Mc clear that if a sufficiently large number of bits is used to noise to the gradient, which helps promote privacy [36, 37]. D quantize the gradient, and all {ξd} are set to 0, causing To further investigate how to mitigate privacy leakage and d=1 Mk := M, then LAQ reduces to GD. Thus, adjusting b and quantify the degree of privacy remains to be our future work. D {ξd}d=1 directly influences the performance of LAQ. To compare the descent amount of LAQ with that of GD, we first establish the one step descent for both algorithms. Based on Assumption 1, the next lemma holds for GD. 3 CONVERGENCE AND COMMUNICATION ANALYSIS Lemma 1. The GD update yields the following descent

Our subsequent convergence analysis of LAQ relies on the k+1 k k f(θ ) − f(θ ) ≤ ∆GD (11) following assumptions on fm( ) and f( ): · · k αL k 2 where ∆GD := −(1 − )αk∇f(θ )k2. Assumption 1. The local gradient ∇fm(·) is Lm-Lipschitz con- 2 tinuous, and the global gradient ∇f(·) is L-Lipschitz continuous; The descent of LAQ differs from that of GD due to the L L i.e., there exist constants m and such that quantization and selection, as specified in the next lemma. (For readability, some proofs are deferred to Section 8) k∇fm(θ1) − ∇fm(θ2)k2 ≤Lmkθ1 − θ2k2, ∀θ1, θ2; (10a) Lemma 2. The LAQ update yields the following descent k∇f(θ1) − ∇f(θ2)k2 ≤Lkθ1 − θ2k2, ∀θ1, θ2. (10b) k+1 k k k 2 f(θ ) − f(θ ) ≤ ∆LAQ + αkε k2 (12) Assumption 2. The function f(·) is µ-strongly convex, meaning k α k 2 P ˆk−1 that there exists a constant µ > 0 such that where ∆LAQ := − k∇f(θ )k2 + αk k (Qm(θm ) − 2 m∈Mc k 2 L 1 k+1 k 2 Qm(θ ))k2 + ( − )kθ − θ k2. µ 2 2α f(θ ) − f(θ ) ≥ h∇f(θ ), θ − θ i + kθ − θ k2, ∀θ , θ . 1 2 2 1 2 2 1 2 2 1 2 At this point, it is instructive to shed more light on LAQ’s gradient skipping rule. If we fix for simplicity α = 1/L, it Before establishing our performance analysis results, we follows readily that first present salient features of our communication skipping α ∆k = − ||∇f(θk)||2; rule. The rationale behind the selection criterion (9) is to GD 2 2 k α k 2 X k−1 k 2 judiciously compare the descent amount of GD versus that ∆ = − ||∇f(θ )|| + α|| (Q (θˆ ) − Q (θ ))|| . LAQ 2 2 m m m 2 of LAQ. k m∈Mc

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 5

The lazy aggregation criterion selects the quantized and σ1 = 1 c with gradient innovation by assessing its contribution to the −  2  loss function decrease. For LAQ to be more communication 1 1 1 3bL [ −4(a+2Dξ+6ab)]a 2 −( 2 a+Dξ + 3ab)+ aL2 M  efficient than GD, each LAQ upload should bring more c=min 2 , m . κ D − d + 1 descent, that is   k k ∆LAQ ∆ (20) ≤ GD . (13) |Mk| M

After simple manipulation, it can be shown that (13) boils Proof. It follows from (8) that down to k X k−1 k 2 |Mc | k 2 k+1 2 2 k+1 2 || (Qm(θˆ ) − Qm(θ ))|| ≤ ||∇f(θ )|| (14) kεm k∞ ≤ τ (Rm ) m 2 2M 2 k 2 k+1 k k m∈Mc = τ k∇fm(θ ) − ∇fm(θ ) + ∇fm(θ ) k k ˆk 2 which implies that since − Qm(θ ) + Qm(θ ) − Qm(θm)k∞ (21) 2 k+1 k 2 2 k 2 X ˆk−1 k 2 ≤ 3τ Lmkθ − θ k2 + 3τ kεmk∞ || (Qm(θm ) − Qm(θ ))||2 2 k ˆk 2 m∈Mk + 3τ kQm(θ ) − Qm(θm)k2 c (15) k X ˆk−1 k 2 ≤ |Mc | ||(Qm(θm ) − Qm(θ )||2, k Then the one-step Lyapunov function difference is m∈Mc bounded as the following condition is sufficient to guarantee (14): ˆk−1 k 2 k 2 2 k k+1 k ||(Qm(θm ) − Qm(θ )||2 ≤ ||∇f(θ )||2/(2M ), ∀m ∈ Mc . V(θ ) − V(θ ) D E α (16) ≤ −α ∇f(θk),Q(θk) + k∇f(θk)k2 However, it is impossible to check (16) locally per worker, 2 2 k α X k−1 k 2 because the fully aggregated gradient ∇f(θ ) is required, + k Q (θˆ ) − Q (θ )k 2 m m m 2 k which is exactly what we want to avoid. This motivates m∈Mc k 2 circumventing ||∇f(θ )||2 by using its approximation L 2 2 k+1 k 2 + ( + β1 + 3γτ Lm)kθ − θ k2 D 2 k 2 2 X k+1−d k−d 2 D−1 ||∇f(θ )|| ≈ ξd||θ − θ || (17) 2 α2 2 X k+1−d k−d 2 k+1−D k−D 2 k=1 + (βd+1 − βd)kθ − θ k2 − βDkθ − θ k2 d=1 D where {ξd} are constants. The main reason why (17) 2 X k 2 2 X k−1 k 2 d=1 + γ(3τ − 1) kε k + 3γτ kQ (θˆ ) − Q (θ )k holds is that ∇f(θk) can be approximated by weighting past m ∞ m m m 2 m∈M m∈M gradients or parameter differences since f(·) is L-smooth. D E α ≤ −α ∇f(θk),Q(θk) + k∇f(θk)k2 Combining (17) and (16) leads to (9a) with the quantization 2 2 error ignored. α L 2 2 −1 2 X k−1 k 2 +[ +( +β +3γτ L )(1+ρ )α ]k Q (θˆ )−Q (θ )k 2 2 1 m 2 m m m 2 m∈Mk 3.2 Convergence analysis c L + ( + β + 3γτ 2L2 )(1 + ρ )α2kQ(θk)k2 The rationale of the previous subsection regarding LAQ’s 2 1 m 2 2 skipping rule is not mathematically rigorous, but we will D−1 X k+1−d k−d 2 establish here that it guarantees convergent iterates. To + (βd+1 − βd)kθ − θ k2 this end, and with θ∗ denoting the optimal solution of (1), d=1 k+1−D k−D 2 2 X k 2 consider the Lyapunov function associated with LAQ as − βDkθ − θ k2 + γ(3τ − 1) kεmk∞ k k ∗ m∈M V(θ ) := f(θ ) f(θ ) (18) 2 X k−1 k 2 − + 3γτ kQm(θˆ ) − Qm(θ )k D D m 2 X X ξj X m∈M + θk+1−d θk−d 2 + γ εk 2 . α 2 m ∞ d=1 j=d k − k m∈M k k where the second inequality follows from Young’s inequality, k 2 2 −1 2 Before we quantify the process of V(θ ) in the ensuing namely ka + bk2 ≤ (1 + ρ)kak2 + (1 + ρ )kbk2. lemma, it is worth pointing out that the Lyapunov function Considering the criterion (9a), we have associated with LAQ is a strict generalization of that used in GD or LAG [30, 31], which not only takes into account the X ˆk−1 k 2 k Qm(θm )−Qm(θ )k2 delayed iterates but also the quantization error. k m∈Mc k X ˆk−1 k 2 Lemma 3. Under Assumption 1 and 2, and by fixing parameters ≤ |Mc | kQm(θm )−Qm(θ )k2 (D−d+1)ξ a 2 bL ξ ξ ξ β α γτ k 1 = 2 = = , d = α , = L , and = L2 , m∈Mc ··· m with a, b > 0, the Lyapunov function obeys the inequality k 2 D |Mc | X k+1−d k−d 2 k X k 2 k−1 2 ≤ ξdkθ − θ k +3|M | (kε k + kεˆ k ) k+1 k 2 1 t0 α2|M|2 2 c m 2 m 2 V(θ ) ≤ σ1V(θ ) + BMp max V(θ ) (19) d=1 k γ k−t¯≤t0≤k−1 m∈Mc D 1 X k+1−d k−d 2 X k 2 k−1 2 where the constant is defined as ≤ ξ kθ − θ k + 3M (kε k + kεˆ k ). α2 d 2 m 2 m 2 3a 3a 2a 9bL d=1 m∈Mk B Dξ ab M, c = [ + ( + 3 + 9 ) + 2 ] (22) 2L 2 L MLm

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 6

1 2 2 2o Thus, the one-step Lyapunov difference satisfies α + (L + 2β1 + 6γτ Lm)(1 + ρ2)α . 2ρ1 k+1 k V(θ ) − V(θ ) Assumption 2 implies that f(·) satisfies the PL condition: α k 2 D k kE ≤ − k∇f(θ )k2 + α ∇f(θ ), ε k ∗ k 2 2 2µ(f(θ ) − f(θ )) ≤ k∇f(θ )k2. (28) α L −1 2 2 D [ +( +β1)(1+ρ2 )α ]M +3γτ X k+1−d k−d 2 Plugging (28) into (25) gives + 2 2 ξ kθ −θ k α2M d 2 d=1 k+1 k k 2 X k 2 k−1 2 V(θ ) ≤ σ1V(θ ) + B[kε k2 + (kεmk2 + kεˆm k2)] L 2 2 2 k k 2 m∈M + ( + β1 + 3γτ Lm)(1 + ρ2)α k∇f(θ ) − ε k2 2 2 X k 2 D−1 + γ(3τ − 1) kεmk∞ X k+1−d k−d 2 k+1−D k−D 2 m∈M + (βd+1 −βd)kθ −θ k2 − βDkθ −θ k2 k 2 2 X k 2 d=1 ≤ σ1V(θ ) + [BMp + B + γ(3τ − 1)] kεmk2 3α 3L 2 2 −1 2 2 m∈M + [ + ( + 3β1 + 9γτ Lm)(1 + ρ2 )α + 3γτ ]M 2 X k−1 2 2 2 + Bp kεˆm k2 X k 2 k−1 2 2 X k 2 × (kεmk2 + kεˆm k2) + γ(3τ − 1) kεmk∞. m∈M m∈Mk m∈M c where σ1 = 1 − c. (23) By choosing parameter stepsize α that impose the following Since for any ρ1 > 0, it holds that inequality hold 2 2 D k kE 1 k 2 1 k 2 [BMp + B + γ(3τ − 1)] ≤ 0, (29) ∇f(θ ), ε ≤ ρ1k∇f(θ )k2 + kε k2 (24) 2 2ρ1 one can obtain

we can rewrite (23) as k+1 k 2 1 X k−1 2 (θ ) ≤ σ (θ ) + Bp · γ kεˆ k V 1V γ m 2 m∈M k+1 k V(θ ) − V(θ ) k 2 1 X t0 ≤ σ1V(θ ) + Bp max V(θ ) 1 1 2 2 2 k 2 γ k−t¯≤t0≤k−1 ≤ [(− + ρ )α + (L + 2β + 6γτ L )(1 + ρ )α ]k∇f(θ )k m∈M 2 2 1 1 m 2 2 k 2 1 t0 α L 2 2 −1 2 2 ξD ≤ σ1V(θ ) + BMp max V(θ ). + {[( + ( + β + 3γτ L )(1 + ρ )α )M + 3γτ ] γ k−t¯≤t0≤k−1 2 2 1 m 2 α2M k+1−D k−D 2 1 (D−d+1)ξ a − βD}kθ − θ k2 For simplicity, we fix ρ1 = 2 , ρ2 = 1, βd = α , α = L , 2 bL D−1 and γτ = L2 , with a, b > 0. Consequently, we obtain X α L 2 2 −1 2 2 ξd m + {[( + ( + β1 + 3γτ L )(1 + ρ )α )M + 3γτ ] 2 2 m 2 α2M h 3α 3L i d=1 2 2 2 2 B = + ( + 3β1 + 9γτ Lm)α + 3γτ M + β − β }kθk+1−d − θk−dk2 2 2 d+1 d 2 h 3a 3a 2a 9bL i 3α 3L 2 2 −1 2 2 = + ( + 3Dξ + 9ab) + 2 M + [ + ( + 3β + 9γτ L )(1 + ρ )α + 3γτ ]M 2L 2 L MLm 2 2 1 m 2 X k 2 k−1 2 and × (kεmk2 + kεˆm k2) m∈Mk c  2  1 1 1 3bL −( a+Dξ + 3ab)+ 2 1 2 2 2 k 2 [ 2 −4(a+2Dξ+6ab)]a 2 2 aLmM  + [ α + (L + 2β1 + 6γτ Lm)(1 + ρ2)α ]kε k2 c=min , . 2ρ1  κ D − d + 1  2 X k 2 + γ(3τ − 1) kεmk∞. (25) (30) m∈M Here the proof is complete. It is straightforward that the following condition guarantees Theorem 1. Under the same assumptions and parameters of the first three terms in (25) are nonpositive Lemma 3, the Lyapunov function converges at a linear rate; that is,

1 1 2 2 2 there exists a constant σ2 ∈ (0, 1) such that (− + ρ1)α + (L + 2β1 + 6γτ Lm)(1 + ρ2)α ≤ 0; 2 2 k k 0 α L 2 2 −1 2 2 V(θ ) σ2 V(θ ) (31) {[ 2 +( 2 +β1 +3γτ Lm)(1+ρ2 )α ]M +3γτ }ξD ≤ ≤ βD; 1 2 2 1 1+t¯ α M where σ2 = (σ1 + BMp γ ) . α L 2 2 −1 2 2 {[ 2 +( 2 +β1 +3γτ Lm)(1+ρ2 )α ]M +3γτ }ξd ≤βd −βd+1. Proof. We first present a critical lemma that will be used to α2M (26) prove our result. For ease of exposition, we define the constants c and B as k Lemma 4. [38, Lemma 3.2] Let V denote a sequence of { } n 2 2 2 nonnegative real numbers satisfying the following inequality for c = min (1 − ρ1)α − 2µ(L + 2β1 + 6γτ Lm)(1 + ρ2)α , some nonnegative constants p and q. α L 2 2 −1 2 2 ξD 1−[( +( +β1 +3γτ Lm)(1+ρ2 )α )M +3γτ ] 2 , k+1 k k 2 2 α MβD V pV + q max V , k 0. (32) α L 2 2 −1 2 2 ≤ (k−d(k))+≤l≤k ≥ β [( +( +β1 +3γτ L )(1+ρ )α )M +3γτ ]ξd o 1− d+1 − 2 2 m 2 β α2Mβ If p + q < 1 and 0 d(k) dmax for some positive constant d d ≤ ≤ (27) dmax, then k k 0 and, V r V , k 1, (33) ≤ ∀ ≥ nh 3α 3L i 1 2 2 −1 2 2 1+d B = max +( +3β1 +9γτ Lm)(1 + ρ )α +3γτ M, where r = (p + q) max . 2 2 2

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 7

Following [38, Lemma 3.2], given that the Lyapunov Algorithm 3 SLAQ D function obeys (19) and if the following condition is satisfied 1: Input: stepsize α > 0, b, D, {ξd}d=1 and t¯, and batchsize S. k ˆ0 2: Initialize: θ , and {Qm(θm), tm}m∈M. 2 1 σ1 + BMp < 1 (34) 3: for k = 1, 2, ··· ,K do γ 4: Server broadcasts θk to all workers. 1 1 5: for m = 1, 2, ··· ,M do σ = (σ + BMp2 ) 1+t¯ then it guarantees (31) holds with 2 1 γ . 6: Worker m draws S samples and computes the average k In the sequel, we will show that we can indeed find a set gradient at these S samples ∇fm(θ ), along with the k of parameters that make (34) hold. For the design parameter quantized gradient Qm(θ ). D, we impose D κ. From (20), it is obvious that the 7: if (9) holds for worker m then following condition≤ 8: Worker m does not upload anything. ˆk ˆk−1 2 9: Set θm = θm and tm ← tm + 1. 1 1 1 3bL 10: else [ − 4(a + 2Dξ + 6ab)]a ≤ − ( a + Dξ + 3ab) + 2 (35) k 2 2 2 aLmM 11: Worker m uploads δQm via (6). ˆk k guarantees 12: Set θm = θ , and tm = 0. 13: end if 1 [ 4(a + 2Dξ + 6ab)]a 14: end for c = 2 − . (36) κ 15: Server updates θ according to (4). 1 16: end for [ 2 −4(a+2Dξ+6ab)]a Thus, we obtain σ1 = 1 c = 1 κ . − − 1 1 1 It can be verified that choosing a = 20 , b = 10 , Dξ = 50 Algorithm 4 TWO-LAQ 93L2 2 1 2 2 m 9 D ¯ and τ 100κ /[M p ( 10L2 + M )] is a sufficient condition 1: Input: stepsize α > 0, b, D, {ξd}d=1 and t. ≤ 1 ˜0 0 ˆ0 for (26), (35) and (34) being satisfied. With above selected 2: Initialize: θ , θ = θ , and {Qm(θm), tm}m∈M. 1 3: for k = 1, 2, ··· ,K do parameters, we can obtain σ1 = 1 1000κ and − 4: Server calculate θ˜k broadcasts quantized model innova- 2 1 k 1 2 2 93Lm 9 2 ¯ tion δθ to all workers. σ2 = (1 − + M p ( + )τ ) 1+t ∈ (0, 1) (37) 1000κ 100L2 10M 5: for m = 1, 2, ··· ,M do k k 6: Worker m computes θ˜ according to (41) and ∇fm(θ˜ ) which together with (31) indicates the linear convergence of k and Qm(θ˜ ). the Lyapunov function and completes the proof. 7: if (9) holds for worker m then From (37), it is obvious that if the quantization is accurate 8: Worker m does not upload anything. ˆk ˆk−1 enough, i.e., τ 2 0, and no communication is skipped, i.e., 9: Set θm = θm and tm ← tm + 1. → 10: else t¯ = 1, the dependence of convergence rate on condition k 1 11: Worker m uploads δQm via (6). number is of order , the same as the gradient descent. ˆk ˜k κ 12: Set θm = θ , and tm = 0. Compared to the LAG analysis in [30], the analysis for 13: end if LAQ is more involved, because it needs to deal with not 14: end for only outdated but also quantized (inexact) gradients. The 15: Server updates θk+1 according to (4). latter challenges the monotonicity of the Lyapunov function 16: end for in (18), which is the building block of the analysis in [30]. We

tackle this issue by i) considering the outdated gradient in k the quantization (6); and, ii) accounting for the quantization the worst-case performance. Due to the error εm introduced error in the new selection criterion (9). As a result, Theorem by the communication skipping and gradient compression, 1 establishes that LAQ retains the linear convergence rate it is not theoretically established that LAQ outperforms GD. even when quantization error is present. This is because a However, our empirical studies will demonstrate that LAQ controlled quantization error also converges at a linear rate. significantly outperforms GD in terms of communication. To As for the improvement relative to the conference version prove that LAQ is more communication-efficient than GD is more challenging and is in our future agenda. [1], the convergence rate σ2 is explicitly characterized by the quantization parameter τ and the maximum communication skipping interval t¯. 4 GENERALIZING LAQ D In this section, we broaden the scope of LAQ by developing Proposition 1. If under Assumption 1, we choose {ξd}d=1 to the stochastic LAQ and the two-way communication-efficient satisfy ξ1 ≥ ξ2 ≥ · · · ≥ ξD, and define dm, m ∈ M as LAQ-driven federated learning, as we elaborate next.  2 2 2 dm := max d|Lm ≤ ξd/(3α M D), d ∈ {1, 2, ··· ,D} (38) d 4.1 Stochastic LAQ it suffices for worker m to have at most k/(dm + 1) uploads with the server until the k-th iteration. Thanks to its well-documented merits in reducing complexity, stochastic gradient descent (SGD) has been widely employed This proposition asserts that the communication inten- by learning tasks involving large-scale training data. Here sity per worker is determined by the smoothness of the we show that LAQ can also benefit from its stochastic corresponding local loss function. Workers with smaller counterpart, namely SLAQ, that is developed with a simple modification in the criterion. Specifically, (9a) in SLAQ is smoothness constant communicate with the server less replaced by frequently, which justifies the term lazily communication. D For some technical issues, the current bound on the s k−1 s k 2 1 X k+1−d k−d 2 kQ (θˆ ) − Q (θ )k ≤ ξ kθ − θ k convergence rate is relatively loose in order to account for m m m 2 α2M 2 d 2 d=1

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 8

5 NUMERICALTESTS k k k θ +1 = θ − α∇ Server To validate our theoretical analysis and demonstrate the per- Quantization formance of LAQ in improving communication efficiency for δθk δQk δθk δθk δQk practical machine learning tasks, we evaluate our algorithm Workers 1 M on the regularized logistic regression (LR), which is strongly convex, and a neural network (NN) classifier involving a Quantization ⋯ Quantization ⋯ Quantization Selection Selection Selection nonconvex loss. For our experiments, we use the MNIST dataset [39], which is equally distributed across M = 10 work- ers. Throughout, we set D = 10, ξ = = ξ = 0.8/D, Fig. 3. Federated learning via TWO-LAQ 1 D and t¯= 100. ···

 k 2 k−1 2 + 3 kεmk2 + kεˆm k2 + var (39) 5.1 Simulation setup where superscript s will henceforth denote the stochastic LR classifier. Consider a multi-class classifier with say C = counterpart of LAQ quantities defined so far; and var is 10 classes, that relies on logistic regression trained using a constant. Compared with (9a), the constant added in the the MNIST dataset. Each training vector xm,n comprises stochastic case is to compensate for the variance coming from f l f F a feature-label pair (xm,n, xm,n), where xm,n ∈ R is the the stochastic sampling. In practice, we can use the empirical l C feature vector and xm,n ∈ R denotes the one-hot label variance to approximate the variance, that is, the variance C×F vector. The model θ ∈ R here is a matrix, which is slightly computed according to the drawn samples per iteration. different from previous description, and it is adopted for Apart from the criterion, SLAQ is different from LAQ convenience but does not change the learning problem. The only in the local (stochastic) gradient calculation. Specifically, estimated probability of (m, n)-th sample to belong to class the worker randomly draws S samples from its training set i is given by and computes the stochastic gradient as l f xˆm,n = softmax(θxm,n) (42) S s 1 X which can be explicitly written as fm(θ) = θ`(xm,n; θ) . (40) ∇ S ∇ f n=1 [θxm,n]i l e [xˆ ]i = , ∀i ∈ {1, 2, ··· ,C}. (43) m,n C f The quantization and other operations are the same as P [θxm,n]j j=1 e before. The SLAQ is summarized in Algorithm 3. Following the convention, we also consider here that the global loss The regularized logistic regression classifier relies on the function is scaled by the total number of training samples. following cross-entropy loss plus a regularizer C X l l λ T `(xm,n, θ) = − [x ]i log[xˆ ]i + Tr(θ θ) (44) 4.2 Two-way quantization m,n m,n 2 i=1 So far, communication savings have been achieved by where Tr(·) denotes trace operator, and θT is the transpose skipping uploads and quantizing the uploaded gradient of θ. Having defined `(xm,n, θ), the local loss functions are innovation. A natural extension is to also quantize the model PNm fm(θ) = n=1 `(xm,n; θ), and the global loss function is innovation in the downlink, which results in what we term given by Two Way Lazily Aggregated Gradient (TWO-LAQ). 1 X ˜k k ˜k−1 f(θ) = f (θ) (45) Let θ = (θ , θ ) describe the quantization model. N m First, the modelQ innovation is quantized as (5), and thus we m∈M omit the details. Then, for the server to workers communica- where N is the total number of data samples. In our tests, tion, only the quantized model innovation δθk is broadcast. we set the regularizer coefficient to λ = 0.01. With θ˜k−1 stored in memory, both the server and the workers NN classifier. In our tests, we employ a ReLU network com- update θ˜k as prising one hidden layer having 200 nodes with dimensions θ˜k = θ˜k−1 + δθk. (41) of the input and output layers being 784(28 × 28) and 10, respectively. The regularizer parameter is set to λ = 0.01. Different from possible alternatives, each worker m does CNN classifier. For the test in CIFAR 10 dataset, we adopt not have the accurate model θk. As a result, each worker the convolutional neural network (CNN) which consists k k has to compute the local gradient fm(θ˜ ) based on θ˜ . The of 3 VGG-type blocks [40]. Each block is constructed by ∇ rest follows the LAG algorithm, meaning workers quantize stacking two convolutional layers with small 3 3 filters the gradient innovation, and upload it, if it is large enough; followed by a max pooling layer. The numbers× of filters otherwise, they skip this upload round. for the convolutional layers in the three blocks are 32, 64, The steps of TWO-LAQ are summarized in Algorithm 4, and 128, respectively. ReLU activation function is used in whose implementation is illustrated in Figure 3. Comparing each layer and padding is utilized on the convolutional Figure 3 with gradient descent, TWO-LAQ improves com- layers to ensure the height and width of the output feature munication efficiency at the expense of extra memory at the matches the inputs. Additionally, each block is followed by server and workers. Indeed, the server needs to store θk, a dropout layer with the rate of 20%. This is followed by a θ˜k and k; and each worker m needs to store θˆk, θ˜k and fully connected layer with 128 nodes and then the softmax ˜k ∇ k Qm(θ ). In contrast, GD only requires the server to store θ . layer. We add an l2 regularization with coefficient 0.001.

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 9

101 101 101 b=4 b=4 0 0 0 10 b=8 10 b=8 10 ) ) ) ∗ ∗ ∗ 1 b=16 1 b=16 1 f f f 10 10− 10− − b=20 b=20 − − − 2 2 2 ( f b=32 ( f ( f b=32 10− 10− 10−

3 3 3 10− 10− 10− b=4 b=8 4 4 4 10− 10− 10−

Loss residual Loss residual b=16 Loss residual 5 5 5 b=20 10− 10− 10− b=32 6 6 6 10− 10− 10− 0 1000 2000 3000 0 1000 2000 3000 106 107 108 109 Number of iterations Number of Communications Number of bits (a) Loss v.s. iteration (b) Loss v.s. communication (c) Loss v.s. bit Fig. 4. Convergence of LAQ under different quantization bits (logistic regression)

LAQ 0 0 0 10 TWO-LAQ 10 10 ) ) )

∗ GD ∗ ∗ f QGD f f − − − 2 2 2

( f LAG ( f ( f 10− 10− 10−

LAQ LAQ TWO-LAQ TWO-LAQ 4 4 4 10− 10− 10−

Loss residual Loss residual GD Loss residual GD QGD QGD LAG LAG 6 6 6 10− 10− 10− 0 1000 2000 3000 101 102 103 104 106 107 108 109 1010 Number of iterations Number of communications Number of bits (a) Loss v.s. iteration (b) Loss v.s. communication (c) Loss v.s. bit Fig. 5. Convergence of the loss function (logistic regression)

LAQ LAQ LAQ TWO-LAQ TWO-LAQ TWO-LAQ 101 101 101

|| GD || GD || GD f f f QGD QGD QGD ||∇ LAG ||∇ LAG ||∇ LAG 1 1 1 10− 10− 10−

Gradient norm 3 Gradient norm 3 Gradient norm 3 10− 10− 10−

0 2000 4000 6000 8000 0 20000 40000 60000 80000 0 1 2 3 4 Number of iterations Number of communications Number of bits 1011 × (a) Gradient v.s. iteration (b) Gradient v.s. communication (c) Gradient v.s. bit

Fig. 6. Convergence of gradient norm (neural network)

5.2 Numerical tests problem. Clearly, Figure 5(a) corroborates Theorem 1, namely the linear convergence for the strongly convex loss function. Figure 4 illustrates the convergence with different number As illustrated in Figure 5(b), LAQ incurs a smaller number of quantization bits. It shows that utilizing fewer bits to of communication rounds than GD and QGD thanks to our quantize the gradient moderately increases the number of innovation selection rule, yet more rounds than LAG due iterations, but markedly reduces the overall number of to the quantization error. Nevertheless, the total number of transmitted bits. To benchmark LAQ, we compare it with transmitted bits of LAQ is significantly reduced compared two classes of algorithms, namely GD and minibatch SGD with that of LAG, as demonstrated in Figure 5(c). For the NN ones, corresponding to the following two tests. classifier, Figure 6 reports the convergence of the gradient Parameters. For GD algorithms, we fix D = 10, ξ1 = ξ2 = norm, where LAQ also shows competitive performance for ¯ ··· = ξD = 0.8/D, t = 100, and we set α = 0.02, and b = 4 or nonconvex loss functions. Similar to what is observed for LR 8 for LR and NN classifiers, respectively. For minibatch SGD classification, LAQ outperforms the benchmark algorithms algorithms, the minibath size is 500 and α = 0.008; b = 3 for in terms of transtimitted bits. TWO-LAQ which additionally LR and b = 8 for NN. leverages model innovation quantization saves more bits Gradient-based tests. The benchmark algorithms include than LAQ as show in Figures 5(c) and 6(c). Table 1 sum- GD, QGD [11] and lazily aggregated gradient (LAG) [30]. marizes the detailed comparison of mentioned algorithms Figure 5 shows the convergence of loss residual for the LR

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 10

Algorithm Iteration # Communication # Bit # Accuracy LAQ 2626 572 6.78 × 108 0.9082 TWO-LAQ 2576 734 1.04 × 108 0.9082 Logistic GD 2763 27630 7.63 × 109 0.9082 QGD 2760 27600 1.56 × 109 0.9082 LAG 2620 2431 1.27 × 109 0.9082 LAQ 8000 32729 8.23 × 1010 0.9433 TWO-LAQ 8000 30741 4.93 × 1010 0.9433 Neural network GD 8000 80000 4.48 × 1011 0.9433 QGD 8000 80000 1.42 × 1011 0.9433 LAG 8000 30818 1.98 × 1011 0.9433 TABLE 1 Comparison of gradient-based algorithms. For logistic regression, all algorithms terminate when loss residual reaches 10−6; for neural network, all algorithms run a fixed number of iterations.

Algorithm Iteration # Communication # Bit # Accuracy LAQ 1000 6359 3.34 × 1010 87.96 TWO-LAQ 1000 5785 2.76 × 1010 87.92 GD 1000 10000 1.09 × 1011 87.86 QGD 1000 10000 4.69 × 1010 87.95 LAG 1000 6542 7.44 × 1010 87.91 TABLE 2 Tests for CIFAR 10 with CNN

0.900 0.84 0.8 0.875 0.82 0.850 0.6 0.825 0.80 Test accuracy Test accuracy Q Q LA Test accuracy 0.800 LAQ 0.78 LA 0.4 GD GD GD QGD 0.775 QGD QGD 0.76 LAG LAG 0.750 LAG 6 7 8 9 10 4 5 6 7 8 10 10 10 10 10 104 105 106 107 10 10 10 10 10 Number of bits Number of bits Number of bits (a) MNIST (b) ijcnn1 (c) covtype

Fig. 7. Tests on different datasets

including the number of iterations, uploads and bits needed Stochastic gradient-based tests. The stochastic version of to reach a given accuracy. LAQ abbreviated as SLAQ is tested and compared with stochastic gradient descent (SGD), quantized stochastic Tests on more datasets. Figure 7 exhibits the test accu- gradient descent (QSGD) [10], sparsified stochastic gradient racy of the aforementioned algorithms on three commonly descent (SSGD) [21], deep gradient compression (DGC) [18], used datasets, namely MNIST, ijcnn1 and covtype. Applied sign-SGD, [8] and tern-Grad [12]. For the all the stochastic- to all these datasets, LAQ saves transmitted bits while based algorithms, each worker draws a bath of 500 data maintaining the same accuracy. In addition, we test our samples to calculate a stochastic local gradient per iteration. algorithm on more challenging dataset—CIFAR 10, for As demonstrated in Figures 9 and 10, SLAQ requires the which CNN is utilized and Adam [41] is applied. The lowest number of communication rounds and bits. Albeit convergence of the loss function is plotted in Figure 8, and sign-SGD and tern-Grad need only 1 bit and 2 bits for the validation accuracy is shown in Figure 11. A detailed each entry of the gradient, respectively, they have larger comparison with above mentioned benchmark algorithms is quantization error and require a smaller stepsize to ensure summarized in Table 2. These tests with different datasets convergence. Therefore it takes more iterations for these two and different algorithms (gradient descent, Adam and the algorithms to reach the same loss (or accuracy), and needs following stochastic gradient descent) all demonstrate that to transmit more bits than that for SLAQ. In this stochastic the proposed communication saving scheme indeed provides gradient test, although the improvement of communication satisfied improvement in communication efficiency and thus efficiency by SLAQ is not as evident as LAQ compared with has promising potential for variety of distributed learning GD-based algorithms, SLAQ still outperforms the cutting- applications.

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 11

100 100 100 LAQ LAQ LAQ TWO-LAQ TWO-LAQ TWO-LAQ ) ) )

∗ GD ∗ GD ∗ GD f QGD f QGD f QGD − − −

( f LAG ( f LAG ( f LAG 1 1 1 10− 10− 10− Loss residual Loss residual Loss residual

2 2 2 10− 10− 10− 0 200 400 600 800 1000 0 2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 1.0 Number of iterations Number of communications Number of bits 1011 × (a) Loss v.s. iteration (b) Loss v.s. communication (c) Loss v.s. bit Fig. 8. Convergence of loss function (CNN for CIFAR 10)

2.5 SGD 2.5 SGD 2.5 SGD QSGD QSGD QSGD SSGD SSGD SSGD 2.0 2.0 2.0 DGC DGC DGC

) sign-SGD ) sign-SGD ) sign-SGD f f f 1.5 tern-Grad 1.5 tern-Grad 1.5 tern-Grad

Loss ( SLAQ Loss ( SLAQ Loss ( SLAQ

1.0 1.0 1.0

0.5 0.5 0.5 0 500 1000 1500 2000 0 5000 10000 15000 20000 0.0 0.5 1.0 1.5 2.0 2.5 Number of iterations Number of communications Number of bits 109 × (a) Loss v.s. iteration (b) Loss v.s. communication (c) Loss v.s. bit Fig. 9. Convergence of loss function (logistic regression)

4 SGD 4 SGD 4 SGD QSGD QSGD QSGD SSGD SSGD SSGD 3 DGC 3 DGC 3 DGC

) sign-SGD ) sign-SGD ) sign-SGD f f f tern-Grad tern-Grad tern-Grad

Loss ( 2 SLAQ Loss ( 2 SLAQ Loss ( 2 SLAQ

1 1 1

0 1000 2000 3000 0 10000 20000 30000 0.00 0.25 0.50 0.75 1.00 Number of iterations Number of communications Number of bits 1011 × (a) Loss v.s. iteration (b) Loss v.s. communication (c) Loss v.s. bit Fig. 10. Convergence of loss function (neural network)

Algorithm Iteration # Communication # Bit # Accuracy SLAQ 1000 4494 1.06 × 108 0.9060 SGD 1000 10000 2.51 × 109 0.9044 8 Logistic QSGD 1000 10000 6.89 × 10 0.9062 SSGD 1000 10000 1.26 × 108 0.9056 DGC 1000 10000 5.21 × 109 0.9082 sign-SGD 2000 20000 1.57 × 108 0.8905 tern-Grad 2000 20000 3.14 × 108 0.8942 SLAQ 1500 4342 4.14 × 109 0.9360 SGD 1500 15000 7.63 × 1010 0.9354 10 Neural network QSGD 1500 15000 3.84 × 10 0.9353 SSGD 1500 15000 1.14 × 1011 0.9356 DGC 1500 15000 3.60 × 1010 0.9362 sign-SGD 3000 30000 4.77 × 109 0.9277 tern-Grad 3000 30000 9.54 × 109 0.9299 TABLE 3 Performance comparison of mini-batch stochastic gradient-based algorithms.

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 12

8 PROOFS 8.1 Proof of Lemma 2 For successive LAQ updates, it is not difficult to show that 0.85 f(θk+1) − f(θk) * + k k k X ˆk−1 k 0.80 ≤ ∇f(θ ), −α[∇f(θ ) − ε + (Qm(θm ) − Qm(θ ))] k m∈Mc

Accuracy GD LAG L k+1 k 2 + kθ − θ k2 0.75 QGD 2 LAQ k 2 L k+1 k 2 ≤−αk∇f(θ )k2 + kθ − θ k2 TWO-LAQ 2 0.70 * + k k X ˆk−1 k 0.0 0.2 0.4 0.6 0.8 1.0 + α ∇f(θ ), ε − (Qm(θm ) − Qm(θ )) 11 Number of bits 10 k × m∈Mc kθk+1 − θkk2 L Fig. 11. Accuracy v.s. bit for CIFAR 10 =−αk∇f(θk)k2 − 2 + kθk+1 − θkk2 2 2α 2 2 α k 2 k X k−1 k 2 + [k∇f(θ )k + kε − (Q (θˆ )−Q (θ ))k ] 2 2 m m m 2 k edge schemes, namely QSGD, SSGD, sign-SGD and tern- m∈Mc α k 2 X k−1 k 2 Grad. The results are summarized in Table 3. ≤− k∇f(θ )k + αk (Q (θˆ ) − Q (θ ))k 2 2 m m m 2 k m∈Mc L 1 + ( − )kθk+1 − θkk2 + αkεkk2 2 2α 2 2 where the second equality follows from the identity ha, bi = 1 2 2 2 6 CONCLUSIONS 2 (kak + kbk − ka − bk ), and the last inequality from the Pn 2 Pn 2 fact that k i=1 aik2 ≤ n i=1 kaik .

This paper investigated communication-efficient federated 8.2 Proof of Proposition 1 learning, and developed LAQ — an approach that integrates Suppose that at current iteration k the last iteration when quantization and adaptive communication techniques based worker m communicated with the server is d0, where 1 ≤ on gradient innovation. Compared with GD method, LAQ 0 k−1 k−d0 d ≤ dm. Having θ = θ , we thus deduce that introduces errors to gradient, yet still preserves linear con- m ˆk−1 k 2 vergence for strongly convex problems. This is a remarkable kQm(θm ) − Qm(θ )k2 result considering that LAQ significantly reduces both k−d0 k−d0 k k =kQm(θ ) − ∇fm(θ ) − Qm(θ ) + ∇fm(θ ) communication bits and rounds. Experiments on strongly k−d0 k 2 convex and nonconvex learning problems verified our the- + ∇fm(θ ) − ∇fm(θ )k2 k−d0 k 2 k 2 k−d0 2 oretical analysis and demonstrated the merits of LAQ over ≤3(kfm(θ ) − ∇fm(θ )k2 + kεmk2 + kεm k2) recent popular approaches. Furthermore, two variants of 2 k−d0 k 2 k 2 k−d0 2 ≤3Lmkθ − θ k2 + 3(kεmk2 + kεm k2) (46) LAQ, termed TWO-LAQ and SLAQ, also exhibit promising d0 performance and outperform prevalent compression schemes 2 X k+1−d k−d 2 k 2 k−d0 2 =3L k θ − θ k + 3(kε k + kε k ) in the empirical studies. m 2 m 2 m 2 d=1 d0 2 0 X k+1−d k−d 2 k 2 k−d0 2 ≤3Lmd kθ − θ k2 + 3(kεmk2 + kεm k2). t=1

From the definition of dm and since ξ1 ≥ ξ2 ≥ · · · ≥ ξD, it 7 ACKNOWLEDGEMENTS can be inferred that 2 ξd0 0 0 L ≤ , for all d satisfying 1 ≤ d ≤ dm. (47) m 3α2M 2D This work by Jun Sun and Zaiyue Yang is supported Substituting (47) into (46) yields in part by NSFC Grant 61873118, in part by the Shen- ˆk−1 k 2 zhen Committee on Science and Innovations under Grant kQm(θm ) − Qm(θ )k2 d0 GJHZ20180411143603361, and in part by the Department of ξd0 X k+1−d k−d 2 k 2 k−1 2 ≤ ξ kθ − θ k + 3(kε k + kεˆ k ) Science and Technology of Guangdong Province under Grant α2M 2 d 2 m 2 m 2 2018A050506003. The work by Jun Sun is also supported by d=1 D 1 X k+1−d k−d 2 k 2 k−1 2 China Scholarship Council. The work by Qinmin Yang is sup- ≤ ξ kθ − θ k + 3(kε k + kεˆ k ) α2M 2 d 2 m 2 m 2 ported in part by the Key-Area Research and Development d=1 Program of Guangdong Province (No. 2018B010107002), and (48) in part by National Natural Science Foundation of China which exactly implies that (9a) is satisfied. Since dm ≤ D ≤ t¯, (61751205). The work by Georgios Giannakis is supported the criterion (9) holds, which means that worker m will not partially by NSF 1901134. upload her/his information until at least tm iterations after

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 13

last upload. In the first k iterations, worker m will therefore [14] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error have at most k/(dm + 1) uploads to the server. compensated quantized SGD and its applications to large-scale distributed optimization,” arXiv preprint arXiv:1806.08054, 2018. REFERENCES [15] K. Mishchenko, E. Gorbunov, M. Takáˇc,and P. Richtárik, [1] J. Sun, T. Chen, G. Giannakis, and Z. Yang, “Distributed learning with compressed gradient differ- “Communication-efficient distributed learning via lazily ences,” arXiv preprint:1901.09269, Jan 2019. aggregated quantized gradients,” in Advances in Neural [16] N. Strom, “Scalable distributed DNN training using Information Processing Systems, 2019, pp. 3370–3380. commodity gpu cloud computing,” in Proc. Conf. Intl. [2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and Speech Comm. Assoc., Dresden, Germany, Sept 2015. B. A. y Arcas, “Communication-efficient learning of [17] A. F. Aji and K. Heafield, “Sparse communication for deep networks from decentralized data,” Fort Laud- distributed gradient descent,” in Proc. Conf. Empirical erdale, Florida, Apr 2017, pp. 1273–1282. Methods Natural Language Process., Copenhagen, Den- [3] J. Koneˇcny,` H. B. McMahan, F. X. Yu, P. Richtárik, mark, Sep 2017, pp. 440–445. A. T. Suresh, and D. Bacon, “Federated learning: Strate- [18] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gies for improving communication efficiency,” arXiv gradient compression: Reducing the communication preprint:1610.05492, Oct 2016. bandwidth for distributed training,” in Proc. Intl. Conf. [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and Learn. Represent., Vancouver, Canada, Apr 2018. B. A. y Arcas, “Communication-efficient learning of [19] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified deep networks from decentralized data,” in Proc. Intl. SGD with memory,” in Proc. Advances in Neural Info. Conf. Artificial Intell. and Stat., Fort Lauderdale, FL, Apr. Process. Syst., Montreal, Canada, Dec 2018, pp. 4447– 2017, pp. 1273–1282. 4458. [5] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Com- [20] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, munication efficient distributed machine learning with S. Khirirat, and C. Renggli, “The convergence of sparsi- the parameter server,” in Proc. Advances in Neural Info. fied gradient methods,” in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec 2014, pp. 19–27. Process. Syst., Montreal, Canada, Dec 2018, pp. 5973– [6] S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, 5983. “Error feedback fixes signsgd and other gradient com- [21] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient pression schemes,” in Proc. Intl. Conf. Machine Learn., sparsification for communication-efficient distributed Long Beach, CA, Jun 2019, pp. 3252–3261. optimization,” in Proc. Advances in Neural Info. Process. [7] E. J. Msechu and G. B. Giannakis, “Sensor-centric data Syst., Montreal, Canada, Dec 2018, pp. 1299–1309. reduction for estimation with WSNs via censoring and [22] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopou- quantization,” IEEE Trans. Sig. Proc., vol. 60, no. 1, pp. los, and S. Wright, “Atomo: Communication-efficient 400–414, Jan 2011. learning via atomic sparsification,” in Proc. Advances in [8] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit Neural Info. Process. Syst., Montreal, Canada, Dec 2018, stochastic gradient descent and its application to data- pp. 9850–9861. parallel distributed training of speech dnns,” in Proc. [23] P. Jiang and G. Agrawal, “A linear speedup analysis of Conf. Intl. Speech Comm. Assoc., Singapore, Sept 2014. distributed deep learning with sparse and quantized [9] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and communication,” in Proc. Advances in Neural Info. Process. A. Anandkumar, “SignSGD: Compressed optimisation Syst., Montreal, Canada, Dec 2018, pp. 2525–2536. for non-convex problems,” in Proc. Intl. Conf. Machine [24] J. Koneˇcny` and P. Richtárik, “Randomized distributed Learn., Stockholm, Sweden, Jul 2018, pp. 559–568. mean estimation: Accuracy vs communication,” Fron- [10] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vo- tiers in Applied Mathematics and Statistics, vol. 4, p. 62, jnovic, “QSGD: Communication-efficient SGD via gra- Dec 2018. dient quantization and encoding,” in Proc. Advances in [25] L. L. Peterson and B. S. Davie, Computer Networks: A Neural Info. Process. Syst., Long Beach, CA, Dec 2017, pp. Systems Approach. Burlington, MA: Morgan Kaufman, 1709–1720. 2007. [11] S. Magnússon, H. Shokri-Ghadikolaei, and N. Li, “On [26] O. Shamir, N. Srebro, and T. Zhang, “Communication- maintaining linear convergence of distributed learning efficient distributed optimization using an approximate and optimization under limited communication,” arXiv newton-type method,” in Proc. Intl. Conf. Machine Learn., preprint arXiv:1902.11163, 2019. Beijing, China, Jun 2014, pp. 1000–1008. [12] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, [27] Y. Zhang and X. Lin, “DiSCO: Distributed optimization and H. Li, “Terngrad: Ternary gradients to reduce for self-concordant empirical loss,” in Proc. Intl. Conf. communication in distributed deep learning,” in Proc. Machine Learn., Lille, France, Jun. 2015, pp. 362–370. Advances in Neural Info. Process. Syst., Long Beach, CA, [28] S. Zhang, A. E. Choromanska, and Y. LeCun, “Deep Dec 2017, pp. 1509–1519. learning with elastic averaging SGD,” in Proc. Advances [13] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, in Neural Info. Process. Syst., Montreal, Canada, Dec 2015, “Zipml: Training linear models with end-to-end low pp. 685–693. precision, and a little bit of deep learning,” in Proc. Intl. [29] H. Yu and R. Jin, “On the computation and communi- Conf. Machine Learn., Sydney, Australia, Aug 2017, pp. cation complexity of parallel SGD with dynamic batch 4035–4043. sizes for stochastic non-convex optimization,” arXiv

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 14

preprint:1905.04346, May 2019. Tianyi Chen received the B. Eng. degree in [30] T. Chen, G. Giannakis, T. Sun, and W. Yin, “LAG: Communication Science and Engineering from Fudan University in 2014, the M.Sc. and Ph.D Lazily aggregated gradient for communication-efficient degrees in Electrical and Computer Engineering distributed learning,” in Proc. Advances in Neural Info. (ECE) from the University of Minnesota (UMN), Process. Syst., Montreal, Canada, Dec 2018, pp. 5050– in 2016 and 2019, respectively. 5060. Since August 2019, he is with Department of Electrical, Computer, and Systems Engineering [31] T. Chen, K. Zhang, G. B. Giannakis, and T. Basar, at Rensselaer Polytechnic Institute as an Assis- “Communication-efficient distributed reinforcement tant Professor. During 2017-2018, he has been a learning,” arXiv preprint:1812.03239, Dec 2018. visiting scholar at Harvard University, University of California, Los Angeles, and University of Illinois Urbana-Champaign. [32] J. Wang and G. Joshi, “Cooperative SGD: A He was a Best Student Paper Award finalist in the 2017 Asilomar Conf. on unified framework for the design and analysis Signals, Systems, and Computers. He received the National Scholarship of communication-efficient SGD algorithms,” arXiv from China in 2013, the UMN ECE Department Fellowship in 2014, and preprint:1808.07576, Aug. 2018. the UMN Doctoral Dissertation Fellowship in 2017. His research interests lie in optimization and statistical signal process- [33] Y. Arjevani and O. Shamir, “Communication complexity ing with applications to distributed learning, reinforcement learning and of distributed convex learning and optimization,” in wireless networks. Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, Dec 2015, pp. 1756–1764. [34] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu, “Asynchronous stochastic gradient descent with delay compensation,” in Proc. Intl. Conf. Machine Learn., Aug 2017, pp. 4120–4129. [35] M. I. Jordan, J. D. Lee, and Y. Yang, “Communication- efficient distributed statistical inference,” J. American Statistical Association, vol. to appear, 2018. [36] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” in Advances in Neural Information Processing Systems, 2019, pp. 14 774–14 784. [37] D. Yu, H. Zhang, W. Chen, T.-Y. Liu, and J. Yin, “Gradi- ent perturbation is underrated for differentially private convex optimization,” arXiv preprint:1911.11363, 2019. [38] M. Gurbuzbalaban, A. Ozdaglar, and P. A. Parrilo, “On the convergence rate of incremental aggregated gradient algorithms,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 1035–1048, 2017. [39] Y. LeCun, C. Cortes, and C. Burges, “Mnist hand- Georgios B. Giannakis (Fellow’97) received his written digit database,” AT&T Labs [Online]. Available: Diploma in Electrical Engr. from the Ntl. Tech. http://yann. lecun. com/exdb/mnist, vol. 2, p. 18, 2010. Univ. of Athens, Greece, 1981. From 1982 to 1986 he was with the Univ. of Southern California [40] K. Simonyan and A. Zisserman, “Very deep convolu- (USC), where he received his MSc. in Electrical tional networks for large-scale image recognition,” arXiv Engineering, 1983, MSc. in Mathematics, 1986, preprint:1409.1556, 2014. and Ph.D. in Electrical Engr., 1986. He was a [41] D. P. Kingma and J. Ba, “Adam: A method for stochastic faculty member with the University of Virginia from 1987 to 1998, and since 1999 he has been optimization,” arXiv preprint arXiv:1412.6980, 2014. a professor with the Univ. of Minnesota, where he holds an ADC Endowed Chair, a University of Minnesota McKnight Presidential Chair in ECE, and serves as director of the Digital Technology Center. His general interests span the areas of statistical learning, commu- nications, and networking - subjects on which he has published more than 470 journal papers, 770 conference papers, 26 book chapters, two edited books and two research monographs. Current research focuses on Data Science, and Network Science with applications to the Internet of Jun Sun (S’16) received the B.S. degree in Things, and power networks with renewables. He is the (co-) inventor of College of Astronautics from Nanjing University 34 issued patents, and the (co-) recipient of 9 best journal paper awards of Aeronautics and Astronautics, China, in 2015. from the IEEE Signal Processing (SP) and Communications Societies, Currently, he is pursuing the Ph.D. degree in the including the G. Marconi Prize Paper Award in Wireless Communications. College of Control Science and Engineering, at He also received the IEEE-SPS Norbert Wiener Society Award (2019); Zhejiang University. He is a member of the Group EURASIP’s A. Papoulis Society Award (2020); Technical Achievement of Networked Sensing and Control in the State Awards from the IEEE-SPS (2000) and from EURASIP (2005); the Key Laboratory of Industrial Control Technology IEEE ComSoc Education Award (2019); the G. W. Taylor Award for at Zhejiang University. His research interests Distinguished Research from the University of Minnesota, and the IEEE include game theory and optimization with ap- Fourier Technical Field Award (2015). He is a foreign member of the plications in electricity market, data centers, and Academia Europaea, and Fellow of the National Academy of Inventors, machine learning. the European Academy of Sciences, IEEE and EURASIP. He has served the IEEE in a number of posts, including that of a Distinguished Lecturer for the IEEE-SPS.

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2020.3033286, IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (SUBMITTED OCTOBER 20, 2020) 15

Qinmin Yang received the Bachelor’s degree in Electrical Engineering from Civil Aviation Univer- sity of China, Tianjin, China in 2001, the Master of Science Degree in Control Science and En- gineering from Institute of Automation, Chinese Academy of Sciences, Beijing, China in 2004, and the Ph.D. degree in Electrical Engineering from the University of Missouri-Rolla, MO USA, in 2007. From 2007 to 2008, he was a Post-doctoral Research Associate at University of Missouri- Rolla. From 2008 to 2009, he was a system engineer with Caterpillar Inc. From 2009 to 2010, he was a Post-doctoral Research Associate at University of Connecticut. Since 2010, he has been with the State Key Laboratory of Industrial Control Technology, the College of Control Sci- ence and Engineering, Zhejiang University, China, where he is currently a professor. He has also held visiting positions in University of Toronto and Lehigh University. He has been serving as an Associate Editor for IEEE Transactions on Systems, Man, and Cybernetics: Systems, IEEE Transactions on Neural Networks and Learning Systems, Transactions of the Institute of Measurement and Control, and Automatica Sinica. His research interests include intelligent control, renewable energy systems, smart grid, and industrial big data.

Zaiyue Yang (M’10) received the B.S. and M.S. degrees from the Department of Automation, University of Science and Technology of China, Hefei, China, in 2001 and 2004, respectively, and the Ph.D. degree from the Department of Me- chanical Engineering, University of Hong Kong, in 2008. He was a Postdoctoral Fellow and Re- search Associate with the Department of Applied Mathematics, Hong Kong Polytechnic University, before joining the College of Control Science and Engineering, Zhejiang University, Hangzhou, China, in 2010. Then, he joined the Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen, China, in 2017. He is currently a Professor there. His current research interests include smart grid, signal processing and control theory. Prof. Yang is an associate editor for the IEEE Transactions on Industrial Informatics.

0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: University of Minnesota. Downloaded on December 19,2020 at 16:23:34 UTC from IEEE Xplore. Restrictions apply.