Federated Composite Optimization

Honglin Yuan 12 Manzil Zaheer 3 Sashank Reddi 3

Abstract Federated Learning (FL) is a distributed learning paradigm that scales on-device learning collab- oratively and privately. Standard FL algorithms such as FEDAVG are primarily geared towards smooth unconstrained settings. In this paper, we study the Federated Composite Optimization (FCO) problem, in which the loss function con- tains a non-smooth regularizer. Such problems arise naturally in FL applications that involve spar- sity, low-rank, monotonicity, or more general con- Figure 1. Results on sparse (`1-regularized) logistic regres- straints. We first show that straightforward exten- sion for a federated fMRI dataset based on (Haxby, 2001). sions of primal algorithms such as FEDAVG are centralized corresponds to training on the centralized dataset not well-suited for FCO since they suffer from gathered from all the training clients. local corresponds to the “curse of primal averaging,” resulting in poor training on the local data from only one training client without convergence. As a solution, we propose a new communication. FEDAVG (@) corresponds to running FEDAVG al- gorithms with subgradient in lieu of SGD to handle the non-smooth primal-dual algorithm, Federated Dual Averaging `1-regularizer. FEDMID is another straightforward extension of (FEDDUALAVG), which by employing a novel FEDAVG running local proximal gradient method (see Section 3.1 server dual averaging procedure circumvents the for details). We show that using our proposed algorithm FED- curse of primal averaging. Our theoretical anal- DUALAVG, one can 1) achieve performance comparable to the ysis and empirical experiments demonstrate that centralized baseline without the need to gather client data, FEDDUALAVG outperforms the other baselines. and 2) significantly outperforms the local baseline on the iso- lated data and the FEDAVG baseline. See Section 5.3 for details.

1. Introduction ploying local SGD updates, significantly reduces the com- Federated Learning (FL, Konecnˇ y` et al. 2015; McMahan munication overhead under moderate client heterogeneity. et al. 2017) is a novel distributed learning paradigm in which Several follow-up works have focused on improving the a large number of clients collaboratively train a shared FEDAVG in various ways (e.g., Li et al. 2020a; Karimireddy model without disclosing their private local data. The two et al. 2020; Reddi et al. 2020; Yuan & Ma 2020). most distinct features of FL, when compared to classic distributed learning settings, are (1) heterogeneity in data Existing FL research primarily focuses on the unconstrained amongst the clients and (2) very high cost to communicate smooth objectives; however, many FL applications involve with a client. Due to these aspects, classic distributed opti- non-smooth objectives. Such problems arise naturally in mization algorithms have been rendered ineffective in FL the context of regularization (e.g., sparsity, low-rank, mono- settings (Kairouz et al., 2019). Several algorithms specif- tonicity, or additional constraints on the model). For in- ically catered towards FL settings have been proposed to stance, consider the problem of cross-silo biomedical FL, address these issues. The most prominent amongst them is where medical organizations collaboratively aim to learn Federated Averaging (FEDAVG) algorithm, which by em- a global model on their patients’ data without sharing. In such applications, sparsity constraints are of paramount im- 1Stanford University 2Based on work performed at Google portance due to the nature of the problem as it involves 3 Research Google Research. Correspondence to: Honglin Yuan only a few data samples (e.g., patients) but with very high . dimensions (e.g., fMRI scans). For the purpose of illus- Proceedings of the 38 th International Conference on Machine tration, in Fig. 1, we present results on a federated sparse Learning, PMLR 139, 2021. Copyright 2021 by the author(s). (`1-regularized) logistic regression task for an fMRI dataset Federated Composite Optimization

(Haxby, 2001). As shown, using a federated approach that can handle non-smooth objectives enables us to find a highly server accurate sparse solution without sharing client data. averaging In this paper, we propose to study the Federated Composite Optimization (FCO) problem. As in standard FL, the losses are distributed to M clients. In addition, we assume all sparse clients dense server the clients share the same, possibly non-smooth, non-finite regularizer . Formally, (FCO) is of the following form Figure 2. Illustration of “curse of primal averaging”. While each client of FEDMID can locate a sparse solution, simply aver- M 1 aging them will yield a much denser solution on the server side. min (w):=F (w)+ (w):= Fm(w)+ (w), w d M 2R m=1 X (FCO) m During communication, the dual states are averaged across m where Fm(w):=E⇠ m f(w; ⇠ ) is the loss at the m- ⇠D the clients. th client, assuming is its local data distribution. We Dm assume that each client m can access f(w; ⇠m) by draw- Thus, FEDDUALAVG employs a novel double averaging r ing independent samples ⇠m from its local distribution . procedure — averaging of dual states across clients (as in Dm Common examples of (w) include `1-regularizer or more FEDAVG), and the averaging of gradients in dual space broadly `p-regularizer, nuclear-norm regularizer (for ma- (as in the sequential dual averaging). Since both levels trix variable), total variation (semi-)norm, etc. The (FCO) of averaging operate in the dual space, we can show that reduces to the standard federated optimization problem if FEDDUALAVG provably overcomes the curse of primal 0. The (FCO) also covers the constrained federated averaging. Specifically, we prove that FEDDUALAVG can ⌘ optimization if one takes to be the following constraint attain significantly lower communication complexity when characteristics (w):=0if w or + otherwise. deployed with a large client learning rate. C 2C 1 Standard FL algorithms such as FEDAVG (see Algorithm 1) In light of the above discussion, let us and its variants (e.g., Li et al. 2020a; Karimireddy et al. Contributions. summarize our key contributions below: 2020) are primarily tailored to smooth unconstrained set- tings, and are therefore, not well-suited for FCO. The most straightforward extension of FEDAVG towards (FCO) is to • We propose a generalized federated learning problem, apply local subgradient method (Shor, 1985) in lieu of SGD. namely Federated Composite Optimization (FCO), with This approach is largely ineffective due to the intrinsic slow non-smooth regularizers and constraints. convergence of subgradient approach (Boyd et al., 2003), • We first propose a natural extension of FEDAVG, namely which is also demonstrated in Fig. 1 (marked FEDAVG (@)). Federated Mirror Descent (FEDMID). We show that FED- A more natural extension of FEDAVG is to replace the lo- MID can attain the mini-batch rate in the small client cal SGD with proximal SGD (Parikh & Boyd 2014, a.k.a. learning rate regime (Section 4.1). We argue that FED- projected SGD for constrained problems), or more gen- MID may suffer from the effect of “curse of primal aver- erally, mirror descent (Duchi et al., 2010). We refer to aging,” which results in poor convergence, especially in this algorithm as Federated Mirror Descent (FEDMID, see the large client learning rate regime (Section 3.2). Algorithm 2). The most noticeable drawback of a primal- • We propose a novel primal-dual algorithm named Fed- averaging method like FEDMID is the “curse of primal averaging,” where the desired regularization of FCO may erated Dual Averaging (FEDDUALAVG), which prov- be rendered completely ineffective due to the server aver- ably overcomes the curse of primal averaging (Sec- aging step typically used in FL. For instance, consider a tion 3.3). Under certain realistic conditions, we show that by virtue of “double averaging” property, FEDDUALAVG `1-regularized logistic regression setting. Although each client is able to obtain a sparse solution, simply averaging can have significantly lower communication complexity the client states will inevitably yield a dense solution. See (Section 4.2). Fig. 2 for an illustrative example. • We demonstrate the empirical performance of FED- To overcome this challenge, we propose a novel MID and FEDDUALAVG on various tasks, including `1- primal-dual algorithm named Federated Dual Averaging regularization, nuclear-norm regularization, and various (FEDDUALAVG, see Algorithm 3). Unlike FEDMID (or constraints in FL (Section 5). its precursor FEDAVG), the server averaging step of FED- DUALAVG operates in the dual space instead of the primal. Notations. We use [n] to denote the set 1,...,n .We { } Locally, each client runs dual averaging algorithm (Nes- use , to denote the inner product, to denote an h· ·i k·k terov, 2009) by tracking of a pair of primal and dual states. arbitrary norm, and to denote its dual norm, unless k·k⇤ Federated Composite Optimization otherwise specified. We use to denote the ` norm k·k2 2 of a vector or the operator norm of a matrix, and A to min E⇠ f(w; ⇠)+ (w), (CO) k·k w d ⇠D denote the vector norm induced by positive definite matrix 2R A, namely w := w, Aw . For any convex function where is a non-smooth, possibly non-finite regularizer. k kA h i g(w), we use g⇤(z) to denote its convex conjugate g⇤(z):= p ? Proximal Gradient Method. A natural extension of SGD supw d z,w g(w) . We use w to denote the optimum 2R {h i } for (CO) is the following proximal gradient method (PGM): of the problem (FCO). We use , ⇥ to hide multiplicative O absolute constants only and x y to denote x = (y). wt+1 prox (wt ⌘ f(wt; ⇠t)) . O ⌘ r 1 2 = arg min ⌘ f(wt; ⇠t),w + w wt + ⌘ (w) . 1.1. Related Work hr i 2 k k2 w ✓ ◆ (2.1) In this subsection, we briefly discuss the main related work. We provide a more detailed literature review in Appendix A, The sub-problem Eq. (2.1) can be motivated by optimizing including the relation to classic composite optimization and a quadratic upper bound of f together with the original . distributed consensus optimization literature. This problem can often be efficiently solved by virtue of the special structure of . For instance, one can verify that PGM The first analysis of general FEDAVG was established by reduces to projected if is a constraint Stich (2019) for the homogeneous client dataset. This result characteristic , soft thresholding if (w)= w 1, or was improved by Haddadpour et al. (2019b); Khaled et al. C k k weight decay if (w):= w 2. (2020); Woodworth et al. (2020b); Yuan & Ma (2020) via k k2 tighter analysis and accelerated algorithms. For heteroge- Mirror Descent / Bregman-PGM. PGM can be gener- neous clients, numerous recent papers (Haddadpour et al., alized to the Bregman-PGM if one replaces the Euclidean 2019b; Khaled et al., 2020; Li et al., 2020b; Koloskova proximity term by the general Bregman divergence, namely et al., 2020; Woodworth et al., 2020a) studied the conver- gence of FEDAVG under various notions of heterogeneity wt+1 arg min (⌘ f(wt; ⇠t),w + ⌘ (w)+Dh(w, wt)) , measure. FEDAVG has also been studied for non-convex w hr i objectives (Zhou & Cong, 2018; Haddadpour et al., 2019a; (2.2) where h is a strongly convex distance-generating function, Wang & Joshi, 2018; Yu & Jin, 2019; Yu et al., 2019a;b). Dh is the Bregman divergence which reduces to Euclidean Other variants of FEDAVG have been proposed to overcome 1 2 distance if one takes h(w)= 2 w 2. We will still refer heterogeneity challenges (e.g., Mohri et al. 2019; Liang to this step as a proximal step fork easek of reference. This et al. 2019; Li et al. 2020a; Wang et al. 2020; Karimireddy general formulation (2.2) enables an equivalent primal-dual et al. 2020; Pathak & Wainwright 2020; Fallah et al. 2020; interpretation: Hanzely et al. 2020; T. Dinh et al. 2020; Lin et al. 2020; He wt+1 (h + ⌘ )⇤( h(wt) f(wt; ⇠t)). (2.3) et al. 2020; Bistritz et al. 2020; Zhang et al. 2020). We refer r r r readers to (Kairouz et al., 2019) for a comprehensive survey A common interpretation of (2.3) is to decompose it into of recent advances in FL. the following three sub-steps (Nemirovski & Yudin, 1983):

We note that none of the aforementioned works can handle (a) Apply h to carry wt to a dual state (denoted as zt). non-smooth problems such as (FCO). Furthermore, the con- r (b) Update zt to yt+1 with the gradient queried at wt. tributions of this work can potentially be integrated with (c) Map y back to primal via (h + ⌘ )⇤. other emerging techniques in FL (e.g., acceleration, adaptiv- t+1 r ity, variance reduction) to overcome challenges in FL such This formulation is known as the composite objective mir- as communication efficiency and client heterogeneity. ror descent (COMID, Duchi et al. 2010), or simply mirror descent in the literature (Flammarion & Bach, 2017). 2. Preliminaries Dual Averaging. An alternative approach for (CO) is the In this section, we review the necessary background for com- following dual averaging algorithm (Nesterov, 2009): posite optimization and federated learning. A detailed tech- zt+1 zt ⌘ f ( (h + ⌘t )⇤(zt); ⇠t) . (2.4) nical exposition of these topics is relegated to Appendix C. r r Similarly, we can decompose (2.4) into two sub-steps:

2.1. Composite Optimization (a) Apply (h + ⌘t )⇤ to map dual state zt to primal wt. Note thatr this sub-step can be reformulated into Composite optimization covers a variety of statistical infer- ence, , signal processing problems. Stan- wt = arg min ( zt,w + ⌘t (w)+h(w)) , h i dard (non-distributed) composite optimization is defined as w which allows for efficient computation for many . Federated Composite Optimization

(b) Update zt to zt+1 with the gradient queried at wt. Algorithm 1 Federated Averaging (FEDAVG) ED VG w ,⌘ ,⌘ Dual averaging is also known as the “lazy” mirror descent 1: procedure F A ( 0 c s) 2: for r =0,...,R 1 do algorithm (Bubeck, 2015) since it skips the forward mapping 3: sample a subset of clients [M] ( h) step. Theoretically, mirror descent and dual averaging Sr ✓ r 4: on client for m r in parallel do often share the similar convergence rates for sequential (CO) 2S 5: wm w broadcast client initialization (e.g., for smooth convex f, c.f. Flammarion & Bach 2017). r,0 r B 6: for k =0,...,K 1 do m m m 7: gr,k f(wr,k; ⇠r,k) B query gradient m r m m There are other algorithms that are popular for 8: w w ⌘c g client update Remark. r,k+1 r,k · r,k B certain types of (CO) problems. For example, Frank-Wolfe 9: = 1 (wm wm ) r m r r,K r,0 |Sr | 2S method (Frank & Wolfe, 1956; Jaggi, 2013) solves con- 10: wr+1 wr + ⌘s r B server update strained optimization with a linear optimization oracle. P · Smoothing method (Nesterov, 2005) can also handle non- Algorithm 2 Federated Mirror Descent (FEDMID) smoothness in objectives, but is in general less efficient 1: procedure FEDMID(w0,⌘c,⌘s) than specialized CO algorithms such as dual averaging 2: for r =0,...,R 1 do (c.f., Nesterov 2018). In this work, we mostly focus on 3: sample a subset of clients [M] Mirror Descent and Dual Averaging algorithms since they Sr ✓ 4: on client for m r in parallel do only employ simple proximal oracles such as projection and 2S 5: wm w broadcast primal initialization soft-thresholding. We refer readers to Appendix A.2 for r,0 r B 6: for k =0,...,K 1 do additional related work in composite optimization. m m m 7: gr,k f(wr,k; ⇠r,k) B query gradient m r m m 8: w (h + ⌘c )⇤( h(w ) ⌘c g ) 2.2. Federated Averaging r,k+1 r r r,k · r,k 9: = 1 (wm wm ) r m r r,K r,0 |Sr | 2S Federated Averaging (FEDAVG, McMahan et al. 2017) is 10: wr+1 (h + ⌘s⌘cK )⇤( h(wr)+⌘s r) the de facto standard algorithm for Federated Learning with rP r · unconstrained smooth objectives (namely =0for (FCO)). In this work, we follow the exposition of (Reddi et al., 2020) 3.1. Federated Mirror Descent (FEDMID) which splits the client learning rate and server learning rate, offering more flexibility (see Algorithm 1). A more natural extension of FEDAVG towards (FCO) is to replace the local SGD steps in FEDAVG with local prox- FEDAVG involves a series of rounds in which each round imal gradient (mirror descent) steps (2.3). The resulting consists of a client update phase and server update phase. algorithm, which we refer to as Federated Mirror Descent We denote the total number of rounds as R. At the beginning (FEDMID)1, is outlined in Algorithm 2. of each round r, a subset of clients are sampled from the Sr client pools of size M. The server state is then broadcast Specifically, we make two changes compared to FEDAVG: to the sampled client as the client initialization. During the client update phase (highlighted in blue shade), each sam- • The client local SGD steps in FEDAVG are replaced with pled client runs local SGD for K steps with client learning proximal gradient steps (Line 8). m rate ⌘c with their own data. We use wr,k to denote the m-th • The server update step is replaced with another proximal client state at the k-th local step of the r-th round. During step (Line 10). the server update phase, the server averages the updates of the sampled clients and treats it as a pseudo-anti-gradient As a sanity check, for constrained (FCO) with = , if C r (Line 9). The server then takes a server update step to one takes server learning rate ⌘s =1and Euclidean dis- update its server state with server learning rate ⌘s and the tance h(w)= 1 w 2, FEDMID will simply reduce to the 2 k k2 pseudo-anti-gradient r (Line 10). following parallel projected SGD with periodic averaging:

3. Proposed Algorithms for FCO (a) Each sampled client runs K steps of projected SGD m m m following wr,k+1 Proj (wr,k ⌘cgr,k). In this section, we explore the possible solutions to approach C (FCO). As mentioned earlier, existing FL algorithms such as 1Despite sharing the same term “prox”, FEDMID is fundamen- FEDAVG and its variants do not solve (FCO). Although it is tally different from FEDPROX (Li et al., 2020a). The proximal step possible to apply FEDAVG to non-smooth settings by using in FEDPROX was to regularize the client drift caused by hetero- geneity, whereas the proximal step in this work is to overcome the subgradient in place of the gradient, such an approach is non-smoothness of . The problems approached by the two meth- usually ineffective owing to the intrinsic slow convergence ods are also different – FEDPROX still solves an unconstrained of subgradient methods (Boyd et al., 2003). smooth problem, whereas ours concerns with approaches (FCO). Federated Composite Optimization

(b) After K local steps, the server simply average the client Algorithm 3 Federated Dual Averaging (FEDDUALAVG) 1 m states following wr+1 m wr,K . 1: procedure FEDDUALAVG (w0,⌘c,⌘s) r 2Sr |S | 2: z h(w ) server dual initialization P 0 r 0 B 3: for r =0,...,R 1 do 3.2. Limitation of FEDMID: Curse of Primal 4: sample a subset of clients [M] Averaging Sr ✓ 5: on client for m r in parallel do 2S Despite its simplicity, FEDMID exhibits a major limitation, 6: zm z broadcast dual initialization r,0 r B which we refer to as “curse of primal averaging”: the server 7: for k =0,...,K 1 do averaging step in FEDMID may severely impede the opti- 8: ⌘˜r,k ⌘s⌘crK + ⌘ck mization progress. To understand this phenomenon, let us m m 9: w (h +˜⌘ )⇤(z ) retrieve primal r,k r r,k r,k B consider the following two illustrative examples: 10: gm f(wm ; ⇠m ) query gradient r,k r r,k r,k B 11: zm zm ⌘ gm client dual update r,k+1 r,k c r,k B • Constrained problem: Suppose the optimum of the 12: = 1 (zm zm ) r m r r,K r,0 aforementioned constrained problem resides on a non- |Sr | 2S 13: zr+1 zr + ⌘sr B server dual update flat boundary . Even when each client is able to obtain a P C 14: wr+1 (h + ⌘s⌘c(r + 1)K )⇤(zr+1) local solution on the boundary, the average of them will r 15: B (optional) retrieve server primal state almost surely be off the boundary (and hence away from the optimum) due to the curvature.

• Federated `1-regularized logistic regression problem: round, the client dual state is initialized with the server Suppose each client obtains a local sparse solution, simply dual state. During the client update stage, each client runs averaging them across clients will invariably yield a non- dual averaging steps following (2.4) to update its primal sparse solution. and dual state (highlighted in blue shade). The coefficient of , namely ⌘˜r,k, is to balance the contribution from F As we will see theoretically (Section 4) and empirically and . At the end of each client update phase, the dual up- (Section 5), the “curse of primal averaging” indeed hampers dates (instead of primal updates) are returned to the server. the performance of FEDMID. The server then averages the dual updates of the sampled clients and updates the server dual state. We observe that 3.3. Federated Dual Averaging (FEDDUALAVG) the averaging in FEDDUALAVG is two-fold: (1) averaging Before we look into the solution of the curse of primal of gradients in dual space within a client and (2) averaging averaging, let us briefly investigate the cause of this effect. of dual states across clients at the server. As we shall see Recall that in standard smooth FL settings, server averaging shortly in our theoretical analysis, this novel “double” av- step is helpful because it implicitly pools the stochastic eraging of FEDDUALAVG in the non-smooth case enables gradients and thereby reduces the variance (Stich, 2019). lower communication complexity and faster convergence of In FEDMID, however, the server averaging operates on the FEDDUALAVG under realistic assumptions. post-proximal primal states, but the gradient is updated in the dual space (recall the primal-dual interpretation of 4. Theoretical Results mirror descent in Section 2.1). This primal/dual mismatch creates an obstacle for primal averaging to benefit from the In this section, we demonstrate the theoretical results of pooling of stochastic gradients in dual space. This thought FEDMID and FEDDUALAVG. We assume the following experiment suggests the importance of aligning the gradient assumptions throughout the paper. The convex analysis update and server averaging. definitions in Assumption 1 are reviewed in Appendix C. Building upon this intuition, we propose a novel Assumption 1. Let be a norm and be its dual. k·k k·k⇤ primal-dual algorithm, named Federated Dual Averaging (a) : Rd R + is a closed convex function with ED UAL VG ! [{ 1} (F D A , Algorithm 3), which provably addresses closed dom . Assume that (w)=F (w)+ (w) the curse of primal averaging. The major novelty of FED- attains a finite optimum at w? dom . DUALAVG, in comparison with FEDMID or its precursor 2 FEDAVG, is to operate the server averaging in the dual space (b) h : Rd R + is a Legendre function that is 1- instead of the primal. This facilitates the server to aggre- ! [{ 1} strongly-convex w.r.t. . Assume dom h dom . gate the gradient information since the gradients are also k·k accumulated in the dual space. (c) f( ,⇠):Rd R is a closed convex function that is · ! Formally, each client maintains a pair of primal and dual differentiable on dom for any fixed ⇠. In addition, states (wm ,zm ). At the beginning of each client update f( ,⇠) is L-smooth w.r.t. on dom , namely for r,k r,k · k·k Federated Composite Optimization

any u, w dom , 4.2. FEDDUALAVG with a Larger Client Learning 2 1 Rate: Usefulness of Local Step f(u; ⇠) f(w; ⇠)+ f(w; ⇠),u w + L u w 2.  hr i 2 k k In this subsection, we show that FEDDUALAVG may attain stronger results with a larger client learning rate. In addi- 2 (d) f has -bounded variance over m under tion to possible faster convergence, Theorems 4.2 and 4.3 withinr dom , namely for any w domD , k·k⇤ 2 also indicate that FEDDUALAVG allows for much broader 2 2 searching scope of efficient learning rates configurations, E⇠ m f(w, ⇠) Fm(w) , for any m [M] ⇠D kr r k⇤  2 which is of key importance for practical purpose. (e) Assume that all the M clients participate in the client updates for every round, namely r =[M]. Bounded Gradient. We first consider the setting with S bounded gradient. Unlike unconstrained, the gradient bound Assumption 1(a) & (b) are fairly standard for composite may be particularly useful when the constraint is finite. optimization analysis (c.f. Flammarion & Bach 2017). As- Theorem 4.2 (Simplified from Theorem D.1). Assuming sumption 1(c) & (d) are standard assumptions in stochas- Assumption 1 and supw dom f(w, ⇠) G, then for 2 kr 1 k⇤  tic federated optimization literature (Khaled et al., 2020; FEDDUALAVG with ⌘s =1and ⌘c 4L , considering Woodworth et al., 2020b). (e) is assumed to simplify the  R 1 K M exposition of the theoretical results. All results presented 1 1 m wˆ := (h +˜⌘r,k )⇤ z , (4.2) can be easily generalized to the partial participation case. KR r M r,k r=0 k=1 " m=1 !# Remark. This work focuses on convex settings because X X X the non-convex composite optimization (either F or non- the following inequality holds convex) is noticeably challenging and under-developed even 2 ? B ⌘c 2 2 2 for non-distributed settings. This is in sharp contrast to E [(w ˆ)] (w ) . + + ⌘c LK G , non-convex smooth optimization for which simple algo- ⌘cKR M rithms such as SGD can readily work. Existing literature on ? where B := Dh(w ,w0). Moreover, there exists ⌘c such non-convex CO (e.g., Attouch et al. 2013; Chouzenoux et al. that 2014; Li & Pong 2015; Bredies et al. 2015) typically relies 1 1 2 2 on non-trivial additional assumptions (such as K-Ł condi- ? LB B 2 L 3 B 3 G 3 E [(w ˆ)] (w ) . + + 2 . (4.3) tions) and sophisticated algorithms. Hence, it is beyond the KR pMKR R 3 scope of this work to study non-convex FCO. 2 We refer the reader to Appendix D for complete proof details 4.1. FEDMID and FEDDUALAVG: Small Client of Theorem 4.2. Learning Rate Regime Remark. The result in Theorem 4.2 not only matches the We first show that both FEDMID and FEDDUALAVG are rate by Stich (2019) for smooth, unconstrained FEDAVG (asymptotically) at least as good as stochastic mini-batch al- but also allows for a general non-smooth composite , gen- gorithms with R iterations and batch-size MK when client eral Bregman divergence induced by h, and arbitrary norm . Compared with the small learning rate result The- learning rate ⌘c is sufficiently small. k·k orem 4.1, the first term in Eq. (4.3) is improved from LB Assuming As- R Theorem 4.1 (Simplified from Theorem F.1). LB sumption 1, then for sufficiently small client learning rate ⌘c, to KR, whereas the third term incurs an additional loss 1 1 1 B 2 M 2 regarding infrequent communication. One can verify that and server learning rate ⌘s =⇥(min , 1 1 ), 2 ⌘cKL L B { ⌘cK 2 R 2 } the bound Eq. (4.3) is better than Eq. (4.1) if R . G2 . both FEDDUALAVG and FEDMID can output wˆ such that Therefore, the larger client learning rate may be preferred 1 when the communication is not too infrequent. ? LB B 2 E [(w ˆ)] (w ) . + , (4.1) R pMKR Bounded Heterogeneity. Next, we consider the settings ? where B := Dh(w ,w0). with bounded heterogeneity. For simplicity, we focus on the case when the loss F is quadratic, as shown in Assump- The intuition is that when ⌘c is small, the client update will tion 2. We will discuss other options to relax the quadratic not drift too far away from its initialization of the round. Due assumption in Section 4.3. to space constraints, the proof is relegated to Appendix F. Assumption 2 (Bounded heterogeneity, quadratic). 2 However, we conjecture that for simple non-convex settings (a) The heterogeneity of F is bounded, namely (e.g., optimize non-convex f on a convex set, as tested in Ap- r m pendix B.5), it is possible to show the convergence and obtain 2 sup Fm(w) F (w) ⇣ , for any m [M] similar advantageous results for FEDDUALAVG. w dom kr r k⇤  2 2 Federated Composite Optimization

1 (b) F (w):= 2 w>Qw + c>w for some Q 0. Lemma 4.4 (Convergence of dual shadow sequence of FED- DUALAVG, simplified version of Lemma D.2). Assuming (c) Assume Assumption 1 is satisfied in which the norm k·k Assumption 1, then for FEDDUALAVG with ⌘s =1and Q w>Qw 1 is taken to be the Q -norm, namely w = Q . ⌘c 4L , the following inequality holds k k2 k k k k2  Assumption 2(a) is a standard assumptionq to R 1 K Remark. 1 ? E (h +˜⌘r,k )⇤ (zr,k ) (w ) bound the heterogeneity among clients (e.g., Woodworth KR r " r=0 !# et al. 2020a). Note that Assumption 2 only assumes the X kX=1 2 R 1 K 1 M objective F to be quadratic. We do not impose any stronger B ⌘c L m 2 + + E zr,k zr,k .  ⌘cKR M MKR k k⇤ assumptions on either the composite function or the " r=0 k=0 m=1 # distance-generating function h. Therefore, this result still X X X Rate if synchronized Discrepancy overhead applies to a broad class of problems such as LASSO. every iteration | {z } | {z (4.6)} The following results hold under Assumption 2. We sketch the proof in Section 4.3 and defer the details to Appendix E. The first two terms correspond to the rate when FEDDU- Theorem 4.3 (Simplified from Theorem E.1). Assuming ALAVG is synchronized every step. The last term corre- Assumption 2, then for FEDDUALAVG with ⌘s =1and sponds to the overhead for not synchronizing every step, 1 ⌘c , the following inequality holds which we call “discrepancy overhead”. Lemma 4.4 can  4L ED 2 serve as a general interface towards the convergence of F - ? B ⌘c 2 2 2 2 2 E[(w ˆ)] (w ) . + + ⌘c LK + ⌘c LK ⇣ , DUALAVG as it only assumes the blanket Assumption 1. ⌘cKR M Remark. Note that the relation (4.5) is not satisfied by where wˆ is the same as defined in Eq. (4.2), and B := FEDMID due to the incommutability of the proximal oper- ? Dh(w ,w0). Moreover, there exists ⌘c such that ator and the the averaging operator, which thereby breaks

1 1 2 2 1 2 2 Lemma 4.4. Intuitively, this means FEDMID fails to pool the ? LB B 2 L 3 B 3 3 L 3 B 3 ⇣ 3 E [(w ˆ)] (w ) . + + 1 2 + 2 . gradients properly (up to a high-order error) in the absence KR pMKR K 3 R 3 R 3 (4.4) of communication. FEDDUALAVG overcomes the incom- mutability issue because all the gradients are accumuluated Remark. The result in Theorem 4.3 matches the best- and averaged in the dual space, whereas the proximal step known convergence rate for smooth, unconstrained FE- only operates at the interface from dual to primal. This key DAVG (Khaled et al., 2020; Woodworth et al., 2020a), difference explains the “curse of primal averaging” from while our results allow for general composite and non- the theoretical perspective. Euclidean distance. Compared with Theorem 4.2, the over- head in Eq. (4.4) involves variance and heterogeneity ⇣ Step 2: Bounding Discrepancy Overhead via Stability but no longer depends on G. The bound Eq. (4.4) could sig- Analysis. The next step is to bound the discrepancy term nificantly outperform the previous ones when the variance introduced in Eq. (4.6). Intuitively, this term characterizes and heterogeneity ⇣ are relatively mild. the stability of FEDDUALAVG, in the sense that how far away a single client can deviate from the average (in dual 4.3. Proof Sketch and Discussions space) if there is no synchronization for k steps. In this subsection, we demonstrate our proof framework by However, unlike the smooth convex unconstrained settings sketching the proof for Theorem 4.3. in which the stability of SGD is known to be well-behaved (Hardt et al., 2016), the stability analysis for composite Step 1: Convergence of Dual Shadow Sequence. We optimization is challenging and absent from the literature. start by characterizing the convergence of the dual shadow We identify that the main challenge originates from the 1 M m sequence zr,k := M m=1 zr,k. The key observation for asymmetry of the Bregman divergence. In this work, we FEDDUALAVG when ⌘s =1is the following relation provide a set of simple conditions, namely Assumption 2, P such that the stability of FEDDUALAVG is well-behaved. 1 M z = z ⌘ f(wm ; ⇠m ). (4.5) Lemma 4.5 (Dual stability of FEDDUALAVG under As- r,k+1 r,k c · M r r,k r,k m=1 sumption 2, simplified version of Lemma E.2). Under the X same settings of Theorem 4.3, the following inequality holds This suggests that the shadow sequence zr,k almost executes 2 1 M m 2 2 2 2 2 a dual averaging update (2.4), but with some perturbed gra- M m=1 E zr,k zr,k . ⌘c K + ⌘c K ⇣ . M dient 1 f(wm ; ⇠m ). To this end, we extend the ⇤ M m=1 r r,k r,k P perturbed iterate analysis framework (Mania et al., 2017) Step 3: Deciding ⌘c. The final step is to plug in the bound P to the dual space. Theoretically we show the following in step 2 back to step 1, and find appropriate ⌘c to optimize Lemma 4.4, with proof relegated to Appendix D.2. such upper bound. We defer the details to Appendix E. Federated Composite Optimization

Figure 3. Sparsity recovery on a synthetic LASSO problem Figure 4. Low-rank matrix estimation comparison on a syn- with 50% sparse ground truth. Observe that FEDDUALAVG not only identifies most of the sparsity pattern but also is fastest. It is thetic dataset with the ground truth of rank 16. We observe that FEDDUALAVG finds the solution with exact rank in less than also worth noting that the less-principled FEDDUALAVG-OSP is 100 communication rounds. FEDMID and FEDMID-OSP con- also very competitive. The poor performance of FEDMID can be attributed to the “curse of primal averaging”, as the server averag- verge slower in loss and rank. The unprincipled FEDDUALAVG- ing step “smooths out” the sparsity pattern, which is corroborated OSP can generate low-rank solutions but is far less accurate. empirically by the least sparse solution obtained by FEDMID. To generate the synthetic dataset, we first fix a sparse ground d truth wreal R and some bias breal R, and then sample 5. Numerical Experiments 2 2 the dataset (x, y) following y = x>wreal + breal + " for In this section, we validate our theory and demonstrate the some noise ". We let the distribution of (x, y) vary over efficiency of the algorithms via numerical experiments. We clients to simulate the heterogeneity. We select so that mostly compare FEDDUALAVG with FEDMID since the lat- the centralized solver (on gathered data) can successfully ter serves a natural baseline. We do not present subgradient- recover the sparse pattern. Since the ground truth wreal FEDAVG in this section due to its consistent ineffectiveness, is known, we can assess the quality of the sparse features as demonstrated in Fig. 1 (marked FEDAVG (@)). To exam- recovered by comparing it with the ground truth. ine the necessity of client proximal step, we also test two We evaluate the performance by recording precision, recall, less-principled versions of FEDMID and FEDDUALAVG, in sparsity density, and F1-score. We tune the client learning which the proximal steps are only performed on the server- rate ⌘c and server learning rate ⌘s only to attain the best F1- side. We refer to these two versions as FEDMID-OSP and score. The results are presented in Fig. 3. The best learning FEDDUALAVG-OSP, where “OSP” stands for “only server rates configuration is ⌘c =0.01,⌘s =1for FEDDUALAVG, proximal,” with pseudo-code provided in Appendix B.1.We and ⌘c =0.001,⌘s =0.3 for other algorithms (including provide the complete setup details in Appendix B, including FEDMID). This matches our theory that FEDDUALAVG can but not limited to hyper-parameter tuning, dataset process- benefit from larger learning rates. We defer the rest of the ing and evaluation metrics. The source code is available at setup details and further experiments to Appendix B.2. https://github.com/hongliny/FCO-ICML21. 5.2. Federated Low-Rank Matrix Estimation via 5.1. Federated LASSO for Sparse Feature Recovery Nuclear-Norm Regularization ` In this subsection, we consider the LASSO ( 1-regularized In this subsection, we consider a low-rank matrix estimation least-squares) problem on a synthetic dataset, motivated problem via the nuclear-norm regularization by models from biomedical and signal processing litera- ture (e.g., Ryali et al. 2010; Chen et al. 2012). The goal is to M 1 2 recover the sparse signal w from noisy observations (x, y). min E(X,y) m ( X, W + b y) + W nuc, W,b M ⇠D h i k k m=1 M X 1 2 where W nuc denotes the matrix nuclear norm. The goal min E(x,y) m (x>w + b y)2 + w 1. k k w d,b M ⇠D k k R R m=1 is to recover a low-rank matrix W from noisy observations 2 2 X Federated Composite Optimization

of either a house or a face. We use the nilearn pack- age (Abraham et al., 2014) to normalize and transform the four-dimensional raw fMRI scan data into an array with 39,912 volumetric pixels (voxels) using the standard mask. We plan to learn a sparse (`1-regularized) binary logistic regression on the voxels to classify the stimuli given the voxels input. Enforcing sparsity is crucial for this task as it allows domain experts to understand which part of the brain is differentiating between the stimuli. We select five (out of six) subjects as the training set and the last subject as the held-out validation set. We treat each session as a client, with a total of 59 training clients and 12 validation clients, where each client possesses the voxel data of 18 scans. As in the previous experiment, we tune the client learning rate ⌘c and server learning rate ⌘s only. We set the 3 `1-regularization strength to be 10 . For each setup, we run the federated algorithms for 300 communication rounds. We compare the algorithms with two non-federated base-

Figure 5. Results on `1-regularized logistic regression for lines: 1) centralized corresponds to training on the fMRI data from (Haxby, 2001). We observe that FEDDUALAVG centralized dataset gathered from all the training clients; 2) yields sparse and accurate solutions that are comparable with the local corresponds to training on the local data from only centralized baseline. FEDMID and FEDMID-OSP provides denser one training client without communication. The results are solutions that are relatively less accurate. The unprincipled FED- shown in Fig. 5. In Appendix B.4.2, we provide another DUALAVG-OSP can provide sparse solutions but far less accurate. presentation of this experiment to visualize the progress of federated algorithms and understand the robustness to learn- ing rate configurations. The results suggest FEDDUALAVG (X, y). This formulation captures a variety of problems not only recovers sparse and accurate solutions, but also such as low-rank matrix completion and recommendation behaves most robust to learning-rate configurations. We systems (Candes` & Recht, 2009). Note that the proximal op- defer the rest of the setup details to Appendix B.4. erator with respect to the nuclear-norm regularizer reduces to singular-value thresholding operation (Cai et al., 2010). In Appendix B.5, we provide another set of experiments on federated constrained optimization for Federated EMNIST We evaluate the algorithms on a synthetic federated dataset dataset (Caldas et al., 2019). d1 d2 with known low-rank ground truth Wreal R ⇥ and 2 bias breal R, similar to the above LASSO experiments. 2 We focus on four metrics for this task: the training (reg- 6. Conclusion ularized) loss, the validation mean-squared-error, the re- In this paper, we have shown the shortcomings of primal covered rank, and the recovery error in Frobenius norm FL algorithms for FCO and proposed a primal-dual method W W . We tune the client learning rate ⌘ k output realkF c (FEDDUALAVG) to tackle them. Our theoretical and empir- and server learning rate ⌘s only to attain the best recovery ical analysis provide strong evidence to support the supe- error. We also record the results obtained by the determin- rior performance of FEDDUALAVG over natural baselines. istic solver on centralized data, marked as optimum. The Potential future directions include control variates and ac- results are presented in Fig. 4. We provide the rest of the celeration based methods for FCO, and applying FCO to setup details and more experiments in Appendix B.3. personalized settings.

5.3. Sparse Logistic Regression for fMRI Scan Acknowledgement In this subsection, we consider the cross-silo setup of learn- ing a binary classifier on fMRI scans. For this purpose, We would like to thank Zachary Charles, Zheng Xu, Andrew we use the data collected by Haxby (2001), to understand Hard, Ehsan Amid, Amr Ahmed, Aranyak Mehta, Qian the pattern of response in the ventral temporal (vt) area of Li, Junzi Zhang, Tengyu Ma, and TensorFlow Federated the brain given a visual stimulus. There were six subjects team for helpful discussions at various stages of this work. doing image recognition in a block-design experiment over Honglin Yuan would like to thank the support by the TOTAL 11 to 12 sessions, with a total of 71 sessions. Each session Innovation Scholars program. We would like to thank the consists of 18 fMRI scans under the stimuli of a picture anonymous reviewers for their suggestions and comments. Federated Composite Optimization

References Candes,` E. J. and Recht, B. Exact Matrix Completion via Convex Optimization. Foundations of Computational Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mathematics, 9(6), 2009. Mueller, A., Kossaifi, J., Gramfort, A., Thirion, B., and Varoquaux, G. Machine learning for neuroimaging with Chen, F., Luo, M., Dong, Z., Li, Z., and He, X. Feder- scikit-learn. Frontiers in Neuroinformatics, 8, 2014. ated Meta-Learning with Fast Convergence and Efficient Communication. arXiv:1802.07876 [cs], 2019. Al-Shedivat, M., Gillenwater, J., Xing, E., and Ros- tamizadeh, A. Federated Learning via Posterior Aver- Chen, X., Lin, Q., and Pena, J. Optimal regularized dual av- aging: A New Perspective and Practical Algorithms. In eraging methods for stochastic optimization. In Advances International Conference on Learning Representations, in Neural Information Processing Systems 25. Curran 2021. Associates, Inc., 2012.

Attouch, H., Bolte, J., and Svaiter, B. F. Convergence of Chouzenoux, E., Pesquet, J.-C., and Repetti, A. Variable descent methods for semi-algebraic and tame problems: Metric Forward–Backward Algorithm for Minimizing the Proximal algorithms, forward–backward splitting, and Sum of a Differentiable Function and a Convex Function. regularized Gauss–Seidel methods. Mathematical Pro- Journal of Optimization Theory and Applications, 162(1), gramming, 137(1-2), 2013. 2014.

Bauschke, H. H., Borwein, J. M., et al. Legendre functions Daubechies, I., Defrise, M., and De Mol, C. An iterative and the method of random Bregman projections. Journal thresholding algorithm for linear inverse problems with a of convex analysis, 4(1), 1997. sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 2004. Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Operations Research Letters, 31(3), 2003. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(6), 2012. Bistritz, I., Mann, A., and Bambos, N. Distributed Distil- lation for On-Device Learning. In Advances in Neural Deng, Y., Kamani, M. M., and Mahdavi, M. Adaptive Information Processing Systems 33, volume 33, 2020. Personalized Federated Learning. arXiv:2003.13461 [cs, stat], 2020. Boyd, S., Xiao, L., and Mutapcic, A. Subgradient methods. Diakonikolas, J. and Orecchia, L. The Approximate Du- lecture notes of EE392o, Stanford University, Autumn ality Gap Technique: A Unified Theory of First-Order Quarter, 2004, 2003. Methods. SIAM Journal on Optimization, 29(1), 2019. Bredies, K., Lorenz, D. A., and Reiterer, S. Minimiza- Duchi, J. and Singer, Y. Efficient online and batch learning tion of Non-smooth, Non-convex Functionals by Iterative using forward backward splitting. Journal of Machine Thresholding. Journal of Optimization Theory and Appli- Learning Research, 10(99), 2009. cations, 165(1), 2015. Duchi, J. C., Shalev-shwartz, S., Singer, Y., and Tewari, Bregman, L. The relaxation method of finding the common A. Composite objective mirror descent. In COLT 2010, point of convex sets and its application to the solution of 2010. problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3), 1967. Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual Averaging for Distributed Optimization: Convergence Bubeck, S. Convex Optimization: Algorithms and Complex- Analysis and Network Scaling. IEEE Transactions on ity. 2015. Automatic Control, 57(3), 2012.

Cai, J.-F., Candes,` E. J., and Shen, Z. A Singular Value Fallah, A., Mokhtari, A., and Ozdaglar, A. E. Personalized Thresholding Algorithm for Matrix Completion. SIAM federated learning with theoretical guarantees: A model- Journal on Optimization, 20(4), 2010. agnostic meta-learning approach. In Advances in Neural Information Processing Systems 33, 2020. Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konecnˇ y,´ J., McMahan, H. B., Smith, V., and Talwalkar, A. LEAF: Flammarion, N. and Bach, F. Stochastic composite least- A Benchmark for Federated Settings. In NeurIPS 2019 squares regression with convergence rate O(1/n). In Pro- Workshop on Federated Learning for Data Privacy and ceedings of the 2017 Conference on Learning Theory, Confidentiality, 2019. volume 65. PMLR, 2017. Federated Composite Optimization

Frank, M. and Wolfe, P. An algorithm for quadratic pro- Edge. In Advances in Neural Information Processing gramming. Naval Research Logistics Quarterly, 3(1-2), Systems 33, volume 33, 2020. 1956. Hiriart-Urruty, J.-B. and Lemarechal,´ C. Fundamentals of Godichon-Baggioni, A. and Saadane, S. On the rates of Convex Analysis. Springer Berlin Heidelberg, 2001. convergence of parallelized averaged stochastic gradient algorithms. Statistics, 54(3), 2020. Ingerman, A. and Ostrowski, K. Introducing TensorFlow Federated, 2019. Haddadpour, F., Kamani, M. M., Mahdavi, M., and Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse Cadambe, V. Trading redundancy for communication: convex optimization. In Proceedings of the 30th Inter- Speeding up distributed SGD for non-convex optimiza- national Conference on Machine Learning, volume 28. tion. In Proceedings of the 36th International Conference PMLR, 2013. on Machine Learning, volume 97. PMLR, 2019a. Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P., and Sid- Haddadpour, F., Kamani, M. M., Mahdavi, M., and ford, A. Accelerating stochastic gradient descent for least Cadambe, V. Local SGD with periodic averaging: Tighter squares regression. In Proceedings of the 31st Conference analysis and adaptive synchronization. In Advances in on Learning Theory, volume 75. PMLR, 2018. Neural Information Processing Systems 32. Curran Asso- ciates, Inc., 2019b. Jiang, Y., Konecnˇ y,´ J., Rush, K., and Kannan, S. Improving Federated Learning Personalization via Model Agnostic Hallac, D., Leskovec, J., and Boyd, S. Network Lasso: Clus- Meta Learning. arXiv:1909.12488 [cs, stat], 2019. tering and Optimization in Large Graphs. In Proceedings of the 21th ACM SIGKDD International Conference on Jung, A. On the Complexity of Sparse Label Propagation. Knowledge Discovery and Data Mining. ACM, 2015. Frontiers in Applied Mathematics and Statistics, 4, 2018. Hanzely, F., Hanzely, S., Horvath,´ S., and Richtarik,´ P. Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, Lower bounds and optimal algorithms for personalized M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, federated learning. In Advances in Neural Information G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E., Processing Systems 33, 2020. Evans, D., Gardner, J., Garrett, Z., Gascon,´ A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, F., Augenstein, S., Eichner, H., Kiddon, C., and Ramage, T., Joshi, G., Khodak, M., Konecnˇ y,´ J., Korolova, A., D. Federated Learning for Mobile Keyboard Prediction. Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., arXiv:1811.03604 [cs], 2018. Mohri, M., Nock, R., Ozg¨ ur,¨ A., Pagh, R., Raykova, M., Hard, A., Partridge, K., Nguyen, C., Subrahmanya, N., Shah, Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich, A., Zhu, P., Lopez-Moreno, I., and Mathews, R. Training S. U., Sun, Z., Suresh, A. T., Tramer,` F., Vepakomma, P., keyword spotting models on non-iid data with federated Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., learning. In Interspeech 2020, 21st Annual Conference and Zhao, S. Advances and Open Problems in Federated of the International Speech Communication Association, Learning. arXiv:1912.04977 [cs, stat], 2019. Virtual Event, Shanghai, China, 25-29 October 2020. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, ISCA, 2020. S. U., and Suresh, A. T. SCAFFOLD: Stochastic Con- Hardt, M., Recht, B., and Singer, Y. Train faster, gener- trolled Averaging for Federated Learning. In Proceedings alize better: Stability of stochastic gradient descent. In of the International Conference on Machine Learning 1 Proceedings of the 33rd International Conference on Ma- Pre-Proceedings (ICML 2020), 2020. chine Learning, volume 48. PMLR, 2016. Khaled, A., Mishchenko, K., and Richtarik,´ P. Tighter The- Hartmann, F., Suh, S., Komarzewski, A., Smith, T. D., and ory for Local SGD on Identical and Heterogeneous Data. Segall, I. Federated Learning for Ranking Browser His- In Proceedings of the Twenty Third International Confer- tory Suggestions. arXiv:1911.11807 [cs, stat], 2019. ence on Artificial Intelligence and Statistics, volume 108. PMLR, 2020. Haxby, J. V. Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex. Science, Koloskova, A., Loizou, N., Boreiri, S., Jaggi, M., and Stich, 293(5539), 2001. S. U. A Unified Theory of Decentralized SGD with Changing Topology and Local Updates. In Proceedings He, C., Annavaram, M., and Avestimehr, S. Group Knowl- of the International Conference on Machine Learning 1 edge Transfer: Federated Learning of Large CNNs at the Pre-Proceedings (ICML 2020), 2020. Federated Composite Optimization

Konecnˇ y,` J., McMahan, B., and Ramage, D. Federated opti- McMahan, B. Follow-the-regularized-leader and mirror mization: Distributed optimization beyond the datacen- descent: Equivalence theorems and L1 regularization. ter. In 8th NIPS Workshop on Optimization for Machine In Proceedings of the Fourteenth International Confer- Learning, 2015. ence on Artificial Intelligence and Statistics, volume 15. PMLR, 2011. Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. Journal of Machine Learning McMahan, B., Moore, E., Ramage, D., Hampson, S., and Research, 10(28), 2009. y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Li, G. and Pong, T. K. Global Convergence of Splitting 20th International Conference on Artificial Intelligence Methods for Nonconvex Composite Optimization. SIAM and Statistics, volume 54. PMLR, 2017. Journal on Optimization, 25(4), 2015. Mohri, M., Sivek, G., and Suresh, A. T. Agnostic feder- Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, ated learning. In Proceedings of the 36th International A., and Smith, V. Federated optimization in heteroge- Conference on Machine Learning, volume 97. PMLR, neous networks. In Proceedings of Machine Learning 2019. and Systems 2020, 2020a. Mokhtari, A. and Ribeiro, A. DSA: Decentralized dou- Li, X., Huang, K., Yang, W., Wang, S., and Zhang, Z. On the ble stochastic averaging gradient algorithm. Journal of convergence of FedAvg on non-iid data. In International Machine Learning Research, 17(61), 2016. Conference on Learning Representations, 2020b. Nedic, A. and Ozdaglar, A. Distributed Subgradient Meth- Liang, X., Shen, S., Liu, J., Pan, Z., Chen, E., and Cheng, ods for Multi-Agent Optimization. IEEE Transactions on Y. Variance Reduced Local SGD with Lower Commu- Automatic Control, 54(1), 2009. nication Complexity. arXiv:1912.12844 [cs, math, stat], Nedich, A. Convergence Rate of Distributed Averaging 2019. Dynamics and Optimization in Networks, volume 2. 2015.

Lin, T., Kong, L., Stich, S. U., and Jaggi, M. Ensemble Dis- Nemirovski, A. and Yudin, D. B. Problem complexity and tillation for Robust Model Fusion in Federated Learning. method efficiency in optimization. Wiley, 1983. In Advances in Neural Information Processing Systems 33, 2020. Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1), 2005. Liu, S., Chen, P.-Y., and Hero, A. O. Accelerated Distributed Dual Averaging Over Evolving Networks of Growing Nesterov, Y. Primal-dual subgradient methods for convex Connectivity. IEEE Transactions on Signal Processing, problems. Mathematical Programming, 120(1), 2009. 66(7), 2018. Nesterov, Y. Gradient methods for minimizing composite Lu, H., Freund, R. M., and Nesterov, Y. Relatively Smooth functions. Mathematical Programming, 140(1), 2013. Convex Optimization by First-Order Methods, and Appli- cations. SIAM Journal on Optimization, 28(1), 2018. Nesterov, Y. Lectures on Convex Optimization. 2018. Parikh, N. and Boyd, S. P. Proximal Algorithms, volume 1. Mahdavi, M., Yang, T., Jin, R., Zhu, S., and Yi, J. Stochastic Now Publishers Inc., 2014. gradient descent with only one projection. In Advances in Neural Information Processing Systems 25. Curran Pathak, R. and Wainwright, M. J. FedSplit: An algorithmic Associates, Inc., 2012. framework for fast federated optimization. In Advances in Neural Information Processing Systems 33, 2020. Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchan- dran, K., and Jordan, M. I. Perturbed Iterate Analysis for Rabbat, M. Multi-agent mirror descent for decentralized Asynchronous Stochastic Optimization. SIAM Journal stochastic optimization. In 2015 IEEE 6th International on Optimization, 27(4), 2017. Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 2015. Mcdonald, R., Mohri, M., Silberman, N., Walker, D., and Mann, G. S. Efficient large-scale distributed training Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., of conditional maximum entropy models. In Advances Konecnˇ y,´ J., Kumar, S., and McMahan, H. B. Adaptive in Neural Information Processing Systems 22. Curran Federated Optimization. arXiv:2003.00295 [cs, math, Associates, Inc., 2009. stat], 2020. Federated Composite Optimization

Rockafellar, R. T. Convex Analysis. Number 28. Princeton Wang, J. and Joshi, G. Cooperative SGD: A unified Frame- University Press, 1970. work for the Design and Analysis of Communication- Efficient SGD Algorithms. arXiv:1808.07576 [cs, stat], Rosenblatt, J. D. and Nadler, B. On the optimality of aver- 2018. aging in distributed statistical learning. Information and Inference, 5(4), 2016. Wang, J., Tantia, V., Ballas, N., and Rabbat, M. SlowMo: Improving communication-efficient distributed SGD with Ryali, S., Supekar, K., Abrams, D. A., and Menon, V. Sparse slow momentum. In International Conference on Learn- logistic regression for whole-brain classification of fMRI ing Representations, 2020. data. NeuroImage, 51(2), 2010. Woodworth, B., Patel, K. K., and Srebro, N. Minibatch vs Shamir, O. and Srebro, N. Distributed stochastic optimiza- Local SGD for Heterogeneous Distributed Learning. In tion and learning. In 2014 52nd Annual Allerton Confer- Advances in Neural Information Processing Systems 33, ence on Communication, Control, and Computing (Aller- 2020a. ton). IEEE, 2014. Woodworth, B., Patel, K. K., Stich, S. U., Dai, Z., Bullins, Shi, W., Ling, Q., Wu, G., and Yin, W. EXTRA: An Exact B., McMahan, H. B., Shamir, O., and Srebro, N. Is First-Order Algorithm for Decentralized Consensus Opti- Local SGD Better than Minibatch SGD? In Proceedings mization. SIAM Journal on Optimization, 25(2), 2015a. of the International Conference on Machine Learning 1 Pre-Proceedings (ICML 2020), 2020b. Shi, W., Ling, Q., Wu, G., and Yin, W. A Proximal Gradient Algorithm for Decentralized Composite Optimization. Xiao, L. Dual averaging methods for regularized stochastic IEEE Transactions on Signal Processing, 63(22), 2015b. learning and online optimization. Journal of Machine Learning Research, 11(88), 2010. Shor, N. Z. Minimization Methods for Non-Differentiable Functions, volume 3. Springer Berlin Heidelberg, 1985. Yang, T., Lin, Q., and Zhang, L. A richer theory of convex constrained optimization with reduced projections and Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. improved rates. In Proceedings of the 34th International Federated multi-task learning. In Advances in Neural Conference on Machine Learning, volume 70. PMLR, Information Processing Systems 30. Curran Associates, 2017. Inc., 2017. Yu, H. and Jin, R. On the computation and communication Stich, S. U. Local SGD converges fast and communicates complexity of parallel SGD with dynamic batch sizes for little. In International Conference on Learning Represen- stochastic non-convex optimization. In Proceedings of tations, 2019. the 36th International Conference on Machine Learning, volume 97. PMLR, 2019. Sundhar Ram, S., Nedic,´ A., and Veeravalli, V. V. Dis- tributed Stochastic Subgradient Projection Algorithms Yu, H., Jin, R., and Yang, S. On the linear speedup analy- for Convex Optimization. Journal of Optimization The- sis of communication efficient momentum SGD for dis- ory and Applications, 147(3), 2010. tributed non-convex optimization. In Proceedings of the 36th International Conference on Machine Learning, vol- T. Dinh, C., Tran, N., and Nguyen, T. D. Personalized ume 97. PMLR, 2019a. Federated Learning with Moreau Envelopes. In Advances in Neural Information Processing Systems 33, 2020. Yu, H., Yang, S., and Zhu, S. Parallel Restarted SGD with Faster Convergence and Less Communication: Demysti- Tong, Q., Liang, G., Zhu, T., and Bi, J. Federated Noncon- fying Why Model Averaging Works for . vex Sparse Learning. arXiv:2101.00052 [cs], 2020. In Proceedings of the AAAI Conference on Artificial In- telligence, volume 33, 2019b. Tsianos, K. I. and Rabbat, M. G. Distributed dual averaging for convex optimization under communication delays. In Yuan, D., Hong, Y., Ho, D. W., and Jiang, G. Optimal 2012 American Control Conference (ACC). IEEE, 2012. distributed stochastic mirror descent for strongly convex optimization. Automatica, 90, 2018. Tsianos, K. I., Lawlor, S., and Rabbat, M. G. Push-Sum Dis- tributed Dual Averaging for convex optimization. In 2012 Yuan, D., Hong, Y., Ho, D. W. C., and Xu, S. Dis- IEEE 51st IEEE Conference on Decision and Control tributed Mirror Descent for Online Composite Optimiza- (CDC). IEEE, 2012. tion. IEEE Transactions on Automatic Control, 2020. Federated Composite Optimization

Yuan, H. and Ma, T. Federated Accelerated Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 33, 2020. Yuan, K., Ling, Q., and Yin, W. On the Convergence of Decentralized Gradient Descent. SIAM Journal on Opti- mization, 26(3), 2016.

Zhang, X., Hong, M., Dhople, S., Yin, W., and Liu, Y. FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data. arXiv:2005.11418 [cs, stat], 2020. Zhou, F. and Cong, G. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the Twenty- Seventh International Joint Conference on Artificial Intel- ligence, 2018. Zinkevich, M. Online convex programming and general- ized infinitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA. AAAI Press, 2003. Zinkevich, M., Weimer, M., Li, L., and Smola, A. J. Paral- lelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23. Curran Associates, Inc., 2010.