Agnostic Federated Learning

Mehryar Mohri 1 2 Gary Sivek 1 Ananda Theertha Suresh 1

Abstract the clients, each with possibly unreliable or relatively slow network connections. A key learning scenario in large-scale applications is that of federated learning, where a centralized Federated learning raises several types of issues and has model is trained based on data originating from a been the topic of multiple research efforts. These include large number of clients. We argue that, with the systems, networking and communication bottleneck prob- existing training and inference, federated models lems due to frequent exchanges between the central server can be biased towards different clients. Instead, and the clients (McMahan et al., 2017). Other research ef- we propose a new framework of agnostic feder- forts include the design of more efficient communication ated learning, where the centralized model is opti- strategies (Konecnˇ y,` McMahan, Yu, Richtarik,´ Suresh, and mized for any target distribution formed by a mix- Bacon, 2016b; Konecnˇ y,` McMahan, Ramage, and Richtarik´ , ture of the client distributions. We further show 2016a; Suresh, Yu, Kumar, and McMahan, 2017), devising that this framework naturally yields a notion of efficient distributed optimization methods benefiting from fairness. We present data-dependent Rademacher differential privacy guarantees (Agarwal, Suresh, Yu, Ku- complexity guarantees for learning with this ob- mar, and McMahan, 2018), as well as recent lower bound jective, which guide the definition of an algorithm guarantees for parallel stochastic optimization with a de- for agnostic federated learning. We also give a fast pendency graph (Woodworth, Wang, Smith, McMahan, and stochastic optimization algorithm for solving the Srebro, 2018). corresponding optimization problem, for which Another important problem in federated learning, which we prove convergence bounds, assuming a convex appears more generally in distributed and loss function and a convex hypothesis set. We fur- other learning setups, is that of fairness. In many instances ther empirically demonstrate the benefits of our in practice, the resulting learning models may be biased or approach in several datasets. Beyond federated unfair: they may discriminate against some protected groups learning, our framework and algorithm can be of (Bickel, Hammel, and O’Connell, 1975; Hardt, Price, Sre- interest to other learning scenarios such as cloud bro, et al., 2016). As a simple example, a regression al- computing, domain adaptation, drifting, and other gorithm predicting a person’s salary could be using that contexts where the training and test distributions person’s gender. This is a central problem in modern ma- do not coincide. chine learning that does not seem to have been specifically studied in the context of federated learning. While many problems related to federated learning have 1. Motivation been extensively studied, the key objective of learning in A key learning scenario in large-scale applications is that that context seems not to have been carefully examined. of federated learning. In that scenario, a centralized model We are also not aware of statistical guarantees derived for is trained based on data originating from a large number of learning in this scenario. A crucial reason for such ques- clients, which may be mobile phones, other mobile devices, tions to emerge in this context is that the target distribution or sensors (Konecnˇ y,` McMahan, Yu, Richtarik,´ Suresh, and for which the centralized model is learned is unspecified. Bacon, 2016b; Konecnˇ y,` McMahan, Ramage, and Richtarik´ , Which expected loss is federated learning seeking to mini- 2016a). The training data typically remains distributed over mize? Most centralized models for standard federated learn- ing are trained on the aggregate training sample obtained 1 2 Google Research, New York; Courant Institute of Mathe- from the subsamples drawn from the clients. Thus, if we matical Sciences, New York, NY. Correspondence to: Ananda denote by D the distribution associated to client k, m the Theertha Suresh . k k size of the sample available from that client and m the total Proceedings of the 36 th International Conference on Machine sample size, intrinsically, the centralized model is trained to Learning, Long Beach, California, PMLR 97, 2019. Copyright minimize the loss with respect to the uniform distribution 2019 by the author(s). Agnostic Federated Learning

p mk U k 1 m Dk. But why should U be the target distribu- distributions do not coincide. In AppendixA, we give an tion of the learning model? Is U the distribution that we extensive discussion of related work, including connections = expect= ∑ to observe at test time? What guarantees can be with the broad literature of domain adaptation. derived for the deployed system? The rest of the paper is organized as follows. In Section2, We argue that, in many common instances, the uniform dis- we give a formal description of AFL. Next, we give a de- tribution is not the natural objective distribution and that tailed theoretical analysis of learning within the AFL frame- seeking to minimize the expected loss with respect to the work (Section3), as well as a learning algorithm based on specific distribution U is risky. This is because the target dis- that theory (Section4). We also present an efficient convex tribution may be in general quite different from U. In many optimization algorithm for solving the optimization prob- cases, that can result in a suboptimal or even a detrimental lem defining our algorithm (Section 4.2). In Section5, we performance. For example, imagine a plausible scenario of present a series of experiments comparing our solution with federated learning where the learner has access to a large existing federated learning solutions. In AppendixB, we population of expensive mobile phones, which are most discuss several extensions of the AFL framework. commonly adopted by software engineers or other technical users (say 70%) than other users (30%), and a small popula- 2. Learning scenario tion of other mobile phones less used by non-technical users (5%) and significantly more often by other users (95%). The In this section, we introduce the learning scenario of agnos- centralized model would then be essentially based on the tic federated learning we consider. We then argue that the uniform distribution based on the expensive clients. But, uniform solution commonly adopted in standard federated clearly, such a model would not be adapted to the wide gen- learning may not be an adequate solution, thereby further eral target domain formed by the majority of phones with justifying our agnostic model. Next, we show the benefit of a 5% 95% population of general versus technical users. our model in fairness learning. Many other realistic examples of this type can help illustrate We start with some general notation and definitions used − the learning problem resulting from a mismatch between throughout the paper. Let X denote the input space and Y the target distribution and U. In fact, it is not clear why the output space. We will primarily discuss a multi-class minimizing the expected loss with respect to U could be classification problem where Y is a finite set of classes, but beneficial for the clients, whose distributions are Dks. much of our results can be extended straightforwardly to Thus, we put forward a new framework of agnostic feder- regression and other problems. The hypotheses we consider ated learning (AFL), where the centralized model is op- are of the form h X ∆Y, where ∆Y stands for the simplex timized for any possible target distribution formed by a over Y. Thus, h x is a probability distribution over the mixture of the client distributions. Instead of optimizing the classes or categories∶ → that can be assigned to x X. We centralized model for a specific distribution, with the high will denote by H(a) family of such hypotheses h. We also risk of a mismatch with the target, we define an agnostic denote by ` a loss function defined over ∆Y Y and∈ taking and more risk-averse objective. We show that, for some non-negative values. The loss of h H for a labeled sample target mixture distributions, the cross-entropy loss of the x, y X Y is given by ` h x , y . One× key example hypothesis obtained by minimization with respect to the uni- in applications is the cross-entropy∈ loss, which is defined form distribution U can be worse than that of the hypothesis as( follows:) ∈ ×` h x , y log( (Py)′ h )x y y . We will denote by LD h the expected loss of a hypothesis′ h with obtained in AFL by a constant additive term, even if the ∼ ( ) learner has access to infinite samples (Section 2.2). respect to a distribution( ( ) ) =D−over(X Y: [ = ]) ( ) We further show that our AFL framework naturally yields a LD h E ` h x , y , D × notion of fairness, which we refer to as good-intent fairness x,y ( ) = ( )∼ [ ( ( ) )] (Section 2.3). Indeed, the predictor solution of the optimiza- and by hD its minimizer: hD argminh H LD h . tion problem for our AFL framework treats all protected ∈ categories similarly. Beyond federated learning, our frame- 2.1. Agnostic federated learning= ( ) work and solution also cover related problems in cloud- based learning services, where customers may not have any We consider a learning scenario where the learner training data at their disposal or may not be willing to share receives p samples S1,...,Sp, with each Sk x , y ,..., x , y X Y mk m that data with the cloud due to privacy concerns. In that case k,1 k,1 k,mk k,mk of size k too, the server needs to train a model without access to the drawn i.i.d. from a possibly different domain or distribution= (( ) ( )) ∈ ( × ) training data. Our framework and algorithm can also be of Dk. We will denote by Dk the empirical distribution associ- m interest to other learning scenarios such as domain adapta- ated to sample Sk of size m drawn from D . The learner’s objective is to determinê a hypothesis h H that performs tion, drifting, and other contexts where the training and test p well on some target distribution. Let m k 1 mk. ∈ = ∑ = Agnostic Federated Learning

the effect of large p values using a suitable regularization D term derived from our theory.

2.2. Comparison with federated learning

D1 D2 Dp Here, we further argue that the uniform solution hU com- monly adopted in federated learning may not provide a ··· satisfactory performance compared with a solution of the Figure 1. Illustration of the agnostic federated learning scenario. agnostic problem. This further motivates our AFL model. This scenario coincides with that of federated learning As already discussed, since the target distribution is un- where training is done with the uniform distribution over known, the natural method for the learner is to select a the union of all samples Sk, where all samples are uni- hypothesis minimizing the agnostic loss LDΛ . Is the predic- p mk tor minimizing the agnostic loss coinciding with the solution formly weighted, that is U k 1 m Dk, and where the underlying assumption is that the target distribution is hU of standard federated learning? How poor can the perfor- U p mk D . We will not̂ = adopt∑ = that assumption̂ since mance of the standard federated learning be? We first show k 1 m k ̂ it is rather restrictive and since, as discussed later, it can lead that the loss of hU can be higher than that of the optimal = to= solutions∑ that are detrimental to domain users. Instead, loss achieved by hD by a constant loss, even if the number ̂ Λ we will consider an agnostic federated learning (AFL) sce- of samples tends to infinity, that is even if the learner has nario where the target distribution can be modeled as an access to the distributions Dk and uses the predictor hU. unknown mixture of the distributions Dk, k 1, . . . , p, that p Proposition 1. [Appendix C.1] Let ` be the cross-entropy is Dλ λkDk for some λ ∆p. Since the mixture k 1 loss. Then, there exist Λ, H, and Dk, k p , such that the weight λ is unknown, here, the learner must= come up with = following inequality holds: a solution= ∑ that is favorable for any∈ λ in the simplex, or any ∈ [ ] λ in a subset Λ ∆p. Thus, we define the agnostic loss (or 2 L h L h log . agnostic risk) L h associated to a predictor h H as DΛ U DΛ DΛ DΛ 3 ⊆ ( ) ≥ ( ) + √ LD(Λ )h max LDλ h . ∈ (1) λ Λ 2.3. Good-intent fairness in learning ( ) = ∈ ( ) h We will extend our previous definitions and denote by DΛ Fairness in machine learning has received much attention in h argmin L h . the minimizer of this loss: DΛ h H DΛ recent past (Bickel et al., 1975; Hardt et al., 2016). There is now a broad literature on the topic with a variety of In practice, the learner has access to the distributions∈ Dk = ( ) definitions of the notion of fairness. In a typical scenario, only via the finite samples Sk. Thus, for any λ ∆p, in- there is a protected class c among p classes c , c , . . . , c . stead of the mixture Dλ, only the λ-mixture of empirical 1 2 p p 1 ∈ While there are many definitions of fairness, the main distributions, Dλ k 1 λkDk, is accessible. This leads to the definition of L h , the agnostic empirical loss of objective of a fairness algorithm is to reduce bias and ensure DΛ a hypothesis h H=for∑ a= subset̂ of the simplex, Λ: that the model is fair to all the p protected categories, under ( ) some definition of fairness. The most common reasons for L h max L h . bias in machine learning algorithms are training data bias ∈ DΛ Dλ λ Λ and overfitting bias. We first provide a brief explanation We will denote by h ( )the= minimizer∈ ( of) this loss: h and illustration for both: DΛ DΛ argmin L h . In the next section, we will present h DΛ • biased training data: consider the regression task, generalization bounds relating the expected and empirical= ∈H where the goal is to predict the salary of a person based agnostic losses L(D ) h and L h for all h H. Λ DΛ on features such as education, location, age, gender. Let gender be the protected class. If in the training data, Notice that the domains( ) Dk discussed( ) thus far∈ need not coincide with the clients. In fact, when the number of clients there is a consistent discrimination against women irre- is very large and Λ is the full simplex, Λ ∆p, it is typically spective of their education, e.g., their salary is lower, preferable to consider instead domains defined by clusters then we can conclude that the training data is inherently of clients, as discussed in AppendixB. On= the other hand, biased. if p is small or Λ more restrictive, then the model may not • biased training procedure: consider an image recogni- perform well on certain domains of interest. We mitigate tion task where the protected category is race. If the 1 Note, Dλ is distinct from an empirical distribution D̂ λ which model is heavily trained on images based on certain would be based on a sample drawn from Dλ. Dλ is based on races, then the resulting model will be biased because samples drawn from Dks. of over-fitting. Agnostic Federated Learning

p Our model of AFL can help define a notion of good-intent m k 1 mk. We define the skewness of Λ with respect to fairness, where we reduce the bias in the training procedure. m by = Furthermore, if training procedure bias exists, it naturally = ∑ s Λ m max χ2 λ m 1, (4) highlights it. λ Λ where, for any( twoY ) distributions= ∈ ( Yp and) + q in ∆ , the Suppose we are interested in a classification problem and p chi-squared divergence χ2 p q is given by χ2 p q there is a protected feature class c, which can be one of p 2 p pk qk . We will also denote by Λ a minimum - values c1, c2, . . . , cp. Then, we define Dk as the conditional k 1 qk ( − ) ( Y ) ( Y ) = c D cover of Λ in `1 distance, that is, Λ argmin ′ Λ , distribution with the protected class being k. If is the = Λ C Λ, true underlying distribution, then where∑ C Λ,  is a set of distributions Λ such that for every p ∈ ( ) = ′ S S λ Λ, there exists λ Λ such that k 1 λk λk . Dk x, y D x, y c x, y ck . ( ) ′ ′ ′ Our first learning guarantee is presented in terms of ∈ ∈ ∑ = S − S ≤ Let Λ δk k ( p ) =be the( collectionS ( ) of= Dirac) measures Rm G, Λ , the skewness parameter s Λ m and the - over the indices k in p . With these definitions, a natural cover Λ. fairness= principle{ ∶ ∈ [ consists]} of ensuring that the test loss is ( ) ( Y ) Theorem 1. [Appendix C.2] Assume that the loss ` is the same for all underlying[ ] protected classes, that is for all bounded by M 0. Fix  0 and m m , . . . , m . λ Λ. This is called the maxmin principle (Rawls, 2009), a 1 p Then, for any δ 0, with probability at least 1 δ over special case of the CVar fairness risk (Williamson & Menon, the draw of samples> S D>mk , for all h= (H and λ Λ), ∈ k k 2019). > − LDλ h is upper bounded by With the above intent in mind, we define a good-intent fair- ∼ ∈ ∈ ( ) ness algorithm as one seeking to minimize the agnostic loss s λ m Λ LD h 2Rm G, λ M M log , LDΛ . Thus, the objective of the algorithm is to minimize the λ ¾ 2m δ maximum loss incurred on any of the underlying protected ( Y ) S S ( ) + p ( ) + + classes and hence does not overfit the data to any particular where m k 1 mk. model at the cost of others. Furthermore, it does not degrade the performance of the other classes so long as it does not It can be shown= ∑ = that for a given λ, the variance of the loss affect the loss of the most-sensitive protected category. We depends on the skewness parameter and hence it can be further note that our approach does not reduce bias in the shown that generalization bound can also be lower bounded training data and is useful only for mitigating the training in terms of the skewness parameter (Theorem 9 in Cortes procedure bias. et al.(2010)). Note that the bound in Theorem1 is instance- specific, i.e., it depends on the target distribution Dλ and 3. Learning bounds increases monotonically as λ moves away from m. Thus, for target domains with λ m, the bound is more favor- We now present learning guarantees for agnostic federated able. The theorem supplies upper bounds for agnostic losses: learning. Let G denote the family of the losses associated they can be obtained simply≈ by taking the maximum over to a hypothesis set H: G x, y ` h x , y h H . λ Λ. The following result shows that, for a family of func- Our learning bounds are based on the following notion of tions taking values in 1, 1 , the Rademacher complexity weighted Rademacher complexity= {( )which↦ ( is( defined) )∶ for∈ any} Rm∈ G, Λ can be bounded in terms of the VC-dimension {− + } hypothesis set H, vector of sample sizes m m1, . . . , mp and the skewness of Λ. ( ) and mixture weight λ ∆p, by the following expression: Lemma 1. [Appendix C.3] Let ` be a loss function taking = ( ) p mk values in 1, 1 and such that the family of losses G ∈ λk Rm G, λ E sup σk,i ` h xk,i , yk,i , admits VC-dimension d. Then, the following upper bound mk m Sk Dk h H k 1 k i 1 σ holds for the{− weighted+ } Rademacher complexity of G: ( )= ∼  ∈ Q= Q= ( ( ) ) (2) where Sk xk,1, yk,1 ,..., xk,mk , yk,mk is a sam- d em ple of size m and σ σ a collection of Rm G, Λ 2s Λ m log . k k,i k p ,i mk ¿ m d Rademacher= variables,(( that) is uniformly( distributed)) random Á ∈[ ] ∈[ ] ÀÁ variables taking values in= ( 1, )1 . We also define the min- ( ) ≤ ( Y )  imax weighted Rademacher complexity for a subset Λ ∆p Both Lemma1 and the generalization bound of Theorem1 by {− + } can thus be expressed in terms of the skewness parameter Rm G, Λ max Rm G, λ . ⊆ (3) s Λ m . Note that, when Λ contains only one distribution λ Λ and is the uniform distribution, that is λk mk m, then the m m mp 1 ∈ ( Y ) Let m m m(,...,) =m denote( the) empirical distri- skewness is equal to one and the results coincide with the bution over ∆p defined by the sample sizes mk, where standard guarantees in supervised learning.= ~ = = ‰ Ž Agnostic Federated Learning

Theorem1 and Lemma1 also provide guidelines for choos- a fixed λ, the objective L h γ h µ χ2 λ m is Dλ ing the domains and Λ. When p is large and Λ ∆p, then, a convex function of h. The maximum over λ (taken in the number of samples per domain could be small, the skew- any set) of a family of convex( functions) + Y Y + is convex.( Y Thus,) ness parameter s Λ m max 1 would= then be max L h γ h µ χ2 λ m is a convex 1 k p mk λ conv Λ Dλ large and the generalization guarantees for the model would function of h and, when the hypothesis set H is a convex, ≤ ≤ ∈ ( ) become weaker. We( Y suggest) = some guidelines for choosing (6) is a convex optimization( ) + Y problem.Y + In( theY next) subsection, domains in AppendixB. We further note that, for a given we present an efficient optimization solution for this prob- p, if Λ contains distributions that are close to m, then the lem in Euclidean norm, for which we prove convergence model generalizes well. guarantees. In Appendix F.1, we generalize the results to other norms. The corollary above can be straightforwardly extended to cover the case where the test samples are drawn from 4.2. Optimization algorithm some distribution D, instead of Dλ. Define `1 D, DΛ by `1 D, DΛ minλ Λ `1 D, Dλ . Then, the following When the loss function ` is convex, the AFL minmax opti- result holds. ( ) ∈ mization problem above can be solved using projected gradi- Corollary( 1. )Assume= that the( loss function) ` is bounded by ent descent or other instances of the generic mirror descent M. Then, for any  0 and δ 0, with probability at least algorithm (Nemirovski & Yudin, 1983). However, for large 1 δ, the following inequality holds for all h H: datasets, that is p and m large, this can be computationally > > costly and typically slow in practice. Juditsky, Nemirovski, LD h L h 2Rm G, Λ M` D, D M − DΛ 1 ∈ Λ and Tauvel(2011) proposed a stochastic Mirror-Prox algo- rithm for solving stochastic variational inequalities, which s Λ m Λ ( ) ≤ M ( ) + log( ) +. ( ) + would be applicable in our context. We present a simplified ¾ 2m δ version of their algorithm for the AFL problem that admits ( Y ) S S + a more straightforward analysis and that is also substantially One straightforward choice of the parameter  is  1 , m easier to implement. but, depending on Λ and other parameters of the bound,√  Our optimization problem is over two sets of parameters, more favorable choices may be possible. We conclude= this the hypothesis h H and the mixture weight λ Λ. In section by adding thatS S alternative learning bounds can be what follows, we will denote by W a non-empty subset of derived for this problem, as discussed in AppendixD. RN and w W a vector∈ of parameters defining a predictor∈ h. Thus, we will rewrite losses and optimization solutions 4. Algorithm only in terms∈ of w, instead of h. We will use the following notation: 4.1. Regularization p L w, λ λkLk w , (7) The learning guarantees of the previous section suggest k 1 minimizing the sum of the empirical AFL term L h , a DΛ ( ) = Q ( ) where Lk w stands for L = h , the empirical loss of term controlling the complexity of H and a term depending Dk hypothesis h H (corresponding to w) on domain k: on the skewness parameter. Observe that, since L (h) is ̂ Dλ 1 mk Lk w ( ) ` h xk,i , y(k,i) . We will consider the linear in λ, the following equality holds: mk i 1 ∈ ( ) unregularized version of problem (6). We note that regular- ization( ) with= respect∑ = ( to(w does) not) make the optimization LD h LD h , (5) Λ conv(Λ) harder. Thus, we will study the following problem given by where conv Λ is the( convex) = hull of Λ(. Assume) that H is a the set of variables w: vector space that can be equipped with a norm , as with ( ) min max L w, λ . (8) most hypothesis sets used in learning applications. Then, w W λ Λ given Λ and the regularization parameters µ Y0⋅andY γ 0, our learning guarantees suggest the following minimization ∈ ∈ ( ) Observe that problem (8) admits a natural game-theoretic problem: ≥ ≥ interpretation as a two-player game, where nature selects 2 λ Λ to maximize the objective, while the learner seeks min max LD h γ h µ χ λ m . (6) h H λ conv Λ λ w W minimizing the loss. We are interested in finding the∈ equilibrium of this game, which is attained for some w , This defines∈ ∈ our algorithm( ) ( for) + AFL.Y Y + ( Y ) the∈ minimizer of Equation8 and λ Λ, the hardest domain∗ Assume that ` is a convex function of its first argument. mixture weights. At the equilibrium,∗ moving w away from Then, L h is a convex function of h. Since h is w or λ from λ , increases the objective∈ function. Hence, Dλ a convex function of h for any choice of the norm, for λ ∗ can be viewed∗ as the center of Λ in the manifold imposed ( ) Y Y ∗ Agnostic Federated Learning

Algorithm STOCHASTIC-AFL

Initialization: w0 W and λ0 Λ. Parameters: step size γw 0 and γλ 0. For t 1 to T : ∈ ∈ > > ∗ 1. Obtain= stochastic gradients: δwL wt 1, λt 1 and Figure 2. Illustration of the positions in Λ of λ , λU, the mixture ∗ δ L w , λ weight corresponding to the distribution U, and an arbitrary λ. λ λ t 1 t 1 . ( − − ) defines the least risky distribution D ∗ for which to optimize the λ 2. w PROJECT− − w γ δ L w , λ , W expected loss. t ( ) t 1 w w t 1 t 1 − − − 3. λt = PROJECT (λt 1 −γλδλL w( t 1, λt 1 ,)Λ .) by the loss function L, whereas U, the empirical distribution A 1 T − A 1− T − of samples, may lie elsewhere, as illustrated by Figure2. Output:=w T (t 1 wt+and λ ( T t 1 λ)t. ) Subroutine PROJECT By Equation5, using the set conv Λ instead of Λ does = ∑ = = ∑ = not affect the solution of the optimization problem. In view Input: x , . Output: x argminx x x 2. of that, in what follows, we will assume,( ) without loss of ′ ′ X = ∈X SS − SS generality, that Λ is a convex set. Observe that, since Lk w Figure 3. Pseudocode of the STOCHASTIC-AFL algorithm. is not an average of functions, standard stochastic algorithms cannot be used to minimize this objec-( ) 1. Convexity: w L w, λ is convex for any λ Λ. tive. We will present instead a new stochastic gradient-type algorithm for this problem. 2. Compactness:↦max( λ Λ )λ 2 RΛ, maxw W∈ w 2 RW. ∈ ∈ Let wL w, λ denote the gradient of the loss function with Y Y ≤ Y Y ≤ 3. Bounded gradients: L w, λ G and respect to w and λL w, λ the gradient with respect to λ. w 2 w L w, λ G for all w W and λ Λ. Let ∇δwL (w, λ ), and δλL w, λ be unbiased estimates of the λ 2 λ Y∇ ( )Y ≤ gradient, that is, ∇ ( ) 2 4. YStochastic∇ ( variance)Y ≤ : E δwL w,∈ λ wL∈ w, λ 2 ( ) ( ) 2 2 2 σw and E δλL w, λ λL w, λ 2 σλ for all E δλL w, λ λL w, λ , E δwL w, λ wL w, λ . δ δ w W and λ λ. [Y ( ) − ∇ ( )Y ] ≤ [Y ( ) − ∇ ( )Y ] ≤ [ ( )] = ∇ ( ) [ ( )] = ∇ ( ) 5. : U denotes the time complex- We first give an optimization algorithm STOCHASTIC-AFL Time∈ complexity∈ w for the AFL problem, assuming access to such unbiased ity of computing δwL w, λ , Uλ that of computing estimates. The pseudocode of the algorithm is given in δλL w, λ , Up that of the projection, and d denotes ( ) Figure3. At each step, the algorithm computes a stochastic the dimensionality of W. ( ) gradient with respect to λ and w and updates the model Theorem 2. [Appendix E.1] Assume that Properties1 hold. accordingly. It then projects λ to Λ by computing a value Then, the following guarantee holds for STOCHASTIC- in Λ via convex minimization. If Λ is the full simplex, then AFL: there exist simple and efficient algorithms for this projection (Duchi et al., 2008). It then repeats the process for T steps A E max L w , λ min max L w, λ and returns the average of the weights. λ Λ w W λ Λ 2 2 2 2  ∈ ( 3RW) − σ∈ ∈G ( 3R) σ G There are several natural candidates for the sampling method w w Λ λ λ , defining stochastic gradients. We highlight two techniques: » T » T ( + ) ( + ) PERDOMAINGRADIENT and WEIGHTEDGRADIENT. We ≤ √ 2RW + √ 2RΛ for the step sizes γw and γλ , analyze the time complexity and give bounds on the variance T σ2 G2 2 2 w w T σλ Gλ for both techniques in Lemmas3 and4 respectively. and the time complexity of» the algorithm is in ¼U U = ( + ) = (λ + w ) Up d p T . 4.3. Analysis O(( + + We+ note+ that) ) similar algorithms have been proposed for Throughout this section, for simplicity, we adopt the nota- solving minimax objectives (Namkoong & Duchi, 2016; tion introduced for Equation7. Our convergence guarantees Chen et al., 2017). Chen et al.(2017) assume the existence hold under the following assumptions, which are similar to of an α-approximate Bayesian oracle, whereas our guar- those adopted for the convergence proof of gradient descent- antees hold regardless of such assumptions. Namkoong & type algorithms. Duchi(2016) use importance sampling to obtain λ gradients, Properties 1. Assume that the following properties hold for thus, their convergence guarantee for the Euclidean norm the loss function L and sets W and Λ ∆p: depends inversely on a lower bound on minλ Λ mink p λk.

⊆ ∈ ∈[ ] Agnostic Federated Learning

Stochastic gradient for λ. ple an element uniformly from mk for each k p . For the WEIGHTED-stochastic gradient, we sample a domain 1. Sample K p , according to the uniform distribu- according to λ and sample an element[ ] out of it. We∈ [ can] now tion. bound the variance of both PERDOMAIN and WEIGHTED Sample IK ∼ [ m] K , according to the uniform distri- stochastic gradients. Let U denote the time complexity of bution. computing the loss and gradient with respect to w for a ∼ [ ] single sample. 2. Output: δλL w, λ such that δλL w, λ K Lemma 3. [Appendix E.3] PERDOMAIN stochastic gradi- pLK,IK w and for all k K, δλL w, λ k 0. ( ) [ ( )] = ent is unbiased and runs in time pU p log m and the 2 2 PER(DOMAIN) -stochastic≠ gradient[ ( for)]w. = variance satisfy, σw RΛσI w , where + O( ) mk 1. For k p , sample Jk mk , according to the 2 ≤1 ( ) 2 σI w max wLk,j w wLk w . uniform distribution. w W,k p mk j 1 ∈ [ ] ∼ [ ] p ( ) = ∈ ∈[ ] Q= [∇ ( ) − ∇ ( )] 2. Output: δwL w, λ k 1 λk wLk,Jk w, h . Lemma 4. [Appendix E.4] WEIGHTED stochastic gradient is unbiased and runs in time U p log m and the WEIGHTED( -stochastic) = ∑ = gradient∇ for(w ) 2 2 variance satisfy the following inequality: σw σI w σ2 w , where + O( + ) 1. Sample K p according to the distribution λ. O ≤ ( ) + Sample JK mk , according to the uniform distri- p ( 2 ) 2 bution. ∼ [ ] σO w max λk wLk w wL w, λ W ∼ [ ] w ,λ Λ k 1 δ L w, λ L w 2. Output: w w K,JK . 2( ) = ∈ ∈ Q= [∇ ( ) − ∇ ( )] and σI w is defined in Lemma3. ( ) = ∇ ( ) Figure 4. Definition of the stochastic gradients with respect to λ Since R(Λ ) 1, at first glance, the above two lemmas may and w. suggest that PERDOMAIN stochastic is always better than WEIGHTED≤ stochastic gradient. Note, however, that the In contrast, our convergence guarantees are not affected by time complexity of the algorithms is dominated by U and that. thus, the time complexity of PERDOMAIN-stochastic gra- dient is roughly p times larger than that of WEIGHTED- 4.4. Stochastic gradients stochastic gradient. Hence, if p is small, it is preferable The convergence results of Theorem4 depend on the vari- to choose the PERDOMAIN-stochastic gradient. For large ance of the stochastic gradients. We first discuss the stochas- values of p, we analyze the differences in Appendix E.5. tic gradients for λ. Notice that the gradient for λ is indepen- dent of λ. Thus, a natural choice for the stochastic gradient 4.5. Related optimization algorithms with respect to λ is based on uniformly sampling a domain In Appendix F.1, we show that STOCHASTIC-AFL can K p and then sampling x from domain K. This leads K,i be extended to the case where arbitrary mirror maps are to the definition of the stochastic gradient δ L w, λ shown λ used, as in the standard mirror descent algorithm. In Ap- in Figure∈ [ ] 4. The following lemma bounds the variance for pendix F.2, we give an algorithm with convergence rate that definition of δ L w, λ . ( ) λ log T T , when the loss function is strongly convex. Fi- Lemma 2. [Appendix E.2] The stochastic gradient ( ) nally, in Appendix F.3, we present an optimistic version of δλL w, λ is unbiased. Further, if the loss function is OSTOCHASTIC( ~ ) -AFL. bounded by M, then the following upper bound holds for the variance( ) of δ L w, λ : λ 5. Experiments 2 2 2 σλ max( Var) δλL w, λ p M . w W,λ Λ To study the benefits of our AFL algorithm, we carried out = ∈ ∈ ( ( )) ≤ experiments with three datasets. Even though our optimiza- If the above variance is too high, then we can sample one tion convergence guarantees hold only for convex functions Jk for every domain k. This is the same as computing the and stochastic gradient, we show that our domain-agnostic gradient of a batch and reduces the variance by a factor of p. learning performs well for non-convex functions and vari- ants of stochastic gradient descent such as Adagrad too. The gradient with respect to w depends both on λ and w. There are two natural stochastic gradients: the PERDO- In all the experiments, we compare the domain agnostic MAIN-stochastic gradient and the WEIGHTED-stochastic model with the model trained with U, the uniform distribu- gradient. For a PERDOMAIN-stochastic gradient, we sam- tion over the union of samples, and the models trained on ̂ Agnostic Federated Learning

Table 1. Adult dataset: test accuracy for various test domains of Table 3. Test perplexity for various test domains of models trained models trained with different loss functions. with different loss functions.

Training loss function U doctorate non-doctorate D Λ Training loss func. U doc. con. DΛ Ldoctorate 53.35 0.91 73.58 0.48 53.12 0.89 53.12 0.89 L 414.96 83.97 615.75 615.75 Lnon-doctorate 82.15 0.09 69.46 0.29 82.29 0.09 69.46 0.29 doc. ± ± ± ± LU 82.10 0.09 69.61 0.35 82.24 0.09 69.61 0.35 Lcon. 108.97 1138.76 61.01 1138.76 L 80.10 ± 0.39 71.53 ± 0.88 80.20 ± 0.40 71.53 ± 0.88 DΛ L 68.18 96.98 62.50 96.98 ̂ ± ± ± ± U ± ± ± ± LDΛ 79.98 86.33 78.48 86.33 Table 2. Fashion MNIST dataset: test accuracy for various test ̂ domains of models trained with different loss functions. the loss of the worst domain, but also generalizes better and hence improves the average test accuracy.

Training loss function U shirt pullover t-shirt/top DΛ LU 81.8 1.3 71.2 7.8 87.8 6.0 86.2 4.9 71.2 7.8 L 82.3 0.9 74.5 6.0 87.6 4.5 84.9 4.4 74.5 6.0 DΛ 5.3. Language models ̂ ± ± ± ± ± ± ± ± ± ± individual domains. In all the experiments, we used PERDO- Motivated by the keyboard application (Hard et al., 2018), MAIN stochastic gradients and set Λ ∆p. All algorithms where a single client uses a trained language model in mul- were implemented in Tensorflow (Abadi et al., 2015). tiple environments such as chat apps, email, and web in- = put, we created a dataset that combines two very different 5.1. Adult dataset types of language datasets: conversation and document. For conversation, we used the Cornell movie dataset The Adult dataset is a census dataset from the UCI Machine that contain movie dialogues (Danescu-Niculescu-Mizil Learning Repository (Blake, 1998). The task consists of & Lee, 2011). For documents, we used the Penn Tree- 50,000 predicting if the person’s income exceeds $ . We split Bank (PTB) dataset (Marcus et al., 1993). We created a this dataset into two domains depending on whether the single dataset by combining both of the above corpuses, person had a doctorate degree or not, resulting into domains: with conversation and document as domains. We pre- doctorate non-doctorate the domain and the domain. processed the data to remove punctuations, capitalized the We trained a logistic regression model with just the cate- data uniformly, and computed a vocabulary of 10,000 most gorical features and Adagrad optimizer. The performance frequent words. We trained a two-layer LSTM model with 50 of the models averaged over runs is reported in Table1. momentum optimizer. The performance of the models are D U The performance on Λ of the model trained with , that is measured by their perplexity, that is the exponent of cross- 69.6% standard federated learning, is about . In contrast, the entropy loss. The results are reported in Table3. Of the 71̂.5% performance of our AFL model is at least about on two domains, the document domain is the one admitting any D target distribution λ. The uniform average over the the higher perplexity. For this domain, the test perplexity domains of the test accuracy of the AFL model is slightly of the domain agnostic model is close to that of the model less than that of the uniform model, but the agnostic model trained only on document data and is better than that of the D is less biased and performs better on Λ. model trained with the uniform distribution U.

5.2. Fashion MNIST ̂ 6. Conclusion The Fashion MNIST dataset (Xiao et al., 2017) is an MNIST- like dataset where images are classified into 10 categories of We introduced a new framework for federated learning, clothing, instead of handwritten digits. We extracted the sub- based on principled learning objectives, for which we pre- set of the data labeled with three categories t-shirt/top, sented a detailed theoretical analysis, a learning algorithm pullover, and shirt and split this subset into three do- motivated by our theory, a new stochastic optimization so- mains, each consisting of one class of clothing. We then lution for large-scale problems and several extensions. Our trained a classifier for the three classes using logistic re- experimental results suggest that our solution can lead to gression and the Adam optimizer. The results are shown significant benefits in practice. In addition, our framework in Table2. Since here the domain uniquely identifies the and algorithms benefit from favorable fairness properties. label, in this experiment, we did not compare against models This constitutes a global solution that we hope will be gener- trained on specific domains. Of the three domains or classes, ally adopted in federated learning, and other related learning the shirt class is the hardest one to distinguish from others. tasks such as domain adaptation. The domain-agnostic model improves the performance for shirt more than it degrades it on pullover and shirt, 7. Acknowledgements leading to both shirt-specific and overall accuracy im- provement when compared to the model trained with the We thank Corinna Cortes, Satyen Kale, Shankar Kumar, Ra- uniform distribution U. Furthermore, in this experiment, jiv Mathews, and Brendan McMahan for helpful comments note that our agnostic learning solution not only improves and discussions. ̂ Agnostic Federated Learning

References Danescu-Niculescu-Mizil, C. and Lee, L. Chameleons in imagined conversations: A new approach to understand- Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., ing coordination of linguistic style in dialogs. In Proceed- Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., ings of the 2nd Workshop on Cognitive Modeling and Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, Computational Linguistics, pp. 76–87. Association for M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev- Computational Linguistics, 2011. enberg, J., Mane,´ D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, H. Training GANs with optimism. arXiv preprint V., Viegas,´ F., Vinyals, O., Warden, P., Wattenberg, M., arXiv:1711.00141, 2017. Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large- scale machine learning on heterogeneous systems, 2015. Dredze, M., Blitzer, J., Talukdar, P. P., Ganchev, K., Graca, URL https://www.tensorflow.org/. Software J., and Pereira, F. Frustratingly Hard Domain Adaptation available from tensorflow.org. for Parsing. In Proceedings of CoNLL 2007, Prague, Czech Republic, 2007. Agarwal, N., Suresh, A. T., Yu, F. X., Kumar, S., and McMahan, B. cpSGD: Communication-efficient and Duchi, J. C., Shalev-Shwartz, S., Singer, Y., and Chandra, T. differentially-private distributed SGD. In Proceedings of Efficient projections onto the `1-ball for learning in high NeurIPS, pp. 7575–7586, 2018. dimensions. In ICML, pp. 272–279, 2008.

Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clus- Farnia, F. and Tse, D. A minimax approach to supervised tering with Bregman divergences. Journal of machine learning. In Proceedings of NIPS, pp. 4240–4248, 2016. learning research, 6(Oct):1705–1749, 2005. Ganin, Y. and Lempitsky, V. S. Unsupervised domain adap- Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. tation by backpropagation. In ICML, volume 37, pp. Analysis of representations for domain adaptation. In 1180–1189, 2015. NIPS, pp. 137–144, 2006. Gauvain, J.-L. and Chin-Hui. Maximum a posteriori esti- Bickel, P. J., Hammel, E. A., and O’Connell, J. W. Sex bias mation for multivariate gaussian mixture observations of in graduate admissions: Data from Berkeley. Science, Markov chains. IEEE Transactions on Speech and Audio 187(4175):398–404, 1975. ISSN 0036-8075. Processing, 2(2):291–298, 1994.

Blake, C. L. UCI repository of machine learning databases, Girshick, R. B., Donahue, J., Darrell, T., and Malik, J. Rich Irvine, University of California. http://www.ics. feature hierarchies for accurate object detection and se- uci.edu/˜mlearn/MLRepository, 1998. mantic segmentation. In CVPR, pp. 580–587, 2014.

Blitzer, J., Dredze, M., and Pereira, F. Biographies, Bolly- Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow wood, Boom-boxes and Blenders: Domain Adaptation kernel for unsupervised domain adaptation. In CVPR, pp. for Sentiment Classification. In Proceedings of ACL 2007, 2066–2073, 2012. Prague, Czech Republic, 2007. Gong, B., Grauman, K., and Sha, F. Connecting the Chen, R. S., Lucier, B., Singer, Y., and Syrgkanis, V. Robust dots with landmarks: Discriminatively learning domain- optimization for non-convex objectives. In Advances in invariant features for unsupervised domain adaptation. In Neural Information Processing Systems, pp. 4705–4714, ICML, volume 28, pp. 222–230, 2013a. 2017. Gong, B., Grauman, K., and Sha, F. Reshaping visual Cortes, C. and Mohri, M. Domain adaptation and sample datasets for domain adaptation. In NIPS, pp. 1286–1294, bias correction theory and algorithm for regression. Theor. 2013b. Comput. Sci., 519:103–126, 2014. Hard, A., Rao, K., Mathews, R., Beaufays, F., Augenstein, Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for S., Eichner, H., Kiddon, C., and Ramage, D. Federated importance weighting. In Advances in neural information learning for mobile keyboard prediction. arXiv preprint processing systems, pp. 442–450, 2010. arXiv:1811.03604, 2018.

Cortes, C., Mohri, M., and Munoz˜ Medina, A. Adaptation Hardt, M., Price, E., Srebro, N., et al. Equality of opportu- algorithm and theory based on generalized discrepancy. nity in supervised learning. In Proceedings of NIPS, pp. In KDD, pp. 169–178, 2015. 3315–3323, 2016. Agnostic Federated Learning

Hoffman, J., Kulis, B., Darrell, T., and Saenko, K. Discov- Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple ering latent domains for multisource domain adaptation. source adaptation and the Renyi´ divergence. In UAI, pp. In ECCV, volume 7573, pp. 702–715, 2012. 367–374, 2009a.

Hoffman, J., Rodner, E., Donahue, J., Saenko, K., and Dar- Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain rell, T. Efficient learning of domain-invariant image rep- adaptation: Learning bounds and algorithms. In COLT, resentations. In ICLR, 2013. 2009b.

Hoffman, J., Mohri, M., and Zhang, N. Algorithms and Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain theory for multiple-source adaptation. In Proceedings of adaptation with multiple sources. In NIPS, pp. 1041– NeurIPS, pp. 8256–8266, 2018. 1048, 2009c.

Jelinek, F. Statistical Methods for Speech Recognition. The Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. MIT Press, 1998. Building a large annotated corpus of english: The Penn Treebank. Computational linguistics, 19(2):313–330, Jiang, J. and Zhai, C. Instance Weighting for Domain Adap- 1993. tation in NLP. In Proceedings of ACL 2007, pp. 264–271, Prague, Czech Republic, 2007. Association for Computa- Mart´ınez, A. M. Recognizing imprecisely localized, par- tional Linguistics. tially occluded, and expression variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach. Intell., Juditsky, A., Nemirovski, A., and Tauvel, C. Solving varia- 24(6):748–763, 2002. tional inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):17–58, 2011. McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep Koltchinskii, V. and Panchenko, D. Empirical margin dis- networks from decentralized data. In Proceedings of tributions and bounding the generalization error of com- AISTATS, pp. 1273–1282, 2017. bined classifiers. Annals of Statistics, 30, 2002. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda- Konecnˇ y,` J., McMahan, H. B., Ramage, D., and Richtarik,´ P. tions of Machine Learning. MIT Press, second edition, Federated optimization: Distributed machine learning for 2018. on-device intelligence. arXiv preprint arXiv:1610.02527, Muandet, K., Balduzzi, D., and Scholkopf,¨ B. Domain 2016a. generalization via invariant feature representation. In ICML, volume 28, pp. 10–18, 2013. Konecnˇ y,` J., McMahan, H. B., Yu, F. X., Richtarik,´ P., Suresh, A. T., and Bacon, D. Federated learning: Strate- Namkoong, H. and Duchi, J. C. Stochastic gradient gies for improving communication efficiency. arXiv methods for distributionally robust optimization with f- preprint arXiv:1610.05492, 2016b. divergences. In Advances in Neural Information Process- ing Systems, pp. 2208–2216, 2016. Lee, J. and Raginsky, M. Minimax statistical learning and domain adaptation with Wasserstein distances. arXiv Nemirovski, A. S. and Yudin, D. B. Problem complexity preprint arXiv:1705.07815, 2017. and Method Efficiency in Optimization. Wiley, 1983.

Legetter, C. J. and Woodland, P. C. Maximum likelihood Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE for speaker adaptation of continuous Trans. Knowl. Data Eng., 22(10):1345–1359, 2010. density hidden Markov models. Computer Speech and Language, pp. 171–185, 1995. Pietra, S. D., Pietra, V. D., Mercer, R. L., and Roukos, S. Adaptive language modeling using minimum discrimi- Liu, J., Zhou, J., and Luo, X. Multiple source domain nant estimation. In HLT ’91: Proceedings of the work- adaptation: A sharper bound using weighted Rademacher shop on Speech and Natural Language, pp. 103–106, complexity. In Technologies and Applications of Artificial Morristown, NJ, USA, 1992. Association for Computa- Intelligence (TAAI), 2015 Conference on, pp. 546–553. tional Linguistics. IEEE, 2015. Raju, A., Hedayatnia, B., Liu, L., Gandhe, A., Khatri, C., Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning Metallinou, A., Venkatesh, A., and Rastrow, A. Contex- transferable features with deep adaptation networks. In tual language model adaptation for conversational agents. ICML, volume 37, pp. 97–105, 2015. arXiv preprint arXiv:1806.10215, 2018. Agnostic Federated Learning

Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. In Proceedings of NIPS, pp. 3066–3074, 2013. Rawls, J. A theory of justice. Harvard University Press, 2009. Roark, B. and Bacchiani, M. Supervised and unsupervised PCFG adaptation to novel domains. In Proceedings of HLT-NAACL, 2003. Rockafellar, R. T. Convex analysis. Princeton University Press, 1997. Rosenfeld, R. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer Speech and Language, 10:187–228, 1996. Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting vi- sual category models to new domains. In ECCV, volume 6314, pp. 213–226, 2010.

Suresh, A. T., Yu, F. X., Kumar, S., and McMahan, H. B. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3329–3337. JMLR. org, 2017.

Tzeng, E., Hoffman, J., Darrell, T., and Saenko, K. Simulta- neous deep transfer across domains and tasks. In ICCV, pp. 4068–4076, 2015. Williamson, R. C. and Menon, A. K. Fairness risk measures. arXiv preprint arXiv:1901.08665, 2019.

Woodworth, B. E., Wang, J., Smith, A. D., McMahan, B., and Srebro, N. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Proceedings of NeurIPS, pp. 8505–8515, 2018.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http: //arxiv.org/abs/1708.07747. Xu, Z., Li, W., Niu, L., and Xu, D. Exploiting low-rank structure from latent domains for domain generalization. In ECCV, volume 8691, pp. 628–643, 2014. Yang, J., Yan, R., and Hauptmann, A. G. Cross-domain video concept detection using adaptive svms. In ACM Multimedia, pp. 188–197, 2007.

Zhang, K., Gong, M., and Scholkopf,¨ B. Multi-source domain adaptation: A causal view. In AAAI, pp. 3150– 3157, 2015.