Fast-Convergent Federated Learning

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 1, JANUARY 2021 201 Fast-Convergent Federated Learning Hung T. Nguyen , Vikash Sehwag, Seyyedali Hosseinalipour, Member, IEEE, Christopher G. Brinton , Senior Member, IEEE, Mung Chiang, Fellow, IEEE, and H. Vincent Poor , Life Fellow, IEEE Abstract— Federated learning has emerged recently as a promising solution for distributing machine learning tasks through modern networks of mobile devices. Recent studies have obtained lower bounds on the expected decrease in model loss that is achieved through each round of federated learning. However, convergence generally requires a large number of communication rounds, which induces delay in model training and is costly in terms of network resources. In this paper, we propose a fast-convergent federated learning algorithm, called FOLB,which performs intelligent sampling of devices in each round of model Fig. 1. Different from standard federated learning algorithms which are based training to optimize the expected convergence speed. We first on uniform sampling, our proposed methodology improves convergence rates theoretically characterize a lower bound on improvement that through intelligent sampling that factors in the values of local updates that devices provide. can be obtained in each round if devices are selected according to the expected improvement their local models will provide to FOLB the current global model. Then, we show that obtains this Most applications of interest today involve machine learning bound through uniform sampling by weighting device updates according to their gradient information. FOLB is able to handle (ML). Federated learning (FL) has emerged recently as a both communication and computation heterogeneity of devices technique for distributing ML model training across edge by adapting the aggregations according to estimates of device’s devices. It allows solving machine learning tasks in a dis- capabilities of contributing to the updates. We evaluate FOLB tributing setting comprising a central server and multiple in comparison with existing federated learning algorithms and participating “worker” nodes, where the nodes themselves experimentally show its improvement in trained model accuracy, convergence speed, and/or model stability across various machine collect the data and never transfer it over the network, which learning tasks and datasets. minimizes privacy concerns. At the same time, the federated learning setting introduces challenges of statistical and system Index Terms— Federated learning, distributed optimization, fast convergence rate. heterogeneity that traditional distributed optimization methods [2]–[11] are not designed for and thus may fail to provide convergence guarantees. I. INTRODUCTION One such challenge is the number of devices that must par- VER the past decade, the intelligence of devices at ticipate in each round of computation. To provide convergence Othe network edge has increased substantially. Today, guarantees, recent studies [12]–[15] in distributed learning smartphones, wearables, sensors, and other Internet-connected have to assume full participation of all devices in every round devices possess significant computation and communication of optimization, which results in excessively high communica- capabilities, especially when considered collectively. This has tion costs in edge network settings. On the other hand, [6], [8], created interest in migrating computing methodologies from [10], [16]–[19] violate the statistical heterogeneity property. cloud to edge-centric to provide near-real-time results [1]. In contrast, FL techniques provide flexibility in selecting only a fraction of clients in each round of computations [20]. Manuscript received July 27, 2020; revised September 27, 2020; accepted October 21, 2020. Date of publication November 9, 2020; date of current However, such a selection of devices, which is often done version December 16, 2020. The work of Hung T. Nguyen and Mung uniformly, naturally causes the convergence rates to be slower. Chiang was supported in part by the Defense Advanced Research Projects In this paper, we take into consideration that in each com- Agency (DARPA) under Contract AWD1005371 and Contract AWD1005468. The work of H. Vincent Poor was supported in part by the U.S. National putation round, some clients provide more valuable updates Science Foundation under Grant CCF-1908308. (Corresponding author: in terms of reducing the overall model loss than others, Hung T. Nguyen.) as illustrated in Figure 1. By taking this into account, we show Hung T. Nguyen, Vikash Sehwag, and H. Vincent Poor are with the Department of Electrical Engineering, Princeton University, Prince- that the convergence in federated learning can be vastly ton, NJ 08544 USA (e-mail: [email protected]; [email protected]; improved with an appropriate non-uniform device selection [email protected]). method. We first theoretically characterize the overall loss Seyyedali Hosseinalipour, Christopher G. Brinton, and Mung Chiang are with the School of Electrical and Computer Engineering, Purdue Uni- decrease of the non-uniform version of the recent state-of- versity, West Lafayette, IN 47907 USA (e-mail: [email protected]; the-art FedProx algorithm [21], where clients in each round [email protected]; [email protected]). are selected based on a target probability distribution. Under Color versions of one or more figures in this article are available at https://doi.org/10.1109/JSAC.2020.3036952. such a non-uniform device selection scheme, we obtain a lower Digital Object Identifier 10.1109/JSAC.2020.3036952 bound on the expected decrease in global loss function at every 0733-8716 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Purdue University. Downloaded on December 22,2020 at 14:36:39 UTC from IEEE Xplore. Restrictions apply. 202 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 39, NO. 1, JANUARY 2021 computation round at the central server. We further improve • We perform extensive experiments on synthetic, vision, this bound by incorporating gradient information from each and language datasets to demonstrate the success of device into the aggregation of local parameter updates and FOLB over FedAvg and FedProx algorithms in terms characterize a device selection distribution, named LB-near- of model accuracy, training stability, and/or convergence optimal, which can achieve a near-optimal lower bound over speed (Section VI). all non-uniform distributions at each round. Straightforwardly computing such distribution in every round involves a heavy communication step across all devices B. Related Work which defeats the purpose of federated learning where the Distributed optimization has been vastly studied in the assumption is that only a subset of devices participates in each literature [2]–[11] which focuses on a datacenter environment round. We address this communication challenge with a novel model where (i) the distribution of data to different machine federated learning algorithm, named FOLB, which is based is under control, e.g., uniformly at random, and (ii) all the on a simple yet effective re-weighting mechanism of updated machines are relatively close to one another, e.g., minimal parameters received from participating devices in every round. cost of communication. However, those approaches no longer With twice the number of devices selected in baseline feder- work on the emerging environment of distributed mobile ated learning settings, i.e., as in the popular FedAvg and Fed- devices due to its peculiar characteristics, including non-i.i.d. Prox algorithms, FOLB achieves the near-optimal decrease in and unbalanced data distributions, limited communication, and global loss as that of the LB-near-optimal device selection heterogeneity of computation between devices. Thus, many distribution, whereas with the same number of devices, FOLB recent efforts [6], [8], [10], [12]–[24] have been devoted to provides a guarantee of global loss decrease close to that of coping with these new challenges. the LB-near-optimal and even better in some cases. Most of the existing works [6], [8], [10], [12]–[19] either Another challenge in federated learning is device hetero- assume the full participation of all devices or violate sta- geneity, which affects the computation and communication tistical heterogeneity property inherent in our environment. capabilities across devices. We demonstrate that FOLB can McMahan et al. [20] was the first to define federated learning easily adapt to such device heterogeneity by adjusting its setting in which a learning task is solved by a loose federation re-weighting mechanism of the updated parameters returned of participating devices which are coordinated by a central from participating devices. Computing the re-weighting coef- server and proposed the heuristic FedAvg algorithm. FedAvg ficients involves presumed constants which are related to runs through multiple rounds of optimization, in each round, the loss function characteristics and solvers used in distrib- it randomly selects a small set of K devices to perform uted devices, and more importantly, may not be available local stochastic gradient descent with respect to their local beforehand. Even estimating those constants may be difficult data. Then, the locally updated model parameters are sent and incur considerable

Load more